Variational Bayesian Quantile Regression with Non-Ignorable Missing Response Data

Zhang, Juanjuan; Wang, Weixian; Tian, Maozai

doi:10.3390/axioms14060408

Open AccessArticle

Variational Bayesian Quantile Regression with Non-Ignorable Missing Response Data

by

Juanjuan Zhang

¹,

Weixian Wang

²

and

Maozai Tian

^3,4,*

¹

School of Digital Economy and Trade, Guangzhou Huashang College, Guangzhou 511300, China

²

School of Mathematics and Statistics, Guangxi Normal University, Guilin 541006, China

³

School of Statistics and Data Science, Xinjiang University of Finance and Economics, Urumqi 830012, China

⁴

Center for Applied Statistics, School of Statistics, Renmin University of China, Beijing 100872, China

^*

Author to whom correspondence should be addressed.

Axioms 2025, 14(6), 408; https://doi.org/10.3390/axioms14060408

Submission received: 22 April 2025 / Revised: 21 May 2025 / Accepted: 23 May 2025 / Published: 27 May 2025

(This article belongs to the Special Issue Recent Stochastic and Statistical Approaches for Modeling Complex Systems and Dependent Variables)

Download

Browse Figures

Versions Notes

Abstract

:

For non-ignorable missing response variables, the mechanism of whether the response variable is missing can be modeled through logistic regression. In Bayesian computation, the lack of a conjugate prior for the logistic function poses a significant challenge. Introducing a new Pólya-Gamma variable and employing lower-bound approximation are two common methods for parameter inference in conjugate Bayesian logistic regression. It can be observed that these two methods yield essentially the same variational posterior in the calculation of the variational Bayesian posterior. This paper applies a popular Bayesian spike-and-slab LASSO prior for variable selection in quantile regression with non-ignorable missing response variables, which demonstrates good performance in both simulations and practical applications.

Keywords:

non-ignorable missingness; Bayesian quantile regression; logistic regression

MSC:

62-08

1. Introduction

In many application fields, such as economics, sociology, and biomedicine, some subjects may have missing responses or predictors due to various reasons, including study dropout, unwillingness of study participants to answer certain questions in the questionnaire, and information loss caused by uncontrollable factors. Statistical inference for missing data problems is quite challenging.

Rubin [1] categorized missing data into three mechanisms: missing completely at random (MCAR), where the missingness process is independent of the observed and missing quantities; missing at random (MAR), where the missingness process depends on the observed quantities but not on the missing quantities; and non-ignorable missingness or not missing at random (NMAR), where the missingness process depends on both the observed and missing quantities. In missing data analysis, the NMAR assumption may be more reasonable than the classical MAR assumption.

Bayesian parameter estimation methods are often used to address the estimation problem of parametric models involving missing data for several reasons. First, Markov chain Monte Carlo (MCMC) methods widely used in statistical computing, such as the Gibbs algorithm [2] and the Metropolis–Hastings (MH) algorithm [3,4], can be employed to estimate the posterior distributions of parameters, nonparametric functions, and missing data. Second, compared to the setting without missing data, Bayesian methods with missing data only require an additional step in the Gibbs sampler. Therefore, Bayesian methods can easily handle missing data without the need for new statistical inference techniques [5]. Third, some prior information can be directly incorporated into the analysis, resulting in more accurate parameter estimation when good prior assumptions are available. Fourth, sampling-based Bayesian methods do not rely on asymptotic theory and may provide more reliable statistical inference even in small sample situations. In recent years, there have been many studies on NMAR data analysis, such as those by Lee and Tang [6], Tang and Zhao [7], and Xu and Tang [8].

Furthermore, variable selection can be viewed as a special case of model selection, which is achieved through spike-and-slab priors in Bayesian variable selection. This paper chooses the spike-and-slab LASSO prior [9] for parameter estimation and variable selection. The missingness mechanism of the response variable can be obtained through logistic regression, and variational Bayesian algorithms are considered for model parameter estimation. In the calculation of variational posteriors, due to the lack of a conjugate prior for logistic regression, we cannot compute the variational posteriors by specifying a reasonable variational family. Introducing Pólya-Gamma latent variables [10] and employing lower-bound approximation [11] are two methods that can yield conjugate posteriors for Bayesian logistic regression. In the posterior computation process of variational Bayesian methods, these two methods will have the same variational posteriors. Considering the characteristics of the spike-and-slab prior, it is unnecessary to calculate the complex variational lower bound, and the algorithm can still converge.

In this study, we propose a variational Bayesian quantile regression algorithm to address the challenges posed by non-ignorable response missing data. Unlike traditional methods, the variational Bayesian approach offers an efficient and scalable solution by transforming the posterior inference problem into an optimization task, ensuring faster convergence and reducing computational burden. The quantile regression framework provides a robust alternative to mean-based models, capturing the conditional distribution of the response variable across different quantiles, which is particularly valuable when the data exhibit heteroscedasticity or skewness. While Li’s paper [12] also considers missing covariates and response variables, it does not address variable selection. In our work, we employ a prior that enables variable selection, allowing for effective variable screening alongside parameter estimation. Additionally, the convergence criteria of our algorithm differ from Li’s. Li uses the minimal change in the variational lower bound as the stopping condition, which involves more complex computations. In contrast, our approach is more computationally efficient. Moreover, the proposed method incorporates variable selection through a Bayesian shrinkage prior, effectively identifying significant predictors while accounting for the missing data mechanism. This combination of variational inference, quantile regression, and variable selection not only enhances estimation accuracy but also offers a flexible and computationally efficient tool for analyzing complex missing data structures. These features highlight the novelty and practical relevance of our approach.

This article is organized as follows: Section 2 introduces the model, prior, and variational Bayesian logistic regression used in this paper; Section 3 proposes the corresponding variational Bayesian algorithm for data with non-ignorable missing responses; Section 4 conducts simulation studies for the proposed algorithm; Section 5 applies the algorithm to real data analysis; and relevant conclusions are presented in Section 6.

2. Model Prior and Variational Bayesian Logistic Regression

In this paper, the response variable

y_{i}

,

i = 1, 2, \dots, n

may be missing, while all covariates (explanatory variables)

X_{i} = (x_{i 1}, \dots, x_{i p})

are completely observable. The incomplete observations are as follows:

{(X_{i}, y}_{i}, r_{i}), i = 1, 2, \dots, n .

(1)

where

r_{i} \in \{0, 1\}

determines whether

y_{i}

is missing. When

r_{i} = 1

,

y_{i}

is missing. Let

y = {y_{o}, y_{m}}

,

r = (r_{1}, r_{2}, \dots, r_{n})

, where

y_{o}

and

y_{m}

represent the observed response variables and the missing response variables, respectively. Let

p (r_{i} ∣ y_{i}, X_{i}, φ)

denote the conditional distribution of

r_{i}

given

y_{i}

,

X_{i}

, and φ, where φ is the unknown parameter vector in the conditional probability function

p (r_{i} ∣ y_{i}, X_{i}, φ)

. The missingness mechanism of the data is completely determined by this conditional distribution.

We consider the following non-ignorable missingness mechanism:

p (r_{i} ∣ y_{i}, X_{i}, φ) = {\{p (r_{i} = 1 ∣ y_{i}, X_{i}, φ)\}}^{r_{i}} {\{1 - p (r_{i} = 1 ∣ y_{i}, X_{i}, φ)\}}^{1 - r_{i}}

(2)

where

p (r_{i} ∣ y_{i}, X_{i}, φ)

can be modeled through logistic regression:

l o g i t (p (r_{i} ∣ y_{i}, X_{i}, φ)) = φ_{01} + φ_{11} y_{i} + φ_{21} x_{i 1} + \dots + φ_{2 p} x_{i p} = W_{i}^{T} φ

(3)

where

φ = {(φ_{01}, φ_{11}, φ_{21}, \dots, φ_{2 p})}^{T}

are the logistic regression model parameters, and

W_{i}^{T} = (1, y_{i}, x_{i 1}, \dots, x_{i p})

are the covariates of the logistic regression model.

The quantile regression model with non-ignorable missing responses. For

0 < τ < 1

, it is constrained such that the

τ

-th quantile equals zero. The

τ

-th quantile regression model is expressed as follows:

Q_{τ} (y_{i} ∣ x_{i}) = X_{i}^{T} β, i = 1, \dots, n .

(4)

p (r_{i} = 1 ∣ y_{i}, x_{i}, φ) = \frac{e^{W_{i}^{T} φ}}{1 + e^{w_{i}^{T} φ}} = σ (W_{i}^{T} φ)

(5)

where β = (β₁, …, β_p)^T are the quantile regression model parameters to be estimated when

r_{i} = 1, y_{i}

is missing.

2.1. Spike-and-Slab LassoLASSO Prior

The spike-and-slab LassoLASSO (SSL) prior [9] is represented as follows:

β_{j} ∣ γ_{j} = γ_{j} φ_{1} (β_{j}) + (1 - γ_{j}) φ_{0} (β_{j}), j = 1, 2, \dots, p,

(6)

φ_{1} (β_{j}) = \frac{λ_{1}}{2} \exp (- λ_{1} |β_{j}|), φ_{0} (β_{j}) = \frac{λ_{0}}{2} \exp (- λ_{0} |β_{j}|) .

(7)

where λ₁ is chosen to be a smaller value, while λ₀ should be chosen to be a larger value. In the Bayesian framework, the Laplace distribution is not conjugate, but it can be hierarchically represented using the normal distribution (⋅, ⋅) and the exponential distribution Exp(⋅):

p (β_{j} ∣ τ_{1 j}^{2}, γ_{j} = 1) ~ N (0, τ_{1 j}^{2}), {p (τ}_{1 j}^{2} |λ_{1}^{2}) ~ E x p (\frac{λ_{1}^{2}}{2})

(8)

p (β_{j}∣ τ_{0 j}^{2}, γ_{j} = 0) ~ N (0, τ_{0 j}^{2}), {p (τ}_{0 j}^{2} |λ_{0}^{2}) ~ E x p (\frac{λ_{0}^{2}}{2})

(9)

Figure 1 describes four types of spike-and-slab priors with a mixing proportion of

γ_{j} = 1 / 2

. They are the Normal mixture (where both

φ_{1} (β_{j})

and

φ_{0} (β_{j})

are normal distributions), the Normal and Point-mass mixture (where

φ_{0} (β_{j})

is a point-mass function at 0 and

φ_{1} (β_{j})

is a normal distribution), the Laplace and Point-mass mixture (where

φ_{0} (β_{j})

is a point-mass function at 0 and

φ_{1} (β_{j})

is a Laplace distribution), and the SSL (where both are Laplace distributions). It can be seen that the Normal mixture cannot adequately penalize smaller coefficients, making it difficult to achieve variable selection. On the other hand, the point-mass spike-and-slab prior has an over-shrinkage problem and may miss important variables. Therefore, the SSL prior can be considered as a balance between the two.

Penalized priors (such as LASSO) and spike-and-slab priors are common priors in Bayesian variable selection. When

λ_{1} = λ_{0} = λ,

the SSL prior degenerates to the LassoLASSO prior. When

λ_{0} \to \infty

,

φ_{0} (β_{j}) \to 0

. That is, in the limit case, SSL can be transformed into the “gold standard” point-mass spike-and-slab prior. Thus, SSL integrates the penalized likelihood (LassoLASSO) and the spike-and-slab prior. SSL is also adaptive (there is no spike-and-slab adaptive LassoLASSO), and the proof of its adaptivity can be found in the discussion by Rockova and George [9]. SSL uses the spike component to encourage sparsity by shrinking many regression coefficients to zero, while the slab component captures larger signals, enabling simultaneous variable selection and parameter estimation. Unlike traditional LassoLASSO with fixed regularization, SSL automatically adjusts the sparsity parameter based on the data, reducing the need for manual tuning and providing multiplicity correction, which lowers the false positive rate.

2.2. Bayesian Logistic Regression Based on Pólya-Gamma Latent Variables

The logistic function does not have a conjugate prior, which poses a challenge for Bayesian inference in logistic regression. Polson et al. [10] proposed a new data augmentation strategy for Bayesian inference in logistic regression models.

If ω∼P (b, c), where P (b, c) denotes the Pólya-Gamma distribution with parameters (b, c), its expectation is

E (ω) = \frac{b}{2 c} \tanh (\frac{c}{2}) = \frac{b}{2 c} (\frac{e^{c} - 1}{e^{c} + 1})

The probability density of ω has the following property:

p (ω ∣ b, c) \propto e x p (- \frac{c^{2}}{2} ω) p (ω ∣ b, 0)

If ω∼(b, 0) and p(ω) represents its probability density, then

\frac{{(e^{ψ})}^{a}}{{(1 + e^{ψ})}^{b}} = 2^{- b} e^{κ ψ} \int_{0}^{\infty} e^{- ω ψ^{2} / 2} p (ω) d ω

where

b > 0, κ = a - b / 2

. Applying this result to (5) (see Section 3.1 of Polson et al. [10]), we obtain that

L (φ, ω) = \prod_{i = 1}^{n} \frac{{(e^{{W_{i}}^{T} φ})}^{r_{i}}}{1 + e^{{W_{i}}^{T} φ}} \propto \prod_{i = 1}^{n} e x p \{(r_{i} - \frac{1}{2}) ({W_{i}}^{T} φ) - \frac{ω_{i}}{2} {({W_{i}}^{T} φ)}^{2}\} p (ω_{i} ∣ 1, 0)

where

ω = (ω_{1}, \dots, ω_{n})

,

ω_{i} ~ P G (1, 0)

, and its probability density is denoted as

p (ω_{i} ∣ 1, 0)

.

Assuming a prior

p (φ)

for

φ

, the posterior of

φ

is as follows:

q (φ ∣ ω, y) \propto p (φ) \prod_{i = 1}^{n} e x p \{κ_{i} {W_{i}}^{T} φ - ω_{i} {({W_{i}}^{T} φ)}^{2} / 2\}

where

κ_{i} = r_{i} - 1 / 2

. If

p (φ)

is a Gaussian prior, then the posterior

q (φ ∣ ω, y)

is conjugate with the prior. The posterior density of the variable

ω_{i}

is

q (ω_{i} | φ) \propto e x p \{- \frac{ω_{i}}{2} {({W_{i}}^{T} φ)}^{2}\} p (ω_{i} ∣ 1, 0)

Then, the posterior of

ω_{i}

is

P G (1, {W_{i}}^{T} φ)

, and

E (ω_{i} | φ) = \frac{1}{2 {W_{i}}^{T} φ} \tanh (\frac{{W_{i}}^{T} φ}{2})

.

2.3. Bayesian Logistic Regression Based on Lower-Bound Approximation

The log-likelihood of

(5)

is as follows:

l (φ) = \log L (φ) = \log \prod_{i = 1}^{n} \frac{{(e^{{W_{i}}^{T} φ})}^{r_{i}}}{1 + e^{{W_{i}}^{T} φ}} = \sum_{i = 1}^{n} r_{i} {W_{i}}^{T} φ - g ({W_{i}}^{T} φ)

where

g (t) = \log (1 + e^{t})

,

t \in R

. We take the logarithm of the logistic function:

\log σ (t) = \log {(1 + e^{- t})}^{- 1} = \frac{t}{2} - \log (e^{\frac{t}{2}} + e^{- \frac{t}{2}})

Jaakkola and Jordan [11] approximate

\log (e^{\frac{t}{2}} + e^{- \frac{t}{2}})

using a first-order Taylor expansion:

\log σ (t) \geq \frac{t}{2} - \log (e^{\frac{η}{2}} + e^{- \frac{η}{2}}) - \frac{1}{4 η} \tanh (\frac{η}{2}) (t^{2} - η^{2}) = \frac{t - η}{2} + \log σ (η) - \frac{1}{4 η} \tanh (\frac{η}{2}) (t^{2} - η^{2})

Substituting

- g ({W_{i}}^{T} φ) = \log σ ({- W}_{i}^{T} φ)

into

l (φ)

:

l (φ) \geq \sum_{i = 1}^{n} \log σ (η_{i}) - \frac{η_{i}}{2} + (r_{i} - \frac{1}{2}) {W_{i}}^{T} φ - \frac{1}{4 η_{i}} t a n h (η_{i} / 2) ({({W_{i}}^{T} φ)}^{2} - η_{i}^{2}) = : f (φ, η)

We want the lower bound

f (φ, η)

of

l (φ)

to be as large as possible. For a given

φ

, the lower bound

f (φ, η) = f (η)

, so we maximize

f (η)

. We define a function

f_{a} (R \to R, a \geq 0)

:

f_{a} (x) = l o g σ (x) - \frac{x}{2} - \frac{1}{4 x} t a n h (x / 2) (a^{2} - x^{2})

Then,

f_{a} (x)

is symmetric about

x = 0

, and

f_{a} (x)

reaches its maximum at

x = \pm a

. The proof of this conclusion can be found in Kolyan Ray et al.’s work [13]. Therefore, when

η_{i} = {W_{i}}^{T} φ

, the lower bound

f (η)

reaches its maximum value.

The posterior of

φ

is as follows:

q (φ) = c p (φ) L (φ) \approx c p (φ) e x p {f (φ, η)} \propto p (φ) \prod_{i = 1}^{n} e x p \{(r_{i} - \frac{1}{2}) {W_{i}}^{T} φ - \frac{1}{4 η_{i}} t a n h (η_{i} / 2) {({W_{i}}^{T} φ)}^{2}\} = p (φ) \prod_{i = 1}^{n} e x p \{(r_{i} - \frac{1}{2}) {W_{i}}^{T} φ - E (ω_{i} | φ) {({W_{i}}^{T} φ)}^{2} / 2\}

Comparing the posterior q(

φ

) under these two methods, we find that whether using the lower bound approximation or introducing the new variable Pólya-Gamma method,

φ

will have essentially the same variational posterior.

3. Variational Bayesian Algorithm for Bayesian Quantile Regression with Non-Ignorable Missing Responses

3.1. Hierarchical Model and Prior

In Bayesian quantile regression, it is often assumed that

y_{i}

follows an asymmetric Laplace (AL) distribution, where

f (y_{i} ∣ β, σ) = \frac{τ (1 - τ)}{σ} \exp \{- ρ_{τ} (\frac{y - X_{i j}^{T} β}{σ})\} .

where

ρ_{τ} (u) = u (τ - I (u < 0))

,

I (\cdot)

is an indicator function.according According to the findings of Kozumi and Kobayashi [14], the asymmetric Laplace (AL) distribution is expressed as a mixture of a normal distribution and an exponential distribution.

\{\begin{matrix} y_{i} | β, σ, e_{i} ~ N (X_{i}^{T} β + k_{1} e_{i}, σ k_{2} e_{i}), \\ e_{i} | σ ~ E x p (\frac{1}{σ}), \end{matrix}

(10)

where

k_{1} = (1 - 2 τ) / (τ (1 - τ))

,

k_{2} = 2 / (τ (1 - τ))

.

The Bayesian quantile regression model with non-ignorable missing responses has the following hierarchical representation:

y_{i} \sim N (X_{i}^{T} β + k_{1} e_{i}, k_{2} σ e_{i}), w h e n r_{i} = 1, y_{i} i s m i s s i n g . r_{i} | ω_{i} \propto e x p \{(r_{i} - \frac{1}{2}) ({W_{i}}^{T} φ) - \frac{ω_{i}}{2} {({W_{i}}^{T} φ)}^{2}\}, ω_{i} \sim P G (0, 1), i = 1, 2, \dots, n . e_{i} ∣ σ \sim e x p (\frac{1}{σ}) .

SSL prior can be hierarchically represented using the normal distribution and the exponential distribution:

β_{j} |τ_{1 j}^{2}, γ_{j} = 1 \sim N (0, τ_{1 j}^{2}), τ_{1 j}^{2}| λ_{1}^{2} \sim E x p (λ_{1}^{2} / 2), j = 1, 2, \dots, p . β_{j} |τ_{0 j}^{2}, γ_{j} = 0 \sim N (0, τ_{0 j}^{2}), τ_{0 j}^{2}| λ_{0}^{2} \sim E x p (λ_{0}^{2} / 2) . γ_{j} ∣ π \sim B e r n o u l l i (π)

The prior of hyperparameters:

σ \sim I G (a_{σ_{0}}, b_{σ_{0}}), π \sim B e t a (a_{0}, b_{0}) λ_{1}^{2} \sim G a (c_{1}, d_{1}), λ_{0}^{2} \sim G a (c_{0}, d_{0})

where

G a (a, b)

and

I G (a, b)

represent the gamma distribution and inverse gamma distribution, respectively, with shape parameter

a

and scale parameter

b

.

B e t a (\cdot, \cdot)

isas beta distribution. If

p (φ)

is a Gaussian prior:

φ \sim N (μ_{φ_{0}}, Σ_{φ_{0}})

In the incomplete data setting, the missing data

y_{m}

introduce additional complexity because the posterior distribution of the model parameters

(β, σ, φ)

is a marginal distribution:

q (β, σ, φ| y_{o}, X, r) = \int q (β, σ, φ, y_{m}| y_{o}, X, r) d y_{m}

This marginalization often results in a complicated and intractable posterior distribution since the integral rarely has a closed-form solution. By augmenting the data with

y_{m}

, the posterior becomes a complete-data posterior:

q (β, σ, φ, y_{m}| y_{o}, X, r)

which is typically easier to work with because—conditionally on the imputed

y_{m}

—the model often simplifies to standard forms like conjugate priors, making Bayesian updating more straightforward.

3.2. Coordinate Ascent Variational Inference Algorithms

Let

θ \in Θ

,

D

is the observed data, and

p (θ | D)

is the conditional posterior distribution of

θ

. Variational Bayesian methods aim to find a distribution

q (θ)

within a given family of distributions

F

, minimizing the Kullback–Leibler (KL) divergence between

q (θ)

and

p (θ | D)

. Denoting

p (θ)

as the prior distribution of the parameters, we have

\log p (D) = \int \log \frac{p (θ) p (D| θ)}{q (θ)} q (θ) d θ + \int \log \frac{q (θ)}{p (θ| D)} q (θ) d θ .

In this equation, the first term is the evidence lower bound observed (ELBO), and the second term is the Kullback–Leibler (KL) divergence between

q (θ)

and

p (θ | D)

, denoted as

K L (q | | p)

. The logarithm of the likelihood of the observed data,

\log p (D)

, is independent of the choice of

q (θ)

.

Therefore, minimizing

K L (q | | p)

is equivalent to maximizing the evidence lower bound:

\hat{q} (θ) = \arg \max_{q (θ) \in F} \{E_{q (θ)} [\log p (D, θ)] - E_{q (θ)} [\log q (θ)]\} .

For ease of solving

\hat{q} (θ)

, consider mean-field Variational Bayesian Bayes (MFVB). Coordinate Ascent Variational Inference (CAVI) is one of the most commonly used VB algorithms. CAVI iteratively optimizes each factor of the mean-field variational density while keeping the other factors fixed, ultimately increasing the ELBO until a local optimum is reached. Let

\hat{q} (θ) = \prod_{t = 1}^{T} q_{t} (θ_{t})

, and denote

θ_{- t} = (θ_{1}, \dots, θ_{t - 1}, θ_{t + 1}, . . ., θ_{T})

. According to the results of Blei et al. [15], the optimal variational posterior is given by

q^{*} (θ_{t}) \propto \exp \{E_{θ_{- t}} [\log p (θ_{t}| D, θ_{- t})]\} .

3.3. Variational Posterior

In this paper, the parameters under consideration are

θ = (β, {\{τ_{0 j}^{2}, τ_{1 j}^{2}, γ_{j}\}}_{j = 1}^{p}, π, λ_{1}^{2}, λ_{2}^{2}, {\{e_{i}, ω_{i}\}}_{i = 1}^{n}, σ, φ, y_{m})

The derivation of the variational posterior is provided in Appendix A, and the variational posterior of β is

q (β) \sim N (μ, Σ),

The variational posterior of

τ_{1 j}^{2}

is

q (τ_{1 j}^{2}) \sim G I G (\frac{1}{2}, a_{1 j}, b_{1 j}),

The variational posterior of

τ_{0 j}^{2}

is

q (τ_{0 j}^{2}) \sim G I G (\frac{1}{2}, a_{0 j}, b_{0 j}),

The Variationalvariational posterior of

γ_{j}

is:

q (γ_{j}) \sim Bernoulli (ϕ_{j}),

The Variationalvariational posterior of

π

is

q (π) ~ Beta (a, b),

The Variationalvariational posterior of

λ_{1}^{2}

is

q (λ_{1}^{2}) ~ Γ (\tilde{c_{1}}, \tilde{d_{1}}),

The Variationalvariational posterior of

λ_{0}^{2}

is

q (λ_{0}^{2}) ~ Γ (\tilde{c_{0}}, \tilde{d_{0}}),

The Variationalvariational posterior of

e_{i}

:

q (e_{i}) ~ G I G (\frac{1}{2}, a_{e} = \frac{{(y_{i} - {X_{i}}^{T} E [β])}^{2}}{σ k_{2}}, b_{e} = ({k_{1}}^{2} + 2 k_{2}) / (σ k_{2})) .

The Variationalvariational posterior of

σ

:

q (σ) ~ I G (a_{σ} = \frac{3 n}{2} + r_{0}, b_{σ}),

The Variationalvariational posterior (

φ

):

q (φ) ~ N (μ_{φ}, Σ_{φ}),

The Variationalvariational posterior of

ω_{i}

is

q (ω_{i}) ~ P G (1, {\hat{c}}_{i}),

For a missing individual

i

, denoted as

y_{i m}

, suppose there are

n_{m}

individuals missing in total, the Variationalvariational posterior is

q (y_{i m}) \sim N (μ_{y}, σ_{y}),

The CAVI algorithm is used to update the parameters in the posterior distributions until convergence. A common convergence criterion is that the ELBO no longer changes (or the change is small enough to be ignored). Since we focus on variable selection and the ELBO involves many complex expectations and entropy calculations, we calculate the entropy of

ϕ = (ϕ_{1}, \dots, ϕ_{p})

until it no longer changes (or the change is very small, less than a given threshold). It can be seen that, when ϕ tends to 0, the algorithm converges. Please refer to Algorithm 1.

E(ϕ) = ϕ log(ϕ) + (1_p − ϕ) log (1_p − ϕ)

Algorithm 1: Variational Bayesian algorithm with non-negligible missing response

Variational Bayesian parameter initialization;

Setting initial values

: δ = 10^{- 3}, J = 1000, t = 1, E n t (ϕ)^{(0)} = 0

;

while

|E n t (ϕ)^{(t)} - E n t (ϕ)^{(t - 1)}| ⩾ δ

and

1 ⩽ t ⩽ J

do

Σ^{(t)} \leftarrow {(\frac{E^{(t - 1)} [1 / σ]}{k_{2}} \sum_{i = 1}^{n} \frac{X_{i}^{τ} X_{i}}{E [e_{i}]} + E^{(t - 1)} [D_{τ}^{- 1}])}^{- 1};

μ^{(t)} \leftarrow Σ (\frac{E^{(t - 1)} [1 / σ]}{k_{2}} \sum_{i = 1}^{n} X_{i}^{T} (E^{(t - 1)} [\frac{1}{e_{i}}] (y_{i} - k_{1})));

a_{1 j}^{(t)} \leftarrow E^{(t - 1)} [γ_{j}] E^{(t - 1)} [β_{j}^{2}], b_{1 j}^{(t)} \leftarrow E^{(t - 1)} [γ_{j}] E^{(t - 1)} [λ_{1}^{2}];

a_{0 j}^{(t)} \leftarrow E^{(t - 1)} [(1 - γ_{j})] E^{(t - 1)} [β_{j}^{2}], b_{0 j}^{(t)} = E^{(t - 1)} [(1 - γ_{j})] E^{(t - 1)} [λ_{0}^{2}];

l o g i t {(ϕ_{j})}^{(t)} \leftarrow E^{(t - 1)} [\frac{1}{2} l o g (\frac{τ_{0 j}^{2}}{τ_{1 j}^{2}}) + \frac{β_{1}^{2}}{2} (\frac{1}{τ_{0 j}^{2}} - \frac{1}{τ_{1 j}^{2}}) + l o g (\frac{λ_{1}^{2}}{λ_{0}^{2}}) + \frac{λ_{1}^{2}}{2} τ_{0 j}^{2} - \frac{λ_{1}^{2}}{2} τ_{1 j}^{2} + l o g (\frac{π}{1 - π})]

;

a^{(t)} \leftarrow a_{0} + \sum_{j = 1}^{p} E^{(t - 1)} [γ_{j}], b^{(t)} \leftarrow b_{0} + p - \sum_{j = 1}^{p} E^{(t - 1)} [γ_{j}];

{\tilde{c}}_{1}^{(t)} \leftarrow c_{1} + \sum_{j = 1}^{p} E^{(t - 1)} [γ_{j}], {\tilde{d}}_{1}^{(t)} \leftarrow d_{1} + \sum_{j = 1}^{p} \frac{E^{(j - 1)} [γ_{j}] E^{(i - 1)} [r_{1}^{2}]}{2};

{\tilde{c}}_{0}^{(t)} \leftarrow c_{0} + p - \sum_{j = 1}^{p} E^{(t - 1)} [γ_{j}], {\tilde{d}}_{0}^{(t)} \leftarrow d_{0} + \sum_{j = 1}^{p} \frac{E^{(t - 1)} [(1 - γ_{j})] E^{(t - 1)} [τ_{0}^{2}]}{2};

a_{e}^{(t)} \leftarrow \frac{{(y_{1} - X_{i}^{T} E^{(t - 1)} [β])}^{2}}{σ k_{2}}, b_{e}^{(t)} \leftarrow ({k_{1}}^{2} + 2 k_{2}) / (σ k_{2});

a_{σ}^{(t)} \leftarrow \frac{3 n}{2} + a_{σ_{0}}, b_{σ}^{(t)} \leftarrow b_{σ_{0}} + \frac{1}{2 k_{2}} \sum_{i = 1}^{n} \frac{{(y_{i} - X_{i}^{T} E^{(t - 1)} [β] - k_{1} E^{(t - 1)} [e_{i}])}^{2}}{E^{[(- 1)} [e_{i}]} + \sum_{i = 1}^{n} E^{(t - 1)} [e_{i}];

Σ_{φ}^{(t)} \leftarrow {(W E^{(t - 1)} [d i a g (ω)] W^{T} + Σ_{φ_{0}}^{- 1})}^{- 1};

μ_{φ}^{(t)} \leftarrow Σ_{φ} (W (r - \frac{1}{2} 1_{n}) + Σ_{φ_{0}}^{- 1} μ_{φ_{0}});

{\hat{c}}_{i}^{(t)} \leftarrow \sqrt{E^{(t - 1)} [{(w_{i}^{T} φ)}^{2}]};

σ_{y}^{(t)} \leftarrow {(φ_{1}^{2} E^{(t - 1)} [ω_{i m}] + E^{(t - 1)} [\frac{1}{2 σ k_{2} e_{i m}}])}^{- 1};

μ_{y}^{(t)} \leftarrow σ_{y} (φ_{1} (r_{i m} - \frac{1}{2}) + E^{(t - 1)} [\frac{X_{m m}^{T} β + k_{1} e_{m m}}{σ k_{2} e_{i m}}]);

Update

E n t (ϕ)^{(t)}, t \leftarrow t + 1

;

end while

4. Simulation Study

In this section, we generate simulated data as follows:

y_{i} = X_{i}^{T} β + ε_{i}, i = 1, 2, \dots, n

Let

x_{i j}

is the jth element of

X_{i}

and

x_{i j}

∼(0, 1), ε_i are independently and identically distributed. We consider the following distributions for ε_i: (1) normal distribution ε_i∼(0, 1); (2) Cauchy distribution ε_i∼C(0, 1); (3) t-distribution with 3 degrees of freedom ε_i∼t(3). The values of β are set as follows:

Simulation 1: β = (3, 1.5, 0, 0, 2, 0, 0, 0), a sparse model;

Simulation 2: β = (5, 5, 5, 5, 5, 5, 5, 0), a dense model;

Simulation 3: β = (5, 0, 0, 0, 0, 0, 0, 0), an ultra-sparse model.

The missing data mechanism M₀

φ_{01} = φ_{11} = φ_{21} = \dots = φ_{28} = 0.1

l o g i t (p (r_{i} ∣ y_{i}, X_{i}, φ)) = 0.1 + 0.1 y_{i} + 0.1 x_{i 1} + \dots + 0.1 x_{i 8},

We generate n = 50 data points. To reduce the influence of priors on the results, the hyperparameters in the priors are set as

a_{0}

=

b_{0}

= 0.01;

a_{σ_{0}}

=

b_{σ_{0}}

= 0.01;

c_{1}

=

d_{1}

= 0.01;

c_{0}

=

d_{0}

= 0.01; μ_φ0 is a vector consisting of the means of non-missing data, and Σφ₀ is a diagonal matrix of the variances of non-missing data. We use the mean squared error (MSE) of β_j estimates and running time (in seconds, T) to measure the estimation accuracy and computational efficiency of each method. The running time is obtained using the tic and toc functions in the “tictoc” package in R (unit: seconds). For each setting, we conduct 100 simulations. The tables report the average results of 100 simulations, with standard deviations in parentheses. The quantiles considered are τ = 0.2, 0.5, 0.8. In R (version 4.3.1), there are no publicly available software packages for methods related to non-ignorable missing response variables. Therefore, we only present the parameter estimation and variable selection results of the algorithm proposed in this chapter.

Table 1, Table 2 and Table 3 show the average results of β estimates using the algorithm proposed in this chapter for Simulations 1, 2, and 3, respectively, when ε follows a normal distribution, Cauchy distribution, and t-distribution. The values in parentheses are the average MSE of β over 100 simulations. The results indicate that our algorithm can provide good estimates of β. When ε follows a Cauchy distribution, the estimation error of β is slightly larger, but the difference is not significant. This is consistent with the characteristics of the Cauchy distribution for ε.

In the algorithm proposed in this chapter, the values of ϕ = (ϕ₁, ⋯, ϕ_p) determine the variable selection. When ϕ_j ≥ 1/2, the j-th variable is selected; otherwise, the j-th variable is not selected. Table 4, Table 5 and Table 6 show the variable selection results when ε follows a normal distribution, Cauchy distribution, and t-distribution, respectively. A value of 1 indicates that the variable is selected, while 0 indicates that it is not selected. It can be observed that all important covariates are identified.

We stop the proposed algorithm when the entropy of ϕ no longer changes (or changes very little, less than a given threshold). Figure 2, Figure 3 and Figure 4 show the entropy calculation results of ϕ when ε follows a normal distribution, Cauchy distribution, and t-distribution, respectively. It can be seen that, in all simulations, our algorithm converges within 100 iterations, which is far fewer than the number of samples in sampling-based MCMC algorithms. Therefore, the algorithm we propose has high computational efficiency.

Variational Bayesian (VB) inference is often considered superior to MCMC in terms of speed, scalability, and efficiency. Unlike MCMC, which relies on iterative sampling and can be computationally expensive, VB transforms the inference problem into an optimization task, leading to faster and more predictable convergence. VB produces deterministic results, avoiding the Monte Carlo error inherent in MCMC, and is more memory-efficient as it does not require storing large numbers of posterior samples. It is particularly well-suited for large-scale models, where MCMC may struggle due to slow sampling and high computational costs. Additionally, VB tends to offer more interpretable posterior distributions by approximating them with simpler parametric families. However, while VB is faster and more scalable, it may underestimate uncertainty due to its reliance on approximations, whereas MCMC remains more flexible and accurate for complex posterior distributions.

5. Real Data Analysis

In this section, we analyze data from HIV-positive patients in the AIDS Clinical Trials Group (ACTG175) study [16], which can be obtained using the command data (ACTG175) in the R package “BART”. The ACTG175 dataset contains 27 variables, and 2139 HIV-positive patients were randomly assigned to the following four groups: (1) 532 received zidovudine treatment; (2) 522 received didanosine treatment; (3) 524 received a combination of zidovudine and didanosine; and (4) 561 received a combination of zidovudine and zalcitabine. These patients were monitored at weeks 2, 4, and 8 after the start of the experiment and then every 12 weeks thereafter. Monitoring ended when the CD4 T-cell count declined by 50% or more, or the patient died.

We are interested in the relationship between the dependent variable Y, the CD4 T-cell count at 96 ± 5 weeks, and the covariates age (X₁), weight (X₂), baseline CD4 T-cell count (X₃), CD4 T-cell count at 20 ± 5 weeks (X₄), baseline CD8 T-cell count (X₅), and CD8 T-cell count at 20 ± 5 weeks (X₆). Due to patient death or dropout, the dependent variable Y has missing records, with missing rates of 39.66%, 36.21%, 35.69%, and 37.43% for the four groups, respectively. Relevant medical research indicates that CD4 T-cell count is related to disease progression, and patients with lower CD4 T-cell counts are more likely to drop out of the study. This suggests that the missingness of Y (CD4 T-cell count at 96 ± 5 weeks) is related to the CD4 T-cell count. In summary, the missingness of Y is not random, and we can establish the following model:

Q_τ (y_i ∣ x_i) = β₁x_i1 + ⋯ + β₆x_i6, i = 1, ⋯, n.

p (r_{i} = 1 ∣ y_{i}, x_{i}, φ) = \frac{e^{φ_{0} + φ_{1} y_{i} + φ_{21} x_{i 1} + \dots + φ_{26} x_{i 6}}}{1 + e^{φ_{0} + φ_{1} y_{i} + φ_{21} x_{i 1} + \dots + φ_{26} x_{i 6}}}

Table 7 summarizes the coefficients and 95% confidence intervals for the quantile regression of the four treatment groups when = 0.5. The QR row shows the results of quantile regression estimation when there is no missing data. It can be seen that the proposed method effectively imputes missing data. Figure 5 shows that our algorithm achieves convergence in the analysis of the AIDS data. From the estimated coefficients of the quantile regression, age and weight factors do not significantly influence the observed CD4 T-cell count at 96 ± 5 weeks. The observed value at 20 ± 5 weeks has a significant positive effect on the CD4 T-cell count at 96 ± 5 weeks and is the main influencing factor for the 96 ± 5-week measurement. The influence of the baseline measurement at a longer interval on the 96 ± 5-week measurement is relatively small and not even significant in groups (1) and (2) (corresponding to variable selection with

ϕ < 1 / 2

). Interestingly, the CD8 T-cell count at 20 ± 5 weeks has a significant negative effect in groups (2) and (3), indicating that higher previous CD4 levels and lower CD8 levels contribute to an increase in CD4 T-cell count. This finding is similar to the use of the CD4/CD8 ratio as an indicator of antiretroviral therapy efficacy in existing studies, suggesting that a higher baseline ratio favors CD4 T-cell count recovery and immune function reconstruction. The results of this study also suggest that, in the mid-term assessment of HIV infection treatment efficacy, attention should be paid not only to CD4 T-cell levels but also to CD8 T-cell levels. With unchanged baseline CD4 levels, a low CD4/CD8 ratio may negatively impact long-term treatment outcomes.

6. Conclusions

In this paper, we consider a Bayesian quantile regression model with non-ignorable missing response data, where the non-ignorable missing mechanism is specified by a logistic function. The model parameters of the quantile regression are assigned a spike-and-slab LassoLASSO prior, which enables effective parameter estimation and variable selection without requiring further processing of the parameter posteriors. Bayesian variable selection based on the spike-and-slab prior is computationally expensive, and variational Bayes is a popular Bayesian computational method. Since the logistic function does not have a conjugate prior, we introduce a new Pólya-Gamma variable in the logistic function, making the parameter posteriors of the logistic function conjugate. Finally, we propose a variational algorithm for Bayesian quantile regression with non-ignorable missing response data, which performs well in both simulation studies and real data analysis.

Author Contributions

J.Z. and W.W. contributed equally to this work. They were both involved in the development of the methodology, data analysis, and drafting of the manuscript. M.T. served as the corresponding author, providing guidance on the research design, supervising the study, and revising the manuscript critically for important intellectual content. All authors have read and agreed to the published version of the manuscript.

Funding

The research is supported by Guangzhou Huashang College Project: 2023HSDS25 and Beijing Natural Science Foundation (1242005).

Data Availability Statement

The researchers can be obtained using the command data (ACTG175) in the R package “BART”.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Denote

D_{τ} = diag (τ_{j}^{2} = γ_{j} τ_{1 j}^{2} + (1 - γ_{j}) τ_{0 j}^{2}

. The variational posterior of β is

q (β) \propto e x p \{E_{- β} [- \frac{1}{2 σ k_{2}} \sum_{i = 1}^{n} \frac{{(y_{i} - X_{i}^{T} β - k_{1} e_{i})}^{2}}{e_{i}} - \frac{1}{2} β^{T} D_{τ}^{- 1} β]\}

Then,

q (β) \sim N (μ, Σ)

, where

\begin{matrix} Σ = {(\frac{E [1 / σ]}{k_{2}} \sum_{i = 1}^{n} \frac{X_{i}^{T} X_{i}}{E [e_{i}]} + E [D_{τ}^{- 1}])}^{- 1}, \\ μ = Σ (\frac{E [1 / σ]}{k_{2}} \sum_{i = 1}^{n} X_{i}^{T} (E [\frac{1}{e_{i}}] (y_{i} - k_{1}))) . \end{matrix}

The Variationalvariational posterior of

τ_{1 j}^{2}

is

q (τ_{1 j}^{2}) \propto e x p \{E_{- τ_{1 j}^{2}} [γ_{j} (- \frac{1}{2} l o g (τ_{1 j}^{2}) - \frac{β_{j}^{2}}{2} \frac{1}{τ_{1 j}^{2}} - \frac{λ_{1}^{2}}{2} τ_{1 j}^{2})]\}

Then,

q (τ_{1 j}^{2}) \sim G I G (\frac{1}{2}, a_{1 j}, b_{1 j})

,

G I G (\cdot, \cdot, \cdot)

is Generalized Inverse Gaussian distribution. Where, where

a_{1 j} = E [γ_{j}] E [β_{j}^{2}], b_{1 j} = E [γ_{j}] E [λ_{1}^{2}]

The Variationalvariational posterior of

τ_{0 j}^{2}

is

q (τ_{0 j}^{2}) \propto e x p \{E_{- τ_{0 j}^{2}} [(1 - γ_{j}) (- \frac{1}{2} l o g (τ_{0 j}^{2}) - \frac{β_{j}^{2}}{2} \frac{1}{τ_{0 j}^{2}} - \frac{λ_{0}^{2}}{2} τ_{0 j}^{2})]\},

Then,

q (τ_{0 j}^{2}) \sim G I G (\frac{1}{2}, a_{0 j}, b_{0 j})

, where

a_{0 j} = E [(1 - γ_{j})] E [β_{j}^{2}], b_{0 j} = E [(1 - γ_{j})] E [λ_{0}^{2}]

The Variationalvariational posterior of

γ_{j}

is

q (γ_{j}) \propto \exp \{E_{- γ_{j}} [γ_{j} (\frac{1}{2} \log (\frac{τ_{0 j}^{2}}{τ_{1 j}^{2}}) + \frac{{β_{j}}^{2}}{2} (\frac{1}{τ_{0 j}^{2}} - \frac{1}{τ_{1 j}^{2}}) + \log (\frac{λ_{1}^{2}}{λ_{0}^{2}}) + \frac{λ_{0}^{2}}{2} τ_{0 j}^{2} - \frac{λ_{1}^{2}}{2} τ_{1 j}^{2} + \log (\frac{π}{1 - π}))]\}

Then,

q (γ_{j}) \sim Bernoulli (ϕ_{j})

,

E [γ_{j}] = ϕ_{j}

, where

ϕ_{j} = σ (E [\frac{1}{2} \log (\frac{τ_{0 j}^{2}}{τ_{1 j}^{2}}) + \frac{{β_{j}}^{2}}{2} (\frac{1}{τ_{0 j}^{2}} - \frac{1}{τ_{1 j}^{2}}) + \log (\frac{λ_{1}^{2}}{λ_{0}^{2}}) + \frac{λ_{0}^{2}}{2} τ_{0 j}^{2} - \frac{λ_{1}^{2}}{2} τ_{1 j}^{2} + \log (\frac{π}{1 - π})])

The Variationalvariational posterior of

π

is

q (π) \propto \exp \{E_{- π} [(a_{0} - 1 + \sum_{j = 1}^{p} γ_{j}) \log π + (b_{0} - 1 + p - \sum_{j = 1}^{p} γ_{j}) \log (1 - π)]\}

Then,

q (π) ~ Beta (a, b)

, where

a = a_{0} + \sum_{j = 1}^{p} E [γ_{j}] {, b = b}_{0} + p - \sum_{j = 1}^{p} E [γ_{j}],

The Variationalvariational posterior of

λ_{1}^{2}

is

q (λ_{1}^{2}) \propto \exp \{E_{- λ_{1}^{2}} [(c_{1} - 1 + \sum_{j = 1}^{p} γ_{j}) \log λ_{1}^{2} - (d_{1} + \sum_{j = 1}^{p} \frac{{γ_{j} τ}_{1 j}^{2}}{2} λ_{1}^{2})]\}

Then,

q (λ_{1}^{2}) ~ Γ (\tilde{c_{1}}, \tilde{d_{1}})

, where

\tilde{c_{1}} = c_{1} + \sum_{j = 1}^{p} E [γ_{j}], \tilde{d_{1}} = d_{1} + \sum_{j = 1}^{p} \frac{E [γ_{j}] E [τ_{1 j}^{2}]}{2} .

The Variationalvariational posterior of

λ_{0}^{2}

is

q (λ_{0}^{2}) \propto \exp \{E_{- λ_{0}^{2}} [(c_{0} - 1 + p - \sum_{j = 1}^{p} γ_{j}) \log λ_{0}^{2} - (d_{0} + \sum_{j = 1}^{p} \frac{(1 - γ_{j}) τ_{0 j}^{2}}{2} λ_{0}^{2})]\}

Then,

q (λ_{0}^{2}) ~ Γ (\tilde{c_{0}}, \tilde{d_{0}})

, where

\tilde{c_{0}} = c_{0} + p - \sum_{j = 1}^{p} E [γ_{j}], \tilde{d_{0}} = d_{0} + \sum_{j = 1}^{p} \frac{E [{(1 - γ}_{j})] E [τ_{0 j}^{2}]}{2},

The Variationalvariational posterior of

e_{i}

is

q (e_{i}) \propto \exp \{E_{- e_{i}} [- \frac{1}{2} \log (e_{i}) - \frac{1}{2 σ k_{2}} \frac{{(y_{i} - {X_{i}}^{T} β - k_{1} e_{i})}^{2}}{e_{i}} - \frac{e_{i}}{σ}]\}

Then,

q (e_{i}) ~ G I G (\frac{1}{2}, a_{e} = \frac{{(y_{i} - {X_{i}}^{T} E [β])}^{2}}{σ k_{2}}, b_{e} = ({k_{1}}^{2} + 2 k_{2}) / (σ k_{2}))

.

The Variationalvariational posterior of

σ

:

q (σ) \propto e x p {E_{- σ} [- \frac{n}{2} \log (σ) - \frac{1}{2 σ k_{2}} \sum_{i = 1}^{n} \frac{{(y_{i} - {X_{i}}^{T} β - k_{1} e_{i})}^{2}}{e_{i}} - n \log (σ) - \sum_{i = 1}^{n} \frac{e_{i}}{σ} - (r_{0} + 1) \log (σ) - \frac{s_{0}}{σ}]}

Then,

q (σ) ~ I G (a_{σ} = \frac{3 n}{2} + r_{0}, b_{σ})

, where

b_{σ} = s_{0} + \frac{1}{2 k_{2}} \sum_{i = 1}^{n} \frac{{(y_{i} - {X_{i}}^{T} E [β] - k_{1} E [e_{i}])}^{2}}{E [e_{i}]} + \sum_{i = 1}^{n} E [e_{i}]

The Variationalvariational posterior of

φ

is

q (φ) \propto \exp \{E_{- φ} [\sum_{i = 1}^{n} (r_{i} - \frac{1}{2}) ({W_{i}}^{T} φ) - \frac{ω_{i}}{2} {({W_{i}}^{T} φ)}^{2} - \frac{1}{2} {(ϕ - μ_{φ_{0}})}^{T} Σ_{φ_{0}}^{- 1} (ϕ - μ_{φ_{0}})]\},

Then,

q (φ) ~ N (μ_{φ}, Σ_{φ})

, where

Σ_{φ} = {(W E [d i a g (ω)] W^{T} + Σ_{φ_{0}}^{- 1})}^{- 1}, μ_{φ} = Σ_{φ} (W (r - \frac{1}{2} 1_{n}) + Σ_{φ_{0}}^{- 1} μ_{φ_{0}}), r = {(r_{1}, \dots, r_{n})}^{T}, 1_{n} = \underset{n o n e s}{\underset{⏟}{{(1, 1, \dots, 1)}^{T}}} .

The Variationalvariational posterior of

ω_{i}

is

q (ω_{i}) \propto \exp \{E_{- ω_{i}} [- \frac{ω_{i}}{2} {({W_{i}}^{T} φ)}^{2}]\} p (ω_{i} ∣ 1, 0),

Then,

q (ω_{i}) ~ P G (1, {\hat{c}}_{i})

,

{\hat{c}}_{i} = \sqrt{E [{({W_{i}}^{T} φ)}^{2}]} .

For a missing individual

i

, denoted as

y_{i m}

, suppose there are

n_{m}

individuals missing in total, and the Variationalvariational posterior is

q (y_{i m}) \propto e x p \{E_{- y_{i m}} [(r_{i m} - \frac{1}{2}) (y_{i m} φ_{1}) - \frac{ω_{i m}}{2} {(y_{i m} φ_{1})}^{2} - \frac{{(y_{i m} - X_{i m}^{T} β - k_{1} e_{i m})}^{2}}{2 σ k_{2} e_{i m}}]\}

Then,

q (y_{i m}) \sim N (μ_{y}, σ_{y})

, where

σ_{y} = {(φ_{1}^{2} E [ω_{i m}] + E [\frac{1}{2 σ k_{2} e_{i m}}])}^{- 1}, μ_{y} = σ_{y} (φ_{1} (r_{i m} - \frac{1}{2})) .

References

Rubin, D.B. Inference and missing data. Biometrika 1976, 63, 581–592. [Google Scholar] [CrossRef]
Geman, S.; Geman, D. Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images. IEEE Trans. Pattern Anal. Mach. Intell. 1984, 6, 721–741. [Google Scholar] [CrossRef] [PubMed]
Hastings, W.K. Monte Carlo Sampling Methods Using Markov Chains and Their Applications; Oxford University Press: Oxford, UK, 1970; pp. 97–103. [Google Scholar]
Metropolis, N.; Rosenbluth, A.W.; Rosenbluth, M.N.; Teller, A.H.; Teller, E. Equation of State Calculations by Fast Computing Machines. J. Chem. Phys. 1953, 21, 1087–1092. [Google Scholar] [CrossRef]
Ibrahim, J.G.; Chen, M.H.; Lipsitz, S.R.; Herring, A.H. Missing-Data Methods for Generalized Linear Models: A Comparative Review. J. Am. Stat. Assoc. 2005, 100, 332–346. [Google Scholar] [CrossRef]
Lee, S.Y. Bayesian analysis of nonlinear structural equation models with nonignorable missing data. Psychometrika 2006, 71, 541–564. [Google Scholar] [CrossRef]
Tang, N.S.; Zhao, H. Bayesian analysis of nonlinear reproductive dispersion mixed models for longitudinal data with nonignorable missing covariates. Commun. Stat.-Simul. Comput. 2014, 43, 1265–1287. [Google Scholar] [CrossRef]
Xu, D.; Tang, N. Bayesian adaptive Lasso for quantile regression models with nonignorably missing response data. Commun. Stat.-Simul. Comput. 2019, 48, 2727–2742. [Google Scholar] [CrossRef]
Ročková, V.; George, E.I. The spike-and-slab lasso. J. Am. Stat. Assoc. 2018, 113, 431–444. [Google Scholar] [CrossRef]
Polson, N.G.; Scott, J.G.; Windle, J. Bayesian inference for logistic models using Pólya–Gamma latent variables. J. Am. Stat. Assoc. 2013, 108, 1339–1349. [Google Scholar] [CrossRef]
Jaakkola, T.S.; Jordan, M.I. Bayesian parameter estimation via variational methods. Stat. Comput. 2000, 10, 25–37. [Google Scholar] [CrossRef]
Li, X.; Tuerde, M.; Hu, X. Variational Bayesian Inference for Quantile Regression Models with Nonignorable Missing Data. Mathematics 2023, 11, 3926. [Google Scholar] [CrossRef]
Ray, K.; Szabó, B.; Clara, G. Spike and slab variational Bayes for high dimensional logistic regression. Adv. Neural Inf. Process. Syst. 2020, 33, 14423–14434. [Google Scholar]
Kozumi, H.; Kobayashi, G. Gibbs sampling methods for Bayesian quantile regression. J. Stat. Comput. Simul. 2011, 81, 1565–1578. [Google Scholar] [CrossRef]
Blei, D.M.; Kucukelbir, A.; McAuliffe, J.D. Variational Bayesian inference: A review for statisticians. J. Am. Stat. Assoc. 2017, 112, 859–877. [Google Scholar] [CrossRef]
Hammer, S.M.; Katzenstein, D.A.; Hughes, M.D.; Gundacker, H.; Schooley, R.T.; Haubrich, R.H.; Henry, W.K.; Lederman, M.M.; Phair, J.P.; Niu, M.; et al. A Trial Comparing Nucleoside Monotherapy with Combination Therapy in HIV-Infected Adults with CD4 Cell Counts from 200 to 500 per Cubic Millimeter. N. Engl. J. Med. 1996, 335, 1081–1090. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Four types of spike-and-slab priors.

Figure 2. Convergence iterations of the VB algorithm for 3 simulations when ε∼(0, 1). (a) Number of iterations in Simulation 1; (b) Number of iterations in Simulation 2; (c) Number of iterations in Simulation 3.

Figure 3. Convergence iterations of the VB algorithm for 3 simulations when ε∼(0, 1). (a) Number of iterations in Simulation 1; (b) Number of iterations in Simulation 2; (c) Number of iterations in Simulation 3.

Figure 4. Convergence iterations of the VB algorithm for 3 simulations when ε∼(3). (a) Number of iterations in Simulation 1; (b) Number of iterations in Simulation 2; (c) Number of iterations in Simulation 3.

Figure 5. Convergence iterations of the VB algorithm for the AIDS data. (a) Number of iterations for Group (1); (b) Number of iterations for Group (2); (c) Number of iterations for Group (3); (d) Number of iterations for Group (4).

Table 1. Comparison of parameter estimation and running time for 3 simulations when ε∼N(0, 1).

Simulation	τ	$\hat{β}$ ₁	$\hat{β}$ ₂	$\hat{β}$ ₃	$\hat{β}$ ₄	$\hat{β}$ ₅	$\hat{β}$ ₆	$\hat{β}$ ₇	$\hat{β}$ ₈	T
Simulation 1	β	3.000	1.500	0.000	0.000	2.000	0.000	0.000	0.000
	τ = 0.2	3.492	1.036	0.153	0.159	2.374	−0.460	−0.246	−0.374	0.041
		(0.242)	(0.214)	(0.023)	(0.025)	(0.140)	(0.211)	(0.060)	(0.139)
	τ = 0.5	2.986	1.478	−0.008	0.002	2.058	0.047	−0.028	0.042	0.039
		(0.059)	(0.022)	(0.034)	(0.051)	(0.033)	(0.071)	(0.045)	(0.027)
	τ = 0.8	2.696	1.294	−0.004	0.308	1.735	−0.075	−0.174	−0.068	0.040
		(0.517)	(0.728)	(0.611)	(0.615)	(0.429)	(0.390)	(0.381)	(0.533)
Simulation 2	β	5.000	5.000	5.000	5.000	5.000	5.000	5.000	0.000
	τ = 0.2	5.015	4.808	5.044	4.985	5.005	4.904	4.977	−0.001	0.012
		(0.466)	(0.564)	(0.238)	(0.317)	(0.384)	(0.502)	(0.433)	(0.205)
	τ = 0.5	4.978	4.947	4.995	4.956	4.993	4.979	5.013	0.006	0.011
		(0.019)	(0.022)	(0.019)	(0.029)	(0.019)	(0.024)	(0.033)	(0.027)
	τ = 0.8	5.046	4.864	4.899	4.774	5.002	4.971	4.874	0.107	0.012
		(0.503)	(0.389)	(0.325)	(0.346)	(0.360)	(0.364)	(0.428)	(0.285)
Simulation 3	β	5.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	τ = 0.2	5.312	−0.169	−0.045	−0.144	0.445	0.030	−0.215	0.073	0.019
		(0.526)	(0.410)	(0.413)	(0.285)	(0.437)	(0.109)	(0.320)	(0.209)
	τ = 0.5	4.993	0.009	−0.054	−0.015	−0.115	−0.063	−0.008	−0.019	0.012
		(0.010)	(0.008)	(0.013)	(0.023)	(0.016)	(0.019)	(0.022)	(0.010)
	τ = 0.8	5.025	−0.242	0.080	−0.038	0.042	−0.058	−0.109	−0.241	0.019
		(0.326)	(0.582)	(0.142)	(0.324)	(0.153)	(0.430)	(0.611)	(0.510)

Table 2. Comparison of parameter estimation and running time for 3 simulations when ε∼C(0, 1).

Simulation	τ	$\hat{β}$ ₁	$\hat{β}$ ₂	$\hat{β}$ ₃	$\hat{β}$ ₄	$\hat{β}$ ₅	$\hat{β}$ ₆	$\hat{β}$ ₇	$\hat{β}$ ₈	T
Simulation 1	β	3.000	1.500	0.000	0.000	2.000	0.000	0.000	0.000
	τ = 0.2	2.890	1.360	0.186	−0.010	2.157	−0.029	−0.255	0.178	0.024
		(0.621)	(0.210)	(0.343)	(0.267)	(0.137)	(0.199)	(0.170)	(0.219)
	τ = 0.5	3.032	1.591	−0.224	−0.071	2.255	−0.304	0.284	−0.246	0.029
		(0.291)	(0.317)	(0.131)	(0.299)	(0.137)	(0.129)	(0.331)	(0.229)
	τ = 0.8	3.037	1.872	−0.404	−0.182	2.122	0.242	−0.194	0.099	0.023
		(0.259)	(0.262)	(0.433)	(0.425)	(0.137)	(0.129)	(0.323)	(0.219)
Simulation 2	β	5.000	5.000	5.000	5.000	5.000	5.000	5.000	0.000
	τ = 0.2	4.752	4.825	4.778	5.298	4.587	4.818	4.742	0.321	0.016
		(0.225)	(0.310)	(0.399)	(0.255)	(0.324)	(0.119)	(0.390)	(0.219)
	τ = 0.5	4.645	4.769	4.544	4.642	4.774	5.281	4.908	4.945	0.018
		(0.276)	(0.218)	(0.233)	(0.295)	(0.247)	(0.112)	(0.184)	(0.105)
	τ = 0.8	5.182	5.543	4.750	5.552	4.931	5.339	4.718	0.280	0.018
		(0.326)	(0.433)	(0.447)	(0.215)	(0.337)	(0.179)	(0.290)	(0.219)
Simulation 3	β	5.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	τ = 0.2	4.970	0.018	−0.243	−0.154	0.230	0.244	−0.238	0.251	0.021
		(0.373)	(0.220)	(0.383)	(0.275)	(0.267)	(0.119)	(0.391)	(0.119)
	τ = 0.5	5.411	−0.117	0.185	0.238	0.355	0.163	0.193	201	0.020
		(0.284)	(0.310)	(0.217)	(0.195)	(0.197)	(0.199)	(0.110)	(0.199)
	τ = 0.8	4.879	0.119	0.295	0.423	0.148	0.253	0.165	0.114	0.026
		(0.371)	(0.388)	(0.313)	(0.385)	(0.237)	(0.169)	(0.333)	(0.288)

Table 3. Comparison of MSE and running time for 3 simulations when ε∼t(3).

Simulation	τ	$\hat{β}$ ₁	$\hat{β}$ ₂	$\hat{β}$ ₃	$\hat{β}$ ₄	$\hat{β}$ ₅	$\hat{β}$ ₆	$\hat{β}$ ₇	$\hat{β}$ ₈	T
Simulation 1	β	3.000	1.500	0.000	0.000	2.000	0.000	0.000	0.000
	τ = 0.2	2.923	1.449	0.145	0.210	1.762	0.324	0.181	0.149	0.037
		(0.164)	(0.120)	(0.213)	(0.185)	(0.125)	(0.129)	(0.410)	(0.189)
	τ = 0.5	3.018	1.409	0.012	−0.222	2.020	−0.036	0.056	0.161	0.036
		(0.186)	(0.220)	(0.198)	(0.117)	(0.237)	(0.119)	(0.298)	(0.119)
	τ = 0.8	3.245	1.418	0.129	−0.176	2.195	0281	−0.124	0.115	0.022
		(0.121)	(0.210)	(0.213)	(0.285)	(0.237)	(0.109)	(0.320)	(0.209)
Simulation 2	β	5.000	5.000	5.000	5.000	5.000	5.000	5.000	0.000
	τ = 0.2	4.850	5.362	4.987	4.863	5.081	5.079	4.552	0.171	0.007
		(0.123)	(0.245)	(0.265)	(0.134)	(0.239)	(0.221)	(0.179)	(0.238)
	τ = 0.5	4.997	4.857	4.952	4.930	5.007	5.028	5.028	−0.074	0.011
		(0.298)	(0.210)	(0.256)	(0.235)	(0.277)	(0.219)	(0.335)	(0.349)
	τ = 0.8	4.812	4.977	4.577	4.737	4.793	5.020	4.746	0.210	0.012
		(0.253)	(0.203)	(0.323)	(0.335)	(0.177)	(0.187)	(0.289)	(0.219)
Simulation 3	β	5.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	τ = 0.2	5.151	−0.127	−0.029	0.022	0.039	−0.129	0.014	−0.134	0.016
		(0.253)	(0.137)	(0.128)	(0.134)	(0.267)	(0.214)	(0.197)	(0.298)
	τ = 0.5	5.171	0.029	−0.086	0.084	−0.097	−0.201	−0.474	0.277	0.012
		(0.203)	(0.110)	(0.278)	(0.235)	(0.117)	(0.279)	(0.220)	(0.289)
	τ = 0.8	5.256	0.306	−0.200	−0.012	−0.005	0.254	−0.288	0.136	0.017
		(0.184)	(0.212)	(0.196)	(0.187)	(0.117)	(0.109)	(0.120)	(0.109)

Table 4. Variable selection results for 3 simulations when ε∼N(0, 1).

Simulation	τ	β₁	β₂	β₃	β₄	β₅	β₆	β₇	β₈
Simulation 1	β	3.000	1.500	0.000	0.000	2.000	0.000	0.000	0.000
	τ = 0.2	1	1	0	0	1	0	0	0
	τ = 0.5	1	1	0	0	1	0	0	0
	τ = 0.8	1	1	0	0	1	0	0	0
Simulation 2	β	5.000	5.000	5.000	5.000	5.000	5.000	5.000	0.000
	τ = 0.2	1	1	1	1	1	1	1	0
	τ = 0.5	1	1	1	1	1	1	1	0
	τ = 0.8	1	1	1	1	1	1	1	0
Simulation 3	β	5.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	τ = 0.2	1	0	0	0	0	0	0	0
	τ = 0.5	1	0	0	0	0	0	0	0
	τ = 0.8	1	0	0	0	0	0	0	0

Table 5. Variable selection results for 3 simulations when ε∼C(0, 1).

Simulation	τ	β₁	β₂	β₃	β₄	β₅	β₆	β₇	β₈
Simulation 1	β	3.000	1.500	0.000	0.000	2.000	0.000	0.000	0.000
	τ = 0.2	1	1	0	0	1	0	0	0
	τ = 0.5	1	1	0	0	1	0	0	0
	τ = 0.8	1	1	0	0	1	0	0	0
Simulation 2	β	5.000	5.000	5.000	5.000	5.000	5.000	5.000	0.000
	τ = 0.2	1	1	1	1	1	1	1	0
	τ = 0.5	1	1	1	1	1	1	1	0
	τ = 0.8	1	1	1	1	1	1	1	0
Simulation 3	β	5.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	τ = 0.2	1	0	0	0	0	0	0	0
	τ = 0.5	1	0	0	0	0	0	0	0
	τ = 0.8	1	0	0	0	0	0	0	0

Table 6. Variable selection results for 3 simulations when ε∼t(3).

Simulation	τ	β₁	β₂	β₃	β₄	β₅	β₆	β₇	β₈
Simulation 1	β	3.000	1.500	0.000	0.000	2.000	0.000	0.000	0.000
	τ = 0.2	1	1	0	0	1	0	0	0
	τ = 0.5	1	1	0	0	1	0	0	0
	τ = 0.8	1	1	0	0	1	0	0	0
Simulation 2	β	5.000	5.000	5.000	5.000	5.000	5.000	5.000	0.000
	τ = 0.2	1	1	1	1	1	1	1	0
	τ = 0.5	1	1	1	1	1	1	1	0
	τ = 0.8	1	1	1	1	1	1	1	0
Simulation 3	β	5.000	0.000	0.000	0.000	0.000	0.000	0.000	0.000
	τ = 0.2	1	0	0	0	0	0	0	0
	τ = 0.5	1	0	0	0	0	0	0	0
	τ = 0.8	1	0	0	0	0	0	0	0

Table 7. Results of the AIDS data analysis.

		β₁	β₂	β₃	β₄	β₅	β₆
(1)	QR	0.050	−0.028	0.128	0.531	−0.008	−0.107
	Est	0.052	−0.030	0.134	0.550	−0.010	−0.111
	CI	(−0.060, 0.134)	(−0.158, 0.045)	(−0.040, 0.204)	(0.453, 0.612)	(−0.135, 0.090)	(−0.195, 0.019)
(2)	QR	0.048	−0.035	0.125	0.652	0.122	−0.166
	Est	0.055	−0.031	0.129	0.620	0.125	−0.169
	CI	(−0.041, 0.141)	(−0.105, 0.043)	(−0.002, 0.200)	(0.519, 0.650)	(0.005, 0.274)	(−0.301, −0.139)
(3)	QR	0.040	0.081	0.201	0.513	0.060	−0.211
	Est	0.036	0.079	0.196	0.507	0.062	−0.209
	CI	(−0.054, 0.121)	(−0.003, 0.159)	(0.112, 0.309)	(0.398, 0.614)	(−0.070, 0.227)	(−0.337, −0.098)
(4)	QR	0.059	0.064	0.271	0.515	−0.095	−0.063
	Est	0.065	0.062	0.262	0.511	−0.098	−0.061
	CI	(−0.030, 0.149)	(−0.038, 0.152)	(0.131, 0.465)	(0.398, 0.615)	(−0.188, 0.037)	(−0.187, 0.069)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, J.; Wang, W.; Tian, M. Variational Bayesian Quantile Regression with Non-Ignorable Missing Response Data. Axioms 2025, 14, 408. https://doi.org/10.3390/axioms14060408

AMA Style

Zhang J, Wang W, Tian M. Variational Bayesian Quantile Regression with Non-Ignorable Missing Response Data. Axioms. 2025; 14(6):408. https://doi.org/10.3390/axioms14060408

Chicago/Turabian Style

Zhang, Juanjuan, Weixian Wang, and Maozai Tian. 2025. "Variational Bayesian Quantile Regression with Non-Ignorable Missing Response Data" Axioms 14, no. 6: 408. https://doi.org/10.3390/axioms14060408

APA Style

Zhang, J., Wang, W., & Tian, M. (2025). Variational Bayesian Quantile Regression with Non-Ignorable Missing Response Data. Axioms, 14(6), 408. https://doi.org/10.3390/axioms14060408

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Variational Bayesian Quantile Regression with Non-Ignorable Missing Response Data

Abstract

1. Introduction

2. Model Prior and Variational Bayesian Logistic Regression

2.1. Spike-and-Slab LassoLASSO Prior

2.2. Bayesian Logistic Regression Based on Pólya-Gamma Latent Variables

2.3. Bayesian Logistic Regression Based on Lower-Bound Approximation

3. Variational Bayesian Algorithm for Bayesian Quantile Regression with Non-Ignorable Missing Responses

3.1. Hierarchical Model and Prior

3.2. Coordinate Ascent Variational Inference Algorithms

3.3. Variational Posterior

4. Simulation Study

5. Real Data Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI