1. Introduction
In many application fields, such as economics, sociology, and biomedicine, some subjects may have missing responses or predictors due to various reasons, including study dropout, unwillingness of study participants to answer certain questions in the questionnaire, and information loss caused by uncontrollable factors. Statistical inference for missing data problems is quite challenging.
Rubin [
1] categorized missing data into three mechanisms: missing completely at random (MCAR), where the missingness process is independent of the observed and missing quantities; missing at random (MAR), where the missingness process depends on the observed quantities but not on the missing quantities; and non-ignorable missingness or not missing at random (NMAR), where the missingness process depends on both the observed and missing quantities. In missing data analysis, the NMAR assumption may be more reasonable than the classical MAR assumption.
Bayesian parameter estimation methods are often used to address the estimation problem of parametric models involving missing data for several reasons. First, Markov chain Monte Carlo (MCMC) methods widely used in statistical computing, such as the Gibbs algorithm [
2] and the Metropolis–Hastings (MH) algorithm [
3,
4], can be employed to estimate the posterior distributions of parameters, nonparametric functions, and missing data. Second, compared to the setting without missing data, Bayesian methods with missing data only require an additional step in the Gibbs sampler. Therefore, Bayesian methods can easily handle missing data without the need for new statistical inference techniques [
5]. Third, some prior information can be directly incorporated into the analysis, resulting in more accurate parameter estimation when good prior assumptions are available. Fourth, sampling-based Bayesian methods do not rely on asymptotic theory and may provide more reliable statistical inference even in small sample situations. In recent years, there have been many studies on NMAR data analysis, such as those by Lee and Tang [
6], Tang and Zhao [
7], and Xu and Tang [
8].
Furthermore, variable selection can be viewed as a special case of model selection, which is achieved through spike-and-slab priors in Bayesian variable selection. This paper chooses the spike-and-slab LASSO prior [
9] for parameter estimation and variable selection. The missingness mechanism of the response variable can be obtained through logistic regression, and variational Bayesian algorithms are considered for model parameter estimation. In the calculation of variational posteriors, due to the lack of a conjugate prior for logistic regression, we cannot compute the variational posteriors by specifying a reasonable variational family. Introducing Pólya-Gamma latent variables [
10] and employing lower-bound approximation [
11] are two methods that can yield conjugate posteriors for Bayesian logistic regression. In the posterior computation process of variational Bayesian methods, these two methods will have the same variational posteriors. Considering the characteristics of the spike-and-slab prior, it is unnecessary to calculate the complex variational lower bound, and the algorithm can still converge.
In this study, we propose a variational Bayesian quantile regression algorithm to address the challenges posed by non-ignorable response missing data. Unlike traditional methods, the variational Bayesian approach offers an efficient and scalable solution by transforming the posterior inference problem into an optimization task, ensuring faster convergence and reducing computational burden. The quantile regression framework provides a robust alternative to mean-based models, capturing the conditional distribution of the response variable across different quantiles, which is particularly valuable when the data exhibit heteroscedasticity or skewness. While Li’s paper [
12] also considers missing covariates and response variables, it does not address variable selection. In our work, we employ a prior that enables variable selection, allowing for effective variable screening alongside parameter estimation. Additionally, the convergence criteria of our algorithm differ from Li’s. Li uses the minimal change in the variational lower bound as the stopping condition, which involves more complex computations. In contrast, our approach is more computationally efficient. Moreover, the proposed method incorporates variable selection through a Bayesian shrinkage prior, effectively identifying significant predictors while accounting for the missing data mechanism. This combination of variational inference, quantile regression, and variable selection not only enhances estimation accuracy but also offers a flexible and computationally efficient tool for analyzing complex missing data structures. These features highlight the novelty and practical relevance of our approach.
This article is organized as follows:
Section 2 introduces the model, prior, and variational Bayesian logistic regression used in this paper;
Section 3 proposes the corresponding variational Bayesian algorithm for data with non-ignorable missing responses;
Section 4 conducts simulation studies for the proposed algorithm;
Section 5 applies the algorithm to real data analysis; and relevant conclusions are presented in
Section 6.
2. Model Prior and Variational Bayesian Logistic Regression
In this paper, the response variable
,
may be missing, while all covariates (explanatory variables)
are completely observable. The incomplete observations are as follows:
where
determines whether
is missing. When
,
is missing. Let
,
, where
and
represent the observed response variables and the missing response variables, respectively. Let
denote the conditional distribution of
given
,
, and
φ, where
φ is the unknown parameter vector in the conditional probability function
. The missingness mechanism of the data is completely determined by this conditional distribution.
We consider the following non-ignorable missingness mechanism:
where
can be modeled through logistic regression:
where
are the logistic regression model parameters, and
are the covariates of the logistic regression model.
The quantile regression model with non-ignorable missing responses. For
, it is constrained such that the
-th quantile equals zero. The
-th quantile regression model is expressed as follows:
where
β = (
β1, …,
βp)
T are the quantile regression model parameters to be estimated when
is missing.
2.1. Spike-and-Slab LassoLASSO Prior
The spike-and-slab LassoLASSO (SSL) prior [
9] is represented as follows:
where
λ1 is chosen to be a smaller value, while
λ0 should be chosen to be a larger value. In the Bayesian framework, the Laplace distribution is not conjugate, but it can be hierarchically represented using the normal distribution (⋅, ⋅) and the exponential distribution Exp(⋅):
Figure 1 describes four types of spike-and-slab priors with a mixing proportion of
. They are the Normal mixture (where both
and
are normal distributions), the Normal and Point-mass mixture (where
is a point-mass function at 0 and
is a normal distribution), the Laplace and Point-mass mixture (where
is a point-mass function at 0 and
is a Laplace distribution), and the SSL (where both are Laplace distributions). It can be seen that the Normal mixture cannot adequately penalize smaller coefficients, making it difficult to achieve variable selection. On the other hand, the point-mass spike-and-slab prior has an over-shrinkage problem and may miss important variables. Therefore, the SSL prior can be considered as a balance between the two.
Penalized priors (such as LASSO) and spike-and-slab priors are common priors in Bayesian variable selection. When
the SSL prior degenerates to the LassoLASSO prior. When
,
. That is, in the limit case, SSL can be transformed into the “gold standard” point-mass spike-and-slab prior. Thus, SSL integrates the penalized likelihood (LassoLASSO) and the spike-and-slab prior. SSL is also adaptive (there is no spike-and-slab adaptive LassoLASSO), and the proof of its adaptivity can be found in the discussion by Rockova and George [
9]. SSL uses the spike component to encourage sparsity by shrinking many regression coefficients to zero, while the slab component captures larger signals, enabling simultaneous variable selection and parameter estimation. Unlike traditional LassoLASSO with fixed regularization, SSL automatically adjusts the sparsity parameter based on the data, reducing the need for manual tuning and providing multiplicity correction, which lowers the false positive rate.
2.2. Bayesian Logistic Regression Based on Pólya-Gamma Latent Variables
The logistic function does not have a conjugate prior, which poses a challenge for Bayesian inference in logistic regression. Polson et al. [
10] proposed a new data augmentation strategy for Bayesian inference in logistic regression models.
If
ω∼
P (
b,
c), where
P (
b,
c) denotes the Pólya-Gamma distribution with parameters (
b,
c), its expectation is
The probability density of
ω has the following property:
If
ω∼(
b, 0) and
p(
ω) represents its probability density, then
where
. Applying this result to (5) (see Section 3.1 of Polson et al. [
10]), we obtain that
where
,
, and its probability density is denoted as
.
Assuming a prior
for
, the posterior of
is as follows:
where
. If
is a Gaussian prior, then the posterior
is conjugate with the prior. The posterior density of the variable
is
Then, the posterior of is , and .
2.3. Bayesian Logistic Regression Based on Lower-Bound Approximation
The log-likelihood of
is as follows:
where
,
. We take the logarithm of the logistic function:
Jaakkola and Jordan [
11] approximate
using a first-order Taylor expansion:
Substituting
into
:
We want the lower bound
of
to be as large as possible. For a given
, the lower bound
, so we maximize
. We define a function
:
Then,
is symmetric about
, and
reaches its maximum at
. The proof of this conclusion can be found in Kolyan Ray et al.’s work [
13]. Therefore, when
, the lower bound
reaches its maximum value.
The posterior of
is as follows:
Comparing the posterior q() under these two methods, we find that whether using the lower bound approximation or introducing the new variable Pólya-Gamma method, will have essentially the same variational posterior.
4. Simulation Study
In this section, we generate simulated data as follows:
Let is the jth element of and ∼(0, 1), εi are independently and identically distributed. We consider the following distributions for εi: (1) normal distribution εi∼(0, 1); (2) Cauchy distribution εi∼C(0, 1); (3) t-distribution with 3 degrees of freedom εi∼t(3). The values of β are set as follows:
Simulation 1: β = (3, 1.5, 0, 0, 2, 0, 0, 0), a sparse model;
Simulation 2: β = (5, 5, 5, 5, 5, 5, 5, 0), a dense model;
Simulation 3: β = (5, 0, 0, 0, 0, 0, 0, 0), an ultra-sparse model.
The missing data mechanism
M0 We generate n = 50 data points. To reduce the influence of priors on the results, the hyperparameters in the priors are set as = = 0.01; = = 0.01; = = 0.01; = = 0.01; μφ0 is a vector consisting of the means of non-missing data, and Σφ0 is a diagonal matrix of the variances of non-missing data. We use the mean squared error (MSE) of βj estimates and running time (in seconds, T) to measure the estimation accuracy and computational efficiency of each method. The running time is obtained using the tic and toc functions in the “tictoc” package in R (unit: seconds). For each setting, we conduct 100 simulations. The tables report the average results of 100 simulations, with standard deviations in parentheses. The quantiles considered are τ = 0.2, 0.5, 0.8. In R (version 4.3.1), there are no publicly available software packages for methods related to non-ignorable missing response variables. Therefore, we only present the parameter estimation and variable selection results of the algorithm proposed in this chapter.
Table 1,
Table 2 and
Table 3 show the average results of
β estimates using the algorithm proposed in this chapter for Simulations 1, 2, and 3, respectively, when
ε follows a normal distribution, Cauchy distribution, and t-distribution. The values in parentheses are the average MSE of
β over 100 simulations. The results indicate that our algorithm can provide good estimates of
β. When
ε follows a Cauchy distribution, the estimation error of
β is slightly larger, but the difference is not significant. This is consistent with the characteristics of the Cauchy distribution for
ε.
In the algorithm proposed in this chapter, the values of
ϕ = (
ϕ1, ⋯,
ϕp) determine the variable selection. When
ϕj ≥ 1/2, the
j-th variable is selected; otherwise, the
j-th variable is not selected.
Table 4,
Table 5 and
Table 6 show the variable selection results when
ε follows a normal distribution, Cauchy distribution, and t-distribution, respectively. A value of 1 indicates that the variable is selected, while 0 indicates that it is not selected. It can be observed that all important covariates are identified.
We stop the proposed algorithm when the entropy of
ϕ no longer changes (or changes very little, less than a given threshold).
Figure 2,
Figure 3 and
Figure 4 show the entropy calculation results of
ϕ when
ε follows a normal distribution, Cauchy distribution, and t-distribution, respectively. It can be seen that, in all simulations, our algorithm converges within 100 iterations, which is far fewer than the number of samples in sampling-based MCMC algorithms. Therefore, the algorithm we propose has high computational efficiency.
Variational Bayesian (VB) inference is often considered superior to MCMC in terms of speed, scalability, and efficiency. Unlike MCMC, which relies on iterative sampling and can be computationally expensive, VB transforms the inference problem into an optimization task, leading to faster and more predictable convergence. VB produces deterministic results, avoiding the Monte Carlo error inherent in MCMC, and is more memory-efficient as it does not require storing large numbers of posterior samples. It is particularly well-suited for large-scale models, where MCMC may struggle due to slow sampling and high computational costs. Additionally, VB tends to offer more interpretable posterior distributions by approximating them with simpler parametric families. However, while VB is faster and more scalable, it may underestimate uncertainty due to its reliance on approximations, whereas MCMC remains more flexible and accurate for complex posterior distributions.
5. Real Data Analysis
In this section, we analyze data from HIV-positive patients in the AIDS Clinical Trials Group (ACTG175) study [
16], which can be obtained using the command data (ACTG175) in the R package “BART”. The ACTG175 dataset contains 27 variables, and 2139 HIV-positive patients were randomly assigned to the following four groups: (1) 532 received zidovudine treatment; (2) 522 received didanosine treatment; (3) 524 received a combination of zidovudine and didanosine; and (4) 561 received a combination of zidovudine and zalcitabine. These patients were monitored at weeks 2, 4, and 8 after the start of the experiment and then every 12 weeks thereafter. Monitoring ended when the CD4 T-cell count declined by 50% or more, or the patient died.
We are interested in the relationship between the dependent variable Y, the CD4 T-cell count at 96 ± 5 weeks, and the covariates age (
X1), weight (
X2), baseline CD4 T-cell count (
X3), CD4 T-cell count at 20 ± 5 weeks (
X4), baseline CD8 T-cell count (
X5), and CD8 T-cell count at 20 ± 5 weeks (
X6). Due to patient death or dropout, the dependent variable
Y has missing records, with missing rates of 39.66%, 36.21%, 35.69%, and 37.43% for the four groups, respectively. Relevant medical research indicates that CD4 T-cell count is related to disease progression, and patients with lower CD4 T-cell counts are more likely to drop out of the study. This suggests that the missingness of
Y (CD4 T-cell count at 96 ± 5 weeks) is related to the CD4 T-cell count. In summary, the missingness of
Y is not random, and we can establish the following model:
Table 7 summarizes the coefficients and 95% confidence intervals for the quantile regression of the four treatment groups when = 0.5. The QR row shows the results of quantile regression estimation when there is no missing data. It can be seen that the proposed method effectively imputes missing data.
Figure 5 shows that our algorithm achieves convergence in the analysis of the AIDS data. From the estimated coefficients of the quantile regression, age and weight factors do not significantly influence the observed CD4 T-cell count at 96 ± 5 weeks. The observed value at 20 ± 5 weeks has a significant positive effect on the CD4 T-cell count at 96 ± 5 weeks and is the main influencing factor for the 96 ± 5-week measurement. The influence of the baseline measurement at a longer interval on the 96 ± 5-week measurement is relatively small and not even significant in groups (1) and (2) (corresponding to variable selection with
). Interestingly, the CD8 T-cell count at 20 ± 5 weeks has a significant negative effect in groups (2) and (3), indicating that higher previous CD4 levels and lower CD8 levels contribute to an increase in CD4 T-cell count. This finding is similar to the use of the CD4/CD8 ratio as an indicator of antiretroviral therapy efficacy in existing studies, suggesting that a higher baseline ratio favors CD4 T-cell count recovery and immune function reconstruction. The results of this study also suggest that, in the mid-term assessment of HIV infection treatment efficacy, attention should be paid not only to CD4 T-cell levels but also to CD8 T-cell levels. With unchanged baseline CD4 levels, a low CD4/CD8 ratio may negatively impact long-term treatment outcomes.