Variational Bayesian Inference for Quantile Regression Models with Nonignorable Missing Data

: Quantile regression models are remarkable structures for conducting regression analyses when the data are subject to missingness. Missing values occur because of various factors like missing completely at random, missing at random, or missing not at random. All these may result from system malfunction during data collection or human error during data preprocessing. Nevertheless, it is important to deal with missing values before analyzing data since ignoring or omitting missing values may result in biased or misinformed analysis. This paper studies quantile regressions from a Bayesian perspective. By proposing a hierarchical model framework, we develop an alternative approach based on deterministic variational Bayes approximations. Logistic and probit models are adopted to specify propensity scores for missing manifests and covariates, respectively. Bayesian variable selection method is proposed to recognize signiﬁcant covariates. Several simulation studies and real examples illustrate the advantages of the proposed methodology and offer some possible future research directions.


Introduction
Koenker and Bassett [1] introduced the concept of the quantile regression (QR) model.Unlike the linear regression (OLS) model, OLS focuses solely on modeling the mean of the response variable.QR allows us to estimate regression coefficients at different quartiles of the response variable.This enables us to capture the varying degrees of correlation between the covariates and the response variable.Quantile regression models are widely used in various fields such as finance, biology, and ecology (Baur and Dirk [2]; Huang et al. [3]; Cade and Noon [4]) due to their robust to outliers in the data, robustness, and wide applicability.Yu and Moye [5] examined Bayesian quantile regression (QR) by redefining QR as an asymmetric Laplace distribution.In continuation of this work, Kozumi and Kobayashi [6] proposed an effective Gibbs sampling method for the QR model by leveraging the decomposition property of the asymmetric Laplace distribution.Alhamzawi and Rahim [7] explored the composite quantile regression model with Bayesian estimation, while Yuan et al. [8] focused on studying Bayesian composite quantile regression for a single indicator model.Additionally, Hu et al. [9] introduced a Bayesian joint quantile regression model.An important aspect of building quantile regression models is the selection of predictor variables.Li et al. [10] approached the regularization problem of quantile regression from a Bayesian perspective, focusing on variable selection.To address the instability of posterior estimates caused by variable selection in Gibbs sampling and convergence issues due to fuzzy priors, Alhamzawi and Yu [11] proposed random search variable selection (ISSVS) within a Bayesian framework.Mallick et al. [12] introduced reciprocal Lasso (rLasso) regularization, which offers advantages over Lasso regularization in terms of estimation, prediction, and variable selection within the Bayesian framework.Considerable progress has been made in Bayesian frameworks for both statistical inference and variable selection in quantile regression.However, it is important to note that all the aforementioned studies are based on fully observable data.
Missing data are inevitable in research and can result from various uncontrollable factors.For instance, some results may be lost due to machine malfunction, while in questionnaires, individuals may be unwilling to disclose their income.The study of missing data originated in the 1970s, and Little and Rubin [13] defined three missing data mechanisms based on the causes of missingness: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).The first two mechanisms, MCAR and MAR, are unrelated to the missing data themselves.However, the third mechanism, MNAR, is directly associated with the missing data and closely reflects real life.Ignoring non-negligible missing data, especially those reflecting practical scenarios through the MNAR mechanism, can lead to erroneous conclusions.Recognizing the distinctive attributes of such data and the multitude of benefits offered by quantile regression, several researchers have explored quantile regression models as a means to address non-negligible missing data.For instance, Yuan and Yin [14] examined Bayesian quantile regression models for longitudinal data with non-negligible missingness.Zhao et al. [15] conducted a study on several estimation methods using inverse probability weighting (IPW) to estimate parameters in quantile regression models when either covariates or response variables more with non-negligible missing values.Wang and Tang [16] investigated nonlinear dynamic factor analysis models with mixed discrete and non-negligible missing covariates, employing Bayesian quantile regression models for statistical inference.Tang et al. [17] and Mulati and Tang [18] explored nonlinear dynamic factor analysis models with nonnegligible missing data using Bayesian non-parametric and semi-parametric approaches, respectively.In all of the aforementioned studies, the Bayesian framework was adopted, and Bayesian inference was facilitated using the Markov Chain Monte Carlo (MCMC) algorithm.However, the MCMC algorithm has certain inherent limitations.It can be challenging to sample from, especially when dealing with large datasets or complex models.
The variational Bayesian algorithm transforms the high-dimensional integration problem in Bayesian inference into an optimization problem, enabling efficient computation of the evidence lower bound and obtaining a variational approximation of the posterior distribution, rendering it well-suited for analyzing extensive datasets.The development of variational inference (VI) has also yielded advantages for high-dimensional data, such as stochastic variational inference (SVI) by Hoffman et al. [19], "black box" variational inference by Ranganath et al. [20], and amortized variational inference by Ganguly et al. [21].Various studies have utilized VI in different contexts, encompassing Bayesian approximation inference for models with missing covariates, as demonstrated in the work of Faes et al. [22], exploring parameter uncertainty in dynamic factorial models with missing data as shown by Erik [23], and conducting statistical inference on identified gene regulatory networks with missing data through variational Bayesian inference as presented in the research by Liu et al. [24].However, none of the existing works have applied the variational Bayesian algorithm to infer quantile regression (QR) models with missing data.
The main contributions of this paper include the following: (i) a variational Bayesian approach for parameter estimation in quantile regression models; (ii) a variational Bayesian approach for quantile regression models with missing covariates and response variables, in which the benefits of our suggested method are demonstrated through simulations; and (iii) a variational Bayesian approach for variable selection in quantile regression models with missing data, which does the job very well.
This paper is organized as follows: Section 2 presents the variational Bayesian algorithm, QR, lasso penalty term, and missing data.In Section 3, variational inference is performed for QR in the presence of missing covariates and missing response variables, respectively.In Section 4, the variable selection and parameter estimation are performed for QR with lasso penalty term in the presence of missing covariates and response variables, respectively.The feasibility of variational Bayes in QR with missing data and the superiority of comparing Gibbs' algorithm are illustrated by four experiments in Section 5. Section 6 analyzes the example data using the variational Bayes algorithm and obtains better results.

Quantile Regression Model
The quantile regression model was initially introduced by Koenker and Bassett [1], and the model can be represented as follows: X i = (x i1 , . . . ,x ik ) represents a k × 1 vector of covariates, and y i denotes a response variable.Let p (0 < p < 1) represent the quantile level.The parameter vector β p is a k-dimensional vector that is associated with the quantile level p.The random error term is denoted as ε i .In this case, we consider a more general distribution for the error term that satisfies For the quantile regression model, given a specific X i and y i , the model can be expressed as Q p (y i |X i ) = X T i β p .The estimation of β p can be obtained by minimizing the following formula: The loss function is defined as ρ p (u) = u p − I (u<0) , where I(•) represents the indicator function.Due to the non-differentiability of the loss function, obtaining estimates for β p becomes challenging.Yu and Moye [5] demonstrated that the quantile regression model can be estimated within a Bayesian framework when the error term follows an asymmetric Laplacian distribution (ALD).However, ALD is not a standard distribution, making its posterior density function complex and inevitably increasing the computational burden.According to Kozumi and Kobayashi [6], ALD can be expressed as a mixture of exponential and normal distributions.Specifically, Among them, z i ∼ exp(1/σ) and v i ∼ N(0, 1) are independent of each other.Here, Finally, the quantile regression model can be expressed as a layered model, represented as follows: (3)

Elements of Variational Bayes
Taking into account the latent variable Z = Z 1 , . . ., Z m and the observed values X = x 1 , . . ., x n , the joint density function can be expressed as follows: Variational Bayesian inference no longer relies on sampling to approximate the posterior density of complex problems.Instead, it employs optimization techniques.Within an acceptable range of error, a simpler density function q(Z) ∈ Θ (where Θ represents the space of possible density functions for the latent variable Z) is used to approximate the posterior density P(Z | X).The difference between these two density functions is measured using the Kullback-Leibler divergence. Alternatively, The term L(q) is referred to as the evidence lower bound.To minimize the KL divergence, we maximize the lower bound.
This is equivalent to solving the following optimization problem: By solving the optimization problem, the best approximation function q * (Z i ) can be obtained as follows: or where Θ r = X, Z 1 , . . ., Z i−1 , Z i+1 , . . ., Z m .For convenience, in the following paragraphs, q * (•) is denoted as q(•).
Additionally, the complexity of the density of the latent variables determines the complexity of the optimization algorithm.Therefore, we consider the mean-field variational family:

Bayesian Lasso Regularized Quantile Regression
Li and Zhu [25] proposed lasso regularized quantile regression.Specifically, they introduced a L 1 norm penalty term to control the complexity of the model, aiming to shrink some regression coefficients to zero.The objective function can be expressed as follows: Li and Zhu [25] proposed lasso regularized quantile regression.Specifically, they introduced an L 1 norm penalty term to control the complexity of the model, aiming to shrink some regression coefficients to zero.The objective function can be expressed as follows: Here, λ > 0 is the regularization parameter used to adjust the balance between model complexity and the data.The term ρ p (•) represents the loss function proposed by Koenker and Bassett [1].It is defined as: Bayesian lasso regularization for quantile regression was introduced by Li et al. [10], which improves upon the standard lasso regularization by incorporating a Bayesian framework.The Bayesian lasso regularization tends to capture outliers in the data and improve robustness.By introducing suitable priors on β, the solution to (10) is equivalent to the Bayesian maximum a posteriori (MAP) estimation.
Considering the error term follows an asymmetric Laplace distribution (ALD), the posterior distribution of β is given by: To complete the model, gamma priors are placed on λ and η 2 , resulting in the following hierarchical model:

Missing Data
In the context of missing data, a comprehensive study was conducted by Little and Rubin [13], who provided insights into missing data mechanisms and patterns.Missing data mechanisms can be classified into three types: missing completely at random (MCAR), missing at random (MAR), and missing not at random (MNAR).In the following, we describe each mechanism and introduce the relevant parameters and their priors for modeling the missing data. ( In the vast proportion of previous studies, the emphasis has been on random missing mechanisms.However, missing data frequently exhibit a connection to the data that are missing, suggesting nonignorable missingness.The study of such nonignorable missing data holds significant value as disregarding this data can potentially result in erroneous conclusions during analysis. Furthermore, the missing data quantile regression models discussed in Sections 3 and 4 are hierarchical models that can be represented using probabilistic-directed acyclic graphs (DAGs).In these graphs, the nodes correspond to the parameters in the model, and the arrows illustrate the dependencies between the parameters (Bishop [26]; Wasserman [27]).Specifically, in the DAG presented in this paper, the unshaded nodes correspond to the observed data.Now, let us delve into Equation (3), where y i and X i denote the observed values of response variables and covariates for the ith observation.Assuming that both y i and X i are prone to nonignorable missingness, but not simultaneously missing, we can establish this model as a quantile regression model that accommodates nonignorable missing data.To effectively model the missing data mechanism, we introduce pertinent parameters and assign their prior distributions.For the random variable δ i , the optimal density function for it q * (δ i ).

Missing Covariate Variables
In Equation (3), we define the matrix: When the covariate x is missing, we introduce an indicator function R = (R 1 , . . ., R n ) T to detect if x i is missing, defined as: Here, x mis represents the n mis -dimensional subvector consisting of the missing x i , and n obs + n mis = n.However, x = (x obs , x mis ) T may not correspond to the original data x = (x 1 , . . ., x n ) T in terms of subscripts.Considering the nature of the indicator function, R i follows a 0 − 1 distribution with probability H(X i φ), where X i denotes the ith row of matrix X.We adopt the probit regression model as the link function H(•), thus: To implement the approach proposed by Albert and Chib [28], we introduce n independent latent variables a 1 , a 2 , . . ., a n in the missing data mechanism, where a i follows a normal distribution with mean X T i φ and variance 1.For i = 1, . . ., n, we define R i = 1 if a i > 0, and R i = 0 if a i < 0. The interactions between the regression parameters and the parameters of the missing data mechanism are shown in Figure 1.

Variational Bayesian Inference of QR Model Parameters
Based on the above parameters, we consider the mean field variational family, x , φ, a = q(β)q(z)q(σ)q(x mis )q(µ x )q σ 2 x q(φ)q(a) In z = (z 1 , . . . ,z n ), a = (a 1 , . . . ,a n ), x mis = x mis,1 , . . ., x mis,n mis , z 1 , . . ., z n are independent of each other, a 1 , . . ., a n are independent of each other, x mis,1 , . . ., x mis,n mis are independent of each other, so there was: The optimal density function for each parameter is derived as follows.Please refer to Appendix A for a detailed derivation of the optimal density function for each parameter, as well as the calculations of E q X i X T i and E q (X), where l = 1, . . ., n mis .
q(x mis,l ) ∼ N µ q(x mis,l ) , σ 2 q(x mis,l ) For the generalized inverse Gaussian distribution q(z i ) ∼ GIG 1 2 , a z i , b z i , where K p (•) is the Bessel function with order p, with For the inverse gamma distribution q(σ) ∼ IG(a σ , b σ ), where ψ(•) denotes the inverse gamma function, we have In addition, the required expectations for the additional variable a are as follows: In the model, the evidence lower bound by where: The specific algorithm is shown in Algorithm 1.
Algorithm 1: VI with Nonignorable Missing Covariates in QR Models.

Missing Response Variables
For Equation (3), let us consider the scenario where the response variable y = (y 1 , . . . ,y n ) contains missing values.Similar to missing covariates, we introduce an indicator function R = (R 1 , . . . ,R n ) to detect missing response values, defined as: Here, y mis represents the n mis -dimensional subvector consisting of the missing y i , and we have n obs + n mis = n.However, y = (y obs , y mis ) does not necessarily correspond to the original data y = (y 1 , . . . ,y n ) in terms of subscripts.
Similar to Section 3.1, we introduce the indicator function R i and consider the link function as a logistic regression model.Consequently, we have: .
Here, X i represents the i-th row of the matrix X, and φ is the parameter vector.

Auxiliary Variables
Polson et al. [29] proposed a Bayesian estimation method for logistic regression models by introducing auxiliary variables ω = (ω 1 , . . . ,ω n ) that follow the Pòlya-Gamma distri- bution.This approach addresses the challenges associated with complex integration and sampling using the Metropolis-Hastings (MH) algorithm.
In their methodology, they consider the following model: They place a Gaussian prior on β * , denoted as β * ∼ N(b * , B).To incorporate the auxiliary variable ω, they introduce it into the model.
The posterior density, after adding the auxiliary variable ω, can be expressed as follows: Here, P(y * |β * , x * ) represents the likelihood function, P(β * ) is the prior distribution of β * , P(ω) is the distribution of the auxiliary variable ω, and P(x * ) denotes the prior distribution of the covariates x * .
The specific form of the posterior density and the method of inference depend on the choice of prior distributions, likelihood function, and the implementation of the Pòlya-Gamma auxiliary variables.
Variational Bayesian inference is typically applicable to specific types of models, particularly conditional covariance exponential family models.However, logistic regression models with Gaussian priors are an exception because there is no covariance between the logistic likelihood and the Gaussian prior in such models.To address this limitation, Daniele et al. [30] proposed introducing an additional variable ω = (ω 1 , . . . ,ω n ) that follows the Pòlya-Gamma distribution.By incorporating this variable into models with logistic components, it strengthens the covariance between the logistic likelihood and the Gaussian prior.This enables Bayesian framework statistical inference for this class of models using a variational Bayesian algorithm.
In the context of this chapter, we consider the missing data mechanism as a logistic regression model.Consequently, we introduce additional variables that follow the Pòlya-Gamma distribution, denoted as ω = (ω 1 , . . . ,ω n ).The interaction between the parameters is shown in Figure 2.
In the scenario of missing response variables, the posterior probabilities P(β | Θ r ), P(z | Θ r ), and P(σ | Θ r ) remain the same as in the presence of missing covariates and are not reiterated here.However, we provide the full conditional probability posterior for P(φ | θ r ), P(ω | θ r ), and P(y mix | θ r ).Following the approach of Polson et al. [29], we consider the prior distribution of ω i as PG(1, 0).
, Ω denotes the diagonal matrix with ω i diagonal elements, i = 1, . . ., n.The optimal density of each parameter in the missing data mechanism section is as follows: where: The optimal density functions of the remaining parameters are as follows, and the detailed derivation of the optimal density function for each parameter and E q (Y) is similar to E q (X).
q(y mis,l ) ∼ N µ q(y mis,l ) , σ 2 q(y mis,l ) In the model, the evidence lower bound by The specific algorithm is shown in Algorithm 2.
Algorithm 2: VI with Nonignorable Missing Response in QR Models.

7
Compute the optimal density function q(σ) ∼ IG(a σ , b σ ) via Equation (37), and update µ q(σ) , µ q(1/σ) , µ q(log(σ)) ,where Compute the evidence lower bound L(q) via Equation (39); For Equation (15), the following matrix is defined: In the context of our analysis, the variable x i represents the covariate for the i-th observation, while y i denotes the response variable corresponding to the same observation.When any of the covariates in X, specifically the j-th dimension covariate where j ranges from 1 to k, is missing, we follow a similar approach as described in Section 3.1.We introduce the indicative function R = (R 1 , . . . ,R n ) T to identify the missing covariates and consider the missing data mechanism as a probit regression.
Furthermore, to incorporate the variables' choices, we introduce the parameters s and η 2 .The prior distribution for s follows an exponential distribution, specifically s ∼ exp η 2 /2 .On the other hand, the parameter η 2 is assigned a prior distribution with a gamma distribution, denoted as η 2 ∼ Ga(c, d).The parameter interactions are shown in Figure 3.The optimal densities of the parameters z, σ, µ x , σ 2 x , a, φ are similar to those of Algorithm 1 and will not be repeated.In the following we give the optimal density functions for the parameters β,s,η 2 ,x j,mis,l , where l = 1, • • • , n mis .
In the model, the evidence lower bound by The specific algorithm is shown in Algorithm 3.

Response with Nonignorable Missing Variable Choices
Only when the response variable has a nonignorable presence of missing values, similar to Section 3.3, we consider the missing data mechanism as a logistic regression model with parameters s and η.The prior distribution for these parameters follows the same approach as described in Section 4.1.The interactions between the parameters shown in Figure 4.  Input: The prior distribution of each parameter, including σ 2 β , σ 2 φ , µ x , σ 2 x , σ 2 µ x , a * , b, c, d, and a given lower bound on the evidence lower bound tol = 10 −5 1 while L(q) (t+1) − L(q) (t) >= tol do 2 Compute the optimal density function q s j ∼ GIG 1 2 , a s j , b s via Equation (41), and update µ q(s j ) , µ q(1/s j ) ,µ q(log(s j )) , where j = 1, . . ., k; 3 Compute the optimal density function q * η 2 ∼ Ga c q , d q via Equation (42), and update µ q(η) , σ 2 q(η) ;

12
Compute the evidence lower bound L(q) via Equation (27); For the above parameters, we consider the family of mean field variables.q β, s, η 2 , z, σ, y mis , φ, ω = q(β)q(s)q η 2 q(z)q(σ)q(y mis )q(φ)q(ω) Since s = (s 1 , . . . ,s k ), s j are independent of each other, so q(s) = q(s 1 ) • • • q(s k ).In addition, the optimal density functions for the parameters s, η are given in Section 4.1, the optimal density functions for the parameters z, σ, y mis , φ, ω are given in Section 3.3, and only the optimal density functions for the parameter β are given here as follows.
In the model, the evidence lower bound by the evidence lower bound in this model is as follows, and the specific algorithm is described in Algorithm 4.
Algorithm 4: VI with Nonignorable Missing Response in Bayesian Lasso-Regularized QR Models.

Simulation Studies
In this section, we conducted four experiments to validate the effectiveness of variational inference for parameter estimation in a quantile regression model with non-negligible missing data.The computer involved used the Windows operating system with an In-tel® Core i5-8400 six-core processor.The experimental data were generated based on the following quantile regression models: In the above equation, X i = (x i1 , . . . ,x ik ) and β = (β 0 , . . . ,β k ), where x i1 = 1.Furthermore, we also performed Gibbs sampling simulations for different scenarios in Simulations 1 and II.We compared the results in terms of CPU time, the deviation of parameter estimates from the true values (BIAS, 1  100 ∑ 100 i=1 β − β ), the root mean square error (RMSE, , and the standard deviation (SD).For Simulations I and II, we introduced approximately 10% missing data as M1 and approximately 20% missing data as M2 under different error terms.

Simulation I:
In this simulation, we take n = 200 or 30, k = 2, x i2 randomly generated from the normal distribution N(0, 0.6) and consider the case of missing covariates x i2 .When n = 200, we take the parameters β = (β 0 , β 1 ), and the truth values of σ are The prior distributions for the parameters β and σ are set as β ∼ N(0, I) and σ ∼ IG(1, 0.0009).We consider different distributions for ε i and examine various deletion cases at different quartiles within each distribution.The details of these cases are as follows: (C1): ε i follows a standard normal distribution N(0, 0.7).We set the true value of φ as (2.5, −2.7), and the missing rates of x i2 are approximately 10.9%.Alternatively, when we set the true value of φ as (0.75, −1.05), the missing rate of x i2 is about 23%.(C2): ε i follows a mixed normal distribution 0.9N(0, 0.7) + 0.1N(0, 1).The true value of φ is set as (2.4,−2.8), and the missing rates of x i2 are approximately 11%.Alternatively, when we set the true value of φ as (0.75, −1), the missing rate of x i2 is about 22%.(C3): ε i follows a chi-squared distribution 0.3χ 2  5 .We set the true value of φ as (2.55, −2.75), and the missing rates of x i2 are approximately 9%.Alternatively, when we set the true value of φ as (0.75, −1), the missing rate of x i2 is about 23%.When n = 30, We examined the M1 situation, maintaining the parameters and error distributions as previously detailed.

Simulation II:
In this simulation, we consider a scenario where n = 200 or 30 , k = 2.The covariate x i2 is randomly generated from a normal distribution N(0, 0.6).When n = 200, we focus on the case of missing response variable y i .The true values of the parameters β = (β 0 , β 1 ) and σ are set as follows: nβ 0 = β 1 = 1, σ = 1.The prior distributions for the parameters β and σ are set as β ∼ N(0, 0.8I) and σ ∼ IG(1, 0.05).We consider different distributions for ε i and examine various deletion cases at different quartiles within each distribution.The details of these cases are as follows: (C1): ε i follows a standard normal distribution N(0, 0.3).We set the true value of φ as (1.5, 1.5), and the missing rates of x i2 are approximately 11%.Alternatively, when we set the true value of φ as (0.82, 0.85), the missing rates of x i2 are approximately 19%.(C2): ε i follows a mixed normal distribution 0.9N(0, 0.1) + 0.1N(0, 0.3).We set the true value of φ as (2.1, 0.29), and the missing rates of x i2 are approximately 12%.Alternatively, when we set the true value of φ as (1.5, 0.3), the missing rates of x i2 are approximately 23%.(C3): ε i follows a chi-square distribution 0.1χ 2  3 .We set the true value of φ as (2, 0.45), and the missing rates of x i2 are approximately 9%.Alternatively, when we set the true value of φ as (2, −0.15), the missing rates of x i2 are approximately 19% .When n = 30, we examined the M1 situation, maintaining the parameters and error distributions as previously detailed.
By comparing the results with and without the inclusion of the scale parameter σ in Simulation I and Simulation II, when n = 200, we observed the following findings for different missing cases and error distributions.The parameter estimation results obtained using the VI algorithm for Simulation I and Simulation II are presented in Tables 1 and 2, respectively.Meanwhile, the parameter estimation results obtained using Gibbs are displayed in Tables A1 and A2 (Appendix B).The Bayesian estimates obtained by the proposed method we present perform satisfactorily in that (i) they have smaller BIAS, RMS, and SD, and the SD and RMS values are close to each other; and (ii) the comparison between Tables 1 and A1, and Tables 2 and A2 reveals that different error distributions, various missing cases, and the inclusion of the parameter σ have a significant impact on CPU time.When considering covariates with nonignorable missingness and including the parameter σ, VI performs over ten times faster than Gibbs.Without including σ, VI is nearly 50 times faster than Gibbs.Similarly, in the presence of nonignorable missing response variables and including sigma, VI is nearly 50 times faster than Gibbs.Without σ, VI speeds up by almost 80 times compared to Gibbs.This means our proposed variational Bayesian method is faster than MCMC while ensuring parameter estimation accuracy.When n = 30, although VI is almost a hundred times faster than Gibbs, its accuracy is somewhat lower.This difference is even more noticeable when the response variable is absent, for specific displays in Table 3.
Based on the good performance of variational inference observed in Simulation I and Simulation II, we employed Bayesian lasso for variable selection in the models with missing covariates and response variables in Simulations III and IV, respectively.Furthermore, the variational Bayesian inference achieved convergence in both Simulations I and II.For convenience, we provide the convergence results of C1 at different quartiles of M1 when the covariates have missing values and the response variables have missing values, as depicted in Figure 5, see Figure 6 for run time.

Simulation IV:
In this simulation, we employ a Bayesian lasso-quantile regression model to handle variable selection and parameter estimation in the presence of missing response variables.The data generation process is similar to Simulations III and does not include the scale parameter σ.We set β = (1, −1, 1, 0, 0, 0, 0) and assume a prior distribution of N(0, I).The error term ε i is considered under three different distributions, following the pattern of Simulation I and Simulation II when y i is missing.(C1), the error term ε i follows a standard normal distribution N(0, 0.3).With the true values set as φ = (0.82, 0.80), the missing rates of y i are approximately 23%.(C2), the error term ε i follows a mixed normal distribution 0.9N(0, 0.1) + 0.1N(0, 0.9).Taking the true values as φ = (1.08,0.60), the missing rates of y i are approximately 16%.(C3), the error term ε i follows a chi-square distribution 0.1χ 2  3 .With the true values φ = (1.00,0.45), the missing rates of y i re about 18%.In Simulation III and Simulation IV, we repeated 100 times and evaluated the performance of variable selection and parameter estimation using the following metrics: (1) the L 2 parametrization between the parameter estimates and the true value 3) the number of parameters that have a true value of zero and are correctly identified as zero is recorded as "C"; (4) the number of parameters whose true value is not zero but is incorrectly identified as zero is recorded as "IC".
We investigated three interquartile levels (0.25, 0.5, 0.75), as shown in Tables 4 and 5, yielding the following results: (1) The "IC" value is extremely close to zero, and "C" closely approximates the number of zeros in the real parameter.(2) The L 2 values for the three interquartile levels are below 0.03, indicating a negligible difference from zero.(3) The "MSE" value at all three quantile levels is less than 0.09.
These findings demonstrate the strong performance of our proposed VI for variable selection.

A Real Example
In this section, we demonstrate the application of Algorithm 3 using data obtained from the American Press Institute, specifically the 1995 American University News Report dataset.The dataset can be accessed at http://lib.stat.cmu.edu/datasets/colleges(accessed on 21 April 2023).Our focus is on examining the relationship between graduation rates (y) and various factors.
We select the following factors for analysis: the public/private indicator as x 1 , the student-to-faculty ratio as x 2 , the natural logarithm of the number of applicants accepted as x 3 , the average Combined SAT score as x 4 , the average ACT score as x 5 , the natural logarithm of the number of applications received as x 6 , the natural logarithm of the number of new students enrolled as x 7 , room and board costs as x 8 , and the natural logarithm of instructional expenditure per student as x 9 .
Furthermore, to facilitate statistical inferences on the data, we apply two preprocessing steps.First, we take the natural logarithm of the variables x 3 , x 6 , x 7 , x 8 , and x 9 .Second, we standardize all the variables to eliminate any scale differences among them.These preprocessing steps ensure that the data are appropriately transformed and normalized for subsequent analysis and inference.
We consider the following quantile regression model.

Discussion
In existing research on quantile regression models with significant missing data, most studies have utilized the Markov Chain Monte Carlo (MCMC) algorithm for Bayesian inference, despite some drawbacks such as sampling difficulties and extensive computational time.In this paper, we propose employing the variational Bayesian algorithm for statistical inference of the aforementioned model while incorporating variable selection.Specifically, in the presence of missing covariates, we consider a probit regression model for the missing data mechanism, and for the missing response variable, we consider a logistic regression model.Additionally, we employ lasso regularization for variable selection in both cases.Simulation and real-world experiments demonstrate that when covariates or response variables have nonignorable missing values in quantile regression models, the variational Bayesian method not only ensures inference precision but also consumes less time compared to MCMC.
Based on our experience, we often encounter situations where both covariates and response variables have missing values.The aforementioned developed variational Bayesian approach faces a well-known ill-posed problem in practical applications.To address this issue, we explore variational Bayesian parameter estimation for quantile regression models in the presence of simultaneous missing covariates or response variables.
Our proposed method has some potential drawbacks.For instance, as the number of dimensions increases, the convergence speed of variational Bayes may slow down.Furthermore, the calculation of expectations may lack an analytical solution, and assumptions about the correctness of the missing data mechanism must be considered.To solve these issues, we can employ commonly used machine learning techniques like deep learning and neural networks to address missing data mechanisms.Furthermore, in the future, the stochastic variational Bayesian method can enhance the algorithm's efficiency, while the Bayesian dimensionality reduction approach can tackle issues related to high-dimensional quantile regression problems.

Figure 1 .
Figure 1.DAG for the missing covariates for Quantile regression model.

Figure 2 .
Figure 2. DAG for the missing response for Quantile regression model.

Figure 3 .
Figure 3. DAG for the missing covariates for Bayesian lasso regularized quantile regression model.

Figure 4 .
Figure 4. DAG for the missing response for Bayesian lasso regularized quantile regression model.

Algorithm 3 :
VI with Nonignorable Missing Covariates in Bayesian lasso regularized QR Models.

Figure 6 .
Figure 6.Comparison of CPU time of variational Bayesian over Gibbs when covariates with missing.

Figure 7 .
Figure 7. Lower bound convergence of evidence in the case analysis.
1) Missing Completely at Random (MCAR): These types of missing data are not related to either the observed data or the missing data.When x ik or y i is missing, the probability of missingness, denoted as P(R i | x ik , y i , φ), follows a certain link function H(•) parameterized by φ, i.e., P(Ri | x ik , y i , φ) = H(R i | φ).(2) Missing at Random (MAR): These types of missing data are only related to the observed data.For example, when the covariate x i is missing, the probability of missingness P(R i | x ik , y i , φ) is conditional on the observed response y i and follows a link function H(•) parameterized by φ, i.e., P(R i | x ik , y i , φ) = H(R i | y i , φ).(3) Missing Not at Random (MNAR): These types of missing data are related to the missing data itself.For example, when the covariate x jk is missing, the probability of missingness P(R i | x ik , y i , φ) depends on the observed covariates X ik and follows a link function H(•) parameterized by φ, i.e., P(R i | x ik , y i

10 end 4. Bayesian Variable Selection of the QR Model 4
t + 1; .1.Covariates with Nonignorable Missing Variable Choices

Table 1 .
Parameter estimation and CPU time cases in Simulation I using VI, n = 200.

Table 2 .
Parameter estimation and CPU time cases in Simulation II using VI, n = 200.

Table 3 .
Parameter estimation and CPU time cases in Simulation I and Simulation II, n = 30.

Table 4 .
Results of variable selection for Simulation III.

Table 5 .
Results of variable selection for Simulation IV.

Table 6 .
Bayesian estimates (EST) and 95% confidence intervals (CI) of the parameters in real example.

Table A1 .
Parameter estimation and CPU time cases in Simulation I using Gibbs, n = 200.

Table A2 .
Parameter estimation and CPU time cases in Simulation II using Gibbs, n = 200.