Variational Bayesian Inference in High-Dimensional Linear Mixed Models

: In high-dimensional regression models, the Bayesian lasso with the Gaussian spike and slab priors is widely adopted to select variables and estimate unknown parameters. However, it involves large matrix computations in a standard Gibbs sampler. To solve this issue, the Skinny Gibbs sampler is employed to draw observations required for Bayesian variable selection. However, when the sample size is much smaller than the number of variables, the computation is rather time-consuming. As an alternative to the Skinny Gibbs sampler, we develop a variational Bayesian approach to simultaneously select variables and estimate parameters in high-dimensional linear mixed models under the Gaussian spike and slab priors of population-speciﬁc ﬁxed-effects regression coefﬁcients, which are reformulated as a mixture of a normal distribution and an exponential distribution. The coordinate ascent algorithm, which can be implemented efﬁciently, is proposed to optimize the evidence lower bound. The Bayes factor, which can be computed with the path sampling technique, is presented to compare two competing models in the variational Bayesian framework. Simulation studies are conducted to assess the performance of the proposed variational Bayesian method. An empirical example is analyzed by the proposed methodologies.


Introduction
Linear mixed models are widely used to analyze longitudinal and correlated data by considering the between-subject and within-subject variations and incorporating the random effects to account for heterogeneity among the subjects in many fields, such as psychology, medicine, epidemiology and econometrics. Various methods have been developed to estimate fixed-effects parameters and variance-covariance matrices for unobservable random effects and noises or select fixed-effects and random-effects components, even if it is quite challenging for the problem of variable selection and parameter estimation in linear mixed models. For example, see [1] for restricted maximum likelihood estimation of parameters, ref [2] for EM algorithm of parameter estimation, refs [3,4] for Bayesian parameter estimation, ref [5] for Bayesian random effects selection and [6] for momentbased method for random effects selection. The aforementioned methods mainly focus on low-dimensional linear mixed models, while high-dimensional data have become increasingly common with the rapid development of modern information technologies that facilitate data collection. Thus, the aforementioned methods do not work well in highdimensional linear mixed models, and so some penalized methods have developed to simultaneously estimate parameters and select variables in high-dimensional linear mixed models. For example, Bondell, Krishna and Ghosh [7] and Ibrahim et al. [8] proposed the penalized likelihood methods for joint selection of fixed and random effects; Schelldorfer, Buhlmann and van De Geer [9] proposed an 1 -penalized estimation procedure; Fan and Li [10] investigated the problem of fixed and random effects selection when the cluster sizes are balanced; Li et al. [11] presented a doubly regularized approach to simultaneously select fixed and random effects; Bradic, Claeskens and Gueuning [12] considered testing a single parameter of fixed effects in high-dimensional linear mixed models with fixed cluster sizes, fixed numbers of random effects and sub-Gaussian designs; Li, Cai and Li [13] proposed a penalized quasi-likelihood method for statistical inference on unknown parameters in high-dimensional linear mixed-effects models. However, the aforementioned regularization methods are computationally complex and unstable and they do not consider the prior information of fixed-effects parameters and variance-covariance matrices, which may lead to unsatisfactory estimation accuracy of parameters or variance-covariance matrices. Bayesian approaches for variable selection and parameter estimation have received much attention over the past years because they can largely improve the accuracy and efficiency of parameter estimation, consistently select important variables and provide more information for variable selection than the corresponding penalization method with a highly non-convex optimization problem by imposing various priors on model parameters. For example, see [14] for reference prior, ref [15] for normal mixture prior, ref [16] for spike and slab prior, ref [17] for horseshoe prior and [18] for shrinking and diffusing prior. In the high-dimensional setting, Bayesian lasso, Bayesian adaptive lasso or the indicator model method, together with the Markov chain Monte Carlo (MCMC) algorithm, are widely used to select important variables. For example, see [19] for Bayesian lasso, ref [20] for Bayesian adaptive lasso and [21,22] for the EM approach in the Bayesian framework. The above-mentioned literature involves the implementation of the standard Gibbs sampler for posterior computation, which is not so scalable for large numbers of fixed-effects components [23]. To address the issue, Narisetty, Shen and He [23] proposed a Skinny Gibbs algorithm by using a sparse matrix to replace the high-dimensional variance-covariance matrix, which avoids large matrix operations. However, implementing the above MCMC algorithm in the presence of high-dimensional data may still be subject to the well-known ill-posed problems, i.e., low efficiency, slow convergence and huge memory being required.
As an alternative to the MCMC, the variational Bayesian method, also called ensemble learning, is widely adopted to approximate intractable integrals involved in Bayesian inference or machine learning due to its good properties, such as high-speed computation. Its basic idea is to transform the high-dimensional integration problem into an optimization problem in making Bayesian inference and then optimize the evidence lower bound (ELB), which is efficiently computed, and finally utilize the ELB to obtain a variational approximation to the posterior distribution in Bayesian analysis. The variational Bayesian approach has been applied to some familiar models, for example, latent variable models [24], mixtures of factor analysis [25], graphical models [26] and partially linear mean shift models with high-dimensional data [27].
Motivated by the aforementioned variational Bayesian studies, we develop a novel variational Bayesian approach to estimate model parameters and select important variables under the Skinny Gibbs sampling framework in a linear mixed model with low-dimensional random effects and high-dimensional fixed effects. We specify the spike and slab priors for the population-specific fixed-effects regression coefficients with completely different shrinkage parameters, which overcomes the problem of selecting a high-dimensional vector of the shrinkage parameters. We reformulate the spike and slab priors of parameter as a mixture of a normal distribution and an exponential distribution, which avoids the highdimensional integral problem. The coordinate ascent algorithm, which can be implemented efficiently, is proposed to optimize the ELB. The Bayes factor, which can be computed with the path sampling technique, is presented to compare two competing models in the variational Bayesian framework. The merits of the proposed variational Bayesian method are (i) simultaneously estimating parameters and variance-covariance matrices and select fixed-and random-effects components with quite a low computation cost, (ii) efficiently analyzing high-dimensional data without requiring the non-convex optimization and avoiding the curse of dimensionality problem, (iii) automatically incorporating the shrinkage parameters and (iv) avoiding large matrix computations.
The rest of the article is organized as follows: Section 2 introduces the linear mixed model setup, including the spike and slab priors. Section 3 describes the Skinny Gibbs sampler algorithm for selecting fixed-and random-effects components and estimating parameters in coefficients and variance-covariance matrices via the Bayesian Lasso method. Section 4 develops a variational Bayesian approach to approximate posterior distributions of parameters and random effects and presents the Bayes factor for model comparison.
The coordinate ascent algorithm is adopted to optimize the ELB in Section 4. Simulation studies are considered in Section 5. An empirical example is illustrated by the proposed methodologies in Section 6. A simple discussion is given in Section 7. Technique details are presented in the Appendix A, Appendix B and Appendix C.

Model
Consider a dataset with n subjects. For the ith subject, let y ij be the observation of the response variable, x ij be a p × 1 vector of covariates associated with the fixed effects and z ij be a q × 1 vector of covariates associated with the random effects, which may be a subvector of x ij for j = 1, . . . , n i , where n i is the number of times observed repeatedly for the ith subject. Generally, n i varies across subjects. For simplicity, we suppose that y ij has been centered at zero for avoiding the requirement of intercept and n 1 = . . . = n n = m, i.e., the balanced design. It is assumed that p n and only a small number of covariates x ij contribute to response variable y ij , i.e., x ij has sparsity, while q is smaller than n.
For the dataset D = {(y ij , x ij , z ij ) : i = 1, . . . , n, j = 1, . . . , m}, we consider the following linear mixed model: where β = (β 1 , . . . , β p ) is a p × 1 vector of population-specific fixed-effects regression coefficients, b i is a q × 1 vector of subject-specific random effects and ε ij is measurement error. Here, we assume that b 1 , . . . , b n are independent and identically distributed (i.i.d.) as the multivariate normal distribution with mean zero and covariance matrix Q and ε ij 's are independently distributed as N (0, σ 2 j ), where N (·, ·) represents the normal distribution. Here, σ 2 1 , . . . , σ 2 m are not completely different but some of them may be identical. Under the aforementioned assumptions, a penalized likelihood approach to estimate β is implemented by where f λ (β) is some appropriate penalty function indexed by the penalty parameter λ. In variable selection literature, it is usually assumed that f λ (β) has the form: 1 -norm, MCP penalty [28], SCAD penalty [29] and Elastic-Net penalty [30]. Recently, it was widely recognized that β can be regarded as a posterior mode of β with some proper prior. Inspired by this idea, we consider Bayesian variable selection procedure based on some proper prior on β.
Following [31], we consider the following spike and slab prior of β: where γ = (γ 1 , . . . , γ p ) , in which γ k is a binary latent variable and follows a Bernoulli distribution with the probability ρ k = Pr(γ k = 1), i.e., γ k = 1 indicates that the kth covariate is active and γ k = 0 implies that the kth covariate is inactive and g 1 (β k |λ 1 ) is usually referred to as a diffuse slab prior reflecting the effect of an active covariate, while g 0 (β k |λ 0 ) is called a concentrated spike prior reflecting the negligibly unimportant effect of an inactive covariate for k = 1, . . . , p. Let f (γ|ρ) be the prior distribution of γ indexed by ρ. It is assumed that f (γ|ρ) has the form where ρ = (ρ 1 , . . . , ρ p ) . For simplicity, we assume ρ 1 = . . . = ρ p = ρ, which is the expected proportion of the active covariates. Generally, one can take g 0 (·) and g 1 (·) as the normal distribution with a small and a large variance, respectively. However, for the spike and slab lasso, we take the following slab and spike priors respectively, where λ 1 should tend to zero and λ 0 should tend to ∞ as the sample size is sufficiently large, which implies that the inactive covariates will be detected as zeros in that small values of β k relative to 1/λ 0 or λ 1 are truncated to zero. Following [32], the density g (β k |λ ) = λ 2 exp(−λ |β k |) can be hierarchically written as a mixture of a normal distribution and an exponential distribution, i.e., Incorporating the above idea shows that the posterior distributions of binary latent variables can be employed to distinguish active covariates from inactive ones in the considered model.
For covariance matrix Q, the proportion ρ, λ 2 0 , λ 2 1 and σ 2 j , we consider the following priors: where IW(·, ·) denotes the inverted Wishart distribution, Beta(·, ·) represents the Beta distribution, Γ(·) is the gamma distribution, IG(·, ·) is the inverse gamma distribution and S 0 , ν 0 , a γ , b γ , c 0 , d 0 , c 1 , d 1 , c 2 and d 2 are the user-specified hyperparameters. As mentioned above, λ 1 should tend to zero and λ 0 should tend to ∞ as the sample size is sufficiently large, which implies that c 1 /d 1 is smaller than c 0 /d 0 . To this end, we assume c 1 c 0 and d 0 d 1 . Based on the above discussion, we can rewrite the considered linear mixed model together with the spike and slab lasso prior as the following hierarchical models:
To avoid expensive computation in running the Gibbs sampler, similarly to [23], at each Gibbs iteration, we divide parameter vector β into two subvectors corresponding to those active (i.e., γ k = 1) and inactive (i.e., γ k = 0) covariates, respectively. To wit, we define β = (β A , β I ) , where β A and β I are the subvectors of β associated with γ k = 1 and γ k = 0, respectively. Suppose that the cardinality of the set A is r. Without loss of generality, it is assumed that the first r components of β correspond to β A and the last p − r components of β correspond to β I . Similarly, we decompose x ij as x ij = (x ijA , x ijI ) . Under the above assumptions, the Gibbs sampler is implemented as follows. Observations required at each Gibbs iteration are iteratively drawn from the following conditional distributions: 2 11 , . . . , ξ 2 1p }. Although the Skinny Gibbs sampler introduced above can be easily conducted, it is rather time-consuming for a sufficiently large p. To address the issue, we investigate a fast yet efficient approach as follows, i.e., the variational Bayesian method.

Variational Bayes
It follows from the principle of variational inference that it is necessary to first construct a variational set F of densities for random variables Ξ having the same support as the posterior density f (Ξ|D), where Ξ = {β, b, ξ 0 , ξ 1 , Q, γ, σ 2 , ϑ}. It is assumed that q(Ξ) ∈ F is any variational density for approximating f (Ξ|D). The variational Bayes aims to find the best approximation to f (Ξ|D) in terms of the Kullback-Leibler divergence between q(Ξ) and f (Ξ|D), which is a solution to the optimization problem: where in which E q(Ξ) (·) is the expectation taken with respect to q(Ξ). Here, KL(q(Ξ) f (Ξ|D)) equals zero if and only if q(Ξ) ≡ f (Ξ|D). Due to the intractable high-dimensional integral involved, it is quite troublesome to conduct the above optimization problem.
Thus, L{q(Ξ)} might be regarded as a lower bound of log f (Y|X, Z) and is usually referred to as the evidence lower bound (ELB). Then, minimizing KL(q(Ξ) f (Ξ|D)) is equivalent to maximizing L{q(Ξ)} in that log f (Y|X, Z) is not related to Ξ. That is, Finding the problem of the best approximation to f (Ξ|D) is transformed into an optimization problem of maximizing L{q(Ξ)} over the variational family F. The complexity of the optimization problem is associated with that of the variational set F. Thus, it is rather desirable to implement the optimization problem over a relatively simple variational set F.
Following the widely used methods for constructing a relatively simple variational set, we take F as the mean-field variational family in which components of Ξ are mutually independent and each has a distinct factor in the variational density. Thus, we can assume that the variational density q(Ξ) has the form where q s (ζ s )s are unspecified but the above assumed factorization across components is prespecified. Similarly to considerable variational literature, the optimal solutions of q s (ζ s )s can be obtained by maximizing L{q(ζ 1 , . . . , ζ S )} via the coordinate ascent method, where Following the idea of the coordinate ascent method given in [33][34][35], when fixing other variational factors q j (ζ j ) for j = s, i.e., ζ −s = {ζ j : j = s, j = 1, . . . , S}, the optimal variational density q * s (ζ s ) maximizing L{q(Ξ)} with respect to q s (ζ s ) has the form where f (ζ s |ζ −s , D) is the conditional density for ζ s given (ζ −s D) and E −s (·) represents the expectation evaluated for q −s (ζ −s ) = ∏ j =s q j (ζ j ). Equation (15) implies that E −s (·) is not associated with the sth variational factor q s (ζ s ) and the optimal variational density q * s (ζ s ) cannot be obtained in that the q −s (ζ −s ) on the right-hand side are not the optimal ones. To address this issue, the coordinate updating algorithm is employed to iteratively update q * s (ζ s ) via Equation (15). After the coordinate updating algorithm converges, we can take mean or mode of the optimal variational density q * s (ζ s ) as a variational Bayesian estimate of parameter vector ζ s and regard the covariate as active if its corresponding variational Bayesian estimate deviates from zero.
It is easily shown from Equation (15) that the optimal density q * β (β) has the form , respectively. Then, the estimated posterior means and variance matrices of β A and β I for the vari- where . Then, the estimated posterior mean and variance matrix of b i for variational densities q where C k = { : γ = 1, = k ∈ A} = A\{k}, which is an index set with the kth index deleted from the set A. Thus, latent variable γ k is sampled from the Bernoulli distribution with the probability ς k = k /( k + 1), i.e., γ k |D, b, σ ∼ Bernoulli(ς k ) for k = 1, . . . , p.
In this case, the estimated posterior mean and variance of γ k for variational density where . Moreover, the mode estimator Q q of Q is given by Q q = S * 0 /(ν * 0 + q + 1). The optimal density q * In this case, the mode estimator σ −2q j of σ −2 j for variational density q * . . , m. The optimal density q * ρ (ρ) can be expressed as where In this case, the mode estimator of ρ is given as The optimal densities q * λ 0 (λ 2 0 ) and q * λ 1 (λ 2 1 ) are The mode estimators λ 2q 0 and λ 2q 1 of λ 2 0 and λ 2 1 for variational densities q * λ 0 (λ 2 0 ) and q *

Optimizing L{q(Ξ)} via Coordinate Ascent Algorithm
The elaborated steps for optimizing L{q(Ξ)} via the coordinate ascent algorithm are given below: Step (a) Given the initial values of variational densities q , compute the lower bound L{q(Ξ)} (denoted as L (0) {q(Ξ)}) and set κ = 1.
Step (c) Compute variational density q * b i (b i ) and update E * b i (b i ).
Step (g) Compute variational density q * Q (Q) and update E * Q (Q).
The preceding presented coordinate ascent algorithm for computing variational Bayesian estimates of parameters is summarized as Algorithm 1 and converges to the solution of the optimization problem (13) because it satisfies the well-known KKT condition for the considered model.

Model Comparison
The Bayes factor is a vital statistic for model comparison within the Bayesian framework and is widely employed to choose a better model among the considered competing models due to its merits for model selection: (i) it is a consistent selector; (ii) it plays the part of an Occam's razor, preferring the simpler model for the similar fits; (iii) it does not need the models to be nested. For instance, see [36] for structural equation models and [37] for non-ignorable missing data. Denote f (Y|X, Z, Ξ h , H h ) as the probability density of the data {Y, X, Z} associated with the model H h , where Ξ h is the parameter vector in the model H h . Define f (Ξ h |H h ) as the prior of Ξ h for h = 0, 1. The Bayes factor for comparing two competing models H 0 and H 1 can be written as where f (Y|X, Z, H k ) is the marginal likelihood for the model H h for h = 0 and 1. However, computing the Bayes factor B 10 is a non-trivial task for our considered high-dimensional linear mixed model because of the intractable integral involved. Considerable methods have been developed to compute the marginal likelihood f (Y|X, Z, H k ) or the Bayes factor, for example, Laplace's method [38], annealed importance sampling [39], bridge sampling [40], path sampling (also called thermodynamic integration) [41], nested sampling [42], power posteriors [43], hybrid method combining simulation and asymptotic approximations [44]. For a comprehensive review, refer to [45]. Here, a path sampling or thermodynamic integration method is adopted to compute B 10 via a link model: where f (Y, ζ|X, Z, Ξ) is the density of Y given X and Z under H ζ and f (Ξ) is the prior of Ξ. Under the above definition, it is easily known that Q(0) = f (Y|X, Z, H 0 ) and Q(1) = f (Y|X, Z, H 1 ). Following the argument of [41], we obtain where E(·) represents the expectation taken with respect to the conditional density f (Ξ, ζ|Y, X, Z) and U(Y, ζ, Ξ|X, Z) = d log f (Y, ζ, Ξ|X, Z)/dζ. Thus, applying the thermodynamic integration [41] or powered posteriors method [43] to Equation (26), log B 10 can be estimated by in which {Ξ (τ) : τ = 1, . . . , J } are observations sampled from the variational density q * (Ξ|ζ ( ) ) for = 1, . . . , L. Following [46], H 1 is selected when log B 10 > 1; otherwise, H 0 is selected.

Simulation Studies
Several simulation studies are implemented to assess the performance of the introduced variational Bayesian methodologies. For comparison, we also take the Bayesian lasso method into consideration. In this simulation study, response variables y ij s are independently sampled from the normal distribution: y ij ∼ N (x ij β + z ij b i , σ 2 j ), where x ij , z ij and b i are independently drawn from the multivariate normal distributions N p (0, Σ x ), N q (0, I) and N q (0, Q), respectively, for i = 1, . . . , n, j = 1, . . . , m. The true value of β is taken to be (−0.5, 0.8, 2, 0.8, 0.5, 0.0, . . . , 0.0) , which implies that there are five active variables and p − 5 inactive variables. As an illustration, we set m = 6, q = 4, n = 100, 200 and 300, and p = 500, 1000 and 2000, which indicate that n p. The true values of σ 2 j 's are set to be σ 2 1 = σ 2 2 = 0.8, σ 2 3 = σ 2 4 = 0.9 and σ 2 5 = σ 2 6 = 1.0. The true value of Q is taken with diagonal elements being 1.0 and remaining components being 0.1.
We consider the following two types of covariance structures for Σ x = (σ x jk ) p×p .
Type I. Components of covariate vector x ij are independent of each other, i.e., σ x jk = 0.0 when j = k and σ x jj = 1.0 when 1 ≤ j, k ≤ p. Type II. x ij is an autoregressive correlation, i.e., σ x jk = 0.5 |j−k| when ∀j = k and σ x jj = 1.0 when 1 ≤ j, k ≤ p.
In implementing the preceding presented variational Bayesian approach together with the spike and slab priors, we take the hyperparameters ν 0 = 1 and S 0 = 0.02I q×q leading to the flat prior for Q and set a γ = b γ = 0.5. For the spike and slab priors of β k s, to achieve appropriate shrinkage and model selection consistency, we take c 0 = 500 and c 1 = 0.3, indicating c 1 c 0 , d 0 = 5 and d 1 = 30, implying d 0 d 1 , guaranteeing the sparsity of the model. In this simulation, 100 replications are conducted to select active variables and estimate model parameters. To assess the accuracy of parameter estimation via the proposed variational Bayesian method, we calculate the average value of RMSes for unknown parameters, where "RMS" indicates the root mean square between the Bayesian estimates based on 100 replications and true values of unknown parameters. To assess the performance of the variable selection procedure, we compute TP and FP, where TP represents the average number of active covariates correctly identified as active and FP denotes the average number of inactive covariates incorrectly detected as active. Generally, the closer to the true number of active covariates TP is or the smaller FP is, the better the variable selection method behaves. Results are reported in Table 1. Examination of Table 1 shows that the proposed variational Bayesian method behaves better than Bayesian lasso method, regardless of the values of p and n and covariance structures, in that TP values for the former are closer to the true number of active covariates and FP values for the former are closer to zero than those for the latter. For parameter estimation, the proposed variational Bayesian method behaves better than the Bayesian lasso method in that the average values of the RMSes for the former are smaller than those for the latter, regardless of the values of p and n and covariance structures. To investigate the sensitivity of the selection of the hyperparameters a γ and b γ , we take a γ = 0.1 and b γ = 0.9 and calculate the corresponding results for the Type I structure of Σ x , which results are given in Table 1. These empirical results indicate that the proposed variational Bayesian method is not sensitive to the hyperparameters in that the same pattern is observed regardless of the hyperparameters a γ and b γ .  As an illustration for model comparison via the proposed Bayes factor, we consider the second simulation study. In the simulation study, the data {(x ij , z ij , y ij ) : i = 1, . . . , n, j = 1, . . . , m} are generated as those in the first simulation study with covariance structure of Σ x taken to be Type I. To this end, we consider the following competing models: where H 0 represents the true linear mixed model and while H 1 and H 2 are two competing linear mixed models, H 1 only containing random effects without fixed effects, and H 2 misspecifying the distribution of measurement error. We define a path t ∈ [0, 1] to link any two of the above presented three models. For example, H 0 and H 1 can be linked by H t01 : y ij = (1 − t)x ij β + z ij b i + ε ij , which indicates that H t01 is just H 0 for t = 0 and becomes H 1 for t = 1, and H 0 and H 2 are linked by H t02 : , which implies that H t02 reduces to H 0 with t = 0 and H t02 becomes H 2 with t = 1.
To calculate the estimated log Bayes factors (i.e., log B 10 and log B 20 ) via the preceding proposed path sampling procedure, we take ζ ( ) = /L for = 0, 1, . . . , L, L = 10, J = 1000 and σ 2 0 = 0.5 and the same priors as those given in the first simulation studies. Results are given in Table 2, which indicates that H 0 is strongly selected as expected regardless of n and p.

An Empirical Example
As an illustration of the variational Bayesian method developed above, we consider the ADNI-2 data [47] published in 2003 and followed by ADNI-1, ADNI-GO and ADNI-2 groups. This study aims to predict the mini-mental state examination (MMSE) score, which is an important index for detecting Alzheimer's disease (AD) stages in that different MMSE scores indicate different progression of a AD patient. AD is the most common type of dementia for elderly people and the sixth leading cause of death in the United States, and it results in the loss of memory and the impairment of cognitive and language skills. More importantly, there is no effective treatment to slow the progression of the disease [48]. The number of AD patients has grown exponentially with the speed of the aging population, bringing a socioeconomic burden to both families and society [49]. The details on the ADNI database can refer to the website http://adni.loni.usc.edu (accessed on 20 May 2021).
The ADNI-2 data were analyzed by [48] using the factor analysis model to impute missing values. As an illustration, we utilize 340 complete magnetic resonance imaging (MRI) features with 62 samples and 3 medical visits (6-month, 12-month and 24-month), take five features among 340 features as covariates associated with random effects and set the MMSE score as the response variable. That is, n = 62, p = 340, q = 5 and m = 3. In this case, covariates are high-dimensional compared with the sample size. Here, we assume that only a small fraction of covariates contribute to the response variable.
The preceding introduced variational Bayesian method together with the linear mixed model and the same priors as those in the first simulation study are utilized to fit the abovementioned MRI data. Here, the hyperparameters are taken as ν 0 = 1, S 0 = 0.02I q×q , a γ = b γ = 0.5, c 0 = 10, d 0 = 1, c 1 = 1 and d 1 = 10 for ensuring the sparsity of the model. Thus, the proposed variational Bayesian method selects three features as active variables: thickness average of the right fusiform (denoted as "x 1 "), thickness standard deviation of the right posterior cingulate (denoted as "x 2 ") and thickness standard deviation of the left postcentral (denoted as "x 3 "). Their corresponding parameter estimates are 1.9, 0.25 and 0.4, respectively, which show that the three active variables have positive effects on MMSE that are consistent with those given in [48]. Bayesian estimates of random effects b i are −0.003, −0.0021, −0.0013, −0.0058 and −0.0054, respectively, which imply that the selected five covariates associated with random effects have negative effects on MMSE. Table 3 also presents the RMSE and MAP values for the models with 340 covariates (denoted as the "Complete" model) and the selected three active covariates (denoted as the "Selected" model), where RMSE and MAP are evaluated by RMSE= n −1 ∑ n i=1 (ŷ i − y i ) 2 and MAP=n −1 ∑ n i=1 |ŷ i − y i | andŷ i is the fitted value of response y i . Examination of Table 3 shows that the selected model has smaller RMS and MAP values than the complete model, i.e., the selected model fits the ADNI-2 data better than the complete model. For the selected model, we also compute the Bayes factors for three competing models H 0 , H 1 and H 2 given in the second simulation study, which are logB 10 = −558 and logB 20 = −46.93, leading to the conclusion that H 0 is strongly selected.

Discussions
This paper investigates simultaneously estimating model parameters and selecting variables in linear mixed models with high-dimensional fixed effects and low-dimensional random effects in the Bayesian framework. A novel variational Bayesian approach is developed to address the time-consuming problem of the traditional Bayesian lasso method due to the ill-posited problem and large matrix computation involved in the presence of high-dimensional data. The Gaussian spike and slab priors of population-specific fixedeffects regression coefficients are specified to identify important fixed effects by allowing the tuning parameters to tend to zero. For the sake of sampling observations, the Gaussian spike and slab priors are reformulated as a mixture of a normal distribution and an exponential distribution. In the variational Bayesian framework, the problem of best approximating the posterior density is transformed as an optimization problem, i.e., minimizing the evidence lower bound. For ease of computation, the coordinate ascent algorithm, implemented efficiently, is employed to optimize the evidence lower bound. For model comparison, the Bayes factor is computed by the path sampling method. Simulation studies are conducted to investigate the performance of the proposed variational Bayesian method, and a real example is illustrated by the proposed methodologies. Empirical results show that the proposed variational Bayesian method behaves better than the traditional Bayesian lasso method regardless of the accuracy of parameter estimation, the consistency of variable selection or computational flexibility and complexity.
The proposed variational Bayesian method has the following advantages: • Overcoming the problem of selecting a high-dimensional vector of shrinkage parameters required for the Bayesian lasso method; • Simultaneously estimating model parameters and variance-covariance matrices and selecting fixed-effects and random-effects components with a relatively low computational cost; • Avoiding large matrix computations and the curse of dimensionality problem; • Providing a flexible and efficient approach to compute the Bayes factor for model comparison.
The proposed variational Bayesian method can be extended to more complicated models, such as generalized linear mixed models with mixed discrete and missing data. However, their extensions have huge challenges, including the closed-form derivation of the optimal variational density, the specification of the priors, the learning of the datadriven hyperparameters and the computational complexity. In addition, this paper does not consider the selection of high-dimensional random effects, which is a rather challenging topic. In addition, to speed up the convergence of the chain, we might consider some important and relevant Gibbs sampling schemes, for example, the herded Gibbs sampling, which is a deterministic variant of the Gibbs sampling scheme and generates observations by matching the full-conditionals rather than by taking the full-conditionals at random [50], the recycling Gibbs sampler, which generates auxiliary observations whose information is eventually discarded and which can be recycled within the Gibbs algorithm for improving efficiency with no extra cost [51], and the blocking and parameterization method [52].
In addition, we did not consider BIC criterion for model comparison in that BIC is only an approximation to the Bayes factor of marginal likelihood of the data given each hypothesis. Moreover, due to the random effects involved in the considered models, BIC behaves unsteadily. Acknowledgments: The authors are grateful for the associate editor and the three referees for their constructive comments, which largely improved an earlier manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Conditional Distributions Required in Implementing the Gibbs Sampler
By the definitions and priors of β A and β I , it is easily shown from Equation (9) that the conditional distributions f A (β A |D, b, σ) and f I (β I |D) have the forms respectively, where Σ 0 The conditional distribution f (b i |D, β, σ, Q) has the form The conditional distributions f (ξ 2 0k |β k , γ k ) and f (ξ 2 1k |β k , γ k ) are given by respectively, which lead to where IvG(a, b) represents the inverse Gaussian distribution with parameters a and b.