Variable Selection Using Nonlocal Priors in High-Dimensional Generalized Linear Models With Application to fMRI Data Analysis

High-dimensional variable selection is an important research topic in modern statistics. While methods using nonlocal priors have been thoroughly studied for variable selection in linear regression, the crucial high-dimensional model selection properties for nonlocal priors in generalized linear models have not been investigated. In this paper, we consider a hierarchical generalized linear regression model with the product moment nonlocal prior over coefficients and examine its properties. Under standard regularity assumptions, we establish strong model selection consistency in a high-dimensional setting, where the number of covariates is allowed to increase at a sub-exponential rate with the sample size. The Laplace approximation is implemented for computing the posterior probabilities and the shotgun stochastic search procedure is suggested for exploring the posterior space. The proposed method is validated through simulation studies and illustrated by a real data example on functional activity analysis in fMRI study for predicting Parkinson’s disease.


Introduction
With the increasing ability to collect and store data in large scales, we are facing the opportunities and challenges to analyze data with a large number of covariates per observation, the so-called high-dimensional problem. When this situation arises, variable selection is one of the most commonly used techniques, especially in radiological and genetic research, due to the nature of high-dimensional data extracted from imaging scans and gene sequencing. In the context of regression, when the number of covariates is greater than the sample size, the parameter estimation problem becomes ill posed, and variable selection is usually the first step for dimension reduction.
A good amount of work has recently been done for variable selection from both frequentist and Bayesian perspectives. On the frequentist side, extensive studies on variable selection have emerged ever since the appearance of least absolute shrinkage and selection operator (Lasso) [1]. Other penalization approaches for sparse model selection including smoothly clipped absolute deviation (SCAD) [2], minimum concave penalty (MCP) [3] and many variations have also been introduced. Most of these methods are first considered in the context of linear regression and then extended to generalized linear models. Because all the methods share the basic desire of shrinkage toward sparse models, it has been understood that most of these frequentist methods can be interpreted from a Bayesian perspective and many analogous Bayesian methods have also been proposed. See for example [4][5][6] that discuss the connection between penalized likelihood-based methods and Bayesian approaches. These Bayesian methods employed local priors, which still preserve positive values at null parameter values, to achieve desirable shrinkage.
In this paper, we are interested in nonlocal densities [7] that are identically zero whenever a model parameter is equal to its null value. Compared to local priors, nonlocal prior distributions have relatively appealing properties for Bayesian model selection. In particular, nonlocal priors discard spurious covariates faster as the sample size grows, while preserving exponential learning rates to detect nontrivial coefficients [7]. Johnson and Rossell [8] and Shin et al. [9] study the behavior of nonlocal densities for variable selection in a linear regression setting. When the number of covariates is much smaller than the sample size, [10] establish the posterior convergence rate for nonlocal priors in a logistic regression model and suggest a Metropolis-Hastings algorithm for computation.
To the best of our knowledge, a rigorous investigation of high-dimensional posterior consistency properties for nonlocal priors has not been undertaken in the context of generalized linear regression. Although [11] investigated the model selection consistency of nonlocal priors in generalized linear models, they assumed a fixed dimension p. Motivated by this gap, our first goal was to examine the model selection property for nonlocal priors, particularly, the product moment (pMOM) prior [8] in a high-dimensional generalized linear model. It is known that the computation problem can arise for Bayesian approaches due to the non-conjugate structure in generalized linear regression. Hence, our second goal was to develop efficient algorithms for exploring the massive posterior space. These were challenging goals of course, as the posterior distributions are not available in closed form for this type of nonlocal priors.
As the main contributions of this paper, we first establish model selection consistency for generalized linear models with pMOM prior on regression coefficients (Theorems 1-3) when the number of covariates grows at a sub-exponential rate of the sample size. Next, n terms of computation, we first obtain the posteriors via Laplace approximation and then implement an efficient shotgun stochastic search (SSS) algorithm for exploring the sparsity pattern of the regression coefficients. In particular, the SSS-based methods have been shown to significantly reduce the computational time compared with standard Markov chain Monte Carlo (MCMC) algorithms in various settings [9,12,13]. We demonstrate that our model can outperform existing state-of-the-art methods including both penalized likelihood and Bayesian approaches in different settings. Finally, the proposed method is applied to a functional Magnetic Resonance Imaging (fMRI) data set for identifying alternative brain activities and for predicting Parkinson's disease.
The rest of paper is organized as follows. Section 2 provides background material regarding generalized linear models and revisits the pMOM distribution. We detail strong selection consistency results in Section 3, and proofs are provided in the Appendix A. The posterior computation algorithm is described in Section 4, and we show the performance of the proposed method and compare it with other competitors through simulation studies in Section 5. In Section 6, we conduct a real data analysis for predicting Parkinson's disease and show our method yields better prediction performance compared with other contenders. To conclude our paper, a discussion is given in Section 7.

Model Specification for Logistic Regression
We first describe the framework for Bayesian variable selection in logistic regression followed by our hierarchical model specification. Let y ∈ {0, 1} n be the binary response vector and X = (x ij ) ∈ R n×p be the design matrix. Without loss of generality, we assume that the columns of X are standardized to have zero mean and unit variance. Let x i ∈ R p denote the ith row vector of X that contains the covariates for the ith subject. Let β be the p × 1 vector of regression coefficients. We first consider the following standard logistic regression model: We will work in a scenario where the dimension of predictors, p grows with the sample size n. Thus, we consider the number of predictors is function of n, i.e., p = p n , but we denote it as p for notational simplicity.
Our goal is variable selection, i.e., the correct identification of all non-zero regression coefficients.
In light of that, we denote a model by k = k 1 , k 2 , . . . , k |k| ⊆ [p] =: {1, 2, . . . , p} if and only if all the nonzero elements of β are β k 1 , β k 2 , . . . , β k |k| and denote β k = β k 1 , β k 2 , . . . , β k |k| , where |k| is the cardinality of k. For any m × p matrix A, let A k ∈ R m×|k| denote the submatrix of A containing the columns of A indexed by model k. In particular, for 1 ≤ i ≤ n, we denote x ik as the subvector of x i containing the entries of x i corresponding to model k.
The class of pMOM densities [8] can be used for model selection through the following hierarchical model Here U is a p × p nonsingular matrix, r is a positive integer referred to as the order of the density and d k is the normalizing constant independent of the positive constant τ. Please note that prior (2) is obtained as the product of the density of multivariate normal distribution and even powers of . This results in π(β k | τ, k) = 0 at β k = 0, which is desirable because (2) is a prior for the nonzero elements of β. Some standard regularity assumptions on the hyperparameters will be provided later in Section 3. In (3), m n ∈ [p] is a positive integer restricting the size of the largest model, and a uniform prior is placed on the model space restricting our analysis to models having size less than or equal to m n . Similar structure has also been considered in [5,9,14]. An alternative is to use a complexity prior [15] that takes the form of for some positive constants c 1 , c 2 . The essence is to force the estimated model to be sparse by penalizing dense models. As noted in [9], the model selection consistency result based on the nonlocal priors derives strength directly from the marginal likelihood and does not require strong penalty over model size. This is indeed reflected in the simulation studies in [14], where the authors compare the model selection performance under uniform prior and complexity prior. The result under uniform prior is much better than that under complexity prior, as the complexity prior always tends to prefer the sparse models. By the hierarchical model (1) to (3) and Bayes' rule, the resulting posterior probability for model k is denoted by, where π(y) is the marginal density of y, and m k (y) is the marginal density of y under model k given by where is the log likelihood function. In particular, these posterior probabilities can be used to select a model by computing the posterior mode defined bŷ k = arg max k π(k|y).
Of course, the closed form of these posterior probabilities cannot be obtained due to not only the nature of logistic regression but also the structure of nonlocal prior. Therefore, special efforts need to be devoted to both consistency analysis and computational strategy as we shall see in the following sections.

Extension to Generalized Linear Model
We can easily extend our previous discussion on logistic regression to a generalized linear model (GLM) [16]. Given predictors x i and an outcome y i for 1 ≤ i ≤ n, a probability density function (or probability mass function) of a generalized linear model has the following form of the exponential family p(y i |θ) = exp a(θ) in which a(·) is a continuously differentiable function with respect to θ with nonzero derivative, b(·) is also a continuously differentiable function of θ, c(·) is some constant function of y, and θ is also known as the natural parameter that relates the response to the predictors through the linear function is the inverse of some chosen link function.
The class of pMOM densities specified in (2) can still be used for model selection in this generalized setting by noting that the log likelihood function in (5) and (6) now takes the general form of After obtaining the posterior probabilities in (4) with the log likelihood substituted as (8), we can select a model by computing the posterior mode. In Section 4, we will adopt some search algorithm that use these posterior probabilities to target the mode in a more efficient way.

Main Results
In this section, we show that the proposed Bayesian model enjoys desirable theoretical properties. Let t ⊆ [p] be the true model, which means that the nonzero locations of the true coefficient vector are t = (j, j ∈ t). We consider t to be a fixed vector. Let β 0 ∈ R p be the true coefficient vector and β 0,t ∈ R |t| be the vector of the true nonzero coefficients. In the following analysis, we will focus on logistic regression, but our argument can be easily extended to any other GLM as well. In particular, as the negative Hessian of L n (β k ), where Σ(β k ) ≡ Σ k = diag(σ 2 1 (β k ), . . . , σ 2 n (β k )), In the rest of the paper, we denote Σ = Σ(β t ) and σ 2 i = σ 2 i (β t ) for simplicity. Before we establish our main results, the following notations are needed for stating our assumptions. For any a, b ∈ R, a ∨ b and a ∧ b mean the maximum and minimum of a and b, respectively. For any sequences a n and b n , we denote a n b n , or equivalently a n = O(b n ), if there exists a constant C > 0 such that |a n | ≤ C|b n | for all large n. We denote a n b n , or equivalently a n = o(b n ), if a n /b n −→ 0 as n → ∞. Without loss of generality, if a n ≥ b n > 0 and there exist constants C 1 > C 2 > 0 such that C 2 < b n /a n ≤ a n /b n < C 1 , we denote a n ∼ b n . For a given vector For any real symmetric matrix A, λ max (A) and λ min (A) are the maximum and minimum eigenvalue of A, respectively. To attain desirable asymptotic properties of our posterior, we assume the following conditions: Condition (A1) ensures our proposed method can accommodate high dimensions where the number of predictors grows at a sub-exponential rate of n. Condition (A1) also specifies the parameter m n in the uniform prior (3) that restricts our analysis on a set of reasonably large models. Similar assumptions restricting the model size have been commonly assumed in the sparse estimation literature [4,5,9,17].
gives lower and upper bounds of n −1 H n (β 0,k ) and n −1 X k X k , respectively, where k is a large model satisfying |k| ≤ m n + |t|. The lower bound condition can be regarded as a restricted eigenvalue condition for 0 -sparse vectors. Restricted eigenvalue conditions are routinely assumed in high-dimensional theory to guarantee some level of curvature of the objective function and are satisfied with high probability for sub-Gaussian design matrices [5]. Similar conditions have also been used in the linear regression literature [18][19][20]. The last assumption in Condition (A2) says that the magnitude of true signals is bounded above (log p) d up to some constant, which allows the magnitude of signals to increase to infinity.
Condition (A3) gives a lower bound for nonzero signals, which is called the beta-min condition. In general, this type of condition is necessary for catching every nonzero signal. Please note that due to Conditions (A1) and (A2), the right-hand side of (9) decreases to zero as n → ∞. Thus, it allows the smallest nonzero coefficients to tend to zero as we observe more data.

Condition (A4)
For some small constant δ > 0, the hyperparameters τ and r satisfy Condition (A4) suggests appropriate conditions for the hyperparameter τ in (2). A similar assumption has also been considered in [9]. The scale parameter τ in the nonlocal prior density reflects the dispersion of the nonlocal prior density around zero, and implicitly determines the size of the regression coefficients that will be shrunk to zero [8,9]. For the below theoretical results, we assume that U = I for simplicity, but our results are still valid for other choices of U as long as λ max (U) = O(1) and λ min (U) = O(1).

Theorem 1 (No super set). Under conditions (A1), (A2) and (A4),
Theorem 1 says that, asymptotically, our posterior will not overfit the model, i.e., not include unnecessarily many variables. Of course, it does not guarantee that the posterior will concentrate on the true model. To capture every significant variable, we require the magnitudes of nonzero entries in β 0,t not to be too small. Theorem 2 shows that with an appropriate lower bound specified in Condition (A3), the true model t will be the mode of the posterior.
Posterior ratio consistency is a useful property especially when we are interested in the point estimation with the posterior mode, but does not provide how large probability the posterior puts on the true model. In the following theorem, we state that our posterior achieves strong selection consistency. By strong selection consistency, we mean that the posterior probability assigned to the true model t converges to 1. Since strong selection consistency implies posterior ratio consistency, it requires a slightly stronger condition on the lower bound for the magnitudes of nonzero entries in β 0,t , i.e., a larger value of c 0 , compared to that in Theorem 2.

Computational Strategy
In this section, we describe how to approximate the marginal density of the data and to conduct the model selection procedure. The integral formulation in (4) leads to the posterior probabilities not available in closed form. Hence, we use Laplace approximation to compute m k (y) and π(k|y). A similar approach to compute posterior probabilities has been used in [8][9][10].
Please note that for any model k, when U k = I k , the normalization constant d k in (2) is given by For any model k, the Laplace approximation of m k (y) is given by (2π) |k| whereβ k = arg max β k f (β k ) obtained via the optimization function optim in R using a quasi-Newton method and V (β k ) is a |k| × |k| symmetric matrix which can be calculated as: The above Laplace approximation can be used to compute the log of the posterior probability ratio between any given model k and true model t, and select a model k with the highest probability.
We then adopt the shotgun stochastic search (SSS) algorithm [9,12] to efficiently navigate through the massive model space and identify the global maxima. Using the Laplace approximations of the marginal probabilities in (10), the SSS method aims at exploring "interesting" regions of the resulting high-dimensional model spaces and quickly identifies regions of high posterior probability over models.
The SSS procedure is described in Algorithm 1.

Algorithm 1 Shotgun Stochastic Search (SSS)
Choose an initial model k (1)

Simulation Studies
In this section, we demonstrate the performance of the proposed method in various settings. Let X be the design matrix whose first |t| columns correspond to the active covariates for which we have nonzero coefficients, while the rest correspond to the inactive ones with zero coefficients. In all the simulation settings, we generate x i i.i.d. ∼ N p (0, Σ) for i = 1, . . . , n under the following two different cases of Σ: • Case 1: Isotropic design, where Σ = I p , i.e., no correlation imposed between different covariates.
The correlations among different covariates are set to different values.
Following the simulation settings in [9,10], we consider the following two designs, each with the same sample size n = 100 and number of predictors being either p = 100 or 150: We investigate the following two settings for the true coefficient vector β 0,t to include different combinations of small and large signals. Finally, for given X and 1 ≤ i ≤ n, we sample y i from the following logistic model as in (1) We will refer to our proposed method as "nonlocal" and its performance will then be compared with other existing methods including Spike and Slab prior-based model selection [21], empirical Bayesian LASSO (EBLasso) [22], Lasso [23] and SCAD [24]. The tuning parameters in the regularization approaches are chosen by 5-fold cross-validation. Spike and slab prior method is implemented via the BoomSpikeSlab package in R. For the nonlocal prior, the hyperparameters are set at U = I, r = 1 and we tune τ = 10 −i n −1/2 p 2+0.05 for four different values of i = 0, 1, 2, 3. We choose the optimal τ by the mean squared prediction error through 5-fold cross-validation. Please note that this implies that τ is data-dependent and the resulting procedure is similar to an empirical-Bayesian approach in the high-dimensional Bayesian literature given the prior knowledge about the sparse true model [13]. For the SSS procedure, the initial model was set by randomly taking three coefficients to be active and the remaining to be inactive. The detailed steps for our method are coded in R and publicly available at https://github.com/xuan-cao/Nonlocal-Logistic-Selection. In particular, the stochastic search is implemented via the SSS function in the R package BayesS5.
To evaluate the performance of variable selection, the precision, sensitivity, specificity, Matthews correlation coefficient (MCC) [25] and mean-squared prediction error (MSPE) are reported at Tables 1-4, where each simulation setting is repeated for 20 times. The criteria are defined as where TP, TN, FP and FN are true positive, true negative, false positive and false negative, respectively.
Here we denoteŷ i = x iβ , whereβ is the estimated coefficient based on each method. For Bayesian methods, the usual GLM estimates based on the selected support are used asβ. We generated test samples y test,1 , . . . , y test,n test with n test = 50 to calculate the MSPE. Based on the above simulation results, we notice that under the first isotropic covariance case, the nonlocal-based approach overall works better than other methods especially in the strong signal setting (i.e., Setting 1), where our method is able to consistently achieve perfect estimation accuracy. This is because as signal strength gets stronger, the consistency conditions of our method are easier to satisfy which leads to better performance. When the covariance is autoregressive, our method suffers from lower sensitivity compared with the frequentist approaches in high-dimensional design ( Table 4), but still has higher precision, specificity and MCC. The poor precision of the regularization methods has also been discussed in previous literature in the sense that selection of the regularization parameter using cross-validation is optimal with respect to prediction but tends to include too many noise predictors [26]. Again we observe under the autoregressive design, the performance of our method improves as the true signals strengthen. To sum up, the above simulation studies indicate that the proposed method can perform well under a variety of configurations with different data generation mechanisms.

Application to fMRI Data Analysis
In this section, we apply the proposed model selection method to an fMRI data set for identifying aberrant functional brain activities to aid the diagnosis of Parkinson's Disease (PD) [27]. Data consists of 70 PD patients and 50 healthy controls (HC). All the demographic characteristics and clinical symptom ratings have been collected before MRI scanning. In particular, we adopt the mini-mental state examination (MMSE) for cognitive evaluation and the Hamilton Depression Scale (HAMD) for measuring the severity of depression.

Image Feature Extraction
Functional imaging data for all subjects are collected and retrieved from the archive by neuroradiologists. Image preprocessing procedure is carried out via Statistical Parametric Mapping (SPM12) operated on the Matlab platform. For each subject, we first discard the first 5 time points for signal equilibrium and the remaining 135 images underwent slice-timing and head motion corrections. Four subjects with more than 2.5 mm maximum displacement in any of the three dimensions or 2.5 • of any angular motion are removed. The functional images are spatially normalized to the Montreal Neurological Institute space with 3 × 3 × 3 mm 3 cubic voxels and smoothed with a 4 mm full width at half maximum (FWHM) Gaussian kernel. We further regress out nuisance covariates and applied temporal filter (0.01 Hz < f < 0.08 Hz) to diminish high-frequency noise.
Zang et al. [28] proposed the method of Regional Homogeneity (ReHo) to analyze characteristics of regional brain activity and to reflect the temporal homogeneity of neural activity. Since some preprocessing methods especially spatial smoothing fMRI time series may significantly change the ReHo magnitudes [29], preprocessed fMRI data without the spatial smoothing step are used for calculating ReHo. In particular, we focus on the mReHo maps obtained by dividing the mean ReHo of the whole brain within each voxel in the ReHo map. We further segment the mReHo maps and extract all the 112 ROI signals based on the Harvard-Oxford atlas (HOA) using the Resting-State fMRI Data Analysis Toolkit.
Slow fluctuations in activity are fundamental features of the resting brain for determining correlated activity between brain regions and resting state networks. The relative magnitude of these fluctuations can discriminate between brain regions and subjects. Amplitude of Low Frequency Fluctuations (ALFF) [30] are related measures that quantify the amplitude of these low frequency oscillations. Leveraging the preprocessed data, we retain the standardized mALFF maps after dividing the ALFF of each voxel by the global mean ALFF. Using the HOA, we again obtain 112 mALFF values via extracting the ROI signals based on the mALFF maps. Voxel-Mirrored Homotopic Connectivity (VMHC) quantifies functional homotopy by providing a voxel-wise measure of connectivity between hemispheres [31]. By segmenting the VMHC maps according to HOA, we also extract 112 VHMC values.

Results
Our candidate features consist of 336 radiomic variables along with all the clinical characteristics. We now consider a standard logistic regression model with the binary disease indicator as the outcome and all the radiomic variables together with five clinical factors as predictors. Various models including the proposed and other competing methods will then be implemented for classifying subjects based on these extracted features. The dataset is randomly divided into a training set (80%) and a testing set (20%) while maintaining the PD:HC ratio in both sets. For Bayesian methods, we first obtain the identified variables, and then evaluate the testing set performance using standard GLM estimates based on the selected features. The penalty parameters in all frequentist methods are tuned via 5-fold cross validation in the training set. The hyperparameters for the proposed method are set as in simulation studies.
The HAMD score and nine radiomic features including five mALFFs, two ReHos, two VHMCs are selected by the SSS procedure under pMOM prior. In Figure 1, we plot the histograms of selected radiomic features with different colors representing different groups. The predictive performance of various methods in the test set is summarized in Table 5. We can tell from Table 5 that the nonlocal prior-based approach has overall better prediction performance compared with other methods. Our nonlocal approach has higher precision and specificity compared with all the other methods, but yields a lower sensitivity than the frequentist approaches. Based on the most comprehensive measure MCC, our method outperforms all the other methods.

Conclusions
In this paper, we propose a Bayesian hierarchical model with a pMOM prior specification over regression coefficients to perform variable selection in high-dimensional generalized linear models. The model selection consistency of our method is established under mild conditions and the shotgun stochastic search algorithm can be used for the implementation of our proposed approach. Our simulation and real data studies indicate that the proposed method has better performance for variable selection compared to a variety of state-of-the-art competing methods. In the fMRI data analysis, our method is able to identify abnormal functional brain activities for PD that occur in the regions of interest including cingulate gyrus, central opercular cortex, occipital pole, brainstem, left amygdala, occipital pole, inferior temporal gyrus, and juxtapositional lobule cortex. These findings suggest disease-related alterations of functional activities that provide physicians sufficient information to get involved with early diagnosis and treatment. Our findings are also coherent with the alternative functional features in cortical regions, brainstem, and limbic regions discovered in previous studies [32][33][34][35].
Our fMRI study certainly has limitations. First, we would like to note that fMRI data are typically treated as spatio-temporal objects and a generalized linear model with spatially varying coefficients can be implemented for brain decoding [36]. However, in our application, for each subject, a total of 135 fMRI scans were obtained, each with the dimension of 64 × 64 × 31. If we take each voxel as a covariate to perform the whole-brain functional analysis, it would be computationally challenging and impractical given the extremely high dimension. Hence, we adopt the radiomics approach to extract three different types of features that can summarize the functional activity of the brain, and take these radiomic features as covariates in our generalized linear model. For future studies, we will focus on several regions of interest rather than the entire brain and take the spatio-temporal dependency among voxels into consideration.
Second, although ReHo, ALFF, and VHMC are different types of radiomic features that quantify the functional activity of the brain, it is definitely possible that in some regions, three measures are highly correlated with each other. Our current theoretical and computational strategy can accommodate a reasonable amount of correlations among covariates, but might not work in the presence of high correlation structure. For future studies, we will first carefully examine the potential correlations among features and might only retain one feature for each region if significant correlations are detected.
One possible extension of our methodology is to address the potential misspecification of the hyperparameter τ. The scale parameter τ is of particular importance in the sense that it can reflect the dispersion of the nonlocal density around zero, and implicitly determine the size of the regression coefficients that will be shrunk to zero [8]. Cao et al. [14] investigated the model selection consistency for the hyper-pMOM priors in linear regression setting, where an additional inverse-gamma prior is placed over τ. Wu et al. [11] proved the model selection consistency using hyper-pMOM prior in generalized linear models, but assumed a fixed dimension p. For future study, we will consider this fully Bayesian approach to carefully examine the theoretical and empirical properties for such hyper-pMOM prior in the context of high-dimensional generalized linear regression. We can also extend our method to develop a Bayesian approach for growth models in the context of non-linear regression [37], where the log-transformation is typically used to recover the additive structure.
However, then the model does not fall into the category of GLMs, which is beyond the current setting in this paper. Therefore, we leave it as a future work.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: for any n ≥ N(δ * ). However, as stated in [5], there always exists δ * > 0 satisfying inequality (A1), so it is not really a restriction. Since we will focus on sufficiently large n, δ * can be considered an arbitrarily small constant, so we can always assume that δ > δ * .
Proof of Theorem 1. Let M 1 = {k : k t, |k| ≤ m n } and where t ⊆ [r] is the true model. We will show that By Taylor's expansion of L n (β k ) around β k , which is the MLE of β k under the model k, we have for someβ k such that β k − β k 2 ≤ β k − β k 2 . Furthermore, by Lemmas A.1 and A.3 in [5] and Condition (A2), with probability tending to 1, for any k ∈ M 1 and β k such that β k − β 0,k 2 < c |k|Λ |k| log p/n =: cw n , where = n := c m 2 n Λ m n log p/n = o(1), for some constants c, c > 0. Please note that for β k such that β k − β k 2 = cw n /2, where the second inequality holds due to Condition (A2). It also holds for any β k such that β k − β k 2 > cw n /2 by concavity of L n (·) and the fact that β k maximizes L n (β k ). Define the set B := β k : β k − β k 2 ≤ cw n /2 , then we have B ⊂ {β k : β k − β 0,k 2 ≤ cw n } for some large c > 0 and any k ∈ M 1 , with probability tending to 1.
where E k (.) denotes the expectation with respect to a multivariate normal distribution with mean β * k and covariance matrix (A k + U k /τ) −1 . It follows from Lemma 6 in the supplementary material for [8] that for some constant C > 0. Further note that for some constant C > 0 and some large constant c > 0, by Conditions (A1), (A2) and (A4). Therefore, it follows from (A3) that for some constant C > 0. Next, note that it follows from Lemma A.3 in the supplementary material for [5] that for any k ∈ M 1 and some constant C 1 > 0. Similarly, by Lemma 4 in the supplementary material for [8] and the similar arguments leading up to (A5), with probability tending to 1, we have by Lemma A1, where A t = (1 + )H n (β 0,t ). Therefore, with probability tending to 1, for any k ∈ M 1 , where the second inequality holds by Lemma 2 in [38], Conditions (A2) and (A4), and the third inequality follows from Lemma 3 in [38], which implies for any k ∈ M 1 with probability tending to 1, where b n = (1 + δ * )(1 + 2w) log p with some small constant w > 0 satisfying 1 + δ > (1 + δ * )(1 + 2w).
Proof. Please note that by Condition (A2), which implies that Thus, we complete the proof if we show that 1 nτλ β k H n (β 0,k ) β k ≤ C for some constant C > 0 and any k ∈ M 1 with probability tending to 1. By Lemma A.3 in [5] and Condition (A2), for any k ∈ M 1 with probability tending to 1.