A Dirichlet Process Prior Approach for Covariate Selection

The variable selection problem in general, and specifically for the ordinary linear regression model, is considered in the setup in which the number of covariates is large enough to prevent the exploration of all possible models. In this context, Gibbs-sampling is needed to perform stochastic model exploration to estimate, for instance, the model inclusion probability. We show that under a Bayesian non-parametric prior model for analyzing Gibbs-sampling output, the usual empirical estimator is just the asymptotic version of the expected posterior inclusion probability given the simulation output from Gibbs-sampling. Other posterior conditional estimators of inclusion probabilities can also be considered as related to the latent probabilities distributions on the model space which can be sampled given the observed Gibbs-sampling output. This paper will also compare, in this large model space setup the conventional prior approach against the non-local prior approach used to define the Bayes Factors for model selection. The approach is exposed along with simulation samples and also an application of modeling the Travel and Tourism factors all over the world.


Introduction
The variable selection problem in regression analysis consists of finding a suitable set of predictors for the response variable y from the columns p of a fixed design matrix X n×p with n rows (observations). In a Bayesian setting we have to evaluate the posterior probability π(M γ |y) of each regression model M γ , where with the usual notation γ = (γ 1 , . . . , γ p ), with γ j = 1 if column X j , j = 1, . . . , p is in model M γ . The aim of this paper is to estimate π(M γ |y) for all γ ∈ Γ. The main point is that the cardinality of Γ is of order 2 p , huge even for moderate values of p ≈ 1000 even if it were p < n. In what follows, whatever p is, we only evaluate the evidence for models in which the size is no larger than n including the intercept and the regression the error variance.
Each π(M γ |y) can be calculated exactly (not just estimated) by means of all 2 p Bayes Factors (BF) BF γ0 , π(M γ |y) = π(M γ )BF γ0 (1 + ∑ ∀γ =0 BF γ0 ) −1 , where the marginal distribution of model M γ is compared against that of a common null nested model M 0 . The reference null model considered here is the regression model with the intercept only. To obtain π(M γ |y), a prior on model space Γ, π(M γ ) must be chosen.
Different priors on Γ have been proposed, starting from the discrete uniform π(M γ ) = 2 −p to fulfill the insufficient reasonable principle requirements, up to the hierarchical uniform prior [1] π(M γ ) ∝ ( p p γ ) in which p γ indicates the size of model γ. Under the hierarchical uniform, prior models of the same sizes receive the same prior probability and this also controls for the false discovery rate in declaring a covariate important when it is not important. This is relevant for large p and when the model is sparse [1]. More than focusing on comparing different definitions of π(M γ ), we consider different definitions of BF γ0 given by the prior probability of model parameters. In particular, we center on the semi-conjugate multivariate normal on regression coefficients, which leads to closed-form expression of BF γ0 , which is important for computational feasibility. For instance, the conventional prior approach in [2] has suitable properties (or desiderata). An important one is the so-called predictive matching property, which assures that BF γ0 = 1 if there is not enough information in the data to distinguish between M γ and M 0 . This is obtained by using the concept of effective sample size, which is the number used to rescale the prior covariance matrix to obtain unit information prior, which is a prior that bears the same information as one sample. The other approach is that of the non-local priors detailed in [3]. These priors are non-local in the sense that the minimum prior density is at the null hypothesis in contrast to other usual approaches. The effect is that such a set of priors, asymptotically in n, has a larger increment in recognizing the true set of important covariates (the so-called learning rate) more than the conventional prior approach, although they do not meet the predictive matching desiderata. In the sequel, we refer to these two sets of priors as conventional and non-local prior approaches and these are the only two definitions of BFs considered in this work.
Beyond choices of model parameters prior and/or model prior, the point of this work is that even under closed-form expression of BFs an exhaustive exploration of Γ is not feasible and thus a stochastic model search must be employed to obtain an estimation of π(M γ |y). By estimating π(M γ |y) and intending to interpret the nature of the true underlying model, a summary of π(M γ |y) which is of particular interest here is the inclusion probability of a covariate j, denoted by τ and defined by marginalizing π(M γ |y) overall γ such that γ j = 1. τ is necessary to define the median probability model [4] defined as the model in which covariates have τ > 0.5. If such a model exists, it is proved to be very near to the true model even under strong collinearity [5].
When Γ cannot be fully explored, a stochastic model search is employed in place of heuristic algorithms as they allow to approximate π(M γ |y). In the case of the normal linear regression model, a popular stochastic approach is that of the well known Gibbs-sampling in [6]. Such an algorithm implements a Markov chain by jumping from one model to another based on the marginal distribution of the data (the same that appears in the Bayes Factors). This algorithm returns a dependent sequence of M (S) = γ (1) ∈ Γ, . . . , γ (S) ∈ Γ assumed to be a sample of size S from π(M γ |y). The Gibbs-sampling operates on the space Γ using the fact that a closed-form expression of the marginal distributions is available for the normal regression model with the above-mentioned set of conjugate priors (conventional and non-local). The main advantage of Gibbs-sampling over heuristic approaches is that the theory assures that when the number of steps S → ∞ all Γ has been properly explored, whereas heuristic methods (like step-wise methods) may stop at local maxima and may not bear properly posterior model uncertainty.
The aim of this paper is to use M (S) to obtain an estimation of τ, namely τ, and in particular the new estimators τ b and τ f , later defined. A different definition of τ derived from the Gibbs-sampling algorithm has been studied in [7] applying the known concept of sampling theory. In particular, two estimators have been fully characterized: (i) The intuitive and popular empirical proportion of the sampled models in M (S) containing covariate j, τ e = S −1 ∑ S γ∈M (S) 1(γ) γ j =1 , where 1(γ) γ j =1 is the usual indicator function for the scalar γ j = 1 in the vector γ. This estimator is called the empirical estimator; (ii) The renormalized proportion of the sampled models containing covariate j, τ r = , called the renormalized estimator.
According to [7] both are consistent estimators of τ for S → ∞ and that the error of τ r is the sum of two components: the error of τ e plus (or minus) a term that depends on the correlation between posterior probability of a M γ and its probability to be a visited model for the Gibbs-sampling algorithm. Basically, if the Gibbs-sampling moves very little around a model M γ just because it has the highest posterior probability (but not necessarily the largest one), τ r will be more biased than τ e [7] otherwise it will be more precise than τ e . Moreover, if the model is sparse, the visited models, for finite S depends on the initial state of the chain (for instance: the null model M 0 or the full model for p < n).
The idea of this paper is to rethink about all these estimators of τ and use a proper Bayesian approach to analyze the sequence M (S) . In practice, M (S) can be viewed as a sequence of non-ordinal categorical variables with two levels, which gives rise to representing Γ as the set of a very sparse contingency table of dimension 2 p cells, which are the probabilities π(M γ |y). Therefore, estimating cell probabilities is equivalent to the estimate of the posterior model probabilities. The main observation here is that this is the same setup as in [8] except for the fact that samples M (S) are dependent instead of being independent as assumed in [8], although for large S and given that this is a Gibbs-Sampling, dependence becomes mild. Deriving the posterior distribution of cells probability is the same as deriving an estimation of π(M γ |y) and thus the τ. That is, from π(τ|M (S) ) we define τ b = E π(τ|M (S) ) τ and this is one of the proposed estimators that we will compare against τ e and τ r allowing also to properly account for the uncertainty around the obtained value of τ.
The following Section 2 will illustrate the Bayesian model for analyzing M (S) and its approximation that is used to obtain τ b and a τ f (specified below). Further, model implementation, simulation study, and real data application on Travel and Tourism data are considered in Section 3. Conclusion and final remarks are left for Section 4.

The Dirichlet Process Mixture Model for Estimating Posterior Model Probabilities
The Dirichlet process prior model was initially considered for sparse contingency tables in [8] and here it is applied to the more specific context of covariate selection and consequent estimation of covariate inclusion probabilities.
Parameters of interest are ß = π γ 1 γ 2 ···γ p , γ j = 0, 1, j = 1, . . . , p ∈ Π which is the set of all probabilities tensor on the space Γ of size 2 p . These are the joint probabilities of all covariates and thus of all M γ on space Γ, namely the 2 p cells probabilities, where In the actual setup, these probabilities are attempted to be estimated with the Gibbs-sampling output, M (S) regarded here as the data used to estimate ß. Specifically, a sample γ (s) j ∈ M (S) is a dichotomous unordered categorical variable and we denote the two categories of γ (s) j , by c j = 0, 1. The key idea in the Dirichlet process mixture model is representing probability ß by decomposing it as an addictive mixture of k (possibly infinite) sets of probabilities, h is the probability for covariate j to be included into the set of predictors given the probability distribution over all Γ labeled by h.
h is the j covariate inclusion probability and its estimator is also an estimator of τ if the distribution h has high posterior probability v h given M (S) . In particular, if sets of probabilities are ordered according to their posterior probabilities, and v 1 ≈ 1 then estimation of Ψ (j) 1 could be a good candidate for being a conditional (to h = 1) estimator of τ.
For this purpose, we define the estimator understood for all j, given that the first component of the mixture has the highest probability of v 1 . The likelihood of the two parameters v and Ψ given M (S) is Introducing the latent class indicator z s ∈ {1, . . . , k}, the conditional probability to a specific set h is Pr γ s j = c j |z s = h = ψ (j) hc j and we have that ψs are the inclusion probabilities of covariates conditional to the latent class indicator. Thus the marginal distribution of τ is obtained by marginalizing over all latent class indicators, , is the parameter of interest which leads to the definition of our estimator The larger the k is, the better is the representation of ß, thus we allow for k = ∞ by using the following non-parametric prior: where P 0j corresponds to a Dirichlet measure and Q to a Dirichlet process.
For the usual stick-breaking stochastic representation of this model, we have where the Multinomial notation is here over-engineered as M (S) observations are indicator variables. Parameters a j1 = . . . = a jc j = 1 induce non informative a priori information about the probabilities of each covariate being in the model. α is the usual concentration parameter on the space of latent classes of model probability distributions. That is for small values of α, the probability of having many classes of different probability distributions decreases. In what follows we will consider either α fixed or under another layer of uncertainty by setting a gamma prior with shape a α and scale b α with a α + b α = 1/2 and a α = 1/4 as in [8].
With this model at hand, we can obtain justification for the popular empirical estimator of τ. That is, S → ∞ τ b ≈ τ e as this is a usual regular Bayesian model in which the prior information is washed out by the sample size S. This is an asymptotic, although not Bayesian, justification of using τ e instead of the proposed τ b . In this comparison scenario, τ f instead plays the role of a more robust estimation as it is based on that distribution which has the largest probability. Therefore, comparing τ b against τ f is the same as comparing means over modes except that the underlying randomness is on probability distributions instead of random variables. Finally, comparing τ b and τ f against τ e and τ r is the same as comparing sampling-based estimators against Bayesian ones which incorporate the shrinkage effect of the prior. The claim here is that such an effect can be arbitrarily large for S finite and p arbitrary large.

Variational Algorithm for Approximate τ b
The above stochastic representation suggests to obtain π(M γ |y) by using another Gibbs-sampling exposed in [8] and it was used to obtain simulations from the posterior distribution of ψs and vs. However, this algorithm can be very slow for large p and the benefits of the proposed Bayesian approach can be compensated by just calculating τ e or τ r over larger values of S (if these were possible to be obtained).
To avoid this drawback of the proposed modelling approach we make use of a recent and faster variational algorithm illustrated in [9] and also implemented in the R package mixdir. The algorithm relies on using approximated distribution, for the posterior of v and ψ. Such distributions are derived by applying the mean-field theory to variational inference (see [10]). The approximating q distributions of the variational approach lay down to be a mixture of Dirichlet distributions. For more details, see [9]. This algorithm is much faster than the initial Gibbs-sampling in [8] and thus may compensate for the need of a larger S to obtain good estimations of τ.

Implementation and Examples
The R implementation of the proposed model is straightforward and it does not even need an ad hoc appendix because major packages are already available. In particular, the implementation requires the following packages for Gibbs-sampling M (S) : BayesVarSel for BF based on the conventional prior approach [2] and/or mombf for BF based on non-local priors [11]. Finally, package mixdir contains the variational Bayes method sketched above in Section 2.1.
The minimal implementation for a response y and a design matrix X with p columns requires two steps: 1.
Obtain a sample of M (S) use: In the examples and simulation studies illustrated below, we will mainly play with the number of rows S of above-calculated matrix gammas.

Riboflavin Simulation Study
In what follows we will consider the Riboflavin dataset (see [12]) related to the riboflavin production by Bacillus subtilis. We have n = 71 observations and p = 4088 predictors (gene expressions) and a one-dimensional response (riboflavin production), y. We assume that BF γ0 is obtained from the conventional prior and non-local prior. Priors on models, π(M γ ) is the Uniform prior.
We use the Riboflavin dataset only for fixing X p and simulating 10,000 times the response vector y of size n according to y = ∑ 3 i=1 i × X p i + , ∼ N(0, 2) and p 1 , p 2 , p 3 are three columns of X p picked at random for each simulated response (the rest of columns of X p are supposed to have no effect on y).
The Gibbs algorithm starts at the null model and S has different sizes S = 100, 500, 1000. The goal is to compare the proposed estimator against existing ones.
Results regarding the estimation of τ over all simulations with the Conventional and Non-Local priors are shown in Figure 1. It is possible to appreciate that when the signal in the data is small (coefficients are 1 or 2) and S is not large enough, the proposed estimators τ b and τ f perform better than the existing ones, while the τ r performs better under the non-local prior approach. This is because, as mentioned above, the non-local prior approach favors the model learning rate which is reflected in the values of the BF used to renormalize τ e .
However, such improvements depend on the specific simulation analysis, that is the design matrix, the noise in the response (here, two standard deviations) the specific value of the assumed coefficients (i.e., 1, 2 and 3) and also the values of S. In particular, it seems to disappear when the value of the coefficient is high (i.e., 2 or 3). To generalize these results concerning the choice of specific values of coefficients, regression error standard deviation, and S, we analyze the results using probit regressions (one for each prior). In particular, we transform 10,000 estimators τ on the probit scale and regress it with respect to the signal into the data (the value of the coefficient) and estimator type (four in total) with interactions among them (in total we have 12 probit regression coefficients: main effects plus interactions).
The resulting coefficient regressions that we focus on indicate the improvement to the τ e and the signal when the coefficient is 1. Such an improvement is an increment in the probit scale of the probability of having τ = 1 for each combination of coefficient and estimator type. These increments are between 0.62 and 2.14 (all highly significant) and are important if the original point in the probit scale is low, while they are negligible if the point on the probit scale is large. Such an initial point corresponds to a signal in the data and the value of S. The increment on τ and the signal in the data is reported in Figure 2. From Figure 2 we can see that the proposed estimators τ b and τ f generally outperform the existing estimator and popular τ e as the increment on τ is larger when the signal in the data is lower. The τ r perform better than τ f under the non-local prior as the corresponding BFs are very informative for detecting variables.

Travel and Tourism Data Set
In this application, we want to explore the determinants of Travel and Tourism (T&T) global expenditure using data of the World Economic Forum, based on the latest Travel and Tourism Competitiveness Report (2019). The statistical unit is the country and the response variable is the log of total tourism expenditure, obtained by multiplying the number of arrivals by the reported individual expenditure. Data are obtained at http://www3.weforum.org/docs/WEF_TTCR19_data_ for_download.xlsx and further filtered to have a complete dataset https://raw.githubusercontent. com/scabras/varseldmmp/master/tourism-data.csv The actual code to analyze it is here: https: //github.com/scabras/varseldmmp/blob/master/s-example-tourism.md.
Tourism worldwide is threatened by the Covid-19 spread and it represents an important source of income for many countries. For instance, in Spain, Mexico, and France, according to The Organization for Economic Co-operation and Development (OECD) in 2016, tourism represented more than 7% of the Gross Domestic Product. The total number of covariates considered here is p = 65 fully observed in n = 52 countries.
We apply the analysis using Conventional and non-local priors assuming that the number of latent probability distributions on the model space is k = 10 and a fixed concentration parameter α = 1 to be as parsimonious as possible on determining the median probability model for S = 1000 and a burn-in of 100 Gibbs-sampling steps.
It is important to note that to reach almost a substantial agreement between the actual estimators τ e , τ r on one side and the newly proposed ones τ b , τ f it is necessary to have S = 10,000, which corresponds to about seven times more the computational time used to obtain Table 1 which reports the estimated median probability models. Table 1. Tourism and Travel (T&T) data set. Estimated median probability models according to the different estimators τ crossed with BF definitions given by conventional and non-local priors. In parenthesis, τ is reported when the covariate is in the median probability model. The most important factor is precisely the one most affected by Covid-19, which is the number of operating airlines. The other factors are less related to Covid-19 but are still important. One is the organizations of the T&T govern monitoring represented by the timeliness of providing data on T&T. There are also variables related to country features: infrastructures (connection and power) internet search of T&T resources and the presence of natural resources (total known species). As observed above, non-local priors lead to more complex models than the conventional prior, but looking at the proposed estimators they seem to be more robust with respect to the choice of the prior on model parameters, in fact, τ b reports almost the same covariates regardless of the BF definition.

Remarks
This work aims to analyze the Gibbs-sampling output of model exploration according to a genuine Bayesian analysis and avoid frequentist approaches that are of a critical application for p large and S finite. Further, this is also the first work on comparing the conventional versus the non-local prior definition of BFs, which represent at the moment the state of the art in covariate selection under the normal linear model. The posterior mean of the τ is estimated with τ b results to be a robust estimator of the τ. This is because of the Dirichlet model, which shrinks to zero some of the observed empirical proportions, resulting from τ e , and increases the others. Moreover, it is also clear that the proposed τ r , studied only for conventional prior in [7], seems to work very well under the improved learning rate of the non-local prior approach.
As a further line of investigation, but beyond the scope of this paper, we note that π(M γ ) (here, the uniform) can be substituted by the Dirichlet process model illustrated above and thus used directly as π(M γ ) by incorporating it into the Gibbs-sampling. This is important in order to match ß in (1) with π(M γ ). This is incidentally done in [7], with the uniform π(M γ ) when using τ r . In general, the problem of model exploration would deserve more Bayesian methods than the existing ones.