Abstract
Zellner’s objective g-prior has been widely used in linear regression models due to its simple interpretation and computational tractability in evaluating marginal likelihoods. However, the g-prior further allows portioning the prior variability explained by the linear predictor versus that of pure noise. In this paper, we propose a novel yet remarkably simple g-prior specification when a subject matter expert has information on the marginal distribution of the response . The approach is extended for use in mixed models with some surprising but intuitive results. Simulation studies are conducted to compare the model fitting under the proposed g-prior with that under other existing priors.
1. Introduction
Incorporation of expert opinion has been an integral component of informative priors for Bayesian models in a wide variety of settings, many of them clinical [1,2,3]. Even in a highly regulated industry such as the medical devices field, guidance has existed for some time for how expert opinion might be incorporated into models [4]. However, the willingness of regulators to accept expert opinion does not necessarily mean that the process of obtaining and utilizing such information is straightforward. Existing approaches for leveraging prior opinions tend to be cumbersome and labor intensive [5,6,7,8]. This paper provides a simple and easy to use method for experts to specify g-priors for a wide class of mixed models focusing only on the marginal distribution of population responses .
A linear model is initially considered , , and prior information is included such that, marginally, and . Here, the notation of denotes that a random variable x has mean and variance , is the ith response, is a p-vector of covariates which usually includes an intercept, and is the p-vector of regression coefficients. The errors are assumed Gaussian for the bulk of the paper, but this assumption can often be relaxed. Zellner’s g-prior [9,10] posits
where is the usual design matrix, yielding a posterior mean that is a weighted average of the usual ordinary least squares (OLS) estimator and the prior value , i.e., . Note that gives no weight to the outcome data and gives complete weight to the data. The choice of g has received considerable interest in the literature, and the g-prior has been widely adopted for use in variable selection e.g., [11,12]. It is not our intent to add to the burgeoning literature on variable selection here but rather provide a useful prior for model parameters when some information about the data generating mechanism is known; in such cases, the “informative g-prior” developed here is competitive with existing approaches for variable selection (Section 4.1.2). In this paper, we propose an informative g-prior that can be used by default when prior information is lacking, or that can reflect available prior information on the marginal distribution of population responses. For example, if the outcome is cholesterol level in a certain population and the interest is to investigate how cholesterol level changes with covariates such as age, gender, ethnicity and body mass index, the expert might find that, marginally, from various studies. This marginal prior specification does not rely on any covariates, which makes the prior excitation relatively easy. The theoretical marginal distribution of the ’s can be obtained from the population distribution of covariates , the distribution on , and the value of through the linear regression model under a specific form of the g-prior. Then, the g-prior can be derived by matching moments between this theoretical marginal distribution and the prior distribution of to ensure, e.g., and . The method is further extended to provide default priors for mixed models, allowing for random-effects ANOVA, random coefficient models, etc.
The sampling distribution of the OLS estimator has covariance . In a Bayesian analysis assuming normal errors, the flat prior yields the conditional posterior . In either case, the covariance estimates , where and . That is, greater variability in implies greater precision in estimating . Thus, ref. [9] specifies a vague conditional prior for that takes advantage of information on distributional shape based solely on and a flat prior on . The g-prior developed here further separates how much marginal variability in is explained a priori by the model from that of pure noise ; a default specification assumes a flat uniform prior on this quantity.
Two popular classes of priors for regression models are conditional means priors [13,14] and power priors [15]. Conditional means priors require a subject matter expert to provide information on the mean response for several candidate vectors of covariates (that do not have to be among those actually observed); the usual specification requires the expert to be able to think about the mean responses independently, but this is not strictly required. Let the candidate vectors be , where . The subject matter expert is asked to provide, say, a 95% interval that contains the mean response at covariates , e.g., . This information on the conditional means is summarized as , yielding , where and . If is invertible, requiring ; then, the induced prior is simply . Ref. [13] propose methods for handling partial prior information on a subset , i.e., the subject matter expert only need specify a handful of priors for conditional means. In contrast, the g-prior developed here only requires information on the marginal distribution of the ’s, namely .
Power priors are built from historical regression data having the same covariates as the current data. Say that the historical data are and the current data are . The power prior is simply the posterior of based on a reference prior, raised to the power : , where is the density of a normal random variable with mean m and variance v. The parameter provides the “degree of borrowing” from the historical data, with giving none and treating the historical data the same as the current study data. The choice of has also received considerable research [16,17,18]. In addition to the power and conditional mean priors, ref. [19] proposed a natural conjugate reference informative prior by taking into account various degrees of certainty in covariates, and [20] proposed a default prior for s by using a normal distribution with mean zero and standard deviation equal to the standard error of the M-estimator of each .
There are several notable limitations for conditional means and power priors. Conditional means priors involve the analyst thinking about various covariate combinations and providing information on the mean response for each covariate setting. As the number of predictors increases, this becomes increasingly difficult; it can be conceptually easier to think about marginal quantities such as the overall mean m and variance v in the population. Such marginal information may be available via census or through published summary data. The power prior requires a historical data set having a superset of the variables under consideration in the current study, which is often unavailable for new treatments.
One consequence of the proposed priors developed here is that proper, data-driven priors are given in closed-form with default settings. Thus, standard model comparison via Bayes factors is possible as no improper priors are used. Difficult-to-elicit prior information such as the range of a variance component is replaced with the question “How much variability in the data do you think the model explains?” If the answer to this is “I have no idea” then a uniform distribution on is suggested. The proposed priors do not have closed-form full conditional distributions for all parameters but are easily specified and fit in R using statistical software Just Another Gibbs Sampler (JAGS) [21] via packages such as R2jags [22].
Bayesians have long known that injecting a small amount of prior information can often “fix” pathological MCMC schemes. The g-prior developed here can be viewed as a ridge prior that takes multicollinearity into account, with the added benefit that the ridge parameter is automatically chosen by the data. Section 2 introduces the informative g-prior for linear regression models. Section 3 extends the g-prior for use in mixed models. Section 4 presents a detailed set of simulation studies exploring the use of the g-prior and compares to other priors in common use. Section 5 concludes the paper with a discussion and eye toward future research.
2. Prior for Linear Regression Models
2.1. The Prior in [23]
The g-prior in [23] was developed for logistic regression; this section carefully extends their approach to normal-errors regression, and Section 3 generalizes further to mixed models. Their g-prior is specified as
where , and is the usual design matrix. Assume for some distribution H where . Noting that includes the intercept in the first element, the first element of is one and the first row along with the first column entries of are all zeros. Given the data , for any new subject with response y and covariates , assuming and are mutually independent, one has , by the law of iterated expectations. In addition, by the law of total variance, one has
where denotes convergence in probability, and the limiting statement originates from the fact that . Hence, given , the g-prior in (1) implies that has a variance approximately equal to for any covariate randomly drawn from its population H. [23] found that also often approximately follows a normal distribution, and this approximation is good for a variety of H considered in their simulations, even when some covariates are categorical. Therefore, it is reasonable to assume that approximately follows .
For the linear normal regression model , the g-prior in [23] can be applied as follows. Assuming a subject matter expert has in hand information on the distribution of marginal mean responses (i.e., ) in a population, rather than the distribution of , say with being chosen to reflect the prior knowledge about the distribution of m. Then using the prior matching idea in [23], one can immediately solve for and g in (1) as and where . Although [24] finds the default prior given by [23] for logistic regression to provide the best predictive performance among several contenders, the performance for linear regression model has not been well tested. In addition, it is not straightforward to set default values for and its extension to linear mixed models is not readily available.
In this paper, we will propose a new g-prior for the linear regression model when a subject matter expert has information on the marginal distribution of the response rather than with reasonable default settings and then extend it for use in linear mixed models.
2.2. New Prior Development
An easily implemented g-prior is first proposed for use in the linear regression model:
Consider the situation where a subject matter expert has information on the marginal distribution of observations that can be synthesized as
where , , and can be obtained from previous studies or published summary data; details are given in Section 2.3. Here denotes the gamma distribution with mean equal to . The goal here is to develop a particular version of g-prior on in (2) that achieves the marginal distribution of .
Consider the g-prior in (1). Given , the total expectation formula gives
and the total variance formula gives
For models with an intercept, setting satisfies the first moment condition . The larger is, the more the prior shrinks toward the intercept only model (with an intercept focused on m), and so is conservative in favoring the null of the overall F-test that no covariates are important.
To match the second moment condition , set and solve for in (1) when . Since and for all , the marginal constraint of approximately holds for any prior with support . In particular, a special case of the generalized beta distribution,
denoted , allows flexibility in specifying how much variability the regression model explains relative to the total variability v; note . If one had prior information that, say, the amount of variation explained by regression is (similar to in OLS regression, but conditions on and fixes ), then the parameters in (4) could be chosen such that with the total “sample size” going into the prior as ; solving yields and . No prior preference gives , i.e., , a sensible default choice.
Encapsulating the above, a hierarchical prior that maintains is
This prior provides an intuitive interpretation given v: when () the model explains all variability in , when () then the model explains nothing. Values indicates that the truth is somewhere between these two extremes, with reflecting no preference on how much variability the model explains. This formulation of the g-prior can be viewed as a type of ridge regression which further addresses multicollinearity among predictors, but where the ridge parameter is chosen automatically. The special form of the g-prior enables easy computation of the amount of variability the model explains relative to the total v.
Once a distribution (e.g., Gaussian) is assumed for the linear regression model given in (2), estimates of and can be obtained using statistical software such as JAGS [21] via the R package R2jags [22]; see Supplementary Materials for the R code.
2.3. Hyper-Prior Elicitation for
Our prior in (5) requires a specification for hyperparameters , , and . Suppose we have historical data from a similar study population. If we assume , then using a noninformative prior such as gives
where . If one believes that the historical data come from the same population as the current observed response data , it is reasonable to set , , and in (5). If the historical data come from a population quite different from the current study or the population distribution is not plausibly normal, one may set lower values for and to put less weight on the historical data relative to the current data. If historical data are not available, we recommend setting , , and instead of setting ; this assumes that the unavailable historical data have the sample mean equal to and the sample variance equal to and are given with the weight of two observations. In real applications, a sensitivity analysis can be performed by setting to several different values between 2 and M.
The idea behind our hyperprior elicitation for is similar to the power prior [15], which is defined as the posterior of model parameters given the historical data, raised to a power , where provides the “degree of borrowing” from the historical data. Consider an intercept-only model . Note that our prior (5) simply reduces to , , , . Given the historical data , setting , , and is exactly the power prior with . Similarly, the values of and control the influence of historical data. For the general linear model in (2), the important difference is that our prior on does not require any covariates in the historical data, since it depends on the historical data only through .
2.4. Comparing to the Mixture of G Priors
For the linear model in (2) with Gaussian errors, the hyper-g prior in [12] can be expressed as
where is set to ensure a proper distribution. Ref. [12] show that the hyper g-prior is not consistent for model selection when the true model is the null model, and then propose a hyper- prior as
Setting i.e., in our prior (5) gives
If we set , , and , it is easy to show that our prior further becomes
which is similar to the hyper- prior in (8), the only difference being that our g is scaled by instead of n. Therefore, the proposed prior here naturally leads to a modified version of the hyper- prior considered in [12] when there is no history information on .
2.5. Simple Example
Ref. [25] analyze data on the lengths (in meters) of dugongs (sea cows) having ages (in years). They fit a nonlinear exponential model for length based on ; we consider a linear model by transforming age, i.e., . An example of a commonly used vague, proper prior is and . The prior marginal mean and variance for the response y under this prior can be estimated via Monte Carlo (MC) by simulating , , and , , , yielding 1000 datasets . The simulation of is completed using the method of [26] designed for gamma distributions with small shape parameters. The average prior sample mean (across the 1000 datasets) and prior sample variance are around and , respectively. These are nowhere near the observed sample mean and variance of and . In contrast, a similar simulation under our proposed new g-prior in (5) with , , , and yields an average sample mean of 2.305 with MC standard deviation of 0.559 and an average sample variance of 0.442 with MC standard deviation of 3.243. That is, the inference under our prior focuses on a much smaller set of potential models that could have conceivably generated the observed data. The posterior estimates for , , and under our proposed new g-prior are 1.770 (0.047), 0.273 (0.021), and 0.0094 (0.0029), respectively, where the values in parentheses are posterior standard deviations. The commonly used vague priors specified above yield to similar estimates but with slightly higher posterior standard deviations: 1.763 (0.047), 0.277 (0.021), and 0.0097 (0.0031).
The use of such prior predictive checks have recently been advocated by [27,28,29]; in particular, ref. [27] suggests that analysts …“visualize simulations from the prior marginal distribution of the data to assess the consistency of the chosen priors with domain knowledge.” They further suggest the use of “weakly informative” priors to gently urge the prior in the direction of providing plausible marginal values. This requires some thought and visual exploration on the part of the user; the prior developed here provides a safe, default method for nudging the prior toward domain knowledge in the form of either historical marginal values or the sample moments seen in the data. The prior mean and variance exist whether analyst wants to think about them or not; this example illustrates that “vague” priors are not necessarily noninformative.
2.6. Variable Selection
Consider the Gaussian linear regression model . Using the proposed g-prior in (5) for Bayesian variable selection requires the calculation of marginal likelihood for each of the submodels, denoted as , where is a p-dimensional vector of indicators with implying that the jth covariate is included in the model. Here we always set so that an intercept is included. Under model , we have , where is the design matrix under model , and is the corresponding -vector of regression coefficients. For model , a default prior specification for and is given by
where is -dimensional.
To perform the variable selection, we need to calculate the Bayes factors for comparing each model with the null model . Note that under model () with prior (11), the marginal likelihood given and is
where and is the usual R-squared under model . Under the null model : , the prior (11) simply reduces to , , , . Therefore, the marginal likelihood under the null model given is
Note that this is a special case of (12) with and .
When is fixed and known, the Bayes factor for comparing any model () to the null model is
where . It is easy to show that the Bayes factor in (14) is finite for all . The integrals in can be numerically computed using the R function integrate [30].
When the hyperprior on in (5) is used, the Bayes factor for comparing to becomes
where the expectation is taken under the prior for in (5). However, the calculation of expectations in (15) is considerably more computationally demanding. Based on the competitive performance of our prior compared to other methods in simulation studies, we recommend using the Bayes factor in (14) with fixed at , where and are determined as follows. If there is no history information available for , we simply use and based on the current marginal data . If there is some history information for that can be summarized as , , we set , where is the posterior mean estimate for based on only the marginal data ; see Section 2.3 for the specification of , , and with historical data . Note that closed-form formulas for and can be derived; see [31] for the derivations. Once the model is selected, we can apply the prior (5) to fit the model under the selected model.
Information Paradox
The information paradox [32] refers to the situations when we have very strong information supporting a non-null model , but the Bayes factor does not go to ∞ as the information about accumulate (i.e., ). The proposed informative g prior resolves the information paradox in the sense that with fixed n, and . Note that the denominator in (14) is finite, and by the mean value theorem for definite integrals, there exists c in such that
Therefore, it suffices to show that
Noting that is an increasing function of , we have
for all .
3. Mixed Models
3.1. One-Way Random Effects ANOVA
The g-prior developed in Section 2 for regression models can be immediately extended to mixed models in an analogous fashion. The shrinkage induced by the g-prior yields familiar exchangeable prior specifications already in widespread use as special cases, as well as some new default formulations. We first examine the simplest random effects model, a one-way ANOVA, typically formulated as
where , rewritten in matrix form as
where , is a -vector of ones, and . Note that without further constraints, (16) is overparameterized; shrinkage on both the “fixed” and “random” portions separately is required for identifiability.
Noting that for all i, a g-prior on the first portion is
Similarly, a g-prior on the second portion is
This prior is the same as assuming exchangeable random effects, e.g., , where ; placing a prior on is the same as placing a prior on . The g-prior as a ridge prior is evident here, with model identifiability achieved by shrinking towards 0. The amount of shrinkage is controlled a priori via the parameter . There are obvious links from the g-prior to ridge regression, shrinkage priors, and penalized likelihood.
The prior on has received considerable interest; suggestions include the half-Cauchy prior and uniform priors [33], as well as approximations to Jeffreys’ prior, e.g., , which permeated Bayesian literature in the 1990’s. Ref. [34] advocate a data-driven prior that is similar in spirit to what is presented here. Ref. [35] considers a shrinkage prior for induced by a uniform prior on . Ref. [36] uses a g-prior for ANOVA with diverging number of parameters. In contrast, we will build a prior that facilitates the borrowing of history information on the overall marginal mean m and variance v of the data .
3.2. Linear Mixed Models
Now consider the linear mixed model
where , is a k-vector of random effects, , . In this setting, denotes the data cluster associated with and are the number of repeated measures within cluster i; the total sample size is . The variability in model (18) is portioned to , and . The first two components will have dependent g-priors, inducing differing amounts of shrinkage across the two regression models; the second portion is further shrunk toward zero. Set . Again, the goal here is to develop a prior on that incorporates the marginal information of , where a hyperprior on can be extracted from historical data or expert option. The usual g-prior on for cluster i is
Let and denote by mean and covariance of for cluster i, then . Similarly, let and denote the overall mean and covariance across all clusters, set , then . Noting that the same coefficient is used for all clusters, the overall g-prior for can be set as
The usual g-prior on for cluster i is
where , . Denote by and the overall mean and covariance of across all clusters. If the s come from the same population, i.e., and , (20) is equivalent to
where . This final expression lies at the heart of hundreds of mixed model analyses; the derivation here clarifies that this is exactly what the g-prior gives us when . Define . Noting that , a sensible default prior is
assuming and is approximately correct.
Let be the k-dimensional multivariate t distribution with degrees of freedom r, mean for , and covariance for . Taking under the default prior (21), the induced marginal prior on is a multivariate t distribution see [37], given by
It is tempting to seek out a more flexible model via the Wishart distribution, but note if instead , and , the same marginal distribution is induced on . Here, is an inverted-Wishart distribution with the usual parameters , . One can play around with different settings for various hyperparameters, but the end result is typically a multivariate t distribution or something close. For example, ref. [38] proposed a default random effects specification for generalized linear models; under the normal errors model their proposal is where , , where is an inflation factor. Their induced marginal prior is . Note that our specification is not conditional on , otherwise, all these priors induce a multivariate t-distribution with similar covariance structures. Ref. [38] compare their approach to the approximate uniform shrinkage prior of [39]. Ref. [40] extended half-t prior [33] to the multivariate setting so that the prior on the covariance matrix induces half-t priors on standard deviations and uniform priors on correlations.
We proceed to build a prior that reflects the prior knowledge on the overall marginal mean m and variance v of the data . Under prior (20) or (21) along with (19), we have as before and now
Certainly and are reasonable bounds. The following default specification enforces the mean and variance constraint of :
A uniform prior on is specified ; a uniform prior on g obtains from . When covariates come from quite different subpopulations across clusters, we recommend replacing the prior on in (22) with . The proposed prior (22) enables easy computation of the approximate amount of variation explained by random effects (i.e., ) and fixed effects (i.e., ) relative to the total v.
The priors on the fixed and random effect portions of the model are tied together and correlated; this is necessary to a priori conserve marginal variance. Ref. [41] note that, although variance components are usually modeled independently in the prior, typically as inverse-gamma, uniform, or half-Cauchy, they are “linked as they are components of the total variation in the response…” and suggests modeling them jointly as we do here, though via generalized multivariate gamma or multivariate log-normal distributions.
3.3. Hyper-Prior Elicitation for in Mixed Models
Our prior in (22) requires specifying hyperparameters , , and in the hyperprior for . Suppose the historical data are . Set . We need to extract sensible hyper-parameter values , , and so that the hyper-prior for in (22) is close to the true posterior of based on the historical data. Assume that the historical data can be approximately fit by the one-way random ANOVA: , , . Unbiased estimates for m, and can be obtained using restricted maximum likelihood (REML) via the R function lmer in package lme4 [42], denoted as , and . Then is an unbiased estimate of v, and is an estimate of the intraclass correlation coefficient. Based on some simulation trials, we find that the following posterior distributions approximately hold
where and can be interpreted as the effective sizes to account for the intraclass dependency, where . Simple simulations (not shown here) reveal that the posterior distributions in (23) often provide us empirical coverage probabilities for around , and the confidence width for v is much narrower than the methods proposed in [43]. Further investigation is needed to understand the reason behind this. Fortunately, we use this approximate posterior to only select a reasonable hyperprior for but not for our actual posterior inference based on the current data.
If one believes that the historical data come from the same population as the current observed response data , it is reasonable to set , , and in the hyperprior of in (22). Setting lower values for and puts less weight on the historical data relative to the current data. If the historical data are not available, we recommend setting , , and , where and are the REML estimates of based on the current response data .
For the random effects one-way ANOVA model (16), the prior (22) reduces to , , and . In addition, the prior information of indicates which further leads to . Therefore, it is easy to show that the prior (22) for the random effects one-way ANOVA model finally reduces to
If we set and , the prior in (24) is equivalent to
which is exactly the shrinkage prior considered in [35]. That is, our prior naturally reduces to a well-known shrinkage prior for the random one-way ANOVA when there is no history information available for .
3.4. Rats Data Example
In the rat data example from WinBUGS manual [44], 30 rats’ weights (in kg) were measured weekly for five weeks. Let be the weight of the ith rat measured in week j and be the corresponding age, , . Consider the mixed model (18) with , and , where . Typically vague priors are used, e.g., , , . The marginal mean and variance for the response under this prior can be estimated via Monte Carlo (MC) by simulating , , , and , , , , yielding 1000 datesets , where . The average prior sample mean (across the 1000 datasets) and prior sample variance are around and ∞ (as reported in R), respectively. These substantially differ from the observed sample mean and variance of and . In contrast, a similar simulation under our proposed new g-prior in (22) with , , , and yields an average sample mean of 0.249 with MC standard deviation of 0.144 and an average sample variance of 0.024 with MC standard deviation of 0.120. That is, the inference under our prior focuses on a much smaller set of potential models around those that could have conceivably generated the observed marginal data. The posterior estimates for , , and under our proposed new g-prior are 0.1073 (0.0051), 0.0062 (0.0002), and 0.00004 (0.000006), respectively, where the values in parentheses are posterior standard deviations. The commonly used vague priors specified above yield to similar estimates but with much higher posterior standard deviations: 0.1067 (0.0059), 0.0061 (0.0049), and 0.00006 (0.000009).
3.5. Model Fitting via Block MCMC
Although the previous section portions variability due to and separately, Ref. [45] note that updating in one large block virtually eliminates problematic MCMC mixing, as and are often highly correlated in the posterior. An optimal approach considers the full model (18) jointly
where . Under the prior (22), the full conditional for is
where
The full conditionals for and g do not correspond to any known distributions, so an adaptive Metropolis algorithm [46] can be used.
4. Simulation Study
In all simulation studies, for each MCMC run, 5000 scans were thinned from 20,000 after a burn-in period of 2000 iterations; convergence diagnostics deemed this more than adequate. We use posterior means as the point estimates for all parameters. R functions to implement linear and linear mixed models using the proposed priors are provided in Supplementary Materials.
4.1. Simulation I: Fixed Effects Model
Simulations were carried out to evaluate the proposed methodology and compare it to the benchmark prior, local empirical Bayes (EB) approach and a hyper-g prior considered in [12]. Data were generated from the Gaussian regression model
where and . Let be the usual centered design matrix for . The benchmark and EB methods consider the following priors
where is set for the benchmark method and is used for the EB approach, where is the R-squared value under the considered model. The hyper-g is given by
where we set in all simulations which is the same as the setting used in [12].
4.1.1. Parameter Estimation
First we evaluate the performance for estimating model parameters using various methods. We generated , where has diagonals being 1 and off-diagonals being . We set , , , and , yielding R-squared values around 0.26. The true marginal mean and variance of are given by and , respectively. We implemented our proposed prior in (5) with .
To evaluate how historical data can improve the parameter estimation accuracy, we additionally generated s of size in the same way as generating s and considered three settings of the hyperprior for : (V1) new-true, when infinite historical data available, , , and , i.e., is fixed at the truth ; (V2) new-hist, when a small set of historical data available, , , and ; (V3) new-none, when no historical data available, , , and .
Let be a generic parameter and be an estimate. The mean squared error (MSE) for is defined as . The bias for is defined as . Table 1 reports the average bias and MSE values and coverage probabilities with interval widths across 500 Monte Carlo (MC) replicates. When , our method without using history information (new-none) performs very similarly to the other three completing methods. When a little history information is available, our prior (new-hist) has significantly lower MSE values and reduced interval widths on estimating without compromising the coverage probabilities; the performance for estimating s is also slightly better than other approaches. When the true information on is available, the estimation performance under our prior (new-true) is further improved comparing to new-hist. Regarding the estimation bias, we can see that all informative priors lead to biased estimates with a general trend that higher informativeness of the prior leads to larger biases. As the sample size increases to , although our methods (new-hist and new-true) still outperform other priors, the differences become smaller.
Table 1.
Simulation I: Average biases (MSEs), coverage probabilities (interval widths) across 500 MC replicates in the simulation study for parameter estimation. Here, new-true, new-hist and new-none corresponds to the three hyperprior versions (V1), (V2) and (V3), respectively.
4.1.2. Variable Selection
For a given , we generated as follows: (i) simulate ; (ii) set the even elements of to be binary by setting them to 0 if less than 1 and to be 1 if greater than 1. We set , and , where is the first l elements of for . That is, among the covariates (including the intercept), there are l of them having non-zero coefficients. For each given l, we generated and , where is the design matrix for . These settings yield R-squared values ranging from 0.11 to 0.30 for . For our method, we additionally generated s of size in the same way as generating s and considered the same three versions of the hyper prior for as in Section 4.1.1: (V1) new-true; (V2) new-hist; (V3) new-none. To compare our methods to the benchmark, EB and hyper-g approaches, we considered the following three cases under each prior: (C1) implement the variable selection procedure and obtain OLS estimation using the selected model; (C2) obtain Bayesian estimation using the true model; (C3) obtain Bayesian estimation using the full model. Here (C1) is used to compare the pure variable selection performance, (C2) is used to compare the predictive performance under the true model, and (C3) is used to compare the overall predictive performance when the model contains noisy covariates. For all Bayesian methods, posterior means were used for estimating .
Table 2 reports the average values of across 200 MC replicates with . When OLS is used for fitting the selected model, the three versions of our methods perform very similarly, indicating that the history information on has little influence on variable selection accuracy. Comparing to EB and hyper-g priors, our methods perform slightly better when the true model size is small (), and perform very similarly when . The benchmark prior works much better when the true model size is less than or equal to 7, but performs much worse when the true model size increases. The reason is that the benchmark prior sets which leads to a more flat prior on . When Bayesian estimation is used under the true or full model and there is some history information available on , our methods (both new-hist and new-true) outperform the other methods, and the benchmark prior is the worst due to its large choice of g. The only case where new-hist and new-true do not perform better is when the full model is fit but the null model is the truth i.e., under (C3), for which more historical data (see new-true) will help. Even when we don’t have any history information on , the results under (C2) and (C3) show that our method performs slightly better than other methods, especially when .
Table 2.
Simulation I: Average values across 200 MC replicates in the simulation study for variable selection.
4.2. Simulation II: Random One-Way ANOVA
Data are generated from the random one-way ANOVA model
where we set , , . In addition, we consider , and , where represents a discrete uniform distribution with support being all integers with . The true marginal mean and variance of are given by and , respectively. We implement our proposed default prior in (22) with and the hyper-prior settings recommended in Section 3.3. Then can be estimated from the posterior samples of g.
We additionally generate in the same way as generating s and consider three versions of the hyper prior for : (V1) new-true, , , and ; (V2) new-hist, , , and ; (V3) new-none, , , and ; see Section 3.3 for the definitions of these hyper-parameters. We also compare our methods to the prior [33], the prior [47], the prior [47] and the shrinkage prior [35]. For these alternative priors, the typical priors and are used on and , respectively.
Table 3 reports the average bias and MSE values and coverage probabilities with interval widths across 500 MC replicates, where the coverage probabilities for s are defined as the average coverage across all s for . Our approach with new-hist or new-true has significantly lower MSE values and narrower interval widths for estimating all model parameters while maintaining coverage probabilities around the nominal level than other methods in all cases. Even when history information on is not available, our method with new-none still has much lower MSE values for estimating and narrower confidence interval widths than all other priors. Note that the induced prior under new-true essentially assumes that the prior variance of is zero, so we didn’t report the coverage probability for here.
Table 3.
Simulation II: Average biases (MSEs), coverage probabilities (interval widths) across 500 MC replicates in the simulation study for random one-way ANOVA model.
4.3. Simulation III: Random Intercept Model
Data were generated from the mixed model
where , and . We set , , , , . In addition, we consider , and . The true marginal mean and variance of are given by and , respectively. The prior settings are the same as those used in Section 4.2.
Table 4 reports the average bias and MSE values and coverage probabilities with interval widths across 500 MC replicates. Our approach with new-hist or new-true has significantly lower MSE values and narrower interval widths for estimating all model parameters while maintaining coverage probabilities around the nominal level than other methods in all cases. Even when history information on is not available, our method with new-none still has much lower MSE values for estimating and with slightly narrower confidence interval widths than all other priors.
Table 4.
Simulation III: Average biases (MSEs), coverage probabilities (interval widths) across 500 MC replicates in the simulation study for random intercept model.
4.4. Simulation IV: Linear Mixed Model
Data were generated from the mixed model
where , , , , , and . Here we set , , for and for . Under this setting, we have the total random effective variance equal to . We consider , 1 and .
We implement our proposed default prior in (22) with and a more general version of it with (denoted as new-i below). Regarding the hyperprior of , we only consider new-hist and new-none as defined in Section 4.2, considering that the true marginal mean and variance of are not available in closed forms. We then compare our methods to the prior proposed in [38]: where , .
Table 5 reports the average bias and MSE values and coverage probabilities with interval widths across 500 MC replicates, where the coverage probabilities for ’s are defined as the average coverage across all ’s over for each . Comparing our default prior in (22) with its more general version new-i, the new-i method has lower MSE values for estimating most model parameters and is markedly better for estimating and . Comparing to our default prior (22) and the prior in [38] (both assuming homogeneous covariance for ), our prior has much lower MSE values and narrower interval widths for estimating s while maintaining coverage probabilities around the nominal level . When the more general prior new-i is used, our method consistently performs better than [38] on estimating all model parameters.
Table 5.
Simulation IV: Average biases (MSEs), coverage probabilities (interval widths) across 500 MC replicates in the simulation study for general mixed model. Here, the suffix -i refers to the prior in (22) with ; KN refers to the prior introduced in [38].
5. Discussion
Prior elicitation plays an important role in Bayesian inference. We have proposed a novel, yet remarkably simple class of informative g-priors for linear mixed models elicited from existing information on the marginal distribution of the responses. The prior is firstly developed for the linear regression model (2) assuming that a subject-matter expert has information on the marginal distribution . A simple, intuitive interpretation of the prior is obtained: when the model explains nothing (i.e., reduces to the null model), when the model explains all variability in responses; furthermore, the use of a generalized beta prior on allows one to specify the prior information on the amount of variation explained by the considered model. The proposed prior also naturally reduces to a modified version of the hyper- prior introduced in [12] when there is no history information available for . Under the Gaussian linear regression models with the proposed g-prior, Bayes factors for comparing all possible submodels can be easily computed for the purpose of variable selection and do not encounter the information paradox commonly seen in Zellner’s g-priors with fixed g. Our approach is further extended for use in linear mixed models. Interesting relationships between the proposed g-priors and some other commonly used priors in mixed models are discussed. For example, under the random effect one-way ANOVA, the proposed prior (22) with a reference hyper prior on reduces exactly to the shrinkage prior of [35]. Posterior sampling for all considered models can be obtained using JAGS via R. Finally, extensive simulation studies reveal that the proposed g-prior outperforms almost all other approaches under consideration when some history information on is available. Even without historical data, better performance of the proposed new g-prior over other priors is still seen in many settings. Interesting generalizations of the proposed idea may include additive penalized B-spline regression, variable selection in the linear mixed models and prior elicitation for generalized linear mixed models. Recently, Ref. [48] proposed two informative priors for the between-cluster slope in a multilevel latent covariate model. However, extension of their methods to multiple covariates has not been investigated. It would be interesting to extend the proposed g-prior here to general multilevel latent covariate models.
Supplementary Materials
The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/stats6010011/s1, R functions to fit the linear and linear mixed models.
Author Contributions
Conceptualization, Y.-F.C., H.Z. and T.H.; methodology, Y.-F.C., H.Z. and T.H.; software, Y.-F.C., H.Z. and T.H.; validation, Y.-F.C., H.Z. and T.H.; formal analysis, Y.-F.C., H.Z. and T.H.; investigation, Y.-F.C., H.Z. and T.H.; resources, Y.-F.C., H.Z., T.H. and T.L.; data curation, Y.-F.C., H.Z. and T.H.; writing—original draft preparation, Y.-F.C., H.Z. and T.H.; writing—review and editing, Y.-F.C., H.Z., T.H. and T.L.; visualization, Y.-F.C., H.Z. and T.H.; supervision, H.Z., T.H. and T.L.; project administration, H.Z. and T.H.; funding acquisition, H.Z., T.H. and T.L. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Acknowledgments
The authors wish to thank the Editor and four anonymous referees for their insightful comments and suggestions that greatly improved the manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Sun, C.Q.; Prajna, N.V.; Krishnan, T.; Mascarenhas, J.; Rajaraman, R.; Srinivasan, M.; Raghavan, A.; O’Brien, K.S.; Ray, K.J.; McLeod, S.D.; et al. Expert Prior Elicitation and Bayesian Analysis of the Mycotic Ulcer Treatment Trial I. Investig. Ophthalmol. Vis. Sci. 2013, 54, 4167–4173. [Google Scholar] [CrossRef] [PubMed]
- Hampson, L.V.; Whitehead, J.; Eleftheriou, D.; Brogan, P. Bayesian methods for the design and interpretation of clinical trials in very rare diseases. Stat. Med. 2014, 33, 4186–4201. [Google Scholar] [CrossRef]
- Zhang, G.; Thai, V.V. Expert elicitation and Bayesian Network modeling for shipping accidents: A literature review. Saf. Sci. 2016, 87, 53–62. [Google Scholar] [CrossRef]
- Food and Drug Administration. Guidance for the use of Bayesian statistics in medical device clinical trials. Guid. Ind. Fda Staff. 2010, 2006, 1–50. [Google Scholar]
- O’Hagan, A. Eliciting expert beliefs in substantial practical applications. J. R. Stat. Soc. Ser. 1998, 47, 21–35. [Google Scholar]
- Kinnersley, N.; Day, S. Structured approach to the elicitation of expert beliefs for a Bayesian-designed clinical trial: A case study. Pharm. Stat. 2013, 12, 104–113. [Google Scholar] [CrossRef]
- Dallow, N.; Best, N.; Montague, T.H. Better decision making in drug development through adoption of formal prior elicitation. Pharm. Stat. 2018, 17, 301–316. [Google Scholar] [CrossRef]
- Hartmann, M.; Agiashvili, G.; Bürkner, P.; Klami, A. Flexible Prior Elicitation via the Prior Predictive Distribution. In Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), Virtual, 3–6 August 2020; Peters, J., Sontag, D., Eds.; PMLR: London, UK, 2020; Volume 124, pp. 1129–1138. [Google Scholar]
- Zellner, A. Applications of Bayesian Analysis in Econometrics. Statistician 1983, 32, 23–34. [Google Scholar] [CrossRef]
- Zellner, A. On Assessing Prior Distributions and Bayesian Regression Analysis With g-Prior Distributions. Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti; North-Holland/Elsevier: Amsterdam, The Netherlands, 1986; pp. 233–243. [Google Scholar]
- Li, Y.; Clyde, M.A. Mixtures of g-priors in generalized linear models. J. Am. Stat. Assoc. 2018, 113, 1828–1845. [Google Scholar] [CrossRef]
- Liang, F.; Paulo, R.; Molina, G.; Clyde, M.A.; Berger, J.O. Mixtures of g priors for Bayesian variable selection. J. Am. Stat. Assoc. 2008, 103, 410–423. [Google Scholar] [CrossRef]
- Bedrick, E.J.; Christensen, R.; Johnson, W. A New Perspective on Priors for Generalized Linear Models. J. Am. Stat. Assoc. 1996, 91, 1450–1460. [Google Scholar] [CrossRef]
- Hosack, G.R.; Hayes, K.R.; Barry, S.C. Prior elicitation for Bayesian generalised linear models with application to risk control option assessment. Reliab. Eng. Syst. Saf. 2017, 167, 351–361. [Google Scholar] [CrossRef]
- Ibrahim, J.G.; Chen, M.H. Power prior distributions for regression models. Stat. Sci. 2000, 15, 46–60. [Google Scholar]
- Ibrahim, J.G.; Chen, M.H.; Sinha, D. On optimality properties of the power prior. J. Am. Stat. Assoc. 2003, 98, 204–213. [Google Scholar] [CrossRef]
- Hobbs, B.P.; Carlin, B.P.; Mandrekar, S.J.; Sargent, D.J. Hierarchical commensurate and power prior models for adaptive incorporation of historical information in clinical trials. Biometrics 2011, 67, 1047–1056. [Google Scholar] [CrossRef] [PubMed]
- Ibrahim, J.G.; Chen, M.H.; Gwon, Y.; Chen, F. The power prior: Theory and applications. Stat. Med. 2015, 34, 3724–3749. [Google Scholar] [CrossRef] [PubMed]
- Agliari, A.; Parisetti, C.C. A-g Reference Informative Prior: A Note on Zellner’s g-Prior. J. R. Stat. Soc. Ser. D 1988, 37, 271–275. [Google Scholar] [CrossRef]
- van Zwet, E. A default prior for regression coefficients. Stat. Methods Med. Res. 2019, 28, 3799–3807. [Google Scholar] [CrossRef]
- Plummer, M. JAGS: A Program for Analysis of Bayesian Graphical Models Using Gibbs Sampling. In Proceedings of the 3rd International Workshop on Distributed Statistical Computing (DSC 2003), Vienna, Austria, 20–22 March 2003; Hornik, K., Leisch, F., Zeileis, A., Eds.; ISSN 1609-395X. [Google Scholar]
- Su, Y.S.; Yajima, M. R2jags: Using R to Run ‘JAGS’; R Package Version 0.5-7; R Foundation for Statistical Computing: Vienna, Austria, 2015. [Google Scholar]
- Hanson, T.E.; Branscum, A.J.; Johnson, W.O. Informative g-Priors for Logistic Regression. Bayesian Anal. 2014, 9, 597–612. [Google Scholar] [CrossRef]
- Lally, N.R. The Informative g-Prior vs. Common Reference Priors for Binomial Regression with an Application to Hurricane Electrical Utility Asset Damage Prediction. Master’s Thesis, University of Connecticut, Mansfield, CT, USA, 31 July 2015. [Google Scholar]
- Carlin, B.P.; Gelfand, A.E. An iterative Monte Carlo method for nonconjugate Bayesian analysis. Stat. Comput. 1991, 1, 119–128. [Google Scholar] [CrossRef]
- Liu, C.; Martin, R.; Syring, N. Efficient simulation from a gamma distribution with small shape parameter. Comput. Stat. 2017, 32, 1767–1775. [Google Scholar] [CrossRef]
- Gabry, J.; Simpson, D.; Vehtari, A.; Betancourt, M.; Gelman, A. Visualization in Bayesian workflow. J. R. Stat. Soc. Ser. A 2019, 182, 389–402. [Google Scholar] [CrossRef]
- Gelman, A.; Simpson, D.; Betancourt, M. The Prior Can Often Only Be Understood in the Context of the Likelihood. Entropy 2017, 19, 555. [Google Scholar] [CrossRef]
- Wesner, J.S.; Pomeranz, J.P.F. Choosing priors in Bayesian ecological models by simulating from the prior predictive distribution. Ecosphere 2021, 12, e03739. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2021. [Google Scholar]
- Murphy, K.P. Conjugate Bayesian Analysis of the Gaussian Distribution; Technical Report; University of British Columbia: Vancouver, BC, Canada, 3 October 2007. [Google Scholar]
- Berger, J.O.; Pericchi, L.R.; Ghosh, J.; Samanta, T.; De Santis, F.; Berger, J.; Pericchi, L. Objective Bayesian methods for model selection: Introduction and comparison. Lect.-Notes-Monogr. Ser. 2001, 38, 135–207. [Google Scholar]
- Gelman, A. Prior distributions for variance parameters in hierarchical models. Bayesian Anal. 2006, 1, 515–533. [Google Scholar] [CrossRef]
- Box, G.E.P.; Tiao, G.C. Bayesian Inference in Statistical Analysis; Addison-Wesley: Reading, MA, USA, 1973. [Google Scholar]
- Daniels, M.J. A prior for the variance in hierarchical models. Can. J. Stat. 1999, 27, 567–578. [Google Scholar] [CrossRef]
- Wang, M. Mixture of g-priors for analysis of variance models with a divergining number of parameters. Bayesian Anal. 2017, 12, 511–532. [Google Scholar] [CrossRef]
- Lin, P.E. Some characterizations of the multivariate t distribution. J. Multivar. Anal. 1972, 2, 339–344. [Google Scholar] [CrossRef]
- Kass, R.E.; Natarajan, R. A default conjugate prior for variance components in a generalized linear mixed models (Comment on article by Browne and Draper). Bayesian Anal. 2006, 1, 535–542. [Google Scholar] [CrossRef]
- Natarajan, R.; Kass, R.E. Reference Bayesian methods for generalized linear mixed models. J. Am. Stat. Assoc. 2000, 95, 227–237. [Google Scholar] [CrossRef]
- Huang, A.; Wand, M.P. Simple marginally noninformative prior distributions for covariance matrices. Bayesian Anal. 2013, 8, 439–452. [Google Scholar] [CrossRef]
- Demirhan, H.; Kalaylioglu, Z. Joint prior distributions for variance parameters in Bayesian analysis of normal hierarchical models. J. Multivar. Anal. 2015, 135, 163–174. [Google Scholar] [CrossRef]
- Bates, D.; Mächler, M.; Bolker, B.; Walker, S. Fitting Linear Mixed-Effects Models Using lme4. J. Stat. Softw. 2015, 67, 1–48. [Google Scholar] [CrossRef]
- Burdick, R.K.; Borror, C.M.; Montgomery, D.C. Design and Analysis of Gauge R and R Studies: Making Decisions with Confidence Intervals in Random and Mixed ANOVA Models; SIAM: Philadelphia, PA, USA, 2005. [Google Scholar]
- Spiegelhalter, D.; Thomas, A.; Best, N.; Lunn, D. WinBUGS User Manual, Version 1.4; Medical Research Council Biostatistics Unit: Cambridge, UK, 2003. [Google Scholar]
- Sargent, D.J.; Hodges, J.S.; Carlin, B.P. Structured Markov Chain Monte Carlo. J. Comput. Graph. Stat. 2000, 9, 217–234. [Google Scholar]
- Haario, H.; Saksman, E.; Tamminen, J. An Adaptive Metropolis Algorithm. Bernoulli 2001, 7, 223–242. [Google Scholar] [CrossRef]
- Brown, W.J.; Draper, D. A comparison of Bayesian and likelihood-based methods for fitting multilevel models. Bayesian Anal. 2006, 1, 473–514. [Google Scholar] [CrossRef]
- Zitzmann, S.; Helm, C.; Hecht, M. Prior specification for more stable Bayesian estimation of multilevel latent variable models in small samples: A comparative investigation of two different approaches. Front. Psychol. 2021, 11, 611267. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).