Posterior Averaging Information Criterion

We propose a new model selection method, named the posterior averaging information criterion, for Bayesian model assessment to minimize the risk of predicting independent future observations. The theoretical foundation is built on the Kullback–Leibler divergence to quantify the similarity between the proposed candidate model and the underlying true model. From a Bayesian perspective, our method evaluates the candidate models over the entire posterior distribution in terms of predicting a future independent observation. Without assuming that the true distribution is contained in the candidate models, the new criterion is developed by correcting the asymptotic bias of the posterior mean of the in-sample log-likelihood against out-of-sample log-likelihood, and can be generally applied even for Bayesian models with degenerate non-informative priors. Simulations in both normal and binomial settings demonstrate superior small sample performance.


Introduction
Model selection plays a key role in applied statistical practice.A clearly defined model selection criterion or score usually lies at the heart of any statistical model selection procedure, and facilitates the comparison of competing models through the assignment of some sort of preference or ranking to the alternatives.Standard criteria include adjusted R 2 (Wherry, 1931), Akaike information criterion (AIC; Akaike, 1973), minimum description length (MDL; Rissanen, 1978) and Schwarz information criterion (SIC; Schwarz, 1978), to name but a few.
To choose a proper criterion for a statistical data analysis project, it is essential to distinguish the ultimate goal of modeling.Geisser and Eddy (1979) challenged research workers with two fundamental questions that should be asked in advance of any procedure conducted for model selection: 1. Which of the models best explains a given set of data? 2. Which of the models yields the best predictions for future observations from the same process that generated the given set of data?
The former question, which concerns the accuracy of the model in describing the current data, has been an empirical problem for many years.It represents the explanatory perspective.The latter question, which represents the predictive perspective, concerns the accuracy of the model in predicting future data, having drawn substantial attention in recent decades.If an infinitely large quantity of data is available, the predictive perspective and the explanatory perspective may not differ significantly.However, with a limited number of observations, as we encounter in practice, it is challenging for predictive model selection methods to achieve an optimal balance between goodness of fit and parsimony.
A substantial group of predictive model selection criteria were proposed based on the Kullback-Leibler information divergence (Kullback and Leibler, 1951), an objective measure to estimate the overall closeness of a probability distribution and the underlying true model.On both theoretical and applied fronts, Kullback-Leibler divergence in model selection has drawn a huge amount of attention, and a large related body of literature now exists for both frequentist and Bayesian inference.
Bayesian approaches to statistical inference have specific concerns regarding the interpretation of parameters and models.However, most of the Kullback-Leibler based Bayesian criteria follow essentially the frequentist paradigm insofar as they select a model using plug-in estimators of the parameters.
Starting from the Bayesian predictive information criterion (BPIC; Ando, 2007), model selection criteria were developed over the entire posterior distribution.Nevertheless, BPIC has a number of limitations, particularly with asymmetric posterior distributions.Furthermore, BPIC is undefined under improper prior distributions, while the expected penalized loss assumes that the true model contained in the approximating family, which limits its use in practice.
To explain the intuition of the proposed Bayesian criterion, in Section 2 we review the Kullback-Leibler divergence, its application and development in frequentist statistics and the adaption to Bayesian modeling based on plug-in parameter estimation.In Section 3, major attention is given to the Kullback-Leibler based predictive criterion for models evaluated by averaging over the posterior distributions of parameters.A generally applicable method, the posterior averaging information criterion (PAIC), is proposed for comparing different Bayesian statistical models under regularity conditions.Our criterion is developed by correcting the asymptotic bias of using the posterior mean of the log-likelihood as an estimator of its expected log-likelihood, and we prove that the asymptotic property holds even though the candidate models are misspecified.In Section 4 we present some numerical studies in both normal and binomial cases to investigate its performance with small sample sizes.We conclude with a few summary remarks and discussions in Section 5.
2 Kullback-Leibler divergence and model selection Kullback and Leibler (1951) derived an information measure to assess the directed 'distance' between any two models.If we assume that f (ỹ) and g(ỹ) respectively represent the probability density distributions of the 'true model' and the 'approximate model' on the same measurable space, the Kullback-Leibler divergence is defined by which is always non-negative, reaching the minimum value of 0 when f is the same as g almost surely.
It is interpreted as the 'information' lost when g is used to approximate f .Namely, the smaller the value of I(f, g), the closer we consider the model g to be to the true distribution.
Only the second term of I(f, g) is relevant in practice to compare different possible models without full knowledge of the true distribution.This is because the first term, E ỹ[log f (ỹ)], is a constant that depends on only the unknown true distribution f , and can be neglected in model comparison for given data.
Let y = (y 1 , y 2 , • • • , y n ) be n independent observations in the data and ỹ, an unknown but potentially observable quantity, represents a future independent observation has the same probability density function f (ỹ), and an approximate model m with density g m (ỹ|θ m ) among a list of potential models For notational purposes, we ignore the model index m when there is no ambiguity.The true model f is referred to as the unknown data generating mechanism, not necessarily to be encompassed in the approximate model family.
As n → ∞, the average of the log-likelihood tends to E ỹ[log g(ỹ|θ)] by the law of large numbers, which suggests how we can estimate the second term of I(f, g).
The model selection based on the Kullback-Leibler divergence is straightforward when all the operating models are fixed probability distributions, i.e., g(ỹ|θ) = g(ỹ).The model with the largest empirical log-likelihood i log g(y i ) is favored.However, when the distribution family g(ỹ|θ) contains some unknown parameters θ, the model fitting should be done first so that we may know what values the free parameters will probably take, given the data.Therefore, the log-likelihood is not optimal for the predictive modeling, when the data were used twice in both model fitting and evaluation.For a desirable out-of-sample predictive performance, a common idea is to identify a bias correction term to rectify the over-estimation bias of the in-sample estimate.
In the frequentist setting, the general model selection procedure chooses candidate models specified by some point estimate θ based on a certain statistical principle such as maximum likelihood.A considerable amount of theoretical research has addressed this problem by correcting for the bias of , 1973;Takeuchi, 1976;Hurvich and Tsai, 1989;Murata et al., 1994;Konishi and Kitagawa, 1996).A nice review can be found in Burnham and Anderson (2002).
Since the introduction of the Akaike Information Criterion (AIC; Akaike, 1973), researchers have commonly applied frequenstist model selection methods into Bayesian modeling.However, the differences in the underlying philosophies between Bayesian and frequentist statistical inference caution against such direct applications.There also have been a few attempts to specialize the Kullback-Leibler divergence for Bayesian model selection (Geisser and Eddy, 1979;San Martini and Spezzaferri, 1984;Laud and Ibrahim, 1995) in the last century.Such methods are limited either in the scope of methodology or computational feasibility, especially when the parameters of the Bayesian models are in highdimensional hierarchical structures.
The seminal work of Spiegelhalter et al. (2002Spiegelhalter et al. ( , 2014) ) proposed Deviance Information Criterion (DIC) as a Bayesian adaption to AIC and implemented it within Bayesian inference using Gibbs sampling (BUGS; Spiegelhalter et al., 1994).Although the estimation lacks a theoretical foundation (Meng and Vaida, 2006;Celeux et al., 2006a), −DIC/2n, as a model selection criterion, heuristically estimates E ỹ[log g(ỹ| θ)], the expected out-of-sample log-likelihood specified at the posterior mean, after assuming that the proposed model encompasses the true model.Alternative methods can be found either using a similar approach for mixed-effects models (Vaida and Blanchard, 2005;Liang et al., 2009;Donohue et al. 2011) or using numerical approximation (Plummer, 2008) to estimate cross-validative predictive loss (Efron, 1983).
3 Posterior averaging information criterion

Posterior averaged discrepancy function for model selection
The preceding methods in general can be viewed as Bayesian adaptation of the information criteria originally designed for frequentist statistics, when each model is assessed in terms of the similarity between the true distribution f and the model density function specified by the plug-in parameters.This may not be ideal since, in contrast to frequentist modeling, "Bayesian inference is the process of fitting a probability model to a set of data and summarizing the result by a probability distribution on the parameters of the model and on unobserved quantities such as predictions for new observations" (Gelman et al., 2003).Rather than considering a model specified by a point estimate, it is more reasonable to assess the goodness of a Bayesian model in terms of prediction against the posterior distribution.
Obtaining the posterior averaged Kullback-Leibler discrepancy, rather than the Kullback-Leibler discrepancy specified at some point estimate, could be more computationally intensive, requiring a large set of posterior samples for numerical averaging when analytical form is not available.However, advanced computer technology developed in recent years has made this computational cost much more feasible for Bayesian model selection.Reviews on recent developments can be found in the next section.Ando (2007) proposed an estimator for the posterior averaged discrepancy function,

Posterior averaging information criterion
Under certain regularity conditions, it was shown that an asymptotic unbiased estimator of η is (1) (2) Here, BC denotes the bias correction term, θ is the posterior mode, K is the cardinality of θ, and matrices J n and I n are some empirical estimators for the Bayesian asymptotic Hessian matrix, and Bayesian asymptotic Fisher information matrix, where log π 0 (θ) = lim n→∞ n −1 log π(θ).
The BPIC is introduced as −2n• ηBPIC and applicable when the true model f is not necessarily in the specified family of probability distributions.In model comparison, the candidate model with a minimum BPIC value is favored.However, it has the following limitations in practice.
1. Equation ( 1) was from the original presentation for BPIC in Ando (2007).After simple math cancelling out the term 1 n E θ|y log L(θ|y) in both estimator and bias correction term, it was actually the plugin estimate 1 n log{π( θ)L( θ|y)}, as shown in equation ( 2), in estimation of η with some bias correction.Compared with the natural estimator n −1 E θ|y log[L(θ|y)], the estimation efficiency of η using plug-in estimator is suboptimal when the posterior distribution is asymmetric or with non-zero correlation between parameters, which occurs in a majority of cases in Bayesian modeling.This will be further illustrated in our simulation studies when we compare the bias correction performance of various criteria in small sample size.
2. The BPIC cannot be calculated when the prior distribution π(θ) is degenerate, a situation that com-monly occurs in Bayesian analysis when an objective non-informative prior is selected.For example, if we use non-informative prior π(µ) ∝ 1 for the mean parameter µ of the normal distribution in the following section 4.1, the values of log π( θ) and E θ|y log π(θ) in equation ( 2 C3: The Hessian matrix of E ỹ[log{g(ỹ|θ)π 0 (θ)}] is non-singular at θ 0 , the bias of η for η can be approximated asymptotically without bias by where θ is the posterior mode that minimizes the posterior distribution ∝ π(θ) n i=1 g(y i |θ) and Proof.Recall that the quantity of interest is E ỹE θ|y log g(ỹ|θ).To estimate it, we first check E ỹE θ|y log{g(ỹ|θ)π 0 (θ)} = E ỹE θ|y {log g(ỹ|θ) + log π 0 (θ)} and expand it around θ 0 , The first term I 1 can be linked to the empirical log likelihood function as follows: where the last equation holds due to Lemma 5.5 (together with other Lemmas, provided in the Appendix).
The second term I 2 vanishes since as θ 0 is the expected posterior mode.
With the above result, we propose a new predictive criterion for Bayesian modeling, the Posterior averaging information criterion (PAIC) as The candidate models with small criterion values are preferred for the purpose of model selection.
The PAIC has several attractive properties.First, it assesses Bayesian model performance with re-spect to the posterior distribution function, which represents the current best knowledge from a Bayesian perspective in the family of candidate model.When the posterior distribution of the parameters is asymmetric, we expect it to better perform than any plug-in estimate based approaches.Secondly, it is an asymptotic unbiased estimator for the out-of-sample log-likelihood, a measure in terms of the Kullback-Leibler divergence for the similarity of the fitted model and the underlying true distribution.Thirdly, PAIC is derived free of the assumption that the approximating distributions contain the truth, indicating that PAIC is generally applicable even though some models could be mis-specified.Lastly, unlike the BPIC, the PAIC is well-defined and can cope with degenerate non-informative prior distributions for parameters.
Several Bayesian researchers have also focused on the posterior averaged Kullback-Leibler discrepancy using cross-validation.Plummer (2008) introduced the expected deviance penalized loss with 'expected deviance' defined as which is a special case of the predictive discrepancy measure (Gelfand and Ghosh, 1998).The standard cross-validation method can also be applied in this circumstance to estimate η, simply by considering the Kullback-Leibler discrepancy as the utility function of Vehtari and Lampinen (2002) and Kitagawa, 1996).Although numeric approach such as importance sampling can be used for the intensive computation, one caveat is that it may cause inaccurate estimation in practice if some observation y i was influential (Vehtari and Lampinen, 2002).To address that problem, Vehtari et al. (2017) proposed Pareto smoothed importance sampling, a new algorithm for regularizing importance weights, and developed a numerical tool (Vehtari et al., 2018) to facilitate computation.Watanabe (2010) established a singular learning theory and proposed a new criterion named Watanabe-Akaike (Gelman et al., 2014), or widely applicable information criterion (WAIC; Watanabe 2008Watanabe , 2009)), while WAIC 1 was proposed for the plug-in discrepancy and WAIC 2 for the posterior averaged discrepancy.However, compared with BPIC and PAIC, we found that WAIC 2 tends to have larger bias and variation for regular Bayesian models, as shown in simulation studies in the next section.

Simulation Study
In this section, we present numerical results to study the behavior of the proposed method under small and moderate sample sizes in both Gaussian and non-Gaussian settings.In the simulation experiments, we estimate the true expected bias η either analytically ( §4.1) or numerically by averaging E θ|y [log g(ỹ|θ)] over a large number of extra independent draws of ỹ when there is no closed form for the integration ( §4.2).To have BPIC well-defined for comparison, only the proper prior distributions are considered.

A case with closed-form expression for bias estimators
Suppose observations y = (y 1 , y 2 , ..., y n ) are a vector of iid samples generated from N (µ T , σ 2 T ), with unknown true mean µ T and variance σ 2 T = 1.Assume the data are analyzed by the approximating model g(y i |µ) = N (µ, σ 2 A ) with prior π(µ) = N (µ 0 , τ 2 0 ), where σ 2 A is fixed, but not necessarily equal to the true variance σ 2 T .When σ 2 A = σ 2 T , the model is misspecified.The posterior distribution of µ is normally distributed with mean μ and variance σ2 , where Therefore, we obtain the Kullback-Leibler discrepancy function and its estimator as To eliminate the estimation error caused by the sampling of the observations y, we average the bias q q q q q q q q 0 10 20 30 40 0.0 0.5 sample size (n) bias q q q q q q q q (a) q q q q q q q q 0 10 20 30 40 0.0 0.5 1.0 sample size (n) bias q q q q q q q q (b) q q q q q q q q q q q q 0 10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8 1.0 τ 0 =0.5, σ A =1 sample size (n) bias q q q q q q q q q q q q (c) q q q q q q q q 0 10 20 30 40 sample size (n) bias q q q q q q q q (d) q q q q q q q q 0 10 20 30 40 sample size (n) bias q q q q q q q q (e) q q q q q q q q q q q q 0 10 20 30 40 50 60 0.0 0.2 0.4 0.6 0.8 1.0 τ 0 =0.5, σ A =1.5 sample size (n) bias q q q q q q q q q q q q (f) q q q q q q q q 0 10 20 30 40 0 1 2 3 4 5 6 τ 0 =100,σ A =0.5 sample size (n) bias q q q q q q q q (g) q q q q q q q q 0 10 20 30 40 0 1 2 3 4 5 6 τ 0 =100/ n, σ A =0.5 sample size (n) bias q q q q q q q q (h) q q q q q q q q q q q q 0 10 20 30 40 50 60 0 1 2 3 4 5 τ 0 =0.5,σA =0.5 sample size (n) bias q q q q q q q q q q q q (i) Figure 1: Performance of the bias estimators for E y (η − η).The top panels are under a relatively noninformative prior with τ 2 0 = 10 4 ; the middle panels are under the case that the prior distribution grows with sample size with τ 2 0 = 10 4 /n; the bottom panels are under an informative prior with τ 2 0 = 0.25.The left panels (a), (b) and (c) are under the scenario of σ 2 A = σ 2 T = 1, i.e., the true distribution is contained in the candidate models.The middle panels (d), (e) and (f) are under the scenario of σ 2 A = 2.25 and right panels (g), (h) and (i) are under the scenario of σ 2 A = 0.25 when the proposed model is misspecified from σ 2 T = 1.The true bias b µ is curved by ( --) as a function of sample size n.The averages of the different bias estimators are marked by (•) for PAIC; (•) for BPIC; ( ) for p opt ; (+) for WAIC 2 ; and (×) for cross-validation.Each mark represents the mean of the estimated bias of 100,000 replications of y.
The results are in accordance with the theory.All of the estimates are close to the true bias-correction values when the model is well-specified with σ 2 A = σ 2 T = 1, especially when the sample size becomes moderately large (Figure 1, panels (a), (b) and (c)).The estimated values based on the PAIC are consistently closer to the true values than either those based on Ando's method, which underestimates the bias, or the WAIC 2 , cross-validation or expected deviance penalized loss, which overestimate the bias, especially when the sample size is small.When the models are misspecified, it is not surprising that in all of the plots given in panels (d)-(i) of Figure 1, only the expected deviance penalized loss misses the target even asymptotically since its assumption is violated, whereas all the other approaches converge to b µ .In summary, PAIC achieves the best overall performance.
We compare the performance of four asymptotically unbiased bias estimators in this hierarchical, asymmetric setting.The true bias η does not have an analytical form.We estimate it through numerical computation using independent simulation from the same data generating process, assuming the underlying true values of µ = 0 and τ = 1.The simulation scheme is as follows: 1. Draw β T,i ∼ N (0, 1), y i ∼ Bin(n i , logit −1 (β T,i )), i = 1, . . ., N from the true distribution.
6. Repeat steps 1-5.Table 1 summarizes the bias and standard deviation of the estimation error when we choose N = 15 and n 1 = . . .= n N = 50, and the β's are independently simulated from the standard normal distribution assuming the true hyper-parameter mean µ = 0 and variance τ 2 = 1.The simulation is repeated for 1, 000 scenarios, each with J = 20, 000 for out-of-sample η estimation.PAIC and BPIC were calculated based on definition; leave-one-out cross-validation and WAIC 2 were estimated using R package loo v2.0.0.The actual error, mean absolute error and mean square error were considered to assess the estimation error using the bias correction estimates.With respect to all three different metrics, the bias estimation of PAIC is consistently superior to other methods.Compared to BPIC, the second best performed model selection criterion, the bias and the mean squared error of PAIC are reduced by about 40%, while the absolute bias is reduced by about one quarter, which matches our expectation that the natural estimate 1 n i E θ|y [log g(y i |θ)] will estimate the posterior averaged Kullback-Leibler discrepancy more precisely than plug-in estimate 1 n i log g(y i | θ) when the posterior distribution is asymmetric and correlated.Compared to WAIC 2 , the bias, absolute error and mean square error of PAIC are dramatically reduced by at least 60%.In practice, we expect the improvement is even larger when proposed models were more complicated in terms of hierarchical structures.

Proof of Lemmas
We start with a few lemmas to support the proofs of Theorem 3.1.
Lemma 5.1.Under the same regularity conditions of Theorem 3.1, Proof.Consider the Taylor expansion of ∂ log{L(θ|y)π(θ)} ∂θ Note that θ is the mode of log{L(y|θ)π(θ)} and satisfies ∂ log{L(y|θ)π(θ)} ∂θ | θ= θ = 0. Plug it into the above equation, we have From the central limit theorem, the right-hand-side (RHS) of the equation ( 6) is approximately distributed Lemma 5.2.Under the same regularity conditions of Theorem 3.1, Proof.Taylor-expand the logarithm of L(θ|y)π(θ) around the posterior mode where Consider the RHS of equation ( 7) as a function of θ: the first term is a constant, whereas the second term is proportional to the logarithm of a normal density.It yields the approximation of the posterior distribution for θ: which completes the proof.
Alternatively, though less intuitive, this lemma can also be proved by applying the Berstein-Von Mises theorem.
Because the posterior mode θ is the solution of ∂ log{L(θ|y)π(θ)} ∂θ = 0, the equation ( 9) can be re-written as Substituting it into the second term of ( 8), the expansion of E θ|y where the former term is asymptotically equal to 1 n J −1 n (θ 0 )I(θ 0 )J −1 n (θ 0 ) by Lemma 5.1, and the latter is negligible with higher order o p (n −1 ), as shown in Lemma 5.3.Therefore, the expansion can be finally ) are undefined.In order to avoid those drawbacks, we propose a new model selection criterion in terms of the posterior mean of the empirical log-likelihood η = 1 n i E θ|y [log g(y i |θ)], a natural estimator of η.Without losing any of the attractive properties of BPIC, the new criterion expands the model scope to all Bayesian models under regularity conditions, improves the unbiased property for small samples, and enhances the robustness of the estimation.Because all the data y are used for both model fitting and model selection, η always overestimates η.To correct the estimation bias from the overuse of the data, we propose the following theorem.Theorem 3.1.Let y = (y 1 , y 2 , • • • , y n ) be n independent observations drawn from the probability cumulative distribution F (ỹ) with density function f (ỹ).Consider G = {g(ỹ|θ); θ ∈ Θ ⊆ R p } as a family of candidate statistical models that do not necessarily contain the true distribution f , where θ = (θ 1 , ..., θ p ) is the p-dimensional vector of unknown parameters, with prior distribution π(θ).Under the following three regularity conditions: C1: Both the log density function log g(ỹ|θ) and the log unnormalized posterior density log{L(θ|y)π(θ)} are twice continuously differentiable in the compact parameter space Θ; C2: The expected posterior mode θ 0 = arg max θ E ỹ[log{g(ỹ|θ)π 0 (θ)}] is unique in Θ; and further investigated by Gelman et al. (2014).The estimation of the bootstrap error correction η (b) − η(b) with bootstrap analogues η (b) = E ỹ * [E θ|y * log g(ỹ|θ)] and η(b) = E ỹ * [n −1 E θ|y * log L(θ|y * )] for η − η was discussed by Ando (2007) as a Bayesian adaptation of frequentist model selection (Konish

The Kullback -
Leilber divergence is a non-symmetric measure of the difference between two probability distributions.Frequentist statistics theoretically employing Kullback-Leibler divergence into parametric model selection emerged during the 1970s.Since then, the development of related theory and applications has rapidly accelerated.Bayesian model selection in terms of the Kullback-Leibler divergence has drawn substantial attention in the past two decades.The availability of both fast computers and advanced numerical methods enables the empirical popularity of Bayesian modeling, which allows for additional flexibility to incorporate the information out of the data, as represented by the prior distribution.The fundamental assumption of Bayesian inference is different from frequentist statistics, for the unknown parameters are treated as random variables in the form of a probability distribution.Taking this into account, it is important to have selection techniques that are specifically designed for Bayesian modeling.Before the proposal of any specific model selection criterion, two questions should be first investigated to guide the method development.1.What is a natural Kullback-Leibler discrepancy to evaluate a Bayesian model? 2. What is a good estimate for Kullback-Leibler discrepancy for Bayesian model?The prevailing plug-in parameter methods, such as DIC, presume the candidate models are correct, and assess the goodness of each candidate model with a density function specified by the plug-in parameters.modified bias correction term, after an algebraic geometrical transformation of the singular parameter space to a real d-dimensional manifold.SUPPLEMENTAL MATERIALS Lemmas for Proof of Theorem 3.1 Some important notations By the law of large numbers we have 1 n log{L(θ|y)π(θ)} → E ỹ[log{g(ỹ|θ)π 0 (θ)}] as n tends to infinity.Denote θ 0 , θ the expected and empirical posterior mode of the log unnormalized posterior density log{L(θ|y)π(θ)}, i.e., θ 0 = arg max θ E ỹ[log{g(ỹ|θ)π 0 (θ)}] θ = arg max θ 1 n log{L(θ|y)π(θ)}, and let I(θ) and J(θ) denote the Bayesian Hessian matrix and Bayesian Fisher information matrix Because of assumption (C1), the equation holds when we change the order of the integral and derivative.ThereforeE θ|y ( θ − θ) = n −1 J −1 n ( θ)E θ|y ∂ log{L(θ|y)π(θ)} ∂θ + O p (n −1 ) = O p (n −1 ).

Table 1 :
The estimation error of bias correction: the mean and standard deviation (in parentheses) from 1000 replications.