Probabilistic Pairwise Model Comparisons Based on Bootstrap Estimators of the Kullback–Leibler Discrepancy

When choosing between two candidate models, classical hypothesis testing presents two main limitations: first, the models being tested have to be nested, and second, one of the candidate models must subsume the structure of the true data-generating model. Discrepancy measures have been used as an alternative method to select models without the need to rely upon the aforementioned assumptions. In this paper, we utilize a bootstrap approximation of the Kullback–Leibler discrepancy (BD) to estimate the probability that the fitted null model is closer to the underlying generating model than the fitted alternative model. We propose correcting for the bias of the BD estimator either by adding a bootstrap-based correction or by adding the number of parameters in the candidate model. We exemplify the effect of these corrections on the estimator of the discrepancy probability and explore their behavior in different model comparison settings.


Introduction
Hypothesis testing and p-values are routinely used in applied, empirically oriented research. However, practitioners of statistics often misinterpret p-values, particularly in settings where hypothesis tests are used for model comparisons. Riedle, Neath and Cavanaugh [1] attempt to address this issue by providing an alternate conceptualization of the p-value. The authors introduce and investigate the concept of the discrepancy comparison probability (DCP) and its bootstrapped estimator, called the bootstrap discrepancy comparison probability (BDCP). The authors establish a clear connection between the BDCP based on the Kullback-Leibler discrepancy (KLD) and the p-values derived from likelihood ratio tests. However, this connection only exists when using the bootstrap discrepancy (BD) that arises from the "plug-in" principle, which yields a biased approximation to the KLD. Similarly to complexity penalization of the Akaike Information Criterion (AIC), we establish that an intuitive bias correction to the BD is the addition of k, the number of functionally independent parameters in the candidate model. We also propose utilizing a bootstrap-based correction, which can be justified under less stringent assumptions. We analyze how well the bootstrap approach corrects the bias of the BDCP and the BD, and we show that, in most settings, its performance is comparable to simply adding k.

Background
When faced with the task of choosing amongst competing models, statisticians often use discrepancy or divergence functions. One of the most flexible and ubiquitous divergence measures is the Kullback-Leibler information. To introduce this measure in the present context, consider a vector of independent observations y = (y 1 , y 2 , . . . , y n ) T such that y is generated from an unknown distribution g(y). Suppose that a candidate model f (y|θ) is proposed as an approximation for g (y), and that this model belongs to the parametric class of densities where Θ is the parameter space for θ. The Kullback-Leibler information, given by captures the separation between the proposed model f (y|θ) and the true data-generating model g(y).
Although not a formal metric, I KL (g, θ) is characterized by two desirable properties. First, by Jensen's inequality, I KL (g, θ) ≥ 0 with equality if and only if g(y) = f (y|θ). Second, as the dissimilarity between g(y) and f (y|θ) increases, I KL (g, θ) increases accordingly.
Note that we can write where log( f (y|θ)) = (θ|y). In the preceding relation, for any proposed candidate model, the quantity E g [−2 log(g(y))] is constant. Only the quantity E g [−2 (θ|y)] changes across different models, which means it is the only quantity needed to distinguish among various models. The expression d(g, θ) = E g [−2 (θ|y))] is known as the Kullback-Leibler discrepancy (KLD) and is often used as a substitute for I KL (g, θ).

The Discrepancy Comparison Probability and Bootstrap Discrepancy Comparison Probability
Suppose that we have two nested models that are formulated to characterize the sample y, and we designate one of the models the null, represented by θ 1 , and the other model the alternative, represented by θ 2 . The discrepancies under the fitted null and alternative models are given by d(g,θ 1 ) and d(g,θ 2 ), respectively. We can use these discrepancies to define the Kullback-Leibler discrepancy comparison probability (KLDCP), which is given by The KLDCP evaluates the probability that the fitted null model is closer to the true data-generating model than the fitted alternative. The values of d(g,θ 1 ) and d(g,θ 2 ) are calculated from the same sample. For example, a KLDCP of 0.8 means that the fitted null has a smaller discrepancy than the fitted alternative in 80% of the samples drawn from the same distribution and of the same size. The development and interpretation of the KLDCP is presented in depth by Riedle, Neath and Cavanaugh [1].
We can estimate the KLDCP using the bootstrap approximation of the joint distribution of d(g,θ 1 ) and d(g,θ 2 ). The bootstrap joint distribution is based on the discrepancy estimators that arise from the "plug-in" principle, as described by Efron and Tibshirani [2] , which replaces all the elements of the KLD by their bootstrap analogues. Specifically, we replace g by the empirical distributionĝ; y by the bootstrap sample fromĝ, which we call y * ; and finally,θ by the maximum likelihood estimate (MLE) derived under the bootstrap sample y * , which we callθ * . With these replacements, the bootstrap version of the KLD is given by where i represents the contribution to the likelihood based on the ith response y i . Now, in order to build a bootstrap distribution, we must draw various bootstrap samples from y. Suppose that we draw j = 1, 2, . . . , J bootstrap samples, and for each of these samples, we calculate the MLE of θ, which we denote asθ * (j). This allows us to obtain a set of J different bootstrap discrepancies; this set is defined as d(ĝ,θ * (j)) : j = 1, . . . , J , and these variates can be used to construct the bootstrap analogue of the discrepancy distribution.
Finally, we can extend this procedure to the setting of the null and alternative models. For each bootstrap sample, we calculateθ * 2 (j) andθ * 1 (j), which are the bootstrap sample MLEs of θ 2 and θ 1 , respectively. We then compute the discrepancies d(ĝ,θ * 2 (j)) and d(ĝ,θ * 1 (j)) for the null and alternative models, respectively. This collection of J pairs of null and alternative bootstrap discrepancies defines the set which characterizes the bootstrap analogue of the joint distribution of d(ĝ,θ 1 ) and d(ĝ,θ 2 )). The bootstrap distribution can be utilized to estimate the bootstrap analogue of the DCP, given by . By the law of large numbers, we can approximate P * by calculating the proportion of times when d(ĝ,θ * 1 (j)) < d(ĝ,θ * 2 (j)) in the J bootstrap samples that were drawn. Thus, if I is an indicator function, we can define an estimator of the DCP, which we call the bootstrap discrepancy comparison probability (BDCP), as follows:

Bias Corrections for the BDCP
An important issue that arises in the bootstrap estimation of the KLD is the negative bias of the discrepancy estimators that materializes from the "plug-in" principle. The following lemma establishes and quantifies this bias for large-sample settings under an appropriately specified candidate model.

Lemma 1.
For a large sample size, assuming that the candidate model subsumes the true model, where E * is the expectation with respect to the bootstrap distribution, and k is the dimension of the model.

Proof.
For a maximum likelihood estimatorθ, it is well known that for a large sample size and under certain regularity conditions, we have provided that the model is adequately specified. In the preceding, χ 2 k denotes a centrally distributed chi-square random variable with k degrees-of-freedom. Now, consider the second-order Taylor series expansion of −2 (θ * |y) aboutθ, which results in − 2 (θ * |y) ≈ −2 (θ|y) + (θ * −θ) T I(θ|y)(θ * −θ).
Finally, it has been established that if the true model is contained in the candidate class at hand, and if the large sample properties of MLEs hold, then AIC serves as an asymptotically unbiased estimator of the KLD. Thus, The preceding expression can be re-written as which implies that the bias correction k must be added to the bootstrap discrepancy in the estimation of the KLD. The BD estimator corrected by the addition of k will be called BDk. Now, focus again on Equation (3). By subtracting (−2 (θ|y)) from both sides of the equation, we obtain − 2 (θ * |y) − (−2 (θ|y)) ≈ (θ * −θ) T I(θ|y)(θ * −θ).
As mentioned previously, if the candidate model is adequately specified, then the distributional approximation in (2) holds true. However, if this model specification assumption is not met, then we can utilize the approximation in (4) to find a suitable bias correction via the bootstrap. The bootstrap has been used for bias corrections in similar problem contexts [3,4].
By applying the expected value with respect to the bootstrap distribution ofθ * to both sides of (4), we obtain 1 J where k 1 and k 2 are the number of functionally independent parameters that define the null and alternative models, respectively.

Simulation Studies
The following simulation sets are designed to explore the bias when estimating both the DCP based on the Kullback-Leibler discrepancy (KLDCP) and the expected value of the KLD. We present different hypothesis testing scenarios, not all of which are conventional, under a linear data-generating model and for varying sample sizes. Each setting exhibits three different approaches to formulating the BD: adding the bootstrap-based correction (BDb), adding k (BDk), and leaving the estimator uncorrected.

Settings for Simulation Sets
For Sets 1 to 5, the true data-generating model is of the form where the entries of µ are chosen from {−1, 1} with equal probability, and Σ = diag p−1 (100). For Sets 1 to 4, we have i ∼ N(0, σ 2 0 ); for Set 5, we have that i ∼ t d f =5 , where t d f denotes the Student's t distribution based on d f degrees of freedom; and for Set 6, we have that In the setting at hand, the true data-generating model g has parameters θ = (β T 0 , σ 2 0 ) T . Hurvich and Tsai [6] showed that for the family of approximating models y = Xβ + , where X is the design matrix and ∼ N(0, σ 2 I n ), with maximum likelihood estimators given byβ = (X T X) −1 X T y the KLD measure d(g,θ) is given by The expected value of the KLD for the null and the alternative models was approximated by averaging the KLD over 5000 samples generated from g. These 5000 KLD values, computed using (9), approximate the joint distribution of d(g,θ 1 ) and d(g,θ 2 ); hence, the simulation-based estimator of the KLDCP is given bŷ This KLDCP estimate is calculated 100 times in order to estimate the KLDCP distribution and its expected value.
Finally, for each of the 5000 samples, we calculate the BD and the BDb using 200 bootstrap samples. However, to attenuate the simulation variability incurred by the mixture distribution, the number of bootstrap samples in Set 6 was increased to 500. The results displayed in the tables are based on averages over the 5000 samples.
Set 1: Null hypothesis is correctly specified, and alternative hypothesis is overspecified.
Consider the true data-generating model given by T is sampled as indicated in (8).
For the hypothesis testing setting in Set 1, the null and alternative models are defined as Note that the null model is adequately specified, while the alternative model contains the true model plus four additional explanatory variables. These extra explanatory variables are generated from the distribution indicated in (8).
Set 2: Null hypothesis is underspecified, and alternative hypothesis is correctly specified.
For the hypothesis testing setting in Set 2, the null and alternative models are Here, the alternative model has the same structure as the data-generating model, but the null model is missing one of the explanatory variables in the true model, namely x 5 . Set 3: Both null and alternative models are underspecified, but the null is closer to the data-generating model.
Consider the true data-generating model given by T is sampled as indicated in (8).
For the hypothesis testing setting in Set 3, the null and alternative models are In this setting, both the null and alternative candidate models have the same number of explanatory variables, and they are both missing variable x 4 . However, there is a slight difference in the effect sizes of the variables for these models. For the alternative, the effect sizes are −0.5 and 0.1 for x 4 and x 6 , respectively. On the other hand, the effect size for the null model is 0.5 for both x 2 and x 3 . When comparing the null and alternative models, the smaller effect size on x 6 sets the alternative further away from the true model. Consider the true data-generating model given by x i1 x i2 · · · x i7 T is sampled as indicated in (8).
For the hypothesis testing setting in Set 4, the null and alternative models are Here, the null and alternative candidate models are equally underspecified because they have the same number of explanatory variables with the same effect sizes, and neither model captures the true data-generating model.
Set 5: Null model has correct mean specification and alternative model is overspecified, but both are misspecified with respect to the error distribution, which is a Student's t distribution.
Consider the true data generating model given by For the hypothesis testing setting in Set 5, the null and alternative models are 100). This setting is similar to the one displayed in Set 1, where the null is properly specified while the alternative is overspecified. However, the models in the setting at hand inadequately specify the distribution of the errors.
Set 6: Null model has correct mean specification, and the alternative model is overspecified, but both are misspecified with respect to the error distribution, which is a mixture of normals.
Consider the true data-generating model given by For the hypothesis testing setting in Set 6, the null and alternative models are 100). This setting is similar to the one featured in Set 5. However, the errors in the setting at hand are generated from a mixture of normal distributions.

KLDCP Estimates From Simulations
For the tables showing the KLDCP simulation results, the columns are labeled as follows.
(1) KLDCP corresponds to results based on the distribution of 100 replicates of KLDCP, where each KLDCP is calculated using (10). Note that the null and alternative KLD joint distribution is characterized based on discrepancy replicates obtained through (9). (2) BDCPb corresponds to results based on the distribution of 5000 replicates of BDCPb.

Estimates of the Expected KLD From Simulations
For the tables showing the KLD results, the columns are labeled as follows.
(1) E(KLD) corresponds to the average of 5000 discrepancies calculated using (9). We have that M = 200 for Sets 1-5 and M = 500 for Set 6.
(3) ∆BDb corresponds to the difference between the estimate of E(BD), with each BD corrected by k b and the estimate of E(KLD) described in (1). In other words, if we let j ∈ {1, 2 . . . , 5000} be the number of simulated data sets, BD j be the BD estimate for each data set j, and k jb be the k b correction for data set j, then ∆BDb = 1 5000 (4) ∆BDk shows the same difference described in (3), but using k instead of k b , which results in

Discussion of Simulation Results
As mentioned previously, in the conventional hypothesis testing scenario for comparing nested models, Riedle, Neath and Cavanaugh [1] established that the uncorrected BDCP approximates the p-value derived from the likelihood ratio test. Therefore, in the case where the null candidate model is correctly specified, both the uncorrected BDCP and the p-value have a Uni f orm(0, 1) distribution. This behavior is displayed in Table 1, where for large sample sizes, the mean and median of the BDCP distribution are around 0.5. This is a problematic feature of the uncorrected BDCP and p-values because the measure does not reliably favor the null model in those settings where the null is true. However, we see that for large sample sizes, both the BDCPk and the BDCPb values are close to 1, which clearly favors the null model. Table 2 shows the results from the setting where the alternative hypothesis is correctly specified, while the null is underspecified. Here, we would expect all the discrepancy probabilities to be close to 0, as seen in the case where the sample size is N = 500. However, for smaller sample sizes, i.e., N = 25 and N = 50, we observe larger values for the discrepancy probabilities. In fact, for N = 25, the BDCPb is 0.89 and, with a mean and median close to 0.5, the uncorrected BDCP exhibits similar behavior to the case where the null is true. This phenomenon is expected within the framework of model selection, where additional explanatory variables are favorable if there is a sufficient sample size to adequately estimate their effects. If the sample size is too small to construct reliable estimates, then it is best to choose smaller models, even at the expense of model misspecification.
The results from Tables 1, 3-6 show that when estimating the KLDCP with a small sample size (N = 25 to N = 100), the BDb performs either better than or as well as the BDk. For large sample sizes, all simulation sets exhibit a similar performance for both corrections.
For discrepancy estimation, Tables 7-10 show that across all sample sizes, k b overcorrects for the bias of the discrepancy approximation, and the over correction is more prominent for small sample sizes. It is worth noting that this evident over-estimation from the BDb is accompanied by a superior bias reduction of the corresponding KLDCP estimator. For instance, Table 7 shows a significant over-estimation by BDb compared to BDk, especially in the small sample settings. However, the corresponding estimator of the KLDCP, displayed in Table 1, exhibits less bias for BDCPb than for BDCPk.
Finally, Tables 11 and 12 show that, across all sample sizes, the correction by k b markedly reduces the bias compared to the correction by k. This means that in the setting where the mean structure is correctly specified for the null and overspecified for the alternative, but both models are incorrectly specified with respect to the error distribution, the bootstrap-based correction evidently outperforms the simple correction of k.
In most cases, however, the bias reductions resulting from the k b and the k corrections are comparable. Therefore, our simulation studies suggest that if the null and/or the alternative models are misspecified, then correcting by either k b or k will generally yield comparable estimators of the expected KLDCP.      Table 9. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 3. Here, the null and alternative models are underspecified, but the null model is closer to the true data-generating model.   Table 11. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 5. Here, the null and alternative models are misspecified with respect to the error distribution, and the errors are generated from a Student's t distribution.   Table 12. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 6. Here, the null and alternative models are misspecified with respect to the error distribution, and the errors are generated from a mixture of normal distributions.

Application: Creatine Kinase Levels during Football Preseason
In this section, we apply the BDCP to a data set from a biomedical setting. The goal of this application is to understand the changes in creatine kinase (CK) levels observed on the blood samples of college football players during preseason training. In order to properly explain the variation of CK, we must select between competing models that use different demographic and clinical variables. We will analyze the models selected by the k b corrected, the k corrected and the uncorrected BDCP, and we will compare the results to the selection of models via the more conventional p-value approach.

Overview of Application
During strenuous exercise, skeletal muscle cells break down and release a variety of intracellular contents. When in excess, a condition known as exertional rhabdomyolysis (ER) can occur, which may result in life-threatening complications such as renal failure, cardiac arrhythmia and compartment syndrome. Creatine kinase (CK) is one of the proteins released during muscle breakdown, and measuring its levels is the most sensitive test for assessing muscular damage that could lead to ER [7].
During the off-season workouts in January 2011, a group of 13 University of Iowa football players developed ER. This event led to a prospective study where 30 University of Iowa football athletes were followed during a 34-day preseason workout camp. Variables such as body mass index (BMI) and CK levels were obtained from blood samples that were drawn at the first, third, and seventh day of the camp. Other demographic and clinical variables such as age, number of semesters in the program and history of rhabdomyolysis were also collected.
The initial results of the study, published by Smoot et al. [8], show that the CK levels at later time points were significantly different than the levels at earlier times. However, most of the clinical and demographic variables were not significant in explaining the levels of CK. One of the underlying issues with this type of modeling analysis is that the significance of each variable can only be assessed by hypothesis tests with nested models. For example, suppose that we wish to determine the significance of BMI in the presence of semesters in the program. To obtain a p-value for BMI, we need to formulate a hypothesis test where the null model only contains semesters in the program, while the alternative model contains both BMI and semesters in the program.
Although this setting may be useful in some scenarios, it is too limiting. For instance, suppose that we wish to choose between two non-nested models where one contains BMI and the other contains semesters in the program. Although a conventional test based on linear regression models would not be able to answer this question, the BDCP approach could indeed determine the propriety of either model in this type of non-nested setting.
In the analysis of this data set, we let CK3 be the log of CK levels measured at the seventh day of the camp, CK1 be the log of CK levels measured at the first day of the camp, and Semesters be the number of semesters at the program. Of note, the log transformation is routinely applied in studies involving CK levels in order to justify approximate normality, as the raw levels tend to have heavily right-skewed distributions. Now, consider the following hypothesis testing settings.
Setting 1: Testing the propriety of the model containing CK1.
Setting 2: Testing the propriety of the model containing CK1 and Semesters over the model containing only CK1.
Setting 3: Head-to-head comparison of non-nested models.

Results of Application
The results for the application are summarized in Table 13. Settings 1 and 2 illustrate the congruence between BDCP and p-values in the case of hypothesis testing based on nested models. Setting 1 assesses the propriety of a model that includes only the intercept against a model that includes both the intercept and the levels of CK1. The p-value for CK1 in this setting is 0.001, which means that, using a level α of 0.05, CK1 is significant in explaining the variation in CK3 levels. Both the BDCPk and BDCPb are 0.075, which means that there is a 7.5% chance that the null model is preferred over multiple bootstrap samples, indicating that the model containing CK1 is superior.
Once we establish that CK1 is an important variable to include in our model, the next step is to determine if additional variables can improve our model fit. Setting 2 displays a hypothesis test where the null model only contains CK1, while the alternative contains both CK1 and Semesters. The p-value for Semesters is 0.734, which means that Semesters is not statistically significant, and a reasonable investigator would choose to exclude Semesters from the final model. The corrected BDCP values arrive at the same conclusion. For instance, the BDCPb is 0.995, which indicates that the across multiple bootstrap samples, the null model is chosen 99.5% of the time; therefore, the BDCP encourages us to choose the model that excludes Semesters. The rationale for testing Semesters is based on the idea that more senior athletes tend to rigorously maintain their workout habits during the off season, mostly because of experience and maturity. Therefore, Semesters is a variable that may confound the effects of CK1 on the variation of CK3. Additionally, medical literature has shown that BMI highly correlates with CK levels and the development of ER [9], which means that one should also test for the propriety of models that include BMI. Thus, one could ask if a model featuring BMI would be better than a model featuring Semesters. This results in a hypothesis testing scenario where the null and alternative models are non-nested, as exhibited in Setting 3.
First, note that the p-values displayed in the table for Setting 3 do not answer the question at hand. These p-values are obtained from partial tests applied to the full model containing both variables. On the other hand, the BDCP gives us meaningful information about the performance of adding BMI versus adding Semesters. The BDCPb tells us that there is a 78% probability that the model containing BMI is a better fit than the model containing Semesters. If we use the BDCPk instead, the probability increases to 81.5%. In both cases, if we are debating weather to include BMI or Semesters as an adjusting variable, the BDCP clearly favors the inclusion of BMI.

Conclusions
When deciding between two competing models, practitioners of statistics normally utilize traditional hypothesis testing methods that rely on the assumption that one of the candidate models is properly specified. This approach is problematic because it is unreasonable to assume that one of the proposed models is precisely true. In addition, these methods are only applicable for nested models. To avoid any underlying assumptions and model structure limitations, Riedle, Neath and Cavanaugh [1] propose the use of the bootstrap discrepancy probability (BDCP) to assess the propriety of the fit of two candidate models. However, the bootstrap discrepancy (BD) utilized in this work provides a biased estimator of the Kullback-Leibler discrepancy (KLD).
When hypothesis testing assumptions are met, the BDCP asymptotically approximates the likelihood ratio test p-value. Therefore, similarly to p-values, the distribution of the BDCP is uniform if the null hypothesis is true. Hence, in settings when the null is true, the BDCP would be of limited value in choosing the appropriate model.
In this paper, we proposed utilizing the k b or the k corrected BDCP, namely BDCPb and BDCPk, respectively. The BDCPb employs the BDb, a bootstrap corrected estimator of the KLD, while the BDCPk uses the BDk, a BD corrected by adding the number of functionally independent parameters in the candidate model. We showed that for most settings, the BDb serves as an over-corrected estimator of the KLD, but the corresponding BDCPb is less biased than the BDCPk for the estimation of the KLDCP. However, in the case when there is distributional misspecification, we showed that the BDb has negligible bias for the estimation of expected value of the KLD.
Moreover, the estimation of the bootstrap correction k b utilizes the same bootstrap samples that were used to calculate the BD; therefore, we argue that the computational requirements of estimating k b are not too burdensome. However, if the sample size is moderately large compared to the number of parameters in the model, then we showed that using k to correct the bias generally results in comparable values of the KLDCP estimates.

Data Availability Statement:
The R code used in generating the data for the simulation study is available on request from the corresponding author. The data for the application are not publicly available since the dataset is confidential.

Conflicts of Interest:
The authors declare no conflict of interest.