Next Article in Journal
Modelling Spirals of Silence and Echo Chambers by Learning from the Feedback of Others
Next Article in Special Issue
Spatial Information-Theoretic Optimal LPI Radar Waveform Design
Previous Article in Journal
Simplicial Persistence of Financial Markets: Filtering, Generative Processes and Structural Risk
Previous Article in Special Issue
Some Information Measures Properties of the GOS-Concomitants from the FGM Family
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Probabilistic Pairwise Model Comparisons Based on Bootstrap Estimators of the Kullback–Leibler Discrepancy

Department of Biostatistics, University of Iowa, 145 N. Riverside Drive, Iowa City, IA 52242, USA
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Entropy 2022, 24(10), 1483; https://doi.org/10.3390/e24101483
Submission received: 27 September 2022 / Revised: 15 October 2022 / Accepted: 16 October 2022 / Published: 18 October 2022
(This article belongs to the Special Issue Information and Divergence Measures)

Abstract

:
When choosing between two candidate models, classical hypothesis testing presents two main limitations: first, the models being tested have to be nested, and second, one of the candidate models must subsume the structure of the true data-generating model. Discrepancy measures have been used as an alternative method to select models without the need to rely upon the aforementioned assumptions. In this paper, we utilize a bootstrap approximation of the Kullback–Leibler discrepancy (BD) to estimate the probability that the fitted null model is closer to the underlying generating model than the fitted alternative model. We propose correcting for the bias of the BD estimator either by adding a bootstrap-based correction or by adding the number of parameters in the candidate model. We exemplify the effect of these corrections on the estimator of the discrepancy probability and explore their behavior in different model comparison settings.

1. Introduction

Hypothesis testing and p-values are routinely used in applied, empirically oriented research. However, practitioners of statistics often misinterpret p-values, particularly in settings where hypothesis tests are used for model comparisons. Riedle, Neath and Cavanaugh [1] attempt to address this issue by providing an alternate conceptualization of the p-value. The authors introduce and investigate the concept of the discrepancy comparison probability (DCP) and its bootstrapped estimator, called the bootstrap discrepancy comparison probability (BDCP). The authors establish a clear connection between the BDCP based on the Kullback–Leibler discrepancy (KLD) and the p-values derived from likelihood ratio tests. However, this connection only exists when using the bootstrap discrepancy (BD) that arises from the “plug-in” principle, which yields a biased approximation to the KLD. Similarly to complexity penalization of the Akaike Information Criterion (AIC), we establish that an intuitive bias correction to the BD is the addition of k, the number of functionally independent parameters in the candidate model. We also propose utilizing a bootstrap-based correction, which can be justified under less stringent assumptions. We analyze how well the bootstrap approach corrects the bias of the BDCP and the BD, and we show that, in most settings, its performance is comparable to simply adding k.

2. Methodological Development

2.1. Background

When faced with the task of choosing amongst competing models, statisticians often use discrepancy or divergence functions. One of the most flexible and ubiquitous divergence measures is the Kullback–Leibler information. To introduce this measure in the present context, consider a vector of independent observations y = ( y 1 , y 2 , , y n ) T such that y is generated from an unknown distribution g ( y ) . Suppose that a candidate model f ( y | θ ) is proposed as an approximation for g ( y ) , and that this model belongs to the parametric class of densities
F = [ f ( y | θ ) : θ Θ ] ,
where Θ is the parameter space for θ . The Kullback–Leibler information, given by
I K L ( g , θ ) = E g log g ( y ) f ( y | θ ) ,
captures the separation between the proposed model f ( y | θ ) and the true data-generating model g ( y ) .
Although not a formal metric, I K L ( g , θ ) is characterized by two desirable properties. First, by Jensen’s inequality, I K L ( g , θ ) 0 with equality if and only if g ( y ) = f ( y | θ ) . Second, as the dissimilarity between g ( y ) and f ( y | θ ) increases, I K L ( g , θ ) increases accordingly.
Note that we can write
2 I K L ( g , θ ) = E g [ 2 log ( f ( y | θ ) ) ] E g [ 2 log ( g ( y ) ) ] = E g [ 2 ( θ | y ) ) ] E g [ 2 log ( g ( y ) ) ] ,
where log ( f ( y | θ ) ) = ( θ | y ) . In the preceding relation, for any proposed candidate model, the quantity E g [ 2 log ( g ( y ) ) ] is constant. Only the quantity E g [ 2 ( θ | y ) ] changes across different models, which means it is the only quantity needed to distinguish among various models. The expression
d ( g , θ ) = E g [ 2 ( θ | y ) ) ]
is known as the Kullback–Leibler discrepancy (KLD) and is often used as a substitute for I K L ( g , θ ) .
In practice, the goal is to determine the propriety of fitted models of the form f ( y | θ ^ ) , where θ ^ = argmax θ Θ ( θ | y ) . The KL discrepancy for the fitted model is given by
d ( g , θ ^ ) = E g [ 2 ( θ | y ) ] | θ = θ ^ .

2.2. The Discrepancy Comparison Probability and Bootstrap Discrepancy Comparison Probability

Suppose that we have two nested models that are formulated to characterize the sample y, and we designate one of the models the null, represented by θ 1 , and the other model the alternative, represented by θ 2 . The discrepancies under the fitted null and alternative models are given by d ( g , θ ^ 1 ) and d ( g , θ ^ 2 ) , respectively. We can use these discrepancies to define the Kullback–Leibler discrepancy comparison probability (KLDCP), which is given by
P = Pr [ d ( g , θ ^ 1 ) < d ( g , θ ^ 2 ) ] .
The KLDCP evaluates the probability that the fitted null model is closer to the true data-generating model than the fitted alternative. The values of d ( g , θ ^ 1 ) and d ( g , θ ^ 2 ) are calculated from the same sample. For example, a KLDCP of 0.8 means that the fitted null has a smaller discrepancy than the fitted alternative in 80% of the samples drawn from the same distribution and of the same size. The development and interpretation of the KLDCP is presented in depth by Riedle, Neath and Cavanaugh [1].
We can estimate the KLDCP using the bootstrap approximation of the joint distribution of d ( g , θ ^ 1 ) and d ( g , θ ^ 2 ) . The bootstrap joint distribution is based on the discrepancy estimators that arise from the “plug-in” principle, as described by Efron and Tibshirani [2], which replaces all the elements of the KLD by their bootstrap analogues. Specifically, we replace g by the empirical distribution g ^ ; y by the bootstrap sample from g ^ , which we call y * ; and finally, θ ^ by the maximum likelihood estimate (MLE) derived under the bootstrap sample y * , which we call θ ^ * . With these replacements, the bootstrap version of the KLD is given by
d ( g ^ , θ ^ * ) = E g ^ [ 2 ( θ | y ) ] | θ = θ ^ * = i = 1 n 2 i ( θ ^ * | y i ) ( because each y i is independent . ) = 2 ( θ ^ * | y ) ,
where i represents the contribution to the likelihood based on the ith response y i .
Now, in order to build a bootstrap distribution, we must draw various bootstrap samples from y. Suppose that we draw j = 1 , 2 , , J bootstrap samples, and for each of these samples, we calculate the MLE of θ , which we denote as θ ^ * ( j ) . This allows us to obtain a set of J different bootstrap discrepancies; this set is defined as
d ( g ^ , θ ^ * ( j ) ) : j = 1 , , J ,
and these variates can be used to construct the bootstrap analogue of the discrepancy distribution.
Finally, we can extend this procedure to the setting of the null and alternative models. For each bootstrap sample, we calculate θ ^ 2 * ( j ) and θ ^ 1 * ( j ) , which are the bootstrap sample MLEs of θ 2 and θ 1 , respectively. We then compute the discrepancies d ( g ^ , θ ^ 2 * ( j ) ) and d ( g ^ , θ ^ 1 * ( j ) ) for the null and alternative models, respectively. This collection of J pairs of null and alternative bootstrap discrepancies defines the set
( d ( g ^ , θ ^ 1 * ( j ) ) , d ( g ^ , θ ^ 2 * ( j ) ) ) : j = 1 , , J ,
which characterizes the bootstrap analogue of the joint distribution of d ( g ^ , θ ^ 1 ) and d ( g ^ , θ ^ 2 ) . The bootstrap distribution can be utilized to estimate the bootstrap analogue of the DCP, given by
P * = Pr * [ d ( g ^ , θ ^ 1 * ) < d ( g ^ , θ ^ 2 * ) ] .
By the law of large numbers, we can approximate P * by calculating the proportion of times when d ( g ^ , θ ^ 1 * ( j ) ) < d ( g ^ , θ ^ 2 * ( j ) ) in the J bootstrap samples that were drawn. Thus, if I is an indicator function, we can define an estimator of the DCP, which we call the bootstrap discrepancy comparison probability (BDCP), as follows:
BDCP = 1 J j = 1 J I [ d ( g ^ , θ ^ 1 * ( j ) ) < d ( g ^ , θ ^ 2 * ( j ) ) ] .

3. Bias Corrections for the BDCP

An important issue that arises in the bootstrap estimation of the KLD is the negative bias of the discrepancy estimators that materializes from the “plug-in” principle. The following lemma establishes and quantifies this bias for large-sample settings under an appropriately specified candidate model.
Lemma 1.
For a large sample size, assuming that the candidate model subsumes the true model, we have
E g E * [ 2 ( θ ^ * | y ) ] E g [ d ( g , θ ^ ) ] k ,
where E * is the expectation with respect to the bootstrap distribution, and k is the dimension of the model.
Proof. 
For a maximum likelihood estimator θ ^ , it is well known that for a large sample size and under certain regularity conditions, we have
( θ ^ θ ) T I ( θ | y ) ( θ ^ θ ) χ k 2 ,
provided that the model is adequately specified. In the preceding, χ k 2 denotes a centrally distributed chi-square random variable with k degrees-of-freedom.
Now, consider the second-order Taylor series expansion of 2 ( θ ^ * | y ) about θ ^ , which results in
2 ( θ ^ * | y ) 2 ( θ ^ | y ) + ( θ ^ * θ ^ ) T I ( θ ^ | y ) ( θ ^ * θ ^ ) .
By taking the expected value of both sides of (3) with respect to the bootstrap distribution of θ ^ * , we obtain
E * 2 ( θ ^ * | y ) E * 2 ( θ ^ | y ) + E * ( θ ^ * θ ^ ) T I ( θ ^ | y ) ( θ ^ * θ ^ ) 2 ( θ ^ | y ) + k ( by the approximation in ( 2 ) ) , = AIC k ,
where AIC denotes the Akaike information criterion.
Finally, it has been established that if the true model is contained in the candidate class at hand, and if the large sample properties of MLEs hold, then AIC serves as an asymptotically unbiased estimator of the KLD. Thus,
E g E * 2 ( θ ^ * | y ) E g ( AIC ) k E g ( d ( g , θ ^ ) ) k .
The preceding expression can be re-written as
E g ( d ( g , θ ^ ) ) E g E * 2 ( θ ^ * | y ) + k ,
which implies that the bias correction k must be added to the bootstrap discrepancy in the estimation of the KLD. The BD estimator corrected by the addition of k will be called BDk.
Now, focus again on Equation (3). By subtracting ( 2 ( θ ^ | y ) ) from both sides of the equation, we obtain
2 ( θ ^ * | y ) ( 2 ( θ ^ | y ) ) ( θ ^ * θ ^ ) T I ( θ ^ | y ) ( θ ^ * θ ^ ) .
As mentioned previously, if the candidate model is adequately specified, then the distributional approximation in (2) holds true. However, if this model specification assumption is not met, then we can utilize the approximation in (4) to find a suitable bias correction via the bootstrap. The bootstrap has been used for bias corrections in similar problem contexts [3,4].
By applying the expected value with respect to the bootstrap distribution of θ ^ * to both sides of (4), we obtain
E * 2 ( θ ^ * | y ) ( 2 ( θ ^ | y ) ) E * ( θ ^ * θ ^ ) T I ( θ ^ | y ) ( θ ^ * θ ^ ) .
The goal is then to find an approximation of E * 2 ( θ ^ * | y ) ( 2 ( θ ^ | y ) ) . Note that by the law of large numbers, we have that when J ,
1 J j = 1 J 2 ( θ ^ * ( j ) | y ) E * ( 2 ( θ ^ * | y ) ) .
Thus, for J , we can assert
1 J j = 1 J 2 ( θ ^ * ( j ) | y ) ( 2 ( θ ^ | y ) ) E * ( 2 ( θ ^ * | y ) ) ( 2 ( θ ^ | y ) ) .
The preceding result shows that 1 J j = 1 J 2 ( θ ^ * ( j ) | y ) ( 2 ( θ ^ | y ) ) serves as an asymptotically unbiased estimator of E * ( 2 ( θ ^ * | y ) ) ( 2 ( θ ^ | y ) ) . We therefore propose using
k b = 1 J j = 1 J 2 ( θ ^ * ( j ) | y ) ( 2 ( θ ^ | y ) )
as a bootstrap-based correction of the BD. A more in-depth derivation and exploration of the k b correction can be found in Cavanaugh and Shumway [5].
Subsequently, the bootstrap approximation of the KLD with a bootstrap-based bias correction is expressed by E * ( 2 ( θ ^ * | y ) ) + k b , and is estimated by
BDb = 1 J j = 1 J 2 ( θ ^ * ( j ) | y ) + k b .
It follows that the bootstrap bias-corrected BDCP would be defined as
BDCPb = 1 J j = 1 J I d ( g ^ , θ ^ 1 * ( j ) ) + k 1 b < d ( g ^ , θ ^ 2 * ( j ) ) + k 2 b ,
where k 1 b and k 2 b correspond to the bootstrap-based corrections for the null and alternative models, respectively.
Similarly, the k bias-corrected BD is expressed as
BDk = 1 J j = 1 J 2 ( θ ^ * ( j ) | y ) + k ,
and the k bias-corrected BDCP is given by
BDCPk = 1 J j = 1 J I d ( g ^ , θ ^ 1 * ( j ) ) + k 1 < d ( g ^ , θ ^ 2 * ( j ) ) + k 2 ,
where k 1 and k 2 are the number of functionally independent parameters that define the null and alternative models, respectively.

4. Simulation Studies

The following simulation sets are designed to explore the bias when estimating both the DCP based on the Kullback–Leibler discrepancy (KLDCP) and the expected value of the KLD. We present different hypothesis testing scenarios, not all of which are conventional, under a linear data-generating model and for varying sample sizes. Each setting exhibits three different approaches to formulating the BD: adding the bootstrap-based correction (BDb), adding k (BDk), and leaving the estimator uncorrected.

4.1. Settings for Simulation Sets

For Sets 1 to 5, the true data-generating model is of the form
y i = x i T β 0 + ϵ i ,
with β 0 T = β 0 , 1 β 0 , 2 β 0 , p , x i T = 1 x i 2 x i p , and
x i 2 x i p T N p 1 ( μ , Σ ) ,
where the entries of μ are chosen from { 1 , 1 } with equal probability, and Σ = d i a g p 1 ( 100 ) . For Sets 1 to 4, we have ϵ i N ( 0 , σ 0 2 ) ; for Set 5, we have that ϵ i t d f = 5 , where t d f denotes the Student’s t distribution based on d f degrees of freedom; and for Set 6, we have that ϵ i Z · N ( 0 , 1 ) + ( 1 Z ) · N ( 0 , 50 ) , where Z B e r n o u l l i ( π ) with π = 0.85 .
In the setting at hand, the true data-generating model g has parameters θ = ( β 0 T , σ 0 2 ) T . Hurvich and Tsai [6] showed that for the family of approximating models y = X β + ϵ , where X is the design matrix and ϵ N ( 0 , σ 2 I n ) , with maximum likelihood estimators given by
β ^ = ( X T X ) 1 X T y
and
σ ^ 2 = ( y X β ^ ) T ( y X β ^ ) n ,
the KLD measure d ( g , θ ^ ) is given by
d ( g , θ ^ ) = n log ( 2 π σ ^ 2 ) + n σ 0 2 σ ^ 2 + ( X β 0 X β ^ ) T ( X β 0 X β ^ ) σ ^ 2 .
The expected value of the KLD for the null and the alternative models was approximated by averaging the KLD over 5000 samples generated from g. These 5000 KLD values, computed using (9), approximate the joint distribution of d ( g , θ ^ 1 ) and d ( g , θ ^ 2 ) ; hence, the simulation-based estimator of the KLDCP is given by
P ^ = 1 5000 i = 1 5000 I [ d ( g , θ ^ 1 ( i ) ) < d ( g , θ ^ 2 ( i ) ) ] .
This KLDCP estimate is calculated 100 times in order to estimate the KLDCP distribution and its expected value.
Finally, for each of the 5000 samples, we calculate the BD and the BDb using 200 bootstrap samples. However, to attenuate the simulation variability incurred by the mixture distribution, the number of bootstrap samples in Set 6 was increased to 500. The results displayed in the tables are based on averages over the 5000 samples.
  • Set 1: Null hypothesis is correctly specified, and alternative hypothesis is overspecified.
Consider the true data-generating model given by
y i = β 0 , 1 + β 0 , 2 x i 2 + β 0 , 3 x i 3 + ϵ i ,
where ϵ i N ( 0 , 50 ) , β 0 , 1 = 1 , β 0 , 2 = β 0 , 3 = 0.5 and x i 2 x i 3 T is sampled as indicated in (8).
For the hypothesis testing setting in Set 1, the null and alternative models are defined as
H 1 : y i = β 1 + β 2 x 2 i + β 3 x i 3 , H 2 : y i = β 1 + β 2 x i 2 + β 3 x i 3 + β 4 x i 4 + β 5 x i 5 + β 6 x i 6 + β 7 x i 7 .
Note that the null model is adequately specified, while the alternative model contains the true model plus four additional explanatory variables. These extra explanatory variables are generated from the distribution indicated in (8).
  • Set 2: Null hypothesis is underspecified, and alternative hypothesis is correctly specified.
Consider the true data-generating model given by
y i = β 0 , 1 + β 0 , 2 x i 2 + β 0 , 3 x i 3 + β 0 , 4 x i 4 + β 0 , 5 x i 5 + ϵ i ,
where ϵ i N ( 0 , 45 ) , β 0 , 1 = 1 , β 0 , 2 = 0.11 , β 0 , 3 = 0.13 , β 0 , 4 = 0.12 , β 0 , 5 = 0.11 , and x i 2 x i 3 x i 5 T is sampled as indicated in (8).
For the hypothesis testing setting in Set 2, the null and alternative models are
H 1 : y i = β 1 + β 2 x 2 i + β 3 x i 3 + β 4 x i 4 , H 2 : y i = β 1 + β 2 x i 2 + β 3 x i 3 + β 4 x i 4 + β 5 x i 5 .
Here, the alternative model has the same structure as the data-generating model, but the null model is missing one of the explanatory variables in the true model, namely x 5 .
  • Set 3: Both null and alternative models are underspecified, but the null is closer to the data-generating model.
Consider the true data-generating model given by
y i = β 0 , 1 + β 0 , 2 x i 2 + β 0 , 3 x i 3 + β 0 , 4 x i 4 + β 0 , 5 x i 5 + β 0 , 6 x i 6 + ϵ i ,
where ϵ i N ( 0 , 50 ) , β 0 , 1 = 1 , β 0 , 2 = β 0 , 3 = 0.5 , β 0 , 4 = β 0 , 5 = 0.5 , β 0.6 = 0.1 , and x i 2 x i 3 x i 6 T is sampled as indicated in (8).
For the hypothesis testing setting in Set 3, the null and alternative models are
H 1 : y i = β 1 + β 2 x 2 i + β 3 x i 3 , H 2 : y i = β 1 + β 4 x i 4 + β 6 x i 6 .
In this setting, both the null and alternative candidate models have the same number of explanatory variables, and they are both missing variable x 4 . However, there is a slight difference in the effect sizes of the variables for these models. For the alternative, the effect sizes are −0.5 and 0.1 for x 4 and x 6 , respectively. On the other hand, the effect size for the null model is 0.5 for both x 2 and x 3 . When comparing the null and alternative models, the smaller effect size on x 6 sets the alternative further away from the true model.
  • Set 4: Both null and alternative models are equally underspecified.
Consider the true data-generating model given by
y i = β 0 , 1 + β 0 , 2 x i 2 + β 0 , 3 x i 3 + β 0 , 4 x i 4 + β 0 , 5 x i 5 + β 0 , 6 x i 6 + β 0 , 7 x i 7 + ϵ i ,
with ϵ i N ( 0 , 50 ) , β 0 , 1 = 1 , β 0 , 2 = β 0 , 3 = β 0 , 6 = β 0 , 7 = 0.5 , β 0 , 4 = β 0 , 5 = 0.5 , and x i 1 x i 2 x i 7 T is sampled as indicated in (8).
For the hypothesis testing setting in Set 4, the null and alternative models are
H 1 : y i = β 1 + β 2 x 2 i + β 3 x i 3 , H 2 : y i = β 1 + β 4 x i 4 + β 5 x i 5 .
Here, the null and alternative candidate models are equally underspecified because they have the same number of explanatory variables with the same effect sizes, and neither model captures the true data-generating model.
  • Set 5: Null model has correct mean specification and alternative model is overspecified, but both are misspecified with respect to the error distribution, which is a Student’s t distribution.
Consider the true data generating model given by
y i = β 0 , 1 + ϵ i ,
with ϵ i t d f = 5 and β 0 , 1 = 1 . Therefore, σ 0 2 = 5 3 .
For the hypothesis testing setting in Set 5, the null and alternative models are
H 1 : y i = β 1 , H 2 : y i = β 1 + β 2 x i 2 ,
where x i 2 N ( 1 , 100 ) . This setting is similar to the one displayed in Set 1, where the null is properly specified while the alternative is overspecified. However, the models in the setting at hand inadequately specify the distribution of the errors.
  • Set 6: Null model has correct mean specification, and the alternative model is overspecified, but both are misspecified with respect to the error distribution, which is a mixture of normals.
Consider the true data-generating model given by
y i = β 0 , 1 + ϵ i ,
with ϵ i Z · N ( 0 , 1 ) + ( 1 Z ) · N ( 0 , 50 ) , where Z B e r n o u l l i ( π ) with π = 0.85 . Therefore,
σ 0 2 = 0.85 ( 1 ) + 0.15 ( 50 ) = 8.35 .
For the hypothesis testing setting in Set 6, the null and alternative models are
H 1 : y i = β 1 , H 2 : y i = β 1 + β 2 x i 2 ,
where x i 2 N ( 1 , 100 ) . This setting is similar to the one featured in Set 5. However, the errors in the setting at hand are generated from a mixture of normal distributions.

4.2. KLDCP Estimates from Simulations

For the tables showing the KLDCP simulation results, the columns are labeled as follows.
(1)
KLDCP corresponds to results based on the distribution of 100 replicates of KLDCP, where each KLDCP is calculated using (10). Note that the null and alternative KLD joint distribution is characterized based on discrepancy replicates obtained through (9).
(2)
BDCPb corresponds to results based on the distribution of 5000 replicates of BDCPb. Each BDCPb is computed using (6) with 200 bootstrap samples for Sets 1–5 and 500 bootstrap samples for Set 6.
(3)
BDCPk corresponds to results based on the distribution of 5000 replicates of BDCPk. Each BDCPk is computed using (7) with 200 bootstrap samples for Sets 1–5 and 500 bootstrap samples for Set 6.
(4)
BDCP corresponds to results based on the distribution of 5000 replicates of the uncorrected BDCP. Each BDCP is computed using (1) with 200 bootstrap samples for Sets 1–5 and 500 bootstrap samples for Set 6.

4.3. Estimates of the Expected KLD from Simulations

For the tables showing the KLD results, the columns are labeled as follows.
(1)
E(KLD) corresponds to the average of 5000 discrepancies calculated using (9).
(2)
E(BD) corresponds to the average of 5000 replicates of BD, where each BD is calculated by
1 M m = 1 M 2 ( θ ^ * ( m ) | y ) .
We have that M = 200 for Sets 1–5 and M = 500 for Set 6.
(3)
Δ BDb corresponds to the difference between the estimate of E(BD), with each BD corrected by k b and the estimate of E(KLD) described in (1). In other words, if we let j { 1 , 2 , 5000 } be the number of simulated data sets, B D ˜ j be the BD estimate for each data set j, and k j b be the k b correction for data set j, then
Δ BDb = 1 5000 j = 1 5000 B D ˜ j + k j b E ( K L D ) .
(4)
Δ BDk shows the same difference described in (3), but using k instead of k b , which results in
Δ BDk = 1 5000 j = 1 5000 B D ˜ j + k E ( K L D ) .

4.4. Discussion of Simulation Results

As mentioned previously, in the conventional hypothesis testing scenario for comparing nested models, Riedle, Neath and Cavanaugh [1] established that the uncorrected BDCP approximates the p-value derived from the likelihood ratio test. Therefore, in the case where the null candidate model is correctly specified, both the uncorrected BDCP and the p-value have a U n i f o r m ( 0 , 1 ) distribution. This behavior is displayed in Table 1, where for large sample sizes, the mean and median of the BDCP distribution are around 0.5. This is a problematic feature of the uncorrected BDCP and p-values because the measure does not reliably favor the null model in those settings where the null is true. However, we see that for large sample sizes, both the BDCPk and the BDCPb values are close to 1, which clearly favors the null model.
Table 2 shows the results from the setting where the alternative hypothesis is correctly specified, while the null is underspecified. Here, we would expect all the discrepancy probabilities to be close to 0, as seen in the case where the sample size is N = 500 . However, for smaller sample sizes, i.e., N = 25 and N = 50 , we observe larger values for the discrepancy probabilities. In fact, for N = 25 , the BDCPb is 0.89 and, with a mean and median close to 0.5 , the uncorrected BDCP exhibits similar behavior to the case where the null is true. This phenomenon is expected within the framework of model selection, where additional explanatory variables are favorable if there is a sufficient sample size to adequately estimate their effects. If the sample size is too small to construct reliable estimates, then it is best to choose smaller models, even at the expense of model misspecification.
The results from Table 1, Table 3, Table 4, Table 5 and Table 6 show that when estimating the KLDCP with a small sample size ( N = 25 to N = 100 ), the BDb performs either better than or as well as the BDk. For large sample sizes, all simulation sets exhibit a similar performance for both corrections.
For discrepancy estimation, Table 7, Table 8, Table 9 and Table 10 show that across all sample sizes, k b over-corrects for the bias of the discrepancy approximation, and the over correction is more prominent for small sample sizes. It is worth noting that this evident over-estimation from the BDb is accompanied by a superior bias reduction of the corresponding KLDCP estimator. For instance, Table 7 shows a significant over-estimation by BDb compared to BDk, especially in the small sample settings. However, the corresponding estimator of the KLDCP, displayed in Table 1, exhibits less bias for BDCPb than for BDCPk.
Finally, Table 11 and Table 12 show that, across all sample sizes, the correction by k b markedly reduces the bias compared to the correction by k. This means that in the setting where the mean structure is correctly specified for the null and overspecified for the alternative, but both models are incorrectly specified with respect to the error distribution, the bootstrap-based correction evidently outperforms the simple correction of k.
In most cases, however, the bias reductions resulting from the k b and the k corrections are comparable. Therefore, our simulation studies suggest that if the null and/or the alternative models are misspecified, then correcting by either k b or k will generally yield comparable estimators of the expected KLDCP.

5. Application: Creatine Kinase Levels during Football Preseason

In this section, we apply the BDCP to a data set from a biomedical setting. The goal of this application is to understand the changes in creatine kinase (CK) levels observed on the blood samples of college football players during preseason training. In order to properly explain the variation of CK, we must select between competing models that use different demographic and clinical variables. We will analyze the models selected by the k b corrected, the k corrected and the uncorrected BDCP, and we will compare the results to the selection of models via the more conventional p-value approach.

5.1. Overview of Application

During strenuous exercise, skeletal muscle cells break down and release a variety of intracellular contents. When in excess, a condition known as exertional rhabdomyolysis (ER) can occur, which may result in life-threatening complications such as renal failure, cardiac arrhythmia and compartment syndrome. Creatine kinase (CK) is one of the proteins released during muscle breakdown, and measuring its levels is the most sensitive test for assessing muscular damage that could lead to ER [7].
During the off-season workouts in January 2011, a group of 13 University of Iowa football players developed ER. This event led to a prospective study where 30 University of Iowa football athletes were followed during a 34-day preseason workout camp. Variables such as body mass index (BMI) and CK levels were obtained from blood samples that were drawn at the first, third, and seventh day of the camp. Other demographic and clinical variables such as age, number of semesters in the program and history of rhabdomyolysis were also collected.
The initial results of the study, published by Smoot et al. [8], show that the CK levels at later time points were significantly different than the levels at earlier times. However, most of the clinical and demographic variables were not significant in explaining the levels of CK. One of the underlying issues with this type of modeling analysis is that the significance of each variable can only be assessed by hypothesis tests with nested models. For example, suppose that we wish to determine the significance of BMI in the presence of semesters in the program. To obtain a p-value for BMI, we need to formulate a hypothesis test where the null model only contains semesters in the program, while the alternative model contains both BMI and semesters in the program.
Although this setting may be useful in some scenarios, it is too limiting. For instance, suppose that we wish to choose between two non-nested models where one contains BMI and the other contains semesters in the program. Although a conventional test based on linear regression models would not be able to answer this question, the BDCP approach could indeed determine the propriety of either model in this type of non-nested setting.
In the analysis of this data set, we let C K 3 be the log of CK levels measured at the seventh day of the camp, C K 1 be the log of CK levels measured at the first day of the camp, and S e m e s t e r s be the number of semesters at the program. Of note, the log transformation is routinely applied in studies involving CK levels in order to justify approximate normality, as the raw levels tend to have heavily right-skewed distributions.
Now, consider the following hypothesis testing settings.
  • Setting 1: Testing the propriety of the model containing C K 1 .
    H 1 : C K 3 = β 1 , H 2 : C K 3 = β 1 + β 2 C K 1 .
  • Setting 2: Testing the propriety of the model containing C K 1 and S e m e s t e r s over the model containing only C K 1 .
    H 1 : C K 3 = β 1 + β 2 C K 1 , H 2 : C K 3 = β 1 + β 2 C K 1 + β 3 S e m e s t e r s .
  • Setting 3: Head-to-head comparison of non-nested models.
    H 1 : C K 3 = β 1 + β 2 C K 1 + β 3 B M I , H 2 : C K 3 = β 1 + β 2 C K 1 + β 3 S e m e s t e r s .

5.2. Results of Application

The results for the application are summarized in Table 13. Settings 1 and 2 illustrate the congruence between BDCP and p-values in the case of hypothesis testing based on nested models. Setting 1 assesses the propriety of a model that includes only the intercept against a model that includes both the intercept and the levels of C K 1 . The p-value for C K 1 in this setting is 0.001, which means that, using a level α of 0.05, C K 1 is significant in explaining the variation in C K 3 levels. Both the BDCPk and BDCPb are 0.075, which means that there is a 7.5 % chance that the null model is preferred over multiple bootstrap samples, indicating that the model containing C K 1 is superior.
Once we establish that C K 1 is an important variable to include in our model, the next step is to determine if additional variables can improve our model fit. Setting 2 displays a hypothesis test where the null model only contains C K 1 , while the alternative contains both C K 1 and S e m e s t e r s . The p-value for S e m e s t e r s is 0.734, which means that S e m e s t e r s is not statistically significant, and a reasonable investigator would choose to exclude S e m e s t e r s from the final model. The corrected BDCP values arrive at the same conclusion. For instance, the BDCPb is 0.995, which indicates that the across multiple bootstrap samples, the null model is chosen 99.5 % of the time; therefore, the BDCP encourages us to choose the model that excludes S e m e s t e r s .
The rationale for testing S e m e s t e r s is based on the idea that more senior athletes tend to rigorously maintain their workout habits during the off season, mostly because of experience and maturity. Therefore, S e m e s t e r s is a variable that may confound the effects of C K 1 on the variation of C K 3 . Additionally, medical literature has shown that BMI highly correlates with CK levels and the development of ER [9], which means that one should also test for the propriety of models that include B M I . Thus, one could ask if a model featuring B M I would be better than a model featuring S e m e s t e r s . This results in a hypothesis testing scenario where the null and alternative models are non-nested, as exhibited in Setting 3.
First, note that the p-values displayed in the table for Setting 3 do not answer the question at hand. These p-values are obtained from partial tests applied to the full model containing both variables. On the other hand, the BDCP gives us meaningful information about the performance of adding B M I versus adding S e m e s t e r s . The BDCPb tells us that there is a 78 % probability that the model containing B M I is a better fit than the model containing S e m e s t e r s . If we use the BDCPk instead, the probability increases to 81.5 % . In both cases, if we are debating weather to include B M I or S e m e s t e r s as an adjusting variable, the BDCP clearly favors the inclusion of B M I .

6. Conclusions

When deciding between two competing models, practitioners of statistics normally utilize traditional hypothesis testing methods that rely on the assumption that one of the candidate models is properly specified. This approach is problematic because it is unreasonable to assume that one of the proposed models is precisely true. In addition, these methods are only applicable for nested models. To avoid any underlying assumptions and model structure limitations, Riedle, Neath and Cavanaugh [1] propose the use of the bootstrap discrepancy probability (BDCP) to assess the propriety of the fit of two candidate models. However, the bootstrap discrepancy (BD) utilized in this work provides a biased estimator of the Kullback–Leibler discrepancy (KLD).
When hypothesis testing assumptions are met, the BDCP asymptotically approximates the likelihood ratio test p-value. Therefore, similarly to p-values, the distribution of the BDCP is uniform if the null hypothesis is true. Hence, in settings when the null is true, the BDCP would be of limited value in choosing the appropriate model.
In this paper, we proposed utilizing the k b or the k corrected BDCP, namely BDCPb and BDCPk, respectively. The BDCPb employs the BDb, a bootstrap corrected estimator of the KLD, while the BDCPk uses the BDk, a BD corrected by adding the number of functionally independent parameters in the candidate model. We showed that for most settings, the BDb serves as an over-corrected estimator of the KLD, but the corresponding BDCPb is less biased than the BDCPk for the estimation of the KLDCP. However, in the case when there is distributional misspecification, we showed that the BDb has negligible bias for the estimation of expected value of the KLD.
Moreover, the estimation of the bootstrap correction k b utilizes the same bootstrap samples that were used to calculate the BD; therefore, we argue that the computational requirements of estimating k b are not too burdensome. However, if the sample size is moderately large compared to the number of parameters in the model, then we showed that using k to correct the bias generally results in comparable values of the KLDCP estimates.

Author Contributions

Conceptualization, A.D. and J.C.; Formal analysis, A.D. and J.C.; Methodology, A.D. and J.C.; Supervision, J.C.; Writing—original draft, A.D. and J.C.; Writing—review and editing, A.D. and J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The R code used in generating the data for the simulation study is available on request from the corresponding author. The data for the application are not publicly available since the dataset is confidential.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Riedle, B.; Neath, A.; Cavanaugh, J.E. Reconceptualizing the p-Value From a Likelihood Ratio Test: A Probabilistic Pairwise Comparison of Models Based on Kullback-Leibler Discrepancy Measures. J. Appl. Stat. 2020, 47, 13–15. [Google Scholar] [CrossRef] [PubMed]
  2. Efron, B.; Tibshirani, R. An Introduction to the Bootstrap, 2nd ed.; Chapman Hall: New York, NY, USA, 1993; pp. 31–37. [Google Scholar]
  3. Efron, B. Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation. J. Am. Stat. Assoc. 1983, 78, 316–331. [Google Scholar] [CrossRef]
  4. Efron, B. How Biased is the Apparent Error Rate of a Prediction Rule? J. Am. Stat. Assoc. 1986, 81, 461–470. [Google Scholar] [CrossRef]
  5. Cavanaugh, J.E.; Shumway, R.H. A Bootstrap Variant of AIC for State-Space Model Selection. Stat. Sin. 1997, 7, 473–496. [Google Scholar]
  6. Hurvich, C.M.; Tsai, C. Regression and Time Series Model Selection in Small Samples. Biometrika 1989, 76, 297–307. [Google Scholar] [CrossRef]
  7. Torres, P.; Helmstetter, J.; Kaye, A.M.; Kaye, A.D. Rhabdomyolysis: Pathogenesis, Diagnosis, and Treatment. Ochsner J. Spring 2015, 15, 58–69. [Google Scholar]
  8. Smoot, M.K.; Cavanaugh, J.E.; Amendola, A.; West, D.R.; Herwaldt, L.A. Creatine Kinase Levels During Preseason Camp in National Collegiate Athletic Association Division I Football Athletes. Clin. J. Sport Med. 2014, 5, 438–440. [Google Scholar] [CrossRef] [PubMed]
  9. Vasquez, C.R.; DiSanto, T.; Reilly, J.P.; Forker, C.M.; Holena, D.N.; Wu, Q.; Lanken, P.N.; Christie, J.D.; Shashaty, M.G.S. Relationship of Body Mass Index, Serum Creatine Kinase, and Acute Kidney Injury After Severe Trauma. J. Trauma Acute Care Surg. 2020, 89, 179–185. [Google Scholar] [CrossRef] [PubMed]
Table 1. Distribution approximations for Set 1, where the null model is correctly specified, while the alternative model is overspecified.
Table 1. Distribution approximations for Set 1, where the null model is correctly specified, while the alternative model is overspecified.
StatisticKLDCPBDCPbBDCPkBDCP
N = 500
Mean1.0000.8780.8680.515
Median1.0001.0001.0000.515
SD0.0000.2330.2410.282
N = 100
Mean1.0000.9180.8640.564
Median1.0001.0000.9950.580
SD0.0000.1860.2250.256
N = 50
Mean1.0000.9660.8750.631
Median1.0001.0000.9800.650
SD0.0000.1110.1930.220
N = 25
Mean1.0000.9990.8860.739
Median1.0001.0000.9550.755
SD0.0000.0120.1440.156
Table 2. Distribution approximations for Set 2, where the null model is underspecified, while the alternative model is correctly specified.
Table 2. Distribution approximations for Set 2, where the null model is underspecified, while the alternative model is correctly specified.
StatisticKLDCPBDCPbBDCPkBDCP
N = 500
Mean0.0010.0220.0210.011
Median0.0010.0000.0000.000
SD0.0000.0880.0850.043
N = 100
Mean0.1560.4700.4280.264
Median0.1560.3400.2800.170
SD0.0050.3900.3780.257
N = 50
Mean0.3720.6910.5970.409
Median0.3720.9050.6300.360
SD0.0070.3500.3540.266
N = 25
Mean0.6170.8900.6980.536
Median0.6170.9900.7850.535
SD0.0060.2130.2800.222
Table 3. Distribution approximations for Set 3, where the null and alternative models are underspecified, but the null model is closer to the true data-generating model.
Table 3. Distribution approximations for Set 3, where the null and alternative models are underspecified, but the null model is closer to the true data-generating model.
StatisticKLDCPBDCPbBDCPkBDCP
N = 500
Mean1.0001.0001.0001.000
Median1.0001.0001.0001.000
SD0.0000.0130.0130.013
N = 100
Mean0.9790.9100.9100.910
Median0.9791.0001.0001.000
SD0.0020.2440.2440.244
N = 50
Mean0.9160.8070.8080.808
Median0.9160.9700.9700.970
SD0.0040.3110.3090.309
N = 25
Mean0.8040.6920.6990.699
Median0.8050.8450.8400.840
SD0.0050.3140.3030.303
Table 4. Distribution approximations for Set 4, where the null and alternative models are equally underspecified.
Table 4. Distribution approximations for Set 4, where the null and alternative models are equally underspecified.
StatisticKLDCPBDCPbBDCPkBDCP
N = 500
Mean0.4980.5070.5070.507
Median0.4980.5700.5800.580
SD0.0070.4780.4780.478
N = 100
Mean0.5000.5100.5090.509
Median0.5000.5620.5670.567
SD0.0070.4420.4420.442
N = 50
Mean0.5000.5020.5020.502
Median0.5000.5050.5150.515
SD0.0070.4070.4060.406
N = 25
Mean0.5010.5010.5010.501
Median0.5010.4900.4950.495
SD0.0070.3530.3450.345
Table 5. Distribution approximations for Set 5, where the null and alternative models are misspecified with respect to the error distribution. Here, the errors are generated from a Student’s t distribution.
Table 5. Distribution approximations for Set 5, where the null and alternative models are misspecified with respect to the error distribution. Here, the errors are generated from a Student’s t distribution.
StatisticKLDCPBDCPbBDCPkBDCP
N = 500
Mean1.0000.7940.7940.499
Median1.0001.0001.0000.500
SD0.0000.3290.3280.289
N = 100
Mean1.0000.8070.7940.507
Median1.0001.0001.0000.515
SD0.0000.3180.3230.284
N = 50
Mean1.0000.8250.7900.508
Median1.0001.0000.9950.505
SD0.0000.3010.3150.273
N = 25
Mean1.0000.8620.7900.525
Median1.0001.0000.9850.530
SD0.0000.2700.3060.261
Table 6. Distribution approximations for Set 6, where the null and alternative models are misspecified with respect to the error distribution. Here, the errors are generated from a mixture of normal distributions.
Table 6. Distribution approximations for Set 6, where the null and alternative models are misspecified with respect to the error distribution. Here, the errors are generated from a mixture of normal distributions.
StatisticKLDCPBDCPbBDCPkBDCP
N = 500
Mean1.0000.7830.7860.487
Median1.0001.0001.0000.484
SD0.0000.3380.3350.289
N = 100
Mean1.0000.8080.7930.495
Median1.0001.0000.9980.496
SD0.0000.3220.3250.283
N = 50
Mean1.0000.8510.7930.502
Median1.0001.0000.9940.494
SD0.0000.2860.3110.269
N = 25
Mean1.0000.9060.7870.509
Median1.0001.0000.9860.490
SD0.0000.2290.3000.246
Table 7. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 1. Here, the null model is correctly specified, while the alternative model is overspecified.
Table 7. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 1. Here, the null model is correctly specified, while the alternative model is overspecified.
HypothesisE(KLD)E(BD) Δ BDb Δ BDk
N = 500
Null3378.9493375.4070.4880.411
Alternative3383.1383375.5780.6860.362
N = 100
Null679.282675.2910.385−0.030
Alternative684.115676.6672.5180.521
N = 50
Null342.167338.4981.2670.268
Alternative348.245342.3487.4762.065
N = 25
Null174.334171.1693.6570.910
Alternative183.828193.24943.32817.290
Table 8. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 2. Here, the null model is underspecified, while the alternative model is correctly specified.
Table 8. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 2. Here, the null model is underspecified, while the alternative model is correctly specified.
HypothesisE(KLD)E(BD) Δ BDb Δ BDk
N = 500
Null3340.4913335.7330.4100.290
Alternative3328.4673322.5810.3190.143
N = 100
Null672.373667.9281.2100.520
Alternative671.137665.6281.4930.454
N = 50
Null339.515334.7261.8910.226
Alternative339.923334.1812.8880.305
N = 25
Null174.136171.3767.4462.223
Alternative176.073174.32013.2704.106
Table 9. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 3. Here, the null and alternative models are underspecified, but the null model is closer to the true data-generating model.
Table 9. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 3. Here, the null and alternative models are underspecified, but the null model is closer to the true data-generating model.
HypothesisE(KLD)E(BD) Δ BDb Δ BDk
N = 500
Null3726.9023726.1593.4013.332
Alternative3832.7703832.3953.7043.626
N = 100
Null745.967745.8094.3583.943
Alternative766.212766.8134.9474.528
N = 50
Null373.419373.7045.3094.325
Alternative383.156384.0205.8434.858
N = 25
Null187.563188.7458.0825.245
Alternative191.924194.0828.8786.088
Table 10. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 4. Here, the null and alternative models are equally underspecified.
Table 10. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 4. Here, the null and alternative models are equally underspecified.
HypothesisE(KLD)E(BD) Δ BDb Δ BDk
N = 500
Null3923.4233923.9085.0224.948
Alternative3923.5803924.7055.4755.399
N = 100
Null784.021784.9175.0804.670
Alternative784.042785.0265.2414.823
N = 50
Null391.751393.1556.3355.343
Alternative391.753393.1316.2225.239
N = 25
Null195.732198.6169.6026.821
Alternative195.862198.6909.5986.804
Table 11. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 5. Here, the null and alternative models are misspecified with respect to the error distribution, and the errors are generated from a Student’s t distribution.
Table 11. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 5. Here, the null and alternative models are misspecified with respect to the error distribution, and the errors are generated from a Student’s t distribution.
HypothesisE(KLD)E(BD) Δ BDb Δ BDk
N = 500
Null1678.6521672.369−2.224−4.178
Alternative1679.6951672.387−2.248−4.231
N = 100
Null338.728334.154−0.920−2.471
Alternative339.866334.300−0.728−2.438
N = 50
Null171.377167.500−0.231−1.839
Alternative172.640167.8470.283−1.714
N = 25
Null87.68983.577−0.434−2.077
Alternative89.31184.4950.869−1.785
Table 12. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 6. Here, the null and alternative models are misspecified with respect to the error distribution, and the errors are generated from a mixture of normal distributions.
Table 12. Expected value of the KLD, its bootstrap estimate, and the bias of the corrected bootstrap estimates for the null and alternative models in Set 6. Here, the null and alternative models are misspecified with respect to the error distribution, and the errors are generated from a mixture of normal distributions.
HypothesisE(KLD)E(BD) Δ BDb Δ BDk
N = 500
Null2488.9322480.154−0.3896.554
Alternative2490.0122480.141−0.3106.659
N = 100
Null508.122497.000−0.3838.404
Alternative509.426497.237−0.5978.459
N = 50
Null263.382252.424−2.8528.590
Alternative264.974253.245−3.9308.361
N = 25
Null144.895131.870−4.36110.842
Alternative147.551134.298−7.7829.956
Table 13. From left to right: results for Setting 1, Setting 2, and Setting 3. BDCPk is the BDCP corrected by k, BDCPb is the BDCP corrected by k b , and BDCP is the uncorrected BDCP. Results are based on 200 bootstraps samples.
Table 13. From left to right: results for Setting 1, Setting 2, and Setting 3. BDCPk is the BDCP corrected by k, BDCPb is the BDCP corrected by k b , and BDCP is the uncorrected BDCP. Results are based on 200 bootstraps samples.
BDCP
BDCPk0.075BDCPk0.990BDCPk0.815
BDCPb0.075BDCPb0.995BDCPb0.780
BDCP0.055BDCP0.495BDCP0.815
p-Value
CK10.001CK10.001CK10.001
Semesters0.734BMI0.176
Semesters0.936
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Dajles, A.; Cavanaugh, J. Probabilistic Pairwise Model Comparisons Based on Bootstrap Estimators of the Kullback–Leibler Discrepancy. Entropy 2022, 24, 1483. https://doi.org/10.3390/e24101483

AMA Style

Dajles A, Cavanaugh J. Probabilistic Pairwise Model Comparisons Based on Bootstrap Estimators of the Kullback–Leibler Discrepancy. Entropy. 2022; 24(10):1483. https://doi.org/10.3390/e24101483

Chicago/Turabian Style

Dajles, Andres, and Joseph Cavanaugh. 2022. "Probabilistic Pairwise Model Comparisons Based on Bootstrap Estimators of the Kullback–Leibler Discrepancy" Entropy 24, no. 10: 1483. https://doi.org/10.3390/e24101483

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop