# Statistical Evidence Measured on a Properly Calibrated Scale across Nested and Non-nested Hypothesis Comparisons

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

Abbreviation | Full Name | Description |
---|---|---|

BBP | Basic Behavior Pattern | a characteristic of what we mean by “statistical evidence” that any measure of evidence must recapitulate |

d.f. | Degrees of Freedom | one of two constants in the EqS, used to calibrate E across different forms of HC |

e | evidence | evidence measured on an empirical (uncalibrated) scale |

E | Evidence | evidence measured on an absolute (context-independent) scale |

EqS | Equation of State | used to calculate the evidence from features of the likelihood ratio graph |

HC | Hypothesis Contrast | the forms of the hypotheses in the numerator and denominator of the LR (nested or non-nested; composite or simple) |

LR | Likelihood Ratio | P(data | Hypothesis 1)/P(data | Hypothesis 2) |

S | Evidential Entropy | a particular form of Kullback-Leibler divergence, equal to the max log LR |

TrP | Transition Point | values of x/n at which evidence switches from supporting one hypothesis to supporting the other |

V | Volume | area (or more generally, volume) under the LR graph |

## 2. Review of Previous Results and the Problem with Two-Sided Hypothesis Comparisons

_{1}: “coin is biased towards tails” (θ ≤ ½), vs. H

_{2}: “coin is fair” (θ = ½). We articulate four BBPs up front (we have described the thought experiments used to motivate the BBPs in detail elsewhere; see, e.g., [5]; here we simply summarize the BBPs themselves). Note that the BBPs described in what follows are specific to this particular set up, which is the one we considered in previous work [4,5]. Extensions to other forms of hypotheses are considered in subsequent sections, albeit still in the context of the binomial model.

_{1}or H

_{2}, depending on x/n, but in either case, it increases with increasing n. BBP(i) is illustrated in Figure 1a.

_{1}decreases up to some value of x/n, after which it increases in favor of H

_{2}. We refer to the value of x/n at which the evidence switches from favoring H

_{1}to favoring H

_{2}as the transition point (TrP). We also expect the value of x/n at which a TrP occurs to shift as a function of n, as increasingly smaller departures from x/n = 0.5 support H

_{1}over H

_{2}. BBP(ii) is also illustrated in Figure 1a.

_{1}by a greater amount if they are preceded by two tails in a row, compared to if they are preceded by 100 tails in a row. BBP(iv) is illustrated in Figure 1c. We reiterate that BBP(iv), like the other BBPs, is based on our feel for how the evidence is behaving, rather than any preconceived notion of how evidence ought to behave. It leads to an understanding of evidence in which data alone do not possess or convey a fixed quantity of information, but rather, the information conveyed by data regarding any given hypothesis contrast depends on the context in which the data are considered.

**Figure 1.**Basic Behavior Patterns for evidence e: (

**a**) e as a function of x/n for different values of n, illustrating BBPs(i) and (ii) (dots mark the TrP, or minimum point, on each curve); (

**b**) iso-e contours for different values of e, (higher contours represent larger values of e), illustrating BBP(iii); (

**c**) e as a function of n for any fixed x/n, illustrating BBP(iv). Because e is on an empirical (relative) measurement scale, numerical values are not assigned to e and e-axes are not labeled in the figures.

_{1}logE + c

_{2}logV

_{1}, c

_{2}constants, where E represents evidence measured on an absolute, and not merely an empirical, scale [4]. Equation (3) is identical in form to the ideal gas EqS in physics, although we assign different (non-physical) meanings to each of the constituent terms.

_{2}= 1 and c

_{1}= 1.5. We have found that c

_{2}= 1 is required to maintain the BBPs. We continue to use this value throughout the remainder of this paper, but we have retained c

_{2}in the equations as a reminder that it may become important in future extensions of the theory. We return to resolution of c

_{1}in Section 4 below.

**Figure 2.**The problem with using the original equation of state in application to the two-sided hypothesis contrast: (

**a**) E as a function of x/n for different values of n, using the original EqS, illustrating the absence of a true TrP (dots indicate minimum value of E); (

**b**) the expected pattern of behavior of behavior of e in the two-sided case, illustrating the correct TrP behavior, with symmetric TrPs on either side of 0.5, converging towards 0.5 as n increases. In (

**a**), because we are using the Equation (4) to calculate the evidence, we label the y-axis E; however, because this is the wrong EqS here, numerical values of E are not labeled.

_{1}: “coin is biased in either direction” vs. H

_{2}: “coin is fair.” For given n and viewed as functions of x/n, Figure 1a and Figure 2a exhibit similar shapes. In Figure 1a (one-sided comparison) the minimum value of E corresponds to the TrP, the x/n value at which the evidence begins (reading left to right) to favor θ

_{2}= ½. Figure 2a might at first appear to be a simple extrapolation, but in fact it must be fundamentally wrong. The minimum value should occur at the TrP, the x/n value at which the evidence begins to favor θ

_{2}= ½, but here the minimum point is occurring at the value x/n = ½, regardless of n. Thus this minimum point no longer has the interpretation of being a TrP, that is, a point at which the evidence starts to favor θ

_{2}= ½. Indeed, there is no such thing as evidence in favor of θ

_{2}= ½ here, since even as n increases the evidence remains at a minimum when the data fit perfectly with H

_{2}. Figure 2b illustrates the pattern (although not necessarily the actual numbers) we should obtain, which requires two TrPs, one on each side of θ

_{2}. In contrast to Figure 2a, Figure 2b represents the correct reflection of the behavior in Figure 1a onto the region x/n > 0.5. In the following section we show how to adjust the EqS to produce the correct pattern as shown in Figure 2b.

## 3. Equations of State for Non-nested and Nested HCs

_{1}, H

_{2}respectively.

Class I Non-Nested | Class II Nested | |
---|---|---|

(a) Composite vs. Simple | H_{1}: 0 ≤ θ ≤ ½ H_{2}: θ = ½ | H_{1}: 0 ≤ θ ≤ 1 H_{2}: θ = ½ |

(b) Composite vs. Composite | H_{1}: 0 ≤ θ ≤ ½ H_{2}: ½ ≤ θ ≤ 1 | H_{1}: 0 ≤ θ ≤ 1 H_{2}: θ _{l} ≤ θ ≤ θ_{r} |

_{l}, θ

_{r}] is symmetric about θ = ½.

_{2}$\in $ [θ

_{2l}, θ

_{2r}] (where the subscript “l” stands for “left” and “r” for “right”) that are symmetric around the value ½. We have speculated from the start that the unconstrained maximum entropy state of a statistical system, which in the binomial case occurs when θ = ½, plays a special role in this theory. Indeed, in order to maintain the BBPs, calculations have shown that binomial HCs that are not “focused” in some sense on θ = ½ will require further corrections to the underlying EqS. We had also speculated previously that HCs in the form “A vs. not-A” play a special role. Here we extend the theory to include nested hypotheses.

_{i}.

_{i}, ${\widehat{\theta}}_{i}$ = θ

_{i}, therefore in application to the original one-sided HC, Equations (5) and (6) maintain the original definitions for S and V as given in Equations (1) and (2). From here on, we utilize the generalized definitions of S and V in Equations (5) and (6).

_{2r}− θ

_{2l}] narrows to ½ $\pm \text{\epsilon}$, the two HCs become (approximately) the same, namely, θ ≠ ½ vs. θ = ½. Therefore for any given data (n, x), as ε shrinks to 0, they must yield the same value of E. This strongly suggests that a single EqS should govern both types of HC, as indeed turns out to be the case.

## 4. The Constant c_{1} and Degrees of Freedom

_{2}on a boundary), and therefore the idea of adjusting the calculation of E to reflect differences in d.f. across HCs was moot, but the discovery that just two basic EqSs cover a wide range of HCs strongly suggests that any d.f. adjustment should be captured by some feature of the EqS as shown in Equations (4) and (7). Furthermore, as these equations show, c

_{1}adjusts the magnitude of E for given S and V, which is on the face of it just what we need to do.

_{1}a constant and then to vary it. We note, however, that in the thermodynamic analogues of our Equations (4) and (7), the position of our c

_{1}is occupied by the physical constant c

_{V}, the thermal capacity of a gas at constant volume. This constant varies, e.g., between monatomic and diatomic gases, reflecting the fact that a fixed influx of heat will raise the temperatures of the two gas types by different amounts. Similarly, we can view c

_{1}as a factor that recognizes that different HCs will convert the same amount of new information (or data) into different changes in E. This viewpoint is consonant with our underlying information-dynamic theory [4,5], which treats transformations of LR graphs in terms of Q (a kind of evidential information influx) and W (information “wasted” during the transformation, in the sense that it does not get converted into a change in E); the sense in which E maintains constant meaning across applications relates specifically to aspects of these transformations (see [4,5] for details).

_{1}for different HCs, as we describe in the following paragraph. Final validation of any specific numerical choices we make at this point regarding c

_{1}will require returning to the original information-dynamic formalism. But we point out here that the choices we have made are far from ad hoc. The form of the EqS itself combined with constraints imposed by the BBPs place severe limitations on how values can be assigned to c

_{1}while maintaining reasonable behavior for E within and across HCs.

Class I Non-Nested | Class II Nested | |
---|---|---|

(a) Composite vs. Simple | $E={\left(\frac{ex{p}^{S}}{V}\right)}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$1.5$}\right.}$ | $E={\left(\frac{ex{p}^{S}}{V-b}\right)}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$2$}\right.}$ |

(b) Composite vs. Composite | $E=\frac{ex{p}^{S}}{V}$ | $E={\left(\frac{ex{p}^{S}}{V-b}\right)}^{\raisebox{1ex}{$1$}\!\left/ \!\raisebox{-1ex}{$(2+\left[{\theta}_{2r}-{\theta}_{2l}\right])$}\right.}$ |

_{1}> 0.5 in order to maintain BBP(iv). Thus we begin, somewhat arbitrarily but in order to start from an integer value, from a baseline c

_{1}= 1.0. In the case of nested hypotheses (${\text{\Theta}}_{2}\subset {\text{\Theta}}_{1})$, we add to this baseline value the sum [θ

_{1r}− θ

_{1l}] + [θ

_{2r}− θ

_{2l}] of the lengths of the intervals. Heuristically, we sum these lengths because it is possible for x/n to be in either or both intervals simultaneously; thus speaking very loosely, c

_{1}captures a kind of conjunction of the two intervals. In the case of non-nested hypotheses, for which x/n can be in Θ

_{1}or Θ

_{2}but not both, a disjunction of the intervals, we add to the baseline the difference [θ

_{1r}− θ

_{1l}] − [θ

_{2r}− θ

_{2l}]. Using these rules we arrive at c

_{1}= d.f. = 1.5, 1.0, 2.0 and 2 + [θ

_{2r}− θ

_{2l}], for Classes I(a), I(b), II(a) and II(b), respectively. Thus for Class II(b), 2 ≤ c

_{1}≤ 3. Table 3 shows the assigned values in the context of the EqS for each HC.

_{1}increases, for given data, E decreases. Thus these values ensure some intuitively reasonable behavior in terms of the conventional role of d.f. adjustments. For instance, for given x/n, the two-sided Class II(a) HC will have lower E compared to the one-sided Class I(a) HC, which conforms to the frequentist pattern for one-sided vs. two-sided comparisons. We consider the behavior of E in greater detail within and across HCs in Section 5 below.

## 5. Behavior of E within and Across HC Classes

_{2}= 1 and c

_{1}as shown in Table 3. In this section we highlight important additional characteristics of E beyond the original BBPs. Some of these characteristics conform to intuitions we had formed in advance, but others constitute newly discovered properties of E-behaviors we did not anticipate, but which nevertheless seem to us to make sense once we observe them.

**Figure 3.**Behavior of E for Class II(b): (

**a**) E as a function of x/n (n = 50) for different ranges for θ

_{2}; (

**b**) E as a function of x/n for 0.4 ≤ θ

_{2}≤ 0.6 for different n. Note that this graph utilizes the correct EqS. Therefore the y-axis is now labeled as E and numerical values are shown.

_{2r}− θ

_{2l}] of the ${\text{\Theta}}_{2}$ interval. Figure 3 illustrates the behavior of E for Class II(b). Several features of Figure 3 are worth noting. Intuition tells us that for x/n $\approx $ 0 or x/n $\approx $ 1, as [θ

_{2r}− θ

_{2l}] increases, the strength of the evidence in favor of θ

_{1}should decrease, to reflect the fact that even such extreme data represent a smaller and smaller deviation from compatibility with θ

_{2}. This pattern is seen in Figure 3a, where E = 6.2, 5.4, 4.6 for [θ

_{2r}− θ

_{2l}] = 0.02, 0.20 and 0.40, respectively. For any given x/n $\in $ ${\text{\Theta}}_{2}$, it also seems reasonable that the evidence, now in favor of θ

_{2}, should decrease as [θ

_{2r}− θ

_{2l}] increases, again as seen in Figure 3a. This reflects the fact that ${\text{\Theta}}_{2}\subset $ ${\text{\Theta}}_{1}$, so that evidence to differentiate the two hypotheses is smaller the more they overlap. At the same time, within this interval we would expect x/n $\approx $ ½ to yield the strongest evidence; however, given the overlap between ${\text{\Theta}}_{1}$ and ${\text{\Theta}}_{2}$, we would not necessarily expect the evidence at x/n $\approx $ ½ to be substantially larger than the evidence at x/n closer to the ${\text{\Theta}}_{2}$ boundary. Figure 3b illustrates this pattern for different values of n. Note that E is actually maximized at x/n $=$ ½: e.g., for n = 50, at x/n = θ

_{2l}= 0.4, E = 2.75, while at x/n = 0.5, E = 2.78. It is also interesting to note that the TrPs move outward as [θ

_{2r}− θ

_{2l}] increases, as might be expected (Figure 3a); while for each fixed [θ

_{2r}− θ

_{2l}], the TrPs are moving inward as n increases (Figure 3b), in all cases, converging towards the corresponding left (or right) boundary value of θ

_{2}. Thus in all regards, the adjustment of c

_{1}combined with the Class II EqS seems to yield sensible behavior for E for Class II(b).

_{2r}− θ

_{2l}] → 0, c

_{1}becomes the same for Class II(b) and Class II(a), by design. Thus the line in Figure 3a representing 0.49 ≤ θ

_{2}≤ 0.51 (c

_{1}= 2.02) is virtually identical to what we would obtain under Class II(a) (c

_{1}= 2.00), and for the moment we treat it as a graph of Class II(a). We noted above that for x/n $\approx $ 0 or x/n $\approx $ 1, evidence decreases as [θ

_{2r}− θ

_{2l}] increases. We can now see from Figure 3a that this also means that evidence is decreasing relative to what would be obtained under a Class II(a) HC. Since under Class II(a) the HC always involves a comparison against θ = ½, it is reasonable that larger (nested) [θ

_{2r}− θ

_{2l}] would return smaller evidence at these x/n values relative to a comparison against the single value θ = ½. For x/n = ½, we might have guessed that E in favor of θ

_{2}should be also smaller for Class II(b) than for Class II(a), since the data are perfectly consistent with both θ

_{1}and θ

_{2}but Class II(a) has the more specific H

_{2}.

**Figure 4.**Comparative behavior E as a function of x/n (n = 50) across all four HCs. For purposes of illustration, 0.4 ≤ θ

_{2}≤ 0.6 for Class II(b). TrPs are marked with circles (Class I(a), Class II(a)) or diamonds (Class I(b), Class II(b)).

_{1}, as discussed above, and it makes sense that nested hypotheses would be harder to distinguish compared to non-nested hypotheses for given n. Figure 4 also illustrates the relative placement of the TrPs across HCs, which is consistent with, and a generalization of, the BBPs involving the TrP considered in Section 2 in the context of a single HC. For instance, the TrPs for Class II(b) are further apart than for Class II(a), a pattern we might have anticipated.

_{1}and the segment to the right corresponding to evidence for θ

_{2}.

**Figure 5.**Iso-E profiles comparing four HCs, for (

**a**) E = 2; (

**b**) E = 4. For purposes of illustration, 0.4 ≤ θ

_{2}≤ 0.6 for Class II(b).

_{1}, we would need n = 1.5, 1.1, 3.0 and 3.6 tails in a row (x/n = 0), for Class I(a), Class I(b), Class II(a) and Class II(b), respectively. Apparently E = 2 is quite easy to achieve, in the sense that relatively few tosses will yield E = 2 if they are all tails. By contrast, to get E = 4 one would need 7.0, 3.0, 15.2 and 20.5 tosses, all tails, for the four HCs respectively; while E = 8 (not shown in Figure) would require 21.8, 7.0, 67.3 and 106.6 tails, respectively. Another way to use the graphs is to see the “effect size” at which a given sample size n will return evidence E. As Figure 5 shows, whether the evidence favors θ

_{1}(left of TrP) or θ

_{2}(right of TrP), much larger samples are required to achieve a given E the closer x/n is to the TrP, or in other words, the less incompatible the data are with the non-favored hypothesis. For instance, for Class II(a) and considering evidence for θ

_{1}, for n = 100, E = 4 for x/n ≈ 0.07; but for n = 300, that same E = 4 is achieved for x/n ≈ 0.25, a much smaller deviation from ½.

## 6. Discussion

_{1}, which corresponds to c

_{V}in the physical versions of these equations, seems to function in the information-dynamic equations as a kind of d.f. adjustment, allowing us for the first time to rigorously compare evidence across HCs of differing dimensionality. While these results represent substantial generalizations of the formalism, they remain specific to the binomial likelihood and will need to be extended to additional models before they are ready for general applications.

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Appendix

## A. Maximum Log LR Plays the Role of Entropy, not Evidence

_{2}), as in the main text, we have what we call the observed KL divergence (“observed” because the expectation is taken with respect to a probability distribution based on the data), which is equal to the log of the maximum likelihood ratio (MLR):

_{OBS}as defined here corresponds to Zhang’s GLR [7] for non-nested hypotheses.

_{OBS}, or equivalently, in the form of the MLR, but not only information in this sense.

## B. Calculation of b

**Figure 6.**Relationship among V, b and V-b using Equation (B1) to calculate b. Shown here are four Θ

_{2}intervals: (

**a**) [0.49, 0.51]; (

**b**) [0.4, 0.6]; (

**c**) [0.3, 0.7]; (

**d**) [0.2, 0.8].

_{1}, which controls the curvature of b over ${\text{\Theta}}_{2}$; and r

_{2}, which controls the baseline value of b at the boundaries of this region, that is, at the points θ

_{2l}, θ

_{2r}. Let the value of b at these points be b(θ

_{2l}) = b(θ

_{2r}). We note up front that for given n, the minimum value of the Fisher information, Min FI(n) = $-E\left[\frac{{d}^{2}}{d{\theta}^{2}}\mathrm{log}L\left(\theta \right)\right]$, occurs when θ = ½. Then we have:

_{2l}) and 0 (on the left) or b(θ

_{2r}) and 1 (on the right).

_{1}= 2 − [θ

_{2r}− θ

_{2l}], so that the curvature of b depends on the width of the Θ

_{2}interval. We found that we needed to constrain r

_{2}such that ⅓ ≤ ( r

_{1}− (½)r

_{2}) ≤ ⅔. Thus we used r

_{2}= 2r

_{1}− ⅔(1 + [θ

_{2r}− θ

_{2l}]). Figure 6 shows b and V for various ${\text{\Theta}}_{2}$ for n = 50.

## References

- Vieland, V.J. Thermometers: Something for statistical geneticists to think about. Hum. Hered.
**2006**, 61, 144–156. [Google Scholar] [CrossRef] [PubMed] - Vieland, V.J. Where’s the Evidence? Hum. Hered.
**2011**, 71, 59–66. [Google Scholar] [CrossRef] [PubMed] - Vieland, V.J.; Hodge, S.E. Measurement of Evidence and Evidence of Measurement (Invited Commentary). Stat. Appl. Genet. Mol. Biol.
**2011**, 10. [Google Scholar] [CrossRef] - Vieland, V.J.; Das, J.; Hodge, S.E.; Seok, S.-C. Measurement of statistical evidence on an absolute scale following thermodynamic principles. Theory Biosci.
**2013**, 132, 181–194. [Google Scholar] [CrossRef] [PubMed] - Vieland, V.J. Evidence, temperature, and the laws of thermodynamics. Hum. Hered.
**2014**, 78, 153–163. [Google Scholar] [CrossRef] [PubMed] - Chang, H. Inventing Temperature: Measurement and Scientific Progress; Oxford University Press: New York, NY, USA, 2004. [Google Scholar]
- Zhang, Z. A Law of Likelihood for Composite Hypotheses.
**2009**. arXiv:0901.0463v1. [Google Scholar] - Lele, S.R. Evidence Functions and the Optimality of the Law of Likelihood. In The Nature of Scientific Evidence: Statistical, Philosophical, and Empirical Considerations; Taper, M.L., Lele, S.R., Eds.; University of Chicago Press: Chicago, IL, USA, 2004. [Google Scholar]
- Royall, R. Statistical Evidence: A likelihood Paradigm; Chapman & Hall: London, UK, 1997. [Google Scholar]
- Kass, R.E.; Raftery, A.E. Bayes Factors. J. Am. Stat. Assoc.
**1995**, 90, 773–795. [Google Scholar] [CrossRef] - Vieland, V.J.; Huang, Y.; Seok, S.-C.; Burian, J.; Catalyurek, U.; O’Connell, J.; Segre, A.; Valentine-Cooper, W. KELVIN: A software package for rigorous measurement of statistical evidence in human genetics. Hum. Hered.
**2011**, 72, 276–288. [Google Scholar] [CrossRef] [PubMed] - Kullback, S. Information Theory and Statistics; Dover: New York, NY, USA, 1997. [Google Scholar]
- Fermi, E. Thermodynamics; Dover Publications: New York, NY, USA, 1956. [Google Scholar]
- Good, I.J. What are degrees of freedom? Am. Stat.
**1973**, 27, 227–228. [Google Scholar] - Sober, E. Evidence and Evolution: The Logic Behind the Science; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
- Strug, L.J.; Hodge, S.E. An alternative foundation for the planning and evaluation of linkage analysis I. Decoupling “error probabilities” from “measures of evidence”. Hum. Hered.
**2006**, 61, 166–188. [Google Scholar] [PubMed] - Krantz, D.H.; Luce, R.D.; Suppes, P.; Tversky, A. Foundations of Measurement Volume 1 1971; Dover: Mineola, NY, USA, 2007. [Google Scholar]
- Baskurt, Z.; Evans, M. Hypothesis assessment and inequalities for Bayes factors and relative belief ratios. Bayesian Anal.
**2013**, 8, 569–590. [Google Scholar] [CrossRef] - Soofi, E.S. Principal Information Theoretic Approaches. J. Am. Stat. Assoc.
**2000**, 95, 1349–1353. [Google Scholar] [CrossRef] - Hacking, I. Logic of Statistical Inference; Cambridge University Press: London, UK, 1965. [Google Scholar]
- Edwards, A. Likelihood; Johns Hopkins University Press: Baltimore, MD, USA, 1992. [Google Scholar]

© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Vieland, V.J.; Seok, S.-C.
Statistical Evidence Measured on a Properly Calibrated Scale across Nested and Non-nested Hypothesis Comparisons. *Entropy* **2015**, *17*, 5333-5352.
https://doi.org/10.3390/e17085333

**AMA Style**

Vieland VJ, Seok S-C.
Statistical Evidence Measured on a Properly Calibrated Scale across Nested and Non-nested Hypothesis Comparisons. *Entropy*. 2015; 17(8):5333-5352.
https://doi.org/10.3390/e17085333

**Chicago/Turabian Style**

Vieland, Veronica J., and Sang-Cheol Seok.
2015. "Statistical Evidence Measured on a Properly Calibrated Scale across Nested and Non-nested Hypothesis Comparisons" *Entropy* 17, no. 8: 5333-5352.
https://doi.org/10.3390/e17085333