Statistical Evidence Measured on a Properly Calibrated Scale Across Nested and Non-nested Hypothesis Comparisons

Statistical modeling is often used to measure the strength of evidence for or against hypotheses on given data. We have previously proposed an information-dynamic framework in support of a properly calibrated measurement scale for statistical evidence, borrowing some mathematics from thermodynamics, and showing how an evidential analogue of the ideal gas equation of state could be used to measure evidence for a one-sided binomial hypothesis comparison (coin is fair versus coin is biased towards heads). Here we take three important steps forward in generalizing the framework beyond this simple example. We (1) extend the scope of application to other forms of hypothesis comparison in the binomial setting; (2) show that doing so requires only the original ideal gas equation plus one simple extension, which has the form of the Van der Waals equation; (3) begin to develop the principles required to resolve a key constant, which enables us to calibrate the measurement scale across applications, and which we find to be related to the familiar statistical concept of degrees of freedom. This paper thus moves our information-dynamic theory substantially closer to the goal of producing a practical, properly calibrated measure of statistical evidence for use in general applications.


Introduction
Statistical modeling is used for a variety of purposes throughout the biological and social sciences, including hypothesis testing and parameter estimation among other things. But there is also a distinct purpose to statistical inference, namely, measurement of the strength of evidence for or against hypotheses in view of data. This is arguably the predominant use of statistical modeling from the point of view of most practicing scientists, as manifested by their persistence in interpreting the p-value as if it were a measure of evidence despite multiple lines of argument against such a practice.
In previous work, we have argued that for any measure of evidence to be reliably used for scientific purposes, it must be properly calibrated, so that one "degree" on the measurement scale always refers to the same amount of underlying evidence, within and across applications [1][2][3]. Towards this end, we proposed adapting some of the mathematics of thermodynamics as the basis for an absolute (context-independent) measurement scale for evidence [4]. The result was a new theory of information-dynamics, in which different types of information are conserved and interconverted under principles that resemble the 1 st two laws of thermodynamics, with evidence emerging as a relationship among information types under certain kinds of transformations [5]. As we argued previously, this provides us both with a formal definition of statistical evidence and with an absolute scale for its measurement, much as thermodynamics itself did for Kelvin's temperature T. But unless this new theory can produce something useful, it is purely speculative and not really a theory at all in the scientific sense, so much as an overgrown analogy.
Until now, though, the theory has been too limited in scope to be of any practical use, for four reasons. (i) We have previously worked out a concrete application only for a simple coin-tossing model, and we speculated that extension to other statistical models (i.e., forms of the likelihood other than the binomial) might require derivation of a new underlying equation of state (EqS, that is, the formula for computing the evidence E; see below for details) for every distinct statistical arXiv June 2015 3 model. The principles for deriving these new equations remained, however, unclear. (The text will be simplified by the introduction of a number of abbreviations, of which "EqS" is the first. To assist the reader, abbreviations are summarized in Table 1.) (ii) The original EqS also contained two constants, which we speculated might relate to calibrating evidence measurement across different statistical models, but again, the principles under which the constants could be found were unknown, rendering the issue of calibration across applications moot. (iii) Furthermore, the theory appeared to work correctly only in application to a one-sided hypothesis comparison ("coin is fair" versus "coin is biased towards heads"), failing even for the seemingly simple extension to a two-sided comparison ("coin is fair" versus "coin is biased in either direction"). (iv) Because we depended heavily on the arithmetic of thermodynamics in justifying some components of the theory, it was unclear how to move to general applications without relying upon additional equations to be borrowed from physics and applied in an ad hoc manner to the statistical problem. While this issue was ameliorated by the introduction of wholly information-based versions of the 1 st and 2 nd laws of thermodynamics [5], it remained a concern, arXiv June 2015 4 particularly in view of our inability to extend the theory beyond the one-sided binomial application.
Thus we were faced with the question of whether the striking connection we had found between the mathematical description of the dynamics of ideal gases and the mathematics of our simple statistical system was really telling us something useful on the statistical side, or whether, by contrast, we had simply stumbled upon a kind of underlying one-sided binomial representation of the ideal gas model in physics -a model of use neither to physicists nor to statisticians. With the results presented in this paper, however, we take an important step forward in laying this concern to rest. We show below how the theory is readily generalized to support a wider range of statistical applications than had previously been considered, and we make strides in laying out the principles under which both the equations of state and the constants can be resolved. In the process, we continue to see connections to the equations of thermodynamics.
Specifically, we generalize the original theory to address the four limitations mentioned above, albeit still in the context of binomial models. We find that equations of state are governed by the different possible forms of hypothesis contrast (HC). We are then able to extend the original results to other HCs, including the two-sided HC that thwarted our earlier attempts at generalization, by introducing a simple extension of the original EqS. We also show a connection between one of the constants and something closely related to the familiar statistical concept of degrees of freedom.
The remainder of the paper is organized as follows. We first (1) briefly review the key methodological principles and results from earlier work, and we illustrate the problem that arises when we move from one-sided to two-sided hypothesis comparisons. In (2) we group binomial HCs into two major Classes, non-nested and nested, and we show that the HCs in Class I can be handled via the original EqS, while a simple modification of this EqS suffices to handle the Class II HCs. In (3) we consider resolution of a key constant across different hypothesis contrasts, and find that it is related to the statistical concept of degrees of freedom. In (4) we illustrate aspects of the behavior of the resulting evidence measure E within and across HCs.

Review of previous results and the problem with two-sided hypothesis comparisons
We begin with a high level definition of evidence as a relationship between data and hypotheses in the context of a statistical model. We then pose a measurement question: How do we ensure a arXiv June 2015 5 meaningfully calibrated mapping between (i) the object of measurement, i.e., the evidence or evidence strength, and (ii) the measurement value? Here the object of measurement cannot be directly observed, but must be inferred based on application of a law or principle that maps observable (computable) features of the data onto the evidence. This is known as a nomic measurement problem [6]. There are precedents for solutions to nomic measurement problems, particularly in physics; measurement of temperature is an example [6].
Our guiding methodological principle is that any measure of evidence must verifiably behave like the evidence, in situations in which such verification is possible. In order to establish basic behavior patterns (BBPs) expected of any evidence measure, we consider a simple model and a series of thought experiments, or appeals to intuition. This enables us to articulate basic operational characteristics of what we mean by "statistical evidence." We then check any proposed measure of evidence to be sure that it exhibits the correct BBPs. As the theory is developed, we are also able to observe new patterns of behavior. These are considered iteratively to assess their reasonableness.
These BBPs play a role here that is similar to the role played in some other treatments of evidence by axioms [7] or "conditions" [8]. However, in our methodology the BBPs themselves only support a measure of evidence e on an empirical, rather than absolute scale. Any proper empirical measure e must exhibit the BBPs. But as long as the only criterion is that e exhibits the BBPs, the units of e remain arbitrary and they are not necessarily comparable across applications.
Thus the BBPs constitute a set of necessary, but not sufficient, conditions for a proper evidence measure.
The primary set of thought experiments used to establish the current set of BBPs involves coin-tossing examples, for which our intuitions are clear and consensus is easy to achieve on key points. (Royall [9] also uses a simple binomial set-up as a canonical system for eliciting intuitions about evidence. However, his use of the binomial is quite different from ours. He appeals to intuition in order to calibrate strength of evidence across applications; we appeal to intuition to establish certain properties we expect evidence to exhibit. In our methodology, calibration is a separate process.) Consider a series of n independent coin tosses of which x land heads and n-x land tails. Let the probability that the coin lands heads be θ. And consider the two hypotheses H 1 : "coin is biased towards tails" (θ < ½), versus H 2 : "coin is fair" (θ = ½). We arXiv June 2015 6 articulate four BBPs up front. (We have described the thought experiments used to motivate the BBPs in detail elsewhere; see, e.g., [5]. Here we simply summarize the BBPs themselves.) (i) Change in evidence as a function of n for fixed x/n For any fixed value of x/n, the evidence increases as n increases. The evidence may favor H 1 or H 2 , depending on x/n, but in either case, it increases with increasing n. BBP(i) is illustrated in Figure 1(a).
(ii) Change in evidence as a function of x/n for fixed n For any fixed n, as x/n increases from 0 to ½ the evidence in favor of H 1 decreases up to some value of x/n, after which it increases in favor of H 2 . We refer to the value of x/n at which the evidence switches from favoring H 1 to favoring H 2 as the transition point (TrP). We also expect the value of x/n at which a TrP occurs to shift as a function of n, as increasingly smaller departures from x/n = 0.5 support H 1 over H 2 .
BBP(ii) is also illustrated in Figure 1(a).
(iii) Change in x/n and n for fixed evidence In order to maintain constant evidence, as x/n increases from 0 to the TrP, n increases; as x/n continues to increase from the TrP to ½, n decreases. These patterns follow from BBP(i) and BBP(ii). BBP(iii) is illustrated in Fig 1(b).
(iv) Rate of increase of evidence as a function of n for fixed x/n The same quantity of new data (n, x) has a smaller impact on the evidence the larger is the starting value of n, or equivalently, the stronger the evidence is before consideration of the new data. E.g., 5 tosses all of which land tails increase the evidence for H 1 by a greater amount if they are preceded by 2 tails in a row, compared to if they are preceded by 100 tails in a row. BBP(iv) is illustrated in Fig 1(c).
We can summarize by saying that the three quantities n, x, and evidence e, enter into an EqS, in which holding any one of the three constant while allowing a second to change necessitates a compensatory change in the third. Here e itself is simply defined as the third fundamental entity in the set. At this point no particular measurement scale is assigned to e, and therefore numerical values are not assigned to e and e-axes are not labeled in the figures. Figure 1 is intended to illustrate behavior patterns only, rather than specific numerical results. (In Figure 1 and subsequent Figures, n and x are treated as continuous rather than integer, in order to smooth the graphs particularly for small n.) Our overarching methodological principle is that any proposed measure of evidence must exhibit these basic patterns of behavior. While we are free to use any methods we like to discover or invent a statistical EqS, applying this principle to our simple set of BBPs severely We treat the likelihood ratio (LR) as fundamental. Originally [4] we considered only the special binomial form of LR, vs. simple HC with the simple hypothesis on the boundary of the parameter space). The EqS for this set-up was originally derived via the information-dynamic analogue of thermodynamic systems [4]. Here we focus only on the EqS itself and not its derivation. This EqS turned out to be a function of two aspects of the LR (and two constants; see below): (i) the logarithm of the maximum LR, which we treated as an entropy term (see Appendix 1) and denoted as S; and (ii) the area under the LR graph, denoted V, which is related to, though distinct from, the Bayes factor [10] and the Bayes ratio in statistical genetics [11]. Originally these equations were given In [4] we derived a simple EqS in the form for c 1, c 2 constants, where E represents evidence measured on an absolute, and not merely an empirical, scale [4]. Equation (1.3) is identical in form to the ideal gas EqS in physics, although we assign different (non-physical) meanings to each of the constituent terms.
Because the focus of this paper is on application of the theory, we do not address the meaning of E in any detail here. But in brief, E is defined as the proportionality between (i) the change in a certain form of information with the influx of new data, and (ii) the entropy, such that the degree of E retains constant meaning across the measurement scale and, given the correct EqS, across applications. See [5] for details.

From (1.3) we have a simple calculation formula for E as
It is readily verified that using (1.4) yields an evidence measure E that exhibits the BBPs described above; in fact, Figure 1 was drawn by applying this equation. In previous work we noted that the principles for determining the constants remained to be discovered, and we set the values somewhat arbitrarily to c 2 = 1 and c 1 = 1.5. We have found that c 2 = 1 is required to maintain the BBPs. We continue to use this value throughout the remainder of this paper, but we have retained c 2 in the equations as a reminder that it may become important in future extensions of the theory. We return to resolution of c 1 in §3 below.

Equations of state for non-nested and nested HCs
We continue to consider the binomial model with the single parameter θ, and pairs of hypotheses specifying various ranges for θ. We restrict attention to HCs in the form ! : consider only ranges θ 2 ∈ [θ 2l , θ 2r ] (where the subscript "l" stands for "left" and "r" for "right") that are symmetric around the value ½. We have speculated from the start that the unconstrained maximum entropy state of a statistical system, which in the binomial case occurs when θ = ½, plays a special role in this theory. Indeed, in order to maintain the BBPs, calculations have shown that binomial HCs that are not "focused" in some sense on θ = ½ will require further corrections to the underlying EqS. We had also speculated previously that HCs in the form "A vs.
not-A" play a special role. Here we extend the theory to include nested hypotheses.
A little thought will show that Class I(b), like the original Class I(a), should have 1 TrP; while Class II(b), like Class II(a), should have 2. The absence of the second TrP was the major reason for feeling that our original EqS did not cover the two-sided case Class II(a), and it appears that Class II(b) will present a similar challenge. Thus we expect both Class I HCs to exhibit the pattern illustrated in Figure 2(a); and both Class II HCs to exhibit the pattern illustrated in Figure   2(b).
Before proceeding we need to generalize our original notation to allow for the additional forms of HC. Let = / , the value of θ that maximizes the likelihood L(θ). Let ! = the value of ! (i = 1, 2) that maximizes L(θ) within the range imposed by H i . Waals equation [13].
It is readily verified that (2.3) returns the correct behavior, with two TrPs, as illustrated in since, as the width of the interval [θ 2r − θ 2l ] narrows to ½ ± , the two HCs become (approximately) the same, namely, θ ≠ ½ vs. θ = ½. Therefore for any given data (n, x), as ε shrinks to 0, they must yield the same value of E. This strongly suggests that a single EqS should govern both types of HC, as indeed turns out to be the case. Adherents of the likelihood principle, however, generally eschew any kind of d.f. "correction"

The constant c 1 and degrees of freedom
to LRs as indicators of evidence strength even in the context of composite hypotheses (see, e.g., [7]). But the premise that a given value of the maximum LR corresponds to the same amount of evidence regardless of the amount of maximization being done strains credulity. Among other problems, this begs the question of overfitting, in which a bigger maximum LR can almost always be achieved by maximizing over additional parameters (up to a model involving one independent parameter for each data point), even in circumstances in which the estimated model can be shown to be getting further from the true model as the maximum LR increases. (See, e.g., the discussion of model fitting versus predictive accuracy in [15]. See also [16], for a coherent pure-likelihoodist resolution of this problem, which avoids "corrections" to the LR for d.f., but which also precludes the possibility of meaningful comparisons of evidence strength across distinct HCs or distinct forms of the likelihood.) Prior to the new results in §2 above, we had been unable to derive the EqS for HCs other than the original one-sided binomial (with θ 2 on a boundary), and therefore the idea of adjusting the calculation of E to reflect differences in d.f. across HCs was moot. But the discovery that just e.g., between monatomic and diatomic gases, reflecting the fact that a fixed influx of heat will raise the temperatures of the two gas types by different amounts. Similarly, we can view c 1 as a factor that recognizes that different HCs will convert the same amount of new information (or data) into different changes in E. This viewpoint is consonant with our underlying informationdynamic theory [4,5], which treats transformations of LR graphs in terms of Q (a kind of evidential information influx) and W (information "wasted" during the transformation, in the sense that it does not get converted into a change in E); the sense in which E maintains constant meaning across applications relates specifically to aspects of these transformations (see [4,5] for details).
The only remaining task then is to find the correct values of c 1 for different HCs, as we describe in the following paragraph. Final validation of any specific numerical choices we make at this point regarding c 1 will require returning to the original information-dynamic formalism.
But we point out here that the choices we have made are far from ad hoc. The form of the EqS itself combined with constraints imposed by the BBPs place severe limitations on how values can be assigned to c 1 while maintaining reasonable behavior for E within and across HCs.
We have found that we must have c 1 > 0.5 in order to maintain BBP(iv). Thus we begin, somewhat arbitrarily but in order to start from an integer value, from a baseline c 1 = 1.0. In the case of nested hypotheses (Θ ! ⊂ Θ ! ), we add to this baseline value the sum [θ 1r -θ 1l ] + [θ 2r − θ 2l ] of the lengths of the intervals. Heuristically, we sum these lengths because it is possible for x/n to be in either or both intervals simultaneously; thus speaking very loosely, c 1 captures a kind of conjunction of the two intervals. In the case of non-nested hypotheses, for which x/n can be in Θ 1 or Θ 2 but not both, a disjunction of the intervals, we add to the baseline the difference [θ 1r -  Table 2 shows the assigned values in the context of the EqS for each HC.
Note that as c 1 increases, for given data, E decreases. Thus these values ensure some intuitively reasonable behavior in terms of the conventional role of d.f. adjustments. For instance, for given x/n, the two-sided Class II(a) HC will have lower E compared to the one-sided Class I(a) HC, which conforms to the frequentist pattern for one-sided versus two-sided comparisons.
We consider the behavior of E in greater detail within and across HCs in §4 below.

Behavior of E within and across HC classes
Our overarching goal here is to quantify statistical evidence on a common, underlying scale across all four HCs. As noted above, (1.4) and (2.3) ensure the BBPs in application to each HC considered on its own, provided that we set c 2 = 1 and c 1 as shown in Table 2. In this section we highlight important additional characteristics of E beyond the original BBPs. Some of these characteristics conform to intuitions we had formed in advance, but others constitute newly discovered properties of E -behaviors we did not anticipate, but which nevertheless seem to us to make sense once we observe them.
We begin with Class II(b) on its own, as a function of the size [θ 2r − θ 2l ] of the Θ ! interval. increases, again as seen in Figure 4(a). This reflects the fact that Θ ! ⊂ Θ ! , so that evidence to differentiate the two hypotheses is smaller the more they overlap. At the same time, within this interval we would expect x/n ≈ ½ to yield the strongest evidence; however, given the overlap between Θ ! and Θ ! , we would not necessarily expect the evidence at x/n ≈ ½ to be substantially larger than the evidence at x/n closer to the Θ ! boundary. Figure 4  We can also assess the reasonableness of E for Class II(b) in comparison with Class II(a). As obtain under Class II(a) (c 1 = 2.00), and for the moment we treat it as a graph of Class II(a). We noted above that that for x/n ≈ 0 or x/n ≈ 1, evidence decreases as [θ 2r − θ 2l ] increases. We can now see from Figure 4(a) that this also means that evidence is decreasing relative to what would be obtained under a Class II(a) HC. Since under Class II(a) the HC always involves a comparison against θ = ½, it is reasonable that larger (nested) [θ 2r − θ 2l ] would return smaller evidence at these x/n values relative to a comparison against the single value θ = ½. For x/n = ½, we might have guessed that E in favor of θ 2 should be also smaller for Class II(b) than for Class II(a), since the data are perfectly consistent with both θ 1 and θ 2 but Class II(a) has the more specific H 2 .
Turning to comparisons across all four HCs, Figure 5 illustrates some additional important behaviors. Across the board, for given x/n, E is higher for the Class I HCs than it is for the Class Figure 5 Comparative behavior E as a function of x/n (n = 50) across all four HCs. For purposes of illustration, 0.4 ≤ θ 2 ≤ 0.6 for Class II(b). TrPs are marked with circles (Class I(a), Class II(a)) or diamonds (Class I(b), Class II(b)).
II HCs. This is the result of our assignments for c 1 , as discussed above, and it makes sense that nested hypotheses would be harder to distinguish compared to non-nested hypotheses for given n.   Figure 6 reorganizes the representation shown in Figure 5 in terms of "iso-E" contours through the (n, x/n) space for the different HCs; that is, these graphs display the sets of (n, x/n) pairs corresponding to the same evidence E. For simplicity, the x-axis is restricted to x/n ≤ 0.5.
(Recall that all HCs considered here are either restricted to x/n ≤ 0.5 or symmetric around x/n = 0.5. Recall too that n and x are treated as continuous here.) For each E and each HC, the maximum value of the iso-E curve occurs at the TrP, with the segment to the left corresponding to evidence for θ 1 and the segment to the right corresponding to evidence for θ 2 .
One way to use these graphs is to find the sample size n corresponding to a particular value of E for given x/n. For instance, to obtain E = 2 in favor of θ 1 , we would need n = 1.5, 1.1, 3.0 and 3.6 heads in a row (x/n = 0), for Class I(a), Class I(b), Class II(a) and Class II(b), respectively.
Apparently E = 2 is quite easy to achieve, in the sense that relatively few tosses will yield E = 2 if they are all heads. By contrast, to get E = 4 one would need 7.0, 3.0, 15.2 and 20.5 tosses, all heads, for the four HCs respectively; while E = 8 (not shown in Figure) would require 21.8, 7.0, 67.3 and 106.6 heads, respectively. Another way to use the graphs is to see the "effect size" at which a given sample size n will return evidence E. As Figure 6 shows, whether the evidence favors θ 1 (left of TrP) or θ 2 (right of TrP), much larger samples are required to achieve a given E the closer x/n is to the TrP, or in other words, the less incompatible the data are with the nonfavored hypothesis. For instance, for Class II(a) and considering evidence for θ 1 , for n = 100, E = 4 for x/n ≈ 0.07; but for n = 300, that same E = 4 is achieved for x/n ≈ 0.25, a much smaller deviation from ½. To our knowledge, ours is the only framework that generates a rigorous mathematical definition of what it means for evidence to be constant across different sets of data and different forms of HC. Note too that E is on a proper ratio scale [4,17], so that 6(b) represents a doubling of the strength of evidence as shown in 6(a) (and E = 8 represents a doubling again relative to E = 4). This is a unique feature of E compared to all other proposed evidence measures of which we are aware. Figure 6 is a type of graph that can be meaningfully produced only once one has a properly calibrated measurement E in hand.

Discussion
With the results presented above, we have taken important steps forward towards generalizing To date we have focused on building this novel "plero"-dynamics (from the Greek word for information) methodology and understanding its relationship to thermo-dynamics. Because our motivation -proper measure-theoretic calibration of evidence -is distinct from the objectives of other schools of statistical thought, we have found it challenging to try to relate plerodynamics to components of standard mathematical statistical theory. But what emerges from the current results is a novel concept of evidence as a relationship between the maximum log LR and the area (or more generally, volume) under the LR, where the relationship is mediated by a quantity related to the Fisher information (for nested HCs, see Appendix 2), and also by something related to statistical degrees of freedom. This strongly suggests that we should be able to tie current results back to fundamental statistical theory. This will entail a detailed consideration of the concept of degrees of freedom, as it appears in plerodynamics, with its corresponding role in familiar statistical theory, and/or with its role in physical theory.
We had originally thought that every statistical model would require discovery of a separate EqS. But we now speculate that the EqS may depend only on the form of the HC, and not on the particular form of the likelihood. That is, our basic equations of state for the binomial model may extend to more complex models, based on general properties of likelihood ratios, at least under broad regularity conditions. Of course, so far the equations remain restricted to single-parameter models, and a somewhat restricted class of HCs (excluding "asymmetric" and non-partitioning HCs, as described above). We also have not considered extensions to continuous distributions.
However, we follow Baskurt and Evans [18] in considering all applications of statistical inference as fundamentally about discrete, rather than continuous, distributions. This also raises the possibility of another way of relating plerodynamics back to thermodynamics, since in this case plerodynamics in its most general form could perhaps be represented solely in terms of the Boltzmann distribution.
In this paper we have not focused on the "-dynamic" part of plerodynamics, but the underlying theory motivating the approach taken here is very closely aligned with the macroscopic description of thermodynamic systems in terms of conservation and interconversion of heat and work. As we have noted previously, there is, however, no direct mapping of the basic thermodynamic variables (volume, pressure, mechanical work, number of particles) onto corresponding statistical variables. For example, the number of observations n on the statistical side does not function in our EqS as the analogue of the number of particles in physics; rather, the number of observations n appears to be part of the description of the statistical system's information "energy," rather than its size. (See [5] for discussion of this issue.) Therefore, but perhaps quite counter-intuitively, we do not expect to see a simple alignment of plerodynamics with statistical mechanical (microscopic) descriptions of physical systems, even in the event that we are able ultimately to consolidate the theory under the umbrella family of Boltzmann distributions.
It remains an open question how deep the connection between plero-and thermo-dynamics really runs. Our discovery here that a simple revision to the ideal gas equation of state solves one of the difficulties we have faced until now -our inability to generalize from a one-side to a twosided hypothesis contrast -goes some distance towards vindicating our original co-opting of that particular equation of state for statistical purposes. The further discovery that the revised equation is identical in form to the Van der Waals equation surely takes us some distance further.

Appendix 1: Maximum log LR plays the role of entropy, not evidence
We consider here the idea of the maximum log LR as an entropy term. We continue to restrict attention to the binomial likelihood in θ under the HCs considered in the main text. We note first that the maximum log LR is equivalent to a particular form of KL divergence (KLD), where If we now evaluate the KLD at ! = = ! ! and ! = ! , as in the main text, we have what we call the observed KL divergence ("observed" because the expectation is taken with respect to a probability distribution based on the data), which is equal to the log of the maximum likelihood ratio (MLR): Note that Kullback [12] and others [19] treat Kullback-Leibler divergence as a key quantity in an entropy-based inferential framework (see also [5]), while the MLR or its logarithm is sometimes interpreted as the statistical evidence for against some simple alternative value [9,20]. In our framework, the log MLR functions as the entropy term S (2.1).
There are several reasons why the MLR (or its logarithm) cannot be an evidence measure.
First, as noted in the main text, the MLR violates important BBPs. In particular, when more maximization is done in the numerator than in the denominator, MLR ≥ 1, and it cannot indicate evidence in favor of ! or accumulate increasing evidence in favor of ! as a function of increasing n, which violates elements of BBP(i)-(iii). Additionally, for fixed x/n, the MLR itself increases exponentially in n, while the log MLR increases linearly in n, both of which violate BBP(iv). (Indeed, the simple vs. simple LR itself, which is sometimes used to define the evidence [9,20,21], violates BBP(iv).) Yet there is clearly a reason why the MLR seems to function as a reasonably good proxy for an evidence measure under many circumstances.
We interpret the log MLR as the difference in information provided by the data for vs. ! .
As a general rule, as information (in an informal sense) goes up, so too does evidence. But information and evidence also must be distinguished, in the sense that increasing the amount of information might reduce the evidence for bias, if the more we toss the coin the closer moves towards ½. Apparently evidence requires us to take account of information in the sense of KLD OBS , or equivalently, in the form of the MLR, but not only information in this sense.

Appendix 2: Calculation of b
Here we describe the rationale for setting the constant b as it appears in the main text. The BBPs impose severe constraints on the set of available solutions for b, and it appears that there is little leeway in choosing a functional form for b that allows us to express E for both Class II(a) and Class II(b) through a single EqS while maintaining the BBPs.
By experimentation (informed trial and error), we arrived at the following definition, which incorporates two rate constants: r 1 , which controls the curvature of b over Θ 2 ; and r 2 , which controls the baseline value of b at the boundaries of this region, that is, at the points θ 2l , θ 2r . Let the value of b at these points be b(θ 2l ) = b(θ 2r ). We note up front that for given n, the minimum where j = (l, r) and g is the linear function connecting the points b(θ 2l ) and 0 (on the left) or b(θ 2r ) and 1 (on the right).
We set r 1 = 2 -[θ 2r − θ 2l ], so that the curvature of b depends on the width of the Θ 2 interval.
We found that we needed to constrain r 2 such that ½ ≤ ( r 1 -(½)r 2 ) ≤ ¾. Thus we used r 2 = 2r 1 -½(2 + [θ 2r − θ 2l ]). Figure 7 shows b and V for various Θ 2 for n = 50. Acknowledgments This work was supported by a grant from the W.M. Keck Foundation. We thank Bill Stewart and Susan E. Hodge for helpful discussion, and we are indebted to Dr. Hodge for her careful reading of earlier drafts of this manuscript and for her many helpful comments, which led to substantial improvements in the paper.