The Measurement of Statistical Evidence as the Basis for Statistical Reasoning

There are various approaches to the problem of how one is supposed to conduct a statistical analysis. Different analyses can lead to contradictory conclusions in some problems so this is not a satisfactory state of affairs. It seems that all approaches make reference to the evidence in the data concerning questions of interest as a justification for the methodology employed. It is fair to say, however, that none of the most commonly used methodologies is absolutely explicit about how statistical evidence is to be characterized and measured. We will discuss the general problem of statistical reasoning and the development of a theory for this that is based on being precise about statistical evidence. This will be shown to lead to the resolution of a number of problems.


Introduction
There are a variety of approaches to conducting statistical analyses and most seem well-motivated from an intuitive point-of-view. If, in any particular context, these all led to approximately the same results, then this wouldn't be a problem. Unfortunately, it is possible that mutually contradictory results can be produced and the Jeffreys-Lindley paradox is a well-known example of this phenomenon. Given that these situations can arise, and with very simple problems, it can lead one to wonder if there are any true foundations for the subject. With the increasing importance of, what we will call here, statistical reasoning in almost all areas of science, this ambiguity can be seen as an important problem to resolve.
It is interesting to note that virtually all approaches make reference to the "evidence in the data" or even to the concept of "statistical evidence" itself. For example, p-values are considered as measuring the evidence against a hypothesis although serious doubts have been raised about their suitability in this regard. Similarly, likelihood ratios are considered as measuring the evidence in the data supporting the truth of one value versus supporting the truth of another and the Bayes factor is considered as a measure of statistical evidence. There are also treatments that recognize the concept of statistical evidence as a central aspect of statistical reasoning as with Birnbaum (1964), Shafer (1976), Royall (1997), Thompson (2007), Aitkin (2010), Morey, Romeijn, and Rouder (2016) and Vieland and Seok (2016). There is also a long tradition of the consideration of evidence in the philosophy of science, sometimes called confirmation theory, with Salmon (1973) being a very accessible summary and Achinstein (2001) being a more recent treatment.
While much of this literature has many relevant things to say about statistical evidence, it is fair to say that there is no treatment that gives an unambiguous definition of what statistical evidence is together with developing a satisfactory theory of statistical reasoning based on this. Evans (2015) contains a summary of a number of publications with co-authors that have attempted to deal with this problem. This paper motivates and describes that theory and provides a discussion of its advantages as well as its limitations.
As far as it benefits go, the theory provides a resolution of a number of problems and this is illustrated via specific examples such as the Jeffrey-Lindley paradox. It is also noted that the theory can be seen to contribute to a resolution of a well-known divide in the statistical community between frequentism and Bayesianism. In essence frequentism is concerned with design: what are the consequences of the possible data sets that could be obtained and how do we control for these effects? By contrast Bayesianism is concerned with inference: given the observed data, what are the answers to specific questions posed? So frequentism arises when the merits of a study are being assessed and this surely has an impact on the acceptability of the inferences produced via a Bayesian analysis.
While there are achievements of the approach taken here to statistical reasoning, there are also limitations. Simply put, it is not the case that the theory produces answers that are acceptable in all contexts where one might deem statistical reasoning as appropriate. But such situations typically fall outside the boundaries of what we might consider as acceptable statistical problems and so require various compromises. For example, suppose one is required to estimate (µ, σ 2 ) based on the model {N (µ, σ 2 ) : µ ∈ R 1 , σ 2 > 0} and a sample x of size n = 1 is provided. The estimate produced here is (x, 0) and one could argue that this is absurd unless it is known categorically that there is no variability in performances. Clearly, however, this can't be considered as a context where the success of a theory is to be assessed. In such a situation some kind of compromise is required, e.g., basing the estimate of σ 2 on a prori beliefs about this quantity. Such compromises can be seen as violating some very basic principle of sound statistical reasoning, e.g., falsifiability of the model and prior here. This points to the need to be careful when identifying what the core problems are to which the theory is to be applied. Otherwise, the possibility of obtaining an acceptable theory seems limited and we would be left with the impression that the subject lacks a foundation. This isn't an argument against compromises, but only that these are made when necessary and that such compromises be clearly identified. In some ways this is similar in spirit to the distinction made by Mosteller and Tukey (1977) between exploratory and confirmatory analyses. The concern here is definitely with the confirmatory aspect and moreover situations where design plays a role through ensuring data is collected properly and an adequate amount is collected according to various criteria subsequently discussed. Even further restrictions are undoubtedly necessary to identify core statistical problems and some discussion on this will be provided. It is worth noting here too the papers Brown et al. (2002) and Ripamonti et al. (2017) that discuss appropriate confidence and testing inferences for some binomial contexts. So it is surely not correct to suggest that such problems are well-settled even in the simplest of problems. When compromises are required it seems natural to try and stay as close to a "gold standard" as possible, so it is necessary to identify what that is.

The Purpose of a Theory of Statistical Reasoning
To start it is necessary to ask: what is the purpose of statistics as a subject or what is its role in science? This is identified here as providing a theory that gives answers to two questions based on observed data x and some object of interest Ψ.
E: What is a suitable value ψ(x) for Ψ together with an assessment of the accuracy of this estimate?
H: Is there evidence that a specified value ψ 0 for Ψ is true or false and how strong is this evidence?
The requirement is placed on such a theory that the answers to E and H be based on a clear characterization of the statistical evidence in the data relevant to Ψ. If the data x was enough to answer these questions unambiguously that would undoubtedly be best but it seems need more is needed. There are two aspects of this that need discussion, referenced hereafter as the Ingredients and Statistical Inference. These are to be taken as playing the same roles in a statistical argument as the Premises and Rules of Inference play in a logical argument or proof. For a logical argument may be valid, as whenever the rules of inference, such as modus ponens, have been applied correctly to the premises to derive a conclusion, but it is only a sound argument when it is valid and the premises are consistent and true. An important aspect of a logical argument, see Kneale and Kneale (1962), is that these two aspects of a logical argument are not confounded. So the rules of inference are not changed based upon the premises and further that, in applying the rules of inference, one is not concerned with the truth of the premises. The validity of the argument and the truth of the premises are separate issues. A similar position is adopted here.

The Ingredients
There are various criteria that the ingredients necessary for a theory of statistical reasoning should satisfy. A partial list is as follows.
I 1 The minimum number of ingredients are required for a characterization of statistical evidence so that E and H can be answered.
The chosen ingredients are such that the bias in these choices can be assessed and controlled. I 3 . Each element specified needs to be falsifiable via the data. . . .
It seems clear that in the majority of statistical problems the ingredients are chosen and are not specified by the application. As such the ingredients are subjective and that would seem to violate a fundamental aspect of science, namely, the goal of objectivity. The appropriateness of this is recognized here but rather as an ideal that is not strictly achievable but approached by dealing with the ingredients appropriately. I 2 suggests that it is possible to choose the ingredients in such a way that they bias the answers to E and H. This is indeed the case, but as long as the bias can be assessed and controlled, this need not be of great concern. I 3 is yet another way of dealing with the inherent subjectivity in a statistical analysis. For indeed it is reasonable to argue that the chosen ingredients are never correct. In essence the role of the ingredients is as devices that allow for a characterization of evidence so that the rules of inference can be applied to answer E and H. Their "correctness" is irrelevant unless the choices made are contradicted via the objective, when collected correctly, data x. The gold standard requires falsifiability of all ingredients to help place statistical reasoning firmly within the realm of science. Compromises to this need to be explicitly noted and only used when necessary.
So for a theory of statistical reasoning we need to state what the ingredients are, how to check for their reasonableness and specify the rules of statistical inference. The following ingredients, beyond the data x, are necessary for the developments here.

Ingredients
Model:{f θ : θ ∈ Θ} a collection of conditional probability distributions for x ∈ X given θ such that the object of interest ψ = Ψ(θ) is specified by the true distribution that gave rise to x.
Prior: π a probability distribution on Θ.
Delta: δ the difference that matters so that dist(ψ 1 , ψ 2 ) ≤ δ, for some distance measure dist, means that ψ 1 and ψ 2 are for practical purposes indistinguishable.
So the model and prior specify a joint probability distribution for ω = (θ, x) ∼ π(θ)f θ (x). As such all uncertainties are handled within the context of probability which is interpreted here as measuring belief, no matter how the probabilities are assigned.
The role of δ will be subsequently discussed but it raises an interesting and relevant point concerning the role of infinities in statistical modelling. The position taken here is that whenever infinities appear their role is as approximations as expressed via limits and they do not represent reality. For example, data arises via a measurement process and as such all data is measured to a finite accuracy. So data is discrete and moreover sample spaces are bounded as measurements cannot be arbitrarily large/small positively or negatively. So whenever continuous probability distributions are used these are considered as approximations to essentially finite distributions. There are a variety of opinions about the use of infinity among physicists, mathematicians, statisticians, etc., but little is lost by taking this view of things here and there is a substantial gain for the theory through the avoidance of anomalies.
The question now is how to choose the model and the prior? Unfortunately, we are somewhat silent about general principles for choosing a model, but when it comes to the prior for us this is by an elicitation algorithm which explains why the prior in question has been chosen. An inability to come up with a suitable elicitation suggests a lack of sufficient understanding of the scientific problem or an inappropriate choice of model where the real world meaning of θ is somewhat unclear. The existence of a suitable elicitation algorithm could be viewed as a necessity to place a context within the gold standard but, given the way models are currently chosen, we do not adopt that position. Still whatever approach is taken to choosing π, it is subject to I 2 and I 3 . As will be seen, the implementation of I 2 requires the characterization of statistical evidence and so discussion of this is delayed until Section 6 and I 3 is addressed next.

Checking the Ingredients
If the observed data x is surprising (in the "tails" of) for each distribution in {f θ : θ ∈ Θ}, then this suggests a problem with the model, and otherwise the model is at least acceptable. There are a number of approaches available for checking the model and this isn't discussed further here although the model is undoubtedly the most important ingredient chosen.
To be a bit more formal note that, when T is a minimal sufficient statistic for the model, then the joint factors as where f (· | T (x)) is a probability distribution, independent of θ, available for model checking, m T is the prior (predictive) distribution of T, available for checking the prior, while π(· | T (x)) is the posterior of θ and provides probabilities for the inference step. Note that Box (1980) proposed using the prior (predictive) distribution of x for jointly checking the model and prior but, to ascertain where a problem exists when it does, it is more appropriate to split this into checking the model first and then, if the model passes its checks, check the prior.
A prior fails when the true value lies in its tails. Evans and Moshonov (2006) proposed using the tail probability for this purpose. This was generalized to include conditioning on ancillary statistics, to remove variation irrelevant to checking the prior, and also to further factorizations of m T (T (x)) that allow for checking of individual components of the prior, to try and isolate a problem when one exists. In Evans and Jang (2011a) ) as the amount of data increases and so the tail probability is a consistent check on the prior. Further refinements can be proposed to deal with invariance and Nott et al. (2018) generalizes (1) to provide a fully invariant check that connects nicely with the measure of evidence used for inference. It is shown in Al Labadi and Evans (2017) that, when prior-data conflict exists, then inferences can be very sensitive to perturbations of the prior. What does one do when a prior fails? Evans and Jang (2011b) provides an approach to defining what is meant by weakly informative priors based on an idea in Gelman et al. (2008). In essence a definition is provided for what it means for one prior to be weakly informative with respect to another and further quantifies this in terms of fewer prior-data conflicts expected. This does not involve the observed data. One can conceive of a base elicited prior and a hierarchy of successively more weakly informative priors. So a prior for which conflict is detected, can be replaced by one more weakly informative and one can progress up the hierarchy until conflict is avoided. The point here is that final prior is not strictly speaking dependent on the data.
There are many approaches to prior-data conflict discussed in the literature and this seems to be, like model checking, the kind of problem for which there isn't a definitive methodology. The significance of the work cited, however, is that it presents a package for dealing with the reasonableness of the prior as part of a theory of statistical reasoning.

The Rules of Statistical Inference
There are three rules of inference that are used to determine inferences. These are stated for a probability model (Ω, F , P ). Suppose interest is in whether or not the event A ∈ F is true after observing C ∈ F and it is supposed that both P (A) > 0 and P (C) > 0 with the null case handled via taking appropriate limits. R 1 . Principle of conditional probability: beliefs about A, as expressed initially by P (A), are replaced by P (A | C).
R 2 . Principle of evidence: the observation of C is evidence in favor of A when P (A | C) > P (A), is evidence against A when P (A | C) < P (A) and is evidence neither for nor against A when P (A | C) = P (A). R 3 . The evidence is measured quantitatively by the relative belief ratio While R 1 doesn't seem controversial, its strict implementation in Bayesian contexts demands proper priors and priors that do not depend on the data. R 2 also seems quite natural and, as will be seen, really represents the central core of our approach to statistical reasoning. R 3 is perhaps not as obvious, but it is clear that RB(A | C) > 1(<, =) indicates that evidence in favor of (against, neither) A has been obtained. In fact the relative belief ratio only plays a role when it is necessary to order alternatives.
There are other valid measures of evidence in the sense that they have a cut-off that determines evidence in favor or against, as specified by R 2 , and can be used to order alternatives, see the discussion in Evans (2015). The relative belief ratio, however, has some advantages, such as invariance under reparameterizations when continuous models are being used and see Section 7. As will be seen, any increasing function of RB leads to identical inferences. Other possible valid measures include the difference P (A | C) − P (A), with cutoff 0, or the Bayes factor with cut-off 1. As indicated, the Bayes factor can be defined in terms of the relative belief ratio but not conversely. Since RB(A | C) > 1 iff RB(A c | C) < 1, the Bayes factor isn't really a comparison of the evidence for A with the evidence for its negation. Furthermore, in the continuous case, when RB and BF are defined as limits as below, they are equal. So RB is preferred to BF. Now consider the Bayesian context. When interest is in ψ = Ψ(θ), then generally the relative belief ratio at a value ψ equals nicely → {ψ} as ǫ → 0 and the equality on the right holds whenever the prior density π Ψ of Ψ is positive and continuous at ψ. This definition is motivated by the treatment of contexts where infinities arise.

Problem E
Suppose that the range of possible values for ψ = Ψ(θ) is also denoted by Ψ. Then R 3 determines the relative belief estimate as as this maximizes the evidence in favor and note sup ψ∈Ψ RB Ψ (ψ | x) ≥ 1 always and generally the inequality is strict. To measure the accuracy of ψ(x) there are a number of possibilities but the plausible region the set of ψ values where there is evidence in favor of the value being true, is surely central to this. If the "size", such as volume or prior content, of P l Ψ (x) is small and its posterior content Π Ψ (P l Ψ (x) | x) is large, then this suggests an accurate estimate has been obtained.
There are several notable aspects of this. First, in the continuous case, the methodology is invariant. So if instead interest is in λ = Λ(Ψ(θ)), where Λ is 1-1 and smooth, then λ(x) = Λ(ψ(x)) and P l Λ (x) = Λ(P l Ψ (x)). Second, while ψ(x) can also be thought of as the MLE from an integrated likelihood, that approach does not lead to P l Ψ (x) because likelihoods do not define evidence for specific values. Third, and perhaps most significant, the set P l Ψ (x) is completely independent of how evidence is measured quantitatively. In other words if any valid measure of evidence is used, then the same set P l Ψ (x) is obtained. This has the happy consequence that, however we choose to estimate Ψ via an estimator that respects the principle of evidence, then effectively the same quantification of error is obtained. This points to the possibility of using some kind of smoothing operation on ψ(x) to produce values that lie in P l Ψ (x) when this is considered necessary. It is also possible to use, as part of measuring the accuracy of ψ(x), a γ-relative belief credible region for Ψ, namely, contains values for which there is evidence against and so would not accurately represent the evidence in the data.
The following example illustrates that probabilities do not measure evidence, although this is a common misconception.

Example 1. Prosecutor's Fallacy
Assume a uniform probability distribution on a population of size N of which some member has committed a crime. DNA evidence has been left at the crime scene and suppose this trait is shared by m ≪ N of the population. A prosecutor is criticized because they conclude that, because the trait is rare and a particular member possesses the trait, then they are guilty. In fact they misinterpret P ("has trait" | "guilty") = 1 as the probability of guilt which is P ("guilty" | "has trait") = 1/m which is small if m is large. But this probability does not reflect the evidence of guilt. For, if you have the trait, then clearly this is evidence in favor of guilt. Note that RB("guilty" | "has trait")) = P ("guilty" | "has trait") P ("guilty") RB("not guilty" | "has trait")) = P ("not guilty" | "has trait") P ("not guilty") and P l("has trait") = {"guilty"} with posterior content 1/m. So there is evidence of guilt but it is weak whenever m is large. Note too that the MAP estimate is "not guilty" . So the prosecutor is correct that there is evidence of guilt and because, in general, RB(A | C) = RB(C | A) this conclusion would have been supported if they had used P ("has trait" | "guilty") correctly as part of a relative belief ratio but of course the strength measurement would have been incorrect. A general weakness of MAP/HPD inferences is their lack if invariance under reparameterizations. Example 1 shows, however, that these inferences are generally inappropriate because they do not express evidence properly. Example 1 also demonstrates a distinction between decisions and inferences. Clearly when m is large there should not be a conviction on the basis of weak evidence. But suppose that "guilty" corresponds to being a carrier of a highly infectious deadly disease and "has trait" corresponds to some positive (but not definitive) test for this, then the same numbers should undoubtedly lead to a quarantine. In essence a theory of statistical reasoning should tell us what the evidence says and decisions are made, partly on this basis, but employing many other criteria as well.

Problem H
It is immediate that RB Ψ (ψ 0 | x) is the evidence concerning H 0 : Ψ(θ) = ψ 0 . The evidential ordering implies that the smaller RB Ψ (ψ 0 | x) is than 1, the stronger the evidence is against H 0 and the bigger it is than 1, the stronger the evidence is in favor H 0 . But how is one to measure this strength? In Baskurt and Evans (2013) it is proposed to measure the strength of the evidence via which is the posterior probability that the true value of ψ has evidence no greater than that obtained for the hypothesized value ψ 0 . When RB Ψ (ψ 0 | x) < 1 and (2) is small, then there is strong evidence against H 0 since there is a large posterior probability that the true value of ψ has a larger relative belief ratio. Similarly, if RB Ψ (ψ 0 | x) > 1 and (2) is large, then there is strong evidence that the true value of ψ is given by ψ 0 since there is a large posterior probability that the true value is in {ψ : RB Ψ (ψ | x) ≤ RB Ψ (ψ 0 | x)} and ψ 0 maximizes the evidence in this set. The strength measurement here results from comparing the evidence for ψ 0 with the evidence for each of the other possible ψ values. This is appropriate provided the values are all referencing similar objects. Note that, using Markov's inequality, and so a very small value of RB Ψ (ψ 0 | x) is immediately strong evidence against H 0 . Also, when there is evidence in favor, then ψ 0 ∈ P l Ψ (x) and so the size and posterior content of this set also provides an indication of the strength of the evidence.
When H 0 is false, then RB Ψ (ψ 0 | x) converges to 0 as does (2). When H 0 is true, then RB Ψ (ψ 0 | x) converges to its largest possible value (greater than 1 and often ∞) and, in the discrete case (2) converges to 1. In the continuous case, however, when H 0 is true, then (2) typically converges to a U (0, 1) random variable. This is easily resolved by requiring that a deviation δ > 0 be specified such that if dist(ψ 1 , ψ 2 ) < δ, where dist is some measure of distance determined by the application, then this difference is to be regarded as immaterial. This leads to redefining H 0 as H 0 = {ψ : dist(ψ, ψ 0 ) < δ} and typically a natural discretization of Ψ exists with H 0 as one of its elements. With this modification (2) converges to 1 as the amount of data increases when H 0 is true. Some discussion on determining a relevant δ can be found in Al-Labadi, Baskurt and Evans (2017) and Evans, Guttman and Li (2017). Typically the incorporation of such a δ makes computations easier as then there is no need to estimate densities when these are not available in closed form.
The following simple example demonstrates the need to calibrate the measure of evidence.

Example 2. Location normal.
Suppose x = (x 1 , . . . , x n ) is i.i.d. N (µ, σ 2 0 ) with σ 2 0 known and π is a N (µ * , τ 2 * ) prior and the hypothesis is which, in this case is the same as the Bayes factor for µ 0 obtained via Jeffreys' mixture approach. From this it is easy to see that RB(µ 0 | x) → ∞ as τ 2 * → ∞ or, when √ n|x − µ 0 |/σ 0 is fixed, as n → ∞. If this is calibrated using the strength then, under these circumstances the classical p-value for assessing H 0 . The Jeffreys-Lindley paradox is the apparent divergence between RB(µ 0 | x) and the p-value as measures of evidence concerning H 0 . For example, see Robert (2014), Villa and Walker (2017) for some recent discussion of the paradox. Using the relative belief framework, however, it is clear that while RB(µ 0 | x) may be a large value, and so apparently strong evidence in favor, (3) suggests it is weak evidence in favor. This doesn't fully explain why this arises, however, as clearly when √ n|x−µ 0 |/σ 0 is large, then we would expect there to be evidence against. To understand why this discrepancy arises, it is necessary to consider bias as discussed in the next section.
Another aspect of (3) is of interest because it raises the obvious question: is the classical p-value a valid measure of evidence? The answer is no according to our definition, but the difference of two tail probabilities with cut-off 0, is a valid measure of evidence. Note that the right-most tail probability converges almost surely to 0 as n → ∞ or τ 2 0 → ∞ and this implies that, if the classical p-value is to be used to measure evidence, the significance level has to go to 0 as n increases. Note that if the N (µ * , τ 2 * ) prior is replaced by a Uniform(−m, m) prior, then the same conclusion is reached as n → ∞ or as m → ∞. These conclusions are similar to those found in Berger and Selke (1987) and Berger and Delampady (1987).

Measuring Bias
One of the more serious concerns with Bayesian methodology is with the issue of bias. Can an analyst choose the ingredients in such a way that the results are a foregone conclusion with high prior probability? The answer is yes and the situation described in Example 1 is a good example of this. Still it is necessary to be precise about the meaning of bias and the principle of evidence provides this. Even though we will use the relative belief ratio in the discussion here, it is important to note that the bias measures discussed are the same no matter what valid measure of evidence is used. The approach to measuring bias described here for problem H appeared in Baskurt and Evans (2013) and this has since been extended to problem E in Evans and Guo (2019). Bias should be measured a priori as it can be controlled by design although a post hoc measurement is also possible. In essence the bias numbers give a measure of the merit of a statistical study. Studies with high bias cannot be considered as being reliable irrespective of the correctness of the ingredients. The measurement of bias also establishes important links with frequentism.

Bias for Problem H
Consider the hypothesis H 0 : Ψ(θ) = ψ 0 and let M (· | ψ) denote the prior predictive distribution of the data given that Ψ(θ) = ψ. In general it is possible for there to be a high prior probability that evidence against H 0 will be obtained even when it is true. In such a circumstance, if evidence against H 0 is obtained, then the result seems highly suspect. Similarly, if there is a high prior probability of obtaining evidence in favor of H 0 even when it meaningfully false, then actually obtaining such evidence based on observed data is hardly compelling. So measuring bias against and bias in favor of H 0 are important aspects of this problem.
Bias against H 0 means that the ingredients and the data x, of a prescribed size, are such that with high prior probability evidence will not be obtained in favor of H 0 even when it is true. Bias against H 0 is thus measured by If (4) is large, then obtaining evidence against H 0 seems like a foregone conclusion. For bias in favor, consider a distance measure dist : and note the necessity of including δ so that the measure is based only on those values of ψ that are meaningfully different from ψ 0 . Typically M (RB Ψ (ψ 0 | D) ≥ 1 | ψ * ) increases as dist(ψ * , ψ 0 ) increases so the supremum can then be taken instead over {ψ : dist(ψ, ψ 0 ) = δ}. When (5) is large, then it suggests that finding evidence in favor of H 0 is meaningless. The choice of the prior can be used somewhat to control bias but typically a prior that makes one bias lower just results in making the other bias higher. It is established in Evans (2015) that, under quite general circumstances, both biases converge to 0 as the amount of data increases. So bias can be controlled by design a priori.
The following example illustrates some general characteristics of measuring and controlling biases. Example 3. Location normal (continued).
The bias against and bias in favor can be computed in closed form in this case, see Evans and Guo (2019). Table 1 gives some values for the bias against when testing the hypothesis H 0 : µ = 0 when σ 2 0 = 1 for two priors. The results here illustrate something that holds generally in this case. Provided the prior variance τ 2 * > σ 2 0 /n, the maximum bias against is achieved when the prior mean µ * equals the hypothesized mean µ 0 . This turns out to be very convenient as controlling the bias against for this case controls the bias against everywhere. This seems paradoxical, as the maximum amount of belief is being placed at µ 0 when µ * = µ 0 . It is clear, however, that the more prior probability that is assigned to a value the harder it is for the probability to increase. This is another example of a situation where evidence works somewhat contrary to our intuition based on how belief works. Another conclusion from Table 2 is that bias against is not a problem. Provided the prior isn't chosen too concentrated, this is generally the case, at least in our experience. Figure 1 is a plot of M (RB(0 | x) ≥ 1 | µ) as a function of µ for this problem when n = 20, µ * = 1, τ 2 * = 1. This demonstrates how the the bias in favor drops n (µ * , τ 2 * ) = (1, 1) (µ * , τ 2 * ) = (0,  off as the true value moves away from µ 0 . Table 2 contains values of the bias in favor for several priors with δ = 0.5 and dist taken to be Euclidean distance. It is clear that bias in favor is a much more serious issue. To a certain extent this can be reduced by choosing δ larger, but δ is determined by the application and really only sample size is available to control bias. As discussed in Evans and Guo (2019), when τ 2 * → ∞ the bias against goes to 0 and the bias in favor goes to 1 in this problem. This explains the phenomenon associated with the Jeffreys-Lindley paradox, as taking a very diffuse prior induces extreme bias in favor. So this argues against using arbitrarily diffuse priors. Rather one should elicit the value for (µ * , τ 2 * ), prescribe the value of δ and choose n to make the biases acceptable.  Table 2: Bias in favor of the hypothesis H 0 = {0} with a N (µ * , τ 2 * ) prior for different sample sizes n with σ 0 = 1 and δ = 0.5.

Bias for Problem E
The accuracy of a valid estimate of ψ is measured by the size of the plausible region P l Ψ (x) = {ψ : RB Ψ (ψ | x) > 1}. As such, if the plausible region is reported as containing the true value and it does not, then the evidence is misleading. It is argued therefore, that biases in estimation problems be measured by prior coverage probabilities associated with P l Ψ (x) and the implausible region Im Ψ (x) = {ψ : RB Ψ (ψ | x) < 1}, the set of values for which there is evidence against. As will be shown, there is a close connection between measuring bias for problems H and E.
The prior probability that the plausible region does not cover the true value measures bias against when estimating ψ. This equals which is also the average bias against over all hypothesis testing problems H 0 : Ψ(θ) = ψ. Note that 1−E ΠΨ (M (ψ / ∈ P l Ψ (X) | ψ)) = E ΠΨ (M (ψ ∈ P l Ψ (X) | ψ)) = E M (Π Ψ (P l Ψ (X) | X)) which is the prior coverage probability of P l Ψ . So, if the bias against for estimation is small, the coverage probability is high. Note that sup is an upper bound on (6). Therefore, controlling (7) controls the bias against in estimation and all hypothesis assessment problems involving ψ. Also 1 − ) so using (7) implies lower bounds for the coverage probability of the plausible region and for the expected posterior content of the plausible region. In general, both (6) and (7) converge to 0 with increasing amounts of data. So it is possible to control for bias against in estimation problems by design.
The prior coverage probability that follows from (6) can be considered as a Bayesian confidence as it is dependent on the full prior. The lower bound on the coverage probability that arises from (7) only depends on the conditional priors on the nuisance parameters and such a situation could be considered as somewhat like the use of priors in random effects models. When Ψ(θ) = ψ then the coverage probability is a "pure frequentist" coverage probability.  Table 3: Average bias against H 0 = 0 when using a N (0, τ 2 * ) prior for different sample sizes n.
Example 4. Location normal (continued). Table 3 gives some values of the bias against measure for estimation with different priors and sample sizes. Subtracting the probabilities in Table 3 from 1 gives the prior probability that the plausible region covers the true value and the expected posterior content of the plausible region. So when n = 20, τ * = 1, the prior probability of the plausible region containing the true value is 1 − 0.051 = 0.949 so P l(x) is a 0.949 Bayesian confidence interval for µ.
To use (7) it is necessary to maximize M (RB(µ | X) ≤ 1 | µ) as a function of µ and it is seen that, at least when the prior is not overly concentrated, that this maximum occurs at µ = µ * . When using the N (0, 1) prior the maximum occurs at µ = 0 when n = 5 and from the second column of Table 1, the maximum equals 0.143. The average bias against is 0.107, as recorded in Table 3.
Bias in favor occurs when the prior probability that Im Ψ does not cover a false value is large, as this would seem to imply, ignoring the situation where there is no evidence either way, that the plausible region will cover a randomly selected false value from the prior with high prior probability. Considering only those values that differ meaningfully from the true value, the bias in favor for estimation is given by which is the prior mean biases in favor for H 0 : Ψ(θ) = ψ 0 averaged over ψ 0 . An upper bound on (8) is commonly equal to 1 as illustrated in Figure 2 and so is not useful.
Example 5. Location normal (continued). Figure 2 plots sup µ±δ M (µ / ∈ Im Ψ (X) | µ) as a function of µ for a specific prior and δ. Table 4 contains values of (8) for this situation with different values of δ.  Table 4: Average bias in favor for estimation when using a N (0, τ 2 * ) prior for different sample sizes n and difference δ.

Properties
The inferences possess a variety of good properties and are even are optimal in the family of all Bayesian inferences. For example, the invariance of the inference under reparameterizations has already been noted. Another simple property is the relationship where Π(· | ψ) is the conditional prior for θ given Ψ(θ) = ψ. So the evidence concerning ψ is obtained by averaging the evidence for each value of θ ∈ Ψ −1 {ψ} with respect to the conditional distribution on the nuisance parameters. This relationship leads to a result that may appear paradoxical as discussed in the following example.
Example 6. Are evidence measures incoherent? Consider A, B, C subsets of Ω with probability measure P such that P (A) > 0, P (B) > 0, P (C) > 0 and suppose A ∩ B = φ. Then applying (9) gives From this it can be deduced that when RB(A | C) > 1, then RB(A ∪ B | C) < 1 iff RB(B | C) < 1 and So it is possible that there is evidence in favor of A but evidence against the superset A ∪ B. The Bayes factor also has this property and this has been referred to as "incoherence" of a measure of evidence. It is clear from (10) that A has to be a suitably small proportion of A ∪ B for this to hold. But note that, if C = A, then RB(B | C) = 0 and, even if P (A | A ∪ B) is very small, then (10) fails. So the situation is more subtle.
To see how such a situation can arise consider the following. In a population Ω let C correspond to a trait that exists in subpopulation A ∪ B only in A. So RB(A | C) = 1/P (C) > 1 and RB(B | C) = 0. This implies RB(A ∪ B | C) = P (A | A∪B)/P (C) and so there is evidence against A∪B iff P (A | A∪B) < P (C) or iff the probability of possessing the trait within the subpopulation is smaller than the probability of possessing the trait within the full population. This makes sense and so our claim is that there is no incoherency in the relative belief ratio (or Bayes factor) as a measure of evidence. Measuring evidence is seen once again to be different than measuring belief.
In Evans and Guo (2019) a number of results are established concerning the principle of evidence. For example, the following result establishes the unbiasedness of the inferences that result. Theorem 1. Using the principle of evidence (i) the prior probability of getting evidence in favor of ψ 0 given that it is true is greater than or equal to the prior probability of getting evidence in favor of ψ 0 given that ψ 0 is false and (ii) the prior probability of P l Ψ covering the true value is always greater than or equal to the prior probability of P l Ψ covering a false value.
Moreover the principle of evidence is seen to satisfy a number of optimality properties. For example, suppose another principle is chosen for establishing evidence against or in favor of a value ψ and let D(ψ) ⊂ X be the data sets that don't lead to evidence in favor of ψ. For the principle of evidence this is the set The rules that are potentially of interest satisfy as this indicates that D(ψ) will do no worse than R(ψ) at identifying when ψ is true and is better when the inequality is strict. The following Theorem holds where part (ii) is the relevant result and part (i) is used to obtain this.
Theorem 2. (i) The prior probability M (D(ψ)) is maximized among all rules satisfying (11) by D(ψ) = R(ψ). (ii) If Π Ψ ({ψ}) = 0, then R(ψ) maximizes, among all rules satisfying (11), the prior probability of not obtaining evidence in favor of ψ when it is false and otherwise maximizes this probability among all rules satisfying M (D(ψ) | ψ) = M (R(ψ) | ψ). Now consider C(x) = {ψ : x / ∈ D(ψ)} which is the set of ψ values for which there is evidence in their favor after observing x according to some alternative evidence rule. If (11) holds for each ψ, then E ΠΨ (M (ψ ∈ C(X) ) | ψ)) ≥ E ΠΨ (M (ψ ∈ P l Ψ (X) ) | ψ)) and so the Bayesian coverage of C is at least as large as that of P l Ψ and so represents a viable alternative to using P l Ψ . The following establishes an optimality result for P l Ψ and again it is part (ii) that is relevant.
As discussed in Evans (2015) relative belief credible regions C Ψ,γ are unbiased and possess optimality properties similar to those stated for P l Ψ . Also, it is always the case that RB(C Ψ,γ (x) | x) ≥ 1, with the inequality generally strict. This implies that there is evidence that the true value is in C Ψ,γ (x) and this is not the case with other rules for forming credible regions. Also, it can be proved that RB(C | x) is maximized by C = C Ψ,γ (x) among all sets C satisfying Π Ψ (C | x) = Π Ψ (C Ψ,γ (x) | x). So C Ψ,γ represents those values that have the greatest increase in belief amongst all the possible γ credible regions having the same posterior content.
Evans (2015) contains discussion concerning a number of other properties for these inferences. Overall the inferences derived via R 2 and R 3 are seen to have excellent properties.

Conclusions
This paper has described an approach to statistical reasoning. The central idea, which in fact we view as key for any such theory, is to be precise about the characterization of statistical evidence relevant to a particular problem. This is accomplished via the principle of evidence which overall plays the major role.
This principle is intuitively very appealing and, while very simple, has broad implications. The principle has been extensively discussed in the philosophy of science literature but statisticians have for the most part ignored it, and that may well be because of the requirement that prior beliefs be specified via a probability measure.
Another feature of the approach is a clear separation between the specification and checking of the ingredients and the rules of inference. Two necessary ingredients are a statistical model and a prior. Both of these are based upon choices made by an analyst, perhaps in conjunction with application area experts. As such, these ingredients are subjective in nature and it is fair to question their validity in light of the central role that the goal of objectivity plays in science. This is addressed in several ways. One is to acknowledge that the primary role of these ingredients is to allow for the reasoning process to present answers to the questions of scientific interest about the real-world object Ψ. In essence the correctness of the ingredients is not of primary importance and obsessing about this seems misplaced. Still, it is clear that the ingredients can be chosen very badly in the sense that they are strongly contradicted by the data and this may lead to misleading inferences. But there is a scientifically valid approach to dealing with this issue through checking the model and prior against the objective, at least if it is collected properly, data. This may lead to the need to modify chosen ingredients and, at least for the prior, this can be done in a way that is only very weakly data dependent.
The chosen ingredients can also suffer from an even more serious defect as the choices may seriously bias the inferences. This bias is measured by the prior probabilities associated with obtaining evidence in favor of or against. The principle of evidence is seen to play the central role in measuring and controlling these biases and can even be considered the optimal way to do this. For us, the measurement of bias is an absolute necessity in assessing the value of a statistical study. If there is a high prior probability of obtaining misleading evidence in a study, then the value of any evidence that reflects this is questionable. Another consequence of measuring and controlling bias is that it establishes a cooperative and beneficial relationship between frequentist and Bayesian ideas, as one can't do without the other. Frequentism is seen to be an a priori concern with controlling the inferences that can result from potential data sets, while Bayesianism is concerned with the actual inferences that follow from the rules of inference applied to the specified ingredients together with the observed data.
One counter-argument to what has been presented here is that it is very idealistic, at least compared with many of the problems statisticians commonly face. Our response to this is that the theory is definitely focused on problems where, for example, the data collection process can be designed so that appropriate and adequate amounts of data can be collected, where parameters indexing models are sufficiently well-understood so that priors can be elicited, differences of importance (values like δ) can be specified and are such that the computations can be carried out to sufficient accuracy. Such problems are what might be considered the core statistical problems. If an appropriate theory of statistical reasoning cannot be developed for such contexts, then there is little reason to hope for a theory that can handle the kinds of problems where data collection is not designed, the data are far too few for the dimensionality of the problem and aspects of the models used are not well understood so that default choices of priors are employed. In any case a proposed theory has to handle the core problems and deal with the concept of statistical evidence appropriately. A better approach, in our view, is to identify the problems which can be considered central and, as discussed in the paper, when problems are encountered that lie outside this core consider compromises to the theory that retain as much of the ideal approach as possible. Certainly Cao et al. (2015), and Evans and Tomal (2018) are in this spirit. Additional applications to more significant statistical problems than those discussed here can be found in the papers cited in this paper. Of some significance are recent applications in quantum state estimation that arises out of an approach to inference developed by B. Englert, see Shang et al. (2013), which has much in common with what has been presented here. Gu et al. (2019) is an example of this and is one where the concept of bias played a notable role.
The computations associated with the theory can indeed be difficult. For the computations associated with bias and prior-data conflict, however, this is mitigated by requiring only a very low level of accuracy for the relevant probabilities in question. See, for example, Nott et al. (2016), Nott et al. (2019) and Wang et al. (2018) where approximate calculation approaches have been successfully employed. In any case, we subscribe to the view that is better to approximately compute what is believed to be correct rather than exactly compute what isn't. It is notable that most of the hard work involved in an application is associated with choosing and checking the ingredients and, since this is the aspect that involves the most thinking about the application and the relevance of the statistical analysis to it, this is as it should be.