1. Introduction
There are a variety of approaches to conducting statistical analyses and most seem well-motivated from an intuitive point-of-view. If, in any particular context, these all led to approximately the same result, then this wouldn’t be a problem but this is not the case. With the increasing importance of, what we will call here, statistical reasoning in almost all areas of science, this ambiguity can be seen as an important problem to resolve.
It is interesting to note that virtually all approaches make reference to the “evidence in the data” or even to the concept of “statistical evidence” itself. For example,
p-values are considered as measuring the evidence against a hypothesis although serious doubts have been raised about their suitability in this regard. Similarly, likelihood ratios are considered as measuring the evidence in the data supporting the truth of one value versus another and the Bayes factor is considered as a measure of statistical evidence. There are treatments that recognize the concept of statistical evidence as a central aspect of statistical reasoning as in [
1,
2,
3,
4,
5,
6,
7]. There is also a tradition of the consideration of the meaning of evidence in the philosophy of science, sometimes called confirmation theory, with [
8] being an accessible summary.
While much of this literature has many relevant things to say about statistical evidence, it is fair to say that there is no treatment that gives an unambiguous definition of what it is together with developing a satisfactory theory of statistical reasoning based on this. The text [
9] summarizes a number of publications with coauthors that have attempted to deal with this problem. This paper describes that theory together with some recent developments beyond those described in [
10].
2. Purpose of a Theory of Statistical Reasoning
To start it is necessary to ask: What is the purpose of statistics as a subject or what is its role in science? This is identified here as providing a theory that gives answers to two questions based on observed data x and an object of interest
- E
What is a suitable value for together with an assessment of the accuracy of this estimate?
- H
Is there evidence that a specified value for is true or false and how strong is this evidence?
The requirement is placed on a theory that the answers to E and H be based on a clear characterization of the statistical evidence in the data relevant to
If the data x was enough to answer these questions unambiguously, that would undoubtedly be best, but it seems more is needed. There are two aspects of this that need discussion, referenced hereafter as the Ingredients and Rules of Statistical Inference. These are to be taken as playing the same roles in a statistical argument as the Premises and Rules of Inference play in a logical argument. For a logical argument is valid whenever the rules of inference, such as modus ponens, have been applied correctly to the premises to derive a conclusion, but it is only sound when it is valid and the premises are consistent and true. Similar considerations apply to statistical reasoning.
3. The Ingredients
There are various criteria that the ingredients necessary for a theory of statistical reasoning should satisfy. A partial list is as follows.
- I
The ingredients are the minimum number required for a characterization of statistical evidence so that E and H can be answered.
- I
The ingredients are such that the bias in these choices can be assessed and controlled.
- I
Each ingredient must be falsifiable via the data.
In the majority of statistical problems the ingredients are chosen and are not specified by the application. As such, the ingredients are subjective which seems to violate a fundamental aspect of science, namely, the goal of objectivity. The appropriateness of this goal is recognized but rather as an ideal that is not strictly achievable only approached by dealing with the ingredients appropriately. I suggests that it is possible to choose the ingredients in such a way that they bias the answers to E and H. This is indeed the case, but as long as the bias can be measured and controlled, this need not be of concern. I is another way of dealing with the inherent subjectivity in a statistical analysis. For it is reasonable to argue that the chosen ingredients are never correct, only serving as devices that allow for a characterization of evidence so that the rules of inference can be applied to answer E and H. Their “correctness” is irrelevant unless the choices made are contradicted via the objective, when collected correctly, data The gold standard requires falsifiability of all ingredients to help place statistical reasoning firmly within the realm of science.
So for a theory of statistical reasoning we need to state what the ingredients are and how to check for their reasonableness. The following ingredients, beyond the data are necessary here.
- Model
is a collection of conditional probability distributions for given such that the object of interest is specified by the true distribution that gave rise to
- Prior
is a probability distribution on
- Delta
is the difference that matters so that dist , for some distance measure dist, means that and are, for practical purposes, indistinguishable.
The model and prior specify a joint probability distribution for . As such all uncertainties are handled within the context of probability interpreted here as measuring belief, no matter how the probabilities are assigned.
The role of will be subsequently discussed but it raises an interesting and relevant point concerning the role of infinities in statistical modelling. The position taken here is that whenever infinities appear their role is as approximations as expressed via limits rather than representing reality. For example, data arises via a measurement process and as such all data are measured to a finite accuracy. So data is discrete and moreover sample spaces are bounded as measurements cannot be arbitrarily large/small positively or negatively. So whenever continuous probability distributions are used these are considered as approximations to essentially finite distributions. Little is lost by taking this view of things and there is a substantial gain for the theory through the avoidance of anomalies.
The question now is how to choose the model and the prior? Unfortunately, we are somewhat silent about general principles for choosing a model, but when it comes to the prior for us this is by an elicitation algorithm which explains why the prior in question has been chosen. An inability to come up with a suitable elicitation suggests a lack of sufficient understanding of the scientific problem or an inappropriate choice of model where the real world meaning of is not clear. The existence of a suitable elicitation algorithm could be viewed as a necessity to place a context within the gold standard but, given the way models are currently chosen, we do not adopt that position. Still whatever approach is taken to choosing it is subject to I and I. As will be seen, the implementation of I requires the characterization of statistical evidence and so discussion of this is delayed and I is addressed next.
4. Checking the Ingredients
If the observed data x is surprising (in the “tails” of) for each distribution in then this suggests a problem with the model, and otherwise the model is at least acceptable. There are a number of approaches available for checking the model and this is not discussed further here, although the model is undoubtedly the most important ingredient chosen.
To be a bit more formal note that, when
T is a minimal sufficient statistic for the model, then the joint factors as
where
is a probability distribution, independent of
available for model checking,
is the prior (predictive) distribution of
available for checking the prior, while
is the posterior of
and provides probabilities for the inference step. The paper [
11] proposed using the prior predictive distribution of
x for jointly checking the model and prior but, to ascertain where a problem exists when it does, it is more appropriate to split this into checking the model first and, when the model passes, then check the prior.
A prior fails when the true value lies in its tails. The paper [
12] proposed using the tail probability
for this purpose and [
13] established that generally
as the amount of data increases. This approach also included conditioning on ancillary statistics, to remove variation irrelevant to checking the prior, and further factorizations of
that allow for checking of individual components of the prior. The paper [
14] generalizes this to provide a fully invariant check that connects nicely with the measure of evidence used for inference as discussed in
Section 5. It is shown in [
15] that, when prior-data conflict exists, then inferences can be very sensitive to perturbations of the prior. The paper [
16] defines what is meant by a prior being weakly informative with respect to another, quantifying this in terms of fewer prior-data conflicts expected a priori. One then specifies a base elicited prior and a hierarchy of successively more weakly informative priors so, if a conflict is detected, a prior can be replaced by one more weakly informative progressing up the hierarchy until conflict is avoided. The hierarchy of priors is not dependent on the data.
5. The Rules of Statistical Inference
There are three rules that are used to determine inferences and stated here for a probability model Suppose interest is in whether or not the event is true after observing where both and
- R
Principle of conditional probability: Beliefs about as expressed initially by are replaced by
- R
Principle of evidence: Observing C is evidence in favor of (against, irrelevant for) A when
- R
Relative belief: The evidence is measured quantitatively by the relative belief ratio
While
R does not seem controversial, its strict implementation in Bayesian contexts demands proper priors and priors that do not depend on the data.
R also seems quite natural and, as will be seen, really represents the central core of our approach to statistical reasoning.
R is perhaps not as obvious, but it is clear that
indicates that evidence in favor of (against, irrelevant for)
A has been obtained. In fact, the relative belief ratio only plays a role when it is necessary to order alternatives. In the Bayesian context, when interest is in
then generally the relative belief ratio at a value
equals
where neighborhoods
of
satisfy
as
and the equality on the right holds whenever the prior density
of
is positive and continuous at
There are other valid measures of evidence in the sense that they have a cut-off that determines evidence in favor or against, as specified by
R, and can be used to order alternatives. For example, the difference
with cut-off 0, or the Bayes factor
with cut-off 1, could be used. As indicated,
can be defined in terms of
but not conversely. Since
iff
the Bayes factor is not a comparison of the evidence for
A with the evidence for
, as is sometimes claimed. Furthermore, in the continuous case, when
is defined as a limit it is equal to
The relative belief ratio, or equivalently any strictly increasing function of
, has other advantages and possesses various optimality properties, as discussed in [
9,
17], and so we adopt it here.
5.1. Problem E
Suppose that the range of possible values for is also denoted by Then R determines the relative belief estimate as as this maximizes the evidence in favor as with the inequality generally strict. To measure the accuracy of there are a number of possibilities but the plausible region the set of values where there is evidence in favor of the value being true, is surely central to this. If the “size”, such as volume or prior content, of is small and its posterior content is large, then this suggests that an accurate estimate has been obtained.
There are several notable aspects of this. The methodology is invariant, so if interest is in where is 1-1 and smooth, then and While can also be thought of as the MLE from an integrated likelihood, that approach does not lead to because likelihoods do not define evidence for specific values. Most significantly, the set is completely independent of how evidence is measured quantitatively. In other words, if any valid measure of evidence is used, then the same set is obtained. This has the consequence that, however we choose to estimate via an estimator that respects the principle of evidence, then effectively the same quantification of error is obtained. This points to the possibility of using some kind of smoothing operation on to produce values that lie in when this is considered necessary. It is also possible to use, as part of measuring the accuracy of a -relative belief credible region for namely, where It is necessary, however, that otherwise contains values for which there is evidence against and so would contradict
Consider now a very simple example which demonstrates some of the benefits of this approach.
Example 1. Prosecutor’s Fallacy.
Assume a uniform probability distribution on a population of size N of which some member has committed a crime. DNA evidence has been left at the crime scene and suppose this trait is shared by of the population. A prosecutor is criticized because they conclude that, because the trait is rare and a particular member possesses the trait, then they are guilty. In fact they misinterpret P(“has trait” | “guilty”) = 1 as the probability of guilt rather than P(“guilty” | “has trait”) = 1/m which is small if m is large. However, this probability does not reflect the evidence of guilt. For, if you have the trait, then clearly this is evidence in favor of guilt. Note thatand Pl(“has trait”) = {“guilty”} with posterior content 1/m. So there is evidence of guilt, and the prosecutor is correct to conclude this, but the evidence is weak whenever m is large and in such a context a conviction does not seem appropriate. It is notable that the MAP (maximum a posteriori) estimate is “not guilty”. A weakness of MAP/HPD inferences is their lack of invariance under reparameterizations. This example shows, however, that, even in very simple situations, these inferences are generally inappropriate because they do not express evidence properly. The example also demonstrates a distinction between decisions and inferences. Clearly when m is large there should not be a conviction on the basis of weak evidence. However, suppose that “guilty” corresponds to being a carrier of a highly infectious deadly disease and “has trait” corresponds to some positive (but not definitive) test for this, then the same numbers should undoubtedly lead to a quarantine. In essence a theory of statistical reasoning should tell us what the evidence says and decisions are made, partly on this basis, but employing many other criteria as well.
5.2. Problem H
It is immediate that
is the evidence concerning
The evidential ordering implies that the smaller
is than 1, the stronger the evidence is against
and the bigger it is than 1, the stronger the evidence is in favor of
How, however, is one to measure this strength? In [
18] it is proposed to measure the strength of the evidence via
which is the posterior probability that the true value of
has evidence no greater than that obtained for the hypothesized value
When
and System (
1) is small, then there is strong evidence against
since there is a large posterior probability that the true value of
has a larger relative belief ratio. Similarly, if
and System (
1) is large, then there is strong evidence that the true value of
is given by
since there is a large posterior probability that the true value is in
and
maximizes the evidence in this set. The strength measurement here results from comparing the evidence for
with the evidence for each of the possible
values.
When
is false, then
converges to 0 as does System (
1). When
is true, then
converges to its largest possible value (greater than 1 and often
∞) and, in the discrete case System (
1) converges to 1. In the continuous case, however, when
is true, then System (
1) typically converges to a
random variable. This is easily resolved by requiring that a deviation
be specified such that if dist
where dist is some measure of distance determined by the application, then this difference is to be regarded as immaterial. This leads to redefining
as
dist
and typically a natural discretization of
exists with
as one of its elements. With this modification System (
1) converges to 1 as the amount of data increases when
is true. Some discussion on determining a relevant
can be found in [
19] and [
20]. Typically the incorporation of such a
makes computations easier as then there is no need to estimate densities when these are not available in closed form.
Consider a simple example where some issues arise that require a discussion of bias to resolve.
Example 2. Jeffreys–Lindley Paradox.
Suppose is i.i.d. with known and π is a prior and the hypothesis is Sowhich, in this case is the same as the Bayes factor for obtained via Jeffreys’ mixture approach. From this it is easy to see that as or, when is fixed, as If this is calibrated using the strength then, under these circumstancesthe classical p-value for assessing The Jeffreys–Lindley paradox is the apparent divergence between and the p-value as measures of evidence concerning For example, see [21,22] for some recent discussion of the paradox. Using the relative belief framework, however, it is clear that while may be a large value, and so apparently strong evidence in favor, System (2) suggests it is weak evidence in favor. This does not fully explain why this arises, however, as clearly when is large, then we would expect there to be evidence against. To understand why this discrepancy arises, it is necessary to consider bias as discussed in the next section. Another aspect of System (
2) is of interest because it raises the obvious question: Is the classical
p-value a valid measure of evidence? The answer is no according to our definition, but the difference
of tail probabilities with cut-off 0, is a valid measure of evidence. Note that the right-most tail probability converges almost surely to 0 as
or
and this implies that, if the classical
p-value is to be used to measure evidence, the significance level has to go to 0 as
n increases. Note that if the
prior is replaced by a Uniform
prior, then the same conclusion is reached as
or as
These conclusions are similar to those found in [
23,
24].
6. Measuring Bias
Perhaps the most serious concern with Bayesian methodology is with the issue of bias. Can the ingredients be chosen in such a way that the results are a foregone conclusion with high prior probability? To answer this it is necessary to be precise about the meaning of bias and the principle of evidence provides this. Even though we will use the relative belief ratio in the discussion, the bias measures discussed are exactly the same for any measure of evidence satisfying
. The approach to measuring bias described here for problem
H appeared in [
18] and this has since been extended to problem
E in [
25] where also links with frequentism are established.
Bias should be measured a priori although a post hoc measurement is also possible. In essence the bias numbers give a measure of the merit of a statistical study. Studies with high bias cannot be considered as being reliable irrespective of the correctness of the ingredients.
Consider hypothesis and let denote the prior predictive distribution of the data given that In general it is possible for there to be a high prior probability that evidence against will be obtained even when it is true. Bias against is thus measured by If is large, then obtaining evidence against seems like a foregone conclusion and subsequently finding evidence against is thus of dubious value. Similarly, if there is a high prior probability of obtaining evidence in favor of even when it meaningfully false, then actually obtaining such evidence based on observed data is not compelling. Bias in favor of is measured by and note the necessity of including so that the measure is based only on those values of that are meaningfully different from Typically decreases as dist increases, so the supremum can then be taken instead over dist
The following example illustrates the relevance of bias in favor and, in particular, the dangers inherent in using diffuse priors to represent ignorance.
Example 2. (Continued).
The bias against and bias in favor can be computed in closed form in this case, see [25]. Table 1 gives some values for the bias against when testing the hypothesis when for two priors. The results here illustrate something that holds generally in this case. Provided the prior variance the maximum bias against is achieved when the prior mean equals the hypothesized mean This turns out to be very convenient as controlling the bias against for this case controls the bias against everywhere. This seems paradoxical, as the maximum amount of belief is being placed at when It is clear, however, that the more prior probability that is assigned to a value the harder it is for the probability to increase. This is another example of a situation where evidence works somewhat contrary to our intuition based on how belief works. Another conclusion from Table 1 is that bias against is not a problem. Provided the prior is not chosen too concentrated, this is generally the case, at least in our experience. As discussed in [
25], when
the bias against goes to 0 and the bias in favor goes to 1. This explains the phenomenon associated with the Jeffreys–Lindley paradox, as taking a very diffuse prior induces extreme bias in favor. So this argues against using arbitrarily diffuse priors. Rather one should elicit the value for
, prescribe the value of
and choose
n to make the biases acceptable.
The accuracy of a valid estimate of
is measured by the size of the plausible region
. As such, if the plausible region is reported as containing the true value and it does not, then the evidence is misleading. Biases in estimation problems are thus measured by prior coverage probabilities associated with
and the implausible region
the set of values for which there is evidence against. More details on this can be found in [
25].
The choice of the prior can be used somewhat to control bias but typically a prior that lowers one bias raises the other. It is established in [
9] that, under quite general circumstances, both biases converge to 0 as the amount of data increases. So bias can be controlled by design a priori. There is a close connection between measuring bias for problems
H and
E and moreover controlling bias leads to confidence properties for
Frequentism is thus seen as an aspect of design while inference is Bayesian and based on the observed data only. These issues are extensively discussed in [
25].
7. Conclusions
This paper has been a survey of a foundational approach to a theory of statistical reasoning. Some recent applications to practical problems of some interest can be found in [
26] (multiple testing and sparsity) and [
27] (quantum physics).