Statistical Reasoning: Choosing and Checking the Ingredients, Inferences Based on a Measure of Statistical Evidence with Some Applications

The features of a logically sound approach to a theory of statistical reasoning are discussed. A particular approach that satisfies these criteria is reviewed. This is seen to involve selection of a model, model checking, elicitation of a prior, checking the prior for bias, checking for prior-data conflict and estimation and hypothesis assessment inferences based on a measure of evidence. A long-standing anomalous example is resolved by this approach to inference and an application is made to a practical problem of considerable importance, which, among other novel aspects of the analysis, involves the development of a relevant elicitation algorithm.


Introduction
It is relevant to ask what characteristics should be required of a theory of statistical reasoning. The phrase statistical reasoning is used here, as opposed to statistical inference, because there is a logical separation between how the ingredients to a statistical problem are chosen and checked for their validity, and the inference step that involves the application of the rules of a theory of inference to the ingredients. Thus, it is argued in Section 2 that there are two aspects to a theory of statistical reasoning: (i) specifying methodology for choosing and checking the ingredients to a statistical analysis beyond the data and (ii) specifying a theory of inference using these ingredients that is based on a measure of statistical evidence. These components correspond to the premises and the argument in logical reasoning.
In Section 3, a specific theory of statistical reasoning, as described in [1], that satisfies the criteria developed in Section 2 is reviewed. It is shown that an application of the theory of relative belief inference resolves difficulties in a problem that has led to anomalous results for other theories. It is to be noted that a number of additional examples have been published demonstrating that relative belief inference can lead to more satisfactory results than other approaches (see [2][3][4][5][6]). Particularly noteworthy is the Jeffreys-Lindley paradox where an increasingly diffuse prior typically leads to overwhelming evidence in favor of a hypothesis even when it seems contradicted by the data. The discussion of this paradox in [1] shows that the relative belief approach to inference leads to a clear resolution. A goal of this paper is to argue in favor of the relative belief approach to statistical inference based on its logical coherence and its utility in applications.
In Section 4, the theory of statistical reasoning is applied to an important practical problem where some inferential difficulties have arisen, namely, the problem of determining whether or not there is a relationship between a binary-valued response variable and predictors. For this, response y ∈ {0, 1} is related to k quantitative predictors x = (x 1 , . . . , x k ) via y ∼ Bernoulli(p(x)) with p(x) = G(β 1 x 1 + · · · + β k x k ), (1) where G is a known cdf and β β β = (β 1 , . . . , β k ) ∈ R k is unknown. This can be regarded as a case-study for the overall approach and a number of novel results are obtained. Perhaps the biggest challenge with this model is determining a suitable prior and in Section 4.1 an elicitation algorithm is developed that improves on earlier efforts. The bias in the prior is measured in Section 4.2. Model checking is essential, namely, determining if (1) holds at least approximately. Since this is dealt with in [7], this approach is used here without much discussion. The check for prior-data conflict is developed in Section 4.3, together with an approach to modifying the prior when necessary, and relative belief inferences are applied in Section 4.4.
The main theme of the approach discussed here for (ii) centers around being clear about how statistical evidence is to be measured and, as will be seen, this has implications for (i) as well. Certainly, the concept of statistical evidence plays a key role in the development of the subject, but, as discussed in Chapter 3 of [1], it can be argued that there is a general failure of most approaches to deal adequately with this concept. There have, however, been a number of measures of evidence proposed in the philosophical and statistical literature and many of these are related to the relative belief ratio. For example, the Bayes factor is one such measure, but the relative belief ratio can be considered as more basic (see Section 3), and this comment applies to a number of other measures as well as discussed in Chapter 4 of [1]. The relative belief ratio is certainly not a new measure of evidence as [8] refers to the log of this quantity as the information and there are some references to Keynes calling it the association factor. As far as we know, however, there have been no attempts to use this quantity as the central concept in derivation of a theory of statistical inference. This leads to a number of unique characteristics for the theory outlined in Section 3.
There are many others who have also focused on treating statistical evidence and its measurement as a central concept in the subject. Jeffreys' earliest papers focus on this (see also [9]), and the paper [10] provides excellent coverage of this history. In addition, the paper [11] summarizes developments by these authors that is certainly in the same spirit as what is being advocated here for (ii). The philosophy of science literature on the topic of evidence is vast and we point to [12] as a nice introduction with [13] being a more detailed discussion. The work of [14] is also particularly noteworthy in this regard.

The Foundations of Statistical Reasoning
When concerned with the foundations of statistics, it is reasonable to ask: what is the purpose of statistics as a subject or what is its role in science? To answer this, consider a context where an investigator has interest in some quantity and either wants to know (E) the value of this quantity or has a theory that leads to a specific value for the quantity and so wants to know (H) if this value is indeed correct and so test the theory. In addition, the investigator has available data d, produced in some fashion, which it is believed contains evidence concerning answers to (E) and (H). The purpose of statistical theory is to provide a reasoning process that can be applied to d to determine what the evidence has to say about (E) or (H), namely, produce an estimate of the quantity based on the evidence or assess whether there is evidence either in favor of or against the hypothesized value. In addition, as is widely recognized, estimation and hypothesis assessment should also produce a measure of the accuracy of the estimate and a measure of the strength of the evidence for or against the hypothesis. Answering (E) and/or (H) is called statistical inference and a sound, logical theory of statistical inference, which contained the minimal ingredients possible, can be viewed as a major goal of the subject.
Any theory that does not lead to specific answers to (E) and (H) or is dependent on ingredients or rules of reasoning that are not well-justified is unsatisfactory. In the end, the believability of the inferences drawn is entirely dependent on the soundness of the theory that produced them. Thus, statistics is not an empirical subject like physics, where conclusions can also be assessed against the empirical world, but is more like an extension of purely logical reasoning to contexts where the data does not lead to categorical answers to (E) and (H) and so produces uncertainty. The view is taken here that we want to maintain a close relationship between a theory of statistical reasoning and the theory of logical reasoning. This has a number of consequences with perhaps the most important being that it implies a separation of the assessment of the appropriateness of the ingredients specified to a statistical analysis beyond the data, and the theory producing the inferences. The ingredients play the role of the premises and the theory of statistical inference takes the role of the rules of inference used in a logical argument. The separation of these aspects of a logical argument has been understood since Aristotle (see [15]).
There are two main theses of the argument developed here: (I) all ingredients to a statistical analysis must be checkable against the data and (II) the theory of inference must be based on a measure of statistical evidence. The rationale for (I) and (II) are now considered with (II) discussed first, as it plays a key role.
The concept of the evidence in the data is clearly of utmost importance to statistical reasoning. There is no need, however, to provide a measure of the total evidence contained in the data, for the measure of evidence only has to deal with (E) and (H) for the quantity of interest. The measure of evidence must clearly indicate whether there is evidence for or against any specific value of the quantity of interest being the true value. This follows from the desired relationship with logic, as the rules of logical inference assume the truth of the premises, so the theory of statistical inference has to be based on the truth of the ingredients and this implies that one of the possible values for the quantity of interest is true. A theory of logical reasoning that could only determine falsity and never truth is not useful and similarly any valid measure of evidence must be able to indicate evidence in favor as well as evidence against.
Once a measure of statistical evidence is determined, an estimate of the quantity of interest is necessarily the value that maximizes the measure of evidence and the accuracy of the estimate can be assessed by looking at the size of the set of values that has evidence in their favor. The measure of evidence similarly necessarily determines whether there is evidence for or against a hypothesized value and the strength of this evidence can be assessed by comparison with the evidence associated with the other possible values for the quantity of interest.
Consider now requirement (I). If a satisfactory measure of evidence could be determined from the data alone, then this would be ideal, but currently this is not available and it is questionable whether it is even possible. It is assumed hereafter that the data x ∈ X can be regarded as having been produced by some probability distribution on the set X with unknown density f . If the data was collected via random sampling, then this assumption seems justified, but it is always an assumption. The density f is unknown and it is assumed that, once f is known, then this completely determines the answers to (E) and (H). The ingredients are then as follows: it is assumed that f ∈ { f θ : θ ∈ Θ}, a collection of densities on X indexed by the parameter θ ∈ Θ called the statistical model, and it is assumed that there is a probability measure Π with density π on Θ that represents beliefs of the investigator about the true value of θ ∈ Θ and called the prior. The ingredients correspond to the premises of a logical argument and these may be true or false.
It can be questioned as to whether both the model and prior are necessary for the development of a satisfactory theory and certainly minimizing the ingredients is desirable. However, as discussed in Section 3, it seems that a valid definition of a measure of evidence requires both and again the challenge is open to develop a satisfactory measure of evidence that uses fewer ingredients. In particular, the use of a prior is often claimed to be inappropriate as it is subjective in nature and, as the goal of a scientific investigation is to be as objective as possible, the prior seems contrary to that. It needs to be recognized, however, that all ingredients to a statistical analysis beyond the data are subjective as they are chosen by the statistician. As discussed in Section 3.2, it is possible to check both the model and the prior against the (objective) data to determine whether or not reasonable choices have been made. This can be considered as analogous to checking the consistency of the premises in a logical argument. In addition, it is possible to check whether or not the chosen ingredients have biased the results so that the inferences obtained are in fact foregone conclusions, namely, could have been made without even looking at the data. It is our view that checking for bias and checking for conflict with the data go a long way towards answering criticisms concerning the subjectivity inherent in a statistical analysis. Another implication from (I) is that no ingredient can be added to a statistical analysis unless it can be checked against the data and, as such, this rules out the use of loss functions.
It is not clear how the ingredients are to be chosen and guidance needs to be provided for this. Not much has been written about how the model is to be chosen, but certainly something needs to be said to justify a specific choice as part of the statistical reasoning argument. Much more has been written about the selection of the prior and the position is adopted here that it is necessary to base this on a clearly stated elicitation algorithm, namely, a prescription for how an expert can translate knowledge into beliefs as expressed via the prior.
In summary, the desiderata for a theory of statistical reasoning include the following: a methodology for choosing a model, an elicitation algorithm for selecting a prior, methodology for assessing the bias in the ingredients chosen, model checking and checking for prior-data conflict procedures and a theory of inference based upon a measure of statistical evidence.

A Theory of Statistical Reasoning
Choosing and checking the ingredients logically comes before inference, but it is convenient to discuss these in reverse order.

Relative Belief Inferences
Consider now a statistical problem with ingredients the data d, a model { f θ : θ ∈ Θ}, a prior π and interest is in making inference about ψ = Ψ(θ) for Ψ : Θ → Ψ where no distinction is made between the function and its range to save notation. Initially, suppose that all the probability distributions are discrete. This is not really a restriction in the discussion, however, as if something works for inference in the discrete case but does not work in the continuous case, then it is our view that the concept is not being applied correctly or the mathematical context is just too general. For us, the continuous case is always to be thought of as an approximation to a fundamentally discrete context, as measurements are always made to finite accuracy, and the approximation arises via taking limits. Some additional comments on the continuous case are made subsequently.
As discussed in Section 2, the basic object of inference is the measure of evidence and what is wanted is a measure of the evidence that any particular value ψ ∈ Ψ is true. Based on the ingredients specified, there are two probabilities associated with this value, namely, the prior probability π Ψ (ψ), as given here by the marginal prior density evaluated at ψ, and the posterior probability π Ψ (ψ | d), as given here by the marginal posterior density evaluated at ψ. In certain treatments of inference, π Ψ (ψ | d) is taken as a measure of the evidence that ψ is the true value, but, for a wide variety of reasons, this is not felt to be correct and Example 1 provides a specific case where this fails. In addition, this measure suffers from the same basic problem of p-values, namely, there is no obvious dividing line between evidence for and evidence against. Moreover, it is to be noted that probabilities measure belief and not evidence. If we start with a large prior belief in ψ, then, unless there is a large amount of data, there will still be a large posterior belief even if it is false and, similarly, if we started with a small amount of belief. There is agreement, however, to use the principle of conditional probability to update beliefs after receiving information or data and this is to be regarded as the first principle of the theory of relative belief.
Thus, what is the evidence that ψ is the true value to be measured? Basic to this is the principle of evidence: if π Ψ (ψ | d) > π Ψ (ψ), there is evidence in favor, as belief has increased, if π Ψ (ψ | d) < π Ψ (ψ), there is evidence against as belief has decreased and if, π Ψ (ψ | d) = π Ψ (ψ), there is no evidence either way. This principle has a long history in the philosophical literature concerning evidence. This principle does not provide a specific measure of evidence but at least it indicates clearly when there is evidence for or against, independent of the size of initial beliefs, and it does suggest that any reasonable measure of the evidence depends on the difference, in some sense, between π Ψ (ψ) and π Ψ (ψ | d), namely, evidence is measured by change in belief rather than belief. A number of measures of this change have been proposed (see [1] for a discussion), but, by far, the simplest and the one that has the nicest properties is the relative belief ratio there is evidence against ψ being the true value and no evidence either way if RB Ψ (ψ | d) = 1. The use of the relative belief ratio to measure the evidence is the third and final principle of the theory, which we call the principle of relative belief. The relative belief ratio can also be written as where m is the prior predictive density of the data and m(· | ψ) is the conditional prior predictive density of the data given Ψ(θ) = ψ. Another natural candidate for a measure of evidence is the Bayes factor BF Ψ (ψ | d) as this satisfies the principle of evidence, namely, BF(ψ | d) > (<, =) 1 when there is evidence for (against, neither) ψ being the true value. The Bayes factor can be defined in terms of the relative belief ratio as are the prior and posterior probability measures of Ψ, respectively. It might appear that BF Ψ (ψ | d) is a comparison between the evidence for ψ being true with the evidence for ψ being false, but it is provable that RB Ψ (A | d) > 1 implies RB Ψ (A c | d) < 1 and conversely, so this is not the case. In addition, as subsequently discussed, in the continuous case, it is natural to take The principle of relative belief leads to an ordering of the possible values for ψ as ψ 1 is preferred to there is more evidence for ψ 1 than ψ 2 . When Ψ(θ) = θ, this agrees with the likelihood ordering, but likelihood fails to provide such an ordering for general ψ. It is common to use the profile likelihood ordering even though this is not a likelihood ordering and this does not agree with the relative belief ordering. It is noteworthy that the relative belief idea is consistent in the sense that inferences for all ψ = Ψ(θ) are based on a measure of the change in prior to posterior probabilities. The relative belief ordering leads immediately to a theory of estimation. Basing inferences on the evidence requires that the relative belief estimate be a value ψ(d) that maximizes RB Ψ (ψ | d) and typically such a value is unique so ψ(d) = arg sup ψ∈Ψ RB Ψ (ψ | d). It is also necessary to say something about the accuracy of this estimate in an application. For this, a set of values containing ψ(d) is quoted and the "size" of the set is taken as the measure of accuracy. Again, following the ordering based on the evidence, it is necessary that the set take the form {ψ : However, what c should be used? It is perhaps natural to chose c so that {ψ : RB Ψ (ψ | d) > c} contains some prescribed amount of posterior probability, so the set is a γ-credible region. However, there are several problems with this approach. For what γ should be chosen? Even if one is content with some particular γ, say γ = 0.95, there is the problem that the set may contain values ψ with RB Ψ (ψ | d) < 1 and such a value has been ruled out since there is evidence against such a ψ being true. It is argued in [1] that the plausibility set Pl Ψ (d) = {ψ : RB Ψ (ψ | d) > 1} be used as Pl Ψ (d) contains all the values for which there is evidence in favor of it being the true value. In general circumstances, it is provable that There are several possible measures of size, and certainly the posterior content Π Ψ (Pl Ψ (d) | d) is one as this measures the belief that the true value is in Pl Ψ (d), but also some measure such as length or cardinality is relevant. If Pl Ψ (d) is small and Π Ψ (Pl Ψ (d) | d) large, then ψ(d) can be judged to be an accurate estimate of ψ.
It is immediate that RB Ψ (ψ 0 | d) is the evidence concerning H 0 : Ψ(θ) = ψ 0 . The evidential ordering implies that the smaller RB Ψ (ψ 0 | d) is than 1, the stronger is the evidence against H 0 and the bigger it is than 1, the stronger is the evidence in favor H 0 , but how is one to measure this strength? In [16], it is proposed to measure the strength of the evidence via which is the posterior probability that the true value of ψ has evidence no greater than that obtained for the hypothesized value ψ 0 . When RB Ψ (ψ 0 | d) < 1 and (3) is small, then there is strong evidence against H 0 since there is a large posterior probability that the true value of ψ has a larger relative belief ratio. Similarly, if RB Ψ (ψ 0 | d) > 1 and (3) is large, then there is strong evidence that the true value of ψ is given by ψ 0 since there is a large posterior probability that the true value is and ψ 0 maximizes the evidence in this set. Additional results concerning RB Ψ (ψ 0 | d) as a measure of evidence and (3) can be found in [1,16].
where N (ψ) is a sequence of sets converging nicely to {ψ} as → 0. When the densities are continuous at ψ, then this limit equals (2) so this is a measure of evidence in general circumstances. In addition, it is natural to define the Bayes factor by BF A variety of consistency results, as the amount of data increases, are proved in [1] concerning the estimation and hypothesis assessment procedures. In particular, when H 0 is false, then (2) converges to 0 as does (3). When H 0 is true, then (2) converges to its largest possible value (greater than 1 and often ∞) and, in the discrete case (3) converges to 1. In the continuous case, however, when H 0 is true, then (3) typically converges to a U(0, 1) random variable. This simply reflects the approximate nature of the inferences and is easily resolved by requiring that a deviation δ > 0 be specified such that if dist(ψ 1 , ψ 2 ) < δ, where dist is some measure of distance determined by the application, then this difference is to be regarded as immaterial. This leads to redefining H 0 as H 0 = {ψ :dist(ψ, ψ 0 ) < δ} and typically a natural discretization of Ψ exists with H 0 as one of its elements. With this modification (3) converges to 1 as the amount of data increases when H 0 is true. Given that data is always measured to finite accuracy, the value of a typical continuous-valued parameter can only be known to a certain finite accuracy no matter how much data is collected. Thus, such a δ always exists and it is part of an application to determine the relevant value (see Example 7 here, Al-Labadi, Baskurt and Evans [7] and Evans, Guttman and Li [17] for developments on determining δ). These results establish that, as the amount of data increases, relative belief inferences will inevitably produce the correct answers to estimation and hypothesis assessment problems.
It is immediate that relative belief inferences have some excellent properties. For example, any 1-1 increasing function of RB Ψ (· | d), such as log RB Ψ (· | d), can be used to measure evidence as the inferences are invariant to this choice. In addition, RB Ψ (· | d) is invariant under smooth reparameterizations and so all relative belief inferences possess this invariance property. For example, MAP (maximum a posteriori) inferences are not invariant and this leads to considerable doubt about their validity (see also Example 1). In [1], results from a number of papers are summarized establishing optimality results for relative belief inferences in the collection of all Bayesian inferences. For example, Al-Labadi and Evans [18] establish that relative belief inferences for ψ have optimal robustness to the prior π Ψ properties. In addition, as discussed in Section 3.2, since the inferences are based on a measure of evidence a key criticism of Bayesian methodology can be addressed, namely, the extent to which the inferences are biased can be measured.
Relative belief prediction inferences for future data are naturally produced by using the ratio of the posterior to prior predictive densities for the quantity in question. The following example illustrates this and demonstrates significant advantages for relative belief.
The relative belief ratio for (y 1 , . . . , y f ) is With n = f = 20 and nx = 6, Figure 1 gives the plot of (5) as a function of nȳ. The best relative belief predictor of (y 1 , . . . , y f ) is any sample with fȳ = 6 and Pl n (x 1 , . . . , x n ) = {(y 1 , . . . , y f ) : fȳ = 2, 3, . . . , 10} has posterior content 0.893. Thus, there is reasonable belief that the plausibility set contains the "true" future sample but certainly there are many such samples. By contrast with MAP, a sensible prediction is made using relative belief.
which equals the sum over all the summands that are greater than 1/(n + 1)and m n, f ((y 1 , . . . , y f ) | (0, . . . , 0)) n nȳ = (n + 1) (2n + 1) . Whenȳ = k/n, the corresponding term converges to (1/2) k+1 . Thus, for all n large enough, the sum (6) contains the terms forȳ = 0, 1/n, . . . , k/n. Therefore, for > 0 and all n large enough, (6) is greater than (1/2)[1 + 1/2 + · · · + (1/2) k ] − = 1 − (1/2) k+1 − and the posterior content of Pl n (0, . . . , 0) converges to 1. Thus, relative belief also behaves appropriately when f = n andx = 0 while MAP does not. The failure of MAP might be attributed to the requirement that the entire sample (y 1 , . . . , y n ) be predicted. If instead it was required only to predict the value nȳ, then the prior predictive of this quantity is uniform on {0, 1, . . . , f }, the posterior of nȳ equals RB ((y 1 , . . . , y f ) | (x 1 , . . . , x n ))/( f + 1) and the relative belief ratio for nȳ equals RB ((y 1 , . . . , y f ) | (x 1 , . . . , x n )). Thus, as is often the case when the quantity in question has a uniform prior, MAP and relative belief estimates are the same. However, even in this case, there is no natural cut-off for MAP inferences to say when there is evidence for or against a particular value. The fact that it is necessary to modify the problem in this way to get a reasonable inference is, in our view, a substantial failing of MAP. It seems reasonable to suggest that, when an inference approach is shown to perform poorly on such examples, that it not be generally recommended. Additional examples of poor performance of MAP are discussed in [1].
It is notable that, while the relative belief approach to inference has been described here using statistical models and priors, in reality, everything can be cast in terms of a single probability model however such an object arises. Thus, if P is a probability measure on a sample space Ω and A ⊂ Ω is an event whose truth value is unknown but C ⊂ Ω is known to be true, then the evidence concerning the truth of A is given by RB(A | C) = P(A | C)/P(A), with this defined by the appropriate limit when either A or C is a null event. As discussed in [1], the relative belief approach to inference can be seen as essentially probability theory together with the principles of evidence and relative belief.

Choosing and Checking the Ingredients
The first choice that must be made is the model and there are a number of standard models used in practice. There is not a lot written about this step, however, and yet it is perhaps the most important step in solving a statistical problem. It is generally accepted that the correct way to choose a prior is through elicitation. This means that a methodology is prescribed that directs an expert in the application area on how to translate their knowledge into a prior. There are various default priors in use that avoid this elicitation step, but it is far better to recommend that sufficient time and energy be allocated for the elicitation of a proper prior. Staying within the context of probability suggests that a variety of paradoxes and illogicalities are avoided.
Given the ingredients, the relative belief inferences may be applied correctly, but it is still reasonable to ask if these ingredients are appropriate for the particular application. If not, then the inferences drawn cannot be considered valid. There are at least two questions about the ingredients that need to be answered, namely, is there bias inherent in the choice of ingredients and are the ingredients contradicted by the data?
The concern for bias is best understood in terms of assessing the hypothesis H 0 : Ψ(θ) = ψ 0 . Let M(· | ψ) denote the prior predictive distribution of the data given that Ψ(θ) = ψ. Bias against H 0 means that the ingredients are such that, with high probability, evidence will not be obtained in favor of H 0 even when it is true. Bias against is thus measured by If (7) is large, then obtaining evidence against H 0 seems like a foregone conclusion. For bias in favor of H 0 , consider M(RB Ψ (ψ 0 | D) ≥ 1 | ψ * ) where dist(ψ * , ψ 0 ) = δ, so ψ * is a value that just differs from the hypothesized value by a meaningful amount. Bias in favor of H 0 is then measured by sup ψ * ∈{ψ: dist(ψ,ψ 0 )=δ} If (8) is large, then obtaining evidence in favor of H 0 seems like a foregone conclusion. Typically, M(RB Ψ (ψ 0 | D) ≥ 1 | ψ * ) increases as dist(ψ * , ψ 0 ) increases so (8) is an appropriate measure of bias in favor of H 0 . The choice of the prior can be used somewhat to control bias but typically a prior that makes one bias lower just results in making the other bias higher. It is established in [1] that, under quite general circumstances, both biases converge to 0 as the amount of data increases. Thus, bias can be controlled by design a priori. The model needs to be checked against the data for, if the data d lies in the tails of every distribution in the model, then this suggests model failure. There are a wide variety of approaches to assessing this and these are not reviewed here. One relevant comment is that, at this time, there do not seem to exist general methodologies for modifying a model when model failure is encountered.
The prior can also be checked for conflict with the data. A conflict means that the observed data are in the tails of all those distributions in the model where the prior primarily places its mass. For a minimal sufficient statistic T for the model, Evans and Moshonov [20] used the tail probability to assess prior-data conflict where (9) small indicates prior-data conflict. In [21], it is shown that, under general circumstances, (9) converges to Π(π(θ) ≤ π(θ true )) as the amount of data increases. There are a variety of refinements of (9) that allow for looking at particular components of a prior to isolate where a problem with the prior may be. In [22], a method is developed for replacing a prior when a prior-data conflict has been detected. This does not mean simply replacing a prior by one that is more diffuse, however, as is demonstrated in Section 4.1.

Binary-Valued Response Regression Models
The following example, based on real data, is used to illustrate each aspect of the approach to statistical reasoning recommended here. Example 2. Bioassay experiment. Table 1 gives the results of exposing animals to various levels in g/mL of a dosage of a toxin, where x 2 is the log-dosage and the number of deaths is recorded at each dosage (see [23]). The dosages range from e −0.86 = 0.423 to e 0.73 = 2.075 g/mL. The logistic regression model p(x 1 , x 2 ) = G(β 1 + β 2 x 2 ) is considered for this data, so x 1 ≡ 1, G(z) = e z /(1 + e z ), (β 1 , β 2 ) ∈ R 2 and p(1, x 2 ) is the probability of death at dosage x 2 . The counts T = (t 1 , t 2 , t 3 , t 4 ) at the dosages comprise a minimal sufficient statistic for this problem with observed value (0, 1,3,5). The conditional distribution of T given (β 1 , β 2 ) is a product of binomials.
In [7], a goodness-of-fit test based on this data was applied for this model using a uniform prior on the space [0, 1] 4 of all probabilities. Relative belief was used to assess the hypothesis that the model is correct and overwhelming evidence in favor of this model was obtained and so model correctness is assumed here. One goal is the estimation of (β 1 , β 2 ) and another is the assessment of the hypothesis H 0 : β 2 = 0. Acceptance of H 0 implies that there is no relationship between the response and the predictor.

Eliciting the Prior
Elicitation of a prior can be difficult when the interpretation of the parameters is unclear. For example, with model (1), it is not clear what the β i represents in contrast to linear models, where they represent either location parameters or rates of change with respect to predictors. This leads to attempts to put default priors on these quantities and there are problems with this approach. For example, suppose p(1, x) = G(β 1 + β 2 x), where G is the standard logistic cdf and the prior is given by the β i being i.i. d. N(0, σ 2 ), where σ 2 is chosen large to reflect little information about these values. In Figure 2, we have plotted the prior this induces on p(1, 1) when σ = 20. This reflects the fact that as σ grows all the prior probability for p(1, x) piles up at 0 and 1 and so this is clearly a poor choice and it is certainly not noninformative. The strange behavior of diffuse normal priors has been noted by others. Bedrick, Christensen and Johnson [24,25], based on [26], make the recommendation that priors should instead be placed on the p(x i ), as these are parameters for which there is typically prior information. Their recommendation is that k of the x i values be selected and then beta(α 1i , α 2i ) priors be placed on the corresponding p(x i ) via eliciting prior quantiles. This results in more sensible priors but depends on the choice of the observed predictors, and it is unclear what kind of priors this induces on the β i .
Following [24,25], priors here are elicited for the probabilities, but the approach is different. First, it is not required that the elicitation be carried out at observed values of the predictors. Rather, it is supposed that there is a set of linearly independent predictor vectors w 1 , . . . , w k where bounds can be placed on the probabilities in the sense that l(w i ) ≤ p(w i ) ≤ u(w i ) for i = 1, . . . , k with virtual certainty. By virtual certainty, it is meant that, for prior probability measure Π, then where γ is chosen to be close to 1. For example, γ = 0.99 certainly seems satisfactory for many applications, but a higher or lower standard can be chosen. The motivation for this is that typically information will be available for the probabilities such as it is known that p(w i ) is very small (or very large) or almost certainly that p(w i ) is in some specific range. Of course, for some of the w i , virtually nothing may be known about p(w i ) and in that case taking [l(w i ), u(w i )] = [0, 1] is appropriate. One implication of this is that when the choice is made [l(w i ), u(w i )] = [0, 1] for every i, then the elicitation procedure should lead to a Π that is at least approximately uniform on the probabilities.
The approach to elicitation, via stating bounds on parameter values that hold with virtual certainty, has been successfully employed in [2] to determine a prior for the multivariate normal model, and [17] to determine a prior for the multinomial model. Another reason for allowing the elicitation procedure to be independent of the observed x i is that prior beliefs about p(x i ) may apply equally well about p(x j ) for some j simply because x i and x j are close, and then it seems that the correlation between the beliefs should be part of the prior. Modelling such correlations is harder and hopefully can be avoided by choosing the w i carefully. For example, requiring the w i to be mutually orthogonal seems like an appropriate way of achieving independence in many contexts.
The second way in which our approach differs from previous developments is that Π is restricted to the family of multivariate normal priors on β β β as this allows us to see directly how (10) translates into information about β β β. Note that (10) is equivalent to where ) and it is clear what this says about β β β. The task then is to determine (µ µ µ 0 , Σ 0 ) so that (11)  Given that the w i have been chosen so that prior beliefs about the probabilities p(w i ) are independent, this implies that the coordinates of Wβ β β are independent and so Σ 0 = diag(σ 2 1 , . . . , σ 2 k ) for some choice of the prior variances σ 2 i . There are, however, typically many choices satisfying (11). For example, taking σ 2 i = 0 for all i achieves this, but clearly this choice does not reflect what is actually known about the probabilities. As might be expected, the choice of the σ 2 i is critical and dependent on G. Furthermore, as Figure 1 demonstrates, an injudicious choice results in absurdities.
Since G −1 (u(w i )) − µ 0i > 0 and G −1 (l(w i )) − µ 0i < 0, both these values are infinite iff [l(w i ), u(w i )] = [0, 1] and so no information is being introduced via the prior. In such a case, a uniform[0, 1] prior on the probability results and the appropriate normal distribution is determined by approximating the distribution function G by a normal cdf (see . Suppose then that at least one of G −1 (u(w i )) and G −1 (l(w i )) is finite and so σ i satisfies as then independence ensures that (11) is satisfied. When both G −1 (u(w i )) and G −1 (l(w i )) are finite, the left side of (12) has the value 1, when σ i = 0, is strictly decreasing to the value 0 as σ i → ∞ and so there are always values of σ i ≥ 0 satisfying (12). When both G −1 (u(w i )) and G −1 (l(w i )) are finite, there is a unique largest solution to (12), which is the preferred solution as it best represents the prior information, and it is easily found numerically by bisection. If u(w i ) = 1 and l(w i ) ∈ (0, 1), which is a very weak requirement as recall that γ represents virtual certainty. If u(w i ) ∈ (0, 1) and l(w i ) = 0, then The following examples consider the situation l(w i )) = 0, u(w i ) = 1. In this case, µ 0i = 0 and G −1 (p(w i )) will be distributed with cdf G when p(w i ) ∼ U(0, 1). Generally, this leads to a need to approximate G by a normal cdf to obtain a normal prior, although no approximation is required in Example 3.

Example 3. Probit regression.
Here, G = Φ and so G −1 (p(w i )) ∼ N(0, 1) when p(w i ) ∼ U(0, 1). As such, σ i = 1 and the standard normal distribution on G −1 (p(w i )) corresponds to no information about p(w i ). When there is no information about any of the p(w i ), then β β β ∼ N k (0, W −1 (W −1 ) ), which equals the N k (0, I) distribution whenever W is an orthogonal matrix. In general, however, a lack of information about the probabilities leads to a prior on β β β that is dependent on W, namely, dependent on the values of predictor variables corresponding to the probabilities.

Example 4. Logistic regression.
In this case, G is the standard logistic cdf and so w i β β β = G −1 (p(w i )) is distributed standard logistic when p(w i ) ∼ U(0, 1). A well-known N(0, λ 2 ) approximation to the standard logistic distribution, as discussed in [27], leads to normal priors that are much easier to work with. The optimal choice of λ, in the sense that it minimizes max x∈R 1 |Φ(x/λ) − e x /(1 + e x )| is given by λ = 1.702, and this leads to a maximum difference less than 0.009. Clearly, this error will generally be irrelevant when considering priors for the probabilities in a logistic regression problem. Thus, when w i β β β ∼ N(0, 1.702 2 ), then p(w i ) is approximately distributed U(0, 1) with the same maximum error. Figure 3 contains plots of the density of p = e z /(1 + e z ) when z ∼ N(0, λ 2 ) for various choices of λ, and it is indeed approximately uniform when λ = 1.702. Using normal probabilities rather than logistic probabilities leads to relatively small differences, so it seems reasonable to use a normal prior on β β β in a logistic regression.

Example 5. t regression.
Suppose that G is taken to be the cdf of t with υ degrees of freedom. Table 2 presents the optimal choice of λ for a N(0, λ 2 ) approximation to the t(υ) cdf together with the maximum error. There does not appear to be much difference in using t υ probabilities instead of normal ones unless υ is quite low. Consider now an application of the elicitation algorithm. Example 6. Bioassay experiment (Example 2 continued).
In this example, k = 2. To determine the prior, it is necessary to choose W = (w 1 w 2 ) ∈ R 2×2 and [l(W), u(W)]. The authors are not experts in bioassay, but, given the range of dosages applied in the experiment, it is reasonable to suppose that an expert might be willing to put bounds on the probabilities that hold with prior probability γ = 0.99 when x 2 = −0.50 and x 2 = 0.50 leading to

Measuring the Bias in a Prior
Consider applying the approach discussed in Section 3.2 to measuring bias in the prior derived in Section 4.1 for the bioassay example.

Algorithm 1: Algorithm
(i) simultaneously estimate the values m T (t 1 , t 2 , t 3 , t 4 ) for each (t 1 , t 2 , t 3 , t 4 ) via a large sample from (13) and store these in a table, (ii) simultaneously estimate the values m T (t 1 , t 2 , t 3 , t 4 | β 2 = 0) for each (t 1 , t 2 , t 3 , t 4 ) via a large sample from π 1 (· | 0) and store these in a table, (iii) using the values in these two tables estimate RB 2 (0 | T) for all values of T and then estimate The bias in favor can be estimated at ±δ in exactly the same way but in step (ii) replace π 1 (· | 0) by π 1 (· | − δ) and by π 1 (· | δ). These computations were carried out and resulted in the bias against equaling 0.22 and the bias in favor equaling 0.77 at −δ and 0.78 at δ. Thus, there is some bias against H 0 with this prior, but there is appreciable bias in favor of H 0 , at least when interest is in detecting deviations of size δ = 0.01. For β 2 = 5, however, the bias in favor of H 0 is 0.006, so there is in reality no bias in favor for large values of this parameter. One could contemplate modifying the prior to reduce the bias in favor at δ = 0.01, but typically this just results in trading bias in favor with bias against. The real cure for excessive bias of either variety is to collect more data.
In general problems, the approach to the computations used here will not be feasible and so alternative methods are required. In certain examples, some aspects of the computations can be done exactly, but, in general, approximations such as those discussed in [28] will be necessary.

Checking and Modifying a Prior
Consider now checking the prior derived in Section 4.1 for the bioassay example.

Example 8. Bioassay experiment (Example 2 continued).
The tail probability for checking the prior is given by As part of the algorithm discussed in Section 4.2, the values of m T (t 1 , t 2 , t 3 , t 4 ) have been estimated and the proportion of values of m T (t 1 , t 2 , t 3 , t 4 ) that satisfies the inequality gives the estimate of (14).
In this example, (14) equals 0.41 so there is no prior-data conflict.
If prior-data conflict exists, the methods discussed in [21] are available to obtain a more weakly informative prior. In this case, it is necessary to be careful as it has been shown in Section 4.1 that simply increasing the variance of the prior will not necessarily accomplish this. On the other hand, there is the satisfying result that the N 2 (0, 1.702 2 I 2 ) prior, where I 2 is the identity matrix, will avoid prior-data conflict, so modifying the elicited prior to be closer to this prior is the appropriate thing to do when a conflict exists.

Inferences
Now, consider estimation and hypothesis assessment for the bioassay example.

Example 9. Bioassay experiment (Example 2 continued).
Consider first the assessment of the hypothesis H 0 : β 2 = 0. From the algorithm, the quantity RB 2 (0 | (0, 1, 3, 5)) is available and this indicates whether there is evidence in favor of or against H 0 . In this case, RB 2 (0 | (0, 1, 3, 5)) = 0.021 so there is evidence against H 0 . A calculation described below gives the value 0.001 for the strength, so it seems there is strong evidence against H 0 .
To obtain the joint relative belief estimate of (β 1 , β 2 ), it is necessary to maximize RB(β 1 , β 2 | T) as a function of (β 1 , β 2 ), which is the same as the MLE. The plausibility region for this estimate is then {(β 1 , β 2 ) : RB(β 1 , β 2 | T) > 1} and the size and posterior content of this set provide a measure of the accuracy with which the coordinates of β can be simultaneously known. However, it is worth noting that the i-th coordinate of this joint estimate is not necessarily the value that has the greatest evidence in its favor, rather this is obtained by maximizing RB i (β i | T) as a function of β i with plausibility region {β i : RB i (β i | D) > 1}. Thus, the evidence approach dictates that β i be estimated by maximizing RB i (β i | T). In problems where components of a multidimensional parameter are related by some constraint, then it is clearly necessary to estimate the components simultaneously, but this is not the case here.
Implementing this for β 1 , the estimate β 1 (0, 1, 3, 5) = 0.11 was obtained with plausibility region [−0.21, 0.49] having posterior content 0.35. Thus, the range of plausible values for β 1 is not large, but there is not a high belief that the true value is in this interval. Figure 5 is a plot of RB 1 (· | (0, 1, 3, 5)). An interesting phenomenon occurs when considering the estimation of β 2 . In Figure 6, the left panel plots RB 2 (· | (0, 1, 3, 5)) over the effective support of the marginal prior π 2 for β 2 . From this, it is clear that the relative belief estimate of β 2 lies outside this range. Recall, however, that the chosen prior passed the check for prior-data conflict. The check for prior-data conflict only tells us, however, that the observed data is consistent with at least some of the probabilities determined by where the prior places its mass. The right panel of Figure 6 is a plot of RB 2 (· | (0, 1, 3, 5)) over a much wider range. Note too that there is an important robustness property as shown in [18] for RB 2 (· | (0, 1, 3, 5)) as it is only weakly dependent on π 2 . In this case, π 2 does not place mass where it appears it should, but there is not enough data to detect the conflict. The relative belief estimate of β 2 is β 2 (0, 1, 3, 5) = 7.31 and the plausibility region for β 2 is [1.14, 30.48] with posterior content 0.83. As such, there is a great deal of uncertainty concerning the true value of β 2 . As long as it is possible to sample from the posterior for a 1-dimensional parameter, then the computations necessary for the inferences for such a parameter are feasible. As such, the Gibbs sampling algorithm of [29] is particularly relevant, although it is not needed in Example 9. The harder computations are those involving the various prior predictives, but these do not need to be highly accurate, as even one decimal place will indicate whether there is bias or prior-data conflict.

Conclusions
Criteria for a satisfactory theory of statistical reasoning have been developed. Perhaps more should be required, but it seems that those stated are necessary. In particular, the separation of the choosing and checking of the ingredients from the inference step has been emphasized as a key aspect based upon maintaining a desirable relationship with logical reasoning. An approach to statistical reasoning that satisfies these criteria has been presented. Application to a well-known example has shown that this approach can resolve anomalies/paradoxes that arise via commonly used methodology. Many other such instances of resolving inferential difficulties as well as results establishing optimal performance have been documented in [1]. An application of the approach to the problem of binary-valued response regression has been carried out, and it has been shown to lead to a number of novel insights and, in particular, a new elicitation algorithm has been developed for this problem.