Abstract
Relative belief inferences are shown to arise as Bayes rules or limiting Bayes rules. These inferences are invariant under reparameterizations and possess a number of optimal properties. In particular, relative belief inferences are based on a direct measure of statistical evidence.
1. Introduction
We consider a sampling model for data x, as given by a collection of densities with respect to a support measure on a sample space and a proper prior, given by density with respect to support measure on When the data are observed, these ingredients lead to the posterior distribution on with density given by with respect to support measure where
is the prior predictive density of the data. In addition, there is a quantity of interest where
for which inferences, such as an estimate or a hypothesis assessment are required. Let denote the marginal prior density of and
be the conditional prior predictive of the data after integrating out the nuisance parameters via the prior conditional distribution of given Bayesian inferences for are then based on the ingredients alone or by adding a loss function Note that the probability measure associated with m will be denoted by M and the probability measure associated with will be denoted by when these are used in the paper.
A natural question arises, namely, how are we to determine the inferences for , namely, an estimate , or assess the hypothesis based upon these ingredients? Several approaches have been put forward to answering this question. Two broad categories can be described, namely, the evidential/inferential approach and the behavioristic/decision-theoretic approach.
The evidential approach can be characterized as having the goal of letting the evidence in the data x determine the inferences and can be subdivided into frequentist, pure likelihood, and Bayesian theories. Central to this is the need to somehow characterize the concept of statistical evidence. The frequentist theory only uses the ingredients together with the idea that inferences are to be graded based on their behavior in hypothetical repeated sampling experiments. Despite the impressive accomplishments of Alan Birnbaum in attempting to formulate a definition of statistical evidence, (see [1]), it is fair to say that there is still no such generally acceptable definition within the frequentist context. The pure likelihood theory is also based on the ingredients but the idea of using repeated sampling characteristics to determine the inferences is dropped and the likelihood function as defined by for any positive constant is taken to be the proper characterization of statistical evidence. All inferences are then determined by the likelihood; for example, see the discussion in [2]. Again, there are gaps in this treatment, as it is unclear when the likelihood function provides evidence in favor of or against a particular value of being the true value, and it is unclear how the likelihood is to be used for marginal parameters The Bayesian approach based on the ingredients is more successful at characterizing statistical evidence concerning through the principle of evidence, which, loosely speaking, suggests that if the data lead to the posterior probability of an event being greater than (less than) the corresponding prior probability, then there is evidence in favor of (against) the event being true. A precise statement of the principle of evidence is provided in Section 2.3. A full theory of inference based on this idea, and called relative belief, has been developed over a number of years (see [3]), and is sketched in Section 2.3. A much fuller discussion of the issues and developments within the context of the evidential approach to developing statistical theory can be found in [4].
The decision-theoretic approach can also be divided into frequentist and Bayesian theories. The frequentist approach is based on the ingredients together with a loss function where for all and, generally, represents the loss or penalty incurred when is chosen as the true value of The idea then is to look for a decision procedure that performs well with respect to the average loss or risk namely, choose a that makes small uniformly in The frequentist decision theory, however, is not always successful in determining a suitable The Bayesian theory of decision considers the prior risk and is generally successful in determining a that minimizes ; this is referred to as the Bayes rule. The Bayesian theory of decision has been axiomatized (see [5]); this provides considerable support for this approach.
If the various approaches to determining inferences all lead to more or less the same answers, then there would be little controversy, but unfortunately, this is not the case. The goal of this paper is to show that relative belief inferences can also be developed within the context of decision theory even though their primary motivation is through the characterization of statistical evidence. It is of historical relevance and interest that two of the founders of the statistical discipline, Fisher and Neyman, disagreed profoundly on the purpose of statistical analyses. Fisher saw the purpose of statistics as summarizing what the evidence in the observed data says about questions of scientific interest, while Neyman described the purpose in behavioristic or decision-theoretic terms, where the goal is to minimize average losses in repeated performances. Ref. [6] described this debate, which continues to be part of the statistical profession. The significance of the results of this paper is that it demonstrates that relative belief inference allows for a possible resolution of this conflict and, as we will now discuss, also resolves a general criticism of decision theory.
A natural requirement for statistical analysis is that all the ingredients chosen by the statistician need to be checkable against the observed data to ensure they align with the objective data (or, at least, are correctly collected). The choice of the model, prior, and loss functions are typically subjective decisions made by the analyst, and many consider such subjectivity to be at odds with the demands of science. However, both the model and the prior can be checked against the observed data to determine whether the choices made are contradicted by the data. Model checking has long been an acceptable, even necessary, part of statistical practice. In recent years, methods have been developed to check for a conflict between the prior and data. These methods determine if the prior placed the bulk of its mass in a region of the parameter space unsupported by the data as containing the true value of the parameter. Ref. [7] contained a discussion about checking for prior-data conflicts and also on what to do when a prior fails its checks. While such checking does not establish the objectivity of these elements, it at least allows the objective data to comment on the relevance of the choices made. However, it is unclear how one can check the loss function L using the data, and this ambiguity may be considered a flaw in decision theory, particularly for scientific applications.
There are, however loss functions that are considered intrinsic and that avoid this criticism. For example, Ref. [8] proposed using the intrinsic loss function based on a measure of distance between sampling distributions. Ref. [9] proposed using the intrinsic loss function based on the Kullback–Leibler divergence between and When , the intrinsic loss function is given by
For a marginal parameter , the intrinsic loss function is These loss functions are intrinsic because they are based on the sampling model, allowing their suitability to be verified through model checking. The loss functions used to derive relative belief inferences for are based upon the prior see Section 3 for the definitions, and so are also intrinsic and checkable against the data while checking for prior-data conflict.
In some contexts, relative belief inferences are Bayes rules, but in a general context, they are seen to arise as the limits of Bayes rules. This approach has some historical antecedents. For example, in [10], it is shown that the MLE is asymptotically a Bayes rule, but this conclusion is drawn under a fixed loss function, with increasing amounts of data and a sequence of priors. In the context discussed here, however, the data amount is fixed, as are the model and prior, but there is a sequence of loss functions, all based on a single fixed prior. The loss functions relevant for deriving relative belief inferences are similar to those used to justify maximum a posteriori (MAP) inferences. It can be demonstrated that under certain conditions, MAP inferences emerge as the limits of Bayes rules through a sequence of loss functions,
where is the ball of radius centered at , denotes the indicator function for set A and, in the continuous case, the support measure on is volume measure (see [11]). MAP inferences are not invariant under reparameterizations and such invariance can be considered as a desirable property of any inference method. Relative belief inferences, however, are invariant under reparameterizations.
Section 2 is concerned with describing the general characteristics of three approaches to deriving Bayesian inferences. Section 3 and Section 4 show how relative belief estimation and prediction inferences can be seen to arise from decision theory and Section 5 does this for credible regions and hypothesis assessment. In particular, it is shown here that relative belief estimators, as used in practice, are admissible. The contents of Section 3, Section 4 and Section 5 are original contributions by the authors that were derived some years ago but not published. Some of this discussion has appeared in Ref. [3] and is included here to provide a complete exposition of the relationship between relative belief and decision theory. All proofs of theorems and corollaries are in Appendix A, except for cases where is finite, as these are quite straightforward and provide motivation for the more complicated contexts.
It should be emphasized that the authors do not consider the fact that relative belief inferences can be derived within the context of decision theory as the primary justification for the approach. Rather, the justification lies within the Bayesian context, which leads—via the principle of evidence—to a clear characterization of statistical evidence. The specific loss functions used, while appealing, are not essential to this characterization. The fact that relative belief inferences are consistent with two of the major themes of statistical research pursued over the years, in our view, provides substantial support for their appropriateness.
2. Bayesian Inference
Some approaches to deriving Bayesian inferences will now be described in detail.
2.1. Bayesian Decision Theory
An ingredient that is commonly added to is a loss function, namely, , satisfying whenever and only when The goal is to find a procedure, say which in some sense minimizes the loss based on the joint distribution of Given the assumptions on L, the loss function can instead be thought of as a map with iff and the ingredients can be represented as
The goal of a decision analysis is then to find a decision function that minimizes the prior risk,
where is the posterior risk. Such a is called a Bayes rule and clearly a that minimizes for each x is a Bayes rule. Further discussion of the Bayesian decision theory can be found in [12].
As noted in [9], a decision formulation also leads to credible regions for namely, a -lowest posterior loss-credible region is defined by
where Note that in (2) is interpreted as the decision function that takes the value constantly in Clearly, as , set converges to the value of a Bayes rule at For example, with quadratic loss, the Bayes rule is given by the posterior mean and a -lowest posterior loss region is the smallest sphere centered at the mean containing (at least) of the posterior probability.
2.2. MAP Inferences
The highest posterior density (HPD) or MAP-based approach to determining inferences constructs credible regions of the following form
where is the marginal posterior density with respect to a support measure on and is chosen so that It follows from (3) that, to assess the hypothesis we can use the tail probability given by Furthermore, the class of sets is naturally “centered” at the posterior mode (when it exists uniquely) as converges to this point as The use of the posterior mode as an estimator is commonly referred to as MAP estimation. We can then think of the size of set say for as a measure of how accurate the MAP estimator is in a given context. Furthermore, when is an open subset of Euclidean space, then minimizes the volume among all -credible regions.
It is well-known that HPD inferences suffer from a defect. In particular, in the continuous case, MAP inferences are not invariant under reparameterization. For example, this means that, if is the MAP estimate of , then it is not necessarily true that is the MAP estimate of when is a 1-1 smooth transformation. The non-invariance of a statistical procedure seems very unnatural as it implies that the statistical analysis depends on the parameterization and typically there does not seem to be a good reason for this. Note too that estimates based upon taking posterior expectations will also suffer from this lack of invariance. It is also the case that MAP inferences are not based on a direct characterization of statistical evidence. Both of these issues motivate the development of relative belief inferences.
One justification for MAP inference is decision-theoretic via the loss functions defined in (1). It is common, however, to also consider posterior probabilities of events as expressions of evidence and so think of this approach as evidential in nature. Posterior probabilities, however, express beliefs rather than evidence. For instance, the posterior probability of an event may be very small yet larger than its prior probability, indicating that the data have increased belief in the event’s occurrence. This would suggest that the data provide evidence in favor of the event being true, rather than evidence against it, even though the posterior probability remains small. It appears that evidence is better characterized by how the data change beliefs, rather than by the beliefs themselves.
2.3. Relative Belief Inferences
Relative belief inferences, like MAP inferences, are based on the ingredients Note that underlying both approaches is the principle (axiom) of conditional probability that says that initial beliefs about as expressed by the prior must be replaced by conditional beliefs as expressed by the posterior In this approach, however, a measure of statistical evidence is used given by the relative belief ratio,
The relative belief ratio produces the following conclusions: if then there is evidence in favor of being the true value, if there is evidence against being the true value, and if then there is no evidence either way. These implications follow from a very simple principle of inference.
Principle of evidence: for probability model if is observed to be true, where then there is evidence in favor of being true if evidence against being true if , and no evidence either way if
This principle seems obvious when is a discrete probability measure. For the continuous case, where let be a sequence of neighborhoods of converging nicely to as (see [13]), then under weak conditions, e.g., is continuous and positive at
and this justifies the general interpretation of as a measure of evidence. The relative belief ratio determines the inferences.
A natural estimate of is the relative belief estimate,
as it has maximum evidence in favor. To assess the accuracy of , consider the plausible region which is the set of values with evidence supporting them as the true value. The size of along with its posterior content which measures the belief that the true value is in provides an assessment of accuracy. So, if is “small” and then is to be considered an accurate estimate of but not otherwise. A relative belief -credible region is given by
where for can also be quoted provided The containment is necessary, as otherwise, would contain a value for which there is evidence against being the true value.
To assess the hypothesis the value indicates whether there is evidence in favor of or against The strength of this evidence can be measured by the posterior probability as this measures the belief in what the evidence says. So, if and then there is strong evidence that is true, while when and there is strong evidence that is false. Since can be small, even 0 in the continuous case, it makes more sense to measure the strength of the evidence in such a case by
If and then the evidence is strong that is the true value as there is little belief that the true value of has more evidence in its favor than If and then the evidence is strong that is not the true value as there is a widespread belief that the true value of has more evidence in its favor than There is no reason to quote a single number to measure the strength; both and can be quoted when relevant.
An important aspect of both and is what happens as the data increase. To ensure that these behave appropriately, namely, when is false (or true) and it is necessary to take into account the difference that matters, By this, we mean that there is a distance measure on such that if then in terms of the application, these values are considered equivalent. Such a always exists because measurements are always taken to finite accuracy. For example, if is real-valued, then there is a grid of values separated by , and inferences are determined using the relative belief ratios of the intervals In effect, is now When the computations are carried out in this way, then and do what is required. As a particular instance of this, see the results in Section 4, where such discretization plays a key role.
It is easy to see that the class of relative-belief credible regions for is independent of the marginal prior When a value is specified, however, set depends on through So, the form of relative belief inferences about is completely robust to the choice of but the quantification of the uncertainty in the inferences is not. For example, when then is the MLE; however, in general, is the maximizer of the integrated likelihood Similarly, relative belief regions are likelihood regions in cases of the full parameter and integrated likelihood regions. As such, likelihood regions can be seen as essentially Bayesian in character with a clear and precise characterization of evidence through the relative belief ratio, and now have probability assignments through the posterior. A relative belief ratio, while proportional to an integrated likelihood, cannot be multiplied by an arbitrary positive constant—as with a likelihood—without losing its interpretation in measuring statistical evidence. It has been established in [14] that relative belief inferences for are optimally robust to the prior
As can be seen from (4), relative belief inferences are always invariant under smooth reparameterizations; this is at least one reason why they are preferable to MAP inferences. Any rule for measuring evidence, which satisfies the principle of evidence, also produces valid estimates, as these lie in and so will have the same “accuracy” as For example, if instead of the relative belief ratio, the difference is used as the measure of evidence with a cut-off of 0, then this satisfies the principle of evidence, but the estimate is no longer necessarily invariant under reparameterizations. The Bayes factor with a cut-off of 1 is also a valid measure of evidence but there are a number of reasons why the relative belief ratio is to be preferred to the Bayes factor for general inferences (see [15]).
We will now consider a simple example that illustrates the various concepts just discussed.
Example 1.
Location normal
Suppose that we have a sample from a where the mean is unknown and the variance is assumed known. Suppose interest lies in making inferences about and the prior π on θ is given by a distribution. In this context, serves as a minimal sufficient statistic, allowing the focus to be restricted to the model while ignoring the remaining aspects of the data, at least for inference. Certainly the residuals are relevant for model checking. The prior predictive density m of is then given by the density of a distribution and, as discussed in [7], this is relevant for checking for prior-data conflicts via the tail probability with small values indicating the existence of a conflict.
The posterior of θ is given by
As such, If, as is common, squared error loss is employed, then the Bayes rule for estimating θ is given by as this is also the posterior mean. On the other hand, the relative belief ratio is given by
as, since there are no nuisance parameters, equals the sampling density of
From this, it is immediate that , which is the MLE, a result that is generally true for relative belief when estimating the full model parameter. The plausible interval for θ is then, putting given by
and note that , so this interval is always defined. The length of and its posterior content, computed using , provide a measure of the accuracy of Notice that converges almost surely to the degenerate interval consisting of the true value of θ and the posterior content of the interval converges to 1 as
To assess a hypothesis, say the relevant relative belief ratio is as follows:
This gives evidence in favor (or against) when
and note that the right-hand side is always positive. The strength of this evidence is given by
where denotes the cdf. As the strength converges to 0 (the strongest possible evidence against) when is false and converges to 1 (the strongest possible evidence in favor) when is true.
3. Estimation: Discrete Parameter Space
The following theorem presents the basic definition of the loss function for the parameter of interest when the set of possible values of , namely, , is finite. This establishes an important optimality result. The indicator function for the set A is denoted as .
Theorem 1.
Suppose that for every , is finite with equal to counting measure on Then for the loss function
the relative belief estimator is a Bayes rule.
Proof.
We have that
Since is finite, the first term in (6) is finite and a Bayes rule at x is given by the value that maximizes the second term. Therefore, is a Bayes rule. □
The loss function seems very natural. Beliefs about the true value of are expressed by the prior . As such, consider values where is very low and is indeed false. It would then be misleading if inferences suggested such a value as being true. So it is appropriate for such values to bear large losses. In a sense, the statistician is acknowledging what such values are by the choice of the prior. Of course, the prior may be wrong in the sense that the bulk of its mass is placed in a region where the true value of does not lie. This is why checking for prior-data conflict, before conducting inference, is always recommended. Procedures for checking priors were discussed in [16,17], and an approach to replacing priors found to be at fault was developed in [7]. The loss function motivates the other losses for relative belief discussed here, making this comment relevant to those losses as well.
The prior risk of satisfies the following,
where is the conditional prior predictive probability measure of the data given , so (7) is the sum of the conditional prior error probabilities over all values. If instead the loss function is taken to be as in (1), then the same proof used in Theorem 1 establishes that is a Bayes rule with respect to this loss, and the prior risk is given by,
which represents the prior probability of making an error. Both and are two-valued loss functions but, when an incorrect decision is made, the loss is constant in for , while it equals the reciprocal of the prior probability of for . So, penalizes an incorrect decision much more severely when the true value of is in the tails of the prior. Note that when is uniform. It is evident that (7) serves as an upper bound to (8), indicating that controlling losses based on automatically controls the losses based on
As already noted, is proportional to the integrated likelihood of So, under the conditions of Theorem 1, the maximum integrated likelihood estimator is a Bayes rule. Furthermore, the Bayes rule is the same for every choice of and only depends on the full prior through the conditional prior placed on the nuisance parameters. When , then is the MLE of and so the MLE of is a Bayes rule for every prior
Note that when , then iff , so when , and otherwise. This is the classical context for hypothesis testing, where can be viewed as acceptance of the hypothesis , and as rejection of Theorem 1 establishes that relative belief offers a Bayes rule for the hypothesis testing problem.
The loss function (5) does not provide meaningful results when is infinite as (7) shows that will be infinite. So, we modify (5) via a parameter and define the loss function as follows,
Note that is bounded by . This loss function is like (5) but does not allow for arbitrarily large losses. The following result shows that we can restrict attention to values of that are sufficiently small.
Theorem 2.
Suppose that for every where is countable with equal to counting measure, and that is the unique maximizer of for all For the loss function (9) and Bayes rule then as for every
The proof of Theorem 2 also establishes the following result.
Corollary 1.
For all sufficiently small η, the value of a Bayes rule at x is given by
The following is an immediate consequence of Theorem 1 and Corollary 1 as is a Bayes rule.
Corollary 2.
is an admissible estimator with respect to the loss function when is finite, and with respect to loss when η is sufficiently small, and is countable.
In a general estimation problem, is risk-unbiased with respect to a loss function L if for all This says that, on average, is closer to the true value than any other value when we interpret as a measure of distance between and A definition of Bayesian-unbiasedness for with respect to L is given by the inequality,
as this retains the idea of being closer, on average, to the true value than a false value. We will now consider a family of loss functions defined as follows,
where h is a nonnegative function satisfying . This includes and when is finite and
Theorem 3.
If is finite or countable, then is Bayesian-unbiased under the loss function (10).
Suppose after observing x, there is a need to predict a future (or concealed) value , where a density with respect to the support measure on and it is assumed that the true value of in the model for x, gives the true value of The prior predictive density of y is given by while the posterior predictive density is The relative belief ratio for a future value of y is, thus, , and the relative belief prediction, namely, the value that maximizes is denoted as When is finite, using the same argument as in Theorem 1, is a Bayes rule under the loss function Also, it can be demonstrated that is a limit of the Bayes rule when is countable.
We will now consider a common application where is finite.
Example 2.
Classification
For a classification problem, there are k categories prescribed by a function where for each Estimating ψ is then equivalent to classifying the data as having come from one of the distributions in the classes specified by The standard Bayesian solution to this problem is to use as the classifier. From (8), we have that minimizes the prior probability of misclassification, while from (7), minimizes the sum of the probabilities of misclassification. The essence of the difference is that treats the misclassification errors equally while weights the errors by their prior probabilities.
The following shows that minimizing the sum of error probabilities is often more appropriate than minimizing the weighted sum. Suppose that and Bernoulli or Bernoulli with and representing the known proportions of individuals coming from population 0 or 1. For example, consider as the probability of a positive diagnostic test for a disease in the non-diseased population while is this probability for the diseased population. Suppose that is very small, indicating that the test is successful at identifying the disease while not yielding many false positives and that ϵ is very small, so the disease is rare. The challenge then becomes assigning a randomly chosen individual to a population based on their test results.
The posterior is given by and Therefore,
This implies that will always classify a person to the non-diseased population when ϵ is small enough, e.g., when and In contrast, in this situation, always classifies an individual with a positive test to the diseased population and the non-diseased population for a negative test. Since is the Bernoulli distribution, when and ϵ are small enough, we have the following:
This clearly illustrates the difference between these two procedures as does better than on the diseased population when is small and is large, as would be the case for a good diagnostic. Of course, minimizes the overall error rate, but at the price of ignoring the most important class in this problem, namely, those who have the disease. Note that this example can be extended to the situation where we need to estimate the based on samples from the respective populations, but this will not materially affect the overall conclusions.
We will now consider a situation where is such that Bernoulli, where and are known but ϵ is unknown with a prior This is a generalization of the previous discussion, where ϵ is assumed to be known. Then, based on a sample from the joint distribution, the goal is to predict the value for a newly observed
The prior of c is , and if beta the prior predictive of is Bernoulli The posterior predictive density of equals, where
It follows that, suppressing the dependence on the data, we have the following,
Note that and are identical whenever
From these formulas, it is apparent that a substantial difference will arise between and when either α or β is much bigger than the other. As in Example 2, these correspond to situations where we believe that ϵ or is very small. Suppose we take and let β be relatively large, as this corresponds to knowing a priori that ϵ is very small. Then (11) implies that and so whenever A similar conclusion arises when we take and
To see what kind of improvement is possible, we consider a simulation study. Let be a density, be a density, and and the prior on ϵ be beta Table 1 presents the Bayes risks for and for various choices of β when When , they are equivalent, but we see that as β rises, the performance of deteriorates while improves. Large values of β correspond to having information where ϵ is small. When , about of the prior probability is to the left of ; with , about of the prior probability is to the left of ; and with , about of the prior probability is to the left of We see that the misclassification rates for the small group stay about the same for as β increases while they deteriorate markedly for as the MAP procedure basically ignores the small group.
Table 1.
Conditional prior probabilities of misclassification for and for various values of in Example 2 when , , and n = 10.
We also investigated other choices for n and There is very little change as n increases. When μ moves toward 0 and μ moves away from 0, the error rates go up and go down, as one would expect; always dominates
4. Estimation: Continuous Parameter Space
When has a continuous prior distribution, the argument in Theorem 2 does not work, as There are several possible ways to proceed but one approach is to use a discretization of the problem that uses Theorem 2. For this, we will assume that the spaces involved are locally Euclidean, the mappings are sufficiently smooth, and take the support measures to be the analogs of Euclidean volume on the respective spaces. While the argument presented is broadly applicable, it has been simplified in this context by assuming that all spaces are open subsets of Euclidean spaces, with the support measures being the Euclidean volume on these sets.
For each , suppose there is a discretization of into a countable number of subsets with the following properties: and diam as So, if then For example, could be equal volume rectangles in Further, we assume that as for every This will hold whenever is continuous everywhere and converges nicely to as Let denote a point in such that whenever and put So, is a discretized version of We will call this a regular discretization of The discretized prior on is and the discretized posterior is
The loss function for the discretized problem is defined in Theorem 2 as follows,
and let denote a Bayes rule for this problem.
Theorem 4.
Suppose that is positive and continuous and we have a regular discretization of Furthermore, suppose that is the unique maximizer of and for any ,
Then, there exists as such that a Bayes rule under the loss converges to as for all
Theorem 4 states that is a limit of Bayes rules. So, when , we have the result that the MLE is a limit of the Bayes rule, and more generally, the MLE from an integrated likelihood is a limit of Bayes rules. The regularity conditions stated in Theorem 4 hold in many common statistical problems.
Now let be the relative belief estimate from the discretized problem, i.e., maximizes as a function of The following is immediate from the proof of Theorem 4, Theorem 3, and Corollary 2:
Corollary 3.
is admissible and Bayesian-unbiased for the discretized problem, and as for every
By similar arguments, an analog of Theorem 4 for can be established. In this case, a simpler development can be followed in certain situations by using the loss function . For this, the posterior risk of in the discretized problem, is given by for some Now suppose is a cube centered at of edge length Suppose that for each , there exists , such that, when , then
Since is constant, a Bayes rule must then satisfy . This proves that is a limit of the Bayes rules. By contrast, for the loss
the posterior risk of is given by,
and the first term is generally unbounded unless is compact.
We will now consider an important example.
Example 3.
Regression
Suppose that , where is fixed of rank and To simplify the discussion, we will assume that is known but this is not necessary. Let π be a prior density for For every having observed then , the MLE of
It is interesting to contrast this result with more standard Bayesian estimates such as MAP or the posterior mean. For example, suppose that Then the posterior distribution of is , where
and note that Writing the spectral decomposition of as , we have that
Since and for each this implies that shrinks the MLE toward the prior mean of When the columns of X are orthonormal, then , where , and so the shrinkage is substantial unless is much larger than This shrinkage is often cited as a positive attribute of these estimates. Consider now the situation where the true value of β is some distance from the mean. In this case, it seems wrong to move β toward the prior mean; thus, it is not clear whether shrinking the MLE is necessarily a good thing, particularly as this requires giving up invariance.
Suppose that estimating the mean response at w for the predictors is required. The prior distribution of ψ is and the posterior distribution is Note the following relationships,
since for each Therefore, maximizing the ratio of the posterior to prior densities leads to
Then implies Note that when is much smaller than in other words, the posterior is much more concentrated than the prior, then and are very similar. In general, is not equal to the plug-in MLE of although it is the MLE from the integrated likelihood. Moreover, as , and when X has orthonormal columns,
Suppose predicting a response z at the predictor value is required. When , the prior distribution of z is and the posterior distribution is , where we have that
To obtain , it is necessary to maximize the ratio of the posterior to the prior densities of z; this leads to
Note that ; thus, , and is further from the prior mean than Also, when is small, then and are very similar. Finally, comparing (13) and (14), we have that
and so at w is more dispersed than the estimate of the mean at w; this makes good sense as we have to take into account the additional variation due to prediction. By contrast,
5. Credible Regions and Hypothesis Assessment
Recall that a -relative-belief credible region for is given by , where There is some arbitrariness in the choice of the greater than or equal sign to define the credible region as it could have been defined as , where In the latter case, is the -th quantile of the posterior distribution of the relative belief ratio. This definition has some advantages as using this implies that the plausible region satisfies , where Also, the strength of the evidence concerning the hypothesis satisfies where The key point here is the close relationship between relative-belief credible regions, the plausible region, and the strength calculation. Thus, any decision-theoretic interpretation applicable to relative-belief credible regions also pertains to the plausible region and the strength of the evidence. Throughout this section, we will retain the definition for provided in Section 2.3.
Now consider the lowest posterior loss -credible regions that arise from the prior-based loss functions considered here.
Theorem 5.
Suppose that for every , where is finite with equal to the counting measure. Then is a γ-lowest posterior loss-credible region for the loss function
Proof.
Now consider the case where is countable and we use the loss function Following the proof of Theorem 5, we see that a -lowest posterior loss region takes the form,
where
Theorem 6.
Suppose that for every the Ψ is countable with equal to counting measure. For the loss function , whenever γ is such that , and whenever and
While Theorem 6 does not establish the exact convergence , it is likely that this does hold under quite general circumstances due to the discreteness. Theorem 6 shows that limit points of the class of sets always contain and their posterior probability content differs from by at most , where is the next largest value for which we have exact content.
Now, consider the continuous case with a regular discretization. For namely, is a subset of a discretized version of we define the un-discretized version of to be Now, let be the -relative belief region for the discretized problem and let be its un-discretized version. Note that in a continuous context, we will consider two sets as equal if they differ only by a set of measure 0 with respect to The following result says that a -relative belief-credible region for the discretized problem, after un-discretizing, converges to the -relative belief region for the original problem.
Theorem 7.
Suppose that is positive and continuous, there is regular discretization of and has a continuous posterior distribution. Then,
While Theorem 7 is interesting in its own right, it can also be used to prove that relative belief regions are limits of the lowest posterior loss regions.
Let be the -lowest posterior loss region obtained for the discretized problem using loss function (12), and let be the un-discretized version.
Theorem 8.
Suppose that is positive and continuous, we have a regular discretization of Ψ, and has a continuous posterior distribution. Then, we have that
In [18,19], additional properties of relative belief regions are developed. For example, it has been shown that a -relative belief region for , satisfying , minimizes among all (measurable) subsets of satisfying So, a -relative belief region is the smallest among all -credible regions for , where the size is measured using the prior measure. This property has several consequences. For example, the prior probability that a region contains a false value from the prior is given by , where a false value is a value of generated independently of It can be demonstrated that a -relative belief region minimizes this probability among all -credible regions for and is always unbiased in the sense that the probability of covering a false value is bounded above by Furthermore, a -relative belief region maximizes the relative belief ratio and the Bayes factor among all regions with
While the results in this section focus on obtaining credible regions for parameters, similar results can be proven for the construction of prediction regions.
6. Conclusions
Relative belief inferences are based on a clear characterization of statistical evidence and are closely related to likelihood inferences. This, along with their invariance and optimality properties, positions these as prime candidates for appropriate inferences in Bayesian contexts. This paper shows that relative belief inferences also arise naturally in a decision-theoretic formulation using loss functions based on the prior. So, relative belief inferences represent a degree of unification between the evidential and decision-theoretic approaches to deriving statistical inferences.
Author Contributions
Conceptualization, M.E. and G.H.J.; Formal analysis, M.E. and G.H.J.; Writing—original draft, M.E.; Writing—review & editing, G.H.J. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Natural Sciences and Engineering Research Council of Canada, grant RGPIN-2024-03839.
Data Availability Statement
Data is contained within the article.
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A
Proof of Theorem 2 and Corollary 1.
We have the following,
The first term in (A1) is bounded above by and does not depend on , so the value of the Bayes rule at x is obtained by finding , which maximizes the second term. Note that,
There are, at most, finitely many values of satisfying , and so assumes a maximum on this set, say at and when If then This proves that, for all , the maximizer of (A2) is given by , and the results are established. □
Proof of Theorem 3.
The prior risk of is given by
and
Therefore, is Bayesian-unbiased if and only if
This inequality holds when because is the density of with respect to , which implies that the maximum of this density is greater than or equal to 1. □
Proof of Theorem 4 and Corollary 3.
Just as in Theorem 2, a Bayes rule maximizes for Furthermore, as in Theorem 2, such a rule exists. Now, we define so that , and note that as As we have that
Let Let be such that diam for all Then, for and any satisfying we have
By (A4) and (A5), there exists , such that, for all then
Therefore, when a Bayes rule, satisfies
By (A5), (A6), and (A7) this implies that and the convergence is established.
Proof of Theorem 6.
For let and Note that as
Suppose c is such that Then, for all , and so This implies that and since , this implies that
Now suppose c is such that Then there exists , such that for all , we have Since when then Choosing for implies □
Proof of Theorem 7.
Let and
Recall that for every If there exists such that for all then and this implies that Now and so (after possibly deleting a set of -measure 0 from If then for infinitely many which implies that and, therefore, This proves (up to a set of -measure 0) so that for any
Let so and
Since then
and as Now consider the second term in (A8). Since has a continuous posterior distribution, is continuous in Let and note that for all small enough, and , which implies that and, therefore, As or then
For all small, then is bounded above by
and this upper bound converges to as Since is arbitrary, this implies that the second term in (A8) goes to 0 as and this proves the result. □
Proof of Theorem 8.
Without loss of generality suppose that Let and satisfy Put and note that By Theorem 7, we have and as so and as This implies that there is a such that for all then Therefore, by Theorem 6, we have that for all ,
From (A9) and Theorem 7, we conclude that
Since this establishes the result. □
References
- Birnbaum, A. On the foundations of statistical inference (with discussion). J. Am. Stat. Assoc. 1962, 57, 269–326. [Google Scholar] [CrossRef]
- Royall, R.M. Statistical Evidence: A Likelihood Paradigm; Chapman & Hall: London, UK, 1997. [Google Scholar]
- Evans, M. Measuring Statistical Evidence Using Relative Belief. Chapman & Hall/CRC Monographs on Statistics & Applied Probability; Chapman and Hall/CRC: Boca Raton, FL, USA, 2015. [Google Scholar]
- Evans, M. The concept of statistical evidence: Historical roots and current developments. Encyclopedia 2024, 4, 1201–1216. [Google Scholar] [CrossRef]
- Savage, L.J. The Foundations of Statistics; Dover Publications: Mineola, NY, USA, 1971. [Google Scholar]
- Lehmann, E.L. Neyman’s statistical philosophy. In Selected Works of E. L. Lehmann; Springer: Boston, MA, USA, 1995; pp. 1067–1073. [Google Scholar]
- Evans, M.; Jang, G.-H. Weak informativity and the information in one prior relative to another. Stat. Sci. 2011, 26, 423–439. [Google Scholar] [CrossRef]
- Robert, C.P. Intrinsic losses. Theory Decis. 1996, 40, 191–214. [Google Scholar] [CrossRef]
- Bernardo, J.M. Intrinsic credible regions: An objective Bayesian approach to interval estimation. Test 2005, 14, 317–384. [Google Scholar] [CrossRef]
- Le Cam, L. On some asymptotic properties of maximum likelihood estimates and related Bayes’ estimates. Univ. Calif. Publ. Statist. 1953, 1, 277–329. [Google Scholar]
- Bernardo, J.M.; Smith, A.F.M. Bayesian Theory. Wiley Series in Probability and Statistics; John Wiley & Sons Ltd.: New York, NY, USA, 2000. [Google Scholar]
- Berger, J.O. Statistical Decision Theory and Bayesian Analysis; Springer: New York, NY, USA, 1985. [Google Scholar]
- Rudin, W. Real and Complex Analysis; McGraw Hill: New York, NY, USA, 1974. [Google Scholar]
- Al-Labadi, L.; Evans, M. Optimal robustness results for some Bayesian procedures and the relationship to prior-data conflict. Bayesian Anal. 2017, 12, 702–728. [Google Scholar] [CrossRef]
- Al-Labadi, L.; Alzaatreh, A.; Evans, M. How to measure evidence and its strength: Bayes factors or relative belief ratios? arXiv 2024, arXiv:2301.08994. [Google Scholar]
- Nott, D.; Wang, X.; Evans, M.; Englert, B.-G. Checking for prior-data conflict using prior to posterior divergences. Stat. Sci. 2020, 35, 234–253. [Google Scholar] [CrossRef]
- Evans, M.; Moshonov, H. Checking for prior-data conflict. Bayesian Anal. 2006, 1, 893–914. [Google Scholar] [CrossRef]
- Evans, M.; Shakhatreh, M. Optimal properties of some Bayesian inferences. Electron. J. Stat. 2008, 2, 1268–1280. [Google Scholar] [CrossRef]
- Evans, M.; Guttman, I.; Swartz, T. Optimality and computations for relative surprise inferences. Canad. J. Statist. 2006, 34, 113–129. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).