1. Introduction
Suppose that
k different experts choose models and priors for a statistical analysis concerning a common quantity of interest
which is a parameter or a future value. A problem then arises as to how the resulting statistical analyses should be combined so that the inferences presented can serve as a consensus inference. If all the models are the same, then this is the well-known problem of combining priors and this is covered by our discussion here. Even for the problem of combining priors, however, a somewhat different point-of-view is taken. A particular measure of statistical evidence is adopted, as discussed in
Section 3, such that the data set, sampling model and prior leads to either evidence in favor of or against each possible value of
Throughout the paper, the word ‘evidence’ is often used alone, but it always refers to the statistical evidence rather than some alternative kind of evidence. In this paper, it is concluded that the linear pooling rule, see [
1], is the most appropriate for combining evidence.
The purpose then is to determine a consensus on what the evidence indicates by combining the measures of statistical evidence rather than focusing on combining priors. Since the primary goal of a statistical analysis is to express what the evidence says about
, this seems appropriate. Also, it is perfectly reasonable that some analyses express evidence against while others express evidence in favor but the combined expression of the evidence is one way or the other, see
Section 2.
Before discussing the combination approach, however, it is necessary to be more precise about the problem and distinguish between somewhat different contexts where the problem can arise. It will be supposed here that is a parameter of interest but prediction problems are easily handled by a slight modification, see Example 3. Let denote a generic statistical model and where is onto and, to save notation, the function and its range have the same symbol.
Context I. Suppose there is a single statistical model for the data x and k distinct priors so there are k inference bases for It is assumed that the conditional priors on the nuisance parameters are all the same, as is satisfied when This situation arises when there is a group of analysts who agree on and perhaps use a default prior for the nuisance parameters, while each member puts forward a prior for
Context II. Suppose there are k data sets, models, and priors as given by the inference bases for and there is a common characteristic of interest with the true value of being the same for each model, as will occur when corresponds to some real-world quantity. Strictly speaking, the function also depends on i when the parameter spaces differ, but we suppress this dependence because each context is referring to the same real-world object.
It is a necessary part of any statistical analysis that a model be checked to see whether or not it is contradicted by the data, namely, determining if it is the case that the data lies in the tails of each distribution in the model. So in any situation where there is a lack of model fit, it is necessary to modify that component of the inference base. Similarly, each prior needs to be checked for prior–data conflict, namely, is there an indication that the true value lies in the tails of the prior, see [
2,
3]. If such a conflict is found, then the prior needs to be modified, see [
4]. For the purpose of the discussion here, however, it is assumed that all the models and priors have passed such checks. A salutary effect of a lack of prior–data conflict, is that it rules out the possibility of trying to combine priors which have little overlap in terms of where they place their mass.
Given an inference base
and interest in
a Bayesian analysis has an important consistency property. In particular, this inference base is equivalent, for inference about
to the inference base
where
is the marginal prior on
and
with
the prior predictive density of the data obtained by integrating out the nuisance parameters via the conditional prior
for
given
So, for example, the posterior
for
obtained via these two inference bases is the same and moreover the evidence about
is also the same. This result has implications for the combination strategy as it is really the inference bases
that are relevant in Context I and it is the inference bases
that are relevant in Context II, namely, nuisance parameters are always integrated out before combining.
Note that if, in a collection of inference bases for all the models are based on sampling from the same basic model, and the conditional priors on the nuisance parameters are all he same, then it makes sense to combine the data sets to with the combined model being based on the sample x so we are in Context I as only the marginal priors differ. This combination would not be possible if the conditional priors on the nuisance parameters differed as then the models will be different. We will assume hereafter that the following principle has been applied.
Combining inference bases rule: all data sets that are assumed to arise from the same set of basic distributions are combined whenever the conditional priors on the nuisance parameters are the same, so that separate data sets are associated with truly distinct models and/or priors.
This rule ensures that any combination reflects true differences among the beliefs concerning where the truth about lies as there is agreement on the other ingredients. It is assumed hereafter that this is applied before the inference bases are determined. Note that, even if the basic model is the same for each when the conditional priors on the nuisance parameters differ, then this is Context II.
In
Section 2 a general family of rules for combining priors with given weights is presented. In
Section 3 the problem of combining evidence for Context I is analyzed, with given weights for the respective priors, and the linear pooling combination rule is seen to have most appropriate properties with respect to evidence. In
Section 3.1 the problem of determining appropriate weights is considered. In
Section 4 the problem for Context II is discussed and a proposal is made for a rule that generalizes the rule for Context I. The rule for Context I possesses a natural consistency property as the combined evidence is the same whether considered as a mixture of the evidence arising from each inference base or obtained directly from the combined prior and the corresponding posterior. In particular, it is Bayesian in this generalized sense which differs from being externally Bayesian as discussed in [
5]; see
Section 3. This is not the case for Context II, however, because of differing nuisance parameters and ambiguities in the definition of the likelihood, but Jeffrey conditionalization provides a meaningful interpretation, at least when all the inference bases contain the same data.
The problem of combining priors has an extensive literature. Ref. [
6] is a basic reference and reviews can be found in [
7,
8,
9,
10]. Ref. [
11] is a significant recent application. Broadly speaking there are mathematical approaches and behavioral approaches. The mathematical approach provides a formal rule, as in
Section 2, while the behavioral approach provides methodology for a group of proposers to work towards a consensus through mutual interaction. For example, ref. [
12] considers the elicitation procedure where quantities concerning the object of interest are elicited by each member of a group and then the average elicited values are used to choose the prior. Ref. [
13] adopts a supra-Bayesian approach where the data generated during the elicitation process is conditioned on in a formal Bayesian analysis to choose a prior in a family on which an initial prior has been placed. Ref. [
14] presents an iterative methodology for a group of proposers to work towards a consensus prior based upon each proposer seeing how far their proposal deviated from a current grouped proposal. While the behavioral approach has a number of attractive features, there are also reservations as indicated by Kahneman in [
15].
The focus in this paper is on presenting a consensus assessment of the evidence via a combination of the evidence that each analyst obtains. In particular, the priors
need not arise via the same elicitation procedure and the proposers may not be aware of other proposals although the approach does not rule this out. Also, utility functions, necessary for decisions, are not part of the development as these may indeed lead to conflicts with what the evidence indicates and they are not generally checkable against the data as with models and priors. The assessment of statistical evidence as the primary driver of statistical methodology is a theme that many authors have pursued, for example, see [
16,
17,
18,
19]. Ref. [
20] reviews many of the attempts to provide a precise definition of the concept of statistical evidence. Ref. [
21] discusses the importance of the amalgamation of evidence, although evidence there references a more general concept than what is considered here.
Throughout the paper the densities of probability distributions will be represented by lower case symbols and the associated probability measure will be represented by the same symbol in upper case. For example, if a prior density is denoted by , then the prior probability measure will be denoted by with the posterior density denoted and the posterior probability measure by .
2. Combining Priors with Given Prior Weights
Let
the
-dimensional simplex for some
and, for now, suppose that
is given. While general combination rules could be considered, attention is restricted here to the power means of densities
where
and, for any
and sequence of nonnegative functions
defined on
then
is the relevant normalizing constant. Note that
and
do not depend on
For each
the mean
is nondecreasing in
see [
22], and two of the means are equal everywhere iff all priors are the same. Since
this implies that
is finite for all
whenever
If
is to be considered, then it is necessary to check on the integrability of the mean so that a proper prior is obtained and this will be assumed to hold whenever the case
is referenced. When
is finite, this is not an issue.
The following result characterizes how the posterior behaves in terms of a combination of the individual posteriors. Let denote the i-th prior predictive density based on prior denote the prior predictive density obtained using the prior and • denotes component-wise multiplication of two vectors of the same dimension.
Proposition 1. For Context I, the posterior based on equalsand when Proof. The expressions for
for
are obvious and
so the factor
cancels giving the result. Finally,
and this is bounded above (below) by
when
which gives the inequality. □
So the posterior is always proportional to a power mean of the individual posteriors of the same degree as the power mean of the priors but, excepting the
case, the weights have changed and when
or
the prior and posterior do not depend on
The posterior resulting when
is
and so is a linear combination of the individual posteriors but with different weights than the prior. The case
is called the
linear opinion pool, see [
1], and when
it is called the
logarithmic opinion pool.
The weights staying constant from a priori to a posteriori property for
or even independence from the weights, may seem like an appealing property but, as discussed in
Section 3, these combination rules have properties that make them inappropriate for combining evidence. A combination rule is said to be
externally Bayesian when the rule for combining the posteriors is the same as the rule for combining the priors. As shown in [
5,
23], logarithmic pooling is characterized by being externally Bayesian while linear pooling only satisfies this when there is a dictatorship, namely,
for some
as otherwise the weights differ. Proposition 2 (iii) shows, however, that there is a sense in which linear pooling can be considered as Bayesian.
Linear pooling has a number of appealing properties.
Proposition 2. For Context I, linear pooling satisfies the following:
(i) the prior probability measures satisfies the same combination rule as the densities, namely, and similarly for the posterior measures,
(ii) marginal priors obtained from are equal to the same combination of the marginal priors obtained from the , and this is effectively the only rule with this property among all possible combination rules,
(iii) if is given joint prior distribution with density , then the posterior density of θ is given by (1) and the weight is the posterior probability of the index i.
Proof. The proof of (i) is obvious while (ii) is proved in [
24] and holds here with no further conditions. For (iii), note that
is the conditional prior of
given
i and
is the conditional density of
x given
Once
x is observed, the posterior of
is then given by
which implies that the marginal posterior of
is (
1) and the posterior probability of
i is
. □
The significance of (i) is that the other combination rules considered here do not exhibit such simplicity and require more computation to obtain the measures. Property (ii) implies that integrating out nuisance parameters before or after combining does not affect inferences about a marginal parameter
in Context I, as conditional priors on the nuisance parameters being the same, implies that the marginal models for
are all the same. Ref. [
25] proves a similar result allowing for negative
. Property (iii) shows that both the prior
and the posterior
arise via valid probability calculations when
is known. A possible interpretation of this is that
represents the combiner’s prior belief in how well the
i-th prior represents appropriate beliefs concerning the true value of
relative to the other priors. The posterior weight
is then the appropriate modified belief after seeing the data, as the factor
reflects how well the
i-th inference base has done at predicting the observed data relative to the other inference bases. This is a somewhat different interpretation than that taken by [
26] where
represents the combiner’s prior belief that the
i-th inference base is the true one which, in this context, does not really apply.
One commonly cited negative property of linear pooling, see [
27], is that if
A and
C are independent events for each
then generally
It is to be noted that if also one of
or
is constant in
then independence is preserved and this will be seen to play a role in linear pooling behaving appropriately when considering statistical evidence, see Proposition 4(ii) and the discussion thereafter.
3. Combining Measures of Evidence in Context I
The criterion for choosing an appropriate combination should depend on how statistical evidence is characterized, as using the evidence to determine inferences is the ultimate purpose of a statistical analysis. The underlying idea concerning evidence used here is the following principle.
Principle of Evidence: there is evidence in favor of the value if there is evidence against the value if and no evidence either way if .
The basic idea is that, if the data has lead to an increase in the belief that
is the true value from a priori to a posteriori, then the data contains evidence in favor of of
etc. This interpretation is obviously the case when the prior is a discrete distribution and it also holds in the continuous case via a limit argument, see [
17]. The principle of evidence does not require that a specific numerical measure of evidence be chosen only that any measure used be consistent with this principle, namely, that there is a cut-off such that the numerical value greater than (less than) the cut-off corresponds to evidence in favor of (against) as indicated by the principle. The relative belief ratio
the ratio of the posterior to the prior, with the cut-off 1, is used here as it has a number of good properties, see [
17]. It is also particularly appropriate for the combination of the evidence as easily interpretable formulas result. The Bayes factor is also a valid measure of evidence, but there are many reason to prefer the relative belief ratio to measure evidence as discussed in [
28].
The next result examines the behavior of the combination rules of
Section 2 with respect to evidence and is stated initially for the full model parameter
in Context I. For this
is the relative belief ratio for
that results from the
i-th inference base
and
is the relative belief ratio for
that results from combining the
k priors using the
t-th power mean combination rule.
Proposition 3. For Context I, the relative belief ratio for θ based on the prior satsifies Proof. Using
and
then
□
This result shows the value of using the relative belief ratio to express evidence as the combination rule, at least for power means, is quite simple and natural. Notice too that if there are only l distinct priors, then the combination rules for the priors, posteriors and relative belief ratios are really only based on these distinct priors and the weights change only by summing the that correspond to common priors.
The result in Proposition 3 is another indication that the correct way to combine priors, from the point of view of measuring evidence, is via linear pooling as
is always proportional to
The constant multiplying
in (
2) suggests that finding
t that minimizes
, leads to the power mean prior that maximizes the amount of mass the prior places at
see Proposition 5 (iv). But there is a significant reason for preferring
over the other possibilities. Suppose that
for all
or
for all
Then it is clear that
in the first case and
in the second case. In the first case there is a consensus that there is evidence against
being the true value and in the second case there is a consensus that there is evidence in favor of
being the true value. In other words
is consensus preserving and this seems like a necessary property for any approach to combining evidence.
A formal definition is now provided which takes into account that sometimes , indicating that there is no evidence either way, which implies that the i-th inference base is agnostic about whether or not is the true value.
Definition A rule for combining evidence about a parameter is called consensus preserving if, whenever at least one of the inference bases indicates evidence in favor of (against) a value of the parameter and the remaining inference bases do not give evidence against (in favor), then the rule gives evidence in favor of (against) the value and if no inference base indicates evidence one way or the other, then neither does the combination.
The following property is immediately obtained for linear pooling.
Proposition 4. For Context I, whenever for all then (i) is consensus preserving and (ii) whenever for all then iff for all i.
The property of preserving consensus is similar to the unanimity principle for priors, see [
7], which says that if all the priors are the same, then the combination rule must give back that prior and all the power mean rules satisfy this.
Proposition 4 (ii) indicates that linear pooling deals correctly with independent events at least with respect to evidence. For note that, for probability measure P and events A and C satisfying then A and C are statistically independent iff So, independence is equivalent to saying that the occurrence of C provides no evidence concerning the truth or falsity of A and conversely. Now consider the statistical context and suppose and further suppose that all the probabilities are discrete. This implies that which implies that the joint prior density at factors as and so the events and are statistically independent in the i-th inference base. If this holds for each then is constant in i and so indeed implies that these events are independent when the prior is the linear pool. With a continuous prior, then can also happen, but typically this event has prior probability 0.
It is of interest to determine whether or not any of the other rules based on the means are consensus preserving. The inequality in Proposition 1 and Proposition 3 imply that, when then with with the inequality typically strict when This suggests that might even contradict the consensus of evidence in favor. A similar argument holds for The following example shows that generally the combination rules based on power means of priors are not consensus-preserving.
Example 1. Power means of priors are not generally consensus preserving.
Suppose and is observed. There are two priors given by and Thenso both inference bases give evidence against when When , then so no evidence either way is obtained from the data when a statistician is categorical in their beliefs. Note being categorical in your beliefs is a possible choice provided it does not lead to prior–data conflict. In this case, there is no prior–data conflict even with since there is a reasonable probability of observing when When so the two priors are being given equal weight, thenWhen so statistician 1 is categorical in their beliefs, and then So, statistician 1 finds no evidence either way for a being the true value from the data and this is because, when a prior is categorical, the data is irrelevant as it does not change beliefs. Statistician 2 finds evidence against a and the posterior probability of indicates reasonably strong belief in a not being the true value. Linear pooling indicates evidence against a, as it should, and the posterior probability of indicates weak belief in a not being the true value and this decrease in the strength of the evidence against is because of the first statistician’s complete confidence in the truth of a and the combination of beliefs. Note that indicates complete indifference between the quality of the statisticians priors but, if we put less weight on the first statistician’s prior, then the evidence against and its strength moves closer to that of statistician 2.
Now consider logarithmic pooling where In particular, with then , no matter what α is, and for every By Proposition 3, with which indicates no evidence for or against a being the true value. Therefore, logarithmic pooling is not consensus preserving. The illogicality of this is readily apparent as it suggests that no evidence has been found one way or the other and that is not the case. Next consider the case so When and then and which shows that this combination rule is also not consensus preserving. In this context, and based on numerical computation, it seems that for every and so all of these combination rules are not consensus preserving and note that this includes the harmonic mean combination rule. If there is evidence against (in favor of) an event, then a property of the relative belief ratio gives that there is evidence in favor of (against) its complement and, if there is no evidence either way for an event, then there is no evidence either way for its complement, see [17], Proposition 4.2.3 (i). So in this example the priors and also do not preserve consensus with respect to So far no case has been found where a combination based on a power mean actually reverses a consensus and it is a reasonable conjecture, based on many examples, that this will never happen but a proof is not obvious. Also, other power means may preserve consensus but currently we do not have such a result. Logarithmic pooling could be considered as the main rival to linear pooling, but Example 1 shows that it does not preserve consensus.
There is another interesting consequence of Proposition 3 which is relevant when the goal is to estimate The natural estimate is the relative belief estimate where the accuracy of is assessed by the plausible region the set of values for which there is evidence in favor. For example, the “size” of and its posterior content together provide an a posteriori measure of how accurate is. Ideally we want “small” and its posterior content high. The size of can be measured in various ways such as Euclidean volume, cardinality, or prior content, with the context determining which is most suitable. Note that it is easy to show in general that so provided is not 1 for all which only occurs when the data indicates nothing about the true value.
Corollary 1. Whenever is not 1 for all θ and for all then
So the estimate of based on maximizing the evidence in favor is determined by linear pooling for every t. It is not the case, however, that the plausible region is independent of t because of the constant
The following underscores the role of linear pooling in preserving consensus.
Corollary 2. The set for all and
So the set of where there is a consensus that there is evidence in favor is always contained in the plausible region determined by linear pooling. A similar comment applies to the implausible region which is the set of all values where there is evidence against. While it might be tempting to quote the region there is no guarantee that any of the relative belief estimates will be in this set, whether determined by or any of the
The situation with respect to the assessment of the hypothesis
is a bit different. Clearly, if
for all
so there is a consensus that there is evidence in favor of (against)
then
preserves this consensus. In general, when the evidence in favor of or against
is assessed via a relative belief ratio
then the posterior probability
can be taken as a measure of the strength of the evidence, see [
17]. In the context under discussion here, it follows from (
2) that the event
for all
Of course, the posterior probability of this event will depend on
t but linear pooling completely determines the event.
Now suppose that interest is in the quantity
and the assumptions of Context I hold so that prior beliefs only differ concerning the value of
, which implies that the inference bases only differ with respect to the priors on
This situation may arise when the analysts all agree to use a common default prior on the nuisance parameters. Then we can treat
as the model parameter for the common model
and the relevant linear pooling rule is
where
is the relative belief ratio for
obtained from the
i-th inference base. Note that the results derived for
also apply for inferences about
In general it can be expected that some inference bases will indicate evidence in favor of
being the true value and some will indicate evidence against, but
will indicate evidence one way or the other or even perhaps no evidence either way. This depends on the values assumed by the
as well as the weights
, with larger values of a weight leading to a greater contribution to the overall inferences by the corresponding inference base. This aspect is discussed in
Section 3.1.
Consider now the context where
is an
sample. The following result gives the consistency of this approach when the model parameter space
is finite. Such results will hold more generally but require some mathematical constraints on densities and this is not pursued further here. Let
be the relative belief estimate of
based on linear pooling. All the convergence results are almost everywhere as
with the proofs in the
Appendix A.
Proposition 5. For Context I, suppose is an sample from a distribution in a model having a finite parameter space
and each prior for θ is everywhere positive on Then
(i) and (ii) and
(iii)
(iv)
Noting that when then Proposition 5 (i) says that the evidence in favor of (against) based on the combination, goes to categorical when is true (false). Part (ii) says that the relative belief estimate based on the combination is consistent. Part (iii) implies that, when the priors are equally weighted, then the inference base whose prior gives the largest value to the true value will inevitably have the largest weight in determining the combined evidence. As previously mentioned, part (iv) suggests choosing t to minimize the ratio as this can be associated with choosing the power combination prior that maximizes the amount of belief the prior places on the true value. This has the unnatural consequence, however, that the prior is being determined by the data.
Our overall conclusion, based on the results established here, is that linear pooling is the most natural way to combine evidence among the power means. As such, attention is restricted to this case hereafter. Various authors, when discussing the combination of priors, have come to a similar conclusion. For example, ref. [
10], when considering the full spectrum of methods for combining priors, contains the following assertion, “In general, it seems that a simple, equally weighted, linear opinion pool is hard to beat in practice”. The results developed here support such a conclusion when considering evidence.
3.1. Determining the Prior Weights
The discussion so far has assumed that
is known but arguments or methodologies for choosing
need to be considered. There are several possible approaches to determining a suitable choice of the prior weights and nothing novel is proposed here. As previously mentioned, the
can represent the combiner’s beliefs concerning how well the
i-th prior represents appropriate beliefs about
The combiner’s beliefs should of course be based upon experience or knowledge concerning the various proposers of the priors. In absence of such knowledge then uniform weights, namely,
seem reasonable. Ref. [
29] provides a good survey of various approaches to choosing
. Also, ref. [
30,
31] present a novel iterative approach to determining a consensus
among the proposers.
In Context I notice that the weights only depend on the data through some function of the value of the minimal sufficient statistic (mss) for the model. So, for example, if the priors are distinct and equally weighted via then the weight of the i-th prior is and so more weight is given to those inference bases that do a better job, relatively speaking, of predicting a priori the observed value of this function of the mss. Since it is only the observed value of the mss that is relevant for inference, this seems sensible. There is the possibility, however, to weight some priors more than others for a variety of reasons.
A prior can also be placed on the results examined for a number different choices of and summarized in a way that addresses the issue of whether or not the inferences are sensitive to For example, suppose the goal is to determine if there is evidence for or against the hypothesis For a given weighting the evidence for or against will be determined by the value Accordingly, a Dirichlet prior with mode at and with some degree of concentration around this point could be used to assess the robustness of the combination inferences. In particular, for each generated value of from the prior, one can record whether evidence in favor of or against was obtained together with the strength of the evidence. If a great proportion of the results gave results similar to those obtained with the weights then this would provide some assurance that the conclusions drawn are robust to deviations. A similar approach can be taken to estimation problems where the relative belief estimate is given by When is 1-dimensional then a histogram of the estimates obtained in the simulation and histograms of the prior and posterior contents of will provide an indication of the dependence on
4. The General Problem
The general Context II is more complicated and an overall solution is not proposed here. Rather, a special case is considered when there is a common data set So, k analysts are making inference about the same real-world object based on the same data, but they are using possibly different models and different priors. Since Context II covers Context I, it is necessary that any rule proposed for such situations agrees with what is determined for Context I when that applies.
While it may seem reasonable to take the prior on to be the linear mixture this cannot be viewed as a marginal prior obtained by integrating out nuisance parameters from , as in Context I, because the nuisance parameters vary with Also, even if we elected to use this prior, the overall posterior does not have a clear definition as it is not obvious how to form the likelihood. As such, a different approach and justification is required.
The simplest approach to characterizing the evidence in Context II is to use
where again
and
arise from the
i-th inference base and
. This will agree with the answer obtained in Context I when it applies, but generally
is not the ratio of the posterior of
to its prior. As such, it cannot be claimed that (
4) is a valid characterization of the evidence, as
is in Context I, even though each
is a valid measure of evidence.
One approach to defining a prior and a posterior in Context II is to use the argument known as Jeffrey conditionalization, see [
32]. This involves considering the probabilities on the partition given by
completely separately from the probabilities on
given
If we knew
then standard conditioning leads to
as the expression of posterior beliefs about
But
i is unknown and all that is available are the probabilities given by
and Jeffrey conditionalization suggests
and
as the appropriate expressions of prior and posterior beliefs.
But note that, based on the
k inference bases,
can be thought of as the prior probability distribution for
which leads to
as the prior for
Since the likelihood
depends on
however, Context I does not apply. Still, the joint prior for
is
and, after observing
from the combiner’s point-of-view, this leads to the posterior probability
for
From the
i-th analyst’s viewpoint,
gives the appropriate posterior for
and so, applying the Jeffrey conditionalization idea, leads to the combination posterior for
given by
This could be considered as a generalization of Jeffrey’s idea as now the probabilities on the partition elements and
both depend on the data. Furthermore, extending Jeffrey’s idea to the combination of the measurement of evidence, we obtain (
4). While this is not formally a valid measure of evidence, (
4) will satisfy all the properties of linear pooling established for Context I with the exception of Proposition 2 (iii). In particular,
will preserve a consensus about evidence in favor or against. A key reason for not using
and
as the prior and posterior to determine the evidence, is that the nice properties of linear pooling are lost, see Example 3.
The following result characterizes what happens as sample size grows and is proved in the
Appendix A. Again convergence is almost everywhere.
Proposition 6. Suppose is an sample from a distribution in at least one of the models and each of the parameter spaces is finite with the prior everywhere positive on Denoting the set of indices corresponding to the models containing the true distribution by then as
(i) and
(ii) which is greater than 1 when
(iii)
So Proposition 6 shows that and provide consistent inferences and the weights converge to appropriate values.
There is another significant difference between (
4) and (
3). In Context I the weights all depended on the data through the same function of a constant mss for the full common model. Furthermore, if
is an ancillary statistic for the full model, then it is seen that the
i-th weight satisfies
This implies that the weights are comparable as they are all concerned with predicting essentially the same data and moreover they are not concerned with predicting aspects of the data that have no relation to the quantity of interest. In Context II this is not necessarily the case which raises the question of whether or not the weights are comparable.
It is not obvious how to deal with this issue in general, but in some contexts the structure of the models is such that where L has fixed dimension and A is ancillary for each model. For example, if all the models are location models, then where is a column of 1’s, and is ancillary. In such a case, it is desirable to determine the weights based on how well the inference bases predict the value of and not To take account of this it is necessary that Jeffrey conditionalization be modified so that the i-th posterior weight is now proportional to where is the i-th prior predictive of the data given Examples 4 and 5 illustrate this modification.
While Proposition 6 does not apply with the conditional weights, a similar result can be proved and for this some assumptions are imposed to simplify the proof. Let the basic sample space be such that there is a finite ancillary partition
applicable to each of the
k models, and for any
n the ancillary is given by
where
records the number of values in the sample that lie in
Then the probability distribution of
for the
i-th model is given by the multinomial
where the
are fixed and independent of the model parameter. Denote this probability function at the observed data by
where
Suppose that each parameter space
is finite with the prior
everywhere positive. Let
for
and
J denote the set of indices containing the true distribution. Calling these requirements condition
the following is proved in the
Appendix A.
Proposition 7. If condition ★
holds, thenand Of course, this result needs to be generalized to handle even a situation like the location model. For this some conditions on the models and priors are undoubtedly required but this is not pursued further here. One key component of the proof is the existence of the ancillary partition
and such a structural element seems necessary generally to obtain the comparability of the weights. In group-based models, like linear regression and many others, such a structure exists via the usual ancillaries, see Example 5. As an approximation, a finite ancillary partition can be constructed via the ancillary statistic in question and so Proposition 7 is applicable. It should also be noted that, if the original models are replaced by the conditional models given the ancillary, then (
4) gives the same answer as this modification as the values of
are unaffected by the conditioning.
Clearly there are connections with the combination rule for statistical evidence advocated here and Bayesian model averaging as discussed in [
33]. In fact, the posterior (
5) is the same as that obtained from Bayesian model averaging. The focus here, however, is on the inferences that arise from a direct measure of statistical evidence rather than basing these on the posterior alone and these inferences are different. That posterior probabilities do not provide a suitable measure of evidence can be seen from simple examples such as the Prosecutor’s Fallacy as discussed in [
28] (Example 4). It is shown there that the posterior probability of of an event (guilt) being true can be very small but there is still clear evidence that the event is true. So, this is only weak evidence because the posterior probability indicates a small belief in what the evidence indicates. As has been demonstrated here, the consensus preserving feature supports the linear rule over other possible candidates for combining and this, together with Jeffrey conditionalization, also supports the posterior (5) obtained via Bayesian model averaging. Issues concerned with the comparability of the weights remain to be more fully addressed for both methodologies.
5. Examples
Some examples are now considered that demonstrate a number of considerations.
Example 2. Location-normal model with normal priors.
Suppose is a sample from a distribution where the mean is unknown but the variance is known. It might be more appropriate to model this with an unknown variance but this situation will suffice for illustrative purposes and there are applications for it in physics, where the variation arising from a given measurement process is well understood. The model is then given by, after reducing to the mss , the collection of distributions and so this is Context I. Suppose there are three analysts and they express their priors for μ as distributions for sothe i-th posterior is and these ingredients determine the relative belief ratios. For combining, the prior predictives are also needed and the i-th prior predictive density for is the density. Suppose the inference bases are equally weighted, so the posterior weight of the i-th analysis relative to the others is determined by how well the observed value fits the distribution. Note, however, that even if there is a perfect fit, in the sense that the weight still depends on the quantity For example, if the are all equal and there is a perfect fit, then the i-th weight is proportional to and this weight goes to 0 as with the other prior variances constant and goes to its biggest value when This suggests that making a prior quite diffuse leads to reducing the impact the corresponding inference base has in the combined analysis.
Consider a specific data example where the true value is with , and sample sizes Data were generated from the true distribution obtaining the values respectively. For the priors, we use equally weighted. Figure 1 plots the combined prior, posterior and relative belief ratio for the case. Table 1 records the estimates of the plausible regions together with the posterior and prior contents of these intervals for each inference base and linear pooling. Note that, in this case, because the model is the same for each inference base and μ is the model parameter, the estimates are all equal to the MLE of μ but the plausible intervals and their posterior contents differ. Consider now prediction which produces the interesting consequence that Context II now obtains even when all the models are same.
Example 3. Prediction.
Consider Context I but suppose interest is in predicting a future value whose distribution is conditionally independent of the observed data x given θ and has model where with The first step in solving this problem is to determine the relevant inference bases and this is carried out by integrating out the nuisance parameter which in this case is So the i-th inference base is given by where is the density of the i-th prior for namely, and is the conditional density of x given Note that unconditionally x and y are not independent and now the collection of possible distributions for x is indexed by The i-th posterior density of y is then
The models are now not all the same so this is Context II with common data as discussed in Section 4. It is assumed, as is typically the case, that the mss for these models is constant in i so the weights are comparable. Applying (4), with the single data set leads towith and (5) leads to posterior Note that in this case the posterior of y given x is well-defined via Bayesian conditioning and equals so there is no need to invoke Jeffrey’s conditionalization for the posterior. It is notable, however, that if the relative belief ratio for y is computed using this posterior and the prior then this equalswhich does not equal Given that the weights in (6) depend on the object of interest this does not correspond to linear pooling of the evidence and this is because the model is not constant. There is no reason to suppose that (6) will retain the good properties of linear pooling and experience with it suggests that it is not the correct way to combine. As such, the recommended approach is via (4) based on Jeffrey’s conditionalization and which retains the good properties of linear pooling. Suppose now the context is as discussed in Example 2 but the goal is to make a prediction concerning a future independent value So the i-th prior is given by and the i-th posterior is Table 2 gives the results for predicting y using the data in Example 2. The final row indicates what happens as and note that the weights converge as well with the i-th limiting weight proportional to which depends on the relative accuracy of the i-th prior with respect to the true mean When all the prior variances are the same, the prior which has its mean closest to the true value will give the heaviest weight. Also, as the i-th weight goes to 0. Note that the limiting plausible intervals are dependent on the prior and the interval does not shrink to a point because y is random. The limiting posterior content of these intervals is the probability content given by the true distribution of For the limiting plausible intervals for y to still be dependent on the prior is different than the situation when making inference about a parameter as, in that case, the plausible intervals shrink to the true value as the amount of data increases. The difference is that there is not a “true” value for The limiting plausible interval does not allow for all possible values for y and the effect of the prior is to disallow some possible values because belief in such a value is less than that specified by the prior of As can be seen from Table 2 this effect is not great unless the prior, as with here, puts little mass near the true value. However, such an occurrence also reduces the limiting weight for such a component. Consider now an example where the weights require adjustment.
Example 4. Location-normal models with different variances.
Consider a situation similar to Example 2 but now with three distinct models so this is Context II. Here the i-th statistician assumes that the true distribution is where the are known but is unknown and interest is in The same three priors are assumed as in Example 2. So the statisticians disagree about the “known” variance of the sampling distribution and an ancillary needs to play a role to make the weights comparable.
In this case is ancillary for each model and is independently distributed from the common mss and Therefore, with equal weights for the priors, and taking the ancillaries into account, the i-th weight satisfiesFrom this it is seen that the assumed variances and the prior both play a role in determining how much weight a given analysis should have. Note that as or , and all other parameters are fixed, then the weight of the i-th analysis goes to 0 as it should as, in the limit, no information is being provided about the true value of Proposition 6 tell us that when and the i-th variance is correct and the others are not, then the i-th inference base will dominate. Consider now an example where the models are truly different.
Example 5. Location with quite different models.
Consider again the context of Example 2 but suppose that one of the models, say the one in is a (Cauchy) location model, while the other models and all the priors are as previously specified. For all three inference bases is ancillary. To ensure that has the same interpretation across all inference bases, the density is rescaled by so that the interval contains of the probability for all 3 distributions. This implies and, with the first model is where To obtain the corresponding weight, the following expression needs to be evaluated numerically,When applied to the data of Example 2 very similar results are obtained. Table 3 contains the weights for the inference bases for this situation. The following example is of considerable practical importance.
Example 6. Linear regression.
Suppose that the data is for and there are two analysts where both propose a simple regression model where with and unknown and z is a sample from for analyst 1 and is a sample from a distribution for analyst 2 for some value In both models is the variance of a Letting be the least squares estimate of β and then where and is ancillary for both models. Further suppose that the quantity of inferential interest is the slope parameter Denoting the relevant density of a by the joint density of given is proportional toThe posterior density of can be worked out in closed-form when f is the density but generally it will require numerical integration to determine the posterior density and the posterior weights for the combination. For the prior, suppose both analysts agree on and gammarate Note that the zero mean for β may entail subtracting a known, fixed constant vector from y so this, and the assumption that may entail some preprocessing of the data. The prior distribution of the quantity of interest is then where denotes the t distribution on degrees of freedom.
Obtaining the hyperparameters of the prior requires elicitation and this can be carried out using the following method as described in [34]. Suppose that it is known with virtual certainty, based on our knowledge of the measurements being taken, that will lie in the interval for some for all a compact set centered at 0 and contained in on account of the standardization. The phrase ‘virtual certainty’ is interpreted here as a probability greater than or equal to γ where γ is some large probability like Therefore, the prior on β must satisfy for all which implieswhere with equality when An interval that will contain a response value y with virtual certainty, given predictor value is Suppose that we have lower and upper bounds and on the half-length of this interval so that or, equivalently,holds with virtual certainty. Combining (8) with (7) implies To obtain the relevant values of and let denote the cdf of the gammarate distribution and note that Therefore, the interval for implied by (8) contains with virtual certainty, when satisfy or equivalentlyIt is a simple matter to solve these equations for For this choose an initial value for and, using (9), find w such that which implies If the left-side of (10) is less (greater) than then decrease (increase) the value of and repeat step 1. Continue iterating this process until satisfactory convergence is attained. Consider now a numerical example drawn from [35] where the response variable is income in U.S. dollars per capita (deflated), and the predictor variable is investment in dollars per capita (deflated) for the United States for the years 1922–1941. The data are provided in Table 4. The data vector y was replaced by as this centered the observations about 0. Taking leads to the values The following prior is then used for both models,Table 5 presents the weights that result when different error distributions are considered to be combined with the results from a error assumption. Presumably this arises when one analyst is concerned that tails longer than the normal are appropriate. As can be seen, the normal error assumption dominates except for when the inferences do not differ by much in any case. This is not surprising as various residual plots do not indicate any issue with the normality assumption for these data. These weights were computed using importance sampling and were found to be robust to the prior by repeating the computations after making small changes to the hyperparameters. The approach taken in this example is easily generalized to more general linear regression models including situations where the priors change.
6. Conclusions
The problem of how to combine evidence has been considered for a Bayesian context where each analyst proposes a model and prior for the same data. Linear opinion pooling is seen as the natural way to make such a combination, at least when the inference bases only differ in the priors on the parameter of interest. This has been shown to have appropriate properties such as preserving a consensus with respect to the evidence and, when combining evidence is considered as opposed to just combining priors, behaves appropriately when considering independent events. In certain contexts the idea can be extended in a logical way based on the idea underlying Jeffrey conditionalization. This approach has been shown to behave correctly asymptotically in a wide variety of situations.
There are a number of factors that need to be considered when implementing the methods discussed here. As mentioned in the Introduction, we have assumed that each of the sampling models and priors used have been subjected to model checks and checking for prior–data conflict, respectively. As such, we are not considering combining the evidence obtained from contexts where the ingredients are contradicted by the data, and this is to be regarded as a key part of the analysis. It is also worth noting too that [
36] establishes that relative belief inferences are optimally robust to the choice of the prior on
, and so, provided there is no prior–data conflict, a degree of robustness to the used priors can be expected. This does not, however, address issues concerned with sensitivity to the sampling models or to the priors used for nuisance parameters. There are also issues that arise for the choice of
. Unless there are good reasons to do otherwise, using uniform weights seems like the best choice as then only the data determines the relative weighting. For Context II, however, as discussed in
Section 4, there are general concerns with the comparability of the model weights and that has been only partially addressed here.
The developments here do not cover contexts where there are different data sets and different models. If the models are all for the same basic responses, then one possibility is to simply combine data sets and proceed, as we have demonstrated here. More generally, it may be that the only aspect in common among the models is the characteristic of interest
, and then it is not clear how we should combine this. The combination rule given by
where
suggests itself as a generalization of what has been considered here. Further investigation is required, however, as it is necessary to ensure that the weights
are indeed comparable, so some modification is probably required that is context-dependent.
It does not seem essential that we restrict attention to combination rules based on power mean priors as we have done here. For example, one could consider combining the
themselves according to some rule. For example, a power rule could be used to combine these quantities. Even in Context I, however, this loses the interpretation of the combination as a valid measure of evidence through the principle of evidence. Of course, (
3) arises in both such approaches to the problem and probably should, no matter which generalized rule is adopted, when it is applied to Context I.
The problem of combining evidence is an important one in science, as evidenced by extensive discussion in the literature over many years. What has been shown here is that a very natural definition of how to measure statistical evidence can lead to a natural solution in a number of significant contexts.