Combining Statistical Evidence When Evidence Is Measured by Relative Belief

Evans, Michael

doi:10.3390/e27060654

Open AccessArticle

Combining Statistical Evidence When Evidence Is Measured by Relative Belief

by

Michael Evans

Department of Statistical Sciences, University of Toronto, Toronto, ON M5G 1Z5, Canada

Entropy 2025, 27(6), 654; https://doi.org/10.3390/e27060654

Submission received: 12 May 2025 / Revised: 16 June 2025 / Accepted: 16 June 2025 / Published: 18 June 2025

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figure

Review Reports Versions Notes

Abstract

The problem of combining statistical evidence concerning an unknown, contained in each of the k Bayesian inference bases, is discussed. This can be considered as being related to the problem of pooling k priors to determine a consensus prior, but the focus here is instead on combining a measure of statistical evidence to obtain a consensus measure of statistical evidence. The linear opinion pool is seen to have the most appropriate properties for this role. In particular, linear pooling preserves a consensus with respect to the evidence, and other rules do not. While linear pooling does not preserve prior independence, it is shown that it still behaves appropriately with respect to the expression of statistical evidence in such a context. For the more general problem of combining statistical evidence, where the priors as well as the sampling models may differ, Jeffrey conditionalization plays a key role.

Keywords:

combining priors; statistical evidence; preserving consensus; Jeffrey conditionalization; ancillarity

1. Introduction

Suppose that k different experts choose models and priors for a statistical analysis concerning a common quantity of interest

Ψ

which is a parameter or a future value. A problem then arises as to how the resulting statistical analyses should be combined so that the inferences presented can serve as a consensus inference. If all the models are the same, then this is the well-known problem of combining priors and this is covered by our discussion here. Even for the problem of combining priors, however, a somewhat different point-of-view is taken. A particular measure of statistical evidence is adopted, as discussed in Section 3, such that the data set, sampling model and prior leads to either evidence in favor of or against each possible value of

Ψ .

Throughout the paper, the word ‘evidence’ is often used alone, but it always refers to the statistical evidence rather than some alternative kind of evidence. In this paper, it is concluded that the linear pooling rule, see [1], is the most appropriate for combining evidence.

The purpose then is to determine a consensus on what the evidence indicates by combining the measures of statistical evidence rather than focusing on combining priors. Since the primary goal of a statistical analysis is to express what the evidence says about

Ψ

, this seems appropriate. Also, it is perfectly reasonable that some analyses express evidence against while others express evidence in favor but the combined expression of the evidence is one way or the other, see Section 2.

Before discussing the combination approach, however, it is necessary to be more precise about the problem and distinguish between somewhat different contexts where the problem can arise. It will be supposed here that

Ψ

is a parameter of interest but prediction problems are easily handled by a slight modification, see Example 3. Let

M = {f_{θ} : θ \in Θ}

denote a generic statistical model and

ψ = Ψ (θ),

where

Ψ : Θ \to Ψ

is onto and, to save notation, the function and its range have the same symbol.

Context I. Suppose there is a single statistical model $M$ for the data x and k distinct priors $π_{i}$ so there are k inference bases $I_{i} = (x, M, π_{i})$ for $i = 1, \dots, k .$ It is assumed that the conditional priors $π_{i} (\cdot | ψ)$ on the nuisance parameters are all the same, as is satisfied when $Ψ (θ) = θ .$ This situation arises when there is a group of analysts who agree on $M$ and perhaps use a default prior for the nuisance parameters, while each member puts forward a prior for $Ψ .$
Context II. Suppose there are k data sets, models, and priors as given by the inference bases $I_{i} = (x_{i}, M_{i}, π_{i})$ for $i \in {1, \dots, k}$ and there is a common characteristic of interest $ψ = Ψ (θ_{i})$ with the true value of $ψ$ being the same for each model, as will occur when $ψ$ corresponds to some real-world quantity. Strictly speaking, the function $Ψ$ also depends on i when the parameter spaces $Θ_{i}$ differ, but we suppress this dependence because each context is referring to the same real-world object.

It is a necessary part of any statistical analysis that a model be checked to see whether or not it is contradicted by the data, namely, determining if it is the case that the data lies in the tails of each distribution in the model. So in any situation where there is a lack of model fit, it is necessary to modify that component of the inference base. Similarly, each prior needs to be checked for prior–data conflict, namely, is there an indication that the true value lies in the tails of the prior, see [2,3]. If such a conflict is found, then the prior needs to be modified, see [4]. For the purpose of the discussion here, however, it is assumed that all the models and priors have passed such checks. A salutary effect of a lack of prior–data conflict, is that it rules out the possibility of trying to combine priors which have little overlap in terms of where they place their mass.

Given an inference base

I = (x, M, π)

and interest in

ψ = Ψ (θ),

a Bayesian analysis has an important consistency property. In particular, this inference base is equivalent, for inference about

ψ,

to the inference base

I = (x, M_{Ψ}, π_{Ψ})

where

π_{Ψ}

is the marginal prior on

ψ

and

M_{Ψ} = {m_{ψ} : ψ \in Ψ}

with

m_{ψ} (x) = E_{π (\cdot | ψ)} (f_{θ} (x)),

the prior predictive density of the data obtained by integrating out the nuisance parameters via the conditional prior

π (\cdot | ψ)

for

θ

given

ψ = Ψ (θ) .

So, for example, the posterior

π_{Ψ} (\cdot | x)

for

ψ

obtained via these two inference bases is the same and moreover the evidence about

ψ

is also the same. This result has implications for the combination strategy as it is really the inference bases

I_{i} = (x, M_{Ψ}, π_{i Ψ})

that are relevant in Context I and it is the inference bases

I_{i} = (x_{i}, M_{i Ψ}, π_{i Ψ})

that are relevant in Context II, namely, nuisance parameters are always integrated out before combining.

Note that if, in a collection of inference bases

I_{i} = (x_{i}, M_{i}, π_{i})

for

i \in {1, \dots, k},

all the models are based on sampling from the same basic model, and the conditional priors

π_{i} (\cdot | ψ)

on the nuisance parameters are all he same, then it makes sense to combine the data sets to

x = (x_{1}, \dots, x_{k})

with the combined model

M

being based on the sample x so we are in Context I as only the marginal priors

π_{Ψ}

differ. This combination would not be possible if the conditional priors

π_{i} (\cdot | ψ)

on the nuisance parameters differed as then the models

M_{i Ψ}

will be different. We will assume hereafter that the following principle has been applied.

Combining inference bases rule: all data sets that are assumed to arise from the same set of basic distributions are combined whenever the conditional priors on the nuisance parameters are the same, so that separate data sets are associated with truly distinct models and/or priors.
This rule ensures that any combination reflects true differences among the beliefs concerning where the truth about $Ψ$ lies as there is agreement on the other ingredients. It is assumed hereafter that this is applied before the inference bases $I_{i}$ are determined. Note that, even if the basic model is the same for each $i,$ when the conditional priors on the nuisance parameters differ, then this is Context II.

In Section 2 a general family of rules for combining priors with given weights is presented. In Section 3 the problem of combining evidence for Context I is analyzed, with given weights for the respective priors, and the linear pooling combination rule is seen to have most appropriate properties with respect to evidence. In Section 3.1 the problem of determining appropriate weights is considered. In Section 4 the problem for Context II is discussed and a proposal is made for a rule that generalizes the rule for Context I. The rule for Context I possesses a natural consistency property as the combined evidence is the same whether considered as a mixture of the evidence arising from each inference base or obtained directly from the combined prior and the corresponding posterior. In particular, it is Bayesian in this generalized sense which differs from being externally Bayesian as discussed in [5]; see Section 3. This is not the case for Context II, however, because of differing nuisance parameters and ambiguities in the definition of the likelihood, but Jeffrey conditionalization provides a meaningful interpretation, at least when all the inference bases contain the same data.

The problem of combining priors has an extensive literature. Ref. [6] is a basic reference and reviews can be found in [7,8,9,10]. Ref. [11] is a significant recent application. Broadly speaking there are mathematical approaches and behavioral approaches. The mathematical approach provides a formal rule, as in Section 2, while the behavioral approach provides methodology for a group of proposers to work towards a consensus through mutual interaction. For example, ref. [12] considers the elicitation procedure where quantities concerning the object of interest are elicited by each member of a group and then the average elicited values are used to choose the prior. Ref. [13] adopts a supra-Bayesian approach where the data generated during the elicitation process is conditioned on in a formal Bayesian analysis to choose a prior in a family on which an initial prior has been placed. Ref. [14] presents an iterative methodology for a group of proposers to work towards a consensus prior based upon each proposer seeing how far their proposal deviated from a current grouped proposal. While the behavioral approach has a number of attractive features, there are also reservations as indicated by Kahneman in [15].

The focus in this paper is on presenting a consensus assessment of the evidence via a combination of the evidence that each analyst obtains. In particular, the priors

π_{i}

need not arise via the same elicitation procedure and the proposers may not be aware of other proposals although the approach does not rule this out. Also, utility functions, necessary for decisions, are not part of the development as these may indeed lead to conflicts with what the evidence indicates and they are not generally checkable against the data as with models and priors. The assessment of statistical evidence as the primary driver of statistical methodology is a theme that many authors have pursued, for example, see [16,17,18,19]. Ref. [20] reviews many of the attempts to provide a precise definition of the concept of statistical evidence. Ref. [21] discusses the importance of the amalgamation of evidence, although evidence there references a more general concept than what is considered here.

Throughout the paper the densities of probability distributions will be represented by lower case symbols and the associated probability measure will be represented by the same symbol in upper case. For example, if a prior density is denoted by

π

, then the prior probability measure will be denoted by

Π

with the posterior density denoted

π (\cdot | x)

and the posterior probability measure by

Π (\cdot | x)

.

2. Combining Priors with Given Prior Weights

Let

α = (α_{1}, \dots, α_{k}) \in S_{k}

the

(k - 1)

-dimensional simplex for some

k \geq 2

and, for now, suppose that

α

is given. While general combination rules could be considered, attention is restricted here to the power means of densities

π_{t, α} = \{\begin{matrix} c_{t} (α, π_{\cdot}) {\sum_{i = 1}^{k} α_{i} π_{i}^{t}}^{1 / t}, & t \neq 0, \pm \infty \\ c_{0} (α, π_{\cdot}) exp {\sum_{i = 1}^{k} α_{i} log π_{i}}, & t = 0 \\ c_{- \infty} (α, π_{\cdot}) min \{π_{1}, \dots, π_{k}\}, & t = - \infty \\ c_{\infty} (α, π_{\cdot}) max \{π_{1}, \dots, π_{k}\}, & t = \infty \end{matrix}

where

π_{\cdot} = (π_{1}, \dots, π_{k})

and, for any

α

and sequence of nonnegative functions

g_{\cdot} = (g_{1}, \dots, g_{k})

defined on

Θ,

then

c_{t} (α, g_{\cdot})

is the relevant normalizing constant. Note that

π_{- \infty, α}

and

π_{\infty, α}

do not depend on

α .

For each

θ

the mean

{\sum_{i = 1}^{k} α_{i} π_{i}^{t} (θ)}^{1 / t}

is nondecreasing in

t,

see [22], and two of the means are equal everywhere iff all priors are the same. Since

c_{1} (α, π_{\cdot}) = 1,

this implies that

c_{t} (α, π_{\cdot})

is finite for all

α

whenever

t \leq 1 .

If

t > 1

is to be considered, then it is necessary to check on the integrability of the mean so that a proper prior is obtained and this will be assumed to hold whenever the case

t > 1

is referenced. When

Θ

is finite, this is not an issue.

The following result characterizes how the posterior behaves in terms of a combination of the individual posteriors. Let

m_{i}

denote the i-th prior predictive density based on prior

π_{i},

m_{t, α}

denote the prior predictive density obtained using the

π_{t, α}

prior and • denotes component-wise multiplication of two vectors of the same dimension.

Proposition 1.

For Context I, the posterior based on

π_{t, α}

equals

π_{t, α} (θ | x) = \{\begin{matrix} c_{t} (α, m_{\cdot} (x) • π_{\cdot} (\cdot | x)) {\sum_{i = 1}^{k} α_{i} m_{i}^{t} (x) π_{i}^{t} (θ | x)}^{1 / t}, & t \neq 0 \\ c_{0} (α, π_{\cdot} (\cdot | x)) exp {\sum_{i = 1}^{k} α_{i} log π_{i} (θ | x)}, & t = 0 \\ c_{- \infty} (α, m_{\cdot} (x) • π_{\cdot} (\cdot | x)) min_{i = 1 \dots, k} m_{i} (x) π_{i} (θ | x), & t = - \infty \\ c_{\infty} (α, m_{\cdot} (x) • π_{\cdot} (\cdot | x)) max_{i = 1 \dots, k} m_{i} (x) π_{i} (θ | x), & t = \infty \end{matrix}

and

m_{t, α} (x) / c_{t} (α, π) \leq (\geq) m_{1, α} (x)

when

t \leq (\geq) 1 .

Proof.

The expressions for

π_{t, α} (\cdot | x)

for

t \neq 0

are obvious and

\begin{matrix} π_{0, α} (θ | x) = c_{0} (α, m_{\cdot} (x) • π_{\cdot} (\cdot | x)) exp {\sum_{i = 1}^{k} α_{i} log m_{i} (x) π_{i} (θ | x)} \\ = {\{\int_{Θ} \prod_{i = 1}^{k} m_{i}^{α_{i}} (x) \prod_{i = 1}^{k} π_{i}^{α_{i}} (θ | x) d θ\}}^{- 1} \prod_{i = 1}^{k} m_{i}^{α_{i}} (x) π_{i}^{α_{i}} (θ | x) \end{matrix}

so the factor

\prod_{i = 1}^{k} m_{i}^{α_{i}} (x)

cancels giving the result. Finally,

m_{t, α} (x) = c_{t} (α, π) \int_{Θ} {\{\sum_{i = 1}^{k} α_{i} m_{i}^{t} (x) π_{i}^{t} (θ | x)\}}^{1 / t} d θ

and this is bounded above (below) by

c_{t} (α, π) \int_{Θ} {\sum_{i = 1}^{k} α_{i} m_{i} (x) π_{i} (θ | x)} d θ = c_{t} (α, π) m_{1, α} (x)

when

t \leq (\geq) 1

which gives the inequality. □

So the posterior is always proportional to a power mean of the individual posteriors of the same degree as the power mean of the priors but, excepting the $t = 0$ case, the weights have changed and when $t = - \infty$ or $t = - \infty$ the prior and posterior do not depend on $α .$ The posterior resulting when $t = 1$ is

$π_{1, α} (θ | x) = \sum_{i = 1}^{k} (\frac{α_{i} m_{i} (x)}{\sum_{i = 1}^{k} α_{i} m_{i} (x)}) π_{i} (θ | x) = \sum_{i = 1}^{k} \frac{α_{i} m_{i} (x)}{m_{1, α} (x)} π_{i} (θ | x),$

(1)

and so is a linear combination of the individual posteriors but with different weights than the prior. The case $t = 1$ is called the linear opinion pool, see [1], and when $t = 0$ it is called the logarithmic opinion pool.

The weights staying constant from a priori to a posteriori property for

π_{0, α},

or even independence from the weights, may seem like an appealing property but, as discussed in Section 3, these combination rules have properties that make them inappropriate for combining evidence. A combination rule is said to be externally Bayesian when the rule for combining the posteriors is the same as the rule for combining the priors. As shown in [5,23], logarithmic pooling is characterized by being externally Bayesian while linear pooling only satisfies this when there is a dictatorship, namely,

α_{i} = 1

for some

i,

as otherwise the weights differ. Proposition 2 (iii) shows, however, that there is a sense in which linear pooling can be considered as Bayesian.

Linear pooling has a number of appealing properties.

Proposition 2.

For Context I, linear pooling satisfies the following:

(i) the prior probability measures satisfies the same combination rule as the densities, namely, $Π_{1, α} = \sum_{i = 1}^{k} α_{i} Π_{i}$ and similarly for the posterior measures,
(ii) marginal priors obtained from $Π_{1, α}$ are equal to the same combination of the marginal priors obtained from the $Π_{i}$ , and this is effectively the only rule with this property among all possible combination rules,
(iii) if $(i, θ, x)$ is given joint prior distribution with density $α_{i} π_{i} (θ) f_{θ} (x)$ , then the posterior density of θ is given by (1) and the weight $α_{i} m_{i} (x) / m_{1, α} (x)$ is the posterior probability of the index i.

Proof.

The proof of (i) is obvious while (ii) is proved in [24] and holds here with no further conditions. For (iii), note that

π_{i}

is the conditional prior of

θ

given i and

f_{θ}

is the conditional density of x given

θ .

Once x is observed, the posterior of

(i, θ)

is then given by

α_{i} π_{i} (θ) f_{θ} (x) / m_{1, α} (x)

which implies that the marginal posterior of

θ

is (1) and the posterior probability of i is

α_{i} m_{i} (x) / m_{1, α} (x)

. □

The significance of (i) is that the other combination rules considered here do not exhibit such simplicity and require more computation to obtain the measures. Property (ii) implies that integrating out nuisance parameters before or after combining does not affect inferences about a marginal parameter $ψ$ in Context I, as conditional priors on the nuisance parameters being the same, implies that the marginal models for $ψ$ are all the same. Ref. [25] proves a similar result allowing for negative $α_{i}$ . Property (iii) shows that both the prior $π_{1, α}$ and the posterior $π_{1, α} (\cdot | x)$ arise via valid probability calculations when $α$ is known. A possible interpretation of this is that $α_{i}$ represents the combiner’s prior belief in how well the i-th prior represents appropriate beliefs concerning the true value of $θ$ relative to the other priors. The posterior weight $α_{i} m_{i} (x) / m_{1, α} (x)$ is then the appropriate modified belief after seeing the data, as the factor $m_{i} (x) / m_{1, α} (x)$ reflects how well the i-th inference base has done at predicting the observed data relative to the other inference bases. This is a somewhat different interpretation than that taken by [26] where $α_{i}$ represents the combiner’s prior belief that the i-th inference base is the true one which, in this context, does not really apply.

One commonly cited negative property of linear pooling, see [27], is that if A and C are independent events for each

Π_{i},

then generally

Π_{1, α} (A \cap C) \neq Π_{1, α} (A) Π_{1, α} (C) .

It is to be noted that if also one of

Π_{i} (A)

or

Π_{i α} (C)

is constant in

i,

then independence is preserved and this will be seen to play a role in linear pooling behaving appropriately when considering statistical evidence, see Proposition 4(ii) and the discussion thereafter.

3. Combining Measures of Evidence in Context I

The criterion for choosing an appropriate combination should depend on how statistical evidence is characterized, as using the evidence to determine inferences is the ultimate purpose of a statistical analysis. The underlying idea concerning evidence used here is the following principle.

Principle of Evidence: there is evidence in favor of the value $ψ$ if $π_{Ψ} (ψ | x) > π_{Ψ} (ψ),$ there is evidence against the value $ψ$ if $π_{Ψ} (ψ | x) < π_{Ψ} (ψ),$ and no evidence either way if $π_{Ψ} (ψ | x) = π_{Ψ} (ψ)$ .

The basic idea is that, if the data has lead to an increase in the belief that $ψ$ is the true value from a priori to a posteriori, then the data contains evidence in favor of of $ψ,$ etc. This interpretation is obviously the case when the prior is a discrete distribution and it also holds in the continuous case via a limit argument, see [17]. The principle of evidence does not require that a specific numerical measure of evidence be chosen only that any measure used be consistent with this principle, namely, that there is a cut-off such that the numerical value greater than (less than) the cut-off corresponds to evidence in favor of (against) as indicated by the principle. The relative belief ratio

$R B_{Ψ} (ψ | x) = \frac{π_{Ψ} (ψ | x)}{π_{Ψ} (ψ)},$

the ratio of the posterior to the prior, with the cut-off 1, is used here as it has a number of good properties, see [17]. It is also particularly appropriate for the combination of the evidence as easily interpretable formulas result. The Bayes factor is also a valid measure of evidence, but there are many reason to prefer the relative belief ratio to measure evidence as discussed in [28].

The next result examines the behavior of the combination rules of Section 2 with respect to evidence and is stated initially for the full model parameter

θ

in Context I. For this

R B_{i} (θ | x)

is the relative belief ratio for

θ

that results from the i-th inference base

I_{i} = (x_{i}, M_{i}, π_{i})

and

R B_{t, α} (θ | x)

is the relative belief ratio for

θ

that results from combining the k priors using the t-th power mean combination rule.

Proposition 3.

For Context I, the relative belief ratio for θ based on the

π_{t, α}

prior satsifies

R B_{t, α} (θ | x) = \frac{m_{1, α} (x)}{m_{t, α} (x)} R B_{1, α} (θ | x) .

(2)

Proof.

Using

R B_{i} (θ | x) = f_{θ} (x) / m_{i} (x)

and

f_{θ} (x) = \sum_{i = 1}^{k} α_{i} f_{θ} (x),

then

\begin{matrix} R B_{t, α} (θ | x) & = \frac{f_{θ} (x)}{m_{t, α} (x)} = \frac{m_{1, α} (x)}{m_{t, α} (x)} \sum_{i = 1}^{k} \frac{α_{i} m_{i} (x)}{m_{1, α} (x)} R B_{i} (θ | x) = \frac{m_{1, α} (x)}{m_{t, α} (x)} R B_{1, α} (θ | x) . \end{matrix}

□

This result shows the value of using the relative belief ratio to express evidence as the combination rule, at least for power means, is quite simple and natural. Notice too that if there are only l distinct priors, then the combination rules for the priors, posteriors and relative belief ratios are really only based on these distinct priors and the weights change only by summing the $α_{i}$ that correspond to common priors.

The result in Proposition 3 is another indication that the correct way to combine priors, from the point of view of measuring evidence, is via linear pooling as

R B_{t, α} (θ | x)

is always proportional to

R B_{1, α} (θ | x) .

The constant multiplying

R B_{1, α} (θ | x)

in (2) suggests that finding t that minimizes

m_{1, α} (x) / m_{t, α} (x)

, leads to the power mean prior that maximizes the amount of mass the prior places at

θ_{t r u e},

see Proposition 5 (iv). But there is a significant reason for preferring

R B_{1, α} (θ | x)

over the other possibilities. Suppose that

R B_{i} (θ | x) < 1

for all

i

or

R B_{i} (θ | x) > 1

for all

i .

Then it is clear that

R B_{1, α} (θ | x) < 1

in the first case and

R B_{1, α} (θ | x) > 1

in the second case. In the first case there is a consensus that there is evidence against

θ

being the true value and in the second case there is a consensus that there is evidence in favor of

θ

being the true value. In other words

R B_{1, α}

is consensus preserving and this seems like a necessary property for any approach to combining evidence.

A formal definition is now provided which takes into account that sometimes

R B_{i} (θ | x) = 1

, indicating that there is no evidence either way, which implies that the i-th inference base is agnostic about whether or not

θ

is the true value.

Definition A rule for combining evidence about a parameter is called consensus preserving if, whenever at least one of the inference bases indicates evidence in favor of (against) a value of the parameter and the remaining inference bases do not give evidence against (in favor), then the rule gives evidence in favor of (against) the value and if no inference base indicates evidence one way or the other, then neither does the combination.

The following property is immediately obtained for linear pooling.

Proposition 4.

For Context I, whenever

α_{i} > 0

for all

i,

then (i)

R B_{1, α}

is consensus preserving and (ii) whenever

R B_{i} (θ | x) \leq (\geq) 1

for all

i,

then

R B_{1, α} (θ | x) = 1

iff

R B_{i} (θ | x) = 1

for all i.

The property of preserving consensus is similar to the unanimity principle for priors, see [7], which says that if all the priors are the same, then the combination rule must give back that prior and all the power mean rules satisfy this.

Proposition 4 (ii) indicates that linear pooling deals correctly with independent events at least with respect to evidence. For note that, for probability measure P and events A and C satisfying

P (A \cap C) > 0,

then A and C are statistically independent iff

R B (A | C) = P (A \cap C) / P (A) P (C) = 1 .

So, independence is equivalent to saying that the occurrence of C provides no evidence concerning the truth or falsity of A and conversely. Now consider the statistical context and suppose

R B_{i} (θ | x) = 1

and further suppose that all the probabilities are discrete. This implies that

f_{θ} (x) = m_{i} (x)

which implies that the joint prior density at

(θ, x)

factors as

f_{θ} (x) π_{i} (θ) = m_{i} (x) π_{i} (θ)

and so the events

{θ}

and

{x}

are statistically independent in the i-th inference base. If this holds for each

i,

then

m_{i} (x)

is constant in i and so indeed

R B_{1, α} (θ | x) = 1

implies that these events are independent when the prior is the linear pool. With a continuous prior, then

R B_{i} (θ | x) = 1

can also happen, but typically this event has prior probability 0.

It is of interest to determine whether or not any of the other rules based on the means are consensus preserving. The inequality in Proposition 1 and Proposition 3 imply that, when

t \leq 1,

then

R B_{t, α} (θ | x) \geq R B_{1, α} (θ | x) / c_{t} (α, π)

with

c_{t} (α, π) \geq 1,

with the inequality typically strict when

t < 1 .

This suggests that

R B_{t, α}

might even contradict the consensus of evidence in favor. A similar argument holds for

t > 1 .

The following example shows that generally the combination rules based on power means of priors are not consensus-preserving.

Example 1.

Power means of priors are not generally consensus preserving.

Suppose

X = {0, 1}, Θ = {a, b}, f_{a} (0) = 1 / 4, f_{b} (0) = 2 / 3

and

x = 0

is observed. There are two priors given by

π_{1} (a) = p_{1}

and

π_{2} (a) = p_{2} .

Then

\begin{matrix} m_{1} (0) & = (8 - 5 p_{1}) / 12, R B_{1} (a | 0) = 3 / (8 - 5 p_{1}), \\ m_{2} (0) & = (8 - 5 p_{2}) / 12, R B_{2} (a | 0) = 3 / (8 - 5 p_{2}), \end{matrix}

so both inference bases give evidence against when

p_{1} < 1, p_{2} < 1 .

When

p_{i} = 1

, then

R B_{i} (a | 0) = 1

so no evidence either way is obtained from the data when a statistician is categorical in their beliefs. Note being categorical in your beliefs is a possible choice provided it does not lead to prior–data conflict. In this case, there is no prior–data conflict even with

p_{1} = 1

since there is a reasonable probability of observing

x = 0

when

a = 1 .

When

α = (1 / 2, 1 / 2),

so the two priors are being given equal weight, then

\begin{matrix} π_{1, α} (a) & = (p_{1} + p_{2}) / 2, \\ m_{1, 1 / 2} (0) & = (m_{1} (0) + m_{2} (0)) / 2 = (8 - p_{1} - p_{2}) / 24, \\ R B_{1, α} (a | 0) & = (m_{1} (0) R B_{1} (a | 0) + m_{2} (0) R B_{2} (a | 0)) / 2 m_{1, 1 / 2} (0) \\ = 6 / (8 - p_{1} - p_{2}) . \end{matrix}

When

p_{1} = 1, p_{2} = 1 / 2,

so statistician 1 is categorical in their beliefs, and

α = (1 / 2, 1 / 2),

then

\begin{matrix} R B_{1} (a | 0) & = 1.00, π_{1} (a | 0) = 1.00, \\ R B_{2} (a | 0) & = 0.55, π_{2} (a | 0) = 0.27, \\ R B_{1, α} (a | 0) & = 0.71, π_{1, α} (a | 0) = 0.53 . \end{matrix}

So, statistician 1 finds no evidence either way for a being the true value from the data and this is because, when a prior is categorical, the data is irrelevant as it does not change beliefs. Statistician 2 finds evidence against a and the posterior probability of $0.27$ indicates reasonably strong belief in a not being the true value. Linear pooling indicates evidence against a, as it should, and the posterior probability of $0.53$ indicates weak belief in a not being the true value and this decrease in the strength of the evidence against is because of the first statistician’s complete confidence in the truth of a and the combination of beliefs. Note that $α = (1 / 2, 1 / 2)$ indicates complete indifference between the quality of the statisticians priors but, if we put less weight on the first statistician’s prior, then the evidence against and its strength moves closer to that of statistician 2.

Now consider logarithmic pooling where

π_{0, α} (a) = p_{1}^{α} p_{2}^{1 - α} / (p_{1}^{α} p_{2}^{1 - α} + {(1 - p_{1})}^{1 - α} {(1 - p_{2})}^{α}) .

In particular, with

p_{1} = 1,

then

π_{0, α} (a) = 1

, no matter what α is, and

m_{0, α} (0) = f_{a} (0)

for every

α .

By Proposition 3, with

p_{1} = 1,

R B_{0, α} (a | 0) = \frac{m_{1, α} (0)}{m_{0, α} (0)} R B_{1, α} (a | 0) = 1, π_{0, α} (a | 0) = 1,

which indicates no evidence for or against a being the true value. Therefore, logarithmic pooling is not consensus preserving. The illogicality of this is readily apparent as it suggests that no evidence has been found one way or the other and that is not the case.

Next consider the case

t = - \infty,

so

π_{- \infty, α} (a) = min {p_{1}, p_{2}} / (min {p_{1}, p_{2}} + min {1 - p_{1}, 1 - p_{2}}) .

When

p_{1} = 1

and

p_{2} < 1,

then

π_{- \infty, α} (a) = p_{2} / (1 - p_{1} + p_{2}) = 1

and

R B_{- \infty, α} (a | 0) = 1

which shows that this combination rule is also not consensus preserving. In this context, and based on numerical computation, it seems that

R B_{t, α} (a | 0) = 1

for every

t \leq 0

and so all of these combination rules are not consensus preserving and note that this includes the harmonic mean combination rule. If there is evidence against (in favor of) an event, then a property of the relative belief ratio gives that there is evidence in favor of (against) its complement and, if there is no evidence either way for an event, then there is no evidence either way for its complement, see [17], Proposition 4.2.3 (i). So in this example the priors

π_{0, α}

and

π_{- \infty, α}

also do not preserve consensus with respect to

θ = b .

So far no case has been found where a combination based on a power mean actually reverses a consensus and it is a reasonable conjecture, based on many examples, that this will never happen but a proof is not obvious. Also, other power means may preserve consensus but currently we do not have such a result. Logarithmic pooling could be considered as the main rival to linear pooling, but Example 1 shows that it does not preserve consensus.

There is another interesting consequence of Proposition 3 which is relevant when the goal is to estimate

θ .

The natural estimate is the relative belief estimate

θ (x) = arg {sup}_{θ} R B (θ | x)

where the accuracy of

θ (x)

is assessed by the plausible region

P l (x) = {θ : R B (θ | x) > 1},

the set of values for which there is evidence in favor. For example, the “size” of

P l (x)

and its posterior content together provide an a posteriori measure of how accurate

θ (x)

is. Ideally we want

P l (x)

“small” and its posterior content high. The size of

P l (x)

can be measured in various ways such as Euclidean volume, cardinality, or prior content, with the context determining which is most suitable. Note that it is easy to show in general that

R B (θ (x) | x) > 1

so

θ (x) \in P l (x)

provided

R B (θ | x)

is not 1 for all

θ,

which only occurs when the data indicates nothing about the true value.

Corollary 1.

Whenever

R B_{t, α} (θ | x)

is not 1 for all θ and

α_{i} > 0

for all

i,

then

arg {sup}_{θ} R B_{t, α} (θ | x) = arg {sup}_{θ} R B_{1, α} (θ | x) .

So the estimate of $θ$ based on maximizing the evidence in favor is determined by linear pooling for every t. It is not the case, however, that the plausible region is independent of t because of the constant $m_{1, α} (x) / m_{t, α} (x) .$

The following underscores the role of linear pooling in preserving consensus.

Corollary 2.

The set

{θ : R B_{i} (θ | x) > 1

for all

i} = \cap_{i = 1}^{k} P l_{i} (x) \subset P l_{1, α} (x)

and

{min}_{i} Π_{i} (P l_{i} (x) | x) \leq Π_{1, α} (P l_{1, α} (x) | x) \leq {max}_{i} Π_{i} (P l_{i} (x) | x) .

So the set of $θ$ where there is a consensus that there is evidence in favor is always contained in the plausible region determined by linear pooling. A similar comment applies to the implausible region which is the set of all values where there is evidence against. While it might be tempting to quote the region $\cap_{i = 1}^{k} P l_{i} (x),$ there is no guarantee that any of the relative belief estimates will be in this set, whether determined by $R B_{1, α} (\cdot | x)$ or any of the $R B_{i} (\cdot | x) .$

The situation with respect to the assessment of the hypothesis

H_{0} : θ = θ_{0}

is a bit different. Clearly, if

R B_{i} (θ_{0} | x) > (<) 1

for all

i,

so there is a consensus that there is evidence in favor of (against)

H_{0},

then

R B_{1, α} (θ_{0} | x)

preserves this consensus. In general, when the evidence in favor of or against

H_{0}

is assessed via a relative belief ratio

R B (θ_{0} | x),

then the posterior probability

Π (R B (θ_{0} | x) \leq R B (θ_{0} | x) | x)

can be taken as a measure of the strength of the evidence, see [17]. In the context under discussion here, it follows from (2) that the event

{θ : R B_{t, α} (θ | x) \leq R B_{t, α} (θ_{0} | x)} = {θ : R B_{1, α} (θ | x) \leq R B_{1, α} (θ_{0} | x)}

for all

t .

Of course, the posterior probability of this event will depend on t but linear pooling completely determines the event.

Now suppose that interest is in the quantity

ψ = Ψ (θ)

and the assumptions of Context I hold so that prior beliefs only differ concerning the value of

ψ

, which implies that the inference bases only differ with respect to the priors on

ψ .

This situation may arise when the analysts all agree to use a common default prior on the nuisance parameters. Then we can treat

ψ

as the model parameter for the common model

{m_{ψ} : ψ \in Ψ}

and the relevant linear pooling rule is

R B_{1, α, Ψ} (ψ | x) = \sum_{i = 1}^{k} \frac{α_{i} m_{i} (x)}{m_{1, α} (x)} R B_{i, Ψ} (ψ | x),

(3)

where

R B_{i, Ψ} (ψ | x)

is the relative belief ratio for

ψ

obtained from the i-th inference base. Note that the results derived for

θ

also apply for inferences about

ψ .

In general it can be expected that some inference bases will indicate evidence in favor of

ψ

being the true value and some will indicate evidence against, but

R B_{1, α, Ψ} (ψ | x)

will indicate evidence one way or the other or even perhaps no evidence either way. This depends on the values assumed by the

R B_{i, Ψ} (ψ | x)

as well as the weights

α_{i} m_{i} (x) / m_{1, α} (x)

, with larger values of a weight leading to a greater contribution to the overall inferences by the corresponding inference base. This aspect is discussed in Section 3.1.

Consider now the context where

{\overset{˘}{x}}_{n} = (x_{1}, \dots, x_{n})

is an

i . i . d .

sample. The following result gives the consistency of this approach when the model parameter space

Θ

is finite. Such results will hold more generally but require some mathematical constraints on densities and this is not pursued further here. Let

ψ_{1, α} ({\overset{˘}{x}}_{n}) = arg {max}_{ψ} R B_{Ψ, 1, α} (ψ | {\overset{˘}{x}}_{n})

be the relative belief estimate of

ψ

based on linear pooling. All the convergence results are almost everywhere as

n \to \infty

with the proofs in the Appendix A.

Proposition 5.

For Context I, suppose

{\overset{˘}{x}}_{n} = (x_{1}, \dots, x_{n})

is an

i . i . d .

sample from a distribution in a model having a finite parameter space

Θ .

and each prior for θ is everywhere positive on

Θ .

Then

(i) $R B_{1, α, Ψ} (ψ_{0} | {\overset{˘}{x}}_{n}) \to I_{{ψ_{t r u e}}} (ψ_{0}) / π_{1, α, Ψ} (ψ_{0})$ and

$Π_{1, α, Ψ} (R B_{1, α, Ψ} (ψ | {\overset{˘}{x}}_{n}) \leq R B_{1, α, Ψ} (ψ_{0} | {\overset{˘}{x}}_{n})) | {\overset{˘}{x}}_{n}) \to I_{{ψ_{t r u e}}} (ψ_{0}),$
(ii) $ψ_{1, α} ({\overset{˘}{x}}_{n}) \to ψ_{t r u e}, P l_{1, α, Ψ} ({\overset{˘}{x}}_{n}) \to {ψ_{t r u e}}$ and $Π_{1, α, Ψ} (P l_{1, α, Ψ} ({\overset{˘}{x}}_{n}) | {\overset{˘}{x}}_{n}) \to 1,$
(iii) $α_{i} m_{i} (x) / m_{1, α} (x) \to α_{i} π_{i} (θ_{t r u e}) / π_{1, α} (θ_{t r u e}),$
(iv) $m_{1, α} (x) / m_{t, α} (x) \to π_{1, α} (θ_{t r u e}) / π_{t, α} (θ_{t r u e}) .$

Noting that when $1 / π_{1, α, Ψ} (ψ_{0}) > 1,$ then Proposition 5 (i) says that the evidence in favor of (against) $H_{0} : Ψ (θ) = ψ_{0},$ based on the combination, goes to categorical when $H_{0}$ is true (false). Part (ii) says that the relative belief estimate based on the combination is consistent. Part (iii) implies that, when the priors are equally weighted, then the inference base whose prior gives the largest value to the true value will inevitably have the largest weight in determining the combined evidence. As previously mentioned, part (iv) suggests choosing t to minimize the ratio $m_{1, α} (x) / m_{t, α} (x)$ as this can be associated with choosing the power combination prior that maximizes the amount of belief the prior places on the true value. This has the unnatural consequence, however, that the prior is being determined by the data.

Our overall conclusion, based on the results established here, is that linear pooling is the most natural way to combine evidence among the power means. As such, attention is restricted to this case hereafter. Various authors, when discussing the combination of priors, have come to a similar conclusion. For example, ref. [10], when considering the full spectrum of methods for combining priors, contains the following assertion, “In general, it seems that a simple, equally weighted, linear opinion pool is hard to beat in practice”. The results developed here support such a conclusion when considering evidence.

3.1. Determining the Prior Weights

The discussion so far has assumed that

α

is known but arguments or methodologies for choosing

α

need to be considered. There are several possible approaches to determining a suitable choice of the prior weights and nothing novel is proposed here. As previously mentioned, the

α_{i}

can represent the combiner’s beliefs concerning how well the i-th prior represents appropriate beliefs about

θ .

The combiner’s beliefs should of course be based upon experience or knowledge concerning the various proposers of the priors. In absence of such knowledge then uniform weights, namely,

α = (1 / k, \dots, 1 / k),

seem reasonable. Ref. [29] provides a good survey of various approaches to choosing

α

. Also, ref. [30,31] present a novel iterative approach to determining a consensus

α

among the proposers.

In Context I notice that the weights

α_{i} m_{i} (x) / m_{1, α} (x)

only depend on the data through some function of the value of the minimal sufficient statistic (mss) for the model. So, for example, if the priors are distinct and equally weighted via

α = (1 / k, \dots, 1 / k),

then the weight of the i-th prior is

m_{i} (x) / (m_{1} (x) + \dots + m_{k} (x))

and so more weight is given to those inference bases that do a better job, relatively speaking, of predicting a priori the observed value of this function of the mss. Since it is only the observed value of the mss that is relevant for inference, this seems sensible. There is the possibility, however, to weight some priors more than others for a variety of reasons.

A prior can also be placed on

α,

the results examined for a number different choices of

α

and summarized in a way that addresses the issue of whether or not the inferences are sensitive to

α .

For example, suppose the goal is to determine if there is evidence for or against the hypothesis

H_{0} : Ψ (θ) = ψ_{0} .

For a given weighting

α_{0},

the evidence for or against will be determined by the value

R B_{1, α_{0}, Ψ} (ψ_{0} | x) .

Accordingly, a Dirichlet prior with mode at

α_{0}

and with some degree of concentration around this point could be used to assess the robustness of the combination inferences. In particular, for each generated value of

α

from the prior, one can record whether evidence in favor of or against

H_{0}

was obtained together with the strength of the evidence. If a great proportion of the results gave results similar to those obtained with the weights

α_{0},

then this would provide some assurance that the conclusions drawn are robust to deviations. A similar approach can be taken to estimation problems where the relative belief estimate is given by

ψ (x) = \arg \sup_{ψ} R B_{1, α_{0}, Ψ} (ψ | x) .

When

Ψ

is 1-dimensional then a histogram of the estimates obtained in the simulation and histograms of the prior and posterior contents of

P l_{Ψ} (x)

will provide an indication of the dependence on

α_{0} .

4. The General Problem

The general Context II is more complicated and an overall solution is not proposed here. Rather, a special case is considered when there is a common data set

x .

So, k analysts are making inference about the same real-world object

Ψ,

based on the same data, but they are using possibly different models and different priors. Since Context II covers Context I, it is necessary that any rule proposed for such situations agrees with what is determined for Context I when that applies.

While it may seem reasonable to take the prior on

ψ

to be the linear mixture

π_{1, α, Ψ} = \sum_{i = 1}^{k} α_{i} π_{i, Ψ},

this cannot be viewed as a marginal prior obtained by integrating out nuisance parameters from

\sum_{i = 1}^{k} α_{i} π_{i}

, as in Context I, because the nuisance parameters vary with

i .

Also, even if we elected to use this prior, the overall posterior does not have a clear definition as it is not obvious how to form the likelihood. As such, a different approach and justification is required.

The simplest approach to characterizing the evidence in Context II is to use

R B_{1, α, Ψ}^{*} (ψ | x) = \sum_{i = 1}^{k} \frac{α_{i} m_{i} (x)}{m_{1, α} (x)} R B_{i, Ψ} (ψ | x),

(4)

where again

R B_{i, Ψ} (ψ | x)

and

m_{i} (x)

arise from the i-th inference base and

m_{1, α} (x) = \sum_{i = 1}^{k} α_{i} m_{i} (x_{i})

. This will agree with the answer obtained in Context I when it applies, but generally

R B_{1, α, Ψ}^{*} (ψ | x)

is not the ratio of the posterior of

ψ

to its prior. As such, it cannot be claimed that (4) is a valid characterization of the evidence, as

R B_{1, α, Ψ} (ψ | x)

is in Context I, even though each

R B_{i, Ψ} (ψ | x)

is a valid measure of evidence.

One approach to defining a prior and a posterior in Context II is to use the argument known as Jeffrey conditionalization, see [32]. This involves considering the probabilities on the partition given by

i \in {1, \dots, k}

completely separately from the probabilities on

ψ

given

i .

If we knew

i,

then standard conditioning leads to

π_{i, Ψ} (\cdot | x)

as the expression of posterior beliefs about

ψ .

But i is unknown and all that is available are the probabilities given by

α

and Jeffrey conditionalization suggests

\sum_{i = 1}^{k} α_{i} π_{i, Ψ},

and

\sum_{i = 1}^{k} α_{i} π_{i, Ψ} (\cdot | x)

as the appropriate expressions of prior and posterior beliefs.

But note that, based on the k inference bases,

α_{i} π_{i} (θ_{i}) f_{θ_{i}} (x)

can be thought of as the prior probability distribution for

(i, θ_{i}, x)

which leads to

α_{i} π_{i, Ψ} (ψ) m_{i} (x | ψ)

as the prior for

(i, ψ, x) .

Since the likelihood

m_{i} (x | ψ)

depends on

i,

however, Context I does not apply. Still, the joint prior for

(i, x)

is

α_{i} m_{i} (x)

and, after observing

x,

from the combiner’s point-of-view, this leads to the posterior probability

α_{i} m_{i} (x) / m_{1, α} (x)

for

i .

From the i-th analyst’s viewpoint,

π_{i, Ψ} (\cdot | x)

gives the appropriate posterior for

ψ

and so, applying the Jeffrey conditionalization idea, leads to the combination posterior for

ψ

given by

π_{1, α, Ψ}^{*} (ψ | x) = \sum_{i = 1}^{k} \frac{α_{i} m_{i} (x)}{m_{1, α} (x)} π_{i, Ψ} (ψ | x) .

(5)

This could be considered as a generalization of Jeffrey’s idea as now the probabilities on the partition elements and

ψ

both depend on the data. Furthermore, extending Jeffrey’s idea to the combination of the measurement of evidence, we obtain (4). While this is not formally a valid measure of evidence, (4) will satisfy all the properties of linear pooling established for Context I with the exception of Proposition 2 (iii). In particular,

R B_{1, α, Ψ}^{*} (ψ | x)

will preserve a consensus about evidence in favor or against. A key reason for not using

π_{1, α, Ψ}^{*} = \sum_{i = 1}^{k} α_{i} π_{i, Ψ}

and

π_{1, α, Ψ}^{*} (\cdot | x)

as the prior and posterior to determine the evidence, is that the nice properties of linear pooling are lost, see Example 3.

The following result characterizes what happens as sample size grows and is proved in the Appendix A. Again convergence is almost everywhere.

Proposition 6.

Suppose

{\overset{˘}{x}}_{n} = (x_{1}, \dots, x_{n})

is an

i . i . d .

sample from a distribution in at least one of the models and each of the parameter spaces

Θ_{i}

is finite with the prior

π_{i}

everywhere positive on

Θ_{i} .

Denoting the set of indices corresponding to the models containing the true distribution by

J,

then as

n \to \infty

(i) $α_{i} m_{i} ({\overset{˘}{x}}_{n}) / m_{1, α} ({\overset{˘}{x}}_{n}) \to w_{i} = I_{J} (i) α_{i} π_{i} (θ_{i t r u e}) / \sum_{j \in J} α_{j} π_{j} (θ_{j t r u e}) \geq 0$ and $\sum_{i = 1}^{k} w_{i} = 1$
(ii) $R B_{1, α, Ψ}^{*} (ψ | {\overset{˘}{x}}_{n}) \to I_{{ψ_{t r u e}}} (ψ) \sum_{i = 1}^{k} w_{i} / π_{i Ψ} (ψ)$ which is greater than 1 when $ψ = ψ_{t r u e}$
(iii) $lim π_{1, α, Ψ}^{*} (ψ | {\overset{˘}{x}}_{n}) = I_{{ψ_{t r u e}}} (ψ) .$

So Proposition 6 shows that $R B_{1, α, Ψ}^{*} (\cdot | {\overset{˘}{x}}_{n})$ and $π_{1, α, Ψ}^{*} (\cdot | {\overset{˘}{x}}_{n})$ provide consistent inferences and the weights converge to appropriate values.

There is another significant difference between (4) and (3). In Context I the weights all depended on the data through the same function of a constant mss for the full common model. Furthermore, if

A (x)

is an ancillary statistic for the full model, then it is seen that the i-th weight satisfies

α_{i} m_{i} (x) / m_{1, α} (x) = α_{i} m_{i} (x | A (x)) / m_{1, α} (x | A (x)) .

This implies that the weights are comparable as they are all concerned with predicting essentially the same data and moreover they are not concerned with predicting aspects of the data that have no relation to the quantity of interest. In Context II this is not necessarily the case which raises the question of whether or not the weights are comparable.

It is not obvious how to deal with this issue in general, but in some contexts the structure of the models is such that

x \leftrightarrow (L (x), A (x))

where L has fixed dimension and A is ancillary for each model. For example, if all the models are location models, then

x = {(x_{1}, \dots, x_{n})}^{'} = \bar{x} 1_{n} + A (x),

where

1_{n}

is a column of 1’s, and

A (x) = {(x_{1} - \bar{x}, \dots, x_{n} - \bar{x})}^{'}

is ancillary. In such a case, it is desirable to determine the weights based on how well the inference bases predict the value of

L (x)

and not

A (x) .

To take account of this it is necessary that Jeffrey conditionalization be modified so that the i-th posterior weight is now proportional to

α_{i} m_{i} (x | A (x))

where

m_{i} (\cdot | A (x))

is the i-th prior predictive of the data given

A (x) .

Examples 4 and 5 illustrate this modification.

While Proposition 6 does not apply with the conditional weights, a similar result can be proved and for this some assumptions are imposed to simplify the proof. Let the basic sample space be such that there is a finite ancillary partition

(B_{1}, \dots, B_{m}),

applicable to each of the k models, and for any n the ancillary is given by

A ({\overset{˘}{x}}_{n}) = (n_{1} ({\overset{˘}{x}}_{n}), \dots, n_{m} ({\overset{˘}{x}}_{n}))

where

n_{i} ({\overset{˘}{x}}_{n})

records the number of values in the sample that lie in

B_{i} .

Then the probability distribution of

A ({\overset{˘}{x}}_{n})

for the i-th model is given by the multinomial

(n, p_{i 1}, \dots, p_{i m})

where the

p_{i j}

are fixed and independent of the model parameter. Denote this probability function at the observed data by

f_{i} (\overset{˘}{n} ({\overset{˘}{x}}_{n}))

where

\overset{˘}{n} ({\overset{˘}{x}}_{n}) = (n_{1} ({\overset{˘}{x}}_{n}), \dots, n_{m} ({\overset{˘}{x}}_{n})) .

Suppose that each parameter space

Θ_{i}

is finite with the prior

π_{i}

everywhere positive. Let

α_{i} \propto α_{i}^{*} / f_{i} (\overset{˘}{n} ({\overset{˘}{x}}_{n}))

for

α^{*} \in S_{k}

and J denote the set of indices containing the true distribution. Calling these requirements condition

★,

the following is proved in the Appendix A.

Proposition 7.

If condition ★ holds, then

α_{i} m_{i} ({\overset{˘}{x}}_{n}) / m_{1, α} ({\overset{˘}{x}}_{n}) \to w_{i} = I_{J} (i) α_{i}^{*} π_{i} (θ_{i t r u e}) / \sum_{j \in J} α_{j}^{*} π_{j} (θ_{j t r u e}) \geq 0

and

\sum_{i = 1}^{k} w_{i} = 1 .

Proposition 7 provides the desirable consistency result as the only thing that is affected here are the weights which have been shown to have the correct asymptotic property.

Of course, this result needs to be generalized to handle even a situation like the location model. For this some conditions on the models and priors are undoubtedly required but this is not pursued further here. One key component of the proof is the existence of the ancillary partition

(B_{1}, \dots, B_{m})

and such a structural element seems necessary generally to obtain the comparability of the weights. In group-based models, like linear regression and many others, such a structure exists via the usual ancillaries, see Example 5. As an approximation, a finite ancillary partition can be constructed via the ancillary statistic in question and so Proposition 7 is applicable. It should also be noted that, if the original models are replaced by the conditional models given the ancillary, then (4) gives the same answer as this modification as the values of

R B_{i, Ψ} (ψ | x)

are unaffected by the conditioning.

Clearly there are connections with the combination rule for statistical evidence advocated here and Bayesian model averaging as discussed in [33]. In fact, the posterior (5) is the same as that obtained from Bayesian model averaging. The focus here, however, is on the inferences that arise from a direct measure of statistical evidence rather than basing these on the posterior alone and these inferences are different. That posterior probabilities do not provide a suitable measure of evidence can be seen from simple examples such as the Prosecutor’s Fallacy as discussed in [28] (Example 4). It is shown there that the posterior probability of of an event (guilt) being true can be very small but there is still clear evidence that the event is true. So, this is only weak evidence because the posterior probability indicates a small belief in what the evidence indicates. As has been demonstrated here, the consensus preserving feature supports the linear rule over other possible candidates for combining and this, together with Jeffrey conditionalization, also supports the posterior (5) obtained via Bayesian model averaging. Issues concerned with the comparability of the weights remain to be more fully addressed for both methodologies.

5. Examples

Some examples are now considered that demonstrate a number of considerations.

Example 2.

Location-normal model with normal priors.

Suppose

x = (x_{1}, \dots, x_{n})

is a sample from a

N (μ, σ_{0}^{2})

distribution where the mean is unknown but the variance is known. It might be more appropriate to model this with an unknown variance but this situation will suffice for illustrative purposes and there are applications for it in physics, where the variation arising from a given measurement process is well understood. The model is then given by, after reducing to the mss

\bar{x}

, the collection of

N (μ, σ_{0}^{2} / n)

distributions and so this is Context I. Suppose there are three analysts and they express their priors for μ as

N (μ_{i}, τ_{i}^{2})

distributions for

i = 1, 2, 3

sothe i-th posterior is

N ({(n / σ_{0}^{2} + 1 / τ_{i}^{2})}^{- 1} (n \bar{x} / σ_{0}^{2} + μ_{i} / τ_{i}^{2}), {(n / σ_{0}^{2} + 1 / τ_{i}^{2})}^{- 1})

and these ingredients determine the relative belief ratios. For combining, the prior predictives are also needed and the i-th prior predictive density

m_{i}

for

\bar{x}

is the

N (μ_{i}, σ_{0}^{2} / n + τ_{i}^{2})

density. Suppose the inference bases are equally weighted, so the posterior weight of the i-th analysis relative to the others is determined by how well the observed value

\bar{x}

fits the

N (μ_{i}, σ_{0}^{2} / n + τ_{i}^{2})

distribution. Note, however, that even if there is a perfect fit, in the sense that

\bar{x} = μ_{i},

the weight still depends on the quantity

σ_{0}^{2} / n + τ_{i}^{2} .

For example, if the

μ_{i}

are all equal and there is a perfect fit, then the i-th weight is proportional to

{(1 + n τ_{i}^{2} / σ_{0}^{2})}^{- 1 / 2}

and this weight goes to 0 as

τ_{i}^{2} \to \infty

with the other prior variances constant and goes to its biggest value when

τ_{i}^{2} \to 0 .

This suggests that making a prior quite diffuse leads to reducing the impact the corresponding inference base has in the combined analysis.

Consider a specific data example where the true value is

μ = 10,

with

σ_{0} = 1

, and sample sizes

n = 5, 10, 25, 100 .

Data were generated from the true distribution obtaining the values

\bar{x} = 10.92, 9.87, 9.96, 10.12

respectively. For the priors, we use

(μ_{1}, τ_{1}^{2}) = (12, 2), (μ_{2}, τ_{2}^{2}) = (9, 1), (μ_{3}, τ_{3}^{2}) = (11, 4)

equally weighted. Figure 1 plots the combined prior, posterior and relative belief ratio for the

n = 10

case. Table 1 records the estimates of

μ,

the plausible regions together with the posterior and prior contents of these intervals for each inference base and linear pooling. Note that, in this case, because the model is the same for each inference base and μ is the model parameter, the estimates are all equal to the MLE of μ but the plausible intervals and their posterior contents differ.

Consider now prediction which produces the interesting consequence that Context II now obtains even when all the models are same.

Example 3.

Prediction.

Consider Context I but suppose interest is in predicting a future value

y \in Y,

whose distribution is conditionally independent of the observed data x given θ and has model

{g_{λ} :

λ \in Λ}

where

Λ : Θ \to Λ

with

λ_{t r u e} = Λ (θ_{t r u e}) .

The first step in solving this problem is to determine the relevant inference bases and this is carried out by integrating out the nuisance parameter which in this case is

θ .

So the i-th inference base is given by

I_{i} = (x, {m_{i} (\cdot | y) : y \in Y}, m_{i, Y})

where

m_{i, Y}

is the density of the i-th prior for

y,

namely,

m_{i, Y} (y) = \int_{Θ} g_{Λ (θ)} (y) π_{i} (θ) d θ,

and

m_{i} (x | y) = \int_{Θ} f_{θ} (x) g_{Λ (θ)} (y) π_{i} (θ) d θ / m_{i, Y} (y)

is the conditional density of x given

y .

Note that unconditionally x and y are not independent and now the collection of possible distributions for x is indexed by

y .

The i-th posterior density of y is then

m_{i, Y} (y | x) = m_{i} (x | y) m_{i, Y} (y) / m_{i} (x) .

The models

{m_{i} (\cdot | y) : y \in Y}

are now not all the same so this is Context II with common data as discussed in Section 4. It is assumed, as is typically the case, that the mss for these models is constant in i so the weights are comparable. Applying (4), with the single data set

x,

leads to

R B_{1, α, Y}^{*} (y | x) = \sum_{i = 1}^{k} \frac{α_{i} m_{i} (x)}{m_{1, α} (x)} R B_{i, Y} (y | x)

with

R B_{i, Y} (y | x) = m_{i, Y} (y | x) / m_{i, Y} (y) = m_{i} (x | y) / m_{i} (x)

and (5) leads to posterior

m_{1, α, Y}^{*} (y | x) = \sum_{i = 1}^{k} (α_{i} m_{i} (x) / m_{1, α} (x)) m_{i, Y} (y | x) .

Note that in this case the posterior of y given x is well-defined via Bayesian conditioning and equals

m_{1, α, Y}^{*} (y | x)

so there is no need to invoke Jeffrey’s conditionalization for the posterior. It is notable, however, that if the relative belief ratio for y is computed using this posterior and the prior

m_{1, α, Y} (y) = \sum_{i = 1}^{k} α_{i} m_{i, Y} (y),

then this equals

\sum_{i = 1}^{k} \frac{α_{i} m_{i} (x) m_{i, Y} (y)}{m_{1, α} (x) m_{1, α, Y} (y)} R B_{i, Y} (y | x)

(6)

which does not equal

R B_{1, α, Y}^{*} (y | x) .

Given that the weights in (6) depend on the object of interest

y,

this does not correspond to linear pooling of the evidence and this is because the model is not constant. There is no reason to suppose that (6) will retain the good properties of linear pooling and experience with it suggests that it is not the correct way to combine. As such, the recommended approach is via (4) based on Jeffrey’s conditionalization and which retains the good properties of linear pooling.

Suppose now the context is as discussed in Example 2 but the goal is to make a prediction concerning a future independent value

y \sim N (μ, σ_{0}^{2}) .

So the i-th prior

m_{i, Y}

is given by

y \sim N (μ_{i}, σ_{0}^{2} + τ_{i}^{2})

and the i-th posterior

m_{i, Y} (\cdot | \bar{x})

is

y | \bar{x} \sim N ({(n / σ_{0}^{2} + 1 / τ_{i}^{2})}^{- 1} (n \bar{x} / σ_{0}^{2} + μ_{i} / τ_{i}^{2}), {(n / σ_{0}^{2} + 1 / τ_{i}^{2})}^{- 1} + τ_{i}^{2}) .

Table 2 gives the results for predicting y using the data in Example 2. The final row indicates what happens as

n \to \infty

and note that the weights converge as well with the i-th limiting weight proportional to

{(σ_{0}^{2} + τ_{i}^{2})}^{- 1 / 2} \exp (- {(μ - μ_{i})}^{2} / 2 (σ_{0}^{2} + τ_{i}^{2}))

which depends on the relative accuracy of the i-th prior with respect to the true mean

μ .

When all the prior variances are the same, the prior which has its mean closest to the true value will give the heaviest weight. Also, as

τ_{i}^{2} \to \infty

the i-th weight goes to 0. Note that the limiting plausible intervals are dependent on the prior and the interval does not shrink to a point because y is random. The limiting posterior content of these intervals is the probability content given by the true distribution of

y .

For the limiting plausible intervals for y to still be dependent on the prior is different than the situation when making inference about a parameter as, in that case, the plausible intervals shrink to the true value as the amount of data increases. The difference is that there is not a “true” value for

y .

The limiting plausible interval does not allow for all possible values for y and the effect of the prior is to disallow some possible values because belief in such a value is less than that specified by the prior of

y .

As can be seen from Table 2 this effect is not great unless the prior, as with

π_{1}

here, puts little mass near the true value. However, such an occurrence also reduces the limiting weight for such a component.

Consider now an example where the weights require adjustment.

Example 4.

Location-normal models with different variances.

Consider a situation similar to Example 2 but now with three distinct models so this is Context II. Here the i-th statistician assumes that the true distribution is

N (μ, σ_{i 0}^{2})

where the

σ_{i 0}^{2}

are known but

μ \in R^{1}

is unknown and interest is in

ψ = Ψ (μ) = μ .

The same three

N (μ_{i 0}, τ_{i 0}^{2})

priors are assumed as in Example 2. So the statisticians disagree about the “known” variance of the sampling distribution and an ancillary needs to play a role to make the weights comparable.

In this case

A (x) = x - \bar{x} 1

is ancillary for each model and is independently distributed from the common mss

L (x) = \bar{x} \sim N (μ, σ_{i 0}^{2} / n)

and

x \leftrightarrow (L (x), A (x)) .

Therefore, with equal weights for the priors, and taking the ancillaries into account, the i-th weight satisfies

m_{i} (L (x) | A (x)) \propto {(σ_{i 0}^{2} / n + τ_{i}^{2})}^{- 1 / 2} φ ({(σ_{i 0}^{2} / n + τ_{i}^{2})}^{- 1 / 2} \bar{x}) .

From this it is seen that the assumed variances and the prior both play a role in determining how much weight a given analysis should have. Note that as

σ_{i 0}^{2} \to \infty

or

τ_{i}^{2} \to \infty

, and all other parameters are fixed, then the weight of the i-th analysis goes to 0 as it should as, in the limit, no information is being provided about the true value of

μ .

Proposition 6 tell us that when

n \to \infty

and the i-th variance is correct and the others are not, then the i-th inference base will dominate.

Consider now an example where the models are truly different.

Example 5.

Location with quite different models.

Consider again the context of Example 2 but suppose that one of the models, say the one in

I_{1},

is a

t_{1}

(Cauchy) location model, while the other models and all the priors are as previously specified. For all three inference bases

A (x) = x - \bar{x} 1

is ancillary. To ensure that

σ_{0}

has the same interpretation across all inference bases, the

t_{1}

density is rescaled by

η_{0}

so that the interval

(- σ_{0}, σ_{0})

contains

0.6827

of the probability for all 3 distributions. This implies

η_{0} = σ_{0} / \tan (0.1827 π)

and, with

g (z) = 1 / η_{0} π (1 + z^{2} / η_{0}^{2}),

the first model is

{f_{μ} (x) : μ \in R^{1}}

where

f_{μ} (x) = \prod_{i = 1}^{n} g (x_{i} - μ) .

To obtain the corresponding weight, the following expression needs to be evaluated numerically,

m_{1} (x | A (x)) = \frac{\int_{- \infty}^{\infty} f_{1, μ} ((\bar{x} - μ) 1 + A (x)) π_{1} (μ) d μ}{\int_{- \infty}^{\infty} \int_{- \infty}^{\infty} f_{1, μ} ((\bar{x} - μ) 1 + A (x)) π_{1} (μ) d μ d \bar{x}} .

When applied to the data of Example 2 very similar results are obtained. Table 3 contains the weights for the inference bases for this situation.

The following example is of considerable practical importance.

Example 6.

Linear regression.

Suppose that the data is

(x_{i}, y_{i})

for

i = 1, \dots, n

and there are two analysts where both propose a simple regression model

y = X β + σ z

where

X = (1_{n} / \sqrt{n}, x)

with

1_{n} ⊥ x,

| | x | | = 1, β = {(β_{1}, β_{2})}^{'} \in R^{2}

and

σ > 0

unknown and z is a sample from

N (0, 1)

for analyst 1 and is a sample from a

t_{λ} / \sqrt{(λ - 2) / λ}

distribution for analyst 2 for some value

λ > 2 .

In both models

σ^{2}

is the variance of a

y_{i} .

Letting

b = {(X^{'} X)}^{- 1} X^{'} y

be the least squares estimate of β and

s^{2} = | | y - X b | |,

then

y \leftrightarrow (L (y), A (y))

where

L (y) = (b, s^{2})

and

A (y) = (y - X b) / s

is ancillary for both models. Further suppose that the quantity of inferential interest is the slope parameter

ψ = Ψ (β_{1}, β_{2}, σ^{2}) = β_{2} .

Denoting the relevant density of a

z_{i}

by

f,

the joint density of

(b, s)

given

A (y) = a

is proportional to

s^{n - 3} σ^{- n} \prod_{i = 1}^{n} f (\frac{b_{1} - β_{1}}{σ} + \frac{b_{2} - β_{2}}{σ} x_{i} + \frac{s}{σ} a_{i}) .

The posterior density of

β_{2}

can be worked out in closed-form when f is the

N (0, 1)

density but generally it will require numerical integration to determine the posterior density and the posterior weights for the combination.

For the prior, suppose both analysts agree on

β | σ^{2} \sim N_{2} (0, τ_{0}^{2} σ^{2} I)

and

1 / σ^{2} \sim

gamma_rate

(α_{1}, α_{2}) .

Note that the zero mean for β may entail subtracting a known, fixed constant vector from y so this, and the assumption that

1_{n} ⊥ x,

may entail some preprocessing of the data. The prior distribution of the quantity of interest is then

β_{2} \sim τ_{0} \sqrt{α_{2} / α_{1}} t_{2 α_{1}}

where

t_{2 α_{1}}

denotes the t distribution on

2 α_{1}

degrees of freedom.

Obtaining the hyperparameters of the prior requires elicitation and this can be carried out using the following method as described in [34]. Suppose that it is known with virtual certainty, based on our knowledge of the measurements being taken, that

β_{1} + β_{2} x

will lie in the interval

(- m_{0}, m_{0})

for some

m_{0} > 0

for all

x \in R

a compact set centered at 0 and contained in

{[- 1, 1]}^{k}

on account of the standardization. The phrase ‘virtual certainty’ is interpreted here as a probability greater than or equal to γ where γ is some large probability like

0.99 .

Therefore, the prior on β must satisfy

2 Φ (m_{0} / σ τ_{0} {1 + x^{2}}^{1 / 2}) - 1 \geq γ

for all

x \in R

which implies

σ \leq m_{0} / ζ_{0} τ_{0} z_{(1 + γ) / 2}

(7)

where

ζ_{0}^{2} = 1 + {max}_{x \in R} x^{2} \leq 2

with equality when

R = [- 1, 1] .

An interval that will contain a response value y with virtual certainty, given predictor value

x,

is

β_{1} + β_{2} x \pm σ z_{(1 + γ) / 2} .

Suppose that we have lower and upper bounds

s_{1}

and

s_{2}

on the half-length of this interval so that

s_{1} \leq σ z_{(1 + γ) / 2} \leq s_{2}

or, equivalently,

s_{1} / z_{(1 + γ) / 2} \leq σ \leq s_{2} / z_{(1 + γ) / 2}

(8)

holds with virtual certainty. Combining (8) with (7) implies

τ_{0} = m_{0} / s_{2} ζ_{0} .

To obtain the relevant values of

α_{1}

and

α_{2},

let

G (α_{1}, α_{2}, \cdot)

denote the cdf of the gamma_rate

(α_{1}, α_{2})

distribution and note that

G (α_{1}, α_{2}, w) = G (α_{1}, 1, α_{2} w) .

Therefore, the interval for

1 / σ^{2}

implied by (8) contains

1 / σ^{2}

with virtual certainty, when

α_{1}, α_{2}

satisfy

G^{- 1} (α_{1}, α_{2}, (1 + γ) / 2) = s_{1}^{- 2} z_{(1 + γ) / 2}^{2}, G^{- 1} (α_{1}, α_{2}, (1 - γ) / 2) = s_{2}^{- 2} z_{(1 - γ) / 2}^{2},

or equivalently

\begin{matrix} G (α_{1}, 1, α_{2} s_{1}^{- 2} z_{(1 + γ) / 2}^{2}) & = (1 + γ) / 2, \end{matrix}

(9)

\begin{matrix} G (α_{1}, 1, α_{2} s_{2}^{- 2} z_{(1 - γ) / 2}^{2}) & = (1 - γ) / 2 . \end{matrix}

(10)

It is a simple matter to solve these equations for

(α_{1}, α_{2}) .

For this choose an initial value for

α_{1}

and, using (9), find w such that

G (α_{1}, 1, w) = (1 + γ) / 2,

which implies

α_{2} = w s_{1}^{2} / z_{(1 + γ) / 2}^{2} .

If the left-side of (10) is less (greater) than

(1 - γ) / 2,

then decrease (increase) the value of

α_{1}

and repeat step 1. Continue iterating this process until satisfactory convergence is attained.

Consider now a numerical example drawn from [35] where the response variable is income in U.S. dollars per capita (deflated), and the predictor variable is investment in dollars per capita (deflated) for the United States for the years 1922–1941. The data are provided in Table 4. The data vector y was replaced by

y - X {(340, 3)}^{t}

as this centered the observations about 0. Taking

γ = 0.99, ζ_{0} = \sqrt{2}, m_{0} = 30, s_{1} = 10, s_{2} = 40

leads to the values

τ_{0} = 0.54, α_{1} = 4.05, α_{2} = 140.39 .

The following prior is then used for both models,

(β_{1}, β_{2}) |, σ^{2} \sim N_{2} (0, {(0.54)}^{2} σ^{2} I), 1 / σ^{2} \sim gamma (4.05, 140.39) .

Table 5 presents the weights that result when different

t_{λ}

error distributions are considered to be combined with the results from a

N (0, 1)

error assumption. Presumably this arises when one analyst is concerned that tails longer than the normal are appropriate. As can be seen, the normal error assumption dominates except for

λ = 100

when the inferences do not differ by much in any case. This is not surprising as various residual plots do not indicate any issue with the normality assumption for these data. These weights were computed using importance sampling and were found to be robust to the prior by repeating the computations after making small changes to the hyperparameters.

The approach taken in this example is easily generalized to more general linear regression models including situations where the priors change.

6. Conclusions

The problem of how to combine evidence has been considered for a Bayesian context where each analyst proposes a model and prior for the same data. Linear opinion pooling is seen as the natural way to make such a combination, at least when the inference bases only differ in the priors on the parameter of interest. This has been shown to have appropriate properties such as preserving a consensus with respect to the evidence and, when combining evidence is considered as opposed to just combining priors, behaves appropriately when considering independent events. In certain contexts the idea can be extended in a logical way based on the idea underlying Jeffrey conditionalization. This approach has been shown to behave correctly asymptotically in a wide variety of situations.

There are a number of factors that need to be considered when implementing the methods discussed here. As mentioned in the Introduction, we have assumed that each of the sampling models and priors used have been subjected to model checks and checking for prior–data conflict, respectively. As such, we are not considering combining the evidence obtained from contexts where the ingredients are contradicted by the data, and this is to be regarded as a key part of the analysis. It is also worth noting too that [36] establishes that relative belief inferences are optimally robust to the choice of the prior on

Ψ

, and so, provided there is no prior–data conflict, a degree of robustness to the used priors can be expected. This does not, however, address issues concerned with sensitivity to the sampling models or to the priors used for nuisance parameters. There are also issues that arise for the choice of

α

. Unless there are good reasons to do otherwise, using uniform weights seems like the best choice as then only the data determines the relative weighting. For Context II, however, as discussed in Section 4, there are general concerns with the comparability of the model weights and that has been only partially addressed here.

The developments here do not cover contexts where there are different data sets and different models. If the models are all for the same basic responses, then one possibility is to simply combine data sets and proceed, as we have demonstrated here. More generally, it may be that the only aspect in common among the models is the characteristic of interest

Ψ

, and then it is not clear how we should combine this. The combination rule given by

R B_{1, α, Ψ}^{* *} (ψ | x_{1}, \dots, x_{k}) = \sum_{i = 1}^{k} \frac{α_{i} m_{i} (x_{i})}{m_{1, α} (x_{1}, \dots, x_{k})} R B_{i, Ψ} (ψ | x_{i}),

where

m_{1, α} (x_{1}, \dots, x_{k}) = \sum_{i = 1}^{k} α_{i} m_{i} (x_{i}),

suggests itself as a generalization of what has been considered here. Further investigation is required, however, as it is necessary to ensure that the weights

α_{i} m_{i} (x_{i}) / m_{1, α} (x_{1}, \dots, x_{k})

are indeed comparable, so some modification is probably required that is context-dependent.

It does not seem essential that we restrict attention to combination rules based on power mean priors as we have done here. For example, one could consider combining the

R B_{i, Ψ} (ψ | x_{i})

themselves according to some rule. For example, a power rule could be used to combine these quantities. Even in Context I, however, this loses the interpretation of the combination as a valid measure of evidence through the principle of evidence. Of course, (3) arises in both such approaches to the problem and probably should, no matter which generalized rule is adopted, when it is applied to Context I.

The problem of combining evidence is an important one in science, as evidenced by extensive discussion in the literature over many years. What has been shown here is that a very natural definition of how to measure statistical evidence can lead to a natural solution in a number of significant contexts.

Funding

This research was partially funded by the Natural Sciences and Engineering Research Council of Canada, grant RGPIN-2024-03839.

Data Availability Statement

All data used is available within the paper.

Acknowledgments

The author thanks Yang Guo for discussions that lead to the paper and for his contributions to the computations in Example 2.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proofs

Proof of Proposition 5.

Parts (i) and (ii) are established in [17], Section 4.7 for a general prior and so can be applied with the prior

π_{1, α} .

For part (iii) we have

\begin{matrix} \frac{α_{i} m_{i} (x)}{m_{1, α} (x)} = \frac{α_{i} \sum_{θ} f_{θ} ({\overset{˘}{x}}_{n}) π_{i} (θ)}{\sum_{θ} f_{θ} ({\overset{˘}{x}}_{n}) π_{j} (θ)} π_{1, α} (θ) = \frac{α_{i} \sum_{θ} exp \{- n (\frac{1}{n} log \frac{f_{θ_{t r u e}} ({\overset{˘}{x}}_{n})}{f_{θ} ({\overset{˘}{x}}_{n})})\} π_{i} (θ)}{\sum_{θ} exp \{- n (\frac{1}{n} log \frac{f_{θ_{t r u e}} ({\overset{˘}{x}}_{n})}{f_{θ} ({\overset{˘}{x}}_{n})})\} π_{1, α} (θ)} \\ = \frac{α_{i} [π_{i} (θ_{t r u e}) + \sum_{θ \neq θ_{t r u e}} exp \{- n (\frac{1}{n} log \frac{f_{θ_{t r u e}} ({\overset{˘}{x}}_{n})}{f_{θ} ({\overset{˘}{x}}_{n})})\} π_{i} (θ)]}{π_{1, α} (θ_{t r u e}) + \sum_{θ \neq θ_{t r u e}} exp \{- n (\frac{1}{n} log \frac{f_{θ_{t r u e}} ({\overset{˘}{x}}_{n})}{f_{θ} ({\overset{˘}{x}}_{n})})\} π_{1, α} (θ)} \end{matrix}

and by the SLLN

\frac{1}{n} log f_{θ_{t r u e}} ({\overset{˘}{x}}_{n}) / f_{θ} ({\overset{˘}{x}}_{n}) \to K L (f_{θ_{t r u e}}, f_{θ}),

where

K L

is the Kullback-Leibler divergence. Since

K L (f_{θ_{t r u e}}, f_{θ}) \geq 0

and 0 iff

θ = θ_{t r u e},

this completes the result. Part (iv) is established similarly. □

Proof of Proposition 6.

Suppose initially that only one of the proposed models contains the true distribution and wlog it is given by

i = 1 .

Following the proof of Proposition 7 (iii) then

\begin{matrix} \frac{α_{i} m_{i} ({\overset{˘}{x}}_{n})}{m_{1, α} ({\overset{˘}{x}}_{n})} = \frac{α_{i} \sum_{θ_{i} \in Θ_{i}} f_{θ_{i}} ({\overset{˘}{x}}_{n}) π_{i} (θ_{i})}{\sum_{j} α_{j} \sum_{θ_{j} \in Θ_{j}} f_{θ_{j}} ({\overset{˘}{x}}_{n}) π_{j} (θ_{j})} \\ = \frac{α_{i} \sum_{θ_{i} \in Θ_{i}} exp \{- n (\frac{1}{n} log \frac{f_{1 θ_{1 t r u e}} ({\overset{˘}{x}}_{n})}{f_{θ_{i}} ({\overset{˘}{x}}_{n})})\} π_{i} (θ_{i})}{\sum_{j} α_{j} \sum_{θ_{j} \in Θ_{j}} exp \{- n (\frac{1}{n} log \frac{f_{1 θ_{1 t r u e}} ({\overset{˘}{x}}_{n})}{f_{θ_{j}} ({\overset{˘}{x}}_{n})})\} π_{j} (θ_{j})} \to \{\begin{matrix} 1 & i = 1, \\ 0 & i \neq 1 . \end{matrix} \end{matrix}

If two of the models contain the true distribution, say given by

i = 1, 2,

then

\frac{α_{i} m_{i} ({\overset{˘}{x}}_{n})}{m_{1, α} ({\overset{˘}{x}}_{n})} \to \{\begin{matrix} \frac{α_{i} π_{i} (θ_{i t r u e})}{α_{1} π_{1} (θ_{1 t r u e}) + α_{2} π_{2} (θ_{2 t r u e})} & i = 1, 2, \\ 0 & i \neq 1, 2 . \end{matrix}

This line of reasoning proves (i).

Now note that

\begin{matrix} R B_{i, Ψ} (ψ | {\overset{˘}{x}}_{n}) = \frac{m_{i} ({\overset{˘}{x}}_{n} | ψ)}{m_{i} ({\overset{˘}{x}}_{n})} = \frac{\sum_{θ_{i} \in Ψ^{- 1} {ψ}} f_{i θ_{i}} ({\overset{˘}{x}}_{n}) π_{i} (θ_{i} | ψ)}{\sum_{θ_{i} \in Θ_{i}} f_{i θ_{i}} ({\overset{˘}{x}}_{n}) π_{i} (θ_{i} | ψ)} \\ = \frac{1}{π_{i, Ψ} (ψ)} \frac{\sum_{θ_{i} \in Ψ^{- 1} {ψ}} exp \{- n (\frac{1}{n} log \frac{f_{1 θ_{1 t r u e}} ({\overset{˘}{x}}_{n})}{f_{θ_{i}} ({\overset{˘}{x}}_{n})})\} π_{i} (θ_{i})}{\sum_{θ_{i} \in Θ_{i}} exp \{- n (\frac{1}{n} log \frac{f_{1 θ_{1 t r u e}} ({\overset{˘}{x}}_{n})}{f_{θ_{i}} ({\overset{˘}{x}}_{n})})\} π_{i} (θ_{i})} \end{matrix}

which implies

0 < R B_{i, Ψ} (ψ | {\overset{˘}{x}}_{n}) \leq 1 / π_{i, Ψ} (ψ) .

If

i \notin J,

then the i-ith term in

R B_{1, α, Ψ}^{*} (ψ | {\overset{˘}{x}}_{n})

converges to 0 as it has been shown that the i-th weight does. If

i \in J,

then

R B_{i, Ψ} (ψ | {\overset{˘}{x}}_{n}) \to I_{{ψ_{t r u e}}} (ψ) / π_{i Ψ} (ψ)

which proves the first part of (ii). Now note that

\sum_{i \in J} w_{i} / π_{i, Ψ} (ψ_{t r u e}) \geq 1 / max {π_{i, Ψ} (ψ)} > 1

proving the second part.

Now

π_{i, Ψ} (ψ | x)

is bounded so it is only necessary to the limit when

i \in J

and in that case

π_{i, Ψ} (ψ | x) \to I_{{ψ_{t r u e}}} (ψ)

and the result follows. □

Proof of Proposition 9.

As in the proof of Proposition 7, suppose that the true distribution is contained in only one of the models, say

i = 1 .

\begin{matrix} \frac{α_{i} m_{i} (x | A (x))}{m_{1, α} ({\overset{˘}{x}}_{n} | A (x))} = \frac{α_{i} \sum_{θ_{i} \in Θ_{i}} (f_{θ_{i}} ({\overset{˘}{x}}_{n}) / f_{i} (\overset{˘}{n} ({\overset{˘}{x}}_{n}))) π_{i} (θ_{i})}{\sum_{j} α_{j} \sum_{θ_{j} \in Θ_{j}} (f_{θ_{j}} ({\overset{˘}{x}}_{n}) / (\overset{˘}{n} ({\overset{˘}{x}}_{n}))) / π_{j} (θ_{j})} \\ = \frac{α_{i} \sum_{θ_{i} \in Θ_{i}} exp \{- n (\frac{1}{n} log \frac{f_{1 θ_{1 t r u e}} ({\overset{˘}{x}}_{n}) / f_{1} (\overset{˘}{n} ({\overset{˘}{x}}_{n}))}{f_{θ_{i}} ({\overset{˘}{x}}_{n}) / f_{i} (\overset{˘}{n} ({\overset{˘}{x}}_{n}))})\} π_{i} (θ_{i})}{\sum_{j} α_{j} \sum_{θ_{j} \in Θ_{j}} exp \{- n (\frac{1}{n} log \frac{f_{1 θ_{1 t r u e}} ({\overset{˘}{x}}_{n}) / f_{1} (\overset{˘}{n} ({\overset{˘}{x}}_{n})}{f_{θ_{j}} ({\overset{˘}{x}}_{n}) / f_{j} (\overset{˘}{n} ({\overset{˘}{x}}_{n}))})\} π_{j} (θ_{j})} \end{matrix}

Now noting that

\begin{matrix} n log \frac{f_{j} (\overset{˘}{n} ({\overset{˘}{x}}_{n}))}{f_{1} (\overset{˘}{n} ({\overset{˘}{x}}_{n}))} & = log \{{(\frac{p_{j 1}}{p_{11}})}^{n_{1} ({\overset{˘}{x}}_{n}) / n} \dots {(\frac{p_{j m}}{p_{1 m}})}^{n_{m} ({\overset{˘}{x}}_{n}) / n}\} \\ \to log \{{(\frac{p_{j 1}}{p_{11}})}^{p_{11}} \dots {(\frac{p_{j m}}{p_{1 m}})}^{p_{1 m}}\} \end{matrix}

the result is obtained. The remainder of the proof is as in Proposition 8. □

References

Stone, M. The opinion pool. Ann. Math. 1961, 32, 1339–1342. [Google Scholar] [CrossRef]
Evans, M.; Moshonov, H. Checking for prior-data conflict. Bayesian Anal. 2006, 1, 893–914. [Google Scholar] [CrossRef]
Nott, D.; Wang, X.; Evans, M.; Englert, B.-G. Checking for prior-data conflict using prior to posterior divergences. Stat. Sci. 2020, 35, 234–253. [Google Scholar] [CrossRef]
Evans, M.; Jang, G.-H. Weak informativity and the information in one prior relative to another. Stat. Sci. 2011, 26, 423–439. [Google Scholar] [CrossRef]
Genest, C. A characterization theorem for externally Bayesian groups. Ann. Stat. 1984, 12, 1100–1105. [Google Scholar] [CrossRef]
Winkler, R.L. The consensus of subjective probability distributions. Manag. Sci. 1968, 15, B-61–B-75. [Google Scholar] [CrossRef]
Clemen, R.T.; Winkler, R.L. Combining probability distributions from experts in risk analysis. Risk Anal. 1999, 19, 187–203. [Google Scholar] [CrossRef]
French, S. Aggregating expert judgement. Rev. Real Acad. Cienc. Exactas Fís. Nat. Ser. A Mat. 2011, 105, 181–206. [Google Scholar] [CrossRef]
Genest, C.; Zidek, J.V. Combining probability distributions: A critique and an annotated bibliography. Stat. Sci. 1986, 1, 114–135. [Google Scholar]
O’Hagan, A.; Buck, C.E.; Daneshkhah, A.; Eiser, J.R.; Garthwaite, P.H.; Jenkinson, D.J.; Oakley, J.E.; Rakow, T. Uncertain Judgements: Eliciting Experts’ Probabilities; John Wiley & Sons Ltd.: Hoboken, NJ, USA, 2006. [Google Scholar]
Farr, C.; Ruggeri, F.; Mengersen, K. Prior and posterior linear pooling for combining expert opinions: Uses and impact on Bayesian networks—The case of the wayfinding model. Entropy 2018, 20, 209. [Google Scholar] [CrossRef]
Burgman, M.A.; McBride, M.; Ashton, R.; Speirs-Bridge, A.; Flander, L.; Wintle, B.; Rumpff, L.; Twardy, C. Expert status and performance. PLoS ONE 2011, 6, e22998. [Google Scholar] [CrossRef] [PubMed]
Albert, I.; Donnet, S.; Guihenneuc-Jouyaux, C.; Low-Choy, S.; Mengersen, K.; Rousseau, J. Combining expert opinions in prior elicitation. Bayesian Anal. 2012, 7, 503–532. [Google Scholar] [CrossRef]
Yager, R.; Alajlan, N. An intelligent interactive approach to group aggregation of subjective probabilities. Knowl.-Based Syst. 2015, 83, 170–175. [Google Scholar] [CrossRef]
Goodman, B. Daniel Kahneman says noise is wrecking your judgment. Here’s why, and what to do about it. Barron’s Online, Dow Jones and Company, Inc. 28 May 2021. [Google Scholar]
Birnbaum, A. On the foundations of statistical inference (with discussion). J. Am. Stat. Assoc. 1962, 57, 269–326. [Google Scholar] [CrossRef]
Evans, M. Measuring Statistical Evidence Using Relative Belief; Chapman & Hall/CRC Monographs on Statistics & Applied Probability: Boca Raton, FL, USA, 2015. [Google Scholar]
Royall, R.M. Statistical Evidence: A Likelihood Paradigm; Chapman & Hall: London, UK, 1997. [Google Scholar]
Vieland, V.J.; Chang, H. No evidence amalgamation without evidence measurement. Synthese 2019, 196, 3139–3161. [Google Scholar] [CrossRef]
Evans, M. The concept of statistical evidence: Historical roots and current developments. Encyclopedia 2024, 4, 1201–1216. [Google Scholar] [CrossRef]
Gelman, A.; O’Rourke, K. Statistics as a social activity: Attitudes towards amalgamating evidence in statistics. Entropy 2024, 26, 652. [Google Scholar] [CrossRef]
Steele, J.M. The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Genest, C. A conflict between two axioms for combining subjective distributions. J. R. Stat. Soc. Ser. B 1984, 46, 403–405. [Google Scholar] [CrossRef]
McConway, K.J. Marginalization and linear opinion pools. J. Am. Stat. Assoc. 1981, 76, 410–414. [Google Scholar] [CrossRef]
Genest, C. Pooling operators with the marginalization property. Can. J. Stat. 1984, 12, 153–163. [Google Scholar] [CrossRef]
Bunn, D.K. Two methodologies for the linear combination of forecasts. J. Oper. Res. Soc. 1981, 32, 213–222. [Google Scholar] [CrossRef]
Ladagga, R. Lehrer and the consensus proposal. Synthese 1977, 36, 473–477. [Google Scholar] [CrossRef]
Al-Labadi, L.; Alzaatreh, A.; Evans, M. How to measure evidence and its strength: Bayes factors or relative belief ratios? arXiv 2024, arXiv:2301.08994, to appear in Can. J. Stat. [Google Scholar]
Genest, C.; McConway, K.J. Allocating the weights in the linear opinion pool. J. Forecast. 1990, 9, 53–73. [Google Scholar] [CrossRef]
DeGroot, M.H. Reaching a consensus. J. Am. Stat. Assoc. 1974, 69, 118–121. [Google Scholar] [CrossRef]
Lehrer, K.; Wagner, C. Rational Consensus in Science and Society; D. Reidel Pub. Co.: Hingham, MA, USA, 1981. [Google Scholar]
Diaconis, P.; Zabell, S. Updating subjective probability. J. Am. Stat. Assoc. 1982, 77, 822–830. [Google Scholar] [CrossRef]
Hoeting, J.A.; Madigan, D.; Raftery, A.E.; Volinsky, C.T. Bayesian model averaging: A tutorial. Stat. Sci. 1999, 14, 382–417. [Google Scholar]
Evans, M.; Tomal, J. Multiple testing via relative belief ratios. FACETS 2006, 3, 563–583. [Google Scholar] [CrossRef]
Zellner, A. An Introduction to Bayesian Inference in Econometrics; Wiley: New York, NY, USA, 1996. [Google Scholar]
Al-Labadi, L.; Evans, M. Optimal robustness results for some Bayesian procedures and the relationship to prior-data conflict. Bayesian Anal. 2017, 12, 702–728. [Google Scholar] [CrossRef]

Figure 1. Plots of prior (- - -) and posterior (—) densities in the left panel and, in the right panel, plots of the relative belief ratio (—) for

μ

and constant 1 (- - -) in Example 2 when

n = 10

.

Figure 1. Plots of prior (- - -) and posterior (—) densities in the left panel and, in the right panel, plots of the relative belief ratio (—) for

μ

and constant 1 (- - -) in Example 2 when

n = 10

.

Table 1. Relative belief estimates, plausible intervals (posterior weights, and contents underneath) for

μ

in Example 2.

Table 1. Relative belief estimates, plausible intervals (posterior weights, and contents underneath) for

μ

in Example 2.

n	Estimate	$I_{1}$	$I_{2}$	$I_{3}$	Combination
5	$10.9$	$(10.2, 11.7)$ $0.431, 0.92$	$(9.9, 11.9)$ $0.164, 0.95$	$(10.1, 11.7)$ $0.406, 0.93$	$(10.0, 11.7)$ $0.93$
10	$9.9$	$(9.2, 10.6)$ $0.176, 0.98$	$(9.3, 10.4)$ $0.507, 0.93$	$(9.2, 10.5)$ $0.317, 0.96$	$(9.3, 10.5)$ $0.95$
25	$10.0$	$(9.5, 10.4)$ $0.192, 0.98$	$(9.6, 10.4)$ $0.478, 0.96$	$(9.5, 10.4)$ $0.330, 0.97$	$(9.5, 10.4)$ $0.97$
100	$10.1$	$(9.9, 10.4)$ $0.229, 0.99$	$(9.9, 10.4)$ $0.418, 0.99$	$(9.9, 10.4)$ $0.354, 0.99$	$(9.9, 10.4)$ $0.99$

Table 2. Relative belief estimates, plausible intervals (posterior contents underneath) for y in Example 3 with the posterior weights as in Table 1.

n	Prediction	$I_{1}$	$I_{2}$	$I_{3}$	Combination
5	$10.9$	$(8.7, 12.1)$ $0.82$	$(9.7, 16.0)$ $0.81$	$(9.4, 12.4)$ $0.83$	$(9.2, 13.7)$ $0.94$
10	$9.9$	$(6.5, 11.1)$ $0.87$	$(9.03, 12.5)$ $0.79$	$(8.0, 11.2)$ $0.86$	$(7.7, 11.8)$ $0.94$
25	$10.0$	$(6.7, 11.2)$ $0.87$	$(9.1, 12.7)$ $0.79$	$(8.2, 11.3)$ $0.86$	$(7.9, 11.9)$ $0.95$
100	$10.1$	$(7.1, 11.3)$ $0.87$	$(9.3, 13.2)$ $0.80$	$(8.4, 11.4)$ $0.86$	$(8.1, 12.2)$ $0.96$
∞	$10.0$	$(6.9, 11.2)$ $0.88$	$(9.2, 12.8)$ $0.79$	$(8.2, 11.3)$ $0.87$	$(8.0, 12.0)$ $0.96$

Table 3. Weights for the inference bases in Example 5.

n	$I_{1}$	$I_{2}$	$I_{3}$
5	$0.4248$	$0.1652$	$0.4100$
10	$0.2036$	$0.4900$	$0.3064$
25	$0.1716$	$0.4898$	$0.3386$
100	$0.2235$	$0.4202$	$0.3563$

Table 4. Haavelmo’s data on income and investment from Zellner (1996) used in Example 6.

Year	Income	Investment	Year	Income	Investment
1922	433	39	1932	372	22
1923	483	60	1933	381	17
1924	479	42	1934	419	27
1925	486	52	1935	449	33
1926	494	47	1936	511	48
1927	498	51	1937	520	51
1928	511	45	1938	477	33
1929	534	60	1939	517	46
1930	478	39	1940	548	54
1931	440	41	1941	629	100

Table 5. Weights for normal and

t_{λ}

errors in Example 6.

Table 5. Weights for normal and

t_{λ}

errors in Example 6.

$λ$	100	50	20	10	5	3
$N (0, 1)$	$0.556$	$0.612$	$0.766$	$0.928$	$0.998$	$1.000$
$t_{λ}$	$0.444$	$0.388$	$0.234$	$0.072$	$0.002$	$0.000$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Evans, M. Combining Statistical Evidence When Evidence Is Measured by Relative Belief. Entropy 2025, 27, 654. https://doi.org/10.3390/e27060654

AMA Style

Evans M. Combining Statistical Evidence When Evidence Is Measured by Relative Belief. Entropy. 2025; 27(6):654. https://doi.org/10.3390/e27060654

Chicago/Turabian Style

Evans, Michael. 2025. "Combining Statistical Evidence When Evidence Is Measured by Relative Belief" Entropy 27, no. 6: 654. https://doi.org/10.3390/e27060654

APA Style

Evans, M. (2025). Combining Statistical Evidence When Evidence Is Measured by Relative Belief. Entropy, 27(6), 654. https://doi.org/10.3390/e27060654

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Combining Statistical Evidence When Evidence Is Measured by Relative Belief

Abstract

1. Introduction

2. Combining Priors with Given Prior Weights

3. Combining Measures of Evidence in Context I

3.1. Determining the Prior Weights

4. The General Problem

5. Examples

6. Conclusions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Proofs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI