Objective Bayesianism and the Maximum Entropy Principle

Landes, Jürgen; Williamson, Jon

doi:10.3390/e15093528

Open AccessArticle

Objective Bayesianism and the Maximum Entropy Principle

by

Jürgen Landes

and

Jon Williamson

^*

Department of Philosophy, School of European Culture and Languages, University of Kent, Canterbury CT2 7NF, UK

^*

Author to whom correspondence should be addressed.

Entropy 2013, 15(9), 3528-3591; https://doi.org/10.3390/e15093528

Submission received: 28 June 2013 / Revised: 21 August 2013 / Accepted: 21 August 2013 / Published: 4 September 2013

(This article belongs to the Special Issue Maximum Entropy and Bayes Theorem)

Download

Browse Figure

Versions Notes

Abstract

:

Objective Bayesian epistemology invokes three norms: the strengths of our beliefs should be probabilities; they should be calibrated to our evidence of physical probabilities; and they should otherwise equivocate sufficiently between the basic propositions that we can express. The three norms are sometimes explicated by appealing to the maximum entropy principle, which says that a belief function should be a probability function, from all those that are calibrated to evidence, that has maximum entropy. However, the three norms of objective Bayesianism are usually justified in different ways. In this paper, we show that the three norms can all be subsumed under a single justification in terms of minimising worst-case expected loss. This, in turn, is equivalent to maximising a generalised notion of entropy. We suggest that requiring language invariance, in addition to minimising worst-case expected loss, motivates maximisation of standard entropy as opposed to maximisation of other instances of generalised entropy. Our argument also provides a qualified justification for updating degrees of belief by Bayesian conditionalisation. However, conditional probabilities play a less central part in the objective Bayesian account than they do under the subjective view of Bayesianism, leading to a reduced role for Bayes’ Theorem.

Keywords:

objective Bayesianism; g-entropy; generalised entropy; Bayesian conditionalisation; scoring rule; maximum entropy; maxent; minimax

1. Introduction

Objective Bayesian epistemology is a theory about the strength of belief. As formulated by Williamson [1], it invokes three norms:

Probability: The strengths of an agent’s beliefs should satisfy the axioms of probability. That is, there should be a probability function, $P_{E} : S L ⟶ [0, 1]$ , such that for each sentence θ of the agent’s language $L$ , $P_{E} (θ)$ measures the degree to which the agent with evidence E believes sentence θ. (Here, $L$ will be construed as a finite propositional language and $S L$ as the set of sentences of $L$ , formed by recursively applying the usual connectives.)
Calibration: The strengths of an agent’s beliefs should satisfy constraints imposed by her evidence E. In particular, if the evidence determines just that physical probability (aka chance), $P^{*}$ , is in some set $P^{*}$ of probability functions defined on $S L$ , then $P_{E}$ should be calibrated to physical probability insofar as it should lie in the convex hull, $E = 〈 P^{*} 〉$ , of the set $P^{*}$ . (We assume throughout this paper that chance is probabilistic, i.e., that $P^{*}$ is a probability function.)
Equivocation: The agent should not adopt beliefs that are more extreme than is demanded by her evidence E. That is, $P_{E}$ should be a member of $E$ that is sufficiently close to the equivocator function, $P_{=}$ , which gives the same probability to each $ω \in Ω$ , where the state descriptions or states, ω, are sentences describing the most fine-grained possibilities expressible in the agent’s language.

One way of explicating these norms proceeds as follows. Measure closeness of

P_{E}

to the equivocator by Kullback-Leibler divergence,

d (P_{E}, P_{=}) = \sum_{ω \in Ω} P_{E} (ω) log \frac{P_{E} (ω)}{P_{=} (ω)}

. Then, if there is some function in

E

that is closest to the equivocator,

P_{E}

should be such a function. If

E

is closed, then there is guaranteed to be some function in

E

closest to the equivocator; as

E

is convex, there is at most one such function. Then we have the maximum entropy principle [2]:

P_{E}

is the function in

E

that has maximum entropy H, where

H (P) = - \sum_{ω \in Ω} P (ω) log P (ω)

.

The question arises as to how the three norms of objective Bayesianism should be justified, and whether the maximum entropy principle provides a satisfactory explication of the norms.

The probability norm is usually justified by a Dutch book argument. Interpret the strength of an agent’s belief in θ to be a betting quotient, i.e., a number x, such that the agent is prepared to bet

x S

on θ with return S if θ is true, where S is an unknown stake, positive or negative. Then the only way to avoid the possibility that stakes may be chosen so as to force the agent to lose money, whatever the true state of the world, is to ensure that the betting quotients satisfy the axioms of probability (see e.g., Theorem 3.2 in [1]).

The calibration norm may be justified by a different sort of betting argument. If the agent bets repeatedly on sentences with known chance y with some fixed betting quotient x, then she is sure to lose money in the long run unless

x = y

(see, e.g., pp. 40–41 in [1]). Alternatively, on a single bet with known chance y, the agent’s expected loss is positive unless her betting quotient

x = y

, where the expectation is determined with respect to the chance function

P^{*}

(pp. 41–42 in [1]). More generally, if evidence E determines that

P^{*} \in P^{*}

and the agent makes such bets, then sure loss/positive expected loss can be forced unless

P_{E} \in 〈 P^{*} 〉

.

The equivocation norm may be justified by appealing to a third notion of loss. In the absence of any particular information about the loss

L (ω, P)

one incurs when one’s strengths of beliefs are represented by P and ω turns out to be the true state, one can argue that one should take the loss function L to be logarithmic,

L (ω, P) = - log P (ω)

(pp. 64–65 in [1]). Then the probability function P that minimises the worst case expected loss, subject to the information that

P^{*} \in E

, where

E

is closed and convex, is simply the probability function

P \in E

closest to the equivocator—equivalently, the probability function in

E

that has maximum entropy [3,4].

The advantage of these three lines of justification is that they make use of the rather natural connection between strength of belief and betting. This connection was highlighted by Frank Ramsey:

All our lives, we are in a sense betting. Whenever we go to the station, we are betting that a train will really run, and if we had not a sufficient degree of belief in this, we should decline the bet and stay at home.
(p. 183 in [5])

The problem is that the three norms are justified in rather different ways. The probability norm is motivated by avoiding sure loss. The calibration norm is motivated by avoiding sure long-run loss or by avoiding positive expected loss. The equivocation norm is motived by minimising worst-case expected loss. In particular, the loss function appealed to in the justification of the equivocation norm differs from that invoked by the justifications of the probability and calibration norms.

In this paper, we seek to rectify this problem. That is, we seek a single justification of the three norms of objective Bayesian epistemology.

The approach we take is to generalise the justification of the equivocation norm, outlined above, in order to show that only the strengths of beliefs that are probabilistic, calibrated and equivocal minimise worst-case expected loss. We shall adopt the following starting point: as discussed above,

E = 〈 P^{*} 〉

is taken to be convex and non-empty throughout this paper; we shall also assume that the strengths of the agent’s beliefs can be measured by non-negative real numbers—an assumption that is rejected by advocates of imprecise probability, a position that we will discuss separately in Section 5.3. We do not assume throughout that

E

is such that it admits some function that has maximum entropy—e.g., that

E

is closed—but we will be particularly interested in the case in which

E

does contain its entropy maximiser, in order to see whether some version of the maximum entropy principle is justifiable in that case.

In Section 2, we shall consider the scenario in which the agent’s belief function,

bel

, is defined over propositions, i.e., sets of possible worlds. Using ω to denote a possible world as well as the state of

L

that picks out that possible world, we have that

bel

is a function from the power set of a finite set Ω of possible worlds ω to the non-negative real numbers,

bel : P Ω ⟶ R_{\geq 0}

. When it comes to justifying the probability norm, this will give us enough structure to show that degrees of belief should be additive. Then, in Section 3, we shall consider the richer framework in which the belief function is defined over sentences, i.e.,

bel : S L ⟶ R_{\geq 0}

. This will allow us to go further by showing that different sentences that express the same proposition should be believed to the same extent. In Section 4, we shall explain how the preceding results can be used to motivate a version of the maximum entropy principle. In Section 5, we draw out some of the consequences of our results for Bayes’ theorem. In particular, conditional probabilities and Bayes’ theorem play a less central role under this approach than they do under subjective Bayesianism. Furthermore, in Section 5 we relate our work to the imprecise probability approach and suggest that the justification of the norms of objective Bayesianism presented here can be reinterpreted in a non-pragmatic way.

The key results of the paper are intended to demonstrate the following points. Theorem 1 (which deals with beliefs defined over propositions) and Theorem 4 (respectively, belief over sentences) show that only a logarithmic loss function satisfies certain desiderata that, we suggest, any default loss function should satisfy. This allows us to focus our attention on logarithmic loss. Theorems 2, 3 (for propositions) and Theorems 5, 6 (for sentences) show that minimising worst-case expected logarithmic loss corresponds to maximising a generalised notion of entropy. Theorem 7 justifies maximising standard entropy, by viewing this maximiser as a limit of generalised entropy maximisers. Theorem 9 demonstrates a level of agreement between updating beliefs by Bayesian conditionalisation and updating by maximising generalised entropy. Theorem 10 shows that the generalised notion of entropy considered in this paper is pitched at precisely the right level of generalisation.

Three appendices to the paper help to shed light on the generalised notion of entropy introduced in this paper. Appendix A motivates the notion by offering justifications of generalised entropy that mirror Shannon’s original justification of standard entropy. Appendix B explores some of the properties of the functions that maximise generalised entropy. Appendix C justifies the level of generalisation of entropy to which we appeal.

2. Belief over Propositions

In this section, we shall show that if a belief function defined on propositions is to minimise worst-case expected loss, then it should be a probability function, calibrated to physical probability, which maximises a generalised notion of entropy. The argument will proceed in several steps. As a technical convenience, in Section 2.1, we shall normalise the belief functions under consideration. In Section 2.2, we introduce the appropriate generalisation of entropy. In Section 2.3, we argue that, by default, loss should be taken to be logarithmic. Then, in Section 2.4, we introduce scoring rules, which measure expected loss. Finally, in Section 2.5, we show that worst-case expected loss is minimised just when generalised entropy is maximised.

For the sake of concreteness, we will take Ω to be generated by a propositional language,

L = {A_{1}, \dots, A_{n}}

, with propositional variables,

A_{1}, \dots, A_{n}

. The states, ω, take the form

_{\pm} A_{1} \land \dots \land_{\pm} A_{n}

, where

+ A_{i}

is just

A_{i}

and

- A_{i}

is

\neg A_{i}

. Thus, there are

2^{n}

states,

ω \in Ω = {_{\pm} A_{1} \land \dots \land_{\pm} A_{n}}

. We can think of each such state as representing a possible world. A proposition (or, in the terminology of the mathematical theory of probability, an ‘event’) may be thought of as a subset of Ω, and a belief function,

bel : P Ω ⟶ R_{\geq 0}

, thus assigns a degree of belief to each proposition that can be expressed in the agent’s language. For a proposition

F \subseteq Ω

, we will use

\bar{F}

to denote

Ω ∖ F

.

| F |

denotes the size of proposition

F \subseteq Ω

, i.e., the number of states under which it is true.

Let Π be the set of partitions of Ω; a partition

π \in Π

is a set of mutually exclusive and jointly exhaustive propositions. To control the proliferation of partitions, we shall take the empty set, ∅, to be contained only in one partition, namely

{Ω, \emptyset}

.

2.1. Normalisation

There are finitely many propositions (

P Ω

has

2^{2^{n}}

members), so any particular belief function,

bel

, takes values in some interval

[0, M] \subseteq R_{\geq 0}

. It is just a matter of convention as to the scale on which belief is measured, i.e., as to what upper bound M we might consider. For convenience, we shall normalise the scale to the unit interval,

[0, 1]

, so that all belief functions are considered on the same scale.

Definition 1 (Normalised belief function on propositions). Let

M = {max}_{π \in Π} \sum_{F \in π} bel (F)

. Given a belief function,

bel : P Ω ⟶ R_{\geq 0}

, that is not zero everywhere, its normalisation,

B : P Ω ⟶ [0, 1]

, is defined by setting

B (F) = bel (F) / M

for each

F \subseteq Ω

. We shall denote the set of normalised belief functions by

B

, so:

\begin{matrix} B = {B : P Ω ⟶ [0, 1] : \sum_{F \in π} B (F) \leq 1 for all π \in Π and \sum_{F \in π} B (F) = 1 for some π} \end{matrix}

Without loss of generality, we rule out of consideration the non-normalised belief function that gives zero degree of belief to each proposition; it will become clear in Section 2.4 that this belief function is of little interest, as it can never minimise worst-case expected loss. For purely technical convenience, we will often consider the convex hull,

〈 B 〉

, of

B

. In which case, we rule into consideration certain belief functions that are not normalised, but which are convex combinations of normalised belief functions. Henceforth, then, we shall focus our attention on belief functions in

B

and

〈 B 〉

.

Note that we do not impose any further restrictions on the agent’s belief function—such as additivity; or the requirement that

B (G) \leq B (F)

whenever

G \subseteq F

; or that the empty proposition, ∅, has belief zero; or the sure proposition, Ω, is assigned belief of one. Our aim is to show that belief functions that do not satisfy such conditions will expose the agent to avoidable loss.

For any

B \in 〈 B 〉

and every

F \subseteq Ω

, we have

B (F) + B (\bar{F}) \leq 1

, because

{F, \bar{F}}

is a partition. Indeed:

\begin{matrix} \sum_{F \subseteq Ω} B (F) & = \frac{1}{2} \cdot (\sum_{F \subseteq Ω} B (F) + \sum_{F \subseteq Ω} B (\bar{F})) \leq \frac{1}{2} \cdot | P Ω | = 2^{2^{n} - 1} \end{matrix}

(1)

Recall that a subset of

R^{N}

is compact, if and only if it is closed and bounded.

Lemma 1 (Compactness).

B

and

〈 B 〉

are compact.

Proof:

B \subset R^{| P Ω |}

is bounded, where ⊂ denotes strict subset inclusion. Now, consider a sequence,

{(B_{t})}_{t \in N} \in B

which converges to some

B \in R^{| P Ω |} .

Then, for all

π \in Π

, we find

\sum_{F \in π} B (F) \leq 1 .

Assume that

B \notin B

. Thus for all

π \in Π

, we have

\sum_{F \in π} B (F) < 1 .

However, then there has to exist a

t_{0} \in N

, such that for all

t \geq t_{0}

and all

π \in Π

,

\sum_{F \in π} B_{t} (F) < 1

. This contradicts

B_{t} \in B .

Thus,

B

is closed and, hence, compact.

〈 B 〉

is the convex hull of a compact set. Hence,

〈 B 〉 \subset R^{| P Ω |}

is closed and bounded and so compact. ■

We will be particularly interested in the subset,

P \subseteq B

, of belief functions defined by:

P = {B : P Ω ⟶ [0, 1] : \sum_{F \in π} B (F) = 1 for all π \in Π}

(2)

P

is the set of probability functions:

Proposition 1.

P \in P

if and only if

P : P Ω ⟶ [0, 1]

satisfies the axioms of probability:

P1:: $P (Ω) = 1$ and $P (\emptyset) = 0$ .
P2:: If $F \cap G = \emptyset$ , then $P (F) + P (G) = P (F \cup G)$ .

Proof: Suppose

P \in P

.

P (Ω) = 1

, because

{Ω}

is a partition.

P (\emptyset) = 0

, because

{Ω, \emptyset}

is a partition and

P (Ω) = 1

. If

F, G \subseteq Ω

are disjoint, then

P (F) + P (G) = P (F \cup G)

, because

{F, G, \bar{F \cup G}}

and

{F \cup G, \bar{F \cup G}}

are both partitions, so

P (F) + P (G) = 1 - P (\bar{F \cup G}) = P (F \cup G)

.

On the other hand, suppose P1 and P2 hold.

\sum_{F \in π} P (F) = 1

can be seen by induction on the size of π. If

| π | = 1

, then

π = {Ω}

and

P (Ω) = 1

by P1. Suppose, then, that

π = {F_{1}, \dots, F_{k + 1}}

for

k \geq 1

. Now,

\sum_{i = 1}^{k - 1} P (F_{i}) + P (F_{k} \cup F_{k + 1}) = 1

by the induction hypothesis and

P (F_{k} \cup F_{k + 1}) = P (F_{k}) + P (F_{k + 1})

by P2, so

\sum_{F \in π} P (F) = 1

, as required. ■

Example 1 (Contrasting $B$ with $P$ ). Using Equation (1), we find

\sum_{F \subseteq Ω} P (F) = \frac{| P Ω |}{2} \geq \sum_{F \subseteq Ω} B (F)

for all

P \in P

and

B \in B .

For probability functions,

P \in P

, probability is evenly distributed among the propositions of fixed size in the following sense:

\begin{matrix} \sum_{\begin{matrix} F \subseteq Ω \\ | F | = t \end{matrix}} P (F) & = \sum_{ω \in Ω} P (ω) \cdot | {F \subseteq Ω : | F | = t and ω \in F} | \end{matrix}

(3)

\begin{matrix} = \sum_{ω \in Ω} P (ω) (\binom{| Ω | - 1}{t - 1}) = (\binom{| Ω | - 1}{t - 1}) \end{matrix}

(4)

where

P (ω)

abbreviates

P ({ω}) .

For

B \in B

and

t > \frac{| Ω |}{2} \geq 2

, we have, in general, only the following inequality:

\begin{matrix} 0 \leq \sum_{\begin{matrix} F \subseteq Ω \\ | F | = t \end{matrix}} B (F) \leq | {F \subseteq Ω : | F | = t} | = (\binom{| Ω |}{t}) \end{matrix}

(5)

For

B_{1} \in B

, defined as

B_{1} (ω) = 1

, for some specific ω, and

B_{1} (F) = 0

for all other

F \subseteq Ω

, we have that the lower bound is tight. For

B_{2} \in B

, defined as

B_{2} (F) = 1

, for

| F | = t

, and

B_{2} (F) = 0

, for all other

F \subseteq Ω

, the upper bound is tight.

To illustrate the potentially uneven distribution of beliefs for a

B \in B,

let

A_{1}, A_{2}

be the propositional variables in

L,

so Ω contains four elements. Now, consider the

B \in B

such that

B (\emptyset) = 0,

B (F) = \frac{1}{100}

for

| F | = 1,

B (F) = \frac{1}{2}

for

| F | = 2,

B (F) = \frac{99}{100}

for

| F | = 3

and

B (Ω) = 1 .

Note, in particular, that there is no

P \in P

such that

B (F) \leq P (F)

for all

F \subseteq Ω .

2.2. Entropy

The entropy of a probability function is standardly defined as:

H_{Ω} (P) : = - \sum_{ω \in Ω} P (ω) log P (ω)

(6)

We shall adopt the usual convention that

- x log 0 = x \cdot \infty = \infty

, if

x > 0

and

0 log 0 = 0

.

We will need to extend the standard notion of entropy to apply to normalised belief functions, not just to probability functions. Note that the standard entropy only takes into account those propositions that are in the partition

{{ω} : ω \in Ω}

which partitions Ω into states. This is appropriate when entropy is applied to probability functions, because a probability function is determined by its values on the states. However, this is not appropriate if entropy is to be applied to belief functions: in that case, one cannot simply disregard all those propositions that are not in the partition of Ω into states—one needs to consider propositions in other partitions, too. In fact, there are a range of entropies of a belief function, according to how much weight is given to each partition π in the entropy sum:

Definition 2 (g-entropy). Given a weighting function

g : Π ⟶ R_{\geq 0}

, the generalised entropy or g-entropy of a normalised belief function is defined as:

H_{g} (B) : = - \sum_{π \in Π} g (π) \sum_{F \in π} B (F) log B (F)

(7)

The standard entropy,

H_{Ω}

, corresponds to

g_{Ω}

-entropy, where

g_{Ω} (π) = \{\begin{matrix} 1 & : & π = {{ω} : ω \in Ω} \\ 0 & : & otherwise \end{matrix}

(8)

We can define the partition entropy,

H_{Π}

, to be the

g_{Π}

-entropy, where

g_{Π} (π) = 1

for all

π \in Π

. Then:

\begin{matrix} H_{Π} (B) & = & - \sum_{π \in Π} \sum_{F \in π} B (F) log B (F) \\ = & - \sum_{F \subseteq Ω} par (F) B (F) log B (F) \end{matrix}

(9)

where

par (F)

is the number of partitions in which F occurs. Note that according to our convention,

par (\emptyset) = 1

and

par (Ω) = 2

, because Ω occurs in partitions,

{\emptyset, Ω}

and

{Ω}

. Otherwise,

par (F) = b_{| \bar{F} |}

, where

b_{k} : = \sum_{i = 1}^{k} 1 / i! \sum_{j = 0}^{i} {(- 1)}^{i - j} (\binom{i}{j}) j^{k}

is the k’th Bell number, i.e., the number of partitions of a set of k elements.

We can define the proposition entropy

H_{P Ω}

to be the

g_{P Ω}

-entropy, where

g_{P Ω} (π) = \{\begin{matrix} 1 & : & | π | = 2 \\ 0 & : & otherwise \end{matrix}

(10)

Then:

\begin{matrix} H_{P Ω} (B) & = & - \sum_{π \in Π} \sum_{\begin{matrix} F \in π \\ | π | = 2 \end{matrix}} B (F) log B (F) \\ = & - \sum_{F \subseteq Ω} B (F) log B (F) \end{matrix}

(11)

In general, we can express

H_{g} (B)

in the following way, which reverses the order of the summations:

H_{g} (B) = - \sum_{F \subseteq Ω} (\sum_{\begin{matrix} π \in Π \\ F \in π \end{matrix}} g (π)) B (F) log B (F)

(12)

As noted above, one might reasonably demand of a measure of the entropy of a belief function that each belief should contribute to the entropy sum, i.e., for each

F \subseteq Ω

,

\sum_{\begin{matrix} π \in Π \\ F \in π \end{matrix}} g (π) \neq 0

:

Definition 3 (Inclusive weighting function). A weighting function

g : Π ⟶ R_{\geq 0}

is inclusive if for all

F \subseteq Ω

, there is some partition π containing F such that

g (π) > 0

.

This desideratum rules out the standard entropy in favour of other candidate measures, such as the partition entropy and the proposition entropy.

We have seen so far that g-entropy is a natural generalisation of standard entropy from probability functions to belief functions. In Section 2.5, we shall see that g-entropy is of particular interest, because maximising g-entropy corresponds to minimising worst-case expected loss—this is our main reason for introducing the concept. However, there is a third reason why g-entropy is of interest. Shannon (§6 in [6]) provided an axiomatic justification of standard entropy as a measure of the uncertainty encapsulated in a probability function. Interestingly, as we show in Appendix A, Shannon’s argument can be adapted to give a justification of our generalised entropy measure. Thus, g-entropy can also be thought of as a measure of the uncertainty of a belief function.

In the remainder of this section, we will examine some of the properties of g-entropy.

Lemma 2. The function

- log : [0, 1] \to [0, \infty]

is continuous in the standard topology on

R_{\geq 0} \cup {+ \infty} .

Proof: To obtain the standard topology on

R_{\geq 0} \cup {+ \infty}

, take as open sets infinite unions and finite intersections over the open sets of

R_{\geq 0}

and sets of the form,

(r, \infty]

, where

r \in R .

In this topology on

[0, \infty],

a set

M \subseteq R_{\geq 0}

is open if and only if it is open in the standard topology in

R_{\geq 0} .

Hence,

- log

is continuous in this topology on

(0, 1] .

Let

{(a_{t})}_{t \in N}

be a sequence in

[0, 1]

with limit zero. For all

ϵ > 0

, there exists a

T \in N

such that

- log a_{t} > \frac{1}{ϵ}

for all

t > T .

Hence, for all open sets U containing

+ \infty

, there exists a K such that

- log a_{m} \in U

if

m > K .

Therefore,

- log a_{t}

converges to

+ \infty .

Thus,

{lim}_{t \to \infty} - log a_{t} = + \infty = - log {lim}_{t \to \infty} a_{t} .

■

Proposition 2. g-entropy is non-negative and, for inclusive g, strictly concave on

〈 B 〉

.

Proof:

B (F) \in [0, 1]

for all F, so

log B (F) \leq 0

, and

g (π) \sum_{F \in π} B (F) log B (F) \leq 0

. Hence,

\sum_{π \in Π} - g (π) \sum_{F \in π} B (F) log B (F) \geq 0,

i.e., g-entropy is non-negative.

Take distinct

B_{1}, B_{2} \in 〈 B 〉

and

λ \in (0, 1)

, and let

B = λ B_{1} + (1 - λ) B_{2}

. Now,

x log x

is strictly convex on

[0, 1]

, i.e.,

B (F) log B (F) \leq λ B_{1} (F) log B_{1} (F) + (1 - λ) B_{2} (F) log B_{2} (F)

(13)

with equality just when

B_{1} (F) = B_{2} (F)

.

Consider an inclusive weighting function, g.

\begin{matrix} H_{g} (λ B_{1} + (1 - λ) B_{2}) & = & - \sum_{π \in Π} g (π) \sum_{F \in π} B (F) log B (F) \\ \geq & - \sum_{π \in Π} g (π) \sum_{F \in π} (λ B_{1} (F) log B_{1} (F) + (1 - λ) B_{2} (F) log B_{2} (F)) \\ = & λ H_{g} (B_{1}) + (1 - λ) H_{g} (B_{2}) \end{matrix}

(14)

with equality iff for all F,

B_{1} (F) = B_{2} (F)

, since g is inclusive. However,

B_{1}

and

B_{2}

are distinct, so equality does not obtain. In other words, g-entropy is strictly concave. ■

Corollary 1. For inclusive g, if g-entropy is maximised by a function

P^{†}

in convex

E \subseteq P

, it is uniquely maximised by

P^{†}

in

E

.

Corollary 2. For inclusive g, g-entropy is uniquely maximised in the closure,

[E]

, of

E

.

If g is not inclusive, concavity is not strict. For example, if the standard entropy,

H_{Ω}

, is maximised by

B^{†}

, then it is also maximised by any belief function

C^{†}

that agrees with

B^{†}

on the states

ω \in Ω

.

Note that different g-entropy measures can have different maximisers on a convex subset

E

of probability functions. For example, when

Ω = {ω_{1}, ω_{2}, ω_{3}, ω_{4}}

and

E = {P \in P : P (ω_{1}) + 2.75 P (ω_{2}) + 7.1 P (ω_{3}) = 1.7, P (ω_{4}) = 0}

, then the proposition entropy maximiser, the standard entropy maximiser and the partition entropy maximiser are all different, as can be seen from Figure 1.

Figure 1. Plotted are the partition entropy, the standard entropy and the proposition entropy under the constraints,

P (ω_{1}) + P (ω_{2}) + P (ω_{3}) + P (ω_{4}) = 1

,

P (ω_{1}) + 2.75 P (ω_{2}) + 7.1 P (ω_{3}) = 1.7, P (ω_{4}) = 0

, as a function of

P (ω_{2}) .

The dotted lines indicate the respective maxima, which obtain for different values of

P (ω_{2}) .

Figure 1. Plotted are the partition entropy, the standard entropy and the proposition entropy under the constraints,

P (ω_{1}) + P (ω_{2}) + P (ω_{3}) + P (ω_{4}) = 1

,

P (ω_{1}) + 2.75 P (ω_{2}) + 7.1 P (ω_{3}) = 1.7, P (ω_{4}) = 0

, as a function of

P (ω_{2}) .

The dotted lines indicate the respective maxima, which obtain for different values of

P (ω_{2}) .

2.3. Loss

As Ramsey observed, all our lives, we are, in a sense, betting. The strengths of our beliefs guide our actions and expose us to possible losses. If we go to the station when the train happens not to run, we incur a loss: a wasted journey to the station and a delay in getting to where we want to go. Normally, when we are deliberating about how strongly to believe a proposition, we have no realistic idea as to the losses to which that belief will expose us. That is, when determining a belief function B, we do not know the true loss function,

L^{*}

.

Now, a loss function L is standardly defined as a function

L : Ω \times P ⟶ (- \infty, \infty]

, where

L (ω, P)

is the loss one incurs by adopting probability function

P \in P

when ω is the true state of the world. Note that a standard loss function will only evaluate an agent’s beliefs about the states, not the extent to which she believes other propositions. This is appropriate when belief is assumed to be probabilistic, because a probability function is determined by its values on the states. But we are concerned with justifying the probability norm here and, hence, need to consider the full range of the agent’s beliefs, in order to show that they should satisfy the axioms of probability. Hence, we need to extend the concept of a loss function to evaluate all of the agent’s beliefs:

Definition 4 (Loss function). A loss function is a function

L : P Ω \times 〈 B 〉 ⟶ (- \infty, \infty]

.

L (F, B)

is the loss incurred by a belief function B, when proposition F turns out to be true. We shall interpret this loss as the loss that is attributable to F in isolation from all other propositions, rather than the total loss incurred when proposition F turns out to be true. When F turns out to be true, so does any proposition G, for

F \subset G

. Thus, the total loss when F turns out to be true includes

L (G, B)

, as well as

L (F, B)

. The total loss on F turning out to be true might therefore be represented by

\sum_{G \supseteq F} L (G, B)

, with

L (F, B)

being the loss distinctive to F, i.e., the loss on F turning out to be true over and above the loss incurred by

G \supset F

.

Is there anything that one can presume about a loss function in the absence of any information about the true loss function,

L^{*}

? Plausibly:

L1.: $L (F, B) = 0$ if $B (F) = 1$ .
L2.: $L (F, B)$ strictly increases as $B (F)$ decreases from one towards zero.
L3.: $L (F, B)$ depends only on $B (F)$ .

To express the next condition, we need some notation. Suppose

L = L_{1} \cup L_{2}

: say that

L = {A_{1}, . . ., A_{n}}

,

L_{1} = {A_{1}, . . ., A_{m}}

,

L_{2} = {A_{m + 1}, . . ., A_{n}}

for some

1 < m < n

. Then,

ω \in Ω

takes the form

ω_{1} \land ω_{2}

, where

ω_{1} \in Ω_{1}

is a state of

L_{1}

, and

ω_{2} \in Ω_{2}

is a state of

L_{2}

. Given propositions,

F_{1} \subseteq Ω_{1}

and

F_{2} \subseteq Ω_{2}

, we can define

F_{1} \times F_{2} : = {ω = ω_{1} \land ω_{2} : ω_{1} \in F_{1}, ω_{2} \in F_{2}}

, a proposition of

L

. Given a fixed belief function, B, such that

B (Ω) = 1

,

L_{1}

and

L_{2}

are independent sublanguages, written

L_{1} ⫫_{B} L_{2}

, if

B (F_{1} \times F_{2}) = B (F_{1}) \cdot B (F_{2})

for all

F_{1} \subseteq Ω_{1}

and

F_{2} \subseteq Ω_{2}

, where

B (F_{1}) : = B (F_{1} \times Ω_{2})

and

B (F_{2}) : = B (Ω_{1} \times F_{2})

. The restriction

B_{⇂ L_{1}}

of B to

L_{1}

is a belief function on

L_{1}

defined by

B_{⇂ L_{1}} (F_{1}) = B (F_{1}) = B (F_{1} \times Ω_{2})

and similarly for

L_{2}

.

L4.: Losses are additive when the language is composed of independent sublanguages: if $L = L_{1} \cup L_{2}$ for $L_{1} ⫫_{B} L_{2}$ , then $L (F_{1} \times F_{2}, B) = L_{1} (F_{1}, B_{⇂ L_{1}}) + L_{2} (F_{2}, B_{⇂ L_{2}})$ , where $L_{1}, L_{2}$ are loss functions defined on $L_{1}, L_{2}$ , respectively.

L1 says that one should presume that fully believing a true proposition will not incur loss. L2 says that one should presume that the less one believes a true proposition, the more loss will result. L3 expresses the interpretation of

L (F, B)

as the loss attributable to F in isolation of all other propositions. This condition, which is sometimes called locality, rules out that

L (F, B)

depends on

B (F^{'})

for

F^{'} \neq F

; it also rules out a dependence on

| F |

, for instance. L4 expresses the intuition that, at least if one supposes two propositions to be unrelated, one should presume that the loss on both turning out to be true is the sum of the losses on each. (These four conditions correspond to conditions L1–4 of pp. 64–65 in [1], which were put forward in the special case of loss functions defined over probability functions, as opposed to belief functions.)

The four conditions taken together tightly constrain the form of a presumed loss function, L:

Theorem 1. If loss functions are assumed to satisfy L1–4, then

L (F, B) = - k log B (F)

for some constant,

k > 0

, that does not depend on

L

.

Proof: We shall first focus on a loss function, L, defined with respect to a language,

L

, that contains at least two propositional variables.

L3 implies that

L (F, B) = f_{L} (B (F))

, for some function,

f_{L} : [0, 1] ⟶ (- \infty, \infty]

.

For our fixed

L

and each

x, y \in [0, 1]

, choose some particular

B \in 〈 B 〉, L_{1}, L_{2}, F_{1} \subseteq Ω_{1}, F_{2} \subseteq Ω_{2}

, such that

L = L_{1} \cup L_{2}

, where

L_{1} ⫫_{B} L_{2}

,

B (F_{1}) = x

and

B (F_{2}) = y

. This is possible, because

L

has at least two propositional variables. Note in particular that since

L_{1}

and

L_{2}

are independent sublanguages, we have

B (Ω) = 1

.

Note that

1 = B (Ω) = B (Ω_{1} \times Ω_{2}) = B_{⇂ L_{1}} (Ω_{1})

(15)

and, similarly,

B_{⇂ L_{2}} (Ω_{2}) = 1

. By L1, then,

L_{1} (Ω_{1}, B_{⇂ L_{1}}) = L_{2} (Ω_{2}, B_{⇂ L_{2}}) = 0

.

Therefore, by applying L4 twice:

\begin{matrix} f_{L} (x y) & = & f_{L} (B (F_{1}) \cdot B (F_{2})) \\ = & L (F_{1} \times F_{2}, B) \\ = & L_{1} (F_{1}, B_{⇂ L_{1}}) + L_{2} (F_{2}, B_{⇂ L_{2}}) \\ = & [L (F_{1} \times Ω_{2}, B) - L_{2} (Ω_{2}, B_{⇂ L_{2}})] + [L (Ω_{1} \times F_{2}, B) - L_{1} (Ω_{1}, B_{⇂ L_{1}})] \\ = & L (F_{1} \times Ω_{2}, B) + L (Ω_{1} \times F_{2}, B) \\ = & f_{L} (x) + f_{L} (y) \end{matrix}

(16)

The negative logarithm on

(0, 1]

is characterisable up to a multiplicative constant,

k_{L}

, in terms of this additivity, together with the condition that

f_{L} (x) \geq 0

, which is implied by L1–2 (see, e.g., Theorem 0.2.5 in [7]). L2 ensures that

f_{L}

is not zero everywhere, so

k_{L} > 0

.

We thus know that

f_{L} (x) = - k_{L} log x

for

x \in (0, 1] .

Now, note that for all

y \in (0, 1]

, it needs to be the case that

f_{L} (0) = f_{L} (0 \cdot y) = f_{L} (0) + f_{L} (y)

, if

f_{L}

is to satisfy

f_{L} (x \cdot y) = f_{L} (x) + f_{L} (y)

for all

x, y \in [0, 1] .

Since

f_{L}

takes values in

(- \infty, + \infty]

, it follows that

f_{L} (0) = + \infty .

Thus far, we have shown that for a fixed language,

L

, with at least two propositional variables,

L (F, B) = - k_{L} log B (F)

on

[0, 1] .

Now consider an arbitrary language,

L_{1}

, and a loss function

L_{1}

on

L_{1}

which satisfies L1–L4. There exists some other language,

L_{2}

, and a belief function B on

L = L_{1} \cup L_{2}

such that

L_{1} ⫫_{B} L_{2} .

By the above, for the loss function L on

L

, it holds that

L (F, B) = - k_{L} log B (F)

on

[0, 1] .

By reasoning analogous to that above:

L_{1} (F_{1}, B_{⇂ L_{1}}) = L (F_{1} \times Ω_{2}, B) = f_{L} (B (F_{1} \times Ω_{2})) = f_{L} (B_{⇂ L_{1}} (F_{1}))

(17)

Therefore, the loss function for

L_{1}

is

L_{1} (F_{1}, B_{⇂ L_{1}}) = - k_{L} log B_{⇂ L_{1}} (F_{1}) .

Thus, the constant

k_{L}

does not depend on the particular language

L

after all.

In general, then,

L (F, B) = - k log B (F)

for some positive k. ■

Since multiplication by a constant is equivalent to a change of base, we can take log to be the natural logarithm. Since we will be interested in the belief functions that minimise loss, rather than in the absolute value of any particular losses, we can take

k = 1

without loss of generality. Theorem 1 thus allows us to focus on the logarithmic loss function:

L^{log} (F, B) : = - log B (F)

(18)

2.4. Score

In this paper, we are concerned with showing that the norms of objective Bayesianism must hold if an agent is to control her worst-case expected loss. Now, an expected loss function or scoring rule is standardly defined as

S_{Ω}^{L} : P \times P ⟶ [- \infty, \infty]

such that

S_{Ω}^{L} (P, Q) = \sum_{ω \in Ω} P (ω) L_{Ω} (ω, Q)

. This is interpretable as the expected loss incurred by adopting probability function Q as one’s belief function, when the probabilities are actually determined by P. (This is the statistical notion of a scoring rule as defined in [8]. More recently, a different, ‘epistemic’ notion of a scoring rule has been considered in the literature on non-pragmatic justifications of Bayesian norms; see, e.g., [9,10] and, also, a forthcoming paper by Landes, where similarities and differences of these two notions of a scoring rule are discussed. One difference that is significant to our purposes is that Predd et al.’s result in [11]—that for every epistemic scoring rule that is continuous and strictly proper, the set of non-dominated belief functions is the set

P

of probability functions—does not apply to statistical scoring rules. Furthermore, Predd et al. are only interested in justifying the probability norm by appealing to dominance as a decision theoretic norm. We are concerned with justifying three norms at once using worst-case loss avoidance as a desideratum. The epistemic approach is considered further in Section 5.4.)

While this standard definition of scoring rule is entirely appropriate when belief is assumed to be probabilistic, we make no such assumption here and need to consider scoring rules that evaluate all the agent’s beliefs, not just those concerning the states. In line with our discussion of entropy in Section 2.2, we shall consider the following generalisation:

Definition 5 (g-score). Given a loss function, L, and an inclusive weighting function,

g : Π ⟶ R_{\geq 0}

, the g-expected loss function or g-scoring rule or, simply, g-score is

S_{g}^{L} : P \times 〈 B 〉 ⟶ [- \infty, \infty]

, such that

S_{g}^{L} (P, B) = \sum_{π \in Π} g (π) \sum_{F \in π} P (F) L (F, B)

(19)

Clearly,

S_{Ω}^{L}

corresponds to

S_{g_{Ω}}^{L}

, where

g_{Ω}

, which is not inclusive, is defined as in Section 2.2. We require that g be inclusive in Definition 5, since only in that case does the g-score genuinely evaluate all the agent’s beliefs. We will focus on

S_{g}^{log} (P^{*}, B)

, i.e., the case in which the loss function is logarithmic and the expectation is taken with respect to the chance function,

P^{*}

, in order to show that an agent should satisfy the norms of objective Bayesianism if she is to control her worst-case g-expected logarithmic loss when her evidence determines that the chance function,

P^{*}

, is in

E

.

For example, with the logarithmic loss function, the partition Π-score is defined by setting

g = g_{Π}

:

S_{Π}^{log} (P, B) = - \sum_{π \in Π} \sum_{F \in π} P (F) log B (F)

(20)

Similarly, the proposition

P Ω

-score is defined by setting

g = g_{P Ω}

:

S_{P Ω}^{log} (P, B) = - \sum_{F \subseteq Ω} P (F) log B (F)

(21)

It turns out that the various logarithmic scoring rules have the following useful property:

Definition 6 (Strictly proper g-score). A scoring rule,

S_{g}^{L} : P \times 〈 B 〉 ⟶ [- \infty, \infty]

, is strictly proper, if for all

P \in P

, the function

S_{g}^{L} (P, \cdot) : 〈 B 〉 ⟶ [- \infty, \infty]

has a unique global minimum at

B = P

.

Definition 6 can be generalised: a scoring rule is strictly

X

-proper if it is strictly proper for belief functions taken to be from a set

X

. In Definition 6,

X = 〈 B 〉

. The logarithmic scoring rule in the standard sense, i.e.,

\sum_{ω \in Ω} P (ω) L (ω, Q)

, is well known to be the only strictly

P

-proper local scoring rule—see McCarthy [12] (p. 654), who credits Andrew Gleason for the uniqueness result; Shuford et al. [13] (p. 136) for the case of continuous scoring rules; Aczel and Pfanzagl [14] (Theorem 3, p. 101) for the case of differentiable scoring rules; and Savage [15] (§9.4). The logarithmic score in our sense, i.e.,

\sum_{F \subseteq Ω} P (F) L (F, B)

, is not strictly

Y

-proper when

Y

is the set of non-normalised belief functions:

S (P, bel)

is a global minimum, where

bel

is the belief function such that

bel (F) = 1

for all F. (While Joyce [9] (p. 276) suggests that logarithmic score is strictly

Y

-proper for

Y

a set of non-normalised belief functions, he is referring to a logarithmic scoring rule that is different to the usual one considered above and that does not satisfy the locality condition, L3.)

On the way to showing that logarithmic g-scores are strictly proper, it will be useful to consider the following natural generalisation of Kullback-Leibler divergence to our framework:

Definition 7 (g-divergence). For a weighting function,

g : Π ⟶ R_{\geq 0}

, the g-divergence is the function,

d_{g} : P \times 〈 B 〉 ⟶ [- \infty, \infty]

, defined by:

d_{g} (P, B) = \sum_{π \in Π} g (π) \sum_{F \in π} P (F) log \frac{P (F)}{B (F)}

(22)

Here, we adopt the usual convention that

0 log \frac{0}{0} = 0

and

x log \frac{x}{0} = + \infty

for

x \in (0, 1] .

We shall see that

d_{g} (P, B)

is a sensible notion of the divergence of P from B by appealing to the following useful inequality (see, e.g., Theorem 2.7.1 in [16]):

Lemma 3 (Log sum inequality). For

x_{i}, y_{i} \in R_{\geq 0}, i, j = 1, \dots, k

,

(\sum_{i = 1}^{n} x_{i}) log \frac{\sum_{i = 1}^{n} x_{i}}{\sum_{i = 1}^{n} y_{i}} \leq \sum_{i = 1}^{n} x_{i} log \frac{x_{i}}{y_{i}}

(23)

with equality, iff

x_{i} = c y_{i}

for some constant, c, and

i = 1, \dots, k

.

Proposition 3. The following are equivalent:

$d_{g} (P, B) \geq 0$ with equality iff $B = P$ .
g is inclusive.

Proof: First we shall see that if g is inclusive, then

d_{g} (P, B) \geq 0

with equality iff

B = P

.

\begin{matrix} d_{g} (P, B) & = & \sum_{π \in Π} g (π) \sum_{F \in π} P (F) log \frac{P (F)}{B (F)} \\ \geq & \sum_{π \in Π} g (π) [(\sum_{F \in π} P (F)) log \frac{\sum_{F \in π} P (F)}{\sum_{F \in π} B (F)}] \\ \geq & \sum_{π \in Π} g (π) [1 log \frac{1}{1}] \\ = & 0 \end{matrix}

(24)

where the first inequality is an application of the log-sum inequality and the second inequality is a consequence of B being in

〈 B 〉

. There is equality at the first inequality iff for all

F \subseteq Ω

and all π such that

F \in π

and

g (π) > 0

,

P (F^{'}) = c_{π} B (F^{'})

for all

F^{'} \in π

, and equality at the second inequality iff for all π such that

g (π) > 0

,

\sum_{F \in π} B (F) = 1

.

Clearly, if

B (F) = P (F)

for all F, then these two equalities obtain. Conversely, suppose the two equalities obtain. Then, for each F, there is some

π = {F = F_{1}, F_{2}, \dots, F_{k}}

such that

g (π) > 0

, because g is inclusive. The first equality condition implies that

P (F_{i}) = c_{π} B (F_{i})

for

i = 1, \dots, k

. The second equality implies that

\sum_{i = 1}^{k} B (F_{i}) = 1

. Hence,

1 = \sum_{i = 1}^{k} P (F_{i}) = c_{π} \sum_{i = 1}^{k} B (F_{i}) = c_{π}

, and so,

P (F_{i}) = B (F_{i})

for

i = 1, \dots, k

. In particular,

B (F) = P (F)

.

Next, we shall see that the condition that g is inclusive is essential.

If g were not inclusive, then there would be some

F \subseteq Ω

such that

g (π) = 0

, for all

π \in Π

such that

F \in π

. There are two cases.

(i): $\emptyset \subset F \subset Ω$ . Take some $P \in P$ such that $P (F) > 0 .$ Now, define $B (F) : = 0,$ and $B (F^{'}) : = P (F^{'})$ for all other $F^{'} .$ Then, $B (Ω) = 1$ and $\sum_{G \in π} B (G) \leq 1$ for all other $π \in Π$ , so $B \in B \subseteq 〈 B 〉$ . Furthermore, $d_{g} (P, P) = d_{g} (P, B) = 0$ .
(ii): $F = \emptyset$ or $F = Ω$ . Define $B (\emptyset) : = B (Ω) : = 0.5$ and $B (F) : = P (F)$ for all $\emptyset \subset F \subset P Ω .$ Then, $B (\emptyset) + B (Ω) = 1$ and $\sum_{G \in π} B (G) \leq 1$ for all other $π \in Π$ , so $B \in B \subseteq 〈 B 〉$ . Furthermore, $d_{g} (P, P) = d_{g} (P, B) = 0$ .

In either case, then,

d_{g} (P, B)

is not uniquely minimised by

B = P

. ■

Corollary 3. The logarithmic g-score is strictly proper.

Proof: Recall that in the context of a g-score, g is inclusive.

\begin{matrix} S_{g}^{log} (P, B) - S_{g}^{log} (P, P) & = & - \sum_{π \in Π} g (π) \sum_{F \in π} P (F) log \frac{B (F)}{P (F)} \\ = & \sum_{π \in Π} g (π) \sum_{F \in π} P (F) log \frac{P (F)}{B (F)} \\ = & d_{g} (P, B) \end{matrix}

(25)

Proposition 3 then implies that

S_{g}^{log} (P, B) - S_{g}^{log} (P, P) \geq 0

with equality, iff

B = P

, i.e.,

S_{g}^{log}

is strictly proper. ■

Finally, logarithmic g-scores are non-negative strictly convex functions in the following qualified sense:

Proposition 4. The logarithmic g-score

S_{g}^{log} (P, B)

is non-negative and convex as a function of

B \in 〈 B 〉

. Convexity is strict, i.e.,

S_{g}^{log} (P, λ B_{1} + (1 - λ) B_{2}) < λ S_{g}^{log} (P, B_{1}) + (1 - λ) S_{g}^{log} (P, B_{2})

for

λ \in (0, 1)

, unless

B_{1}

and

B_{2}

agree everywhere, except where

P (F) = 0

.

Proof: The logarithmic g-score is non-negative, because

B (F), P (F) \in [0, 1]

for all F; so,

log B (F) \leq 0

,

P (F) log B (F) \leq 0

, and

g (π) > 0

.

That

S_{g}^{log} (P, B)

is strictly convex as a function of

〈 B 〉

follows from the strict concavity of

log x

. Take distinct

B_{1}, B_{2} \in 〈 B 〉

and

λ \in (0, 1)

, and let

B = λ B_{1} + (1 - λ) B_{2}

. Now:

\begin{matrix} P (F) log B (F) & = P (F) log (λ \cdot B_{1} (F) + (1 - λ) B_{2} (F)) \\ \geq P (F) (λ log B_{1} (F) + (1 - λ) log B_{2} (F)) \\ = λ P (F) log B_{1} (F) + (1 - λ) P (F) log B_{2} (F) \end{matrix}

(26)

with equality iff either

P (F) = 0

or

B_{1} (F) = B_{2} (F)

.

Hence:

\begin{matrix} S_{g}^{log} (P, B) & = & - \sum_{π \in Π} g (π) \sum_{F \in π} P (F) log B (F) \\ \leq & λ S_{g}^{log} (P, B_{1}) + (1 - λ) S_{g}^{log} (P, B_{2}) \end{matrix}

(27)

with equality iff

B_{1}

and

B_{2}

agree everywhere, except possibly where

P (F) = 0

. ■

2.5. Minimising the Worst-Case Logarithmic g-Score

In this section, we shall show that the g-entropy maximiser minimises the worst-case logarithmic g-score.

In order to prove our main result (Theorem 2), we would like to apply a game-theoretic minimax theorem, which will allow us to conclude that:

inf_{B \in B} sup_{P \in E} S_{g}^{log} (P, B) = sup_{P \in E} inf_{B \in B} S_{g}^{log} (P, B)

Note that the expression on the left-hand side describes the minimising worst-case g-score, where the worst case refers to P ranging in

E .

Speaking in game-theoretic lingo: the player playing first on the left-hand side aims to find the belief function(s) that minimises worst-case g-expected loss; again, the worst case is taken with respect to varying

P .

For this approach to work, we would normally need

B

to be some set of mixed strategies. It is not obvious how

B

could be represented as a mixing of finitely many pure strategies. However, there exists a broad literature on minimax theorems [17], and we shall apply a theorem proven in König [18]. This theorem requires that certain level sets, in the set of functions in which the player aiming to minimise may chose his functions, are connected. To apply König’s result, we will thus allow the belief functions, B, to range in

〈 B 〉,

which has this property. It will follow that the

B \in 〈 B 〉 ∖ B

are never good choices for the minimising player playing first: the best choice is in

E

, which is a subset of

B .

Having established that the inf and the sup commute, the rest is straightforward. Since the scoring rule we employ,

S_{g}^{log}

, is strictly proper, we have that the best strategy for the minimising player, answering a move by the maximising player, is to select the same function as the maximising player. Thus, it is best for the maximising player playing first to choose a/the function that maximises

S_{g}^{log} (P, P) .

We will thus find that:

\begin{matrix} sup_{P \in E} inf_{B \in 〈 B 〉} S_{g}^{log} (P, B) = sup_{P \in E} inf_{B \in {P}} S_{g}^{log} (P, B) = sup_{P \in E} S_{g}^{log} (P, P) = sup_{P \in E} H_{g} (P) \end{matrix}

Thus, worst-case g-expected loss and g-entropy have the same value. In game-theoretic terms: we find that our zero-sum g-log-loss game has a value. It remains to be shown that both players, when playing first, have a unique best choice,

P^{†} .

First, then, we shall apply König’s result.

Definition 8 (König [18], p. 56). For

F : X \times Y \to [- \infty, \infty]

, we call

I \subset R

a border interval of F, if and only if I is an interval of the form

I = ({sup}_{x \in X} {inf}_{y \in Y} F (x, y), + \infty) .

Λ \subset R

is called a border set of F if and only if

inf Λ = {sup}_{x \in X} {inf}_{y \in Y} F (x, y) .

For

λ \in R

and

\emptyset \subset K \subseteq Y

, define

s_{λ}

and

σ_{λ}

to consist of

X

and of subsets of

X

of the form:

⋂_{y \in K} [F (\cdot, y) > λ] respectively ⋂_{y \in K} [F (\cdot, y) \geq λ]

For

λ \in R

and finite

\emptyset \subset H \subseteq X

, define

t_{λ}

and

τ_{λ}

to consist of subsets of

Y

of the form:

⋂_{x \in H} [F (x, \cdot) < λ] respectively ⋂_{x \in H} [F (x, \cdot) \leq λ]

The following may be found in König [18] (Theorem 1.3, p. 57):

Lemma 4 (König’s Minimax). Let

X, Y

be topological spaces,

Y

be compact and Hausdorff and let

F : X \times Y \to [- \infty, \infty]

be lower semicontinuous. Then, if Λ is some border set, I some border interval of F and if at least one of the following conditions holds:

for all $λ \in Λ$ , all members of $s_{λ}$ and $τ_{λ}$ are connected;
for all $λ \in Λ$ , all members of $s_{λ}$ are connected and all $λ \in I$ all $t_{λ}$ are connected;
for all $λ \in Λ$ , all members of $σ_{λ}$ and $t_{λ}$ are connected;
for all $λ \in Λ$ , all members of $σ_{λ}$ are connected and all $λ \in I$ all $τ_{λ}$ are connected;

then:

inf_{y \in Y} sup_{x \in X} F (x, y) = sup_{x \in X} inf_{y \in Y} F (x, y)

Lemma 5.

S_{g}^{log} : E \times 〈 B 〉 \to [0, \infty]

is lower semicontinuous.

Proof: It suffices to show that

{(P, B) \in E \times 〈 B 〉 | S_{g}^{log} (P, B) \leq r}

is closed for all

r \in R .

For

r \in R

consider a sequence

{(P_{t}, B_{t})}_{t \in N}

with

{lim}_{t \to \infty} (P_{t}, B_{t}) = (P, B)

, such that

S_{g}^{log} (P_{t}, B_{t}) \leq r

for all t. Then:

\begin{matrix} S_{g}^{log} (P, B) & = - \sum_{π \in Π} g (π) \sum_{F \in π} P (F) log B (F) \\ = \sum_{π \in Π} \sum_{\begin{matrix} F \in π \\ g (π) P (F) > 0 \end{matrix}} - g (π) P (F) log B (F) . \end{matrix}

(28)

If

g (π) P (F) > 0

and

B_{t} (F)

converges to zero, then there is an

T \in N

such that for all

t \geq T

,

- g (π) P_{t} (F) log B_{t} (F) > r + 1 .

Thus,

B_{t} (F)

cannot converge to zero if

P (F) > 0 .

Since

(B_{t})

converges, it has to converge to some

B (F) > 0 .

Thus, when

g (π) P (F) > 0

we have that

- g (π) P (F) log B (F) = {lim}_{t \to \infty} - g (π) P_{t} (F) log B_{t} (F) \leq r .

From

S_{g}^{log} (P_{t}, B_{t}) \leq r

, we conclude that:

\begin{matrix} \sum_{π \in Π} \sum_{\begin{matrix} F \in π \\ g (π) P (F) > 0 \end{matrix}} - g (π) P (F) log B (F) & = lim_{t \to \infty} \sum_{π \in Π} \sum_{\begin{matrix} F \in π \\ g (π) P (F) > 0 \end{matrix}} - g (π) P_{t} (F) log B_{t} (F) \\ \leq r \end{matrix}

(29)

■

Proposition 5. For all

E

:

\begin{matrix} inf_{B \in 〈 B 〉} sup_{P \in E} S_{g}^{log} (P, B) = sup_{P \in E} inf_{B \in 〈 B 〉} S_{g}^{log} (P, B) \end{matrix}

Proof: It suffices to verify that the conditions of Lemma 4 are satisfied.

E, 〈 B 〉

are subsets of

R^{| Ω |}

,

R^{| P Ω |}

respectively, thus naturally equipped with the induced topology.

〈 B 〉

is compact and Hausdorff (see Lemma 1).

S_{g}^{log} : E \times 〈 B 〉 \to [0, \infty]

is lower semicontinuous (see Lemma 5).

We need to show that one of the connectivity conditions holds. In fact, they all hold, as we shall see.

Note that

E, 〈 B 〉

are connected, since they are convex.

For the

s_{λ}

and

σ_{λ}

consider any

B \in 〈 B 〉

and suppose that

P, P^{'} \in E

are such that

S_{g}^{log} (P, B) \overset{\geq}{>} λ

and

S_{g}^{log} (P^{'}, B) \overset{\geq}{>} λ .

Then for

η \in (0, 1)

we have:

\begin{matrix} S_{g}^{log} (η P + (1 - η) P^{'}, B) & = - \sum_{π \in Π} g (π) \sum_{F \in π} (η P + (1 - η) P^{'}) (F) log B (F) \\ = η S_{g}^{log} (P, B) + (1 - η) S_{g}^{log} (P^{'}, B) \\ \overset{\geq}{>} λ \end{matrix}

(30)

Thus,

\begin{matrix} {P \in E | S_{g}^{log} (P, B) \overset{\geq}{>} λ} \end{matrix}

(31)

is convex for all

B \in 〈 B 〉 .

Thus, every intersection of such sets is convex. Hence, these intersections are connected. (If any such intersection is empty, then it is trivially connected.)

For the

t_{λ}

and

τ_{λ}

, note that for every

P \in P

, we have that

\begin{matrix} {B \in 〈 B 〉 | S_{g}^{log} (P, B) \overset{\leq}{<} λ} \end{matrix}

(32)

is convex, which follows from Proposition 4 by noting that for a convex function (here,

S_{g}^{log} (P, \cdot)

) on a convex set (here,

〈 B 〉

), the set of elements in the domain that are mapped to a number (strictly) less than λ is convex for all

λ \in R .

Thus, every intersection of such sets is convex. Hence, these intersections are connected. ■

The suprema and infima referred to in Proposition 5 may not be achieved at points of

E

. If not, they will be achieved instead at points in the closure,

[E]

, of

E

. We shall use

arg {sup}_{P \in E}

(and

arg {inf}_{P \in E}

) to refer to the points in

[E]

that achieve the supremum (respectively, infimum), whether or not these points are in

E

.

Theorem 2. As usual,

E

is taken to be convex and g inclusive. We have that:

\begin{matrix} arg sup_{P \in E} H_{g} (P) = arg inf_{B \in B} sup_{P \in E} S_{g}^{log} (P, B) \end{matrix}

(33)

Proof: We shall prove the following slightly stronger equality, allowing B to range in

〈 B 〉

, instead of

B

:

\begin{matrix} arg sup_{P \in E} H_{g} (P) = arg inf_{B \in 〈 B 〉} sup_{P \in E} S_{g}^{log} (P, B) \end{matrix}

(34)

The theorem then follows from the following fact. The right-hand side of Equation (34) is an optimization problem, where the optimum (here, we look for the infimum of

{sup}_{P \in E} S_{g}^{log} (P, \cdot)

) uniquely obtains for a certain value (here,

P^{†}

). Restricting the domain of the variables (here, from

〈 B 〉

to

B

) in the optimization problem, to a subdomain that contains optimum

P^{†} \in [E] \subseteq B \subseteq 〈 B 〉

, does not change where the optimum obtains nor the value of the optimum.

Note that:

\begin{matrix} sup_{P \in E} H_{g} (P) & = sup_{P \in E} S_{g}^{log} (P, P) \\ \overset{}{=} sup_{P \in E} inf_{B \in 〈 B 〉} S_{g}^{log} (P, B) \\ \overset{}{=} inf_{B \in 〈 B 〉} sup_{P \in E} S_{g}^{log} (P, B) \end{matrix}

(35)

The first equality is simply the definition of

H_{g} .

The second equality follows directly from strict propriety (Corollary 3). To obtain the third line, we apply Proposition 5.

It remains to show that we can introduce arg on both sides of Equation (33).

The following sort of argument seems to be folklore in game theory; we here adapt (Lemma 4.1 on p. 1384 in [3]) for our purposes. We have:

\begin{matrix} P^{†} : & = arg sup_{P \in E} S_{g}^{log} (P, P) \end{matrix}

(36)

\begin{matrix} = arg sup_{P \in E} inf_{B \in 〈 B 〉} S_{g}^{log} (P, B) \end{matrix}

(37)

The

arg sup

in Equation (36) is unique (Corollary 2). Equation (37) follows from strict propriety of

S_{g}^{log}

(Corollary 3). Now let:

\begin{matrix} B^{†} \in arg inf_{B \in 〈 B 〉} sup_{P \in E} S_{g}^{log} (P, B) \end{matrix}

(38)

Then:

\begin{matrix} S_{g}^{log} (P^{†}, P^{†}) & = sup_{P \in E} inf_{B \in 〈 B 〉} S_{g}^{log} (P, B) \\ = inf_{B \in 〈 B 〉} S_{g}^{log} (P^{†}, B) \\ \leq S_{g}^{log} (P^{†}, B^{†}) \\ \leq sup_{P \in E} S_{g}^{log} (P, B^{†}) \end{matrix}

(39)

\begin{matrix} = inf_{B \in 〈 B 〉} sup_{P \in E} S_{g}^{log} (P, B) \end{matrix}

(40)

The first equality follows from the definition of

P^{†}

; see Equations (36) and (37). That we may drop the sup again follows from the definition of

P^{†},

since

P^{†}

maximises

{inf}_{B \in 〈 B 〉} S_{g}^{log} (\cdot, B) .

The inequalities hold, since dropping a minimisation and introducing a maximisation can only lead to an increase. The final inequality is immediate from the definition of

B^{†}

minimising

{sup}_{P \in E} S_{g}^{log} (P, \cdot) .

By Proposition 5, all inequalities above are in fact equalities. From

S_{g}^{log} (P^{†}, P^{†}) = S_{g}^{log} (P^{†}, B^{†})

and strict propriety, we may now infer that

B^{†} = P^{†} .

■

In sum, then, if an agent is to minimise her worst-case g-score, then her belief function needs to be the probability function in

E

that maximises g-entropy, as long as this entropy maximiser is in

E

. That the belief function is to be a probability function is the content of the probability norm; that it is to be in

E

is the content of the calibration norm; that it is to maximise g-entropy is related to the equivocation norm. We shall defer a full discussion of the equivocation norm to Section 4. In the next section, we shall show that the arguments of this section generalise to belief as defined over sentences rather than propositions. This will imply that logically equivalent sentences should be believed to the same extent—an important component of the probability norm in the sentential framework.

We shall conclude this section by providing a slight generalisation of the previous result. Note that, thus far, when considering worst-case g-score, this worst case is with respect to a chance function taken to be in

E = 〈 P^{*} 〉

. However, the evidence determines something more precise, namely that the chance function is in

P^{*}

, which is not assumed to be convex. The following result indicates that our main argument will carry over to this more precise setting.

Theorem 3. Suppose

P^{*} \subseteq P

is such that the unique g-entropy maximiser,

P^{†}

, for

[E] = [〈 P^{*} 〉]

, is in

[P^{*}]

. Then:

\begin{matrix} P^{†} = arg sup_{P \in E} H_{g} (P) = arg inf_{B \in B} sup_{P \in P^{*}} S_{g}^{log} (P, B) \end{matrix}

(41)

Proof: As in the previous proof, we shall prove a slightly stronger equality:

\begin{matrix} P^{†} = arg sup_{P \in E} H_{g} (P) = arg inf_{B \in 〈 B 〉} sup_{P \in P^{*}} S_{g}^{log} (P, B) \end{matrix}

(42)

The result follows for the same reasons given in the proof of Theorem 2.

From the strict propriety of

S_{g}^{log}

, we have:

\begin{matrix} S_{g}^{log} (P^{†}, P^{†}) & = inf_{B \in 〈 B 〉} S_{g}^{log} (P^{†}, B) \\ \leq inf_{B \in 〈 B 〉} sup_{P \in P^{*}} S_{g}^{log} (P, B) \\ \leq inf_{B \in 〈 B 〉} sup_{P \in 〈 P^{*} 〉} S_{g}^{log} (P, B) \\ = sup_{P \in 〈 P^{*} 〉} S_{g}^{log} (P, P^{†}) \\ = S_{g}^{log} (P^{†}, P^{†}) \end{matrix}

(43)

where the last two equalities are simply Theorem 2. Hence:

\begin{matrix} inf_{B \in 〈 B 〉} sup_{P \in P^{*}} S_{g}^{log} (P, B) = S_{g}^{log} (P^{†}, P^{†}) = sup_{P \in E} H_{g} (P) = sup_{P \in P^{*}} H_{g} (P) \end{matrix}

That is, the lowest worst case expected loss is the same for

P \in [P^{*}]

and

P \in [〈 P^{*} 〉] .

Furthermore, since

S_{g}^{log} (P^{†}, P^{†}) = {sup}_{P \in [〈 P^{*} 〉]} S_{g}^{log} (P, P^{†})

and since

P^{†} \in [P^{*}]

, we have

S_{g}^{log} (P^{†}, P^{†}) = {sup}_{P \in P^{*}} S_{g}^{log} (P, P^{†})

. Thus,

B = P^{†}

minimises

{sup}_{P \in P^{*}} S_{g}^{log} (P, B) .

Now, suppose that

B^{'} \in 〈 B 〉

is different from

P^{†} .

Then:

\begin{matrix} sup_{P \in P^{*}} S_{g}^{log} (P, B^{'}) \geq S_{g}^{log} (P^{†}, B^{'}) > S_{g}^{log} (P^{†}, P^{†}) \end{matrix}

(44)

where the strict inequality follows from strict propriety. This shows that adopting

B^{'} \neq P^{†}

leads to an avoidably bad score.

Hence,

B = P^{†}

is the unique function in

〈 B 〉

which minimises

{sup}_{P \in P^{*}} S_{g}^{log} (P, B) .

■

3. Belief over Sentences

Armed with our results for beliefs defined over propositions, we now tackle the case of beliefs defined over sentences,

S L

, of a propositional language,

L .

The plan is as follows. First, we normalise the belief functions in Section 3.1. In Section 3.2, we motivate the use of logarithmic loss as a default loss function. We are able to define our logarithmic scoring rule in Section 3.3, and we show there that, with respect to our scoring rule, the generalised entropy maximiser is the unique belief function that minimises the worst-case expected loss.

Again, we shall not impose any restriction—such as additivity—on the agent’s belief function, now defined on the sentences of the propositional language

L

. In particular, we do not assume that the agent’s belief function assigns logically equivalent sentences the same degree of belief. We shall show that any belief function violating this property incurs an avoidable loss. Thus, the results of this section allow us to show more than we could in the case of belief functions defined over propositions.

Several of the proofs in this section are analogous to the proofs of corresponding results presented in Section 2. They are included here in full for the sake of completeness; the reader may wish to skim over those details that are already familiar.

3.1. Normalisation

S L

is the set of sentences of propositional language

L

, formed as usual by recursively applying the connectives,

\neg, \lor, \land, \to, \leftrightarrow

, to the propositional variables,

A_{1}, \dots, A_{n}

. A non-normalised belief function,

bel : S L ⟶ R_{\geq 0}

, is thus a function that maps any sentence of the language to a non-negative real number. As in Section 2.1, for technical convenience, we shall focus our attention on normalised belief functions.

Definition 9 (Representation). A sentence,

θ \in S L

, represents the proposition

F = {ω : ω ⊧ θ}

. Let

F

be a set of pairwise distinct propositions. We say that

Θ \subseteq S L

is a set of representatives of

F,

if and only if each sentence in Θ represents some proposition in

F

and each proposition in

F

is represented by a unique sentence in Θ. A set, ρ, of representatives of

P Ω

will be called a representation. We denote by ϱ the set of all representations. For a set of pairwise distinct propositions,

F

, and a representation,

ρ \in ϱ

, we denote by

ρ (F) \subset S L

the set of sentences in ρ that represent the propositions in

F .

We call

π_{L} \subseteq S L

a partition of

S L,

if and only if it is a set of representatives of some partition

π \in Π

of propositions. We denote by

Π_{L}

the set of these

π_{L} .

Definition 10 (Normalised belief function on sentences). Define the set of normalized belief functions on

S L

as:

\begin{matrix} B_{L} : = {B_{L} : S L ⟶ [0, 1] : \sum_{φ \in π_{L}} B_{L} (φ) \leq 1 for all π_{L} \in Π_{L} and \sum_{φ \in π_{L}} B_{L} (φ) = 1 for some π_{L} \in Π_{L}} \end{matrix}

The set of probability functions is defined as:

\begin{matrix} P_{L} : = {P_{L} : S L ⟶ [0, 1] : \sum_{φ \in π_{L}} P_{L} (φ) = 1 for all π_{L} \in Π_{L}} \end{matrix}

As in the proposition case, we have:

Proposition 6.

P_{L} \in P_{L}

iff

P_{L} : S L ⟶ [0, 1]

satisfies the axioms of probability:

P1:: $P_{L} (τ) = 1$ for all tautologies $τ .$
P2:: If $⊧ \neg (φ \land ψ)$ then $P_{L} (φ \lor ψ) = P_{L} (φ) + P_{L} (ψ)$ .

Proof: Suppose

P_{L} \in P_{L}

. For any tautology,

τ \in S L

, it holds that

P_{L} (τ) = 1

, because

{τ}

is a partition in

Π_{L} .

P_{L} (\neg τ) = 0

, because

{τ, \neg τ}

is a partition in

Π_{L}

and

P_{L} (τ) = 1

.

Suppose that

φ, ψ \in S L

are such that

⊧ \neg (φ \land ψ)

. We shall proceed by cases to show that

P_{L} (φ \lor ψ) = P_{L} (φ) + P_{L} (ψ)

. In the first three cases, one of the sentences is a contradiction, in the last two cases, there are no contradictions.

(i): $⊧ φ$ and $⊧ \neg ψ,$ then $⊧ φ \lor ψ .$ Thus, by the above $P_{L} (φ) = 1$ and $P_{L} (ψ) = 0$ , and hence, $P_{L} (φ \lor ψ) = 1 = P_{L} (φ) + P_{L} (ψ) .$
(ii): $⊧ \neg φ$ and $⊧ \neg ψ,$ then $⊧ \neg φ \lor \neg ψ .$ Thus, $P_{L} (φ \lor ψ) = 0 = P_{L} (φ) + P_{L} (ψ) .$
(iii): $⊭ \neg φ,$ $⊭ φ,$ and $⊧ \neg ψ,$ then ${φ \lor ψ, \neg φ \lor ψ}$ and ${φ, \neg φ \lor ψ}$ are both partitions in $Π_{L} .$ Thus, $P_{L} (φ \lor ψ) + P_{L} (\neg φ \lor ψ) = 1 = P_{L} (φ) + P_{L} (\neg φ \lor ψ) .$ Putting these observations together, we now find $P_{L} (φ \lor ψ) = P_{L} (φ) = P_{L} (φ) + P_{L} (ψ) .$
(iv): $⊭ \neg φ,$ $⊭ \neg ψ$ and $⊧ φ \leftrightarrow \neg ψ,$ then ${φ, ψ}$ is a partition and $φ \lor ψ$ is a tautology. Hence, $P_{L} (φ) + P_{L} (ψ) = 1$ and $P_{L} (φ \lor ψ) = 1$ . This now yields $P_{L} (φ) + P_{L} (ψ) = P_{L} (φ \lor ψ)$ .
(v): $⊭ \neg φ,$ $⊭ \neg ψ$ and $⊭ φ \leftrightarrow \neg ψ,$ then none of the following sentences is a tautology or a contradiction: $φ, ψ, φ \lor ψ, \neg (φ \lor ψ) .$ Since ${φ, ψ, \neg (φ \lor ψ)}$ and ${φ \lor ψ, \neg (φ \lor ψ)}$ are both partitions in $Π_{L}$ , we obtain $P_{L} (φ) + P_{L} (ψ) = 1 - P_{L} (\neg (φ \lor ψ)) = P_{L} (φ \lor ψ)$ . So, $P_{L} (φ) + P_{L} (ψ) = P_{L} (φ \lor ψ)$ .

On the other hand, suppose P1 and P2 hold. That

\sum_{φ \in π_{L}} P_{L} (φ) = 1

holds for all

π_{L} \in Π_{L}

can be seen by induction on the size of

π_{L}

. If

| π_{L} | = 1

, then

π = {τ}

for some tautology

τ \in S L

, and

P_{L} (τ) = 1

by P1. Suppose then that

π_{L} = {φ_{1}, \dots, φ_{k + 1}}

for

k \geq 1

. Now,

\sum_{i = 1}^{k - 1} P_{L} (φ_{i}) + P_{L} (φ_{k} \lor φ_{k + 1}) = 1

by the induction hypothesis. Furthermore,

P_{L} (φ_{k} \lor φ_{k + 1}) = P_{L} (φ_{k}) + P_{L} (φ_{k + 1})

by P2, so

\sum_{φ \in π_{L}} P_{L} (φ) = 1

, as required. ■

Definition 11 (Respects logical equivalence). We say that a belief function

B_{L} \in 〈 B_{L} 〉

respects logical equivalence if and only if

⊧ φ \leftrightarrow ψ

implies

B_{L} (φ) = B_{L} (ψ) .

Proposition 7. The probability functions

P_{L} \in P_{L}

respect logical equivalence.

Proof: Suppose

P_{L} \in P_{L}

and assume that

φ, ψ \in S L

are logically equivalent. Note that

ψ \land \neg φ ⊧ A_{1} \land \neg A_{1},

ψ \lor \neg φ ⊧ A_{1} \lor \neg A_{1}

and that

{φ, \neg φ}

and

{ψ, \neg φ}

are partitions in

Π_{L} .

Hence:

\begin{matrix} P_{L} (φ) + P_{L} (\neg φ) = 1 = P_{L} (ψ) + P_{L} (\neg φ) \end{matrix}

(45)

Therefore,

P_{L} (φ) = P_{L} (ψ) .

Thus, the

P_{L} \in P_{L}

assign logically equivalent formulae the same probability. ■

3.2. Loss

By analogy with the line of argument of Section 2.3, we shall suppose that a default loss function,

L : S L \times 〈 B_{L} 〉 \to (- \infty, \infty]

, satisfies the following requirements:

L1.: $L (φ, B_{L}) = 0,$ if $B_{L} (φ) = 1 .$
L2.: $L (φ, B_{L})$ strictly increases as $B_{L} (φ)$ decreases from one towards zero.
L3.: $L (φ, B_{L})$ only depends on $B_{L} (φ)$ .

Suppose we have a fixed belief function,

B_{L} \in 〈 B_{L} 〉

, such that

B_{L} (τ) = 1

for any tautology, τ, and

L = L_{1} \cup L_{2}

, where

L_{1}

and

L_{2}

are independent sublanguages, written

L_{1} ⫫_{B_{L}} L_{2}

, i.e.,

B_{L} (ϕ_{1} \land ϕ_{2}) = B_{L} (ϕ_{1}) \cdot B_{L} (ϕ_{2})

for all

ϕ_{1} \in S L_{1}

and

ϕ_{2} \in S L_{2}

. Let

B_{⇂ L_{1}} (ϕ_{1}) : = B_{L} (ϕ_{1})

,

B_{⇂ L_{2}} (ϕ_{2}) : = B_{L} (ϕ_{2})

.

L4.: Losses are additive when the language is composed of independent sublanguages: if $L = L_{1} \cup L_{2}$ for $L_{1} ⫫_{B_{L}} L_{2}$ , then $L (ϕ_{1} \land ϕ_{2}, B_{L}) = L_{1} (ϕ_{1}, B_{⇂ L_{1}}) + L_{2} (ϕ_{2}, B_{⇂ L_{2}})$ , where $L_{1}, L_{2}$ are loss functions defined on $L_{1}, L_{2}$ , respectively.

Theorem 4. If a loss function, L, on

S L \times 〈 B_{L} 〉

satisfies L1–4, then

L (φ, B_{L}) = - k log B_{L} (φ)

, where the constant,

k > 0

, does not depend on the language,

L

.

Proof: We shall first focus on a loss function, L, defined with respect to a language,

L

, that contains at least two propositional variables.

L3 implies that

L (φ, B_{L}) = f_{L} (B_{L} (φ))

for some function,

f_{L} : [0, 1] ⟶ (- \infty, \infty]

. For our fixed

L

and all

x, y \in [0, 1]

, choose some

B_{L} \in 〈 B_{L} 〉

such that

L = L_{1} \cup L_{2}

,

L_{1} ⫫_{B_{L}}, L_{2}

B_{L} (ϕ_{1}) = x

, and

B_{L} (ϕ_{2}) = y

for some

ϕ_{1} \in S L_{1}, ϕ_{2} \in S L_{2} .

This is possible, because

L

contains at least two propositional variables.

Note that since

L_{1}

and

L_{2}

are independent sublanguages, given some specific tautology,

τ_{1}

, of

L_{1}

:

1 = B_{L} (τ_{1}) = B_{⇂ L_{1}} (τ_{1})

(46)

B_{L} (τ_{1})

is well defined, since

τ_{1}

is a tautology of

S L_{1}

, and every sentence in

S L_{1}

is a sentence in

S L

. Similarly,

B_{⇂ L_{2}} (τ_{2}) = 1

for some specific tautology

τ_{2}

of

L_{2}

. By L1, then,

L_{1} (τ_{1}, B_{⇂ L_{1}}) = L_{2} (τ_{2}, B_{⇂ L_{2}}) = 0,

where

L_{1},

respectively,

L_{2},

are the loss functions with respect to

S L_{1}

and

S L_{2}

satisfying L1–4. Thus:

\begin{matrix} f_{L} (x \cdot y) & \overset{}{=} & f_{L} (B_{L} (ϕ_{1}) \cdot B_{L} (ϕ_{2})) \\ \overset{L_{3}}{=} & L (ϕ_{1} \land ϕ_{2}, B_{L}) \\ \overset{L 4}{=} & L_{1} (ϕ_{1}, B_{⇂ L_{1}}) + L_{2} (ϕ_{2}, B_{⇂ L_{2}}) \\ \overset{L 4}{=} & [L (ϕ_{1} \land τ_{2}, B_{L}) - L_{2} (τ_{2}, B_{⇂ L_{2}})] \\ + [L (τ_{1} \land ϕ_{2}, B_{L}) - L_{1} (τ_{1}, B_{⇂ L_{1}})] \\ \overset{L 1}{=} & L (ϕ_{1} \land τ_{2}, B_{L}) + L (τ_{1} \land ϕ_{2}, B_{L}) \\ \overset{L 3}{=} & f_{L} (B_{L} (ϕ_{1} \land τ_{2})) + f_{L} (B_{L} (ϕ_{2} \land τ_{1})) \\ \overset{}{=} & f_{L} (B_{⇂ L_{1}} (ϕ_{1}) \cdot B_{⇂ L_{2}} (τ_{2})) + f_{L} (B_{⇂ L_{1}} (τ_{1}) \cdot B_{⇂ L_{2}} (ϕ_{2})) \\ \overset{(46)}{=} & f_{L} (B_{⇂ L_{1}} (ϕ_{1})) + f_{L} (B_{⇂ L_{2}} (ϕ_{2})) \\ \overset{}{=} & f_{L} (B_{L} (ϕ_{1})) + f_{L} (B_{L} (ϕ_{2})) \\ \overset{}{=} & f_{L} (x) + f_{L} (y) \end{matrix}

(47)

The negative logarithm on

(0, 1]

is characterisable up to a multiplicative constant,

k_{L}

, in terms of this additivity, together with the condition that

f_{L} (x) \geq 0

, which is implied by L1–2 (see, e.g., Theorem 0.2.5 in [7]). L2 ensures that

f_{L}

is not zero everywhere, so

k_{L} > 0

. As in the corresponding proof for propositions, it follows that

f_{L} (0) = + \infty .

Thus far, we have shown that for a fixed language,

L

, with at least two propositional variables,

L (F, B_{L}) = - k_{L} log B_{L} (F)

on

[0, 1] .

Now, focus on an arbitrary language,

L_{1}

, and a corresponding loss function,

L_{1}

. We can choose

L_{2}, L, B_{L}

such that

L

is composed of independent sublanguages,

L_{1}

and

L_{2} .

By reasoning analogous to that above:

\begin{matrix} f_{L_{1}} (B_{⇂ L_{1}} (ϕ_{1})) & = & L_{1} (ϕ_{1}, B_{⇂ L_{1}}) \\ = & L (ϕ_{1} \land τ_{2}, B_{L}) \\ = & f_{L} (B_{L} (ϕ_{1} \land τ_{2})) \\ = & f_{L} (B_{L} (ϕ_{1}) \cdot 1) \\ = & - k_{L} log B_{⇂ L_{1}} (ϕ_{1}) \end{matrix}

(48)

Therefore, the loss function for

L_{1}

is

L_{1} (ϕ_{1}, B_{⇂ L_{1}}) = - k_{L} log B_{⇂ L_{1}} (ϕ_{1}) .

Thus, the constant,

k_{L}

, does not depend on

L

after all.

In general, then,

L (F, B_{L}) = - k log B_{L} (F)

for some positive k. ■

Since multiplication by a constant is equivalent to a change of base, we can take log to be the natural logarithm. Since we will be interested in the belief functions that minimise loss, rather than in the absolute value of any particular losses, we can take

k = 1

without loss of generality. Theorem 4 thus allows us to focus on the logarithmic loss function:

L^{log} (F, B_{L}) : = - log B_{L} (F)

(49)

3.3. Score, Entropy and Their Connection

In the case of belief over sentences, the expected loss varies according to which sentences are used to represent the various partitions of propositions. We can define the g-score to be the worst-case expected loss, where this worst case is taken over all possible representations:

Definition 12 (g-score). Given a loss function,

L,

an inclusive weighting function,

g : Π ⟶ R_{\geq 0}

, and a representation,

ρ \in ϱ

, we define the representation-relative g-score

S_{g, ρ}^{L} : P_{L} \times 〈 B_{L} 〉 ⟶ [- \infty, \infty]

by

S_{g, ρ}^{L} (P_{L}, B_{L}) : = \sum_{π \in Π} g (π) \sum_{φ \in ρ (π)} P_{L} (φ) L (φ, B_{L})

(50)

and the (representation-independent) g-score

S_{g, L}^{L} : P_{L} \times 〈 B_{L} 〉 ⟶ [- \infty, \infty]

by

S_{g, L}^{L} (P_{L}, B_{L}) : = sup_{ρ \in ϱ} S_{g, ρ}^{L} (P_{L}, B_{L})

(51)

In particular, for the logarithmic loss function under consideration here, we have:

S_{g, ρ}^{log} (P_{L}, B_{L}) : = - \sum_{π \in Π} g (π) \sum_{φ \in ρ (π)} P_{L} (φ) log B_{L} (φ)

(52)

and:

S_{g, L}^{log} (P_{L}, B_{L}) : = sup_{ρ \in ϱ} S_{g, ρ}^{log} (P_{L}, B_{L})

(53)

We can thus define the g-entropy of a belief function on

S L

as:

\begin{matrix} H_{g, L} (B_{L}) : = S_{g, L}^{log} (B_{L}, B_{L}) \end{matrix}

(54)

There is a canonical one-to-one correspondence between the

B_{L} \in 〈 B_{L} 〉

which respect logical equivalence and the

B \in 〈 B 〉 .

In particular,

P_{L}

can be identified with

P .

Moreover, any convex

E \subseteq P

is in one-to-one correspondence with a convex

E_{L} \subseteq P_{L} .

In the following, we shall make frequent use of this correspondence. For a

B_{L} \in 〈 B_{L} 〉

which respects logical equivalence, we denote by B the function in

〈 B 〉

with which it stands in one-to-one correspondence.

Lemma 6. If

B_{L} \in 〈 B_{L} 〉

respects logical equivalence, then for all

ρ \in ϱ

, we have

S_{g, L}^{log} (P_{L}, B_{L}) = {sup}_{ρ \in ϱ} S_{g, ρ}^{log} (P_{L}, B_{L}) = S_{g}^{log} (P, B) .

Proof: Simply note that

S_{g, ρ}^{log} (P_{L}, B_{L})

does not depend on

ρ .

■

Lemma 7. For all convex

E_{L} \subseteq P_{L}

:

\begin{matrix} B_{L}^{†} \in arg inf_{B_{L} \in 〈 B_{L} 〉} sup_{P_{L} \in E_{L}} sup_{ρ \in ϱ} S_{g, ρ}^{log} (P_{L}, B_{L}) \end{matrix}

respects logical equivalence.

Proof: Suppose that:

\begin{matrix} B_{L}^{†} \in arg inf_{B_{L} \in 〈 B_{L} 〉} sup_{P_{L} \in E_{L}} sup_{ρ \in ϱ} S_{g, ρ}^{log} (P_{L}, B_{L}) \end{matrix}

(55)

and assume that

B_{L}^{†}

does not respect logical equivalence. Then, define:

\begin{matrix} B_{L}^{inf} (φ) : = inf_{\begin{matrix} θ \in S L \\ ⊧ θ \leftrightarrow φ \end{matrix}} B_{L}^{†} (θ) \end{matrix}

(56)

Since

B_{L}^{†}

does not respect logical equivalence, there are logically equivalent

φ, ψ

such that

B_{L}^{†} (φ) \neq B_{L}^{†} (ψ) .

Hence,

B_{L}^{inf} (φ) < max {B_{L}^{†} (φ), B_{L}^{†} (ψ)} .

Thus, for every

π_{L} \in Π_{L}

with

φ \in π_{L}

, we have

\sum_{χ \in π_{L}} B_{L}^{inf} (χ) < 1 .

Thus,

B_{L}^{inf} \notin P_{L} .

B_{L}^{inf}

respects logical equivalence by definition.

Now, consider the function

B^{inf} : P Ω ⟶ [0, 1]

which is determined by

B_{L}^{inf} .

Clearly,

B^{inf} \notin P .

There are two cases to consider.

(a)

B^{inf} \in 〈 B 〉 ∖ P .

Since

B^{inf} \notin P

, by Theorem 2, we have that:

\begin{matrix} sup_{P \in E} S_{g}^{log} (P, B^{inf}) > inf_{B \in 〈 B 〉} sup_{P \in E} S_{g}^{log} (P, B) \end{matrix}

(57)

(b)

B^{inf} \notin 〈 B 〉

. Then, define

B^{'}

by

B^{'} (F) : = B^{inf} (F) + δ

for all

F \subseteq Ω

, where

δ \in (0, 1]

is minimal such that

B^{'} \in 〈 B 〉 .

In particular,

B^{'} (\emptyset) \geq δ > 0,

thus

B^{'} \notin P .

Moreover, whenever

P (F) > 0

, it holds that

- P (F) log B^{inf} (F) > - P (F) log B^{'} (F) < + \infty .

For the remainder of this proof, we shall extend the definition of the logarithmic g-score

S_{g}^{log} (P, B)

by allowing the belief function, B, to be any non-negative function defined on

P Ω,

rather than just

B \in 〈 B 〉

—if

B \notin 〈 B 〉

, we shall be careful not to appeal to results that assume

B \in 〈 B 〉

. We thus find for all

P \in P

that

S_{g}^{log} (P, B^{inf}) > S_{g}^{log} (P, B^{'}) < + \infty .

Thus, by Theorem 2, we obtain the sharp inequality in the following:

\begin{matrix} sup_{P \in E} S_{g}^{log} (P, B^{inf}) & \geq sup_{P \in E} S_{g}^{log} (P, B^{'}) \\ > inf_{B \in 〈 B 〉} sup_{P \in E} S_{g}^{log} (P, B) \end{matrix}

(58)

For both cases, we will obtain a contradiction:

\begin{matrix} S_{g}^{log} (P^{†}, P^{†}) & = sup_{P \in E} S_{g}^{log} (P, P^{†}) \end{matrix}

(59)

\begin{matrix} = sup_{P_{L} \in E_{L}} sup_{ρ \in ϱ} S_{g, ρ}^{log} (P_{L}, P_{L}^{†}) \end{matrix}

(60)

\begin{matrix} \geq inf_{B_{L} \in 〈 B_{L} 〉} sup_{P_{L} \in E_{L}} sup_{ρ \in ϱ} S_{g, ρ}^{log} (P_{L}, B_{L}) \end{matrix}

(61)

\begin{matrix} \overset{(55)}{=} sup_{P_{L} \in E_{L}} sup_{ρ \in ϱ} S_{g, ρ}^{log} (P_{L}, B_{L}^{†}) \end{matrix}

(62)

\begin{matrix} = sup_{P_{L} \in E_{L}} - \sum_{π \in Π} g (π) \sum_{φ \in ρ (π)} P_{L} (φ) inf_{\begin{matrix} θ \in S L \\ ⊧ θ \leftrightarrow φ \end{matrix}} log B_{L}^{†} (θ) for all ρ \in ϱ \end{matrix}

(63)

\begin{matrix} \overset{}{=} sup_{P_{L} \in E_{L}} S_{g, ρ}^{log} (P_{L}, B_{L}^{inf}) for all ρ \in ϱ \end{matrix}

(64)

\begin{matrix} = sup_{P \in E} S_{g}^{log} (P, B^{inf}) \end{matrix}

(65)

\begin{matrix} > inf_{B \in 〈 B 〉} sup_{P \in E} S_{g}^{log} (P, B) \end{matrix}

(66)

\begin{matrix} = S_{g}^{log} (P^{†}, P^{†}) \end{matrix}

(67)

We obtain Equation (59) by noticing that

P^{†}

is the unique function minimising worst-case g-expected loss (Theorem 2) and recalling that the expressions in Equation (39) and Equation (40) are equal.

Equation (60) is immediate, as the probability functions respect logical equivalence. For Equation (63), note that

P_{L}

respects logical equivalence. Furthermore, since

- log (\cdot)

is strictly decreasing, a smaller value of

B_{L} (φ)

leads to a greater score.

Equation (64) follows from Equation (56) and Lemma 6, since

B_{L}^{inf}

respects logical equivalence. Hence,

S_{g, ρ}^{log} (P, B_{L}^{inf})

does not depend on the partition

ρ .

The inequality (66), we have seen above in the two cases, Equation (57) and Equation (58). Equation (67) is again implied by Theorem 2.

We have thus found a contradiction. Hence, the

B_{L}^{†} \in arg inf_{B_{L} \in 〈 B_{L} 〉} sup_{P_{L} \in E_{L}} sup_{ρ \in ϱ} S_{g, ρ}^{log} (P_{L}, B_{L})

(68)

have to respect logical equivalence. ■

Theorem 2, the key result in the case of belief over propositions, generalises to the case of belief over sentences:

Theorem 5. As usual,

E_{L} \subseteq P_{L}

is taken to be convex and g inclusive. We have that:

\begin{matrix} arg sup_{P_{L} \in E_{L}} H_{g, L} (P_{L}) = arg inf_{B_{L} \in B_{L}} sup_{P_{L} \in E_{L}} S_{g, L}^{log} (P_{L}, B_{L}) \end{matrix}

Proof: As in the corresponding theorem for the proposition (Theorem 2), we shall prove a slightly stronger equality:

\begin{matrix} arg sup_{P_{L} \in E_{L}} H_{g, L} (P_{L}) = arg inf_{B_{L} \in 〈 B_{L} 〉} sup_{P_{L} \in E_{L}} S_{g, L}^{log} (P_{L}, B_{L}) \end{matrix}

Theorem 5 then follows for the same reasons given in the previous section.

Denote by

〈 B_{L}^{l e} 〉 \subset 〈 B_{L} 〉

the convex hull of functions

B_{L} \in B_{L}

that respect logical equivalence. Let

R e p : 〈 B 〉 ⟶ 〈 B_{L}^{l e} 〉

be the bijective map that assigns to any

B \in 〈 B 〉

the unique

B_{L} \in 〈 B_{L} 〉

which represents it (i.e.,

B (F) = B_{L} (φ),

whenever

F \subseteq Ω

is represented by

φ \in S L

).

\begin{matrix} arg inf_{B_{L} \in 〈 B_{L} 〉} sup_{P_{L} \in E_{L}} S_{g, L}^{log} (P_{L}, B_{L}) & = arg inf_{B_{L} \in 〈 B_{L}^{l e} 〉} sup_{P_{L} \in E_{L}} S_{g, L}^{log} (P_{L}, B_{L}) \end{matrix}

(69)

\begin{matrix} = R e p (arg inf_{B \in B} sup_{P \in E} S_{g}^{log} (P, B)) \end{matrix}

(70)

\begin{matrix} = R e p (P^{†}) \end{matrix}

(71)

\begin{matrix} = P_{L}^{†} \end{matrix}

(72)

Equation (69) is simply Lemma 7. Equation (70) follows directly from applying Lemma 6, and Equation (71) is simply Theorem 2. ■

In the above, we used

P_{L}^{†}

to denote the probability function in

E_{L}

which represents the g-entropy maximiser,

P^{†} \in E .

Now, note that

H_{g, L} (P_{L}) = H_{g} (P) .

Thus,

P_{L}^{†}

is not only the function representing

P^{†}

; it is also the unique function in

E_{L}

which maximises g-entropy

H_{g, L}

.

Theorem 3 also extends to the sentence framework. As we shall now see, the worst-case g-score can be taken with respect to a chance function in

P_{L}^{*}

, rather than

E_{L} = 〈 P_{L}^{*} 〉

.

Theorem 6. If

P_{L}^{*} \subseteq P_{L}

is such that the unique g-entropy maximiser,

P_{L}^{†}

, of

[E_{L}] = [〈 P_{L}^{*} 〉]

, is in

[P_{L}^{*}],

then:

\begin{matrix} P_{L}^{†} = arg sup_{P_{L} \in E_{L}} H_{g, L} (P_{L}) = arg inf_{B \in B_{L}} sup_{P_{L} \in P_{L}^{*}} S_{g, L}^{log} (P_{L}, B_{L}) \end{matrix}

Proof: Again, we shall prove a slightly stronger statement with

B_{L}

ranging in

〈 B_{L} 〉 .

Since g is inclusive, we have that

S_{g}^{log}

is a strictly proper scoring rule. Hence, for a fixed

ρ \in ϱ

,

S_{g, ρ}^{log} (P_{L}, \cdot)

is minimal if and only if

P_{L} (φ) = B_{L} (φ)

for all

φ \in ρ .

Now, suppose

B_{L} \in 〈 B_{L} 〉

is different from a fixed

P_{L} \in P_{L} .

Then, there is some

φ \in S L

such that

B_{L} (φ) \neq P_{L} (φ) .

Now, pick some

ρ^{'} \in ϱ

such that

φ \in ρ^{'} .

Then, strict propriety implies the sharp inequality below:

\begin{matrix} S_{g, L}^{log} (P_{L}, B_{L}) & = sup_{ρ \in ϱ} S_{g, ρ}^{log} (P_{L}, B_{L}) \\ \geq S_{g, ρ^{'}}^{log} (P_{L}, B_{L}) \\ > S_{g, ρ^{'}}^{log} (P_{L}, P_{L}) \\ = sup_{ρ \in ϱ} S_{g, ρ}^{log} (P_{L}, P_{L}) \\ = S_{g, L}^{log} (P_{L}, P_{L}) \end{matrix}

(73)

The second equality follows since the

P_{L} \in P_{L}

respect logical equivalence, and hence,

S_{g, ρ}^{L} (P_{L}, P_{L})

does not depend on

ρ .

Thus, for all

P_{L} \in P_{L}

, we find

arg {inf}_{B_{L} \in 〈 B_{L} 〉} S_{g}^{log} (P_{L}, B_{L}) = P_{L} .

Hence, for

P_{L} = P_{L}^{†}

, we obtain:

\begin{matrix} S_{g, L}^{log} (P_{L}^{†}, P_{L}^{†}) & = inf_{B_{L} \in 〈 B_{L} 〉} S_{g, L}^{log} (P_{L}^{†}, B_{L}) \\ \leq inf_{B_{L} \in 〈 B_{L} 〉} sup_{P_{L} \in P_{L}^{*}} S_{g, L}^{log} (P_{L}, B_{L}) \\ \leq inf_{B_{L} \in 〈 B_{L} 〉} sup_{P_{L} \in 〈 P_{L}^{*} 〉} S_{g, L}^{log} (P_{L}, B_{L}) \\ = sup_{P_{L} \in 〈 P_{L}^{*} 〉} S_{g, L}^{log} (P_{L}, P_{L}^{†}) \\ = S_{g, L}^{log} (P_{L}^{†}, P_{L}^{†}) \end{matrix}

(74)

where the last two equalities are simply Theorem 5. Hence:

\begin{matrix} inf_{B_{L} \in 〈 B_{L} 〉} sup_{P_{L} \in P_{L}^{*}} S_{g, L}^{log} (P_{L}, B_{L}) = S_{g, L}^{log} (P_{L}^{†}, P_{L}^{†}) = sup_{P_{L} \in 〈 P_{L}^{*} 〉} H_{g, L} (P) \end{matrix}

That is, the lowest worst-case expected loss is the same for

P_{L} \in [P_{L}^{*}]

and

P_{L} \in [〈 P_{L}^{*} 〉] .

Furthermore, since

S_{g, L}^{log} (P_{L}^{†}, P_{L}^{†}) = {sup}_{P_{L} \in 〈 P_{L}^{*} 〉} S_{g, L}^{log} (P_{L}, P_{L}^{†})

and since

P_{L}^{†} \in [P_{L}^{*}]

, we have

S_{g, L}^{log} (P_{L}^{†}, P_{L}^{†}) = {sup}_{P_{L} \in P_{L}^{*}} S_{g, L}^{log} (P_{L}, P_{L}^{†}) .

Thus,

B_{L} = P_{L}^{†}

minimises

{sup}_{P_{L} \in P_{L}^{*}} S_{g, L}^{log} (P_{L}, B_{L}) .

Now, suppose that

B_{L}^{'} \in 〈 B_{L} 〉

is different from

P_{L}^{†} .

Then:

\begin{matrix} sup_{P_{L} \in P_{L}^{*}} S_{g, L}^{log} (P_{L}, B_{L}^{'}) \geq S_{g, L}^{log} (P_{L}^{†}, B_{L}^{'}) > S_{g, L}^{log} (P_{L}^{†}, P_{L}^{†}) \end{matrix}

where the strict inequality follows as seen above. This now shows that adopting

B_{L}^{'} \neq P_{L}^{†}

leads to an avoidably bad score.

Hence,

B_{L} = P_{L}^{†}

is the unique function in

〈 B_{L} 〉

which minimises

{sup}_{P_{L} \in P_{L}^{*}} S_{g, L}^{log} (P_{L}, B_{L}) .

■

We see, then, that the results of Section 2 concerning beliefs defined on propositions extend naturally to beliefs defined on the sentences of a propositional language. In light of these findings, our subsequent discussions will, for ease of exposition, solely focus on propositions. It should be clear how our remarks generalise to sentences.

4. Relationship to Standard Entropy Maximisation

We have seen so far that there is a sense in which our notions of entropy and expected loss depend on the weight given to each partition under consideration—i.e., on the weighting function, g. It is natural to demand that no proposition should be entirely dismissed from consideration by being given zero weight—that g be inclusive. In which case, the belief function that minimises worst-case g-expected loss is just the probability function in

E

that maximises g-entropy, if there is such a function. This result provides a single justification of the three norms of objective Bayesianism: the belief function should be a probability function, it should be in

E

, i.e., calibrated to evidence of physical probability, and it should otherwise be equivocal, where the degree to which a belief function is equivocal can be measured by its g-entropy.

This line of argument gives rise to two questions. Which g-entropy should be maximised? Does the standard entropy maximiser count as a rational belief function?

On the former question, the task is to isolate some set,

G

, of appropriate weighting functions. Thus far, the only restriction imposed on a weighting function, g, has been that it should be inclusive; this is required in order that scoring rules evaluate all beliefs, rather than just a select few. We shall put forward two further conditions that can help to narrow down a proper subclass,

G

, of weighting functions.

A second natural desideratum is the following:

Definition 13 (Symmetric weighting function). A weighting function, g, is symmetric, if and only if whenever

π^{'}

can be obtained from π by permuting the

ω_{i}

in

π,

then

g (π^{'}) = g (π) .

For example, for

| Ω | = 4

and symmetric g, we have that

g ({{ω_{1}, ω_{2}}, {ω_{3}}, {ω_{4}}}) = g ({{ω_{1}, ω_{4}}, {ω_{2}}, {ω_{3}}}) .

Note that

g_{Ω}, g_{P Ω}

and

g_{Π}

are all symmetric. The symmetry condition can also be stated as follows:

g (π)

is only a function of the spectrum of π, i.e., of the multi-set of sizes of the members of π. In the above example, the spectrum of both partitions is

{2, 1, 1}

.

It turns out that inclusive and symmetric weighting functions lead to g-entropy maximisers that satisfy a variety of intuitive and plausible properties—see Appendix B.

In addition, it is natural to suppose that if

π^{'}

is a refinement of partition π, then g should not give any less weight to

π^{'}

than it does to π—there are no grounds to favour coarser partitions over more fine-grained partitions; although, as Keynes (Chapter 4 in [19]) argued, there may be grounds to prefer finer-grained partitions over coarser partitions.

Definition 14 (Refined weighting function). A weighting function, g, is refined, if and only if whenever

π^{'}

refines π, then

g (π^{'}) \geq g (π) .

g_{Π}

and

g_{Ω}

are refined, but

g_{P Ω}

is not.

Let

G_{0}

be the set of weighting functions that are inclusive, symmetric and refined. One might plausibly set

G = G_{0}

. We would at least suggest that all the weighting functions in

G_{0}

are appropriate weighting functions for scoring rules; we shall leave it open as to whether

G

should contain some weighting functions—such as the proposition weighting,

g_{P Ω}

—that lie outside

G_{0}

. We shall thus suppose in what follows that the set

G

of appropriate weighting functions is such that

G_{0} \subseteq G \subseteq G_{inc}

, where

G_{inc}

is the set of inclusive weighting functions.

One might think that the second question posed above—does the standard entropy maximiser count as a rational belief function?—should be answered in the negative. We saw in Section 2.2 that the standard entropy,

g_{Ω}

-entropy, has a weighting function,

g_{Ω}

, that is not inclusive. Therefore, there is no guarantee that the standard entropy maximiser minimises worst-case g-expected loss for some

g \in G .

Indeed, Figure 1 showed that the standard entropy maximiser need neither coincide with the partition entropy maximiser nor the proposition entropy maximiser.

However, it would be too hasty to conclude that the standard entropy maximiser fails to qualify as a rational belief function. Recall that the equivocation norm says that an agent’s belief function should be sufficiently equivocal, rather than maximally equivocal. This qualification is essential to cope with the situation in which there is no maximally equivocal function in

E

, i.e., the situation in which for any function in

E

, there is another function in

E

that is more equivocal. This arises, for instance, when one has evidence that a coin is biased in favour of tails,

E = P^{*} = {P : P (Tails) > 1 / 2}

. In this case,

{sup}_{P \in E} H_{g} (P)

is achieved by the probability function which gives probability

1 / 2

to tails, which is outside

E

. This situation also arises in certain cases when evidence is determined by quantified propositions (§2 in [20]). The best one can do in such a situation is adopt a probability function in

E

that is sufficiently equivocal, where what counts as sufficiently equivocal may depend on pragmatic factors, such as the required numerical accuracy of predictions and the computational resources available to isolate a suitable function.

Let

⇓ E

be the set of belief functions that are sufficiently equivocal. Plausibly:

E1:: $⇓ E \neq \emptyset$ . An agent is always entitled to hold some beliefs.
E2:: $⇓ E \subseteq E$ . Sufficiently equivocal belief functions are calibrated with evidence.
E3:: For all $g \in G$ , there is some $ϵ > {inf}_{B \in B} {sup}_{P \in E} S_{g}^{log} (P, B)$ such that if $R \in E$ and ${sup}_{P \in E} S_{g} (P, R) < ϵ$ , then $R \in ⇓ E .$ , i.e., if R has sufficiently low worst-case g-expected loss for some appropriate g, then R is sufficiently equivocal.
E4:: $⇓ ⇓ E = ⇓ E$ . Any function, from those that are calibrated with evidence, that is sufficiently equivocal, is a function, from those that are calibrated with evidence and are sufficiently equivocal, that is sufficiently equivocal.
E5:: If P is a limit point of $⇓ E$ and $P \in E$ , then $P \in ⇓ E$ .

A closely related set of conditions was put forward in [20]. Note that we will not need to appeal to E4 in this paper. E1 is a consequence of the other principles together with the fact that

E \neq \emptyset

.

Conditions E2, E3 and E5 allow us to answer our two questions. Which g-entropy should be maximised? By E3, it is rational to adopt any g-entropy maximiser that is in

E

, for

g \in G \supseteq G_{0}

. Does the standard entropy maximiser count as a rational belief function? Yes, if it is in

E

(which is the case, for instance, if

E

is closed):

Theorem 7 (Justification of maxent). If

E

contains its standard entropy maximiser,

P_{Ω}^{†} : = arg {sup}_{E} H_{Ω}

, then

P_{Ω}^{†} \in ⇓ E

.

Proof: We shall first see that there is a sequence of

{(g_{t})}_{t \in N}

in

G

such that the

g_{t}

-entropy maximisers

P_{t}^{†} \in [E]

converge to

P_{Ω}^{†}

. All respective entropy maximisers are unique, due to Corollary 2.

Let

g_{t} ({{ω} : ω \in Ω}) = 1,

and put

g_{t} (π) : = \frac{1}{t}

for all other

π \in Π .

The

g_{t}

are in

G

, because they are inclusive, symmetric and refined.

g_{t}

-entropy has the following form:

\begin{matrix} H_{t} : = sup_{P \in E} H_{g_{t}} (P) = sup_{P \in E} \sum_{π \in Π} - g_{t} (π) \sum_{F \in π} P (F) log P (F) \end{matrix}

Now note that

g_{t} (π)

converges to

g_{Ω} (π)

and that

P (F) log P (F)

is finite for all

F \subseteq Ω .

Thus, for all

P \in P

,

H_{t} (P)

converges to

H_{Ω} (P)

as t approaches infinity. Hence,

{sup}_{P \in E} H_{g_{t}} (P) = H_{t}

tends to

{sup}_{P \in E} H_{Ω} (P) = H_{Ω} .

Let us now compute:

\begin{matrix} | H_{Ω} (P_{t}^{†}) - H_{Ω} (P_{Ω}^{†}) | & = | H_{Ω} (P_{t}^{†}) - H_{g_{t}} (P_{t}^{†}) + H_{g_{t}} (P_{t}^{†}) - H_{Ω} (P_{Ω}^{†}) | \\ \leq | H_{Ω} (P_{t}^{†}) - H_{g_{t}} (P_{t}^{†}) | + | H_{g_{t}} (P_{t}^{†}) - H_{Ω} (P_{Ω}^{†}) | \\ = | H_{Ω} (P_{t}^{†}) - H_{g_{t}} (P_{t}^{†}) | + | H_{t} - H_{Ω} | \end{matrix}

(75)

As we noted above,

g_{t}

converges to

g_{Ω} .

Furthermore,

{(P_{t}^{†})}_{t \in N}

is a bounded sequence. Hence,

H_{g_{t}} (P_{t}^{†})

converges to

H_{Ω} (P_{t}^{†}) .

Furthermore, recall that

H_{t}

tends to

H_{Ω} .

Overall, we find that

{lim}_{t \to \infty} H_{Ω} (P_{t}^{†}) = H_{Ω} (P_{Ω}^{†}) .

Since

H_{Ω} (\cdot)

is a strictly concave function on

[E]

and

[E]

is convex, it follows that

P_{t}^{†}

converges to

P_{Ω}^{†} .

Note that the

P_{t}^{†}

are not necessarily in

E

. However, they are in

[E]

, and there will be some sequence of

P_{t}^{‡} \in ⇓ E

close to

P_{t}^{†}

such that

{lim}_{t \to \infty} P_{t}^{‡} = P_{Ω}^{†}

, as we shall now see.

If

P_{t}^{†} \in E,

then simply let

P_{t}^{‡} = P_{t}^{†},

which is in

⇓ E

by E3.

If

P_{t}^{†} \notin E,

then there exists a

P^{'} \in E

which is different from

P_{t}^{†}

, such that all the points on the line segment between

P_{t}^{†}

and

P^{'}

are in

E;

with the exception of

P_{t}^{†} .

Now define

P_{t, δ_{t}}^{‡} (ω) = (1 - δ_{t}) P_{t}^{†} (ω) + δ_{t} P^{'} (ω) = P_{t}^{†} (ω) + δ_{t} (P^{'} (ω) - P_{t}^{†} (ω)) .

Note that for

0 < δ_{t} < 1,

we have, for all

ω \in Ω

, that

P_{t}^{†} (ω) > 0

implies

P_{t, δ_{t}}^{‡} (ω) > 0 .

Then, with

\begin{matrix} m_{t} : = min_{\begin{matrix} ω \in Ω \\ P_{t}^{†} (ω) > 0 \end{matrix}} {P_{t}^{†} (ω)} \end{matrix}

and

0 < δ_{t} < m_{t}

, it follows from Proposition 18 that for all

F \subseteq Ω

and all

P \in E

,

P (F) > 0

implies

P_{t}^{†} (F) > 0 .

Thus, for such an F, we have

P_{t}^{†} (F) \geq m_{t} > δ_{t} > 0 .

Adopting the purely notational convention that

0 (log 0 - log 0) = 0

, we find for

P \in [E]

and

m_{t} > δ_{t}

that:

\begin{matrix} | S_{g_{t}}^{log} (P, P_{t, δ_{t}}^{‡}) - S_{g_{t}}^{log} (P, P_{t}^{†}) | & \leq \sum_{π \in Π} g_{t} (π) | \sum_{F \in π} P (F) (log P_{t, δ_{t}}^{‡} (F) - log P_{t}^{†} (F)) | \\ \leq \sum_{π \in Π} g_{t} (π) \sum_{\begin{matrix} F \in π \\ P (F) > 0 \end{matrix}} P (F) | log P_{t, δ_{t}}^{‡} (F) - log P_{t}^{†} (F) | \\ \leq \sum_{π \in Π} g_{t} (π) \sum_{\begin{matrix} F \in π \\ P (F) > 0 \end{matrix}} P (F) | log \frac{P_{t}^{†} (F) - δ_{t} \cdot | P^{'} (F) - P_{t}^{†} (F) |}{P_{t}^{†} (F)} | \\ \leq \sum_{π \in Π} g_{t} (π) \sum_{\begin{matrix} F \in π \\ P (F) > 0 \end{matrix}} P (F) | log \frac{P_{t}^{†} (F) - δ_{t}}{P_{t}^{†} (F)} | \\ \leq \sum_{π \in Π} g_{t} (π) \sum_{\begin{matrix} F \in π \\ P (F) > 0 \end{matrix}} P (F) | log \frac{m_{t} - δ_{t}}{m_{t}} | \\ = | log \frac{m_{t} - δ_{t}}{m_{t}} | \sum_{π \in Π} g_{t} (π) \end{matrix}

(76)

For fixed

g_{t}

and all

P \in [E],

| S_{g_{t}}^{log} (P, P_{t, δ_{t}}^{‡}) - S_{g_{t}}^{log} (P, P_{t}^{†}) |

becomes arbitrarily small for small

δ_{t};

moreover, the upper bound we established does not depend on

P .

In particular, for all

χ_{t} > 0

, there exists a

T \in N

such that for all

U_{t} > T

and all

P \in [E]

, it holds that

| S_{g_{t}}^{log} (P, P_{t, \frac{1}{U_{t}}}^{‡}) - S_{g_{t}}^{log} (P, P_{t}^{†}) | < χ_{t} .

Now, let

ϵ_{t} > {inf}_{B \in B} {sup}_{P \in E} S_{g_{t}}^{log} (P, B) = H_{t} .

Then, with

χ_{t} = \frac{ϵ_{t} - H_{t}}{2} > 0

, we have for big enough

U_{t}

that:

\begin{matrix} sup_{P \in E} S_{g_{t}}^{log} (P, P_{t, \frac{1}{U_{t}}}^{‡}) - sup_{P \in E} S_{g_{t}}^{log} (P, P_{t}^{†}) \leq χ_{t} \end{matrix}

(77)

Thus:

\begin{matrix} sup_{P \in E} S_{g_{t}}^{log} (P, P_{t, \frac{1}{U_{t}}}^{‡}) & \leq χ_{t} + sup_{P \in E} S_{g_{t}}^{log} (P, P_{t}^{†}) \\ = \frac{ϵ_{t} - H_{t}}{2} + H_{t} \\ < ϵ_{t} \end{matrix}

(78)

Hence,

P_{t, δ_{t}}^{‡} \in ⇓ E

by E3 for small enough

δ_{t},

since worst-case

g_{t}

-expected loss of

P_{t, δ_{t}}^{‡}

becomes arbitrarily close to

H_{t}

.

Now, pick a sequence

δ_{t} ↘ 0

, such that

δ_{t}

is small enough to ensure that for every t, it holds that

P_{t, δ_{t}}^{‡} \in ⇓ E .

Clearly, the sequence

{(P_{t, δ_{t}}^{‡})}_{t \in N}

converges to the limit of the sequence

P_{t}^{†},

and this limit is

P_{Ω}^{†} .

Therefore, the sequence

P_{t, δ_{t}}^{‡}

converges to

P_{Ω}^{†}

, which is, by our assumption, in

E .

By E5, we have

P_{Ω}^{†} \in ⇓ E .

■

So far, we have seen that, as long as the standard entropy maximiser is not ruled out by the available evidence, it is sufficiently equivocal, and hence, it is rational for an agent to adopt this function as her belief function. On the other hand, the above considerations also imply that if the entropy maximiser

P_{Ω}^{†}

is ruled out by the available evidence (i.e.,

P_{Ω}^{†} \in [E] ∖ E

), it is rational to adopt some function P close enough to

P_{Ω}^{†},

because such a function will be sufficiently equivocal:

Corollary 4. For all

ϵ > 0

, there exists a

P \in ⇓ E

such that

| P (ω) - P_{Ω}^{†} (ω) | < ϵ

for all

ω \in Ω .

Proof: Consider the same sequence,

g_{t}

, as in the above proof. Recall that

P_{t}^{†}

converges to

P_{Ω}^{†} .

Now, pick a t such that

| P_{t}^{†} (ω) - P_{Ω}^{†} (ω) | < \frac{ϵ}{2}

for all

ω \in Ω .

For this t, it holds that

P_{t, δ_{t}}^{‡} \in ⇓ E

for small enough

δ_{t}

and that

P_{t, δ_{t}}^{‡}

converges to

P_{t}^{†} .

Thus, for small enough

δ_{t}

, we have

| P_{t}^{†} (ω) - P_{t, δ_{t}}^{‡} (ω) | < \frac{ϵ}{2}

for all

ω \in Ω .

Thus,

| P_{t, δ_{t}}^{‡} (ω) - P_{Ω}^{†} (ω) | < ϵ

for all

ω \in Ω .

■

Is there anything that makes the standard entropy maximiser stand out among all those functions that are sufficiently equivocal? One consideration is language invariance. Suppose

g^{L}

is a family of weighting functions, defined for each

L

.

g^{L}

is language invariant, as long as merely adding new propositional variables to the language does not undermine the

g^{L}

-entropy maximiser:

Definition 15 (Language invariant family of weighting functions). Suppose we are given, as usual, a set

E

of probability functions on a fixed language

L

. For any

L^{'}

extending

L

, let

E^{'} = E \times P_{L^{'} ∖ L}

be the translation of

E

into the richer language

L^{'}

. A family of weighting functions is language invariant, if for any such

E, L,

any

P^{†} \in arg {sup}_{P \in E} H_{g^{L}} (P)

on

L

, and for any language

L^{'}

extending

L

, there is some

P^{‡} \in arg {sup}_{P \in E^{'}} H_{g^{L^{'}}} (P)

on

L^{'}

such that

P_{⇂ L}^{‡} = P^{†}

, i.e.,

P^{‡} (ω) = P^{†} (ω)

for each state ω of

L

.

It turns out that many families of weighting functions—including the partition weightings and the proposition weightings—are not language invariant:

Proposition 8. The family of partition weightings,

g_{Π}

, and the family of proposition weightings,

g_{P Ω}

, are not language invariant.

Proof: Let

L = {A_{1}, A_{2}}

and

E = {P \in P : P (ω_{1}) + 2 P (ω_{2}) + 3 P (ω_{3}) + 4 P (ω_{4}) = 1.7} .

The partition entropy maximiser

P_{Π}^{†}

and the proposition entropy maximiser

P_{P Ω}^{†}

for this language and this set

E

of calibrated functions are given in the first two rows of the table below.

Table 1. Partition entropy and proposition entropy maximisers on

L

and

L^{'}

.

**Table 1.** Partition entropy and proposition entropy maximisers on $L$ and $L^{'}$ .
	$ω_{1}$		$ω_{2}$		$ω_{3}$		$ω_{4}$
$P_{Π}^{†}$	$0.5331$		$0.2841$		$0.1324$		$0.0504$
$P_{P Ω}^{†}$	$0.5192$		$0.3008$		$0.1408$		$0.0392$
	$χ_{1}$	$χ_{2}$	$χ_{3}$	$χ_{4}$	$χ_{5}$	$χ_{6}$	$χ_{7}$	$χ_{8}$
$P_{Π}^{‡}$	$0.2649$	$0.2649$	$0.1441$	$0.1441$	$0.0671$	$0.0671$	$0.0239$	$0.0239$
$P_{P Ω}^{‡}$	$0.2510$	$0.2510$	$0.1594$	$0.1594$	$0.0783$	$0.0783$	$0.0113$	$0.0113$

We now add one propositional variable,

A_{3}

, to

L

and, thus, obtain

L^{'} .

Denote the states of

L^{'}

by

χ_{1} = ω_{1} \land \neg A_{3}, χ_{2} = ω_{1} \land A_{3}

, and so on. Assuming that we have no information at all concerning

A_{3}

, the set of calibrated probability functions is given by the solutions of the constraint,

(P^{'} (χ_{1}) + P^{'} (χ_{2})) + 2 (P^{'} (χ_{3}) + P^{'} (χ_{4})) + 3 (P^{'} (χ_{5}) + P^{'} (χ_{6})) + 4 (P^{'} (χ_{7}) + P^{'} (χ_{8})) = 1.7 .

Language invariance would now entail that

P^{†} (ω_{1}) = P^{‡} (χ_{1}) + P^{‡} (χ_{2}), P^{†} (ω_{2}) = P^{‡} (χ_{3}) + P^{‡} (χ_{4}), P^{†} (ω_{3}) = P^{‡} (χ_{5}) + P^{‡} (χ_{6}), P^{†} (ω_{4}) = P^{‡} (χ_{7}) + P^{‡} (χ_{8}) .

However, neither the partition entropy maximisers nor the proposition entropy maximisers form a language invariant family, as can be seen from the last two rows of the above table. ■

On the other hand, it is well known that standard entropy maximisation is language invariant (p. 76 in [21]). This can be seen to follow from the fact that certain families of weighting functions that only assign positive weight to a single partition are language invariant:

Lemma 8. Suppose a function f picks out a partition π for any language

L

, in such a way that if

L^{'} \supseteq L

, then

f (L^{'})

is a refinement of

f (L)

, with each

F \in f (L)

being refined into the same number k of members

F_{1}, \dots, F_{k} \in f (L^{'})

, for

k \geq 1

. Suppose

g^{L}

is such that for any

L

,

g^{L} (f (L)) = c > 0

, but

g^{L} (π) = 0

for all other partitions π. Then,

g^{L}

is language invariant.

Proof: Let

P^{†}

denote a

g^{L}

-entropy maximiser (in

[E]

), and let

P^{‡}

denote a

g^{L^{'}}

-entropy maximiser in

[E] \times P_{L^{'} ∖ L}

. Since

g^{L}

and

g^{L^{'}}

need not be inclusive,

H_{g, L}

and

H_{g, L^{'}}

need not be strictly concave. Thus, there need not be unique entropy maximisers. Given

F \subseteq Ω

refined into subsets

F_{1}, \dots, F_{k}

of

Ω^{'}

,

F^{'} \subseteq Ω^{'}

is defined by

F^{'} : = F_{1} \cup \dots \cup F_{k}

. One can restrict

P^{‡}

to

L

by setting

P^{‡} (ω) = \sum_{ω^{'} \in Ω^{'}, ω^{'} ⊧ ω} P^{‡} (ω^{'})

for

ω \in Ω

, so, in particular,

P^{‡} (F) = P^{‡} (F^{'}) = P^{‡} (F_{1}) + \dots + P^{‡} (F_{k})

for

F \in Ω

.

The

g^{L}

-entropy of

P^{†}

is closely related to the

g^{L^{'}}

-entropy of

P^{‡}

:

\begin{matrix} - c \sum_{F \in f (L)} & P^{†} (F) log P^{†} (F) \\ \geq - c \sum_{F \in f (L)} P^{‡} (F) log P^{‡} (F) \\ = - c \sum_{F \in f (L)} (P^{‡} (F_{1}) + \dots + P^{‡} (F_{k})) log (P^{‡} (F_{1}) + \dots + P^{‡} (F_{k})) \\ = - c \sum_{F \in f (L)} (P^{‡} (F_{1}) + \dots + P^{‡} (F_{k})) (log k + log \frac{P^{‡} (F_{1}) + \dots + P^{‡} (F_{k})}{k}) \\ \overset{L S I}{\geq} - c log k - c \sum_{F \in f (L)} P^{‡} (F_{1}) log P^{‡} (F_{1}) + \dots + P^{‡} (F_{k}) log P^{‡} (F_{k}) \\ = - c log k - c \sum_{G \in f (L^{'})} P^{‡} (G) log P^{‡} (G) \\ = - c log k - c \sum_{F \in f (L)} P^{‡} (F_{1}) log P^{‡} (F_{1}) + \dots + P^{‡} (F_{k}) log P^{‡} (F_{k}) \\ \geq - c log k - c \sum_{F \in f (L)} \frac{P^{†} (F)}{k} log \frac{P^{†} (F)}{k} + \dots + \frac{P^{†} (F)}{k} log \frac{P^{†} (F)}{k} \\ = - c log k - c \sum_{F \in f (L)} P^{†} (F) log \frac{P^{†} (F)}{k} \\ = - c \sum_{F \in f (L)} P^{†} (F) log P^{†} (F) \end{matrix}

(79)

LSI refers to the log sum inequality introduced in Lemma 3. The first and last inequality above follow from the fact that

P^{†}

and

P^{‡}

are entropy maximisers over

L,

L^{'}

, respectively. Hence, all inequalities are indeed equalities. These entropy maximisers are unique on

f (L), f (L^{'})

, so

P^{†} (F) = k \cdot P^{‡} (F_{1}) = \dots = k \cdot P^{‡} (F_{k}) = P^{‡} (F)

for

F \in f (L)

.

Now, take an arbitrary

P^{†} \in arg {sup}_{P \in E} H_{g^{L}} (P)

, and suppose

ω \in Ω

. Any

P^{‡}

such that

P^{‡} (ω) = P^{†} (ω)

and

P^{‡} (F_{1}) = \dots = P^{‡} (F_{k}) = P^{†} (F) / k

will be a

g^{L^{'}}

-entropy maximiser on

L^{'}

. Thus,

g^{L}

is language invariant.

Note that if, for some

L

,

f (L) = {Ω^{L}, \emptyset},

where

Ω^{L}

denotes the set of states of

L,

then

H_{g^{L}} (P) = - P (Ω^{L}) log P (Ω^{L}) - P (\emptyset) log P (\emptyset) = 0 - 0 = 0 .

Likewise, if

f (L^{'}) = {Ω^{L^{'}}},

then

H_{g^{L^{'}}} (P) = 0 .

For such g-entropies, every probability maximises g-entropy trivially, since all probability functions have the same g-entropy. ■

Taking

f (L) = {{ω} : ω \in Ω}

and

c = 1

, we have the language invariance of standard entropy maximisation:

Corollary 5. The family of weighting functions

g_{Ω}

is language invariant.

While giving weight in this way to just one partition is sufficient for language invariance, it is not necessary, as we shall now see. Define a family of weighting functions, the substate weighting functions, by giving weight to just those partitions that are partitions of states of sublanguages. For any sublanguage,

L^{-} \subseteq L = {A_{1}, \dots, A_{n}}

, let

Ω^{-}

be the set of states of

L^{-}

, and let

π^{-}

be the partition of propositions of

L

that represents the partition of states of the sublanguage,

L^{-}

, i.e.,

π^{-} = {{ω \in Ω : ω ⊧ ω^{-}} : ω^{-} \in Ω^{-}}

. Then,

g_{\subseteq}^{L} (π) = \{\begin{matrix} 1 & : & π = π^{-} for some L^{-} \subseteq L \\ 0 & : & otherwise \end{matrix}

(80)

Example 2. For

L = {A_{1}, A_{2}}

, there are three sublanguages:

L

itself and the two proper sublanguages,

{A_{1}}, {A_{2}} .

Then,

g_{\subseteq}^{L}

assigns the following three partitions of Ω the same positive weight:

{{A_{1} \land A_{2}, A_{1} \land \neg A_{2}}, {\neg A_{1} \land A_{2}, \neg A_{1} \land \neg A_{2}}}

,

{{A_{1} \land A_{2}, \neg A_{1} \land A_{2}}, {A_{1} \land \neg A_{2}, \neg A_{1} \land \neg A_{2}}}

,

{{A_{1} \land A_{2}}, {A_{1} \land \neg A_{2}}, {\neg A_{1} \land A_{2}}, {\neg A_{1} \land \neg A_{2}}}

.

g_{\subseteq}^{L}

assigns all other

π \in Π

weight zero.

Note that there are

2^{n} - 1

non-empty sublanguages of

L

, so

g_{\subseteq}^{L}

gives positive weight to

2^{n} - 1

partitions.

Proposition 9. The family of substate weighting functions is language invariant.

Proof: Consider an extension,

L^{'} = {A_{1}, \dots, A_{n}, A_{n + 1}}

, of

L

. Let

P^{†}, P^{‡}

be

g_{\subseteq}

-entropy maximisers on

L, L^{'}

, respectively. For simplicity of exposition, we shall view these functions as defined over sentences, so that we can talk of

P^{‡} (A_{n + 1} \land ω^{-})

, etc. For the purposes of the following calculation we shall consider the empty language to be a language. Entropies over the empty language vanish. Summing over the empty language ensures, for example, that the expression

P^{‡} (A_{n + 1}) log P^{‡} (A_{n + 1})

appears in Equation (81).

\begin{matrix} 2 H_{g_{\subseteq}^{L}} (P^{†}) & = & - 2 \sum_{L^{-} \subseteq L} \sum_{ω^{-} \in Ω^{-}} P^{†} (ω^{-}) log P^{†} (ω^{-}) \\ \geq & - 2 \sum_{L^{-} \subseteq L} \sum_{ω^{-} \in Ω^{-}} P^{‡} (ω^{-}) log P^{‡} (ω^{-}) \\ = & - \sum_{L^{-} \subseteq L} \sum_{ω^{-} \in Ω^{-}} P^{‡} (ω^{-}) log P^{‡} (ω^{-}) \\ - \sum_{L^{-} \subseteq L} \sum_{ω^{-} \in Ω^{-}} [P^{‡} (A_{n + 1} \land ω^{-}) + P^{‡} (\neg A_{n + 1} \land ω^{-})] \\ \times log [P^{‡} (A_{n + 1} \land ω^{-}) + P^{‡} (\neg A_{n + 1} \land ω^{-})] \\ = & - \sum_{L^{-} \subseteq L} \sum_{ω^{-} \in Ω^{-}} P^{‡} (ω^{-}) log P^{‡} (ω^{-}) \\ - \sum_{L^{-} \subseteq L} \sum_{ω^{-} \in Ω^{-}} [P^{‡} (A_{n + 1} \land ω^{-}) + P^{‡} (\neg A_{n + 1} \land ω^{-})] \\ \times log [2 \cdot \frac{P^{‡} (A_{n + 1} \land ω^{-}) + P^{‡} (\neg A_{n + 1} \land ω^{-})}{1 + 1}] \\ \geq & - \sum_{L^{-} \subseteq L} \sum_{ω^{-} \in Ω^{-}} P^{‡} (ω^{-}) log P^{‡} (ω^{-}) \\ - \sum_{L^{-} \subseteq L} \sum_{ω^{-} \in Ω^{-}} [log 2 + P^{‡} (A_{n + 1} \land ω^{-}) log P^{‡} (A_{n + 1} \land ω^{-}) \\ + P^{‡} (\neg A_{n + 1} \land ω^{-}) log P^{‡} (\neg A_{n + 1} \land ω^{-})] \end{matrix}

(81)

\begin{matrix} = & - c log 2 - \sum_{\begin{matrix} L^{-} \subseteq L^{'} \\ {A_{n + 1}} \notin L^{'} \end{matrix}} \sum_{ω^{-} \in Ω^{-}} P^{‡} (ω^{-}) log P^{‡} (ω^{-}) \\ - \sum_{\begin{matrix} L^{-} \subseteq L^{'} \\ {A_{n + 1}} \in L^{'} \end{matrix}} \sum_{ω^{-} \in Ω^{-}} P^{‡} (ω^{-}) log P^{‡} (ω^{-}) \\ = & - c log 2 - \sum_{L^{-} \subseteq L^{'}} \sum_{ω^{-} \in Ω^{-}} P^{‡} (ω^{-}) log P^{‡} (ω^{-}) \\ = & - c log 2 + H_{g_{\subseteq}^{L^{'}}} (P^{‡}) \\ = & - c log 2 - \sum_{L^{-} \subseteq L} \sum_{ω^{-} \in Ω^{-}} P^{‡} (ω^{-}) log P^{‡} (ω^{-}) \\ - \sum_{L^{-} \subseteq L} \sum_{ω^{-} \in Ω^{-}} [P^{‡} (A_{n + 1} \land ω^{-}) log P^{‡} (A_{n + 1} \land ω^{-}) \\ + P^{‡} (\neg A_{n + 1} \land ω^{-}) log P^{‡} (\neg A_{n + 1} \land ω^{-})] \\ \geq & - c log 2 - \sum_{L^{-} \subseteq L} \sum_{ω^{-} \in Ω^{-}} P^{†} (ω^{-}) log P^{†} (ω^{-}) - \sum_{L^{-} \subseteq L} \sum_{ω^{-} \in Ω^{-}} P^{†} (ω^{-}) log \frac{P^{†} (ω^{-})}{2} \\ = & - 2 \sum_{L^{-} \subseteq L} \sum_{ω^{-} \in Ω^{-}} P^{†} (ω^{-}) log P^{†} (ω^{-}) \\ = & 2 H_{g_{\subseteq}^{L}} (P^{†}) \end{matrix}

(82)

where c is some constant and where the second inequality is an application of the log-sum inequality. As in the previous proof, all inequalities are thus equalities,

P^{‡} (\pm A_{n + 1} \land ω) = P^{†} (ω) / 2

and

P^{‡}

extends

P^{†}

, as required. ■

In general the substate entropy maximisers differ from the standard entropy maximisers, as well as the partition entropy maximisers and the proposition entropy maximisers:

Example 3. For

L = {A_{1}, A_{2}}

and the substate weighting function,

g_{\subseteq}^{L}

on

L

(see Example 2), we find for

E = {P \in P : P (A_{1} \land A_{2}) + 2 P (A_{1} \land \neg A_{2}) = 0.1}

that the standard entropy maximiser, the partition entropy maximiser, the proposition entropy maximiser and the substate weighting entropy maximiser are pairwise different.

Table 2. Standard, partition, proposition and substate entropy maximisers.

**Table 2.** Standard, partition, proposition and substate entropy maximisers.
	$A_{1} \land A_{2}$	$A_{1} \land \neg A_{2}$	$\neg A_{1} \land A_{2}$	$\neg A_{1} \land \neg A_{2}$
$P_{Ω}^{†}$	$0.0752$	$0.0124$	$0.4562$	$0.4562$
$P_{Π}^{†}$	$0.0856$	$0.0072$	$0.4536$	$0.4536$
$P_{P Ω}^{†}$	$0.0950$	$0.0025$	$0.4513$	$0.4513$
$P_{g_{\subseteq}^{L}}^{†}$	$0.0950$	$0.0025$	$0.4293$	$0.4732$

Observe that the standard entropy maximiser, the partition entropy maximiser and the proposition entropy maximiser are all symmetric in

\neg A_{1} \land A_{2}

and

\neg A_{1} \land \neg A_{2},

while the substate weighting entropy maximiser is not. This break of symmetry is caused by the fact that

g_{\subseteq}^{L}

is not symmetric in

\neg A_{1} \land A_{2}

and

\neg A_{1} \land \neg A_{2} .

We have seen that the substate weighting functions are not symmetric. Neither are they inclusive nor refined. We conjecture that if

G = G_{0}

, the set of inclusive, symmetric and refined g, then the only language invariant family,

g^{L}

, that gives rise to entropy maximisers that are sufficiently equivocal is the family that underwrites standard entropy maximisation: if

g^{L}

is language invariant and the

g^{L}

-entropy maximiser is in

⇓ E

, then

g^{L} = g_{Ω}

.

In sum, there is a compelling reason to prefer the standard entropy maximiser over other g-entropy maximisers: the standard entropy maximiser is language invariant, while other—perhaps, all other—appropriate g-entropy maximisers are not. In Appendix B.3, we show that there are three further ways in which the standard entropy maximiser differs from other g-entropy maximisers: it satisfies the principles of irrelevance, relativisation and independence.

5. Discussion

5.1. Summary

In this paper, we have seen how the standard concept of entropy generalises rather naturally to the notion of g-entropy, where g is a function that weights the partitions that contribute to the entropy sum. If loss is taken to be logarithmic, as is forced by desiderata L1–4 for a default loss function, then the belief function that minimises worst-case g-expected loss, where the expectation is taken with respect to a chance function known to lie in a convex set

E

, is the probability function in

E

that maximises g-entropy, if there is such a function. This applies whether belief functions are thought of as defined over the sentences of an agent’s language or over the propositions picked out by those sentences.

This fact suggests a justification of the three norms of objective Bayesianism: a belief function should be a probability function, it should lie in the set

E

of potential chance functions and it should otherwise be equivocal in that it should have maximum g-entropy.

However, the probability function with maximum g-entropy may lie outside

E

, on its boundary, in which case that function is ruled out of contention by available evidence. Therefore, objective Bayesianism only requires that a belief function be sufficiently equivocal—not that it be maximally equivocal. Principles E1–5 can be used to constrain the set

⇓ E

, of sufficiently equivocal functions. Arguably, if the standard entropy maximiser is in

E

, then it is also in

⇓ E

. Moreover, the standard entropy maximiser stands out as being language invariant. This then provides a qualified justification of the standard maximum entropy principle: while an agent is rationally entitled to adopt any sufficiently equivocal probability function in

E

as her belief function, if the standard entropy maximiser is in

E

, then that function is a natural choice.

Some questions arise. First, what are the consequences of this sort of account for conditionalisation and Bayes’ theorem? Second, how does this account relate to imprecise probability, advocates of which reject our starting assumption that the strengths of an agent’s beliefs are representable by a single belief function? Third, the arguments of this paper are overtly pragmatic; can they be reformulated in a non-pragmatic way? We shall tackle these questions in turn.

5.2. Conditionalisation, Conditional Probabilities and Bayes’ Theorem

Subjective Bayesians endorse the probability norm and often also some sort of calibration norm, but do not go so far as to insist on equivocation. This leads to relatively weak constraints on degrees of belief, so subjective Bayesians typically appeal to Bayesian conditionalisation as a means to tightly constrain the way in which degrees of belief change in the light of new evidence. Objective Bayesians do not need to invoke Bayesian conditionalisation as a norm of belief change, because the three norms of objective Bayesianism already tightly constrain any new belief function that an agent can adopt. In fact, if the objective Bayesian adopts the policy of adopting the standard entropy maximiser as her belief function, then objective Bayesian updating often agrees with updating by conditionalisation, as shown by Seidenfeld (Result 1 in [22]):

Theorem 8. Suppose that

E

is the set of probability functions calibrated with evidence E, and that

E

can be written as the set of probability functions which satisfy finitely many constraints of the form,

c_{i} = \sum_{ω \in Ω} d_{i, ω} P (ω) .

Suppose

E^{'}

is the set of probability functions calibrated with evidence

E \cup {G}

, and that

P_{E}^{†}, P_{E \cup {G}}^{†}

are functions in

E, E^{'}

, respectively, that maximise standard entropy. If:

(i): $G \subseteq Ω$ ,
(ii): the only constraints imposed by $E \cup {G}$ are the constraints $c_{i} = \sum_{ω \in Ω} d_{i, ω} P (ω)$ imposed by E together with the constraint $P (G) = 1$ ,
(iii): the constraints in (ii) are consistent, and
(iv): $P_{E}^{†} (\cdot | G) \in E$ ,
then $P_{E \cup {G}}^{†} (F) = P_{E}^{†} (F | G)$ for all $F \subseteq Ω$ .

This fact has various consequences. First, it provides a qualified justification of Bayesian conditionalisation: a standard entropy maximiser can be thought of as applying Bayesian conditionalisation in many natural situations. Second, if conditions (i)–(iv) of Theorem 8 hold, then there is no need to maximise standard entropy to compute the agent’s new degrees of belief—instead, Bayesian conditionalisation can be used to calculate these degrees of belief. Third, conditions (i)–(iv) of Theorem 8 can each fail, so the two forms of updating do not always agree, and Bayesian conditionalisation is less central to an objective Bayesian who maximises standard entropy than it is to a subjective Bayesian. As pointed out in Williamson [1] (Chapter 4) and Williamson [23] (§§8,9), standard entropy maximisation is to be preferred over Bayesian conditionalisation where any of these conditions fail. Fourth, conditional probabilities, which are crucial to subjective Bayesianism on account of their use in Bayesian conditionalisation, are less central to the objective Bayesian, because conditionalisation is only employed in a qualified way. For the objective Bayesian, conditional probabilities are merely ratios of unconditional probabilities—they are not generally interpretable as conditional degrees of belief (§4.4.1 in [1]). Fifth, Bayes’ theorem, which is an important tool for calculating conditional probabilities, used routinely in Bayesian statistics, for example, is less central to objective Bayesianism, because of the less significant role played by conditional probabilities.

Interestingly, while Theorem 8 appeals to standard entropy maximisation, an analogous result holds for g-entropy maximisation, for any inclusive g, as we show in Appendix B.2:

Theorem 9. Suppose that convex and closed

E

is the set of probability functions calibrated with evidence E, and

E^{'}

is the set of probability functions calibrated with evidence

E \cup {G}

. Furthermore, suppose that

P_{E}^{†}, P_{E \cup {G}}^{†}

are functions in

E, E^{'}

, respectively, that maximise g-entropy for some fixed

g \in G_{inc} \cup {g_{Ω}} .

If:

(i): $G \subseteq Ω$ ,
(ii): the only constraints imposed by $E \cup {G}$ are the constraints imposed by E together with the constraint $P (G) = 1$ ,
(iii): the constraints in (ii) are consistent, and
(iv): $P_{E}^{†} (\cdot | G) \in E$ ,
then $P_{E \cup {G}}^{†} (F) = P_{E}^{†} (F | G)$ for all $F \subseteq Ω$ .

Thus, the preceding comments apply equally in the more general context of this paper.

5.3. Imprecise Probability

Advocates of imprecise probability argue that an agent’s belief state is better represented by a set of probability functions—for example, by the set

E

of probability functions calibrated with evidence—than by a single belief function [24]. This makes decision making harder. An agent whose degrees of belief are represented by a single probability function can use that probability function to determine which of the available acts maximises expected utility. However, an imprecise agent will typically find that the acts that maximise expected utility vary according to which probability function in her imprecise belief state is used to determine the expectation. The question then arises, with respect to which probability function in her belief state should such expectations be taken?

This question motivates a two-step procedure for imprecise probability: first, isolate a set of probability functions as one’s belief state; then, choose a probability function from within this set for decision making—this might be done in advance of any particular decision problem arising—and use that function to make decisions by maximising expected utility. While this sort of procedure is not the only way of thinking about imprecise probability, it does have some adherents. It is a component of the transferrable belief model of Smets and Kennes [25], for instance, and Keynes advocated a similar sort of view:

the prospect of a European war is uncertain, or the price of copper and the rate of interest twenty years hence, or the obsolescence of a new invention, or the position of private wealth-owners in the social system in 1970. About these matters there is no scientific basis on which to form any calculable probability whatever. We simply do not know. Nevertheless, the necessity for action and for decision compels us as practical men to do our best to overlook this awkward fact and to behave exactly as we should if we had behind us a good Benthamite calculation of a series of prospective advantages and disadvantages, each multiplied by its appropriate probability, waiting to be summed.
(p. 214 in [26])

(We are very grateful to an anonymous referee for pointing out that Smets and Kennes adopt this sort of position, and to Hykel Hosni for alerting us to this view of Keynes.)

The results of this paper can be applied at the second step of this two-step procedure. If one wants a probability function for decision making that controls worst-case g-expected default loss, then one should choose a function in one’s belief state with sufficiently high g-entropy (or a limit point of such functions), where g is in

G

, the set of appropriate weighting functions. The resulting approach to imprecise probability is conceptually different to objective Bayesian epistemology, but the two approaches are formally equivalent, with the decision function for imprecise probability corresponding to the belief function for objective Bayesian epistemology.

5.4. A Non-Pragmatic Justification

The line of argument in this paper is thoroughly pragmatic: one ought to satisfy the norms of objective Bayesianism in order to control worst-case expected loss. However, the question has recently arisen as to whether one can adapt arguments that appeal to scoring rules to provide a non-pragmatic justification of the norms of rational belief—see, e.g., Joyce [9]. There appears to be some scope for reinterpreting the arguments of this paper in non-pragmatic terms, along the following lines. Instead of viewing L1–4 as isolating an appropriate default loss function, one can view them as postulates on a measure of the inaccuracy of one’s belief in a true proposition: believing a true proposition does not expose one to inaccuracy; inaccuracy strictly increases as the degree of belief in the true proposition decreases; inaccuracy with respect to a proposition only depends on the degree of belief in that proposition; inaccuracy is additive over independent sublanguages. (L4 would need to be changed insofar as that it would need to be physical probability,

P^{*}

, rather than the agent’s belief function, B, that determines whether sublanguages are independent. This change does not affect the formal results.) A g-scoring rule then measures expected inaccuracy. Strict propriety implies that the physical probability function has minimum expected inaccuracy. (If

P^{*}

is deterministic, i.e.,

P^{*} (ω) = 1

for some

ω \in Ω,

then the unique probability function that puts all mass on ω has minimum expected inaccuracy. In this sense, we can say that strictly proper scoring rules are truth-tracking, which is an important epistemic good.) In order to minimise worst-case g-expected inaccuracy, one would need degrees of belief that are probabilities, that are calibrated to physical probability and that maximise g-entropy.

The main difference between the pragmatic and the non-pragmatic interpretations of the arguments of this paper appears to lie in the default nature of the conclusions under a pragmatic interpretation. It is argued here that loss should be taken to be logarithmic in the absence of knowledge of the true loss function. If one does know the true loss function,

L^{*}

, and this loss function turns out not to be logarithmic, then one should arguably do something other than minimising worst-case expected logarithmic loss—one should minimise worst-case expected

L^{*}

-loss. Under a non-pragmatic interpretation, on the other hand, one might argue that L1-4 characterises the correct measure of the inaccuracy of a belief in a true proposition, not a measure that is provisional in the sense that logarithmic loss is. Thus, the conclusions of this paper are arguably firmer—less provisional—under a non-pragmatic construal.

5.5. Questions for Further Research

We noted above that if one knows the true loss function,

L^{*}

, then one should arguably minimise worst-case expected

L^{*}

-loss. [3] generalise standard entropy in a different direction to that pursued in this paper, in order to argue that minimising worst-case expected

L^{*}

-loss requires maximising entropy in their generalised sense. One interesting question for further research is whether one can generalise the notion of g-entropy in an analogous way, to try to show that minimising worst-case g-expected

L^{*}

-loss requires maximising g-entropy in this further generalised sense.

A second question concerns whether one can extend the discussion of belief over sentences in Section 3 to predicate, rather than propositional, languages. A third question is whether other justifications of the logarithmic score can be used to the justify logarithmic g-score—for example, is the logarithmic g-score the only local strictly proper g-score? Fourth, we suspect that Theorem 3 can be further generalised. Finally, it would be interesting to investigate language invariance in more detail in order to test the conjecture at the end of Section 4.

Acknowledgements

This research was conducted as a part of the project, From objective Bayesian epistemology to inductive logic. We are grateful to the UK Arts and Humanities Research Council for funding this research and to Teddy Groves, Jeff Paris and three anonymous referees for very helpful comments.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix

A. Entropy of Belief Functions

Axiomatic characterizations of standard entropy on probability functions have featured heavily in the literature—see [27]. In this appendix, we provide two characterizations of g-entropy on belief functions, which closely resemble the original axiomatisation provided by Shannon (§6 in [6]). (We appeal to these characterisations in the proof of Proposition 12 in Section B.2.)

We shall need some new notation. Let

k \in N

and

x \in R,

then denote by

x @ k

the tuple

〈 x, x, \dots, x 〉 \in R^{k} .

For

x \in R

and

\vec{y} \in R^{l}

, we denote by

x \cdot \vec{y}

the vector,

〈 x \cdot y_{1}, \dots, x \cdot y_{l} 〉 \in R^{l} .

For a vector

\vec{x} \in R^{k}

, let

| \vec{x} |_{1} = x_{1} + \dots + x_{k} .

Assume in the following that all

x_{i}

and all

y_{i j}

are in

[0, 1] .

Furthermore, let

k, l

henceforth denote the number of components in

\vec{x}

, respectively,

\vec{y}

.

Proposition 10 (First characterisation). Let

H (B) = \sum_{π \in Π} g (π) f (π, B)

, where

f (π, B) : = h (B (F_{1}), . . ., B (F_{k}))

for

π = {F_{1}, . . ., F_{k}}

and:

\begin{matrix} h : ⋃_{k \geq 1} {〈 x_{1}, \dots, x_{k} 〉 : x_{i} \geq 0 & \sum_{i = 1}^{k} x_{i} \leq 1} ⟶ [0, \infty) \end{matrix}

Suppose also that the following conditions hold:

H1:: h is continuous;
H2:: if $1 \leq t_{1} < t_{2} \in N$ then $h (\frac{1}{t_{1}} @ t_{1}) < h (\frac{1}{t_{2}} @ t_{2})$ ;
H3:: if $0 < | \vec{x} |_{1} \leq 1$ and if $| \vec{y_{i}} |_{1} = 1$ for $1 \leq i \leq k,$ then

$\begin{matrix} h (x_{1} \cdot \vec{y_{1}}, \dots, x_{k} \cdot \vec{y_{k}}) = h (x_{1}, \dots, x_{k}) + \sum_{i = 1}^{k} x_{i} h (\vec{y_{i}}) \end{matrix}$
H4:: $q h (\frac{1}{t}) = h (\frac{1}{t} @ q)$ for $1 \leq q \leq t \in N$ ;

then,

H (B) = - \sum_{π \in Π} g (π) \sum_{F \in π} B (F) log B (F)

.

Proof: We first apply the proof of Paris [21] (pp. 77–78), which implies (using only H1, H2 and H3) that:

\begin{matrix} h (\vec{x}) = - c \sum_{i = 1}^{k} x_{i} log x_{i} \end{matrix}

(83)

for all

\vec{x}

with

| \vec{x} |_{1} = 1

, where

c \in R_{> 0}

is some constant.

Now suppose

0 < | \vec{x} |_{1} < 1 .

Then, with

y_{i} : = \frac{x_{i}}{| \vec{x} |_{1}}

, we have

\vec{x} = {| \vec{x} |}_{1} \cdot \vec{y}

and

| \vec{y} |_{1} = 1 .

Thus:

\begin{matrix} h (\vec{x}) & = h (| \vec{x} |_{1} \cdot \vec{y}) \overset{H 3}{=} h (| \vec{x} |_{1}) + | \vec{x} |_{1} h (\vec{y}) \overset{(83)}{=} h (| \vec{x} |_{1}) - | \vec{x} |_{1} c \sum_{i = 1}^{l} y_{i} log y_{i} . \end{matrix}

(84)

We will next show that

h (x) = - c x log x

for

x \in [0, 1) .

Thus, note that

h (\frac{1}{t}) \overset{H 4}{=} \frac{1}{t} h (\frac{1}{t} @ t) \overset{(83)}{=} \frac{1}{t} (- c t \frac{1}{t} log \frac{1}{t}) = - c \frac{1}{t} log \frac{1}{t} .

For

1 \leq q \leq t \in N

, we now find:

\begin{matrix} - c \frac{1}{t} log \frac{1}{t} = h (\frac{1}{t}) \overset{H 4}{=} \frac{1}{q} h (\frac{1}{t} @ q) = \frac{1}{q} h (\frac{q}{t} \cdot \frac{1}{q} @ q) & \overset{H 3}{=} \frac{1}{q} (h (\frac{q}{t}) + \frac{q}{t} h (\frac{1}{q} @ q)) \\ \overset{(83)}{=} \frac{1}{q} (h (\frac{q}{t}) + \frac{q}{t} (- c q \frac{1}{q} log \frac{1}{q})) \end{matrix}

(85)

Thus:

\begin{matrix} h (\frac{q}{t}) & = - c \frac{q}{t} (log (\frac{1}{t}) - log (\frac{1}{q})) \\ = - c \frac{q}{t} log \frac{q}{t} \end{matrix}

(86)

Hence, h is of the claimed form for rational numbers in

(0, 1] .

The continuity axiom, H1, now guarantees that

h (x) = - c x log x

for all

x \in [0, 1] \subset R .

Putting our results together, we obtain:

\begin{matrix} h (\vec{x}) & = - c | \vec{x} |_{1} log | \vec{x} |_{1} - c | \vec{x} |_{1} \sum_{i = 1}^{l} y_{i} log y_{i} = - c | \vec{x} |_{1} (\sum_{i = 1}^{l} y_{i} log | \vec{x} |_{1} + \sum_{i = 1}^{l} y_{i} log y_{i}) \\ = - c | \vec{x} |_{1} \sum_{i = 1}^{l} y_{i} log (| \vec{x} |_{1} \cdot y_{i}) \\ = - c \sum_{i = 1}^{l} x_{i} log x_{i} \end{matrix}

(87)

Finally, note that h does satisfy all the axioms. The constant, c, can then be absorbed into the weighting function, g, to give

H (B) = - \sum_{π \in Π} g (π) \sum_{F \in π} B (F) log B (F)

, as required. ■

A tighter analysis reveals that the axiomatic characterization above may be weakened. We may replace H3 by the following two instances of H3:

A: If

| \vec{x} |_{1} = 1

and if

| \vec{y_{i}} |_{1} = 1

for

1 \leq i \leq k,

then

\begin{matrix} h (x_{1} \cdot \vec{y_{1}}, \dots, x_{k} \cdot \vec{y_{k}}) = h (x_{1}, \dots, x_{k}) + \sum_{i = 1}^{k} x_{i} h (\vec{y_{i}}) \end{matrix}

B: If

0 < x < 1

and if

| \vec{y} |_{1} = 1,

then

\begin{matrix} h (x \cdot \vec{y}) = h (x) + x h (\vec{y}) \end{matrix}

Property A is, of course, Shannon’s original axiom H3. The axiom H3 used above is the straightforward generalization of Shannon’s H3 to vectors

\vec{x}

summing to less than one.

Proposition 11 (Second characterisation). Let

H (B) = \sum_{π \in Π} g (π) f (π, B)

, where

f (π, B) : = h (B (F_{1}), . . ., B (F_{k}))

for

π = {F_{1}, . . ., F_{k}}

and:

\begin{matrix} h : ⋃_{k \geq 1} {〈 x_{1}, \dots, x_{k} 〉 : x_{i} \geq 0 & \sum_{i = 1}^{k} x_{i} \leq 1} ⟶ [0, \infty) \end{matrix}

Suppose also that the following conditions hold:

H1:: h is continuous;
H2:: if $1 \leq t_{1} < t_{2} \in N$ then $h (\frac{1}{t_{1}} @ t_{1}) < h (\frac{1}{t_{2}} @ t_{2})$ ;
A:: if $| \vec{x} |_{1} = 1$ and if $| \vec{y_{i}} |_{1} = 1$ for $1 \leq i \leq k,$ then

$\begin{matrix} h (x_{1} \cdot \vec{y_{1}}, \dots, x_{k} \cdot \vec{y_{k}}) = h (x_{1}, \dots, x_{k}) + \sum_{i = 1}^{k} x_{i} h (\vec{y_{i}}) \end{matrix}$
B:: : if $0 < x < 1$ and if $| \vec{y} |_{1} = 1,$ then

$\begin{matrix} h (x \cdot \vec{y}) = h (x) + x h (\vec{y}) \end{matrix}$
C:: for $0 < x, y < 1,$ it holds that $h (x \cdot y) = x h (y) + y h (x)$ ;
D:: for $0 < x < 1,$ it holds that $h (x) = h (x, 1 - x) - h (1 - x)$ ;

then,

H (B) = - \sum_{π \in Π} g (π) \sum_{F \in π} B (F) log B (F)

.

Proof: We shall again invoke the proof in Paris [21] (pp. 77–78) to show (using only H1, H2 and A) that:

\begin{matrix} h (\vec{x}) = - c \sum_{i = 1}^{k} x_{i} log x_{i} \end{matrix}

(88)

for all

\vec{x}

with

| \vec{x} |_{1} = 1

and some constant,

c \in R_{> 0}

.

Now suppose

0 < | \vec{x} |_{1} < 1 .

Then, with

y_{i} : = \frac{x_{i}}{| \vec{x} |_{1}}

, we have

\vec{x} = {| \vec{x} |}_{1} \cdot \vec{y}

and

| \vec{y} |_{1} = 1 .

Thus:

\begin{matrix} h (\vec{x}) & = h (| \vec{x} |_{1} \cdot \vec{y}) \overset{H 3}{=} h (| \vec{x} |_{1}) + | \vec{x} |_{1} h (\vec{y}) \overset{(88)}{=} h (| \vec{x} |_{1}) - | \vec{x} |_{1} c \sum_{i = 1}^{l} y_{i} log y_{i} \end{matrix}

As we have seen in the previous proof, it now only remains to show that

h (x) = - c x log x

for

x \in [0, 1] \subset R

.

We next show by induction that for all non-zero

t \in N

,

h (\frac{1}{2^{t}}) = - c \frac{1}{2^{t}} log \frac{1}{2^{t}} .

The base case is immediate, observe that:

\begin{matrix} h (\frac{1}{2}) \overset{D}{=} \frac{1}{2} h (\frac{1}{2}, \frac{1}{2}) \overset{(88)}{=} - c \frac{1}{2} log \frac{1}{2} \end{matrix}

(89)

Using the induction hypothesis (IH), the inductive step is straightforward too:

\begin{matrix} h (\frac{1}{2^{t}}) & \overset{C}{=} \frac{1}{2^{t - 1}} h (\frac{1}{2}) + \frac{1}{2} h (\frac{1}{2^{t - 1}}) \\ \overset{I H}{=} - c (\frac{1}{2^{t}} log (\frac{1}{2}) + \frac{1}{2^{t}} log (\frac{1}{2^{t - 1}})) \\ = - c \frac{1}{2^{t}} log \frac{1}{2^{t}} \end{matrix}

(90)

We next show by induction on

t \geq 1

that for all non-zero natural numbers

m < 2^{t}

,

h (\frac{m}{2^{t}}) = - c \frac{m}{2^{t}} log \frac{m}{2^{t}} .

For the base case, simply note that

t = m = 1

, and thus:

\begin{matrix} h (\frac{1}{2^{1}}) = - c \frac{1}{2} log \frac{1}{2} \end{matrix}

(91)

The inductive step follows for

m < 2^{t - 1}

:

\begin{matrix} h (\frac{m}{2^{t}}) & \overset{C}{=} \frac{m}{2^{t - 1}} h (\frac{1}{2}) + \frac{1}{2} h (\frac{m}{2^{t - 1}}) \\ \overset{I H}{=} - c \frac{m}{2^{t - 1}} \frac{1}{2} log (\frac{1}{2}) - c \frac{1}{2} \frac{m}{2^{t - 1}} log (\frac{m}{2^{t - 1}}) \\ = - c \frac{m}{2^{t}} log \frac{m}{2^{t}} \end{matrix}

(92)

For

2^{t - 1} < m < 2^{t}

, we find:

\begin{matrix} h (\frac{m}{2^{t}}) & \overset{D}{=} h (\frac{m}{2^{t}}, \frac{2^{t} - m}{2^{t}}) - h (\frac{2^{t} - m}{2^{t}}) \\ \overset{(88)}{=} - c (\frac{m}{2^{t}} log (\frac{m}{2^{t}}) + \frac{2^{t} - m}{2^{t}} log (\frac{2^{t} - m}{2^{t}})) - h (\frac{2^{t} - m}{2^{t}}) \\ \overset{I H}{=} - c (\frac{m}{2^{t}} log (\frac{m}{2^{t}}) + \frac{2^{t} - m}{2^{t}} log (\frac{2^{t} - m}{2^{t}})) + c \frac{2^{t} - m}{2^{t}} log (\frac{2^{t} - m}{2^{t}}) \\ = - c \frac{m}{2^{t}} log \frac{m}{2^{t}} \end{matrix}

(93)

Since rational numbers of the form

\frac{m}{2^{t}}

are dense in

[0, 1] \subset R

, we can use the continuity axiom, H1, to conclude that h has to be of the desired form.

Finally, note that h does satisfy all the axioms. The constant, c, can then be absorbed into the weighting function, g, to give the required form of

H (B)

. ■

We can combine B and C to form one single axiom, H5, which implies B and C:

H5: if

0 < x < 1

and if

| \vec{y} |_{1} \leq 1,

then

\begin{matrix} h (x \cdot \vec{y}) = {| \vec{y} |}_{1} h (x) + x h (\vec{y}) \end{matrix}

(94)

Clearly, H5 is a natural way to generalize A to belief functions. It now follows easily that H1, H2, A, H5 and D jointly constrain h to

h (\vec{x}) = - c \sum_{i = 1}^{k} x_{i} log x_{i} .

Although it is certainly possible to consider the g-entropy of a belief function, maximising standard entropy over

B

—as opposed to

E \subseteq P

—has bizarre consequences. For

| Ω | = 2

, we have that

{B_{z} \in B : z \in [0, 1], B_{z} (Ω) = z, B_{z} (\emptyset) = 1 - z, B_{z} (ω_{1}) = B_{z} (ω_{2}) = \frac{1}{e}}

is the set of entropy maximisers. This follows from considering the following optimization problem:

\begin{matrix} maximize & - B (ω_{1}) log B (ω_{1}) - B (ω_{2}) log B (ω_{2}) \\ subject to & 0 \leq B (\emptyset), B (Ω), B (ω_{1}), B (ω_{2}) \\ B (ω_{1}) + B (ω_{2}) \leq 1 \\ B (\emptyset) + B (Ω) \leq 1 \\ B (\emptyset) + B (Ω) = 1 or B (ω_{1}) + B (ω_{2}) = 1 \end{matrix}

Putting

B (\emptyset) + B (Ω) = 1

ensures that the last two constraints are satisfied and permits the choice of

B (ω_{1})

,

B (ω_{2})

such that

B (ω_{1}) + B (ω_{2}) < 1 .

For non-negative

B (ω)

, we have that

- B (ω) log B (ω)

obtains the unique maximum at

B (ω) = \frac{1}{e} .

The claimed optimality result follows.

It is worth pointing out that this phenomenon does not depend on the base of the logarithm. For

| Ω | \geq 3

, however, intuition honed by considering the entropy of probability functions does not lead one astray. For

| Ω | \geq 3,

any belief function B with

B (ω) = \frac{1}{| Ω |}

for

ω \in Ω

does maximize standard entropy.

Similarly bizarre consequences also obtain in the case of other g-entropies. For

| Ω | = 2

and

g ({Ω}) + g ({Ω, \emptyset}) ≪ g ({ω_{1}}, {ω_{2}})

, belief functions maximizing g-entropy satisfy

B (ω_{1}) = B (ω_{2}) = \frac{1}{e} .

To see this, simply note that for such g, the optimum obtains for

B (Ω) + B (\emptyset) = 1 .

For the proposition entropy for

| Ω | = 2,

there are two entropy maximisers in

B .

They are

B_{1}^{†} (\emptyset) = B_{1}^{†} (Ω) = \frac{1}{2}, B_{1}^{†} (ω_{1}) = B_{1}^{†} (ω_{2}) = \frac{1}{e}

and

B_{2}^{†} (\emptyset) = B_{2}^{†} (Ω) = \frac{1}{e}, B_{2}^{†} (ω_{1}) = B_{2}^{†} (ω_{2}) = \frac{1}{2} .

Thus, an agent adopting a belief function maximizing g-entropy over

B

may violate the probability norm. Furthermore, the agent may have to choose a belief function from finitely or infinitely many such non-probabilistic functions. For an agent minimizing worst-case g-expected loss, these bizarre situations do not arise. From Theorem 2 and knowing that for inclusive g, minimizing worst-case g-expected loss forces the agent to adopt a probability function that maximizes g-entropy over the set

E

of calibrated probability functions. By Corollary 2, this probability function is unique.

B. Properties of g-Entropy Maximisation

The general properties of standard entropy (defined on probability functions) have been widely studied in the literature. Here, we examine general properties of the g-entropy of a probability function, for

g \in G .

We have already seen one difference between standard and g-entropy in Section 4: standard entropy satisfies language invariance; g-entropy, in general, need not. Surprisingly, language invariance seems to be an exception. Standard entropy and g-entropy behave in many respects in the same way.

B.1. Preserving the Equivocator

For example, as we shall see now, if g is inclusive and symmetric then the probability function that is deemed most equivocal—i.e., the function, out of all probability functions, with maximum g-entropy—is the equivocator function,

P_{=}

, which gives each state the same probability.

Definition 16 (Equivocator-preserving). A weighting function g is called equivocator-preserving, if and only if

arg {sup}_{P \in P} H_{g} (P) = P_{=} .

That symmetry and inclusiveness are sufficient for g to be equivocator-preserving will follow from the following lemma:

Lemma 9. For inclusive

g,

g is equivocator-preserving if and only if:

\begin{matrix} z (ω) : = \sum_{\begin{matrix} F \subseteq Ω \\ ω \in F \end{matrix}} \sum_{\begin{matrix} π \in Π \\ F \in π \end{matrix}} - g (π) (1 - log | Ω | + log | F |) = c \end{matrix}

(95)

for some constant, c.

Proof: Recall from Proposition 2 that g-entropy is strictly concave on

P .

Thus, every critical point in the interior of

P

is the unique maximiser of

H_{g} (\cdot)

on

P .

Now consider the Lagrange function,

L a g

:

\begin{matrix} L a g (P) & = λ (- 1 + \sum_{ω \in Ω} P (ω)) + H_{g} (P) \\ = λ (- 1 + \sum_{ω \in Ω} P (ω)) + \sum_{π \in Π} - g (π) \sum_{F \in π} (\sum_{ω \in F} P (ω)) (log \sum_{ω \in F} P (ω)) \end{matrix}

(96)

For fixed

ω \in Ω

and

π \in Π

, denote by

F_{ω, π}

the unique

F \subseteq Ω

such that

ω \in F

and

F \in π .

Taking derivatives, we obtain:

\begin{matrix} \frac{\partial}{\partial P (ω)} L a g (P) = λ + \sum_{π \in Π} - g (π) (1 + log \sum_{ν \in F_{ω, π}} P (ν)) for all ω \in Ω \end{matrix}

(97)

Now, if

P_{=}

maximises g-entropy, then for all

ω \in Ω

, the following must vanish:

\begin{matrix} \frac{\partial}{\partial P (ω)} L a g (P_{=}) & = λ + \sum_{π \in Π} - g (π) (1 + log P_{=} (F_{ω, π})) \\ = λ + \sum_{π \in Π} - g (π) (1 + log \frac{| F_{ω, π} |}{| Ω |}) \\ = λ + \sum_{π \in Π} - g (π) (1 - log | Ω | + log | F_{ω, π} |) \\ = λ + \sum_{\begin{matrix} F \subseteq Ω \\ ω \in F \end{matrix}} \sum_{\begin{matrix} π \in Π \\ F \in π \end{matrix}} - g (π) (1 - log | Ω | + log | F |) \end{matrix}

(98)

Since this expression has to vanish for all

ω \in Ω

, it does not depend on

ω .

On the other hand, if g is such that:

\sum_{\begin{matrix} F \subseteq Ω \\ ω \in F \end{matrix}} \sum_{\begin{matrix} π \in Π \\ F \in π \end{matrix}} - g (π) (1 - log | Ω | + log | F |)

(99)

does not depend on

ω,

then

P_{=}

is a critical point of

L a g (P)

and, thus, is the entropy maximiser. ■

Corollary 6. If g is symmetric and inclusive, then it is equivocator-preserving.

Proof: By Lemma 9, we only need to show that:

\begin{matrix} \sum_{\begin{matrix} F \subseteq Ω \\ ω \in F \end{matrix}} \sum_{\begin{matrix} π \in Π \\ F \in π \end{matrix}} - g (π) (1 - log | Ω | + log | F |) \end{matrix}

(100)

does not depend on

ω .

Denote by

π_{i j}

, respectively

F_{i j}

, the result of replacing

ω_{i}

by

ω_{j}

and vice versa in

π \in Π,

respectively

F \subseteq Ω .

By the symmetry of g, we have

g (π) = g (π_{i j}) .

Since

| F | = | F_{i j} |

, we then find for all

ω_{i}, ω_{j} \in Ω

:

\begin{matrix} \sum_{\begin{matrix} F \subseteq Ω \\ ω_{i} \in F \end{matrix}} \sum_{\begin{matrix} π \in Π \\ F \in π \end{matrix}} - g (π) (1 - log | Ω | + log | F |) & = \sum_{\begin{matrix} F \subseteq Ω \\ ω_{i} \in F \end{matrix}} \sum_{\begin{matrix} π \in Π \\ F \in π \end{matrix}} - g (π_{i j}) (1 - log | Ω | + log | F_{i j} |) \\ = \sum_{\begin{matrix} F \subseteq Ω \\ ω_{i} \in F \end{matrix}} \sum_{\begin{matrix} π \in Π \\ F_{i j} \in π \end{matrix}} - g (π) (1 - log | Ω | + log | F_{i j} |) \\ = \sum_{\begin{matrix} F \subseteq Ω \\ ω_{j} \in F \end{matrix}} \sum_{\begin{matrix} π \in Π \\ F \in π \end{matrix}} - g (π) (1 - log | Ω | + log | F |) \end{matrix}

(101)

■

Are there are any non-symmetric, inclusive g that are equivocator-preserving? We pose this as an interesting question for further research.

B.2. Updating

Next, we show that there is widespread agreement between updating by conditionalisation and updating by g-entropy maximisation, a result to which we alluded in Section 5.

Proposition 12. Suppose that

E

is the set of probability functions calibrated with evidence E. Let g be inclusive and

G \subseteq Ω

such that

E^{'} = {P \in E : P (G) = 1} \neq \emptyset

, where

E^{'}

is the set of probability functions calibrated with evidence

E \cup {G}

. Then, the following are equivalent:

$P_{E}^{†} (\cdot | G) \in [E]$
$P_{E \cup {G}}^{†} (\cdot) = P_{E}^{†} (\cdot | G)$ ,

where

P_{E}^{†}, P_{E \cup {G}}^{†}

are functions in

E, E^{'}

, respectively, that maximise g-entropy.

Proof: First, suppose that

P_{E}^{†} (\cdot | G) \in [E] .

Observe that if

E^{'} = E,

then there is nothing to prove. Thus, suppose that

E^{'} \subset E .

Hence, there exists a function

P \in E

with

P (\bar{G}) > 0 .

By Proposition 18, inclusive g are open-minded, hence

P_{E}^{†} (\bar{G}) > 0 .

(Note that the proof of Proposition 18 does not itself depend on Proposition 12.) Therefore,

P_{E}^{†} (\cdot | \bar{G})

is well defined.

Now let

P_{1}^{†} : = P_{E \cup {G}}^{†}

and

P^{†} : = P_{E}^{†} .

Then, assume for contradiction that

P_{1}^{†} (F) \neq P^{†} (F | G)

for some

F \subseteq Ω .

By Corollary 2, the g-entropy maximiser

P_{1}^{†}

in

[E^{'}]

is unique; furthermore,

P^{†} (\cdot | G) \in [E^{'}] .

It follows that:

\begin{matrix} \sum_{π \in Π} - g (π) \sum_{F^{'} \in π} P_{1}^{†} (F^{'}) log P_{1}^{†} (F^{'}) & = H_{g} (P_{1}^{†}) \\ > H_{g} (P^{†} (\cdot | G)) \\ = \sum_{π \in Π} - g (π) \sum_{F^{'} \in π} P^{†} (F^{'} | G) log P^{†} (F^{'} | G) \end{matrix}

(102)

Now, define

P^{'} (\cdot) = P^{†} (G) P_{1}^{†} (\cdot | G) + P^{†} (\bar{G}) P^{†} (\cdot | \bar{G}) .

Since

[E]

is convex,

P_{1}^{†}, P^{†} (\cdot | G) \in [E]

, and since

P_{1}^{†} (\cdot | G) = P_{1}^{†}

, we have that

P^{'} \in [E] .

Using the above inequality, we observe, using axiom A of Appendix A with

\vec{x} = 〈 P^{†} (G), P^{†} (\bar{G}) 〉,

{\vec{y}}_{1} = 〈 P_{1}^{†} (F^{'} | G) : F^{'} \in π 〉

and

{\vec{y}}_{2} = 〈 P^{†} (F^{'} | G) : F^{'} \in π 〉

that:

\begin{matrix} H_{g} (P^{'}) = & \sum_{π \in Π} - g (π) \sum_{F^{'} \in π} P^{'} (F^{'}) log P^{'} (F^{'}) \\ = & \sum_{π \in Π} - g (π) \sum_{F^{'} \in π} (P^{†} (G) P_{1}^{†} (F^{'} | G) + P^{†} (\bar{G}) P^{†} (F^{'} | \bar{G})) log (P^{†} (G) P_{1}^{†} (F^{'} | G) + P^{†} (\bar{G}) P^{†} (F^{'} | \bar{G})) \\ \overset{A}{=} & \sum_{π \in Π} - g (π) (P^{†} (G) log P^{†} (G) + P^{†} (\bar{G}) log P^{†} (\bar{G})) \\ + \sum_{π \in Π} - g (π) (P^{†} (G) \sum_{F^{'} \in π} P_{1}^{†} (F^{'} | G) log P_{1}^{†} (F^{'} | G) + P^{†} (\bar{G}) \sum_{F^{'} \in π} P^{†} (F^{'} | \bar{G}) log P^{†} (F^{'} | \bar{G})) \\ = & \sum_{π \in Π} - g (π) (P^{†} (G) log P^{†} (G) + P^{†} (\bar{G}) log P^{†} (\bar{G})) \\ + \sum_{π \in Π} - g (π) (P^{†} (G) \sum_{F^{'} \in π} P_{1}^{†} (F^{'}) log P_{1}^{†} (F^{'}) + P^{†} (\bar{G}) \sum_{F^{'} \in π} P^{†} (F^{'} | \bar{G}) log P^{†} (F^{'} | \bar{G})) \\ > & \sum_{π \in Π} - g (π) (P^{†} (G) log P^{†} (G) + P^{†} (\bar{G}) log P^{†} (\bar{G})) \\ + \sum_{π \in Π} - g (π) (P^{†} (G) \sum_{F^{'} \in π} P^{†} (F^{'} | G) log P^{†} (F^{'} | G) + P^{†} (\bar{G}) \sum_{F^{'} \in π} P^{†} (F^{'} | \bar{G}) log P^{†} (F^{'} | \bar{G})) \\ = & H_{g} (P^{†}) \end{matrix}

(103)

Our above calculation contradicts that

P^{†}

maximises g-entropy over

[E] .

Thus,

P_{1}^{†} (\cdot) = P^{†} (\cdot | G) .

Conversely, suppose that

P_{E}^{†} (\cdot | G) = P_{E \cup {G}}^{†} (\cdot) .

Now, simply observe

P_{E}^{†} (\cdot | G) \in [E^{'}] \subseteq [E] .

■

Theorem 9. Suppose that convex and closed

E

is the set of probability functions calibrated with evidence E, and

E^{'}

is the set of probability functions calibrated with evidence

E \cup {G}

. Furthermore, suppose that

P_{E}^{†}, P_{E \cup {G}}^{†}

are functions in

E, E^{'}

, respectively, that maximise g-entropy for some fixed

g \in G_{inc} \cup {g_{Ω}} .

If:

(i): $G \subseteq Ω$ ,
(ii): the only constraints imposed by $E \cup {G}$ are the constraints imposed by E together with the constraint $P (G) = 1$ ,
(iii): the constraints in (ii) are consistent, and
(iv): $P_{E}^{†} (\cdot | G) \in E$ ,

then,

P_{E \cup {G}}^{†} (F) = P_{E}^{†} (F | G)

for all

F \subseteq Ω

.

Proof: For

g \in G_{inc}

, this follows directly from Proposition 12. Simply note that

E = [E]

and, thus,

P_{E}^{†} (\cdot | G) \in [E]

.

The proof of Proposition 12 also goes through for

g = g_{Ω} .

This follows from the fact that all the ingredients in the proof—open-mindedness, uniqueness of the g-entropy maximiser on a convex set

E

and the axiomatic characterizations in Appendix A—also hold for standard entropy. ■

This extends Seidenfeld’s result for standard entropy, Theorem 8, to arbitrary convex sets,

E \subseteq P,

and, also, to inclusive weighting functions.

B.3. Paris-Vencovská Properties

The following eight principles have played a central role in axiomatic characterizations of the maximum entropy principle by Paris and Vencovská—c.f., [21,28,29,30]. The first seven principles were first put forward in [29]. [28] views all eight principles as following from the following single common-sense principle: “Essentially similar problems should have essentially similar solutions.”

While Paris and Vencovská mainly considered linear constraints, we shall consider arbitrary convex sets,

E, E_{1}

. Adopting their definitions and using our notation, we investigate the following properties:

Definition 17 (1: Equivalence).

P^{†}

only depends on

E

and not on the constraints that give rise to

E .

This clearly holds for every weighting function

g .

Definition 18 (2: Renaming). Let

p e r

be an element of the permutation group on

{1, \dots, | Ω |} .

For a proposition

F \subseteq Ω

with

F = {ω_{i_{1}}, \dots, ω_{i_{k}}}

, define

p e r (F) = {ω_{p e r (i_{1})}, \dots, ω_{p e r (i_{k})}} .

Next, let

p e r (B (F)) = B (p e r (F))

and

p e r (E) = {p e r (P) : P \in E} .

Then, g satisfies renaming if and only if

P_{E}^{†} (F) = P_{p e r (E)}^{†} (p e r (F)) .

Proposition 13. If g is inclusive and symmetric, then g satisfies renaming.

Proof: For

π \in Π

with

π = {F_{i_{1}}, \dots, F_{i_{f}}}

, define

p e r (π) = {p e r (F_{i_{1}}), \dots, p e r (F_{i_{f}})} .

Using that g is symmetric for the second equality, we find:

\begin{matrix} H_{g} (P) & = - \sum_{π \in Π} g (π) \sum_{F \in π} P (F) log P (F) \\ = - \sum_{π \in Π} g (p e r^{- 1} (π)) \sum_{F \in π} P (F) log P (F) \\ = - \sum_{π \in Π} g (π) \sum_{F \in p e r (π)} P (F) log P (F) \\ = - \sum_{π \in Π} g (π) \sum_{F \in π} P (p e r (F)) log P (p e r (F)) \\ = H_{g} (p e r (P)) \end{matrix}

(104)

Thus,

P_{p e r (E)}^{†} = p e r (P^{†})

, and hence,

P_{p e r (E)}^{†} (p e r (F)) = p e r (P^{†}) (p e r (F)) = P^{†} (F) .

■

Weighting functions g satisfying the renaming property satisfy a further symmetry condition, as we shall see now.

Definition 19 (Symmetric complement). For

P \in P

, define the symmetric complement of P with respect to

A_{i},

denoted by

σ_{i} (P),

as follows:

\begin{matrix} σ_{i} (P) & (_{\pm} A_{1} \land \dots \land_{\pm} A_{i - 1} \land_{\pm} A_{i} \land_{\pm} A_{i + 1} \land \dots \land_{\pm} A_{n}) \\ : = P (_{\pm} A_{1} \land \dots \land_{\pm} A_{i - 1} \land_{\pm} A_{i} \land_{\pm} A_{i + 1} \land \dots \land_{\pm} A_{n}) \end{matrix}

i.e.,

σ_{i} (P) (ω) = P (ω^{'})

, where

ω^{'}

is ω but with

A_{i}

negated. A function

P \in P

is called symmetric with respect to

A_{i}

if and only if

P = σ_{i} (P) .

We call

E \subseteq P

symmetric with respect to

A_{i}

just when the following condition holds:

P \in [E]

, if and only if

σ_{i} (P) \in [E] .

Corollary 7. For all symmetric and inclusive g and all

E

that are symmetric with respect to

A_{i}

, it holds that:

P^{†} = σ_{i} (P^{†})

Thus, if

E

is symmetric with respect to

A_{i}

so is

P^{†} .

Proof: Since g is symmetric and inclusive, there is some function

γ : N \to R_{> 0}

, such that

H_{g} (P) = \sum_{F \subseteq Ω} - γ (| F |) P (F) log P (F)

for all

P \in P .

Hence:

\begin{matrix} H_{g} (P^{†}) & = \sum_{F \subseteq Ω} - γ (| F |) P^{†} (F) log P^{†} (F) \\ = \sum_{F \subseteq Ω} - γ (| F |) \cdot σ_{i} (P^{†}) (F) \cdot log (σ_{i} (P^{†}) (F)) \\ = H_{g} (σ_{i} (P^{†})) \end{matrix}

(105)

Since

E

is symmetric with respect to

A_{i}

, we have that

σ_{i} (P^{†}) \in [E] .

Therefore, if

P^{†} \neq σ_{i} (P^{†})

then there are two different probability functions in

[E]

which both have maximum entropy. This contradicts the uniqueness of the g-entropy maximiser (Corollary 2). ■

This Corollary explains the symmetries exhibited in the tables in the proof of Proposition 8. Since in that proof,

E

is symmetric with respect to

A_{3},

the proposition entropy and the partition entropy maximisers are symmetric with respect to

A_{3} .

Thus,

P_{P Ω, L^{'}}^{†} (ω \land A_{3}) = P_{P Ω, L^{'}}^{†} (ω \land \neg A_{3})

and

P_{Π, L^{'}}^{†} (ω \land A_{3}) = P_{Π, L^{'}}^{†} (ω \land \neg A_{3})

for all

ω \in Ω .

Definition 20 (3: Irrelevance). Let

P_{1}, P_{2}

be the sets of probability functions on disjoint

L_{1}, L_{2}

, respectively. Then irrelevance holds if, for

E_{1} \subseteq P_{1}

and

E_{2} \subseteq P_{2}

, we have that

P_{E_{1}}^{†} (F \times Ω_{2}) = P_{E_{1} \times E_{2}}^{†} (F \times Ω_{2})

for all propositions F of

L_{1},

where

P_{E_{1}}^{†}, P_{E_{1} \times E_{2}}^{†}

are the g-entropy maximisers on

L_{1} \cup L_{2}

with respect to

E_{1} \times P_{2},

respectively,

E_{1} \times E_{2} .

Proposition 14. Neither the partition nor the proposition weighting satisfy irrelevance.

Proof: Let

L_{1} = {A_{1}, A_{2}},

L_{2} = {A_{3}},

E_{1} = {P \in P_{1} : P (A_{1} \land A_{2}) + 2 P (\neg A_{1} \land \neg A_{2}) = 0.2}

and

E_{2} = {P \in P_{2} : P (A_{3}) = 0.1} .

Then, with

ω_{1} = \neg A_{1} \land \neg A_{2} \land \neg A_{3},

ω_{2} = \neg A_{1} \land \neg A_{2} \land A_{3}

and so on, we find:

Table A1. Partition entropy and proposition entropy maximisers and irrelevance.

**Table A1.** Partition entropy and proposition entropy maximisers and irrelevance.
	$ω_{1}$	$ω_{2}$	$ω_{3}$	$ω_{4}$	$ω_{5}$	$ω_{6}$	$ω_{7}$	$ω_{8}$
$P_{Π, E_{1}}^{†}$	$0.0142$	$0.0142$	$0.2071$	$0.2071$	$0.2071$	$0.2071$	$0.0715$	$0.0715$
$P_{Π, E_{1} \times E_{2}}^{†}$	$0.0312$	$0.0004$	$0.3692$	$0.0466$	$0.3692$	$0.0466$	$0.1304$	$0.0064$
$P_{P Ω, E_{1}}^{†}$	$0.0050$	$0.0050$	$0.2025$	$0.2025$	$0.2025$	$0.2025$	$0.0901$	$0.0901$
$P_{P Ω, E_{1} \times E_{2}}^{†}$	$0.0211$	$6.2 \times 10^{- 9}$	$0.3606$	$0.0500$	$0.3606$	$0.0500$	$0.1577$	$2.3 \times 10^{- 6}$

Now, simply note that, for instance:

\begin{matrix} P_{Π, E_{1}}^{†} (\neg A_{1} \land \neg A_{2}) & = P_{Π, E_{1}}^{†} (ω_{1}) + P_{Π, E_{1}}^{†} (ω_{2}) \\ \neq P_{Π, E_{1} \times E_{2}}^{†} (ω_{1}) + P_{Π, E_{1} \times E_{2}}^{†} (ω_{2}) = P_{Π, E_{1} \times E_{2}}^{†} (\neg A_{1} \land \neg A_{2}) \end{matrix}

(106)

(As we are going to see in Proposition 18, none of the values in the table can be zero. Therefore, the small numerical values found by computer approximation are not artifacts of the approximations involved.) ■

Definition 21 (4: Relativisation). Let

\emptyset \subset F \subset Ω

,

E = {P \in P : P (F) = z} \cap E_{1} \cap E_{2}

and

E^{'} = {P \in P : P (F) = z} \cap E_{1} \cap E_{2}^{'}

, where

E_{1}

is determined by a set of constraints on the

P (G)

with

G \subseteq F

, and the

E_{2}, E_{2}^{'}

are determined by a set of constraints on the

P (G)

with

G \subseteq \bar{F} .

Then,

P_{E}^{†} (G) = P_{E^{'}}^{†} (G)

for all

G \subseteq F .

Proposition 15. Neither the partition not the proposition weighting satisfy relativisation.

Proof: Let

| Ω | = 8,

F = {ω_{1}, ω_{2}, ω_{3}, ω_{4}, ω_{5}},

P (F) = 0.5

, and put

E_{1} = {P \in P : P (ω_{1}) + 2 P (ω_{2}) + 3 P (ω_{3}) + 4 P (ω_{4}) = 0.2}, E_{2} = P, E_{2}^{'} = {P \in P : P (ω_{6}) + 2 P (ω_{7}) + 3 P (ω_{8}) = 0.7} .

Then,

P_{Π, E}^{†}

and

P_{Π, E^{'}}^{†}

differ substantially on three out of five

ω \in F

, as do

P_{P Ω, E}^{†}

and

P_{P Ω, E^{'}}^{†}

, as can be seen from the following table:

Table A2. Partition entropy and proposition entropy maximisers and relativisation.

**Table A2.** Partition entropy and proposition entropy maximisers and relativisation.
	$ω_{1}$	$ω_{2}$	$ω_{3}$	$ω_{4}$	$ω_{5}$	$ω_{6}$	$ω_{7}$	$ω_{8}$
$P_{Π, E}^{†}$	$0.1251$	$0.0308$	$0.0041$	$0.0003$	$0.3398$	$0.1667$	$0.1667$	$0.1667$
$P_{Π, E^{'}}^{†}$	$0.1242$	$0.0312$	$0.0041$	$0.0003$	$0.3402$	$0.3356$	$0.1288$	$0.0356$
$P_{P Ω, E}^{†}$	$0.1523$	$0.0239$	$5.5 \times 10^{- 7}$	$6.8 \times 10^{- 9}$	$0.3239$	$0.1667$	$0.1667$	$0.1667$
$P_{P Ω, E^{'}}^{†}$	$0.1495$	$0.0252$	$7.0 \times 10^{- 7}$	$7.6 \times 10^{- 9}$	$0.3252$	$0.3252$	$0.1495$	$0.0252$

■

Definition 22 (5: Obstinacy). If

E_{1}

is a subset of

E

such that

P_{E}^{†} \in [E_{1}],

then

P_{E}^{†} = P_{E_{1}}^{†} .

Proposition 16. If g is inclusive, then it satisfies the obstinacy principle.

Proof: This follows directly from the definition of

P_{E}^{†}

. ■

Definition 23 (6: Independence). If

E = {P \in P | P (A_{1} \land A_{3}) = α, P (A_{2} \land A_{3}) = β, P (A_{3}) = γ},

then for

γ > 0

, it holds that

P^{†} (A_{1} \land A_{2} \land A_{3}) = \frac{α β}{γ} .

Proposition 17. Neither the partition entropy nor the proposition weighting satisfy independence.

Proof: Let

L = {A_{1}, A_{2}, A_{3}},

α = 0.2, β = 0.35, γ = 0.6,

then:

\begin{matrix} P_{Π}^{†} (A_{1} \land A_{2} \land A_{3}) = 0.1197 \neq 0.1167 = \frac{0.2 \cdot 0.35}{0.6} \end{matrix}

and

\begin{matrix} P_{P Ω}^{†} (A_{1} \land A_{2} \land A_{3}) = 0.1237 \neq 0.1167 = \frac{0.2 \cdot 0.35}{0.6} \end{matrix}

■

Definition 24 (7: Open-mindedness). A weighting function g is open-minded, if and only if for all

E

and all

\emptyset \subseteq F \subseteq Ω

, it holds that

P^{†} (F) = 0

if and only if

P (F) = 0

for all

P \in E .

Proposition 18. Any inclusive g is open-minded.

Proof: First, observe that

P (F) = 0

for all

P \in E,

if and only if

P (F) = 0

for all

P \in [E] .

Now, note that if

P (F) = 0

for all

P \in [E],

then

P_{g}^{†} (F) = 0,

since

P_{g}^{†} \in [E] .

On the other hand, if there exists an

F \subseteq Ω

such that

P_{g}^{†} (F) = 0 < P (F)

for some

P \in [E],

then

S_{g}^{log} (P, P_{g}^{†}) = \infty > H_{g} (P_{g}^{†}) .

Thus, adopting

P_{g}^{†}

exposes one to an infinite loss, and by Theorem 2 adopting the g-entropy maximiser exposes one to the finite loss,

H_{g} (P_{g}^{†}) .

This is a contradiction. Thus,

P_{g}^{†} (F) > 0 .

Overall,

P_{g}^{†} (F) = 0

if and only if

P (F) = 0

for all

P \in [E] .

■

Definition 25 (8: Continuity). Let us recall the definition of the Blaschke metric, Δ, between two convex sets,

E, E_{1} \subseteq P

:

\begin{matrix} Δ (E, E_{1}) = inf {δ | & \forall P \in E \exists P_{1} \in E_{1} : | P, P_{1} | \leq δ \\ & & \forall P_{1} \in E_{1} \exists P \in E : | P, P_{1} | \leq δ} \end{matrix}

where

| \cdot, \cdot |

is the usual Euclidean metric between elements of

R^{| Ω |} .

g satisfies continuity, if and only if the function

arg {sup}_{P \in E} H_{g} (P)

is continuous in the Blaschke metric.

Proposition 19. Any inclusive g satisfies the continuity property.

Proof: Since the g-entropy is strictly concave (see Proposition 2), we may apply Theorem 7.5 on p. 91 in [21]. Thus, if

E

is determined by finitely many linear constraints, then g satisfies continuity. Paris [21] credits I. Maung for the proof of the theorem.

Now, let

E \subseteq P

be an arbitrary convex set. Note that we can approximate

E

arbitrarily closely by two sequences

E_{t}, E^{t}

, where each member of the sequences is determined by finitely many linear constraints, such that

E_{t} \subseteq E_{t + 1} \subseteq E \subseteq E^{t + 1} \subseteq E^{t} .

By this subset relation, we have

{sup}_{P \in E_{t}} H_{g} (P) \leq {sup}_{P \in E} H_{g} (P) \leq {sup}_{P \in E^{t}} H_{g} (P) .

With

P^{†_{t}} : = arg {sup}_{P \in E_{t}} H_{g} (P)

and

P^{†^{t}} : = arg {sup}_{P \in E^{t}} H_{g} (P)

, we have

{lim}_{t \to \infty} P^{†_{t}} = {lim}_{t \to \infty} P^{†^{t}}

by Maung’s theorem.

Since

E^{t}

converges to

E_{t}

in the Blaschke metric, we have by Maung’s theorem that

{lim}_{t \to \infty} {sup}_{P \in E_{t}} H_{g} (P) = {lim}_{t \to \infty} {sup}_{P \in E^{t}} H_{g} (P) = {sup}_{P \in E} H_{g} (P) .

Note that

{lim}_{t \to \infty} P^{†_{t}} \in [E] .

Moreover, since

E

is convex,

H_{g}

is strictly concave, and since

E_{t}

converges to

E

, we have

{lim}_{t \to \infty} H_{g} (P^{†_{t}}) = {sup}_{P \in E} H_{g} (P) .

By the uniqueness of the g-entropy maximiser on

E

, we thus find

{lim}_{t \to \infty} P^{†_{t}} = P^{†},

{lim}_{t \to \infty} P^{†^{t}} = P^{†}

and

{lim}_{t \to \infty} P^{†_{t}} = {lim}_{t \to \infty} P^{†^{t}} .

Since the sets determined by finitely many linear constraints are dense in the set of convex

E \subseteq P

, we can use a standard approximation argument yielding that

arg {sup}_{P \in E} H_{g} (P)

is continuous in the Blaschke metric on the set of convex

E \subseteq P .

■

B.4. The Topology of g-Entropy

We have so far investigated g-entropy for fixed

g \in G .

We now briefly consider the location and shape of the set of g-entropy maximisers.

For standard entropy maximisation and g-entropy maximisation with inclusive and symmetric g, the respective maximisers all obtain at

P_{=},

if

P_{=} \in [E];

cf. Corollary 6.

If

P_{=} \notin [E],

then the maxima all obtain at the boundary of

E

“facing”

P_{=} .

To make this latter observation precise, we denote for

P, P^{'} \in P

the line segment in

P

that connects P with

P^{'}

, end points included, by

\bar{P P^{'}}

.

Proposition 20 (g-entropy is maximised at the boundary). For inclusive and symmetric g,

\bar{P_{=} P^{†}} \cap [E] = {P^{†}} .

Proof: If

P_{=} \in [E],

then

P^{†} = P_{=},

by Corollary 6.

If

P_{=} \notin [E],

suppose that there exists a

P^{'} \in \bar{P_{=} P^{†}} \cap [E]

different from

P^{†} .

Then, by the concavity of g-entropy on

P

(Proposition 2) and the equivocator-preserving property (Corollary 6), we have

H_{g} (P_{=}) > H_{g} (P^{'}) > H_{g} (P^{†}) .

By the convexity of

[E]

and Proposition 2, we have

H_{g} (P^{†}) > H_{g} (P)

for all

P \in [E] ∖ {P^{†}} .

Contradiction. ■

We saw in Theorem 7 that for a particular sequence

g_{t}

converging to

g_{Ω}

,

P_{g_{t}}^{†}

converges to

P_{Ω}^{†} .

We shall now show that this is an instance of a more general phenomenon. We will demonstrate that

P_{g}^{†}

varies continuously for continuous changes in g for

g \in G .

Proposition 21 (Continuity of g-entropy maximisation). For all

E,

the function:

\begin{matrix} arg sup_{P \in E} H_{(\cdot)} (P) : G ⟶ [E], g \mapsto P_{g}^{†} \end{matrix}

is continuous on

G .

Proof: Consider a sequence,

{(g_{t})}_{t \in N} \subseteq G

, converging to some

g \in G .

We need to show that

P_{g_{t}}^{†}

converges to

P_{g}^{†} .

From

g_{t}

converging to g, it easily follows that

H_{g_{t}} (P)

converges to

H_{g} (P)

for all

P \in P .

Since g-entropy is strictly concave, we have that for every

P^{'} \in [E] ∖ {P_{g}^{†}}

, there exists some

ϵ > 0

such that

H_{g} (P^{'}) + ϵ = H_{g} (P_{g}^{†}) .

By the fact that

H_{g_{t}} (P)

converges to

H_{g} (P)

for all P, we find that

H_{g_{t}} (P^{'}) + \frac{ϵ}{2} < H_{g_{t}} (P_{g}^{†})

for all t which are greater than some

T \in N .

Since

H_{g_{t}} (P_{g}^{†}) \leq H_{g_{t}} (P_{g_{t}}^{†})

, it follows that

P^{'}

cannot be a point of accumulation of the sequence,

{(P_{g_{t}}^{†})}_{t \in N} .

The sequence

P_{g_{t}}^{†}

takes values in the compact set

[E],

so it has at least one point of accumulation. We have demonstrated above that

P_{g}^{†}

is the only possible point of accumulation. Hence,

P_{g}^{†}

is the only point of accumulation and, therefore, the limit of this sequence. ■

The continuity of g-entropy maximisation will be instrumental in proving the next proposition, which asserts that the g-entropy maximisers are clustered together.

Proposition 22. For any

E

, if

G \subseteq G_{inc}

is path-connected, then the set

{P_{g}^{†} : g \in G}

is path-connected.

Proof: By Proposition 21, the map

arg {sup}_{P \in E} H_{(\cdot)} (P)

is continuous. The image of a path-connected set under a continuous map is path-connected. ■

Corollary 8. For all

E

, the sets

{P_{g}^{†} : g \in G_{inc}}

and

{P_{g}^{†} : g \in G_{0}}

are path-connected.

Proof:

G_{inc}

and

G_{0}

are convex; thus, they are path-connected. Now, apply Proposition 22. ■

It is, in general, not the case that a convex combination of weighting functions generates a convex combination of the corresponding g-entropy maximisers:

Proposition 23. For a convex combination of weighting functions,

g = λ g_{1} + (1 - λ) g_{2}

, in general, it fails to hold that

P_{g}^{†} = λ P_{g_{1}}^{†} + (1 - λ) P_{g_{2}}^{†} .

Moreover, in general,

P_{g}^{†} \notin \bar{P_{g_{1}}^{†} P_{g_{2}}^{†}} .

Proof: Let

g_{1} = g_{Π}, g_{2} = g_{P Ω}

and

λ = 0.3 .

Then, for a language

L

with two propositional variables and

E = {P \in P : P (ω_{1}) + 2 P (ω_{2}) + 3 P (ω_{3}) + 4 P (ω_{4}) = 1.7}

, we can see from the following table that

P_{0.3 g_{Π} + 0.7 g_{P Ω}}^{†} \neq 0.3 P_{Π}^{†} + 0.7 P_{P Ω}^{†} .

Table A3. Partition entropy and proposition entropy maximisers and their convex combinations.

**Table A3.** Partition entropy and proposition entropy maximisers and their convex combinations.
	$ω_{1}$	$ω_{2}$	$ω_{3}$	$ω_{4}$
$P_{Π}^{†}$	$0.5331$	$0.2841$	$0.1324$	$0.0504$
$P_{P Ω}^{†}$	$0.5192$	$0.3008$	$0.1408$	$0.0392$
$0.3 P_{Π}^{†} + 0.7 P_{P Ω}^{†}$	$0.5234$	$0.2958$	$0.1383$	$0.0426$
$P_{0.3 g_{Π} + 0.7 g_{P Ω}}^{†}$	$0.5272$	$0.2915$	$0.1353$	$0.0459$
$\frac{P_{P Ω}^{†} - P_{0.3 g_{Π} + 0.7 g_{P Ω}}^{†}}{P_{P Ω}^{†} - P_{Π}^{†}}$	$0.5755$	$0.5569$	$0.6429$	$0.6036$

If

P_{0.3 g_{Π} + 0.7 g_{P Ω}}^{†}

were in

\bar{P_{Π}^{†} P_{P Ω}^{†}},

then the last line of the above table would be constant for all

ω \in Ω .

As we can see, the values in the last line do vary. ■

C. Level of Generalisation

In this section we shall show that the generalisation of entropy and score used in the text above is essentially the right one. We shall do this by defining broader notions of entropy and score of which the g-entropy and g-score are special cases, and showing that entropy maximisation only coincides with minimisation of worst-case score in the special case of g-entropy and g-score as they are defined above.

We will focus on the case of belief over propositions; belief over sentences behaves similarly. Our broader notions will be defined relative to a weighting

γ : P Ω ⟶ R_{\geq 0}

of propositions, rather than a weighting

g : Π ⟶ R_{\geq 0}

of partitions.

Definition 26 (γ-entropy). Given a function

γ : P Ω ⟶ R_{\geq 0}

, the γ-entropy of a normalised belief function is defined as:

H_{γ} (B) : = - \sum_{F \subseteq Ω} γ (F) B (F) log B (F)

(107)

Definition 27 (γ-score). Given a loss function, L, and a function

γ : P Ω ⟶ R_{\geq 0}

, the γ-expected loss function or γ-scoring rule, or simply γ-score, is

S_{γ}^{L} : P \times 〈 B 〉 ⟶ [- \infty, \infty]

such that

S_{γ}^{L} (P, B) = \sum_{F \subseteq Ω} γ (F) P (F) L (F, B)

.

Definition 28 (Equivalent to a weighting of partitions). A weighting of propositions

γ : P Ω ⟶ R_{\geq 0}

is equivalent to a weighting of partitions, if there exists a function

g : Π ⟶ R_{\geq 0}

, such that for all

F \subseteq Ω

:

γ (F) = \sum_{\begin{matrix} π \in Π \\ F \in π \end{matrix}} g (π)

(108)

We see then that the notions of g-entropy and g-score coincide with those of γ-entropy and γ-score, just when the weightings of propositions γ are equivalent to weightings of partitions. Next, we extend the notion of inclusivity to our more general weighting functions:

Definition 29 (Inclusive weighting of propositions). A weighting of propositions

γ : P Ω ⟶ R_{\geq 0}

is inclusive if

γ (F) > 0

for all

F \subseteq Ω

.

We shall also consider a slight generalisation of strict propriety (cf., discussion following Definition 6):

Definition 30 (Strictly $X$ -proper γ-score). For

P \subseteq X \subseteq 〈 B 〉

, a γ-score

S_{γ}^{L} : P \times 〈 B 〉 ⟶ [- \infty, \infty]

is strictly

X

-proper, if for all

P \in P

, the restricted function

S_{γ}^{L} (P, \cdot) : X ⟶ [- \infty, \infty]

has a unique global minimum at

B = P

. A γ-score is strictly proper if it is strictly

〈 B 〉

-proper. A γ-score is merely

X

-proper if for some P, this minimum at

B = P

is not the only minimum.

Note that if a γ-score is strictly

X

-proper, then it is strictly

Y

-proper for

P \subseteq Y \subseteq X

. Thus, if it is strictly proper, it is also strictly

B

-proper and strictly

P

-proper.

Proposition 24. The logarithmic γ-score

S_{γ}^{log} (P, B)

is non-negative and convex as a function of

B \in 〈 B 〉

. For inclusive γ, convexity is strict, i.e.,

S_{γ}^{log} (P, λ B_{1} + (1 - λ) B_{2}) < λ S_{γ}^{log} (P, B_{1}) + (1 - λ) S_{γ}^{log} (P, B_{2})

for

λ \in (0, 1)

, unless

B_{1}

and

B_{2}

agree everywhere except where

P (F) = 0

.

Proof: The logarithmic γ-score is non-negative because

B (F), P (F) \in [0, 1]

for all F, so

log B (F) \leq 0, γ (F) P (F) \geq 0

and

γ (F) P (F) log B (F) \leq 0

.

That

S_{γ}^{log} (P, B)

is strictly convex as a function of

〈 B 〉

follows from the strict concavity of

log x

. Take as distinct

B_{1}, B_{2} \in 〈 B 〉

and

λ \in (0, 1)

, and let

B = λ B_{1} + (1 - λ) B_{2}

. Now:

\begin{matrix} γ (F) P (F) log (B (F)) & = γ (F) P (F) log (λ \cdot B_{1} (F) + (1 - λ) B_{2} (F)) \\ \geq γ (F) P (F) (λ log B_{1} (F) + (1 - λ) log B_{2} (F)) \\ = λ γ (F) P (F) log B_{1} (F) + (1 - λ) γ (F) P (F) log B_{2} (F) \end{matrix}

(109)

with equality iff either

P (F) = 0

or

B_{1} (F) = B_{2} (F)

(since in the latter case,

γ (F) P (F) > 0

).

Hence:

\begin{matrix} S_{γ}^{log} (P, B) & = & - \sum_{F \subseteq Ω} γ (F) P (F) log B (F) \\ \leq & λ S_{γ}^{log} (P, B_{1}) + (1 - λ) S_{γ}^{log} (P, B_{2}) \end{matrix}

(110)

with equality if and only if

B_{1}

and

B_{2}

agree everywhere, except, possibly, where

P (F) = 0

. ■

Corollary 9. For inclusive γ and fixed

P \in P

,

arg {inf}_{B \in 〈 B 〉} S_{γ}^{log} (P, B)

is unique. For

B^{'} : = arg {inf}_{B \in 〈 B 〉} S_{γ}^{log} (P, B)

and for all

F \subseteq Ω

, we have

B^{'} (F) > 0

if and only if

P (F) > 0 .

Moreover,

B^{'} (Ω) = 1

and

B^{'} \in B .

Proof: First of all, suppose that there is an

F \subseteq Ω

such that

P (F) > 0

and

B (F) = 0 .

Then,

S_{γ}^{log} (P, B) = \infty .

Furthermore,

S_{γ}^{log} (P, P) < \infty

for all

P \in P .

Hence, for

B^{'} \in arg {inf}_{B \in 〈 B 〉} S_{γ}^{log} (P, B)

, it holds that

P (F) > 0

implies

B^{'} (F) > 0 .

Now, note that for

P \in P

, we have

P (Ω) = 1 - P (\emptyset) = 1

. Furthermore, there are only two partitions,

{Ω}

and

{Ω, \emptyset}

which contain Ω or

\emptyset .

Minimising

- γ (\emptyset) P (\emptyset) log B^{'} (\emptyset) - γ (Ω) P (Ω) log B^{'} (Ω)

, i.e.,

- γ (Ω) log B^{'} (Ω)

, subject to the constraint

B^{'} (\emptyset) + B^{'} (Ω) \leq 1

, is uniquely solved by taking

B^{'} (Ω) = 1

, and hence,

B^{'} (\emptyset) = 0

. Thus, for any

B^{'}

minimising

S_{γ}^{log} (P, \cdot)

, it holds that

B^{'} (\emptyset) = 0

and

B^{'} (Ω) = 1 .

Hence,

B^{'} \in 〈 B 〉

is in

B .

Now, consider a

P \in P

such that there is at least one

\emptyset \subset F \subset Ω

with

P (F) = 0 .

We will show that

B^{'} (F) = 0

for all

B^{'} \in arg {inf}_{B \in 〈 B 〉} S_{γ}^{log} (P, B) .

In the second step, we will show that there is a unique infimum,

B^{'} .

Therefore, suppose that the there is a

B^{'} \in arg {inf}_{B \in 〈 B 〉} S_{γ}^{log} (P, B)

such that

B^{'} (F) > 0 = P (F) .

Assume that

\emptyset \subset H \subset Ω

is for this

B^{'},

with respect to subset inclusion, one such largest subset of

Ω .

Now define

B^{''} : P Ω \to [0, 1]

by

B^{''} (G) : = 0

for all

G \subseteq H

and

B^{''} (F) : = B^{'} (F)

otherwise. From

B^{''} (Ω) = 1, B^{''} (\emptyset) = 0

, we see that

B^{''} \in B

; thus,

S_{γ}^{log} (P, B^{''})

is well defined. Since

P \in P

, we have for all

G \subseteq H

that

P (H) = P (G) = 0 .

Thus,

S_{γ}^{log} (P, B^{'}) = S_{γ}^{log} (P, B^{''}) .

Note that since

B^{'} \in 〈 B 〉

, we have

1 \geq B^{'} (\bar{H}) + B^{'} (H) > B^{'} (\bar{H}) = B^{''} (\bar{H})

. Now, define a function

B^{'''} \in 〈 B 〉

by:

\begin{matrix} B^{'''} (\bar{H}) & : = 1 \\ B^{'''} (F) & : = B^{''} (F) for all F \neq \bar{H} \end{matrix}

Since for all

F \subseteq Ω

,

B^{''} (F) \leq B^{'''} (F)

,

B^{''} (\bar{H}) < B^{'''} (\bar{H}) = 1

and

P (\bar{H}) \cdot γ (\bar{H}) = 1 \cdot γ (\bar{H}) > 0

, we have:

\begin{matrix} S_{γ}^{log} (P, B^{'}) & = S_{γ}^{log} (P, B^{''}) \\ > S_{γ}^{log} (P, B^{'''}) \end{matrix}

We assumed that

B^{'}

minimises

S_{γ}^{log} (P, \cdot)

over

〈 B 〉 .

Hence, we have a contradiction. We have thus proven that for every

B \in arg {inf}_{B \in 〈 B 〉} S_{γ}^{log} (P, B)

,

B (F) = 0

if and only if

P (F) = 0 .

Hence, for all

P \in P

:

\begin{matrix} arg inf_{B \in 〈 B 〉} S_{γ}^{log} (P, B) = arg inf_{{B \in 〈 B 〉 : P (F) = 0 \leftrightarrow B (F) = 0}} S_{γ}^{log} (P, B) \end{matrix}

(111)

By Proposition 24, we can assume that the right-hand side of Equation (111) is a strictly convex optimisation problem on a convex set, which has, hence, a unique infimum. ■

Corollary 10.

S_{γ}^{log}

is strictly

〈 B 〉

-proper if and only if

S_{γ}^{log}

is strictly

B

-proper.

Proof: Assume that

S_{γ}^{log}

is strictly

〈 B 〉

-proper. Then for all

P \in P

, we have

P = arg {inf}_{B \in 〈 B 〉} S_{γ}^{log} (P, B) .

Since

P \subset B \subset 〈 B 〉

, we hence have

P = arg {inf}_{B \in B} S_{γ}^{log} (P, B) .

For the converse, suppose that

S_{γ}^{log}

is strictly

B

-proper, i.e., for all

P \in P

we have

P = arg {inf}_{B \in B} S_{γ}^{log} (P, B) .

Note that strict propriety implies that γ is inclusive. Corollary 9 implies then that no

B \in 〈 B 〉 ∖ B

can minimise

S_{γ}^{log} (P, B) .

■

Definition 31 (Symmetric weighting of propositions). A weighting of propositions, γ, is symmetric, if and only if whenever

F^{'}

can be obtained from F by permuting the

ω_{i}

in

F,

then

γ (F^{'}) = γ (F) .

Note that γ is symmetric, if and only if

| F | = | F^{'} |

entails

γ (F) = γ (F^{'})

. For symmetric γ, we will sometimes write

γ (n)

for

γ (F),

if

| F | = n .

Proposition 25. For inclusive and symmetric γ,

S_{γ}^{log}

is strictly

P

-proper.

Proof: We have that for all

ω \in Ω

,

| {F \subseteq Ω : | F | = n, ω \in F} | = | {G \subseteq \bar{{ω}} : | G | = n - 1} | = (\binom{| Ω | - 1}{n - 1}) .

We recall from Example 1 that with

ν_{n} : = (\binom{| Ω | - 1}{n - 1})

, we have:

\begin{matrix} \sum_{\begin{matrix} F \subseteq Ω \\ | F | = n \end{matrix}} P (F) = ν_{n} \cdot \sum_{ω \in Ω} P (ω) = ν_{n} \end{matrix}

(112)

Multiplying the objective function in an optimisation problem by some positive constant does not change where optima obtain. Thus:

\begin{matrix} arg inf_{Q \in P} - \sum_{\begin{matrix} F \subseteq Ω \\ | F | = n \end{matrix}} γ (n) P (F) log Q (F) & = arg inf_{Q \in P} - \sum_{\begin{matrix} F \subseteq Ω \\ | F | = n \end{matrix}} \frac{P (F)}{ν_{n}} log Q (F) \\ = arg inf_{Q \in P} - \sum_{\begin{matrix} F \subseteq Ω \\ | F | = n \end{matrix}} \frac{P (F)}{ν_{n}} log (\frac{Q (F)}{ν_{n}} \cdot ν_{n}) \\ = arg inf_{Q \in P} - \sum_{\begin{matrix} F \subseteq Ω \\ | F | = n \end{matrix}} \frac{P (F)}{ν_{n}} (log \frac{Q (F)}{ν_{n}} + log ν_{n}) \\ = arg inf_{Q \in P} - \sum_{\begin{matrix} F \subseteq Ω \\ | F | = n \end{matrix}} \frac{P (F)}{ν_{n}} log \frac{Q (F)}{ν_{n}} \end{matrix}

(113)

Now, note that since

Q, P \in P

, we have that

\sum_{\begin{matrix} F \subseteq Ω \\ | F | = n \end{matrix}} P (F) = ν_{n} = \sum_{\begin{matrix} F \subseteq Ω \\ | F | = n \end{matrix}} Q (F)

, and hence,

\sum_{\begin{matrix} F \subseteq Ω \\ | F | = n \end{matrix}} \frac{P (F)}{ν_{n}} = 1 = \sum_{\begin{matrix} F \subseteq Ω \\ | F | = n \end{matrix}} \frac{Q (F)}{ν_{n}} .

Put

Ψ : = {F \subseteq Ω : | F | = n}

, and let us understand

\frac{P (\cdot)}{ν_{n}}, \frac{Q (\cdot)}{ν_{n}}

as functions

\frac{P (\cdot)}{ν_{n}}, \frac{Q (\cdot)}{ν_{n}} : Ψ ⟶ [0, 1]

with

\sum_{G \in Ψ} \frac{P (G)}{ν_{n}} = 1 = \sum_{G \in Ψ} \frac{Q (G)}{ν_{n}} .

It follows that

\frac{P (\cdot)}{ν_{n}}, \frac{Q (\cdot)}{ν_{n}}

are formally probability functions on Ψ, satisfying certain further conditions which are not relevant in the following. Let

P_{Ψ}

denote the set of probability functions on Ψ, and let

P_{Ω} \subseteq P_{Ψ}

be the set of probability functions of the above form,

\frac{P (\cdot)}{ν_{n}}, \frac{Q (\cdot)}{ν_{n}}

, where

P, Q \in P

.

Consider a scoring rule,

S (P, B)

, in the standard sense, i.e., expectations over losses are taken with respect to members x of some set X. (At the beginning of Section 2.4, we considered states

ω \in Ω

.) Let

X

denote the set of probability functions on the set X. Suppose that S is strictly

X

-proper. Then, for any fixed set

Y \subseteq X

, it holds that

arg {inf}_{B \in Y} S (P, B) = P

for all

P \in Y

. It is well known that the standard logarithmic scoring rule on a given universal set is strictly

X

-proper. Taking

X = Ψ

,

X = P_{Ψ}

and

Y = P_{Ω}

, we obtain for all

\frac{P (\cdot)}{ν_{n}} \in P_{Ω}

that:

\begin{matrix} \frac{P (\cdot)}{ν_{n}} & = arg inf_{\frac{Q (\cdot)}{ν_{n}} \in P_{Ω}} - \sum_{G \in Ψ} \frac{P (G)}{ν_{n}} log \frac{Q (G)}{ν_{n}} \\ = arg inf_{Q \in P} - \sum_{G \in Ψ} \frac{P (G)}{ν_{n}} log \frac{Q (G)}{ν_{n}} \end{matrix}

(114)

We thus find:

\begin{matrix} P = arg inf_{Q \in P} - \sum_{\begin{matrix} F \subseteq Ω \\ | F | = n \end{matrix}} γ (n) P (F) log Q (F) \end{matrix}

(115)

Since P minimises Equation (115) for every n, it also minimises the sum over all

n,

and hence:

\begin{matrix} P = arg inf_{Q \in P} - \sum_{1 \leq n \leq | Ω |} \sum_{\begin{matrix} F \subseteq Ω \\ | F | = n \end{matrix}} γ (F) P (F) log Q (F) = arg inf_{Q \in P} S_{g} (P, Q) \end{matrix}

(116)

■

Lemma 10. If γ is an inclusive weighting of propositions that is equivalent to a weighting of partitions, then

S_{γ}^{log}

is strictly

B

-proper.

Proof: While this result follows directly from Corollary 3, we shall give another proof that will provide the groundwork for the proof of the next result, Theorem 10.

First, we shall fix a

P \in P

and observe that the first part of Corollary 9 up to and including Equation (111) still holds with

B

substituted for

〈 B 〉 .

We shall thus concentrate on propositions

F \subset Ω

with

P (F) > 0,

since it follows from Corollary 9 that whenever

P (F) = 0,

we must have

B (F) = 0

and

B (Ω) = 1,

if

S_{γ}^{log} (P, B)

is to be minimised. We thus let

P^{+} Ω : = {\emptyset \subset F \subset Ω : P (F) > 0}

and:

\begin{matrix} B^{+} : = {B \in B : & 0 < B (F) \leq 1 for all F \in P^{+} Ω, \\ B (Ω) = 1 and B (F) = 0 for all other F \in P Ω ∖ P^{+} Ω} \end{matrix}

In the following optimisation problem, we will thus only consider

B (F)

to be a variable if

F \in P^{+} Ω .

We now investigate:

\begin{matrix} arg inf_{B \in B^{+}} S_{γ}^{log} (P, B) \end{matrix}

(117)

To this end, we shall first find for all fixed

t \geq 2

:

\begin{matrix} arg inf_{\begin{matrix} B \in B^{+} \\ B (F) \geq \frac{P (F)}{t} for all F \in P^{+} Ω \end{matrix}} S_{γ}^{log} (P, B) \end{matrix}

(118)

Making this restriction on

B (F)

allows us to evade any problems that arise from taking the derivative of

log B (F)

at

B (F) = 0

, which inevitably arise when we directly apply Karush-Kuhn-Tucker techniques to Equation (117).

With

Π^{'} : = {π \in Π : π \neq {Ω}, π \neq {Ω, \emptyset}}

, we thus need to solve the following optimisation problem:

\begin{matrix} minimize & S_{γ}^{log} (P, B) \\ subject to & B (F) \geq \frac{P (F)}{t} > 0 for t \geq 2 and all F \in P^{+} Ω \\ \sum_{\begin{matrix} G \in π \\ G \in P^{+} Ω \end{matrix}} B (G) \leq 1 for all π \in Π^{'} \\ B (Ω) = 1 and B (F) = 0 for all other F \in P Ω ∖ P^{+} Ω \end{matrix}

Note that the first and second constraints imply that

0 < B (F) \leq 1

for all

F \in P^{+} Ω .

Observe that for

π \in Π^{'}

with

G \in π,

| G | \geq 2

and

P (G) = 0,

there is another partition in

Π^{'}

that subdivides G and agrees with π everywhere else. These two partitions,

π, π^{'}

, will give rise to the exact same constraint on the

F \in P^{+} Ω .

Including the same constraint multiple times does not affect the applicability of the Karush-Kuhn-Tucker techniques. Thus, the solutions of this optimisation problem are the solutions of Equation (118).

With Karush-Kuhn-Tucker techniques in mind, we shall define the following function for

B \in B^{+}

:

\begin{matrix} L a g (B) = & \overset{S_{γ}^{log} (P, B)}{\overset{︷}{- \sum_{\begin{matrix} F \subseteq Ω \end{matrix}} γ (F) P (F) log B (F)}} + \overset{constraints}{\overset{︷}{\sum_{π \in Π^{'}} λ_{π} \cdot (- 1 + \sum_{\begin{matrix} G \in π \\ G \in P^{+} Ω \end{matrix}} B (G)) + \sum_{F \in P^{+} Ω} μ_{F} (\frac{P (F)}{t} - B_{F})}} \\ = & - \sum_{\begin{matrix} F \in P^{+} Ω \end{matrix}} γ (F) P (F) log B (F) + \sum_{π \in Π^{'}} λ_{π} \cdot (- 1 + \sum_{\begin{matrix} G \in π \\ G \in P^{+} Ω \end{matrix}} B (G)) + \sum_{F \in P^{+} Ω} μ_{F} (\frac{P (F)}{t} - B_{F}) \end{matrix}

(119)

First, recall that

B (F) = 0

iff

P (F) = 0

; thus, the first sum is always finite here. Since

B (F) > 0

for all

F \in P^{+} Ω

, we can take derivatives with respect to the variables

B (F) .

Recalling that

γ (F) > 0

for all

F \subseteq Ω

, we now find:

\begin{matrix} \frac{\partial}{\partial B (F)} L a g (B) & = - γ (F) \frac{P (F)}{B (F)} + \sum_{\begin{matrix} π \in Π^{'} \\ F \in π \end{matrix}} λ_{π} - μ_{F} for all F \in P^{+} Ω \end{matrix}

(120)

Equating these derivatives with zero, we obtain:

\begin{matrix} γ (F) \frac{P (F)}{B (F)} = \sum_{\begin{matrix} π \in Π^{'} \\ F \in π \end{matrix}} λ_{π} - μ_{F} for all F \in P^{+} Ω \end{matrix}

(121)

γ is, by our assumption, equivalent to a weighting of partitions,

γ (F) = \sum_{\begin{matrix} π \in Π^{'} \\ F \in π \end{matrix}} g (π) .

Letting

λ_{π} : = g (π), μ_{F} : = 0

and

B (F) = P (F)

for

F \in P^{+} Ω

solves the set of equations in Equation (121). For

B (F) = P (F)

when

F \in P^{+} Ω

, we trivially have

\sum_{\begin{matrix} G \in π \\ G \in P^{+} Ω \end{matrix}} B (G) = 1

, and hence,

λ_{π} (\sum_{\begin{matrix} G \in π \\ G \in P^{+} Ω \end{matrix}} B (G) - 1) = 0 .

Furthermore,

μ_{F} (\frac{P (F)}{t} - B (F)) = 0

for

F \in P^{+} Ω .

Thus, by the Karush-Kuhn-Tucker Theorem,

B (F) = P (F)

for

F \in P^{+} Ω

is a critical point of the optimisation problem in Equation (118) for all t and all

P \in P

, since all constraints are linear.

Note that the constraints,

B (Ω) = 1,

B (\emptyset) = 0

and

0 \leq \sum_{F \in π} B (F) \leq 1

for

π \in Π^{'}

, ensure that B is a member of

B

, regardless of the actual value of

B (F)

for

\emptyset \neq F \neq Ω .

Thus,

B \in B^{+}

, if and only if

B (Ω) = 1,

B (\emptyset) = 0,

0 \leq \sum_{F \in π} B (F) \leq 1

for

π \in Π^{'}

and

B (F) = 0

iff

P (F) = 0 .

Thus,

B^{+}

is convex. It follows that

B_{t}^{+} : = {B \in B^{+} : B (F) \geq \frac{P (F)}{t} for all F \in P^{+} Ω}

is convex for all

t \geq 2 .

Since

B^{+}

is the feasible region of Equation (118), the critical point of the convex minimisation problem is the unique minimum.

Letting

t > 0

tend to zero, we see that

B (F) = P (F)

for

F \in P^{+} Ω

is the unique solution of Equation (117).

Thus, any function

B \in B

minimizing

S_{γ}^{log} (P, \cdot)

has to agree with P on the

F \in P^{+} Ω .

By our introductory remarks, it has to hold that

B (Ω) = 1

and

B (G) = 0

for all other

G \subseteq Ω .

Thus,

B (F) = P (F)

for all

F \subseteq Ω .

We have thus shown that

S_{γ}^{log}

is strictly proper. ■

Theorem 10. For inclusive γ with

γ (Ω) \geq γ (\emptyset)

,

S_{γ}^{log}

is strictly proper if and only if γ is equivalent to a weighting of partitions.

Proof: From Lemma 10, we have that the existence of the

λ_{π}

ensures propriety.

For the converse, suppose that

S_{γ}^{log}

is strictly

B

-proper (equivalently, by Corollary 10, strictly proper). By our assumptions, we have

γ (Ω) \geq γ (\emptyset) > 0 .

We can thus put

g ({Ω, \emptyset}) : = γ (\emptyset)

and

g (Ω) : = γ (Ω) - γ (\emptyset) .

Then

γ (Ω) = g ({Ω, \emptyset}) + g (Ω) > 0

and

γ (\emptyset) = g ({Ω, \emptyset}) > 0 .

Observe that for all

P \in P

, for any infimum of the minimisation problem

arg {inf}_{B \in B} S_{γ}^{log} (P, B)

, there have to exist multipliers,

λ_{π} \geq 0

and

μ_{F} \geq 0

, that solve Equation (121) and

μ_{F} (\frac{P (F)}{t} - B (F)) = 0 .

Now, fix a

P \in P

such that

P (F) > 0

for all

\emptyset \subset F \subseteq Ω .

If

S_{γ}^{log}

is strictly

B

-proper, then the minimisation problem

arg {inf}_{B \in B} S_{γ}^{log} (P, B)

for this P has to be solved uniquely by

B = P

. Thus, strict

B

-propriety implies that:

\begin{matrix} 0 < γ (F) = \sum_{\begin{matrix} π \in Π \\ F \in π \end{matrix}} λ_{π} - μ_{F} for all \emptyset \subset F \subset Ω and μ_{F} \frac{1 - t}{t} P (F) = 0 for all F \in P^{+} Ω \end{matrix}

The latter conditions can only be satisfied if all

μ_{F}

vanish. Hence, we obtain the following conditions, which necessarily have to hold if

S_{γ}^{log} (P, \cdot)

is to be uniquely minimised by

B = P

:

\begin{matrix} 0 < γ (F) = \sum_{\begin{matrix} π \in Π \\ F \in π \end{matrix}} λ_{π} for all \emptyset \subset F \subset Ω \end{matrix}

Since all the constraints are inequalities, the corresponding multipliers,

λ_{π}

, have to be greater than or equal to zero.

Thus, strict propriety of

S_{γ}^{log}

implies the existence of these

λ_{π} \geq 0 .

This, in turn, implies that γ is equivalent to a weighting of partitions.

Note that for the purposes of this proof, we do not need to investigate what happens if

P \in P

is such that there exists a proposition,

\emptyset \subset F \subseteq Ω

, with

P (F) = 0 .

■

Note that

γ (Ω) \geq γ (\emptyset)

is not a real restriction. The first component in

S_{γ}^{log} (\cdot, \cdot)

is a probability function in the above proof. Thus,

P (\emptyset) = 0 .

Hence

γ (\emptyset) P (\emptyset) log B (\emptyset) = 0,

regardless of

γ (\emptyset) .

The particular value of

γ (\emptyset)

is thus irrelevant for strict propriety. Therefore, setting

γ (\emptyset) = γ (Ω)

fulfills the conditions of the Theorem, but does not change the value of the γ-score. (The condition is required, because if

γ (\emptyset) > γ (Ω)

, then, while

S_{γ}^{log}

may be strictly proper, it cannot be a weighting of partitions.)

The importance of the condition in Theorem 10 that γ should be equivalent to a weighting of partitions is highlighted in the following:

Example 4. Let

Ω = {ω_{1}, ω_{2}, ω_{3}}

and

γ (1) = γ (3) = 1,

and

γ (2) = 10 .

Now, consider

B \in B

, defined as

B (\emptyset) : = 0,

B (F) : = 0.2

if

| F | = 1

,

B (F) : = 0.8

if

| F | = 2

, and

B (Ω) : = 1 .

Then:

\begin{matrix} S_{γ}^{log} (P_{=}, P_{=}) = & - \sum_{ω \in Ω} P_{=} (ω) log P_{=} (ω) \\ - 10 \cdot (\sum_{\begin{matrix} F \subseteq Ω \\ | F | = 2 \end{matrix}} P_{=} (F) log P_{=} (F)) - P_{=} (Ω) \cdot log P_{=} (Ω) \\ = & - 3 \cdot \frac{1}{3} log \frac{1}{3} - 3 \cdot 10 \cdot \frac{2}{3} log \frac{2}{3} \approx 9.2079 \end{matrix}

(122)

\begin{matrix} S_{γ}^{log} (P_{=}, B) = & - \sum_{ω \in Ω} P_{=} (ω) log B (ω) \\ - 10 \cdot (\sum_{\begin{matrix} F \subseteq Ω \\ | F | = 2 \end{matrix}} P_{=} (F) log B (F)) - P_{=} (Ω) \cdot log B (Ω) \\ = & - 3 \cdot \frac{1}{3} log 0.2 - 3 \cdot 10 \cdot \frac{2}{3} log 0.8 \approx 6.0723 \end{matrix}

(123)

Thus,

S_{γ}^{log} (P_{=}, B) < S_{γ}^{log} (P_{=}, P_{=}) .

Hence,

S_{γ}^{log}

is not strictly

B

-proper, even though γ is inclusive and symmetric. Compare this with Proposition 25, where we proved that positivity and symmetry γ were enough to ensure that

S_{γ}^{log}

is strictly

P

-proper.

Note that strict propriety is exactly what is needed in order to derive Theorem 2, as is apparent from its proof (see, also, the discussion at the start of Section 2.5). By Theorem 10, only a weighting of propositions that is equivalent to a weighting of partitions can be strictly proper (up to an inconsequential value for

γ (\emptyset)

); hence, the generalisation of standard entropy and score in the main text, which focusses on weightings of partitions, is essentially the right one for our purposes.

Indeed, adopting a non-strictly proper scoring rule

S_{γ}^{log}

may result in Theorem 2 not holding:

Proposition 26. If

S_{γ}^{log}

is not strictly

X

-proper (with

P \subseteq X

), then the worst case γ-expected loss minimisation and γ-entropy maximisation are, in general, achieved by different functions.

Proof: If

S_{g}

is not merely proper, then there is a

P^{'} \in P

such that

S_{γ}^{log} (P^{'}, \cdot)

is not minimised over

X

by

P^{'} .

In particular, there is some

Q \in X

such that

S_{γ}^{log} (P^{'}, Q) < S_{γ}^{log} (P^{'}, P^{'}) .

Suppose that

E = {P^{'}} .

Trivially:

\begin{matrix} arg sup_{P \in E} S_{γ}^{log} (P, P) & = P^{'} \end{matrix}

By construction:

\begin{matrix} arg inf_{Q \in X} sup_{P \in E} S_{γ}^{log} (P, Q) = & arg inf_{Q \in X} sup_{P \in {P^{'}}} S_{γ}^{log} (P, Q) \\ = & arg inf_{Q \in X} S_{γ}^{log} (P^{'}, Q) \\ ∌ & P^{'} \end{matrix}

(124)

Thus, the γ-entropy maximiser in

E

(here,

P^{'}

) is not a function in

X

that minimises worst-case γ-expected loss.

Finally, consider the case in which

S_{γ}^{log}

is merely proper, i.e., there exists a

P^{'} \in P

such that

S_{γ}^{log} (P^{'}, \cdot)

is minimised by both

P^{'}

and members of a non-empty subset,

Q \subseteq B ∖ {P^{'}} .

Then, with

E = {P^{'}}

:

\begin{matrix} arg inf_{Q \in X} sup_{P \in E} S_{γ}^{log} (P, Q) = & arg inf_{Q \in X} sup_{P \in {P^{'}}} S_{γ}^{log} (P, Q) = arg inf_{Q \in X} S_{γ}^{log} (P^{'}, Q) = Q \cup {P^{'}} \end{matrix}

Thus there is some function other than the γ-entropy maximiser that also minimises the γ-score. ■

References

Williamson, J. In Defence of Objective Bayesianism; Oxford University Press: Oxford, UK, 2010. [Google Scholar]
Jaynes, E.T. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
Grünwald, P.; Dawid, A.P. Game theory, maximum entropy, minimum discrepancy, and robust Bayesian decision theory. Ann. Stat. 2004, 32, 1367–1433. [Google Scholar]
Topsøe, F. Information theoretical optimization techniques. Kybernetika 1979, 15, 1–27. [Google Scholar]
Ramsey, F.P. Truth and Probability. In Studies in Subjective Probability; Kyburg, H.E., Smokler, H.E., Robert, E., Eds.; Krieger Publishing Company: Huntington, New York, NY, USA, 1926; pp. 23–52. [Google Scholar]
Shannon, C. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656, Reprinted with corrections. Available online: http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf (accessed on 1 June 2013). [Google Scholar] [CrossRef]
Aczél, J.; Daróczy, Z. On Measures of Information and Their Characterizations; Academic Press: New York, NY, USA, 1975. [Google Scholar]
Dawid, A.P. Probability Forecasting. In Encyclopedia of Statistical Sciences; Kotz, S., Johnson, N.L., Eds.; Wiley: New York, USA, 1986; Volume 7, pp. 210–218. [Google Scholar]
Joyce, J.M. Accuracy and Coherence: Prospects for an Alethic Epistemology of Partial Belief. In Degrees of Belief; Huber, F., Schmidt-Petri, C., Eds.; Synthese Library 342; Springer: Dordrecht, The Netherlands, 2009. [Google Scholar]
Pettigrew, R. Epistemic Utility Arguments for Probabilism. In The Stanford Encyclopedia of Philosophy; Zalta, E.N., Ed.; The Metaphysics Research Lab, Center for the Study of Language and Information, Stanford University: Stanford, CA, USA, 2011. [Google Scholar]
Predd, J.; Seiringer, R.; Lieb, E.; Osherson, D.; Poor, H.; Kulkarni, S. Probabilistic coherence and proper scoring rules. IEEE Trans. Inf. Theory 2009, 55, 4786–4792. [Google Scholar] [CrossRef]
McCarthy, J. Measures of the value of information. Proc. Natl. Acad. Sci. USA 1956, 42, 654–655. [Google Scholar] [CrossRef] [PubMed]
Shuford, E.H.; Albert, A.; Massengill, H.E. Admissible probability measurement procedures. Psychometrika 1966, 31, 125–145. [Google Scholar] [CrossRef] [PubMed]
Aczel, J.; Pfanzagl, J. Remarks on the measurement of subjective probability and information. Metrika 1967, 11, 91–105. [Google Scholar] [CrossRef]
Savage, L.J. Elicitation of personal probabilities and expectations. J. Am. Stat. Assoc. 1971, 66, 783–801. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: New York, NY, USA, 1991. [Google Scholar]
Ricceri, B. Recent Advances in Minimax Theory and Applications. In Pareto Optimality, Game Theory And Equilibria; Chinchuluun, A., Pardalos, P., Migdalas, A., Pitsoulis, L., Eds.; Optimization and Its Applications; Springer: New York, USA, 2008; Volume 17, pp. 23–52. [Google Scholar]
König, H. A general minimax theorem based on connectedness. Arch. Math. 1992, 59, 55–64. [Google Scholar] [CrossRef]
Keynes, J.M. A Treatise on Probability; Macmillan: London, UK, 1948. [Google Scholar]
Williamson, J. From Bayesian epistemology to inductive logic. J. Appl. Logic 2013, in press. [Google Scholar] [CrossRef]
Paris, J.B. The Uncertain Reasoner’s Companion; Cambridge University Press: Cambridge, UK, 1994. [Google Scholar]
Seidenfeld, T. Entropy and uncertainty. Philos. Sci. 1986, 53, 467–491. [Google Scholar] [CrossRef]
Williamson, J. An Objective Bayesian Account of Confirmation. In Explanation, Prediction, and Confirmation: New Trends and Old Ones Reconsidered; Dieks, D., Gonzalez, W.J., Hartmann, S., Uebel, T., Weber, M., Eds.; Springer: Dordrecht, The Netherlands, 2011; pp. 53–81. [Google Scholar]
Kyburg, H.E., Jr. Are there degrees of belief? J. Appl. Logic 2003, 1, 139–149. [Google Scholar] [CrossRef]
Smets, P.; Kennes, R. The transferable belief model. Artif. Intell. 1994, 66, 191–234. [Google Scholar] [CrossRef]
Keynes, J.M. The general theory of employment. Q. J. Econ. 1937, 51, 209–223. [Google Scholar] [CrossRef]
Csiszàr, I. Axiomatic characterizations of information measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef]
Paris, J.B. Common sense and maximum entropy. Synthese 1998, 117, 75–93. [Google Scholar] [CrossRef]
Paris, J.B.; Vencovská, A. A note on the inevitability of maximum entropy. Int. J. Approx. Reason. 1990, 4, 183–223. [Google Scholar] [CrossRef]
Paris, J.B.; Vencovská, A. In defense of the maximum entropy inference process. Int. J. Approx. Reason. 1997, 17, 77–103. [Google Scholar] [CrossRef]

© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Landes, J.; Williamson, J. Objective Bayesianism and the Maximum Entropy Principle. Entropy 2013, 15, 3528-3591. https://doi.org/10.3390/e15093528

AMA Style

Landes J, Williamson J. Objective Bayesianism and the Maximum Entropy Principle. Entropy. 2013; 15(9):3528-3591. https://doi.org/10.3390/e15093528

Chicago/Turabian Style

Landes, Jürgen, and Jon Williamson. 2013. "Objective Bayesianism and the Maximum Entropy Principle" Entropy 15, no. 9: 3528-3591. https://doi.org/10.3390/e15093528

APA Style

Landes, J., & Williamson, J. (2013). Objective Bayesianism and the Maximum Entropy Principle. Entropy, 15(9), 3528-3591. https://doi.org/10.3390/e15093528

Article Menu

Objective Bayesianism and the Maximum Entropy Principle

Abstract

1. Introduction

2. Belief over Propositions

2.1. Normalisation

2.2. Entropy

2.3. Loss

2.4. Score

2.5. Minimising the Worst-Case Logarithmic g-Score

3. Belief over Sentences

3.1. Normalisation

3.2. Loss

3.3. Score, Entropy and Their Connection

4. Relationship to Standard Entropy Maximisation

5. Discussion

5.1. Summary

5.2. Conditionalisation, Conditional Probabilities and Bayes’ Theorem

5.3. Imprecise Probability

5.4. A Non-Pragmatic Justification

5.5. Questions for Further Research

Acknowledgements

Conflicts of Interest

Appendix

A. Entropy of Belief Functions

B. Properties of g-Entropy Maximisation

B.1. Preserving the Equivocator

B.2. Updating

B.3. Paris-Vencovská Properties

B.4. The Topology of g-Entropy

C. Level of Generalisation

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI