#### 2.1. Uncertainty Reduction by Eliminating Options

In its most basic form, the concept of decision-making can be formalized as the process of looking for a decision

$x\in \mathsf{\Omega}$ in a discrete set of options

$\mathsf{\Omega}=\{{x}_{1},\cdots ,{x}_{N}\}$. We say that a decision

$x\in \mathsf{\Omega}$ is

certain, if repeated queries of the decision-maker will result in the same decision, and it is

uncertain, if repeated queries can result in different decisions. Uncertainty reduction then corresponds to reducing the amount of uncertain options. Hence, a decision-making process that transitions from a space

$\mathsf{\Omega}$ of options to a strictly smaller subset

$A\u228a\mathsf{\Omega}$ reduces the amount of uncertain options from

$N=\left|\mathsf{\Omega}\right|$ to

${N}_{A}:=\left|A\right|<N$, with the possible goal to eventually find a single certain decision

${x}^{\ast}$. Such a process is generally costly, the more uncertainty is reduced the more resources it costs (

Figure 1). The explicit mapping between uncertainty reduction and resource cost depends on the details of the underlying process and on which explicit quantity is taken as the resource. For example, if the resource is given by time (or any monotone function of time), then a search algorithm that eliminates options sequentially until the target value is found (linear search) is less cost efficient than an algorithm that takes a sorted list and in each step removes half of the options by comparing the mid point to the target (logarithmic search). Abstractly, any real-valued function

C on the power set of

$\mathsf{\Omega}$ that satisfies

$C\left({A}^{\prime}\right)<C\left(A\right)$ whenever

$A\u228a{A}^{\prime}$ might be used as a cost function in the sense that

$C\left(A\right)$ quantifies the expenses of reducing the uncertainty from

$\mathsf{\Omega}$ to

$A\subset \mathsf{\Omega}$.

In utility theory, decision-making is modeled as an optimization process that maximizes a so-called

utility function $U:\mathsf{\Omega}\to \mathbb{R}$ (which can itself be an

expected utility with respect to a probabilistic model of the environment, in the sense of von Neumann and Morgenstern [

1]). A decision-maker that is optimizing a given utility function

U obtains a utility of

$\frac{1}{{N}_{A}}{\sum}_{x\in A}U\left(x\right)\ge \frac{1}{N}{\sum}_{x\in \mathsf{\Omega}}U\left(x\right)$ on average after reducing the amount of uncertain options from

N to

${N}_{A}<N$ (see

Figure 2). A decision-maker that completely reduces uncertainty by finding the optimum

${x}^{\ast}={\mathrm{argmax}}_{x\in \mathsf{\Omega}}U\left(x\right)$ is called

rational (without loss of generality we can assume that

${x}^{\ast}$ is unique, by redefining

$\mathsf{\Omega}$ in the case when it is not). Since uncertainty reduction generally comes with a cost, a utility optimizing decision-maker with limited resources, correspondingly called

bounded rational (see

Section 3), in contrast will obtain only uncertain decisions from a subset

$A\subset \mathsf{\Omega}$. Such decision-makers seek satisfactory rather than optimal solutions, for example by taking the first option that satisfies a minimal utility requirement, which Herbert A. Simon calls a satisficing solution [

2].

Summarizing, we conclude that a decision-making process with decision space

$\mathsf{\Omega}$ that successively eliminates options can be represented by a mapping

$\varphi $ between subsets of

$\mathsf{\Omega}$, together with a cost function

C that quantifies the total expenses of arriving at a given subset,

such that

For example, a rational decision-maker can afford $C\left(\left\{{x}^{\ast}\right\}\right)$, whereas a decision-maker with limited resources can typically only afford uncertainty reduction with cost $C\left(A\right)<C\left(\left\{{x}^{\ast}\right\}\right)$.

From a probabilistic perspective, a decision-making process as described above is a transition from a uniform probability distribution over

N options to a uniform probability distribution over

${N}^{\prime}<N$ options, that converges to the Dirac measure

${\delta}_{{x}^{\ast}}$ centered at

${x}^{\ast}$ in the fully rational limit. From this point of view, the restriction to uniform distributions is artificial. A decision-maker that is uncertain about the optimal decision

${x}^{\ast}$ might indeed have a bias towards a subset

A without completely excluding other options (the ones in

${A}^{c}=\mathsf{\Omega}\backslash A$), so that the behavior must be properly described by a probability distribution

$p\in {\mathbb{P}}_{\mathsf{\Omega}}$. Therefore, in the following section, we extend Equations (

1) and (

2) to transitions between probability distributions. In particular, we must replace the power set of

$\mathsf{\Omega}$ by the space of probability distributions on

$\mathsf{\Omega}$, denoted by

${\mathbb{P}}_{\mathsf{\Omega}}$.

#### 2.2. Probabilistic Decision-Making

Let

$\mathsf{\Omega}$ be a discrete decision space of

$N\phantom{\rule{0.166667em}{0ex}}=\phantom{\rule{0.166667em}{0ex}}\left|\mathsf{\Omega}\right|<\phantom{\rule{0.166667em}{0ex}}\infty $ options, so that

${\mathbb{P}}_{\mathsf{\Omega}}$ consists of discrete distributions

p, often represented by probability vectors

$p=({p}_{1},\cdots ,{p}_{N})$. However, many of the concepts presented in this and the following section can be generalized to the continuous case [

27,

28].

Intuitively, the uncertainty contained in a distribution

$p\in {\mathbb{P}}_{\mathsf{\Omega}}$ is related to the relative inequality of its entries, the more similar its entries are, the higher the uncertainty. This means that uncertainty is increased by moving some probability weight from a more likely option to a less likely option. It turns out that this simple idea leads to a concept widely known as

majorization [

27,

29,

30,

31,

32,

33], which has roots in the economic literature of the early 19th century [

26,

34,

35], where it was introduced to describe income inequality, later known as the

Pigou–Dalton Principle of Transfers. Here, the operation of moving weight from a more likely to a less likely option corresponds to the transfer of income from one individual of a population to a relatively poorer individual (also known as a

Robin Hood operation [

30]). Since a decision-making process can be viewed as a sequence of uncertainty reducing computations, we call the inverse of such a Pigou–Dalton transfer an

elementary computation.

**Definition** **1** (Elementary computation)

**.** A transformation on ${\mathbb{P}}_{\mathsf{\Omega}}$ of the formwhere $m,n$ are such that ${p}_{m}\le {p}_{n}$, and $0<\epsilon \le \frac{{p}_{n}-{p}_{m}}{2}$, is called a Pigou–Dalton transfer (see Figure 3). We call its inverse ${T}_{\epsilon}^{-1}$ an elementary computation.Since making two probability values more similar or more dissimilar are the only two possibilities to minimally transform a probability distribution, elementary computations are the most basic principle of how uncertainty is reduced. Hence, we conclude that a distribution ${p}^{\prime}$ has more uncertainty than a distribution p if and only if p can be obtained from ${p}^{\prime}$ by finitely many elementary computations (and permutations, which are not considered an elementary computation due to the choice of $\epsilon $).

**Definition** **2** (Uncertainty)

**.** We say that ${p}^{\prime}\in {\mathbb{P}}_{\mathsf{\Omega}}$ contains more uncertainty than $p\in {\mathbb{P}}_{\mathsf{\Omega}}$, denoted byif and only if p can be obtained from ${p}^{\prime}$ by a finite number of elementary computations and permutations.Note that, mathematically, this defines a preorder on ${\mathbb{P}}_{\mathsf{\Omega}}$, i.e., a reflexive ($p\prec p$ for all $p\in {\mathbb{P}}_{\mathsf{\Omega}}$) and transitive (if ${p}^{\u2033}\prec {p}^{\prime}$, ${p}^{\prime}\prec p$ then ${p}^{\u2033}\prec p$ for all $p,{p}^{\prime},{p}^{\u2033}\in {\mathbb{P}}_{\mathsf{\Omega}}$) binary relation.

In the literature, there are different names for the relation between

p and

${p}^{\prime}$ expressed by Definition 2, for example

${p}^{\prime}$ is called

more mixed than

p [

36],

more disordered than

p [

37],

more chaotic than

p [

32], or an

average of

p [

29]. Most commonly, however,

p is said to

majorize ${p}^{\prime}$, which started with the early influences of Muirhead [

38], and Hardy, Littlewood, and Pólya [

29] and was developed by many authors into the field of majorization theory (a standard reference was published by Marshall, Olkin, and Arnold [

27]), with far reaching applications until today, especially in non-equilibrium thermodynamics and quantum information theory [

39,

40,

41].

There are plenty of equivalent (arguably less intuitive) characterizations of

$p\prec {p}^{\prime}$, some of which are summarized below. However, one characterization makes use of a concept very closely related to Pigou–Dalton transfers, known as

T-transforms [

27,

32], which expresses the fact that moving some weight from a more likely option to a less likely option is equivalent to taking (weighted) averages of the two probability values. More precisely, a T-transform is a linear operator on

${\mathbb{P}}_{\mathsf{\Omega}}$ with a matrix of the form

$T=(1-\lambda )\mathbb{I}+\lambda \mathsf{\Pi}$, where

$\mathbb{I}$ denotes the identity matrix on

${\mathbb{R}}^{N}$,

$\mathsf{\Pi}$ denotes a permutation matrix of two elements, and

$0\le \lambda \le 1$. If

$\mathsf{\Pi}$ permutes

${p}_{m}$ and

${p}_{n}$, then

${\left(Tp\right)}_{k}={p}_{k}$ for all

$k\notin \{m,n\}$, and

Hence, a T-transform considers any two probability values

${p}_{m}$ and

${p}_{n}$ of a given

$p\in {\mathbb{P}}_{\mathsf{\Omega}}$, calculates their weighted averages with weights

$(1-\lambda ,\lambda )$ and

$(\lambda ,1-\lambda )$, and replaces the original values with these averages. From Equation (

5), it follows immediately that a T-transform with parameter

$0<\lambda \le \frac{1}{2}$ and a permutation

$\mathsf{\Pi}$ of

${p}_{m},{p}_{n}$ with

${p}_{m}\le {p}_{n}$ is a Pigou–Dalton transfer with

$\epsilon =({p}_{n}-{p}_{m})\lambda $. In addition, allowing

$\frac{1}{2}\le \lambda \le 1$ means that T-transfers include permutations, in particular,

${p}^{\prime}\prec p$ if and only if

${p}^{\prime}$ can be derived from

p by successive applications of finitely many T-transforms.

Due to a classic result by Hardy, Littlewood and Pólya ([

29] (p. 49)), this characterization can be stated in an even simpler form by using

doubly stochastic matrices, i.e., matrices

$A={\left({A}_{ij}\right)}_{i,j}$ with

${A}_{ij}\ge 0$ and

${\sum}_{i}{A}_{ij}=1={\sum}_{j}{A}_{ij}$ for all

$i,j$. By writing

$xA:={A}^{T}x$ for all

$x\in {\mathbb{R}}^{N}$, and

$e:=(1,\cdots ,1)$, these conditions are often stated as

Note that doubly stochastic matrices can be viewed as generalizations of T-transforms in the sense that a T-transform takes an average of two entries, whereas if

${p}^{\prime}=pA$ with a doubly stochastic matrix

A, then

${p}_{j}^{\prime}={\sum}_{i}{A}_{ij}{p}_{i}$ is a convex combination, or a weighted average, of

p with coefficients

${\left({A}_{ij}\right)}_{i}$ for each

j. This is also why

${p}^{\prime}$ is then called

more mixed than

p [

36]. Therefore, similar to T-transforms, we might expect that, if

${p}^{\prime}$ is the result of an application of a doubly stochastic matrix,

${p}^{\prime}=pA$, then

${p}^{\prime}$ is an average of

p and therefore contains more uncertainty than

p. This is exactly what is expressed by Characterization

$\left(iii\right)$ in the following theorem. A similar characterization of

${p}^{\prime}\prec p$ is that

${p}^{\prime}$ must be given by a convex combination of permutations of the elements of

p (see property

$\left(iv\right)$ below).

Without having the concept of majorization, Schur proved that functions of the form

$p\mapsto {\sum}_{i}f\left({p}_{i}\right)$ with a convex function

f are monotone with respect to the application of a doubly stochastic matrix [

42] (see property

$\left(v\right)$ below). Functions of this form are an important class of cost functions for probabilistic decision-makers, as we discuss in Example 1.

**Theorem** **1** (Characterizations of

${p}^{\prime}\prec p$ [

27])

**.** For $p,{p}^{\prime}\in {\mathbb{P}}_{\mathsf{\Omega}}$, the following are equivalent:- (i)
${p}^{\prime}\prec p$, i.e., ${p}^{\prime}$ contains more uncertainty than p (Definition 2)

- (ii)
${p}^{\prime}$ is the result of finitely many T-transforms applied to p

- (iii)
${p}^{\prime}=pA$ for a doubly stochastic matrix A

- (iv)
${p}^{\prime}={\sum}_{k=1}^{K}{\theta}_{k}{\mathsf{\Pi}}_{k}\left(p\right)$ where $K\in \mathbb{N}$, ${\sum}_{k=1}^{K}{\theta}_{k}=1$, ${\theta}_{k}\ge 0$, and ${\mathsf{\Pi}}_{k}$ is a permutation for all $k\in \{1,\cdots ,K\}$

- (v)
${\sum}_{i=1}^{N}f\left({p}_{i}^{\prime}\right)\le {\sum}_{i=1}^{N}f\left({p}_{i}\right)$ for all continuous convex functions f

- (vi)
${\sum}_{i=1}^{k}{\left({p}_{i}^{\prime}\right)}^{\downarrow}\le {\sum}_{i=1}^{k}{p}_{i}^{\downarrow}$ for all $k\in \{1,\cdots ,N-1\}$, where ${p}^{\downarrow}$ denotes the decreasing rearrangement of p

As argued above, the equivalence between

$\left(i\right)$ and

$\left(ii\right)$ is straight-forward. The equivalences among

$\left(ii\right)$,

$\left(iii\right)$, and

$\left(vi\right)$ are due to Muirhead [

38] and Hardy, Littlewood, and Pólya [

29]. The implication

$\left(v\right)\Rightarrow \left(iii\right)$ is due to Karamata [

43] and Hardy, Littlewood, and Pólya [

44], whereas

$\left(iii\right)\Rightarrow \left(v\right)$ goes back to Schur [

42]. Mathematically,

$\left(iv\right)$ means that

${p}^{\prime}$ belongs to the convex hull of all permutations of the entries of

p, and the equivalence

$\left(iii\right)\iff \left(iv\right)$ is known as the Birkhoff–von Neumann theorem. Here, we state all relations for probability vectors

$p\in {\mathbb{P}}_{\mathsf{\Omega}}$, even though they are usually stated for all

$p,{p}^{\prime}\in {\mathbb{R}}^{N}$ with the additional requirement that

${\sum}_{i=1}^{N}{p}_{i}={\sum}_{i=1}^{N}{p}_{i}^{\prime}$.

Condition

$\left(vi\right)$ is the classical and most commonly used definition of majorization [

27,

29,

34], since it is often the easiest to check in practical examples. For example, from

$\left(vi\right)$, it immediately follows that uniform distributions over

N options contain more uncertainty than uniform distributions over

${N}^{\prime}<N$ options, since

${\sum}_{i=1}^{k}\frac{1}{N}=\frac{k}{N}\u2a7d\frac{k}{{N}^{\prime}}={\sum}_{i=1}^{k}\frac{1}{{N}^{\prime}}$ for all

$k<N$, i.e., for

$N\ge 3$ we have

In particular, if $A\subset {A}^{\prime}\subset \mathsf{\Omega}$, then the uniform distribution over A contains less uncertainty than the uniform distribution over ${A}^{\prime}$, which shows that the notion of uncertainty introduced in Definition 2 is indeed a generalizatin of the notion of uncertainty given by the number of uncertain options introduced in the previous section.

Note that ≺ only being a preorder on

${\mathbb{P}}_{\mathsf{\Omega}}$, in general, two distributions

${p}^{\prime},p\in {\mathbb{P}}_{\mathsf{\Omega}}$ are not necessarily comparable, i.e., we can have both

${p}^{\prime}\nprec p$ and

$p\nprec {p}^{\prime}$. In

Figure 4, we visualize the regions of all comparable distributions for two exemplary distributions on a three-dimensional decision space (

$N=3$), represented on the two-dimensional simplex of probability vectors

$p=({p}_{1},{p}_{2},{p}_{3})$. For example,

$p=(\frac{1}{2},\frac{1}{4},\frac{1}{4})$ and

${p}^{\prime}=(\frac{2}{5},\frac{2}{5},\frac{1}{5})$ cannot be compared under ≺, since

$\frac{1}{2}>\frac{2}{5}$, but

$\frac{3}{4}<\frac{4}{5}$.

Cost functions can now be generalized to probabilistic decision-making by noting that the property

$C\left({A}^{\prime}\right)<C\left(A\right)$ whenever

$A\u228a{A}^{\prime}$ in Equation (

2) means that

C is strictly monotonic with respect to the preorder given by set inclusion.

**Definition** **3** (Cost functions on

${\mathbb{P}}_{\mathsf{\Omega}}$)

**.** We say that a function $C:{\mathbb{P}}_{\mathsf{\Omega}}\to {\mathbb{R}}_{+}$ is a cost function, if it is strictly monotonically increasing with respect to the preorder ≺, i.e., ifwith equality only if p and ${p}^{\prime}$ are equivalent, ${p}^{\prime}\sim p$, which is defined as ${p}^{\prime}\prec p$ and $p\prec {p}^{\prime}$. Moreover, for a parameterized family of posteriors ${\left({p}_{r}\right)}_{r\in I}$, we say that r is a resource parameter with respect to a cost function C, if the mapping $I\mapsto {\mathbb{R}}_{+},r\mapsto C\left({p}_{r}\right)$ is strictly monotonically increasing.Since monotonic functions with respect to majorization were first studied by Schur [

42], functions with this property are usually called (strictly)

Schur-convex ([

27] (Ch. 3)).

**Example** **1** (Generalized entropies)

**.** From $\left(v\right)$ in Theorem 1, it follows that functions of the formwhere f is strictly convex, are examples of cost functions. Since many entropy measures used in the literature can be seen to be special cases of Equation (9) (with a concave f), functions of this form are often called generalized entropies [45]. In particular, for the choice $f\left(t\right)=tlogt$, we have $C\left(p\right)=-H\left(p\right)$, where $H\left(p\right)$ denotes the Shannon entropy of p. Thus, if ${p}^{\prime}$ contains more uncertainty than p in the sense of Definition 2 (${p}^{\prime}\prec p$) then the Shannon entropy of ${p}^{\prime}$ is larger than the Shannon entropy of p and therefore ${p}^{\prime}$ contains also more uncertainty in the sense of classical information theory than p. Similarly, for $f\left(t\right)=-log\left(t\right)$ we obtain the (negative) Burg entropy, and for functions of the form $f\left(t\right)=\pm {t}^{\alpha}$ for $\alpha \in \mathbb{R}\backslash \{0,1\}$ we get the (negative) Tsallis entropy, where the sign is chosen depending on α such that f is convex (see, e.g., [46] for more examples). Moreover, the composition of any (strictly) monotonically increasing function g with Equation (9) generates another class of cost functions, which contains for example the (negative) Rényi entropy [23]. Note also that entropies of the form of Equation (9) are special cases of Csiszár’s f-divergences [47] for uniform reference distributions (see Example 3 below). In Figure 5, several examples of cost functions are shown for $N=3$. In this case, the two-dimensional probability simplex ${\mathbb{P}}_{\mathsf{\Omega}}$ is given by the triangle in ${\mathbb{R}}^{3}$ with edges $(1,0,0)$, $(0,1,0)$, and $(0,0,1)$. Cost functions are visualized in terms of their level sets.We prove in Proposition A1 in Appendix A that all cost functions of the form of Equation (9) are superadditive with respect to coarse-graining. This seems to be a new result and an improvement upon the fact that generalized entropies (and f-divergences) satisfy information monotonicity [48]. More precisely, if a decision in Ω, represented by a random variable Z, is split up into two steps by partitioning $\mathsf{\Omega}={\bigcup}_{i\in I}{A}_{i}$ and first deciding about the partition $i\in I$, correspondingly described by a random variable X with values in I, and then choosing an option inside of the selected partition ${A}_{i}$, represented by a random variable Y, i.e., $Z=(X,Y)$, thenwhere $C\left(X\right):=C\left(p\right(X\left)\right)$ and $C\left(Y\right|X):={\mathbb{E}}_{p\left(X\right)}\left[C\left(p\left(Y\right|X)\right)\right]$. For symmetric cost functions (such as Equation (9)) this is equivalent to The case of equality in Equations (10) and (11) (see Figure 6) is sometimes called separability [49], strong additivity [50], or recursivity [51], and it is often used to characterize Shannon entropy [23,52,53,54,55,56]. In fact, we also show in Appendix A (Proposition A2) that cost functions C that are additive under coarse-graining are proportional to the negative Shannon entropy $-H$. See also Example 3 in the next section, where we discuss the generalization to arbitrary reference distributions. We can now refine the notion of a decision-making process introduced in the previous section as a mapping

$\varphi $ together with a cost function

C satisfying Equation (

2). Instead of simply mapping from sets

${A}^{\prime}$ to smaller subsets

$A\u228a{A}^{\prime}$ by successively eliminating options, we now allow

$\varphi $ to be a mapping between probability distributions such that

$\varphi \left(p\right)$ can be obtained from

p by a finite number of elementary computations (without permutations), and we require

C to be a cost function on

${\mathbb{P}}_{\mathsf{\Omega}}$, so that

Here, $C\left(p\right)$ quantifies the total costs of arriving at a distribution p, and ${p}^{\prime}\u22e8p$ means that ${p}^{\prime}\prec p$ and $p\nprec {p}^{\prime}$. In other words, a decision-making process can be viewed as traversing probability space by moving pieces of probability from one option to another option such that uncertainty is reduced.

Up to now, we have ignored one important property of a decision-making process, the distribution

q with

minimal cost, i.e., satisfying

$C\left(q\right)\le C\left(p\right)$ for all

p, which must be identified with the initial distribution of a decision-making process with cost function

C. As one might expect (see

Figure 5), it turns out that all cost functions according to Definition 3 have the same minimal element.

**Proposition** **1** (Uniform distributions are minimal)

**.** The uniform distribution $({\textstyle \frac{1}{N}},\cdots ,{\textstyle \frac{1}{N}})$ is the unique minimal element in ${\mathbb{P}}_{\mathsf{\Omega}}$ with respect to ≺, i.e.Once Equation (

13) is established, it follows from Equation (

8) that

$C\left(({\textstyle \frac{1}{N}},\cdots ,{\textstyle \frac{1}{N}})\right)\le C\left(p\right)$ for all

p, in particular the uniform distribution corresponds to the initial state of all decision-making processes with cost function

C satisfying Equation (

12). In particular, it contains the maximum amount of uncertainty with respect to any entropy measure of the form of Equation (

9), known as the second Khinchin axiom [

49], e.g., for Shannon entropy

$0\le H\left(p\right)\le logN$. Proposition 1 follows from Characterization

$\left(iv\right)$ in Theorem 1 after noticing that every

$p\in {\mathbb{P}}_{\mathsf{\Omega}}$ can be transformed to a uniform distribution by permuting its elements cyclically (see Proposition A3 in

Appendix A for a detailed proof).

Regarding the possibility that a decision-maker may have prior information, for example originating from the experience of previous comparable decision-making tasks, the assumption of a uniform initial distribution seems to be artificial. Therefore, in the following section, we arrive at the final notion of a decision-making process by extending the results of this section to allow for arbitrary initial distributions.

#### 2.3. Decision-Making with Prior Knowledge

From the discussion at the end of the previous section we conclude that, in full generality, a decision-maker transitions from an initial probability distribution $q\in {\mathbb{P}}_{\mathsf{\Omega}}$, called prior, to a terminal distribution $p\in {\mathbb{P}}_{\mathsf{\Omega}}$, called posterior. Note that, since once eliminated options are excluded from the rest of the decision-making process, a posterior p must be absolutely continuous with respect to the prior q, denoted by $p\ll q$, i.e., $p\left(x\right)$ can be non-zero for a given $x\in \mathsf{\Omega}$ only if $q\left(x\right)$ is non-zero.

The notion of uncertainty (Definition 2) can be generalized with respect to a non-uniform prior

$q\in {\mathbb{P}}_{\mathsf{\Omega}}$ by viewing the probabilities

${q}_{i}$ as the probabilities

$Q\left({A}_{i}\right)$ of partitions

${A}_{i}$ of an underlying elementary probability space

$\tilde{\mathsf{\Omega}}={\bigcup}_{i}{A}_{i}$ of equally likely elements under

Q, in particular

Q represents

q as the uniform distribution on

$\tilde{\mathsf{\Omega}}$ (see

Figure 7). The similarity of the entries of the corresponding representation

$P\in {\mathbb{P}}_{\tilde{\mathsf{\Omega}}}$ of any

$p\in {\mathbb{P}}_{\mathsf{\Omega}}$ (its uncertainty) then contains information about how close

p is to

q, which we call the

relative uncertainty of

p with respect to

q (Definition 4 below).

The formal construction is as follows: Let

$p,q\in {\mathbb{P}}_{\mathsf{\Omega}}$ be such that

$p\ll q$ and

${q}_{i}\in \mathbb{Q}$. The case when

${q}_{i}\in \mathbb{R}$ then follows from a simple approximation of each entry by a rational number. Let

$\alpha \in \mathbb{N}$ be such that

$\alpha \phantom{\rule{0.166667em}{0ex}}{q}_{i}\in \mathbb{N}$ for all

$i\in \{1,\cdots ,N\}$, for example

$\alpha $ could be chosen as the least common multiple of the denominators of the

${q}_{i}$. The underlying elementary probability space

$\tilde{\mathsf{\Omega}}$ then consists of

$\alpha $ elements and there exists a partitioning

${\left\{{A}_{i}\right\}}_{i=1,\cdots ,N}$ of

$\tilde{\mathsf{\Omega}}$ such that

where

Q denotes the uniform distribution on

$\tilde{\mathsf{\Omega}}$. In particular, it follows that

i.e.,

Q represents

q in

$\tilde{\mathsf{\Omega}}$ with respect to the partitioning

${\left\{{A}_{i}\right\}}_{i}$. Similarly, any

$p\in {\mathbb{P}}_{\mathsf{\Omega}}$ can be represented as a distribution on

$\tilde{\mathsf{\Omega}}$ by requiring that

$P\left({A}_{i}\right)={p}_{i}$ for all

$i\in \{1,\cdots ,N\}$ and letting

P to be constant inside of each partition, i.e., similar to Equation (

15) we have

$P\left({A}_{i}\right)=\left|{A}_{i}\right|\phantom{\rule{0.166667em}{0ex}}P\left(\omega \right)={p}_{i}$ for all

$\omega \in {A}_{i}$ and therefore by Equation (

14)

Note that, if ${q}_{i}=0$ then ${p}_{i}=0$ by absolute continuity ($p\ll q$) in which case we can either exclude option i from $\mathsf{\Omega}$ or set $P\left(\omega \right)=0$.

**Example** **2.** For a prior $q=(\frac{1}{6},\frac{1}{2},\frac{1}{3})$ we put $\alpha =6$, so that $\tilde{\mathsf{\Omega}}=\{{\omega}_{1},\cdots ,{\omega}_{6}\}$ should be partitioned as $\tilde{\mathsf{\Omega}}=\left\{{\omega}_{1}\right\}\cup \{{\omega}_{2},{\omega}_{3},{\omega}_{4}\}\cup \{{\omega}_{5},{\omega}_{6}\}$. Then, ${q}_{i}$ corresponds to the probability of the ith partition under the uniform distribution $Q=\frac{1}{6}(1,\cdots ,1)$, while the distribution $p=(\frac{1}{6},\frac{3}{4},\frac{1}{12})$ is represented on $\tilde{\mathsf{\Omega}}$ by the distribution $P=({\textstyle \frac{1}{6}},{\textstyle \frac{1}{4}},{\textstyle \frac{1}{4}},{\textstyle \frac{1}{4}},{\textstyle \frac{1}{24}},{\textstyle \frac{1}{24}})$ (see Figure 7). Importantly, if the components of the representation

${\mathsf{\Lambda}}_{q}p:=P$ in

${\mathbb{P}}_{\tilde{\mathsf{\Omega}}}$ given by Equation (

16) are similar to each other, i.e., if

P is close to uniform, then the components of

p must be very similar to the components of

q, which we express by the concept of

relative uncertainty.

**Definition** **4** (Uncertainty relative to

q)

**.** We say that ${p}^{\prime}\in {\mathbb{P}}_{\mathsf{\Omega}}$ contains more uncertainty with respect to a prior $q\in {\mathbb{P}}_{\mathsf{\Omega}}$ than $p\in {\mathbb{P}}_{\mathsf{\Omega}}$, denoted by ${p}^{\prime}{\prec}_{q}p$, if and only if ${\mathsf{\Lambda}}_{q}{p}^{\prime}$ contains more uncertainty than ${\mathsf{\Lambda}}_{q}p$, i.e.where ${\mathsf{\Lambda}}_{q}:{\mathbb{P}}_{\mathsf{\Omega}}\to {\mathbb{P}}_{\tilde{\mathsf{\Omega}}},p\mapsto P$ is given by Equation (16).As we show in Theorem 2 below, it turns out that

${\prec}_{q}$ coincides with a known concept called

q-majorization [

57],

majorization relative to q [

27,

28], or

mixing distance [

58]. Due to the lack of a characterization by partial sums, it is usually defined as a generalization of Characterization

$\left(iii\right)$ in Theorem 1, that is

${p}^{\prime}$ is

q-majorized by

p iff

${p}^{\prime}=pA$, where

A is a so-called

q-stochastic matrix, which means that it is a stochastic matrix (

$Ae=e$) with

$qA=q$. In particular,

${\prec}_{q}$ does not depend on the choice of

$\alpha $ in the definition of

${\mathsf{\Lambda}}_{q}$. Here, we provide two new characterizations of

q-majorization, the one given by Definition 4, and one using partial sums generalizing the original definition of majorization.

**Theorem** **2** (Characterizations of ${p}^{\prime}{\prec}_{q}p$
)**.** The following are equivalent

- (i)
${p}^{\prime}{\prec}_{q}p$, i.e., ${p}^{\prime}$ contains more uncertainty relative to q than p (Definition 4).

- (ii)
${\mathsf{\Lambda}}_{q}p$ can be obtained from ${\mathsf{\Lambda}}_{q}{p}^{\prime}$ by a finite number of elementary computations and permutations on ${\mathbb{P}}_{\tilde{\mathsf{\Omega}}}$.

- (iii)
${p}^{\prime}=pA$ for a q-stochastic matrix A, i.e., $Ae=e$ and $qA=q$.

- (iv)
${\sum}_{i=1}^{N}{q}_{i}f\left(\frac{{p}_{i}^{\prime}}{{q}_{i}}\right)\le {\sum}_{i=1}^{N}{q}_{i}f\left(\frac{{p}_{i}}{{q}_{i}}\right)$ for all continuous convex functions f.

- (v)
${\sum}_{i=1}^{l-1}{\left({p}_{i}^{\prime}\right)}^{\downarrow}+{a}_{q}(k,l){\left({p}_{l}^{\prime}\right)}^{\downarrow}\phantom{\rule{4pt}{0ex}}\le \phantom{\rule{4pt}{0ex}}{\sum}_{i=1}^{l-1}{p}_{i}^{\downarrow}+{a}_{q}(k,l){p}_{l}^{\downarrow}$ for all $\alpha {\sum}_{i=1}^{l-1}{q}_{i}^{\downarrow}\le k\le \alpha {\sum}_{i=1}^{l}{q}_{i}^{\downarrow}$ and $1\le l\le N$, where ${a}_{q}(k,l):=(\frac{k}{\alpha}-{\sum}_{i=1}^{l-1}{q}_{i}^{\downarrow})/{q}_{l}^{\downarrow}$, and the arrows indicate that ${({p}_{i}^{\downarrow}/{q}_{i}^{\downarrow})}_{i}$ is ordered decreasingly.

To prove that

$\left(i\right)$,

$\left(iii\right)$, and

$\left(v\right)$ are equivalent (see Proposition A4 in

Appendix A), we make use of the fact that

${\mathsf{\Lambda}}_{q}:{\mathbb{P}}_{\mathsf{\Omega}}\to {\mathbb{P}}_{\tilde{\mathsf{\Omega}}}$ has a left inverse

${\mathsf{\Lambda}}_{q}^{-1}:{\mathsf{\Lambda}}_{q}\left({\mathbb{P}}_{\mathsf{\Omega}}\right)\to {\mathbb{P}}_{\mathsf{\Omega}}$. This can be verified by simply multiplying the corresponding matrices given in the proof of Proposition A4. The equivalence between

$\left(iii\right)$ and

$\left(iv\right)$ is shown in [

28] (see also [

27,

58]). Characterization

$\left(ii\right)$ follows immediately from Definitions 2 and 4.

As required from the discussion at the end of the previous section, q is indeed minimal with respect to ${\prec}_{q}$, which means that it contains the most amount of uncertainty with respect to itself.

**Proposition** **2** (The prior is minimal)

**.** The prior $q\in {\mathbb{P}}_{\mathsf{\Omega}}$ is the unique minimal element in ${\mathbb{P}}_{\mathsf{\Omega}}$ with respect to ${\prec}_{q}$, that isThis follows more or less directly from Proposition 1 and the equivalence of

$\left(i\right)$ and

$\left(iii\right)$ in Theorem 2 (see Proposition A5 in

Appendix A for a detailed proof).

Order-preserving functions with respect to ${\prec}_{q}$ generalize cost functions introduced in the previous section (Definition 3). According to Proposition 2, such functions have a unique minimum given by the prior q. Since cost functions are used in Definition 7 below to quantify the expenses of a decision-making process, we require their minimum to be zero, which can always be achieved by redefining a given cost function by an additive constant.

**Definition** **5** (Cost functions relative to

q)

**.** We say that a function ${C}_{q}:{\mathbb{P}}_{\mathsf{\Omega}}\to {\mathbb{R}}_{+}$ is a cost function relative to q, if ${C}_{q}\left(q\right)=0$, if it is invariant under relabeling ${({q}_{i},{p}_{i})}_{i}$, and if it is strictly monotonically increasing with respect to the preorder ${\prec}_{q}$, that is ifwith equality only if ${p}^{\prime}{\sim}_{q}p$, i.e., if ${p}^{\prime}{\prec}_{q}p$ and $p{\prec}_{q}{p}^{\prime}$. Moreover, for a parameterized family of posteriors ${\left({p}_{r}\right)}_{r\in I}$, we say that r is a resource parameter with respect to a cost function ${C}_{q}$, if the mapping $I\mapsto {\mathbb{R}}_{+},r\mapsto {C}_{q}\left({p}_{r}\right)$ is strictly monotonically increasing.Similar to generalized entropy functions discussed in Example 1, in the literature, there are many examples of relative cost functions, usually called divergences or measures of divergence.

**Example** **3** (

f-divergences)

**.** From $\left(iv\right)$ in Theorem 2, it follows that functions of the formwhere f is continuous and strictly convex with $f\left(1\right)=0$, are examples of cost functions relative to q. Many well-known divergence measures can be seen to belong to this class of relative cost functions, also known as Csiszár’s f-divergences [47]: the Kullback-Leibler divergence (or relative entropy), the squared ${\ell}^{2}$ distance, the Hartley entropy, the Burg entropy, the Tsallis entropy, and many more [46,50] (see Figure 8 for visualizations of some of them in $N=3$ relative to a non-uniform prior).As a generalization of Proposition A1 (superadditivity of generalized entropies), we prove in Proposition A6 in Appendix A that f-divergences are superadditive under coarse-graining, that iswhenever $Z=(X,Y)$, and ${C}_{q}\left(X\right):={C}_{q\left(X\right)}\left(p\left(X\right)\right)$ and ${C}_{q}\left(Y\right|X):={\mathbb{E}}_{p\left(X\right)}\left[{C}_{q\left(Y\right|X)}\left(p\left(Y\right|X)\right)\right]$, This generalizes Equation (10) to the case of a non-uniform prior. Similar to entropies, the case of equality in Equation (21) is sometimes called composition rule [59], chain rule [60], or recursivity [50], and is often used to characterize Kullback-Leibler divergence [8,50,59,60]. Indeed, we also show in Appendix A (Proposition A7) that all additive cost functions with respect to q are proportional to Kullback-Leibler divergence (relative entropy). This goes back to Hobson’s modification [59] of Shannon’s original proof [22], after establishing the following monotonicity property for uniform distributions: If $f(M,N)$ denotes the cost ${C}_{{u}_{N}}\left({u}_{M}\right)$ of a uniform distribution ${u}_{M}$ over M elements relative to a uniform distribution ${u}_{N}$ over $N\ge M$ elements, then (see Figure 9). Note that, even though our proof of Proposition A7 uses additivity under coarse graining to show the monotonicity property in Equation (22), it is easy to see that any relative cost function of the form of Equation (20) also satisfies Equation (22) by using the convexity of f as $f\left(t\right)\le {\textstyle \frac{t}{s}}f\left(s\right)+(1-{\textstyle \frac{t}{s}})f\left(0\right)$ with $t=\frac{{N}^{\prime}}{M}<\frac{N}{M}=s$. In terms of decision-making, superadditivity under coarse-graining means that decision-making costs can potentially be reduced by splitting up the decision into multiple steps, for example by a more intelligent search strategy. For example, if $N={2}^{k}$ for some $k\in \mathbb{N}$ and ${C}_{q}$ is superadditive, then the cost for reducing uncertainty to a single option, i.e., $p=(1,0,\cdots ,0)$, when starting from a uniform distribution q, satisfieswhere ${q}^{n}:=(\frac{1}{n},\cdots ,\frac{1}{n})$, and we have set ${C}_{{q}^{2}}(1,0)=1$ as unit cost (corresponding to 1 bit in the case of Kullback-Leibler divergence). Thus, intuitively the property of the Kullback-Leibler divergence of being additive under coarse-graining might be viewed as describing the minimal amount of processing costs that must be contained in any cost function, because it cannot be reduced by changing the decision-making process. Therefore, in the following, we call cost functions that are proportional to the Kullback-Leibler divergence simply informational costs. In contrast to the previous section, in the definition of

${\prec}_{q}$ and its characterizations, we never use elementary computations on

${\mathbb{P}}_{\mathsf{\Omega}}$ directly. This is because permutations interact with the uncertainty relative to

q, and therefore

${\prec}_{q}$ cannot be characterized by a finite number of elementary computations and permutations on

${\mathbb{P}}_{\mathsf{\Omega}}$. However, we can still define elementary computations relative to

q by the inverse of Pigou–Dalton transfers

${T}_{\epsilon}$ of the form of Equation (

3) such that

${T}_{\epsilon}p{\u22e8}_{q}p$ for

$\epsilon >0$, which is arguably the most basic form of how to generate uncertainty with respect to

q.

Even for small

$\epsilon $, a regular Pigou–Dalton transfer does not necessarily increase uncertainty relative to

q, because the similarity of the components now needs to be considered with respect to

q. Instead, we compare the components of the representation

$P={\mathsf{\Lambda}}_{q}p$ of

$p\in {\mathbb{P}}_{\mathsf{\Omega}}$, and move some probability weight

$\epsilon \ge 0$ from

$P\left({A}_{n}\right)$ to

$P\left({A}_{m}\right)$ whenever

$P\left(\omega \right)\le P\left({\omega}^{\prime}\right)$ for

$\omega \in {A}_{m}$ and

${\omega}^{\prime}\in {A}_{n}$, by distributing

$\epsilon $ evenly among the elements in

${A}_{m}$ (see

Figure 10), denoted by the transformation

${\tilde{T}}_{\epsilon}$. Here,

$\epsilon $ must be small enough such that the inequality

$\frac{1}{\alpha}\frac{{p}_{m}}{{q}_{m}}=P\left(\omega \right)\le P\left({\omega}^{\prime}\right)=\frac{1}{\alpha}\frac{{p}_{n}}{{q}_{n}}$ is invariant under

${\tilde{T}}_{\epsilon}$, which means that

By construction, ${\tilde{T}}_{\epsilon}$ minimally increases uncertainty in ${\mathbb{P}}_{\tilde{\mathsf{\Omega}}}$ while staying in the image of ${\mathbb{P}}_{\mathsf{\Omega}}$ under ${\mathsf{\Lambda}}_{q}$, by keeping the values of P constant in each partition, and therefore ${T}_{\epsilon}:={\mathsf{\Lambda}}_{q}^{-1}{\tilde{T}}_{\epsilon}{\mathsf{\Lambda}}_{q}$ can be considered as the most basic way of how to increase uncertainty relative to q.

**Definition** **6** (Elementary computation relative to

q)

**.** We call a transformation on ${\mathbb{P}}_{\mathsf{\Omega}}$ of the formwith $m,n$ such that $\frac{{p}_{m}}{{q}_{m}}\le \frac{{p}_{n}}{{q}_{n}}$, and ε satisfying Equation (23), a Pigou–Dalton transfer relative to q, and its inverse ${T}_{\epsilon}^{-1}$ an elementary computation relative to q.We are now in the position to state our final definition of a decision-making process.

**Definition** **7** (Decision-making process)

**.** A decision-making process is a gradual transformationof a prior $q\in {\mathbb{P}}_{\mathsf{\Omega}}$ to a posterior $p\in {\mathbb{P}}_{\mathsf{\Omega}}$, such that each step decreases uncertainty relative to q. This means that p is obtained from q by successive application of a mapping ϕ between probability distributions on Ω, such that $\varphi \left({p}^{\prime}\right)$ can be obtained from ${p}^{\prime}$ by finitely many elementary computations relative to q, in particularwhere ${C}_{q}\left({p}^{\prime}\right)$ quantifies the total costs of a distribution ${p}^{\prime}$, and ${p}^{\prime}{\u22e8}_{q}p$ means that ${p}^{\prime}{\prec}_{q}p$ and $p{\nprec}_{q}{p}^{\prime}$.In other words, a decision-making process can be viewed as traversing probability space from prior q to posterior p by moving pieces of probability from one option to another option such that uncertainty is reduced relative to q, while expending a certain amount of resources determined by the cost function ${C}_{q}$.