Divergence and Sufficiency for Convex Optimization

Harremoës, Peter

doi:10.3390/e19050206

Open AccessArticle

Divergence and Sufficiency for Convex Optimization

by

Peter Harremoës

GSK Department, Copenhagen Business College, Nørre Voldgade 34, 1358 Copenhagen K, Denmark

Entropy 2017, 19(5), 206; https://doi.org/10.3390/e19050206

Submission received: 30 December 2016 / Revised: 11 April 2017 / Accepted: 2 May 2017 / Published: 3 May 2017

(This article belongs to the Special Issue Convex Optimization and Entropy)

Download

Browse Figures

Versions Notes

Abstract

:

Logarithmic score and information divergence appear in information theory, statistics, statistical mechanics, and portfolio theory. We demonstrate that all these topics involve some kind of optimization that leads directly to regret functions and such regret functions are often given by Bregman divergences. If a regret function also fulfills a sufficiency condition it must be proportional to information divergence. We will demonstrate that sufficiency is equivalent to the apparently weaker notion of locality and it is also equivalent to the apparently stronger notion of monotonicity. These sufficiency conditions have quite different relevance in the different areas of application, and often they are not fulfilled. Therefore sufficiency conditions can be used to explain when results from one area can be transferred directly to another and when one will experience differences.

Keywords:

Bregman divergence; entropy; exergy; Kraft’s inequality; locallity; monotonicity; portfolio; regret; scoring rule; sufficiency

1. Introduction

One of the main purposes of information theory is to compress data so that data can be recovered exactly or approximately. One of the most important quantities was called entropy because it is calculated according to a formula that mimics the calculation of entropy in statistical mechanics. Another key concept in information theory is information divergence (KL-divergence) that is defined for probability vectors P and Q as

D (P ∥ Q) = \sum_{x} P (x) ln \frac{P (x)}{Q (x)} .

It was introduced by Kullback and Leibler in 1951 in a paper entitled On Information and Sufficiency [1]. The link from information theory back to statistical physics was developed by E.T. Jaynes via the maximum entropy principle [2,3,4]. The link back to statistics is now well established [5,6,7,8,9].

Related quantities appear in information theory, statistics, statistical mechanics, and finance, and we are interested in a theory that describes when these relations are exact and when they just work by analogy. First we introduce some general results about optimization on state spaces of finite dimensional C*-algebras. This part applies exactly to all the topics under consideration and lead to Bregman divergences or more general regret functions. Secondly, we introduce several notions of sufficiency and show that this leads to information divergence. In a number of cases it is not possible or not relevant to impose the condition of sufficiency, which can explain why regret function are not always equal to information divergence.

2. Structure of the State Space

Our knowledge about a system will be represented by a state space. I many applications the state space is given by a set of probability distributions on a sample space. In such cases the state space is a simplex, but it is well-known that the state space is not a simplex in quantum physics. For applications in quantum physics the state space is often represented by a set of density matrices, i.e., positive semidefinite complex matrices with trace 1. In some cases the states are represented as elements of a finite dimensional

C^{*}

-algebra, which is a direct sum of matrix algebras. A finite dimensional

C^{*}

-algebra that is a sum of

1 \times 1

matrices has a state space that is a simplex, so the state spaces of finite dimensional

C^{*}

-algebras contain the classical probability distributions as special cases.

The extreme points in the set of states are the pure states. The pure states of a

C^{*}

-algebra can be identified with projections of rank 1. Two density matrices

s_{1}

and

s_{2}

are said to be orthogonal if

s_{1} s_{2} = s_{2} s_{1} = 0 .

Any state s has a decomposition

s = \sum λ_{i} s_{i}

where

s_{i}

are orthogonal pure states. Such a decomposition is not unique, but for a finite dimensional

C^{*}

-algebra the coefficients

λ_{1}, λ_{2}, \dots, λ_{n}

are unique and are called the spectrum of the state.

Sometimes more general state spaces are of interest. In generalized probabilistic theories a state space is a convex set where mixtures are defined by randomly choosing certain states with certain probabilities [10,11]. A convex set where all orthogonal decompositions of a state have the same spectrum, is called a spectral state space. Much of the theory in this paper can be generalized to spectral sets. The most important spectral sets are sets of positive elements with trace 1 in Jordan algebras. The study of Jordan algebras and other spectral sets is relevant for the foundation of quantum theory [12,13,14,15], but in this paper we will restrict our attention to states on finite dimensional

C^{*}

-algebras. Nevertheless some of the theorems and proofs are stated in such a way that they hold for more general state spaces.

3. Optimization

Let

S

denote a state space of a finite dimensional

C^{*}

-algebra and let

A

denote a set of self-adjoint operators. Each

a \in A

is identified with a real valued measurement. The elements of

A

may represent feasible actions (decisions) that lead to a payoff like the score of a statistical decision, the energy extracted by a certain interaction with the system, (minus) the length of a codeword of the next encoded input letter using a specific code book, or the revenue of using a certain portfolio. For each

s \in S

the mean value of the measurement

a \in A

is given by

〈a, s〉 = tr (as) .

In this way the set of actions may be identified with a subset of the dual space of

S

.

Next we define

F (s) = sup_{a \in A} 〈a, s〉 .

We note that F is convex, but F need not be strictly convex. In principle

F (s)

may be infinite, but we will assume that

F (s) < \infty

for all states s. We also note that F is lower semi-continuous. In this paper we will assume that the function F is continuous. The assumption that F is a real valued continuous function is fulfilled for all the applications we consider.

If s is a state and

a \in A

is an action then we say that a is optimal for s if

〈a, s〉 = F (s)

. A sequence of actions

a_{n} \in A

is said to be asymptotically optimal for the state s if

〈a, s〉 \to F (s)

for

n \to \infty .

If

a_{i}

are actions and

(t_{i})

is a probability vector then we we may define the mixed action

\sum t_{i} \cdot a_{i}

as the action where we do the action

a_{i}

with probability

t_{i} .

We note that

〈\sum t_{i} \cdot a_{i}, s〉 = \sum t_{i} \cdot 〈a_{i}, s〉 .

We will assume that all such mixtures of feasible actions are also feasible. If

a_{1} (s) \geq a_{2} (s)

almost surely for all states we say that

a_{1}

dominates

a_{2}

and if

a_{1} (s) > a_{2} (s)

almost surely for all states s we say that

a_{1}

strictly dominates

a_{2} .

All actions that are dominated may be removed from

A

without changing the function

F .

Let

A_{F}

denote the set of self-adjoint operators (observables) a such that

〈a, s〉 \leq F (s) .

Then

F (s) = {sup}_{a \in A_{F}} 〈a, s〉 .

Therefore we may replace

A

by

A_{F}

without changing the optimization problem.

In the definition of regret we follow Servage [16] but with different notation.

Definition 1.

Let F denote a convex function on the state space

S

. If

F (s)

is finite the regret of the action a is defined by

D_{F} (s, a) = F (s) - 〈a, s〉 .

(1)

The notion of regret has been discussed in detail in [17,18,19]. In [20] it was proved that if a regret based decision procedure is transitive then it must be equal to a difference in expected utility as in Equation (1), which rules out certain non-linear models in [17,19].

Proposition 1.

The regret

D_{F}

of actions has the following properties:

$D_{F} (s, a) \geq 0$ with equality if a is optimal for s.
$s \to D_{F} (s, a)$ is a convex function.
If $\bar{a}$ is optimal for the state $\bar{s} = \sum t_{i} \cdot s_{i}$ where $(t_{1}, t_{2}, \dots, t_{ℓ})$ is a probability vector then

$\sum t_{i} \cdot D_{F} (s_{i}, a) = \sum t_{i} \cdot D_{F} (s_{i}, \bar{a}) + D_{F} (\bar{s}, a) .$
$\sum t_{i} \cdot D_{F} (s_{i}, a)$ is minimal if a is optimal for $\bar{s} = \sum t_{i} \cdot s_{i}$ .

If the state is

s_{1}

but one acts as if the state were

s_{0}

one may compare what one achieves and what could have been achieved. If the state

s_{0}

has a unique optimal action a we may simply define the regret of

s_{0}

by

D_{F} (s_{1}, s_{0}) = D_{F} (s_{1}, a) .

The following definition leads to a regret function that is essentially equivalent to the so-called generalized Bregman divergences defined by Kiwiel [21,22].

Definition 2.

Let F denote a convex function on the state space

S

. If

F (s_{1})

is finite then we define the regret of the state

s_{0}

as

D_{F} (s_{1}, s_{0}) = inf_{(a_{n})} lim_{n \to \infty} D_{F} (s_{1}, a_{n})

where the infimum is taken over all sequences of actions

(a_{n})

that are asymptotically optimal for

s_{0} .

With this definition the regret is always defined with values in

[0, \infty]

and the value of the regret

D_{F} (s_{1}, s_{0})

only depends on the restriction of the function F to the line segment from

s_{0}

to

s_{1}

. Let f denote the function

f (t) = F ((1 - t) s_{0} + t s_{1})

where

t \in [0, 1]

. As illustrated in Figure 1 we have

D_{F} (s_{1}, s_{0}) = f (1) - (f (0) + f_{+}^{'} (0))

(2)

where

f_{+}^{'} (0)

denotes the right derivative of f at

t = 0

. Equation (2) is even valid when the regret is infinite if we allow the right derivative to take the value −∞.

If the state

s_{0}

has the unique optimal action

a \in A

then

F (s_{1}) = D_{F} (s_{1}, s_{0}) + 〈a, s_{1}〉

(3)

so the function F can be reconstructed from

D_{F}

except for an affine function of

s_{1} .

The following proposition follows from Alexandrov’s theorem ([23], Theorem 25.5).

Proposition 2.

A convex function on a finite dimensional convex set is differentiable almost everywhere with respect to the Lebesgue measure.

A state

s_{0}

where F is differentiable has a unique optimal action. Therefore Equation (3) holds for almost any state

s_{0}

. In particular the function F can be reconstructed from

D_{F}

except for an affine function.

Proposition 3.

The regret

D_{F}

of states has the following properties:

$D_{F} (s_{1}, s_{0}) \geq 0$ with equality if there exists an action a that is optimal for both $s_{1}$ and $s_{0}$ .
$s_{1} \to D_{F} (s_{1}, s_{0})$ is a convex function.

Further, the following two conditions are equivalent:

$D_{F} (s_{1}, s_{0}) = 0$ implies $s_{1} = s_{0}$ .
The function F is strictly convex.

We say that a regret function

D_{F}

is strict if F is strictly convex. The two last properties Proposition 1 do not carry over to regret for states except if the regret is a Bregman divergence as defined below. The regret is called a Bregman divergence if it can be written in the following form

\begin{matrix} D_{F} (s_{1}, s_{0}) & = F (s_{1}) - (F (s_{0}) + 〈s_{1} - s_{0}, \nabla F (s_{0})〉) \end{matrix}

(4)

where

〈\cdot, \cdot〉

denotes the (Hilbert-Schmidt) inner product. In the context of forecasting and statistical scoring rules the use of Bregman divergences dates back to [24]. A similar but less general definition of regret was given by Rao and Nayak [25] where the name cross entropy was proposed. Although Bregman divergences have been known for many years they did not gain popularity before the paper [26] where a systematic study of Bregman divergences was presented.

We note that if

D_{F}

is a Bregman divergence and

s_{0}

minimizes F then

\nabla F (s_{0}) = 0

so that the formula for the Bregman divergence reduces to

D_{F} (s_{1}, s_{0}) = F (s_{1}) - F (s_{0}) .

Bregman divergences satisfy the Bregman identity

\sum t_{i} \cdot D_{F} (s_{i}, s) = \sum t_{i} \cdot D_{F} (s_{i}, \bar{s}) + D_{F} (\bar{s}, s),

(5)

but if F is not differentiable this identity can be violated.

Example 1.

Let the state space be the interval

[0, 1]

with two actions

〈a_{0}, s〉 = 1 - 2 s

and

〈a_{1}, s〉 = 2 s - 1 .

Let

s_{0} = 0

and

s_{1} = 1 .

Let further

t_{0} = \frac{1}{3}

and

t_{1} = \frac{2}{3} .

Then

\bar{s} = \frac{2}{3} .

If

s = \frac{1}{2}

then

\sum t_{i} \cdot D_{F} (s_{i}, s) = 0,

but

\begin{matrix} \sum t_{i} \cdot D_{F} (s_{i}, \bar{s}) & = & \frac{1}{3} \cdot (〈a_{0}, 0〉 - 〈a_{1}, 0〉) + \frac{2}{3} \cdot (〈a_{1}, 1〉 - 〈a_{1}, 1〉) \\ = & \frac{1}{3} \cdot (1 - (- 1)) \\ = & \frac{2}{3} . \end{matrix}

Clearly the Bregman identity (5) is violated and

\sum t_{i} \cdot D_{F} (s_{i}, s)

will increase if s is replaced by

\bar{s}

.

The following proposition is easily proved.

Proposition 4.

For a convex and continuous function F on the state space

S

the following conditions are equivalent:

The function F is differentiable in the interior of any face of $S$ .
The regret $D_{F}$ is a Bregman divergence.
The Bregman identity (5) is always satisfied.
For any probability vectors $(t_{1}, t_{2}, \dots, t_{n})$ the sum $\sum t_{i} \cdot D_{F} (s_{i}, s)$ is always minimal when $s = \sum t_{i} \cdot s_{i}$ .

4. Examples

In this section we shall see how regret functions are defined in some applications.

4.1. Information Theory

We recall that a code is uniquely decodable if any finite sequence of input symbols give a unique sequence of output symbols. It is well-known that a uniquely decodable code satisfies Kraft’s inequality (see [27] and ([28], Theorem 3.8))

\sum_{a \in A} β^{- ℓ (a)} \leq 1

(6)

where

ℓ (a)

denotes the length of the codeword corresponding to the input symbol

a \in A

and

β

denotes the size of the output alphabet

B

. Here the length of a codeword is an integer. If

P = {(p_{a})}_{a \in A}

is a probability vector over the input alphabet, then the mean code-length is

\sum_{a \in A} ℓ (a) \cdot p_{a} .

Our goal is to minimize the expected code-length. Here the state space consist of probability distributions over the input alphabet and the actions are code-length functions.

Shannon established the inequality

- \sum_{a \in A} {log}_{b} (p_{a}) \cdot p_{a} \leq min \sum_{a \in A} ℓ (a) \cdot p_{a} \leq - \sum_{a \in A} {log}_{b} (p_{a}) \cdot p_{a} + 1 .

It is a combinatoric problem to find the optimal code-length function. In the simplest case with a binary output alphabet the optimal code-length function is determined by the Huffmann algorithm.

A code-length function dominates another code-length function if all letters have shorter code-length. If a code-length function is not dominated by another code-length function then for all

a \in A

the length is bounded by

ℓ (a) \leq |A| - 1 .

For fixed alphabets

A

and

B

there exists only a finite number of code-length functions ℓ that satisfy Kraft’s inequality and are not dominated by other code-length functions that satisfying Kraft’s inequality.

4.2. Scoring Rules

The use of scoring rules has a long history in statistics. An early contribution was the idea of minimizing the sum of square deviations that dates back to Gauss and works perfectly for Gaussian distributions. In the 1920s, Ramsay and de Finetti proved versions of the Dutch book theorem where determination of probability distributions were considered as dual problems of maximizing a payoff function [29]. Later it was proved that any consistent inference procedure corresponds to optimizing with respect to some payoff function. A more systematic study of scoring rules was given by McCarthy [30].

Consider an experiment with

X = \{1, 2, \dots, ℓ\}

as sample space. A scoring rule f is defined as a function

X \times M_{1}^{+} (X) \to R

such that the score is

f (x, Q)

when a prediction has been given in terms of a probability distribution Q and

x \in X

has been observed. A scoring rule is proper if for any probability measure

P \in M_{1}^{+} (X)

the score

\sum_{x \in X} P (x) \cdot f (x, Q)

is minimal when

Q = P .

Here the state space consist of probability distributions over

X

and the actions are predictions over

X

, which are also probability distributions over

X

.

There is a correspondence between proper scoring rules and Bregman divergences as explained in [31,32]. If

D_{F}

is a Bregman divergence and g is a function with domain

X

then f given by

f (x, Q) = g (x) - D_{F} (δ_{x}, Q)

defines a scoring rule.

Assume that f is a proper scoring function. Then a function F can be defined as

F (P) = \sum_{x \in X} P (x) \cdot f (x, P) .

This lead to the regret function

D_{F} (P, Q) = F (P) - \sum_{x \in X} P (x) \cdot f (x, Q) .

(7)

Since f is assumed to be proper

D_{F} (P, Q) \geq 0

. The Bregman identity (5) follows by straight forward calculations. With these two results we see that the regret function

D_{F}

is a Bregman divergence and that

\begin{matrix} D_{F} (δ_{y}, Q) & = \sum_{x \in X} δ_{y} (x) \cdot f (x, δ_{y}) - \sum_{x \in X} δ_{y} (x) \cdot f (x, Q) \\ = f (y, δ_{y}) - f (y, Q) . \end{matrix}

(8)

Hence a proper scoring rule f has the form

f (x, Q) = g (x) - D_{F} (δ_{x}, Q)

where

g (x) = f (x, δ_{x})

. A strictly proper scoring rule can be defined as a proper scoring rule where the corresponding Bregman divergence is strict.

Example 2.

The Brier score is given by

f (x, Q) = \frac{1}{n} (\sum_{y \in X} {(Q (y) - δ_{x} (y))}^{2}) .

The Brier score is generated by the strictly convex function

F (P) = \frac{1}{n} \sum_{x \in X} P {(x)}^{2}

.

4.3. Statistical Mechanics

Thermodynamics is the study of concepts like heat, temperature and energy. A major objective is to extract as much energy from a system as possible. The idea in statistical mechanics is to view the macroscopic behavior of a thermodynamic system as a statistical consequence of the interaction between a lot of microscopic components where the interacting between the components are governed by very simple laws. Here the central limit theorem and large deviation theory play a major role. One of the main achievements is the formula for entropy as a logarithm of a probability.

Here we shall restrict the discussion to the most simple kind of thermodynamic system from which we want to extract energy. We may think of a system of non-interacting spin particles in a magnetic field. For such a system the Hamiltonian is given by

\hat{H} (σ) = - μ \sum h_{j} σ_{j}

where

σ

is the spin configuration,

μ

is the magnetic moment,

h_{j}

is the strength of an external magnetic field, and

σ_{j} = \pm 1

is the spin of the the j’th particle. If the system is in thermodynamic equilibrium the configuration probability is

P_{β} (σ) = \frac{exp (- β \hat{H} (σ))}{Z_{β}}

where

Z (β)

is the partition function

Z (β) = \sum_{σ} exp (- β \hat{H} (σ)) .

Here

β

is the inverse temperature

{(k T)}^{- 1}

of the spin system and

k = 1.381 \times 10^{- 23} \frac{J}{K}

is Boltzmann’s constant.

The mean energy is given by

\sum_{σ} P_{β} (σ) \hat{H} (σ),

which will be identified with the internal energy U defined in thermodynamics. The Shannon entropy can be calculated as

\begin{matrix} - \sum_{σ} P_{β} (σ) ln P_{β} (σ) & = - \sum_{σ} P_{β} (σ) ln \frac{exp (- β \hat{H} (σ))}{Z_{β}} \\ = - \sum_{σ} P_{β} (σ) (- β \hat{H} (σ) - ln Z (β)) \\ = β \cdot U + ln Z (β) . \end{matrix}

The Shannon entropy times k will be identified with the thermodynamic entropy S.

The amount of energy that can be extracted from the system if a heat bath is available, is called the exergy [33]. We assume that the heat bath has temperature

T_{0}

and the internal energy and entropy of the system are

U_{0}

and

S_{0}

if the system has been brought in equilibrium with the heat bath. The exergy can be calculated by

\begin{matrix} E x & = U - U_{0} - T_{0} (S - S_{0}) \\ = U - U_{0} - k T_{0} (β \cdot U + ln Z (β) - β_{0} U_{0} - ln Z (β_{0})) \\ = k T_{0} ((β_{0} - β) \cdot U + ln \frac{Z (β_{0})}{Z (β)}) . \end{matrix}

The information divergence between the actual state and the corresponding state that is in equilibrium with the environment is

\begin{matrix} D (P_{β}∥ P_{β_{0}}) & = \sum_{σ} P_{β} (σ) ln \frac{P_{β} (σ)}{P_{β_{0}} (σ)} \\ = \sum_{σ} P_{β} (σ) ln \frac{\frac{exp (- β \hat{H} (σ))}{Z (β)}}{\frac{exp (- β_{0} \hat{H} (σ))}{Z (β_{0})}} \\ = \sum_{σ} P_{β} (σ) (- β \hat{H} (σ) + β_{0} \hat{H} (σ) + ln \frac{Z (β_{0})}{Z (β)}) \\ = (β_{0} - β) \cdot \sum_{σ} P_{β} (σ) \hat{H} (σ) + ln \frac{Z (β_{0})}{Z (β)} \\ = (β_{0} - β) \cdot U + ln \frac{Z (β_{0})}{Z (β)} . \end{matrix}

Hence

E x = k T_{0} D (P_{β}∥ P_{β_{0}}) .

This equation appeared already in [34].

4.4. Portfolio Theory

The relation between information theory and gambling was established by J. L. Kelly [35]. Logarithmic terms appear because we are interested in the exponent in the exponential growth rate of our wealth. Later Kelly’s approach has been generalized to trading of stocks although the relation to information theory is weaker [36].

Let

X_{1}, X_{2}, \dots, X_{k}

denote price relatives for a list of k assets. For instance

X_{5} = 1.04

means that 5-th asset increases its value by 4%. Such price relatives are mapped into a price relative vector

\vec{X} = (X_{1}, X_{2}, \dots, X_{k}) .

Example 3.

A special asset is the safe asset where the price relative is 1 for any possible price relative vector. Investing in this asset corresponds to placing the money at a safe place with interest rate equal to 0%.

A portfolio is a probability vector

\vec{b} = (b_{1}, b_{2}, \dots, b_{k})

where for instance

b_{5} = 0.3

means that 30% of the money is invested in asset no. 5. We note that a portfolio may be traded just like the original assets. The price relative for the portfolio

\vec{b}

is

X_{1} \cdot b_{1} + X_{2} \cdot b_{2} + \dots + X_{k} \cdot b_{k} = 〈\vec{X}, \vec{b}〉 .

The original assets may be considered as extreme points in the set of portfolios. If an asset has the property that the price relative is only positive for one of the possible price relative vectors, then we may call it a gambling asset.

We now consider a situation where the assets are traded once every day. For a sequence of price relative vectors

{\vec{X}}_{1}, \vec{X_{2}}, \dots {\vec{X}}_{n}

and a constant re-balancing portfolio

\vec{b}

the wealth after n days is

\begin{matrix} S_{n} & = & \prod_{i = 1}^{n} 〈{\vec{X}}_{i}, \vec{b}〉 \end{matrix}

(9)

\begin{matrix} = & exp (\sum_{i = 1}^{n} ln (〈{\vec{X}}_{i}, \vec{b}〉)) \end{matrix}

(10)

\begin{matrix} = & exp (n \cdot E [ln 〈\vec{X}, \vec{b}〉]) \end{matrix}

(11)

where the expectation is taken with respect to the empirical distribution of the price relative vectors. Here

E [ln 〈\vec{X}, \vec{b}〉]

is proportional to the doubling rate and is denoted

W (\vec{b}, P)

where P indicates the probability distribution of

\vec{X}

. Our goal is to maximize

W (\vec{b}, P)

by choosing an appropriate portfolio

\vec{b} .

The advantage of using constant rebalancing portfolios was demonstrated in [37].

Definition 3.

Let

{\vec{b}}_{1}

and

{\vec{b}}_{2}

denote two portfolios. We say that

{\vec{b}}_{1}

dominates

{\vec{b}}_{2}

if

〈{\vec{X}}_{j}, {\vec{b}}_{1}〉 \geq 〈{\vec{X}}_{j}, {\vec{b}}_{2}〉

for any possible price relative vector

{\vec{X}}_{j}

j = 1, 2, \dots, n .

We say that

{\vec{b}}_{1}

strictly dominates

{\vec{b}}_{2}

if

〈{\vec{X}}_{j}, {\vec{b}}_{1}〉 > 〈{\vec{X}}_{j}, {\vec{b}}_{2}〉

for any possible price relative vector

{\vec{X}}_{j}

j = 1, 2, \dots, n .

A set A of assets is said to dominate the set of assets B if any asset in B is dominated by a portfolio of assets in

A .

The maximal doubling rate does not change if dominated assets are removed. Sometimes assets that are dominated but not strictly dominated may lead to non-uniqueness of the optimal portfolio.

Let

{\vec{b}}_{P}

denote a portfolio that is optimal for P and define

G (P) = W ({\vec{b}}_{P}, P) .

(12)

The regret of choosing a portfolio that is optimal for Q when the distribution is P is given by the regret function

D_{G} (P, Q) = W ({\vec{b}}_{P}, P) - W ({\vec{b}}_{Q}, P) .

(13)

If

{\vec{b}}_{Q}

is not uniquely determined we take a minimum over all

\vec{b}

that are optimal for

Q .

Example 4.

Assume that the price relative vector is

(2, \frac{1}{2})

with probability

1 - t

and

(\frac{1}{2}, 2)

with probability t. Then the portfolio concentrated on the first asset is optimal for

t \leq \frac{1}{5}

and the portfolio concentrated on the second asset is optimal for

t > \frac{4}{5}

. For values of t between

\frac{1}{5}

and

\frac{4}{5}

the optimal portfolio invests money on both assets as illustrated in Figure 2.

Lemma 1.

If there are only two price relative vectors and the regret function is strict then either one of the assets dominates all other assets or two of the assets are orthogonal gambling assets that dominate all other assets.

Proof.

We will assume that no assets are dominated by other assets. Let

\begin{matrix} \vec{X} & = & (X_{1}, X_{2}, \dots, X_{k}) \\ \vec{Y} & = & (Y_{1}, Y_{2}, \dots, Y_{k}) \end{matrix}

denote the two price relative vectors. Without loss of generality we may assume that

\frac{X_{1}}{Y_{1}} \geq \frac{X_{2}}{Y_{2}} \geq \dots \geq \frac{X_{k}}{Y_{k}} .

If

\frac{X_{i}}{Y_{i}} = \frac{X_{i + 1}}{Y_{i + 1}}

then

\frac{X_{i}}{X_{i + 1}} = \frac{Y_{i}}{Y_{i + 1}}

so that if

X_{i} \leq X_{i + 1}

then

Y_{i} \leq Y_{i + 1}

and the asset i is dominated by the asset

i + 1 .

Since we have assumed that no assets are dominated we may assume that

\frac{X_{1}}{Y_{1}} > \frac{X_{2}}{Y_{2}} > \dots > \frac{X_{k}}{Y_{k}} .

If

P = (1 - t, t)

is a probability vector over the two price relative vectors then according to [36] the portfolio

\vec{b} = (b_{1}, b_{2}, \dots, b_{n})

is optimal if and only if

(1 - t) \frac{X_{i}}{b_{1} X_{1} + \dots + b_{k} X_{k}} + t \frac{Y_{i}}{b_{1} Y_{1} + \dots + b_{k} Y_{k}} \leq 1

for all

i \in \{1, 2, \dots, k\}

with equality if

b_{i} > 0 .

Assume that the portfolio

\vec{b} = δ_{j}

is optimal. Now

(1 - t) \frac{X_{j + 1}}{X_{j}} + t \frac{Y_{j + 1}}{Y_{j}} \leq 1

is equivalent to

t \leq \frac{\frac{X_{j}}{Y_{j + 1}} - \frac{X_{j + 1}}{Y_{j + 1}}}{\frac{X_{j}}{Y_{j}} - \frac{X_{j + 1}}{Y_{j + 1}}} .

(14)

Similarly

(1 - t) \frac{X_{j - 1}}{X_{j}} + t \frac{Y_{j - 1}}{Y_{j}} \leq 1

is equivalent to

t \geq \frac{\frac{X_{j}}{Y_{j - 1}} - \frac{X_{j - 1}}{Y_{j - 1}}}{\frac{X_{j}}{Y_{j}} - \frac{X_{j - 1}}{Y_{j - 1}}} .

(15)

We have to check that

\frac{\frac{X_{j}}{Y_{j - 1}} - \frac{X_{j - 1}}{Y_{j - 1}}}{\frac{X_{j}}{Y_{j}} - \frac{X_{j - 1}}{Y_{j - 1}}} < \frac{\frac{X_{j}}{Y_{j + 1}} - \frac{X_{j + 1}}{Y_{j + 1}}}{\frac{X_{j}}{Y_{j}} - \frac{X_{j + 1}}{Y_{j + 1}}},

which is equivalent with

0 < X_{j} Y_{j - 1} - Y_{j - 1} X_{j + 1} - Y_{j} X_{j - 1} - (X_{j} Y_{j + 1} - Y_{j + 1} X_{j - 1} - Y_{j} X_{j + 1}) .

The right hand side equals the determinant

|\begin{matrix} X_{j + 1} - X_{j - 1} & X_{j} - X_{j - 1} \\ Y_{j + 1} - Y_{j - 1} & Y_{j} - Y_{j - 1} \end{matrix}|,

which is positive because asset j is not dominated by a portfolio based on asset

j - 1

and asset

j + 1 .

We see that the portfolio concentrated in asset j is optimal for t in an interval of positive length and the regret between distributions in such an interval will be zero. In particular the regret will not be strict.

Strictness of the regret function is only possible if there are only two assets and if a portfolio concentrated on one of these assets is only optimal for a singular probability measure. According to the formulas for the end points of intervals (14) and (15) this is only possible if the assets are gambling assets. ☐

Theorem 1.

If the regret function is strict it equals information divergence, i.e.,

D_{G} (P, Q) = D (P ∥ Q) .

(16)

Proof.

If the regret function is strict then it is also strict when we restrict to two price relative vectors. Therefore any two price relative vectors are orthogonal gambling assets. If the assets are orthogonal gambling assets we get the type of gambling described by Kelly [35], for gambling equations can easily be derived [36]. ☐

5. Sufficiency Conditions

In this section we will introduce some conditions on a regret function. Under some mild conditions they turn out to be equivalent.

Theorem 2.

Let

D_{F}

denote a regret function based on a continuous and convex function F defined on the state space of a finite dimensional

C^{*}

-algebra. If the state space has at least three orthogonal states then the following conditions are equivalent:

The function F equals entropy times a negative constant plus an affine function.
The regret $D_{F}$ is proportional to information divergence.
The regret is monotone.
The regret satisfies sufficiency.
The regret is local.

In the rest of this section we will describe each of these equivalent conditions and prove that they are actually equivalent. The theorems and proofs will be stated so that they hold even for more general state spaces than the ones considered in this paper.

5.1. Entropy and Information Divergence

Definition 4.

Let s denote an element in a state space. The entropy of s is defined as

H (s) = inf (- \sum_{i = 1}^{n} λ_{i} ln (λ_{i}))

where the infimum is taken over all decompositions

s = \sum_{i = 1}^{n} λ_{i} s_{i}

of s into pure states

s_{i}

.

This definition of the entropy of a state was first given by Uhlmann [38]. Using the fact that entropy is decreasing under majorization we see that the entropy of s is attained at an orthogonal decomposition [13] and we obtain the familiar equation

H (s) = - tr [s ln (s)] .

In general this definition of entropy does not provide a concave function on a convex set. For instance, the entropy of points in the square has local maximum in the four different points. A characterization of the convex sets with concave entropy functions is lacking.

Definition 5.

If the entropy is a concave function then the regret function

D_{- H}

is called information divergence.

The information divergence is also called Kullback–Leibler divergence, relative entropy or quantum relative entropy. In a C*-algebra we get

\begin{matrix} D_{- H} (s_{1}, s_{2}) & = - H (s_{1}) - (- H (s_{2}) + 〈s_{1} - s_{2}, - \nabla H (s_{2})〉) \\ = H (s_{2}) - H (s_{1}) + 〈s_{1} - s_{2}, \nabla H (s_{2})〉 \\ = tr [f (s_{2})] - tr [f (s_{1})] + tr [(s_{1} - s_{2}) f^{'} (s_{2})] \\ = tr [f (s_{2}) - f (s_{1}) + (s_{1} - s_{2}) f^{'} (s_{2})] \end{matrix}

where

f (x) = - x ln (x) .

Now

f^{'} (x) = - ln (x) - 1

so that

\begin{matrix} f (s_{2}) - f (s_{1}) + (s_{1} - s_{2}) f^{'} (s_{2}) & = - s_{2} ln (s_{2}) + s_{1} ln (s_{1}) + (s_{1} - s_{2}) (- ln (s_{2}) - 1) \\ = s_{1} (ln (s_{1}) - ln (s_{2})) + s_{2} - s_{1} . \end{matrix}

Hence

D_{- H} (s_{1}, s_{2}) = tr [s_{1} (ln (s_{1}) - ln (s_{2})) + s_{2} - s_{1}] .

For states

s_{1}, s_{2}

it reduces to the well-known formula

D_{- H} (s_{1}, s_{2}) = tr [s_{1} ln (s_{1}) - s_{1} ln (s_{2})] .

5.2. Monotonicity

We consider a set

T

of maps of the state space into itself. The set

T

will be used to represent those transformations that we are able to perform on the state space before we choose a feasible action

a \in A

. Let

Φ : S ↷ S

denote a map. Then the dual map

Φ^{*}

maps actions into actions and is given by

〈a, Φ (s)〉 = 〈Φ^{*} (a), s〉 .

Proposition 5 (The principle of lost opportunities).

If

Φ^{*}

maps the set of feasible actions

A

into itself then

F (Φ (s)) \leq F (s) .

(17)

Proof.

If

a \in A

then

\begin{matrix} 〈a, Φ (s)〉 & = & 〈Φ^{*} (a), s〉 \\ \leq & F (s) \end{matrix}

because

Φ^{*} (a) \in A

. Inequality (17) follows because

F (Φ (s)) = {sup}_{a} 〈a, Φ (s)〉 .

☐

Corollary 1 (Semi-monotonicity)

Let Φ denote a map of the state space into itself such that

Φ^{*}

maps the set of feasible actions

A

into itself and let

s_{2}

denote a state that minimizes the function F. If

D_{F}

is a Bregman divergence then

D_{F} (Φ (s_{1}), Φ (s_{2})) \leq D_{F} (s_{1}, s_{2}) .

(18)

Proof.

Since

s_{2}

minimizes F and F is differentiable we have

\nabla F (s_{2}) = 0

. Since

s_{2}

minimizes F and

F (Φ (s_{2})) \leq F (s_{2})

we also have that

Φ (s_{2})

minimizes F and that

\nabla F (Φ (s_{2})) = 0

. Therefore

\begin{matrix} D_{F} (Φ (s_{1}), Φ (s_{2})) & = F (Φ (s_{1})) - (F (Φ (s_{2})) + 〈Φ (s_{1}) - Φ (s_{2}), \nabla F (Φ (s_{2}))〉) \\ = F (Φ (s_{1})) - F (Φ (s_{2})) \\ \leq F (s_{1}) - F (s_{2}) \\ = D_{F} (s_{1}, s_{2}), \end{matrix}

which proves the inequality. ☐

Next we introduce the stronger notion of monotonicity.

Definition 6.

Let

D_{F}

denote a regret function on the state space

S

of a finite dimensional C*-algebra. Then

D_{F}

is said to be monotone if

D_{F} (Φ (s_{1}), Φ (s_{2})) \leq D_{F} (s_{1}, s_{2})

for any affine map

Φ : S \to S .

Proposition 6.

If a regret function

D_{F}

based on a convex and continuous function F is monotone then it is a Bregman divergence.

Proof.

Assume that

D_{F}

is monotone. We have to prove that F is differentiable. Since F is convex it is sufficient to prove that any restriction of F to a line segment is differentiable. Let

s_{0}

and

s_{1}

denote states that are the end points of a line segment. The restriction of F to the line segment is given by the convex and continuous function

f (t) = F ((1 - t) s_{0} + t s_{1})

so we have to prove that f is differentiable.

If

0 < t_{1} < t_{2} < 1

then according to Equation (2) we have

D_{F} ((1 - t_{2}) s_{0} + t_{2} s_{1}, (1 - t_{1}) s_{0} + t_{1} s_{1}) = f (t_{2}) - (f (t_{1}) + (t_{2} - t_{1}) \cdot f_{+}^{'} (t_{1}))

where

f_{+}^{'}

denotes the derivative from the right. A dilation by a factor

r \leq 1

around

s_{0}

decreases the regret so that

r \to f (r \cdot t_{2}) - (f (r \cdot t_{1}) + r \cdot (t_{2} - t_{1}) \cdot f_{+}^{'} (r \cdot t_{1}))

(19)

is increasing. Since f is convex the function

r \to f_{+}^{'} (r \cdot t_{1})

is increasing. Assume that f is not differentiable so that

r \to f_{+}^{'} (r \cdot t_{1})

has a positive jump as illustrated on Figure 3.

This contradicts that the function (19) is increasing. Therefore

f_{+}^{'}

is continuous and f is differentiable. ☐

Recently it has been proved that information divergence on a complex Hilbert space is decreasing under positive trace preserving maps [39,40]. Previously this was only known to hold if some extra condition like complete positivity or 2-positivity was assumed [41].

Theorem 3.

Information divergence is monotone under any positive trace preserving map on the states of a finite dimensional

C^{*}

-algebra.

Proof.

Any finite dimensional

C^{*}

-algebra

B

can be embedded in

B (H)

and there exist a conditional expectation

E : B (H) \to B .

If

Φ

is a positive trace preserving map of the density matrices of

B

into it self then

Φ \circ E

is positive and trace preserving on

B (H) .

According to Müller-Hermes and Reeb [39] we have

D (Φ \circ E (s_{1})∥ Φ \circ E (s_{2})) \leq D (s_{1}∥ s_{2})

for density matrices in

B (H) .

In particular this inequality holds for density matrices in

B

and for such matrices we have

E (s_{i}) = s_{i}

. ☐

5.3. Sufficiency

The notion of sufficiency plays an important role in statistics and related fields. We shall present a definition of sufficiency that is based on [42], but there are a number of other equivalent ways of defining this concept. We refer to [43] where the notion of sufficiency is discussed in great detail.

Definition 7.

Let

{(s_{θ})}_{θ}

denote a family of states and let Φ denote an affine map

S \to T

where

S

and

T

denote state spaces. A recovery map is an affine map

Ψ : T \to S

such that

Ψ (Φ (s_{θ})) = s_{θ} .

The map Φ is said to be sufficient for

{(s_{θ})}_{θ}

if Φ has a recovery map.

Proposition 7.

Assume

D_{F}

is a regret function based on a convex and continuous function F and assume that Φ is sufficient for

s_{1}

and

s_{2}

with recovery map Ψ. Assume that both

Φ^{*}

and

Ψ^{*}

map the set of feasible actions

A

into itself. Then

D_{F} (Φ (s_{1}), Φ (s_{2})) = D_{F} (s_{1}, s_{2}) .

Proof.

According to the principle of lest opportunities (Proposition 5) we have

\begin{matrix} F (s_{2}) & = F (Ψ (Φ (s_{2}))) \\ \leq F (Φ (s_{2})) \\ \leq F (s_{2}) . \end{matrix}

Therefore

F (Φ (s_{2})) = F (s_{2}) .

Let a denote an action that is optimal for

s_{2} .

Then

\begin{matrix} F (Φ (s_{2})) & = F (s_{2}) \\ = 〈a, s_{2}〉 \\ = 〈a, Ψ (Φ (s_{2}))〉 \\ = 〈Ψ^{*} (a), Φ (s_{2})〉 \end{matrix}

and we see that

Ψ^{*} (a)

is optimal for

Φ (s_{2}) .

Now

\begin{matrix} D_{F} (s_{1}, s_{2}) & = inf_{a} (F (s_{1}) - 〈a, s_{1}〉) \\ = inf_{a} (F (s_{1}) - 〈Ψ^{*} (a), Φ (s_{1})〉) \end{matrix}

where the infimum is taken over actions a that are optimal for

s_{2} .

Then

\begin{matrix} inf_{a} (F (s_{1}) - 〈Ψ^{*} (a), Φ (s_{1})〉) & \geq inf_{\tilde{a}} (F (Φ (s_{1})) - 〈\tilde{a}, Φ (s_{1})〉) \\ = D_{F} (Φ (s_{1}), Φ (s_{2})) \end{matrix}

so we have

D_{F} (s_{1}, s_{2}) \geq D_{F} (Φ (s_{1}), Φ (s_{2})) .

The reverse inequality is proved in the same way. ☐

The notion of sufficiency as a property of divergences was introduced in [44]. The crucial idea of restricting the attention to maps of the state space into itself was introduced in [45]. It was shown in [45] that a Bregman divergence on the simplex of distributions on an alphabet that is not binary and satisfies sufficiency equals information divergence up a multiplicative factor. Here we extend the notion of sufficiency from Bregman divergences to regret functions.

Definition 8.

Let

D_{F}

denote a regret function based on a convex and continuous function F on a state space

S

. We say

D_{F}

satisfies sufficiency if

D_{F} (Φ (s_{1}), Φ (s_{2})) = D_{F} (s_{1}, s_{2})

for any affine map

S \to S

that is sufficient for

(s_{1}, s_{2}) .

Proposition 8.

Let

D_{F}

denote a regret function based on a convex and continuous function F on a state space

S

. If the regret function

D_{F}

is monotone then it satisfies sufficiency.

Proof.

Assume that the regret function

D_{F}

is monotone. Let

s_{1}

and

s_{2}

denote two states and let

Φ

and

Ψ

denote maps on the state space such that

Φ (Ψ (s_{i})) = s_{i}, i = 1, 2

. Then

\begin{matrix} D_{F} (s_{1}, s_{2}) & = D_{F} (Φ (Ψ (s_{1})), Φ (Ψ (s_{2}))) \\ \leq D_{F} (Ψ (s_{1}), Ψ (s_{2})) \\ \leq D_{F} (s_{1}, s_{2}) . \end{matrix}

Hence

D_{F} (Ψ (s_{1}), Ψ (s_{2})) = D_{F} (s_{1}, s_{2}) .

☐

Combining the previous results we get that information divergence satisfies sufficiency. Under some conditions there exists an inverse version of Proposition 8 stating that if monotonicity holds with equality then the map is sufficient. In statistics where the state space is a simplex, this result is well established. For density matrices over the complex numbers it has been proved for completely positive maps in [43]. Some new results on this topic can be found in [46].

5.4. Locality

Often it is relevant to use the following weak version of the sufficiency property.

Definition 9.

Let

D_{F}

denote a regret function based on a convex and continuous function F on a state space

S

. The regret function

D_{F}

is said to be local if

D_{F} (s_{1}, t \cdot s_{1} + (1 - t) \cdot σ) = D_{F} (s_{1}, t \cdot s_{1} + (1 - t) \cdot ρ)

when the states σ and ρ are orthogonal to

s_{1}

and

t \in]0, 1[.

Example 5.

On a 1-dimensional simplex (an interval) or on the Block sphere any regret function

D_{F}

is local. The reason is that if σ and ρ are states that are orthogonal to

s_{1}

then

σ = ρ .

Proposition 9.

Let

D_{F}

denote a regret function based on a convex and continuous function F on a state space

S

. If the regret function

D_{F}

satisfies sufficiency then

D_{F}

is local.

Proof.

Let

σ

and

ρ

be states that are orthogonal to

s_{1} .

Let p denote the projection supporting the state

s_{0}

. Let the maps

Φ

and

Ψ

be defined by

\begin{matrix} Φ (s) & = tr (p s) \cdot s_{1} + (1 - tr (p s)) \cdot ρ, \\ Ψ (s) & = tr (p s) \cdot s_{1} + (1 - tr (p s)) \cdot σ . \end{matrix}

Then

Φ (s_{1}) = Ψ (s_{1}) = s_{1}

and

Φ (σ) = ρ

and

Ψ (ρ) = σ .

Therefore

\begin{matrix} Φ (t \cdot s_{1} + (1 - t) \cdot σ) & = t \cdot s_{1} + (1 - t) \cdot ρ \\ Ψ (t \cdot s_{1} + (1 - t) \cdot ρ) & = t \cdot s_{1} + (1 - t) \cdot σ \end{matrix}

and

D_{F} (s_{1}, t \cdot s_{1} + (1 - t) \cdot σ) = D_{F} (s_{1}, t \cdot s_{1} + (1 - t) \cdot ρ),

which proves the Proposition. ☐

Theorem 4.

Let

S

be the state space of a

C^{*}

-algebra with at least three orthogonal states, and let

D_{F}

denote a regret function based on a convex and continuous function F on the state space

S

. If the regret function

D_{F}

is local then it is the Bregman divergence generated by the entropy times a negative constant.

Proof.

In the following proof we will assume that the regret function is based on the convex function

F : S \to R .

First we will prove that the regret function is a Bregman divergence.

Let K denote the convex hull of a set

s_{0}, s_{1}, \dots s_{n}

of orthogonal states. For

x \in [0, 1]

let

g_{i}

denote the function

g_{i} (x) = D_{F} (s_{i}, x s_{i} + (1 - x) s_{i + 1})

. Note that

g_{i}

is decreasing and continuous from the left. Let

P = \sum p_{i} s_{i}

and

Q = \sum q_{i} s_{i}

where

p_{i}, q_{i} \in]0, 1[

for all

i = 0, 1, 2, \dots n

. If F is differentiable in P then locality implies that

\begin{matrix} D_{F} (P, Q) & = & \sum p_{i} D_{F} (s_{i}, Q) - \sum p_{i} D_{F} (s_{i}, P) \\ = & \sum p_{i} g_{i} (q_{i}) - \sum p_{i} g_{i} (p_{i}) \\ = & \sum p_{i} (g_{i} (q_{i}) - g_{i} (p_{i})) . \end{matrix}

Note that

P \to D_{F} (P, Q)

is a convex function and thereby it is continuous. Assume that

P_{0}

is an arbitrary element in K and let

{(P_{n})}_{n \in N}

denote a sequence such that

P_{n} \to P_{0}

for

n \to \infty .

The sequence

{(P_{n})}_{n \in N}

can be chosen so that regret is differentiable in

P_{n}

for all

n \in N .

Further the sequence

P_{n}

can be chosen such that

p_{n, i}

is increasing for all

i \neq j .

Then

D_{F} (P_{0}, Q) = \sum p_{0, i} (g_{i} (q_{i}) - g_{i} (p_{0, i})) + p_{0, j} g_{j} (p_{0, j}) - p_{0, j} lim_{n \to \infty} g_{j} (p_{n, j}) .

Similarly, if the sequence

P_{n}

can be chosen such that

p_{n, i}

is increasing for all

i \neq j, j + 1

then

\begin{matrix} D_{F} (P_{0}, Q) = \sum p_{0, i} (g_{i} (q_{i}) - g_{i} (p_{0, i})) + p_{0, j} g_{j} (p_{0, j}) - p_{0, j} lim_{n \to \infty} g_{j} (p_{n, j}) \\ + p_{0, j + 1} g_{j + 1} (p_{0, j + 1}) - p_{0, j + 1} lim_{n \to \infty} g_{j + 1} (p_{n, j + 1}), \end{matrix}

which implies that

p_{0, j + 1} g_{j + 1} (p_{0, j + 1}) - p_{0, j + 1} {lim}_{n \to \infty} g_{j + 1} (p_{n, j + 1}) = 0

and that

lim_{n \to \infty} g_{j + 1} (p_{n, j + 1}) = g_{j + 1} (p_{0, j + 1})

for all j. Therefore

D_{F} (P_{0}, Q) = \sum p_{0, i} (g_{i} (q_{i}) - g_{i} (p_{0, i}))

(20)

for all

P_{0}, Q

in the interior of K. In the following calculations we will assume that the distributions lie in the interior of K. The validity of the Bregman identity (5) follows directly from Equation (20) implying that

D_{F}

is a Bregman divergence.

As a function of Q the regret is minimal when

Q = P .

In the following calculations we write

x = p_{i}

,

z = p_{j}

,

y = q_{i}

, and

w = q_{j}

. If

p_{ℓ} = q_{ℓ}

for

ℓ \neq i, j

then non-negativity of regret can be written as

x (g_{i} (y) - g_{i} (x)) + z (g_{j} (w) - g_{j} (z)) \geq 0

and we note that this inequality should hold as long as

x + z = y + w \leq 1 .

Permutation of i and j leads to the inequality

x (g_{j} (y) - g_{j} (x)) + z (g_{i} (w) - g_{i} (z)) \geq 0

that implies

x (g_{i j} (y) - g_{i j} (x)) + z (g_{i j} (w) - g_{i j} (z)) \geq 0

(21)

where

g_{i j} = \frac{g_{i} + g_{j}}{2} .

Assume that

x = z = \frac{y + w}{2}

in Inequality (21). Then

\begin{matrix} x (g_{i j} (y) - g_{i j} (x)) + x (g_{i j} (w) - g_{i j} (x)) & \geq & 0 \\ g_{i j} (y) - g_{i j} (x) + g_{i j} (w) - g_{i j} (x) & \geq & 0 \\ \frac{g_{i j} (y) + g_{i j} (w)}{2} & \geq & g_{i j} (x) \end{matrix}

so that

g_{i j}

is mid-point convex, which for a measurable function implies convexity. Therefore

g_{i j}

is differentiable from left and right.

If

y = w

and

x = y + ϵ

and

z = y - ϵ

then we have

(y + ϵ) (g_{i j} (y) - g_{i j} (y + ϵ)) + (y - ϵ) (g_{i j} (y) - g_{i j} (y - ϵ)) \geq 0

with equality when

ϵ = 0 .

We differentiate with respect to

ϵ

from right.

(g_{i j} (y) - g_{i j} (y + ϵ)) + (y + ϵ) (- g_{i j +}^{'} (y + ϵ)) - (g_{i j} (y) - g_{i j} (y - ϵ)) + (y - ϵ) (g_{i j -}^{'} (y - ϵ)),

which is positive for

ϵ = 0

so that

\begin{matrix} - y \cdot g_{i j +}^{'} (y) + y \cdot g_{i j -}^{'} (y) & \geq & 0 \end{matrix}

(22)

\begin{matrix} y \cdot g_{i j -}^{'} (y) & \geq & y \cdot g_{i j +}^{'} (y) . \end{matrix}

(23)

Since

g_{i j}

is convex we have

g_{i j -}^{'} (y) \leq g_{i j +}^{'} (y)

which in combination with Inequality (23) implies that

g_{i j -}^{'} (y) = g_{i j +}^{'} (y)

so that

g_{i j}

is differentiable. Since

g_{i} = g_{i j} + g_{i k} - g_{j k}

the function

g_{i}

is also differentiable.

As a function of Q the Bregman divergence

D_{F} (P, Q)

has a minimum at

Q = P

under the condition

\sum q_{i} = 1

. Since the functions

g_{i}

are differentiable we can characterize this minimum using Lagrange multipliers. We have

\frac{\partial}{\partial q_{i}} D_{F} (P, Q) = p_{i} g_{i}^{'} (q_{i})

and

\frac{\partial}{\partial q_{i}} D_{F} {(P, Q)}_{∣ Q = P} = p_{i} \cdot g_{i}^{'} (p_{i}) .

Further

\frac{\partial}{\partial q_{i}} \sum q_{i} = 1

so there exist a constant

c_{K}

such that

p_{i} \cdot g_{i}^{'} (p_{i}) = c_{K} .

Hence

g_{i}^{'} (p_{i}) = \frac{c_{K}}{p_{i}}

so that

g_{i} (p_{i}) = c_{K} \cdot ln (p_{i}) + m_{i}

for some constant

m_{i} .

Now we get

\begin{matrix} D_{F} (P, Q) & = & \sum p_{i} (g_{i} (q_{i}) - g_{i} (p_{i})) \\ = & \sum p_{i} ((c_{K} \cdot ln (q_{i}) + m_{i}) - (c_{K} \cdot ln (p_{i}) + m_{i})) \\ = & - c_{K} \cdot \sum p_{i} ln \frac{p_{i}}{q_{i}} \\ = & D_{- c_{K} \cdot H} (P, Q) . \end{matrix}

Therefore, an affine function exists, defined by K such that

F_{∣ K} (P) = - c_{K} \cdot H_{∣ K} (P) + g_{K}

(24)

for all P in the interior of K. Since

H_{K}

is continuous on K Equation (24) holds for any

P \in K

. If each of the sets K and L is a simplex and

x \in K \cap L

then

- c_{K} \cdot H_{∣ K} (x) + g_{K} (x) = - c_{L} \cdot H_{∣ L} (x) + g_{L} (x)

so that

(c_{L} - c_{K}) \cdot H_{∣ K} (x) = g_{L} (x) - g_{K} (x) .

If

K \cap L

has dimension greater than zero then the right hand side is affine so the left hand side is affine, which is only possible when

c_{K} = c_{L} .

Therefore we also have

g_{L} (x) = g_{K} (x)

for all

x \in K \cap L .

Therefore the functions

g_{K}

can be extended to a single affine function on the whole of

S .

☐

6. Applications

6.1. Information Theory

If only integer values of a code-length function ℓ are allowed then there are only finitely many actions that are not dominated. Therefore the function F given by

F (P) = - min_{ℓ} \sum ℓ (a) \cdot p_{a}

is piece-wise linear. In particular F is not differentiable so that the regret is not a Bregman divergence and cannot be monotone according to Proposition 6. In information theory monotonicity of a divergence function is closely related to the data processing inequality and since the data processing inequality is one of the most important tools for deriving inequalities in information theory we need to modify our notion of code-length function in order to achieve a data processing inequality.

We now formulate a version of Kraft’s inequality that allows the code length function to be non-integer valued.

Theorem 5.

Let

ℓ : A \to R

be a function. Then the function ℓ satisfies Kraft’s inequality (6) if and only if for all

ε > 0

there exists an integer n and a uniquely decodable fixed-to-variable length block code

κ : A^{n} \to B^{*}

such that

|{\bar{ℓ}}_{κ} (a^{n}) - \frac{1}{n} \sum_{i = 1}^{n} ℓ (a_{i})| \leq ε

where

{\bar{ℓ}}_{κ} (a^{n})

denotes the length

ℓ_{κ} (a^{n})

divided by

n .

The uniquely decodable block code can be chosen to be prefix free.

Proof.

Assume that ℓ satisfies Kraft’s inequality. Then

\sum_{a_{1} a_{2} \dots a_{n} \in A^{n}} β^{- \sum_{i = 1}^{n} ℓ (a_{i})} = {(\sum_{a \in A} β^{- ℓ (a)})}^{n} \leq 1^{n} = 1 .

Therefore the function

\tilde{ℓ} : A^{n} \to N

given by

\tilde{ℓ} (a_{1} a_{2} \dots a_{n}) = ⌈\sum_{i = 1}^{n} ℓ (a_{i})⌉

is integer valued and satisfies Kraft’s inequality (6) and there exists a prefix-free code

κ : A^{n} \to {\{0, 1\}}^{*}

such that

ℓ_{κ} (a_{1} a_{2} \dots a_{n}) = \tilde{ℓ} (a_{1} a_{2} \dots a_{n}) .

Therefore

|{\bar{ℓ}}_{κ} (a_{1} a_{2} \dots a_{n}) - \frac{1}{n} \sum_{i = 1}^{n} ℓ (a_{i})| = \frac{1}{n} |⌈\sum_{i = 1}^{n} ℓ (a_{i})⌉ - \sum_{i = 1}^{n} ℓ (a_{i})| \leq \frac{1}{n}

so for any

ε > 0

choose n such that

1 / n \leq ε .

Assume that for all

ε > 0

there exists a uniquely decodable fixed-to-variable length code

κ : A^{n} \to {\{0, 1\}}^{*}

such that

|{\bar{ℓ}}_{κ} (a_{1} a_{2} \dots a_{n}) - \frac{1}{n} \sum_{i = 1}^{n} ℓ (a_{i})| \leq ε

for all strings

a_{1} a_{2} \dots a_{n} \in A^{n} .

Then

n {\bar{ℓ}}_{κ} (a_{1} a_{2} \dots a_{n})

satisfies Kraft’s Inequality (6) and

\begin{matrix} {(\sum_{a \in A} β^{- ℓ (a)})}^{n} & = \sum_{a_{1} a_{2} \dots a_{n} \in A^{n}} β^{- \sum_{i = 1}^{n} ℓ (a_{i})} \\ \leq \sum_{a_{1} a_{2} \dots a_{n} \in A^{n}} β^{- n ({\bar{ℓ}}_{κ} (a_{1} a_{2} \dots a_{n}) - ε)} \\ = β^{n ε} \sum_{a_{1} a_{2} \dots a_{n} \in A^{n}} β^{- n {\bar{ℓ}}_{κ} (a_{1} a_{2} \dots a_{n})} \\ \leq β^{n ε} . \end{matrix}

Therefore

\sum_{a \in A} β^{- ℓ (a)} \leq β^{ε}

for all

ε > 0

and the result is obtained. ☐

Like in Bayesian statistics we focus on finite sequences. Contrary to Bayesian statistics we should always consider a finite sequence as a prefix of longer finite sequences. Contrary to frequential statistics we do not have to consider a finite sequence as a prefix of an infinite sequence.

If we minimize the mean code-length over functions that satisfy Kraft’s inequality (6), but without an integer constraint the code-length should be

ℓ (a) = - {log}_{β} (p_{a})

and the function F is given by

F (P) = \sum_{a} p_{a} \cdot {log}_{β} (p_{a}) .

The function F is proportional to the Shannon entropy and the (negative) proportionality factor is determined by the size of the output alphabet.

In lossy source coding and rate distortion theory it is important to choose a distortion function with tractable properties. The notion of sufficiency for divergence functions was introduced in [44] in order to characterize such tractable distortions functions. In this paper the main result was that sufficiency together with properties related to Bregman divergence lead directly to the information bottleneck method introduced by N. Tishby [47]. Logarithmic loss has also been studied for lossy compression in [48].

6.2. Statistics

In statistics one is often interested in scoring rules that are local, which means a scoring rule where the payoff only depends on the probability of the observed value and not on the predicted distribution over unobserved values. The notion of locality has recently been extended by Dawid, Lauritzen and Parry [49], but here we shall focus on the original definition. The basic result is that the only local strictly proper scoring rule is logarithmic score that was proved by Bernardo under the assumption that scoring rule is given by a smooth function [50].

Definition 10.

A local strictly proper scoring rule is a scoring rule of the form

f (x, Q) = g (Q (x)) .

Theorem 6.

On a finite space a local strictly proper scoring rule is given by a local regret function.

Proof.

The regret function of a local strictly proper scoring rule is given by

D (P, Q) = \sum_{x} P (x) (g (P (x)) - g (Q (x))) .

If

Q = (1 - t) P + t Q_{i}

and P and Q are mutually singular then

\begin{matrix} D (P, Q) & = \sum_{x} P (x) (g (P (x)) - g ((1 - t) P (x) + t Q_{i} (x))) \\ = \sum_{x} P (x) (g (P (x)) - g ((1 - t) P (x) + 0)) \end{matrix}

and we see that the regret does not depend on

Q_{i}

because

Q_{i}

vanish on the support of

P .

Therefore the regret function is local. ☐

Corollary 2.

On a finite space with at least three elements a local strictly proper scoring rule is given by a function g of the form

g (x) = a \cdot ln (x) + b

for some constants a and

b .

Also the notion of sufficiency plays an important role in statistics. Here we will restrict the discussion to 1-dimensional exponential families. A natural exponential family is a family of probability distributions of the form

\frac{d P_{β}}{d Q} = \frac{exp (β x)}{Z (β)}

where Q is a reference measure on the real numbers and Z is the moment generating function given by

Z (β) = \int exp (β x) d Q x

. Then

x_{1}^{n} \to x_{1} + x_{2} + \dots + x_{n}

is a sufficient statistic for the family

{(P_{β}^{\otimes n})}_{β} .

Example 6.

In a Bernoulli model a sequence

x_{1}^{n} \in {\{0, 1\}}^{n}

is predicted with probability

\prod_{i = 1}^{n} p^{x_{i}} {(1 - p)}^{1 - x_{i}} = exp ((\sum_{i = 1}^{n} x_{1}) ln (p) + (n - \sum_{i = 1}^{n} x_{1}) ln (1 - p)) .

The function

x_{1}^{n} \to x_{1} + x_{2} + \dots + x_{n}

induces a sufficient map Φ from probability distributions on

{\{0, 1\}}^{n}

to probability distributions on

\{0, 1, 2, \dots, n\} .

The reverse map maps a measure concentrated in

j \in \{0, 1, 2, \dots, n\}

into a uniform distributions over sequences

x_{1}^{n} \in {\{0, 1\}}^{n}

that satisfy

\sum_{i = 1}^{n} x_{1} = j .

The mean value of

P_{β}

is

\int x \cdot \frac{exp (β x)}{Z (β)} d Q x .

The set of possible mean values is called the mean value range and is an interval. Let

P^{μ}

denote the element in the exponential family with mean value

μ .

Then a Bregman divergence on the mean value range is defined by

D (λ, μ) = D (P^{λ} ∥P^{μ}) .

Note that the mapping

μ \to P^{μ}

is not affine so the Bregman divergence

D (λ, μ)

will in general not be given by the formula for information divergence with the family of binomial distributions as the only exception. Nevertheless the Bregman divergence

D (λ, μ)

encodes important information about the exponential family. In statistics it is common to use squared Euclidean distance as distortion measure, but often it is better to use the Bregman divergence

D (λ, μ)

as a distortion measure. Note that

D (λ, μ)

is only proportional to squared Euclidean distance for the Gaussian location family.

Example 7.

An exponential distribution has density

f_{λ} (x) = \{\begin{matrix} \frac{1}{λ} exp (- \frac{x}{λ}), & for x \geq 0; \\ 0, & e l s e . \end{matrix}

This leads to a Bregman divergence on the interval

[0; \infty[

given by

\begin{matrix} \int_{0}^{\infty} f_{λ} (x) ln (\frac{f_{λ} (x)}{f_{μ} (x)}) d x & = \frac{λ}{μ} - 1 - ln (\frac{λ}{μ}) \\ = D_{- ln} (λ, μ) . \end{matrix}

This Bregman divergence is called the Isakura-Saito distance. The Isakura-Saito distance is defined as an unbounded set so our previous results cannot be applied. Affine bijections on

[0; \infty[

have the form

Φ (x) = k \cdot x

for some constant

k > 0

. The Isakura-Saito distance obviously satisfy sufficiency for such maps and it is a simple exercise to check that the Isakura-Saito distance is the only Bregman divergence on

[0, \infty [

that satisfies sufficiency. Any affine map

[0; \infty[\to [0; \infty[

is composed of a map

x \to k \cdot x

where

k \geq 0

and a right translation

x \to x + t

where

t \geq 0 .

The Itakura-Saito distance decreases under right translations because

\begin{matrix} \frac{\partial}{\partial t} D_{- ln} (λ + t, μ + t) & = \frac{\partial}{\partial t} (\frac{λ + t}{μ + t} - 1 - ln (\frac{λ + t}{μ + t})) \\ = \frac{(μ + t) - (λ + t)}{{(μ + t)}^{2}} - \frac{1}{λ + t} + \frac{1}{μ + t} \\ = - \frac{{(λ - μ)}^{2}}{(λ + t) {(μ + t)}^{2}} \leq 0 . \end{matrix}

Thus the Isakura-Saito distance is monotone.

Both sufficiency and the Bregman identity are closely related to inference rules. In [51] I. Csiszár explained why information divergence is the only divergence function on the cone of positive measures that lead to certain tractable inference rules. One should observe that his inference rules are closely related to sufficiency and the Bregman identity, and the present paper may be viewed as a generalization of these results of I. Csiszár.

In the minimum description length approach to statistics [9] it is common to minimize the maximal regret of the model where the maximum is taken over all possible data sequences. For most exponential families this approach is computationally difficult and may cause problems with normalization of the optimal distribution over the parameter. In general this approach will also depend on the length of the data sequence in a way that is not transitive. That means that one cannot analyze a subsequence before the length of the whole data sequence is known. In [52] it was proved that for one dimensional exponential family there are essentially three exponential families where these problems are avoided. The exponential families are the Gaussian location family, the Gamma distributions, and the Tweedie distributions of order

\frac{3}{2}

. The statistical analysis of the Gaussian location family reduces to minimizing the sum of squares. Similarly, the Gamma distributions can be analyzed using the Isakura–Saito distance (or information divergence), but the Tweedie family of order

\frac{3}{2}

is an exotic object that has not been analyzed in similar detail. For exponential models in two or more dimensions similar results are not known, but in general one should expect that most models are complicated to analyze exactly while certain models simplify due to some type of inner symmetry of the model.

6.3. Statistical Mechanics

Statistical mechanics can be stated based on classical mechanics or quantum mechanics. For our purpose this makes no difference because our theorems are valid for both classical systems and quantum systems.

As we have seen before

E x = k T_{0} \cdot D (s ∥s_{0}) .

(25)

Our general results for Bregman divergences imply that the Bregman divergence based on this exergy satisfies

D_{E x} (s_{1}, s_{2}) = k T_{0} \cdot D (s_{1} ∥s_{2}) .

Therefore

D_{E x} (Φ (s_{1}), Φ (s_{2})) = D_{E x} (s_{1}, s_{2})

for any map that is sufficient for

\{s_{1}, s_{2}\} .

The equality holds for any regret function that is reversible and conserves the state that is in equilibrium with the environment. Since a different temperature of the environment leads to a different state that is in equilibrium the equality holds for any reversible map that leave some equilibrium state invariant. We see that

D_{E x} (s_{1}, s_{2})

is uniquely determined as long as there exists a sufficiently large set of maps that are reversible.

In this exposition we have made some short-cuts. First of all we did not derive Equation (25). In particular the notion of temperature was used without discussion. Secondly we identified the internal energy with the mean value of the Hamiltonian and identified the thermodynamic entropy with k times the Shannon entropy. Finally, in the argument above we need to verify in all details that the set of reversible maps is sufficiently large to determine the regret function. For classical thermodynamics a comprehensive exposition was made by Lieb and Yngvason [53,54]. In their exposition randomness was not taken into account. With the present framework it is also possible to handle randomness so that one can make a bridge between thermodynamics and statistical mechanics. The approach of Lieb and Yngvason was recently improved by C. Marletto [55] uses the formalism of constructor theory to derive results. The basic idea in constructor theory is to distinguish between possible and impossible transformation in a way that is closely related to the ideas presented in this paper. A detailed exposition of such relations will be given in a future paper.

According to Equation (25) any bit of information can be converted into an amount of energy! One may ask how this is related to the mixing paradox (a special case of Gibbs’ paradox). Consider a container divided by a wall with a blue and a yellow gas on each side of the wall as illustrated in Figure 4. The question is how much energy can be extracted by mixing the blue and the yellow gas?

We loose one bit of information about each molecule by mixing the blue and the green gas, but if the color is the only difference no energy can be extracted. This seems to be in conflict with Equation (25), but in this case different states cannot be converted into each other by reversible processes. For instance one cannot convert the blue gas into the yellow gas. To get around this problem one can restrict the set of preparations and one can restrict the set of measurements. For instance one may simply ignore measurements of the color of the gas. What should be taken into account and what should be ignored, can only be answered by an experienced physicist. Formally this solves the mixing paradox, but from a practical point of view nothing has been solved. If for instance the molecules in one of the gases are much larger than the molecules in the other gas then a semi-permeable membrane can be used to create an osmotic pressure that can be used to extract some energy. It is still an open question as to which differences in properties of the two gases that can be used to extract energy.

6.4. Monotone Regret for Portfolios

We know that in general a local regret function on a state space with at least three orthogonal states is proportional to information divergence. In portfolio theory we get the stronger result that monotonicity implies that we are in the situation of gambling introduced by Kelly [35].

Theorem 7.

Assume that none of the assets are dominated by a portfolio of other assets. If the regret function

D_{G} (P, Q)

given by (13) is monotone then the regret function equals information divergence and the measures P and Q are supported by k distinct price relative vectors of the form

(o_{1}, 0, 0, \dots 0)

,

(0, o_{2}, 0, \dots 0),

until

(0, 0, \dots o_{k}) .

Proof.

If there are more than three price relative vectors then a monotone regret function is always proportional to information divergence which is a strict regret function. Therefore we may assume that there are only two price relative vectors. Assume that the regret function is not strict. Then the function G defined by (12) is not strictly convex. Assume that

D_{G} (P, Q) = 0

so that G is affine between P and Q. Let

Φ

be a contraction around one of the end points of intersection between the state space and the line through P and Q. Then monotonicity implies that

D_{G} (Φ (P), Φ (Q)) = 0

so that G is affine on the line between

Φ (P)

and

Φ (Q)

. This holds for contractions around any point. Therefore G is affine on the whole state space which implies that there is a single portfolio that dominates all assets. Such a portfolio must be supported on a single asset. Therefore monotonicity implies that if two assets are not dominated then the regret function is strict and according to Theorem 1 we have already proved that a strict regret function in portfolio theory is proportional to information divergence. ☐

Example 8.

If the regret function divergence is monotone and one of the assets is the safe asset then there exists a portfolio

\vec{b}

such that

b_{i} \cdot o_{i} \geq 1

for all

i .

Equivalently

b_{i} \geq o_{i}^{- 1}

which is possible if and only if

\sum o_{i}^{- 1} \leq 1 .

One says that the gamble is fair if

\sum o_{i}^{- 1} = 1

. If the gamble is super-fair, i.e.,

\sum o_{i}^{- 1} < 1

, then the portfolio

b_{i} = o_{i}^{- 1} / \sum o_{i}^{- 1}

gives a price relative equal to

{(\sum o_{i}^{- 1})}^{- 1} > 1

independently of what happens, which is a Dutch book.

Corollary 3.

In portfolio theory the regret function

D_{G} (P, Q)

given by (13) is monotone if and only if it is strict.

Proof.

We use that in portfolio theory the regret function is monotone if and only it is proportional to information. ☐

7. Concluding Remarks

Convexity of a Bregman divergence is an important property that was first studied systematically in [56] and extended from probability distributions to matrices in [57]. In [58] it was proved that if f is a function such that the Bregman divergence based on

tr (f (ρ))

is monotone on any (simple) C*-algebra then the Bregman divergence is jointly convex. As we have seen monotonicity implies that the Bregman divergence must be proportional to inform divergence, which is jointly convex in both arguments. We also see that in general joint convexity is not a sufficient condition for monotonicity, but in the case where the state space has only two orthogonal states it is not known if joint convexity of a Bregman divergence is sufficient to conclude that the Bregman divergence is monotone.

One should note that the type of optimization presented in this paper is closely related to a game theoretic model developed by F. Topsøe [59,60]. In his game theoretic model he needed what he called the perfect match principle. Using the terminology presented in this paper the perfect match principle states that the regret function is a strict Bregman divergence. As we have seen the perfect match principle is only fulfilled in portfolio theory if all the assets are gambling assets. Therefore, the theory of F. Topsøe can only be used to describe gambling while our optimization model can describe general portfolio theory and our sufficient conditions can explain exactly when our general model equals gambling. The formalism that has been developed in this paper is also closely related to constructor theory [61], but a discussion will be postponed to another article.

The original paper of Kullback and Leibler [1] was called “On Information and Sufficiency”. In the present paper, we have made the relation between information divergence and the notion of sufficiency more explicit. The results presented in this paper are closely related to the result that a divergence that is both an f-divergence and a Bregman divergence is proportional to information divergence (see [44] or [62] and references therein). All f-divergences satisfy a sufficiency condition, which is the reason why this class of divergences has played such a prominent role in the study of the relation between information theory and statistics. One major question has been to find reasons for choosing between the different f-divergences. For instance f-divergences of power type (often called Tsallis divergences or Cressie-Read divergences) are popular [5], but there are surprisingly few papers that can point at a single value of the power

α

that is optimal for a certain problem except if this value is 1. In this paper we have started with Bregman divergences because each optimization problem comes with a specific Bregman divergence. Often it is possible to specify a Bregman divergence for an optimization problem and only in some of the cases this Bregman divergence is proportional to information divergence.

The idea of sufficiency has different relevance in different applications, but in all cases information divergence prove to be the quantity that convert the general notion of sufficiency into a number. In information theory information divergence appear as a consequence of Kraft’s inequality. For code length functions of integer length we get functions that are piecewise linear. Only if we are interested in extend-able sequences we get a regret function that satisfies a data processing inequality. In this sense information theory is a theory of extend-able sequences. For scoring functions in statistics the notion of locality is important. These applications do not refer to sequences. Similarly the notion of sufficiency that plays a major role in statistics, does not refer to sequences. Both sufficiency and locality imply that regret is proportional to information divergence, but these reasons are different from the reasons why information divergence is used in information theory. Our description of statistical mechanics does not go into technical details, but the main point is that the many symmetries in terms of reversible maps form a set of maps so large that our result on invariance of regret under sufficient maps applies. In this sense statistical mechanics and statistics both apply information divergence for reasons related to sufficiency. For portfolio theory the story is different. In most cases one has to apply the general theory of Bregman divergences because we deal with an optimization problem. The general Bregman divergences only reduce to information divergence when the assets are gambling assets.

Often one talks about applications of information theory in statistics, statistical mechanics and portfolio theory. In this paper we have argued that information theory is mainly a theory of sequences, while some problems in statistics and statistical mechanics are also relevant without reference to sequences. It would be more correct to say that convex optimization has various application such as information theory, statistics, statistical mechanics, and portfolio theory and that certain conditions related to sufficiency lead to the same type of quantities in all these applications.

Acknowledgments

The author want to thank Prasad Santhanam for inviting me to the Electrical Engineering Department, University of Hawai‘i at Mānoa, where many of the ideas presented in this paper were developed. I also want to thank Alexander Müller-Hermes, Frank Hansen, and Flemming Topsøe for stimulating discussions and correspondence. Finally, I want to thank the reviewers for their valuable comments.

Conflicts of Interest

The author declares no conflict of interest.

References

Kullback, S.; Leibler, R. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Jaynes, E.T. Information Theory and Statistical Mechanics, I. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
Jaynes, E.T. Information Theory and Statistical Mechanics, II. Phys. Rev. 1957, 108, 171–190. [Google Scholar] [CrossRef]
Jaynes, E.T. Clearing up mysteries—The original goal. In Maximum Entropy and Bayesian Methods; Skilling, J., Ed.; Kluwer: Dordrecht, The Netherlands, 1989. [Google Scholar]
Liese, F.; Vajda, I. Convex Statistical Distances; Teubner: Leipzig, Germany, 1987. [Google Scholar]
Barron, A.R.; Rissanen, J.; Yu, B. The Minimum Description Length Principle in Coding and Modeling. IEEE Trans. Inf. Theory 1998, 44, 2743–2760. [Google Scholar] [CrossRef]
Csiszár, I.; Shields, P. Information Theory and Statistics: A Tutorial; Foundations and Trends in Communications and Information Theory; Now Publishers Inc.: Delft, The Netherlands, 2004. [Google Scholar]
Grünwald, P.D.; Dawid, A.P. Game Theory, Maximum Entropy, Minimum Discrepancy, and Robust Bayesian Decision Theory. Ann. Math. Stat. 2004, 32, 1367–1433. [Google Scholar]
Grünwald, P. The Minimum Description Length Principle; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
Holevo, A.S. Probabilistic and Statistical Aspects of Quantum Theory; North-Holland Series in Statistics and Probability; North-Holland: Amsterdam, The Netherlands, 1982; Volume 1. [Google Scholar]
Krumm, M.; Barnum, H.; Barrett, J.; Müller, M. Thermodynamics and the structure of quantum theory. arXiv, 2016; arXiv:1608.04461. [Google Scholar]
Barnum, H.; Müller, M.P.; Ududec, C. Higher-order interference and single-system postulates characterizing quantum theory. New J. Phys. 2014, 16, 123029. [Google Scholar] [CrossRef]
Harremoës, P. Maximum Entropy and Sufficiency. arXiv, 2016; arXiv:1607.02259. [Google Scholar]
Harremoës, P. Quantum information on Spectral Sets. arXiv, 2017; arXiv:1701.06688. [Google Scholar]
Barnum, H.; Lee, C.M.; Scandolo, C.M.; Selby, J.H. Ruling out higher-order interference from purity principles. arXiv, 2017; arXiv:1704.05106. [Google Scholar]
Servage, L.J. The Theory of Statistical Decision. J. Am. Stat. Assoc. 1951, 46, 55–67. [Google Scholar]
Bell, D.E. Regret in decision making under uncertainty. Oper. Res. 1982, 30, 961–981. [Google Scholar] [CrossRef]
Fishburn, P.C. The Foundations of Expected Utility; Springer: Berlin/Heidelberg, Germany, 1982. [Google Scholar]
Loomes, G.; Sugden, R. Regret theory: An alternative theory of rational choice under uncertainty. Econ. J. 1982, 92, 805–824. [Google Scholar] [CrossRef]
Bikhchandani, S.; Segal, U. Transitive regret. Theor. Econ. 2011, 6, 95–108. [Google Scholar] [CrossRef]
Kiwiel, K.C. Proximal Minimization Methods with Generalized Bregman Functions. SIAM J. Control Optim. 1997, 35, 1142–1168. [Google Scholar] [CrossRef]
Kiwiel, K.C. Free-steering Relaxation Methods for Problems with Strictly Convex Costs and Linear Constraints. Math. Oper. Res. 1997, 22, 326–349. [Google Scholar] [CrossRef]
Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970. [Google Scholar]
Hendrickson, A.D.; Buehler, R.J. Proper scores for probability forecasters. Ann. Math. Stat. 1971, 42, 1916–1921. [Google Scholar] [CrossRef]
Rao, C.R.; Nayak, T.K. Cross Entropy, Dissimilarity Measures, and Characterizations of Quadratic Entropy. IEEE Trans. Inf. Theory 1985, 31, 589–593. [Google Scholar] [CrossRef]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Kraft, L.G. A Device for Quanitizing, Grouping and Coding Amplitude Modulated Pulses. Master’s Thesis, Department of Electrical Engineering, MIT University, Cambridge, MA, USA, 1949. [Google Scholar]
Han, T.S.; Kobayashi, K. Mathematics of Information and Coding; Translations of Mathematical Monographs; American Mathematical Society: Providence, RI, USA, 2002; Volume 203. [Google Scholar]
De Finetti, B. Theory of Probability; Wiley: Hoboken, NJ, USA, 1974. [Google Scholar]
McCarthy, J. Measures of the value of information. Proc. Natl. Acad. Sci. USA 1956, 42, 654–655. [Google Scholar] [CrossRef] [PubMed]
Gneiting, T.; Raftery, A.E. Strictly Proper Scoring Rules, Prediction, and Estimation. J. Am. Stat. Assoc. 2007, 102, 359–378. [Google Scholar] [CrossRef]
Ovcharov, E.Y. Proper Scoring Rules and Bregman Divergences. arXiv, 2015; arXiv:1502.01178. [Google Scholar]
Gundersen, T. An Introduction to the Concept of Exergy and Energy Quality; Lecture notes; Norwegian University of Science and Technology: Trondheim, Norway, 2011. [Google Scholar]
Harremoës, P. Time and Conditional Independence; IMFUFA-Tekst; IMFUFA Roskilde University: Roskilde, Denmark, 1993; Volume 255. [Google Scholar]
Kelly, J.L. A New Interpretation of Information Rate. Bell Syst. Tech. J. 1956, 35, 917–926. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: Hoboken, NJ, USA, 1991. [Google Scholar]
Cover, T.M. Universal portfolios. Math. Finance 1991, 1, 1–29. [Google Scholar] [CrossRef]
Uhlmann, A. On the Shannon Entropy and Related Functionals on Convex Sets. Rep. Math. Phys. 1970, 1, 147–159. [Google Scholar] [CrossRef]
Müller-Hermes, A.; Reeb, D. Monotonicity of the Quantum Relative Entropy under Positive Maps. Annales Henri Poincaré 2017, 18, 1777–1788. [Google Scholar] [CrossRef]
Christandl, M.; Müller-Hermes, A. Relative Entropy Bounds on Quantum, Private and Repeater Capacities. arXiv, 2016; arXiv:1604.03448. [Google Scholar]
Petz, D. Monotonicity of Quantum Relative Entropy Revisited. Rev. Math. Phys. 2003, 15, 79–91. [Google Scholar] [CrossRef]
Petz, D. Sufficiency of Channels over von Neumann algebras. Q. J. Math. Oxf. 1988, 39, 97–108. [Google Scholar] [CrossRef]
Jenčová, A.; Petz, D. Sufficiency in quantum statistical inference. Commun. Math. Phys. 2006, 263, 259–276. [Google Scholar] [CrossRef]
Harremoës, P.; Tishby, N. The Information Bottleneck Revisited or How to Choose a Good Distortion Measure. In Proceedings of the IEEE International Symposium on Information Theory, Nice, France, 24–29 June 2007; pp. 566–571. [Google Scholar]
Jiao, J.; Courtade, T.A.; No, A.; Venkat, K.; Weissman, T. Information Measures: The Curious Case of the Binary Alphabet. IEEE Trans. Inf. Theory 2014, 60, 7616–7626. [Google Scholar] [CrossRef]
Jenčová, A. Preservation of a quantum Rényi relative entropy implies existence of a recovery map. J. Phys. A Math. Theor. 2017, 50, 085303. [Google Scholar] [CrossRef]
Tishby, N.; Pereira, F.; Bialek, W. The information bottleneck method. In Proceedings of the 37th Annual Allerton Conference on Communication, Control and Computing, Urbana, Illinois, USA, 22–24 September 1999; pp. 368–377. [Google Scholar]
No, A.; Weissman, T. Universality of logarithmic loss in lossy compression. In Proceedings of the 2015 IEEE International Symposium on Information Theory (ISIT), Hongkong, China, 14–19 June 2015; pp. 2166–2170. [Google Scholar]
Dawid, A.P.; Lauritzen, S.; Perry, M. Proper local scoring rules on discrete sample spaces. Ann. Stat. 2012, 40, 593–603. [Google Scholar] [CrossRef]
Bernardo, J.M. Expected Information as Expected Utility. Ann. Stat. 1978, 7, 686–690. [Google Scholar] [CrossRef]
Csiszár, I. Why least squares and maximum entropy? An axiomatic approach to inference for linear inverse problems. Ann. Stat. 1991, 19, 2032–2066. [Google Scholar] [CrossRef]
Bartlett, P.; Grünwald, P.; Harremoës, P.; Hedayati, F.; Kotlowski, W. Horizon-Independent Optimal Prediction with Log-Loss in Exponential Families. In Proceedings of the Conference on Learning Theory (COLT 2013), Princeton, NJ, USA, 12–14 June 2013; p. 23. [Google Scholar]
Lieb, E.; Yngvason, J. A Guide to Entropy and the Second Law of Thermodynamics. Not. AMS 1998, 45, 571–581. [Google Scholar]
Lieb, E.; Yngvason, J. The Mathematics of the Second Law of Thermodynamics. In Visions in Mathematics; Alon, N., Bourgain, J., Connes, A., Gromov, M., Milman, V., Eds.; Birkhäuser: Basel, Switzerland, 2010; pp. 334–358. [Google Scholar]
Marletto, C. Constructor Theory of Thermodynamics. arXiv, 2016; arXiv:1608.02625. [Google Scholar]
Bauschke, H.H.; Borwein, J.M. Joint and Separate Convexity of the Bregman Distance. In Inherently Parallel Algorithms in Feasibility and Optimization and Their Applications; Dan Butnariu, Y.C., Reich, S., Eds.; Elsevier: Amsterdam, The Netherlands, 2001; Volume 8, pp. 23–36. [Google Scholar]
Hansen, F.; Zhang, Z. Characterisation of Matrix Entropies. Lett. Math. Phys. 2015, 105, 1399–1411. [Google Scholar] [CrossRef]
Pitrik, J.; Virosztek, D. On the Joint Convexity of the Bregman Divergence of Matrices. Lett. Math. Phys. 2015, 105, 675–692. [Google Scholar] [CrossRef]
Topsøe, F. Game theoretical optimization inspired by information theory. J. Glob. Optim. 2008, 43, 553–564. [Google Scholar] [CrossRef]
Topsøe, F. Cognition and Inference in an Abstract Setting. In Proceedings of the Fourth Workshop on Information Theoretic Methods in Science and Engineering (WITMSE 2011), Helsinki, Finland, 7–10 August 2011; pp. 67–70. [Google Scholar]
Deutch, D.; Marletto, C. Constructor theory of information. Proc. R. Soc. A 2014, 471, 20140540. [Google Scholar] [CrossRef] [PubMed]
Amari, S.I. α-Divergence Is Unique, Belonging to Both f-Divergence and Bregman Divergence Classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar] [CrossRef]

Figure 1. The regret equals the vertical distance between curve and tangent.

Figure 2. The function G for the price relative vectors in Example 4.

Figure 3. Example of a dilation that increases regret.

Figure 4. Mixing of a blue and a yellow gas.

© 2017 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Harremoës, P. Divergence and Sufficiency for Convex Optimization. Entropy 2017, 19, 206. https://doi.org/10.3390/e19050206

AMA Style

Harremoës P. Divergence and Sufficiency for Convex Optimization. Entropy. 2017; 19(5):206. https://doi.org/10.3390/e19050206

Chicago/Turabian Style

Harremoës, Peter. 2017. "Divergence and Sufficiency for Convex Optimization" Entropy 19, no. 5: 206. https://doi.org/10.3390/e19050206

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Divergence and Sufficiency for Convex Optimization

Abstract

1. Introduction

2. Structure of the State Space

3. Optimization

4. Examples

4.1. Information Theory

4.2. Scoring Rules

4.3. Statistical Mechanics

4.4. Portfolio Theory

5. Sufficiency Conditions

5.1. Entropy and Information Divergence

5.2. Monotonicity

5.3. Sufficiency

5.4. Locality

6. Applications

6.1. Information Theory

6.2. Statistics

6.3. Statistical Mechanics

6.4. Monotone Regret for Portfolios

7. Concluding Remarks

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI