Divergence and Sufficiency for Convex Optimization

Logarithmic score and information divergence appear in information theory, statistics, statistical mechanics, and portfolio theory. We demonstrate that all these topics involve some kind of optimization that leads directly to regret functions and such regret functions are often given by a Bregman divergence. If the regret function also fulfills a sufficiency condition it must be proportional to information divergence. We will demonstrate that sufficiency is equivalent to the apparently weaker notion of locality and it is also equivalent to the apparently stronger notion of monotonicity. These sufficiency conditions have quite different relevance in the different areas of application, and often they are not fulfilled. Therefore sufficiency conditions can be used to explain when results from one area can be transferred directly to another and when one will experience differences.


Introduction
The maximum entropy method as introduced by Jaynes [1] works quite well in various applications and an obvious question is why the same entropy formula appear in very different applications.In each of the applications the appearence of logarithmic terms have been given its own justification.Other scientists considerthe maximum entropy principle as a general tool justified by information theory.It is quite obvious that the function that we would like to optimize should be convex or concave in order for our procedures to lead to a unique distribution.In this this paper we will look at the consequences of requiring that our optimizer does not depend on irrelevant information, which is formulated as a sufficiency condition.
The use of scoring rules has a long history in statistics.An early contribution was the idea of minimizing the sum of square deviations that dates back to Gauss and works perfectly for Gaussian distributions.In the 1920's Ramsay and de Finetti proved versions of the Dutch book theorem where determination of probability distributions were considered as dual problems to maximizing a payoff function.Later it was proved that any consistent inference corresponds to optimizing with respect to some payoff function.A more systematic study of scoring rules was given by McCarthy [2].The basic result is that the only strictly local proper scoring rule is logarithmic score.Our main theorem extends this result to general regret functions on convex sets.
Thermodynamics is the study of concepts like heat, temperature and energy.A major objective is to extract as much energy from a system as possible.Concepts like entropy and free energy play a significant role.The idea in statistical mechanics is to view the macroscopic behavior of a thermodynamic system as a statistical consequence of the interaction between a lot of microscopic components where the interacting between the components are governed by very simple laws.Here the central limit theorem and large deviation theory play a major role.One of the main achievements is the formula for entropy as a logarithm of a probability.
One of the main purposes of information theory is to compress data so that data can be recovered exactly or approximately.One of the most important quantities was called entropy because it is calculated according to a formula that mimics the calculation of entropy in statistical mechanics.Another key concept in information theory is information divergence (KL-divergence) that was introduced by Kullback and Leibler in 1951 in a paper entitled information and sufficiency.The link from information theory back to statistical physics was developed by E.T. Jaynes via the maximum entropy principle.The link back to statistics is now well established [3,4,5].
The relation between information theory and gambling was established by Kelly [6].Logarithmic terms appear because we are interested in the exponent in an exponential growth rate of of our wealth.Later Kelly's approach has been generalized to traiding of stocks although the relation to information theory is weaker [7].
Since related quantities appear in statistics, statistical mechanics, information theory and finance, and we are interested in a general theory that describes when these relations are exact and when they just work by analogy.We introduce some general concepts related to optimization on convex sets.These concepts apply exactly to all the topics under consideration and lead to Bregman divergences.Then we introduce a notion of sufficiency and show that this leads to information divergence and logarithmic score.This second step is not always applicable which explains when the different topics are really different.For applications in thermodynamics and gambling this is described in [8] and [9].In this paper we will see how general convex optimization lead to the notion of Bregman divergence.If optimization is combined with the notion of sufficiency the Bregman divergence is generated by a function that is proportional to the entropy.This results holds on any convex set but it also gives very severe ties on the shape of the convex set to an extend that leads almost to the Hilbert space formalism of quantum mechanics.
Due to the limited space in the proceedings paper some of the proofs have been forshortened or omitted.

Improved Caratheodory theorem
We consider a situation where our knowledge about a system is given by an element in a convex set.These elements are called states and convex combinations are formed by probabilistic mixing.States that cannot be distinguished by any measurement are considered as being the same state.The extreme points in the convex set are called pure states and all other states are called mixed states.See [10] for details about this definition of a state space.In this exposition we will assume that the state space, i.e. the convex set is finite dimensional and compact.
Definition 1 Let C denote a convex set.A test is an affine map from C to [0, 1] .Let s 0 and s 1 denote states in the state space C. Then s 0 and s 1 are said to be mutually singular if there exists test φ such that φ (s 0 ) = 0 and φ (s 1 ) = 1.The states s 0 and s 1 are said to be orthogonal if s 0 and s 1 are mutually singular in the smallest face F of C that contain both s 0 and s 1 .
Let s ∈ C be a state.Then the entropy of s is defined as where the infimum is taken over all probability vectors p n i such that there exist extreme points s 1 , s 2 , . . ., s n such that Note that there is no restriction on the number n in the definition of the entropy of s.If the points s 1 , s 2 , . . ., s n are orthogonal then he decomposition in Equation 2is called an orthogonal decomposition and if 1 is called the spectrum of the decomposition.
Theorem 2 Let C denote a convex compact set of dimension d and let s ∈ C denote some state.Then there exists n ≤ d + 1 and an orthogonal decomposition Proof.Define where According to Caratheodory's theorem there exists a decomposition of s when n > d, which implies that H n < ∞.For fixed extreme points (s i ) i=1,2,...,n the set of probability vectors p n 1 that satisfies this equation is a polytop and the Shannon entropy is a concave and continuous function on this polytop.Therefore the minimum on this polytop is at one of the extreme points of the polytop.
According to Caratheodory's theorem the extreme points of the polytop have By compactness of C and continuity of the Shannon entropy function the infimum is achieved.Therefore assume that H = − d i=0 p i • ln (p i ) .We will show that any pair of states s i and s j are orthogonal.Without loss of generality we may assume that i = 1 and j = 2. Now We will prove that s 0 and s 1 are singular in the smallest face containing s.
Without loss of generality we may assume that p 0 + p 1 = 1 and that s is an algebraically interior point.
The proof is by induction on d.If d = 1 the result is trivial.Assume that the theorem has been proved for d < n and that C has dimension n.Let C be the intersection of C and a hyperplane through s 0 and s 1 .Then there exists a function ψ : C → [0, 1] such that ψ (s i ) = i.Let ℓ denote the subset of the hyperplane where ψ (x) = 0. Then C can be projected into a 2-dimensional vector space along ℓ.Therefore we just have to prove the result for d = 2.
Introduce a coordinate system so that s 0 = (0, 0) and Orthogonall decompositions are only unique when the convex set is a simplex.
Definition 3 If any orthogonal decomposition has the same spectrum then then the common spectrum is called the spectrum of the state and the state is said to be spectral.We say that the convex compact set C is spectral if all states in C are spectral.The entropy of a convex set H (C) is defined as the supremum of the entropies of the states in the set.The entropic rank of a set is defined as exp (H (C)) .
Proposition 4 For a spectral set the entropic rank equals the maximal number of orthogonal states.
Proof.Assume that the maximal number of orthogonal states is n.Then any state can be written as a mixture of at most n states, and a mixture of at most n states has entropy at most n.The uniform distribution on n states has entropy n.
Example 5 In the unit square with (0, 0) , (1, 0) , (0, 1) and (1, 1) as vertices the point with coordinates ( 1 /2, 1 /4) has entropy 3 2 ln 2. The entropic rank of the set is 8 1 /2 .A spectral set of entropic rank 2 is symmetric around a central point and all boundary points are extreme.Two states in the set are orthogonal if and only if they are antipodal.Any state can be decomposed into two antipodal states.If the state is not the centre of the sphere this is the only orthogonal decomposition.The centre can be decomposed into a 1 /2 and 1 /2 mixture of any pair of antipodal points.

Proposition 6
In two dimensions a simplex and a balanced set are the only spectral sets.

Concavity of entropy in Jordan algebras
The density matrices with complex entries play a crusial role in the mathematical theory of quantum mechanics and it is well-known that the density matrices is a spectral set.For each density matrix the spectrum equals the usual spectrum calculated as roots of the characteristic polynomium.In the 1930'ties Jordan generalized the notion of Hermitean complex matrix to the notion of a Jordan algebra in an attempt to provide an alternative to the complex Hilbert spaces as the mathematical basis of quantum mechanics.
An Euclidean Jordan algebra is an algebra with composition • that is commutative and satifies the Jordan identity Further it is assumed that n i=1 x 2 i = 0 implies that x i = 0 for all i.In an Euclidean Jordan algebra we write x ≥ 0 if x is a sum of squares.
For an element in a real division algebra tr is defined as the real part of a number so that tr (xy) = tr (yx).For a matrix (M mn ) the trace Tr is defined by Tr (M ) = tr ( n M nn ) .Then Tr (M N ) = Tr (N M ) .
Any finite dimensional Eucledian Jordan algebra has a trace and we may define the density operators as the positive elements with trace 1.Then the density operators of a Jordan algebra is a spectral set.The complex Hermitean operators form a Jordan algebra with the composition x • y = 1 2 (xy + yx) .Any Jordan algebra can be decomposed into 5 types leading to the following convex sets: Real Density matrices over the real numbers.
Complex Density matrices over the complex numbers.

Quaternionic Density matrices over the quaternionians.
Spin type A unit ball in d dimensions.
See [11] for general results on Jordan algebras and [12] for details about quantum mechanics described using quarteionians.
The entropy is defined as for general convex set and we will prove that H is a concave on the set of states.We will prove that the entropy is concave on each of of these sets.For the unit ball the entropy is centrally symmetric and is obviously concave.The following exposition is based on a similar result for complex matrix algebras stated in [13], but the proofs have been changed because neither commutativity nor associativity in the division ring is assumed.
Theorem 7 Let A and B denote Hermitean matrices.Assume that A = t ℓ E ℓ where E ℓ are orthogonal idempotents.If f is a holomorphic function around the spectrum of A then where for t m = t n .
Proof.Assume that f (z) = z r .Then As a consequence the theorem holds for any polynomial and also for any holomorphic function because such functions can be approximated by polynomials.
Lemma 8 For Hermitean matrices A and B we have Proof.According to the previous lemma and the equation has been proved.

Theorem 9
In a Jordan algebra the entropy of positive elements of trace one is a concave function.
Proof.Let f denote the holomorphic function f (z) = −z ln z, z > 0. We have to prove that Tr (f (( The second derivative can be calculated.
for t m = t n .

Local Bregman divergences
Let A denote a subset of the feasible measurements such that a ∈ A maps C into a distribution on the real numbers i.e. a random variable.The elements of A may represent feasible actions (decisions) that lead to a payoff like the score of a statistical decision, the energy extracted by a certain interaction with the system, (minus) the length of a codeword of the next encoded input letter using a specific code book, or the revenue of using a certain portfolio.For each s ∈ C we define a, s = E [a (s)] .and F (s) = sup a∈A a, s .
Without loss of generality we may assume that the set of actions A is closed so that we may assume that there exists a ∈ A such that F (s) = a, s and in this case we say that a is optimal for s.We note that F is convex but F need not be strictly convex.
Definition 10 If F (s) is finite the regret of the action a is defined by If the state is s 1 but one acts as if the state were s 2 one suffers a regret that equals the difference between what one achieves and what could have been achieved.
Definition 11 If F (s 1 ) is finite the regret of the state s 2 is defined as where the infimum is taken over actions a that are optimal for s 2 .
The notion of sufficiency for Bregman divergences have been introduced in [14] and [15].It was shown in [15] that a Bregman divergence on the simplex of distributions on an alphabet that is not binary determines the divergence except for a multiplicative factor.
Definition 12 Let C denote a convex set and let and let Φ : C → C denote some affine map.Then Φ is said to be sufficient for the family of states s θ if there exists an affine transformation Ψ : for any states s 1 , s 2 ∈ C and any affine transformation Φ : C → C that is sufficient for s 1 , s 2 .
Recently it has been proved that divergence on a complex Hilbert space is decreasing under positive trace preserving maps [16,17].Therefore information divergence satisfies the sufficiency condition on complex Hilbert spaces.Hence sufficiency is also satisfied on real Hilbert spaces.It it not known if sufficiency holds on more genral Jordan algebras so we introduce a weaker condition called locality.

Definition 13
The regret function D F is said to be local if when s 1 and s 2 are states that are orthogonal to s 0 .Proposition 14 Let C denote a spectral convex set.Then the Bregman divergence generated by the entropy is local.
Proof.Assume that s = (1 − p) s 0 + ps 1 where s 0 and s 1 are orthogonal.Then one can make orthogonal decompositions Then which does not depend on s 1 as long as s 1 is orthogonal to s 0 .
Proposition 15 If the regret function D F on a convex set satisfies the sufficiency condition then it is local.
Proposition 16 Let C denote a spectral convex set of entropic rank 2. Then the convex set is balanced and any Bregman divergence is local.
The following lemma follows from Alexandrov's theorem.See [18] Theorem 25.5 for details.
Lemma 17 A convex function on a finite dimensional convex set is differentiable almost everywhere with respect to the Lebesgue measure.
Theorem 18 Let C be a convex set with at least three orthogonal states.If a regret function D F defined on C is local then it is a Bregman divergence generated by the entropy times some constant.
Proof.In the following proof we will assume that the regret function is based on a convex differentiable function F : C → R. Accoring to the previous lemma the argument works almost everywhere so the same proof will work even if it is not asumed that F is differentiable.Details about the case when F is not assumed to be differentiable are omitted.
Let K denote the convex hull of a set s 0 , s 1 , . . .s n of orthogonal states.Let f i denote the function f Since F is differentiable almost surely on K we see that continuity implies that the equality most hold for all distributions P. As a function of Q it has minimum when Q = P.We have where x + z = y + w.We also have so that f ij is convex.Therefore f ij is differentiable from left and right.We have with equality when ǫ = 0. We differentiate with respect to ǫ from right.
y) which in combination with the previous inequality implies that f ′ ij− (y) = f ′ ij+ (y) so that f ij is differentiable.Since f i = f ij + f ik − f jk the function f i is also differentiable.
We have We have the condition q i = 1 so using Lagrange multipliers we get that there exist a constant c K such that p i • f ′ i (p i ) = c K .Hence f ′ i (p i ) = cK pi so that f i (p i ) = c k • ln (p i ) + m i for some constant m i .Therefore Therefore there exists an affine function defined on K such that If K ∩ L has dimension greater than zero then the right hand side is affine so the left hand side is affine which is only possible when c K = c L .Therefore we also have g L (x) = g K (x) for all x ∈ K ∩ L. Therefore the functions g K can be extended to function on the whole of C.
A careful inspection of the previous proof reveils that a convex set with a local Bregman divergence must be spectral.The notion of a spectral set is related to self-duality of the cone of positive elements, which leads to the following conjecture.
Conjecture 19 If a finite dimensional convex compact set has a local regret function and has a transitive symmetry group then the convex set can be represented as positive elements of a Jordan algebra with trace 1.