Generalised exponential families and associated entropy functions

A generalised notion of exponential families is introduced. It is based on the variational principle, borrowed from statistical physics. It is shown that inequivalent generalised entropy functions lead to distinct generalised exponential families. The well-known result that the inequality of Cramer and Rao becomes an equality in the case of an exponential family can be generalised. However, this requires the introduction of escort probabilities.


Introduction
Generalised entropy functions have been studied intensively in the second half of the past century. They have been called quasi-entropies in [10]. Every entropy function is in fact minus a relative entropy, also called a divergence. It is relative to some reference measure c. Consider the f-divergence [3] with f (u) a convex function defined for u > 0 and strictly convex at u = 1. It is minus the entropy of p, relative to c. Taking c a = 1 for all a and f (u) = u ln u one obtains the Boltzmann-Gibbs-Shannon entropy Note that throughout the paper discrete probabilities are considered, with events a belonging to a finite or countable alphabet A. Recent interest in these generalised entropies within statistical physics goes back to the introduction by Tsallis [14] of the q-entropy with q > 0. In the limit q = 1 it converges to (2). It has been studied before in the mathematics literature by Havrda and Charvat [5], and by Daróczy [4]. Investigations within the physics community have lead to some interesting developments. One of them is the introduction of deformed logarithmic and exponential functions [15,6] see the Section 13. They have been very useful to generalise common concepts, like that of an exponential family or of a Gaussian distribution. They also helped to clarify the pitfalls of the generalisation process. One of the surprises is the necessity to introduce escort probability functions [17] -see Section 11. In a series of papers, including [7,8], the present author has elaborated a formalism based on deformed logarithms. In the present work, it is shown that slightly more general results are obtained when abandoning these deformed logarithms.
In Sections 2 to 6 the maximum entropy principle and the variational principle are discussed in the context of generalised entropies. In particular, a characterisation of the maximising probability distributions is given. This is used in Section 7 to define a generalised exponential family. In Section 8 it is shown that the intersection of distinct generalised exponential families is empty and that there exists a one-to one-relation with generalised entropy functions. Sections 9 tot 12 discuss geometric aspects, starting with concepts from thermodynamics and introducing escort families and a generalised Fisher information matrix. Sections 13 and 14 discuss non-extensive thermostatistics and the percolation problem as examples of the generalised formalism. The paper ends with a short diascussion in Section 15.

Generalised entropies
Let us fix some further notations. The space of probability distributions is denoted M + 1 (A). Expectation values are denoted p, X = a∈A p a X(a). Here we follow the physics tradition to put the elements of the dual space at the l.h.s..
It is rather common to define a generalised entropy as any function I(p) of the form where h(u) is a continuous strictly concave function, defined on [0, 1], which vanishes when u = 0 or u = 1. This is a special case of minus the f-divergence (1), with weights c a = 1. The entropy function I(p) is defined for any p ∈ M + 1 (A) and has values in [0, +∞]. In the present paper it is allowed that the function h(u) is stochastic, this means, depends also on a in A. But for convenience of notation, this dependence will not be made explicit. Throughout the paper it is assumed that the derivative exists on the interval (0, 1) and defines a continuous function on the halfopen interval (0, 1]. Because h(u) is strictly concave, f (u) is strictly increasing. Note that it is allowed to diverge to −∞ at u = 0. This is indeed the case when h(u) = −u ln u and f (u) = 1 + ln u.
The function f (u) can be used to rewrite the entropy I(p) as Note that the latter expression implies that The standard definition of the Bregman divergence [2] reads In the case that f (u) diverges at u = 0 it is only well defined when q a = 0 implies p a = 0. It is a convex function of the first argument. Note that one can write From the latter expression it is immediately clear that D(p||q) ≥ 0, with equality if and only if p = q.

Maximum entropy principle
Let be given a finite number of real functions H 1 (a), H 2 (a), · · ·, H n (a). Assume they are bounded from below. In a physical context these functions may be called Hamiltonians.
The maximum entropy problem deals with finding the probability distribution p that maximises I(p) under the constraint that the expectation values of the Hamiltonians H j attain given values U j , called energies. Introduce the notation Then one looks for the probability distribution p ∈ P U which maximises I(p).
Definition 1 A probability distribution p * ∈ P U is said to satisfy the maximum entropy principle if it satisfies In what follows a stronger condition is needed. It was introduced some 40 years ago [11] -see Theorem 7.4.1 of [12] -and is in fact a stability criterion.
Definition 2 A probability distribution p * is said to satisfy the variational principle if there exist parameters θ 1 , θ 2 , · · · , θ n such that In statistical physics, a probability distribution satisfying the variational principle is called an equilibrium state.

Lagrange multipliers
A popular way to solve the maximum entropy problem is by the introduction of Lagrange parameters. However, a difficulty arises, known as the cutoff problem. It is indeed possible that some of the probabilities p a of the optimising probability distribution vanish. Let us see how this problem arises. The Lagrangean reads Here, α is the parameter introduced to fix the normalisation condition a∈A p a = 1, the θ j are introduced to cope with the constraints (10). Variation of L w.r.t. the p a yields The problem that can arise is that it may well happen that the r.h.s. of this expression does not belong to the range of the function f (u). This situation is particularly likely to occur when f (u) does not tend to −∞ when u tends to 0. If the r.h.s. is in the range of f (u) then p a is determined uniquely by (14) because of the assumption that f (u) is a strictly increasing function. The above problem is well known in optimisation theory. Because the constraints, defining P U , are affine, the set P U forms a simplex. Its faces are obtained by putting some of the probabilities p a equal to zero. Because the entropy function I(p) is concave it attains its maximum within one of these faces. This observation leads to the ansatz that the probability distribution p, which maximises I(p) with p in P U , if it exists, is determined by a subset A 0 = {a ∈ A : p a = 0}, and by the values of the parameters α and θ j , which determine the remaining probabilities via (14). Let us now try to prove this statement.

Characterisation
Let us first consider the more familiar situation that f (0) = −∞.

Proof
The inverted statement is proved. Because of the normalisation, there exists at least one a ∈ A for which p * a > 0. Assume b ∈ A such that p * b = 0. Let us show that this implies that p * does not satisfy the variational principle.
Fix 0 < ǫ << 1. Introduce a new probability distribution p which coincides with p * except that Let Then one has From the assumption f (0) = −∞ then follows that This proves that p * does not satisfy the variational principle because for ǫ sufficiently small M(ǫ) is strictly larger than M(0).
A probability distribution p * satisfies the variational principle if and only if there exists α and θ 1 , θ 2 , · · ·, θ n such that (14) holds for all a ∈ A.

Proof
First assume that p * satisfies (14). This implies that p * a > 0 for all a ∈ A because f (0) is not defined. Hence, the divergence D(p||p * ) is well defined for all p. Next one calculates Because D(p||p * ) ≥ 0 with equality if and only if p = p * there follows that p * satisfies the variational principle. Next assume that p * satisfies the variational principle (12). From the lemma then follows that p * a > 0 for all a ∈ A. Hence, the divergence D(p||p * ) is well-defined for all p ∈ M + 1 . It follows from the variational principle that Now, the function p → D(p||p * ) is convex with continuous derivatives. The r.h.s. of the above expression is affine. Both l.h.s. and r.h.s. vanish for p = p * . One then concludes that the r.h.s. is tangent to the convex function and must be identically zero. One concludes that for all p This implies that f (p * a ) is of the form (14) -take p a = δ a,b for some fixed b to see this.

The case with cutoff
Assume now that f (0) = lim u↓0 f (u) converges. Then the divergence D(p||q) is well defined for any pair of probability distributions p, q.
Theorem 2 Assume that f (0) = lim u↓0 f (u) converges. Are equivalent 1. p * satisfies the variational principle; 2. there exist parameters α and θ 1 , θ 2 , · · ·, θ n , and a subset A 0 of A such that Note that this last condition expresses that the r.h.s. of (14) is out of the range of f (u) because it takes a value less than f (0). Proof 1) implies 2) As in the proof of the previous Theorem, one shows that (20) holds for all p. But now one cannot conclude (21) because some of the p * a may vanish so that p * lies in one of the faces of the simplex M + 1 . But one can still derive (14) for all a for which p * a = 0. Assume now that p * a = 0 for some given a ∈ A. Let Then the l.h.s. of (20) becomes On the other hand, the r.h.s. of (20) becomes From the inequality (20) then follows This implies the desired inequality because 2) implies 1) One calculates The variational principle now follows using the third assumption of the Theorem.

Statistical models
In the definition of the variational principle there is given a set of Hamiltonians H 1 (a), H 2 (a), · · ·, H n (a), this means, real functions over the alphabet A, bounded from below. The equilibrium distribution p * is then characterised by a normalisation constant α, by parameters θ 1 , θ 2 , · · ·, θ n , and by a subset A 0 of the alphabet A -see (14). The emphasis now shifts towards these parameters.
Theorem 3 Let be given Hamiltonians H 1 (a), H 2 (a), · · ·, H n (a). For each θ in R n there exists at most one probability distribution p * satisfying the variational principle (12) with these parameters θ.

Proof
If p * and q * both satisfy the variational principle (12) with the same parameters θ then also the convex combination r * = 1 2 p * + 1 2 q * has the same property because the entropy function is concave. But then one can conclude from the inequalities (12) that I(r * ) = 1 2 I(p * ) + 1 2 I(q * ). Because the entropy function is strictly concave there follows p * = q * .
The set of θ for which a p * exists, satisfying the variational principle (12), is denoted D. The probability distribution is denoted p θ instead of p * . The constant α appearing in (14) is replaced by α(θ).
A statistical model is a parametrised set of probability distributions. The above Theorem implies that the set (p θ ) θ∈D , of probability distributions satisfying the variational principle, is a statistical model. One can say that such a model belongs to the generalised exponential family.
Definition 3 Let be given a generalised entropy function I(p) of the form (4). A statistical model (p θ ) θ∈D belongs to the generalised exponential family if there exist real functions H 1 (a), H 2 (a), · · ·, H n (a), bounded from below, such that each member p θ of the model satisfies the variational principle (12) with these Hamiltonians and with this set of parameters.
Clearly, entropy functions which differ only by a scalar factor determine the same generalised exponential family.

Uniqueness theorem
Let us now turn to the question whether a given model (p θ ) θ∈D can belong to two different generalised exponential families. (28) If the model belongs to two different generalised exponential families, one with entropy function I 1 (p), the other with entropy function I 2 (p), then there exists a constant λ such that I 2 (p) = λI 1 (p) for all p.

Proof
Take any point u in (0, 1) and a corresponding θ ∈ D 0 and a such that p θ,a = u. From the previous theorems follows that there exist functions α i (θ) and Hamiltonians H i1 (a), H i2 (a), · · ·, H in (a), with i = 1, 2, such that Let F a = f 2,a • f −1 1,a . Note that this is a strictly increasing continuous function. Then one has This relation holds also on a vicinity of θ ∈ D 0 . It therefore implies the existence of λ a and K i,j such that Then one can rewrite (30) as with valid for some neighbourhood of the given θ. Using the definition of F a (v) one obtains valid on some neighbourhood of the given u ∈ (0, 1). Because u is arbitrary and the functions f ia are continuous, the same expression must hold on all of (0, 1]. From 0 = h i,a (0) = 1 0 du f i,a (u) now follows that γ a (θ) = 0. Therefore (33) becomes In particular, λ a does not depend on a ∈ A. One concludes therefore that there exists λ so that f 2,a (u) = λf 1,a (u). This implies I 2 (p) = λI 1 (p).

Thermodynamics
Throughout this Section, let be given a statistical model (p θ ) θ∈D belonging to the generalised exponential family. Note that if p θ and p η both belong to the same set P U then they satisfy I(p θ ) = I(p η ). Hence, a function S(U) can be defined by S(U) = I(p θ ) whenever p θ , H j = U j for j = 1, 2, · · · , n.
This function is called the thermodynamic entropy. The concept of thermodynamic entropy was first introduced by Clausius around 1850. The Legendre transform of the thermodynamic entropy is given by This function was introduced by Massieu in 1869. The suprememum is taken over all U for which S(U) is defined by (36). The function is convex -this is a well-known property of Legendre transforms.

Proposition 1 One has
Proof Given θ ∈ D there exists p θ for which the variational principle holds. Then one has, with U j = p θ , H j , This proves the inequality in one direction. Next, fix ǫ > 0 and let U be such that with U such that S(U) is defined by (36). Then, there follows from the definition of S(U) that η ∈ D exists such that S(U) = I(p η ) with p η , H j = U j , j = 1, 2, · · · , n. The variational principle now implies that Because ǫ > 0 is arbitrary, the inequality in the other direction follows now.
The inverse Legendre transformation reads It is a concave function.
Proposition 2 One has S(U) = S(U) for all U for which S(U) is defined by (36).

Proof
From the definition of the Massieu function Φ(θ) there follows that This implies that S(U) ≤ S(U). On the other hand, from the definition (36) of S(U) follows that where θ is such that p θ ∈ P U . This implies S(U) ≥ S(U). The two inequalities together establish the desired equality.

Thermodynamic relations
Like in the previous Section, there is given a statistical model (p θ ) θ∈D belonging to the generalised exponential family. In addition, let D 0 be an open subset of D on which the map θ → p θ , H j is continuous.
The following results are typical properties of Legendre transforms. For completeness, proofs are given.

Proposition 3
The first derivative of the Massieu function Φ(θ) exists for θ in D 0 . It satisfies Proof From the definitions one has for θ and θ + η in D 0 and Expression (45) now follows using the continuity of the map θ → p θ , H j .
Introduce the metric tensor Because the Massieu function Φ(θ) is convex the matrix g(θ) is positive definite, whenever it exists. By the previous Proposition one has for those θ in D 0 for which the derivative exists. In thermodynamics, the derivative of S(U) equals the inverse of the absolute temperature T . Here, the analogous property becomes Proposition 4 Let θ ∈ D 0 and define U by U j = p θ , H j . Then one has ∂S ∂U j = θ j , j = 1, 2, · · · , n.
Proof On a vicinity of θ is S(U) = Φ(θ) + n j=1 θ j U j . Hence, one can write But the first term in the r.h.s. vanishes because the previous Proposition holds. Hence, the desired result follows.
The two relations (45) and (50) are dual in the sense of Amari [1]. In thermodynamics, the entropy S(U) and Massieu's function Φ(θ) are state functions, the energies U j are extensive thermodynamic variables, the parameters θ j are the intensive thermodynamic variables.

Escort probabilities
Let us now make the additional assumption that the function f (u), which enters the definition (6) of the generalised entropy, has a derivative f ′ (u). Because f (u) was supposed to be strictly increasing, one can write where φ(v) = 1/(df /dv) is a strictly positive function. As before, there is given a statistical model (p θ ) θ∈D belonging to the generalised exponential family, and D 0 is an open subset of D on which the map θ → p θ , H j is continuous. The set A 0 (θ) is the set of a ∈ A for which p θ (a) = 0. From theorems 1 and 2 now follows This expression was used in [8] as a condition under which a generalisation of the well-known bound of Cramér and Rao is optimal. An immediate consequence of (53) is Proposition 5 Assume the regularity condition Assume in addition that where ′ denotes the sum over all a ∈ A \ A 0 (θ). Then one has Proof On a vicinity of the given θ one has (53). Hence, by summing (53) over a ∈ A \ A 0 (θ) one obtains using (54) The probability distribution P θ,a = 1 z(θ) φ(p θ,a ), p θ,a = 0, when it exists, is called the escort of the exponential family (p θ ) θ∈D . With this notation, one can write the result of the Proposition as

Generalised Fisher information
Let be given a model (p θ ) θ∈D for which z(θ), as given by (55), converges. The escort probabilities P θ,a are defined by (58). Then one can define a generalised Fisher information matrix by where the score variables are defined by Note that in the standard case of h(u) = −u ln u one has φ(u) = u so that the escort probabilities P θ coincide with the p θ . Then (60) reduces to the conventional definition. Fix now a set of Hamiltonians H 1 (a), H 2 (a), · · ·, H n (a). Then one can define a covariance matrix σ(θ) by Proposition 6 Assume a finite alphabet A. Then one has Proof From (53) follows for all θ ∈ D 0 and a ∈ A \ A 0 (θ). Hence, the Fisher information matrix becomes Using (59) there follows I i,j (θ) = z 2 (θ)σ i,j . On the other hand, from (49) and (53) there follows Using (56) there follows g i,j (θ) = z(θ)σ i,j .
The assumption of a finite alphabet is made to ensure that the conditions of Proposition 5 are fulfilled and that the sum and derivative may be interchanged in (66).
The generalised inequality of Cramér and Rao, in the present notations, reads [8] kl with u and v arbitrary real vectors. The previous Proposition then implies that the inequality becomes an equality when u = v, when P is related to p via (58), and when p θ belongs to a generalised exponential family.

Non-extensive thermostatistics
Define the q-deformed logarithm by [15,16] ln q (u) = 1 It is a strictly increasing function, defined for u > 0. Indeed, its derivative equals In the limit q = 1 the q-deformed logarithm converges to the nature logarithm ln u.
The deformed logarithm can be used in more than one way to define an entropy function. The q-entropy (3) can be written as Comparison with (4) gives One has h(0) = h(1) = 0. Taking the derivative gives It is a strictly increasing function on (0, 1] when q > 0. The function φ(u) is given by The probability distributions belonging to the generalised exponential family, corresponding with (70), are with [u] + = max{0, u}. This is indeed the kind of probability distribution discussed in the original paper of Tsallis [14]. However, more often used is the alternative of [17].
In the latter paper the concept of escort probability distributions was introduced into the literature. They were defined by which in the present notations corresponds with φ(u) proportional to u q . This can be obtained by replacing the constant q by 2 − q in (70). The entropy function then reads which is not the expression that one would write down based on the information theoretical argument that ln(1/p a ) is the amount of information (counted in units of ln 2), gained from an event occurring with probability p a . Note that with this definition of entropy function the condition q < 2 is needed in order to satisfy the requirements that the function f (u) = d du (u ln q (u) is an increasing function.

The percolation problem
This example has been treated in [9]. It is a genuine example of an important model of statistical physics which does not belong to the exponential family. In addition, it is an example which fits into the present generalised context provided that one allows that the function h(u) appearing in the definition (4) of the generalised entropy function is stochastic.
In the site percolation problem [13], the points of a lattice are occupied with probability q, independent of each other. The point at the origin is either unoccupied, with probability p ∅ , or it belongs to a cluster of shape i, with probability p i . This cluster is finite with probability 1, provided that 0 ≤ q ≤ q c , where q c is the percolation threshold. The probability p ∞ that the origin belongs to an infinite cluster is strictly positive for q > q c . However, for the sake of simplicity of the presentation, 0 < q < q c will be assumed -see [9] for the general case.
These probabilities are given by where c i is the number of different clusters of shape i, s(i) is the number of occupied sites in the cluster, and t(i) is the number of perimeter sites, this is, of unoccupied neighbouring sites. Note that (77) also holds when the origin is not occupied, provided that one convenes that c(∅) = 1, s(∅) = 0 and t(∅) = 1.

Choose the Hamiltonian
and introduce the parameter θ by Then one can write with This looks like an exponential family, except for the extra factor [s(i) + t(i)] in the r.h.s.. Introduce the stochastic function Then the above expression is of the form (14). By integrating f i (u) one obtains It is now straightforward to verify that the percolation problem belongs to a generalised exponential family. The relevant entropy function for the percolation model in the nonpercolating region 0 < q < q c is therefore Sections 3 to 6 of the present paper discuss the variational principle, which is stronger than the maximum entropy principle. It is shown that the method of Lagrange multipliers leads to the correct result, even in the context of generalised entropy functions. The difficulty that arises is known as the cutoff problem: the optimising probability distribution may assign vanishing probabilities to some of the events. To cope with this situation the two cases have been considered separately. Theorem 1 treats the standard case, Theorem 2 copes with the vanishing probabilities. In Section 7, a generalised definition of an exponential family is given. It identifies the members of the generalised exponential family with the solutions of the variational principle, given a generalised entropy function of the usual form (4). The definition of the standard exponential family corresponds of course with the Boltzmann-Gibbs-Shannon entropy. Entropy functions I(p) and λI(p), with λ > 0, determine the same exponential family. Assuming some technical condition, the intersection of different generalised exponential families is empty -see Theorem 4. As a consequence, a one-to one relation has been established between generalised exponential families and classes of equivalent entropy functions.
In [8], the notion of phi-exponential family was introduced. The 'phi' in this name refers to the function φ(v), introduced in (52). It is one divided by the derivative of the function f (v) appearing in the expression (6) for the entropy function I(p). The assumption that the derivative of f (v) exists for all v > 0 has been eliminated in the present paper. More important is that the definition of a generalised exponential family is now given directly in terms of the entropy function I(p), via the variational principle, without relying on the notion of deformed exponential functions.
Sections 9 to 12 discuss the geometric properties of a generalised exponential family, using a terminology coming from 150 year old thermodynamics. The main result is (63), proving the equality of the three quantities generalised Fisher information, metric tensor times partition sum z(θ), and covariance matrix multiplied with z 2 (θ). The covariance matrix is calculated using the escort family of probability distributions.
Many applications of generalised exponential families are found in the literature, in the context of nonextensive thermostatistics. The latter has been discussed in Section 13. A completely different kind of example is found in percolation theory -see Section 14. It illustrates the possibility that the function f (u), which determines the entropy function I(p) via (6), is of a stochastic nature. One can expect that many other applications will be found in the near future.