Khinchin’s Fourth Axiom of Entropy Revisited

: The Boltzmann–Gibbs–Shannon (BGS) entropy is the only entropy form satisfying four conditions known as Khinchin’s axioms. The uniqueness theorem of the BGS entropy, plus the fact that Shannon’s mutual information completely characterizes independence between the two underlying random elements, puts the BGS entropy in a special place in many ﬁelds of study. In this article, the fourth axiom is replaced by a slightly weakened condition: an entropy whose associated mutual information is zero if and only if the two underlying random elements are independent. Under the weaker fourth axiom, other forms of entropy are sought by way of escort transformations. Two main results are reported in this article. First, there are many entropies other than the BGS entropy satisfying the weaker condition, yet retaining all the desirable utilities of the BGS entropy. Second, by way of escort transformations, the newly identiﬁed entropies are the only ones satisfying the weaker axioms.


Introduction and Summary
Let X be a random element in a countable alphabet X = {x k ; k ≥ 1}, where x k , k ≥ 1 are distinct letters or labels with a probability distribution p = {p k ; k ≥ 1} ∈ P, where P is the collection of all possible probability distributions on X .Many random system properties of interest may be described by entropic quantities or entropies, that is, functions of p that are label-independent.For generality of discussion, letting p ↓ = {p (k) ; k ≥ 1} be a non-increasingly ordered p, a function of p that satisfies H = H(p) = H(p ↓ ) for all p ∈ P is referred to as an entropy.An entropy may also be denoted as H(X) for notation simplicity.For example, the Boltzmann-Gibbs-Shannon entropy H BGS = − ∑ k≥1 p k ln p k , the Rényi entropy H R = (1 − α) −1 ln(∑ k≥1 p α k ) for α > 0 and α = 1 is a constant, and the Tsallis entropy H T = (α − 1) −1 (1 − ∑ k≥1 p α k ) for any α = 0 are of high utility across many fields of study, such as statistical mechanics and information theory.
Ever since the information-theoretic utility of H BGS was unlocked in [1], a large volume of research has been amassed in relation to H BGS .A considerable amount of the research effort has been placed in framing H BGS in general forms.Many articles have been published under the theme of generalized entropy, but from different perspectives.One particular perspective is the axiomatic system studied by Khinchin in [2].
Let (X, Y) be a pair of random elements on a joint countable alphabet, X × Y = {(x i , y j ); i ≥ 1, j ≥ 1}, where (x i , y j ) for i ≥ 1 and j ≥ 1 are distinct labels with a joint probability distribution, p X,Y = {p i,j ; i ≥ 1, j ≥ 1}, and the two marginal distributions are p X = {p i,• ; i ≥ 1} and p Y = {p •,j ; j ≥ 1}, where p i,• = ∑ j≥1 p i,j and p •,j = ∑ i≥1 p i,j for X and Y, respectively.Let H(p) be an entropy.The four axioms of Khinchin are given below.
where H(Y|X) = ∑ i≥1 p i,• H(Y|X = x i ) and H(Y|X = x i ) is the entropy of the conditional distribution of Y given X = x i .K 4 is sometimes also known as Strong Additivity.
Fact 1 (The uniqueness theorem of entropy).For any p ∈ P, if an entropy H(p), such that H(p) < ∞, satisfies all four axioms, K 1 -K 4 , then H(p) must be uniquely of the form H BGS = − ∑ k≥1 p k ln p k up to a multiplicative constant.
which is often referred to as Shannon's mutual information.The following fact is due to Shannon.

Fact 2.
Let p X,Y be a joint probability distribution of (X, Y) satisfying H(X) < ∞ and H(Y) < ∞.
Then, X and Y are independent if and only if MI(X, Y) = 0.
The fact that the independence between X and Y may be characterized by a singlevalued index MI under a general joint distribution on X × Y puts MI in a very important place in information theory.Furthermore the uniqueness theorem of entropy adds a special aura to H BGS .
However, H BGS , which satisfies under independence of X and Y, is considered by many physicists to be overly rigid.In search of more general forms of entropy, Khinchin's axiom K 4 is weakened in various ways, and a large number of research articles has been published under the weaker conditions.Many of these articles may be found in two excellent review articles, see [3,4].The fourth axiom, K 4 , may be weakened to different degrees across a spectrum.At one end of it, K 4 is ignored and generalized entropy forms are sought only under {K 1 , K 2 , K 3 }.Other versions of the weakened K 4 are mostly given in more general forms of (3), under independence of X and Y.For example, an entropy H(p) may be required to satisfy for some symmetric function of two variables, Φ, of which a special case is where α = 1 is a real number.The condition in (5) is a centerpiece of non-extensive statistical mechanics.The Tsallis entropy satisfies (4) in general and (5) in specific.It is to be particularly noted that conditions in (4) and (5) are necessary conditions of X and Y being independent, but are not sufficient conditions.Non-extensive physics aside, the rigidity of H BGS has its remarkable utility, namely Fact 2: a single-valued index characterizes the stochastic association between two random elements on a non-meterized joint alphabet under general distributions.(Also see standardized mutual information in Chapter 5 of [5]).
It may be interesting to ask whether there exist other entropies, H(p), such that, in addition to satisfying {K 1 , K 2 , K 3 }, it also satisfies K 4 : H(X) + H(Y) − H(X, Y) = 0 if and only if X and Y are independent.(6) Let it be noted that K 4 is weaker than K 4 in the sense that {K 1 , K 2 , K 3 , K 4 } implies {K 1 , K 2 , K 3 , K 4 }.In the same sense, K 4 is a stronger condition than since K 4 is a necessary condition of X and Y being independent, but not a sufficient condition.The condition in ( 7) is a special case of ( 4).
Example 1.The Rényi entropy, H T , satisfies K 4 but not K 4 .
Example 2. The no-name entropy, H N = − ∑ k≥1 p α k ln p α k for any α > 1, satisfies K 4 but not K 4 .By the way, it may be interesting to note that lim α→1 H N = H BGS .
As it turns out, there are many entropies satisfying {K 1 , K 2 , K 3 , K 4 }.Consider the following family of entropies, for α > 1, and its implied mutual information, Obviously the family of entropies in (8) also satisfies lim α→1 H α (p) = H BGS , which, however, is only finitely defined for some of the distributions in P. A significant advantage of ( 8) is that every member is finitely defined for each and every p ∈ P. The first main result established in this article is the fact that X and Y are independent if and only if MI α (X, Y) = 0 for any α ∈ (1, ∞), which immediately implies that there are many entropies satisfying {K 1 , K 2 , K 3 , K 4 }.
A second question of interest is whether there exist other forms of entropy satisfying {K 1 , K 2 , K 3 , K 4 } beyond those in (8).The answer to this question is unknown in general.However, under certain restrictions on the functional forms of H(p), uniqueness of (8) may be established.This, in fact, is the second main result of this article.
In Section 2, the statements of the two main results are made more precise and are established in two separate subsections.The article ends with Section 3, where several related minor results are summarized.

Main Results
The path leading to both main results of this article goes through a mapping, Φ: P ⇒ P * ⊆ P, denoted p * = Φ(p) and referred to as the escort distribution of p on the same alphabet X .p * is constructed in a special way, according to the concept of escort distributions introduced in [6].For a given function φ(p) ≥ 0 on [0, 1] and a distribution p ∈ P, provided that 0 is referred to as the escort distribution of p associated with the deformation function (also known as the the escort function), φ(p).The utility of escort distributions is discussed by many researchers, most notably [7] in the context of statistical mechanics, and [8] regarding information geometry.Escort distributions, originally as natural scanners of a single underpinning probability distribution, p, in a multifractal structure, have been shown to be useful in a great variety of places and ways, ranging from information theory and coding theory to multifractal neural networks.For example, many interesting results and applications may be found in [9,10], and the references there within.Consider a special family of power escort functions, When φ(p) is a member of (11), (10) becomes The Boltzmann-Gibbs-Shannon entropy of the escort distribution, 8), and MI α of (9) becomes its implied Shannon's mutual information.

Characterization of Independence
Theorem 1. X and Y are independent if and only if MI α (X, Y) = 0, for any fixed α > 1.
Lemma 1. X and Y are independent if and only if X * and Y * are independent.
Proof.For each α > 1, if p i,j = p i,• × p •,j for all pairs (i, j), i ≥ 1 and j ≥ 1, then Therefore, the fact that holds, which implies for each pair (i, j), i ≥ 1 and j ≥ 1, Noting that the third factor in the expression above does not depend on i or j, the lemma immediately follows the factorization theorem.
Remark 1.It is acknowledged that the proof of Lemma 1 above is inspired by a similar proof in [11], where a similar result with a more restrictive family of escort functions is established.

A Uniqueness Theorem
In Theorem 1, it is established that X and Y are independent if and only if X * and Y * are independent, where (X, Y) and (X * , Y * ) are linked by a power escort transformation, φ(p), through the mapping, Φ, between their respective joint distributions.Such an escort function may be reasonably referred to as an independence-dependence preserving deformation function.In concept, however, there may exist other such functions outside of the power family in (11).However, Theorem 2 below says otherwise.Definition 1.A deformation function φ(p) on [0, 1] is said to be independence-dependence preserving with respect to a subclass P X,Y ⊆ P X,Y , if for each and every p X,Y ∈ P X,Y and its associated escort distribution p * X,Y , X and Y are independent if and only if X * and Y * are independent.More specifically, a deformation function φ(p) on [0, 1] is said to be independence-dependence preserving if it is independence-dependence preserving with respect to P X,Y .Theorem 2. A measurable and integrable deformation function, φ(p) for p ∈ (0, 1), is independence preserving if and only if it is a member in the family of power functions in (11).Lemma 2. Suppose f (x) > 0 is a Lebesgue measurable function on R such that f (x + y) = f (x) f (y) for all x, y ∈ R.Then, f (x) = e αx for all x ∈ R and some constant α ∈ R.
Lemma 2 above is due to [12].In fact, it is also established in [12] that if the condition of Lebesgue measurability is not imposed, a nowhere-continuous f (x), satisfying f (x + y) = f (x) f (y) for all x, y ∈ R, exists and is therefore not of the form f (x) = e αx .Lemma 3. Suppose g(x) > 0 is a Lebesgue measurable function on (0, ∞) such that g(xy) = g(x)g(y) for all x, y ∈ (0, ∞).Then, g(x) = x α for all x > 0 and some constant α ∈ R.
Proof.For any x, y ∈ (0, ∞), consider the variable transformation s = ln x and t = ln y, and hence x = e s and y = e t .Let f (s) = g(e s ).It follows that, for all x, y ∈ (0, ∞), f (s + t) = g(e s+t ) = g(e s e t ) = g(e s )g(e t ) = f (s) f (t).
Proof.Let {p i,j ; i ≥ 1, j ≥ 1} be any joint distribution on X × Y and denote its marginal distributions as {p i,• ; i ≥ 1} and {p •,j ; j ≥ 1}.Suppose p i,j = p i,• p •,j for every pair (i, j).It follows that Conversely, suppose , and p i,j = p i,• p •,j for every pair (i, j).
Lemma 5.If φ(p) is Lebesgue measurable and integrable on (0, 1) and is independence-dependence preserving on P X,Y , then φ(p) is a member of (11).
Proof.Suppose X and Y are independent, that is p i,j = p i,• × p •,j for all {i, j}.For φ(•) to preserve independence, it is to satisfy, for all {i, j}, or to satisfy, for any two pairs of indices (k, l) and (s, t), More specifically let k = s, then Equation ( 17) is reduced to Noting p i,j = p i,• × p •,j for all pairs of (i, j), it follows from (18) that The right-hand side of ( 19) is independent of k, and so is the left-hand side of (19).It follows that regardless if k = s or k = s.It is to be noted that (20) implies for any p ∈ (0, 1), p ∈ (0, 1), q ∈ (0, 1), and q ∈ (0, 1), but subject to the constraints p + p ≤ 1 and q + q ≤ 1.Now, it is desired to show that, without the constraints p + p ≤ 1 and q + q ≤ 1, (21) still holds.Toward that end, the proof is given in two steps, (1) q + q ≤ 1 and (2) q + q > 1.
Step 1: Suppose q + q ≤ 1.Let Applying P 1 and Q to Equation (20), it follows that Applying P 2 and Q to Equation (20), it follows that Before moving on to Step 2, let it be noted that the constraint p + p ≤ 1 is not used in the above proof of Step 1.That is to say that (24) holds under the condition q + q ≤ 1 regardless of whether p + p ≤ 1 or p + p > 1.

Other Results
An independence-dependence preserving deformation function, φ(p), with respect to a subspace, P X,Y ⊂ P X,Y , needs not necessarily to be of the power form in (11).Let P K×K be the collection of all distributions of (X, Y) on a K × K joint alphabet, where K ≥ 2 is a finite integer.Proposition 1.A Lebesgue measurable and integrable deformation function φ(p) is independencedependence preserving on a subspace P X,Y such that P 3×3 ⊆ P X,Y ⊆ P X,Y , if and only if φ(p) is a member of power functions in (11).
The proof of Proposition 1 is trivial because the the proof of Lemma 5 directly applies in this case, noting that the existence of the constructed distributions, P 1 , P 2 , Q, in the proof of Lemma 5 only requires the joint alphabet to be 3 × 3 or larger.
It may be interesting to note that Proposition 1 is a stronger statement than Theorem 2 in the sense that Theorem 2 may be considered a corollary of Proposition 1.This is due to the fact that, for two sub-classes of distributions P X,Y ⊆ P X,Y , an independence-dependence preserving φ(p) with respect to P X,Y is independence-dependence preserving with respect to P X,Y .
Example 3 below illustrates the fact that on some restricted class of probability distributions an independence-dependence preserving deformation function, φ(p), needs not to be of the power form of (11).
The independence-dependence preserving property of a chosen φ(p) may be important when dependence between two random elements is of importance in a study.However, when only a one-to-one escort deformation mapping Φ is desired, the deformation function φ(p) needs not to be a member of (11).The following proposition provides a sufficient condition.
Proof.For every strictly increasing φ(p) on p ∈ [0, 1] for which p * is well-defined for each and every p ∈ P, it suffices to show that, given any p * ∈ P * , Φ −1 (p * ) is unique.Toward that end, suppose there are two distinct distributions, p 1 = {p 1,k ; k ≥ 1} ∈ P and