Next Article in Journal
The New Exponentiated Half Logistic-Harris-G Family of Distributions with Actuarial Measures and Applications
Previous Article in Journal
Exploring Heterogeneity with Category and Cluster Analyses for Mixed Data
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Communication

Khinchin’s Fourth Axiom of Entropy Revisited

1
Department of Mathematics and Statistics, University of North Carolina at Charlotte, Charlotte, NC 28223, USA
2
Wells Fargo Bank, Charlotte, NC 28282, USA
*
Author to whom correspondence should be addressed.
Stats 2023, 6(3), 763-772; https://doi.org/10.3390/stats6030049
Submission received: 11 July 2023 / Revised: 25 July 2023 / Accepted: 26 July 2023 / Published: 27 July 2023
(This article belongs to the Section Data Science)

Abstract

:
The Boltzmann–Gibbs–Shannon (BGS) entropy is the only entropy form satisfying four conditions known as Khinchin’s axioms. The uniqueness theorem of the BGS entropy, plus the fact that Shannon’s mutual information completely characterizes independence between the two underlying random elements, puts the BGS entropy in a special place in many fields of study. In this article, the fourth axiom is replaced by a slightly weakened condition: an entropy whose associated mutual information is zero if and only if the two underlying random elements are independent. Under the weaker fourth axiom, other forms of entropy are sought by way of escort transformations. Two main results are reported in this article. First, there are many entropies other than the BGS entropy satisfying the weaker condition, yet retaining all the desirable utilities of the BGS entropy. Second, by way of escort transformations, the newly identified entropies are the only ones satisfying the weaker axioms.

1. Introduction and Summary

Let X be a random element in a countable alphabet X = { x k ; k 1 } , where x k , k 1 are distinct letters or labels with a probability distribution p = { p k ; k 1 } P , where P is the collection of all possible probability distributions on X . Many random system properties of interest may be described by entropic quantities or entropies, that is, functions of p that are label-independent. For generality of discussion, letting p = { p ( k ) ; k 1 } be a non-increasingly ordered p , a function of p that satisfies H = H ( p ) = H ( p ) for all p P is referred to as an entropy. An entropy may also be denoted as H ( X ) for notation simplicity. For example, the Boltzmann–Gibbs–Shannon entropy H BGS = k 1 p k ln p k , the Rényi entropy H R = ( 1 α ) 1 ln ( k 1 p k α ) for α > 0 and α 1 is a constant, and the Tsallis entropy H T = ( α 1 ) 1 ( 1 k 1 p k α ) for any α 0 are of high utility across many fields of study, such as statistical mechanics and information theory.
Ever since the information-theoretic utility of H BGS was unlocked in [1], a large volume of research has been amassed in relation to H BGS . A considerable amount of the research effort has been placed in framing H BGS in general forms. Many articles have been published under the theme of generalized entropy, but from different perspectives. One particular perspective is the axiomatic system studied by Khinchin in [2].
Let ( X , Y ) be a pair of random elements on a joint countable alphabet, X × Y = { ( x i , y j ) ; i 1 , j 1 } , where ( x i , y j ) for i 1 and j 1 are distinct labels with a joint probability distribution, p X , Y = { p i , j ; i 1 , j 1 } , and the two marginal distributions are p X = { p i , · ; i 1 } and p Y = { p · , j ; j 1 } , where p i , · = j 1 p i , j and p · , j = i 1 p i , j for X and Y, respectively. Let H ( p ) be an entropy. The four axioms of Khinchin are given below.
K 1 :
(Continuity) H ( p ) is continuous with respect to all elements of p .
K 2 :
(Maximality) Given K = k 1 1 [ p k > 0 ] < , H ( p ) is maximized in P at p k = 1 / K for k = 1 , , K .
K 3 :
(Expansibility) Letting p 0 = { 0 , p 1 , p 2 , } , H ( p ) = H ( p 0 ) .
K 4 :
(Separability) For any pair of random elements ( X , Y ) on X × Y with a joint probability distribution p X , Y ,
H ( X , Y ) = H ( X ) + H ( Y | X )
where H ( Y | X ) = i 1 p i , · H ( Y | X = x i ) and H ( Y | X = x i ) is the entropy of the conditional distribution of Y given X = x i .
K 4 is sometimes also known as Strong Additivity.
Khinchin famously proved the following fact in [2].
Fact 1
(The uniqueness theorem of entropy). For any p P , if an entropy H ( p ) , such that H ( p ) < , satisfies all four axioms, K 1 K 4 , then H ( p ) must be uniquely of the form H B G S = k 1 p k ln p k up to a multiplicative constant.
Let
M I = M I ( X , Y ) = H BGS ( X ) + H BGS ( Y ) H BGS ( X , Y )
which is often referred to as Shannon’s mutual information. The following fact is due to Shannon.
Fact 2.
Let p X , Y be a joint probability distribution of ( X , Y ) satisfying H ( X ) < and H ( Y ) < . Then, X and Y are independent if and only if M I ( X , Y ) = 0 .
The fact that the independence between X and Y may be characterized by a single-valued index M I under a general joint distribution on X × Y puts M I in a very important place in information theory. Furthermore the uniqueness theorem of entropy adds a special aura to H BGS .
However, H BGS , which satisfies
H BGS ( X , Y ) = H BGS ( X ) + H BGS ( Y )
under independence of X and Y, is considered by many physicists to be overly rigid. In search of more general forms of entropy, Khinchin’s axiom K 4 is weakened in various ways, and a large number of research articles has been published under the weaker conditions. Many of these articles may be found in two excellent review articles, see [3,4].
The fourth axiom, K 4 , may be weakened to different degrees across a spectrum. At one end of it, K 4 is ignored and generalized entropy forms are sought only under { K 1 , K 2 , K 3 } . Other versions of the weakened K 4 are mostly given in more general forms of (3), under independence of X and Y. For example, an entropy H ( p ) may be required to satisfy
H ( X , Y ) = Φ ( H ( X ) , H ( Y ) )
for some symmetric function of two variables, Φ , of which a special case is
H ( X , Y ) = H ( X ) + H ( Y ) + ( 1 α ) H ( X ) H ( Y )
where α 1 is a real number. The condition in (5) is a centerpiece of non-extensive statistical mechanics. The Tsallis entropy satisfies (4) in general and (5) in specific. It is to be particularly noted that conditions in (4) and (5) are necessary conditions of X and Y being independent, but are not sufficient conditions.
Non-extensive physics aside, the rigidity of H BGS has its remarkable utility, namely Fact 2: a single-valued index characterizes the stochastic association between two random elements on a non-meterized joint alphabet under general distributions. (Also see standardized mutual information in Chapter 5 of [5]).
It may be interesting to ask whether there exist other entropies, H ( p ) , such that, in addition to satisfying { K 1 , K 2 , K 3 } , it also satisfies
K 4 : H ( X ) + H ( Y ) H ( X , Y ) = 0 if   and   only   if   X   and   Y   are   independent .
Let it be noted that K 4 is weaker than K 4 in the sense that { K 1 , K 2 , K 3 , K 4 } implies { K 1 , K 2 , K 3 , K 4 } . In the same sense, K 4 is a stronger condition than
K 4 : H ( X ) + H ( Y ) H ( X , Y ) = 0 if   X   and   Y   are   independent ,
since K 4 is a necessary condition of X and Y being independent, but not a sufficient condition. The condition in (7) is a special case of (4).
Example 1.
The Rényi entropy, H T , satisfies K 4 but not K 4 .
Example 2.
The no-name entropy, H N = k 1 p k α ln p k α for any α > 1 , satisfies K 4 but not K 4 . By the way, it may be interesting to note that lim α 1 H N = H B G S .
As it turns out, there are many entropies satisfying { K 1 , K 2 , K 3 , K 4 } . Consider the following family of entropies,
H α ( p ) = k 1 p k α i 1 p i α ln p k α i 1 p i α ,
for α > 1 , and its implied mutual information,
M I α ( X , Y ) = H α ( X ) + H α ( Y ) H α ( X , Y ) .
Obviously the family of entropies in (8) also satisfies lim α 1 H α ( p ) = H BGS , which, however, is only finitely defined for some of the distributions in P . A significant advantage of (8) is that every member is finitely defined for each and every p P . The first main result established in this article is the fact that X and Y are independent if and only if M I α ( X , Y ) = 0 for any α ( 1 , ) , which immediately implies that there are many entropies satisfying { K 1 , K 2 , K 3 , K 4 } .
A second question of interest is whether there exist other forms of entropy satisfying { K 1 , K 2 , K 3 , K 4 } beyond those in (8). The answer to this question is unknown in general. However, under certain restrictions on the functional forms of H ( p ) , uniqueness of (8) may be established. This, in fact, is the second main result of this article.
In Section 2, the statements of the two main results are made more precise and are established in two separate subsections. The article ends with Section 3, where several related minor results are summarized.

2. Main Results

The path leading to both main results of this article goes through a mapping, Φ : P P * P , denoted p * = Φ ( p ) and referred to as the escort distribution of p on the same alphabet X . p * is constructed in a special way, according to the concept of escort distributions introduced in [6]. For a given function ϕ ( p ) 0 on [ 0 , 1 ] and a distribution p P , provided that 0 < i 1 ϕ ( p i ) = C ( p , ϕ ( · ) ) < , p * = { p k * ; k 1 } , where
p k * = ϕ ( p k ) i 1 ϕ ( p i ) ,
is referred to as the escort distribution of p associated with the deformation function (also known as the the escort function), ϕ ( p ) . The utility of escort distributions is discussed by many researchers, most notably [7] in the context of statistical mechanics, and [8] regarding information geometry. Escort distributions, originally as natural scanners of a single underpinning probability distribution, p , in a multifractal structure, have been shown to be useful in a great variety of places and ways, ranging from information theory and coding theory to multifractal neural networks. For example, many interesting results and applications may be found in [9,10], and the references there within.
Consider a special family of power escort functions,
{ ϕ ( p ) = λ p α : α > 1 , λ > 0 } .
When ϕ ( p ) is a member of (11), (10) becomes
p k * = p k α i 1 p i α .
The Boltzmann–Gibbs–Shannon entropy of the escort distribution, p * = { p k * ; k 1 } , becomes H α ( p ) of (8), and M I α of (9) becomes its implied Shannon’s mutual information.

2.1. Characterization of Independence

Theorem 1.
X and Y are independent if and only if M I α ( X , Y ) = 0 , for any fixed α > 1 .
Let ( X , Y ) be a pair of random elements on X × Y , with a joint probability distribution, p X , Y = { p i , j ; i 1 , j 1 } , and two marginal distributions, p X and p Y . For a fixed α > 1 , consider another pair of random elements ( X * , Y * ) in the same alphabet X × Y , but with an induced joint probability distribution, p X , Y * = { p i , j * ; i 1 , j 1 } , and two marginal distributions, p X * = { p i , · * ; i 1 } and p Y * = { p · , j * ; j 1 } , where, for some α > 1 ,
p i , j * = p i , j α s 1 , t 1 p s , t α , p i , · * = j 1 p i , j α s 1 , t 1 p s , t α , and p · , j * = i 1 p i , j α s 1 , t 1 p s , t α .
Since (8) is the Boltzmann–Gibbs–Shannon entropy of the escort distribution, p * = { p k * ; k 1 } , that is, H α ( X ) = H BGS ( X * ) , and M I α ( X , Y ) = M I ( X * , Y * ) , by Fact 2, Theorem 1 is an immediate consequence of the following lemma.
Lemma 1.
X and Y are independent if and only if X * and Y * are independent.
Proof. 
For each α > 1 , if p i , j = p i , · × p · , j for all pairs ( i , j ) , i 1 and j 1 , then
p i , j * = p i , j α s 1 , t 1 p s , t α = p i , · α s 1 p s , · α × p · , j α t 1 p · , t α = p i , · * × p · , j * .
Therefore, the fact that H α ( X , Y ) is the Boltzmann–Gibbs–Shannon entropy of ( X * , Y * ) implies M I α ( X , Y ) = 0 .
Conversely, if M I α ( X , Y ) = 0 , then (14) holds, which implies for each pair ( i , j ) , i 1 and j 1 ,
p i , j = p i , · α s 1 p s , · α 1 / α × p · , j α t 1 p · , t α 1 / α × s 1 , t 1 p s , t α 1 / α = p i , · × p · , j × s 1 , t 1 p s , t α s 1 p s , · α × t 1 p · , t α 1 / α .
Noting that the third factor in the expression above does not depend on i or j, the lemma immediately follows the factorization theorem.    □
Remark 1.
It is acknowledged that the proof of Lemma 1 above is inspired by a similar proof in [11], where a similar result with a more restrictive family of escort functions is established.

2.2. A Uniqueness Theorem

In Theorem 1, it is established that X and Y are independent if and only if X * and Y * are independent, where ( X , Y ) and ( X * , Y * ) are linked by a power escort transformation, ϕ ( p ) , through the mapping, Φ , between their respective joint distributions. Such an escort function may be reasonably referred to as an independence-dependence preserving deformation function. In concept, however, there may exist other such functions outside of the power family in (11). However, Theorem 2 below says otherwise.
Definition 1.
A deformation function ϕ ( p ) on [ 0 , 1 ] is said to be independence-dependence preserving with respect to a subclass P X , Y P X , Y , if for each and every p X , Y P X , Y and its associated escort distribution p X , Y * , X and Y are independent if and only if X * and Y * are independent. More specifically, a deformation function ϕ ( p ) on [ 0 , 1 ] is said to be independence-dependence preserving if it is independence-dependence preserving with respect to P X , Y .
Theorem 2.
A measurable and integrable deformation function, ϕ ( p ) for p ( 0 , 1 ) , is independence preserving if and only if it is a member in the family of power functions in (11).
Lemma 2.
Suppose f ( x ) > 0 is a Lebesgue measurable function on R such that f ( x + y ) = f ( x ) f ( y ) for all x , y R . Then, f ( x ) = e α x for all x R and some constant α R .
Lemma 2 above is due to [12]. In fact, it is also established in [12] that if the condition of Lebesgue measurability is not imposed, a nowhere-continuous f ( x ) , satisfying f ( x + y ) = f ( x ) f ( y ) for all x , y R , exists and is therefore not of the form f ( x ) = e α x .
Lemma 3.
Suppose g ( x ) > 0 is a Lebesgue measurable function on ( 0 , ) such that g ( x y ) = g ( x ) g ( y ) for all x , y ( 0 , ) . Then, g ( x ) = x α for all x > 0 and some constant α R .
Proof. 
For any x , y ( 0 , ) , consider the variable transformation s = ln x and t = ln y , and hence x = e s and y = e t . Let f ( s ) = g ( e s ) . It follows that, for all x , y ( 0 , ) ,
f ( s + t ) = g ( e s + t ) = g ( e s e t ) = g ( e s ) g ( e t ) = f ( s ) f ( t ) .
Since g ( x ) is measurable, f ( s ) is measurable, and therefore, by Lemma 2, f ( s ) = e α s for some constant α > 0 , or equivalently g ( x ) = x α .    □
Lemma 4.
If ϕ ( p ) is a member of (11), then ϕ ( p ) is independence-dependence preserving.
Proof. 
Let { p i , j ; i 1 , j 1 } be any joint distribution on X × Y and denote its marginal distributions as { p i , · ; i 1 } and { p · , j ; j 1 } . Suppose p i , j = p i , · p · , j for every pair ( i , j ) . It follows that
ϕ ( p i , j ) s 1 , t 1 ϕ ( p s , t ) = λ 2 p i , · α p · , j α λ 2 s 1 , t 1 p s , · α p · , t α = ϕ ( p i , · ) s 1 ϕ ( p s , · ) × ϕ ( p · , j ) t 1 ϕ ( p · , t ) .
Conversely, suppose
ϕ ( p i , j ) s 1 , t 1 ϕ ( p s , t ) = ϕ ( p i , · ) s 1 ϕ ( p s , · ) × ϕ ( p · , j ) t 1 ϕ ( p · , t )
holds for every pair ( i , j ) . Noting λ s 1 , t 1 ϕ ( p s , t ) = s 1 ϕ ( p s , · ) × t 1 ϕ ( p · , t ) , it follows that λ ϕ ( p i , j ) = ϕ ( p i , · ) ϕ ( p · , j ) , λ 2 p i , j α = λ p i , · α λ p · , j α , and p i , j = p i , · p · , j for every pair ( i , j ) .    □
Lemma 5.
If ϕ ( p ) is Lebesgue measurable and integrable on ( 0 , 1 ) and is independence-dependence preserving on P X , Y , then ϕ ( p ) is a member of (11).
Proof. 
Suppose X and Y are independent, that is p i , j = p i , · × p · , j for all { i , j } . For ϕ ( · ) to preserve independence, it is to satisfy, for all { i , j } ,
p i , j * = p i , · * × p · , j * ,
or to satisfy, for any two pairs of indices ( k , l ) and ( s , t ) ,
ϕ ( p k , l ) ϕ ( p s , t ) = j ϕ ( p k , j ) × i ϕ ( p i , l ) j ϕ ( p s , j ) × i ϕ ( p i , t ) .
More specifically let k = s , then Equation (17) is reduced to
ϕ ( p k , l ) ϕ ( p k , t ) = i ϕ ( p i , l ) i ϕ ( p i , t ) .
Noting p i , j = p i , · × p · , j for all pairs of ( i , j ) , it follows from (18) that
ϕ ( p k , · × p · , l ) ϕ ( p k , · × p · , t ) = i ϕ ( p i , · × p · , l ) i ϕ ( p i , · × p · , t ) .
The right-hand side of (19) is independent of k, and so is the left-hand side of (19). It follows that
ϕ ( p k , · × q · , l ) ϕ ( p k , · × q · , t ) = ϕ ( p s , · × q · , l ) ϕ ( p s , · × q · , t )
regardless if k = s or k s . It is to be noted that (20) implies
ϕ ( p · q ) ϕ ( p · q ) = ϕ ( p · q ) ϕ ( p · q )
for any p ( 0 , 1 ) , p ( 0 , 1 ) , q ( 0 , 1 ) , and q ( 0 , 1 ) , but subject to the constraints p + p 1 and q + q 1 .
Now, it is desired to show that, without the constraints p + p 1 and q + q 1 , (21) still holds. Toward that end, the proof is given in two steps, (1) q + q 1 and (2) q + q > 1 .
Step 1: Suppose q + q 1 . Let
P 1 = p , 1 2 ( 1 max ( p , p ) ) , , P 2 = p , 1 2 ( 1 max ( p , p ) ) , , Q = q , q , .
Applying P 1 and Q to Equation (20), it follows that
ϕ ( p · q ) ϕ ( p · q ) = ϕ ( 1 2 ( 1 max ( p , p ) ) · q ) ϕ ( 1 2 ( 1 max ( p , p ) ) · q )
Applying P 2 and Q to Equation (20), it follows that
ϕ ( p · q ) ϕ ( p · q ) = ϕ ( 1 2 ( 1 max ( p , p ) ) · q ) ϕ ( 1 2 ( 1 max ( p , p ) ) · q )
and that
ϕ ( p · q ) ϕ ( p · q ) = ϕ ( p · q ) ϕ ( p · q ) .
Before moving on to Step 2, let it be noted that the constraint p + p 1 is not used in the above proof of Step 1. That is to say that (24) holds under the condition q + q 1 regardless of whether p + p 1 or p + p > 1 .
Step 2: Suppose q + q > 1 . Let q = ( 1 max ( q , q ) ) . Noting q + q 1 , evaluating (24) with q in place of q gives
ϕ ( p · q ) ϕ ( p · 1 2 ( 1 max ( q , q ) ) ) = ϕ ( p · q ) ϕ ( p · 1 2 ( 1 max ( q , q ) ) ) .
Noting q + q 1 , evaluating (24) with q in place of q gives
ϕ ( p · q ) ϕ ( p · 1 2 ( 1 max ( q , q ) ) ) = ϕ ( p · q ) ϕ ( p · 1 2 ( 1 max ( q , q ) ) ) .
Combining (25) and (26) again gives (24), but for all p , p , q , q ( 0 , 1 ) without any constraints.
Moreover, (24) may be further simplified. Let ϵ be such that 0 < ϵ < min 1 / q , 1 / q . Then, for any p ( 0 , 1 ) , q ( 0 , 1 ) , q ( 0 , 1 ) ,
ϕ ( p · q ) ϕ ( p · q ) = ϕ ( q ) ϕ ( q ) .
This is so because
ϕ ( p · q ) ϕ ( p · q ) = ϕ p 1 + ϵ · ( 1 + ϵ ) q ϕ p 1 + ϵ · ( 1 + ϵ ) q = ϕ 1 1 + ϵ · ( 1 + ϵ ) q ϕ 1 1 + ϵ · ( 1 + ϵ ) q = ϕ ( q ) ϕ ( q ) .
Next, it is to establish the fact that ϕ ( p ) is continuous on ( 0 , 1 ) . Since ϕ ( · ) is measurable and integrable on ( 0 , 1 ) by assumption, substituting u = x v , Φ ( x ) = 0 x ϕ ( u ) d u = x 0 1 ϕ ( x v ) d v . By (27), for a fixed s ( 0 , 1 ) ,
ϕ ( x v ) ϕ ( s v ) = ϕ ( x ) ϕ ( s ) or ϕ ( x v ) = ϕ ( x ) ϕ ( s v ) ϕ ( s ) .
It follows that
Φ ( x ) = x ϕ ( x ) 0 1 ϕ ( s v ) ϕ ( s ) d v .
Since Φ is continuous in x, and x ϕ ( s v ) ϕ ( s ) d v is a linear function of x, it follows that ϕ ( x ) is continuous in x on ( 0 , 1 ) . Denoting ϕ ( 1 ) = lim x 1 ϕ ( x ) , by (27),
ϕ ( p · q ) = ϕ ( p ) · ϕ ( q ) ϕ ( 1 ) ,
which implies that ϕ ( p ) is a member of (11).    □
Lemmas 4 and 5 immediately give Theorem 2.

3. Other Results

An independence-dependence preserving deformation function, ϕ ( p ) , with respect to a subspace, P X , Y P X , Y , needs not necessarily to be of the power form in (11). Let P K × K be the collection of all distributions of ( X , Y ) on a K × K joint alphabet, where K 2 is a finite integer.
Proposition 1.
A Lebesgue measurable and integrable deformation function ϕ ( p ) is independence-dependence preserving on a subspace P X , Y such that P 3 × 3 P X , Y P X , Y , if and only if ϕ ( p ) is a member of power functions in (11).
The proof of Proposition 1 is trivial because the the proof of Lemma 5 directly applies in this case, noting that the existence of the constructed distributions, P 1 , P 2 , Q , in the proof of Lemma 5 only requires the joint alphabet to be 3 × 3 or larger.
It may be interesting to note that Proposition 1 is a stronger statement than Theorem 2 in the sense that Theorem 2 may be considered a corollary of Proposition 1. This is due to the fact that, for two sub-classes of distributions P X , Y P X , Y , an independence-dependence preserving ϕ ( p ) with respect to P X , Y is independence-dependence preserving with respect to P X , Y .
Example 3 below illustrates the fact that on some restricted class of probability distributions an independence-dependence preserving deformation function, ϕ ( p ) , needs not to be of the power form of (11).
Example 3.
Consider the collection of all distributions on a 2 × 2 joint alphabet, denoted P 2 × 2 . The function,
ϕ ( x ) = exp ( ( x 1 / 4 ) + 2 ( x 1 / 4 ) 2 ) ) ,
is independence-dependence preserving but is not in the form of a power function. To see this, let it be first noted what being independence-dependence preserving entails on a 2 × 2 alphabet. Write the joint distribution as { p i , j ; i = 1 , 2 , j = 1 , 2 } and the two marginal distributions as { p 1 , · , p 2 , · } = { p , 1 p } and { p · , 1 , p · , 2 } = { q , 1 q } . When two underlying random elements X and Y on the 2 × 2 alphabet are independent, a qualified deformation function ϕ ( x ) must satisfy
ϕ ( p 1 , 1 ) s = 1 , 2 ; t = 1 , 2 ϕ ( p s , t ) = ϕ ( p 1 , 1 ) + ϕ ( p 1 , 2 ) s = 1 , 2 ; t = 1 , 2 ϕ ( p s , t ) × ϕ ( p 1 , 1 ) + ϕ ( p 2 , 1 ) s = 1 , 2 ; t = 1 , 2 ϕ ( p s , t ) ,
or, letting = ϕ ( p q ) + ϕ ( p ( 1 q ) ) + ϕ ( q ( 1 p ) ) + ϕ ( ( 1 p ) ( 1 q ) ) ,
ϕ ( p q ) = ϕ ( p q ) + ϕ ( p ( 1 q ) ) × ϕ ( p q ) + ϕ ( q ( 1 p ) ) ,
which, after a few algebraic steps, is reduced to
ϕ ( p q ) × ϕ ( ( 1 p ) ( 1 q ) ) = ϕ ( p ( 1 q ) ) ϕ ( q ( 1 p ) ) .
It is easily verified that (29) satisfies (30).
The independence-dependence preserving property of a chosen ϕ ( p ) may be important when dependence between two random elements is of importance in a study. However, when only a one-to-one escort deformation mapping Φ is desired, the deformation function ϕ ( p ) needs not to be a member of (11). The following proposition provides a sufficient condition.
Proposition 2.
If ϕ ( p ) is strictly increasing for p [ 0 , 1 ] , then the ϕ-induced mapping Φ: P P * is injective.
Proof. 
For every strictly increasing ϕ ( p ) on p [ 0 , 1 ] for which p * is well-defined for each and every p P , it suffices to show that, given any p * P * , Φ 1 ( p * ) is unique. Toward that end, suppose there are two distinct distributions, p 1 = { p 1 , k ; k 1 } P and p 2 = { p 2 , k ; k 1 } P , such that p 1 * = p 2 * = p * . It follows that there exists an index k such that p 1 , k p 2 , k . Without loss of generality, let it be supposed that
p 1 , k < p 2 , k .
Then, it follows further that
ϕ ( p 1 , k ) i 1 ϕ ( p 1 , i ) = ϕ ( p 2 , k ) i 1 ϕ ( p 2 , i ) = p k * , and   hence ϕ ( p 1 , k ) = ϕ ( p 2 , k ) i 1 ϕ ( p 1 , i ) i 1 ϕ ( p 2 , i ) .
By (31) and the condition that ϕ ( p ) is strictly increasing on [ 0 , 1 ] , it follows that ϕ ( p 1 , k ) < ϕ ( p 2 , k ) , and hence
r = i 1 ϕ ( p 1 , i ) i 1 ϕ ( p 2 , i ) < 1 .
However, since ϕ ( p 1 , k ) = ϕ ( p 2 , k ) r for every k 1 , it follows that ϕ ( p 1 , k ) < ϕ ( p 2 , k ) for every k 1 . Since ϕ ( p ) is strictly increasing, it follows that p 1 , k < p 2 , k for every k 1 , which contradicts the supposition that both p 1 and p 2 are probability distributions, that is, i 1 p 1 , i = i 1 p 2 , i = 1 . The said contradiction implies that p 1 = p 2 , and the proposition follows.    □
Example 4.
Consider a special family of ϕ ( p ) , ϕ ( p ) = ( 1 + λ p ) 1 / λ for p [ 0 , 1 ] , where λ > 0 is a parameter. The escort distribution based on this deformation function is one of several well-studied families of escort distributions known as the Tsallis distributions. Every such ϕ ( p ) is strictly increasing on [ 0 , 1 ] and therefore, by Proposition 2, every member of the family induces an injective mapping. However, by Theorem 2, the induced mapping is not independence-dependence preserving on a countable joint alphabet X × Y .

4. Concluding Remarks

Two main results are established in this article. First, there exist many entropies other than the BGS entropy satisfying the weaker axiomatic conditions, more specifically { K 1 , K 2 , K 3 , K 4 } rather than { K 1 , K 2 , K 3 , K 4 } , yet retaining the key utility that the associated mutual information preserves independence-dependence on a countable joint alphabet as the BGS entropy does. Second, by way of escort transformations, the newly identified entropies are the only ones satisfying the weaker axioms on a general countable joint alphabet.
The significance of the established results come into better focus in a broader perspective. Inspired by the development in modern data science, a shift is increasingly visible in the foundation of statistical inference, away from a real space, where random variables reside, toward a non-meterized and non-ordinal alphabet, where more general random elements reside. While statistical inferences based on random variables are theoretically well supported in the rich literature of probability and statistics, inferences on alphabets, mostly by way of various entropies and their estimation, are less systematically supported in theory. Without the familiar notions of neighborhood, real or complex moments, tails, etc., associated with random variables, probability and statistics based on random elements on alphabets need more attention to foster a clearer framework for the rigorous development of entropy-based statistical exercises, which may be more concisely termed entropic statistics. While a considerable volume of published work has been accumulated over several decades on entropy estimation, it is fair to say that the current research activities in the existing literature are sporadic in nature and the implied theoretical framework is porous. The said porosity permeates across the board, from basic axioms to a general definition of entropy, and from model interpretability to statistical inference, although some recent effort is observed to alleviate it. See, for example, [13] and [14], where a general definition of entropy and several fundamental results are given.
The exploration of this article is on the axiomatic foundation of entropy. If the primary study interest lies with the independence-dependence between two sets of random elements on a joint countable alphabet, as is usually the case in practice, then the main results of this article suggest that there is flexibility in choosing an entropy from (8) to serve the interest. In addition, the uniqueness of (8) by way of escort distributions is of interest not only in its own right theoretically but also as it provides support in practice. For example, in artificial neural networks with multi-layer fractal structures, the links between nodes in two layers are often modeled by power escort distributions propagating through the entire network. In such a context, it is of fundamental importance to know that the power escort distributions are the only type that would preserve dependence and independence. If the dependence-independence preserving property is desired, then the power escort is appropriate. If not, then some other escort is needed.

Author Contributions

All authors contributed equally to this article. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

No data were used in this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423; 623–656. [Google Scholar] [CrossRef] [Green Version]
  2. Khinchin, A.I. Mathematical Foundations of Information Theory; Dover Publications: New York, NY, USA, 1957. [Google Scholar]
  3. Amigó, J.M.; Balogh, S.G.; Hernández, S. A Brief Review of Generalized Entropies. Entropy 2018, 20, 813. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Ilić, V.M.; Korbel, J.; Gupta, S.; Scarfone, A.M. An overview of generalized entropic forms. Europhys. Lett. 2021, 133, 50005. [Google Scholar] [CrossRef]
  5. Zhang, Z. Statistical Implications of Turing’s Formula; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2017. [Google Scholar]
  6. Beck, C.; Schlögl, F. Thermodynamics of Chaotic Systems; Cambridge University Press: Cambridge, UK, 1993. [Google Scholar]
  7. Tsallis, C. Nonadditive Entropy and Nonextensive Statistical Mechanics. An Overview after 20 Years. Braz. J. Phys. 2009, 39, 337–357. [Google Scholar] [CrossRef]
  8. Amari, S. Information Geometry and Its Applications; Springer: Tokyo, Japan, 2016. [Google Scholar]
  9. Matsuzoe, H. A Sequence of Escort Distributions and Generalizations of Expectations on q-Exponential Family. Entropy 2017, 19, 7. [Google Scholar] [CrossRef] [Green Version]
  10. Ampilova, N.; Soloviev, I.; Sergeev, V. On using escort distributions in digital image analysis. J. Meas. Eng. 2012, 9, 58–70. [Google Scholar] [CrossRef]
  11. Zhang, Z. Generalized Mutual Information. Stats 2020, 3, 13. [Google Scholar] [CrossRef]
  12. Hewitt, E.; Stromberg, K. Real and Abstract Analysis: A Modern Treatment of the Theory of Functions of a Real Variable; Springer: Berlin/Heidelberg, Germany, 1965. [Google Scholar]
  13. Zhang, Z. Entropy-Based Statistics and Their Applications. Entropy 2023, 25, 936. [Google Scholar] [CrossRef] [PubMed]
  14. Zhang, Z. Several Basic Elements of Entropic Statistics. Entropy 2023, 25, 1060. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Z.; Huang, H.; Xu, H. Khinchin’s Fourth Axiom of Entropy Revisited. Stats 2023, 6, 763-772. https://doi.org/10.3390/stats6030049

AMA Style

Zhang Z, Huang H, Xu H. Khinchin’s Fourth Axiom of Entropy Revisited. Stats. 2023; 6(3):763-772. https://doi.org/10.3390/stats6030049

Chicago/Turabian Style

Zhang, Zhiyi, Hongwei Huang, and Hao Xu. 2023. "Khinchin’s Fourth Axiom of Entropy Revisited" Stats 6, no. 3: 763-772. https://doi.org/10.3390/stats6030049

APA Style

Zhang, Z., Huang, H., & Xu, H. (2023). Khinchin’s Fourth Axiom of Entropy Revisited. Stats, 6(3), 763-772. https://doi.org/10.3390/stats6030049

Article Metrics

Back to TopTop