Abstract
The Boltzmann–Gibbs–Shannon (BGS) entropy is the only entropy form satisfying four conditions known as Khinchin’s axioms. The uniqueness theorem of the BGS entropy, plus the fact that Shannon’s mutual information completely characterizes independence between the two underlying random elements, puts the BGS entropy in a special place in many fields of study. In this article, the fourth axiom is replaced by a slightly weakened condition: an entropy whose associated mutual information is zero if and only if the two underlying random elements are independent. Under the weaker fourth axiom, other forms of entropy are sought by way of escort transformations. Two main results are reported in this article. First, there are many entropies other than the BGS entropy satisfying the weaker condition, yet retaining all the desirable utilities of the BGS entropy. Second, by way of escort transformations, the newly identified entropies are the only ones satisfying the weaker axioms.
1. Introduction and Summary
Let X be a random element in a countable alphabet , where , are distinct letters or labels with a probability distribution , where is the collection of all possible probability distributions on . Many random system properties of interest may be described by entropic quantities or entropies, that is, functions of that are label-independent. For generality of discussion, letting be a non-increasingly ordered , a function of that satisfies for all is referred to as an entropy. An entropy may also be denoted as for notation simplicity. For example, the Boltzmann–Gibbs–Shannon entropy , the Rényi entropy for and is a constant, and the Tsallis entropy for any are of high utility across many fields of study, such as statistical mechanics and information theory.
Ever since the information-theoretic utility of was unlocked in [1], a large volume of research has been amassed in relation to . A considerable amount of the research effort has been placed in framing in general forms. Many articles have been published under the theme of generalized entropy, but from different perspectives. One particular perspective is the axiomatic system studied by Khinchin in [2].
Let be a pair of random elements on a joint countable alphabet, , where for and are distinct labels with a joint probability distribution, , and the two marginal distributions are and , where and for X and Y, respectively. Let be an entropy. The four axioms of Khinchin are given below.
- :
- (Continuity) is continuous with respect to all elements of .
- :
- (Maximality) Given , is maximized in at for .
- :
- (Expansibility) Letting , .
- :
- (Separability) For any pair of random elements on with a joint probability distribution ,where and is the entropy of the conditional distribution of Y given .
is sometimes also known as Strong Additivity.
Khinchin famously proved the following fact in [2].
Fact 1
(The uniqueness theorem of entropy). For any , if an entropy , such that , satisfies all four axioms, –, then must be uniquely of the form up to a multiplicative constant.
Let
which is often referred to as Shannon’s mutual information. The following fact is due to Shannon.
Fact 2.
Let be a joint probability distribution of satisfying and . Then, X and Y are independent if and only if .
The fact that the independence between X and Y may be characterized by a single-valued index under a general joint distribution on puts in a very important place in information theory. Furthermore the uniqueness theorem of entropy adds a special aura to .
However, , which satisfies
under independence of X and Y, is considered by many physicists to be overly rigid. In search of more general forms of entropy, Khinchin’s axiom is weakened in various ways, and a large number of research articles has been published under the weaker conditions. Many of these articles may be found in two excellent review articles, see [3,4].
The fourth axiom, , may be weakened to different degrees across a spectrum. At one end of it, is ignored and generalized entropy forms are sought only under . Other versions of the weakened are mostly given in more general forms of (3), under independence of X and Y. For example, an entropy may be required to satisfy
for some symmetric function of two variables, , of which a special case is
where is a real number. The condition in (5) is a centerpiece of non-extensive statistical mechanics. The Tsallis entropy satisfies (4) in general and (5) in specific. It is to be particularly noted that conditions in (4) and (5) are necessary conditions of X and Y being independent, but are not sufficient conditions.
Non-extensive physics aside, the rigidity of has its remarkable utility, namely Fact 2: a single-valued index characterizes the stochastic association between two random elements on a non-meterized joint alphabet under general distributions. (Also see standardized mutual information in Chapter 5 of [5]).
It may be interesting to ask whether there exist other entropies, , such that, in addition to satisfying , it also satisfies
Let it be noted that is weaker than in the sense that implies . In the same sense, is a stronger condition than
since is a necessary condition of X and Y being independent, but not a sufficient condition. The condition in (7) is a special case of (4).
Example 1.
The Rényi entropy, , satisfies but not .
Example 2.
The no-name entropy, for any , satisfies but not . By the way, it may be interesting to note that .
As it turns out, there are many entropies satisfying . Consider the following family of entropies,
for , and its implied mutual information,
Obviously the family of entropies in (8) also satisfies , which, however, is only finitely defined for some of the distributions in . A significant advantage of (8) is that every member is finitely defined for each and every . The first main result established in this article is the fact that X and Y are independent if and only if for any , which immediately implies that there are many entropies satisfying .
A second question of interest is whether there exist other forms of entropy satisfying beyond those in (8). The answer to this question is unknown in general. However, under certain restrictions on the functional forms of , uniqueness of (8) may be established. This, in fact, is the second main result of this article.
2. Main Results
The path leading to both main results of this article goes through a mapping, : , denoted and referred to as the escort distribution of on the same alphabet . is constructed in a special way, according to the concept of escort distributions introduced in [6]. For a given function on and a distribution , provided that , , where
is referred to as the escort distribution of associated with the deformation function (also known as the the escort function), . The utility of escort distributions is discussed by many researchers, most notably [7] in the context of statistical mechanics, and [8] regarding information geometry. Escort distributions, originally as natural scanners of a single underpinning probability distribution, , in a multifractal structure, have been shown to be useful in a great variety of places and ways, ranging from information theory and coding theory to multifractal neural networks. For example, many interesting results and applications may be found in [9,10], and the references there within.
Consider a special family of power escort functions,
When is a member of (11), (10) becomes
The Boltzmann–Gibbs–Shannon entropy of the escort distribution, , becomes of (8), and of (9) becomes its implied Shannon’s mutual information.
2.1. Characterization of Independence
Theorem 1.
X and Y are independent if and only if , for any fixed .
Let be a pair of random elements on , with a joint probability distribution, , and two marginal distributions, and . For a fixed , consider another pair of random elements in the same alphabet , but with an induced joint probability distribution, , and two marginal distributions, and , where, for some ,
Since (8) is the Boltzmann–Gibbs–Shannon entropy of the escort distribution, , that is, , and , by Fact 2, Theorem 1 is an immediate consequence of the following lemma.
Lemma 1.
X and Y are independent if and only if and are independent.
Proof.
For each , if for all pairs , and , then
Therefore, the fact that is the Boltzmann–Gibbs–Shannon entropy of implies .
Conversely, if , then (14) holds, which implies for each pair , and ,
Noting that the third factor in the expression above does not depend on i or j, the lemma immediately follows the factorization theorem. □
Remark 1.
It is acknowledged that the proof of Lemma 1 above is inspired by a similar proof in [11], where a similar result with a more restrictive family of escort functions is established.
2.2. A Uniqueness Theorem
In Theorem 1, it is established that X and Y are independent if and only if and are independent, where and are linked by a power escort transformation, , through the mapping, , between their respective joint distributions. Such an escort function may be reasonably referred to as an independence-dependence preserving deformation function. In concept, however, there may exist other such functions outside of the power family in (11). However, Theorem 2 below says otherwise.
Definition 1.
A deformation function on is said to be independence-dependence preserving with respect to a subclass , if for each and every and its associated escort distribution , X and Y are independent if and only if and are independent. More specifically, a deformation function on is said to be independence-dependence preserving if it is independence-dependence preserving with respect to .
Theorem 2.
A measurable and integrable deformation function, for , is independence preserving if and only if it is a member in the family of power functions in (11).
Lemma 2.
Suppose is a Lebesgue measurable function on such that for all . Then, for all and some constant .
Lemma 2 above is due to [12]. In fact, it is also established in [12] that if the condition of Lebesgue measurability is not imposed, a nowhere-continuous , satisfying for all , exists and is therefore not of the form .
Lemma 3.
Suppose is a Lebesgue measurable function on such that for all . Then, for all and some constant .
Proof.
For any , consider the variable transformation and , and hence and . Let . It follows that, for all ,
Since is measurable, is measurable, and therefore, by Lemma 2, for some constant , or equivalently . □
Lemma 4.
If is a member of (11), then is independence-dependence preserving.
Proof.
Let be any joint distribution on and denote its marginal distributions as and . Suppose for every pair . It follows that
Conversely, suppose
holds for every pair . Noting , it follows that , , and for every pair . □
Lemma 5.
If is Lebesgue measurable and integrable on and is independence-dependence preserving on , then is a member of (11).
Proof.
Suppose X and Y are independent, that is for all . For to preserve independence, it is to satisfy, for all ,
or to satisfy, for any two pairs of indices and ,
More specifically let , then Equation (17) is reduced to
Noting for all pairs of , it follows from (18) that
The right-hand side of (19) is independent of k, and so is the left-hand side of (19). It follows that
regardless if or . It is to be noted that (20) implies
for any , , , and , but subject to the constraints and .
Now, it is desired to show that, without the constraints and , (21) still holds. Toward that end, the proof is given in two steps, (1) and (2) .
Step 1: Suppose . Let
Before moving on to Step 2, let it be noted that the constraint is not used in the above proof of Step 1. That is to say that (24) holds under the condition regardless of whether or .
Step 2: Suppose . Let Noting , evaluating (24) with in place of gives
Noting , evaluating (24) with in place of q gives
Combining (25) and (26) again gives (24), but for all without any constraints.
Next, it is to establish the fact that is continuous on . Since is measurable and integrable on by assumption, substituting , By (27), for a fixed ,
It follows that
Lemmas 4 and 5 immediately give Theorem 2.
3. Other Results
An independence-dependence preserving deformation function, , with respect to a subspace, , needs not necessarily to be of the power form in (11). Let be the collection of all distributions of on a joint alphabet, where is a finite integer.
Proposition 1.
A Lebesgue measurable and integrable deformation function is independence-dependence preserving on a subspace such that , if and only if is a member of power functions in (11).
The proof of Proposition 1 is trivial because the the proof of Lemma 5 directly applies in this case, noting that the existence of the constructed distributions, , in the proof of Lemma 5 only requires the joint alphabet to be or larger.
It may be interesting to note that Proposition 1 is a stronger statement than Theorem 2 in the sense that Theorem 2 may be considered a corollary of Proposition 1. This is due to the fact that, for two sub-classes of distributions , an independence-dependence preserving with respect to is independence-dependence preserving with respect to .
Example 3 below illustrates the fact that on some restricted class of probability distributions an independence-dependence preserving deformation function, , needs not to be of the power form of (11).
Example 3.
Consider the collection of all distributions on a joint alphabet, denoted . The function,
is independence-dependence preserving but is not in the form of a power function. To see this, let it be first noted what being independence-dependence preserving entails on a alphabet. Write the joint distribution as and the two marginal distributions as and . When two underlying random elements X and Y on the alphabet are independent, a qualified deformation function must satisfy
or, letting ,
which, after a few algebraic steps, is reduced to
It is easily verified that (29) satisfies (30).
The independence-dependence preserving property of a chosen may be important when dependence between two random elements is of importance in a study. However, when only a one-to-one escort deformation mapping is desired, the deformation function needs not to be a member of (11). The following proposition provides a sufficient condition.
Proposition 2.
If is strictly increasing for , then the ϕ-induced mapping Φ:is injective.
Proof.
For every strictly increasing on for which is well-defined for each and every , it suffices to show that, given any , is unique. Toward that end, suppose there are two distinct distributions, and , such that . It follows that there exists an index such that . Without loss of generality, let it be supposed that
Then, it follows further that
By (31) and the condition that is strictly increasing on , it follows that , and hence
However, since for every , it follows that for every . Since is strictly increasing, it follows that for every , which contradicts the supposition that both and are probability distributions, that is, . The said contradiction implies that , and the proposition follows. □
Example 4.
Consider a special family of , for , where is a parameter. The escort distribution based on this deformation function is one of several well-studied families of escort distributions known as the Tsallis distributions. Every such is strictly increasing on and therefore, by Proposition 2, every member of the family induces an injective mapping. However, by Theorem 2, the induced mapping is not independence-dependence preserving on a countable joint alphabet .
4. Concluding Remarks
Two main results are established in this article. First, there exist many entropies other than the BGS entropy satisfying the weaker axiomatic conditions, more specifically rather than , yet retaining the key utility that the associated mutual information preserves independence-dependence on a countable joint alphabet as the BGS entropy does. Second, by way of escort transformations, the newly identified entropies are the only ones satisfying the weaker axioms on a general countable joint alphabet.
The significance of the established results come into better focus in a broader perspective. Inspired by the development in modern data science, a shift is increasingly visible in the foundation of statistical inference, away from a real space, where random variables reside, toward a non-meterized and non-ordinal alphabet, where more general random elements reside. While statistical inferences based on random variables are theoretically well supported in the rich literature of probability and statistics, inferences on alphabets, mostly by way of various entropies and their estimation, are less systematically supported in theory. Without the familiar notions of neighborhood, real or complex moments, tails, etc., associated with random variables, probability and statistics based on random elements on alphabets need more attention to foster a clearer framework for the rigorous development of entropy-based statistical exercises, which may be more concisely termed entropic statistics. While a considerable volume of published work has been accumulated over several decades on entropy estimation, it is fair to say that the current research activities in the existing literature are sporadic in nature and the implied theoretical framework is porous. The said porosity permeates across the board, from basic axioms to a general definition of entropy, and from model interpretability to statistical inference, although some recent effort is observed to alleviate it. See, for example, [13] and [14], where a general definition of entropy and several fundamental results are given.
The exploration of this article is on the axiomatic foundation of entropy. If the primary study interest lies with the independence-dependence between two sets of random elements on a joint countable alphabet, as is usually the case in practice, then the main results of this article suggest that there is flexibility in choosing an entropy from (8) to serve the interest. In addition, the uniqueness of (8) by way of escort distributions is of interest not only in its own right theoretically but also as it provides support in practice. For example, in artificial neural networks with multi-layer fractal structures, the links between nodes in two layers are often modeled by power escort distributions propagating through the entire network. In such a context, it is of fundamental importance to know that the power escort distributions are the only type that would preserve dependence and independence. If the dependence-independence preserving property is desired, then the power escort is appropriate. If not, then some other escort is needed.
Author Contributions
All authors contributed equally to this article. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Informed Consent Statement
Not applicable.
Data Availability Statement
No data were used in this research.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423; 623–656. [Google Scholar] [CrossRef]
- Khinchin, A.I. Mathematical Foundations of Information Theory; Dover Publications: New York, NY, USA, 1957. [Google Scholar]
- Amigó, J.M.; Balogh, S.G.; Hernández, S. A Brief Review of Generalized Entropies. Entropy 2018, 20, 813. [Google Scholar] [CrossRef] [PubMed]
- Ilić, V.M.; Korbel, J.; Gupta, S.; Scarfone, A.M. An overview of generalized entropic forms. Europhys. Lett. 2021, 133, 50005. [Google Scholar] [CrossRef]
- Zhang, Z. Statistical Implications of Turing’s Formula; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2017. [Google Scholar]
- Beck, C.; Schlögl, F. Thermodynamics of Chaotic Systems; Cambridge University Press: Cambridge, UK, 1993. [Google Scholar]
- Tsallis, C. Nonadditive Entropy and Nonextensive Statistical Mechanics. An Overview after 20 Years. Braz. J. Phys. 2009, 39, 337–357. [Google Scholar] [CrossRef]
- Amari, S. Information Geometry and Its Applications; Springer: Tokyo, Japan, 2016. [Google Scholar]
- Matsuzoe, H. A Sequence of Escort Distributions and Generalizations of Expectations on q-Exponential Family. Entropy 2017, 19, 7. [Google Scholar] [CrossRef]
- Ampilova, N.; Soloviev, I.; Sergeev, V. On using escort distributions in digital image analysis. J. Meas. Eng. 2012, 9, 58–70. [Google Scholar] [CrossRef]
- Zhang, Z. Generalized Mutual Information. Stats 2020, 3, 13. [Google Scholar] [CrossRef]
- Hewitt, E.; Stromberg, K. Real and Abstract Analysis: A Modern Treatment of the Theory of Functions of a Real Variable; Springer: Berlin/Heidelberg, Germany, 1965. [Google Scholar]
- Zhang, Z. Entropy-Based Statistics and Their Applications. Entropy 2023, 25, 936. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Z. Several Basic Elements of Entropic Statistics. Entropy 2023, 25, 1060. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).