1. Introduction and Summary
Let X be a random element in a countable alphabet , where , are distinct letters or labels with a probability distribution , where is the collection of all possible probability distributions on . Many random system properties of interest may be described by entropic quantities or entropies, that is, functions of that are label-independent. For generality of discussion, letting be a non-increasingly ordered , a function of that satisfies for all is referred to as an entropy. An entropy may also be denoted as for notation simplicity. For example, the Boltzmann–Gibbs–Shannon entropy , the Rényi entropy for and is a constant, and the Tsallis entropy for any are of high utility across many fields of study, such as statistical mechanics and information theory.
Ever since the information-theoretic utility of
was unlocked in [
1], a large volume of research has been amassed in relation to
. A considerable amount of the research effort has been placed in framing
in general forms. Many articles have been published under the theme of generalized entropy, but from different perspectives. One particular perspective is the axiomatic system studied by Khinchin in [
2].
Let be a pair of random elements on a joint countable alphabet, , where for and are distinct labels with a joint probability distribution, , and the two marginal distributions are and , where and for X and Y, respectively. Let be an entropy. The four axioms of Khinchin are given below.
- :
(Continuity) is continuous with respect to all elements of .
- :
(Maximality) Given , is maximized in at for .
- :
(Expansibility) Letting , .
- :
(Separability) For any pair of random elements
on
with a joint probability distribution
,
where
and
is the entropy of the conditional distribution of
Y given
.
is sometimes also known as Strong Additivity.
Khinchin famously proved the following fact in [
2].
Fact 1 (The uniqueness theorem of entropy). For any , if an entropy , such that , satisfies all four axioms, –, then must be uniquely of the form up to a multiplicative constant.
Let
which is often referred to as Shannon’s mutual information. The following fact is due to Shannon.
Fact 2. Let be a joint probability distribution of satisfying and . Then, X and Y are independent if and only if .
The fact that the independence between X and Y may be characterized by a single-valued index under a general joint distribution on puts in a very important place in information theory. Furthermore the uniqueness theorem of entropy adds a special aura to .
However,
, which satisfies
under independence of
X and
Y, is considered by many physicists to be overly rigid. In search of more general forms of entropy, Khinchin’s axiom
is weakened in various ways, and a large number of research articles has been published under the weaker conditions. Many of these articles may be found in two excellent review articles, see [
3,
4].
The fourth axiom,
, may be weakened to different degrees across a spectrum. At one end of it,
is ignored and generalized entropy forms are sought only under
. Other versions of the weakened
are mostly given in more general forms of (
3), under independence of
X and
Y. For example, an entropy
may be required to satisfy
for some symmetric function of two variables,
, of which a special case is
where
is a real number. The condition in (
5) is a centerpiece of non-extensive statistical mechanics. The Tsallis entropy satisfies (
4) in general and (
5) in specific. It is to be particularly noted that conditions in (
4) and (
5) are necessary conditions of
X and
Y being independent, but are not sufficient conditions.
Non-extensive physics aside, the rigidity of
has its remarkable utility, namely Fact 2: a single-valued index characterizes the stochastic association between two random elements on a non-meterized joint alphabet under general distributions. (Also see standardized mutual information in Chapter 5 of [
5]).
It may be interesting to ask whether there exist other entropies,
, such that, in addition to satisfying
, it also satisfies
Let it be noted that
is weaker than
in the sense that
implies
. In the same sense,
is a stronger condition than
since
is a necessary condition of
X and
Y being independent, but not a sufficient condition. The condition in (
7) is a special case of (
4).
Example 1. The Rényi entropy, , satisfies but not .
Example 2. The no-name entropy, for any , satisfies but not . By the way, it may be interesting to note that .
As it turns out, there are many entropies satisfying
. Consider the following family of entropies,
for
, and its implied mutual information,
Obviously the family of entropies in (
8) also satisfies
, which, however, is only finitely defined for some of the distributions in
. A significant advantage of (
8) is that every member is finitely defined for each and every
. The first main result established in this article is the fact that
X and
Y are independent if and only if
for any
, which immediately implies that there are many entropies satisfying
.
A second question of interest is whether there exist other forms of entropy satisfying
beyond those in (
8). The answer to this question is unknown in general. However, under certain restrictions on the functional forms of
, uniqueness of (
8) may be established. This, in fact, is the second main result of this article.
In
Section 2, the statements of the two main results are made more precise and are established in two separate subsections. The article ends with
Section 3, where several related minor results are summarized.
3. Other Results
An independence-dependence preserving deformation function,
, with respect to a subspace,
, needs not necessarily to be of the power form in (
11). Let
be the collection of all distributions of
on a
joint alphabet, where
is a finite integer.
Proposition 1. A Lebesgue measurable and integrable deformation function is independence-dependence preserving on a subspace such that , if and only if is a member of power functions in (11). The proof of Proposition 1 is trivial because the the proof of Lemma 5 directly applies in this case, noting that the existence of the constructed distributions, , in the proof of Lemma 5 only requires the joint alphabet to be or larger.
It may be interesting to note that Proposition 1 is a stronger statement than Theorem 2 in the sense that Theorem 2 may be considered a corollary of Proposition 1. This is due to the fact that, for two sub-classes of distributions , an independence-dependence preserving with respect to is independence-dependence preserving with respect to .
Example 3 below illustrates the fact that on some restricted class of probability distributions an independence-dependence preserving deformation function,
, needs not to be of the power form of (
11).
Example 3. Consider the collection of all distributions on a joint alphabet, denoted . The function,is independence-dependence preserving but is not in the form of a power function. To see this, let it be first noted what being independence-dependence preserving entails on a alphabet. Write the joint distribution as and the two marginal distributions as and . When two underlying random elements X and Y on the alphabet are independent, a qualified deformation function must satisfyor, letting ,which, after a few algebraic steps, is reduced toIt is easily verified that (29) satisfies (30). The independence-dependence preserving property of a chosen
may be important when dependence between two random elements is of importance in a study. However, when only a one-to-one escort deformation mapping
is desired, the deformation function
needs not to be a member of (
11). The following proposition provides a sufficient condition.
Proposition 2. If is strictly increasing for , then the ϕ-induced mapping Φ:is injective.
Proof. For every strictly increasing
on
for which
is well-defined for each and every
, it suffices to show that, given any
,
is unique. Toward that end, suppose there are two distinct distributions,
and
, such that
. It follows that there exists an index
such that
. Without loss of generality, let it be supposed that
Then, it follows further that
By (
31) and the condition that
is strictly increasing on
, it follows that
, and hence
However, since
for every
, it follows that
for every
. Since
is strictly increasing, it follows that
for every
, which contradicts the supposition that both
and
are probability distributions, that is,
. The said contradiction implies that
, and the proposition follows. □
Example 4. Consider a special family of , for , where is a parameter. The escort distribution based on this deformation function is one of several well-studied families of escort distributions known as the Tsallis distributions. Every such is strictly increasing on and therefore, by Proposition 2, every member of the family induces an injective mapping. However, by Theorem 2, the induced mapping is not independence-dependence preserving on a countable joint alphabet .
4. Concluding Remarks
Two main results are established in this article. First, there exist many entropies other than the BGS entropy satisfying the weaker axiomatic conditions, more specifically rather than , yet retaining the key utility that the associated mutual information preserves independence-dependence on a countable joint alphabet as the BGS entropy does. Second, by way of escort transformations, the newly identified entropies are the only ones satisfying the weaker axioms on a general countable joint alphabet.
The significance of the established results come into better focus in a broader perspective. Inspired by the development in modern data science, a shift is increasingly visible in the foundation of statistical inference, away from a real space, where random variables reside, toward a non-meterized and non-ordinal alphabet, where more general random elements reside. While statistical inferences based on random variables are theoretically well supported in the rich literature of probability and statistics, inferences on alphabets, mostly by way of various entropies and their estimation, are less systematically supported in theory. Without the familiar notions of neighborhood, real or complex moments, tails, etc., associated with random variables, probability and statistics based on random elements on alphabets need more attention to foster a clearer framework for the rigorous development of entropy-based statistical exercises, which may be more concisely termed entropic statistics. While a considerable volume of published work has been accumulated over several decades on entropy estimation, it is fair to say that the current research activities in the existing literature are sporadic in nature and the implied theoretical framework is porous. The said porosity permeates across the board, from basic axioms to a general definition of entropy, and from model interpretability to statistical inference, although some recent effort is observed to alleviate it. See, for example, [
13] and [
14], where a general definition of entropy and several fundamental results are given.
The exploration of this article is on the axiomatic foundation of entropy. If the primary study interest lies with the independence-dependence between two sets of random elements on a joint countable alphabet, as is usually the case in practice, then the main results of this article suggest that there is flexibility in choosing an entropy from (
8) to serve the interest. In addition, the uniqueness of (
8) by way of escort distributions is of interest not only in its own right theoretically but also as it provides support in practice. For example, in artificial neural networks with multi-layer fractal structures, the links between nodes in two layers are often modeled by power escort distributions propagating through the entire network. In such a context, it is of fundamental importance to know that the power escort distributions are the only type that would preserve dependence and independence. If the dependence-independence preserving property is desired, then the power escort is appropriate. If not, then some other escort is needed.