Next Article in Journal
Quantum Bohmian-Inspired Potential to Model Non–Gaussian Time Series and Its Application in Financial Markets
Previous Article in Journal
The Non-Equilibrium Thermodynamics of Natural Selection: From Molecules to the Biosphere
Previous Article in Special Issue
Entropy-Based Statistics and Their Applications
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Several Basic Elements of Entropic Statistics

Department of Mathematics and Statistics, UNC Charlotte, Charlotte, NC 28223, USA
Entropy 2023, 25(7), 1060; https://doi.org/10.3390/e25071060
Submission received: 14 June 2023 / Revised: 11 July 2023 / Accepted: 12 July 2023 / Published: 13 July 2023
(This article belongs to the Special Issue Entropy-Based Statistics and Their Applications)

Abstract

:
Inspired by the development in modern data science, a shift is increasingly visible in the foundation of statistical inference, away from a real space, where random variables reside, toward a nonmetrized and nonordinal alphabet, where more general random elements reside. While statistical inferences based on random variables are theoretically well supported in the rich literature of probability and statistics, inferences on alphabets, mostly by way of various entropies and their estimation, are less systematically supported in theory. Without the familiar notions of neighborhood, real or complex moments, tails, et cetera, associated with random variables, probability and statistics based on random elements on alphabets need more attention to foster a sound framework for rigorous development of entropy-based statistical exercises. In this article, several basic elements of entropic statistics are introduced and discussed, including notions of general entropies, entropic sample spaces, entropic distributions, entropic statistics, entropic multinomial distributions, entropic moments, and entropic basis, among other entropic objects. In particular, an entropic-moment-generating function is defined and it is shown to uniquely characterize the underlying distribution in entropic perspective, and, hence, all entropies. An entropic version of the Glivenko–Cantelli convergence theorem is also established.

1. Introduction and Summary

Let X = { k ; k 1 } be a countable alphabet and let p = { p k ; k 1 } be a probability distribution on X . Let P be the collection of all probability distributions on X . Let p = { p ( k ) ; k 1 } be the nonincreasingly rearranged p , that is, p ( k ) p ( k + 1 ) for every k 1 . Let P be the collection of all possible p . It follows that P P is an aggregated version of P in the sense that P is partitioned and represented by p P .
Across a wide spectrum of scientific investigation, a random system is often described as a probability distribution on a countable alphabet, { X , p } ; however many complex system properties of interest, such as those studied in information theory and statistical mechanics, are often described by functions of p , for example, the Shannon entropy
H = k 1 p k ln p k
as in [1], the members of the Rényi entropy family
R α = ln k 1 p k α / ( 1 α )
where α ( 0 , 1 ) ( 1 . ) , as in [2], and the members of the Tsallis entropy family
T α = ( 1 k 1 p k α ) / ( α 1 )
where α ( , 1 ) ( 1 , ) , as in [3]. Other similar functions come under the names of diversity indices, for example, the Gini–Simpson index
ζ = 1 k 1 p k 2
as in [4], the generalized Simpson’s indices
ζ u , v = k 1 p k u ( 1 p k ) v
where u 1 and v 0 are integers, as described in [5], Hill’s diversity numbers
H α = ( k 1 p k α ) 1 / ( 1 α )
where α ( 0 , 1 ) ( 1 , ) , as in [6], Emlen’s index
D = k 1 p k e p k ,
as in [7], and the richness index
K = k 1 1 [ p k > 0 ] ,
where 1 [ · ] is the indicator function. While all the abovementioned functions each have their unique significance in their respective fields of study, they share one characteristic in common: they are all functions of p .
The word entropy has ancient Greek roots, en and tropē, that is, inward and change respectively, in English, or internal change collectively. As such, it is a label-independent concept. For generality and conciseness of the presentation in this article, let the following definition be adopted.
Definition 1.
Let f ( p ) be a function defined for every p P . The function f ( p ) is referred to as an entropy if f ( p ) depends on p only through p , that is, f ( p ) = f ( p ) .
By Definition 1, all entropies and diversity indices mentioned about are indeed entropies. In addition, p ( 1 ) , or more generally p ( k ) for any positive integer k, is an entropy, and therefore p is an array of entropies. One important property to be noted about entropies is that p is independent of labels of the alphabet, { k ; k 1 } . Another fact to be noted is that all entropies are uniquely determined by p . For clarity of terminologies throughout this article, let it be noted that any properties of the underlying random system that are described by one or more entropies are referred to as entropic properties. Furthermore, p is referred to as the underlying probability distribution, or simply the distribution, of a random system, and p is referred to as the entropic distribution associated with p . It is also to be noted that p = { p ( k ) ; k 1 } is not a probability distribution in the usual sense since it is not associated with any specific probability experiment. It is merely an array of nonincreasingly ordered positive parameters that sum up to one.
Let { X 1 , , X n } , drawn from X according to p , be a random sample of size n. The sample may be summarized into Y = { Y k ; k 1 } , where Y k is the observed frequency of letter k , or into p ^ = { p ^ k = Y k / n ; k 1 } . Let Y = { Y ( k ) ; k } and p ^ = { p ^ ( k ) = Y ( k ) / n ; k 1 } be the nonincreasingly rearranged Y and p ^ , respectively, where Y ( k ) Y ( k + 1 ) and p ^ ( k ) p ^ ( k + 1 ) for every k. Under the assumption that the study interest of the underlying random system only lies with the properties described by indices that are functions of the form f ( p ) , that is, entropies by Definition 1, there are two conceptual perspectives to the associated with statistical inference. The first is a framework of estimating f ( p ) based on p ^ , and the second is one of estimating f ( p ) = f ( p ) based on p ^ . For lack of better terms, let the first framework be referred to as the classical statistics and the second framework as the entropic statistics. These two frameworks are not equivalent and, in particular, the entropic framework has its special and useful implications.
The literature of statistical estimation of entropies, mostly in the specific form of the Shannon entropy, begins with the early works, as in [8,9,10], and expands in width and depth in works by, for example, [11,12,13]. Many other worthy references on entropy estimation may be found in the literature review in [14]. The general entropies of Definition 1, however, allow a discussion on the foundational elements of the statistics in entropic perspective, or entropic statistics, in a broader sense. This article focuses on three basic basic issues.
First, a notion of entropic sample space is introduced in Section 2 below. An entropic sample space is an aggregated sample space to register; not a single data point, but an ensemble of data points. It is a sample space of the entropic statistics, Y or p ^ , and hence is label-independent. The said label-independence in turn allows an entropic sample space to accommodate statistical sampling into a population that is not necessarily prescribed, that is, the labels of alphabet X need not be completely specified a priori. This property of an entropic sample space gives new meaning to statistical learning and lends foundational support for statistical exploration into an unknown, or partially known, universe.
Second, an entropic characteristic function, ϕ ( t ) = k 1 p ( k ) t for t 1 , is introduced. It is obvious that ϕ ( t ) is an entropy by Definition 1 and that it always exists. It is established in Section 3 that ϕ ( t ) in an arbitrarily small neighborhood of any interior point of [ 1 , ) uniquely determines the p P and vice versa. Therefore, it is immediately implied that any and all entropic properties of a random system, including statistical inferences, may be approached by way of ϕ ( t ) .
Third, it is established in Section 4 that the entropic statistics converges almost surely and uniformly to the underlying entropic distribution, that is, p ^ a . s . p uniformly, for any p P . In light of the entropic sampling space and an entropic characterization of the associated entropic sampling distribution, the Glivenko–Cantelli-like convergence theorem provides a fundamental support in theory for exercises in entropic statistics.
The article ends with an appendix where a lengthy proof is found.

2. Things Entropic

2.1. Sample Spaces in Different Resolutions

Consider the experiment of randomly drawing a marble from urn 1, which contains marbles of K = 3 known colors, red, white, and blue. In anticipating the outcome of the experiment, one may introduce an index k, k = 1 , 2 , 3 , to label the possible outcomes by 1 = red , 2 = white , and 3 = blue , and denote the corresponding proportions by p 1 , p 2 , and p 3 . In this case, the sample space is
Ω 1 = { 1 , 2 , 3 } ,
the event space is B = { , { 1 } , { 2 } , { 3 } , { 1 , 2 } , { 1 , 3 } , { 2 , 3 } , { 1 , 2 , 3 } } , and the point mass probability measure μ ( · ) assigns p 1 to 1 , p 2 to 2 , and p 3 to 3 . Let X denote the random outcome of the experiment. The following model of probability distribution,
X 1 2 3 P ( x ) p 1 p 2 p 3
or in a different form p = { p 1 , p 2 , p 3 } on X = Ω 1 = { 1 , 2 , 3 } , is well defined with three parameters, p 1 , p 2 , and p 3 , subject to the constraints, 0 p k 1 for each k and k = 1 3 p k = 1 . The result of drawing n = 1 marble from the urn may also be represented by a triplet of random variables Y = { 1 [ X = 1 ] , 1 [ X = 2 ] , 1 [ X = 3 ] } . If Y is used to represent the outcome of the experiment, the sample space may be denoted as Ω 1 = { { 1 , 0 , 0 } , { 0 , 1 , 0 } , { 0 , 0 , 1 } } with corresponding probability distribution P ( Y = { 1 , 0 , 0 } ) = p 1 , P ( Y = { 0 , 1 , 0 } ) = p 2 and P ( Y = { 0 , 0 , 1 } ) = p 3 . For clarity in terminology, X is referred to as a random element but Y is a set of random variables. In general, random results of an experiment that are represented by numerical values are referred to as random variables, and those by non-numerical symbols are random elements.
For a given experiment, the sample space may be chosen at different levels of resolution depending on the experimenter’s interest in the study. Suppose the experimenter is to randomly draw n = 3 marbles from urn 1 with replacement in sequence, resulting in X = { X 1 , X 2 , X 3 } where X i , i = 1 , 2 , 3 , is the color of the ith marble drawn in the sequence. The sample space associated with X may be represented by
Ω s = { 1 , 1 , 1 } , { 2 , 2 , 2 } , { 3 , 3 , 3 } , { 1 , 1 , 2 } , { 2 , 1 , 1 } , { 1 , 2 , 1 } , { 1 , 1 , 3 } , { 3 , 1 , 1 } , { 1 , 3 , 1 } , { 2 , 2 , 1 } , { 1 , 2 , 2 } , { 2 , 1 , 2 } , { 2 , 2 , 3 } , { 3 , 2 , 2 } , { 2 , 3 , 2 } , { 3 , 3 , 1 } , { 1 , 3 , 3 } , { 3 , 1 , 3 } , { 3 , 3 , 2 } , { 2 , 3 , 3 } , { 3 , 2 , 3 } , { 1 , 2 , 3 } , { 1 , 3 , 2 } , { 2 , 1 , 3 } , { 2 , 3 , 1 } , { 3 , 1 , 2 } , { 3 , 2 , 1 } ,
where the subscript “s” stands for sequential. There are 27 distinct elements in (3). In this case, the sample space may also be expressed as Ω s = { 1 , 2 , 3 } 3 . This sample space may be adopted if the order of the n = 3 observations is observable and is of interest.
Suppose in the above experiment the order of the observations is not observable or not of interest. Then the relevant information in X = { X 1 , X 2 , X 3 } may be represented in the form of Y = { Y 1 , Y 2 , Y 3 } , where Y k , k = 1 , 2 , and 3 is the number of k s observed in the sample. The sample space associated with Y is
Ω m = { 3 , 0 , 0 } , { 0 , 3 , 0 } , { 0 , 0 , 3 } , { 2 , 1 , 0 } , { 2 , 0 , 1 } , { 0 , 2 , 1 } , { 1 , 2 , 0 } , { 0 , 1 , 2 } , { 1 , 0 , 2 } , { 1 , 1 , 1 } ,
where the subscript “m” stands for multinomial. There are 10 distinct elements in (4). In fact, Y = { Y 1 , Y 2 , Y 3 } is the usual multinomial random vector with K = 3 categories and category probabilities p 1 , p 2 , and p 3 .
The two sample spaces, Ω s and Ω m , serve different statistical interests in various situations. Ω s is well defined if X = { X 1 , X 2 , X 3 } is observable. Ω m is well defined if X = { X 1 , X 2 , X 3 } is observable or only Y = { Y 1 , Y 2 , Y 3 } is observable. Noting that X = { X 1 , X 2 , X 3 } implies Y = { Y 1 , Y 2 , Y 3 } , a lower-resolution sample space may always be adopted if a higher-resolution sample space may, but not vice versa. For example, if the order of the draws is not observable, then only Ω m is appropriate since Y = { Y 1 , Y 2 , Y 3 } is not linked uniquely to the elements of Ω s .
Ω m is an aggregated form of Ω s and is hence of lower resolution; however, Ω m may be further reduced in resolution. Let
Y = { Y ( 1 ) , Y ( 2 ) , Y ( 3 ) } ,
where Y ( 1 ) , Y ( 2 ) , Y ( 3 ) are nonincreasingly ordered observed frequencies of the three colors. The sample space associated with Y is
Ω e = { 3 , 0 , 0 } , { 2 , 1 , 0 } , { 1 , 1 , 1 } ,
where the subscript “e” stands for entropic. Ω e is yet an aggregated form of Ω m and hence of lower resolution still than that of Ω m . Noting that Ω e is label-independent, it is an example of entropic sample space.
It is easily verified that the probability distribution of Y may be expressed in terms of p as follows.
Y { 3 , 0 , 0 } { 2 , 1 , 0 } { 1 , 1 , 1 } P ( y ) k 1 p ( k ) 3 3 k 1 p ( k ) 2 1 p ( k ) 6 p ( 1 ) p ( 2 ) p ( 3 )
Let it be noted that all the probabilities in (7) are label-independent, and therefore they are entropies by Definition 1.
In the case of sampling n = 3 marbles from urn 1 in sequence, a subscription to the entropic sample space, Ω e , is by choice since both Ω m and Ω e are available. There are situations when the subscription to an entropic sample space may be by necessity.
Consider the experiment of randomly drawing n = 3 marbles in sequence from urn 2, which contains marbles of K = 3 unknown but distinguishable colors. In this case, the sample spaces, Ω 1 of (1) and Ω m of (3), are not well defined due to the lack of knowledge of the color labels. However, the entropic sample space, Ω e , is available for subscription regardless of what the colors are, known or unknown, as long as they are distinguishable.
In general, consider drawing a random sample of size n from X = { k ; k 1 } under p = { p k ; k 1 } in sequence. The sequential sample space is of the form Ω 1 = X n . The aggregated sample space,
Ω m = { { y k ; k 1 } : y k 0   for   every   k 1   and   k 1 y k = n } ,
is that of the mutinomial array, Y = { Y k ; k 1 } , with probability mass function
P ( { y k ; k 1 } ) = n ! k 1 y k ! k 1 p k y k
where 0 y k n for every k 0 and k 1 y k = n . Moreover, Ω m may be further aggregated into a sample space, Ω e , for Y = { Y ( k ) ; k 1 } , that is,
Ω e = { { y ( k ) ; k 1 } : y ( k ) 0   and   y ( k ) y ( k + 1 )   for   every   k 1 ,   and   k 1 y ( k ) = n } .
Let Ω e of (10) be referred to as the entropic sample space. The associated probability distribution is
P ( { y ( k ) ; k 1 } ) = * P ( { y k ; k 1 } )
where * is summation of (9) over all { y k ; k 1 } s in Ω m sharing the same given { y ( k ) ; k 1 } .
Given a y = { y ( k ) ; k 1 } , (11) is an entropy. This may be seen in two steps. First, let K ^ = k 1 1 [ y ( k ) 1 ] be the number of distinct letters of X represented in a sample of size n, and let z = { z 1 , , z K ^ } be the set of K ^ positive integer values of y . K ^ is a positive finite integer. Let the cardinality of X be denoted as K = k 1 1 [ p k > 0 ] . K 1 may be finite or countably infinite. Consider an array a ( y ) = { a k ( y ) ; k 1 } of length K whose entries are a particular allocation of the K ^ values of z j , j = 1 , , K ^ , with the other K K ^ values of a ( y ) being zeros. Let A ( y ) be the complete collection of all such distinct a ( y ) s. Then it is clear that y uniquely implies A ( y ) .
Second, the probability in (11) may be re-expressed as
P ( { y ( k ) ; k 1 } ) = n ! k 1 y ( k ) ! * * k 1 p ( k ) a k ( y )
where * * is summation over all a ( y ) A ( y ) , given a y = { y ( k ) ; k 1 } . Equation (12) implies that P ( { y ( k ) ; k 1 } ) is a function of p and hence an entropy. Let P ( { y ( k ) ; k 1 } ) of (11) or (12) be referred to as the entropic distribution associated with the entropic sample space, Ω e .

2.2. Entropic Objects

Let the adjective “entropic” be used to describe objects that are label-independent. Several such objects are defined or summarized below.
  • A function, f ( p ) = f ( p ) for all p P , is an entropy.
  • The elements of p = { p ( k ) ; k 1 } are the entropic parameters, as compared to the elements of p = { p k ; k 1 } , which are multinomial parameters.
  • The elements of Y = { Y ( k ) ; k 1 } or equivalently of p ^ = { p ^ ( k ) ; k 1 } are entropic statistics, as compared to the elements of Y = { Y k ; k 1 } or equivalently p ^ = { p ^ k ; k 1 } , which are multinomial statistics.
  • Ω e of (10) is the entropic (multinomial) sample space, as compared to Ω m of (8), which is the multinomial sample space.
  • The distribution P ( { y ( k ) ; k 1 } ) of (11) or (12), is the entropic probability distribution, while P ( { y k ; k 1 } ) of (9) is the multinomial probability distribution.
  • Entropic statistics is the collection of statistical methodologies that help to make inference on the characteristics of a random system exclusively via entropies.
In addition, there are several useful other entropic objects. First, letting ζ v = k 1 p k ( 1 p k ) v for all non-negative integers v 0 , ζ = { ζ v ; v 0 } is referred to as the entropic basis. The name comes from the fact that, for any well-behaved function, h ( p ) for p [ 0 , 1 ] , an entropy of the form H = k 1 p k h ( p k ) may be expressed as a linear combination H = v 1 w ( v ) ζ v . For example, the Shannon entropy, provided that it is finite, may be written as
H = k 1 p k ln p k = v 1 ( 1 / v ) ζ v 1 .
The entropic basis is useful because it unfolds many entropies into simple and linearly additive forms.
Second, letting η u = k 1 p k u for all positive integers u 1 , η = { η u ; u 1 } is often referred to as the entropic moment. The elements of both ζ and η have good estimators. A detailed discussion may be found in [14].
Definition 2.
Let X be an random element on a countable alphabet X = { k ; k 1 } with a corresponding probability distribution p P and its associated entropic distribution p P . The function,
ϕ ( t ) = k 1 p k t , f o r   t 0 ,
is referred to as the entropic-moment-generating function of X, of p , or of p . The two complementary parts of its domain, [ 1 , ) and [ 0 , 1 ) , are, respectively, referred to as the primary domain and the secondary domain of the entropic-moment-generating function.
Depending on context, ϕ ( t ) may be denoted as ϕ X ( t ) , ϕ p ( t ) , or ϕ p ( t ) whenever appropriate. Obviously, ϕ ( t ) is uniformly bounded above by one for all p P in the primary domain but is not necessarily finitely defined in the secondary domain. However, in the case of a finite alphabet, that is, K = k 1 1 [ p k > 0 ] < , ϕ ( t ) is finitely defined for each and every t R , in particular for t 0 . The characteristic utility of ϕ ( t ) is further explored in Section 3 below.

2.3. Examples of Entropic Statistics

Example 1.
Consider the Bernoulli experiment of tossing a coin, where P ( h ) = p and P ( t ) = 1 p . The question of whether the coin is fair may be formulated in the usual classical sense, that is, whether p = 0.5 . The question may be approached by estimating p based on a sample proportion, p ^ , if it is observable which trials lead to “h” and which lead to “t”. The question may alternatively be formulated by an equivalent entropic statement, for example, whether H = p ( 1 p ) = 0.25 . More generally, if K = k 1 1 [ p k ] is finite and known, then the uniformity of p on X may be formulated entropically by, for example, H = k 1 p k 2 = 1 / K , H = k 1 p k ( 1 p k ) = ( K 1 ) / K , or H = k 1 p k ln p k = ln K . The validity of these entropic statements may then be gauged statistically.
Example 2.
Consider a two-stage sampling scheme: a random sample of size n, { X 1 , , X n } , and then a single extra observation X n + 1 are taken. The sample of size n may be summarized into letter frequencies, Y = { Y k ; k 1 } . Let π 0 = k 1 p k 1 [ Y k = 0 ] . Clearly, π 0 is label-independent and therefore an entropic random variable. Given the sample of size n, π 0 may be thought of as the probability of that X n + 1 assumes a letter in X that is not represented in the sample of size n. In some context, π 0 may be thought of as the probability of new discovery. Let N 1 = k 1 1 [ Y k = 1 ] and T n = N 1 / n . T n is commonly known as Turing’s formula, introduced in [15], but credited largely to Alan Turing. It is to be noted that N 1 is label-independent and, therefore, so is T n . T n is an good estimator of π 0 and a discussion on many of its statistical properties may be found in [14].
Example 3.
In developing a decision tree classifier, the data space is partitioned into an ensemble of small subspaces, in each of which a local classification rule is sought. The central spirit of every local classification may be described by a two-step scheme.
1. 
First, a random sample of size n, { X 1 , , X n } , is taken from X = { k ; k 1 } , under an unknown p = { p k ; k 1 } , which is summarized into Y = { Y k ; k 1 } .
2. 
The data-based local classification rule is as follows: the next observation, X n + 1 , is predicted to be the letter which is observed most frequently in the sample of size n. For simplicity, let it be assumed that p ( 1 ) > p ( 2 ) , and a letter with the sample maximum frequency is unique (if not, some randomization may be employed).
Obviously, the designated letter based on a sample is not necessarily the letter associated with the letter corresponding to the maximum of p k s. In such a setup, the performance of the tree classifier may be gauged by evaluating (calculating or estimating) the probability of the event that “the designated letter is the same letter of X with probability p ( 1 ) ”, that is,
P arg max k ; k 1 { p ( k ) ; k 1 } = arg max k ; k 1 { p ^ ( k ) ; k 1 } .
Note that the event in (14) is label-independent and hence the probability is an entropy, which may be estimated. The probability in (14) may reasonably called the confidence level of the simple classifier.
For illustration purpose, consider the special case of a binary X, with n = 2 m + 1 for some positive integer m. For simplicity, n is chosen to be odd here so that Y ( 1 ) > Y ( 2 ) always holds true. Suppose that p 1 = p ( 1 ) > p ( 2 ) = 1 p ( 1 ) . The event that a classifier based on the sample of n correctly identifies the letter of maximum probability may be equivalently expressed as Y 1 m + 1 . The probability of such an event, (14), is
P 1 = arg max 1 , 2 { p ^ 1 , 1 p ^ 1 } = P ( Y 1 m + 1 ) = y m + 1 n ! y ! ( n y ) ! p ( 1 ) y ( 1 p ( 1 ) ) n y ,
which is independent of the assumption that p 1 > p 2 = 1 p 1 and, therefore, is an entropy. More specifically, (15) is computed for several combinations of n and p and the resulting values are tabulated in Table 1. Table 1, and its likes, may be used in two different ways. First, given a fixed p , it indicates how large a sample is needed to assure a reliability level of the classifier. On the other hand, at a given level of n and a particular p , the classifier may be evaluated by the probabilities in the table. In practice, p is unknown but may be estimated.

3. Entropic Characterization

Entropic statistics focuses on making inference via entropies; it is therefore of interest to find a function which may characterize p P . Since the function ϕ ( t ) in its primary domain and η = { η u ; u 1 } , where η u = k 1 p k u and u is an integer, imply each other (see Lemma 1 below), it follows immediately that (13) uniquely determines all entropies. However, the following theorem claims that the characteristic property of the entropic-moment-generating function, ϕ ( t ) , remains intact in any arbitrarily neighborhood of any t ( 1 , ) .
Theorem 1.
Let p = { p k ; k 1 } and q = { q k ; k 1 } be two probability distributions on a same countable alphabet, X = { k ; k 1 } . Let p = { p ( k ) ; k 1 } and q = { q ( k ) ; k 1 } be the respective corresponding entropic distributions of p and q . Then p = q if and only if ϕ p ( t ) = ϕ q ( t ) for all t ( a , b ) where ( a , b ) is an arbitrary interval such that 1 a < b < .
Lemma 1.
Let p = { p k ; k 1 } and q = { q k ; k 1 } be two probability distributions in P with two corresponding associated entropic distributions p and q in P . Then p = q if and only if k 1 p k n = k 1 q k n for all positive integers n 1 .
A proof of Lemma 1 may be found on pages 50 and 51 in [14]. To prove Theorem 1, it suffices to show that ϕ ( t ) in an arbitrarily small neighborhood of any interior point of [ 1 , ) determines the function globally.
Proof of Theorem 1.
If p = q , then it immediately follows that ϕ p ( t ) = ϕ q ( t ) for all t [ 1 , ) and, therefore, for t ( a , b ) specifically. To prove the theorem, it suffices to show the converse.
Consider the series
f ( z ) = k = 1 p k z
where z C is a complex variable. Denote the real and the imaginary parts of a complex value z by Re ( z ) and Im ( z ) , respectively.
Let D = { z : Re ( z ) > 1 } be the subset of C such that the real part of z is greater than 1. For every z D , since p k Re ( z 1 ) 1 and p k i Im ( z ) = 1 , for every k, where | z | is the modulus of z, it follows that
f ( z ) = k = 1 p k p k z 1 = k = 1 p k p k Re ( z 1 ) p k i Im ( z )
and
| f ( z ) | k = 1 p k .
Letting α k = ln ( 1 / p k ) ,
f ( z ) = k = 1 e α k z ,
and the functions e α k z , k 1 , are analytic on C .
Since the series in (17), for z D , is dominated by the convergent series k 1 p k as in (16), by the Weierstrass uniform convergence theorem, f ( z ) is analytic on D . By a similar argument, g ( z ) = k = 1 q k z is also analytic on D .
Assuming that ϕ p ( t ) = ϕ q ( t ) for t ( a , b ) where 1 a < b < , there exists a convergent sequence, { z n ; n 1 } in ( a , b ) such that lim n z n = z 0 ( a , b ) . Noting that ( a , b ) D and f ( z n ) = ϕ p ( z n ) = ϕ q ( z n ) = g ( z n ) for n 0 , by the identity theorem for analytic functions, f ( z ) = g ( z ) for all z D . It follows that ϕ p ( t ) = f ( t ) = g ( t ) = ϕ q ( t ) for all t [ 1 , ) , specifically, k 1 p k n = k 1 q k n for all n 1 . Finally, by Lemma 1, p = q . □
Theorem 1 immediately implies that a subfamily of the Rényi entropy R α with α ( a , b ) ( 1 , ) , a subfamily of the Tsallis entropy T α with α ( a , b ) ( 1 , ) , and a subfamily of Hill’s diversity numbers H α with α ( a , b ) ( 1 , ) , respectively, characterizes p and, hence, characterizes all entropies.
The characterization of p in Theorem 1 may be equivalently stated only on any infinitely countable subset of ( a , b ) .
Corollary 1.
Let p and q be two probability distributions on a same countable alphabet, X . Let p and q be the corresponding entropic distributions of p and q , respectively. Then p = q if and only if ϕ p ( t ) = ϕ q ( t ) on any infinite sequence of distinct values, { t n ; n 1 } , such that lim n t n = c ( 1 , ) .
Proof. 
Both ϕ p ( t ) and ϕ q ( t ) are analytic at t = c , and therefore h ( t ) = ϕ p ( t ) ϕ q ( t ) is analytic at t = c . Let it be first shown, by induction, that all derivatives of h ( t ) at t = c are zero, that is, h ( m ) ( c ) = 0 for m 0 . Note first that h ( c ) = h ( 0 ) = 0 by the fact that both ϕ p ( t ) and ϕ q ( t ) are continuous and lim n ϕ p ( t n ) = ϕ p ( c ) = ϕ q ( c ) = lim n ϕ q ( t n ) . Suppose that h ( 0 ) ( c ) = h ( 1 ) ( c ) = h ( 2 ) ( c ) = = h ( m ) ( c ) = 0 but h ( m + 1 ) ( c ) 0 . Then there exists an interval ( c ε , c + ε ) such that h ( t ) 0 for t ( c ε , c + ε ) . However, there is at least one t n ( c ε , c + ε ) such that h ( t n ) = 0 by assumption. This is a contradiction and therefore h ( m ) ( c ) = 0 for all m 1 . □
Corollary 2.
Let p and q be two probability distributions on a same countable alphabet, X . Let p and q be the corresponding entropic distributions of p and q , respectively. Then p = q if and only if ϕ p ( t ) = ϕ q ( t ) on any infinite sequence of distinct values, { t n ; n 1 } ( a , b ) where 1 a < b < .
Proof. 
Noting that the infinitely many t n s are in an bounded interval, there exists an infinite subset of { t n ; n 1 } that converges to a constant c [ a , b ] . The corollary follows Corollary 1. □
Consider a pair of random elements, ( X , Y ) , on a countable joint alphabet, X × Y = { ( l i , m j ) ; i 1 , j 1 } , with a corresponding joint probability distribution, p X , Y = { p i , j ; i 1 , j 1 } . Let p X = { p i , · ; i 1 } and p Y = { p · , j ; j 1 } , where p i , · = j 1 p i , j and p · , j = i 1 p i , j , be the two marginal probability distributions of X and Y, respectively.
Corollary 3.
X and Y are independent if and only if
ϕ X , Y ( t ) = ϕ X ( t ) × ϕ Y ( t )
for all t ( a , b ) , where a and b are two arbitrary real numbers such that 1 a < b < .
Proof. 
If X and Y are independent, then (18) follows immediately. Conversely, suppose that (18) holds. Consider another pair of independent random elements, ( U , V ) , on the same countable joint alphabet X × Y and with identical marginal distributions to those of ( X , Y ) , that is, p X and p Y . It then follows, by (18) and Theorem 2, that p U , V = p X , Y , which in turn implies that X and Y are independent. □
Corollary 3 provides a characterization of independence on a general countable joint alphabet, and its utility may be explored further.

4. A Basic Convergence Theorem

From an entropic perspective, the convergence of p ^ to p , to be distinguished from that of p ^ to p , is of fundamental interest.
For clarity of presentation in this section, let it be noted that, whenever necessary, the subindex n may be added to Y , Y k , p ^ , p ^ k , p ^ , and p ^ ( k ) to highlight the dynamic nature of these previously defined quantities as n changes, that is, Y = Y n , Y k = Y k , n , p ^ = p ^ n , p ^ k = p ^ k , n , p ^ , n = p ^ and p ^ ( k ) = p ^ ( k ) , n , respectively.
The main result established in this section is the uniform almost-sure convergence of p ^ to p , which is made more precise in Theorem 2 below.
Consider the experiment of repeatedly and independently drawing a letter from X under p , resulting in a sequence of randomly selected letters, ω = { x 1 , x 2 , } . Let the collection of all possible such sequences or paths be denoted Ω . A sample of size n is a partial sequence of the first n randomly selected letters in an ω , { x 1 , , x n } .
Let p = { p ( k ) ; k 1 } and p ^ = { p ^ ( k ) ; k 1 } be defined as above. It is to be specifically noted that the rearrangement of the observed relative frequencies, p ^ , is performed independently based on the observed values of p ^ k for all k 1 , with no regard to the arrangement of the probabilities, p = { p ( k ) ; k 1 } . Consequently, the letter of which the relative frequency p ^ ( k ) is observed is not necessarily the same letter with which the probability p ( k ) is associated. This is, in fact, the essence of entropic perspective.
Theorem 2.
For any p P , let p , p ^ and p ^ be defined as above. Then
max k 1 p ^ ( k ) p ( k ) a . s . 0 .
A proof of Theorem 2 requires Lemmas 2 and 3 below.
Lemma 2.
For any p P , let p ^ be as defined above. Then
max k 1 p ^ k p k a . s . 0 .
Proof. 
For each k, by the strong law of large numbers, p ^ k a . s . p k or equivalently | p ^ k p k | a . s 0 . Let the collection of paths ω = { x 1 , x 2 , } in Ω that satisfies lim n | p ^ k p k | = 0 be denoted as Ω k Ω . It follows that P ( Ω k ) = 1 , that the complement of Ω k , Ω k , is of probability zero, that k 1 Ω k is of probability zero, and that, letting Ω * = k 1 Ω k , P ( Ω * ) = 1 P ( k 1 Ω k ) = 1 .
For each and every path ω Ω * and every k, lim n | p ^ k p k | = 0 . Note the fact that | p ^ k p k | p ^ k + p k and, therefore, k 1 | p ^ k p k | k 1 ( p ^ k + p k ) = 2 , by the bounded convergence theorem,
lim n k 1 | p ^ k p k | = k 1 lim n | p ^ k p k | = 0 ,
that is, k 1 | p ^ k p k | a . s . 0 . By (21), the lemma follows from the fact that
max k 1 | p ^ k p k | k 1 | p ^ k p k | a . s . 0 .
 □
Lemma 2 may be viewed as a version of the Glivenko–Cantelli theorem on countable alphabets with respect to observed data from a classical multinomial experiment. The uniformity of the convergence in (20) is of essential importance in the proof of Theorem 2, which is given below by way of Lemma 3.
Lemma 3.
For each k 1 ,
p ^ ( k ) p ( k ) a . s . 0 .
A proof of Lemma 3 is given in Appendix A. Let it be noted that Ω is the sample space of a perpetual multinomial iid sampling scheme on X under a probability distribution p P . Each path in Ω may be represented by { p ^ n ; n 1 } where p ^ n = { p ^ k , n ; k 1 } . For each such path { p ^ n ; n 1 } Ω , there exists a corresponding path { p ^ , n ; n 1 } , which is the rearranged { p ^ n ; n 1 } over all k for every n. Let the total collection of all rearranged paths of Ω be denoted as Ω , and let the collection of all rearranged paths of Ω * be denoted as Ω * . It follows that P ( Ω * ) = P ( Ω * ) = 1 . Lemma 3 states that, in each path of Ω * , the k th component of p ^ , n converges to the k th component of p , namely, p ( k ) , for each k.
Proof of Theorem 2.
For any ω Ω * , note that k 1 | p ^ ( k ) p ( k ) | 2 , by the bounded convergence theorem and Lemma 3, lim n max k 1 | p ^ ( k ) p ( k ) | lim n k 1 | p ^ ( k ) p ( k ) | = k 1 lim n | p ^ ( k ) p ( k ) | = 0 . The theorem follows the fact that P ( Ω * ) = 1 . □
Theorem 2 may be viewed as a version of the Glivenko–Cantelli theorem on countable alphabets with respect to observed data from an entropic multinomial experiment. Theorem 2 immediately implies almost sure convergence for estimators of several key quantities in classification procedures.
Example 4.
p ^ ( 1 ) a . s . p ( 1 ) .
Example 5.
Suppose that p ( 1 ) > p ( 2 ) , that is, there exists a unique letter in X , denoted 0 , associated with probability p ( 1 ) . Then the probability of a correct classification, that is, 0 = arg max X { p ^ k ; k 1 } , converges almost surely to one. This is so because, for any path in Ω * and any ε < ( p ( 1 ) p ( 2 ) ) / 2 , there exists an N such that, for any n > N , | p ^ ( 1 ) p ( 1 ) | < ε and | p ^ ( 1 ) p ( k ) | > ε for all k 2 .
The results of Examples 4 and 5 lend fundamental support for classification algorithms based on maximum observed frequency, used widely in exercises of modern data science, for example, decision trees, as mentioned in Example 3.
Many entropies of interest across a wide spectrum of studies are of the additive form, H ( p ) = k 1 g ( p ( k ) ) h ( p ( k ) ) , where g ( p ) 0 and h ( p ) 0 are functions of p [ 0 , 1 ] . The almost-sure convergence of Theorem 2 may be passed on to the plug-in estimators of some such entropies by way of a rather trivial statement in the proposition below.
Proposition 1.
Let H ( p ) = k 1 g ( p ( k ) ) h ( p ( k ) ) where g ( p ) 0 and 0 h ( p ) M for some M > 0 are continuous functions of p I = [ 0 , 1 ] . Suppose that p P such that
1. 
k 1 g ( p ( k ) ) < , and
2. 
k 1 g ( p ^ ( k ) , n ) a . s . k 1 g ( p ( k ) ) .
Then H ( p ^ ) a . s . H ( p ) .
Proof. 
Noting that H ( p ) = k 1 g ( p ( k ) ) h ( p ( k ) ) M k 1 g ( p ( k ) ) < , it follows by Conditions 1 and 2 that
| H ( p ^ ) H ( p ) | M k 1 g ( p ^ ( k ) ) + M k 1 g ( p ( k ) ) < .
Let Ω * * Ω be the total collection of paths such that Condition 2 holds. For each path, { p ^ n , ; n 1 } Ω * * , by (23), the proposition follows by the bounded convergence theorem and the fact that P ( Ω * * ) = 1 . □
Example 6.
Let H ( p ) = k 1 p ( k ) s ( 1 p ( k ) ) t where s 1 and t 0 are two real constants. In the setup of Proposition 1, h ( p ) = ( 1 p ) t 1 on I = [ 0 , 1 ] , and g ( p ) = p s satisfying k 1 p ( k ) s 1 for all p P without qualification, and therefore also for p ^ P , that is, k 1 p ^ ( k ) s 1 , which implies, by the bounded convergence theorem, k 1 p ^ ( k ) s k 1 p ( k ) s along each and every path in Ω * . By Proposition 1, k 1 p ^ k s ( 1 p ^ ( k ) ) t a . s . k 1 p k s ( 1 p ( k ) ) t . More specifically, when s and t take integers u 1 and v 0 , the plug-in estimator of the generalized Simpson’s diversity index H ( p ) = k 1 p ( k ) u ( 1 p ( k ) ) v (see [5,16]) converges almost surely.
Example 6 implies that the plug-in estimator of H ( p ) = k 1 p ( k ) s where s 1 converges almost surely, which in turn implies that the plug-in estimators of members of the Rényi entropy family and the Tsallis entropy family converge almost surely for all p P without qualification when α 1 . However, it is not known whether the plug-in estimators of the members of the families with α ( 0 , 1 ) converge almost surely when p P without other qualification (also, see [17]).
Example 7.
The plug-in estimator of the Shannon entropy, H ( p ) = k 1 p ( k ) ln p ( k ) , converges almost surely when p is such that K = k 1 1 [ p ( k ) > 0 ] < . In this case, even though ln p is not bounded above on I = [ 0 , 1 ] , h ( p ) = p α ln p 1 / ( α e ) is for any α ( 0 , 1 ) . Writing H ( p ) = k 1 g ( p ( k ) ) h ( p ( k ) ) where g ( p ) = p 1 α and h ( p ) = p α ln p , it suffices to show k 1 p ^ ( k ) 1 α converges almost surely. However, this is the case since, by Theorem 2, for every path { p ^ n , ; n 1 } Ω * , k 1 p ^ ( k ) 1 α k 1 p ( k ) 1 α , due to the fact that K < and P ( Ω * ) = 1 .
It is not known whether the plug-in estimator of the Shannon entropy converges almost surely when p P without further qualification.
The Shannon entropy has utilities across a wide spectrum of scientific investigations (see [18]). However, it is not finitely defined for all distributions in P . A family of the generalized Shannon entropies, for any p P , is proposed as follows:
H m ( p ) = k 1 p ( k ) m j 1 p ( j ) m ln p ( k ) m j 1 p ( j ) m
in [19], where m 1 is an integer. The Shannon entropy is a special family member corresponding to m = 1 . It may be verified that each member of the family, except the Shannon entropy, is finitely defined for all p P and offers all important utilities that the Shannon entropy offers, including the fact that the mutual information derived based on each member with m 2 is zero if and only if the two underlying random elements are independent.
Example 8.
The plug-in estimator of (24) converges almost surely for any p P whenever m 2 . To see this, let it be first noted that the plug-in estimator of k 1 p k m ln p k converges almost surely. This fact follows from Proposition 1 with g ( p ) = p , h ( p ) = p m 1 ln p which is uniformly bounded above on I = [ 0 , 1 ] . The claimed almost-sure convergence then follows the fact that, in the re-expression of (24) below,
H m ( p ) = 1 j 1 p ( j ) m m k 1 p ( k ) m ln p ( k ) + k 1 p ( k ) m ln j 1 p ( j ) m ,
and the fact that the plug-in estimator of each of the four series converges almost surely.

5. Conclusions and Discussion

This article introduces a perspective termed entropic statistics. One of the motivations of the perspective is to accommodate probability experiments on sample spaces which may include outcomes that are known to exist (and therefore are prescribed) and those whose existence is not known (and therefore not prescribable). Such a framework allows statistical exploration into a general population with possibly infinitely many previously unobserved and unknown outcomes, or new discoveries. The key concept to foster such a framework is the label-independence, that is, all parameters and statistics do not depend of the labels of an alphabet as long as they are distinguishable. Consequently, in this article an array of label-independent objects are defined and termed entropic objects. In particular, a general entropy, entropic parameters, entropic statistics, entropic sample spaces, entropic probability distributions, and an entropic-moment-generating function are defined.
Based on the defined entropic objects, two basic theorems are established. Theorem 1 provides a characterization of the entropic probability distribution on the alphabet via the entropic-moment-generating function, and Theorem 2 establishes the almost-sure convergence of the entropic statistics to the entropic parameters and, hence, provides a foundational support to the entropic framework.
On the other hand, this article merely provides a few basic results in entropic statistics. On a broader spectrum, many other issues may be fruitfully considered on at least three fronts, namely, fundamental, probabilistic, and statistical. To begin with, the fundamental question of what constitutes entropy may be explored in many directions. One of the most cited sets of axioms is that discussed by Khinchin [20], under which the Shannon entropy is proved to be unique. However under slightly less restrictive axioms, many other entropies exist and enjoy almost all the desirable utilities of the Shannon entropy; for example, see [19]. The existing literature on generalization of entropy is extensive in physics and information theory; for example, see [21,22]. The collective effort to better understand what entropy is and how it may help to describe an underlying random system is ongoing. Further research in understanding generalized entropies and their implications could greatly enrich the framework of entropic statistics.
Entropy in general is often thought of as summary of a profile state, however measured numerically, of inner energy or chaos within a random system. As such, it is independent of any labeling systems, regardless of whether the state is observable or not. A key conceptual shift introduced in this article is from statistical inference on p (or a function of p ) based on the multinomial frequencies Y to that on p (or a function of p ) based on the entropic frequencies Y . Such a framework shift, by necessity or by choice, triggers a long array of basic probability and statistics questions, under different degrees of model restriction, ranging from parametric forms of p k = p ( k , θ ) for some parameter θ to the nonparametric form, { p ( k ) ; k 1 } . It may be interesting to note that even for the nonparametric form, there are several qualitatively different forms, that of a known K = k 1 1 [ p ( k ) > 0 ] < , that of an unknown K = k 1 1 [ p ( k ) > 0 ] < , and that of K = k 1 1 [ p ( k ) > 0 ] = . Each of these model classes could imply a very different stochastic behavior of Y as the sample size n increases. Even long before the notion of information entropy was coined by Shannon in [1], the behavior of Y had been discussed in the literature by, for example, Auerbach [23] and Zipf [24]. More recently, several articles [25,26] discussed domains of attraction in the total collection of all distributions on a countable alphabet by a tail index, τ n = n k 1 p ( k ) ( 1 p ( k ) ) n . Each domain characterizes the decay rate of the tail of the underlying entropic distribution and, in turn, dictates the rates of convergence of various statistical estimators of various entropies. Further advances on that front would enhance the understanding of probabilistic behavior of the entropic statistics and, in turn, the estimated entropies of interest.
In terms of statistical estimation, a large proportion of the existing literature mainly focuses on the Shannon entropy and variations of the plug-in estimators under various conditions, most of which are described and referenced in [14]. There are also non-plug-in estimators of different types, for example, the Bayes estimators [27,28,29], the hierarchical Bayes estimators [30], the James–Stein estimators [31], the coverage-adjusted estimators [32,33,34], and an unbiased estimator based on sequential data proposed by Montgomery-Smith and Schürmann. In general, the asymptotic distributions of the plug-in estimators and their variants seem to have been studied and described to some extent; for example, see [12,35,36,37,38]. However, it is fair to say that many, if not most, of the proposed estimators of various types have not yet been assigned asymptotic distributions. Any advances in that direction could much benefit applications of these estimators.
In short, the landscape of entropic statistics is quite porous in comparison to that of richly supported classical statistics. Many basic and important questions are yet to be answered, from the axiomatic foundation, to the definitions of basic elements, to the theoretical supporting architecture, and to the relevance in applications. However, the same said porosity also offers opportunities for interesting contemplation.

Funding

This research received no external funding.

Data Availability Statement

This research involved no data.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A

Proof of Lemma 3.
For clarity, the proof of (22) is given, respectively, in five progressively more general cases: (1) p 1 = 1 ; (2) K = k 1 1 [ p k > 0 ] < and all positive p k ’s are distinct; (3) K < ; (4) K is infinite and all positive p k ’s are distinct; (5) K is infinite.
For notation simplicity in all cases, let it be assumed without loss of generality that p = { p k ; k 1 } is nonincreasingly arranged to begin with, that is, p k p k + 1 for every k. With this assumption, the only rearranged object is p ^ = { p ^ ( k ) ; k 1 } with p ^ ( k ) p ^ ( k + 1 ) for every k.
In Case 1, the statement of (22) is trivial.
In Case 2, let p 0 = 1 and p K + 1 = 0 . It follows that
1 = p 0 > p 1 > p 2 > > p k 1 > p k > p k + 1 > > p K > p K + 1 = 0 .
For each sequence ω Ω * as defined in the proof of Lemma 2, the uniformity of (20) implies that for any ε > 0 , there exists an N such that for all n > N , ε < p ^ k p k < ε for all k 1 . Specifically, let
ε 0 = min { ( p k p k + 1 ) / 2 ; 1 k K } > 0 .
There exists an N such that for all n > N , max { | p ^ k p k | ; 1 k K } < ε 0 , which has the following two implications.
  • k = 1 K ( p k ε 0 , p k + ε 0 ) = , that is, p k ± ε 0 are disjoint for all k = 1 , , K .
  • For every k, k = 1 , , K , p ^ k p k ± ε 0 and it is the only observed relative frequency in p k ± ε 0 .
Combining the above two implications, it follows that p ^ k = p ^ ( k ) , that is, | p ^ ( k ) p k | 0 , for every k, k = 1 , , K . Since Ω * is of probability one, (22) is established.
In Case 3, it is allowed that several consecutive probabilities in p = { p k ; 1 k K } , where K = k 1 1 [ p k > 0 ] < , are identical. It follows that
1 = p 0 p 1 p 2 p k 1 p k p k + 1 p K > p K + 1 = 0 .
Noting that p = { p k ; k 1 } is a finite sequence of runs of identical values, collecting the first value in each run and retaining its index value, a subset of { p k ; k 1 } is obtained, namely, { p k i ; i = 1 , , I } , where I is the number of distinct values in p . Let r i be the multiplicity of p k i in p , i = 1 , , I . It follows that
1 = p 0 = p k 0 > p k 1 > p k 2 > > p k K > p K + 1 = 0 .
For each sequence ω Ω * as defined in the proof of Lemma 2, the uniformity of (20) implies that for any ε > 0 , there exists an N such that for all n > N , ε < p ^ k p k < ε for all k, k = 1 , , K . Specifically, let
ε 1 = min { ( p k i p k i + 1 ) / 2 ; i = 0 , , I } > 0
where p k 0 = p 0 = 1 . There exists an N 1 such that for all n > N 1 , max { | p ^ k p k | ; k 1 } < ε 1 , which has the following implications.
  • i = 1 I ( p k i ε 1 , p k i + ε 1 ) = , that is, p k i ± ε 1 are disjoint for all i, i = 1 , , I .
  • For every given k, and therefore an implied i, there are exactly r i relative frequencies among { p ^ k ; 1 k K } found in p k i ± ε 1 .
It then follows that, for each given k,
min { p ^ ( k i + j ) ; j = 0 , , r i 1 } p ^ ( k ) max { p ^ ( k i + j ) ; j = 0 , , r i 1 } ,
and hence p ^ ( k ) p ( k ) = p k . Finally, (22) follows the fact that P ( Ω * ) = 1 .
In Case 4, p k > 0 for all k 1 and all probabilities are distinct. Letting p 0 = 1 and p = 0 ,
1 = p 0 > p 1 > p 2 > > p k 1 > p k > p k + 1 > > p = 0 .
For every fixed k such that p k ( 0 , 1 ) , let m 1 be an integer such that
1 k = 1 m p k < p k + 1 , and
m k + 1 .
Such an m exists for any given p with an infinite K and a fixed k 1 .
For each sequence ω Ω * , as defined in the proof of Lemma 2, the uniformity of (20) implies that for any ε > 0 , there exists an N such that for all n > N , ε < p ^ k p k < ε for all k, k 1 . Specifically, let
ε 2 = min { ( p k p k + 1 ) / 2 ; k = 0 , , m } > 0 .
There exists an N 2 such that for all n > N 2 , max { | p ^ k p k | ; k 1 } < ε 2 , which implies the following.
  • The first m probabilities of p , p 1 , , p m , are covered, respectively, by m disjoint intervals, p k ± ε 2 , k = 1 , , m .
  • The relative frequencies corresponding to { p 1 , , p m } , namely, { p ^ 1 , , p ^ m } , are also covered, respectively, by the same disjoint intervals, p k ± ε 2 , k = 1 , , m .
On the other hand, noting the strict inequality in (A3) and the fact that k is a fixed integer, there exists a sufficiently small ε 3 such that
1 k = 1 m p k + m ε 3 < p k + 1
or equivalently
1 k = 1 m ( p k ε 3 ) < p k + 1 .
Let ε 4 = min { ε 2 , ε 3 } . By Lemma 2, there exists an N 4 such that for all n > N 4 ,
p k ε 4 < p ^ k < p k + ε 4 ,
for all k, k = 1 , , m , and that the updated (A5) and (A6) hold, namely,
1 k = 1 m p k + m ε 4 < p k + 1
or, equivalently,
1 k = 1 m ( p k ε 4 ) < p k + 1 .
That is, in each of the disjoint intervals of (A7), there is at least one relative frequency. In particular, p ^ k is covered in ( p k ε 4 , p k + ε 4 ) for each k, k = 1 , , k < m , by (A4).
Next it is necessary to show that there may not be more than one relative frequency in ( p k ε 4 , p k + ε 4 ) for each k, k = 1 , , k . Toward that end, consider the total mass of 100 % distributed among p ^ k , k 1 , given n. From interval ( p 1 ε 4 , p 1 + ε 4 ) to interval ( p m ε 4 , p m + ε 4 ) , the total collective mass covered is at least k = 1 m p ^ k ; however, by (A7) and (A8),
k = 1 m p ^ k = k = 1 m ( p ^ k + ε 4 ) m ε 4 > k = 1 m p k m ε 4 = k = 1 m ( p k ε 4 ) > 1 p k + 1
and the remainder of the mass is
1 k = 1 m p ^ k < p k + 1 < p k + 1 + ε 4 .
Regardless of the mass, 1 k = 1 m p ^ k , on the left side of (A9) is allocated to one or more than one letter, other than those in { 1 , , m } , the corresponding p ^ k , k m + 1 , could not possibly be sufficiently large to exceed p k + 1 + ε 4 , nor, therefore, p k ε 4 . That implies that, along the path of that selected ω Ω * , for any n > N 4 , p ^ k and p ^ k alone is covered in ( p k ε 4 , p k + ε 4 ) for k, k = 1 , , k . This immediately implies that p ^ k = p ^ ( k ) for all k, k = 1 , , k , and in particular p ^ k = p ^ ( k ) . p ^ ( k ) p k since p ^ k p k . Finally (22) follows the fact that P ( Ω * ) = 1 .
In Case 5, p k > 0 for all k 1 but the probabilities in p = { p k ; k 1 } are allowed to have multiplicities. Letting p 0 = 1 and p = 0 ,
1 = p 0 > p 1 p 2 p k > p = 0 .
p = { p k ; k 1 } has a special pattern: its maximum value runs for r 1 times; then its second largest value runs for r 2 times, and so on and so forth. In general, its i th largest value runs for r i times followed by a run of its i 1 st largest value. Collect the first value in each run and record its index, k i , i 1 , resulting in a strictly decreasing subsequence, { p k i ; i 1 } . Letting k 0 = 0 and k = ,
1 = p 0 = p k 0 > p k 1 > p k 2 > p k i > > p k = p = 0 .
Consequently, p = { p k ; k 1 } may be viewed as a sequence containing p k i for i 1 with r i 1 p k i s between p k i and p k i + 1 .
Given a value of k, say k , there is an i such that p k = p k i and k must be one of the values from the list { k i , k i + 1 , , k i + r i 1 } , noting p k i + r i = p k i + 1 < p k . Let m be such that
1 i = 1 m r i p k i < p k i + 1 , and
i = 1 m r i k i + 1 .
Such an m exists for any given p and a fixed k 1 , which fixes an i .
For each sequence ω Ω * as defined in the proof of Lemma 2, the uniformity of (20) implies that for any ε > 0 , there exists an N such that for all n > N , ε < p ^ k p k < ε for all k, k 1 . Specifically let
ε 5 = min { ( p k i p k i + 1 ) / 2 ; i = 0 , , m } > 0 .
There exists an N 5 such that for all n > N 5 , max { | p ^ k p k | ; k 1 } < ε 5 , which has the following two implications.
  • The first i = 1 m r i probabilities of p , p 1 , , p i = 1 m r i , are covered, respectively, by k i disjoint intervals, p k i ± ε 5 , i = 1 , , i .
  • The relative frequencies corresponding to { p 1 , , p i = 1 m r i } , namely, { p ^ 1 , , p ^ i = 1 m r i } , are also covered, respectively, by the same disjoint intervals, p k i ± ε 5 , i = 1 , , i .
On the other hand, noting the strict inequality in (A11) and the fact that k is a fixed integer, there exists a sufficiently small ε 6 such that
1 i = 1 m r i p k i + ε 6 i = 1 m r i < p k i + 1
or, equivalently,
1 i = 1 m r i ( p k i ε 6 ) < p k i + 1 .
Let ε 7 = min { ε 5 , ε 6 } . By Lemma 2, there exists an N 7 such that for all n > N 7 , all relative frequencies sharing the same p k i , namely, p ^ k i , p ^ k i + 1 , , p ^ k i + r i 1 , are found in
( p k i ε 7 , p k i + ε 7 )
for all i, i = 1 , , m , and the updated (A13) and (A14) are
1 i = 1 m r i p k i + ε 7 i = 1 m r i < p k i + 1
or, equivalently,
1 i = 1 m r i ( p k i ε 7 ) < p k i + 1 .
That is, in each of the disjoint intervals of (A15), there are at least r i relative frequencies. In particular, the r i relative frequencies, { p ^ k i , p ^ k i + 1 , , p ^ k i + r i 1 } , are covered in ( p k i ε 7 , p k i + ε 7 ) for each i, i = 1 , , i m , by (A12).
Next it necessary is to show that there may not be more than r i relative frequencies in ( p k i ε 7 , p k i + ε 7 ) for each i, i = 1 , , i . Toward that end, consider the total mass of 100 % distributed among p ^ k , k 1 , given n. From interval ( p 1 ε 7 , p 1 + ε 7 ) to interval ( p i = 1 m r i ε 7 , p i = 1 m r i + ε 7 ) , the total collective mass covered is at least i = 1 m r i p ^ k i ; however, by (A15) and (A16),
i = 1 m r i p ^ k i > i = 1 m r i ( p k i ε 7 ) > 1 p k i + 1
and the remainder of the mass is
1 k = 1 m p ^ k < p k i + 1 < p k i + 1 + ε 7 .
Regardless of if the mass on the left side of (A17) is allocated to one or more than one letter, other than those in { 1 , , i = 1 m r i } , the corresponding p ^ k , k i = 1 m r i + 1 , could not possibly be sufficiently large to exceed p k i + 1 + ε 7 , nor, therefore, p k ε 7 . That implies that, along the path of that selected ω Ω * , for any n > N 7 , { p ^ k i , p ^ k i + 1 , , p ^ k i + r i 1 } and only { p ^ k i , p ^ k i + 1 , , p ^ k i + r i 1 } are covered in ( p k i ε 7 , p k i + ε 7 ) for i, i = 1 , , i . This immediately implies that
  • { p ^ k i , p ^ k i + 1 , , p ^ k i + r i 1 } = { p ^ ( k i ) , p ^ ( k i + 1 ) , , p ^ ( k i + r i 1 ) } but is not necessarily equal component-wise;
  • | p ^ ( k i + j ) p k | < ε 7 for all j = 0 , 1 , , r i 1 ;
  • In particular, | p ^ ( k ) p k | < ε 7 .
Finally (22) follows the fact that P ( Ω * ) = 1 . □

References

  1. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423, 623–656. [Google Scholar] [CrossRef] [Green Version]
  2. Rényi, A. On measures of information and entropy. In Proceedings of the Fourth Berkeley Symposium on Mathematics, Statistics and Probability, Berkeley, CA, USA, 20–30 June 1961; pp. 547–561. [Google Scholar]
  3. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
  4. Simpson, E.H. Measurement of diversity. Nature 1949, 163, 688. [Google Scholar] [CrossRef]
  5. Zhang, Z.; Zhou, J. Re-parameterization of multinomial distribution and diversity indices. J. Stat. Plan. Inference 2010, 140, 1731–1738. [Google Scholar] [CrossRef]
  6. Hill, M.O. Diversity and evenness: A unifying notation and its consequences. Ecology 1973, 54, 427–432. [Google Scholar] [CrossRef] [Green Version]
  7. Emlen, J.M. Ecology: An Evolutionary Approach; Addison-Wesley: Reading, MA, USA, 1973. [Google Scholar]
  8. Miller, G.A.; Madow, W.G. On the Maximum Likelihood Estimate of the Shannon-Weaver Measure of Information; Air Force Cambridge Research Center Technical Report AFCRC-TR-54-75; Operational Applications Laboratory, Air Force, Cambridge Research Center, Air Research and Development Command: New York, NY, USA, 1954. [Google Scholar]
  9. Miller, G.A. Note on the bias of information estimates. Inf. Theory Psychol. Probl. Methods 1955, 11-B, 95–100. [Google Scholar]
  10. Harris, B. The Statistical Estimation of Entropy in the Non-Parametric Case; Wisconsin University—Madison Mathematics Research Center: Madison, WI, USA, 1975. [Google Scholar]
  11. Antos, A.; Kontoyiannis, I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithms 2001, 19, 163–193. [Google Scholar] [CrossRef]
  12. Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef] [Green Version]
  13. Silva, J.F. Shannon entropy estimation in -alphabets from convergence results: Studying plug-in estimators. Entropy 2018, 20, 397. [Google Scholar] [CrossRef] [Green Version]
  14. Zhang, Z. Statistical Implications of Turing’s Formula; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2017. [Google Scholar]
  15. Good, I.J. The population frequencies of species and estimation of population parameters. Biometrika 1953, 40, 237–264. [Google Scholar] [CrossRef]
  16. Grabchak, M.; Marcon, G.; Lang, G.; Zhang, Z. The generalized Simpson’s entropy is a measure of biodiversity. PLoS ONE 2017, 12, e0173305. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Contreras-Reyes, J.E. Mutual information matrix based on Rényi entropy and application. Nonlinear Dyn. 2022, 110, 623–633. [Google Scholar] [CrossRef]
  18. Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley & Son, Inc.: New York, NY, USA, 2006. [Google Scholar]
  19. Zhang, Z. Generalized Mutual Information. Stats 2020, 3, 158–165. [Google Scholar] [CrossRef]
  20. Khinchin, A.I. Mathematical Foundations of Information Theory; Dover Publications: New York, NY, USA, 1957. [Google Scholar]
  21. Amigó, J.M.; Balogh, S.G.; Hernández, S. A Brief Review of Generalized Entropies. Entropy 2018, 20, 813. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  22. Ilić, V.M.; Korbel, J.; Gupta, G.; Scarfone, A.M. An overview of generalized entropic forms. Europhys. Lett. 2021, 133, 50005. [Google Scholar] [CrossRef]
  23. Auerbach, F. Das Gesetz der Bevölkerungskonzentration. Petermann’s Geogr. Mitteilungen 1913, 59, 74–76. [Google Scholar]
  24. Zipf, G.K. Selected Studies of the Principle of Relative Frequency in Language; Harvard University Press: Cambridge, MA, USA; London, UK, 1932. [Google Scholar]
  25. Zhang, Z. Domains of attraction on countable alphabets. Bernoulli 2018, 24, 873–894. [Google Scholar] [CrossRef]
  26. Molchanov, S.; Zhang, Z.; Zheng, L. Entropic Moments and Domains of Attraction on Countable Alphabets. Math. Meth. Stat. 2018, 27, 60–70. [Google Scholar] [CrossRef]
  27. Krichevsky, R.E.; Trofimov, V.K. The Performance of Universal Encoding. IEEE Trans. Inf. Theory 1981, 27, 199–207. [Google Scholar] [CrossRef] [Green Version]
  28. Holste, D.; Große, I.; Herzel, H. Bayes’ estimators of generalized entropies. J. Phys. A Math. Gen. 1998, 31, 2551–2566. [Google Scholar] [CrossRef] [Green Version]
  29. Schurmann, T.; Grassberger, P. Entropy estimation of symbol sequences. Chaos 1996, 6, 414–427. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  30. Nemenman, I.; Shafee, F.; Bialek, W. Entropy and inference, revisited. In Advances in Neural Information Processing Systems; Dietterich, T.G., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2002; Volume 14. [Google Scholar]
  31. Hausser, J.; Strimmer, K. Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. J. Mach. Learn. Res. 2009, 10, 1469–1484. [Google Scholar]
  32. Chao, A.; Shen, T.-J. Non-parametric estimation of Shannon’s Index of diversity when there are unseen species in sample. Environ. Ecol. Stat. 2003, 10, 429–443. [Google Scholar] [CrossRef]
  33. Vu, V.Q.; Yu, B.; Kass, R.E. Coverage-adjusted entropy estimation. Stat. Med. 2007, 26, 4039–4060. [Google Scholar] [CrossRef] [PubMed]
  34. Zhang, Z. Entropy estimation in Turing’s perspective. Neural Comput. 2012, 24, 1368–1389. [Google Scholar] [CrossRef]
  35. Zhang, Z.; Zhang, X. A normal law for the plug-in estimator of entropy. IEEE Trans. Inf. Theory 2012, 58, 2745–2747. [Google Scholar] [CrossRef]
  36. Zhang, Z. Asymptotic normality of an entropy estimator with exponentially decaying bias. IEEE Trans. Inf. Theory 2013, 59, 504–508. [Google Scholar] [CrossRef]
  37. Chen, C.; Grabchak, M.; Stewart, A.; Zhang, J.; Zhang, Z. Normal Laws for Two Entropy Estimators on Infinite Alphabets. Entropy 2018, 20, 371. [Google Scholar] [CrossRef] [Green Version]
  38. Grabchak, M.; Zhang, Z. Asymptotic Normality for Plug-in Estimators of Diversity Indices on Countable Alphabet. J. Nonparametric Stat. 2018, 30, 774–795. [Google Scholar] [CrossRef]
Table 1. Confidence Levels of Simple Binary Classifier.
Table 1. Confidence Levels of Simple Binary Classifier.
p n = 3 n = 5 n = 7
( 1 / 2 , 1 / 2 ) 0.50000.50000.5000
( 2 / 3 , 1 / 3 ) 0.74070.79010.8267
( 3 / 4 , 1 / 4 ) 0.84380.89650.9294
( 4 / 5 , 1 / 5 ) 0.89600.94210.9667
( 5 / 6 , 1 / 6 ) 0.92590.96450.9824
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, Z. Several Basic Elements of Entropic Statistics. Entropy 2023, 25, 1060. https://doi.org/10.3390/e25071060

AMA Style

Zhang Z. Several Basic Elements of Entropic Statistics. Entropy. 2023; 25(7):1060. https://doi.org/10.3390/e25071060

Chicago/Turabian Style

Zhang, Zhiyi. 2023. "Several Basic Elements of Entropic Statistics" Entropy 25, no. 7: 1060. https://doi.org/10.3390/e25071060

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop