Several Basic Elements of Entropic Statistics

Inspired by the development in modern data science, a shift is increasingly visible in the foundation of statistical inference, away from a real space, where random variables reside, toward a nonmetrized and nonordinal alphabet, where more general random elements reside. While statistical inferences based on random variables are theoretically well supported in the rich literature of probability and statistics, inferences on alphabets, mostly by way of various entropies and their estimation, are less systematically supported in theory. Without the familiar notions of neighborhood, real or complex moments, tails, et cetera, associated with random variables, probability and statistics based on random elements on alphabets need more attention to foster a sound framework for rigorous development of entropy-based statistical exercises. In this article, several basic elements of entropic statistics are introduced and discussed, including notions of general entropies, entropic sample spaces, entropic distributions, entropic statistics, entropic multinomial distributions, entropic moments, and entropic basis, among other entropic objects. In particular, an entropic-moment-generating function is defined and it is shown to uniquely characterize the underlying distribution in entropic perspective, and, hence, all entropies. An entropic version of the Glivenko–Cantelli convergence theorem is also established.


Introduction and Summary
Let X = { k ; k ≥ 1} be a countable alphabet and let p = {p k ; k ≥ 1} be a probability distribution on X . Let P be the collection of all probability distributions on X . Let p ↓ = {p (k) ; k ≥ 1} be the nonincreasingly rearranged p, that is, p (k) ≥ p (k+1) for every k ≥ 1. Let P ↓ be the collection of all possible p ↓ . It follows that P ↓ ⊂ P is an aggregated version of P in the sense that P is partitioned and represented by p ↓ ∈ P ↓ . Across a wide spectrum of scientific investigation, a random system is often described as a probability distribution on a countable alphabet, {X , p}; however many complex system properties of interest, such as those studied in information theory and statistical mechanics, are often described by functions of p ↓ , for example, the Shannon entropy H = − ∑ k≥1 p k ln p k as in [1], the members of the Rényi entropy family where α ∈ (0, 1) ∪ (1.∞), as in [2], and the members of the Tsallis entropy family where α ∈ (−∞, 1) ∪ (1, ∞), as in [3]. Other similar functions come under the names of diversity indices, for example, the Gini-Simpson index as in [4], the generalized Simpson's indices where u ≥ 1 and v ≥ 0 are integers, as described in [5], Hill's diversity numbers where α ∈ (0, 1) ∪ (1, ∞), as in [6], Emlen's index p k e −p k , as in [7], and the richness index where 1[·] is the indicator function. While all the abovementioned functions each have their unique significance in their respective fields of study, they share one characteristic in common: they are all functions of p ↓ . The word entropy has ancient Greek roots, en and tropē, that is, inward and change respectively, in English, or internal change collectively. As such, it is a label-independent concept. For generality and conciseness of the presentation in this article, let the following definition be adopted.
Definition 1. Let f (p) be a function defined for every p ∈ P. The function f (p) is referred to as an entropy if f (p) depends on p only through p ↓ , that is, f (p) = f (p ↓ ).
By Definition 1, all entropies and diversity indices mentioned about are indeed entropies. In addition, p (1) , or more generally p (k) for any positive integer k, is an entropy, and therefore p ↓ is an array of entropies. One important property to be noted about entropies is that p ↓ is independent of labels of the alphabet, { k ; k ≥ 1}. Another fact to be noted is that all entropies are uniquely determined by p ↓ . For clarity of terminologies throughout this article, let it be noted that any properties of the underlying random system that are described by one or more entropies are referred to as entropic properties. Furthermore, p is referred to as the underlying probability distribution, or simply the distribution, of a random system, and p ↓ is referred to as the entropic distribution associated with p. It is also to be noted that p ↓ = {p (k) ; k ≥ 1} is not a probability distribution in the usual sense since it is not associated with any specific probability experiment. It is merely an array of nonincreasingly ordered positive parameters that sum up to one.
Let {X 1 , · · · , X n }, drawn from X according to p, be a random sample of size n. The sample may be summarized into Y = {Y k ; k ≥ 1}, where Y k is the observed frequency of letter k , or intop = {p k = Y k /n; k ≥ 1}. Let Y ↓ = {Y (k) ; k ≥} andp ↓ = {p (k) = Y (k) /n; k ≥ 1} be the nonincreasingly rearranged Y andp, respectively, where Y (k) ≥ Y (k+1) andp (k) ≥p (k+1) for every k. Under the assumption that the study interest of the underlying random system only lies with the properties described by indices that are functions of the form f (p ↓ ), that is, entropies by Definition 1, there are two conceptual perspectives to the associated with statistical inference. The first is a framework of estimating f (p) based onp, and the second is one of estimating f (p) = f (p ↓ ) based onp ↓ . For lack of better terms, let the first framework be referred to as the classical statistics and the second framework as the entropic statistics. These two frameworks are not equivalent and, in particular, the entropic framework has its special and useful implications.
The literature of statistical estimation of entropies, mostly in the specific form of the Shannon entropy, begins with the early works, as in [8][9][10], and expands in width and depth in works by, for example, [11][12][13]. Many other worthy references on entropy estimation may be found in the literature review in [14]. The general entropies of Definition 1, however, allow a discussion on the foundational elements of the statistics in entropic perspective, or entropic statistics, in a broader sense. This article focuses on three basic basic issues.
First, a notion of entropic sample space is introduced in Section 2 below. An entropic sample space is an aggregated sample space to register; not a single data point, but an ensemble of data points. It is a sample space of the entropic statistics, Y ↓ orp ↓ , and hence is label-independent. The said label-independence in turn allows an entropic sample space to accommodate statistical sampling into a population that is not necessarily prescribed, that is, the labels of alphabet X need not be completely specified a priori. This property of an entropic sample space gives new meaning to statistical learning and lends foundational support for statistical exploration into an unknown, or partially known, universe.
Second, an entropic characteristic function, φ(t) = ∑ k≥1 p t (k) for t ≥ 1, is introduced. It is obvious that φ(t) is an entropy by Definition 1 and that it always exists. It is established in Section 3 that φ(t) in an arbitrarily small neighborhood of any interior point of [1, ∞) uniquely determines the p ↓ ∈ P ↓ and vice versa. Therefore, it is immediately implied that any and all entropic properties of a random system, including statistical inferences, may be approached by way of φ(t).
Third, it is established in Section 4 that the entropic statistics converges almost surely and uniformly to the underlying entropic distribution, that is,p ↓ a.s. −→ p ↓ uniformly, for any p ↓ ∈ P ↓ . In light of the entropic sampling space and an entropic characterization of the associated entropic sampling distribution, the Glivenko-Cantelli-like convergence theorem provides a fundamental support in theory for exercises in entropic statistics.
The article ends with an appendix where a lengthy proof is found. 3 }}, and the point mass probability measure µ(·) assigns p 1 to 1 , p 2 to 2 , and p 3 to 3 . Let X denote the random outcome of the experiment. The following model of probability distribution, or in a different form p = {p 1 , p 2 , p 3 } on X = Ω 1 = { 1 , 2 , 3 }, is well defined with three parameters, p 1 , p 2 , and p 3 , subject to the constraints, 0 ≤ p k ≤ 1 for each k and ∑ 3 k=1 p k = 1. The result of drawing n = 1 marble from the urn may also be represented by a triplet of random variables If Y is used to represent the outcome of the experiment, the sample space may be denoted as Ω 1 = {{1, 0, 0}, {0, 1, 0}, {0, 0, 1}} with corresponding probability distribution P(Y = {1, 0, 0}) = p 1 , P(Y = {0, 1, 0}) = p 2 and P(Y = {0, 0, 1}) = p 3 . For clarity in terminology, X is referred to as a random element but Y is a set of random variables. In general, random results of an experiment that are represented by numerical values are referred to as random variables, and those by non-numerical symbols are random elements.
For a given experiment, the sample space may be chosen at different levels of resolution depending on the experimenter's interest in the study. Suppose the experimenter is to randomly draw n = 3 marbles from urn 1 with replacement in sequence, resulting in X = {X 1 , X 2 , X 3 } where X i , i = 1, 2, 3, is the color of the ith marble drawn in the sequence. The sample space associated with X may be represented by where the subscript "s" stands for sequential. There are 27 distinct elements in (3). In this case, the sample space may also be expressed as Ω s = { 1 , 2 , 3 } 3 . This sample space may be adopted if the order of the n = 3 observations is observable and is of interest. Suppose in the above experiment the order of the observations is not observable or not of interest. Then the relevant information in where the subscript "m" stands for multinomial. There are 10 distinct elements in (4). In fact, Y = {Y 1 , Y 2 , Y 3 } is the usual multinomial random vector with K = 3 categories and category probabilities p 1 , p 2 , and p 3 . The two sample spaces, Ω s and Ω m , serve different statistical interests in various situations.
a lower-resolution sample space may always be adopted if a higher-resolution sample space may, but not vice versa. For example, if the order of the draws is not observable, then only Ω m is appropriate since Ω m is an aggregated form of Ω s and is hence of lower resolution; however, Ω m may be further reduced in resolution. Let where Y (1) , Y (2) , Y (3) are nonincreasingly ordered observed frequencies of the three colors. The sample space associated with Y ↓ is where the subscript "e" stands for entropic. Ω e is yet an aggregated form of Ω m and hence of lower resolution still than that of Ω m . Noting that Ω e is label-independent, it is an example of entropic sample space.
It is easily verified that the probability distribution of Y ↓ may be expressed in terms of p ↓ as follows.
Let it be noted that all the probabilities in (7) are label-independent, and therefore they are entropies by Definition 1.
In the case of sampling n = 3 marbles from urn 1 in sequence, a subscription to the entropic sample space, Ω e , is by choice since both Ω m and Ω e are available. There are situations when the subscription to an entropic sample space may be by necessity.
Consider the experiment of randomly drawing n = 3 marbles in sequence from urn 2, which contains marbles of K = 3 unknown but distinguishable colors. In this case, the sample spaces, Ω 1 of (1) and Ω m of (3), are not well defined due to the lack of knowledge of the color labels. However, the entropic sample space, Ω e , is available for subscription regardless of what the colors are, known or unknown, as long as they are distinguishable.
In general, consider drawing a random sample of size n from X = { k ; k ≥ 1} under p = {p k ; k ≥ 1} in sequence. The sequential sample space is of the form Ω 1 = X n . The aggregated sample space, is that of the mutinomial array, Y = {Y k ; k ≥ 1}, with probability mass function where 0 ≤ y k ≤ n for every k ≥ 0 and ∑ k≥1 y k = n. Moreover, Ω m may be further aggregated into a sample space, Ω e , for Y ↓ = {Y (k) ; k ≥ 1}, that is, Ω e = {{y (k) ; k ≥ 1} : y (k) ≥ 0 and y (k) ≥ y (k+1) for every k ≥ 1, and ∑ k≥1 y (k) = n}. (10) Let Ω e of (10) be referred to as the entropic sample space. The associated probability distribution is where ∑ * is summation of (9) over all {y k ; k ≥ 1}s in Ω m sharing the same given {y (k) ; k ≥ 1}. Given a y ↓ = {y (k) ; k ≥ 1}, (11) is an entropy. This may be seen in two steps. First, letK = ∑ k≥1 1[y (k) ≥ 1] be the number of distinct letters of X represented in a sample of size n, and let z = {z 1 , · · · , zK} be the set ofK positive integer values of y ↓ .K is a positive finite integer. Let the cardinality of X be denoted as K = ∑ k≥1 1[p k > 0]. K ≥ 1 may be finite or countably infinite. Consider an array a(y ↓ ) = {a k (y ↓ ); k ≥ 1} of length K whose entries are a particular allocation of theK values of z j , j = 1, . . . ,K, with the other K −K values of a(y ↓ ) being zeros. Let A(y ↓ ) be the complete collection of all such distinct a(y ↓ )s. Then it is clear that y ↓ uniquely implies A(y ↓ ).

Entropic Objects
Let the adjective "entropic" be used to describe objects that are label-independent. Several such objects are defined or summarized below.
The elements of p ↓ = {p (k) ; k ≥ 1} are the entropic parameters, as compared to the elements of p = {p k ; k ≥ 1}, which are multinomial parameters.
which are multinomial statistics.
• Ω e of (10) is the entropic (multinomial) sample space, as compared to Ω m of (8), which is the multinomial sample space. • The distribution P({y (k) ; k ≥ 1}) of (11) or (12), is the entropic probability distribution, while P({y k ; k ≥ 1}) of (9) is the multinomial probability distribution. • Entropic statistics is the collection of statistical methodologies that help to make inference on the characteristics of a random system exclusively via entropies.
In addition, there are several useful other entropic objects. First, letting The name comes from the fact that, for any well-behaved function, h(p) for p ∈ [0, 1], an entropy of the form H = ∑ k≥1 p k h(p k ) may be expressed as a linear combination H = ∑ v≥1 w(v)ζ v . For example, the Shannon entropy, provided that it is finite, may be written as The entropic basis is useful because it unfolds many entropies into simple and linearly additive forms.
Second, letting η u = ∑ k≥1 p u k for all positive integers u ≥ 1, η = {η u ; u ≥ 1} is often referred to as the entropic moment. The elements of both ζ and η have good estimators. A detailed discussion may be found in [14].

Definition 2.
Let X be an random element on a countable alphabet X = { k ; k ≥ 1} with a corresponding probability distribution p ∈ P and its associated entropic distribution p ↓ ∈ P ↓ . The function, is referred to as the entropic-moment-generating function of X, of p, or of p ↓ . The two complementary parts of its domain, [1, ∞) and [0, 1), are, respectively, referred to as the primary domain and the secondary domain of the entropic-moment-generating function.
Depending on context, φ(t) may be denoted as φ X (t), φ p (t), or φ p ↓ (t) whenever appropriate. Obviously, φ(t) is uniformly bounded above by one for all p ∈ P in the primary domain but is not necessarily finitely defined in the secondary domain. However, in the case of a finite alphabet, that is, is finitely defined for each and every t ∈ R, in particular for t ≥ 0. The characteristic utility of φ(t) is further explored in Section 3 below.

Examples of Entropic Statistics
Example 1. Consider the Bernoulli experiment of tossing a coin, where P(h) = p and P(t) = 1 − p. The question of whether the coin is fair may be formulated in the usual classical sense, that is, whether p = 0.5. The question may be approached by estimating p based on a sample proportion,p, if it is observable which trials lead to "h" and which lead to "t". The question may alternatively be formulated by an equivalent entropic statement, for example, whether H = p(1 − p) = 0.25. More generally, if K = ∑ k≥1 1[p k ] is finite and known, then the uniformity of p on X may be formulated entropically by, for example, H = ∑ k≥1 p 2 k = 1/K, The validity of these entropic statements may then be gauged statistically.

Example 2.
Consider a two-stage sampling scheme: a random sample of size n, {X 1 , · · · , X n }, and then a single extra observation X n+1 are taken. The sample of size n may be summarized into letter frequencies, Clearly, π 0 is label-independent and therefore an entropic random variable. Given the sample of size n, π 0 may be thought of as the probability of that X n+1 assumes a letter in X that is not represented in the sample of size n. In some context, π 0 may be thought of as the probability of new discovery. Let and T n = N 1 /n. T n is commonly known as Turing's formula, introduced in [15], but credited largely to Alan Turing. It is to be noted that N 1 is label-independent and, therefore, so is T n . T n is an good estimator of π 0 and a discussion on many of its statistical properties may be found in [14].
In developing a decision tree classifier, the data space is partitioned into an ensemble of small subspaces, in each of which a local classification rule is sought. The central spirit of every local classification may be described by a two-step scheme.

1.
First, a random sample of size n, {X 1 , · · · , X n }, is taken from The data-based local classification rule is as follows: the next observation, X n+1 , is predicted to be the letter which is observed most frequently in the sample of size n. For simplicity, let it be assumed that p (1) > p (2) , and a letter with the sample maximum frequency is unique (if not, some randomization may be employed).
Obviously, the designated letter based on a sample is not necessarily the letter associated with the letter corresponding to the maximum of p k s. In such a setup, the performance of the tree classifier may be gauged by evaluating (calculating or estimating) the probability of the event that "the designated letter is the same letter of X with probability p (1) ", that is, Note that the event in (14) is label-independent and hence the probability is an entropy, which may be estimated. The probability in (14) may reasonably called the confidence level of the simple classifier.
For illustration purpose, consider the special case of a binary X, with n = 2m + 1 for some positive integer m. For simplicity, n is chosen to be odd here so that Y (1) > Y (2) always holds true. Suppose that p 1 = p (1) > p (2) = 1 − p (1) . The event that a classifier based on the sample of n correctly identifies the letter of maximum probability may be equivalently expressed as Y 1 ≥ m + 1.
The probability of such an event, (14), is which is independent of the assumption that p 1 > p 2 = 1 − p 1 and, therefore, is an entropy. More specifically, (15) is computed for several combinations of n and p ↓ and the resulting values are tabulated in Table 1. Table 1, and its likes, may be used in two different ways. First, given a fixed p ↓ , it indicates how large a sample is needed to assure a reliability level of the classifier. On the other hand, at a given level of n and a particular p ↓ , the classifier may be evaluated by the probabilities in the table. In practice, p ↓ is unknown but may be estimated.

Entropic Characterization
Entropic statistics focuses on making inference via entropies; it is therefore of interest to find a function which may characterize p ↓ ∈ P ↓ . Since the function φ(t) in its primary domain and η = {η u ; u ≥ 1}, where η u = ∑ k≥1 p u k and u is an integer, imply each other (see Lemma 1 below), it follows immediately that (13) uniquely determines all entropies. However, the following theorem claims that the characteristic property of the entropicmoment-generating function, φ(t), remains intact in any arbitrarily neighborhood of any t ∈ (1, ∞).
be the respective corresponding entropic distributions of p and q. Then Lemma 1. Let p = {p k ; k ≥ 1} and q = {q k ; k ≥ 1} be two probability distributions in P with two corresponding associated entropic distributions p ↓ and q ↓ in P ↓ . Then p ↓ = q ↓ if and only if ∑ k≥1 p n k = ∑ k≥1 q n k for all positive integers n ≥ 1.
A proof of Lemma 1 may be found on pages 50 and 51 in [14]. To prove Theorem 1, it suffices to show that φ(t) in an arbitrarily small neighborhood of any interior point of [1, ∞) determines the function globally.
Proof of Theorem 1. If p ↓ = q ↓ , then it immediately follows that φ p (t) = φ q (t) for all t ∈ [1, ∞) and, therefore, for t ∈ (a, b) specifically. To prove the theorem, it suffices to show the converse.
Consider the series where z ∈ C is a complex variable. Denote the real and the imaginary parts of a complex value z by Re(z) and Im(z), respectively. Let D = {z : Re(z) > 1} be the subset of C such that the real part of z is greater than 1. For every z ∈ D, since p Re(z−1) k ≤ 1 and p i Im(z) k = 1, for every k, where |z| is the modulus of z, it follows that Letting α k = ln(1/p k ), and the functions e −α k z , k ≥ 1, are analytic on C.
The characterization of p ↓ in Theorem 1 may be equivalently stated only on any infinitely countable subset of (a, b).

Corollary 1.
Let p and q be two probability distributions on a same countable alphabet, X . Let p ↓ and q ↓ be the corresponding entropic distributions of p and q, respectively. Then p ↓ = q ↓ if and only if φ p (t) = φ q (t) on any infinite sequence of distinct values, {t n ; n ≥ 1}, such that lim n→∞ t n = c ∈ (1, ∞).
Proof. Both φ p (t) and φ q (t) are analytic at t = c, and therefore h(t) = φ p (t) − φ q (t) is analytic at t = c. Let it be first shown, by induction, that all derivatives of h(t) at t = c are zero, that is, h (m) (c) = 0 for m ≥ 0. Note first that h(c) = h (0) = 0 by the fact that both φ p (t) and φ q (t) are continuous and lim n→∞ φ Then there exists an interval (c − ε, c + ε) such that h(t) = 0 for t ∈ (c − ε, c + ε). However, there is at least one t n ∈ (c − ε, c + ε) such that h(t n ) = 0 by assumption. This is a contradiction and therefore h (m) (c) = 0 for all m ≥ 1.

Corollary 2.
Let p and q be two probability distributions on a same countable alphabet, X . Let p ↓ and q ↓ be the corresponding entropic distributions of p and q, respectively. Then p ↓ = q ↓ if and only if φ p (t) = φ q (t) on any infinite sequence of distinct values, {t n ; n ≥ 1} ∈ (a, b) where 1 ≤ a < b < ∞.
Proof. Noting that the infinitely many t n s are in an bounded interval, there exists an infinite subset of {t n ; n ≥ 1} that converges to a constant c ∈ [a, b]. The corollary follows Corollary 1.
Consider a pair of random elements, (X, Y), on a countable joint alphabet, where p i,· = ∑ j≥1 p i,j and p ·,j = ∑ i≥1 p i,j , be the two marginal probability distributions of X and Y, respectively.

Corollary 3. X and Y are independent if and only if
for all t ∈ (a, b), where a and b are two arbitrary real numbers such that 1 ≤ a < b < ∞.
Proof. If X and Y are independent, then (18) follows immediately. Conversely, suppose that (18) holds. Consider another pair of independent random elements, (U, V), on the same countable joint alphabet X × Y and with identical marginal distributions to those of (X, Y), that is, p X and p Y . It then follows, by (18) and Theorem 2, that p U,V = p X,Y , which in turn implies that X and Y are independent.
Corollary 3 provides a characterization of independence on a general countable joint alphabet, and its utility may be explored further.

A Basic Convergence Theorem
From an entropic perspective, the convergence ofp ↓ to p ↓ , to be distinguished from that ofp to p, is of fundamental interest.
For clarity of presentation in this section, let it be noted that, whenever necessary, the subindex n may be added to Y, Y k ,p,p k ,p ↓ , andp (k) to highlight the dynamic nature of these previously defined quantities as n changes, that is, Y = Y n , Y k = Y k,n ,p =p n , p k =p k,n ,p ↓,n =p ↓ andp (k) =p (k),n , respectively.
The main result established in this section is the uniform almost-sure convergence of p ↓ to p ↓ , which is made more precise in Theorem 2 below.
Consider the experiment of repeatedly and independently drawing a letter from X under p, resulting in a sequence of randomly selected letters, ω = {x 1 , x 2 , · · · }. Let the collection of all possible such sequences or paths be denoted Ω. A sample of size n is a partial sequence of the first n randomly selected letters in an ω, {x 1 , · · · , x n }.
Let p ↓ = {p (k) ; k ≥ 1} andp ↓ = {p (k) ; k ≥ 1} be defined as above. It is to be specifically noted that the rearrangement of the observed relative frequencies,p ↓ , is performed independently based on the observed values ofp k for all k ≥ 1, with no regard to the arrangement of the probabilities, p ↓ = {p (k) ; k ≥ 1}. Consequently, the letter of which the relative frequencyp (k) is observed is not necessarily the same letter with which the probability p (k) is associated. This is, in fact, the essence of entropic perspective.

Theorem 2.
For any p ∈ P, let p ↓ ,p andp ↓ be defined as above. Then A proof of Theorem 2 requires Lemmas 2 and 3 below.
For each and every path ω ∈ Ω * and every k, lim n→∞ |p k − p k | = 0. Note the fact that |p k − p k | ≤p k + p k and, therefore , ∑ k≥1 |p k − p k | ≤ ∑ k≥1 (p k + p k ) = 2, by the bounded convergence theorem, that is, ∑ k≥1 |p k − p k | a.s. −→ 0. By (21), the lemma follows from the fact that Lemma 2 may be viewed as a version of the Glivenko-Cantelli theorem on countable alphabets with respect to observed data from a classical multinomial experiment. The uniformity of the convergence in (20) is of essential importance in the proof of Theorem 2, which is given below by way of Lemma 3.
A proof of Lemma 3 is given in Appendix A. Let it be noted that Ω is the sample space of a perpetual multinomial iid sampling scheme on X under a probability distribution p ∈ P. Each path in Ω may be represented by {p n ; n ≥ 1} wherep n = {p k,n ; k ≥ 1}. For each such path {p n ; n ≥ 1} ∈ Ω, there exists a corresponding path {p ↓,n ; n ≥ 1}, which is the rearranged {p n ; n ≥ 1} over all k for every n. Let the total collection of all rearranged paths of Ω be denoted as Ω ↓ , and let the collection of all rearranged paths of Ω * be denoted as Ω * ↓ . It follows that P(Ω * ↓ ) = P(Ω * ) = 1. Lemma 3 states that, in each path of Ω * ↓ , the k th component ofp ↓,n converges to the k th component of p ↓ , namely, p (k) , for each k.
Theorem 2 may be viewed as a version of the Glivenko-Cantelli theorem on countable alphabets with respect to observed data from an entropic multinomial experiment. Theorem 2 immediately implies almost sure convergence for estimators of several key quantities in classification procedures.
Example 5. Suppose that p (1) > p (2) , that is, there exists a unique letter in X , denoted 0 , associated with probability p (1) . Then the probability of a correct classification, that is, 0 = arg max X {p k ; k ≥ 1}, converges almost surely to one. This is so because, for any path in Ω * ↓ and any ε < (p (1) − p (2) )/2, there exists an N such that, for any n > N, |p (1) − p (1) | < ε and |p (1) The results of Examples 4 and 5 lend fundamental support for classification algorithms based on maximum observed frequency, used widely in exercises of modern data science, for example, decision trees, as mentioned in Example 3.
Many entropies of interest across a wide spectrum of studies are of the additive form, H(p ↓ ) = ∑ k≥1 g(p (k) )h(p (k) ), where g(p) ≥ 0 and h(p) ≥ 0 are functions of p ∈ [0, 1]. The almost-sure convergence of Theorem 2 may be passed on to the plug-in estimators of some such entropies by way of a rather trivial statement in the proposition below. 1]. Suppose that p ∈ P such that 1.
Example 6 implies that the plug-in estimator of H(p ↓ ) = ∑ k≥1 p s (k) where s ≥ 1 converges almost surely, which in turn implies that the plug-in estimators of members of the Rényi entropy family and the Tsallis entropy family converge almost surely for all p ↓ ∈ P ↓ without qualification when α ≥ 1. However, it is not known whether the plug-in estimators of the members of the families with α ∈ (0, 1) converge almost surely when p ↓ ∈ P ↓ without other qualification (also, see [17]).
It is not known whether the plug-in estimator of the Shannon entropy converges almost surely when p ↓ ∈ P ↓ without further qualification.
The Shannon entropy has utilities across a wide spectrum of scientific investigations (see [18]). However, it is not finitely defined for all distributions in P. A family of the generalized Shannon entropies, for any p ↓ ∈ P ↓ , is proposed as follows: in [19], where m ≥ 1 is an integer. The Shannon entropy is a special family member corresponding to m = 1. It may be verified that each member of the family, except the Shannon entropy, is finitely defined for all p ∈ P and offers all important utilities that the Shannon entropy offers, including the fact that the mutual information derived based on each member with m ≥ 2 is zero if and only if the two underlying random elements are independent. Example 8. The plug-in estimator of (24) converges almost surely for any p ↓ ∈ P ↓ whenever m ≥ 2. To see this, let it be first noted that the plug-in estimator of − ∑ k≥1 p m k ln p k converges almost surely. This fact follows from Proposition 1 with g(p) = p, h(p) = −p m−1 ln p which is uniformly bounded above on I = [0, 1]. The claimed almost-sure convergence then follows the fact that, in the re-expression of (24) below, and the fact that the plug-in estimator of each of the four series converges almost surely.

Conclusions and Discussion
This article introduces a perspective termed entropic statistics. One of the motivations of the perspective is to accommodate probability experiments on sample spaces which may include outcomes that are known to exist (and therefore are prescribed) and those whose existence is not known (and therefore not prescribable). Such a framework allows statistical exploration into a general population with possibly infinitely many previously unobserved and unknown outcomes, or new discoveries. The key concept to foster such a framework is the label-independence, that is, all parameters and statistics do not depend of the labels of an alphabet as long as they are distinguishable. Consequently, in this article an array of label-independent objects are defined and termed entropic objects. In particular, a general entropy, entropic parameters, entropic statistics, entropic sample spaces, entropic probability distributions, and an entropic-moment-generating function are defined.
Based on the defined entropic objects, two basic theorems are established. Theorem 1 provides a characterization of the entropic probability distribution on the alphabet via the entropic-moment-generating function, and Theorem 2 establishes the almost-sure convergence of the entropic statistics to the entropic parameters and, hence, provides a foundational support to the entropic framework.
On the other hand, this article merely provides a few basic results in entropic statistics. On a broader spectrum, many other issues may be fruitfully considered on at least three fronts, namely, fundamental, probabilistic, and statistical. To begin with, the fundamental question of what constitutes entropy may be explored in many directions. One of the most cited sets of axioms is that discussed by Khinchin [20], under which the Shannon entropy is proved to be unique. However under slightly less restrictive axioms, many other entropies exist and enjoy almost all the desirable utilities of the Shannon entropy; for example, see [19]. The existing literature on generalization of entropy is extensive in physics and information theory; for example, see [21,22]. The collective effort to better understand what entropy is and how it may help to describe an underlying random system is ongoing. Further research in understanding generalized entropies and their implications could greatly enrich the framework of entropic statistics.
Entropy in general is often thought of as summary of a profile state, however measured numerically, of inner energy or chaos within a random system. As such, it is independent of any labeling systems, regardless of whether the state is observable or not. A key conceptual shift introduced in this article is from statistical inference on p (or a function of p) based on the multinomial frequencies Y to that on p ↓ (or a function of p ↓ ) based on the entropic frequencies Y ↓ . Such a framework shift, by necessity or by choice, triggers a long array of basic probability and statistics questions, under different degrees of model restriction, ranging from parametric forms of p k = p(k, θ) for some parameter θ to the nonparametric form, {p (k) ; k ≥ 1}. It may be interesting to note that even for the nonparametric form, there are several qualitatively different forms, that of a known K = ∑ k≥1 1 [p (k) >0] < ∞, that of an unknown K = ∑ k≥1 1 [p (k) >0] < ∞, and that of K = ∑ k≥1 1 [p (k) >0] = ∞. Each of these model classes could imply a very different stochastic behavior of Y ↓ as the sample size n increases. Even long before the notion of information entropy was coined by Shannon in [1], the behavior of Y ↓ had been discussed in the literature by, for example, Auerbach [23] and Zipf [24]. More recently, several articles [25,26] discussed domains of attraction in the total collection of all distributions on a countable alphabet by a tail index, τ n = n ∑ k≥1 p (k) (1 − p (k) ) n . Each domain characterizes the decay rate of the tail of the underlying entropic distribution and, in turn, dictates the rates of convergence of various statistical estimators of various entropies. Further advances on that front would enhance the understanding of probabilistic behavior of the entropic statistics and, in turn, the estimated entropies of interest.
In terms of statistical estimation, a large proportion of the existing literature mainly focuses on the Shannon entropy and variations of the plug-in estimators under various conditions, most of which are described and referenced in [14]. There are also nonplug-in estimators of different types, for example, the Bayes estimators [27][28][29], the hi-erarchical Bayes estimators [30], the James-Stein estimators [31], the coverage-adjusted estimators [32][33][34], and an unbiased estimator based on sequential data proposed by Montgomery-Smith and Schürmann. In general, the asymptotic distributions of the plug-in estimators and their variants seem to have been studied and described to some extent; for example, see [12,[35][36][37][38]. However, it is fair to say that many, if not most, of the proposed estimators of various types have not yet been assigned asymptotic distributions. Any advances in that direction could much benefit applications of these estimators.
In short, the landscape of entropic statistics is quite porous in comparison to that of richly supported classical statistics. Many basic and important questions are yet to be answered, from the axiomatic foundation, to the definitions of basic elements, to the theoretical supporting architecture, and to the relevance in applications. However, the same said porosity also offers opportunities for interesting contemplation.

Conflicts of Interest:
The author declares no conflict of interest.
For notation simplicity in all cases, let it be assumed without loss of generality that p = {p k ; k ≥ 1} is nonincreasingly arranged to begin with, that is, p k ≥ p k+1 for every k. With this assumption, the only rearranged object isp ↓ = {p (k) ; k ≥ 1} withp (k) ≥p (k+1) for every k.
In Case 1, the statement of (22) is trivial. In Case 2, let p 0 = 1 and p K+1 = 0. It follows that For each sequence ω ∈ Ω * as defined in the proof of Lemma 2, the uniformity of (20) implies that for any ε > 0, there exists an N such that for all n > N, −ε <p k − p k < ε for all k ≥ 1. Specifically, let There exists an N such that for all n > N, max{|p k − p k |; 1 ≤ k ≤ K} < ε 0 , which has the following two implications.
In Case 3, it is allowed that several consecutive probabilities in p = {p k ; 1 ≤ k ≤ K}, where K = ∑ k≥1 1[p k > 0] < ∞, are identical. It follows that Noting that p = {p k ; k ≥ 1} is a finite sequence of runs of identical values, collecting the first value in each run and retaining its index value, a subset of {p k ; k ≥ 1} is obtained, namely, {p k i ; i = 1, . . . , I}, where I is the number of distinct values in p. Let r i be the multiplicity of p k i in p, i = 1, . . . , I. It follows that For each sequence ω ∈ Ω * as defined in the proof of Lemma 2, the uniformity of (20) implies that for any ε > 0, there exists an N such that for all n > N, −ε <p k − p k < ε for all k, k = 1, . . . , K. Specifically, let where p k 0 = p 0 = 1. There exists an N 1 such that for all n > N 1 , max{|p k − p k |; k ≥ 1} < ε 1 , which has the following implications. 1.
For every given k, and therefore an implied i, there are exactly r i relative frequencies among {p k ; 1 ≤ k ≤ K} found in p k i ± ε 1 . It then follows that, for each given k, and hencep (k) → p (k) = p k . Finally, (22) follows the fact that P(Ω * ) = 1.
In Case 4, p k > 0 for all k ≥ 1 and all probabilities are distinct. Letting p 0 = 1 and For every fixed k such that p k ∈ (0, 1), let m ≥ 1 be an integer such that Such an m exists for any given p with an infinite K and a fixed k ≥ 1.
There exists an N 2 such that for all n > N 2 , max{|p k − p k |; k ≥ 1} < ε 2 , which implies the following.
On the other hand, noting the strict inequality in (A3) and the fact that k is a fixed integer, there exists a sufficiently small ε 3 such that Let ε 4 = min{ε 2 , ε 3 }. By Lemma 2, there exists an N 4 such that for all n > N 4 , for all k, k = 1, · · · , m, and that the updated (A5) and (A6) hold, namely, or, equivalently, That is, in each of the disjoint intervals of (A7), there is at least one relative frequency. In particular,p k is covered in (p k − ε 4 , p k + ε 4 ) for each k, k = 1, . . . , k < m, by (A4).
Next it is necessary to show that there may not be more than one relative frequency in (p k − ε 4 , p k + ε 4 ) for each k, k = 1, . . . , k . Toward that end, consider the total mass of 100% distributed amongp k , k ≥ 1, given n. From interval (p 1 − ε 4 , p 1 + ε 4 ) to interval (p m − ε 4 , p m + ε 4 ), the total collective mass covered is at least ∑ m k=1p k ; however, by (A7) and (A8), and the remainder of the mass is Regardless of the mass, 1 − ∑ m k=1p k , on the left side of (A9) is allocated to one or more than one letter, other than those in { 1 , · · · , m }, the correspondingp k , k ≥ m + 1, could not possibly be sufficiently large to exceed p k +1 + ε 4 , nor, therefore, p k − ε 4 . That implies that, along the path of that selected ω ∈ Ω * , for any n > N 4 ,p k andp k alone is covered in (p k − ε 4 , p k + ε 4 ) for k, k = 1, . . . , k . This immediately implies thatp k =p (k) for all k, k = 1, . . . , k , and in particularp k =p (k ) .p (k ) → p k sincep k → p k . Finally (22) follows the fact that P(Ω * ) = 1.
(A10) p = {p k ; k ≥ 1} has a special pattern: its maximum value runs for r 1 times; then its second largest value runs for r 2 times, and so on and so forth. In general, its i th largest value runs for r i times followed by a run of its i − 1 st largest value. Collect the first value in each run and record its index, k i , i ≥ 1, resulting in a strictly decreasing subsequence, {p k i ; i ≥ 1}. Letting k 0 = 0 and k ∞ = ∞, Consequently, p = {p k ; k ≥ 1} may be viewed as a sequence containing p k i for i ≥ 1 with r i − 1 p k i s between p k i and p k i+1 .
Given a value of k, say k , there is an i such that p k = p k i and k must be one of the values from the list {k i , k i + 1, · · · , k i + r i − 1}, noting p k i +r i = p k i +1 < p k . Let m be such that 1 − m ∑ i=1 r i p k i < p k i +1 , and (A11) Such an m exists for any given p and a fixed k ≥ 1, which fixes an i . For each sequence ω ∈ Ω * as defined in the proof of Lemma 2, the uniformity of (20) implies that for any ε > 0, there exists an N such that for all n > N, −ε <p k − p k < ε for all k, k ≥ 1. Specifically let ε 5 = min{(p k i − p k i+1 )/2; i = 0, . . . , m} > 0.
There exists an N 5 such that for all n > N 5 , max{|p k − p k |; k ≥ 1} < ε 5 , which has the following two implications.

2.
The relative frequencies corresponding to {p 1 , · · · , p ∑ m i=1 r i }, namely, {p 1 , · · · ,p ∑ m i=1 r i }, are also covered, respectively, by the same disjoint intervals, p k i ± ε 5 , i = 1, . . . , i . On the other hand, noting the strict inequality in (A11) and the fact that k is a fixed integer, there exists a sufficiently small ε 6 such that Let ε 7 = min{ε 5 , ε 6 }. By Lemma 2, there exists an N 7 such that for all n > N 7 , all relative frequencies sharing the same p k i , namely,p k i ,p k i +1 , · · · ,p k i +r i −1 , are found in (p k i − ε 7 , p k i + ε 7 ) (A15) for all i, i = 1, · · · , m, and the updated (A13) and (A14) are or, equivalently, That is, in each of the disjoint intervals of (A15), there are at least r i relative frequencies. In particular, the r i relative frequencies, {p k i ,p k i +1 , · · · ,p k i +r i −1 }, are covered in (p k i − ε 7 , p k i + ε 7 ) for each i, i = 1, . . . , i ≤ m, by (A12).
Next it necessary is to show that there may not be more than r i relative frequencies in (p k i − ε 7 , p k i + ε 7 ) for each i, i = 1, . . . , i . Toward that end, consider the total mass of 100% distributed amongp k , k ≥ 1, given n. From interval (p 1 − ε 7 , p 1 + ε 7 ) to interval (p ∑ m i=1 r i − ε 7 , p ∑ m i=1 r i + ε 7 ), the total collective mass covered is at least ∑ m i=1 r ipk i ; however, by (A15) and (A16), and the remainder of the mass is 1 − m ∑ k=1p k <p k i +1 < p k i +1 + ε 7 . (A17) Regardless of if the mass on the left side of (A17) is allocated to one or more than one letter, other than those in { 1 , · · · , ∑ m i=1 r i }, the correspondingp k , k ≥ ∑ m i=1 r i + 1, could not possibly be sufficiently large to exceed p k i +1 + ε 7 , nor, therefore, p k − ε 7 . That implies that, along the path of that selected ω ∈ Ω * , for any n > N 7 , {p k i ,p k i +1 , · · · ,p k i +r i −1 } and only {p k i ,p k i +1 , · · · ,p k i +r i −1 } are covered in (p k i − ε 7 , p k i + ε 7 ) for i, i = 1, . . . , i . This immediately implies that 1. {p k i ,p k i +1 , · · · ,p k i +r i −1 } = {p (k i ) ,p (k i +1) , · · · ,p (k i +r i −1) } but is not necessarily equal component-wise; 2.