Entropic Statistics: Concept, Estimation, and Application in Machine Learning and Knowledge Extraction

: The demands for machine learning and knowledge extraction methods have been booming due to the unprecedented surge in data volume and data quality. Nevertheless, challenges arise amid the emerging data complexity as signiﬁcant chunks of information and knowledge lie within the non-ordinal realm of data. To address the challenges, researchers developed considerable machine learning and knowledge extraction methods regarding various domain-speciﬁc challenges. To characterize and extract information from non-ordinal data, all the developed methods pointed to the subject of Information Theory, established following Shannon’s landmark paper in 1948. This article reviews recent developments in entropic statistics, including estimation of Shannon’s entropy and its functionals (such as mutual information and Kullback–Leibler divergence), concepts of entropic basis, generalized Shannon’s entropy (and its functionals), and their estimations and potential applications in machine learning and knowledge extraction. With the knowledge of recent development in entropic statistics, researchers can customize existing machine learning and knowledge extraction methods for better performance or develop new approaches to address emerging domain-speciﬁc challenges.


Introduction of Entropic Statistics
Entropic statistics is a collection of statistical procedures that characterize information from non-ordinal spaces with Shannon's entropy and its generalized functionals. Such procedures includes but not limited to statistical methods involving Shannon's entropy (entropy) and Mutual Information (MI) [1], Kullback-Leibler divergence (KL) [2], entropic basis and diversity index [3,4], and Generalized Shannon's Entropy (GSE) and Generalized Mutual Information (GMI) [5]. The field of entropic statistics is at the intersection of information theory and statistics. Entropic statistics quantities are also referred as information-theoretic quantities [6,7].
There are two general data types-ordinal and non-ordinal (nominal). Ordinal data are data with an inherent numerical scale. For example, {52 F, 50 F, 49 F, 53 F}-a set of daily high temperatures at Nuuk, Greenland-is ordinal. Ordinal data are generated from random variables (which map outcomes from sample space to the real numbers). For ordinal data, classical concepts, such as moments (mean, variance, covariance, etc.) and characteristics functions, are powerful tools to induce various statistical methods, including but not limited to regression analysis [8] and analysis of variance (ANOVA) [9].
Non-ordinal data are data without an inherent numerical scale. For example, {androgen receptor, clock circadian regulator, epidermal growth factor, Werner syndrome RecQ helicase-like}-a subset of human genes names-is a set of data without inherent numerical scale. Non-ordinal data are generated from random elements (which map outcomes from sample space to alphabet). Due to the absence of inherent numerical scale, the concept of random variable is undefined according to its definition. Therefore, statistical concepts involving ordinal scale (e.g., mean, variance, covariance, and characteristic functions) no longer exist. For example, consider the mentioned data of human genes names; what is the mean or variance of the data? Such questions cannot be answered because the concepts of mean and variance do not exist, while in practice, researchers need to measure the level of dependence in non-ordinal joint space between gene types and genetic phenotype to study the gene's functionalities. One would use covariance and its generated methods in ordinal data. However, the concept of covariance no longer exists in such non-ordinal space. Furthermore, all well-established statistical methods that require ordinal scale (e.g., regression and ANOVA) cannot be directly applied anymore.
Non-ordinal data have several variant names, such as categorical data, qualitative data, and nominal data. A common situation is a dataset is mixed with ordinal and non-ordinal data. On such a dataset, a common practice is to introduce coded (dummy) variables [10]. However, introducing dummy variables is equivalent to separating the mixed dataset according to the classes in non-ordinal variables to induce multiple purely ordinal subsets and then utilizing ordinal methods (such as regression analysis) case-by-case on the induced subsets. Unfortunately, this approach sometimes could be impractical because of the curse of dimensionality, particularly when there are too many categorical variables or when some categorical variable has too many categories (classes).
With the challenges from non-ordinal data, entropic statistics methods focus on underlying probability distribution instead of associated labels. As a result, all the entropic statistical quantities are location (permutation) invariant. The main strengths of entropic statistics lie within non-ordinal alphabets, or a mixture data space that significant bulk of information lies within the non-ordinal sub-space. For ordinal spaces, although ordinal variables can be binned as categorical variables, the strength of entropic statistics are generally incapable of overcoming the loss of ordinal information during discretization. Therefore, ordinal statistical methods are preferred when they are capable of the needs. In summary, potential scenarios for entropic statistics are: 1.
The data lie within non-ordinal space.

2.
The data are a mixture of ordinal and non-ordinal spaces, and the non-ordinal space is expected to carry unneglectable bulk of information.

3.
The data lie within ordinal space, yet the performance of ordinal statistics methods fails to meet the expectation.
The following notations are used throughout the article. They are listed here for convenience.

2.
Let the Cartesian product X × Y be with a joint probability distribution p XY = {p i,j }.

3.
Let the two marginal distributions be respectively denoted by For uni-variate situations, K stands for K 1 , and p stands for p X .

5.
Let {X 1 , X 2 , · · · , X n } be an independent and identically distributed (i.i.d.) random sample of size n from X . Let C r = ∑ n i=1 1[X i = l r ]; hence C r is the count of occurrence of letter l r in a sample. Letp = {p 1 ,p 2 , · · · } = {C 1 /n, C 2 /n, · · · }.p is called the plug-in estimator of p. Similarly, one can construct the plug-in estimators for p Y and p XY and name them asp Y andp XY , respectively. 6.
For any two functions f and g taking values in (0, ∞) with lim n→∞ f (n) = lim n→∞ g(n) = 0, the notation f (n) = O(g(n)) means
Many concepts discussed in the following sections have continuous counterparts under the same concept name. The results reviewed in this article focus on non-ordinal data space. Therefore, some notable results on ordinal space are not reviewed (for example, [11][12][13]). In Section 2, estimation on some classic entropic statistics quantities are discussed. Section 3 reviews estimation results and properties for some recently developed information-theoretic quantities. Entropic statistics' application potentials in machine learning (ML) and knowledge extraction are discussed in Section 4. Finally, some remarks are given in Section 5.

Classic Entropic Statistics Quantities and Estimation
This section reviews three classic entropic concepts and their estimations, including Shannon's entropy (Section 2.1.1) and mutual information (Section 2.1.2), and Kullback-Leibler divergence (Section 2.2). These three concepts are among the earliest entropic concepts and have been intensively studied over the past decades. Enormous amounts of statistical methods and computational algorithms are designed based on these three concepts [14][15][16]. Nevertheless, most of those methods and algorithms use naive plug-in estimation, which could be improved for a smaller estimation bias and better performance. For this reason, this section reviews several notable estimation methods as a reference. Some asymptotic properties are also presented as a reference. The asymptotic properties provide a theoretical guarantee for the corresponding estimators with statistical procedures such as hypothesis testing and confidence intervals.

Shannon's Entropy
Established by Shannon in his landmark paper [1], the concept of entropy is the first and still the most important building brick in characterizing information from non-ordinal spaces. Many of the established information-theoretic quantities are linear functions of entropy. Shannon's entropy, H, is defined as Some remarkable properties of entropy are: H is a measurement of dispersion. It is always non-negative by definition.

2.
H = 0 if and only if the probability of a letter l in X is 1; hence no dispersion.

3.
For a finite alphabet with cardinality K, H is bounded from the above by ln K, and the maximum is achieved when its distribution is uniform (p i = 1/K, i = 1, 2, · · · , K); hence maximum dispersion.

4.
For a countably infinite alphabet, H may not exist (See Example 4 in Section 3).

Entropy Estimation-The Plug-in Estimator
Estimation of entropy has been a core research topic for decades. Due to the curse of "High Dimensionality" and "Discrete and Non-ordinal Nature", entropy estimation is a technically difficult problem. Advances in this area have been slow to come. The plug-in estimator of entropy (also known as empirical entropy estimator),Ĥ, defined aŝ is inarguably the most naive entropy estimator.Ĥ has been studied thoroughly in recent decades. Ref. [17] provided the asymptotic properties forĤ when K is finite, namely, Theorem 1 (Asymptotic property ofĤ when K is finite).
Ref. [18] derived the bias ofĤ for finite K Ref. [19] derived the asymptotic properties forĤ when K is countable infinite. Namely, Theorem 2 (Asymptotic property ofĤ when K is countable infinite). For any nonuniform if there exists an integer-valued function K(n) such that, as n → ∞, As discussed in [19], the conditions with K(n) hold if p i ∼ 1/ i 2 ln 2 i ; the conditions do not hold if p i ∼ 1/ i 2 ln i .

Entropy Estimation-The Miller-Madow and Jackknife Estimatorŝ
H MM [20] andĤ JK [21] are two notable entropy estimators with bias adjustments. Namely,Ĥ whereK is the observed sample cardinality. For finite K, the bias ofĤ MM is H JK is calculated in three steps: 1. for each i ∈ {1, 2, . . . , n}, constructĤ (i) , which is a plug-in estimator based on a sub-sample of size n − 1 obtained by leaving the ith observation out; 2.
compute the jackknife estimatorĤ Equivalently, (3) can be written aŝ When K < ∞, it can be shown that the bias ofĤ JK is Asymptotic properties forĤ MM andĤ JK were derived in [22].Ĥ MM andĤ JK reduce the rate of bias to a higher order power-decaying. Ref. [23] proved the convergence ofĤ could be arbitrarily slow. Ref. [24] proved that for finite K, an unbiased estimator for entropy does not exist. As a result, it is only possible to reduce the bias to a smaller extent.

Entropy Estimation-The Z-Estimator
Recent studies on entropy estimation have reduced the bias to exponentially decaying. For example,Ĥ is the entropy estimator provided in [25] with an exponentially decaying bias (Interested readers may refer to [26] for discussion on an entropy estimator that is algebraically equivalent toĤ z ). Ref. [27] derived the asymptotic properties forĤ z . Namely, The following asymptotic properties forĤ z when K is countable infinite were provided in [28].
Theorem 4 (Asymptotic property ofĤ z when K is countable infinite). For a nonuniform distribution {p i ; i ≥ 1} ∈ P satisfying ∑ i p i ln 2 p i < ∞, if there exists an integer-valued function K(n) such that, as n → ∞, 1.
The sufficient condition given in Theorem 4 for the normality ofĤ z is slightly more restrictive than that of the plug-in estimatorĤ as stated in Theorem 2, and consequently supports a smaller class of distributions. The sufficient conditions of Theorem 4 still holds for p i = C λ i −λ where λ > 2, but not for p i = C/ i 2 ln 2 i , which satisfies the sufficient conditions of Theorem 2. However, it is discussed in [28] that simulation results indicate that the asymptotic normality ofĤ z in Theorem 4 may still hold for p i = C/ i 2 ln 2 i for i ≥ 1 though not covered by the sufficient condition.

Remarks
Another perspective of entropy estimation is to combineĤ z andĤ JK . Namely, one could useĤ z in place of eachĤ (i) in (3). Interested readers may refer to [25] where a single layer combination ofĤ z andĤ JK was discussed. In addition, ref. [29] presented a non-parametric entropy estimator (Ĥ chao ) when there are unseen species in the sample. H chao has a smaller sample root mean squared error thanĤ MM and a smaller bias than H JK , according to the simulation study. Unfortunately, the bias decaying rate forĤ chao was not theoretically offered. Based on their simulation study ofĤ chao , it seems that the bias decaying rate is O(1/n 2 ), which is slower thanĤ z . Asymptotic properties ofĤ chao are not developed in the literature.
There are several parametric entropy estimators for specific interests. For example, Dirichlet prior Bayesian estimator of entropy [30,31] and shrinkage estimator of entropy [32]. This review article focuses on results from non-parametric estimation methods. To conclude this section, a small scale comparison betweenĤ andĤ z from [33] is provided in Table 1.

Mutual Information
In the same paper defining Shannon's entropy, the concept of Mutual Information (MI) was also described [1]. Shannon's entropies for X , Y, and X × Y are defined as and MI between X and Y is defined as

Some notable properties of MI are:
Property 2 (Mutual Information).

1.
MI is a measurement of dependence. It is always non-negative by definition.

4.
A non-zero MI does not always indicate the degree (level) of dependence. 5.
MI may not exist when the cardinality of joint space is countably infinite.

MI Estimation-The Plug-in Estimator and Z-Estimator
Since MI is a function of entropy, estimation of MI is essentially entropy estimation. Let {(X 1 , Y 1 ), (X 2 , Y 2 ), · · · , (X n , Y n )} be an i.i.d. random sample of size n from the joint alphabet (X , Y ). Based on the sample, plug-in estimators of the component entropy of MI can be obtained. Namely,Ĥ wherep i,· is the plug-in estimator for p i,· ,p ·,j is the plug-in estimator for p ·,j , andp i,j is the plug-in estimator for p i,j . Then the plug-in estimator of mutual information between X and Y is defined as With various entropy estimation methods, one could estimate MI by replacingĤ with a different entropy estimator. For example, using the entropy estimator with the fastest bias decaying rate,Ĥ z , the resulting estimator ( MI z ) also has a bias with an exponentially decaying rate [34], namely, The asymptotic properties for MI-s ( MI and MI z ) shall be discussed under two situations: (1) MI = 0, and (2) MI > 0.
The first situation of MI = 0 is used for testing independence. For example, in feature selection, irrelevant (to the outcome) non-ordinal features shall be dropped, and a feature is irrelevant if it is independent of the outcome. Let A be the potential irrelevant feature and B be the outcome; hence one must test H 0 : MI(A, B) = 0 against H a : MI(A, B) > 0. To test such a hypothesis, one needs the asymptotic properties of MI-s under the null hypothesis: MI = 0, derived in [35]. Namely, Theorem 5 (Asymptotic properties of MI and MI z when MI = 0). Provided that MI = 0, where n is the sample size and χ 2 For the second situation of MI > 0 (recall that MI > 0 if and only if the two marginals are dependent), the following asymptotic properties were due to [34]. Letv = p 1 , · · · ,p K 1 K 2 −1 τ = p 1,1 ,p 1,2 , · · · ,p 1,K 2 ,p 2,1 ,p 2,2 , · · · ,p 2,K 2 , · · · ,p K 1 ,1 ,p K 1 ,2 , · · · ,p K 1 ,K 2 −1 τ be the enumeration of joint probabilities plug-in estimators. Let Theorem 6 (Asymptotic properties of MI and MI z when MI > 0). Provided that MI > 0, The following examples describe a proper use of MI and properties in Theorems 5 and 6.
In the example, data were from two different genes in 191 patients. It has been calculated in [34] that MI z = 0.0552. The hypothesis test in Example 1 of [35] gave a p-value of 0.0567, which suggests MI = 0 at α = 0.05. However, one shall use the property in Theorem 6 to obtain a confidence interval of MI. One must not use the property in Theorem 5 for the purpose of the confidence interval in this situation (because the asymptotic distribution in Theorem 5 assumes a specific location for MI under the null hypothesis).
Example 2 (Genes ENAH and ENAH-data and descriptions are in Example 2 of [34]). In the example, data were from different probes of the same genes on 191 patients. It has been calculated in [34] that MI z = 0.1157. The hypothesis test in Example 2 of [35] gave a p-value of 0.0012, which suggests MI > 0 at α = 0.05. Furthermore, one shall use the property in Theorem 6 to obtain a confidence interval of MI. One must not use the property in Theorem 5 for the purpose of the confidence interval in this situation.
Example 3 (Compare the MI between Examples 1 and 2). From Examples 1 and 2, MI z (TME M30A, MTCH2) = 0.0552 and MI z (EN AH 1 , EN AH 2 ) = 0.1157. Although the second estimation value is higher, one cannot conclude that the level of dependence between EN AH 1 and EN AH 2 is higher than that between TMEM30A and MTCH2 due to the limitation described in the 4-th property in Property 2. To compare the level of dependence, one shall refer to the standardized mutual information in Section 3.1.
Recall that MI is always non-negative. For the same reason, MI is always non-negative (note that MI can be viewed as the MI for the distributionp). Nevertheless, MI z can be negative under some scenarios. A negative MI z suggests the level of dependence between the two random elements is extremely weak. If one uses the results from Theorem 5 to test if H 0 : MI = 0, a negative MI z would lead to a fail-to-reject for most settings of α (level of significance).

Remarks
There is another line of research on multivariate information-theoretic methods, the Partial Information Decomposition (PID) framework [36][37][38]. The PID may be viewed as a direct extension of MI to a measures of information provided by two or more variables about a third. Interesting applications of the PID are, for example, in explaining representation learning in neural networks [39] or in feature selection from dependent features [40]. PID aims to characterize redundancy with information decomposition. Another approach to characterize redundancy is to utilize MI on a joint feature space [33]. Additional research to compare the two approaches is needed.

Kullback-Leibler Divergence
Kullback-Leibler divergence (KL) [2], also known as relative entropy, is the distance between two probability distributions, introduced by [2], and is an important measure of information in information theory. The notations to define KL and describe its properties differ slightly from other sections. Let P = {p k : k = 1, · · · , K} and Q = {q k : k = 1, · · · , K} be two discrete probability distributions on the same finite alphabet, X = { k : k = 1, · · · , K}, where K ≥ 2 is a finite integer. KL is defined to be Note that many also use D as the notation of KL, namely, D(P Q). KL is not a metric since it does not satisfy the triangle inequality and is not symmetric. Some notable properties of KL are: KL is a measurement of non-metric distance between two distributions on the same alphabet (with the same discrete support). It is always non-negative because of Gibbs' inequality.

2.
KL = 0 if and only if the two underlying distributions are the same. Namely, P = Q for each k = 1, · · · , K. 3.
KL > 0 if and only if the two underlying distributions are different. Namely, p k = q k for some k.
The use of KL has several variants, including but not limited to, (1) P and Q are unknown; (2) Q is known; (3) P and Q are continuous distributions. The second variant is an alternative method of the Pearson goodness-of-fit test. Interested readers may refer to [41] for more discussion on the second variant. Although utilizing entropic statistics on continuous spaces is generally not recommended, interested readers may refer to [42,43] for discussions on the third variant.

KL Point Estimation-The Plug-in Estimator, Augmented Estimator, and Z-Estimator
Although KL is not exactly a function of entropy, it still carries many similarities with entropy. For that reason, KL estimation is very similar to entropy estimation. For example, KL can be estimated from a plug-in perspective. Letp k be the plug-in estimator of p k and q k be the plug-in estimator of q k , then the KL plug-in estimator is Because KL could have an infinite bias [44], an augmented plug-in estimator of KL was presented in [44]: and m is the sample size of the sample from Q. The bias of KL is no faster than O(1/n), where n is the sample size of the sample from P [44].
Since the KL could have an infinite bias, its estimation in perspectives ofĤ MM orĤ JK will not help in reducing the bias to a finite extent. In the perspective ofĤ z , a KL estimator with exponentially decaying bias was offered in [44]:

Symmetrized KL and Its Point Estimation
As mentioned in the first property of Property 3, KL is generally an asymmetric measurement. For certain interests that require a symmetric measurement, a symmetrized KL is defined to be The symmetrized KL S, as a function of KL, can be similarly estimated in the perspective of KL, KL * , and KL z . The respective estimators arê wherep * k =p k + 1[p k = 0]/n (n is the sample size of the sample from P), and

Asymptotic Properties for KL and Symmetrized KL Estimators
The asymptotic properties for KL, KL * , KL z ,Ŝ,Ŝ * , andŜ z are all presented in [44].
All the asymptotic properties therein require P = Q (namely, KL > 0). When P = Q, the asymptotic property of KL (or S) estimators are currently missing from the literature.
The derivation of such asymptotic properties is not complicated yet unnecessary. The only purpose of such asymptotic property under P = Q is to test if H 0 : P = Q against H a : P = Q. For such a purpose, the two-sample goodness-of-fit chi-squared test can be used (see p. 616 in [45]).

Recently Developed Entropic Statistics Quantities and Estimation
In this section, various recently developed entropic statistics quantities are introduced and discussed. Some quantities are quite new with limited estimation properties developed other than the plug-in estimation. Therefore, some of the following discussions focus on conceptual spirits and application potentials.

Standardized Mutual Information
Mutual information between two random elements (on non-ordinal alphabets) is similar to the covariance between two random variables (on ordinal spaces) regarding properties and drawbacks. For example, the covariance does not provide general information on the degree of correlation, and the concept correlation of coefficient was defined to fill the gap. Similarly, recall the fourth property of MI that MI generally does not provide information about the degree of dependence, standardized mutual information (SMI), κ, has been studied and defined in various ways. To name a few, provided H(X, Y) < ∞, The quantity κ 6 is also called information gain ratio [46]. The benefits of SMI are supported by Theorem 7. Interested readers may refer to [47][48][49][50] for discussions on SMI. A detailed discussion of the estimation of various SMI may be found in [51].

Entropic Basis: A Generalization from Shannon's Entropy
Shannon's entropy and MI are powerful tools to quantify dispersion and dependence on non-ordinal space. More concepts and statistical tools are needed to characterize nonordinal space information from different perspectives. Generalized Simpson's diversity indices were established in [52] and coined in [3].

Definition 1 (Generalized Simpson's Diversity Indices)
. For a given p = {p k ; k ≥ 1} and an integer pair be defined as the family of generalized Simpson's diversity indices.
Generalized Simpson's Diversity Indices are the foundation of entropic basis and entropic moment. Interested readers may refer to [53] for discussions on entropic moments and a goodness-of-fit test under permutation based on entropic moments. In estimating ζ u,v , was derived in [52], where n is the sample size; u and v are given constants;p k is the sample proportion of the k-th letter(category); 1[·] stands for indicator function. z u,v is an uniformly minimum-variance unbiased estimator (UMVUE) of ζ u,v for any combination of (u, v) non-negative integers pair as long as u + v ≤ n, where n is the corresponding sample size. Based on ζ u,v , ref. [3] defined the entropic basis.
Definition 2 (Entropic Basis). Given Definition 1, the entropic basis is the sub-family All diversity indices can be represented as a function of ζ 1 [3] (most representations are due to Taylor's expansion). For example,
For example, letK (the observed number of categories) be the plug-in estimator of K. Meanwhile, ∑ n−1 v=0 z 1,v (the estimator in perspective of entropic basis representation) is algebraically equivalent toK [58]. Namely, K entropic has a smaller bias thanK. Interested readers may refer to [58] for details on the estimation of ∑ ∞ v=n ζ 1,v . Similar estimation could benefit the estimation of Rényi equiv. entropy, Emlen's index, and any other diversity indices or theoretical quantities which contain the terms ∑ ∞ v=n ζ 1,v after Taylor's expansion.

Generalized Shannon's Entropy and Generalized Mutual Information
Because of the advantages in characterizing information in non-ordinal space, Shannon's entropy and MI have become the building blocks of information theory and essential aspects of ML methods. Yet, they are only finitely defined for distributions with fast decaying tails on a countable alphabet.
The unboundedness of Shannon's entropy and MI over the general class of all distributions on an alphabet prevents their potential utility from being fully realized. Ref. [5] proposed GSE and GMI, which are finitely defined everywhere. To state the definition of GSE and GMI, Definition 3 is stated first. Definition 3 (Conditional Distribution of Total Collision (CDOTC)). Given X = {x i ; i ≥ 1} and p = {p i }, consider the experiment of drawing an identically and independently distributed (iid) sample of size m (m ≥ 2). Let C m denote the event that all observations of the sample take on the same letter in X , and let C m be referred to as the event of a total collision. The conditional probability, given C m , that the total collision occurs at the letter x i is The idea of CDOTC is to adopt a special member of the family of the escort distributions introduced in [59]. The utility of CDOTC is endorsed by Lemmas 1 and 2, which are proved in [5]. It is clear that p m is a probability distribution induced from p = {p k }. An example is provided to help understand Definition 3.

Example 7 (The second-order GMI). Let
Further, let The second-order GMI, MI 2 (X, Y), is then defined as where H(X * ), H(Y * ), and H(X * , Y * ) are Shannon's entropy based on p X * , p Y * , and p XY,2 , respectively.
GSE's and GMI's plug-in estimators are stated in Definitions 6 and 7.
Definition 6 (GSE's plug-in estimators). Let X 1 , X 2 , . . . , X n be i.i.d. random variables taking values in X = {x i ; i ≥ 1} with distribution p X . The plug-in estimator for p X isp X = {p i }.

Definition 7 (GMI's plug-in estimators)
. Let (X 1 , Y 1 ), (X 2 , Y 2 ), . . . , (X n , Y n ) be i.i.d. random variables taking values in X × Y = (x i , y j ); i ≥ 1, j ≥ 1 with distribution p XY = {p i,j }. Let p XY = {p i,j } be the plug-in estimator of p XY . The plug-in estimator for the m-th order GMI, The following asymptotic properties for GSE's plug-in estimators are given in [60].

Theorem 8.
Let p X = {p k ; k ≥ 1} be a probability distribution on a countably infinite alphabet X , without any further conditions, The properties in Theorems 8 and 9 allow interval estimation and hypothesis testing withĤ m . The advantage of shifting the original distribution to an escort distribution is reflected in Theorem 8-the asymptotic normality requires no assumption on a countable infinite alphabet. Theorem 9 can be viewed as a special case of Theorem 8 under a finite situation, where the uniform distribution shall be omitted because a uniform distribution has no variation between different category probabilities and hence results in a zero GSE and degenerate asymptotic distribution.
Nevertheless, suppose one is certain that the cardinality of distribution is finite. In that case, one shall use Shannon's entropy instead of GSE because Shannon's entropy always exists under a finite distribution. There are various well-studied estimation methods with Shannon's entropy (whereas only plug-in estimation on GSE has been studied by far).
Asymptotic properties for GMI plug-in estimator have not been studied yet. Nonetheless, a test of independence with modified GMI [61] has been studied. The test does not require the knowledge of the number of columns or rows of a contingency table; hence it yielded an alternative other than Pearson's chi-squared test of independence, particularly when a contingency table is large or sparse.

Application of Entropic Statistics in Machine Learning and Knowledge Extraction
Applications of entropic statistics in ML and knowledge extraction can be clustered in two directions. The first direction is to solve an existing question from a new perspective by creating a new information-theoretic quantity [61] or revisiting an existing informationtheoretic quantity for additional insights [62]. The second direction is to use different estimation methods in existing methods to improve the performance by reducing bias and/or variation [32]. Application potentials in the second direction are very promising because theoretical results from recent-developed estimation methods suggest the performance of many existing ML methods could be improved, yet not much research has been conducted in the direction. In this section, many established ML and knowledge extraction methods are discussed with their potential to improve in the second direction.

An Entropy-Based Random Forest Model
Ref. [63] proposed an entropy-importance-based random forest model for power quality feature selection and disturbance classification. The method used a greedy search based on entropy and information gain for node segmentation. Nevertheless, only the plug-in estimation of entropy and information gain was considered. The method could be improved by replacing the plug-in estimation with smaller bias estimation methods, such asĤ z in [25]. Further, one can also combineĤ z with the jackknife procedure in (3) to obtain z , and useĤ zJK in place of the adopted plug-in estimation. The benefit of usingĤ zJK is the potential smaller bias, and variance [25]. However, asymptotic properties forĤ zJK are yet developed. When asymptotic properties are desired (e.g., for confidence interval or hypothesis testing purposes), one shall consider estimators with established asymptotic properties (also called theoretical guarantee), such asĤ,Ĥ MM ,Ĥ JK , andĤ z .

Feature Selection Methods
In [16,64], various information-theoretic feature selection methods were reviewed and discussed. The two review articles did not mention that all the discussed methods adopted plug-in estimators for the corresponding information-theoretic quantities. Improving the performance with different estimation methods is possible, and investigation is needed. For example, some of the discussed methods are summarized in Table 2 with suggestions to utilize smaller bias and/or variance estimation methods. Table 2. Selected information-theoretic feature selection methods reviewed in [16,64] with potential perspectives to improve the performance using smaller bias and/or variance estimation methods. The proposed criterion uses the same notations as their original forms for readers' ease in tracing them back to the original articles, except some terms are equivalently re-denoted as MI orκ to help readers follow the same notations in the previous sections. To further clarify the notations, MI(X 1 , X 2 ) and MI(X 1 ; X 2 ) are the same. MI(X 1 , X 2 |Y) is the plug-in estimator of conditional mutual information. By its definition, MI(X 1 , where eachĤ z could be further replaced usingĤ z with jackknife procedure. MIM [65] Proposed Criterion (Score) MI

Different Estimation Method
Use MI z with jackknife procedure CMIFS [78] Proposed Criterion (Score) Use MI z with jackknife procedure

A Keyword Extraction Method
Ref. [79] proposed a keyword extraction method with Rényi's entropy and used the plug-in estimator therein. Namely,Ŝ R (w, q) = 1 1−q log 2 ∑ F w i=1p q i . Nevertheless, S R (w, q) can be represented as where ζ 1,v has the UMVUE z 1,v for v = 0, 1, 2, · · · , n − 1. For v ≥ n, ζ 1,v could be estimated based on regression analysis [58]. Hence, S R (w, q) can be estimated aŝ where the construction of z * 1,v needs investigation using regression analysis [58]. The resultingŜ Rz (w, q) would have a smaller bias than that ofŜ R (w, q) to help improve the established keyword extraction method. Note that if one wishes to use z 1,v up to n − 1 only, the resulting estimator would becomê (6) is the same asĤ r (n) in [3]. Asymptotic properties forĤ r (n) were provided therein (Corollary 3 in [3]) for interested readers.

Conclusions
Entropic statistics is effective in characterizing information from non-ordinal space. Meanwhile, it is essential to realize that non-ordinal information is inherently difficult to identify due to its non-ordinal and permutation invariant nature. This survey article aims to provide a comprehensive review of recent advances in entropic statistics, including classic entropic concepts estimation, recent-developed entropic statistics quantities, and their applications potentials in ML and knowledge extraction. This article first introduces the concept of entropic statistics and emphasizes challenges from non-ordinal data. Then this article reviews the estimation for classic entropic quantities. These classic entropic concepts, including Shannon's entropy, MI, and KL, are widely used in established machine learning and knowledge extraction methods. Most, if not all, of the established methods use plug-in estimation, which is computation efficient yet with a large bias. The surveyed different estimation methods would help researchers to potentially improve existing methods' performance by adopting a different estimation method or adding theoretical guarantee to the existing methods. Recent-developed entropic statistics concepts are also reviewed with their estimation and applications. These new concepts not only allow researchers to estimate existing quantities in a new perspective, but also support additional aspects in characterizing non-ordinal information. In particular, the generalized Simpson's diversity indices (with the induced entropic basis and entropic moments) have significant application and theoretical potential to either customize existing ML and knowledge extraction methods or to establish new methods considering domain-specific challenges. Further, this article provides some examples of how to apply the surveyed results to some of the existing methods, including a random forest model, fourteen feature selection methods, and a keyword extraction model. It should be mentioned that the aim of the survey is not to claim the superiority of some estimation methods over others but to provide a comprehensive list of recent advances in entropic statistics research. Specifically, although an estimator with a faster-decaying bias seems theoretically preferred, it has a longer calculation time even with the convenient R functions, particularly when multiple layers of jackknife (boot-strap) are involved. The preference of estimation varies case by case-some may prefer an estimator with a smaller bias, some may prefer one with a smaller variance, while some may need a trade-off between them. Furthermore, the article focuses on non-parametric estimation, while parametric estimation would perform better if the specified model fits the domain-specific reality. In summary, one should always investigate if a new estimation method fits the needs.
Enormous additional works are still needed in entropic statistics. For example, (1) the asymptotic properties for many established estimators (such asĤ chao andĤ zJK ) are not clear when cardinality is infinite. (2) With the transition from original distribution to escort distribution, GSE and GMI fill the void left by Shannon's entropy and MI. However, only plug-in estimations of GSE and GMI have been studied. The biases of these plug-in estimators have not been studied, and additional estimation methods are undoubtedly needed.
(3) Calculations for many entropic statistics are not yet supported in R, such as entropic basis, GSE, and GMI. Furthermore, more work is needed to implement the new entropic statistics concepts in programming software other than R (some of the reviewed estimators are implemented in R and are listed in Appendix A as a reference), particularly in Python. With additional theoretical development and application support, entropic statistics methods would be a more efficient tool to characterize more non-ordinal information and better serve the demands arose from the emerging domain-specific challenges.
Funding: This research received no external funding.
Data Availability Statement: Not applicable.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: