Asymptotic Normality for Plug-In Estimators of Generalized Shannon’s Entropy

Shannon’s entropy is one of the building blocks of information theory and an essential aspect of Machine Learning (ML) methods (e.g., Random Forests). Yet, it is only finitely defined for distributions with fast decaying tails on a countable alphabet. The unboundedness of Shannon’s entropy over the general class of all distributions on an alphabet prevents its potential utility from being fully realized. To fill the void in the foundation of information theory, Zhang (2020) proposed generalized Shannon’s entropy, which is finitely defined everywhere. The plug-in estimator, adopted in almost all entropy-based ML method packages, is one of the most popular approaches to estimating Shannon’s entropy. The asymptotic distribution for Shannon’s entropy’s plug-in estimator was well studied in the existing literature. This paper studies the asymptotic properties for the plug-in estimator of generalized Shannon’s entropy on countable alphabets. The developed asymptotic properties require no assumptions on the original distribution. The proposed asymptotic properties allow for interval estimation and statistical tests with generalized Shannon’s entropy.


Introduction
Shannon's entropy, introduced by Shannon (1948), is one of the building blocks of Information Theory and a key aspect of Machine Learning (ML) methods (e.g., Random Forests).It is one of the most popular measurement on countable alphabet1 , particularly on nonordinal space with categorical data.For example, in Li et al. (2017), all reviewed feature selection methods on non-ordinal space boiled down to a function of Shannon's entropy.In addition, Shannon's entropy is one of the most important foundations for all tree-based ML algorithms, sometimes substitutable with the Gini impurity index Banerjee et al. (2019) Mienye et al. (2019) Hssina et al. (2014).As one of the essential information-theoretical quantities, Shannon's entropy and its estimation are widely studied in the past decades Miller and Madow (1954) Harris (1975) Esty et al. (1983) Paninski (2003) Zhang ( 2012) Zhang and Zhang (2012) Zhang (2013).
Nevertheless, Shannon's entropy is only finitely defined for distributions with fast decaying tails Baccetti and Visser (2013).It is never known if the real distribution yields a finite Shannon's entropy in practice.Furthermore, all existing results on Shannon's entropy require it to be finitely defined, which results in a usage restriction when adopting the entropy-based methods.This is, in fact, a void in the foundation of all Shannon's entropy-related results.To address the deficiency of Shannon's entropy, Zhang (2020) proposed generalized Shannon's entropy (GSE) and showed that GSE enjoys all utilities of a finite Shannon's entropy.In addition, GSE is finitely defined on all distributions.Due to the advantages of GSE and the deficiency of Shannon's entropy, the use of Shannon's entropy should eventually be transited to GSE.To aid the transition, the estimation of GSE needs to be studied.In practice, the plug-in estimator, adopted in almost all entropy-based ML method packages, is one of the most popular approaches to estimating Shannon's entropy.For plug-in estimation of GSE, asymptotic properties are needed for statistical tests and confidence intervals.This paper aims to study the asymptotic properties of plug-in estimators of GSE.
The rest of this paper is organized as follows.Section 2 formally states the problem and gives our main results.In Section 3, we provide a small-scale simulation study.In Section 4, we discuss the potential of GSE.Proofs are postponed to Section 5.

Main Results
Let Z be a random element on a countable alphabet Z = {z k ; k ≥ 1} with an associated distribution p = {p k ; k ≥ 1}.Let the cardinality or support on Z be denoted K = k≥1 1 [p k > 0], where 1[•] is the indicator function.K is possibly finite or infinite.Let P denote the family of all distributions on Z .Shannon's entropy, H, is defined as (1) To state our main result, we need to state Definition 1 and 2 given by Zhang (2020), and Definition 3.
Definition 1 (Conditional Distribution of Total Collision (CDOTC)).Given Z = {z k ; k ≥ 1} and p = {p k }, consider the experiment of drawing an identically and independently distributed (iid) sample of size m (m ≥ 2).Let C m denote the event that all observations of the sample take on a same letter in Z , and let C m be referred to as the event of total collision.The conditional probability, given C m , that the total collision occurs at letter where m ≥ 2. p m = {p m,k } is defined as the m-th order CDOTC.
and p m = {p m,k }, generalized Shannon's entropy (GSE) is defined as where p m,k is defined in Definition 1, and m = 2, 3, . . . is the order of GSE.GSE with order m is referred to as the m-th order GSE.
It is clear that p m is a probability distribution induced from p = {p k }.To help understand Definition 1 and 2, Example 1 and 2 are provided as follows.
Definition 3 (Plug-in estimator of GSE).Let Z 1 , Z 2 , . . ., Z n be independent and identically distributed (iid) random variables taking values in be the sample count of observations in category z k , and let pk = Y k /n be the sample proportion.The plug-in estimator for the m-th order GSE, Ĥm (Z), is defined as Our main results are stated in Theorem 1, Corollary 1, and Corollary 2.
Theorem 1.Let p = {p k } be a probability distribution on a countably infinite alphabet, without any further conditions, where Corollary 1.Let p = {p k } be a probability distribution on a countably infinite alphabet, without any further conditions, where Corollary 2. Let p = {p k ; k = 1, 2, . . ., K} be a non-uniform probability distribution on a countably finite alphabet, without any further conditions, where Corollary 2 is a special case of Theorem 1.All proofs are provided in Section 5.

Simulations
One of the main applications of our results is the ability to construct confidence intervals, and hence testing hypothesis.Specifically, Corollary 1 implies that an asymptotic (1 − α)100% confidence interval for H m is given by where σm is given by ( 2) and z α/2 is a number such that P Z > z α/2 = α/2 and Z ∼ N (0, 1).In this section, we give a small scale simulation study to check the finite sample performance of this confidence interval.
We consider Zeta distribution with s = 1.5, where ζ(s) is the Riemann zeta function given by We set s = 1.5 because such Zeta distribution has asymptotic normality with Ĥm but does not have asymptotic normality with Ĥ (Zhang and Zhang ( 2012)).The simulations were performed as follows.For the given distribution, we obtained a random sample of size n and used it to evaluate a 95% confidence interval for a given index using (3).We then checked to see if the true value of the H m was in the interval or not.This was repeated 5000 times, and the proportion of times when the true value was in the interval was calculated.When the asymptotics works well, this proportion should be close to 0.95.We repeated this for sample sizes ranging from 10 to 1000 in increments of 10.The results for s = 1.5, order m = 2 are given in Figure 1, and the results for s = 1.5, order m = 3 are given in Figure 2.
The results suggest that convergence is fast, particularly when the order m = 2.We conjecture that this may be caused by the fact that, when m is larger, the probabilities in the corresponding CDOTC are smaller and hence require a larger sample size for convergence.Although GSE with order m ≥ 3 may shed some light on specific information, GSE with order m = 2 is enough to well exist with asymptotic properties for any valid underlying probability distribution p.

Discussion
The proposed asymptotic properties in Corollary 1 and 2 make it possible for interval estimation and statistical tests.Based on the simulation results, the convergence is quite fast, particularly under order m = 2.Note that a GSE with order m = 2 already enjoys all asymptotic properties without any assumption on original distribution p.
We recommend using GSE with order m = 2 in place of Shannon's entropy in all entropy-based methods.The proposed asymptotic results also allow interval estimation and statistical tests on the modified entropy-based methods that replaced Shannon's entropy with GSE.By replacing Shannon's entropy with GSE, one still enjoys all the benefits of Shannon's entropy with a pretty fast convergence speed.Moreover, using GSE is risk-free compared to Shannon's entropy because Shannon's entropy 1) does not exist on some thicktailed distributions and 2) requires thinner tail distribution for some asymptotic properties.
To further unlock the utility of GSE, future research is needed on the Generalized Mutual Information (GMI), also proposed in Zhang (2020).The proposed asymptotic properties in this article directly provide asymptotic normality for the plug-in estimator of GMI when the real underlying GMI is not 0. The asymptotic behavior for the plug-in estimator of GMI when the real underlying GMI is 0 remains an open question, which we will address in future work.

Proofs
The proofs require several lemmas.The first lemma is state below.
Lemma 1 (Zhang and Zhang (2012) and Grabchak and Zhang (2018)).Assume that In this case where where Different proofs of Lemma 1 are provided in Zhang and Zhang (2012) and Grabchak and Zhang (2018).
The spirit for proof of Theorem 1 is to regard CDOTC as an original distribution and utilize the result from Lemma 1. Toward that end, several lemmas are needed and stated below.
Lemma 2 (Equivalent conditions in Lemma 1).For any valid distribution p, let the corresponding CDOTC with order m be denoted as p m , then Lemma 3 (σ 2 m in Theorem 1).In Theorem 1, Proof of Lemma 2. Note that for any p to be a valid distribution, the tail of p must be thicker than 1/(k ln k) because k≥2 1/(k ln k) diverges.Hence p m is thicker than 1/(k 2 ln 2 k) for any m ≥ 2 by definition.It is shown in Example 3 of Zhang and Zhang (2012) that such tail satisfies the mentioned conditions.
Proof of Lemma 3.Because of Lemma 2, σ 2 can be obtained under finite K and then let K → ∞.For a finite K, it can be verified that for i = 1, . . ., K − 1, According to the first-order Delta method, ,k ln p m,k + p m,k H m (Z)) ,k ln p m,k + p m,k H m (Z)) Proof of Theorem 1 and Corollary 1.With Lemma 1, 2, 3, 4, and Slutsky's theorem, Theorem 1 and Corollary 1 are proved.Proof of Corollary 2. Corollary 2 is a directly result of Theorem 1, except under uniform distribution when ∇H m = 0 for all m ≥ 2.

Figures
Figures

Figure 1 :
Figure 1: Effectiveness of the 95% confidence intervals as a function of sample size.Simulations from Zeta distribution with s = 1.5 and GSE with order m = 2.The horizontal dashed line is at 0.95.

Figure 2 :
Figure 2: Effectiveness of the 95% confidence intervals as a function of sample size.Simulations from Zeta distribution with s = 1.5 and GSE with order m = 3.The horizontal dashed line is at 0.95.