Normal Laws for Two Entropy Estimators on Infinite Alphabets

This paper offers sufficient conditions for the Miller–Madow estimator and the jackknife estimator of entropy to have respective asymptotic normalities on countably infinite alphabets.


Introduction
Let X = { k ; k ≥ 1} be a finite or countably infinite alphabet, let p = {p k ; k ≥ 1} be a probability distribution on X , and define K = ∑ k≥1 1[p k > 0], where 1[·] is the indicator function, to be the effective cardinality of X under p. An important quantity associated with p is entropy, which is defined by [1] as Here and throughout, we adopt the convention that 0 ln 0 = 0. Many properties of entropy and related quantities are discussed in [2]. The problem of statistical estimation of entropy has a long history (see the survey paper [3] or the recent book [4]). It is well-known that no unbiased estimators of entropy exist, and, for this reason, much energy has been focused on deriving estimators with relatively little bias (see [5] and the references therein for a discussion of some (but far from all) of these). Perhaps the most commonly used estimator is the plug-in. Its theoretical properties have been studied going back, at least, to [6], where conditions for consistency and asymptotic normality, in the case of finite alphabets, were derived. It would be almost fifty years before corresponding conditions for the countabe case would appear in the literature. Specifically, consistency, both in terms of almost sure and L 2 convergence, was verified in [7]. Later, sufficient conditions for asymptotic normality were derived in two steps in [3,8].
Despite a simple form and nice theoretical properties, the plug-in suffers from large finite sample bias, which has led to the development of modifications that aim to reduce this bias. Two of the most popular are the Miller-Madow estimator of [6] and the jackknife estimator of [9]. Theoretical properties of these have not been studied, as extensively, in the literature. In this paper, we give sufficient conditions for the asymptotic normality of these two estimators. This is important for deriving confidence intervals and hypothesis tests, and it immediately implies consistency (see e.g., [4]).
We begin by introducing some notation. We say that a distribution p = {p k ; k ≥ 1} is uniform if and only if its effective cardinality K < ∞ and for each k = 1, 2, . . . either p k = 1/K or p k = 0. We write f ∼ g to denote lim n→∞ f (n)/g(n) = 1 and we write f = O(g(n)) to denote lim sup n→∞ | f (n)/g(n)| < ∞. Furthermore, we write L → to denote convergence in law and p → to Entropy 2018, 20, 371; doi:10.3390/e20050371 www.mdpi.com/journal/entropy denote convergence in probability. If a and b are real number, we write a ∨ b to denote the maximum of a and b. When it is not specified, all limits are assumed to be taken as n → ∞. Let X 1 , . . . , X n be independent and identically distributed (iid) random variables on X under p. Let {Y k ; k ≥ 1} be the observed letter counts in the sample, i.e., Y k = ∑ n i=1 1[X i = k ], and let p = {p k ; k ≥ 1}, wherep k = Y k /n, be the corresponding relative frequencies. Perhaps the most intuitive estimator of H is the plug-in, which is given bŷ When the effective cardinality, K, is finite, [10] showed that the bias ofĤ is One of the simplest and earliest approaches aiming to reduce the bias ofĤ is to estimate the first order term. Specifically, letm = ∑ k≥1 1[Y k > 0] be the number of letters observed in the sample and consider an estimator of the form,Ĥ This estimator is often attributed to [6] and is known as the Miller-Madow estimator. Note that, for finite K, , decays exponentially fast, it follows that, for finite K, the bias ofĤ MM is Among the many estimators in the literature aimed at reducing bias in entropy estimation, the Miller-Madow estimator is one of the most commonly used. Its popularity is due to its simplicity, its intuitive appeal, and, more importantly, its good performance across a wide range of different distributions including those on countably infinite alphabets. See, for instance, the simulation study in [5]. The jackknife entropy estimator is another commonly used estimator designed to reduce the bias of the plug-in. It is calculated in three steps: 1. for each i ∈ {1, 2, . . . , n} constructĤ (i) , which is a plug-in estimator based on a sub-sample of size n − 1 obtained by leaving the ith observation out; 2. obtainĤ (i) = nĤ − (n − 1)Ĥ (i) for i = 1, · · · , n; and then 3. compute the jackknife estimatorĤ Equivalently, (5) can be written asĤ The jackknife estimator of entropy was first described by [9]. From (2), it may be verified that, when K < ∞, the bias ofĤ JK is Both the Miller-Madow and the jackknife estimators are adjusted versions of the plug-in. When the effective cardinality is finite, i.e., K < ∞, the asymptotic normalities of both can be easily verified. A question of theoretical interest is whether these normalities still hold when the effective cardinality is countably infinite. In this paper, we give sufficient conditions for √ n(Ĥ MM − H) and √ n(Ĥ JK − H) to have asymptotic normalities on countably infinite alphabets and provide several illustrative examples. The rest of paper is organized as follows. Our main results for both the Miller-Madow and the jackknife estimators are given in Section 2. A small simulation study is given in Section 3. This is followed by a brief discussion in Section 4. Proofs are postponed to Section 5.

Main Results
We begin by recalling a sufficient condition due to [8] for the asymptotic normality of the plug-in estimator.
and there exists an integer-valued function K(n) such that, as n → ∞, Note that, by Jensen's inequality (see e.g., [2]), (8) implies that where equality holds, i.e., H 2 = ∑ k≥1 p k ln 2 p k , if and only if p is a uniform distribution. Thus, when (8) holds, we have H < ∞. The following result is given in [8].

Lemma 1.
Let p = {p k ; k ≥ 1} be a distribution, which is not uniform, and set If p satisfies Condition 1, thenσ The following is useful for checking when Condition 1 holds.

Lemma 2.
Let p = {p k ; k ≥ 1} and p = {p k ; k ≥ 1} be two distributions and assume that p satisfies Condition 1. If there exists a C > 0 such that, for large enough k, then p satisfies Condition 1 as well.
In [8], it is shown that Condition 1 holds for p = {p k ; k ≥ 1} with where C > 0 is a normalizing constant. It follows from Lemma 2 that any distribution with tails lighter than this satisfies Condition 1 as well.
We are interested in finding conditions under which the result of Lemma 1 can be extended to bias adjusted modifications ofĤ. LetĤ * be any bias-adjusted estimator of the form whereB * is an estimate of the bias. Combining Lemma 1 with Slutsky's theorem immediately gives the following.
which is not uniform, and let σ 2 andσ 2 be as in (9).

If Condition 1 holds and
For the Miller-Madow estimator and the jackknife estimator, respectively, the bias correction term,B * , in (10) takes the form Below, we give sufficient conditions for when √ nB MM p −→ 0 and when √ nB JK p −→ 0.

Results for the Miller-Madow Estimator
Condition 2. The distribution, p = {p k ; k ≥ 1}, satisfies that, for sufficiently large k, where a(k) > 0 and b(k) > 0 are two sequences such that 1. a(k) → ∞ as k → ∞, and, furthermore, (a) the function a(k) is eventually nondecreasing, and (b) there exists an ε > 0 such that 2.
Since this condition only requires that p k , for sufficiently large k, is upper bounded in the appropriate way, we immediately get the following. Lemma 3. Let p = {p k ; k ≥ 1} and p = {p k ; k ≥ 1} be two distributions and assume that p satisfies Condition 2. If there exists a C > 0 such that, for large enough k, then p satisfies Condition 2 as well.
We now give our main results for the Miller-Madow Estimator.
Theorem 2. Let p = {p k ; k ≥ 1} be a distribution, which is not uniform, and let σ 2 andσ 2 be as in (9). If Condition 2 holds, thenσ In the proof of the theorem, we will show that Condition 2 implies that Condition 1 holds. Condition 2 requires p k to decay slightly faster than k −3 by two factors 1/a(k) and 1/b(k), where a(k) and b(k) satisfy (12) and (13) respectively. While (13) is clear in its implication on b(k), (12) is much less so on a(k). To have a better understanding of (12), we give an important situation where (12) holds. Consider the case a(n) = ln n. In this case, for any ε ∈ (0, 0.5) We now give a more general situation, which shows just how slow a(k) can be. First, we recall the iterated logarithm function. Define ln (r) (x), recursively for sufficiently large x > 0, by ln (0) (x) = x and ln (r) (x) = ln ln (r−1) x for r ≥ 1. By induction, it can be shown that d dx ln (r) for r ≥ 1.
We now give three examples.
We can consider a more general form, which allows for even heavier tails.
Example 2. Let r be an integer with r ≥ 2 and let p = {p k ; k ≥ 1} be such that, for sufficiently large k, where ε > 0 and C > 0 are fixed constants. In this case, Condition 2 holds with a(k) = ln (r) k and b(k) = ∏ r−1 i=1 ln (i) k (ln (r) k) 1+ε /C in (11). The fact that b(k) satisfies (13) follows by the integral test for convergence.
It follows from Lemma 3 that any distribution with tails lighter than those in this example must satisfy Condition 2. On the other hand, the tails cannot get too much heavier. Example 3. Let p = {p k ; k ≥ 1} be such that p k = Ck −3 , where C > 0 is a normalizing constant. In this case, Condition 2 does not hold. However, Condition 1 does hold.

Results for the Jackknife Estimator
For any distribution p, let B n = E(Ĥ) − H be the bias of the plug-in based on a sample of size n. Theorem 3. Let p = {p k ; k ≥ 1} be a distribution, which is not uniform, and let σ 2 andσ 2 be as in (9). If Conditions 1 and 3 hold, thenσ It is not clear to us whether Conditions 1 and 3 are equivalent or, if not, which is more stringent. For that reason, in the statement of Theorem 3, both conditions are imposed. The proof of the theorem uses the following lemma, which gives some insight intoB JK and Condition 3.

Lemma 5.
For any probability distribution p = {p k ; k ≥ 1}, we havê We now give a condition, which implies Condition 3 and tends to be easier to check.

Simulations
The main application of the asymptotic normality results given in this paper is the construction of asymptotic confidence intervals and hypothesis tests. For instance, if p satisfies the assumptions of Theorem 2, then an asymptotic (1 − α)100% confidence interval for H is given by where z α/2 is a number such that P(Z > z α/2 ) = α/2 and Z is a standard normal random variable. Similarly, if the assumptions of Theorem 3 are satisfied, then we can replaceĤ MM withĤ JK , and if the assumptions of Lemma 1 are satisfied, then we can replaceĤ MM withĤ. In this section, we give a small-scale simulation study to evaluate the finite sample performance of these confidence intervals.
For concreteness, we focus on the geometric distribution, which corresponds to where p ∈ (0, 1) is a parameter. The true entropy of this distribution is given by H = −p −1 (p ln p + (1 − p) ln(1 − p)). In this case, Conditions 1, 2, and 3 all hold. For our simulations, we took p = 0.5. The simulations were performed as follows. We began by simulating a random sample of size n and used it to evaluated a 95% confidence interval for the given estimator. We then checked to see if the true value of H was in the interval or not. This was repeated 5000 times and the proportion of times when the true value was in the interval was calculated. This proportion should be close to 0.95 when the confidence interval works well. We repeated this for sample sizes ranging from 20 to 1000 in increments of 10. The results are given in Figure 1. We can see that the Miller-Madow and jackknife estimators consistently outperform the plug-in. It may be interesting to note that, although the proofs of Theorems 1-3 are based on showing that the bias correction term approaches zero, it does not mean that the bias correction term is not useful. On the contrary, bias correction improves the finite sample performance of the asymptotic confidence intervals.

Discussion
In this paper, we gave sufficient conditions for the asymptotic normality of the Miller-Madow and the Jackknife estimators of entropy. While our focus is on the case of countably infinite alphabets, our results are formulated and proved in the case where the effective cardinality K may be finite or countably infinite. As such, they hold in the case of finite alphabets as well. In fact, for finite alphabets, Conditions 1-3 always hold and we have asymptotic normality so long as the underlying distribution is not uniform. The difficulty with the uniform distribution is that it is the unique distribution for which σ 2 , as given by (9), is zero (see the discussion just below Condition 1). When the distribution is uniform, the asymptotic distribution is chi-squared with (K − 1) degrees of freedom (see [6]).
In general, we do not know if our conditions are necessary. However, they cover most distributions of interest. The only distributions, which they preclude, are ones with extremely heavy tails. However, in complete generality, Conditions 1-3 may look complicated, and they are easily checked in many situations. For instance, Condition 2 always holds when, for large enough k, p k ≤ Ck −3−δ for some C, δ > 0, i.e., when If the alphabet X = N is the set of natural numbers, then this is equivalent to the distribution p having a finite variance. Similarly, Conditions 1 and 3 both holding is the case when, for large enough k, p k ≤ Ck −2−δ for some C, δ > 0, i.e., when If the alphabet X = N is the set of natural numbers, then this is equivalent to the distribution p having a finite mean.

Proofs
Proof of Lemma 2. Without loss of generality, assume that C > 1 and thus that ln C > 0. Let f (x) = x ln x for x ∈ (0, 1). It is readily checked that f is negative and decreasing for x ∈ (0, e −1 ). Since Cp k → 0 as k → ∞, it follows that Cp k < e −1 for large enough k. Now, let K(n) be the sequence that works for p in Condition 1. For large enough n, Similarly, the function g(x) = x ln 2 x for x ∈ (0, 1) is positive and increasing for x ∈ (0, e −2 ). Thus, there is an integer M > 0 such that if k ≥ M, then Cp k < e −2 and as required.
To prove Theorem 2, the following Lemma is needed.

Lemma 6.
If Condition 2 holds, then there exists a K 1 > 0 such that for all k ≥ K 1 (14) Proof. Observing that e −x ≥ 1 − x holds for all real x and that lim x→0 (1 − e −x )/x = 1, we have e −2/(kb(k)) ≥ 1 − 2/(kb(k)), and hence This implies that there is a K 1 > 0 such that for all k ≥ K 1 (11) holds and It follows that, for such k, → 0. Fix ε 0 ∈ (0, ε). From (12) and the facts that a(k) is positive, eventually nondecreasing, and approaches infinity, it follows that Let K 2 be a positive integer such that, for all n ≥ K 2 , a(n) is nondecreasing, and let r n = √ n/(a(n)) ε 0 ∨ K 3 , where K 3 = K 1 ∨ K 2 and K 1 is as in Lemma 6. It follows that We have By (15), it follows that, for large enough n, (see [11] for a standard reference). To see that ln (r−1) (x) is slowly varying, note that ln is slowly varying and that compositions of slowly varying functions are slowly varying by Proposition 1.3.6 in [11].
Towards that end, first note that the inequality of (16) holds for y = 1. Now, let f (y) = (y − 1) ln y − (y − 1) ln(y − 1) and, therefore, letting s = 1 − 1/y, Since s − 1 ≥ ln s for all s > 0 (see e.g., 4.1.36 in [12]), f (y) ≥ 0 for all y, 1 < y ≤ n, which implies (16). For the second part, we use the first part to get where the last equality follows from the facts that for each i,Ĥ (i) is a plug-in estimator of H based on a sample of size (n − 1) and that E Ĥ (i) does not depend on i due to symmetry. From here, the result follows.
Proof of Theorem 3. By Theorem 1, it suffices to show √ nB JK p −→ 0. Note that, by Lemma 5, where the convergence follows by Condition 2. From here, the result follows by Markov's inequality.
To prove Proposition 1, we need several lemmas, which may be of independent interest. Lemma 7. Let S n and S n−1 be binomial random variables with parameters (n, p) and (n − 1, p), respectively. If n ≥ 2 and p ∈ (0, 1), then E(S n ln S n ) = E np ln(S n−1 + 1) .