On Generalized Schürmann Entropy Estimators

We present a new class of estimators of Shannon entropy for severely undersampled discrete distributions. It is based on a generalization of an estimator proposed by T. Schürmann, which itself is a generalization of an estimator proposed by myself.For a special set of parameters, they are completely free of bias and have a finite variance, something which is widely believed to be impossible. We present also detailed numerical tests, where we compare them with other recent estimators and with exact results, and point out a clash with Bayesian estimators for mutual information.


Introduction
It is well known that estimating (Shannon) entropies from finite samples is not trivial. If one naively replaces the probability p i to be in "box" i by the observed frequency, p i ≈ n i /N, statistical fluctuations tend to make the distribution look less uniform, which leads to an underestimation of the entropy. There have been numerous proposals on how to estimate and eliminate this bias . Some make quite strong assumptions [5,7]; others use Bayesian methods [6,11,12,19,21,22]. As pointed out in [4,13,14,17], one can devise estimators with arbitrarily small bias (for sufficiently large N and fixed p i ), but these will then have very large statistical errors. As conjectured in [4,[13][14][15]17], the variance of any estimator whose bias vanishes will have a diverging variance.
Another widespread belief is that Bayesian entropy estimators cannot be outperformed by non-Bayesian ones for severely undersampled cases. The problem with Bayesian estimators is of course that they depend on a good choice of prior distributions, which is not always easy, and they tend to be slow. One positive feature of a non-Bayesian estimator proposed in [14] is that it is extremely fast since it works precisely like the 'naive' (or maximum-likelihood) estimator, except that the logarithms used there are replaced by a function G n defined on integers, which can be precomputed by means of a simple recursion. While the estimator of [14] seems in general to be a reasonable compromise between bias and variance, it was shown in [15] that it can be improved-as far as bias is concerned, at the cost of increased variance-by generalizing G n to a one-parameter family of functions G n (a).
In the present paper, we show that the Grassberger-Schürmann approach [14,15] can be further improved by using different functions G n (a i ) for each different realization i of the random variable. Indeed, the a i can be chosen such that the estimator is completely free of bias and yet has a finite variance-although, to be honest, the optimal parameters a i can be found only if the exact distribution is known (in which case also the entropy can be computed exactly). We show that-even if the exact, optimal a i is not known-the new estimator can reduce the bias very much, without inducing unduly large variances, provided the distribution is sufficiently much undersampled.
We test the proposed estimator numerically with simple examples, where we produce bias-free entropy estimates, e.g., from pairs of ternary variables, something which, to my knowledge, is not possible with any Bayesian method. We also use it for estimating mutual information (MI) in cases where one of the two variables is binary, and the other one can take very many values. In the limit of severe undersampling and of no obvious regularity in the true probabilities, MI cannot be estimated unambiguously. In that limit, the present algorithm seems to choose systematically a different outcome from Bayesian methods for reasons that are not yet clear.

Basic Formalism
In the following, we use the notation of [14]. As in this reference, we consider M > 1 "boxes" (states, possible experimental outcomes, etc.) and N > 1 points (samples, events, and particles) distributed randomly and independently into the boxes. We assume that each box has weight p i (i = 1, . . . M) with ∑ i p i = 1. Thus each box i will contain a random number n i of points, with E[n i ] = p i N. Their joint distribution is multinomial, while the marginal distribution in box i is binomial, Our aim is to estimate the entropy, with z i ≡ E[n i ] = p i N, from an observation of the numbers {n i } (in the following, all entropies are measured in "natural units", not in bits). The estimatorĤ(n 1 , . . . n M ) will of course have both statistical errors and a bias, i.e., if we repeat this experiment, the average ofĤ will, in general, not be equal to H, as will also be its variance Var [Ĥ]. Notice that for computing E[Ĥ], we need only Equation (2), not the full multinomial distribution of Equation (1). However, if we want to compute this variance, we additionally need the joint marginal distribution in two boxes, in order to compute the covariances between different boxes. Notice that these covariances were not taken into account in [13,17], whence the variance estimations in these papers are, at best, approximate. In the following, we are mostly interested in the case where we are close to the limit N → ∞, M → ∞, with M/N (the average number of points per box) being finite and small. In this limit, the variance will go to zero (because essentially one averages over many boxes), but the bias will remain finite. The binomial distribution, Equation (2), can be replaced then by a Poisson distribution However, as we shall see, it is in general not good advice to jump right to this limit, even if we are close to it. More generally, we shall therefore also be interested in the case of large but finite N, where also the variance is positive, and we will discuss the balance between demanding minimal bias versus minimal variance. Indeed it is well known that the naive (or 'maximum-likelihood') estimator, obtained by assuming z i = n i without fluctuations, is negatively biased, ∆Ĥ naive < 0. In order to estimate H, we have to estimate p i ln p i or equivalently z i ln z i for each i. Since the distribution of n i depends, according to Equation (2), on z i only, we can make the rather general ansatz [4,14] for the estimator with a yet unknown function φ(n). Notice thatĤ becomes with this ansatz a sum over strictly positive values of n i . Effectively this means that we have assumed that observing an outcome n i = 0 does not give any information: if n i = 0, we do not know whether this is because of statistical fluctuations or because p i = 0 for that particular i. The resulting entropy estimator is then [14] H with the overbar indicating an average over all boxes, Its bias is with being the expectation value for a typical box (in the following we shall suppress the box index i to simplify notation, wherever this makes sense). In the following, ψ(x) = d ln Γ(x)/dx is the digamma function, and is an exponential integral (Ref. [23], paragraph 5.1.4). It was shown in [14] that which simplifies in the Poisson limit (N → ∞, z fixed) to Equations (14) and (15) are the starting points of all further analysis. In [14], it was proposed to re-write Equation (15) as where The advantages are that G n can be evaluated very easily by recursion (here γ = 0.57721 . . . is the Euler-Mascheroni constant), , and neglecting the second term, zE 1 (2z) gives an excellent approximation unless z is exceedingly small, i.e., unless the numbers of points per box are very small such that the distribution is very severely undersampled. Thus the entropy estimator proposed in [14] was simplyĤ Furthermore, since zE 1 (2z) is strictly positive, neglecting it gives a negative bias inĤ G , and one can show rigorously that this bias is smaller than that of [1,3].

Schürmann and Generalized Schürmann Estimators
The easiest way to understand the Schürmann class of estimators [15] is to define, instead of G n , a one-parameter family of functions Notice that G n (1) = G n and G n (0) = ψ(n). Let us first discuss the somewhat easier Poissonian limit, where which gives Using-to achieve greater flexibility-different parameters a i for different boxes, and neglecting the second term in the last line of Equation (20), we obtain finally by using Equation (3) Indeed, the last term in Equation (20) can always be neglected for sufficiently large a because 0 < E 1 ((1 + a)z) < exp(−(1 + a)z)/(1 + a)z for any real a > −1.
Equation (22) might suggest that using larger a i would always give an improvement because bias is reduced, but this would not take into account that larger a i might lead to larger variances. However, the optimal choices of the parameters a i are not obvious. Indeed, in spite of the ease of derivations in the Poissonian limit, it is much better to avoid it and to use the exact binomial expression.
For the general binomial case, the algebra is a bit more involved. By somewhat tedious but straightforward algebra, one finds that One immediately checks that this reduces, in the limit (N → ∞, z fixed), to Equation (20).
On the other hand, by substituting in the integral, we obtain Finally, by combining with Equation (14), we find [15] and, using again Equation (3), with a correction term which is 1/N times a sum over the integrals in Equation (26). This correction term vanishes, if all integration ranges vanish. This happens when 1 − (1 + a i )z i /N = 0 for all i, or This is a remarkable result, as it shows that in principle, there exists always an estimator which has zero bias and yet finite variance. In [15], one single parameter a was used, which is why we call our method a generalized Schürmann estimator. When all box weights are small, p i 1 for all i, then these bias-optimal values a * i are very large. However, for two boxes with p 1 = p 2 = 1/2, e.g., the bias vanishes already for a 1 = a 2 = 1, i.e., for the estimator of Grassberger [14]! In order to test the latter, we drew 10 8 triplets of random bits (i.e., N = 3, p 0 = p 1 = 1/2), and estimatedĤ naive andĤ G for each triplet. From these, we computed averages and variances, with the resultsĤ naive = 0.68867(4) bits andĤ G = 0.99995(4) bits. We should stress that the latter requires the precise form of Equation (27) to be used, with ψ(N) neither replaced by ln N nor by G N .
Since there is no free lunch, there must of course be some problems in the limit when parameters a i are chosen to be nearly bias-optimal. One problem is that one cannot, in general, choose a i according to Equation (28), because the p i is unknown. In addition, it is in this limit (and more generally when a i >> 1) that variances blow up. In order to see this, we have to discuss in more detail the properties of the functions G n (a).
According to Equation (19), G n (a) is a sum of two terms, both of which can be computed, for all positive integer n, by recursion. The digamma function ψ(n) satisfies Let us denote the second term in Equation (19) as g n (a). It satisfies the recursion g 1 (a) = − ln(1 + a), g n+1 (a) = g n (a) − (−a) n /n.
Thus, while ψ(n) is monotonic and slowly increasing, g n (a) has alternating sign and increases, for a > 1, exponentially with n. As a consequence, also G n (a) is non-monotonic and diverges exponentially with n, whenever a > 1. Therefore an estimator such asĤ opt gets, unless all n i are very small, increasingly large contributions of alternating signs. As a result, the variances will blow up, unless one is very careful to keep a balance between bias and variance.
To illustrate this, we drew tuples of independent and identically distributed binary variables {s 1 , . . . s N } with p 0 = 3/4 and p 1 = 1/4. For a 0 , we chose a 0 = a * 0 = 1/3 because this should minimize the bias and should not create problems with the variance. We should expect such problems, however, if we would take a 1 = a * 1 = 3, although this would reduce the bias to zero. Indeed we found for N = 100 that the variance of the estimator exploded for all practical purposes as soon as a 1 > 1.4, while the results were optimal for 0.5 < a 1 ≤ 1 (bias and statistical error were both < 10 −5 for 10 8 tuples). On the other hand, for pairs (N = 2), we had to use much larger values of a 1 for optimality, and a 1 = 3 gave indeed the best results (see Figure 1). A similar plot for ternary variables is shown in Figure 2, where we see again that a-values near the bias-optimal ones gave estimates with zero almost zero bias and acceptable variance for the most undersampled case N = 2. Again, using the the exact bias-optimal values would have given unacceptably large variances for large N.
The message to be learned from this is that we should always keep all a i sufficiently small such that a n i i is not much larger than 1 for any of the observed values of n i .  (27). The parameter a 0 was kept fixed at its optimal value a * 0 = 0.6, while a 1 and a 2 varied in view of possible problems with the variances. More precisely, we used a 2 = 1 + 4(a 1 − 1), so that the plot ends at the bias-free value a * 2 = 7.0 and at a value of a 1 slightly smaller than a * 1 = 2.5. For each N and each value of a 1 , 10 8 tuples were drawn. The exact entropy is 1.29879 . . . bits, and is indicated by the horizontal straight line.

Estimating Mutual and Conditional Information
Finally, we apply our estimator to two problems of mutual information (MI) estimation discussed in [22] (actually, the problems were originally proposed by previous authors, but we shall compare our results mainly to those in [22]). In each of these problems, there are two discrete random variables: X has many (several thousand) possible values, while Y is binary. Moreover, the marginal distribution of Y is uniform, p(y = 0) = p(y = 1) = 1/2, while the X distributions are highly non-uniform. Finally-and that is crucial-the joint distributions show no obvious regularities.
The MI is estimated as I(X : Y) = H(Y) − H(Y|X). Since H(Y) = 1 bit, the problem essentially burns down to estimate the conditional probabilities p(y|x). The data are given in terms of a large number of independent and identically distributed sampled pairs (x, y) (250,000 pairs for problem I, called 'PYM' in the following, and 50,000 pairs for problem II, called 'spherical' in the following). The task is to draw random subsamples of size N, to estimate the MI from each subsample, and to calculate averages and statistical widths from these estimates.
Results are shown in Figure 3. For large N, our data agree perfectly with those in [22] and in the previous papers cited in [22]. However, while the MI estimates in these previous papers all increase with decreasing N, and those in [22] stay essential constant (as we would expect, since a good entropy estimator should not depend on N, and conditional entropies should decrease with N for not so good estimators), our estimated MI decreases to zero for small N. . Estimated mutual information (in bits) of N-tuples of independent and identically distributed random subsamples from two distributions given in [22]. The data for "PYM", originally due to [24], consist of 250,000 pairs (x, y) with binary y with p(y = 0) = p(y = 1) = 1/2, and x being uniformly distributed over 4096 values. Thus each x−value is realized ≈60 times, and we classify them into 5 classes depending on the associated y−values: (i) very heavily biased toward y = 1, (ii) moderately biased toward y = 1, (iii) y−neutral, (iv) moderately biased toward y = 0, and (v) heavily biased toward y = 0. When we estimated conditional entropies H(Y|X) for randomly drawn subsamples, we kept this classification and choose a y accordingly: For class (iii) we used a 0 = a 1 = 1, for class (ii) we used a 1 = 1, a 0 = 4, for class (i) we used a 1 = 1, a 0 = 7, for class (iv) we used a 1 = 4, a 0 = 1, and finally for class (v) we used a 1 = 7, a 0 = 1. The data for "spherical", originally due to [21], consist of 50,000 (x, y) pairs. Here, Y is again binary with p(y = 0) = p(y = 1) = 1/2, but X is highly non-uniformly distributed over ≈4000 values. Again we classified these values as y−neutral or heavily/moderately biased toward or against y = 0 and used this classification to choose values of a y accordingly.
This looks at first sight like a failure of our method, but it is not. As we said, the joint distributions show no regularities. For small N most values of X will show up at most once, and if we write the sequence of y−values in a typical tuple, it will look like a perfectly random binary string. The modeler knows that it actually is not random, because there are correlations between X and Y. However, no algorithm can know this, and any good algorithm should conclude that H(Y|X) = H(Y) = 1 bit. Why, then, was this not found in the previous analyses? In all these, Bayesian estimators were used. If the priors used in these estimators were chosen in view of the special structures in the data (which are, as we should stress again, not visible from the data, as long as these are severely undersampled), then the algorithms can, of course, make use of these structures and avoid the conclusion that H(Y|X) = 1 bit.

Conclusions
In conclusion, we gave an entropy estimator with zero bias and finite variance. It builds on an estimator by Schürmann [15], which itself is a generalization of [14]. It involves a real-valued parameter for each possible realization of the random variable, and bias is reduced to zero by choosing these parameters properly. However, this choice would require that we know already the distribution, which is of course not the case. Nevertheless we can reduce the bias very much for severely undersampled cases. In cases with moderate undersampling, choosing these zero-bias parameters would give very large variances and would thus be useless. Nevertheless, by judicious parameter choices, we can obtain extremely good entropy estimates. Finding good parameters is non-trivial, but is made less difficult by the fact that the method is very fast.
Finally, we pointed out that Bayesian methods, which have been very popular in this field, have the danger of choosing "too good" priors, i.e., choosing priors which are not justified by the data themselves and are thus misleading, although both the bias and the observed variances seem to be small. I thank Thomas Schürmann for the numerous discussions, and Damián Hernández for both discussions and for sending me the data for Figure 3.