Abstract
We present a new class of estimators of Shannon entropy for severely undersampled discrete distributions. It is based on a generalization of an estimator proposed by T. Schürmann, which itself is a generalization of an estimator proposed by myself.For a special set of parameters, they are completely free of bias and have a finite variance, something which is widely believed to be impossible. We present also detailed numerical tests, where we compare them with other recent estimators and with exact results, and point out a clash with Bayesian estimators for mutual information.
1. Introduction
It is well known that estimating (Shannon) entropies from finite samples is not trivial. If one naively replaces the probability to be in “box” i by the observed frequency, , statistical fluctuations tend to make the distribution look less uniform, which leads to an underestimation of the entropy. There have been numerous proposals on how to estimate and eliminate this bias [,,,,,,,,,,,,,,,,,,,,,]. Some make quite strong assumptions [,]; others use Bayesian methods [,,,,,]. As pointed out in [,,,], one can devise estimators with arbitrarily small bias (for sufficiently large N and fixed ), but these will then have very large statistical errors. As conjectured in [,,,,], the variance of any estimator whose bias vanishes will have a diverging variance.
Another widespread belief is that Bayesian entropy estimators cannot be outperformed by non-Bayesian ones for severely undersampled cases. The problem with Bayesian estimators is of course that they depend on a good choice of prior distributions, which is not always easy, and they tend to be slow. One positive feature of a non-Bayesian estimator proposed in [] is that it is extremely fast since it works precisely like the ‘naive’ (or maximum-likelihood) estimator, except that the logarithms used there are replaced by a function defined on integers, which can be precomputed by means of a simple recursion. While the estimator of [] seems in general to be a reasonable compromise between bias and variance, it was shown in [] that it can be improved—as far as bias is concerned, at the cost of increased variance—by generalizing to a one-parameter family of functions .
In the present paper, we show that the Grassberger–Schürmann approach [,] can be further improved by using different functions for each different realization i of the random variable. Indeed, the can be chosen such that the estimator is completely free of bias and yet has a finite variance—although, to be honest, the optimal parameters can be found only if the exact distribution is known (in which case also the entropy can be computed exactly). We show that—even if the exact, optimal is not known—the new estimator can reduce the bias very much, without inducing unduly large variances, provided the distribution is sufficiently much undersampled.
We test the proposed estimator numerically with simple examples, where we produce bias-free entropy estimates, e.g., from pairs of ternary variables, something which, to my knowledge, is not possible with any Bayesian method. We also use it for estimating mutual information (MI) in cases where one of the two variables is binary, and the other one can take very many values. In the limit of severe undersampling and of no obvious regularity in the true probabilities, MI cannot be estimated unambiguously. In that limit, the present algorithm seems to choose systematically a different outcome from Bayesian methods for reasons that are not yet clear.
2. Basic Formalism
In the following, we use the notation of []. As in this reference, we consider “boxes” (states, possible experimental outcomes, etc.) and points (samples, events, and particles) distributed randomly and independently into the boxes. We assume that each box has weight () with . Thus each box i will contain a random number of points, with . Their joint distribution is multinomial,
while the marginal distribution in box i is binomial,
Our aim is to estimate the entropy,
with , from an observation of the numbers (in the following, all entropies are measured in “natural units”, not in bits). The estimator will of course have both statistical errors and a bias, i.e., if we repeat this experiment, the average of will, in general, not be equal to H,
as will also be its variance . Notice that for computing , we need only Equation (2), not the full multinomial distribution of Equation (1). However, if we want to compute this variance, we additionally need the joint marginal distribution in two boxes,
in order to compute the covariances between different boxes. Notice that these covariances were not taken into account in [,], whence the variance estimations in these papers are, at best, approximate.
In the following, we are mostly interested in the case where we are close to the limit , with (the average number of points per box) being finite and small. In this limit, the variance will go to zero (because essentially one averages over many boxes), but the bias will remain finite. The binomial distribution, Equation (2), can be replaced then by a Poisson distribution
However, as we shall see, it is in general not good advice to jump right to this limit, even if we are close to it. More generally, we shall therefore also be interested in the case of large but finite N, where also the variance is positive, and we will discuss the balance between demanding minimal bias versus minimal variance.
Indeed it is well known that the naive (or ‘maximum-likelihood’) estimator, obtained by assuming without fluctuations,
is negatively biased, .
In order to estimate H, we have to estimate or equivalently for each i. Since the distribution of depends, according to Equation (2), on only, we can make the rather general ansatz [,] for the estimator
with a yet unknown function . Notice that becomes with this ansatz a sum over strictly positive values of . Effectively this means that we have assumed that observing an outcome does not give any information: if , we do not know whether this is because of statistical fluctuations or because for that particular i.
The resulting entropy estimator is then []
with the overbar indicating an average over all boxes,
Its bias is
with
being the expectation value for a typical box (in the following we shall suppress the box index i to simplify notation, wherever this makes sense).
In the following, is the digamma function, and
is an exponential integral (Ref. [], paragraph 5.1.4). It was shown in [] that
which simplifies in the Poisson limit (, z fixed) to
Equations (14) and (15) are the starting points of all further analysis. In [], it was proposed to re-write Equation (15) as
where
The advantages are that can be evaluated very easily by recursion (here is the Euler–Mascheroni constant), and , and neglecting the second term, gives an excellent approximation unless z is exceedingly small, i.e., unless the numbers of points per box are very small such that the distribution is very severely undersampled. Thus the entropy estimator proposed in [] was simply
Furthermore, since is strictly positive, neglecting it gives a negative bias in , and one can show rigorously that this bias is smaller than that of [,].
3. Schürmann and Generalized Schürmann Estimators
The easiest way to understand the Schürmann class of estimators [] is to define, instead of , a one-parameter family of functions
Notice that and .
Let us first discuss the somewhat easier Poissonian limit, where
which gives
Using—to achieve greater flexibility—different parameters for different boxes, and neglecting the second term in the last line of Equation (20), we obtain finally by using Equation (3)
Indeed, the last term in Equation (20) can always be neglected for sufficiently large a because for any real .
Equation (22) might suggest that using larger would always give an improvement because bias is reduced, but this would not take into account that larger might lead to larger variances. However, the optimal choices of the parameters are not obvious. Indeed, in spite of the ease of derivations in the Poissonian limit, it is much better to avoid it and to use the exact binomial expression.
For the general binomial case, the algebra is a bit more involved. By somewhat tedious but straightforward algebra, one finds that
One immediately checks that this reduces, in the limit (, z fixed), to Equation (20). On the other hand, by substituting
in the integral, we obtain
Finally, by combining with Equation (14), we find []
and, using again Equation (3),
with a correction term which is times a sum over the integrals in Equation (26). This correction term vanishes, if all integration ranges vanish. This happens when for all i, or
This is a remarkable result, as it shows that in principle, there exists always an estimator which has zero bias and yet finite variance. In [], one single parameter a was used, which is why we call our method a generalized Schürmann estimator.
When all box weights are small, for all i, then these bias-optimal values are very large. However, for two boxes with , e.g., the bias vanishes already for , i.e., for the estimator of Grassberger []!
In order to test the latter, we drew triplets of random bits (i.e., , ), and estimated and for each triplet. From these, we computed averages and variances, with the results bits and bits. We should stress that the latter requires the precise form of Equation (27) to be used, with neither replaced by nor by .
Since there is no free lunch, there must of course be some problems in the limit when parameters are chosen to be nearly bias-optimal. One problem is that one cannot, in general, choose according to Equation (28), because the is unknown. In addition, it is in this limit (and more generally when ) that variances blow up. In order to see this, we have to discuss in more detail the properties of the functions .
According to Equation (19), is a sum of two terms, both of which can be computed, for all positive integer n, by recursion. The digamma function satisfies
Let us denote the second term in Equation (19) as . It satisfies the recursion
Thus, while is monotonic and slowly increasing, has alternating sign and increases, for , exponentially with n. As a consequence, also is non-monotonic and diverges exponentially with n, whenever . Therefore an estimator such as gets, unless all are very small, increasingly large contributions of alternating signs. As a result, the variances will blow up, unless one is very careful to keep a balance between bias and variance.
To illustrate this, we drew tuples of independent and identically distributed binary variables with and . For , we chose because this should minimize the bias and should not create problems with the variance. We should expect such problems, however, if we would take , although this would reduce the bias to zero. Indeed we found for that the variance of the estimator exploded for all practical purposes as soon as , while the results were optimal for (bias and statistical error were both for tuples). On the other hand, for pairs (), we had to use much larger values of for optimality, and gave indeed the best results (see Figure 1). A similar plot for ternary variables is shown in Figure 2, where we see again that a-values near the bias-optimal ones gave estimates with zero almost zero bias and acceptable variance for the most undersampled case . Again, using the the exact bias-optimal values would have given unacceptably large variances for large N.
Figure 1.
Estimated entropies (in bits) of N-tuples of independent and identically distributed random binary variables with and , using the optimized estimator defined in Equation (27). The parameter was kept fixed at its optimal value , while is varied in view of possible problems with the variances, and is plotted on the horizontal axis. For each N and each value of , tuples were drawn. The exact entropy for and is bits, and is indicated by the horizontal straight line.
Figure 2.
Estimated entropies (in bits) of N-tuples of independent and identically distributed random ternary variables with , and , using the optimized estimator defined in Equation (27). The parameter was kept fixed at its optimal value , while and varied in view of possible problems with the variances. More precisely, we used , so that the plot ends at the bias-free value and at a value of slightly smaller than . For each N and each value of , tuples were drawn. The exact entropy is bits, and is indicated by the horizontal straight line.
The message to be learned from this is that we should always keep all sufficiently small such that is not much larger than 1 for any of the observed values of .
4. Estimating Mutual and Conditional Information
Finally, we apply our estimator to two problems of mutual information (MI) estimation discussed in [] (actually, the problems were originally proposed by previous authors, but we shall compare our results mainly to those in []). In each of these problems, there are two discrete random variables: X has many (several thousand) possible values, while Y is binary. Moreover, the marginal distribution of Y is uniform, , while the X distributions are highly non-uniform. Finally—and that is crucial—the joint distributions show no obvious regularities.
The MI is estimated as . Since bit, the problem essentially burns down to estimate the conditional probabilities . The data are given in terms of a large number of independent and identically distributed sampled pairs (250,000 pairs for problem I, called ‘PYM’ in the following, and 50,000 pairs for problem II, called ‘spherical’ in the following). The task is to draw random subsamples of size N, to estimate the MI from each subsample, and to calculate averages and statistical widths from these estimates.
Results are shown in Figure 3. For large N, our data agree perfectly with those in [] and in the previous papers cited in []. However, while the MI estimates in these previous papers all increase with decreasing N, and those in [] stay essential constant (as we would expect, since a good entropy estimator should not depend on N, and conditional entropies should decrease with N for not so good estimators), our estimated MI decreases to zero for small N.
Figure 3.
Estimated mutual information (in bits) of N-tuples of independent and identically distributed random subsamples from two distributions given in []. The data for “PYM”, originally due to [], consist of 250,000 pairs with binary y with , and x being uniformly distributed over 4096 values. Thus each value is realized ≈60 times, and we classify them into 5 classes depending on the associated values: (i) very heavily biased toward , (ii) moderately biased toward , (iii) neutral, (iv) moderately biased toward , and (v) heavily biased toward . When we estimated conditional entropies for randomly drawn subsamples, we kept this classification and choose accordingly: For class (iii) we used , for class (ii) we used , for class (i) we used , for class (iv) we used , and finally for class (v) we used . The data for “spherical”, originally due to [], consist of 50,000 pairs. Here, Y is again binary with , but X is highly non-uniformly distributed over ≈4000 values. Again we classified these values as neutral or heavily/moderately biased toward or against and used this classification to choose values of accordingly.
This looks at first sight like a failure of our method, but it is not. As we said, the joint distributions show no regularities. For small N most values of X will show up at most once, and if we write the sequence of values in a typical tuple, it will look like a perfectly random binary string. The modeler knows that it actually is not random, because there are correlations between X and Y. However, no algorithm can know this, and any good algorithm should conclude that bit. Why, then, was this not found in the previous analyses? In all these, Bayesian estimators were used. If the priors used in these estimators were chosen in view of the special structures in the data (which are, as we should stress again, not visible from the data, as long as these are severely undersampled), then the algorithms can, of course, make use of these structures and avoid the conclusion that bit.
5. Conclusions
In conclusion, we gave an entropy estimator with zero bias and finite variance. It builds on an estimator by Schürmann [], which itself is a generalization of []. It involves a real-valued parameter for each possible realization of the random variable, and bias is reduced to zero by choosing these parameters properly. However, this choice would require that we know already the distribution, which is of course not the case. Nevertheless we can reduce the bias very much for severely undersampled cases. In cases with moderate undersampling, choosing these zero-bias parameters would give very large variances and would thus be useless. Nevertheless, by judicious parameter choices, we can obtain extremely good entropy estimates. Finding good parameters is non-trivial, but is made less difficult by the fact that the method is very fast.
Finally, we pointed out that Bayesian methods, which have been very popular in this field, have the danger of choosing “too good” priors, i.e., choosing priors which are not justified by the data themselves and are thus misleading, although both the bias and the observed variances seem to be small.
I thank Thomas Schürmann for the numerous discussions, and Damián Hernández for both discussions and for sending me the data for Figure 3.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Miller, G. Note on the bias of information estimates. In Information Theory in Psychology II-B; Quastler, H., Ed.; Free Press: Glencoe, IL, USA, 1955; pp. 95–100. [Google Scholar]
- Harris, B. Colloquia Mathematica Societatis János Bolyai. Infinite Finite Sets 1973, 175. [Google Scholar]
- Herzel, H. Complexity of symbol sequences. Syst. Anal. Mod. Sim. 1988, 5, 435. [Google Scholar]
- Grassberger, P. Entropy Estimates from Insufficient Samplings. Phys. Lett. 1988, 128, 369. [Google Scholar] [CrossRef]
- Schmitt, A.O.; Herzel, H.; Ebeling, W. A new method to calculate higher-order entropies from finite samples. Europhys. Lett. 1993, 23, 303. [Google Scholar] [CrossRef]
- Wolpert, D.H.; Wolf, D.R. Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E 1995, 52, 6841. [Google Scholar] [CrossRef]
- Poschel, T.; Ebeling, W.; Rose, H. Guessing probability distributions from small samples. J. Stat. Phys. 1995, 80, 1443. [Google Scholar] [CrossRef][Green Version]
- Panzeri, S.; Treves, A. Analytical estimates of limited sampling biases in different information measures. Network: Computation in neural systems. Netw. Comput. Neural Syst. 1996, 7, 87. [Google Scholar] [CrossRef] [PubMed]
- Schürmann, T.; Grassberger, P. Entropy estimation of symbol sequences. Chaos 1996, 6, 414. [Google Scholar] [CrossRef]
- Strong, S.; Koberle, R.; van Steveninck, R.R.D.; Bialek, W. Entropy and information in neural spike trains. Phys. Rev. Lett. 1998, 80, 197–200. [Google Scholar] [CrossRef]
- Holste, D.; Grosse, I.; Herzel, H. Bayes’ estimators of generalized entropies. J. Phys. A 1998, 31, 2551. [Google Scholar] [CrossRef]
- Nemenman, I.; Shafee, F.; Bialek, W. Entropy and inference, revisited. In Advances in Neural Information Processing 14; Dietterich, T., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
- Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191. [Google Scholar] [CrossRef]
- Grassberger, P. Entropy estimates from insufficient samplings. arXiv 2003, arXiv:physics/0307138. [Google Scholar]
- Schürmann, T. Bias analysis in entropy estimation. J. Phys. A Math. Gen. 2004, 37, L295. [Google Scholar] [CrossRef]
- Vu, V.Q.; Yu, B.; Kass, R.E. Coverage-adjusted entropy estimation. Stat. Med. 2007, 26, 4039. [Google Scholar] [CrossRef]
- Bonachela, J.A.; Hinrichsen, H.; Muñoz, M.A.M. Entropy estimates of small data sets. J. Phys. A Math. Gen. 2008, 41, 202001. [Google Scholar] [CrossRef]
- Hausser, J.; Strimmer, K. Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. J. Mach. Learn. Res. 2009, 10, 1469. [Google Scholar]
- Wolpert, D.H.; Deo, S.D. Estimating functions of distributions defined over spaces of unknown size. Entropy 2013, 15, 4668. [Google Scholar] [CrossRef]
- Chao, A.; Wang, Y.; Jost, L. Entropy and the species accumulation curve: A novel entropy estimator via discovery rates of new species. Methods Ecol. Evol. 2013, 2013, 1091. [Google Scholar] [CrossRef]
- Archer, E.; Park, I.M.; Pillow, J.W. Bayesian entropy estimation for countable discrete distributions. J. Mach. Learn. Res. 2013, 2014, 2833. [Google Scholar]
- Hernández, D.G.; Samengo, I. Estimating the Mutual Information between Two Discrete, Asymmetric Variables with Limited Samples. Entropy 2019, 21, 623. [Google Scholar] [CrossRef] [PubMed]
- Abramowitz, M.; Stegun, I. (Eds.) Handbook of Mathematical Functions; Dover: New York, NY, USA, 1965. [Google Scholar]
- Schwartz-Ziv, R.; Tishby, N. Opening the black box of Deep Neural Networks via Information. arXiv 2017, arXiv:1703.00810. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).