We perform an asymptotic analysis of the NSB estimator of entropy of a discrete random variable. The analysis illuminates the dependence of the estimates on the number of coincidences in the sample and shows that the estimator has a well defined limit for a large cardinality of the studied variable. This allows estimation of entropy with no a priori assumptions about the cardinality. Software implementation of the algorithm is available.
entropy estimation; coincidences; bias-variance tradeoff; model selection
MSC 94A17, 62F12, 62F15
Estimation of functions of a discrete random variable with an unknown probability distribution is one of the simplest problems in statistics. However, the simplicity vanishes in an extremely undersampled regime, where K, the cardinality or the alphabet size of the variable, is much larger than N, the number of the samples. In this case, the average number of samples per possible outcome, or bin, is less than one, and the relative uncertainty about the underlying probability distribution and its various statistics is large. To decrease the posterior error, one may turn to Bayesian statistics and bias the set of a priori admissible distributions. However, finding an optimal bias-variance tradeoff is not easy. For severely undersampled cases, controlling the variance often make an estimator a function of the prior, rather than of the measured data.
This is often the case for inference of the Boltzmann-Shannon entropy, (here is probability of an event i), an important characteristics of a discrete variable. In this paper, all logarithms are natural, and the unit of entropy is nat. Simple estimators of entropy have low variances but high biases that are difficult to calculate due to the divergence of the logarithm near zero . Developments driven in part by computational biology applications have solved this problem in the moderately undersampled regime, and [1,2,3,4,5,6,7,8,9]. Interestingly, they also resulted in the understanding that it is impossible to estimate entropy with zero bias uniformly over all distributions for a smaller N. However, Ma has argued  that, since coincidences in data start to occur at , it is possible to estimate entropies even in the deeply undersampled regime, at least for some classes of probability distributions, such as uniform ones. Similar arguments are well-known in the literature on estimation of population sizes from capture-recapture data (see, e.g.,  for recent developments). There it has been recognized that the population size (and the population entropy) can be estimated long before every possible individual outcome has been sampled with a high probability .
In 2002, Nemenman, Shafee and Bialek introduced a method for entropy estimation, hereafter called NSB . While the estimator has proven successful in the Ma square-root regime [14,15], a theoretical basis for the success has not been presented in the literature. Here we review the method and perform its asymptotic analysis. We verify the intuition that the estimator works in the Ma regime by counting coincidences. We point out that the method can be viewed as finding the number of yet unseen bins with nonzero probability given K, the maximum cardinality of the variable. While estimation of K by model selection techniques cannot work (see below), we show that the method has a non-trivial limit as . Thus one should be able to calculate entropies of discrete random variables even without knowing their cardinalities. Our analysis allows for an efficient numerical implementation of the NSB estimator, which we have made available from .
2. Summary of the NSB Method
We use Bayes rule to expresses the posterior probability of a probability distribution of a discrete random variable with a help of its a priori probability, . Thus if i. i. d. samples from are observed in bin i, such that , then the posterior, , is
Following , we focus on the popular Dirichlet family of priors, indexed by a hyperparameter β:
Here the δ-function and enforce normalizations of and , respectively; and Γ stands for Euler’s Γ-function. These priors are common in statistics since they result in an analytically tractable, multinomial posteriors. For example, Wolpert and Wolf  calculated posterior averages, here denoted as , of many interesting quantities, including the distribution itself,
and the moments of its entropy, which we will not reprint here.
According to Equation (3), Dirichlet priors add extra β samples (pseudocounts) to each bin. Thus for , the data are unimportant, on average, and is dominated by almost uniform distributions, . Then the posterior mean of the entropy is strongly biased upward to its maximum possible value of . Similarly, for , distributions in the vicinity of the frequentist’s maximum likelihood estimate, , are important, and is biased downward .
 traced this problem to properties of the Dirichlet family. Its members encode reasonable a priori assumptions about , but not about . Instead, a priori assumptions about the entropy are strongly biased, as seen from the a priori moments:
Here are the polygamma functions. varies smoothly from 0 for , through 1 for , and to for . for almost all β , which is negligibly small for large K. Thus that is typical in has the entropy extremely close to some predetermined β-dependent value. This bias persists even when data are collected.
One should strive for the a priori distribution of entropy, , to be approximately uniform to have a chance for an unbiased estimator. NSB achieves the uniformity (but not necessarily zero bias) by noting that, following Equations (4) and (5), for large K, is almost a δ-function. Thus a prior that averages over all non-negative values of β (and, correspondingly, over all a ) may reduce the bias in the entropy estimation even for .  proposed the following infinite mixture of Dirichlet priors  for the averaging:
Here Z is again a normalizing coefficient, and ensures uniformity for ξ, rather than for β. A non-constant prior on β, , may be used if needed, but we will not focus on this term from now on. Such Dirichlet mixture results in , introducing biases in estimation of as a tradeoff for a possibly accurate estimation of H.
Inference with the prior, Equation (6), involves additional averaging over β (or, equivalently, ξ). The a posteriori moments of the entropy are
where the unnormalized posterior density is
Note that, for , . Thus if we choose , then the a priori assumptions about ξ are exactly uniform, as we had hoped to achieve. We note again that the uniformity of the prior is not equivalent to zero posterior bias.
An additional reason for the choice of averaging over the model families, as in Equation (6), is provided by the theory of Bayesian model selection [13,19,20,21,22,23]. Specifically, families of probabilistic models of data that incorporate more models (have larger volumes in the model space) usually have high explanatory powers and include some models that are very likely a posteriori. However, they also include many extremely unlikely models, and the posterior probability averaged over the entire family is low. Thus the competition between the “goodness of fit” and the volume of the model space (the Occam factor) often attributes much of the posterior weight to model families that are relatively simple, but explain the data well. In the case of the NSB prior, different values of β index different model families. For small β, the estimates in Equation (3) are closer to the frequentist’s maximum likelihood, explaining the data better. However, there is less smoothing, and the space of models is larger. Thus as argued in , one expects that the integrals in Equation (7) are dominated by some with a small posterior variance, and then .
In this work, we start with investigating whether a maximum of the integrand in Equation (7), indeed, exists. We then study its properties. The results of the analysis leads to a deeper understanding of the NSB method.
3. Saddle Point Analysis
We calculate integrals in Equation (7) using the saddle point (a. k. a. Laplace) approximation. Since does not depend on N, for , only the Γ-terms in ρ define the saddle. We write
Differentiating, we obtain the following equation for the saddle point (or the maximum likelihood) value, :
where denotes the number of bins that have, at least, m counts. Note that .
If , and if there are many bins with multiple counts, i.e., , then the (unknown) is likely non-uniform. Thus the entropy is significantly smaller than its maximum possible value . Since for any , , small entropy estimate is achievable only if as . Thus we will look for
where none of depends on K. Plugging Equation (12) into Equation (11), we get an equation for :
The leading terms in the expansion of are:
We have calculated additional higher order terms. However, when , as is common in applications, these terms are rarely needed.
We now solve Equation (13). For and , the r. h. s. of the equation is approximately . For , it is close to . Thus if , and the number of coincidences among data, , is zero, then the l. h. s. majorates the r. h. s., and Equation (13) has no solution. That is, there is no saddle point in the integrand. If there are coincidences, a unique solution exists, and means . Thus we search for of the form .
It is useful to define:
where each of ’s scales as . Using properties of polygamma functions  and defining , we rewrite Equation (13) as
Combined with the previous observations, Equation (17) suggests that we look for of the form
where each of ’s is independent of δ and scales as .
Substituting Equation (18) into Equation (17), we find the series expansion self-consistent, and
Again, more terms have been calculated and are used in the software implementation of the estimator.
The obtained expressions present the saddle point value (or , or ) as a power series in and δ. To complete the evaluation of Equation (7), we now calculate the curvature at this saddle point:
Notice that the curvature does not scale as a power of N as was suggested in . The uncertainty in is determined to the first order only by coincidences. One can understand this by considering with for most of the bins. Then counts of are not informative for entropy estimation since they can correspond to massive bins, as well as to some random bins from the sea of the negligible ones. However, coinciding counts likely corresponds to high-probability bins, which should influence the entropy estimation. Note also that, to the first order in , the exact positioning of coincidences does not matter: for a fixed Δ, a few coincidences in many bins or many coincidences in a single one produce the same saddle point and the same curvature around it. While this is an artifact of the specific choice of the prior , the similarity to Ma’s coincidence counting  is intriguing.
In summary, if the number of coincidences , then the saddle point analysis is self-consistent. A specific value is selected a posteriori, and the variance of the entropy is small.
The series expansions calculated above form the basis for a numerical implementation of the NSB algorithm. To calculate the posterior mean and variance of the NSB entropy estimator using Equation (7), evaluation of three integrals is required numerically (normalization, the first, and the second moments of H). The algorithm for this is as follows. When , the integrands are not peaked, and the integrals can be evaluated by simple Gaussian quadratures or other user-selected methods. If instead , the integrands will be peaked, strongly if . Identification of the location of the peaks is then essential before numerical integration is done. We proceed as follows:
The saddle point (the maximum of ) is found numerically by:
evaluating an approximation for using the first few terms of the series, Equation (18);
using the approximate value as a starting point for the Newton-Raphson iterative algorithm to solve for from Equation (13);
plugging the solution into the series expansion for the saddle , Equation (12);
and, finally, using the latter solution as a starting point for the Newton-Raphson search of a more accurate value of in Equation (11).
Each of the integrands in Equation (7) is divided by the value of at the saddle point, so that the maximum of the integrands is .
Curvature around the saddle point (and hence the posterior variance) is evaluated numerically.
The integrals are evaluated numerically over the range that spans a few standard deviations on both sides of the saddle point; the range is controlled by the user-specified desired accuracy.
The above algorithm has been implemented in Octave/Matlab and C++. It is available from . The input to the routines is either the histogram of counts (Octave/Matlab and C++), or a series of samples (C++ only). The output of the routines is either the posterior mean and the standard deviation of the entropy, and the position of the saddle point, or a variety of diagnostics information if the integration fails for any reason. The C++ version is implemented specifically to allow estimation of entropies on alphabets with arbitrary large cardinalities. It is limited only by the ability of the data series to fit in the computer memory.
4. Choosing a Value for K?
We are interested in the regime , when the number of pseudocounts in occupied bins, , is negligible compared to their number in empty bins, . Then Equation (3) and Equation (8) show that selecting β (i.e., integrating over it) means balancing N, the number of actual counts versus , the number of pseudocounts, or, equivalently, the scaled number of unoccupied bins. K is often unknown in real-life applications, or the number of possible outcomes is a countable infinity. Estimation of K from data has proven to be a hard problem, only solved completely for uniform distributions [10,11]. One can consider varying K (instead of β) to find its maximum a posteriori value when performing Bayesian integration over κ.
To see that this will not work, we note that smaller K leads to a higher maximum likelihood since the total number of pseudocounts is less. Unfortunately, since there are fewer bins (degrees of freedom) available, smaller K also means smaller volume in the distribution space. Thus Bayesian averaging over K is trivial: the smallest possible number of bins (i.e., no empty bins) dominates. This can be seen from Equation (8): only the first ratio of Γ-functions in the posterior density depends on K, and it is maximized for . Thus straightforward selection of the value of K is not an option. However, the next section suggests a way around this hurdle.
5. Unknown or Infinite K
Often the true value of K is unknown because its simple estimate is intolerably large. For example, consider measuring entropy of ℓ-gramms in printed English  using an alphabet with 29 characters: 26 different letters, one symbol for digits, one space, and one punctuation mark. Then for , a naive estimate of K is . Only very few of all possible 10-gramms are allowed by the grammar, but one does not know how many exactly. Thus one has to work in the space of full cardinality, which is ridiculously undersampled.
As shown in Section 3, NSB is well defined even for finite N and extremely large K, provided . Moreover, if , then the expressions simplify since only the first term in Equation (12) needs to be kept. Even more interestingly, for an increasing K and , becomes closer to a delta function since the a priori variance of entropy drops to zero as , Equation (5). Thus NSB becomes more “certain” as K increases. Correspondingly, a possible solution to the problem of unknown cardinality is to use an upper bound estimate for K. It is better to overestimate K than to underestimate it. Even can be used. Insensitivity of the method to the value of K was explored empirically in .
Which assumptions allow NSB to use a few data points to specify entropy of a variable with even an infinite cardinality? A typical distribution in the Dirichlet family has a specific rank ordered (Zipf) plot : the number of bins with the probability less than some q is given by an incomplete B-function, I,
where B is the usual complete B-function. NSB estimates the best value for β using bins with coincidences, the head of the rank ordered plot. But knowing β defines the tails, where no data has been observed yet, allowing entropy estimation. Thus NSB relies on the rank-ordered tail of the studied distribution to be not too far away from the form in Equation (23). If the Zipf plot of the studied distribution has a substantially longer tail, then one should not trust the results of the method. An empirical procedure for detecting this case has been suggested in [14,15].
With this warning in mind, we can analytically calculate the entropy estimate and its variance for a very large K. We want the results that hold even if the saddle point analysis, Section 3, fails when . Following Equation (12) and Equation (18), , but . The range of entropies is , so the prior on H produced by is (almost) uniform over a semi-infinite range and thus is ill-defined. Similarly, there is a problem normalizing . However, both problems are resolved by an appropriate limiting procedure, and we disregard them in what follows.
To perform the integrals in Equation (7), we point out that, for , , and , we have , and then . A similar relation holds for . That is, the posterior averages of the entropy and its square are almost indistinguishable from ξ and , their respective a priori averages. Since now we are interested in small Δ (otherwise we can use the saddle point analysis), we replace by in Equation (7). The error of this approximation is .
We transform the Lagrangian in Equation (10). First, we drop terms that do not depend on κ since they appear in the numerator and denominator of Equation (7) and thus cancel. Second, we expand around . This gives
We note that κ is large in the vicinity of the saddle if δ is small and N is large, cf. Equation (18). Thus by definition of ψ-functions, . Further, , and . Finally, since , where is the Euler’s constant, Equation (4) says that . Combining all of these expressions, we get
where ≈ means the precision of .
The integrals in these expressions are calculated by substituting and replacing the limits of integration by . This introduces errors of at the lower limit and at the upper limit. Both errors are within the precision of interest if there is, at least, one coincidence. Thus
Finally, substituting Equation (28) into Equation (26) and (27), we get
These equations are valid to zeroth order in and . They provide a simple, yet nontrivial, estimate of the entropy that can be used even if the cardinality of the variable is unknown. However, one always must analyze for a possible bias when using the estimator. Note that Equation (30) agrees with Equation (22) since, for large Δ, . The similarity between the coincidence counting in Equations (29) and (30) and in Ma’s analysis  is also clear.
We have calculated various asymptotic properties of the NSB estimator for estimation of entropies of discrete random variables. First, the posterior expectations have been evaluated in terms of power series in and , but for the number of coincidences . Evaluation is done using the saddle point expansion. Convergence of the series depends on the number of coincidences rather than on the total number of samples. This elucidates the similarity to Ma’s argument  and verifies the intuition of [13,14] that counting coincidence is what makes the method work in the severely undersampled regime. We have then discussed the limit when , and the saddle point analysis is not applicable. Here we have shown that the estimator has a finite asymptote for the case of infinitely many bins, , or of an unknown number of bins. We obtained a closed form solutions for the estimate of the entropy and its variance in this regime. As for , to the first order, both depend on the number of coincidences rather than on the total number of samples.
The NSB estimator has been implemented in software, using the current asymptotic analysis as one of the steps in numerical evaluation of posterior integrals. Armed with empirical tests for the absence of bias in the estimator suggested in [14,15], the software brings us one step closer to a reliable, model independent estimation of entropy of discrete probability distributions in the severely undersampled Ma regime. The method is proving to be particularly powerful in a variety of biological applications.
I thank Jonathan Miller, Chris Wiggins, and William Bialek for stimulating discussions and for encouragement to complete the manuscript. I also thank Christian Mendl for noticing mistakes in the earlier drafts of the manuscript. This work was done in part at Kavli Institute for Theoretical Physics, supported by NSF Grant No. PHY99-07949, and finished at Emory University with a support of NIH/NCI grant No. 7R01CA132629-04 and HFSP grant No. RGY0084/2011.
Paninski, L. Estimation of entropy and mutual information. Neural Comp.2003, 15, 1191–1253. [Google Scholar] [CrossRef]
Panzeri, S.; Treves, A. Analytical estimates of limited sampling biases in different information measures. Netw. Comput. Neural Syst.1996, 7, 87–107. [Google Scholar] [CrossRef]
Strong, S.; Koberle, R.; de Ruyter van Steveninck, R.; Bialek, W. Entropy and information in neural spike trains. Phys. Rev. Lett.1998, 80, 197–200. [Google Scholar] [CrossRef]
Victor, J. Binless strategies for estimation of information from neural data. Phys. Rev. E2002, 66, 051903. [Google Scholar] [CrossRef]
Antos, A.; Kontoyiannis, I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithm.2002, 19, 163–193. [Google Scholar] [CrossRef]
Batu, T.; Dasgupta, S.; Kumar, R.; Rubinfeld, R. The complexity of approximating the entropy. SIAM J. Comput.2005, 35, 132–150. [Google Scholar] [CrossRef]
Grassberger, P. Entropy estimates from insufficient samples. arXiv2003. physics/0307138v2. [Google Scholar]
Kennel, M.; Shlens, J.; Abarbanel, H.; Chichilnisky, E. Estimating entropy rates with bayesian confidence intervals. Neural Comp.2005, 17, 1531–1576. [Google Scholar] [CrossRef] [PubMed]
Ma, S. Calculation of entropy from data of motion. J. Stat. Phys.1981, 26, 221–240. [Google Scholar] [CrossRef]
Orlitsky, A.; Santhanam, N.; Vishwanathan, K. Population estimation with performance guarantees. In Proceedings of the IEEE International Symposium on Information Theory, Nice, France, 24th–29th June 2007; pp. 2026–2030.
Nemenman, I.; Shafee, F.; Bialek, W. Entropy and inference, revisited. In Advances in Neural Information Processing Systems 14; Dietterich, T.G., Becker, S., Ghahramani, Z., Eds.; MIT Press: Cambridge, MA, USA, 2002. [Google Scholar]
Nemenman, I.; Bialek, W.; de Ruyter van Steveninck, R. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E2004, 69, 056111. [Google Scholar] [CrossRef]
Nemenman, I.; Lewen, G.; Bialek, W.; de Ruyter van Steveninck, R. Neural coding of natural stimuli: Information at sub-millisecond resolution. PLoS Comput. Biol.2008, 4, e1000025. [Google Scholar] [CrossRef] [PubMed]