Coincidences and Estimation of Entropies of Random Variables with Large Cardinalities

We perform an asymptotic analysis of the NSB estimator of entropy of a discrete random variable. The analysis illuminates the dependence of the estimates on the number of coincidences in the sample and shows that the estimator has a well defined limit for a large cardinality of the studied variable. This allows estimation of entropy with no a priori assumptions about the cardinality. Software implementation of the algorithm is available.


Introduction
Estimation of functions of a discrete random variable with an unknown probability distribution is one of the simplest problems in statistics.However, the simplicity vanishes in an extremely undersampled regime, where K, the cardinality or the alphabet size of the variable, is much larger than N , the number of the samples.In this case, the average number of samples per possible outcome, or bin, is less than one, and the relative uncertainty about the underlying probability distribution and its various statistics is large.To decrease the posterior error, one may turn to Bayesian statistics and bias the set of a priori admissible distributions.However, finding an optimal bias-variance tradeoff is not easy.For severely undersampled cases, controlling the variance often make an estimator a function of the prior, rather than of the measured data.This is often the case for inference of the Boltzmann-Shannon entropy, H = − K i=1 q i ln q i (here q i is probability of an event i), an important characteristics of a discrete variable.In this paper, all logarithms are natural, and the unit of entropy is nat.Simple estimators of entropy have low variances but high biases that are difficult to calculate due to the divergence of the logarithm near zero [1].Developments driven in part by computational biology applications have solved this problem in the moderately undersampled regime, N ∼ K and N ∼ e H [1][2][3][4][5][6][7][8][9]. Interestingly, they also resulted in the understanding that it is impossible to estimate entropy with zero bias uniformly over all distributions for a smaller N .However, Ma has argued [10] that, since coincidences in data start to occur at N ∼ √ e H , it is possible to estimate entropies even in the deeply undersampled regime, at least for some classes of probability distributions, such as uniform ones.Similar arguments are well-known in the literature on estimation of population sizes from capture-recapture data (see, e.g., [11] for recent developments).
There it has been recognized that the population size (and the population entropy) can be estimated long before every possible individual outcome has been sampled with a high probability [12].
In 2002, Nemenman, Shafee and Bialek introduced a method for entropy estimation, hereafter called NSB [13].While the estimator has proven successful in the Ma square-root regime [14,15], a theoretical basis for the success has not been presented in the literature.Here we review the method and perform its asymptotic analysis.We verify the intuition that the estimator works in the Ma regime by counting coincidences.We point out that the method can be viewed as finding the number of yet unseen bins with nonzero probability given K, the maximum cardinality of the variable.While estimation of K by model selection techniques cannot work (see below), we show that the method has a non-trivial limit as K → ∞.Thus one should be able to calculate entropies of discrete random variables even without knowing their cardinalities.Our analysis allows for an efficient numerical implementation of the NSB estimator, which we have made available from [16].

Summary of the NSB Method
We use Bayes rule to expresses the posterior probability of a probability distribution q ≡ {q i }, i = 1 . . .K, of a discrete random variable with a help of its a priori probability, P(q).Thus if n i i. i. d. samples from q are observed in bin i, such that K i=1 n i = N , then the posterior, P (q|n), is Following [13], we focus on the popular Dirichlet family of priors, indexed by a hyperparameter β: Here the δ-function and Z β enforce normalizations of q and P β (q), respectively; and Γ stands for Euler's Γ-function.These priors are common in statistics since they result in an analytically tractable, multinomial posteriors.For example, Wolpert and Wolf [17] calculated posterior averages, here denoted as . . .β , of many interesting quantities, including the distribution itself, and the moments of its entropy, which we will not reprint here.
According to Equation ( 3), Dirichlet priors add extra β samples (pseudocounts) to each bin.Thus for β N/K, the data are unimportant, on average, and P (q|n) is dominated by almost uniform distributions, q ≈ 1/K.Then the posterior mean of the entropy is strongly biased upward to its maximum possible value of H max = ln K. Similarly, for β N/K, distributions in the vicinity of the frequentist's maximum likelihood estimate, q = n/N , are important, and H β is biased downward [1].[13] traced this problem to properties of the Dirichlet family.Its members encode reasonable a priori assumptions about q, but not about H(q).Instead, a priori assumptions about the entropy are strongly biased, as seen from the a priori moments: Here ψ m (x) = (d/dx) m+1 ln Γ(x) are the polygamma functions.ξ(β) varies smoothly from 0 for β = 0, through 1 for β ≈ 1/K, and to ln , which is negligibly small for large K. Thus q that is typical in P β (q) has the entropy extremely close to some predetermined β-dependent value.This bias persists even when N ∼ K data are collected.
One should strive for the a priori distribution of entropy, P(H(q)), to be approximately uniform to have a chance for an unbiased estimator.NSB achieves the uniformity (but not necessarily zero bias) by noting that, following Equations ( 4) and ( 5), for large K, P β (H) is almost a δ-function.Thus a prior that averages over all non-negative values of β (and, correspondingly, over all a ξ ∈ [0; ln K]) may reduce the bias in the entropy estimation even for N K. [13] proposed the following infinite mixture of Dirichlet priors [18] for the averaging: Here Z is again a normalizing coefficient, and dξ/dβ ensures uniformity for ξ, rather than for β.A non-constant prior on β, P(β), may be used if needed, but we will not focus on this term from now on.Such Dirichlet mixture results in P(q) = const, introducing biases in estimation of q as a tradeoff for a possibly accurate estimation of H.
Inference with the prior, Equation ( 6), involves additional averaging over β (or, equivalently, ξ).The a posteriori moments of the entropy are where the unnormalized posterior density is Note that, for N = 0, ρ(ξ|0) = P (β(ξ)).Thus if we choose P (β(ξ)) = const, then the a priori assumptions about ξ are exactly uniform, as we had hoped to achieve.We note again that the uniformity of the prior is not equivalent to zero posterior bias.
An additional reason for the choice of averaging over the model families, as in Equation ( 6), is provided by the theory of Bayesian model selection [13,[19][20][21][22][23].Specifically, families of probabilistic models of data that incorporate more models (have larger volumes in the model space) usually have high explanatory powers and include some models that are very likely a posteriori.However, they also include many extremely unlikely models, and the posterior probability averaged over the entire family is low.Thus the competition between the "goodness of fit" and the volume of the model space (the Occam factor) often attributes much of the posterior weight to model families that are relatively simple, but explain the data well.In the case of the NSB prior, different values of β index different model families.For small β, the estimates in Equation ( 3) are closer to the frequentist's maximum likelihood, explaining the data better.However, there is less smoothing, and the space of models is larger.Thus as argued in [13], one expects that the integrals in Equation ( 7) are dominated by some β * with a small posterior variance, and then In this work, we start with investigating whether a maximum of the integrand in Equation ( 7), indeed, exists.We then study its properties.The results of the analysis leads to a deeper understanding of the NSB method.

Saddle Point Analysis
We calculate integrals in Equation ( 7) using the saddle point (a.k. a. Laplace) approximation.Since H m β does not depend on N , for N → ∞, only the Γ-terms in ρ define the saddle.We write Differentiating, we obtain the following equation for the saddle point (or the maximum likelihood) value, where K m denotes the number of bins that have, at least, m counts.Note that , and if there are many bins with multiple counts, i.e., N − K 1 1, then the (unknown) q is likely non-uniform.Thus the entropy is significantly smaller than its maximum possible value H max .Since for any β = O(1), H β ≈ H max [13], small entropy estimate is achievable only if β * → 0 as K → ∞.Thus we will look for where none of κ j depends on K. Plugging Equation (12) into Equation (11), we get an equation for κ 0 : The leading terms in the expansion of κ * are: We have calculated additional higher order terms.However, when K 1, as is common in applications, these terms are rarely needed.
It is useful to define: where each of f N 's scales as N 0 .Using properties of polygamma functions [24] and defining δ = ∆/N , we rewrite Equation ( 13) as Combined with the previous observations, Equation (17) suggests that we look for κ 0 of the form where each of b j 's is independent of δ and scales as N 0 .Substituting Equation (18) into Equation ( 17), we find the series expansion self-consistent, and Again, more terms have been calculated and are used in the software implementation of the estimator.The obtained expressions present the saddle point value β * (or κ * , or ξ * ) as a power series in 1/K and δ.To complete the evaluation of Equation ( 7), we now calculate the curvature at this saddle point: Notice that the curvature does not scale as a power of N as was suggested in [13].The uncertainty in ξ * is determined to the first order only by coincidences.One can understand this by considering K 1

Choosing a Value for K?
We are interested in the regime N K, when the number of pseudocounts in occupied bins, K 1 β, is negligible compared to their number in empty bins, (K − K 1 )β ≈ Kβ.Then Equations ( 3) and (8) show that selecting β (i.e., integrating over it) means balancing N , the number of actual counts versus κ = Kβ, the number of pseudocounts, or, equivalently, the scaled number of unoccupied bins.K is often unknown in real-life applications, or the number of possible outcomes is a countable infinity.Estimation of K from data has proven to be a hard problem, only solved completely for uniform distributions [10,11].One can consider varying K (instead of β) to find its maximum a posteriori value when performing Bayesian integration over κ.
To see that this will not work, we note that smaller K leads to a higher maximum likelihood since the total number of pseudocounts is less.Unfortunately, since there are fewer bins (degrees of freedom) available, smaller K also means smaller volume in the distribution space.Thus Bayesian averaging over K is trivial: the smallest possible number of bins (i.e., no empty bins) dominates.This can be seen from Equation ( 8): only the first ratio of Γ-functions in the posterior density depends on K, and it is maximized for K = K 1 .Thus straightforward selection of the value of K is not an option.However, the next section suggests a way around this hurdle.

Unknown or Infinite K
Often the true value of K is unknown because its simple estimate is intolerably large.For example, consider measuring entropy of -gramms in printed English [25] using an alphabet with 29 characters: 26 different letters, one symbol for digits, one space, and one punctuation mark.Then for = 10, a naive estimate of K is 29 10 ∼ 10 14 .Only very few of all possible 10-gramms are allowed by the grammar, but one does not know how many exactly.Thus one has to work in the space of full cardinality, which is ridiculously undersampled.
As shown in Section 3, NSB is well defined even for finite N and extremely large K, provided ∆ 1.Moreover, if K → ∞, then the expressions simplify since only the first term in Equation ( 12) needs to be kept.Even more interestingly, for an increasing K and β 1/K, P β (H) becomes closer to a delta function since the a priori variance of entropy drops to zero as 1/K, Equation ( 5).Thus NSB becomes more "certain" as K increases.Correspondingly, a possible solution to the problem of unknown cardinality is to use an upper bound estimate for K.It is better to overestimate K than to underestimate it.Even K → ∞ can be used.Insensitivity of the method to the value of K was explored empirically in [14].
Which assumptions allow NSB to use a few data points to specify entropy of a variable with even an infinite cardinality?A typical distribution in the Dirichlet family has a specific rank ordered (Zipf) plot [13]: the number of bins with the probability less than some q is given by an incomplete B-function, I, where B is the usual complete B-function.NSB estimates the best value for β using bins with coincidences, the head of the rank ordered plot.But knowing β defines the tails, where no data has been observed yet, allowing entropy estimation.Thus NSB relies on the rank-ordered tail of the studied distribution to be not too far away from the form in Equation (23).If the Zipf plot of the studied distribution has a substantially longer tail, then one should not trust the results of the method.An empirical procedure for detecting this case has been suggested in [14,15].With this warning in mind, we can analytically calculate the entropy estimate and its variance for a very large K.We want the results that hold even if the saddle point analysis, Section 3, fails when ∆ ∼ 1.Following Equations ( 12) and (18), The range of entropies is 0 ≤ H ≤ ln K → ∞, so the prior on H produced by P(q; β) is (almost) uniform over a semi-infinite range and thus is ill-defined.Similarly, there is a problem normalizing P β (q).However, both problems are resolved by an appropriate limiting procedure, and we disregard them in what follows.
We write: The integrals in these expressions are calculated by substituting exp(C γ −ξ) = τ and replacing the limits of integration 1/K exp(C γ ) ≤ τ ≤ exp(C γ ) by 0 ≤ τ ≤ ∞.This introduces errors of ∼ (1/K) ∆ at the lower limit and ∼ δ 2 exp(−1/δ 2 ) at the upper limit.Both errors are within the precision of interest O(1/K, 1/N ) if there is, at least, one coincidence.Thus Finally, substituting Equation (28) into Equation ( 26) and ( 27), we get These equations are valid to zeroth order in 1/K and 1/N .They provide a simple, yet nontrivial, estimate of the entropy that can be used even if the cardinality of the variable is unknown.However, one always must analyze for a possible bias when using the estimator.Note that Equation (30) agrees with Equation ( 22) since, for large ∆, ψ 1 (∆) ≈ 1/∆.The similarity between the coincidence counting in Equations ( 29) and (30) and in Ma's analysis [10] is also clear.

Conclusions
We have calculated various asymptotic properties of the NSB estimator for estimation of entropies of discrete random variables.First, the posterior expectations have been evaluated in terms of power series in 1/K and δ = ∆/N , but for the number of coincidences ∆ 1. Evaluation is done using the saddle point expansion.Convergence of the series depends on the number of coincidences rather than on the total number of samples.This elucidates the similarity to Ma's argument [10] and verifies the intuition of [13,14] that counting coincidence is what makes the method work in the severely undersampled regime.We have then discussed the limit when ∆ ∼ 1, and the saddle point analysis is not applicable.Here we have shown that the estimator has a finite asymptote for the case of infinitely many bins, K → ∞, or of an unknown number of bins.We obtained a closed form solutions for the estimate of the entropy and its variance in this regime.As for ∆ 1, to the first order, both depend on the number of coincidences rather than on the total number of samples.
The NSB estimator has been implemented in software, using the current asymptotic analysis as one of the steps in numerical evaluation of posterior integrals.Armed with empirical tests for the absence of bias in the estimator suggested in [14,15], the software brings us one step closer to a reliable, model independent estimation of entropy of discrete probability distributions in the severely undersampled Ma regime.The method is proving to be particularly powerful in a variety of biological applications.