1. Introduction
Estimation of functions of a discrete random variable with an unknown probability distribution is one of the simplest problems in statistics. However, the simplicity vanishes in an extremely undersampled regime, where K, the cardinality or the alphabet size of the variable, is much larger than N, the number of the samples. In this case, the average number of samples per possible outcome, or bin, is less than one, and the relative uncertainty about the underlying probability distribution and its various statistics is large. To decrease the posterior error, one may turn to Bayesian statistics and bias the set of a priori admissible distributions. However, finding an optimal bias-variance tradeoff is not easy. For severely undersampled cases, controlling the variance often make an estimator a function of the prior, rather than of the measured data.
This is often the case for inference of the Boltzmann-Shannon entropy,
(here
is probability of an event
i), an important characteristics of a discrete variable. In this paper, all logarithms are natural, and the unit of entropy is
nat. Simple estimators of entropy have low variances but high biases that are difficult to calculate due to the divergence of the logarithm near zero [
1]. Developments driven in part by computational biology applications have solved this problem in the moderately undersampled regime,
and
[
1,
2,
3,
4,
5,
6,
7,
8,
9]. Interestingly, they also resulted in the understanding that it is impossible to estimate entropy with zero bias uniformly over all distributions for a smaller
N. However, Ma has argued [
10] that, since coincidences in data start to occur at
, it is possible to estimate entropies even in the deeply undersampled regime, at least for some classes of probability distributions, such as uniform ones. Similar arguments are well-known in the literature on estimation of population sizes from capture-recapture data (see, e.g., [
11] for recent developments). There it has been recognized that the population size (and the population entropy) can be estimated long before every possible individual outcome has been sampled with a high probability [
12].
In 2002, Nemenman, Shafee and Bialek introduced a method for entropy estimation, hereafter called NSB [
13]. While the estimator has proven successful in the Ma square-root regime [
14,
15], a theoretical basis for the success has not been presented in the literature. Here we review the method and perform its asymptotic analysis. We verify the intuition that the estimator works in the Ma regime by counting coincidences. We point out that the method can be viewed as finding the number of yet unseen bins with nonzero probability given
K, the maximum cardinality of the variable. While estimation of
K by model selection techniques cannot work (see below), we show that the method has a non-trivial limit as
. Thus one should be able to calculate entropies of discrete random variables even without knowing their cardinalities. Our analysis allows for an efficient numerical implementation of the NSB estimator, which we have made available from [
16].
2. Summary of the NSB Method
We use Bayes rule to expresses the posterior probability of a probability distribution
of a discrete random variable with a help of its a priori probability,
. Thus if
i. i. d. samples from
are observed in bin
i, such that
, then the posterior,
, is
Following [
13], we focus on the popular Dirichlet family of priors, indexed by a hyperparameter
β:
Here the
δ-function and
enforce normalizations of
and
, respectively; and Γ stands for Euler’s Γ-function. These priors are common in statistics since they result in an analytically tractable, multinomial posteriors. For example, Wolpert and Wolf [
17] calculated posterior averages, here denoted as
, of many interesting quantities, including the distribution itself,
and the moments of its entropy, which we will not reprint here.
According to Equation (
3), Dirichlet priors add extra
β samples (pseudocounts) to each bin. Thus for
, the data are unimportant, on average, and
is dominated by almost uniform distributions,
. Then the posterior mean of the entropy is strongly biased upward to its maximum possible value of
. Similarly, for
, distributions in the vicinity of the frequentist’s maximum likelihood estimate,
, are important, and
is biased downward [
1].
[
13] traced this problem to properties of the Dirichlet family. Its members encode reasonable a priori assumptions about
, but not about
. Instead, a priori assumptions about the entropy are strongly biased, as seen from the a priori moments:
Here
are the polygamma functions.
varies smoothly from 0 for
, through 1 for
, and to
for
.
for almost all
β [
13], which is negligibly small for large
K. Thus
that is typical in
has the entropy extremely close to some predetermined
β-dependent value. This bias persists even when
data are collected.
One should strive for the a priori distribution of entropy,
, to be approximately uniform to have a chance for an unbiased estimator. NSB achieves the uniformity (but not necessarily zero bias) by noting that, following Equations (4) and (5), for large
K,
is almost a
δ-function. Thus a prior that averages over all non-negative values of
β (and, correspondingly, over all a
) may reduce the bias in the entropy estimation even for
. [
13] proposed the following infinite mixture of Dirichlet priors [
18] for the averaging:
Here
Z is again a normalizing coefficient, and
ensures uniformity for
ξ, rather than for
β. A non-constant prior on
β,
, may be used if needed, but we will not focus on this term from now on. Such Dirichlet mixture results in
, introducing biases in estimation of
as a tradeoff for a possibly accurate estimation of
H.
Inference with the prior, Equation (
6), involves additional averaging over
β (or, equivalently,
ξ). The a posteriori moments of the entropy are
where the unnormalized posterior density is
Note that, for
,
. Thus if we choose
, then the a priori assumptions about
ξ are exactly uniform, as we had hoped to achieve. We note again that the uniformity of the prior is not equivalent to zero posterior bias.
An additional reason for the choice of averaging over the model families, as in Equation (
6), is provided by the theory of Bayesian model selection [
13,
19,
20,
21,
22,
23]. Specifically, families of probabilistic models of data that incorporate more models (have larger volumes in the model space) usually have high explanatory powers and include some models that are very likely a posteriori. However, they also include many extremely unlikely models, and the posterior probability averaged over the entire family is low. Thus the competition between the “goodness of fit” and the volume of the model space (the
Occam factor) often attributes much of the posterior weight to model families that are relatively simple, but explain the data well. In the case of the NSB prior, different values of
β index different model families. For small
β, the estimates in Equation (
3) are closer to the frequentist’s maximum likelihood, explaining the data better. However, there is less smoothing, and the space of models is larger. Thus as argued in [
13], one expects that the integrals in Equation (
7) are dominated by some
with a small posterior variance, and then
.
In this work, we start with investigating whether a maximum of the integrand in Equation (
7), indeed, exists. We then study its properties. The results of the analysis leads to a deeper understanding of the NSB method.
3. Saddle Point Analysis
We calculate integrals in Equation (
7) using the saddle point (a. k. a. Laplace) approximation. Since
does not depend on
N, for
, only the Γ-terms in
ρ define the saddle. We write
Differentiating, we obtain the following equation for the saddle point (or the maximum likelihood) value,
:
where
denotes the number of bins that have, at least,
m counts. Note that
.
If
, and if there are many bins with multiple counts,
i.e.,
, then the (unknown)
is likely non-uniform. Thus the entropy is significantly smaller than its maximum possible value
. Since for any
,
[
13], small entropy estimate is achievable only if
as
. Thus we will look for
where none of
depends on
K. Plugging Equation (
12) into Equation (
11), we get an equation for
:
The leading terms in the expansion of
are:
We have calculated additional higher order terms. However, when
, as is common in applications, these terms are rarely needed.
We now solve Equation (
13). For
and
, the r. h. s. of the equation is approximately
[
24]. For
, it is close to
. Thus if
, and the number of coincidences among data,
, is zero, then the l. h. s. majorates the r. h. s., and Equation (
13) has no solution. That is, there is no saddle point in the integrand. If there are coincidences, a unique solution exists, and
means
. Thus we search for
of the form
.
It is useful to define:
where each of
’s scales as
. Using properties of polygamma functions [
24] and defining
, we rewrite Equation (
13) as
Combined with the previous observations, Equation (
17) suggests that we look for
of the form
where each of
’s is independent of
δ and scales as
.
Substituting Equation (
18) into Equation (
17), we find the series expansion self-consistent, and
Again, more terms have been calculated and are used in the software implementation of the estimator.
The obtained expressions present the saddle point value
(or
, or
) as a power series in
and
δ. To complete the evaluation of Equation (
7), we now calculate the curvature at this saddle point:
Notice that the curvature
does not scale as a power of
N as was suggested in [
13]. The uncertainty in
is determined to the first order only by coincidences. One can understand this by considering
with
for most of the bins. Then counts of
are not informative for entropy estimation since they can correspond to massive bins, as well as to some random bins from the sea of the negligible ones. However, coinciding counts likely corresponds to high-probability bins, which should influence the entropy estimation. Note also that, to the first order in
, the exact positioning of coincidences does not matter: for a fixed Δ, a few coincidences in many bins or many coincidences in a single one produce the same saddle point and the same curvature around it. While this is an artifact of the specific choice of the prior
, the similarity to Ma’s coincidence counting [
10] is intriguing.
In summary, if the number of coincidences , then the saddle point analysis is self-consistent. A specific value is selected a posteriori, and the variance of the entropy is small.
Numerical Implementation
The series expansions calculated above form the basis for a numerical implementation of the NSB algorithm. To calculate the posterior mean and variance of the NSB entropy estimator using Equation (
7), evaluation of three integrals is required numerically (normalization, the first, and the second moments of
H). The algorithm for this is as follows. When
, the integrands are not peaked, and the integrals can be evaluated by simple Gaussian quadratures or other user-selected methods. If instead
, the integrands will be peaked, strongly if
. Identification of the location of the peaks is then essential before numerical integration is done. We proceed as follows:
The saddle point (the maximum of
) is found numerically by:
- (a)
evaluating an approximation for
using the first few terms of the series, Equation (
18);
- (b)
using the approximate value as a starting point for the Newton-Raphson iterative algorithm to solve for
from Equation (
13);
- (c)
plugging the solution into the series expansion for the saddle
, Equation (
12);
- (d)
and, finally, using the latter solution as a starting point for the Newton-Raphson search of a more accurate value of
in Equation (
11).
Each of the integrands in Equation (
7) is divided by the value of
at the saddle point, so that the maximum of the integrands is
.
Curvature around the saddle point (and hence the posterior variance) is evaluated numerically.
The integrals are evaluated numerically over the range that spans a few standard deviations on both sides of the saddle point; the range is controlled by the user-specified desired accuracy.
The above algorithm has been implemented in Octave/Matlab and C++. It is available from [
16]. The input to the routines is either the histogram of counts (Octave/Matlab and C++), or a series of samples (C++ only). The output of the routines is either the posterior mean and the standard deviation of the entropy, and the position of the saddle point, or a variety of diagnostics information if the integration fails for any reason. The C++ version is implemented specifically to allow estimation of entropies on alphabets with arbitrary large cardinalities. It is limited only by the ability of the data series to fit in the computer memory.
4. Choosing a Value for K?
We are interested in the regime
, when the number of pseudocounts in occupied bins,
, is negligible compared to their number in empty bins,
. Then Equation (
3) and Equation (
8) show that selecting
β (
i.e., integrating over it) means balancing
N, the number of actual counts versus
, the number of pseudocounts, or, equivalently, the scaled number of unoccupied bins.
K is often unknown in real-life applications, or the number of possible outcomes is a countable infinity. Estimation of
K from data has proven to be a hard problem, only solved completely for uniform distributions [
10,
11]. One can consider varying
K (instead of
β) to find its maximum a posteriori value when performing Bayesian integration over
κ.
To see that this will not work, we note that smaller
K leads to a higher maximum likelihood since the total number of pseudocounts is less. Unfortunately, since there are fewer bins (degrees of freedom) available, smaller
K also means smaller volume in the distribution space. Thus Bayesian averaging over
K is trivial: the smallest possible number of bins (
i.e., no empty bins) dominates. This can be seen from Equation (
8): only the first ratio of Γ-functions in the posterior density depends on
K, and it is maximized for
. Thus straightforward selection of the value of
K is not an option. However, the next section suggests a way around this hurdle.
5. Unknown or Infinite K
Often the true value of
K is unknown because its simple estimate is intolerably large. For example, consider measuring entropy of
ℓ-gramms in printed English [
25] using an alphabet with 29 characters: 26 different letters, one symbol for digits, one space, and one punctuation mark. Then for
, a naive estimate of
K is
. Only very few of all possible 10-gramms are allowed by the grammar, but one does not know how many exactly. Thus one has to work in the space of full cardinality, which is ridiculously undersampled.
As shown in
Section 3, NSB is well defined even for finite
N and extremely large
K, provided
. Moreover, if
, then the expressions simplify since only the first term in Equation (
12) needs to be kept. Even more interestingly, for an increasing
K and
,
becomes closer to a delta function since the a priori variance of entropy drops to zero as
, Equation (5). Thus NSB becomes more “certain” as
K increases. Correspondingly, a possible solution to the problem of unknown cardinality is to use an upper bound estimate for
K. It is better to overestimate
K than to underestimate it. Even
can be used. Insensitivity of the method to the value of
K was explored empirically in [
14].
Which assumptions allow NSB to use a few data points to specify entropy of a variable with even an infinite cardinality? A typical distribution in the Dirichlet family has a specific rank ordered (Zipf) plot [
13]: the number of bins with the probability less than some
q is given by an incomplete
B-function,
I,
where
B is the usual complete
B-function. NSB estimates the best value for
β using bins with coincidences, the head of the rank ordered plot. But knowing
β defines the tails, where no data has been observed yet, allowing entropy estimation. Thus NSB relies on the rank-ordered tail of the studied distribution to be not too far away from the form in Equation (
23). If the Zipf plot of the studied distribution has a substantially longer tail, then one should not trust the results of the method. An empirical procedure for detecting this case has been suggested in [
14,
15].
With this warning in mind, we can analytically calculate the entropy estimate and its variance for a very large
K. We want the results that hold even if the saddle point analysis,
Section 3, fails when
. Following Equation (
12) and Equation (
18),
, but
. The range of entropies is
, so the prior on
H produced by
is (almost) uniform over a semi-infinite range and thus is ill-defined. Similarly, there is a problem normalizing
. However, both problems are resolved by an appropriate limiting procedure, and we disregard them in what follows.
To perform the integrals in Equation (
7), we point out that, for
,
, and
, we have
, and then
. A similar relation holds for
. That is, the posterior averages of the entropy and its square are almost indistinguishable from
ξ and
, their respective a priori averages. Since now we are interested in small Δ (otherwise we can use the saddle point analysis), we replace
by
in Equation (
7). The error of this approximation is
.
We transform the Lagrangian in Equation (
10). First, we drop terms that do not depend on
κ since they appear in the numerator and denominator of Equation (
7) and thus cancel. Second, we expand around
. This gives
We note that
κ is large in the vicinity of the saddle if
δ is small and
N is large,
cf. Equation (
18). Thus by definition of
ψ-functions,
. Further,
, and
[
24]. Finally, since
, where
is the Euler’s constant, Equation (
4) says that
. Combining all of these expressions, we get
where ≈ means the precision of
.
We write:
The integrals in these expressions are calculated by substituting
and replacing the limits of integration
by
. This introduces errors of
at the lower limit and
at the upper limit. Both errors are within the precision of interest
if there is, at least, one coincidence. Thus
Finally, substituting Equation (
28) into Equation (26) and (27), we get
These equations are valid to zeroth order in
and
. They provide a simple, yet nontrivial, estimate of the entropy that can be used even if the cardinality of the variable is unknown. However, one always must analyze for a possible bias when using the estimator. Note that Equation (30) agrees with Equation (
22) since, for large Δ,
. The similarity between the coincidence counting in Equations (29) and (30) and in Ma’s analysis [
10] is also clear.
6. Conclusions
We have calculated various asymptotic properties of the NSB estimator for estimation of entropies of discrete random variables. First, the posterior expectations have been evaluated in terms of power series in
and
, but for the number of coincidences
. Evaluation is done using the saddle point expansion. Convergence of the series depends on the number of coincidences rather than on the total number of samples. This elucidates the similarity to Ma’s argument [
10] and verifies the intuition of [
13,
14] that counting coincidence is what makes the method work in the severely undersampled regime. We have then discussed the limit when
, and the saddle point analysis is not applicable. Here we have shown that the estimator has a finite asymptote for the case of infinitely many bins,
, or of an unknown number of bins. We obtained a closed form solutions for the estimate of the entropy and its variance in this regime. As for
, to the first order, both depend on the number of coincidences rather than on the total number of samples.
The NSB estimator has been implemented in software, using the current asymptotic analysis as one of the steps in numerical evaluation of posterior integrals. Armed with empirical tests for the absence of bias in the estimator suggested in [
14,
15], the software brings us one step closer to a reliable, model independent estimation of entropy of discrete probability distributions in the severely undersampled Ma regime. The method is proving to be particularly powerful in a variety of biological applications.