We begin by reviewing the NSB estimator [

17], a Bayes least squares (BLS) estimator for

H under the generative model depicted in

Figure 1. The Bayesian approach to entropy estimation involves formulating a prior over distributions

π, and then turning the crank of Bayesian inference to infer

H using the posterior over

H induced by the posterior over

π. The starting point for this approach is the symmetric Dirichlet prior with parameter

α over a discrete distribution

π:

where

${\pi}_{i}$ (the

ith element of the vector

π) gives the probability that a data point

x falls in the

ith bin,

K denotes the number of bins in the distribution, and

${\sum}_{i=1}^{K}{\pi}_{i}=1$. The Dirichlet concentration parameter

$\alpha >0$ controls the concentration or “roughness” of the prior, with small

α giving spiky distributions (most probability mass concentrated in a few bins) and large

α giving more uniform distributions.

**Figure 1.**
Graphical models for entropy and mutual information of discrete data. Arrows indicate conditional dependencies between variables and gray “plates” indicate

N independent draws of random variables.

**Left:** Graphical model for entropy estimation [

16,

17]. The probability distribution over all variables factorizes as

$p(\alpha ,\pi ,\mathbf{x},H)=p(\alpha )p(\pi |\alpha )p(\mathbf{x}|\pi )p(H|\pi )$, where

$p(H|\pi )$ is simply a delta measure on

$H(\pi )$. The hyper-prior

$p(\alpha )$ specifies a set of “mixing weights” for Dirichlet distributions

$p(\pi |\alpha )=\text{Dir}(\alpha )$ over discrete distributions

π. Data

$\mathbf{x}=\left\{{x}_{j}\right\}$ are drawn from the discrete distribution

π. Bayesian inference for

H entails integrating out

α and

π to obtain the posterior

$p(H|\mathbf{x})$.

**Right:** Graphical model for mutual information estimation, in which

π is now a joint distribution that produces paired samples

$\left\{({x}_{j},{y}_{j})\right\}$. The mutual information

I is a deterministic function of the joint distribution

π. The Bayesian estimate comes from the posterior

$p(I|\mathbf{x})$, which requires integrating out

π and

α.

**Figure 1.**
Graphical models for entropy and mutual information of discrete data. Arrows indicate conditional dependencies between variables and gray “plates” indicate

N independent draws of random variables.

**Left:** Graphical model for entropy estimation [

16,

17]. The probability distribution over all variables factorizes as

$p(\alpha ,\pi ,\mathbf{x},H)=p(\alpha )p(\pi |\alpha )p(\mathbf{x}|\pi )p(H|\pi )$, where

$p(H|\pi )$ is simply a delta measure on

$H(\pi )$. The hyper-prior

$p(\alpha )$ specifies a set of “mixing weights” for Dirichlet distributions

$p(\pi |\alpha )=\text{Dir}(\alpha )$ over discrete distributions

π. Data

$\mathbf{x}=\left\{{x}_{j}\right\}$ are drawn from the discrete distribution

π. Bayesian inference for

H entails integrating out

α and

π to obtain the posterior

$p(H|\mathbf{x})$.

**Right:** Graphical model for mutual information estimation, in which

π is now a joint distribution that produces paired samples

$\left\{({x}_{j},{y}_{j})\right\}$. The mutual information

I is a deterministic function of the joint distribution

π. The Bayesian estimate comes from the posterior

$p(I|\mathbf{x})$, which requires integrating out

π and

α.

The likelihood (bottom arrow in

Figure 1, left) is the conditional probability of the data

**x** given

π:

where

${n}_{i}$ is the number of samples in

**x** falling in the

ith bin, and

N is the total number of samples. Because Dirichlet is conjugate to multinomial, the posterior over

π given

α and

**x** takes the form of a Dirichlet distribution:

From this expression, the posterior mean of

H can be computed analytically [

17,

21]:

where

${\psi}_{n}$ is the polygamma function of

n-th order (

${\psi}_{0}$ is the digamma function). For each

α,

${\widehat{H}}_{Dir}(\alpha )$ is the posterior mean of a Bayesian entropy estimator with a

$\text{Dir}(\alpha )$ prior. Nemenman and colleagues [

17] observed that, unless

$N\gg K$, the estimate

${\widehat{H}}_{Dir}$ is strongly determined by the Dirichlet parameter

α. They suggested using a hyper-prior

$p(\alpha )$ over the Dirichlet parameter, resulting in a mixture-of-Dirichlets distributions prior:

The NSB estimator is the posterior mean of

$p(H|\mathbf{x})$ under this prior, which can in practice be computed by numerical integration over

α with appropriate weighting of

${\widehat{H}}_{Dir}(\alpha )$:

By Bayes’ rule, we have

$p(\alpha |\mathbf{x})\propto p(\mathbf{x}|\alpha )p(\alpha )$, where

$p(\mathbf{x}|\alpha )$, the marginal probability of

**x** given

α, takes the form of a

Polya distribution [

22]:

To obtain an uninformative prior on the entropy, [

17] proposed the (hyper-)prior

the derivative with respect to

α of the prior mean of the entropy (

i.e., before any data have been observed, which depends only on the number of bins

K). This prior may be computed numerically (from Equation (12)) using a fine discretization of

α. In practice, we find that a prior representation in terms of log

α is more tractable since the derivative is extremely steep near zero; the prior on log

α has a more approximately smooth bell shape (see

Figure 2A).

**Figure 2.**
NSB priors used for entropy estimation, for three different values of alphabet size K. (**A**) The NSB hyper-prior on the Dirichlet parameter α on a log scale (Equation (12)). (**B**) Prior distributions on H implied by each of the three NSB hyper-priors in (**A**). Ideally, the implied prior over entropy should be as close to uniform as possible.