# Bayesian and Quasi-Bayesian Estimators for Mutual Information from Discrete Data

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Entropy and Mutual Information

## 3. Bayesian Entropy Estimation

**Figure 1.**Graphical models for entropy and mutual information of discrete data. Arrows indicate conditional dependencies between variables and gray “plates” indicate N independent draws of random variables.

**Left:**Graphical model for entropy estimation [16,17]. The probability distribution over all variables factorizes as $p(\alpha ,\pi ,\mathbf{x},H)=p(\alpha )p(\pi |\alpha )p(\mathbf{x}|\pi )p(H|\pi )$, where $p(H|\pi )$ is simply a delta measure on $H(\pi )$. The hyper-prior $p(\alpha )$ specifies a set of “mixing weights” for Dirichlet distributions $p(\pi |\alpha )=\text{Dir}(\alpha )$ over discrete distributions π. Data $\mathbf{x}=\left\{{x}_{j}\right\}$ are drawn from the discrete distribution π. Bayesian inference for H entails integrating out α and π to obtain the posterior $p(H|\mathbf{x})$.

**Right:**Graphical model for mutual information estimation, in which π is now a joint distribution that produces paired samples $\left\{({x}_{j},{y}_{j})\right\}$. The mutual information I is a deterministic function of the joint distribution π. The Bayesian estimate comes from the posterior $p(I|\mathbf{x})$, which requires integrating out π and α.

**x**given π:

**x**falling in the ith bin, and N is the total number of samples. Because Dirichlet is conjugate to multinomial, the posterior over π given α and

**x**takes the form of a Dirichlet distribution:

**x**given α, takes the form of a Polya distribution [22]:

**Figure 2.**NSB priors used for entropy estimation, for three different values of alphabet size K. (

**A**) The NSB hyper-prior on the Dirichlet parameter α on a log scale (Equation (12)). (

**B**) Prior distributions on H implied by each of the three NSB hyper-priors in (

**A**). Ideally, the implied prior over entropy should be as close to uniform as possible.

#### 3.1. Quantifying Uncertainty

**x**, which can be computed by numerically integrating the variance of entropy across α. The raw second moment of the posterior is, from [16,21]:

#### 3.2. Efficient Computation

**x**given α (Equation (11)),

## 4. Quasi-Bayesian Estimation of MI

#### 4.1. Bayesian Entropy Estimates do not Give Bayesian MI Estimates

**Proposition 1.**

## 5. Fully Bayesian Estimation of MI

#### 5.1. Dirichlet Prior

**x**and α (first given by Hutter in [18]),

**Figure 3.**The distribution of MI under a $\text{Dir}(\alpha )$ prior as a function of α for a 10 × 100 joint probability table. The distributions $p(I|\alpha )$ are tightly concentrated around 0 for very small and very large values of α. The mean MI for each distribution increases with α until a “cutoff” near 0.01, past which the mean decreases again with α. All curves are colored in a gradient from dark red (small α) to bright yellow (large α). (

**A**) Distributions $p(I|\alpha )$ for $\alpha <0.01$. Notice that some distributions are bimodal, with peaks in MI around 0 and 1 bit. The peak around 0 appears because, for very small values of α, nearly all probability mass is concentrated on joint tables on a single entry. The peak around 1 bit arises because, as α increases from 0, tables with 2 nonzero entries become increasingly likely. (

**B**) Distributions $p(I|\alpha )$ for $\alpha >0.01$. (

**C**) The distributions in (

**A**) and (

**B**) plotted together to better illustrate their dependence on ${log}_{10}(\alpha )$. The color bar underneath shows the color of each distribution that appears in (

**A**) and (

**B**). Note that no $\text{Dir}(\alpha )$ prior assigns significant probability mass to the values of I near the maximal MI of $log10\approx 3.3$ bits; the highest mean $\mathbb{E}\left[I\right|\alpha ]$ occurs at approximately 2.65, for the cutoff value $\alpha \approx 0.01$.

**Figure 4.**Prior mean, $\mathbb{E}\left[I\right|\alpha ]$ (solid gray lines), and 80% quantiles (gray regions) of mutual information for tables of size 10 × 10, 10 × 100, and 10 × 10

^{3}, as α varies. Quantiles are computed by sampling 5 × 10

^{4}probability tables from a $\text{Dir}(\alpha )$ distribution for each value of α. For very large and very small α, $p(I|\alpha )$ is concentrated tightly around I = 0 (see Figure 3). For small α, the most probable tables under $\text{Dir}(\alpha )$ are those with all probability mass in a single bin. For very large α, the probability mass of $\text{Dir}(\alpha )$ concentrates on nearly uniform probability tables. Notice that sampling fails for very small values of α due to numerical issues; for α ≈ 10

^{−6}nearly all sampled tables have only a single nonzero element, and quantiles of the sample do not contain $\mathbb{E}\left[I\right|\alpha ]$.

#### 5.2. Mixture-of-Dirichlets (MOD) Prior

**Figure 5.**Illustration of Mixture-of-Dirichlets (MOD) priors and hyper-priors, for three settings of ${K}_{y}$ and ${K}_{x}$. (

**A**) Hyper-priors over α for three different-sized joint distributions: $({K}_{x},{K}_{y})=(10,500)$ (dark), $({K}_{x},{K}_{y})=(50,100)$ (gray), and $({K}_{x},{K}_{y})=(50,500)$ (light gray). (

**B**) Prior distributions over mutual information implied by each of the priors on α shown in (

**A**). The prior on mutual information remains approximately flat for varying table sizes, but note that it does not assign very much probability the maximum possible mutual information, which is given by the right-most point on the abscissa in each graph.

## 6. Results

**Figure 6.**Performance of MI estimators for true distributions sampled from distributions related to Dirichlet. Joint distributions have 10 × 100 bins. (

**left column**) Estimated mutual information from 100 data samples, as a function of a parameter defining the true distribution. Error bars indicate the variability of the estimator over independent samples (± one standard deviation). Gray shading denotes the average 95% Bayesian credible interval. Insets show examples of true joint distributions, for visualization purposes. (

**right column**) Convergence as a function of sample size. True distribution is given by that shown in the central panel of the corresponding figure on the left. Inset images show examples of empirical distribution, calculated from data. (

**top row**) True distributions sampled from a fixed Dirichlet Dir(α) prior, where α varies from 10

^{−3}to 10

^{2}. (

**bottom row**) Each column (conditional) is an independent Dirichlet distribution with a fixed α.

**Figure 7.**Performance of MI estimators on distributions sampled from more structured distributions. Joint distributions have 10 × 100 bins. The format of left and right columns is the same as in Figure 6. (

**top row**) Dirichlet joint distribution with a base measure ${\mu}_{ij}$ chosen such that there is a diagonal strip of low concentration. The base measure is given by ${\mu}_{ij}\propto \frac{1}{{10}^{-6}+Q(i,j)}$, where $Q(x,y)$ is a $2D$ Gaussian probability density function with 0 mean and covariance matrix $\left(\right)open="["\; close="]">\begin{array}{cc}0.08& 0\\ 0& 0.0003\end{array}$. We normalized ${\mu}_{ij}$ to sum to one over the grid shown. (

**bottom row**) Laplace conditional distributions with linearly-shifting means. Each conditional $p(Y=y|X=x)$ has the form of ${e}^{-|y-10x|/\eta}$. These conditionals are shifted circularly with respect to one another, generating a diagonal structure.

**Figure 8.**Performance of MI estimators: failure mode for ${\widehat{I}}_{\text{MOD}}$. Joint distributions have 10 × 100 bins. The format of left and right column is same as in Figure 6. (

**top row**) True distributions are rotating, discretized Gaussians, where rotation angle is varied from 0 to π. For cardinal orientations, the distribution is independent and MI is 0. For diagonal orientations, the MI is maximal. (

**bottom row**) Each column (conditional) is an independent Laplace (double-exponential) distribution: $p(Y=j|X=i)={e}^{-|j-50|/\tau (i)}$. The width of the Laplace distribution is governed by $\tau (i)={e}^{-1.11i\zeta}$ where ζ is varied from 10

^{−2}to 1.

**Figure 9.**Comparison of Hutter MI estimator for different values of α. Four datasets shown in Figure 6 and Figure 7 are used with the same sample size of N = 0. We compare improper Haldane prior (α = 0), Perks prior ($\alpha =\frac{1}{{K}_{x}{K}_{y}}$), Jeffreys’ prior ($\alpha =\frac{1}{2}$), and uniform prior (α = 1) [19]. ${\widehat{I}}_{\text{NSB1}}$ is also shown for comparison.

## 7. Conclusions

## Acknowledgement

## A. Derivations

#### A.1. Mean of Mutual Information Under Dirichlet Distribution

## References

- Schindler, K.H.; Palus, M.; Vejmelka, M.; Bhattacharya, J. Causality detection based on information-theoretic approaches in time series analysis. Phys. Rep.
**2007**, 441, 1–46. [Google Scholar] - Rényi, A. On measures of dependence. Acta Math. Hung.
**1959**, 10, 441–451. [Google Scholar] [CrossRef] - Chow, C.; Liu, C. Approximating discrete probability distributions with dependence trees. Inf. Theory IEEE Trans.
**1968**, 14, 462–467. [Google Scholar] [CrossRef] - Rieke, F.; Warland, D.; de Ruyter van Steveninck, R.; Bialek, W. Spikes: Exploring the Neural Code; MIT Press: Cambridge, MA, USA, 1996. [Google Scholar]
- Ma, S. Calculation of entropy from data of motion. J. Stat. Phys.
**1981**, 26, 221–240. [Google Scholar] [CrossRef] - Bialek, W.; Rieke, F.; de Ruyter van Steveninck, R.R.; Warland, D. Reading a neural code. Science
**1991**, 252, 1854–1857. [Google Scholar] [CrossRef] [PubMed] - Strong, S.; Koberle, R.; de Ruyter van Steveninck, R.; Bialek, W. Entropy and information in neural spike trains. Phys. Rev. Lett.
**1998**, 80, 197–202. [Google Scholar] [CrossRef] - Paninski, L. Estimation of entropy and mutual information. Neural Comput.
**2003**, 15, 1191–1253. [Google Scholar] [CrossRef] - Barbieri, R.; Frank, L.; Nguyen, D.; Quirk, M.; Solo, V.; Wilson, M.; Brown, E. Dynamic Analyses of Information Encoding in Neural Ensembles. Neural Comput.
**2004**, 16, 277–307. [Google Scholar] [CrossRef] [PubMed] - Kennel, M.; Shlens, J.; Abarbanel, H.; Chichilnisky, E. Estimating Entropy Rates with Bayesian Confidence Intervals. Neural Comput.
**2005**, 17, 1531–1576. [Google Scholar] [CrossRef] [PubMed] - Victor, J. Approaches to information-theoretic analysis of neural activity. Biol. Theory
**2006**, 1, 302–316. [Google Scholar] [CrossRef] [PubMed] - Shlens, J.; Kennel, M.B.; Abarbanel, H.D.I.; Chichilnisky, E.J. Estimating information rates with confidence intervals in neural spike trains. Neural Comput.
**2007**, 19, 1683–1719. [Google Scholar] [CrossRef] [PubMed] - Vu, V.Q.; Yu, B.; Kass, R.E. Coverage-adjusted entropy estimation. Stat. Med.
**2007**, 26, 4039–4060. [Google Scholar] [CrossRef] [PubMed] - Montemurro, M.A.; Senatore, R.; Panzeri, S. Tight data-robust bounds to mutual information combining shuffling and model selection techniques. Neural Comput.
**2007**, 19, 2913–2957. [Google Scholar] [CrossRef] [PubMed] - Vu, V.Q.; Yu, B.; Kass, R.E. Information in the Nonstationary Case. Neural Comput.
**2009**, 21, 688–703. [Google Scholar] [CrossRef] [PubMed] - Archer, E.; Park, I.M.; Pillow, J. Bayesian estimation of discrete entropy with mixtures of stick-breaking priors. In Advances in Neural Information Processing Systems 25; Bartlett, P., Pereira, F., Burges, C., Bottou, L., Weinberger, K., Eds.; MIT Press: Cambridge, MA, 2012; pp. 2024–2032. [Google Scholar]
- Nemenman, I.; Shafee, F.; Bialek, W. Entropy and inference, revisited. In Advances in Neural Information Processing Systems 14; MIT Press: Cambridge, MA, 2002; pp. 471–478. [Google Scholar]
- Hutter, M. Distribution of mutual information. In Advances in Neural Information Processing Systems 14; MIT Press: Cambridge, MA, 2002; pp. 399–406. [Google Scholar]
- Hutter, M.; Zaffalon, M. Distribution of mutual information from complete and incomplete data. Comput. Stat. Data Anal.
**2005**, 48, 633–657. [Google Scholar] [CrossRef] - Treves, A.; Panzeri, S. The upward bias in measures of information derived from limited data samples. Neural Comput.
**1995**, 7, 399–407. [Google Scholar] [CrossRef] - Wolpert, D.; Wolf, D. Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E
**1995**, 52, 6841–6854. [Google Scholar] [CrossRef] - Minka, T. Estimating a Dirichlet Distribution; Technical report; MIT: Cambridge, MA, USA, 2003. [Google Scholar]
- Nemenman, I.; Lewen, G.D.; Bialek, W.; de Ruyter van Steveninck, R.R. Neural coding of natural stimuli: information at sub-millisecond resolution. PLoS Comput. Biol.
**2008**, 4, e1000025. [Google Scholar] [CrossRef] [PubMed]

© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Archer, E.; Park, I.M.; Pillow, J.W.
Bayesian and Quasi-Bayesian Estimators for Mutual Information from Discrete Data. *Entropy* **2013**, *15*, 1738-1755.
https://doi.org/10.3390/e15051738

**AMA Style**

Archer E, Park IM, Pillow JW.
Bayesian and Quasi-Bayesian Estimators for Mutual Information from Discrete Data. *Entropy*. 2013; 15(5):1738-1755.
https://doi.org/10.3390/e15051738

**Chicago/Turabian Style**

Archer, Evan, Il Memming Park, and Jonathan W. Pillow.
2013. "Bayesian and Quasi-Bayesian Estimators for Mutual Information from Discrete Data" *Entropy* 15, no. 5: 1738-1755.
https://doi.org/10.3390/e15051738