# Estimating the Mutual Information between Two Discrete, Asymmetric Variables with Limited Samples

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Bayesian Approaches to the Estimation of Entropies

## 3. A Prior Distribution for the Conditional Entropies

## 4. A Closer Look on the Case of a Symmetric and Binary $\mathit{Y}$-Variable

## 5. Testing the Estimator

**a**) and the standard deviation (

**b**). The small departures from the diagonal stem from the fact that the analytical average contains all the possible ${\mathbf{q}}_{x}$ and $\left\{{\mathbf{q}}_{y|x}\right\}$, even if some of them are highly improbable for one given set of multiplicities. The numerical average, instead, includes the subset of the 13,500 explored cases that produced the tested multiplicity. All the depicted subsets contained many cases, but, still, they remained unavoidably below the infinity covered by the theoretical result.

## 6. A Prior Distribution for the Large Entropy Variable

## 7. Discussion

## 8. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A. Expected Variance for a Symmetric, Binary Y-Variable

## Appendix B. Bounding the Bias for Independent Variables

## References

- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J.
**1948**, 27, 379–423. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- Panzeri, S.; Treves, A. Analytical estimates of limited sampling biases in different information measures. Network Comput. Neural Syst.
**1996**, 7, 87–107. [Google Scholar] [CrossRef] [PubMed] - Samengo, I. Estimating probabilities from experimental frequencies. Phys. Rev. E
**2002**, 65, 046124. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Paninski, L. Estimation of entropy and mutual information. Neural Comput.
**2003**, 15, 1191–1253. [Google Scholar] [CrossRef] - Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E
**2004**, 69, 066138. [Google Scholar] [CrossRef] - Montemurro, M.A.; Senatore, R.; Panzeri, S. Tight data-robust bounds to mutual information combining shuffling and model selection techniques. Neural Comput.
**2007**, 19, 2913–2957. [Google Scholar] [CrossRef] [PubMed] - Archer, E.; Park, I.M.; Pillow, J.W. Bayesian and quasi-Bayesian estimators for mutual information from discrete data. Entropy
**2013**, 15, 1738–1755. [Google Scholar] [CrossRef] - Kolchinsky, A.; Tracey, B.D. Estimating mixture entropy with pairwise distances. Entropy
**2017**, 19, 361. [Google Scholar] [CrossRef] - Belghazi, I.; Rajeswar, S.; Baratin, A.; Hjelm, R.D.; Courville, A. MINE: Mutual information neural estimation. arXiv
**2018**, arXiv:1801.04062. [Google Scholar] - Safaai, H.; Onken, A.; Harvey, C.D.; Panzeri, S. Information estimation using nonparametric copulas. Phys. Rev. E
**2018**, 98, 053302. [Google Scholar] [CrossRef] [Green Version] - Strong, S.P.; Koberle, R.; van Steveninck, R.R.D.R.; Bialek, W. Entropy and information in neural spike trains. Phys. Rev. Lett.
**1998**, 80, 197. [Google Scholar] [CrossRef] - Nemenman, I.; Bialek, W.; van Steveninck, R.D.R. Entropy and information in neural spike trains: Progress on the sampling problem. Phys. Rev. E
**2004**, 69, 056111. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Archer, E.; Park, I.M.; Pillow, J.W. Bayesian entropy estimation for countable discrete distributions. J. Mach. Learn. Res.
**2014**, 15, 2833–2868. [Google Scholar] - Wolpert, D.H.; DeDeo, S. Estimating functions of distributions defined over spaces of unknown size. Entropy
**2013**, 15, 4668–4699. [Google Scholar] [CrossRef] - Jaynes, E.T. Probability Theory: The Logic of Science; Cambridge University Press: Cambridge, UK, 2007. [Google Scholar]
- Wolpert, D.H.; Wolf, D.R. Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E
**1995**, 52, 6841. [Google Scholar] [CrossRef] - Ma, S.k. Calculation of entropy from data of motion. J. Stat. Phys.
**1981**, 26, 221–240. [Google Scholar] [CrossRef] - Nemenman, I. Coincidences and estimation of entropies of random variables with large cardinalities. Entropy
**2011**, 13, 2013–2023. [Google Scholar] [CrossRef] - Antos, A.; Kontoyiannis, I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithms
**2001**, 19, 163–193. [Google Scholar] [CrossRef] - Grassberger, P. Entropy estimates from insufficient samplings. arXiv
**2003**, arXiv:0307138. [Google Scholar] - Schürmann, T. Estimating probabilities from experimental frequencies. J. Phys. Math. Gen.
**2004**, 37, L295–L300. [Google Scholar] [CrossRef] - Chao, A.; Wang, Y.; Jost, L. Entropy and the species accumulation curve: A novel entropy estimator via discovery rates of new species. Methods Ecol. Evol.
**2013**, 4, 1091–1100. [Google Scholar] [CrossRef] - Kazhdan, M.; Funkhouser, T.; Rusinkiewicz, S. Rotation invariant spherical harmonic representation of 3 d shape descriptors. Symp. Geom. Process.
**2003**, 6, 156–164. [Google Scholar] - Shwartz-Ziv, R.; Tishby, N. Opening the black box of deep neural networks via information. arXiv
**2017**, arXiv:1703.00810. [Google Scholar] - Kinney, J.B.; Atwal, G.S. Equitability, mutual information, and the maximal information coefficient. Proc. Natl. Acad. Sci. USA
**2014**, 111, 3354–3359. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Grassberger, P. Entropy estimates from insuficient samples. Archive
**2001**, 412, 787. [Google Scholar] - Barlow, R.J. Statistics: A Guide to the Use of Statistical Methods in the Physical Sciences; John Wiley & Sons: Hoboken, NJ, USA, 1993. [Google Scholar]
- Amari, S.I. Information geometry on hierarchy of probability distributions. IEEE Trans. Inf. Theory
**2001**, 47, 1701–1711. [Google Scholar] [CrossRef] - Panzeri, S.; Schultz, S.R.; Treves, A.; Rolls, E.T. Correlations and the encoding of information in the nervous system. Proc. R. Soc. B Biol. Sci.
**1999**, 226, 1001–1012. [Google Scholar] [CrossRef] - Panzeri, S.; Schultz, S.R. Temporal Correlations and Neural Spike Train Entropy. Phys. Rev. Lett.
**2001**, 86, 5823–5826. [Google Scholar] [Green Version] - Panzeri, S.; Schultz, S.R. A Unified Approach to the Study of Temporal, Correlational, and Rate Coding. Neural Comput.
**2001**, 13, 1311–1349. [Google Scholar] [CrossRef] [Green Version] - Pola, G.; Hoffmann, K.P.; Panzeri, S. An exact method to quantify the information transmitted by different mechanisms of correlational coding. Network
**2003**, 14, 35–60. [Google Scholar] [CrossRef] - Hernández, D.G.; Zanette, D.H.; Samengo, I. Information-theoretical analysis of the statistical dependencies between three variables: Applications to written language. Phys. Rev. E
**2015**, 92, 022813. [Google Scholar] [CrossRef] [PubMed] - Williams, P.L.; Beer, R.D. Nonnegative decomposition of multivariate information. arXiv
**2010**, arXiv:1004.2515. [Google Scholar] - Harder, M.; Salge, C.; Polani, D. Bivariate Measure of Redundant Information. Phys. Rev. E
**2015**, 87, 012130. [Google Scholar] [CrossRef] [PubMed] - Timme, N.; Alford, W.; Flecker, B.; Beggs, J.M. Synergy, redundancy, and multivariate information measures: An experimentalist’s perspective. J. Comput. Neurosci.
**2013**, 36, 119–140. [Google Scholar] [CrossRef] [PubMed] - Griffiths, V.; Koch, C. Quantifying Synergistic Mutual Information. In Guided Self-Organization: Inception; Prokopenko, M., Ed.; Springer: Berlin/Heidelberg, Germany, 2014; pp. 159–190. [Google Scholar] [Green Version]
- Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy
**2014**, 16, 2161–2183. [Google Scholar] [CrossRef] - Ince, R.A.A. Measuring Multivariate Redundant Information with Pointwise Common Change in Surprisal. Entropy
**2017**, 19, 318. [Google Scholar] [CrossRef] - Yu, S.; Giraldo, S.; Gonzalo, L.; Jenssen, R.; Príncipe, J.C. Multivariate Extension of Matrix-based Renyi’s α-order Entropy Functional. arXiv
**2018**, arXiv:1808.07912. [Google Scholar] - Tang, C.; Chehayeb, D.; Srivastava, K.; Nemenman, I.; Sober, S.J. Millisecond-scale motor encoding in a cortical vocal area. PLoS Biol.
**2014**, 12, e1002018. [Google Scholar] [CrossRef] - Maidana Capitán, M.; Kropff, E.; Samengo, I. Information-Theoretical Analysis of the Neural Code in the Rodent Temporal Lobe. Entropy
**2018**, 20, 571. [Google Scholar] [CrossRef] - Butte, A.J.; Tamayo, P.; Slonim, D.; Golub, T.R.; Kohane, I.S. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl. Acad. Sci. USA
**2000**, 97, 12182–12186. [Google Scholar] [CrossRef] [Green Version] - Tishby, N.; Pereira, F.C.; Bialek, W. The information bottleneck method. arXiv
**2000**, arXiv:0004057. [Google Scholar] - Still, S.; Bialek, W. How many clusters? An information-theoretic perspective. Neural Comput.
**2004**, 16, 2483–2506. [Google Scholar] [CrossRef] [PubMed] - Fairhall, A.L.; Lewen, G.D.; Bialek, W.; van Steveninck, R.R.D.R. Efficiency and ambiguity in an adaptive neural code. Nature
**2001**, 412, 787. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**A scheme of our method to estimate the mutual information between two variables X and Y. (

**a**) We collect a few samples of a variable x with a large number of effective states ${x}_{1},{x}_{2},\cdots $, each sample characterized by a binary variable y (the two values represented in white and gray). We consider different hypotheses about the strength with which the probability of each y-value varies with x; (

**b**) one possibility is that the conditional probability of each of the two y-values hardly varies with x. This situation is modeled by assuming that the different ${q}_{y|x}$ are random variables governed by a Beta distribution with a large hyper-parameter ${\beta}_{1}$; (

**c**) On the other hand, the conditional probability ${q}_{y|x}$ could vary strongly with x. This situation is modeled by a Beta distribution with a small hyper-parameter ${\beta}_{2}$. (

**d**) As $\beta $ varies, so does the prior mutual information (Equation (18)). This prior is obtained by averaging all the $I\left(\left\{{q}_{1|x}\right\}\right)$ values obtained from different possible sets of marginal distributions $\left\{{q}_{1|x}\right\}$ that can be generated when sampling the prior $p\left(\left\{{q}_{1|x}\right\}\right|\beta )$ of Equation (17). The shaded area around the solid line illustrates such fluctuations in $I\left(\left\{{q}_{1|x}\right\}\right)$ when ${k}_{x}=50$.

**Figure 2.**Comparison of the performance of four different estimators for ${I}_{XY}$: the plug-in estimator, NSB estimator used in the limit of infinite states, PYM estimator, and our estimator $\langle I|\mathbf{n}\rangle (\beta )$ (Equation (20)) calculated with the $\beta $ that maximizes the marginal likelihood $p\left(\mathbf{n}\right|\beta )$ (Equation (21)). The curves represent the average over 50 different data sets $\mathbf{n}$, with the standard deviation displayed as a colored area around the mean. (

**a**) estimates of mutual information as a function of the total number of samples N, when the values of ${q}_{1|x}$ are generated under the hypothesis of our method (Equation (17)). We sample once the marginal probabilities ${q}_{x}\sim \mathrm{PY}(d=0.55,\alpha =50)$ (as described in [14]), as well as the conditionals ${q}_{y|x}\sim \mathrm{Beta}(\beta /2,\beta /2)$ with $\beta =2.3$. The effective size of the system is $exp\left({H}_{XY}\right)\simeq 800$. The exact value of ${I}_{XY}$ is shown as a horizontal dashed line; (

**b**) E vbvbsgv gtimates of mutual information, for data sets where the conditional probabilities have spherical symmetry. X, a binary variable of dimension 12, corresponds to the presence of 12 delta functions equally spaced in a sphere (${q}_{x}={2}^{-12}$, for all x). We generate the conditional probabilities such that they are invariant under rotations of the sphere, namely ${q}_{y|x}={q}_{y\left|\mathrm{R}\right(x)}$, being R a rotation. To this aim, we set ${q}_{y|x}$ as a sigmoid function of a combination of frequency components (${\pi}_{0}-{\pi}_{1}-{\pi}_{2}$) of the spherical spectrum [24]. The effective size of the system is $exp\left({H}_{XY}\right)\simeq 5000$; (

**c**) estimates of mutual information, for a conditional distribution far away from our hypotheses. The x states are generated as Bernoulli ($p=0.05$) binary vectors of dimension $D=40$, while the conditional probabilities depend on the parity of the sum of the components of the vector. When the sum is even we set ${q}_{y|x}=1/2$, and when is odd, ${q}_{y|x}$ is generated by sampling a mixture of two deltas of equal weight ${q}_{y|x}\sim [\delta (q-{q}_{0})+\delta (q-1+{q}_{0})]/2$ with ${q}_{0}=0.1$. The resulting distribution of ${q}_{y|x}$-values contains three peaks, and therefore, cannot be described with a Dirichlet distribution. The effective size of the system is $exp\left({H}_{XY}\right)\simeq 4000$; (

**d**) bias in the estimation as a function of the value of mutual information. Settings remain the same as in (

**a**), but fixing $N=500$ and changing $\beta \in (0.04,\phantom{\rule{0.166667em}{0ex}}14)$ in the conditional; (

**e**) bias in the estimation as a function of the value of mutual information. Settings as in (

**b**), but fixing $N=2000$ and changing the gain of the sigmoid in the conditional; (

**f**) bias in the estimation as a function of the value of mutual information. Settings as in (

**c**), but fixing $N=2000$ and changing ${q}_{0}\in (0.01,\phantom{\rule{0.166667em}{0ex}}0.4)$ in the conditional.

**Figure 3.**Verification of the accuracy of the analytically predicted mean posterior information (Equation (20)) and variance (Equation (A4)) in the severely under-sampled regime. A collection of 13,500 distributions ${q}_{xy}$ are constructed by sampling ${q}_{x}\sim \mathrm{DP}\left(\alpha \right)$ and ${q}_{y|x}\sim \mathrm{Beta}(\beta /2,\beta /2)$, with $\alpha $ varying in the set $\{{e}^{4},{e}^{5},{e}^{6}\}$ and $log\beta $ from Equation (19). Each distribution ${q}_{xy}$ has an associated ${I}_{XY}\left({q}_{xy}\right)$. From each ${q}_{xy}$, we take five (5) sets of just $N=40$ samples. (

**a**) the values of $I\left({q}_{xy}\right)$ are grouped according to the multiplicities $\left\{{m}_{n{n}^{\prime}}\right\}$ produced by the samples, averaged together, and depicted as the y component of each data point. The x component is the analytical result of Equation (20), based on the sampled multiplicities; (

**b**) same analysis for the standard deviation of the information (the square root of the variance calculated in Equation (A4)).

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Hernández, D.G.; Samengo, I.
Estimating the Mutual Information between Two Discrete, Asymmetric Variables with Limited Samples. *Entropy* **2019**, *21*, 623.
https://doi.org/10.3390/e21060623

**AMA Style**

Hernández DG, Samengo I.
Estimating the Mutual Information between Two Discrete, Asymmetric Variables with Limited Samples. *Entropy*. 2019; 21(6):623.
https://doi.org/10.3390/e21060623

**Chicago/Turabian Style**

Hernández, Damián G., and Inés Samengo.
2019. "Estimating the Mutual Information between Two Discrete, Asymmetric Variables with Limited Samples" *Entropy* 21, no. 6: 623.
https://doi.org/10.3390/e21060623