# Minimum Mutual Information and Non-Gaussianity through the Maximum Entropy Method: Estimation from Finite Samples

^{1}

^{2}

^{*}

## Abstract

**:**

**T**

_{cr}comprehended by m

_{cr}linear and/or nonlinear joint expectations, computed from samples of N iid outcomes. Marginals (and their entropy) are imposed by single morphisms of the original random variables. N-asymptotic formulas are given both for the distribution of cross expectation’s estimation errors, the MinMI estimation bias, its variance and distribution. A growing

**T**

_{cr}leads to an increasing MinMI, converging eventually to the total MI. Under N-sized samples, the MinMI increment relative to two encapsulated sets

**T**

_{cr1}⊂ T

_{cr2}(with numbers of constraints m

_{cr}

_{1}< m

_{cr}

_{2}) is the test-difference $\delta H={H}_{\mathrm{max}1,N}-{H}_{\mathrm{max}2,N}\ge 0$ between the two respective estimated MEs. Asymptotically, δH follows a Chi-Squared distribution ${\scriptscriptstyle \frac{1}{2N}}{\chi}_{({m}_{cr2}-{m}_{cr1})}^{2}$ whose upper quantiles determine if constraints in

**T**

_{cr2}/

**T**

_{cr1}explain significant extra MI. As an example, we have set marginals to being normally distributed (Gaussian) and have built a sequence of MI bounds, associated to successive non-linear correlations due to joint non-Gaussianity. Noting that in real-world situations available sample sizes can be rather low, the relationship between MinMI bias, probability density over-fitting and outliers is put in evidence for under-sampled data.

## 1. Introduction

#### 1.1. The State of the Art

_{cr}of m

_{cr}empirical non-redundant cross constraints (e.g., a set of cross expectations between a stimulus X and a response Y, for example in a neural cell, the Earth’s climate, an ecosystem). The constrained MI or the Minimum Mutual Information (MinMI) among RVs $Y,X$ is: ${I}_{\mathrm{min}}(X,Y)=H(X)+H(Y)-{H}_{\mathrm{max}}(X,Y)=H(Y)-{H}_{\mathrm{max}}(Y|X)$, obtained after subtraction to the sum of fixed marginal entropies of the maximum joint entropy (ME) ${H}_{\mathrm{max}}$, compatible with imposed cross constraints. The solution comes from application of the MinMI principle [9,10]. The MinMI is a MI lower bound depending on the marginal pdfs (e.g., Gaussians, Uniforms, Gammas), as well as the particular form of the cross expectations in T

_{cr}(e.g., linear and non-linear correlations). There are only a few cases of known closed formulas for the MinMI and m

_{cr}=1:a) Gaussian marginals and Pearson linear correlation [8,11,12] and (b) Uniform marginals and rank linear correlation [11]. The authors have presented in [12] (PP12 hereafter), a general formalism for computing, though not in an explicit form, the MinMI in terms of multiple (m

_{cr}> 1) linear and nonlinear cross expectations included in T

_{cr}This set can consist of a natural population constraint (e.g., a specific neural behavior) or it can grow without limit through additional expectations computed within a sample with the MinMI increasing and converging eventually to the total MI. This paper is the natural follow-up of PP12 [12], studying now the statistics (mean or bias, variance and distribution) of the MinMI estimation errors: $\mathrm{\Delta}{I}_{\mathrm{min},N}=-\mathrm{\Delta}{H}_{\mathrm{max},N}\equiv -({H}_{\mathrm{max},N}-{H}_{\mathrm{max}})$ where ${H}_{\mathrm{max},N}$ is the ME estimation issued from N-sized samples of iid outcomes. Those errors are roughly similar to those of MI and entropy generic estimator’s errors (see [13,14] for a thorough review and performance comparisons between MI estimators). Their mean (bias), variance and higher-order moments are written in terms of ${N}^{-1}$ powers, thus covering intermediate and asymptotic N ranges [15], with specific applications in neurophysiology [16,17,18]. Entropy estimators range from: (a) the histogram-based plug-in one [19] with a negative bias or the Miller-Madow correction [20] equal to $-(m-1)/(2N)$, where m is the number of univariate histogram bins to much more improved estimators (e.g., kernel density estimators, adaptive or non-adaptive grids, next nearest neighbors) and others specially designed for small samples [21,22]

#### 1.2. The Rationale of the Paper

_{o}that ${H}_{ME,cr1}={H}_{ME,cr2}$ or ${T}_{cr1},\text{\hspace{0.17em}\hspace{0.17em}}{T}_{cr2}$ ME-congruent (see definition in PP12, [12]), the difference ${H}_{ME,cr1,N}-{H}_{ME,cr2,N}$ works as a significance test of H

_{o}. Those tests can be used: (1) for testing statistical significant MI above zero or significant RV dependence or (2) for testing MI due to nonlinear correlations beyond MI due to linear correlations. Another important case (verified here) is the test of MI explained by joint non-Gaussianity beyond the MI explained by joint Gaussianity, in which Gaussian morphism (i.e., bijective, reversible variable transformation into another with a Gaussian pdf without loss of generality) is used for single variables. According to the above result, the bias of ${H}_{ME,cr1,N}-{H}_{ME,cr2,N}$, subjected to H

_{o}is $({m}_{cr2}-{m}_{cr1})/2N$, i.e., the number of cross constraints in the difference set ${T}_{cr2}/{T}_{cr1}$ divided by $2N$.

## 2. Minimum Mutual Information and Its Estimators

#### 2.1. Imposing Marginal PDFs

#### 2.2. Imposing Marginals through ME Constraints

#### 2.2.1. The Formalism

#### 2.2.2. A Theorem about the MinMI Covariance Matrix

**Theorem 1**:

#### 2.3. Gaussian and Non-Gaussian MI

#### 2.4. Estimators of the Minimum MI from Data and Their Errors

## 3. Errors of the Expectation’s Estimators

#### 3.1. Generic Properties

#### 3.2. The Effects of Morphisms and Bivariate Sampling

^{−1}-scaled expression for $\mathrm{var}({\theta}_{N,j})$, we will consider another type of deviations of ${T}_{j}$ consistent with (20).

**Theorem**

**2:**

#### 3.3. Errors of the Estimators of Polynomial Moments under Gaussian Distributions

^{k},k=0,..,11. We have verified that the empirical variance $\mathrm{var}({E}_{N}(T))$ agrees very well to the theoretical value ${N}^{-1}{\mathrm{var}}_{N}(T|lms)$ for all Ns. (not shown).

**Figure 1.**Squared empirical bias: ${\Vert b\Vert}^{2}$ (black lines) of N-based $T$- expectations as function of N, empirical variances: $\mathrm{var}({E}_{N}(T))$ (red lines), approximated variances: ${N}^{-1}\mathrm{var}(T|lms)$ (blue lines) and variance for the case of N iid trials: ${N}^{-1}\mathrm{var}(T)$ (green lines). $T$ stands for different bivariate monomials: ${X}^{4}{Y}^{2}$ (a), ${X}^{6}{Y}^{2}$ (b) and ${X}^{8}{Y}^{2}$ (c).

**Figure 2.**N times Monte-Carlo variances: $N\mathrm{var}({E}_{N}(T))$ thick solid lines) and its theoretical analytical value $\mathrm{var}(T|lms)$ (thick dashed lines), both under imposed marginals (morphisms) and analytical value of $N\mathrm{var}({E}_{N}(T))=\mathrm{var}(T)$ for iid data (

**thin solid lines**). $T$ means different bivariate monomials: $XY$ (

**black curves**), ${X}^{2}Y$ (

**red curves**). N = 200.

#### 3.4. Statistical Modeling of Moment Estimation Errors

## 4. Modeling of MinMI Estimation Errors, Their Bias, Variance and Distribution

- The estimation of bias, variance, quantiles and distribution of estimators of the incremental MinMI ${I}_{j/p}$ issued from finite samples of N (iid) realizations of bivariate original variables $(\widehat{X},\widehat{Y})$ and then transformed into RVs $(X,Y)$
- The distribution of estimators of ${I}_{j/p}$ under the null hypothesis H
_{0}that $(X,Y)$ follows the ME distribution constrained by a weaker constraint set $({T}_{p},{\theta}_{p})$ (j>p). These estimators work as a significance test for determining whether there is statistically significant MI beyond that explained by cross moments in $({T}_{p},{\theta}_{p})$.

#### 4.1. Bias, Variance, Quantiles and Distribution of MI Estimation Error

^{−1}-scaled as generally deduced in [15]. Keeping the leading term of (34), and dealing with the trace, we get a given relative error $\text{\hspace{0.17em}}{r}_{I}=\mathrm{\Delta}{I}_{N,j}/{I}_{j}$ of the MinMI ${I}_{j/0}$ (p=0) when $N\ge E\left({({\lambda}_{cr,j}^{T}\text{\hspace{0.17em}}{T}_{cr,j}^{\prime})}^{2}\right)/{\left({I}_{j/0}\text{\hspace{0.17em}\hspace{0.17em}}{r}_{I}\right)}^{2}\approx O({m}_{cr,j})/{\left({I}_{j/0}\text{\hspace{0.17em}\hspace{0.17em}}{r}_{I}\right)}^{2}$. The term $O({m}_{cr,j})$ increases with a larger rate than ${I}_{j/0}$ as far as the bound of the polytope of allowed expectations is closer.

#### 4.2. Significance Tests of MinMI Thresholds

_{0}considering that the true PDF coincides to the ME-PDF constrained by $({T}_{p},{\theta}_{p})$. In particular for $({T}_{p},{\theta}_{p})=({T}_{p=0},{\theta}_{p=0})=({T}_{ind},{\theta}_{ind})$, the null hypothesis states that $(X,Y)$ are statistically independent. Therefore under H

_{0}, the moment sets $({T}_{p},{\theta}_{p}),\text{\hspace{0.17em}}({T}_{j},{\theta}_{j})$ are ME-congruent and the moments of order $j\ge p$ remain well determined by expectations over the less restricted p-th ME-PDF i.e., ${\theta}_{j}={E}_{{\rho}_{{T}_{p},{\theta}_{p}}^{*}}({T}_{j})\equiv {\theta}_{j\leftarrow p}$ where the subscript arrow $j\leftarrow p$ means that j-order statistics are obtained by the p-order ME-PDF. The same holds for the ME covariance matrices, i.e., ${C}_{*p}={C}_{p}$ and ${C}_{*j}={C}_{*j\leftarrow p}={C}_{j}\text{\hspace{0.17em}\hspace{0.17em}};\text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}j\ge p$. In these conditions, the matrix ${C}_{p}$ is simply a sub-matrix of ${C}_{j}$.The Lagrange multipliers are restricted to the p-order i.e. ${\lambda}_{j}={\lambda}_{j\leftarrow p}\text{\hspace{0.17em}}=({\lambda}_{p},{\overrightarrow{0}}_{j/p})\text{\hspace{0.17em}};\text{\hspace{0.17em}}j\ge p$, where entries of higher order than p are set to zero leading to ${v}_{j/p}=0$ in (9). Therefore, the incremental MinMI vanishes, i.e. $H({\theta}_{j})-H({\theta}_{p})={I}_{j/p}=0$, but the estimator of ${I}_{N,j/p}$ is positive due to artificial MI generation stemming from the presence of sampling errors. Then, under H

_{0}

_{,}and using (9), the MI estimation is provided by the following approximation:

_{0}; in other words, if ${I}_{N,j/p}$ is larger than an upper 1-α quantile (e.g., 1−α=95%) of $\delta {I}_{N,j/p}$, then H

_{0}is rejected with a significance level α. Those quantiles determine the significant MI thresholds and can be computed empirically as for the MinMI error (32) by a Monte-Carlo strategy. Another possibility is the fitting of the $\delta {I}_{N,j/p}$ distribution to a Gamma PDF with prescribed mean and variance (not done here). The bias and variance of $\delta {I}_{N,j/p}$ are straightforward, coming as:

^{−2}-scale for variance is also present in other MI estimate errors under the hypothesis of variable independency [27]. Under the Theorems 1 [11] and 2 [27], along with the null hypothesis, one gets ${C}_{N,cr,j|lms}\text{\hspace{0.17em}\hspace{0.17em}}{A}_{j\leftarrow p}={P}_{cr,j}-{P}_{cr,p}$, thus leading to a Chi-Squared distribution for $\delta {I}_{N,j/p}$:

^{2}probability lookup tables. The bias and variance are, respectively:

#### 4.3. Significance Tests of the Gaussian and Non-Gaussian MI

_{b}bins of an extended enough finite interval [-L

_{i},L

_{i}]. In the corresponding experiments (and as in PP12), we have used the calibrated values L

_{i}=6 and N

_{b}=80. The used algorithm is explained in detail in the appendix 2 of PP12 [12], following an adapted bivariate version of that of [35]. The error $\delta H=\widehat{H}-H$ is of the order of round-off errors, only becoming comparable to the sampling ME errors at very high values of N.

#### 4.3.1. Error and Significance Tests of the Gaussian MI

^{th}entry (row and column) of ${T}_{2}$, corresponding to the unique cross moment XY. The necessary 5x5 covariance matrix is ${C}_{*,2}={E}_{{\rho}_{T2,\theta 2}^{*}}[{T}_{2}{T}_{2}^{T}]-{\theta}_{2}{\theta}_{2}^{T}$, where the E operator is the expectation over the bivariate Gaussian ${\rho}_{T2,\theta 2}^{*}$. Then, we apply (9) for j=2, p=0 where $\mathrm{\Delta}{\theta}_{N,j}={(0,0,0,0,\mathrm{\Delta}{c}_{g,N})}^{T}$. The Gaussian MI error is written in different forms as:

#### 4.3.2. Error and Significance Tests of the Non-Gaussian MI

^{nd}order ME solutions.

_{rea}(e.g., 5000) realizations, we compute moments and solve the ME problem gathering statistics afterwards. Alternatively, ME errors can be computed from the Taylor expansion (9) from moment deviations over the ensemble.

_{0}that the true PDF is bivariate Gaussian and is written as a particular case of (35). However, a simplification of the statistical test formula can be achieved by considering a null Gaussian correlation. This holds thanks the non-Gaussian MI invariance under variable rotations (see PP12), in particular for uncorrelated standardized variables ${({X}_{r},{Y}_{r})}^{T}=A{(X,Y)}^{T}$, where A is the rotation matrix (e.g. ${X}_{r}=X,{Y}_{r}=(Y-{c}_{g}X){(1-{c}_{g}^{2})}^{-1/2}$, i.e., the residual of the linear prediction). Under H

_{0}, the rotated variables are still bivariate Gaussian and therefore the non-Gaussianity significance test $\delta {I}_{N,ng,j}$ has the same distribution as that for ${c}_{g}=0$. The matrices ${C}_{N,cr,j|lms}$ and ${A}_{j\leftarrow 2}$ entering in Equation (35) are now evaluated for Gaussian isotropic conditions. For the sake of clarity, we represent them respectively by ${C}_{g,N,cr,j|lms}$, ${A}_{g,j\leftarrow 2}={P}_{j}{({C}_{g,j})}^{-1}{P}_{j}-{P}_{2}{({C}_{g,2})}^{-1}{P}_{2}$, where the subscript g stands for evaluation at ${(X,Y)}^{T}~N(\overrightarrow{0},I)$. For high N, ${C}_{g,N,cr,j|lms}={C}_{g,j}$, i.e., the covariance matrix of cross j-th order moments for the isotropic Gaussian. Then we write:

#### 4.4. Validation of Significance Tests by Monte-Carlo Experiments

^{1}*25,…,2

^{11}*25 = 51200. Then, we have computed the 5,000 realizations for the independency test $\delta {I}_{N,g}$ as well as for the non-Gaussianity tests $\delta {I}_{N,ng,j}$ for j = 4, 6, 8. In order to minimize errors of type $\delta H$ (8), from the ME functional, we have retained only those Monte-Carlo realizations whose ME-PDF moments are within a relative square error of 10

^{−5}.

**Figure 3.**Test statistics: bias (black lines), standard deviation (red lines) and 95%-quantiles (green lines), provided by the Monte-Carlo approach (tick full lines), the semi-analytical approach (thin dashed lines) and the analytical approach (tick full lines). The tests are $\delta {I}_{N,g}$ (a); $\delta {I}_{N,ng,j=4}$ (b); $\delta {I}_{N,ng,j=6}$ (c) and $\delta {I}_{N,ng,j=8}$ (d).

**Figure 4.**Monte-Carlo empirical cumulative histogram (solid lines) and theoretical cumulative Chi-Squared fit (dashed lines) normalized by N: $2N\delta {I}_{N,g}$ (${\chi}_{1}^{2}$) for $N=50$ (

**black curves**); $2N\delta {I}_{N,ng,j=4}$ (${\chi}_{5}^{2}$) for $N=400$ (

**red curves**); $2N\delta {I}_{N,ng,6}$ (${\chi}_{14}^{2}$) f