Next Article in Journal
Statistical Analysis of Gait Maturation in Children Using Nonparametric Probability Density Function Modeling
Next Article in Special Issue
A Novel Nonparametric Distance Estimator for Densities with Error Bounds
Previous Article in Journal
Study on the Stability of an Artificial Stock Option Market Based on Bidirectional Conduction
Previous Article in Special Issue
Machine Learning with Squared-Loss Mutual Information

Metrics 0

Export Article

Entropy 2013, 15(3), 721-752; doi:10.3390/e15030721

Article
Minimum Mutual Information and Non-Gaussianity through the Maximum Entropy Method: Estimation from Finite Samples
1
Instituto Dom Luiz (IDL), University of Lisbon (UL), Lisbon, P-1749-016, Portugal
2
Institute of Hydraulic Engineering and Water Resources Management, Vienna University of Technology, Vienna, A-1040, Austria
*
Author to whom correspondence should be addressed.
Received: 8 November 2012; in revised form: 15 February 2013 / Accepted: 19 February 2013 / Published: 25 February 2013

Abstract

:
The Minimum Mutual Information (MinMI) Principle provides the least committed, maximum-joint-entropy (ME) inferential law that is compatible with prescribed marginal distributions and empirical cross constraints. Here, we estimate MI bounds (the MinMI values) generated by constraining sets Tcr comprehended by mcr linear and/or nonlinear joint expectations, computed from samples of N iid outcomes. Marginals (and their entropy) are imposed by single morphisms of the original random variables. N-asymptotic formulas are given both for the distribution of cross expectation’s estimation errors, the MinMI estimation bias, its variance and distribution. A growing Tcr leads to an increasing MinMI, converging eventually to the total MI. Under N-sized samples, the MinMI increment relative to two encapsulated sets Tcr1Tcr2 (with numbers of constraints mcr1 < mcr2) is the test-difference $δ H = H max 1 , N − H max 2 , N ≥ 0$ between the two respective estimated MEs. Asymptotically, δH follows a Chi-Squared distribution $1 2 N χ ( m c r 2 − m c r 1 ) 2$ whose upper quantiles determine if constraints in Tcr2/Tcr1 explain significant extra MI. As an example, we have set marginals to being normally distributed (Gaussian) and have built a sequence of MI bounds, associated to successive non-linear correlations due to joint non-Gaussianity. Noting that in real-world situations available sample sizes can be rather low, the relationship between MinMI bias, probability density over-fitting and outliers is put in evidence for under-sampled data.
Keywords:
mutual information; non-Gaussianity; maximum entropy distributions; Entropy bias; mutual information distribution; morphism
MSC2000 Codes:
62B10; 94A17

1. Introduction

1.1. The State of the Art

The seminal work of Shannon on Information Theory [1] gave rise to the concept of Mutual Information (MI) [2] as a measure of probabilistic dependence among random variables (RVs), with a broad range of applications, including neuroscience [3], communications and engineering [4], physics, statistics, economics [5], genetics [6], linguistics [7] and geosciences [8]. MI is the positive difference between two Shannon entropies of the RVs: the one assuming statistical independence $( H i n d )$ and the other $( H d e p )$ considering their true dependence.
This paper addresses the problem of estimating the MI conveyed by the least committed, inferential law (say the conditional probability density function pdf $ρ ( Y | X )$ between random variables RVs $Y , X$), which is compatible with prescribed marginal distributions and a set Tcr of mcr empirical non-redundant cross constraints (e.g., a set of cross expectations between a stimulus X and a response Y, for example in a neural cell, the Earth’s climate, an ecosystem). The constrained MI or the Minimum Mutual Information (MinMI) among RVs $Y , X$ is: $I min ( X , Y ) = H ( X ) + H ( Y ) − H max ( X , Y ) = H ( Y ) − H max ( Y | X )$, obtained after subtraction to the sum of fixed marginal entropies of the maximum joint entropy (ME) $H max$, compatible with imposed cross constraints. The solution comes from application of the MinMI principle [9,10]. The MinMI is a MI lower bound depending on the marginal pdfs (e.g., Gaussians, Uniforms, Gammas), as well as the particular form of the cross expectations in Tcr (e.g., linear and non-linear correlations). There are only a few cases of known closed formulas for the MinMI and mcr=1:a) Gaussian marginals and Pearson linear correlation [8,11,12] and (b) Uniform marginals and rank linear correlation [11]. The authors have presented in [12] (PP12 hereafter), a general formalism for computing, though not in an explicit form, the MinMI in terms of multiple (mcr > 1) linear and nonlinear cross expectations included in Tcr This set can consist of a natural population constraint (e.g., a specific neural behavior) or it can grow without limit through additional expectations computed within a sample with the MinMI increasing and converging eventually to the total MI. This paper is the natural follow-up of PP12 [12], studying now the statistics (mean or bias, variance and distribution) of the MinMI estimation errors: $Δ I min , N = − Δ H max , N ≡ − ( H max , N − H max )$ where $H max , N$ is the ME estimation issued from N-sized samples of iid outcomes. Those errors are roughly similar to those of MI and entropy generic estimator’s errors (see [13,14] for a thorough review and performance comparisons between MI estimators). Their mean (bias), variance and higher-order moments are written in terms of $N − 1$ powers, thus covering intermediate and asymptotic N ranges [15], with specific applications in neurophysiology [16,17,18]. Entropy estimators range from: (a) the histogram-based plug-in one [19] with a negative bias or the Miller-Madow correction [20] equal to $− ( m − 1 ) / ( 2 N )$, where m is the number of univariate histogram bins to much more improved estimators (e.g., kernel density estimators, adaptive or non-adaptive grids, next nearest neighbors) and others specially designed for small samples [21,22]

1.2. The Rationale of the Paper

The well-posedness of a MinMI $I min ( X , Y )$ compatible with available cross information needs the knowledge of marginal X and Y PDFs, $ρ X$ and $ρ Y$, either imposed or inferred from sufficiently long samples. For that purpose, we can change X and Y into the cumulated probabilities $u ( x ) = ∫ x ρ X ( t ) d t ; v ( y ) = ∫ y ρ Y ( t ) d t$, which are uniform RVs on the interval [0,1] (i.e., copulas [23]), through appropriate smoothly growing (injective) morphisms (or anamorphoses), while leaving the MI invariant [2]. Then, the MI $I ( X , Y )$ becomes the negative copula entropy [24,25] given by $I ( X , Y ) = ∫ 0 1 ∫ 0 1 c [ u , v ] log ( c [ u , v ] ) d u d v$, where the copula density is $c [ u , v ] = ρ X Y ( x , y ) / [ ρ X ( x ) ρ Y ( y ) ]$.
The MinMI, subjected to $m c r$ constraints of type $E [ T i ( u , v ) ] = θ i ; i = 1 , ... m c r$ in the copula-space, is readily obtained by variational analysis (as in the ME method [2]) for $c [ u , v ] = exp [ − 1 + λ u ( u ) + λ v ( u ) + ∑ i = 1 m c r λ i T i ( u , v ) ]$, where the Lagrange multipliers $λ u ( u ) , λ v ( u ) , λ i$ correspond respectively to the preset (not subjected to sampling) continuum of constraints: $∫ c [ u , v ] d u = ∫ c [ u , v ] d v = 1$ and to the $m c r$ expectations (subjected to sampling error). The general solution is rather tricky since all the values $λ u ( u ) , λ v ( u ) , λ i$ are implicitly related. The constrained joint PDF and the inferential law are recovered from the constrained copula through the product: $ρ X Y ( x , y ) = c [ u , v ] ρ X ( x ) ρ Y ( y )$.
In PP12 [12], we have generalized this problem to a less constrained MinMI version by changing marginal RVs into ME prescribed ones—the ME-morphisms (e.g., standard Gaussians)—and imposing a finite set of marginal constraints instead of the full marginal PDFs. Under these conditions, the number of control Lagrange multipliers is finite, leaving the possibility of using nonlinear minimization algorithms for the MinMI estimation, as already tested in [8]. The MinMI subjected to a set $T c r$ of $m c r$ cross constraints is thus given by $H i n d − H M E , c r$, where $H M E , c r$ is the joint ME and $H i n d$ is the sum of single fixed (preset) entropies. The MinMI estimator is written as $H i n d − H M E , c r , N$, where $H M E , c r , N$ is the ME constrained by the $m c r$ sampling expectations obtained from N-sized samples. The MinMI estimation error is $H M E , c r − H M E , c r , N$. Therefore, as a generalization of the ME estimator bias [26], one verifies a MinMI positive bias equal to (larger/smaller than) $m c r / ( 2 N )$ when the true population PDF including the tested sample, follows (is more leptokurtic/platykurtic than) the ME-PDF. This result is supported through Monte-Carlo experiments.
Moreover, we introduce here the positive incremental MinMI given by the difference $H M E , c r 1 − H M E , c r 2$ between two MEs, forced by cross constraint sets $T c r 1 ⊆ T c r 2$, which is interpreted as the MinMI coming from the difference set $T c r 2 / T c r 1$. The corresponding estimator is $H M E , c r 1 , N − H M E , c r 2 , N$. Both the MinMI and incremental MinMI estimators depend basically on errors of the expectations estimated from finite N-sized samples.
In particular, under the null hypothesis Ho that $H M E , c r 1 = H M E , c r 2$ or $T c r 1 , T c r 2$ ME-congruent (see definition in PP12, [12]), the difference $H M E , c r 1 , N − H M E , c r 2 , N$ works as a significance test of Ho. Those tests can be used: (1) for testing statistical significant MI above zero or significant RV dependence or (2) for testing MI due to nonlinear correlations beyond MI due to linear correlations. Another important case (verified here) is the test of MI explained by joint non-Gaussianity beyond the MI explained by joint Gaussianity, in which Gaussian morphism (i.e., bijective, reversible variable transformation into another with a Gaussian pdf without loss of generality) is used for single variables. According to the above result, the bias of $H M E , c r 1 , N − H M E , c r 2 , N$, subjected to Ho is $( m c r 2 − m c r 1 ) / 2 N$, i.e., the number of cross constraints in the difference set $T c r 2 / T c r 1$ divided by $2 N$.
We further provide asymptotic analytical N-scaled formulas for the variance and distribution of MinMI estimation errors as functions of statistics of the ME cross constraints estimation errors. This is possible for N high enough where expectation errors are closely governed by a multivariate Gaussian distribution, uniquely determined by their bias and covariance matrix, thanks to the multivariate Central Limit Theorem. Since marginal morphisms are performed, the single variables are set to values from a look-up table of fixed quantiles (not subjected to sampling) and therefore the estimator’s squared-bias decreases faster than the estimator’s variance as $N → ∞$.
The correct modeling of covariances between sampling expectation’s errors under morphism is crucial for the correct computation of MinMI error statistics. We have verified an overall reduction of the cross expectation errors when compared to case where they are issued from iid realizations (no morphism performed). For instance the variance, noted as $var ( E N ( T ) )$ of the N-sized sampling mean $E N ( T )$, of a cross function $T ( X , Y )$ is given by $N − 1 var N ( T * )$, where $T *$ is the residual of the best linear fit of $T$, using the conditional means $E ( T | X ) , E ( T | Y )$ as predictors. Asymptotically, $var N ( T * ) → var ( T * )$ which is the variance of T, conditioned to the knowledge of marginal PDFs, computed at the joint PDF of the population. These conditional variances are exactly those coming from the MinMI solution, allowing for relating MinMI statistics with asymptotic no-replacement finite statistics under fixed marginals. The results are synthesized in the form of two theorems.
Regarding the conversion of expectation errors to ME and MinMI errors, we have used a perturbative approach—a 2nd order Taylor expansion of the ME. This allows for closed analytical formulas to be obtained for MinMI variance and its distribution in a few cases (e.g., Chi-Squared distributions), in what we hereafter call the analytical approach. In order to confirm that, expectation errors are generated by surrogates of the governing multivariate Gaussian PDF; then, they are plugged into the Taylor expansion of MinMI and finally statistics (bias, variances, quantiles) are estimated from a large ensemble (semi-analytical approach). These statistics are compared with those obtained from a Monte-Carlo experiment where MinMI is computed ab initio from the sampling expectations – the Monte-Carlo approach. The closeness of results between the Monte-Carlo, the semi-analytical and the analytical approaches is tested using several statistical tests of bivariate non-Gaussianity and RV independency. This exhaustive validation has already been performed for testing analytical formulas of bias, variance, skewness and kurtosis of MI estimation errors [27].
In accordance to the above synthesis, the paper structure starts with this introduction, followed by the formulation of MinMI and their estimators in Section 2. In Section 3 we present the modeling of sample mean errors that will constrain entropy and the effect of morphisms on statistics. Section 4 is devoted to the modeling of errors of MinMI, incremental MinMI and significance tests, followed by a practical case of MI estimation with under-sampled data (Section 5) and the discussion with conclusions in section 6. An appendix with some proofs is also provided.

2. Minimum Mutual Information and Its Estimators

2.1. Imposing Marginal PDFs

Let us formulate the problem of finding the minimum Mutual Information (MinMI) in the simplest framework of bivariate RVs $( X , Y )$, over the Cartesian product of support sets $S = S X ⊗ S Y ⊆ ℝ 2$. The MinMI is constrained by the imposition of marginal PDFs $ρ X , ρ Y$ and a set of cross expectations ${ T c r , θ c r ≡ E ( T c r ) }$, where $T c r$ is a vector comprising $m c r$ cross $X , Y$ functions and $θ c r$ is the vector of their expectations. In the space of imposed PDF marginals, the MinMI comes uniquely as a function of $θ c r$ as $I ( θ c r , ρ X , ρ Y ) = H ρ X + H ρ Y − H ρ X , Y * ( θ c r , ρ X , ρ Y )$, where $H ρ X = E [ − log ( ρ X ) ] , H ρ Y = E [ − log ( ρ Y ) ]$ are preset Shannon entropies of $X , Y$ respectively and $H ρ X , Y * ( θ c r , ρ X , ρ Y )$ is the ME subjected to joint constraints and marginal PDFs where the ME-PDF is $ρ X , Y *$. That leads to the equivalence between computations of MinMI and ME [9]. In particular if $ρ X , ρ Y$ are copula marginals (uniform PDFs in [0,1]), then $H ρ X = H ρ Y = 0$ and the MinMI is the copula entropy [24,25]. For instance, for standard Gaussians $X , Y$ and a given correlation $E ( T c r ≡ X Y ) = c g$, the MinMI is $I ( c g ) = − 1 2 log ( 1 − c g 2 )$. Obviously, the more cross constraints are imposed, the larger the MinMI will be.
The general solution is obtained through variational analysis, rather similar to that for the ME [28] but with a continuity of constraints (the marginal PDFs) and a finite set of expectations:
$I ( θ c r , ρ X , ρ Y ) = H ρ X + H ρ Y − H ρ X , Y * ( θ c r , ρ X , ρ Y ) ; H ρ X , Y * ( θ c r , ρ X , ρ Y ) = L ( λ c r ) λ c r = arg min η c r [ L ( η c r ) ≡ 1 + ∫ S X log Z X ( X , η c r ) ρ X ( X ) d x + ∫ S Y log Z Y ( Y , η c r ) ρ Y ( Y ) d y − η c r T θ c r ]$
The MinMI-PDF $ρ X , Y * ( X , Y )$ and the partition functions $Z X , Z Y$ are
$ρ X , Y * ( X , Y ) = [ Z X ( X , λ c r ) Z Y ( Y , λ c r ) ] − 1 exp [ − 1 + λ c r T T c r ( X , Y ) ] ; Z X ( X , λ c r ) ≡ 1 ρ X ( X ) ∫ S X exp [ − 1 + λ c r T T c r ( X , y ) ] Z Y ( y , λ c r ) d y ; Z Y ( Y , λ c r ) ≡ 1 ρ Y ( Y ) ∫ S Y exp [ − 1 + λ c r T T c r ( x , Y ) ] Z X ( x , λ c r ) d x$
The superscript T stands for transpose such that $λ c r T T c r$ is the canonical inner product between vectors $λ c r$ and $T c r$. The proof is given in Appendix 1. Any PDF $ρ X Y ( X , Y )$ is a MinMI PDF corresponding to the single constraint $T c r ( X , Y ) = 1 + log [ ρ X Y ( X , Y ) / [ ρ X ( X ) ρ Y ( Y ) ] ]$, leading to $λ = 1$, $Z X ( X , λ ) = ρ X ( X ) − 1$ and $Z Y ( Y , λ ) = ρ Y ( Y ) − 1$.
The minimization of $L ( η )$ in (1) calls for the implementation of an iterative strategy as in [11] with successive adjustments of the implicitly linked partition functions.
The present paper deals with small changes of $I ( θ c r , ρ X , ρ Y )$ coming from estimation errors $Δ θ c r$ of the cross expectations evaluated from finite samples. For the purpose of inferring the consequent MinMI error statistics (bias, variance, distribution), we will use the second-order Taylor expansion of $I ( θ c r , ρ X , ρ Y )$ in terms of the variation $Δ θ c r$:
$Δ I ( θ c r , ρ X , ρ Y ) ≡ I ( θ c r + Δ θ c r , ρ X , ρ Y ) − I ( θ c r , ρ X , ρ Y ) = − Δ H ρ X , Y * ( θ c r , ρ X , ρ Y ) = = λ c r T Δ θ c r + 1 / 2 ( Δ θ c r T ) C c r , ρ X , ρ Y − 1 ( Δ θ c r ) + O ( | | Δ θ c r | | 3 )$
where $C c r , ρ X , ρ Y − 1$ is the inverse of the covariance matrix of the vector of constraining functions $T c r$, conditioned to knowledge of marginal PDFs and evaluated at the MinMI-PDF $ρ X , Y *$ i.e.,
$C c r , ρ X , ρ Y = E ρ X , Y * [ ( T c r * T c r T * | ρ X , ρ Y ] = E ρ X , Y * [ ( T c r * T c r T * | E ( T | X ) , E ( T | Y ) ]$
where $E ρ X , Y *$ is the expectation at $ρ X , Y *$.The perturbation $T * = T − E ρ X , Y * ( T c r | ρ X , ρ Y )$ is the residual with respect to the conditional mean, obtained by methods of variational and functional analysis as the best linear fit
$E ρ X , Y * ( T c r | ρ X , ρ Y ) = θ c r + α X [ E ρ X , Y * ( T c r | X ) − θ c r ] + α Y [ E ρ X , Y * ( T c r | Y ) − θ c r ]$
where $α X , α Y$ are vectors of coefficients minimizing the mean square deviations to each component of $T c r$ using the X and Y conditional means of $T c r$ as predictors. The proof is given in Appendix 1 as part of the proof of Theorem 1 presented in Section 2.2.

2.2.1. The Formalism

In PP12 [12], we address the MinMI problem (1,2) by considering that $ρ X , ρ Y$ are themselves ME-PDFs forced by a finite set of marginal, independent constraints, ${ T i n d ≡ ( T X ( X ) , T Y ( Y ) ) , θ i n d ≡ E ( T i n d ) ≡ ( θ X , θ Y ) }$. For that purpose we solve the ME problem [29] by imposing the constraints set ${ T , θ } = { ( T i n d , T c r ) , ( θ i n d , θ c r ) }$, thus leading to a weaker (i.e., smaller) MinMI solution than that obtained with the full imposition of the marginal PDFs. That is given by $I ( θ c r , θ i n d ) = H ( θ i n d ) − H ( θ ) ≤ I ( θ c r , ρ X , ρ Y )$, where $H ( θ )$ is the ME issued from the finite set of constraints (marginal and cross) and $H ( θ i n d ) ≡ H X + H Y$ is the ME corresponding uniquely to the marginal constraints [30]. In particular, if the support sets are $S X = S Y = [ 0 , 1 ]$ and ${ T i n d , θ i n d } = ∅$ (no constraints on marginals), then the joint PDF of $( X , Y )$ is a copula [24] since their marginal PDFs are uniform in [0,1].The cross part $T c r$ includes only cross functions, not redundantly expressed as sums of marginal functions in $T i n d$.
In practice one can impose the marginal PDFs from a priori RVs $( X ^ , Y ^ )$ (data variables) through ME-morphisms $( X = X ( X ^ ) , Y = Y ( Y ^ ) )$ (Equation 6 of PP12), (e.g., standard Gaussians), which are monotonically growing smooth homeomorphisms linking data to transformed $( X , Y )$ variables. Then, thanks to the MI invariance $( X = X ( X ^ ) , Y = Y ( Y ^ ) )$ [2], one can consistently define the MinMI between $( X ^ , Y ^ )$ as that obtained with $( X , Y )$.
The joint ME-PDF is written in terms of a vector $λ$ of Lagrange multipliers [28] as: $ρ T , θ * ( X , Y ) = Z ( λ , T ) − 1 exp [ λ T T ( X , Y ) ]$, where $Z ( λ , T ) ≡ ∬ S exp ( λ T T ) d x d y$ is the partition function. The ME functional is $H ( θ ) = min η ( log Z ( η , T ) − θ T η ) = log Z ( λ , T ) − θ T λ$, whose input is the vector $θ$. The marginal PDFs are supposed to be the ME-PDFs $ρ T X , θ X * ( X ) ; ρ T Y , θ Y * ( Y )$, verifying the marginal X and Y constraints respectively, since variables were built accordingly by ME-morphisms.
As far as more cross constraints are added to ${ T c r , θ c r }$, the MinMI $I ( θ c r , θ i n d )$ increases converging to the full MI $I ( X , Y )$. Let us formalize that by supposing that the true joint PDF belongs to the ME-family characterized by an information moment superset ${ T ∞ , θ ∞ } ⊇ { T , θ }$.
The true joint PDF is given by $ρ T ∞ , θ ∞ *$ with Shannon entropy given by the ME $H ( θ ∞ )$. The encapsulated moment sets obey to $θ i n d ⊆ θ ⊆ θ ∞$. Therefore, thanks to Lemma 1 of PP12, the monotonic property of MEs is obtained: $H ( θ i n d ) ≥ H ( θ ) ≥ H ( θ ∞ )$. This, according to Theorem 1 of PP12, allows for the decomposition of the MI $I ( X , Y )$ into two positive terms, such that:
$I ( X , Y ) = H ( θ i n d ) − H ( θ ∞ ) = I θ / θ i n d ( X , Y ) + I θ ∞ / θ ( X , Y ) ≥ 0 I θ / θ i n d ≡ H ( θ i n d ) − H ( θ ) ≥ 0 ; I θ ∞ / θ ≡ H ( θ ) − H ( θ ∞ ) ≥ 0$
The term $I θ / θ i n d$ is the MinMI associated to the finite set of cross moments $θ c r$ and the second one is the remaining MI. The decomposition (6) allows us for defining a monotonic sequence of lower MI bounds converging to the total MI. That follows from the sequence of encapsulated moment sets ${ T i n d = T 0 , θ i n d = θ 0 } ⊆ { T j , θ j } ≡ { ( T i n d , j , T c r , j ) , ( θ i n d , j , θ c r , j ) } ⊆ { T j + 1 , θ j + 1 } ⊆ ... ⊆ { T ∞ , θ ∞ } , j ≥ 1$ (e.g. set of monomial bivariate moments of a certain total order j), whose ME-PDF approximates the true ME-PDF in the sense of the Kullback-Leibler divergence (KBD) i.e., $D K L ( ρ T ∞ , θ ∞ * | | ρ T j , θ j * ) = H ( θ j ) − H ( θ ∞ ) → j → ∞ 0$ with the MI given by the limit $I ( X , Y ) = H ( θ i n d ) − lim j → ∞ [ H ( θ j ) ]$. The sets ${ T 0 , θ 0 }$ and ${ T i n d , j , θ i n d , j }$ are ME-congruent, i.e., their ME-PDF are the same. The j-th set must include enough constraints so as to keep a finite joint ME issued from ${ T j , θ j }$ and guarantee the convergence of the above KBD towards zero. Moreover that also guarantees that marginals of the joint ME-PDF converge to the preset marginal PDFs $ρ X , ρ Y$ in the KBD sense. Therefore, the MinMI $I ( θ c r , ∞ , ρ X , ρ Y ) = I ( X , Y ) = H ( θ i n d ) − lim j → ∞ [ H ( θ j ) ]$.
The addition of constraints leads to the decrease of ME, raising the useful concept of incremental MinMI next presented. The MI part that is explained by cross terms in the set difference $T j / T p ( j > p ≥ 0 ) , i . e . , T p ⊆ T j$ is the incremental MinMI:
$I j / p ≡ H ( θ p ) − H ( θ j ) ​ = D K L ( ρ T j , θ j * | | ρ T p , θ p * ) = I j / 0 − I p / 0 ≥ 0$
Estimation errors of $I j / p$ are affected by the vector of moment errors $Δ θ j$ (from which $Δ θ p$ is simply a projection). Since we preset marginal PDFs, $Δ θ j$ is restricted to the cross part i.e., $Δ θ j = Δ θ c r , j = P c r , j Δ θ j$ where $P c r , j$ is the diagonal projector operator over cross expectations (cr and ind terms are set to 1 and 0 respectively). Looking for error statistics of $I j / p$, we use the second-order Taylor expression of ME:
$− Δ H = H ( θ ) − H ( θ + Δ θ c r ) = ( P c r λ ) T Δ θ c r + ( 1 / 2 ) Δ θ c r T ( P c r C * − 1 P c r ) Δ θ c r + O ( ‖ Δ θ c r ‖ 3 )$
where, as usually, $λ$ (with dropped subscrits) is the whole vector of Lagrange multipliers of dimension $dim ( θ c r ) + dim ( θ i n d )$ and $C *$ is the covariance matrix of the function vector $T$, both valid for the ME-PDF verifying the constraints $E * ( T ) = θ$. We note that $C * = E * [ T ′ T ′ T ]$, where the star stands for evaluation over the ME-PDF and prime denotes deviation from the mean $θ$, i.e., $T ′ = T − θ$. Therefore, by using (8), we express the variation of $I j / p ( j > p )$ due to variations $Δ θ c r , j$ as:
$Δ I j / p = ( v j / p ) T ( Δ θ c r , j ) + ( 1 / 2 ) ( Δ θ c r , j ) T A j / p ( Δ θ c r , j ) + O ( ‖ Δ θ c r , j ‖ 3 ) v j / p T ≡ P c r , j λ j − P c r , p λ p ; A j / p ≡ P c r , j ( C * j − 1 − P c r , p ( C * p ) − 1 P c r , p ) P c r , j$
where $λ j , C * j$ and $λ p , C * p$ are the whole vectors of Lagrange multipliers and the whole covariance matrices, valid for the ME-PDFs of orders j and p respectively. The matrix $A j / p$ is built from the covariance matrices $C * j$ and $C * p$ valid at the ME-PDFs of order j and p respectively.
When the ME-PDFs of order j and p are the same (which is useful for testing if the estimated $I j / p$ from data is significantly different from zero), or p = 0 (in which $P c r , p = 0$), then $C * p$ is a sub-matrix of $C * j$. In that case, $A j / p$ is positive semi-definite (PSD). This comes from the algebraic generic result stating that $A = C − 1 − P C P − 1 P$ is PSD, where $C$ is PSD, $P$ is a diagonal projection matrix, $C P = P C P$ is the projected $C$ with generalized inverse $C P − 1$ such that $C P C P − 1 = C P − 1 C P = P$. $A$ is singular with $Ker ( A ) = Im ( C P )$. However, one can prove that for small deviations among the ME-PDFs of orders j and p, the matrix $A j / p$ is still PSD. For that one can use the same perturbation approach of [26].

2.2.2. A Theorem about the MinMI Covariance Matrix

The matrix $P c r C * − 1 P c r$ in (8) has inverse in the cross-expectation subspace, i.e. $( P c r C * − 1 P c r ) − 1 ( P c r C * − 1 P c r ) = P c r$. Taking the identity as the sum of complementary projector operators $I = P c r + P i n d$, both diagonal and self-adjoint, we have
$( P c r C * − 1 P c r ) − 1 = ( P c r C * P c r ) − ( P c r C * P i n d ) ( P i n d C * P i n d ) − 1 ( P i n d C * P c r ) = E * [ T c r ′ T c r ′ T ] − E * [ T c r ′ T i n d ′ T ] E * [ T i n d ′ T i n d ′ T ] − 1 E * [ T i n d ′ T c r ′ T ] = E * [ T c r ′ i n d T c r ′ i n d T ]$
which is the covariance matrix between the residuals $T c r ′ i n d$ of the best linear fit (in the sense of mean squares error) of $T c r$ using the X and Y functions in $T i n d$ as predictors, i.e., $T c r ′ i n d ≡ T c r ′ − α i n d , c r T T i n d ′$ where the matrix of coefficients is $α i n d , c r = E * [ T i n d ′ T i n d ′ T ] − 1 E * [ T i n d ′ T c r ′ T ]$. The identity (10) is simply an application to the ME covariance matrix of a generic algebraic result on PSD matrices $C *$ and projection operators $P c r , P i n d = I − P c r$.
Therefore, the variances in $( P c r C * − 1 P c r ) − 1$ are smaller than those in $( P c r C * P c r )$. Moreover, the more marginal constraints are imposed (with increasing j), the smaller the variances from $( P c r C * − 1 P c r ) − 1$ will be, due to the increasing number of predictors and closer will be the full knowledge of the marginal PDFs. Then, asymptotically the residuals $T c r , j ′ i n d$ at step j must converge to the residuals $T * = T − E ρ X , Y * ( T c r | ρ X , ρ Y )$ with respect to the mean (5) entering in the covariance (4) regarding MinMI. Therefore, that leads us to the Theorem:
Theorem 1:
Let $ρ X , Y *$ be the MinMI-PDF issued from ${ T c r , θ c r } , ρ X , ρ Y$, being the same as the ME-PDF issued from ${ ( T i n d , T c r ) , ( θ i n d , θ c r ) }$ for some set ${ T i n d , θ i n d }$. Then we have:
$λ c r = P c r λ ; C c r , ρ X , ρ Y = ( P c r C * − 1 P c r ) − 1 = E ρ X , Y * [ ( T c r * T c r T * | E ( T | X ) , E ( T | Y ) ]$
which states that the Lagrange multipliers of the MinMI-PDF are those of the ME-PDF for the cross constraints and the MinMI covariance matrix (4), say that of the residuals of the best fit of the cross constraints using their condtional means as predictors. The proof, as well of (3–5) is added in Appendix 1.
An illustrative example of the Theorem 1 is given for the bivariate Gaussian $ρ X Y * ( X , Y ) = ( 2 π ) − 1 d g 1 / 2 exp [ − 1 2 d g ( X 2 − 2 c g X Y + Y 2 ) ]$ of correlation $c g$ with $d g ≡ ( 1 − c g 2 ) − 1$. The marginals $ρ X , ρ Y$ are standard Gaussians. $ρ X Y * ( X , Y )$ is the MinMI-PDF constrained by correlation as well as the ME-PDF constrained by moments of order one and two: ${ T i n d = ( X , X 2 , Y , Y 2 ) , θ i n d = ( 0 , 1 , 0 , 1 ) }$ and ${ T c r = ( X Y ) , θ c r = ( c g ) }$. The vector of Lagrange multipliers is $λ = [ 0 , − 1 2 d g , 0 , − 1 2 d g , c g d g ] T$ while the diagonal covariance matrix and its inverse (lower triangle parts) are:
$C * = [ ( 1 , 0 , c g , 0 , 0 ) T , ( * , 2 , 0 , 2 c g 2 , 2 c ) T , ( * * , 1 , 0 , 0 ) T , ( * * * , 2 , 2 c ) T , ( * * * * , c g 2 + 1 ) T ] C * − 1 = [ ( d g , 0 , − c g d g , 0 , 0 ) T , ( * , 1 2 d g 2 , 0 , 1 2 c g 2 d g 2 , − c g d g 2 ) T , ( * * , d g , 0 , 0 ) T , ( * * * , 1 2 d g 2 , − c g d g 2 ) T , ( * * * * , ( 1 + c g 2 ) d g 2 ) T ]$
The redundant upper triangle part is given by stars. The MinMI is $I g ( c g ) = − 1 2 log ( 1 − c g 2 )$ with its derivatives entering in the Taylor development (3) given by $∂ I g ∂ c g = c g d g = P c r λ$ which is the fifth component of $λ$ and $∂ 2 I g ∂ c g 2 = d g 2 ( 1 + c g 2 ) = C c r , ρ X , ρ Y − 1 = ( P c r C * − 1 P c r )$, i.e., the entry at 5th line, 5th column of $C * − 1$ as guessed from the Theorem 1. By expressing $Y = c g X + d g − 1 / 2 W X$ and $X = c g Y + d g − 1 / 2 W Y$ with standard Gaussian noises $W X , W Y ~ N ( 0 , 1 )$, and $c o r ( X , W X ) = c o r ( Y , W Y ) = 0$, one easily gets the conditional means $T c r$ as $E ρ X , Y * ( X Y | X ) = c g X 2 ; E ρ X , Y * ( X Y | Y ) = c g Y 2$, leading to the best linear fit with mean square error $C c r , ρ X , ρ Y = d g − 2 ( 1 + c g 2 ) − 1$, confirming the second part of (11).

2.3. Gaussian and Non-Gaussian MI

There is a particular MI decomposition of the type (6,7), already studied in PP12 [12], in which both RVs X and Y are set to standard Gaussians $N ( 0 , 1 )$ over the real support set $S X = S Y = ℝ$ by Gaussian morphism [31]. The isotropic bivariate standard Gaussian is constrained by the moment set $T i n d = T 0 = ( X , X 2 , Y , Y 2 ) T$ with the expectations vector $θ i n d = θ 0 = E ( T 0 ) = ( 0 , 1 , 0 , 1 ) T$. The sequence of MinMIs is obtained by considering the indexed moment set (Equation 14 of PP12 [12], changing the index p there into j here):
$T j ≡ { X r Y s : 1 ≤ r + s ≤ j , ( r , s ) ∈ ℕ 0 2 } , j ∈ ℕ$
Comprising bivariate polynomials of total order j. Only natural j even numbers provide integrable ME-PDFs over $ℝ$, thus excluding odd j values from the sequence ${ T 0 , θ 0 } , { T 2 , θ 2 } , { T 4 , θ 4 } ... { T ∞ , θ ∞ }$ of set pairs {moments, expectations}. The independent parts of all sets are ME-congruent with ${ T 0 , θ 0 }$, i.e., they include high-order univariate moment expectations of the standard Gaussian. The number of independent and cross moments of $T j$ (13) is 2j and $j ( j − 1 ) / 2$ respectively (e.g. (4,1), (8,6), (12,15) and (16,28), for j=2,4,6,8). Other more efficient basis cross functions could be used as for example orthogonal polynomials. Using the notation of Section 2.2, the maximum entropy limit $H ( θ ∞ )$ of the sequence limit coincides to the true (X,Y) Shannon entropy. As presented in PP12, we define the positive Gaussian MI $I g$, the non-Gaussian MI $I n g$ and the non-Gaussian MI $I n g , j$ of even order j, respectively as:
$I g = I 2 / 0 = H ( θ 0 ) − H ( θ 2 ) = − ( 1 / 2 ) log ( 1 − c g 2 ) ≡ I g ( c g ) ; I n g = I ∞ / 2 = H ( θ 2 ) − H ( θ ∞ ) ; I n g , j = I j / p = 2 = H ( θ 2 ) − H ( θ j )$
with the MI decomposed as $I ( X , Y ) = I g + I n g ≥ I g + I n g , j$. The Gaussian MI depends on the Gaussian correlation $c g$, i.e., the Pearson correlation between the Gaussianized variables $( X , Y )$. The non-Gaussian MI vanishes iff the joint PDF is Gaussian.

2.4. Estimators of the Minimum MI from Data and Their Errors

This section is devoted to the study of estimators (and their errors) of the incremental MI $I j / p ( j > p )$, (7) between a priori RVs $X ^ , Y ^$ or, equivalently, between their transformed RVs X,Y.
In practice, the incremental MI $I j / p , j > p$ is estimated by a two-step algorithm: first, the computation of expectations; then, the MEs and the partial MIs. The vector of expectations, $θ N , j$, is estimated from the N-sized bivariate series $( X l , Y l ) , l = 1 , ... , N$, obtained by morphism from the original N iid realizations of the a-priori RVs $( X ^ l , Y ^ l ) , l = 1 , ... , N$ (e.g. time-series, spatially distributed data), as the arithmetic average:
$E N ( T j ) ≡ θ N , j = N − 1 ∑ l = 1 N T j ( X l , Y l ) = θ j + Δ θ N , j$
where $E N$ stands for expectation over the N realizations and the vector of moment estimation errors is $Δ θ N , j$. The first-step error comes from the difference $H ( θ N , j ) − H ( θ j )$, due to marginal morphisms and finite bivariate sampling, i.e., the cross combinations of variable realizations. We will see that MI errors depend crucially from moment estimation errors and their statistics.
Secondly, the true ME $H ( θ N , j )$ is estimated as the minimum $H ^ ( θ N , j )$ of a functional that is reached by nonlinear minimization techniques (e.g., gradient-descent), taking as inputs $θ N , j$ and a set of calibrated parameters. The second-step error comes from the difference $H ^ − H ≡ δ H$.
The estimator of $I j / p$ along with its error, decomposed into the first-step ($Δ I N , j / p , θ$) and second-step ($Δ I N , j / p , H$) contributions, is written as
$I N , j / p ≡ H ^ ( θ N , p ) − H ^ ( θ N , j ) = I j / p + Δ I N , j / p ; Δ I N , j / p = Δ I N , j / p , θ + Δ I N , j / p , H Δ I N , j / p , θ ≡ [ H ( θ j ) − H ( θ N , j ) ] − [ H ( θ p ) − H ( θ N , p ) ] ≡ − Δ H N , j + Δ H N , p Δ I N , j / p , H ≡ [ H ^ ( θ N , p ) − H ( θ N , p ) ] − [ H ^ ( θ N , j ) − H ( θ N , j ) ] ≡ ( δ H ) N , p − ( δ H ) N , j$
where $Δ I N , j / p , θ$ is the difference between entropy anomalies $Δ H$ due to input errors. The second-step error comes from the numerical implementation and round-off errors of the entropy functional due to: (a) a coarse graining representation of the continuous PDF; (b) the numerical approximation of the ME functional and its gradient; (c) the stopping criteria of the iterative gradient-descent technique. In this article we will neglect the effect of the second-step error, thus approximating the MinMI error by $Δ I N , j / p ≈ Δ I N , j / p , θ$ depending uniquely on the sampling error of the cross expectations $Δ θ c r = Δ θ N , c r , j$.

3. Errors of the Expectation’s Estimators

3.1. Generic Properties

The distribution of the MinMI error and its statistics (bias, variance, quantiles) depends on the distribution of the vector of error moments $Δ θ N , c r , j$ entering in (9). Here, we present a generic statistical modeling of those errors giving the emphasis in the influence of variable morphisms and bivariate sampling.
Let us assume the reasonable hypothesis that the discrete estimator $θ N , j$ (15) is a consistent estimator of the mean $θ j$, i.e., the error $Δ θ N , j → 0 , N → ∞$ in probability, with both the bias and covariance matrix converging to zero as data size grows:
$b Δ θ N , j ≡ E ( Δ θ N , j ) → N → ∞ 0 ; M Δ θ N , j ≡ E [ ( Δ θ N , j ′ ) ( Δ θ N , j ′ ) T ] → N → ∞ 0 , ; Δ θ N , j ′ = Δ θ N , j − b Δ θ N , j$
where the prime stands for perturbation with respect to the mean. The exact form of the components of $b Δ θ N , j$ and $M Δ θ N , j$ is rather difficult to establish as a consequence of imposing marginal distributions thus reducing the randomness to the covariate sampling. Estimator variances are scaled as $O ( 1 / N )$, though smaller than in the case of N iid outcomes. Moreover, we assume that the convergence rate is higher (faster convergence) for the squared bias than for variances, which is supported in a few examples in next section.

3.2. The Effects of Morphisms and Bivariate Sampling

Let us start with the effect of morphisms transforming original variables $( X ^ , Y ^ )$ into their transformed $( X , Y )$. That depends on the rank of variables within the available sample. Without loss of generality, let us sort $X ^$ by ascending order in the sample, i.e., the l-th value equaling the ordered l-th value $X ^ l = X ^ ( l )$, l=1,…,N. The bivariate l-th realization is $( X ^ l , Y ^ l = Y ^ ( l ′ ( l ) ) )$, where $l ′ ( l ) : { 1 , ... , N } → { 1 , ... , N }$ is the random bivariate rank permutation depending upon the particular sample (e.g. the first of $X ^$ coming with the third of $Y ^$, then l’(l=1)=3 and so on). In particular $l ′ ( l ) = l$ when correlation equals one. The inverse of the function $l ′ ( l )$ is written $l ( l ′ )$. The probability p-values of $X ^ ( l ) , Y ^ ( l ′ )$ i.e., their marginal cumulated probability functions (CDFs) are respectively $p X , l , p Y , l ′$, growing as function of $l , l ′$. Those p-values can only be inferred from the sample or prescribed from a-priori hypotheses. The sorted transformed RVs given by ME-morphisms are:
$X ( l ) = Φ M E , X − 1 ( p X , l ) ; Y ( l ′ ) = Φ M E , X − 1 ( p Y , l ′ ) ; l , l ′ = 1 , ... N$
where $Φ M E , X , Φ M E , Y$ are the ME prescribed CDFs (e.g. CDFs of Gaussians) of X and Y respectively. Then the morphisms relies upon invertible transformations $X ^ ( l ) → X ( l ) ; Y ^ ( l ′ ) → Y ( l ′ )$. The bivariate transformed realizations $( X l , Y l = Y ( l ′ ( l ) ) ) , l = 1 , ... , N$ are then used to compute expectations (Equation 15). Since the exact marginal distributions are not known, their cumulated probabilities must be prescribed, for example with regular steps $Δ p X , l , = p Y , l = 1 / N$ in which $p X , l , p Y , l = l / ( N + 1 ) , l = 1 , .. , N$.
In order to obtain moments of $Δ θ N , j$ we need rewriting it in a convenient form:
$Δ θ N , j ≡ θ N , j − θ j = = ∑ l , l ′ = 1 N T j ( Φ M E , X − 1 ( p X , l ) , Φ M E , Y − 1 ( p Y , l ′ ) ) N − 1 δ l ′ ( l ) , l ′ − ∫ 0 1 ∫ 0 1 T j ( Φ M E , X − 1 ( u ) , Φ M E , Y − 1 ( v ) ) c [ u , v ] d u d v ≈ ∑ l , l ′ = 1 N T j ( X ( l ) , Y ( l ′ ) ) [ N − 1 δ l ′ ( l ) , l ′ Δ p X , l Δ p Y , l ′ − c [ p X , l , p Y , l ′ ] ] Δ p X , l Δ p Y , l ′$
where $δ l ′ ( l ) , l ′ = δ l ( l ′ ) , l , ∀ l , l ′ ∈ { 1 , ... , N }$ is the Kronecker delta, $u = ∫ − ∞ X ρ T X , θ X * ( t ) d t ; v = ∫ − ∞ Y ρ T Y , θ Y * ( t ) d t$ are the marginal cumulated probabilities, corresponding respectively to probabilities $p X , l$ and $p Y , l ′$ in the sum (19) and $c [ u , v ]$ is the copula function [23] (ratio between the joint PDF and the product of marginal PDFs). By looking at (19), one sees that $N − 1 δ l ′ ( l ) , l ′ / ( Δ p X , l Δ p Y , l ′ )$ is an estimator of the copula $c [ p X , l , p Y , l ′ ]$. In particular, if X,Y are independent, then l and l’(l) are independent, $c [ p X , l , p Y , l ′ ] = 1$ and $E ( δ l ′ ( l ) , l ′ | l , l ′ ) = N − 1$ i.e. there is an average equipartition of the bivariate ranks.
Equation (19) shows that moments of $Δ θ N , j$ depend on statistics of the error of the copula estimator, which can be very tricky due to the imposition of marginal PDFs by morphisms, presenting unusual effects with respect to classical results from samples of iid realizations [32].
For that, let us denote the random perturbation $η l , l ′ ≡ δ l ′ ( l ) , l ′ − E [ δ l ′ ( l ) , l ′ ] = η l ′ , l , ∀ l , l ′$, then $E [ η l , l ′ ] = 0$ , also satisfying to the constraints $∑ l = 1 N δ l ′ ( l ) , l ′ = ∑ l ′ = 1 N δ l ( l ′ ) , l = 1$ or $∑ l = 1 N η l , l ′ = ∑ l ′ = 1 N η l , l ′ = 0$ as a consequence of the fact that $l ′ ( l )$ and $l ( l ′ )$ are index permutations of N values. Therefore, taking into account those constraints, $Δ θ N , j$ can be written in different forms in terms of perturbations:
$Δ θ N , j ′ = ∑ l , l ′ = 1 N T j , l , l ′ N − 1 η l , l ′ = ∑ l , l ′ = 1 N T j , l , l ′ ′ N − 1 δ l ′ ( l ) , l ′ = ∑ l , l ′ = 1 N T j , l , l ′ ′ X N − 1 δ l ′ ( l ) , l ′ = ∑ l , l ′ = 1 N T j , l , l ′ ′ Y N − 1 δ l ′ ( l ) , l ′ = ∑ l = 1 N T j , l , l ′ ( l ) ′ N − 1 = ∑ l = 1 N T j , l , l ′ ( l ) ′ X N − 1 = ∑ l = 1 N T j , l , l ′ ( l ) ′ Y N − 1$
where $T j , l , l ′ ≡ T j ( X ( l ) , Y ( l ′ ) )$ and its perturbation with respect to the global mean is $T j , l , l ′ ′ ≡ T j , l , l ′ − E ( θ N , j )$. The perturbation with respect to X-conditional mean is $T j , l , l ′ ′ X ≡ T j , l , l ′ − E ( T j | X = X ( l ) )$ where $E ( T j | X = X ( l ) ) = ∑ l ′ = 1 N T j E [ δ l ′ ( l ) , l ′ ]$. A similar definition is written for the Y- perturbation $T j , l , l ′ ′ Y ≡ T j , l , l ′ − E ( T j | Y = Y ( l ′ ) )$.
The estimator (15) of independent constraints (components of $T j$ uniquely dependent on X or Y) have a bias but vanishing variances (null components of $Δ θ N , j ′$), since perturbations $T j ′ X$ or $T j ′ Y$ vanish because the local values of $T j$ coincide to one of the (X or Y)-conditional means. That bias reduces to a numerical integration error. For example for X-depending functions expectations, the error reduces to bias $Δ θ X , N , j = ∑ l = 1 N T X , j ( X ( l ) ) N − 1 − ∫ 0 1 T X , j ( Φ M E , X − 1 ( u ) ) d u$, of order $O ( N − 2 )$ as given by the trapezoidal integration rule for bounded $T X , j$ functions. The estimators of cross expectations have bias and non-vanishing variances.
Now, our goal is to get the estimation of the covariance matrix $M Δ θ N , j$ (17). As a consequence of the non-replacement of quantiles or rankins, the deviations $T j , l 1 , l ′ ( l 1 ) ′$ and $T j , l 2 , l ′ ( l 2 ) ′$ in (20) are not necessarily independent for $l 1 ≠ l 2$, which will not occur if different realizations would be independent, leading to $var ( θ N , j ) = N − 1 var ( T j )$. The statistics without replacement generally lead to a deflation of estimator variances as compared to those satisfying the hypothesis of independence of realizations [33] or, in other words, $var ( θ N , j ) ≤ N − 1 var ( T j )$. Therefore, in order to get a N−1-scaled expression for $var ( θ N , j )$, we will consider another type of deviations of $T j$ consistent with (20).
We propose new deviations, denoted by $T j ′ l m s$, that are given by the linear combination both of the global deviation $T j ′$ and of the marginal deviations $T j ′ X , T j ′ Y$ with the respective coefficients summing 1 and having the least mean square (lms). Those deviations are consistently given by:
$T j ′ l m s = ( 1 − α X − α Y ) T j ′ + α X T j ′ X + α Y T j ′ Y = T j − α X [ E ( T j | X ) − E ( θ N , j ) ] − α Y [ E ( T j | Y ) − E ( θ N , j ) ]$
which are the residuals of the best linear fit of $T j$ using the conditional means $E ( T j | X )$ and $E ( T j | Y )$ as predictors and where the coefficients are those of the linear regression:
$[ α X α Y ] = [ var [ E ( T j | X ) ] cov [ E ( T j | X ) , E ( T j | Y ) ] cov [ E ( T j | X ) , E ( T j | Y ) ] var [ E ( T j | Y ) ] ] − 1 [ cov [ E ( T j | X ) , T j ] cov [ E ( T j | Y ) , T j ] ]$
Those deviations take into account the maximum implicit knowledge of marginal PDFs through their conditional means. Now we will use them for expressing the error moments.
The expression of the error covariances in $M Δ θ N , j$ relies upon the expansion (20) with perturbations written as function of mean values of products of deltas $δ l ′ ( l ) , l ′$. These means depend on the true copula and are written as:
$E ( δ l ′ ( l 1 ) , l 1 ′ δ l ′ ( l 2 ) , l 2 ′ ) = { 0 , if [ l 1 = l 2 , l 1 ′ ≠ l 2 ′ or l 1 ′ = l 2 ′ l 1 ≠ l 2 ] E ( δ l ′ ( l 1 ) , l 1 ′ ) , N − 1 ( * ) if [ l 1 = l 2 , l 1 ′ = l 2 ′ ] N − 1 ( N − 1 ) − 1 ( * ) if [ l 1 ≠ l 2 , l 1 ′ ≠ l 2 ′ ]$
where we have considered the fact that l’(l) and its inverse l(l’) are permutations of ranks (no duplication allowed). The values indicated with asterisk in (23) correspond to X,Y independent (l’(l) independent of l). Those moments are difficult to obtain in practice unless variables are independent or the bivariate PDF is known a priori. From these moments, a large ensemble of N-sized surrogate samples is generated from which empirical estimator covariances are computed.
Then, by plugging (23) into the generic (α-th row, β-th column) of $M Δ θ N , j$, and denoting the α-th and β-th components of $T j$ by $T j , α$ and $T j , β$ with estimation errors $Δ θ N , j , α , Δ θ N , j β$, we get
$( M Δ θ N , j ) α , β = E ( Δ θ N , j , α ′ Δ θ N , j β ′ ) = ∑ l 1 , l 1 ′ , l 2 , l 2 ′ [ T j , α ′ ( X ( l 1 ) , Y ( l 1 ′ ) ) T j , β ′ ( X ( l 2 ) , Y ( l 2 ′ ) ) ] N − 2 E ( δ l ′ ( l 1 ) , l 1 ′ δ l ′ ( l 2 ) , l 2 ′ ) = = N − 1 E ( E N ( T j , α ′ T j , β ′ ) ) + N − 2 ∑ l 1 ≠ l 2 E [ T j , α ′ ( X ( l 1 ) , Y ( l 1 ′ ( l 1 ) ) ) T j , β ′ ( X ( l 2 ) , Y ( l 2 ′ ( l 2 ) ) ) ]$
The first term of the rhs of (24) is given by $N − 1 E [ cov N ( T j , α , T j , β ) ]$ i.e. 1/N times the expectation of the covariance among N realizations. That term converges asymptotically to $N − 1 cov ( T j , α , T j , β )$, i.e., the estimator’s covariance in the hypothesis of N iid realizations. However, when marginals are imposed or the morphism of variables is performed, that hypothesis no longer holds because the covariance estimator is a statistic without replacement [33], since quantiles of X and Y are not repeated in the sample. Therefore, the additional term of (24) reduces the estimator’s variances with respect to the case of iid trials.
Looking for a correct representation of the cross estimator’s variances when marginals are imposed, we represent the $T j$ perturbations by $T j ′ l m s$ (21) (residuals of the best linear regression). There, we will benefit from a generic property of lse (least squares error) regression residuals which is the fact that they are uncorrelated with the predictors (here the conditional means of $E ( T j | X ) , E ( T j , Y )$). This means that $T j ′ l m s$ is represented in terms of noises which are uncorrelated, both with X and Y. Consequently, different realizations of $T j ′ l m s$ are uncorrelated, which will simplify the expression of the covariance matrix. Therefore, using those lms perturbations, the generic matrix entry $( M Δ θ N , j ) α , β$ (24) is rewritten as
$( M Δ θ N , j ) α , β = N − 2 ∑ l 1 [ E [ T j , α ′ l m s ( X ( l 1 ) , Y ( l ′ ( l 1 ) ) ) T j , β ′ l m s ( X ( l 1 ) , Y ( l ′ ( l 1 ) ) ) ] ] + N − 2 ∑ l 1 , l 2 ≠ l 1 [ E [ T j , α ′ l m s ( X ( l 1 ) , Y ( l ′ ( l 1 ) ) ) T j , β ′ l m s ( X ( l 2 ) , Y ( l ′ ( l 2 ) ) ) ] ] = N − 1 E ( E N ( T j , α ′ l m s T j , β ′ l m s ) ) + O ( N − 2 )$
The $N − 1$-scaled term of (25) converges asymptotically (as $N → ∞$) to $N − 1 E ( T j , α ′ l m s T j , β ′ l m s )$, i.e., 1/N times the covariances between residuals of the linear regression relying upon conditional variances. This let us to formulate the Theorem:
Theorem 2:
Let us suppose imposed X and Y marginal PDFs by variable morphisms. Then, the covariance between the N-sized based estimators $θ N , α$ and $θ N , β$ of the means of cross functions of $T α ( X , Y )$ and $T β ( X , Y )$ is given by
$cov ( θ N , α , θ N , β ) = N − 1 E ( E N ( T α ′ l m s T β ′ l m s ) ) → N → ∞ N − 1 E ( T α ′ l m s T β ′ l m s )$
where $T α ′ l m s = T α ′ − α X [ E ( T α | X ) − θ α ] − α Y [ E ( T α | Y ) − θ α ]$ is the residual of the best linear fit taking conditional means as predictors, and $α X , α Y$ are the corresponding coefficients (idem for $T β ′ l m s$). The expectation is computed with the true PDF of the population. The proof was given before in the text.
An immediate corollary of this Theorem applies in the case data are governed by a certain MinMI-PDF issued from ${ T c r , θ c r } , ρ X , ρ Y$. In that conditions $T α$ and $T β$ are themselves cross functions from the constraining set $T c r$ and $cov ( θ N , α , θ N , β )$ are entries of $M Δ θ N$ (17). Then, if the true joint PDF is the MinMI-PDF issued from ${ T c r , θ c r } , ρ X , ρ Y$, we get:
$P c r M Δ θ N P c r = N − 1 C c r , ρ X , ρ Y$
where we use the covariance matrix introduced in (4). Under those conditions one has the identity for the matricial product $( P c r M Δ θ N P c r ) C c r , ρ X , ρ Y − 1 = N − 1 P c r$, which will be crucial for the evaluation of asymptotic MinMI estimation bias.

3.3. Errors of the Estimators of Polynomial Moments under Gaussian Distributions

In this section we assess the bias, the covariance of estimators and its expression (25) when constraints are bivariate monomials (13) and Gaussian morphisms are performed as described in Section 2.3. For the purpose of discussing statistical tests of non-Gaussianity presented in a next section, we will restrict our study by considering the case of N-sized samples of iid realizations of independent variables $X ^ , Y ^$ (taken without loss of generality standard Gaussians). There, an empiric Monte-Carlo strategy is used by taking the standard Gaussian morphisms $X , Y$ of the N outcomes, from which one estimates the expectation of a vector of generic functions $T ( X , Y ) = X r Y s , r , s ∈ ℕ 0$ (13). The bias is $b = E ( E N ( T ) ) − E ( T ) = μ N , r μ N , s − μ r μ s$, which is determined by the fixed Gaussian centered moments $μ r ≡ E ( X r )$ and $μ N , r ≡ E N ( X r )$, $r ∈ ℕ 0$. The sample is centered and standardized such that $μ N , 1 = 0 ; μ N , 2 = 1$. The variance $var ( E N ( T ) )$ of $E N ( T )$ can be rigorously computed from the quadruple sum (25) using the N quantiles from the standard Gaussian and the delta expectations (23) for the case of X, Y independent from each other. However, the computation of that sum is very time-consuming for high N values. For that reason, we approximate it by a Monte-Carlo mean obtained with $N r e a = 5000$ independent realizations of the N-sized samples. The finite and asymptotic values of $N − 1 E ( var N ( T ) )$, valid for the case of N iid trials, are given by:
$N − 1 E ( var N ( T ) ) = N − 1 ( μ N , 2 r μ N , 2 s − ( μ N , r μ N , s ) 2 ) → N → ∞ N − 1 var ( T ) = N − 1 ( μ 2 r μ 2 s − ( μ r μ s ) 2 )$
whereas those (smaller than those of (28)) obtained from least mean squares (25) are:
$var ( E N ( T ) ) ≈ N − 1 E ( var N ( T | l m s ) ) = N − 1 var N ( T | l m s ) = = N − 1 ( μ N , 2 r μ N , 2 s − μ N , 2 r ( μ N , s ) 2 − μ N , 2 s ( μ N , r ) 2 + ( μ N , s μ N , r ) 2 ) → N → ∞ N − 1 var ( T | l m s ) = N − 1 ( μ 2 r μ 2 s − μ 2 r ( μ s ) 2 − μ 2 s ( μ r ) 2 + ( μ s μ r ) 2 )$
Figure 1 compares the variance $var ( E N ( T ) )$ with the squared bias $‖ b ‖ 2$ of the estimator, both relevant in the bias of the MinMI estimation. In the same figure, one compares the empirical variance $var ( E N ( T ) )$, with its approximation $N − 1 var ( T | l m s )$ and with the variance for the case of iid trials: $N − 1 var ( T )$. We use $T = X 4 Y 2 , X 6 Y 2 , X 8 Y 2$,respectively in panels a), b), c), sorted by growing total variance $var ( T )$, specially concentrated at the distribution queues. In all figures, N=25*2k,k=0,..,11. We have verified that the empirical variance $var ( E N ( T ) )$ agrees very well to the theoretical value $N − 1 var N ( T | l m s )$ for all Ns. (not shown).
At this point, some generic conclusions can be drawn. The estimator’s variance $var ( E N ( T ) )$ grows with $var ( T )$ dominating over the squared bias, except for small N values and higher values of $var ( T )$. This will lead us to neglect the bias of covariance estimator’s in the MinMI asymptotic statistics.
Figure 1. Squared empirical bias: $‖ b ‖ 2$ (black lines) of N-based $T$- expectations as function of N, empirical variances: $var ( E N ( T ) )$ (red lines), approximated variances: $N − 1 var ( T | l m s )$ (blue lines) and variance for the case of N iid trials: $N − 1 var ( T )$ (green lines). $T$ stands for different bivariate monomials: $X 4 Y 2$ (a), $X 6 Y 2$ (b) and $X 8 Y 2$ (c).
Figure 1. Squared empirical bias: $‖ b ‖ 2$ (black lines) of N-based $T$- expectations as function of N, empirical variances: $var ( E N ( T ) )$ (red lines), approximated variances: $N − 1 var ( T | l m s )$ (blue lines) and variance for the case of N iid trials: $N − 1 var ( T )$ (green lines). $T$ stands for different bivariate monomials: $X 4 Y 2$ (a), $X 6 Y 2$ (b) and $X 8 Y 2$ (c).
From Figure 1, we also note that the variance reduction coming from morphisms of variables, tends to decrease for higher N values, where the effect of sampling prevails with a $N − 1$ scaling on the estimator variance where it is closely approximated by the asymptotic lms variance $N − 1 var ( T | l m s )$. That can lead to a slight increase of $var ( E N ( T ) )$ for small Ns, followed by a decrease (e.g., $X 6 Y 2$), due to the effect that $var N ( T | l m s )$ is small for lower values of N.
Moreover, thanks to the Central Limit Theorem (CLT), the distribution of estimator errors tends towards Gaussianity with increasing N, with a slower convergence rate for higher $T$ variances. However, the Gaussian PDF limit has an infinite support which must be truncated since the estimated moments $E N ( T )$ must be within a kind of polytope with edges determined by Schwartz-like inequalities as shown by PP12 [12] (e.g., $| E N ( X Y ) | ≤ 1$ and $| E ( X 2 Y ) | / [ 2 ( 1 − c g 2 ) ] 1 / 2 ≤ 1 )$, working as bounds for nonlinear correlations. Since estimators have bounds, the estimation errors do so as well. This can be solved by using the Fisher Z-transform arctanh(c) of a generic linear or nonlinear correlation c and projecting it over the real support (not done here).
Now we illustrate in Figure 2, the Theorem 2 under different values of correlation $c g ∈ [ 0 , 1 ]$. We consider the variables $X , Y$ with a joint Gaussian PDF of correlation $c g ∈ [ 0 , 1 ]$ with marginal standard Gaussians. In Figure 2 we compare the empirical Monte-Carlo value of $N var ( E N ( T ) )$ (MC in the Figure), within an ensemble of 5000 N-sized samples with the theoretical one $var ( T | l m s )$ (case where morphism is performed, AN in the Figure) and $var ( T )$ (case of iid realizations, ANiid in the Figure). We have used a sample of N=200, which is supposed to be near the beginning of the asymptotic regime and two cross functions: $T ( X , Y ) = X Y$ and $T ( X , Y ) = X 2 Y$. The aforementioned variances are $var ( X Y | l m s ) = ( 1 − c g 2 ) / ( 1 + c g 2 ) ; var ( X Y ) = c g 2 + 1$ while $var ( X 2 Y ) = 12 c g 2 + 3$ and $var ( X 2 Y | l m s )$ is the mean squared residual of the best linear fit using the predictors $E ( X 2 Y | X ) = c g X 3$ and $E ( X 2 Y | Y ) = c g 2 Y 3 + ( 1 − c g 2 ) Y$. For both functions, a very good agreement is verified between Monte-Carlo values and the theoretical ones within 1–5% relative error. A generic result of Figure 2 is the fact that, under the fixation (presetting) of marginals, the sampling variability of cross estimators falls to zero as far the absolute value of correlation tends to one.
Figure 2. N times Monte-Carlo variances: $N var ( E N ( T ) )$ thick solid lines) and its theoretical analytical value $var ( T | l m s )$ (thick dashed lines), both under imposed marginals (morphisms) and analytical value of $N var ( E N ( T ) ) = var ( T )$ for iid data (thin solid lines). $T$ means different bivariate monomials: $X Y$ (black curves), $X 2 Y$ (red curves). N = 200.
Figure 2. N times Monte-Carlo variances: $N var ( E N ( T ) )$ thick solid lines) and its theoretical analytical value $var ( T | l m s )$ (thick dashed lines), both under imposed marginals (morphisms) and analytical value of $N var ( E N ( T ) ) = var ( T )$ for iid data (thin solid lines). $T$ means different bivariate monomials: $X Y$ (black curves), $X 2 Y$ (red curves). N = 200.

3.4. Statistical Modeling of Moment Estimation Errors

The above qualitative results gave empirical support to Theorem 2 about the covariance of estimation errors and the neglecting of estimation biases. Therefore, the part of matrix $M Δ θ N , j$ (17) regarding cross components is modeled as:
$M Δ θ N , c r , j ≈ N − 1 E ( E N ( T c r , j ′ l m s T c r , j ′ l m s ) ) ≡ N − 1 C N , c r , j | l m s$
with the approximation being valid within terms $o ( N − 1 )$. In practice, the matrix $E ( T c r , j ′ l m s T c r , j ′ l m s )$ requires the estimation of conditional means for each value of X and Y.
Now, we will formulate the distribution of moment’s estimation errors in the asymptotic regime of high enough N. Then, thanks to the multivariate Central Limit Theorem [34] one can suppose that the unbiased estimation error vector follows a multivariate Gaussian distribution, which is written as
$Δ θ N , c r , j ≈ ( M Δ θ N , c r , j ) 1 / 2 U j ≈ N − 1 / 2 ( C N , c r , j | l m s ) 1 / 2 U j ; U j ~ N ( 0 → c r , j , P c r , j )$
where $( C N , c r , j | l m s ) 1 / 2$ is the square root matrix of $C N , c r , j | l m s$ and $U j$ is a multivariate standard normal RV of dimension equal to $dim ( θ c r , j )$ with zero mean $0 → c r , j$ and covariance matrix $P c r , j$.

4. Modeling of MinMI Estimation Errors, Their Bias, Variance and Distribution

Taking into account the Gaussian approximations (31) for estimation errors, their neglected bias, the $N − 1$ scaled covariance (30), and the second-order Taylor development of MinMI (9), one can determine approximated bias, variance and distribution of MinMI estimators (15).
• The estimation of bias, variance, quantiles and distribution of estimators of the incremental MinMI $I j / p$ issued from finite samples of N (iid) realizations of bivariate original variables $( X ^ , Y ^ )$ and then transformed into RVs $( X , Y )$
• The distribution of estimators of $I j / p$ under the null hypothesis H0 that $( X , Y )$ follows the ME distribution constrained by a weaker constraint set $( T p , θ p )$ (j>p). These estimators work as a significance test for determining whether there is statistically significant MI beyond that explained by cross moments in $( T p , θ p )$.

4.1. Bias, Variance, Quantiles and Distribution of MI Estimation Error

Considering the moment error distribution (31) and plugging it into the development (9), the error of the MI estimator $I N , j / p$ is then distributed as:
$Δ I N , j / p , θ ≈ N − 1 / 2 [ v j / p T ( C N , c r , j | l m s ) 1 / 2 ] U j + 1 / 2 N − 1 U j T [ ( C N , c r , j | l m s ) 1 / 2 A j / p ( C N , c r , j | l m s ) 1 / 2 ] U j$
where neglected terms are of order $O ( N − 3 / 2 )$. That is a second-order polynomial form of a multivariate standard Gaussian RV $U j ~ N ( 0 → j , P c r , j )$. There is no general analytical expression for the PDF inferred from (32), except in certain cases where $Δ I N , j / p$ is a governed by a non-central Chi-squared distribution [36]. The quantiles determining the confidence intervals of $I N , j / p$ can easily be obtained by sorting of Monte-Carlo surrogates (proxies) of (32) from a pseudo-random generator of a standard Gaussian. Analytical expressions of the distribution of MI estimates are given from a MI Taylor expansion in terms of the anomalies of the estimated probabilities [27,37]. Here, we adopt a different approach by considering anomalies of the estimated expectations.
The bias of $I N , j / p$ or the expectation of $Δ I N , j / p , θ$ is derived from the mean of the quadratic form term in (32). Therefore, taking the invariance of the trace for the circular permutation of a matrix product, that bias is approximated by the asymptotic value:
$E ( Δ I N , j / p ) ≈ ( 1 / 2 ) N − 1 T r ( C N , c r , j | l m s A j / p ) = ( 1 / 2 ) N − 1 [ T r ( C N , c r , j | l m s P c r , j C * j − 1 P c r , j ) − T r ( C N , c r , p | l m s P c r , p C * p − 1 P c r , p ) ]$
This is the difference between maximum entropy $N − 1$-scaled biases of orders j and p, subjected to the imposition of marginal PDFs. We must remember that if p = 0, $P c r , p$ is zero. For this case the MinMI bias is simply minus the negative bias of the ME $H ( θ N , j )$, which is treated without the effect of variable morphism by [26]. When data is governed by the MinMI-PDF of order j, the matrices $C N , c r , j | l m s$ and $P c r , j C * j − 1 P c r , j$ are the inverse of each-other, according to Theorems 1 and 2 (11,27), leading to $E ( Δ I N , j / 0 ) = ( 1 / 2 ) N − 1 T r ( C N , c r , j | l m s P c r , j C * j − 1 P c r , j ) = ( 1 / 2 ) N − 1 T r ( P c r , j )$, i.e., $1 / ( 2 N )$ times the number of cross constraints. However, as argued by [26], when the true data distribution is more leptokurtic than the MinMI-PDF, then the bias can be larger than $( 1 / 2 ) N − 1 T r ( P c r , j )$.
By assuming the limit case of Gaussianity, the variance of $Δ I N , j / p$ comes as:
$var ( Δ I N , j / p ) ≈ N − 1 T r [ C N , c r , j | l m s ( v j / p v j / p T ) ] + ( 1 / 2 ) N − 2 T r [ ( C N , c r , j | l m s A j / p ) 2 ]$
The leading variance term is N−1-scaled as generally deduced in [15]. Keeping the leading term of (34), and dealing with the trace, we get a given relative error $r I = Δ I N , j / I j$ of the MinMI $I j / 0$ (p=0) when $N ≥ E ( ( λ c r , j T T c r , j ′ ) 2 ) / ( I j / 0 r I ) 2 ≈ O ( m c r , j ) / ( I j / 0 r I ) 2$. The term $O ( m c r , j )$ increases with a larger rate than $I j / 0$ as far as the bound of the polytope of allowed expectations is closer.

4.2. Significance Tests of MinMI Thresholds

The estimators $I N , j / p$ allow for the elaboration of statistical significance tests in order to verify whether the empirical PDF differs considerably from a threshold ME-PDF or in the contrary if the difference can be justified by sampling errors.
Let us suppose the null hypothesis H0 considering that the true PDF coincides to the ME-PDF constrained by $( T p , θ p )$. In particular for $( T p , θ p ) = ( T p = 0 , θ p = 0 ) = ( T i n d , θ i n d )$, the null hypothesis states that $( X , Y )$ are statistically independent. Therefore under H0, the moment sets $( T p , θ p ) , ( T j , θ j )$ are ME-congruent and the moments of order $j ≥ p$ remain well determined by expectations over the less restricted p-th ME-PDF i.e., $θ j = E ρ T p , θ p * ( T j ) ≡ θ j ← p$ where the subscript arrow $j ← p$ means that j-order statistics are obtained by the p-order ME-PDF. The same holds for the ME covariance matrices, i.e., $C * p = C p$ and $C * j = C * j ← p = C j ; j ≥ p$. In these conditions, the matrix $C p$ is simply a sub-matrix of $C j$.The Lagrange multipliers are restricted to the p-order i.e. $λ j = λ j ← p = ( λ p , 0 → j / p ) ; j ≥ p$, where entries of higher order than p are set to zero leading to $v j / p = 0$ in (9). Therefore, the incremental MinMI vanishes, i.e. $H ( θ j ) − H ( θ p ) = I j / p = 0$, but the estimator of $I N , j / p$ is positive due to artificial MI generation stemming from the presence of sampling errors. Then, under H0, and using (9), the MI estimation is provided by the following approximation:
$H ( θ N , p ) − H ( θ N , j ) | H 0 ≡ δ I N , j / p ≈ ( 1 / 2 ) N − 1 U j T [ ( C N , c r , j | l m s ) 1 / 2 A j ← p ( C N , c r , j | l m s ) 1 / 2 ] U j U j ~ N ( 0 → j , P c r , j ) ; A j ← p = P c r , j ( C j ) − 1 P c r , j − P c r , p ( C p ) − 1 P c r , p$
where $A j ← p$ is a positive semi-definite matrix. That works as a significance test for the non-verification of H0; in other words, if $I N , j / p$ is larger than an upper 1-α quantile (e.g., 1−α=95%) of $δ I N , j / p$, then H0 is rejected with a significance level α. Those quantiles determine the significant MI thresholds and can be computed empirically as for the MinMI error (32) by a Monte-Carlo strategy. Another possibility is the fitting of the $δ I N , j / p$ distribution to a Gamma PDF with prescribed mean and variance (not done here). The bias and variance of $δ I N , j / p$ are straightforward, coming as:
$E [ δ I N , j / p ] ≈ ( 1 / 2 ) N − 1 T r [ C N , c r , j | l m s A j ← p ] ; var [ δ I N , j / p ] ≈ ( 1 / 2 ) N − 2 T r [ ( C N , c r , j | l m s A j ← p ) 2 ]$
The N−2-scale for variance is also present in other MI estimate errors under the hypothesis of variable independency [27]. Under the Theorems 1 [11] and 2 [27], along with the null hypothesis, one gets $C N , c r , j | l m s A j ← p = P c r , j − P c r , p$, thus leading to a Chi-Squared distribution for $δ I N , j / p$:
$δ I N , j / p ~ ( 1 / 2 ) N − 1 χ n d 2 ; n d = T r ( P c r , j − P c r , p )$
with $n d$ degrees of freedom, i.e., the difference between the number of cross moments of order j and p. From that, the upper quantiles necessary for statistical significance are easily obtained from χ2 probability lookup tables. The bias and variance are, respectively:
$E [ δ I N , j / p ] ≈ ( 1 / 2 ) N − 1 [ T r ( P c r , j − P c r , p ) ] ; var [ δ I N , j / p ] ≈ ( 1 / 2 ) N − 2 [ T r ( P c r , j − P c r , p ) ]$
By analyzing (38), and in order to get a test with a relative error $r I = ( Δ I min / I min )$, one must choose $N ≥ ( ( m c r 2 − m c r 1 ) / 2 ) 1 / 2 / ( I min r I )$.

4.3. Significance Tests of the Gaussian and Non-Gaussian MI

In this section we particularize the theory presented in Section 4.1 and Section 4.2 (Equations 35–38) for the case of Gaussian and non-Gaussian MIs as defined in Section 2.3. For this purpose, let us consider the moment sets (13) and the MI components $I g$ and $I n g , j$ (11). Their finite estimators are:
$I N , g = H ( θ 0 ) − H ( θ N , 2 ) = I g + Δ I N , g = I N , j = 2 / p = 0 ; Δ I N , g = I g ( c g + Δ c g , N ) − I g ( c g ) = − Δ H ( θ N , 2 ) ; I N , n g , j = H ( θ N , 2 ) − H ( θ N , j ) = I n g , j + Δ I N , n g , j = I N , j / p = 2 ; Δ I N , n g , j = Δ H ( θ N , 2 ) − Δ H ( θ N , j )$
where $Δ I N , g , Δ I N , n g , j$ are MinMI errors, $Δ c g , N$ is the Gaussian correlation estimation error, $H ( θ 0 ) = 2 H g$ with $H g ≡ 1 2 log ( 2 π e )$ being the entropy of the univariate standard Gaussian; $θ N , j = θ j + Δ θ N , j ; j ≥ 1$ are the expectations obtained from the N-sized Gaussianized standardized sample.
The numerical implementation of the maximum entropy estimator $H ^$ (16), approximating H is computed over a number Nb bins of an extended enough finite interval [-Li,Li]. In the corresponding experiments (and as in PP12), we have used the calibrated values Li=6 and Nb=80. The used algorithm is explained in detail in the appendix 2 of PP12 [12], following an adapted bivariate version of that of [35]. The error $δ H = H ^ − H$ is of the order of round-off errors, only becoming comparable to the sampling ME errors at very high values of N.

4.3.1. Error and Significance Tests of the Gaussian MI

The Gaussian MI error $Δ I N , g$ depends on the Gaussian correlation estimation’s error $Δ c g , N ≡ c g , N − c g$ where $c g , N = E N ( X Y )$ is inferred from the sample. Let us write (9) for $Δ I N , g$. The Gaussian bivariate ME-PDF, constrained by $( T 2 = ( X , X 2 , Y , Y 2 , X Y ) T , θ 2 = ( 0 , 1 , 0 , 1 , c g ) T )$ is $ρ T 2 , θ 2 * ( X , Y ) = [ 4 π 2 ( 1 − c g 2 ) ] − 1 / 2 exp [ − ( 1 / 2 ) ( 1 − c g 2 ) − 1 ( X 2 − 2 c g X Y + Y 2 ) ]$, leading to the vector of Lagrange multipliers $λ 2 = [ 0 , − ( 1 / 2 ) ( 1 − c g 2 ) − 1 , 0 , − ( 1 / 2 ) ( 1 − c g 2 ) − 1 , c g ( 1 − c g 2 ) − 1 ] T$. The projector operator $P c r , 2$ onto cross moments is the 5x5 matrix that extracts the 5th entry (row and column) of $T 2$, corresponding to the unique cross moment XY. The necessary 5x5 covariance matrix is $C * , 2 = E ρ T 2 , θ 2 * [ T 2 T 2 T ] − θ 2 θ 2 T$, where the E operator is the expectation over the bivariate Gaussian $ρ T 2 , θ 2 *$. Then, we apply (9) for j=2, p=0 where $Δ θ N , j = ( 0 , 0 , 0 , 0 , Δ c g , N ) T$. The Gaussian MI error is written in different forms as:
$Δ I N , g ≈ ( P c r , 2 λ 2 ) T ( Δ c g , N ) + 1 2 ( P c r , 2 C * 2 − 1 P c r , 2 ) ( Δ c g , N ) 2 = c g 1 − c g 2 ( Δ c g , N ) + 1 + c g 2 2 ( 1 − c g 2 ) 2 ( Δ c g , N ) 2 = = ∂ I g ∂ c g Δ c g , N + 1 2 ∂ 2 I g ∂ c g 2 ( Δ c g , N ) 2$
There, the term $P c r , 2 λ 2$ is the fifth component of $λ 2$, corresponding to the first derivative of $I g$ with respect to $c g$ whereas the term $P c r , 2 C * 2 − 1 P c r , 2$ is the entry of $C * 2 − 1$ at row 5, column 5, corresponding to the second derivative of $I g$. The bias and variance of $Δ I N , g$ depend on the distribution of the Gaussian correlation error $Δ c g , N$. According to the proposed modeling of moment estimation errors (Theorem 2 of section 3.4), $Δ c g , N$ is asymptotically Gaussian with a negligible bias $E ( Δ c g , N ) ≈ 0$ and a variance (under imposed marginals) given by:
$var ( Δ c g , N ) ≈ N − 1 var ( X Y | E ( X Y | X ) , E ( X Y | X ) ) = ( 1 − c g 2 ) 2 / ( 1 + c g 2 )$
However, in order to keep the simulated $c g = c g , N − Δ c g , N$ within the interval [-1,1], one can use the more precise Fisher Z-transform [38] such that $Δ c g , N = tanh ( tanh − 1 ( c g ) + Δ Z N ) − c g$, where $Δ Z N$ has a mean and variance of order $O ( N − 1 )$.
In order to test the null hypothesis that the variable pair $( X , Y )$ has a joint bivariate isotropic Gaussian distribution, we must compare the estimated $I N , g$ with upper quantiles of the significance test $δ I N , g$, given by $Δ I N , g$ (40) with $c g = 0$ and $Δ c g , N ~ N ( 0 , N − 1 )$. This is a Gaussian correlation significance test that is Chi-squared distributed, with:
$δ I N , g = ( 1 / 2 ) ( Δ c g , N ) 2 = ( 1 / 2 ) N − 1 U 2 ~ ( 1 / 2 ) N − 1 χ 1 2 ; U ~ N ( 0 , 1 ) E ( δ I N , g ) = ( 1 / 2 ) N − 1 ; var ( δ I N , g ) = ( 1 / 2 ) N − 2$

4.3.2. Error and Significance Tests of the Non-Gaussian MI

The estimation error $Δ I N , n g , j$ of the non-Gaussian MI as defined in (39) can be written as a particular form of (9) for an even order $j ≥ 4$ and p=2 as function of the vector $Δ θ N , j$ of moment errors of the moment vector $T j$ (13) with a certain chosen component indexation. Therefore, the matrix $A j / p = A j / p = 2 ≡ P c r , j ( C * j ) − 1 P c r , j − P c r , 2 ( C * 2 ) − 1 P c r , 2$ of (9) comprises the inverses of covariance matrices $C * j$ and $C * 2$, respectively of the j-th and 2nd order ME solutions.
Algebraic consistency sets the matrix $P 2 ( C * 2 ) − 1 P 2$ to the embedding of $( C * 2 ) − 1$ onto the j-th moment subspace. Then we will perform a range of experiments for the validation of approximations in Section 4.2. The vector $v j / p = 2 ≡ P c r , j λ j − P c r , 2 λ 2$ comprises Lagrange multiplier vectors of the ME solutions of orders j and 2.
In order to compute the bias, variance, quantiles and confidence intervals of $I N , n g , j$, from N-sized samples, there are two possible strategies: either pure Monte-Carlo simulations or the analytical and the semi-analytical (analytical with moment’s error surrogates) approaches as explained in section 1. In the pure Monte-Carlo approach, either a known bivariate PDF is assumed or surrogates of the joint PDF are generated through multivariate bootstrapping techniques [39] preserving the copula structure. For each generated sample from an extended ensemble of Nrea (e.g., 5000) realizations, we compute moments and solve the ME problem gathering statistics afterwards. Alternatively, ME errors can be computed from the Taylor expansion (9) from moment deviations over the ensemble.
In the analytical and semi-analytical approaches, moment errors $Δ θ N , j$ are assumed to follow a certain parametric distribution that can be multivariate Gaussian as in (31), based on a given bias-covariance matrix modeling or a more sophisticated approach taking into account the natural bounds of the simulated moments $θ c r , j = θ N , c r , j − Δ θ N , c r , j$. Then, MinMI statistics are computed from statistics (bias, variance, quantiles) on ensembles of error surrogates.
The non-Gaussian MIs $I N , n g , j ( even j ≥ 4 )$ work as tests measuring significant statistical deviations from the null hypotheses of joint Gaussianity. These statistical tests are given by Kullback-Leibler distances (7) and constitute an alternative to the use of algebraic deviations of moments from those given by the bivariate Gaussian (e.g., bivariate cumulants) [40].
The non-Gaussianity test of order j is given by $δ I N , n g , j ≡ H ( θ N , 2 ) − H ( θ N , j ) | H 0$ under the null hypothesis H0 that the true PDF is bivariate Gaussian and is written as a particular case of (35). However, a simplification of the statistical test formula can be achieved by considering a null Gaussian correlation. This holds thanks the non-Gaussian MI invariance under variable rotations (see PP12), in particular for uncorrelated standardized variables $( X r , Y r ) T = A ( X , Y ) T$, where A is the rotation matrix (e.g. $X r = X , Y r = ( Y − c g X ) ( 1 − c g 2 ) − 1 / 2$, i.e., the residual of the linear prediction). Under H0, the rotated variables are still bivariate Gaussian and therefore the non-Gaussianity significance test $δ I N , n g , j$ has the same distribution as that for $c g = 0$. The matrices $C N , c r , j | l m s$ and $A j ← 2$ entering in Equation (35) are now evaluated for Gaussian isotropic conditions. For the sake of clarity, we represent them respectively by $C g , N , c r , j | l m s$, $A g , j ← 2 = P j ( C g , j ) − 1 P j − P 2 ( C g , 2 ) − 1 P 2$, where the subscript g stands for evaluation at $( X , Y ) T ~ N ( 0 → , I )$. For high N, $C g , N , c r , j | l m s = C g , j$, i.e., the covariance matrix of cross j-th order moments for the isotropic Gaussian. Then we write:
$δ I N , n g , j ≈ ( 1 / 2 ) N − 1 U j T [ ( C g , N , c r , j | l m s ) 1 / 2 A g , j ← 2 ( C g , N , c r , j | l m s ) 1 / 2 ] U j$
Let us specify generic entries at row α, column β of those matrices, corresponding to monomials $X r α Y s α$ and $X r β Y s β$ of $T j$, i.e. with $r α + s α , r β + s β ≤ j$. Then, using the notation introduced in Section 3.3 for Gaussian standard moments $μ r ≡ E ( X r ) ; μ N , r ≡ E N ( X r ) , r ∈ ℕ 0$, the components of $C g , j$ become:
$( C g , j ) α , β = μ r α + r β μ s α + s β − μ r α μ r β μ s α μ s β$
whereas the components of the lms covariances are:
$( C g , N , c r , j | l m s ) α , β = μ N , r α + r β μ N , s α + s β − μ N , s α + s β μ N , r α μ N , r β − μ N , r α + r β μ N , s α μ N , s β + μ N , r α μ N , r β μ N , s α μ N s β$
The bias of the non-Gaussian MinMI and its asymptotic approximation (36) are given by:
$E [ δ I N , n g , j ] ≈ ( 1 / 2 ) N − 1 [ T r ( C g , N , c r , j | l m s P c r , j C g , j − 1 ) − 1 ] = ( 1 / 2 ) N − 1 ( T r ( P c r , j ) − 1 )$
Similarly and following (36), the variance becomes:
$var [ δ I N , n g , j ] ≈ ( 1 / 2 ) N − 2 T r [ ( C g , N , c r , j | l m s A g , j ← 2 ) 2 ] = ( 1 / 2 ) N − 2 ( T r ( P c r , j ) − 1 )$
and the reasonable distribution approximation following (37):
$δ I N , n g , j ~ ( 1 / 2 ) N − 1 χ n d 2 ; n d = T r ( P c r , j ) − 1 = j ( j − 1 ) / 2 − 1$
from which bounds of significance levels of non-Gaussianity can be computed through quantiles of the Chi-squared distribution.

4.4. Validation of Significance Tests by Monte-Carlo Experiments

We have presented the theoretical expressions for the bias, variance and distribution, both for the Gaussian correlation test (42) and for the ME non-Gaussianity test of order j (46–48). Now we validate those expressions by comparing their results with statistics from large Monte-Carlo ensembles of ME computations. For that purpose, we have generated $N r e a = 5000$ independent synthetic datasets of N iid uncorrelated $( X , Y )$ from a Gaussian random generator. We have set N from a duplication sequence: N=25, 21*25,…,211*25 = 51200. Then, we have computed the 5,000 realizations for the independency test $δ I N , g$ as well as for the non-Gaussianity tests $δ I N , n g , j$ for j = 4, 6, 8. In order to minimize errors of type $δ H$ (8), from the ME functional, we have retained only those Monte-Carlo realizations whose ME-PDF moments are within a relative square error of 10−5.
In the sequel, we have collected and compared the estimates of bias, standard deviation and the 95%-quantile, all provided by the three approaches: the Monte-Carlo (extended ensemble of ME computations), the semi-analytical (generation of Gaussian surrogates in the Taylor expansion of ME) and the analytical (analytical formulas based on the Theorems 1 and 2). The Figure 3a, b, c and d depict the above statistics of significance tests, respectively for $δ I N , g$ and $δ I N , n g , j$ (j = 4, 6, 8). The truth is assumed to be provided by the Monte-Carlo estimate.
As previously expected, significance tests are all scaled by $N − 1 O ( 1 )$, and consequently their bias, standard deviation and quantiles are $N − 1 O ( 1 )$ as shown in Figure 3a-d by estimates coming from the different approaches. MinMI biases and significance thresholds (the 95% quantiles) grow for higher number of constraints as in the sequence $I N , g$, $I N , n g , j = 4$, $I N , n g , j = 6$, $I N , n g , j = 8$.
These results mean that those estimators are progressively better (stronger) evaluations of MI (or the MI beyond that explained by Gaussianity), though they call for progressively higher significance thresholds. Therefore, especially in cases of under-sampled data (small N) or very low MI (or Non-Gaussian MI) values (weakly dependent variables or weak joint non-Gaussianity), there must be a tradeoff between N and the number of parameters of the MinMI estimator (here the number of cross constraints).
At this point, we discuss how the analytical and semi-analytical estimates of MinMI error statistics fit the Monte-Carlo (true) statistics. There are three crucial factors in our approximations: (1) The accuracy of the ME Taylor expansion, valid for small enough sampling errors (N large); (2) The convergence rate towards Gaussian statistics (from the CLT) for high N.
Figure 3. Test statistics: bias (black lines), standard deviation (red lines) and 95%-quantiles (green lines), provided by the Monte-Carlo approach (tick full lines), the semi-analytical approach (thin dashed lines) and the analytical approach (tick full lines). The tests are $δ I N , g$ (a); $δ I N , n g , j = 4$ (b); $δ I N , n g , j = 6$ (c) and $δ I N , n g , j = 8$ (d).
Figure 3. Test statistics: bias (black lines), standard deviation (red lines) and 95%-quantiles (green lines), provided by the Monte-Carlo approach (tick full lines), the semi-analytical approach (thin dashed lines) and the analytical approach (tick full lines). The tests are $δ I N , g$ (a); $δ I N , n g , j = 4$ (b); $δ I N , n g , j = 6$ (c) and $δ I N , n g , j = 8$ (d).
The analytical bias depends on factors 1 and 3, while formulas for variance, distribution and quantiles depend on all above factors, being only valid for N high enough. From Figure 3a–d, we see that the agreement between analytical and Monte-Carlo statistics is quite good for all tests (with a slight analytical underestimation), though only for large enough $N > N t e s t$ values where $N t e s t$ depends on how later (in N) the factors 1-3 hold together. We have $N t e s t ≈ 50 , 400 , 1600 , 3200$, respectively for $δ I N , g$, $δ I N , n g , j = 4$, $δ I N , n g , j = 6$, $δ I N , n g , j = 8$, growing with the number of constraints. The exception is when N is so large that errors $δ H$ of the operational ME (typically, round-off errors) are of the same order of the small value tests $δ I$, starting to influence the Monte-Carlo statistics.
In order to validate the analytical Chi-Squared distributions for the tests, we present in Figure 4, the empirical cumulative histograms, respectively of $2 N δ I N , g$, $2 N δ I N , n g , j$, $2 N δ I N , n g , 6$, $2 N δ I N , n g , 8$ for $N ≈ N t e s t$ and the corresponding theoretical cumulative Chi-Squared PDF fits, respectively $χ 1 2$, $χ 5 2$, $χ 14 2$ and $χ 27 2$. The agreement is shown to be quite good, with a slight deficit in the theoretical number of degrees of freedom, possibly due to uncontrolled aspects (e.g., the numerical implementation of the ME algorithm and bound effects) leading to extra randomness. In fact, the theoretical prediction of MinMI bias results from two matrices, theoretically equal, which are issued from extraordinary complicated outputs (the MinMI covariance matrix and the covariance matrix of estimators under fixed marginals). The theoretical result depends on the matching of a huge number of algorithmic details. The results provide good support to the presented Theorems, the hypotheses on the basis of the analytical and semi-analytical approaches. The slightly higher MinMI bias than the theoretical one is due to a small difference between the data PDF and the ME-PDF.
Figure 4. Monte-Carlo empirical cumulative histogram (solid lines) and theoretical cumulative Chi-Squared fit (dashed lines) normalized by N: $2 N δ I N , g$ ($χ 1 2$) for $N = 50$ (black curves); $2 N δ I N , n g , j = 4$ ($χ 5 2$) for $N = 400$ (red curves); $2 N δ I N , n g , 6$ ($χ 14 2$) for $N = 1600$ (green curves) and $2 N δ I N , n g , 8$ ($χ 27 2$) for $N = 3200$ (blue curves).
Figure 4. Monte-Carlo empirical cumulative histogram (solid lines) and theoretical cumulative Chi-Squared fit (dashed lines) normalized by N: $2 N δ I N , g$ ($χ 1 2$) for $N = 50$ (black curves); $2 N δ I N , n g , j = 4$ ($χ 5 2$) for $N = 400$ (red curves); $2 N δ I N , n g , 6$ ($χ 14 2$) for $N = 1600$ (green curves) and $2 N δ I N , n g , 8$ ($χ 27 2$) for $N = 3200$ (blue curves).

5. MI Estimation from Under-Sampled Data

In this section, we present a case of MinMI estimation from under-sampled data (N small), emphasizing the effect of MI bias and its relation to PDF over-fitting. For this purpose, we consider an example from meteorology, already introduced by authors [8] in which X,Y are the standard Gaussian morphism $( X , Y ~ N ( 0 , 1 ) )$ of monthly means in winter (December to February), respectively of the North Atlantic Index (X) (a quite useful planetary-scale atmospheric index [41]), and the amount of rainfall in Greenland (Y) The paper [8] has shown the existence of statistically significant nonlinear correlations between X and Y, i.e., non-Gaussian MI. The data used in the study comes from the NCEP/NCAR meteorological reanalysis for the period 1951–2003, leading to temporal series with length equal to 159, from which we have estimated the number N~100 of iid data (temporal degrees of freedom), after discarding the effect of temporal auto-correlation [42].
Figure 5a–d present the scatter-plot of the $( X , Y )$ pairs along with the contours of the ME-PDF fitting constrained by bivariate monomial expectations $T j$ (13) of total order j = 2,4,6 and 8 respectively. There is pictorial evidence of PDF over-fitting for cases of a high number of cross constraints (14 and 27 for j = 6, 8 respectively) in Figures 5c and d. In those cases, the dataset bivariate outliers, which lie at very poorly probable regions of the PDF, tend to give a polygonal character to the PDF extreme contours.
The MinMI values in nats are $I N , g$ = 0.053 (0.048), $I N , n g , 4$ = 0.071 (0.041), $I N , n g , 6$ = 0.086(~0) and $I N , n g , 8$ = 0.196 (~0) with unbiased values in parenthesis and figures marked bold where the null hypothesis H0 is rejected at the 5% significance level (values above the 95% error quantile). That means that variables are significantly correlated with the unbiased Gaussian correlation $c g$ = −0.30 and a statistically significant, though small, non-Gaussian unbiased MI of order j = 4 of 0.041 nats, which has been shown to be of the same order of the Gaussian MI. None of the remaining incremental MinMIs are significant, which corroborates the fact that the values of $I N , n g , 6$ and $I N , n g , 8$ are purely artificial.
Figure 5. Scatter-plot of the Gaussianized variables X (in abscissas) Y (in ordinates) (see text for details) along with ME-PDF fitting constrained by monomial bivariate moments up to order j = 2 (a), j = 4 (b), j = 6 (c) and j = 8 (d). Contour levels are set to 0.0005, 0.005, 0.05, 0.5, and 5.
Figure 5. Scatter-plot of the Gaussianized variables X (in abscissas) Y (in ordinates) (see text for details) along with ME-PDF fitting constrained by monomial bivariate moments up to order j = 2 (a), j = 4 (b), j = 6 (c) and j = 8 (d). Contour levels are set to 0.0005, 0.005, 0.05, 0.5, and 5.

6. Discussion and Conclusions

This paper presents theoretical formulas for statistics (bias, variance, distribution) of estimation errors of information theoretical measures. This is quite relevant because finite samples can apparently exhibit artificial statistical structures leading to negatively biased estimations of Entropy or positively biased estimations of Mutual Information. By using Monte-Carlo experiments, we empirically validate certain results about the asymptotic distribution of estimation errors of the minimum Mutual Information (MinMI) between two random variables X,Y.
That MinMI is the least committed MI compatible with prescribed marginal X and Y distributions and a set $T c r$ of a number mcr of expectations of cross X,Y joint functions $T c r ( X , Y )$, filling up a vector $θ c r = E ( T c r )$ where MinMI is written in terms of Shannon entropies (H) as: $I min ( X , Y ) = H ( X ) + H ( Y ) − H max ( X , Y )$. There, Hmax is the maximum entropy (ME) constrained by marginals and cross mean constraints. The MinMI is a lower MI bound, converging to the total MI when the set $T c r$ converges to the sufficient joint statistics. Sampling $θ c r$ errors from N-sized samples, say $Δ θ N , c r = θ N , c r − θ c r$ lead to MinMI errors. In order to compute MinMI, the marginal PDFs of finite samples must be preset by morphisms, setting the X and Y single values to fixed quantiles. This reduces the sampling randomness to the covariate sampling in the form of random permutations in the bivariate trials (X,Y). Then, the estimator variance $var ( Δ θ N , c r )$ is scaled by N−1, being lower than the value $N − 1 var ( T c r )$, valid in the case of random iid marginal trials. In order to get a given MinMI relative error $r I = ( Δ I min / I min )$, one must choose $N ≥ E ( ( λ c r T T c r ′ ) 2 ) / ( I min r I ) 2 ≈ O ( m c r ) / ( I min r I ) 2$ where one uses the Lagrange multipliers associated to cross moments and also the perturbations $T c r ′$.
The detailed analysis of $Δ θ N , c r$ has shown that $var ( Δ θ N , c r )$ under variable morphisms is given by $N − 1 var ( T c r | E ( T c r | X ) , E ( T c r | Y ) )$, which is the mean squared residual of the best linear fit of $T c r$ using the conditional means $E ( T c r | X )$ and $E ( T c r | Y )$ as predictors. This is supported by a few examples using a Monte-Carlo methodology. We have shown that $var ( Δ θ N , c r )$ is closely related to the Maximum Entropy solution constrained by T and marginal distributions, i.e., the MinMI solution constrained by the cross constraints $θ c r = E ( T c r )$.
The MinMI errors are readily obtained from MinMI second-order Taylor development in terms of $Δ θ N , c r$. Asymptotically, $Δ θ N , c r$ is multivariate Gaussian thanks to the Central Limit Theorem. The MinMI bias is positive, given by the mean of a positive quadratic form of Gaussians. When data samples come from the same distribution as the one generated from MinMI, the MinMI bias is simply 1/(2N) mcr. However, the bias can increase/decrease when data comes from a more leptokurtic/platykurtic distribution. That expression of bias comes from the fact that the Hessian matrix of MinMI in terms of the vector of cross constraints θ is the inverse of the covariance matrix of the cross functions T, conditioned to the knowledge of marginal PDFs. That matrix is the matrix of mean squared residuals of best linear fit of T using predictors $E ( T c r | X )$, $E ( T c r | Y )$ evaluated at the MinMI-PDF.
We have further introduced the incremental MinMI given by the difference $H max 1 − H max 2$ between two MEs, forced by cross constraint sets $T c r 1 ⊆ T c r 2$. Under the null hypothesis $H max 1 = H max 2$, the incremental MinMI stands for a statistical test evaluating the existence of statistically significant MI explained by cross expectations in the set difference $T c r 2 / T c r 1$. This test is distributed as $1 2 N χ ( m c r 2 − m c r 1 ) 2$ where $m c r 2 , m c r 1$ are the numbers of cross constraints respectively in $T c r 2 , T c r 1$. In order to get a test with a relative error $r I = ( Δ I min / I min )$, one must choose $N ≥ ( ( m c r 2 − m c r 1 ) / 2 ) 1 / 2 / ( I min r I )$.
By setting X,Y to single standard Gaussians by Gaussian morphisms and the single constraint product $T c r = X Y$, we have evaluated the MI parcel that is explained by joint Gaussianity – the Gaussian MI. By adding further monomial bivariate as constraints, we can define the non-Gaussian MI, attributed to joint non-Gaussianity. Under the null hypothesis of null non-Gaussian MI tests the existence of statistically significant MI explained by nonlinear correlations, beyond the scope of Pearson correlation. This is an Information-Theoretic-based significance test of non-Gaussianity, beyond others based on multivariate cumulants.
Finally, we have evaluated the Gaussian and non-Gaussian MIs for real under-sampled data allowing illustrating the relationship between MI bias, probability density over-fitting and data outliers. Some questions do remain for future work, namely the implementation of fast algorithms for computing non-Gaussian MI and its generalization to more than two random variables.

Acknowledgments

This research was supported by the ERC advanced grant “Flood Change”, project No. 291152 and also the Projects PTDC/GEO-MT/3476/2012 and PEST-OE/CTE/LA0019/2011-FCT, funded by the Portuguese Foundation for Science and Technology (FCT). Thanks are due to three anonymous referees, to J. Macke and Susana Barbosa for some discussions and also our families for the omnipresent support.

References

1. Shannon, C.E. The mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
2. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons, Inc.: New York, NY, USA, 1991. [Google Scholar]
3. Averbeck, B.B.; Latham, P.E.; Pouget, A. Neural correlations, population coding and computation. Nat. Rev. Neurosci. 2006, 7, 358–366. [Google Scholar] [CrossRef] [PubMed]
4. Goldie, C.M.; Pinch, R.G.E. Communication Theory. In London Mathematical Society Student Texts (No. 20); Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]
5. Sims, C.A. Rational Inattention: Beyond the Linear-Quadratic Case. Am. Econ. Rev. 2006, 96, 158–163. [Google Scholar] [CrossRef]
6. Sherwin, W.E. Entropy and Information Approaches to Genetic Diversity and its Expression: Genomic Geography. Entropy 2010, 12, 1765–1798. [Google Scholar] [CrossRef]
7. Pothos, E.M.; Juola, P. Characterizing linguistic structure with mutual information. Br. J. Psychol. 2007, 98, 291–304. [Google Scholar] [CrossRef] [PubMed]
8. Pires, C.A.; Perdigão, R.A.P. Non-Gaussianity and asymmetry of the winter monthly precipitation estimation from the NAO. Mon. Wea. Rev. 2007, 135, 430–448. [Google Scholar] [CrossRef]
9. Globerson, A.; Tishby, N. The minimum information principle for discriminative learning. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, Banff, Canada, 7–11 July 2004; pp. 193–200.
10. Globerson, A.; Stark, E.; Vaadia, E.; Tishby, N. The minimum information principle and its application to neural code analysis. Proc. Natl. Accd. Sci. USA 2009, 106, 3490–3495. [Google Scholar] [CrossRef] [PubMed]
11. Foster, D.V. Grassberger, P. Lower bounds on mutual information. Phys. Rev. E 2011, 83, 010101(R):1–010101(R):4. [Google Scholar] [CrossRef]
12. Pires, C.A.; Perdigão, R.A.P. Minimum Mutual Information and Non-Gaussianity Through the Maximum Entropy Method: Theory and Properties. Entropy 2012, 14, 1103–1126. [Google Scholar] [CrossRef]
13. Walters-Williams, J.; Li, Y. Estimation of mutual information: A survey. Lect. Notes Comput. Sci. 2009, 5589, 389–396. [Google Scholar]
14. Khan, S.; Bandyopadhyay, S.; Ganguly, A.R.; Saigal, S.; Erickson, D.J.; Protopopescu, V.; Ostrouchov, G. Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Phys. Rev. E 2007, 76, 026209:1–026209:15. [Google Scholar] [CrossRef]
15. Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1254. [Google Scholar] [CrossRef]
16. Panzeri, S.; Treves, A. Analytical estimates of limited sampling biases in different information measures. Comp. Neur. Syst. 1996, 7, 87–107. [Google Scholar] [CrossRef]
17. Victor, J.D. Asymptotic Bias in Information Estimates and the Exponential (Bell) Polynomials. Neur. Comput. 2000, 12, 2797–2804. [Google Scholar] [CrossRef]
18. Panzeri, S.; Senatore, R.; Montemurro, M.A.; Petersen, R.S. Train Information Measures Correcting for the Sampling Bias Problem in Spike Information Measures. J. Neurophysiol. 2007, 98, 1064–1072. [Google Scholar] [CrossRef] [PubMed]
19. Strong, S.P.; Koberle, R.; de Ruyter van Steveninck, R.; Bialek, W. Entropy and information in neural spike trains. Phys. Rev. Lett. 1998, 86, 197–200. [Google Scholar] [CrossRef]
20. Miller, G. Note on the bias of information estimates. In Information Theory in Psycholog; Quastler, H., Ed.; II-B Free Press: Glencoe, IL, USA, 1955; pp. 95–100. [Google Scholar]
21. Grassberger, P. Entropy Estimates from Insufficient Samplings. 2008; arXiv:physics/0307138v2.pdf. [Google Scholar]
22. Bonachela, J.A.; Hinrichsen, H.; Muñoz, M.A. Entropy estimates of small data sets. J. Phys. A 2008, 41, 202001. [Google Scholar] [CrossRef]
23. Nelsen, R.B. An Introduction to Copulas; Springer: New York, NY, USA, 1999; ISBN 0-387-98623-5. [Google Scholar]
24. Calsaverini, R.S.; Vicente, R. An information-theoretic approach to statistical dependence: Copula information. Europhys. Lett. 2009, 88, 68003. [Google Scholar] [CrossRef]
25. Ma, J.; Sun, Z. Mutual information is copula entropy. 2008; arXiv:0808.0845v1. [Google Scholar]
26. Macke, J.H.; Murray, I.; Latham, P.E. How biased are maximum entropy models? Adv. Neur. Inf. Proc. Syst. 2011, 24, 2034–2042. [Google Scholar]
27. Hutter, M.; Zaffalon, M. Distribution of mutual information from complete and incomplete data. Comput. Stat. Data An. 2005, 48, 633–657. [Google Scholar] [CrossRef]
28. Jaynes, E.T. On the Rationale of Maximum-entropy methods. P. IEEE 1982, 70, 939–952. [Google Scholar] [CrossRef]
29. Shore, J.E.; Johnson, R.W. Axiomatic derivation of the principle of maximum entropy and the principle of the minimum cross-entropy. IEEE Trans. Inform. Theor. 1980, 26, 26–37. [Google Scholar] [CrossRef]
30. Ebrahimi, N.; Soofi, E.S.; Soyer, R. Information Measures in Perspective. Int. Stat. Rev. 2010, 78, 383–412. [Google Scholar] [CrossRef]
31. Wackernagel, H. Multivariate Geostatistics—An Introduction with Applications; Springer Verlag: Berlin, Germany, 1995. [Google Scholar]
32. Charpentier, A.; Fermanian, J.D. Copulas: From Theory to Application in Finance; Rank, J., Ed.; Risk Publications: London, UK, 2007; Section 2. [Google Scholar]
33. Tam, S.M. On Covariance in Finite Population Sampling. J. Roy. Stat. Soc. D-Sta. 1985, 34, 429–433. [Google Scholar] [CrossRef]
34. Van det Vaart, A.W. Asymptotic statistics. Cambridge University Press: New York, NY, USA, 1998; ISBN ISBN 978–0-521–49603–2, LCCN. V22 1998 QA276. V22. [Google Scholar]
35. Rockinger, M.; Jondeau, E. Entropy densities with an application to autoregressive conditional skewness and kurtosis. J. Econometrics 2002, 106, 119–142. [Google Scholar] [CrossRef]
36. Bates, D. Quadratic Forms of Random Variables. STAT 849 lectures. Available online: http://www.stat.wisc.edu/~st849–1/lectures/Ch02.pdf (accessed on 22 February 2013).
37. Goebel, B.; Dawy, Z.; Hagenauer, J.; Mueller, J.C. An approximation to the distribution of finite sample size mutual information estimates. 2005. In Proceedings of IEEE International Conference on Communications (ICC’ 05), Seoul, Korea, 16–20 May 2005; pp. 1102–1106.
38. Fisher, R.A. On the “probable error” of a coefficient of correlation deduced from a small sample. Metron 1921, 1, 3–32. [Google Scholar]
39. Zientek, L.R.; Thompson, B. Applying the bootstrap to the multivariate case: bootstrap component/factor analysis. Behav. Res. Methods 2007, 39, 318–325. [Google Scholar] [CrossRef] [PubMed]
40. Mardia, K.V. Algorithm AS 84: Measures of multivariate skewness and kurtosis. Appl. Stat. 1975, 24, 262–265. [Google Scholar] [CrossRef]
41. Hurrell, J.W.; Kushnir, Y.; Visbeck, M. The North Atlantic Oscillation. Science 2001, 26, 291. [Google Scholar]
42. The NCEP/NCAR Reanalysis Project. Available online: http://www.esrl.noaa.gov/psd/data/reanalysis/reanalysis.shtml/ (accessed on 22 February 2013).

Appendix 1

Proof of Equations 1 and 2
We are looking for a PDF $ρ X Y ( X , Y )$ satisfying: (1) the discrete constraints $∬ S T c r ( X , Y ) ρ X Y * ( X , Y ) d X d Y = θ c r$, corresponding to the vector $η c r$ of Lagrange multipliers and (2) the continuum of constraints $∬ S δ ( X − u ) ρ X Y ( X , Y ) d X d Y = ρ X ( u )$ and $∬ S δ ( Y − v ) ρ X Y ( X , Y ) d X d Y = ρ Y ( v )$, corresponding to the continuum of Lagrange multipliers $λ X ( u ) , λ Y ( v ) , u ∈ S X , v ∈ S Y$, where the integrals of $ρ X$, $ρ Y$ are both equal to one. The Lagrangian functional of Entropy is therefore
$L ( η c r , λ X , λ Y ) = ∬ S [ − log ρ X Y ( X , Y ) + λ X ( X ) + λ Y ( Y ) + η c r T T c r ( X , Y ) ] ρ X Y ( X , Y ) d X d Y − ∫ S X ρ X ( X ) λ X ( X ) d X − ∫ S Y ρ Y ( Y ) λ Y ( Y ) d Y − η c r T θ c r$
The maximum Entropy is obtained by taking the differential $δ L$ of $L$ in terms of $δ λ X ( X ) , δ λ Y ( Y ) , δ η c r$ and setting vanishing gradient components, leading to the PDF $ρ X Y ( X , Y ) = exp [ − 1 + η c r T T c r ( X , Y ) + λ X ( X ) + λ Y ( Y ) ]$. Now, considering the partition functions $Z X ( X , η c r ) ≡ exp [ − λ X ( X ) ]$ and $Z Y ( Y , η c r ) ≡ exp [ − λ Y ( Y ) ]$ and imposing the marginal PDF constraints leads directly to the expressions (2) where the continnum of Lagrange multipliers depend implicitly from the discrete ones $η c r$. Plugging that into $L$ leads to the definition of the concave function $L ( η c r )$ in (1) with its global minimum at $η c r = λ c r$. The MinMI-PDF (2) is $ρ X Y ( X , Y ) = ρ X Y * ( X , Y )$ at that minimum.
Proof of Equations 3, 4, 5 and Theorem 1
At the ME-PDF solution, the $L$ functional of the MinMI solution is an implicit function of the constraining means $θ c r$ and the differential satisfies $δ L = δ H = − δ I$. By expanding it in terms of $δ λ X ( X ) , δ λ Y ( Y ) , δ λ c r , δ θ c r$ and using $∫ S X ρ X Y * ( X , Y ) d Y = ρ X ( X ) ; ∫ S Y ρ X Y * ( X , Y ) d X = ρ Y ( Y )$, and $∬ S T c r ( X , Y ) ρ X Y * ( X , Y ) d X d Y = θ c r$, one gets $δ I ( θ c r ) = − δ L = λ c r T δ θ c r$, thus showing that the gradient of $I ( θ c r )$ with respect to $θ c r$ is $λ c r$.
Regarding the Hessian of $I ( θ c r )$, we must differentiate $θ c r$ using the same technique for the ME problems with a finite number of constraints.
Therefore, as postulated in Section 2.2, let us consider a finite sequence of constraint sets ${ T j , θ j }$ whose ME-PDF converge to MinMI solution as $( j → ∞ )$ The the differentials of expectations $δ θ j$ and the differential $δ λ j$ of Lagrange multipliers are related through $δ θ j = C * j δ λ j$,where $C * j$ is the covariance matrix of the constraining functions $T j$ at the ME-PDF solution (denoted with *), i.e., $C * j = E * ( T j ′ T j ′ T )$ where the perturbations are $T j ′ = T j − θ j$. Inverting that relationship we have $δ λ j = C * j − 1 δ θ j$. In the case of MinMI, the constraining functions have a discrete part ($T c r$) and a continuous part (the Dirac deltas), being merged together into a whole vector $T c r , ρ = ( T ( X , Y ) c r , δ ( X − u ) , δ ( Y − v ) ) T$ corresponding to the whole vector of expectations $θ c r , ρ = ( θ ( X , Y ) c r , ρ X ( u ) , ρ Y ( v ) ) T$ and to the whole vector of Lagrange multipliers $λ c r , ρ = ( λ c r , λ X ( u ) , λ Y ( v ) ) T$. Therefore, as for the discrete case, the differentials are related by $δ θ c r , ρ = E * ( T c r , ρ ′ T c r , ρ ′ T ) δ λ c r = C c r , ρ δ λ c r , ρ$, where the covariance matrix is now replaced by an operator (continuous matrix) along the u, v, and the discrete index of $θ c r$. The multiplication of the continuous matrix by the continuous vector $δ λ c r , ρ$ is the sum of an integral in u, an integral in v and a discrete sum. The inverse relationship comes as $δ λ c r , ρ = [ C c r , ρ ] − 1 δ θ c r , ρ$ where $[ C c r , ρ ] − 1$ is the inverse operator of $C c r , ρ$, i.e., the product $[ C c r , ρ ] − 1 C c r , ρ = C c r , ρ [ C c r , ρ ] − 1 = ( I c r , δ ( X − u ) , δ ( Y − v ) )$ equals the identity operator. Therefore, the fixation of marginal PDFs in the MinMI problem leads to variations on cross expectations alone $δ θ c r , ρ = P c r δ θ c r , ρ = δ θ c r$, where $P c r$ is the projection operator over the discrete part. Therefore, since $δ I = δ θ c r T λ c r$, the second MI variation is $δ 2 I = 1 2 δ θ c r T δ λ c r T = 1 2 δ θ c r T [ P c r T [ C c r , ρ ] − 1 P c r ] δ θ c r$ and the matrix identity $P c r T [ C c r , ρ ] − 1 P c r = C c r , ρ X , ρ Y − 1$ appearing in (3). The discrete matrix $C c r , ρ X , ρ Y$ is positively defined, being different from $P c r T [ C c r , ρ ] P c r$, which is the single covariance matrix of functions $T c r$ at the MinMI-PDF. Its computation is quite difficult in practice, involving the convolution (continuous product) of operators $[ C c r , ρ ] − 1$ and $P c r$.
Since the ME-PDF for ${ T j , θ j }$ converges to the MinMI PDF, the same holds for the covariance matrix conditioned to the marginal PDFs. Therefore, one has the Equation 10 at step j
$( P c r C * j − 1 P c r ) − 1 = ( P c r C * j P c r ) − ( P c r C * j P i n d ) ( P i n d C * j P i n d ) − 1 ( P i n d C * j P c r ) = E * [ T c r , j ′ i n d T c r , j ′ i n d T ] → j → ∞ C c r , ρ X , ρ Y$
The matrix $C c r , ρ X , ρ Y$ can be obtained from the limit of ME covariance matrices where one adds progressively independent moments of the marginal variables X and Y as constraints. In the limit, the perturbations $T c r , j ′ i n d = T c r , j ′ − α j T T i n d , j ′$ must converge to the perturbations $T * = T c r − E ρ X , Y * ( T c r | ρ X , ρ Y )$ appearing in (4). They are residuals of the best fit on marginal functions on X and Y as $T * ( X , Y ) = T c r ′ ( X , Y ) − [ β X ( X ) + β Y ( Y ) ]$ where $β X ( X ) + β Y ( Y )$ is a sum of marginal functions. The minimum of the total mean squares of residuals $∬ S ρ X Y * ( X , Y ) | | T * | | 2 d X d Y = E * ( | | T * | | 2 )$ is obtained through variational analysis by taking small variations $δ β X ( X ) , δ β Y ( Y )$ and vanishing the gradients. We get the solution
$T * ( X , Y ) = T c r ′ ( X , Y ) − [ α X E ( T c r ′ | X ) + α Y E ( T c r ′ | Y ) ]$
where fitting is done on conditional means and $α X , α Y$ are the best linear fit coefficients for each function in $T c r ′ ( X , Y )$. This completes the proof of (5) and Theorem 1. The Taylor expansion (3) comes by taking $Δ I ( θ c r , ρ X , ρ Y ) = δ I + δ 2 I + O ( | | Δ θ c r | | 3 )$.