Next Article in Journal
Statistical Analysis of Gait Maturation in Children Using Nonparametric Probability Density Function Modeling
Next Article in Special Issue
A Novel Nonparametric Distance Estimator for Densities with Error Bounds
Previous Article in Journal
Study on the Stability of an Artificial Stock Option Market Based on Bidirectional Conduction
Previous Article in Special Issue
Machine Learning with Squared-Loss Mutual Information
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Minimum Mutual Information and Non-Gaussianity through the Maximum Entropy Method: Estimation from Finite Samples

by
Carlos A. L. Pires
1,* and
Rui A. P. Perdigão
2
1
Instituto Dom Luiz (IDL), University of Lisbon (UL), Lisbon, P-1749-016, Portugal
2
Institute of Hydraulic Engineering and Water Resources Management, Vienna University of Technology, Vienna, A-1040, Austria
*
Author to whom correspondence should be addressed.
Entropy 2013, 15(3), 721-752; https://doi.org/10.3390/e15030721
Submission received: 8 November 2012 / Revised: 15 February 2013 / Accepted: 19 February 2013 / Published: 25 February 2013
(This article belongs to the Special Issue Estimating Information-Theoretic Quantities from Data)

Abstract

:
The Minimum Mutual Information (MinMI) Principle provides the least committed, maximum-joint-entropy (ME) inferential law that is compatible with prescribed marginal distributions and empirical cross constraints. Here, we estimate MI bounds (the MinMI values) generated by constraining sets Tcr comprehended by mcr linear and/or nonlinear joint expectations, computed from samples of N iid outcomes. Marginals (and their entropy) are imposed by single morphisms of the original random variables. N-asymptotic formulas are given both for the distribution of cross expectation’s estimation errors, the MinMI estimation bias, its variance and distribution. A growing Tcr leads to an increasing MinMI, converging eventually to the total MI. Under N-sized samples, the MinMI increment relative to two encapsulated sets Tcr1Tcr2 (with numbers of constraints mcr1 < mcr2) is the test-difference δ H = H max 1 , N H max 2 , N 0 between the two respective estimated MEs. Asymptotically, δH follows a Chi-Squared distribution 1 2 N χ ( m c r 2 m c r 1 ) 2 whose upper quantiles determine if constraints in Tcr2/Tcr1 explain significant extra MI. As an example, we have set marginals to being normally distributed (Gaussian) and have built a sequence of MI bounds, associated to successive non-linear correlations due to joint non-Gaussianity. Noting that in real-world situations available sample sizes can be rather low, the relationship between MinMI bias, probability density over-fitting and outliers is put in evidence for under-sampled data.

1. Introduction

1.1. The State of the Art

The seminal work of Shannon on Information Theory [1] gave rise to the concept of Mutual Information (MI) [2] as a measure of probabilistic dependence among random variables (RVs), with a broad range of applications, including neuroscience [3], communications and engineering [4], physics, statistics, economics [5], genetics [6], linguistics [7] and geosciences [8]. MI is the positive difference between two Shannon entropies of the RVs: the one assuming statistical independence ( H i n d ) and the other ( H d e p ) considering their true dependence.
This paper addresses the problem of estimating the MI conveyed by the least committed, inferential law (say the conditional probability density function pdf ρ ( Y | X ) between random variables RVs Y , X ), which is compatible with prescribed marginal distributions and a set Tcr of mcr empirical non-redundant cross constraints (e.g., a set of cross expectations between a stimulus X and a response Y, for example in a neural cell, the Earth’s climate, an ecosystem). The constrained MI or the Minimum Mutual Information (MinMI) among RVs Y , X is: I min ( X , Y ) = H ( X ) + H ( Y ) H max ( X , Y ) = H ( Y ) H max ( Y | X ) , obtained after subtraction to the sum of fixed marginal entropies of the maximum joint entropy (ME) H max , compatible with imposed cross constraints. The solution comes from application of the MinMI principle [9,10]. The MinMI is a MI lower bound depending on the marginal pdfs (e.g., Gaussians, Uniforms, Gammas), as well as the particular form of the cross expectations in Tcr (e.g., linear and non-linear correlations). There are only a few cases of known closed formulas for the MinMI and mcr=1:a) Gaussian marginals and Pearson linear correlation [8,11,12] and (b) Uniform marginals and rank linear correlation [11]. The authors have presented in [12] (PP12 hereafter), a general formalism for computing, though not in an explicit form, the MinMI in terms of multiple (mcr > 1) linear and nonlinear cross expectations included in Tcr This set can consist of a natural population constraint (e.g., a specific neural behavior) or it can grow without limit through additional expectations computed within a sample with the MinMI increasing and converging eventually to the total MI. This paper is the natural follow-up of PP12 [12], studying now the statistics (mean or bias, variance and distribution) of the MinMI estimation errors: Δ I min , N = Δ H max , N ( H max , N H max ) where H max , N is the ME estimation issued from N-sized samples of iid outcomes. Those errors are roughly similar to those of MI and entropy generic estimator’s errors (see [13,14] for a thorough review and performance comparisons between MI estimators). Their mean (bias), variance and higher-order moments are written in terms of N 1 powers, thus covering intermediate and asymptotic N ranges [15], with specific applications in neurophysiology [16,17,18]. Entropy estimators range from: (a) the histogram-based plug-in one [19] with a negative bias or the Miller-Madow correction [20] equal to ( m 1 ) / ( 2 N ) , where m is the number of univariate histogram bins to much more improved estimators (e.g., kernel density estimators, adaptive or non-adaptive grids, next nearest neighbors) and others specially designed for small samples [21,22]

1.2. The Rationale of the Paper

The well-posedness of a MinMI I min ( X , Y ) compatible with available cross information needs the knowledge of marginal X and Y PDFs, ρ X and ρ Y , either imposed or inferred from sufficiently long samples. For that purpose, we can change X and Y into the cumulated probabilities u ( x ) = x ρ X ( t ) d t ;    v ( y ) = y ρ Y ( t ) d t , which are uniform RVs on the interval [0,1] (i.e., copulas [23]), through appropriate smoothly growing (injective) morphisms (or anamorphoses), while leaving the MI invariant [2]. Then, the MI I ( X , Y ) becomes the negative copula entropy [24,25] given by I ( X , Y ) = 0 1 0 1 c [ u , v ] log ( c [ u , v ] ) d u d v , where the copula density is c [ u , v ] = ρ X Y ( x , y ) / [ ρ X ( x ) ρ Y ( y ) ] .
The MinMI, subjected to m c r constraints of type E [ T i ( u , v ) ] = θ i    ; i = 1 , ... m c r in the copula-space, is readily obtained by variational analysis (as in the ME method [2]) for c [ u , v ] = exp [ 1 + λ u ( u ) + λ v ( u ) + i = 1 m c r λ i T i ( u , v ) ] , where the Lagrange multipliers λ u ( u ) , λ v ( u ) , λ i correspond respectively to the preset (not subjected to sampling) continuum of constraints: c [ u , v ] d u = c [ u , v ] d v = 1 and to the m c r expectations (subjected to sampling error). The general solution is rather tricky since all the values λ u ( u ) , λ v ( u ) , λ i are implicitly related. The constrained joint PDF and the inferential law are recovered from the constrained copula through the product: ρ X Y ( x , y ) = c [ u , v ] ρ X ( x ) ρ Y ( y ) .
In PP12 [12], we have generalized this problem to a less constrained MinMI version by changing marginal RVs into ME prescribed ones—the ME-morphisms (e.g., standard Gaussians)—and imposing a finite set of marginal constraints instead of the full marginal PDFs. Under these conditions, the number of control Lagrange multipliers is finite, leaving the possibility of using nonlinear minimization algorithms for the MinMI estimation, as already tested in [8]. The MinMI subjected to a set T c r of m c r cross constraints is thus given by H i n d H M E , c r , where H M E , c r is the joint ME and H i n d is the sum of single fixed (preset) entropies. The MinMI estimator is written as H i n d H M E , c r , N , where H M E , c r , N is the ME constrained by the m c r sampling expectations obtained from N-sized samples. The MinMI estimation error is H M E , c r H M E , c r , N . Therefore, as a generalization of the ME estimator bias [26], one verifies a MinMI positive bias equal to (larger/smaller than) m c r / ( 2 N ) when the true population PDF including the tested sample, follows (is more leptokurtic/platykurtic than) the ME-PDF. This result is supported through Monte-Carlo experiments.
Moreover, we introduce here the positive incremental MinMI given by the difference H M E , c r 1 H M E , c r 2 between two MEs, forced by cross constraint sets T c r 1 T c r 2 , which is interpreted as the MinMI coming from the difference set T c r 2 / T c r 1 . The corresponding estimator is H M E , c r 1 , N H M E , c r 2 , N . Both the MinMI and incremental MinMI estimators depend basically on errors of the expectations estimated from finite N-sized samples.
In particular, under the null hypothesis Ho that H M E , c r 1 = H M E , c r 2 or T c r 1 ,    T c r 2 ME-congruent (see definition in PP12, [12]), the difference H M E , c r 1 , N H M E , c r 2 , N works as a significance test of Ho. Those tests can be used: (1) for testing statistical significant MI above zero or significant RV dependence or (2) for testing MI due to nonlinear correlations beyond MI due to linear correlations. Another important case (verified here) is the test of MI explained by joint non-Gaussianity beyond the MI explained by joint Gaussianity, in which Gaussian morphism (i.e., bijective, reversible variable transformation into another with a Gaussian pdf without loss of generality) is used for single variables. According to the above result, the bias of H M E , c r 1 , N H M E , c r 2 , N , subjected to Ho is ( m c r 2 m c r 1 ) / 2 N , i.e., the number of cross constraints in the difference set T c r 2 / T c r 1 divided by 2 N .
We further provide asymptotic analytical N-scaled formulas for the variance and distribution of MinMI estimation errors as functions of statistics of the ME cross constraints estimation errors. This is possible for N high enough where expectation errors are closely governed by a multivariate Gaussian distribution, uniquely determined by their bias and covariance matrix, thanks to the multivariate Central Limit Theorem. Since marginal morphisms are performed, the single variables are set to values from a look-up table of fixed quantiles (not subjected to sampling) and therefore the estimator’s squared-bias decreases faster than the estimator’s variance as N .
The correct modeling of covariances between sampling expectation’s errors under morphism is crucial for the correct computation of MinMI error statistics. We have verified an overall reduction of the cross expectation errors when compared to case where they are issued from iid realizations (no morphism performed). For instance the variance, noted as var ( E N ( T ) ) of the N-sized sampling mean E N ( T ) , of a cross function T ( X , Y ) is given by N 1 var N ( T * ) , where T * is the residual of the best linear fit of T , using the conditional means E ( T | X ) , E ( T | Y ) as predictors. Asymptotically, var N ( T * ) var ( T * ) which is the variance of T, conditioned to the knowledge of marginal PDFs, computed at the joint PDF of the population. These conditional variances are exactly those coming from the MinMI solution, allowing for relating MinMI statistics with asymptotic no-replacement finite statistics under fixed marginals. The results are synthesized in the form of two theorems.
Regarding the conversion of expectation errors to ME and MinMI errors, we have used a perturbative approach—a 2nd order Taylor expansion of the ME. This allows for closed analytical formulas to be obtained for MinMI variance and its distribution in a few cases (e.g., Chi-Squared distributions), in what we hereafter call the analytical approach. In order to confirm that, expectation errors are generated by surrogates of the governing multivariate Gaussian PDF; then, they are plugged into the Taylor expansion of MinMI and finally statistics (bias, variances, quantiles) are estimated from a large ensemble (semi-analytical approach). These statistics are compared with those obtained from a Monte-Carlo experiment where MinMI is computed ab initio from the sampling expectations – the Monte-Carlo approach. The closeness of results between the Monte-Carlo, the semi-analytical and the analytical approaches is tested using several statistical tests of bivariate non-Gaussianity and RV independency. This exhaustive validation has already been performed for testing analytical formulas of bias, variance, skewness and kurtosis of MI estimation errors [27].
In accordance to the above synthesis, the paper structure starts with this introduction, followed by the formulation of MinMI and their estimators in Section 2. In Section 3 we present the modeling of sample mean errors that will constrain entropy and the effect of morphisms on statistics. Section 4 is devoted to the modeling of errors of MinMI, incremental MinMI and significance tests, followed by a practical case of MI estimation with under-sampled data (Section 5) and the discussion with conclusions in section 6. An appendix with some proofs is also provided.

2. Minimum Mutual Information and Its Estimators

2.1. Imposing Marginal PDFs

Let us formulate the problem of finding the minimum Mutual Information (MinMI) in the simplest framework of bivariate RVs ( X , Y ) , over the Cartesian product of support sets S = S X S Y 2 . The MinMI is constrained by the imposition of marginal PDFs ρ X , ρ Y and a set of cross expectations { T c r , θ c r E ( T c r ) } , where T c r is a vector comprising m c r cross X , Y functions and θ c r is the vector of their expectations. In the space of imposed PDF marginals, the MinMI comes uniquely as a function of θ c r as I ( θ c r , ρ X , ρ Y ) = H ρ X + H ρ Y H ρ X , Y * ( θ c r , ρ X , ρ Y ) , where H ρ X = E [ log ( ρ X ) ] , H ρ Y = E [ log ( ρ Y ) ] are preset Shannon entropies of X , Y respectively and H ρ X , Y * ( θ c r , ρ X , ρ Y ) is the ME subjected to joint constraints and marginal PDFs where the ME-PDF is ρ X , Y * . That leads to the equivalence between computations of MinMI and ME [9]. In particular if ρ X , ρ Y are copula marginals (uniform PDFs in [0,1]), then H ρ X = H ρ Y = 0 and the MinMI is the copula entropy [24,25]. For instance, for standard Gaussians X , Y and a given correlation E ( T c r X Y ) = c g , the MinMI is I ( c g ) = 1 2 log ( 1 c g 2 ) . Obviously, the more cross constraints are imposed, the larger the MinMI will be.
The general solution is obtained through variational analysis, rather similar to that for the ME [28] but with a continuity of constraints (the marginal PDFs) and a finite set of expectations:
I ( θ c r , ρ X , ρ Y ) = H ρ X + H ρ Y H ρ X , Y * ( θ c r , ρ X , ρ Y )       ;     H ρ X , Y * ( θ c r , ρ X , ρ Y ) = L ( λ c r )       λ c r = arg min η c r [ L ( η c r ) 1 + S X log Z X ( X , η c r ) ρ X ( X ) d x + S Y log Z Y ( Y , η c r ) ρ Y ( Y ) d y η c r T θ c r ]
The MinMI-PDF ρ X , Y * ( X , Y ) and the partition functions Z X , Z Y are
ρ X , Y * ( X , Y ) = [ Z X ( X , λ c r ) Z Y ( Y , λ c r ) ] 1 exp [ 1 + λ c r T T c r ( X , Y ) ] ; Z X ( X , λ c r ) 1 ρ X ( X ) S X exp [ 1 + λ c r T T c r ( X , y ) ] Z Y ( y , λ c r ) d y ; Z Y ( Y , λ c r ) 1 ρ Y ( Y ) S Y exp [ 1 + λ c r T T c r ( x , Y ) ] Z X ( x , λ c r ) d x
The superscript T stands for transpose such that λ c r T T c r is the canonical inner product between vectors λ c r and T c r . The proof is given in Appendix 1. Any PDF ρ X Y ( X , Y ) is a MinMI PDF corresponding to the single constraint T c r ( X , Y ) = 1 + log [ ρ X Y ( X , Y ) / [ ρ X ( X ) ρ Y ( Y ) ] ] , leading to λ = 1 , Z X ( X , λ ) = ρ X ( X ) 1 and Z Y ( Y , λ ) = ρ Y ( Y ) 1 .
The minimization of L ( η ) in (1) calls for the implementation of an iterative strategy as in [11] with successive adjustments of the implicitly linked partition functions.
The present paper deals with small changes of I ( θ c r , ρ X , ρ Y ) coming from estimation errors Δ θ c r of the cross expectations evaluated from finite samples. For the purpose of inferring the consequent MinMI error statistics (bias, variance, distribution), we will use the second-order Taylor expansion of I ( θ c r , ρ X , ρ Y ) in terms of the variation Δ θ c r :
Δ I ( θ c r , ρ X , ρ Y ) I ( θ c r + Δ θ c r , ρ X , ρ Y ) I ( θ c r , ρ X , ρ Y ) = Δ H ρ X , Y * ( θ c r , ρ X , ρ Y ) = = λ c r T Δ θ c r + 1 / 2 ( Δ θ c r T )    C c r , ρ X , ρ Y 1    ( Δ θ c r ) + O ( | | Δ θ c r | | 3 )
where C c r , ρ X , ρ Y 1 is the inverse of the covariance matrix of the vector of constraining functions T c r , conditioned to knowledge of marginal PDFs and evaluated at the MinMI-PDF ρ X , Y * i.e.,
C c r , ρ X , ρ Y = E ρ X , Y * [ ( T c r * T c r T * | ρ X , ρ Y ] = E ρ X , Y * [ ( T c r * T c r T * | E ( T | X ) , E ( T | Y ) ]
where E ρ X , Y * is the expectation at ρ X , Y * .The perturbation T * = T E ρ X , Y * ( T c r | ρ X , ρ Y ) is the residual with respect to the conditional mean, obtained by methods of variational and functional analysis as the best linear fit
E ρ X , Y * ( T c r | ρ X , ρ Y ) = θ c r + α X [ E ρ X , Y * ( T c r | X ) θ c r ] + α Y [ E ρ X , Y * ( T c r | Y ) θ c r ]
where α X , α Y are vectors of coefficients minimizing the mean square deviations to each component of T c r using the X and Y conditional means of T c r as predictors. The proof is given in Appendix 1 as part of the proof of Theorem 1 presented in Section 2.2.

2.2. Imposing Marginals through ME Constraints

2.2.1. The Formalism

In PP12 [12], we address the MinMI problem (1,2) by considering that ρ X , ρ Y are themselves ME-PDFs forced by a finite set of marginal, independent constraints, { T i n d ( T X ( X ) , T Y ( Y ) ) , θ i n d E ( T i n d ) ( θ X , θ Y ) } . For that purpose we solve the ME problem [29] by imposing the constraints set { T , θ } = { ( T i n d , T c r ) , ( θ i n d , θ c r ) } , thus leading to a weaker (i.e., smaller) MinMI solution than that obtained with the full imposition of the marginal PDFs. That is given by I ( θ c r , θ i n d ) = H ( θ i n d ) H ( θ ) I ( θ c r , ρ X , ρ Y ) , where H ( θ ) is the ME issued from the finite set of constraints (marginal and cross) and H ( θ i n d ) H X + H Y is the ME corresponding uniquely to the marginal constraints [30]. In particular, if the support sets are S X = S Y = [ 0 , 1 ] and { T i n d , θ i n d } = (no constraints on marginals), then the joint PDF of ( X , Y ) is a copula [24] since their marginal PDFs are uniform in [0,1].The cross part T c r includes only cross functions, not redundantly expressed as sums of marginal functions in T i n d .
In practice one can impose the marginal PDFs from a priori RVs ( X ^ , Y ^ ) (data variables) through ME-morphisms ( X = X ( X ^ ) , Y = Y ( Y ^ ) ) (Equation 6 of PP12), (e.g., standard Gaussians), which are monotonically growing smooth homeomorphisms linking data to transformed ( X , Y ) variables. Then, thanks to the MI invariance ( X = X ( X ^ ) , Y = Y ( Y ^ ) ) [2], one can consistently define the MinMI between ( X ^ , Y ^ ) as that obtained with ( X , Y ) .
The joint ME-PDF is written in terms of a vector λ of Lagrange multipliers [28] as: ρ T , θ * ( X , Y ) = Z ( λ , T ) 1 exp [ λ T T ( X , Y ) ] , where Z ( λ , T ) S exp ( λ T T ) d x d y is the partition function. The ME functional is H ( θ ) = min η ( log Z ( η , T ) θ T η ) = log Z ( λ , T ) θ T λ , whose input is the vector θ . The marginal PDFs are supposed to be the ME-PDFs ρ T X , θ X * ( X ) ; ρ T Y , θ Y * ( Y ) , verifying the marginal X and Y constraints respectively, since variables were built accordingly by ME-morphisms.
As far as more cross constraints are added to { T c r , θ c r } , the MinMI I ( θ c r , θ i n d ) increases converging to the full MI I ( X , Y ) . Let us formalize that by supposing that the true joint PDF belongs to the ME-family characterized by an information moment superset { T , θ } { T , θ } .
The true joint PDF is given by ρ T , θ * with Shannon entropy given by the ME H ( θ ) . The encapsulated moment sets obey to θ i n d θ θ . Therefore, thanks to Lemma 1 of PP12, the monotonic property of MEs is obtained: H ( θ i n d ) H ( θ ) H ( θ ) . This, according to Theorem 1 of PP12, allows for the decomposition of the MI I ( X , Y ) into two positive terms, such that:
I ( X , Y ) = H ( θ i n d ) H ( θ ) = I θ / θ i n d ( X , Y ) + I θ / θ ( X , Y ) 0 I θ / θ i n d H ( θ i n d ) H ( θ ) 0       ;     I θ / θ H ( θ ) H ( θ ) 0
The term I θ / θ i n d is the MinMI associated to the finite set of cross moments θ c r and the second one is the remaining MI. The decomposition (6) allows us for defining a monotonic sequence of lower MI bounds converging to the total MI. That follows from the sequence of encapsulated moment sets { T i n d = T 0 , θ i n d = θ 0 } { T j , θ j } { ( T i n d , j , T c r , j ) , ( θ i n d , j , θ c r , j ) } { T j + 1 , θ j + 1 } ... { T , θ } , j 1 (e.g. set of monomial bivariate moments of a certain total order j), whose ME-PDF approximates the true ME-PDF in the sense of the Kullback-Leibler divergence (KBD) i.e., D K L ( ρ T , θ * | | ρ T j , θ j * ) = H ( θ j ) H ( θ ) j 0 with the MI given by the limit I ( X , Y ) = H ( θ i n d ) lim j [ H ( θ j ) ] . The sets { T 0 , θ 0 } and { T i n d , j , θ i n d , j } are ME-congruent, i.e., their ME-PDF are the same. The j-th set must include enough constraints so as to keep a finite joint ME issued from { T j , θ j } and guarantee the convergence of the above KBD towards zero. Moreover that also guarantees that marginals of the joint ME-PDF converge to the preset marginal PDFs ρ X , ρ Y in the KBD sense. Therefore, the MinMI I ( θ c r , , ρ X , ρ Y ) = I ( X , Y ) = H ( θ i n d ) lim j [ H ( θ j ) ] .
The addition of constraints leads to the decrease of ME, raising the useful concept of incremental MinMI next presented. The MI part that is explained by cross terms in the set difference T j / T p    ( j > p 0 ) , i . e . , T p T j is the incremental MinMI:
I j / p H ( θ p ) H ( θ j ) ​        = D K L ( ρ T j , θ j * | | ρ T p , θ p * ) = I j / 0 I p / 0 0
Estimation errors of I j / p are affected by the vector of moment errors Δ θ j (from which Δ θ p is simply a projection). Since we preset marginal PDFs, Δ θ j is restricted to the cross part i.e., Δ θ j = Δ θ c r , j = P c r , j    Δ θ j where P c r , j is the diagonal projector operator over cross expectations (cr and ind terms are set to 1 and 0 respectively). Looking for error statistics of I j / p , we use the second-order Taylor expression of ME:
Δ H = H ( θ ) H ( θ + Δ θ c r ) = ( P c r λ ) T Δ θ c r + ( 1 / 2 ) Δ θ c r T ( P c r C * 1 P c r ) Δ θ c r + O ( Δ θ c r 3 )
where, as usually, λ (with dropped subscrits) is the whole vector of Lagrange multipliers of dimension dim ( θ c r ) + dim ( θ i n d ) and C * is the covariance matrix of the function vector T , both valid for the ME-PDF verifying the constraints E * ( T ) = θ . We note that C * = E * [ T T T ] , where the star stands for evaluation over the ME-PDF and prime denotes deviation from the mean θ , i.e., T = T θ . Therefore, by using (8), we express the variation of I j / p    ( j > p ) due to variations Δ θ c r , j as:
Δ I j / p = ( v j / p ) T ( Δ θ c r , j ) + ( 1 / 2 )    ( Δ θ c r , j ) T A j / p ( Δ θ c r , j ) + O ( Δ θ c r , j 3 ) v j / p T P c r , j λ j P c r , p λ p      ;     A j / p P c r , j ( C * j 1 P c r , p ( C * p ) 1 P c r , p ) P c r , j
where λ j , C * j and λ p , C * p are the whole vectors of Lagrange multipliers and the whole covariance matrices, valid for the ME-PDFs of orders j and p respectively. The matrix A j / p is built from the covariance matrices C * j and C * p valid at the ME-PDFs of order j and p respectively.
When the ME-PDFs of order j and p are the same (which is useful for testing if the estimated I j / p from data is significantly different from zero), or p = 0 (in which P c r , p = 0 ), then C * p is a sub-matrix of C * j . In that case, A j / p is positive semi-definite (PSD). This comes from the algebraic generic result stating that A = C 1 P C P 1 P is PSD, where C is PSD, P is a diagonal projection matrix, C P = P C P is the projected C with generalized inverse C P 1 such that C P C P 1 = C P 1 C P = P . A is singular with Ker ( A ) = Im ( C P ) . However, one can prove that for small deviations among the ME-PDFs of orders j and p, the matrix A j / p is still PSD. For that one can use the same perturbation approach of [26].

2.2.2. A Theorem about the MinMI Covariance Matrix

The matrix P c r C * 1 P c r in (8) has inverse in the cross-expectation subspace, i.e. ( P c r C * 1 P c r ) 1 ( P c r C * 1 P c r ) = P c r . Taking the identity as the sum of complementary projector operators I = P c r + P i n d , both diagonal and self-adjoint, we have
( P c r C * 1 P c r ) 1 = ( P c r C * P c r ) ( P c r C * P i n d ) ( P i n d C * P i n d ) 1 ( P i n d C * P c r ) = E * [ T c r T c r T ] E * [ T c r T i n d T ]    E * [ T i n d T i n d T ] 1 E * [ T i n d T c r T ] = E * [ T c r i n d T c r i n d T ]
which is the covariance matrix between the residuals T c r i n d of the best linear fit (in the sense of mean squares error) of T c r using the X and Y functions in T i n d as predictors, i.e., T c r i n d T c r α i n d , c r T T i n d where the matrix of coefficients is α i n d , c r = E * [ T i n d T i n d T ] 1 E * [ T i n d T c r T ] . The identity (10) is simply an application to the ME covariance matrix of a generic algebraic result on PSD matrices C * and projection operators P c r , P i n d = I P c r .
Therefore, the variances in ( P c r C * 1 P c r ) 1 are smaller than those in ( P c r C * P c r ) . Moreover, the more marginal constraints are imposed (with increasing j), the smaller the variances from ( P c r C * 1 P c r ) 1 will be, due to the increasing number of predictors and closer will be the full knowledge of the marginal PDFs. Then, asymptotically the residuals T c r , j i n d at step j must converge to the residuals T * = T E ρ X , Y * ( T c r | ρ X , ρ Y ) with respect to the mean (5) entering in the covariance (4) regarding MinMI. Therefore, that leads us to the Theorem:
Theorem 1:
Let ρ X , Y * be the MinMI-PDF issued from { T c r , θ c r } , ρ X , ρ Y , being the same as the ME-PDF issued from { ( T i n d , T c r ) , ( θ i n d , θ c r ) } for some set { T i n d , θ i n d } . Then we have:
λ c r = P c r λ         ;         C c r , ρ X , ρ Y = ( P c r C * 1 P c r ) 1 = E ρ X , Y * [ ( T c r * T c r T * | E ( T | X ) , E ( T | Y ) ]
which states that the Lagrange multipliers of the MinMI-PDF are those of the ME-PDF for the cross constraints and the MinMI covariance matrix (4), say that of the residuals of the best fit of the cross constraints using their condtional means as predictors. The proof, as well of (3–5) is added in Appendix 1.
An illustrative example of the Theorem 1 is given for the bivariate Gaussian ρ X Y * ( X , Y ) = ( 2 π ) 1 d g 1 / 2 exp [ 1 2 d g ( X 2 2 c g X Y + Y 2 ) ] of correlation c g with d g ( 1 c g 2 ) 1 . The marginals ρ X , ρ Y are standard Gaussians. ρ X Y * ( X , Y ) is the MinMI-PDF constrained by correlation as well as the ME-PDF constrained by moments of order one and two: { T i n d = ( X , X 2 , Y , Y 2 ) , θ i n d = ( 0 , 1 , 0 , 1 ) } and { T c r = ( X Y ) , θ c r = ( c g ) } . The vector of Lagrange multipliers is λ = [ 0 , 1 2 d g , 0 , 1 2 d g , c g d g ] T while the diagonal covariance matrix and its inverse (lower triangle parts) are:
C * = [ ( 1 , 0 , c g , 0 , 0 ) T , ( * , 2 , 0 , 2 c g 2 , 2 c ) T , ( * * , 1 , 0 , 0 ) T , ( * * * , 2 , 2 c ) T , ( * * * * , c g 2 + 1 ) T ] C * 1 = [ ( d g , 0 , c g d g , 0 , 0 ) T , ( * , 1 2 d g 2 , 0 , 1 2 c g 2 d g 2 , c g d g 2 ) T , ( * * , d g , 0 , 0 ) T , ( * * * , 1 2 d g 2 , c g d g 2 ) T , ( * * * * , ( 1 + c g 2 ) d g 2 ) T ]
The redundant upper triangle part is given by stars. The MinMI is I g ( c g ) = 1 2 log ( 1 c g 2 ) with its derivatives entering in the Taylor development (3) given by I g c g = c g d g = P c r λ which is the fifth component of λ and 2 I g c g 2 = d g 2 ( 1 + c g 2 ) = C c r , ρ X , ρ Y 1 = ( P c r C * 1 P c r ) , i.e., the entry at 5th line, 5th column of C * 1 as guessed from the Theorem 1. By expressing Y = c g X + d g 1 / 2 W X and X = c g Y + d g 1 / 2 W Y with standard Gaussian noises W X , W Y ~ N ( 0 , 1 ) , and c o r ( X , W X ) = c o r ( Y , W Y ) = 0 , one easily gets the conditional means T c r as E ρ X , Y * ( X Y | X ) = c g X 2    ;    E ρ X , Y * ( X Y | Y ) = c g Y 2 , leading to the best linear fit with mean square error C c r , ρ X , ρ Y = d g 2 ( 1 + c g 2 ) 1 , confirming the second part of (11).

2.3. Gaussian and Non-Gaussian MI

There is a particular MI decomposition of the type (6,7), already studied in PP12 [12], in which both RVs X and Y are set to standard Gaussians N ( 0 , 1 ) over the real support set S X = S Y = by Gaussian morphism [31]. The isotropic bivariate standard Gaussian is constrained by the moment set T i n d = T 0 = ( X , X 2 , Y , Y 2 ) T with the expectations vector θ i n d = θ 0 = E ( T 0 ) = ( 0 , 1 , 0 , 1 ) T . The sequence of MinMIs is obtained by considering the indexed moment set (Equation 14 of PP12 [12], changing the index p there into j here):
T j { X r Y s :    1 r + s j ,     ( r , s ) 0 2 } , j
Comprising bivariate polynomials of total order j. Only natural j even numbers provide integrable ME-PDFs over , thus excluding odd j values from the sequence { T 0 , θ 0 } , { T 2 , θ 2 } , { T 4 , θ 4 } ... { T , θ } of set pairs {moments, expectations}. The independent parts of all sets are ME-congruent with { T 0 , θ 0 } , i.e., they include high-order univariate moment expectations of the standard Gaussian. The number of independent and cross moments of T j (13) is 2j and j ( j 1 ) / 2 respectively (e.g. (4,1), (8,6), (12,15) and (16,28), for j=2,4,6,8). Other more efficient basis cross functions could be used as for example orthogonal polynomials. Using the notation of Section 2.2, the maximum entropy limit H ( θ ) of the sequence limit coincides to the true (X,Y) Shannon entropy. As presented in PP12, we define the positive Gaussian MI I g , the non-Gaussian MI I n g and the non-Gaussian MI I n g , j of even order j, respectively as:
I g = I 2 / 0 = H ( θ 0 ) H ( θ 2 ) = ( 1 / 2 ) log ( 1 c g 2 ) I g ( c g )      ;       I n g = I / 2 = H ( θ 2 ) H ( θ )       ;       I n g , j = I j / p = 2 = H ( θ 2 ) H ( θ j )
with the MI decomposed as I ( X , Y ) = I g + I n g I g + I n g , j . The Gaussian MI depends on the Gaussian correlation c g , i.e., the Pearson correlation between the Gaussianized variables ( X , Y ) . The non-Gaussian MI vanishes iff the joint PDF is Gaussian.

2.4. Estimators of the Minimum MI from Data and Their Errors

This section is devoted to the study of estimators (and their errors) of the incremental MI I j / p    ( j > p ) , (7) between a priori RVs X ^ , Y ^ or, equivalently, between their transformed RVs X,Y.
In practice, the incremental MI I j / p    , j > p is estimated by a two-step algorithm: first, the computation of expectations; then, the MEs and the partial MIs. The vector of expectations, θ N , j , is estimated from the N-sized bivariate series ( X l , Y l ) , l = 1 , ... , N , obtained by morphism from the original N iid realizations of the a-priori RVs ( X ^ l , Y ^ l ) , l = 1 , ... , N (e.g. time-series, spatially distributed data), as the arithmetic average:
E N ( T j ) θ N , j = N 1 l = 1 N T j ( X l , Y l ) = θ j + Δ θ N , j
where E N stands for expectation over the N realizations and the vector of moment estimation errors is Δ θ N , j . The first-step error comes from the difference H ( θ N , j ) H ( θ j ) , due to marginal morphisms and finite bivariate sampling, i.e., the cross combinations of variable realizations. We will see that MI errors depend crucially from moment estimation errors and their statistics.
Secondly, the true ME H ( θ N , j ) is estimated as the minimum H ^ ( θ N , j ) of a functional that is reached by nonlinear minimization techniques (e.g., gradient-descent), taking as inputs θ N , j and a set of calibrated parameters. The second-step error comes from the difference H ^ H δ H .
The estimator of I j / p along with its error, decomposed into the first-step ( Δ I N , j / p , θ ) and second-step ( Δ I N , j / p , H ) contributions, is written as
I N , j / p H ^ ( θ N , p ) H ^ ( θ N , j ) = I j / p + Δ I N , j / p       ;     Δ I N , j / p = Δ I N , j / p , θ + Δ I N , j / p , H Δ I N , j / p , θ [ H ( θ j ) H ( θ N , j ) ] [ H ( θ p ) H ( θ N , p ) ] Δ H N , j + Δ H N , p Δ I N , j / p , H [ H ^ ( θ N , p ) H ( θ N , p ) ] [ H ^ ( θ N , j ) H ( θ N , j ) ] ( δ H ) N , p ( δ H ) N , j
where Δ I N , j / p , θ is the difference between entropy anomalies Δ H due to input errors. The second-step error comes from the numerical implementation and round-off errors of the entropy functional due to: (a) a coarse graining representation of the continuous PDF; (b) the numerical approximation of the ME functional and its gradient; (c) the stopping criteria of the iterative gradient-descent technique. In this article we will neglect the effect of the second-step error, thus approximating the MinMI error by Δ I N , j / p Δ I N , j / p , θ depending uniquely on the sampling error of the cross expectations Δ θ c r = Δ θ N , c r , j .

3. Errors of the Expectation’s Estimators

3.1. Generic Properties

The distribution of the MinMI error and its statistics (bias, variance, quantiles) depends on the distribution of the vector of error moments Δ θ N , c r , j entering in (9). Here, we present a generic statistical modeling of those errors giving the emphasis in the influence of variable morphisms and bivariate sampling.
Let us assume the reasonable hypothesis that the discrete estimator θ N , j (15) is a consistent estimator of the mean θ j , i.e., the error Δ θ N , j 0 ,    N in probability, with both the bias and covariance matrix converging to zero as data size grows:
b Δ θ N , j E ( Δ θ N , j ) N 0       ;      M Δ θ N , j E [ ( Δ θ N , j ) ( Δ θ N , j ) T ] N 0 ,       ;     Δ θ N , j = Δ θ N , j b Δ θ N , j
where the prime stands for perturbation with respect to the mean. The exact form of the components of b Δ θ N , j and M Δ θ N , j is rather difficult to establish as a consequence of imposing marginal distributions thus reducing the randomness to the covariate sampling. Estimator variances are scaled as O ( 1 / N ) , though smaller than in the case of N iid outcomes. Moreover, we assume that the convergence rate is higher (faster convergence) for the squared bias than for variances, which is supported in a few examples in next section.

3.2. The Effects of Morphisms and Bivariate Sampling

Let us start with the effect of morphisms transforming original variables ( X ^ , Y ^ ) into their transformed ( X , Y ) . That depends on the rank of variables within the available sample. Without loss of generality, let us sort X ^ by ascending order in the sample, i.e., the l-th value equaling the ordered l-th value X ^ l = X ^ ( l ) , l=1,…,N. The bivariate l-th realization is ( X ^ l , Y ^ l = Y ^ ( l ( l ) ) ) , where l ( l ) : { 1 , ... , N } { 1 , ... , N } is the random bivariate rank permutation depending upon the particular sample (e.g. the first of X ^ coming with the third of Y ^ , then l’(l=1)=3 and so on). In particular l ( l ) = l when correlation equals one. The inverse of the function l ( l ) is written l ( l ) . The probability p-values of X ^ ( l ) , Y ^ ( l ) i.e., their marginal cumulated probability functions (CDFs) are respectively p X , l , p Y , l , growing as function of l , l . Those p-values can only be inferred from the sample or prescribed from a-priori hypotheses. The sorted transformed RVs given by ME-morphisms are:
X ( l ) = Φ M E , X 1 ( p X , l )    ;     Y ( l ) = Φ M E , X 1 ( p Y , l )    ;     l , l = 1 , ... N
where Φ M E , X , Φ M E , Y are the ME prescribed CDFs (e.g. CDFs of Gaussians) of X and Y respectively. Then the morphisms relies upon invertible transformations X ^ ( l ) X ( l ) ;    Y ^ ( l ) Y ( l ) . The bivariate transformed realizations ( X l , Y l = Y ( l ( l ) ) ) ,    l = 1 , ... , N are then used to compute expectations (Equation 15). Since the exact marginal distributions are not known, their cumulated probabilities must be prescribed, for example with regular steps Δ p X , l , = p Y , l = 1 / N in which p X , l , p Y , l = l / ( N + 1 ) ,     l = 1 , .. , N .
In order to obtain moments of Δ θ N , j we need rewriting it in a convenient form:
Δ θ N , j θ N , j θ j = = l , l = 1 N T j ( Φ M E , X 1 ( p X , l ) , Φ M E , Y 1 ( p Y , l ) ) N 1 δ l ( l ) , l 0 1 0 1 T j ( Φ M E , X 1 ( u ) , Φ M E , Y 1 ( v ) ) c [ u , v ] d u d v l , l = 1 N T j ( X ( l ) , Y ( l ) ) [ N 1 δ l ( l ) , l Δ p X , l Δ p Y , l c [ p X , l , p Y , l ] ] Δ p X , l Δ p Y , l
where δ l ( l ) , l = δ l ( l ) , l , l , l { 1 , ... , N } is the Kronecker delta, u = X ρ T X , θ X * ( t ) d t    ;    v = Y ρ T Y , θ Y * ( t ) d t are the marginal cumulated probabilities, corresponding respectively to probabilities p X , l and p Y , l in the sum (19) and c [ u , v ] is the copula function [23] (ratio between the joint PDF and the product of marginal PDFs). By looking at (19), one sees that N 1 δ l ( l ) , l / ( Δ p X , l Δ p Y , l ) is an estimator of the copula c [ p X , l , p Y , l ] . In particular, if X,Y are independent, then l and l’(l) are independent, c [ p X , l , p Y , l ] = 1 and E ( δ l ( l ) , l | l , l )     = N 1 i.e. there is an average equipartition of the bivariate ranks.
Equation (19) shows that moments of Δ θ N , j depend on statistics of the error of the copula estimator, which can be very tricky due to the imposition of marginal PDFs by morphisms, presenting unusual effects with respect to classical results from samples of iid realizations [32].
For that, let us denote the random perturbation η l , l δ l ( l ) , l E [ δ l ( l ) , l ] = η l , l    , l , l , then E [ η l , l ] = 0 , also satisfying to the constraints l = 1 N δ l ( l ) , l = l = 1 N δ l ( l ) , l = 1 or l = 1 N η l , l = l = 1 N η l , l = 0 as a consequence of the fact that l ( l ) and l ( l ) are index permutations of N values. Therefore, taking into account those constraints, Δ θ N , j can be written in different forms in terms of perturbations:
Δ θ N , j = l , l = 1 N T j , l , l N 1 η l , l = l , l = 1 N T j , l , l N 1 δ l ( l ) , l = l , l = 1 N T j , l , l X N 1 δ l ( l ) , l = l , l = 1 N T j , l , l Y N 1 δ l ( l ) , l = l = 1 N T j , l , l ( l ) N 1 = l = 1 N T j , l , l ( l ) X N 1 = l = 1 N T j , l , l ( l ) Y N 1
where T j , l , l T j ( X ( l ) , Y ( l ) ) and its perturbation with respect to the global mean is T j , l , l T j , l , l E ( θ N , j ) . The perturbation with respect to X-conditional mean is T j , l , l X T j , l , l E ( T j | X = X ( l ) ) where E ( T j | X = X ( l ) )    = l = 1 N T j E [ δ l ( l ) , l ] . A similar definition is written for the Y- perturbation T j , l , l Y T j , l , l E ( T j | Y = Y ( l ) ) .
The estimator (15) of independent constraints (components of T j uniquely dependent on X or Y) have a bias but vanishing variances (null components of Δ θ N , j ), since perturbations T j X or T j Y vanish because the local values of T j coincide to one of the (X or Y)-conditional means. That bias reduces to a numerical integration error. For example for X-depending functions expectations, the error reduces to bias Δ θ X , N , j = l = 1 N T X , j ( X ( l ) ) N 1 0 1 T X , j ( Φ M E , X 1 ( u ) ) d u , of order O ( N 2 ) as given by the trapezoidal integration rule for bounded T X , j functions. The estimators of cross expectations have bias and non-vanishing variances.
Now, our goal is to get the estimation of the covariance matrix M Δ θ N , j (17). As a consequence of the non-replacement of quantiles or rankins, the deviations T j , l 1 , l ( l 1 ) and T j , l 2 , l ( l 2 ) in (20) are not necessarily independent for l 1 l 2 , which will not occur if different realizations would be independent, leading to var ( θ N , j ) = N 1 var ( T j ) . The statistics without replacement generally lead to a deflation of estimator variances as compared to those satisfying the hypothesis of independence of realizations [33] or, in other words, var ( θ N , j ) N 1 var ( T j ) . Therefore, in order to get a N−1-scaled expression for var ( θ N , j ) , we will consider another type of deviations of T j consistent with (20).
We propose new deviations, denoted by T j l m s , that are given by the linear combination both of the global deviation T j and of the marginal deviations T j X , T j Y with the respective coefficients summing 1 and having the least mean square (lms). Those deviations are consistently given by:
T j l m s = ( 1 α X α Y ) T j + α X T j X + α Y T j Y = T j α X [ E ( T j | X ) E ( θ N , j ) ] α Y [ E ( T j | Y ) E ( θ N , j ) ]
which are the residuals of the best linear fit of T j using the conditional means E ( T j | X ) and E ( T j | Y ) as predictors and where the coefficients are those of the linear regression:
[ α X α Y ] = [ var [ E ( T j | X ) ] cov [ E ( T j | X ) , E ( T j | Y ) ] cov [ E ( T j | X ) , E ( T j | Y ) ] var [ E ( T j | Y ) ] ] 1 [ cov [ E ( T j | X ) , T j ] cov [ E ( T j | Y ) , T j    ] ]
Those deviations take into account the maximum implicit knowledge of marginal PDFs through their conditional means. Now we will use them for expressing the error moments.
The expression of the error covariances in M Δ θ N , j relies upon the expansion (20) with perturbations written as function of mean values of products of deltas δ l ( l ) , l . These means depend on the true copula and are written as:
E ( δ l ( l 1 ) , l 1 δ l ( l 2 ) , l 2 ) = { 0 ,   if  [ l 1 = l 2 , l 1 l 2   or  l 1 = l 2 l 1 l 2 ] E ( δ l ( l 1 ) , l 1 ) ,    N 1 ( * )      if    [ l 1 = l 2 , l 1 = l 2 ] N 1 ( N 1 ) 1    ( * )    if    [ l 1 l 2 , l 1 l 2 ]
where we have considered the fact that l’(l) and its inverse l(l’) are permutations of ranks (no duplication allowed). The values indicated with asterisk in (23) correspond to X,Y independent (l’(l) independent of l). Those moments are difficult to obtain in practice unless variables are independent or the bivariate PDF is known a priori. From these moments, a large ensemble of N-sized surrogate samples is generated from which empirical estimator covariances are computed.
Then, by plugging (23) into the generic (α-th row, β-th column) of M Δ θ N , j , and denoting the α-th and β-th components of T j by T j , α and T j , β with estimation errors Δ θ N , j , α , Δ θ N , j β , we get
( M Δ θ N , j ) α , β = E ( Δ θ N , j , α Δ θ N , j β ) = l 1 , l 1 , l 2 , l 2 [ T j , α ( X ( l 1 ) , Y ( l 1 ) ) T j , β ( X ( l 2 ) , Y ( l 2 ) ) ] N 2 E ( δ l ( l 1 ) , l 1 δ l ( l 2 ) , l 2 ) = = N 1 E ( E N ( T j , α T j , β ) ) + N 2 l 1 l 2 E [ T j , α ( X ( l 1 ) , Y ( l 1 ( l 1 ) ) ) T j , β ( X ( l 2 ) , Y ( l 2 ( l 2 ) ) ) ]
The first term of the rhs of (24) is given by N 1 E [ cov N ( T j , α , T j , β ) ] i.e. 1/N times the expectation of the covariance among N realizations. That term converges asymptotically to N 1 cov ( T j , α , T j , β ) , i.e., the estimator’s covariance in the hypothesis of N iid realizations. However, when marginals are imposed or the morphism of variables is performed, that hypothesis no longer holds because the covariance estimator is a statistic without replacement [33], since quantiles of X and Y are not repeated in the sample. Therefore, the additional term of (24) reduces the estimator’s variances with respect to the case of iid trials.
Looking for a correct representation of the cross estimator’s variances when marginals are imposed, we represent the T j perturbations by T j l m s (21) (residuals of the best linear regression). There, we will benefit from a generic property of lse (least squares error) regression residuals which is the fact that they are uncorrelated with the predictors (here the conditional means of E ( T j | X ) , E ( T j , Y ) ). This means that T j l m s is represented in terms of noises which are uncorrelated, both with X and Y. Consequently, different realizations of T j l m s are uncorrelated, which will simplify the expression of the covariance matrix. Therefore, using those lms perturbations, the generic matrix entry ( M Δ θ N , j ) α , β (24) is rewritten as
( M Δ θ N , j ) α , β = N 2 l 1 [ E [ T j , α l m s ( X ( l 1 ) , Y ( l ( l 1 ) ) ) T j , β l m s ( X ( l 1 ) , Y ( l ( l 1 ) ) ) ] ] + N 2 l 1 , l 2 l 1 [ E [ T j , α l m s ( X ( l 1 ) , Y ( l ( l 1 ) ) ) T j , β l m s ( X ( l 2 ) , Y ( l ( l 2 ) ) ) ] ] = N 1 E ( E N ( T j , α l m s T j , β l m s ) ) + O ( N 2 )
The N 1 -scaled term of (25) converges asymptotically (as N ) to N 1 E ( T j , α l m s T j , β l m s ) , i.e., 1/N times the covariances between residuals of the linear regression relying upon conditional variances. This let us to formulate the Theorem:
Theorem 2:
Let us suppose imposed X and Y marginal PDFs by variable morphisms. Then, the covariance between the N-sized based estimators θ N , α and θ N , β of the means of cross functions of T α ( X , Y ) and T β ( X , Y ) is given by
cov ( θ N , α , θ N , β ) = N 1 E ( E N ( T α l m s T β l m s ) ) N N 1 E ( T α l m s T β l m s )
where T α l m s = T α α X [ E ( T α | X ) θ α ] α Y [ E ( T α | Y ) θ α ] is the residual of the best linear fit taking conditional means as predictors, and α X , α Y are the corresponding coefficients (idem for T β l m s ). The expectation is computed with the true PDF of the population. The proof was given before in the text.
An immediate corollary of this Theorem applies in the case data are governed by a certain MinMI-PDF issued from { T c r , θ c r } , ρ X , ρ Y . In that conditions T α and T β are themselves cross functions from the constraining set T c r and cov ( θ N , α , θ N , β ) are entries of M Δ θ N (17). Then, if the true joint PDF is the MinMI-PDF issued from { T c r , θ c r } , ρ X , ρ Y , we get:
P c r M Δ θ N P c r = N 1 C c r , ρ X , ρ Y
where we use the covariance matrix introduced in (4). Under those conditions one has the identity for the matricial product ( P c r M Δ θ N P c r ) C c r , ρ X , ρ Y 1 = N 1 P c r , which will be crucial for the evaluation of asymptotic MinMI estimation bias.

3.3. Errors of the Estimators of Polynomial Moments under Gaussian Distributions

In this section we assess the bias, the covariance of estimators and its expression (25) when constraints are bivariate monomials (13) and Gaussian morphisms are performed as described in Section 2.3. For the purpose of discussing statistical tests of non-Gaussianity presented in a next section, we will restrict our study by considering the case of N-sized samples of iid realizations of independent variables X ^ , Y ^ (taken without loss of generality standard Gaussians). There, an empiric Monte-Carlo strategy is used by taking the standard Gaussian morphisms X , Y of the N outcomes, from which one estimates the expectation of a vector of generic functions T ( X , Y ) = X r Y s , r , s 0 (13). The bias is b = E ( E N ( T ) ) E ( T ) = μ N , r μ N , s μ r μ s , which is determined by the fixed Gaussian centered moments μ r E ( X r ) and μ N , r E N ( X r ) , r 0 . The sample is centered and standardized such that μ N , 1 = 0 ; μ N , 2 = 1 . The variance var ( E N ( T ) ) of E N ( T ) can be rigorously computed from the quadruple sum (25) using the N quantiles from the standard Gaussian and the delta expectations (23) for the case of X, Y independent from each other. However, the computation of that sum is very time-consuming for high N values. For that reason, we approximate it by a Monte-Carlo mean obtained with N r e a = 5000 independent realizations of the N-sized samples. The finite and asymptotic values of N 1 E ( var N ( T ) ) , valid for the case of N iid trials, are given by:
N 1 E ( var N ( T ) ) = N 1 ( μ N , 2 r μ N , 2 s ( μ N , r μ N , s ) 2 ) N N 1 var ( T ) = N 1 ( μ 2 r μ 2 s ( μ r μ s ) 2 )
whereas those (smaller than those of (28)) obtained from least mean squares (25) are:
var ( E N ( T ) ) N 1 E ( var N ( T | l m s ) ) = N 1 var N ( T | l m s ) = = N 1 ( μ N , 2 r μ N , 2 s μ N , 2 r ( μ N , s ) 2 μ N , 2 s ( μ N , r ) 2 + ( μ N , s μ N , r ) 2 ) N N 1 var ( T | l m s ) = N 1 ( μ 2 r μ 2 s μ 2 r ( μ s ) 2 μ 2 s ( μ r ) 2 + ( μ s μ r ) 2 )
Figure 1 compares the variance var ( E N ( T ) ) with the squared bias b 2 of the estimator, both relevant in the bias of the MinMI estimation. In the same figure, one compares the empirical variance var ( E N ( T ) ) , with its approximation N 1 var ( T | l m s ) and with the variance for the case of iid trials: N 1 var ( T ) . We use T = X 4 Y 2 , X 6 Y 2 , X 8 Y 2 ,respectively in panels a), b), c), sorted by growing total variance var ( T ) , specially concentrated at the distribution queues. In all figures, N=25*2k,k=0,..,11. We have verified that the empirical variance var ( E N ( T ) ) agrees very well to the theoretical value N 1 var N ( T | l m s ) for all Ns. (not shown).
At this point, some generic conclusions can be drawn. The estimator’s variance var ( E N ( T ) ) grows with var ( T ) dominating over the squared bias, except for small N values and higher values of var ( T ) . This will lead us to neglect the bias of covariance estimator’s in the MinMI asymptotic statistics.
Figure 1. Squared empirical bias: b 2 (black lines) of N-based T - expectations as function of N, empirical variances: var ( E N ( T ) ) (red lines), approximated variances: N 1 var ( T | l m s ) (blue lines) and variance for the case of N iid trials: N 1 var ( T ) (green lines). T stands for different bivariate monomials: X 4 Y 2 (a), X 6 Y 2 (b) and X 8 Y 2 (c).
Figure 1. Squared empirical bias: b 2 (black lines) of N-based T - expectations as function of N, empirical variances: var ( E N ( T ) ) (red lines), approximated variances: N 1 var ( T | l m s ) (blue lines) and variance for the case of N iid trials: N 1 var ( T ) (green lines). T stands for different bivariate monomials: X 4 Y 2 (a), X 6 Y 2 (b) and X 8 Y 2 (c).
Entropy 15 00721 g001
From Figure 1, we also note that the variance reduction coming from morphisms of variables, tends to decrease for higher N values, where the effect of sampling prevails with a N 1 scaling on the estimator variance where it is closely approximated by the asymptotic lms variance N 1 var ( T | l m s ) . That can lead to a slight increase of var ( E N ( T ) ) for small Ns, followed by a decrease (e.g., X 6 Y 2 ), due to the effect that var N ( T | l m s ) is small for lower values of N.
Moreover, thanks to the Central Limit Theorem (CLT), the distribution of estimator errors tends towards Gaussianity with increasing N, with a slower convergence rate for higher T variances. However, the Gaussian PDF limit has an infinite support which must be truncated since the estimated moments E N ( T ) must be within a kind of polytope with edges determined by Schwartz-like inequalities as shown by PP12 [12] (e.g., | E N ( X Y ) | 1 and | E ( X 2 Y ) | / [ 2 ( 1 c g 2 ) ] 1 / 2 1 ) , working as bounds for nonlinear correlations. Since estimators have bounds, the estimation errors do so as well. This can be solved by using the Fisher Z-transform arctanh(c) of a generic linear or nonlinear correlation c and projecting it over the real support (not done here).
Now we illustrate in Figure 2, the Theorem 2 under different values of correlation c g [ 0 , 1 ] . We consider the variables X , Y with a joint Gaussian PDF of correlation c g [ 0 , 1 ] with marginal standard Gaussians. In Figure 2 we compare the empirical Monte-Carlo value of N var ( E N ( T ) ) (MC in the Figure), within an ensemble of 5000 N-sized samples with the theoretical one var ( T | l m s ) (case where morphism is performed, AN in the Figure) and var ( T ) (case of iid realizations, ANiid in the Figure). We have used a sample of N=200, which is supposed to be near the beginning of the asymptotic regime and two cross functions: T ( X , Y ) = X Y and T ( X , Y ) = X 2 Y . The aforementioned variances are var ( X Y | l m s ) = ( 1 c g 2 ) / ( 1 + c g 2 )    ;     var ( X Y ) = c g 2 + 1 while var ( X 2 Y ) = 12 c g 2 + 3 and var ( X 2 Y | l m s ) is the mean squared residual of the best linear fit using the predictors E ( X 2 Y | X ) = c g X 3 and E ( X 2 Y | Y ) = c g 2 Y 3 + ( 1 c g 2 ) Y . For both functions, a very good agreement is verified between Monte-Carlo values and the theoretical ones within 1–5% relative error. A generic result of Figure 2 is the fact that, under the fixation (presetting) of marginals, the sampling variability of cross estimators falls to zero as far the absolute value of correlation tends to one.
Figure 2. N times Monte-Carlo variances: N var ( E N ( T ) ) thick solid lines) and its theoretical analytical value var ( T | l m s ) (thick dashed lines), both under imposed marginals (morphisms) and analytical value of N var ( E N ( T ) ) = var ( T ) for iid data (thin solid lines). T means different bivariate monomials: X Y (black curves), X 2 Y (red curves). N = 200.
Figure 2. N times Monte-Carlo variances: N var ( E N ( T ) ) thick solid lines) and its theoretical analytical value var ( T | l m s ) (thick dashed lines), both under imposed marginals (morphisms) and analytical value of N var ( E N ( T ) ) = var ( T ) for iid data (thin solid lines). T means different bivariate monomials: X Y (black curves), X 2 Y (red curves). N = 200.
Entropy 15 00721 g002

3.4. Statistical Modeling of Moment Estimation Errors

The above qualitative results gave empirical support to Theorem 2 about the covariance of estimation errors and the neglecting of estimation biases. Therefore, the part of matrix M Δ θ N , j (17) regarding cross components is modeled as:
M Δ θ N , c r , j N 1 E ( E N ( T c r , j l m s T c r , j l m s ) ) N 1 C N , c r , j | l m s
with the approximation being valid within terms o ( N 1 ) . In practice, the matrix E ( T c r , j l m s T c r , j l m s ) requires the estimation of conditional means for each value of X and Y.
Now, we will formulate the distribution of moment’s estimation errors in the asymptotic regime of high enough N. Then, thanks to the multivariate Central Limit Theorem [34] one can suppose that the unbiased estimation error vector follows a multivariate Gaussian distribution, which is written as
Δ θ N , c r , j ( M Δ θ N , c r , j ) 1 / 2 U j     N 1 / 2 ( C N , c r , j | l m s ) 1 / 2 U j     ;       U j     ~ N ( 0 c r , j , P c r , j )
where ( C N , c r , j | l m s ) 1 / 2 is the square root matrix of C N , c r , j | l m s and U j is a multivariate standard normal RV of dimension equal to dim ( θ c r , j ) with zero mean 0 c r , j and covariance matrix P c r , j .

4. Modeling of MinMI Estimation Errors, Their Bias, Variance and Distribution

Taking into account the Gaussian approximations (31) for estimation errors, their neglected bias, the N 1 scaled covariance (30), and the second-order Taylor development of MinMI (9), one can determine approximated bias, variance and distribution of MinMI estimators (15).
Two problems are then addressed:
  • The estimation of bias, variance, quantiles and distribution of estimators of the incremental MinMI I j / p issued from finite samples of N (iid) realizations of bivariate original variables ( X ^ , Y ^ ) and then transformed into RVs ( X , Y )
  • The distribution of estimators of I j / p under the null hypothesis H0 that ( X , Y ) follows the ME distribution constrained by a weaker constraint set ( T p , θ p ) (j>p). These estimators work as a significance test for determining whether there is statistically significant MI beyond that explained by cross moments in ( T p , θ p ) .

4.1. Bias, Variance, Quantiles and Distribution of MI Estimation Error

Considering the moment error distribution (31) and plugging it into the development (9), the error of the MI estimator I N , j / p is then distributed as:
Δ I N , j / p , θ N 1 / 2 [ v j / p T ( C N , c r , j | l m s ) 1 / 2 ] U j     + 1 / 2 N 1 U j T [ ( C N , c r , j | l m s ) 1 / 2 A j / p ( C N , c r , j | l m s ) 1 / 2 ] U j    
where neglected terms are of order O ( N 3 / 2 ) . That is a second-order polynomial form of a multivariate standard Gaussian RV U j     ~ N ( 0 j , P c r , j ) . There is no general analytical expression for the PDF inferred from (32), except in certain cases where Δ I N , j / p is a governed by a non-central Chi-squared distribution [36]. The quantiles determining the confidence intervals of I N , j / p can easily be obtained by sorting of Monte-Carlo surrogates (proxies) of (32) from a pseudo-random generator of a standard Gaussian. Analytical expressions of the distribution of MI estimates are given from a MI Taylor expansion in terms of the anomalies of the estimated probabilities [27,37]. Here, we adopt a different approach by considering anomalies of the estimated expectations.
The bias of I N , j / p or the expectation of Δ I N , j / p , θ is derived from the mean of the quadratic form term in (32). Therefore, taking the invariance of the trace for the circular permutation of a matrix product, that bias is approximated by the asymptotic value:
E ( Δ I N , j / p ) ( 1 / 2 ) N 1 T r ( C N , c r , j | l m s    A j / p ) = ( 1 / 2 ) N 1 [ T r ( C N , c r , j | l m s    P c r , j C * j 1 P c r , j ) T r ( C N , c r , p | l m s    P c r , p C * p 1 P c r , p ) ]
This is the difference between maximum entropy N 1 -scaled biases of orders j and p, subjected to the imposition of marginal PDFs. We must remember that if p = 0, P c r , p is zero. For this case the MinMI bias is simply minus the negative bias of the ME H ( θ N , j ) , which is treated without the effect of variable morphism by [26]. When data is governed by the MinMI-PDF of order j, the matrices C N , c r , j | l m s and P c r , j C * j 1 P c r , j are the inverse of each-other, according to Theorems 1 and 2 (11,27), leading to E ( Δ I N , j / 0 ) = ( 1 / 2 ) N 1 T r ( C N , c r , j | l m s    P c r , j C * j 1 P c r , j ) = ( 1 / 2 ) N 1 T r ( P c r , j ) , i.e., 1 / ( 2 N ) times the number of cross constraints. However, as argued by [26], when the true data distribution is more leptokurtic than the MinMI-PDF, then the bias can be larger than ( 1 / 2 ) N 1 T r ( P c r , j ) .
By assuming the limit case of Gaussianity, the variance of Δ I N , j / p comes as:
var ( Δ I N , j / p ) N 1 T r [ C N , c r , j | l m s    ( v j / p v j / p T ) ] + ( 1 / 2 ) N 2 T r [ ( C N , c r , j | l m s    A j / p ) 2 ]
The leading variance term is N−1-scaled as generally deduced in [15]. Keeping the leading term of (34), and dealing with the trace, we get a given relative error r I = Δ I N , j / I j of the MinMI I j / 0 (p=0) when N E ( ( λ c r , j T T c r , j ) 2 ) / ( I j / 0    r I ) 2 O ( m c r , j ) / ( I j / 0    r I ) 2 . The term O ( m c r , j ) increases with a larger rate than I j / 0 as far as the bound of the polytope of allowed expectations is closer.

4.2. Significance Tests of MinMI Thresholds

The estimators I N , j / p allow for the elaboration of statistical significance tests in order to verify whether the empirical PDF differs considerably from a threshold ME-PDF or in the contrary if the difference can be justified by sampling errors.
Let us suppose the null hypothesis H0 considering that the true PDF coincides to the ME-PDF constrained by ( T p , θ p ) . In particular for ( T p , θ p ) = ( T p = 0 , θ p = 0 ) = ( T i n d , θ i n d ) , the null hypothesis states that ( X , Y ) are statistically independent. Therefore under H0, the moment sets ( T p , θ p ) , ( T j , θ j ) are ME-congruent and the moments of order j p remain well determined by expectations over the less restricted p-th ME-PDF i.e., θ j = E ρ T p , θ p * ( T j ) θ j p where the subscript arrow j p means that j-order statistics are obtained by the p-order ME-PDF. The same holds for the ME covariance matrices, i.e., C * p = C p and C * j = C * j p = C j    ;     j p . In these conditions, the matrix C p is simply a sub-matrix of C j .The Lagrange multipliers are restricted to the p-order i.e. λ j = λ j p = ( λ p , 0 j / p ) ; j p , where entries of higher order than p are set to zero leading to v j / p = 0 in (9). Therefore, the incremental MinMI vanishes, i.e. H ( θ j ) H ( θ p ) = I j / p = 0 , but the estimator of I N , j / p is positive due to artificial MI generation stemming from the presence of sampling errors. Then, under H0, and using (9), the MI estimation is provided by the following approximation:
H ( θ N , p ) H ( θ N , j ) | H 0 δ I N , j / p ( 1 / 2 ) N 1 U j T [ ( C N , c r , j | l m s ) 1 / 2 A j p ( C N , c r , j | l m s ) 1 / 2 ] U j     U j     ~ N ( 0 j , P c r , j )       ;       A j p = P c r , j ( C j ) 1 P c r , j P c r , p ( C p ) 1 P c r , p
where A j p is a positive semi-definite matrix. That works as a significance test for the non-verification of H0; in other words, if I N , j / p is larger than an upper 1-α quantile (e.g., 1−α=95%) of δ I N , j / p , then H0 is rejected with a significance level α. Those quantiles determine the significant MI thresholds and can be computed empirically as for the MinMI error (32) by a Monte-Carlo strategy. Another possibility is the fitting of the δ I N , j / p distribution to a Gamma PDF with prescribed mean and variance (not done here). The bias and variance of δ I N , j / p are straightforward, coming as:
E [ δ I N , j / p ] ( 1 / 2 ) N 1 T r [ C N , c r , j | l m s    A j p ]     ;      var [ δ I N , j / p ] ( 1 / 2 ) N 2 T r [ ( C N , c r , j | l m s    A j p ) 2 ]
The N−2-scale for variance is also present in other MI estimate errors under the hypothesis of variable independency [27]. Under the Theorems 1 [11] and 2 [27], along with the null hypothesis, one gets C N , c r , j | l m s    A j p = P c r , j P c r , p , thus leading to a Chi-Squared distribution for δ I N , j / p :
δ I N , j / p ~ ( 1 / 2 ) N 1 χ n d 2 ;   n d = T r ( P c r , j P c r , p )
with n d degrees of freedom, i.e., the difference between the number of cross moments of order j and p. From that, the upper quantiles necessary for statistical significance are easily obtained from χ2 probability lookup tables. The bias and variance are, respectively:
E [ δ I N , j / p ] ( 1 / 2 ) N 1 [ T r ( P c r , j P c r , p ) ]     ;      var [ δ I N , j / p ] ( 1 / 2 ) N 2 [ T r ( P c r , j P c r , p ) ]
By analyzing (38), and in order to get a test with a relative error r I = ( Δ I min / I min ) , one must choose N ( ( m c r 2 m c r 1 ) / 2 ) 1 / 2 / ( I min    r I ) .

4.3. Significance Tests of the Gaussian and Non-Gaussian MI

In this section we particularize the theory presented in Section 4.1 and Section 4.2 (Equations 35–38) for the case of Gaussian and non-Gaussian MIs as defined in Section 2.3. For this purpose, let us consider the moment sets (13) and the MI components I g and I n g , j (11). Their finite estimators are:
I N , g = H ( θ 0 ) H ( θ N , 2 ) = I g + Δ I N , g = I N , j = 2 / p = 0       ;       Δ I N , g = I g ( c g + Δ c g , N ) I g ( c g ) = Δ H ( θ N , 2 )    ; I N , n g , j = H ( θ N , 2 ) H ( θ N , j ) = I n g , j + Δ I N , n g , j = I N , j / p = 2 ;      Δ I N , n g , j = Δ H ( θ N , 2 ) Δ H ( θ N , j )
where Δ I N , g , Δ I N , n g , j are MinMI errors, Δ c g , N is the Gaussian correlation estimation error, H ( θ 0 ) = 2 H g with H g 1 2 log ( 2 π e ) being the entropy of the univariate standard Gaussian; θ N , j = θ j + Δ θ N , j ;    j 1 are the expectations obtained from the N-sized Gaussianized standardized sample.
The numerical implementation of the maximum entropy estimator H ^ (16), approximating H is computed over a number Nb bins of an extended enough finite interval [-Li,Li]. In the corresponding experiments (and as in PP12), we have used the calibrated values Li=6 and Nb=80. The used algorithm is explained in detail in the appendix 2 of PP12 [12], following an adapted bivariate version of that of [35]. The error δ H = H ^ H is of the order of round-off errors, only becoming comparable to the sampling ME errors at very high values of N.

4.3.1. Error and Significance Tests of the Gaussian MI

The Gaussian MI error Δ I N , g depends on the Gaussian correlation estimation’s error Δ c g , N c g , N c g where c g , N = E N ( X Y ) is inferred from the sample. Let us write (9) for Δ I N , g . The Gaussian bivariate ME-PDF, constrained by ( T 2 = ( X , X 2 , Y , Y 2 , X Y ) T , θ 2 = ( 0 , 1 , 0 , 1 , c g ) T ) is ρ T 2 , θ 2 * ( X , Y ) = [ 4 π 2 ( 1 c g 2 ) ] 1 / 2 exp [ ( 1 / 2 ) ( 1 c g 2 ) 1 ( X 2 2 c g X Y + Y 2 ) ] , leading to the vector of Lagrange multipliers λ 2 = [ 0 , ( 1 / 2 ) ( 1 c g 2 ) 1 , 0 , ( 1 / 2 ) ( 1 c g 2 ) 1 , c g ( 1 c g 2 ) 1 ] T . The projector operator P c r , 2 onto cross moments is the 5x5 matrix that extracts the 5th entry (row and column) of T 2 , corresponding to the unique cross moment XY. The necessary 5x5 covariance matrix is C * , 2 = E ρ T 2 , θ 2 * [ T 2 T 2 T ] θ 2 θ 2 T , where the E operator is the expectation over the bivariate Gaussian ρ T 2 , θ 2 * . Then, we apply (9) for j=2, p=0 where Δ θ N , j = ( 0 , 0 , 0 , 0 , Δ c g , N ) T . The Gaussian MI error is written in different forms as:
Δ I N , g ( P c r , 2 λ 2 ) T ( Δ c g , N ) + 1 2 ( P c r , 2 C * 2 1 P c r , 2 ) ( Δ c g , N ) 2 = c g 1 c g 2 ( Δ c g , N ) + 1 + c g 2 2 ( 1 c g 2 ) 2 ( Δ c g , N ) 2 = = I g c g Δ c g , N + 1 2 2 I g c g 2 ( Δ c g , N ) 2
There, the term P c r , 2 λ 2 is the fifth component of λ 2 , corresponding to the first derivative of I g with respect to c g whereas the term P c r , 2 C * 2 1 P c r , 2 is the entry of C * 2 1 at row 5, column 5, corresponding to the second derivative of I g . The bias and variance of Δ I N , g depend on the distribution of the Gaussian correlation error Δ c g , N . According to the proposed modeling of moment estimation errors (Theorem 2 of section 3.4), Δ c g , N is asymptotically Gaussian with a negligible bias E ( Δ c g , N ) 0 and a variance (under imposed marginals) given by:
var ( Δ c g , N ) N 1 var ( X Y | E ( X Y | X ) , E ( X Y | X ) ) = ( 1 c g 2 ) 2 / ( 1 + c g 2 )
However, in order to keep the simulated c g = c g , N Δ c g , N within the interval [-1,1], one can use the more precise Fisher Z-transform [38] such that Δ c g , N = tanh ( tanh 1 ( c g ) + Δ Z N ) c g , where Δ Z N has a mean and variance of order O ( N 1 ) .
In order to test the null hypothesis that the variable pair ( X , Y ) has a joint bivariate isotropic Gaussian distribution, we must compare the estimated I N , g with upper quantiles of the significance test δ I N , g , given by Δ I N , g (40) with c g = 0 and Δ c g , N ~ N ( 0 , N 1 ) . This is a Gaussian correlation significance test that is Chi-squared distributed, with:
δ I N , g = ( 1 / 2 ) ( Δ c g , N ) 2 = ( 1 / 2 ) N 1 U 2 ~ ( 1 / 2 ) N 1 χ 1 2    ;        U ~ N ( 0 , 1 ) E ( δ I N , g ) = ( 1 / 2 ) N 1    ;    var ( δ I N , g ) = ( 1 / 2 ) N 2

4.3.2. Error and Significance Tests of the Non-Gaussian MI

The estimation error Δ I N , n g , j of the non-Gaussian MI as defined in (39) can be written as a particular form of (9) for an even order j 4 and p=2 as function of the vector Δ θ N , j of moment errors of the moment vector T j (13) with a certain chosen component indexation. Therefore, the matrix A j / p = A j / p = 2 P c r , j ( C * j ) 1 P c r , j P c r , 2 ( C * 2 ) 1 P c r , 2 of (9) comprises the inverses of covariance matrices C * j and C * 2 , respectively of the j-th and 2nd order ME solutions.
Algebraic consistency sets the matrix P 2 ( C * 2 ) 1 P 2 to the embedding of ( C * 2 ) 1 onto the j-th moment subspace. Then we will perform a range of experiments for the validation of approximations in Section 4.2. The vector v j / p = 2 P c r , j λ j P c r , 2 λ 2 comprises Lagrange multiplier vectors of the ME solutions of orders j and 2.
In order to compute the bias, variance, quantiles and confidence intervals of I N , n g , j , from N-sized samples, there are two possible strategies: either pure Monte-Carlo simulations or the analytical and the semi-analytical (analytical with moment’s error surrogates) approaches as explained in section 1. In the pure Monte-Carlo approach, either a known bivariate PDF is assumed or surrogates of the joint PDF are generated through multivariate bootstrapping techniques [39] preserving the copula structure. For each generated sample from an extended ensemble of Nrea (e.g., 5000) realizations, we compute moments and solve the ME problem gathering statistics afterwards. Alternatively, ME errors can be computed from the Taylor expansion (9) from moment deviations over the ensemble.
In the analytical and semi-analytical approaches, moment errors Δ θ N , j are assumed to follow a certain parametric distribution that can be multivariate Gaussian as in (31), based on a given bias-covariance matrix modeling or a more sophisticated approach taking into account the natural bounds of the simulated moments θ c r , j = θ N , c r , j Δ θ N , c r , j . Then, MinMI statistics are computed from statistics (bias, variance, quantiles) on ensembles of error surrogates.
The non-Gaussian MIs I N , n g , j    ( even   j 4 ) work as tests measuring significant statistical deviations from the null hypotheses of joint Gaussianity. These statistical tests are given by Kullback-Leibler distances (7) and constitute an alternative to the use of algebraic deviations of moments from those given by the bivariate Gaussian (e.g., bivariate cumulants) [40].
The non-Gaussianity test of order j is given by δ I N , n g , j H ( θ N , 2 ) H ( θ N , j ) | H 0 under the null hypothesis H0 that the true PDF is bivariate Gaussian and is written as a particular case of (35). However, a simplification of the statistical test formula can be achieved by considering a null Gaussian correlation. This holds thanks the non-Gaussian MI invariance under variable rotations (see PP12), in particular for uncorrelated standardized variables ( X r , Y r ) T = A ( X , Y ) T , where A is the rotation matrix (e.g. X r = X , Y r = ( Y c g X ) ( 1 c g 2 ) 1 / 2 , i.e., the residual of the linear prediction). Under H0, the rotated variables are still bivariate Gaussian and therefore the non-Gaussianity significance test δ I N , n g , j has the same distribution as that for c g = 0 . The matrices C N , c r , j | l m s and A j 2 entering in Equation (35) are now evaluated for Gaussian isotropic conditions. For the sake of clarity, we represent them respectively by C g , N , c r , j | l m s , A g , j 2 = P j ( C g , j ) 1 P j P 2 ( C g , 2 ) 1 P 2 , where the subscript g stands for evaluation at ( X , Y ) T ~ N ( 0 , I ) . For high N, C g , N , c r , j | l m s = C g , j , i.e., the covariance matrix of cross j-th order moments for the isotropic Gaussian. Then we write:
δ I N , n g , j ( 1 / 2 ) N 1 U j T [ ( C g , N , c r , j | l m s ) 1 / 2 A g , j 2 ( C g , N , c r , j | l m s ) 1 / 2 ] U j
Let us specify generic entries at row α, column β of those matrices, corresponding to monomials X r α Y s α and X r β Y s β of T j , i.e. with r α + s α ,    r β + s β j . Then, using the notation introduced in Section 3.3 for Gaussian standard moments μ r E ( X r ) ;    μ N , r E N ( X r ) , r 0 , the components of C g , j become:
( C g , j ) α , β = μ r α + r β μ s α + s β μ r α μ r β μ s α μ s β
whereas the components of the lms covariances are:
( C g , N , c r , j | l m s ) α , β = μ N , r α + r β μ N , s α + s β μ N , s α + s β μ N , r α μ N , r β μ N , r α + r β μ N , s α μ N , s β + μ N , r α μ N , r β μ N , s α μ N s β
The bias of the non-Gaussian MinMI and its asymptotic approximation (36) are given by:
E [ δ I N , n g , j ] ( 1 / 2 ) N 1 [ T r ( C g , N , c r , j | l m s    P c r , j C g , j 1 ) 1 ] = ( 1 / 2 ) N 1 ( T r ( P c r , j ) 1 )
Similarly and following (36), the variance becomes:
var [ δ I N , n g , j ] ( 1 / 2 ) N 2 T r [ ( C g , N , c r , j | l m s    A g , j 2 ) 2 ] = ( 1 / 2 ) N 2 ( T r ( P c r , j ) 1 )
and the reasonable distribution approximation following (37):
δ I N , n g , j ~ ( 1 / 2 ) N 1 χ n d 2      ;      n d = T r ( P c r , j ) 1 = j ( j 1 ) / 2 1
from which bounds of significance levels of non-Gaussianity can be computed through quantiles of the Chi-squared distribution.

4.4. Validation of Significance Tests by Monte-Carlo Experiments

We have presented the theoretical expressions for the bias, variance and distribution, both for the Gaussian correlation test (42) and for the ME non-Gaussianity test of order j (46–48). Now we validate those expressions by comparing their results with statistics from large Monte-Carlo ensembles of ME computations. For that purpose, we have generated N r e a = 5000 independent synthetic datasets of N iid uncorrelated ( X , Y ) from a Gaussian random generator. We have set N from a duplication sequence: N=25, 21*25,…,211*25 = 51200. Then, we have computed the 5,000 realizations for the independency test δ I N , g as well as for the non-Gaussianity tests δ I N , n g , j for j = 4, 6, 8. In order to minimize errors of type δ H (8), from the ME functional, we have retained only those Monte-Carlo realizations whose ME-PDF moments are within a relative square error of 10−5.
In the sequel, we have collected and compared the estimates of bias, standard deviation and the 95%-quantile, all provided by the three approaches: the Monte-Carlo (extended ensemble of ME computations), the semi-analytical (generation of Gaussian surrogates in the Taylor expansion of ME) and the analytical (analytical formulas based on the Theorems 1 and 2). The Figure 3a, b, c and d depict the above statistics of significance tests, respectively for δ I N , g and δ I N , n g , j (j = 4, 6, 8). The truth is assumed to be provided by the Monte-Carlo estimate.
As previously expected, significance tests are all scaled by N 1 O ( 1 ) , and consequently their bias, standard deviation and quantiles are N 1 O ( 1 ) as shown in Figure 3a-d by estimates coming from the different approaches. MinMI biases and significance thresholds (the 95% quantiles) grow for higher number of constraints as in the sequence I N , g , I N , n g , j = 4 , I N , n g , j = 6 , I N , n g , j = 8 .
These results mean that those estimators are progressively better (stronger) evaluations of MI (or the MI beyond that explained by Gaussianity), though they call for progressively higher significance thresholds. Therefore, especially in cases of under-sampled data (small N) or very low MI (or Non-Gaussian MI) values (weakly dependent variables or weak joint non-Gaussianity), there must be a tradeoff between N and the number of parameters of the MinMI estimator (here the number of cross constraints).
At this point, we discuss how the analytical and semi-analytical estimates of MinMI error statistics fit the Monte-Carlo (true) statistics. There are three crucial factors in our approximations: (1) The accuracy of the ME Taylor expansion, valid for small enough sampling errors (N large); (2) The convergence rate towards Gaussian statistics (from the CLT) for high N.
Figure 3. Test statistics: bias (black lines), standard deviation (red lines) and 95%-quantiles (green lines), provided by the Monte-Carlo approach (tick full lines), the semi-analytical approach (thin dashed lines) and the analytical approach (tick full lines). The tests are δ I N , g (a); δ I N , n g , j = 4 (b); δ I N , n g , j = 6 (c) and δ I N , n g , j = 8 (d).
Figure 3. Test statistics: bias (black lines), standard deviation (red lines) and 95%-quantiles (green lines), provided by the Monte-Carlo approach (tick full lines), the semi-analytical approach (thin dashed lines) and the analytical approach (tick full lines). The tests are δ I N , g (a); δ I N , n g , j = 4 (b); δ I N , n g , j = 6 (c) and δ I N , n g , j = 8 (d).
Entropy 15 00721 g003
The analytical bias depends on factors 1 and 3, while formulas for variance, distribution and quantiles depend on all above factors, being only valid for N high enough. From Figure 3a–d, we see that the agreement between analytical and Monte-Carlo statistics is quite good for all tests (with a slight analytical underestimation), though only for large enough N > N t e s t values where N t e s t depends on how later (in N) the factors 1-3 hold together. We have N t e s t 50 , 400 , 1600 , 3200 , respectively for δ I N , g , δ I N , n g , j = 4 , δ I N , n g , j = 6 , δ I N , n g , j = 8 , growing with the number of constraints. The exception is when N is so large that errors δ H of the operational ME (typically, round-off errors) are of the same order of the small value tests δ I , starting to influence the Monte-Carlo statistics.
In order to validate the analytical Chi-Squared distributions for the tests, we present in Figure 4, the empirical cumulative histograms, respectively of 2 N δ I N , g , 2 N δ I N , n g , j , 2 N δ I N , n g , 6 , 2 N δ I N , n g , 8 for N N t e s t and the corresponding theoretical cumulative Chi-Squared PDF fits, respectively χ 1 2 , χ 5 2 , χ 14 2 and χ 27 2 . The agreement is shown to be quite good, with a slight deficit in the theoretical number of degrees of freedom, possibly due to uncontrolled aspects (e.g., the numerical implementation of the ME algorithm and bound effects) leading to extra randomness. In fact, the theoretical prediction of MinMI bias results from two matrices, theoretically equal, which are issued from extraordinary complicated outputs (the MinMI covariance matrix and the covariance matrix of estimators under fixed marginals). The theoretical result depends on the matching of a huge number of algorithmic details. The results provide good support to the presented Theorems, the hypotheses on the basis of the analytical and semi-analytical approaches. The slightly higher MinMI bias than the theoretical one is due to a small difference between the data PDF and the ME-PDF.
Figure 4. Monte-Carlo empirical cumulative histogram (solid lines) and theoretical cumulative Chi-Squared fit (dashed lines) normalized by N: 2 N δ I N , g ( χ 1 2 ) for N = 50 (black curves); 2 N δ I N , n g , j = 4 ( χ 5 2 ) for N = 400 (red curves); 2 N δ I N , n g , 6 ( χ 14 2 ) f