Minimum Mutual Information and Non-Gaussianity through the Maximum Entropy Method: Estimation from Finite Samples

Pires, Carlos A. L.; Perdigão, Rui A. P.

doi:10.3390/e15030721

Open AccessArticle

Minimum Mutual Information and Non-Gaussianity through the Maximum Entropy Method: Estimation from Finite Samples

by

Carlos A. L. Pires

^1,* and

Rui A. P. Perdigão

²

¹

Instituto Dom Luiz (IDL), University of Lisbon (UL), Lisbon, P-1749-016, Portugal

²

Institute of Hydraulic Engineering and Water Resources Management, Vienna University of Technology, Vienna, A-1040, Austria

^*

Author to whom correspondence should be addressed.

Entropy 2013, 15(3), 721-752; https://doi.org/10.3390/e15030721

Submission received: 8 November 2012 / Revised: 15 February 2013 / Accepted: 19 February 2013 / Published: 25 February 2013

(This article belongs to the Special Issue Estimating Information-Theoretic Quantities from Data)

Download

Browse Figures

Versions Notes

Abstract

:

The Minimum Mutual Information (MinMI) Principle provides the least committed, maximum-joint-entropy (ME) inferential law that is compatible with prescribed marginal distributions and empirical cross constraints. Here, we estimate MI bounds (the MinMI values) generated by constraining sets T_cr comprehended by m_cr linear and/or nonlinear joint expectations, computed from samples of N iid outcomes. Marginals (and their entropy) are imposed by single morphisms of the original random variables. N-asymptotic formulas are given both for the distribution of cross expectation’s estimation errors, the MinMI estimation bias, its variance and distribution. A growing T_cr leads to an increasing MinMI, converging eventually to the total MI. Under N-sized samples, the MinMI increment relative to two encapsulated sets T_cr1 ⊂ T_cr2 (with numbers of constraints m_cr₁ < m_cr₂) is the test-difference

δ H = H_{\max 1, N} - H_{\max 2, N} \geq 0

between the two respective estimated MEs. Asymptotically, δH follows a Chi-Squared distribution

\frac{1}{2 N} χ_{(m_{c r 2} - m_{c r 1})}^{2}

whose upper quantiles determine if constraints in T_cr2/T_cr1 explain significant extra MI. As an example, we have set marginals to being normally distributed (Gaussian) and have built a sequence of MI bounds, associated to successive non-linear correlations due to joint non-Gaussianity. Noting that in real-world situations available sample sizes can be rather low, the relationship between MinMI bias, probability density over-fitting and outliers is put in evidence for under-sampled data.

Keywords:

mutual information; non-Gaussianity; maximum entropy distributions; Entropy bias; mutual information distribution; morphism

MSC2000 Codes:

62B10; 94A17

1. Introduction

1.1. The State of the Art

The seminal work of Shannon on Information Theory [1] gave rise to the concept of Mutual Information (MI) [2] as a measure of probabilistic dependence among random variables (RVs), with a broad range of applications, including neuroscience [3], communications and engineering [4], physics, statistics, economics [5], genetics [6], linguistics [7] and geosciences [8]. MI is the positive difference between two Shannon entropies of the RVs: the one assuming statistical independence

(H_{i n d})

and the other

(H_{d e p})

considering their true dependence.

This paper addresses the problem of estimating the MI conveyed by the least committed, inferential law (say the conditional probability density function pdf

ρ (Y | X)

between random variables RVs

Y, X

), which is compatible with prescribed marginal distributions and a set T_cr of m_cr empirical non-redundant cross constraints (e.g., a set of cross expectations between a stimulus X and a response Y, for example in a neural cell, the Earth’s climate, an ecosystem). The constrained MI or the Minimum Mutual Information (MinMI) among RVs

Y, X

is:

I_{\min} (X, Y) = H (X) + H (Y) - H_{\max} (X, Y) = H (Y) - H_{\max} (Y | X)

, obtained after subtraction to the sum of fixed marginal entropies of the maximum joint entropy (ME)

H_{\max}

, compatible with imposed cross constraints. The solution comes from application of the MinMI principle [9,10]. The MinMI is a MI lower bound depending on the marginal pdfs (e.g., Gaussians, Uniforms, Gammas), as well as the particular form of the cross expectations in T_cr (e.g., linear and non-linear correlations). There are only a few cases of known closed formulas for the MinMI and m_cr=1:a) Gaussian marginals and Pearson linear correlation [8,11,12] and (b) Uniform marginals and rank linear correlation [11]. The authors have presented in [12] (PP12 hereafter), a general formalism for computing, though not in an explicit form, the MinMI in terms of multiple (m_cr > 1) linear and nonlinear cross expectations included in T_cr This set can consist of a natural population constraint (e.g., a specific neural behavior) or it can grow without limit through additional expectations computed within a sample with the MinMI increasing and converging eventually to the total MI. This paper is the natural follow-up of PP12 [12], studying now the statistics (mean or bias, variance and distribution) of the MinMI estimation errors:

Δ I_{\min, N} = - Δ H_{\max, N} \equiv - (H_{\max, N} - H_{\max})

where

H_{\max, N}

is the ME estimation issued from N-sized samples of iid outcomes. Those errors are roughly similar to those of MI and entropy generic estimator’s errors (see [13,14] for a thorough review and performance comparisons between MI estimators). Their mean (bias), variance and higher-order moments are written in terms of

N^{- 1}

powers, thus covering intermediate and asymptotic N ranges [15], with specific applications in neurophysiology [16,17,18]. Entropy estimators range from: (a) the histogram-based plug-in one [19] with a negative bias or the Miller-Madow correction [20] equal to

- (m - 1) / (2 N)

, where m is the number of univariate histogram bins to much more improved estimators (e.g., kernel density estimators, adaptive or non-adaptive grids, next nearest neighbors) and others specially designed for small samples [21,22]

1.2. The Rationale of the Paper

The well-posedness of a MinMI

I_{\min} (X, Y)

compatible with available cross information needs the knowledge of marginal X and Y PDFs,

ρ_{X}

and

ρ_{Y}

, either imposed or inferred from sufficiently long samples. For that purpose, we can change X and Y into the cumulated probabilities

u (x) = \int^{x} ρ_{X} (t) d t; v (y) = \int^{y} ρ_{Y} (t) d t

, which are uniform RVs on the interval [0,1] (i.e., copulas [23]), through appropriate smoothly growing (injective) morphisms (or anamorphoses), while leaving the MI invariant [2]. Then, the MI

I (X, Y)

becomes the negative copula entropy [24,25] given by

I (X, Y) = \int_{0}^{1} \int_{0}^{1} c [u, v] \log (c [u, v]) d u d v

, where the copula density is

c [u, v] = ρ_{X Y} (x, y) / [ρ_{X} (x) ρ_{Y} (y)]

.

The MinMI, subjected to

m_{c r}

constraints of type

E [T_{i} (u, v)] = θ_{i}; i = 1, ... m_{c r}

in the copula-space, is readily obtained by variational analysis (as in the ME method [2]) for

c [u, v] = \exp [- 1 + λ_{u} (u) + λ_{v} (u) + \sum_{i = 1}^{m_{c r}} λ_{i} T_{i} (u, v)]

, where the Lagrange multipliers

λ_{u} (u), λ_{v} (u), λ_{i}

correspond respectively to the preset (not subjected to sampling) continuum of constraints:

\int c [u, v] d u = \int c [u, v] d v = 1

and to the

m_{c r}

expectations (subjected to sampling error). The general solution is rather tricky since all the values

λ_{u} (u), λ_{v} (u), λ_{i}

are implicitly related. The constrained joint PDF and the inferential law are recovered from the constrained copula through the product:

ρ_{X Y} (x, y) = c [u, v] ρ_{X} (x) ρ_{Y} (y)

.

In PP12 [12], we have generalized this problem to a less constrained MinMI version by changing marginal RVs into ME prescribed ones—the ME-morphisms (e.g., standard Gaussians)—and imposing a finite set of marginal constraints instead of the full marginal PDFs. Under these conditions, the number of control Lagrange multipliers is finite, leaving the possibility of using nonlinear minimization algorithms for the MinMI estimation, as already tested in [8]. The MinMI subjected to a set

T_{c r}

of

m_{c r}

cross constraints is thus given by

H_{i n d} - H_{M E, c r}

, where

H_{M E, c r}

is the joint ME and

H_{i n d}

is the sum of single fixed (preset) entropies. The MinMI estimator is written as

H_{i n d} - H_{M E, c r, N}

, where

H_{M E, c r, N}

is the ME constrained by the

m_{c r}

sampling expectations obtained from N-sized samples. The MinMI estimation error is

H_{M E, c r} - H_{M E, c r, N}

. Therefore, as a generalization of the ME estimator bias [26], one verifies a MinMI positive bias equal to (larger/smaller than)

m_{c r} / (2 N)

when the true population PDF including the tested sample, follows (is more leptokurtic/platykurtic than) the ME-PDF. This result is supported through Monte-Carlo experiments.

Moreover, we introduce here the positive incremental MinMI given by the difference

H_{M E, c r 1} - H_{M E, c r 2}

between two MEs, forced by cross constraint sets

T_{c r 1} \subseteq T_{c r 2}

, which is interpreted as the MinMI coming from the difference set

T_{c r 2} / T_{c r 1}

. The corresponding estimator is

H_{M E, c r 1, N} - H_{M E, c r 2, N}

. Both the MinMI and incremental MinMI estimators depend basically on errors of the expectations estimated from finite N-sized samples.

In particular, under the null hypothesis H_o that

H_{M E, c r 1} = H_{M E, c r 2}

or

T_{c r 1}, T_{c r 2}

ME-congruent (see definition in PP12, [12]), the difference

H_{M E, c r 1, N} - H_{M E, c r 2, N}

works as a significance test of H_o. Those tests can be used: (1) for testing statistical significant MI above zero or significant RV dependence or (2) for testing MI due to nonlinear correlations beyond MI due to linear correlations. Another important case (verified here) is the test of MI explained by joint non-Gaussianity beyond the MI explained by joint Gaussianity, in which Gaussian morphism (i.e., bijective, reversible variable transformation into another with a Gaussian pdf without loss of generality) is used for single variables. According to the above result, the bias of

H_{M E, c r 1, N} - H_{M E, c r 2, N}

, subjected to H_o is

(m_{c r 2} - m_{c r 1}) / 2 N

, i.e., the number of cross constraints in the difference set

T_{c r 2} / T_{c r 1}

divided by

2 N

.

We further provide asymptotic analytical N-scaled formulas for the variance and distribution of MinMI estimation errors as functions of statistics of the ME cross constraints estimation errors. This is possible for N high enough where expectation errors are closely governed by a multivariate Gaussian distribution, uniquely determined by their bias and covariance matrix, thanks to the multivariate Central Limit Theorem. Since marginal morphisms are performed, the single variables are set to values from a look-up table of fixed quantiles (not subjected to sampling) and therefore the estimator’s squared-bias decreases faster than the estimator’s variance as

N \to \infty

.

The correct modeling of covariances between sampling expectation’s errors under morphism is crucial for the correct computation of MinMI error statistics. We have verified an overall reduction of the cross expectation errors when compared to case where they are issued from iid realizations (no morphism performed). For instance the variance, noted as

var (E_{N} (T))

of the N-sized sampling mean

E_{N} (T)

, of a cross function

T (X, Y)

is given by

N^{- 1} {var}_{N} (T^{*})

, where

T^{*}

is the residual of the best linear fit of

T

, using the conditional means

E (T | X), E (T | Y)

as predictors. Asymptotically,

{var}_{N} (T^{*}) \to var (T^{*})

which is the variance of T, conditioned to the knowledge of marginal PDFs, computed at the joint PDF of the population. These conditional variances are exactly those coming from the MinMI solution, allowing for relating MinMI statistics with asymptotic no-replacement finite statistics under fixed marginals. The results are synthesized in the form of two theorems.

Regarding the conversion of expectation errors to ME and MinMI errors, we have used a perturbative approach—a 2nd order Taylor expansion of the ME. This allows for closed analytical formulas to be obtained for MinMI variance and its distribution in a few cases (e.g., Chi-Squared distributions), in what we hereafter call the analytical approach. In order to confirm that, expectation errors are generated by surrogates of the governing multivariate Gaussian PDF; then, they are plugged into the Taylor expansion of MinMI and finally statistics (bias, variances, quantiles) are estimated from a large ensemble (semi-analytical approach). These statistics are compared with those obtained from a Monte-Carlo experiment where MinMI is computed ab initio from the sampling expectations – the Monte-Carlo approach. The closeness of results between the Monte-Carlo, the semi-analytical and the analytical approaches is tested using several statistical tests of bivariate non-Gaussianity and RV independency. This exhaustive validation has already been performed for testing analytical formulas of bias, variance, skewness and kurtosis of MI estimation errors [27].

In accordance to the above synthesis, the paper structure starts with this introduction, followed by the formulation of MinMI and their estimators in Section 2. In Section 3 we present the modeling of sample mean errors that will constrain entropy and the effect of morphisms on statistics. Section 4 is devoted to the modeling of errors of MinMI, incremental MinMI and significance tests, followed by a practical case of MI estimation with under-sampled data (Section 5) and the discussion with conclusions in section 6. An appendix with some proofs is also provided.

2. Minimum Mutual Information and Its Estimators

2.1. Imposing Marginal PDFs

Let us formulate the problem of finding the minimum Mutual Information (MinMI) in the simplest framework of bivariate RVs

(X, Y)

, over the Cartesian product of support sets

S = S_{X} \otimes S_{Y} \subseteq ℝ^{2}

. The MinMI is constrained by the imposition of marginal PDFs

ρ_{X}, ρ_{Y}

and a set of cross expectations

{T_{c r}, θ_{c r} \equiv E (T_{c r})}

, where

T_{c r}

is a vector comprising

m_{c r}

cross

X, Y

functions and

θ_{c r}

is the vector of their expectations. In the space of imposed PDF marginals, the MinMI comes uniquely as a function of

θ_{c r}

as

I (θ_{c r}, ρ_{X}, ρ_{Y}) = H_{ρ_{X}} + H_{ρ_{Y}} - H_{ρ_{X, Y}^{*}} (θ_{c r}, ρ_{X}, ρ_{Y})

, where

H_{ρ_{X}} = E [- \log (ρ_{X})], H_{ρ_{Y}} = E [- \log (ρ_{Y})]

are preset Shannon entropies of

X, Y

respectively and

H_{ρ_{X, Y}^{*}} (θ_{c r}, ρ_{X}, ρ_{Y})

is the ME subjected to joint constraints and marginal PDFs where the ME-PDF is

ρ_{X, Y}^{*}

. That leads to the equivalence between computations of MinMI and ME [9]. In particular if

ρ_{X}, ρ_{Y}

are copula marginals (uniform PDFs in [0,1]), then

H_{ρ_{X}} = H_{ρ_{Y}} = 0

and the MinMI is the copula entropy [24,25]. For instance, for standard Gaussians

X, Y

and a given correlation

E (T_{c r} \equiv X Y) = c_{g}

, the MinMI is

I (c_{g}) = - \frac{1}{2} \log (1 - c_{g}^{2})

. Obviously, the more cross constraints are imposed, the larger the MinMI will be.

The general solution is obtained through variational analysis, rather similar to that for the ME [28] but with a continuity of constraints (the marginal PDFs) and a finite set of expectations:

\begin{array}{l} I (θ_{c r}, ρ_{X}, ρ_{Y}) = H_{ρ_{X}} + H_{ρ_{Y}} - H_{ρ_{X, Y}^{*}} (θ_{c r}, ρ_{X}, ρ_{Y}); H_{ρ_{X, Y}^{*}} (θ_{c r}, ρ_{X}, ρ_{Y}) = L (λ_{c r}) \\ λ_{c r} = \underset{η_{c r}}{\arg \min} [L (η_{c r}) \equiv 1 + \int_{S_{X}} \log Z_{X} (X, η_{c r}) ρ_{X} (X) d x + \int_{S_{Y}} \log Z_{Y} (Y, η_{c r}) ρ_{Y} (Y) d y - {η_{c r}}^{T} θ_{c r}] \end{array}

(1)

The MinMI-PDF

ρ_{X, Y}^{*} (X, Y)

and the partition functions

Z_{X}, Z_{Y}

are

\begin{array}{l} ρ_{X, Y}^{*} (X, Y) = {[Z_{X} (X, λ_{c r}) Z_{Y} (Y, λ_{c r})]}^{- 1} \exp [- 1 + {λ_{c r}}^{T} T_{c r} (X, Y)]; \\ Z_{X} (X, λ_{c r}) \equiv \frac{1}{ρ_{X} (X)} \int_{S_{X}} \frac{\exp [- 1 + {λ_{c r}}^{T} T_{c r} (X, y)]}{Z_{Y} (y, λ_{c r})} d y; \\ Z_{Y} (Y, λ_{c r}) \equiv \frac{1}{ρ_{Y} (Y)} \int_{S_{Y}} \frac{\exp [- 1 + {λ_{c r}}^{T} T_{c r} (x, Y)]}{Z_{X} (x, λ_{c r})} d x \end{array}

(2)

The superscript T stands for transpose such that

{λ_{c r}}^{T} T_{c r}

is the canonical inner product between vectors

λ_{c r}

and

T_{c r}

. The proof is given in Appendix 1. Any PDF

ρ_{X Y} (X, Y)

is a MinMI PDF corresponding to the single constraint

T_{c r} (X, Y) = 1 + \log [ρ_{X Y} (X, Y) / [ρ_{X} (X) ρ_{Y} (Y)]]

, leading to

λ = 1

,

Z_{X} (X, λ) = ρ_{X} {(X)}^{- 1}

and

Z_{Y} (Y, λ) = ρ_{Y} {(Y)}^{- 1}

.

The minimization of

L (η)

in (1) calls for the implementation of an iterative strategy as in [11] with successive adjustments of the implicitly linked partition functions.

The present paper deals with small changes of

I (θ_{c r}, ρ_{X}, ρ_{Y})

coming from estimation errors

Δ θ_{c r}

of the cross expectations evaluated from finite samples. For the purpose of inferring the consequent MinMI error statistics (bias, variance, distribution), we will use the second-order Taylor expansion of

I (θ_{c r}, ρ_{X}, ρ_{Y})

in terms of the variation

Δ θ_{c r}

:

\begin{array}{l} Δ I (θ_{c r}, ρ_{X}, ρ_{Y}) \equiv I (θ_{c r} + Δ θ_{c r}, ρ_{X}, ρ_{Y}) - I (θ_{c r}, ρ_{X}, ρ_{Y}) = - Δ H_{ρ_{X, Y}^{*}} (θ_{c r}, ρ_{X}, ρ_{Y}) = \\ = {λ_{c r}}^{T} Δ θ_{c r} + 1 / 2 (Δ θ_{c r}^{T}) C_{c r, ρ_{X}, ρ_{Y}}^{- 1} (Δ θ_{c r}) + O (| | Δ θ_{c r} | |^{3}) \end{array}

(3)

where

C_{c r, ρ_{X}, ρ_{Y}}^{- 1}

is the inverse of the covariance matrix of the vector of constraining functions

T_{c r}

, conditioned to knowledge of marginal PDFs and evaluated at the MinMI-PDF

ρ_{X, Y}^{*}

i.e.,

C_{c r, ρ_{X}, ρ_{Y}} = E_{ρ_{X, Y}^{*}} [(T_{c r} * T_{c r}^{T} * | ρ_{X}, ρ_{Y}] = E_{ρ_{X, Y}^{*}} [(T_{c r} * T_{c r}^{T} * | E (T | X), E (T | Y)]

(4)

where

E_{ρ_{X, Y}^{*}}

is the expectation at

ρ_{X, Y}^{*}

.The perturbation

T * = T - E_{ρ_{X, Y}^{*}} (T_{c r} | ρ_{X}, ρ_{Y})

is the residual with respect to the conditional mean, obtained by methods of variational and functional analysis as the best linear fit

E_{ρ_{X, Y}^{*}} (T_{c r} | ρ_{X}, ρ_{Y}) = θ_{c r} + α_{X} [E_{ρ_{X, Y}^{*}} (T_{c r} | X) - θ_{c r}] + α_{Y} [E_{ρ_{X, Y}^{*}} (T_{c r} | Y) - θ_{c r}]

(5)

where

α_{X}, α_{Y}

are vectors of coefficients minimizing the mean square deviations to each component of

T_{c r}

using the X and Y conditional means of

T_{c r}

as predictors. The proof is given in Appendix 1 as part of the proof of Theorem 1 presented in Section 2.2.

2.2. Imposing Marginals through ME Constraints

2.2.1. The Formalism

In PP12 [12], we address the MinMI problem (1,2) by considering that

ρ_{X}, ρ_{Y}

are themselves ME-PDFs forced by a finite set of marginal, independent constraints,

{T_{i n d} \equiv (T_{X} (X), T_{Y} (Y)), θ_{i n d} \equiv E (T_{i n d}) \equiv (θ_{X}, θ_{Y})}

. For that purpose we solve the ME problem [29] by imposing the constraints set

{T, θ} = {(T_{i n d}, T_{c r}), (θ_{i n d}, θ_{c r})}

, thus leading to a weaker (i.e., smaller) MinMI solution than that obtained with the full imposition of the marginal PDFs. That is given by

I (θ_{c r}, θ_{i n d}) = H (θ_{i n d}) - H (θ) \leq I (θ_{c r}, ρ_{X}, ρ_{Y})

, where

H (θ)

is the ME issued from the finite set of constraints (marginal and cross) and

H (θ_{i n d}) \equiv H_{X} + H_{Y}

is the ME corresponding uniquely to the marginal constraints [30]. In particular, if the support sets are

S_{X} = S_{Y} = [0, 1]

and

{T_{i n d}, θ_{i n d}} = \emptyset

(no constraints on marginals), then the joint PDF of

(X, Y)

is a copula [24] since their marginal PDFs are uniform in [0,1].The cross part

T_{c r}

includes only cross functions, not redundantly expressed as sums of marginal functions in

T_{i n d}

.

In practice one can impose the marginal PDFs from a priori RVs

(\hat{X}, \hat{Y})

(data variables) through ME-morphisms

(X = X (\hat{X}), Y = Y (\hat{Y}))

(Equation 6 of PP12), (e.g., standard Gaussians), which are monotonically growing smooth homeomorphisms linking data to transformed

(X, Y)

variables. Then, thanks to the MI invariance

(X = X (\hat{X}), Y = Y (\hat{Y}))

[2], one can consistently define the MinMI between

(\hat{X}, \hat{Y})

as that obtained with

(X, Y)

.

The joint ME-PDF is written in terms of a vector

λ

of Lagrange multipliers [28] as:

ρ_{T, θ}^{*} (X, Y) = Z {(λ, T)}^{- 1} \exp [λ^{T} T (X, Y)]

, where

Z (λ, T) \equiv \iint_{S} \exp (λ^{T} T) d x d y

is the partition function. The ME functional is

H (θ) = \min_{η} (\log Z (η, T) - θ^{T} η) = \log Z (λ, T) - θ^{T} λ

, whose input is the vector

θ

. The marginal PDFs are supposed to be the ME-PDFs

ρ_{T X, θ X}^{*} (X); ρ_{T Y, θ Y}^{*} (Y)

, verifying the marginal X and Y constraints respectively, since variables were built accordingly by ME-morphisms.

As far as more cross constraints are added to

{T_{c r}, θ_{c r}}

, the MinMI

I (θ_{c r}, θ_{i n d})

increases converging to the full MI

I (X, Y)

. Let us formalize that by supposing that the true joint PDF belongs to the ME-family characterized by an information moment superset

{T_{\infty}, θ_{\infty}} \supseteq {T, θ}

.

The true joint PDF is given by

ρ_{T \infty, θ \infty}^{*}

with Shannon entropy given by the ME

H (θ_{\infty})

. The encapsulated moment sets obey to

θ_{i n d} \subseteq θ \subseteq θ_{\infty}

. Therefore, thanks to Lemma 1 of PP12, the monotonic property of MEs is obtained:

H (θ_{i n d}) \geq H (θ) \geq H (θ_{\infty})

. This, according to Theorem 1 of PP12, allows for the decomposition of the MI

I (X, Y)

into two positive terms, such that:

\begin{array}{l} I (X, Y) = H (θ_{i n d}) - H (θ_{\infty}) = I_{θ / θ_{i n d}} (X, Y) + I_{θ_{\infty} / θ} (X, Y) \geq 0 \\ I_{θ / θ_{i n d}} \equiv H (θ_{i n d}) - H (θ) \geq 0; I_{θ_{\infty} / θ} \equiv H (θ) - H (θ_{\infty}) \geq 0 \end{array}

(6)

The term

I_{θ / θ_{i n d}}

is the MinMI associated to the finite set of cross moments

θ_{c r}

and the second one is the remaining MI. The decomposition (6) allows us for defining a monotonic sequence of lower MI bounds converging to the total MI. That follows from the sequence of encapsulated moment sets

{T_{i n d} = T_{0}, θ_{i n d} = θ_{0}} \subseteq {T_{j}, θ_{j}} \equiv {(T_{i n d, j}, T_{c r, j}), (θ_{i n d, j}, θ_{c r, j})} \subseteq {T_{j + 1}, θ_{j + 1}} \subseteq ... \subseteq {T_{\infty}, θ_{\infty}}, j \geq 1

(e.g. set of monomial bivariate moments of a certain total order j), whose ME-PDF approximates the true ME-PDF in the sense of the Kullback-Leibler divergence (KBD) i.e.,

D_{K L} (ρ_{T_{\infty}, θ_{\infty}}^{*} | | ρ_{T_{j}, θ_{j}}^{*}) = H (θ_{j}) - H (θ_{\infty}) \underset{j \to \infty}{\to} 0

with the MI given by the limit

I (X, Y) = H (θ_{i n d}) - \lim_{j \to \infty} [H (θ_{j})]

. The sets

{T_{0}, θ_{0}}

and

{T_{i n d, j}, θ_{i n d, j}}

are ME-congruent, i.e., their ME-PDF are the same. The j-th set must include enough constraints so as to keep a finite joint ME issued from

{T_{j}, θ_{j}}

and guarantee the convergence of the above KBD towards zero. Moreover that also guarantees that marginals of the joint ME-PDF converge to the preset marginal PDFs

ρ_{X}, ρ_{Y}

in the KBD sense. Therefore, the MinMI

I (θ_{c r, \infty}, ρ_{X}, ρ_{Y}) = I (X, Y) = H (θ_{i n d}) - \lim_{j \to \infty} [H (θ_{j})]

.

The addition of constraints leads to the decrease of ME, raising the useful concept of incremental MinMI next presented. The MI part that is explained by cross terms in the set difference

T_{j} / T_{p} (j > p \geq 0), i . e ., T_{p} \subseteq T_{j}

is the incremental MinMI:

I_{j / p} \equiv H (θ_{p}) - H (θ_{j}) ​ = D_{K L} (ρ_{T_{j}, θ_{j}}^{*} | | ρ_{T_{p}, θ_{p}}^{*}) = I_{j / 0} - I_{p / 0} \geq 0

(7)

Estimation errors of

I_{j / p}

are affected by the vector of moment errors

Δ θ_{j}

(from which

Δ θ_{p}

is simply a projection). Since we preset marginal PDFs,

Δ θ_{j}

is restricted to the cross part i.e.,

Δ θ_{j} = Δ θ_{c r, j} = P_{c r, j} Δ θ_{j}

where

P_{c r, j}

is the diagonal projector operator over cross expectations (cr and ind terms are set to 1 and 0 respectively). Looking for error statistics of

I_{j / p}

, we use the second-order Taylor expression of ME:

- Δ H = H (θ) - H (θ + Δ θ_{c r}) = {(P_{c r} λ)}^{T} Δ θ_{c r} + (1 / 2) Δ θ_{c r}^{T} (P_{c r} C_{*}^{- 1} P_{c r}) Δ θ_{c r} + O ({‖ Δ θ_{c r} ‖}^{3})

(8)

where, as usually,

λ

(with dropped subscrits) is the whole vector of Lagrange multipliers of dimension

\dim (θ_{c r}) + \dim (θ_{i n d})

and

C_{*}

is the covariance matrix of the function vector

T

, both valid for the ME-PDF verifying the constraints

E_{*} (T) = θ

. We note that

C_{*} = E_{*} [T' T'^{T}]

, where the star stands for evaluation over the ME-PDF and prime denotes deviation from the mean

θ

, i.e.,

T' = T - θ

. Therefore, by using (8), we express the variation of

I_{j / p} (j > p)

due to variations

Δ θ_{c r, j}

as:

\begin{array}{l} Δ I_{j / p} = {(v_{j / p})}^{T} (Δ θ_{c r, j}) + (1 / 2) {(Δ θ_{c r, j})}^{T} A_{j / p} (Δ θ_{c r, j}) + O ({‖ Δ θ_{c r, j} ‖}^{3}) \\ v_{j / p}^{T} \equiv P_{c r, j} λ_{j} - P_{c r, p} λ_{p}; A_{j / p} \equiv P_{c r, j} ({C_{* j}}^{- 1} - P_{c r, p} {(C_{* p})}^{- 1} P_{c r, p}) P_{c r, j} \end{array}

(9)

where

λ_{j}, C_{* j}

and

λ_{p}, C_{* p}

are the whole vectors of Lagrange multipliers and the whole covariance matrices, valid for the ME-PDFs of orders j and p respectively. The matrix

A_{j / p}

is built from the covariance matrices

C_{* j}

and

C_{* p}

valid at the ME-PDFs of order j and p respectively.

When the ME-PDFs of order j and p are the same (which is useful for testing if the estimated

I_{j / p}

from data is significantly different from zero), or p = 0 (in which

P_{c r, p} = 0

), then

C_{* p}

is a sub-matrix of

C_{* j}

. In that case,

A_{j / p}

is positive semi-definite (PSD). This comes from the algebraic generic result stating that

A = C^{- 1} - P C_{P}^{- 1} P

is PSD, where

C

is PSD,

P

is a diagonal projection matrix,

C_{P} = P C P

is the projected

C

with generalized inverse

C_{P}^{- 1}

such that

C_{P} C_{P}^{- 1} = C_{P}^{- 1} C_{P} = P

.

A

is singular with

Ker (A) = Im (C P)

. However, one can prove that for small deviations among the ME-PDFs of orders j and p, the matrix

A_{j / p}

is still PSD. For that one can use the same perturbation approach of [26].

2.2.2. A Theorem about the MinMI Covariance Matrix

The matrix

P_{c r} C_{*}^{- 1} P_{c r}

in (8) has inverse in the cross-expectation subspace, i.e.

{(P_{c r} C_{*}^{- 1} P_{c r})}^{- 1} (P_{c r} C_{*}^{- 1} P_{c r}) = P_{c r}

. Taking the identity as the sum of complementary projector operators

I = P_{c r} + P_{i n d}

, both diagonal and self-adjoint, we have

\begin{array}{l} {(P_{c r} C_{*}^{- 1} P_{c r})}^{- 1} = (P_{c r} C_{*} P_{c r}) - (P_{c r} C_{*} P_{i n d}) {(P_{i n d} C_{*} P_{i n d})}^{- 1} (P_{i n d} C_{*} P_{c r}) \\ = E_{*} [T_{c r}^{'} T_{c r}^{' T}] - E_{*} [T_{c r}^{'} T_{i n d}^{' T}] E_{*} {[T_{i n d}^{'} T_{i n d}^{' T}]}^{- 1} E_{*} [T_{i n d}^{'} T_{c r}^{' T}] = E_{*} [T_{c r}^{' i n d} T_{c r}^{' i n d T}] \end{array}

(10)

which is the covariance matrix between the residuals

T_{c r}^{' i n d}

of the best linear fit (in the sense of mean squares error) of

T_{c r}

using the X and Y functions in

T_{i n d}

as predictors, i.e.,

T_{c r}^{' i n d} \equiv T_{c r}^{'} - α_{i n d, c r}^{T} T_{i n d}^{'}

where the matrix of coefficients is

α_{i n d, c r} = E_{*} {[T_{i n d}^{'} T_{i n d}^{' T}]}^{- 1} E_{*} [T_{i n d}^{'} T_{c r}^{' T}]

. The identity (10) is simply an application to the ME covariance matrix of a generic algebraic result on PSD matrices

C_{*}

and projection operators

P_{c r}, P_{i n d} = I - P_{c r}

.

Therefore, the variances in

{(P_{c r} C_{*}^{- 1} P_{c r})}^{- 1}

are smaller than those in

(P_{c r} C_{*} P_{c r})

. Moreover, the more marginal constraints are imposed (with increasing j), the smaller the variances from

{(P_{c r} C_{*}^{- 1} P_{c r})}^{- 1}

will be, due to the increasing number of predictors and closer will be the full knowledge of the marginal PDFs. Then, asymptotically the residuals

T_{c r, j}^{' i n d}

at step j must converge to the residuals

T * = T - E_{ρ_{X, Y}^{*}} (T_{c r} | ρ_{X}, ρ_{Y})

with respect to the mean (5) entering in the covariance (4) regarding MinMI. Therefore, that leads us to the Theorem:

Theorem 1:

Let

ρ_{X, Y}^{*}

be the MinMI-PDF issued from

{T_{c r}, θ_{c r}}, ρ_{X}, ρ_{Y}

, being the same as the ME-PDF issued from

{(T_{i n d}, T_{c r}), (θ_{i n d}, θ_{c r})}

for some set

{T_{i n d}, θ_{i n d}}

. Then we have:

λ_{c r} = P_{c r} λ; C_{c r, ρ_{X}, ρ_{Y}} = {(P_{c r} C_{*}^{- 1} P_{c r})}^{- 1} = E_{ρ_{X, Y}^{*}} [(T_{c r} * T_{c r}^{T} * | E (T | X), E (T | Y)]

(11)

which states that the Lagrange multipliers of the MinMI-PDF are those of the ME-PDF for the cross constraints and the MinMI covariance matrix (4), say that of the residuals of the best fit of the cross constraints using their condtional means as predictors. The proof, as well of (3–5) is added in Appendix 1.

An illustrative example of the Theorem 1 is given for the bivariate Gaussian

ρ_{X Y}^{*} (X, Y) = {(2 π)}^{- 1} d_{g}^{1 / 2} \exp [\frac{- 1}{2} d_{g} (X^{2} - 2 c_{g} X Y + Y^{2})]

of correlation

c_{g}

with

d_{g} \equiv {(1 - c_{g}^{2})}^{- 1}

. The marginals

ρ_{X}, ρ_{Y}

are standard Gaussians.

ρ_{X Y}^{*} (X, Y)

is the MinMI-PDF constrained by correlation as well as the ME-PDF constrained by moments of order one and two:

{T_{i n d} = (X, X^{2}, Y, Y^{2}), θ_{i n d} = (0, 1, 0, 1)}

and

{T_{c r} = (X Y), θ_{c r} = (c_{g})}

. The vector of Lagrange multipliers is

λ = {[0, \frac{- 1}{2} d_{g}, 0, \frac{- 1}{2} d_{g}, c_{g} d_{g}]}^{T}

while the diagonal covariance matrix and its inverse (lower triangle parts) are:

\begin{array}{l} C_{*} = [{(1, 0, c_{g}, 0, 0)}^{T}, {(*, 2, 0, 2 c_{g}^{2}, 2 c)}^{T}, {(* *, 1, 0, 0)}^{T}, {(* * *, 2, 2 c)}^{T}, {(* * * *, c_{g}^{2} + 1)}^{T}] \\ C_{*}^{- 1} = [{(d_{g}, 0, - c_{g} d_{g}, 0, 0)}^{T}, {(*, \frac{1}{2} d_{g}^{2}, 0, \frac{1}{2} c_{g}^{2} d_{g}^{2}, - c_{g} d_{g}^{2})}^{T}, {(* *, d_{g}, 0, 0)}^{T}, \\ {(* * *, \frac{1}{2} d_{g}^{2}, - c_{g} d_{g}^{2})}^{T}, {(* * * *, (1 + c_{g}^{2}) d_{g}^{2})}^{T}] \end{array}

(12)

The redundant upper triangle part is given by stars. The MinMI is

I_{g} (c_{g}) = \frac{- 1}{2} \log (1 - c_{g}^{2})

with its derivatives entering in the Taylor development (3) given by

\frac{\partial I_{g}}{\partial c_{g}} = c_{g} d_{g} = P_{c r} λ

which is the fifth component of

λ

and

\frac{\partial^{2} I_{g}}{\partial c_{g}^{2}} = d_{g}^{2} (1 + c_{g}^{2}) = C_{c r, ρ_{X}, ρ_{Y}}^{- 1} = (P_{c r} C_{*}^{- 1} P_{c r})

, i.e., the entry at 5th line, 5th column of

C_{*}^{- 1}

as guessed from the Theorem 1. By expressing

Y = c_{g} X + d_{g}^{- 1 / 2} W_{X}

and

X = c_{g} Y + d_{g}^{- 1 / 2} W_{Y}

with standard Gaussian noises

W_{X}, W_{Y} ~ N (0, 1)

, and

c o r (X, W_{X}) = c o r (Y, W_{Y}) = 0

, one easily gets the conditional means

T_{c r}

as

E_{ρ_{X, Y}^{*}} (X Y | X) = c_{g} X^{2}; E_{ρ_{X, Y}^{*}} (X Y | Y) = c_{g} Y^{2}

, leading to the best linear fit with mean square error

C_{c r, ρ_{X}, ρ_{Y}} = d_{g}^{- 2} {(1 + c_{g}^{2})}^{- 1}

, confirming the second part of (11).

2.3. Gaussian and Non-Gaussian MI

There is a particular MI decomposition of the type (6,7), already studied in PP12 [12], in which both RVs X and Y are set to standard Gaussians

N (0, 1)

over the real support set

S_{X} = S_{Y} = ℝ

by Gaussian morphism [31]. The isotropic bivariate standard Gaussian is constrained by the moment set

T_{i n d} = T_{0} = {(X, X^{2}, Y, Y^{2})}^{T}

with the expectations vector

θ_{i n d} = θ_{0} = E (T_{0}) = {(0, 1, 0, 1)}^{T}

. The sequence of MinMIs is obtained by considering the indexed moment set (Equation 14 of PP12 [12], changing the index p there into j here):

T_{j} \equiv {X^{r} Y^{s} : 1 \leq r + s \leq j, (r, s) \in ℕ_{0}^{2}}, j \in ℕ

(13)

Comprising bivariate polynomials of total order j. Only natural j even numbers provide integrable ME-PDFs over

ℝ

, thus excluding odd j values from the sequence

{T_{0}, θ_{0}}, {T_{2}, θ_{2}}, {T_{4}, θ_{4}} ... {T_{\infty}, θ_{\infty}}

of set pairs {moments, expectations}. The independent parts of all sets are ME-congruent with

{T_{0}, θ_{0}}

, i.e., they include high-order univariate moment expectations of the standard Gaussian. The number of independent and cross moments of

T_{j}

(13) is 2j and

j (j - 1) / 2

respectively (e.g. (4,1), (8,6), (12,15) and (16,28), for j=2,4,6,8). Other more efficient basis cross functions could be used as for example orthogonal polynomials. Using the notation of Section 2.2, the maximum entropy limit

H (θ_{\infty})

of the sequence limit coincides to the true (X,Y) Shannon entropy. As presented in PP12, we define the positive Gaussian MI

I_{g}

, the non-Gaussian MI

I_{n g}

and the non-Gaussian MI

I_{n g, j}

of even order j, respectively as:

\begin{array}{l} I_{g} = I_{2 / 0} = H (θ_{0}) - H (θ_{2}) = - (1 / 2) \log (1 - c_{g}^{2}) \equiv I_{g} (c_{g}); \\ I_{n g} = I_{\infty / 2} = H (θ_{2}) - H (θ_{\infty}); I_{n g, j} = I_{j / p = 2} = H (θ_{2}) - H (θ_{j}) \end{array}

(14)

with the MI decomposed as

I (X, Y) = I_{g} + I_{n g} \geq I_{g} + I_{n g, j}

. The Gaussian MI depends on the Gaussian correlation

c_{g}

, i.e., the Pearson correlation between the Gaussianized variables

(X, Y)

. The non-Gaussian MI vanishes iff the joint PDF is Gaussian.

2.4. Estimators of the Minimum MI from Data and Their Errors

This section is devoted to the study of estimators (and their errors) of the incremental MI

I_{j / p} (j > p)

, (7) between a priori RVs

\hat{X}, \hat{Y}

or, equivalently, between their transformed RVs X,Y.

In practice, the incremental MI

I_{j / p}, j > p

is estimated by a two-step algorithm: first, the computation of expectations; then, the MEs and the partial MIs. The vector of expectations,

θ_{N, j}

, is estimated from the N-sized bivariate series

(X_{l}, Y_{l}), l = 1, ..., N

, obtained by morphism from the original N iid realizations of the a-priori RVs

({\hat{X}}_{l}, {\hat{Y}}_{l}), l = 1, ..., N

(e.g. time-series, spatially distributed data), as the arithmetic average:

E_{N} (T_{j}) \equiv θ_{N, j} = N^{- 1} \sum_{l = 1}^{N} T_{j} (X_{l}, Y_{l}) = θ_{j} + Δ θ_{N, j}

(15)

where

E_{N}

stands for expectation over the N realizations and the vector of moment estimation errors is

Δ θ_{N, j}

. The first-step error comes from the difference

H (θ_{N, j}) - H (θ_{j})

, due to marginal morphisms and finite bivariate sampling, i.e., the cross combinations of variable realizations. We will see that MI errors depend crucially from moment estimation errors and their statistics.

Secondly, the true ME

H (θ_{N, j})

is estimated as the minimum

\hat{H} (θ_{N, j})

of a functional that is reached by nonlinear minimization techniques (e.g., gradient-descent), taking as inputs

θ_{N, j}

and a set of calibrated parameters. The second-step error comes from the difference

\hat{H} - H \equiv δ H

.

The estimator of

I_{j / p}

along with its error, decomposed into the first-step (

Δ I_{N, j / p, θ}

) and second-step (

Δ I_{N, j / p, H}

) contributions, is written as

\begin{array}{l} I_{N, j / p} \equiv \hat{H} (θ_{N, p}) - \hat{H} (θ_{N, j}) = I_{j / p} + Δ I_{N, j / p}; Δ I_{N, j / p} = Δ I_{N, j / p, θ} + Δ I_{N, j / p, H} \\ Δ I_{N, j / p, θ} \equiv [H (θ_{j}) - H (θ_{N, j})] - [H (θ_{p}) - H (θ_{N, p})] \equiv - Δ H_{N, j} + Δ H_{N, p} \\ Δ I_{N, j / p, H} \equiv [\hat{H} (θ_{N, p}) - H (θ_{N, p})] - [\hat{H} (θ_{N, j}) - H (θ_{N, j})] \equiv {(δ H)}_{N, p} - {(δ H)}_{N, j} \end{array}

(16)

where

Δ I_{N, j / p, θ}

is the difference between entropy anomalies

Δ H

due to input errors. The second-step error comes from the numerical implementation and round-off errors of the entropy functional due to: (a) a coarse graining representation of the continuous PDF; (b) the numerical approximation of the ME functional and its gradient; (c) the stopping criteria of the iterative gradient-descent technique. In this article we will neglect the effect of the second-step error, thus approximating the MinMI error by

Δ I_{N, j / p} \approx Δ I_{N, j / p, θ}

depending uniquely on the sampling error of the cross expectations

Δ θ_{c r} = Δ θ_{N, c r, j}

.

3. Errors of the Expectation’s Estimators

3.1. Generic Properties

The distribution of the MinMI error and its statistics (bias, variance, quantiles) depends on the distribution of the vector of error moments

Δ θ_{N, c r, j}

entering in (9). Here, we present a generic statistical modeling of those errors giving the emphasis in the influence of variable morphisms and bivariate sampling.

Let us assume the reasonable hypothesis that the discrete estimator

θ_{N, j}

(15) is a consistent estimator of the mean

θ_{j}

, i.e., the error

Δ θ_{N, j} \to 0, N \to \infty

in probability, with both the bias and covariance matrix converging to zero as data size grows:

b_{Δ θ_{N, j}} \equiv E (Δ θ_{N, j}) \underset{N \to \infty}{\to} 0; M_{Δ θ_{N, j}} \equiv E [(Δ θ_{N, j}') {(Δ θ_{N, j}')}^{T}] \underset{N \to \infty}{\to} 0,; Δ θ_{N, j}' = Δ θ_{N, j} - b_{Δ θ_{N, j}}

(17)

where the prime stands for perturbation with respect to the mean. The exact form of the components of

b_{Δ θ_{N, j}}

and

M_{Δ θ_{N, j}}

is rather difficult to establish as a consequence of imposing marginal distributions thus reducing the randomness to the covariate sampling. Estimator variances are scaled as

O (1 / N)

, though smaller than in the case of N iid outcomes. Moreover, we assume that the convergence rate is higher (faster convergence) for the squared bias than for variances, which is supported in a few examples in next section.

3.2. The Effects of Morphisms and Bivariate Sampling

Let us start with the effect of morphisms transforming original variables

(\hat{X}, \hat{Y})

into their transformed

(X, Y)

. That depends on the rank of variables within the available sample. Without loss of generality, let us sort

\hat{X}

by ascending order in the sample, i.e., the l-th value equaling the ordered l-th value

{\hat{X}}_{l} = {\hat{X}}_{(l)}

, l=1,…,N. The bivariate l-th realization is

({\hat{X}}_{l}, {\hat{Y}}_{l} = {\hat{Y}}_{(l' (l))})

, where

l' (l) : {1, ..., N} \to {1, ..., N}

is the random bivariate rank permutation depending upon the particular sample (e.g. the first of

\hat{X}

coming with the third of

\hat{Y}

, then l’(l=1)=3 and so on). In particular

l' (l) = l

when correlation equals one. The inverse of the function

l' (l)

is written

l (l')

. The probability p-values of

{\hat{X}}_{(l)}, {\hat{Y}}_{(l')}

i.e., their marginal cumulated probability functions (CDFs) are respectively

p_{X, l}, p_{Y, l'}

, growing as function of

l, l'

. Those p-values can only be inferred from the sample or prescribed from a-priori hypotheses. The sorted transformed RVs given by ME-morphisms are:

X_{(l)} = Φ_{M E, X}^{- 1} (p_{X, l}); Y_{(l')} = Φ_{M E, X}^{- 1} (p_{Y, l'}); l, l' = 1, ... N

(18)

where

Φ_{M E, X}, Φ_{M E, Y}

are the ME prescribed CDFs (e.g. CDFs of Gaussians) of X and Y respectively. Then the morphisms relies upon invertible transformations

{\hat{X}}_{(l)} \to X_{(l)}; {\hat{Y}}_{(l')} \to Y_{(l')}

. The bivariate transformed realizations

(X_{l}, Y_{l} = Y_{(l' (l))}), l = 1, ..., N

are then used to compute expectations (Equation 15). Since the exact marginal distributions are not known, their cumulated probabilities must be prescribed, for example with regular steps

Δ p_{X, l}, = p_{Y, l} = 1 / N

in which

p_{X, l}, p_{Y, l} = l / (N + 1), l = 1, .., N

.

In order to obtain moments of

Δ θ_{N, j}

we need rewriting it in a convenient form:

\begin{array}{l} Δ θ_{N, j} \equiv θ_{N, j} - θ_{j} = \\ = \sum_{l, l' = 1}^{N} T_{j} (Φ_{M E, X}^{- 1} (p_{X, l}), Φ_{M E, Y}^{- 1} (p_{Y, l'})) N^{- 1} δ_{l' (l), l'} - \int_{0}^{1} \int_{0}^{1} T_{j} (Φ_{M E, X}^{- 1} (u), Φ_{M E, Y}^{- 1} (v)) c [u, v] d u d v \\ \approx \sum_{l, l' = 1}^{N} T_{j} (X_{(l)}, Y_{(l')}) [\frac{N^{- 1} δ_{l' (l), l'}}{Δ p_{X, l} Δ p_{Y, l'}} - c [p_{X, l}, p_{Y, l'}]] Δ p_{X, l} Δ p_{Y, l'} \end{array}

(19)

where

δ_{l' (l), l'} = δ_{l (l'), l}, \forall l, l' \in {1, ..., N}

is the Kronecker delta,

u = \int_{- \infty}^{X} ρ_{T X, θ X}^{*} (t) d t; v = \int_{- \infty}^{Y} ρ_{T Y, θ Y}^{*} (t) d t

are the marginal cumulated probabilities, corresponding respectively to probabilities

p_{X, l}

and

p_{Y, l'}

in the sum (19) and

c [u, v]

is the copula function [23] (ratio between the joint PDF and the product of marginal PDFs). By looking at (19), one sees that

N^{- 1} δ_{l' (l), l'} / (Δ p_{X, l} Δ p_{Y, l'})

is an estimator of the copula

c [p_{X, l}, p_{Y, l'}]

. In particular, if X,Y are independent, then l and l’(l) are independent,

c [p_{X, l}, p_{Y, l'}] = 1

and

E (δ_{l' (l), l'} | l, l') = N^{- 1}

i.e. there is an average equipartition of the bivariate ranks.

Equation (19) shows that moments of

Δ θ_{N, j}

depend on statistics of the error of the copula estimator, which can be very tricky due to the imposition of marginal PDFs by morphisms, presenting unusual effects with respect to classical results from samples of iid realizations [32].

For that, let us denote the random perturbation

η_{l, l'} \equiv δ_{l' (l), l'} - E [δ_{l' (l), l'}] = η_{l', l}, \forall l, l'

, then

E [η_{l, l'}] = 0

, also satisfying to the constraints

\sum_{l = 1}^{N} δ_{l' (l), l'} = \sum_{l' = 1}^{N} δ_{l (l'), l} = 1

or

\sum_{l = 1}^{N} η_{l, l'} = \sum_{l' = 1}^{N} η_{l, l'} = 0

as a consequence of the fact that

l' (l)

and

l (l')

are index permutations of N values. Therefore, taking into account those constraints,

Δ θ_{N, j}

can be written in different forms in terms of perturbations:

\begin{array}{l} Δ θ_{N, j}' = \sum_{l, l' = 1}^{N} T_{j, l, l'} N^{- 1} η_{l, l'} = \sum_{l, l' = 1}^{N} T_{j, l, l'}' N^{- 1} δ_{l' (l), l'} = \sum_{l, l' = 1}^{N} T_{j, l, l'}'^{X} N^{- 1} δ_{l' (l), l'} = \\ \sum_{l, l' = 1}^{N} T_{j, l, l'}'^{Y} N^{- 1} δ_{l' (l), l'} = \sum_{l = 1}^{N} T_{j, l, l' (l)}' N^{- 1} = \sum_{l = 1}^{N} T_{j, l, l' (l)}'^{X} N^{- 1} = \sum_{l = 1}^{N} T_{j, l, l' (l)}'^{Y} N^{- 1} \end{array}

(20)

where

T_{j, l, l'} \equiv T_{j} (X_{(l)}, Y_{(l')})

and its perturbation with respect to the global mean is

T_{j, l, l'}' \equiv T_{j, l, l'} - E (θ_{N, j})

. The perturbation with respect to X-conditional mean is

T_{j, l, l'}'^{X} \equiv T_{j, l, l'} - E (T_{j} | X = X_{(l)})

where

E (T_{j} | X = X_{(l)}) = \sum_{l' = 1}^{N} T_{j} E [δ_{l' (l), l'}]

. A similar definition is written for the Y- perturbation

T_{j, l, l'}'^{Y} \equiv T_{j, l, l'} - E (T_{j} | Y = Y_{(l')})

.

The estimator (15) of independent constraints (components of

T_{j}

uniquely dependent on X or Y) have a bias but vanishing variances (null components of

Δ θ_{N, j}'

), since perturbations

T_{j}'^{X}

or

T_{j}'^{Y}

vanish because the local values of

T_{j}

coincide to one of the (X or Y)-conditional means. That bias reduces to a numerical integration error. For example for X-depending functions expectations, the error reduces to bias

Δ θ_{X, N, j} = \sum_{l = 1}^{N} T_{X, j} (X_{(l)}) N^{- 1} - \int_{0}^{1} T_{X, j} (Φ_{M E, X}^{- 1} (u)) d u

, of order

O (N^{- 2})

as given by the trapezoidal integration rule for bounded

T_{X, j}

functions. The estimators of cross expectations have bias and non-vanishing variances.

Now, our goal is to get the estimation of the covariance matrix

M_{Δ θ_{N, j}}

(17). As a consequence of the non-replacement of quantiles or rankins, the deviations

T_{j, l_{1}, l' (l_{1})}'

and

T_{j, l_{2}, l' (l_{2})}'

in (20) are not necessarily independent for

l_{1} \neq l_{2}

, which will not occur if different realizations would be independent, leading to

var (θ_{N, j}) = N^{- 1} var (T_{j})

. The statistics without replacement generally lead to a deflation of estimator variances as compared to those satisfying the hypothesis of independence of realizations [33] or, in other words,

var (θ_{N, j}) \leq N^{- 1} var (T_{j})

. Therefore, in order to get a N⁻¹-scaled expression for

var (θ_{N, j})

, we will consider another type of deviations of

T_{j}

consistent with (20).

We propose new deviations, denoted by

T_{j}'^{l m s}

, that are given by the linear combination both of the global deviation

T_{j}'

and of the marginal deviations

T_{j}'^{X}, T_{j}'^{Y}

with the respective coefficients summing 1 and having the least mean square (lms). Those deviations are consistently given by:

\begin{array}{l} T_{j}'^{l m s} = (1 - α_{X} - α_{Y}) T_{j}' + α_{X} T_{j}'^{X} + α_{Y} T_{j}'^{Y} = \\ T_{j} - α_{X} [E (T_{j} | X) - E (θ_{N, j})] - α_{Y} [E (T_{j} | Y) - E (θ_{N, j})] \end{array}

(21)

which are the residuals of the best linear fit of

T_{j}

using the conditional means

E (T_{j} | X)

and

E (T_{j} | Y)

as predictors and where the coefficients are those of the linear regression:

[\begin{array}{l} α_{X} \\ α_{Y} \end{array}] = {[\begin{array}{l} var [E (T_{j} | X)] & cov [E (T_{j} | X), E (T_{j} | Y)] \\ cov [E (T_{j} | X), E (T_{j} | Y)] & var [E (T_{j} | Y)] \end{array}]}^{- 1} [\begin{array}{l} cov [E (T_{j} | X), T_{j}] \\ cov [E (T_{j} | Y), T_{j}] \end{array}]

(22)

Those deviations take into account the maximum implicit knowledge of marginal PDFs through their conditional means. Now we will use them for expressing the error moments.

The expression of the error covariances in

M_{Δ θ_{N, j}}

relies upon the expansion (20) with perturbations written as function of mean values of products of deltas

δ_{l' (l), l'}

. These means depend on the true copula and are written as:

E (δ_{l' (l_{1}), l_{1}'} δ_{l' (l_{2}), l_{2}'}) = {\begin{cases} 0, if [l_{1} = l_{2}, l_{1}' \neq l_{2}' or l_{1}' = l_{2}' l_{1} \neq l_{2}] \\ E (δ_{l' (l_{1}), l_{1}'}), N^{- 1} (*) if [l_{1} = l_{2}, l_{1}' = l_{2}'] \\ N^{- 1} {(N - 1)}^{- 1} (*) if [l_{1} \neq l_{2}, l_{1}' \neq l_{2}'] \end{cases}

(23)

where we have considered the fact that l’(l) and its inverse l(l’) are permutations of ranks (no duplication allowed). The values indicated with asterisk in (23) correspond to X,Y independent (l’(l) independent of l). Those moments are difficult to obtain in practice unless variables are independent or the bivariate PDF is known a priori. From these moments, a large ensemble of N-sized surrogate samples is generated from which empirical estimator covariances are computed.

Then, by plugging (23) into the generic (α-th row, β-th column) of

M_{Δ θ_{N, j}}

, and denoting the α-th and β-th components of

T_{j}

by

T_{j, α}

and

T_{j, β}

with estimation errors

Δ θ_{N, j, α}, Δ θ_{N, j β}

, we get

\begin{array}{l} {(M_{Δ θ_{N, j}})}_{α, β} = E (Δ θ_{N, j, α}' Δ θ_{N, j β}') = \sum_{l_{1}, l_{1}', l_{2}, l_{2}'} [T_{j, α}' (X_{(l_{1})}, Y_{(l_{1}')}) T_{j, β}' (X_{(l_{2})}, Y_{(l_{2}')})] N^{- 2} E (δ_{l' (l_{1}), l_{1}'} δ_{l' (l_{2}), l_{2}'}) = \\ = N^{- 1} E (E_{N} (T_{j, α}' T_{j, β}')) + N^{- 2} \sum_{l_{1} \neq l_{2}} E [T_{j, α}' (X_{(l_{1})}, Y_{(l_{1}' (l_{1}))}) T_{j, β}' (X_{(l_{2})}, Y_{(l_{2}' (l_{2}))})] \end{array}

(24)

The first term of the rhs of (24) is given by

N^{- 1} E [{cov}_{N} (T_{j, α}, T_{j, β})]

i.e. 1/N times the expectation of the covariance among N realizations. That term converges asymptotically to

N^{- 1} cov (T_{j, α}, T_{j, β})

, i.e., the estimator’s covariance in the hypothesis of N iid realizations. However, when marginals are imposed or the morphism of variables is performed, that hypothesis no longer holds because the covariance estimator is a statistic without replacement [33], since quantiles of X and Y are not repeated in the sample. Therefore, the additional term of (24) reduces the estimator’s variances with respect to the case of iid trials.

Looking for a correct representation of the cross estimator’s variances when marginals are imposed, we represent the

T_{j}

perturbations by

T_{j}'^{l m s}

(21) (residuals of the best linear regression). There, we will benefit from a generic property of lse (least squares error) regression residuals which is the fact that they are uncorrelated with the predictors (here the conditional means of

E (T_{j} | X), E (T_{j}, Y)

). This means that

T_{j}'^{l m s}

is represented in terms of noises which are uncorrelated, both with X and Y. Consequently, different realizations of

T_{j}'^{l m s}

are uncorrelated, which will simplify the expression of the covariance matrix. Therefore, using those lms perturbations, the generic matrix entry

{(M_{Δ θ_{N, j}})}_{α, β}

(24) is rewritten as

\begin{array}{l} {(M_{Δ θ_{N, j}})}_{α, β} = N^{- 2} \sum_{l_{1}} [E [T_{j, α}'^{l m s} (X_{(l_{1})}, Y_{(l' (l_{1}))}) T_{j, β}'^{l m s} (X_{(l_{1})}, Y_{(l' (l_{1}))})]] + \\ N^{- 2} \sum_{l_{1}, l_{2} \neq l_{1}} [E [T_{j, α}'^{l m s} (X_{(l_{1})}, Y_{(l' (l_{1}))}) T_{j, β}'^{l m s} (X_{(l_{2})}, Y_{(l' (l 2))})]] = N^{- 1} E (E_{N} (T_{j, α}'^{l m s} T_{j, β}'^{l m s})) + O (N^{- 2}) \end{array}

(25)

The

N^{- 1}

-scaled term of (25) converges asymptotically (as

N \to \infty

) to

N^{- 1} E (T_{j, α}'^{l m s} T_{j, β}'^{l m s})

, i.e., 1/N times the covariances between residuals of the linear regression relying upon conditional variances. This let us to formulate the Theorem:

Theorem 2:

Let us suppose imposed X and Y marginal PDFs by variable morphisms. Then, the covariance between the N-sized based estimators

θ_{N, α}

and

θ_{N, β}

of the means of cross functions of

T_{α} (X, Y)

and

T_{β} (X, Y)

is given by

cov (θ_{N, α}, θ_{N, β}) = N^{- 1} E (E_{N} (T_{α}'^{l m s} T_{β}'^{l m s})) \underset{N \to \infty}{\to} N^{- 1} E (T_{α}'^{l m s} T_{β}'^{l m s})

(26)

where

T_{α}'^{l m s} = T_{α}' - α_{X} [E (T_{α} | X) - θ_{α}] - α_{Y} [E (T_{α} | Y) - θ_{α}]

is the residual of the best linear fit taking conditional means as predictors, and

α_{X}, α_{Y}

are the corresponding coefficients (idem for

T_{β}'^{l m s}

). The expectation is computed with the true PDF of the population. The proof was given before in the text.

An immediate corollary of this Theorem applies in the case data are governed by a certain MinMI-PDF issued from

{T_{c r}, θ_{c r}}, ρ_{X}, ρ_{Y}

. In that conditions

T_{α}

and

T_{β}

are themselves cross functions from the constraining set

T_{c r}

and

cov (θ_{N, α}, θ_{N, β})

are entries of

M_{Δ θ_{N}}

(17). Then, if the true joint PDF is the MinMI-PDF issued from

{T_{c r}, θ_{c r}}, ρ_{X}, ρ_{Y}

, we get:

P_{c r} M_{Δ θ_{N}} P_{c r} = N^{- 1} C_{c r, ρ_{X}, ρ_{Y}}

(27)

where we use the covariance matrix introduced in (4). Under those conditions one has the identity for the matricial product

(P_{c r} M_{Δ θ_{N}} P_{c r}) C_{c r, ρ_{X}, ρ_{Y}}^{- 1} = N^{- 1} P_{c r}

, which will be crucial for the evaluation of asymptotic MinMI estimation bias.

3.3. Errors of the Estimators of Polynomial Moments under Gaussian Distributions

In this section we assess the bias, the covariance of estimators and its expression (25) when constraints are bivariate monomials (13) and Gaussian morphisms are performed as described in Section 2.3. For the purpose of discussing statistical tests of non-Gaussianity presented in a next section, we will restrict our study by considering the case of N-sized samples of iid realizations of independent variables

\hat{X}, \hat{Y}

(taken without loss of generality standard Gaussians). There, an empiric Monte-Carlo strategy is used by taking the standard Gaussian morphisms

X, Y

of the N outcomes, from which one estimates the expectation of a vector of generic functions

T (X, Y) = X^{r} Y^{s}, r, s \in ℕ_{0}

(13). The bias is

b = E (E_{N} (T)) - E (T) = μ_{N, r} μ_{N, s} - μ_{r} μ_{s}

, which is determined by the fixed Gaussian centered moments

μ_{r} \equiv E (X^{r})

and

μ_{N, r} \equiv E_{N} (X^{r})

,

r \in ℕ_{0}

. The sample is centered and standardized such that

μ_{N, 1} = 0; μ_{N, 2} = 1

. The variance

var (E_{N} (T))

of

E_{N} (T)

can be rigorously computed from the quadruple sum (25) using the N quantiles from the standard Gaussian and the delta expectations (23) for the case of X, Y independent from each other. However, the computation of that sum is very time-consuming for high N values. For that reason, we approximate it by a Monte-Carlo mean obtained with

N_{r e a} = 5000

independent realizations of the N-sized samples. The finite and asymptotic values of

N^{- 1} E ({var}_{N} (T))

, valid for the case of N iid trials, are given by:

N^{- 1} E ({var}_{N} (T)) = N^{- 1} (μ_{N, 2 r} μ_{N, 2 s} - {(μ_{N, r} μ_{N, s})}^{2}) \underset{N \to \infty}{\to} N^{- 1} var (T) = N^{- 1} (μ_{2 r} μ_{2 s} - {(μ_{r} μ_{s})}^{2})

(28)

whereas those (smaller than those of (28)) obtained from least mean squares (25) are:

\begin{array}{l} var (E_{N} (T)) \approx N^{- 1} E ({var}_{N} (T | l m s)) = N^{- 1} {var}_{N} (T | l m s) = \\ = N^{- 1} (μ_{N, 2 r} μ_{N, 2 s} - μ_{N, 2 r} {(μ_{N, s})}^{2} - μ_{N, 2 s} {(μ_{N, r})}^{2} + {(μ_{N, s} μ_{N, r})}^{2}) \\ \underset{N \to \infty}{\to} N^{- 1} var (T | l m s) = N^{- 1} (μ_{2 r} μ_{2 s} - μ_{2 r} {(μ_{s})}^{2} - μ_{2 s} {(μ_{r})}^{2} + {(μ_{s} μ_{r})}^{2}) \end{array}

(29)

Figure 1 compares the variance

var (E_{N} (T))

with the squared bias

{‖ b ‖}^{2}

of the estimator, both relevant in the bias of the MinMI estimation. In the same figure, one compares the empirical variance

var (E_{N} (T))

, with its approximation

N^{- 1} var (T | l m s)

and with the variance for the case of iid trials:

N^{- 1} var (T)

. We use

T = X^{4} Y^{2}, X^{6} Y^{2}, X^{8} Y^{2}

,respectively in panels a), b), c), sorted by growing total variance

var (T)

, specially concentrated at the distribution queues. In all figures, N=25*2^k,k=0,..,11. We have verified that the empirical variance

var (E_{N} (T))

agrees very well to the theoretical value

N^{- 1} {var}_{N} (T | l m s)

for all Ns. (not shown).

At this point, some generic conclusions can be drawn. The estimator’s variance

var (E_{N} (T))

grows with

var (T)

dominating over the squared bias, except for small N values and higher values of

var (T)

. This will lead us to neglect the bias of covariance estimator’s in the MinMI asymptotic statistics.

Figure 1. Squared empirical bias:

{‖ b ‖}^{2}

(black lines) of N-based

T

- expectations as function of N, empirical variances:

var (E_{N} (T))

(red lines), approximated variances:

N^{- 1} var (T | l m s)

(blue lines) and variance for the case of N iid trials:

N^{- 1} var (T)

(green lines).

T

stands for different bivariate monomials:

X^{4} Y^{2}

(a),

X^{6} Y^{2}

(b) and

X^{8} Y^{2}

(c).

Figure 1. Squared empirical bias:

{‖ b ‖}^{2}

(black lines) of N-based

T

- expectations as function of N, empirical variances:

var (E_{N} (T))

(red lines), approximated variances:

N^{- 1} var (T | l m s)

(blue lines) and variance for the case of N iid trials:

N^{- 1} var (T)

(green lines).

T

stands for different bivariate monomials:

X^{4} Y^{2}

(a),

X^{6} Y^{2}

(b) and

X^{8} Y^{2}

(c).

From Figure 1, we also note that the variance reduction coming from morphisms of variables, tends to decrease for higher N values, where the effect of sampling prevails with a

N^{- 1}

scaling on the estimator variance where it is closely approximated by the asymptotic lms variance

N^{- 1} var (T | l m s)

. That can lead to a slight increase of

var (E_{N} (T))

for small Ns, followed by a decrease (e.g.,

X^{6} Y^{2}

), due to the effect that

{var}_{N} (T | l m s)

is small for lower values of N.

Moreover, thanks to the Central Limit Theorem (CLT), the distribution of estimator errors tends towards Gaussianity with increasing N, with a slower convergence rate for higher

T

variances. However, the Gaussian PDF limit has an infinite support which must be truncated since the estimated moments

E_{N} (T)

must be within a kind of polytope with edges determined by Schwartz-like inequalities as shown by PP12 [12] (e.g.,

| E_{N} (X Y) | \leq 1

and

| E (X^{2} Y) | / {[2 (1 - c_{g}^{2})]}^{1 / 2} \leq 1)

, working as bounds for nonlinear correlations. Since estimators have bounds, the estimation errors do so as well. This can be solved by using the Fisher Z-transform arctanh(c) of a generic linear or nonlinear correlation c and projecting it over the real support (not done here).

Now we illustrate in Figure 2, the Theorem 2 under different values of correlation

c_{g} \in [0, 1]

. We consider the variables

X, Y

with a joint Gaussian PDF of correlation

c_{g} \in [0, 1]

with marginal standard Gaussians. In Figure 2 we compare the empirical Monte-Carlo value of

N var (E_{N} (T))

(MC in the Figure), within an ensemble of 5000 N-sized samples with the theoretical one

var (T | l m s)

(case where morphism is performed, AN in the Figure) and

var (T)

(case of iid realizations, ANiid in the Figure). We have used a sample of N=200, which is supposed to be near the beginning of the asymptotic regime and two cross functions:

T (X, Y) = X Y

and

T (X, Y) = X^{2} Y

. The aforementioned variances are

var (X Y | l m s) = (1 - c_{g}^{2}) / (1 + c_{g}^{2}); var (X Y) = c_{g}^{2} + 1

while

var (X^{2} Y) = 12 c_{g}^{2} + 3

and

var (X^{2} Y | l m s)

is the mean squared residual of the best linear fit using the predictors

E (X^{2} Y | X) = c_{g} X^{3}

and

E (X^{2} Y | Y) = c_{g}^{2} Y^{3} + (1 - c_{g}^{2}) Y

. For both functions, a very good agreement is verified between Monte-Carlo values and the theoretical ones within 1–5% relative error. A generic result of Figure 2 is the fact that, under the fixation (presetting) of marginals, the sampling variability of cross estimators falls to zero as far the absolute value of correlation tends to one.

Figure 2. N times Monte-Carlo variances:

N var (E_{N} (T))

thick solid lines) and its theoretical analytical value

var (T | l m s)

(thick dashed lines), both under imposed marginals (morphisms) and analytical value of

N var (E_{N} (T)) = var (T)

for iid data (thin solid lines).

T

means different bivariate monomials:

X Y

(black curves),

X^{2} Y

(red curves). N = 200.

Figure 2. N times Monte-Carlo variances:

N var (E_{N} (T))

thick solid lines) and its theoretical analytical value

var (T | l m s)

(thick dashed lines), both under imposed marginals (morphisms) and analytical value of

N var (E_{N} (T)) = var (T)

for iid data (thin solid lines).

T

means different bivariate monomials:

X Y

(black curves),

X^{2} Y

(red curves). N = 200.

3.4. Statistical Modeling of Moment Estimation Errors

The above qualitative results gave empirical support to Theorem 2 about the covariance of estimation errors and the neglecting of estimation biases. Therefore, the part of matrix

M_{Δ θ_{N, j}}

(17) regarding cross components is modeled as:

M_{Δ θ_{N, c r, j}} \approx N^{- 1} E (E_{N} (T_{c r, j}'^{l m s} T_{c r, j}'^{l m s})) \equiv N^{- 1} C_{N, c r, j | l m s}

(30)

with the approximation being valid within terms

o (N^{- 1})

. In practice, the matrix

E (T_{c r, j}'^{l m s} T_{c r, j}'^{l m s})

requires the estimation of conditional means for each value of X and Y.

Now, we will formulate the distribution of moment’s estimation errors in the asymptotic regime of high enough N. Then, thanks to the multivariate Central Limit Theorem [34] one can suppose that the unbiased estimation error vector follows a multivariate Gaussian distribution, which is written as

Δ θ_{N, c r, j} \approx {(M_{Δ θ_{N, c r, j}})}^{1 / 2} U_{j} \approx N^{- 1 / 2} {(C_{N, c r, j | l m s})}^{1 / 2} U_{j}; U_{j} ~ N ({\vec{0}}_{c r, j}, P_{c r, j})

(31)

where

{(C_{N, c r, j | l m s})}^{1 / 2}

is the square root matrix of

C_{N, c r, j | l m s}

and

U_{j}

is a multivariate standard normal RV of dimension equal to

\dim (θ_{c r, j})

with zero mean

{\vec{0}}_{c r, j}

and covariance matrix

P_{c r, j}

.

4. Modeling of MinMI Estimation Errors, Their Bias, Variance and Distribution

Taking into account the Gaussian approximations (31) for estimation errors, their neglected bias, the

N^{- 1}

scaled covariance (30), and the second-order Taylor development of MinMI (9), one can determine approximated bias, variance and distribution of MinMI estimators (15).

Two problems are then addressed:

The estimation of bias, variance, quantiles and distribution of estimators of the incremental MinMI $I_{j / p}$ issued from finite samples of N (iid) realizations of bivariate original variables $(\hat{X}, \hat{Y})$ and then transformed into RVs $(X, Y)$
The distribution of estimators of $I_{j / p}$ under the null hypothesis H₀ that $(X, Y)$ follows the ME distribution constrained by a weaker constraint set $(T_{p}, θ_{p})$ (j>p). These estimators work as a significance test for determining whether there is statistically significant MI beyond that explained by cross moments in $(T_{p}, θ_{p})$ .

4.1. Bias, Variance, Quantiles and Distribution of MI Estimation Error

Considering the moment error distribution (31) and plugging it into the development (9), the error of the MI estimator

I_{N, j / p}

is then distributed as:

Δ I_{N, j / p, θ} \approx N^{- 1 / 2} [v_{j / p}^{T} {(C_{N, c r, j | l m s})}^{1 / 2}] U_{j} + 1 / 2 N^{- 1} U_{j}^{T} [{(C_{N, c r, j | l m s})}^{1 / 2} A_{j / p} {(C_{N, c r, j | l m s})}^{1 / 2}] U_{j}

(32)

where neglected terms are of order

O (N^{- 3 / 2})

. That is a second-order polynomial form of a multivariate standard Gaussian RV

U_{j} ~ N ({\vec{0}}_{j}, P_{c r, j})

. There is no general analytical expression for the PDF inferred from (32), except in certain cases where

Δ I_{N, j / p}

is a governed by a non-central Chi-squared distribution [36]. The quantiles determining the confidence intervals of

I_{N, j / p}

can easily be obtained by sorting of Monte-Carlo surrogates (proxies) of (32) from a pseudo-random generator of a standard Gaussian. Analytical expressions of the distribution of MI estimates are given from a MI Taylor expansion in terms of the anomalies of the estimated probabilities [27,37]. Here, we adopt a different approach by considering anomalies of the estimated expectations.

The bias of

I_{N, j / p}

or the expectation of

Δ I_{N, j / p, θ}

is derived from the mean of the quadratic form term in (32). Therefore, taking the invariance of the trace for the circular permutation of a matrix product, that bias is approximated by the asymptotic value:

\begin{array}{l} E (Δ I_{N, j / p}) \approx (1 / 2) N^{- 1} T r (C_{N, c r, j | l m s} A_{j / p}) \\ = (1 / 2) N^{- 1} [T r (C_{N, c r, j | l m s} P_{c r, j} C_{* j}^{- 1} P_{c r, j}) - T r (C_{N, c r, p | l m s} P_{c r, p} C_{* p}^{- 1} P_{c r, p})] \end{array}

(33)

This is the difference between maximum entropy

N^{- 1}

-scaled biases of orders j and p, subjected to the imposition of marginal PDFs. We must remember that if p = 0,

P_{c r, p}

is zero. For this case the MinMI bias is simply minus the negative bias of the ME

H (θ_{N, j})

, which is treated without the effect of variable morphism by [26]. When data is governed by the MinMI-PDF of order j, the matrices

C_{N, c r, j | l m s}

and

P_{c r, j} C_{* j}^{- 1} P_{c r, j}

are the inverse of each-other, according to Theorems 1 and 2 (11,27), leading to

E (Δ I_{N, j / 0}) = (1 / 2) N^{- 1} T r (C_{N, c r, j | l m s} P_{c r, j} C_{* j}^{- 1} P_{c r, j}) = (1 / 2) N^{- 1} T r (P_{c r, j})

, i.e.,

1 / (2 N)

times the number of cross constraints. However, as argued by [26], when the true data distribution is more leptokurtic than the MinMI-PDF, then the bias can be larger than

(1 / 2) N^{- 1} T r (P_{c r, j})

.

By assuming the limit case of Gaussianity, the variance of

Δ I_{N, j / p}

comes as:

var (Δ I_{N, j / p}) \approx N^{- 1} T r [C_{N, c r, j | l m s} (v_{j / p} v_{j / p}^{T})] + (1 / 2) N^{- 2} T r [{(C_{N, c r, j | l m s} A_{j / p})}^{2}]

(34)

The leading variance term is N⁻¹-scaled as generally deduced in [15]. Keeping the leading term of (34), and dealing with the trace, we get a given relative error

r_{I} = Δ I_{N, j} / I_{j}

of the MinMI

I_{j / 0}

(p=0) when

N \geq E ({(λ_{c r, j}^{T} T_{c r, j}^{'})}^{2}) / {(I_{j / 0} r_{I})}^{2} \approx O (m_{c r, j}) / {(I_{j / 0} r_{I})}^{2}

. The term

O (m_{c r, j})

increases with a larger rate than

I_{j / 0}

as far as the bound of the polytope of allowed expectations is closer.

4.2. Significance Tests of MinMI Thresholds

The estimators

I_{N, j / p}

allow for the elaboration of statistical significance tests in order to verify whether the empirical PDF differs considerably from a threshold ME-PDF or in the contrary if the difference can be justified by sampling errors.

Let us suppose the null hypothesis H₀ considering that the true PDF coincides to the ME-PDF constrained by

(T_{p}, θ_{p})

. In particular for

(T_{p}, θ_{p}) = (T_{p = 0}, θ_{p = 0}) = (T_{i n d}, θ_{i n d})

, the null hypothesis states that

(X, Y)

are statistically independent. Therefore under H₀, the moment sets

(T_{p}, θ_{p}), (T_{j}, θ_{j})

are ME-congruent and the moments of order

j \geq p

remain well determined by expectations over the less restricted p-th ME-PDF i.e.,

θ_{j} = E_{ρ_{T_{p}, θ_{p}}^{*}} (T_{j}) \equiv θ_{j \leftarrow p}

where the subscript arrow

j \leftarrow p

means that j-order statistics are obtained by the p-order ME-PDF. The same holds for the ME covariance matrices, i.e.,

C_{* p} = C_{p}

and

C_{* j} = C_{* j \leftarrow p} = C_{j}; j \geq p

. In these conditions, the matrix

C_{p}

is simply a sub-matrix of

C_{j}

.The Lagrange multipliers are restricted to the p-order i.e.

λ_{j} = λ_{j \leftarrow p} = (λ_{p}, {\vec{0}}_{j / p}); j \geq p

, where entries of higher order than p are set to zero leading to

v_{j / p} = 0

in (9). Therefore, the incremental MinMI vanishes, i.e.

H (θ_{j}) - H (θ_{p}) = I_{j / p} = 0

, but the estimator of

I_{N, j / p}

is positive due to artificial MI generation stemming from the presence of sampling errors. Then, under H₀_, and using (9), the MI estimation is provided by the following approximation:

\begin{array}{l} H (θ_{N, p}) - H (θ_{N, j}) | H_{0} \equiv δ I_{N, j / p} \approx (1 / 2) N^{- 1} U_{j}^{T} [{(C_{N, c r, j | l m s})}^{1 / 2} A_{j \leftarrow p} {(C_{N, c r, j | l m s})}^{1 / 2}] U_{j} \\ U_{j} ~ N ({\vec{0}}_{j}, P_{c r, j}); A_{j \leftarrow p} = P_{c r, j} {(C_{j})}^{- 1} P_{c r, j} - P_{c r, p} {(C_{p})}^{- 1} P_{c r, p} \end{array}

(35)

where

A_{j \leftarrow p}

is a positive semi-definite matrix. That works as a significance test for the non-verification of H₀; in other words, if

I_{N, j / p}

is larger than an upper 1-α quantile (e.g., 1−α=95%) of

δ I_{N, j / p}

, then H₀ is rejected with a significance level α. Those quantiles determine the significant MI thresholds and can be computed empirically as for the MinMI error (32) by a Monte-Carlo strategy. Another possibility is the fitting of the

δ I_{N, j / p}

distribution to a Gamma PDF with prescribed mean and variance (not done here). The bias and variance of

δ I_{N, j / p}

are straightforward, coming as:

E [δ I_{N, j / p}] \approx (1 / 2) N^{- 1} T r [C_{N, c r, j | l m s} A_{j \leftarrow p}]; var [δ I_{N, j / p}] \approx (1 / 2) N^{- 2} T r [{(C_{N, c r, j | l m s} A_{j \leftarrow p})}^{2}]

(36)

The N⁻²-scale for variance is also present in other MI estimate errors under the hypothesis of variable independency [27]. Under the Theorems 1 [11] and 2 [27], along with the null hypothesis, one gets

C_{N, c r, j | l m s} A_{j \leftarrow p} = P_{c r, j} - P_{c r, p}

, thus leading to a Chi-Squared distribution for

δ I_{N, j / p}

:

δ I_{N, j / p} ~ (1 / 2) N^{- 1} χ_{n d}^{2}; n d = T r (P_{c r, j} - P_{c r, p})

(37)

with

n d

degrees of freedom, i.e., the difference between the number of cross moments of order j and p. From that, the upper quantiles necessary for statistical significance are easily obtained from χ² probability lookup tables. The bias and variance are, respectively:

E [δ I_{N, j / p}] \approx (1 / 2) N^{- 1} [T r (P_{c r, j} - P_{c r, p})]; var [δ I_{N, j / p}] \approx (1 / 2) N^{- 2} [T r (P_{c r, j} - P_{c r, p})]

(38)

By analyzing (38), and in order to get a test with a relative error

r_{I} = (Δ I_{\min} / I_{\min})

, one must choose

N \geq {((m_{c r 2} - m_{c r 1}) / 2)}^{1 / 2} / (I_{\min} r_{I})

.

4.3. Significance Tests of the Gaussian and Non-Gaussian MI

In this section we particularize the theory presented in Section 4.1 and Section 4.2 (Equations 35–38) for the case of Gaussian and non-Gaussian MIs as defined in Section 2.3. For this purpose, let us consider the moment sets (13) and the MI components

I_{g}

and

I_{n g, j}

(11). Their finite estimators are:

\begin{array}{l} I_{N, g} = H (θ_{0}) - H (θ_{N, 2}) = I_{g} + Δ I_{N, g} = I_{N, j = 2 / p = 0}; Δ I_{N, g} = I_{g} (c_{g} + Δ c_{g, N}) - I_{g} (c_{g}) = - Δ H (θ_{N, 2}); \\ I_{N, n g, j} = H (θ_{N, 2}) - H (θ_{N, j}) = I_{n g, j} + Δ I_{N, n g, j} = I_{N, j / p = 2}; Δ I_{N, n g, j} = Δ H (θ_{N, 2}) - Δ H (θ_{N, j}) \end{array}

(39)

where

Δ I_{N, g}, Δ I_{N, n g, j}

are MinMI errors,

Δ c_{g, N}

is the Gaussian correlation estimation error,

H (θ_{0}) = 2 H_{g}

with

H_{g} \equiv \frac{1}{2} \log (2 π e)

being the entropy of the univariate standard Gaussian;

θ_{N, j} = θ_{j} + Δ θ_{N, j}; j \geq 1

are the expectations obtained from the N-sized Gaussianized standardized sample.

The numerical implementation of the maximum entropy estimator

\hat{H}

(16), approximating H is computed over a number N_b bins of an extended enough finite interval [-L_i,L_i]. In the corresponding experiments (and as in PP12), we have used the calibrated values L_i=6 and N_b=80. The used algorithm is explained in detail in the appendix 2 of PP12 [12], following an adapted bivariate version of that of [35]. The error

δ H = \hat{H} - H

is of the order of round-off errors, only becoming comparable to the sampling ME errors at very high values of N.

4.3.1. Error and Significance Tests of the Gaussian MI

The Gaussian MI error

Δ I_{N, g}

depends on the Gaussian correlation estimation’s error

Δ c_{g, N} \equiv c_{g, N} - c_{g}

where

c_{g, N} = E_{N} (X Y)

is inferred from the sample. Let us write (9) for

Δ I_{N, g}

. The Gaussian bivariate ME-PDF, constrained by

(T_{2} = {(X, X^{2}, Y, Y^{2}, X Y)}^{T}, θ_{2} = {(0, 1, 0, 1, c_{g})}^{T})

is

ρ_{T 2, θ 2}^{*} (X, Y) = {[4 π^{2} (1 - c_{g}^{2})]}^{- 1 / 2} \exp [- (1 / 2) {(1 - c_{g}^{2})}^{- 1} (X^{2} - 2 c_{g} X Y + Y^{2})]

, leading to the vector of Lagrange multipliers

λ_{2} = {[0, - (1 / 2) {(1 - c_{g}^{2})}^{- 1}, 0, - (1 / 2) {(1 - c_{g}^{2})}^{- 1}, c_{g} {(1 - c_{g}^{2})}^{- 1}]}^{T}

. The projector operator

P_{c r, 2}

onto cross moments is the 5x5 matrix that extracts the 5^th entry (row and column) of

T_{2}

, corresponding to the unique cross moment XY. The necessary 5x5 covariance matrix is

C_{*, 2} = E_{ρ_{T 2, θ 2}^{*}} [T_{2} T_{2}^{T}] - θ_{2} θ_{2}^{T}

, where the E operator is the expectation over the bivariate Gaussian

ρ_{T 2, θ 2}^{*}

. Then, we apply (9) for j=2, p=0 where

Δ θ_{N, j} = {(0, 0, 0, 0, Δ c_{g, N})}^{T}

. The Gaussian MI error is written in different forms as:

\begin{array}{l} Δ I_{N, g} \approx {(P_{c r, 2} λ_{2})}^{T} (Δ c_{g, N}) + \frac{1}{2} (P_{c r, 2} C_{* 2}^{- 1} P_{c r, 2}) {(Δ c_{g, N})}^{2} = \frac{c_{g}}{1 - c_{g}^{2}} (Δ c_{g, N}) + \frac{1 + c_{g}^{2}}{2 {(1 - c_{g}^{2})}^{2}} {(Δ c_{g, N})}^{2} = \\ = \frac{\partial I_{g}}{\partial c_{g}} Δ c_{g, N} + \frac{1}{2} \frac{\partial^{2} I_{g}}{\partial {c_{g}}^{2}} {(Δ c_{g, N})}^{2} \end{array}

(40)

There, the term

P_{c r, 2} λ_{2}

is the fifth component of

λ_{2}

, corresponding to the first derivative of

I_{g}

with respect to

c_{g}

whereas the term

P_{c r, 2} C_{* 2}^{- 1} P_{c r, 2}

is the entry of

C_{* 2}^{- 1}

at row 5, column 5, corresponding to the second derivative of

I_{g}

. The bias and variance of

Δ I_{N, g}

depend on the distribution of the Gaussian correlation error

Δ c_{g, N}

. According to the proposed modeling of moment estimation errors (Theorem 2 of section 3.4),

Δ c_{g, N}

is asymptotically Gaussian with a negligible bias

E (Δ c_{g, N}) \approx 0

and a variance (under imposed marginals) given by:

var (Δ c_{g, N}) \approx N^{- 1} var (X Y | E (X Y | X), E (X Y | X)) = {(1 - c_{g}^{2})}^{2} / (1 + c_{g}^{2})

(41)

However, in order to keep the simulated

c_{g} = c_{g, N} - Δ c_{g, N}

within the interval [-1,1], one can use the more precise Fisher Z-transform [38] such that

Δ c_{g, N} = \tanh (\tanh^{- 1} (c_{g}) + Δ Z_{N}) - c_{g}

, where

Δ Z_{N}

has a mean and variance of order

O (N^{- 1})

.

In order to test the null hypothesis that the variable pair

(X, Y)

has a joint bivariate isotropic Gaussian distribution, we must compare the estimated

I_{N, g}

with upper quantiles of the significance test

δ I_{N, g}

, given by

Δ I_{N, g}

(40) with

c_{g} = 0

and

Δ c_{g, N} ~ N (0, N^{- 1})

. This is a Gaussian correlation significance test that is Chi-squared distributed, with:

\begin{array}{l} δ I_{N, g} = (1 / 2) {(Δ c_{g, N})}^{2} = (1 / 2) N^{- 1} U^{2} ~ (1 / 2) N^{- 1} χ_{1}^{2}; U ~ N (0, 1) \\ E (δ I_{N, g}) = (1 / 2) N^{- 1}; var (δ I_{N, g}) = (1 / 2) N^{- 2} \end{array}

(42)

4.3.2. Error and Significance Tests of the Non-Gaussian MI

The estimation error

Δ I_{N, n g, j}

of the non-Gaussian MI as defined in (39) can be written as a particular form of (9) for an even order

j \geq 4

and p=2 as function of the vector

Δ θ_{N, j}

of moment errors of the moment vector

T_{j}

(13) with a certain chosen component indexation. Therefore, the matrix

A_{j / p} = A_{j / p = 2} \equiv P_{c r, j} {(C_{* j})}^{- 1} P_{c r, j} - P_{c r, 2} {(C_{* 2})}^{- 1} P_{c r, 2}

of (9) comprises the inverses of covariance matrices

C_{* j}

and

C_{* 2}

, respectively of the j-th and 2^nd order ME solutions.

Algebraic consistency sets the matrix

P_{2} {(C_{* 2})}^{- 1} P_{2}

to the embedding of

{(C_{* 2})}^{- 1}

onto the j-th moment subspace. Then we will perform a range of experiments for the validation of approximations in Section 4.2. The vector

v_{j / p = 2} \equiv P_{c r, j} λ_{j} - P_{c r, 2} λ_{2}

comprises Lagrange multiplier vectors of the ME solutions of orders j and 2.

In order to compute the bias, variance, quantiles and confidence intervals of

I_{N, n g, j}

, from N-sized samples, there are two possible strategies: either pure Monte-Carlo simulations or the analytical and the semi-analytical (analytical with moment’s error surrogates) approaches as explained in section 1. In the pure Monte-Carlo approach, either a known bivariate PDF is assumed or surrogates of the joint PDF are generated through multivariate bootstrapping techniques [39] preserving the copula structure. For each generated sample from an extended ensemble of N_rea (e.g., 5000) realizations, we compute moments and solve the ME problem gathering statistics afterwards. Alternatively, ME errors can be computed from the Taylor expansion (9) from moment deviations over the ensemble.

In the analytical and semi-analytical approaches, moment errors

Δ θ_{N, j}

are assumed to follow a certain parametric distribution that can be multivariate Gaussian as in (31), based on a given bias-covariance matrix modeling or a more sophisticated approach taking into account the natural bounds of the simulated moments

θ_{c r, j} = θ_{N, c r, j} - Δ θ_{N, c r, j}

. Then, MinMI statistics are computed from statistics (bias, variance, quantiles) on ensembles of error surrogates.

The non-Gaussian MIs

I_{N, n g, j} (even j \geq 4)

work as tests measuring significant statistical deviations from the null hypotheses of joint Gaussianity. These statistical tests are given by Kullback-Leibler distances (7) and constitute an alternative to the use of algebraic deviations of moments from those given by the bivariate Gaussian (e.g., bivariate cumulants) [40].

The non-Gaussianity test of order j is given by

δ I_{N, n g, j} \equiv H (θ_{N, 2}) - H (θ_{N, j}) | H_{0}

under the null hypothesis H₀ that the true PDF is bivariate Gaussian and is written as a particular case of (35). However, a simplification of the statistical test formula can be achieved by considering a null Gaussian correlation. This holds thanks the non-Gaussian MI invariance under variable rotations (see PP12), in particular for uncorrelated standardized variables

{(X_{r}, Y_{r})}^{T} = A {(X, Y)}^{T}

, where A is the rotation matrix (e.g.

X_{r} = X, Y_{r} = (Y - c_{g} X) {(1 - c_{g}^{2})}^{- 1 / 2}

, i.e., the residual of the linear prediction). Under H₀, the rotated variables are still bivariate Gaussian and therefore the non-Gaussianity significance test

δ I_{N, n g, j}

has the same distribution as that for

c_{g} = 0

. The matrices

C_{N, c r, j | l m s}

and

A_{j \leftarrow 2}

entering in Equation (35) are now evaluated for Gaussian isotropic conditions. For the sake of clarity, we represent them respectively by

C_{g, N, c r, j | l m s}

,

A_{g, j \leftarrow 2} = P_{j} {(C_{g, j})}^{- 1} P_{j} - P_{2} {(C_{g, 2})}^{- 1} P_{2}

, where the subscript g stands for evaluation at

{(X, Y)}^{T} ~ N (\vec{0}, I)

. For high N,

C_{g, N, c r, j | l m s} = C_{g, j}

, i.e., the covariance matrix of cross j-th order moments for the isotropic Gaussian. Then we write:

δ I_{N, n g, j} \approx (1 / 2) N^{- 1} U_{j}^{T} [{(C_{g, N, c r, j | l m s})}^{1 / 2} A_{g, j \leftarrow 2} {(C_{g, N, c r, j | l m s})}^{1 / 2}] U_{j}

(43)

Let us specify generic entries at row α, column β of those matrices, corresponding to monomials

X^{r_{α}} Y^{s_{α}}

and

X^{r_{β}} Y^{s_{β}}

of

T_{j}

, i.e. with

r_{α} + s_{α}, r_{β} + s_{β} \leq j

. Then, using the notation introduced in Section 3.3 for Gaussian standard moments

μ_{r} \equiv E (X^{r}); μ_{N, r} \equiv E_{N} (X^{r}), r \in ℕ_{0}

, the components of

C_{g, j}

become:

{(C_{g, j})}_{α, β} = μ_{r_{α} + r_{β}} μ_{s_{α} + s_{β}} - μ_{r_{α}} μ_{r_{β}} μ_{s_{α}} μ_{s_{β}}

(44)

whereas the components of the lms covariances are:

{(C_{g, N, c r, j | l m s})}_{α, β} = μ_{N, r_{α} + r_{β}} μ_{N, s_{α} + s_{β}} - μ_{N, s_{α} + s_{β}} μ_{N, r_{α}} μ_{N, r_{β}} - μ_{N, r_{α} + r_{β}} μ_{N, s_{α}} μ_{N, s_{β}} + μ_{N, r_{α}} μ_{N, r_{β}} μ_{N, s_{α}} μ_{N s_{β}}

(45)

The bias of the non-Gaussian MinMI and its asymptotic approximation (36) are given by:

E [δ I_{N, n g, j}] \approx (1 / 2) N^{- 1} [T r (C_{g, N, c r, j | l m s} P_{c r, j} C_{g, j}^{- 1}) - 1] = (1 / 2) N^{- 1} (T r (P_{c r, j}) - 1)

(46)

Similarly and following (36), the variance becomes:

var [δ I_{N, n g, j}] \approx (1 / 2) N^{- 2} T r [{(C_{g, N, c r, j | l m s} A_{g, j \leftarrow 2})}^{2}] = (1 / 2) N^{- 2} (T r (P_{c r, j}) - 1)

(47)

and the reasonable distribution approximation following (37):

δ I_{N, n g, j} ~ (1 / 2) N^{- 1} χ_{n d}^{2}; n d = T r (P_{c r, j}) - 1 = j (j - 1) / 2 - 1

(48)

from which bounds of significance levels of non-Gaussianity can be computed through quantiles of the Chi-squared distribution.

4.4. Validation of Significance Tests by Monte-Carlo Experiments

We have presented the theoretical expressions for the bias, variance and distribution, both for the Gaussian correlation test (42) and for the ME non-Gaussianity test of order j (46–48). Now we validate those expressions by comparing their results with statistics from large Monte-Carlo ensembles of ME computations. For that purpose, we have generated

N_{r e a} = 5000

independent synthetic datasets of N iid uncorrelated

(X, Y)

from a Gaussian random generator. We have set N from a duplication sequence: N=25, 2¹*25,…,2¹¹*25 = 51200. Then, we have computed the 5,000 realizations for the independency test

δ I_{N, g}

as well as for the non-Gaussianity tests

δ I_{N, n g, j}

for j = 4, 6, 8. In order to minimize errors of type

δ H

(8), from the ME functional, we have retained only those Monte-Carlo realizations whose ME-PDF moments are within a relative square error of 10⁻⁵.

In the sequel, we have collected and compared the estimates of bias, standard deviation and the 95%-quantile, all provided by the three approaches: the Monte-Carlo (extended ensemble of ME computations), the semi-analytical (generation of Gaussian surrogates in the Taylor expansion of ME) and the analytical (analytical formulas based on the Theorems 1 and 2). The Figure 3a, b, c and d depict the above statistics of significance tests, respectively for

δ I_{N, g}

and

δ I_{N, n g, j}

(j = 4, 6, 8). The truth is assumed to be provided by the Monte-Carlo estimate.

As previously expected, significance tests are all scaled by

N^{- 1} O (1)

, and consequently their bias, standard deviation and quantiles are

N^{- 1} O (1)

as shown in Figure 3a-d by estimates coming from the different approaches. MinMI biases and significance thresholds (the 95% quantiles) grow for higher number of constraints as in the sequence

I_{N, g}

,

I_{N, n g, j = 4}

,

I_{N, n g, j = 6}

,

I_{N, n g, j = 8}

.

These results mean that those estimators are progressively better (stronger) evaluations of MI (or the MI beyond that explained by Gaussianity), though they call for progressively higher significance thresholds. Therefore, especially in cases of under-sampled data (small N) or very low MI (or Non-Gaussian MI) values (weakly dependent variables or weak joint non-Gaussianity), there must be a tradeoff between N and the number of parameters of the MinMI estimator (here the number of cross constraints).

At this point, we discuss how the analytical and semi-analytical estimates of MinMI error statistics fit the Monte-Carlo (true) statistics. There are three crucial factors in our approximations: (1) The accuracy of the ME Taylor expansion, valid for small enough sampling errors (N large); (2) The convergence rate towards Gaussian statistics (from the CLT) for high N.

Figure 3. Test statistics: bias (black lines), standard deviation (red lines) and 95%-quantiles (green lines), provided by the Monte-Carlo approach (tick full lines), the semi-analytical approach (thin dashed lines) and the analytical approach (tick full lines). The tests are

δ I_{N, g}

(a);

δ I_{N, n g, j = 4}

(b);

δ I_{N, n g, j = 6}

(c) and

δ I_{N, n g, j = 8}

(d).

Figure 3. Test statistics: bias (black lines), standard deviation (red lines) and 95%-quantiles (green lines), provided by the Monte-Carlo approach (tick full lines), the semi-analytical approach (thin dashed lines) and the analytical approach (tick full lines). The tests are

δ I_{N, g}

(a);

δ I_{N, n g, j = 4}

(b);

δ I_{N, n g, j = 6}

(c) and

δ I_{N, n g, j = 8}

(d).

The analytical bias depends on factors 1 and 3, while formulas for variance, distribution and quantiles depend on all above factors, being only valid for N high enough. From Figure 3a–d, we see that the agreement between analytical and Monte-Carlo statistics is quite good for all tests (with a slight analytical underestimation), though only for large enough

N > N_{t e s t}

values where

N_{t e s t}

depends on how later (in N) the factors 1-3 hold together. We have

N_{t e s t} \approx 50, 400, 1600, 3200

, respectively for

δ I_{N, g}

,

δ I_{N, n g, j = 4}

,

δ I_{N, n g, j = 6}

,

δ I_{N, n g, j = 8}

, growing with the number of constraints. The exception is when N is so large that errors

δ H

of the operational ME (typically, round-off errors) are of the same order of the small value tests

δ I

, starting to influence the Monte-Carlo statistics.

In order to validate the analytical Chi-Squared distributions for the tests, we present in Figure 4, the empirical cumulative histograms, respectively of

2 N δ I_{N, g}

,

2 N δ I_{N, n g, j}

,

2 N δ I_{N, n g, 6}

,

2 N δ I_{N, n g, 8}

for

N \approx N_{t e s t}

and the corresponding theoretical cumulative Chi-Squared PDF fits, respectively

χ_{1}^{2}

,

χ_{5}^{2}

,

χ_{14}^{2}

and

χ_{27}^{2}

. The agreement is shown to be quite good, with a slight deficit in the theoretical number of degrees of freedom, possibly due to uncontrolled aspects (e.g., the numerical implementation of the ME algorithm and bound effects) leading to extra randomness. In fact, the theoretical prediction of MinMI bias results from two matrices, theoretically equal, which are issued from extraordinary complicated outputs (the MinMI covariance matrix and the covariance matrix of estimators under fixed marginals). The theoretical result depends on the matching of a huge number of algorithmic details. The results provide good support to the presented Theorems, the hypotheses on the basis of the analytical and semi-analytical approaches. The slightly higher MinMI bias than the theoretical one is due to a small difference between the data PDF and the ME-PDF.

Figure 4. Monte-Carlo empirical cumulative histogram (solid lines) and theoretical cumulative Chi-Squared fit (dashed lines) normalized by N:

2 N δ I_{N, g}

(

χ_{1}^{2}

) for

N = 50

(black curves);

2 N δ I_{N, n g, j = 4}

(

χ_{5}^{2}

) for

N = 400

(red curves);

2 N δ I_{N, n g, 6}

(

χ_{14}^{2}

) for

N = 1600

(green curves) and

2 N δ I_{N, n g, 8}

(

χ_{27}^{2}

) for

N = 3200

(blue curves).

Figure 4. Monte-Carlo empirical cumulative histogram (solid lines) and theoretical cumulative Chi-Squared fit (dashed lines) normalized by N:

2 N δ I_{N, g}

(

χ_{1}^{2}

) for

N = 50

(black curves);

2 N δ I_{N, n g, j = 4}

(

χ_{5}^{2}

) for

N = 400

(red curves);

2 N δ I_{N, n g, 6}

(

χ_{14}^{2}

) for

N = 1600

(green curves) and

2 N δ I_{N, n g, 8}

(

χ_{27}^{2}

) for

N = 3200

(blue curves).

5. MI Estimation from Under-Sampled Data

In this section, we present a case of MinMI estimation from under-sampled data (N small), emphasizing the effect of MI bias and its relation to PDF over-fitting. For this purpose, we consider an example from meteorology, already introduced by authors [8] in which X,Y are the standard Gaussian morphism

(X, Y ~ N (0, 1))

of monthly means in winter (December to February), respectively of the North Atlantic Index (X) (a quite useful planetary-scale atmospheric index [41]), and the amount of rainfall in Greenland (Y) The paper [8] has shown the existence of statistically significant nonlinear correlations between X and Y, i.e., non-Gaussian MI. The data used in the study comes from the NCEP/NCAR meteorological reanalysis for the period 1951–2003, leading to temporal series with length equal to 159, from which we have estimated the number N~100 of iid data (temporal degrees of freedom), after discarding the effect of temporal auto-correlation [42].

Figure 5a–d present the scatter-plot of the

(X, Y)

pairs along with the contours of the ME-PDF fitting constrained by bivariate monomial expectations

T_{j}

(13) of total order j = 2,4,6 and 8 respectively. There is pictorial evidence of PDF over-fitting for cases of a high number of cross constraints (14 and 27 for j = 6, 8 respectively) in Figures 5c and d. In those cases, the dataset bivariate outliers, which lie at very poorly probable regions of the PDF, tend to give a polygonal character to the PDF extreme contours.

The MinMI values in nats are

I_{N, g}

= 0.053 (0.048),

I_{N, n g, 4}

= 0.071 (0.041),

I_{N, n g, 6}

= 0.086(~0) and

I_{N, n g, 8}

= 0.196 (~0) with unbiased values in parenthesis and figures marked bold where the null hypothesis H₀ is rejected at the 5% significance level (values above the 95% error quantile). That means that variables are significantly correlated with the unbiased Gaussian correlation

c_{g}

= −0.30 and a statistically significant, though small, non-Gaussian unbiased MI of order j = 4 of 0.041 nats, which has been shown to be of the same order of the Gaussian MI. None of the remaining incremental MinMIs are significant, which corroborates the fact that the values of

I_{N, n g, 6}

and

I_{N, n g, 8}

are purely artificial.

Figure 5. Scatter-plot of the Gaussianized variables X (in abscissas) Y (in ordinates) (see text for details) along with ME-PDF fitting constrained by monomial bivariate moments up to order j = 2 (a), j = 4 (b), j = 6 (c) and j = 8 (d). Contour levels are set to 0.0005, 0.005, 0.05, 0.5, and 5.

6. Discussion and Conclusions

This paper presents theoretical formulas for statistics (bias, variance, distribution) of estimation errors of information theoretical measures. This is quite relevant because finite samples can apparently exhibit artificial statistical structures leading to negatively biased estimations of Entropy or positively biased estimations of Mutual Information. By using Monte-Carlo experiments, we empirically validate certain results about the asymptotic distribution of estimation errors of the minimum Mutual Information (MinMI) between two random variables X,Y.

That MinMI is the least committed MI compatible with prescribed marginal X and Y distributions and a set

T_{c r}

of a number m_cr of expectations of cross X,Y joint functions

T_{c r} (X, Y)

, filling up a vector

θ_{c r} = E (T_{c r})

where MinMI is written in terms of Shannon entropies (H) as:

I_{\min} (X, Y) = H (X) + H (Y) - H_{\max} (X, Y)

. There, H_max is the maximum entropy (ME) constrained by marginals and cross mean constraints. The MinMI is a lower MI bound, converging to the total MI when the set

T_{c r}

converges to the sufficient joint statistics. Sampling

θ_{c r}

errors from N-sized samples, say

Δ θ_{N, c r} = θ_{N, c r} - θ_{c r}

lead to MinMI errors. In order to compute MinMI, the marginal PDFs of finite samples must be preset by morphisms, setting the X and Y single values to fixed quantiles. This reduces the sampling randomness to the covariate sampling in the form of random permutations in the bivariate trials (X,Y). Then, the estimator variance

var (Δ θ_{N, c r})

is scaled by N⁻¹, being lower than the value

N^{- 1} var (T_{c r})

, valid in the case of random iid marginal trials. In order to get a given MinMI relative error

r_{I} = (Δ I_{\min} / I_{\min})

, one must choose

N \geq E ({(λ_{c r}^{T} T_{c r}')}^{2}) / {(I_{\min} r_{I})}^{2} \approx O (m_{c r}) / {(I_{\min} r_{I})}^{2}

where one uses the Lagrange multipliers associated to cross moments and also the perturbations

T_{c r}'

.

The detailed analysis of

Δ θ_{N, c r}

has shown that

var (Δ θ_{N, c r})

under variable morphisms is given by

N^{- 1} var (T_{c r} | E (T_{c r} | X), E (T_{c r} | Y))

, which is the mean squared residual of the best linear fit of

T_{c r}

using the conditional means

E (T_{c r} | X)

and

E (T_{c r} | Y)

as predictors. This is supported by a few examples using a Monte-Carlo methodology. We have shown that

var (Δ θ_{N, c r})

is closely related to the Maximum Entropy solution constrained by T and marginal distributions, i.e., the MinMI solution constrained by the cross constraints

θ_{c r} = E (T_{c r})

.

The MinMI errors are readily obtained from MinMI second-order Taylor development in terms of

Δ θ_{N, c r}

. Asymptotically,

Δ θ_{N, c r}

is multivariate Gaussian thanks to the Central Limit Theorem. The MinMI bias is positive, given by the mean of a positive quadratic form of Gaussians. When data samples come from the same distribution as the one generated from MinMI, the MinMI bias is simply 1/(2N) m_cr. However, the bias can increase/decrease when data comes from a more leptokurtic/platykurtic distribution. That expression of bias comes from the fact that the Hessian matrix of MinMI in terms of the vector of cross constraints θ is the inverse of the covariance matrix of the cross functions T, conditioned to the knowledge of marginal PDFs. That matrix is the matrix of mean squared residuals of best linear fit of T using predictors

E (T_{c r} | X)

,

E (T_{c r} | Y)

evaluated at the MinMI-PDF.

We have further introduced the incremental MinMI given by the difference

H_{\max 1} - H_{\max 2}

between two MEs, forced by cross constraint sets

T_{c r 1} \subseteq T_{c r 2}

. Under the null hypothesis

H_{\max 1} = H_{\max 2}

, the incremental MinMI stands for a statistical test evaluating the existence of statistically significant MI explained by cross expectations in the set difference

T_{c r 2} / T_{c r 1}

. This test is distributed as

\frac{1}{2 N} χ_{(m_{c r 2} - m_{c r 1})}^{2}

where

m_{c r 2}, m_{c r 1}

are the numbers of cross constraints respectively in

T_{c r 2}, T_{c r 1}

. In order to get a test with a relative error

r_{I} = (Δ I_{\min} / I_{\min})

, one must choose

N \geq {((m_{c r 2} - m_{c r 1}) / 2)}^{1 / 2} / (I_{\min} r_{I})

.

By setting X,Y to single standard Gaussians by Gaussian morphisms and the single constraint product

T_{c r} = X Y

, we have evaluated the MI parcel that is explained by joint Gaussianity – the Gaussian MI. By adding further monomial bivariate as constraints, we can define the non-Gaussian MI, attributed to joint non-Gaussianity. Under the null hypothesis of null non-Gaussian MI tests the existence of statistically significant MI explained by nonlinear correlations, beyond the scope of Pearson correlation. This is an Information-Theoretic-based significance test of non-Gaussianity, beyond others based on multivariate cumulants.

Finally, we have evaluated the Gaussian and non-Gaussian MIs for real under-sampled data allowing illustrating the relationship between MI bias, probability density over-fitting and data outliers. Some questions do remain for future work, namely the implementation of fast algorithms for computing non-Gaussian MI and its generalization to more than two random variables.

Acknowledgments

This research was supported by the ERC advanced grant “Flood Change”, project No. 291152 and also the Projects PTDC/GEO-MT/3476/2012 and PEST-OE/CTE/LA0019/2011-FCT, funded by the Portuguese Foundation for Science and Technology (FCT). Thanks are due to three anonymous referees, to J. Macke and Susana Barbosa for some discussions and also our families for the omnipresent support.

References

Shannon, C.E. The mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons, Inc.: New York, NY, USA, 1991. [Google Scholar]
Averbeck, B.B.; Latham, P.E.; Pouget, A. Neural correlations, population coding and computation. Nat. Rev. Neurosci. 2006, 7, 358–366. [Google Scholar] [CrossRef] [PubMed]
Goldie, C.M.; Pinch, R.G.E. Communication Theory. In London Mathematical Society Student Texts (No. 20); Cambridge University Press: Cambridge, UK, 1991. [Google Scholar]
Sims, C.A. Rational Inattention: Beyond the Linear-Quadratic Case. Am. Econ. Rev. 2006, 96, 158–163. [Google Scholar] [CrossRef]
Sherwin, W.E. Entropy and Information Approaches to Genetic Diversity and its Expression: Genomic Geography. Entropy 2010, 12, 1765–1798. [Google Scholar] [CrossRef]
Pothos, E.M.; Juola, P. Characterizing linguistic structure with mutual information. Br. J. Psychol. 2007, 98, 291–304. [Google Scholar] [CrossRef] [PubMed]
Pires, C.A.; Perdigão, R.A.P. Non-Gaussianity and asymmetry of the winter monthly precipitation estimation from the NAO. Mon. Wea. Rev. 2007, 135, 430–448. [Google Scholar] [CrossRef]
Globerson, A.; Tishby, N. The minimum information principle for discriminative learning. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, Banff, Canada, 7–11 July 2004; pp. 193–200.
Globerson, A.; Stark, E.; Vaadia, E.; Tishby, N. The minimum information principle and its application to neural code analysis. Proc. Natl. Accd. Sci. USA 2009, 106, 3490–3495. [Google Scholar] [CrossRef] [PubMed]
Foster, D.V. Grassberger, P. Lower bounds on mutual information. Phys. Rev. E 2011, 83, 010101(R):1–010101(R):4. [Google Scholar] [CrossRef]
Pires, C.A.; Perdigão, R.A.P. Minimum Mutual Information and Non-Gaussianity Through the Maximum Entropy Method: Theory and Properties. Entropy 2012, 14, 1103–1126. [Google Scholar] [CrossRef]
Walters-Williams, J.; Li, Y. Estimation of mutual information: A survey. Lect. Notes Comput. Sci. 2009, 5589, 389–396. [Google Scholar]
Khan, S.; Bandyopadhyay, S.; Ganguly, A.R.; Saigal, S.; Erickson, D.J.; Protopopescu, V.; Ostrouchov, G. Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Phys. Rev. E 2007, 76, 026209:1–026209:15. [Google Scholar] [CrossRef]
Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1254. [Google Scholar] [CrossRef]
Panzeri, S.; Treves, A. Analytical estimates of limited sampling biases in different information measures. Comp. Neur. Syst. 1996, 7, 87–107. [Google Scholar] [CrossRef]
Victor, J.D. Asymptotic Bias in Information Estimates and the Exponential (Bell) Polynomials. Neur. Comput. 2000, 12, 2797–2804. [Google Scholar] [CrossRef]
Panzeri, S.; Senatore, R.; Montemurro, M.A.; Petersen, R.S. Train Information Measures Correcting for the Sampling Bias Problem in Spike Information Measures. J. Neurophysiol. 2007, 98, 1064–1072. [Google Scholar] [CrossRef] [PubMed]
Strong, S.P.; Koberle, R.; de Ruyter van Steveninck, R.; Bialek, W. Entropy and information in neural spike trains. Phys. Rev. Lett. 1998, 86, 197–200. [Google Scholar] [CrossRef]
Miller, G. Note on the bias of information estimates. In Information Theory in Psycholog; Quastler, H., Ed.; II-B Free Press: Glencoe, IL, USA, 1955; pp. 95–100. [Google Scholar]
Grassberger, P. Entropy Estimates from Insufficient Samplings. 2008; arXiv:physics/0307138v2.pdf. [Google Scholar]
Bonachela, J.A.; Hinrichsen, H.; Muñoz, M.A. Entropy estimates of small data sets. J. Phys. A 2008, 41, 202001. [Google Scholar] [CrossRef]
Nelsen, R.B. An Introduction to Copulas; Springer: New York, NY, USA, 1999; ISBN 0-387-98623-5. [Google Scholar]
Calsaverini, R.S.; Vicente, R. An information-theoretic approach to statistical dependence: Copula information. Europhys. Lett. 2009, 88, 68003. [Google Scholar] [CrossRef]
Ma, J.; Sun, Z. Mutual information is copula entropy. 2008; arXiv:0808.0845v1. [Google Scholar]
Macke, J.H.; Murray, I.; Latham, P.E. How biased are maximum entropy models? Adv. Neur. Inf. Proc. Syst. 2011, 24, 2034–2042. [Google Scholar]
Hutter, M.; Zaffalon, M. Distribution of mutual information from complete and incomplete data. Comput. Stat. Data An. 2005, 48, 633–657. [Google Scholar] [CrossRef]
Jaynes, E.T. On the Rationale of Maximum-entropy methods. P. IEEE 1982, 70, 939–952. [Google Scholar] [CrossRef]
Shore, J.E.; Johnson, R.W. Axiomatic derivation of the principle of maximum entropy and the principle of the minimum cross-entropy. IEEE Trans. Inform. Theor. 1980, 26, 26–37. [Google Scholar] [CrossRef]
Ebrahimi, N.; Soofi, E.S.; Soyer, R. Information Measures in Perspective. Int. Stat. Rev. 2010, 78, 383–412. [Google Scholar] [CrossRef]
Wackernagel, H. Multivariate Geostatistics—An Introduction with Applications; Springer Verlag: Berlin, Germany, 1995. [Google Scholar]
Charpentier, A.; Fermanian, J.D. Copulas: From Theory to Application in Finance; Rank, J., Ed.; Risk Publications: London, UK, 2007; Section 2. [Google Scholar]
Tam, S.M. On Covariance in Finite Population Sampling. J. Roy. Stat. Soc. D-Sta. 1985, 34, 429–433. [Google Scholar] [CrossRef]
Van det Vaart, A.W. Asymptotic statistics. Cambridge University Press: New York, NY, USA, 1998; ISBN ISBN 978–0-521–49603–2, LCCN. V22 1998 QA276. V22. [Google Scholar]
Rockinger, M.; Jondeau, E. Entropy densities with an application to autoregressive conditional skewness and kurtosis. J. Econometrics 2002, 106, 119–142. [Google Scholar] [CrossRef]
Bates, D. Quadratic Forms of Random Variables. STAT 849 lectures. Available online: http://www.stat.wisc.edu/~st849–1/lectures/Ch02.pdf (accessed on 22 February 2013).
Goebel, B.; Dawy, Z.; Hagenauer, J.; Mueller, J.C. An approximation to the distribution of finite sample size mutual information estimates. 2005. In Proceedings of IEEE International Conference on Communications (ICC’ 05), Seoul, Korea, 16–20 May 2005; pp. 1102–1106.
Fisher, R.A. On the “probable error” of a coefficient of correlation deduced from a small sample. Metron 1921, 1, 3–32. [Google Scholar]
Zientek, L.R.; Thompson, B. Applying the bootstrap to the multivariate case: bootstrap component/factor analysis. Behav. Res. Methods 2007, 39, 318–325. [Google Scholar] [CrossRef] [PubMed]
Mardia, K.V. Algorithm AS 84: Measures of multivariate skewness and kurtosis. Appl. Stat. 1975, 24, 262–265. [Google Scholar] [CrossRef]
Hurrell, J.W.; Kushnir, Y.; Visbeck, M. The North Atlantic Oscillation. Science 2001, 26, 291. [Google Scholar]
The NCEP/NCAR Reanalysis Project. Available online: http://www.esrl.noaa.gov/psd/data/reanalysis/reanalysis.shtml/ (accessed on 22 February 2013).

Appendix 1

Proof of Equations 1 and 2

We are looking for a PDF

ρ_{X Y} (X, Y)

satisfying: (1) the discrete constraints

\iint_{S} T_{c r} (X, Y) ρ_{X Y}^{*} (X, Y) d X d Y = θ_{c r}

, corresponding to the vector

η_{c r}

of Lagrange multipliers and (2) the continuum of constraints

\iint_{S} δ (X - u) ρ_{X Y} (X, Y) d X d Y = ρ_{X} (u)

and

\iint_{S} δ (Y - v) ρ_{X Y} (X, Y) d X d Y = ρ_{Y} (v)

, corresponding to the continuum of Lagrange multipliers

λ_{X} (u), λ_{Y} (v), u \in S_{X}, v \in S_{Y}

, where the integrals of

ρ_{X}

,

ρ_{Y}

are both equal to one. The Lagrangian functional of Entropy is therefore

\begin{array}{l} L (η_{c r}, λ_{X}, λ_{Y}) = \iint_{S} [- \log ρ_{X Y} (X, Y) + λ_{X} (X) + λ_{Y} (Y) + {η_{c r}}^{T} T_{c r} (X, Y)] ρ_{X Y} (X, Y) d X d Y \\ - \int_{S_{X}} ρ_{X} (X) λ_{X} (X) d X - \int_{S_{Y}} ρ_{Y} (Y) λ_{Y} (Y) d Y - {η_{c r}}^{T} θ_{c r} \end{array}

(A1)

The maximum Entropy is obtained by taking the differential

δ L

of

L

in terms of

δ λ_{X} (X), δ λ_{Y} (Y), δ η_{c r}

and setting vanishing gradient components, leading to the PDF

ρ_{X Y} (X, Y) = \exp [- 1 + {η_{c r}}^{T} T_{c r} (X, Y) + λ_{X} (X) + λ_{Y} (Y)]

. Now, considering the partition functions

Z_{X} (X, η_{c r}) \equiv \exp [- λ_{X} (X)]

and

Z_{Y} (Y, η_{c r}) \equiv \exp [- λ_{Y} (Y)]

and imposing the marginal PDF constraints leads directly to the expressions (2) where the continnum of Lagrange multipliers depend implicitly from the discrete ones

η_{c r}

. Plugging that into

L

leads to the definition of the concave function

L (η_{c r})

in (1) with its global minimum at

η_{c r} = λ_{c r}

. The MinMI-PDF (2) is

ρ_{X Y} (X, Y) = ρ_{X Y}^{*} (X, Y)

at that minimum.

Proof of Equations 3, 4, 5 and Theorem 1

At the ME-PDF solution, the

L

functional of the MinMI solution is an implicit function of the constraining means

θ_{c r}

and the differential satisfies

δ L = δ H = - δ I

. By expanding it in terms of

δ λ_{X} (X), δ λ_{Y} (Y), δ λ_{c r}, δ θ_{c r}

and using

\int_{S_{X}} ρ_{X Y}^{*} (X, Y) d Y = ρ_{X} (X); \int_{S_{Y}} ρ_{X Y}^{*} (X, Y) d X = ρ_{Y} (Y)

, and

\iint_{S} T_{c r} (X, Y) ρ_{X Y}^{*} (X, Y) d X d Y = θ_{c r}

, one gets

δ I (θ_{c r}) = - δ L = {λ_{c r}}^{T} δ θ_{c r}

, thus showing that the gradient of

I (θ_{c r})

with respect to

θ_{c r}

is

λ_{c r}

.

Regarding the Hessian of

I (θ_{c r})

, we must differentiate

θ_{c r}

using the same technique for the ME problems with a finite number of constraints.

Therefore, as postulated in Section 2.2, let us consider a finite sequence of constraint sets

{T_{j}, θ_{j}}

whose ME-PDF converge to MinMI solution as

(j \to \infty)

The the differentials of expectations

δ θ_{j}

and the differential

δ λ_{j}

of Lagrange multipliers are related through

δ θ_{j} = C_{* j} δ λ_{j}

,where

C_{* j}

is the covariance matrix of the constraining functions

T_{j}

at the ME-PDF solution (denoted with *), i.e.,

C_{* j} = E_{*} (T_{j}' T_{j}'^{T})

where the perturbations are

T_{j}' = T_{j} - θ_{j}

. Inverting that relationship we have

δ λ_{j} = {C_{* j}}^{- 1} δ θ_{j}

. In the case of MinMI, the constraining functions have a discrete part (

T_{c r}

) and a continuous part (the Dirac deltas), being merged together into a whole vector

T_{c r, ρ} = {(T {(X, Y)}_{c r}, δ (X - u), δ (Y - v))}^{T}

corresponding to the whole vector of expectations

θ_{c r, ρ} = {(θ {(X, Y)}_{c r}, ρ_{X} (u), ρ_{Y} (v))}^{T}

and to the whole vector of Lagrange multipliers

λ_{c r, ρ} = {(λ_{c r}, λ_{X} (u), λ_{Y} (v))}^{T}

. Therefore, as for the discrete case, the differentials are related by

δ θ_{c r, ρ} = E_{*} (T_{c r, ρ}' T_{c r, ρ}'^{T}) δ λ_{c r} = C_{c r, ρ} δ λ_{c r, ρ}

, where the covariance matrix is now replaced by an operator (continuous matrix) along the u, v, and the discrete index of

θ {c r}

. The multiplication of the continuous matrix by the continuous vector

δ λ_{c r, ρ}

is the sum of an integral in u, an integral in v and a discrete sum. The inverse relationship comes as

δ λ_{c r, ρ} = {[C_{c r, ρ}]}^{- 1} δ θ_{c r, ρ}

where

{[C_{c r, ρ}]}^{- 1}

is the inverse operator of

C_{c r, ρ}

, i.e., the product

{[C_{c r, ρ}]}^{- 1} C_{c r, ρ} = C_{c r, ρ} {[C_{c r, ρ}]}^{- 1} = (I_{c r}, δ (X - u), δ (Y - v))

equals the identity operator. Therefore, the fixation of marginal PDFs in the MinMI problem leads to variations on cross expectations alone

δ θ_{c r, ρ} = P_{c r} δ θ_{c r, ρ} = δ θ_{c r}

, where

P_{c r}

is the projection operator over the discrete part. Therefore, since

δ I = δ {θ_{c r}}^{T} λ_{c r}

, the second MI variation is

δ^{2} I = \frac{1}{2} δ {θ_{c r}}^{T} δ {λ_{c r}}^{T} = \frac{1}{2} δ {θ_{c r}}^{T} [{P_{c r}}^{T} {[C_{c r, ρ}]}^{- 1} P_{c r}] δ θ_{c r}

and the matrix identity

{P_{c r}}^{T} {[C_{c r, ρ}]}^{- 1} P_{c r} = C_{c r, ρ_{X}, ρ_{Y}}^{- 1}

appearing in (3). The discrete matrix

C_{c r, ρ_{X}, ρ_{Y}}

is positively defined, being different from

{P_{c r}}^{T} [C_{c r, ρ}] P_{c r}

, which is the single covariance matrix of functions

T_{c r}

at the MinMI-PDF. Its computation is quite difficult in practice, involving the convolution (continuous product) of operators

{[C_{c r, ρ}]}^{- 1}

and

P_{c r}

.

Since the ME-PDF for

{T_{j}, θ_{j}}

converges to the MinMI PDF, the same holds for the covariance matrix conditioned to the marginal PDFs. Therefore, one has the Equation 10 at step j

{(P_{c r} C_{* j}^{- 1} P_{c r})}^{- 1} = (P_{c r} C_{* j} P_{c r}) - (P_{c r} C_{* j} P_{i n d}) {(P_{i n d} C_{* j} P_{i n d})}^{- 1} (P_{i n d} C_{* j} P_{c r}) = E_{*} [T_{c r, j}^{' i n d} T_{c r, j}^{' i n d T}] \underset{j \to \infty}{\to} C_{c r, ρ_{X}, ρ_{Y}}

(A2)

The matrix

C_{c r, ρ_{X}, ρ_{Y}}

can be obtained from the limit of ME covariance matrices where one adds progressively independent moments of the marginal variables X and Y as constraints. In the limit, the perturbations

T_{c r, j}^{' i n d} = T_{c r, j}^{'} - {α_{j}}^{T} T_{i n d, j}^{'}

must converge to the perturbations

T * = T_{c r} - E_{ρ_{X, Y}^{*}} (T_{c r} | ρ_{X}, ρ_{Y})

appearing in (4). They are residuals of the best fit on marginal functions on X and Y as

T * (X, Y) = T_{c r}^{'} (X, Y) - [β_{X} (X) + β_{Y} (Y)]

where

β_{X} (X) + β_{Y} (Y)

is a sum of marginal functions. The minimum of the total mean squares of residuals

\iint_{S} ρ_{X Y}^{*} (X, Y) | | T * | |^{2} d X d Y = E_{*} (| | T * | |^{2})

is obtained through variational analysis by taking small variations

δ β_{X} (X), δ β_{Y} (Y)

and vanishing the gradients. We get the solution

T * (X, Y) = T_{c r}^{'} (X, Y) - [α_{X} E (T_{c r}^{'} | X) + α_{Y} E (T_{c r}^{'} | Y)]

(A3)

where fitting is done on conditional means and

α_{X}, α_{Y}

are the best linear fit coefficients for each function in

T_{c r}^{'} (X, Y)

. This completes the proof of (5) and Theorem 1. The Taylor expansion (3) comes by taking

Δ I (θ_{c r}, ρ_{X}, ρ_{Y}) = δ I + δ^{2} I + O (| | Δ θ_{c r} | |^{3})

.

© 2013 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Pires, C.A.L.; Perdigão, R.A.P. Minimum Mutual Information and Non-Gaussianity through the Maximum Entropy Method: Estimation from Finite Samples. Entropy 2013, 15, 721-752. https://doi.org/10.3390/e15030721

AMA Style

Pires CAL, Perdigão RAP. Minimum Mutual Information and Non-Gaussianity through the Maximum Entropy Method: Estimation from Finite Samples. Entropy. 2013; 15(3):721-752. https://doi.org/10.3390/e15030721

Chicago/Turabian Style

Pires, Carlos A. L., and Rui A. P. Perdigão. 2013. "Minimum Mutual Information and Non-Gaussianity through the Maximum Entropy Method: Estimation from Finite Samples" Entropy 15, no. 3: 721-752. https://doi.org/10.3390/e15030721

APA Style

Pires, C. A. L., & Perdigão, R. A. P. (2013). Minimum Mutual Information and Non-Gaussianity through the Maximum Entropy Method: Estimation from Finite Samples. Entropy, 15(3), 721-752. https://doi.org/10.3390/e15030721

Article Menu

Minimum Mutual Information and Non-Gaussianity through the Maximum Entropy Method: Estimation from Finite Samples

Abstract

1. Introduction

1.1. The State of the Art

1.2. The Rationale of the Paper

2. Minimum Mutual Information and Its Estimators

2.1. Imposing Marginal PDFs

2.2. Imposing Marginals through ME Constraints

2.2.1. The Formalism

2.2.2. A Theorem about the MinMI Covariance Matrix

2.3. Gaussian and Non-Gaussian MI

2.4. Estimators of the Minimum MI from Data and Their Errors

3. Errors of the Expectation’s Estimators

3.1. Generic Properties

3.2. The Effects of Morphisms and Bivariate Sampling

3.3. Errors of the Estimators of Polynomial Moments under Gaussian Distributions

3.4. Statistical Modeling of Moment Estimation Errors

4. Modeling of MinMI Estimation Errors, Their Bias, Variance and Distribution

4.1. Bias, Variance, Quantiles and Distribution of MI Estimation Error

4.2. Significance Tests of MinMI Thresholds

4.3. Significance Tests of the Gaussian and Non-Gaussian MI

4.3.1. Error and Significance Tests of the Gaussian MI

4.3.2. Error and Significance Tests of the Non-Gaussian MI

4.4. Validation of Significance Tests by Monte-Carlo Experiments

5. MI Estimation from Under-Sampled Data

6. Discussion and Conclusions

Acknowledgments

References

Appendix 1

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI