Minimum Mutual Information and Non-Gaussianity Through the Maximum Entropy Method: Theory and Properties

: The application of the Maximum Entropy (ME) principle leads to a minimum of the Mutual Information (MI), I(X,Y) , between random variables X , Y , which is compatible with prescribed joint expectations and given ME marginal distributions. A sequence of sets of joint constraints leads to a hierarchy of lower MI bounds increasingly approaching the true MI. In particular, using standard bivariate Gaussian marginal distributions, it allows for the MI decomposition into two positive terms: the Gaussian MI ( I g ), depending upon the Gaussian correlation or the correlation between ‘Gaussianized variables’, and a non-Gaussian MI ( I ng ), coinciding with joint negentropy and depending upon nonlinear correlations. Joint moments of a prescribed total order p are bounded within a compact set defined by Schwarz-like inequalities, where I ng grows from zero at the ‘Gaussian manifold’ where moments are those of Gaussian distributions, towards infinity at the set’s boundary where a deterministic relationship holds. Sources of joint non-Gaussianity have been systematized by estimating I ng between the input and output from a nonlinear synthetic channel contaminated by multiplicative and non-Gaussian additive noises for a full range of signal-to-noise ratio ( snr ) variances. We have studied the effect of varying snr on I g and I ng under several signal/noise scenarios


Introduction
One of the most commonly used information theoretic measures is the mutual information (MI) [1], measuring the total amount of probabilistic dependence among random variables (RVs)-see [2] for a unifying perspective and axiomatic review.MI is a positive quantity vanishing iff RVs are independent.
MI is an effective tool for multiple purposes, namely and among others: (a) Blind signal separation [3] and Independent Component Analysis (ICA) [4], both of which look for transformed and/or lagged stochastic time series [5] which minimize MI; (b) Predictability studies, Predictable Component Analysis [6] and Forecast Utility [7], all of which are focused on the analysis and decomposition of the MI between probabilistic forecasts and observed states.
Analytical expressions of MI are known for a few number of parametric joint distributions [8,9].Alternatively, it can be numerically estimated by different methods, such as Maximum Likelihood estimators, Edgeworth expansion, Bayesian methods, equiprobable and equidistant histograms, kernel-based probability distribution functions (PDFs), K-nearest neighbors technique-see [10,11] and references therein for a survey of estimation methods and scoring comparison studies.
In the bivariate case, treated here, the MI ( , ) I X Y , between RVs , X Y is the Kullback-Leibler (KL) divergence: The goal is the determination of theoretical lower MI bounds under certain conditions or, in other words, the minimum mutual information (MinMI) [12] between two RVs , X Y , consistent, both with imposed marginal distributions and cross-expectations assessing their linear and nonlinear covariability.Those lower bounds can be obtained due to the application of the Maximum Entropy (ME) method to distributions [13] and to the inequality:

( ) ( )
|| || ME D p q D p q ≤ .Here, ME p is a Maximum Entropy probability distribution (MEPD) with respect to q, say ( ) , where Ω is a PDF class verifying a given set of constraints [1].Therefore, by using the joint MEPD X Y into others with imposed ME probability mass distributions through the so called ME-anamorphoses [14].
The joint ME probability distribution XY ME p − is derived from the minimum of a functional in terms of a certain number of Lagrange parameters.The properties of multivariate ME distributions have been studied for various ME constraints, namely: (a) imposed marginals and covariance matrix [15]; (b) generic joint moments [16].Abramov [17][18][19] has developed efficient and stable numerical iterative algorithms for computing ME distributions forced by sets of polynomial expectations.Here we use a bivariate version of the algorithm of [20], already tested in [21].By taking a sequence of encapsulated sets of joint ME constraints we obtain an increasing hierarchy of lower MI bounds converging towards the total MI.
We particularize this methodology to the case where , X Y are standard Gaussians, issued from single homeomorphisms of an original pair of variables ˆ, X Y by Gaussian anamorphosis [14].Then we get the MI ˆ( , ) ( , ) I X Y I X Y = , which is decomposed into two generic positive quantities, a Gaussian MI I g and a non-Gaussian MI I ng [21], vanishing under bivariate Gaussianty.The Gaussian MI is given by 2 ( ) 1/ 2log(1 ) g c X Y cor X Y ≡ is the Pearson linear correlation between the Gaussianized variables.As for the non-Gaussian MI term I ng it relies upon imposed nonlinear correlations between the single Gaussian variables.The MI reduces to ( ) g g I c when only moments of order one and two are imposed as ME constraints.We will note that, for certain extreme non-Gaussian marginals, ˆˆˆ( ( , )) ( , ) , thus showing that ( ) g I c [22] is not a proper MI lower bound in general.The correlation g c , hereafter called Gaussian correlation [21] is a nonlinear concordance measure like the Spearman rank correlation [23] and the Kendall τ, which, by definition are invariant for monotonically growing smooth marginal transformations.These measures have the good property of being expressed as functionals of the copula density functions [24], uniquely dependent on the cross dependency between variables.The non-Gaussian MI I ng holds some interesting characteristics.It coincides with the joint negentropy (deficit of entropy with respect to that of the Gaussian PDF with the same mean, variance and covariance) in the space of 'Gaussianized' variables, which is invariant for any orthogonal or oblique rotation of them.In particular for uncorrelated rotated variables, it coincides with the 'compactness', which measures the concentration of the joint distribution around a lower-dimensional manifold [25] which is given by ( ) , the KL divergence with respect to the spherical Gaussian SG p (Gaussian with an isotropic covariance matrix with the same trace as that of XY p , say the total variance).
We also show that I ng comprises a series of positive terms associated to a p-sequence of imposed monomial expectations of total even order p (2,4,6,8…).The higher the number of independent constraints, the higher the order of terms that are retained in that series and the more information is extracted from the joint PDF.
We have shown that the possible values of the cross-expectations lie within a bounded set obtained by Schwarz-like inequalities.We illustrate the range of I ng values within those sets as function of third and fourth-order cross moments.Near the set's boundary, I ng tends to infinity where a deterministic relationship holds and the ME problem functional is ill-conditioned.
In order to better understand the possible sources of joint non-Gaussianity and non-Gaussian MI, we have used the preceding method for computing I g and I ng between the input and the output of a nonlinear channel contaminated by multiplicative and non-Gaussian noise for a full range of the signal-to-noise (snr) variance ratio.We put in evidence that sources of non-Gaussian MI arise from the nonlinearity of the transfer function, multiplicative noise and additive non-Gaussian noise [26].
Many of the results of the paper are straightforwardly generalized to the multivariate case with three or more random variables.
The paper is then organized as follows: Section 2 formalizes the Minimum Mutual Information (MinMI) principle from maximum entropy distributions, while Section 3 particularizes that principle to the MI decomposition into Gaussian and non-Gaussian MI parts.Section 4 addresses the non-Gaussianity in a nonlinear non-Gaussian channel.The paper ends with conclusions and appendices with theoretical proofs and the numerical algorithm for solving the ME problem.This paper is followed by a companion one [27] on the estimation of non-Gaussian MI from finite samples with practical applications.

MI Estimation from Maximum Entropy PDFs
In this section we present the basis of the MI estimation between bivariate RVs, through the use of joint PDFs inferred by the maximum entropy (ME) method (ME-PDFs for short) in the space of transformed (anamorphed) marginals into specified ME distributions.We start with preliminary general concepts and definitions.

General Properties of Bivariate Mutual Information
Let ( ) , X Y be continuous RVs with support given by the Cartesian product in terms both of a KL divergence and Shannon entropies.The MI equals zero iff the RVs X and Y are statistically independent, i.e., iff with equality occurring if both X and Y are smooth homeomorphisms of X and Y respectively.

Congruency between Information Moment Sets
Definition 1: Following the notation of [28], we define the moment class Ω T,θ of bivariate PDFs XY ρ of ( , ) X Y , as: where E ρ is the ρ-expectation operator, ( ) 1 ,..., X Y integrable functions with respect to ρ XY , and Lemma 1: Given two information encapsulated information moment sets: ( ) ( ) T ,θ , if they exist, satisfy the following conditions: This means that any ME-PDF log-mean is unchanged when it is computed with respect to a more constraining ME-PDF.As a corollary we have ( , i.e., the ME decreases with the increasing number of constraints.This can be understood bearing in mind that the entropy maximization is performed in a more reduced PDF class, since T T X X = = T θ for X ∈ lead to the same ME-PDF, the standard Gaussian N(0,1).
Consequently both information moment sets are ME-congruent but not congruent since

Ω ⊂Ω
T ,θ T ,θ .This is because the Lagrange multiplier of the ME functional (see Appendix 1) corresponding to the fourth moment is set to zero without any constraining effect.The congruency implies ME-congruency but not the converse.

MI Estimation from Maximum Entropy Anamorphoses
We are looking for a method of obtaining lower bound MI estimates from ME-PDFs.For that purpose we will decompose the information moment set as: ( ) ( ) ( ) T ,θ T ,θ , say into a marginal or independent part ( , ) ( , ) ( , ) T ,θ , with the joint entropy being the sum of marginal maximum entropies, as in the case of independent random variables [15].The KL divergence between the ME-PDFs associated to ( ) T,θ and those associated to ( ) T ,θ is a proper MI lower bound or, expressed in other terms, a constrained mutual information.We denote it as: Its difference with respect to ( ) , I X Y is given by: which can be negative.However, the positiveness is ensured when marginal distributions are set to the ME-PDFs constrained by * ( ) ( ), , which results in KL divergences vanishing with respect to the marginal PDFs in Equation ( 5).This procedure is quite general because ( , ) X Y can appropriately be obtained through an injective smooth maps ˆ( ( ), ( )) from an original pair of RVs ˆ( , ) X Y preserving the MI, i.e., ˆ( , ) ( , ) Those maps are monotonically growing homeomorphisms, the hereby called ME-anamorphoses, which are obtained by equaling mass probability functions of the original variable X ρ to those of the transformed variable X ρ (equally for Ŷ ρ and Y ρ ) as: The moments in ( ) Therefore, in the space of ME-anamorphed variables the ME-based MI (4) gives the minimum MI (MinMI), compatible with the cross moments ( ) T ,θ .
Thanks to Lemma 1, a hierarchy of ME-based MI bounds is obtainable by considering successive supersets of the ME constraints on the ME-anamorphed variables, which is justified by the theorem below.

Theorem 1: Let ( , )
X Y be a pair of single random variables (RVs), distributed as the ME-PDF associated to the independent constraints ( ( , ), ( , )) . Then, the following inequalities between constrained mutual informations hold: The proof is given in Appendix 2. The Theorem 1 justifies the possibility of building a monotonically growing sequence of lower bounds of ( ) In the sequence, the entropy associated to independent constraint sets is always constant due to the ME-congruency, while the entropy of the joint ME-PDF decreases, thus allowing the MinMIs to grow.
Therefore, ( ) , j I X Y is the part of MI due to cross moments in cr j T and the positive difference is the increment of MI due to the additional cross moments in 1 / ind j ind j + T T , while marginals are kept as preset ME-PDFs (e.g., Gaussian, Gamma, Weibull).

Gaussian Anamorphosis and Gaussian Correlation
In this section we explain how to implement the sequence of MI estimators detailed in Section 2.3 for the particular case where X and Y are standard Gaussian RVs.Our aim is to estimate the MI between two original variables ˆ( , ) X Y of null mean and unit variance with real support.Those variables are then transformed through a homeomorphism, the Gaussian anamorphosis [14], into standard Gaussian RVs, respectively ~(0,1) X N and ~(0,1) Y N , given by: where Φ is the mass distribution function for the standard Gaussian.If ˆ( , ) X Y are marginally non-Gaussian, then the Gaussian anamorphoses are nonlinear transformations.In practice, Gaussian anamorphoses can be approximated empirically from finite data sets by equaling cumulated histograms.However, for certain cases, it is analytically possible to construct bivariate distributions with specific marginal distributions and the knowledge of the joint cumulative distribution function [29].
In the case of Gaussian anamorphosis, the information moment set ( )

T ,θ
of Theorem 1 includes the first and second independent moments of each variable: Then, following the proposed procedure of Section 2.3, we will consider a sequence of cross-constraint sets for determining the hierarchy of lower MI bounds.
The most obvious cross moment to be considered is the XY expectation, equal to the Gaussian correlation The signal of the factor ˆX G X − in Equation ( 9) roughly depends on the skewness , respectively for a skewed X PDF and a symmetric X PDF (idem for Ŷ ).Therefore, g c can result in an enhancement of correlation c or in the opposite effect, as shown in [21] for the RV pair of meteorological variables ( X = North Atlantic Oscillation Index, Ŷ = monthly precipitation).The Gaussian correlation is a concordance measure like the rank correlation and Kendall τ, being thus invariant for a monotonically growing smooth homeomorphism of both X and Ŷ .Those measures are expressed as functionals of the bivariate copula-function ( ) , which is uniquely dependent on the cumulated marginal probabilities and equal to the density ratio, independently from the specific forms of marginal PDFs [24].In particular, the Gaussian correlation is given by:

Gaussian and Non-Gaussian MI
The purpose of this sub-section is to express which part of MI comes from joint non-Gaussianity.If the 'Gaussianized' variables ( , ) X Y are jointly non-Gaussian, then the original standardized variables ˆ( , ) X Y , obtained from ( , ) X Y by invertible smooth monotonic transformations, are jointly non-Gaussian as well.However, the converse is not true.The MI between Gaussianized variables ( , ) . The first term is the Gaussian MI [21] given by 2 ˆ( , ) ( , ) 1/ 2log(1 ) ( ) 0 function of the Gaussian correlation g c (see its graphic in Figure 1), and the second one is the non-Gaussian MI ˆ( , ) ( , ) , which is due to joint non-Gaussianity and nonlinear statistical relationships among variables.The MI ˆ( , ) I X Y is related to the negentropy ˆ( , ) J X Y , i.e., to the KL divergence between the PDF and the Gaussian PDF with the same moments of order one and two.That is shown by: , a pair of rotated standardized variables (A being an invertible 2 × 2 matrix), one has the following result with proof in Appendix 2: A simple consequence is that in the space of uncorrelated variables (i.e., ˆ( ( , )) 0 g I cor X Y = ), the joint negentropy is the sum of marginal negentropies with the MI, thus showing that there are intrinsic and joint sources of non-Gaussianity.One interesting corollary is derived from that.

Corollary 1: For standard Gaussian variables ( , )
X Y and standardized rotated ones ( , ) ( , ) For the proof it suffices to consider Gaussian variables ˆ( , ) ( , ) (11).Their self-negentropy vanishes by definition and the correlation term is the Gaussian MI.Negentropy has the property of being invariant for any orthogonal or oblique rotations of the Gaussianized variables ( , ) X Y .However, this invariance does not extend to ( , ) ng I X Y .From (12), in particular when ( , ) r r X Y are uncorrelated (e.g., standardized principal components of ( , ) X Y or uncorrelated standardized linear regression residuals), the negentropy equals the KL divergence between the joint PDF and that of an isotropic Gaussian with the same total variance.That KL divergence is the compactness (level of concentration around a lower-dimensional manifold), as defined in [25].This measure is invariant under orthogonal rotations.The last term of ( 12) vanishes due the fact that the null correlation allows one to decompose of ( , ) ng I X Y into positive contributions: the self-negentropies of uncorrelated variables ( , ) r r X Y and their MI ( , ) ( , ) ( , ) These variables can be 'Gaussianized' and rotated, leading to further decomposition of ( , ) r r ng I X Y until the possible "emptying"/depletion of the initial joint non-Gaussianity into Gaussian MIs and univariate negentropies.The PDF of the new rotated variables will be closer to an isotropic spherical Gaussian PDF.Since it is algorithmically easier to compute univariate rather than multivariate entropies, the above method can be used for an efficient estimation of MIs.
The search for rotated variables maximizing the sum of individual negentropies ( ) ( ) 12) with minimization of ( , ) r r I X Y or their statistical dependency is the goal of Independent Component Analysis (ICA) [4].
A natural generalization of the MI decomposition is possible when ( , ) X Y is obtained from a generic ME-anamorphosis by decomposing the MI into a term associated to correlation, under the constraint that marginals are set to given ME-PDFs (the equivalent to I g ), and to a term not explained by correlation (the equivalent to I ng ).There is however no guarantee that this decomposition is unique as in the case of non-Gaussians, since there is no natural bivariate extension of univariate prescribed PDFs with a given correlation [30].
By looking again at Equation ( 12), we notice that when original variables are correlated ( ( , ) 0) After some lines of algebra the MI becomes: The MI is bounded for any value of c and 4 ( ) log( 2) is included in Figure 1, showing that 4 ( ) ( ) . This behavior is also reproduced by finite discontinuous PDFs (e.g., replacing the Dirac-Deltas by cylinders of probability) as well as continuous PDFs.We test it by approximating the discrete PDF by the weighted superposition of 4 spherical bivariate Gaussian PDFs, all of which have a sharp isotropic standard deviation σ = 0.001 and are centered at the 4 referred centroids.Figure 1

The Sequence of Non-Gaussian MI Lower Bounds from Cross-Constraints
In order to build a monotonically increasing sequence of lower bounds for ( , ) we have considered a sequence of encapsulated sets of functions whose moments will constrain the ME-PDFs.Those functions consist of single (univariate) and combined (bivariate) monomials of the standard Gaussians X and Y.The numerical implementation of joint ME-PDFs constrained by polynomials in dimensions d = 2, 3 and 4 was studied by Abramov [17][18][19], with particular emphasis on the efficiency and convergence of iterative algorithms.Here, we use the algorithm proposed in [21] and explained in the Appendix 1.Let us define the information moment set p T as the set of bivariate monomials with total order up to p: ( ) This set is decomposed into marginal (independent) and cross monomials as ( , ),( , ),( , ),... T θ T θ T θ with any pair of consecutive sets satisfying the premises of Theorem 1, i.e., all independent moment sets are ME-congruent.This will lead to the corresponding monotonically growing sequence of lower bounds of MI, denoted as , ...
, which increases with p and converges to ( ) as p → ∞, under quite general conditions [15].
In the same manner as stated in (12), the lower bound ( ) , , is also a lower bound for the joint negentropy, which is invariant for any affine transformations ( , ) ( , ) There is no analytical closed formula for the dependence of non-Gaussian MI on cross moments.However, under the scenario of low joint non-Gaussianity (small KL divergence to the joint Gaussian), the ME-PDF can be approximated by the Edgeworth expansion [31], based on orthogonal Hermite polynomials and I ng approximated as a polynomial of joint bivariate cumulants:

I X Y
= is approximated as in [13] by the sum of squares: where ( , ) X Y is assumed to be the arithmetic average of an equivalent number n eq of independent and identically distributed (iid) bivariate RVs.Therefore, from the multidimensional Central Limit Theorem [33], the larger n eq is, the closer the distribution is to joint Gaussianity, and the smaller the absolute value of cumulants become.

Non-Gaussian MI across the Polytope of Cross Moments
The ( , ) X Y cross moments in the expectation vector p θ (p even) are not completely free.Rather, they satisfy to Schwarz-like inequalities defining a compact set p D within which cross-moments lie.
That set portrays all the possible non-Gaussian ME-PDFs with p-order independent moments equal to those of the standard Gaussian.Under these conditions ( ) ( ) , , , . In order to have a better feel on how I ng behaves, we have numerically evaluated

≡ ∪ T T
, obtained from p T including unity.Then:

P
is written as a quadratic form ( ) ( ) .By taking the expectation operator we have ( ) , which implies the positiveness of the matrix of moments ( ) C , which is given in terms of components of p θ .
When p = 4 and d = 2, the case of bivariate quartics, any PSD polynomial is a SOS [34] and vice versa.However, for p ≥ 6 there are PSD-non-SOS polynomials (e.g., those coming from the inequality between arithmetic and geometric means [35]).Therefore, a necessary and sufficient condition among fourth-order moments is that ( ) C be a PSD matrix.Let us study the conditions for that.

T x y x y xy ≡ T
, one has the 6 × 6 matrix 4 C , written in the simplified form in terms of moments: (1, 0, 0,1,1, ) (0,1, , 0, , ) (0, 1, , 0, ) A necessary and sufficient condition for the positiveness of 4 In Equation (22), the inequality for 4 d has a dual relationship (d 4,dual ), its sign being reversed by swapping the two indices in m i,j , whereas  In order to illustrate how non-Gaussianity depends on moments of third and fourth order, we have computed the non-Gaussian MI of order 4 ( , 4 ng p I = ), along a set of 2-dimensional cross-sections of D 4 crossing the Gaussian manifold and extending up the boundary of D 4 .For Gaussian moments, ,4 ng I vanishes, being approximated by the Edgeworth expansion ( 16) near the Gaussian manifold.
In order to get a picture of , 4 ng p I = , we have chosen six particular cross-sections of D 4 by varying two moments and setting the remaining to their 'Gaussian values'.The six pairs of varying parameters are: A (c g , m 2,1 ), B (c g , m 3,1 ), C (c g , m 2,2 ), D (m 2,1 , m 3,1 ) at c g =0, E (m 2,1 , m 2,2 ) at c g =0 and F (m 3,1 , m 2,2 ) at c g = 0, with the contours of the corresponding I ng field shown in Figure 2a-f.The fields are retrieved from a discrete mesh of 100 × 100 in moment space.The Gaussian state lies at: (c g , 0), (c g , 3c g ), (c g , 2c g 2 + 1), (0, 0), (0, 1) and (0, 1), respectively for cases A up to F. The moment domains are obtained by solving inequalities for d 4 , d 5 and d 6 and applying the restrictions imposed by the crossing of the Gaussian manifold (e.g., m 1,2 = 0, m 1,3 = m 3,1 = 3c g , m 2 , 2 = 2c g 2 + 1 for case A).We obtain the following restrictions for cases A-F: Case A: Case C: Case D: Case E: Case F: The analytical boundary of allowed domains is emphasized with thick solid lines in all Figure 2a-f.There are some common aspects among the figures.As expected, I ng vanishes at the Gaussian states, the Gaussian manifold, marked with G in

Q
, leading to a Dirac-Delta-like ME-PDF along a one-dimensional curve.This in turn leads to I ng = ∞, except possibly in a set of singular points of D 4 on which I ng is not well defined.In practice, infinity is not reached due to stopping criteria for convergence of the iterative method used for obtaining the ME-PDF.
( ) At states where |c g | = 1, I g = ∞ and I ng has a second-kind singularity discontinuity where the contours merge together without a well-defined limit for I ng .In the neighborhood of the Gaussian state with c g = 0 in Figure 2d-f, I ng is approximated by the quadratic form (16) as is confirmed by the elliptic shape of I ng contours.The value of I ng can surpass I g , thus emphasizing the fact that in some cases much of the MI may come from nonlinear (X,Y) correlations.
The joint entropy is invariant for a mirror symmetry in one or both variables: because the absolute value of the determinant of that transformation equals 1.As a consequence, the dependence of the Gaussian and non-Gaussian MI on moments also reflects these intrinsic mirror symmetries.For instance, in Figure 2d, the symmetry X X → − leads to the dependency relations is singular, due to the above deterministic relationship.Therefore, the closer that boundary is, the closer the ME-PDF is to a deterministic relationship, the more ill conditioned the ME problem is and the slower the numerical convergence of the optimization algorithm becomes.

The Effect of Noise and Nonlinearity on Non-Gaussian MI
The aim of this section is an exploratory analysis of the possible sources of non-Gaussianity in a bivariate statistical relationship.Towards that aim, we explore the qualitative behavior of I g and I ng between a standardized signal X (with null mean and unit variance) and an X -dependent standardized response variable Ŷ contaminated by noise.For this purpose, a full range of signal-to-noise variance ratios (snr) shall be considered, from pure signal to pure noise.The statistics are evaluated from one-million-long synthetic ˆ( , ) X Y iid independent realizations produced by a numeric Gaussian random generator.Many interpretations are possible for the output variable: (i) Ŷ taken as the observable outcome emerging from a noisy transmission channel fed by X ; (ii) given by the direct or indirect observation affected by measurement and representativeness errors corresponding to a certain value X of the model state vector [37] (iii) the outcome from a stochastic or deterministic dynamical system [38].
In order to estimate ˆ( , ) I X Y , the working variables ˆ( , ) X Y are transformed by anamorphosis into standard Gaussian variables ( , ) X Y .
We consider, without loss of generality, X X = . The variable Y is given by Gaussian anamorphosis ˆ( ) ~(0,1) as in Equation ( 8), with: where ( ) F X is a purely deterministic transfer function and ( , ) n X W is a scalar noise uncorrelated with ( ) F X , depending in general on X (e.g., multiplicative noise) and from a vector W of independent Gaussians contaminating the signal.Both ( ) F X and ( , ) n X W have unit variance with ( , ) 0 n X = 0 . The signal-to-noise variance ratio is Then, the Gaussian MI I g is computed for each value of s∈[0,1] and compared among several scenarios of ( ) F X and ( , ) X W .A similar comparison is done for the non-Gaussian MI, approximated here by I ng,p=8 .Six case studies have been considered (A, B, C, D, E and F); their signal and noise terms are summarized in Table 1, along with the colors with which they are represented in Figure 3 further below.
Table 1.Types of signal and noise in Equation ( 35) and corresponding colors used in Figure 3.

Case study
, , W W W are independent standard Gaussian noises.We begin with a reference case, A, which refers to Gaussian noise.Case B refers to a symmetric leptokurtic (i.e., with kurtosis larger than that of the Gaussian) non-Gaussian noise.In case C the multiplicative noise depends linearly on the Ŷ X W signal X.In case D, the signal is a nonlinear cubic homeomorphism of the real domain.For case E, the signal is nonlinear and not injective in the interval [−1, 1], thus introducing ambiguity in the relationship ( , ) X Y .Finally, in case F all the factors-non-Gaussian noise, multiplicative noise and signal ambiguity-are pooled together.1 for details about the cases and their color code.8 and of the Gaussian MI (I g ) for the six cases (A to F).The graphic of non-Gaussian MI as approximated by I ng,8 is depicted in Figure 3c for five cases (B to F).In Figure 4, we show a 'stamp-format' collection of the contouring of ME-PDFs of polynomial order p = 8 for all cases (A to F) and extreme and intermediate cases of the snr: s = 0.1, s = 0.5 and s = 0.9.This illustrates how the snr and the nature of both the transfer function and noises influence the PDFs.
For the Gaussian noise case (A), the non-Gaussian MI is theoretically set to zero since the joint distribution of ( , ) X Y is Gaussian.In all scenarios, both I g and the total MI I g + I ng grow, as expected, with the snr.This is in accordance to the Bruijn's equality stating the positiveness of the MI derivative with respect to snr and established in the literature of signal processing for certain types of noise [39,40].On the contrary, the monotonic behavior as a function of snr is not a universal characteristic of the non-Gaussian MI.By observing Figure 3a-c, the following qualitative results are worth mentioning.We begin by comparing the total MI in three cases (A, B and C), which share the same linear signal but feature noise of different kinds (Figure 3a).Both the red (B) and blue (C) lines lie above the black line (A) for each given s, thus indicating that the total MI is lowest when the noise is Gaussian.This means that the Gaussian noise is the most signal degrading of noises with the same variance [41].The extra MI found in the B and C cases come, respectively, from the Gaussian MI (see case B in Figure 3c) and from the non-Gaussian MI (see case C in Figure 3a), as it is also apparent by looking at ME-PDFs for cases B, C (s = 0.1) (Figure 4).
We consider now the cases A, D, E, all of which have a Gaussian noise.Their differences lie in the signals, with the one in A being linear and the ones in D and E being nonlinear.By comparing these cases it is seen that I g is highest for the linear signal, the black curve lying above the magenta (D) and green (E) curves for each s in Figure 3b.This indicates that the Gaussian MI, measuring the degree of signal linearity, is lower when the signal introduces nonlinearity (cases D and E) than when no nonlinearity is present (case A).It is worth noting that, while the signals in A and D are injective, the one in E is not, thus introducing ambiguity.This will imply loss of information in E, which is visible in the total MI depicted for each s in Figure 3a.In fact, there the green curve (case E) lies lower than the black (A) and magenta (D) curves for every s.The effect of nonlinearity is quite evident in ME-PDFs, in particular for high s value (Figure 4, cases D, E, s = 0.9).
We focus now on the non-Gaussian MI, depicted in Figure 3c for each s.The curve representing the case B, with a linear signal and a state-independent noise, indicates that the non-Gaussian MI is null for both s = 0 and s = 1.The first zero of non-Gaussian MI (at s = 0) is justified by the noise being state-independent, whereas the second zero (at s = 1) is due to the signal being linear, which means that all the MI resides in the Gaussian MI.The non-Gaussian MI is thus positive and maximum at intermediate values of s.
By looking at case C (multiplicative noise), it is seen that the non-Gaussian MI remains roughly unchanged for every s < 1.This holds even at s = 0 (pure noise), since the noise is state-dependent and thus some information is already present.At s = 1 the non-Gaussian MI is null due to the signal being linear (as in case B).
By observing the cases with Gaussian noise and nonlinear signals (D and E) in Figure 3c, it can be seen that their non-Gaussian MI grows with s (and thus with the relative weight of the signal), due to A B s=0.1 s=0.5 s=0.9 their signals being nonlinear.This gradual behavior is also reflected in the ME-PDFs (Figure 4, cases D, E along the s values).Finally, we consider the case in which the signal is nonlinear and the noise comprises a multiplicative and a non-Gaussian additive component (case F).As compared with E (which differs from it in that the noise is Gaussian), it can been that non-Gaussian MI is always larger in F independently of s.This is due to the fact that in F there is information even at s = 0, due to the state-dependence of its noise.For all values of s, the ME-PDF exhibits quite a large deviation from Gaussianity.

Discussion and Conclusions
We have addressed the problem of finding the minimum mutual information (MinMI), or the least noncommittal MI between d = 2 random variables, consistent with a set of marginal and joint expectations.The MinMI is a proper MI lower bound when marginals are set to ME-PDFs through appropriate nonlinear single anamorphoses.Moreover, the MinMI increases as long as one increases number of independent cross-constraints of the bivariate ME problem.Considering a sequence of moments, we have obtained a hierarchy of lower MI bounds approximating the total MI value.The method can easily be generalized for d > 2 variables with the necessary adaptations.
One straightforward application of that principle follows from the MI estimation from 'Gaussianized' variables with real support, where the marginals are rendered standard Gaussian N(0,1) by Gaussian anamorphosis.This allows for the MI decomposition into two positive contributions: a Gaussian term I g , which depends uniquely on the Gaussian correlation c g (Pearson correlation in the space of 'Gaussianized' variables), and a non-Gaussian term I ng depending on nonlinear correlations.This term is equal to the joint negentropy, which is invariant for any oblique or orthogonal rotation of the 'Gaussianized' variables and is related to the 'compactness' measure or the closeness of the PDF towards a low manifold deterministic relationship.The Gaussian MI is also a 'concordance' measure, invariant for any monotonically growing homeomorphisms of marginals and consequently expressed as a functional of the copula-density function, which is exclusively dependent on marginal cumulated probabilities.In certain extreme cases, very far from Gaussianity, the Pearson correlation among non-Gaussian variables is not a proper measure of the mutual information.An example of that situation is given.
Cross moments under marginal standard Gaussians are bounded by Schwarz-like inequalities defining compact sets, the shape of which resemble a rounded polytope where cross moments live.The allowed moment values portray all possible joint PDFs with Gaussian marginals.Inside that set lies the so called one-dimensional Gaussian manifold, parametrized by c g , where joint Gaussinity holds.There, I ng vanishes, growing towards infinity as far as the boundary is approached, where variables satisfy a deterministic relationship and the ME problem is ill conditioned.This behavior is illustrated in cross-sections of the polytope of cross moments of total order p = 4.
In order to systematize the possible sources of Gaussian and non-Gaussian MI, we have computed it in the context of nonlinear noisy channels.The MI has been computed between a Gaussian input and a panoply of (linear and/or nonlinear) outputs contaminated by different kinds of noise for a full range of the signal-to-noise variance ratio.Sources of non-Gaussian MI include: (a) the nonlinearity of the signal transfer function, (b) multiplicative noise and (c) non-Gaussian additive noise.This paper is followed by a companion one [27] on the estimation of non-Gaussian MI from finite samples with practical applications.with r large enough in order to prevent significant boundary effects on the ME-PDF, thus obtaining a good estimation of the ME-PDF asymptotic limit when sum of marginal negentropies is not necessarily a lower bound of the joint negentropy ˆ( , ) J X Y , because in some cases ˆ( , ) ( ) g I X Y I c − can be negative.This means that ( ) g I c is not generally a proper lower bound of the MI.An example of that is given by the following discrete distribution with support on four points: ˆ( , ) (1,1), (1, 1), ( 1,1), ( 1, 1) X Y = − − − − , with mass probabilities ˆXY P , respectively of (1 + c)/4, (1 + c)/4, (1 − c)/4 and (1 − c)/4.This 4-point distribution has X and Ŷ zero means, unit variances and Pearson correlation c .The PDF is made of four Dirac-Deltas.In this case, the mutual information ˆ( , ) I X Y is easily computed from the 4-point discrete mean of

2
the subscript g means that variables are marginally standard Gaussian.The first term of the sequence is the Gaussian MI ,the Gaussian correlation.The difference between the subsequent terms and the first one leads to the non-Gaussian MI of order p defined as

I 2 (P
= along the allowed set of cross-moments.In order to determine that set, let us begin by invoking some generalities about polynomials.Any bivariate polynomial of total order p is expressed as a linear combination of linearly independent monomials from the basis

5 d and 6 d, 5 d and 6 d
are symmetric with respect to indicial swap.The term d 6 of Equation (21) is a fourth-order polynomial of c g .The inequalities for 4 d , 4 dual d hold inside a compact domain denoted D 4 , with the shape of what resembles a rounded polytope in the space of cross moments ( 1,2 2,1 1,3 3,1 2,2 , , , , m m m m m ), for each value of the Gaussian correlation c g .The case of Gaussianity lies within the interior of D 4 , corresponding to 2 1,2 defining the hereafter called one-dimensional 'Gaussian manifold'.

Figures 2 .
I ng grows monotonically towards the boundary of the moment domains D 4 .There,

Figure 3 .
Figure 3. Graphs depicting the total MI (a), Gaussian MI (b) and non-Gaussian MI (c) of order 8 for 6 cases (A-F) of different signal-noise combinations with the signal weight s in abscissas varying from 0 up to 1. See text and Table1for details about the cases and their color code.

1
Figure3a,b show the graphics of the total MI, estimated by I g + I ng,8 and of the Gaussian MI (I g ) for the six cases (A to F).The graphic of non-Gaussian MI as approximated by I ng,8 is depicted in Figure3cfor five cases (B to F).In Figure4, we show a 'stamp-format' collection of the contouring of ME-PDFs of polynomial order p = 8 for all cases (A to F) and extreme and intermediate cases of the snr: s = 0.1, s = 0.5 and s = 0.9.This illustrates how the snr and the nature of both the transfer function and noises influence the PDFs.For the Gaussian noise case (A), the non-Gaussian MI is theoretically set to zero since the joint distribution of ( , )X Y is Gaussian.In all scenarios, both I g and the total MI I g + I ng grow, as expected,
. The minimum of L lies at a value of η dependent on ( ) , T θ , given by: θ .The corresponding L minimum is the value of the maximum entropy: the normalization partition function.Except when no analytical relationship ( ) λ T,θ exists, this function has to be estimated by iterative techniques of minimization of ( )L η,T,θ .The numerical algorithm consists of a bivariate version of that presented in[20].In practice, we have solved the ME problem for a finite square support set 2 C is given by the application of the Sylvester criterion, stating that the determinants d 1 , d 2 , d 3 , d 4 , d 5 and d 6 of the 6 upper sub-matrices of [36]re positive.From these, only those of orders 4, 5 and 6 lead to nontrivial relationships, given with help of Mathematica®[36]as: