# Minimum Mutual Information and Non-Gaussianity Through the Maximum Entropy Method: Theory and Properties

^{*}

## Abstract

**:**

_{g}), depending upon the Gaussian correlation or the correlation between ‘Gaussianized variables’, and a non‑Gaussian MI (I

_{ng}), coinciding with joint negentropy and depending upon nonlinear correlations. Joint moments of a prescribed total order p are bounded within a compact set defined by Schwarz-like inequalities, where I

_{ng}grows from zero at the ‘Gaussian manifold’ where moments are those of Gaussian distributions, towards infinity at the set’s boundary where a deterministic relationship holds. Sources of joint non-Gaussianity have been systematized by estimating I

_{ng}between the input and output from a nonlinear synthetic channel contaminated by multiplicative and non-Gaussian additive noises for a full range of signal-to-noise ratio (snr) variances. We have studied the effect of varying snr on I

_{g}and I

_{ng}under several signal/noise scenarios.

## 1. Introduction

_{g}and a non-Gaussian MI I

_{ng}[21], vanishing under bivariate Gaussianty. The Gaussian MI is given by ${I}_{g}({c}_{g})\equiv -1/2\mathrm{log}(1-{c}_{g}^{2})$, where ${c}_{g}(\widehat{X},\widehat{Y})\equiv cor(X,Y)$ is the Pearson linear correlation between the Gaussianized variables. As for the non-Gaussian MI term I

_{ng}it relies upon imposed nonlinear correlations between the single Gaussian variables. The MI reduces to ${I}_{g}({c}_{g})$ when only moments of order one and two are imposed as ME constraints. We will note that, for certain extreme non-Gaussian marginals, ${I}_{g}(c=cor(\widehat{X},\widehat{Y}))>I(\widehat{X},\widehat{Y})$, thus showing that ${I}_{g}(c)$ [22] is not a proper MI lower bound in general.

_{ng}holds some interesting characteristics. It coincides with the joint negentropy (deficit of entropy with respect to that of the Gaussian PDF with the same mean, variance and covariance) in the space of ‘Gaussianized’ variables, which is invariant for any orthogonal or oblique rotation of them. In particular for uncorrelated rotated variables, it coincides with the ‘compactness’, which measures the concentration of the joint distribution around a lower-dimensional manifold [25] which is given by $D\left({p}_{XY}\left|\right|{p}_{SG}\right)$, i.e., the KL divergence with respect to the spherical Gaussian ${p}_{SG}$ (Gaussian with an isotropic covariance matrix with the same trace as that of ${p}_{XY}$, say the total variance).

_{ng}comprises a series of positive terms associated to a p-sequence of imposed monomial expectations of total even order p (2,4,6,8…). The higher the number of independent constraints, the higher the order of terms that are retained in that series and the more information is extracted from the joint PDF.

_{ng}values within those sets as function of third and fourth-order cross moments. Near the set’s boundary, I

_{ng}tends to infinity where a deterministic relationship holds and the ME problem functional is ill-conditioned.

_{g}and I

_{ng}between the input and the output of a nonlinear channel contaminated by multiplicative and non-Gaussian noise for a full range of the signal-to-noise (snr) variance ratio. We put in evidence that sources of non-Gaussian MI arise from the nonlinearity of the transfer function, multiplicative noise and additive non-Gaussian noise [26].

## 2. MI Estimation from Maximum Entropy PDFs

#### 2.1. General Properties of Bivariate Mutual Information

#### 2.2. Congruency between Information Moment Sets

**Definition 1:**Following the notation of [28], we define the moment class ${\mathrm{\Omega}}_{T,\mathrm{\theta}}$ of bivariate PDFs ${\rho}_{XY}$ of $(X,Y)$, as:$${\mathrm{\Omega}}_{T,\mathrm{\theta}}=\left\{{\rho}_{XY}:{E}_{\rho}\left[T\right]=\mathrm{\theta}\right\}\hspace{0.17em}$$

**Lemma 1:**Given two information encapsulated information moment sets: $\left(T,\mathrm{\theta}\right)\subseteq \left({T}_{1},{\mathrm{\theta}}_{1}\right)$ i.e., with ${T}_{1}$ including more constraining functions than $T$, the respective ME-PDFs ${\rho}_{{\rm T},\mathrm{\theta}}^{*},\hspace{0.17em}{\rho}_{{T}_{1},{\mathrm{\theta}}_{1}}^{*}$, if they exist, satisfy the following conditions:$${E}_{\rho}\left[-\mathrm{log}{\rho}_{T,\mathrm{\theta}}^{*}\right]={E}_{{\rho}_{T1,\mathrm{\theta}1}^{*}}\left[-\mathrm{log}{\rho}_{T,\mathrm{\theta}}^{*}\right]={E}_{{\rho}_{T,\mathrm{\theta}}^{*}}\left[-\mathrm{log}{\rho}_{T,\mathrm{\theta}}^{*}\right]={H}_{{\rho}_{T,\mathrm{\theta}}^{*}}$$$${E}_{\rho}\left[-\mathrm{log}{\rho}_{T1,\mathrm{\theta}1}^{*}\right]={E}_{{\rho}_{T1,\mathrm{\theta}1}^{*}}\left[-\mathrm{log}{\rho}_{T1,\mathrm{\theta}1}^{*}\right]={H}_{{\rho}_{T1,\mathrm{\theta}1}^{*}}$$

**Definition 2:**If two information moment sets $\left({T}_{1},{\mathrm{\theta}}_{1}\right)$, $\left({T}_{2},{\mathrm{\theta}}_{2}\right)$ are related by linear affine relationships, then the sets are referred to as “congruent”, a property hereby denoted as $\left({T}_{1},{\mathrm{\theta}}_{1}\right)\underset{PDF}{\cong}\left({T}_{2},{\mathrm{\theta}}_{2}\right)$, and consequently both PDF sets are equal, i.e., ${\mathrm{\Omega}}_{{T}_{1},{\mathrm{\theta}}_{1}}={\mathrm{\Omega}}_{{T}_{2},{\mathrm{\theta}}_{2}}$ [15]. A stronger condition than ‘congruency’ is the ME-congruency, denoted as $\left({T}_{1},{\mathrm{\theta}}_{1}\right)\underset{ME}{\cong}\left({T}_{2},{\mathrm{\theta}}_{2}\right)$, holding when both the associated ME-PDFs are equal. For example, both the univariate constraint sets $\left({T}_{1}={X}^{2},{\mathrm{\theta}}_{1}=1\right)$ and $\left({T}_{2}={({X}^{2},{X}^{4})}^{T},{\mathrm{\theta}}_{2}={(1,3)}^{T}\right)$ for $X\in \mathbb{R}$ lead to the same ME-PDF, the standard Gaussian N(0,1). Consequently both information moment sets are ME-congruent but not congruent since ${\mathrm{\Omega}}_{{T}_{2},{\mathrm{\theta}}_{2}}\subset {\mathrm{\Omega}}_{{T}_{1},{\mathrm{\theta}}_{1}}$. This is because the Lagrange multiplier of the ME functional (see Appendix 1) corresponding to the fourth moment is set to zero without any constraining effect. The congruency implies ME-congruency but not the converse.

#### 2.3. MI Estimation from Maximum Entropy Anamorphoses

**Theorem 1:**Let $(X,Y)$ be a pair of single random variables (RVs), distributed as the ME-PDF associated to the independent constraints $({T}_{ind}=({T}_{X},{T}_{Y}),{\mathrm{\theta}}_{ind}=({\mathrm{\theta}}_{X},{\mathrm{\theta}}_{Y}))$. Both variables can be obtained from previous ME-anamophosis. Let $({T}_{1},{\mathrm{\theta}}_{1})=({T}_{cr1}\cup {T}_{ind1},{\mathrm{\theta}}_{cr1}\cup {\mathrm{\theta}}_{ind1})$ be a subset of $({T}_{2},{\mathrm{\theta}}_{2})=({T}_{cr2}\cup {T}_{ind2},{\mathrm{\theta}}_{cr2}\cup {\mathrm{\theta}}_{ind2})$, i.e., ${T}_{cr1}\subseteq {T}_{cr2}$ and ${T}_{ind}\subseteq {T}_{ind1}\subseteq {T}_{ind2}$, such that all independent moment sets are ME-congruent (see Definition 2), i.e., $({T}_{ind},{\mathrm{\theta}}_{ind})\underset{ME}{\cong}({T}_{ind1},{\mathrm{\theta}}_{ind1})\underset{ME}{\cong}({T}_{ind2},{\mathrm{\theta}}_{ind2})$, i.e., such that the independent extra moments in $({T}_{2},{\mathrm{\theta}}_{2})$ are not further constraining the ME-PDF. Each marginal moment set is decomposed as $({T}_{ind\hspace{0.17em}j}=({T}_{Xj},{T}_{Yj}),{\mathrm{\theta}}_{ind\hspace{0.17em}j}=({\mathrm{\theta}}_{Xj},{\mathrm{\theta}}_{Yj}))\hspace{0.17em},\hspace{0.17em}\hspace{0.17em}(j=1,2)$. For simplicity of notation, let us denote $I{\left(X,Y\right)}_{j}\equiv I\left(X;Y:({T}_{j},{\mathrm{\theta}}_{j}),({T}_{ind\hspace{0.17em}j},{\mathrm{\theta}}_{ind\hspace{0.17em}j})\right)\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}(j=1,2)$. Then, the following inequalities between constrained mutual informations hold:$$I{\left(X,Y\right)}_{j}=I\left(X,Y\right)-D\left({\rho}_{XY}\left|\right|{\rho}_{{T}_{j},{\mathrm{\theta}}_{j}}^{*}\right)\le I\left(X,Y\right)\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}(j=1,2)$$$$I{\left(X,Y\right)}_{2}-I{\left(X,Y\right)}_{1}=D\left({\rho}_{{T}_{2},{\mathrm{\theta}}_{2}}^{*}\left|\right|{\rho}_{{T}_{1},{\mathrm{\theta}}_{1}}^{*}\right)={H}_{{\rho}_{T1,\mathrm{\theta}1}^{*}}-{H}_{{\rho}_{T2,\mathrm{\theta}2}^{*}}\ge 0$$

## 3. MI Decomposition under Gaussian Marginals

#### 3.1. Gaussian Anamorphosis and Gaussian Correlation

#### 3.2. Gaussian and Non-Gaussian MI

**Theorem 2:**Given ${({\widehat{X}}_{r},{\widehat{Y}}_{r})}^{T}=A{(\widehat{X},\widehat{Y})}^{T}$, a pair of rotated standardized variables (A being an invertible 2 × 2 matrix), one has the following result with proof in Appendix 2:$$\begin{array}{l}J(\widehat{X},\widehat{Y})=J(\widehat{X})+J(\widehat{Y})+I(\widehat{X},\widehat{Y})-{I}_{g}(cor(\widehat{X},\widehat{Y}))=\\ =J({\widehat{X}}_{r},{\widehat{Y}}_{r})=J({\widehat{X}}_{r})+J({\widehat{Y}}_{r})+I({\widehat{X}}_{r},{\widehat{Y}}_{r})-{I}_{g}(cor({\widehat{X}}_{r},{\widehat{Y}}_{r}))\end{array}$$

**Corollary 1:**For standard Gaussian variables $(X,Y)$ and standardized rotated ones${({X}_{r},{Y}_{r})}^{T}=A{(X,Y)}^{T}$, we have$${I}_{ng}(X,Y)=J(X,Y)=J({X}_{r},{Y}_{r})=J({X}_{r})+J({Y}_{r})+I({X}_{r},{Y}_{r})-{I}_{g}(cor({X}_{r},{Y}_{r}))$$

_{g}), and to a term not explained by correlation (the equivalent to I

_{ng}). There is however no guarantee that this decomposition is unique as in the case of non-Gaussians, since there is no natural bivariate extension of univariate prescribed PDFs with a given correlation [30].

**Figure 1.**Semi-logarithmic graphs of ${I}_{g}(c)$ (black thick line), ${I}_{4}(c)$, (grey thick line) and of the successive growing estimates of ${I}_{4}(c)$: ${I}_{g}$, ${I}_{g}+{I}_{ng,4}$, ${I}_{g}+{I}_{ng,6}$ and ${I}_{g}+{I}_{ng,8}$ (grey thin lines). See text for details.

#### 3.3. The Sequence of Non-Gaussian MI Lower Bounds from Cross-Constraints

^{th}order polynomial given by a linear combination of monomials in ${T}_{p}$ with weights given by Lagrange multipliers and C being a normalizing constant. That happens because $|({X}^{i}{Y}^{j})|\le O({X}^{p}+{Y}^{p}),\forall X,Y\in \mathbb{R},1\le i+j\le p,$thus allowing for ${P}_{p}(X,Y)\to -\infty $ as $\left|X\right|,\left|Y\right|\to \infty $, leading to convergence of ME-integrals [18]. For odd p, those integrals diverge because the dominant monomials ${X}^{p},{Y}^{p}$ change sign from positive to negative real and therefore the ME-PDF is not well defined.

_{ng}approximated as a polynomial of joint bivariate cumulants: ${k}^{[i,j]}\equiv E[{X}_{r}^{i}{Y}_{r}^{j}]-{E}_{g}[{X}_{r}^{i}]{E}_{g}[{Y}_{r}^{j}]$, i + j > 2, of any pair of uncorrelated standardized variables ${({X}_{r},{Y}_{r})}^{T}=A{(X,Y)}^{T}$. Cross-cumulants are nonlinear correlations measuring joint non-Gaussianity [32], vanishing when $\left(X,Y\right)$ are jointly Gaussian. For example $I{\left(X,Y\right)}_{ng,p=4}$ is approximated as in [13] by the sum of squares:

_{eq}of independent and identically distributed (iid) bivariate RVs. Therefore, from the multidimensional Central Limit Theorem [33], the larger n

_{eq}is, the closer the distribution is to joint Gaussianity, and the smaller the absolute value of cumulants become.

#### 3.4. Non-Gaussian MI across the Polytope of Cross Moments

_{ng}behaves, we have numerically evaluated ${I}_{ng,p=4}$ along the allowed set of cross-moments.

_{4,dual}), its sign being reversed by swapping the two indices in m

_{i,j}, whereas ${d}_{5}$ and ${d}_{6}$ are symmetric with respect to indicial swap. The term d

_{6}of Equation (21) is a fourth-order polynomial of c

_{g}. The inequalities for${d}_{4}$,${d}_{4\hspace{0.17em}dual}$, ${d}_{5}$ and ${d}_{6}$ hold inside a compact domain denoted

**D**

_{4}, with the shape of what resembles a rounded polytope in the space of cross moments (${m}_{1,2},{m}_{2,1},{m}_{1,3},{m}_{3,1},{m}_{2,2}$), for each value of the Gaussian correlation cg. The case of Gaussianity lies within the interior of

**D**

_{4}, corresponding to ${m}_{1,2}={m}_{2,1}=0;{m}_{1,3}={m}_{3,1}=3{c}_{g}\text{;}{m}_{2,2}=2{c}_{g}^{2}+1$, thus defining the hereafter called one-dimensional ‘Gaussian manifold’.

**D**

_{4}crossing the Gaussian manifold and extending up the boundary of

**D**

_{4}. For Gaussian moments, ${I}_{ng,4}$ vanishes, being approximated by the Edgeworth expansion (16) near the Gaussian manifold.

**D**

_{4}by varying two moments and setting the remaining to their ‘Gaussian values’. The six pairs of varying parameters are: A (c

_{g}, m

_{2,1}), B (c

_{g}, m

_{3,1}), C (c

_{g}, m

_{2,2}), D (m

_{2,1}, m

_{3,1}) at c

_{g}=0, E (m

_{2,1}, m

_{2,2}) at c

_{g}=0 and F (m

_{3,1}, m

_{2,2}) at c

_{g}= 0, with the contours of the corresponding I

_{ng}field shown in Figure 2a–f. The fields are retrieved from a discrete mesh of 100 × 100 in moment space. The Gaussian state lies at: (c

_{g}, 0), (c

_{g}, 3c

_{g}), (c

_{g}, 2c

_{g}

^{2}+ 1), (0, 0), (0, 1) and (0, 1), respectively for cases A up to F. The moment domains are obtained by solving inequalities for d

_{4}, d

_{5}and d

_{6}and applying the restrictions imposed by the crossing of the Gaussian manifold (e.g., m

_{1,2}= 0, m

_{1,3}= m

_{3,1}= 3c

_{g}, m

_{2},

_{2}= 2c

_{g}

^{2}+ 1 for case A). We obtain the following restrictions for cases A–F:

_{ng}vanishes at the Gaussian states, the Gaussian manifold, marked with G in Figure 2. I

_{ng}grows monotonically towards the boundary of the moment domains

**D**

_{4}. There, $\mathrm{det}({C}_{4})=0$, meaning that ${C}_{4}$ is singular and there is a vector $(b\ne 0)\in Ker({C}_{4})$. This holds if one gets the deterministic relationship ${Q}_{p/2}(x,y)={b}^{T}{T}_{2}^{1}=0$, leading to a Dirac-Delta-like ME-PDF along a one-dimensional curve. This in turn leads to I

_{ng}= ∞, except possibly in a set of singular points of

**D**

_{4}on which I

_{ng}is not well defined. In practice, infinity is not reached due to stopping criteria for convergence of the iterative method used for obtaining the ME‑PDF.

**Figure 2.**Field of the non-Gaussian MI ${I}_{ng,4}$ along 6 bivariate cross-sections of the set of allowed moments. Two varying moments are featured in each cross-section: A (c

_{g}, m

_{2,1}), B (c

_{g}, m3

_{,1}), C (c

_{g}, m

_{2,2}), D (m

_{2,1}, m

_{3,1}) at c

_{g}=0, E (m

_{2,1}, m

_{2,2}) at c

_{g}= 0 and F (m

_{3,1}, m

_{2,2}) at c

_{g}= 0. The letter G indicates points or curves where Gaussianity holds.

_{g}| = 1, I

_{g}= ∞ and I

_{ng}has a second-kind singularity discontinuity where the contours merge together without a well-defined limit for I

_{ng}. In the neighborhood of the Gaussian state with c

_{g}= 0 in Figure 2d–f, I

_{ng}is approximated by the quadratic form (16) as is confirmed by the elliptic shape of I

_{ng}contours. The value of I

_{ng}can surpass I

_{g}, thus emphasizing the fact that in some cases much of the MI may come from nonlinear (X,Y) correlations.

_{ng}(m

_{2,1}, m

_{3,1}) = I

_{ng}(m

_{2,1}, −m

_{3,1}), where arguments are the varying moments, while symmetry $Y\to -Y$ leads to I

_{ng}(m

_{2,1}, m

_{3,1}) = I

_{ng}(−m

_{2,1}, −m

_{3,1}). The symmetries corresponding to the remaining figures are: I

_{ng}(c

_{g}, m

_{2,1}) = I

_{ng}(−c

_{g}, m

_{2,1}) = I

_{ng}(c

_{g}, −m

_{2,1}); I

_{ng}(c

_{g}, m

_{3,1}) = I

_{ng}(−c

_{g}, −m

_{3,1}); I

_{ng}(c

_{g}, m

_{2,2}) = I

_{ng}(−c

_{g}, m

_{22}); I

_{ng}(m

_{2,1}, m

_{2,2}) = I

_{ng}(−m

_{2,1}, m

_{2,2}); I

_{ng}(m

_{3,1}, m

_{2,2}) = I

_{ng}(−m

_{3,1}, m

_{2,2}), respectively for Figure 2a–c, e and f.

## 4. The Effect of Noise and Nonlinearity on Non-Gaussian MI

_{g}and I

_{ng}between a standardized signal $\widehat{X}$(with null mean and unit variance) and an $\widehat{X}$-dependent standardized response variable $\widehat{Y}$ contaminated by noise. For this purpose, a full range of signal-to-noise variance ratios (snr) shall be considered, from pure signal to pure noise. The statistics are evaluated from one‑million-long synthetic $(\widehat{X},\widehat{Y})$ iid independent realizations produced by a numeric Gaussian random generator. Many interpretations are possible for the output variable: (i) $\widehat{Y}$ taken as the observable outcome emerging from a noisy transmission channel fed by $\widehat{X}$ (ii) given by the direct or indirect observation affected by measurement and representativeness errors corresponding to a certain value $\widehat{X}$ of the model state vector [37] (iii) the outcome from a stochastic or deterministic dynamical system [38].

_{g}is computed for each value of s∈[0,1] and compared among several scenarios of $F(X)$ and $n(X,W)$. A similar comparison is done for the non-Gaussian MI, approximated here by I

_{ng,p=8}. Six case studies have been considered (A, B, C, D, E and F); their signal and noise terms are summarized in Table 1, along with the colors with which they are represented in Figure 3 further below.

**Table 1.**Types of signal and noise in Equation (35) and corresponding colors used in Figure 3.

Case study | $F(X)$ | $n(X,W)$ | Color |
---|---|---|---|

A—Gaussian noise (reference) | $X$ | $W$ | Black |

B—Additive non-Gaussian independent noise | $X$ | ${W}^{3}/\sqrt{15}$ | Red |

C—Multiplicative noise | $W\hspace{0.17em}X$ | Blue | |

D—Smooth nonlinear homeomorphism | ${X}^{3}/\sqrt{15}$ | $W$ | Magenta |

E—Smooth non-injective transformation | $({X}^{3}-X)/\sqrt{10}$ | Green | |

F—Combined non-Gaussianity | $({X}^{3}-X)/\sqrt{10}$ | $X{W}_{1}/\sqrt{2}+{W}_{2}^{3}/\sqrt{30}$ | Cyan |

**Figure 3.**Graphs depicting the total MI (

**a**), Gaussian MI (

**b**) and non-Gaussian MI (

**c**) of order 8 for 6 cases (A–F) of different signal-noise combinations with the signal weight s in abscissas varying from 0 up to 1. See text and Table 1 for details about the cases and their color code.

_{g}+ I

_{ng,8}and of the Gaussian MI (I

_{g}) for the six cases (A to F). The graphic of non-Gaussian MI as approximated by I

_{ng,8}is depicted in Figure 3c for five cases (B to F). In Figure 4, we show a ‘stamp-format’ collection of the contouring of ME-PDFs of polynomial order p = 8 for all cases (A to F) and extreme and intermediate cases of the snr: s = 0.1, s = 0.5 and s = 0.9. This illustrates how the snr and the nature of both the transfer function and noises influence the PDFs.

_{g}and the total MI I

_{g}+ I

_{ng}grow, as expected, with the snr. This is in accordance to the Bruijn’s equality stating the positiveness of the MI derivative with respect to snr and established in the literature of signal processing for certain types of noise [39,40]. On the contrary, the monotonic behavior as a function of snr is not a universal characteristic of the non-Gaussian MI.

_{g}is highest for the linear signal, the black curve lying above the magenta (D) and green (E) curves for each s in Figure 3b. This indicates that the Gaussian MI, measuring the degree of signal linearity, is lower when the signal introduces nonlinearity (cases D and E) than when no nonlinearity is present (case A).

**Figure 4.**Collection of stamp-format ME-PDFs for cases

**A**–

**F**(see text for details) and signal weight s = 0.1 (a), s = 0.5 (b) and s = 0.9 (c) over the [−3, 3]

^{2}support set.

## 7. Discussion and Conclusions

_{g}, which depends uniquely on the Gaussian correlation c

_{g}(Pearson correlation in the space of ‘Gaussianized’ variables), and a non-Gaussian term I

_{ng}depending on nonlinear correlations. This term is equal to the joint negentropy, which is invariant for any oblique or orthogonal rotation of the ‘Gaussianized’ variables and is related to the ‘compactness’ measure or the closeness of the PDF towards a low manifold deterministic relationship. The Gaussian MI is also a ‘concordance’ measure, invariant for any monotonically growing homeomorphisms of marginals and consequently expressed as a functional of the copula-density function, which is exclusively dependent on marginal cumulated probabilities. In certain extreme cases, very far from Gaussianity, the Pearson correlation among non‑Gaussian variables is not a proper measure of the mutual information. An example of that situation is given.

_{g}, where joint Gaussinity holds. There, I

_{ng}vanishes, growing towards infinity as far as the boundary is approached, where variables satisfy a deterministic relationship and the ME problem is ill conditioned. This behavior is illustrated in cross-sections of the polytope of cross moments of total order p = 4.

## Acknowledgments

## References

- Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons, Inc.: New York, NY, USA, 1991. [Google Scholar]
- Ebrahimi, N.; Soofi, E.S.; Soyer, R. Information measures in perspective. Int. Stat. Rev.
**2010**, 78, 383–412. [Google Scholar] [CrossRef] - Stögbauer, H.; Kraskov, A.; Astakhov, S.A.; Grassberger, P. Least-dependent-component analysis based on mutual information. Phys. Rev. E
**2004**, 70, 066123:1–066123:17. [Google Scholar] [CrossRef] - Hyvärinen, A.; Oja, E. Independent Component Analysis: Algorithms and application. Neural Network.
**2000**, 13, 411–430. [Google Scholar] [CrossRef] - Fraser, H.M.; Swinney, H.L. Independent coordinates for strange attractors from mutual information. Phys. Rev. A
**1986**, 33, 1134–1140. [Google Scholar] [CrossRef] [PubMed] - DelSole, T. Predictability and information theory. Part I: Measures of predictability. J. Atmos. Sci.
**2004**, 61, 2425–2440. [Google Scholar] [CrossRef] - Majda, A.; Kleeman, R.; Cai, D. A mathematical framework for quantifying predictability through relative entropy. Methods and applications of analysis. Meth. Appl. Anal.
**2002**, 9, 425–444. [Google Scholar] - Darbellay, G.A.; Vajda, I. Entropy expressions for multivariate continuous distributions. IEEE Trans. Inform. Theor.
**2000**, 46, 709–712. [Google Scholar] [CrossRef] - Nadarajah, S.; Zografos, K. Expressions for Rényi and Shannon entropies for bivariate distributions. Inform. Sci.
**2005**, 170, 173–189. [Google Scholar] [CrossRef] - Khan, S.; Bandyopadhyay, S.; Ganguly, A.R.; Saigal, S.; Erickson, D.J.; Protopopescu, V.; Ostrouchov, G. Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Phys. Rev. E
**2007**, 76, 026209:1–026209:15. [Google Scholar] [CrossRef] - Walters-Williams, J.; Li, Y. Estimation of mutual information: A survey. Lect. Notes Comput. Sci.
**2009**, 5589, 389–396. [Google Scholar] - Globerson, A.; Tishby, N. The minimum information principle for discriminative learning. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, Banff, Canada, 7–11 July 2004; pp. 193–200.
- Jaynes, E.T. On the rationale of maximum-entropy methods. Proc. IEEE
**1982**, 70, 939–952. [Google Scholar] [CrossRef] - Wackernagel, H. Multivariate Geostatistics—An Introduction with Applications; Springer-Verlag: Berlin, Germany, 1995. [Google Scholar]
- Shams, S.A. Convergent iterative procedure for constructing bivariate distributions. Comm. Stat. Theor. Meth.
**2010**, 39, 1026–1037. [Google Scholar] [CrossRef] - Ebrahimi, N.; Soofi, E.S.; Soyer, R. Multivariate maximum entropy identification, transformation, and dependence. J. Multivariate Anal.
**2008**, 99, 1217–1231. [Google Scholar] [CrossRef] - Abramov, R. An improved algorithm for the multidimensional moment-constrained maximum entropy problem. J. Comput. Phys.
**2007**, 226, 621–644. [Google Scholar] [CrossRef] - Abramov, R. The multidimensional moment-constrained maximum entropy problem: A BFGS algorithm with constraint scaling. J. Comput. Phys.
**2009**, 228, 96–108. [Google Scholar] [CrossRef] - Abramov, R. The multidimensional maximum entropy moment problem: A review on numerical methods. Commun. Math. Sci.
**2010**, 8, 377–392. [Google Scholar] [CrossRef] - Rockinger, M.; Jondeau, E. Entropy densities with an application to autoregressive conditional skewness and kurtosis. J. Econometrics
**2002**, 106, 119–142. [Google Scholar] [CrossRef] - Pires, C.A.; Perdigão, R.A.P. Non-Gaussianity and asymmetry of the winter monthly precipitation estimation from the NAO. Mon. Wea. Rev.
**2007**, 135, 430–448. [Google Scholar] [CrossRef] - Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E
**2004**, 69, 066138:1–066138:16. [Google Scholar] [CrossRef] - Myers, J.L.; Well, A.D. Research Design and Statistical Analysis, 2nd ed.; Lawrence Erlbaum Associates: Mahwah, NJ, USA, 2003. [Google Scholar]
- Calsaverini, R.S.; Vicente, R. An information-theoretic approach to statistical dependence: Copula information. Europhys. Lett.
**2009**, 88, 68003. 1–6. [Google Scholar] [CrossRef] - Monahan, A.H.; DelSole, T. Information theoretic measures of dependence, compactness, and non-Gaussianity for multivariate probability distributions. Nonlinear Proc. Geoph.
**2009**, 16, 57–64. [Google Scholar] [CrossRef] - Guo, D.; Shamai, S.; Verdú, S. Mutual information and minimum mean-square error in Gaussian channels. IEEE Trans. Inform. Theor.
**2005**, 51, 1261–1283. [Google Scholar] [CrossRef] - Pires, C.A.; Perdigão, R.A.P. Minimum mutual information and non-Gaussianity through the maximum entropy method: Estimation from finite samples. Entropy
**2012**. submitted for publication. [Google Scholar] - Shore, J.E.; Johnson, R.W. Axiomatic derivation of the principle of maximum entropy and the principle of the minimum cross-entropy. IEEE Trans. Inform. Theor.
**1980**, 26, 26–37. [Google Scholar] [CrossRef] - Koehler, K.J.; Symmanowski, J.T. Constructing multivariate distributions with specific marginal distributions. J. Multivariate Anal.
**1995**, 55, 261–282. [Google Scholar] [CrossRef] - Cruz-Medina, I.R.; Osorio-Sánchez, M.; García-Páez, F. Generation of multivariate random variables with known marginal distribution and a specified correlation matrix. InterStat
**2010**, 16, 19–29. [Google Scholar] - van Hulle, M.M. Edgeworth approximation of multivariate differential entropy. Neural Computat.
**2005**, 17, 1903–1910. [Google Scholar] [CrossRef] - Comon, P. Independent component analysis, a new concept? Signal Process.
**1994**, 36, 287–314. [Google Scholar] [CrossRef][Green Version] - van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: New York, NY, USA, 1998. [Google Scholar]
- Hilbert, D. Über die Darstellung definiter Formen als Summe von Formenquadraten. Math. Ann.
**1888**, 32, 342–350. [Google Scholar] [CrossRef] - Ahmadi, A.A.; Parrilo, P.A. A positive definite polynomial hessian that does not factor. In Proceedings of the Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference, Shanghai, China, 15–18 December 2009; pp. 16–18.
- Wolfram, S. The Mathematica Book, 3rd ed.; Cambridge University Press: Cambridge, UK, 1996. [Google Scholar]
- Bocquet, M.; Pires, C.; Lin, W. Beyond Gaussian statistical modeling in geophysical data assimilation. Mon. Wea. Rev.
**2010**, 138, 2997–3023. [Google Scholar] [CrossRef] - Sura, P.; Newman, M.; Penland, C.; Sardeshmuck, P. Multiplicative noise and non-Gaussianity: A paradigm for atmospheric regimes? J. Atmos. Sci.
**2005**, 62, 1391–1406. [Google Scholar] [CrossRef] - Guo, D.; Shamai, S.; Verdú, S. Mutual information and minimum mean-square error in Gaussian channels. IEEE Trans. Inform. Theor.
**2005**, 51, 1261–1283. [Google Scholar] [CrossRef] - Rioul, O. A simple proof of the entropy-power inequality via properties of mutual information. arXiv
**2007**. arXiv:cs/0701050v2 [cs.IT]. [Google Scholar] - Guo, D.; Wu, Y.; Shamai, S.; Verdú, S. Estimation in gaussian noise: Properties of the minimum mean-square error. IEEE T. Inform. Theory
**2011**, 57, 2371–2385. [Google Scholar] - Gilbert, J.; Lemaréchal, C. Some numerical experiments with variable-storage quasi-Newton algorithms. Math. Program.
**1989**, 45, 407–435. [Google Scholar] [CrossRef]

## Appendix 1

Form and Numerical Estimation of ME-PDFs

^{2}, by taking the appropriate scaled constraints. Then, we apply the scaling entropy relationship:

_{f}weighting factors each in the interval [−1, 1]. In order to get full resolution during the minimization, and to avoid “Not-a-Number” (NAN) and infinity (INF) errors in computation, we subtract the polynomials in the arguments of exponentials from the corresponding maximum in S. Finally, the functional $L$ is multiplied by a sufficient high factor F in order to emphasize the gradient. After some preliminary experiments, we have set r = 6, N

_{f}= 80, F = 1000. Convergence is assumed to be reached when one gets an accuracy of 10−6 for the gradient of $L$. By setting the first guess of Lagrange multipliers (FGLM) to zero, convergence is reached after about 60–400 iterations. For the optimization we have used the routine M1QN3 from INRIA [42], which uses the Quasi-Newton BFGS algorithm. Convergence is slower under a higher condition number (CN) of the Hessian of $L$, with the convergence time growing in general with the proximity of the boundary of the domain of allowed moments as well as the total maximum order p of constraining monomials. The convergence is faster when a closer FGLM is provided to the exact solution. This is possible in sequential ME problems with quite small successive constraint differences. There, one uses a continuation technique by setting FGLM to the optimized Lagrange multipliers from the previous ME problem. This technique has been used in the computation of the graphics in Figure 3.

## Appendix 2

**Proof of Theorem 1:**Since X follows the ME-PDF generated by $({T}_{X},{\mathrm{\theta}}_{X})$ and ME-congruency holds, we have $D({\rho}_{X}||{\rho}_{{{\rm T}}_{X},{\mathrm{\theta}}_{X}}^{*})=D({\rho}_{X}||{\rho}_{{{\rm T}}_{X1},{\mathrm{\theta}}_{X1}}^{*})=D({\rho}_{X}||{\rho}_{{{\rm T}}_{X2},{\mathrm{\theta}}_{X2}}^{*})=0$ with similar identities for Y. Therefore, the first inequality of (7a) follow directly from (5). The second one is obtained by difference and application of Lemma 1 to ${T}_{1}\subseteq {T}_{2}$ (q.e.d.).**Proof of Theorem 2:**The first equality of (11) comes from Equation (1) as $I(X,Y)=2{H}_{g}-{H}_{{\rho}_{XY}}$ and from the negentropy definition of Gaussianized variables $J(X,Y)={H}_{g}(X,Y)-{H}_{{\rho}_{XY}}=2{H}_{g}+{I}_{g}(X,Y)-{H}_{{\rho}_{XY}}$, where ${H}_{g}(X,Y)$is the entropy of the Gaussian fit. From the entropy formula of transformed variables we get${H}_{{\rho}_{{X}_{r}{Y}_{r}}}-{H}_{{\rho}_{XY}}={H}_{g}({X}_{r},{Y}_{r})-{H}_{g}(X,Y)=\mathrm{log}\left|\mathrm{det}A\right|$, leading to the negentropy equality $J(X,Y)=J({X}_{r},{Y}_{r})$. Finally, the last equation of (11) comes from the equality ${H}_{{\rho}_{{X}_{r}{Y}_{r}}}={H}_{{\rho}_{{X}_{r}}}+{H}_{{\rho}_{{Y}_{r}}}-I\left({X}_{r},{Y}_{r}\right)$ and the definition of negentropy, i.e., $J({X}_{r})={H}_{g}-{H}_{{\rho}_{{X}_{r}}}$ (q.e.d.).

© 2012 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Pires, C.A.L.; Perdigão, R.A.P.
Minimum Mutual Information and Non-Gaussianity Through the Maximum Entropy Method: Theory and Properties. *Entropy* **2012**, *14*, 1103-1126.
https://doi.org/10.3390/e14061103

**AMA Style**

Pires CAL, Perdigão RAP.
Minimum Mutual Information and Non-Gaussianity Through the Maximum Entropy Method: Theory and Properties. *Entropy*. 2012; 14(6):1103-1126.
https://doi.org/10.3390/e14061103

**Chicago/Turabian Style**

Pires, Carlos A. L., and Rui A. P. Perdigão.
2012. "Minimum Mutual Information and Non-Gaussianity Through the Maximum Entropy Method: Theory and Properties" *Entropy* 14, no. 6: 1103-1126.
https://doi.org/10.3390/e14061103