Previous Article in Journal
Entropic Regularization to Assist a Geologist in Producing a Geologic Map

Article

# Large-Sample Asymptotic Approximations for the Sampling and Posterior Distributions of Differential Entropy for Multivariate Normal Distributions

by 1,2,3,* and 1,2,3
1
Inserm, U678, Paris, F-75013, France
2
UPMC Univ Paris 06, UMR_S U678, Paris, F-75013, France
3
Inserm, Université de Montréal, and UPMC Univ Paris 06, LINeM, Montréal, QC, H3W 1W5, Canada
*
Author to whom correspondence should be addressed.
Entropy 2011, 13(4), 805-819; https://doi.org/10.3390/e13040805
Received: 15 February 2011 / Revised: 29 March 2011 / Accepted: 31 March 2011 / Published: 6 April 2011

## Abstract

In the present paper, we propose a large sample asymptotic approximation for the sampling and posterior distributions of differential entropy when the sample is composed of independent and identically distributed realization of a multivariate normal distribution.

## 1. Introduction

Entropy has been an active topic of research for over 50 years and much has been published about this measure in various contexts. In statistics, recent developments have investigated how to estimate entropy from data, either in a parametric [1,2,3] or nonparametric framework [4,5], as well as the reliability and convergence properties of these estimators [6,7].
By contrast, relatively little is known about the statistical distribution of entropy, even in the simple case of a multivariate normal distribution. For instance, the differential entropy $H ( X )$ of a D-dimensional random variable X that is normally distributed with mean μ and covariance matrix Σ is given by
$H ( X ) = h ( Σ ) = D 2 1 + ln ( 2 π ) + 1 2 ln | Σ |$
If $( x n ) n = 1 , ⋯ , N$ are N independent and identically distributed realizations of X and S the corresponding sum of square, then the sample differential entropy $h ( S / N )$ is used as the so-called plug-in estimator for $H ( X )$. However, $h ( S / N )$ is also a random variable whose sampling distribution could be studied. Ahmed et al. provided the exact expression for the mean and variance of this variable [1]. Similarly, in a Bayesian framework, given $h ( S / N )$, what are the probable values of $h ( Σ )$? We are not aware of any study in this direction for multivariate normal distributions (but see, e.g., [8,9] for the posterior moments of entropy in the case of multinomial distributions). In the present paper, we provide an asymptotic approximation for both the sampling distribution of $h ( S / N )$ and, in a Bayesian framework, the posterior distribution of $h ( Σ )$ given $h ( S / N )$. To this aim, we first calculate the moments of $| S | / | ν Σ |$ in the same condition as above. We then use this result to provide a closed form expression for the cumulant-generating function of $U = - ln ( | S | / | ν Σ | )$, from which we derive closed form expressions for the cumulants, together with asymptotic expansions when $ν → ∞$. Using the characteristic function of U, we then provide an asymptotic normal approximation for the distribution of this variable. We finally apply these result to the sample and posterior entropy of multivariate normal distributions.

## 2. General Result

Assume that S is distributed according to a Wishart distribution with $ν ≥ D$ degrees of freedom and scale matrix Σ, i.e., [10] (Chapter 7)
$p ( S | Σ , ν ) = 1 Z D ( ν ) | Σ | - ν 2 | S | ν - D - 1 2 exp - 1 2 tr ( Σ - 1 S )$
where $Z D ( ν )$ is the normalizing constant,
$Z D ( ν ) = 2 ν D 2 π D ( D - 1 ) 4 ∏ d = 1 D Γ ν + 1 - d 2$
Direct calculation show that we have, for $t ∈ ℝ$,
provided that the integral sums to one, i.e., $ν + 2 t ≥ D$ or, equivalently, $t ≥ ( D - ν ) / 2$.

#### 2.1. Cumulant-Generating Function, Cumulants, and Central Moments of U

Cumulant-generating function Let U be the function defined in the introduction, i.e.,
$U = - ln | S | | ν Σ |$
and $g U ( t ) = ln E e t U$ its cumulant-generating function. $g U ( t )$ is the log of the quantity calculated in Equation (3)
$g U ( t ) = D t ln ν + ln Z D ( ν - 2 t ) - ln Z D ( ν )$
$ln Z D ( ν )$ and $ln Z D ( ν - 2 t )$ can be expressed using Equation (2), leading to
$g U ( t ) = D t ln ν 2 + ∑ d = 1 D ln Γ ν - 2 t + 1 - d 2 - ∑ d = 1 D ln Γ ν + 1 - d 2$
Cumulants By construction, the nth cumulant of U is given by $κ n = g U ( n ) ( 0 )$. In the present case, $g U ( n ) ( t )$ can be obtained by direct derivation, yielding for the cumulants
$κ 1 = g U ′ ( 0 ) = D ln ν 2 - ∑ d = 1 D ψ ν + 1 - d 2$
and
$κ n = g U ( n ) ( 0 ) = ( - 1 ) n ∑ d = 1 D ψ ( n - 1 ) ν + 1 - d 2$
for $n ≥ 2$, where ψ is the digamma function, i.e., $ψ ( t ) = d [ ln Γ ( t ) ] / d t$, and $ψ ( n )$ its nth derivative [11] (pp. 258–260). For any $n ≥ 1$, $κ n$ is always strictly positive. It is an increasing function of D and a decreasing function of ν. It tends to 0 when ν tends to infinity. For a proof of these properties, see the appendix.
Central moments Cumulants and central moments are related as follows: If we denote by μ, $σ 2$, γ and $γ 2$ the mean, variance, skewness and excess kurtosis of U, respectively, we have $μ = κ 1$, $σ 2 = κ 2$, $γ 1 = κ 3 / κ 2 3 / 2$, and $γ 2 = κ 4 / κ 2 2$. Note that, by definition, μ is equal to the expression of Equation (7) and $σ 2$ to that of Equation (8) with $n = 2$.

## 2.2. Asymptotic Expansion

When ν is large, ψ can be approximated using the following asymptotic expansion [11] (p. 260)
$ψ ( z ) = ln z - 1 2 z - 1 12 z 2 + O 1 z 3$
where $O ( 1 / z n )$ refers to Landau notation and stands for any function $f ( z )$ for which there exists $z 0$ so that $z n f ( z )$ is bounded for $z ≥ z 0$. This leads to
$ψ ν + 1 - d 2 = ln ν + 1 - d 2 - 1 ν + 1 - d - 1 3 ( ν + 1 - d ) 2 + O 1 ν 3 = ln ν 2 + ln 1 + 1 - d ν - 1 ν 1 + 1 - d ν - 1 3 ν 2 1 + 1 - d ν 2 + O 1 ν 3 = ln ν 2 + 1 - d ν - 1 2 1 - d ν 2 - 1 ν 1 - 1 - d ν - 1 3 ν 2 + O 1 ν 3 = ln ν 2 - d ν + 1 - 3 d 2 6 ν 2 + O 1 ν 3$
Incorporating this expansion in Equation (7) yields for the first cumulant $κ 1$ or, equivalently, the mean μ
$κ 1 = μ = D ( D + 1 ) 2 ν + 2 D 3 + 3 D 2 - D 12 ν 2 + O 1 ν 3$
For the cumulants and central moments of order 2 and up, we use the following approximation of $ψ ( n )$ [11] (p. 260)
$ψ ( n ) ( z ) = ( - 1 ) n - 1 ( n - 1 ) ! z n + n ! 2 z n + 1 + O 1 z n + 2$
Each term in the sum of Equation (8) can therefore be approximated as
$ψ ( n - 1 ) ν + 1 - d 2 = ( - 1 ) n - 2 2 n - 1 ( n - 2 ) ! ν n - 1 1 + 1 - d ν n - 1 + 2 n - 1 ( n - 1 ) ! ν n 1 + 1 - d ν n + O 1 ν n + 1 = ( - 1 ) n - 2 2 n - 1 ( n - 2 ) ! ν n - 1 1 - ( n - 1 ) ( 1 - d ) ν + 2 n - 1 ( n - 1 ) ! ν n + O 1 ν n + 1 = ( - 1 ) n - 2 2 n - 1 ( n - 2 ) ! ν n - 1 + 2 n - 1 ( n - 1 ) ! d ν n + O 1 ν n + 1$
leading to an approximation of $κ n = g U ( n ) ( 0 )$ of the form
$κ n = 2 n - 1 D ( n - 2 ) ! ν n - 1 + 2 n - 1 D ( D + 1 ) ( n - 1 ) ! 2 ν n + O 1 ν n + 1$
Taking n equal to 2, 3, and 4 respectively yields for the cumulants of order 2, 3, and 4
$κ 2 = 2 D ν + D ( D + 1 ) ν 2 + O 1 ν 3$
$κ 3 = 4 D ν 2 + 4 D ( D + 1 ) ν 3 + O 1 ν 4$
$κ 4 = 16 D ν 3 + 24 D ( D + 1 ) ν 4 + O 1 ν 5$
We can now provide asymptotic approximations for the corresponding central moments. The variance $σ 2 = κ 2$ is given by Equation (12). Approximation for the skewness $γ 1 = κ 3 / κ 2 3 / 2$ can be obtained from Equations (12) and (13) as
$γ 1 = 4 D ν 2 1 + D + 1 ν + O 1 ν 2 2 D ν - 3 2 1 + D + 1 2 ν + O 1 ν 2 - 3 2 = 2 D ν 1 + D + 1 4 ν + O 1 ν 2$
$γ 1$ being asymptotically positive, the distribution is skewed on the right. Finally, the approximation for $γ 2 = κ 4 / κ 2 2$ can be expressed as
$γ 2 = 16 D ν 3 1 + 3 ( D + 1 ) 2 ν + O 1 ν 2 2 D ν - 2 1 + D + 1 2 ν + O 1 ν 2 - 2 = 4 D ν 1 + D + 1 2 ν + O 1 ν 3$
which is asymptotically positive, corresponding to a leptokurtic distribution.

## 2.3. Asymptotic Distribution of  U

We now use the previous results to prove that U is asymptotically normally distributed with mean $D ( D + 1 ) / 2 ν$ and variance $2 D / ν$. To this aim, set
$V ν = U - a ν b ν$
with $a = D ( D + 1 ) / 2$ and $b = 2 D$. The logarithm of the characteristic function of $V ν$ reads
$ln ϕ V ν ( t ) = ln E exp i t U - a ν b ν = - i t a b ν + ln E exp i t ν b U = - i t a b ν + ln ϕ U i t ν b = ln ϕ U i t ν b + O 1 ν$
where $ϕ U ( t )$ is the characteristic function of U. We proved Equation (3) as an analytic identity for $t ∈ R$. This expression will, however, be valid in the range where $Z D ( ν + 2 t )$ is analytic. We can thus obtain an expression for $ϕ U ( i t ν / b )$ by replacing t by $- i t ν / b$ in Equation (3), leading to
$ln ϕ U i t ν b = ln Z D ν - 2 i t ν b Z D ( ν ) + i t D ν ln ν b = ln 2 ν - 2 i t ν b D 2 π D ( D - 1 ) 2 ∏ d = 1 D Γ ν - 2 i t ν b + 1 - d 2 2 ν D 2 π D ( D - 1 ) 2 ∏ d = 1 D Γ ν + 1 - d 2 + i t D ν ln ν b (16) = i t D ν b ln ν 2 + ∑ d = 1 D ln Γ ν - 2 i t ν b + 1 - d 2 Γ ν + 1 - d 2$
We then use Stirling’s approximation [11] (p. 257)
$ln Γ ( z ) = z - 1 2 ln z - z + 1 2 ln 2 π + O 1 z$
to approximate each term of the sum in the second term of the right-hand side of Equation (16) when ν is large, yielding
$ln Γ ν - 2 i t ν b + 1 - d 2 Γ ν + 1 - d 2 = ν - 2 i t ν b - d 2 ln ν - 2 i t ν b + 1 - d 2 - ν - 2 i t ν b + 1 - d 2 - ν - d 2 ln ν + 1 - d 2 + ν + 1 - d 2 + O 1 ν = ν - 2 i t ν b - d 2 ln ν 2 + ln 1 - 2 i t b ν + 1 - d ν + i t ν b - ν - d ν ln ν 2 + ln 1 + 1 - d ν + O 1 ν = - i t ν b ln ν 2 + i t ν b + ν - 2 i t ν b - d 2 - 2 i t b ν + 1 - d ν + 2 t 2 b 2 ν + O 1 ν 3 / 2 - ν - d 2 1 - d ν + O 1 ν 3 / 2 + O 1 ν = - i t ν b ln ν 2 - t 2 b 2 + O 1 ν$
We consequently have for the characteristic moment of $V ν$
$ln ϕ V ν ( t ) = ln ϕ U i t ν b + O 1 ν = - D t 2 b 2 + O 1 ν = - t 2 2 + O 1 ν$
As ν tends towards infinity, $ϕ V ν ( t )$ achieves pointwise convergence toward $e - t 2 / 2$, which is continuous in $t = 0$. According to Lévi’s continuity theorem, $V ν$ therefore converges in distribution to the standard normal distribution,
$V ν = U - D ( D + 1 ) 2 ν 2 D ν ∼ ν → ∞ N ( 0 , 1 )$

## 3. Application to Differential Entropy

We can use the results of the previous section to obtain the exact and asymptotic cumulants of the sample and posterior entropy when the data are multivariate normal.

#### 3.1. Sampling Distribution

The differential entropy $H ( X )$ of a D-dimensional random variable X that is normally distributed with (known) mean μ and (unknown) covariance matrix Σ is given by Equation (1). Let $( x n ) n = 1 , … , N$ be N independent and identically distributed realizations of X. Set S the sum of square, i.e.,
$S = ∑ n = 1 N ( x n - μ ) ( x n - μ )$
S follows a Wishart distribution with $ν = N$ degrees of freedom and scale matrix Σ [12] (Th. 7.2.2). Define the sample differential entropy corresponding to the N realizations as $h ( S / N )$. Using the fact that $| S / N | / | Σ | = | S | / | N Σ |$, we obtain that $h ( S / N ) - h ( Σ ) = - U / 2$, where U was defined in Equation (4). The mean and variance of $h ( S / N ) = h ( Σ ) - U / 2$ can therefore be expressed as functions of the corresponding central moments of U, i.e., $μ = κ 1$ [Equations (7) and (9)] and $σ 2 = κ 2$ [Equations (6) and (12)], leading to the following closed form expressions and approximations
and
Furthermore, use of Section 2.3 shows that, given N and Σ, $h ( S / N )$ is asymptotically normally distributed with mean $- D ( D + 1 ) / 4 N$ and variance $D / 2 N$. If μ is unknown, we replace μ by the sample mean m in Equation (17). S is then still Wishart distributed with scale matrix Σ but $ν = N - 1$ degrees of freedom [12] (Cor. 7.2.2). The exact expectation and variance of $h [ S / ( N - 1 ) ]$ are therefore given by Equations (18) and (20), respectively where N is replaced by $N - 1$. Performing asymptotic expansion of this expression leads to
$E h S / ( N - 1 ) | N , Σ = h ( Σ ) - D ( D + 1 ) 4 N - 2 D 3 + 9 D 2 + 5 D 24 N 2 + O 1 N 3$
and
$Var h S / ( N - 1 ) | ν , Σ = D 2 N + D ( D + 3 ) 4 N 2 + O 1 N 3$
Furthermore, since the first-order approximation is the same for $h [ S / ( N - 1 ) ]$ for $h ( S / N )$, both quantities have the same asymptotic distribution.

#### 3.2. Posterior Distribution

With the same assumptions as above, and assuming a non-informative Jeffreys prior for Σ, i.e.,
$P ( Σ ) ∝ | Σ | - D + 1 2$
the posterior distribution for Σ given the N realizations of X is inverse Wishart with $n = N - 1$ degrees of freedom and scale matrix $S - 1$ [13]. This implies that $Υ = Σ - 1$, the concentration matrix, is Wishart distributed with n degrees of freedom and scale matrix $S - 1$. Results of Section 3.1 therefore apply to $h ( Υ / n ) - h ( S - 1 )$. But, since for any matrix A, $ln | A - 1 | = ln | A | - 1 = - ln | A |$, we have that $h ( Υ / n ) - h ( S - 1 )$ is equal to $h ( S ) - h ( n Σ )$ or, equivalently, to $h ( S / n ) - h ( Σ )$. As a consequence,
$(22) E [ h ( Σ ) | N , S ] = h ( S / n ) + D 2 ln ν 2 - 1 2 ∑ d = 1 D ψ N - d 2 (23) = h ( S / n ) + D ( D + 1 ) 4 N + 2 D 3 + 9 D 2 + 5 D 24 N 2 + O 1 N 3$
and
$Var [ h ( Σ ) | n , S ] = 1 4 ∑ d = 1 D ψ ′ N - d 2 = D 2 N + D ( D + 3 ) 4 N 2 + O 1 N 3$
Also, $h ( Σ )$ is asymptotically normally distributed with mean $D ( D + 1 ) / 4 N$ and variance $D / 2 N$.

## 4. Application to Mutual Information and Multiinformation

Similar results can also be derived about the first cumulant of mutual information and multiinformation, its generalization to more than two variables. The mutual information between two sets of variables $X 1$ (of dimension $D 1$) and $X 2$ (of dimension $D 2$) is defined as
$I ( X 1 , X 2 ) = H ( X 1 ) + H ( X 2 ) - H ( X 1 , X 2 )$
For multivariate normal variables, we have
$I ( X 1 , X 2 ) = i ( Σ ) = h ( Σ 1 ) + h ( Σ 2 ) - h ( Σ )$
where $Σ 1$ and $Σ 2$ are the two block diagonal elements of Σ and where h was defined in Equation (1).

#### 4.1. Sampling Mean

Define the sample mutual information as $i ( S / N )$. Using Equation (24), direct calculation shows that we have
$E [ i ( S / N ) | N , Σ ] = E [ h ( S 1 / N ) | N , Σ ] + E [ h ( S 2 / N ) | N , Σ ] - E [ h ( S / N ) | N , Σ ]$
An asymptotic approximation for $E [ h ( S / N ) | N , Σ ]$ can be obtained by direct use of Equation (19). For $S 1$ and $S 2$, we proceed as follows. If S is Wishart distributed with N degrees of freedom and scale matrix Σ, then $S j$ ($j ∈ { 1 , 2 }$) is also Wishart distributed with N degrees of freedom and scale matrix $Σ j$ [12] (Th. 7.3.4). Equation (19) can therefore be applied to matrix $S j$ with the proper scale matrix, yielding
$E [ h ( S j / N ) | N , Σ ] = E [ h ( S j / N ) | N , Σ j ] = h ( Σ j ) - D j ( D j + 1 ) 4 N - 2 D j 3 + 3 D j 2 - D j 24 N 2 + O 1 N 3$
$E [ i ( S / N ) | N , Σ ]$ consequently reads
$E [ i ( S / N ) | N , Σ ] = i ( Σ ) + D 1 D 2 2 N 1 + D 1 + D 2 + 1 2 N + O 1 N 3$
A similar result can be obtained for the generalization of i to K sets of variables $X k$ (of size $D k$) as a measure called total correlation [14], multivariate constraint [15], δ [16], or multiinformation [17]. In that case, we have
$E [ i ( S / N ) | N , Σ ] = i ( Σ ) + ∑ i < j D i D j 2 N + ∑ i ≠ j D i D j D i + ∑ k ≠ i , j D k + 1 4 N 2 + O 1 N 3$
and, in the particular case where each $X k$ is one-dimensional (i.e., $D k = 1$),
$E [ i ( S / N ) | N , Σ ] = i ( Σ ) + D ( D - 1 ) 4 N + 2 D 3 + 3 D 2 - 5 D 24 N 2 + O 1 N 3$

#### 4.2. Posterior Mean

A similar argument can be applied to the Bayesian posterior mean of i. Using Equation (24) again, we have
$E [ i ( Σ ) | N , S ] = E [ h ( Σ 1 ) | N , S ] + E [ h ( Σ 2 ) | N , S ] - E [ h ( Σ ) | N , S ]$
An asymptotic approximation for $E [ h ( Σ ) | N , S ]$ can be obtained by direct use of Equation (23). Now, if Σ is inverse Wishart distributed with n degrees of freedom and scale matrix S, then $Σ j$ ($j ∈ { 1 , 2 }$) is also inverse Wishart distributed with $n - D k$ ($k ∈ { 1 , 2 }$, $k ≠ j$) degrees of freedom and scale matrix $S j$ [18]. Application of Equation (23) with the proper degrees of freedom and scale matrix leads to
$E [ h ( Σ j ) | N , S ] = h [ S j / ( n - D k ) ] + D j ( D j + 1 ) 4 ( N - D k ) + O 1 N 2 = h ( S j / n ) - D j 2 ln 1 - D k N + D j ( D j + 1 ) 4 N + O 1 N 2 = h [ S j / n ] + D 1 D 2 2 N + D j ( D j + 1 ) 4 N + O 1 N 2$
where we only retained the expansion terms of order up to $1 / N$ for the sake of simplicity. $E [ i ( Σ ) | N , S ]$ consequently reads
$E [ i ( Σ ) | N , S ] = i ( S / n ) + D 1 D 2 2 N + O 1 N 2$
For posterior multiinformation, we have
$E [ i ( Σ ) | N , S ] = i ( S / n ) + ∑ i < j D i D j 2 N + O 1 N 2$
and, in the particular case where each $X k$ is one-dimensional (i.e., $D k = 1$),
$E [ i ( Σ ) | N , S ] = i ( S / n ) + D ( D - 1 ) 4 N + O 1 N 2$

## 5. Simulation Study

We conducted the following computations for $D ∈ { 2 , 5 , 10 }$. To assess the accuracy of the asymptotic expansion of the cumulants of sample entropy, we calculated the error made by the first and second central moments (i.e., the mean and variance of the distribution) compared to the exact values as a function of ν. As a way of comparison, we computed the same quantities for 500 different homogeneous positive definite matrices Σ (i.e., with all non-diagonal elements equal to the same value ρ, generated uniformly); for each value of Σ and ν, we generated 1,000 samples from $S ∼ Wishart ( ν , Σ )$, computed the corresponding values of sample entropy, and approximated the moments by the corresponding sampling moments. The results are reported in Figure 1.

## 6. Discussion

In this work, we calculated both the moments of $| S | / | ν Σ |$ and the cumulant-generating function of $U = - ln ( | S | / | ν Σ | )$ when S is Wishart distributed with ν degrees of freedom and scale matrix Σ. From there, we provided an asymptotic approximation of the first four central moments of U. We also proved that U is asymptotically normally distributed. We then demonstrated the quality of the normal approximation compared to simulations. We finally applied these results to the multivariate normal distribution to provide asymptotic approximations of the sample and posterior distributions of differential entropy, as well as an asymptotic approximation of the sample and posterior mean of multiinformation.
Interestingly, the moments of $| S | / | ν Σ |$ and, as a way of consequence, the cumulant-generating function of U depends on the distribution that S follows only through the matrix dimension D and the degree of freedom ν, but not through Σ. This means that the exact distribution of U is also independent from that parameter and could possibly be tabulated as a function of the two integer parameters.
As mentioned in the introduction, the sample differential entropy defined in Equation (1) is equal to the plug-in estimator for differential entropy. The present work provides a quantification in the case of multivariate normal samples for the well-known negative bias for this estimator [7]. Obviously, Equation (18) confirms that, to correct from this bias, one must take the uniformly minimum variance unbiased (UMVU) estimator [1].
The posterior derivation that we presented here is a particular case of the Bayesian posterior estimate obtained by [3] with, in our case, the prior distribution for Σ taken as Jeffreys prior (i.e., $q = - 1$ and $B = 0$ with their notations). While the same analysis as in [3] could have been performed, it would essentially lead to the same result, since we only consider the asymptotic case, where the sample is large and the prior distribution is supposed to have very little influence—provided that it does not contradict the data. The present study also shows an interesting feature of Bayesian estimation with respect to the above-mentioned negative bias. As the sample differential entropy tends to underestimate $H ( Σ )$ by a factor of $m / 2$, if one takes the posterior mean as the Bayesian estimate of $H ( Σ )$, then the negative bias is corrected by the opposite factor.
Figure 1. Error on the mean (top row) and variance (bottom row) of sample entropy for various values of D and ν when using the first-order approximation (circles), the second-order approximation (squares), or the sampling scheme (diamonds). The error was calculated as the absolute value of the difference between the approximation and the true value. For the sampling scheme are represented the median as well as the symmetrical 90% probability interval of the error. Scale on y axis is logarithmic.
Figure 1. Error on the mean (top row) and variance (bottom row) of sample entropy for various values of D and ν when using the first-order approximation (circles), the second-order approximation (squares), or the sampling scheme (diamonds). The error was calculated as the absolute value of the difference between the approximation and the true value. For the sampling scheme are represented the median as well as the symmetrical 90% probability interval of the error. Scale on y axis is logarithmic.
We were also able to obtain an asymptotic approximation of the sampling and posterior expectations of mutual information and multiinformation. Contrary to the general argument developed by [7], we proved that, for multivariate normal distributions, the negative bias for differential entropy does entail a positive bias for mutual information. This result is in agreement with the fact that, under the null hypothesis of Σ diagonal matrix, corresponding to $i ( Σ ) = 0$, $ν i ( S / ν )$ is asymptotically chi square distributed with $∑ i < j D i D j / 2$ degrees of freedom and, hence, has an expectation equal to that value [19] (pp. 306–307). Surprisingly, and unlike what was said for entropy, the positive bias of the sample multiinformation was not corrected by the Bayesian approach. A naive correction of minus the positive bias could lead to negative values, which is impossible by construction of multiinformation. Note that, using the present results alone, we were not able to obtain an asymptotic approximation for the variance of the same measures.
In the present paper, we used loose versions of the inequalities proposed in [20] to prove the monotonicity and sign of the cumulants of U (see Section 2.1 and Appendix). Note that, using the same inequalities, it seems that it would also be possible to obtain lower and upper bounds for these quantities, instead of asymptotic approximations. These bounds would be useful complements to the approximations provided in the present manuscript.

## Acknowledgements

The authors are grateful to Pierre Bellec for helpful discussions.

## References

1. Ahmed, N.A.; Gokhale, D.V. Entropy expressions and their estimators for multivariate distributions. IEEE Trans. Inform. Theory 1989, 35, 688–692. [Google Scholar] [CrossRef]
2. Misra, N.; Singh, H.; Demchuk, E. Estimation of the entropy of a multivariate normal distribution. J. Multivariate Anal. 2005, 92, 324–342. [Google Scholar] [CrossRef]
3. Gupta, M.; Srivastava, S. Parametric Bayesian estimation od differential entropy and relative entropy. Entropy 2010, 12, 818–843. [Google Scholar] [CrossRef]
4. Beirlant, J.; Dudewicz, E.J.; Györfi, L.; van der Meulen, E.C. Nonparametric entropy estimation: An overview. Int. J. Math. Stastist. Sci. 1997, 6, 17–39. [Google Scholar]
5. Strong, S.P.; Koberle, R.; de Ruyter van Steveninck, R.R.; Bialek, W. Entropy and information in neural spike trains. Phys. Rev. Lett. 1998, 80, 197–200. [Google Scholar] [CrossRef]
6. Antos, A.; Kontoyiannis, I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algor. 2001, 19, 163–193. [Google Scholar] [CrossRef]
7. Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef]
8. Wolpert, D.H.; Wolf, D.R. Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E 1995, 52, 6841–6854. [Google Scholar] [CrossRef]
9. Wolpert, D.H.; Wolf, D.R. Erratum: Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E 1996, 54, 6973. [Google Scholar] [CrossRef]
10. Anderson, T.W. An Introduction to Multivariate Statistical Analysis; John Wiley and Sons: New York, NY, USA, 1958. [Google Scholar]
11. Abramowitz, M.; Stegun, I.A. Handbook of Mathematical Functions; Applied Mathematics Series 55; National Bureau of Standards: Washington, DC, USA, 1972. [Google Scholar]
12. Anderson, T.W. An Introduction to Multivariate Statistical Analysis, 3rd ed.; Series in Probability and Mathematical Statistics; John Wiley and Sons: New York, NY, USA, 2003. [Google Scholar]
13. Gelman, A.; Carlin, J.B.; Stern, H.S.; Rubin, D.B. Bayesian Data Analysis; Texts in Statistical Science; Chapman & Hall: London, UK, 1998. [Google Scholar]
14. Watanabe, S. Information theoretical analysis of multivariate correlation. IBM J. Res. Dev. 1960, 4, 66–82. [Google Scholar] [CrossRef]
15. Garner, W.R. Uncertainty and Structure as Psychological Concepts; John Wiley & Sons: New York, NY, USA, 1962. [Google Scholar]
16. Joe, H. Relative entropy measures of multivariate dependence. J. Am. Statist. Assoc. 1989, 84, 157–164. [Google Scholar] [CrossRef]
17. Studený, M.; Vejnarová, J. The multiinformation function as a tool for measuring stochastic dependence. In Proceedings of the NATO Advanced Study Institute on Learning in Graphical Models; Jordan, M.I., Ed.; MIT Press: Cambridge, MA, USA, 1998; pp. 261–298. [Google Scholar]
18. Press, S.J. Applied Multivariate Analysis. Using Bayesian and Frequentist Methods of Inference, 2nd ed.; Dover: Mineola, NY, USA, 2005. [Google Scholar]
19. Kullback, S. Information Theory and Statistics; Dover: Mineola, NY, USA, 1968. [Google Scholar]
20. Chen, C.P. Inequalities for the polygamma functions with application. Gener. Math. 2005, 13, 65–72. [Google Scholar]

## Appendix

Results Regarding the Cumulants
The proofs differ for $κ 1$ and $κ n$, $n ≥ 2$.

#### 1. Results for $κ 1$

For $ν ≥ D > 0$, set $f D ( ν ) = κ 1$ as defined in Equation (7).
Result 1: $f D ( ν )$ is a decreasing function of ν. Derivation of $f D ( ν )$ with respect to ν leads to
$f D ′ ( ν ) = ∑ d = 1 D 1 ν - 1 2 ψ ′ ν + 1 - d 2$
We use the following inequality [20]
$ψ ′ ( x ) > 1 x + 1 2 x 2$
This implies that
$1 ν - 1 2 ψ ′ ν + 1 - d 2 < 1 ν - 1 ν + 1 - d - 1 ( ν + 1 - d ) 2$
For $1 ≤ d ≤ ν$, we have $1 / ν ≤ 1 / ( ν + 1 - d )$. Consequently, each term in the sum of Equation (25) is strictly negative, and so is $f D ′ ( ν )$. $f D ( ν )$ is therefore a strictly decreasing function of ν.
Result 2: $f D ( ν )$ is an increasing function of D. We have
$f D + 1 ( ν ) = f D ( ν ) + ln ν 2 - ψ ν + 1 - D 2$
Using the following inequality [20]
$ψ ( u ) < ln u - 1 2 u < ln u$
we obtain that
$ψ ν + 1 - D 2 < ln ν + 1 - D 2$
$ln ν 2 - ψ ν + 1 - D 2 > - ln 1 + 1 - D ν$
Since $ln ( 1 + x ) < x$, we have
$ln ν 2 - ψ ν + 1 - D 2 > D - 1 ν$
and, therefore, $f D + 1 ( ν ) > f D ( ν )$.
Result 3: $f D ( ν )$ is positive. $f D ( ν )$ is the sum of terms that are strictly positive (cf previous paragraph); it is thus strictly positive.
Result 4: $f D ( ν )$ tends to infinity as D increases. From the proof of Result 2, we have
$f D ( ν ) > ∑ d = 1 D d - 1 ν = D ( D - 1 ) 2 ν$
which tends to infinity when D tends to infinity.
Result 5: $f D ( ν )$ tends to 0 as ν increases. We use the following inequality [20]
$ln x - 1 2 x - 1 12 x 2 < ψ ( u )$
This implies that
$ψ ν + 1 - d 2 > ln ν + 1 - d 2 - 1 ν + 1 - d - 1 3 ( ν + 1 - d ) 2$
$ln ν 2 - ψ ν + 1 - d 2 < - ln 1 + 1 - d ν + 1 ν + 1 - d + 1 3 ( ν + 1 - d ) 2$
Since $x - x 2 / 2 < ln ( 1 + x )$, we have
$ln ν 2 - ψ ν + 1 - d 2 < d - 1 ν + 1 ν + 1 - d + 1 3 ( ν + 1 - d ) 2 < d - 1 ν + 1 ν - ( D - 1 ) + 1 3 [ ν - ( D - 1 ) ] 2$
$f D ( ν ) < D ( D - 1 ) 2 ν + D ν - ( D - 1 ) + D 3 [ ν - ( D - 1 ) ] 2$
#### 2. Results for $κ n$, $n ≥ 2$
Define $f D ( ν ) = κ n$ as in Equation (6), $( - 1 ) n + 1 ψ ( n )$ is completely monotonic. As a consequence, $κ n$ is a decreasing function of ν. We also use the following inequality [20]
$( n - 1 ) ! x n < ( - 1 ) n + 1 ψ ( n ) ( x ) < ( n - 1 ) ! x n + n ! 2 x n + 1 + B 2 Γ ( n + 2 ) 2 x n + 2$
This implies that $( - 1 ) n + 1 ψ ( n ) ( x )$ is strictly positive and, as a consequence, that $κ n$ is an increasing function of D. It also implies that $κ n$ tends to 0 as ν tends to infinity.