Large-Sample Asymptotic Approximations for the Sampling and Posterior Distributions of Differential Entropy for Multivariate Normal Distributions

In the present paper, we propose a large sample asymptotic approximation for the sampling and posterior distributions of differential entropy when the sample is composed of independent and identically distributed realization of a multivariate normal distribution.


Introduction
Entropy has been an active topic of research for over 50 years and much has been published about this measure in various contexts.In statistics, recent developments have investigated how to estimate entropy from data, either in a parametric [1][2][3] or nonparametric framework [4,5], as well as the reliability and convergence properties of these estimators [6,7].
By contrast, relatively little is known about the statistical distribution of entropy, even in the simple case of a multivariate normal distribution.For instance, the differential entropy H(X) of a D-dimensional random variable X that is normally distributed with mean μ and covariance matrix Σ is given by If (x n ) n=1,...,N are N independent and identically distributed realizations of X and S the corresponding sum of square, then the sample differential entropy h(S/N ) is used as the so-called plug-in estimator for H(X).However, h(S/N ) is also a random variable whose sampling distribution could be studied.Ahmed et al. provided the exact expression for the mean and variance of this variable [1].Similarly, in a Bayesian framework, given h(S/N ), what are the probable values of h(Σ)?We are not aware of any study in this direction for multivariate normal distributions (but see, e.g., [8,9] for the posterior moments of entropy in the case of multinomial distributions).In the present paper, we provide an asymptotic approximation for both the sampling distribution of h(S/N ) and, in a Bayesian framework, the posterior distribution of h(Σ) given h(S/N ).To this aim, we first calculate the moments of |S|/|νΣ| in the same condition as above.We then use this result to provide a closed form expression for the cumulant-generating function of U = − ln(|S|/|νΣ|), from which we derive closed form expressions for the cumulants, together with asymptotic expansions when ν → ∞.Using the characteristic function of U , we then provide an asymptotic normal approximation for the distribution of this variable.We finally apply these result to the sample and posterior entropy of multivariate normal distributions.

General Result
Assume that S is distributed according to a Wishart distribution with ν ≥ D degrees of freedom and scale matrix Σ, i.e., [10] where Z D (ν) is the normalizing constant, Direct calculation show that we have, for t ∈ 4, provided that the integral sums to one, i.e., ν + 2t ≥ D or, equivalently, t ≥ (D − ν)/2.

Cumulant-Generating Function, Cumulants, and Central Moments of U
Cumulant-generating function Let U be the function defined in the introduction, i.e., and g U (t) = ln E e tU its cumulant-generating function.g U (t) is the log of the quantity calculated in Equation ( 3) ln Z D (ν) and ln Z D (ν − 2t) can be expressed using Equation ( 2), leading to Cumulants By construction, the nth cumulant of U is given by κ n = g (n) U (0).In the present case, g (n) U (t) can be obtained by direct derivation, yielding for the cumulants and for n ≥ 2, where ψ is the digamma function, i.e., ψ(t) = d[ln Γ(t)]/dt, and ψ (n) its nth derivative [11] (pp. 258-260).For any n ≥ 1, κ n is always strictly positive.It is an increasing function of D and a decreasing function of ν.It tends to 0 when ν tends to infinity.For a proof of these properties, see the appendix.
Central moments Cumulants and central moments are related as follows: If we denote by μ, σ 2 , γ and γ 2 the mean, variance, skewness and excess kurtosis of U , respectively, we have 2 , and γ 2 = κ 4 /κ 2 2 .Note that, by definition, μ is equal to the expression of Equation ( 7) and σ 2 to that of Equation (8) with n = 2.

Asymptotic Expansion
When ν is large, ψ can be approximated using the following asymptotic expansion [11] (p. 260) where O(1/z n ) refers to Landau notation and stands for any function f (z) for which there exists z 0 so that z n f (z) is bounded for z ≥ z 0 .This leads to Incorporating this expansion in Equation ( 7) yields for the first cumulant κ 1 or, equivalently, the mean μ For the cumulants and central moments of order 2 and up, we use the following approximation of ψ (n) [11] (p. 260) Each term in the sum of Equation ( 8) can therefore be approximated as Taking n equal to 2, 3, and 4 respectively yields for the cumulants of order 2, 3, and 4 We can now provide asymptotic approximations for the corresponding central moments.The variance σ 2 = κ 2 is given by Equation (12).Approximation for the skewness can be obtained from Equations ( 12) and ( 13) as γ 1 being asymptotically positive, the distribution is skewed on the right.Finally, the approximation for γ 2 = κ 4 /κ 2 2 can be expressed as which is asymptotically positive, corresponding to a leptokurtic distribution.

Asymptotic Distribution of U
We now use the previous results to prove that U is asymptotically normally distributed with mean D(D + 1)/2ν and variance 2D/ν.To this aim, set where φ U (t) is the characteristic function of U .We proved Equation (3) as an analytic identity for t ∈ 4.
This expression will, however, be valid in the range where Z D (ν + 2t) is analytic.We can thus obtain an expression for φ U (it We then use Stirling's approximation [11] (p. 257) to approximate each term of the sum in the second term of the right-hand side of Equation ( 16) when ν is large, yielding We consequently have for the characteristic moment of As ν tends towards infinity, φ Vν (t) achieves pointwise convergence toward e −t 2 /2 , which is continuous in t = 0.According to Lévi's continuity theorem, V ν therefore converges in distribution to the standard normal distribution,

Application to Differential Entropy
We can use the results of the previous section to obtain the exact and asymptotic cumulants of the sample and posterior entropy when the data are multivariate normal.

Sampling Distribution
The differential entropy H(X) of a D-dimensional random variable X that is normally distributed with (known) mean μ and (unknown) covariance matrix Σ is given by Equation (1).Let (x n ) n=1,...,N be N independent and identically distributed realizations of X. Set S the sum of square, i.e., S follows a Wishart distribution with ν = N degrees of freedom and scale matrix Σ [12] (Th.7.2.2).Define the sample differential entropy corresponding to the N realizations as h(S/N ).Using the fact that |S/N |/|Σ| = |S|/|N Σ|, we obtain that h(S/N )−h(Σ) = −U/2, where U was defined in Equation ( 4).
The mean and variance of h(S/N ) = h(Σ) − U/2 can therefore be expressed as functions of the corresponding central moments of U , i.e., μ = κ 1 [Equations ( 7) and ( 9)] and σ 2 = κ 2 [Equations ( 6) and ( 12)], leading to the following closed form expressions and approximations and Furthermore, use of Section 2.3 shows that, given N and Σ, h(S/N ) is asymptotically normally distributed with mean −D(D + 1)/4N and variance D/2N .If μ is unknown, we replace μ by the sample mean m in Equation (17).S is then still Wishart distributed with scale matrix Σ but ν = N − 1 degrees of freedom [12] (Cor.7.2.2).The exact expectation and variance of h[S/(N − 1)] are therefore given by Equations ( 18) and (20), respectively where N is replaced by N − 1. Performing asymptotic expansion of this expression leads to Furthermore, since the first-order approximation is the same for h[S/(N − 1)] for h(S/N ), both quantities have the same asymptotic distribution.

Posterior Distribution
With the same assumptions as above, and assuming a non-informative Jeffreys prior for Σ, i.e., the posterior distribution for Σ given the N realizations of X is inverse Wishart with n = N − 1 degrees of freedom and scale matrix S −1 [13].This implies that Υ = Σ −1 , the concentration matrix, is Wishart distributed with n degrees of freedom and scale matrix S −1 .Results of Section 3.1 therefore apply to h(Υ/n) − h(S −1 ).But, since for any matrix A, ln and Also, h(Σ) is asymptotically normally distributed with mean D(D + 1)/4N and variance D/2N .

Application to Mutual Information and Multiinformation
Similar results can also be derived about the first cumulant of mutual information and multiinformation, its generalization to more than two variables.The mutual information between two sets of variables X 1 (of dimension D 1 ) and X 2 (of dimension D 2 ) is defined as For multivariate normal variables, we have where Σ 1 and Σ 2 are the two block diagonal elements of Σ and where h was defined in Equation (1).

Sampling Mean
Define the sample mutual information as i(S/N ).Using Equation (24), direct calculation shows that we have An asymptotic approximation for E[h(S/N )|N, Σ] can be obtained by direct use of Equation (19).For S 1 and S 2 , we proceed as follows.If S is Wishart distributed with N degrees of freedom and scale matrix Σ, then S j (j ∈ {1, 2}) is also Wishart distributed with N degrees of freedom and scale matrix Σ j [12] (Th.7.3.4).Equation ( 19) can therefore be applied to matrix S j with the proper scale matrix, yielding A similar result can be obtained for the generalization of i to K sets of variables X k (of size D k ) as a measure called total correlation [14], multivariate constraint [15], δ [16], or multiinformation [17].In that case, we have and, in the particular case where each X k is one-dimensional (i.e., D k = 1),

Posterior Mean
A similar argument can be applied to the Bayesian posterior mean of i.Using Equation (24) again, we have An asymptotic approximation for E[h(Σ)|N, S] can be obtained by direct use of Equation ( 23).Now, if Σ is inverse Wishart distributed with n degrees of freedom and scale matrix S, then Σ j (j ∈ {1, 2}) is also inverse Wishart distributed with n − D k (k ∈ {1, 2}, k = j) degrees of freedom and scale matrix S j [18].Application of Equation ( 23) with the proper degrees of freedom and scale matrix leads to where we only retained the expansion terms of order up to 1/N for the sake of simplicity.E[i(Σ)|N, S] consequently reads For posterior multiinformation, we have and, in the particular case where each X k is one-dimensional (i.e., D k = 1),

Simulation Study
We conducted the following computations for D ∈ {2, 5, 10}.To assess the accuracy of the asymptotic expansion of the cumulants of sample entropy, we calculated the error made by the first and second central moments (i.e., the mean and variance of the distribution) compared to the exact values as a function of ν.As a way of comparison, we computed the same quantities for 500 different homogeneous positive definite matrices Σ (i.e., with all non-diagonal elements equal to the same value ρ, generated uniformly); for each value of Σ and ν, we generated 1,000 samples from S ∼ Wishart(ν, Σ), computed the corresponding values of sample entropy, and approximated the moments by the corresponding sampling moments.The results are reported in Figure 1.

Discussion
In this work, we calculated both the moments of |S|/|νΣ| and the cumulant-generating function of U = − ln(|S|/|νΣ|) when S is Wishart distributed with ν degrees of freedom and scale matrix Σ.From there, we provided an asymptotic approximation of the first four central moments of U .We also proved that U is asymptotically normally distributed.We then demonstrated the quality of the normal approximation compared to simulations.We finally applied these results to the multivariate normal distribution to provide asymptotic approximations of the sample and posterior distributions of differential entropy, as well as an asymptotic approximation of the sample and posterior mean of multiinformation.
Interestingly, the moments of |S|/|νΣ| and, as a way of consequence, the cumulant-generating function of U depends on the distribution that S follows only through the matrix dimension D and the degree of freedom ν, but not through Σ.This means that the exact distribution of U is also independent from that parameter and could possibly be tabulated as a function of the two integer parameters.
As mentioned in the introduction, the sample differential entropy defined in Equation ( 1) is equal to the plug-in estimator for differential entropy.The present work provides a quantification in the case of multivariate normal samples for the well-known negative bias for this estimator [7].Obviously, Equation (18) confirms that, to correct from this bias, one must take the uniformly minimum variance unbiased (UMVU) estimator [1].
The posterior derivation that we presented here is a particular case of the Bayesian posterior estimate obtained by [3] with, in our case, the prior distribution for Σ taken as Jeffreys prior (i.e., q = −1 and B = 0 with their notations).While the same analysis as in [3] could have been performed, it would essentially lead to the same result, since we only consider the asymptotic case, where the sample is large and the prior distribution is supposed to have very little influence-provided that it does not contradict the data.The present study also shows an interesting feature of Bayesian estimation with respect to the above-mentioned negative bias.As the sample differential entropy tends to underestimate H(Σ) by a factor of m/2, if one takes the posterior mean as the Bayesian estimate of H(Σ), then the negative bias is corrected by the opposite factor.and ν when using the first-order approximation (circles), the second-order approximation (squares), or the sampling scheme (diamonds).The error was calculated as the absolute value of the difference between the approximation and the true value.For the sampling scheme are represented the median as well as the symmetrical 90% probability interval of the error.Scale on y axis is logarithmic.

Figure 1 .
Figure 1.Error on the mean (top row) and variance (bottom row) of sample entropy for various values of D