Large-Sample Asymptotic Approximations for the Sampling and Posterior Distributions of Differential Entropy for Multivariate Normal Distributions

Marrelec, Guillaume; Benali, Habib

doi:10.3390/e13040805

Open AccessArticle

Large-Sample Asymptotic Approximations for the Sampling and Posterior Distributions of Differential Entropy for Multivariate Normal Distributions

by

Guillaume Marrelec

^1,2,3,* and

Habib Benali

^1,2,3

¹

Inserm, U678, Paris, F-75013, France

²

UPMC Univ Paris 06, UMR_S U678, Paris, F-75013, France

³

Inserm, Université de Montréal, and UPMC Univ Paris 06, LINeM, Montréal, QC, H3W 1W5, Canada

^*

Author to whom correspondence should be addressed.

Entropy 2011, 13(4), 805-819; https://doi.org/10.3390/e13040805

Submission received: 15 February 2011 / Revised: 29 March 2011 / Accepted: 31 March 2011 / Published: 6 April 2011

Download

Browse Figures

Versions Notes

Abstract

:

In the present paper, we propose a large sample asymptotic approximation for the sampling and posterior distributions of differential entropy when the sample is composed of independent and identically distributed realization of a multivariate normal distribution.

Keywords:

differential entropy; large sample; asymptotic approximation; multivariate normal distribution; sampling distribution; posterior distribution; mutual information; multiinformation; total correlation; multivariate constraint

Graphical Abstract

1. Introduction

Entropy has been an active topic of research for over 50 years and much has been published about this measure in various contexts. In statistics, recent developments have investigated how to estimate entropy from data, either in a parametric [1,2,3] or nonparametric framework [4,5], as well as the reliability and convergence properties of these estimators [6,7].

By contrast, relatively little is known about the statistical distribution of entropy, even in the simple case of a multivariate normal distribution. For instance, the differential entropy

H (X)

of a D-dimensional random variable X that is normally distributed with mean μ and covariance matrix Σ is given by

H (X) = h (Σ) = \frac{D}{2} [1 + ln (2 π)] + \frac{1}{2} ln | Σ |

(1)

If

{(x_{n})}_{n = 1, \dots, N}

are N independent and identically distributed realizations of X and S the corresponding sum of square, then the sample differential entropy

h (S / N)

is used as the so-called plug-in estimator for

H (X)

. However,

h (S / N)

is also a random variable whose sampling distribution could be studied. Ahmed et al. provided the exact expression for the mean and variance of this variable [1]. Similarly, in a Bayesian framework, given

h (S / N)

, what are the probable values of

h (Σ)

? We are not aware of any study in this direction for multivariate normal distributions (but see, e.g., [8,9] for the posterior moments of entropy in the case of multinomial distributions). In the present paper, we provide an asymptotic approximation for both the sampling distribution of

h (S / N)

and, in a Bayesian framework, the posterior distribution of

h (Σ)

given

h (S / N)

. To this aim, we first calculate the moments of

| S | / | ν Σ |

in the same condition as above. We then use this result to provide a closed form expression for the cumulant-generating function of

U = - ln (| S | / | ν Σ |)

, from which we derive closed form expressions for the cumulants, together with asymptotic expansions when

ν \to \infty

. Using the characteristic function of U, we then provide an asymptotic normal approximation for the distribution of this variable. We finally apply these result to the sample and posterior entropy of multivariate normal distributions.

2. General Result

Assume that S is distributed according to a Wishart distribution with

ν \geq D

degrees of freedom and scale matrix Σ, i.e., [10] (Chapter 7)

p (S | Σ, ν) = \frac{1}{Z_{D} (ν)} {| Σ |}^{- \frac{ν}{2}} {| S |}^{\frac{ν - D - 1}{2}} exp [- \frac{1}{2} tr (Σ^{- 1} S)]

where

Z_{D} (ν)

is the normalizing constant,

Z_{D} (ν) = 2^{\frac{ν D}{2}} π^{\frac{D (D - 1)}{4}} \prod_{d = 1}^{D} Γ (\frac{ν + 1 - d}{2})

(2)

Direct calculation show that we have, for

t \in ℝ

,

\begin{matrix} E [{(\frac{| S |}{| ν Σ |})}^{t}] & = \int {(\frac{| S |}{| ν Σ |})}^{t} \cdot \frac{1}{Z_{D} (ν)} {| Σ |}^{- \frac{ν}{2}} {| S |}^{\frac{ν - D - 1}{2}} exp [- \frac{1}{2} tr (Σ^{- 1} S)] d S \\ = \frac{Z_{D} (ν + 2 t)}{Z_{D} (ν)} ν^{- D t} \int \frac{1}{Z_{D} (ν + 2 t)} {| Σ |}^{- \frac{ν + 2 t}{2}} {| S |}^{\frac{(ν + 2 t) - D - 1}{2}} exp [- \frac{1}{2} tr (Σ^{- 1} S)] d S \\ (3) & = \frac{Z_{D} (ν + 2 t)}{Z_{D} (ν)} ν^{- D t} \end{matrix}

provided that the integral sums to one, i.e.,

ν + 2 t \geq D

or, equivalently,

t \geq (D - ν) / 2

.

2.1. Cumulant-Generating Function, Cumulants, and Central Moments of U

Cumulant-generating function Let U be the function defined in the introduction, i.e.,

U = - ln \frac{| S |}{| ν Σ |}

(4)

and

g_{U} (t) = ln E [e^{t U}]

its cumulant-generating function.

g_{U} (t)

is the log of the quantity calculated in Equation (3)

g_{U} (t) = D t ln ν + ln Z_{D} (ν - 2 t) - ln Z_{D} (ν)

(5)

ln Z_{D} (ν)

and

ln Z_{D} (ν - 2 t)

can be expressed using Equation (2), leading to

g_{U} (t) = D t ln \frac{ν}{2} + \sum_{d = 1}^{D} ln Γ (\frac{ν - 2 t + 1 - d}{2}) - \sum_{d = 1}^{D} ln Γ (\frac{ν + 1 - d}{2})

(6)

Cumulants By construction, the nth cumulant of U is given by

κ_{n} = g_{U}^{(n)} (0)

. In the present case,

g_{U}^{(n)} (t)

can be obtained by direct derivation, yielding for the cumulants

κ_{1} = g_{U}^{'} (0) = D ln \frac{ν}{2} - \sum_{d = 1}^{D} ψ (\frac{ν + 1 - d}{2})

(7)

and

κ_{n} = g_{U}^{(n)} (0) = {(- 1)}^{n} \sum_{d = 1}^{D} ψ^{(n - 1)} (\frac{ν + 1 - d}{2})

(8)

for

n \geq 2

, where ψ is the digamma function, i.e.,

ψ (t) = d [ln Γ (t)] / d t

, and

ψ^{(n)}

its nth derivative [11] (pp. 258–260). For any

n \geq 1

,

κ_{n}

is always strictly positive. It is an increasing function of D and a decreasing function of ν. It tends to 0 when ν tends to infinity. For a proof of these properties, see the appendix.

Central moments Cumulants and central moments are related as follows: If we denote by μ,

σ^{2}

, γ and

γ_{2}

the mean, variance, skewness and excess kurtosis of U, respectively, we have

μ = κ_{1}

,

σ^{2} = κ_{2}

,

γ_{1} = κ_{3} / κ_{2}^{3 / 2}

, and

γ_{2} = κ_{4} / κ_{2}^{2}

. Note that, by definition, μ is equal to the expression of Equation (7) and

σ^{2}

to that of Equation (8) with

n = 2

.

2.2. Asymptotic Expansion

When ν is large, ψ can be approximated using the following asymptotic expansion [11] (p. 260)

ψ (z) = ln z - \frac{1}{2 z} - \frac{1}{12 z^{2}} + O (\frac{1}{z^{3}})

where

O (1 / z^{n})

refers to Landau notation and stands for any function

f (z)

for which there exists

z_{0}

so that

z^{n} f (z)

is bounded for

z \geq z_{0}

. This leads to

\begin{matrix} ψ (\frac{ν + 1 - d}{2}) & = & ln (\frac{ν + 1 - d}{2}) - \frac{1}{ν + 1 - d} - \frac{1}{3 {(ν + 1 - d)}^{2}} + O (\frac{1}{ν^{3}}) \\ = & ln \frac{ν}{2} + ln (1 + \frac{1 - d}{ν}) - \frac{1}{ν (1 + \frac{1 - d}{ν})} - \frac{1}{3 ν^{2} {(1 + \frac{1 - d}{ν})}^{2}} + O (\frac{1}{ν^{3}}) \\ = & ln \frac{ν}{2} + [\frac{1 - d}{ν} - \frac{1}{2} {(\frac{1 - d}{ν})}^{2}] - \frac{1}{ν} (1 - \frac{1 - d}{ν}) - \frac{1}{3 ν^{2}} + O (\frac{1}{ν^{3}}) \\ = & ln \frac{ν}{2} - \frac{d}{ν} + \frac{1 - 3 d^{2}}{6 ν^{2}} + O (\frac{1}{ν^{3}}) \end{matrix}

Incorporating this expansion in Equation (7) yields for the first cumulant

κ_{1}

or, equivalently, the mean μ

κ_{1} = μ = \frac{D (D + 1)}{2 ν} + \frac{2 D^{3} + 3 D^{2} - D}{12 ν^{2}} + O (\frac{1}{ν^{3}})

(9)

For the cumulants and central moments of order 2 and up, we use the following approximation of

ψ^{(n)}

[11] (p. 260)

ψ^{(n)} (z) = {(- 1)}^{n - 1} [\frac{(n - 1)!}{z^{n}} + \frac{n!}{2 z^{n + 1}} + O (\frac{1}{z^{n + 2}})]

(10)

Each term in the sum of Equation (8) can therefore be approximated as

\begin{matrix} ψ^{(n - 1)} (\frac{ν + 1 - d}{2}) & = & {(- 1)}^{n - 2} [\frac{2^{n - 1} (n - 2)!}{ν^{n - 1} {(1 + \frac{1 - d}{ν})}^{n - 1}} + \frac{2^{n - 1} (n - 1)!}{ν^{n} {(1 + \frac{1 - d}{ν})}^{n}} + O (\frac{1}{ν^{n + 1}})] \\ = & {(- 1)}^{n - 2} [\frac{2^{n - 1} (n - 2)!}{ν^{n - 1}} (1 - \frac{(n - 1) (1 - d)}{ν}) + \frac{2^{n - 1} (n - 1)!}{ν^{n}} + O (\frac{1}{ν^{n + 1}})] \\ = & {(- 1)}^{n - 2} [\frac{2^{n - 1} (n - 2)!}{ν^{n - 1}} + \frac{2^{n - 1} (n - 1)! d}{ν^{n}} + O (\frac{1}{ν^{n + 1}})] \end{matrix}

leading to an approximation of

κ_{n} = g_{U}^{(n)} (0)

of the form

κ_{n} = \frac{2^{n - 1} D (n - 2)!}{ν^{n - 1}} + \frac{2^{n - 1} D (D + 1) (n - 1)!}{2 ν^{n}} + O (\frac{1}{ν^{n + 1}})

(11)

Taking n equal to 2, 3, and 4 respectively yields for the cumulants of order 2, 3, and 4

\begin{matrix} κ_{2} & = & \frac{2 D}{ν} + \frac{D (D + 1)}{ν^{2}} + O (\frac{1}{ν^{3}}) \end{matrix}

(12)

\begin{matrix} κ_{3} & = & \frac{4 D}{ν^{2}} + \frac{4 D (D + 1)}{ν^{3}} + O (\frac{1}{ν^{4}}) \end{matrix}

(13)

\begin{matrix} κ_{4} & = & \frac{16 D}{ν^{3}} + \frac{24 D (D + 1)}{ν^{4}} + O (\frac{1}{ν^{5}}) \end{matrix}

(14)

We can now provide asymptotic approximations for the corresponding central moments. The variance

σ^{2} = κ_{2}

is given by Equation (12). Approximation for the skewness

γ_{1} = κ_{3} / κ_{2}^{3 / 2}

can be obtained from Equations (12) and (13) as

\begin{matrix} γ_{1} & = & \frac{4 D}{ν^{2}} [1 + \frac{D + 1}{ν} + O (\frac{1}{ν^{2}})] {(\frac{2 D}{ν})}^{- \frac{3}{2}} {[1 + \frac{D + 1}{2 ν} + O (\frac{1}{ν^{2}})]}^{- \frac{3}{2}} \\ = & \sqrt{\frac{2}{D ν}} [1 + \frac{D + 1}{4 ν} + O (\frac{1}{ν^{2}})] \end{matrix}

γ_{1}

being asymptotically positive, the distribution is skewed on the right. Finally, the approximation for

γ_{2} = κ_{4} / κ_{2}^{2}

can be expressed as

\begin{matrix} γ_{2} & = & \frac{16 D}{ν^{3}} [1 + \frac{3 (D + 1)}{2 ν} + O (\frac{1}{ν^{2}})] {(\frac{2 D}{ν})}^{- 2} {[1 + \frac{D + 1}{2 ν} + O (\frac{1}{ν^{2}})]}^{- 2} \\ = & \frac{4}{D ν} (1 + \frac{D + 1}{2 ν}) + O (\frac{1}{ν^{3}}) \end{matrix}

which is asymptotically positive, corresponding to a leptokurtic distribution.

2.3. Asymptotic Distribution of U

We now use the previous results to prove that U is asymptotically normally distributed with mean

D (D + 1) / 2 ν

and variance

2 D / ν

. To this aim, set

V_{ν} = \frac{U - \frac{a}{ν}}{\frac{b}{\sqrt{ν}}}

(15)

with

a = D (D + 1) / 2

and

b = \sqrt{2 D}

. The logarithm of the characteristic function of

V_{ν}

reads

\begin{matrix} ln ϕ_{V_{ν}} (t) & = & ln E \{exp [i t (\frac{U - \frac{a}{ν}}{\frac{b}{\sqrt{ν}}})]\} \\ = & - \frac{i t a}{b \sqrt{ν}} + ln E \{exp [(\frac{i t \sqrt{ν}}{b}) U]\} \\ = & - \frac{i t a}{b \sqrt{ν}} + ln ϕ_{U} (\frac{i t \sqrt{ν}}{b}) \\ = & ln ϕ_{U} (\frac{i t \sqrt{ν}}{b}) + O (\frac{1}{\sqrt{ν}}) \end{matrix}

where

ϕ_{U} (t)

is the characteristic function of U. We proved Equation (3) as an analytic identity for

t \in R

. This expression will, however, be valid in the range where

Z_{D} (ν + 2 t)

is analytic. We can thus obtain an expression for

ϕ_{U} (i t \sqrt{ν} / b)

by replacing t by

- i t \sqrt{ν} / b

in Equation (3), leading to

\begin{matrix} ln ϕ_{U} (\frac{i t \sqrt{ν}}{b}) & = & ln [\frac{Z_{D} (ν - \frac{2 i t \sqrt{ν}}{b})}{Z_{D} (ν)}] + \frac{i t D \sqrt{ν} ln ν}{b} \\ = & ln [\frac{2^{\frac{(ν - \frac{2 i t \sqrt{ν}}{b}) D}{2}} π^{\frac{D (D - 1)}{2}} \prod_{d = 1}^{D} Γ (\frac{ν - \frac{2 i t \sqrt{ν}}{b} + 1 - d}{2})}{2^{\frac{ν D}{2}} π^{\frac{D (D - 1)}{2}} \prod_{d = 1}^{D} Γ (\frac{ν + 1 - d}{2})}] + \frac{i t D \sqrt{ν} ln ν}{b} \\ (16) & = & \frac{i t D \sqrt{ν}}{b} ln \frac{ν}{2} + \sum_{d = 1}^{D} ln [\frac{Γ (\frac{ν - \frac{2 i t \sqrt{ν}}{b} + 1 - d}{2})}{Γ (\frac{ν + 1 - d}{2})}] \end{matrix}

We then use Stirling’s approximation [11] (p. 257)

ln Γ (z) = (z - \frac{1}{2}) ln z - z + \frac{1}{2} ln 2 π + O (\frac{1}{z})

to approximate each term of the sum in the second term of the right-hand side of Equation (16) when ν is large, yielding

\begin{matrix} ln [\frac{Γ (\frac{ν - \frac{2 i t \sqrt{ν}}{b} + 1 - d}{2})}{Γ (\frac{ν + 1 - d}{2})}] & = & \frac{ν - \frac{2 i t \sqrt{ν}}{b} - d}{2} ln (\frac{ν - \frac{2 i t \sqrt{ν}}{b} + 1 - d}{2}) - \frac{ν - \frac{2 i t \sqrt{ν}}{b} + 1 - d}{2} \\ - \frac{ν - d}{2} ln (\frac{ν + 1 - d}{2}) + \frac{ν + 1 - d}{2} + O (\frac{1}{\sqrt{ν}}) \\ = & \frac{ν - \frac{2 i t \sqrt{ν}}{b} - d}{2} [ln \frac{ν}{2} + ln (1 - \frac{2 i t}{b \sqrt{ν}} + \frac{1 - d}{ν})] + \frac{i t \sqrt{ν}}{b} \\ - \frac{ν - d}{ν} [ln \frac{ν}{2} + ln (1 + \frac{1 - d}{ν})] + O (\frac{1}{\sqrt{ν}}) \\ = & - \frac{i t \sqrt{ν}}{b} ln \frac{ν}{2} + \frac{i t \sqrt{ν}}{b} \\ + \frac{ν - 2 i t \frac{\sqrt{ν}}{b} - d}{2} [- \frac{2 i t}{b \sqrt{ν}} + \frac{1 - d}{ν} + \frac{2 t^{2}}{b^{2} ν} + O (\frac{1}{ν^{3 / 2}})] \\ - \frac{ν - d}{2} [\frac{1 - d}{ν} + O (\frac{1}{ν^{3 / 2}})] + O (\frac{1}{\sqrt{ν}}) \\ = & - \frac{i t \sqrt{ν}}{b} ln \frac{ν}{2} - \frac{t^{2}}{b^{2}} + O (\frac{1}{\sqrt{ν}}) \end{matrix}

We consequently have for the characteristic moment of

V_{ν}

\begin{matrix} ln ϕ_{V_{ν}} (t) & = & ln ϕ_{U} (i t \frac{\sqrt{ν}}{b}) + O (\frac{1}{\sqrt{ν}}) \\ = & - \frac{D t^{2}}{b^{2}} + O (\frac{1}{\sqrt{ν}}) \\ = & - \frac{t^{2}}{2} + O (\frac{1}{\sqrt{ν}}) \end{matrix}

As ν tends towards infinity,

ϕ_{V_{ν}} (t)

achieves pointwise convergence toward

e^{- t^{2} / 2}

, which is continuous in

t = 0

. According to Lévi’s continuity theorem,

V_{ν}

therefore converges in distribution to the standard normal distribution,

V_{ν} = \frac{U - \frac{D (D + 1)}{2 ν}}{\sqrt{\frac{2 D}{ν}}} \overset{ν \to \infty}{\sim} N (0, 1)

3. Application to Differential Entropy

We can use the results of the previous section to obtain the exact and asymptotic cumulants of the sample and posterior entropy when the data are multivariate normal.

3.1. Sampling Distribution

The differential entropy

H (X)

of a D-dimensional random variable X that is normally distributed with (known) mean μ and (unknown) covariance matrix Σ is given by Equation (1). Let

{(x_{n})}_{n = 1, \dots, N}

be N independent and identically distributed realizations of X. Set S the sum of square, i.e.,

S = \sum_{n = 1}^{N} (x_{n} - μ) (x_{n} - μ)

(17)

S follows a Wishart distribution with

ν = N

degrees of freedom and scale matrix Σ [12] (Th. 7.2.2). Define the sample differential entropy corresponding to the N realizations as

h (S / N)

. Using the fact that

| S / N | / | Σ | = | S | / | N Σ |

, we obtain that

h (S / N) - h (Σ) = - U / 2

, where U was defined in Equation (4). The mean and variance of

h (S / N) = h (Σ) - U / 2

can therefore be expressed as functions of the corresponding central moments of U, i.e.,

μ = κ_{1}

[Equations (7) and (9)] and

σ^{2} = κ_{2}

[Equations (6) and (12)], leading to the following closed form expressions and approximations

\begin{matrix} E [h (S / N) | N, Σ] & = & h (Σ) - \frac{μ}{2} \\ (18) & = & h (Σ) - \frac{D}{2} ln \frac{N}{2} + \frac{1}{2} \sum_{d = 1}^{D} ψ (\frac{N + 1 - d}{2}) \\ (19) & = & h (Σ) - \frac{D (D + 1)}{4 N} - \frac{2 D^{3} + 3 D^{2} - D}{24 N^{2}} + O (\frac{1}{N^{3}}) \end{matrix}

and

\begin{matrix} Var [h (S / N) | ν, Σ] & = & \frac{σ^{2}}{4} \\ (20) & = & \frac{1}{4} \sum_{d = 1}^{D} ψ^{'} (\frac{N + 1 - d}{2}) \\ (21) & = & \frac{D}{2 N} + \frac{D (D + 1)}{4 N^{2}} + O (\frac{1}{N^{3}}) \end{matrix}

Furthermore, use of Section 2.3 shows that, given N and Σ,

h (S / N)

is asymptotically normally distributed with mean

- D (D + 1) / 4 N

and variance

D / 2 N

. If μ is unknown, we replace μ by the sample mean m in Equation (17). S is then still Wishart distributed with scale matrix Σ but

ν = N - 1

degrees of freedom [12] (Cor. 7.2.2). The exact expectation and variance of

h [S / (N - 1)]

are therefore given by Equations (18) and (20), respectively where N is replaced by

N - 1

. Performing asymptotic expansion of this expression leads to

E \{h [S / (N - 1)] | N, Σ\} = h (Σ) - \frac{D (D + 1)}{4 N} - \frac{2 D^{3} + 9 D^{2} + 5 D}{24 N^{2}} + O (\frac{1}{N^{3}})

and

Var \{h [S / (N - 1)] | ν, Σ\} = \frac{D}{2 N} + \frac{D (D + 3)}{4 N^{2}} + O (\frac{1}{N^{3}})

Furthermore, since the first-order approximation is the same for

h [S / (N - 1)]

for

h (S / N)

, both quantities have the same asymptotic distribution.

3.2. Posterior Distribution

With the same assumptions as above, and assuming a non-informative Jeffreys prior for Σ, i.e.,

P (Σ) \propto {| Σ |}^{- \frac{D + 1}{2}}

the posterior distribution for Σ given the N realizations of X is inverse Wishart with

n = N - 1

degrees of freedom and scale matrix

S^{- 1}

[13]. This implies that

Υ = Σ^{- 1}

, the concentration matrix, is Wishart distributed with n degrees of freedom and scale matrix

S^{- 1}

. Results of Section 3.1 therefore apply to

h (Υ / n) - h (S^{- 1})

. But, since for any matrix A,

ln | A^{- 1} {| = ln | A |}^{- 1} = - ln | A |

, we have that

h (Υ / n) - h (S^{- 1})

is equal to

h (S) - h (n Σ)

or, equivalently, to

h (S / n) - h (Σ)

. As a consequence,

\begin{matrix} (22) & E [h (Σ) | N, S] & = & h (S / n) + \frac{D}{2} ln \frac{ν}{2} - \frac{1}{2} \sum_{d = 1}^{D} ψ (\frac{N - d}{2}) \\ (23) & = & h (S / n) + \frac{D (D + 1)}{4 N} + \frac{2 D^{3} + 9 D^{2} + 5 D}{24 N^{2}} + O (\frac{1}{N^{3}}) \end{matrix}

and

\begin{matrix} Var [h (Σ) | n, S] & = & \frac{1}{4} \sum_{d = 1}^{D} ψ^{'} (\frac{N - d}{2}) \\ = & \frac{D}{2 N} + \frac{D (D + 3)}{4 N^{2}} + O (\frac{1}{N^{3}}) \end{matrix}

Also,

h (Σ)

is asymptotically normally distributed with mean

D (D + 1) / 4 N

and variance

D / 2 N

.

4. Application to Mutual Information and Multiinformation

Similar results can also be derived about the first cumulant of mutual information and multiinformation, its generalization to more than two variables. The mutual information between two sets of variables

X_{1}

(of dimension

D_{1}

) and

X_{2}

(of dimension

D_{2}

) is defined as

I (X_{1}, X_{2}) = H (X_{1}) + H (X_{2}) - H (X_{1}, X_{2})

For multivariate normal variables, we have

I (X_{1}, X_{2}) = i (Σ) = h (Σ_{1}) + h (Σ_{2}) - h (Σ)

(24)

where

Σ_{1}

and

Σ_{2}

are the two block diagonal elements of Σ and where h was defined in Equation (1).

4.1. Sampling Mean

Define the sample mutual information as

i (S / N)

. Using Equation (24), direct calculation shows that we have

E [i (S / N) | N, Σ] = E [h (S_{1} / N) | N, Σ] + E [h (S_{2} / N) | N, Σ] - E [h (S / N) | N, Σ]

An asymptotic approximation for

E [h (S / N) | N, Σ]

can be obtained by direct use of Equation (19). For

S_{1}

and

S_{2}

, we proceed as follows. If S is Wishart distributed with N degrees of freedom and scale matrix Σ, then

S_{j}

(

j \in {1, 2}

) is also Wishart distributed with N degrees of freedom and scale matrix

Σ_{j}

[12] (Th. 7.3.4). Equation (19) can therefore be applied to matrix

S_{j}

with the proper scale matrix, yielding

\begin{matrix} E [h (S_{j} / N) | N, Σ] & = & E [h (S_{j} / N) | N, Σ_{j}] \\ = & h (Σ_{j}) - \frac{D_{j} (D_{j} + 1)}{4 N} - \frac{2 D_{j}^{3} + 3 D_{j}^{2} - D_{j}}{24 N^{2}} + O (\frac{1}{N^{3}}) \end{matrix}

E [i (S / N) | N, Σ]

consequently reads

E [i (S / N) | N, Σ] = i (Σ) + \frac{D_{1} D_{2}}{2 N} [1 + \frac{D_{1} + D_{2} + 1}{2 N}] + O (\frac{1}{N^{3}})

A similar result can be obtained for the generalization of i to K sets of variables

X_{k}

(of size

D_{k}

) as a measure called total correlation [14], multivariate constraint [15], δ [16], or multiinformation [17]. In that case, we have

E [i (S / N) | N, Σ] = i (Σ) + \frac{\sum_{i < j} D_{i} D_{j}}{2 N} + \frac{\sum_{i \neq j} D_{i} D_{j} (D_{i} + \sum_{k \neq i, j} D_{k} + 1)}{4 N^{2}} + O (\frac{1}{N^{3}})

and, in the particular case where each

X_{k}

is one-dimensional (i.e.,

D_{k} = 1

),

E [i (S / N) | N, Σ] = i (Σ) + \frac{D (D - 1)}{4 N} + \frac{2 D^{3} + 3 D^{2} - 5 D}{24 N^{2}} + O (\frac{1}{N^{3}})

4.2. Posterior Mean

A similar argument can be applied to the Bayesian posterior mean of i. Using Equation (24) again, we have

E [i (Σ) | N, S] = E [h (Σ_{1}) | N, S] + E [h (Σ_{2}) | N, S] - E [h (Σ) | N, S]

An asymptotic approximation for

E [h (Σ) | N, S]

can be obtained by direct use of Equation (23). Now, if Σ is inverse Wishart distributed with n degrees of freedom and scale matrix S, then

Σ_{j}

(

j \in {1, 2}

) is also inverse Wishart distributed with

n - D_{k}

(

k \in {1, 2}

,

k \neq j

) degrees of freedom and scale matrix

S_{j}

[18]. Application of Equation (23) with the proper degrees of freedom and scale matrix leads to

\begin{matrix} E [h (Σ_{j}) | N, S] & = & h [S_{j} / (n - D_{k})] + \frac{D_{j} (D_{j} + 1)}{4 (N - D_{k})} + O (\frac{1}{N^{2}}) \\ = & h (S_{j} / n) - \frac{D_{j}}{2} ln (1 - \frac{D_{k}}{N}) + \frac{D_{j} (D_{j} + 1)}{4 N} + O (\frac{1}{N^{2}}) \\ = & h [S_{j} / n] + \frac{D_{1} D_{2}}{2 N} + \frac{D_{j} (D_{j} + 1)}{4 N} + O (\frac{1}{N^{2}}) \end{matrix}

where we only retained the expansion terms of order up to

1 / N

for the sake of simplicity.

E [i (Σ) | N, S]

consequently reads

E [i (Σ) | N, S] = i (S / n) + \frac{D_{1} D_{2}}{2 N} + O (\frac{1}{N^{2}})

For posterior multiinformation, we have

E [i (Σ) | N, S] = i (S / n) + \frac{\sum_{i < j} D_{i} D_{j}}{2 N} + O (\frac{1}{N^{2}})

and, in the particular case where each

X_{k}

is one-dimensional (i.e.,

D_{k} = 1

),

E [i (Σ) | N, S] = i (S / n) + \frac{D (D - 1)}{4 N} + O (\frac{1}{N^{2}})

5. Simulation Study

We conducted the following computations for

D \in {2, 5, 10}

. To assess the accuracy of the asymptotic expansion of the cumulants of sample entropy, we calculated the error made by the first and second central moments (i.e., the mean and variance of the distribution) compared to the exact values as a function of ν. As a way of comparison, we computed the same quantities for 500 different homogeneous positive definite matrices Σ (i.e., with all non-diagonal elements equal to the same value ρ, generated uniformly); for each value of Σ and ν, we generated 1,000 samples from

S \sim Wishart (ν, Σ)

, computed the corresponding values of sample entropy, and approximated the moments by the corresponding sampling moments. The results are reported in Figure 1.

6. Discussion

In this work, we calculated both the moments of

| S | / | ν Σ |

and the cumulant-generating function of

U = - ln (| S | / | ν Σ |)

when S is Wishart distributed with ν degrees of freedom and scale matrix Σ. From there, we provided an asymptotic approximation of the first four central moments of U. We also proved that U is asymptotically normally distributed. We then demonstrated the quality of the normal approximation compared to simulations. We finally applied these results to the multivariate normal distribution to provide asymptotic approximations of the sample and posterior distributions of differential entropy, as well as an asymptotic approximation of the sample and posterior mean of multiinformation.

Interestingly, the moments of

| S | / | ν Σ |

and, as a way of consequence, the cumulant-generating function of U depends on the distribution that S follows only through the matrix dimension D and the degree of freedom ν, but not through Σ. This means that the exact distribution of U is also independent from that parameter and could possibly be tabulated as a function of the two integer parameters.

As mentioned in the introduction, the sample differential entropy defined in Equation (1) is equal to the plug-in estimator for differential entropy. The present work provides a quantification in the case of multivariate normal samples for the well-known negative bias for this estimator [7]. Obviously, Equation (18) confirms that, to correct from this bias, one must take the uniformly minimum variance unbiased (UMVU) estimator [1].

The posterior derivation that we presented here is a particular case of the Bayesian posterior estimate obtained by [3] with, in our case, the prior distribution for Σ taken as Jeffreys prior (i.e.,

q = - 1

and

B = 0

with their notations). While the same analysis as in [3] could have been performed, it would essentially lead to the same result, since we only consider the asymptotic case, where the sample is large and the prior distribution is supposed to have very little influence—provided that it does not contradict the data. The present study also shows an interesting feature of Bayesian estimation with respect to the above-mentioned negative bias. As the sample differential entropy tends to underestimate

H (Σ)

by a factor of

m / 2

, if one takes the posterior mean as the Bayesian estimate of

H (Σ)

, then the negative bias is corrected by the opposite factor.

Figure 1. Error on the mean (top row) and variance (bottom row) of sample entropy for various values of D and ν when using the first-order approximation (circles), the second-order approximation (squares), or the sampling scheme (diamonds). The error was calculated as the absolute value of the difference between the approximation and the true value. For the sampling scheme are represented the median as well as the symmetrical 90% probability interval of the error. Scale on y axis is logarithmic.

We were also able to obtain an asymptotic approximation of the sampling and posterior expectations of mutual information and multiinformation. Contrary to the general argument developed by [7], we proved that, for multivariate normal distributions, the negative bias for differential entropy does entail a positive bias for mutual information. This result is in agreement with the fact that, under the null hypothesis of Σ diagonal matrix, corresponding to

i (Σ) = 0

,

ν i (S / ν)

is asymptotically chi square distributed with

\sum_{i < j} D_{i} D_{j} / 2

degrees of freedom and, hence, has an expectation equal to that value [19] (pp. 306–307). Surprisingly, and unlike what was said for entropy, the positive bias of the sample multiinformation was not corrected by the Bayesian approach. A naive correction of minus the positive bias could lead to negative values, which is impossible by construction of multiinformation. Note that, using the present results alone, we were not able to obtain an asymptotic approximation for the variance of the same measures.

In the present paper, we used loose versions of the inequalities proposed in [20] to prove the monotonicity and sign of the cumulants of U (see Section 2.1 and Appendix). Note that, using the same inequalities, it seems that it would also be possible to obtain lower and upper bounds for these quantities, instead of asymptotic approximations. These bounds would be useful complements to the approximations provided in the present manuscript.

Acknowledgements

The authors are grateful to Pierre Bellec for helpful discussions.

References

Ahmed, N.A.; Gokhale, D.V. Entropy expressions and their estimators for multivariate distributions. IEEE Trans. Inform. Theory 1989, 35, 688–692. [Google Scholar] [CrossRef]
Misra, N.; Singh, H.; Demchuk, E. Estimation of the entropy of a multivariate normal distribution. J. Multivariate Anal. 2005, 92, 324–342. [Google Scholar] [CrossRef]
Gupta, M.; Srivastava, S. Parametric Bayesian estimation od differential entropy and relative entropy. Entropy 2010, 12, 818–843. [Google Scholar] [CrossRef]
Beirlant, J.; Dudewicz, E.J.; Györfi, L.; van der Meulen, E.C. Nonparametric entropy estimation: An overview. Int. J. Math. Stastist. Sci. 1997, 6, 17–39. [Google Scholar]
Strong, S.P.; Koberle, R.; de Ruyter van Steveninck, R.R.; Bialek, W. Entropy and information in neural spike trains. Phys. Rev. Lett. 1998, 80, 197–200. [Google Scholar] [CrossRef]
Antos, A.; Kontoyiannis, I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algor. 2001, 19, 163–193. [Google Scholar] [CrossRef]
Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef]
Wolpert, D.H.; Wolf, D.R. Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E 1995, 52, 6841–6854. [Google Scholar] [CrossRef]
Wolpert, D.H.; Wolf, D.R. Erratum: Estimating functions of probability distributions from a finite set of samples. Phys. Rev. E 1996, 54, 6973. [Google Scholar] [CrossRef]
Anderson, T.W. An Introduction to Multivariate Statistical Analysis; John Wiley and Sons: New York, NY, USA, 1958. [Google Scholar]
Abramowitz, M.; Stegun, I.A. Handbook of Mathematical Functions; Applied Mathematics Series 55; National Bureau of Standards: Washington, DC, USA, 1972. [Google Scholar]
Anderson, T.W. An Introduction to Multivariate Statistical Analysis, 3rd ed.; Series in Probability and Mathematical Statistics; John Wiley and Sons: New York, NY, USA, 2003. [Google Scholar]
Gelman, A.; Carlin, J.B.; Stern, H.S.; Rubin, D.B. Bayesian Data Analysis; Texts in Statistical Science; Chapman & Hall: London, UK, 1998. [Google Scholar]
Watanabe, S. Information theoretical analysis of multivariate correlation. IBM J. Res. Dev. 1960, 4, 66–82. [Google Scholar] [CrossRef]
Garner, W.R. Uncertainty and Structure as Psychological Concepts; John Wiley & Sons: New York, NY, USA, 1962. [Google Scholar]
Joe, H. Relative entropy measures of multivariate dependence. J. Am. Statist. Assoc. 1989, 84, 157–164. [Google Scholar] [CrossRef]
Studený, M.; Vejnarová, J. The multiinformation function as a tool for measuring stochastic dependence. In Proceedings of the NATO Advanced Study Institute on Learning in Graphical Models; Jordan, M.I., Ed.; MIT Press: Cambridge, MA, USA, 1998; pp. 261–298. [Google Scholar]
Press, S.J. Applied Multivariate Analysis. Using Bayesian and Frequentist Methods of Inference, 2nd ed.; Dover: Mineola, NY, USA, 2005. [Google Scholar]
Kullback, S. Information Theory and Statistics; Dover: Mineola, NY, USA, 1968. [Google Scholar]
Chen, C.P. Inequalities for the polygamma functions with application. Gener. Math. 2005, 13, 65–72. [Google Scholar]

Appendix

Results Regarding the Cumulants

The proofs differ for

κ_{1}

and

κ_{n}

,

n \geq 2

.

1. Results for $κ_{1}$

For

ν \geq D > 0

, set

f_{D} (ν) = κ_{1}

as defined in Equation (7).

Result 1:

f_{D} (ν)

is a decreasing function of ν. Derivation of

f_{D} (ν)

with respect to ν leads to

f_{D}^{'} (ν) = \sum_{d = 1}^{D} [\frac{1}{ν} - \frac{1}{2} ψ^{'} (\frac{ν + 1 - d}{2})]

(25)

We use the following inequality [20]

ψ^{'} (x) > \frac{1}{x} + \frac{1}{2 x^{2}}

This implies that

\frac{1}{ν} - \frac{1}{2} ψ^{'} (\frac{ν + 1 - d}{2}) < \frac{1}{ν} - \frac{1}{ν + 1 - d} - \frac{1}{{(ν + 1 - d)}^{2}}

For

1 \leq d \leq ν

, we have

1 / ν \leq 1 / (ν + 1 - d)

. Consequently, each term in the sum of Equation (25) is strictly negative, and so is

f_{D}^{'} (ν)

.

f_{D} (ν)

is therefore a strictly decreasing function of ν.

Result 2: $f_{D} (ν)$ is an increasing function of D. We have

f_{D + 1} (ν) = f_{D} (ν) + ln \frac{ν}{2} - ψ (\frac{ν + 1 - D}{2})

Using the following inequality [20]

ψ (u) < ln u - \frac{1}{2 u} < ln u

we obtain that

ψ (\frac{ν + 1 - D}{2}) < ln (\frac{ν + 1 - D}{2})

leading to

ln \frac{ν}{2} - ψ (\frac{ν + 1 - D}{2}) > - ln (1 + \frac{1 - D}{ν})

Since

ln (1 + x) < x

, we have

ln \frac{ν}{2} - ψ (\frac{ν + 1 - D}{2}) > \frac{D - 1}{ν}

and, therefore,

f_{D + 1} (ν) > f_{D} (ν)

.

Result 3: $f_{D} (ν)$ is positive.

f_{D} (ν)

is the sum of terms that are strictly positive (cf previous paragraph); it is thus strictly positive.

Result 4: $f_{D} (ν)$ tends to infinity as D increases. From the proof of Result 2, we have

f_{D} (ν) > \sum_{d = 1}^{D} \frac{d - 1}{ν} = \frac{D (D - 1)}{2 ν}

which tends to infinity when D tends to infinity.

Result 5: $f_{D} (ν)$ tends to 0 as ν increases. We use the following inequality [20]

ln x - \frac{1}{2 x} - \frac{1}{12 x^{2}} < ψ (u)

This implies that

ψ (\frac{ν + 1 - d}{2}) > ln (\frac{ν + 1 - d}{2}) - \frac{1}{ν + 1 - d} - \frac{1}{3 {(ν + 1 - d)}^{2}}

leading to

ln \frac{ν}{2} - ψ (\frac{ν + 1 - d}{2}) < - ln (1 + \frac{1 - d}{ν}) + \frac{1}{ν + 1 - d} + \frac{1}{3 {(ν + 1 - d)}^{2}}

Since

x - x^{2} / 2 < ln (1 + x)

, we have

ln \frac{ν}{2} - ψ (\frac{ν + 1 - d}{2}) < \frac{d - 1}{ν} + \frac{1}{ν + 1 - d} + \frac{1}{3 {(ν + 1 - d)}^{2}} < \frac{d - 1}{ν} + \frac{1}{ν - (D - 1)} + \frac{1}{3 {[ν - (D - 1)]}^{2}}

Summing over d yields

f_{D} (ν) < \frac{D (D - 1)}{2 ν} + \frac{D}{ν - (D - 1)} + \frac{D}{3 {[ν - (D - 1)]}^{2}}

which tends to 0 when ν increases.

2. Results for $κ_{n}$ , $n \geq 2$

Define

f_{D} (ν) = κ_{n}

as in Equation (6),

{(- 1)}^{n + 1} ψ^{(n)}

is completely monotonic. As a consequence,

κ_{n}

is a decreasing function of ν. We also use the following inequality [20]

\frac{(n - 1)!}{x^{n}} < {(- 1)}^{n + 1} ψ^{(n)} (x) < \frac{(n - 1)!}{x^{n}} + \frac{n!}{2 x^{n + 1}} + \frac{B_{2} Γ (n + 2)}{2 x^{n + 2}}

This implies that

{(- 1)}^{n + 1} ψ^{(n)} (x)

is strictly positive and, as a consequence, that

κ_{n}

is an increasing function of D. It also implies that

κ_{n}

tends to 0 as ν tends to infinity.

© 2011 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Marrelec, G.; Benali, H. Large-Sample Asymptotic Approximations for the Sampling and Posterior Distributions of Differential Entropy for Multivariate Normal Distributions. Entropy 2011, 13, 805-819. https://doi.org/10.3390/e13040805

AMA Style

Marrelec G, Benali H. Large-Sample Asymptotic Approximations for the Sampling and Posterior Distributions of Differential Entropy for Multivariate Normal Distributions. Entropy. 2011; 13(4):805-819. https://doi.org/10.3390/e13040805

Chicago/Turabian Style

Marrelec, Guillaume, and Habib Benali. 2011. "Large-Sample Asymptotic Approximations for the Sampling and Posterior Distributions of Differential Entropy for Multivariate Normal Distributions" Entropy 13, no. 4: 805-819. https://doi.org/10.3390/e13040805

APA Style

Marrelec, G., & Benali, H. (2011). Large-Sample Asymptotic Approximations for the Sampling and Posterior Distributions of Differential Entropy for Multivariate Normal Distributions. Entropy, 13(4), 805-819. https://doi.org/10.3390/e13040805

Article Menu

Large-Sample Asymptotic Approximations for the Sampling and Posterior Distributions of Differential Entropy for Multivariate Normal Distributions

Abstract

1. Introduction

2. General Result

2.1. Cumulant-Generating Function, Cumulants, and Central Moments of U

2.2. Asymptotic Expansion

2.3. Asymptotic Distribution of U

3. Application to Differential Entropy

3.1. Sampling Distribution

3.2. Posterior Distribution

4. Application to Mutual Information and Multiinformation

4.1. Sampling Mean

4.2. Posterior Mean

5. Simulation Study

6. Discussion

Acknowledgements

References

Appendix

1. Results for $κ_{1}$

2. Results for $κ_{n}$ , $n \geq 2$

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Large-Sample Asymptotic Approximations for the Sampling and Posterior Distributions of Differential Entropy for Multivariate Normal Distributions

Abstract

1. Introduction

2. General Result

2.1. Cumulant-Generating Function, Cumulants, and Central Moments of U

2.2. Asymptotic Expansion

2.3. Asymptotic Distribution of U

3. Application to Differential Entropy

3.1. Sampling Distribution

3.2. Posterior Distribution

4. Application to Mutual Information and Multiinformation

4.1. Sampling Mean

4.2. Posterior Mean

5. Simulation Study

6. Discussion

Acknowledgements

References

Appendix

1. Results for κ 1

2. Results for κ n , n ≥ 2

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

1. Results for $κ_{1}$

2. Results for $κ_{n}$ , $n \geq 2$