Article Parametric Bayesian Estimation of Differential Entropy and Relative Entropy

Given iid samples drawn from a distribution with known parametric form, we propose the minimization of expected Bregman divergence to form Bayesian estimates of differential entropy and relative entropy, and derive such estimators for the uniform, Gaussian, Wishart, and inverse Wishart distributions. Additionally, formulas are given for a log gamma Bregman divergence and the differential entropy and relative entropy for the Wishart and inverse Wishart. The results, as always with Bayesian estimates, depend on the accuracy of the prior parameters, but example simulations show that the performance can be substantially improved compared to maximum likelihood or state-of-the-art nonparametric estimators.


Introduction
Estimating differential entropy and relative entropy is useful in many applications of coding, machine learning, signal processing, communications, chemistry, and physics.For example, relative entropy between maximum likelihood-fit Gaussians has been used for biometric identification [1], differential entropy estimates have been used for analyzing sensor locations [2], and mutual information estimates have been used in the study of EEG signals [3].
In this paper we present Bayesian estimates for differential entropy and relative entropy that are optimal in the sense of minimizing expected Bregman divergence between the estimate and the uncertain true distribution.We illustrate techniques that may be used for a wide range of parametric distributions, specifically deriving estimates for the uniform (a non-exponential example), Gaussian (perhaps the most popular distribution), and the Wishart and inverse Wishart (the most commonly used distributions for positive definite matrices).
Bayesian estimates for differential entropy and relative entropy have previously been derived for the Gaussian [4], but our estimates differ in that we take a distribution-based approach, and we use a prior that results in numerically stable estimates even when the number of samples is smaller than the dimension of the data.Performance of the presented estimates will depend on how well the user is able to choose the prior distribution's parameters, and we do not attempt a rigorous experimental study here.However, we do present simulated results for the uniform distribution (where no prior is needed), that show that our approach to forming these estimates can result in significant performance improvements over maximum likelihood estimates and over the state-of-the-art nearest-neighbor nonparametric estimates [5].
First we define notation that will be used throughout the paper.In Section II we review related work in estimating differential entropy and relative entropy.In Section III we show that the proposed Bayesian estimates are optimal in the sense of minimizing expected Bregman divergence loss.In the remaining sections, we present differential entropy and relative entropy estimates for the uniform, Gaussian, Wishart and inverse Wishart distributions given iid samples drawn from the underlying distributions.
All proofs and derivations are in the Appendix.

Notation and Background
If P and Q were the known parametric distributions of two random variables with respective densities p and q, then the differential entropy of P is h(P ) = − x p(x) ln p(x)dx and the relative entropy between P and Q is KL(P ||Q) = x p(x) ln p(x) q(x) dx For estimating differential entropy, we assume that one has drawn iid samples {x 1 , x 2 , . . ., x n } from distribution P where x i ∈ R d is a d×1 vector, and the samples have mean x and scaled sample covariance S = n j=1 (x j − x)(x j − x) T .The notation x j [i] will be used to refer to the value of the ith component of vector x j .
In some places, we treat variables such as the covariance Σ as random, and we consistently denote realizations of random variables with a tilde, e.g., Σ. Expectations are always taken with respect to the posterior distribution unless otherwise noted.The digamma function is denoted by ψ(z) = d dz ln Γ(z), where Γ is the standard gamma function; and Γ d denotes the standard multi-dimensional gamma function.
Let W be distributed according to a Wishart distribution with scalar degree of freedom parameter q ≥ d and positive definite matrix parameter Let V be distributed according to an inverse Wishart distribution with scalar degree of freedom parameter q ≥ d and positive definite matrix parameter Note that V −1 is then distributed as a Wishart random matrix with parameters q and Σ −1 .

Related Work
First we review related work in parametric differential entropy estimation, then in nonparametric differential entropy estimation, and then in estimating relative entropy.

Prior Work on Parametric Differential Entropy Estimation
A common approach to estimate differential entropy (and relative entropy) is to find the maximum likelihood estimate for the parameters and then substitute them into the differential entropy formula.For example, for the multivariate Gaussian distribution, the maximum likelihood differential entropy estimate of a d-dimensional random vector X drawn from the Gaussian N (µ, Σ) is Similarly, if samples {x i } are drawn iid from a one-dimensional uniform distribution, the maximum likelihood differential entropy estimate is ĥML = ln(max i ({x i }) − min i ({x i })), which will always be an under-estimate of the true differential entropy.Ahmed and Gokhale investigated uniformly minimum variance unbiased (UMVU) differential entropy estimators for parametric distributions [6].They stated that the UMVU differential entropy estimate for the Gaussian is: However, they treated the random sample covariance of n IID Gaussian samples as if it were drawn from a Wishart with n degrees of freedom, when in fact it is drawn from a Wishart of n−1 degrees of freedom, and thus the UMVU estimator they derived should be stated: Bayesian differential entropy estimation was first proposed for the multivariate normal in 2005 by Misra et al. [4].They formed an estimate of the multivariate normal differential entropy by substituting ln |Σ| for ln |Σ| in the differential entropy formula for the Gaussian, where their ln |Σ| minimizes the expected squared-difference of the differential entropy estimate: They also considered different priors with support over the set of positive definite matrices.Using the prior p(μ, Σ) =  4), and thus they recommend using the computationally much simpler (4) for applications.
There are two practical problems with the previously proposed parametric differential entropy estimators.First, the estimates given by ( 3), ( 4), and the other estimators investigated by Misra et al. require calculating the determinant of S or S + xx T , which is problematic if n < d.Second, the estimate (4) uses the digamma function of n − d which requires n > d samples so that the digamma has a non-negative argument.Thus, although the knowledge that one is estimating the differential entropy of a Gaussian should be of use, for the n ≤ d case one must currently turn to nonparametric differential entropy estimators.

Prior Work on Nonparametric Differential Entropy Estimation
Nonparametric differential entropy estimation up to 1997 has been thoroughly reviewed by Beirlant et al. [7], including density estimation approaches, sample-spacing approaches, and nearest-neighbor estimators.Recently, Nilsson and Kleijn show that high-rate quantization approximations of Zador and Gray can be used to estimate Renyi entropy, and that the limiting case of Shannon entropy produces a nearest-neighbor estimate that depends on the number of quantization cells [8].The special case that best validates the high-rate quantization assumptions is when the number of quantization cells is as large as possible, and they show that this special case produces the nearest-neighbor differential entropy estimator originally proposed by Kozachenko and Leonenko in 1987 [9]: where γ is the Euler-Mascheroni constant, and V d is the volume of the d-dimensional hypersphere with radius 1: Γ(1+d/2) .Other variants of nearest-neighbor differential entropy estimators have also been proposed and analyzed [10,11].A practical problem with the nearest-neighbor approach is that data samples are often quantized, for example, image pixel data are usually quantized to eight bits or ten bits.Thus, it can happen in practice that two samples x j and x k have the exact same measured value so that ρ(j) = 0 and the differential entropy estimate is ill-defined.Though there are various fixes, such as pre-dithering the quantized data, it is not clear what effect such fixes could have on the estimated differential entropy.
They relate a result of Beardwood-Halton-Hammersley on the limiting length of a minimum spanning graph to Renyi entropy, and form a Renyi entropy estimator based on the empirical length of a minimum spanning tree of data.Unfortunately, how to use this approach to estimate the special case of Shannon entropy remains an open question.
In other recent work on differential entropy estimation, Van Hulle took a semiparametric approach to nonparametric differential entropy estimation for a continuous density by using a 5th-order Edgeworth expansion about the maximum likelihood multivariate normal given the data samples drawn from a non-normal distribution [15].

Prior Work on Relative Entropy Estimation
There is relatively little work on estimating relative entropy for continuous distributions.Wang et al. explored a number of data-dependent partitioning approaches for relative entropy between any two absolutely continuous distributions [16].Nguyen et al. took a variational approach to relative entropy estimation [17], which was reported to work better for some cases than the data-partitioning estimators.
In more recent work [5,18], Wang et al. proposed a nearest-neighbor estimator based on nearest-neighbor density estimation: where ν(j) = min k=1,...,n 2 x 1,j − x 2,k 2 and ρ(j) = min k=1,...,n 1 ,k =j They showed that (7) significantly outperforms their best data-partitioning estimators [5,18].Peréz-Cruz has contributed additional convergence analysis for these estimators [19].In practice, like the nearest-neighbor entropy estimate, KL N N may be ill-defined if samples are quantized.The nearest-neighbor relative entropy estimator can perform quite poorly for Gaussian distributed data given a reasonable number of finite samples, particularly in high-dimensions.For example, consider the case of two high-dimensional Gaussians each with identity covariance and a finite iid sample of points from the two distributions.Their true relative entropy is a function of µ 1 − µ 2 2 , whereas the nearest neighbor estimated relative entropy is better approximated (though roughly so) as a function of ln µ 1 − µ 2 2 .

Functional Estimates that Minimize Expected Bregman Loss
Here we propose to form estimators of functionals (such as differential entropy and relative entropy) that are optimal in the sense that they minimize the expected Bregman loss, and that are always computable (assuming an appropriate prior is used).
Consider samples x 1 , x 2 , . . ., x n ∈ R d drawn iid from some unknown distribution A, where we model A as a random distribution drawn from a distribution over distributions P A that has density p A .We use Ã to denote a realization of the random distribution A.
The goal is to estimate some functional (such as differential entropy or relative entropy) ξ, where ξ maps a distribution or set of distributions (e.g., relative entropy is a functional on pairs of distributions) to a real number ξ : A → R, where A is the Cartesian product of finite distributions A = A 1 × A 2 × . . .× A M , and we denote a realization of A as Ã.For example, the functional relative entropy maps a pair of distributions A = A 1 × A 2 to a non-negative number.
We are interested in the Bayesian estimate of ξ that minimizes an expected loss L : R × R → R [20]: In this paper, we will focus on Bregman loss functions (Bregman divergences), which include squared error and relative entropy [21][22][23][24].For any twice differentiable strictly convex function The following proposition will aid in solving ( 8): exists and is finite.
One can view this proposition as a special case of Theorem 1 of Banerjee et al. [22]; we provide a proof in the appendix for completeness.
In this paper we focus on estimating differential entropy and relative entropy, which by applying Proposition 1 we calculate respectively as: assuming the expectations are finite.

Bayesian Differential Entropy Estimate of the Uniform Distribution
We present estimates of the differential entropy of an unknown uniform distribution over a hyperrectangular domain for two cases: first, that there is no prior knowledge about the uniform distribution; and second, that there is prior knowledge about the uniform given in the form of a Pareto prior.

No Prior Knowledge About the Uniform
Given n d-dimensional samples {x 1 , x 2 , . . ., x n } drawn from a hyperrectangular d-dimensional uniform distribution, let ∆ i be the difference between the maximum and minimum sample in the ith dimension: Then because a hyperrectangular uniform is the product of independent marginal uniforms, its differential entropy is the sum of the marginal entropies.Given no prior knowledge about the uniform, we take the expectation with respect to the (normalized) likelihood, or equivalently using a non-informative flat prior.Then, the proposed differential entropy estimate is the sum over dimensions of the differential entropy estimate for each marginal uniform: To illustrate the effectiveness of the proposed Bayesian estimates, we show example results from two representative experiments in Figure 1.The proposed estimate ( 9) is compared to the maximum likelihood estimate, and to the nearest-neighbor estimate given in (6).Right: For each of 10,000 runs of the simulation, n samples were drawn iid from a Gaussian distribution.For each of the 10,000 runs, a new Gaussian distribution with diagonal covariance was randomly generated by drawing each of the variances iid from a uniform on [0, 1].The Bayesian estimator prior parameters were q = d and B = .5qI.The proposed estimate ( 12) is compared to the only feasible estimator for this range of n, the nearest-neighbor estimate given in (6).

Pareto Prior Knowledge About the Uniform
We consider the case that one has prior knowledge about the random uniform distribution U , where that prior knowledge is formulated as an independent Pareto prior for each dimension such that the prior probability of the marginal ith-dimension uniform Ũδ with support of length δ is: for δ ≥ i 0 otherwise (10) where α i ∈ R + and i ∈ R + are the Pareto distribution prior parameters for the ith dimension.
The parameter i defines the minimum length one believes the uniform's support could be in the ith dimension, and the parameter α i specifies the confidence that i is the right length; a larger α i means one is more confident that i is the correct length.
Then the differential entropy estimate for the ith dimension's one-dimensional uniform is: Note that the two cases given above do coincide for the boundary case that i = ∆ i , so that this differential entropy estimate is a continuous function of ∆ i .For the full d-dimensional uniform, the differential entropy estimate is the sum of the one-dimensional differential entropy estimates:

Gaussian Distribution
The Gaussian is a popular model and often justified by central limit theorem arguments and because it is the maximum entropy distribution given fixed mean and covariance.In this section we assume d-dimensional samples have been drawn iid from an unknown Gaussian N , which we model as a random Gaussian and we take the prior to be an inverse Wishart distribution with scalar parameter q ∈ R and parameter matrix B ∈ R d×d .
We use the Fisher information metric to define a measure over the Riemannian manifold formed by the set of Gaussian distributions [25][26][27].We found these choices for prior and measure worked well for estimating Gaussian distributions for Bayesian quadratic discriminant analysis [27].
The performance of the proposed Gaussian entropy and relative entropy estimators will depend strongly on the choice of the prior.Generally, prior knowledge or subjective guesses about the data are used to set the parameters of the prior.Another choice to form a prior is to use a coarse estimate of the data, for example, in previous work we found that setting B equal to the identity matrix times the trace of the sample covariance worked well as a data-adaptive prior in the context of classification [27].Since the trace times the identity is the extremal case of maximum entropy Gaussian for a given trace, this specific approach is problematic as a coarse estimate for setting the prior for differential entropy estimation, but other coarse estimates based on a different statistic of the eigenvalue may work well.

Differential Entropy Estimate of the Gaussian Distribution
Assume n samples {x 1 , x 2 , . . ., x n } have been drawn iid from an unknown d-dimensional normal distribution.Per Proposition 1, we estimate the differential entropy as: E N [h(N )], where the expectation is taken with respect to the posterior distribution over N and the prior is taken to be inverse Wishart with matrix parameter B ∈ R d×d and scale parameter q ∈ R. See the appendix for full details and derivation.The resulting estimate is, This estimate is well-defined for any number of samples n.

Relative Entropy Estimate between Gaussian Distributions
Assume n 1 samples have been drawn iid from an unknown d-dimensional normal distribution N 1 , and n 2 samples have been drawn iid from another d-dimensional distribution N 2 , assumed independent from the first.Then following Proposition 1, we estimate the relative entropy as where N 1 and N 2 are independent random Gaussians, the expectation is taken with respect to their posterior distributions, and the prior distributions are taken to be inverse Wisharts with scale parameters q 1 and q 2 and matrix parameters B 1 and B 2 .See the appendix for full details and derivation.The resulting estimate is, This estimate is well-defined for any number of samples n 1 , n 2 .If the prior scalar parameters are taken to be the same, that is q 1 = q 2 , then the digamma terms cancel.

Wishart and Inverse Wishart Distributions
The Wishart and inverse Wishart distributions are arguably the most popular distributions for modeling random positive definite matrices.Moreover, if a random variable has a Gaussian distribution, then its sample covariance is drawn from a Wishart distribution.The relative entropy between Wishart distributions may be a useful way to measure the dissimilarity between collections of covariance matrices or Gram (inner product) matrices.
We were unable to find expressions for differential entropy or relative entropy of the Wishart and inverse Wishart distributions, so we first present those, and then present Bayesian estimates of these quantities.

Wishart Differential Entropy and Relative Entropy
The differential entropy of W is The relative entropy between two Wishart distributions p 1 and p 2 with parameters (q 1 , Σ 1 ) and (q 2 , Σ 2 ) respectively is, For the special case of q 1 = q 2 = q, we note that the relative entropy given in ( 15) is q/2 times Stein's loss function, which is itself a common Bregman divergence.
For the special case of Σ 1 = Σ 2 , we find that the relative entropy between two Wisharts can also be written in the form a Bregman divergence [21] between q 2 and q 1 .Consider the strictly convex function φ(q) = ln Γ d (q/2) for q ∈ R d + , and let ψ d be the derivative of the Γ d .Then (15) becomes, We term (16) the log-gamma Bregman divergence.We have not seen this divergence noted before, and hypothesize that this divergence may have physical or practical significance because of the widespread occurrence of the gamma function and its special properties [28].

Inverse Wishart Differential Entropy and Relative Entropy
Let V be distributed according to an inverse Wishart distribution with scalar degree of freedom parameter q ≥ d and positive definite matrix parameter Σ ∈ R d×d as per (2).
Then V has differential entropy The relative entropy between two inverse Wishart distributions with parameters Σ 1 , q 1 and Σ 2 , q 2 is One sees that the relative entropy between two inverse Wishart distributions is the same as the relative entropy between two Wishart distributions with inverse matrix parameters S −1 1 and S −1 2 respectively.Like the Wishart distribution relative entropy, the inverse Wishart distribution relative entropy has special cases that are the Stein loss and the log-gamma Bregman divergence.

Bayesian Estimation of Wishart Differential Entropy
We present a Bayesian estimate of the differential entropy of a Wishart distribution p where we make the simplifying assumption that the scalar parameter q is known or estimated (for example, it is common to assume that q = d).We estimate the differential entropy E Σ [h(p)] where the estimation is with respect to the uncertainty in the matrix parameter Σ.We assume the prior p(Σ = Σ) is inverse Wishart with scale parameter r and parameter matrix U , which reduces to the non-informative prior when r and U are chosen to be zeros.
Then given sample d × d matrices S 1 , S 2 , . . ., S n drawn iid from the Wishart W , the normalized posterior distribution p( Σ|S 1 , S 2 , . . ., S n ) is inverse Wishart with matrix parameter n j=1 S j + U and scalar parameter nq + r (details in Appendix).
Then our differential entropy estimate E Σ [h(W )] where the expectation is with respect to the posterior p( Σ|{S j }) is:

Bayesian Estimation of Relative Entropy between Two Wisharts
We present a Bayesian estimate of the relative entropy between two Wishart distributions p 1 and p 2 where we make the simplifying assumption that the respective scalar parameters q 1 , q 2 are known or estimated (for example, that q 1 = q 2 = d), and then we estimate the relative entropy KL(p 1 ||p 2 ) where the estimation is with respect to the uncertainty in the respective matrix parameters Σ 1 , Σ 2 .To this end, we treat the unknown Wishart parameters Σ 1 , Σ 2 as random, and compute the estimate For Σ 1 and Σ 2 we use independent inverse Wishart conjugate priors with respective scalar parameters r 1 , r 2 and parameter matrices U 1 , U 2 , which reduces to non-informative priors when r 1 , r 2 and U 1 , U 2 are chosen to be zeros.
Then given n 1 sample d × d matrices {S j } drawn iid from the Wishart p 1 , and n 2 sample d × d matrices {S k } drawn iid from the Wishart p 2 , the normalized posterior distribution p( Σ1 |{S j }) is inverse Wishart with matrix parameter n 1 j=1 S j +U 1 and scalar parameter n 1 q +r 1 , and the normalized posterior distribution p( Σ2 |{S k }) is inverse Wishart with matrix parameter n 2 k=1 S k + U 2 and scalar parameter n 2 q + r 2 .
Then our relative entropy estimate E Σ 1 ,Σ 2 [KL(p 1 ||p 2 )] (where the expectation is with respect to the posterior distributions) is

Bayesian Estimation of Inverse Wishart Differential Entropy
Let S i denote the ith of n random iid draws from an inverse unknown Wishart distribution p with parameters (Σ, q).Taking the prior p( Σ) to be a Wishart distribution with parameter r and U , our Bayesian estimate of the inverse Wishart differential entropy is

Bayesian Estimation of Relative Entropy between Two Inverse Wisharts
Given q 1 , q 2 , and assuming independent Wishart priors with respective scalar parameters r 1 , r 2 and parameter matrices U 1 , U 2 , and given n 1 sample d × d matrices {S j } drawn iid from the inverse Wishart p 1 , and n 2 sample d × d matrices {S k } drawn iid from the inverse Wishart p 2 , our Bayesian estimate of the relative entropy is

Discussion
We have presented Bayesian differential entropy and relative entropy estimates for four standard distributions, and in doing so illustrated techniques that could be used to derive such estimates for other distributions.For the uniform case with no prior, the given estimators perform significantly better than previous estimators, and this experimental evidence validates our approach.However given a prior over distributions, the performance will depend heavily on the accuracy of the prior, and a thorough experimental study would be useful to practitioners but was outside the scope of this investigation.
In practice, there may not be a priori information available to determine a prior, and an open question is how to design appropriate data-dependent priors for differential entropy estimation.For example, for Bayesian quadratic discriminant analysis classification [27], we have shown that setting the prior matrix parameter for the Gaussian to be a coarse estimate of the data's covariance (the identity times the trace of the sample covariance) works well.However, for differential entropy estimation the trace forms an extreme estimate, and is thus not (by itself) suitable for forming a data-dependent prior for this problem.
Another open area is forming estimators for more complicated parametric models, for example estimating the differential entropy and relative entropy of Gaussian mixture models.Estimating the differential entropy of Gaussian processes is also an important problem [29] that may be amenable to the present approach.
Lastly, the new estimators have been motivated by their expected Bregman loss optimality and by the practical consideration of producing estimates even when there are fewer samples than dimensions, but there are a number of theoretical questions about these estimators that are open, such as domination.The proof is by contradiction.Let ξ * = E A [ξ(A)], and assume the true minimizer of E A d φ (ξ(A), ξ) occurs instead at some other value s.Then a contradiction occurs: ≥ 0 where in (a) we expanded d φ and simplified, in (b) we used the fact that ] and used the definition of the Bregman divergence, and in (d) we used the non-negativity of the Bregman divergence.Thus ξ * = E A [ξ(A)] must be the minimizer.

A.2. Derivation of Uniform Differential Entropy Estimate
In this section we will repeatedly use the integral: To estimate the differential entropy of a multidimensional uniform distribution one only needs to consider the differential entropy for a one-dimensional uniform, because a multidimensional uniform can be written as a product of independent univariate distributions, and thus the differential entropy of the multidimensional uniform is the sum of the univariate entropies.
Thus we model the n samples {x 1 , x 2 , . . ., x n } as being drawn from a random one-dimensional uniform distribution U .Let M be the two-dimensional statistical manifold composed of uniform distributions { Ũa,b }, where Ũa,b has support on [a, b] for b > a, a, b, ∈ R. The measure should depend on the length δ = b − a of the uniform and be invariant to shifts in the support.To that end, we use the Fisher information metric [25,26] based on the length, dM = |I(δ)| 1/2 dδ = dδ δ where I is the Fisher information matrix, Using dM as a differential element and the normalized likelihood of the samples for p( Ũa,b ), the uniform differential entropy estimate is where the normalization factor γ is γ = Canceling terms in (23) due to the normalization factor γ yields the one-dimensional uniform differential entropy ln(x max − x min ) + 1 n−1 + 1 n .For the multidimensional uniform, one sums these marginal entropy terms over the dimensions, as given in (9).

A.3. Derivation of Uniform differential Entropy Given Pareto Prior
As explained for the no-prior derivation, we need only consider a one-dimensional uniform.Although the Pareto distribution is a conjugate prior for the uniform with respect to its length, one must be careful because the data restrict b > x max and a < x min , and these restrictions are not taken into account if one integrates with respect to the variable δ.Throughout this section we use various flavors of γ to denote normalization constants, and ∆ = x max − x min .We consider two cases separately.
Case I: ≤ ∆: where the normalizer is Then the posterior (24) becomes, for a ≤ x min , b ≥ x max 0 otherwise.
Using (22), it is straightforward to derive the differential entropy estimate given in the text as: Case II: > ∆: In this case, the posterior has an additional constraint compared to (24): The normalization constant can be solved for as: Then the differential entropy estimate is where in (a) we substituted in (25).In the second factor of (a) there are five terms.Combining the first and fourth term with the first factor results in the first term of the estimate given in (11).Combining the second term with the first factor results in the second term of (11).Lastly, combining the third and fifth term of (a) with the first factor results in the third term of (11).

A.4. Propositions Used in Remaining Derivations
The following identities and propositions will be used repeatedly in the derivations in the rest of the appendix.
Identity 1.This is a convenient re-statement of the fact that the normal distribution normalizes to one.For x, µ ∈ R d and positive definite d × d matrix Σ, Identity 2. This is a convenient re-statement of the fact that the inverse Wishart distribution normalizes to one.For positive definite Σ: Proposition 2. Corollary 7.3]) where χ 2 denotes the chi-squared random variable.
Then the result is produced by taking the expected log and using the fact that E[ln χ 2 q ] = ln 2 + ψ q   For V ∼ inverse Wishart (S, q) and any positive definite matrix A ∈ R d×d , For V ∼ inverse Wishart (S, q) and any positive definite matrix Proof: By definition, V −1 ∼ Wishart (S −1 , q), and so one can apply Proposition 4 to yield E[tr(V −1 A)] = qtr(S −1 A).

A.5. Derivation of Bayesian Gaussian Differential Entropy Estimate
We model the samples {x i } n i=1 as being drawn from a random d-dimensional normal distribution N and assume an inverse Wishart prior distribution for N with parameters (B, q).That is, the prior probability that the random normal N is Ñ (μ, Σ) is The likelihood can be written: Then the posterior is the likelihood times the prior normalized, or sweeping all the constant terms into a normalization term α we can write the posterior as: Note that this is a density on the statistical manifold of Gaussians, so we integrate with respect to the Fisher information measure 1/| Σ| d+2 2 [27], [25] rather than the Lebesgue measure, such that where the last line follows from Identities 1 and 2 stated in the previous subsection.Then combining (30) and (29), the posterior is: (31) Our differential entropy estimate is the integral E N [h(N )], which is an integral over the statistical manifold of Gaussians that we convert to an integral over covariance matrices by using the Fisher information metric 1/| Σ| (d+2)/2 [27], [25].Then, We evaluate the third term of (32) as follows: where (33) follows by Integral Identity 1; and (34) is half the expectation of ln |Σ| with respect to the inverse Wishart with parameters (S + B, n + q + d + 1), and thus (35) follows from (34) by Proposition 3. Then (32) becomes A.6.Derivation of Bayesian Gaussian Relative Entropy Estimate Recall that the relative entropy between independent Gaussians N 1 (x; µ 1 , Σ 1 ) and N 2 (x; Here we derive E N 1 ,N 2 [KL(N 1 N 2 )].Analogous to the previous derivation of the Bayesian Gaussian entropy estimate, we form the posterior distributions using independent inverse Wishart priors with parameters (B 1 , q 1 ) and (B 2 , q 2 ).(Note that this is equivalent to having a non-informative prior on the mean parameter, and that a different prior on the mean would lead to a more regularized estimate).We consider the four terms of the expectation of (37), that is, of E N 1 ,N 2 [KL(N 1 N 2 )], in turn.
Similarly, we can write the second term as: (a) where (a) uses the formula for entropy given in (14), and (b) follows by applying Proposition 2 and 4 and then simplifying.
A.9. Derivation of Inverse Wishart Differential Entropy: Using the inverse Wishart density given in (2), the inverse Wishart differential entropy is:

Figure 1 .
Figure 1.Example comparison of differential entropy estimators.Left: For each of 10,000 runs of the simulation, n samples were drawn iid from a uniform distribution on [−5, 5].The proposed estimate (9) is compared to the maximum likelihood estimate, and to the nearest-neighbor estimate given in(6).Right: For each of 10,000 runs of the simulation, n samples were drawn iid from a Gaussian distribution.For each of the 10,000 runs, a new Gaussian distribution with diagonal covariance was randomly generated by drawing each of the variances iid from a uniform on [0, 1].The Bayesian estimator prior parameters were q = d and B = .5qI.The proposed estimate (12) is compared to the only feasible estimator for this range of n, the nearest-neighbor estimate given in(6).

− d + i 2
where in (a) we applied Proposition 6 and Proposition 3, and in (b) used tr(S −1 S) = tr(I) = d and simplified.
Misra et al. show that (4) is dominated by a Stein-type estimator ln |S + nxx T | − c 1 , where c 1 is a function of d and n [4].In addition, they show that (4) is dominated by a Brewster-Zidek-type estimator ln |S +nxx T |−c 2 , where c 2 is a function of |S| and xx T that requires calculating the ratio of two definite [4]egrals, stated in full in (4.3) of[4].Misra et al. found that on simulated numerical experiments their Stein-type and Brewster-Zidek-type estimators achieved roughly only 6% improvement over (