Estimating Mixture Entropy with Pairwise Distances

Mixture distributions arise in many parametric and non-parametric settings -- for example, in Gaussian mixture models and in non-parametric estimation. It is often necessary to compute the entropy of a mixture, but, in most cases, this quantity has no closed-form expression, making some form of approximation necessary. We propose a family of estimators based on a pairwise distance function between mixture components, and show that this estimator class has many attractive properties. For many distributions of interest, the proposed estimators are efficient to compute, differentiable in the mixture parameters, and become exact when the mixture components are clustered. We prove this family includes lower and upper bounds on the mixture entropy. The Chernoff $\alpha$-divergence gives a lower bound when chosen as the distance function, with the Bhattacharyya distance providing the tightest lower bound for components that are symmetric and members of a location family. The Kullback-Leibler divergence gives an upper bound when used as the distance function. We provide closed-form expressions of these bounds for mixtures of Gaussians, and discuss their applications to the estimation of mutual information. We then demonstrate that our bounds are significantly tighter than well-known existing bounds using numeric simulations. This estimator class is very useful in optimization problems involving maximization/minimization of entropy and mutual information, such as MaxEnt and rate distortion problems.


Introduction
A mixture distribution is a probability distribution whose density function is a weighted sum of individual densities.Mixture distributions are a common choice for modeling probability distributions, in both parametric settings, for example, learning a mixture of Gaussians statistical model [1], and non-parametric settings, such as kernel density estimation.
It is often necessary to compute the differential entropy [2] of a random variable with a mixture distribution, which is a measure of the inherent uncertainty in the outcome of the random variable.Entropy estimation arises in image retrieval tasks [3], image alignment and error correction [4], speech recognition [5,6], analysis of debris spread in rocket launch failures [7], and many other settings.Entropy also arises in optimization contexts [4,[8][9][10], where it is minimized or maximized under some constraints (e.g., MaxEnt problems).Finally, entropy also plays a central role in minimization or maximization of mutual information, such as in problems related to rate distortion [11].
Unfortunately, in most cases, the entropy of a mixture distribution has no known closed-form expression [12].This is true even when the entropy of each component distribution does have a known closed-form expression.For instance, the entropy of a Gaussian has a well-known form, while the entropy of a mixture of Gaussians does not [13].As a result, the problem of finding a tractable and accurate estimate for mixture entropy has been described as "a problem of considerable current interest and practical significance" [14].
One way to approximate mixture entropy is with Monte Carlo (MC) sampling.MC sampling provides an unbiased estimate of the entropy, and this estimate can become arbitrarily accurate by increasing the number of MC samples.Unfortunately, MC sampling is very computationally intensive, as, for each sample, the (log) probability of the sample location must be computed under every component in the mixture.MC sampling typically requires a large number of samples to estimate entropy, especially in high-dimensions.Sampling is thus typically impractical, especially for optimization problems where, for every parameter change, a new entropy estimate is required.Alternatively, it is possible to approximate entropy using numerical integration, but this is also computationally expensive and limited to low-dimensional applications [15,16].
Instead of Monte Carlo sampling or numerical integration, one may use an analytic estimator of mixture entropy.Analytic estimators have estimation bias but are much more computationally efficient.There are several existing analytic estimators of entropy, discussed in-depth below.To summarize, however, commonly-used estimators have significant drawbacks: they have large bias relative to the true entropy, and/or they are invariant to the amount of "overlap" between mixture components.For example, many estimators do not depend on the locations of the means in a Gaussian mixture model.
In this paper, we introduce a novel family of estimators for the mixture entropy.Each member of this family is defined via a pairwise-distance function between component densities.The estimators in this family have several attractive properties.They are computationally efficient, as long as the pairwise-distance function and the entropy of each component distribution are easy to compute.The estimation bias of any member of this family is bounded by a constant.The estimator is continuous and smooth and is therefore useful for optimization problems.In addition, we show that when the Chernoff α-divergence (i.e., a scaled Rényi divergence) is used as a pairwise-distance function, the corresponding estimator is a lower-bound on the mixture entropy.Furthermore, among all the Chernoff α-divergences, the Bhattacharrya distance (α = 0.5) provides the tightest lower bound when the mixture components are symmetric and belong to a location family (such as a mixture of Gaussians with equal covariances).We also show that when the Kullback-Leibler [KL] divergence is used as a pairwise-distance function, the corresponding estimator is an upper-bound on the mixture entropy.Finally, our family of estimators can compute the exact mixture entropy when the component distributions are grouped into well-separated clusters, a property not shared by other analytic estimators of entropy.In particular, the bounds mentioned above converge to the same value for well-separated clusters.
The paper is laid out as follows.We first review mixture distributions and entropy estimation in Section 2. We then present the class of pairwise distance estimators in Section 3, prove bounds on the error of any estimator in this class, and show distance functions that bound the entropy as discussed above.In Section 4, we consider the special case of mixtures of Gaussians, and give explicit expressions for lower and upper bounds on the mixture entropy.When all the Gaussian components have the same covariance matrix, we show that these bounds have particularly simple expressions.In Section 5, we consider the closely related problem of estimating the mutual information between two random variables, and show that our estimators can be directly used to estimate and bound the mutual information.For the Gaussian case, these can be used to bound the mutual information across a type of Additive White Noise Gaussian channel.Finally, in Section 6, we run numerical experiments and compare the performance of our lower and upper bounds relative to existing estimators.We consider both mixtures of Gaussians and mixtures of uniform distributions.

Background and Definitions
We consider the differential entropy of a continuous random variable X, defined as and where c i indicates the weight of component i (c i ≥ 0, ∑ i c i = 1) and p i the probability density of component i.We can treat the set of component weights as the probabilities of outcomes 1 . . .N of a discrete random variable C, where Pr(C = i) = c i .Consider the mixed joint distribution of the discrete random variable C and the continuous random variable X, p X,C (x, i) = p i (x)c i , and note the following identities for conditional and joint entropy [17], where we use H for discrete and differential entropy interchangeably.Here, the conditional entropies are defined as Using elementary results from information theory [2], H(X) can be bounded from below by since conditioning can only decrease entropy.Similarly, H(X) can be bounded from above by following from H(X) = H(X, C) − H(C|X) and the non-negativity of the conditional discrete entropy H(C|X).This upper bound on the mixture entropy was previously proposed by Huber et al. [18].It is easy to see that the bound in Equation ( 1) is tight when all the components have the same distribution, since then H(p X ) = H(p i ) for all i.The bound in Equation (2) becomes tight when H(C|X) = 0, i.e., when any sample from p X uniquely determines the component identity C.This occurs when the different mixture components have non-overlapping supports, p i (x) > 0 =⇒ p j (x) = 0 for all x and i = j.More generally, the bound of Equation (2) becomes increasingly tight as the mixture distributions move farther apart from one another.
In the case where the entropy of each component density, H(p i ) for i = 1 . . .N, has a simple closed form expression, the bounds in Equations ( 1) and (2) can be easily computed.However, neither bound depends on the "overlap" between components.For instance, in a Gaussian mixture model, these bounds are invariant to changes in the component means.The bounds are thus unsuitable for many problems; for instance, in optimization, one typically tunes parameters to adjust component means, but the above entropy bounds remain the same regardless of mean location.
There are two other estimators of the mixture entropy that should be mentioned.The first estimator is based on kernel density estimation [16,19].It estimates the entropy using the mixture probability of the component means, µ i , ĤKDE (X) := − ∑ i c i ln ∑ j c j p j (µ i ) . ( The second estimator is a lower bound that is derived using Jensen's inequality [2], In the literature, the term p i (x)p j (x) dx has been referred to as the "Cross Information Potential" [20,21] and the "Expected Likelihood Kernel" [22,23] (ELK, we use this second acronym to label this estimator).When the component distributions are Gaussian, p i := N (µ i , Σ i ), the ELK has a simple closed-form expression, ĤELK (X) = − ∑ i c i ln ∑ j c j q j,i (µ i ) , (5) where each q j,i is a Gaussian defined as q j,i := N (µ j , Σ i + Σ j ).This lower bound was previously proposed for Gaussian mixtures in [18] and in a more general context in [12].Both ĤKDE , Equation (3), and ĤELK , Equation ( 5), are computationally efficient, continuous and differentiable, and depend on component overlap, making them suitable for optimization.However, as will be shown via numerical experiments (Section 6), they exhibit significant underestimation bias.At the same time, we will show that for Gaussian mixtures with equal covariance, ĤKDE is only an additive constant away from an estimator in our proposed class.

Overview
Let D(p i p j ) be some (generalized) distance function between probability densities p i and p j .Formally, we assume that D is a premetric, meaning that it is non-negative and D(p i p j ) = 0 if p i = p j .We do not assume that D is symmetric, nor that it obeys the triangle inequality, nor that it is strictly greater than 0 when p i = p j .
For any allowable distance function D, we propose the following entropy estimator: This estimator can be efficiently computed if the entropy of each component and D(p i p j ) for all i, j have simple closed-form expressions.There are many distribution-distance function pairs that satisfy these conditions (e.g., Kullback-Leibler divergence, Renyi divergences, Bregman divergences, f-divergences, etc., for Gaussian, uniform, exponential, etc.) [24][25][26][27][28].
It is straightforward to show that for any D, ĤD falls between the bounds of Equations ( 1) and ( 2), To do so, consider the "smallest" and "largest" allowable distance functions, For any D and p i , Plugging D min into Equation ( 6) (noting that ∑ j c j = 1) gives ĤD min (X) = H(X|C), while plugging D max into Equation (6) gives These two inequalities yield Equation (7).The true entropy, as shown in Section 2, also obeys In the next two subsections, we improve upon the bounds suggested in Equations ( 1) and ( 2), by examining bounds induced by particular distance functions.
We show that for any α ∈ [0, 1], ĤC α (X) is a lower bound on the entropy (for α / ∈ [0, 1], C α is not a valid distance function (see Appendix A)).To do so, we make use of a derivation from [31] and write, The inequalities (a) and (b) follow from Jensen's inequality.This inequality is used directly in (b), while in (a) it follows from Note that Jensen's inequality is used in the derivations of both this lower bound as well as the lower bound ĤELK in Equation ( 4).However, the inequality is applied differently in the two cases, and, as will be demonstrated in Section 6, the estimators have different performance.
We have shown that using C α as a distance function gives a lower bound on the mixture entropy for any α ∈ [0, 1].For a general mixture distribution, one could optimize over the value of α to find the tightest lower bound.However, we can show that the tightest bound is achieved for α = 0.5 in the special case when all of the mixture components p i are symmetric and come from a location family, Examples of this situation include mixtures of Gaussians with the same covariance ("homoscedastic" mixtures), multivariate t-distributions with the same covariance, location-shifted bounded uniform distributions, most kernels used in kernel density estimation, etc.It does not apply to skewed distributions, such as as the skew-normal distribution [12].
To show that α = 0.5 is optimal, first define the Chernoff α-coefficient as We show that for any pair p i , p j of symmetric distributions from a location family, c α (p j p j ) is minimized by α = 0.5.This means that all pairwise distances C α (p j p i ) ≡ − ln c α (p j p i ) are maximized by α = 0.5, and, therefore, the entropy estimator H C α (Equation ( 6)) is maximized by α = 0.5.
First, define a change of variables This allows us to write the Chernoff α-coefficient as where, in (a), we have substituted variables, and in (b) we used the assumption that b(x) = b(−x).
Since we have shown that c α (p i p j ) = c 1−α (p i p j ), c α is symmetric in α about α = 0.5.In Appendix A, we show that c α (p q) is everywhere convex in α.Together, this means that c α (p i p j ) must achieve a minimum value at α = 0.5.The Chernoff α-coefficient for α = 0.5 is known as the Bhattacharyya coefficient, with the corresponding Bhattacharyya distance [32] defined as BD(p q) := − ln p(x)q(x)dx = C 0.5 (p q).
Since any Chernoff α-divergence is a lower bound for the entropy, we write the particular case of Bhattacharyya-distance lower bound as (11)

Upper Bound
The Kullback-Leibler [KL] divergence [2] is defined as Using KL divergence as the pairwise distance provides an upper bound on the mixture entropy.We show this as follows: where E p i indicates expectation when X is distributed according to p i , H(• •) indicates the cross-entropy function, and we employ the identity H(p i p j ) = H(p i ) + KL(p i p j ).The inequality in step (a) uses a variational lower bound on the expectation of a log-sum [5,33], Combining yields the upper bound 3.4.Exact Estimation in the "Clustered" Case In the previous sections, we derived lower and upper bounds on the mixture entropy, using estimators based on Chernoff α-divergence and KL divergence, respectively.
There are situations in which the lower and upper bounds become similar.Consider a pair of component distributions, p i and p j .By applying Jensen's inequality to Equation ( 9), we can derive the inequality C α (p i p j ) ≤ αKL(p i q j ).There are two cases in which a pair of components contributes similarly to the lower and upper bounds.The first case is when C α (p i p j ) is very large, meaning that the KL is also very large.By Equation ( 6), distances enter into our estimators as exp(−D(p i p j )), and, in this case, exp(−KL(p i p j )) ≈ exp(−C α (p i p j )) ≈ 0. In the second case, KL(p i p j ) ≈ 0, meaning that C α (p i p j ) must also be near zero, and, in this case, exp(−KL(p i p j )) ≈ exp(−C α (p i p j )) ≈ 1.Thus, the lower and upper bounds become similar when all pairs of components are either very close together or very far apart.
In this section, we analyze this special case.Specifically, we consider the situation when mixture components are "clustered", meaning that there is a grouping of component distributions such that distributions in the same group are approximately the same and distributions assigned to different groups are very different from one another.We show that in this case our lower and upper bounds become equal and our pairwise-distance estimate of the entropy is tight.Though this situation may seem like an edge case, clustered distributions do arise in mixture estimation, e.g., when there are repeated data points, or as solutions to information-theoretic optimization problems [11].Note that the number of groups is arbitrary, and therefore this situation includes the extreme cases of a single group (all component distributions are nearly the same) as well as N different groups (all component distributions are very different).
Formally, let the function g(i) indicate the group of component i.We define that the components are "clustered" with respect to grouping g iff KL(p i p j ) ≤ κ whenever g(i) = g(j) for some small κ, and BD(p i p j ) ≥ β whenever g(i) = g(j) some large β.We use the notation p G (k) = ∑ i δ g(i),k c i to indicate the sum of the weights of the components in group k, where δ ij indicates the Kronecker delta function.For technical reasons, below we only consider C α where α is strictly greater than 0.
For the upper bound ĤKL , we use that KL(p i p j ) ≤ κ for i and j in the same group, and otherwise exp(−KL(p i p j )) ≥ 0. This gives the bound The difference between the bounds is bounded by where |G| is the number of groups.Thus, the difference decreases at least linearly in κ and exponentially in β.This shows that, in the clustered case, when κ ≈ 0 and β is very large, our lower and upper bounds become exact.
It also shows that any distance measure bounded between BD and KL also gives an exact estimate of entropy in the clustered case.Furthermore, the idea behind this proof can be extended to estimators induced by other bounding distances, beyond BD and KL, so as to show that a particular estimator converges to an exact entropy estimate in the clustered case.Note, however, that, for some distribution-distance pairs, the components will never be considered as "clustered"; e.g., the α-Chernoff distance for α = 0 between any two Gaussians is 0, and so a Gaussian mixture distribution will never be considered clustered according to this distance.
Finally, in the perfectly clustered case, we can show that our lower bound, ĤBD , is at least as good as the Expected Likelihood Kernel lower bound, ĤELK , as defined in Equation (4).See Appendix B for details.

Gaussian Mixtures
Gaussians are very frequently used as components in mixture distributions.Our family of estimators is well-suited to estimating the entropies of Gaussian mixtures, since the entropy of a d-dimensional Gaussian p i = N (µ i , Σ i ) has a simple closed-form expression, and because there are many distance functions between Gaussians with closed-form expressions (KL divergence, the Chernoff α-divergences [35], 2-Wasserstein distance [36,37], etc.).In this section, we consider Gaussian mixtures and state explicit expression for the lower and upper bounds on the mixture entropy derived in the previous section.We also consider these bounds in the special case where all Gaussian components have the same covariance matrix (homoscedastic mixtures).
We first consider the lower bound, ĤC α , based on the Chernoff α-divergence distance function.For two multivariate Gaussians p 1 ∼ N (µ 1 , Σ 1 ) and p 2 ∼ N (µ 2 , Σ 2 ), this distance is defined as [35]: (As a warning, note that most sources show erroneous expressions for the Chernoff and/or Rényi α-divergence between two multivariate Gaussians, including [27,29,[38][39][40], and even a late draft of this manuscript.)For the upper bound ĤKL , the KL divergence between two multivariate Gaussians The appropriate lower and upper bounds are found by plugging in Equations ( 14) and ( 15) into Equation (6).
These bounds have simple forms when all of the mixture components have equal covariance matrices; i.e., Σ i = Σ for all i.First, define a transformation in which each Gaussian component p j is mapped to a different Gaussian pj,α , which has the same mean but where the covariance matrix is rescaled by 1  α(1−α) , Then, the lower bound of Equation ( 10) can be written as This is derived by combining the expressions for C α , Equation ( 14), the entropy of a Gaussian, Equation ( 13), and the Gaussian density function.For a homoscedastic mixture, the tightest lower bound among the Chernoff α-divergences is given by α = 0.5, corresponding to the Bhattacharyya distance, (This is derived above in Section 3.2.)For the upper bound, when all Gaussians have the same covariance matrix, we again combine the expressions for KL, Equation (15), the entropy of a Gaussian, Equation (13), and the Gaussian density function to give Note that this is exactly the expression for the kernel density estimator ĤKDE (Equation ( 3)), plus a dimensional correction.Thus, surprisingly ĤKDE is a reasonable entropy estimator for homoscedastic Gaussian mixtures, since it is only an additive constant away from KL-distance based estimator ĤKL (which has various beneficial properties, as described above).This may explain why ĤKDE has been used effectively in optimization contexts [4,[8][9][10], where the additive constant is often irrelevant, despite lacking a principled justification in terms of being a a bound on entropy.

Estimating Mutual Information
It is often of interest, for example in rate distortion and related problems [11], to calculate the mutual information across a communication channel, where U is the distribution of signals sent across the channel, and X is the distribution of messages received on the other end of the channel.As with mixture distributions, it is often easy to compute H(X|U), the entropy of the received signal given the sent signal (i.e., the distribution of noise on the channel).The marginal entropy of the received signals, H(X), on the other hand, is often difficult to compute.
In some cases, the distribution of U may be well approximated by a mixture model.In this case, we can estimate the entropy of the received signals, H(X), using our pairwise distance estimators, as discussed in Section 3. In particular, we have the lower bound where p i is the density of component i, and noting that the H(X|U) terms cancel in the expression.
This also illuminates that the actual pairwise portion of the estimator, − ∑ i c i ln ∑ j c j exp(−D(p i p j )) is a measure of the mutual information between the random variable specifying the component identity and the random variable distributed as the mixture of the component densities.If the components are identical, this mutual information is zero, since knowing the component identity tells one nothing about the outcome of X.On the other hand, when all of the components are very different from one another, knowing the component that generated X is very informative, giving the maximum amount of information, H(C).
As a practical example, consider a scenario in which U is a random variable representing the outside temperature on any particular day.This temperature is measured with a thermometer with Gaussian measurement noise (the "Additive White Noise Gaussian channel").This gives our measurement distribution If the actual temperature U is (approximately or exactly) distributed as a mixture of M Gaussians, each one having mixture weight c i , mean µ i , and covariance matrix Σ i , then X will also be distributed as a mixture of M Gaussians, each with weight c i , mean µ i , and covariance matrix Σ j := Σ i + Σ .We can then use our estimators to estimate the mutual information between the actual temperature, U, and thermometer measurements, X.

Numerical Results
In this section, we run numerical experiments and compare estimators of mixture entropy under a variety of conditions.We consider two different types of mixtures, mixtures of Gaussians and mixtures of uniform distributions, for a variety of parameter values.We evaluate the following estimators: 1.The true entropy, H(X), as estimated by a Monte Carlo sampling of the mixture model.
Two thousand samples were used for each MC estimate for the mixtures of Gaussians, and 5000 samples were used for the mixtures of uniform distributions.2. Our proposed upper-bound, based on the KL divergence, ĤKL (Equation ( 12)) 3. Our proposed lower-bound, based on the Bhattacharyya distance, ĤBD (Equation ( 11)) 4. The kernel density estimate based on the component means, ĤKDE (Equation ( 3)) 5.The lower bound based on the "Expected Likelihood Kernel", ĤELK (Equation ( 4)) 6.The lower bound based on the conditional entropy, H(X|C) (Equation ( 1)) 7. The upper bound based on the joint entropy, H(X, C) (Equation ( 2)).
We show the values of the estimators 1-5 as line plots, while the region between the conditional ( 6) and joint entropy ( 7) is shown in shaded green.The code for these figures can be found at [41], and uses the Gonum numeric library [42].

Mixture of Gaussians
In the first experiment, we evaluate the estimators on a mixture of randomly placed Gaussians, and look at their behavior as the distance between the means of the Gaussians increases.The mixture is composed of 100 10-dimensional Gaussians, each Gaussian distributed as p i = N µ i , I (10) , where I (d) indicates the d × d identity matrix.Means are sampled from µ i ∼ N (0, σI (10) ). Figure 1A depicts the change in estimated entropy as the means grow farther apart, in particular a function of ln(σ).We see that our proposed bounds are closer to the true entropy than the other estimators over the whole range of σ values, and in the extremes, our bounds approach the exact value of the true entropy.This is as expected, since as σ → 0 all of the Gaussian mixture components become identical, and as σ → ∞ all of the Gaussian components grow very far apart, approaching the case where each Gaussian is in its own "cluster".The ELK lower bound is a strictly worse estimate than ĤBD , in this experiment.As expected, the KDE estimator differs by exactly d/2 from the KL estimator.
In the second experiment, we evaluate the entropy estimators as the covariance matrices change from less to more similar.We again generate 100 10-dimensional Gaussians.Each Gaussian is distributed as p i = N (µ i , Σ i ), where now µ i ∼ N (0, I (10) ) and Σ i ∼ W( 110+n I (10) , n), where W(V, n) is a Wishart distribution with scale-matrix V and n degrees of freedom.Figure 1B compares the the estimators with the true entropy as a function of ln(n).When n is small, the Wishart distribution is broad and the covariance matrices differ significantly from one another, while as n → ∞, all the covariance matrices become close to the identity I (10) .Thus, for small n, we essentially recover a "clustered" case, in which every component is in its own cluster and our lower and upper bounds give highly accurate estimates.For large n, we converge to the σ = 1 case of the first experiment.
In the third experiment, we again generate a mixture of 100 10-dimensional Gaussians.Now, however, the Gaussians are grouped into five "clusters", with each Gaussian component randomly assigned to one of the clusters.We use g(i) ∈ {1 . . .5} to indicate the group of each Gaussian's component i ∈ {1 . . .100}, and each of the 100 Gaussians is distributed as p i = N ( μg(i) , I (10) ).The cluster centers μk for k ∈ {1 . . .5} are drawn from N (0, σI (10) ).The results are depicted in Figure 1C as a function of ln(σ).In the first experiment, we saw that the joint entropy H(X, C) became an increasingly better estimator as the Gaussians grew increasingly far apart.Here, however, we see that there is a significant difference between H(X, C) and the true entropy, even as the groups become increasingly separated.Our proposed bounds, on the other hand, provide accurate estimates of the entropy across the entire parameter sweep.As expected, they become exact in the limit when all clusters are at the same location, as well as when all clusters are very far apart from each other.Finally, we evaluate the entropy estimators while changing the dimension of the Gaussian components.We again generate 100 Gaussian components, each distributed as p i = N (µ i , I (d) ), with µ i ∼ N (0, σI (d) ).We vary the dimensionality d from 1 to 60.The results are shown in Figure 1D.First, we see that when d = 1, the KDE estimator and the KL-divergence based estimator give a very similar prediction (differing only by 0.5), but as the dimension increases, the two estimates diverge at a rate of d/2.Similarly, ĤELK grows increasingly less accurate as the dimension increases.Our proposed lower and upper bounds provide good estimates of the mixture entropy across the whole sweep across dimensions.
As previously mentioned, our lower and upper bounds tend to perform best at the "extremes" and worse in the intermediate regimes.In particular, in Figure 1A,C,D, the distances between component means increase from left to right.On the left hand side of these figures, all of the component means are close and the component distributions overlap, as evidenced by the fact that the mixture entropy is ≈ H(X|C), i.e., I(X; C) ≈ 0. In this regime, when there is essentially a single "cluster", and our bounds become tight (see Section 3.4).On the right hand side of these figures, the components' means are all far apart from each other, and the mixture entropy ≈ H(X, C), i.e., I(X; C) ≈ H(C) (in Figure 1C, it is the five clusters that become far apart, and the mixture entropy ≈ H(X|C) + ln 5).In this regime where there are many well-separated clusters, our bounds again become tight.In between these two extremes, however, there is no clear clustering of the mixture components, and the entropy bounds are not as tight.
As noted in the previous paragraph, the extremes in three out of the four subfigures approach the perfectly clustered case.In this situation, we show in Appendix B that the BD-based estimator is a better bound on the true entropy than the Expected Likelihood Kernel estimator.We see confirmation of this in the experimental results, where ĤELK performs worse than the pairwise-distance based estimators.

Mixture of Uniforms
In the second set of experiments, we consider a mixture of uniform distributions.Unlike Gaussians, uniform distributions are bounded within a hyper-rectangle and do not have full support over the domain.In particular, a uniform distribution p = U (a, b) over d dimensions is defined as where x, a, and b are d-dimensional vectors, and the subscript x i refers to value of x on dimension i.Note that when p X is a mixture of uniforms, there can be significant regions where p X (x) > 0, but p i (x) = 0 for some i.
Here, we list the formulae for pairwise distance measure between uniform distributions.In the following, we use V i := x 1{p i (x) > 0}dx to indicate the "volume" of distribution p i .Uniform components have a constant p(x) over their support, and so p i (x) = 1/V i for all x where p i (x) > 0. Similarly, we use V i∩j as the "volume of overlap" between p i and p j , i.e., the volume of the intersection of the support of p i and p j , V i∩j := x 1{p i (x) > 0}1{p j (x) > 0}dx.The distance measures between uniforms are then Like the Gaussian case, we run four different computational experiments and compare the mixture entropy estimates to the true entropy, as determined by Monte Carlo sampling.
In the first experiment, the mixture consists of 100 10-dimensional uniform components, with p i = U (µ i − 1 (10) , µ i + 1 (10) ) , and µ i ∼ N (0, σI (10) ), where 1 (d) refers to a d-dimensional vector of 1s. Figure 2A depicts the change in entropy as a function of ln(σ).For very small σ, the distributions are almost entirely overlapping, while for large σ they tend very far apart.As expected, the entropy increases with σ.Here, we see that the prediction of ĤKL is identical to H(X, C), which arises because KL(p i p j ) is infinite whenever the support of p i is not entirely contained in the support of p j .Uniform components with equal size and non-equal means must have some region of non-overlap, and so the KL is infinite between all pairs of components, thus KL is effectively D max (Equation ( 8)).In contrast, we see that ĤBD estimates the true entropy quite well.This example demonstrates that getting an accurate estimate of mixture entropy may require selecting a distance function that works will with the component distributions.Finally, it turns out that, for uniform components of equal size, ĤELK = ĤBD .This can be seen by combining Equations ( 6) and ( 16), and comparing to Equation (17) (note that V i = V j when the components have equal size).
In the second experiment, we adjust the variance in the size of the uniform components.We again use 100 10-dimensional components, p i = U (µ i − γ i 1 (10) , µ i + γ i 1 (10) ), where µ i ∼ N (0, I (10) ), and γ i ∼ Γ(1 + σ, 1 + σ), where Γ(α, β) is the Gamma distribution with shape parameter α and rate parameter β. Figure 2B shows the change in entropy estimates as a function of ln(σ).When σ is small, the sizes have significant spread, while as σ grows the distributions become close to equally sized.We again see that ĤBD is a good estimator of entropy, outperforming all of the other estimators.Generally, not all supports will be non-overlapping, so ĤKL will not necessarily be equal to H(X, C), though we find the two to be numerically quite close.In this experiment, we find that the lower and upper bounds specified by ĤBD and ĤKL provide a tight estimate of the true entropy.In the third experiment, we again consider a clustered mixture, and evaluate the entropy estimators as these clusters grow apart.Here, there are 100 components with p i = U ( μg(i) − 1 (10) , μg(i) + 1 (10) ), where g(i) ∈ {1 . . .5} is the randomly assigned cluster identity of component i.The cluster centers μk for k ∈ {1 . . .5} are generated according to N (0, σI (10) ). Figure 2C shows the change in entropy as the clusters locations move apart.Note that, in this case, the upper bound ĤKL significantly outperforms H(X, C), unlike in the first and second experiment, because in this experiment, components in the same cluster have perfect overlap.We again see that ĤBD provides a relatively accurate lower bound for the true entropy.
In the final experiment, the dimension of the components is varied.There are again 100 components, with p i = U (µ i − 1 (d) , µ i + 1 (d) ) , and µ i ∼ N (0, σI (d) ). Figure 2D shows the change in entropy as the dimension increases from d = 1 to d = 16.Interestingly, in the low-dimensional case, H(X|C) is a very close estimate for the true entropy, while in the high-dimensional case, the entropy becomes very close to H(X, C).This is because in higher dimensions, there is more 'space' for the components to be far from each other.As in the first experiment, ĤKL is equal to H(X, C).We again observe that ĤBD provides a tight lower bound on the mixture entropy, regardless of dimension.

Discussion
We have presented a new class of estimators for the entropy of a mixture distribution.We have shown that any estimator in this class has a bounded estimation bias, and that this class includes useful lower and upper bounds on the entropy of a mixture.Finally, we show that these bounds become exact when mixture components are grouped into well-separated clusters.
Our derivation of the bounds make use of some existing results [5,31].However, to our knowledge, these results have not been previously used to estimate mixture entropies.Furthermore, they have not been compared numerically or analytically to better-known bounds.
We evaluated these estimators using numerical simulations of mixtures of Gaussians as well as mixtures of bounded (hypercube) uniform distributions.Our results demonstrate that our estimators perform much better than existing well-known estimators.
This estimator class can be especially useful for optimization problems that involve minimization of entropy or mutual information.If the distance function used in the pairwise estimator class is continuous and smooth in the parameters of the mixture components, then the entropy estimate is also continuous and smooth.This permits our estimators to be used within gradient-based optimization techniques, for example gradient descent, as often done in machine learning problems.
In fact, we have used our upper bound to implement a non-parametric, nonlinear version of the "Information Bottleneck" [43].Specifically, we minimized an upper bound on the mutual information between input and hidden layer in a neural networks [11].We found that the optimal distributions were often clustered (Section 3.4).That work demonstrated practically the value of having an accurate, differentiable upper bound on mixture entropy that performs well in the clustered regime.
Note that we have not proved that the bounds derived here are the best possible.Identifying better bounds, or proving that our results are optimal within some class of bounds, remains for future work.
where p k (x) is shorthand for the density of any component in cluster k (remember that all components in the same cluster have equal density).ĤC α becomes and the upper bound, MI(X; U) ≤ − ∑ i c i ln ∑ j c j exp(−KL(p i p j )),

Figure 1 .
Figure1.Entropy estimates for a mixture of a 100 Gaussians.In each plot, the vertical axis shows the entropy of the distribution, and the horizontal axis changes a feature of the components: (A) the distance between means is increased; (B) the component covariances become more similar (at the right side of the plot, all Gaussians have covariance matrices approximately equal to the identity matrix); (C) the components are grouped into five "clusters", and the distance between the locations of the clusters is increased; (D) the dimension is increased.

Figure 2 .
Figure 2.Entropy estimates for a mixture of a 100 uniform components.In each plot, the vertical axis shows the entropy of the distribution, and the horizontal axis changes a feature of the components: (A) the distance between means is increased; (B) the component sizes become more similar (at the right side of the plot, all components have approximately the same size); (C) the components are grouped into five "clusters", and the distance between these clusters is increased; (D) the dimension is increased.
ĤC α = H(X|C) − ∑ i c i ln ∑ j c j exp(−C α (p i ||p j )) = − ∑ i c i p i (x) ln p i (x)dx − ∑ i c i ln ∑ j c j δ g(i),g(j) = − ∑ k p G (k) p k (x) ln p k (x)dx − ∑ k p G (k) ln p G (k) k) ln p k (x) 2 dx − ∑ k p G (k) ln p G (k) = − ∑ k p G (k) ln p G (k) p k (x) 2 dx = ĤELK ,where (a) uses Jensen's inequality.