Guaranteed bounds on the Kullback-Leibler divergence of univariate mixtures using piecewise log-sum-exp inequalities

The Kullback-Leibler divergence between two mixture models is a core primitive in many signal processing tasks. Since the Kullback-Leibler divergence of mixtures does not admit closed-form formula, it is in practice either estimated using costly Monte-Carlo stochastic integration or approximated using various techniques. We present a fast and generic method that builds algorithmically closed-form lower and upper bounds on the entropy, the cross-entropy and the Kullback-Leibler divergence of mixtures. We illustrate the versatile method by reporting on our experiments for approximating the Kullback-Leibler divergence between univariate exponential mixtures, Gaussian mixtures and Rayleigh mixtures.


Introduction
Mixture models are commonly used in signal processing. A typical scenario is to use mixture models [10,18,12] to smoothly model histograms. For example, Gaussian Mixture Models (GMMs) can be used to convert grey-valued images into binary images by building a GMM fitting the image intensity histogram and then choosing the threshold as the average of the Gaussian means [10] to binarize the image. Similarly, Rayleigh Mixture Models (RMMs) are often used in ultrasound imagery [18] to model histograms, and perform segmentation by classification. When using mixture models, a fundamental primitive is to define a proper statistical distance between mixtures. The Kullback-Leibler divergence [5], also called relative entropy, is the most commonly used distance: Let m(x) = k i=1 w i p i (x) and m (x) = k i=1 w i p i (x) be two finite statistical mixtures of k and k components, respectively. In statistics, the mixture components p i (x) are often parametric: p i (x) = p i (x; θ i ), where θ i is a vector of parameter. For example, a mixture of Gaussians has its components parameterized by its mean µ i and its covariance matrix Σ i (so that θ i = (µ i , Σ i )). Let X be the support of the component distributions, and denote by H × (m, m ) = − X m(x) log m (x)dx the cross-entropy [5]. Then the Kullback-Leibler (KL) divergence between two continuous mixtures of densities m and m is given by: with H(m) = H × (m, m) = − m(x) log m(x)dx denoting the Shannon entropy [5]. The notation ":" is used instead of the usual coma to emphasize that the distance is not a metric distance since it is not symmetric (KL(m, m ) = KL(m , m)) and further does not satisfy the triangular inequality [5] (KL(m, m ) + KL(m , m ) ≥ KL(m, m )). Although the KL divergence is available in closed-form for many distributions (in particular as Bregman divergences for exponential families [2]), it was proven that the Kullback-Leibler divergence between two (univariate) GMMs is not analytic [22]. Thus many approximations techniques have been designed to beat the computational costly Monte-Carlo (MC) stochastic estimation: . , x s ∼ m(x) (s independently and identically distributed samples x 1 , . . . , x s ). The MC estimator is asymptotically consistent, lim s→∞ KL s (m, m ) = KL(m : m ), so that the "true value" of the KL of mixtures is estimated in practice by taking a very large sampling (say, s = 10 9 ). However, we point out that the MC estimator is a stochastic approximation, and therefore does not guarantee deterministic bounds. We refer to [11,23,8,17] for the current state-of-the-art approximation techniques and bounds on the KL of GMMs.
In information geometry [1], a mixture family of linearly independent probability distributions Let us emphasize that in information geometry [1] the "components" are not parametric. A mixture family induces a dually flat space when the Kullback-Leibler divergence is equivalent to a Bregman divergence [2] defined on the η-parameters. However, the Bregman convex generator F (η) = m(x; η) log m(x; η)dx is not available in closed-form for mixtures. The family of multinomial distribution is both a mixture family (with closed-form KL(m : , the discrete KL [5]) and an exponential family [1]. In this note, we present a simple and efficient method that builds algorithmically a closed-form formula that guarantees both a lower bound and a upper bound on the KL divergence within an additive factor of log k + log k . To illustrate our generic technique, we demonstrate it on finding bounds for the KL of Exponential Mixture Models (EMMs), GMMs and RMMs.
The paper is organized as follows: Section 2 describes the algorithmic construction of the formula using piecewise log-sum-exp inequalities. Section 3 reports on experiments on several mixture families. Finally, Section 4 concludes this work by discussing extensions to other statistical distances.

A generic algorithm
Since the cross-entropy of two mixtures: has a log-sum term, we shall use bounds on the log-sum-exp (lse) function [4,21]: . We have: The lhs inequality holds because , and the rhs inequality follows from the fact that The lse function is convex and enjoys the following identity property: . Similarly, we can also lower bound the lse function by log l + min{x i } l i=1 . (Note that we could write equivalently that for l positive numbers .) Therefore we shall bound the term X log( k j=1 w j p j (x))dx using piecewise lse inequalities. Using the log-sum-exp inequalities, we get L × (m : In order to calculate A(m : m ), let us compute the upper envelope of the k real-valued functions E(x) = max{w j p j (x)} k j=1 defined on the support X . This upper envelope can be computed exactly using techniques of computational geometry [6,19] provided that we can calculate the roots of the equality equation of weighted components: w r p r (x) = w s p s (x). (Although this amounts to solve quadratic equations for Gaussian or Rayleigh distributions, the roots may not always be available in closed form, say for example for Weibull distributions.) Let the upper envelope be combinatorially defined by m elementary interval pieces defined on support intervals I r = (a r , a r+1 ) partitioning the support X = m r=1 I r (with a 0 = min X and a m+1 = max X ). Observe that on each interval I r , the maximum of the functions max{w j p j (x)} k j=1 is given by w δ(r) p δ(r) (x), where δ(s) indicates the weighted component dominating all the others (the arg max of {w j p j (x)} k j=1 for any x ∈ I r ). To fix ideas, when mixture components are univariate Gaussians, the upper envelope E(x) amounts to find equivalently the lower envelope of k parabola which has linear complexity, and can be computed in O(k log k )-time [7], or in output-sensitive time O(k log m) [14]. When the Gaussian mixture components have all the same weight and variance (e.g., kernel density estimators), the upper envelope amounts to find a lower envelope of cones: min j |x − µ j | (a Voronoi diagram in arbitrary dimension).
We need to calculate two types of definite integrals on those elementary intervals: (i) the probability mass in an interval b a p(x)dx = Φ(b)−Φ(a) where Φ denotes the Cumulative Distribution Function (CDF), and (ii) the partial cross-entropy − b a p(x) log p (x)dx [15]. Thus let us define: Then we get the term A(m : m ) as M i (a r , a r+1 ) log w δ(r) + C s,δ(r) (a r , a r+1 ).
The size of the lower/upper bound formula depends on the complexity of the upper envelope, and of the closed-form expressions of the integral terms C i,j and M i (a, b). In general, when weighted component densities intersect in at most p points, the complexity m is related to the Davenport-Schinzel sequences [20]. It is quasi-linear for bounded p = O(1), see [20].
Note that in symbolic computing, the Risch semi-algorithm [3] solves the problem of computing indefinite integration in terms of elementary functions provided that there exists an oracle (hence the term semi-algorithm) for checking whether an expression is equivalent to zero or not (it is unknown whether there exists an algorithm implementing the oracle or not).
We presented the technique by bounding the cross-entropy (and entropy) to deliver lower/uppers bounds on the KL divergence. When only the KL divergence needs to be bounded, we rather consider the ratio term m(x) m (x) . This requires to partition the support X into elementary intervals by overlaying the critical points of both the lower and upper envelopes of m(x) and m (x). In a given elementary interval, since max(k min we then consider the inequalities: We now need to compute definite integrals of the form b a w 1 p(x; θ 1 ) log w 2 p(x;θ 2 ) w 3 p(x;θ 3 ) dx. (Thus for exponential families, the ratio of densities remove the auxiliary carrier measure term.) Next, we instantiate this method for the prominent cases of exponential mixture models, Gaussian mixture models and Rayleigh mixture models often used to model intensity histograms in image [10] and ultra-sound [18] processing, respectively.

The case of Rayleigh mixture models
A Rayleigh distribution has density p( Any two components w 1 p(x; σ 1 ) and w 2 p(x; σ 2 ) (with σ 1 = σ 2 ) must intersect at x 0 = 0 and can have at most one other intersection point if the square root is well defined. We have The last term in Eq. (14) does not have a simple closed form (it requires the exponential integral Ei). One need a numerical integrator to compute it.

Experiments
We perform an empirical study to verify our theoretical bounds. We build four mixture models {RMM 1 , RMM 2 , GMM 1 , GMM 2 } as the test subjects. The component type is implied by the model name.  Fig. (2). Our objective is to measure their KL divergences fast and accurately. We compare the proposed bounds with Monte-Carlo estimation with different sample sizes in the range {10, 10 2 , 10 3 , 10 4 , 10 5 }. For each sample size, Monte-Carlo estimation is repeated for 100 times to get the statistics. Fig. (3) shows the estimation results by both methods. We can loosely consider the average Monte-Carlo output with the largest sample size (10 5 ) as the underlying truth, which is clearly inside our bounds. This serves as an empirical justification on the correctness of the bounds. A key observation is that the bounds can be very tight, e.g. in the upper-left and lower-right sub-figures. This is because the gap between the lower and upper bounds is always guaranteed to be within log k + log k . Because KL is unbounded measure [5], in the general case two mixture models may have a large KL. Then our approximation gap is relatively very small. We also observed that the bounds in the lower-left sub-figure are not as tight as the other cases. When the underlying KL is small, the bound is not as informative as the general case. Note that, the bounds are accurate and must contain the true value. Monte-Carlo estimation gives no guarantee on where the true value is. For example, in the lower-right sub-figure, Monte-Carlo estimation based on 10 3 samples can go beyond our bounds!, and therefore have a larger estimation error.

Concluding remarks
We have presented a fast versatile method to compute bounds on the Kullback-Leibler divergence between mixtures by building algorithmically formula. We reported on our experiments for the Gaussian mixture models, and the Rayleigh mixture models. For univariate GMMs, we get a guaranteed bound of the KL divergence of two mixtures m and m with k and k components within an additive approximation factor of log k +log k in O((k +k ) log(k +k ))-time. Therefore the larger the KL divergence the better the bound when considering a multiplicative (1 + α)-approximation factor since α = log k+log k KL(m:m ) . Our technique also yields bound for the Jeffreys divergence (the symmetrized KL divergence: J(m, m ) = KL(m : m ) + KL(m : m)) and the Jensen-Shannon divergence [13]  is a mixture model with k + k components. One advantage of this statistical distance is that it is symmetric, always bounded by log 2, and its square root yields a metric distance [9]. The log-sum-exp inequalities may also used to compute some Rényi divergences [16]: when α is an integer, m(x) a mixture and p(x) a single (component) distribution. Getting fast guaranteed tight bounds on statistical distances between mixtures opens many avenues. For example, we may consider building hierarchical mixture models by merging iteratively two mixture components so that those pair of components is chosen so that the KL distance between the full mixture and the simplified mixture is minimized. A Python code implementing the method for reproducible research is available online at https://www.lix.polytechnique.fr/~nielsen/KLGMM/