# Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions

## Abstract

**:**

## 1. Introduction

#### 1.1. Statistical Mixtures and Statistical Divergences

#### 1.2. Jeffreys Divergence between Densities of an Exponential Family

#### 1.3. A Simple Approximation Heuristic

- Simplify GMMs ${m}_{i}$ into ${p}^{{\eta}_{i}^{\mathrm{MLE}}}$, and approximately convert the ${\overline{\eta}}_{i}^{\mathrm{MLE}}$’s into ${\tilde{\theta}}_{i}^{\mathrm{MLE}}$’s. Then approximate the Jeffreys divergence as$${D}_{J}[{m}_{1},{m}_{2}]\simeq {\tilde{\mathrm{\Delta}}}_{J}^{\mathrm{MLE}}[{m}_{1},{m}_{2}]:={({\tilde{\theta}}_{2}^{\mathrm{MLE}}-{\tilde{\theta}}_{1}^{\mathrm{MLE}})}^{\top}({\overline{\eta}}_{2}^{\mathrm{MLE}}-{\overline{\eta}}_{1}^{\mathrm{MLE}}).$$
- Simplify GMMs ${m}_{i}$ into ${p}_{{\overline{\theta}}_{i}^{\mathrm{SME}}}$, and approximately convert the ${\overline{\theta}}_{i}^{\mathrm{SME}}$’s into ${\tilde{\eta}}_{i}^{\mathrm{SME}}$’s. Then approximate the Jeffreys divergence as$${D}_{J}[{m}_{1},{m}_{2}]\simeq {\tilde{\mathrm{\Delta}}}_{J}^{\mathrm{SME}}({m}_{1},{m}_{2})={({\overline{\theta}}_{2}^{\mathrm{SME}}-{\overline{\theta}}_{1}^{\mathrm{SME}})}^{\top}({\tilde{\eta}}_{2}^{\mathrm{SME}}-{\tilde{\eta}}_{1}^{\mathrm{SME}}).$$

#### 1.4. Contributions and Paper Outline

- We explain how to convert any continuous density $r\left(x\right)$ (including GMMs) into a polynomial exponential density in Section 2 using integral-based extensions of the Maximum Likelihood Estimator [22] (MLE estimates in the moment parameter space H, Theorem 1 and Corollary 1) and the Score Matching Estimator [27] (SME estimates in the natural parameter space $\Theta $, Theorem 3). We show a connection between SME and the Moment Linear System Estimator [28] (MLSE).
- We show how to approximate the Jeffreys divergence between GMMs using a pair of natural/moment parameter PED conversion and present experimental results that display a gain of several orders of magnitude of performance when compared to the vanilla Monte Carlo estimator in Section 4. We observe that the quality of the approximations depend on the number of modes of the GMMs [43]. However, calculating or counting the modes of a GMM is a difficult problem in its own [43].

## 2. Converting Finite Mixtures to Exponential Family Densities

#### 2.1. Conversion Using the Moment Parameterization (MLE)

**Theorem**

**1.**

**Corollary**

**1**

#### 2.2. Converting to a PEF Using the Natural Parameterization (SME)

#### Integral-Based Score Matching Estimator (SME)

- Generic solution: It can be shown that for exponential families [47], we obtain the following solution:$${\theta}_{\mathrm{SME}}\left(r\right)=-{\left({E}_{r}\left[A\left(x\right)\right]\right)}^{-1}\times \left({E}_{r}\left[b\left(x\right)\right]\right),$$$$A\left(x\right):={\left[{t}_{i}^{\prime}\left(x\right){t}_{j}^{\prime}\left(x\right)\right]}_{ij}$$$$b\left(x\right)={[{t}_{1}^{\u2033}\left(x\right)\dots \phantom{\rule{4pt}{0ex}}{t}_{D}^{\u2033}\left(x\right)]}^{\top}$$

**Theorem 2.**

- Solution instantiated for polynomial exponential families:For polynomial exponential families of order D, we have ${t}_{i}^{\prime}\left(x\right)=i{x}^{i-1}$ and ${t}_{i}^{\u2033}\left(x\right)=i(i-1){x}^{i-2}$, and therefore, we have$${A}_{D}={E}_{r}\left[A\left(x\right)\right]={\left[ij\phantom{\rule{0.166667em}{0ex}}{\mu}_{i+j-2}\left(r\right)\right]}_{ij},$$$${b}_{D}={E}_{s}\left[b\left(x\right)\right]={\left[j(j-1)\phantom{\rule{0.166667em}{0ex}}{\mu}_{j-2}\left(r\right)\right]}_{j},$$Thus, the integral-based SME of a density r is:$${\theta}_{\mathrm{SME}}\left(r\right)=-{\left({\left[ij{\mu}_{i+j-2}\left(r\right)\right]}_{ij}\right)}^{-1}\times {\left[j(j-1){\mu}_{j-2}\left(r\right)\right]}_{j}.$$For example, matrix ${A}_{4}$ is$$\left[\begin{array}{cccc}{\mu}_{0}& 2{\mu}_{1}& 3{\mu}_{2}& 4{\mu}_{3}\\ 2{\mu}_{1}& 4{\mu}_{2}& 6{\mu}_{3}& 8{\mu}_{4}\\ 3{\mu}_{2}& 6{\mu}_{3}& 9{\mu}_{4}& 12{\mu}_{5}\\ 4{\mu}_{3}& 8{\mu}_{4}& 12{\mu}_{5}& 16{\mu}_{6}\end{array}\right].$$
- Faster PEF solutions using Hankel matrices:The method of Cobb et al. [28] (1983) anticipated the Score Matching method of Hyvärinen (2005). It can be derived from Stein’s lemma for exponential families [50]. The integral-based Score Matching method is consistent, i.e., if $r={p}_{\theta}$, then ${\overline{\theta}}_{\mathrm{SME}}=\theta $: The probabilistic proof for $r\left(x\right)={p}_{e}\left(x\right)$ is reported as Theorem 2 of [28]. The integral-based proof is based on the property that arbitrary order partial mixed derivatives can be obtained from higher-order partial derivatives with respect to ${\theta}_{1}$ [29]:$${\partial}_{1}^{{i}_{1}}\dots {\partial}_{D}^{{i}_{D}}F\left(\theta \right)={\partial}_{1}^{{\sum}_{j=1}^{D}j{i}_{j}}F\left(\theta \right),$$The complexity of the direct SME method is $O\left({D}^{3}\right)$ as it requires the inverse of the $D\times D$-dimensional matrix ${A}_{D}$.We show how to lower this complexity by reporting an equivalent method (originally presented in [28]) that relies on recurrence relationships between the moments of ${p}_{\theta}\left(x\right)$ for PEDs. Recall that ${\mu}_{l}\left(r\right)$ denotes the l-th raw moment ${E}_{r}\left[{x}^{l}\right]$.Let ${A}^{\prime}={\left[{a}_{i+j-2}^{\prime}\right]}_{ij}$ denote the $D\times D$ symmetric matrix with ${a}_{i+j-2}^{\prime}\left(r\right)={\mu}_{i+j-2}\left(r\right)$ (with ${a}_{0}^{\prime}\left(r\right)={\mu}_{0}\left(r\right)=1$), and ${b}^{\prime}={\left[{b}_{i}\right]}_{i}$ the D-dimensional vector with ${b}_{i}^{\prime}\left(r\right)=(i+1){\mu}_{i}\left(r\right)$. We solve the system ${A}^{\prime}\beta ={b}^{\prime}$ to obtain $\beta ={{A}^{\prime}}^{-1}{b}^{\prime}$. We then obtain the natural parameter ${\overline{\theta}}_{\mathrm{SME}}$ from the vector $\beta $ as$${\overline{\theta}}_{\mathrm{SME}}=\left[\begin{array}{c}-\frac{{\beta}_{1}}{2}\\ \vdots \\ -\frac{{\beta}_{i}}{i+1}\\ \vdots \\ -\frac{{\beta}_{D}}{D+1}\end{array}\right].$$Now, if we inspect matrix ${A}_{D}^{\prime}=\left[{\mu}_{i+j-2}\left(r\right)\right]$, we find that matrix ${A}_{D}^{\prime}$ is a Hankel matrix: A Hankel matrix has constant anti-diagonals and can be inverted in quadratic-time [51,52] instead of cubic time for a general $D\times D$ matrix. (The inverse of a Hankel matrix is a Bezoutian matrix [53].) Moreover, a Hankel matrix can be stored using linear memory (store $2D-1$ coefficients) instead of quadratic memory of regular matrices.For example, matrix ${A}_{4}^{\prime}$ is:$${A}_{4}^{\prime}=\left[\begin{array}{cccc}{\mu}_{0}& {\mu}_{1}& {\mu}_{2}& {\mu}_{3}\\ {\mu}_{1}& {\mu}_{2}& {\mu}_{3}& {\mu}_{4}\\ {\mu}_{2}& {\mu}_{3}& {\mu}_{4}& {\mu}_{5}\\ {\mu}_{3}& {\mu}_{4}& {\mu}_{5}& {\mu}_{6}\end{array}\right],$$$${A}_{d}^{\prime}:={\left[{\mu}_{i+j-2}\right]}_{ij}=\left[\begin{array}{cccc}{\mu}_{0}& {\mu}_{1}& \dots & {\mu}_{d}\\ {\mu}_{1}& {\mu}_{2}& \dots & \vdots \\ \vdots & & \ddots & \vdots \\ {\mu}_{d}& \dots & \dots & {\mu}_{2d}\end{array}\right],$$$${A}_{d}^{\prime}=:\mathrm{Hankel}({\mu}_{0},{\mu}_{1},\dots ,{\mu}_{2d}).$$In statistics, those matrices ${A}_{d}^{\prime}$ are called moment matrices and well-studied [54,55,56]. The variance $\mathrm{Var}\left[X\right]$ of a random variable X can be expressed as the determinant of the order-2 moment matrix:$$\mathrm{Var}\left[X\right]=E\left[{(X-\mu )}^{2}\right]=E\left[{X}^{2}\right]-E{\left[X\right]}^{2}={\mu}_{2}-{\mu}_{1}^{2}=\mathrm{det}\left(\left[\begin{array}{cc}1& {\mu}_{1}\\ {\mu}_{1}& {\mu}_{2}\end{array}\right]\right)\ge 0.$$This observation yields a generalization of the notion of variance to $d+1$ random variables: ${X}_{1},\dots ,{X}_{d+1}{\sim}_{iid}{F}_{X}\Rightarrow E\left[{\prod}_{j>i}{({X}_{i}-{X}_{j})}^{2}\right]=(d+1)!\phantom{\rule{0.166667em}{0ex}}\mathrm{det}\left({M}_{d}\right)\ge 0$. The variance can be expressed as $E\left[\frac{1}{2}{({X}_{1}-{X}_{2})}^{2}\right]$ for ${X}_{1},{X}_{2}{\sim}_{iid}{F}_{X}$. See [57] (Chapter 5) for a detailed description related to U-statistics.For GMMs r, the raw moments ${\mu}_{l}\left(r\right)$ to build matrix ${A}_{D}$ can be calculated in closed-form, as explained in Section 2.4.

**Theorem 3**(Score matching GMM conversion)

#### 2.3. Converting Numerically Moment Parameters from/to Natural Parameters

#### 2.3.1. Converting Moment Parameters to Natural Parameters Using Maximum Entropy

#### 2.3.2. Converting Natural Parameters to Moment Parameters

#### 2.4. Raw Non-Central Moments of Normal Distributions and GMMs

## 3. Goodness-of-Fit between GMMs and PEDs: Higher Order Hyvärinen Divergences

**Theorem**

**4.**

**Proof.**

## 4. Experiments: Jeffreys Divergence between Mixtures

## 5. Conclusions and Perspectives

## Funding

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. R. Soc. Lond. Ser. A Math. Phys. Sci.
**1946**, 186, 453–461. [Google Scholar] - McLachlan, G.J.; Basford, K.E. Mixture Models: Inference and Applications to Clustering; M. Dekker: New York, NY, USA, 1988; Volume 38. [Google Scholar]
- Pearson, K. Contributions to the mathematical theory of evolution. Philos. Trans. R. Soc. Lond. A
**1894**, 185, 71–110. [Google Scholar] - Seabra, J.C.; Ciompi, F.; Pujol, O.; Mauri, J.; Radeva, P.; Sanches, J. Rayleigh mixture model for plaque characterization in intravascular ultrasound. IEEE Trans. Biomed. Eng.
**2011**, 58, 1314–1324. [Google Scholar] [CrossRef] [PubMed] - Kullback, S. Information Theory and Statistics; Courier Corporation: North Chelmsford, MA, USA, 1997. [Google Scholar]
- Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
- Vitoratou, S.; Ntzoufras, I. Thermodynamic Bayesian model comparison. Stat. Comput.
**2017**, 27, 1165–1180. [Google Scholar] [CrossRef] [Green Version] - Kannappan, P.; Rathie, P. An axiomatic characterization of J-divergence. In Transactions of the Tenth Prague Conference on Information Theory, Statistical Decision Functions, Random Processes; Springer: Dordrecht, The Netherlands, 1988; pp. 29–36. [Google Scholar]
- Burbea, J. J-Divergences and related concepts. Encycl. Stat. Sci.
**2004**. [Google Scholar] [CrossRef] - Tabibian, S.; Akbari, A.; Nasersharif, B. Speech enhancement using a wavelet thresholding method based on symmetric Kullback–Leibler divergence. Signal Process.
**2015**, 106, 184–197. [Google Scholar] [CrossRef] - Veldhuis, R. The centroid of the symmetrical Kullback-Leibler distance. IEEE Signal Process. Lett.
**2002**, 9, 96–99. [Google Scholar] [CrossRef] [Green Version] - Nielsen, F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. IEEE Signal Process. Lett.
**2013**, 20, 657–660. [Google Scholar] [CrossRef] [Green Version] - Watanabe, S.; Yamazaki, K.; Aoyagi, M. Kullback information of normal mixture is not an analytic function. IEICE Tech. Rep. Neurocomput.
**2004**, 104, 41–46. [Google Scholar] - Cui, S.; Datcu, M. Comparison of Kullback-Leibler divergence approximation methods between Gaussian mixture models for satellite image retrieval. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 3719–3722. [Google Scholar]
- Cui, S. Comparison of approximation methods to Kullback–Leibler divergence between Gaussian mixture models for satellite image retrieval. Remote Sens. Lett.
**2016**, 7, 651–660. [Google Scholar] [CrossRef] [Green Version] - Sreekumar, S.; Zhang, Z.; Goldfeld, Z. Non-asymptotic Performance Guarantees for Neural Estimation of f-Divergences. In Proceedings of the International Conference on Artificial Intelligence and Statistics (PMLR 2021), San Diego, CA, USA, 18–24 July 2021; pp. 3322–3330. [Google Scholar]
- Durrieu, J.L.; Thiran, J.P.; Kelly, F. Lower and upper bounds for approximation of the Kullback-Leibler divergence between Gaussian mixture models. In Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012; pp. 4833–4836. [Google Scholar]
- Nielsen, F.; Sun, K. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy
**2016**, 18, 442. [Google Scholar] [CrossRef] [Green Version] - Jenssen, R.; Principe, J.C.; Erdogmus, D.; Eltoft, T. The Cauchy–Schwarz divergence and Parzen windowing: Connections to graph theory and Mercer kernels. J. Frankl. Inst.
**2006**, 343, 614–629. [Google Scholar] [CrossRef] - Liu, M.; Vemuri, B.C.; Amari, S.i.; Nielsen, F. Shape retrieval using hierarchical total Bregman soft clustering. IEEE Trans. Pattern Anal. Mach. Intell.
**2012**, 34, 2407–2419. [Google Scholar] - Robert, C.; Casella, G. Monte Carlo Statistical Methods; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Barndorff-Nielsen, O. Information and Exponential Families: In Statistical Theory; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
- Azoury, K.S.; Warmuth, M.K. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn.
**2001**, 43, 211–246. [Google Scholar] [CrossRef] [Green Version] - Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res.
**2005**, 6, 1705–1749. [Google Scholar] - Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys.
**1967**, 7, 200–217. [Google Scholar] [CrossRef] - Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory
**2009**, 55, 2882–2904. [Google Scholar] [CrossRef] [Green Version] - Hyvärinen, A. Estimation of non-normalized statistical models by score matching. J. Mach. Learn. Res.
**2005**, 6, 695–709. [Google Scholar] - Cobb, L.; Koppstein, P.; Chen, N.H. Estimation and moment recursion relations for multimodal distributions of the exponential family. J. Am. Stat. Assoc.
**1983**, 78, 124–130. [Google Scholar] [CrossRef] - Hayakawa, J.; Takemura, A. Estimation of exponential-polynomial distribution by holonomic gradient descent. Commun. Stat.-Theory Methods
**2016**, 45, 6860–6882. [Google Scholar] [CrossRef] [Green Version] - Nielsen, F.; Nock, R. MaxEnt upper bounds for the differential entropy of univariate continuous distributions. IEEE Signal Process. Lett.
**2017**, 24, 402–406. [Google Scholar] [CrossRef] - Matz, A.W. Maximum likelihood parameter estimation for the quartic exponential distribution. Technometrics
**1978**, 20, 475–484. [Google Scholar] [CrossRef] - Barron, A.R.; Sheu, C.H. Approximation of density functions by sequences of exponential families. Ann. Stat.
**1991**, 19, 1347–1369, Correction in**1991**, 19, 2284–2284. [Google Scholar] - O’toole, A. A method of determining the constants in the bimodal fourth degree exponential function. Ann. Math. Stat.
**1933**, 4, 79–93. [Google Scholar] [CrossRef] - Aroian, L.A. The fourth degree exponential distribution function. Ann. Math. Stat.
**1948**, 19, 589–592. [Google Scholar] [CrossRef] - Zellner, A.; Highfield, R.A. Calculation of maximum entropy distributions and approximation of marginal posterior distributions. J. Econom.
**1988**, 37, 195–209. [Google Scholar] [CrossRef] - McCullagh, P. Exponential mixtures and quadratic exponential families. Biometrika
**1994**, 81, 721–729. [Google Scholar] [CrossRef] - Mead, L.R.; Papanicolaou, N. Maximum entropy in the problem of moments. J. Math. Phys.
**1984**, 25, 2404–2417. [Google Scholar] [CrossRef] [Green Version] - Armstrong, J.; Brigo, D. Stochastic filtering via L
_{2}projection on mixture manifolds with computer algorithms and numerical examples. arXiv**2013**, arXiv:1303.6236. [Google Scholar] - Efron, B.; Hastie, T. Computer Age Statistical Inference; Cambridge University Press: Cambridge, UK, 2016; Volume 5. [Google Scholar]
- Pinsker, M. Information and Information Stability of Random Variables and Processes (Translated and Annotated by Amiel Feinstein); Holden-Day Inc.: San Francisco, CA, USA, 1964. [Google Scholar]
- Fedotov, A.A.; Harremoës, P.; Topsoe, F. Refinements of Pinsker’s inequality. IEEE Trans. Inf. Theory
**2003**, 49, 1491–1498. [Google Scholar] [CrossRef] - Amari, S. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
- Carreira-Perpinan, M.A. Mode-finding for mixtures of Gaussian distributions. IEEE Trans. Pattern Anal. Mach. Intell.
**2000**, 22, 1318–1323. [Google Scholar] [CrossRef] [Green Version] - Brown, L.D. Fundamentals of statistical exponential families with applications in statistical decision theory. Lect. Notes-Monogr. Ser.
**1986**, 9, 1–279. [Google Scholar] - Pelletier, B. Informative barycentres in statistics. Ann. Inst. Stat. Math.
**2005**, 57, 767–780. [Google Scholar] [CrossRef] - Améndola, C.; Drton, M.; Sturmfels, B. Maximum likelihood estimates for Gaussian mixtures are transcendental. In Proceedings of the International Conference on Mathematical Aspects of Computer and Information Sciences, Berlin, Germany, 11–13 November 2015; pp. 579–590. [Google Scholar]
- Hyvärinen, A. Some extensions of score matching. Comput. Stat. Data Anal.
**2007**, 51, 2499–2512. [Google Scholar] [CrossRef] [Green Version] - Otto, F.; Villani, C. Generalization of an inequality by Talagrand and links with the logarithmic Sobolev inequality. J. Funct. Anal.
**2000**, 173, 361–400. [Google Scholar] [CrossRef] [Green Version] - Toscani, G. Entropy production and the rate of convergence to equilibrium for the Fokker-Planck equation. Q. Appl. Math.
**1999**, 57, 521–541. [Google Scholar] [CrossRef] [Green Version] - Hudson, H.M. A natural identity for exponential families with applications in multiparameter estimation. Ann. Stat.
**1978**, 6, 473–484. [Google Scholar] [CrossRef] - Trench, W.F. An algorithm for the inversion of finite Hankel matrices. J. Soc. Ind. Appl. Math.
**1965**, 13, 1102–1107. [Google Scholar] [CrossRef] - Heinig, G.; Rost, K. Fast algorithms for Toeplitz and Hankel matrices. Linear Algebra Its Appl.
**2011**, 435, 1–59. [Google Scholar] [CrossRef] [Green Version] - Fuhrmann, P.A. Remarks on the inversion of Hankel matrices. Linear Algebra Its Appl.
**1986**, 81, 89–104. [Google Scholar] [CrossRef] [Green Version] - Lindsay, B.G. On the determinants of moment matrices. Ann. Stat.
**1989**, 17, 711–721. [Google Scholar] [CrossRef] - Lindsay, B.G. Moment matrices: Applications in mixtures. Ann. Stat.
**1989**, 17, 722–740. [Google Scholar] [CrossRef] - Provost, S.B.; Ha, H.T. On the inversion of certain moment matrices. Linear Algebra Its Appl.
**2009**, 430, 2650–2658. [Google Scholar] [CrossRef] - Serfling, R.J. Approximation Theorems of Mathematical Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2009; Volume 162. [Google Scholar]
- Mohammad-Djafari, A. A. A Matlab program to calculate the maximum entropy distributions. In Maximum Entropy and Bayesian Methods; Springer: Berlin/Heidelberg, Germany, 1992; pp. 221–233. [Google Scholar]
- Karlin, S. Total Positivity; Stanford University Press: Redwood City, CA, USA, 1968; Volume 1. [Google Scholar]
- von Neumann, J. Various Techniques Used in Connection with Random Digits. In Monte Carlo Method; National Bureau of Standards Applied Mathematics Series; Householder, A.S., Forsythe, G.E., Germond, H.H., Eds.; US Government Printing Office: Washington, DC, USA, 1951; Volume 12, Chapter 13; pp. 36–38. [Google Scholar]
- Flury, B.D. Acceptance-rejection sampling made easy. SIAM Rev.
**1990**, 32, 474–476. [Google Scholar] [CrossRef] - Rohde, D.; Corcoran, J. MCMC methods for univariate exponential family models with intractable normalization constants. In Proceedings of the 2014 IEEE Workshop on Statistical Signal Processing (SSP), Gold Coast, Australia, 29 June–2 July 2014; pp. 356–359. [Google Scholar]
- Barr, D.R.; Sherrill, E.T. Mean and variance of truncated normal distributions. Am. Stat.
**1999**, 53, 357–361. [Google Scholar] - Amendola, C.; Faugere, J.C.; Sturmfels, B. Moment Varieties of Gaussian Mixtures. J. Algebr. Stat.
**2016**, 7, 14–28. [Google Scholar] [CrossRef] [Green Version] - Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal.
**2008**, 99, 2053–2081. [Google Scholar] [CrossRef] [Green Version] - Nielsen, F.; Nock, R. Patch matching with polynomial exponential families and projective divergences. In Proceedings of the International Conference on Similarity Search and Applications, Tokyo, Japan, 24–26 October 2016; pp. 109–116. [Google Scholar]
- Yang, Y.; Martin, R.; Bondell, H. Variational approximations using Fisher divergence. arXiv
**2019**, arXiv:1905.05284. [Google Scholar] - Kostrikov, I.; Fergus, R.; Tompson, J.; Nachum, O. Offline reinforcement learning with Fisher divergence critic regularization. In Proceedings of the International Conference on Machine Learning (PMLR 2021), online, 7–8 June 2021; pp. 5774–5783. [Google Scholar]
- Elkhalil, K.; Hasan, A.; Ding, J.; Farsiu, S.; Tarokh, V. Fisher Auto-Encoders. In Proceedings of the International Conference on Artificial Intelligence and Statistics (PMLR 2021), San Diego, CA, USA, 13–15 April 2021; pp. 352–360. [Google Scholar]
- Améndola, C.; Engström, A.; Haase, C. Maximum number of modes of Gaussian mixtures. Inf. Inference J. IMA
**2020**, 9, 587–600. [Google Scholar] [CrossRef] - Aprausheva, N.; Mollaverdi, N.; Sorokin, S. Bounds for the number of modes of the simplest Gaussian mixture. Pattern Recognit. Image Anal.
**2006**, 16, 677–681. [Google Scholar] [CrossRef] [Green Version] - Aprausheva, N.; Sorokin, S. Exact equation of the boundary of unimodal and bimodal domains of a two-component Gaussian mixture. Pattern Recognit. Image Anal.
**2013**, 23, 341–347. [Google Scholar] [CrossRef] - Xiao, Y.; Shah, M.; Francis, S.; Arnold, D.L.; Arbel, T.; Collins, D.L. Optimal Gaussian mixture models of tissue intensities in brain MRI of patients with multiple-sclerosis. In Proceedings of the International Workshop on Machine Learning in Medical Imaging, Beijing, China, 20 September 2010; pp. 165–173. [Google Scholar]
- Bilik, I.; Khomchuk, P. Minimum divergence approaches for robust classification of ground moving targets. IEEE Trans. Aerosp. Electron. Syst.
**2012**, 48, 581–603. [Google Scholar] [CrossRef] - Alippi, C.; Boracchi, G.; Carrera, D.; Roveri, M. Change Detection in Multivariate Datastreams: Likelihood and Detectability Loss. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016. [Google Scholar]
- Eguchi, S.; Komori, O.; Kato, S. Projective power entropy and maximum Tsallis entropy distributions. Entropy
**2011**, 13, 1746–1764. [Google Scholar] [CrossRef] - Orjebin, E. A Recursive Formula for the Moments of a Truncated Univariate Normal Distribution. 2014, Unpublished note.
- Del Castillo, J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math.
**1994**, 46, 57–66. [Google Scholar] [CrossRef] [Green Version]

**Figure 1.**Two examples illustrating the conversion of a GMM m (black) of $k=2$ components (dashed black) into a pair of polynomial exponential densities of order $D=4$$({p}_{{\overline{\theta}}_{\mathrm{SME}}},{p}^{{\overline{\eta}}_{\mathrm{MLE}}})$. PED ${p}_{{\overline{\theta}}_{\mathrm{SME}}}$ is displayed in green, and PED ${p}^{{\overline{\eta}}_{\mathrm{MLE}}}$ is displayed in blue. To display ${p}^{{\overline{\eta}}_{\mathrm{MLE}}}$, we first converted ${\overline{\eta}}_{\mathrm{MLE}}$ to ${\tilde{\overline{\theta}}}_{\mathrm{MLE}}$ using an iterative linear system descent method (ILSDM), and we numerically estimated the normalizing factors $Z\left({\overline{\theta}}_{\mathrm{SME}}\right)$ and $Z\left({\overline{\eta}}_{\mathrm{MLE}}\right)$ to display the normalized PEDs.

**Figure 2.**Two mixtures ${m}_{1}$ (black) and ${m}_{2}$ (red) of ${k}_{1}=10$ components and ${k}_{2}=11$ components (

**left**), respectively. The unnormalized PEFs ${q}_{{\overline{\theta}}_{1}}={\tilde{p}}_{{\overline{\theta}}_{1}}$ (

**middle**) and ${q}_{{\overline{\theta}}_{2}}={\tilde{p}}_{{\overline{\theta}}_{2}}$ (

**right**) of order $D=8$. Jeffreys divergence (about $0.2634$) is approximated using PEDs within $0.6\%$ compared to the Monte Carlo estimate with a speed factor of about 3190. Notice that displaying ${p}_{{\overline{\theta}}_{1}}$ and ${p}_{{\overline{\theta}}_{2}}$ on the same PDF canvas as the mixtures would require calculating the partition functions $Z\left({\overline{\theta}}_{1}\right)$ and $Z\left({\overline{\theta}}_{2}\right)$ (which we do not in this figure). The PEDs ${q}^{{\overline{\eta}}_{1}}$ and ${q}^{{\overline{\eta}}_{2}}$ of the pairs $({\overline{\theta}}_{1},{\overline{\eta}}_{1})$ and $({\overline{\theta}}_{2},{\overline{\eta}}_{2})$ parameterized in the moment space are not shown here.

**Figure 3.**The best simplification of a GMM $m\left(x\right)$ into a single normal component ${p}_{{\theta}^{*}}$ (${min}_{\theta \in \Theta}{D}_{\mathrm{KL}}[m:{p}_{\theta}]={min}_{\eta \in H}{D}_{\mathrm{KL}}[m:{p}^{\eta}]$) is geometrically interpreted as the unique m-projection of $m\left(x\right)$ onto the Gaussian family (a e-flat): We have ${\eta}^{*}=\overline{\eta}={\sum}_{i=1}^{k}{\eta}_{i}$.

**Figure 4.**Experiments of approximating the Jeffreys divergence between two mixtures by considering pairs of PEDs. Notice that only the PEDs estimated using the Score Matching in the natural parameter space are displayed.

**Figure 5.**Selecting the PED order D my evaluating the best divergence order-2 Hyvärinen divergence (for $D\in \{4,8,10,12,14,16\}$) values. Here, the order $D=10$ (boxed) yields the lowest order-2 Hyvärinen divergence: The GMM is close to the PED.

**Figure 6.**Some limitation examples of the conversion of GMMs (black) to PEDs (grey) using the integral-based Score Matching estimator: Case of GMMs with many modes.

**Figure 7.**Modeling the Old Faithful geyser by a KDE (GMM with $k=272$ components, uniform weights ${w}_{i}=\frac{1}{272}$): Histogram (#bins = 25) (

**left**), KDE with $\sigma =0.05$ (

**middle**), and KDE with $\sigma =0.1$ with less spurious bumps (

**right**).

**Figure 8.**Modeling the Old Faithful geyser by an exponential-polynomial distribution of order $D=10$.

**Figure 9.**GMM modes versus PED modes: (

**left**) same number and locations of modes for the GMM and the PED; (

**right**) 4 modes for the GMM but only 2 modes for the PED.

**Table 1.**Comparison of ${\tilde{\mathrm{\Delta}}}_{J}({m}_{1},{m}_{2})$ with ${\widehat{D}}_{J}({m}_{1},{m}_{2})$ for random GMMs.

k | D | Average Error | Maximum Error | Speed-Up |
---|---|---|---|---|

2 | 4 | 0.1180799978221536 | 0.9491425404132259 | 2008.2323536011806 |

3 | 6 | 0.12533811294546526 | 1.9420608151988419 | 1010.4917042114389 |

4 | 8 | 0.10198448868508087 | 5.290871019594698 | 474.5135294829539 |

5 | 10 | 0.06336388579897352 | 3.8096955246161848 | 246.38780782640987 |

6 | 12 | 0.07145257192133717 | 1.0125283726458822 | 141.39097909641052 |

7 | 14 | 0.10538875853178625 | 0.8661463142793943 | 88.62985036546912 |

8 | 16 | 0.4150905507007969 | 0.4150905507007969 | 58.72277575395611 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Nielsen, F.
Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions. *Entropy* **2021**, *23*, 1417.
https://doi.org/10.3390/e23111417

**AMA Style**

Nielsen F.
Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions. *Entropy*. 2021; 23(11):1417.
https://doi.org/10.3390/e23111417

**Chicago/Turabian Style**

Nielsen, Frank.
2021. "Fast Approximations of the Jeffreys Divergence between Univariate Gaussian Mixtures via Mixture Conversions to Exponential-Polynomial Distributions" *Entropy* 23, no. 11: 1417.
https://doi.org/10.3390/e23111417