Fast approximations of the Jeffreys divergence between univariate Gaussian mixture models via exponential polynomial densities

The Jeffreys divergence is a renown symmetrization of the oriented Kullback-Leibler divergence broadly used in information sciences. Since the Jeffreys divergence between Gaussian mixture models is not available in closed-form, various techniques with pros and cons have been proposed in the literature to either estimate, approximate, or lower and upper bound this divergence. In this paper, we propose a simple yet fast heuristic to approximate the Jeffreys divergence between two univariate Gaussian mixtures with arbitrary number of components. Our heuristic relies on converting the mixtures into pairs of dually parameterized probability densities belonging to an exponential family. In particular, we consider the versatile polynomial exponential family densities, and design a divergence to measure in closed-form the goodness of fit between a Gaussian mixture and its polynomial exponential density approximation. This goodness-of-fit divergence is a generalization of the Hyv\"arinen divergence used to estimate models with computationally intractable normalizers. It allows us to perform model selection by choosing the orders of the polynomial exponential densities used to approximate the mixtures. We demonstrate experimentally that our heuristic to approximate the Jeffreys divergence improves by several orders of magnitude the computational time of stochastic Monte Carlo estimations while approximating reasonably well the Jeffreys divergence, specially when the mixtures have a very small number of modes. Besides, our mixture-to-exponential family conversion techniques may prove useful in other settings.


Mixtures and statistical divergences
In this work, we consider the problem of approximating the Jeffreys divergence [33] between two univariate continuous mixture models 1 [44] m(x) = k i=1 w i p i (x) and m (x) = k i=1 w i p i (x) with continuous component distributions p i 's and p i 's defined on a coinciding support X , respectively. Although our work can be used for any continuous mixtures of exponential families like Rayleigh mixtures [70] (with restricted support X = R + ), we explain the method instantiated to its most promiment member: The Gaussian mixture models (GMMs). In the remainder, a GMM m( is called a k-GMM. 1 Pearson [64] first considered a unimodal Gaussian mixture of two components for modeling distributions of ratio of forehead breadth to body length of 1000 crabs in 1894. The Kullback-Leibler divergence [37] D KL (KLD) between two probability density functions m and m is: The JD is a symmetric divergence: D J [m, m ] = D J [m , m]. In the literature, the Jeffreys divergence has also been called the J-divergence [35], the symmetric Kullback-Leibler divergence [72] and sometimes the symmetrical Kullback-Leibler divergence [75,49]. However, since there are many ways beyond the usual Jensen-Shannon divergence [39] and Jeffreys divergence to symmetrize the KLD [51], so we shall refer to D J as the Jeffreys divergence in the remainder. In general, it is provably hard to calculate in closed-form the integral of the KLD between two continuous mixtures: For example, the KLD between two GMMs has been shown to be nonanalytic [76]. One notable exception to this hardness result of calculating KLD between mixtures is the closed-form formula (albeit being large) reported for KLD between Cauchy mixtures of two components [53] which is analytic.
Thus in practice, when calculating the JD between two GMMs, one can either (i) approximate [26,17], (ii) estimate [71], or (iii) bound [21,59] the KLD between mixtures. We refer the reader to the references contained in these recent papers for a more exhaustive bibliography. Another approach when dealing with mixtures consists in designing new types of divergences which admit closed-form expressions. See [34,47,52] for some examples.
estimate stochastically the JD isÕ((k 1 + k 2 )s) (with s is typically ranging from 10 4 to 10 6 ). Notice that the number of components can be very large (e.g., k = O(n) for n input data) when using Kernel Density Estimators [69] (KDE). KDEs build mixtures by fitting a mixture component at each data location. Those mixtures have a large number of components which can exhibit many spurious modes.
In this work, we shall consider the approximation of the JD by converting mixtures into densities of polynomial exponential families (PEFs) [16,57] also called tilted families [22] (i.e., densities obtained by tilting a base distribution). An exponential family (EF) E t of order D is defined as a family of probability density functions on the support X : where F (θ) is called the cumulant function which ensures normalization of p θ (x) to 1: Function F (θ) is also called the log-Laplace transform. Parameter θ ∈ Θ ⊂ R D is called the natural parameter (and Θ is the natural parameter space), and the vector t( is called the sufficient statistics [9]. Polynomial exponential families E D are exponential families with sufficient statistics t i (x) = x i for i ∈ {1, . . . , D}. For example, the exponential distributions {p λ (x) = λ exp(−λx)} form a PEF with D = 1, t(x) = x and X = R + , and the normal distributions from a PEF with D = 2, t(x) = [x x 2 ] and X = R, etc. In those low order PEF cases, the respective cumulant functions are obtained in closed-form. However, no-closed form formula are available for F (θ) of PEFs in general as soon D ≥ 4, and the cumulant function F (θ) is said to be computationally intractable. Notice that for even integer order D, the coefficient θ D < 0 when X = R. PEFs are attractive because they can model multimodal distributions, and require fewer parameters in comparison to GMMs: Indeed, A univariate k-GMM m(x) has at most k modes and k − 1 antimodes, and requires 3k − 1 parameters to specify m(x). But the PDF of a PEF of order D (called a Polynomial Exponential Density, PED) requires D parameters to specify, and has at most D 2 modes and D 2 − 1 antimodes (see below). The case of the quartic (polynomial) exponential densities E 4 (D = 4) has been extensively investigated in [63,6,42,77,43,45]. Armstrong and Brigo [5] discussed order-6 PEDs, and Efron and Hastie reported and order-7 PEF (see Figure 5.7 of [22]).
When the homogeneous polynomial P θ (x) : ) is a monomial (i.e., Monomial Exponential Family), the cumulant function is available in closedform and this has been used to devise a sequence of maximum entropy upper bounds for GMMs [58]. Appendix A provides details of MEFs. Polynomial P θ is called the shape of the PEF. An extremal value of a PED is found by solving the roots of the derivative polynomial P θ (x) = 0: x is a mode of P θ (x) iff. P θ (x) + P θ (x) > 0 and an anti-mode iff. P θ (x) + P θ (x) < 0. Therefore to build a PED with modes/antimodes at given position {θ i } i , we may consider the shape polynomial [68] In view of deep learning, the PEFs can be interpreted as a simple class of energy-based models [38,27].
Thus let us write θ i x i is the unnormalized density, and Z(θ) = exp(F (θ)) is called the partition function in statistical physics (hence, F (θ) is also called log-partition function since F (θ) = log Z(θ)). We can define and equivalence class ∼ such that p 1 (x) ∼ p 2 (x) iff. there exists λ > 0 such that p 1 (x) = λp 2 (x). In the literature,p θ (x) often denote one representative of the equivalence class p θ (x)/ ∼, the distribution q θ (x):p θ (x) = q θ (x).
PEFs like any other exponential family admit a dual parameterization [9] η = η(θ) := E p θ [t(x)] = ∇F (θ), called the moment parameterization (or mean parameterization). Let H 2 denote the moment parameter space. Let us use the subscript and superscript notations to emphasize the coordinate system used to index a PEF: p θ (x) = p η (x).
It is known that the KLD between any two densities of an exponential family amounts to a reverse Bregman divergence B F * induced by the cumulant of the EF [7,8]: where the Bregman divergence for a strictly convex and smooth generator F is defined by: with η 2 := ∇F (θ 2 ). Thus the JD between two PEDS of a PEF can be written as: where F * denote the Legendre-Fenchel convex conjugate: It follows from the Legendre transform that we have η = ∇F (θ) and θ = ∇F * (η). Using the mixed natural & moment parameterizations (with η = ∇F (θ) and η = ∇F (θ )), we get equivalently: Observe that the cumulant function F (θ) does not appear explicitly in Eq. 12 (but occurs implicitly in the moment parameters η = ∇F (θ) or dually in the natural parameters θ = ∇F * (η)). An alternative way to derive Eq. 12 is to consider the Legendre-Fenchel divergence L F which is equivalent to a Bregman divergence but which uses mixed natural/moment parameterizations: 2 Pronounced Eta using the greek alphabet. . PED pθ SME is displayed in green and PED pη MLE is displayed in blue. To display pη MLE , we first convertedη MLE toθ MLE using an iterative linear system descent method (ILSDM), and we numerically estimated the normalizing factors Z(θ SME ) and Z(η MLE ) to display the normalized PEDs.

A simple heuristic
Our heuristic to approximate the JD between mixtures m and m consists in converting those mixtures m and m into pairs of PEDs of a PEF in § 2. To convert a mixture m(x) into a pair (pθ 1 , pη 2 ) dually parameterized (but not dual, i.e.,η 2 = ∇F (θ 1 )), we consider 'integral extensions" of a statistical inference method M 1 which estimates in the natural parameter space Θ (e.g., the Score Matching method [31]) and another method M 2 which estimates in the moment space H (e.g., the Maximum Likelihood method [9]). , q η M 2 )), we approximate the JD between mixtures m and m by using four parameters of the PEDs of a PEF as conversion conversion Figure 2: The Jeffreys divergence D J (m, m ) between two GMMs m and m is approximated using pairs (θ SME ,η MLE ) and (θ SME ,η MLE ) of dually parameterized (unnormalized) polynomial exponential densities.
Let ∆ J denote the approximation formula obtained from two pairs of PEDs: Note that ∆ J is not a proper divergence as it may be negative sinceη MLE = ∇F (θ SME ) (and does not satisfy the law of the indiscernibles).
Approximation∆ J is exact when k 1 = k 2 = 1 with their component distributions belonging to the PEF. To quote George E. P. Box, "All models are wrong, but some are useful." Here, we precisely show that PEFs prove useful to approximate/estimate divergences between mixtures. Figure 2 schematically displays the approximation heuristic of the JD between mixtures using dually parameterized pairs of exponential family densities. We do not need to estimate the normalization constants of densities to calculateD J We show experimentally in §4 that theD J heuristic yields fast approximations of the JD compared to the MC estimations by several order of magnitudes while approximating reasonably well the JD when the mixtures have a small number of modes. For example, Figure 3  We report those mixtures and PEF conversions in Appendix C.

Contributions and paper outline
Our contributions are summarized as follows: • We explain how to convert any continuous PDF r (including GMMs) into a PED in Section 2 using integral-based extensions of the Maximum Likelihood Estimator [9] (MLE estimates in the moment parameter space H and detailed in Theorem 1 and Corollary 1) and the Score Matching Estimator [31] (SME estimates in the natural parameter space Θ). We show a connection of SME with the Moment Linear System Estimator [16] (MLSE) which is related to Stein's lemma for exponential families [30].  Figure 3: Two mixtures m 1 (black) and m 2 (red) of k 1 = 10 components and k 2 = 11 components (left), respectively. The unnormalized PEFs qθ 1 =pθ 1 (middle) and qθ 2 =pθ 2 (right) of order 8. Jeffreys divergence (about 0.2634) is approximated using PEDs within 0.6% compared to the Monte Carlo estimate with a speed factor of about 3190. Notice that displaying pθ 1 and pθ 2 on the same PDF canvas as the mixtures would require to calculate the partition functions Z(θ 1 ) and Z(θ 2 ) (which we do not in this figure). The PEDs qη 1 and qη 2 of the pairs (θ 1 ,η 1 ) and (θ 2 ,η 2 ) parameterized in the moment space are not shown here.
• We report a closed-form formula to evaluate the goodness-of-fit of the PED to a GMM in §3 using an extension of the Hyvärinen divergence [1] (Theorem 2), and discuss the problem of model selection for choosing the order D of the PEF.
• We show how to approximate the Jeffreys divergence between GMMs using a pair of natural/moment parameter PED conversion, and present experimental results which displays a gain of several orders of magnitude of performance when compared to the Monte Carlo estimator in §4. We observe that the quality of the approximations depend on the number of modes of the GMMs [13]. Calculating or counting the modes of a GMM is a challenging problem in its own.
The paper is organized as follows: In Section 2, we show how to convert arbitrary PDFs into PEDs using integral-based Maximum Likelihood Estimator and Score Matching Estimator, and describe a Maximum Entropy method to convert iteratively moment parameters to natural parameters in §2.3. It is followed by Section 3 which shows how to calculate in closed-form the order-2 Hyvärinen divergence between a GMM and a PED. We use this criterion to perform model selection. Section 4 presents our experiments which demonstrate a gain of several orders of magnitudes for GMMs with small number of modes. Finally, we conclude in Section 5 by discussing potential generalizations to other kinds of univariate or multivariate mixtures which we plan to experiment in future work.

Converting finite mixtures to exponential family densities
We report two generic methods to convert a mixture m(x) into a density of an exponential family: The first method (MLE) in §2.1 proceeds using the mean parameterization η while the second method (SME) in §2.2 uses the natural parameterization of the exponential family. We show how to instantiate these conversion methods for GMMs: This requires to calculate in closed-form raw moments of GMMs which is detailed in §2.4.

Conversion using the moment parameterization (MLE)
Let us recall that in order to estimate the moment or mean parameterη MLE of a density belonging an exponential family . , x n , the Maximum Likelihood Estimator (MLE) [12,9] yields Eq. 21 is called the estimating equation. The MLE exists and is consistent and asymptotically normally distributed under mild conditions [9]. Furthermore, since the MLE satisfies the equivariance property [9], we haveθ MLE = ∇F * (η MLE ), where ∇F * denote the gradient of the conjugate function F * (η) of the cumulant function F (θ) of the exponential family. By considering the empirical distribution p e (x) := 1 denoting the Dirac distribution at location x i ), we can formulate the MLE problem as a minimum KLD problem between the empirical distribution and a density of the exponential family: Thus to convert an arbitrary smooth density r(x) into a density p θ of an exponential family E t , we ask to solve the minimization problem: min θ D KL [r : p θ ]. Rewriting the minimization problem as: This conversion can be interpreted as an integral extension of the MLE (hence the¯notation in η MLE ). Notice that the ordinary MLE isη MLE =η MLE (p e ) obtained for the empirical distribution: r = p e .
Theorem 1 (Integral-based MLE) The best density pη(x) minimizing the Kullback-Leibler divergence D KL [r : p θ ] between a density r and a density p θ of an exponential family E t is Notice that when r = p θ , we haveη = E p θ [t(x)] = η, so that the methodη MLE is consistent (by analogy to the finite i.i.d. MLE case). The KLD right-sided minimization problem can be interpreted as an information projection [50].
As a corollary of Theorem 1, we deduce the following corollary: Corollary 1 (Best KLD simplification of a mixture) The best KLD simplification of a homogeneous mixture of exponential families [44] Eq. 23 allows us to greatly simplifies the proofs reported in [65,69] for mixture simplifications which involved the explicit use of the Pythagoras' theorem in the dually flat spaces of exponential families [1].

Integral-based Score Matching Estimator (SME)
To convert density r(x) to an exponential density with sufficient statistics t(x), we can use the Score Matching Estimator [31,32] (SME). The score matching estimator minimizes the Hyvärinen divergence (Eq. 4 of [32]): Notice that in statistics, the score is defined by ∇ θ log p θ (x), but in score matching the "data score" 3 is defined by ∇ x log p θ (x). It can be shown that for exponential families [32], we have: where is a D-dimensional column vector. For PEFs of order D, we have t i (x) = ix i−1 and t i (x) = i(i − 1)x i−2 , and therefore we have 3 Hyvärinen reports an explanation of the naming "score" using a spurious location parameter. See [32].
where µ l (r) := E r [X l ] denotes the l-th raw moment of distribution X ∼ r(x) (with the convention that m −1 (r) = 0). For a PDF r, we have µ 1 (r) = 1. Thus the integral-based SME of a density r is: This method of Cobb et al. [16] (1983) anticipated the score matching method of Hyvärinen (2005). It can be derived from Stein's lemma for exponential families (see Appendix B).
The integral-based score matching method is consistent, i.e., if r = p θ thenθ SME = θ: The probabilistic proof for r(x) = p e (x) is reported as Theorem 2 of [16]. The integral-based proof is based on the property that arbitrary order partial mixed derivatives can be obtained from higherorder partial derivatives with respect to θ 1 [28]: The complexity of the direct SME method is O(D 3 ) as it requires to inverse matrix A D . Next, we show how to lower this complexity by reporting an equivalent method presented in [16] which relies on recurrence relationships between the moments of p θ (x). Recall that µ l (r) denotes the l-th raw moment E r [x l ].
Let A = [a i+j−2 ] ij denote the D × D symmetric matrix with a i+j−2 (r) = µ i+j−2 (r) (with a 0 (r) = µ 0 (r) = 1), and b = [b i ] i the D-dimensional vector with b i (r) = (i + 1)µ i (r). We solve the system A β = b to get β = A −1 b . We then get the natural parameterθ SME from the vector β as Now, if we inspect matrix A D = [µ i+j−2 (r)], we find that matrix A D is a Hankel matrix: A Hankel matrix has constant antidiagonals and can be inverted in quadratic-time [73,29] instead of cubic time for a general D × D matrix. Moreover, the Hankel matrix can be stored using linear memory (store 2D − 2 coefficients) instead of quadratic memory of regular matrices.
For example, matrix A 4 is: and requires only 6 = 2 × 4 − 2 coefficients to be stored instead of 4 × 4 = 16. In statistics, those matrices A are called moment matrices and well-studied [41,40,67]. For GMMs r, the raw moments µ l (r) to build matrix A D can be calculated in closed-form as explained in §2.4.

Converting moment parameters to natural parameters using MaxEnt
In §2.1, we showed how to use the integral extension of the MLE to convert r into p η ∈ E t , a PED indexed with a moment parameter η. However, for PEFs with sufficient statistic vector (x 1 , . . . , x D ), we do not have F * nor ∇F * in closed-form expressions: Thus given parameterη MLE , we cannot computeθ MLE = ∇F * (η MLE ), and we need to approximateθ MLE . More generally, given the density p η parameterized in the moment parameter, we ask for the equivalent p θ parameterized by the corresponding natural parameter θ = θ(η) = ∇F * (η). Let us report the iterative approximation technique of [46] (extending the method described in [77]) based on solving a maximum entropy problem (MaxEnt problem). This method will be useful when comparing the results of our heuristic with a direct (but very costly) method. The density p θ of an exponential family can be characterized as a maximum entropy distribution given the moment constraint E p θ [t(x)] = η: max p h(p) subject to the D + 1 moment constraints t i (x)p(x)dx = η i for i ∈ {0, . . . , D}, where we added by convention η 0 = 1 and t 0 (x) = 1 (so that p(x)dx = 1). The solution of this MaxEnt problem [46] is p(x) = p λ where λ are the D + 1 Lagrangian parameters. Here, we adopt the the following canonical parameterization of the densities of an exponential family: That is, F (λ) = λ 0 and λ i = −θ i for i ∈ {1, . . . , D}. Parameter λ is a kind of augmented natural parameter which includes the log-normalizer in its first coefficient.
For a PEF of order D, we have This yields a moment matrix H λ (Hankel matrix) which can be inverted in quadratic time [29]. In our setting, the moment matrix is invertible because |H| > 0, see [36]. Letλ T (η) denote θ (T ) after T iterations (retrieved from λ (T ) ), and let be the corresponding natural parameter of the PED. We have the following approximation of the JD: The method is costly because we need to numerically calculate µ i+j−2 (p θ ) and the K i 's (e.g., univariate Simpson integrator). Another potential method consists in estimating these expectations using rejection sampling. We may also consider the holonomic gradient descent [28]. Thus the conversion η → θ method is costly. Our heuristic∆ J bypasses this costly moment-to-natural parameter conversion by converting each mixture m to a pair (p θ SME , p η MLE ) of PEDs parameterized in the natural and moment parameters (i.e., loosely untangle these dual parameterizations).

Raw moments of normal distributions and GMMs
The l-th moment raw moment E[Z l ] of a standard normal distribution Z ∼ N (0, 1) is 0 when l is odd (since the density is an even function) and (l − 1)!! = 2 − l 2 l! (l/2)! when l is even, where n!! = 2 n+1 π Γ( n 2 + 1) = n 2 −1 k=0 (n − 2k) is the double factorial (with (−1)!! = 1 by convention). Using the binomial theorem, we deduce that a normal distribution X = µ + σZ has finite moments: That is, we have By the linearity of the expectation E[·], we deduce the l-th raw moment of a GMM m( Notice that by using [10], we can extend this formula to truncated normals and GMMs. Thus computing the first O(D) raw moments of a GMM with k components can be done in O(kD 2 ) using the Pascal triangle method for computing the binomial coefficients. In practice, to instantiate the generic raw moment formula for the first prescribed D order, we can also use the computer algebra system Maxima. 4 For example, the code snippet below reports the first 18 moments. In our implementation, we could use up to order D = 34.

Goodness-of-fit between GMMs and PEDs: Higher order Hyvärinen divergences
Once we have converted a GMM m(x) into an unnormalized PED q θm (x) =p θm (x), we would like to evaluate the quality of the conversion using a statistical divergence: D(m(x) : q θm (x)). This divergence shall allows us to perform model selection by choosing the order D of the PEF so that D(m(x) : p θ (x)) ≤ for θ ∈ R D , where > 0 is a prescribed threshold. Since PEDs have computationally intractable normalization constants, we consider a projective divergence [1] D(p : q) that satisfies D(p : λq) = D(p : q) for any λ > 0. For example, we may consider the γ-divergence [25] that is a two-sided projective divergence: D γ (λp : λ q) = D(p : q) = D(p :q) for any λ, λ > 0 and converge to the KLD when γ → 0. However, the γ-divergence between a mixture model and an unnormalized PEF does not yield a closed-form formula (and the γdivergence between two unnormalized PEDs is expressed using the log-normalizer function F (·) that is computationally intractable). In order to a get a closed-form formula for a divergence between a mixture model and an unnormalized PED, we consider the order-α Hyvärinen divergence [1] (also called Fisher divergence [31]) as follows: The Hyvärinen divergences D H,α is a right-sided projective divergence [60] with D H,α (p : q) = D H,α (p : λq) for any λ > 0. Thus we have D H,α (m : p θ ) = D H,α (m : q θ ) for a unnormalized PED Theorem 2 The Hyvärinen divergence D H,2 of order 2 between a GMM m(x) and a PED q θ (x) is available in closed form.
We have D H,2 (m : denoting the derivative of the mixture density m(x). It follows that: Therefore we have the log-normalizer of the Gaussian exponential family [1]. Therefore Thus the Hyvärinen divergence D H,2 of order 2 between a GMM and a PED expressed using its natural parameter is available in closed-form. For example, when k = 1 (mixture m is a Gaussian p µ 1 ,σ 1 ) and p θ is a normal distribution (i.e., D = 2, q θ = p µ 2 ,σ 2 ), the order-2 Hyvarinen divergence.

Experiments: Jeffreys divergence between mixtures
EstimatingD J between k-GMMs with Monte Carlo sampling using s samples requireÕ(ks). In comparison, approximating D J by ∆ J using D-order PEDs require to O(kD 2 ) to compute the raw moments and O(D 2 ) to invert a Hankel moment matrix. Choosing D = 2k, we get a deterministic O(k 3 ) algorithm. Thus our heuristic is faster when k 2 s. To get quantitative results, we build random GMMs with k components as follows: where the U i 's and U 1 and U 2 are uniform distributions on [0, 1). The weights are then normalized to sum up to one. For each value of k, we make 1000 trial experiments to gather statistics, and use s = 10 5 for evaluating the Jeffreys divergenceD J by Monte Carlo samplings. We denote by error := |D J −∆ J | D J the error of an experiment. Table 1 presents the results of the experiments for D = 2k (since there are at most k modes for a k-GMM): We indicate the average error, the maximum error (minimum error is very close to zero, of order 10 −5 ), and the speed-up obtained by our heuristic ∆ J .
Notice that the quality of the approximations depend on the number of modes of the GMMs. This is a difficult problem to calculate [14,2] even for simple cases [3,4]. Figure 4 displays several experiments of converting mixtures to pairs of PEDs to get approximations of the Jeffreys divergence. Figure 5 illustrates the use of the order-2 Hyvärinen divergence D H,2 to perform model selection of the order of a PED.
Finally, Figure 6 displays some limitations of the GMM to PED conversion when the GMMs have many modes. However, in that case, running the conversionη MLE toθ T (η MLE ) improves a lot the results but requires a lot more computation.

Conclusion and perspectives
In this paper, we described several methods to convert a univariate Gaussian mixture model (1D GMM) of k components into a polynomial exponential density (PED) of order D. Those conversion methods use the closed-form raw moment formula of GMMs details described in §2.4. A k-mode [14] distribution f (x) can be modeled either by using a GMM m(x) with k-components (3k − 1 parameters) or by using a more compactly PED of order D = 2k requiring 2k parameters. We have demonstrated that the Jeffreys divergence (JD) between two GMMs can be accurately approximated by converting the GMMs to PEFs using a pair of dually parameterized integral maximum likelihood/score matching estimators. Compared to the vanilla stochastic Monte Carlo method (non-deterministic), our deterministic heuristic is faster by several order of magnitudes while keeping a reasonable accuracy when the number of modes is small.
In order to select the order D of the PED, we proposed to use the Hyvärinen divergence of order 2 which admits a closed-form formula between the mixture and the unnormalized PED. The  proposed technique extends to other univariate mixture of exponential families [26] (e.g., mixture of Rayleigh distributions, mixture of Gamma or Beta distributions, etc.).
Finally, we conclude by listing three topics for future research: • Although we have dealt with univariate GMMs, our technique may extend to multivariate GMMs using multivariate PEFs: For example, Hayakawa and Takemura [28] considered bivariate PEFs, and Pistone and Wynn studied multivariate cumulants and moments [66].
• Since PEDs and GMMs have different tail distributions, it should be better to proceed by truncating the support X , say to X = [{µ − − 3σ − }, max i {µ + + 3σ + }] for the GMMs with normal components (µ, σ) such that µ − ≤ µ ≤ µ + and σ − ≤ σ ≤ σ + . The integral estimators remain in closed-form since truncated Gaussian moments are available in closed-form [62] (expressed using the cumulative distribution function Φ of the standard Gaussian). Notice that a truncated exponential family is an exponential family that may bear different properties [18] (e.g., non-steep).
• Instead of the Jeffreys divergence, we may consider the Jensen-Shannon divergence [39] (JSD) as a symmetrization of the KLD: The JSD has two main advantages compared to the Jeffreys divergence: It is always upper bounded by log 2 and its square root yields a metric distance [24]. To contrast with the JSD, the KLD is unbounded and its square root (or any other positive exponent) does not yield a metric [74]. Then the differential entropies of the PEDs obtained by converting the mixtures m, m and m+m 2 can be estimated using the projective power entropy [23].
• Yet another interesting topic is to convert a PED into a GMM with a prescribed number of components using a continuous extension of k-MLE [48] (see centroidal Voronoi tessellations [20]).

A Monomial exponential families
Consider the polynomial exponential density: for an even integer D. The set of such densities form a Monomial Exponential Family [58] (MEF) with t(x) = x D : A univariate order-1 exponential family. MEFs are special PEFs (with θ 1 = . . . = θ D−1 = 0 and θ D = θ) which yield tractable information-theoretic quantities such that the KLD or the differential entropy. Indeed, the cumulant function F D (θ) is available in closed-form expression [58]: for θ < 0, where Γ(·) denotes the gamma function. The natural parameter space is Θ = (−∞, 0). The moment parameter is η = ∇F D (θ) = − 1 Dθ > 0, and the moment space is H = (0, ∞). We have θ = ∇F * D (η D ) = − 1 Dη < 0 and the convex conjugate is: We check that the Fenchel-Young equality: The differential entropy of a MEF [56] is h[p D,θ (x)] = −F * (η), and the Kullback-Leibler divergence is This is a scaled Itakura-Saito divergence [55] D IS with: That is, we have Let us report the MEFs for D = 2: The zero-mean normal distributions [54] N (0, σ 2 ). We . We have the KLD between two zero-mean normal distributions p 0,σ 2 1 and p 0,σ 2 2 as: This matches the usual KLD between two normal distributions which can be interpreted as the sum of a squared Mahalanobis distance and half of the Itakura-Saito divergence (as noticed in [19]): where When D = 1, the MEF is not defined but we can define Absolute Monomial Exponential Families [58] (AMEFs) with PDFs: Let AC(R) denote the set of absolutely continuous functions on R.
Theorem 3 Let X be a continuous random variable with exponential density p θ (x) = exp( D i=1 θ i t i (x) − F (θ) + k(x)) with support X = R. For any f ∈ AC(R) with E[f (X)] < ∞, we have the following Stein identity: Proof: Let us integrate by parts 5 E[f (X)] = R f (x)p θ (x)dx: Since E[f (X)] < ∞, we necessarily have lim x→±∞ p θ (x)f (x) = 0, and therefore we get Notice that if we define h(x) := e k(x) (i.e., k(x) = log h(x)) then p θ (x) = exp( D i=1 θ i t i (x) − F (θ))h(x), and we have k (x) = h (x) h(x) , so that the Stein identity can be rewritten equivalently as: This proof extends the original proof of Hudson [30] (1978) who originally considered D = 1 and t(x) = x.
Thus for a PED of order D with t i (x) = x i , we have: Further letting f (x) = x j , we get the following identity for PEDs: 5 Recall that (u(x)v(x)) = u (x)v(x) + u(x)v (x) so that After some rewriting 6 , this equation corresponds to Theorem 1 of [16] for their type N distributions (PEDs). Let µ i = E[x i ] denote the raw (non-central) moments. Using the linearity of the expectation in Eq. 46, we have for any integer j: Based on these identities, we recover Cobb et al. method [16] (and hence the Score Matching Method for PEFs [32]).

C An example of GMM conversion to PEDs of different orders
In the example below in Maxima, we display the conversion of two GMMs m 1 with k 1 = 10 components and m 2 with k 2 = 11 components into PEFs of order 8. The MC estimation of the JD with s = 10 6 samples yields 0.2633 . . . while the PEF approximation on corresponding PEFs yields 0.2618 . . . (relative error is 0.00585 . . . or about 0.585 . . . %). Integral-based Maximum Likelihood Estimator θ SME Integral-based Score Matching Estimator θ MLE T Approximation of ∇F * (η MLE ) using T iterations of [46] p θ EF density parameterized using natural parameter p η EF density parameterized using moment parameter Θ Natural parameter space H Moment parameter space (read H as greek Eta) F (θ) Cumulant function of an EF Z(θ)

D Acronyms and notations
Partition function of an EF (Z(θ) = e F (θ) ) p θ = q θ unnormalized EF density parameterized using natural parameter µ k (p) raw moment E p [X k ]. p µ,σ PDF of a normal distribution with mean µ and standard deviation σ E t EF with sufficient statistics t(x) E D PEF of order D, t(x) = (x, x 2 , . . . , x D ) E 4 Quartic EF