On a Variational Definition for the Jensen-Shannon Symmetrization of Distances Based on the Information Radius

We generalize the Jensen-Shannon divergence and the Jensen-Shannon diversity index by considering a variational definition with respect to a generic mean, thereby extending the notion of Sibson’s information radius. The variational definition applies to any arbitrary distance and yields a new way to define a Jensen-Shannon symmetrization of distances. When the variational optimization is further constrained to belong to prescribed families of probability measures, we get relative Jensen-Shannon divergences and their equivalent Jensen-Shannon symmetrizations of distances that generalize the concept of information projections. Finally, we touch upon applications of these variational Jensen-Shannon divergences and diversity indices to clustering and quantization tasks of probability measures, including statistical mixtures.


Introduction: Background and motivations
Let (X , F, µ) denote a measure space [8] with sample X , σ-algebra F on the set X and positive measure µ on (F, µ) (usually the Lebesgue measure or the counting measure). Denote by D = D(X ) the set of densities with full support X (Radon-Nikodym derivatives of probability measures with respect to µ): D(X ) := p : X → R : p(x) > 0 µ-a.e., X p(x)dµ(x) = 1 .
The 2-point JSD of Eq. 2 can be extended to a weighted set of n densities P := {(w 1 , p 1 ), . . . , (w n , p n )} (with positive w i 's normalized so that n i=1 w i = 1) thus providing a diversity index (i.e., a n-point JSD) for P: wherep := n i=1 w i p i denotes the statistical mixture [36] of the densities of P. The KLD is also called the relative entropy since it can be expressed as the difference between the cross entropy h[p : q] and the entropy h[p]: with When µ is the Lebesgue measure, the Shannon entropy is also called the differential entropy [17]. It follows that the Jensen-Shannon diversity index of Eq. 3 can be rewritten as: The JSD representation of Eq. 7 is a Jensen divergence [45] for the strictly convex negentropy F (p) = −h(p) since the entropy function h[.] is strictly concave, hence its name Jensen-Shannon divergence.
Since p i (x) p(x) ≤ p i (x) w i p i (x) = 1 w i , it can be shown that the Jensen-Shannon diversity index is upper bounded by H(w) := − n i=1 w i log w i , the discrete Shannon entropy. In particular, the JSD is bounded by log 2 although the KLD is unbounded and may even be equal to +∞ when the definite integral diverges (e.g., KLD between the standard Cauchy distribution and the standard Gaussian distribution). Another nice property of the JSD is that its square root yields a metric distance [23,28]. This property further holds for the quantum JSD [58]. Recently, the JSD has gained interest in machine learning. See for example, the Generative Adversarial Networks [30] (GANs) in deep learning [29] where it was proven that minimizing the GAN objective function by adversarial training is equivalent to minimizing a JSD.
The Jensen-Shannon principle of taking the average of the (Kullback-Leibler) divergences between the source parameters to the mid-parameter can be applied to other distances. For example, the Jensen-Bregman divergence is a Jensen-Shannon symmetrization of the Bregman divergence B F [45]: The Jensen-Bregman divergence B JS F can also be written as an equivalent Jensen divergence J F : where F is a strictly convex function ensuring J F (θ 1 : θ 2 ) ≥ 0 with equality iff θ 1 = θ 2 . Because of its use in various fields of information sciences [4], various generalizations of the JSD have been proposed: These generalizations are either based on Eq. 1 [43] or are based on Eq. 7 [39,48,44]. For example, the (arithmetic) mixturep = i w i p i in Eq. 1 was replaced by an abstract statistical mixture with respect to a generic mean M in [43] (e.g., geometric mixture induced by the geometric mean), and the two KLDS defining the JSD in Eq. 1 was further averaged using another abstract mean N , thus yielding the following generic (M, N )-Jensen-Shannon divergence [43] ((M, N )-JSD): where (pq) M α denotes the statistical weighted mixture: Notice that when M = N = A (the arithmetic mean), Eq. 10 of the (A, A)-JSD reduces to the ordinary JSD of Eq. 1. When the means M and N are symmetric, the (M, N )-JSD is symmetric. In general, a weighted mean M α (a, b) for any α ∈ [0, 1] shall satisfy the in-betweeness property: A weighted mean (also called barycenter) can be built from a non-weighted mean M (a, b) (i.e., α = 1 2 ) using the dyadic expansion of the weight α as detailed in [38]. The three Pythagorean means defined for positive scalars a > 0 and b > 0 are classic examples of means: • The arithmetic mean A(a, b) = a+b 2 , • the geometric mean G(a, b) = √ ab, and • the harmonic mean H(a, b) = 2ab a+b . These Pythagorean means may be interpreted as special instances of another family of means: The power means defined for α ∈ R\{0} (also called Hölder means). The power means can be extended to the full range α ∈ R by using the property that lim α→0 P α (a, b) = G(a, b). The power means are homogeneous means: P α (λa, λb) = λP α (a, b) for any λ > 0. We refer to the handbook of means [14] to get definitions and principles of other means beyond these power means.
Choosing the abstract mean M in accordance with the family of the densities allows one to obtain closed-form formula for the (M, N )-JSDs which rely on definite integral calculations. For example, the JSD between two Gaussian densities does not admit a closed-form formula because of the log-sum integral, but the (G, N )-JSD admits a closed-form formula when using geometric statistical mixtures (i.e., when M = G). As an application of these generalized JSDs, Deasy et al. [21] used the skewed geometric JSD (namely, the (G α , A 1−α )-JSD for α ∈ (0, 1)) which admits a closedform formula between normal densities [43], and showed how regularizing an optimization task with this G-JSD divergence improved reconstruction and generation of Variational AutoEncoders (VAEs).
More generally, instead of using the KLD, one can also use any arbitrary distance D to define its JS-symmetrization as follows: These symmetrizations may further be skewed by using M α and/or N β for α ∈ (0, 1) and β ∈ (0, 1) yielding the definition [43]: In this work, we consider symmetrizing an arbitrary distance D (including the KLD) generalizing the Jensen-Shannon divergence by using a variational formula for the JSD. Namely, we observe that the Jensen-Shannon divergence can also be defined as the following minimization problem: since the optimal density c is proven unique using the calculus of variation [54,2] and corresponds to the mid density p+q 2 . The paper is organized as follows: In §2, we recall the rationale and definitions of the Rényi α-entropy and the Rényi α-divergence [52], and explain the information radius of Sibson [54] which includes as a special case the ordinary Jensen-Shannon divergence. It is noteworthy to point out that Sibson's work (1969) includes as a particular case a definition of the JSD, prior to the reference paper of Lin [34] (1991). In §3, we present the JS-symmetrization variational formula based on a generalization of the information radius with an abstract mean (Definition 6 and Definition 4). In §4, we constrain the mid density to belong to a class of (parametric) probability densities like an exponential family [7], and get relative information radius generalizing information radius and related to information projections. Definition 7 generalizes the (relative) normal information radius of Sibson [54] who considered the multivariate normal family. As an application of these relative JSDs, we consider clustering and quantization of probability densities in §4.2. Finally, we conclude by summarizing our contributions and discussing related works in §5.

Rényi entropy and divergence, and Sibson information radius
Rényi [52] investigated a generalization of the four axioms of Fadeev [25] yielding to the unique Shannon entropy [19]. In doing so, Rényi replaced the ordinary weighted arithmetic mean by a more general class of averaging schemes. Namely, Rényi considered the weighted quasi-arithmetic means [32]. A weighted quasi-arithmetic mean can be induced by a strictly monotonous and continuous function g as follows: where the x i 's and the w i 's are positive (the weights are normalized so that n i=1 w i = 1). For example, the power means P α (a, b) = a α +b α 2 1 α introduced earlier are quasi-arithmetic means for the generator g P α (u) := u α : Rényi proved that among the class of weighted quasi-arithmetic means, only the means induced by for α > 0 and α = 1 yield a proper generalization of Shannon entropy called nowadays the Rényi α-entropy. The Rényi mean is The Rényi α-means M R α are not power means: They are not homogeneous means. The Rényi α-entropy are defined by: In the limit case α → 1, the Rényi α-entropy converges to Shannon entropy: In the discrete case (i.e., counting measure µ on a finite alphabet X ), we can further define h 0 [p] = log |X | for α = 0 (also called max-entropy or Hartley entropy). The Rényi +∞-entropy h +∞ [p] = − log max x∈X p(x) is also called the min-entropy since the sequence h α is non-increasing with respect to increasing α.
Similarly, Rényi obtained the α-divergences for α > 0 and α = 1 (originally called information gain of order α): of densities p i 's as the following minimization problem: where and Function LSE(a 1 , . . . , a n ) := log ( n i=1 e a i ) denotes the log-sum-exp (convex) function [10,49]. Sibson [54] also considered the limit case α → ∞ when defining the information radius: Sibson reported the following theorem in his information radius study [54]: Theorem 1 (Theorem 2.2 and Corollary 2.3 of [54]). The optimal density c * α = arg min c∈D R α (P, c) is unique and we have: Observe that R ∞ (P) does not depend on the (positive) weights. The proof follows from the following decomposition of the information radius: Proposition 1. We have: Notice that 2 (α−1)Iα[p:q] = X p(x) α q(x) 1−α dµ(x), the Bhattacharyya α-coefficient [45] (also called Chernoff α-coefficient [40,41]): Thus we have Notice that c * is the upper envelope of the densities p i (x)'s normalized to be a density. Provided that the densities p i 's intersect pairwise in at most s locations (i.e., |{p i (x) ∩ p j (x)}| ≤ s for i = j), we can compute efficiently this upper envelope using an outputsensitive algorithm [50] of computational geometry.
In general, the optimal density c * α = arg min c∈D R α (P, c) yielding the information radius R α (P) can be interpreted as a generalized centroid (extending the notion of Fréchet means [27] When all the densities p i 's belong to a same exponential family [7] with cumulant function F (i.e., θ 1 ), . . . , (w n , θ n )} be the parameter set corresponding to P. Define Then we have the equivalent decomposition of Proposition 1: with θ * =θ := n i=1 w i θ i . (This decomposition is used to prove Proposition 1 of [6].) The quantity R F (V) = R F (V, θ * ) was termed the Bregman information [6,47]. R F (V) could also be called Bregman information radius according to Sibson Sibson proved that the information radii of any order are all upper bounded (Theorem 2.8 and Theorem 2.9 of [54]) as follows: R ∞ (P) ≤ log 2 n < ∞.
We interpret Sibson's upper bound as follows: Proposition 2 (Information radius upper bound). The information radius of order α of a weighted set of distributions is upper bounded by the discrete Rényi entropy of order 1 α of the weight distribution:

JS-symmetrization based on generalized information radius
Let us give the following definitions generalizing the information radius (i.e., Jensen-Shannon symmetrization of the distance when |P| = 2) and the ordinary Jensen-Shannon divergence:  When M = A, we recover the notion of Fréchet mean [27]. Notice that although the minimum R M,D (P) is unique, there may potentially exists several generalized centroids c M,D (P) depending on (M, D).
The generalized information radius can be interepreted as a diversity index or an n-point distance. When n = 2, we get the following (2-point) distances which are considered as a generalization of the Jensen-Shannon divergence or Jensen-Shannon symmetrization: .
We recover Sibson's information radius R α [p : q] induced by two densities p and q from Definition 4 as the M R α -vJS symmetrization of the Rényi divergence D R α . Notice that we may skew these generalized JSDs by taking weighted mean M β instead of M for β ∈ (0, 1) yielding to the general definition: Notice that this definition is implicit and can be made explicit when the centroid c * (p, q) is unique: In particular, when D = D KL , the KLD, we obtain generalized skewed Jensen-Shannon divergences: Definition 6 (Skewed M β -vJS divergence). Let M β be a weighted mean for β ∈ (0, 1). Then the M -vJS divergence is defined by the variational formula: Amari [2] obtained the (A, D α )-information radius and its corresponding unique centroid for D α , the α-divergence of information geometry [3].
It is interesting to study the link between (M β , D)-variational Jensen-Shannon symmetrization of D and the (M ′ α , N ′ β )-JS symmetrization of of [43]. In particular the link between M β for averaging in the minimization and M ′ α the mean for generating abstract mixtures. More generally, Brekelmans et al. [12] considered the α-divergences extended to positive measures and proved that c * β = arg min is a density of a likelihood ratio q-exponential family: -generalized centroid, and the corresponding information radius is the variational JS symmetrization: The q-divergence D q between two densities of a q-exponential family amounts to a Bregman divergence [3]. Thus D vJS q for M = A is a generalized information radius which amounts to a Bregman information.
For the case α = ∞ in Sibson's information radius, we find that the information radius is related to the total variation: Proposition 3 (Lemma 2.4 [54]). : where D TV denotes the total variation . Notice that when M = M g is a quasi-arithmetic mean, we may consider the divergence D g [p : q] = g −1 (D[p : q)) so that the centroid of the (M g , D g )-JS symmetrization is: 4 Relative information radius and relative Jensen-Shannon divergences 4

.1 Relative information radius
In this section, instead of considering the full space of densities D on (X , F, µ) for performing the optimization of the information radius, we may consider a subfamily of (parametric) densities R ⊂ D. Then we define the R-relative Jensen-Shannon divergence (R-JSD for short) as In particular, Sibson [54] considered the normal information radius with R = {N (µ, Σ) :  We obtain relative Jensen-Shannon divergences when D = D KL . Grosse et al. [31] considered geometric and moment average paths for annealing. They proved that when p 1 = p θ 1 and p 2 = p θ 2 belong to an exponential family [7] E F with cumulant function F , we have and pη = arg min x)] (this is not an arithmetic mixture but an exponential family density which moment parameter is a mixture of the parameters). The corresponding minima can be interpreted as relative skewed Jensen-Shannon symmetrization for the reverse KLD D * KL (Eq. 48) and the relative skewed Jensen-Shannon divergence (Eq. 49): where A β (a, b) := (1 − β)a + βb is the weighted arithmetic mean for β ∈ (0, 1). Notice that when p = q, we have D vJS M,R [p : p] = min c∈R D[p : c] which is the information projection [42] with respect to D of density q to the submanifold R. Thus when p ∈ R, we have D vJS M,R [p : p] > 0, i.e., the relative JSDs are not proper divergences since a proper divergence ensures that D[p : q] ≥ 0 with equality iff p = q.

Relative Jensen-Shannon divergences: Clustering and quantizing densities
Let D KL [p : q θ ] be the KLD between an arbitrary density p and a density of an exponential family q θ . The density of q θ is expressed as exp(θ ⊤ t(x) − F (θ)) where t(x) denotes the sufficient statistics t(x) and F (θ) the cumulant function [7]. Assume we both know (i) E p [t(x)] = m and (ii) the Shannon entropy h(p) of p. Then we can expressed the KLD in a "semi-closed" form: Proposition 4. Let q θ belong to an exponential family and p a density with m = E p [t(x)]. Then the Kullback-Leibler divergence is expressed as: Example 1. For example, when q θ is a Gaussian distribution, we have The formula of proposition 4 is said semi-closed because although it relies on knowing the entropy of h and the moment E p [t(x)], We can compare D KL [p : q θ 1 ] ≥ D KL [p : q θ 2 ] or not by checking whether F (θ 1 ) − F (θ 2 ) − m ⊤ (θ 1 − θ 2 ) ≥ 0 or not. This shall be useful for clustering densities with respect to centroids in §4.2.
To find the best density q θ approximating p by minimizing min θ D KL [p : q θ ], we solve ∇F (θ) = η = m, and therefore θ = ∇F * (m) = (∇F ) −1 (m) where F * (η) = E qη [log q η (m)] with F * denoting the Legendre-Fenchel convex conjugate [7]. In particular, when p = w i p θ i is a mixture of EFs thanks to the linearity of the expectation), then the best density of the EF simplifying p is Taking the gradient with respect to θ, we have ∇F (θ) = η = w i η i . This yields another proof without the Pythagoras theorem [51,53].
be a mixture with components belonging to an exponential family with cumulant function F . Then θ * = arg θ min θ D KL [p : q θ ] is ∇F * ( n i=1 w i η i ) where the η i = ∇F (θ i ) are the moment parameters of the mixture components.

Consider the following two problems:
Problem 1 (Density clustering). Given a set of n weighted densities (w 1 , p 1 ), . . . , (w n , p n ), partition them into k clusters C 1 , . . . , C k in order to minimize the k-centroid objective function with respect to a statistical divergence D: n i=1 w i min l∈{1,...,k} D[p i : c l ], where c l denotes the centroid of cluster C l for l ∈ {1, . . . , k}.
For example, when all densities p i 's are isotropic Gaussians, we recover the k-means objective function [35].
Problem 2 (Mixture component quantization). Given a statistical mixture m(x) = n i=1 w i p i (x), quantize the mixture components into k densities q 1 , . . . , q k in order to minimize i w i min l∈{1,...,k} D[p i : Notice that in Problem 1, the input densities p i 's may be mixtures, i.e., p i (x) = n i j=1 w i,j p i,j (x). Using the relative information radius, we can cluster a set of distributions (potentially mixtures) into an exponential family mixture, or quantize an exponential family mixture. Indeed, we can implement an extension of k-means [35] with k-centers q θ i , to assign density p i to cluster C j (with center q j ), we need to perform basic comparison tests D KL [p i : q θ l ] ≥ D KL [p i : q θ j ]. Provided the cumulant F of the exponential family is in closed-form, we do not need formula for the entropies h(p i ).

Conclusion
To summarize, the ordinary Jensen-Shannon divergence has been defined in three equivalent ways in the literature: The JSD Eq. 58 was studied by Sibson [54] in 1969 with the wider scope of information radius: Sibson relied on the Rényi α-divergences (relative Rényi α-entropies [24]) and he recovered the ordinary Jensen-Shannon divergence as a particular case of the information radius when α = 1. The JSD Eq. 59 was investigated by Lin [34] in 1991 with its connection to the JSD Eq. 59.
Generalizations of the JSD based on Eq. 59 was proposed in [43] using a generic mean instead of the arithmetic mean. Generalization of the bridge between the JSD of Eq. 59 to the JSD of Eq. 60 was studied using a skewing vector in [44]. The JSD is a f -divergence [18,44] but the Sibson-M Jensen-Shannon symmetrization of a distance does not belong to the class of f -divergences in general. The variational JSD definition is implicit while the definitions of Eq. 59 and Eq. 60 are explicit because the unique optimal centroid c * = p+q 2 has been plugged into the objective function minimized by Eq. 58. . We also consider relative variational JS symmetrization when the centroid has to belong a family. For the case of exponential family, we showed how to compute the relative centroid in closed form extending the work of Sibson who considered the relative normal centroid in the relative normal information radius.
In a similar vein, Chen et al. [16] considered the following minimax symmetrization of the scalar Bregman divergence [11]: where B f denotes the scalar Bregman divergence induced by a strictly convex and smooth function f : They proved that B minmax f (p, q) yields a metric when 3(log f ′′ ) ′′ ≥ ((log f ′′ ) ′ ) 2 , and extend the definition to the vector case and conjecture that the square-root metrization still holds in the multivariate case. In a sense, this definition geometrically highlights the notion of radius since the minmax optimization amount to find a smallest enclosing ball enclosing [5] the source distributions. The circumcenter also called the Chebyshev center [15] is then the mid-distribution instead of the centroid for the information radius. Besided providing another viewpoint, variational definitions of divergences are proven useful in practice (e.g., for estimation). For example, a variational definition of the Rényi divergence generalizing the Donsker-Varadhan variational formula of the KLD is given in [9] which is used to estimate the Rényi Divergences.