On a generalization of the Jensen-Shannon divergence

The Jensen-Shannon divergence is a renown bounded symmetrization of the Kullback-Leibler divergence which does not require probability densities to have matching supports. In this paper, we introduce a vector-skew generalization of the scalar $\alpha$-Jensen-Bregman divergences and derive thereof the vector-skew $\alpha$-Jensen-Shannon divergences. We study the properties of these novel divergences and show how to build parametric families of symmetric Jensen-Shannon-type divergences. Finally, we report an iterative algorithm to numerically compute the Jensen-Shannon-type centroids for a set of probability densities belonging to a mixture family: This includes the case of the Jensen-Shannon centroid of a set of categorical distributions or normalized histograms.


Introduction
Let (X , F, µ) be a measure space [5] where X denotes the sample space, F the σ-algebra of measurable events, and µ a positive measure. For example, the measure space defined by the Lebesgue measure µ L with Borel σ-algebra B(R d ) for X = R d or the measure space defined by the counting measure µ c with the power set σ-algebra 2 X on a finite alphabet X . Denote by L 1 (X , F, µ) the Lebesgue space of measurable functions, P 1 the subspace of positive integrable functions f such that X f (x)dµ(x) = 1 and f (x) > 0 for all x ∈ X , and P 1 the subspace of non-negative integrable functions f such that X f (x)dµ(x) = 1 and f (x) ≥ 0 for all x ∈ X .
The Kullback-Leibler Divergence (KLD) KL : P 1 × P 1 → [0, ∞] is an oriented statistical distance (commonly called the relative entropy in information theory [6]) defined between two densities p and q (i.e., the Radon-Nikodym densities of µ-absolutely continuous probability measures P and Q) by KL(p : q) := p log p q dµ.
Although KL(p : q) ≥ 0 with equality iff. p = q µ-a. e. (Gibb's inequality [6]), the KLD may diverge to infinity depending on the underlying densities. Since the KLD is asymmetric, several symmetrizations [22] have been proposed in the literature including the Jeffreys divergence [21] (JD): J(p, q) := KL(p : q) + KL(q : p) = (p − q) log p q dµ = J(q, p), (2) and the Jensen-Shannon Divergence [19] (JSD): JS(p, q) := 1 2 KL p : p + q 2 + KL q : = 1 2 p log 2p p + q + q log 2q p + q dµ = JS(q, p). The Jensen-Shannon divergence can be interpreted as the total KL divergence to the average distribution p+q 2 . A nice feature of the Jensen-Shannon divergence is that this divergence can be applied to densities with arbitrary support (i.e., p, q ∈ P 1 with the convention that 0 log 0 = 0 and log 0 0 = 0), and moreover the JSD is always upper bounded by log 2. Let X p = supp(p) and X q = supp(q) denote the supports of the densities p and q, respectively, where supp(p) := {x ∈ X : p(x) > 0}. The JSD saturates to log 2 whenever the supports X p and X p are disjoints. The square root of the JSD is a metric [9] satisfying the triangle inequality but the square root of the JD is not a metric (nor any positive power of the Jeffreys divergence, see [14]).

2
In general, skewing divergences (e.g., using the divergence K α instead of the KLD) has been shown experimentally to perform better in applications like in some natural language processing (NLP) tasks [18].
The α-Jensen-Shannon divergences are Csiszár f -divergences [7,1,32]. A f -divergence is defined for a convex function f , strictly convex at 1 and satisfying f (1) = 0 as: We can always symmetrize f -divergences by taking the conjugate convex function f * (x) = xf ( 1 x ) (related to the perspective function): I f +f * (p, q) is a symmetric divergence. The f -divergences are convex statistical distances which are provably the only separable invariant divergences in information geometry [2], except for binary alphabets X (see [12]).
Jeffreys divergence is a f -divergence for the generator f (x) = (x − 1) log x, and the α-Jensen-Shannon divergences are f -divergences for the generator family The main contributions of this paper are summarized as follows: • First, we generalize the Jensen-Bregman divergence by skewing a weighted separable Jensen-Bregman divergence with a k-dimensional vector α ∈ [0, 1] k in §2. This yields a generalization of the symmetric skew α-Jensen-Shannon divergences to a vector-skew parameter. This extension retains the key properties to be upper bounded and to apply to densities with potentially different support. The proposed generalization allows one to grasp a better understanding of the "mechanism" of the Jensen-Shannon divergence itself too. We also show how to obtain directly the weighted vector-skew Jensen-Shannon divergence from the decomposition of the KLD as the difference of the cross-entropy minus the entropy (i.e., KLD as the relative entropy).
• Second, we show how to build families of symmetric Jensen-Shannon-type divergences which can be controlled by a vector of parameters in §2.3, generalizing the work of [20] from scalar skewing to vector skewing. This may prove useful in applications by providing additional tuning parameters (which can be set, for example, by using cross-validation techniques).
• Third, we consider the calculation of the Jensen-Shannon centroids in §3 for densities belonging to mixture families. Mixture families include the family of categorical distributions and the family of statistical mixtures sharing the same prescribed components. Mixture families are well-studied manifolds in information geometry [2]. We show how to compute the Jensen-Shannon centroid using a concaveconvex numerical iterative optimization procedure [36]. Experimental results compare graphically the Jeffreys centroid with the Jensen-Shannon centroid for grey-valued image histograms.
2 Extending the Jensen-Shannon divergence

Vector-skew Jensen-Bregman divergences and Jensen diversities
Recall our notational shortcut: (ab) α := (1 − α)a + αb. For a k-dimensional vector α ∈ [0, 1] k , a weight vector w belonging to the (k − 1)-dimensional open simplex ∆ k , and a scalar γ ∈ (0, 1), let us define the following vector skew α-Jensen-Bregman divergence (α-JBD) following [27]: where B F is the Bregman divergence [4] induced by a strictly convex and smooth generator F : 3 with ·, · denoting the Euclidean inner product x, y = x y (dot product). Expanding the Bregman divergence formulas in the expression of the α-JBD, and using the fact that we get the following expression: The inner product term of Eq. 20 vanishes when Thus when γ =ᾱ (assuming at least two distinct components in α so that γ ∈ (0, 1)), we get the simplified formula for the vector-skew α-JBD: This vector-skew Jensen-Bregman divergence is always finite and amounts to a Jensen diversity [25] J F induced by Jensen's inequality gap: The Jensen diversity is a quantity which arises naturally as a generalization of the cluster variance (i.e., Bregman information) when clustering with Bregman divergences, see [4,25]. In general, a k-point measure is called a diversity measure (for k > 2) while a distance/divergence is a 2-point measure.
Conversely, in 1D, we may start from Jensen's inequality for a strictly convex function F : so that γ = i w i α i =ᾱ.

Vector-skew Jensen-Shannon divergences
Let f (x) = x log x − x be a strictly smooth convex function on (0, ∞). Then the Bregman divergence induced by this univariate generator is the extended scalar extended Kullback-Leibler divergence.
Definition 1 (Weighted vector-skew (α, w)-Jensen-Shannon divergence). For a vector α ∈ [0, 1] k and a unit positive weight vector w ∈ ∆ k , the (α, w)-Jensen-Shannon divergence between two densities p, q ∈P 1 is defined by: This definition generalizes the ordinary JSD; We recover the ordinary Jensen-Shannon divergence when Using this (α, β)-KLD, we have the following identity: is a k-dimensional vector of ones. Next, we show that KL α,β (and JS α,w ) are separable convex divergences: Theorem 1 (Separable convexity). The divergence KL α,β (p : q) is strictly separable convex for α = β and x ∈ X p ∩ X q .
Proof. Let us calculate the second partial derivative of KL α,β (x : y) with respect to x, and show it is strictly positive: for x, y > 0. Thus KL α,β is strictly convex on the left argument. Similarly, since KL α,β (y : x) = KL 1−α,1−β (x : y), we deduce that KL α,β is strictly convex on the right argument. Therefore the divergence KL α,β is separable convex.
It follows that the divergence JS α,w (p : q) is strictly separable convex since it is a convex combinations of weighted KL αi,ᾱ divergences.
Another way to derive the vector-skew JSD is to decompose the KLD as the difference of the cross-entropy h × minus the entropy h (i.e., KLD is also called the relative entropy): where h × (p : q) := − p log qdµ and h(p) := h × (p : p) (self cross-entropy). Since Here, the "trick" is to choose γ =ᾱ in order to "convert" the cross-entropy into an entropy: Moreover, if we consider the cross-entropy/entropy extended to positive densitiesp andq: we get: Next, we shall prove that our generalization of the skew Jensen-Shannon divergence to vector-skewing is always bounded. We first start by a lemma bounding the KLD between two mixtures sharing the same components: Lemma 1 (KLD between two w-mixtures). For α ∈ [0, 1] and β ∈ (0, 1), we have: Proof. Let us form a partition of the sample space X into two dominance regions: . Notice that we allow α ∈ {0, 1} but not β to take the extreme values (i.e., β ∈ (0, 1)).
By using the fact that we conclude that the vector-skew Jensen-Shannon divergence is upper bounded: .
• We can carry on similarly the construction of such symmetric JSDs by increasing the dimensionality of the skewing vector.
In fact, we can define 3 Jensen-Shannon centroids on mixture families

Mixture families and Jensen-Shannon divergences
Consider a mixture family in information geometry [2]. That is, let us give a prescribed set of D + 1 linearly independent probability densities p 0 (x), . . . , p D (x) defined on the sample space X . A mixture family M of order D consists of all strictly convex combinations of these component densities: For example, the family of categorical distributions (sometimes called "multinouilli" distributions) is a mixture family [2] which can also be interpreted as an exponential family. The KL divergence between two densities of a mixture family M amounts to a Bregman divergence for the Shannon negentropy generator F (θ) = −h(m θ ) (see [29]): On a mixture manifold M, the mixture density (1 − α)m θ1 + αm θ2 of two mixtures m θ1 and m θ2 of M also belongs to M: where we extend the notation (θ 1 θ 2 ) α := (1 − α)θ 1 + αθ 2 to vectors θ 1 and θ 2 : (θ 1 θ 2 ) i α = (θ i 1 θ i 2 ) α . Thus the vector-skew JSD amounts to a vector-skew Jensen diversity for the Shannon negentropy convex function F (θ) = −h(m θ ):

Jensen-Shannon centroids
Given a set of n mixture densities m θ1 , . . . , m θn of M, we seek to calculate the skew-vector Jensen-Shannon centroid (or barycenter) by minimizing the following objective function (or loss function): where ω ∈ ∆ n is the weight vector of densities (uniform weight for the centroid and non-uniform weight for a barycenter). This definition of the Jensen-Shannon centroid is a generalization of the Fréchet mean 1 [10] to non-metric spaces. Since the divergence JS α,w is strictly separable convex, it follows that the Jensen-Shannon-type centroids are unique when they exist. Plugging Eq. 68 into Eq. 70, we get that the calculation of the Jensen-Shannon centroid amounts to minimize: Figure 1: The Convex ConCave Procedure iteratively updates the parameter θ by aligning the support hyperplanes at θ. In the limit case of convergence to θ * , the support hyperplanes at θ * are parallel to each other. CCCP finds a local minimum.
This optimization is a Difference of Convex (DC) programming optimization for which we can use the ConCave-Convex procedure [36,23] (CCCP). Indeed, let us define the following two convex functions: Both functions A(θ) and B(θ) are convex since F is convex. Then the minimization problem of Eq. 71 to solve can be rewritten as: min This is a DC programming optimization problem which can be solved iteratively by initializing θ to an arbitrary value θ (0) (say, the centroid of the θ i 's), and then by updating the parameter at step t using the CCCP [36] as follows: θ (t+1) = (∇B) −1 (∇A(θ (t) )).
Compared to a gradient descent local optimization, there is no required step size (also called "learning" rate) in CCCP.
The CCCP converges to a local optimum θ * where the support hyperplanes of the function graphs of A and B at θ * are parallel to each other, as depicted in Figure 1. The set of stationary points are {θ : ∇A(θ) = ∇B(θ)}. In practice, the delicate step is to invert ∇B. Next, we show how to implement this algorithm for the Jensen-Shannon centroid of a set of categorical distributions (i.e., normalized histograms with all non-empty bins).

Jensen-Shannon centroids of categorical distributions
To illustrate the method, let us consider the mixture family of categorical distributions [2]: where δ(x) is the Dirac distribution (i.e., δ(x) = 1 for x = 0 and δ(x) = 0 for x = 0). The Shannon negentropy is We have the partial derivatives Inverting the gradient ∇F requires to solve the equation ∇F (θ) = η so that we get θ = (∇F ) −1 (η). We find that (79) We have JS(p 1 , p 2 ) = J F (θ 1 , θ 2 ) for p 1 = m θ1 and p 2 = m θ2 where is the Jensen divergence [23]. Thus to compute the Jensen-Shannon centroid of a set of n densities p 1 , . . . , p n of a mixture family (with p i = m θi ), we need to solve the following optimization problem for a density p = m θ : The CCCP algorithm for the Jensen-Shannon centroid proceeds by initializing θ (0) = 1 n i θ i (center of mass of the natural parameters), and iteratively update as follows: We iterate until the absolute difference |E(θ (t) )−E(θ (t+1) )| between two successive θ (t) and θ (t+1) goes below a prescribed threshold value. The convergence of the CCCP algorithm is linear [16] to a local minimum that is a fixed point of the equation where is a vector generalization of the formula of the quasi-arithmetic means [25,23] obtained for the generator H = ∇F . Algorithm 1 summarizes the method for approximating the Jensen-Shannon centroid of a given set of categorical distributions (given a prescribed number of iterations). In the pseudo-code, we used the notation (t+1) θ instead of θ (t+1) in order to highlight the conversion procedures of the natural parameters to/from the mixture weight parameters by using superscript notations for coordinates.
of n categorical distributions belonging to the (d − 1)-dimensional probability simplex ∆ d−1 Input: T : the number of CCCP iterations Output: An approximation (T )p of the Jensen-Shannon centroidp /* Convert the categorical distributions to their natural parameters by dropping the last coordinate */ (T )pj return (T )p Algorithm 1: The CCCP algorithm for computing the Jensen-Shannon centroid of a set of categorical distributions. Figure 2 displays the results of the calculations of the Jeffreys centroid [21] and the Jensen-Shannon centroid for two normalized histograms obtained from grey-valued images of Lena and Barbara. Figure 3 shows the Jeffreys centroid and the Jensen-Shannon centroid for the Barbara image and its negative. Figure 4 demonstrates that the Jensen-Shannon centroid is well-defined even if the input histograms do not have coinciding supports. Notice that on the parts of the support where only one distribution is defined, the JS centroid is a scaled copy of that defined distribution.
• Since the skew-vector Jensen-Shannon divergence formula holds for positive densities: = JS α,w (p :q), we can relax the computation of the Jensen-Shannon centroid by considering 1D separable minimization problems. We then normalize the positive JS centroids to get an approximation of the probability JS centroids. This approach was also considered when dealing with the Jeffreys' centroid [21]. In 1D, we have F (θ) = θ log θ − θ, F (θ) = log θ and (F ) −1 (η) = e η .
In general, calculating the negentropy for a mixture family with continuous densities sharing the same support is not tractable because of the log-sum term of the differential entropy. However, the following remark emphasizes an extension of the mixture family of categorical distributions:

Some remarks and properties
for a parameter θ belonging to the D-dimensional standard simplex) of D + 1 linearly independent probability densities p 0 (x), . . . , p D (x) defined respectively on the supports X 0 , X 1 , . . . , X D . Let θ 0 : for all x ∈ X i , and let X = ∪ i X i . Consider Shannon negative entropy F (θ) = −h(m θ ) as a strictly convex function. Then we have Note that the term i θ i h(p i ) is affine in θ, and Bregman divergences are defined up to affine terms so that the Bregman generator F is equivalent to the Bregman generator of the family categorical distributions. This example generalizes the ordinary mixture family of categorical distributions where the p i 's are distinct Dirac distributions. Note that when the support of the component distributions are not pairwise disjoint, the (neg)entropy may not be analytic [31] (e.g., mixture of the convex weighting of two prescribed distinct Gaussian distributions). This contrasts with the fact that the cumulant function of an exponential family is always real-analytic [34].
The entropy and cross-entropy between densities of a mixture family can be calculated in closed-form. Property 2. The entropy of a density belonging to a mixture family M is h(m θ ) = −F (θ), and the crossentropy between two mixture densities m θ1 and m θ2 is h × (m θ1 : Proof. Let us write the KLD as the difference between the cross-entropy minus the entropy [6]: Following [26], we deduce that h(m θ ) = −F (θ) + c and h × (m θ1 : Thus we can compute numerically the Jensen-Shannon centroids (or barycenters) of a set of densities belonging to a mixture family. This includes the case of categorical distributions and the case of Gaussian Mixture Models (GMMs) with prescribed Gaussian components [29] (although in this case the negentropy need to be stochastically approximated using Monte Carlo techniques [24]). When the densities do not belong to a mixture family (say, the Gaussian family which is an exponential family [2]), we face the problem that the mixture of two densities does not belong to the family anymore. One way to tackle this problem is to project the mixture onto the Gaussian family. This corresponds to a m-projection (mixture projection) which can be interpreted as a Maximum Entropy projection of the mixture [33,2]).
Notice that we can perform fast k-means clustering without centroid calculations by generalizing the k-means++ probabilistic initialization [3,30] to an arbitrary divergence as detailed in [28]. Finally, let us notice some decompositions of the Jensen-Shannon divergence and the skew Jensen divergences.
Remark 3. We have the following decomposition for the Jensen-Shannon divergence: where h × JS (p 1 : and h JS (p 2 ) = h × JS (p 2 : p 2 ) = h(p 2 ) − 1 2 h(p 2 ) = 1 2 h(p 2 ). This decomposition bears some similarity with the KLD decomposition viewed as the cross-entropy minus the entropy (with the cross-entropy always upperbounding the entropy).
Similarly, the α-skew Jensen divergence can be decomposed as the sum of the information I α F (θ 1 ) = (1 − α)F (θ 1 ) minus the cross-information C α F (θ 1 : Notice that the information I α F (θ 1 ) is the self cross-information: . Recall that the convex information is the negentropy where the entropy is concave. For the Jensen-Shannon divergence on the mixture family of categorical distributions, the convex generator F (θ) = −h(m θ ) = D i=1 θ i log θ i is the Shannon negentropy.

Conclusion and discussion
The Jensen-Shannon divergence [19] is a renown symmetrization of the Kullback-Leibler oriented divergence that enjoys the following three essential properties: 1. it is always bounded, 2. it applies to densities with potentially different supports, and 3. it extends to unnormalized densities while enjoying the same formula expression.
This JSD plays an important role in machine learning and in deep learning for studying Generative Adversarial Networks (GANs) [11]. Traditionally, the JSD has been skewed with a scalar parameter [17,35] α ∈ (0, 1). In practice, it has been demonstrated experimentally that skewing divergences may improve significantly the performance of some tasks (e.g., [18,15]).
In this work, we showed how to vector-skew the JSD while preserving the above three properties. These new families of weighted vector-skew Jensen-Shannon divergences may allow one to fine-tune the dissimilarity in applications by replacing the skewing scalar parameter of the JSD by a vector parameter (informally, adding some "knobs" for tuning a divergence). We then considered computing the Jensen-Shannon centroids of a set of densities belonging to a mixture family [2] by using the convex concave procedure [36].
This bi-vector-skew divergence unifies the Jeffreys divergence with the Jensen-Shannon α-skew divergence by setting the following parameters: KL (0,1),(1,0),(1,1) (p : q) = KL(p : q) + KL(q : p) = J(p, q), We have shown in this paper that interesting properties may occur when the skewing vector β is purposely correlated to the skewing vector α: Namely, for the bi-vector-skew Bregman divergences with β = (ᾱ, . . . ,ᾱ) andᾱ = i w i α i , we obtain an equivalent Jensen diversity for the Jensen-Bregman divergence, and as a byproduct a vector-skew generalization of the Jensen-Shannon divergence.