Next Article in Journal
Identifying Network Propagation Sources Using Advanced Centrality Measures
Previous Article in Journal
Analytic Solutions and Entropy Production of the Double-Diffusive Equation System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Two Types of Geometric Jensen–Shannon Divergences

Sony Computer Science Laboratories, Tokyo 141-0022, Japan
Entropy 2025, 27(9), 947; https://doi.org/10.3390/e27090947
Submission received: 8 August 2025 / Revised: 29 August 2025 / Accepted: 9 September 2025 / Published: 11 September 2025
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

The geometric Jensen–Shannon divergence (G-JSD) has gained popularity in machine learning and information sciences thanks to its closed-form expression between Gaussian distributions. In this work, we introduce an alternative definition of the geometric Jensen–Shannon divergence tailored to positive densities which does not normalize geometric mixtures. This novel divergence is termed the extended G-JSD, as it applies to the more general case of positive measures. We explicitly report the gap between the extended G-JSD and the G-JSD when considering probability densities, and show how to express the G-JSD and extended G-JSD using the Jeffreys divergence and the Bhattacharyya distance or Bhattacharyya coefficient. The extended G-JSD is proven to be an f-divergence, which is a separable divergence satisfying information monotonicity and invariance in information geometry. We derive a corresponding closed-form formula for the two types of G-JSDs when considering the case of multivariate Gaussian distributions that is often met in applications. We consider Monte Carlo stochastic estimations and approximations of the two types of G-JSD using the projective γ -divergences. Although the square root of the JSD yields a metric distance, we show that this is no longer the case for the two types of G-JSD. Finally, we explain how these two types of geometric JSDs can be interpreted as regularizations of the ordinary JSD.

1. Introduction

1.1. Kullback–Leibler and Jensen–Shannon Divergences

Let ( X , E , μ ) be a measure space on the sample space X , σ -algebra of events E , with μ a prescribed positive measure on the measurable space ( X , E ) (e.g., counting measure or Lebesgue measure). Let M + ( X ) = { Q } be the set of positive distributions Q and M + 1 ( X ) = { P } be the subset of probability measures P. We denote by M μ = { d Q d μ : Q M + ( X ) } and M μ 1 = { d P d μ : P M + 1 ( X ) } the corresponding sets of Radon–Nikodym positive and probability densities, respectively.
Consider two probability measures P 1 and P 2 of M + 1 ( X ) with Radon–Nikodym densities with respect to μ   p 1 : = d P 1 d μ M μ 1 and p 2 : = d P 2 d μ M μ 1 , respectively. The deviation of P 1 to P 2 (also called distortion, dissimilarity, or deviance) is commonly measured in information theory [1] by the Kullback–Leibler divergence (KLD):
KL ( p 1 , p 2 ) : = p 1 log p 1 p 2 d μ = E p 1 log p 1 p 2 .
Informally, the KLD quantifies the information lost when p 2 is used to approximate p 1 by measuring, on average, the surprise when outcomes sampled from p 1 are assumed to emanate from p 2 : Shannon entropy H ( p ) = p log 1 p d μ is the expected surprise H ( p ) = E p [ log p ] , where log p ( x ) measures the surprise of the outcome x. Logarithms are taken to base 2 when information is measured in bits, and to base e when it is measured in nats. Gibbs’ inequality asserts that KL ( P 1 , P 2 ) 0 with equality if and only if P 1 = P 2 μ -almost everywhere. Since KL ( p 1 , p 2 ) KL ( p 2 , p 1 ) , various symmetrization schemes of the KLD have been proposed in the literature [1] (e.g., Jeffreys divergence [1,2], resistor average divergence [3] (harmonic KLD symmetrization), Chernoff information [1], etc.)
An important symmetrization technique of the KLD is the Jensen–Shannon divergence [4,5] (JSD):
JS ( p 1 , p 2 ) : = 1 2 KL ( p 1 , a ) + KL ( p 2 , a ) ,
where a = 1 2 p 1 + 1 2 p 2 denotes the statistical mixture of p 1 and p 2 . The JSD is guaranteed to be upper-bounded by log 2 even when the support of p 1 and p 2 differ, making it attractive in applications. Furthermore, its square root JS yields a metric distance [6,7].
The JSD can be extended to a set of densities to measure the diversity of the set as an information radius [8]. In information theory, the JSD can also be interpreted as an information gain [6] since it can be equivalently written as
JS ( p 1 , p 2 ) = H 1 2 p 1 + 1 2 p 2 H ( p 1 ) + H ( p 2 ) 2 ,
where H ( p ) = p log p d μ is Shannon entropy (Shannon entropy for discrete measures and differential entropy for continuous measures). The JSD has also been defined in the setting of quantum information [9], where it has also been proven that its square root yields a metric distance [10].
Remark 1. 
Both the KLD and the JSD belong to the family of f-divergences [11,12] defined for a convex generator f ( u ) (strictly convex at 1) by
I f ( p 1 , p 2 ) : = p 1 f p 2 p 1 d μ .
Indeed, we have KL ( p 1 , p 2 ) = I f KL ( p 1 , p 2 ) and JS ( p 1 , p 2 ) = I f JS ( p 1 , p 2 ) for the following generators:
f KL ( u ) : = log u , f JS ( u ) : = ( 1 + u ) log 1 + u 2 + u log u .
The family of f-divergences is the invariant divergences in information geometry [13]. The f-divergences guarantee information monotonicity by coarse graining [13] (also called lumping in information theory [14]). Using Jensen inequality, we get I f ( p 1 , p 2 ) f ( 1 ) .
Remark 2. 
The metrization of f-divergences was studied in [15]. Once a metric distance D ( p 1 , p 2 ) is given, we may use the following metric transform [16] to obtain another metric which is guaranteed to be bounded by 1:
0 d ( p 1 , p 2 ) = D ( p 1 , p 2 ) 1 + D ( p 1 , p 2 ) 1 .

1.2. Jensen–Shannon Symmetrization of Dissimilarities with Generalized Mixtures

In [17], a generalization of the KLD Jensen–Shannon symmetrization scheme was studied for arbitrary statistical dissimilarity D ( · , · ) using an arbitrary weighted mean [18] M α . A generic weighted mean M α ( a , b ) = M 1 α ( b , a ) for a , b R > 0 is a continuous symmetric monotonic map α [ 0 , 1 ] M α ( a , b ) such that M 0 ( a , b ) = b and M 1 ( a , b ) = 1 . For example, the quasi-arithmetic means [18] are defined according to a monotonous continuous function ϕ as follows:
M α ϕ ( a , b ) : = ϕ 1 α ϕ ( a ) + ( 1 α ) ϕ ( b ) .
When ϕ p ( u ) = u p , we get the p-power mean M α ϕ p ( a , b ) = ( α a p + ( 1 α ) b p ) 1 p for p R { 0 } . We extend ϕ p for p = 0 by defining ϕ 0 ( u ) = log u , and get M α ϕ 0 ( a , b ) = a α b 1 α , the weighted geometric mean G α .
Let us recall the generalization of the Jensen–Shannon symmetrization scheme of a dissimilarity measure presented in [17]:
Definition 1 
( ( α , β ) M-JS dissimilarity [17]). The Jensen–Shannon skew symmetrization of a statistical dissimilarity D ( · , · ) with respect to an arbitrary weighted bivariate mean M α ( · , · ) is given by
D M α , β JS ( p 1 , p 2 ) : = β D p 1 , ( p 1 p 2 ) M α + ( 1 β ) D p 2 , ( p 1 p 2 ) M α , ( α , β ) ( 0 , 1 ) 2 ,
where ( p 1 p 2 ) M α is the statistical normalized weighted M-mixture of p 1 and p 2 :
( p 1 p 2 ) M α ( x ) : = M α ( p 1 ( x ) , p 2 ( x ) ) M α ( p 1 ( x ) , p 2 ( x ) ) d μ ( x ) .
Remark 3. 
A more general definition is given in [17] by using another arbitrary weighted mean N β to average the two dissimilarities in Equation (3):
D M α , N β JS ( p 1 , p 2 ) : = N β D p 1 , ( p 1 p 2 ) M α , D p 2 , ( p 1 p 2 ) M α , ( α , β ) ( 0 , 1 ) 2 .
When N β = A α , the weighted arithmetic mean A α ( a , b ) = α a + ( 1 α ) b , Equation (5) amounts to Equation (3).
When α = 1 2 , we write, for short, ( p 1 p 2 ) M instead of ( p 1 p 2 ) M 1 2 in the reminder.
When D = KL , M = N = A 1 2 , Equation (5) yields the Jensen–Shannon divergence of Equation (2): JS ( p 1 , p 2 ) = KL A 1 2 , A 1 2 JS ( p 1 , p 2 ) = KL A , A JS ( p 1 , p 2 ) .
Lower and upper bounds for the skewed α -Jensen–Shannon divergence were reported in [19].
The abstract mixture normalizer of ( p 1 p 2 ) M α shall be denoted by
Z M α ( p 1 , p 2 ) : = M α ( p 1 ( x ) , p 2 ( x ) ) d μ ( x ) ,
so that the normalized M-mixture is written as ( p 1 p 2 ) M α ( x ) = M α ( p 1 ( x ) , p 2 ( x ) ) Z M α ( p 1 , p 2 ) . The normalizer Z M α ( p 1 , p 2 ) is always finite, and thus the weighted M-mixtures ( p 1 p 2 ) M α are well-defined:
Proposition 1. 
For any generic weighted mean M α , we have the normalizer of the weighted M-mixture bounded by 2:
0 Z M α ( p 1 , p 2 ) 2 .
Proof. 
Since M α is a scalar weighted mean, it satisfies the following in-betweenness property:
min { p 1 ( x ) , p 2 ( x ) } M α ( p 1 ( x ) , p 2 ( x ) ) max { p 1 ( x ) , p 2 ( x ) } .
Hence, by using the following two identities for a 0 and b 0 ,
min { a , b } = a + b 2 1 2 | a b | , max { a , b } = a + b 2 + 1 2 | a b | ,
we get
min { p 1 ( x ) , p 2 ( x ) } d μ ( x ) M α ( p 1 ( x ) , p 2 ( x ) ) d μ ( x ) max { p 1 ( x ) , p 2 ( x ) } d μ ( x ) , 0 1 TV ( p 1 , p 2 ) Z M α ( p 1 , p 2 ) 1 + TV ( p 1 , p 2 ) 2 ,
where
TV ( p 1 , p 2 ) : = 1 2 | p 1 p 2 | d μ ,
is the total variation distance, upper-bounded by 1. When the support of the densities p 1 and p 2 intersect (i.e., non-singular probability measures P 1 and P 2 ), we have Z M α ( p 1 , p 2 ) > 0 , and therefore the weighted M-mixtures ( p 1 p 2 ) M α are well-defined. □
The generic Jensen–Shannon symmetrization of dissimilarities given in Definition 1 allows us to re-interpret some well-known statistical dissimilarities:
For example, the Chernoff information [1,20] is defined by
C ( p 1 , p 2 ) : = max α ( 0 , 1 ) B α ( p 1 , p 2 ) ,
where B α ( p 1 , p 2 ) denotes the α -skewed Bhattacharrya distance:
B α ( p 1 , p 2 ) : = log p 1 α p 2 1 α d μ
When α = 1 2 , we note B ( p 1 , p 2 ) = B 1 2 ( p 1 , p 2 ) , the Bhattacharrya distance. Notice that the Bhattacharrya distance is not a metric distance as it violates the triangle inequality of metrics.
Using the framework of JS-symmetrization of dissimilarities, we can reinterpret the Chernoff information as
C ( p 1 , p 2 ) = ( KL * ) G α * , A 1 2 JS ( p 1 , p 2 ) ,
where α * is provably the unique optimal skewing factor in Equation (8), such that we have [20]:
C ( p 1 , p 2 ) = KL * ( p 1 , ( p 1 p 2 ) G α * ) = KL * ( p 2 , ( p 1 p 2 ) G α * ) , = 1 2 KL * ( p 1 , ( p 1 p 2 ) G α * ) + KL * ( p 2 , ( p 1 p 2 ) G α * ) ,
where KL * denotes the reverse KLD:
KL * ( p 1 , p 2 ) : = KL ( p 2 , p 1 ) .
Note that the KLD is sometimes called the forward KLD (e.g., [21]), and we have KL * * ( p 1 , p 2 ) = KL ( p 1 , p 2 ) .
Although arithmetic mixtures are most often used in statistics, geometric mixtures are also encountered, for example in Bayesian statistics [22] or in Markov chain Monte Carlo annealing [23], just to give two examples. In information geometry, statistical power mixtures based on the homogeneous power means are used to perform stochastic integration of statistical models [24].
Proposition 2 (Bhattacharyya distance as G-JSD).
The Bhattacharyya distance [25] and the α-skewed Bhattacharyya distances can be interpreted as JS-symmetrizations of the reverse KLD with respect to the geometric mean G:
B ( p 1 , p 2 ) : = log p 1 p 2 d μ = ( KL * ) G JS ( p 1 , p 2 ) , B α ( p 1 , p 2 ) : = log p 1 α p 2 1 α d μ = ( KL * ) G α JS ( p 1 , p 2 ) .
Proof. 
Let m = ( p 1 p 2 ) G = p 1 p 2 Z ( p 1 , p 2 ) denote the weighted geometric mixture with the normalizer Z G ( p 1 , p 2 ) = p 1 p 2 d μ . By definition of the JS-symmetrization of the reverse KLD, we have
( KL * ) G JS ( p 1 , p 2 ) : = 1 2 KL * ( p 1 , ( p 1 p 2 ) G ) + KL * ( p 2 , ( p 1 p 2 ) G ) , = 1 2 KL ( ( p 1 p 2 ) G , p 1 ) + KL ( ( p 1 p 2 ) G , p 2 ) , = 1 2 m log p 1 p 2 p 1 Z G ( p 1 , p 2 ) + m log p 1 p 2 p 2 Z G ( p 1 , p 2 ) d μ , = 1 2 1 2 m log p 2 p 1 p 1 p 2 d μ 2 log Z G ( p 1 , p 2 ) m d μ , = log Z G ( p 1 , p 2 ) = : B ( p 1 , p 2 ) .
The proof carries on similarly for the α -skewed JS-symmetrization of the reverse KLD: we now let m α = ( p 1 p 2 ) G α = p 1 α p 2 1 α Z G α ( p 1 , p 2 ) be the α -weighted geometric mixture with the normalizer Z G α ( p 1 , p 2 ) = p 1 α p 2 1 α d μ , written as Z G α for short below:
KL * G α , α JS ( p 1 , p 2 ) : = α KL * ( p 1 , ( p 1 p 2 ) G α ) + ( 1 α ) KL * ( p 2 , ( p 1 p 2 ) G α ) , = α KL ( m α , p 1 ) + ( 1 α ) KL ( m α , p 2 ) , = α m α log p 1 α p 2 1 α Z G α p 1 + ( 1 α ) m α log p 1 α p 2 1 α Z G α p 2 d μ , = ( α + 1 α ) log Z G α m α d μ + m α log p 2 p 1 α ( 1 α ) p 1 p 2 α ( 1 α ) d μ , = log Z G α ( p 1 , p 2 ) = : B α ( p 1 , p 2 ) .
Besides information theory [1], the JSD also plays an important role in machine learning [26,27,28]. However, one drawback that refrains its use in practice is that the JSD between two Gaussian distributions (normal distributions) is not known in closed form, since no analytic formula is known for the differential entropy of a two-component Gaussian mixture [29], and thus the JSD needs to be numerically approximated in practice by various methods.
To circumvent this problem, the geometric G-JSD was defined in [17] as follows:
Definition 2 
(G-JSD [17]). The geometric Jensen–Shannon divergence (G-JSD) between two probability densities, p 1 and p 2 , is defined by
JS G ( p 1 , p 2 ) : = 1 2 KL ( p 1 , ( p 1 p 2 ) G ) + KL ( p 2 , ( p 1 p 2 ) G ) ,
where ( p 1 p 2 ) G ( x ) = p 1 ( x ) p 2 ( x ) p 1 ( x ) p 2 ( x ) d μ is the (normalized) geometric mixture of p 1 and p 2 .
We have JS G ( p 1 , p 2 ) = KL G JS ( p 1 , p 2 ) . Since, by default, the M- mixture JS-symmetrization of dissimilarities D is performed on the right argument (i.e., D M JS ), we may also consider a dual JS-symmetrization by setting the M-mixtures on the left argument. We denote this left mixture JS-symmetrization with D M JS * . We have D M JS * ( p 1 , p 2 ) = ( D * ) M JS ( p 1 , p 2 ) , i.e., the left-sided JS-symmetrization of D amounts to a right-sided JS-symmetrization of the dual dissimilarity D * ( p 1 , p 2 ) : = D ( p 2 , p 1 ) .
Thus, a left-sided G-JSD divergence JS G * was also defined in [17]:
Definition 3. 
The left-sided geometric Jensen–Shannon divergence (G-JSD) between two probability densities p 1 and p 2 is defined by
JS G * ( p 1 , p 2 ) : = 1 2 KL ( ( p 1 p 2 ) G , p 1 ) + KL ( ( p 1 p 2 ) G , p 2 ) , = 1 2 KL * ( p 1 , ( p 1 p 2 ) G ) + KL * ( p 2 , ( p 1 p 2 ) G ) ,
where ( p 1 p 2 ) G ( x ) = p 1 ( x ) p 2 ( x ) p 1 ( x ) p 2 ( x ) d μ is the (normalized) geometric mixture of p 1 and p 2 .
To contrast with the numerical approximation limitation of the JSD between Gaussians, one advantage of the geometric Jensen–Shannon divergence (G-JSD) is that it admits a closed-form expression between Gaussian distributions [17]. However, the G-JSD is no longer bounded. The G-JSD formula between Gaussian distributions has been used in several scenarios. See [30,31,32,33,34,35,36,37,38] for a few use cases.
Let us express the G-JSD divergence using other familiar divergences.
Proposition 3. 
We have the following expression of the geometric Jensen–Shannon divergence:
JS G ( p 1 , p 2 ) = 1 4 J ( p 1 , p 2 ) B ( p 1 , p 2 ) ,
where J ( p 1 , p 2 ) : = ( p 1 p 2 ) log p 1 p 2 d μ is Jeffreys’ divergence [2], and
B ( p 1 , p 2 ) = log p 1 p 2 d μ = log Z G ( p 1 , p 2 ) ,
is the Bhattacharrya distance.
Proof. 
We have the following:
JS G ( p 1 , p 2 ) : = 1 2 KL ( p 1 , ( p 1 p 2 ) G ) + KL ( p 2 , ( p 1 p 2 ) G ) , = 1 2 p 1 ( x ) log p 1 ( x ) Z G ( p 1 , p 2 ) p 1 ( x ) p 2 ( x ) + p 2 ( x ) log p 2 ( x ) Z G ( p 1 , p 2 ) p 1 ( x ) p 2 ( x ) d μ ( x ) , = 1 2 p 1 ( x ) + p 2 ( x ) log Z G ( p 1 , p 2 ) d μ ( x ) + 1 2 KL ( p 1 , p 2 ) + 1 2 KL ( p 2 , p 1 ) , = log Z G ( p 1 , p 2 ) + 1 4 J ( p 1 , p 2 ) , = 1 4 J ( p 1 , p 2 ) B ( p 1 , p 2 ) .
Corollary 1 (G-JSD upper bound).
We have the upper bound JS G ( p , q ) 1 4 J ( p , q ) .
Proof. 
Since B ( p 1 , p 2 ) 0 and JS G ( p 1 , p 2 ) = 1 4 J ( p 1 , p 2 ) B ( p 1 , p 2 ) , we have JS G ( p , q ) 1 4 J ( p , q ) . □
Remark 4. 
Although the KLD and JSD are separable divergences (i.e., f-divergences expressed as integrals of scalar divergences), the M-JSD divergence is, in general, not separable, because it requires mixtures to be normalized inside the log terms. Notice that the Bhattacharyya distance is, similarly, not a separable divergence, but the Bhattacharyya similarity coefficient BC ( p 1 , p 2 ) = exp ( B ( p 1 , p 2 ) ) = p 1 p 2 d μ is a separable “f-divergence”/f-coefficient for f BC ( u ) = u (here, a concave generator): BC ( p 1 , p 2 ) = I f BC ( p 1 , p 2 ) . Notice that f BC ( 1 ) = 1 , and because of the concavity of f BC , we have I f BC ( p 1 , p 2 ) f BC ( 1 ) = 1 (hence, the term f-coefficient to reflect the notion of a similarity measure).

1.3. Paper Outline

The paper is organized as follows: We first give an alternative definition of the M-JSD in Section 2 (Definition 4) which extends to positive measures and does not require normalization of geometric mixtures. We call this new divergence the extended M-JSD, and we compare the two types of geometric JSDs when dealing with probability measures. In Section 4, we show that these normalized/extended M-JSD divergences can be interpreted as regularizations of the Jensen–Shannon divergence, and exhibit several bounds. We discuss Monte Carlo stochastic approximations and approximations using γ -divergences [39] in Section 5. For the case of geometric mixtures, although the G-JSD is not an f-divergence, we show that the extended G-JSD is an f-divergence (Proposition 5), and we express both the G-JSD and the extended G-JSD using both the Jeffreys divergence and the Bhattacharyya divergence or coefficient. We report a related closed-form formula for the G-JSD and extended G-JSD between two Gaussian distributions in Section 3. Finally, we summarize the main results in the concluding section, Section 6.
A list of notations is provided in Nomenclature.

2. A Novel Definition: The G-JSD, Extended to Positive Measures

2.1. Definition and Properties

We may consider the following two modifications of the G-JSD:
  • First, we replace the KLD with the extended KLD between positive densities q 1 M μ + and q 2 M μ + instead of normalized densities:
    KL + ( q 1 , q 2 ) : = q 1 log q 1 q 2 + q 2 q 1 d μ ,
    (with KL + ( p 1 , p 2 ) = KL ( p 1 , p 2 ) );
  • Second, we consider unnormalized M-mixture densities:
    ( q 1 q 2 ) M ˜ α ( x ) : = M α ( q 1 ( x ) , q 2 ( x ) ) ,
    where we use the M ˜ tilde notation to indicate that the M-mixture is not normalized, instead of normalized densities ( q 1 q 2 ) M α ( x ) .
The extended KLD can be interpreted as a pointwise integral of a scalar Bregman divergence obtained for the negative Shannon entropy generator [40]. This proves that KL + ( q 1 , q 2 ) 0 with equality if and only if q 1 = q 2 μ -almost everywhere. Notice that KL ( q 1 , q 2 ) may be negative when q 1 and/or q 2 are not normalized to probability densities, but we always have KL + ( q 1 , q 2 ) 0 .
The extended KLD is an extended f-divergence [41]: KL + ( q 1 , q 2 ) = I f KL + + ( q 1 , q 2 ) for f KL + ( u ) = log ( u ) + u 1 , where I f + ( q 1 , q 2 ) denotes the f-divergence extended to positive densities q 1 and q 2 :
I f + ( q 1 , q 2 ) = q 1 f q 2 q 1 d μ .
Remark 5. 
As a side remark, it is preferable in practice to estimate the KLD between p 1 and p 2 by Monte Carlo methods using Equation (10) instead of Equation (1) in order to guarantee the non-negativeness of the KLD (Gibbs’ inequality). Indeed, the sampling of s samples x 1 , , x s , defining two unnormalized distributions q 1 ( x ) = 1 s i = 1 s p 1 ( x ) δ x i ( x ) and q 2 ( x ) = 1 s i = 1 s p 2 ( x ) δ x i ( x ) , where
δ x i ( x ) = 1 , i f x = x i 0 , o t h e r w i s e .
Remark 6. 
For an arbitrary distortion measure D + ( q 1 , q 2 ) between positive measures q 1 and q 2 , we can build a corresponding projective divergence D ˜ ( q 1 , q 2 ) as follows:
D ˜ ( q 1 , q 2 ) : = D + q 1 Z ( q 1 ) , q 1 Z ( q 2 ) ,
where Z ( q ) : = q d μ is the normalization factor of the positive density q. The divergence D ˜ is said to be projective because we have, for all λ 1 > 0 ,   λ 2 > 0 , the property that D ˜ ( λ 1 q 1 , λ 2 q 2 ) = D ˜ ( q 1 , q 2 ) = D + ( p 1 , p 2 ) , where p i = q i Z ( q i ) are the normalized densities. The projective Kullback–Leibler divergence KL ˜ is thus another projective extension of the KLD to non-normalized densities which coincide with the KLD for probability densities. But the projective KLD is different from the extended KLD of Equation (10), and furthermore, we have KL ˜ ( q 1 , q 2 ) = 0 if and only if q 1 = λ q 2 μ-almost everywhere for some λ > 0 .
Let us now define the Jensen–Shannon symmetrization of an extended statistical divergence D + with respect to an arbitrary weighted mean M α as follows:
Definition 4 (Extended M-JSD).
A Jensen–Shannon skew symmetrization of a statistical divergence D + ( · , · ) between two positive measures q 1 and q 2 with respect to a weighted mean M α is defined by
D M ˜ α , β JS + ( q 1 , q 2 ) : = β D + q 1 , ( q 1 q 2 ) M ˜ α + ( 1 β ) D + q 1 , ( q 1 q 2 ) M ˜ α ,
When β = 1 2 , we write, for short, D M ˜ α JS + ( q 1 , q 2 ) , and furthermore, when α = 1 2 , we simplify the notation to D M ˜ JS + ( q 1 , q 2 ) .
When D + = KL + , we obtain the extended geometric Jensen–Shannon divergence, JS G ˜ + ( q 1 , q 2 ) = KL G ˜ JS + ( q 1 , q 2 ) :
Definition 5 (Extended G-JSD).
The extended geometric Jensen–Shannon divergence between two positive densities q 1 and q 2 is
JS G ˜ + ( q 1 , q 2 ) = 1 2 KL + ( q 1 , ( q 1 q 2 ) G ˜ ) + KL + ( q 2 , ( q 1 q 2 ) G ˜ ) ) ,
The extended G-JSD between two normalized densities p 1 and p 2 is thus
JS G ˜ + ( p 1 , p 2 ) = 1 2 p 1 log p 1 p 1 p 2 + p 2 log p 2 p 1 p 2 d μ + p 1 p 2 d μ ) 1 ,
= 1 2 p 1 log p 1 p 2 + p 2 log p 2 p 1 d μ + Z G ( p 1 , p 2 ) 1 ,
with Z G ( p 1 , p 2 ) = exp ( B ( p 1 , p 2 ) ) .
Thus, we get the following propositions:
Proposition 4. 
The extended geometric Jensen–Shannon divergence (G-JSD) can be expressed as follows:
JS G ˜ + ( p 1 , p 2 ) = 1 4 J ( p 1 , p 2 ) + exp ( B ( p 1 , p 2 ) ) 1 .
Proof. 
We have
JS G ˜ + ( p 1 , p 2 ) = 1 2 KL + ( p 1 , ( p 1 p 2 ) G ˜ ) + KL + ( p 2 , ( p 1 p 2 ) G ˜ ) , = 1 2 p 1 log p 1 p 2 + p 2 log p 2 p 1 + 2 p 1 p 2 ( p 1 + p 2 ) d μ , = 1 4 ( p 1 p 2 ) log p 1 p 2 d μ + p 1 p 2 d μ 1 , = 1 4 J ( p 1 , p 2 ) + exp ( B ( p 1 , p 2 ) ) 1 .
Thus, we can express the gap between JS G ˜ + ( p 1 , p 2 ) and JS G ( p 1 , p 2 ) :
Δ G ( p 1 , p 2 ) = JS G ˜ + ( p 1 , p 2 ) JS G ( p 1 , p 2 ) = exp ( B ( p 1 , p 2 ) ) + B ( p 1 , p 2 ) 1 .
Since Z G ( p 1 , p 2 ) = exp ( B ( p 1 , p 2 ) ) , we have:
Δ G ( p 1 , p 2 ) = Z G ( p 1 , p 2 ) log Z G ( p 1 , p 2 ) 1 .
Proposition 5. 
The extended G-JSD is an f-divergence for the generator
f G ˜ ( u ) = 1 4 u 1 log u + u 1 .
That is, we have JS G ˜ + ( p 1 , p 2 ) = I f G ˜ ( p 1 , p 2 ) .
Proof. 
We proved that JS G ˜ + ( p 1 , p 2 ) = 1 4 J ( p 1 , p 2 ) + BC ( p 1 , p 2 ) 1 . The Jeffreys divergence is an f-divergence for the generator f J ( u ) = ( u 1 ) log u , and the Bhattacharyya coefficient is an f-coefficient for f BC ( u ) = u (a “f-divergence” for a concave generator). Thus, we have
f G ˜ ( u ) = 1 4 u 1 log u + u 1 ,
such that JS G ˜ + ( p 1 , p 2 ) = I f G ˜ ( p 1 , p 2 ) . We check that f G ˜ ( u ) is convex, since f G ˜ ( u ) = u ( u + 1 ) u 4 u 5 2 (and by a change of variable t = u , the numerator t ( t 2 t + 1 ) is shown to be positive, since the discriminant of t 2 t + 1 is negative), and we have f G ˜ ( 1 ) = 0 . Thus, the extended G-JSD is a proper f-divergence. □
It follows that the extended G-JSD satisfies the information monotonicity of invariant divergences in information geometry [13].
By abuse of notations, we have
KL + ( q 1 , q 2 ) : = KL ( q 1 , q 2 ) + q 2 q 1 d μ ,
although q 1 and q 2 may not need to be normalized in the KL term (which can then yield a potentially negative value). Letting Z ( q i ) : = q i d μ be the total mass of positive density q i , we have
KL + ( q 1 , q 2 ) = KL ( q 1 , q 2 ) + Z ( q 2 ) Z ( q 1 ) .
Let m ˜ α = M α ( q 1 , q 2 ) be the unnormalized M-mixture of positive densities q 1 and q 2 , and set Z M α = m ˜ α d μ be the normalization term so that we have m α = m ˜ α Z M α and m ˜ α = Z M α m α . When clear from context, we write Z α instead of Z M α .
We get, after elementary calculus, the following identity:
JS M ˜ α , β + ( q 1 , q 2 ) = JS M α , β ( q 1 , q 2 ) ( β Z ( q 1 ) + ( 1 β ) Z ( q 2 ) ) log Z α + Z α ( β Z ( q 1 ) + ( 1 β ) Z ( q 2 ) ) .
Therefore, the difference gap Δ M α , β ( p 1 , p 2 ) (written for short as Δ ( p 1 , p 2 ) ) between the normalized JSD and the unnormalized M-JSD between two normalized densities p 1 and p 2 (i.e., with Z 1 = Z ( p 1 ) = 1 and Z 2 = Z ( p 2 ) = 1 ) is
Δ ( p 1 , p 2 ) : = JS + M ˜ α , β ( p 1 , p 2 ) JS M α , β ( p 1 , p 2 ) = Z α log ( Z α ) 1 .
Proposition 6 (Extended/normalized M-JSD Gap).
The following identity holds:
JS + M ˜ α , β ( p 1 , p 2 ) = JS M α , β ( p 1 , p 2 ) + Z α log ( Z α ) 1 .
Thus, JS + M ˜ α , β ( p 1 , p 2 ) JS M α , β ( p 1 , p 2 ) when Δ ( p 1 , p 2 ) 0 , and JS + M ˜ α , β ( p 1 , p 2 ) JS M α , β ( p 1 , p 2 ) when Δ ( p 1 , p 2 ) 0 .
When we consider the weighted arithmetic mean A α , we always have Z α = 1 for α ( 0 , 1 ) , and thus the two definitions (Definition 1 and Definition 4) of the A-JSD coincide (i.e., Z α A log ( Z α A ) 1 = 0 ):
JS A ( p 1 , p 2 ) = JS A ˜ ( p 1 , p 2 ) .
However, when the weighted mean M α differs from the weighted arithmetic mean (i.e., M α A α ), the two definitions of the M-JSD JS M and extended M-JSD JS M ˜ differ by the gap expressed in Equation (17).
Remark 7. 
When information is measured in bits, logarithms are taken to base 2, and when information is measured in nats, base e is considered. Thus, we shall generally consider the gap Δ b = Z α log b ( Z α ) 1 , where b denotes the base of the logarithm. When b = e , we have Δ e 0 for all Z α > 0 . When b = 2 , we have Δ 2 = Z α log 2 ( Z α ) 1 0 when 0 < Z α 1 or Z α 2 . But since Z α 2 (see Equation (7)), the condition simplifies to Δ 2 0 if and only if Z α 1 .
Remark 8. 
Although JS is a metric distance [5], JS G is not a metric distance, as the triangle inequality is not satisfied. It suffices to report a counterexample of the triangle inequality for a triple of points p 1 , p 2 , and p 3 : Consider p 1 = ( 0.55 , 0.45 ) , p 2 = ( 0.002 , 0.998 ) , and p 3 = ( 0.045 , 0.955 ) . Then we have JS G ( p 1 , p 2 ) 1.0263227 , JS G ( p 1 , p 3 ) 0.63852342 , and JS G ( p 3 , p 2 ) 0.19794622 . The triangle inequality fails with an error of
JS G ( p 1 , p 2 ) ( JS G ( p 1 , p 3 ) + JS G ( p 3 , p 2 ) ) 0.1898531 .
Similarly, the triangle inequality also fails for the extended G-JSD: we have JS G + ( p 1 , p 2 ) 1.0788275 , JS G + ( p 1 , p 3 ) 0.6691922 , and JS G + ( p 3 , p 2 ) 0.1984633 with a triangle inequality defect value of
JS G + ( p 1 , p 2 ) ( JS G + ( p 1 , p 3 ) + JS G + ( p 3 , p 2 ) ) 0.2111719 .

2.2. Power JSDs and (Extended) Min-JSD and Max-JSD

Let P γ , α ( a , b ) : = ( α a γ + ( 1 α ) b γ ) 1 γ be the γ -power mean for γ 0 (with A α = P 1 , α ). Further define P 0 , α ( a , b ) = G α ( a , b ) so that P γ , α defines the weighted power means for γ R and α ( 0 , 1 ) in the reminder. Since P γ , α ( a , b ) P γ , α ( a , b ) when γ γ for any a , b > 0 , we have
Z α P γ ( p 1 , p 2 ) = P γ , α ( p 1 ( x ) , p 2 ( x ) ) d μ Z α P γ ( p 1 , p 2 ) = P γ , α ( p 1 ( x ) , p 2 ( x ) ) d μ .
Let P γ ( a , b ) = P γ , 1 2 ( a , b ) . We have lim γ P γ ( a , b ) = min ( a , b ) and lim γ + P γ ( a , b ) = max ( a , b ) . Thus, we can define both (extended) min-JSD and (extended) max-JSD. Using the fact that min ( a , b ) = a + b 2 1 2 | a b | and max ( a , b ) = a + b 2 + 1 2 | a b | , we obtain the extremal mixture normalization terms as follows:
Z min ( p 1 , p 2 ) = min ( p 1 , p 2 ) d μ = 1 TV ( p 1 , p 2 ) ,
Z max ( p 1 , p 2 ) = max ( p 1 , p 2 ) d μ = 1 + TV ( p 1 , p 2 ) ,
where TV ( p 1 , p 2 ) = 1 2 | p 1 p 2 | d μ is the total variation distance.
Proposition 7 (max-JSD).
The following upper bound holds for max-JSD:
0 JS + max ˜ ( p 1 , p 2 ) TV ( p 1 , p 2 ) .
Furthermore, the following identity relates the two types of max-JSDs:
JS + max ˜ ( p 1 , p 2 ) = JS max ˜ ( p 1 , p 2 ) + TV ( p 1 , p 2 ) log 1 + TV ( p 1 , p 2 ) .
Proof. 
We have
JS + max ˜ ( p 1 , p 2 ) : = 1 2 p 1 log p 1 max ( p 1 , p 2 ) + p 2 log p 2 max ( p 1 , p 2 ) + 2 max ( p 1 , p 2 ) ( p 1 + p 2 ) d μ .
Since both log p 1 max ( p 1 , p 2 ) 0 and log p 2 max ( p 1 , p 2 ) 0 , and max ( a , b ) = a + b 2 + 1 2 | b a | , we have
JS + max ˜ ( p 1 , p 2 ) p 1 + p 2 2 + 1 2 | p 2 p 1 | p 1 + p 2 2 d μ .
That is, JS + max ˜ ( p 1 , p 2 ) TV ( p 1 , p 2 ) .
We characterize the gap as follows:
Δ max ( p 1 , p 2 ) = Z max ( p 1 , p 2 ) log Z max ( p 1 , p 2 ) 1 , = TV ( p 1 , p 2 ) log ( 1 + TV ( p 1 , p 2 ) ) 0 ,
since 0 TV 1 . Thus JS + max ˜ ( p 1 , p 2 ) JS max ( p 1 , p 2 ) . □
Proposition 8 (min-JSD).
We have the following lower bound on the extended min-JSD:
JS + min ˜ ( p 1 , p 2 ) 1 4 J ( p 1 , p 2 ) TV ( p 1 , p 2 ) ,
where J ( p 1 , p 2 ) : = KL ( p 1 , p 2 ) + KL ( p 2 , p 1 ) = ( p 1 p 2 ) log p 1 p 2 d μ is the Jeffreys’ divergence [2] and
JS + min ˜ ( p 1 , p 2 ) = JS min ( p 1 , p 2 ) TV ( p 1 , p 2 ) + log ( 1 TV ( p 1 , p 2 ) ) .
Proof. 
We have Z min ( p 1 , p 2 ) = min { p 1 , p 2 } d μ = 1 TV ( p 1 , p 2 ) 1 and
Δ min ( p 1 , p 2 ) = Z min ( p 1 , p 2 ) log Z min ( p 1 , p 2 ) 1 , = TV ( p 1 , p 2 ) log ( 1 TV ( p 1 , p 2 ) ) 0 ,
since x log ( 1 x ) 0 for x 1 . Note that the gap can be arbitrarily large when TV ( p 1 , p 2 ) 1 .
Thus, we have JS + min ˜ ( p 1 , p 2 ) JS min ( p 1 , p 2 ) .
To get the lower bound, we use the fact that min ( p 1 , p 2 ) p 1 p 2 . Indeed, we have
JS + min ˜ ( p 1 , p 2 ) = 1 2 ( p 1 log p 1 min ( p 1 , p 2 ) + p 2 log p 2 min ( p 1 , p 2 ) + 2 min ( p 1 , p 2 ) ( p 1 + p 2 ) d μ , 1 2 1 2 p 1 log p 1 p 2 + 1 2 p 2 log p 2 p 1 + 2 min ( p 1 , p 2 ) ( p 1 + p 2 ) d μ , = 1 4 J ( p 1 , p 2 ) TV ( p 1 , p 2 ) .
Remark 9. 
Let us report the total variation distance between two univariate Gaussian distributions p μ 1 , σ 1 and p μ 2 , σ 2 in closed form using the error function [42] erf ( x ) = 1 π x x e t 2 d t .
  • When σ 1 = σ 2 = σ , we have
    TV ( p 1 , p 2 ) = 1 2 Φ ( x * ; μ 2 , σ ) Φ ( x * ; μ 1 , σ ) ,
    where Φ ( x ; μ , σ ) = 1 2 ( 1 + erf ( x μ σ 2 ) ) is the cumulative distribution, and
    x * = μ 1 2 μ 2 2 2 ( μ 1 μ 2 ) .
  • When σ 1 σ 2 , we let x 1 = b Δ 2 a and x 2 = b + Δ 2 a , where Δ = b 2 4 a c 0 and
    a = 1 σ 1 2 1 σ 2 2 ,
    b = 2 μ 2 σ 2 μ 1 σ 1 ,
    c = μ 1 σ 1 2 μ 2 σ 2 2 2 log σ 2 σ 1 .
    The total variation is given by
    TV ( p 1 , p 2 ) = 1 2 erf x 1 μ 1 σ 1 2 erf x 1 μ 2 σ 2 2 + erf x 2 μ 1 σ 1 2 erf x 2 μ 2 σ 2 2
Next, we shall consider the important case of p 1 and p 2 belonging to the family of multivariate normal distributions, commonly called Gaussian distributions.

3. Geometric JSDs Between Gaussian Distributions

3.1. Exponential Families

The formula for the G-JSD between two Gaussian distributions was reported in [17] using the more general framework of exponential families. An exponential family [43] is a family of probability measures { P λ } with Radon–Nikodym densities p λ with respect to μ expressed canonically as
p λ ( x ) : = exp θ ( λ ) , t ( x ) F ( θ ) + k ( x ) , = 1 Z ( θ ) exp θ ( λ ) , t ( x ) + k ( x ) ,
where θ ( λ ) is the natural parameter, t ( x ) the sufficient statistic, k ( x ) an auxiliary carrier term with respect to μ , and F ( θ ) the cumulant function. The partition function Z ( θ ) is the normalizer denominator: Z ( θ ) = exp ( F ( θ ) ) . The cumulant function F ( θ ) = log Z ( θ ) is strictly convex and analytic [43], and the partition function Z ( θ ) = exp ( F ( θ ) ) is strictly log-convex (and hence also strictly convex).
We consider the exponential family of multivariate Gaussian distributions
N = { N ( μ , Σ ) : μ R d , Σ PD ( d ) } ,
where PD ( d ) denotes the set of symmetric positive–definite matrices of size d × d . Let λ : = ( λ v , λ M ) = ( μ , Σ ) denote the compound (vector, matrix) parameter of a Gaussian. The d-variate Gaussian density is given by
p λ ( x ; λ ) : = 1 ( 2 π ) d 2 | λ M | exp 1 2 ( x λ v ) λ M 1 ( x λ v ) ,
where | · | denotes the matrix determinant. The natural parameters θ are expressed using both a vector parameter θ v and a matrix parameter θ M in a compound parameter θ = ( θ v , θ M ) . By defining the following compound inner product on a compound (vector, matrix) parameter
θ , θ : = θ v θ v + tr θ M θ M ,
where tr ( · ) denotes the matrix trace, we rewrite the Gaussian density of Equation (29) in the canonical form of an exponential family:
p θ ( x ; θ ) : = exp t ( x ) , θ F θ ( θ ) = p λ ( x ) ,
where θ = θ ( λ ) with
θ = ( θ v , θ M ) = Σ 1 μ , 1 2 Σ 1 = θ ( λ ) = λ M 1 λ v , 1 2 λ M 1 ,
is the compound vector-matrix natural parameter and
t ( x ) = ( x , x x ) ,
is the compound vector-matrix sufficient statistic. There is no auxiliary carrier term (i.e., k ( x ) = 0 ). The function F θ is given by
F θ ( θ ) : = 1 2 d log π log | θ M | + 1 2 θ v θ M 1 θ v ,
Remark 10. 
Beware that when the cumulant function is expressed using the ordinary parameter λ = ( μ , Σ ) , the cumulant function F θ ( θ ( λ ) ) is no longer convex:
F λ ( λ ) = 1 2 λ v λ M 1 λ v + log | λ M | + d log 2 π ,
= 1 2 μ Σ 1 μ + log | Σ | + d log 2 π .
We convert between the ordinary parameterization λ = ( μ , Σ ) and the natural parameterization θ using these formulas:
θ = ( θ v , θ M ) = θ v ( λ ) = λ M 1 λ v = Σ 1 μ θ M ( λ ) = 1 2 λ M 1 = 1 2 Σ 1 λ = ( λ v , λ M ) = λ v ( θ ) = 1 2 θ M 1 θ v = μ λ M ( θ ) = 1 2 θ M 1 = Σ
The geometric mixture p θ 1 α p θ 2 1 α of two densities of an exponential family is a density p α θ 1 + ( 1 α ) θ 2 of the exponential family with the partition function Z α ( θ 1 , θ 2 ) = exp ( J F , α ( θ 1 , θ 2 ) ) , where J F , α ( θ 1 , θ 2 ) denotes the skew Jensen divergence [44,45]:
J F , α ( θ 1 , θ 2 ) : = α F ( θ 1 ) + ( 1 α ) F ( θ 2 ) F ( α θ 1 + ( 1 α ) θ 2 ) .
Therefore, the difference gap of Equation (17) between the G-JSD and the extended G-JSD between exponential family densities is given by
Δ ( θ 1 , θ 2 ) = exp ( J F , α ( θ 1 , θ 2 ) ) + J F , α ( θ 1 , θ 2 ) 1 ,
= Z α ( θ 1 , θ 2 ) log Z α ( θ 1 , θ 2 ) 1 ,
= Z α ( θ 1 , θ 2 ) F ( α θ 1 + ( 1 α ) θ 2 ) 1 .
Since Z α = exp ( J F , α ( θ 1 , θ 2 ) ) 1 , the gap Δ is negative, and we have
JS + G ˜ α , β ( p μ 1 , Σ 1 , p μ 2 , Σ 2 ) JS G α , β ( p μ 1 , Σ 1 , p μ 2 , Σ 2 ) .
Corollary 2. 
When p 1 = p θ 1 and p 2 = p θ 2 belongs to a same exponential family with the cumulant function F ( θ ) , we have
JS G ( p θ 1 , p θ 2 ) = 1 4 ( θ 2 θ 1 ) ( F ( θ 2 ) F ( θ 1 ) ) F ( θ 1 ) + F ( θ 2 ) 2 F θ 1 + θ 2 2 ,
since J ( p θ 1 , p θ 2 ) = θ 2 θ 1 , F ( θ 2 ) F ( θ 1 ) amounts to a symmetrized Bregman divergence.
Proof. 
We have J ( p θ 1 , p θ 2 ) = ( θ 2 θ 1 ) ( F ( θ 2 ) F ( θ 1 ) ) and J ( p θ 1 , p θ 2 ) = J F ( θ 1 , θ 2 ) . □
The extended geometric Jensen–Shannon divergence and geometric Jensen–Shannon divergence between two densities of an exponential family are given by
JS G ( p θ 1 , p θ 2 ) = 1 4 ( θ 2 θ 1 ) ( F ( θ 2 ) F ( θ 1 ) ) F ( θ 1 ) + F ( θ 2 ) 2 F θ 1 + θ 2 2 , JS G ˜ ( p θ 1 , p θ 2 ) = 1 4 θ 2 θ 1 , F ( θ 2 ) F ( θ 1 ) exp ( J F ( θ 1 , θ 2 ) ) 1 , JS * G ( p θ 1 , p θ 2 ) = J F ( θ 1 , θ 2 )
Remark 11. 
Given two densities p 1 and p 2 , the family G of geometric mixtures { ( p 1 p 2 ) G α p 1 α p 2 1 α : α ( 0 , 1 ) } forms a 1D exponential family that has been termed the likelihood ratio exponential family [46] (LREF). The cumulant function of this LREF is F ( α ) = B α ( p 1 , p 2 ) . Hence, G has also been called a Bhattacharyya arc or Hellinger arc in the literature [47]. However, notice that KL ( p i : ( p 1 p 2 ) G α ) does not necessarily amount to a Bregman divergence, because neither p 1 nor p 2 belongs to G .

3.2. Closed-Form Formula for Gaussian Distributions

Let us report the corresponding closed-form formula for d-variate Gaussian distributions.
When α = 1 2 , we proved that JS G ( p 1 , p 2 ) = 1 4 J ( p 1 , p 2 ) B ( p 1 , p 2 ) and JS G ˜ + ( p 1 , p 2 ) = 1 4 J ( p 1 , p 2 ) + exp ( B ( p 1 , p 2 ) ) 1 where BC ( p 1 , p 2 ) = exp ( B ( p 1 , p 2 ) ) . Thus, for the case of balanced geometric mixtures, we need to report the closed form for the Jeffreys and Bhattacharyya distances:
J ( p μ 1 , Σ 1 , p μ 2 , Σ 2 ) = 1 2 tr Σ 1 Σ 2 1 + Σ 2 Σ 1 1 + ( μ 1 μ 2 ) ( Σ 1 1 + Σ 2 1 ) ( μ 1 μ 2 ) 2 d , B ( p μ 1 , Σ 1 , p μ 2 , Σ 2 ) = 1 8 ( μ 1 μ 2 ) Σ ¯ 1 ( μ 1 μ 2 ) + 1 2 log det Σ ¯ det Σ 1 det Σ 2 ,
where Σ ¯ = 1 2 Σ 1 + Σ 2 .
Otherwise, for the arbitrary weighted geometric mixture G α , define ( θ 1 θ 2 ) α = α θ 1 + ( 1 α ) θ 2 , the weighted linear interpolation of the natural parameters θ 1 and θ 2 .
Corollary 3. 
The skew G-Jensen–Shannon divergence JS α G and the dual skew G-Jensen–Shannon divergence JS * α G between two d-variate Gaussian distributions N ( μ 1 , Σ 1 ) and N ( μ 2 , Σ 2 ) is
JS G α ( p ( μ 1 , Σ 1 ) , p ( μ 2 , Σ 2 ) ) = α KL ( p ( μ 1 , Σ 1 ) , p ( μ α , Σ α ) ) + ( 1 α ) KL ( p ( μ 2 , Σ 2 ) , p ( μ α , Σ α ) ) , = α B F ( ( θ 1 θ 2 ) α , θ 1 ) + ( 1 α ) B F ( ( θ 1 θ 2 ) α , θ 2 ) , = 1 2 tr Σ α 1 ( α Σ 1 + ( 1 α ) Σ 2 ) + log | Σ α | | Σ 1 | α | Σ 2 | 1 α = + α ( μ α μ 1 ) Σ α 1 ( μ α μ 1 ) + ( 1 α ) ( μ α μ 2 ) Σ α 1 ( μ α μ 2 ) d JS G α * ( p ( μ 1 , Σ 1 ) , p ( μ 2 , Σ 2 ) ) = ( 1 α ) KL ( p ( μ α , Σ α ) , p ( μ 1 , Σ 1 ) ) + α KL ( p ( μ α , Σ α ) , p ( μ 2 , Σ 2 ) ) , = α B F ( θ 1 , ( θ 1 θ 2 ) α ) + ( 1 α ) B F ( θ 2 , ( θ 1 θ 2 ) α ) , = J F , α ( θ 1 , θ 2 ) = : B α ( p ( μ 1 , Σ 1 ) , p ( μ 2 , Σ 2 ) ) , = 1 2 α μ 1 Σ 1 1 μ 1 + ( 1 α ) μ 2 Σ 2 1 μ 2 μ α Σ α 1 μ α + log | Σ 1 | α | Σ 2 | 1 α | Σ α | , F ( μ , Σ ) = 1 2 μ Σ 1 μ + log | Σ | + d log 2 π , F ( θ v , θ M ) = 1 2 d log π log | θ M | + 1 2 θ v θ M 1 θ v , Δ ( θ 1 , θ 2 ) = exp ( J F , α ( θ 1 , θ 2 ) ) + J F , α ( θ 1 , θ 2 ) 1 ,
where Σ α is the matrix harmonic barycenter:
Σ α = α Σ 1 1 + ( 1 α ) Σ 2 1 1 ,
and
μ α = Σ α α Σ 1 1 μ 1 + ( 1 α ) Σ 2 1 μ 2 .

4. The Extended and Normalized G-JSDs as Regularizations of the Ordinary JSD

The M-Jensen–Shannon divergence JS M ( p , q ) can be interpreted as a regularization of the ordinary JSD:
Proposition 9 (JSD regularization).
For any arbitrary mean M, the following identity holds:
JS M ( p 1 , p 2 ) = JS ( p 1 , p 2 ) + KL p 1 + p 2 2 , ( p 1 p 2 ) M .
Notice that ( p 1 p 2 ) A = p 1 + p 2 2 .
Proof. 
We have
JS M ( p 1 , p 2 ) : = 1 2 KL ( p 1 , ( p 1 p 2 ) M ) + KL ( p 2 , ( p 1 p 2 ) M ) , = 1 2 p 1 log p 1 ( p 1 p 2 ) A ( p 1 p 2 ) M ( p 1 p 2 ) A + p 2 log p 2 ( p 1 p 2 ) A ( p 1 p 2 ) M ( p 1 p 2 ) A d μ , = 1 2 p 1 log p 1 ( p 1 p 2 ) A + p 1 log ( p 1 p 2 ) A ( p 1 p 2 ) M + p 2 log p 2 ( p 1 p 2 ) A + p 2 log ( p 1 p 2 ) A ( p 1 p 2 ) M d μ , = 1 2 p 1 log p 1 ( p 1 p 2 ) A + p 2 log p 2 ( p 1 p 2 ) A d μ + 1 2 ( p 1 + p 2 ) log ( p 1 p 2 ) A ( p 1 p 2 ) M d μ , = JS ( p 1 , p 2 ) + ( p 1 p 2 ) A log ( p 1 p 2 ) A ( p 1 p 2 ) M d μ , = JS ( p 1 , p 2 ) + KL ( ( p 1 p 2 ) A , ( p 1 p 2 ) M ) .
Remark 12. 
One way to symmetrize the KLD is to consider two distinct symmetric means M 1 ( a , b ) = M 1 ( b , a ) and M 2 ( a , b ) = M 2 ( b , a ) and define
KL M 1 , M 2 ( p 1 , p 2 ) = KL ( ( p 1 p 2 ) M 1 , ( p 1 p 2 ) M 2 ) = KL M 1 , M 2 ( p 2 , p 1 ) .
We notice that KL A , G is not a metric distance by reporting a triple of points ( p 1 , p 2 , p 3 ) that fails the triangle inequality. Consider p 1 = ( 0.55 , 0.45 ) , p 2 = ( 0.002 , 0.998 ) , and p 3 = ( 0.045 , 0.955 ) . We have KL M 1 , M 2 ( p 1 , p 2 ) = 0.5374165 , KL M 1 , M 2 ( p 1 , p 3 ) = 0.1759400 , and KL M 1 , M 2 ( p 3 , p 2 ) = 0.08485931 . The triangle inequality defect is
KL M 1 , M 2 ( p 1 , p 2 ) ( KL M 1 , M 2 ( p 1 , p 3 ) + KL M 1 , M 2 ( p 3 , p 2 ) ) = 0.2766171 .
We can also similarly symmetrize the extended KLD as follows:
KL M ˜ 1 , M ˜ 2 + ( q 1 , q 2 ) = KL + ( ( q 1 q 2 ) M ˜ 1 , ( q 1 q 2 ) M ˜ 2 ) = KL M ˜ 1 , M ˜ 2 ( q 2 , q 1 ) .
In particular, when M 1 = A and M 2 = G , we get the KL A , M divergence:
KL A , M ( p 1 , p 2 ) = p 1 + p 2 2 log p 1 + p 2 2 p 1 p 2 + log Z G ( p 1 , p 2 ) ,
which is related to Taneja T-divergence [48]:
T ( p 1 , p 2 ) = p 1 + p 2 2 log p 1 + p 2 2 p 1 p 2 .
The T-divergence is an f-divergence [11,12] obtained for the generator f T ( u ) = 1 + u 2 log 1 + u 2 u .
Corollary 4 (JSD lower bound on M-JSD).
We have JS M ( p , q ) JS ( p , q ) .
Proof. 
Since JS M ( p , q ) = JS ( p , q ) + KL p + q 2 , ( p q ) M and KL 0 by Gibbs’ inequality, we have JS M ( p , q ) JS ( p , q ) . □
Since the extended M-JSD is JS M ˜ α , β + ( p 1 , p 2 ) = JS M α , β ( p 1 , p 2 ) + Z α log ( Z α ) 1 , the extended M-JSD JS M ˜ α , β + can also be interpreted as another regularization of the Jensen–Shannon divergence when dealing with probability densities:
JS M ˜ α , β + ( p 1 , p 2 ) = JS ( p 1 , p 2 ) + KL p 1 + p 2 2 , ( p 1 p 2 ) M + Z M α ( p 1 , p 2 ) log ( Z M α ( p 1 , p 2 ) ) 1 .
It is well known that the JSD can be rewritten as a diversity index [4] using concave entropy:
JS ( p 1 , p 2 ) = H p 1 + p 2 2 H ( p 1 ) + H ( p 2 ) 2 .
We generalize this decomposition as the difference of a cross-entropy term minus entropies, as follows:
Proposition 10 (M-JSD cross-entropy decomposition).
We have
JS M ( p 1 , p 2 ) = H × ( ( p 1 p 2 ) A , ( p 1 p 2 ) M ) H ( p 1 ) + H ( p 2 ) 2 .
Proof. 
From Proposition 9, we have
JS M ( p 1 , p 2 ) = JS ( p 1 , p 2 ) + KL p 1 + p 2 2 , ( p 1 p 2 ) M .
Since KL ( p 1 , p 2 ) = H × ( p 1 , p 2 ) H ( p 1 ) , where H × ( p 1 , p 2 ) = p 1 log p 2 d μ is the cross-entropy and H ( p ) = p log p d μ = H × ( p , p ) is the entropy. Plugging Equation (46) into Equation (43), we get
JS M ( p 1 , p 2 ) = H p 1 + p 2 2 H ( p 1 ) + H ( p 2 ) 2 + H × p 1 + p 2 2 , ( p 1 p 2 ) M H p 1 + p 2 2 , = H × p 1 + p 2 2 , ( p 1 p 2 ) M H ( p 1 ) + H ( p 2 ) 2 .
Note that when M = A , the arithmetic mean, we have H × p 1 + p 2 2 , ( p 1 p 2 ) M = H p 1 + p 2 2 and we recover the fact that JS M ( p 1 , p 2 ) = JS ( p 1 , p 2 ) . □

5. Estimation and Approximation of the Extended and Normalized M-JSDs

Let us recall the two definitions of the extended M-JSD and the normalized M-JSD (for the case of α = β = 1 2 ) between two normalized densities p 1 and p 2 :
JS M ( p 1 , p 2 ) = 1 2 KL p 1 , ( p 1 p 2 ) M + KL p 2 , ( p 1 p 2 ) M , JS M + ( p 1 , p 2 ) = 1 2 KL + p 1 , ( p 1 p 2 ) M ˜ + KL + p 2 , ( p 1 p 2 ) M ˜ ,
where ( p 1 p 2 ) M ( x ) = M ( p 1 ( x ) , p 2 ( x ) ) Z M ( p 1 , p 2 ) (with Z M ( p 1 , p 2 ) = M ( p 1 ( x ) , p 2 ( x ) ) d μ ( x ) ) and ( p 1 p 2 ) M ˜ ( x ) = M ( p 1 ( x ) , p 2 ( x ) ) .
In practice, one needs to estimate the extended and normalized G-JSDs when they do not admit a closed-form formula.

5.1. Monte Carlo Estimators

To estimate JS M ( p 1 , p 2 ) , we can use Monte Carlo samplings to estimate both KLD integrals and mixture normalizers Z M ; for example, the normalizer Z M ( p 1 , p 2 ) is estimated by
Z ^ M ( p 1 , p 2 ) = 1 s i = 1 s 1 r ( x i ) M ( p 1 ( x i ) , p 2 ( x i ) ) ,
where r ( x ) is the proposal distribution which can be chosen according to the mean M and the types of probability distributions p 1 and p 2 , and x 1 , , x s are s identically and independently sampled (iid.) from r ( x ) . However, since ( p 1 p 2 ) M ( x ) is now estimated as ( p 1 p 2 ) M ^ ( x ) , it is no longer a normalized M-mixture, and thus we consider estimating
JS M ^ + ( p 1 , p 2 ) = 1 2 KL + p 1 , ( p 1 p 2 ) M ^ + KL + p 2 , ( p 1 p 2 ) M ^
to ensure the non-negativity of the divergence JSD M ^ .
Let us consider the estimation of the term
KL + p 1 , ( p 1 p 2 ) M ˜ = p 1 log p 1 M ( p 1 , p 2 ) + M ( p 1 , p 2 ) p 1 d μ .
By choosing the proposal distribution r ( x ) = p 1 ( x ) , we have KL + p 1 , ( p 1 p 2 ) M ^ KL + ^ p 1 , ( p 1 p 2 ) M ˜ (for large enough s), where
KL ^ + p 1 , ( p 1 p 2 ) M ˜ = 1 s i = 1 s log p 1 ( x i ) M ( p 1 ( x i ) , p 2 ( x i ) ) + 1 p 1 ( x i ) M ( p 1 ( x i ) , p 2 ( x i ) ) 1 .
Monte Carlo (MC) stochastic integration [49] is a well-studied topic in statistics, with many results available regarding the consistency and variance of MC estimators.
Note that even if we have a generic formula for the G-JSD between two densities of an exponential family given by Corollary 2, the cumulant function F ( θ ) may not be in closed form [50,51]. This is the case when the sufficient statistic vector of the exponential family is t ( x ) = ( x , x 2 , , x m ) (for m 5 ), yielding the polynomial exponential family (also called exponential–polynomial family [51]).

5.2. Approximations via γ -Divergences

One way to circumvent the lack of computational tractable density normalizers is to consider the family of γ -divergences [39] instead of the KLD:
D ˜ γ ( q 1 , q 2 ) = 1 γ ( 1 + γ ) log I γ ( q 1 , q 2 ) 1 γ log I γ ( q 1 , q 2 ) + 1 1 + γ log I γ ( q 1 , q 2 ) , γ > 0 ,
where
I γ ( q 1 , q 2 ) = q 1 ( x ) q 2 γ ( x ) d μ ( x ) .
The γ -divergences are projective divergences, i.e., they enjoy the property that
D ˜ γ ( λ 1 q 1 , λ 2 q 2 ) = D ˜ γ ( q 1 , q 2 ) , λ 1 > 0 ,   λ 2 > 0 .
Furthermore, we have lim γ 0 D ˜ γ ( p 1 , p 2 ) = KL ( p 1 , p 2 ) . (Note that the KLD is not projective.)
Let us define the projective M-JSD:
JS M ˜ , γ ( p 1 , p 2 ) = 1 2 D ˜ γ p 1 , ( p 1 p 2 ) M ˜ + D ˜ γ p 2 , ( p 1 p 2 ) M ˜ .
We have, for γ = ϵ , a small enough value (e.g., ϵ 10 3 ), JS M ( p 1 , p 2 ) JS M ˜ , γ ( p 1 , p 2 ) , since
KL ( p 1 , ( p 1 p 2 ) M ) γ = ϵ D ˜ γ ( p 1 , ( p 1 p 2 ) M ˜ ) .
In particular, for exponential family densities p θ 1 ( x ) = q θ 1 ( x ) Z ( θ 1 ) and p θ 2 ( x ) = q θ 2 ( x ) Z ( θ 2 ) , we have
I γ ( p θ 1 , p θ 2 ) = exp F ( θ 1 + γ θ 2 ) F ( θ 1 ) γ F ( θ 2 ) ,
provided that θ 1 + γ θ 2 belongs to the natural parameter space (otherwise, the integral I γ diverges).
Even when F ( θ ) is not known in closed form, we may estimate the γ -divergence by estimating the I γ integrals as follows:
I ^ γ ( q θ 1 , q θ 2 ) 1 s i = 1 s q 2 ( x i ) ,
where x 1 , , x s are iid. sampled from p 1 ( x ) . For example, we may use Monte Carlo importance sampling methods [52] or exponential family Langevin dynamics [53] to sample densities of exponential family densities with computationally intractable normalizers (e.g., polynomial exponential families).

6. Summary and Concluding Remarks

In this paper, we first recalled the Jensen–Shannon symmetrization (JS-symmetrization) scheme of [17] for an arbitrary statistical dissimilarity D ( · , · ) using an arbitrary weighted scalar mean M α as follows:
D M α , β JS ( p 1 , p 2 ) : = β D p 1 , ( p 1 p 2 ) M α + ( 1 β ) D p 2 , ( p 1 p 2 ) M α , ( α , β ) ( 0 , 1 ) 2 ,
In particular, we showed that the skewed Bhattacharyya distance and the Chernoff information can both be interpreted as JS-symmetrizations of the reverse Kullback–Leibler divergence.
Then we defined two types of geometric Jensen–Shannon divergence between probability densities. The first type JS M requires normalization of M-mixtures and relies on the Kullback–Leibler divergence: JS M = KL M 1 2 , 1 2 JS . The second type JS M ˜ + does not normalize M-mixtures and uses the extended Kullback–Leibler divergence KL + to take into account unnormalized mixtures: JS M ˜ + = KL M ˜ 1 2 , 1 2 JS + . When M is the arithmetic mean A, both M-JSD types coincide with the ordinary Jensen–Shannon divergence of Equation (2).
We have shown that both M-JSD types can be interpreted as regularized Jensen–Shannon divergences JS with additive terms. Namely, we have
JS M ( p 1 , p 2 ) = JS ( p 1 , p 2 ) + KL ( ( p 1 p 2 ) A , ( p 1 p 2 ) M ) , JS M ˜ + ( p 1 , p 2 ) = JS M ( p 1 , p 2 ) + Z M ( p 1 , p 2 ) log Z M ( p 1 , p 2 ) 1 , = JS ( p 1 , p 2 ) + KL ( ( p 1 p 2 ) A , ( p 1 p 2 ) M ) + Z M ( p 1 , p 2 ) log Z M ( p 1 , p 2 ) 1 ,
where Z M ( p 1 , p 2 ) = M ( p 1 , p 2 ) d μ is the M-mixture normalizer. The gap between these two types of M-JSD is
Δ M ( p 1 , p 2 ) = JS M ˜ + ( p 1 , p 2 ) JS M ( p 1 , p 2 ) , = Z M ( p 1 , p 2 ) log Z M ( p 1 , p 2 ) 1 .
When taking the geometric mean M = G , we showed that both G-JSD types can be expressed using the Jeffreys divergence and the Bhattacharyya divergence (or Bhattacharyya coefficient):
JS G ( p 1 , p 2 ) = 1 4 J ( p 1 , p 2 ) B ( p 1 , p 2 ) , JS G ˜ + ( p 1 , p 2 ) = 1 4 J ( p 1 , p 2 ) + exp ( B ( p 1 , p 2 ) ) 1 , = 1 4 J ( p 1 , p 2 ) + BC ( p 1 , p 2 ) 1 .
Thus, the gap between these two types of G-JSD is
Δ G ( p 1 , p 2 ) : = JS G ˜ + ( p 1 , p 2 ) JS G ( p 1 , p 2 ) , = BC ( p 1 , p 2 ) + B ( p 1 , p 2 ) 1 , = Z G ( p 1 , p 2 ) log Z G ( p 1 , p 2 ) 1 ,
since Z G ( p 1 , p 2 ) = p 1 p 2 d μ = BC ( p 1 , p 2 ) .
Although the square root of the Jensen–Shannon divergence yields a metric distance, this is no longer the case for the geometric-JSD and the extended geometric-JSD: we reported counterexamples in Remark 8. Moreover, we have shown that the KL symmetrization KL ( ( p 1 p 2 ) A , ( p 1 p 2 ) G ) is not a metric distance (Remark 12).
We discussed the merit of the extended G-JSD, which does not require normalization of the geometric mixture, in Section 5, and we showed how to approximate the G-JSD using the projective γ -divergences [39] for γ = ϵ , a small enough value (i.e., γ = ϵ = 10 3 ). From the viewpoint of information geometry, the extended G-JSD has been shown to be an f-divergence [13] (separable divergence), while the G-JSD is not separable in general because of the normalization of mixtures (with the exception of the ordinary JSD, which is an f-divergence because the arithmetic mixtures do not require normalization).
We studied power JSDs by considering the power means and studied the ± limits, the extended max-JSD, and the min-JSD: We proved that the extended max-JSD is upper-bounded by the total variation distance TV ( p 1 , p 2 ) = 1 2 | p 1 p 2 | d μ :
0 JS max ˜ + ( p 1 , p 2 ) TV ( p 1 , p 2 ) ,
and that the extended min-JSD is lower-bounded as follows:
JS min ˜ + ( p 1 , p 2 ) 1 4 J ( p 1 , p 2 ) TV ( p 1 , p 2 ) ,
where J denotes the Jeffreys divergence: J ( p 1 , p 2 ) = KL ( p 1 , p 2 ) + KL ( p 2 , p 1 ) .
The advantage of using the extended G-JSD is that we do not need to normalize geometric mixtures, while this novel divergence is proven to be an f-divergence [13] and retains the property that it amounts to a regularization of the ordinary Jensen–Shannon divergence by an extra additive gap term.
Finally, we expressed JS G (Equation (41)) and JS G ˜ + (Equation (41)) for exponential families, characterized the gap between these two types of divergences as a function of the cumulant and partition functions, and reported a corresponding explicit formula for the multivariate Gaussian (exponential) family. The G-JSD between Gaussian distributions has already been used successfully in many applications [30,32,33,34,35,36,37,38].

Funding

This research received no external funding.

Acknowledgments

The Author would like to thank the two Reviewers for their insightful, detailed, and constructive comments and feedback.

Conflicts of Interest

Author Frank Nielsen was employed by the company Sony Computer Science Laboratories. The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Nomenclature

Means:
M α ( a , b ) weighted scalar mean
M α ϕ ( a , b ) weighted quasi-arithmetic scalar mean for generator ϕ ( u )
A ( a , b ) arithmetic mean
A α ( a , b ) weighted arithmetic mean
G α ( a , b ) weighted geometric mean
G ( a , b ) geometric mean
P γ ( a , b ) power mean with P 0 = G and P 1 = A
P γ , α ( a , b ) weighted power mean
Densities on measure space ( X , E , μ ) :
p , p 1 , p 2 , normalized density
q , q 1 , q 2 , unnormalized density
Z ( q ) density normalizer p = q Z ( q )
Z M ( p 1 , p 2 ) normalizer of M-mixture ( α = 1 2 )
Z ^ M ( p 1 , p 2 ) Monte Carlo estimator of Z M ( p 1 , p 2 )
Z M , α ( p 1 , p 2 ) normalizer of weighted M-mixture
( p 1 p 2 ) M M-mixture
( p 1 p 2 ) M , α weighted M-mixture
Dissimilarities, divergences, and distances:
KL ( p 1 , p 2 ) Kullback–Leibler divergence (KLD)
KL + ( q 1 , q 2 ) extended Kullback–Leibler divergence
KL * ( p 1 , p 2 ) reverse Kullback–Leibler divergence
H × ( p 1 , p 2 ) cross-entropy
H ( p ) Shannon discrete or differential entropy
J ( p 1 , p 2 ) Jeffreys divergence
TV ( p 1 , p 2 ) total variation distance
B ( p 1 , p 2 ) Bhattacharyya “distance” (not metric)
B α ( p 1 , p 2 ) α -skewed Bhattacharyya “distance”
C ( p 1 , p 2 ) Chernoff information or Chernoff distance
T ( p 1 , p 2 ) Taneja T-divergence
I f ( p 1 , p 2 ) Ali–Silvey–Csiszár f-divergence
D ( p 1 , p 2 ) arbitrary dissimilarity measure
D * ( p 1 , p 2 ) reverse dissimilarity measure
D + ( q 1 , q 2 ) extended dissimilarity measure
D ˜ ( q 1 , q 2 ) projective dissimilarity measure
D ˜ γ ( q 1 , q 2 ) γ -divergence
D ^ + ( q 1 , q 2 ) Monte Carlo estimation of dissimilarity D +
Jensen–Shannon divergences and generalizations:
JS ( p 1 , p 2 ) Jensen–Shannon divergence (JSD)
JS α , β ( p 1 , p 2 ) β -weighted α -skewed mixture JSD
JS M ( p 1 , p 2 ) M-JSD for M-mixtures
JS G ( p 1 , p 2 ) geometric JSD
JS G ˜ ( p 1 , p 2 ) extended geometric JSD
JS G * ( p 1 , p 2 ) left-sided geometric JSD (right-sided for KL * )
JS + min ˜ ( p 1 , p 2 ) min-JSD
JS + max ˜ ( p 1 , p 2 ) max-JSD
Δ M ( p 1 , p 2 ) gap between extended and normalized M-JSDs

References

  1. Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
  2. Jeffreys, H. The Theory of Probability; OuP Oxford: Oxford, UK, 1998. [Google Scholar]
  3. Johnson, D.H.; Sinanovic, S. Symmetrizing the Kullback-Leibler distance. IEEE Trans. Inf. Theory 2001, 1, 1–10. [Google Scholar]
  4. Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
  5. Fuglede, B.; Topsoe, F. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium on Information Theory (ISIT), Chicago, IL, USA, 27 June–2 July 2004; p. 31. [Google Scholar]
  6. Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858–1860. [Google Scholar] [CrossRef]
  7. Okamura, K. Metrization of powers of the Jensen-Shannon divergence. arXiv 2023, arXiv:2302.10070. [Google Scholar] [CrossRef]
  8. Sibson, R. Information radius. Z. FÜR Wahrscheinlichkeitstheorie Und Verwandte Geb. 1969, 14, 149–160. [Google Scholar] [CrossRef]
  9. Briët, J.; Harremoës, P. Properties of classical and quantum Jensen-Shannon divergence. Phys. Rev. A 2009, 79, 052311. [Google Scholar] [CrossRef]
  10. Virosztek, D. The metric property of the quantum Jensen-Shannon divergence. Adv. Math. 2021, 380, 107595. [Google Scholar] [CrossRef]
  11. Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. Methodol. 1966, 28, 131–142. [Google Scholar] [CrossRef]
  12. Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
  13. Amari, S.i. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016. [Google Scholar]
  14. Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial. (Foundations and Trends® in Communications and Information Theory); Now Publishers Inc.: Hanover, MA, USA, 2004; Volume 1, pp. 417–528. [Google Scholar]
  15. Osterreicher, F.; Vajda, I. A new class of metric divergences on probability spaces and its applicability in statistics. Ann. Inst. Stat. Math. 2003, 55, 639–653. [Google Scholar] [CrossRef]
  16. Schoenberg, I.J. Metric spaces and completely monotone functions. Ann. Math. 1938, 39, 811–841. [Google Scholar] [CrossRef]
  17. Nielsen, F. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy 2019, 21, 485. [Google Scholar] [CrossRef]
  18. Bullen, P.S. Handbook of Means and Their Inequalities; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 560. [Google Scholar]
  19. Yamano, T. Some bounds for skewed α-Jensen-Shannon divergence. Results Appl. Math. 2019, 3, 100064. [Google Scholar] [CrossRef]
  20. Nielsen, F. Revisiting Chernoff information with likelihood ratio exponential families. Entropy 2022, 24, 1400. [Google Scholar] [CrossRef]
  21. Jerfel, G.; Wang, S.; Wong-Fannjiang, C.; Heller, K.A.; Ma, Y.; Jordan, M.I. Variational refinement for importance sampling using the forward Kullback-Leibler divergence. In Proceedings of the Uncertainty in Artificial Intelligence, PMLR, Online, 27–30 July 2021; pp. 1819–1829. [Google Scholar]
  22. Asadi, M.; Ebrahimi, N.; Kharazmi, O.; Soofi, E.S. Mixture models, Bayes Fisher information, and divergence measures. IEEE Trans. Inf. Theory 2018, 65, 2316–2321. [Google Scholar] [CrossRef]
  23. Grosse, R.B.; Maddison, C.J.; Salakhutdinov, R.R. Annealing between distributions by averaging moments. Adv. Neural Inf. Process. Syst. 2013, 26, 1–12. [Google Scholar]
  24. Amari, S.I. Integration of stochastic models by minimizing α-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef]
  25. Bhattacharyya, A. On a measure of divergence between two multinomial populations. Sankhyā Indian J. Stat. 1946, 7, 401–406. [Google Scholar]
  26. Melville, P.; Yang, S.M.; Saar-Tsechansky, M.; Mooney, R. Active learning for probability estimation using Jensen-Shannon divergence. In Proceedings of the European Conference on Machine Learning, Porto, Portugal, 3–7 October 2005; pp. 268–279. [Google Scholar]
  27. Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
  28. Sutter, T.; Daunhawer, I.; Vogt, J. Multimodal generative learning utilizing Jensen-Shannon-divergence. Adv. Neural Inf. Process. Syst. 2020, 33, 6100–6110. [Google Scholar]
  29. Michalowicz, J.V.; Nichols, J.M.; Bucholtz, F. Calculation of differential entropy for a mixed Gaussian distribution. Entropy 2008, 10, 200–206. [Google Scholar] [CrossRef]
  30. Deasy, J.; Simidjievski, N.; Liò, P. Constraining variational inference with geometric Jensen-Shannon divergence. Adv. Neural Inf. Process. Syst. 2020, 33, 10647–10658. [Google Scholar]
  31. Deasy, J.; McIver, T.A.; Simidjievski, N.; Lio, P. α-VAEs: Optimising variational inference by learning data-dependent divergence skew. In Proceedings of the ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, Virtual, 23 July 2021. [Google Scholar]
  32. Kumari, J.; Deepak, G.; Santhanavijayan, A. RDS: Related document search for economics data using ontologies and hybrid semantics. In Proceedings of the International Conference on Data Analytics and Insights, Kolkata, India, 11–13 May 2023; pp. 691–702. [Google Scholar]
  33. Ni, S.; Lin, C.; Wang, H.; Li, Y.; Liao, Y.; Li, N. Learning geometric Jensen-Shannon divergence for tiny object detection in remote sensing images. Front. Neurorobot. 2023, 17, 1273251. [Google Scholar] [CrossRef]
  34. Sachdeva, R.; Gakhar, R.; Awasthi, S.; Singh, K.; Pandey, A.; Parihar, A.S. Uncertainty and Noise Aware Decision Making for Autonomous Vehicles—A Bayesian Approach. IEEE Trans. Veh. Technol. 2024, 74, 378–389. [Google Scholar] [CrossRef]
  35. Wang, J.; Massiceti, D.; Hu, X.; Pavlovic, V.; Lukasiewicz, T. NP-SemiSeg: When neural processes meet semi-supervised semantic segmentation. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 36138–36156. [Google Scholar]
  36. Serra, G.; Stavrou, P.A.; Kountouris, M. On the computation of the Gaussian rate–distortion–perception function. IEEE J. Sel. Areas Inf. Theory 2024, 5, 314–330. [Google Scholar] [CrossRef]
  37. Thiagarajan, P.; Ghosh, S. Jensen–Shannon divergence based novel loss functions for Bayesian neural networks. Neurocomputing 2025, 618, 129115. [Google Scholar] [CrossRef]
  38. Hanselmann, N.; Doll, S.; Cordts, M.; Lensch, H.P.; Geiger, A. EMPERROR: A Flexible Generative Perception Error Model for Probing Self-Driving Planners. IEEE Robot. Autom. Lett. 2025, 10, 5807–5814. [Google Scholar] [CrossRef]
  39. Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
  40. Jones, L.K.; Byrne, C.L. General entropy criteria for inverse problems, with applications to data compression, pattern classification, and cluster analysis. IEEE Trans. Inf. Theory 2002, 36, 23–30. [Google Scholar] [CrossRef]
  41. Nishimura, T.; Komaki, F. The information geometric structure of generalized empirical likelihood estimators. Commun. Stat. Methods 2008, 37, 1867–1879. [Google Scholar] [CrossRef]
  42. Nielsen, F. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. Pattern Recognit. Lett. 2014, 42, 25–34. [Google Scholar] [CrossRef][Green Version]
  43. Barndorff-Nielsen, O. Information and Exponential Families: In Statistical Theory; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
  44. Kailath, T. The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]
  45. Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
  46. Grünwald, P.D. The Minimum Description Length Principle; MIT Press: Cambridge, MA USA, 2007. [Google Scholar]
  47. Cena, A.; Pistone, G. Exponential statistical manifold. Ann. Inst. Stat. Math. 2007, 59, 27–56. [Google Scholar] [CrossRef]
  48. Taneja, I.J. New developments in generalized information measures. In Advances in Imaging and Electron Physics; Elsevier: Amsterdam, The Netherlands, 1995; Volume 91, pp. 37–135. [Google Scholar]
  49. Rubinstein, R.Y.; Kroese, D.P. Simulation and the Monte Carlo Method; John Wiley & Sons: Hoboken, NJ, USA, 2016. [Google Scholar]
  50. Cobb, L.; Koppstein, P.; Chen, N.H. Estimation and moment recursion relations for multimodal distributions of the exponential family. J. Am. Stat. Assoc. 1983, 78, 124–130. [Google Scholar] [CrossRef]
  51. Hayakawa, J.; Takemura, A. Estimation of exponential-polynomial distribution by holonomic gradient descent. Commun. Stat.-Theory Methods 2016, 45, 6860–6882. [Google Scholar] [CrossRef]
  52. Kloek, T.; Van Dijk, H.K. Bayesian estimates of equation system parameters: An application of integration by Monte Carlo. Econom. J. Econom. Soc. 1978, 46, 1–19. [Google Scholar] [CrossRef]
  53. Banerjee, A.; Chen, T.; Li, X.; Zhou, Y. Stability based generalization bounds for exponential family Langevin dynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MA, USA, 17–23 July 2022; pp. 1412–1449. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nielsen, F. Two Types of Geometric Jensen–Shannon Divergences. Entropy 2025, 27, 947. https://doi.org/10.3390/e27090947

AMA Style

Nielsen F. Two Types of Geometric Jensen–Shannon Divergences. Entropy. 2025; 27(9):947. https://doi.org/10.3390/e27090947

Chicago/Turabian Style

Nielsen, Frank. 2025. "Two Types of Geometric Jensen–Shannon Divergences" Entropy 27, no. 9: 947. https://doi.org/10.3390/e27090947

APA Style

Nielsen, F. (2025). Two Types of Geometric Jensen–Shannon Divergences. Entropy, 27(9), 947. https://doi.org/10.3390/e27090947

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop