Two Types of Geometric Jensen–Shannon Divergences

Nielsen, Frank

doi:10.3390/e27090947

Open AccessArticle

Two Types of Geometric Jensen–Shannon Divergences

by

Frank Nielsen

Sony Computer Science Laboratories, Tokyo 141-0022, Japan

Entropy 2025, 27(9), 947; https://doi.org/10.3390/e27090947

Submission received: 8 August 2025 / Revised: 29 August 2025 / Accepted: 9 September 2025 / Published: 11 September 2025

(This article belongs to the Section Information Theory, Probability and Statistics)

Download Versions Notes

Abstract

The geometric Jensen–Shannon divergence (G-JSD) has gained popularity in machine learning and information sciences thanks to its closed-form expression between Gaussian distributions. In this work, we introduce an alternative definition of the geometric Jensen–Shannon divergence tailored to positive densities which does not normalize geometric mixtures. This novel divergence is termed the extended G-JSD, as it applies to the more general case of positive measures. We explicitly report the gap between the extended G-JSD and the G-JSD when considering probability densities, and show how to express the G-JSD and extended G-JSD using the Jeffreys divergence and the Bhattacharyya distance or Bhattacharyya coefficient. The extended G-JSD is proven to be an f-divergence, which is a separable divergence satisfying information monotonicity and invariance in information geometry. We derive a corresponding closed-form formula for the two types of G-JSDs when considering the case of multivariate Gaussian distributions that is often met in applications. We consider Monte Carlo stochastic estimations and approximations of the two types of G-JSD using the projective

γ

-divergences. Although the square root of the JSD yields a metric distance, we show that this is no longer the case for the two types of G-JSD. Finally, we explain how these two types of geometric JSDs can be interpreted as regularizations of the ordinary JSD.

Keywords:

Jensen–Shannon divergence; quasi-arithmetic means; total variation distance; Bhattacharyya distance; Chernoff information; Jeffreys divergence; Taneja divergence; geometric mixtures; exponential families; projective γ-divergences; f-divergence; separable divergence; information monotonicity

1. Introduction

1.1. Kullback–Leibler and Jensen–Shannon Divergences

Let

(X, E, μ)

be a measure space on the sample space

X

,

σ

-algebra of events

E

, with

μ

a prescribed positive measure on the measurable space

(X, E)

(e.g., counting measure or Lebesgue measure). Let

M_{+} (X) = {Q}

be the set of positive distributions Q and

M_{+}^{1} (X) = {P}

be the subset of probability measures P. We denote by

M_{μ} = {\frac{d Q}{d μ} : Q \in M_{+} (X)}

and

M_{μ}^{1} = {\frac{d P}{d μ} : P \in M_{+}^{1} (X)}

the corresponding sets of Radon–Nikodym positive and probability densities, respectively.

Consider two probability measures

P_{1}

and

P_{2}

of

M_{+}^{1} (X)

with Radon–Nikodym densities with respect to

μ

p_{1} : = \frac{d P_{1}}{d μ} \in M_{μ}^{1}

and

p_{2} : = \frac{d P_{2}}{d μ} \in M_{μ}^{1}

, respectively. The deviation of

P_{1}

to

P_{2}

(also called distortion, dissimilarity, or deviance) is commonly measured in information theory [1] by the Kullback–Leibler divergence (KLD):

KL (p_{1}, p_{2}) : = \int p_{1} log \frac{p_{1}}{p_{2}} d μ = E_{p_{1}} [log \frac{p_{1}}{p_{2}}] .

(1)

Informally, the KLD quantifies the information lost when

p_{2}

is used to approximate

p_{1}

by measuring, on average, the surprise when outcomes sampled from

p_{1}

are assumed to emanate from

p_{2}

: Shannon entropy

H (p) = \int p log \frac{1}{p} d μ

is the expected surprise

H (p) = E_{p} [- log p]

, where

- log p (x)

measures the surprise of the outcome x. Logarithms are taken to base 2 when information is measured in bits, and to base e when it is measured in nats. Gibbs’ inequality asserts that

KL (P_{1}, P_{2}) \geq 0

with equality if and only if

P_{1} = P_{2}

μ

-almost everywhere. Since

KL (p_{1}, p_{2}) \neq KL (p_{2}, p_{1})

, various symmetrization schemes of the KLD have been proposed in the literature [1] (e.g., Jeffreys divergence [1,2], resistor average divergence [3] (harmonic KLD symmetrization), Chernoff information [1], etc.)

An important symmetrization technique of the KLD is the Jensen–Shannon divergence [4,5] (JSD):

JS (p_{1}, p_{2}) : = \frac{1}{2} (KL (p_{1}, a) + KL (p_{2}, a)),

(2)

where

a = \frac{1}{2} p_{1} + \frac{1}{2} p_{2}

denotes the statistical mixture of

p_{1}

and

p_{2}

. The JSD is guaranteed to be upper-bounded by

log 2

even when the support of

p_{1}

and

p_{2}

differ, making it attractive in applications. Furthermore, its square root

\sqrt{JS}

yields a metric distance [6,7].

The JSD can be extended to a set of densities to measure the diversity of the set as an information radius [8]. In information theory, the JSD can also be interpreted as an information gain [6] since it can be equivalently written as

JS (p_{1}, p_{2}) = H (\frac{1}{2} p_{1} + \frac{1}{2} p_{2}) - \frac{H (p_{1}) + H (p_{2})}{2},

where

H (p) = - \int p log p d μ

is Shannon entropy (Shannon entropy for discrete measures and differential entropy for continuous measures). The JSD has also been defined in the setting of quantum information [9], where it has also been proven that its square root yields a metric distance [10].

Remark 1.

Both the KLD and the JSD belong to the family of f-divergences [11,12] defined for a convex generator

f (u)

(strictly convex at 1) by

I_{f} (p_{1}, p_{2}) : = \int p_{1} f (\frac{p_{2}}{p_{1}}) d μ .

Indeed, we have

KL (p_{1}, p_{2}) = I_{f_{KL}} (p_{1}, p_{2})

and

JS (p_{1}, p_{2}) = I_{f_{JS}} (p_{1}, p_{2})

for the following generators:

\begin{matrix} f_{KL} (u) & : = & - log u, \\ f_{JS} (u) & : = & - (1 + u) log \frac{1 + u}{2} + u log u . \end{matrix}

The family of f-divergences is the invariant divergences in information geometry [13]. The f-divergences guarantee information monotonicity by coarse graining [13] (also called lumping in information theory [14]). Using Jensen inequality, we get

I_{f} (p_{1}, p_{2}) \geq f (1)

.

Remark 2.

The metrization of f-divergences was studied in [15]. Once a metric distance

D (p_{1}, p_{2})

is given, we may use the following metric transform [16] to obtain another metric which is guaranteed to be bounded by 1:

0 \leq d (p_{1}, p_{2}) = \frac{D (p_{1}, p_{2})}{1 + D (p_{1}, p_{2})} \leq 1 .

1.2. Jensen–Shannon Symmetrization of Dissimilarities with Generalized Mixtures

In [17], a generalization of the KLD Jensen–Shannon symmetrization scheme was studied for arbitrary statistical dissimilarity

D (\cdot, \cdot)

using an arbitrary weighted mean [18]

M_{α}

. A generic weighted mean

M_{α} (a, b) = M_{1 - α} (b, a)

for

a, b \in R_{> 0}

is a continuous symmetric monotonic map

α \in [0, 1] \mapsto M_{α} (a, b)

such that

M_{0} (a, b) = b

and

M_{1} (a, b) = 1

. For example, the quasi-arithmetic means [18] are defined according to a monotonous continuous function

ϕ

as follows:

M_{α}^{ϕ} (a, b) : = ϕ^{- 1} (α ϕ (a) + (1 - α) ϕ (b)) .

When

ϕ_{p} (u) = u^{p}

, we get the p-power mean

M_{α}^{ϕ_{p}} (a, b) = {(α a^{p} + (1 - α) b^{p})}^{\frac{1}{p}}

for

p \in R ∖ {0}

. We extend

ϕ_{p}

for

p = 0

by defining

ϕ_{0} (u) = log u

, and get

M_{α}^{ϕ_{0}} (a, b) = a^{α} b^{1 - α}

, the weighted geometric mean

G_{α}

.

Let us recall the generalization of the Jensen–Shannon symmetrization scheme of a dissimilarity measure presented in [17]:

Definition 1

(

(α, β)

M-JS dissimilarity [17]). The Jensen–Shannon skew symmetrization of a statistical dissimilarity

D (\cdot, \cdot)

with respect to an arbitrary weighted bivariate mean

M_{α} (\cdot, \cdot)

is given by

D_{M_{α}, β}^{JS} (p_{1}, p_{2}) : = β D (p_{1}, {(p_{1} p_{2})}_{M_{α}}) + (1 - β) D (p_{2}, {(p_{1} p_{2})}_{M_{α}}), (α, β) \in {(0, 1)}^{2},

(3)

where

{(p_{1} p_{2})}_{M_{α}}

is the statistical normalized weighted M-mixture of

p_{1}

and

p_{2}

:

{(p_{1} p_{2})}_{M_{α}} (x) : = \frac{M_{α} (p_{1} (x), p_{2} (x))}{\int M_{α} (p_{1} (x), p_{2} (x)) d μ (x)} .

(4)

Remark 3.

A more general definition is given in [17] by using another arbitrary weighted mean

N_{β}

to average the two dissimilarities in Equation (3):

D_{M_{α}, N_{β}}^{JS} (p_{1}, p_{2}) : = N_{β} (D (p_{1}, {(p_{1} p_{2})}_{M_{α}}), D (p_{2}, {(p_{1} p_{2})}_{M_{α}})), (α, β) \in {(0, 1)}^{2} .

(5)

When

N_{β} = A_{α}

, the weighted arithmetic mean

A_{α} (a, b) = α a + (1 - α) b

, Equation (5) amounts to Equation (3).

When

α = \frac{1}{2}

, we write, for short,

{(p_{1} p_{2})}_{M}

instead of

{(p_{1} p_{2})}_{M_{\frac{1}{2}}}

in the reminder.

When

D = KL

,

M = N = A_{\frac{1}{2}}

, Equation (5) yields the Jensen–Shannon divergence of Equation (2):

JS (p_{1}, p_{2}) = {KL}_{A_{\frac{1}{2}}, A_{\frac{1}{2}}}^{JS} (p_{1}, p_{2}) = {KL}_{A, A}^{JS} (p_{1}, p_{2})

.

Lower and upper bounds for the skewed

α

-Jensen–Shannon divergence were reported in [19].

The abstract mixture normalizer of

{(p_{1} p_{2})}_{M_{α}}

shall be denoted by

Z_{M_{α}} (p_{1}, p_{2}) : = \int M_{α} (p_{1} (x), p_{2} (x)) d μ (x),

so that the normalized M-mixture is written as

{(p_{1} p_{2})}_{M_{α}} (x) = \frac{M_{α} (p_{1} (x), p_{2} (x))}{Z_{M_{α}} (p_{1}, p_{2})}

. The normalizer

Z_{M_{α}} (p_{1}, p_{2})

is always finite, and thus the weighted M-mixtures

{(p_{1} p_{2})}_{M_{α}}

are well-defined:

Proposition 1.

For any generic weighted mean

M_{α}

, we have the normalizer of the weighted M-mixture bounded by 2:

0 \leq Z_{M_{α}} (p_{1}, p_{2}) \leq 2 .

Proof.

Since

M_{α}

is a scalar weighted mean, it satisfies the following in-betweenness property:

\min {p_{1} (x), p_{2} (x)} \leq M_{α} (p_{1} (x), p_{2} (x)) \leq \max {p_{1} (x), p_{2} (x)} .

(6)

Hence, by using the following two identities for

a \geq 0

and

b \geq 0

,

\begin{matrix} \min {a, b} & = & \frac{a + b}{2} - \frac{1}{2} | a - b |, \\ \max {a, b} & = & \frac{a + b}{2} + \frac{1}{2} | a - b |, \end{matrix}

we get

\begin{matrix} \int \min {p_{1} (x), p_{2} (x)} d μ (x) \leq & \int M_{α} (p_{1} (x), p_{2} (x)) d μ (x) & \leq \int \max {p_{1} (x), p_{2} (x)} d μ (x), \\ 0 \leq 1 - TV (p_{1}, p_{2}) \leq & Z_{M_{α}} (p_{1}, p_{2}) & \leq 1 + TV (p_{1}, p_{2}) \leq 2, \end{matrix}

(7)

where

TV (p_{1}, p_{2}) : = \frac{1}{2} \int | p_{1} - p_{2} | d μ,

is the total variation distance, upper-bounded by 1. When the support of the densities

p_{1}

and

p_{2}

intersect (i.e., non-singular probability measures

P_{1}

and

P_{2}

), we have

Z_{M_{α}} (p_{1}, p_{2}) > 0

, and therefore the weighted M-mixtures

{(p_{1} p_{2})}_{M_{α}}

are well-defined. □

The generic Jensen–Shannon symmetrization of dissimilarities given in Definition 1 allows us to re-interpret some well-known statistical dissimilarities:

For example, the Chernoff information [1,20] is defined by

C (p_{1}, p_{2}) : = \max_{α \in (0, 1)} B_{α} (p_{1}, p_{2}),

(8)

where

B_{α} (p_{1}, p_{2})

denotes the

α

-skewed Bhattacharrya distance:

B_{α} (p_{1}, p_{2}) : = - log \int p_{1}^{α} p_{2}^{1 - α} d μ

(9)

When

α = \frac{1}{2}

, we note

B (p_{1}, p_{2}) = B_{\frac{1}{2}} (p_{1}, p_{2})

, the Bhattacharrya distance. Notice that the Bhattacharrya distance is not a metric distance as it violates the triangle inequality of metrics.

Using the framework of JS-symmetrization of dissimilarities, we can reinterpret the Chernoff information as

C (p_{1}, p_{2}) = {({KL}^{*})}_{G_{α^{*}}, A_{\frac{1}{2}}}^{JS} (p_{1}, p_{2}),

where

α^{*}

is provably the unique optimal skewing factor in Equation (8), such that we have [20]:

\begin{matrix} C (p_{1}, p_{2}) & = & {KL}^{*} (p_{1}, {(p_{1} p_{2})}_{G_{α^{*}}}) = {KL}^{*} (p_{2}, {(p_{1} p_{2})}_{G_{α^{*}}}), \\ = & \frac{1}{2} ({KL}^{*} (p_{1}, {(p_{1} p_{2})}_{G_{α^{*}}}) + {KL}^{*} (p_{2}, {(p_{1} p_{2})}_{G_{α^{*}}})), \end{matrix}

where

{KL}^{*}

denotes the reverse KLD:

{KL}^{*} (p_{1}, p_{2}) : = KL (p_{2}, p_{1}) .

Note that the KLD is sometimes called the forward KLD (e.g., [21]), and we have

{KL}^{*}^{*} (p_{1}, p_{2}) = KL (p_{1}, p_{2})

.

Although arithmetic mixtures are most often used in statistics, geometric mixtures are also encountered, for example in Bayesian statistics [22] or in Markov chain Monte Carlo annealing [23], just to give two examples. In information geometry, statistical power mixtures based on the homogeneous power means are used to perform stochastic integration of statistical models [24].

Proposition 2 (Bhattacharyya distance as G-JSD).

The Bhattacharyya distance [25] and the α-skewed Bhattacharyya distances can be interpreted as JS-symmetrizations of the reverse KLD with respect to the geometric mean G:

\begin{matrix} B (p_{1}, p_{2}) & : = & - log \int \sqrt{p_{1} p_{2}} d μ = {({KL}^{*})}_{G}^{JS} (p_{1}, p_{2}), \\ B_{α} (p_{1}, p_{2}) & : = & - log \int p_{1}^{α} p_{2}^{1 - α} d μ = {({KL}^{*})}_{G_{α}}^{JS} (p_{1}, p_{2}) . \end{matrix}

Proof.

Let

m = {(p_{1} p_{2})}_{G} = \frac{\sqrt{p_{1} p_{2}}}{Z (p_{1}, p_{2})}

denote the weighted geometric mixture with the normalizer

Z_{G} (p_{1}, p_{2}) = \int \sqrt{p_{1} p_{2}} d μ

. By definition of the JS-symmetrization of the reverse KLD, we have

\begin{matrix} {({KL}^{*})}_{G}^{JS} (p_{1}, p_{2}) & : = & \frac{1}{2} ({KL}^{*} (p_{1}, {(p_{1} p_{2})}_{G}) + {KL}^{*} (p_{2}, {(p_{1} p_{2})}_{G})), \\ = & \frac{1}{2} (KL ({(p_{1} p_{2})}_{G}, p_{1}) + KL ({(p_{1} p_{2})}_{G}, p_{2})), \\ = & \frac{1}{2} (\int (m log \frac{\sqrt{p_{1} p_{2}}}{p_{1} Z_{G} (p_{1}, p_{2})} + m log \frac{\sqrt{p_{1} p_{2}}}{p_{2} Z_{G} (p_{1}, p_{2})}) d μ), \\ = & \frac{1}{2} (\int \frac{1}{2} m log \frac{p_{2}}{p_{1}} \frac{p_{1}}{p_{2}} d μ - 2 log Z_{G} (p_{1}, p_{2}) \int m d μ), \\ = & - log Z_{G} (p_{1}, p_{2}) = : B (p_{1}, p_{2}) . \end{matrix}

The proof carries on similarly for the

α

-skewed JS-symmetrization of the reverse KLD: we now let

m_{α} = {(p_{1} p_{2})}_{G_{α}} = \frac{p_{1}^{α} p_{2}^{1 - α}}{Z_{G_{α}} (p_{1}, p_{2})}

be the

α

-weighted geometric mixture with the normalizer

Z_{G_{α}} (p_{1}, p_{2}) = \int p_{1}^{α} p_{2}^{1 - α} d μ

, written as

Z_{G_{α}}

for short below:

\begin{matrix} {KL}^{*}_{G_{α}, α}^{JS} (p_{1}, p_{2}) & : = & α {KL}^{*} (p_{1}, {(p_{1} p_{2})}_{G_{α}}) + (1 - α) {KL}^{*} (p_{2}, {(p_{1} p_{2})}_{G_{α}}), \\ = & α KL (m_{α}, p_{1}) + (1 - α) KL (m_{α}, p_{2}), \\ = & \int (α m_{α} log \frac{p_{1}^{α} p_{2}^{1 - α}}{Z_{G_{α}} p_{1}} + (1 - α) m_{α} log \frac{p_{1}^{α} p_{2}^{1 - α}}{Z_{G_{α}} p_{2}}) d μ, \\ = & - (α + 1 - α) log Z_{G_{α}} \int m_{α} d μ + \int m_{α} log {(\frac{p_{2}}{p_{1}})}^{α (1 - α)} {(\frac{p_{1}}{p_{2}})}^{α (1 - α)} d μ, \\ = & - log Z_{G_{α}} (p_{1}, p_{2}) = : B_{α} (p_{1}, p_{2}) . \end{matrix}

□

Besides information theory [1], the JSD also plays an important role in machine learning [26,27,28]. However, one drawback that refrains its use in practice is that the JSD between two Gaussian distributions (normal distributions) is not known in closed form, since no analytic formula is known for the differential entropy of a two-component Gaussian mixture [29], and thus the JSD needs to be numerically approximated in practice by various methods.

To circumvent this problem, the geometric G-JSD was defined in [17] as follows:

Definition 2

(G-JSD [17]). The geometric Jensen–Shannon divergence (G-JSD) between two probability densities,

p_{1}

and

p_{2}

, is defined by

{JS}_{G} (p_{1}, p_{2}) : = \frac{1}{2} (KL (p_{1}, {(p_{1} p_{2})}_{G}) + KL (p_{2}, {(p_{1} p_{2})}_{G})),

where

{(p_{1} p_{2})}_{G} (x) = \frac{\sqrt{p_{1} (x) p_{2} (x)}}{\int \sqrt{p_{1} (x) p_{2} (x)} d μ}

is the (normalized) geometric mixture of

p_{1}

and

p_{2}

.

We have

{JS}_{G} (p_{1}, p_{2}) = {KL}_{G}^{JS} (p_{1}, p_{2})

. Since, by default, the M- mixture JS-symmetrization of dissimilarities D is performed on the right argument (i.e.,

D_{M}^{JS}

), we may also consider a dual JS-symmetrization by setting the M-mixtures on the left argument. We denote this left mixture JS-symmetrization with

D_{M}^{{JS}^{*}}

. We have

D_{M}^{{JS}^{*}} (p_{1}, p_{2}) = {(D^{*})}_{M}^{JS} (p_{1}, p_{2})

, i.e., the left-sided JS-symmetrization of D amounts to a right-sided JS-symmetrization of the dual dissimilarity

D^{*} (p_{1}, p_{2}) : = D (p_{2}, p_{1})

.

Thus, a left-sided G-JSD divergence

{JS}_{G}^{*}

was also defined in [17]:

Definition 3.

The left-sided geometric Jensen–Shannon divergence (G-JSD) between two probability densities

p_{1}

and

p_{2}

is defined by

\begin{matrix} {JS}_{G}^{*} (p_{1}, p_{2}) & : = & \frac{1}{2} (KL ({(p_{1} p_{2})}_{G}, p_{1}) + KL ({(p_{1} p_{2})}_{G}, p_{2})), \\ = & \frac{1}{2} ({KL}^{*} (p_{1}, {(p_{1} p_{2})}_{G}) + {KL}^{*} (p_{2}, {(p_{1} p_{2})}_{G})), \end{matrix}

where

{(p_{1} p_{2})}_{G} (x) = \frac{\sqrt{p_{1} (x) p_{2} (x)}}{\int \sqrt{p_{1} (x) p_{2} (x)} d μ}

is the (normalized) geometric mixture of

p_{1}

and

p_{2}

.

To contrast with the numerical approximation limitation of the JSD between Gaussians, one advantage of the geometric Jensen–Shannon divergence (G-JSD) is that it admits a closed-form expression between Gaussian distributions [17]. However, the G-JSD is no longer bounded. The G-JSD formula between Gaussian distributions has been used in several scenarios. See [30,31,32,33,34,35,36,37,38] for a few use cases.

Let us express the G-JSD divergence using other familiar divergences.

Proposition 3.

We have the following expression of the geometric Jensen–Shannon divergence:

{JS}_{G} (p_{1}, p_{2}) = \frac{1}{4} J (p_{1}, p_{2}) - B (p_{1}, p_{2}),

where

J (p_{1}, p_{2}) : = \int (p_{1} - p_{2}) log \frac{p_{1}}{p_{2}} d μ

is Jeffreys’ divergence [2], and

B (p_{1}, p_{2}) = - log \int \sqrt{p_{1} p_{2}} d μ = - log Z_{G} (p_{1}, p_{2}),

is the Bhattacharrya distance.

Proof.

We have the following:

\begin{matrix} {JS}_{G} (p_{1}, p_{2}) & : = & \frac{1}{2} (KL (p_{1}, {(p_{1} p_{2})}_{G}) + KL (p_{2}, {(p_{1} p_{2})}_{G})), \\ = & \frac{1}{2} (\int (p_{1} (x) log \frac{p_{1} (x) Z_{G} (p_{1}, p_{2})}{\sqrt{p_{1} (x) p_{2} (x)}} + p_{2} (x) log \frac{p_{2} (x) Z_{G} (p_{1}, p_{2})}{\sqrt{p_{1} (x) p_{2} (x)}}) d μ (x)), \\ = & \frac{1}{2} (\int (p_{1} (x) + p_{2} (x)) log Z_{G} (p_{1}, p_{2}) d μ (x) + \frac{1}{2} KL (p_{1}, p_{2}) + \frac{1}{2} KL (p_{2}, p_{1})), \\ = & log Z_{G} (p_{1}, p_{2}) + \frac{1}{4} J (p_{1}, p_{2}), \\ = & \frac{1}{4} J (p_{1}, p_{2}) - B (p_{1}, p_{2}) . \end{matrix}

□

Corollary 1 (G-JSD upper bound).

We have the upper bound

{JS}_{G} (p, q) \leq \frac{1}{4} J (p, q)

.

Proof.

Since

B (p_{1}, p_{2}) \geq 0

and

{JS}_{G} (p_{1}, p_{2}) = \frac{1}{4} J (p_{1}, p_{2}) - B (p_{1}, p_{2})

, we have

{JS}_{G} (p, q) \leq \frac{1}{4} J (p, q)

. □

Remark 4.

Although the KLD and JSD are separable divergences (i.e., f-divergences expressed as integrals of scalar divergences), the M-JSD divergence is, in general, not separable, because it requires mixtures to be normalized inside the log terms. Notice that the Bhattacharyya distance is, similarly, not a separable divergence, but the Bhattacharyya similarity coefficient

BC (p_{1}, p_{2}) = exp (- B (p_{1}, p_{2})) = \int \sqrt{p_{1} p_{2}} d μ

is a separable “f-divergence”/f-coefficient for

f_{BC} (u) = \sqrt{u}

(here, a concave generator):

BC (p_{1}, p_{2}) = I_{f_{BC}} (p_{1}, p_{2})

. Notice that

f_{BC} (1) = 1

, and because of the concavity of

f_{BC}

, we have

I_{f_{BC}} (p_{1}, p_{2}) \leq f_{BC} (1) = 1

(hence, the term f-coefficient to reflect the notion of a similarity measure).

1.3. Paper Outline

The paper is organized as follows: We first give an alternative definition of the M-JSD in Section 2 (Definition 4) which extends to positive measures and does not require normalization of geometric mixtures. We call this new divergence the extended M-JSD, and we compare the two types of geometric JSDs when dealing with probability measures. In Section 4, we show that these normalized/extended M-JSD divergences can be interpreted as regularizations of the Jensen–Shannon divergence, and exhibit several bounds. We discuss Monte Carlo stochastic approximations and approximations using

γ

-divergences [39] in Section 5. For the case of geometric mixtures, although the G-JSD is not an f-divergence, we show that the extended G-JSD is an f-divergence (Proposition 5), and we express both the G-JSD and the extended G-JSD using both the Jeffreys divergence and the Bhattacharyya divergence or coefficient. We report a related closed-form formula for the G-JSD and extended G-JSD between two Gaussian distributions in Section 3. Finally, we summarize the main results in the concluding section, Section 6.

A list of notations is provided in Nomenclature.

2. A Novel Definition: The G-JSD, Extended to Positive Measures

2.1. Definition and Properties

We may consider the following two modifications of the G-JSD:

First, we replace the KLD with the extended KLD between positive densities $q_{1} \in M_{μ}^{+}$ and $q_{2} \in M_{μ}^{+}$ instead of normalized densities:

${KL}^{+} (q_{1}, q_{2}) : = \int (q_{1} log \frac{q_{1}}{q_{2}} + q_{2} - q_{1}) d μ,$

(10)

(with ${KL}^{+} (p_{1}, p_{2}) = KL (p_{1}, p_{2})$ );
Second, we consider unnormalized M-mixture densities:

${(q_{1} q_{2})}_{{\tilde{M}}_{α}} (x) : = M_{α} (q_{1} (x), q_{2} (x)),$

where we use the $\tilde{M}$ tilde notation to indicate that the M-mixture is not normalized, instead of normalized densities ${(q_{1} q_{2})}_{M_{α}} (x)$ .

The extended KLD can be interpreted as a pointwise integral of a scalar Bregman divergence obtained for the negative Shannon entropy generator [40]. This proves that

{KL}^{+} (q_{1}, q_{2}) \geq 0

with equality if and only if

q_{1} = q_{2}

μ

-almost everywhere. Notice that

KL (q_{1}, q_{2})

may be negative when

q_{1}

and/or

q_{2}

are not normalized to probability densities, but we always have

{KL}^{+} (q_{1}, q_{2}) \geq 0

.

The extended KLD is an extended f-divergence [41]:

{KL}^{+} (q_{1}, q_{2}) = I_{f_{{KL}^{+}}}^{+} (q_{1}, q_{2})

for

f_{{KL}^{+}} (u) = - log (u) + u - 1

, where

I_{f}^{+} (q_{1}, q_{2})

denotes the f-divergence extended to positive densities

q_{1}

and

q_{2}

:

I_{f}^{+} (q_{1}, q_{2}) = \int q_{1} f (\frac{q_{2}}{q_{1}}) d μ .

Remark 5.

As a side remark, it is preferable in practice to estimate the KLD between

p_{1}

and

p_{2}

by Monte Carlo methods using Equation (10) instead of Equation (1) in order to guarantee the non-negativeness of the KLD (Gibbs’ inequality). Indeed, the sampling of s samples

x_{1}, \dots, x_{s}

, defining two unnormalized distributions

q_{1} (x) = \frac{1}{s} \sum_{i = 1}^{s} p_{1} (x) δ_{x_{i}} (x)

and

q_{2} (x) = \frac{1}{s} \sum_{i = 1}^{s} p_{2} (x) δ_{x_{i}} (x)

, where

δ_{x_{i}} (x) = \{\begin{matrix} 1, & i f x = x_{i} \\ 0, & o t h e r w i s e \end{matrix} .

Remark 6.

For an arbitrary distortion measure

D^{+} (q_{1}, q_{2})

between positive measures

q_{1}

and

q_{2}

, we can build a corresponding projective divergence

\tilde{D} (q_{1}, q_{2})

as follows:

\tilde{D} (q_{1}, q_{2}) : = D^{+} (\frac{q_{1}}{Z (q_{1})}, \frac{q_{1}}{Z (q_{2})}),

where

Z (q) : = \int q d μ

is the normalization factor of the positive density q. The divergence

\tilde{D}

is said to be projective because we have, for all

λ_{1} > 0, λ_{2} > 0

, the property that

\tilde{D} (λ_{1} q_{1}, λ_{2} q_{2}) = \tilde{D} (q_{1}, q_{2}) = D^{+} (p_{1}, p_{2})

, where

p_{i} = \frac{q_{i}}{Z (q_{i})}

are the normalized densities. The projective Kullback–Leibler divergence

\tilde{KL}

is thus another projective extension of the KLD to non-normalized densities which coincide with the KLD for probability densities. But the projective KLD is different from the extended KLD of Equation (10), and furthermore, we have

\tilde{KL} (q_{1}, q_{2}) = 0

if and only if

q_{1} = λ q_{2}

μ-almost everywhere for some

λ > 0

.

Let us now define the Jensen–Shannon symmetrization of an extended statistical divergence

D^{+}

with respect to an arbitrary weighted mean

M_{α}

as follows:

Definition 4 (Extended M-JSD).

A Jensen–Shannon skew symmetrization of a statistical divergence

D^{+} (\cdot, \cdot)

between two positive measures

q_{1}

and

q_{2}

with respect to a weighted mean

M_{α}

is defined by

D_{{\tilde{M}}_{α}, β}^{{JS}^{+}} (q_{1}, q_{2}) : = β D^{+} (q_{1}, {(q_{1} q_{2})}_{{\tilde{M}}_{α}}) + (1 - β) D^{+} (q_{1}, {(q_{1} q_{2})}_{{\tilde{M}}_{α}}),

(11)

When

β = \frac{1}{2}

, we write, for short,

D_{{\tilde{M}}_{α}}^{{JS}^{+}} (q_{1}, q_{2})

, and furthermore, when

α = \frac{1}{2}

, we simplify the notation to

D_{\tilde{M}}^{{JS}^{+}} (q_{1}, q_{2})

.

When

D^{+} = {KL}^{+}

, we obtain the extended geometric Jensen–Shannon divergence,

{JS}_{\tilde{G}}^{+} (q_{1}, q_{2}) = {KL}_{\tilde{G}}^{{JS}^{+}} (q_{1}, q_{2})

:

Definition 5 (Extended G-JSD).

The extended geometric Jensen–Shannon divergence between two positive densities

q_{1}

and

q_{2}

is

\begin{matrix} {JS}_{\tilde{G}}^{+} (q_{1}, q_{2}) & = & \frac{1}{2} ({KL}^{+} (q_{1}, {(q_{1} q_{2})}_{\tilde{G}}) + {KL}^{+} (q_{2}, {(q_{1} q_{2})}_{\tilde{G}}))), \end{matrix}

(12)

The extended G-JSD between two normalized densities

p_{1}

and

p_{2}

is thus

\begin{matrix} {JS}_{\tilde{G}}^{+} (p_{1}, p_{2}) & = & \frac{1}{2} (\int (p_{1} log \frac{p_{1}}{\sqrt{p_{1} p_{2}}} + p_{2} log \frac{p_{2}}{\sqrt{p_{1} p_{2}}}) d μ + \int \sqrt{p_{1} p_{2}} d μ)) - 1, \end{matrix}

(13)

\begin{matrix} = & \frac{1}{2} (\int (p_{1} log \sqrt{\frac{p_{1}}{p_{2}}} + p_{2} log \sqrt{\frac{p_{2}}{p_{1}}}) d μ + Z_{G} (p_{1}, p_{2})) - 1, \end{matrix}

(14)

with

Z_{G} (p_{1}, p_{2}) = exp (- B (p_{1}, p_{2}))

.

Thus, we get the following propositions:

Proposition 4.

The extended geometric Jensen–Shannon divergence (G-JSD) can be expressed as follows:

{JS}_{\tilde{G}}^{+} (p_{1}, p_{2}) = \frac{1}{4} J (p_{1}, p_{2}) + exp (- B (p_{1}, p_{2})) - 1 .

Proof.

We have

\begin{matrix} {JS}_{\tilde{G}}^{+} (p_{1}, p_{2}) & = & \frac{1}{2} ({KL}^{+} (p_{1}, {(p_{1} p_{2})}_{\tilde{G}}) + {KL}^{+} (p_{2}, {(p_{1} p_{2})}_{\tilde{G}})), \\ = & \frac{1}{2} (\int (p_{1} log \sqrt{\frac{p_{1}}{p_{2}}} + p_{2} log \sqrt{\frac{p_{2}}{p_{1}}} + 2 \sqrt{p_{1} p_{2}} - (p_{1} + p_{2})) d μ), \\ = & \int \frac{1}{4} (p_{1} - p_{2}) log \frac{p_{1}}{p_{2}} d μ + \int \sqrt{p_{1} p_{2}} d μ - 1, \\ = & \frac{1}{4} J (p_{1}, p_{2}) + exp (- B (p_{1}, p_{2})) - 1 . \end{matrix}

□

Thus, we can express the gap between

{JS}_{\tilde{G}}^{+} (p_{1}, p_{2})

and

{JS}_{G} (p_{1}, p_{2})

:

Δ_{G} (p_{1}, p_{2}) = {JS}_{\tilde{G}}^{+} (p_{1}, p_{2}) - {JS}_{G} (p_{1}, p_{2}) = exp (- B (p_{1}, p_{2})) + B (p_{1}, p_{2}) - 1 .

Since

Z_{G} (p_{1}, p_{2}) = exp (- B (p_{1}, p_{2}))

, we have:

Δ_{G} (p_{1}, p_{2}) = Z_{G} (p_{1}, p_{2}) - log Z_{G} (p_{1}, p_{2}) - 1 .

Proposition 5.

The extended G-JSD is an f-divergence for the generator

f_{\tilde{G}} (u) = \frac{1}{4} (u - 1) log u + \sqrt{u} - 1 .

That is, we have

{JS}_{\tilde{G}}^{+} (p_{1}, p_{2}) = I_{f_{\tilde{G}}} (p_{1}, p_{2})

.

Proof.

We proved that

{JS}_{\tilde{G}}^{+} (p_{1}, p_{2}) = \frac{1}{4} J (p_{1}, p_{2}) + BC (p_{1}, p_{2}) - 1

. The Jeffreys divergence is an f-divergence for the generator

f_{J} (u) = (u - 1) log u

, and the Bhattacharyya coefficient is an f-coefficient for

f_{BC} (u) = \sqrt{u}

(a “f-divergence” for a concave generator). Thus, we have

f_{\tilde{G}} (u) = \frac{1}{4} (u - 1) log u + \sqrt{u} - 1,

such that

{JS}_{\tilde{G}}^{+} (p_{1}, p_{2}) = I_{f_{\tilde{G}}} (p_{1}, p_{2})

. We check that

f_{\tilde{G}} (u)

is convex, since

f_{\tilde{G}}^{″} (u) = \frac{\sqrt{u} (u + 1) - u}{4 u^{\frac{5}{2}}}

(and by a change of variable

t = \sqrt{u}

, the numerator

t (t^{2} - t + 1)

is shown to be positive, since the discriminant of

t^{2} - t + 1

is negative), and we have

f_{\tilde{G}} (1) = 0

. Thus, the extended G-JSD is a proper f-divergence. □

It follows that the extended G-JSD satisfies the information monotonicity of invariant divergences in information geometry [13].

By abuse of notations, we have

{KL}^{+} (q_{1}, q_{2}) : = KL (q_{1}, q_{2}) + \int (q_{2} - q_{1}) d μ,

although

q_{1}

and

q_{2}

may not need to be normalized in the

KL

term (which can then yield a potentially negative value). Letting

Z (q_{i}) : = \int q_{i} d μ

be the total mass of positive density

q_{i}

, we have

{KL}^{+} (q_{1}, q_{2}) = KL (q_{1}, q_{2}) + Z (q_{2}) - Z (q_{1}) .

(15)

Let

{\tilde{m}}_{α} = M_{α} (q_{1}, q_{2})

be the unnormalized M-mixture of positive densities

q_{1}

and

q_{2}

, and set

Z_{M_{α}} = \int {\tilde{m}}_{α} d μ

be the normalization term so that we have

m_{α} = \frac{{\tilde{m}}_{α}}{Z_{M_{α}}}

and

{\tilde{m}}_{α} = Z_{M_{α}} m_{α}

. When clear from context, we write

Z_{α}

instead of

Z_{M_{α}}

.

We get, after elementary calculus, the following identity:

{JS}_{{\tilde{M}}_{α}, β}^{+} (q_{1}, q_{2}) = {JS}_{M_{α}, β} (q_{1}, q_{2}) - (β Z (q_{1}) + (1 - β) Z (q_{2})) log Z_{α} + Z_{α} - (β Z (q_{1}) + (1 - β) Z (q_{2})) .

(16)

Therefore, the difference gap

Δ_{M_{α}, β} (p_{1}, p_{2})

(written for short as

Δ (p_{1}, p_{2})

) between the normalized JSD and the unnormalized M-JSD between two normalized densities

p_{1}

and

p_{2}

(i.e., with

Z_{1} = Z (p_{1}) = 1

and

Z_{2} = Z (p_{2}) = 1

) is

Δ (p_{1}, p_{2}) : = {JS}^{+}_{{\tilde{M}}_{α}, β} (p_{1}, p_{2}) - {JS}^{M_{α}, β} (p_{1}, p_{2}) = Z_{α} - log (Z_{α}) - 1 .

(17)

Proposition 6 (Extended/normalized M-JSD Gap).

The following identity holds:

{JS}^{+}_{{\tilde{M}}_{α}, β} (p_{1}, p_{2}) = {JS}_{M_{α}, β} (p_{1}, p_{2}) + Z_{α} - log (Z_{α}) - 1 .

Thus,

{JS}^{+}_{{\tilde{M}}_{α}, β} (p_{1}, p_{2}) \geq {JS}_{M_{α}, β} (p_{1}, p_{2})

when

Δ (p_{1}, p_{2}) \geq 0

, and

{JS}^{+}_{{\tilde{M}}_{α}, β} (p_{1}, p_{2}) \leq {JS}_{M_{α}, β} (p_{1}, p_{2})

when

Δ (p_{1}, p_{2}) \leq 0

.

When we consider the weighted arithmetic mean

A_{α}

, we always have

Z_{α} = 1

for

α \in (0, 1)

, and thus the two definitions (Definition 1 and Definition 4) of the A-JSD coincide (i.e.,

Z_{α}^{A} - log (Z_{α}^{A}) - 1 = 0

):

{JS}_{A} (p_{1}, p_{2}) = {JS}_{\tilde{A}} (p_{1}, p_{2}) .

However, when the weighted mean

M_{α}

differs from the weighted arithmetic mean (i.e.,

M_{α} \neq A_{α}

), the two definitions of the M-JSD

{JS}_{M}

and extended M-JSD

{JS}_{\tilde{M}}

differ by the gap expressed in Equation (17).

Remark 7.

When information is measured in bits, logarithms are taken to base 2, and when information is measured in nats, base e is considered. Thus, we shall generally consider the gap

Δ_{b} = Z_{α} - {log}_{b} (Z_{α}) - 1

, where b denotes the base of the logarithm. When

b = e

, we have

Δ_{e} \geq 0

for all

Z_{α} > 0

. When

b = 2

, we have

Δ_{2} = Z_{α} - {log}_{2} (Z_{α}) - 1 \geq 0

when

0 < Z_{α} \leq 1

or

Z_{α} \geq 2

. But since

Z_{α} \leq 2

(see Equation (7)), the condition simplifies to

Δ_{2} \geq 0

if and only if

Z_{α} \leq 1

.

Remark 8.

Although

\sqrt{JS}

is a metric distance [5],

\sqrt{{JS}_{G}}

is not a metric distance, as the triangle inequality is not satisfied. It suffices to report a counterexample of the triangle inequality for a triple of points

p_{1}

,

p_{2}

, and

p_{3}

: Consider

p_{1} = (0.55, 0.45)

,

p_{2} = (0.002, 0.998)

, and

p_{3} = (0.045, 0.955)

. Then we have

\sqrt{{JS}_{G}} (p_{1}, p_{2}) \approx 1.0263227 \dots

,

\sqrt{{JS}_{G}} (p_{1}, p_{3}) \approx 0.63852342 \dots

, and

\sqrt{{JS}_{G}} (p_{3}, p_{2}) \approx 0.19794622 \dots

. The triangle inequality fails with an error of

\sqrt{{JS}_{G}} (p_{1}, p_{2}) - (\sqrt{{JS}_{G}} (p_{1}, p_{3}) + \sqrt{{JS}_{G}} (p_{3}, p_{2})) \approx 0.1898531 \dots .

Similarly, the triangle inequality also fails for the extended G-JSD: we have

\sqrt{{JS}_{G}^{+} (p_{1}, p_{2})} \approx 1.0788275 \dots

,

\sqrt{{JS}_{G}^{+} (p_{1}, p_{3})} \approx 0.6691922 \dots

, and

\sqrt{{JS}_{G}^{+} (p_{3}, p_{2})} \approx 0.1984633 \dots

with a triangle inequality defect value of

\sqrt{{JS}_{G}^{+} (p_{1}, p_{2})} - (\sqrt{{JS}_{G}^{+} (p_{1}, p_{3})} + \sqrt{{JS}_{G}^{+} (p_{3}, p_{2})}) \approx 0.2111719 \dots .

2.2. Power JSDs and (Extended) Min-JSD and Max-JSD

Let

P_{γ, α} (a, b) : = {(α a^{γ} + (1 - α) b^{γ})}^{\frac{1}{γ}}

be the

γ

-power mean for

γ \neq 0

(with

A_{α} = P_{1, α}

). Further define

P_{0, α} (a, b) = G_{α} (a, b)

so that

P_{γ, α}

defines the weighted power means for

γ \in R

and

α \in (0, 1)

in the reminder. Since

P_{γ, α} (a, b) \leq P_{γ^{'}, α} (a, b)

when

γ^{'} \geq γ

for any

a, b > 0

, we have

Z_{α}^{P_{γ}} (p_{1}, p_{2}) = \int P_{γ, α} (p_{1} (x), p_{2} (x)) d μ \leq Z_{α}^{P_{γ^{'}}} (p_{1}, p_{2}) = \int P_{γ^{'}, α} (p_{1} (x), p_{2} (x)) d μ .

(18)

Let

P_{γ} (a, b) = P_{γ, \frac{1}{2}} (a, b)

. We have

{lim}_{γ \to - \infty} P_{γ} (a, b) = \min (a, b)

and

{lim}_{γ \to + \infty} P_{γ} (a, b) = \max (a, b)

. Thus, we can define both (extended) min-JSD and (extended) max-JSD. Using the fact that

\min (a, b) = \frac{a + b}{2} - \frac{1}{2} | a - b |

and

\max (a, b) = \frac{a + b}{2} + \frac{1}{2} | a - b |

, we obtain the extremal mixture normalization terms as follows:

\begin{matrix} Z_{\min} (p_{1}, p_{2}) & = & \int \min (p_{1}, p_{2}) d μ = 1 - TV (p_{1}, p_{2}), \end{matrix}

(19)

\begin{matrix} Z_{\max} (p_{1}, p_{2}) & = & \int \max (p_{1}, p_{2}) d μ = 1 + TV (p_{1}, p_{2}), \end{matrix}

(20)

where

TV (p_{1}, p_{2}) = \frac{1}{2} \int | p_{1} - p_{2} | d μ

is the total variation distance.

Proposition 7 (max-JSD).

The following upper bound holds for max-JSD:

0 \leq {JS}^{+}_{\max^{˜}} (p_{1}, p_{2}) \leq TV (p_{1}, p_{2}) .

(21)

Furthermore, the following identity relates the two types of max-JSDs:

{JS}^{+}_{\max^{˜}} (p_{1}, p_{2}) = {JS}_{\max^{˜}} (p_{1}, p_{2}) + TV (p_{1}, p_{2}) - log (1 + TV (p_{1}, p_{2})) .

(22)

Proof.

We have

{JS}^{+}_{\max^{˜}} (p_{1}, p_{2}) : = \frac{1}{2} \int (p_{1} log \frac{p_{1}}{\max (p_{1}, p_{2})} + p_{2} log \frac{p_{2}}{\max (p_{1}, p_{2})} + 2 \max (p_{1}, p_{2}) - (p_{1} + p_{2})) d μ .

Since both

log \frac{p_{1}}{\max (p_{1}, p_{2})} \leq 0

and

log \frac{p_{2}}{\max (p_{1}, p_{2})} \leq 0

, and

\max (a, b) = \frac{a + b}{2} + \frac{1}{2} | b - a |

, we have

{JS}^{+}_{\max^{˜}} (p_{1}, p_{2}) \leq \int (\frac{p_{1} + p_{2}}{2} + \frac{1}{2} | p_{2} - p_{1} | - \frac{p_{1} + p_{2}}{2}) d μ .

That is,

{JS}^{+}_{\max^{˜}} (p_{1}, p_{2}) \leq TV (p_{1}, p_{2})

.

We characterize the gap as follows:

\begin{matrix} Δ_{\max} (p_{1}, p_{2}) & = & Z_{\max} (p_{1}, p_{2}) - log Z_{\max} (p_{1}, p_{2}) - 1, \\ = & TV (p_{1}, p_{2}) - log (1 + TV (p_{1}, p_{2})) \geq 0, \end{matrix}

since

0 \leq TV \leq 1

. Thus

{JS}^{+}_{\max^{˜}} (p_{1}, p_{2}) \geq {JS}_{\max} (p_{1}, p_{2})

. □

Proposition 8 (min-JSD).

We have the following lower bound on the extended min-JSD:

{JS}^{+}_{\min^{˜}} (p_{1}, p_{2}) \geq \frac{1}{4} J (p_{1}, p_{2}) - TV (p_{1}, p_{2}),

where

J (p_{1}, p_{2}) : = KL (p_{1}, p_{2}) + KL (p_{2}, p_{1}) = \int (p_{1} - p_{2}) log \frac{p_{1}}{p_{2}} d μ

is the Jeffreys’ divergence [2] and

{JS}^{+}_{\min^{˜}} (p_{1}, p_{2}) = {JS}_{\min} (p_{1}, p_{2}) - TV (p_{1}, p_{2}) + log (1 - TV (p_{1}, p_{2})) .

Proof.

We have

Z_{\min} (p_{1}, p_{2}) = \int \min {p_{1}, p_{2}} d μ = 1 - TV (p_{1}, p_{2}) \leq 1

and

\begin{matrix} Δ_{\min} (p_{1}, p_{2}) & = & Z_{\min} (p_{1}, p_{2}) - log Z_{\min} (p_{1}, p_{2}) - 1, \\ = & - TV (p_{1}, p_{2}) - log (1 - TV (p_{1}, p_{2})) \geq 0, \end{matrix}

since

- x - log (1 - x) \geq 0

for

x \leq 1

. Note that the gap can be arbitrarily large when

TV (p_{1}, p_{2}) \to 1^{-}

.

Thus, we have

{JS}^{+}_{\min^{˜}} (p_{1}, p_{2}) \geq {JS}_{\min} (p_{1}, p_{2})

.

To get the lower bound, we use the fact that

\min (p_{1}, p_{2}) \leq \sqrt{p_{1} p_{2}}

. Indeed, we have

\begin{matrix} {JS}^{+}_{\min^{˜}} (p_{1}, p_{2}) & = & \frac{1}{2} (\int (p_{1} log \frac{p_{1}}{\min (p_{1}, p_{2})} + p_{2} log \frac{p_{2}}{\min (p_{1}, p_{2})} + 2 \min (p_{1}, p_{2}) - (p_{1} + p_{2})) d μ, \\ \geq & \frac{1}{2} \int (\frac{1}{2} p_{1} log \frac{p_{1}}{p_{2}} + \frac{1}{2} p_{2} log \frac{p_{2}}{p_{1}} + 2 \min (p_{1}, p_{2}) - (p_{1} + p_{2})) d μ, \\ = & \frac{1}{4} J (p_{1}, p_{2}) - TV (p_{1}, p_{2}) . \end{matrix}

□

Remark 9.

Let us report the total variation distance between two univariate Gaussian distributions

p_{μ_{1}, σ_{1}}

and

p_{μ_{2}, σ_{2}}

in closed form using the error function [42]

\erf (x) = \frac{1}{\sqrt{π}} \int_{- x}^{x} e^{- t^{2}} d t

.

When $σ_{1} = σ_{2} = σ$ , we have

$TV (p_{1}, p_{2}) = \frac{1}{2} |Φ (x^{*}; μ_{2}, σ) - Φ (x^{*}; μ_{1}, σ)|,$

(23)

where $Φ (x; μ, σ) = \frac{1}{2} (1 + \erf (\frac{x - μ}{σ \sqrt{2}}))$ is the cumulative distribution, and

$x^{*} = \frac{μ_{1}^{2} - μ_{2}^{2}}{2 (μ_{1} - μ_{2})} .$

(24)
When $σ_{1} \neq σ_{2}$ , we let $x_{1} = \frac{- b - \sqrt{Δ}}{2 a}$ and $x_{2} = \frac{- b + \sqrt{Δ}}{2 a}$ , where $Δ = b^{2} - 4 a c \geq 0$ and

$\begin{matrix} a & = & \frac{1}{σ_{1}^{2}} - \frac{1}{σ_{2}^{2}}, \end{matrix}$

(25)

$\begin{matrix} b & = & 2 (\frac{μ_{2}}{σ_{2}} - \frac{μ_{1}}{σ_{1}}), \end{matrix}$

(26)

$\begin{matrix} c & = & {(\frac{μ_{1}}{σ_{1}})}^{2} - {(\frac{μ_{2}}{σ_{2}})}^{2} - 2 log \frac{σ_{2}}{σ_{1}} . \end{matrix}$

(27)

The total variation is given by

$\begin{matrix} TV (p_{1}, p_{2}) = \\ \frac{1}{2} (|\erf (\frac{x_{1} - μ_{1}}{σ_{1} \sqrt{2}}) - \erf (\frac{x_{1} - μ_{2}}{σ_{2} \sqrt{2}})| + |\erf (\frac{x_{2} - μ_{1}}{σ_{1} \sqrt{2}}) - \erf (\frac{x_{2} - μ_{2}}{σ_{2} \sqrt{2}})|) \end{matrix}$

(28)

Next, we shall consider the important case of

p_{1}

and

p_{2}

belonging to the family of multivariate normal distributions, commonly called Gaussian distributions.

3. Geometric JSDs Between Gaussian Distributions

3.1. Exponential Families

The formula for the G-JSD between two Gaussian distributions was reported in [17] using the more general framework of exponential families. An exponential family [43] is a family of probability measures

{P_{λ}}

with Radon–Nikodym densities

p_{λ}

with respect to

μ

expressed canonically as

\begin{matrix} p_{λ} (x) & : = & exp (〈θ (λ), t (x)〉 - F (θ) + k (x)), \\ = & \frac{1}{Z (θ)} exp (〈θ (λ), t (x)〉 + k (x)), \end{matrix}

where

θ (λ)

is the natural parameter,

t (x)

the sufficient statistic,

k (x)

an auxiliary carrier term with respect to

μ

, and

F (θ)

the cumulant function. The partition function

Z (θ)

is the normalizer denominator:

Z (θ) = exp (F (θ))

. The cumulant function

F (θ) = log Z (θ)

is strictly convex and analytic [43], and the partition function

Z (θ) = exp (F (θ))

is strictly log-convex (and hence also strictly convex).

We consider the exponential family of multivariate Gaussian distributions

N = {N (μ, Σ) : μ \in R^{d}, Σ \in PD (d)},

where

PD (d)

denotes the set of symmetric positive–definite matrices of size

d \times d

. Let

λ : = (λ_{v}, λ_{M}) = (μ, Σ)

denote the compound (vector, matrix) parameter of a Gaussian. The d-variate Gaussian density is given by

\begin{matrix} p_{λ} (x; λ) & : = & \frac{1}{{(2 π)}^{\frac{d}{2}} \sqrt{| λ_{M} |}} exp (- \frac{1}{2} {(x - λ_{v})}^{⊤} λ_{M}^{- 1} (x - λ_{v})), \end{matrix}

(29)

where

| \cdot |

denotes the matrix determinant. The natural parameters

θ

are expressed using both a vector parameter

θ_{v}

and a matrix parameter

θ_{M}

in a compound parameter

θ = (θ_{v}, θ_{M})

. By defining the following compound inner product on a compound (vector, matrix) parameter

〈θ, θ^{'}〉 : = θ_{v}^{⊤} θ_{v}^{'} + tr ({θ_{M}^{'}}^{⊤} θ_{M}),

(30)

where

tr (\cdot)

denotes the matrix trace, we rewrite the Gaussian density of Equation (29) in the canonical form of an exponential family:

\begin{matrix} p_{θ} (x; θ) & : = & exp (〈t (x), θ〉 - F_{θ} (θ)) = p_{λ} (x), \end{matrix}

(31)

where

θ = θ (λ)

with

θ = (θ_{v}, θ_{M}) = (Σ^{- 1} μ, - \frac{1}{2} Σ^{- 1}) = θ (λ) = (λ_{M}^{- 1} λ_{v}, - \frac{1}{2} λ_{M}^{- 1}),

(32)

is the compound vector-matrix natural parameter and

t (x) = (x, - x x^{⊤}),

(33)

is the compound vector-matrix sufficient statistic. There is no auxiliary carrier term (i.e.,

k (x) = 0

). The function

F_{θ}

is given by

F_{θ} (θ) : = \frac{1}{2} (d log π - log | θ_{M} | + \frac{1}{2} θ_{v}^{⊤} θ_{M}^{- 1} θ_{v}),

(34)

Remark 10.

Beware that when the cumulant function is expressed using the ordinary parameter

λ = (μ, Σ)

, the cumulant function

F_{θ} (θ (λ))

is no longer convex:

\begin{matrix} F_{λ} (λ) & = & \frac{1}{2} (λ_{v}^{⊤} λ_{M}^{- 1} λ_{v} + log | λ_{M} | + d log 2 π), \end{matrix}

(35)

\begin{matrix} = & \frac{1}{2} (μ^{⊤} Σ^{- 1} μ + log | Σ | + d log 2 π) . \end{matrix}

(36)

We convert between the ordinary parameterization

λ = (μ, Σ)

and the natural parameterization

θ

using these formulas:

\begin{matrix} θ = (θ_{v}, θ_{M}) = \{\begin{matrix} θ_{v} (λ) = λ_{M}^{- 1} λ_{v} = Σ^{- 1} μ \\ θ_{M} (λ) = \frac{1}{2} λ_{M}^{- 1} = \frac{1}{2} Σ^{- 1} \end{matrix} & \Leftrightarrow & λ = (λ_{v}, λ_{M}) = \{\begin{matrix} λ_{v} (θ) = \frac{1}{2} θ_{M}^{- 1} θ_{v} = μ \\ λ_{M} (θ) = \frac{1}{2} θ_{M}^{- 1} = Σ \end{matrix} \end{matrix}

The geometric mixture

p_{θ_{1}}^{α} p_{θ_{2}}^{1 - α}

of two densities of an exponential family is a density

p_{α θ_{1} + (1 - α) θ_{2}}

of the exponential family with the partition function

Z_{α} (θ_{1}, θ_{2}) = exp (- J_{F, α} (θ_{1}, θ_{2}))

, where

J_{F, α} (θ_{1}, θ_{2})

denotes the skew Jensen divergence [44,45]:

J_{F, α} (θ_{1}, θ_{2}) : = α F (θ_{1}) + (1 - α) F (θ_{2}) - F (α θ_{1} + (1 - α) θ_{2}) .

Therefore, the difference gap of Equation (17) between the G-JSD and the extended G-JSD between exponential family densities is given by

\begin{matrix} Δ (θ_{1}, θ_{2}) & = & exp (- J_{F, α} (θ_{1}, θ_{2})) + J_{F, α} (θ_{1}, θ_{2}) - 1, \end{matrix}

(37)

\begin{matrix} = & Z_{α} (θ_{1}, θ_{2}) - log Z_{α} (θ_{1}, θ_{2}) - 1, \end{matrix}

(38)

\begin{matrix} = & Z_{α} (θ_{1}, θ_{2}) - F (α θ_{1} + (1 - α) θ_{2}) - 1 . \end{matrix}

(39)

Since

Z_{α} = exp (- J_{F, α} (θ_{1}, θ_{2})) \leq 1

, the gap

Δ

is negative, and we have

{JS}^{+}_{{\tilde{G}}_{α}, β} (p_{μ_{1}, Σ_{1}}, p_{μ_{2}, Σ_{2}}) \leq {JS}_{G_{α}, β} (p_{μ_{1}, Σ_{1}}, p_{μ_{2}, Σ_{2}}) .

Corollary 2.

When

p_{1} = p_{θ_{1}}

and

p_{2} = p_{θ_{2}}

belongs to a same exponential family with the cumulant function

F (θ)

, we have

{JS}_{G} (p_{θ_{1}}, p_{θ_{2}}) = \frac{1}{4} {(θ_{2} - θ_{1})}^{⊤} (\nabla F (θ_{2}) - \nabla F (θ_{1})) - (\frac{F (θ_{1}) + F (θ_{2})}{2} - F (\frac{θ_{1} + θ_{2}}{2})),

(40)

since

J (p_{θ_{1}}, p_{θ_{2}}) = 〈θ_{2} - θ_{1}, \nabla F (θ_{2}) - \nabla F (θ_{1})〉

amounts to a symmetrized Bregman divergence.

Proof.

We have

J (p_{θ_{1}}, p_{θ_{2}}) = {(θ_{2} - θ_{1})}^{⊤} (\nabla F (θ_{2}) - \nabla F (θ_{1}))

and

J (p_{θ_{1}}, p_{θ_{2}}) = J_{F} (θ_{1}, θ_{2})

. □

The extended geometric Jensen–Shannon divergence and geometric Jensen–Shannon divergence between two densities of an exponential family are given by

\begin{matrix} {JS}_{G} (p_{θ_{1}}, p_{θ_{2}}) & = & \frac{1}{4} {(θ_{2} - θ_{1})}^{⊤} (\nabla F (θ_{2}) - \nabla F (θ_{1})) - (\frac{F (θ_{1}) + F (θ_{2})}{2} - F (\frac{θ_{1} + θ_{2}}{2})), \\ {JS}_{\tilde{G}} (p_{θ_{1}}, p_{θ_{2}}) & = & \frac{1}{4} 〈θ_{2} - θ_{1}, \nabla F (θ_{2}) - \nabla F (θ_{1})〉 - exp (- J_{F} (θ_{1}, θ_{2})) - 1, \\ {JS}^{*}_{G} (p_{θ_{1}}, p_{θ_{2}}) & = & J_{F} (θ_{1}, θ_{2}) \end{matrix}

Remark 11.

Given two densities

p_{1}

and

p_{2}

, the family

G

of geometric mixtures

{{(p_{1} p_{2})}_{G_{α}} \propto p_{1}^{α} p_{2}^{1 - α} : α \in (0, 1)}

forms a 1D exponential family that has been termed the likelihood ratio exponential family [46] (LREF). The cumulant function of this LREF is

F (α) = - B_{α} (p_{1}, p_{2})

. Hence,

G

has also been called a Bhattacharyya arc or Hellinger arc in the literature [47]. However, notice that

KL (p_{i} : {(p_{1} p_{2})}_{G_{α}})

does not necessarily amount to a Bregman divergence, because neither

p_{1}

nor

p_{2}

belongs to

G

.

3.2. Closed-Form Formula for Gaussian Distributions

Let us report the corresponding closed-form formula for d-variate Gaussian distributions.

When

α = \frac{1}{2}

, we proved that

{JS}_{G} (p_{1}, p_{2}) = \frac{1}{4} J (p_{1}, p_{2}) - B (p_{1}, p_{2})

and

{JS}_{\tilde{G}}^{+} (p_{1}, p_{2}) = \frac{1}{4} J (p_{1}, p_{2}) + exp (- B (p_{1}, p_{2})) - 1

where

BC (p_{1}, p_{2}) = exp (- B (p_{1}, p_{2}))

. Thus, for the case of balanced geometric mixtures, we need to report the closed form for the Jeffreys and Bhattacharyya distances:

\begin{matrix} J (p_{μ_{1}, Σ_{1}}, p_{μ_{2}, Σ_{2}}) & = & \frac{1}{2} (tr (Σ_{1} Σ_{2}^{- 1} + Σ_{2} Σ_{1}^{- 1}) + {(μ_{1} - μ_{2})}^{⊤} (Σ_{1}^{- 1} + Σ_{2}^{- 1}) (μ_{1} - μ_{2}) - 2 d), \\ B (p_{μ_{1}, Σ_{1}}, p_{μ_{2}, Σ_{2}}) & = & \frac{1}{8} {(μ_{1} - μ_{2})}^{⊤} {\bar{Σ}}^{- 1} (μ_{1} - μ_{2}) + \frac{1}{2} log (\frac{det \bar{Σ}}{\sqrt{det Σ_{1} det Σ_{2}}}), \end{matrix}

where

\bar{Σ} = \frac{1}{2} (Σ_{1} + Σ_{2})

.

Otherwise, for the arbitrary weighted geometric mixture

G_{α}

, define

{(θ_{1} θ_{2})}_{α} = α θ_{1} + (1 - α) θ_{2}

, the weighted linear interpolation of the natural parameters

θ_{1}

and

θ_{2}

.

Corollary 3.

The skew G-Jensen–Shannon divergence

{JS}_{α}^{G}

and the dual skew G-Jensen–Shannon divergence

{JS}^{*}_{α}^{G}

between two d-variate Gaussian distributions

N (μ_{1}, Σ_{1})

and

N (μ_{2}, Σ_{2})

is

\begin{matrix} {JS}_{G_{α}} (p_{(μ_{1}, Σ_{1})}, p_{(μ_{2}, Σ_{2})}) & = & α KL (p_{(μ_{1}, Σ_{1})}, p_{(μ_{α}, Σ_{α})}) + (1 - α) KL (p_{(μ_{2}, Σ_{2})}, p_{(μ_{α}, Σ_{α})}), \\ = & α B_{F} ({(θ_{1} θ_{2})}_{α}, θ_{1}) + (1 - α) B_{F} ({(θ_{1} θ_{2})}_{α}, θ_{2}), \\ = & \frac{1}{2} (tr (Σ_{α}^{- 1} (α Σ_{1} + (1 - α) Σ_{2})) + log (\frac{| Σ_{α} |}{| Σ_{1} |^{α} {| Σ_{2} |}^{1 - α}}) \\ = & + α {(μ_{α} - μ_{1})}^{⊤} Σ_{α}^{- 1} (μ_{α} - μ_{1}) + (1 - α) {(μ_{α} - μ_{2})}^{⊤} Σ_{α}^{- 1} (μ_{α} - μ_{2}) - d) \\ {JS}_{G_{α}}^{*} (p_{(μ_{1}, Σ_{1})}, p_{(μ_{2}, Σ_{2})}) & = & (1 - α) KL (p_{(μ_{α}, Σ_{α})}, p_{(μ_{1}, Σ_{1})}) + α KL (p_{(μ_{α}, Σ_{α})}, p_{(μ_{2}, Σ_{2})}), \\ = & α B_{F} (θ_{1}, {(θ_{1} θ_{2})}_{α}) + (1 - α) B_{F} (θ_{2}, {(θ_{1} θ_{2})}_{α}), \\ = & J_{F, α} (θ_{1}, θ_{2}) = : B_{α} (p_{(μ_{1}, Σ_{1})}, p_{(μ_{2}, Σ_{2})}), \\ = & \frac{1}{2} (α μ_{1}^{⊤} Σ_{1}^{- 1} μ_{1} + (1 - α) μ_{2}^{⊤} Σ_{2}^{- 1} μ_{2} - μ_{α}^{⊤} Σ_{α}^{- 1} μ_{α} + log \frac{| Σ_{1} |^{α} {| Σ_{2} |}^{1 - α}}{| Σ_{α} |}), \\ F (μ, Σ) & = & \frac{1}{2} (μ^{⊤} Σ^{- 1} μ + log | Σ | + d log 2 π), \\ F (θ_{v}, θ_{M}) & = & \frac{1}{2} (d log π - log | θ_{M} | + \frac{1}{2} θ_{v}^{⊤} θ_{M}^{- 1} θ_{v}), \\ Δ (θ_{1}, θ_{2}) & = & exp (- J_{F, α} (θ_{1}, θ_{2})) + J_{F, α} (θ_{1}, θ_{2}) - 1, \end{matrix}

where

Σ_{α}

is the matrix harmonic barycenter:

Σ_{α} = {(α Σ_{1}^{- 1} + (1 - α) Σ_{2}^{- 1})}^{- 1},

(41)

and

μ_{α} = Σ_{α} (α Σ_{1}^{- 1} μ_{1} + (1 - α) Σ_{2}^{- 1} μ_{2}) .

(42)

4. The Extended and Normalized G-JSDs as Regularizations of the Ordinary JSD

The M-Jensen–Shannon divergence

{JS}_{M} (p, q)

can be interpreted as a regularization of the ordinary JSD:

Proposition 9 (JSD regularization).

For any arbitrary mean M, the following identity holds:

{JS}_{M} (p_{1}, p_{2}) = JS (p_{1}, p_{2}) + KL (\frac{p_{1} + p_{2}}{2}, {(p_{1} p_{2})}_{M}) .

(43)

Notice that

{(p_{1} p_{2})}_{A} = \frac{p_{1} + p_{2}}{2}

.

Proof.

We have

\begin{matrix} {JS}_{M} (p_{1}, p_{2}) & : = & \frac{1}{2} (KL (p_{1}, {(p_{1} p_{2})}_{M}) + KL (p_{2}, {(p_{1} p_{2})}_{M})), \\ = & \frac{1}{2} \int (p_{1} log \frac{p_{1} {(p_{1} p_{2})}_{A}}{{(p_{1} p_{2})}_{M} {(p_{1} p_{2})}_{A}} + p_{2} log \frac{p_{2} {(p_{1} p_{2})}_{A}}{{(p_{1} p_{2})}_{M} {(p_{1} p_{2})}_{A}}) d μ, \\ = & \frac{1}{2} \int (p_{1} log \frac{p_{1}}{{(p_{1} p_{2})}_{A}} + p_{1} log \frac{{(p_{1} p_{2})}_{A}}{{(p_{1} p_{2})}_{M}} + p_{2} log \frac{p_{2}}{{(p_{1} p_{2})}_{A}} + p_{2} log \frac{{(p_{1} p_{2})}_{A}}{{(p_{1} p_{2})}_{M}}) d μ, \\ = & \frac{1}{2} \int (p_{1} log \frac{p_{1}}{{(p_{1} p_{2})}_{A}} + p_{2} log \frac{p_{2}}{{(p_{1} p_{2})}_{A}}) d μ + \int \frac{1}{2} (p_{1} + p_{2}) log \frac{{(p_{1} p_{2})}_{A}}{{(p_{1} p_{2})}_{M}} d μ, \\ = & JS (p_{1}, p_{2}) + \int {(p_{1} p_{2})}_{A} log \frac{{(p_{1} p_{2})}_{A}}{{(p_{1} p_{2})}_{M}} d μ, \\ = & JS (p_{1}, p_{2}) + KL ({(p_{1} p_{2})}_{A}, {(p_{1} p_{2})}_{M}) . \end{matrix}

□

Remark 12.

One way to symmetrize the KLD is to consider two distinct symmetric means

M_{1} (a, b) = M_{1} (b, a)

and

M_{2} (a, b) = M_{2} (b, a)

and define

{KL}_{M_{1}, M_{2}} (p_{1}, p_{2}) = KL ({(p_{1} p_{2})}_{M_{1}}, {(p_{1} p_{2})}_{M_{2}}) = {KL}_{M_{1}, M_{2}} (p_{2}, p_{1}) .

We notice that

\sqrt{{KL}^{A, G}}

is not a metric distance by reporting a triple of points

(p_{1}, p_{2}, p_{3})

that fails the triangle inequality. Consider

p_{1} = (0.55, 0.45)

,

p_{2} = (0.002, 0.998)

, and

p_{3} = (0.045, 0.955)

. We have

\sqrt{{KL}_{M_{1}, M_{2}} (p_{1}, p_{2})} = 0.5374165 \dots

,

\sqrt{{KL}_{M_{1}, M_{2}} (p_{1}, p_{3})} = 0.1759400 \dots

, and

\sqrt{{KL}_{M_{1}, M_{2}} (p_{3}, p_{2})} = 0.08485931 \dots

. The triangle inequality defect is

\sqrt{{KL}_{M_{1}, M_{2}} (p_{1}, p_{2})} - (\sqrt{{KL}_{M_{1}, M_{2}} (p_{1}, p_{3})} + \sqrt{{KL}_{M_{1}, M_{2}} (p_{3}, p_{2})}) = 0.2766171 \dots .

We can also similarly symmetrize the extended KLD as follows:

{KL}_{{\tilde{M}}_{1}, {\tilde{M}}_{2}}^{+} (q_{1}, q_{2}) = {KL}^{+} ({(q_{1} q_{2})}_{{\tilde{M}}_{1}}, {(q_{1} q_{2})}_{{\tilde{M}}_{2}}) = {KL}_{{\tilde{M}}_{1}, {\tilde{M}}_{2}} (q_{2}, q_{1}) .

In particular, when

M_{1} = A

and

M_{2} = G

, we get the

{KL}_{A, M}

divergence:

{KL}_{A, M} (p_{1}, p_{2}) = \frac{p_{1} + p_{2}}{2} log \frac{p_{1} + p_{2}}{2 \sqrt{p_{1} p_{2}}} + log Z_{G} (p_{1}, p_{2}),

which is related to Taneja T-divergence [48]:

T (p_{1}, p_{2}) = \int \frac{p_{1} + p_{2}}{2} log \frac{p_{1} + p_{2}}{2 \sqrt{p_{1} p_{2}}} .

(44)

The T-divergence is an f-divergence [11,12] obtained for the generator

f_{T} (u) = \frac{1 + u}{2} log \frac{1 + u}{2 \sqrt{u}}

.

Corollary 4 (JSD lower bound on M-JSD).

We have

{JS}_{M} (p, q) \geq JS (p, q)

.

Proof.

Since

{JS}_{M} (p, q) = JS (p, q) + KL (\frac{p + q}{2}, {(p q)}_{M})

and

KL \geq 0

by Gibbs’ inequality, we have

{JS}_{M} (p, q) \geq JS (p, q)

. □

Since the extended M-JSD is

{JS}_{{\tilde{M}}_{α}, β}^{+} (p_{1}, p_{2}) = {JS}_{M_{α}, β} (p_{1}, p_{2}) + Z_{α} - log (Z_{α}) - 1

, the extended M-JSD

{JS}_{{\tilde{M}}_{α}, β}^{+}

can also be interpreted as another regularization of the Jensen–Shannon divergence when dealing with probability densities:

{JS}_{{\tilde{M}}_{α}, β}^{+} (p_{1}, p_{2}) = JS (p_{1}, p_{2}) + KL (\frac{p_{1} + p_{2}}{2}, {(p_{1} p_{2})}_{M}) + Z_{M_{α}} (p_{1}, p_{2}) - log (Z_{M_{α}} (p_{1}, p_{2})) - 1 .

(45)

It is well known that the JSD can be rewritten as a diversity index [4] using concave entropy:

JS (p_{1}, p_{2}) = H (\frac{p_{1} + p_{2}}{2}) - \frac{H (p_{1}) + H (p_{2})}{2} .

(46)

We generalize this decomposition as the difference of a cross-entropy term minus entropies, as follows:

Proposition 10 (M-JSD cross-entropy decomposition).

We have

{JS}_{M} (p_{1}, p_{2}) = H^{\times} ({(p_{1} p_{2})}_{A}, {(p_{1} p_{2})}_{M}) - \frac{H (p_{1}) + H (p_{2})}{2} .

Proof.

From Proposition 9, we have

{JS}_{M} (p_{1}, p_{2}) = JS (p_{1}, p_{2}) + KL (\frac{p_{1} + p_{2}}{2}, {(p_{1} p_{2})}_{M}) .

Since

KL (p_{1}, p_{2}) = H^{\times} (p_{1}, p_{2}) - H (p_{1})

, where

H^{\times} (p_{1}, p_{2}) = - \int p_{1} log p_{2} d μ

is the cross-entropy and

H (p) = - \int p log p d μ = H^{\times} (p, p)

is the entropy. Plugging Equation (46) into Equation (43), we get

\begin{matrix} {JS}_{M} (p_{1}, p_{2}) & = & H (\frac{p_{1} + p_{2}}{2}) - \frac{H (p_{1}) + H (p_{2})}{2} + H^{\times} (\frac{p_{1} + p_{2}}{2}, {(p_{1} p_{2})}_{M}) - H (\frac{p_{1} + p_{2}}{2}), \\ = & H^{\times} (\frac{p_{1} + p_{2}}{2}, {(p_{1} p_{2})}_{M}) - \frac{H (p_{1}) + H (p_{2})}{2} . \end{matrix}

Note that when

M = A

, the arithmetic mean, we have

H^{\times} (\frac{p_{1} + p_{2}}{2}, {(p_{1} p_{2})}_{M}) = H (\frac{p_{1} + p_{2}}{2})

and we recover the fact that

{JS}_{M} (p_{1}, p_{2}) = JS (p_{1}, p_{2})

. □

5. Estimation and Approximation of the Extended and Normalized M-JSDs

Let us recall the two definitions of the extended M-JSD and the normalized M-JSD (for the case of

α = β = \frac{1}{2}

) between two normalized densities

p_{1}

and

p_{2}

:

\begin{matrix} {JS}_{M} (p_{1}, p_{2}) & = & \frac{1}{2} (KL (p_{1}, {(p_{1} p_{2})}_{M}) + KL (p_{2}, {(p_{1} p_{2})}_{M})), \\ {JS}_{M}^{+} (p_{1}, p_{2}) & = & \frac{1}{2} ({KL}^{+} (p_{1}, {(p_{1} p_{2})}_{\tilde{M}}) + {KL}^{+} (p_{2}, {(p_{1} p_{2})}_{\tilde{M}})), \end{matrix}

where

{(p_{1} p_{2})}_{M} (x) = \frac{M (p_{1} (x), p_{2} (x))}{Z_{M} (p_{1}, p_{2})}

(with

Z_{M} (p_{1}, p_{2}) = \int M (p_{1} (x), p_{2} (x)) d μ (x)

) and

{(p_{1} p_{2})}_{\tilde{M}} (x) = M (p_{1} (x), p_{2} (x))

.

In practice, one needs to estimate the extended and normalized G-JSDs when they do not admit a closed-form formula.

5.1. Monte Carlo Estimators

To estimate

{JS}_{M} (p_{1}, p_{2})

, we can use Monte Carlo samplings to estimate both KLD integrals and mixture normalizers

Z_{M}

; for example, the normalizer

Z_{M} (p_{1}, p_{2})

is estimated by

{\hat{Z}}_{M} (p_{1}, p_{2}) = \frac{1}{s} \sum_{i = 1}^{s} \frac{1}{r (x_{i})} M (p_{1} (x_{i}), p_{2} (x_{i})),

where

r (x)

is the proposal distribution which can be chosen according to the mean M and the types of probability distributions

p_{1}

and

p_{2}

, and

x_{1}, \dots, x_{s}

are s identically and independently sampled (iid.) from

r (x)

. However, since

{(p_{1} p_{2})}_{M} (x)

is now estimated as

{(p_{1} p_{2})}_{\hat{M} (x)}

, it is no longer a normalized M-mixture, and thus we consider estimating

{JS}_{\hat{M}}^{+} (p_{1}, p_{2}) = \frac{1}{2} ({KL}^{+} (p_{1}, {(p_{1} p_{2})}_{\hat{M}}) + {KL}^{+} (p_{2}, {(p_{1} p_{2})}_{\hat{M}}))

to ensure the non-negativity of the divergence

{JSD}_{\hat{M}}

.

Let us consider the estimation of the term

{KL}^{+} (p_{1}, {(p_{1} p_{2})}_{\tilde{M}}) = \int (p_{1} log \frac{p_{1}}{M (p_{1}, p_{2})} + M (p_{1}, p_{2}) - p_{1}) d μ .

By choosing the proposal distribution

r (x) = p_{1} (x)

, we have

{KL}^{+} (p_{1}, {(p_{1} p_{2})}_{\hat{M}}) \approx \hat{{KL}^{+}} (p_{1}, {(p_{1} p_{2})}_{\tilde{M}})

(for large enough s), where

{\hat{KL}}^{+} (p_{1}, {(p_{1} p_{2})}_{\tilde{M}}) = \frac{1}{s} \sum_{i = 1}^{s} (log \frac{p_{1} (x_{i})}{M (p_{1} (x_{i}), p_{2} (x_{i}))} + \frac{1}{p_{1} (x_{i})} M (p_{1} (x_{i}), p_{2} (x_{i})) - 1) .

Monte Carlo (MC) stochastic integration [49] is a well-studied topic in statistics, with many results available regarding the consistency and variance of MC estimators.

Note that even if we have a generic formula for the G-JSD between two densities of an exponential family given by Corollary 2, the cumulant function

F (θ)

may not be in closed form [50,51]. This is the case when the sufficient statistic vector of the exponential family is

t (x) = (x, x^{2}, \dots, x^{m})

(for

m \geq 5

), yielding the polynomial exponential family (also called exponential–polynomial family [51]).

5.2. Approximations via $γ$ -Divergences

One way to circumvent the lack of computational tractable density normalizers is to consider the family of

γ

-divergences [39] instead of the KLD:

{\tilde{D}}_{γ} (q_{1}, q_{2}) = \frac{1}{γ (1 + γ)} log I_{γ} (q_{1}, q_{2}) - \frac{1}{γ} log I_{γ} (q_{1}, q_{2}) + \frac{1}{1 + γ} log I_{γ} (q_{1}, q_{2}), γ > 0,

where

I_{γ} (q_{1}, q_{2}) = \int q_{1} (x) q_{2}^{γ} (x) d μ (x) .

The

γ

-divergences are projective divergences, i.e., they enjoy the property that

{\tilde{D}}_{γ} (λ_{1} q_{1}, λ_{2} q_{2}) = {\tilde{D}}_{γ} (q_{1}, q_{2}), \forall λ_{1} > 0, λ_{2} > 0 .

Furthermore, we have

{lim}_{γ \to 0} {\tilde{D}}_{γ} (p_{1}, p_{2}) = KL (p_{1}, p_{2})

. (Note that the KLD is not projective.)

Let us define the projective M-JSD:

{JS}_{\tilde{M}, γ} (p_{1}, p_{2}) = \frac{1}{2} ({\tilde{D}}_{γ} (p_{1}, {(p_{1} p_{2})}_{\tilde{M}}) + {\tilde{D}}_{γ} (p_{2}, {(p_{1} p_{2})}_{\tilde{M}})) .

(47)

We have, for

γ = ϵ

, a small enough value (e.g.,

ϵ \leq 10^{- 3}

),

{JS}_{M} (p_{1}, p_{2}) \approx {JS}_{\tilde{M}, γ} (p_{1}, p_{2})

, since

KL (p_{1}, {(p_{1} p_{2})}_{M}) \approx_{γ = ϵ} {\tilde{D}}_{γ} (p_{1}, {(p_{1} p_{2})}_{\tilde{M}}) .

In particular, for exponential family densities

p_{θ_{1}} (x) = \frac{q_{θ_{1}} (x)}{Z (θ_{1})}

and

p_{θ_{2}} (x) = \frac{q_{θ_{2}} (x)}{Z (θ_{2})}

, we have

I_{γ} (p_{θ_{1}}, p_{θ_{2}}) = exp (F (θ_{1} + γ θ_{2}) - F (θ_{1}) - γ F (θ_{2})),

provided that

θ_{1} + γ θ_{2}

belongs to the natural parameter space (otherwise, the integral

I_{γ}

diverges).

Even when

F (θ)

is not known in closed form, we may estimate the

γ

-divergence by estimating the

I_{γ}

integrals as follows:

{\hat{I}}_{γ} (q_{θ_{1}}, q_{θ_{2}}) \approx \frac{1}{s} \sum_{i = 1}^{s} q_{2} (x_{i}),

where

x_{1}, \dots, x_{s}

are iid. sampled from

p_{1} (x)

. For example, we may use Monte Carlo importance sampling methods [52] or exponential family Langevin dynamics [53] to sample densities of exponential family densities with computationally intractable normalizers (e.g., polynomial exponential families).

6. Summary and Concluding Remarks

In this paper, we first recalled the Jensen–Shannon symmetrization (JS-symmetrization) scheme of [17] for an arbitrary statistical dissimilarity

D (\cdot, \cdot)

using an arbitrary weighted scalar mean

M_{α}

as follows:

D_{M_{α}, β}^{JS} (p_{1}, p_{2}) : = β D (p_{1}, {(p_{1} p_{2})}_{M_{α}}) + (1 - β) D (p_{2}, {(p_{1} p_{2})}_{M_{α}}), (α, β) \in {(0, 1)}^{2},

In particular, we showed that the skewed Bhattacharyya distance and the Chernoff information can both be interpreted as JS-symmetrizations of the reverse Kullback–Leibler divergence.

Then we defined two types of geometric Jensen–Shannon divergence between probability densities. The first type

{JS}_{M}

requires normalization of M-mixtures and relies on the Kullback–Leibler divergence:

{JS}_{M} = {KL}_{M_{\frac{1}{2}}, \frac{1}{2}}^{JS}

. The second type

{JS}_{\tilde{M}}^{+}

does not normalize M-mixtures and uses the extended Kullback–Leibler divergence

{KL}^{+}

to take into account unnormalized mixtures:

{JS}_{\tilde{M}}^{+} = {KL}_{{\tilde{M}}_{\frac{1}{2}}, \frac{1}{2}}^{{JS}^{+}}

. When M is the arithmetic mean A, both M-JSD types coincide with the ordinary Jensen–Shannon divergence of Equation (2).

We have shown that both M-JSD types can be interpreted as regularized Jensen–Shannon divergences

JS

with additive terms. Namely, we have

\begin{matrix} {JS}_{M} (p_{1}, p_{2}) & = & JS (p_{1}, p_{2}) + KL ({(p_{1} p_{2})}_{A}, {(p_{1} p_{2})}_{M}), \\ {JS}_{\tilde{M}}^{+} (p_{1}, p_{2}) & = & {JS}_{M} (p_{1}, p_{2}) + Z_{M} (p_{1}, p_{2}) - log Z_{M} (p_{1}, p_{2}) - 1, \\ = & JS (p_{1}, p_{2}) + KL ({(p_{1} p_{2})}_{A}, {(p_{1} p_{2})}_{M}) + Z_{M} (p_{1}, p_{2}) - log Z_{M} (p_{1}, p_{2}) - 1, \end{matrix}

where

Z_{M} (p_{1}, p_{2}) = \int M (p_{1}, p_{2}) d μ

is the M-mixture normalizer. The gap between these two types of M-JSD is

\begin{matrix} Δ_{M} (p_{1}, p_{2}) & = & {JS}_{\tilde{M}}^{+} (p_{1}, p_{2}) - {JS}_{M} (p_{1}, p_{2}), \\ = & Z_{M} (p_{1}, p_{2}) - log Z_{M} (p_{1}, p_{2}) - 1 . \end{matrix}

When taking the geometric mean

M = G

, we showed that both G-JSD types can be expressed using the Jeffreys divergence and the Bhattacharyya divergence (or Bhattacharyya coefficient):

\begin{matrix} {JS}_{G} (p_{1}, p_{2}) & = & \frac{1}{4} J (p_{1}, p_{2}) - B (p_{1}, p_{2}), \\ {JS}_{\tilde{G}}^{+} (p_{1}, p_{2}) & = & \frac{1}{4} J (p_{1}, p_{2}) + exp (- B (p_{1}, p_{2})) - 1, \\ = & \frac{1}{4} J (p_{1}, p_{2}) + BC (p_{1}, p_{2}) - 1 . \end{matrix}

Thus, the gap between these two types of G-JSD is

\begin{matrix} Δ_{G} (p_{1}, p_{2}) & : = & {JS}_{\tilde{G}}^{+} (p_{1}, p_{2}) - {JS}_{G} (p_{1}, p_{2}), \\ = & BC (p_{1}, p_{2}) + B (p_{1}, p_{2}) - 1, \\ = & Z_{G} (p_{1}, p_{2}) - log Z_{G} (p_{1}, p_{2}) - 1, \end{matrix}

since

Z_{G} (p_{1}, p_{2}) = \int \sqrt{p_{1} p_{2}} d μ = BC (p_{1}, p_{2})

.

Although the square root of the Jensen–Shannon divergence yields a metric distance, this is no longer the case for the geometric-JSD and the extended geometric-JSD: we reported counterexamples in Remark 8. Moreover, we have shown that the KL symmetrization

\sqrt{KL ({(p_{1} p_{2})}_{A}, {(p_{1} p_{2})}_{G})}

is not a metric distance (Remark 12).

We discussed the merit of the extended G-JSD, which does not require normalization of the geometric mixture, in Section 5, and we showed how to approximate the G-JSD using the projective

γ

-divergences [39] for

γ = ϵ

, a small enough value (i.e.,

γ = ϵ = 10^{- 3}

). From the viewpoint of information geometry, the extended G-JSD has been shown to be an f-divergence [13] (separable divergence), while the G-JSD is not separable in general because of the normalization of mixtures (with the exception of the ordinary JSD, which is an f-divergence because the arithmetic mixtures do not require normalization).

We studied power JSDs by considering the power means and studied the

\pm \infty

limits, the extended max-JSD, and the min-JSD: We proved that the extended max-JSD is upper-bounded by the total variation distance

TV (p_{1}, p_{2}) = \frac{1}{2} \int | p_{1} - p_{2} | d μ

:

0 \leq {JS}_{\max^{˜}}^{+} (p_{1}, p_{2}) \leq TV (p_{1}, p_{2}),

and that the extended min-JSD is lower-bounded as follows:

{JS}_{\min^{˜}}^{+} (p_{1}, p_{2}) \geq \frac{1}{4} J (p_{1}, p_{2}) - TV (p_{1}, p_{2}),

where J denotes the Jeffreys divergence:

J (p_{1}, p_{2}) = KL (p_{1}, p_{2}) + KL (p_{2}, p_{1})

.

The advantage of using the extended G-JSD is that we do not need to normalize geometric mixtures, while this novel divergence is proven to be an f-divergence [13] and retains the property that it amounts to a regularization of the ordinary Jensen–Shannon divergence by an extra additive gap term.

Finally, we expressed

{JS}_{G}

(Equation (41)) and

{JS}_{\tilde{G}}^{+}

(Equation (41)) for exponential families, characterized the gap between these two types of divergences as a function of the cumulant and partition functions, and reported a corresponding explicit formula for the multivariate Gaussian (exponential) family. The G-JSD between Gaussian distributions has already been used successfully in many applications [30,32,33,34,35,36,37,38].

Funding

This research received no external funding.

Acknowledgments

The Author would like to thank the two Reviewers for their insightful, detailed, and constructive comments and feedback.

Conflicts of Interest

Author Frank Nielsen was employed by the company Sony Computer Science Laboratories. The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Nomenclature

Means:
$M_{α} (a, b)$	weighted scalar mean
$M_{α}^{ϕ} (a, b)$	weighted quasi-arithmetic scalar mean for generator $ϕ (u)$
$A (a, b)$	arithmetic mean
$A_{α} (a, b)$	weighted arithmetic mean
$G_{α} (a, b)$	weighted geometric mean
$G (a, b)$	geometric mean
$P_{γ} (a, b)$	power mean with $P_{0} = G$ and $P_{1} = A$
$P_{γ, α} (a, b)$	weighted power mean
Densities on measure space $(X, E, μ)$ :
$p, p_{1}, p_{2}, \dots$	normalized density
$q, q_{1}, q_{2}, \dots$	unnormalized density
$Z (q)$	density normalizer $p = \frac{q}{Z (q)}$
$Z_{M} (p_{1}, p_{2})$	normalizer of M-mixture ( $α = \frac{1}{2}$ )
${\hat{Z}}_{M} (p_{1}, p_{2})$	Monte Carlo estimator of $Z_{M} (p_{1}, p_{2})$
$Z_{M, α} (p_{1}, p_{2})$	normalizer of weighted M-mixture
${(p_{1} p_{2})}_{M}$	M-mixture
${(p_{1} p_{2})}_{M, α}$	weighted M-mixture
Dissimilarities, divergences, and distances:
$KL (p_{1}, p_{2})$	Kullback–Leibler divergence (KLD)
${KL}^{+} (q_{1}, q_{2})$	extended Kullback–Leibler divergence
${KL}^{*} (p_{1}, p_{2})$	reverse Kullback–Leibler divergence
$H^{\times} (p_{1}, p_{2})$	cross-entropy
$H (p)$	Shannon discrete or differential entropy
$J (p_{1}, p_{2})$	Jeffreys divergence
$TV (p_{1}, p_{2})$	total variation distance
$B (p_{1}, p_{2})$	Bhattacharyya “distance” (not metric)
$B_{α} (p_{1}, p_{2})$	$α$ -skewed Bhattacharyya “distance”
$C (p_{1}, p_{2})$	Chernoff information or Chernoff distance
$T (p_{1}, p_{2})$	Taneja T-divergence
$I_{f} (p_{1}, p_{2})$	Ali–Silvey–Csiszár f-divergence
$D (p_{1}, p_{2})$	arbitrary dissimilarity measure
$D^{*} (p_{1}, p_{2})$	reverse dissimilarity measure
$D^{+} (q_{1}, q_{2})$	extended dissimilarity measure
$\tilde{D} (q_{1}, q_{2})$	projective dissimilarity measure
${\tilde{D}}_{γ} (q_{1}, q_{2})$	$γ$ -divergence
${\hat{D}}^{+} (q_{1}, q_{2})$	Monte Carlo estimation of dissimilarity $D^{+}$
Jensen–Shannon divergences and generalizations:
$JS (p_{1}, p_{2})$	Jensen–Shannon divergence (JSD)
${JS}_{α, β} (p_{1}, p_{2})$	$β$ -weighted $α$ -skewed mixture JSD
${JS}_{M} (p_{1}, p_{2})$	M-JSD for M-mixtures
${JS}_{G} (p_{1}, p_{2})$	geometric JSD
${JS}_{\tilde{G}} (p_{1}, p_{2})$	extended geometric JSD
${JS}_{G}^{*} (p_{1}, p_{2})$	left-sided geometric JSD (right-sided for ${KL}^{*}$ )
${JS}^{+}_{\min^{˜}} (p_{1}, p_{2})$	min-JSD
${JS}^{+}_{\max^{˜}} (p_{1}, p_{2})$	max-JSD
$Δ_{M} (p_{1}, p_{2})$	gap between extended and normalized M-JSDs

References

Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
Jeffreys, H. The Theory of Probability; OuP Oxford: Oxford, UK, 1998. [Google Scholar]
Johnson, D.H.; Sinanovic, S. Symmetrizing the Kullback-Leibler distance. IEEE Trans. Inf. Theory 2001, 1, 1–10. [Google Scholar]
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
Fuglede, B.; Topsoe, F. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium on Information Theory (ISIT), Chicago, IL, USA, 27 June–2 July 2004; p. 31. [Google Scholar]
Endres, D.M.; Schindelin, J.E. A new metric for probability distributions. IEEE Trans. Inf. Theory 2003, 49, 1858–1860. [Google Scholar] [CrossRef]
Okamura, K. Metrization of powers of the Jensen-Shannon divergence. arXiv 2023, arXiv:2302.10070. [Google Scholar] [CrossRef]
Sibson, R. Information radius. Z. FÜR Wahrscheinlichkeitstheorie Und Verwandte Geb. 1969, 14, 149–160. [Google Scholar] [CrossRef]
Briët, J.; Harremoës, P. Properties of classical and quantum Jensen-Shannon divergence. Phys. Rev. A 2009, 79, 052311. [Google Scholar] [CrossRef]
Virosztek, D. The metric property of the quantum Jensen-Shannon divergence. Adv. Math. 2021, 380, 107595. [Google Scholar] [CrossRef]
Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. Methodol. 1966, 28, 131–142. [Google Scholar] [CrossRef]
Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
Amari, S.i. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016. [Google Scholar]
Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial. (Foundations and Trends® in Communications and Information Theory); Now Publishers Inc.: Hanover, MA, USA, 2004; Volume 1, pp. 417–528. [Google Scholar]
Osterreicher, F.; Vajda, I. A new class of metric divergences on probability spaces and its applicability in statistics. Ann. Inst. Stat. Math. 2003, 55, 639–653. [Google Scholar] [CrossRef]
Schoenberg, I.J. Metric spaces and completely monotone functions. Ann. Math. 1938, 39, 811–841. [Google Scholar] [CrossRef]
Nielsen, F. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy 2019, 21, 485. [Google Scholar] [CrossRef]
Bullen, P.S. Handbook of Means and Their Inequalities; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 560. [Google Scholar]
Yamano, T. Some bounds for skewed α-Jensen-Shannon divergence. Results Appl. Math. 2019, 3, 100064. [Google Scholar] [CrossRef]
Nielsen, F. Revisiting Chernoff information with likelihood ratio exponential families. Entropy 2022, 24, 1400. [Google Scholar] [CrossRef]
Jerfel, G.; Wang, S.; Wong-Fannjiang, C.; Heller, K.A.; Ma, Y.; Jordan, M.I. Variational refinement for importance sampling using the forward Kullback-Leibler divergence. In Proceedings of the Uncertainty in Artificial Intelligence, PMLR, Online, 27–30 July 2021; pp. 1819–1829. [Google Scholar]
Asadi, M.; Ebrahimi, N.; Kharazmi, O.; Soofi, E.S. Mixture models, Bayes Fisher information, and divergence measures. IEEE Trans. Inf. Theory 2018, 65, 2316–2321. [Google Scholar] [CrossRef]
Grosse, R.B.; Maddison, C.J.; Salakhutdinov, R.R. Annealing between distributions by averaging moments. Adv. Neural Inf. Process. Syst. 2013, 26, 1–12. [Google Scholar]
Amari, S.I. Integration of stochastic models by minimizing α-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef]
Bhattacharyya, A. On a measure of divergence between two multinomial populations. Sankhyā Indian J. Stat. 1946, 7, 401–406. [Google Scholar]
Melville, P.; Yang, S.M.; Saar-Tsechansky, M.; Mooney, R. Active learning for probability estimation using Jensen-Shannon divergence. In Proceedings of the European Conference on Machine Learning, Porto, Portugal, 3–7 October 2005; pp. 268–279. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Sutter, T.; Daunhawer, I.; Vogt, J. Multimodal generative learning utilizing Jensen-Shannon-divergence. Adv. Neural Inf. Process. Syst. 2020, 33, 6100–6110. [Google Scholar]
Michalowicz, J.V.; Nichols, J.M.; Bucholtz, F. Calculation of differential entropy for a mixed Gaussian distribution. Entropy 2008, 10, 200–206. [Google Scholar] [CrossRef]
Deasy, J.; Simidjievski, N.; Liò, P. Constraining variational inference with geometric Jensen-Shannon divergence. Adv. Neural Inf. Process. Syst. 2020, 33, 10647–10658. [Google Scholar]
Deasy, J.; McIver, T.A.; Simidjievski, N.; Lio, P. α-VAEs: Optimising variational inference by learning data-dependent divergence skew. In Proceedings of the ICML Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models, Virtual, 23 July 2021. [Google Scholar]
Kumari, J.; Deepak, G.; Santhanavijayan, A. RDS: Related document search for economics data using ontologies and hybrid semantics. In Proceedings of the International Conference on Data Analytics and Insights, Kolkata, India, 11–13 May 2023; pp. 691–702. [Google Scholar]
Ni, S.; Lin, C.; Wang, H.; Li, Y.; Liao, Y.; Li, N. Learning geometric Jensen-Shannon divergence for tiny object detection in remote sensing images. Front. Neurorobot. 2023, 17, 1273251. [Google Scholar] [CrossRef]
Sachdeva, R.; Gakhar, R.; Awasthi, S.; Singh, K.; Pandey, A.; Parihar, A.S. Uncertainty and Noise Aware Decision Making for Autonomous Vehicles—A Bayesian Approach. IEEE Trans. Veh. Technol. 2024, 74, 378–389. [Google Scholar] [CrossRef]
Wang, J.; Massiceti, D.; Hu, X.; Pavlovic, V.; Lukasiewicz, T. NP-SemiSeg: When neural processes meet semi-supervised semantic segmentation. In Proceedings of the International Conference on Machine Learning, PMLR, Honolulu, HI, USA, 23–29 July 2023; pp. 36138–36156. [Google Scholar]
Serra, G.; Stavrou, P.A.; Kountouris, M. On the computation of the Gaussian rate–distortion–perception function. IEEE J. Sel. Areas Inf. Theory 2024, 5, 314–330. [Google Scholar] [CrossRef]
Thiagarajan, P.; Ghosh, S. Jensen–Shannon divergence based novel loss functions for Bayesian neural networks. Neurocomputing 2025, 618, 129115. [Google Scholar] [CrossRef]
Hanselmann, N.; Doll, S.; Cordts, M.; Lensch, H.P.; Geiger, A. EMPERROR: A Flexible Generative Perception Error Model for Probing Self-Driving Planners. IEEE Robot. Autom. Lett. 2025, 10, 5807–5814. [Google Scholar] [CrossRef]
Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
Jones, L.K.; Byrne, C.L. General entropy criteria for inverse problems, with applications to data compression, pattern classification, and cluster analysis. IEEE Trans. Inf. Theory 2002, 36, 23–30. [Google Scholar] [CrossRef]
Nishimura, T.; Komaki, F. The information geometric structure of generalized empirical likelihood estimators. Commun. Stat. Methods 2008, 37, 1867–1879. [Google Scholar] [CrossRef]
Nielsen, F. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. Pattern Recognit. Lett. 2014, 42, 25–34. [Google Scholar] [CrossRef][Green Version]
Barndorff-Nielsen, O. Information and Exponential Families: In Statistical Theory; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
Kailath, T. The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]
Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
Grünwald, P.D. The Minimum Description Length Principle; MIT Press: Cambridge, MA USA, 2007. [Google Scholar]
Cena, A.; Pistone, G. Exponential statistical manifold. Ann. Inst. Stat. Math. 2007, 59, 27–56. [Google Scholar] [CrossRef]
Taneja, I.J. New developments in generalized information measures. In Advances in Imaging and Electron Physics; Elsevier: Amsterdam, The Netherlands, 1995; Volume 91, pp. 37–135. [Google Scholar]
Rubinstein, R.Y.; Kroese, D.P. Simulation and the Monte Carlo Method; John Wiley & Sons: Hoboken, NJ, USA, 2016. [Google Scholar]
Cobb, L.; Koppstein, P.; Chen, N.H. Estimation and moment recursion relations for multimodal distributions of the exponential family. J. Am. Stat. Assoc. 1983, 78, 124–130. [Google Scholar] [CrossRef]
Hayakawa, J.; Takemura, A. Estimation of exponential-polynomial distribution by holonomic gradient descent. Commun. Stat.-Theory Methods 2016, 45, 6860–6882. [Google Scholar] [CrossRef]
Kloek, T.; Van Dijk, H.K. Bayesian estimates of equation system parameters: An application of integration by Monte Carlo. Econom. J. Econom. Soc. 1978, 46, 1–19. [Google Scholar] [CrossRef]
Banerjee, A.; Chen, T.; Li, X.; Zhou, Y. Stability based generalization bounds for exponential family Langevin dynamics. In Proceedings of the International Conference on Machine Learning, PMLR, Baltimore, MA, USA, 17–23 July 2022; pp. 1412–1449. [Google Scholar]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nielsen, F. Two Types of Geometric Jensen–Shannon Divergences. Entropy 2025, 27, 947. https://doi.org/10.3390/e27090947

AMA Style

Nielsen F. Two Types of Geometric Jensen–Shannon Divergences. Entropy. 2025; 27(9):947. https://doi.org/10.3390/e27090947

Chicago/Turabian Style

Nielsen, Frank. 2025. "Two Types of Geometric Jensen–Shannon Divergences" Entropy 27, no. 9: 947. https://doi.org/10.3390/e27090947

APA Style

Nielsen, F. (2025). Two Types of Geometric Jensen–Shannon Divergences. Entropy, 27(9), 947. https://doi.org/10.3390/e27090947

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Two Types of Geometric Jensen–Shannon Divergences

Abstract

1. Introduction

1.1. Kullback–Leibler and Jensen–Shannon Divergences

1.2. Jensen–Shannon Symmetrization of Dissimilarities with Generalized Mixtures

1.3. Paper Outline

2. A Novel Definition: The G-JSD, Extended to Positive Measures

2.1. Definition and Properties

2.2. Power JSDs and (Extended) Min-JSD and Max-JSD

3. Geometric JSDs Between Gaussian Distributions

3.1. Exponential Families

3.2. Closed-Form Formula for Gaussian Distributions

4. The Extended and Normalized G-JSDs as Regularizations of the Ordinary JSD

5. Estimation and Approximation of the Extended and Normalized M-JSDs

5.1. Monte Carlo Estimators

5.2. Approximations via $γ$ -Divergences

6. Summary and Concluding Remarks

Funding

Acknowledgments

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Two Types of Geometric Jensen–Shannon Divergences

Abstract

1. Introduction

1.1. Kullback–Leibler and Jensen–Shannon Divergences

1.2. Jensen–Shannon Symmetrization of Dissimilarities with Generalized Mixtures

1.3. Paper Outline

2. A Novel Definition: The G-JSD, Extended to Positive Measures

2.1. Definition and Properties

2.2. Power JSDs and (Extended) Min-JSD and Max-JSD

3. Geometric JSDs Between Gaussian Distributions

3.1. Exponential Families

3.2. Closed-Form Formula for Gaussian Distributions

4. The Extended and Normalized G-JSDs as Regularizations of the Ordinary JSD

5. Estimation and Approximation of the Extended and Normalized M-JSDs

5.1. Monte Carlo Estimators

5.2. Approximations via γ -Divergences

6. Summary and Concluding Remarks

Funding

Acknowledgments

Conflicts of Interest

Nomenclature

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.2. Approximations via $γ$ -Divergences