On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means

Nielsen, Frank

doi:10.3390/e21050485

Open AccessArticle

On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means

by

Frank Nielsen

Sony Computer Science Laboratories, Takanawa Muse Bldg., 3-14-13, Higashigotanda, Shinagawa-ku, Tokyo 141-0022, Japan

Entropy 2019, 21(5), 485; https://doi.org/10.3390/e21050485

Submission received: 10 April 2019 / Revised: 8 May 2019 / Accepted: 9 May 2019 / Published: 11 May 2019

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figure

Versions Notes

Abstract

:

The Jensen–Shannon divergence is a renowned bounded symmetrization of the unbounded Kullback–Leibler divergence which measures the total Kullback–Leibler divergence to the average mixture distribution. However, the Jensen–Shannon divergence between Gaussian distributions is not available in closed form. To bypass this problem, we present a generalization of the Jensen–Shannon (JS) divergence using abstract means which yields closed-form expressions when the mean is chosen according to the parametric family of distributions. More generally, we define the JS-symmetrizations of any distance using parameter mixtures derived from abstract means. In particular, we first show that the geometric mean is well-suited for exponential families, and report two closed-form formula for (i) the geometric Jensen–Shannon divergence between probability densities of the same exponential family; and (ii) the geometric JS-symmetrization of the reverse Kullback–Leibler divergence between probability densities of the same exponential family. As a second illustrating example, we show that the harmonic mean is well-suited for the scale Cauchy distributions, and report a closed-form formula for the harmonic Jensen–Shannon divergence between scale Cauchy distributions. Applications to clustering with respect to these novel Jensen–Shannon divergences are touched upon.

Keywords:

Jensen–Shannon divergence; Jeffreys divergence; resistor average distance; Bhattacharyya distance; f-divergence; Jensen/Burbea–Rao divergence; Bregman divergence; abstract weighted mean; quasi-arithmetic mean; mixture family; statistical M-mixture; exponential family; Gaussian family; Cauchy scale family; clustering

Graphical Abstract

1. Introduction and Motivations

1.1. Kullback–Leibler Divergence and Its Symmetrizations

Let

(X, A)

be a measurable space [1] where

X

denotes the sample space and

A

the

σ

-algebra of measurable events. Consider a positive measure

μ

(usually the Lebesgue measure

μ_{L}

with Borel

σ

-algebra

B (R^{d})

or the counting measure

μ_{c}

with power set

σ

-algebra

2^{X}

). Denote by

P

the set of probability distributions.

The Kullback–Leibler Divergence [2] (KLD)

KL : P \times P \to [0, \infty]

is the most fundamental distance [2] between probability distributions, defined by:

KL (P : Q) : = \int p log \frac{p}{q} d μ,

(1)

where p and q denote the Radon–Nikodym derivatives of probability measures P and Q with respect to

μ

(with

P, Q ≪ μ

). The KLD expression between P and Q in Equation (1) is independent of the dominating measure

μ

. Table A1 summarizes the various distances and their notations used in this paper.

The KLD is also called the relative entropy [2] because it can be written as the difference of the cross-entropy minus the entropy:

KL (p : q) = h_{\times} (p : q) - h (p),

(2)

where

h_{\times}

denotes the cross-entropy [2]:

h_{\times} (p : q) : = \int p log \frac{1}{q} d μ,

(3)

and

h (p) : = \int p log \frac{1}{p} d μ = h_{\times} (p : p),

(4)

denotes the Shannon entropy [2]. Although the formula of the Shannon entropy in Equation (4) unifies both the discrete case and the continuous case of probability distributions, the behavior of entropy in the discrete case and the continuous case is very different: When

μ = μ_{c}

, Equation (4) yields the discrete Shannon entropy which is always positive and upper bounded by

log | X |

. When

μ = μ_{L}

, Equation (4) defines the Shannon differential entropy which may be negative and unbounded [2] (e.g., the differential entropy of the Gaussian distribution

N (m, σ)

is

\frac{1}{2} log (2 π e σ^{2})

). See also [3] for further important differences between the discrete case and the continuous case.

In general, the KLD is an asymmetric distance (i.e.,

KL (p : q) \neq KL (q : p)

, hence the argument separator notation using the delimiter ‘:’) In information theory [2], it is customary to use the double bar notation ‘‖’ instead of the comma ‘,’ notation to avoid confusion with joint random variables. The reverse KL divergence or dual KL divergence is:

{KL}^{*} (P : Q) : = KL (Q : P) = \int q log \frac{q}{p} d μ .

(5)

In general, the reverse distance or dual distance for a distance D is written as:

D^{*} (p : q) : = D (q : p) .

(6)

One way to symmetrize the KLD is to consider the Jeffreys Divergence [4] (JD, Sir Harold Jeffreys (1891–1989) was a British statistician.):

J (p; q) : = KL (p : q) + KL (q : p) = \int (p - q) log \frac{p}{q} d μ = J (q; p) .

(7)

However, this symmetric distance is not upper bounded, and its sensitivity can raise numerical issues in applications. Here, we used the optional argument separator notation ‘;’ to emphasize that the distance is symmetric but not necessarily a metric distance. This notation matches the notational convention of the mutual information if two joint random variables in information theory [2].

The symmetrization of the KLD may also be obtained using the harmonic mean instead of the arithmetic mean, yielding the resistor average distance [5]

R (p; q)

:

\begin{matrix} \frac{1}{R (p; q)} & = & \frac{1}{2} (\frac{1}{KL (p : q)} + \frac{1}{KL (q : p)}), \end{matrix}

(8)

\begin{matrix} R (p; q) & = & \frac{2 (KL (p : q) + KL (q : p))}{KL (p : q) KL (q : p)} = \frac{2 J (p; q)}{KL (p : q) KL (q : p)} . \end{matrix}

(9)

Another famous symmetrization of the KLD is the Jensen–Shannon Divergence [6] (JSD) defined by:

\begin{matrix} JS (p; q) & : = & \frac{1}{2} (KL (p : \frac{p + q}{2}) + KL (q : \frac{p + q}{2})), \end{matrix}

(10)

\begin{matrix} = & \frac{1}{2} \int (p log \frac{2 p}{p + q} + q log \frac{2 q}{p + q}) d μ . \end{matrix}

(11)

This distance can be interpreted as the total divergence to the average distribution (see Equation (10)). The JSD can be rewritten as a Jensen divergence (or Burbea–Rao divergence [7]) for the negentropy generator

- h

(called Shannon information):

JS (p; q) = h (\frac{p + q}{2}) - \frac{h (p) + h (q)}{2} .

(12)

An important property of the Jensen–Shannon divergence compared to the Jeffreys divergence is that this distance is always bounded:

0 \leq JS (p : q) \leq log 2 .

(13)

This follows from the fact that

KL (p : \frac{p + q}{2}) = \int p log \frac{2 p}{p + q} d μ \leq \int p log \frac{2 p}{p} d μ = log 2 .

(14)

Finally, the square root of the JSD (i.e.,

\sqrt{JS}

) yields a metric distance satisfying the triangular inequality [8,9]. The JSD has found applications in many fields such as bioinformatics [10] and social sciences [11], just to name a few. Recently, the JSD has gained attention in the deep learning community with the Generative Adversarial Networks (GANs) [12]. In computer vision and pattern recognition, one often relies on information-theoretic techniques to perform registration and recognition tasks. For example, in [13], the authors use a mixture of Principal Axes Registrations (mPAR) whose parameters are estimated by minimizing the KLD between the considered two-point distributions. In [14], the authors parameterize both shapes and deformations using Gaussian Mixture Models (GMMs) to perform non-rigid shape registration. The lack of closed-form formula for the KLD between GMMs [15] spurred the use of other statistical distances which admit a closed-form expression for GMMs. For example, in [16], shape registration is performed by using the Jensen-Rényi divergence between GMMs. See also [17] for other information-theoretic divergences that admit closed-form formula for some statistical mixtures extending GMMs.

In information geometry [18], the KLD, JD and JSD are invariant divergences which satisfy the property of information monotonicity [18]. The class of (separable) distances satisfying the information monotonicity are exhaustively characterized as Csiszár’s f-divergences [19]. A f-divergence is defined for a convex generator function f strictly convex at 1 (with

f (1) = f^{'} (1) = 0

) by:

I_{f} (p : q) = \int p f (\frac{q}{p}) d μ .

(15)

The Jeffreys and Jensen–Shannon f-generators are:

\begin{matrix} f_{J} (u) & : = & (u - 1) log u, \end{matrix}

(16)

\begin{matrix} f_{JS} (u) & : = & - (u + 1) log \frac{1 + u}{2} + u log u . \end{matrix}

(17)

1.2. Statistical Distances and Parameter Divergences

In information and probability theory, the term “divergence” informally means a statistical distance [2]. However in information geometry [18], a divergence has a stricter meaning of being a smooth parametric distance (called a contrast function in [20]) from which a dual geometric structure can be derived [21,22].

Consider parametric distributions

p_{θ}

belonging to a parametric family of distributions

{p_{θ} : θ \in Θ}

(e.g., Gaussian family or Cauchy family), where

Θ

denotes the parameter space. Then a statistical distance D between distributions

p_{θ}

and

p_{θ^{'}}

amount to an equivalent parameter distance:

P (θ : θ^{'}) : = D (p_{θ} : p_{θ^{'}}) .

(18)

For example, the KLD between two distributions belonging to the same exponential family (e.g., Gaussian family) amount to a reverse Bregman divergence for the cumulant generator F of the exponential family [23]:

KL (p_{θ} : p_{θ^{'}}) = B_{F}^{*} (θ : θ^{'}) = B_{F} (θ^{'} : θ) .

(19)

A Bregman divergence

B_{F}

is defined for a strictly convex and differentiable generator F as:

B_{F} (θ : θ^{'}) : = F (θ) - F (θ^{'}) - 〈 θ - θ^{'}, \nabla F (θ^{'}) 〉,

(20)

where

〈 \cdot, \cdot 〉

is an inner product (usually the Euclidean dot product for vector parameters).

Similar to the interpretation of the Jensen–Shannon divergence (statistical divergence) as a Jensen divergence for the negentropy generator, the Jensen–Bregman divergence [7]

{JB}_{F}

(parametric divergence JBD) amounts to a Jensen divergence

J_{F}

for a strictly convex generator

F : Θ \to R

:

\begin{matrix} {JB}_{F} (θ : θ^{'}) & : = & \frac{1}{2} (B_{F} (θ : \frac{θ + θ^{'}}{2}) + B_{F} (θ^{'} : \frac{θ + θ^{'}}{2})), \end{matrix}

(21)

\begin{matrix} = & \frac{F (θ) + F (θ^{'})}{2} - F (\frac{θ + θ^{'}}{2}) = : J_{F} (θ : θ^{'}), \end{matrix}

(22)

Let us introduce the notation

{(θ_{p} θ_{q})}_{α} : = (1 - α) θ_{p} + α θ_{q}

to denote the linear interpolation (LERP) of the parameters. Then we have more generally that the skew Jensen–Bregman divergence

{JB}_{F}^{α} (θ : θ^{'})

amounts to a skew Jensen divergence

J_{F}^{α} (θ : θ^{'})

:

\begin{matrix} {JB}_{F}^{α} (θ : θ^{'}) & : = & (1 - α) B_{F} (θ : {(θ θ^{'})}_{α}) + α B_{F} (θ^{'} : {(θ θ^{'})}_{α})), \end{matrix}

(23)

\begin{matrix} = & {(F (θ) F (θ^{'}))}_{α} - F ({(θ θ^{'})}_{α}) = : J_{F}^{α} (θ : θ^{'}), \end{matrix}

(24)

1.3. J-Symmetrization and $J S$ -Symmetrization of Distances

For any arbitrary distance

D (p : q)

, we can define its skew J-symmetrization for

α \in [0, 1]

by:

J_{D}^{α} (p : q) : = (1 - α) D (p : q) + α D (q : p),

(25)

and its JS-symmetrization by:

\begin{matrix} {JS}_{D}^{α} (p : q) & : = & (1 - α) D (p : (1 - α) p + α q) + α D (q : (1 - α) p + α q), \end{matrix}

(26)

\begin{matrix} = & (1 - α) D (p : {(p q)}_{α}) + α D (q : {(p q)}_{α}) . \end{matrix}

(27)

Usually,

α = \frac{1}{2}

, and for notational brevity, we drop the superscript:

{JS}_{D} (p : q) : = {JS}_{D}^{\frac{1}{2}} (p : q)

. The Jeffreys divergence is twice the J-symmetrization of the KLD, and the Jensen–Shannon divergence is the

JS

-symmetrization of the KLD.

The J-symmetrization of a f-divergence

I_{f}

is obtained by taking the generator

f_{α}^{J} (u) = (1 - α) f (u) + α f^{⋄} (u),

(28)

where

f^{⋄} (u) = u f (\frac{1}{u})

is the conjugate generator:

I_{f^{⋄}} (p : q) = I_{f}^{*} (p : q) = I_{f} (q : p) .

(29)

The

JS

-symmetrization of a f-divergence

I_{f}^{α} (p : q) : = (1 - α) I_{f} (p : {(p q)}_{α}) + α I_{f} (q : {(p q)}_{α}),

(30)

with

{(p q)}_{α} = (1 - α) p + α q

is obtained by taking the generator

f_{α}^{JS} (u) : = (1 - α) f (α u + 1 - α) + α f (α + \frac{1 - α}{u}) .

(31)

We check that we have:

I_{f}^{α} (p : q) = (1 - α) I_{f} (p : {(p q)}_{α}) + α I_{f} (q : {(p q)}_{α}) = I_{f}^{1 - α} (q : p) = I_{f_{α}^{JS}} (p : q) .

(32)

A family of symmetric distances unifying the Jeffreys divergence with the Jensen–Shannon divergence was proposed in [24]. Finally, let us mention that once we have symmetrized a distance D, we may also metrize this symmetric distance by choosing (when it exists) the largest exponent

δ > 0

such that

D^{δ}

becomes a metric distance [8,25,26,27,28].

1.4. Contributions and Paper Outline

The paper is organized as follows:

Section 2 reports the special case of mixture families in information geometry [18] for which the Jensen–Shannon divergence can be expressed as a Bregman divergence (Theorem 1), and highlight the lack of closed-form formula when considering exponential families. This fact precisely motivated this work.

Section 3 introduces the generalized Jensen–Shannon divergences using statistical mixtures derived from abstract weighted means (Definitions 2 and 5), presents the JS-symmetrization of statistical distances, and report a sufficient condition to get bounded JS-symmetrizations (Property 1).

In Section 4.1, we consider the calculation of the geometric JSD between members of the same exponential family (Theorem 2) and instantiate the formula for the multivariate Gaussian distributions (Corollary 1). We discuss about applications for k-means clustering in Section 4.1.2. In Section 4.2, we illustrate the method with another example that calculates in closed form the harmonic JSD between scale Cauchy distributions (Theorem 4).

Finally, we wrap up and conclude this work in Section 5.

2. Jensen–Shannon Divergence in Mixture and Exponential Families

We are interested to calculate the JSD between densities belonging to parametric families of distributions.

A trivial example is when

p = (p_{0}, \dots, p_{D})

and

q = (q_{0}, \dots, q_{D})

are categorical distributions: The average distribution

\frac{p + q}{2}

is a again categorical distribution, and the JSD is expressed plainly as:

JS (p, q) = \frac{1}{2} \sum_{i = 0}^{D} (p_{i} log \frac{2 p_{i}}{p_{i} + q_{i}} + q_{i} log \frac{2 q_{i}}{p_{i} + q_{i}}) .

(33)

Another example is when

p = m_{θ_{p}}

and

q = m_{θ_{q}}

both belong to the same mixture family [18]

M

:

M : = \{m_{θ} (x) = (1 - \sum_{i = 1}^{D} θ_{i} p_{i} (x)) p_{0} (x) + \sum_{i = 1}^{D} θ_{i} p_{i} (x) : θ_{i} > 0, \sum_{i} θ_{i} < 1\},

(34)

for linearly independent component distributions

p_{0}, p_{1}, \dots, p_{D}

. We have [29]:

KL (m_{θ_{p}} : m_{θ_{q}}) = B_{F} (θ_{p} : θ_{q}),

(35)

where

B_{F}

is a Bregman divergence defined in Equation (20) obtained for the convex negentropy generator [29]

F (θ) = - h (m_{θ})

. The proof that

F (θ)

is a strictly convex function is not trivial [30].

The mixture families include the family of categorical distributions over a finite alphabet

X = {E_{0}, \dots, E_{D}}

(the D-dimensional probability simplex) since those categorical distributions form a mixture family with

p_{i} (x) : = Pr (X = E_{i}) = δ_{E_{i}} (x)

. Beware that mixture families impose to prescribe the component distributions. Therefore, a density of a mixture family is a special case of statistical mixtures (e.g., GMMs) with prescribed component distributions.

The mathematical identity of Equation (35) that does not yield a practical formula since

F (θ)

is usually not itself available in closed form. Worse, the Bregman generator can be non-analytic [31]. Nevertheless, this identity is useful for computing the right-sided Bregman centroid (left KL centroid of mixtures) since this centroid is equivalent to the center of mass, and independent of the Bregman generator [29].

Since the mixture of mixtures is also a mixture, specifically

\frac{m_{θ_{p}} + m_{θ_{q}}}{2} = m_{\frac{θ_{p} + θ_{q}}{2}} \in M,

(36)

it follows that we get a closed-form expression for the JSD between mixtures belonging to

M

.

Theorem 1 (JSD between mixtures).

The Jensen–Shannon divergence between two distributions

p = m_{θ_{p}}

and

q = m_{θ_{q}}

belonging to the same mixture family

M

is expressed as a Jensen–Bregman divergence for the negentropy generator F:

JS (m_{θ_{p}}, m_{θ_{q}}) = \frac{1}{2} (B_{F} (θ_{p} : \frac{θ_{p} + θ_{q}}{2}) + B_{F} (θ_{q} : \frac{θ_{p} + θ_{q}}{2})) .

(37)

This amounts to calculate the Jensen divergence:

JS (m_{θ_{p}}, m_{θ_{q}}) = J_{F} (θ_{1}; θ_{2}) = {(F (θ_{1}) F (θ_{2}))}_{\frac{1}{2}} - F ({(θ_{1} θ_{2})}_{\frac{1}{2}}),

(38)

where

{(v_{1} v_{2})}_{α} : = (1 - α) v_{1} + α v_{2}

.

Now, consider distributions

p = e_{θ_{p}}

and

q = e_{θ_{q}}

belonging to the same exponential family [18]

E

:

E : = \{e_{θ} (x) = exp (θ^{⊤} x - F (θ)) : θ \in Θ\},

(39)

where

Θ : = \{θ \in R^{D} : \int exp (θ^{⊤} x) d μ < \infty\},

(40)

denotes the natural parameter space. We have [18]:

KL (e_{θ_{p}} : e_{θ_{q}}) = B_{F} (θ_{q} : θ_{p}),

(41)

where F denotes the log-normalizer or cumulant function of the exponential family [18].

However,

\frac{e_{θ_{p}} + e_{θ_{q}}}{2}

does not belong to

E

in general, except for the case of the categorical/multinomial family which is both an exponential family and a mixture family [18].

For example, the mixture of two Gaussian distributions with distinct components is not a Gaussian distribution. Thus, it is not obvious to get a closed-form expression for the JSD in that case. This limitation precisely motivated the introduction of generalized JSDs defined in the next section.

Notice that in [32,33], it is shown how to express or approximate the f-divergences using expansions of power

χ

pseudo-distances. These power chi distances can all be expressed in closed form when dealing with isotropic Gaussians. This result holds for the JSD since the JSD is a f-divergence [33].

3. Generalized Jensen–Shannon Divergences

We first define abstract means M, and then generic statistical M-mixtures from which generalized Jensen–Shannon divergences are built thereof.

Definitions

Consider an abstract mean [34] M. That is, a continuous bivariate function

M (\cdot, \cdot) : I \times I \to I

on an interval

I \subset R

that satisfies the following in-betweenness property:

inf {x, y} \leq M (x, y) \leq sup {x, y}, \forall x, y \in I .

(42)

Using the unique dyadic expansion of real numbers, we can always build a corresponding weighted mean

M_{α} (p, q)

(with

α \in [0, 1]

) following the construction reported in [34] (page 3) such that

M_{0} (p, q) = p

and

M_{1} (p, q) = q

. In the remainder, we consider

I = (0, \infty)

.

Examples of common weighted means are:

the arithmetic mean $A_{α} (x, y) = (1 - α) x + α_{y}$ ,
the geometric mean $G_{α} (x, y) = x^{1 - α} y^{α}$ , and
the harmonic mean $H_{α} (x, y) = \frac{x y}{(1 - α) y + α x}$ .

These means can be unified using the concept of quasi-arithmetic means [34] (also called Kolmogorov–Nagumo means):

M_{α}^{h} (x, y) : = h^{- 1} ((1 - α) h (x) + α h (y)),

(43)

where h is a strictly monotonous function. For example, the geometric mean

G_{α} (x, y)

is obtained as

M_{α}^{h} (x, y)

for the generator

h (u) = log (u)

. Rényi used the concept of quasi-arithmetic means instead of the arithmetic mean to define axiomatically the Rényi entropy [35] of order

α

in information theory [2].

For any abstract weighted mean, we can build a statistical mixture called a M-mixture as follows:

Definition 1 (M-mixture).

The

M_{α}

-interpolation

{(p q)}_{α}^{M}

(with

α \in [0, 1]

) of densities p and q with respect to a mean M is a α-weighted M-mixture defined by:

{(p q)}_{α}^{M} (x) : = \frac{M_{α} (p (x), q (x))}{Z_{α}^{M} (p : q)},

(44)

where

Z_{α}^{M} (p : q) = \int_{t \in X} M_{α} (p (t), q (t)) d μ (t) = : 〈M_{α} (p, q)〉 .

(45)

is the normalizer function (or scaling factor) ensuring that

{(p q)}_{α}^{M} \in P

. (The bracket notation

〈f〉

denotes the integral of f over

X

.)

The A-mixture

{(p q)}_{α}^{A} (x) = (1 - α) p (x) + α q (x)

(‘A’ standing for the arithmetic mean) represents the usual statistical mixture [36] (with

Z_{α}^{A} (p : q) = 1

). The G-mixture

{(p q)}_{α}^{G} (x) = \frac{p {(x)}^{1 - α} q {(x)}^{α}}{Z_{α}^{G} (p : q)}

of two distributions

p (x)

and

q (x)

(’G’ standing for the geometric mean G) is an exponential family of order [37] 1:

{(p q)}_{α}^{G} (x) = exp ((1 - α) p (x) + α q (x) - log Z_{α}^{G} (p : q)) .

(46)

The two-component M-mixture can be generalized to a k-component M-mixture with

α \in Δ_{k - 1}

, the

(k - 1)

-dimensional standard simplex:

{(p_{1} \dots p_{k})}_{α}^{M} : = \frac{p_{1} {(x)}^{α_{1}} \times \dots \times p_{k} {(x)}^{α_{k}}}{Z_{α} (p_{1}, \dots, p_{k})},

(47)

where

Z_{α} (p_{1}, \dots, p_{k}) : = \int_{X} p_{1} {(x)}^{α_{1}} \times \dots \times p_{k} {(x)}^{α_{k}} d μ (x)

.

For a given pair of distributions p and q, the set

{M_{α} (p (x), q (x)) : α \in [0, 1]}

describes a path in the space of probability density functions. This density interpolation scheme was investigated for quasi-arithmetic weighted means in [38,39,40]. In [41], the authors study the Fisher information matrix for the

α

-mixture models (using

α

-power means).

We call

{(p q)}_{α}^{M}

the

α

-weighted M-mixture, thus extending the notion of

α

-mixtures [42] obtained for power means

P_{α}

. Notice that abstract means have also been used to generalize Bregman divergences using the concept of

(M, N)

-convexity [43].

Let us state a first generalization of the Jensen–Shannon divergence:

Definition 2 (M-Jensen–Shannon divergence).

For a mean M, the skew M-Jensen–Shannon divergence (for

α \in [0, 1]

) is defined by

{JS}^{M_{α}} (p : q) : = (1 - α) KL (p : {(p q)}_{α}^{M}) + α KL (q : {(p q)}_{α}^{M})

(48)

When

M_{α} = A_{α}

, we recover the ordinary Jensen–Shannon divergence since

A_{α} (p : q) = {(p q)}_{α}

(and

Z_{α}^{A} (p : q) = 1

).

We can extend the definition to the

JS

-symmetrization of any distance:

Definition 3 (M-JS symmetrization).

For a mean M and a distance D, the skew M-

JS

symmetrization of D (for

α \in [0, 1]

) is defined by

{JS}_{D}^{M_{α}} (p : q) : = (1 - α) D (p : {(p q)}_{α}^{M}) + α D (q : {(p q)}_{α}^{M})

(49)

By notation, we have

{JS}^{M_{α}} (p : q) = {JS}_{KL}^{M_{α}} (p : q)

. That is, the arithmetic JS-symmetrization of the KLD is the JSD.

Let us define the

α

-skew K-divergence [6,44]

K_{α} (p : q)

as

K_{α} (p : q) : = KL (p : (1 - α) p + α q) = KL (p : {(p q)}_{α}),

(50)

where

{(p q)}_{α} (x) : = (1 - α) p (x) + α q (x)

. Then the Jensen–Shannon divergence and the Jeffreys divergence can be rewritten [24] as

\begin{matrix} JS (p; q) & = & \frac{1}{2} (K_{\frac{1}{2}} (p : q) + K_{\frac{1}{2}} (q : p)), \end{matrix}

(51)

\begin{matrix} J (p; q) & = & K_{1} (p : q) + K_{1} (q : p), \end{matrix}

(52)

since

KL (p : q) = K_{1} (p : q)

. Then

{JS}_{α} (p : q) = (1 - α) K_{α} (p : q) + α K_{1 - α} (q : p)

. Similarly, we can define the generalized skew K-divergence:

K_{D}^{M_{α}} (p : q) : = D (p : {(p q)}_{α}^{M}) .

(53)

The success of the JSD compared to the JD in applications is partially due to the fact that the JSD is upper bounded by

log 2

. So, one question to ask is whether those generalized JSDs are upper bounded or not?

To report a sufficient condition, let us first introduce the dominance relationship between means: We say that a mean M dominates a mean N when

M (x, y) \geq N (x, y)

for all

x, y \geq 0

, see [34]. In that case we write concisely

M \geq N

. For example, the Arithmetic-Geometric-Harmonic (AGH) inequality states that

A \geq G \geq H

.

Consider the term

\begin{matrix} KL (p : {(p q)}_{α}^{M}) & = & \int p (x) log \frac{p (x) Z_{α}^{M} (p, q)}{M_{α} (p (x), q (x))} d μ (x), \end{matrix}

(54)

\begin{matrix} = & log Z_{α}^{M} (p, q) + \int p (x) log \frac{p (x)}{M_{α} (p (x), q (x))} d μ (x) . \end{matrix}

(55)

When mean

M_{α}

dominates the arithmetic mean

A_{α}

, we have

\int p (x) log \frac{p (x)}{M_{α} (p (x), q (x))} d μ (x) \leq \int p (x) log \frac{p (x)}{A_{α} (p (x), q (x))} d μ (x),

and

\int p (x) log \frac{p (x)}{A_{α} (p (x), q (x))} d μ (x) \leq \int p (x) log \frac{p (x)}{(1 - α) p (x)} d μ (x) = - log (1 - α) .

Notice that

Z_{α}^{A} (p : q) = 1

(when

M = A

is the arithmetic mean), and we recover the fact that the

α

-skew Jensen–Shannon divergence is upper bounded by

- log (1 - α)

(e.g.,

log 2

when

α = \frac{1}{2}

).

We summarize the result in the following property:

Property 1 (Upper bound on M-JSD).

The M-JSD is upper bounded by

log \frac{Z_{α}^{M} (p, q)}{1 - α}

when

M \geq A

.

Let us observe that dominance of means can be used to define distances: For example, the celebrated

α

-divergences

I_{α} (p : q) = \int (α p (x) + (1 - α) q (x) - p {(x)}^{α} q {(x)}^{1 - α}) d μ (x), α \notin {0, 1}

(56)

can be interpreted as a difference of two means, the arithmetic mean and the geometry mean:

I_{α} (p : q) = \int (A_{α} (q (x) : p (x)) - G_{α} (q (x) : p (x))) d μ (x) .

(57)

We can also define the generalized Jeffreys divergence as follows:

Definition 4 (N-Jeffreys divergence).

For a mean N, the skew N-Jeffreys divergence (for

β \in [0, 1]

) is defined by

\begin{matrix} J^{N_{β}} (p : q) & : = & N_{β} (KL (p : q), KL (q : p)) . \end{matrix}

(58)

This definition includes the (scaled) resistor average distance [5]

R (p; q)

, obtained for the harmonic mean

N = H

for the KLD with skew parameter

β = \frac{1}{2}

:

\begin{matrix} \frac{1}{R (p; q)} & = & \frac{1}{2} (\frac{1}{KL (p : q)} + \frac{1}{KL (q : p)}), \end{matrix}

(59)

\begin{matrix} R (p; q) & = & \frac{2 J (p; q)}{KL (p : q) KL (q : p)} . \end{matrix}

(60)

In [5], the factor

\frac{1}{2}

is omitted to keep the spirit of the original Jeffreys divergence.

We can further extend this definition for any arbitrary divergence D as follows:

Definition 5 (Skew

(M, N)

-D divergence).

The skew

(M, N)

-divergence with respect to weighted means

M_{α}

and

N_{β}

as follows:

{JS}_{D}^{M_{α}, N_{β}} (p : q) : = N_{β} (D (p : {(p q)}_{α}^{M}), D (q : {(p q)}_{α}^{M})) .

(61)

We now show how to choose the abstract mean according to the parametric family of distributions to obtain some closed-form formula for some statistical distances.

4. Some Closed-Form Formula for the M-Jensen–Shannon Divergences

Our motivation to introduce these novel families of M-Jensen–Shannon divergences is to obtain closed-form formula when probability densities belong to some given parametric families

P_{Θ}

. We shall illustrate the principle of the method to choose the right abstract mean for the considered parametric family, and report corresponding formula for the following two case studies:

The geometric G-Jensen–Shannon divergence for the exponential families (Section 4.1), and
the harmonic H-Jensen–Shannon divergence for the family of Cauchy scale distributions (Section 4.2).

Recall that the arithmetic A-Jensen–Shannon divergence is well-suited for mixture families (Theorem 1).

4.1. The Geometric G-Jensen–Shannon Divergence

Consider an exponential family [37]

E_{F}

with log-normalizer F:

E_{F} = \{p_{θ} (x) d μ = exp (θ^{⊤} x - F (θ)) d μ : θ \in Θ\},

(62)

and natural parameter space

Θ = \{θ : \int_{X} exp (θ^{⊤} x) d μ < \infty\} .

(63)

The log-normalizer (a log-Laplace function also called log-partition or cumulant function) is a real analytic convex function.

We seek for a mean M such that the weighted M-mixture density

{(p_{θ_{1}} p_{θ_{2}})}_{α}^{M}

of two densities

p_{θ_{1}}

and

p_{θ_{2}}

of the same exponential family yields another density of that exponential family (e.g.,

p_{{(θ_{1} θ_{2})}_{α}}

). When considering exponential families, choose the weighted geometric mean

G_{α}

for the abstract mean

M_{α} (x, y)

:

M_{α} (x, y) = G_{α} (x, y) = x^{1 - α} y^{α}

, for

x, y > 0

. Indeed, it is well-known that the normalized weighted product of distributions belonging to the same exponential family also belongs to this exponential family [45]:

\begin{matrix} \forall x \in X, {(p_{θ_{1}} p_{θ_{2}})}_{α}^{G} (x) & : = & \frac{G_{α} (p_{θ_{1}} (x), p_{θ_{2}} (x))}{\int G_{α} (p_{θ_{1}} (t), p_{θ_{2}} (t)) d μ (t)} = \frac{p_{θ_{1}}^{1 - α} (x) p_{θ_{2}}^{α} (x)}{Z_{α}^{G} (p : q)}, \end{matrix}

(64)

\begin{matrix} = & p_{{(θ_{1} θ_{2})}_{α}} (x), \end{matrix}

(65)

where the normalization factor is

Z_{α}^{G} (p : q) = exp (- J_{F}^{α} (θ_{1} : θ_{2})),

(66)

for the skew Jensen divergence

J_{F}^{α}

defined by:

J_{F}^{α} (θ_{1} : θ_{2}) : = {(F (θ_{1}) F (θ_{2}))}_{α} - F ({(θ_{1} θ_{2})}_{α}) .

(67)

Notice that since the natural parameter space

Θ

is convex, the distribution

p_{{(θ_{1} θ_{2})}_{α}} \in E_{F}

(since

{(θ_{1} θ_{2})}_{α} \in Θ

).

Thus, it follows that we have:

\begin{matrix} KL (p_{θ} : {(p_{θ_{1}} p_{θ_{2}})}_{α}^{G}) & = & KL (p_{θ} : p_{{(θ_{1} θ_{2})}_{α}}), \end{matrix}

(68)

\begin{matrix} = & B_{F} ({(θ_{1} θ_{2})}_{α} : θ) . \end{matrix}

(69)

This allows us to conclude that the G-Jensen–Shannon divergence admits the following closed-form expression between densities belonging to the same exponential family:

\begin{matrix} {JS}_{α}^{G} (p_{θ_{1}} : p_{θ_{2}}) & : = & (1 - α) KL (p_{θ_{1}} : {(p_{θ_{1}} p_{θ_{2}})}_{α}^{G}) + α KL (p_{θ_{2}} : {(p_{θ_{1}} p_{θ_{2}})}_{α}^{G}), \end{matrix}

(70)

\begin{matrix} = & (1 - α) B_{F} ({(θ_{1} θ_{2})}_{α} : θ_{1}) + α B_{F} ({(θ_{1} θ_{2})}_{α} : θ_{2}) . \end{matrix}

(71)

Please note that since

{(θ_{1} θ_{2})}_{α} - θ_{1} = α (θ_{2} - θ_{1})

and

{(θ_{1} θ_{2})}_{α} - θ_{2} = (1 - α) (θ_{1} - θ_{2})

, it follows that

(1 - α) B_{F} (θ_{1} : {(θ_{1} θ_{2})}_{α}) + α B_{F} (θ_{2} : {(θ_{1} θ_{2})}_{α}) = J_{F}^{α} (θ_{1} : θ_{2})

.

The dual divergence [46]

D^{*}

(with respect to the reference argument) or reverse divergence of a divergence D is defined by swapping the calling arguments:

D^{*} (θ : θ^{'}) : = D (θ^{'} : θ)

.

Thus, if we defined the Jensen–Shannon divergence for the dual KL divergence

{KL}^{*} (p : q) : = KL (q : p)

\begin{matrix} {JS}_{{KL}^{*}} (p : q) & : = & \frac{1}{2} ({KL}^{*} (p : \frac{p + q}{2}) + {KL}^{*} (q : \frac{p + q}{2})), \end{matrix}

(72)

\begin{matrix} = & \frac{1}{2} (KL (\frac{p + q}{2} : p) + KL (\frac{p + q}{2} : q)), \end{matrix}

(73)

then we obtain:

\begin{matrix} {JS}_{{KL}^{*}}^{G_{α}} (p_{θ_{1}} : p_{θ_{2}}) & : = & (1 - α) KL ({(p_{θ_{1}} p_{θ_{2}})}_{α}^{G} : p_{θ_{1}}) + α KL ({(p_{θ_{1}} p_{θ_{2}})}_{α}^{G} : p_{θ_{2}}), \end{matrix}

(74)

\begin{matrix} = & (1 - α) B_{F} (θ_{1} : {(θ_{1} θ_{2})}_{α}) + α B_{F} (θ_{2} : {(θ_{1} θ_{2})}_{α}) = {JB}_{F}^{α} (θ_{1} : θ_{2}), \end{matrix}

(75)

\begin{matrix} = & (1 - α) F (θ_{1}) + α F (θ_{2}) - F ({(θ_{1} θ_{2})}_{α}), \end{matrix}

(76)

\begin{matrix} = & J_{F}^{α} (θ_{1} : θ_{2}) . \end{matrix}

(77)

Please note that

{JS}_{D^{*}} \neq {JS}_{D}^{*}

.

In general, the JS-symmetrization for the reverse KL divergence is

\begin{matrix} {JS}_{{KL}^{*}} (p; q) & = & \frac{1}{2} (KL (\frac{p + q}{2} : p) + KL (\frac{p + q}{2} : q)), \end{matrix}

(78)

\begin{matrix} = & \int m log \frac{m}{\sqrt{p q}} d μ = \int A (p, q) log \frac{A (p, q)}{G (p, q)} d μ, \end{matrix}

(79)

where

m = \frac{p + q}{2} = A (p, q)

and

G (p, q) = \sqrt{p q}

. Since

A \geq G

(arithmetic-geometric inequality), it follows that

{JS}_{{KL}^{*}} (p; q) \geq 0

.

Theorem 2 (G-JSD and its dual JS-symmetrization in exponential families).

The α-skew G-Jensen–Shannon divergence

{JS}^{G_{α}}

between two distributions

p_{θ_{1}}

and

p_{θ_{2}}

of the same exponential family

E_{F}

is expressed in closed form for

α \in (0, 1)

as:

\begin{matrix} {JS}^{G_{α}} (p_{θ_{1}} : p_{θ_{2}}) & = & (1 - α) B_{F} ({(θ_{1} θ_{2})}_{α} : θ_{1}) + α B_{F} ({(θ_{1} θ_{2})}_{α} : θ_{2}), \end{matrix}

(80)

\begin{matrix} {JS}_{{KL}^{*}}^{G_{α}} (p_{θ_{1}} : p_{θ_{2}}) & = & {JB}_{F}^{α} (θ_{1} : θ_{2}) = J_{F}^{α} (θ_{1} : θ_{2}) . \end{matrix}

(81)

4.1.1. Case Study: The Multivariate Gaussian Family

Consider the exponential family [18,37] of multivariate Gaussian distributions [47,48,49]

{N (μ, Σ) : μ \in R^{d}, Σ ≻ 0} .

(82)

The multivariate Gaussian family is also called the multivariate normal family in the literature, or MVN family for short.

Let

λ : = (λ_{v}, λ_{M}) = (μ, Σ)

denote the composite (vector,matrix) parameter of an MVN. The d-dimensional MVN density is given by

\begin{matrix} p_{λ} (x; λ) & : = & \frac{1}{{(2 π)}^{\frac{d}{2}} \sqrt{| λ_{M} |}} exp (- \frac{1}{2} {(x - λ_{v})}^{⊤} λ_{M}^{- 1} (x - λ_{v})), \end{matrix}

(83)

where

| \cdot |

denotes the matrix determinant. The natural parameters

θ

are also expressed using both a vector parameter

θ_{v}

and a matrix parameter

θ_{M}

in a compound object

θ = (θ_{v}, θ_{M})

. By defining the following compound inner product on a composite (vector,matrix) object

〈 θ, θ^{'} 〉 : = θ_{v}^{⊤} θ_{v}^{'} + tr ({θ_{M}^{'}}^{⊤} θ_{M}),

(84)

where

tr (\cdot)

denotes the matrix trace, we rewrite the MVN density of Equation (83) in the canonical form of an exponential family [37]:

\begin{matrix} p_{θ} (x; θ) & : = & exp (〈 t (x), θ 〉 - F_{θ} (θ)) = p_{λ} (x; λ (θ)), \end{matrix}

(85)

where

θ = (θ_{v}, θ_{M}) = (Σ^{- 1} μ, - \frac{1}{2} Σ^{- 1}) = θ (λ) = (λ_{M}^{- 1} λ_{v}, - \frac{1}{2} λ_{M}^{- 1}),

(86)

is the compound natural parameter and

t (x) = (x, - x x^{⊤})

(87)

is the compound sufficient statistic. The function

F_{θ}

is the strictly convex and continuously differentiable log-normalizer defined by:

F_{θ} (θ) = \frac{1}{2} (d log π - log | θ_{M} | + \frac{1}{2} θ_{v}^{⊤} θ_{M}^{- 1} θ_{v}),

(88)

The log-normalizer can be expressed using the ordinary parameters,

λ = (μ, Σ)

, as:

\begin{matrix} F_{λ} (λ) & = & \frac{1}{2} (λ_{v}^{⊤} λ_{M}^{- 1} λ_{v} + log | λ_{M} | + d log 2 π), \end{matrix}

(89)

\begin{matrix} = & \frac{1}{2} (μ^{⊤} Σ^{- 1} μ + log | Σ | + d log 2 π) . \end{matrix}

(90)

The moment/expectation parameters [18,49] are

η = (η_{v}, η_{M}) = E [t (x)] = \nabla F (θ) .

(91)

We report the conversion formula between the three types of coordinate systems (namely the ordinary parameter

λ

, the natural parameter

θ

and the moment parameter

η

) as follows:

\begin{matrix} \{\begin{matrix} θ_{v} (λ) = λ_{M}^{- 1} λ_{v} = Σ^{- 1} μ \\ θ_{M} (λ) = \frac{1}{2} λ_{M}^{- 1} = \frac{1}{2} Σ^{- 1} \end{matrix} & \Leftrightarrow & \{\begin{matrix} λ_{v} (θ) = \frac{1}{2} θ_{M}^{- 1} θ_{v} = μ \\ λ_{M} (θ) = \frac{1}{2} θ_{M}^{- 1} = Σ \end{matrix} \end{matrix}

(92)

\begin{matrix} \{\begin{matrix} η_{v} (θ) = \frac{1}{2} θ_{M}^{- 1} θ_{v} \\ η_{M} (θ) = - \frac{1}{2} θ_{M}^{- 1} - \frac{1}{4} (θ_{M}^{- 1} θ_{v}) {(θ_{M}^{- 1} θ_{v})}^{⊤} \end{matrix} & \Leftrightarrow & \{\begin{matrix} θ_{v} (η) = - {(η_{M} + η_{v} η_{v}^{⊤})}^{- 1} η_{v} \\ θ_{M} (η) = - \frac{1}{2} {(η_{M} + η_{v} η_{v}^{⊤})}^{- 1} \end{matrix} \end{matrix}

(93)

\begin{matrix} \{\begin{matrix} λ_{v} (η) = η_{v} = μ \\ λ_{M} (η) = - η_{M} - η_{v} η_{v}^{⊤} = Σ \end{matrix} & \Leftrightarrow & \{\begin{matrix} η_{v} (λ) = λ_{v} = μ \\ η_{M} (λ) = - λ_{M} - λ_{v} λ_{v}^{⊤} = - Σ - μ μ^{⊤} \end{matrix} \end{matrix}

(94)

The dual Legendre convex conjugate [18,49] is

F_{η}^{*} (η) = - \frac{1}{2} (log (1 + η_{v}^{⊤} η_{M}^{- 1} η_{v}) + log | - η_{M} | + d (1 + log 2 π)),

(95)

and

θ = \nabla_{η} F_{η}^{*} (η)

.

We check the Fenchel-Young equality when

η = \nabla F (θ)

and

θ = \nabla F^{*} (η)

:

F_{θ} (θ) + F_{η}^{*} (η) - 〈 θ, η 〉 = 0 .

(96)

The Kullback–Leibler divergence between two d-dimensional Gaussians distributions

p_{(μ_{1}, Σ_{1})}

and

p_{(μ_{2}, Σ_{2})}

(with

Δ_{μ} = μ_{2} - μ_{1}

) is

\begin{matrix} KL (p_{(μ_{1}, Σ_{1})} : p_{(μ_{2}, Σ_{2})}) & = & \frac{1}{2} \{tr (Σ_{2}^{- 1} Σ_{1}) + Δ_{μ}^{⊤} Σ_{2}^{- 1} Δ_{μ} + log \frac{| Σ_{2} |}{| Σ_{1} |} - d\} = KL (p_{λ_{1}} : p_{λ_{2}}) . \end{matrix}

(97)

We check that

KL (p_{(μ, Σ)} : p_{(μ, Σ)}) = 0

since

Δ_{μ} = 0

and

tr (Σ^{- 1} Σ) = tr (I) = d

. Notice that when

Σ_{1} = Σ_{2} = Σ

, we have

KL (p_{(μ_{1}, Σ)} : p_{(μ_{2}, Σ)}) = \frac{1}{2} Δ_{μ}^{⊤} Σ^{- 1} Δ_{μ} = \frac{1}{2} D_{Σ^{- 1}}^{2} (μ_{1}, μ_{2}),

(98)

that is half the squared Mahalanobis distance for the precision matrix

Σ^{- 1}

(a positive-definite matrix:

Σ^{- 1} ≻ 0

), where the Mahalanobis distance is defined for any positive matrix

Q ≻ 0

as follows:

D_{Q} (p_{1} : p_{2}) = \sqrt{{(p_{1} - p_{2})}^{⊤} Q (p_{1} - p_{2})} .

(99)

The Kullback–Leibler divergence between two probability densities of the same exponential families amount to a Bregman divergence [18]:

KL (p_{(μ_{1}, Σ_{1})} : p_{(μ_{2}, Σ_{2})}) = KL (p_{λ_{1}} : p_{λ_{2}}) = B_{F} (θ_{2} : θ_{1}) = B_{F^{*}} (η_{1} : η_{2}),

(100)

where the Bregman divergence is defined by

B_{F} (θ : θ^{'}) : = F (θ) - F (θ^{'}) - 〈 θ - θ^{'}, \nabla F (θ^{'}) 〉,

(101)

with

η^{'} = \nabla F (θ^{'})

. Define the canonical divergence [18]

A_{F} (θ_{1} : η_{2}) = F (θ_{1}) + F^{*} (η_{2}) - 〈 θ_{1}, η_{2} 〉 = A_{F^{*}} (η_{2} : θ_{1}),

(102)

since

{F^{*}}^{*} = F

. We have

B_{F} (θ_{1} : θ_{2}) = A_{F} (θ_{1} : η_{2})

.

Now, observe that

p_{θ} (0, θ) = exp (- F (θ))

when

〈 t (0), θ 〉 = 0

. In particular, this holds for the multivariate normal family. Thus, we have the following proposition.

Proposition 1.

For the MVN family, we have

p_{θ} (x; {(θ_{1} θ_{2})}_{α}) = \frac{p_{θ} {(x, θ_{1})}^{1 - α} p_{θ} {(x, θ_{2})}^{α}}{Z_{α}^{G} (p_{θ_{1}} : p_{θ_{2}})},

(103)

with the scaling normalization factor:

Z_{α}^{G} (p_{θ_{1}} : p_{θ_{2}}) = exp (- J_{F}^{α} (θ_{1} : θ_{2})) = \frac{p_{θ} {(0; θ_{1})}^{1 - α} p_{θ} {(0; θ_{2})}^{α}}{p_{θ} (0; {(θ_{1} θ_{2})}_{α})} .

(104)

More generally, we have for a k-dimensional weight vector

α

belonging to the

(k - 1)

-dimensional standard simplex:

Z_{α}^{G} (p_{θ_{1}}, \dots p_{θ_{k}}) = \frac{\prod_{i = 1}^{k} p_{θ} {(0, θ_{i})}^{α_{i}}}{p_{θ} (0; \bar{θ})},

(105)

where

\bar{θ} = \sum_{i = 1}^{k} α_{i} θ_{i}

.

Finally, we state the formulas for the G-JS divergence between MVNs for the KL and reverse KL, respectively:

Corollary 1 (G-JSD between Gaussians).

The skew G-Jensen–Shannon divergence

{JS}_{α}^{G}

and the dual skew G-Jensen–Shannon divergence

{JS}^{*}_{α}^{G}

between two multivariate Gaussians

N (μ_{1}, Σ_{1})

and

N (μ_{2}, Σ_{2})

is

\begin{matrix} {JS}^{G_{α}} (p_{(μ_{1}, Σ_{1})} : p_{(μ_{2}, Σ_{2})}) & = & (1 - α) KL (p_{(μ_{1}, Σ_{1})} : p_{(μ_{α}, Σ_{α})}) + α KL (p_{(μ_{2}, Σ_{2})} : p_{(μ_{α}, Σ_{α})}), \end{matrix}

(106)

\begin{matrix} = & (1 - α) B_{F} ({(θ_{1} θ_{2})}_{α} : θ_{1}) + α B_{F} ({(θ_{1} θ_{2})}_{α} : θ_{2}), \\ = & \frac{1}{2} (tr (Σ_{α}^{- 1} ((1 - α) Σ_{1} + α Σ_{2})) + log \frac{| Σ_{α} |}{| Σ_{1} |^{1 - α} {| Σ_{2} |}^{α}} + \end{matrix}

(107)

\begin{matrix} (1 - α) {(μ_{α} - μ_{1})}^{⊤} Σ_{α}^{- 1} (μ_{α} - μ_{1}) + α {(μ_{α} - μ_{2})}^{⊤} Σ_{α}^{- 1} (μ_{α} - μ_{2}) - d) \end{matrix}

(108)

\begin{matrix} {JS}_{*}^{G_{α}} (p_{(μ_{1}, Σ_{1})} : p_{(μ_{2}, Σ_{2})}) & = & (1 - α) KL (p_{(μ_{α}, Σ_{α})} : p_{(μ_{1}, Σ_{1})}) + α KL (p_{(μ_{α}, Σ_{α})} : p_{(μ_{2}, Σ_{2})}), \end{matrix}

(109)

\begin{matrix} = & (1 - α) B_{F} (θ_{1} : {(θ_{1} θ_{2})}_{α}) + α B_{F} (θ_{2} : {(θ_{1} θ_{2})}_{α}), \end{matrix}

(110)

\begin{matrix} = & J_{F} (θ_{1} : θ_{2}), \end{matrix}

(111)

\begin{matrix} = & \frac{1}{2} ((1 - α) μ_{1}^{⊤} Σ_{1}^{- 1} μ_{1} + α μ_{2}^{⊤} Σ_{2}^{- 1} μ_{2} - μ_{α}^{⊤} Σ_{α}^{- 1} μ_{α} + log \frac{| Σ_{1} |^{1 - α} {| Σ_{2} |}^{α}}{| Σ_{α} |}), \end{matrix}

(112)

where

Σ_{α} = {(Σ_{1} Σ_{2})}_{α}^{Σ} = {((1 - α) Σ_{1}^{- 1} + α Σ_{2}^{- 1})}^{- 1},

(113)

(matrix harmonic barycenter) and

μ_{α} = {(μ_{1} μ_{2})}_{α}^{μ} = Σ_{α} ((1 - α) Σ_{1}^{- 1} μ_{1} + α Σ_{2}^{- 1} μ_{2}) .

(114)

Notice that the α-skew Bhattacharyya distance [7]:

B_{α} (p : q) = - log \int_{X} p^{1 - α} q^{α} d μ

(115)

between two members of the same exponential family amounts to a

α

-skew Jensen divergence between the corresponding natural parameters:

B_{α} (p_{θ_{1}} : p_{θ_{2}}) = J_{F}^{α} (θ_{1} : θ_{2}) .

(116)

A simple proof follows from the fact that

\int p_{{(θ_{1} θ_{2})}_{α}} (x) d μ (x) = 1 = \int \frac{p_{θ_{1}}^{1 - α} (x) p_{θ_{2}}^{α} (x)}{Z_{α}^{G} (p_{θ_{1}} : p_{θ_{2}})} d μ (x) .

(117)

Therefore, we have

log 1 = 0 = log \int p_{θ_{1}}^{1 - α} (x) p_{θ_{2}}^{α} (x) d μ (x) - log Z_{α}^{G} (p_{θ_{1}} : p_{θ_{2}}),

(118)

with

Z_{α}^{G} (p_{θ_{1}} : p_{θ_{2}}) = exp (- J_{F} (p_{θ_{1}} : p_{θ_{2}}))

. Thus, it follows that

\begin{matrix} B_{α} (p_{θ_{1}} : p_{θ_{2}}) & = & - log \int p_{θ_{1}}^{1 - α} (x) p_{θ_{2}}^{α} (x) d μ (x), \end{matrix}

(119)

\begin{matrix} = & - log Z_{α}^{G} (p_{θ_{1}} : p_{θ_{2}}), \end{matrix}

(120)

\begin{matrix} = & J_{F} (p_{θ_{1}} : p_{θ_{2}}) . \end{matrix}

(121)

Corollary 2.

The JS-symmetrization of the reverse Kullback–Leibler divergence between densities of the same exponential family amount to calculate a Jensen/Burbea–Rao divergence between the corresponding natural parameters.

4.1.2. Applications to k-Means Clustering

Let

P = {p_{1}, \dots, p_{n}}

denote a point set, and

C = {c_{1}, \dots, c_{k}}

denote a set of k (cluster) centers. The generalized k-means objective [23] with respect to a distance D is defined by:

E_{D} (P, C) = \frac{1}{n} \sum_{i = 1}^{n} min_{j \in {1, \dots, k}} D (p_{i} : c_{j}) .

(122)

By defining the distance

D (p, C) = {min}_{j \in {1, \dots, k}} D (p : c_{j})

of a point to a set of points, we can rewrite compactly the objective function as

E_{D} (P, C) = \frac{1}{n} \sum_{i = 1}^{n} D (p_{i}, C)

. Denote by

E_{D}^{*} (P, k)

the minimum objective loss for a set of

k = | C |

clusters:

E_{D}^{*} (P, k) = {min}_{| C | = k} E_{D} (P, C)

. It is NP-hard [50] to compute

E_{D}^{*} (P, k)

when

k > 1

and the dimension

d > 1

. The most common heuristic is Lloyd’s batched k-means [23] that yields a local minimum.

The performance of the probabilistic k-means++ initialization [51] has been extended to arbitrary distances in [52] as follows:

Theorem 3

(Generalized k-means++ performance, [53]). Let

κ_{1}

and

κ_{2}

be two constants such that

κ_{1}

defines the quasi-triangular inequality property:

D (x : z) \leq κ_{1} (D (x : y) + D (y : z)), \forall x, y, z \in Δ^{d},

(123)

and

κ_{2}

handles the symmetry inequality:

D (x : y) \leq κ_{2} D (y : x), \forall x, y \in Δ^{d} .

(124)

Then the generalized k-means++ seeding guarantees with high probability a configuration C of cluster centers such that:

E_{D} (P, C) \leq 2 κ_{1}^{2} (1 + κ_{2}) (2 + log k) E_{D}^{*} (P, k) .

(125)

To bound the constants

κ_{1}

and

κ_{2}

, we rewrite the generalized Jensen–Shannon divergences using quadratic form expressions: That is, using a squared Mahalanobis distance:

D_{Q} (p : q) = \sqrt{{(p - q)}^{⊤} Q (p - q)},

(126)

for a positive-definite matrix

Q ≻ 0

. Since the Bregman divergence can be interpreted as the tail of a first-order Taylor expansion, we have:

B_{F} (θ_{1} : θ_{2}) = \frac{1}{2} {(θ_{1} - θ_{2})}^{⊤} \nabla^{2} F (ξ) (θ_{1} - θ_{2}),

(127)

for

ξ \in Θ

(open convex). Similarly, the Jensen divergence can be interpreted as a Jensen–Bregman divergence, and thus we have

J_{F} (θ_{1} : θ_{2}) \frac{1}{2} {(θ_{1} - θ_{2})}^{⊤} \nabla^{2} F (ξ^{'}) (θ_{1} - θ_{2}),

(128)

for

ξ^{'} \in Θ

. More precisely, for a prescribed point set

{θ_{1}, \dots, θ_{n}}

, we have

ξ, ξ^{'} \in CH ({θ_{1}, \dots, θ_{n}})

, where

CH

denotes the closed convex hull. We can therefore upper bound

κ_{1}

and

κ_{2}

using the ratio

\frac{{max}_{θ \in CH ({θ_{1}, \dots, θ_{n}})} ∥ \nabla^{2} F (θ) ∥}{{max}_{θ \in CH ({θ_{1}, \dots, θ_{n}})} ∥ \nabla^{2} F (θ) ∥}

. See [54] for further details.

A centroid for a set of parameters

θ_{1}, \dots, θ_{n}

is defined as the minimizer of the functional

E_{D} (θ) = \frac{1}{n} \sum_{i} D (θ_{i} : θ) .

(129)

In particular, the symmetrized Bregman centroids have been studied in [55] (for

{JS}^{G_{α}}

), and the Jensen centroids (for

{JS}_{*}^{G_{α}}

) have been investigated in [7] using the convex-concave iterative procedure.

4.2. The Harmonic Jensen–Shannon Divergence (H- $J S$ )

The principle to get closed-form formula for generalized Jensen–Shannon divergences between distributions belonging to a parametric family

P_{Θ} = {p_{θ} : θ \in Θ}

consists of finding an abstract mean M such that the M-mixture

{(p_{θ_{1}} p_{θ_{2}})}_{α}^{M}

belongs to the family

P_{Θ}

. In particular, when

Θ

is a convex domain, we seek a mean M such that

{(p_{θ_{1}} p_{θ_{2}})}_{α}^{M} = p_{{(θ_{1} θ_{2})}_{α}}

with

{(θ_{1} θ_{2})}_{α} \in Θ

.

Let us consider the weighted harmonic mean [34] (induced by the harmonic mean) H:

H_{α} (x, y) : = \frac{1}{(1 - α) \frac{1}{x} + α \frac{1}{y}} = \frac{x y}{(1 - α) y + α x} = \frac{x y}{{(x y)}_{1 - α}}, α \in [0, 1] .

(130)

The harmonic mean is a quasi-arithmetic mean

H_{α} (x, y) = M_{α}^{h} (x, y)

obtained for the monotone (decreasing) function

h (u) = \frac{1}{u}

(or equivalently for the increasing monotone function

h (u) = - \frac{1}{u}

).

This harmonic mean is well-suited for the scale family

C

of Cauchy probability distributions (also called Lorentzian distributions):

C_{Γ} : = \{p_{γ} (x) = \frac{1}{γ} p_{std} (\frac{x}{γ}) = \frac{γ}{π (γ^{2} + x^{2})} : γ \in Γ = (0, \infty)\},

(131)

where

γ

denotes the scale and

p_{std} (x) = \frac{1}{π (1 + x^{2})}

the standard Cauchy distribution.

Using the computer algebra system Maxima (http://maxima.sourceforge.net/) we find that (see Appendix B)

{(p_{γ_{1}} p_{γ_{2}})}_{\frac{1}{2}}^{H} (x) = \frac{H_{α} (p_{γ_{1}} (x) : p_{γ_{2}} (x))}{Z_{α}^{H} (γ_{1}, γ_{2})} = p_{{(γ_{1} γ_{2})}_{α}}

(132)

where the normalizing coefficient is

Z_{α}^{H} (γ_{1}, γ_{2}) : = \sqrt{\frac{γ_{1} γ_{2}}{{(γ_{1} γ_{2})}_{α} {(γ_{1} γ_{2})}_{1 - α}}} = \sqrt{\frac{γ_{1} γ_{2}}{{(γ_{1} γ_{2})}_{α} {(γ_{2} γ_{1})}_{α}}},

(133)

since we have

{(γ_{1} γ_{2})}_{1 - α} = {(γ_{2} γ_{1})}_{α}

.

The H-Jensen–Shannon symmetrization of a distance D between distributions writes as:

{JS}_{D}^{H_{α}} (p : q) = (1 - α) D (p : {(p q)}_{α}^{H}) + α D (q : {(p q)}_{α}^{H}),

(134)

where

H_{α}

denote the weighted harmonic mean. When D is available in closed form for distributions belonging to the scale Cauchy distributions, so is

{JS}_{D}^{H_{α}} (p : q)

.

For example, consider the KL divergence formula between two scale Cauchy distributions:

KL (p_{γ_{1}} : p_{γ_{2}}) = 2 log \frac{A (γ_{1}, γ_{2})}{G (γ_{1}, γ_{2})} = 2 log \frac{γ_{1} + γ_{2}}{2 \sqrt{γ_{1} γ_{2}}},

(135)

where A and G denote the arithmetic and geometric means, respectively. The formula initially reported in [56] has been corrected by the authors. Since

A \geq G

(and

\frac{A}{G} \geq 1

), it follows that

KL (p_{γ_{1}} : p_{γ_{2}}) \geq 0

. Notice that the KL divergence is symmetric for Cauchy scale distributions. We note in passing that for exponential families, the KL divergence is symmetric only for the location Gaussian family (since the only symmetric Bregman divergences are the squared Mahalanobis distances [57]). The cross-entropy between scale Cauchy distributions is

h^{\times} (p_{γ_{1}} : p_{γ_{2}}) = log π \frac{{(γ_{1} + γ_{2})}^{2}}{γ_{2}}

, and the differential entropy is

h (p_{γ}) = h^{\times} (p_{γ} : p_{γ}) = log 4 π γ

.

Then the H-JS divergence between

p = p_{γ_{1}}

and

q = p_{γ_{2}}

is:

\begin{matrix} {JS}^{H} (p : q) & = & \frac{1}{2} (KL (p : {(p q)}_{\frac{1}{2}}^{H}) + KL (q : {(p q)}_{\frac{1}{2}}^{H})), \end{matrix}

(136)

\begin{matrix} {JS}^{H} (p_{γ_{1}} : p_{γ_{2}}) & = & \frac{1}{2} (KL (p_{γ_{1}} : p_{\frac{γ_{1} + γ_{2}}{2}}) + KL (p_{γ_{2}} : p_{\frac{γ_{1} + γ_{2}}{2}})), \end{matrix}

(137)

\begin{matrix} = & log \frac{(3 γ_{1} + γ_{2}) (3 γ_{2} + γ_{1})}{8 \sqrt{γ_{1} γ_{2}} (γ_{1} + γ_{2})} . \end{matrix}

(138)

We check that when

γ_{1} = γ_{2} = γ

, we have

{JS}^{H_{α}} (p_{γ} : p_{γ}) = 0

.

Theorem 4 (Harmonic JSD between scale Cauchy distributions).

The harmonic Jensen–Shannon divergence between two scale Cauchy distributions

p_{γ_{1}}

and

p_{γ_{2}}

is

{JS}^{H} (p_{γ_{1}} : p_{γ_{2}}) = log \frac{(3 γ_{1} + γ_{2}) (3 γ_{2} + γ_{1})}{8 \sqrt{γ_{1} γ_{2}} (γ_{1} + γ_{2})}

.

Let us report some numerical examples: Consider

p_{γ_{1}} = 0.1

and

p_{γ_{1}} = 0.5

, we find that

{JS}^{H} (p_{γ_{1}} : p_{γ_{2}}) ≃ 0.176

. When

p_{γ_{1}} = 0.2

and

p_{γ_{1}} = 0.8

, we find that

{JS}^{H} (p_{γ_{1}} : p_{γ_{2}}) ≃ 0.129

.

Notice that KL formula is scale-invariant, and this property holds for any scale family:

Lemma 1.

The Kullback–Leibler divergence between two distributions

p_{s_{1}}

and

p_{s_{2}}

belonging to the same scale family

{p_{s} (x) = \frac{1}{s} p (\frac{x}{s})}_{s \in (0, \infty)}

with standard density p is scale-invariant:

KL (p_{λ s_{1}} : p_{λ s_{2}}) = KL (p_{s_{1}} : p_{s_{2}}) = KL (p : p_{\frac{s_{2}}{s_{1}}}) = KL (p_{\frac{s_{1}}{s_{2}}} : p)

for any

λ > 0

.

A direct proof follows from a change of variable in the KL integral with

y = \frac{x}{λ}

and

d x = λ d y

. Please note that although the KLD between scale Cauchy distributions is symmetric, it is not the case for all scale families: For example, the Rayleigh distributions form a scale family with the KLD amounting to compute a Bregman asymmetric Itakura–Saito divergence between parameters [37].

Instead of the KLD, we can choose the total variation distance for which a formula has been reported in [38] between two Cauchy distributions. Notice that the Cauchy distributions are alpha-stable distributions for

α = 1

and q Gaussian distributions for

q = 2

([58], p. 104). A closed-form formula for the divergence between two q-Gaussians is given in [58] when

q < 2

. The definite integral

h_{q} (p) = \int_{- \infty}^{+ \infty} p {(x)}^{q} d μ

is available in closed form for Cauchy distributions. When

q = 2

, we have

h_{2} (p_{γ}) = \frac{1}{2 π γ}

.

We refer to [38] for yet other illustrative examples considering the family of Pearson type VII distributions and central multivariate t-distributions which use the power means (quasi-arithmetic means

M^{h}

induced by

h (u) = u^{α}

for

α > 0

) for defining mixtures.

Table 1 summarizes the various examples introduced in the paper.

4.3. The M-Jensen–Shannon Matrix Distances

In this section, we consider distances between matrices which play an important role in quantum computing [59,60]. We refer to [61] for the matrix Jensen–Bregman logdet divergence. The Hellinger distance can be interpreted as the difference of an arithmetic mean A and a geometric mean G:

D_{H} (p, q) = \sqrt{1 - \int_{X} \sqrt{p (x)} \sqrt{q (x)} d μ (x)} = \sqrt{\int_{X} (A (p (x), q (x)) - G (p (x), q (x))) d μ (x)} .

(139)

Notice that since

A \geq G

, we have

D_{H} (p, q) \geq 0

. The scaled and squared Hellinger distance is an

α

-divergence

I_{α}

for

α = 0

. Recall that the

α

-divergence can be interpreted as the difference of a weighted arithmetic minus a weighted geometry mean.

In general, if a mean

M_{1}

dominates a mean

M_{2}

, we may define the distance as

D_{M_{1}, M_{2}} (p, q) = \int_{X} (M_{1} (p, q) - M_{2} (p, q)) d μ (x) .

(140)

When considering matrices [62], there is not a unique definition of a geometric matrix mean, and thus we have different notions of matrix Hellinger distances [62], some of them are divergences (smooth distances defining a dualistic structure in information geometry).

We define the matrix M-Jensen–Shannon divergence for a matrix divergence [63,64] D as follows:

{JS}_{D}^{M} (X_{1}, X_{2}) = \frac{1}{2} (D (X_{1} : M (X_{1}, X_{2})) + D (X_{2} : M (X_{1}, X_{2}))) = {JS}_{D}^{M} (X_{2}, X_{1}) .

(141)

For example, we can choose the von Neumann matrix divergence [63]:

D_{vN} (X_{1} : X_{2}) : = tr (X_{1} log X_{1} - X_{1} log X_{2} - X_{1} + X_{2}),

(142)

or the LogDet matrix divergence [63]:

D_{ld} (X_{1} : X_{2}) : = tr (X_{1} X_{2}^{- 1}) - log | X_{1} X_{2}^{- 1} | - d,

(143)

where square matrices

X_{1}

and

X_{2}

have dimension d.

5. Conclusions and Perspectives

We introduced a generalization of the celebrated Jensen–Shannon divergence [6], termed the

(M, N)

-Jensen–Shannon divergences, based on M-mixtures derived from abstract means M. This new family of divergences includes the ordinary Jensen–Shannon divergence when both M and N are set to the arithmetic mean. We reported closed-form expressions of the M Jensen–Shannon divergences for mixture families and exponential families in information geometry by choosing the arithmetic and geometric weighted mean, respectively. The

α

-skewed geometric Jensen–Shannon divergence (G-Jensen–Shannon divergence) between densities

p_{θ_{1}}

and

p_{θ_{2}}

of the same exponential family with cumulant function F is

{JS}_{KL}^{G_{α}} [p_{θ_{1}} : p_{θ_{2}}] = {JS}_{{B_{F}}^{*}}^{A_{α}} (θ_{1} : θ_{2}) .

Here, we used the bracket notation to emphasize that the statistical distance

{JS}_{KL}^{G_{α}}

is between densities, and the parenthesis notation to emphasize that the distance

{JS}_{{B_{F}}^{*}}^{A_{α}}

is between parameters. We also have

{JS}_{{KL}^{*}}^{G_{α}} [p_{θ_{1}} : p_{θ_{2}}] = J_{F}^{α} (θ_{1} : θ_{2})

. We also show how to get a closed-form formula for the harmonic Jensen–Shannon divergence of Cauchy scale distributions by taking harmonic mixtures.

For an arbitrary distance D, we define the skew N-Jeffreys symmetrization:

J_{D}^{N_{β}} (p_{1} : p_{2}) = N_{β} (D (p_{1} : p_{2}), D (p_{_{2}} : p_{1})),

(144)

and the skew

(M, N)

-JS-symmetrization:

{JS}_{D}^{M_{α}, N_{β}} (p_{1} : p_{2}) = N_{β} (D (p_{1} : {(p_{1} p_{2})}_{α}^{M}), D (p_{2} : {(p_{1} p_{2})}_{α}^{M})) .

(145)

A Java™ source code for computing the geometric Jensen–Shannon divergence between multivariate Gaussian distributions is available online at https://franknielsen.github.io/M-JS/.

Funding

This research received no external funding.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Summary of Distances and Their Notations

Table A1 lists the main distances with their notations.

Table A1. Summary of Distances and Their Notations.

Weighted mean	$M_{α}$ , $α \in (0, 1)$
Arithmetic mean	$A_{α} (x, y) = (1 - α) x + α y$
Geometric mean	$G_{α} (x, y) = x^{1 - α} y^{α}$
Harmonic mean	$H_{α} (x, y) = \frac{x y}{(1 - α) y + α x}$
Power mean	$P_{α}^{p} (x, y) = {((1 - α) x^{p} + α y^{p})}^{\frac{1}{p}}, p \in R ∖ {0}$ , ${lim}_{p \to 0} P_{α}^{p} = G$
Quasi-arithmetic mean	$M_{α}^{f} (x, y) = f^{- 1} ((1 - α) f (x) + α f (y))$ , f strictly monotonous
M-mixture	$Z_{α}^{M} (p, q) = \int_{t \in X} M_{α} (p (t), q (t)) d μ (t)$ with $Z_{α}^{M} (p, q) = \int_{t \in X} M_{α} (p (t), q (t)) d μ (t)$
Statistical distance	$D (p : q)$
Dual/reverse distance $D^{*}$	$D^{*} (p : q) : = D (q : p)$
Kullback-Leibler divergence	$KL (p : q) = \int p (x) log \frac{p (x)}{q (x)} d μ (x)$
reverse Kullback-Leibler divergence	${KL}^{*} (p : q) = KL (q : p) = \int q (x) log \frac{q (x)}{p (x)} d μ (x)$
Jeffreys divergence	$J (p; q) = KL (p : q) + KL (q : p) = \int (p (x) - q (x)) log \frac{p (x)}{q (x)} d μ (x)$
Resistor divergence	$\frac{1}{R (p; q)} = \frac{1}{2} (\frac{1}{KL (p : q)} + \frac{1}{KL (q : p)})$ . $R (p; q) = \frac{2 J (p; q)}{KL (p : q) KL (q : p)}$
skew K-divergence	$K_{α} (p : q) = \int p (x) log \frac{p (x)}{(1 - α) p (x) + α q (x)} d μ (x)$
Jensen-Shannon divergence	$JS (p, q) = \frac{1}{2} (KL (p : \frac{p + q}{2}) + KL (q : \frac{p + q}{2}))$
skew Bhattacharrya divergence	$B_{α} (p : q) = - log \int_{X} p {(x)}^{1 - α} q {(x)}^{α} d μ (x)$
Hellinger distance	$D_{H} (p, q) = \sqrt{1 - \int_{X} \sqrt{p (x)} \sqrt{q (x)} d μ (x)}$
$α$ -divergences	$I_{α} (p : q) = \int (α p (x) + (1 - α) q (x) - p {(x)}^{α} q {(x)}^{1 - α}) d μ (x), α \notin {0, 1}$ $I_{α} (p : q) = A_{α} (q : p) - G_{α} (q : p)$
Mahalanobis distance	$D_{Q} (p : q) = \sqrt{{(p - q)}^{⊤} Q (p - q)}$ for a positive-definite matrix $Q ≻ 0$
f-divergence	$I_{f} (p : q) = \int p (x) f (\frac{q (x)}{p (x)}) d μ (x)$ , with $f (1) = f^{'} (1) = 0$ f strictly convex at 1
reverse f-divergence	$I_{f}^{*} (p : q) = \int q (x) f (\frac{p (x)}{q (x)}) d μ (x) = I_{f^{⋄}} (p : q)$ for $f^{⋄} (u) = u f (\frac{1}{u})$
J-symmetrized f-divergence	$J_{f} (p; q) = \frac{1}{2} (I_{f} (p : q) + I_{f} (q : p))$
JS-symmetrized f-divergence	$I_{f}^{α} (p; q) : = (1 - α) I_{f} (p : {(p q)}_{α}) + α I_{f} (q : {(p q)}_{α}) = I_{f_{α}^{JS}} (p : q)$ for $f_{α}^{JS} (u) : = (1 - α) f (α u + 1 - α) + α f (α + \frac{1 - α}{u})$
Parameter distance
Bregman divergence	$B_{F} (θ : θ^{'}) : = F (θ) - F (θ^{'}) - 〈 θ - θ^{'}, \nabla F (θ^{'}) 〉$
skew Jeffreys-Bregman divergence	$S_{F}^{α} = (1 - α) B_{F} (θ : θ^{'}) + α B_{F} (θ^{'} : θ)$
skew Jensen divergence	$J_{F}^{α} (θ : θ^{'}) : = {(F (θ) F (θ^{'}))}_{α} - F ({(θ θ^{'})}_{α})$
Jensen-Bregman divergence	${JB}_{F} (θ; θ^{'}) = \frac{1}{2} (B_{F} (θ : \frac{θ + θ^{'}}{2}) + B_{F} (θ^{'} : \frac{θ + θ^{'}}{2})) = J_{F} (θ; θ^{'})$ .
Generalized Jensen-Shannon divergences
skew J-symmetrization	$J_{D}^{α} (p : q) : = (1 - α) D (p : q) + α D (q : p)$
skew $JS$ -symmetrization	${JS}_{D}^{α} (p : q) : = (1 - α) D (p : (1 - α) p + α q) + α D (q : (1 - α) p + α q)$
skew M-Jensen-Shannon divergence	${JS}^{M_{α}} (p : q) : = (1 - α) KL (p : {(p q)}_{α}^{M}) + α KL (q : {(p q)}_{α}^{M})$
skew M- $JS$ -symmetrization	${JS}_{D}^{M_{α}} (p : q) : = (1 - α) D (p : {(p q)}_{α}^{M}) + α D (q : {(p q)}_{α}^{M})$
N-Jeffreys divergence	$J^{N_{β}} (p : q) : = N_{β} (KL (p : q), KL (q : p))$
N-J D divergence	$J_{D}^{N_{β}} (p : q) = N_{β} (D (p : q), D (q : p))$
skew $(M, N)$ -D JS divergence	${JS}_{D}^{M_{α}, N_{β}} (p : q) : = N_{β} (D (p : {(p q)}_{α}^{M}), D (q : {(p q)}_{α}^{M}))$

Appendix B. Symbolic Calculations in Maxima

The program below calculates the normalizer Z for the harmonic H-mixtures of Cauchy distributions (Equation (133)).

assume(gamma>0);

Cauchy(x,gamma) := gamma/(%pi∗(x∗∗2+gamma∗∗2));

assume(alpha>0);

assume(alpha<1);

h(x,y,alpha) := (x∗y)/((1-alpha)∗y+alpha∗x);

assume(gamma1>0);

assume(gamma2>0);

m(x,alpha) := ratsimp(h(Cauchy(x,gamma1),Cauchy(x,gamma2),alpha));

/∗ calculate Z ∗/

integrate(m(x,alpha),x,-inf,inf);

References

Billingsley, P. Probability and Measure; John Wiley & Sons: Hoboken, NJ, USA, 2008. [Google Scholar]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Ho, S.W.; Yeung, R.W. On the discontinuity of the Shannon information measures. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Adelaide, Australia, 4–9 September 2005; pp. 159–163. [Google Scholar]
Nielsen, F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. IEEE Signal Process. Lett. 2013, 20, 657–660. [Google Scholar] [CrossRef]
Johnson, D.; Sinanovic, S. Symmetrizing the Kullback-Leibler Distance. Technical report of Rice University (US). 2001. Available online: https://scholarship.rice.edu/handle/1911/19969 (accessed on 11 May 2019).
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
Vajda, I. On metric divergences of probability measures. Kybernetika 2009, 45, 885–900. [Google Scholar]
Fuglede, B.; Topsoe, F. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the IEEE International Symposium on Information Theory (ISIT), Waikiki, HI, USA, 29 June–4 July 2014; p. 31. [Google Scholar]
Sims, G.E.; Jun, S.R.; Wu, G.A.; Kim, S.H. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions. Proc. Natl. Acad. Sci. USA 2009, 106, 2677–2682. [Google Scholar] [CrossRef]
DeDeo, S.; Hawkins, R.X.; Klingenstein, S.; Hitchcock, T. Bootstrap methods for the empirical study of decision-making and information flows in social systems. Entropy 2013, 15, 2246–2276. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2014; pp. 2672–2680. [Google Scholar]
Wang, Y.; Woods, K.; McClain, M. Information-theoretic matching of two point sets. IEEE Trans. Image Process. 2002, 11, 868–872. [Google Scholar] [CrossRef]
Peter, A.M.; Rangarajan, A. Information geometry for landmark shape analysis: Unifying shape representation and deformation. IEEE Trans. Pattern Anal. Mach. Intell. 2009, 31, 337–350. [Google Scholar] [CrossRef] [PubMed]
Nielsen, F.; Sun, K. Guaranteed bounds on information-theoretic measures of univariate mixtures using piecewise log-sum-exp inequalities. Entropy 2016, 18, 442. [Google Scholar] [CrossRef]
Wang, F.; Syeda-Mahmood, T.; Vemuri, B.C.; Beymer, D.; Rangarajan, A. Closed-form Jensen-Rényi divergence for mixture of Gaussians and applications to group-wise shape registration. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI); Springer: Berlin, Germany, 2009; pp. 648–655. [Google Scholar]
Nielsen, F. Closed-form information-theoretic divergences for statistical mixtures. In Proceedings of the IEEE 21st International Conference on Pattern Recognition (ICPR2012), Tsukuba, Japan, 11–15 November 2012; pp. 1723–1726. [Google Scholar]
Amari, S.I. Information Geometry and Its Applications; Springer: Berlin, Germany, 2016. [Google Scholar]
Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
Eguchi, S. Geometry of minimum contrast. Hiroshima Math. J. 1992, 22, 631–647. [Google Scholar] [CrossRef]
Amari, S.I.; Cichocki, A. Information geometry of divergence functions. Bull. Pol. Acad. Sci. Tech. Sci. 2010, 58, 183–195. [Google Scholar] [CrossRef]
Ciaglia, F.M.; Di Cosmo, F.; Felice, D.; Mancini, S.; Marmo, G.; Pérez-Pardo, J.M. Hamilton-Jacobi approach to potential functions in information geometry. J. Math. Phys. 2017, 58, 063506. [Google Scholar] [CrossRef]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Nielsen, F. A family of statistical symmetric divergences based on Jensen’s inequality. arXiv 2010, arXiv:1009.4004. [Google Scholar]
Chen, P.; Chen, Y.; Rao, M. Metrics defined by Bregman divergences. Commun. Math. Sci. 2008, 6, 915–926. [Google Scholar] [CrossRef]
Chen, P.; Chen, Y.; Rao, M. Metrics defined by Bregman divergences: Part 2. Commun. Math. Sci. 2008, 6, 927–948. [Google Scholar] [CrossRef]
Kafka, P.; Österreicher, F.; Vincze, I. On powers of f-divergences defining a distance. Stud. Sci. Math. Hung. 1991, 26, 415–422. [Google Scholar]
Österreicher, F.; Vajda, I. A new class of metric divergences on probability spaces and its applicability in statistics. Ann. Inst. Stat. Math. 2003, 55, 639–653. [Google Scholar] [CrossRef]
Nielsen, F.; Nock, R. On the geometry of mixtures of prescribed distributions. In Proceeding of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 Aprli 2018; pp. 2861–2865. [Google Scholar]
Nielsen, F.; Hadjeres, G. Monte Carlo Information Geometry: The dually flat case. arXiv 2018, arXiv:1803.07225. [Google Scholar]
Watanabe, S.; Yamazaki, K.; Aoyagi, M. Kullback information of normal mixture is not an analytic function. IEICE Tech. Rep. Neurocomput. 2004, 104, 41–46. [Google Scholar]
Nielsen, F.; Nock, R. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Process. Lett. 2014, 21, 10–13. [Google Scholar] [CrossRef]
Nielsen, F.; Hadjeres, G. On power chi expansions of f-divergences. arXiv 2019, arXiv:1903.05818. [Google Scholar]
Niculescu, C.; Persson, L.E. Convex Functions and Their Applications, 2nd ed.; Springer: Berlin, Germany, 2018. [Google Scholar]
Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics; The Regents of the University of California: Oakland, CA, USA, 1961. [Google Scholar]
McLachlan, G.J.; Lee, S.X.; Rathnayake, S.I. Finite mixture models. Ann. Rev. Stat. Appl. 2019, 6, 355–378. [Google Scholar] [CrossRef]
Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards. arXiv 2009, arXiv:0911.4863. [Google Scholar]
Nielsen, F. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. Pattern Recognit. Lett. 2014, 42, 25–34. [Google Scholar] [CrossRef]
Eguchi, S.; Komori, O. Path connectedness on a space of probability density functions. In Geometric Science of Information (GSI); Springer: Cham, Switzerland, 2015; pp. 615–624. [Google Scholar]
Eguchi, S.; Komori, O.; Ohara, A. Information geometry associated with generalized means. In Information Geometry and its Applications IV; Springer: Berlin, Germany, 2016; pp. 279–295. [Google Scholar]
Asadi, M.; Ebrahimi, N.; Kharazmi, O.; Soofi, E.S. Mixture models, Bayes Fisher information, and divergence measures. IEEE Trans. Inf. Theory 2019, 65, 2316–2321. [Google Scholar] [CrossRef]
Amari, S.I. Integration of stochastic models by minimizing α-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef]
Nielsen, F.; Nock, R. Generalizing skew Jensen divergences and Bregman divergences with comparative convexity. IEEE Signal Process. Lett. 2017, 24, 1123–1127. [Google Scholar] [CrossRef]
Lee, L. Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, Association for Computational Linguistics, Stroudsburg, PA, USA, 20–26 June 1999; pp. 25–32. [Google Scholar] [CrossRef]
Nielsen, F. The statistical Minkowski distances: Closed-form formula for Gaussian mixture models. arXiv 2019, arXiv:1901.03732. [Google Scholar]
Zhang, J. Reference duality and representation duality in information geometry. AIP Conf. Proc. 2015, 1641, 130–146. [Google Scholar]
Yoshizawa, S.; Tanabe, K. Dual differential geometry associated with the Kullback-Leibler information on the Gaussian distributions and its 2-parameter deformations. SUT J. Math. 1999, 35, 113–137. [Google Scholar]
Nielsen, F.; Nock, R. A closed-form expression for the Sharma–Mittal entropy of exponential families. J. Phys. A Math. Theor. 2011, 45, 032003. [Google Scholar] [CrossRef]
Nielsen, F. An elementary introduction to information geometry. arXiv 2018, arXiv:1808.08271. [Google Scholar]
Nielsen, F.; Nock, R. Optimal interval clustering: Application to Bregman clustering and statistical mixture learning. IEEE Signal Process. Lett. 2014, 21, 1289–1292. [Google Scholar] [CrossRef]
Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics; ACM: New York, NY, USA, 2007; pp. 1027–1035. [Google Scholar]
Nielsen, F.; Nock, R.; Amari, S.I. On clustering histograms with k-means by using mixed α-divergences. Entropy 2014, 16, 3273–3301. [Google Scholar] [CrossRef]
Nielsen, F.; Nock, R. Total Jensen divergences: definition, properties and clustering. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brisbane, QLD, Australia, 19–24 August 2015; pp. 2016–2020. [Google Scholar]
Ackermann, M.R.; Blömer, J. Bregman clustering for separable instances. In Scandinavian Workshop on Algorithm Theory; Springer: Berlin, Germany, 2010; pp. 212–223. [Google Scholar]
Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef]
Tzagkarakis, G.; Tsakalides, P. A statistical approach to texture image retrieval via alpha-stable modeling of wavelet decompositions. In Proceedings of the 5th International Workshop on Image Analysis for Multimedia Interactive Services, Instituto Superior Técnico, Lisboa, Portugal, 21–23 April 2004; pp. 21–23. [Google Scholar]
Boissonnat, J.D.; Nielsen, F.; Nock, R. Bregman Voronoi diagrams. Discrete Comput. Geom. 2010, 44, 281–307. [Google Scholar] [CrossRef]
Naudts, J. Generalised Thermostatistics; Springer Science & Business Media: Berlin, Germany, 2011. [Google Scholar]
Briët, J.; Harremoës, P. Properties of classical and quantum Jensen-Shannon divergence. Phys. Rev. A 2009, 79, 052311. [Google Scholar] [CrossRef]
Audenaert, K.M. Quantum skew divergence. J. Math. Phys. 2014, 55, 112202. [Google Scholar] [CrossRef]
Cherian, A.; Sra, S.; Banerjee, A.; Papanikolopoulos, N. Jensen-Bregman logdet divergence with application to efficient similarity search for covariance matrices. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 35, 2161–2174. [Google Scholar] [CrossRef]
Bhatia, R.; Jain, T.; Lim, Y. Strong convexity of sandwiched entropies and related optimization problems. Rev. Math. Phys. 2018, 30, 1850014. [Google Scholar] [CrossRef]
Kulis, B.; Sustik, M.A.; Dhillon, I.S. Low-rank kernel learning with Bregman matrix divergences. J. Mach. Learn. Res. 2009, 10, 341–376. [Google Scholar]
Nock, R.; Magdalou, B.; Briys, E.; Nielsen, F. Mining matrix data with Bregman matrix divergences for portfolio selection. In Matrix Information Geometry; Springer: Berlin, Germany, 2013; pp. 373–402. [Google Scholar]

Table 1. Summary of the weighted means M chosen according to the parametric family in order to ensure that the family is closed under M-mixturing:

{(p_{θ_{1}} p_{θ_{2}})}_{α}^{M} = p_{{(θ_{1} θ_{2})}_{α}}

.

Table 1. Summary of the weighted means M chosen according to the parametric family in order to ensure that the family is closed under M-mixturing:

{(p_{θ_{1}} p_{θ_{2}})}_{α}^{M} = p_{{(θ_{1} θ_{2})}_{α}}

.

${JS}^{M_{α}}$	Mean M	Parametric Family	$Z_{α}^{M} (p : q)$
${JS}^{A_{α}}$	arithmetic A	mixture family	$Z_{α}^{M} (θ_{1} : θ_{2}) = 1$
${JS}^{G_{α}}$	geometric G	exponential family	$Z_{α}^{G} (θ_{1} : θ_{2}) = exp (- J_{F}^{α} (θ_{1} : θ_{2}))$
${JS}^{H_{α}}$	harmonic H	Cauchy scale family	$Z_{α}^{H} (θ_{1} : θ_{2}) = \sqrt{\frac{θ_{1} θ_{2}}{{(θ_{1} θ_{2})}_{α} {(θ_{1} θ_{2})}_{1 - α}}}$

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nielsen, F. On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means. Entropy 2019, 21, 485. https://doi.org/10.3390/e21050485

AMA Style

Nielsen F. On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means. Entropy. 2019; 21(5):485. https://doi.org/10.3390/e21050485

Chicago/Turabian Style

Nielsen, Frank. 2019. "On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means" Entropy 21, no. 5: 485. https://doi.org/10.3390/e21050485

APA Style

Nielsen, F. (2019). On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means. Entropy, 21(5), 485. https://doi.org/10.3390/e21050485

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means

Abstract

1. Introduction and Motivations

1.1. Kullback–Leibler Divergence and Its Symmetrizations

1.2. Statistical Distances and Parameter Divergences

1.3. J-Symmetrization and $J S$ -Symmetrization of Distances

1.4. Contributions and Paper Outline

2. Jensen–Shannon Divergence in Mixture and Exponential Families

3. Generalized Jensen–Shannon Divergences

Definitions

4. Some Closed-Form Formula for the M-Jensen–Shannon Divergences

4.1. The Geometric G-Jensen–Shannon Divergence

4.1.1. Case Study: The Multivariate Gaussian Family

4.1.2. Applications to k-Means Clustering

4.2. The Harmonic Jensen–Shannon Divergence (H- $J S$ )

4.3. The M-Jensen–Shannon Matrix Distances

5. Conclusions and Perspectives

Funding

Conflicts of Interest

Appendix A. Summary of Distances and Their Notations

Appendix B. Symbolic Calculations in Maxima

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

On the Jensen–Shannon Symmetrization of Distances Relying on Abstract Means

Abstract

1. Introduction and Motivations

1.1. Kullback–Leibler Divergence and Its Symmetrizations

1.2. Statistical Distances and Parameter Divergences

1.3. J-Symmetrization and J S -Symmetrization of Distances

1.4. Contributions and Paper Outline

2. Jensen–Shannon Divergence in Mixture and Exponential Families

3. Generalized Jensen–Shannon Divergences

Definitions

4. Some Closed-Form Formula for the M-Jensen–Shannon Divergences

4.1. The Geometric G-Jensen–Shannon Divergence

4.1.1. Case Study: The Multivariate Gaussian Family

4.1.2. Applications to k-Means Clustering

4.2. The Harmonic Jensen–Shannon Divergence (H- J S )

4.3. The M-Jensen–Shannon Matrix Distances

5. Conclusions and Perspectives

Funding

Conflicts of Interest

Appendix A. Summary of Distances and Their Notations

Appendix B. Symbolic Calculations in Maxima

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

1.3. J-Symmetrization and $J S$ -Symmetrization of Distances

4.2. The Harmonic Jensen–Shannon Divergence (H- $J S$ )