Generalizing the Alpha-Divergences and the Oriented Kullback–Leibler Divergences with Quasi-Arithmetic Means

Nielsen, Frank

doi:10.3390/a15110435

Open AccessArticle

Generalizing the Alpha-Divergences and the Oriented Kullback–Leibler Divergences with Quasi-Arithmetic Means

by

Frank Nielsen

Sony Computer Science Laboratories, Tokyo 141-0022, Japan

Algorithms 2022, 15(11), 435; https://doi.org/10.3390/a15110435

Submission received: 13 October 2022 / Revised: 10 November 2022 / Accepted: 16 November 2022 / Published: 17 November 2022

(This article belongs to the Special Issue Machine Learning for Pattern Recognition)

Download

Browse Figures

Versions Notes

Abstract

:

The family of

α

-divergences including the oriented forward and reverse Kullback–Leibler divergences is often used in signal processing, pattern recognition, and machine learning, among others. Choosing a suitable

α

-divergence can either be done beforehand according to some prior knowledge of the application domains or directly learned from data sets. In this work, we generalize the

α

-divergences using a pair of strictly comparable weighted means. Our generalization allows us to obtain in the limit case

α \to 1

the 1-divergence, which provides a generalization of the forward Kullback–Leibler divergence, and in the limit case

α \to 0

, the 0-divergence, which corresponds to a generalization of the reverse Kullback–Leibler divergence. We then analyze the condition for a pair of weighted quasi-arithmetic means to be strictly comparable and describe the family of quasi-arithmetic

α

-divergences including its subfamily of power homogeneous

α

-divergences. In particular, we study the generalized quasi-arithmetic 1-divergences and 0-divergences and show that these counterpart generalizations of the oriented Kullback–Leibler divergences can be rewritten as equivalent conformal Bregman divergences using strictly monotone embeddings. Finally, we discuss the applications of these novel divergences to k-means clustering by studying the robustness property of the centroids.

Keywords:

Kullback–Leibler divergence; α-divergences; comparable weighted means; weighted quasi-arithmetic means; information geometry; conformal divergences; k-means clustering

1. Introduction

1.1. Statistical Divergences and $α$ -Divergences

Consider a measurable space [1]

(X, F)

where

F

denotes a finite

σ

-algebra and

X

the sample space, and let

μ

denotes a positive measure on

(X, F)

, usually chosen as the Lebesgue measure or the counting measure. The notion of statistical dissimilarities [2,3,4]

D (P : Q)

between two distributions P and Q is at the core of many algorithms in signal processing, pattern recognition, information fusion, data analysis, and machine learning, among others. A dissimilarity may be oriented, i.e., asymmetric:

D (P : Q) \neq D (Q : P)

, where the colon mark “:” between the arguments of the dissimilarities represents the asymmetric property of the division operation. When the arbitrary probability measures P and Q are dominated by a measure

μ

(e.g., one can always choose

μ = \frac{P + Q}{2}

), we consider their Radon–Nikodym (RN) densities

p_{μ} = \frac{d P}{d μ}

and

q_{μ} = \frac{d Q}{d μ}

with respect to

μ

, and define

D (P : Q)

as

D_{μ} (p_{μ} : q_{μ})

. A good dissimilarity measure shall be invariant of the chosen dominating measure so that we can write

D (P : Q) = D_{μ} (p_{μ} : q_{μ})

[5]. When those statistical dissimilarities are smooth, they are called divergences [6] in information geometry, as they induce a dualistic geometric structure [7].

The most renowned statistical divergence rooted in information theory [8] is the Kullback–Leibler divergence (KLD, also called relative entropy):

{KL}_{μ} (p_{μ} : q_{μ}) : = \int_{X} p_{μ} (x) log \frac{p_{μ} (x)}{q_{μ} (x)} d μ (x) .

(1)

Since the KLD is independent of the reference measure

μ

, i.e.,

{KL}_{μ} (p_{μ} : q_{μ}) = {KL}_{ν} (p_{ν} : q_{ν})

for

p_{μ} = \frac{d P}{d μ}

and

q_{μ} = \frac{d Q}{d μ}

, and

p_{ν} = \frac{d P}{d ν}

and

q_{ν} = \frac{d Q}{d ν}

are the RN derivatives with respect to another positive measure

ν

, we write concisely in the remainder:

KL (p : q) = \int p log \frac{p}{q} d μ,

(2)

instead of

{KL}_{μ} (p_{μ} : q_{μ})

.

The KLD belongs to a parametric family of

α

-divergences [9]

I_{α} (p : q)

for

α \in R

:

I_{α} (p : q) : = \{\begin{matrix} \frac{1}{α (1 - α)} (1 - \int p^{α} q^{1 - α} d μ), & α \in R \ {0, 1} \\ I_{1} (p : q) = KL (p : q), & α = 1 \\ I_{0} (p : q) = KL (q : p), & α = 0 \end{matrix}

(3)

The

α

-divergences extended to positive densities [10] (not necessarily normalized densities) play a central role in information geometry [6]:

I_{α}^{+} (p : q) : = \{\begin{matrix} \frac{1}{α (1 - α)} \int (α p + (1 - α) q - p^{α} q^{1 - α}) d μ, & α \in R \ {0, 1} \\ I_{1}^{+} (p : q) = {KL}^{+} (p : q), & α = 1 \\ I_{0}^{+} (p : q) = {KL}^{+} (q : p), & α = 0 \end{matrix},

(4)

where

{KL}^{+}

denotes the Kullback–Leibler divergence extended to positive measures:

{KL}^{+} (p : q) : = \int (p log \frac{p}{q} + q - p) d μ .

(5)

The

α

-divergences are asymmetric for

α \neq \frac{1}{2}

(i.e.,

I_{α} (p : q) \neq I_{α} (q : p)

for

α \neq \frac{1}{2}

) but exhibit the following reference duality [11]:

I_{α} (q : p) = I_{1 - α} (p : q) = : I_{α}^{*} (p : q),

(6)

where we denoted by

D^{*} (p : q) : = D (q : p)

, the reverse divergence for an arbitrary divergence

D (p : q)

(e.g.,

I_{α}^{*} (p : q) : = I_{α} (q : p) = I_{1 - α} (p : q)

). The

α

-divergences have been extensively used in many applications [12], and the parameter

α

may not be necessarily fixed beforehand but can also be learned from data sets in applications [13,14]. When

α = \frac{1}{2}

, the

α

-divergence is symmetric and called the squared Hellinger divergence [15]:

I_{\frac{1}{2}} (p : q) : = 4 (1 - \int \sqrt{p q} d μ) = 2 \int {(\sqrt{p} - \sqrt{q})}^{2} d μ .

(7)

The

α

-divergences belong to the family of Ali–Silvey–Csizár’s f-divergences [16,17] which are defined for a convex function

f (u)

satisfying

f (1) = 0

and strictly convex at 1:

I_{f} (p : q) : = \int p f (\frac{q}{p}) d μ .

(8)

We have

I_{α} (p : q) = I_{f_{α}} (p : q),

(9)

with the following class of f-generators:

f_{α} (u) : = \{\begin{matrix} \frac{1}{α (1 - α)} (α + (1 - α) u - u^{1 - α}), & α \in α \in R \ {0, 1} \\ u - 1 - log u, & α = 1 \\ 1 - u + u log u, & α = 0 \end{matrix}

(10)

In information geometry,

α

-divergences and more generally f-divergences are called invariant divergences [6], since they are provably the only statistical divergences which are invariant under invertible smooth transformations of the sample space. That is, let

Y = m (X)

be a smooth invertible transformation and let

Y = m (X)

denote the transformed sample space. Denote by

p_{Y} (y)

and

p_{Y^{'}} (y)

the densities with respect to y corresponding to

p_{X} (x)

and

p_{X^{'}} (x)

, respectively. Then, we have

I_{f} (p_{X} : p_{X^{'}}) = I_{f} (p_{Y} : p_{Y^{'}})

[18]. The dualistic information-geometric structures induced by these invariant f-divergences between densities of a same parametric family

{p_{θ} (x) : θ \in Θ}

of statistical models yield the Fisher information metric and the dual

\pm α

-connections for

α = 3 + 2 \frac{f^{″}^{'} (1)}{f^{″} (1)}

, see [6] for details. It is customary to rewrite the

α

-divergences in information geometry using rescaled parameter

α_{A} = 1 - 2 α

(i.e.,

α = \frac{1 - α_{A}}{2}

). Thus, the extended

α_{A}

-divergence in information geometry is defined as follows:

{\hat{I}}_{α_{A}}^{+} (p : q) = \{\begin{matrix} \frac{4}{1 - α_{A}^{2}} \int (\frac{1 - α_{A}}{2} p + \frac{1 + α_{A}}{2} q - p^{\frac{1 - α_{A}}{2}} q^{\frac{1 + α_{A}}{2}}) d μ, & α_{A} \in R \ {- 1, 1} \\ {\hat{I}}_{1} (p : q) = {KL}^{+} (p : q), & α_{A} = 1 \\ {\hat{I}}_{- 1} (p : q) = {KL}^{+} (q : p), & α_{A} = - 1 \end{matrix},

(11)

and the reference duality is expressed by

{\hat{I}}_{α_{A}}^{+} (q : p) = {\hat{I}}_{- α_{A}}^{+} (p : q)

.

A statistical divergence

D (\cdot : \cdot)

when evaluated on densities belonging to a given parametric family

P = {p_{θ} : θ \in Θ}

of densities is equivalent to a corresponding contrast function

D_{P}

[7]:

D_{P} (θ_{1} : θ_{2}) : = D (p_{θ_{1}} : p_{θ_{2}}) .

(12)

Remark 1.

Although quite confusing, those contrast functions [7] have also been called divergences in the literature [6]. Any smooth parameter divergence

D (θ_{1} : θ_{2})

(contrast function [7]) induces a dualistic structure in information geometry [6]. For example, the KLD on the family

Δ

of probability mass functions defined on a finite alphabet

X

is equivalent to a Bregman divergence, and thus induces a dually flat space [6]. More generally, the

α_{A}

-divergences on the probability simplex

Δ

induce the

α_{A}

-geometry in information geometry [6].

We refer the reader to [3] for a richly annotated bibliography of many common statistical divergences investigated in signal processing and statistics. Building and studying novel statistical/parameter divergences from first principles is an active research area. For example, Li [19,20] recently introduced some new divergence functionals based on the framework of transport information geometry [21], which considers information entropy functionals in Wasserstein spaces. Li defined (i) the transport information Hessian distances [20] between univariate densities supported on a compact, which are symmetric distances satisfying the triangle inequality, and obtained the counterpart of the Hellinger distance on the

L^{2}

-Wasserstein space by choosing the Shannon information entropy, and (ii) asymmetric transport Bregman divergences (including the transport Kullback–Leibler divergence) between densities defined on a multivariate compact smooth support in [19].

The

α

-divergences are widely used in information sciences, see [22,23,24,25,26,27] just to cite a few applications. The singly parametric

α

-divergences have also been generalized to biparametric families of divergences such as the

(α, β)

-divergences [6] or the

α β

-divergences [28].

In this work, based on the observation that the term

α p + (1 - α) q - p^{α} q^{1 - α}

in the extended

I_{α}^{+} (p : q)

divergence for

α \in (0, 1)

of Equation (4) is a difference between a weighted arithmetic mean

A_{1 - α} (p, q) : = α p + (1 - α) q

and a weighted geometric mean

G_{1 - α} (p, q) : = p^{α} q^{1 - α}

, we investigate a generalization of

α

-divergences with respect to a generic pair of strictly comparable weighted means [29]. In particular, we consider the class of quasi-arithmetic weighted means [30], analyze the condition for two quasi-arithmetic means to be strictly comparable, and report their induced

α

-divergences with limit KL type divergences when

α \to 1

and

α \to 0

.

1.2. Divergences and Decomposable Divergences

A statistical divergence

D (p : q)

shall satisfy the following two basic axioms:

D1 (Non-negativity).

D (p : q) \geq 0

for all densities p and q,

D2 (Identity of indiscernibles).

D (p : q) = 0

if and only if

p = q

μ

-almost everywhere.

These axioms are a subset of the metric axioms, since we do not consider the symmetry axiom nor the triangle inequality axiom of metric distances. See [31,32] for some common examples of probability metrics (e.g., total variation distance or Wasserstein metrics).

A divergence

D (p : q)

is said decomposable [6] when it can be written as a definite integral of a scalar divergence

d (\cdot, \cdot)

:

D (p : q) = \int d (p (x) : q (x)) d μ (x),

(13)

or

D (p : q) = \int d (p : q) d μ

for short, where

d (a, b)

is a scalar divergence between

a > 0

and

b > 0

(hence one-dimensional parameter divergence).

The

α

-divergences are decomposable divergences since we have

I_{α}^{+} (p : q) = \int i_{α} (p (x) : q (x)) d μ

(14)

with the following scalar

α

-divergence:

i_{α} (a : b) : = \{\begin{matrix} \frac{1}{α (1 - α)} (α a + (1 - α) b - a^{α} b^{1 - α}), & α \in R \ {0, 1} \\ i_{1} (a : b) = a log \frac{a}{b} + b - a & α = 1 \\ i_{0} (a : b) = i_{1} (b : a), & α = 0 \end{matrix}

(15)

1.3. Contributions and Paper Outline

The outline of the paper and its main contributions are summarized as follows:

We first define for two families of strictly comparable means (Definition 1) their generic induced

α

-divergences in Section 2 (Definition 2). Then, Section 2.2 reports a closed-form formula (Theorem 3) for the quasi-arithmetic

α

-divergences induced by two strictly comparable quasi-arithmetic means with monotonically increasing generators f and g such that

f \circ g^{- 1}

is strictly convex and differentiable (Theorem 1). In Section 2.3, we study the divergences

I_{0}^{+}

and

I_{1}^{+}

obtained in the limit cases when

α \to 0

and

α \to 1

, respectively, (Theorem 2). We obtain generalized counterparts of the Kullback–Leibler divergence when

α \to 1

and generalized counterparts of the reverse Kullback–Leibler divergence when

α \to 0

. Moreover, these generalized KLDs can be rewritten as generalized cross-entropies minus entropies. In Section 2.4, we show how to express these generalized

I_{1}

-divergences and

I_{0}

-divergences as conformal Bregman representational divergences, and briefly explain their induced conformally flat statistical manifolds (Theorem 4). Section 3 introduces the subfamily of bipower homogeneous

α

-divergences (Definition 2) which belong to the family of Ali–Silvey–Csiszár f-divergences [16,17]. In Section 4, we consider k-means clustering [33] and k-means++ seeding [34] for the generic class of extended

α

-divergences: we first study the robustness of quasi-arithmetic means in Section 4.1 and then the robustness of the newly class of generalized Kullback–Leibler centroids in Section 4.2. Finally, Section 5 summarizes the results obtained in this work and discusses perspectives for future research.

2. The $α$ -Divergences Induced by a Pair of Strictly Comparable Weighted Means

2.1. The $(M, N)$ $α$ -Divergences

The point of departure for generalizing the

α

-divergences is to rewrite Equation (4) for

α \in R \ {0, 1}

as

I_{α}^{+} (p : q) = \frac{1}{α (1 - α)} \int (A_{1 - α} (p, q) - G_{1 - α} (p, q)) d μ,

(16)

where

A_{λ}

and

G_{λ}

for

λ \in (0, 1)

stands for the weighted arithmetic mean and the weighted geometric mean, respectively:

\begin{matrix} A_{λ} (x, y) & = & (1 - λ) x + λ y, \\ G_{λ} (x, y) & = & x^{1 - λ} y^{λ} . \end{matrix}

For a weighted mean

M_{λ} (a, b)

, we choose the (geometric) convention

M_{0} (x, y) = x

and

M_{1} (x, y) = 1

so that

{M_{λ} (x, y)}_{λ \in [0, 1]}

smoothly interpolates between x (

λ = 0

) and y (

λ = 1

). For the converse convention, we simply define

M_{λ}^{'} (a, b) = M_{1 - λ} (a, b)

and get the conventional definition of

I_{α}^{+} (p : q) = \frac{1}{α (1 - α)} \int (A_{α}^{'} (p, q) - G_{α}^{'} (p, q)) d μ

.

In general, a mean

M (x, y)

aggregates two values x and y of an interval

I \subset R

to produce an intermediate quantity which satisfies the innerness property [35,36]:

min {x, y} \leq M (x, y) \leq max {x, y}, \forall x, y \in I .

(17)

This in-between property of means (Equation (17)) was postulated by Cauchy [37] in 1821. A mean is said strict if the inequalities of Equation (17) are strict whenever

x \neq y

. A mean M is said reflexive iff

M (x, x) = x

for all

x \in I

. The reflexive property of means was postulated by Chisini [38] in 1929.

In the remainder, we consider

I = (0, \infty)

. By using the unique dyadic representation of any real

λ \in (0, 1)

(i.e.,

λ = \sum_{i = 1}^{\infty} \frac{d_{i}}{2^{i}}

with

d_{i} \in {0, 1}

the binary digit expansion of

λ

), one can build a weighted mean

M_{λ}

from any given mean M; see [29] for such a construction.

In the remainder, we drop the “+” notation to emphasize that the divergences are defined between positive measures. By analogy to the

α

-divergences, let us define the (decomposable)

(M, N)

α

-divergences between two positive densities p and q for a pair of weighted means

M_{1 - α}

and

N_{1 - α}

for

α \in (0, 1)

as

I_{α}^{M, N} (p : q) : = \frac{1}{α (1 - α)} \int (M_{1 - α} (p, q) - N_{1 - α} (p, q)) d μ .

(18)

The ordinary

α

-divergences for

α \in (0, 1)

are recovered as the

(A, G)

α

-divergences:

\begin{matrix} I_{α}^{A, G} (p : q) & = & \frac{1}{α (1 - α)} \int (A_{1 - α} (p, q) - G_{1 - α} (p, q)) d μ, \end{matrix}

(19)

\begin{matrix} = & I_{1 - α} (p : q) = I_{α} (q : p) = I_{α}^{*} (p : q) . \end{matrix}

(20)

In order to define generalized

α

-divergences satisfying axioms D1 and D2 of proper divergences, we need to characterize the class of acceptable means. We give a definition strengthening the notion of comparable means in [29]:

Definition 1

(Strictly comparable weighted means). A pair

(M, N)

of means are said strictly comparable whenever

M_{λ} (x, y) \geq N_{λ} (x, y)

for all

x, y \in (0, \infty)

with equality if and only if

x = y

, and for all

λ \in (0, 1)

.

Example 1.

For example, the inequality of the arithmetic and geometric means states that

A (x, y) \geq G (x, y)

implies means A and G are comparable, denoted by

A \geq G

. Furthermore, the arithmetic and geometric weighted means are distinct whenever

x \neq y

. Indeed, consider the equation

(1 - α) x + α y = x^{1 - α} y^{α}

for

x, y > 0

and

x \neq y

. By taking the logarithm on both sides, we get

log ((1 - α) x + α y) = (1 - α) log x + α log y .

(21)

Since the logarithm is a strictly convex function, the only solution is

x = y

. Thus,

(A, G)

is a pair of strictly comparable weighted means.

For a weighted mean M, define

M_{λ}^{'} (x, y) : = M_{1 - λ} (x, y)

. We are ready to state the definition of generalized

α

-divergences:

Definition 2

(

(M, N)

α

-divergences). The

(M, N)

α-divergences

I_{α}^{M, N} (p : q)

between two positive densities p and q for

α \in (0, 1)

is defined for a pair of strictly comparable weighted means

M_{α}

and

N_{α}

with

M_{α} \geq N_{α}

by:

\begin{matrix} I_{α}^{M, N} (p : q) & : = & \frac{1}{α (1 - α)} \int (M_{1 - α} (p, q) - N_{1 - α} (p, q)) d μ, α \in (0, 1) \end{matrix}

(22)

\begin{matrix} = & \frac{1}{α (1 - α)} \int (M_{α}^{'} (p, q) - N_{α}^{'} (p, q)) d μ, α \in (0, 1) . \end{matrix}

(23)

Using

α = \frac{1 - α_{A}}{2}

, we can rewrite this

α

-divergence as

\begin{matrix} {\hat{I}}_{α_{A}}^{M, N} (p : q) & : = & \frac{4}{1 - α_{A}^{2}} \int (M_{\frac{1 + α_{A}}{2}} (p, q) - N_{\frac{1 + α_{A}}{2}} (p, q)) d μ, α_{A} \in (- 1, 1) \end{matrix}

(24)

\begin{matrix} = & \frac{4}{1 - α_{A}^{2}} \int (M_{\frac{1 - α_{A}}{2}}^{'} (p, q) - N_{\frac{1 - α_{A}}{2}}^{'} (p, q)) d μ, α_{A} \in (- 1, 1) . \end{matrix}

(25)

It is important to check the conditions on the weighted means

M_{α}

and

N_{α}

which ensures the law of the indiscernibles of a divergence

D (p : q)

, namely,

D (p : q) = 0

iff

p = q

almost

μ

-everywhere. This condition rewrites as

\int M_{α} (p, q) d μ = \int N_{α} (p, q) d μ

if and only if

p (x) = q (x)

μ

-almost everywhere. A sufficient condition is to ensure that

M_{α} (x, y) \neq N_{α} (x, y)

for

x \neq y

. In particular, this condition holds if the weighted means

M_{α}

and

N_{α}

are strictly comparable weighted means.

Instead of taking the difference

M_{1 - α} (x : y) - N_{1 - α} (x : y)

between two weighted means, we may also measure the gap logarithmically, and thus define the family of

log \frac{M}{N}

α

-divergences as follows:

Definition 3

(

log \frac{M}{N}

α

-divergence). The

log \frac{M}{N}

α-divergences

L_{α}^{M, N} (p : q)

between two positive densities p and q for

α \in (0, 1)

is defined for a pair of strictly comparable weighted means

M_{α}

and

N_{α}

with

M_{α} \geq N_{α}

by:

\begin{matrix} L_{α}^{M, N} (p : q) & : = & \int (log \frac{M_{1 - α} (p, q)}{N_{1 - α} (p, q)}) d μ, \end{matrix}

(26)

\begin{matrix} = & - \int (log \frac{N_{1 - α} (p, q)}{M_{1 - α} (p, q)}) d μ . \end{matrix}

(27)

Note that this definition is different from the skewed Bhattacharyya type distance [39,40], which rather measures

\begin{matrix} B_{α}^{M, N} (p : q) & : = & log \frac{\int M_{1 - α} (p, q) d μ}{\int N_{1 - α} (p, q) d μ}, \end{matrix}

(28)

\begin{matrix} = & - log \frac{\int N_{1 - α} (p, q) d μ}{\int M_{1 - α} (p, q) d μ} . \end{matrix}

(29)

The ordinary

α

-skewed Bhattacharyya distance [39] is recovered when

N_{α} = G_{α}

(weighted geometric mean) and

M_{α} = A_{α}

the arithmetic mean since

\int A_{1 - α} (p, q) d μ = 1

. The Bhattacharyya type divergences

B_{α}^{M, N}

were introduced in [41] in order to upper bound the probability of error in Bayesian hypothesis testing.

A weighted mean

M_{α}

is said symmetric if and only if

M_{α} (x, y) = M_{1 - α} (y, x)

. When both the weighted means M and N are symmetric, we have the following reference duality [11]:

I_{α}^{M, N} (p : q) = I_{1 - α}^{M, N} (q : p) .

(30)

We consider symmetric weighted means in the remainder.

In the limit cases of

α \to 0

or

α \to 1

, we define the 0-divergence

I_{0}^{M, N} (p : q)

and the 1-divergence

I_{1}^{M, N} (p : q)

, respectively, by

\begin{matrix} I_{0}^{M, N} (p : q) & = & lim_{α \to 0} I_{α}^{M, N} (p : q), \end{matrix}

(31)

\begin{matrix} I_{1}^{M, N} (p : q) & = & lim_{α \to 1} I_{α}^{M, N} (p : q) = I_{0}^{M, N} (q : p), \end{matrix}

(32)

provided that those limits exist.

Notice that the ordinary

α

-divergences are defined for any

α \in R

but our generic quasi-arithmetic

α

-divergences are defined in general on

(0, 1)

. However, when the weighted means

M_{α}

and

N_{α}

admit weighted extrapolations (e.g., the arithmetic mean

A_{α}

or the geometric mean

G_{α}

) the quasi-arithmetic

α

-divergences can be extended to

R \ {0, 1}

. Furthermore, when the limits of quasi-arithmetic

α

-divergences exist for

α \in {0, 1}

, the quasi-arithmetic

α

-divergences may be defined on the full range of

α \in R

. To demonstrate the restricted range

(0, 1)

, consider the weighted harmonic mean for

x, y > 0

with

x \neq y

:

H_{λ} (x, y) = \frac{1}{(1 - λ) \frac{1}{x} + λ \frac{1}{y}} = \frac{x y}{λ x + (1 - λ) y} = \frac{x y}{y + λ (x - y)} .

(33)

Clearly, the denominator may become zero when

λ = \frac{y}{y - x}

and even possibly negative. Thus, to avoid this issue, we restrict the range of

α

to

(0, 1)

for defining quasi-arithmetic

α

-divergences.

2.2. The Quasi-Arithmetic $α$ -Divergences

A quasi-arithmetic mean (QAM) is defined for a continuous and strictly monotonic function

f : I \subset R_{+} \to J \subset R_{+}

as:

M^{f} (x, y) : = f^{- 1} (\frac{f (x) + f (y)}{2}) .

(34)

Function f is called the generator of the quasi-arithmetic mean. These strict and reflexive quasi-arithmetic means are also called Kolmogorov means [30], Nagumo means [42] de Finetti means [43], or quasi-linear means [44] in the literature. These means are called quasi-arithmetic means because they can be interpreted as arithmetic means on the arguments

f (x)

and

f (y)

:

f (M^{f} (x, y)) = \frac{f (x) + f (y)}{2} = A (f (x), f (y)) .

(35)

QAMs are strict, reflexive, and symmetric means.

Without loss of generality, we may assume strictly increasing functions f instead of monotonic functions since

M^{- f} = M^{f}

. Indeed,

M^{- f} (x, y) = {(- f)}^{- 1} (- f (M^{f} (x, y)))

and

({(- f)}^{- 1} \circ (- f)) (u) = u

, the identity function. Notice that the composition

f_{1} \circ f_{2}

of two strictly monotonic increasing functions

f_{1}

and

f_{2}

is a strictly monotonic increasing function. Furthermore, we consider

I = J = (0, \infty)

in the remainder since we apply these means on positive densities. Two quasi-arithmetic means

M^{f}

and

M^{g}

coincide if and only if

f (u) = a g (u) + b

for some

a > 0

and

b \in R

, see [44]. The quasi-arithmetic means were considered in the axiomatization of the entropies by Rényi to define the

α

-entropies (see Equation (2).11 of [45]).

By choosing

f_{A} (u) = u

,

f_{G} (u) = log u

, or

f_{H} (u) = \frac{1}{u}

, we obtain the Pythagorean’s arithmetic A, geometric G, and harmonic H means, respectively:

the arithmetic mean (A): $A (x, y) = \frac{x + y}{2} = M^{f_{A}} (x, y)$ ,
the geometric mean (G): $G (x, y) = \sqrt{x y} = M^{f_{G}} (x, y)$ , and
the harmonic mean (H): $H (x, y) = \frac{2}{\frac{1}{x} + \frac{1}{y}} = \frac{2 x y}{x + y} = M^{f_{H}} (x, y)$ .

More generally, choosing

f_{P_{r}} (u) = u^{r}

, we obtain the parametric family of power means also called Hölder means [46] or binary means [47]:

P_{r} (x, y) = {(\frac{x^{r} + y^{r}}{2})}^{\frac{1}{r}} = M^{f_{P_{r}}} (x, y), r \in R \ {0} .

(36)

In order to get a smooth family of power means, we define the geometric mean as the limit case of

r \to 0

:

P_{0} (x, y) = lim_{r \to 0} P_{r} (x, y) = G (x, y) = \sqrt{x y} .

(37)

A mean M is positively homogeneous if and only if

M (t a, t b) = t M (a, b)

for any

t > 0

. It is known that the only positively homogeneous quasi-arithmetic means coincide exactly with the family of power means [44]. The weighted QAMs are given by

\begin{matrix} M_{α}^{f} (p, q) & = & f^{- 1} ((1 - α) f (p) + α f (q))), \end{matrix}

(38)

\begin{matrix} = & f^{- 1} (f (p) + α (f (q) - f (p))) = M_{1 - α}^{f} (q, p) . \end{matrix}

(39)

Let us remark that QAMs were generalized to complex-valued generators in [48] and to probability measures defined on a compact support in [49].

Notice that there exist other positively homogeneous means which are not quasi-arithmetic means. For example, the logarithmic mean [50,51]

L (x, y)

for

x > 0

and

y > 0

:

L (x, y) = \frac{y - x}{log y - log x}

(40)

is an example of a homogeneous mean (i.e.,

L (t x, t y) = t L (x, y)

for any

t > 0

) that is not a QAM. Besides the family of QAMs, there exist many other families of means [35]. For example, let us mention the Lagrangian means [52], which intersect with the QAMs only for the arithmetic mean, or a generalization of the QAMs called the Bajraktarević means [53].

Let us now strengthen a recent theorem (Theorem 1 of [54], 2010):

Theorem 1

(Strictly comparable weighted QAMs). The pair

(M^{f}, M^{g})

of quasi-arithmetic means obtained for two strictly increasing generators f and g is strictly comparable provided that function

f \circ g^{- 1}

is strictly convex, where ∘ denotes the function composition.

Proof.

Since

f \circ g^{- 1}

is strictly convex, it is convex, and therefore it follows from Theorem 1 of [54] that

M_{α}^{f} \geq M_{α}^{g}

for all

α \in [0, 1]

. Thus, the very nice property of QAMs is that

M^{f} \geq M^{g}

implies that

M_{α}^{f} \geq M_{α}^{g}

for any

α \in [0, 1]

. Now, let us consider the equation

M_{α}^{f} (p, q) = M_{α}^{g} (p, q)

for

p \neq q

:

f^{- 1} ((1 - α) f (p) + α f (q)) = g^{- 1} ((1 - α) g (p) + α g (q)) .

(41)

Since

f \circ g^{- 1}

is assumed strictly convex, and g is strictly increasing, we have

g (p) \neq g (q)

for

p \neq q

, and we reach the following contradiction:

\begin{matrix} (1 - α) f (p) + α f (q) & = & (f \circ g^{- 1}) ((1 - α) g (p) + α g (q)), \end{matrix}

(42)

\begin{matrix} < & (1 - α) (f \circ g^{- 1}) (g (p)) + α (f \circ g^{- 1}) (g (q)), \end{matrix}

(43)

\begin{matrix} < & (1 - α) f (p) + α f (q) . \end{matrix}

(44)

Thus,

M_{α}^{f} (p, q) \neq M_{α}^{g} (p, q)

for

p \neq q

, and

M_{α}^{f} (p, q) = M_{α}^{g} (p, q)

for

p = q

. □

Thus, we can define the quasi-arithmetic

α

-divergences as follows:

Definition 4

(Quasi-arithmetic

α

-divergences). The

(f, g)

α-divergences

I_{α}^{f, g} (p : q) : = I_{α}^{M^{f}, M^{g}}

(p : q)

between two positive densities p and q for

α \in (0, 1)

are defined for two strictly increasing and differentiable functions f and g such that

f \circ g^{- 1}

is strictly convex by:

\begin{matrix} I_{α}^{f, g} (p : q) & : = & \frac{1}{α (1 - α)} \int (M_{1 - α}^{f} (p, q) - M_{1 - α}^{g} (p, q)) d μ, \end{matrix}

(45)

where

M_{λ}^{f}

and

M_{λ}^{g}

are the weighted quasi-arithmetic means induced by f and g, respectively.

We have the following corollary:

Corollary 1

(Proper quasi-arithmetic

α

-divergences). Let

(M^{f}, M^{g})

be a pair of quasi-arithmetic means with

f \circ g^{- 1}

strictly convex, then the

(M^{f}, M^{g})

α-divergences are proper divergences for

α \in (0, 1)

.

Proof.

Consider p and q with

p (x) \neq q (x)

μ

-almost everywhere. Since

f \circ g^{- 1}

is strictly convex, we have

M^{f} (x, y) - M^{g} (x, y) \geq 0

with strict inequality when

x \neq y

. Thus,

\int M^{f} (p, q) d μ - \int M^{g} (p, q) d μ > 0

and

I_{α}^{f, g} (p : q) > 0

. Therefore the quasi-arithmetic

α

-divergences

I_{α}^{f, g}

satisfy the law of the indiscernibles for

α \in (0, 1)

. □

Note that the

(A, G)

α

-divergences (i.e., the ordinary

α

-divergences) are proper divergences satisfying both the properties D1 and D2 because

f_{A} (u) = u

and

f_{G} (u) = log u

, and hence

(f_{A} \circ f_{G}^{- 1}) (u) = exp (u)

is strictly convex on

(0, \infty)

.

Let us denote by

I_{α}^{f, g} (p : q) : = I_{α}^{M^{f}, M^{g}} (p : q)

the quasi-arithmetic

α

-divergences. Since the QAMs are symmetric means, we have

I_{α}^{f, g} (p : q) = I_{1 - α}^{f, g} (q : p)

.

Remark 2.

Let us notice that Zhang [55] in their study of divergences under monotone embeddings also defined the following family of related divergences (Equation (71) of [55]):

{\hat{I}}_{α_{A}}^{f, g} (p : q) = \frac{4}{1 - {α_{A}}^{2}} \int (M_{\frac{1 + α_{A}}{2}}^{f} (p, q) - M_{\frac{1 + α_{A}}{2}}^{g} (p, q)) d μ .

(46)

However, Zhang did not study the limit case divergences

{\hat{I}}_{α_{A}}^{f, g} (p : q)

when

α_{A} \to \pm 1

.

2.3. Limit Cases of 1-Divergences and 0-Divergences

We seek a closed-form formula of the limit divergence

{lim}_{α \to 0} I_{α}^{f, g} (p : q)

when

α \to 0

.

Lemma 1.

A first-order Taylor approximation of the quasi-arithmetic mean [56]

M_{α}^{f}

for a

C_{1}

strictly increasing generator f when

α ≃ 0

yields

M_{α}^{f} (p, q) = p + \frac{α (f (q) - f (p))}{f^{'} (p)} + o (α (f (q) - f (p))) .

(47)

Proof.

By taking the first-order Taylor expansion of

f^{- 1} (x)

at

x_{0}

(i.e., Taylor polynomial of order 1), we get:

f^{- 1} (x) = f^{- 1} (x_{0}) + (x - x_{0}) {(f^{- 1})}^{'} (x_{0}) + o (x - x_{0}) .

(48)

Using the property of the derivative of an inverse function

{(f^{- 1})}^{'} (x) = \frac{1}{(f^{'} (f^{- 1}) (x))},

(49)

it follows that the first-order Taylor expansion of

f^{- 1} (x)

is:

f^{- 1} (x) = f^{- 1} (x_{0}) + (x - x_{0}) \frac{1}{(f^{'} (f^{- 1}) (x_{0}))} + o (x - x_{0}) .

(50)

Plugging

x_{0} = f (p)

and

x = f (p) + α (f (q) - f (p))

, we get a first-order approximation of the weighted quasi-arithmetic mean

M_{α}^{f}

when

α \to 0

:

M_{α}^{f} (p, q) = p + \frac{α (f (q) - f (p))}{f^{'} (p)} + o (α (f (q) - f (p))) .

(51)

□

Let us introduce the following bivariate function:

E_{f} (p, q) : = \frac{f (q) - f (p)}{f^{'} (p)} .

(52)

Remark 3.

Notice that

E_{f} (p, q) = E_{- f} (p, q)

matches the fact that

M_{α}^{f} (p, q) = M_{α}^{- f} (p, q)

. That is, we may either consider a strictly increasing differentiable generator f, or equivalently a strictly decreasing differentiable generator

- f

.

Thus, we obtain closed-form formulas for the

I_{1}

-divergence and

I_{0}

-divergence:

Theorem 2

(Quasi-arithmetic

I_{1}

-divergence and reverse

I_{0}

-divergence). The quasi-arithmetic

I_{1}

-divergence induced by two strictly increasing and differentiable functions f and g such that

f \circ g^{- 1}

is strictly convex is

\begin{matrix} I_{1}^{f, g} (p : q) : = lim_{α \to 1} I_{α}^{f, g} (p : q) & = & \int (E_{f} (p, q) - E_{g} (p, q)) d μ \geq 0, \end{matrix}

(53)

\begin{matrix} = & \int (\frac{f (q) - f (p)}{f^{'} (p)} - \frac{g (q) - g (p)}{g^{'} (p)}) d μ . \end{matrix}

(54)

Furthermore, we have

I_{0}^{f, g} (p : q) = I_{1}^{f, g} (q : p) = {(I_{1}^{f, g})}^{*} (p : q)

, the reverse divergence.

Proof.

Let us prove that

I_{1}^{f, g}

is a proper divergence satisfying axioms D1 and D2. Note that a sufficient condition for

I_{1}^{f, g} (p : q) \geq 0

is to check that

\begin{matrix} E_{f} (p, q) & \geq & E_{g} (p, q), \end{matrix}

(55)

\begin{matrix} \frac{f (q) - f (p)}{f^{'} (p)} & \geq & \frac{g (q) - g (p)}{g^{'} (p)} . \end{matrix}

(56)

If

p = q

μ

-almost everywhere then clearly

I_{1}^{f, g} (p : q) = 0

. Consider

p \neq q

(i.e., at some observation x:

p (x) \neq q (x)

).

We use the following property of a strictly convex and differentiable function h for

x < y

(sometimes called the chordal slope lemma, see [29]):

h^{'} (x) \leq \frac{h (y) - h (x)}{y - x} \leq h^{'} (y) .

(57)

We consider

h (x) = (f \circ g^{- 1}) (x)

so that

h^{'} (x) = \frac{f^{'} (g^{- 1} (x))}{g^{'} (g^{- 1} (x))}

. There are two cases to consider:

$p < q$ and therefore $g (p) < g (q)$ . Let $y = g (q)$ and $x = g (p)$ in Equation (57). We have $h^{'} (x) = \frac{f^{'} (p)}{g^{'} (p)}$ and $h^{'} (y) = \frac{f^{'} (q)}{g^{'} (q)}$ , and the double inequality of Equation (57) becomes

$\frac{f^{'} (p)}{g^{'} (p)} \leq \frac{f (q) - f (p)}{g (q) - g (p)} \leq \frac{f^{'} (q)}{g^{'} (q)} .$

Since $g (q) - g (p) > 0$ , $g^{'} (p) > 0$ , and $f^{'} (p) > 0$ , we get

$\frac{g (q) - g (p)}{g^{'} (p)} \leq \frac{f (q) - f (p)}{f^{'} (p)} .$
$q < p$ and therefore $g (p) > g (q)$ . Then, the double inequality of Equation (57) becomes

$\frac{f^{'} (q)}{g^{'} (q)} \leq \frac{f (q) - f (p)}{g (q) - g (p)} \leq \frac{f^{'} (p)}{g^{'} (p)}$

That is,

$\frac{f (q) - f (p)}{f^{'} (p)} \geq \frac{g (q) - g (p)}{g^{'} (p)},$

since $g (q) - g (p) < 0$ .

Thus, in both cases, we checked that

E_{f} (p (x), q (x)) \geq E_{g} (p (x), q (x))

. Therefore,

I_{1}^{f, g} (p : q) \geq 0

, and since the QAMs are distinct,

I_{1}^{f, g} (p : q) = 0

iff

p (x) = q (x)

μ

-a.e. □

We can interpret the

I_{1}

divergences as generalized KL divergences and define generalized notions of cross-entropies and entropies. Since the KL divergence can be written as the cross-entropy minus the entropy, we can also decompose the

I_{1}

divergences as follows:

\begin{matrix} I_{1}^{f, g} (p : q) & = & \int (\frac{f (q)}{f^{'} (p)} - \frac{g (q)}{g^{'} (p)}) d μ - \int (\frac{f (p)}{f^{'} (p)} - \frac{g (p)}{g^{'} (p)}) d μ, \end{matrix}

(58)

\begin{matrix} = & h_{\times}^{f, g} (p : q) - h^{f, g} (p), \end{matrix}

(59)

where

h_{\times}^{f, g} (p : q)

denotes the

(f, g)

-cross-entropy (for a constant

c \in R

):

h_{\times}^{f, g} (p : q) = \int (\frac{f (q)}{f^{'} (p)} - \frac{g (q)}{g^{'} (p)}) d μ + c,

(60)

and

h^{f, g} (p)

stands for the

(f, g)

-entropy (self cross-entropy):

h^{f, g} (p) = h_{\times}^{f, g} (p : p) = \int (\frac{f (p)}{f^{'} (p)} - \frac{g (p)}{g^{'} (p)}) d μ + c .

(61)

Notice that we recover the Shannon entropy for

f (x) = x

and

g (x) = log (x)

with

f \circ g^{- 1}) (x) = exp (x)

(strictly convex) and

c = - 1

to annihilate the

\int p d μ = 1

term:

h^{id, log} (p) = \int (p - p log p) d μ - 1 = - \int p log p d μ .

(62)

We define the generalized

(f, g)

-Kullback–Leibler divergence or generalized

(f, g)

-relative entropies:

{KL}_{f, g} (p : q) : = h_{\times}^{f, g} (p : q) - h^{f, g} (p) .

(63)

When

f = f_{A}

and

g = f_{G}

, we resolve the constant to

c = 0

, and recover the ordinary Shannon cross-entropy and entropy:

\begin{matrix} h_{\times}^{f_{A}, f_{G}} (p : q) & = & \int (q - p log q) d μ = h_{\times} (p : q), \end{matrix}

(64)

\begin{matrix} h^{f_{A}, f_{G}} (p : q) & = & h_{\times}^{f_{A}, f_{G}} (p : p) = \int (p - p log p) d μ = h (p), \end{matrix}

(65)

and we have the

(f_{A}, f_{G})

-Kullback–Leibler divergence that is the extended Kullback–Leibler divergence:

{KL}_{f_{A}, f_{G}} (p : q) = {KL}^{+} (p : q) = h_{\times} (p : q) - h (p) = \int (p log \frac{p}{q} + q - p) d μ .

(66)

Thus, we have the

(f, g)

-cross-entropy and

(f, g)

-entropy expressed as

\begin{matrix} h_{\times}^{f, g} (p : q) & = & \int (\frac{f (q)}{f^{'} (p)} - \frac{g (q)}{g^{'} (p)}) d μ, \end{matrix}

(67)

\begin{matrix} h^{f, g} (p) & = & \int (\frac{f (p)}{f^{'} (p)} - \frac{g (p)}{g^{'} (p)}) d μ . \end{matrix}

(68)

In general, we can define the

(f, g)

-Jeffreys divergence as:

\begin{matrix} J^{f, g} (p : q) & = & {KL}^{f, g} (p : q) + {KL}^{f, g} (q : p) . \end{matrix}

(69)

Thus, we define the quasi-arithmetic mean

α

-divergences as follows:

Theorem 3

(Quasi-arithmetic

α

-divergences). Let f and g be two strictly continuously increasing and differentiable functions on

(0, \infty)

such that

f \circ g^{- 1}

is strictly convex. Then, the quasi-arithmetic α-divergences induced by

(f, g)

for

α \in [0, 1]

is

I_{α}^{f, g} (p : q) = \{\begin{matrix} \frac{1}{α (1 - α)} \int (M_{1 - α}^{f} (p, q) - M_{1 - α}^{g} (p, q)) d μ, & α \in R \ {0, 1} . \\ I_{1}^{f, g} (p : q) = \int (\frac{f (q) - f (p)}{f^{'} (p)} - \frac{g (q) - g (p)}{g^{'} (p)}) d μ & α = 1, \\ I_{0}^{f, g} (p : q) = \int (\frac{f (p) - f (q)}{f^{'} (q)} - \frac{g (p) - g (q)}{g^{'} (q)}) d μ, & α = 0 . \end{matrix}

(70)

When

f (u) = f_{A} (u) = u

(

M^{f} = A

) and

g (u) = f_{G} (u) = log u

(

M^{g} = G

), we get

I_{1}^{A, G} (p : q) = \int (q - p - p log \frac{q}{p}) d μ = {KL}^{+} (p : q) = I_{1} (p : q),

(71)

the Kullback–Leibler divergence (KLD) extended to positive densities, and

I_{0} = {KL}^{+}^{*}

the reverse extended KLD.

Let

M

denote the class of strictly increasing and differentiable real-valued univariate functions. An interesting question is to study the class of pairs of functions

(f, g) \in M \times M

such that

I_{1}^{f, g} (p : q) = KL (p : q)

. This involves solving integral-based functional equations [57].

We can rewrite the

α

-divergence

I_{α}^{f, g} (p : q)

for

α \in (0, 1)

as

I_{α}^{f, g} (p : q) = \frac{1}{α (1 - α)} (S_{1 - α}^{f} (p, q) - S_{1 - α}^{g} (p, q)),

(72)

where

S_{λ}^{h} (p, q) : = \int M_{λ}^{h} (p, q) d μ .

(73)

Zhang [11] (pp. 188–189) considered the

(A, M^{ρ})

α_{A}

-divergences:

D_{α}^{ρ} (p : q) : = \frac{4}{1 - α^{2}} \int (\frac{1 - α}{2} p + \frac{1 + α}{2} q - ρ^{- 1} (\frac{1 - α}{2} ρ (p) + \frac{1 + α}{2} ρ (q))) d μ .

(74)

Zhang obtained for

D_{\pm 1}^{ρ} (p : q)

the following formula:

D_{1}^{ρ} (p : q) = \int (p - q - {(ρ^{- 1})}^{'} (ρ (q)) (ρ (p) - ρ (q))) d μ = D_{- 1}^{ρ} (q : p),

(75)

which is in accordance with our generic formula of Equation (53) since

{(ρ^{- 1} (x))}^{'} = \frac{1}{ρ^{'} (ρ^{- 1} (x))}

. Notice that

A_{α} \geq P_{α}^{r}

for

r \leq 1

; the arithmetic weighted mean dominates the weighted power means

P^{r}

when

r \leq 1

.

Furthermore, by imposing the homogeneity condition

I_{α}^{A, M^{ρ}} (t p : t q) = t I_{α}^{A, M^{ρ}} (p : q)

for

t > 0

, Zhang [11] obtained the class of

(α_{A}, β_{A})

-divergences for

(α_{A}, β_{A}) \in {[- 1, 1]}^{2}

:

\begin{matrix} D_{α_{A}, β_{A}} (p : q) : = \frac{4}{1 - α_{A}^{2}} \frac{2}{1 + β_{A}} \int (\frac{1 - α_{A}}{2} p + \frac{1 + α_{A}}{2} q \\ - {(\frac{1 - α_{A}}{2} p^{\frac{1 - β_{A}}{2}} + \frac{1 + α_{A}}{2} q^{\frac{1 - β_{A}}{2}})}^{\frac{2}{1 - β_{A}}}) d μ . \end{matrix}

(76)

2.4. Generalized KL Divergences as Conformal Bregman Divergences on Monotone Embeddings

Let us rewrite the generalized KLDs

I_{1}^{f, g}

as a conformal Bregman representational divergence [58,59,60] as follows:

Theorem 4.

The generalized KLDs

I_{1}^{f, g}

divergences are conformal Bregman representational divergences

I_{1}^{f, g} (p : q) = \int \frac{1}{f^{'} (p)} B_{F} (g (q) : g (p)) d μ,

(77)

with

F = f \circ g^{- 1}

a strictly convex and differentiable Bregman convex generator defining the scalar Bregman divergence [61]

B_{F}

:

B_{F} (a : b) = F (a) - F (b) - (a - b) F^{'} (b) .

Proof.

For the Bregman strictly convex and differentiable generator

F = f \circ g^{- 1}

, we expand the following conformal divergence

\begin{matrix} \frac{1}{f^{'} (p)} B_{F} (g (q) : g (p)) & = & \frac{1}{f^{'} (p)} (F (g (q)) - F (g (p)) - (g (q) - g (p)) F^{'} (g (p))), \end{matrix}

(78)

\begin{matrix} = & \frac{1}{f^{'} (p)} ((f (q) - f (p)) - (g (q) - g (p)) \frac{f^{'} (p)}{g^{'} (p)}), \end{matrix}

(79)

since

(g^{- 1} \circ g) (x) = x

and

F^{'} (g (x)) = \frac{f^{'} (x)}{g^{'} (x)}

. It follows that

\begin{matrix} \frac{1}{f^{'} (p)} B_{F} (g (q) : g (p)) & = & \frac{f (q) - f (p)}{f^{'} (p)} - \frac{g (q) - g (p)}{g^{'} (p)}, \end{matrix}

(80)

\begin{matrix} = & E_{f} (p, q) - E_{g} (p, q) = I_{1}^{f, g} (p : q) . \end{matrix}

(81)

Hence, we easily check that

I_{1}^{f, g} (p : q) = \int \frac{1}{f^{'} (p)} B_{F} (g (q) : g (p)) d μ \geq 0

since

f^{'} (p) > 0

and

B_{F} \geq 0

. □

In general, for a functional generator f and a strictly monotonic representational function r (also called monotone embedding [62] in information geometry), we can define the representational Bregman divergence [63]

B_{f \circ r^{- 1}} (r (p) : r (q))

provided that

F = f \circ r^{- 1}

is a Bregman generator (i.e., strictly convex and differentiable).

The Itakura–Saito divergence [64] (IS) between two densities p and q is defined by:

\begin{matrix} D_{IS} (p : q) & = & \int (\frac{p}{q} - log \frac{p}{q} - 1) d μ, \end{matrix}

(82)

\begin{matrix} = & \int D_{IS} (p (x) : q (x)) d μ (x), \end{matrix}

(83)

where

D_{IS} (x : y) = \frac{x}{y} - log \frac{x}{y} - 1

is the scalar IS divergence. This divergence was originally designed in sound processing for measuring the discrepancy between two speech power spectra. Observe that the IS divergence is invariant by rescaling:

D_{IS} (t p : t q) = D_{IS} (p : q)

for any

t > 0

. The IS divergence is a Bregman divergence [61] obtained for the Burg information generator (i.e., negative Burg entropy):

F_{Burg} (u) = - log u

with

F_{Burg}^{'} (u) = - \frac{1}{u}

. It follows that we have

I_{1}^{f} (p : q) = \int p B_{f} (q : p) d μ,

(84)

The Itakura–Saito divergence may further be extended to a family of

α

-Itakura–Saito divergences (see [6], Equation (10).45 of Theorem 10.1):

\begin{matrix} D_{IS, α} (p : q) & = & \{\begin{matrix} \int \frac{1}{α^{2}} ({(\frac{p}{q})}^{α} - α log \frac{p}{q} - 1) d μ & α \neq 0 \\ \frac{1}{2} \int {(log q - log p)}^{2} d μ & α = 0 . \end{matrix} \end{matrix}

(85)

In [56], a generalization of the Bregman divergences was obtained using the comparative convexity induced by two abstract means M and N to define

(M, N)

-Bregman divergences as limit of scaled

(M, N)

-Jensen divergences. The skew

(M, N)

-Jensen divergences are defined for

α \in (0, 1)

by:

J_{F, α}^{M, N} (p : q) = \frac{1}{α (1 - α)} (N_{α} (F (p), F (q))) - F (M_{α} (p, q))),

(86)

where

M_{α}

and

N_{α}

are weighted means that should be regular [56] (i.e., homogeneous, symmetric, continuous, and increasing in each variable). Then, we can define the

(M, N)

-Bregman divergence as

\begin{matrix} B_{F}^{M, N} (p : q) & = & lim_{α \to 1^{-}} J_{F, α}^{M, N} (p : q), \end{matrix}

(87)

\begin{matrix} = & lim_{α \to 1^{-}} \frac{1}{α (1 - α)} (N_{α} (F (p), F (q))) - F (M_{α} (p, q))) . \end{matrix}

(88)

The formula obtained in [56] for the quasi-arithmetic means

M^{f}

and

M^{g}

and a functional generator F that is

(M^{f}, M^{g})

-convex is:

\begin{matrix} B_{F}^{f, g} (p : q) & = & \frac{g (F (p)) - g (F (q))}{g^{'} (F (q))} - \frac{f (p) - f (q)}{f^{'} (q)} F^{'} (q), \end{matrix}

(89)

\begin{matrix} = & \frac{1}{f^{'} (F (q))} B_{g \circ F \circ f^{- 1}} (f (p) : f (q)) \geq 0 . \end{matrix}

(90)

This is a conformal divergence [58] that can be written using the

E_{f}

terms as:

B_{F}^{f, g} (p : q) = E_{g} (F (q), F (p)) - E_{f} (q, p) F^{'} (q) .

(91)

A function F is

(M^{f}, M^{g})

-convex iff

g \circ F \circ f^{- 1}

is (ordinary) convex [56].

The information geometry induced by a Bregman divergence (or equivalently by its convex generator) is a dually flat space [6]. The dualistic structure induced by a conformal Bregman representational divergence is related to conformal flattening [59,60]. The notion of conformal structures was first introduced in information geometry by Okamoto et al. [65].

Following the work of Ohara [59,60,66], the Kurose geometric divergence

ρ (p, r)

[67] (a contrast function in affine differential geometry) induced by a pair

(L, M)

of strictly monotone smooth functions between two distributions p and r of the d-dimensional probability simplex

Δ_{d}

is defined by (Equation (28) in [59]):

ρ (p : r) = \frac{1}{Λ (r)} \sum_{i = 1}^{d + 1} \frac{L (p_{i}) - L (r_{i})}{L^{'} (r_{i})} = \frac{1}{Λ (r)} \sum_{i = 1}^{d + 1} E_{L} (r_{i}, p_{i}),

(92)

where

Λ (r) = \sum_{i = 1}^{d + 1} \frac{1}{L^{'} (p_{i})} p_{i}

. Affine immersions [67] can be interpreted as special embeddings.

Let

ρ

be a divergence (contrast function) and

(^{ρ} g,^{ρ} \nabla,^{ρ} \nabla^{*})

be the induced statistical manifold structure with

\begin{matrix} ^{ρ} g_{i j} (p) & : = & - {(\partial_{i})}_{p} {(\partial_{j})}_{p} ρ (p, q) |_{q = p}, \end{matrix}

(93)

\begin{matrix} Γ_{i j, k} (p) & : = & - {(\partial_{i})}_{p} {(\partial_{j})}_{p} {(\partial_{k})}_{q} ρ (p, q) |_{q = p}, \end{matrix}

(94)

\begin{matrix} Γ_{i j, k}^{*} (p) & : = & - {(\partial_{i})}_{p} {(\partial_{j})}_{q} {(\partial_{k})}_{q} ρ (p, q) |_{q = p}, \end{matrix}

(95)

where

{(\partial_{i})}_{s}

denotes the tangent vector at s of a vector field

\partial_{i}

.

Consider a conformal divergence

ρ_{κ} (p : q) = κ (q) ρ (p : q)

for a positive function

κ (q) > 0

, called the conformal factor. Then, the induced statistical manifold [6,7]

(^{ρ_{κ}} g,^{ρ_{κ}} \nabla,^{ρ_{κ}} \nabla^{*})

is 1-conformally equivalent to

(^{ρ} g,^{ρ} \nabla,^{ρ} \nabla^{*})

and we have

\begin{matrix} ^{ρ_{κ}} g & = & κ^{ρ} g, \end{matrix}

(96)

\begin{matrix} ^{ρ} g (^{ρ_{κ}} \nabla_{X} Y, Z) & = & ^{ρ} g (^{ρ} \nabla_{X} Y, Z) - d (log κ) (Z)^{ρ} g (X, Y) . \end{matrix}

(97)

The dual affine connections

^{ρ_{κ}} \nabla^{*}

and

^{ρ} \nabla^{*}

are projectively equivalent [67] (and

^{ρ} \nabla^{*}

is said

- 1

-conformally flat).

Conformal flattening [59,60] consists of choosing the conformal factor

κ

such that

(^{ρ_{κ}} g,^{ρ_{κ}} \nabla,^{ρ_{κ}} \nabla)

becomes a dually flat space [6] equipped with a canonical Bregman divergence.

Therefore, it follows that the statistical manifolds induced by the 1-divergence

I_{1}^{f, g}

is a representational 1-conformally flat statistical manifold. Figure 1 gives an overview of the interplay of divergences with information-geometric structures. The logarithmic divergence [68]

L_{G, α}

is defined for

α > 0

and an

α

-exponentially concave generator G by:

L_{G, α} (θ_{1} : θ_{2}) = \frac{1}{α} log (1 + α \nabla G {(θ_{2})}^{⊤} (θ_{1} - θ_{2})) + G (θ_{2}) - G (θ_{1}) .

(98)

When

α \to 0

, we have

L_{G, α} (θ_{1} : θ_{2}) \to B_{- G} (θ_{1} : θ_{2})

, where

B_{F}

is the Bregman divergence [61] induced by a strictly convex and smooth function F:

B_{F} (θ_{1} : θ_{2}) = F (θ_{1}) - F (θ_{2}) - {(θ_{1} - θ_{2})}^{⊤} \nabla F (θ_{2}) .

3. The Subfamily of Homogeneous $(r, s)$ -Power $α$ -Divergences for $r > s$

In particular, we can define the

(r, s)

-power

α

-divergences from two power means

P_{r} = M^{{pow}_{r}}

and

P_{s} = M^{{pow}_{s}}

with

r > s

(and

P_{r} \geq P_{s}

) with the family of generators

{pow}_{l} (u) = u^{l}

. Indeed, we check that

f_{r s} (u) : = {pow}_{r} \circ {pow}_{s}^{- 1} (u) = u^{\frac{r}{s}}

is strictly convex on

(0, \infty)

since

f_{r s}^{''} (u) = \frac{r}{s} (\frac{r}{s} - 1) u^{\frac{r}{s} - 2} > 0

for

r > s

. Thus,

P_{r}

and

P_{s}

are two QAMs which are both comparable and distinct. Table 1 lists the expressions of

E_{r} (p, q) : = E_{{pow}_{r}} (p, q)

obtained from the power mean generators

{pow}_{r} (u) = u^{r}

.

We conclude with the definition of the

(r, s)

-power

α

-divergences:

Corollary 2

(power

α

-divergences). Given

r > s

, the α-power divergences are defined for

r > s

and

r, s \neq 0

by

I_{α}^{r, s} (p : q) = \{\begin{matrix} \frac{1}{α (1 - α)} \int ({(α p^{r} + (1 - α) q^{r})}^{\frac{1}{r}} - {(α p^{s} + (1 - α) q^{s})}^{\frac{1}{s}}) d μ, & α \in R \ {0, 1} . \\ I_{1}^{r, s} (p : q) = \int (\frac{q^{r} - p^{r}}{r p^{r - 1}} - \frac{q^{s} - p^{s}}{s p^{s - 1}}) d μ & α = 1, \\ I_{0}^{r, s} (p : q) = I_{1}^{r, s} (q : p) & α = 0 . \end{matrix}

(99)

When

r = 0

, we get the following power

α

-divergences for

s < 0

:

I_{α}^{0, s} (p : q) = \{\begin{matrix} \frac{1}{α (1 - α)} \int (p^{α} q^{1 - α} - {(α p^{s} + (1 - α) q^{s})}^{\frac{1}{s}}) d μ, & α \in R \ {0, 1} . \\ I_{1}^{0, s} (p : q) = \int (p log \frac{q}{p} - \frac{q^{s} - p^{s}}{s p^{s - 1}}) d μ & α = 1, \\ I_{0}^{0, s} (p : q) = I_{1}^{r, s} (q : p) & α = 0 . \end{matrix}

(100)

When

s = 0

, we get the following power

α

-divergences for

r > 0

:

I_{α}^{r, 0} (p : q) = \{\begin{matrix} \frac{1}{α (1 - α)} \int ({(α p^{r} + (1 - α) q^{r})}^{\frac{1}{r}} - p^{α} q^{1 - α}) d μ, & α \in R \ {0, 1} . \\ I_{1}^{r, 0} (p : q) = \int (\frac{q^{r} - p^{r}}{r p^{r - 1}} - p log \frac{q}{p}) d μ & α = 1, \\ I_{0}^{r, 0} (p : q) = I_{1}^{r, s} (q : p) & α = 0 . \end{matrix}

(101)

In particular, we get the following family of

(A, H)

α

-divergences

I_{α}^{A, H} (p : q) = I_{α}^{1, - 1} (p : q) = \{\begin{matrix} \frac{1}{α (1 - α)} \int (α p + (1 - α) q - \frac{p q}{α q + (1 - α) p}) d μ, & α \in R \ {0, 1} . \\ I_{1}^{1, - 1} (p : q) = \int (q - 2 p + \frac{p^{2}}{q}) d μ & α = 1, \\ I_{0}^{1, - 1} (p : q) = I_{1}^{1, - 1} (q : p) & α = 0 . \end{matrix},

(102)

and the family of

(G, H)

α

-divergences:

I_{α}^{G, H} (p : q) = I_{α}^{0, - 1} (p : q) = \{\begin{matrix} \frac{1}{α (1 - α)} \int (p^{α} q^{1 - α} - \frac{p q}{α q + (1 - α) p}) d μ, & α \in R \ {0, 1} . \\ I_{1}^{0, - 1} (p : q) = \int (p log \frac{q}{p} - p + \frac{p^{2}}{q}) d μ & α = 1, \\ I_{0}^{0, - 1} (p : q) = I_{1}^{0, - 1} (q : p) & α = 0 . \end{matrix}

(103)

The

(r, s)

-power

α

-divergences for

r, s \neq 0

yield homogeneous divergences:

I_{α}^{r, s} (t p : t q) = t I_{α}^{r, s} (p : q)

for any

t > 0

because the power means are homogeneous:

P_{α}^{r} (t x, t y) = t P_{α}^{r} (x, y) = t x P_{α}^{r} (1, \frac{y}{x})

. Thus, the

I_{α}^{r, s}

-divergences are Csiszár f-divergences [17]

I_{α}^{r, s} (p : q) = \int p (x) f_{r, s} (\frac{q (x)}{p (x)}) d μ

(104)

for the generator

f_{r, s} (u) = \frac{1}{α (1 - α)} (P_{α}^{r} (1, u) - P^{s} (1, u)) .

(105)

Thus, the family of

(r, s)

-power

α

-divergences are homogeneous divergences:

I_{α}^{r, s} (t p : t q) = t I_{α}^{r, s} (p : q), \forall t > 0 .

(106)

4. Applications to Center-Based Clustering

Clustering is a class of unsupervised learning algorithms which partitions a given d-dimensional point set

P = {p_{1}, \dots, p_{n}}

into clusters such that data points falling into a same cluster tend to be more similar to data points belonging to different clusters. The celebrated k-means clustering [69] is a center-based method for clustering

P

into k clusters

C_{1}, \dots, C_{k}

(with

P = \cup_{i = 1}^{k} C_{i}

), by minimizing the following k-means objective function

L (P, C) = \frac{1}{n} \sum_{i = 1}^{n} min_{j \in {1, \dots, k}} {∥ p_{i} - c_{j} ∥}^{2},

(107)

where the

c_{j}

’s denote the cluster representatives. Let

C = {c_{1}, \dots, c_{k}}

denote the set of cluster centers. The cluster

C_{j}

is defined as the points of

P

closer to cluster representative

c_{j}

than any other

c_{i}

for

i \neq j

:

C_{j} = {p \in P : ∥ p - c_{j} ∥^{2} \leq ∥ p - c_{l} ∥^{2}, \forall l \in {1, \dots, k}} .

When

k = 1

, it can be shown that the centroid of the point set

P

is the unique best cluster representative:

arg min_{c_{1}} L (P, {c_{1}}) \Rightarrow c_{1} = \frac{1}{n} \sum_{i = 1}^{n} p_{i} .

When

d > 1

and

k > 1

, finding a best partition

P = \cup_{j = 1}^{k} C_{j}

which minimizes the objective function of Equation (107) is NP-hard [70]. When

d = 1

, k-means clustering can be solved efficiently using dynamic programming [71] in subcubic

O (n^{3})

time.

The k-means objective function can be generalized to any arbitrary (potentially asymmetric) divergence

D (\cdot : \cdot)

by considering the following objective function:

L_{D} (P, C) : = \frac{1}{n} \sum_{i = 1}^{n} min_{j \in {1, \dots, k}} D (p_{i} : c_{j}) .

(108)

Thus, when

D (p : q) = {∥ p - q ∥}^{2}

, one recovers the ordinary k-means clustering [69]. When

D (p : q) = B_{F} (p : q)

is chosen as a Bregman divergence, one gets the right-sided Bregman k-means clustering [72] as the minimization of the cluster centers are defined on the right-sided arguments of D in Equation (108). When

F (x) = {∥ x ∥}_{2}^{2}

, Bregman k-means clustering (i.e.,

D (p : q) = B_{F} (p : q)

in Equation (108)) amounts to the ordinary k-means clustering. The right-sided Bregman centroid for

k = 1

coincides with the center of mass and is independent of the Bregman generator F:

arg min_{c_{1}} L_{B_{F}} (P, {c_{1}}) \Rightarrow c_{1} = \frac{1}{n} \sum_{i = 1}^{n} p_{i} .

The left-sided Bregman k-means clustering is obtained by considering the right-sided Bregman centroid for the reverse Bregman divergence

{(B_{F})}^{*} (p : q) = B_{F} (q : p)

, and the left-sided Bregman centroid [73] can be expressed as a multivariate generalization of the quasi-arithmetic mean:

c_{1} = {(\nabla F)}^{- 1} (\frac{1}{n} \sum_{i = 1}^{n} \nabla F (p_{i})) .

In order to study the robustness of k-means clustering with respect to our novel family of divergences

I_{α}^{f, g}

, we first study the robustness of the left-sided Bregman centroids to outliers.

4.1. Robustness of the Left-Sided Bregman Centroids

Consider two d-dimensional points

p = (p_{1}, \dots, p_{d})

and

p^{'} = (p_{1}^{'}, \dots, p_{d}^{'})

of a domain

Θ \subset R^{d}

. The centroid of p and

p^{'}

with respect to any arbitrary divergence

D (\cdot : \cdot)

is by definition the minimizer of

L_{D} (c) = \frac{1}{2} D (p : c) + \frac{1}{2} D (p^{'} : c),

provided that the minimizer

{min}_{c \in Θ} L_{D} (c)

is unique. Assume a separable Bregman divergence induced by the generator

F (p) = \sum_{i = 1}^{d} F (p_{i})

. The left-sided Bregman centroid [73] of p and

p^{'}

is given by the following separable quasi-arithmetic centroid:

c = (c_{1}, \dots, c_{d}),

with

c_{i} = M^{f} (p_{i}, p_{i}^{'}) = f^{- 1} (\frac{f (p_{i}) + f (p_{i}^{'})}{2}),

where

f (x) = F^{'} (x)

denotes the derivative of the Bregman generator

F (x)

.

Now, fix p (say,

p = (1, \dots, 1) \in Θ

), and let the coordinates

p_{i}^{'}

of

p^{'}

all tend to infinity: That is, point

p^{'}

plays the role of an outlier data point. We use the general framework of influence functions [74] in statistics to study the robustness of divergence-based centroids. Consider the r-power mean, a quasi-arithmetic mean induced by

{pow}_{r} (x) = x^{r}

for

r \neq 0

and by extension

{pow}_{0} (x) = log x

when

r = 0

(geometric mean).

When

r < 0

, we check that

\begin{matrix} lim_{p_{i}^{'} \to + \infty} M^{{pow}_{r}} (p_{i}, p_{i}^{'}) & = & lim_{p_{i}^{'} \to + \infty} {(\frac{1 + p_{i}^{r}}{2})}^{\frac{1}{r}}, \end{matrix}

(109)

\begin{matrix} = & {(\frac{1}{2})}^{\frac{1}{r}} < \infty . \end{matrix}

(110)

That is, the r-power mean is robust to an outlier data point when

r < 0

(see Figure 2). Note that if instead of considering the centroid, we consider the barycenter with w denoting the weight of point p and

1 - w

denoting the weight of the outlier

p^{'}

for

w \in (0, 1)

, then the power r-mean falls in a square box of side

w^{\frac{1}{r}}

when

r < 0

.

On the contrary, when

r > 0

or

r = 0

, we have

{lim}_{p_{i}^{'} \to + \infty} M^{{pow}_{r}} (p_{i}, p_{i}^{'}) = \infty

, and the r-power mean diverges to infinity.

Thus, when

r < 0

, the quasi-arithmetic centroid of

p = (1, \dots, 1)

and

p^{'}

is contained in a bounding box of length

{(\frac{1}{2})}^{\frac{1}{r}}

with left corner

(1, \dots, 1)

, and the left-sided Bregman power centroid minimizing

\frac{1}{2} B_{F} (c : p) + \frac{1}{2} B_{F} (c : p^{'})

is robust to outlier

p^{'}

.

To contrast with this result, notice that the right-sided Bregman centroid [72] is always the center of mass (arithmetic mean), and therefore not robust to outliers as a single outlier data point may potentially drag the centroid to infinity.

Example 2.

Since

M^{f} = M^{- f}

for any strictly smooth increasing function f, we deduce that the quasi-arithmetic left-sided Bregman centroid induced by

F (x) = - log x

with

f (x) = F^{'} (x) = - x^{- 1} = - \frac{1}{x}

for

x > 0

is the harmonic mean which is robust to outliers. The corresponding Bregman divergence is the Itakura–Saito divergence [72].

Notice that it is enough to consider without loss of generality two points p and

p^{'}

: Indeed, the case of the quasi-arithmetic mean of

P = {p_{1}, \dots, p_{n}}

and

p^{'}

can be rewritten as an equivalent weighted quasi-arithmetic mean of two points

\bar{p} = M^{f} (p_{1}, \dots, p_{n})

with weight

w = \frac{n}{n + 1}

and

p^{'}

of weight

\frac{1}{n + 1}

using the replacement property of quasi-arithmetic means:

M^{f} (p_{1}, \dots, p_{k}, p_{k + 1}, \dots, p_{n}) = M^{f} (\bar{p}, \dots, \bar{p}, p_{k + 1}, p_{n})

where

\bar{p} = M^{f} (p_{1}, \dots, p_{k})

.

4.2. Robustness of Generalized Kullback–Leibler Centroids

The fact that the generalized KLDs are conformal representational Bregman divergences can be used to design efficient algorithms in computational geometry [60]. For example, let us consider the centroid (or barycenter) of a finite set of weighted probability measures

P_{1}, \dots, P_{n} ≪ μ

(with RN derivatives

p_{1}, \dots, p_{n}

) defined as the minimizer of

min \sum_{i = 1}^{n} w_{i} I_{1}^{f, g} (p_{i} : c),

where the

w_{i}

’s are positive weights summing up to one (

\sum_{i = 1}^{n} w_{i} = 1

). The divergences

I_{1}^{f, g} (p_{i} : c)

are separable. Thus, consider without loss of generality, the scalar-generalized KLDs so that we have

I_{1}^{f, g} (p : q) = \frac{1}{f^{'} (p)} B_{F} (g (q) : g (p)),

where p and q are scalars.

Since the Bregman centroid is unique and always coincide with the center of mass [72]

c^{*} = arg min w_{i} \sum_{i = 1}^{n} B_{F} (p_{i} : c) = \sum_{i = 1}^{n} w_{i} p_{i},

for positive weights

w_{i}

’s summing up to one, we deduce that the right-sided generalized KLD centroid

arg min_{c} \frac{1}{n} \sum_{i = 1}^{n} I_{1}^{f, g} (p_{i} : c) = arg min_{c} \frac{1}{n} \sum_{i = 1}^{n} \frac{1}{f^{'} (p_{i})} B_{F} (g (c) : g (p_{i}))

amounts to a left-sided Bregman centroid with un-normalized positive weights

W_{i} = \frac{1}{f^{'} (p_{i})}

for the scalar Bregman generator

F (x) = f (g^{- 1} (x))

with

F^{'} (x) = \frac{f^{'} (g^{- 1} (x))}{g^{'} (g^{- 1} (x))}

. Therefore, the right-sided generalized KLD centroid

c^{*}

is calculated for normalized weights

w_{i} = \frac{W_{i}}{\sum_{j = 1}^{n} W_{j}}

as:

\begin{matrix} c^{*} & = & {(F^{'})}^{- 1} (\sum_{i = 1}^{n} w_{i} F^{'} (g (p_{i}))), \end{matrix}

(111)

\begin{matrix} = & {(F^{'})}^{- 1} (\sum_{i = 1}^{n} \frac{1}{f^{'} (p_{i}) \sum_{j = 1}^{n} \frac{1}{f^{'} (p_{j})}} \frac{f^{'} (p_{i})}{g^{'} (p_{i})}), \end{matrix}

(112)

\begin{matrix} = & {(F^{'})}^{- 1} (\sum_{i = 1}^{n} \frac{1}{g^{'} (p_{i}) \sum_{j = 1}^{n} \frac{1}{f^{'} (p_{j})}}) . \end{matrix}

(113)

Thus, we obtain a closed-form formula when

{(F^{'})}^{- 1}

is computationally tractable. For example, consider the

(r, s)

-power KLD (with

r > s

). We have

f^{'} (x) = r x^{r - 1}

,

g^{'} (x) = s x^{s - 1}

,

F (x) = x^{\frac{r}{s}}

,

F^{'} (x) = \frac{r}{s} x^{\frac{r - s}{s}}

and therefore, we get

{F^{'}}^{- 1} (x) = {(\frac{s}{r} x)}^{\frac{s}{r - s}}

. Thus, we get a closed-form formula for the right-sided

(r, s)

-power Kullback–Leibler centroid using Equation (113).

Overall, we can design a k-means-type algorithm with respect to our generalized KLDs following [72]. Moreover, we can initialize probabilistically k-means with a fast k-means++ seeding [34] described in Algorithm 1. The performance of the k-means++ seeding (i.e., the ratio

\frac{L_{D} (P, C)}{{min}_{C} L_{D} (P, C)}

) is

O (log k)

when

D (p : q) = {∥ p - q ∥}^{2}

, and the analysis has been extended to arbitrary divergences in [75]. The merit of using the k-means++ seeding is that we do not need to iteratively update the cluster representatives using Lloyd’s heuristic [69] and we can thus bypass the calculations of centroids and merely choose the cluster representatives from the source data points

P

as described in Algorithm 1.

Algorithm 1: Generic seeding of k-means with divergence-based k-means++.

input: A finite set

P = {p_{1}, \dots, p_{n}}

of n points, the number of cluster
representatives

k \geq 1

, and an arbitrary divergence

D (\cdot : \cdot)

Output: Set of initial cluster centers

C = {c_{1}, \dots, c_{k}}

Choose

c_{1} \leftarrow p_{i}

with uniform probability and

C = {c_{1}}

return

C

The advantage of using a conformal Bregman divergence such as a total Bregman divergence [33] or

I_{1}^{f, g}

is to potentially ensure robustness to outliers (e.g., see Theorem III.2 of [33]). Robustness property of these novel

I_{1}^{f, g}

divergences can also be studied for statistical inference tasks based on minimum divergence methods [4,76].

5. Conclusions and Discussion

For two comparable strict means [35]

M (p, q) \geq N (p, q)

(with equality holding if and only if

p = q

), one can define their

(M, N)

-divergence as

I^{M, N} (p : q) : = 4 \int (M (p, q) - N (p, q)) d μ .

(114)

When the property of strict comparable means extend to their induced weighted means

M_{α} (p, q)

and

N_{α} (p, q)

(i.e.,

M_{α} (p, q) \geq N_{α} (p, q)

), one can further define the family of

(M, N)

α

-divergences for

α \in (0, 1)

:

I_{α}^{M, N} (p : q) : = \frac{1}{α (1 - α)} \int (M_{1 - α} (p, q) - N_{1 - α} (p, q)) d μ,

(115)

so that

I^{M, N} (p : q) = I_{\frac{1}{2}}^{M, N} (p : q)

. When the weighted means are symmetric, the reference duality holds (i.e.,

I_{α}^{M, N} (q : p) = I_{1 - α}^{M, N} (p : q)

), and we can define the

(M, N)

-equivalent of the Kullback–Leibler divergence, i.e., the

(M, N)

1-divergence, as the limit case (when it exists):

I_{1}^{M, N} (p : q) = {lim}_{α \to 1} I_{α}^{M, N} (p : q)

. Similarly, the

(M, N)

-equivalent of the reverse Kullback–Leibler divergence is obtained as

I_{0}^{M, N} (p : q) = {lim}_{α \to 0} I_{α}^{M, N} (p : q)

.

We proved that the quasi-arithmetic weighted means [30]

M_{α}^{f}

and

M_{α}^{g}

were strictly comparable whenever

f \circ g^{- 1}

was strictly convex. In the limit cases of

α \to 0

and

α \to 1

, we reported a closed-form formula for the equivalent of the forward and the reverse Kullback–Leibler divergences. We reported closed-form formulas for the quasi-arithmetic

α

-divergences

I_{α}^{f, g} (p : q) : = I_{α}^{M^{f}, M^{g}} (p : q)

for

α \in [0, 1]

(Theorem 3) and for the subfamily of homogeneous

(r, s)

-power

α

-divergences

I_{α}^{r, s} (p : q) : = I_{α}^{M^{{pow}_{r}}, M^{{pow}_{s}}} (p : q)

induced by power means (Corollary 2). The ordinary

(A, G)

α

-divergences [12], the

(A, H)

α

-divergences, and the

(G, H)

α

-divergences are examples of

(r, s)

-power

α

-divergences obtained for

(r, s) = (1, 0)

,

(r, s) = (1, - 1)

and

(r, s) = (0, - 1)

, respectively.

Generalized

α

-divergences may prove useful in reporting a closed-form formula between densities of a parametric family

{p_{θ}}

. For example, consider the ordinary

α

-divergences between two scale Cauchy densities

p_{1} (x) = \frac{1}{π} \frac{s_{1}}{x^{2} + s_{1}^{2}}

and

p_{2} (x) = \frac{1}{π} \frac{s_{2}}{x^{2} + s_{2}^{2}}

; there is no obvious closed-form for the ordinary

α

-divergences, but we can report a closed-form for the

(A, H)

α

-divergences following the calculus reported in [41]:

\begin{matrix} I_{α}^{A, H} (p_{1} : p_{2}) & = & \frac{1}{α (1 - α)} (1 - \int H_{1 - α} (p_{1} (x), p_{2} (x)) d μ (x)), \end{matrix}

(116)

\begin{matrix} = & \frac{1}{α (1 - α)} (1 - \frac{s_{1} s_{2}}{(α s_{1} + (1 - α) s_{2}) s_{1 - α}}), \end{matrix}

(117)

with

s_{α} = \sqrt{\frac{α s_{1} s_{2}^{2} + (1 - α) s_{2} s_{1}^{2}}{α s_{1} + (1 - α) s_{2}}}

. For probability distributions

p_{θ_{1}}

and

p_{θ_{2}}

belonging to the same exponential family [77] with cumulant function F, the ordinary

α

-divergences admit the following closed-form solution:

\begin{matrix} I_{α} (p_{θ_{1}} : p_{θ_{2}}) = \\ \{\begin{matrix} \frac{1}{α (1 - α)} (1 - exp (F (α θ_{1} + (1 - α) θ_{2}) - (α F (θ_{1}) + (1 - α) F (θ_{2}))), & α \in (0, 1) \\ I_{1} (p_{θ_{1}} : p_{θ_{2}}) = KL (p_{θ_{1}} : p_{θ_{2}}) = B_{F} (θ_{2} : θ_{1}), & α = 1 \\ I_{0} (p_{θ_{1}} : p_{θ_{2}}) = KL (p_{θ_{2}} : p_{θ_{1}}) = B_{F} (θ_{1} : θ_{2}) & α = 0 \end{matrix} \end{matrix}

(118)

where

B_{F}

is the Bregman divergence:

B_{F} (θ_{2} : θ_{1}) = F (θ_{2}) - F (θ_{1}) - {(θ_{2} - θ_{1})}^{⊤} \nabla F (θ_{1})

.

Instead of considering ordinary

α

-divergences in applications, one may consider the

(r, s)

-power

α

-divergences, and tune the three scalar parameters

(r, s, α)

according to the various tasks (say, by cross-validation in supervised machine learning tasks, see [13]). For the limit cases of

α \to 0

or of

α \to 1

, we further proved that the limit KL type divergences amounted to conformal Bregman divergences on strictly monotone embeddings and explained the connection of conformal divergences with conformal flattening [60], which allows one to build fast algorithms for centroid-based k-means clustering [72], Voronoi diagrams, and proximity data-structures [60,63]. Some ideas left for future directions is to study the properties of these new

(M, N)

α

-divergences for statistical inference [2,4,76].

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Keener, R.W. Theoretical Statistics: Topics for a Core Course; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Basu, A.; Shioya, H.; Park, C. Statistical Inference: The Minimum Distance Approach; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
Basseville, M. Divergence measures for statistical data processing — An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
Pardo, L. Statistical Inference Based on Divergence Measures; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
Oller, J.M. Some geometrical aspects of data analysis and statistics. In Statistical Data Analysis and Inference; Elsevier: Amsterdam, The Netherlands, 1989; pp. 41–58. [Google Scholar]
Amari, S. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016. [Google Scholar]
Eguchi, S. Geometry of minimum contrast. Hiroshima Math. J. 1992, 22, 631–647. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
Cichocki, A.; Amari, S.i. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef] [Green Version]
Amari, S.i. α-Divergence is Unique, belonging to Both f-divergence and Bregman Divergence Classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar] [CrossRef]
Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef]
Hero, A.O.; Ma, B.; Michel, O.; Gorman, J. Alpha-Divergence for Classification, Indexing and Retrieval; Technical Report CSPL-328; Communication and Signal Processing Laboratory, University of Michigan: Ann Arbor, MI, USA, 2001. [Google Scholar]
Dikmen, O.; Yang, Z.; Oja, E. Learning the information divergence. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1442–1454. [Google Scholar] [CrossRef] [Green Version]
Liu, W.; Yuan, K.; Ye, D. On α-divergence based nonnegative matrix factorization for clustering cancer gene expression data. Artif. Intell. Med. 2008, 44, 1–5. [Google Scholar] [CrossRef]
Hellinger, E. Neue Begründung der Theorie Quadratischer Formen von Unendlichvielen Veränderlichen. J. Für Die Reine Und Angew. Math. 1909, 1909, 210–271. [Google Scholar] [CrossRef]
Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B 1966, 28, 131–142. [Google Scholar] [CrossRef]
Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
Qiao, Y.; Minematsu, N. A study on invariance of f-divergence and its application to speech recognition. IEEE Trans. Signal Process. 2010, 58, 3884–3890. [Google Scholar] [CrossRef]
Li, W. Transport information Bregman divergences. Inf. Geom. 2021, 4, 435–470. [Google Scholar] [CrossRef]
Li, W. Transport information Hessian distances. In Proceedings of the International Conference on Geometric Science of Information (GSI), Paris, France, 21–23 July 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 808–817. [Google Scholar]
Li, W. Transport information geometry: Riemannian calculus on probability simplex. Inf. Geom. 2022, 5, 161–207. [Google Scholar] [CrossRef]
Amari, S.i. Integration of stochastic models by minimizing α-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef] [PubMed]
Cichocki, A.; Lee, H.; Kim, Y.D.; Choi, S. Non-negative matrix factorization with α-divergence. Pattern Recognit. Lett. 2008, 29, 1433–1440. [Google Scholar] [CrossRef]
Wada, J.; Kamahara, Y. Studying malapportionment using α-divergence. Math. Soc. Sci. 2018, 93, 77–89. [Google Scholar] [CrossRef]
Maruyama, Y.; Matsuda, T.; Ohnishi, T. Harmonic Bayesian prediction under α-divergence. IEEE Trans. Inf. Theory 2019, 65, 5352–5366. [Google Scholar] [CrossRef]
Iqbal, A.; Seghouane, A.K. An α-Divergence-Based Approach for Robust Dictionary Learning. IEEE Trans. Image Process. 2019, 28, 5729–5739. [Google Scholar] [CrossRef]
Ahrari, V.; Habibirad, A.; Baratpour, S. Exponentiality test based on alpha-divergence and gamma-divergence. Commun. Stat.-Simul. Comput. 2019, 48, 1138–1152. [Google Scholar] [CrossRef]
Sarmiento, A.; Fondón, I.; Durán-Díaz, I.; Cruces, S. Centroid-based clustering with αβ-divergences. Entropy 2019, 21, 196. [Google Scholar] [CrossRef] [Green Version]
Niculescu, C.P.; Persson, L.E. Convex Functions and Their Applications: A Contemporary Approach, 1st ed.; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Kolmogorov, A.N. Sur la notion de moyenne. Acad. Naz. Lincei Mem. Cl. Sci. His. Mat. Natur. Sez. 1930, 12, 388–391. [Google Scholar]
Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435. [Google Scholar] [CrossRef]
Rachev, S.T.; Klebanov, L.B.; Stoyanov, S.V.; Fabozzi, F. The Methods of Distances in the Theory of Probability and Statistics; Springer: Berlin/Heidelberg, Germany, 2013; Volume 10. [Google Scholar]
Vemuri, B.C.; Liu, M.; Amari, S.I.; Nielsen, F. Total Bregman divergence and its applications to DTI analysis. IEEE Trans. Med Imaging 2010, 30, 475–483. [Google Scholar] [CrossRef] [Green Version]
Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the SODA ’07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035. [Google Scholar]
Bullen, P.S.; Mitrinovic, D.S.; Vasic, M. Means and Their Inequalities; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 31. [Google Scholar]
Toader, G.; Costin, I. Means in Mathematical Analysis: Bivariate Means; Academic Press: Cambridge, MA, USA, 2017. [Google Scholar]
Cauchy, A.L.B. Cours d’analyse de l’École Royale Polytechnique; Debure frères: Paris, France, 1821. [Google Scholar]
Chisini, O. Sul concetto di media. Period. Di Mat. 1929, 4, 106–116. [Google Scholar]
Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–109. [Google Scholar]
Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. Pattern Recognit. Lett. 2014, 42, 25–34. [Google Scholar] [CrossRef] [Green Version]
Nagumo, M. Über eine klasse der mittelwerte. Jpn. J. Math. Trans. Abstr. 1930, 7, 71–79. [Google Scholar] [CrossRef] [Green Version]
De Finetti, B. Sul concetto di media. Ist. Ital. Degli Attuari 1931, 3, 369–396. [Google Scholar]
Hardy, G.; Littlewood, J.; Pólya, G. Inequalities; Cambridge Mathematical Library, Cambridge University Press: Cambridge, UK, 1988. [Google Scholar]
Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; The Regents of the University of California: Oakland, CA, USA, 1961; Volume 1. Contributions to the Theory of Statistics. [Google Scholar]
Holder, O.L. Über einen Mittelwertssatz. Nachr. Akad. Wiss. Gottingen Math.-Phys. Kl. 1889, 44, 38–47. [Google Scholar]
Bhatia, R. The Riemannian mean of positive matrices. In Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013; pp. 35–51. [Google Scholar]
Akaoka, Y.; Okamura, K.; Otobe, Y. Bahadur efficiency of the maximum likelihood estimator and one-step estimator for quasi-arithmetic means of the Cauchy distribution. Ann. Inst. Stat. Math. 2022, 74, 1–29. [Google Scholar] [CrossRef]
Kim, S. The quasi-arithmetic means and Cartan barycenters of compactly supported measures. Forum Math. Gruyter 2018, 30, 753–765. [Google Scholar] [CrossRef]
Carlson, B.C. The logarithmic mean. Am. Math. Mon. 1972, 79, 615–618. [Google Scholar] [CrossRef]
Stolarsky, K.B. Generalizations of the logarithmic mean. Math. Mag. 1975, 48, 87–92. [Google Scholar] [CrossRef]
Jarczyk, J. When Lagrangean and quasi-arithmetic means coincide. J. Inequal. Pure Appl. Math. 2007, 8, 71. [Google Scholar]
Páles, Z.; Zakaria, A. On the Equality of Bajraktarević Means to Quasi-Arithmetic Means. Results Math. 2020, 75, 19. [Google Scholar] [CrossRef] [Green Version]
Maksa, G.; Páles, Z. Remarks on the comparison of weighted quasi-arithmetic means. Colloq. Math. 2010, 120, 77–84. [Google Scholar] [CrossRef]
Zhang, J. Nonparametric information geometry: From divergence function to referential-representational biduality on statistical manifolds. Entropy 2013, 15, 5384–5418. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F.; Nock, R. Generalizing Skew Jensen Divergences and Bregman Divergences with Comparative Convexity. IEEE Signal Process. Lett. 2017, 24, 1123–1127. [Google Scholar] [CrossRef]
Kuczma, M. An Introduction to the Theory of Functional Equations and Inequalities: Cauchy’s Equation and Jensen’s Inequality; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Nock, R.; Nielsen, F.; Amari, S.i. On conformal divergences and their population minimizers. IEEE Trans. Inf. Theory 2015, 62, 527–538. [Google Scholar] [CrossRef] [Green Version]
Ohara, A. Conformal flattening for deformed information geometries on the probability simplex. Entropy 2018, 20, 186. [Google Scholar] [CrossRef] [Green Version]
Ohara, A. Conformal Flattening on the Probability Simplex and Its Applications to Voronoi Partitions and Centroids. In Geometric Structures of Information; Springer: Berlin/Heidelberg, Germany, 2019; pp. 51–68. [Google Scholar]
Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
Zhang, J. On monotone embedding in information geometry. Entropy 2015, 17, 4485–4499. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F.; Nock, R. The dual Voronoi diagrams with respect to representational Bregman divergences. In Proceedings of the Sixth International Symposium on Voronoi Diagrams (ISVD), Copenhagen, Denmark, 23–26 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 71–78. [Google Scholar]
Itakura, F.; Saito, S. Analysis synthesis telephony based on the maximum likelihood method. In Proceedings of the 6th International Congress on Acoustics, Tokyo, Japan, 21–28 August 1968; pp. 280–292. [Google Scholar]
Okamoto, I.; Amari, S.I.; Takeuchi, K. Asymptotic theory of sequential estimation: Differential geometrical approach. Ann. Stat. 1991, 19, 961–981. [Google Scholar] [CrossRef]
Ohara, A.; Matsuzoe, H.; Amari, S.I. Conformal geometry of escort probability and its applications. Mod. Phys. Lett. B 2012, 26, 1250063. [Google Scholar] [CrossRef]
Kurose, T. On the divergences of 1-conformally flat statistical manifolds. Tohoku Math. J. Second Ser. 1994, 46, 427–433. [Google Scholar] [CrossRef]
Pal, S.; Wong, T.K.L. The geometry of relative arbitrage. Math. Financ. Econ. 2016, 10, 263–293. [Google Scholar] [CrossRef]
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef] [Green Version]
Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. The planar k-means problem is NP-hard. Theor. Comput. Sci. 2012, 442, 13–21. [Google Scholar] [CrossRef] [Green Version]
Wang, H.; Song, M. Ckmeans.1d.dp: Optimal k-means clustering in one dimension by dynamic programming. R J. 2011, 3, 29. [Google Scholar] [CrossRef] [Green Version]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J.; Lafferty, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef] [Green Version]
Ronchetti, E.M.; Huber, P.J. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
Nielsen, F.; Nock, R. Total Jensen divergences: Definition, properties and clustering. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 2016–2020. [Google Scholar]
Eguchi, S.; Komori, O. Minimum Divergence Methods in Statistical Machine Learning; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
Kailath, T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]

Figure 1. Interplay of divergences and their information-geometric structures: Bregman divergences are canonical divergences of dually flat structures, and the

α

-logarithmic divergences are canonical divergences of 1-conformally flat statistical manifolds. When

α \to 0

, the logarithmic divergence

L_{F, α}

tends to the Bregman divergence

B_{F}

.

Figure 1. Interplay of divergences and their information-geometric structures: Bregman divergences are canonical divergences of dually flat structures, and the

α

-logarithmic divergences are canonical divergences of 1-conformally flat statistical manifolds. When

α \to 0

, the logarithmic divergence

L_{F, α}

tends to the Bregman divergence

B_{F}

.

Figure 2. Illustration of the robustness property of the r-power mean

M^{{pow}_{r}} (p, p^{'})

when

r < 0

for two points: a prescribed point

p = (1, 1)

and an outlier point

p^{'} = (t, t)

. When

t \to + \infty

, the r-power mean of p and

p^{'}

for

r < 0

(e.g., coordinatewise harmonic mean when

r = - 1

) is contained inside the box anchored at p of size length

{(\frac{1}{2})}^{\frac{1}{r}}

. The r-power mean can be interpreted as a left-sided Bregman centroid for

F^{'} (x) = - x^{r}

, i.e.,

F (x) = - \frac{1}{r} x^{r + 1}

when

r < - 1

and

F (x) = - log x

when

r = - 1

.

Figure 2. Illustration of the robustness property of the r-power mean

M^{{pow}_{r}} (p, p^{'})

when

r < 0

for two points: a prescribed point

p = (1, 1)

and an outlier point

p^{'} = (t, t)

. When

t \to + \infty

, the r-power mean of p and

p^{'}

for

r < 0

(e.g., coordinatewise harmonic mean when

r = - 1

) is contained inside the box anchored at p of size length

{(\frac{1}{2})}^{\frac{1}{r}}

. The r-power mean can be interpreted as a left-sided Bregman centroid for

F^{'} (x) = - x^{r}

, i.e.,

F (x) = - \frac{1}{r} x^{r + 1}

when

r < - 1

and

F (x) = - log x

when

r = - 1

.

Table 1. Expressions of the terms

E_{r}

for the family of power means

P_{r}

,

r \in R

.

Table 1. Expressions of the terms

E_{r}

for the family of power means

P_{r}

,

r \in R

.

Power Mean	$E_{r} (p, q)$
$P_{r} (r \in R \ {0})$	$\frac{q^{r} - p^{r}}{r p^{r - 1}}$
$Q (r = 2)$	$\frac{q^{2} - p^{2}}{2 p}$
$A (r = 1)$	$q - p$
$G (r = 0)$	$p log \frac{q}{p}$
$H (r = - 1)$	$- p^{2} (\frac{1}{q} - \frac{1}{p}) = p - \frac{p^{2}}{q}$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nielsen, F. Generalizing the Alpha-Divergences and the Oriented Kullback–Leibler Divergences with Quasi-Arithmetic Means. Algorithms 2022, 15, 435. https://doi.org/10.3390/a15110435

AMA Style

Nielsen F. Generalizing the Alpha-Divergences and the Oriented Kullback–Leibler Divergences with Quasi-Arithmetic Means. Algorithms. 2022; 15(11):435. https://doi.org/10.3390/a15110435

Chicago/Turabian Style

Nielsen, Frank. 2022. "Generalizing the Alpha-Divergences and the Oriented Kullback–Leibler Divergences with Quasi-Arithmetic Means" Algorithms 15, no. 11: 435. https://doi.org/10.3390/a15110435

APA Style

Nielsen, F. (2022). Generalizing the Alpha-Divergences and the Oriented Kullback–Leibler Divergences with Quasi-Arithmetic Means. Algorithms, 15(11), 435. https://doi.org/10.3390/a15110435

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Generalizing the Alpha-Divergences and the Oriented Kullback–Leibler Divergences with Quasi-Arithmetic Means

Abstract

1. Introduction

1.1. Statistical Divergences and $α$ -Divergences

1.2. Divergences and Decomposable Divergences

1.3. Contributions and Paper Outline

2. The $α$ -Divergences Induced by a Pair of Strictly Comparable Weighted Means

2.1. The $(M, N)$ $α$ -Divergences

2.2. The Quasi-Arithmetic $α$ -Divergences

2.3. Limit Cases of 1-Divergences and 0-Divergences

2.4. Generalized KL Divergences as Conformal Bregman Divergences on Monotone Embeddings

3. The Subfamily of Homogeneous $(r, s)$ -Power $α$ -Divergences for $r > s$

4. Applications to Center-Based Clustering

4.1. Robustness of the Left-Sided Bregman Centroids

4.2. Robustness of Generalized Kullback–Leibler Centroids

5. Conclusions and Discussion

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Generalizing the Alpha-Divergences and the Oriented Kullback–Leibler Divergences with Quasi-Arithmetic Means

Abstract

1. Introduction

1.1. Statistical Divergences and α -Divergences

1.2. Divergences and Decomposable Divergences

1.3. Contributions and Paper Outline

2. The α -Divergences Induced by a Pair of Strictly Comparable Weighted Means

2.1. The ( M , N ) α -Divergences

2.2. The Quasi-Arithmetic α -Divergences

2.3. Limit Cases of 1-Divergences and 0-Divergences

2.4. Generalized KL Divergences as Conformal Bregman Divergences on Monotone Embeddings

3. The Subfamily of Homogeneous ( r , s ) -Power α -Divergences for r > s

4. Applications to Center-Based Clustering

4.1. Robustness of the Left-Sided Bregman Centroids

4.2. Robustness of Generalized Kullback–Leibler Centroids

5. Conclusions and Discussion

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

1.1. Statistical Divergences and $α$ -Divergences

2. The $α$ -Divergences Induced by a Pair of Strictly Comparable Weighted Means

2.1. The $(M, N)$ $α$ -Divergences

2.2. The Quasi-Arithmetic $α$ -Divergences

3. The Subfamily of Homogeneous $(r, s)$ -Power $α$ -Divergences for $r > s$