Next Article in Journal
Higher-Order Curvatures of Plane and Space Parametrized Curves
Next Article in Special Issue
Human Body Shapes Anomaly Detection and Classification Using Persistent Homology
Previous Article in Journal
Insights into Multi-Model Federated Learning: An Advanced Approach for Air Quality Index Forecasting
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Generalizing the Alpha-Divergences and the Oriented Kullback–Leibler Divergences with Quasi-Arithmetic Means

Sony Computer Science Laboratories, Tokyo 141-0022, Japan
Algorithms 2022, 15(11), 435; https://doi.org/10.3390/a15110435
Submission received: 13 October 2022 / Revised: 10 November 2022 / Accepted: 16 November 2022 / Published: 17 November 2022
(This article belongs to the Special Issue Machine Learning for Pattern Recognition)

Abstract

:
The family of α -divergences including the oriented forward and reverse Kullback–Leibler divergences is often used in signal processing, pattern recognition, and machine learning, among others. Choosing a suitable α -divergence can either be done beforehand according to some prior knowledge of the application domains or directly learned from data sets. In this work, we generalize the α -divergences using a pair of strictly comparable weighted means. Our generalization allows us to obtain in the limit case α 1 the 1-divergence, which provides a generalization of the forward Kullback–Leibler divergence, and in the limit case α 0 , the 0-divergence, which corresponds to a generalization of the reverse Kullback–Leibler divergence. We then analyze the condition for a pair of weighted quasi-arithmetic means to be strictly comparable and describe the family of quasi-arithmetic α -divergences including its subfamily of power homogeneous α -divergences. In particular, we study the generalized quasi-arithmetic 1-divergences and 0-divergences and show that these counterpart generalizations of the oriented Kullback–Leibler divergences can be rewritten as equivalent conformal Bregman divergences using strictly monotone embeddings. Finally, we discuss the applications of these novel divergences to k-means clustering by studying the robustness property of the centroids.

1. Introduction

1.1. Statistical Divergences and α -Divergences

Consider a measurable space [1] ( X , F ) where F denotes a finite σ -algebra and X the sample space, and let μ denotes a positive measure on ( X , F ) , usually chosen as the Lebesgue measure or the counting measure. The notion of statistical dissimilarities [2,3,4] D ( P : Q ) between two distributions P and Q is at the core of many algorithms in signal processing, pattern recognition, information fusion, data analysis, and machine learning, among others. A dissimilarity may be oriented, i.e., asymmetric: D ( P : Q ) D ( Q : P ) , where the colon mark “:” between the arguments of the dissimilarities represents the asymmetric property of the division operation. When the arbitrary probability measures P and Q are dominated by a measure μ (e.g., one can always choose μ = P + Q 2 ), we consider their Radon–Nikodym (RN) densities p μ = d P d μ and q μ = d Q d μ with respect to μ , and define D ( P : Q ) as D μ ( p μ : q μ ) . A good dissimilarity measure shall be invariant of the chosen dominating measure so that we can write D ( P : Q ) = D μ ( p μ : q μ ) [5]. When those statistical dissimilarities are smooth, they are called divergences [6] in information geometry, as they induce a dualistic geometric structure [7].
The most renowned statistical divergence rooted in information theory [8] is the Kullback–Leibler divergence (KLD, also called relative entropy):
KL μ ( p μ : q μ ) : = X p μ ( x ) log p μ ( x ) q μ ( x ) d μ ( x ) .
Since the KLD is independent of the reference measure μ , i.e., KL μ ( p μ : q μ ) = KL ν ( p ν : q ν ) for p μ = d P d μ and q μ = d Q d μ , and p ν = d P d ν and q ν = d Q d ν are the RN derivatives with respect to another positive measure ν , we write concisely in the remainder:
KL ( p : q ) = p log p q d μ ,
instead of KL μ ( p μ : q μ ) .
The KLD belongs to a parametric family of α -divergences [9 I α ( p : q ) for α R :
I α ( p : q ) : = 1 α ( 1 α ) 1 p α q 1 α d μ , α R \ { 0 , 1 } I 1 ( p : q ) = KL ( p : q ) , α = 1 I 0 ( p : q ) = KL ( q : p ) , α = 0
The α -divergences extended to positive densities [10] (not necessarily normalized densities) play a central role in information geometry [6]:
I α + ( p : q ) : = 1 α ( 1 α ) α p + ( 1 α ) q p α q 1 α d μ , α R \ { 0 , 1 } I 1 + ( p : q ) = KL + ( p : q ) , α = 1 I 0 + ( p : q ) = KL + ( q : p ) , α = 0 ,
where KL + denotes the Kullback–Leibler divergence extended to positive measures:
KL + ( p : q ) : = p log p q + q p d μ .
The α -divergences are asymmetric for α 1 2 (i.e., I α ( p : q ) I α ( q : p ) for α 1 2 ) but exhibit the following reference duality [11]:
I α ( q : p ) = I 1 α ( p : q ) = : I α * ( p : q ) ,
where we denoted by D * ( p : q ) : = D ( q : p ) , the reverse divergence for an arbitrary divergence D ( p : q ) (e.g., I α * ( p : q ) : = I α ( q : p ) = I 1 α ( p : q ) ). The α -divergences have been extensively used in many applications [12], and the parameter α may not be necessarily fixed beforehand but can also be learned from data sets in applications [13,14]. When α = 1 2 , the α -divergence is symmetric and called the squared Hellinger divergence [15]:
I 1 2 ( p : q ) : = 4 1 p q d μ = 2 ( p q ) 2 d μ .
The α -divergences belong to the family of Ali–Silvey–Csizár’s f-divergences [16,17] which are defined for a convex function f ( u ) satisfying f ( 1 ) = 0 and strictly convex at 1:
I f ( p : q ) : = p f q p d μ .
We have
I α ( p : q ) = I f α ( p : q ) ,
with the following class of f-generators:
f α ( u ) : = 1 α ( 1 α ) ( α + ( 1 α ) u u 1 α ) , α α R \ { 0 , 1 } u 1 log u , α = 1 1 u + u log u , α = 0
In information geometry, α -divergences and more generally f-divergences are called invariant divergences [6], since they are provably the only statistical divergences which are invariant under invertible smooth transformations of the sample space. That is, let Y = m ( X ) be a smooth invertible transformation and let Y = m ( X ) denote the transformed sample space. Denote by p Y ( y ) and p Y ( y ) the densities with respect to y corresponding to p X ( x ) and p X ( x ) , respectively. Then, we have I f ( p X : p X ) = I f ( p Y : p Y ) [18]. The dualistic information-geometric structures induced by these invariant f-divergences between densities of a same parametric family { p θ ( x ) : θ Θ } of statistical models yield the Fisher information metric and the dual ± α -connections for α = 3 + 2 f ( 1 ) f ( 1 ) , see [6] for details. It is customary to rewrite the α -divergences in information geometry using rescaled parameter α A = 1 2 α (i.e., α = 1 α A 2 ). Thus, the extended α A -divergence in information geometry is defined as follows:
I ^ α A + ( p : q ) = 4 1 α A 2 1 α A 2 p + 1 + α A 2 q p 1 α A 2 q 1 + α A 2 d μ , α A R \ { 1 , 1 } I ^ 1 ( p : q ) = KL + ( p : q ) , α A = 1 I ^ 1 ( p : q ) = KL + ( q : p ) , α A = 1 ,
and the reference duality is expressed by I ^ α A + ( q : p ) = I ^ α A + ( p : q ) .
A statistical divergence D ( · : · ) when evaluated on densities belonging to a given parametric family P = { p θ : θ Θ } of densities is equivalent to a corresponding contrast function D P [7]:
D P ( θ 1 : θ 2 ) : = D ( p θ 1 : p θ 2 ) .
Remark 1.
Although quite confusing, those contrast functions [7] have also been called divergences in the literature [6]. Any smooth parameter divergence D ( θ 1 : θ 2 ) (contrast function [7]) induces a dualistic structure in information geometry [6]. For example, the KLD on the family Δ of probability mass functions defined on a finite alphabet X is equivalent to a Bregman divergence, and thus induces a dually flat space [6]. More generally, the α A -divergences on the probability simplex Δ induce the α A -geometry in information geometry [6].
We refer the reader to [3] for a richly annotated bibliography of many common statistical divergences investigated in signal processing and statistics. Building and studying novel statistical/parameter divergences from first principles is an active research area. For example, Li [19,20] recently introduced some new divergence functionals based on the framework of transport information geometry [21], which considers information entropy functionals in Wasserstein spaces. Li defined (i) the transport information Hessian distances [20] between univariate densities supported on a compact, which are symmetric distances satisfying the triangle inequality, and obtained the counterpart of the Hellinger distance on the L 2 -Wasserstein space by choosing the Shannon information entropy, and (ii) asymmetric transport Bregman divergences (including the transport Kullback–Leibler divergence) between densities defined on a multivariate compact smooth support in [19].
The α -divergences are widely used in information sciences, see [22,23,24,25,26,27] just to cite a few applications. The singly parametric α -divergences have also been generalized to biparametric families of divergences such as the ( α , β ) -divergences [6] or the α β -divergences [28].
In this work, based on the observation that the term α p + ( 1 α ) q p α q 1 α in the extended I α + ( p : q ) divergence for α ( 0 , 1 ) of Equation (4) is a difference between a weighted arithmetic mean A 1 α ( p , q ) : = α p + ( 1 α ) q and a weighted geometric mean G 1 α ( p , q ) : = p α q 1 α , we investigate a generalization of α -divergences with respect to a generic pair of strictly comparable weighted means [29]. In particular, we consider the class of quasi-arithmetic weighted means [30], analyze the condition for two quasi-arithmetic means to be strictly comparable, and report their induced α -divergences with limit KL type divergences when α 1 and α 0 .

1.2. Divergences and Decomposable Divergences

A statistical divergence D ( p : q ) shall satisfy the following two basic axioms:
D1 (Non-negativity). D ( p : q ) 0 for all densities p and q,
D2 (Identity of indiscernibles). D ( p : q ) = 0 if and only if p = q μ -almost everywhere.
These axioms are a subset of the metric axioms, since we do not consider the symmetry axiom nor the triangle inequality axiom of metric distances. See [31,32] for some common examples of probability metrics (e.g., total variation distance or Wasserstein metrics).
A divergence D ( p : q ) is said decomposable [6] when it can be written as a definite integral of a scalar divergence d ( · , · ) :
D ( p : q ) = d ( p ( x ) : q ( x ) ) d μ ( x ) ,
or D ( p : q ) = d ( p : q ) d μ for short, where d ( a , b ) is a scalar divergence between a > 0 and b > 0 (hence one-dimensional parameter divergence).
The α -divergences are decomposable divergences since we have
I α + ( p : q ) = i α ( p ( x ) : q ( x ) ) d μ
with the following scalar α -divergence:
i α ( a : b ) : = 1 α ( 1 α ) α a + ( 1 α ) b a α b 1 α , α R \ { 0 , 1 } i 1 ( a : b ) = a log a b + b a α = 1 i 0 ( a : b ) = i 1 ( b : a ) , α = 0

1.3. Contributions and Paper Outline

The outline of the paper and its main contributions are summarized as follows:
We first define for two families of strictly comparable means (Definition 1) their generic induced α -divergences in Section 2 (Definition 2). Then, Section 2.2 reports a closed-form formula (Theorem 3) for the quasi-arithmetic α -divergences induced by two strictly comparable quasi-arithmetic means with monotonically increasing generators f and g such that f g 1 is strictly convex and differentiable (Theorem 1). In Section 2.3, we study the divergences I 0 + and I 1 + obtained in the limit cases when α 0 and α 1 , respectively, (Theorem 2). We obtain generalized counterparts of the Kullback–Leibler divergence when α 1 and generalized counterparts of the reverse Kullback–Leibler divergence when α 0 . Moreover, these generalized KLDs can be rewritten as generalized cross-entropies minus entropies. In Section 2.4, we show how to express these generalized I 1 -divergences and I 0 -divergences as conformal Bregman representational divergences, and briefly explain their induced conformally flat statistical manifolds (Theorem 4). Section 3 introduces the subfamily of bipower homogeneous α -divergences (Definition 2) which belong to the family of Ali–Silvey–Csiszár f-divergences [16,17]. In Section 4, we consider k-means clustering [33] and k-means++ seeding [34] for the generic class of extended α -divergences: we first study the robustness of quasi-arithmetic means in Section 4.1 and then the robustness of the newly class of generalized Kullback–Leibler centroids in Section 4.2. Finally, Section 5 summarizes the results obtained in this work and discusses perspectives for future research.

2. The α -Divergences Induced by a Pair of Strictly Comparable Weighted Means

2.1. The ( M , N ) α -Divergences

The point of departure for generalizing the α -divergences is to rewrite Equation (4) for α R \ { 0 , 1 } as
I α + ( p : q ) = 1 α ( 1 α ) A 1 α ( p , q ) G 1 α ( p , q ) d μ ,
where A λ and G λ for λ ( 0 , 1 ) stands for the weighted arithmetic mean and the weighted geometric mean, respectively:
A λ ( x , y ) = ( 1 λ ) x + λ y , G λ ( x , y ) = x 1 λ y λ .
For a weighted mean M λ ( a , b ) , we choose the (geometric) convention M 0 ( x , y ) = x and M 1 ( x , y ) = 1 so that { M λ ( x , y ) } λ [ 0 , 1 ] smoothly interpolates between x ( λ = 0 ) and y ( λ = 1 ). For the converse convention, we simply define M λ ( a , b ) = M 1 λ ( a , b ) and get the conventional definition of I α + ( p : q ) = 1 α ( 1 α ) A α ( p , q ) G α ( p , q ) d μ .
In general, a mean M ( x , y ) aggregates two values x and y of an interval I R to produce an intermediate quantity which satisfies the innerness property [35,36]:
min { x , y } M ( x , y ) max { x , y } , x , y I .
This in-between property of means (Equation (17)) was postulated by Cauchy [37] in 1821. A mean is said strict if the inequalities of Equation (17) are strict whenever x y . A mean M is said reflexive iff M ( x , x ) = x for all x I . The reflexive property of means was postulated by Chisini [38] in 1929.
In the remainder, we consider I = ( 0 , ) . By using the unique dyadic representation of any real λ ( 0 , 1 ) (i.e., λ = i = 1 d i 2 i with d i { 0 , 1 } the binary digit expansion of λ ), one can build a weighted mean M λ from any given mean M; see [29] for such a construction.
In the remainder, we drop the “+” notation to emphasize that the divergences are defined between positive measures. By analogy to the α -divergences, let us define the (decomposable) ( M , N ) α -divergences between two positive densities p and q for a pair of weighted means M 1 α and N 1 α for α ( 0 , 1 ) as
I α M , N ( p : q ) : = 1 α ( 1 α ) M 1 α ( p , q ) N 1 α ( p , q ) d μ .
The ordinary α -divergences for α ( 0 , 1 ) are recovered as the ( A , G )   α -divergences:
I α A , G ( p : q ) = 1 α ( 1 α ) A 1 α ( p , q ) G 1 α ( p , q ) d μ ,
= I 1 α ( p : q ) = I α ( q : p ) = I α * ( p : q ) .
In order to define generalized α -divergences satisfying axioms D1 and D2 of proper divergences, we need to characterize the class of acceptable means. We give a definition strengthening the notion of comparable means in [29]:
Definition 1
(Strictly comparable weighted means). A pair ( M , N ) of means are said strictly comparable whenever M λ ( x , y ) N λ ( x , y ) for all x , y ( 0 , ) with equality if and only if x = y , and for all λ ( 0 , 1 ) .
Example 1.
For example, the inequality of the arithmetic and geometric means states that A ( x , y ) G ( x , y ) implies means A and G are comparable, denoted by A G . Furthermore, the arithmetic and geometric weighted means are distinct whenever x y . Indeed, consider the equation ( 1 α ) x + α y = x 1 α y α for x , y > 0 and x y . By taking the logarithm on both sides, we get
log ( 1 α ) x + α y = ( 1 α ) log x + α log y .
Since the logarithm is a strictly convex function, the only solution is x = y . Thus, ( A , G ) is a pair of strictly comparable weighted means.
For a weighted mean M, define M λ ( x , y ) : = M 1 λ ( x , y ) . We are ready to state the definition of generalized α -divergences:
Definition 2
( ( M , N ) α -divergences). The ( M , N ) α-divergences I α M , N ( p : q ) between two positive densities p and q for α ( 0 , 1 ) is defined for a pair of strictly comparable weighted means M α and N α with M α N α by:
I α M , N ( p : q ) : = 1 α ( 1 α ) M 1 α ( p , q ) N 1 α ( p , q ) d μ , α ( 0 , 1 )
= 1 α ( 1 α ) M α ( p , q ) N α ( p , q ) d μ , α ( 0 , 1 ) .
Using α = 1 α A 2 , we can rewrite this α -divergence as
I ^ α A M , N ( p : q ) : = 4 1 α A 2 M 1 + α A 2 ( p , q ) N 1 + α A 2 ( p , q ) d μ , α A ( 1 , 1 )
= 4 1 α A 2 M 1 α A 2 ( p , q ) N 1 α A 2 ( p , q ) d μ , α A ( 1 , 1 ) .
It is important to check the conditions on the weighted means M α and N α which ensures the law of the indiscernibles of a divergence D ( p : q ) , namely, D ( p : q ) = 0 iff p = q almost μ -everywhere. This condition rewrites as M α ( p , q ) d μ = N α ( p , q ) d μ if and only if p ( x ) = q ( x ) μ -almost everywhere. A sufficient condition is to ensure that M α ( x , y ) N α ( x , y ) for x y . In particular, this condition holds if the weighted means M α and N α are strictly comparable weighted means.
Instead of taking the difference M 1 α ( x : y ) N 1 α ( x : y ) between two weighted means, we may also measure the gap logarithmically, and thus define the family of log M N α -divergences as follows:
Definition 3
( log M N α -divergence). The log M N α-divergences L α M , N ( p : q ) between two positive densities p and q for α ( 0 , 1 ) is defined for a pair of strictly comparable weighted means M α and N α with M α N α by:
L α M , N ( p : q ) : = log M 1 α ( p , q ) N 1 α ( p , q ) d μ ,
= log N 1 α ( p , q ) M 1 α ( p , q ) d μ .
Note that this definition is different from the skewed Bhattacharyya type distance [39,40], which rather measures
B α M , N ( p : q ) : = log M 1 α ( p , q ) d μ N 1 α ( p , q ) d μ ,
= log N 1 α ( p , q ) d μ M 1 α ( p , q ) d μ .
The ordinary α -skewed Bhattacharyya distance [39] is recovered when N α = G α (weighted geometric mean) and M α = A α the arithmetic mean since A 1 α ( p , q ) d μ = 1 . The Bhattacharyya type divergences B α M , N were introduced in [41] in order to upper bound the probability of error in Bayesian hypothesis testing.
A weighted mean M α is said symmetric if and only if M α ( x , y ) = M 1 α ( y , x ) . When both the weighted means M and N are symmetric, we have the following reference duality [11]:
I α M , N ( p : q ) = I 1 α M , N ( q : p ) .
We consider symmetric weighted means in the remainder.
In the limit cases of α 0 or α 1 , we define the 0-divergence I 0 M , N ( p : q ) and the 1-divergence I 1 M , N ( p : q ) , respectively, by
I 0 M , N ( p : q ) = lim α 0 I α M , N ( p : q ) ,
I 1 M , N ( p : q ) = lim α 1 I α M , N ( p : q ) = I 0 M , N ( q : p ) ,
provided that those limits exist.
Notice that the ordinary α -divergences are defined for any α R but our generic quasi-arithmetic α -divergences are defined in general on ( 0 , 1 ) . However, when the weighted means M α and N α admit weighted extrapolations (e.g., the arithmetic mean A α or the geometric mean G α ) the quasi-arithmetic α -divergences can be extended to R \ { 0 , 1 } . Furthermore, when the limits of quasi-arithmetic α -divergences exist for α { 0 , 1 } , the quasi-arithmetic α -divergences may be defined on the full range of α R . To demonstrate the restricted range ( 0 , 1 ) , consider the weighted harmonic mean for x , y > 0 with x y :
H λ ( x , y ) = 1 ( 1 λ ) 1 x + λ 1 y = x y λ x + ( 1 λ ) y = x y y + λ ( x y ) .
Clearly, the denominator may become zero when λ = y y x and even possibly negative. Thus, to avoid this issue, we restrict the range of α to ( 0 , 1 ) for defining quasi-arithmetic α -divergences.

2.2. The Quasi-Arithmetic α -Divergences

A quasi-arithmetic mean (QAM) is defined for a continuous and strictly monotonic function f : I R + J R + as:
M f ( x , y ) : = f 1 f ( x ) + f ( y ) 2 .
Function f is called the generator of the quasi-arithmetic mean. These strict and reflexive quasi-arithmetic means are also called Kolmogorov means [30], Nagumo means [42] de Finetti means [43], or quasi-linear means [44] in the literature. These means are called quasi-arithmetic means because they can be interpreted as arithmetic means on the arguments f ( x ) and f ( y ) :
f ( M f ( x , y ) ) = f ( x ) + f ( y ) 2 = A ( f ( x ) , f ( y ) ) .
QAMs are strict, reflexive, and symmetric means.
Without loss of generality, we may assume strictly increasing functions f instead of monotonic functions since M f = M f . Indeed, M f ( x , y ) = ( f ) 1 ( f ( M f ( x , y ) ) ) and ( ( f ) 1 ( f ) ) ( u ) = u , the identity function. Notice that the composition f 1 f 2 of two strictly monotonic increasing functions f 1 and f 2 is a strictly monotonic increasing function. Furthermore, we consider I = J = ( 0 , ) in the remainder since we apply these means on positive densities. Two quasi-arithmetic means M f and M g coincide if and only if f ( u ) = a g ( u ) + b for some a > 0 and b R , see [44]. The quasi-arithmetic means were considered in the axiomatization of the entropies by Rényi to define the α -entropies (see Equation (2).11 of [45]).
By choosing f A ( u ) = u , f G ( u ) = log u , or f H ( u ) = 1 u , we obtain the Pythagorean’s arithmetic A, geometric G, and harmonic H means, respectively:
  • the arithmetic mean (A): A ( x , y ) = x + y 2 = M f A ( x , y ) ,
  • the geometric mean (G): G ( x , y ) = x y = M f G ( x , y ) , and
  • the harmonic mean (H): H ( x , y ) = 2 1 x + 1 y = 2 x y x + y = M f H ( x , y ) .
More generally, choosing f P r ( u ) = u r , we obtain the parametric family of power means also called Hölder means [46] or binary means [47]:
P r ( x , y ) = x r + y r 2 1 r = M f P r ( x , y ) , r R \ { 0 } .
In order to get a smooth family of power means, we define the geometric mean as the limit case of r 0 :
P 0 ( x , y ) = lim r 0 P r ( x , y ) = G ( x , y ) = x y .
A mean M is positively homogeneous if and only if M ( t a , t b ) = t M ( a , b ) for any t > 0 . It is known that the only positively homogeneous quasi-arithmetic means coincide exactly with the family of power means [44]. The weighted QAMs are given by
M α f ( p , q ) = f 1 ( 1 α ) f ( p ) + α f ( q ) ) ,
= f 1 f ( p ) + α ( f ( q ) f ( p ) ) = M 1 α f ( q , p ) .
Let us remark that QAMs were generalized to complex-valued generators in [48] and to probability measures defined on a compact support in [49].
Notice that there exist other positively homogeneous means which are not quasi-arithmetic means. For example, the logarithmic mean [50,51] L ( x , y ) for x > 0 and y > 0 :
L ( x , y ) = y x log y log x
is an example of a homogeneous mean (i.e., L ( t x , t y ) = t L ( x , y ) for any t > 0 ) that is not a QAM. Besides the family of QAMs, there exist many other families of means [35]. For example, let us mention the Lagrangian means [52], which intersect with the QAMs only for the arithmetic mean, or a generalization of the QAMs called the Bajraktarević means [53].
Let us now strengthen a recent theorem (Theorem 1 of [54], 2010):
Theorem 1
(Strictly comparable weighted QAMs). The pair ( M f , M g ) of quasi-arithmetic means obtained for two strictly increasing generators f and g is strictly comparable provided that function f g 1 is strictly convex, where ∘ denotes the function composition.
Proof. 
Since f g 1 is strictly convex, it is convex, and therefore it follows from Theorem 1 of [54] that M α f M α g for all α [ 0 , 1 ] . Thus, the very nice property of QAMs is that M f M g implies that M α f M α g for any α [ 0 , 1 ] . Now, let us consider the equation M α f ( p , q ) = M α g ( p , q ) for p q :
f 1 ( 1 α ) f ( p ) + α f ( q ) = g 1 ( 1 α ) g ( p ) + α g ( q ) .
Since f g 1 is assumed strictly convex, and g is strictly increasing, we have g ( p ) g ( q ) for p q , and we reach the following contradiction:
( 1 α ) f ( p ) + α f ( q ) = ( f g 1 ) ( 1 α ) g ( p ) + α g ( q ) ,
< ( 1 α ) ( f g 1 ) ( g ( p ) ) + α ( f g 1 ) ( g ( q ) ) ,
< ( 1 α ) f ( p ) + α f ( q ) .
Thus, M α f ( p , q ) M α g ( p , q ) for p q , and M α f ( p , q ) = M α g ( p , q ) for p = q . □
Thus, we can define the quasi-arithmetic α -divergences as follows:
Definition 4
(Quasi-arithmetic α -divergences). The ( f , g ) α-divergences I α f , g ( p : q ) : = I α M f , M g ( p : q ) between two positive densities p and q for α ( 0 , 1 ) are defined for two strictly increasing and differentiable functions f and g such that f g 1 is strictly convex by:
I α f , g ( p : q ) : = 1 α ( 1 α ) M 1 α f ( p , q ) M 1 α g ( p , q ) d μ ,
where M λ f and M λ g are the weighted quasi-arithmetic means induced by f and g, respectively.
We have the following corollary:
Corollary 1
(Proper quasi-arithmetic α -divergences). Let ( M f , M g ) be a pair of quasi-arithmetic means with f g 1 strictly convex, then the ( M f , M g ) α-divergences are proper divergences for α ( 0 , 1 ) .
Proof. 
Consider p and q with p ( x ) q ( x ) μ -almost everywhere. Since f g 1 is strictly convex, we have M f ( x , y ) M g ( x , y ) 0 with strict inequality when x y . Thus, M f ( p , q ) d μ M g ( p , q ) d μ > 0 and I α f , g ( p : q ) > 0 . Therefore the quasi-arithmetic α -divergences I α f , g satisfy the law of the indiscernibles for α ( 0 , 1 ) . □
Note that the ( A , G ) α -divergences (i.e., the ordinary α -divergences) are proper divergences satisfying both the properties D1 and D2 because f A ( u ) = u and f G ( u ) = log u , and hence ( f A f G 1 ) ( u ) = exp ( u ) is strictly convex on ( 0 , ) .
Let us denote by I α f , g ( p : q ) : = I α M f , M g ( p : q ) the quasi-arithmetic α -divergences. Since the QAMs are symmetric means, we have I α f , g ( p : q ) = I 1 α f , g ( q : p ) .
Remark 2.
Let us notice that Zhang [55] in their study of divergences under monotone embeddings also defined the following family of related divergences (Equation (71) of [55]):
I ^ α A f , g ( p : q ) = 4 1 α A 2 M 1 + α A 2 f ( p , q ) M 1 + α A 2 g ( p , q ) d μ .
However, Zhang did not study the limit case divergences I ^ α A f , g ( p : q ) when α A ± 1 .

2.3. Limit Cases of 1-Divergences and 0-Divergences

We seek a closed-form formula of the limit divergence lim α 0 I α f , g ( p : q ) when α 0 .
Lemma 1.
A first-order Taylor approximation of the quasi-arithmetic mean [56] M α f for a C 1 strictly increasing generator f when α 0 yields
M α f ( p , q ) = p + α ( f ( q ) f ( p ) ) f ( p ) + o ( α ( f ( q ) f ( p ) ) ) .
Proof. 
By taking the first-order Taylor expansion of f 1 ( x ) at x 0 (i.e., Taylor polynomial of order 1), we get:
f 1 ( x ) = f 1 ( x 0 ) + ( x x 0 ) ( f 1 ) ( x 0 ) + o ( x x 0 ) .
Using the property of the derivative of an inverse function
( f 1 ) ( x ) = 1 ( f ( f 1 ) ( x ) ) ,
it follows that the first-order Taylor expansion of f 1 ( x ) is:
f 1 ( x ) = f 1 ( x 0 ) + ( x x 0 ) 1 ( f ( f 1 ) ( x 0 ) ) + o ( x x 0 ) .
Plugging x 0 = f ( p ) and x = f ( p ) + α ( f ( q ) f ( p ) ) , we get a first-order approximation of the weighted quasi-arithmetic mean M α f when α 0 :
M α f ( p , q ) = p + α ( f ( q ) f ( p ) ) f ( p ) + o ( α ( f ( q ) f ( p ) ) ) .
Let us introduce the following bivariate function:
E f ( p , q ) : = f ( q ) f ( p ) f ( p ) .
Remark 3.
Notice that E f ( p , q ) = E f ( p , q ) matches the fact that M α f ( p , q ) = M α f ( p , q ) . That is, we may either consider a strictly increasing differentiable generator f, or equivalently a strictly decreasing differentiable generator f .
Thus, we obtain closed-form formulas for the I 1 -divergence and I 0 -divergence:
Theorem 2
(Quasi-arithmetic I 1 -divergence and reverse I 0 -divergence). The quasi-arithmetic I 1 -divergence induced by two strictly increasing and differentiable functions f and g such that f g 1 is strictly convex is
I 1 f , g ( p : q ) : = lim α 1 I α f , g ( p : q ) = E f ( p , q ) E g ( p , q ) d μ 0 ,
= f ( q ) f ( p ) f ( p ) g ( q ) g ( p ) g ( p ) d μ .
Furthermore, we have I 0 f , g ( p : q ) = I 1 f , g ( q : p ) = ( I 1 f , g ) * ( p : q ) , the reverse divergence.
Proof. 
Let us prove that I 1 f , g is a proper divergence satisfying axioms D1 and D2. Note that a sufficient condition for I 1 f , g ( p : q ) 0 is to check that
E f ( p , q ) E g ( p , q ) ,
f ( q ) f ( p ) f ( p ) g ( q ) g ( p ) g ( p ) .
If p = q μ -almost everywhere then clearly I 1 f , g ( p : q ) = 0 . Consider p q (i.e., at some observation x: p ( x ) q ( x ) ).
We use the following property of a strictly convex and differentiable function h for x < y (sometimes called the chordal slope lemma, see [29]):
h ( x ) h ( y ) h ( x ) y x h ( y ) .
We consider h ( x ) = ( f g 1 ) ( x ) so that h ( x ) = f ( g 1 ( x ) ) g ( g 1 ( x ) ) . There are two cases to consider:
  • p < q and therefore g ( p ) < g ( q ) . Let y = g ( q ) and x = g ( p ) in Equation (57). We have h ( x ) = f ( p ) g ( p ) and h ( y ) = f ( q ) g ( q ) , and the double inequality of Equation (57) becomes
    f ( p ) g ( p ) f ( q ) f ( p ) g ( q ) g ( p ) f ( q ) g ( q ) .
    Since g ( q ) g ( p ) > 0 , g ( p ) > 0 , and f ( p ) > 0 , we get
    g ( q ) g ( p ) g ( p ) f ( q ) f ( p ) f ( p ) .
  • q < p and therefore g ( p ) > g ( q ) . Then, the double inequality of Equation (57) becomes
    f ( q ) g ( q ) f ( q ) f ( p ) g ( q ) g ( p ) f ( p ) g ( p )
    That is,
    f ( q ) f ( p ) f ( p ) g ( q ) g ( p ) g ( p ) ,
    since g ( q ) g ( p ) < 0 .
Thus, in both cases, we checked that E f ( p ( x ) , q ( x ) ) E g ( p ( x ) , q ( x ) ) . Therefore, I 1 f , g ( p : q ) 0 , and since the QAMs are distinct, I 1 f , g ( p : q ) = 0 iff p ( x ) = q ( x ) μ -a.e. □
We can interpret the I 1 divergences as generalized KL divergences and define generalized notions of cross-entropies and entropies. Since the KL divergence can be written as the cross-entropy minus the entropy, we can also decompose the I 1 divergences as follows:
I 1 f , g ( p : q ) = f ( q ) f ( p ) g ( q ) g ( p ) d μ f ( p ) f ( p ) g ( p ) g ( p ) d μ ,
= h × f , g ( p : q ) h f , g ( p ) ,
where h × f , g ( p : q ) denotes the ( f , g ) -cross-entropy (for a constant c R ):
h × f , g ( p : q ) = f ( q ) f ( p ) g ( q ) g ( p ) d μ + c ,
and h f , g ( p ) stands for the ( f , g ) -entropy (self cross-entropy):
h f , g ( p ) = h × f , g ( p : p ) = f ( p ) f ( p ) g ( p ) g ( p ) d μ + c .
Notice that we recover the Shannon entropy for f ( x ) = x and g ( x ) = log ( x ) with f g 1 ) ( x ) = exp ( x ) (strictly convex) and c = 1 to annihilate the p d μ = 1 term:
h id , log ( p ) = p p log p d μ 1 = p log p d μ .
We define the generalized ( f , g ) -Kullback–Leibler divergence or generalized ( f , g ) -relative entropies:
KL f , g ( p : q ) : = h × f , g ( p : q ) h f , g ( p ) .
When f = f A and g = f G , we resolve the constant to c = 0 , and recover the ordinary Shannon cross-entropy and entropy:
h × f A , f G ( p : q ) = ( q p log q ) d μ = h × ( p : q ) ,
h f A , f G ( p : q ) = h × f A , f G ( p : p ) = ( p p log p ) d μ = h ( p ) ,
and we have the ( f A , f G ) -Kullback–Leibler divergence that is the extended Kullback–Leibler divergence:
KL f A , f G ( p : q ) = KL + ( p : q ) = h × ( p : q ) h ( p ) = ( p log p q + q p ) d μ .
Thus, we have the ( f , g ) -cross-entropy and ( f , g ) -entropy expressed as
h × f , g ( p : q ) = f ( q ) f ( p ) g ( q ) g ( p ) d μ ,
h f , g ( p ) = f ( p ) f ( p ) g ( p ) g ( p ) d μ .
In general, we can define the ( f , g ) -Jeffreys divergence as:
J f , g ( p : q ) = KL f , g ( p : q ) + KL f , g ( q : p ) .
Thus, we define the quasi-arithmetic mean α -divergences as follows:
Theorem 3
(Quasi-arithmetic α -divergences). Let f and g be two strictly continuously increasing and differentiable functions on ( 0 , ) such that f g 1 is strictly convex. Then, the quasi-arithmetic α-divergences induced by ( f , g ) for α [ 0 , 1 ] is
I α f , g ( p : q ) = 1 α ( 1 α ) M 1 α f ( p , q ) M 1 α g ( p , q ) d μ , α R \ { 0 , 1 } . I 1 f , g ( p : q ) = f ( q ) f ( p ) f ( p ) g ( q ) g ( p ) g ( p ) d μ α = 1 , I 0 f , g ( p : q ) = f ( p ) f ( q ) f ( q ) g ( p ) g ( q ) g ( q ) d μ , α = 0 .
When f ( u ) = f A ( u ) = u ( M f = A ) and g ( u ) = f G ( u ) = log u ( M g = G ), we get
I 1 A , G ( p : q ) = q p p log q p d μ = KL + ( p : q ) = I 1 ( p : q ) ,
the Kullback–Leibler divergence (KLD) extended to positive densities, and I 0 = KL + * the reverse extended KLD.
Let M denote the class of strictly increasing and differentiable real-valued univariate functions. An interesting question is to study the class of pairs of functions ( f , g ) M × M such that I 1 f , g ( p : q ) = KL ( p : q ) . This involves solving integral-based functional equations [57].
We can rewrite the α -divergence I α f , g ( p : q ) for α ( 0 , 1 ) as
I α f , g ( p : q ) = 1 α ( 1 α ) S 1 α f ( p , q ) S 1 α g ( p , q ) ,
where
S λ h ( p , q ) : = M λ h ( p , q ) d μ .
Zhang [11] (pp. 188–189) considered the ( A , M ρ ) α A -divergences:
D α ρ ( p : q ) : = 4 1 α 2 1 α 2 p + 1 + α 2 q ρ 1 1 α 2 ρ ( p ) + 1 + α 2 ρ ( q ) d μ .
Zhang obtained for D ± 1 ρ ( p : q ) the following formula:
D 1 ρ ( p : q ) = p q ρ 1 ( ρ ( q ) ) ( ρ ( p ) ρ ( q ) ) d μ = D 1 ρ ( q : p ) ,
which is in accordance with our generic formula of Equation (53) since ( ρ 1 ( x ) ) = 1 ρ ( ρ 1 ( x ) ) . Notice that A α P α r for r 1 ; the arithmetic weighted mean dominates the weighted power means P r when r 1 .
Furthermore, by imposing the homogeneity condition I α A , M ρ ( t p : t q ) = t I α A , M ρ ( p : q ) for t > 0 , Zhang [11] obtained the class of ( α A , β A ) -divergences for ( α A , β A ) [ 1 , 1 ] 2 :
D α A , β A ( p : q ) : = 4 1 α A 2 2 1 + β A 1 α A 2 p + 1 + α A 2 q 1 α A 2 p 1 β A 2 + 1 + α A 2 q 1 β A 2 2 1 β A d μ .

2.4. Generalized KL Divergences as Conformal Bregman Divergences on Monotone Embeddings

Let us rewrite the generalized KLDs I 1 f , g as a conformal Bregman representational divergence [58,59,60] as follows:
Theorem 4.
The generalized KLDs I 1 f , g divergences are conformal Bregman representational divergences
I 1 f , g ( p : q ) = 1 f ( p ) B F ( g ( q ) : g ( p ) ) d μ ,
with F = f g 1 a strictly convex and differentiable Bregman convex generator defining the scalar Bregman divergence [61 B F :
B F ( a : b ) = F ( a ) F ( b ) ( a b ) F ( b ) .
Proof. 
For the Bregman strictly convex and differentiable generator F = f g 1 , we expand the following conformal divergence
1 f ( p ) B F ( g ( q ) : g ( p ) ) = 1 f ( p ) F ( g ( q ) ) F ( g ( p ) ) ( g ( q ) g ( p ) ) F ( g ( p ) ) ,
= 1 f ( p ) ( f ( q ) f ( p ) ) ( g ( q ) g ( p ) ) f ( p ) g ( p ) ,
since ( g 1 g ) ( x ) = x and F ( g ( x ) ) = f ( x ) g ( x ) . It follows that
1 f ( p ) B F ( g ( q ) : g ( p ) ) = f ( q ) f ( p ) f ( p ) g ( q ) g ( p ) g ( p ) ,
= E f ( p , q ) E g ( p , q ) = I 1 f , g ( p : q ) .
Hence, we easily check that I 1 f , g ( p : q ) = 1 f ( p ) B F ( g ( q ) : g ( p ) ) d μ 0 since f ( p ) > 0 and B F 0 . □
In general, for a functional generator f and a strictly monotonic representational function r (also called monotone embedding [62] in information geometry), we can define the representational Bregman divergence [63] B f r 1 ( r ( p ) : r ( q ) ) provided that F = f r 1 is a Bregman generator (i.e., strictly convex and differentiable).
The Itakura–Saito divergence [64] (IS) between two densities p and q is defined by:
D IS ( p : q ) = p q log p q 1 d μ ,
= D IS ( p ( x ) : q ( x ) ) d μ ( x ) ,
where D IS ( x : y ) = x y log x y 1 is the scalar IS divergence. This divergence was originally designed in sound processing for measuring the discrepancy between two speech power spectra. Observe that the IS divergence is invariant by rescaling: D IS ( t p : t q ) = D IS ( p : q ) for any t > 0 . The IS divergence is a Bregman divergence [61] obtained for the Burg information generator (i.e., negative Burg entropy): F Burg ( u ) = log u with F Burg ( u ) = 1 u . It follows that we have
I 1 f ( p : q ) = p B f ( q : p ) d μ ,
The Itakura–Saito divergence may further be extended to a family of α -Itakura–Saito divergences (see [6], Equation (10).45 of Theorem 10.1):
D IS , α ( p : q ) = 1 α 2 p q α α log p q 1 d μ α 0 1 2 ( log q log p ) 2 d μ α = 0 .
In [56], a generalization of the Bregman divergences was obtained using the comparative convexity induced by two abstract means M and N to define ( M , N ) -Bregman divergences as limit of scaled ( M , N ) -Jensen divergences. The skew ( M , N ) -Jensen divergences are defined for α ( 0 , 1 ) by:
J F , α M , N ( p : q ) = 1 α ( 1 α ) N α ( F ( p ) , F ( q ) ) ) F ( M α ( p , q ) ) ,
where M α and N α are weighted means that should be regular [56] (i.e., homogeneous, symmetric, continuous, and increasing in each variable). Then, we can define the ( M , N ) -Bregman divergence as
B F M , N ( p : q ) = lim α 1 J F , α M , N ( p : q ) ,
= lim α 1 1 α ( 1 α ) N α ( F ( p ) , F ( q ) ) ) F ( M α ( p , q ) ) .
The formula obtained in [56] for the quasi-arithmetic means M f and M g and a functional generator F that is ( M f , M g ) -convex is:
B F f , g ( p : q ) = g ( F ( p ) ) g ( F ( q ) ) g ( F ( q ) ) f ( p ) f ( q ) f ( q ) F ( q ) ,
= 1 f ( F ( q ) ) B g F f 1 ( f ( p ) : f ( q ) ) 0 .
This is a conformal divergence [58] that can be written using the E f terms as:
B F f , g ( p : q ) = E g ( F ( q ) , F ( p ) ) E f ( q , p ) F ( q ) .
A function F is ( M f , M g ) -convex iff g F f 1 is (ordinary) convex [56].
The information geometry induced by a Bregman divergence (or equivalently by its convex generator) is a dually flat space [6]. The dualistic structure induced by a conformal Bregman representational divergence is related to conformal flattening [59,60]. The notion of conformal structures was first introduced in information geometry by Okamoto et al. [65].
Following the work of Ohara [59,60,66], the Kurose geometric divergence ρ ( p , r ) [67] (a contrast function in affine differential geometry) induced by a pair ( L , M ) of strictly monotone smooth functions between two distributions p and r of the d-dimensional probability simplex Δ d is defined by (Equation (28) in [59]):
ρ ( p : r ) = 1 Λ ( r ) i = 1 d + 1 L ( p i ) L ( r i ) L ( r i ) = 1 Λ ( r ) i = 1 d + 1 E L ( r i , p i ) ,
where Λ ( r ) = i = 1 d + 1 1 L ( p i ) p i . Affine immersions [67] can be interpreted as special embeddings.
Let ρ be a divergence (contrast function) and ( ρ g , ρ , ρ * ) be the induced statistical manifold structure with
ρ g i j ( p ) : = ( i ) p ( j ) p ρ ( p , q ) | q = p ,
Γ i j , k ( p ) : = ( i ) p ( j ) p ( k ) q ρ ( p , q ) | q = p ,
Γ i j , k * ( p ) : = ( i ) p ( j ) q ( k ) q ρ ( p , q ) | q = p ,
where ( i ) s denotes the tangent vector at s of a vector field i .
Consider a conformal divergence ρ κ ( p : q ) = κ ( q ) ρ ( p : q ) for a positive function κ ( q ) > 0 , called the conformal factor. Then, the induced statistical manifold [6,7] ( ρ κ g , ρ κ , ρ κ * ) is 1-conformally equivalent to ( ρ g , ρ , ρ * ) and we have
ρ κ g = κ ρ g ,
ρ g ( ρ κ X Y , Z ) = ρ g ( ρ X Y , Z ) d ( log κ ) ( Z ) ρ g ( X , Y ) .
The dual affine connections ρ κ * and ρ * are projectively equivalent [67] (and ρ * is said 1 -conformally flat).
Conformal flattening [59,60] consists of choosing the conformal factor κ such that ( ρ κ g , ρ κ , ρ κ ) becomes a dually flat space [6] equipped with a canonical Bregman divergence.
Therefore, it follows that the statistical manifolds induced by the 1-divergence I 1 f , g is a representational 1-conformally flat statistical manifold. Figure 1 gives an overview of the interplay of divergences with information-geometric structures. The logarithmic divergence [68] L G , α is defined for α > 0 and an α -exponentially concave generator G by:
L G , α ( θ 1 : θ 2 ) = 1 α log 1 + α G ( θ 2 ) ( θ 1 θ 2 ) + G ( θ 2 ) G ( θ 1 ) .
When α 0 , we have L G , α ( θ 1 : θ 2 ) B G ( θ 1 : θ 2 ) , where B F is the Bregman divergence [61] induced by a strictly convex and smooth function F:
B F ( θ 1 : θ 2 ) = F ( θ 1 ) F ( θ 2 ) ( θ 1 θ 2 ) F ( θ 2 ) .

3. The Subfamily of Homogeneous ( r , s ) -Power α -Divergences for r > s

In particular, we can define the ( r , s ) -power α -divergences from two power means P r = M pow r and P s = M pow s with r > s (and P r P s ) with the family of generators pow l ( u ) = u l . Indeed, we check that f r s ( u ) : = pow r pow s 1 ( u ) = u r s is strictly convex on ( 0 , ) since f r s ( u ) = r s r s 1 u r s 2 > 0 for r > s . Thus, P r and P s are two QAMs which are both comparable and distinct. Table 1 lists the expressions of E r ( p , q ) : = E pow r ( p , q ) obtained from the power mean generators pow r ( u ) = u r .
We conclude with the definition of the ( r , s ) -power α -divergences:
Corollary 2
(power α -divergences). Given r > s , the α-power divergences are defined for r > s and r , s 0 by
I α r , s ( p : q ) = 1 α ( 1 α ) ( α p r + ( 1 α ) q r ) 1 r ( α p s + ( 1 α ) q s ) 1 s d μ , α R \ { 0 , 1 } . I 1 r , s ( p : q ) = q r p r r p r 1 q s p s s p s 1 d μ α = 1 , I 0 r , s ( p : q ) = I 1 r , s ( q : p ) α = 0 .
When r = 0 , we get the following power α -divergences for s < 0 :
I α 0 , s ( p : q ) = 1 α ( 1 α ) p α q 1 α ( α p s + ( 1 α ) q s ) 1 s d μ , α R \ { 0 , 1 } . I 1 0 , s ( p : q ) = p log q p q s p s s p s 1 d μ α = 1 , I 0 0 , s ( p : q ) = I 1 r , s ( q : p ) α = 0 .
When s = 0 , we get the following power α -divergences for r > 0 :
I α r , 0 ( p : q ) = 1 α ( 1 α ) ( α p r + ( 1 α ) q r ) 1 r p α q 1 α d μ , α R \ { 0 , 1 } . I 1 r , 0 ( p : q ) = q r p r r p r 1 p log q p d μ α = 1 , I 0 r , 0 ( p : q ) = I 1 r , s ( q : p ) α = 0 .
In particular, we get the following family of ( A , H ) α -divergences
I α A , H ( p : q ) = I α 1 , 1 ( p : q ) = 1 α ( 1 α ) α p + ( 1 α ) q p q α q + ( 1 α ) p d μ , α R \ { 0 , 1 } . I 1 1 , 1 ( p : q ) = q 2 p + p 2 q d μ α = 1 , I 0 1 , 1 ( p : q ) = I 1 1 , 1 ( q : p ) α = 0 . ,
and the family of ( G , H ) α -divergences:
I α G , H ( p : q ) = I α 0 , 1 ( p : q ) = 1 α ( 1 α ) p α q 1 α p q α q + ( 1 α ) p d μ , α R \ { 0 , 1 } . I 1 0 , 1 ( p : q ) = p log q p p + p 2 q d μ α = 1 , I 0 0 , 1 ( p : q ) = I 1 0 , 1 ( q : p ) α = 0 .
The ( r , s ) -power α -divergences for r , s 0 yield homogeneous divergences: I α r , s ( t p : t q ) = t I α r , s ( p : q ) for any t > 0 because the power means are homogeneous: P α r ( t x , t y ) = t P α r ( x , y ) = t x P α r 1 , y x . Thus, the I α r , s -divergences are Csiszár f-divergences [17]
I α r , s ( p : q ) = p ( x ) f r , s q ( x ) p ( x ) d μ
for the generator
f r , s ( u ) = 1 α ( 1 α ) ( P α r ( 1 , u ) P s ( 1 , u ) ) .
Thus, the family of ( r , s ) -power α -divergences are homogeneous divergences:
I α r , s ( t p : t q ) = t I α r , s ( p : q ) , t > 0 .

4. Applications to Center-Based Clustering

Clustering is a class of unsupervised learning algorithms which partitions a given d-dimensional point set P = { p 1 , , p n } into clusters such that data points falling into a same cluster tend to be more similar to data points belonging to different clusters. The celebrated k-means clustering [69] is a center-based method for clustering P into k clusters C 1 , , C k (with P = i = 1 k C i ), by minimizing the following k-means objective function
L ( P , C ) = 1 n i = 1 n min j { 1 , , k } p i c j 2 ,
where the c j ’s denote the cluster representatives. Let C = { c 1 , , c k } denote the set of cluster centers. The cluster C j is defined as the points of P closer to cluster representative c j than any other c i for i j :
C j = { p P : p c j 2 p c l 2 , l { 1 , , k } } .
When k = 1 , it can be shown that the centroid of the point set P is the unique best cluster representative:
arg min c 1 L ( P , { c 1 } ) c 1 = 1 n i = 1 n p i .
When d > 1 and k > 1 , finding a best partition P = j = 1 k C j which minimizes the objective function of Equation (107) is NP-hard [70]. When d = 1 , k-means clustering can be solved efficiently using dynamic programming [71] in subcubic O ( n 3 ) time.
The k-means objective function can be generalized to any arbitrary (potentially asymmetric) divergence D ( · : · ) by considering the following objective function:
L D ( P , C ) : = 1 n i = 1 n min j { 1 , , k } D ( p i : c j ) .
Thus, when D ( p : q ) = p q 2 , one recovers the ordinary k-means clustering [69]. When D ( p : q ) = B F ( p : q ) is chosen as a Bregman divergence, one gets the right-sided Bregman k-means clustering [72] as the minimization of the cluster centers are defined on the right-sided arguments of D in Equation (108). When F ( x ) = x 2 2 , Bregman k-means clustering (i.e., D ( p : q ) = B F ( p : q ) in Equation (108)) amounts to the ordinary k-means clustering. The right-sided Bregman centroid for k = 1 coincides with the center of mass and is independent of the Bregman generator F:
arg min c 1 L B F ( P , { c 1 } ) c 1 = 1 n i = 1 n p i .
The left-sided Bregman k-means clustering is obtained by considering the right-sided Bregman centroid for the reverse Bregman divergence ( B F ) * ( p : q ) = B F ( q : p ) , and the left-sided Bregman centroid [73] can be expressed as a multivariate generalization of the quasi-arithmetic mean:
c 1 = ( F ) 1 1 n i = 1 n F ( p i ) .
In order to study the robustness of k-means clustering with respect to our novel family of divergences I α f , g , we first study the robustness of the left-sided Bregman centroids to outliers.

4.1. Robustness of the Left-Sided Bregman Centroids

Consider two d-dimensional points p = ( p 1 , , p d ) and p = ( p 1 , , p d ) of a domain Θ R d . The centroid of p and p with respect to any arbitrary divergence D ( · : · ) is by definition the minimizer of
L D ( c ) = 1 2 D ( p : c ) + 1 2 D ( p : c ) ,
provided that the minimizer min c Θ L D ( c ) is unique. Assume a separable Bregman divergence induced by the generator F ( p ) = i = 1 d F ( p i ) . The left-sided Bregman centroid [73] of p and p is given by the following separable quasi-arithmetic centroid:
c = ( c 1 , , c d ) ,
with
c i = M f ( p i , p i ) = f 1 f ( p i ) + f ( p i ) 2 ,
where f ( x ) = F ( x ) denotes the derivative of the Bregman generator F ( x ) .
Now, fix p (say, p = ( 1 , , 1 ) Θ ), and let the coordinates p i of p all tend to infinity: That is, point p plays the role of an outlier data point. We use the general framework of influence functions [74] in statistics to study the robustness of divergence-based centroids. Consider the r-power mean, a quasi-arithmetic mean induced by pow r ( x ) = x r for r 0 and by extension pow 0 ( x ) = log x when r = 0 (geometric mean).
When r < 0 , we check that
lim p i + M pow r ( p i , p i ) = lim p i + 1 + p i r 2 1 r ,
= 1 2 1 r < .
That is, the r-power mean is robust to an outlier data point when r < 0 (see Figure 2). Note that if instead of considering the centroid, we consider the barycenter with w denoting the weight of point p and 1 w denoting the weight of the outlier p for w ( 0 , 1 ) , then the power r-mean falls in a square box of side w 1 r when r < 0 .
On the contrary, when r > 0 or r = 0 , we have lim p i + M pow r ( p i , p i ) = , and the r-power mean diverges to infinity.
Thus, when r < 0 , the quasi-arithmetic centroid of p = ( 1 , , 1 ) and p is contained in a bounding box of length 1 2 1 r with left corner ( 1 , , 1 ) , and the left-sided Bregman power centroid minimizing
1 2 B F ( c : p ) + 1 2 B F ( c : p )
is robust to outlier p .
To contrast with this result, notice that the right-sided Bregman centroid [72] is always the center of mass (arithmetic mean), and therefore not robust to outliers as a single outlier data point may potentially drag the centroid to infinity.
Example 2.
Since M f = M f for any strictly smooth increasing function f, we deduce that the quasi-arithmetic left-sided Bregman centroid induced by F ( x ) = log x with f ( x ) = F ( x ) = x 1 = 1 x for x > 0 is the harmonic mean which is robust to outliers. The corresponding Bregman divergence is the Itakura–Saito divergence [72].
Notice that it is enough to consider without loss of generality two points p and p : Indeed, the case of the quasi-arithmetic mean of P = { p 1 , , p n } and p can be rewritten as an equivalent weighted quasi-arithmetic mean of two points p ¯ = M f ( p 1 , , p n ) with weight w = n n + 1 and p of weight 1 n + 1 using the replacement property of quasi-arithmetic means:
M f ( p 1 , , p k , p k + 1 , , p n ) = M f ( p ¯ , , p ¯ , p k + 1 , p n )
where p ¯ = M f ( p 1 , , p k ) .

4.2. Robustness of Generalized Kullback–Leibler Centroids

The fact that the generalized KLDs are conformal representational Bregman divergences can be used to design efficient algorithms in computational geometry [60]. For example, let us consider the centroid (or barycenter) of a finite set of weighted probability measures P 1 , , P n μ (with RN derivatives p 1 , , p n ) defined as the minimizer of
min i = 1 n w i I 1 f , g ( p i : c ) ,
where the w i ’s are positive weights summing up to one ( i = 1 n w i = 1 ). The divergences I 1 f , g ( p i : c ) are separable. Thus, consider without loss of generality, the scalar-generalized KLDs so that we have
I 1 f , g ( p : q ) = 1 f ( p ) B F ( g ( q ) : g ( p ) ) ,
where p and q are scalars.
Since the Bregman centroid is unique and always coincide with the center of mass [72]
c * = arg min w i i = 1 n B F ( p i : c ) = i = 1 n w i p i ,
for positive weights w i ’s summing up to one, we deduce that the right-sided generalized KLD centroid
arg min c 1 n i = 1 n I 1 f , g ( p i : c ) = arg min c 1 n i = 1 n 1 f ( p i ) B F ( g ( c ) : g ( p i ) )
amounts to a left-sided Bregman centroid with un-normalized positive weights W i = 1 f ( p i ) for the scalar Bregman generator F ( x ) = f ( g 1 ( x ) ) with F ( x ) = f ( g 1 ( x ) ) g ( g 1 ( x ) ) . Therefore, the right-sided generalized KLD centroid c * is calculated for normalized weights w i = W i j = 1 n W j as:
c * = ( F ) 1 i = 1 n w i F ( g ( p i ) ) ,
= ( F ) 1 i = 1 n 1 f ( p i ) j = 1 n 1 f ( p j ) f ( p i ) g ( p i ) ,
= ( F ) 1 i = 1 n 1 g ( p i ) j = 1 n 1 f ( p j ) .
Thus, we obtain a closed-form formula when ( F ) 1 is computationally tractable. For example, consider the ( r , s ) -power KLD (with r > s ). We have f ( x ) = r x r 1 , g ( x ) = s x s 1 , F ( x ) = x r s , F ( x ) = r s x r s s and therefore, we get F 1 ( x ) = s r x s r s . Thus, we get a closed-form formula for the right-sided ( r , s ) -power Kullback–Leibler centroid using Equation (113).
Overall, we can design a k-means-type algorithm with respect to our generalized KLDs following [72]. Moreover, we can initialize probabilistically k-means with a fast k-means++ seeding [34] described in Algorithm 1. The performance of the k-means++ seeding (i.e., the ratio L D ( P , C ) min C L D ( P , C ) ) is O ( log k ) when D ( p : q ) = p q 2 , and the analysis has been extended to arbitrary divergences in [75]. The merit of using the k-means++ seeding is that we do not need to iteratively update the cluster representatives using Lloyd’s heuristic [69] and we can thus bypass the calculations of centroids and merely choose the cluster representatives from the source data points P as described in Algorithm 1.
Algorithm 1: Generic seeding of k-means with divergence-based k-means++.
input: A finite set P = { p 1 , , p n } of n points, the number of cluster
   representatives k 1 , and an arbitrary divergence D ( · : · )
Output: Set of initial cluster centers C = { c 1 , , c k }
Choose c 1 p i with uniform probability and C = { c 1 }
Algorithms 15 00435 i001
return C
The advantage of using a conformal Bregman divergence such as a total Bregman divergence [33] or I 1 f , g is to potentially ensure robustness to outliers (e.g., see Theorem III.2 of [33]). Robustness property of these novel I 1 f , g divergences can also be studied for statistical inference tasks based on minimum divergence methods [4,76].

5. Conclusions and Discussion

For two comparable strict means [35] M ( p , q ) N ( p , q ) (with equality holding if and only if p = q ), one can define their ( M , N ) -divergence as
I M , N ( p : q ) : = 4 M ( p , q ) N ( p , q ) d μ .
When the property of strict comparable means extend to their induced weighted means M α ( p , q ) and N α ( p , q ) (i.e., M α ( p , q ) N α ( p , q ) ), one can further define the family of ( M , N ) α -divergences for α ( 0 , 1 ) :
I α M , N ( p : q ) : = 1 α ( 1 α ) M 1 α ( p , q ) N 1 α ( p , q ) d μ ,
so that I M , N ( p : q ) = I 1 2 M , N ( p : q ) . When the weighted means are symmetric, the reference duality holds (i.e., I α M , N ( q : p ) = I 1 α M , N ( p : q ) ), and we can define the ( M , N ) -equivalent of the Kullback–Leibler divergence, i.e., the ( M , N ) 1-divergence, as the limit case (when it exists): I 1 M , N ( p : q ) = lim α 1 I α M , N ( p : q ) . Similarly, the ( M , N ) -equivalent of the reverse Kullback–Leibler divergence is obtained as I 0 M , N ( p : q ) = lim α 0 I α M , N ( p : q ) .
We proved that the quasi-arithmetic weighted means [30] M α f and M α g were strictly comparable whenever f g 1 was strictly convex. In the limit cases of α 0 and α 1 , we reported a closed-form formula for the equivalent of the forward and the reverse Kullback–Leibler divergences. We reported closed-form formulas for the quasi-arithmetic α -divergences I α f , g ( p : q ) : = I α M f , M g ( p : q ) for α [ 0 , 1 ] (Theorem 3) and for the subfamily of homogeneous ( r , s ) -power α -divergences I α r , s ( p : q ) : = I α M pow r , M pow s ( p : q ) induced by power means (Corollary 2). The ordinary ( A , G ) α -divergences [12], the ( A , H ) α -divergences, and the ( G , H ) α -divergences are examples of ( r , s ) -power α -divergences obtained for ( r , s ) = ( 1 , 0 ) , ( r , s ) = ( 1 , 1 ) and ( r , s ) = ( 0 , 1 ) , respectively.
Generalized α -divergences may prove useful in reporting a closed-form formula between densities of a parametric family { p θ } . For example, consider the ordinary α -divergences between two scale Cauchy densities p 1 ( x ) = 1 π s 1 x 2 + s 1 2 and p 2 ( x ) = 1 π s 2 x 2 + s 2 2 ; there is no obvious closed-form for the ordinary α -divergences, but we can report a closed-form for the ( A , H ) α -divergences following the calculus reported in [41]:
I α A , H ( p 1 : p 2 ) = 1 α ( 1 α ) 1 H 1 α ( p 1 ( x ) , p 2 ( x ) ) d μ ( x ) ,
= 1 α ( 1 α ) 1 s 1 s 2 ( α s 1 + ( 1 α ) s 2 ) s 1 α ,
with s α = α s 1 s 2 2 + ( 1 α ) s 2 s 1 2 α s 1 + ( 1 α ) s 2 . For probability distributions p θ 1 and p θ 2 belonging to the same exponential family [77] with cumulant function F, the ordinary α -divergences admit the following closed-form solution:
I α ( p θ 1 : p θ 2 ) = 1 α ( 1 α ) 1 exp F ( α θ 1 + ( 1 α ) θ 2 ) ( α F ( θ 1 ) + ( 1 α ) F ( θ 2 ) , α ( 0 , 1 ) I 1 ( p θ 1 : p θ 2 ) = KL ( p θ 1 : p θ 2 ) = B F ( θ 2 : θ 1 ) , α = 1 I 0 ( p θ 1 : p θ 2 ) = KL ( p θ 2 : p θ 1 ) = B F ( θ 1 : θ 2 ) α = 0
where B F is the Bregman divergence: B F ( θ 2 : θ 1 ) = F ( θ 2 ) F ( θ 1 ) ( θ 2 θ 1 ) F ( θ 1 ) .
Instead of considering ordinary α -divergences in applications, one may consider the ( r , s ) -power α -divergences, and tune the three scalar parameters ( r , s , α ) according to the various tasks (say, by cross-validation in supervised machine learning tasks, see [13]). For the limit cases of α 0 or of α 1 , we further proved that the limit KL type divergences amounted to conformal Bregman divergences on strictly monotone embeddings and explained the connection of conformal divergences with conformal flattening [60], which allows one to build fast algorithms for centroid-based k-means clustering [72], Voronoi diagrams, and proximity data-structures [60,63]. Some ideas left for future directions is to study the properties of these new ( M , N ) α -divergences for statistical inference [2,4,76].

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Keener, R.W. Theoretical Statistics: Topics for a Core Course; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  2. Basu, A.; Shioya, H.; Park, C. Statistical Inference: The Minimum Distance Approach; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
  3. Basseville, M. Divergence measures for statistical data processing — An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
  4. Pardo, L. Statistical Inference Based on Divergence Measures; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
  5. Oller, J.M. Some geometrical aspects of data analysis and statistics. In Statistical Data Analysis and Inference; Elsevier: Amsterdam, The Netherlands, 1989; pp. 41–58. [Google Scholar]
  6. Amari, S. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016. [Google Scholar]
  7. Eguchi, S. Geometry of minimum contrast. Hiroshima Math. J. 1992, 22, 631–647. [Google Scholar] [CrossRef]
  8. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
  9. Cichocki, A.; Amari, S.i. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef] [Green Version]
  10. Amari, S.i. α-Divergence is Unique, belonging to Both f-divergence and Bregman Divergence Classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar] [CrossRef]
  11. Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef]
  12. Hero, A.O.; Ma, B.; Michel, O.; Gorman, J. Alpha-Divergence for Classification, Indexing and Retrieval; Technical Report CSPL-328; Communication and Signal Processing Laboratory, University of Michigan: Ann Arbor, MI, USA, 2001. [Google Scholar]
  13. Dikmen, O.; Yang, Z.; Oja, E. Learning the information divergence. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1442–1454. [Google Scholar] [CrossRef] [Green Version]
  14. Liu, W.; Yuan, K.; Ye, D. On α-divergence based nonnegative matrix factorization for clustering cancer gene expression data. Artif. Intell. Med. 2008, 44, 1–5. [Google Scholar] [CrossRef]
  15. Hellinger, E. Neue Begründung der Theorie Quadratischer Formen von Unendlichvielen Veränderlichen. J. Für Die Reine Und Angew. Math. 1909, 1909, 210–271. [Google Scholar] [CrossRef]
  16. Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B 1966, 28, 131–142. [Google Scholar] [CrossRef]
  17. Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
  18. Qiao, Y.; Minematsu, N. A study on invariance of f-divergence and its application to speech recognition. IEEE Trans. Signal Process. 2010, 58, 3884–3890. [Google Scholar] [CrossRef]
  19. Li, W. Transport information Bregman divergences. Inf. Geom. 2021, 4, 435–470. [Google Scholar] [CrossRef]
  20. Li, W. Transport information Hessian distances. In Proceedings of the International Conference on Geometric Science of Information (GSI), Paris, France, 21–23 July 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 808–817. [Google Scholar]
  21. Li, W. Transport information geometry: Riemannian calculus on probability simplex. Inf. Geom. 2022, 5, 161–207. [Google Scholar] [CrossRef]
  22. Amari, S.i. Integration of stochastic models by minimizing α-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef] [PubMed]
  23. Cichocki, A.; Lee, H.; Kim, Y.D.; Choi, S. Non-negative matrix factorization with α-divergence. Pattern Recognit. Lett. 2008, 29, 1433–1440. [Google Scholar] [CrossRef]
  24. Wada, J.; Kamahara, Y. Studying malapportionment using α-divergence. Math. Soc. Sci. 2018, 93, 77–89. [Google Scholar] [CrossRef]
  25. Maruyama, Y.; Matsuda, T.; Ohnishi, T. Harmonic Bayesian prediction under α-divergence. IEEE Trans. Inf. Theory 2019, 65, 5352–5366. [Google Scholar] [CrossRef]
  26. Iqbal, A.; Seghouane, A.K. An α-Divergence-Based Approach for Robust Dictionary Learning. IEEE Trans. Image Process. 2019, 28, 5729–5739. [Google Scholar] [CrossRef]
  27. Ahrari, V.; Habibirad, A.; Baratpour, S. Exponentiality test based on alpha-divergence and gamma-divergence. Commun. Stat.-Simul. Comput. 2019, 48, 1138–1152. [Google Scholar] [CrossRef]
  28. Sarmiento, A.; Fondón, I.; Durán-Díaz, I.; Cruces, S. Centroid-based clustering with αβ-divergences. Entropy 2019, 21, 196. [Google Scholar] [CrossRef] [Green Version]
  29. Niculescu, C.P.; Persson, L.E. Convex Functions and Their Applications: A Contemporary Approach, 1st ed.; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  30. Kolmogorov, A.N. Sur la notion de moyenne. Acad. Naz. Lincei Mem. Cl. Sci. His. Mat. Natur. Sez. 1930, 12, 388–391. [Google Scholar]
  31. Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435. [Google Scholar] [CrossRef]
  32. Rachev, S.T.; Klebanov, L.B.; Stoyanov, S.V.; Fabozzi, F. The Methods of Distances in the Theory of Probability and Statistics; Springer: Berlin/Heidelberg, Germany, 2013; Volume 10. [Google Scholar]
  33. Vemuri, B.C.; Liu, M.; Amari, S.I.; Nielsen, F. Total Bregman divergence and its applications to DTI analysis. IEEE Trans. Med Imaging 2010, 30, 475–483. [Google Scholar] [CrossRef] [Green Version]
  34. Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the SODA ’07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035. [Google Scholar]
  35. Bullen, P.S.; Mitrinovic, D.S.; Vasic, M. Means and Their Inequalities; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 31. [Google Scholar]
  36. Toader, G.; Costin, I. Means in Mathematical Analysis: Bivariate Means; Academic Press: Cambridge, MA, USA, 2017. [Google Scholar]
  37. Cauchy, A.L.B. Cours d’analyse de l’École Royale Polytechnique; Debure frères: Paris, France, 1821. [Google Scholar]
  38. Chisini, O. Sul concetto di media. Period. Di Mat. 1929, 4, 106–116. [Google Scholar]
  39. Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–109. [Google Scholar]
  40. Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef] [Green Version]
  41. Nielsen, F. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. Pattern Recognit. Lett. 2014, 42, 25–34. [Google Scholar] [CrossRef] [Green Version]
  42. Nagumo, M. Über eine klasse der mittelwerte. Jpn. J. Math. Trans. Abstr. 1930, 7, 71–79. [Google Scholar] [CrossRef] [Green Version]
  43. De Finetti, B. Sul concetto di media. Ist. Ital. Degli Attuari 1931, 3, 369–396. [Google Scholar]
  44. Hardy, G.; Littlewood, J.; Pólya, G. Inequalities; Cambridge Mathematical Library, Cambridge University Press: Cambridge, UK, 1988. [Google Scholar]
  45. Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; The Regents of the University of California: Oakland, CA, USA, 1961; Volume 1. Contributions to the Theory of Statistics. [Google Scholar]
  46. Holder, O.L. Über einen Mittelwertssatz. Nachr. Akad. Wiss. Gottingen Math.-Phys. Kl. 1889, 44, 38–47. [Google Scholar]
  47. Bhatia, R. The Riemannian mean of positive matrices. In Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013; pp. 35–51. [Google Scholar]
  48. Akaoka, Y.; Okamura, K.; Otobe, Y. Bahadur efficiency of the maximum likelihood estimator and one-step estimator for quasi-arithmetic means of the Cauchy distribution. Ann. Inst. Stat. Math. 2022, 74, 1–29. [Google Scholar] [CrossRef]
  49. Kim, S. The quasi-arithmetic means and Cartan barycenters of compactly supported measures. Forum Math. Gruyter 2018, 30, 753–765. [Google Scholar] [CrossRef]
  50. Carlson, B.C. The logarithmic mean. Am. Math. Mon. 1972, 79, 615–618. [Google Scholar] [CrossRef]
  51. Stolarsky, K.B. Generalizations of the logarithmic mean. Math. Mag. 1975, 48, 87–92. [Google Scholar] [CrossRef]
  52. Jarczyk, J. When Lagrangean and quasi-arithmetic means coincide. J. Inequal. Pure Appl. Math. 2007, 8, 71. [Google Scholar]
  53. Páles, Z.; Zakaria, A. On the Equality of Bajraktarević Means to Quasi-Arithmetic Means. Results Math. 2020, 75, 19. [Google Scholar] [CrossRef] [Green Version]
  54. Maksa, G.; Páles, Z. Remarks on the comparison of weighted quasi-arithmetic means. Colloq. Math. 2010, 120, 77–84. [Google Scholar] [CrossRef]
  55. Zhang, J. Nonparametric information geometry: From divergence function to referential-representational biduality on statistical manifolds. Entropy 2013, 15, 5384–5418. [Google Scholar] [CrossRef] [Green Version]
  56. Nielsen, F.; Nock, R. Generalizing Skew Jensen Divergences and Bregman Divergences with Comparative Convexity. IEEE Signal Process. Lett. 2017, 24, 1123–1127. [Google Scholar] [CrossRef]
  57. Kuczma, M. An Introduction to the Theory of Functional Equations and Inequalities: Cauchy’s Equation and Jensen’s Inequality; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
  58. Nock, R.; Nielsen, F.; Amari, S.i. On conformal divergences and their population minimizers. IEEE Trans. Inf. Theory 2015, 62, 527–538. [Google Scholar] [CrossRef] [Green Version]
  59. Ohara, A. Conformal flattening for deformed information geometries on the probability simplex. Entropy 2018, 20, 186. [Google Scholar] [CrossRef] [Green Version]
  60. Ohara, A. Conformal Flattening on the Probability Simplex and Its Applications to Voronoi Partitions and Centroids. In Geometric Structures of Information; Springer: Berlin/Heidelberg, Germany, 2019; pp. 51–68. [Google Scholar]
  61. Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
  62. Zhang, J. On monotone embedding in information geometry. Entropy 2015, 17, 4485–4499. [Google Scholar] [CrossRef] [Green Version]
  63. Nielsen, F.; Nock, R. The dual Voronoi diagrams with respect to representational Bregman divergences. In Proceedings of the Sixth International Symposium on Voronoi Diagrams (ISVD), Copenhagen, Denmark, 23–26 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 71–78. [Google Scholar]
  64. Itakura, F.; Saito, S. Analysis synthesis telephony based on the maximum likelihood method. In Proceedings of the 6th International Congress on Acoustics, Tokyo, Japan, 21–28 August 1968; pp. 280–292. [Google Scholar]
  65. Okamoto, I.; Amari, S.I.; Takeuchi, K. Asymptotic theory of sequential estimation: Differential geometrical approach. Ann. Stat. 1991, 19, 961–981. [Google Scholar] [CrossRef]
  66. Ohara, A.; Matsuzoe, H.; Amari, S.I. Conformal geometry of escort probability and its applications. Mod. Phys. Lett. B 2012, 26, 1250063. [Google Scholar] [CrossRef]
  67. Kurose, T. On the divergences of 1-conformally flat statistical manifolds. Tohoku Math. J. Second Ser. 1994, 46, 427–433. [Google Scholar] [CrossRef]
  68. Pal, S.; Wong, T.K.L. The geometry of relative arbitrage. Math. Financ. Econ. 2016, 10, 263–293. [Google Scholar] [CrossRef]
  69. Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef] [Green Version]
  70. Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. The planar k-means problem is NP-hard. Theor. Comput. Sci. 2012, 442, 13–21. [Google Scholar] [CrossRef] [Green Version]
  71. Wang, H.; Song, M. Ckmeans.1d.dp: Optimal k-means clustering in one dimension by dynamic programming. R J. 2011, 3, 29. [Google Scholar] [CrossRef] [Green Version]
  72. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J.; Lafferty, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
  73. Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef] [Green Version]
  74. Ronchetti, E.M.; Huber, P.J. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
  75. Nielsen, F.; Nock, R. Total Jensen divergences: Definition, properties and clustering. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 2016–2020. [Google Scholar]
  76. Eguchi, S.; Komori, O. Minimum Divergence Methods in Statistical Machine Learning; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
  77. Kailath, T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]
Figure 1. Interplay of divergences and their information-geometric structures: Bregman divergences are canonical divergences of dually flat structures, and the α -logarithmic divergences are canonical divergences of 1-conformally flat statistical manifolds. When α 0 , the logarithmic divergence L F , α tends to the Bregman divergence B F .
Figure 1. Interplay of divergences and their information-geometric structures: Bregman divergences are canonical divergences of dually flat structures, and the α -logarithmic divergences are canonical divergences of 1-conformally flat statistical manifolds. When α 0 , the logarithmic divergence L F , α tends to the Bregman divergence B F .
Algorithms 15 00435 g001
Figure 2. Illustration of the robustness property of the r-power mean M pow r ( p , p ) when r < 0 for two points: a prescribed point p = ( 1 , 1 ) and an outlier point p = ( t , t ) . When t + , the r-power mean of p and p for r < 0 (e.g., coordinatewise harmonic mean when r = 1 ) is contained inside the box anchored at p of size length 1 2 1 r . The r-power mean can be interpreted as a left-sided Bregman centroid for F ( x ) = x r , i.e., F ( x ) = 1 r x r + 1 when r < 1 and F ( x ) = log x when r = 1 .
Figure 2. Illustration of the robustness property of the r-power mean M pow r ( p , p ) when r < 0 for two points: a prescribed point p = ( 1 , 1 ) and an outlier point p = ( t , t ) . When t + , the r-power mean of p and p for r < 0 (e.g., coordinatewise harmonic mean when r = 1 ) is contained inside the box anchored at p of size length 1 2 1 r . The r-power mean can be interpreted as a left-sided Bregman centroid for F ( x ) = x r , i.e., F ( x ) = 1 r x r + 1 when r < 1 and F ( x ) = log x when r = 1 .
Algorithms 15 00435 g002
Table 1. Expressions of the terms E r for the family of power means P r , r R .
Table 1. Expressions of the terms E r for the family of power means P r , r R .
Power Mean E r ( p , q )
P r ( r R \ { 0 } ) q r p r r p r 1
Q ( r = 2 ) q 2 p 2 2 p
A ( r = 1 ) q p
G ( r = 0 ) p log q p
H ( r = 1 ) p 2 1 q 1 p = p p 2 q
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Nielsen, F. Generalizing the Alpha-Divergences and the Oriented Kullback–Leibler Divergences with Quasi-Arithmetic Means. Algorithms 2022, 15, 435. https://doi.org/10.3390/a15110435

AMA Style

Nielsen F. Generalizing the Alpha-Divergences and the Oriented Kullback–Leibler Divergences with Quasi-Arithmetic Means. Algorithms. 2022; 15(11):435. https://doi.org/10.3390/a15110435

Chicago/Turabian Style

Nielsen, Frank. 2022. "Generalizing the Alpha-Divergences and the Oriented Kullback–Leibler Divergences with Quasi-Arithmetic Means" Algorithms 15, no. 11: 435. https://doi.org/10.3390/a15110435

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop