The {\alpha}-divergences associated with a pair of strictly comparable quasi-arithmetic means

We generalize the family of $\alpha$-divergences using a pair of strictly comparable weighted means. In particular, we obtain the $1$-divergence in the limit case $\alpha\rightarrow 1$ (a generalization of the Kullback-Leibler divergence) and the $0$-divergence in the limit case $\alpha\rightarrow 0$ (a generalization of the reverse Kullback-Leibler divergence). We state the condition for a pair of quasi-arithmetic means to be strictly comparable, and report the formula for the quasi-arithmetic $\alpha$-divergences and its subfamily of bipower homogeneous $\alpha$-divergences which belong to the Csis\'ar's $f$-divergences. Finally, we show that these generalized quasi-arithmetic $1$-divergences and $0$-divergences can be decomposed as the sum of generalized cross-entropies minus entropies, and rewritten as conformal Bregman divergences using monotone embeddings.


Introduction 1.Statistical divergences
Consider a measurable space (X , F) (where F denotes the σ-algebra and X the sample space) equipped with a positive measure µ (e.g., usually the Lebesgue measure or the counting measure).The notion of statistical dissimilarity [4] D[P : Q] = D µ [p µ : q µ ] between two arbitrary probability measures with Radon-Nikodym (RN) densities p µ = dP dµ and q µ = dQ dµ with respect to µ is at the core of many algorithms in signal processing, information theory, information fusion, and machine learning among others.When those statistical dissimilarities are smooth, they are called divergences [2] in the literature.The most renown statistical divergence rooted in information theory [9] is the Kullback-Leibler divergence (KLD): KL µ [p µ : q µ ] := X p µ (x) log p µ (x) q µ (x) dµ(x). ( Since the KLD is independent of the reference measure µ, i.e., KL µ [p µ : q µ ] = KL ν [p ν : q ν ] for p µ = dP dµ and q µ = dQ dµ , and p ν = dP dν and q ν = dQ dν the RN derivatives with respect to another positive measure ν, we write concisely in the remainder: instead of KL µ [p µ : q µ ].
The KLD belongs to a parametric family of α-divergences [7] I α [p : q] for α ∈ R: The α-divergences extended to positive densities (not necessarily normalized) play a central role in information geometry [2]: αp + (1 − α)q − p α q 1−α dµ, α ∈ R\{0, 1} I 1 [p : q] = KL e [p : q], α = 1 I 0 [p : q] = KL e [q : p], α = 0 where KL e denotes the extended Kullback-Leibler divergence: KL e [p : q] := p log p q + q − p dµ. ( The α-divergences are asymmetric for α = 0 (i.e., I α [p : q] = I α [q : p] for α = 0) but exhibit the following reference duality [39]: where we denoted by D * [p : q] := D[q : p], the reverse divergence for an arbitrary divergence D (e.g., I * α [p : q] = I α [q : p] = I 1−α [p : q]).The α-divergences belong to the family of Csizár's f -divergences [10] which are defined for a convex function f satisfying by f (1) = 0 by: We have with In information geometry, α-divergences (and more generally f -divergences) are invariant divergences [2], and it is customary to rewrite the α-divergences using α and the reference duality is expressed by Îα A statistical divergence D[• : •] when evaluated on densities belonging to a given parametric family P = {p θ : θ ∈ Θ} of densities are equivalent to a corresponding contrast function [12]: Although quite confusing, those contrast functions have also been called recently divergences in the literature [2].Thus to disambiguate whether the divergence is a statistical divergence or a parameter divergence (i.e., contrast function), we choose to use the brackets for encapsulating arguments in statistical divergences and the parenthesis to encapsulate parameter arguments in divergences which are contrast functions.
A smooth divergence D(θ 1 : θ 2 ) induces a dualistic structure in information geometry [2].For example, the KLD on the family ∆ of probability mass functions on a finite alphabet X with equivalent contrast function a Bregman divergence induces a dually flat space [2].More generally, the α A -divergences on the probability simplex ∆ induce the α A -geometry.

Divergences and decomposable divergences
A statistical divergence D shall satisfy the following two axioms: D1. non-negativity.D[p : q] ≥ 0 for all densities p and q, D2. identity of indiscernibles.D[p : q] = 0 if and only if p = q µ-almost everywhere.
These axioms are a subset of the metric axioms since we do not consider the symmetry axiom nor the triangular inequality axiom of metric distances.See [14] for some common examples of probability metrics (e.g., total variation or Wasserstein metrics).
A divergence D[p : q] is said decomposable [2] when it can be written as an integral of a scalar divergence d(•, •): or D[p : q] = d(p : q)dµ for short.The α-divergences are decomposable divergences since we have with the following scalar α-divergence:

Contributions and paper outline
The outline of the paper and the contributions are summarized as follows: We define the generic α-divergences in §2 for two families of strictly comparable means (Definition 2).Then section 2.2 reports a closed-form formula (Theorem 3) for the quasi-arithmetic α-divergences induced by two strictly comparable quasi-arithmetic means with monotonically increasing generators f and g such that f • g −1 is strictly convex and differentiable.In §2.3, we study the divergences I 0 and I 1 obtained in the limit cases when α → 0 and α → 1, respectively.We obtain generalized Kullback-Leibler divergences when α → 1 and generalized reverse Kullback-Leibler divergences when α → 0, which can be factorized as generalized cross-entropies minus entropies.In §2.4,we show how to express these generalized I 1 -divergences and I 0 -divergences as conformal Bregman representational divergences and briefly explain their induced conformally flat statistical manifolds.Section 3 explicits the subfamily of bipower homogeneous α-divergences which belong to the family of Csiszár f -divergences [10].Finally, Section 4 summarizes the work and present several opportunities for future research directions.
In general, a mean M (x, y) aggregates two values x and y of an interval I to produce an intermediate quantity which satisfies the innerness property [5]: A mean is said strict if the inequalities of Eq. 16 are strict whenever x = y.A mean M is said reflexive iff M (x, x) = x for all x ∈ I.In the remainder, we consider I = (0, ∞).By using the unique dyadic representation of any real λ ∈ (0, 1) (i.e., λ = ∞ i=1 d i 2 i with d i ∈ {0, 1}, a binary digit) , one can build a weighted mean M λ from any given mean M , see [25] for such a construction.
By analogy to the α-divergences, let us define the (decomposable) (M, N ) α-divergences for a pair of weighted means M 1−α and N 1−α for α ∈ (0, 1) as The ordinary α-divergences for α ∈ (0, 1) are recovered as the (A, G) α-divergences: In order to define generalized α-divergences satisfying the axioms D1 and D2 of proper divergences, we need to characterize the class of acceptable means.We give a definition strengthening the notion of comparable means in [25]: Definition 1 (Strictly comparable weighted means).A pair (M, N ) of means are said strictly comparable whenever M λ (x, y) ≤ N λ (x, y) for all x, y ∈ (0, ∞) with equality if and only if x = y, and for all λ ∈ (0, 1).
For example, the inequality of the arithmetic and geometric means states that A(x, y) ≥ G(x, y): Means A and G are comparable, denoted by A ≥ G. Furthermore, the arithmetic and geometric weighted means are distinct whenever x = y: Indeed, consider the equation (1 − α)x + αy = x 1−α y α for x, y > 0 and x = y.By taking the logarithm on both sides, we get Since the logarithm is a strictly convex function, the only solution is x = y.Thus (A, G) is a pair of strictly comparable weighted means.For a weighted mean M , define Mα (x, y) := M 1−α (x, y).We are ready to state the definition of generalized α-divergences: Definition 2 ((M, N ) α-divergences).The (M, N ) α-divergences I M,N α [p : q] between two positive densities p and q for α ∈ (0, 1) is defined for a pair of strictly comparable weighted means M α and N α with M α ≥ N α by: Using α = 1−α A 2 , we can rewrite this divergence as A weighted mean M α is said symmetric iff M α (x, y) = M 1−α (y, x).When both the weighted means M and N are symmetric, we have the following reference duality [39]: We consider symmetric means in the remainder.
In the limit cases of α → 0 or α → 1, we define the 0-divergence I M,N 0 [p : q] and the 1-divergence I M,N 1 [p : q], respectively, by provided that those limits exist.

The quasi-arithmetic α-divergences
A quasi-arithmetic mean (QAM) is defined for a continuous and strictly monotonic function f : Function f is called the generator of the quasi-arithmetic mean.These strict and reflexive quasiarithmetic means are also called Kolmogorov means [19], Nagumo means [23] or de Finetti means [11], or quasi-linear means [15] in the literature.These means are called quasi-arithmetic means because they can be interpreted as arithmetic means on the arguments f (x) and f (y): QAMs are strict, reflexive and symmetric means.Without loss of generality, we assume strictly increasing functions f instead of monotonic functions since the identity function.Notice that the composition f 1 • f 2 of two strictly monotonic increasing functions f 1 and f 2 is a strictly monotonic increasing function.Furthermore, we consider I = J = (0, ∞) in the remainder since we apply these means on positive densities.Two quasi-arithmetic means M f and M g coincide if and only if f (u) = ag(u) + b for some a > 0 and b ∈ R see [15].The quasi-arithmetic means were considered in the axiomatization of the entropies by Rényi to define the α-entropies (see Eq. 2.11 of [36]).
By choosing f A (u) = u, f G (u) = log u or f H (u) = 1 u , we obtain the Pythagorean's arithmetic A, geometric G, and harmonic H means, respectively: More generally, choosing f Pr (u) = u r , we obtain the parametric family of power means (also called Hölder means [16]): In order to get a smooth family of power means, we define the geometric mean in the limit case of r → 0: It is known that the positively homogeneous quasi-arithmetic means, i.e.M f (λa, λb) = λM f (a, b) for λ > 0, coincide exactly with the family of power means.The weighted QAMs are given by The logarithmic mean L(x, y) for x > 0 and y > 0: is an example of a homogeneous mean (i.e., L(λx, λy) = λL(x, y) for any λ > 0) that is not a QAM.Besides the family of QAMs, there exist many other families of means [5]: For example, let us mention the Lagrangean means [18] which intersect with the QAMs only for the arithmetic mean, or a generalization of the QAMs called the the Bajraktarević means [35].
Let us strengthen a recent theorem of [21] (Theorem 1, 2010): Theorem 1 (Strictly comparable weighted QAMs).The pair (M f , M g ) of quasi-arithmetic means obtained for two strictly increasing generators is strictly comparable provided that f • g −1 is strictly convex.
Proof.Since f • g −1 is strictly convex, it is convex, and therefore it follows from Theorem 1 of [21] that M f α ≥ M g α for all α ∈ [0, 1].Thus the very nice property of QAMs is that M f ≥ M g implies that M f α ≥ M g α for any α ∈ [0, 1].Now, let us consider the equation M f α (p, q) = M g α (p, q) for p = q: Since f • g −1 is assumed strictly convex, and g is strictly increasing, we have g(p) = g(q) for p = q, and we reach the following contradiction: Thus M f α (p, q) = M g α (p, q) for p = q, and M f α (p, q) = M g α (p, q) for p = q.
Note that the (A, G) α-divergences (i.e., the ordinary α-divergences) is a proper divergence satisfying both the properties D1 and D2 because f A (u) = u and f G (u) = log u, and hence (f Let us denote by I f,g α [p : q] := I M f ,M g α [p : q] the quasi-arithmetic α-divergences.Since the QAMs are symmetric means, we have I f,g α [p : q] = I f,g 1−α [q : p].

Limit cases of 1-divergences and 0-divergences
We seek a closed-form formula of the limit divergence lim α→0 I f,g α [p : q] when α → 0.
Lemma 1.A first-order Taylor approximation of the quasi-arithmetic mean [30] M f α for a C 1 strictly increasing generator f when α ≃ 0 yields Proof.By taking the first-order Taylor expansion of f −1 (x) at x 0 (i.e., Taylor polynomial of order 1), we get: Using the property of the derivative of an inverse function: it follows that the first-order Taylor expansion of f −1 (x) is: Plugging , we get a first-order approximation of the weighted quasi-arithmetic mean M f α when α → 0: Let us introduce the following bivariate function: Thus we obtain closed-form formula for the I 1 -divergence and I 0 -divergence: Theorem 2 (Quasi-arithmetic I 1 -divergence and I 0 -divergence).The quasi-arithmetic I 1 -divergence induced by two strictly increasing and differentiable functions f and g such that f • g −1 is strictly convex is We have Proof.Let us prove that I f,g 1 is a proper divergence satisfying axioms D1 and D2.Note that a sufficient condition for I f,g 1 [p : q] ≥ 0 is to check that If p = q µ-a.e. then clearly I f,g 1 [p : q] = 0. Consider p = q (i.e., at some observation x: p(x) = q(x)).
We shall use the following property of a strictly convex and differentiable function h for x < y (sometimes called the chordal slope lemma, see [25]): We consider h(x) = (f . There are two cases to consider: • p < q and therefore g(p) < g(q).Let y = g(q) and x = g(p) in Eq. 46.We have and h ′ (y) = f ′ (q) g ′ (q) , and the double inequality of Eq. 46 becomes Since g(q) − g(p) > 0 and g ′ (p) > 0 and f ′ (p) > 0, we get • q < p and therefore g(p) > g(q).Then the double inequality of Eq. 46 becomes Thus in both cases, we checked that E f (p(x), q(x)) ≥ E g (p(x), q(x)).Therefore I f,g 1 [p : q] ≥ 0 and since the QAMs are distinct I f,g 1 [p : q] = 0 iff p(x) = q(x) µ-a.e.
We can interpret the I 1 divergences as generalized KL divergences, and define generalized notions of cross-entropies and entropies.Since the KL divergence can be written as the cross-entropy minus the entropy, we can also decompose the I 1 divergences as follows: where h f,g × (p : q) denotes the (f, g)-cross-entropy (for a constant c ∈ R): and h f,g (p) stands for the (f, g)-entropy (self cross-entropy): We define the generalized (f, g)-Kullback-Leibler divergence: When f = f A and g = f G , we resolve the constant to c = 0, and recover the ordinary Shannon cross-entropy and entropy: and we have the (f A , f G )-Kullback-Leibler divergence that is the extended Kullback-Leibler divergence: Thus we have the (f, g)-cross-entropy and (f, g)-entropy expressed as In general, we can define the (f, g)-Jeffreys' divergence as: Thus we define the quasi-arithmetic mean α-divergences as follows: Theorem 3 (Quasi-arithmetic α-divergences).Let f and g be two strictly continuously increasing and differentiable functions on (0, ∞) such that f •g −1 is strictly convex.Then the quasi-arithmetic α-divergences induced by (f, g) for α ∈ [0, 1] is g ′ (q) dµ, α = 0. (58) Proof.For the Bregman strictly convex and differentiable generator F = f • g −1 , we expand the following conformal divergence: Hence, we easily check that I f,g In general, for a functional generator f and a strictly monotonic representational function r, we can define the representational Bregman divergence [29] B f •r −1 (r(p) : r(q)) provided that F = f • r −1 is a Bregman generator (i.e., strictly convex and differentiable).
In [30], a generalization of the Bregman divergences was obtained using the comparative convexity induced by two abstract means M and N to define (M, N )-Bregman divergences as limit of scaled (M, N )-Jensen divergences.The skew (M, N )-Jensen divergences are defined for α ∈ (0, 1) by: where M α and N α are weighted means that should be regular [30] (i.e., homogeneous, symmetric, continuous and increasing in each variable).Then we can define the (M, N )-Bregman divergence as The formula obtained in [30] for the quasi-arithmetic means M f and M g and a functional generator F that is (M f , M g )-convex is: This is a conformal divergence [32] that can be written using the E f terms as: [30].
The information geometry induced by a Bregman divergence (or equivalently by its convex generator) is a dually flat space [2,27].The dualistic structure induced by a conformal Bregman representational divergence is related to conformal flattening [33,34].
Therefore it follows that the statistical manifolds induced by the 1-divergence I f,g 1 is a representational 1-conformally flat statistical manifold.
3 The subfamily of homogeneous (r, s)-power α-divergences In particular, we can define the (r, s)-power α-divergences from two power means P r = M pow r and P s = M pow s with r > s (and P r ≥ P s ) with the family of generators pow l (u) = u l .Indeed, we check that f rs (u) := pow r • pow −1 s (u) = u r s is strictly convex on (0, ∞) since f ′′ rs (u) = r s r s − 1 u r s −2 > 0 for r > s.Thus P r and P s are two QAMs which are both comparable and distinct.Table 1 lists the expressions of E r (p, q) := E pow r (p, q) obtained from the power mean generators pow r (u) = u r .
We conclude with the definition of the (r, s)-power α-divergences: Corollary 1 (power α-divergences).Given r > s, the α-power divergences are defined for r > s and r, s = 0 by dµ, α ∈ R\{0, 1}.I r,s 1 [p : q] = q r −p r rp r−1 − q s −p s sp s−1 dµ α = 1, I r,s 0 [p : q] = I r,s 1 (q : p) α = 0. (80) Applications of (f, g) α-divergences to center-based clustering [31] and to generalized α-divergences in positive-definite matrix spaces [2] shall also be considered in future work.The quasi-arithmetic weighted means are convex if and only if the generators are differentiable with positive first derivatives with corresponding functions −E f of Eq. 41 convex (Theorem 4 of [6], i.e., convexity of the quasi-arithmetic weighted means does not depend on the weights).For example, when both quasiarithmetic means are convex means, the quasi-arithmetic α-divergence is the difference of two convex mean functions, and the k-means centroid computation amounts to solve a Difference of Convex (DC) program which can solved by the smooth DC Algorithm, DCA, called the Convex-ConCave Procedure [28].Similarly, when α ∈ {0, 1}, we get a DC program since I f,g α writes as a difference of convex functions.