Abstract
The family of -divergences including the oriented forward and reverse Kullback–Leibler divergences is often used in signal processing, pattern recognition, and machine learning, among others. Choosing a suitable -divergence can either be done beforehand according to some prior knowledge of the application domains or directly learned from data sets. In this work, we generalize the -divergences using a pair of strictly comparable weighted means. Our generalization allows us to obtain in the limit case the 1-divergence, which provides a generalization of the forward Kullback–Leibler divergence, and in the limit case , the 0-divergence, which corresponds to a generalization of the reverse Kullback–Leibler divergence. We then analyze the condition for a pair of weighted quasi-arithmetic means to be strictly comparable and describe the family of quasi-arithmetic -divergences including its subfamily of power homogeneous -divergences. In particular, we study the generalized quasi-arithmetic 1-divergences and 0-divergences and show that these counterpart generalizations of the oriented Kullback–Leibler divergences can be rewritten as equivalent conformal Bregman divergences using strictly monotone embeddings. Finally, we discuss the applications of these novel divergences to k-means clustering by studying the robustness property of the centroids.
1. Introduction
1.1. Statistical Divergences and -Divergences
Consider a measurable space [1] where denotes a finite -algebra and the sample space, and let denotes a positive measure on , usually chosen as the Lebesgue measure or the counting measure. The notion of statistical dissimilarities [2,3,4] between two distributions P and Q is at the core of many algorithms in signal processing, pattern recognition, information fusion, data analysis, and machine learning, among others. A dissimilarity may be oriented, i.e., asymmetric: , where the colon mark “:” between the arguments of the dissimilarities represents the asymmetric property of the division operation. When the arbitrary probability measures P and Q are dominated by a measure (e.g., one can always choose ), we consider their Radon–Nikodym (RN) densities and with respect to , and define as . A good dissimilarity measure shall be invariant of the chosen dominating measure so that we can write [5]. When those statistical dissimilarities are smooth, they are called divergences [6] in information geometry, as they induce a dualistic geometric structure [7].
The most renowned statistical divergence rooted in information theory [8] is the Kullback–Leibler divergence (KLD, also called relative entropy):
Since the KLD is independent of the reference measure , i.e., for and , and and are the RN derivatives with respect to another positive measure , we write concisely in the remainder:
instead of .
The KLD belongs to a parametric family of -divergences [9] for :
The -divergences extended to positive densities [10] (not necessarily normalized densities) play a central role in information geometry [6]:
where denotes the Kullback–Leibler divergence extended to positive measures:
The -divergences are asymmetric for (i.e., for ) but exhibit the following reference duality [11]:
where we denoted by , the reverse divergence for an arbitrary divergence (e.g., ). The -divergences have been extensively used in many applications [12], and the parameter may not be necessarily fixed beforehand but can also be learned from data sets in applications [13,14]. When , the -divergence is symmetric and called the squared Hellinger divergence [15]:
The -divergences belong to the family of Ali–Silvey–Csizár’s f-divergences [16,17] which are defined for a convex function satisfying and strictly convex at 1:
We have
with the following class of f-generators:
In information geometry, -divergences and more generally f-divergences are called invariant divergences [6], since they are provably the only statistical divergences which are invariant under invertible smooth transformations of the sample space. That is, let be a smooth invertible transformation and let denote the transformed sample space. Denote by and the densities with respect to y corresponding to and , respectively. Then, we have [18]. The dualistic information-geometric structures induced by these invariant f-divergences between densities of a same parametric family of statistical models yield the Fisher information metric and the dual -connections for , see [6] for details. It is customary to rewrite the -divergences in information geometry using rescaled parameter (i.e., ). Thus, the extended -divergence in information geometry is defined as follows:
and the reference duality is expressed by .
A statistical divergence when evaluated on densities belonging to a given parametric family of densities is equivalent to a corresponding contrast function [7]:
Remark 1.
Although quite confusing, those contrast functions [7] have also been called divergences in the literature [6]. Any smooth parameter divergence (contrast function [7]) induces a dualistic structure in information geometry [6]. For example, the KLD on the family of probability mass functions defined on a finite alphabet is equivalent to a Bregman divergence, and thus induces a dually flat space [6]. More generally, the -divergences on the probability simplex induce the -geometry in information geometry [6].
We refer the reader to [3] for a richly annotated bibliography of many common statistical divergences investigated in signal processing and statistics. Building and studying novel statistical/parameter divergences from first principles is an active research area. For example, Li [19,20] recently introduced some new divergence functionals based on the framework of transport information geometry [21], which considers information entropy functionals in Wasserstein spaces. Li defined (i) the transport information Hessian distances [20] between univariate densities supported on a compact, which are symmetric distances satisfying the triangle inequality, and obtained the counterpart of the Hellinger distance on the -Wasserstein space by choosing the Shannon information entropy, and (ii) asymmetric transport Bregman divergences (including the transport Kullback–Leibler divergence) between densities defined on a multivariate compact smooth support in [19].
The -divergences are widely used in information sciences, see [22,23,24,25,26,27] just to cite a few applications. The singly parametric -divergences have also been generalized to biparametric families of divergences such as the -divergences [6] or the -divergences [28].
In this work, based on the observation that the term in the extended divergence for of Equation (4) is a difference between a weighted arithmetic mean and a weighted geometric mean , we investigate a generalization of -divergences with respect to a generic pair of strictly comparable weighted means [29]. In particular, we consider the class of quasi-arithmetic weighted means [30], analyze the condition for two quasi-arithmetic means to be strictly comparable, and report their induced -divergences with limit KL type divergences when and .
1.2. Divergences and Decomposable Divergences
A statistical divergence shall satisfy the following two basic axioms:
D1 (Non-negativity). for all densities p and q,
D2 (Identity of indiscernibles). if and only if -almost everywhere.
These axioms are a subset of the metric axioms, since we do not consider the symmetry axiom nor the triangle inequality axiom of metric distances. See [31,32] for some common examples of probability metrics (e.g., total variation distance or Wasserstein metrics).
A divergence is said decomposable [6] when it can be written as a definite integral of a scalar divergence :
or for short, where is a scalar divergence between and (hence one-dimensional parameter divergence).
The -divergences are decomposable divergences since we have
with the following scalar -divergence:
1.3. Contributions and Paper Outline
The outline of the paper and its main contributions are summarized as follows:
We first define for two families of strictly comparable means (Definition 1) their generic induced -divergences in Section 2 (Definition 2). Then, Section 2.2 reports a closed-form formula (Theorem 3) for the quasi-arithmetic -divergences induced by two strictly comparable quasi-arithmetic means with monotonically increasing generators f and g such that is strictly convex and differentiable (Theorem 1). In Section 2.3, we study the divergences and obtained in the limit cases when and , respectively, (Theorem 2). We obtain generalized counterparts of the Kullback–Leibler divergence when and generalized counterparts of the reverse Kullback–Leibler divergence when . Moreover, these generalized KLDs can be rewritten as generalized cross-entropies minus entropies. In Section 2.4, we show how to express these generalized -divergences and -divergences as conformal Bregman representational divergences, and briefly explain their induced conformally flat statistical manifolds (Theorem 4). Section 3 introduces the subfamily of bipower homogeneous -divergences (Definition 2) which belong to the family of Ali–Silvey–Csiszár f-divergences [16,17]. In Section 4, we consider k-means clustering [33] and k-means++ seeding [34] for the generic class of extended -divergences: we first study the robustness of quasi-arithmetic means in Section 4.1 and then the robustness of the newly class of generalized Kullback–Leibler centroids in Section 4.2. Finally, Section 5 summarizes the results obtained in this work and discusses perspectives for future research.
2. The -Divergences Induced by a Pair of Strictly Comparable Weighted Means
2.1. The -Divergences
The point of departure for generalizing the -divergences is to rewrite Equation (4) for as
where and for stands for the weighted arithmetic mean and the weighted geometric mean, respectively:
For a weighted mean , we choose the (geometric) convention and so that smoothly interpolates between x () and y (). For the converse convention, we simply define and get the conventional definition of .
In general, a mean aggregates two values x and y of an interval to produce an intermediate quantity which satisfies the innerness property [35,36]:
This in-between property of means (Equation (17)) was postulated by Cauchy [37] in 1821. A mean is said strict if the inequalities of Equation (17) are strict whenever . A mean M is said reflexive iff for all . The reflexive property of means was postulated by Chisini [38] in 1929.
In the remainder, we consider . By using the unique dyadic representation of any real (i.e., with the binary digit expansion of ), one can build a weighted mean from any given mean M; see [29] for such a construction.
In the remainder, we drop the “+” notation to emphasize that the divergences are defined between positive measures. By analogy to the -divergences, let us define the (decomposable) -divergences between two positive densities p and q for a pair of weighted means and for as
The ordinary -divergences for are recovered as the -divergences:
In order to define generalized -divergences satisfying axioms D1 and D2 of proper divergences, we need to characterize the class of acceptable means. We give a definition strengthening the notion of comparable means in [29]:
Definition 1
(Strictly comparable weighted means). A pair of means are said strictly comparable whenever for all with equality if and only if , and for all .
Example 1.
For example, the inequality of the arithmetic and geometric means states that implies means A and G are comparable, denoted by . Furthermore, the arithmetic and geometric weighted means are distinct whenever . Indeed, consider the equation for and . By taking the logarithm on both sides, we get
Since the logarithm is a strictly convex function, the only solution is . Thus, is a pair of strictly comparable weighted means.
For a weighted mean M, define . We are ready to state the definition of generalized -divergences:
Definition 2
(-divergences). The α-divergences between two positive densities p and q for is defined for a pair of strictly comparable weighted means and with by:
Using , we can rewrite this -divergence as
It is important to check the conditions on the weighted means and which ensures the law of the indiscernibles of a divergence , namely, iff almost -everywhere. This condition rewrites as if and only if -almost everywhere. A sufficient condition is to ensure that for . In particular, this condition holds if the weighted means and are strictly comparable weighted means.
Instead of taking the difference between two weighted means, we may also measure the gap logarithmically, and thus define the family of -divergences as follows:
Definition 3
(-divergence). The α-divergences between two positive densities p and q for is defined for a pair of strictly comparable weighted means and with by:
Note that this definition is different from the skewed Bhattacharyya type distance [39,40], which rather measures
The ordinary -skewed Bhattacharyya distance [39] is recovered when (weighted geometric mean) and the arithmetic mean since . The Bhattacharyya type divergences were introduced in [41] in order to upper bound the probability of error in Bayesian hypothesis testing.
A weighted mean is said symmetric if and only if . When both the weighted means M and N are symmetric, we have the following reference duality [11]:
We consider symmetric weighted means in the remainder.
In the limit cases of or , we define the 0-divergence and the 1-divergence , respectively, by
provided that those limits exist.
Notice that the ordinary -divergences are defined for any but our generic quasi-arithmetic -divergences are defined in general on . However, when the weighted means and admit weighted extrapolations (e.g., the arithmetic mean or the geometric mean ) the quasi-arithmetic -divergences can be extended to . Furthermore, when the limits of quasi-arithmetic -divergences exist for , the quasi-arithmetic -divergences may be defined on the full range of . To demonstrate the restricted range , consider the weighted harmonic mean for with :
Clearly, the denominator may become zero when and even possibly negative. Thus, to avoid this issue, we restrict the range of to for defining quasi-arithmetic -divergences.
2.2. The Quasi-Arithmetic -Divergences
A quasi-arithmetic mean (QAM) is defined for a continuous and strictly monotonic function as:
Function f is called the generator of the quasi-arithmetic mean. These strict and reflexive quasi-arithmetic means are also called Kolmogorov means [30], Nagumo means [42] de Finetti means [43], or quasi-linear means [44] in the literature. These means are called quasi-arithmetic means because they can be interpreted as arithmetic means on the arguments and :
QAMs are strict, reflexive, and symmetric means.
Without loss of generality, we may assume strictly increasing functions f instead of monotonic functions since . Indeed, and , the identity function. Notice that the composition of two strictly monotonic increasing functions and is a strictly monotonic increasing function. Furthermore, we consider in the remainder since we apply these means on positive densities. Two quasi-arithmetic means and coincide if and only if for some and , see [44]. The quasi-arithmetic means were considered in the axiomatization of the entropies by Rényi to define the -entropies (see Equation (2).11 of [45]).
By choosing , , or , we obtain the Pythagorean’s arithmetic A, geometric G, and harmonic H means, respectively:
- the arithmetic mean (A): ,
- the geometric mean (G): , and
- the harmonic mean (H): .
More generally, choosing , we obtain the parametric family of power means also called Hölder means [46] or binary means [47]:
In order to get a smooth family of power means, we define the geometric mean as the limit case of :
A mean M is positively homogeneous if and only if for any . It is known that the only positively homogeneous quasi-arithmetic means coincide exactly with the family of power means [44]. The weighted QAMs are given by
Let us remark that QAMs were generalized to complex-valued generators in [48] and to probability measures defined on a compact support in [49].
Notice that there exist other positively homogeneous means which are not quasi-arithmetic means. For example, the logarithmic mean [50,51] for and :
is an example of a homogeneous mean (i.e., for any ) that is not a QAM. Besides the family of QAMs, there exist many other families of means [35]. For example, let us mention the Lagrangian means [52], which intersect with the QAMs only for the arithmetic mean, or a generalization of the QAMs called the Bajraktarević means [53].
Let us now strengthen a recent theorem (Theorem 1 of [54], 2010):
Theorem 1
(Strictly comparable weighted QAMs). The pair of quasi-arithmetic means obtained for two strictly increasing generators f and g is strictly comparable provided that function is strictly convex, where ∘ denotes the function composition.
Proof.
Since is strictly convex, it is convex, and therefore it follows from Theorem 1 of [54] that for all . Thus, the very nice property of QAMs is that implies that for any . Now, let us consider the equation for :
Since is assumed strictly convex, and g is strictly increasing, we have for , and we reach the following contradiction:
Thus, for , and for . □
Thus, we can define the quasi-arithmetic -divergences as follows:
Definition 4
(Quasi-arithmetic -divergences). The α-divergences between two positive densities p and q for are defined for two strictly increasing and differentiable functions f and g such that is strictly convex by:
where and are the weighted quasi-arithmetic means induced by f and g, respectively.
We have the following corollary:
Corollary 1
(Proper quasi-arithmetic -divergences). Let be a pair of quasi-arithmetic means with strictly convex, then the α-divergences are proper divergences for .
Proof.
Consider p and q with -almost everywhere. Since is strictly convex, we have with strict inequality when . Thus, and . Therefore the quasi-arithmetic -divergences satisfy the law of the indiscernibles for . □
Note that the -divergences (i.e., the ordinary -divergences) are proper divergences satisfying both the properties D1 and D2 because and , and hence is strictly convex on .
Let us denote by the quasi-arithmetic -divergences. Since the QAMs are symmetric means, we have .
Remark 2.
Let us notice that Zhang [55] in their study of divergences under monotone embeddings also defined the following family of related divergences (Equation (71) of [55]):
However, Zhang did not study the limit case divergences when .
2.3. Limit Cases of 1-Divergences and 0-Divergences
We seek a closed-form formula of the limit divergence when .
Lemma 1.
A first-order Taylor approximation of the quasi-arithmetic mean [56] for a strictly increasing generator f when yields
Proof.
By taking the first-order Taylor expansion of at (i.e., Taylor polynomial of order 1), we get:
Using the property of the derivative of an inverse function
it follows that the first-order Taylor expansion of is:
Plugging and , we get a first-order approximation of the weighted quasi-arithmetic mean when :
□
Let us introduce the following bivariate function:
Remark 3.
Notice that matches the fact that . That is, we may either consider a strictly increasing differentiable generator f, or equivalently a strictly decreasing differentiable generator .
Thus, we obtain closed-form formulas for the -divergence and -divergence:
Theorem 2
(Quasi-arithmetic -divergence and reverse -divergence). The quasi-arithmetic -divergence induced by two strictly increasing and differentiable functions f and g such that is strictly convex is
Furthermore, we have , the reverse divergence.
Proof.
Let us prove that is a proper divergence satisfying axioms D1 and D2. Note that a sufficient condition for is to check that
If -almost everywhere then clearly . Consider (i.e., at some observation x: ).
We use the following property of a strictly convex and differentiable function h for (sometimes called the chordal slope lemma, see [29]):
We consider so that . There are two cases to consider:
- and therefore . Let and in Equation (57). We have and , and the double inequality of Equation (57) becomesSince , , and , we get
- and therefore . Then, the double inequality of Equation (57) becomesThat is,since .
Thus, in both cases, we checked that . Therefore, , and since the QAMs are distinct, iff -a.e. □
We can interpret the divergences as generalized KL divergences and define generalized notions of cross-entropies and entropies. Since the KL divergence can be written as the cross-entropy minus the entropy, we can also decompose the divergences as follows:
where denotes the -cross-entropy (for a constant ):
and stands for the -entropy (self cross-entropy):
Notice that we recover the Shannon entropy for and with (strictly convex) and to annihilate the term:
We define the generalized -Kullback–Leibler divergence or generalized -relative entropies:
When and , we resolve the constant to , and recover the ordinary Shannon cross-entropy and entropy:
and we have the -Kullback–Leibler divergence that is the extended Kullback–Leibler divergence:
Thus, we have the -cross-entropy and -entropy expressed as
In general, we can define the -Jeffreys divergence as:
Thus, we define the quasi-arithmetic mean -divergences as follows:
Theorem 3
(Quasi-arithmetic -divergences). Let f and g be two strictly continuously increasing and differentiable functions on such that is strictly convex. Then, the quasi-arithmetic α-divergences induced by for is
When () and (), we get
the Kullback–Leibler divergence (KLD) extended to positive densities, and the reverse extended KLD.
Let denote the class of strictly increasing and differentiable real-valued univariate functions. An interesting question is to study the class of pairs of functions such that . This involves solving integral-based functional equations [57].
We can rewrite the -divergence for as
where
Zhang [11] (pp. 188–189) considered the -divergences:
Zhang obtained for the following formula:
which is in accordance with our generic formula of Equation (53) since . Notice that for ; the arithmetic weighted mean dominates the weighted power means when .
Furthermore, by imposing the homogeneity condition for , Zhang [11] obtained the class of -divergences for :
2.4. Generalized KL Divergences as Conformal Bregman Divergences on Monotone Embeddings
Let us rewrite the generalized KLDs as a conformal Bregman representational divergence [58,59,60] as follows:
Theorem 4.
The generalized KLDs divergences are conformal Bregman representational divergences
with a strictly convex and differentiable Bregman convex generator defining the scalar Bregman divergence [61] :
Proof.
For the Bregman strictly convex and differentiable generator , we expand the following conformal divergence
since and . It follows that
Hence, we easily check that since and . □
In general, for a functional generator f and a strictly monotonic representational function r (also called monotone embedding [62] in information geometry), we can define the representational Bregman divergence [63] provided that is a Bregman generator (i.e., strictly convex and differentiable).
The Itakura–Saito divergence [64] (IS) between two densities p and q is defined by:
where is the scalar IS divergence. This divergence was originally designed in sound processing for measuring the discrepancy between two speech power spectra. Observe that the IS divergence is invariant by rescaling: for any . The IS divergence is a Bregman divergence [61] obtained for the Burg information generator (i.e., negative Burg entropy): with . It follows that we have
The Itakura–Saito divergence may further be extended to a family of -Itakura–Saito divergences (see [6], Equation (10).45 of Theorem 10.1):
In [56], a generalization of the Bregman divergences was obtained using the comparative convexity induced by two abstract means M and N to define -Bregman divergences as limit of scaled -Jensen divergences. The skew -Jensen divergences are defined for by:
where and are weighted means that should be regular [56] (i.e., homogeneous, symmetric, continuous, and increasing in each variable). Then, we can define the -Bregman divergence as
The formula obtained in [56] for the quasi-arithmetic means and and a functional generator F that is -convex is:
This is a conformal divergence [58] that can be written using the terms as:
A function F is -convex iff is (ordinary) convex [56].
The information geometry induced by a Bregman divergence (or equivalently by its convex generator) is a dually flat space [6]. The dualistic structure induced by a conformal Bregman representational divergence is related to conformal flattening [59,60]. The notion of conformal structures was first introduced in information geometry by Okamoto et al. [65].
Following the work of Ohara [59,60,66], the Kurose geometric divergence [67] (a contrast function in affine differential geometry) induced by a pair of strictly monotone smooth functions between two distributions p and r of the d-dimensional probability simplex is defined by (Equation (28) in [59]):
where . Affine immersions [67] can be interpreted as special embeddings.
Let be a divergence (contrast function) and be the induced statistical manifold structure with
where denotes the tangent vector at s of a vector field .
Consider a conformal divergence for a positive function , called the conformal factor. Then, the induced statistical manifold [6,7] is 1-conformally equivalent to and we have
The dual affine connections and are projectively equivalent [67] (and is said -conformally flat).
Conformal flattening [59,60] consists of choosing the conformal factor such that becomes a dually flat space [6] equipped with a canonical Bregman divergence.
Therefore, it follows that the statistical manifolds induced by the 1-divergence is a representational 1-conformally flat statistical manifold. Figure 1 gives an overview of the interplay of divergences with information-geometric structures. The logarithmic divergence [68] is defined for and an -exponentially concave generator G by:
Figure 1.
Interplay of divergences and their information-geometric structures: Bregman divergences are canonical divergences of dually flat structures, and the -logarithmic divergences are canonical divergences of 1-conformally flat statistical manifolds. When , the logarithmic divergence tends to the Bregman divergence .
When , we have , where is the Bregman divergence [61] induced by a strictly convex and smooth function F:
3. The Subfamily of Homogeneous -Power -Divergences for
In particular, we can define the -power -divergences from two power means and with (and ) with the family of generators . Indeed, we check that is strictly convex on since for . Thus, and are two QAMs which are both comparable and distinct. Table 1 lists the expressions of obtained from the power mean generators .
Table 1.
Expressions of the terms for the family of power means , .
We conclude with the definition of the -power -divergences:
Corollary 2
(power -divergences). Given , the α-power divergences are defined for and by
When , we get the following power -divergences for :
When , we get the following power -divergences for :
In particular, we get the following family of -divergences
and the family of -divergences:
The -power -divergences for yield homogeneous divergences: for any because the power means are homogeneous: . Thus, the -divergences are Csiszár f-divergences [17]
for the generator
Thus, the family of -power -divergences are homogeneous divergences:
4. Applications to Center-Based Clustering
Clustering is a class of unsupervised learning algorithms which partitions a given d-dimensional point set into clusters such that data points falling into a same cluster tend to be more similar to data points belonging to different clusters. The celebrated k-means clustering [69] is a center-based method for clustering into k clusters (with ), by minimizing the following k-means objective function
where the ’s denote the cluster representatives. Let denote the set of cluster centers. The cluster is defined as the points of closer to cluster representative than any other for :
When , it can be shown that the centroid of the point set is the unique best cluster representative:
When and , finding a best partition which minimizes the objective function of Equation (107) is NP-hard [70]. When , k-means clustering can be solved efficiently using dynamic programming [71] in subcubic time.
The k-means objective function can be generalized to any arbitrary (potentially asymmetric) divergence by considering the following objective function:
Thus, when , one recovers the ordinary k-means clustering [69]. When is chosen as a Bregman divergence, one gets the right-sided Bregman k-means clustering [72] as the minimization of the cluster centers are defined on the right-sided arguments of D in Equation (108). When , Bregman k-means clustering (i.e., in Equation (108)) amounts to the ordinary k-means clustering. The right-sided Bregman centroid for coincides with the center of mass and is independent of the Bregman generator F:
The left-sided Bregman k-means clustering is obtained by considering the right-sided Bregman centroid for the reverse Bregman divergence , and the left-sided Bregman centroid [73] can be expressed as a multivariate generalization of the quasi-arithmetic mean:
In order to study the robustness of k-means clustering with respect to our novel family of divergences , we first study the robustness of the left-sided Bregman centroids to outliers.
4.1. Robustness of the Left-Sided Bregman Centroids
Consider two d-dimensional points and of a domain . The centroid of p and with respect to any arbitrary divergence is by definition the minimizer of
provided that the minimizer is unique. Assume a separable Bregman divergence induced by the generator . The left-sided Bregman centroid [73] of p and is given by the following separable quasi-arithmetic centroid:
with
where denotes the derivative of the Bregman generator .
Now, fix p (say, ), and let the coordinates of all tend to infinity: That is, point plays the role of an outlier data point. We use the general framework of influence functions [74] in statistics to study the robustness of divergence-based centroids. Consider the r-power mean, a quasi-arithmetic mean induced by for and by extension when (geometric mean).
When , we check that
That is, the r-power mean is robust to an outlier data point when (see Figure 2). Note that if instead of considering the centroid, we consider the barycenter with w denoting the weight of point p and denoting the weight of the outlier for , then the power r-mean falls in a square box of side when .
Figure 2.
Illustration of the robustness property of the r-power mean when for two points: a prescribed point and an outlier point . When , the r-power mean of p and for (e.g., coordinatewise harmonic mean when ) is contained inside the box anchored at p of size length . The r-power mean can be interpreted as a left-sided Bregman centroid for , i.e., when and when .
On the contrary, when or , we have , and the r-power mean diverges to infinity.
Thus, when , the quasi-arithmetic centroid of and is contained in a bounding box of length with left corner , and the left-sided Bregman power centroid minimizing
is robust to outlier .
To contrast with this result, notice that the right-sided Bregman centroid [72] is always the center of mass (arithmetic mean), and therefore not robust to outliers as a single outlier data point may potentially drag the centroid to infinity.
Example 2.
Since for any strictly smooth increasing function f, we deduce that the quasi-arithmetic left-sided Bregman centroid induced by with for is the harmonic mean which is robust to outliers. The corresponding Bregman divergence is the Itakura–Saito divergence [72].
Notice that it is enough to consider without loss of generality two points p and : Indeed, the case of the quasi-arithmetic mean of and can be rewritten as an equivalent weighted quasi-arithmetic mean of two points with weight and of weight using the replacement property of quasi-arithmetic means:
where .
4.2. Robustness of Generalized Kullback–Leibler Centroids
The fact that the generalized KLDs are conformal representational Bregman divergences can be used to design efficient algorithms in computational geometry [60]. For example, let us consider the centroid (or barycenter) of a finite set of weighted probability measures (with RN derivatives ) defined as the minimizer of
where the ’s are positive weights summing up to one (). The divergences are separable. Thus, consider without loss of generality, the scalar-generalized KLDs so that we have
where p and q are scalars.
Since the Bregman centroid is unique and always coincide with the center of mass [72]
for positive weights ’s summing up to one, we deduce that the right-sided generalized KLD centroid
amounts to a left-sided Bregman centroid with un-normalized positive weights for the scalar Bregman generator with . Therefore, the right-sided generalized KLD centroid is calculated for normalized weights as:
Thus, we obtain a closed-form formula when is computationally tractable. For example, consider the -power KLD (with ). We have , , , and therefore, we get . Thus, we get a closed-form formula for the right-sided -power Kullback–Leibler centroid using Equation (113).
Overall, we can design a k-means-type algorithm with respect to our generalized KLDs following [72]. Moreover, we can initialize probabilistically k-means with a fast k-means++ seeding [34] described in Algorithm 1. The performance of the k-means++ seeding (i.e., the ratio ) is when , and the analysis has been extended to arbitrary divergences in [75]. The merit of using the k-means++ seeding is that we do not need to iteratively update the cluster representatives using Lloyd’s heuristic [69] and we can thus bypass the calculations of centroids and merely choose the cluster representatives from the source data points as described in Algorithm 1.
| Algorithm 1: Generic seeding of k-means with divergence-based k-means++. |
| input: A finite set of n points, the number of cluster representatives , and an arbitrary divergence Output: Set of initial cluster centers Choose with uniform probability and ![]() return |
The advantage of using a conformal Bregman divergence such as a total Bregman divergence [33] or is to potentially ensure robustness to outliers (e.g., see Theorem III.2 of [33]). Robustness property of these novel divergences can also be studied for statistical inference tasks based on minimum divergence methods [4,76].
5. Conclusions and Discussion
For two comparable strict means [35] (with equality holding if and only if ), one can define their -divergence as
When the property of strict comparable means extend to their induced weighted means and (i.e., ), one can further define the family of -divergences for :
so that . When the weighted means are symmetric, the reference duality holds (i.e., ), and we can define the -equivalent of the Kullback–Leibler divergence, i.e., the 1-divergence, as the limit case (when it exists): . Similarly, the -equivalent of the reverse Kullback–Leibler divergence is obtained as .
We proved that the quasi-arithmetic weighted means [30] and were strictly comparable whenever was strictly convex. In the limit cases of and , we reported a closed-form formula for the equivalent of the forward and the reverse Kullback–Leibler divergences. We reported closed-form formulas for the quasi-arithmetic -divergences for (Theorem 3) and for the subfamily of homogeneous -power -divergences induced by power means (Corollary 2). The ordinary -divergences [12], the -divergences, and the -divergences are examples of -power -divergences obtained for , and , respectively.
Generalized -divergences may prove useful in reporting a closed-form formula between densities of a parametric family . For example, consider the ordinary -divergences between two scale Cauchy densities and ; there is no obvious closed-form for the ordinary -divergences, but we can report a closed-form for the -divergences following the calculus reported in [41]:
with . For probability distributions and belonging to the same exponential family [77] with cumulant function F, the ordinary -divergences admit the following closed-form solution:
where is the Bregman divergence: .
Instead of considering ordinary -divergences in applications, one may consider the -power -divergences, and tune the three scalar parameters according to the various tasks (say, by cross-validation in supervised machine learning tasks, see [13]). For the limit cases of or of , we further proved that the limit KL type divergences amounted to conformal Bregman divergences on strictly monotone embeddings and explained the connection of conformal divergences with conformal flattening [60], which allows one to build fast algorithms for centroid-based k-means clustering [72], Voronoi diagrams, and proximity data-structures [60,63]. Some ideas left for future directions is to study the properties of these new -divergences for statistical inference [2,4,76].
Funding
This research received no external funding.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Keener, R.W. Theoretical Statistics: Topics for a Core Course; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
- Basu, A.; Shioya, H.; Park, C. Statistical Inference: The Minimum Distance Approach; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
- Basseville, M. Divergence measures for statistical data processing — An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
- Pardo, L. Statistical Inference Based on Divergence Measures; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
- Oller, J.M. Some geometrical aspects of data analysis and statistics. In Statistical Data Analysis and Inference; Elsevier: Amsterdam, The Netherlands, 1989; pp. 41–58. [Google Scholar]
- Amari, S. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016. [Google Scholar]
- Eguchi, S. Geometry of minimum contrast. Hiroshima Math. J. 1992, 22, 631–647. [Google Scholar] [CrossRef]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
- Cichocki, A.; Amari, S.i. Families of alpha-beta-and gamma-divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
- Amari, S.i. α-Divergence is Unique, belonging to Both f-divergence and Bregman Divergence Classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar] [CrossRef]
- Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef]
- Hero, A.O.; Ma, B.; Michel, O.; Gorman, J. Alpha-Divergence for Classification, Indexing and Retrieval; Technical Report CSPL-328; Communication and Signal Processing Laboratory, University of Michigan: Ann Arbor, MI, USA, 2001. [Google Scholar]
- Dikmen, O.; Yang, Z.; Oja, E. Learning the information divergence. IEEE Trans. Pattern Anal. Mach. Intell. 2014, 37, 1442–1454. [Google Scholar] [CrossRef]
- Liu, W.; Yuan, K.; Ye, D. On α-divergence based nonnegative matrix factorization for clustering cancer gene expression data. Artif. Intell. Med. 2008, 44, 1–5. [Google Scholar] [CrossRef]
- Hellinger, E. Neue Begründung der Theorie Quadratischer Formen von Unendlichvielen Veränderlichen. J. Für Die Reine Und Angew. Math. 1909, 1909, 210–271. [Google Scholar] [CrossRef]
- Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B 1966, 28, 131–142. [Google Scholar] [CrossRef]
- Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
- Qiao, Y.; Minematsu, N. A study on invariance of f-divergence and its application to speech recognition. IEEE Trans. Signal Process. 2010, 58, 3884–3890. [Google Scholar] [CrossRef]
- Li, W. Transport information Bregman divergences. Inf. Geom. 2021, 4, 435–470. [Google Scholar] [CrossRef]
- Li, W. Transport information Hessian distances. In Proceedings of the International Conference on Geometric Science of Information (GSI), Paris, France, 21–23 July 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 808–817. [Google Scholar]
- Li, W. Transport information geometry: Riemannian calculus on probability simplex. Inf. Geom. 2022, 5, 161–207. [Google Scholar] [CrossRef]
- Amari, S.i. Integration of stochastic models by minimizing α-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef] [PubMed]
- Cichocki, A.; Lee, H.; Kim, Y.D.; Choi, S. Non-negative matrix factorization with α-divergence. Pattern Recognit. Lett. 2008, 29, 1433–1440. [Google Scholar] [CrossRef]
- Wada, J.; Kamahara, Y. Studying malapportionment using α-divergence. Math. Soc. Sci. 2018, 93, 77–89. [Google Scholar] [CrossRef]
- Maruyama, Y.; Matsuda, T.; Ohnishi, T. Harmonic Bayesian prediction under α-divergence. IEEE Trans. Inf. Theory 2019, 65, 5352–5366. [Google Scholar] [CrossRef]
- Iqbal, A.; Seghouane, A.K. An α-Divergence-Based Approach for Robust Dictionary Learning. IEEE Trans. Image Process. 2019, 28, 5729–5739. [Google Scholar] [CrossRef]
- Ahrari, V.; Habibirad, A.; Baratpour, S. Exponentiality test based on alpha-divergence and gamma-divergence. Commun. Stat.-Simul. Comput. 2019, 48, 1138–1152. [Google Scholar] [CrossRef]
- Sarmiento, A.; Fondón, I.; Durán-Díaz, I.; Cruces, S. Centroid-based clustering with αβ-divergences. Entropy 2019, 21, 196. [Google Scholar] [CrossRef]
- Niculescu, C.P.; Persson, L.E. Convex Functions and Their Applications: A Contemporary Approach, 1st ed.; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
- Kolmogorov, A.N. Sur la notion de moyenne. Acad. Naz. Lincei Mem. Cl. Sci. His. Mat. Natur. Sez. 1930, 12, 388–391. [Google Scholar]
- Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435. [Google Scholar] [CrossRef]
- Rachev, S.T.; Klebanov, L.B.; Stoyanov, S.V.; Fabozzi, F. The Methods of Distances in the Theory of Probability and Statistics; Springer: Berlin/Heidelberg, Germany, 2013; Volume 10. [Google Scholar]
- Vemuri, B.C.; Liu, M.; Amari, S.I.; Nielsen, F. Total Bregman divergence and its applications to DTI analysis. IEEE Trans. Med Imaging 2010, 30, 475–483. [Google Scholar] [CrossRef]
- Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the SODA ’07: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035. [Google Scholar]
- Bullen, P.S.; Mitrinovic, D.S.; Vasic, M. Means and Their Inequalities; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 31. [Google Scholar]
- Toader, G.; Costin, I. Means in Mathematical Analysis: Bivariate Means; Academic Press: Cambridge, MA, USA, 2017. [Google Scholar]
- Cauchy, A.L.B. Cours d’analyse de l’École Royale Polytechnique; Debure frères: Paris, France, 1821. [Google Scholar]
- Chisini, O. Sul concetto di media. Period. Di Mat. 1929, 4, 106–116. [Google Scholar]
- Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–109. [Google Scholar]
- Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
- Nielsen, F. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. Pattern Recognit. Lett. 2014, 42, 25–34. [Google Scholar] [CrossRef][Green Version]
- Nagumo, M. Über eine klasse der mittelwerte. Jpn. J. Math. Trans. Abstr. 1930, 7, 71–79. [Google Scholar] [CrossRef]
- De Finetti, B. Sul concetto di media. Ist. Ital. Degli Attuari 1931, 3, 369–396. [Google Scholar]
- Hardy, G.; Littlewood, J.; Pólya, G. Inequalities; Cambridge Mathematical Library, Cambridge University Press: Cambridge, UK, 1988. [Google Scholar]
- Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; The Regents of the University of California: Oakland, CA, USA, 1961; Volume 1. Contributions to the Theory of Statistics. [Google Scholar]
- Holder, O.L. Über einen Mittelwertssatz. Nachr. Akad. Wiss. Gottingen Math.-Phys. Kl. 1889, 44, 38–47. [Google Scholar]
- Bhatia, R. The Riemannian mean of positive matrices. In Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013; pp. 35–51. [Google Scholar]
- Akaoka, Y.; Okamura, K.; Otobe, Y. Bahadur efficiency of the maximum likelihood estimator and one-step estimator for quasi-arithmetic means of the Cauchy distribution. Ann. Inst. Stat. Math. 2022, 74, 1–29. [Google Scholar] [CrossRef]
- Kim, S. The quasi-arithmetic means and Cartan barycenters of compactly supported measures. Forum Math. Gruyter 2018, 30, 753–765. [Google Scholar] [CrossRef]
- Carlson, B.C. The logarithmic mean. Am. Math. Mon. 1972, 79, 615–618. [Google Scholar] [CrossRef]
- Stolarsky, K.B. Generalizations of the logarithmic mean. Math. Mag. 1975, 48, 87–92. [Google Scholar] [CrossRef]
- Jarczyk, J. When Lagrangean and quasi-arithmetic means coincide. J. Inequal. Pure Appl. Math. 2007, 8, 71. [Google Scholar]
- Páles, Z.; Zakaria, A. On the Equality of Bajraktarević Means to Quasi-Arithmetic Means. Results Math. 2020, 75, 19. [Google Scholar] [CrossRef]
- Maksa, G.; Páles, Z. Remarks on the comparison of weighted quasi-arithmetic means. Colloq. Math. 2010, 120, 77–84. [Google Scholar] [CrossRef]
- Zhang, J. Nonparametric information geometry: From divergence function to referential-representational biduality on statistical manifolds. Entropy 2013, 15, 5384–5418. [Google Scholar] [CrossRef]
- Nielsen, F.; Nock, R. Generalizing Skew Jensen Divergences and Bregman Divergences with Comparative Convexity. IEEE Signal Process. Lett. 2017, 24, 1123–1127. [Google Scholar] [CrossRef]
- Kuczma, M. An Introduction to the Theory of Functional Equations and Inequalities: Cauchy’s Equation and Jensen’s Inequality; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Nock, R.; Nielsen, F.; Amari, S.i. On conformal divergences and their population minimizers. IEEE Trans. Inf. Theory 2015, 62, 527–538. [Google Scholar] [CrossRef][Green Version]
- Ohara, A. Conformal flattening for deformed information geometries on the probability simplex. Entropy 2018, 20, 186. [Google Scholar] [CrossRef]
- Ohara, A. Conformal Flattening on the Probability Simplex and Its Applications to Voronoi Partitions and Centroids. In Geometric Structures of Information; Springer: Berlin/Heidelberg, Germany, 2019; pp. 51–68. [Google Scholar]
- Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
- Zhang, J. On monotone embedding in information geometry. Entropy 2015, 17, 4485–4499. [Google Scholar] [CrossRef]
- Nielsen, F.; Nock, R. The dual Voronoi diagrams with respect to representational Bregman divergences. In Proceedings of the Sixth International Symposium on Voronoi Diagrams (ISVD), Copenhagen, Denmark, 23–26 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 71–78. [Google Scholar]
- Itakura, F.; Saito, S. Analysis synthesis telephony based on the maximum likelihood method. In Proceedings of the 6th International Congress on Acoustics, Tokyo, Japan, 21–28 August 1968; pp. 280–292. [Google Scholar]
- Okamoto, I.; Amari, S.I.; Takeuchi, K. Asymptotic theory of sequential estimation: Differential geometrical approach. Ann. Stat. 1991, 19, 961–981. [Google Scholar] [CrossRef]
- Ohara, A.; Matsuzoe, H.; Amari, S.I. Conformal geometry of escort probability and its applications. Mod. Phys. Lett. B 2012, 26, 1250063. [Google Scholar] [CrossRef]
- Kurose, T. On the divergences of 1-conformally flat statistical manifolds. Tohoku Math. J. Second Ser. 1994, 46, 427–433. [Google Scholar] [CrossRef]
- Pal, S.; Wong, T.K.L. The geometry of relative arbitrage. Math. Financ. Econ. 2016, 10, 263–293. [Google Scholar] [CrossRef]
- Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
- Mahajan, M.; Nimbhorkar, P.; Varadarajan, K. The planar k-means problem is NP-hard. Theor. Comput. Sci. 2012, 442, 13–21. [Google Scholar] [CrossRef]
- Wang, H.; Song, M. Ckmeans.1d.dp: Optimal k-means clustering in one dimension by dynamic programming. R J. 2011, 3, 29. [Google Scholar] [CrossRef]
- Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J.; Lafferty, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
- Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef]
- Ronchetti, E.M.; Huber, P.J. Robust Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2009. [Google Scholar]
- Nielsen, F.; Nock, R. Total Jensen divergences: Definition, properties and clustering. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, Australia, 19–24 April 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 2016–2020. [Google Scholar]
- Eguchi, S.; Komori, O. Minimum Divergence Methods in Statistical Machine Learning; Springer: Berlin/Heidelberg, Germany, 2022. [Google Scholar]
- Kailath, T. The divergence and Bhattacharyya distance measures in signal selection. IEEE Trans. Commun. Technol. 1967, 15, 52–60. [Google Scholar] [CrossRef]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
