On Clustering Histograms with k-Means by Using Mixed α-Divergences

Nielsen, Frank; Nock, Richard; Amari, Shun-ichi

doi:10.3390/e16063273

Open AccessArticle

On Clustering Histograms with k-Means by Using Mixed α-Divergences

by

Frank Nielsen

^1,2,*,

Richard Nock

³ and

Shun-ichi Amari

⁴

¹

Sony Computer Science Laboratories, Inc, Tokyo 141-0022, Japan

²

Polytechnique, 91128 Palaiseau Cedex, France

³

NICTA and The Australian National University, Locked Bag 9013, Alexandria NSW 1435, Australia

⁴

RIKEN Brain Science Institute, 2-1 Hirosawa Wako City, Saitama 351-0198, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2014, 16(6), 3273-3301; https://doi.org/10.3390/e16063273

Submission received: 15 May 2014 / Revised: 10 June 2014 / Accepted: 13 June 2014 / Published: 17 June 2014

(This article belongs to the Special Issue Information Geometry)

Download

Browse Figures

Versions Notes

Abstract

: Clustering sets of histograms has become popular thanks to the success of the generic method of bag-of-X used in text categorization and in visual categorization applications. In this paper, we investigate the use of a parametric family of distortion measures, called the α-divergences, for clustering histograms. Since it usually makes sense to deal with symmetric divergences in information retrieval systems, we symmetrize the α-divergences using the concept of mixed divergences. First, we present a novel extension of k-means clustering to mixed divergences. Second, we extend the k-means++ seeding to mixed α-divergences and report a guaranteed probabilistic bound. Finally, we describe a soft clustering technique for mixed α-divergences.

Keywords:

bag-of-X; α-divergence; Jeffreys divergence; centroid; k-means clustering; k-means seeding

Graphical Abstract

1. Introduction: Motivation and Background

1.1. Clustering Histograms in the Bag-of-Word Modeling Paradigm

A common task of information retrieval (IR) systems is to classify documents into categories. Given a training set of documents labeled with categories, one asks to classify new incoming documents. Text categorisation [1,2] proceeds by first defining a dictionary of words from a corpus. It then models each document by a word count yielding a word distribution histogram per document (see the University of California, Irvine, UCI, machine learning repository for such data-sets [3]). The importance of the words in the dictionary can be weighted by the term frequency-inverse document frequency [2] (tf-idf) that takes into account both the frequency of the words in a given document, but also of the frequency of the words in all documents: Namely, the tf-idf weight for a given word in a given document is the product of the frequency of that word in the document times the logarithm of the ratio of the number of documents divided by the document frequency of the word [2]. Defining a proper distance between histograms allows one to:

Classify a new on-line document: We first calculate its word distribution histogram signature and seek for the labeled document, which has the most similar histogram to deduce its category tag.
Find the initial set of categories: we cluster all document histograms and assign a category per cluster.

This text classification method based on the representation of the bag-of -words (BoWs) has also been instrumental in computer vision for efficient object categorization [4] and recognition in natural images [5]. This paradigm is called bag-of-features [6] (BoFs) in the general case. It first requires one to create a dictionary of “visual words” by quantizing keypoints (e.g., affine invariant descriptors of image patches) of the training database. Quantization is performed using the k-means [7–9] algorithm that partitions n data $X = {x_{1}, \dots, x_{n}}$ into k pairwise disjoint clusters C₁,…, C_k, where each data element belongs to the closest cluster center (i.e., the cluster prototype). From a given initialization, batched k-means first assigns data points to their closest centers and then updates the cluster centers and reiterates this process until convergence is met to a local minimum (not necessarily the global minimum) after a provably finite number of steps. Csurka et al. [4] used the squared Euclidean distance for building the visual vocabulary. Depending on the chosen features, other distances have proven useful. For example, the symmetrized Kullback–Leibler (KL) divergence was shown to perform experimentally better than the Euclidean or squared Euclidean distances for a compressed histogram of gradient descriptors [10] (CHoGs), even if it is not a metric distance, since its fails to satisfy the triangular inequality. To summarize, k-means histogram clustering with respect to the symmetrized KL (called Jeffreys divergence J) can be used to quantize both visual words and document categories. Nowadays, the seminal bag-of-word method has been generalized fruitfully to various settings using the generic bag-of-X paradigm, like the bag-of-textons [6], the bag-of-readers [11], etc. Bag-of-X represents each data (e.g., document, image, etc.) as an histogram of codeword count indices. Furthermore, the semantic space [12] paradigm has been recently explored to overcome two drawbacks of the bag-of-X paradigms: the high-dimensionality of the histograms (number of bins) and difficult human interpretation of the codewords due to the lack of semantic information. In semantic space, modeling relies on semantic multinomials that are discrete frequency histograms; see [12].

In summary, clustering histograms with respect to symmetric distances (like the symmetrized KL divergence) is playing an increasing role. It turns out that the symmetrized KL divergence belongs to a 1-parameter family of divergences, called symmetrized α-divergences, or Jeffreys α-divergence [13].

1.2. Contributions

Since divergences D(p : q) are usually asymmetric distortion measures between two objects p and q, one has to often consider two kinds of centroids obtained by carrying the minimization process either on the left argument or on the right argument of the divergences; see [14]. In theory, it is enough to consider only one type of centroid, say the right centroid, since the left centroid with respect to a divergence D(p : q) is equivalent to the right centroid with respect to the mirror divergence D′(p : q) = D(q : p). In this paper, we consider mixed divergences [15] that allow one to handle in a unified way the arithmetic symmetrization $S (p, q) = \frac{1}{2} (D (p : q) + D (q : p))$ of a given divergence D(p : q) with both the sided divergences: D(p : q) and its mirror divergence D′(p : q). The mixed α-divergence is the mixed divergence obtained for the α-divergence. We term α-clustering the clustering with respect to α-divergences and mixed α-clustering the clustering w.r.t. mixed α-divergences [16]. Our main contributions are to extend the celebrated batched k-means [7–9] algorithm to mixed divergences by associating two dual centroids per cluster and to generalize the probabilistically guaranteed good seeding of k-means++ [17] to mixed α-divergences. The mixed α-seedings provide guaranteed probabilistic clustering bounds by picking up seeds from the data and do not require explicitly computing of centroids. Therefore, it follows a fast clustering technique in practice, even when cluster centers are not available in closed form. We also consider clustering histograms by explicitly building the symmetrized α-centroids and end up with a variational k-means when the centroids are not available in closed-form, Finally, we investigate soft mixed α-clustering and discuss topics related to α-clustering. Note that clustering with respect to non-symmetrized α-divergences has been recently investigated independently in [18] and proven useful in several applications.

1.3. Outline of the Paper

The paper is organized as follows: Section 2 introduces the notion of mixed divergences, presents an extension of k-means to mixed divergences and recalls some properties of α-divergences. Section 3 describes the α-seeding techniques and reports a probabilistically-guaranteed bound on the clustering quality. Section 4 investigates the various sided/symmetrized/mixed calculations of the α-centroids. Section 5 presents the soft α-clustering with respect to α-mixed divergences. Finally, Section 6 summarises the contributions, discusses related topics and hints at further perspectives. The paper is followed by two appendices. Appendix B studies several properties of α-divergences that are used to derive the guaranteed probabilistic performance of the α-seeding. Appendix C proves that α-sided centroids are quasi-arithmetic means for the power generator functions.

2. Mixed Centroid-Based k-Means Clustering

2.1. Divergences, Centroids and k-Means

Consider a set $ℋ$ of n histograms h₁,…, h_n, each with d bins, with all positive real-valued bins: $h_{j}^{i} > 0$ , ∀1 ≤ i ≤ d, 1 ≤ j ≤ n. A histogram h is called a frequency histogram when its bins sums up to one: $w (h) = w_{h} = \sum_{i} h^{i} = 1$ . Otherwise, it is called a positive histogram that can eventually be normalized to a frequency histogram:

\tilde{h} ≐ \frac{h}{w (h)} .

(1)

The frequency histograms belong to the (d-1)-dimensional open probability simplex Δ_d:

Δ_{d} ≐ {(x^{1}, \dots, x^{d}) \in ℝ^{d} | \forall i, x^{i} > 0, and \sum_{i = 1}^{d} x^{i} = 1} .

(2)

That is, although frequency histograms have d bins, the constraint that those bin values should sum up to one yields d-1 degrees of freedom. In probability theory, the frequency or counting of histograms either model discrete multinomial probabilities or discrete positive measures (also called positive arrays [19]).

The celebrated k-means clustering [8,9] is one of the most famous clustering techniques that has been generalized in many ways [20,21]. In information geometry [22], a divergence D(p : q) is a smooth C³ differentiable dissimilarity measure that is not necessarily symmetric (D(p : q) ≠ D(q : p), hence the notation “:” instead of the classical “,” reserved for metric distances), but is non-negative and satisfies the separability property: D(p : q) = 0 iff p = q. More precisely, let $\partial_{i} D (x : y) = \frac{\partial}{\partial x^{i}} D (x : y)$ , $\partial_{, i} D (x : y) = \frac{\partial}{\partial y^{i}} D (x : y)$ . Then, we require ∂_iD(x : x) = ∂_,_iD(x : x) = 0 and −∂_i∂_,jD(x : y) positive definite for defining a divergence. For a distance function D(· : ·), we denote by $D (x : ℋ)$ the weighted average distance of x to a set a weighted histograms:

D (x : ℋ) ≐ \sum_{j = 1}^{n} w_{i} D (x : h_{j}) .

(3)

An important class of divergences on frequency histograms is the f-divergences [23–25] defined for a convex generator f (with f′(1) = f′(1) = 0 and f″(1) = 1):

I_{f} (p : q) ≐ \sum_{i = 1}^{d} q^{i} f (\frac{p^{i}}{q^{i}}) .

Those divergences preserve information monotonicity [19] under any arbitrary transition probability (Markov morphisms). f-divergences can be extended to positive arrays [19].

The k-means algorithm on a set of weighted histograms can be tailored to any divergence as follows: First, we initialize the k cluster centers C = {c₁,…, c_k} (say, by picking up randomly arbitrary distinct seeds). Then, we iteratively repeat until convergence the following two steps:

Assignment: Assign each histogram h_j to its closest cluster center:

$l (h_{j}) ≐ \arg {min}_{l = 1}^{k} D (h_{j} : c_{l}) .$

This yields a partition of the histogram set $ℋ = \cup_{l = 1}^{k} A_{l}$ , where A_l denotes the set of histograms of the l-th cluster: $A_{l} = {h_{j} | l (h_{j}) = l}$ .
Center relocation: Update the cluster centers by taking their centroids:

$c_{l} ≐ \arg min_{x} \sum_{h_{j} \in A_{l}} w_{j} D (h_{j} : x) .$

Throughout this paper, centroid shall be understood in the broader sense of a barycenter when weights are non-uniform.

2.2. Mixed Divergences and Mixed k-Means Clustering

Since divergences are potentially asymmetric, we can define two-sided k-means or always consider a right-sided k-means, but then define another sided divergence D′(p : q) = D(q : p). We can also consider the symmetrized k-means with respect to the symmetrized divergence: S(p, q) = D(p : q) + D(q : p). Eventually, we may skew the symmetrization with a parameter λ ∈ [0, 1]: S_λ(p, q) = λD(p : q) + (1 − λ)D(q : p) (and consider other averaging schemes instead of the arithmetic mean).

In order to handle those sided and symmetrized k-means under the same framework, let us introduce the notion of mixed divergences [15] as follows:

Definition 1 (Mixed divergence).

M_{λ} (p : q : r) ≐ λ D (p : q) + (1 - λ) D (q : r),

(4)

for λ ∈ [0, 1].

A mixed divergence includes the sided divergences for λ ∈ {0, 1} and the symmetrized (arithmetic mean) divergence for $λ = \frac{1}{2}$ .

We generalize k-means clustering to mixed k-means clustering [15] by considering two centers per cluster (for the special cases of λ = 0, 1, it is enough to consider only one). Algorithm 1 sketches the generic mixed k-means algorithm. Note that a simple initialization consists of choosing randomly the k distinct seeds from the dataset with l_i = r_i.

Notice that the mixed k-means clustering is different from the k-means clustering with respect to the symmetrized divergences S_λ that considers only one centroid per cluster.

2.3. Sided, Symmetrized and Mixed α-Divergences

For α ≠ ±1, we define the family of α-divergences [26] on positive arrays [27] as:

\begin{array}{l} D_{α} (p : q) ≐ \sum_{i = 1}^{d} \frac{4}{1 - α^{2}} (\frac{1 - α}{2} p^{i} + \frac{1 + α}{2} q^{i} - {(p^{i})}^{\frac{1 - α}{2}} {(q^{i})}^{\frac{1 + α}{2}}), \\ = D_{- α} (q : p), α \in ℝ \ {0, 1}, \end{array}

(5)

with the limit cases D₋₁(p : q) = KL(p : q) and D₁(p : q) = KL(q : p), where KL is the extended Kullback–Leibler divergence:

KL (p : q) ≐ \sum_{i = 1}^{d} p^{i} \log \frac{p^{i}}{q^{i}} + q^{i} - p^{i} .

(6)

Divergence D₀ is the squared Hellinger symmetric distance (scaled by a multiplicative factor of four) extended to positive arrays:

D_{0} (p : q) = 2 \int {(\sqrt{p (x)} - \sqrt{q (x)})}^{2} d x = 4 H^{2} (p, q),

(7)

with the Hellinger distance:

H (p, q) = \sqrt{\frac{1}{2} \int {(\sqrt{p (x)} - \sqrt{q (x)})}^{2} d x} .

(8)

Note that α-divergences are defined for the full range of α values: α ∈ ℝ. Observe that α-divergences of Equation (5) are homogeneous of degree one: D_α(λp : λq) = λD_α(p : q) for λ > 0.

When histograms p and q are both frequency histograms, we have:

\begin{array}{l} D_{α} (\tilde{p} : \tilde{q}) = \frac{4}{1 - α^{2}} (1 - {\sum_{i = 1}^{d} ({\tilde{p}}^{i})}^{\frac{1 - α}{2}} {({\tilde{q}}^{i})}^{\frac{1 + α}{2}}), \\ = D_{- α} (\tilde{q} : \tilde{p}), α \in ℝ \ {0, 1}, \end{array}

(9)

and the extended Kullback–Leibler divergence reduces to the traditional Kullback–Leibler divergence: $KL (\tilde{p} : \tilde{q}) = \sum_{i = 1}^{d} {\tilde{p}}^{i} \log \frac{{\tilde{p}}^{i}}{{\tilde{q}}^{i}}$ .

The Kullback–Leibler divergence between frequency histograms $\tilde{p}$ and $\tilde{q}$ (α = ±1) is interpreted as the cross-entropy minus the Shannon entropy:

KL (\tilde{p} : \tilde{q}) ≐ H^{\times} (\tilde{p} : \tilde{q}) - H (\tilde{p}) .

Often, $\tilde{p}$ denotes the true model (hidden by nature), and $\tilde{q}$ is the estimated model from observations. However, in information retrieval, both $\tilde{p}$ and $\tilde{q}$ play the same symmetrical role, and we prefer to deal with a symmetric divergence.

The Pearson and Neyman χ² distances are obtained for α = −3 and α = 3, respectively:

D_{3} (\tilde{p} : \tilde{q}) = \frac{1}{2} \sum_{i} \frac{{({\tilde{q}}^{i} - {\tilde{p}}^{i})}^{2}}{{\tilde{p}}^{i}},

(10)

D_{- 3} (\tilde{p} : \tilde{q}) = \frac{1}{2} \sum_{i} \frac{{({\tilde{q}}^{i} - {\tilde{p}}^{i})}^{2}}{{\tilde{q}}^{i}} .

(11)

The α-divergences belong to the class of Csiszár f-divergences with the following generator:

f (t) = {\begin{array}{l} \frac{4}{1 - α^{2}} (1 - t^{(1 + α) / 2}), & if α \neq \pm 1, \\ t \ln t, & if α = 1, \\ - \ln t, & if α = - 1 \end{array}

(12)

Remark 1. Historically, the α-divergences have been introduced by Chernoff [28,29] in the context of hypothesis testing. In Bayesian binary hypothesis testing, we are asked to decide whether an observation belongs to one class or the other class, based on prior w₁ and w₂ and class-conditional probabilities p₁ and p₂. The average expected error of the best decision maximum a posteriori (MAP) rule is called the probability of error, denoted by P_e. When prior probabilities are identical $(w_{1} = w_{2} = \frac{1}{2})$ , we have $P_{e} (p_{1}, p_{2}) = \frac{1}{2} \int min (p_{1} (x), p_{2} (x)) d x$ . Let S(p, q) = ∫ min(p(x), q(x))dx denote the intersection similarity measure, with 0 < S ≤ 1 (generalizing the histogram intersection distance often used in computer vision [30]). S is bounded by the α-Chernoff affinity coefficient:

S (p, q) \leq C_{β} (p, q) = \int p^{β} (x) q^{1 - β} (x) d x,

for all β ∈ [0, 1]. We can convert the affinity coefficient 0 < C_β ≤ 1 into a divergence D_β by simply taking D_β = 1 − C_β. Since the absolute value of divergences does not matter, we can rescale appropriately the divergence. One nice rescaling is by multiplying by $\frac{1}{β (1 - β)} : D_{β} = \frac{1}{β (1 - β)} (1 - C_{β})$ . This lets coincide the parameterized divergence with the fundamental Kullback–Leibler divergence for the limit values β ∈ {0, 1}. Last, by choosing $β = \frac{1 - α}{2}$ , it yields the well-known expression of the α-divergences.

Interestingly, the α-divergences can be interpreted as a generalized α-Kullback–Leibler divergence [26] with deformed logarithms.

Next, we introduce the mixed α-divergence of a histogram x to two histograms p and q as follows:

Definition 2 (Mixed α-divergence). The mixed α-divergence of a histogram x to two histograms p and q is defined by:

\begin{array}{l} M_{λ, α} (p : x : q) = λ D_{α} (p : x) + (1 - λ) D_{α} (x : q), \\ = λ D_{- α} (x : p) + (1 - λ) D_{- α} (q : x), \\ = M_{1 - λ, - α} (q : x : p), \end{array}

(13)

The α-Jeffreys symmetrized divergence is obtained for $λ = \frac{1}{2}$ :

S_{α} (p, q) = M_{\frac{1}{2}, α} (q : p : q) = M_{\frac{1}{2}, α} (p : q : p) .

The skew symmetrized α-divergence is defined by:

S_{λ, α} (p : q) = λ D_{α} (p : q) + (1 - λ) D_{α} (q : p) .

2.4. Notations and Hard/Soft Clusterings

Throughout the paper, superscript index i denotes the histogram bin numbers and subscript index j the histogram numbers. Index l is used to iterate on the clusters. The left-sided, right-sided and symmetrized histogram positive and frequency α-centroids are denoted by l_α, r_α, s_α and ${\tilde{l}}_{α}, {\tilde{r}}_{α}, {\tilde{s}}_{α}$ , respectively.

In this paper, we investigate the following kinds of clusterings for sets of histograms:

Hard clustering. Each histogram belongs to exactly one cluster:

k-means with respect to mixed divergences M_λ,α.
k-means with respect to symmetrized divergences S_λ,α.
Randomized seeding for mixed/symmetrized k-means by extending k-means++ with guaranteed probabilistic bounds for α-divergences.

Soft clustering. Each histogram belongs to all clusters according to some weight distribution: the soft mixed α-clustering.

3. Coupled k-Means++ α-Seeding

It is well-known that the Lloyd k-means clustering algorithm monotonically decreases the loss function and stops after a finite number of iterations into a local optimal. Optimizing globally the k-means loss is NP-hard [17] when d > 1 and k > 1. In practice, the performance of the k-means algorithm heavily relies on the initialization. A breakthrough was obtained by the k-means++ seeding [17], which guarantees in expectation a good starting partition. We extend this scheme to the coupled α-clustering. However, we point out that although k-means++ prove popular and are often used in practice with very good results; it has been recently pointed out that “worst case” configurations exist and even in small dimensions, on which the algorithm cannot beat significantly its expected approximability with a high probability [31]. Still, the expected approximability ratio, roughly in O(log(k)), is very good, as long as the number of clusters is not too large.

Algorithm 2 provides our adaptation of k-means++ seeding [15,17]. It works for all three of our sided/symmetrized and mixed clustering settings:

Pick λ = 1 for the left-sided centroid initialization,
Pick λ = 0 for the right-sided centroid initialization (a left-sided initialization for −α),
with arbitrary λ, for the λ-J_α (skew Jeffreys) centroids or mixed λ centroids. Indeed, the initialization is the same (see the MAS procedure in Algorithm 2).

Our proof follows and generalizes the proof described for the case of mixed Bregman seeding [15] (Lemma 2). In fact, our proof is more precise, as it quantifies the expected potential with respect to the optimum only, whereas in [15], the optimal potential is averaged with a dual optimal potential, which depends on the optimal centers, but may be larger than the optimum sought.

Theorem 1. Let C_λ,α denote for short the cost function related to the clustering type chosen (left-, right-, skew Jeffreys or mixed) in MASand $C_{λ, α}^{o p t}$ denote the optimal related clustering in k clusters, for λ ∈ [0, 1] and α ∈ (−1, 1). Then, on average, with respect to distribution (14), the initial clustering of MAS satisfies:

(15)

Here $f (λ) = max {\frac{1 - λ}{λ}, \frac{λ}{1 - λ}}$ , g(k) = 2(2 + log k), $z (α) = {(\frac{1 + | α |}{1 - | α |})}^{\frac{8 | α |^{2}}{{(1 - | α |)}^{2}}}$ , $h (α) = {max}_{i} p_{i}^{| α |} / {min}_{i} p_{i}^{| α |}$ ; the min is defined on strictly positive coordinates, and π denotes distribution of Algorithm 2.

Remark 2. The bound is particularly good when λ is close to 1/2, and in particular for the α-Jeffreys clustering, as in these cases, the only additional penalty compared to the Euclidean case [17] is h²(α), a penalty that relies on an optimal triangle inequality for α-divergences that we provide in Lemma 8 below.

Remark 3. This guaranteed initialization is particularly useful for α-Jeffreys clustering, as there is no closed form solution for the centroids (except when α = ±1, see [32]).

Algorithm 3 presents the general hard mixed k-means clustering, which can be adapted also to left- (λ = 1) and right- (λ = 0) α-clustering.

For skew Jeffreys centers, since the centroids are not available in closed form [32], we adopt a variational approach of k-means by updating iteratively the centroid in each cluster (thus improving the overall loss function without computing the optimal centroids that would eventually require infinitely many iterations).

4. Sided, Symmetrized and Mixed α-Centroids

The k-means clustering requires assigning data elements to their closest cluster center and then updating those cluster centers by taking their centroids. This section investigates the centroid computations for the sided, symmetrized and mixed α-divergences.

Note that the mixed α-seeding presented in Section 3 does not require computing centroids and, yet, guarantees probabilistically a good clustering partition.

Since mixed α-divergences are f-divergences, we start with the generic f-centroids.

4.1. Csiszár f-Centroids

The centroids induced by f-divergences of a set of positive measures (that relaxes the normalisation constraint) have been studied by Ben-Tal et al. [33]. Those entropic centroids are shown to be unique, since f-divergences are convex statistical distances in both arguments. Let E_f denote the energy to minimize when considering f-divergences:

E_{f} ≐ min_{x \in X} I_{f} (ℋ : x) = \sum_{j = 1}^{n} w_{j} I_{f} (h_{j} : x),

(16)

= min_{x \in X} \sum_{j = 1}^{n} w_{j} \sum_{i = 1}^{d} p_{j}^{i} f (\frac{c^{i}}{h_{j}^{i}}) .

(17)

When the domain is the open probability simplex $X = Δ_{d}$ , we get a constrained optimisation problem to solve. We transform this constrained minimisation problem (i.e., x ∈ Δ_d) into an equivalent unconstrained minimisation problem by using the Lagrange multiplier, γ:

min_{x \in ℝ^{d}} \sum_{j = 1}^{n} w_{j} I_{f} (h_{j} : c) + γ (\sum_{i = 1}^{d} x^{i} - 1) .

(18)

Taking the derivatives according to xⁱ, we get:

\forall i \in {1, \dots, d}, \sum_{j = 1}^{n} w_{j} f^{'} (\frac{x^{i}}{h_{j}^{i}}) - γ = 0.

(19)

We now consider this equation for α-divergences and symmetrized α-divergences, both f-divergences.

4.2. Sided Positive and Frequency α-Centroids

The positive sided α-centroids for a set of weighted histograms were reported in [34] using the representation Bregman divergence. We summarise the results in the following theorem:

Theorem 2 (Sided positive α-centroids [34]). The left-sided l_α and right-sided r_α positive weighted α-centroid coordinates of a set of n positive histograms h₁,…, h_n are weighted α-means:

r_{α}^{i} = f_{α}^{- 1} (\sum_{j = 1}^{n} w_{j} f_{α} (h_{j}^{i})), l_{α}^{i} = r_{- α}^{i}

with $f_{α} (x) = {\begin{array}{l} x^{\frac{1 - α}{2}} & α \neq \pm 1, \\ \log x & α = 1. \end{array}$

Furthermore, the frequency-sided α-centroids are simply the normalized-sided α-centroids.

Theorem 3 (Sided frequency α-centroids [16]). The coordinates of the sided frequency α-centroids of a set of n weighted frequency histograms are the normalised weighted α-means.

Table 1 summarizes the results concerning the sided positive and frequency α-centroids.

4.3. Mixed α-Centroids

The mixed α-centroids for a set of n weighted histograms is defined as the minimizer of:

\sum_{j} w_{j} M_{λ, α} (l : h_{j} : r) .

(20)

We state the theorem generalizing [15]:

Theorem 4. The two mixed α-centroids are the left-sided and right-sided α-centroids.

Figure 1 depicts some clustering result with our α-clustering software. We remark that the clusters found are all approximately subclusters of the “distinct” clusters that appear on the figure. When those distinct clusters are actually the optimal clusters—which is likely to be the case when they are separated by large minimal distance to other clusters—this is clearly a desirable qualitative property as long as the number of experimental clusters is not too large compared to the number of optimal clusters. We remark also that in the experiment displayed, there is no closed form solution for the cluster centers.

4.4. Symmetrized Jeffreys-Type α-Centroids

The Kullback–Leibler divergence can be symmetrized in various ways: Jeffreys divergence, Jensen–Shannon divergence and Chernoff information, just to mention a few. Here, we consider the following symmetrization of α-divergences extending Jeffreys J-divergence:

\begin{array}{l} S_{α} (p, q) = \frac{1}{2} (D_{α} (p : q) + D_{α} (q : p)) = S_{- α} (p, q), \\ = M_{\frac{1}{2}} (p : q : p), \end{array}

(21)

For α = ±1, we get half of Jeffreys divergence:

S_{\pm 1} (p, q) = \frac{1}{2} \sum_{i = 1}^{d} (p^{i} - q^{i}) \log \frac{p^{i}}{q^{i}}

In particular, when p and q are frequency histograms, we have for α ≠ ±1:

J_{α} (\tilde{p} : \tilde{q}) = \frac{8}{1 - α^{2}} (1 + \sum_{i = 14}^{d} H_{\frac{1 - α}{2}} ({\tilde{p}}^{i} : {\tilde{q}}^{i})),

(22)

where $H_{\frac{1 - α}{2}} (a, b)$ a symmetric Heinz mean [35,36]:

H_{β} (a, b) = \frac{a^{β} b^{1 - β} + a^{1 - β} b^{β}}{2} .

Heinz means interpolate the arithmetic and geometric means and satisfies the inequality:

\sqrt{a b} = H_{\frac{1}{2}} (a, b) \leq H_{α} (a, b) \leq H_{0} (a, b) = \frac{a + b}{2} .

(Another interesting property of Heinz means is the integral representation of the logarithmic mean: $L (x, y) = \frac{x - y}{\log x - \log y} = \int_{0}^{1} H_{β} (x, y) d β$ . This allows one to prove easily that $\sqrt{x y} \leq L (x, y) \leq \frac{x + y}{2}$ .)

The J_α-divergence is a Csiszár f-divergence [24,25].

Observe that it is enough to consider α ∈ [0, ∞) and that the symmetrized α-divergence for positive and frequency histograms coincide only for α = ±1.

For α = ±1, S_α(p, q) tends to the Jeffreys divergence:

J (p, q) = KL (p, q) + KL (q, p) = \sum_{i = 1}^{d} (p^{i} - q^{i}) (\log p^{i} - \log q^{i}) .

(23)

The Jeffreys divergence writes mathematically the same for frequency histograms:

J (\tilde{p}, \tilde{q}) = KL (\tilde{p}, \tilde{q}) + KL (\tilde{p}, \tilde{q}) = \sum_{i = 1}^{d} ({\tilde{p}}^{i} - {\tilde{q}}^{i}) (\log {\tilde{p}}^{i} - \log {\tilde{q}}^{i}) .

(24)

We state the results reported in [32]:

Theorem 5 (Jeffreys positive centroid [32]). The Jeffreys positive centroid c = (c¹,…, c^d) of a set {h₁,…, h_n} of n weighted positive histograms with d bins can be calculated component-wise exactly using the Lambert W analytic function:

c^{i} = \frac{a^{i}}{W (\frac{a^{i}}{g^{i}} e)},

where $a^{i} = \sum_{j = 1}^{n} π_{j} h_{j}^{i}$ denotes the coordinate-wise arithmetic weighted means and $g^{i} = \prod_{j = 1}^{n} {(h_{j}^{i})}^{π_{j}}$ the coordinate-wise geometric weighted means.

The Lambert analytic function W [37] (positive branch) is defined by W (x)e^W⁽^x⁾ = x for x ≥ 0.

Theorem 6 (Jeffreys frequency centroid [32]). Let $\tilde{c}$ denote the Jeffreys frequency centroid and ${\tilde{c}}^{'} = \frac{c}{w_{c}}$ the normalised Jeffreys positive centroid. Then, the approximation factor $α_{{\tilde{c}}^{'}} = \frac{S_{1} ({\tilde{c}}^{'}, \tilde{ℋ})}{S_{1} (\tilde{c}, \tilde{ℋ})}$ is such that $1 \leq α_{{\tilde{c}}^{'}} \leq \frac{1}{w_{c}}$ (with w_c ≤ 1)

Therefore, we shall consider α ≠ ±1 in the remainder.

We state the following lemma generalizing the former results in [38] that were tailored to the symmetrized Kullback–Leibler divergence or the symmetrized Bregman divergence [14]:

Lemma 1 (Reduction property). The symmetrized J_α-centroid of a set of n weighted histograms amount to computing the symmetrized α-centroid for the weighted α-mean and −α-mean:

min J_{α} (x, ℋ) = min_{x} (D_{α} (x : r_{α}) + D_{α} (l_{α} : x)) .

Proof. It follows that the minimization problem ${min}_{x} S_{α} (x, ℋ) = \sum_{j = 1}^{n} w_{j} S_{α} (x, h_{j})$ reduces to the following minimization:

min \sum_{i = 1}^{d} x^{i} - {(x^{i})}^{\frac{1 + α}{2}} {\bar{h}}_{α}^{i} - {(x^{i})}^{\frac{1 - α}{2}} {\bar{h}}_{- α}^{i} .

(25)

This is equivalent to minimizing:

Note that α = ±1, the lemma states that the minimization problem is equivalent to minimizing KL(a : x) + KL(x : g) with respect to x, where a = l₁ and g = r₁ denote the arithmetic and geometric means, respectively. □

The lemma states that the optimization problem with n weighted histograms is equivalent to the optimization with only two equally weighted histograms.

The positive symmetrized α-centroid is equivalent to computing a representation symmetrized Bregman centroid [14,34].

The frequency symmetrized α-centroid asks to minimize the following problem:

min_{\tilde{x} \in Δ d} \sum_{j} w_{j} S_{α} (\tilde{x}, {\tilde{h}}_{i}) .

Instead of seeking for $\tilde{x}$ in the probability simplex, we can optimize on the unconstrained domain ℝ^d⁻¹ by using a reparameterization. Indeed, frequency histograms belong to the exponential families [39] (multinomials).

Exponential families also include many other continuous distributions, like the Gaussian, Beta or Dirichlet distributions. It turns out the α-divergences can be computed in closed-form for members of the same exponential family:

Lemma 2. The α-divergence for distributions belonging to the same exponential families amounts to computing a divergence on the corresponding natural parameters:

A_{α} (p : q) = \frac{4}{1 - α^{2}} (1 - e^{- J_{F}^{(\frac{1 - α}{2})}}^{(θ_{p} : θ_{q})}),

where $J_{F}^{β} (θ_{1} : θ_{2}) = β F (θ_{1}) + (1 - β) F (θ_{2}) - F (β θ_{1} + (1 - β) θ_{2})$ is a skewed Jensen divergence defined for the log-normaliser F of the family.

The proof follows from the fact that $\int p^{α} (x) q^{1 - α} (x) d x = e - J_{F}^{(α) (θ_{p} : θ_{q})}$ ; see [40].

First, we convert a frequency histogram $\tilde{h}$ to its natural parameter θ with $θ^{i} = \log \frac{{\tilde{h}}^{i}}{{\tilde{h}}^{d}}$ ; see [39].

The log-normaliser is a non-separable convex function $F (θ) = \log (1 + \sum_{i} e^{θ_{i}})$ . To convert back a multinomial to a frequency histogram with d bins, we first set ${\tilde{h}}^{d} = \frac{1}{1 + \sum_{l = 1}^{d - 1} e^{θ^{l}}}$ and then retrieve the other bin values as ${\tilde{h}}^{i} = {\tilde{h}}^{d} e^{θ^{i}}$ .

The centroids with respect to skewed Jensen divergences has been investigated in [13,40].

Remark 4. Note that for the special case of α = 0 (squared Hellinger centroid), the sided and symmetrized centroids coincide. In that case, the coordinates $s_{0}^{i}$ of the squared Hellinger centroid are:

s_{0}^{i} = {(\sum_{j = 1}^{n} w_{j} \sqrt{h_{j}^{i}})}^{2}, 1 \leq i \leq d .

Remark 5. The symmetrized positive α-centroids can be solved in special cases (α = ±3, α = ±1 corresponding to the symmetrized χ² and Jeffreys positive centroids). For frequency centroids, when dealing with binary histograms (d = 2), we have only one degree of freedom and can solve the binary frequency centroids. Binary histograms (and mixtures thereof) are used in computer vision and pattern recognition [41].

Remark 6. Since α-divergences are Csiszár f-divergences and f-divergences can always be symmetrized by taking generator $s (t) = f (t) + t f (\frac{1}{t})$ , we deduce that symmetrized α-divergences S_α are f-divergences for the generator:

f (t) = - \log ((1 - α) + α t) - t \log ((1 - α) + \frac{α}{t}) .

(26)

Hence, Sα divergences are convex in both arguments, and the sα centroids are unique.

5. Soft Mixed α-Clustering

Algorithm 4 reports the general clustering with soft membership, which can be adapted to left (λ_init = 1), right (λ_init = 0) or mixed clustering. We have not considered a weighted histogram set in order not load the notations and because the extension is straightforward.

Again, for skew Jeffreys centers, we shall adopt a variational approach. Notice that the soft clustering approach learns all parameters, including λ (if not constrained to zero or one) and α ∈ ℝ. This is not the case for Matsuyama’s α-expectation maximization (EM) algorithm [42] in which α is fixed beforehand (and, thus, not learned).

Assuming we model the prior for histograms by:

\begin{array}{l} p_{λ, α, j} (h_{i}) \propto \\ λ \exp - D_{α} (l_{j} : h_{i}) + (1 - λ) \exp - D_{α} (h_{i} : r_{i}), \end{array}

(27)

the negative log-likelihood involves the α-depending quantity:

(28)

because of the concavity of the logarithm function. Therefore, the maximization step for α involves finding:

\arg max_{α} \sum_{j = 1}^{k} \sum_{i = 1}^{m} M_{λ, α} (l_{j} : h_{i} : r_{j}) p (j | h_{i}) .

(29)

No closed-form solution are known, so we compute the gradient update in Algorithm 4 with:

\begin{array}{l} \frac{\partial M_{λ, α} (l_{j} : h_{i} : r_{j})}{\partial α} = \\ λ \frac{\partial D_{α} (l_{j} : h_{i})}{\partial α} + (1 - λ) \frac{\partial D_{α} (h_{i} : r_{i})}{\partial α}, \end{array}

(30)

(31)

The update in λ is easier as:

\frac{\partial M_{λ, α} (l_{j} : h_{i} : r_{j})}{\partial λ} = D_{α} (l_{j} : h_{i}) - D_{α} (h_{i} : r_{j}) .

(32)

Maximizing the likelihood in λ would imply choosing λ ∈ {0, 1} (a hard choice for left/right centers), yet we prefer the soft update for the parameter, like for α.

6. Conclusions

The family of α-divergences plays a fundamental role in information geometry: These statistical distortion measures are the canonical divergences of dual spaces on probability distribution manifolds with constant curvature $κ = \frac{1 - α^{2}}{4}$ and the canonical divergences of dually flat manifolds for positive distribution manifolds [19].

In this work, we have presented three techniques for clustering (positive or frequency) histograms using k-means:

(1): Sided left or right α-centroid k-means,
(2): Symmetrized Jeffreys-type α-centroid (variational) k-means, and
(3): Coupled k-means with respect to mixed α-divergences relying on dual α-centroids.

Sided and mixed dual centroids are always available in closed-forms and are therefore highly attractive from the standpoint of implementation. Symmetrized Jeffreys centroids are in general not available in closed-form and require one to implement a variational k-means by updating incrementally the cluster centroids in order to monotonically decrease the k-means loss function. From the clustering standpoint, this appears not to be a problem when guaranteed expected approximations to the optimal clustering are enough.

Indeed, we also presented and analyzed an extension of k-means++ [17] for seeding those k-means algorithms. The mixed α-seeding initializations do not require one to calculate centroids and behaves like a discrete k-means by picking up the seeds among the data. We reported guaranteed probabilistic clustering bounds. Thus, it yields a fast hard/soft data partitioning technique with respect to mixed or symmetrized α-divergences. Recently, the advantage of clustering using α-divergences by tuning α in applications has been demonstrated in [18]. We thus expect the computationally fast mixed α-seeding with guaranteed performance to be useful in a growing number of applications.

Acknowledgments

NICTA is funded by the Australian Government as represented by the Department of Broadband, Communication and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program.

Author Contributions

All authors contributed equally to the design of the research. The research was carried out by all authors. Frank Nielsen and Richard Nock wrote the paper. Frank Nielsen implemented the algorithms and performed experiments. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interests.

Appendix

A. Proof Sketch of Theorem 1

We give here the key results allowing one to obtain the proof of the Theorem, following the proof scheme of [15]. In order not to load the notations, weights are considered uniform. The extension to non-uniform weights is immediate as it boils down to duplicate histograms in the histogram set and does not change the approximation result.

Let $A \subseteq ℋ$ be an arbitrary cluster of C_opt. Let us define $U_{A}$ and $π_{A}$ as the uniform and biased distributions conditioned to $A$ . The key to the proof is to relate the expected potential of $A$ under $U_{A}$ and $π_{A}$ to its contribution to the optimal potential.

Lemma 3. Let $A \subseteq ℋ$ be an arbitrary cluster of C_opt. Then:

where $U_{A}$ is the uniform distribution over $A$ .

Proof. α-coordinates have the property that for any subset $A \subseteq ℋ$ , $(1 / | A |) \sum_{p \in A} u_{α} (p) = u_{α} (r_{α}, A)$ . Hence, we have:

(33)

Because D_α(p : q) = D₋_α(q : p) and l_α = r₋_α, we obtain:

(34)

It comes now from (33) and (34) that:

(35)

This gives the left-hand side equality of the Lemma. The right-hand side follows from the fact that $E_{c \sim U A} [M_{λ, - α} (A, c)] = M_{opt, 1 - λ, α} (A) + M_{opt, 1 - λ, - α} (A)$ . □

Instead of $M_{opt, λ, α} (A) + M_{opt, λ, - α} (A)$ , we want a term depending solely on $M_{opt, λ, α} (A)$ as it is the “true” optimum. We now give two lemmata that shall be useful in obtaining this upper bound. The first is of independent interest, as it shows that any α-divergence is a scaled, squared Hellinger distance between geometric means of points.

Lemma 4. For any p, q and α ≠ 1, there exists r ∈ [p, q], such that (1 − α)²D_α(p : q) = D₀(p¹⁻^αr^α : q¹⁻^αr^α).

Proof. By the definition of Bregman divergences, for any x, y, there exists some z ∈ [x, y], such that:

and since u_α is continuous and strictly increasing, for any p, q, there exists some r ∈ [p, q], such that:

Lemma 5. Let discrete random variable x take non-negative values x₁, x₂,…, x_m with uniform probabilities. Then, for any β > −1, we have var(x¹⁺^β/u^β) ≤ var(x), with $u ≐ {(1 + β)}^{β} {max}_{i} x_{i}$ .

Proof. First, ∀β > −1, remark that for any x, function f(x) = x(u^β − x^β) is increasing for x ≤ u/(1 + β)^β. Hence, assuming that the x_is are put in non-increasing order without loss of generality, we have f(x_i) ≥ f(x_j), and so, $x_{i} (u^{β} - x_{i}^{β}) \geq x_{j} (u^{β} - x_{j}^{β})$ , ∀i ≥ j, as long as x_i ≤ u/(1 + β)^β. Choosing u = x₁(1 + β)^β yields, after reordering and putting the exponent, ${(x_{i}^{1 + β} - x_{j}^{1 + β})}^{2} \leq {(x_{i} u^{β} - x_{j} u^{β})}^{2}$ . Hence:

Dividing by u²^β the leftmost and rightmost terms and using the fact that var(λx) = λ²var(x) yields the statement of the Lemma. □

We are now ready to upper bound $M_{opt, λ, - α} (A)$ as a function of $M_{opt, λ, α} (A)$ .

Lemma 6. For any cluster $A$ of C_opt,

where z(α), f(λ) and h(α) are defined in Theorem 1.

Proof. The case λ ≠ 0, 1 is fast, as we have by definition:

Suppose now that λ = 0 and α ≥ 0. Because $M_{opt, 0, - α} (A) = \sum_{p \in A} D_{- α} (p : r_{- α, A}) = \sum_{p \in A} D_{α} (l_{α, A} : p) = M_{opt, 1, α} (A)$ , what we wish to do is upper bound $\sum_{p \in A} D_{α} (l_{α, A} : p) = M_{opt, 1, α} (A)$ as a function of $\sum_{p \in A} D_{α} (p : r_{α, A}) = M_{opt, 0, α} (A)$ . We use Lemmatas 4 and 5 in the following derivations, using r(p) to refer to the r in Lemma 4, assuming α ≥ 0. We also note ${var}_{A} (f (p))$ as the variance, under the uniform distribution over $A$ , of discrete random variable f(p), for $p \in A$ . We have:

(36)

We have used the expression of left centroid $l_{α, A}^{1 + α}$ to simplify the expressions. Now, picking $x_{i} = p_{i}^{\frac{1 - α}{2}}$ , β = 2α/(1 − α) and $u = {(\frac{1 + α}{1 - α})}^{\frac{2 α}{1 - α}} {max}_{A} p^{\frac{1 - α}{2}}$ in Lemma 5 yields:

(37)

Plugging this in (36) yields:

(38)

(39)

Here, (38) follows the path backwards of derivations that lead to (36). The cases λ = 1 or α < 0 are obtained using the same chains of derivations and achieve the proof of Lemma 6. □

Lemma 6 can be directly used to refine the bound of Lemma 3 in the uniform distribution. We give the Lemma for the biased distribution, directly integrating the refinement of the bound.

Lemma 7. Let $A$ be an arbitrary cluster of C_opt and C an arbitrary clustering. If we add a random couple (c, c) to C, chosen from $A$ with π as in Algorithm 2, then:

(40)

where f(λ) and h(α) are defined in Theorem 1.

Proof. The proof essentially follows the proof of Lemma 3 in [15]. To complete it, we need a triangle inequality involving α-divergences. We give it here.

Lemma 8. For any p, q, r and α, we have:

(41)

(where the min is over strictly positive values)

Remark: take α = 0; we find the triangle inequality for the squared Hellinger distance.

Proof. Using the proof of Lemma 2 in [15] for Bregman divergence $D_{φ_{α}}$ , we get:

\begin{array}{l} \sqrt{D_{φ_{α}} (x : z)} \\ \leq ρ (α) (\sqrt{D_{φ_{α}} (x : y)} + \sqrt{D_{φ_{α}} (y : z)}), \end{array}

(42)

where:

ρ (α) = max_{u, υ} \frac{{(1 + \frac{1 - α}{2} u)}^{\frac{2 α}{1 - α}}}{{(1 + \frac{1 - α}{2} υ)}^{\frac{2 α}{1 - α}}} .

(43)

Taking x = u_α(p), y = u_α(q), z = u_α(r) yields $ρ (α) = {max}_{s, t \in {p_{i}, q_{i}, r_{i}}} {(s / t)}^{| α |}$ and the statement of Lemma 8. □

The rest of the proof of Lemma 7 follows the proof of Lemma 3 in [15]. □

We get all of the ingredients to our proof, and there remains to use Lemma 4 in [15] to achieve the proof of Theorem 1.

B. Properties of α-Divergences

For positive arrays p and q, the α-divergence D_α(p : q) can be defined as an equivalent representational Bregman divergence [19,34] $B_{φ_{α}} (u_{α} (p) : u_{α} (q))$ over the (u_α, υ_α)-structure [43] with:

φ_{α} (x) ≐ \frac{2}{1 + α} {(1 + \frac{1 - α}{2} x)}^{\frac{2}{1 - α}},

(44)

u_{α} (p) ≐ \frac{2}{1 - α} (p^{\frac{1 - α}{2}} - 1),

(45)

υ_{α} (p) ≐ \frac{2}{1 + α} p^{\frac{1 + α}{2}},

(46)

where we assume that α ≠ ±1. Otherwise, for α = ±1, we compute D_α(p : q) by taking the sided Kullback–Leibler divergence extended to positive arrays.

In the proof of Theorem 1, we have used two properties of α-divergences of independent interest:

any α-divergence can be explained as a scaled squared Hellinger distance between geometric means of its arguments and a point that belong to their segment (Lemma 4);
any α-divergence satisfies a generalized triangle inequality (Lemma 8). Notice that this Lemma is optimal in the sense that for α = 0, it is possible to recover the triangle inequality of the Hellinger distance.

The following lemma shows how to bound the mixed divergence as a function of an α-divergence.

Lemma 9. For any positive arrays l, h, r and α ≠ ±1, define $η ≐ λ (1 - α) / (1 - α (2 λ - 1)) \in [0, 1]$ , with $g_{η}^{i} ≐ {(l^{i})}^{η} {(r^{i})}^{1 - η}$ and a_η with $a_{η}^{i} ≐ η l^{i} + (1 - η) r^{i}$ . Then, we have:

Proof. For all index i, we have:

(47)

(48)

The arithmetic-geometric-harmonic (AGH) inequality implies:

It follows that (48) yields:

(49)

(50)

out of which we get the statement of the Lemma. □

C. Sided α-Centroids

For the sake of completeness, we prove the following theorem:

Theorem 7 (Sided positive α-centroids [34]). The left-sided l_α and right-sided r_α positive weighted α-centroid coordinates of a set of n positive histograms h₁,…, h_n are weighted α-means:

r_{α}^{i} = f_{α}^{- 1} (\sum_{j = 1}^{n} w_{j} f_{α} (h_{j}^{i})), l_{α}^{i} = r_{- α}^{i}

with:

f_{α} (x) = {\begin{array}{l} x^{\frac{1 - α}{2}} & α \neq \pm 1, \\ \log x & α = 1. \end{array}

Proof. We distinguish three cases: α ≠ ±1, α = −1 and α = 1.

First, consider the general case α ≠ ±1. We have to minimize:

Removing all additive terms independent of xⁱ and the overall constant multiplicative factor $\frac{4}{1 - α^{2}} \neq 0$ , we get the following equivalent minimisation problem:

{R^{'}}_{α} (x, ℋ) = \sum_{i = 1}^{d} \frac{1 + α}{2} x^{i} - {(x^{i})}^{\frac{1 + α}{2}} \underset{{\bar{h}}_{α}^{i}}{\underset{︸}{(\sum_{j = 1}^{n} w_{j} {(h_{j}^{i})}^{\frac{1 - α}{2}}),}}

(51)

where ${\bar{h}}_{α}^{i}$ denote the following aggregation term:

{\bar{h}}_{α}^{i} = {\sum_{j = 1}^{n} w_{j} (h_{j}^{i})}^{\frac{1 - α}{2}} .

Setting coordinate-wise the derivative to zero of Equation (51) $(i . e ., \nabla_{x} R^{'} (x, ℋ) = 0)$ , we get:

\frac{1 + α}{2} - \frac{1 + α}{2} {(x^{i})}^{\frac{α - 1}{2}} {\bar{h}}_{α}^{i} = 0

Thus, we find that the coordinates of the right-sided α-centroids are:

c_{α}^{i} = {({\bar{h}}_{α}^{i})}^{\frac{2}{1 - α}} = {(\sum_{j = 1}^{n} w_{j} {(h_{j}^{i})}^{\frac{1 - α}{2}})}^{\frac{2}{1 - α}} = {\hat{h}}_{α}^{i} .

We recognise the expression of a quasi-arithmetic mean for the strictly monotonous generator f_α(x):

r_{α}^{i} = f_{α}^{- 1} (\sum_{j = 1}^{n} w_{j} f_{α} (h_{j}^{i})),

(52)

with:

\begin{matrix} f_{α} (x) = x^{\frac{1 - α}{2}}, & f_{α}^{- 1} (x) = x^{\frac{2}{1 - α}}, α \neq \pm 1. \end{matrix}

Therefore, we conclude that the coordinates of the positive α-centroid are the weighted α-means of the histogram coordinates (for α ≠ ±1). Quasi-arithmetic means are also called in the literature quasi-linear means or f-means.

When α = −1, we search for the right-sided extended Kullback–Leibler divergence centroid by minimising:

R_{- 1} (x; \tilde{ℋ}) = \sum_{j = 1}^{n} w_{j} \sum_{i = 1}^{d} h_{j}^{i} \log \frac{h_{j}^{i}}{x^{i}} + x^{i} - h_{j}^{i} .

It is equivalent to minimizing:

{R^{'}}_{- 1} (x; \tilde{ℋ}) = \sum_{i = 1}^{d} x^{i} - \underset{a}{\underset{︸}{(\sum_{j = 1}^{n} w_{j} h_{j}^{i})}} \log x^{i},

where a denotes the arithmetic mean. Solving coordinate-wise, $c^{i} = a^{i} = \sum_{j = 1}^{n} w_{j} h_{j}^{i}$ .

When α = 1, the right-sided reverse extended KL centroid is a left-sided extended KL centroid. The minimisation problem is:

R_{1} (x; \tilde{ℋ}) = \sum_{j = 1}^{n} w_{j} \sum_{i = 1}^{d} x^{i} \log \frac{x^{i}}{h_{j}^{i}} + h_{j}^{i} - x^{i} .

Since ∑_jw_j = 1, we solve coordinate-wise and find log x = ∑_jw_j log h_j. That is, $r_{1}^{i}$ is the geometric mean:

r_{1}^{i} = \prod_{j = 1}^{n} {(h_{j}^{i})}^{w_{j}} .

Both the arithmetic mean and the geometric mean are power means in the limit case (and hence quasi-arithmetic means). Thus,

r_{α}^{i} = f_{α}^{- 1} (\sum_{j = 1}^{n} w_{j} f_{α} (h_{j}^{i})),

(53)

with:

f_{α} (x) = {\begin{array}{l} x^{\frac{1 - α}{2}} & α \neq \pm 1, \\ \log x & α = 1. \end{array}

References

Baker, L.D.; McCallum, A.K. Distributional clustering of words for text classification, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, 24–28 August 1998; ACM: New York, NY, USA, 1998; pp. 96–103.
Bigi, B. Using Kullback–Leibler distance for text categorization, Proceedings of the 25th European conference on IR research (ECIR), Pisa, Italy, 14–16 April 2003; Springer-Verlag: Berlin/Heidelberg, Germany, 2003; ECIR’03. pp. 305–319.
Bag of Words Data Set, Available online: http://archive.ics.uci.edu/ml/datasets/Bag+of+Words (accessed 17 June 2014).
Csurka, G.; Bray, C.; Dance, C.; Fan, L. Visual Categorization with Bags of Keypoints, Workshop on Statistical Learning in Computer Vision (ECCV); Xerox Research Centre Europe: Meylan, France, 2004; pp. 1–22.
Jégou, H.; Douze, M.; Schmid, C. Improving Bag-of-Features for Large Scale Image Search. Int. J. Comput. Vis 2010, 87, 316–336. [Google Scholar]
Yu, Z.; Li, A.; Au, O.; Xu, C. Bag of textons for image segmentation via soft clustering and convex shift, Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 16–21 June 2012; pp. 781–788.
Steinhaus, H. Sur la division des corp matériels en parties. Bull. Acad. Polon. Sci 1956, 1, 801–804. (in French). [Google Scholar]
Lloyd, S.P. Least Squares Quantization in PCM; Technical Report RR-5497; Bell Laboratories: Murray Hill, NJ, USA, 1957. [Google Scholar]
Lloyd, S.P. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar]
Chandrasekhar, V.; Takacs, G.; Chen, D.M.; Tsai, S.S.; Reznik, Y.A.; Grzeszczuk, R.; Girod, B. Compressed histogram of gradients: A low-bitrate descriptor. Int. J. Comput. Vis 2012, 96, 384–399. [Google Scholar]
Nock, R.; Nielsen, F.; Briys, E. Non-linear book manifolds: Learning from associations the dynamic geometry of digital libraries, Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries, New York, NY, USA; 2013; pp. 313–322.
Kwitt, R.; Vasconcelos, N.; Rasiwasia, N.; Uhl, A.; Davis, B.C.; Häfner, M.; Wrba, F. Endoscopic image analysis in semantic space. Med. Image Anal 2012, 16, 1415–1422. [Google Scholar]
Nielsen, F. A family of statistical symmetric divergences based on Jensen’s inequality 2010, arXiv, 1009.4004.
Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar]
Nock, R.; Luosto, P.; Kivinen, J. Mixed Bregman clustering with approximation guarantees, Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, Antwerp, Belgium, 15–19 September 2008; Springer-Verlag: Berlin/Heidelberg, Germany, 2008; pp. 154–169.
Amari, S. Integration of Stochastic Models by Minimizing α-Divergence. Neural Comput 2007, 19, 2780–2796. [Google Scholar]
Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding, Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), New Orleans, LA, USA, 7–9 January 2007; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2007; pp. 1027–1035.
Olszewski, D.; Ster, B. Asymmetric clustering using the alpha-beta divergence. Pattern Recognit 2014, 47, 2031–2041. [Google Scholar]
Amari, S. Alpha-divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res 2005, 6, 1705–1749. [Google Scholar]
Teboulle, M. A unified continuous optimization framework for center-based clustering methods. J. Mach. Learn. Res 2007, 8, 65–102. [Google Scholar]
Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
Morimoto, T. Markov Processes and the H-theorem. J. Phys. Soc. Jpn 1963, 18, 328–331. [Google Scholar]
Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B 1966, 28, 131–142. [Google Scholar]
Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Studi. Sci. Math. Hung 1967, 2, 229–318. [Google Scholar]
Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2011, 13, 134–170. [Google Scholar]
Zhu, H.; Rohwer, R. Measurements of generalisation based on information geometry. In Mathematics of Neural Networks; Operations Research/Computer Science Interfaces Series; Ellacott, S., Mason, J., Anderson, I., Eds.; Springer: New York, NY, USA, 1997; Volume 8, pp. 394–398. [Google Scholar]
Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat 1952, 23, 493–507. [Google Scholar]
Nielsen, F. An information-geometric characterization of Chernoff information. IEEE Signal Process. Lett 2013, 20, 269–272. [Google Scholar]
Wu, J.; Rehg, J. Beyond the euclidean distance: creating effective visual codebooks using the histogram intersection kernel, Proceedings of 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 29 September–2 October 2009; pp. 630–637.
Bhattacharya, A.; Jaiswal, R.; Ailon, N. A tight lower bound instance for k-means++ in constant dimension. In Theory and Applications of Models of Computation; Lecture Notes in Computer Science; Gopal, T., Agrawal, M., Li, A., Cooper, S., Eds.; Springer International Publishing: New York, NY, USA, 2014; Volume 8402, pp. 7–22. [Google Scholar]
Nielsen, F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. IEEE Signal Process. Lett 2013, 20, 657–660. [Google Scholar]
Ben-Tal, A.; Charnes, A.; Teboulle, M. Entropic means. J. Math. Anal. Appl 1989, 139, 537–551. [Google Scholar]
Nielsen, F.; Nock, R. The dual Voronoi diagrams with respect to representational Bregman divergences, Proceedings of International Symposium on Voronoi Diagrams (ISVD), Copenhagen, Denmark, 23–26 June 2009; pp. 71–78.
Heinz, E. Beiträge zur Störungstheorie der Spektralzerlegung. Math. Anna 1951, 123, 415–438. (in German). [Google Scholar]
Besenyei, A. On the invariance equation for Heinz means. Math. Inequal. Appl 2012, 15, 973–979. [Google Scholar]
Barry, D.A.; Culligan-Hensley, P.J.; Barry, S.J. Real values of the W -function. ACM Trans. Math. Softw 1995, 21, 161–171. [Google Scholar]
Veldhuis, R.N.J. The centroid of the symmetrical Kullback–Leibler distance. IEEE Signal Process. Lett 2002, 9, 96–99. [Google Scholar]
Nielsen, F.; Garcia, V. Statistical exponential families: A digest with flash cards, 2009. arXiv.org: 0911.4863.
Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar]
Romberg, S.; Lienhart, R. Bundle min-hashing for logo recognition, Proceedings of the 3rd ACM Conference on International Conference on Multimedia Retrieval, Dallas, TX, USA, 16–19 April 2013; ACM: New York, NY, USA, 2013; pp. 113–120.
Matsuyama, Y. The alpha-EM algorithm: Surrogate likelihood maximization using alpha-logarithmic information measures. IEEE Trans. Inf. Theory 2003, 49, 692–706. [Google Scholar]
Amari, S.I. New developments of information geometry (26): Information geometry of convex programming and game theory. In Mathematical Sciences (suurikagaku); Number 605; The Science Company: Denver, CO, USA, 2013; pp. 65–74. (In Japanese) [Google Scholar]

Figure 1. Snapshot of the α-clustering software. Here, n = 800 frequency histograms of three bins with k = 8, and α = 0.7 and

λ = \frac{1}{2}

.

Figure 1. Snapshot of the α-clustering software. Here, n = 800 frequency histograms of three bins with k = 8, and α = 0.7 and

λ = \frac{1}{2}

.

Table 1. Positive and frequency α-centroids: the frequency α-centroids are normalized positive α-centroids, where w(h) denotes the cumulative sum of the histogram bins. The arithmetic mean is obtained for r₋₁ = l₁ and the geometric mean for r₁ = l₋₁.

**Table 1.** Positive and frequency α-centroids: the frequency α-centroids are normalized positive α-centroids, where w(h) denotes the cumulative sum of the histogram bins. The arithmetic mean is obtained for r₋₁ = l₁ and the geometric mean for r₁ = l₋₁.
	Positive centroid	Frequency centroid

Algorithm 1:. Mixed divergence-based k-means clustering.

**Algorithm 1:.** Mixed divergence-based k-means clustering.

Algorithm 2:. Mixed α-seeding;

MAS (ℋ, k, λ, α)

**Algorithm 2:.** Mixed α-seeding; $MAS (ℋ, k, λ, α)$

Algorithm 3:. Mixed α-hard clustering:

MAhC (ℋ, k, λ, α)

**Algorithm 3:.** Mixed α-hard clustering: $MAhC (ℋ, k, λ, α)$

Algorithm 4:. Mixed α-soft clustering;

MAsC (ℋ, k, λ, α)

**Algorithm 4:.** Mixed α-soft clustering; $MAsC (ℋ, k, λ, α)$

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Nielsen, F.; Nock, R.; Amari, S.-i. On Clustering Histograms with k-Means by Using Mixed α-Divergences. Entropy 2014, 16, 3273-3301. https://doi.org/10.3390/e16063273

AMA Style

Nielsen F, Nock R, Amari S-i. On Clustering Histograms with k-Means by Using Mixed α-Divergences. Entropy. 2014; 16(6):3273-3301. https://doi.org/10.3390/e16063273

Chicago/Turabian Style

Nielsen, Frank, Richard Nock, and Shun-ichi Amari. 2014. "On Clustering Histograms with k-Means by Using Mixed α-Divergences" Entropy 16, no. 6: 3273-3301. https://doi.org/10.3390/e16063273

APA Style

Nielsen, F., Nock, R., & Amari, S.-i. (2014). On Clustering Histograms with k-Means by Using Mixed α-Divergences. Entropy, 16(6), 3273-3301. https://doi.org/10.3390/e16063273

Article Menu

On Clustering Histograms with k-Means by Using Mixed α-Divergences

Abstract

1. Introduction: Motivation and Background

1.1. Clustering Histograms in the Bag-of-Word Modeling Paradigm

1.2. Contributions

1.3. Outline of the Paper

2. Mixed Centroid-Based k-Means Clustering

2.1. Divergences, Centroids and k-Means

2.2. Mixed Divergences and Mixed k-Means Clustering

2.3. Sided, Symmetrized and Mixed α-Divergences

2.4. Notations and Hard/Soft Clusterings

3. Coupled k-Means++ α-Seeding

4. Sided, Symmetrized and Mixed α-Centroids

4.1. Csiszár f-Centroids

4.2. Sided Positive and Frequency α-Centroids

4.3. Mixed α-Centroids

4.4. Symmetrized Jeffreys-Type α-Centroids

5. Soft Mixed α-Clustering

6. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

Appendix

A. Proof Sketch of Theorem 1

B. Properties of α-Divergences

C. Sided α-Centroids

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI