This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
Clustering sets of histograms has become popular thanks to the success of the generic method of bagofX used in text categorization and in visual categorization applications. In this paper, we investigate the use of a parametric family of distortion measures, called the
A common task of information retrieval (IR) systems is to classify documents into categories. Given a training set of documents labeled with categories, one asks to classify new incoming documents. Text categorisation [
Classify a new online document: We first calculate its word distribution histogram signature and seek for the labeled document, which has the most similar histogram to deduce its category tag.
Find the initial set of categories: we cluster all document histograms and assign a category per cluster.
This text classification method based on the representation of the bagof words (BoWs) has also been instrumental in computer vision for efficient object categorization [
In summary, clustering histograms with respect to symmetric distances (like the symmetrized KL divergence) is playing an increasing role. It turns out that the symmetrized KL divergence belongs to a 1parameter family of divergences, called symmetrized
Since divergences
The paper is organized as follows: Section 2 introduces the notion of mixed divergences, presents an extension of
Consider a set
The frequency histograms belong to the (
That is, although frequency histograms have
The celebrated
An important class of divergences on frequency histograms is the
Those divergences preserve information monotonicity [
The
Assignment: Assign each histogram
Center relocation: Update the cluster centers by taking their centroids:
Since divergences are potentially asymmetric, we can define twosided
In order to handle those sided and symmetrized
A mixed divergence includes the sided divergences for λ ∈ {0, 1} and the symmetrized (arithmetic mean) divergence for
We generalize
Notice that the mixed
For α ≠ ±1, we define the family of
with the limit cases
Divergence
with the Hellinger distance:
Note that
When histograms
and the extended Kullback–Leibler divergence reduces to the traditional Kullback–Leibler divergence:
The Kullback–Leibler divergence between frequency histograms
Often,
The Pearson and Neyman χ^{2} distances are obtained for
The
Interestingly, the
Next, we introduce the mixed
The
The skew symmetrized
Throughout the paper, superscript index
In this paper, we investigate the following kinds of clusterings for sets of histograms:
Randomized seeding for mixed/symmetrized
It is wellknown that the Lloyd
Pick λ = 1 for the leftsided centroid initialization,
Pick λ = 0 for the rightsided centroid initialization (a leftsided initialization for −
with arbitrary λ, for the λ
Our proof follows and generalizes the proof described for the case of mixed Bregman seeding [
Here
For skew Jeffreys centers, since the centroids are not available in closed form [
The
Note that the mixed
Since mixed
The centroids induced by
When the domain is the open probability simplex
Taking the derivatives according to
We now consider this equation for
The positive sided
with
Furthermore, the frequencysided
The mixed
We state the theorem generalizing [
The Kullback–Leibler divergence can be symmetrized in various ways: Jeffreys divergence, Jensen–Shannon divergence and Chernoff information, just to mention a few. Here, we consider the following symmetrization of
For
In particular, when
where
Heinz means interpolate the arithmetic and geometric means and satisfies the inequality:
(Another interesting property of Heinz means is the integral representation of the logarithmic mean:
The
Observe that it is enough to consider
For
The Jeffreys divergence writes mathematically the same for frequency histograms:
We state the results reported in [
The Lambert analytic function
Therefore, we shall consider α ≠ ±1 in the remainder.
We state the following lemma generalizing the former results in [
This is equivalent to minimizing:
Note that
The lemma states that the optimization problem with
The positive symmetrized
The frequency symmetrized
Instead of seeking for
Exponential families also include many other continuous distributions, like the Gaussian, Beta or Dirichlet distributions. It turns out the
The proof follows from the fact that
First, we convert a frequency histogram
The lognormaliser is a nonseparable convex function
The centroids with respect to skewed Jensen divergences has been investigated in [
Again, for skew Jeffreys centers, we shall adopt a variational approach. Notice that the soft clustering approach learns all parameters, including λ (if not constrained to zero or one) and
Assuming we model the prior for histograms by:
the negative loglikelihood involves the
because of the concavity of the logarithm function. Therefore, the maximization step for
No closedform solution are known, so we compute the gradient update in
The update in λ is easier as:
Maximizing the likelihood in λ would imply choosing λ ∈ {0, 1} (a hard choice for left/right centers), yet we prefer the soft update for the parameter, like for
The family of
In this work, we have presented three techniques for clustering (positive or frequency) histograms using
Sided left or right
Symmetrized Jeffreystype
Coupled
Sided and mixed dual centroids are always available in closedforms and are therefore highly attractive from the standpoint of implementation. Symmetrized Jeffreys centroids are in general not available in closedform and require one to implement a variational
Indeed, we also presented and analyzed an extension of
NICTA is funded by the Australian Government as represented by the Department of Broadband, Communication and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program.
All authors contributed equally to the design of the research. The research was carried out by all authors. Frank Nielsen and Richard Nock wrote the paper. Frank Nielsen implemented the algorithms and performed experiments. All authors have read and approved the final manuscript.
The authors declare no conflict of interests.
We give here the key results allowing one to obtain the proof of the Theorem, following the proof scheme of [
Let
where
Because
It comes now from
This gives the lefthand side equality of the Lemma. The righthand side follows from the fact that
Instead of
and since
Dividing by
We are now ready to upper bound
Suppose now that λ = 0 and
We have used the expression of left centroid
Plugging this in
Here,
Lemma 6 can be directly used to refine the bound of Lemma 3 in the uniform distribution. We give the Lemma for the biased distribution, directly integrating the refinement of the bound.
Remark: take
where:
Taking
The rest of the proof of Lemma 7 follows the proof of Lemma 3 in [
We get all of the ingredients to our proof, and there remains to use Lemma 4 in [
For positive arrays
where we assume that α ≠ ±1. Otherwise, for
In the proof of Theorem 1, we have used two properties of
any
any
The following lemma shows how to bound the mixed divergence as a function of an
The arithmeticgeometricharmonic (AGH) inequality implies:
It follows that
out of which we get the statement of the Lemma. □
For the sake of completeness, we prove the following theorem:
First, consider the general case
Removing all additive terms independent of
where
Setting coordinatewise the derivative to zero of
Thus, we find that the coordinates of the rightsided
We recognise the expression of a quasiarithmetic mean for the strictly monotonous generator
with:
Therefore, we conclude that the coordinates of the positive
When
It is equivalent to minimizing:
where
When
Since ∑
Both the arithmetic mean and the geometric mean are power means in the limit case (and hence quasiarithmetic means). Thus,
with:
Snapshot of the
Positive and frequency
Positive centroid  Frequency centroid  


Mixed divergencebased

Mixed

Mixed

Mixed
