1. Introduction
Many machine learning algorithms for classification and clustering employ a variety of dissimilarity measures. Information theory, convex analysis, and information geometry play key roles in the formulation of such divergences [
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22,
23,
24,
25].
The most popular and often used are: Squared Euclidean distance and Kullback–Leibler divergence. Recently, alternative generalized divergences such as the Csiszár–Morimoto
f-divergence and Bregman divergence become attractive alternatives for advanced machine learning algorithms [
26,
27,
28,
29,
30,
31,
32,
33,
34]. In this paper, we discuss a robust parameterized subclass of the Csiszár–Morimoto and the Bregman divergences: Alpha- and Beta-divergences that may provide more robust solutions with respect to outliers and additive noise and improved accuracy. Moreover, we provide links to new-class of robust Gamma-divergences [
35] and extend this class to so called Alpha-Gamma divergences.
Divergences are considered here as (dis)similarity measures. Generally speaking, they measure a quasi-distance or directed difference between two probability distributions and , which can also be expressed for unconstrained nonnegative multi-way arrays and patterns.
In this paper we assume that
and
are positive measures (densities) not necessary normalized, but should be finite measures. In the special case of normalized densities, we explicitly refer to these as probability densities. If we do not mention explicitly we assume that these measures are continuous. An information divergence is a measure of distance between two probability curves. In this paper, we discuss only one-dimensional probability curves (represented by nonnegative signals or time series). Generalization to two or multidimensional dimensional variables is straightforward. One density
is usually known and fixed and another one
is learned or adjusted to achieve a best in some sense similarity to the
. For example, a discrete density
corresponds to the observed data and the vector
to be estimated, or expected data which are subject to constraints imposed on the assumed models. For the Non-negative Matrix Factorization (NMF) problem
corresponds to the data matrix
and
corresponds to estimated matrix
(or vice versa) [
30].
The distance between two densities is called a metric if the following conditions hold:
with equality if and only if (nonnegativity and positive definiteness),
(symmetry),
(subaddivity/triangle inequality).
Distances which only satisfies Condition 1 are not a metric and are referred to as (asymmetric) divergences.
In many applications, such as image analysis, pattern recognition and statistical machine learning we use the information-theoretic divergences rather than Euclidean squared or
-norm distances [
28]. Several information divergences such as Kullback–Leibler, Hellinger and Jensen–Shannon divergences are central to estimate similarity between distributions and have long history in information geometry.
The concept of a divergence is not restricted to Euclidean spaces but can be extended to abstract spaces with the help of Radon–Nikodym derivative (see for example [
36]). Let
be a measure space, where
μ is a finite or a
σ-finite measure on
and let assume that
and
are two (probability) measures on
such that
,
are absolutely continuous with respect to a measure
μ, e.g.,
and that
and
the (densities) Radon–Nikodym derivative of
and
with respect to
μ. Using such notations the fundamental Kullback–Leibler (KL) divergence between two probabilities distributions can be written as
which is related to the Shannon entropy
via
where
is the Shannon’s cross entropy, provided that integrals exist. (In measure theoretic terms, the integral exists if the measure induced by
is absolutely continuous with respect to that induced by
). Here and in the whole paper we assume that all integrals exist.
The Kullback–Leibler divergence has been generalized by using a family of functions called generalized logarithm functions or
α-logarithm
which is a power function of
x with power
, and is the natural logarithm function in the limit
. Often, the power function (
3) allows to generate more robust divergences in respect to outliers and consequently better or more flexible performance (see, for example, [
37]).
By using this type of extension, we derive and review three series of divergences, the Alpha, Beta- and Gamma- divergences all of which are generalizations of the KL-divergence. Moreover, we show its relation to the Tsallis entropy and the Rényi entropy (see
Appendix A and
Appendix B). It will be also shown how the Alpha-divergences are derived from the Csiszár–Morimoto
f-divergence and the Beta-divergence form the Bregman divergence by using the power functions.
Similarly to work of Zhang [
22,
23,
24] and Hein and Bousquet [
21] one of our motivations is to show the close links and relations among wide class of divergences and provide an elegant way to handle the known divergences and intermediate and new one in the same framework. However, our similarity measures are different from these proposed in [
21] and our approach and results are quite different to these presented in [
22].
It should be mentioned that there has been previous attempts of unifying divergence functions, (especially related to Alpha-divergence) starting from the work by Zhu and Rohwer [
10,
11], Amari and Nagaoka [
3], Taneya [
38,
39], Zhang ([
22] and Gorban and Judge [
40]. In particular, Zhang in ([
22], and in subsequent works [
23,
24] investigated the deep link between information geometry and various divergence functions mostly related to Alpha-divergence through a unified approach based on convex functions. However, the previous works have not considered explicitly links and relationships among ALL three fundamental classes (Alpha-, Beta, Gamma-) divergences. Moreover, some their basic properties are reviewed and extended.
The scope of the results presented in this paper is vast since the class of generalized (flexible) divergence functions include a large number of useful loss functions containing those based on the relative entropies, generalized Kullback–Leibler or I-divergence, Hellinger distance, Jensen–Shannon divergence, J-divergence, Pearson and Neyman Chi-square divergences, Triangular Discrimination and Arithmetic-Geometric divergence. Moreover, we show that some new divergences can be generated. Especially, we generate a new family of Alpha-Gamma divergences and Itakura–Saito like distances with the invariant scaling property which belongs to a wider class of Beta-divergences. Generally, these new scale-invariant divergences provide extension of the families of Beta- and Gamma- divergences. The discussed in this paper divergence functions are flexible because they allow us to generate a large number of well known and often used particular divergences (for specific values of tuning parameters). Moreover, by adjusting adaptive tuning parameters, we can optimize cost functions for learning algorithms and estimate desired parameters of a model in presence of noise and outliers. In other words, the discussed in this paper divergences, especially Beta- and Gama- divergences are robust in respect to outliers for some values of tuning parameters.
One of important features of the considered family of divergences is that they can give some guidance for the selection and even development of new divergence measures if necessary and allows to unify these divergences under the same framework using the Csiszár-Morimoto and Bregamn divergences and their fundamental properties. Moreover, these families of divergences are generally defined on unnormalized finite measures (not necessary normalized probabilities). This allows us to analyze patterns of different size to be weighted differently, e.g., images with different sizes or documents of different length. Such measures play also an important role in the areas of neural computation, pattern recognition, learning, estimation, inference, and optimization. We have already successfully applied a subset of such divergences as cost functions (possibly with additional constraints and regularization terms) to derive novel multiplicative and additive projected gradient algorithms for nonnegative matrix and tensor factorizations [
30,
31].
The divergences are closely related to the invariant geometrical properties of the manifold of probability distributions [
5,
6,
7].
3. Family of Beta-Divergences
The basic Beta-divergence was introduced by Basu
et al. [
67] and Minami and Eguchi [
15] and many researchers investigated their applications including [
8,
13,
30,
31,
32,
33,
34,
37,
37,
68,
69,
70,
71,
72], and references therein. The main motivation to investigate the beta divergence, at least from the practical point of view, is to develop highly robust in respect to outliers learning algorithms for clustering, feature extraction, classification and blind source separation. Until now the Beta-divergence has been successfully applied for robust PCA (Principal Component Analysis) and clustering [
71], robust ICA (Independent Component Analysis) [
15,
68,
69], and robust NMF/NTF [
30,
70,
73,
74,
75,
76].
First, let us define the basic asymmetric Beta-divergence between two unnormalized density functions by
where
β is a real number and, for
, is defined by continuity (see below for more explanation).
For discrete probability measures with mass functions
and
the discrete Beta-divergence is defined as
The Beta-divergence can be expressed via a generalized KL divergence as follows
The above representation of the Beta-divergence indicates why it is robust to outliers for some values of the tuning parameter
β and therefore, it is often better suited than others for some specific applications. For example, in sound processing, the speech power spectra can be modeled by exponential family densities of the form, whose for
the Beta-divergence is no less than the Itakura–Saito distance (called also Itakura–Saito divergence or Itakura–Saito distortion measure or Burg cross entropy) [
12,
13,
30,
76,
77,
78,
79]. In fact, the Beta-divergence has to be defined in limiting case for
as the Itakura–Saito distance:
The Itakura and Saito distance was derived from the maximum likelihood (ML) estimation of speech spectra [
77]. It was used as a measure of the distortion or goodness of fit between two spectra and is often used as a standard measure in the speech processing community due to the good perceptual properties of the reconstructed signals since it is scale invariant, Due to scale invariance low energy components of
p have the same relative importance as high energy ones. This is especially important in the scenario in which the coefficients of
p have a large dynamic range, such as in short-term audio spectra [
30,
76,
79].
It is also interesting to note that, for
, we obtain the standard squared Euclidean (
-norm) distance, while for the singular case
, we obtain the KL I-divergence:
Note, that we used here, the following formulas
and
Hence, the Beta-divergence can be represented in a more explicit form:
We have shown that the basic Beta-divergence smoothly connects the Itakura–Saito distance and the squared Euclidean
-norm distance and passes through the KL I-divergence
. Such a parameterized connection is impossible in the family of the Alpha-divergences.
The choice of the tuning parameter
β depends on the statistical distribution of data sets. For example, the optimal choice of the parameter
β for the normal distribution is
, for the gamma distribution it is
, for the Poisson distribution
, and for the compound Poisson distribution
[
15,
31,
32,
33,
34,
68,
69].
It is important to note that the Beta divergence can be derived from the Bregman divergence. The Bregman divergence is a pseudo-distance for measuring discrepancy between two values of density functions
and
[
9,
16,
80]:
where
is strictly convex real-valued function and
is the derivative with respect to
q. The total discrepancy between two functions
and
is given by
and it corresponds to the Φ-entropy of continuous probability measure
defined by
Remark: The concept of divergence and entropy are closely related. Let
be a uniform distribution for which
(When
is an infinite space
this might not be a probability distribution but is a measure). Then,
is regarded as the related entropy. This the negative of the divergence form
to the uniform distribution. On the other hand, given a concave entropy
, we can define the related divergence as the Bregman divergence derived from a convex function
.
If takes discrete values on a certain space, the separable Bregman divergence is defined as , where denotes derivative with respect to q. In a general (nonseparable) case for two vectors and , the Bregman divergence is defined as , where is the gradient of Φ evaluated at .
Note that equals the tail of the first-order Taylor expansion of at . Bregman divergences include many prominent dissimilarity measures like the squared Euclidean distance, the Mahalanobis distance, the generalized Kullback–Leibler divergence and the Itakura–Saito distance.
It is easy to check that the Beta-divergence can be generated from the Bregman divergence using the following strictly convex continuous function [
36,
81]
It is also interesting to note that the same generating function
(with
) can be used to generate the Alpha-divergence using the Csiszár–Morimoto
f-divergence
.
Furthermore, the Beta-divergence can be generated by a generalized
f-divergence:
where
with
and
is convex generating function with
.
The links between the Bregman and Beta-divergences are important, since the many well known fundamental properties of the Bregman divergence are also valid for the Beta-divergence [
28,
82]:
Convexity: The Bregman divergence is always convex in the first argument , but is often not in the second argument .
Nonnegativity: The Bregman divergence is nonnegative with zero .
Linearity: Any positive linear combination of Bregman divergences is also a Bregman divergence,
i.e.,
where
are positive constants and
are strictly convex functions.
Invariance: The functional Bregman divergence is invariant under affine transforms
for positive measures
and
to linear and arbitrary constant terms [
28,
82],
i.e.,
The three-point property generalizes the “Law of Cosines”: Generalized Pythagoras Theorem:
where
is the Bregman projection onto the convex set Ω and
. When Ω is an affine set then it holds with equality. This is proved to be the generalized Pythagorean relation in terms of information geometry.
For the Beta-divergence (
46) the first and second-order Fréchet derivative with respect to
are given by [
28,
76]
Hence, we conclude that the Beta-divergence has a single global minimum equal to zero for
and increases with
. Moreover, the Beta divergence is strictly convex for
only for
. For
(Itakura–Saito distance), it is convex if
i.e., if
[
78].
3.1. Generation of Family of Beta-divergences Directly from Family of Alpha-Divergences
It should be noted that in the original works [
15,
67,
68,
69] they considered only the Beta-divergence function for
. Moreover, they did not consider the whole range of non-positive values for parameter
β, especially
, for which we have the important Itakura–Saito distance. Furthermore, similar to the Alpha-divergences there exist an associated family of Beta-divergences and as special cases a family of generalized Itakura–Saito like distances. The fundamental question arises: How to generate a whole family of Beta-divergences or what is the relationships or correspondences between the Alpha- and Beta-divergences. In fact, on the basis of our considerations above, it is easy to find that the complete set of Beta-divergences can be obtained from the Alpha-divergences and conversely the Alpha-divergences, can obtained directly from Beta-divergences.
In order to obtain a Beta-divergence from the corresponding (associated) Alpha-divergence, we need to apply the following nonlinear transformations:
For example, using these transformations (substitutions) for a basic asymmetric Alpha-divergence (
4) and assuming that
, we obtain the following divergence
Observe that, by ignoring the scaling factor
, we obtain the basic asymmetric Beta-divergence defined by Equation (
46).
In fact, there exists the same link between the whole family of Alpha-divergences and the family of Beta-divergences (see
Table 2).
For example, we can derive a symmetric Beta-divergence from the symmetric Alpha-divergence (Type-1) (
38):
It is interesting to note that, in special cases, we obtain:
Symmetric KL of J-divergence [
65]:
and symmetric Chi-square divergence [
54]
Analogously, from the symmetric Alpha-divergence (Type-2), we obtain
Table 2.
Family of Alpha-divergences and corresponding Beta-divergences. We applied the following transformations . Note that and they represents for extended family of KL divergences. Furthermore, Beta-divergences for describe the family of generalized (extended) Itakura–Saito like distances.
Table 2.
Family of Alpha-divergences and corresponding Beta-divergences. We applied the following transformations . Note that and they represents for extended family of KL divergences. Furthermore, Beta-divergences for describe the family of generalized (extended) Itakura–Saito like distances.
Alpha-divergence | Beta-divergence |
| |
| |
| |
| |
| |
| |
It should be noted that in special cases, we obtain:
The Arithmetic-Geometric divergence [
38,
39]:
and a symmetrized Itakura–Saito distance (called also the COSH distance) [
12,
13]:
4. Family of Gamma-Divergences Generated from Beta- and Alpha-Divergences
A basic asymmetric Gamma-divergence has been proposed very recently by Fujisawa and Eguchi [
35] as a very robust similarity measure with respect to outliers:
The Gamma-divergence employs the nonlinear transformation (log) for cumulative patterns and the terms
are not separable. The main motivation for employing the Gamma divergence is that it allows “super” robust estimation of some parameters in presence of outlier. In fact, the authors demonstrated that the bias caused by outliers can become sufficiently small even in the case of very heavy contamination and that some contamination can be naturally and automatically neglected [
35,
37].
In this paper, we show that we can formulate the whole family of Gamma-divergences generated directly from Alpha- and also Beta-divergences. In order to obtain a robust Gamma-divergence from an Alpha- or Beta-divergence, we use the following transformations (see also
Table 3):
where
and
are real constants and
.
Applying the above transformation to all monomials to the basic Alpha-divergence (
4), we obtain a new divergence referred to as here the Alpha-Gamma-divergence:
The asymmetric Alpha-Gamma-divergence has the following important properties:
. The equality holds if and only if for a positive constant c.
It is scale invariant for any value of γ, that is, , for arbitrary positive scaling constants .
The Alpha-Gamma divergence is equivalent to the normalized Alpha-Rényi divergence (
25),
i.e.,
for
and normalized densities
and
.
It can be expressed via generalized weighted mean:
where the weighted mean is defined as
.
As
, the Alpha-Gamma-divergence becomes the Kullback–Leibler divergence:
For
, the Alpha-Gamma-divergence can be expressed by the reverse Kullback–Leibler divergence:
In a similar way, we can generate the whole family of Alpha-Gamma-divergences from the family of Alpha-divergences, which are summarized in
Table 3.
It is interesting to note that using the above transformations (
73) with
, we can generate another family of Gamma divergences, referred to as Beta-Gamma divergences.
In particular, using the nonlinear transformations (
73) for the basic asymmetric Beta-divergence (
46), we obtain the Gamma-divergence (
72) [
35] referred to as here a Beta-Gamma-divergence (
)
where
Analogously, for discrete densities we can express the Beta-Gamma-divergence via generalized power means also known as the power mean or Hölder means as follows
Hence,
and finally
where the (generalized) power mean of the order-
γ is defined as
In the special cases, we obtain standard harmonic mean (
, geometric mean (
), arithmetic mean(
), and squared root mean
with the following relations:
with
,
and
.
Table 3.
Family of Alpha-divergences and corresponding robust Alpha-Gamma-divergences;
. For
, we obtained a generalized robust KL divergences. Note that Gamma divergences are expressed compactly via generalized power means. (see also
Table 4 for more direct representations).
Table 3.
Family of Alpha-divergences and corresponding robust Alpha-Gamma-divergences; . For , we obtained a generalized robust KL divergences. Note that Gamma divergences are expressed compactly via generalized power means. (see also Table 4 for more direct representations).
Alpha-divergence | Robust Alpha-Gamma-divergence |
| |
| |
| |
| |
| |
| |
Table 4.
Basic Alpha- and Beta-divergences and directly generated corresponding Gamma-divergences (see also
Table 3 how the Gamma-divergences can be expressed by power means).
Table 4.
Basic Alpha- and Beta-divergences and directly generated corresponding Gamma-divergences (see also Table 3 how the Gamma-divergences can be expressed by power means).
Divergence or | Gamma-divergence and |
| |
| |
| |
| |
| |
| |
| |
The asymmetric Beta-Gamma-divergence has the following properties [
30,
35]:
. The equality holds if and only if for a positive constant c.
It is scale invariant, that is, , for arbitrary positive scaling constants .
As
, the Gamma-divergence becomes the Kullback–Leibler divergence:
where
and
.
For
, the Gamma-divergence can be expressed as follows
For the discrete Gamma divergence we have the corresponding formula
Similarly to the Alpha and Beta-divergences, we can also define the symmetric Beta-Gamma-divergence as
The symmetric Gamma-divergence has similar properties to the asymmetric Gamma-divergence:
. The equality holds if and only if for a positive constant c, in particular, .
It is scale invariant, that is,
for arbitrary positive scaling constants
.
For
, it is reduced to a special form of the symmetric Kullback–Leibler divergence (also called the J-divergence)
where
and
.
For
, we obtain a simple divergence expressed by weighted arithmetic means
where weight function
is such that
.
For the discrete Beta-Gamma divergence (or simply the Gamma divergence), we obtain divergence
It is interesting to note that for
the discrete symmetric Gamma-divergence can be expressed by expectation functions
, where
and
.
For
, the asymmetric Gamma-divergences (equal to a symmetric Gamma-divergence) is reduced to Cauchy–Schwarz divergence, introduced by Principe [
83]
It should be noted that the basic asymmetric Beta-Gamma divergence (derived from the Beta-divergence) is exactly equivalent to the Gamma divergence defined in [
35], while Alpha-Gamma divergences (derived from the family of Alpha-divergences) have different expressions but they are similar in terms of properties.