Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities

Cichocki, Andrzej; Amari, Shun-ichi

doi:10.3390/e12061532

Open AccessArticle

Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities

by

Andrzej Cichocki

^1,2,* and

Shun-ichi Amari

³

¹

Riken Brain Science Institute, Laboratory for Advanced Brain Signal Processing, Wako-shi, Japan

²

Systems Research Institute, Polish Academy of Science, Poland

³

Riken Brain Science Institute, Laboratory for Mathematical Neuroscience, Wako-shi, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2010, 12(6), 1532-1568; https://doi.org/10.3390/e12061532

Submission received: 26 April 2010 / Accepted: 1 June 2010 / Published: 14 June 2010

Download Versions Notes

Abstract

:

In this paper, we extend and overview wide families of Alpha-, Beta- and Gamma-divergences and discuss their fundamental properties. In literature usually only one single asymmetric (Alpha, Beta or Gamma) divergence is considered. We show in this paper that there exist families of such divergences with the same consistent properties. Moreover, we establish links and correspondences among these divergences by applying suitable nonlinear transformations. For example, we can generate the Beta-divergences directly from Alpha-divergences and vice versa. Furthermore, we show that a new wide class of Gamma-divergences can be generated not only from the family of Beta-divergences but also from a family of Alpha-divergences. The paper bridges these divergences and shows also their links to Tsallis and Rényi entropies. Most of these divergences have a natural information theoretic interpretation.

Keywords:

1. Introduction

Many machine learning algorithms for classification and clustering employ a variety of dissimilarity measures. Information theory, convex analysis, and information geometry play key roles in the formulation of such divergences [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25].

The most popular and often used are: Squared Euclidean distance and Kullback–Leibler divergence. Recently, alternative generalized divergences such as the Csiszár–Morimoto f-divergence and Bregman divergence become attractive alternatives for advanced machine learning algorithms [26,27,28,29,30,31,32,33,34]. In this paper, we discuss a robust parameterized subclass of the Csiszár–Morimoto and the Bregman divergences: Alpha- and Beta-divergences that may provide more robust solutions with respect to outliers and additive noise and improved accuracy. Moreover, we provide links to new-class of robust Gamma-divergences [35] and extend this class to so called Alpha-Gamma divergences.

Divergences are considered here as (dis)similarity measures. Generally speaking, they measure a quasi-distance or directed difference between two probability distributions

P

and

Q

, which can also be expressed for unconstrained nonnegative multi-way arrays and patterns.

In this paper we assume that

P

and

Q

are positive measures (densities) not necessary normalized, but should be finite measures. In the special case of normalized densities, we explicitly refer to these as probability densities. If we do not mention explicitly we assume that these measures are continuous. An information divergence is a measure of distance between two probability curves. In this paper, we discuss only one-dimensional probability curves (represented by nonnegative signals or time series). Generalization to two or multidimensional dimensional variables is straightforward. One density

Q (x)

is usually known and fixed and another one

P (x)

is learned or adjusted to achieve a best in some sense similarity to the

Q (x)

. For example, a discrete density

Q

corresponds to the observed data and the vector

P

to be estimated, or expected data which are subject to constraints imposed on the assumed models. For the Non-negative Matrix Factorization (NMF) problem

Q

corresponds to the data matrix

Y

and

P

corresponds to estimated matrix

\hat{Y} = A X

(or vice versa) [30].

The distance between two densities is called a metric if the following conditions hold:

$D (P | | Q) \geq 0$ with equality if and only if $P = Q$ (nonnegativity and positive definiteness),
$D (P | | Q) = D (Q | | P)$ (symmetry),
$D (P | | Z) \leq D (P | | Q) + D (Q | | Z)$ (subaddivity/triangle inequality).

Distances which only satisfies Condition 1 are not a metric and are referred to as (asymmetric) divergences.

In many applications, such as image analysis, pattern recognition and statistical machine learning we use the information-theoretic divergences rather than Euclidean squared or

l_{p}

-norm distances [28]. Several information divergences such as Kullback–Leibler, Hellinger and Jensen–Shannon divergences are central to estimate similarity between distributions and have long history in information geometry.

The concept of a divergence is not restricted to Euclidean spaces but can be extended to abstract spaces with the help of Radon–Nikodym derivative (see for example [36]). Let

(X, A, μ)

be a measure space, where μ is a finite or a σ-finite measure on

(X, A)

and let assume that

P

and

Q

are two (probability) measures on

(X, A)

such that

P < < μ

,

Q < < μ

are absolutely continuous with respect to a measure μ, e.g.,

μ = P + Q

and that

p = \frac{d P}{d μ}

and

q = \frac{d Q}{d μ}

the (densities) Radon–Nikodym derivative of

P

and

Q

with respect to μ. Using such notations the fundamental Kullback–Leibler (KL) divergence between two probabilities distributions can be written as

D_{K L} (P | | Q) = \int_{X} p (x) log (\frac{p (x)}{q (x)}) d μ (x)

(1)

which is related to the Shannon entropy

H_{S} (P) = - \int_{X} p (x) log (p (x)) d μ (x)

(2)

via

D_{K L} (P | | Q) = V_{S} (P, Q) - H_{S} (P)

where

V_{S} (P, Q) = - \int_{X} p (x) log (q (x)) d μ (x)

is the Shannon’s cross entropy, provided that integrals exist. (In measure theoretic terms, the integral exists if the measure induced by

P

is absolutely continuous with respect to that induced by

Q

). Here and in the whole paper we assume that all integrals exist.

The Kullback–Leibler divergence has been generalized by using a family of functions called generalized logarithm functions or α-logarithm

{log}_{α} (x) = \frac{1}{1 - α} (x^{1 - α} - 1) (for x > 0)

(3)

which is a power function of x with power

1 - α

, and is the natural logarithm function in the limit

α \to 1

. Often, the power function (3) allows to generate more robust divergences in respect to outliers and consequently better or more flexible performance (see, for example, [37]).

By using this type of extension, we derive and review three series of divergences, the Alpha, Beta- and Gamma- divergences all of which are generalizations of the KL-divergence. Moreover, we show its relation to the Tsallis entropy and the Rényi entropy (see Appendix A and Appendix B). It will be also shown how the Alpha-divergences are derived from the Csiszár–Morimoto f-divergence and the Beta-divergence form the Bregman divergence by using the power functions.

Similarly to work of Zhang [22,23,24] and Hein and Bousquet [21] one of our motivations is to show the close links and relations among wide class of divergences and provide an elegant way to handle the known divergences and intermediate and new one in the same framework. However, our similarity measures are different from these proposed in [21] and our approach and results are quite different to these presented in [22].

It should be mentioned that there has been previous attempts of unifying divergence functions, (especially related to Alpha-divergence) starting from the work by Zhu and Rohwer [10,11], Amari and Nagaoka [3], Taneya [38,39], Zhang ([22] and Gorban and Judge [40]. In particular, Zhang in ([22], and in subsequent works [23,24] investigated the deep link between information geometry and various divergence functions mostly related to Alpha-divergence through a unified approach based on convex functions. However, the previous works have not considered explicitly links and relationships among ALL three fundamental classes (Alpha-, Beta, Gamma-) divergences. Moreover, some their basic properties are reviewed and extended.

The scope of the results presented in this paper is vast since the class of generalized (flexible) divergence functions include a large number of useful loss functions containing those based on the relative entropies, generalized Kullback–Leibler or I-divergence, Hellinger distance, Jensen–Shannon divergence, J-divergence, Pearson and Neyman Chi-square divergences, Triangular Discrimination and Arithmetic-Geometric divergence. Moreover, we show that some new divergences can be generated. Especially, we generate a new family of Alpha-Gamma divergences and Itakura–Saito like distances with the invariant scaling property which belongs to a wider class of Beta-divergences. Generally, these new scale-invariant divergences provide extension of the families of Beta- and Gamma- divergences. The discussed in this paper divergence functions are flexible because they allow us to generate a large number of well known and often used particular divergences (for specific values of tuning parameters). Moreover, by adjusting adaptive tuning parameters, we can optimize cost functions for learning algorithms and estimate desired parameters of a model in presence of noise and outliers. In other words, the discussed in this paper divergences, especially Beta- and Gama- divergences are robust in respect to outliers for some values of tuning parameters.

One of important features of the considered family of divergences is that they can give some guidance for the selection and even development of new divergence measures if necessary and allows to unify these divergences under the same framework using the Csiszár-Morimoto and Bregamn divergences and their fundamental properties. Moreover, these families of divergences are generally defined on unnormalized finite measures (not necessary normalized probabilities). This allows us to analyze patterns of different size to be weighted differently, e.g., images with different sizes or documents of different length. Such measures play also an important role in the areas of neural computation, pattern recognition, learning, estimation, inference, and optimization. We have already successfully applied a subset of such divergences as cost functions (possibly with additional constraints and regularization terms) to derive novel multiplicative and additive projected gradient algorithms for nonnegative matrix and tensor factorizations [30,31].

The divergences are closely related to the invariant geometrical properties of the manifold of probability distributions [5,6,7].

2. Family of Alpha-Divergences

The Alpha-divergences can be derived from the Csiszár–Morimoto f-divergence and as shown recently by Amari using some tricks also from the Bregman divergence [3,6] (see also Appendix A). The Alpha-divergence was proposed by Chernoff [41] and have been extensively investigated and extended by Amari [1,3,5,6] and other researchers. For some modifications and/or extensions see for example works of Liese and Vajda [36], Minka [42], Taneja [38,39,43], Cressie–Read [44], Zhu–Rohwer [10,11] and Zhang [22,23,24,25]. One of our motivation to investigate and explore the family of Alpha-divergences is to develop flexible and efficient learning algorithms for nonnegative matrix and tensor factorizations which unify and extend existing algorithms including such algorithms as EMML (Expectation Maximization Maximum Likelihood) and ISRA (Image Space Reconstruction Algorithm) (see [30,45] and references therein).

2.1. Asymmetric Alpha-Divergences

The basic asymmetric Alpha-divergence can be defined as [3]:

D_{A}^{(α)} (P | | Q) = \frac{1}{α (α - 1)} \int (p^{α} (x) q^{1 - α} (x) - α p (x) + (α - 1) q (x)) d μ (x), α \in R ∖ {0, 1}

(4)

where

p (x)

and

q (x)

do not need to be normalized.

The Alpha-divergence can be expressed via a generalized KL divergence as follows

D_{A}^{(α)} (P | | Q) = D_{G K L}^{(α)} (P | | Q) = - \frac{1}{α} \int (p {log}_{α} (\frac{q}{p}) + p - q) d μ (x), α \in R ∖ {0, 1}

(5)

For discrete probability measures with mass functions

P = [p_{1}, p_{2}, \dots, p_{n}]

and

Q = [q_{1}, q_{2}, \dots, q_{n}]

, the discrete Alpha-divergence is formulated as a separable divergence:

D_{A}^{(α)} (P | | Q) = \sum_{i = 1}^{n} d_{A}^{(α)} (p_{i} | | q_{i}) = \frac{1}{α (α - 1)} \sum_{i = 1}^{n} (p_{i}^{α} q_{i}^{1 - α} - α p_{i} + (α - 1) q_{i}), α \in R ∖ {0, 1}

(6)

where

d_{A}^{(α)} (p_{i} | | q_{i}) = (p_{i}^{α} q_{i}^{1 - α} - α p_{i} + (α - 1) q_{i}) / (α (α - 1)) .

Note that this form of Alpha-divergence differs slightly from the loss function given by [1], because it was defined only for probability distributions, i.e., under assumptions that

\int p (x) d μ (x) = 1

and

\int q (x) d μ (x) = 1

. It was extended by Zhu and Rohwer in [10,11] (see also Amari and Nagaoka [3]) for positive measures, by incorporating additional terms. These terms are needed to allow de-normalized densities (positive measures), in the same way that the generalized Kullback–Leibler divergences (I-divergence).

Extending the Kullback–Leibler divergence into the family of Alpha-divergence is a crucial step for a unified view of a wide class of divergence functions, since it demonstrates that the nonnegativity of alpha-divergences may be viewed as arising not only from Jensen’s inequality but the arithmetic-geometric inequality. As has been pointed out by Jun Zhang this can be fully exploited as a consequence of the inequality of convex functions or more generally by his convex-based approach [22].

For normalized densities:

\bar{p} (x) = p (x) / \int p (x) d μ (x)

and

\bar{q} (x) = q (x) / \int q (x) d μ (x)

the Alpha-divergence simplifies to [1,44]:

D_{A}^{(α)} (\bar{P} | | \bar{Q}) = \frac{1}{α (α - 1)} (\int {\bar{p}}^{α} (x) {\bar{q}}^{1 - α} (x) d μ (x) - 1), α \in R ∖ {0, 1}

(7)

and is related to the Tsallis divergence and the Tsallis entropy [46] (see Appendix A):

H_{T}^{(α)} (\bar{P}) = \frac{1}{1 - α} (\int {\bar{p}}^{α} (x) d μ (x) - 1) = - \int {\bar{p}}^{α} (x) {log}_{α} \bar{p} (x) d μ (x)

(8)

provided integrals on the right exist.

In fact, the Tsallis entropy was first defined by Havrda and Charvat in 1967 [47] and almost forgotten and rediscovered by Tsallis in 1988 [46] in different context (see Appendix A).

Various authors use the parameter α in different ways. For example, using the Amari notation,

α_{A}

with

α = (1 - α_{A}) / 2

, the Alpha-divergence takes the following form [1,2,3,10,11,22,23,24,25,44,48,49]:

{\tilde{D}}_{A}^{(α_{A})} (P ̠ | | Q ̠) = \frac{4}{1 - α_{A}^{2}} \int (\frac{1 - α_{A}}{2} p + \frac{1 + α_{A}}{2} q - p^{\frac{1 - α_{A}}{2}} q^{\frac{1 + α_{A}}{2}}) d μ (x), α_{A} \in R ∖ {\pm 1}

(9)

When α takes values from 0 to 1,

α_{A}

takes values from

- 1

to 1. The duality exists between

α_{A}

and

- α_{A}

, in the sense that

{\tilde{D}}_{A}^{(α_{A})} (P | | Q) = {\tilde{D}}_{A}^{(- α_{A})} (Q | | P)

.

In the special cases for

α = 2, 0.5, - 1

, we obtain from (4) the well known Pearson Chi-square, Hellinger and inverse Pearson, also called the Neyman Chi-square distances, given respectively by

\begin{matrix} D_{A}^{(2)} (P | | Q) & = & D_{P} (P | | Q) = \frac{1}{2} \int \frac{{(p (x) - q (x))}^{2}}{q (x)} d μ (x) \\ D_{A}^{(1 / 2)} (P | | Q) & = & 2 D_{H} (P | | Q) = 2 \int {(\sqrt{p (x)} - \sqrt{q (x)})}^{2} d μ (x) \\ D_{A}^{(- 1)} (P | | Q) & = & D_{N} (P | | Q) = \frac{1}{2} \int \frac{{(p (x) - q (x))}^{2}}{p (x)} d μ (x) \end{matrix}

(10)

For the singular values

α = 1

and

α = 0

the Alpha-divergences (4) have to be defined as limiting cases respectively for

α \to 1

and

α \to 0

. When this limit is evaluated (using the L’Hôpital’s rule) for

α \to 1

, we obtain the Kullback–Leibler divergence:

\begin{matrix} D_{K L} (P | | Q) & = lim_{α \to 1} D_{A}^{(α)} (P | | Q) = lim_{α \to 1} \int \frac{p^{α} q^{1 - α} log p - p^{α} q^{1 - α} log q - p + q}{2 α - 1} d μ (x) \\ = \int (p log (\frac{p}{q}) - p + q) d μ (x) \end{matrix}

(11)

with the conventions

0 / 0 = 0, 0 log (0) = 0

and

p / 0 = \infty

for

p > 0

.

From the inequality

p log p \geq p - 1

, it follows that the KL I-divergence is nonnegative and achieves zero if and only if

P = Q

.

Similarly, for

α \to 0

, we obtain the reverse Kullback–Leibler divergence:

D_{K L} (Q | | P) = lim_{α \to 0} D_{A}^{(α)} (P | | Q) = \int (q log (\frac{q}{p}) - q + p) d μ (x)

(12)

Hence, the Alpha-divergence can be evaluated in a more explicit form as

D_{A}^{(α)} (P | | Q) = \{\begin{matrix} \frac{1}{α (α - 1)} \int (q (x) [{(\frac{p (x)}{q (x)})}^{α} - 1] - α [p (x) - q (x)]) d μ (x), α \neq 0, 1 \\ \int (q (x) log \frac{q (x)}{p (x)} + p (x) - q (x)) d μ (x), α = 0 \\ \int (p (x) log \frac{p (x)}{q (x)} - p (x) + q (x)) d μ (x), α = 1 \end{matrix}

(13)

In fact, the Alpha-divergence smoothly connects the I-divergence

D_{K L} (P | | Q)

with the reverse I-divergence

D_{K L} (Q | | P)

and passes through the Hellinger distance [50]. Moreover, it also smoothly connects the Pearson Chi-square and Neyman Chi-square divergences and passes through the I-divergences [10,11].

The Alpha-divergence is a special case of Csiszár–Morimoto f-divergence [17,19] proposed later by Ali and Silvey [20]). This class of divergences were also independently defined by Morimoto [51]. The Csiszár–Morimoto f-divergence is associated to any function

f (u)

that is convex over

(0, \infty)

and satisfies

f (1) = 0

:

D_{f} (P | | Q) = \int q (x) f (\frac{p (x)}{q (x)}) d μ (x)

(14)

We define

0 f (0 / 0) = 0

and

0 f (a / 0) = {lim}_{t \to 0} t f (a / t) = {lim}_{u \to \infty} f (u) / u

. Indeed, assuming

f (u) = (u^{α} - α u + α - 1)) / (α^{2} - α)

yields the formula (4).

The Csiszár–Morimoto f-divergence has many beautiful properties [6,17,18,20,22,36,52,53]:

Nonnegativity: The Csiszár–Morimoto f-divergence is always nonnegative, and equal to zero if and only if probability densities $p (x)$ and $q (x)$ coincide. This follows immediately from the Jensens inequality (for normalized densities):

$D_{f} (P | | Q) = \int q f (\frac{p}{q}) d μ (x) \geq f (\int q (\frac{p}{q}) d μ (x)) = f (1) = 0 .$

(15)
Generalized entropy: It corresponds to a generalized f-entropy of the form:

$H_{f} (P) = - \int f (p (x)) d μ (x)$

(16)

for which the Shannon entropy is a special case for $f (p) = p log (p)$ . Note that $H_{f}$ is concave while f is convex.
Convexity: For any $0 \leq λ \leq 1$

$D_{f} (λ P_{1} + (1 - λ) P_{2} | | λ Q_{1} + (1 - λ) Q_{2}) \leq λ D_{f} (P_{1} | | Q_{1}) + (1 - λ) D_{f} (P_{2} | | Q_{2}) .$

(17)
Scaling: For any positive constant $c > 0$ we have

$c D_{f} (P | | Q) = D_{c f} (P | | Q)$

(18)
Invariance: The f-divergence is invariant to bijective transformations [6,20]. This means that, when $x$ is transformed to $y$ bijectively by

$y = k (x)$

(19)

probability distribution $P (x)$ changes to $\tilde{P} (y)$ . However,

$D_{f} (P | | Q) = D_{f} (\tilde{P} | | \tilde{Q})$

(20)

Additionally,

$D_{f} (P | | Q) = D_{\tilde{f}} (P | | Q)$

(21)

for $\tilde{f} (u) = f (u) - c (u - 1)$ for an arbitrary constant c, and

$D_{f} (P | | Q) = D_{f^{*}} (Q | | P)$

(22)

where $f^{*} (u) = u f (1 / u)$ is called a conjugate function.
Symmetricity: For an arbitrary Csiszár–Morimoto f-divergence, it is possible to construct a symmetric divergence for $f_{s y m} (u) = f (u) + f^{*} (u)$ .
Boundedness The Csiszár–Morimoto f-divergence for positive measures (densities) is bounded (if limit exists and it is finite) [28,36]

$0 \leq D_{f} (P | | Q) \leq lim_{u \to 0^{+}} \{f (u) + u f (\frac{1}{u})\}$

(23)

Furthermore [54],

$0 \leq D_{f} (P | | Q) \leq \int (p - q) f^{'} (\frac{p}{q}) d μ (x)$

(24)

Using fundamental properties of the Csiszár–Morimoto f-divergence we can establish basic properties of the Alpha-divergences [3,6,22,40,42].

The Alpha-divergence (4) has the following basic properties:

Convexity: $D_{A}^{(α)} (P | | Q)$ is convex with respect to both $P$ and $Q$ .
Strict Positivity: $D_{A}^{(α)} (P | | Q) \geq 0$ and $D_{A}^{(α)} (P | | Q) = 0$ if and only if $P = Q$ .
Continuity: The Alpha-divergence is continuous function of real variable α in the whole range including singularities.
Duality: $D_{A}^{(α)} (P | | Q) = D_{A}^{(1 - α)} (Q | | P)$ .
Exclusive/Inclusive Properties: [42]
- For $α \to - \infty$ , the estimation of $q (x)$ that approximates $p (x)$ is exclusive, that is $q (x) \leq p (x)$ for all $x$ . This means that the minimization of $D_{A}^{(α)} (P | | Q)$ with respect to $q (x)$ will force $q (x)$ to be exclusive approximation, i.e., the mass of $q (x$ ) will lie within $p (x)$ (see detail and graphical illustrations in [42]).
- For $α \to \infty$ , the estimation of $q (x)$ that approximates $p (x)$ is inclusive, that is $q (x) \geq p (x)$ for all $x$ . In other words, the mass of $q (x)$ includes all the mass of $p (x)$ .
Zero-forcing and zero-avoiding properties: [42]
Here, we treat the case where p ( x ) and q ( x ) are not necessary mutually absolutely continuous. In such a case the divergence may diverges to ∞. However, the following two properties hold:
- For $α \leq 0$ the estimation of $q (x)$ that approximates $p (x)$ is zero-forcing (coercive), that is, $p (x) = 0$ forces $q (x) = 0$ .
- For $α \geq 1$ the estimation of $q (x)$ that approximates $p (x)$ is zero-avoiding, that is, $p (x) > 0$ implies $q (x) > 0$ .

One of the most important property of the Alpha-divergence is that it is a convex function with respect to

P

and

Q

and has a unique minimum for

P = Q

(see e.g. [22]).

2.2. Alpha-Rényi Divergence

It is interesting to note that the Alpha-divergence is closely related to the Rényi divergence. We define an Alpha-Rényi divergence as

\begin{matrix} D_{A R}^{(α)} (P | | Q) & = & \frac{1}{α (α - 1)} log (1 + α (α - 1) D_{A}^{(α)} (P | | Q) \\ = & \frac{1}{α (α - 1)} log (\int (p^{α} q^{1 - α} - α p + (α - 1) q) d μ (x) + 1), α \in R ∖ {0, 1} \end{matrix}

(25)

For

α = 0

and

α = 1

the Alpha-Rényi divergence simplifies to the Kullback–Leibler divergences:

\begin{matrix} D_{K L} (P | | Q) = lim_{α \to 1} D_{A R}^{(α)} (P | | Q) & = & lim_{α \to 1} \frac{1}{2 α - 1} \frac{\int (p^{α} q^{1 - α} (log p - log q) - p + q) d μ (x)}{\int (p^{α} q^{1 - α} - α p - (α - 1) q) d μ (x) + 1} \\ = & \int (p log (\frac{p}{q}) - p + q) d μ (x) \end{matrix}

(26)

and

\begin{matrix} D_{K L} (Q | | P) = lim_{α \to 0} D_{A R}^{(α)} (P | | Q) = \int (q log (\frac{q}{p}) - q + p) d μ (x) \end{matrix}

(27)

We define the Alpha-Rényi divergence for normalized probability densities as follows:

D_{A R}^{(α)} (\bar{P} | | \bar{Q}) = \frac{1}{α (α - 1)} log (\int {\bar{p}}^{α} {\bar{q}}^{1 - α} d μ (x)) = \frac{1}{α - 1} log {(\int \bar{q} {(\frac{\bar{p}}{\bar{q}})}^{α} d μ (x))}^{\frac{1}{α}}

(28)

which corresponds to the Rényi entropy [55,56,57] (see Appendix B)

H_{R}^{(α)} (\bar{P}) = - \frac{1}{α - 1} log (\int {\bar{p}}^{α} (x) d μ (x))

(29)

Note that we used different scaling factor than in the original Rényi divergence [56,57]. In general, the Alpha-Rényi divergence make sense only for the (normalized) probability densities (since, otherwise the function (25) can be complex-valued for some positive non-normalized measures).

Furthermore, the Alpha-divergence is convex with respect to positive densities p and q for any parameter of

α \in R

, while the Alpha-Rényi divergence (28) is convex jointly in p and q for

α \in [0, 1]

and it is not convex for

α > 1

. Convexity implies that

D_{A R}^{(α)}

is increasing in α when p and q are fixed. Actually, the Rényi divergence is increasing in α on the whole set

(0, \infty)

[58,59,60].

2.3. Extended Family of Alpha-Divergences

There are several ways to extend the asymmetric Alpha-divergence. For example, instead of q, we can take the average of q and p, under the assumption that if p and q are similar, they should be “close” to their average, so we can define the modified Alpha-divergence as

D_{A m 1}^{(α)} (P | | \tilde{Q}) = \frac{1}{α (α - 1)} \int (\tilde{q} [{(\frac{p}{\tilde{q}})}^{α} - 1] - α (p - \tilde{q})) d μ (x)

(30)

whereas an adjoint Alpha-divergence is given by

D_{A m 2}^{(α)} (\tilde{Q} | | P) = \frac{1}{α (α - 1)} \int (p [{(\frac{\tilde{q}}{p})}^{α} - 1] + α (p - \tilde{q})) d μ (x)

(31)

where

\tilde{Q} = (P + Q) / 2

and

\tilde{q} = (p + q) / 2

.

For the singular values

α = 1

and

α = 0

, the Alpha-divergences (30) can be evaluated as

lim_{α \to 0} D_{A m 1}^{(α)} (P | | \tilde{Q}) = \int (\tilde{q} log (\frac{\tilde{q}}{p}) + p - \tilde{q}) d μ (x) = \int (\frac{p + q}{2} log (\frac{p + q}{2 p}) + \frac{p - q}{2}) d μ (x)

and

lim_{α \to 1} D_{A m 1}^{(α)} (P | | \tilde{Q}) = \int (p log (\frac{p}{\tilde{q}}) - p + \tilde{q}) d μ (x) = \int (p log (\frac{2 p}{p + q}) + \frac{q - p}{2}) d μ (x)

As examples, we consider the following prominent cases for (31):

1. Triangular Discrimination (TD) [61]

D_{A m 2}^{(- 1)} (\tilde{Q} | | P) = \frac{1}{4} D_{T D} (P | | Q) = \frac{1}{4} \int \frac{{(p - q)}^{2}}{p + q} d μ (x)

(32)

2. Relative Jensen–Shannon divergence [62,63,64]

lim_{α \to 0} D_{A m 2}^{(α)} (\tilde{Q} | | P) = D_{R J S} (P | | Q) = \int (p log (\frac{2 p}{p + q}) - p + q) d μ (x)

(33)

3. Relative Arithmetic-Geometric divergence proposed by Taneya [38,39,43]

lim_{α \to 1} D_{A m 2}^{(α)} (\tilde{Q} | | P) = \frac{1}{2} D_{R A G} (P | | Q) = \int ((p + q) log (\frac{p + q}{2 p}) + p - q) d μ (x)

(34)

3. Neyman Chi-square divergence

D_{A m 2}^{(2)} (\tilde{Q} | | P) = \frac{1}{8} D_{χ^{2}} (P | | Q) = \frac{1}{8} \int \frac{{(p - q)}^{2}}{p} d μ (x)

(35)

The asymmetric Alpha-divergences can be expressed formally as the Csiszár–Morimoto f-divergence, as shown in Table 1 [30].

Table 1. Asymmetric Alpha-divergences and associated convex Csiszár-Morimoto functions [30].

**Table 1.** Asymmetric Alpha-divergences and associated convex Csiszár-Morimoto functions [30].
Divergence $D_{A}^{(α)} (P \| \| Q) = \int q f^{(α)} (\frac{p}{q}) d μ (x)$	Csiszár function $f^{(α)} (u), u = p / q$
$\{\begin{matrix} \frac{1}{α (α - 1)} \int (q [{(\frac{p}{q})}^{α} - 1] - α (p - q)) d μ (x), \\ \int (q log \frac{q}{p} + p - q) d μ (x), \\ \int (p log \frac{p}{q} - p + q) d μ (x) . \end{matrix}$	$\{\begin{matrix} \frac{u^{α} - 1 - α (u - 1)}{α (α - 1)}, α \neq 0, 1 \\ u - 1 - log u, α = 0, \\ 1 - u + u log u, α = 1 . \end{matrix}$
$\{\begin{matrix} \frac{1}{α (α - 1)} \int (p [{(\frac{q}{p})}^{α} - 1] + α (p - q)) d μ (x), \\ \int (p log \frac{p}{q} - p + q) d μ (x), \\ \int (q log \frac{q}{p} + p - q) d μ (x) . \end{matrix}$	$\{\begin{matrix} \frac{u^{1 - α} + (α - 1) u - α}{α (α - 1)}, α \neq 0, 1, \\ 1 - u + u log u, α = 0 \\ u - 1 - log u, α = 1 . \end{matrix}$
$\{\begin{matrix} \frac{1}{α (α - 1)} \int (q {(\frac{p + q}{2 q})}^{α} - q - α \frac{p - q}{2}) d μ (x), \\ \int (q log (\frac{2 q}{p + q}) + \frac{p - q}{2}) d μ (x), \\ \int (\frac{p + q}{2} log (\frac{p + q}{2 q}) - \frac{p - q}{2}) d μ (x) . \end{matrix}$	$\{\begin{matrix} \frac{{(\frac{u + 1}{2})}^{α} - 1 - α (\frac{u - 1}{2})}{α (α - 1)}, α \neq 0, 1, \\ \frac{u - 1}{2} + log (\frac{2}{u + 1}), α = 0, \\ \frac{1 - u}{2} + \frac{u + 1}{2} log (\frac{u + 1}{2}), α = 1 . \end{matrix}$
$\{\begin{matrix} \frac{1}{α (α - 1)} \int (p {(\frac{p + q}{2 p})}^{α} - p + α \frac{p - q}{2}) d μ (x), \\ \int (p log (\frac{2 p}{p + q}) - \frac{p - q}{2}) d μ (x), \\ \int (\frac{p + q}{2} log (\frac{p + q}{2 p}) + \frac{p - q}{2}) d μ (x) . \end{matrix}$	$\{\begin{matrix} \frac{u {(\frac{u + 1}{2 u})}^{α} - u - α (\frac{1 - u}{2})}{α (α - 1)}, α \neq 0, 1, \\ \frac{1 - u}{2} - u log (\frac{u + 1}{2 u}), α = 0, \\ \frac{u - 1}{2} + (\frac{u + 1}{2}) log (\frac{u + 1}{2 u}), α = 1 . \end{matrix}$
$\{\begin{matrix} \frac{1}{α - 1} \int (p - q) [{(\frac{p + q}{2 q})}^{α - 1} - 1] d μ (x), \\ \int (p - q) log (\frac{p + q}{2 q}) d μ (x) . \end{matrix}$	$\{\begin{matrix} \frac{(u - 1) [{(\frac{u + 1}{2})}^{α - 1} - 1]}{α - 1}, α \neq 1, \\ (u - 1) log (\frac{u + 1}{2}), α = 1 . \end{matrix}$
$\{\begin{matrix} \frac{1}{α - 1} \int (q - p) [{(\frac{p + q}{2 p})}^{α - 1} - 1] d μ (x), \\ \int (q - p) log (\frac{p + q}{2 p}) d μ (x) . \end{matrix}$	$\{\begin{matrix} \frac{(1 - u) [{(\frac{u + 1}{2 u})}^{α - 1} - 1]}{α - 1}, α \neq 1, \\ (1 - u) log (\frac{u + 1}{2 u}), α = 1 . \end{matrix}$

2.4. Symmetrized Alpha-Divergences

The basic Alpha-divergence is asymmetric, that is,

D_{A}^{(α)} (P | | Q) \neq D_{A}^{(α)} (Q | | P)

.

Generally, there are two ways to symmetrize divergences: Type-1

D_{S 1} (P | | Q) = \frac{1}{2} [D_{A} (P | | Q) + D_{A} (Q | | P)]

(36)

and Type 2

D_{S 2} (P | | Q) = \frac{1}{2} [D_{A} (P | | \frac{P + Q}{2}) + D_{A} (Q | | \frac{P + Q}{2})]

(37)

The symmetric Alpha-divergence (Type-1) can be defined as (we will omit scaling factor

1 / 2

for simplicity)

D_{A S 1}^{(α)} (P | | Q) = D_{A}^{(α)} (P | | Q) + D_{A}^{(α)} (Q | | P) = \int \frac{(p^{α} - q^{α}) (p^{1 - α} - q^{1 - α})}{α (1 - α)} d μ (x)

(38)

As special cases, we obtain several well-known symmetric divergences:

1. Symmetric Chi-Squared divergence [54]

D_{A S 1}^{(- 1)} (P | | Q) = D_{A S 1}^{(2)} (P | | Q) = \frac{1}{2} D_{χ^{2}} (P | | Q) = \frac{1}{2} \int \frac{{(p - q)}^{2} (p + q)}{p q} d μ (x)

(39)

2. Symmetrized KL divergence, called also J-divergence corresponding to Jeffreys entropy maximization [65,66]

lim_{α \to 0} D_{A S 1}^{(α)} (P | | Q) = lim_{α \to 1} D_{A S 1}^{(α)} (P | | Q) = D_{J} (P | | Q) = \int (p - q) log (\frac{p}{q}) d μ (x)

(40)

An alternative wide class of symmetric divergences can be described by the following symmetric Alpha-divergence (Type-2):

\begin{matrix} D_{A S 2}^{(α)} (P | | Q) & = & D_{A}^{(α)} (P | | \frac{P + Q}{2}) + D_{A}^{(α)} (Q | | \frac{P + Q}{2}) \\ = & \frac{1}{α (α - 1)} \int ((p^{1 - α} + q^{1 - α}) {(\frac{p + q}{2})}^{α} - (p + q)) d μ (x) \end{matrix}

The above measure admits the following prominent cases

Triangular Discrimination [30,38]

$D_{A S 2}^{(- 1)} (P | | Q) = \frac{1}{2} D_{T D} (P | | Q) = \frac{1}{2} \int \frac{{(p - q)}^{2}}{p + q} d μ (x)$

(41)
Symmetric Jensen–Shannon divergence [62,64]

$lim_{α \to 0} D_{A S 2}^{(α)} (P | | Q) = D_{J S} (P | | Q) = \int (p log (\frac{2 p}{p + q}) + q log (\frac{2 q}{p + q})) d μ (x)$

(42)

It is worth mentioning, that the Jensen–Shannon divergence is a symmetrized and smoothed variant of the Kullback–Leibler divergence, i.e., it can be interpreted as the average of the Kullback–Leibler divergences to the average distribution. For the normalized probability densities the Jensen–Shannon divergence is related to the Shannon entropy in the following sense:

$D_{J S} = H_{S} ((P + Q) / 2) - (H_{S} (P) + H_{S} (Q)) / 2$

(43)

where $H_{S} (P) = - \int p (x) log p (x) d μ (x)$
Arithmetic-Geometric divergence [39]

$lim_{α \to 1} D_{A S 2}^{(α)} (P | | Q) = \int (p + q) log (\frac{p + q}{2 \sqrt{p q}}) d μ (x)$

(44)
Symmetric Chi-square divergence [54]

$D_{A S 2}^{(2)} (P | | Q) = \frac{1}{8} D_{χ^{2}} (P | | Q) = \frac{1}{8} \int \frac{{(p - q)}^{2} (p + q)}{p q} d μ (x)$

(45)

The above Alpha-divergence is symmetric in its arguments

P

and

Q

, and it is well-defined even if

P

and

Q

are not absolutely continuous. For example, for discrete

D_{A S 2}^{(α)} (P | | Q)

is well-defined even if, for some indices

p_{i}

, it vanishes without vanishing

q_{i}

or if

q_{i}

vanishes without vanishing

p_{i}

[54]. It is also lower- and upper-bounded, for example, the Jensen–Shannon divergence is bounded between 0 and 2 [36].

3. Family of Beta-Divergences

The basic Beta-divergence was introduced by Basu et al. [67] and Minami and Eguchi [15] and many researchers investigated their applications including [8,13,30,31,32,33,34,37,37,68,69,70,71,72], and references therein. The main motivation to investigate the beta divergence, at least from the practical point of view, is to develop highly robust in respect to outliers learning algorithms for clustering, feature extraction, classification and blind source separation. Until now the Beta-divergence has been successfully applied for robust PCA (Principal Component Analysis) and clustering [71], robust ICA (Independent Component Analysis) [15,68,69], and robust NMF/NTF [30,70,73,74,75,76].

First, let us define the basic asymmetric Beta-divergence between two unnormalized density functions by

D_{B}^{(β)} (P | | Q) = \int (p (x) \frac{p^{β - 1} (x) - q^{β - 1} (x)}{β - 1} - \frac{p^{β} (x) - q^{β} (x)}{β}) d μ (x), β \in R ∖ {0, 1}

(46)

where β is a real number and, for

β = 0, 1

, is defined by continuity (see below for more explanation).

For discrete probability measures with mass functions

P = [p_{1}, p_{2}, \dots, p_{n}]

and

Q = [q_{1}, q_{2}, \dots, q_{n}]

the discrete Beta-divergence is defined as

D_{B}^{(β)} (P | | Q) = \sum_{i = 1}^{n} d_{B}^{(β)} (p_{i} | | q_{i}) = \sum_{i = 1}^{n} (p_{i} \frac{p_{i}^{β - 1} - q_{i}^{β - 1}}{β - 1} - \frac{p_{i}^{β} - q_{i}^{β}}{β}) β \in R ∖ {0, 1}

(47)

The Beta-divergence can be expressed via a generalized KL divergence as follows

D_{B}^{(β)} (P | | Q) = D_{G K L}^{(β^{- 1})} (P^{β} | | Q^{β}) = - \frac{1}{β} \int (p^{β} {log}_{(\frac{1}{β})} (\frac{q^{β}}{p^{β}}) + p^{β} - q^{β}) d μ (x) β \in R ∖ {0, 1}

(48)

The above representation of the Beta-divergence indicates why it is robust to outliers for some values of the tuning parameter β and therefore, it is often better suited than others for some specific applications. For example, in sound processing, the speech power spectra can be modeled by exponential family densities of the form, whose for

β = 0

the Beta-divergence is no less than the Itakura–Saito distance (called also Itakura–Saito divergence or Itakura–Saito distortion measure or Burg cross entropy) [12,13,30,76,77,78,79]. In fact, the Beta-divergence has to be defined in limiting case for

β \to 0

as the Itakura–Saito distance:

D_{I S} (P | | Q) = lim_{β \to 0} D_{B}^{(β)} (P | | Q) = \int (log \frac{q}{p} + \frac{p}{q} - 1) d μ (x)

(49)

The Itakura and Saito distance was derived from the maximum likelihood (ML) estimation of speech spectra [77]. It was used as a measure of the distortion or goodness of fit between two spectra and is often used as a standard measure in the speech processing community due to the good perceptual properties of the reconstructed signals since it is scale invariant, Due to scale invariance low energy components of p have the same relative importance as high energy ones. This is especially important in the scenario in which the coefficients of p have a large dynamic range, such as in short-term audio spectra [30,76,79].

It is also interesting to note that, for

β = 2

, we obtain the standard squared Euclidean (

L_{2}

-norm) distance, while for the singular case

β = 1

, we obtain the KL I-divergence:

D_{K L} (P | | Q) = lim_{β \to 1} D_{B}^{(β)} (P | | Q) = \int (p log \frac{p}{q} - p + q) d μ (x)

(50)

Note, that we used here, the following formulas

{lim}_{β \to 0} \frac{p^{β} - q^{β}}{β} = log (p / q)

and

{lim}_{β \to 0} \frac{p^{β} - 1}{β} = log p

Hence, the Beta-divergence can be represented in a more explicit form:

D_{B}^{(β)} (P | | Q) = \{\begin{matrix} \frac{1}{β (β - 1)} \int (p^{β} (x) + (β - 1) q^{β} (x) - β p (x) q^{β - 1} (x)) d μ (x), β \neq 0, 1 \\ \int (p (x) log (\frac{p (x)}{q (x)}) - p (x) + q (x)) d μ (x), β = 1 \\ \int (log (\frac{q (x)}{p (x)}) + \frac{p (x)}{q (x)} - 1) d μ (x), β = 0 \end{matrix}

(51)

We have shown that the basic Beta-divergence smoothly connects the Itakura–Saito distance and the squared Euclidean

L_{2}

-norm distance and passes through the KL I-divergence

D_{K L} (P | | Q)

. Such a parameterized connection is impossible in the family of the Alpha-divergences.

The choice of the tuning parameter β depends on the statistical distribution of data sets. For example, the optimal choice of the parameter β for the normal distribution is

β = 2

, for the gamma distribution it is

β = 0

, for the Poisson distribution

β = 1

, and for the compound Poisson distribution

β \in (1, 2)

[15,31,32,33,34,68,69].

It is important to note that the Beta divergence can be derived from the Bregman divergence. The Bregman divergence is a pseudo-distance for measuring discrepancy between two values of density functions

p (x)

and

q (x)

[9,16,80]:

d_{Φ} (p | | q) = Φ (p) - Φ (q) - (p - q) Φ^{'} (q)

(52)

where

Φ (t)

is strictly convex real-valued function and

Φ^{'} (q)

is the derivative with respect to q. The total discrepancy between two functions

p (x)

and

q (x)

is given by

D_{Φ} (P | | Q) = \int [Φ (p (x)) - Φ (q (x)) - (p (x) - q (x)) Φ^{'} (q (x))] d μ (x)

(53)

and it corresponds to the Φ-entropy of continuous probability measure

p (x) \geq 0

defined by

H_{Φ} (P) = - \int Φ (p (x)) d μ (x)

(54)

Remark: The concept of divergence and entropy are closely related. Let

Q_{0}

be a uniform distribution for which

Q_{0} (x) = c o n s t

(55)

(When

x

is an infinite space

R^{n}

this might not be a probability distribution but is a measure). Then,

H (P) = - D (P | | Q_{0}) + c o n s t

(56)

is regarded as the related entropy. This the negative of the divergence form

P

to the uniform distribution. On the other hand, given a concave entropy

H (P)

, we can define the related divergence as the Bregman divergence derived from a convex function

Φ (P) = - H (P)

.

If

x

takes discrete values on a certain space, the separable Bregman divergence is defined as

D_{Φ} (P | | Q) = \sum_{i = 1} d_{Φ} (p_{i} | | q_{i}) = \sum_{i = 1}^{n} [Φ (p_{i}) - Φ (q_{i}) - (p_{i} - q_{i}) Φ^{'} (q_{i}))]

, where

Φ^{'} (q)

denotes derivative with respect to q. In a general (nonseparable) case for two vectors

P

and

Q

, the Bregman divergence is defined as

D_{Φ} (P | | Q) = Φ (P) - Φ (Q) - {(P - Q)}^{T} \nabla Φ (Q)

, where

\nabla Φ (Q)

is the gradient of Φ evaluated at

Q

.

Note that

D_{Φ} (P | | Q)

equals the tail of the first-order Taylor expansion of

Φ (P)

at

Q

. Bregman divergences include many prominent dissimilarity measures like the squared Euclidean distance, the Mahalanobis distance, the generalized Kullback–Leibler divergence and the Itakura–Saito distance.

It is easy to check that the Beta-divergence can be generated from the Bregman divergence using the following strictly convex continuous function [36,81]

Φ (t) = \{\begin{matrix} \frac{1}{β (β - 1)} (t^{β} - β t + β - 1), β \neq 0, 1 \\ t log (t) - t + 1, β = 1 \\ t - log (t) - 1, β = 0 \end{matrix}

(57)

It is also interesting to note that the same generating function

f (u) = Φ {(t)}_{| t = u}

(with

α = β

) can be used to generate the Alpha-divergence using the Csiszár–Morimoto f-divergence

D_{f} (P | | Q) = \int q f (p / q) d μ (x)

.

Furthermore, the Beta-divergence can be generated by a generalized f-divergence:

{\tilde{D}}_{f} (P | | Q) = \int q^{β} f (p / q) d μ (x) = \int q^{β} \tilde{f} (\frac{p^{β}}{q^{β}}) d μ (x)

(58)

where

f (u) = Φ {(t)}_{| t = u}

with

u = p / q

and

\tilde{f} (\tilde{u}) = \frac{1}{1 - β} ({\tilde{u}}^{\frac{1}{β}} - \frac{1}{β} \tilde{u} + \frac{1}{β} - 1)

(59)

is convex generating function with

\tilde{u} = p^{β} / q^{β}

.

The links between the Bregman and Beta-divergences are important, since the many well known fundamental properties of the Bregman divergence are also valid for the Beta-divergence [28,82]:

Convexity: The Bregman divergence $D_{Φ} (P | | Q)$ is always convex in the first argument $P$ , but is often not in the second argument $Q$ .
Nonnegativity: The Bregman divergence is nonnegative $D_{Φ} (P | | Q) \geq 0$ with zero $P = Q$ .
Linearity: Any positive linear combination of Bregman divergences is also a Bregman divergence, i.e.,

$D_{c_{1} Φ_{1} + c_{2} Φ_{2}} (P | | Q) = c_{1} D_{Φ_{1}} (P | | Q) + c_{2} D_{Φ_{2}} (P | | Q)$

(60)

where $c_{1}, c_{2}$ are positive constants and $Φ_{1}, Φ_{2}$ are strictly convex functions.
Invariance: The functional Bregman divergence is invariant under affine transforms $Γ (Q) = Φ (Q) + \int a (x - x^{'}) q (x^{'}) d x^{'} + c$ for positive measures $P$ and $Q$ to linear and arbitrary constant terms [28,82], i.e.,

$D_{Γ} (P | | Q) = D_{Φ} (P | | Q)$

(61)
The three-point property generalizes the “Law of Cosines”:

$D_{Φ} (P | | Q) = D_{Φ} (P | | Z) + D_{Φ} (Z | | Q) - {(P - Z)}^{T} (\frac{δ}{δ Q} Φ (Q) - \frac{δ}{δ Z} Φ (Z))$

(62)
Generalized Pythagoras Theorem:

$D_{Φ} (P | | Q) \geq D_{Φ} (P | | P_{Ω} (Q)) + D_{Φ} (P_{Ω} (Q) | | Q)$

(63)

where $P_{Ω} (Q) = \underset{ω \in Ω}{arg min} D_{Φ} (ω | | Q)$ is the Bregman projection onto the convex set Ω and $P \in Ω$ . When Ω is an affine set then it holds with equality. This is proved to be the generalized Pythagorean relation in terms of information geometry.

For the Beta-divergence (46) the first and second-order Fréchet derivative with respect to

Q

are given by [28,76]

\frac{δ D_{B}^{(β)}}{δ q} = q^{β - 2} (q - p), \frac{δ^{2} D_{B}^{(β)}}{δ q^{2}} = q^{β - 3} ((β - 1) q - (β - 2) p)

(64)

Hence, we conclude that the Beta-divergence has a single global minimum equal to zero for

P = Q

and increases with

| p - q |

. Moreover, the Beta divergence is strictly convex for

q (x) > 0

only for

β \in [1, 2]

. For

β = 0

(Itakura–Saito distance), it is convex if

\int q^{- 3} (2 p - q) d μ (x) \geq 0

i.e., if

q / p \leq 2

[78].

3.1. Generation of Family of Beta-divergences Directly from Family of Alpha-Divergences

It should be noted that in the original works [15,67,68,69] they considered only the Beta-divergence function for

β \geq 1

. Moreover, they did not consider the whole range of non-positive values for parameter β, especially

β = 0

, for which we have the important Itakura–Saito distance. Furthermore, similar to the Alpha-divergences there exist an associated family of Beta-divergences and as special cases a family of generalized Itakura–Saito like distances. The fundamental question arises: How to generate a whole family of Beta-divergences or what is the relationships or correspondences between the Alpha- and Beta-divergences. In fact, on the basis of our considerations above, it is easy to find that the complete set of Beta-divergences can be obtained from the Alpha-divergences and conversely the Alpha-divergences, can obtained directly from Beta-divergences.

In order to obtain a Beta-divergence from the corresponding (associated) Alpha-divergence, we need to apply the following nonlinear transformations:

p \to p^{β}, q \to q^{β} and α = β^{- 1}

(65)

For example, using these transformations (substitutions) for a basic asymmetric Alpha-divergence (4) and assuming that

α = {(β)}^{- 1}

, we obtain the following divergence

D_{A}^{(β)} (P | | Q) = β^{2} \int (\frac{- β p q^{β - 1} + p^{β} + (β - 1) q^{β}}{β (β - 1)}) d μ (x)

(66)

Observe that, by ignoring the scaling factor

β^{2}

, we obtain the basic asymmetric Beta-divergence defined by Equation (46).

In fact, there exists the same link between the whole family of Alpha-divergences and the family of Beta-divergences (see Table 2).

For example, we can derive a symmetric Beta-divergence from the symmetric Alpha-divergence (Type-1) (38):

\begin{matrix} D_{B S 1}^{(β)} (P | | Q) & = & D_{B}^{(β)} (P | | Q) + D_{B}^{(β)} (Q | | P) \\ = & \frac{1}{β - 1} \int ((p - q) (p^{β - 1} - q^{β - 1})) d μ (x) \end{matrix}

It is interesting to note that, in special cases, we obtain:

Symmetric KL of J-divergence [65]:

D_{B S 1}^{(1)} = lim_{β \to 1} D_{B S 1}^{(β)} = \int (p - q) log (\frac{p}{q}) d μ (x)

(67)

and symmetric Chi-square divergence [54]

D_{B S 1}^{(0)} = lim_{β \to 0} D_{B S 1}^{(β)} = \int \frac{{(p - q)}^{2}}{p q} d μ (x)

(68)

Analogously, from the symmetric Alpha-divergence (Type-2), we obtain

D_{B S 2}^{(β)} (P | | Q) = \frac{1}{β - 1} \int (p^{β} + q^{β} - (p^{β - 1} + q^{β - 1}) {(\frac{p^{β} + q^{β}}{2})}^{\frac{1}{β}}) d μ (x)

(69)

Table 2. Family of Alpha-divergences and corresponding Beta-divergences. We applied the following transformations

p \to p^{β}, q \to q^{β}, α = 1 / β

. Note that

D_{A}^{(1)} (P | | Q) = D_{B}^{(1)} (P | | Q)

and they represents for

α = β = 1

extended family of KL divergences. Furthermore, Beta-divergences for

β = 0

describe the family of generalized (extended) Itakura–Saito like distances.

**Table 2.** Family of Alpha-divergences and corresponding Beta-divergences. We applied the following transformations $p \to p^{β}, q \to q^{β}, α = 1 / β$ . Note that $D_{A}^{(1)} (P | | Q) = D_{B}^{(1)} (P | | Q)$ and they represents for $α = β = 1$ extended family of KL divergences. Furthermore, Beta-divergences for $β = 0$ describe the family of generalized (extended) Itakura–Saito like distances.
Alpha-divergence $D_{A}^{(α)} (P \| \| Q)$	Beta-divergence $D_{B}^{(β)} (P \| \| Q)$
$\{\begin{matrix} \frac{1}{α (α - 1)} \int (p^{α} q^{1 - α} - α p + (α - 1) q) d μ (x), \\ \int (q log (\frac{q}{p}) + p - q) d μ (x), α = 0 \\ \int (p log (\frac{p}{q}) - p + q) d μ (x), α = 1 \end{matrix}$	$\{\begin{matrix} \frac{1}{β (β - 1)} \int (p^{β} + (β - 1) q^{β} - β p q^{β - 1}) d μ (x), \\ \int (log (\frac{q}{p}) + \frac{p}{q} - 1) d μ (x), β = 0 \\ \int (p log (\frac{p}{q}) - p + q) d μ (x), β = 1 \end{matrix}$
$\{\begin{matrix} \frac{\int ({(\frac{p + q}{2})}^{α} q^{1 - α} - \frac{α}{2} p + (\frac{α}{2} - 1) q) d μ (x)}{α (α - 1)}, \\ \int (q log (\frac{2 q}{p + q}) + \frac{p - q}{2}) d μ (x), α = 0 \\ \int (\frac{p + q}{2} log (\frac{p + q}{2 q}) - \frac{p - q}{2}) d μ (x), α = 1 \end{matrix}$	$\{\begin{matrix} \frac{\int (p^{β} + (2 β - 1) q^{β} - 2 β q^{β - 1} {(\frac{p^{β} + q^{β}}{2})}^{\frac{1}{β}}) d μ (x)}{β (β - 1)}, \\ \int (log (\frac{q}{p}) + 2 (\sqrt{\frac{p}{q}} - 1)) d μ (x), β = 0 \\ \int (\frac{p + q}{2} log (\frac{p_{1} + q}{2 q}) - \frac{p - q}{2}) d μ (x), β = 1 \end{matrix}$
$\{\begin{matrix} \frac{1}{α - 1} \int (p - q) [{(\frac{p + q}{2 q})}^{α - 1} - 1] d μ (x), \\ \int \frac{{(p - q)}^{2}}{p + q} d μ (x), α = 0 \\ \int (p - q) log (\frac{p + q}{2 q}) d μ (x), α = 1 \end{matrix}$	$\{\begin{matrix} \frac{1}{β (β - 1)} \int (p^{β} - q^{β}) (1 - \frac{2^{\frac{β - 1}{β}} q^{β - 1}}{{(p^{β} + q^{β})}^{\frac{β - 1}{β}}}) d μ (x), \\ \int (\sqrt{\frac{p}{q}} - 1) log (\frac{p}{q}) d μ (x), β = 0 \\ \int (p - q) log (\frac{p + q}{2 q}) d μ (x), β = 1 \end{matrix}$
$\{\begin{matrix} \frac{1}{α (α - 1)} \int (p^{α} q^{1 - α} + p^{1 - α} q^{α} - p - q) d μ (x), \\ \int (p - q) log (\frac{p}{q}) d μ (x), α = 0, 1 \end{matrix}$	$\{\begin{matrix} \frac{1}{β - 1} \int (p^{β} + q^{β} - p q^{β - 1} - p^{β - 1} q) d μ (x), \\ \int \frac{{(p - q)}^{2}}{p q} d μ (x), β = 0 \\ \int (p - q) log (\frac{p}{q}) d μ (x), β = 1 \end{matrix}$
$\{\begin{matrix} \frac{1}{1 - α} \int (\frac{p + q}{2} - {(\frac{p^{α} + q^{α}}{2})}^{\frac{1}{α}}) d μ (x), \\ \int (\frac{\sqrt{p} - \sqrt{q})^{2}}{2}) d μ (x), α = 0 \\ H_{S} (\frac{P + Q}{2}) - \frac{H_{S} (P) + H_{S} (Q)}{2}, α = 1 \end{matrix}$	$\{\begin{matrix} \frac{1}{β (β - 1)} \int (\frac{p^{β} + q^{β}}{2} - {(\frac{p + q}{2})}^{β}) d μ (x), \\ \int log (\frac{p + q}{2 \sqrt{p q}}) d μ (x), β = 0 \\ H_{S} (\frac{P + Q}{2}) - \frac{H_{S} (P) + H_{S} (Q)}{2}, β = 1 \end{matrix}$
$\{\begin{matrix} \frac{\int ((p^{1 - α} + q^{1 - α}) {(\frac{p + q}{2})}^{α} - p - q) d μ (x)}{α (α - 1)}, \\ \int (p log (\frac{2 p}{p + q}) + q log (\frac{2 q}{p + q})) d μ (x), α = 0 \\ \int (p + q) log (\frac{p + q}{2 \sqrt{p q}}) d μ (x), α = 1 \end{matrix}$	$\{\begin{matrix} \frac{1}{β - 1} \int (p^{β} + q^{β} - (p^{β - 1} + q^{β - 1}) {(\frac{p^{β} + q^{β}}{2})}^{\frac{1}{β}}) d μ (x), \\ \int (\sqrt{\frac{p}{q}} + \sqrt{\frac{q}{p}} - 2) d μ (x), β = 0 \\ \int (p + q) log (\frac{p + q}{2 \sqrt{p q}}) d μ (x), β = 1 \end{matrix}$

It should be noted that in special cases, we obtain:

The Arithmetic-Geometric divergence [38,39]:

D_{B S 2}^{(1)} = lim_{β \to 1} D_{B S 2}^{(β)} = \int (p + q) log (\frac{p + q}{2 \sqrt{p q}}) d μ (x),

(70)

and a symmetrized Itakura–Saito distance (called also the COSH distance) [12,13]:

D_{B S 2}^{(0)} = lim_{β \to 0} D_{B S 2}^{(β)} = \int (\sqrt{\frac{p}{q}} + \sqrt{\frac{q}{p}} - 2) d μ (x) = \int \frac{{(\sqrt{p} - \sqrt{q})}^{2}}{\sqrt{p q}} d μ (x)

(71)

4. Family of Gamma-Divergences Generated from Beta- and Alpha-Divergences

A basic asymmetric Gamma-divergence has been proposed very recently by Fujisawa and Eguchi [35] as a very robust similarity measure with respect to outliers:

D_{G}^{(γ)} (P | | Q) = \frac{1}{γ (γ - 1)} log (\frac{(\int p^{γ} (x) d μ (x)) {(\int q^{γ} (x) d μ (x))}^{γ - 1}}{{(\int p (x) q^{γ - 1} (x) d μ (x))}^{γ}})

(72)

The Gamma-divergence employs the nonlinear transformation (log) for cumulative patterns and the terms

p, q

are not separable. The main motivation for employing the Gamma divergence is that it allows “super” robust estimation of some parameters in presence of outlier. In fact, the authors demonstrated that the bias caused by outliers can become sufficiently small even in the case of very heavy contamination and that some contamination can be naturally and automatically neglected [35,37].

In this paper, we show that we can formulate the whole family of Gamma-divergences generated directly from Alpha- and also Beta-divergences. In order to obtain a robust Gamma-divergence from an Alpha- or Beta-divergence, we use the following transformations (see also Table 3):

c_{0} \int p^{c_{1}} (x) q^{c_{2}} (x) d μ (x) \to log {(\int p^{c_{1}} (x) q^{c_{2}} (x) d μ (x))}^{c_{0}}

(73)

where

c_{0}, c_{1}

and

c_{2}

are real constants and

γ = α

.

Applying the above transformation to all monomials to the basic Alpha-divergence (4), we obtain a new divergence referred to as here the Alpha-Gamma-divergence:

\begin{matrix} D_{A G}^{(γ)} (P | | Q) & = & log {(\int p^{γ} (x) q^{1 - γ} (x) d μ (x))}^{\frac{1}{γ (γ - 1)}} - log {(\int p (x) d μ (x))}^{\frac{1}{γ - 1}} + log {(\int q (x) d μ (x))}^{\frac{1}{γ}} \\ = & \frac{1}{γ (γ - 1)} log (\frac{\int p^{γ} (x) q^{1 - γ} (x) d μ (x)}{{(\int p (x) d μ (x))}^{γ} {(\int q (x) d μ (x))}^{1 - γ}}) \end{matrix}

The asymmetric Alpha-Gamma-divergence has the following important properties:

$D_{A G}^{(γ)} (P | | Q) \geq 0$ . The equality holds if and only if $P = c Q$ for a positive constant c.
It is scale invariant for any value of γ, that is, $D_{A G}^{(γ)} (P | | Q) = D_{A G}^{(γ)} (c_{1} P | | c_{2} Q)$ , for arbitrary positive scaling constants $c_{1}, c_{2}$ .
The Alpha-Gamma divergence is equivalent to the normalized Alpha-Rényi divergence (25), i.e.,

$\begin{matrix} D_{A G}^{(γ)} (P | | Q) & = & \frac{1}{γ (γ - 1)} log (\frac{\int p^{γ} (x) q^{1 - γ} (x) d μ (x)}{{(\int p (x) d μ (x))}^{γ} {(\int q (x) d μ (x))}^{1 - γ}}) \\ = & \frac{1}{γ (γ - 1)} log (\int {\bar{p}}^{γ} (x) {\bar{q}}^{1 - γ} (x) d μ (x)) \\ = & \frac{1}{γ - 1} log {(\int \bar{q} (x) {(\frac{\bar{p} (x)}{\bar{q} (x)})}^{γ} d μ (x))}^{\frac{1}{γ}} = D_{A R}^{(γ)} (\bar{P} | | \bar{Q}) \end{matrix}$

for $α = γ$ and normalized densities $\bar{p} (x) = p (x) / \int (p (x) d μ (x)$ and $\bar{q} (x) = q (x) / \int (q (x) d μ (x)$ .
It can be expressed via generalized weighted mean:

$D_{A G}^{(γ)} (P | | Q) = \frac{1}{γ - 1} log ({\bar{M}}_{γ} \{\bar{q}; \frac{\bar{p}}{\bar{q}}\})$

(74)

where the weighted mean is defined as ${\bar{M}}_{γ} \{\bar{q}; \frac{\bar{p}}{\bar{q}}\} = {(\int \bar{q} (x) {(\frac{\bar{p} (x)}{\bar{q} (x)})}^{γ} d μ (x))}^{\frac{1}{γ}}$ .
As $γ \to 0$ , the Alpha-Gamma-divergence becomes the Kullback–Leibler divergence:

$lim_{γ \to 0} D_{A G}^{(γ)} (P | | Q) = D_{K L} (\bar{P} | | \bar{Q}) = \int \bar{p} (x) log \frac{\bar{p} (x)}{\bar{q} (x)} d μ (x)$

(75)
For $γ \to 1$ , the Alpha-Gamma-divergence can be expressed by the reverse Kullback–Leibler divergence:

$lim_{γ \to 1} D_{A G}^{(γ)} (P | | Q) = D_{K L} (\bar{Q} | | \bar{P}) = \int \bar{q} (x) log \frac{\bar{q} (x)}{\bar{p} (x)} d μ (x)$

(76)

In a similar way, we can generate the whole family of Alpha-Gamma-divergences from the family of Alpha-divergences, which are summarized in Table 3.

It is interesting to note that using the above transformations (73) with

γ = β

, we can generate another family of Gamma divergences, referred to as Beta-Gamma divergences.

In particular, using the nonlinear transformations (73) for the basic asymmetric Beta-divergence (46), we obtain the Gamma-divergence (72) [35] referred to as here a Beta-Gamma-divergence (

D_{G}^{(γ)} (P | | Q) = D_{B G}^{(γ)} (P | | Q)

)

\begin{matrix} D_{B G}^{(γ)} (P | | Q) & = & \frac{1}{γ (γ - 1)} [log (\int p^{γ} d μ (x)) + (γ - 1) log (\int q^{γ} d μ (x)) - γ log (\int p q^{γ - 1} d μ (x))] \\ = & log (\frac{{(\int p q^{γ - 1} d μ (x))}^{\frac{1}{1 - γ}}}{{(\int p^{γ} d μ (x))}^{\frac{1}{γ (1 - γ)}} {(\int q^{γ} d μ (x))}^{\frac{- 1}{γ}}}) \\ = & \frac{1}{1 - γ} log (\int {\tilde{q}}^{γ} (x) (\frac{\tilde{p} (x)}{\tilde{q} (x)}) d μ (x)) \end{matrix}

where

\tilde{p} (x) = \frac{p (x)}{{(\int p^{γ} (x) d μ (x))}^{\frac{1}{γ}}}, \tilde{q} (x) = \frac{q (x)}{{(\int q^{γ} (x) d μ (x))}^{\frac{1}{γ}}}

(77)

Analogously, for discrete densities we can express the Beta-Gamma-divergence via generalized power means also known as the power mean or Hölder means as follows

D_{B G}^{(γ)} (P | | Q) = - log \frac{{(\sum_{i = 1}^{n} p_{i} q_{i}^{γ - 1})}^{\frac{1}{γ - 1}}}{{(\sum_{i = 1}^{n} p_{i}^{γ})}^{\frac{1}{γ (γ - 1)}} {(\sum_{i = 1}^{n} q_{i}^{γ})}^{\frac{1}{γ}}}

(78)

Hence,

D_{B G}^{(γ)} (P | | Q) = - log \frac{{(\sum_{i = 1}^{n} \frac{p_{i}}{q_{i}} q_{i}^{γ})}^{\frac{1}{γ - 1}}}{{[{(\sum_{i = 1}^{n} p_{i}^{γ})}^{\frac{1}{γ}}]}^{\frac{1}{γ - 1}} {(\sum_{i = 1}^{n} q_{i}^{γ})}^{\frac{1}{γ}}} = - log {(\frac{\frac{1}{n} \sum_{i = 1}^{n} \frac{p_{i}}{q_{i}} q_{i}^{γ}}{{(\frac{1}{n} \sum_{i = 1}^{n} p_{i}^{γ})}^{\frac{1}{γ}} {[{(\frac{1}{n} \sum_{i = 1}^{n} q_{i}^{γ})}^{\frac{1}{γ}}]}^{γ - 1}})}^{\frac{1}{γ - 1}}

and finally

D_{B G}^{(γ)} (P | | Q) = \frac{1}{1 - γ} log \frac{\frac{1}{n} \sum_{i = 1}^{n} \frac{p_{i}}{q_{i}} q_{i}^{γ}}{M_{γ} {p_{i}} {[M_{γ} {q_{i}}]}^{γ - 1}}

(79)

where the (generalized) power mean of the order-γ is defined as

M_{γ} {p_{i}} = {(\frac{1}{n} \sum_{i = 1}^{n} p_{i}^{γ})}^{\frac{1}{γ}}

(80)

In the special cases, we obtain standard harmonic mean (

γ = - 1)

, geometric mean (

γ = 0

), arithmetic mean(

γ = 1

), and squared root mean

γ = 2

with the following relations:

M_{- \infty} {p_{i}} \leq M_{- 1} ({p_{i}}) \leq M_{0} {p_{i}} \leq M_{1} {p_{i}} \leq M_{2} {p_{i}} \leq M_{\infty} {p_{i}},

(81)

with

M_{0} {p_{i}} = {lim}_{γ \to 0} M_{γ} {p_{i}} = {(\prod_{i = 1}^{n} p_{i})}^{1 / n}

,

M_{- \infty} {p_{i}} = {min}_{i} {p_{i}}

and

M_{\infty} {p_{i}} = {max}_{i} {p_{i}}

.

Table 3. Family of Alpha-divergences and corresponding robust Alpha-Gamma-divergences;

\bar{p} (x) = p (x) / (\int p (x) d μ (x)), \bar{q} (x) = q (x) / (\int q (x) d μ (x))

. For

γ = 0, 1

, we obtained a generalized robust KL divergences. Note that Gamma divergences are expressed compactly via generalized power means. (see also Table 4 for more direct representations).

**Table 3.** Family of Alpha-divergences and corresponding robust Alpha-Gamma-divergences; $\bar{p} (x) = p (x) / (\int p (x) d μ (x)), \bar{q} (x) = q (x) / (\int q (x) d μ (x))$ . For $γ = 0, 1$ , we obtained a generalized robust KL divergences. Note that Gamma divergences are expressed compactly via generalized power means. (see also Table 4 for more direct representations).
Alpha-divergence $D_{A}^{(α)} (P \| \| Q)$	Robust Alpha-Gamma-divergence $D_{A G}^{(γ)} (c P \| \| c Q)$
$\{\begin{matrix} \frac{1}{α (α - 1)} \int (p^{α} q^{1 - α} - α p + (α - 1) q) d μ (x), \\ \int (q log (\frac{q}{p}) + p - q) d μ (x), α = 0 \\ \int (p log (\frac{p}{q}) - p + q) d μ (x), α = 1 \end{matrix}$	$\{\begin{matrix} \frac{1}{γ - 1} log {(\int \bar{q} {(\frac{\bar{p}}{\bar{q}})}^{γ} d μ (x))}^{\frac{1}{γ}}, \\ \int \bar{q} log (\frac{\bar{q}}{\bar{p}}) d μ (x), γ = 0 \\ \int \bar{p} log (\frac{\bar{p}}{\bar{q}}) d μ (x), γ = 1 \end{matrix}$
$\{\begin{matrix} \frac{1}{α (α - 1)} \int (p^{1 - α} q^{α} + (α - 1) p - α q) d μ (x), \\ \int (p log (\frac{p}{q}) - p + q) d μ (x), α = 0 \\ \int (q log (\frac{q}{p}) + p - q) d μ (x), α = 1 \end{matrix}$	$\{\begin{matrix} \frac{1}{γ - 1} log {(\int \bar{p} {(\frac{\bar{q}}{\bar{p}})}^{γ} d μ (x))}^{\frac{1}{γ}}, \\ \int \bar{p} log (\frac{\bar{p}}{\bar{q}}) d μ (x), γ = 0 \\ \int \bar{q} log (\frac{\bar{q}}{\bar{p}}) d μ (x), γ = 1 \end{matrix}$
$\{\begin{matrix} \frac{\int ({(\frac{p + q}{2})}^{α} q^{1 - α} - \frac{α}{2} p - (1 - \frac{α}{2}) q) d μ (x)}{α (α - 1)}, \\ \int (q log (\frac{2 q}{p + q}) + \frac{p - q}{2}) d μ (x), α = 0 \\ \int (\frac{p + q}{2} log (\frac{p + q}{2 q}) - \frac{p - q}{2}) d μ (x), α = 1 \end{matrix}$	$\{\begin{matrix} \frac{1}{γ - 1} log [{(\frac{\int q d μ (x)}{\int p d μ (x)})}^{\frac{1}{2}} {(\int \bar{q} {(\frac{p + q}{2 q})}^{γ} d μ (x))}^{\frac{1}{γ}}] \\ \int ((\bar{q} log (\frac{2 q}{p + q})) d μ (x), γ = 0 \\ \int (\frac{\bar{p} + \bar{q}}{2} log (\frac{p + q}{2 q})) d μ (x), γ = 1 \end{matrix}$
$\{\begin{matrix} \frac{1}{α - 1} \int [{(\frac{p + q}{2 q})}^{α - 1} - 1] (p - q) d μ (x), \\ \int (p - q) log (\frac{p + q}{2 q}) d μ (x), α = 1 \end{matrix}$	$\{\begin{matrix} log {(\frac{\int (\bar{p} {(\frac{p + q}{2 q})}^{γ - 1}) d μ (x)}{\int (\bar{q} {(\frac{p + q}{2 q})}^{γ - 1}) d μ (x)})}^{\frac{1}{γ - 1}} \\ \int (\bar{p} - \bar{q}) log (\frac{p + q}{2 q}) d μ (x), γ = 1 \end{matrix}$
$\{\begin{matrix} \frac{1}{α (α - 1)} \int (p^{α} q^{1 - α} + p^{1 - α} q^{α} - p - q) d μ (x), \\ \int (p - q) log (\frac{p}{q}) d μ (x), α = 0, 1 \end{matrix}$	$\{\begin{matrix} \frac{1}{γ - 1} log {[\int \bar{q} {(\frac{p}{q})}^{γ} d μ (x) \int \bar{p} {(\frac{q}{p})}^{γ} d μ (x)]}^{\frac{1}{γ}} \\ \int (\bar{p} - \bar{q}) log (\frac{p}{q}) d μ (x), γ = 0, 1 \end{matrix}$
$\{\begin{matrix} \frac{\int ((p^{1 - α} + q^{1 - α}) {(\frac{p + q}{2})}^{α} - p - q) d μ (x)}{α (α - 1)}, \\ \int (p log (\frac{2 p}{p + q}) + q log (\frac{2 q}{p + q})) d μ (x), α = 0 \\ \int (p + q) log (\frac{p + q}{2 \sqrt{p q}}) d μ (x), α = 1 \end{matrix}$	$\{\begin{matrix} \frac{1}{γ - 1} log {[\int \bar{p} {(\frac{p + q}{2 p})}^{γ} d μ (x) \int \bar{q} {(\frac{p + q}{2 q})}^{γ} d μ (x)]}^{\frac{1}{γ}} \\ \int (\bar{p} log (\frac{2 p}{p + q}) + \bar{q} log (\frac{2 q}{p + q})) d μ (x), γ = 0 \\ \int (\bar{p} + \bar{q}) log (\frac{p + q}{2 \sqrt{p q}}) d μ (x), γ = 1 . \end{matrix}$

Table 4. Basic Alpha- and Beta-divergences and directly generated corresponding Gamma-divergences (see also Table 3 how the Gamma-divergences can be expressed by power means).

**Table 4.** Basic Alpha- and Beta-divergences and directly generated corresponding Gamma-divergences (see also Table 3 how the Gamma-divergences can be expressed by power means).
Divergence $D_{A}^{(α)} (P \| \| Q)$ or $D_{B}^{(β)} (P \| \| Q)$	Gamma-divergence $D_{A G}^{(γ)} (c P \| \| c Q)$ and $D_{B G}^{(γ)} (c P \| \| c Q)$
$\frac{1}{α (1 - α)} \int (α p + (1 - α) q - p^{α} q^{1 - α}) d μ (x)$	$- log \frac{{(\int p^{γ} q^{1 - γ} d μ (x))}^{1 / (γ (1 - γ))}}{{(\int p d μ (x))}^{1 / (1 - γ)} {(\int q d μ (x))}^{1 / γ}}$
$\frac{1}{β (β - 1)} \int (p^{β} + (β - 1) q^{β} - β p q^{β - 1}) d μ (x)$	$- log \frac{{(\int p q^{γ - 1} d μ (x))}^{1 / (γ - 1)}}{{(\int p^{γ} d μ (x))}^{1 / (γ (γ - 1))} {(\int q^{γ} d μ (x))}^{1 / γ}}$
$\frac{1}{α (1 - α)} \int (p + q - p^{α} q^{1 - α} - p^{1 - α} q^{α}) d μ (x)$	$\frac{- 1}{γ (1 - γ)} log \frac{(\int p^{γ} q^{1 - γ} d μ (x)) (\int p^{1 - γ} q^{γ} d μ (x))}{\int p d μ (x) \int q d μ (x)}$
$\frac{1}{β - 1} \int (p^{β} + q^{β} - p q^{β - 1} - p^{β - 1} q) d μ (x)$	$\frac{- 1}{γ - 1} log \frac{(\int p q^{γ - 1} d μ (x)) (\int p^{γ - 1} q d μ (x))}{(\int p^{γ} d μ (x)) (\int q^{γ} d μ (x))}$
$\frac{\int (\frac{α}{2} p + (1 - \frac{α}{2}) q - {(\frac{p + q}{2})}^{α} q^{α - 1}) d μ (x)}{α (1 - α)}$	$\frac{- 1}{γ (1 - γ)} log \frac{\int {(\frac{p + q}{2})}^{γ - 1} q^{γ - 1} d μ (x)}{{(\int p d μ (x))}^{γ / 2} {(\int q d μ (x))}^{1 - γ / 2}}$
$\frac{1}{1 - α} \int (p - q) {(1 - \frac{p + q}{2 q})}^{α - 1} d μ (x)$	$\frac{1}{1 - γ} log \frac{\int p d μ (x) \int q {(\frac{p + q}{2 q})}^{γ - 1} d μ (x)}{\int q d μ (x) \int p {(\frac{p + q}{2 q})}^{γ - 1} d μ (x)}$
$\frac{\int (p + q - (p^{1 - α} + q^{1 - α}) {(\frac{p + q}{2})}^{α}) d μ (x)}{α (1 - α)}$	$log {(\frac{\int p^{1 - γ} {(\frac{p + q}{2})}^{γ} d μ (x) \int q^{1 - γ} {(\frac{p + q}{2})}^{γ} d μ (x)}{\int p d μ (x) \int q d μ (x)})}^{\frac{1}{γ (γ - 1)}}$

The asymmetric Beta-Gamma-divergence has the following properties [30,35]:

$D_{B G}^{(γ)} (P | | Q) \geq 0$ . The equality holds if and only if $P = c Q$ for a positive constant c.
It is scale invariant, that is, $D_{B G}^{(γ)} (P | | Q) = D_{B G}^{(γ)} (c_{1} P | | c_{2} Q)$ , for arbitrary positive scaling constants $c_{1}, c_{2}$ .
As $γ \to 1$ , the Gamma-divergence becomes the Kullback–Leibler divergence:

$lim_{γ \to 1} D_{B G}^{(γ)} (P | | Q) = D_{K L} (\bar{P} | | \bar{Q}) = \int \bar{p} log \frac{\bar{p}}{\bar{q}} d μ (x)$

(82)

where $\bar{p} = p / \int p d μ (x)$ and $\bar{q} = q / \int q d μ (x)$ .
For $γ \to 0$ , the Gamma-divergence can be expressed as follows

$lim_{γ \to 0} D_{B G}^{(γ)} (P | | Q) = \int log \frac{q (x)}{p (x)} d μ (x) + log (\int \frac{p (x)}{q (x)} d μ (x))$

(83)

For the discrete Gamma divergence we have the corresponding formula

$D_{B G}^{(0)} (P | | Q) = \frac{1}{n} \sum_{i = 1}^{n} (log \frac{q_{i}}{p_{i}}) + log (\sum_{i = 1}^{n} \frac{p_{i}}{q_{i}}) - log (n) = log \frac{\frac{1}{n} \sum_{i = 1}^{n} \frac{p_{i}}{q_{i}}}{{(\prod_{i = 1}^{n} \frac{p_{i}}{q_{i}})}^{1 / n}}$

(84)

Similarly to the Alpha and Beta-divergences, we can also define the symmetric Beta-Gamma-divergence as

D_{B G S}^{(γ)} (P | | Q) = D_{B G}^{(γ)} (P | | Q) + D_{B G}^{(γ)} (Q | | P) = \frac{1}{1 - γ} log [\frac{(\int p^{γ - 1} q d μ (x)) (\int p q^{γ - 1} d μ (x))}{(\int p^{γ} d μ (x)) (\int q^{γ} d μ (x))}]

The symmetric Gamma-divergence has similar properties to the asymmetric Gamma-divergence:

$D_{B G S}^{(γ)} (P | | Q) \geq 0$ . The equality holds if and only if $P = c Q$ for a positive constant c, in particular, $p = q, \forall i$ .
It is scale invariant, that is,

$D_{B G S}^{(γ)} (P | | Q) = D_{B G S}^{(γ)} (c_{1} P | | c_{2} Q)$

(85)

for arbitrary positive scaling constants $c_{1}, c_{2}$ .
For $γ \to 1$ , it is reduced to a special form of the symmetric Kullback–Leibler divergence (also called the J-divergence)

$lim_{γ \to 1} D_{B G S}^{(γ)} (P | | Q) = \int (\bar{p} - \bar{q}) log \frac{\bar{p}}{\bar{q}} d μ (x)$

(86)

where $\bar{p} = p / \int p d μ (x)$ and $\bar{q} = q / \int q d μ (x)$ .
For $γ = 0$ , we obtain a simple divergence expressed by weighted arithmetic means

$D_{B G S}^{(0)} (P | | Q) = log (\int w \frac{p (x)}{q (x)} d μ (x) \int w \frac{q (x)}{p (x)} d μ (x))$

(87)

where weight function $w > 0$ is such that $\int w d μ (x) = 1$ .
For the discrete Beta-Gamma divergence (or simply the Gamma divergence), we obtain divergence

$\begin{matrix} D_{B G S}^{(0)} (P | | Q) & = & log (\sum_{i = 1}^{n} \frac{p_{i}}{q_{i}} \sum_{i = 1}^{n} \frac{q_{i}}{p_{i}}) - log {(n)}^{2} \\ = & log ((\frac{1}{n} \sum_{i = 1}^{n} \frac{p_{i}}{q_{i}}) (\frac{1}{n} \sum_{i = 1}^{n} \frac{q_{i}}{p_{i}})) \\ = & log (M_{1} \{\frac{p_{i}}{q_{i}}\} M_{1} \{\frac{q_{i}}{p_{i}}\}) \end{matrix}$

It is interesting to note that for $n \to \infty$ the discrete symmetric Gamma-divergence can be expressed by expectation functions $D_{B G S}^{(1)} (P | | Q) = log (E {u} E {u^{- 1}})$ , where $u_{i} = {p_{i} / q_{i}}$ and $u_{i}^{- 1} = {q_{i} / p_{i}}$ .
For $γ = 2$ , the asymmetric Gamma-divergences (equal to a symmetric Gamma-divergence) is reduced to Cauchy–Schwarz divergence, introduced by Principe [83]

$\begin{matrix} D_{B G}^{(2)} (P | | Q) & = & D_{B G}^{(2)} (Q | | P) = \frac{1}{2} D_{B G S}^{(2)} (P | | Q) = \frac{1}{2} D_{C S} (P | | Q) \\ = & - log \frac{\int p (x) q (x) d μ (x)}{{(\int p^{2} (x) d μ (x))}^{1 / 2} {(\int q^{2} (x) d μ (x))}^{1 / 2}} \end{matrix}$

It should be noted that the basic asymmetric Beta-Gamma divergence (derived from the Beta-divergence) is exactly equivalent to the Gamma divergence defined in [35], while Alpha-Gamma divergences (derived from the family of Alpha-divergences) have different expressions but they are similar in terms of properties.

5. Relationships for Asymmetric Divergences and their Unified Representation

The fundamental relationships and analogies for the generalized divergences discussed in this paper are summarized in Table 5.

The basic difference among Alpha- Beta and Gamma divergences is their scaling properties

D_{A}^{(α)} (c P | | c Q) = c D_{A}^{(α)} (P | | Q)

(88)

D_{B}^{(β)} (c P | | c Q) = c^{β} D_{B}^{(β)} (P | | Q)

(89)

Table 5. Fundamental asymmetric divergences and their relationships. We can obtain the Beta-divergence from the Alpha-divergence via transformations

p \to p^{β}, q \to q^{β} and α = β^{- 1}

and the Alpha-divergence from the Beta-divergence via dual transformations:

p \to p^{α}, q \to q^{α} a n d β = α^{- 1}

. Furthermore, we can generate the Gamma-divergences from the Alpha- and Beta- divergences via transformations

c_{0} \int p^{c_{1}} (x) q^{c_{2}} (x) d μ (x) \to c_{0} log \int p^{c_{1}} (x) q^{c_{2}} (x) d μ (x)

. Moreover, we can generate Beta-Gamma divergences from the Alpha-Gamma divergence via transformations:

p \to p^{γ_{n e w}}, q \to q^{γ_{n e w}} and γ = γ_{n e w}^{- 1}

.

**Table 5.** Fundamental asymmetric divergences and their relationships. We can obtain the Beta-divergence from the Alpha-divergence via transformations $p \to p^{β}, q \to q^{β} and α = β^{- 1}$ and the Alpha-divergence from the Beta-divergence via dual transformations: $p \to p^{α}, q \to q^{α} a n d β = α^{- 1}$ . Furthermore, we can generate the Gamma-divergences from the Alpha- and Beta- divergences via transformations $c_{0} \int p^{c_{1}} (x) q^{c_{2}} (x) d μ (x) \to c_{0} log \int p^{c_{1}} (x) q^{c_{2}} (x) d μ (x)$ . Moreover, we can generate Beta-Gamma divergences from the Alpha-Gamma divergence via transformations: $p \to p^{γ_{n e w}}, q \to q^{γ_{n e w}} and γ = γ_{n e w}^{- 1}$ .
Divergence name	Formula
Alpha	$D_{A}^{(α)} (P \| \| Q) = \frac{\int (p^{α} q^{1 - α} - α p + (α - 1) q) d μ (x)}{α (α - 1)}$
Beta	$D_{B}^{(β)} (P \| \| Q) = \frac{\int (p^{β} + (β - 1) q^{β} - β p q^{β - 1}) d μ (x)}{β (β - 1)}$
Gamma	$D_{A G}^{(γ)} (P \| \| Q) = \frac{log (\int p^{γ} q^{1 - γ} d μ (x)) - γ log (\int p d μ (x)) + (γ - 1) log (\int q d μ (x))}{γ (γ - 1)}$
	$= \frac{1}{γ - 1} log {(\int \bar{q} {(\frac{\bar{p}}{\bar{q}})}^{γ} d μ (x))}^{\frac{1}{γ}}$
	$D_{B G}^{(γ)} (P \| \| Q) = \frac{log (\int p^{γ} d μ (x)) + (γ - 1) log (\int q^{γ} d μ (x)) - γ log (\int p q^{γ - 1} d μ (x))}{γ (γ - 1)}$
	$= \frac{1}{1 - γ} log (\int {\tilde{q}}^{γ} {(\frac{\tilde{p}}{\tilde{q}})}^{} d μ (x))$ ))
Alpha-Rényi	$D_{A R}^{(α)} (P \| \| Q) = \frac{log (\int (p^{α} q^{1 - α} - α p + (α - 1) q) d μ (x) + 1)}{α (α - 1)}$
Bregman	$D_{Φ} (P \| \| Q) = \int (Φ (p) - Φ (q) - \frac{δ Φ}{δ q} (p - q)) d μ (x)$ , (see Eqation(57))
Csiszár-Morimoto	$D_{f} (P \| \| Q) = \int q f (\frac{p}{q}) d μ (x)$ , (see Equation (57)) for $f (u) = Φ {(t)}_{\| t = u}$ )

D_{G}^{(γ)} (c P | | c Q) = D_{G}^{(γ)} (P | | Q)

(90)

for any

c > 0

.

Note that the values of all Alpha-divergences are proportional to a scaling factor and this scaling is independent of the parameter α, while for all Beta-divergences this scaling factor is dependent on the value of the parameter β exponentially. Only in a special case for

β = 0

(Itakura–Saito distance), the Beta-divergences are invariant to a scaling factor, while the Gamma-divergences is invariant to a scaling factor for any value of the parameter γ.

On the basis of our analysis and discussion in the previous sections, we come to conclusion that the wide class of asymmetric divergences can be described by the following generalized function:

D_{A C}^{(α, β, r)} (P | | Q) = \frac{1}{α (β - 1) λ r} [{({\tilde{D}}_{A C}^{(α, β)} + 1)}^{r} - 1]

(91)

where

r > 0

and

\begin{matrix} {\tilde{D}}_{A C}^{(α, β)} (P ∥ Q) & = & \int (α p^{λ} + (β - 1) q^{λ} - λ p^{α} q^{β - 1}) d μ (x) \end{matrix}

(92)

with

α \neq 0, β \neq 1

and

λ = α + β - 1 \neq 0

.

Note that such defined function has similar structure to generalized logarithm (3). Moreover, it should be noted that the above function is a divergence only for such set of parameters for which

{\tilde{D}}_{A C}^{(α, β)} (P ∥ Q) \geq 0

and equals zero if and only if

P = Q

.

In the special cases, we obtain the following divergences:

For $r = λ = 1$ and $α + β = 2$ , we obtain the Alpha-divergence (4)
For $r = 1$ and $α = 1$ $(β = λ)$ , we obtain the Beta-divergence (46)
For $r \to 0$ and $α + β = 2$ $(λ = 1)$ , we obtain the Alpha-Rényi divergence (25), which for normalized probabilities densities: $\bar{p} = p / (\int p d μ (x))$ and $\bar{q} = q / (\int q d μ (x))$ reduces to the Alpha-Gamma divergence.
And finally, for $r \to 0$ and $α = 1$ $(β = λ = γ$ ) and the normalized densities: $\tilde{p} = p / {(\int p^{γ} d μ (x))}^{1 / γ}$ and $\tilde{q} = q / {(\int q^{γ} d μ (x))}^{1 / γ}$ , we obtain the Gamma divergence (referred in this paper as the Beta-Gamma divergence) (72), proposed recently by Fujisawa and Eguchi [35].

The function (91) has many interesting general properties. However, this is out of the scope of this paper.

Duality

Note that we can easily establish the following dualities for the Alpha and Alpha-Gamma divergences:

D_{A}^{(α)} (P | | Q) = c^{- 1} D_{A}^{(1 - α)} (c Q | | c P), D_{A G}^{(γ)} (P | | Q) = D_{A G}^{(1 - γ)} (c_{1} Q | | c_{2} P)

(93)

However, in order to establish duality for the Beta-divergence, we need represent it in slightly more general form as two set parameters divergence:

D_{A B}^{(β_{1}, β_{2})} (P | | Q) = \frac{1}{β_{1} β_{2} (β_{1} + β_{2})} \int (β_{1} p^{β_{1} + β_{2}} + β_{2} q^{β_{1} + β_{2}} - (β_{1} + β_{2}) p^{β_{1}} q^{β_{2}}) d μ (x)

(94)

In this case, we can express a duality of the Beta-divergence as

D_{B}^{(β)} (P | | Q) = D_{A B}^{(1, β - 1)} (P | | Q) = c^{- β} D_{A B}^{(β - 1, 1)} (c Q | | c P)

(95)

or more generally

D_{A B}^{(β_{1}, β_{2})} (P | | Q) = c^{- (β_{1} + β_{2})} D_{A B}^{(β_{2}, β_{1})} (c Q | | c P)

(96)

Analogously, we define family of the asymmetric Gamma divergences in more general form as a two-set parameter divergence:

\begin{matrix} D_{A B G}^{(γ_{1}, γ_{2})} (P ∥ Q) = log \frac{{(\int p^{γ_{1} + γ_{2}} d μ (x))}^{\frac{1}{γ_{2} (γ_{1} + γ_{2})}} {(\int q^{γ_{1} + γ_{2}} d μ (x))}^{\frac{1}{γ_{1} (γ_{1} + γ_{2})}}}{{(\int p^{γ_{1}} q^{γ_{2}} d μ (x))}^{\frac{1}{γ_{1} γ_{2}}}} \end{matrix}

(97)

Hence, we express duality property for the Beta-Gamma (basic Gamma) -divergence as follows

D_{B G}^{(γ)} (P | | Q) = D_{A B G}^{(1, γ - 1)} (P | | Q) = D_{A B G}^{(γ - 1, 1)} (c_{1} Q | | c_{2} P)

(98)

or more generally

D_{A B G}^{(γ_{1}, γ_{2})} (P | | Q) = D_{A B G}^{(γ_{2}, γ_{1})} (c_{1} Q | | c_{2} P)

(99)

5.1. Conclusions and Discussion

The main goal of this paper was to establish and widen bridges between several classes of divergences. The main results are summarized in Table 2, Table 3, Table 4 and Table 5. We have extended and unified several important classes of divergence and provided links and correspondences among them. The parametric families of the generalized divergences presented in this paper, especially the Alpha-, Alpha-Rényi and Beta-divergences, enable us to consider smooth connections of various pairs of well-known and frequently used fundamental divergences under one “umbrella”. For example, the family of Beta-divergences smoothly connects the squared Euclidean distance (

L_{2}

-norm) with a generalized extended version of Itakura–Saito like distances and through the the generalized KL I-divergences. In fact, we have shown how to generate the whole family of the extended robust Itakura–Saito like distances from Beta-divergences for

β = 0

, which are scale invariant and can be used in many potential applications, especially in speech processing.

Furthermore, we have shown that two different families of Gamma-divergences can be generated one from Beta-divergences and the other one from Alpha-divergences. The Gamma-divergences characterized by the global scale invariance (independent of value of parameter γ) and high robustness. We have reviewed and extended fundamental properties of the Gamma-divergences. Moreover, our approach allows us to generate a wide class of new divergences, especially Gamma-divergences and family of extended Itakura–Saito like distances. Special emphasis is given to divergences for singular values of tuning parameters

0, 1

.

These families have many desirable properties such as flexibility, robustness to outliers and they involve only one single tuning parameter (α, β, or γ). An insight using information geometry may further elucidates the fundamental structures of such divergences and geometry [1,3,7].

In comparison to previous closely related works, we explicitly investigated three wide class of divergences (Alpha, Beta- Gama), discussed their properties and links to Csiszár–Morimoto and Bregman divergences. Some provided properties, especially for Gamma-divergences are new.

The properties of divergences discussed in this paper allowed us to develop a wide family of robust algorithms for Independent Component Analysis (ICA) and Nonnegative Matrix Factorization (NMF) presented in separated recent works [30,31,32,33,34,68,69,70,71]. They have great potential in many applications, especially in blind separation of statistically dependent sparse and/or smooth sources and in feature extractions and classification problems [30].

One of the important and still open problem (not discussed in this paper) is how to set tuning parameters

α, β

and γ depending on distributions of available data sets and noise or outliers. Some recent works address this problem [69,71,84]. Usually, there is some trade off between robustness and efficiency [35,37,71]. For example, if the tuning parameter γ is large, then the robustness of the Gamma divergence will be strong but the efficiency could be lower, and vice versa, if the tuning parameter has small positive value, then the robustness of the method would be not strong but the efficiency will be higher [35]. The relation between the efficiency and the tuning parameters was discussed by several authors (see Jones et al. [85], Basu et al. [67], and Fujisawa and Eguchi [35]).

References

Amari, S. Differential-Geometrical Methods in Statistics; Springer Verlag: Berlin, Germany, 1985. [Google Scholar]
Amari, S. Dualistic geometry of the manifold of higher-order neurons. Neural Network. 1991, 4, 443–451. [Google Scholar] [CrossRef]
Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: New York, NY, USA, 2000. [Google Scholar]
Amari, S. Integration of stochastic models by minimizing alpha-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef] [PubMed]
Amari, S. Information geometry and its applications: Convex function and dually flat manifold. In Emerging Trends in Visual Computing; Nielsen, F., Ed.; Springer: New York, NY, USA; pp. 75–102.
Amari, S. Alpha-divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar] [CrossRef]
Amari, S.; Cichocki, A. Information geometry of divergence functions. Bull. Pol. Acad. Sci. 2010. (in print). [Google Scholar] [CrossRef]
Murata, N.; Takenouchi, T.; Kanamori, T.; Eguchi, S. Information geometry of U-Boost and Bregman divergence. Neural Comput. 2004, 16, 1437–1481. [Google Scholar] [CrossRef] [PubMed]
Fujimoto, Y.; Murata, N. A modified EM Algorithm for mixture models based on Bregman divergence. Ann. Inst. Stat. Math. 2007, 59, 57–75. [Google Scholar] [CrossRef]
Zhu, H.; Rohwer, R. Bayesian Invariant measurements of generalization. Neural Process. Lett. 1995, 2, 28–31. [Google Scholar] [CrossRef]
Zhu, H.; Rohwer, R. Measurements of generalisation based on information geometry. In Mathematics of Neural Networks: Model Algorithms and Applications; Ellacott, S.W., Mason, J.C., Anderson, I.J., Eds.; Kluwer: Norwell, MA, USA, 1997; pp. 394–398. [Google Scholar]
Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 56, 2882–2903. [Google Scholar] [CrossRef]
Boissonnat, J.D.; Nielsen, F.; Nock, R. Bregman Voronoi diagrams. Discrete and Computational Geometry (Springer) 2010. (in print). [Google Scholar] [CrossRef]
Yamano, T. A generalization of the Kullback-Leibler divergence and its properties. J. Math. Phys. 2009, 50, 85–95. [Google Scholar] [CrossRef]
Minami, M.; Eguchi, S. Robust blind source separation by Beta-divergence. Neural Comput. 2002, 14, 1859–1886. [Google Scholar]
Bregman, L. The relaxation method of finding a common point of convex sets and its application to the solution of problems in convex programming. Comp. Math. Phys., USSR 1967, 7, 200–217. [Google Scholar] [CrossRef]
Csiszár, I. Eine Informations Theoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitt von Markoffschen Ketten. Magyar Tud. Akad. Mat. Kutat Int. Kzl 1963, 8, 85–108. [Google Scholar]
Csiszár, I. Axiomatic characterizations of information measures. Entropy 2008, 10, 261–273. [Google Scholar] [CrossRef]
Csiszár, I. Information measures: A critial survey. In Transactions of the 7th Prague Conference, Prague, Czech Republic, 18–23 August 1974; Reidel: Dordrecht, Netherlands, 1977; pp. 83–86. [Google Scholar]
Ali, M.; Silvey, S. A general class of coefficients of divergence of one distribution from another. J. Royal Stat. Soc. 1966, Ser B, 131–142. [Google Scholar]
Hein, M.; Bousquet, O. Hilbertian metrics and positive definite kernels on probability measures. In Proceedings of the Tenth International Workshop on Artificial Intelligence and Statistics, Barbados, 6–8 January 2005; Ghahramani, Z., Cowell, R., Eds.; AISTATS. 2005; 10, pp. 136–143. [Google Scholar]
Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef] [PubMed]
Zhang, J. Referential duality and representational duality on statistical manifolds. In Proceedings of the Second International Symposium on Information Geometry and its Applications, University of Tokyo, Tokyo, Japan, 12–16 December 2005; University of Tokyo: Tokyo, Japan, 2006; pp. 58–67. [Google Scholar]
Zhang, J. A note on curvature of a-connections of a statistical manifold. Ann. Inst. Stat. Math. 2007, 59, 161–170. [Google Scholar] [CrossRef]
Zhang, J.; Matsuzoe, H. Dualistic differential geometry associated with a convex function. In Springer Series of Advances in Mechanics and Mathematics; 2008; Springer: New York, NY, USA; pp. 58–67. [Google Scholar]
Lafferty, J. Additive models, boosting, and inference for generalized divergences. In Proceedings of the Twelfth Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 7–9 July 1999; ACM: New York, NY, USA, 1999. [Google Scholar]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Villmann, T.; Haase, S. Divergence based vector quantization using Fréchet derivatives. Neural Comput. 2010. (submitted for publication). [Google Scholar]
Villmann, T.; Haase, S.; Schleif, F.M.; Hammer, B. Divergence based online learning in vector quantization. In Proceedings of the International Conference on Artifial Intelligence and Soft Computing (ICAISC2010), LNAI, Zakopane, Poland, 13–17 June 2010.
Cichocki, A.; Zdunek, R.; Phan, A.H.; Amari, S. Nonnegative Matrix and Tensor Factorizations; John Wiley & Sons Ltd: Chichester, UK, 2009. [Google Scholar]
Cichocki, A.; Zdunek, R.; Amari, S. Csiszár’s divergences for nonnegative matrix factorization: Family of new algorithms. Springer, LNCS-3889 2006, 3889, 32–39. [Google Scholar]
Cichocki, A.; Amari, S.; Zdunek, R.; Kompass, R.; Hori, G.; He, Z. Extended SMART algorithms for Nonnegative Matrix Factorization. Springer, LNAI-4029 2006, 4029, 548–562. [Google Scholar]
Cichocki, A.; Zdunek, R.; Choi, S.; Plemmons, R.; Amari, S. Nonnegative tensor factorization using Alpha and Beta divergences. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Tulose, France, May 2007; Volume III, pp. 1393–1396.
Cichocki, A.; Zdunek, R.; Choi, S.; Plemmons, R.; Amari, S.I. Novel multi-layer nonnegative tensor factorization with sparsity constraints. Springer, LNCS-4432 2007, 4432, 271–280. [Google Scholar]
Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. J. Multivar. Anal. 2008, 99, 2053–2081. [Google Scholar] [CrossRef]
Liese, F.; Vajda, I. Convex Statistical Distances. Teubner-Texte zur Mathematik Teubner Texts in Mathematics 1987, 95, 1–85. [Google Scholar]
Eguchi, S.; Kato, S. Entropy and divergence associated with power function and the statistical application. Entropy 2010, 12, 262–274. [Google Scholar] [CrossRef]
Taneja, I. On generalized entropies with applications. In Lectures in Applied Mathematics and Informatics; Ricciardi, L., Ed.; Manchester University Press: Manchester, UK, 1990; pp. 107–169. [Google Scholar]
Taneja, I. New developments in generalized information measures. In Advances in Imaging and Electron Physics; Hawkes, P., Ed.; Elsevier: Amsterdam, Netherlands, 1995; Volume 91, pp. 37–135. [Google Scholar]
Gorban, A.N.; Gorban, P.A.; Judge, G. Entropy: The Markov ordering approach. Entropy 2010, 12, 1145–1193. [Google Scholar] [CrossRef]
Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on a sum of observations. Ann. Math. Statist. 1952, 23, 493–507. [Google Scholar] [CrossRef]
Minka, T. Divergence measures and message passing. Microsoft Research Technical Report (MSR-TR-2005) 2005. [Google Scholar]
Taneja, I. On measures of information and inaccuarcy. J. Statist. Phys. 1976, 14, 203–270. [Google Scholar]
Cressie, N.; Read, T. Goodness-of-Fit Statistics for Discrete Multivariate Data; Springer: New York, NY, USA, 1988. [Google Scholar]
Cichocki, A.; Lee, H.; Kim, Y.D.; Choi, S. Nonnegative matrix factorization with Alpha-divergence. Pattern. Recognit. Lett. 2008, 29, 1433–1440. [Google Scholar] [CrossRef]
Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Statist. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
Havrda, J.; Charvát, F. Quantification method of classification processes: Concept of structrual a-entropy. Kybernetika 1967, 3, 30–35. [Google Scholar]
Cressie, N.; Read, T. Multinomial goodness-of-fit tests. J. R. Stat. Soc. Ser. B 1984, 46, 440–464. [Google Scholar]
Vajda, I. Theory of Statistical Inference and Information; Kluwer Academic Press: Amsterdam, Netherland, 1989. [Google Scholar]
Hellinger, E. Neue Begründung der Theorie Quadratischen Formen von unendlichen vielen Veränderlichen. J. Reine Ang. Math. 1909, 136, 210–271. [Google Scholar]
Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jap. 1963, 12, 328–331. [Google Scholar] [CrossRef]
Österreicher, F. Csiszár’s f-divergences-basic properties. Technical report; In Research Report Collection; Victoria University: Melbourne, Australia, 2002. [Google Scholar]
Harremoës, P.; Vajda, I. Joint range of f-divergences. In Accepted for presentation at ISIT 2010, Austin, TX, USA, 13–18 June 2010.
Dragomir, S. Inequalities for Csiszár f-Divergence in Information Theory; Victoria University: Melbourne, Australia, 2000; (edited monograph). [Google Scholar]
Rényi, A. On the foundation of information theory. Rev. Inst. Int. Stat. 1965, 33, 1–4. [Google Scholar] [CrossRef]
Rényi, A. On measures of entropy and information. In Proceddings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA; Volome 1, pp. 547–561.
Rényi, A. Probability Theory; North-Holland: Amsterdam, The Netherlands, 1970. [Google Scholar]
Harremoës, P. Interpretaions of Rényi entropies and divergences. Physica A 2006, 365, 57–62. [Google Scholar] [CrossRef]
Harremoës, P. Joint range of Rényi entropies. Kybernetika 2009, 45, 901–911. [Google Scholar]
Hero, A.; Ma, B.; Michel, O.; Gorman, J. Applications of entropic spanning graphs. IEEE Signal Process. Mag. 2002, 19, 85–95. [Google Scholar] [CrossRef]
Topsoe, F. Some inequalities for information divergence and related measuresof discrimination. IEEE Trans. Inf. Theory 2000, 46, 1602–1609. [Google Scholar] [CrossRef]
Burbea, J.; Rao, C. Entropy differential metric, distance and divergence measures in probability spaces: A unified approach. J. Multi. Analysis 1982, 12, 575–596. [Google Scholar] [CrossRef]
Burbea, J.; Rao, C. On the convexity of some divergence measures based on entropy functions. IEEE Trans. Inf. Theory 1982, IT-28, 489–495. [Google Scholar] [CrossRef]
Sibson, R. Information radius. Probability Theory and Related Fields 1969, 14, 149–160. [Google Scholar] [CrossRef]
Jeffreys, H. An invariant form for the prior probability in estimation problems. Proc. Roy. Soc. Lon., Ser. A 1946, 186, 453–461. [Google Scholar] [CrossRef]
Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Statist. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Basu, A.; Harris, I.R.; Hjort, N.; Jones, M. Robust and efficient estimation by minimising a density power divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
Mollah, M.; Minami, M.; Eguchi, S. Exploring latent structure of mixture ICA models by the minimum Beta-divergence method. Neural Comput. 2006, 16, 166–190. [Google Scholar] [CrossRef]
Mollah, M.; Eguchi, S.; Minami, M. Robust prewhitening for ICA by minimizing Beta-divergence and its application to FastICA. Neural Process. Lett. 2007, 25, 91–110. [Google Scholar] [CrossRef]
Kompass, R. A Generalized divergence measure for Nonnegative Matrix Factorization. Neural Comput. 2006, 19, 780–791. [Google Scholar] [CrossRef] [PubMed]
Mollah, M.; Sultana, N.; Minami, M.; Eguchi, S. Robust extraction of local structures by the minimum of Beta-divergence method. Neural Netw. 2010, 23, 226–238. [Google Scholar] [CrossRef] [PubMed]
Nielsen, F.; Nock, R. The dual Voronoi diagrams with respect to representational Bregman divergences. In Proceedings of the International Symposium on Voronoi Diagrams (ISVD), Copenhagen, Denmark, 23–26 June 2009.
Cichocki, A.; Phan, A. Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE (invited paper) 2009, E92-A (3), 708–721. [Google Scholar] [CrossRef]
Cichocki, A.; Phan, A.; Caiafa, C. Flexible HALS algorithms for sparse non-negative matrix/tensor factorization. In Proceedings of the 18th IEEE workshops on Machine Learning for Signal Processing, Cancun, Mexico, 16–19 October 2008.
Dhillon, I.; Sra, S. Generalized Nonnegative Matrix Approximations with Bregman Divergences. In Neural Information Processing Systems; MIT Press: Vancouver, Canada, 2005; pp. 283–290. [Google Scholar]
Févotte, C.; Bertin, N.; Durrieu, J.L. Nonnegative matrix factorization with the Itakura-Saito divergence with application to music analysis. Neural Comput. 2009, 21, 793–830. [Google Scholar] [CrossRef] [PubMed]
Itakura, F.; Saito, F. Analysis synthesis telephony based on the maximum likelihood method. In Proceedings of the of the 6th International Congress on Acoustics, Tokyo, Japan, 1968; pp. 17–20.
Eggermont, P.; LaRiccia, V. On EM-like algorithms for minimum distance estimation. Technical report; In Mathematical Sciences; University of Delaware: Newark, DE, USA, 1998. [Google Scholar]
Févotte, C.; Cemgil, A.T. Nonnegative matrix factorizations as probabilistic inference in composite models. In Proceedings of the 17th European Signal Processing Conference (EUSIPCO-09), Glasgow, Scotland, UK, 24–28 August 2009.
Banerjee, A.; Dhillon, I.; Ghosh, J.; Merugu, S.; Modha, D. A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 22–25 August 2004; ACM Press: New York, NY, USA, 2004; pp. 509–514. [Google Scholar]
Lafferty, J. Additive models, boosting, and inference for generalized divergences. In Proceedings of the 12th Annual Conference on Computational Learning Theory, Santa Cruz, CA, USA, 7–9 July 1999; ACM Press: New York, USA, 1999; pp. 125–133. [Google Scholar]
Frigyik., B.A.; Srivastan, S.; Gupta, M. Functional Bregman divergence and Bayesion estimation of distributions. IEEE Trans. Inf. Theory 2008, 54, 5130–5139. [Google Scholar] [CrossRef]
Principe, J. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: Berlin, Germany, 2010. [Google Scholar]
Choi, H.; Choi, S.; Katake, A.; Choe, Y. Learning alpha-integration with partially-labeled data. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2010), Dallas, TX, USA, 14–19 March 2010.
Jones, M.; Hjort, N.; Harris, I.R.; Basu, A. A comparison of related density-based minimum divergence estimators. Biometrika 1998, 85, 865–873. [Google Scholar] [CrossRef]

Appendix A. Divergences Derived from Tsallis α-entropy

In this appendix, we presents links between the Tsallis and the Rényi entropies and Alpha- Beta-divergences and Gamma-divergences. For simplicity of our considerations, we assume first that

\int \bar{p} (x) d μ (x) = 1

and

\int \bar{q} (x) d μ (x) = 1

, but these constraints can be relaxed.

The Tsallis entropy is defined by using the power function

h_{α} (\bar{p}) = \int {\bar{p}}^{α} d μ (x)

,

H_{T}^{(α)} (\bar{p}) = \frac{1}{1 - α} \{h_{α} (\bar{p}) - 1\}

(100)

The Rényi entropy is also written as

H_{R}^{(α)} (\bar{p}) = \frac{1}{1 - α} log \{h_{α} (\bar{p})\} = \frac{1}{1 - α} log (\int {\bar{p}}^{α} (x) d μ (x))

(101)

It is shown that the Tsallis entropy is related (corresponds) to the Alpha- and Beta-divergences, while the Rényi entropy is related to the Gamma-divergences.

The Rényi and Tsallis entropy measures are two different generalization of the standard Boltzman-Gibbs entropy (or Shannon’s information). In the limit as

α \to 1

we have

p^{α - 1} = exp ((α - 1) log p) \approx 1 + (α - 1) log p

, hence they reduce to the Shannon entropy:

H_{S} (\bar{p}) = - \int \bar{p} (x) log (\bar{p} (x)) d μ (x)

(102)

The Tsallis entropy can also be considered as an α-deformation of the Shannon’s entropy by noting that

H_{T}^{(α)} (\bar{p}) = - \int {\bar{p}}^{α} (x) {log}_{α} (\bar{p} (x)) d μ (x) = \frac{1}{1 - α} (\int {\bar{p}}^{α} (x) d μ (x) - 1)

(103)

and by using the generalized logarithm function defined by

{log}_{α} x = \frac{1}{1 - α} (x^{1 - α} - 1)

(104)

The Tsallis relative entropy or the Tsallis divergence is defined similarly by

\begin{matrix} D_{T}^{(α)} (\bar{p} ∥ \bar{q}) & = & - \int \bar{p} {log}_{α} (\frac{\bar{q}}{\bar{p}}) d μ (x) = \frac{1}{1 - α} (1 - \int {\bar{p}}^{α} {\bar{q}}^{1 - α} d μ (x)) \end{matrix}

(105)

This is a rescaled version of the Alpha-divergence,

D_{T}^{(α)} (\bar{p} ∥ \bar{q}) = α D_{A}^{(α)} (\bar{p} ∥ \bar{q})

(106)

It should be noted that the Alpha-divergence is more general since it is defined for any positive arrays

P

and

Q

and also for any values of α including singular values of

α = 0

and

α = 1

.

Since the Tsallis Alpha-divergence is an f-divergence, it is possible to extend it to the applicable general cases. This is given in Appendix B, where

{log}_{α} (x)

is replaced by another power function.

Another way of deriving a divergence is to use the convex generating functions. This has been deeply investigated by Zhang [22].

Since

Φ_{α} (\bar{p}) = - h_{α} (\bar{p})

(107)

is a convex function, we derive a Bregman divergence associated with the Tsallis entropy,

{\tilde{D}}_{T}^{(α)} (\bar{p} ∥ \bar{q}) = \int \{- {\bar{p}}^{α} + {\bar{q}}^{α} + α {\bar{p}}^{α - 1} (\bar{p} - \bar{q})\} d μ (x)

(108)

This is a rescaled and simplified version of the Beta-divergence with

α = β

,

{\tilde{D}}_{T}^{(α)} (\bar{p} ∥ \bar{q}) = α (1 - α) D_{B}^{(β)} (\bar{p} ∥ \bar{q})

(109)

Again, note that the Beta-divergence is more general than the Tsallis divergence, since it is defined for any positive arrays and any β including singular values of

β = 0

and

β = 1

. We need to extend the definition of

h_{α} (p)

to obtain the whole range of the Beta-divergences.

Appendix B. Entropies and Divergences

An entropy

H (\bar{p})

is maximized at the uniform distribution. Therefore, when a divergence function is given, we can define an entropy

H (\bar{p})

by

H (\bar{p}) = D (\bar{p} | | 1) + c

(110)

for some constant c and normalized probability measure

\bar{p}

.

In many cases,

Ψ (\bar{p}) = - D (\bar{p} | | 1)

is a convex function of

\bar{p}

. For example, from the KL-divergence, we have

H_{K L} (\bar{p}) = - \int \bar{p} (x) log \bar{p} (x) d μ (x)

(111)

which is the Shannon entropy (neglecting a constant).

We define a power function of probabilities by

h_{α} (\bar{p}) = \int {\bar{p}}^{α} (x) d μ (x)

(112)

where

0 < α < 1

.

From the Alpha-divergence

D_{A}^{(α)}

and Beta-divergence

D_{B}^{(β)}

we have the entropies

H (\bar{p}) = c h_{α} (\bar{p}) + d

(113)

where c and d are constants and

α = β

.

Similarly, from the Gamma-divergence

D_{G}^{(γ)}

, we have

H (\bar{p}) = c log (h_{α} (\bar{p})) + d

(114)

with

α = γ

.

This implies that various divergence functions can generate a similar family of entropy functions. On the other hand, there are a number of ways to generate divergences from entropy functions. A typical way is to use the Bregman divergence derived from a convex generating function

Φ (\bar{p}) = - H (\bar{p})

,

D_{Φ} (\bar{p} | | \bar{q}) = \int (Φ (\bar{p}) - Φ (\bar{q}) - \frac{δ Φ (\bar{q})}{δ \bar{q}} (\bar{p} - \bar{q})) d μ (x)

(115)

Appendix C. Tsallis and Rényi Entropies

It is worth mentioning that the Rényi and Tsallis α-entropies are monotonically related through

H_{R}^{(α)} (\bar{p}) = \frac{1}{1 - α} log (1 + (1 - α) H_{T}^{(α)} (\bar{p}))

(116)

or using the α-logarithm

H_{T}^{(α)} (\bar{p}) = {log}_{α} e^{H_{R}^{(α)} (\bar{p})}

(117)

Ψ_{R}^{(α)} = - H_{R}^{(α)}

is convex function of

\bar{p}

, and hence we have a corresponding divergence.

The Bregman divergence can be defined in a more general form as shown below.

If

Φ (\bar{p}) : R^{n} \to R

is a strictly convex and

C^{1}

(i.e., first-order differentiable) function, the corresponding Bregman distance (divergence) for discrete positive measures is defined by

D_{Φ} (\bar{p} | | \bar{q}) = Φ (\bar{p}) - Φ (\bar{q}) - {(\bar{p} - \bar{q})}^{T} \nabla Φ (\bar{q})

(118)

where

\nabla Φ (\bar{q})

represents the gradient of Φ evaluated at

\bar{q}

.

The Bregman divergence derived from the Rényi entropy for discrete probability densities is given by

D_{R}^{(α)} (\bar{p} | | \bar{q}) = \frac{1}{1 - α} log \frac{h_{α} (\bar{q})}{h_{α} (p)} - \frac{α}{1 - α} \frac{1}{h_{α} (\bar{q})} (1 - \sum_{i} {\bar{p}}_{i} {\bar{q}}_{i}^{α - 1})

(119)

The Tsallis entropy is closely related to the α-exponential family of probability distributions. A family

S = \{\bar{p} (x, θ)\}

, parameterized by a vector parameter

θ

, of probability distributions is called a α-exponential family, when it is written as

{log}_{α} \bar{p} (x, θ) = \sum_{i} θ_{i} x_{i} - Ψ (θ)

(120)

where

Ψ (θ)

is a convex function. This is the same as the α-family defined in [6]. The set

S = \{p\}

of discrete probability distributions is a α-exponential family for any α. To show this, we introduce random variables

x = (x)

,

x_{i} = δ_{i}

(121)

which is 1 when the outcome is i of which probability is

\bar{p}

.

Then, the probability is written as

{log}_{α} \bar{p} (x, θ) = \sum_{i} θ_{i} δ_{i} - Ψ (θ)

(122)

where

θ_{i} = {log}_{α} {\bar{p}}_{i}

(123)

Since

\sum {\bar{p}}_{i} = 1

holds, or

\sum δ_{i} = 1

(124)

and hence free variables of

θ

are

n - 1

, say

(θ_{1}, \dots, θ_{n - 1})

. In this case

θ_{n}

is a function of the other variables, and we get

Ψ (θ) = - {log}_{α} θ_{n}

(125)

We can derive a divergence from this convex function

Ψ (θ)

. However, it is more intuitive to use the dual of

Ψ (θ)

, which is given by

φ (η) = \frac{1}{1 - α} (\frac{1}{h_{α} (\bar{p})} - 1)

(126)

where

η

is the dual variable of

θ

by the Legendre transformation

η = \nabla_{θ} Ψ (θ)

(127)

The derived divergence is

D_{exp}^{(α)} (\bar{p} ∥ \bar{q}) = \frac{1}{h_{α} (\bar{p})} D_{T}^{(α)} (\bar{p} ∥ \bar{q})

(128)

This is a conformal transformation of the Alpha-divergence [7]. A conformal transformation, also called a conformal map or angle-preserving transformation is a transformation that preserves local angles of intersection of any two lines or curves.

© 2010 by the authors licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license ( http://creativecommons.org/licenses/by/3.0/).

Share and Cite

MDPI and ACS Style

Cichocki, A.; Amari, S.-i. Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities. Entropy 2010, 12, 1532-1568. https://doi.org/10.3390/e12061532

AMA Style

Cichocki A, Amari S-i. Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities. Entropy. 2010; 12(6):1532-1568. https://doi.org/10.3390/e12061532

Chicago/Turabian Style

Cichocki, Andrzej, and Shun-ichi Amari. 2010. "Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities" Entropy 12, no. 6: 1532-1568. https://doi.org/10.3390/e12061532

APA Style

Cichocki, A., & Amari, S.-i. (2010). Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities. Entropy, 12(6), 1532-1568. https://doi.org/10.3390/e12061532

Article Menu

Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities

Abstract

1. Introduction

2. Family of Alpha-Divergences

2.1. Asymmetric Alpha-Divergences

2.2. Alpha-Rényi Divergence

2.3. Extended Family of Alpha-Divergences

2.4. Symmetrized Alpha-Divergences

3. Family of Beta-Divergences

3.1. Generation of Family of Beta-divergences Directly from Family of Alpha-Divergences

4. Family of Gamma-Divergences Generated from Beta- and Alpha-Divergences

5. Relationships for Asymmetric Divergences and their Unified Representation

Duality

5.1. Conclusions and Discussion

References

Appendix A. Divergences Derived from Tsallis α-entropy

Appendix B. Entropies and Divergences

Appendix C. Tsallis and Rényi Entropies

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI