Next Article in Journal
A Concentrated, Nonlinear Information-Theoretic Estimator for the Sample Selection Model
Previous Article in Journal
Fairness Is an Emergent Self-Organized Property of the Free Market for Labor
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Families of Alpha- Beta- and Gamma- Divergences: Flexible and Robust Measures of Similarities

by
Andrzej Cichocki
1,2,* and
Shun-ichi Amari
3
1
Riken Brain Science Institute, Laboratory for Advanced Brain Signal Processing, Wako-shi, Japan
2
Systems Research Institute, Polish Academy of Science, Poland
3
Riken Brain Science Institute, Laboratory for Mathematical Neuroscience, Wako-shi, Japan
*
Author to whom correspondence should be addressed.
Entropy 2010, 12(6), 1532-1568; https://doi.org/10.3390/e12061532
Submission received: 26 April 2010 / Accepted: 1 June 2010 / Published: 14 June 2010

Abstract

:
In this paper, we extend and overview wide families of Alpha-, Beta- and Gamma-divergences and discuss their fundamental properties. In literature usually only one single asymmetric (Alpha, Beta or Gamma) divergence is considered. We show in this paper that there exist families of such divergences with the same consistent properties. Moreover, we establish links and correspondences among these divergences by applying suitable nonlinear transformations. For example, we can generate the Beta-divergences directly from Alpha-divergences and vice versa. Furthermore, we show that a new wide class of Gamma-divergences can be generated not only from the family of Beta-divergences but also from a family of Alpha-divergences. The paper bridges these divergences and shows also their links to Tsallis and Rényi entropies. Most of these divergences have a natural information theoretic interpretation.

1. Introduction

Many machine learning algorithms for classification and clustering employ a variety of dissimilarity measures. Information theory, convex analysis, and information geometry play key roles in the formulation of such divergences [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25].
The most popular and often used are: Squared Euclidean distance and Kullback–Leibler divergence. Recently, alternative generalized divergences such as the Csiszár–Morimoto f-divergence and Bregman divergence become attractive alternatives for advanced machine learning algorithms [26,27,28,29,30,31,32,33,34]. In this paper, we discuss a robust parameterized subclass of the Csiszár–Morimoto and the Bregman divergences: Alpha- and Beta-divergences that may provide more robust solutions with respect to outliers and additive noise and improved accuracy. Moreover, we provide links to new-class of robust Gamma-divergences [35] and extend this class to so called Alpha-Gamma divergences.
Divergences are considered here as (dis)similarity measures. Generally speaking, they measure a quasi-distance or directed difference between two probability distributions P and Q , which can also be expressed for unconstrained nonnegative multi-way arrays and patterns.
In this paper we assume that P and Q are positive measures (densities) not necessary normalized, but should be finite measures. In the special case of normalized densities, we explicitly refer to these as probability densities. If we do not mention explicitly we assume that these measures are continuous. An information divergence is a measure of distance between two probability curves. In this paper, we discuss only one-dimensional probability curves (represented by nonnegative signals or time series). Generalization to two or multidimensional dimensional variables is straightforward. One density Q ( x ) is usually known and fixed and another one P ( x ) is learned or adjusted to achieve a best in some sense similarity to the Q ( x ) . For example, a discrete density Q corresponds to the observed data and the vector P to be estimated, or expected data which are subject to constraints imposed on the assumed models. For the Non-negative Matrix Factorization (NMF) problem Q corresponds to the data matrix Y and P corresponds to estimated matrix Y ^ = A X (or vice versa) [30].
The distance between two densities is called a metric if the following conditions hold:
  • D ( P | | Q ) 0 with equality if and only if P = Q (nonnegativity and positive definiteness),
  • D ( P | | Q ) = D ( Q | | P ) (symmetry),
  • D ( P | | Z ) D ( P | | Q ) + D ( Q | | Z ) (subaddivity/triangle inequality).
Distances which only satisfies Condition 1 are not a metric and are referred to as (asymmetric) divergences.
In many applications, such as image analysis, pattern recognition and statistical machine learning we use the information-theoretic divergences rather than Euclidean squared or l p -norm distances [28]. Several information divergences such as Kullback–Leibler, Hellinger and Jensen–Shannon divergences are central to estimate similarity between distributions and have long history in information geometry.
The concept of a divergence is not restricted to Euclidean spaces but can be extended to abstract spaces with the help of Radon–Nikodym derivative (see for example [36]). Let ( X , A , μ ) be a measure space, where μ is a finite or a σ-finite measure on ( X , A ) and let assume that P and Q are two (probability) measures on ( X , A ) such that P < < μ , Q < < μ are absolutely continuous with respect to a measure μ, e.g., μ = P + Q and that p = d P d μ and q = d Q d μ the (densities) Radon–Nikodym derivative of P and Q with respect to μ. Using such notations the fundamental Kullback–Leibler (KL) divergence between two probabilities distributions can be written as
D K L ( P | | Q ) = X p ( x ) log p ( x ) q ( x ) d μ ( x )
which is related to the Shannon entropy
H S ( P ) = X p ( x ) log ( p ( x ) ) d μ ( x )
via
D K L ( P | | Q ) = V S ( P , Q ) H S ( P )
where
V S ( P , Q ) = X p ( x ) log ( q ( x ) ) d μ ( x )
is the Shannon’s cross entropy, provided that integrals exist. (In measure theoretic terms, the integral exists if the measure induced by P is absolutely continuous with respect to that induced by Q ). Here and in the whole paper we assume that all integrals exist.
The Kullback–Leibler divergence has been generalized by using a family of functions called generalized logarithm functions or α-logarithm
log α ( x ) = 1 1 α ( x 1 α 1 ) ( for x > 0 )
which is a power function of x with power 1 α , and is the natural logarithm function in the limit α 1 . Often, the power function (3) allows to generate more robust divergences in respect to outliers and consequently better or more flexible performance (see, for example, [37]).
By using this type of extension, we derive and review three series of divergences, the Alpha, Beta- and Gamma- divergences all of which are generalizations of the KL-divergence. Moreover, we show its relation to the Tsallis entropy and the Rényi entropy (see Appendix A and Appendix B). It will be also shown how the Alpha-divergences are derived from the Csiszár–Morimoto f-divergence and the Beta-divergence form the Bregman divergence by using the power functions.
Similarly to work of Zhang [22,23,24] and Hein and Bousquet [21] one of our motivations is to show the close links and relations among wide class of divergences and provide an elegant way to handle the known divergences and intermediate and new one in the same framework. However, our similarity measures are different from these proposed in [21] and our approach and results are quite different to these presented in [22].
It should be mentioned that there has been previous attempts of unifying divergence functions, (especially related to Alpha-divergence) starting from the work by Zhu and Rohwer [10,11], Amari and Nagaoka [3], Taneya [38,39], Zhang ([22] and Gorban and Judge [40]. In particular, Zhang in ([22], and in subsequent works [23,24] investigated the deep link between information geometry and various divergence functions mostly related to Alpha-divergence through a unified approach based on convex functions. However, the previous works have not considered explicitly links and relationships among ALL three fundamental classes (Alpha-, Beta, Gamma-) divergences. Moreover, some their basic properties are reviewed and extended.
The scope of the results presented in this paper is vast since the class of generalized (flexible) divergence functions include a large number of useful loss functions containing those based on the relative entropies, generalized Kullback–Leibler or I-divergence, Hellinger distance, Jensen–Shannon divergence, J-divergence, Pearson and Neyman Chi-square divergences, Triangular Discrimination and Arithmetic-Geometric divergence. Moreover, we show that some new divergences can be generated. Especially, we generate a new family of Alpha-Gamma divergences and Itakura–Saito like distances with the invariant scaling property which belongs to a wider class of Beta-divergences. Generally, these new scale-invariant divergences provide extension of the families of Beta- and Gamma- divergences. The discussed in this paper divergence functions are flexible because they allow us to generate a large number of well known and often used particular divergences (for specific values of tuning parameters). Moreover, by adjusting adaptive tuning parameters, we can optimize cost functions for learning algorithms and estimate desired parameters of a model in presence of noise and outliers. In other words, the discussed in this paper divergences, especially Beta- and Gama- divergences are robust in respect to outliers for some values of tuning parameters.
One of important features of the considered family of divergences is that they can give some guidance for the selection and even development of new divergence measures if necessary and allows to unify these divergences under the same framework using the Csiszár-Morimoto and Bregamn divergences and their fundamental properties. Moreover, these families of divergences are generally defined on unnormalized finite measures (not necessary normalized probabilities). This allows us to analyze patterns of different size to be weighted differently, e.g., images with different sizes or documents of different length. Such measures play also an important role in the areas of neural computation, pattern recognition, learning, estimation, inference, and optimization. We have already successfully applied a subset of such divergences as cost functions (possibly with additional constraints and regularization terms) to derive novel multiplicative and additive projected gradient algorithms for nonnegative matrix and tensor factorizations [30,31].
The divergences are closely related to the invariant geometrical properties of the manifold of probability distributions [5,6,7].

2. Family of Alpha-Divergences

The Alpha-divergences can be derived from the Csiszár–Morimoto f-divergence and as shown recently by Amari using some tricks also from the Bregman divergence [3,6] (see also Appendix A). The Alpha-divergence was proposed by Chernoff [41] and have been extensively investigated and extended by Amari [1,3,5,6] and other researchers. For some modifications and/or extensions see for example works of Liese and Vajda [36], Minka [42], Taneja [38,39,43], Cressie–Read [44], Zhu–Rohwer [10,11] and Zhang [22,23,24,25]. One of our motivation to investigate and explore the family of Alpha-divergences is to develop flexible and efficient learning algorithms for nonnegative matrix and tensor factorizations which unify and extend existing algorithms including such algorithms as EMML (Expectation Maximization Maximum Likelihood) and ISRA (Image Space Reconstruction Algorithm) (see [30,45] and references therein).

2.1. Asymmetric Alpha-Divergences

The basic asymmetric Alpha-divergence can be defined as [3]:
D A ( α ) ( P | | Q ) = 1 α ( α 1 ) p α ( x ) q 1 α ( x ) α p ( x ) + ( α 1 ) q ( x ) d μ ( x ) , α R { 0 , 1 }
where p ( x ) and q ( x ) do not need to be normalized.
The Alpha-divergence can be expressed via a generalized KL divergence as follows
D A ( α ) ( P | | Q ) = D G K L ( α ) ( P | | Q ) = 1 α p log α q p + p q d μ ( x ) , α R { 0 , 1 }
For discrete probability measures with mass functions P = [ p 1 , p 2 , , p n ] and Q = [ q 1 , q 2 , , q n ] , the discrete Alpha-divergence is formulated as a separable divergence:
D A ( α ) ( P | | Q ) = i = 1 n d A ( α ) ( p i | | q i ) = 1 α ( α 1 ) i = 1 n p i α q i 1 α α p i + ( α 1 ) q i , α R { 0 , 1 }
where d A ( α ) ( p i | | q i ) = p i α q i 1 α α p i + ( α 1 ) q i / α ( α 1 ) .
Note that this form of Alpha-divergence differs slightly from the loss function given by [1], because it was defined only for probability distributions, i.e., under assumptions that p ( x ) d μ ( x ) = 1 and q ( x ) d μ ( x ) = 1 . It was extended by Zhu and Rohwer in [10,11] (see also Amari and Nagaoka [3]) for positive measures, by incorporating additional terms. These terms are needed to allow de-normalized densities (positive measures), in the same way that the generalized Kullback–Leibler divergences (I-divergence).
Extending the Kullback–Leibler divergence into the family of Alpha-divergence is a crucial step for a unified view of a wide class of divergence functions, since it demonstrates that the nonnegativity of alpha-divergences may be viewed as arising not only from Jensen’s inequality but the arithmetic-geometric inequality. As has been pointed out by Jun Zhang this can be fully exploited as a consequence of the inequality of convex functions or more generally by his convex-based approach [22].
For normalized densities: p ¯ ( x ) = p ( x ) / p ( x ) d μ ( x ) and q ¯ ( x ) = q ( x ) / q ( x ) d μ ( x ) the Alpha-divergence simplifies to [1,44]:
D A ( α ) ( P ¯ | | Q ¯ ) = 1 α ( α 1 ) p ¯ α ( x ) q ¯ 1 α ( x ) d μ ( x ) 1 , α R { 0 , 1 }
and is related to the Tsallis divergence and the Tsallis entropy [46] (see Appendix A):
H T ( α ) ( P ¯ ) = 1 1 α p ¯ α ( x ) d μ ( x ) 1 = p ¯ α ( x ) log α p ¯ ( x ) d μ ( x )
provided integrals on the right exist.
In fact, the Tsallis entropy was first defined by Havrda and Charvat in 1967 [47] and almost forgotten and rediscovered by Tsallis in 1988 [46] in different context (see Appendix A).
Various authors use the parameter α in different ways. For example, using the Amari notation, α A with α = ( 1 α A ) / 2 , the Alpha-divergence takes the following form [1,2,3,10,11,22,23,24,25,44,48,49]:
D ˜ A ( α A ) ( P ̠ | | Q ̠ ) = 4 1 α A 2 1 α A 2 p + 1 + α A 2 q p 1 α A 2 q 1 + α A 2 d μ ( x ) , α A R { ± 1 }
When α takes values from 0 to 1, α A takes values from 1 to 1. The duality exists between α A and α A , in the sense that D ˜ A ( α A ) ( P | | Q ) = D ˜ A ( α A ) ( Q | | P ) .
In the special cases for α = 2 , 0 . 5 , 1 , we obtain from (4) the well known Pearson Chi-square, Hellinger and inverse Pearson, also called the Neyman Chi-square distances, given respectively by
D A ( 2 ) ( P | | Q ) = D P ( P | | Q ) = 1 2 ( p ( x ) q ( x ) ) 2 q ( x ) d μ ( x ) D A ( 1 / 2 ) ( P | | Q ) = 2 D H ( P | | Q ) = 2 ( p ( x ) q ( x ) ) 2 d μ ( x ) D A ( 1 ) ( P | | Q ) = D N ( P | | Q ) = 1 2 ( p ( x ) q ( x ) ) 2 p ( x ) d μ ( x )
For the singular values α = 1 and α = 0 the Alpha-divergences (4) have to be defined as limiting cases respectively for α 1 and α 0 . When this limit is evaluated (using the L’Hôpital’s rule) for α 1 , we obtain the Kullback–Leibler divergence:
D K L ( P | | Q ) = lim α 1 D A ( α ) ( P | | Q ) = lim α 1 p α q 1 α log p p α q 1 α log q p + q 2 α 1 d μ ( x ) = p log p q p + q d μ ( x )
with the conventions 0 / 0 = 0 , 0 log ( 0 ) = 0 and p / 0 = for p > 0 .
From the inequality p log p p 1 , it follows that the KL I-divergence is nonnegative and achieves zero if and only if P = Q .
Similarly, for α 0 , we obtain the reverse Kullback–Leibler divergence:
D K L ( Q | | P ) = lim α 0 D A ( α ) ( P | | Q ) = q log q p q + p d μ ( x )
Hence, the Alpha-divergence can be evaluated in a more explicit form as
D A ( α ) ( P | | Q ) = 1 α ( α 1 ) q ( x ) p ( x ) q ( x ) α 1 α [ p ( x ) q ( x ) ] d μ ( x ) , α 0 , 1 q ( x ) log q ( x ) p ( x ) + p ( x ) q ( x ) d μ ( x ) , α = 0 p ( x ) log p ( x ) q ( x ) p ( x ) + q ( x ) d μ ( x ) , α = 1
In fact, the Alpha-divergence smoothly connects the I-divergence D K L ( P | | Q ) with the reverse I-divergence D K L ( Q | | P ) and passes through the Hellinger distance [50]. Moreover, it also smoothly connects the Pearson Chi-square and Neyman Chi-square divergences and passes through the I-divergences [10,11].
The Alpha-divergence is a special case of Csiszár–Morimoto f-divergence [17,19] proposed later by Ali and Silvey [20]). This class of divergences were also independently defined by Morimoto [51]. The Csiszár–Morimoto f-divergence is associated to any function f ( u ) that is convex over ( 0 , ) and satisfies f ( 1 ) = 0 :
D f ( P | | Q ) = q ( x ) f p ( x ) q ( x ) d μ ( x )
We define 0 f ( 0 / 0 ) = 0 and 0 f ( a / 0 ) = lim t 0 t f ( a / t ) = lim u f ( u ) / u . Indeed, assuming f ( u ) = ( u α α u + α 1 ) ) / ( α 2 α ) yields the formula (4).
The Csiszár–Morimoto f-divergence has many beautiful properties [6,17,18,20,22,36,52,53]:
  • Nonnegativity: The Csiszár–Morimoto f-divergence is always nonnegative, and equal to zero if and only if probability densities p ( x ) and q ( x ) coincide. This follows immediately from the Jensens inequality (for normalized densities):
    D f ( P | | Q ) = q f p q d μ ( x ) f q ( p q ) d μ ( x ) = f ( 1 ) = 0 .
  • Generalized entropy: It corresponds to a generalized f-entropy of the form:
    H f ( P ) = f ( p ( x ) ) d μ ( x )
    for which the Shannon entropy is a special case for f ( p ) = p log ( p ) . Note that H f is concave while f is convex.
  • Convexity: For any 0 λ 1
    D f ( λ P 1 + ( 1 λ ) P 2 | | λ Q 1 + ( 1 λ ) Q 2 ) λ D f ( P 1 | | Q 1 ) + ( 1 λ ) D f ( P 2 | | Q 2 ) .
  • Scaling: For any positive constant c > 0 we have
    c D f ( P | | Q ) = D c f ( P | | Q )
  • Invariance: The f-divergence is invariant to bijective transformations [6,20]. This means that, when x is transformed to y bijectively by
    y = k ( x )
    probability distribution P ( x ) changes to P ˜ ( y ) . However,
    D f ( P | | Q ) = D f ( P ˜ | | Q ˜ )
    Additionally,
    D f ( P | | Q ) = D f ˜ ( P | | Q )
    for f ˜ ( u ) = f ( u ) c ( u 1 ) for an arbitrary constant c, and
    D f ( P | | Q ) = D f * ( Q | | P )
    where f * ( u ) = u f ( 1 / u ) is called a conjugate function.
  • Symmetricity: For an arbitrary Csiszár–Morimoto f-divergence, it is possible to construct a symmetric divergence for f s y m ( u ) = f ( u ) + f * ( u ) .
  • Boundedness The Csiszár–Morimoto f-divergence for positive measures (densities) is bounded (if limit exists and it is finite) [28,36]
    0 D f ( P | | Q ) lim u 0 + f ( u ) + u f ( 1 u )
    Furthermore [54],
    0 D f ( P | | Q ) ( p q ) f ( p q ) d μ ( x )
Using fundamental properties of the Csiszár–Morimoto f-divergence we can establish basic properties of the Alpha-divergences [3,6,22,40,42].
The Alpha-divergence (4) has the following basic properties:
  • Convexity: D A ( α ) ( P | | Q ) is convex with respect to both P and Q .
  • Strict Positivity: D A ( α ) ( P | | Q ) 0 and D A ( α ) ( P | | Q ) = 0 if and only if P = Q .
  • Continuity: The Alpha-divergence is continuous function of real variable α in the whole range including singularities.
  • Duality: D A ( α ) ( P | | Q ) = D A ( 1 α ) ( Q | | P ) .
  • Exclusive/Inclusive Properties: [42]
    • For α , the estimation of q ( x ) that approximates p ( x ) is exclusive, that is q ( x ) p ( x ) for all x . This means that the minimization of D A ( α ) ( P | | Q ) with respect to q ( x ) will force q ( x ) to be exclusive approximation, i.e., the mass of q ( x ) will lie within p ( x ) (see detail and graphical illustrations in [42]).
    • For α , the estimation of q ( x ) that approximates p ( x ) is inclusive, that is q ( x ) p ( x ) for all x . In other words, the mass of q ( x ) includes all the mass of p ( x ) .
  • Zero-forcing and zero-avoiding properties: [42]
    Here, we treat the case where p ( x ) and q ( x ) are not necessary mutually absolutely continuous. In such a case the divergence may diverges to . However, the following two properties hold:
    • For α 0 the estimation of q ( x ) that approximates p ( x ) is zero-forcing (coercive), that is, p ( x ) = 0 forces q ( x ) = 0 .
    • For α 1 the estimation of q ( x ) that approximates p ( x ) is zero-avoiding, that is, p ( x ) > 0 implies q ( x ) > 0 .
One of the most important property of the Alpha-divergence is that it is a convex function with respect to P and Q and has a unique minimum for P = Q (see e.g. [22]).

2.2. Alpha-Rényi Divergence

It is interesting to note that the Alpha-divergence is closely related to the Rényi divergence. We define an Alpha-Rényi divergence as
D A R ( α ) ( P | | Q ) = 1 α ( α 1 ) log 1 + α ( α 1 ) D A ( α ) ( P | | Q = 1 α ( α 1 ) log p α q 1 α α p + ( α 1 ) q d μ ( x ) + 1 , α R { 0 , 1 }
For α = 0 and α = 1 the Alpha-Rényi divergence simplifies to the Kullback–Leibler divergences:
D K L ( P | | Q ) = lim α 1 D A R ( α ) ( P | | Q ) = lim α 1 1 2 α 1 ( p α q 1 α ( log p log q ) p + q ) d μ ( x ) ( p α q 1 α α p ( α 1 ) q ) d μ ( x ) + 1 = p log p q p + q d μ ( x )
and
D K L ( Q | | P ) = lim α 0 D A R ( α ) ( P | | Q ) = q log q p q + p d μ ( x )
We define the Alpha-Rényi divergence for normalized probability densities as follows:
D A R ( α ) ( P ¯ | | Q ¯ ) = 1 α ( α 1 ) log p ¯ α q ¯ 1 α d μ ( x ) = 1 α 1 log q ¯ p ¯ q ¯ α d μ ( x ) 1 α
which corresponds to the Rényi entropy [55,56,57] (see Appendix B)
H R ( α ) ( P ¯ ) = 1 α 1 log p ¯ α ( x ) d μ ( x )
Note that we used different scaling factor than in the original Rényi divergence [56,57]. In general, the Alpha-Rényi divergence make sense only for the (normalized) probability densities (since, otherwise the function (25) can be complex-valued for some positive non-normalized measures).
Furthermore, the Alpha-divergence is convex with respect to positive densities p and q for any parameter of α R , while the Alpha-Rényi divergence (28) is convex jointly in p and q for α [ 0 , 1 ] and it is not convex for α > 1 . Convexity implies that D A R ( α ) is increasing in α when p and q are fixed. Actually, the Rényi divergence is increasing in α on the whole set ( 0 , ) [58,59,60].

2.3. Extended Family of Alpha-Divergences

There are several ways to extend the asymmetric Alpha-divergence. For example, instead of q, we can take the average of q and p, under the assumption that if p and q are similar, they should be “close” to their average, so we can define the modified Alpha-divergence as
D A m 1 ( α ) ( P | | Q ˜ ) = 1 α ( α 1 ) q ˜ p q ˜ α 1 α ( p q ˜ ) d μ ( x )
whereas an adjoint Alpha-divergence is given by
D A m 2 ( α ) ( Q ˜ | | P ) = 1 α ( α 1 ) p q ˜ p α 1 + α ( p q ˜ ) d μ ( x )
where Q ˜ = ( P + Q ) / 2 and q ˜ = ( p + q ) / 2 .
For the singular values α = 1 and α = 0 , the Alpha-divergences (30) can be evaluated as
lim α 0 D A m 1 ( α ) ( P | | Q ˜ ) = q ˜ log q ˜ p + p q ˜ d μ ( x ) = p + q 2 log p + q 2 p + p q 2 d μ ( x )
and
lim α 1 D A m 1 ( α ) ( P | | Q ˜ ) = p log p q ˜ p + q ˜ d μ ( x ) = p log 2 p p + q + q p 2 d μ ( x )
As examples, we consider the following prominent cases for (31):
1. Triangular Discrimination (TD) [61]
D A m 2 ( 1 ) ( Q ˜ | | P ) = 1 4 D T D ( P | | Q ) = 1 4 ( p q ) 2 p + q d μ ( x )
2. Relative Jensen–Shannon divergence [62,63,64]
lim α 0 D A m 2 ( α ) ( Q ˜ | | P ) = D R J S ( P | | Q ) = p log 2 p p + q p + q d μ ( x )
3. Relative Arithmetic-Geometric divergence proposed by Taneya [38,39,43]
lim α 1 D A m 2 ( α ) ( Q ˜ | | P ) = 1 2 D R A G ( P | | Q ) = ( p + q ) log p + q 2 p + p q d μ ( x )
3. Neyman Chi-square divergence
D A m 2 ( 2 ) ( Q ˜ | | P ) = 1 8 D χ 2 ( P | | Q ) = 1 8 ( p q ) 2 p d μ ( x )
The asymmetric Alpha-divergences can be expressed formally as the Csiszár–Morimoto f-divergence, as shown in Table 1 [30].
Table 1. Asymmetric Alpha-divergences and associated convex Csiszár-Morimoto functions [30].
Table 1. Asymmetric Alpha-divergences and associated convex Csiszár-Morimoto functions [30].
Divergence D A ( α ) ( P | | Q ) = q f ( α ) p q d μ ( x ) Csiszár function f ( α ) ( u ) , u = p / q
1 α ( α 1 ) q p q α 1 α ( p q ) d μ ( x ) , q log q p + p q d μ ( x ) , p log p q p + q d μ ( x ) . u α 1 α ( u 1 ) α ( α 1 ) , α 0 , 1 u 1 log u , α = 0 , 1 u + u log u , α = 1 .
1 α ( α 1 ) p q p α 1 + α ( p q ) d μ ( x ) , p log p q p + q d μ ( x ) , q log q p + p q d μ ( x ) . u 1 α + ( α 1 ) u α α ( α 1 ) , α 0 , 1 , 1 u + u log u , α = 0 u 1 log u , α = 1 .
1 α ( α 1 ) q p + q 2 q α q α p q 2 d μ ( x ) , q log 2 q p + q + p q 2 d μ ( x ) , p + q 2 log p + q 2 q p q 2 d μ ( x ) . u + 1 2 α 1 α u 1 2 α ( α 1 ) , α 0 , 1 , u 1 2 + log 2 u + 1 , α = 0 , 1 u 2 + u + 1 2 log u + 1 2 , α = 1 .
1 α ( α 1 ) p p + q 2 p α p + α p q 2 d μ ( x ) , p log 2 p p + q p q 2 d μ ( x ) , p + q 2 log p + q 2 p + p q 2 d μ ( x ) . u u + 1 2 u α u α 1 u 2 α ( α 1 ) , α 0 , 1 , 1 u 2 u log u + 1 2 u , α = 0 , u 1 2 + u + 1 2 log u + 1 2 u , α = 1 .
1 α 1 ( p q ) p + q 2 q α 1 1 d μ ( x ) , ( p q ) log p + q 2 q d μ ( x ) . ( u 1 ) u + 1 2 α 1 1 α 1 , α 1 , ( u 1 ) log u + 1 2 , α = 1 .
1 α 1 ( q p ) p + q 2 p α 1 1 d μ ( x ) , ( q p ) log p + q 2 p d μ ( x ) . ( 1 u ) u + 1 2 u α 1 1 α 1 , α 1 , ( 1 u ) log u + 1 2 u , α = 1 .

2.4. Symmetrized Alpha-Divergences

The basic Alpha-divergence is asymmetric, that is, D A ( α ) ( P | | Q ) D A ( α ) ( Q | | P ) .
Generally, there are two ways to symmetrize divergences: Type-1
D S 1 ( P | | Q ) = 1 2 D A ( P | | Q ) + D A ( Q | | P )
and Type 2
D S 2 ( P | | Q ) = 1 2 D A P | | P + Q 2 + D A Q | | P + Q 2
The symmetric Alpha-divergence (Type-1) can be defined as (we will omit scaling factor 1 / 2 for simplicity)
D A S 1 ( α ) ( P | | Q ) = D A ( α ) ( P | | Q ) + D A ( α ) ( Q | | P ) = p α q α p 1 α q 1 α α ( 1 α ) d μ ( x )
As special cases, we obtain several well-known symmetric divergences:
1. Symmetric Chi-Squared divergence [54]
D A S 1 ( 1 ) ( P | | Q ) = D A S 1 ( 2 ) ( P | | Q ) = 1 2 D χ 2 ( P | | Q ) = 1 2 ( p q ) 2 ( p + q ) p q d μ ( x )
2. Symmetrized KL divergence, called also J-divergence corresponding to Jeffreys entropy maximization [65,66]
lim α 0 D A S 1 ( α ) ( P | | Q ) = lim α 1 D A S 1 ( α ) ( P | | Q ) = D J ( P | | Q ) = ( p q ) log p q d μ ( x )
An alternative wide class of symmetric divergences can be described by the following symmetric Alpha-divergence (Type-2):
D A S 2 ( α ) ( P | | Q ) = D A ( α ) P | | P + Q 2 + D A ( α ) Q | | P + Q 2 = 1 α ( α 1 ) ( p 1 α + q 1 α ) p + q 2 α ( p + q ) d μ ( x )
The above measure admits the following prominent cases
  • Triangular Discrimination [30,38]
    D A S 2 ( 1 ) ( P | | Q ) = 1 2 D T D ( P | | Q ) = 1 2 ( p q ) 2 p + q d μ ( x )
  • Symmetric Jensen–Shannon divergence [62,64]
    lim α 0 D A S 2 ( α ) ( P | | Q ) = D J S ( P | | Q ) = p log 2 p p + q + q log 2 q p + q d μ ( x )
    It is worth mentioning, that the Jensen–Shannon divergence is a symmetrized and smoothed variant of the Kullback–Leibler divergence, i.e., it can be interpreted as the average of the Kullback–Leibler divergences to the average distribution. For the normalized probability densities the Jensen–Shannon divergence is related to the Shannon entropy in the following sense:
    D J S = H S ( ( P + Q ) / 2 ) ( H S ( P ) + H S ( Q ) ) / 2
    where H S ( P ) = p ( x ) log p ( x ) d μ ( x )
  • Arithmetic-Geometric divergence [39]
    lim α 1 D A S 2 ( α ) ( P | | Q ) = ( p + q ) log p + q 2 p q d μ ( x )
  • Symmetric Chi-square divergence [54]
    D A S 2 ( 2 ) ( P | | Q ) = 1 8 D χ 2 ( P | | Q ) = 1 8 ( p q ) 2 ( p + q ) p q d μ ( x )
The above Alpha-divergence is symmetric in its arguments P and Q , and it is well-defined even if P and Q are not absolutely continuous. For example, for discrete D A S 2 ( α ) ( P | | Q ) is well-defined even if, for some indices p i , it vanishes without vanishing q i or if q i vanishes without vanishing p i [54]. It is also lower- and upper-bounded, for example, the Jensen–Shannon divergence is bounded between 0 and 2 [36].

3. Family of Beta-Divergences

The basic Beta-divergence was introduced by Basu et al. [67] and Minami and Eguchi [15] and many researchers investigated their applications including [8,13,30,31,32,33,34,37,37,68,69,70,71,72], and references therein. The main motivation to investigate the beta divergence, at least from the practical point of view, is to develop highly robust in respect to outliers learning algorithms for clustering, feature extraction, classification and blind source separation. Until now the Beta-divergence has been successfully applied for robust PCA (Principal Component Analysis) and clustering [71], robust ICA (Independent Component Analysis) [15,68,69], and robust NMF/NTF [30,70,73,74,75,76].
First, let us define the basic asymmetric Beta-divergence between two unnormalized density functions by
D B ( β ) ( P | | Q ) = p ( x ) p β 1 ( x ) q β 1 ( x ) β 1 p β ( x ) q β ( x ) β d μ ( x ) , β R { 0 , 1 }
where β is a real number and, for β = 0 , 1 , is defined by continuity (see below for more explanation).
For discrete probability measures with mass functions P = [ p 1 , p 2 , , p n ] and Q = [ q 1 , q 2 , , q n ] the discrete Beta-divergence is defined as
D B ( β ) ( P | | Q ) = i = 1 n d B ( β ) ( p i | | q i ) = i = 1 n p i p i β 1 q i β 1 β 1 p i β q i β β β R { 0 , 1 }
The Beta-divergence can be expressed via a generalized KL divergence as follows
D B ( β ) ( P | | Q ) = D G K L ( β 1 ) ( P β | | Q β ) = 1 β p β log ( 1 β ) q β p β + p β q β d μ ( x ) β R { 0 , 1 }
The above representation of the Beta-divergence indicates why it is robust to outliers for some values of the tuning parameter β and therefore, it is often better suited than others for some specific applications. For example, in sound processing, the speech power spectra can be modeled by exponential family densities of the form, whose for β = 0 the Beta-divergence is no less than the Itakura–Saito distance (called also Itakura–Saito divergence or Itakura–Saito distortion measure or Burg cross entropy) [12,13,30,76,77,78,79]. In fact, the Beta-divergence has to be defined in limiting case for β 0 as the Itakura–Saito distance:
D I S ( P | | Q ) = lim β 0 D B ( β ) ( P | | Q ) = log q p + p q 1 d μ ( x )
The Itakura and Saito distance was derived from the maximum likelihood (ML) estimation of speech spectra [77]. It was used as a measure of the distortion or goodness of fit between two spectra and is often used as a standard measure in the speech processing community due to the good perceptual properties of the reconstructed signals since it is scale invariant, Due to scale invariance low energy components of p have the same relative importance as high energy ones. This is especially important in the scenario in which the coefficients of p have a large dynamic range, such as in short-term audio spectra [30,76,79].
It is also interesting to note that, for β = 2 , we obtain the standard squared Euclidean ( L 2 -norm) distance, while for the singular case β = 1 , we obtain the KL I-divergence:
D K L ( P | | Q ) = lim β 1 D B ( β ) ( P | | Q ) = p log p q p + q d μ ( x )
Note, that we used here, the following formulas lim β 0 p β q β β = log ( p / q ) and lim β 0 p β 1 β = log p
Hence, the Beta-divergence can be represented in a more explicit form:
D B ( β ) ( P | | Q ) = 1 β ( β 1 ) p β ( x ) + ( β 1 ) q β ( x ) β p ( x ) q β 1 ( x ) d μ ( x ) , β 0 , 1 p ( x ) log ( p ( x ) q ( x ) ) p ( x ) + q ( x ) d μ ( x ) , β = 1 log ( q ( x ) p ( x ) ) + p ( x ) q ( x ) 1 d μ ( x ) , β = 0
We have shown that the basic Beta-divergence smoothly connects the Itakura–Saito distance and the squared Euclidean L 2 -norm distance and passes through the KL I-divergence D K L ( P | | Q ) . Such a parameterized connection is impossible in the family of the Alpha-divergences.
The choice of the tuning parameter β depends on the statistical distribution of data sets. For example, the optimal choice of the parameter β for the normal distribution is β = 2 , for the gamma distribution it is β = 0 , for the Poisson distribution β = 1 , and for the compound Poisson distribution β ( 1 , 2 ) [15,31,32,33,34,68,69].
It is important to note that the Beta divergence can be derived from the Bregman divergence. The Bregman divergence is a pseudo-distance for measuring discrepancy between two values of density functions p ( x ) and q ( x ) [9,16,80]:
d Φ ( p | | q ) = Φ ( p ) Φ ( q ) ( p q ) Φ ( q )
where Φ ( t ) is strictly convex real-valued function and Φ ( q ) is the derivative with respect to q. The total discrepancy between two functions p ( x ) and q ( x ) is given by
D Φ ( P | | Q ) = Φ ( p ( x ) ) Φ ( q ( x ) ) ( p ( x ) q ( x ) ) Φ ( q ( x ) ) d μ ( x )
and it corresponds to the Φ-entropy of continuous probability measure p ( x ) 0 defined by
H Φ ( P ) = Φ ( p ( x ) ) d μ ( x )
Remark: The concept of divergence and entropy are closely related. Let Q 0 be a uniform distribution for which
Q 0 ( x ) = c o n s t
(When x is an infinite space R n this might not be a probability distribution but is a measure). Then,
H ( P ) = D ( P | | Q 0 ) + c o n s t
is regarded as the related entropy. This the negative of the divergence form P to the uniform distribution. On the other hand, given a concave entropy H ( P ) , we can define the related divergence as the Bregman divergence derived from a convex function Φ ( P ) = H ( P ) .
If x takes discrete values on a certain space, the separable Bregman divergence is defined as D Φ ( P | | Q ) = i = 1 d Φ ( p i | | q i ) = i = 1 n Φ ( p i ) Φ ( q i ) ( p i q i ) Φ ( q i ) ) , where Φ ( q ) denotes derivative with respect to q. In a general (nonseparable) case for two vectors P and Q , the Bregman divergence is defined as D Φ ( P | | Q ) = Φ ( P ) Φ ( Q ) ( P Q ) T Φ ( Q ) , where Φ ( Q ) is the gradient of Φ evaluated at Q .
Note that D Φ ( P | | Q ) equals the tail of the first-order Taylor expansion of Φ ( P ) at Q . Bregman divergences include many prominent dissimilarity measures like the squared Euclidean distance, the Mahalanobis distance, the generalized Kullback–Leibler divergence and the Itakura–Saito distance.
It is easy to check that the Beta-divergence can be generated from the Bregman divergence using the following strictly convex continuous function [36,81]
Φ ( t ) = 1 β ( β 1 ) t β β t + β 1 , β 0 , 1 t log ( t ) t + 1 , β = 1 t log ( t ) 1 , β = 0
It is also interesting to note that the same generating function f ( u ) = Φ ( t ) | t = u (with α = β ) can be used to generate the Alpha-divergence using the Csiszár–Morimoto f-divergence D f ( P | | Q ) = q f ( p / q ) d μ ( x ) .
Furthermore, the Beta-divergence can be generated by a generalized f-divergence:
D ˜ f ( P | | Q ) = q β f ( p / q ) d μ ( x ) = q β f ˜ p β q β d μ ( x )
where f ( u ) = Φ ( t ) | t = u with u = p / q and
f ˜ ( u ˜ ) = 1 1 β u ˜ 1 β 1 β u ˜ + 1 β 1
is convex generating function with u ˜ = p β / q β .
The links between the Bregman and Beta-divergences are important, since the many well known fundamental properties of the Bregman divergence are also valid for the Beta-divergence [28,82]:
  • Convexity: The Bregman divergence D Φ ( P | | Q ) is always convex in the first argument P , but is often not in the second argument Q .
  • Nonnegativity: The Bregman divergence is nonnegative D Φ ( P | | Q ) 0 with zero P = Q .
  • Linearity: Any positive linear combination of Bregman divergences is also a Bregman divergence, i.e.,
    D c 1 Φ 1 + c 2 Φ 2 ( P | | Q ) = c 1 D Φ 1 ( P | | Q ) + c 2 D Φ 2 ( P | | Q )
    where c 1 , c 2 are positive constants and Φ 1 , Φ 2 are strictly convex functions.
  • Invariance: The functional Bregman divergence is invariant under affine transforms Γ ( Q ) = Φ ( Q ) + a ( x x ) q ( x ) d x + c for positive measures P and Q to linear and arbitrary constant terms [28,82], i.e.,
    D Γ ( P | | Q ) = D Φ ( P | | Q )
  • The three-point property generalizes the “Law of Cosines”:
    D Φ ( P | | Q ) = D Φ ( P | | Z ) + D Φ ( Z | | Q ) ( P Z ) T ( δ δ Q Φ ( Q ) δ δ Z Φ ( Z ) )
  • Generalized Pythagoras Theorem:
    D Φ ( P | | Q ) D Φ ( P | | P Ω ( Q ) ) + D Φ ( P Ω ( Q ) | | Q )
    where P Ω ( Q ) = arg min ω Ω D Φ ( ω | | Q ) is the Bregman projection onto the convex set Ω and P Ω . When Ω is an affine set then it holds with equality. This is proved to be the generalized Pythagorean relation in terms of information geometry.
For the Beta-divergence (46) the first and second-order Fréchet derivative with respect to Q are given by [28,76]
δ D B ( β ) δ q = q β 2 ( q p ) ,    δ 2 D B ( β ) δ q 2 = q β 3 ( ( β 1 ) q ( β 2 ) p )
Hence, we conclude that the Beta-divergence has a single global minimum equal to zero for P = Q and increases with | p q | . Moreover, the Beta divergence is strictly convex for q ( x ) > 0 only for β [ 1 , 2 ] . For β = 0 (Itakura–Saito distance), it is convex if q 3 ( 2 p q ) d μ ( x ) 0 i.e., if q / p 2 [78].

3.1. Generation of Family of Beta-divergences Directly from Family of Alpha-Divergences

It should be noted that in the original works [15,67,68,69] they considered only the Beta-divergence function for β 1 . Moreover, they did not consider the whole range of non-positive values for parameter β, especially β = 0 , for which we have the important Itakura–Saito distance. Furthermore, similar to the Alpha-divergences there exist an associated family of Beta-divergences and as special cases a family of generalized Itakura–Saito like distances. The fundamental question arises: How to generate a whole family of Beta-divergences or what is the relationships or correspondences between the Alpha- and Beta-divergences. In fact, on the basis of our considerations above, it is easy to find that the complete set of Beta-divergences can be obtained from the Alpha-divergences and conversely the Alpha-divergences, can obtained directly from Beta-divergences.
In order to obtain a Beta-divergence from the corresponding (associated) Alpha-divergence, we need to apply the following nonlinear transformations:
p p β , q q β and α = β 1
For example, using these transformations (substitutions) for a basic asymmetric Alpha-divergence (4) and assuming that α = ( β ) 1 , we obtain the following divergence
D A ( β ) ( P | | Q ) = β 2 β p q β 1 + p β + ( β 1 ) q β β ( β 1 ) d μ ( x )
Observe that, by ignoring the scaling factor β 2 , we obtain the basic asymmetric Beta-divergence defined by Equation (46).
In fact, there exists the same link between the whole family of Alpha-divergences and the family of Beta-divergences (see Table 2).
For example, we can derive a symmetric Beta-divergence from the symmetric Alpha-divergence (Type-1) (38):
D B S 1 ( β ) ( P | | Q ) = D B ( β ) ( P | | Q ) + D B ( β ) ( Q | | P ) = 1 β 1 ( p q ) ( p β 1 q β 1 ) d μ ( x )
It is interesting to note that, in special cases, we obtain:
Symmetric KL of J-divergence [65]:
D B S 1 ( 1 ) = lim β 1 D B S 1 ( β ) = ( p q ) log p q d μ ( x )
and symmetric Chi-square divergence [54]
D B S 1 ( 0 ) = lim β 0 D B S 1 ( β ) = ( p q ) 2 p q d μ ( x )
Analogously, from the symmetric Alpha-divergence (Type-2), we obtain
D B S 2 ( β ) ( P | | Q ) = 1 β 1 p β + q β ( p β 1 + q β 1 ) p β + q β 2 1 β d μ ( x )
Table 2. Family of Alpha-divergences and corresponding Beta-divergences. We applied the following transformations p p β , q q β , α = 1 / β . Note that D A ( 1 ) ( P | | Q ) = D B ( 1 ) ( P | | Q ) and they represents for α = β = 1 extended family of KL divergences. Furthermore, Beta-divergences for β = 0 describe the family of generalized (extended) Itakura–Saito like distances.
Table 2. Family of Alpha-divergences and corresponding Beta-divergences. We applied the following transformations p p β , q q β , α = 1 / β . Note that D A ( 1 ) ( P | | Q ) = D B ( 1 ) ( P | | Q ) and they represents for α = β = 1 extended family of KL divergences. Furthermore, Beta-divergences for β = 0 describe the family of generalized (extended) Itakura–Saito like distances.
Alpha-divergence D A ( α ) ( P | | Q ) Beta-divergence D B ( β ) ( P | | Q )
1 α ( α 1 ) p α q 1 α α p + ( α 1 ) q d μ ( x ) , q log q p + p q d μ ( x ) , α = 0 p log p q p + q d μ ( x ) , α = 1 1 β ( β 1 ) p β + ( β 1 ) q β β p q β 1 d μ ( x ) , log q p + p q 1 d μ ( x ) , β = 0 p log p q p + q d μ ( x ) , β = 1
( p + q 2 ) α q 1 α α 2 p + ( α 2 1 ) q d μ ( x ) α ( α 1 ) , q log 2 q p + q + p q 2 d μ ( x ) , α = 0 p + q 2 log p + q 2 q p q 2 d μ ( x ) , α = 1 p β + ( 2 β 1 ) q β 2 β q β 1 p β + q β 2 1 β d μ ( x ) β ( β 1 ) , log ( q p ) + 2 ( p q 1 ) d μ ( x ) , β = 0 p + q 2 log p 1 + q 2 q p q 2 d μ ( x ) , β = 1
1 α 1 ( p q ) p + q 2 q α 1 1 d μ ( x ) , ( p q ) 2 p + q d μ ( x ) , α = 0 ( p q ) log p + q 2 q d μ ( x ) , α = 1 1 β ( β 1 ) ( p β q β ) 1 2 β 1 β q β 1 p β + q β β 1 β d μ ( x ) , p q 1 log p q d μ ( x ) , β = 0 ( p q ) log p + q 2 q d μ ( x ) , β = 1
1 α ( α 1 ) p α q 1 α + p 1 α q α p q d μ ( x ) , ( p q ) log p q d μ ( x ) , α = 0 , 1 1 β 1 p β + q β p q β 1 p β 1 q d μ ( x ) , ( p q ) 2 p q d μ ( x ) , β = 0 ( p q ) log p q d μ ( x ) , β = 1
1 1 α p + q 2 p α + q α 2 1 α d μ ( x ) , p q ) 2 2 d μ ( x ) , α = 0 H S P + Q 2 H S ( P ) + H S ( Q ) 2 , α = 1 1 β ( β 1 ) p β + q β 2 p + q 2 β d μ ( x ) , log p + q 2 p q d μ ( x ) , β = 0 H S P + Q 2 H S ( P ) + H S ( Q ) 2 , β = 1
( p 1 α + q 1 α ) p + q 2 α p q d μ ( x ) α ( α 1 ) , p log 2 p p + q + q log 2 q p + q d μ ( x ) , α = 0 ( p + q ) log p + q 2 p q d μ ( x ) , α = 1 1 β 1 p β + q β ( p β 1 + q β 1 ) p β + q β 2 1 β d μ ( x ) , p q + q p 2 d μ ( x ) , β = 0 ( p + q ) log p + q 2 p q d μ ( x ) , β = 1
It should be noted that in special cases, we obtain:
The Arithmetic-Geometric divergence [38,39]:
D B S 2 ( 1 ) = lim β 1 D B S 2 ( β ) = ( p + q ) log p + q 2 p q d μ ( x ) ,
and a symmetrized Itakura–Saito distance (called also the COSH distance) [12,13]:
D B S 2 ( 0 ) = lim β 0 D B S 2 ( β ) = p q + q p 2 d μ ( x ) = ( p q ) 2 p q d μ ( x )

4. Family of Gamma-Divergences Generated from Beta- and Alpha-Divergences

A basic asymmetric Gamma-divergence has been proposed very recently by Fujisawa and Eguchi [35] as a very robust similarity measure with respect to outliers:
D G ( γ ) ( P | | Q ) = 1 γ ( γ 1 ) log p γ ( x ) d μ ( x ) q γ ( x ) d μ ( x ) γ 1 p ( x ) q γ 1 ( x ) d μ ( x ) γ
The Gamma-divergence employs the nonlinear transformation (log) for cumulative patterns and the terms p , q are not separable. The main motivation for employing the Gamma divergence is that it allows “super” robust estimation of some parameters in presence of outlier. In fact, the authors demonstrated that the bias caused by outliers can become sufficiently small even in the case of very heavy contamination and that some contamination can be naturally and automatically neglected [35,37].
In this paper, we show that we can formulate the whole family of Gamma-divergences generated directly from Alpha- and also Beta-divergences. In order to obtain a robust Gamma-divergence from an Alpha- or Beta-divergence, we use the following transformations (see also Table 3):
c 0 p c 1 ( x ) q c 2 ( x ) d μ ( x ) log p c 1 ( x ) q c 2 ( x ) d μ ( x ) c 0
where c 0 , c 1 and c 2 are real constants and γ = α .
Applying the above transformation to all monomials to the basic Alpha-divergence (4), we obtain a new divergence referred to as here the Alpha-Gamma-divergence:
D A G ( γ ) ( P | | Q ) = log p γ ( x ) q 1 γ ( x ) d μ ( x ) 1 γ ( γ 1 ) log p ( x ) d μ ( x ) 1 γ 1 + log q ( x ) d μ ( x ) 1 γ = 1 γ ( γ 1 ) log p γ ( x ) q 1 γ ( x ) d μ ( x ) p ( x ) d μ ( x ) γ q ( x ) d μ ( x ) 1 γ
The asymmetric Alpha-Gamma-divergence has the following important properties:
  • D A G ( γ ) ( P | | Q ) 0 . The equality holds if and only if P = c Q for a positive constant c.
  • It is scale invariant for any value of γ, that is, D A G ( γ ) ( P | | Q ) = D A G ( γ ) ( c 1 P | | c 2 Q ) , for arbitrary positive scaling constants c 1 , c 2 .
  • The Alpha-Gamma divergence is equivalent to the normalized Alpha-Rényi divergence (25), i.e.,
    D A G ( γ ) ( P | | Q ) = 1 γ ( γ 1 ) log p γ ( x ) q 1 γ ( x ) d μ ( x ) p ( x ) d μ ( x ) γ q ( x ) d μ ( x ) 1 γ = 1 γ ( γ 1 ) log p ¯ γ ( x ) q ¯ 1 γ ( x ) d μ ( x ) = 1 γ 1 log q ¯ ( x ) p ¯ ( x ) q ¯ ( x ) γ d μ ( x ) 1 γ = D A R ( γ ) ( P ¯ | | Q ¯ )
    for α = γ and normalized densities p ¯ ( x ) = p ( x ) / ( p ( x ) d μ ( x ) and q ¯ ( x ) = q ( x ) / ( q ( x ) d μ ( x ) .
  • It can be expressed via generalized weighted mean:
    D A G ( γ ) ( P | | Q ) = 1 γ 1 log M ¯ γ q ¯ ; p ¯ q ¯
    where the weighted mean is defined as M ¯ γ q ¯ ; p ¯ q ¯ = q ¯ ( x ) p ¯ ( x ) q ¯ ( x ) γ d μ ( x ) 1 γ .
  • As γ 0 , the Alpha-Gamma-divergence becomes the Kullback–Leibler divergence:
    lim γ 0 D A G ( γ ) ( P | | Q ) = D K L ( P ¯ | | Q ¯ ) = p ¯ ( x ) log p ¯ ( x ) q ¯ ( x ) d μ ( x )
  • For γ 1 , the Alpha-Gamma-divergence can be expressed by the reverse Kullback–Leibler divergence:
    lim γ 1 D A G ( γ ) ( P | | Q ) = D K L ( Q ¯ | | P ¯ ) = q ¯ ( x ) log q ¯ ( x ) p ¯ ( x ) d μ ( x )
In a similar way, we can generate the whole family of Alpha-Gamma-divergences from the family of Alpha-divergences, which are summarized in Table 3.
It is interesting to note that using the above transformations (73) with γ = β , we can generate another family of Gamma divergences, referred to as Beta-Gamma divergences.
In particular, using the nonlinear transformations (73) for the basic asymmetric Beta-divergence (46), we obtain the Gamma-divergence (72) [35] referred to as here a Beta-Gamma-divergence ( D G ( γ ) ( P | | Q ) = D B G ( γ ) ( P | | Q ) )
D B G ( γ ) ( P | | Q ) = 1 γ ( γ 1 ) log ( p γ d μ ( x ) ) + ( γ 1 ) log ( q γ d μ ( x ) ) γ log ( p q γ 1 d μ ( x ) ) = log p q γ 1 d μ ( x ) 1 1 γ p γ d μ ( x ) 1 γ ( 1 γ ) q γ d μ ( x ) 1 γ = 1 1 γ log q ˜ γ ( x ) p ˜ ( x ) q ˜ ( x ) d μ ( x )
where
p ˜ ( x ) = p ( x ) p γ ( x ) d μ ( x ) 1 γ , q ˜ ( x ) = q ( x ) q γ ( x ) d μ ( x ) 1 γ
Analogously, for discrete densities we can express the Beta-Gamma-divergence via generalized power means also known as the power mean or Hölder means as follows
D B G ( γ ) ( P | | Q ) = log i = 1 n p i q i γ 1 1 γ 1 i = 1 n p i γ 1 γ ( γ 1 ) i = 1 n q i γ 1 γ
Hence,
D B G ( γ ) ( P | | Q ) = log i = 1 n p i q i q i γ 1 γ 1 i = 1 n p i γ 1 γ 1 γ 1 i = 1 n q i γ 1 γ = log 1 n i = 1 n p i q i q i γ 1 n i = 1 n p i γ 1 γ 1 n i = 1 n q i γ 1 γ γ 1 1 γ 1
and finally
D B G ( γ ) ( P | | Q ) = 1 1 γ log 1 n i = 1 n p i q i q i γ M γ { p i } M γ { q i } γ 1
where the (generalized) power mean of the order-γ is defined as
M γ { p i } = 1 n i = 1 n p i γ 1 γ
In the special cases, we obtain standard harmonic mean ( γ = 1 ) , geometric mean ( γ = 0 ), arithmetic mean( γ = 1 ), and squared root mean γ = 2 with the following relations:
M { p i } M 1 ( { p i } ) M 0 { p i } M 1 { p i } M 2 { p i } M { p i } ,
with M 0 { p i } = lim γ 0 M γ { p i } = i = 1 n p i 1 / n , M { p i } = min i { p i } and M { p i } = max i { p i } .
Table 3. Family of Alpha-divergences and corresponding robust Alpha-Gamma-divergences; p ¯ ( x ) = p ( x ) / ( p ( x ) d μ ( x ) ) , q ¯ ( x ) = q ( x ) / ( q ( x ) d μ ( x ) ) . For γ = 0 , 1 , we obtained a generalized robust KL divergences. Note that Gamma divergences are expressed compactly via generalized power means. (see also Table 4 for more direct representations).
Table 3. Family of Alpha-divergences and corresponding robust Alpha-Gamma-divergences; p ¯ ( x ) = p ( x ) / ( p ( x ) d μ ( x ) ) , q ¯ ( x ) = q ( x ) / ( q ( x ) d μ ( x ) ) . For γ = 0 , 1 , we obtained a generalized robust KL divergences. Note that Gamma divergences are expressed compactly via generalized power means. (see also Table 4 for more direct representations).
Alpha-divergence D A ( α ) ( P | | Q ) Robust Alpha-Gamma-divergence D A G ( γ ) ( c P | | c Q )
1 α ( α 1 ) p α q 1 α α p + ( α 1 ) q d μ ( x ) , q log q p + p q d μ ( x ) , α = 0 p log p q p + q d μ ( x ) , α = 1 1 γ 1 log q ¯ p ¯ q ¯ γ d μ ( x ) 1 γ , q ¯ log q ¯ p ¯ d μ ( x ) , γ = 0 p ¯ log p ¯ q ¯ d μ ( x ) , γ = 1
1 α ( α 1 ) p 1 α q α + ( α 1 ) p α q d μ ( x ) , p log p q p + q d μ ( x ) , α = 0 q log q p + p q d μ ( x ) , α = 1 1 γ 1 log p ¯ q ¯ p ¯ γ d μ ( x ) 1 γ , p ¯ log p ¯ q ¯ d μ ( x ) , γ = 0 q ¯ log q ¯ p ¯ d μ ( x ) , γ = 1
p + q 2 α q 1 α α 2 p ( 1 α 2 ) q d μ ( x ) α ( α 1 ) , q log 2 q p + q + p q 2 d μ ( x ) , α = 0 p + q 2 log p + q 2 q p q 2 d μ ( x ) , α = 1 1 γ 1 log q d μ ( x ) p d μ ( x ) 1 2 q ¯ p + q 2 q γ d μ ( x ) 1 γ ( q ¯ log 2 q p + q d μ ( x ) , γ = 0 p ¯ + q ¯ 2 log p + q 2 q d μ ( x ) , γ = 1
1 α 1 p + q 2 q α 1 1 ( p q ) d μ ( x ) , ( p q ) log p + q 2 q d μ ( x ) , α = 1 log ( p ¯ p + q 2 q γ 1 ) d μ ( x ) ( q ¯ p + q 2 q γ 1 ) d μ ( x ) 1 γ 1 ( p ¯ q ¯ ) log p + q 2 q d μ ( x ) , γ = 1
1 α ( α 1 ) p α q 1 α + p 1 α q α p q d μ ( x ) , ( p q ) log p q d μ ( x ) , α = 0 , 1 1 γ 1 log q ¯ p q γ d μ ( x ) p ¯ q p γ d μ ( x ) 1 γ ( p ¯ q ¯ ) log p q d μ ( x ) , γ = 0 , 1
( p 1 α + q 1 α ) p + q 2 α p q d μ ( x ) α ( α 1 ) , p log 2 p p + q + q log 2 q p + q d μ ( x ) , α = 0 ( p + q ) log p + q 2 p q d μ ( x ) , α = 1 1 γ 1 log p ¯ p + q 2 p γ d μ ( x ) q ¯ p + q 2 q γ d μ ( x ) 1 γ p ¯ log 2 p p + q + q ¯ log 2 q p + q d μ ( x ) , γ = 0 ( p ¯ + q ¯ ) log p + q 2 p q d μ ( x ) , γ = 1 .
Table 4. Basic Alpha- and Beta-divergences and directly generated corresponding Gamma-divergences (see also Table 3 how the Gamma-divergences can be expressed by power means).
Table 4. Basic Alpha- and Beta-divergences and directly generated corresponding Gamma-divergences (see also Table 3 how the Gamma-divergences can be expressed by power means).
Divergence D A ( α ) ( P | | Q ) or D B ( β ) ( P | | Q ) Gamma-divergence D A G ( γ ) ( c P | | c Q ) and D B G ( γ ) ( c P | | c Q )
1 α ( 1 α ) α p + ( 1 α ) q p α q 1 α d μ ( x ) log p γ q 1 γ d μ ( x ) 1 / ( γ ( 1 γ ) ) p d μ ( x ) 1 / ( 1 γ ) q d μ ( x ) 1 / γ
1 β ( β 1 ) p β + ( β 1 ) q β β p q β 1 d μ ( x ) log p q γ 1 d μ ( x ) 1 / ( γ 1 ) p γ d μ ( x ) 1 / ( γ ( γ 1 ) ) q γ d μ ( x ) 1 / γ
1 α ( 1 α ) p + q p α q 1 α p 1 α q α d μ ( x ) 1 γ ( 1 γ ) log p γ q 1 γ d μ ( x ) p 1 γ q γ d μ ( x ) p d μ ( x ) q d μ ( x )
1 β 1 p β + q β p q β 1 p β 1 q d μ ( x ) 1 γ 1 log p q γ 1 d μ ( x ) p γ 1 q d μ ( x ) p γ d μ ( x ) q γ d μ ( x )
α 2 p + ( 1 α 2 ) q p + q 2 α q α 1 d μ ( x ) α ( 1 α ) 1 γ ( 1 γ ) log p + q 2 γ 1 q γ 1 d μ ( x ) p d μ ( x ) γ / 2 q d μ ( x ) 1 γ / 2
1 1 α ( p q ) 1 p + q 2 q α 1 d μ ( x ) 1 1 γ log p d μ ( x ) q p + q 2 q γ 1 d μ ( x ) q d μ ( x ) p p + q 2 q γ 1 d μ ( x )
p + q ( p 1 α + q 1 α ) p + q 2 α d μ ( x ) α ( 1 α ) log p 1 γ p + q 2 γ d μ ( x ) q 1 γ p + q 2 γ d μ ( x ) p d μ ( x ) q d μ ( x ) 1 γ ( γ 1 )
The asymmetric Beta-Gamma-divergence has the following properties [30,35]:
  • D B G ( γ ) ( P | | Q ) 0 . The equality holds if and only if P = c Q for a positive constant c.
  • It is scale invariant, that is, D B G ( γ ) ( P | | Q ) = D B G ( γ ) ( c 1 P | | c 2 Q ) , for arbitrary positive scaling constants c 1 , c 2 .
  • As γ 1 , the Gamma-divergence becomes the Kullback–Leibler divergence:
    lim γ 1 D B G ( γ ) ( P | | Q ) = D K L ( P ¯ | | Q ¯ ) = p ¯ log p ¯ q ¯ d μ ( x )
    where p ¯ = p / p d μ ( x ) and q ¯ = q / q d μ ( x ) .
  • For γ 0 , the Gamma-divergence can be expressed as follows
    lim γ 0 D B G ( γ ) ( P | | Q ) = log q ( x ) p ( x ) d μ ( x ) + log p ( x ) q ( x ) d μ ( x )
    For the discrete Gamma divergence we have the corresponding formula
    D B G ( 0 ) ( P | | Q ) = 1 n i = 1 n log q i p i + log i = 1 n p i q i log ( n ) = log 1 n i = 1 n p i q i i = 1 n p i q i 1 / n
Similarly to the Alpha and Beta-divergences, we can also define the symmetric Beta-Gamma-divergence as
D B G S ( γ ) ( P | | Q ) = D B G ( γ ) ( P | | Q ) + D B G ( γ ) ( Q | | P ) = 1 1 γ log p γ 1 q d μ ( x ) p q γ 1 d μ ( x ) p γ d μ ( x ) q γ d μ ( x )
The symmetric Gamma-divergence has similar properties to the asymmetric Gamma-divergence:
  • D B G S ( γ ) ( P | | Q ) 0 . The equality holds if and only if P = c Q for a positive constant c, in particular, p = q , i .
  • It is scale invariant, that is,
    D B G S ( γ ) ( P | | Q ) = D B G S ( γ ) ( c 1 P | | c 2 Q )
    for arbitrary positive scaling constants c 1 , c 2 .
  • For γ 1 , it is reduced to a special form of the symmetric Kullback–Leibler divergence (also called the J-divergence)
    lim γ 1 D B G S ( γ ) ( P | | Q ) = ( p ¯ q ¯ ) log p ¯ q ¯ d μ ( x )
    where p ¯ = p / p d μ ( x ) and q ¯ = q / q d μ ( x ) .
  • For γ = 0 , we obtain a simple divergence expressed by weighted arithmetic means
    D B G S ( 0 ) ( P | | Q ) = log w p ( x ) q ( x ) d μ ( x ) w q ( x ) p ( x ) d μ ( x )
    where weight function w > 0 is such that w d μ ( x ) = 1 .
    For the discrete Beta-Gamma divergence (or simply the Gamma divergence), we obtain divergence
    D B G S ( 0 ) ( P | | Q ) = log i = 1 n p i q i i = 1 n q i p i log ( n ) 2 = log 1 n i = 1 n p i q i 1 n i = 1 n q i p i = log M 1 p i