Log-Determinant Divergences Revisited: Alpha--Beta and Gamma Log-Det Divergences

In this paper, we review and extend a family of log-det divergences for symmetric positive definite (SPD) matrices and discuss their fundamental properties. We show how to generate from parameterized Alpha-Beta (AB) and Gamma Log-det divergences many well known divergences, for example, the Stein's loss, S-divergence, called also Jensen-Bregman LogDet (JBLD) divergence, the Logdet Zero (Bhattacharryya) divergence, Affine Invariant Riemannian Metric (AIRM) as well as some new divergences. Moreover, we establish links and correspondences among many log-det divergences and display them on alpha-beta plain for various set of parameters. Furthermore, this paper bridges these divergences and shows also their links to divergences of multivariate and multiway Gaussian distributions. Closed form formulas are derived for gamma divergences of two multivariate Gaussian densities including as special cases the Kullback-Leibler, Bhattacharryya, R\'enyi and Cauchy-Schwartz divergences. Symmetrized versions of the log-det divergences are also discussed and reviewed. A class of divergences is extended to multiway divergences for separable covariance (precision) matrices.

The dissimilarity between two SPD matrices is called a metric if the following conditions hold: Dissimilarities which only satisfy condition (1) are not a metric and are referred to as (asymmetric) divergences.

Basic Alpha-Beta Log-Determinant Divergence
For symmetric positive definite matrices P ∈ R n×n and Q ∈ R n×n (both of the same size n × n), let define the following function, (which will be considered as a new dissimilarity measure referred briefly to as the AB log-det divergence): for α = 0, β = 0, α + β = 0. This is not a symmetric divergence with respect to P and Q except for the case α = β.
It is interesting to observe that such a divergence has some correspondences and relationships to alpha, beta and AB-divergences discussed in our previous papers, and especially gamma divergences [10], [9], [12], see also [21]. Furthermore, the above defined divergence is different but related to the AB divergence for SPD matrices defined as for α = 0, β = 0, α + β = 0, which will be investigated in detail in a separated paper (see also [1], [10] ). It should be noted that D (α,β) AB (P Q), defined in (16), can be evaluated without need to compute inverse of SPD matrices. It can evaluated easily by computing (positive) eigenvalues of the matrix PQ −1 or it is inverse. Since both matrices P and Q (and their inverses) are SPD matrices, their eigenvalues are positive. It can be shown that although in general matrix PQ −1 is non symmetric, its eigenvalues are the same as those of the SPD matrix Q −1/2 PQ −1/2 , so its eigenvalues are always positive.
Taking into account the eigenvalue decomposition: (where V is a nonsingular matrix, while Λ β = diag{λ β 1 , λ β 2 , . . . , λ β n } is the diagonal matrix with the positive eigenvalues λ i > 0, i = 1, 2, . . . , n, of PQ −1 ), we can write Hence, after simple algebraic manipulations, we obtain D (α,β) It is easy to check that D (α,β) AB (P Q) = 0 if P = Q. We will show later that this function is nonnegative for any SPD matrices if alpha and beta parameters take both positive or negative values.
For the singular values α = 0 and/or β = 0 (also α = −β) the AB log-det divergence (15) have to be defined as limiting cases respectively for α → 0 and/or β → 0. In other words, to avoid indeterminacy or singularity for specific values of parameters, the AB log-det divergence can be reformulated (extended) by continuity by applying L'Hôpital's formula to cover also the singular values of α, β. Using the L'Hôpital's rule we found that the AB log-det divergence can be expressed or defined in explicit form as: or equivalently after simple mathematical operations it can be expressed by eigenvalues of the matrix PQ −1 (or its transpose), i.e., the generalized eigenvalues computed from where v i (i = 1, 2, . . . , n) are corresponding generalized eigenvectors: We can prove the following Theorem (see Appendix). (15) is nonnegative for any SPD matrices with arbitrary positive eigenvalues for the following set of parameters α ≥ 0 and β ≥ 0 or α < 0 and simultaneously β < 0 and equal zero if and only if P = Q.
In other words if the values of α and β parameters have the same sign, the AB log-det divergence is positive independent of distribution of eigenvalues of PQ −1 and achieves zero if and only if all eigenvalues are equal to one.
However, if the eigenvalues are sufficiently close to one the AB log-det divergence is also positive for different signs of α and β parameters. The conditions for positive definiteness can be formulated by the following Theorem 2: Theorem 2 The function D (α,β) AB (P Q) expressed by Eq. (22) is non-negative for the set of parameters α > 0 and β < 0, or α < 0 and β > 0, if all the eigenvalues of the matrix PQ −1 satisfy the following conditions: and λ i < β α 1 α+β ∀i, for α < 0 and β > 0. (25) When any of the eigenvalues does not satisfy these bounds, the value of the divergence should be (by definition) set to infinite. Moreover, in the limit, when α → −β the bounds simplifies to  Whereas, in the limit, for α → 0 or for β → 0 the bounds disappear.
The complete picture of bounds for different values of α and β is shown in Fig. 1. Additionally, D In general, the AB log-det divergence is not a metric distance since triangular inequality may be not satisfied for some values of parameters. Therefore, we can define optionally a metric distance as a square root of the AB log-det divergence in the special case α = β as because D (α,α) AB (P Q) is symmetric with respect to P and Q. As we will show later such defined measures lead to many important divergences and metric distances like the Logdet Zero divergence, the AIRM, squared root of Stein's loss. Moreover, we can generate new divergences, e.g., generalization of Stein's loss, Beta log-det divergence, or generalized Hilbert metric.
From divergence D (α,β) AB (P Q), a Riemannian metric and a pair of dually coupled affine connections are introduced in the manifold of positive definite matrices. Let dP be a small deviation of P, which belongs to the tangent space of the manifold at P. By calculating AB (P + dP P) and neglecting higher-order terms, we have This gives a Riemannian metric which is common for all (α, β). Therefore, the Riemannian metric is the same for all AB log-det divergences, although the dual affine connections depend on α and β. The Riemannian metric is the same as the Fisher information matrix of the manifold of multivariate Gaussian distribution of mean zero and covariance matrix P.
It is interesting to note that the Riemannian metric or geodesic distance is obtained from Figure 3: Graphical illustration of the fundamental non-symmetric AB log-det divergences. On the α-β plane are indicated important divergences by points and lines, especially the Stein's loss and its generalization, the AIRM (Riemannian) distance, S-divergence called also Jensen-Bregman LogDet Divergence (JBLD), Alpha log-det divergence D (15) for α = β = 0, where λ i are the eigenvalues of the matrix PQ −1 . This is also known as the Affine Invariant Riemannian metric (AIRM). AIRM enjoys serval important and useful theoretical properties, and is probably one of the most widely used (dis)similarity measure for SPD (covariance) matrices [13], [14]. For α = β = 0.5 (and also for α = β = −0.5), we obtain the recently defined and deeply analyzed S-divergence, called also the JBLD (Jensen-Bregman LogDet) divergence [16], [4], [13], [14]: The S-divergence is not metric distance. In order to make it metric we use square root of it, and we obtain then the LogDet Zero divergence, called also sometimes the Bhattacharyya distance [18], [17], [5] as Moreover, for α = 0, β = 0 and for α = 0, β = 0, we obtain divergences, which can be considered as generalizations of Stein's loss (called also Burg matrix divergence or simply LogDet divergence): The divergences (33) and (34) can be simplified to the standard Stein's loss for α = 1 and β = 1, respectively. One important potential application of the AB log-det divergence is to generate efficient conditionally positive definite kernels, which can be found wide applications in classification and clustering. It seems that for a specific set of parameters the AB log-det divergence divergences admit a Hilbert space embedding in the form of a Radial Basis Function (RBF) kernel [22]. More specifically, it can be shown that AB log-det kernel can be defined as where γ > 0 and α, β > 0 or α, β < 0, which some selected values of γ parameters is positive definite. However, the topic of kernel properties and their applications is out of the scope of this review paper.

Special Cases of the AB Log-Det Divergence
We shall now illustrate that a suitable choice of the (α, β) parameters simplifies the AB log-det divergence into some known divergences, including the Alpha-and Beta-log-det divergences [18], [23], [17], [9]. When α + β = 1 the AB log-det divergence reduces to the Alpha-log-det divergence [18] On the other hand, when α = 1, and β ≥ 0 the AB log-det divergence reduces to the Betalog-det divergence It should be noted that det(I+log( It is interesting to note that the Beta log-det divergence for β → ∞ leads to a new (robust in respect to noise) divergence expressed as 1 Assuming that the set Ω = {i : λ i > 1}, gathers the indices of those eigenvalues greater than one, we can more formally express such divergence as The Alpha-log-det divergence gives the standard Stein's losses (Burg matrix divergences) for α = 1 and α = 0 and the Beta-log-det divergence is also the Stein's loss for β = 0. Another important class of divergences is Power log-det divergence for any α = β ∈ R D (α,α)

Fundamental Properties of the AB Log-Det Divergence
The AB log-det divergence has several important and useful theoretical properties for any SPD matrices where Λ = diag{λ 1 , λ 2 , . . . , λ n }.
Proof: From the definition of the divergence it is evident that D (α,β)

Scaling invariance
for any c > 0, or more general for any nonsingular matrix C ∈ R n×n . Proof: 6. For a given α, β parameters and a non-zero scaling scalar ω = 0, Proof: From the definition of the divergence we can write Hence, we can obtain important inequality for |ω| ≤ 1.

Dual-invariance under inversion (for
8. Dual symmetry 9. Affine invariance (invariance under linear transformations) for any nonsingular matrices A ∈ R n×n and B ∈ R n×n , Proof: 10. Scaling invariance under Kronecker product Proof: 11. Triangle Inequality -Metric Distance Condition Proof: On the one hand, for α = 0, we can prove the metric condition with the help of the Bhattacharryya distance By defining ω = 2α = 0 and using the property the metric condition can be easily verified. For instance, in order to check the triangle inequality we can observe that On the other hand, D (α,α) AB (P Q) for α → 0 converges to the Riemannian metric which concludes the proof of the metric condition of D (α,α) AB (P Q) for any α ∈ R.

Symmetrized AB Log-Det Divergences
The basic AB log-det divergence is asymmetric, that is, D AB (Q || P), except the spacial case of α = β).

PQ -2I
(Generalized KLDM or Jeffreys KL divergence) Figure 4: Graphical illustration of the fundamental symmetric AB log-det divergences. On the alpha-beta plane are indicated as special important cases particular divergences by points, especially Jeffreys KL divergence, called also KLDM (KL Divergence Metric) or symmetric Stein's loss and its generalization, S-divergence or JBLD-divergence, and Power log-det divergence.
KL Density Metric), called also the symmetric Stein's loss or Jeffreys KL divergence:

Modifications and Generalizations of AB Log-Det Divergences, Gamma Matrix Divergences
The divergence (15) discussed in previous sections can be extended or modified in several ways.
First of all, we can define alternative AB log-det divergence as follows for α = 0, β = 0, α + β = 0, α > 0, β > 0 It can be shown that for α + β = 1 (i.e., for Alpha log-det divergence -see Eq. (15)): However, they are not equivalent in more general cases. In fact, it is easy to show that the divergence (83) can be expressed as a scaled and transformed Alpha log-det divergence of the form (see (36)) so (83) is less general than (15), since it does not cover Power and Beta log-det divergences.
It is interesting to note that positive eigenvalues of the matrix PQ −1 play similar role to ratios (p i /q i ) and (q i /p i ) used in the wide class of standard discrete divergences, see for example, [10], [9], so we can apply such divergences to formulate modified log-det divergence as functions of eigenvalues λ i .
For example, for Itakura-Saito distance defined as 2 we replace ratios as follows p i /q i → λ i and q i /p i → λ −1 i , we obtain log-det divergence for SPD which is consistent in our previous considerations (see (36) and (38)).
As another example let consider discrete Gamma divergence defined as [9], [10] D (α,β) for α = 0, β = 0, α + β = 0, which simplifies for α = 1 and β → −1 to the following form [9] lim β→−1 Hence, by substituting p i /q i → λ i , we can derive a new Gamma matrix divergence for SPD matrices: where M 1 denotes arithmetic means, while M 0 is the geometric means. It is interesting to note that (90) can be expressed equivalently as Similarly, using symmetric gamma divergence defined as [9], [10]: for α = 0, β = 0, α + β = 0, for α = 1 and β → −1, we obtain a new Gamma matrix divergence (by substituting the ratios p i /q i by λ i ) as follows: where M −1 {λ i } denotes harmonic means. Note that for n → ∞ so formulated divergence can be expressed compactly as where u i = {λ i } and u −1 i = {λ −1 i }. The basic means can be defined follows: with the following relationships between them where equalities only holds if all λ i have the same values. By increasing the values of γ, we puts more emphasis on large relative errors that is λ i , which are more deviated from one. Depending on the value of γ, we obtain as particular cases: the minimum of the vector λ (for γ → −∞), its harmonic mean (γ = −1), the geometric mean (γ = 0), the arithmetic mean (γ = 1), the quadratic mean (γ = 2) and the maximum of the vector (γ → −∞). Exploiting the above inequalities for the means the divergence (90) and (94) can be heuristically generalized (defined) as follows with γ 2 > γ 1 .
The new divergence (98) is quite general and flexible and in extreme case it can take the following form: which is in fact, a well-known the Hilbert projective metric [4] [26]. The Hilbert projective metric is extremely simple and it is suitable for big data because it requires to compute only two (minimum and maximum) eigenvalues of the matrix PQ −1 .
The Hilbert projective metric enjoys the following important properties [4,27]

Invariance to scaling
for any c 1 , c 2 > 0,

Invariance under congruence transformation
for any invertible matrix A,

The AB Log-Det Divergence for Noisy and Ill-Conditioned Covariance Matrices
In real-world signal processing and machine learning applications the SPD sampled matrices can be strongly corrupted by noise and extremely ill conditioned. In such cases eigenvalues of generalized eigenvalue (GEVD) problem Pv i = λ i Qv i can be divided into signal subspace and noise subspace. Signal subspace is usually represented by largest eigenvalues (and corresponding eigenvectors) and noise subspace by smallest eigenvalues (and corresponding eigenvectors), which should be rejected. In other words, in evaluation of log-det divergences, we should take into account only these eigenvalues which represent signal subspace. The simplest approach is to find truncated dominant eigenvalues, by applying a suitable threshold τ > 0, that is a index r ≤ n for which λ r+1 ≤ τ and perform summation, e.g. in Eq (21) form 1 to r (instead form 1 to n) [22]. The threshold parameter τ can be selected via cross-validation. Recent studies suggested that the real signal subspace covariance matrices can be better represented by shrinking the eigenvalues. For example, a popular and relatively simple method is to apply a thresholding and shrinkage rule to the all eigenvalues [29]: where any eigenvalue smaller than the specific threshold is set to zero and the rest eigenvalues are shrunk. Note that the smallest eigenvalues are more shrunk the largest one. For γ = 1, we obtain a standard soft thresholding and for γ → ∞ a standard hard thresholding [30]. We can estimate the optimal threshold τ > 0 and the parameter γ > 0 using cross validation. However, a more practical and efficient method is to apply the Generalized Stein Unbiased Risk Estimate (GSURE) method even if the variance of noise is unknown (for detail please see [29] and references therein).
In this paper we have proposed alternative approach in which bias generated by noise is reduced by a suitable choice of parameters α and β [10]. In other words, instead of eigenvalues λ i of the matrix PQ −1 or its inverses, we can used regularized or shrinked eigenvalues [29], [30], [31]. For example, on basis of formula (21) we can use the following shrinked eigenvalues 3 which play similar role to ratios (p i /q i ) (with p i ≥ q i ) used in the standard discrete divergences [10], [9]. So, for example, the new gamma divergence (98) can be formulated in even more general form as with γ 2 > γ 1 , where λ i means regularized or optimally shrinked eigenvalues.

Divergences for Multivariate Gaussian Densities -Differential Relative Entropies for Multivariate Normal Distributions
The objective of this section is to show links or relationships between family of continuous gamma divergences and AB log-det divergences for multivariate Gaussian densities Consider two multivariate Gaussian (normal) distributions: where µ 1 ∈ R n and µ 2 ∈ R n are means vectors and P = Σ 1 ∈ R n×n and Q = Σ 2 ∈ R n×n are covariance matrices of p(x) and q(x), respectively. Let consider the gamma divergence for these distributions: for α = 0, β = 0, α + β = 0, which generalizes a family of Gamma-divergences [10], [9]. Theorem 3 The gamma divergence (116) for multivariate Gaussian densities (114) and (115) can be expressed in closed form formulas as follows: for α > 0 and β > 0. The proof of theorem is provided in the Appendix 10.5. The formula (117) consists two terms: The first term is expressed via the AB log-det divergence of the form given by (83), which is similarity between two covariance or precision matrices and is independent form the mean vectors, while the second term is a quadratic form expressed via the Mahalanobis distance, which represents distance between means (weighted by the covariance matrices) of the multivariate Gaussian distributions which is zero if mean values are the same.
2. For α = β = 0.5 we have the Bhattacharyya distance [33] d Bh (p q) = −4 log 3. For α + β = 1 and 0 < α < 1, we obtain the closed form expression for the Rényi divergence expressed as [34] 4. For α = β = 1, the Gamma-divergences is reduced to the Cauchy-Schwartz divergence: Similar formula can be derived for symmetric gamma divergence for two multivariate Gaussian. Furthermore, similar formulas can be probably derived for Elliptical Gamma distributions (EGD) [35], which offers more flexible modeling than the standard multivariate Gaussian distributions.

Multiway divergences for Multivariate Normal Distributions with Separable Covariance Matrices
Recently has been growing interest in the analysis of tensors or multiway arrays [36][37][38][39]. For multiway arrays we often use multilinear (called also array or tensor) normal distributions which correspond to the multivariate normal (Gaussian) distributions (114)-(115), with common mean (µ 1 = µ 2 ) and separable (Kronecker structured) covariance matrices expressed as 4 :P where P k ∈ R n k ×n k and Q k ∈ R n k ×n k for k = 1, 2, . . . , K are SPD matrices, usually normalized that det P k = det Q k = 1 for each k [39] and N = K k=1 n k . A main advantage of the separable Kronecker model is a significant reduction in the number of variance-covariance parameters [36]. Usually, such separable covariance matrices are sparse and very large-scale. The challenge is to design for big data an efficient and relatively simple dissimilarity measures between two zero-mean multivariate (or multilinear) normal distributions (114)-(115). It seems that the Hilbert projective metric due to its unique properties is a good candidate since for the separable Kronecker structured covariances, since it can be expressed in very simple form as: where λ (k) max and λ (k) min are (shrinked) maximum and minimum eigenvalues of the (relatively small) matrices P k Q −1 k for k = 1, 2, . . . , K, respectively. We refer to this divergence as the multiway Hilbert metric which has many attractive properties, especially invariance under multilinear transformation.
Using fundamental properties of divergence and SPD matrices we can drive other multiway log-det divergence. For example, we can obtain the multiway Stein's loss as Note that under the constraints det P k = det Q k = 1, it simplifies to which is different from the multiway Stein's loss proposed recently by Gerard and Hoff [39]. Similarly, we can derive or define multiway Riemannian metric (under constraints that det P k = det Q k = 1 for each k = 1, 2, . . . , K) as follows: Remark: The above multiway divergences were derived using the following properties: If eigenvalues {λ i } and {θ j } are eigenvalues with corresponding eigenvectors {v i } and {u j } for SPD matrices A and B, respectively, then A⊗B has eigenvalues {λ i θ j } with corresponding eigenvectors {v i ⊗ u j }, andPQ Other possible extensions of AB and Gamma matrix divergences to separable multiway divergences for multilinear normal distributions under some normalization or constraints conditions will be discussed in our future publication.

Conclusions
In this paper, we presented novel (dis)similarity measures: Alpha-Beta and Gamma Log-det divergences (and/or their square-roots), that smoothly connects or unifies a wide class of existing divergences for symmetric positive definite matrices. We derived numerous results that uncovered or unified theoretic properties and qualitative similarities between well-known divergences and also new divergences. The scope of the results presented in this paper is vast, since the parameterized Alpha-Beta and Gamma log-det divergences functions include several efficient and useful divergences including those based on the relative entropies, Riemannian metric (AIRM), S-divergence, generalized Jeffreys KL or the KLDM, Stein's loss and Hilbert projective metric. Various links and relationships between various divergences ware also established. Furthermore, we proposed several multiway divergences for tensor (array) normal distributions. Remark: The function (15) is only well defined in the first and third quadrant of the (α, β)plane. Outside these regions, when parameters and α and β have opposite signs (i.e. α > 0 and β < 0 or vice versa α < 0 and β > 0), the divergence can be complex valued. This undesired behavior can be avoided with the help of the truncation operator that will be used to prevent the arguments of the logarithms to be negative. The new definition of the AB log-det divergence for α = 0, β = 0, α + β = 0.
is compatible with the previous one on the first and third quadrant of the (α, β) plane, while it is also well defined on the second and four quadrants except for the special cases α = 0, β = 0, α + β = 0 where the formula is undetermined. Enforcing the continuity, we can define explicitly the AB-log-det divergence on the entire (α, β)-plane as: 10.2 Domain of the eigenvalues for which D (α,β)

AB (P Q) is finite
In this section, we assume that λ i , the eigenvalues of PQ −1 , satisfy that 0 ≤ λ i ≤ ∞ for all i = 1, . . . , n. We will determine the bounds on the eigenvalues of PQ −1 that prevent the AB log-det divergence to be infinite. For this purpose, let us recall that Let us assume that 0 ≤ λ i ≤ ∞ for all i. For the divergence to be finite, the arguments of the logarithms in the previous expression should be all positive. This happens for condition which is always true when α, β > 0 or when α, β < 0. On the contrary, when sign(αβ) = −1, we have the following two cases. On the one hand, for α > 0, we can solve initially for λ α+β i and later for λ i to obtain ∀i, for α > 0 and β < 0. (137) On the other hand, for α < 0, we obtain Solving again for λ i we see that λ i > β α 1 α+β ∀i, for α > 0 and β < 0, and λ i < β α 1 α+β ∀i, for α < 0 and β > 0. (141) Moreover, in the limit, when α → −β = 0 these bounds simplify to AB (λ i 1) is non-negative and minimum at λ i = 1. For this purpose, we first realize that the only critical point of the criterion is obtained for λ i = 1. This can be seen equating to zero the derivative of the criterion ∂D (α,β) and solving for λ i . Next we will show that the sign of the derivative only changes at the critical point λ i = 1. If we rewrite ∂D (α,β) arriving to the desired result: This formula can be can easily particularized to evaluate the integrals and Ω q α+β (x)dx = (2π) which concludes the proof the Theorem 3.