Log-Determinant Divergences Revisited: Alpha-Beta and Gamma Log-Det Divergences

Cichocki, Andrzej; Cruces, Sergio; Amari, Shun-ichi

doi:10.3390/e17052988

Open AccessReview

Log-Determinant Divergences Revisited: Alpha-Beta and Gamma Log-Det Divergences

by

Andrzej Cichocki

^1,2,*,

Sergio Cruces

^3,*

and

Shun-ichi Amari

⁴

¹

Laboratory for Advanced Brain Signal Processing, Brain Science Institute, RIKEN, 2-1 Hirosawa, Wako, 351-0198 Saitama, Japan

²

Systems Research Institute, Intelligent Systems Laboratory, Newelska 6, 01-447 Warsaw, Poland

³

Dpto de Teoría de la Señal y Comunicaciones, University of Seville, Camino de los Descubrimientos s/n, 41092 Seville, Spain

⁴

Laboratory for Mathematical Neuroscience, RIKEN BSI, Wako, 351-0198 Saitama, Japan

^*

Authors to whom correspondence should be addressed.

Entropy 2015, 17(5), 2988-3034; https://doi.org/10.3390/e17052988

Submission received: 19 December 2014 / Revised: 18 March 2015 / Accepted: 5 May 2015 / Published: 8 May 2015

(This article belongs to the Section Information Theory, Probability and Statistics)

Download

Browse Figures

Versions Notes

Abstract

:

This work reviews and extends a family of log-determinant (log-det) divergences for symmetric positive definite (SPD) matrices and discusses their fundamental properties. We show how to use parameterized Alpha-Beta (AB) and Gamma log-det divergences to generate many well-known divergences; in particular, we consider the Stein’s loss, the S-divergence, also called Jensen-Bregman LogDet (JBLD) divergence, Logdet Zero (Bhattacharyya) divergence, Affine Invariant Riemannian Metric (AIRM), and other divergences. Moreover, we establish links and correspondences between log-det divergences and visualise them on an alpha-beta plane for various sets of parameters. We use this unifying framework to interpret and extend existing similarity measures for semidefinite covariance matrices in finite-dimensional Reproducing Kernel Hilbert Spaces (RKHS). This paper also shows how the Alpha-Beta family of log-det divergences relates to the divergences of multivariate and multilinear normal distributions. Closed form formulas are derived for Gamma divergences of two multivariate Gaussian densities; the special cases of the Kullback-Leibler, Bhattacharyya, Rényi, and Cauchy-Schwartz divergences are discussed. Symmetrized versions of log-det divergences are also considered and briefly reviewed. Finally, a class of divergences is extended to multiway divergences for separable covariance (or precision) matrices.

Keywords:

1. Introduction

Divergences or (dis)similarity measures between symmetric positive definite (SPD) matrices underpin many applications, including: Diffusion Tensor Imaging (DTI) segmentation, classification, clustering, pattern recognition, model selection, statistical inference, and data processing problems [1–3]. Furthermore, there is a close connection between divergence and the notions of entropy, information geometry, and statistical mean [2,4–7], while matrix divergences are closely related to the invariant geometrical properties of the manifold of probability distributions [4,8–10]. A wide class of parameterized divergences for positive measures are already well understood and a unification and generalization of their properties can be found in [11–13].

The class of SPD matrices, especially covariance matrices, play a key role in many areas of statistics, signal/image processing, DTI, pattern recognition, and biological and social sciences [14–16]. For example, medical data produced by diffusion tensor magnetic resonance imaging (DTI-MRI) represent the covariance in a Brownian motion model of water diffusion. The diffusion tensors can be represented as SPD matrices, which are used to track the diffusion of water molecules in the human brain, with applications such as the diagnosis of mental disorders [14]. In array processing, covariance matrices capture both the variance and correlation of multidimensional data; this data is often used to estimate (dis)similarity measures, i.e., divergences. This all has led to an increasing interest in divergences for SPD (covariance) matrices [1,5,6,14,17–20].

The main aim of this paper is to review and extend log-determinant (log-det) divergences and to establish a link between log-det divergences and standard divergences, especially the Alpha, Beta, and Gamma divergences. Several forms of the log-det divergence exist in the literature, including the log–determinant α divergence, Riemannian metric, Stein’s loss, S-divergence, also called the Jensen-Bregman LogDet (JBLD) divergence, and the symmetrized Kullback-Leibler Density Metric (KLDM) or Jeffrey’s KL divergence [5,6,14,17–20]. Despite their numerous applications, common theoretical properties and the relationships between these divergences have not been established. To this end, we propose and parameterize a wide class of log-det divergences that provide robust solutions and/or even improve the accuracy for a noisy data. We next review fundamental properties and provide relationships among the members of this class. The advantages of some selected log-det divergences are also discussed; in particular, we consider the efficiency, simplicity, and resilience to noise or outliers, in addition to simplicity of calculations [14]. The log-det divergences between two SPD matrices have also been shown to be robust to biases in composition, which can cause problems for other similarity measures.

The divergences discussed in this paper are flexible enough to facilitate the generation of several established divergences (for specific values of the tuning parameters). In addition, by adjusting the adaptive tuning parameters, we optimize the cost functions of learning algorithms and estimate desired model parameters in the presence of noise and outliers. In other words, the divergences discussed in this paper are robust with respect to outliers and noise if the tuning parameters, α, β, and γ, are chosen properly.

1.1. Preliminaries

We adopt the following notation: SPD matrices will be denoted by P ∈ ℝ^n×n and Q ∈ ℝ^n×n, and have positive eigenvalues λ_i (sorted in descending order); by log(P), det(P) = |P|, tr(P) we denote the logarithm, determinant, and trace of P, respectively.

For any real parameter α ∈ ℝ and for a positive definite matrix P, the matrix P^α is defined using symmetric eigenvalue decomposition as follows:

P^{α} = {(V Λ V^{T})}^{α} = V (Λ^{α}) V^{T},

(1)

where Λ is a diagonal matrix of the eigenvalues of P, and V ∈ ℝ^n×n is the orthogonal matrix of the corresponding eigenvectors. Similarly, we define

\log P^{α} = \log ({(V Λ V^{T})}^{α}) = V \log (Λ^{α}) V^{T},

(2)

where log(Λ) is a diagonal matrix of logarithms of the eigenvalues of P. The basic operations for positive definite matrices are provided in Appendix A.

The dissimilarity between two SPD matrices is called a metric if the following conditions hold:

D(P ║ Q) ≥ 0, where the equality holds if and only if P = Q (nonnegativity and positive definiteness).
D(P ║ Q) = D(Q ║ P) (symmetry).
D(P ║ Z) ≤ D(P ║ Q) + D(Q ║ Z) (subaddivity/triangle inequality).

Dissimilarities that only satisfy condition (1) are not metrics and are referred to as (asymmetric) divergences.

2. Basic Alpha-Beta Log-Determinant Divergence

For SPD matrices P ∈ ℝ^n×n and Q ∈ ℝ^n×n, consider a new dissimilarity measure, namely, the AB log-det divergence, given by

\begin{array}{r} D_{A B}^{(α, β)} (P ∥ Q) = \frac{1}{α β} \log \det \frac{α {(P Q^{- 1})}^{β} + β {(P Q^{- 1})}^{- α}}{α + β} \\ for α \neq 0, β \neq 0, α + β \neq 0, \end{array}

(3)

where the values of the parameters α and β can be chosen so as to guarantee the non-negativity of the divergence and it vanishes to zero if and only if P = Q (this issue is addressed later by Theorems 1 and 2). Observe that this is not a symmetric divergence with respect to P and Q, except when α = β. Note that using the identity log det(P) = tr log(P), the divergence in (3) can be expressed as

\begin{array}{r} D_{A B}^{(α, β)} (P ∥ Q) = \frac{1}{α β} tr [\log (\frac{α {(P Q^{- 1})}^{β} + β {(P Q^{- 1})}^{- α}}{α + β})] \\ for α \neq 0, β \neq 0, α + β \neq 0 . \end{array}

(4)

This divergence is related to the Alpha, Beta, and AB divergences discussed in our previous work, especially Gamma divergences [11–13,21]. Furthermore, the divergence in (4) is related to the AB divergence for SPD matrices [1,12], which is defined by

\begin{array}{r} {\bar{D}}_{A B}^{(α, β)} (P ∥ Q) = \frac{1}{α β} tr (\frac{α}{α + β} P^{α + β} + \frac{β}{α + β} Q^{α + β} - P^{α} Q^{β}) \\ for α \neq 0, β \neq 0 . α + β \neq 0 . \end{array}

(5)

Note that α and β are chosen so that

D_{A B}^{(α, β)} (P ∥ Q)

is nonnegative and equal to zero if P = Q. Moreover, such divergence functions can be evaluated without computing the inverses of the SPD matrices; instead, they can be evaluated easily by computing (positive) eigenvalues of the matrix PQ⁻¹ or its inverse. Since both matrices P and Q (and their inverses) are SPD matrices, their eigenvalues are positive. In general, it can be shown that even though PQ⁻¹ is nonsymmetric, its eigenvalues are the same as those of the SPD matrix Q⁻¹^/²PQ⁻¹^/²; hence, its eigenvalues are always positive.

Next, consider the eigenvalue decomposition:

{(P Q^{- 1})}^{β} = V Λ^{β} V^{- 1},

(6)

where V is a nonsingular matrix, and

Λ^{β} = diag {λ_{1}^{β}, λ_{2}^{β}, \dots, λ_{n}^{β}}

is the diagonal matrix with the positive eigenvalues λ_i > 0, i = 1, 2, …, n, of PQ⁻¹. Then, we can write

\begin{array}{l} D_{A B}^{(α, β)} (P ∥ Q) = \frac{1}{α β} \log \det \frac{α V Λ^{β} V^{- 1} + β V Λ^{- α} V^{- 1}}{α + β} \\ = \frac{1}{α β} \log [\det V \det \frac{α Λ^{β} + β Λ^{- α}}{α + β} \det V^{- 1}] \\ = \frac{1}{α β} \log \det \frac{α Λ^{β} + β Λ^{- α}}{α + β}, \end{array}

(7)

which allows us to use simple algebraic manipulations to obtain

\begin{array}{l} D_{A B}^{(α, β)} (P ∥ Q) = \frac{1}{α β} \log \prod_{i = 1}^{n} \frac{α λ_{i}^{β} + β λ_{i}^{- α}}{α + β} \\ = \frac{1}{α β} \sum_{i = 1}^{n} \log (\frac{α λ_{i}^{β} + β λ_{i}^{- α}}{α + β}), α, β, α + β \neq 0 . \end{array}

(8)

It is straightforward to verify that

D_{A B}^{(α, β)} (P ∥ Q) = 0

if P = Q. We will show later that this function is nonnegative for any SPD matrices if the α and β parameters take both positive or negative values.

For the singular values α = 0 and/or β = 0 (also α = −β), the AB log-det divergence in (3) is defined as a limit for α → 0 and/or β → 0. In other words, to avoid indeterminacy or singularity for specific parameter values, the AB log-det divergence can be reformulated or extended by continuity and by applying L’Hôpital’s formula to cover the singular values of α and β. Using L’Hôpital’s rule, the AB log-det divergence can be defined explicitly by

D_{A B}^{(α, β)} (P ‖ Q) = {\begin{array}{l} \frac{1}{α β} \log \det \frac{α {(P Q^{- 1})}^{β} + β {(Q P^{- 1})}^{α}}{α + β} & for α, β \neq 0, α + β \neq 0 \\ \frac{1}{α^{2}} [tr ({(Q P^{- 1})}^{α} - I) - α \log \det (Q P^{- 1})] & for α \neq 0, β = 0 \\ \frac{1}{β^{2}} [tr ({(P Q^{- 1})}^{β} - I) - β \log \det (P Q^{- 1})] & for α = 0, β \neq 0 \\ \frac{1}{α^{2}} \log \frac{\det {(P Q^{- 1})}^{α}}{\det (I + \log {(P Q^{- 1})}^{α})} & for α = - β \neq 0 \\ \frac{1}{2} tr \log^{2} (P Q^{- 1}) = \frac{1}{2} ∥ \log (Q^{- 1 / 2} P Q^{- 1 / 2}) ∥_{F}^{2} & for α, β = 0 . \end{array}

(9)

Equivalently, using standard matrix manipulations, the above formula can be expressed in terms of the eigenvalues of PQ⁻¹, i.e., the generalized eigenvalues computed from λ_iQv_i = Pv_i (where v_i (i = 1, 2, …, n) are corresponding generalized eigenvectors) as follows:

D_{A B}^{(α, β)} (P ∥ Q) = {\begin{array}{l} \frac{1}{α β} \sum_{i = 1}^{n} \log (\frac{α λ_{i}^{β} + β λ_{i}^{- α}}{α + β}) & for α, β \neq 0, α + β \neq 0 \\ \frac{1}{α^{2}} [\sum_{i = 1}^{n} (λ_{i}^{- α} - \log (λ_{i}^{- α})) - n] & for α \neq 0, β = 0 \\ \frac{1}{β^{2}} [\sum_{i = 1}^{n} (λ_{i}^{- β} - \log (λ_{i}^{- β})) - n] & for α = 0, β \neq 0 \\ \frac{1}{α^{2}} [\sum_{i = 1}^{n} \log (\frac{λ_{i}^{α}}{1 + \log λ_{i}^{α}})] & for α = - β \neq 0 \\ \frac{1}{2} \sum_{i = 1}^{n} \log^{2} (λ_{i}) & for α, β = 0 . \end{array}

(10)

Theorem 1. The function

D_{A B}^{(α, β)} (P ∥ Q) \geq 0

given by (3) is nonnegative for any SPD matrices with arbitrary positive eigenvalues if α ≥ 0 and β ≥ 0 or if α < 0 and β < 0. It is equal to zero if and only if P = Q.

Equivalently, if the values of α and β have the same sign, the AB log-det divergence is positive independent of the distribution of the eigenvalues of PQ⁻¹ and goes to zero if and only if all the eigenvalues are equal to one. However, if the eigenvalues are sufficiently close to one, the AB log-det divergence is also positive for different signs of α and β. The conditions for positive definiteness are given by the following theorem.

Theorem 2. The function

D_{A B}^{(α, β)} (P ∥ Q)

given by (9) is nonnegative if α > 0 and β < 0 or if α < 0 and β > 0 and if all the eigenvalues of PQ⁻¹ satisfy the following conditions:

λ_{i} > {| \frac{β}{α} |}^{\frac{1}{α + β}} \forall i, f o r α > 0 a n d β < 0,

(11)

and

λ_{i} < {| \frac{β}{α} |}^{\frac{1}{α + β}} \forall i, f o r α < 0 a n d β > 0 .

(12)

If any of the eigenvalues violate these bounds, the value of the divergence, by definition, is infinite. Moreover, when α → −β these bounds simplify to

λ_{i} > e^{- 1 / α} \forall i, α = - β > 0,

(13)

λ_{i} < e^{- 1 / α} \forall i, α = - β < 0 .

(14)

In the limit, when α → 0 or β → 0, the bounds disappear. A visual presentation of these bounds for different values of α and β is shown in Figure 1.

Additionally,

D_{A B}^{(α, β)} (P ∥ Q) = 0

only if λ_i = 1 for all i = 1, …, n, i.e., when P = Q.

The proofs of these theorems are given in Appendices B, C and D.

Figure 2 illustrates the typical shapes of the AB log-det divergence for different values of the eigenvalues for various choices of α and β.

In general, the AB log-det divergence is not a metric distance since the triangle inequality is not satisfied for all parameter values. Therefore, we can define the metric distance as the square root of the AB log-det divergence in the special case when α = β as follows:

d_{A B}^{(α, α)} (P ∥ Q) = \sqrt{D_{A B}^{(α, α)} (P ∥ Q) .}

(15)

This follows from the fact that

D_{A B}^{(α, α)} (P ∥ Q)

is symmetric with respect to P and Q.

Later, we will show that measures defined in this manner lead to many important and well-known divergences and metric distances such as the Logdet Zero divergence, Affine Invariant Riemannian metric (AIRM), and square root of Stein’s loss [5,6]. Moreover, new divergences can be generated; specifically, generalized Stein’s loss, the Beta-log-det divergence, and extended Hilbert metrics.

From the divergence

D_{A B}^{(α, α)} (P ∥ Q)

, a Riemannian metric and a pair of dually coupled affine connections are introduced in the manifold of positive definite matrices. Let dP be a small deviation of P, which belongs to the tangent space of the manifold at P. Calculating

D_{A B}^{(α, β)} (P + d P ∥ P)

and neglecting higher-order terms yields (see Appendix E)

D_{A B}^{(α, β)} (P + d P ∥ P) = \frac{1}{2} tr [d P P^{- 1} d P P^{- 1}] .

(16)

This gives a Riemannian metric that is common for all (α, β). Therefore, the Riemannian metric is the same for all AB log-det divergences, although the dual affine connections depend on α and β. The Riemannian metric is also the same as the Fisher information matrix of the manifold of multivariate Gaussian distributions of mean zero and covariance matrix P.

Interestingly, note that the Riemannian metric or geodesic distance is obtained from (3) for α = β = 0:

d_{R} (P ∥ Q) = \sqrt{2 D_{A B}^{(0, 0)} (P ∥ Q)} = \sqrt{tr \log^{2} (P Q^{- 1})}

(17)

= {‖ \log (P Q^{- 1}) ‖}_{F} = {‖ \log (Q^{- 1 / 2} P Q^{- 1 / 2}) ‖}_{F}

(18)

= \sqrt{\sum_{i = 1}^{n} \log^{2} (λ_{i})},

(19)

where λ_i are the eigenvalues of PQ⁻¹.

This is also known as the AIRM. AIRM takes advantage of several important and useful theoretical properties and is probably one of the most widely used (dis)similarity measure for SPD (covariance) matrices [14,15].

For α = β = 0.5 (and for α = β = −0.5), the recently defined and deeply analyzed S-divergence (JBLD) [6,14,15,17] is obtained:

\begin{array}{l} D_{S} (P ∥ Q) = D_{A B}^{(0.5, 0.5)} (P ∥ Q) = 4 \log \det (\frac{1}{2} [{(P Q^{- 1})}^{1 / 2} + {(P Q^{- 1})}^{1 / 2}]) \\ = 4 \log \frac{\det {(P)}^{1 / 2} \det (\frac{{(P Q^{- 1})}^{1 / 2} + {(P Q^{- 1})}^{- 1 / 2}}{2}) \det {(Q)}^{1 / 2}}{\det {(P)}^{1 / 2} \det {(Q)}^{1 / 2}} \\ = 4 \log \frac{\det \frac{1}{2} (P + Q)}{\sqrt{\det (P) \det (Q)}} \\ = 4 (\log \det (\frac{(P + Q)}{2}) - \frac{1}{2} \log \det (PQ)) = 4 \sum_{i = 1}^{n} (\frac{λ_{i} + 1}{2 \sqrt{λ_{i}}}) . \end{array}

(20)

The S-divergence is not a metric distance. To make it a metric, we take its square root and obtain the LogDet Zero divergence, or Bhattacharyya distance [5,7,18]:

\begin{array}{l} d_{Bh} (P ∥ Q) = \sqrt{D_{A B}^{(0.5, 0.5)} (P ∥ Q)} \\ = 2 \sqrt{\log \det (\frac{(P + Q)}{2}) -} \frac{1}{2} \log \det (PQ) \\ = 2 \sqrt{\log \frac{\det \frac{1}{2} (P + Q)}{\sqrt{\det (P) \det (Q)}} .} \end{array}

(21)

Moreover, for α = 0, β ≠ 0 and α ≠ 0, β = 0, we obtain divergences which are generalizations of Stein’s loss (called also Burg matrix divergence or simply LogDet divergence):

D_{A B}^{(0, β)} (P ∥ Q) = \frac{1}{β^{2}} [tr ({(P Q^{- 1})}^{β} - I) - β \log \det (P Q^{- 1})], β \neq 0 .

(22)

D_{A B}^{(α, 0)} (P ∥ Q) = \frac{1}{α^{2}} [tr ({(Q P^{- 1})}^{α} - I) - α \log \det (Q P^{- 1})], α \neq 0

(23)

The divergences in (22) and (23) simplify, respectively, to the standard Stein’s loss if β = 1 and to its dual loss if α = 1.

3. Special Cases of the AB Log-Det Divergence

We now illustrate how a suitable choice of the (α, β) parameters simplify the AB log-det divergence into other known divergences such as the Alpha- and Beta-log-det divergences [5,11,18,23] (see Figure 3 and Table 1).

When α + β = 1, the AB log-det divergence reduces to the Alpha-log-det divergence [5]:

D_{A B}^{(α, 1 - α)} (P ∥ Q) = D_{A}^{(α)} (P ∥ Q) ≐ {\begin{array}{l} \begin{array}{l} \frac{1}{α (1 - α)} \log \det [α {({PQ}^{- 1})}^{1 - α} + (1 - α) {(Q P^{- 1})}^{α}] = \\ \frac{1}{α (1 - α)} \log \frac{\det (α P + (1 - α) Q)}{\det (P^{α} Q^{1 - α})} = \\ \frac{1}{α (1 - α)} \sum_{i = 1}^{n} \log (\frac{α (λ_{i} - 1) + 1}{λ_{i}^{α}}) \end{array} & for 0 < α < 1, \\ tr (Q P^{- 1}) - \log \det (Q P^{- 1}) - n = \sum_{i = 1}^{n} (λ_{i}^{- 1} + \log (λ_{i})) - n & for α = 1, \\ tr (Q P^{- 1}) - \log \det (Q P^{- 1}) - n = \sum_{i = 1}^{n} (λ_{i} - \log (λ_{i})) - n & for α = 0 . \end{array}

(24)

On the other hand, when α = 1 and β ≥ 0, the AB log-det divergence reduces to the Beta-log-det divergence:

D_{A B}^{(1, β)} (P ∥ Q) = D_{A}^{(β)} (P ∥ Q) ≐ {\begin{array}{l} \frac{1}{β} \log \det \frac{{(P Q^{- 1})}^{β} + β (Q P^{- 1})}{1 + β} = \frac{1}{β} \sum_{i = 1}^{n} \log (\frac{λ_{i}^{β} + β λ_{i}^{- 1}}{1 + β}) & for β > 0, \\ tr (Q P^{- 1} - I) - \log \det (Q P^{- 1}) = \sum_{i = 1}^{n} (λ_{i}^{- 1} + \log (λ_{i})) - n & for β = 0, \\ \log \frac{\det (P Q^{- 1})}{\det (I + \log (P Q^{- 1}))} = \sum_{i = 1}^{n} \log \frac{λ_{i}}{1 + \log (λ_{i})} & for β = - 1, λ_{i} > e^{- 1 \forall i .} \end{array}

(25)

Note that

\det (I + \log (P Q^{- 1}) = Π_{i = 1}^{n} [1 + \log (λ_{i})]

, and the Beta-log-det divergence is well defined for β = −1 and if all the eigenvalues are larger than λ_i > e⁻¹ ≈ 0.367 (e ≈ 2.72).

It is interesting to note that the Beta-log-det divergence for β → ∞ leads to a new divergence that is robust with respect to noise. This new divergence is given by

\lim_{β \to \infty} D_{A B}^{(β)} (P ∥ Q) = D_{A}^{(\infty)} (P ∥ Q) = \log (\prod_{i = 1}^{k} λ_{i}) for all λ_{i} \geq 1 .

(26)

This can be easily shown by applying the L’Hôpital’s formula. Assuming that the set Ω = {i : λ_i > 1} gathers the indices of those eigenvalues greater than one, we can more formally express this divergence as

D_{B}^{(\infty)} (P ∥ Q) = {\begin{array}{l} \sum_{i \in Ω} \log λ_{i} & for Ω \neq ϕ, \\ 0 & for Ω = ϕ . \end{array}

(27)

The Alpha-log-det divergence gives the standard Stein’s losses (Burg matrix divergences) for α = 1 and α = 0, and the Beta-log-det divergence is equivalent to Stein’s loss for β = 0.

Another important class of divergences is Power log-det divergences for any α = β ∈ ℝ:

D_{A B}^{(α, α)} (P ∥ Q) = D_{P}^{(α)} (P ∥ Q) ≐ {\begin{matrix} \frac{1}{α^{2}} \log \det \frac{{(P Q^{- 1})}^{α} + {(P Q^{- 1})}^{α}}{2} = \frac{1}{α^{2}} \sum_{i = 1}^{n} \log \frac{λ_{i}^{α} + λ_{i}^{- α}}{2} & for α \neq 0, \\ \frac{1}{2} tr \log^{2} (P Q^{- 1}) = \frac{1}{2} tr \log^{2} (P Q^{- 1}) = \frac{1}{2} \sum_{i = 1}^{n} \log^{2} (λ_{i}) & for α \neq 0 . \end{matrix}

(28)

4. Properties of the AB Log-Det Divergence

The AB log-det divergence has several important and useful theoretical properties for SPD matrices.

Nonnegativity; given by

$D_{A B}^{(α, β)} (P ∥ Q) \geq 0, \forall α, β \in ℝ .$

(29)
Identity of indiscernibles (see Theorems 1 and 2); given by

$D_{A B}^{(α, β)} (P ∥ Q) = 0 if and only if P = Q .$

(30)
Continuity and smoothness of $D_{A B}^{(α, α)} (P ∥ Q)$ as a function of α ∈ ℝ and β ∈ ℝ, including the singular cases when α = 0 or β = 0, and when α = −β (see Figure 2).
The divergence can be expressed in terms of the diagonal matrix Λ = diag{λ₁, λ₂, …, λ_n} with the eigenvalues of PQ⁻¹, in the form

$D_{A B}^{(α, β)} (P ∥ Q) = D_{A B}^{(α, β)} (Λ ∥ I) .$

(31)
Scaling invariance; given by

$D_{A B}^{(α, β)} (c P ∥ c Q) = D_{A B}^{(α, β)} (P ∥ Q),$

(32)

for any c > 0.
Relative invariance for scale transformation: For given α and β and nonzero scaling factor ω ≠ 0, we have

$D_{A B}^{(ω α, ω β)} (P ∥ Q) = \frac{1}{ω^{2}} D_{A B}^{(α, β)} ({(Q^{- 1 / 2} P Q^{- 1 / 2})}^{ω} ∥ I) .$

(33)
Dual-invariance under inversion (for ω = −1); given by

$D_{A B}^{(- α, - β)} (P ∥ Q) = D_{A B}^{(α, β)} (P^{- 1} ∥ Q^{- 1}) .$

(34)
Dual symmetry; given by

$D_{A B}^{(α, β)} (P ∥ Q) = D_{A B}^{(β, α)} (Q ∥ P) .$

(35)
Affine invariance (invariance under congruence transformations); given by

$D_{A B}^{(α, β)} (AP A^{T} ∥ AQ A^{T}) = D_{A B}^{(α, β)} (P ∥ Q),$

(36)

for any nonsingular matrix A ∈ ℝ^n×n
Divergence lower-bound; given by

$D_{A B}^{(α, β)} (X^{T} PX ∥ X^{T} QX) \leq D_{A B}^{(α, β)} (P ∥ Q),$

(37)

for any full-column rank matrix X ∈ ℝ^n×m with n ≤ m.
Scaling invariance under the Kronecker product; given by

$D_{A B}^{(α, β)} (Z \otimes P ∥ Z \otimes Q) = n D_{A B}^{(α, β)} (P ∥ Q),$

(38)

for any symmetric and positive definite matrix Z of rank n.
Double Sided Orthogonal Procrustes property. Consider an orthogonal matrix $Ω \in O (n)$ and two symmetric positive definite matrices P and Q, with respective eigenvalue matrices Λ_P and Λ_Q which elements are sorted in descending order. The AB log-det divergence between Ω^TPΩ and Q is globally minimized when their eigenspaces are aligned, i.e.,

$\min_{Ω \in O (n)} D_{A B}^{(α, β)} (Ω^{T} P Ω ∥ Q) = D_{A B}^{(α, β)} (Λ_{P} ∥ Λ_{Q}) .$

(39)
Triangle Inequality-Metric Distance Condition, for α = β ∈ ℝ. The previous property implies the validity of the triangle inequality for arbitrary positive definite matrices, i.e.,

$\sqrt{D_{A B}^{(α, β)} (P ∥ Q)} \leq \sqrt{D_{A B}^{(α, α)} (P ∥ Z)} + \sqrt{D_{A B}^{(α, α)} (Z ∥ Q)} .$

(40)

The proof of this property exploits the metric characterization of the square root of the S-divergence proposed first by S. Sra in [6,17] for arbitrary SPD matrices.

Several of these properties have been already proved for the specific cases of α and β that lead to the S-divergence (α, β = 1/2) [6], the Alpha log-det divergence (0 ≤ α ≤ 1, β = 1 − α) [5] and the Riemannian metric (α, β = 0) [28, Chapter 6]. We refer the reader to Appendix F for their proofs when α, β ∈ ℝ.

5. Symmetrized AB Log-Det Divergences

The basic AB log-det divergence is asymmetric; that is,

D_{A B}^{(α, β)} (P ∥ Q) \neq D_{A B}^{(α, β)} (Q ∥ P)

, except the spacial case of α = β).

In general, there are several ways to symmetrize a divergence; for example, Type-1,

D_{A B S 1}^{(α, β)} (P ∥ Q) = \frac{1}{2} [D_{A B}^{(α, β)} (P ∥ Q) + D_{A B S 1}^{(α, β)} (Q ∥ P)],

(41)

and Type-2, based on the Jensen-Shannon symmetrization (which is too complex for log-det divergences),

D_{A B S 2}^{(α, β)} (P ∥ Q) = \frac{1}{2} [D_{A B}^{(α, β)} (P ∥ \frac{P + Q}{2}) + D_{A B S 1}^{(α, β)} (Q ∥ \frac{P + Q}{2})] .

(42)

The Type-1 symmetric AB log-det divergence is defined as

D_{A B S 1}^{(α, β)} (P ∥ Q) = {\begin{array}{l} \frac{1}{2 α β} \log \det [I + \frac{α β}{{(α + β)}^{2}} ({(P Q^{- 1})}^{α + β} + {(Q P^{- 1})}^{α + β} - 2 I)] & for α β > 0, \\ \frac{1}{2 α^{2}} [tr ({(P Q^{- 1})}^{α} + {(Q P^{- 1})}^{α} - 2 I)] & for α \neq 0, β = 0, \\ \frac{1}{2 β^{2}} [tr ({(P Q^{- 1})}^{β} + {(Q P^{- 1})}^{β} - 2 I)] & for α = 0, β \neq 0, \\ \frac{1}{2 α^{2}} tr \log {(I - \log^{2} {(P Q^{- 1})}^{α})}^{- 1} & for α = - β \neq 0, \\ \frac{1}{2} tr \log^{2} (P Q^{- 1}) = \frac{1}{2} ∥ \log (Q^{- 1 / 2} P Q^{- 1 / 2}) ∥_{F}^{2} & for α, β = 0 . \end{array}

(43)

Equivalently, this can be expressed by the eigenvalues of PQ⁻¹ in the form

D_{A B S 1}^{(α, β)} (P ∥ Q) = {\begin{array}{l} \frac{1}{2 α β} \sum_{i = 1}^{n} \log (1 + \frac{α β}{{(α + β)}^{2}} {(λ_{i}^{\frac{α + β}{2}} - λ_{i}^{- \frac{α + β}{2}})}^{2}) & for α β > 0, \\ \frac{1}{2 α^{2}} \sum_{i = 1}^{n} (λ_{i}^{α} + λ_{i}^{- α} - 2) = \frac{1}{2 α^{2}} \sum_{i = 1}^{n} {(λ_{i}^{\frac{α}{2}} - λ_{i}^{- \frac{α}{2}})}^{2} & for α \neq 0, β = 0, \\ \frac{1}{2 β^{2}} \sum_{i = 1}^{n} (λ_{i}^{β} + λ_{i}^{- β} - 2) = \frac{1}{2 β^{2}} \sum_{i = 1}^{n} {(λ_{i}^{\frac{β}{2}} - λ_{i}^{- \frac{β}{2}})}^{2} & for α = 0, β \neq 0, \\ \frac{1}{2 α^{2}} \sum_{i = 1}^{n} \log \frac{1}{1 - \log^{2} (λ_{i}^{α})} & for α = - β \neq 0, \\ \frac{1}{2} \sum_{i = 1}^{n} \log^{2} (λ_{i}) & for α, β = 0 . \end{array}

(44)

We consider several well-known symmetric log-det divergences (see Figure 4); in particular, we consider the following:

For α = β = ±0.5, we obtain the S-divergence or JBLD divergence (20).
For α = β = 0, we obtain the square of the AIRM (Riemannian metric) (19).
For α = 0 and β = ±1 or for β = 0 and α = ±1, we obtain the KLDM (symmetrized KL Density Metric), also known as the symmetric Stein’s loss or Jeffreys KL divergence [3]:

$\begin{array}{l} D_{J} (P ∥ Q) = \frac{1}{2} tr (P Q^{- 1} + Q P^{- 1} - 2 I) \\ = \frac{1}{2} {\sum_{i = 1}^{n} (\sqrt{λ_{i} - \frac{1}{\sqrt{λ_{i}}}})}^{2} \end{array}$

(45)

One important potential application of the AB log-det divergence is to generate conditionally positive definite kernels, which are widely applied to classification and clustering. For a specific set of parameters, the AB log-det divergence gives rise to a Hilbert space embedding in the form of a Radial Basis Function (RBF) kernel [22]; more specifically, the AB log-det kernel is defined by

\begin{array}{l} K_{A B}^{(α, β)} (P ∥ Q) = \exp (- γ D_{A B S 1}^{(α, β)} (P ∥ Q)) \\ = {(\det [I + \frac{α β}{{(α + β)}^{2}} ({(P Q^{- 1})}^{α + β} + {(Q P^{- 1})}^{α + β} - 2 I)])}^{- \frac{γ}{2 α β}} \end{array}

(46)

for some selected values of γ > 0 and α, β > 0 or α, β < 0 that can make the kernel positive definite.

6. Similarity Measures for Semidefinite Covariance Matrices in Reproducing Kernel Hilbert Spaces

There are many practical applications for which the underlying covariance matrices are symmetric but only positive semidefinite, i.e., their columns do not span the whole space. For instance, in classification problems, assume two classes and a set of observation vectors {x₁, …, x_T} and {y₁, …, y_T} in ℝ^m for each class, then we may wish to find a principled way to evaluate the ensemble similarity of the data from their sample similarity. The problem of the modeling of similarity between two ensembles was studied by Zhou and Chellappa in [32]. For this purpose, they proposed several probabilistic divergence measures between positive semidefinite covariance matrices in a Reproducing kernel Hilbert space (RKHS) of finite dimensionality. Their strategy was later extended for image classification problems [33] and formalized for the Log-Hilbert-Schmidt metric between infinite-dimensional RKHS covariance operators [34].

In this section, we propose the unifying framework of the AB log-det divergences to reinterpret and extend the similarity measures obtained in [32,33] for semidefinite covariance matrices in the finite-dimensional RKHS.

We shall assume that the nonlinear functions Φ_x : ℝ^m → ℝⁿ and Φ_y : ℝ^m → ℝⁿ (where n > m) respectively map the data from each of the classes into their higher dimensional feature spaces. We implicitly define the feature matrices as

Φ_{x} = [Φ_{x} (x_{1}), \dots, Φ_{x} (x_{T})], Φ_{y} = [Φ_{y} (y_{1}), \dots, Φ_{y} (y_{T})],

(47)

and the sample covariance matrices of the observations in the feature space as:

C_{x} = Φ_{x} J Φ_{x}^{T} / T \in ℝ^{n \times n}

and

C_{y} = Φ_{y} J Φ_{y}^{T} / T \in ℝ^{n \times n}

, where

J = I_{T} - \frac{1}{T} 11^{T}

denotes the T × T centering matrix.

In practice, it is common to consider low-rank approximations of sample covariance matrices. For a given basis V_x = (v₁, …, v_r) ∈ ℝ^T×r of the principal subspace of

J Φ_{x}^{T} Φ_{x} J

, we can define the projection matrix

\prod_{x} = V_{x} V_{x}^{T}

and redefine the covariance matrices as

C_{x} = \frac{1}{T} Φ_{x} V_{x} V_{x}^{T} Φ_{x}^{T} and C_{y} = \frac{1}{T} Φ_{y} V_{y} V_{y}^{T} Φ_{y}^{T} .

(48)

Assuming the Gaussianity of the data in the feature space, the mean vector and covariance matrix are sufficient statistics and a natural measure of dissimilarity between Φ_x and Φ_y should be a function of the first and second order statistics of the features. Furthermore, in most practical problems the mean value should be ignored due to robustness considerations, and then the comparison reduces to the evaluation of a suitable dissimilarity measure between C_x and C_y.

The dimensionality of the feature space n is typically much larger than r, so the rank of the covariance matrices in (48) will be r ≪ n and, therefore, both matrices are positive semidefinite. The AB log-det divergence is infinite when the range spaces of the covariance matrices C_x and C_y differ. This property is useful in applications which require an automatic constraint in the range of the estimates [22], but it will prohibit the practical use of the comparison when the ranges of the covariance matrices differ. The next subsections present two different strategies to address this challenging problem.

6.1. Measuring the Dissimilarity with a Divergence Lower-Bound

One possible strategy is to use dissimilarity measures which ignore the contribution to the divergence caused by the rank deficiency of the covariance matrices. This is useful when performing one comparison of the covariances matrices after applying a congruence transformation that aligns their range spaces, and can be implemented by retaining only the finite and non-zero eigenvalues of the matrix pencil (C_x, C_y).

Let I_r denote the identity matrix of size r and (·)⁺ the Moore-Penrose pseudoinverse operator. Consider the eigenvalue decomposition of the symmetric matrix

{(C_{y}^{+})}^{\frac{1}{2}} C_{x} {(C_{y}^{+})}^{\frac{1}{2}} = U Λ U^{T}

(49)

where U is a semi-orthogonal matrix for which the columns are the eigenvectors associated with the positive eigenvalues of the matrix pencil and

Λ = diag (λ_{1}, \dots λ_{r}) \equiv diag E i g_{+} {{(C_{y}^{+})}^{\frac{1}{2}} C_{x} {(C_{y}^{+})}^{\frac{1}{2}}} .

(50)

is a diagonal matrix with the eigenvalues sorted in a descending order.

Note that the tall matrix

W = {(C_{y}^{+})}^{\frac{1}{2}} U \in ℝ^{n \times r}

diagonalizes the covariance matrices of the two classes

W^{T} C_{x} W = Λ

(51)

W^{T} C_{y} W = I_{r}

(52)

and compress them to a common range space. The compression automatically discards the singular and infinite eigenvalues of the matrix pencil (C_x, C_y), while it retains the finite and positive eigenvalues. In this way, the following dissimilarity measures can be obtained:

L_{A B}^{(α, β)} (C_{x}, C_{y}) \equiv D_{A B}^{(α, β)} (W^{T} C_{x} W ∥ W^{T} C_{y} W) = D_{A B}^{(α, β)} (Λ ∥ I_{r}),

(53)

L_{A B S 1}^{(α, β)} (C_{x}, C_{y}) \equiv D_{A B S 1}^{(α, β)} (W^{T} C_{x} W ∥ W^{T} C_{y} W) = D_{A B S 1}^{(α, β)} (Λ ∥ I_{r}) .

(54)

Note, however, that these measures should not be understood as a strict comparison of the original covariance matrices, but rather as an indirect comparison through their respective compressed versions W^TC_xW and W^TC_yW.

With the help of the kernel trick, the next lemma shows that the evaluation of the dissimilarity measures

L_{A B}^{(α, β)} (C_{x}, C_{y})

and

L_{A B S 1}^{(α, β)} (C_{x}, C_{y})

, does not require the explicit computation of the covariance matrices or of the feature vectors.

Lemma 1. Given the Gram matrix or kernel matrix of the input vectors

(\begin{matrix} K_{x x} & K_{x y} \\ K_{y x} & K_{y y} \end{matrix}) = (\begin{matrix} Φ_{x}^{T} Φ_{x} & Φ_{x}^{T} Φ_{y} \\ Φ_{y}^{T} Φ_{x} & Φ_{y}^{T} Φ_{y} \end{matrix})

(55)

and the matrices V_x and V_y which respectively span the principal subspaces of K_xx and K_yy, the positive and finite eigenvalues of the matrix pencil can be expressed by

Λ = diag E i g_{+} {(V_{x}^{T} K_{x y} K_{y y}^{- 1} V_{y}) {(V_{x}^{T} K_{x y} K_{y y}^{- 1} V_{y})}^{T}} .

(56)

Proof. The proof of the lemma relies on the property that for any pair of m × n matrices A and B, the non-zero eigenvalues of AB^T and of B^TA are the same (see [30, pag. 11]). Then, there is an equality between the following matrices of positive eigenvalues

Λ = diag E i g_{+} {{(C_{y}^{+})}^{\frac{1}{2}} C_{x} {(C_{y}^{+})}^{\frac{1}{2}}} = diag E i g_{+} {C_{x} C_{y}^{+}} .

(57)

Taking into account the structure of the covariance matrices in (48), such eigenvalues can be explicitly obtained in terms of the kernel matrices

E i g_{+} {C_{x} C_{y}^{+}} = E i g_{+} {(Φ_{x} V_{x} V_{x}^{T} Φ_{x}^{T}) ({(Φ_{y}^{+})}^{T} V_{y} V_{y}^{T} Φ_{y}^{T})}

(58)

= E i g_{+} {(V_{x}^{T} Φ_{x}^{T} {(Φ_{x}^{T})}^{+} V_{y} V_{y}^{T} Φ_{y}^{+}) (Φ_{x} V_{x})}

(59)

= E i g_{+} {(V_{x}^{T} K_{x y} K_{y y}^{- 1} V_{y}) {(V_{x}^{T} K_{x y} K_{y y}^{- 1} V_{y})}^{T}} .

(60)

6.2. Similarity Measures Between Regularized Covariance Descriptors

Several authors consider a completely different strategy, which consists in the regularization of the original covariance matrices [32–34]. This way the null the eigenvalues of the covariances C_x and C_y are replaced by a small positive constant ρ > 0, to obtain the “regularized” positive definite matrices

{\tilde{C}}_{x}

and

{\tilde{C}}_{y}

, respectively. The modification can be illustrated by comparing the eigendecompositions

C_{x} = (U_{x} | U_{x}^{⊥}) (\begin{matrix} Λ_{x} & 0 \\ 0 & 0 \end{matrix}) {(U_{x} | U_{x}^{⊥})}^{T} = U_{x} Λ_{x} U_{x}^{T}

(61)

↓

(62)

{\tilde{C}}_{x} = (U_{x} | U_{x}^{⊥}) (\begin{matrix} Λ_{x} & 0 \\ 0 & ρ I_{n - r} \end{matrix}) {(U_{x} | U_{x}^{⊥})}^{T} = C_{x} + ρ U_{x}^{⊥} {(U_{x}^{⊥})}^{T} .

(63)

Then, the dissimilarity measure of the data in the feature space can be obtained just by measuring a divergence between the SPD matrices

{\tilde{C}}_{x}

and

{\tilde{C}}_{y}

. Again, the idea is to compute the value of the divergence without requiring the evaluation of the feature vectors but by using the available kernels.

Using the properties of the trace and the determinants, a practical formula for the log-det Alpha-divergence has been obtained in [32,33] for 0 < α < 1. The resulting expression

D_{A B}^{(α, 1 - α)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) = \frac{1}{α (1 - α)} \log \det (I_{2 r} + ρ^{- 1} H) - \frac{1}{(1 - α)} \log \det (ρ^{- 1} Λ_{x}) - \frac{1}{α} \log \det (ρ^{- 1} Λ_{y})

is a function of the principal eigenvalues of the kernels

Λ_{x} = V_{x}^{T} K_{x x} V_{x}, Λ_{y} = V_{x}^{T} K_{y y} V_{y},

(64)

and the matrix

H = (\begin{matrix} {(α)}^{\frac{1}{2}} W_{x} & 0 \\ 0 & {(1 - α)}^{\frac{1}{2}} W_{y} \end{matrix}) (\begin{matrix} K_{x x} & K_{x y} \\ K_{y x} & K_{y y} \end{matrix}) {(\begin{matrix} {(α)}^{\frac{1}{2}} W_{x} & 0 \\ 0 & {(1 - α)}^{\frac{1}{2}} W_{y} \end{matrix})}^{T} .

(65)

where

W_{x} = V_{x} {(I_{r} - ρ Λ_{x}^{- 1})}^{\frac{1}{2}} and W_{y} = V_{y} {(I_{r} - ρ Λ_{y}^{- 1})}^{\frac{1}{2}} .

(66)

The evaluation of the divergence outside the interval 0 < α < 1, or when β ≠ 1 − α, is not covered by this formula and, in general, requires knowledge of the eigenvalues of the matrix

{\tilde{C}}_{y}^{- \frac{1}{2}} {\tilde{C}}_{x} {\tilde{C}}_{y}^{- \frac{1}{2}}

. However, different analyses are necessary depending on the dimension of the intersection of the range space of both covariance matrices C_x and C_y. In the following, we study the two more general scenarios.

Case (A) The range spaces of C_x and C_y are the same.

In this case

U_{y}^{⊥} {(U_{y}^{⊥})}^{T} = U_{x}^{⊥} {(U_{x}^{⊥})}^{T}

and the eigenvalues of the matrix

{\tilde{C}}_{x} {\tilde{C}}_{y}^{- 1} = (C_{x} + ρ U_{x}^{⊥} {(U_{x}^{⊥})}^{T}) (C_{y}^{+} + ρ^{- 1} U_{x}^{⊥} {(U_{x}^{⊥})}^{T})

(67)

= C_{x} C_{y}^{+} + U_{x}^{⊥} {(U_{x}^{⊥})}^{T}

(68)

coincide with the nonzero eigenvalues of

C_{x} C_{y}^{+}

except for (n − r) additional eigenvalues which are equal to 1. Then, using the equivalence between (57) and (60), the divergence reduces to the following form

D_{A B}^{(α, β)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) = L_{A B}^{(α, β)} (C_{x} ∥ C_{y})

(69)

= D_{A B}^{(α, β)} ((V_{x}^{T} K_{x y} K_{y y}^{- 1} V_{y}) {(V_{x}^{T} K_{x y} K_{y y}^{- 1} V_{y})}^{T} ∥ I_{r}) .

(70)

Case (B) The range spaces of C_x and C_y are disjoint.

In practice, for n ≫ r this is the most probable scenario. In such a case, the r largest eigenvalues of the matrix

{\tilde{C}}_{x} {\tilde{C}}_{y}^{- 1}

diverge as ρ tends to zero. Hence, we can not bound above these eigenvalues and, for this reason, it makes no sense to study the case of sign(α) ≠ sign(β), so in this section we assume that sign(α) = sign(β).

Theorem 3. When range spaces of C_x and C_y are disjoint and for a sufficiently small value of ρ > 0, the AB log-det divergence is closely approximated by the formula

D_{A B}^{(α, β)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) \approx D_{A B}^{(α, β)} (C_{x | y}^{(ρ)} ∥ ρ I_{r}) + D_{A B}^{(β, α)} (C_{y | x}^{(ρ)} ∥ ρ I_{r}),

(71)

where

C_{x | y}^{(ρ)}

(and respectively

C_{y | x}^{(ρ)}

by interchanging x and y) denotes the matrix

C_{x | y}^{(ρ)} = Λ_{x} - ρ I_{r} - ρ^{2} Λ_{y}^{- 1} - W_{x}^{T} K_{x y} W_{y} Λ_{y}^{- 1} W_{y}^{T} K_{y x} W_{x} .

(72)

The proof of the theorem is presented in the Appendix G. The eigenvalues of the matrices

C_{x | y}^{(ρ)}

and

C_{y | x}^{(ρ)}

, estimate the r largest eigenvalues of

{\tilde{C}}_{y}^{- \frac{1}{2}} {\tilde{C}}_{x} {\tilde{C}}_{y}^{- \frac{1}{2}}

and of its inverse

{\tilde{C}}_{x}^{- \frac{1}{2}} {\tilde{C}}_{y} {\tilde{C}}_{x}^{- \frac{1}{2}}

, respectively. The relative error in the estimation of these eigenvalues is of order O(ρ), i.e., it gradually improves as ρ tend to zero. The approximation is asymptotically exact, and

C_{x | y}^{(ρ)}

and

C_{x | y}^{(ρ)}

converge respectively to the conditional covariance matrices

C_{x | y} = \lim_{ρ \to 0} C_{x | y}^{(ρ)} = V_{x}^{T} K_{x x} V_{x} - (V_{x}^{T} K_{x y} V_{y}) {(V_{x}^{T} K_{y y} V_{y})}^{- 1} {(V_{x}^{T} K_{x y} V_{y})}^{T},

(73)

C_{y | x} = \lim_{ρ \to 0} C_{y | x}^{(ρ)} = V_{x}^{T} K_{y y} V_{y} - (V_{y}^{T} K_{y x} V_{x}) {(V_{x}^{T} K_{x x} V_{x})}^{- 1} {(V_{y}^{T} K_{y x} V_{x})}^{T},

(74)

while ρ I converges to the zero matrix.

In the limit, the value of the divergence is not very useful because

\lim_{ρ \to 0} D_{A B}^{(α, β)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) = \infty

(75)

though there are some practical ways to circumvent this limitation. For example, when α = 0 or β = 0, the divergence can be scaled by a suitable power of ρ to make it finite (see Section 3.3.1 in [32]). The scaled form of the divergence between the regularized covariance matrices is

{SD}_{A B}^{(α, β)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) \equiv \lim_{ρ \to 0} ρ^{\max {α, β}} D_{A B}^{(α, β)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) .

(76)

Examples of scaled divergences are the following versions of Stein’s losses

{SD}_{A B}^{(0, β)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) = \lim_{ρ \to 0} ρ^{β} D_{A B}^{(0, β)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) = \frac{1}{β^{2}} tr ({(C_{x | y})}^{β}) \geq 0, β > 0,

(77)

{SD}_{A B}^{(α, 0)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) = \lim_{ρ \to 0} ρ^{α} D_{A B}^{(0, β)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) = \frac{1}{α^{2}} tr ({(C_{y | x})}^{α}) \geq 0, α > 0,

(78)

as well as the Jeffrey’s KL family of symmetric divergences (cf. Equation (23) in [33])

{SD}_{A B S 1}^{(α, 0)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) = \lim_{ρ \to 0} ρ^{α} D_{A B S 1}^{(α, 0)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) = \frac{1}{2 α^{2}} (tr ({(C_{x | y})}^{α}) + tr ({(C_{x | y})}^{α}), α > 0 .

(79)

In other cases, when the scaling is not sufficient to obtain a finite and practical dissimilarity measure, an affine transformation may be used. The idea is to identify the divergent part of

D_{A B}^{(α, β)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y})

as ρ → 0 and use its value as a reference for the evaluation the dissimilarity. For α, β ≥ 0, the relative AB log-det dissimilarity measure is the limiting value of the affine transformation

{RD}_{A B}^{(α, β)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) \equiv \lim_{ρ \to 0} \min {α, β} (D_{A B S 1}^{(α, 0)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) - \frac{r}{α β} \log \frac{α β ρ^{- (α + β)}}{{(α + β)}^{2}}), α, β > 0 .

(80)

After its extension by continuity (including as special cases α = 0 or β = 0), the function

{RD}_{A B}^{(α, β)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) = {\begin{array}{l} \log \det (C_{x | y}) + \frac{α}{β} \log \det (C_{y | x}) & β > α \geq 0 \\ \log \det (C_{x | y}) + \log \det (C_{y | x}) & α = β \geq 0 \\ \log \det (C_{y | x}) + \frac{β}{α} \log \det (C_{x | y}) & α > β \geq 0 \end{array}

(81)

provides simple formulas to measure the relative dissimilarity between symmetric positive semidefinite matrices C_x and C_y. However, it should be taken into account that, as a consequence of its relative character, this function is not bounded below and can achieve negative values.

7. Modifications and Generalizations of AB Log-Det Divergences and Gamma Matrix Divergences

The divergence (3) discussed in the previous sections can be extended and modified in several ways. It is interesting to note that the positive eigenvalues of PQ⁻¹ play a similar role as the ratios (p_i/q_i) and (q_i/p_i) when used in the wide class of standard discrete divergences, see for example, [11,12]; hence, we can apply such divergences to formulate a modified log-det divergence as a function of the eigenvalues λ_i.

For example, consider the Itakura-Saito distance defined by

D_{I S} (p ∥ q) = \sum_{i} (\frac{p_{i}}{q_{i}} + \log \frac{q_{i}}{p_{i}} - 1) .

(82)

It is worth noting that we can generate the large class of divergences or cost functions using Csiszár f-functions [13,24,25]. By replacing p_i=q_i with λ_i and q_i=p_i with we obtain the log-det divergence for SPD matrices:

D_{I S} (P ∥ Q) = \sum_{i = 1}^{n} (λ_{i} - \log (λ_{i})) - n,

(83)

which is consistent with (24) and (26).

As another example, consider the discrete Gamma divergence [11,12] defined by

\begin{array}{l} D_{A C}^{(α, β)} (p ∥ q) = \frac{1}{β (α + β)} \log (\sum_{i} p_{i}^{α + β}) + \frac{1}{α (α + β)} \log (\sum_{i} q_{i}^{α + β}) - \frac{1}{α β} \ln (\sum_{i} p_{i}^{α} q_{i}^{α}) \\ = \frac{1}{α β (α + β)} \log \frac{{(\sum_{i} p_{i}^{α + β})}^{α} {(\sum_{i} q_{i}^{α + β})}^{β}}{{(\sum_{i} p_{i}^{α} q_{i}^{β})}^{α + β}}, \\ for α \neq 0, β \neq 0, α + β \neq 0, \end{array}

(84)

which when α = 1 and β → −1, simplifies to the following form [11]:

\lim_{β \to - 1} D_{A C}^{(1, β)} (p ∥ q) = \frac{1}{n} \sum_{i = 1}^{n} (\log \frac{q_{i}}{p_{i}}) + \log (\sum_{i = 1}^{n} \frac{p_{i}}{q_{i}}) - \log (n) = \log \frac{\frac{1}{n} \sum_{i = 1}^{n} \frac{p_{i}}{q_{i}}}{{(\prod_{i = 1}^{n} \frac{p_{i}}{q_{i}})}^{1 / n}} .

(85)

Hence, by substituting p_i/q_i with λ_i, we derive a new Gamma matrix divergence for SPD matrices:

\begin{array}{l} D_{C C A}^{(1, 0)} (P ∥ Q) = D_{A C}^{(1, - 1)} (P ∥ Q) = \frac{1}{n} \sum_{i = 1}^{n} (\log λ_{i}^{- 1}) + \log (\sum_{i = 1}^{n} λ_{i}) - \log (n) \\ = \log \frac{\frac{1}{n} \sum_{i = 1}^{n} λ_{i}}{{(\prod_{i = 1}^{n} λ_{i})}^{1 / n}} = \log \frac{| M_{1} {λ_{i}}}{M_{0} {λ_{i}}}, \end{array}

(86)

where M₁ denotes the arithmetic mean, and M₀ denotes the geometric mean.

Interestingly, (86) can be expressed equivalently as

D_{C C A}^{(1, 0)} (P ∥ Q) = \log (tr (P Q^{- 1})) - \frac{1}{n} \log \det (P Q^{- 1}) - \log (n) .

(87)

Similarly, using the symmetric Gamma divergence defined in [11,12],

\begin{array}{l} D_{A C S}^{(α, β)} (p ∥ q) = \frac{1}{α β} \log \frac{(\sum_{i} p_{i}^{α + β}) (\sum_{i} q_{i}^{α + β})}{(\sum_{i} p_{i}^{α} q_{i}^{β}) (\sum_{i} p_{i}^{β} q_{i}^{α})}, \\ for α \neq 0, β \neq 0, α + β \neq 0, \end{array}

(88)

for α = 1 and β → −1 and by substituting the ratios p_i/q_i with λ_i, we obtain a new Gamma matrix divergence as follows:

\begin{array}{l} D_{A C S}^{(1, - 1)} (P ∥ Q) = \log ((\sum_{i = 1}^{n} λ_{i}) (\sum_{i = 1}^{n} λ_{i}^{- 1})) - \log {(n)}^{2} \\ = \log ((\frac{1}{n} \sum_{i = 1}^{n} λ_{i}) (\frac{1}{n} \sum_{i = 1}^{n} λ_{i}^{- 1})) \\ = \log (M_{1} {λ_{i}} M_{1} {λ_{i}^{- 1}}) \end{array}

(89)

= \log \frac{M_{1} {λ_{i}}}{M_{- 1} {λ_{i}}},

(90)

where M₋₁ {λ_i} denotes the harmonic mean.

Note that for n → ∞, this formulated divergence can be expressed compactly as

D_{A C S}^{(1, - 1)} (P ∥ Q) = \log (E {u} E {u^{- 1}}),

(91)

where u_i = {λ_i} and

u_{i}^{- 1} = {λ_{i}^{- 1}}

.

The basic means are defined as follows:

M_{γ} (λ) = {\begin{array}{l} M_{- \infty} = \min {λ_{1}, \dots, λ_{n}}, & γ \to - \infty, \\ M_{- 1} = n {(\sum_{i = 1}^{n} \frac{1}{λ_{i}})}^{- 1}, & γ = - 1, \\ M_{0} = (\prod_{i = 1}^{n} λ_{i}), & γ = 0, \\ M_{1} = \frac{1}{n} \sum_{i = 1}^{n} λ_{i}, & γ = 1, \\ M_{2} = {(\frac{1}{n} \sum_{i = 1}^{n} λ_{i}^{2})}^{1 / 2}, & γ = 2, \\ M_{\infty} = \max {λ_{1}, \dots, λ_{n}}, & γ = \infty, \end{array}

(92)

with

M_{- \infty} \leq M_{- 1} \leq M_{0} \leq M_{1} \leq M_{2} \leq M_{\infty},

(93)

where equality holds only if all λ_i are equal. By increasing the values of γ, more emphasis is put on large relative errors, i.e., on λ_i whose values are far from one. Depending on the value of γ, we obtain the minimum entry of the vector λ (for γ → −∞), its harmonic mean (γ = −1), the geometric mean (γ = 0), the arithmetic mean (γ = 1), the quadratic mean (γ = 2), and the maximum entry of the vector (γ → ∞).

Exploiting the above inequalities for the means, the divergences in (86) and (90) can be heuristically generalized (defined) as follows:

D_{C C A}^{(γ_{2}, γ_{1})} (P ∥ Q) = \log \frac{M_{γ_{2}} {λ_{i}}}{M_{γ_{1}} {λ_{i}}},

(94)

for γ₂ > γ₁.

The new divergence in (94) is quite general and flexible, and in extreme cases, it takes the following form:

D_{C C A}^{(\infty, - \infty)} (P ∥ Q) = d_{H} (P ∥ Q) = \log \frac{M_{\infty} {λ_{i}}}{M_{- \infty} {λ_{i}}} = \log \frac{λ_{m a x}}{λ_{m i n}},

(95)

which is, in fact, a well-known Hilbert projective metric [6,26].

The Hilbert projective metric is extremely simple and suitable for big data because it requires only two (minimum and maximum) eigenvalue computations of the matrix PQ⁻¹.

The Hilbert projective metric satisfies the following important properties [6,27]:

Nonnegativity, d_H(P ║ Q) ≥ 0, and definiteness, d_H(P ║ Q) = 0, if and only if there exists a c > 0 such that Q = cP.
Invariance to scaling:

$d_{H} (c_{1} P ∥ c_{2} Q) = d_{H} (P ∥ Q),$

(96)

for any c₁, c₂ > 0.
Symmetry:

$d_{H} (P ∥ Q) = d_{H} (Q ∥ P) .$

(97)
Invariance under inversion:

$d_{H} (P ∥ Q) = d_{H} (P^{- 1} ∥ Q^{- 1}) .$

(98)
Invariance under congruence transformations:

$d_{H} (AP A^{T} ∥ AQ A^{T}) = d_{H} (P ∥ Q),$

(99)

for any invertible matrix A.
Invariance under geodesic (Riemannian) transformations (by taking A = P⁻¹^/² in (99)):

$d_{H} (I ∥ P^{- 1 / 2} Q P^{- 1 / 2}) = d_{H} (P ∥ Q) .$

(100)
Separability of divergence for the Kronecker product of SPD matrices:

$d_{H} (P_{1} \otimes P_{2} ∥ Q_{1} \otimes Q_{2}) = d_{H} (P_{1} ∥ Q_{1}) + d_{H} (P_{2} ∥ Q_{2}) .$

(101)
Scaling of power of SPD matrices:

$d_{H} (P^{ω} ∥ Q^{ω}) = | ω | d_{H} (P ∥ Q),$

(102)

for any ω ≠ 0.
Hence, for 0 < |ω₁| ≤ 1 ≤ |ω₂| we have

$d_{H} (P^{ω_{1}} ∥ Q^{ω_{1}}) \leq d_{H} (P ∥ Q) \leq d_{H} (P^{ω_{2}} ∥ Q^{ω_{2}}) .$

(103)
Scaling under the weighted geometric mean:

$d_{H} (P #_{s} Q ∥ P #_{u} Q) = | s - u | d_{H} (P ∥ Q),$

(104)

for any u, s ≠ 0, where

$P #_{u} Q = P^{1 / 2} {(P^{- 1 / 2} Q P^{- 1 / 2})}^{u} P^{1 / 2} .$

(105)
Triangular inequality: $d_{H} (P ∥ Q) \leq d_{H} (P ∥ Z) + d_{H} (Z ∥ Q)$ .

These properties can easily be derived and verified. For example, property (9) can easily be derived as follows [6,27]:

\begin{array}{l} d_{H} (P #_{s} Q ∥ P #_{u} Q) = d_{H} (P^{1 / 2} {(P^{- 1 / 2} Q P^{- 1 / 2})}^{s} P^{1 / 2} ∥ (P^{1 / 2} {(P^{- 1 / 2} Q P^{- 1 / 2})}^{u} P^{1 / 2}) \\ = d_{H} ({(P^{- 1 / 2} Q P^{- 1 / 2})}^{s} ∥ {(P^{- 1 / 2} Q P^{- 1 / 2})}^{u}) \\ = d_{H} ({(P^{- 1 / 2} Q P^{- 1 / 2})}^{(s - u)} ∥ I) \\ = | s - u | d_{H} (P ∥ Q) . \end{array}

(106)

In Table 2, we summarize and compare some fundamental properties of three important metric distances: the Hilbert projective metric, Riemannian metric, and LogDet Zero (Bhattacharyya) distance. Since some of these properties are new, we refer to [6,27,28].

7.1. The AB Log-Det Divergence for Noisy and Ill-Conditioned Covariance Matrices

In real-world signal processing and machine learning applications, the SPD sampled matrices can be strongly corrupted by noise and extremely ill conditioned. In such cases, the eigenvalues of the generalized eigenvalue (GEVD) problem Pv_i = λ_iQv_i can be divided into a signal subspace and noise subspace. The signal subspace is usually represented by the largest eigenvalues (and corresponding eigenvectors), and the noise subspace is usually represented by the smallest eigenvalues (and corresponding eigenvectors), which should be rejected; in other words, in the evaluation of log-det divergences, only the eigenvalues that represent the signal subspace should be taken into account. The simplest approach is to find the truncated dominant eigenvalues by applying the suitable threshold τ > 0; equivalently, find an index r ≤ n for which λ_r₊₁ ≤ τ and perform a summation. For example, truncation reduces the summation in (8) from 1 to r (instead of 1 to n) [22]. The threshold parameter τ can be selected via cross validation.

Recent studies suggest that the real signal subspace covariance matrices can be better represented by truncating the eigenvalues. A popular and relatively simple method applies a thresholding and shrinkage rule to the eigenvalues [35]:

{\tilde{λ}}_{i} = λ_{i} \max {(1 - \frac{τ^{γ}}{λ^{γ}}), 0},

(107)

where any eigenvalue smaller than the specific threshold is set to zero, and the remaining eigenvalues are shrunk. Note that the smallest eigenvalues are shrunk more than the largest one. For γ = 1, we obtain a standard soft thresholding, and for γ → ∞ a standard hard thresholding is obtained [36]. The optimal threshold τ > 0 can be estimated along with the parameter γ > 0 using cross validation. However, a more practical and efficient method is to apply the Generalized Stein Unbiased Risk Estimate (GSURE) method even if the variance of the noise is unknown (for more details, we refer to [35] and the references therein).

In this paper, we propose an alternative approach in which the bias generated by noise is reduced by suitable choices of α and β [12]. Instead of using the eigenvalues λ_i of PQ⁻¹ or its inverse, we use regularized or shrinked eigenvalues [35–37]. For example, in light of (8), we can use the following shrinked eigenvalues:

{\tilde{λ}}_{i} = {(\frac{α λ_{i}^{β} + β λ_{i}^{- α}}{α + β})}^{\frac{1}{α β}} \geq 1, for α, β \neq 0, α, β > 0 or α, β < 0,

(108)

which play a similar role as the ratios (p_i/q_i) (p_i ≥ q_i), which are used in the standard discrete divergences [11,12]. It should be noted that equalities

{\tilde{λ}}_{i} = 1

, ∀i hold only if all λ_i of PQ⁻¹ are equal to one, which occurs only if P = Q. For example, the new Gamma divergence in (94) can be formulated even more generally as

D_{C C A}^{(γ_{2}, γ_{1})} (P ∥ Q) = \log \frac{M_{γ_{2}} {{\tilde{λ}}_{i}}}{M_{γ_{1}} {{\tilde{λ}}_{i}}},

(109)

where γ₂ > γ₁, and

{\tilde{λ}}_{i}

are the regularized or optimally shrinked eigenvalues.

8. Divergences of Multivariate Gaussian Densities and Differential Relative Entropies of Multivariate Normal Distributions

In this section, we show the links or relationships between a family of continuous Gamma divergences and AB log-det divergences for multivariate Gaussian densities.

Consider the two multivariate Gaussian (normal) distributions:

p (x) = \frac{1}{\sqrt{{(2 π)}^{n} \det P}} \exp (- \frac{1}{2} {(x - μ_{1})}^{T} P^{- 1} (x - μ_{1})),

(110)

q (x) = \frac{1}{\sqrt{{(2 π)}^{n} \det Q}} \exp (- \frac{1}{2} {(x - μ_{2})}^{T} Q^{- 1} (x - μ_{2})), x \in ℝ^{n},

(111)

where µ₁ ∈ ℝⁿ and µ₂ ∈ ℝⁿ are mean vectors, and P = Σ₁ ∈ ℝ^n×n and Q = Σ₂ ∈ ℝ^n×n are the covariance matrices of p(x) and q(x), respectively.

Furthermore, consider the Gamma divergence for these distributions:

\begin{array}{l} D_{A C}^{(α, β)} (p (x) ∥ q (x)) = \frac{1}{α β (α + β)} \log \frac{{(\int_{Ω} p^{α + β} (x) d x)}^{α} {(\int_{Ω} q^{α + β} (x) d x)}^{β}}{{(\int_{Ω} p^{α} (x) q^{β} (x) d x)}^{α + β}} \\ for α \neq 0, β \neq 0, α + β \neq 0, \end{array}

(112)

which generalizes a family of Gamma divergences [11,12].

Theorem 4. The Gamma divergence in (112) for multivariate Gaussian densities (110) and (111) can be expressed in closed form as follows:

\begin{array}{l} D_{A C}^{(α, β)} (p (x) ∥ q (x)) = \frac{1}{2} D_{A B}^{(β, α)} ({(Q^{- 1 / 2} P Q^{- 1 / 2})}^{\frac{1}{α + β}} ∥ I) + \frac{1}{2} {(μ_{1} - μ_{2})}^{T} {(α Q + β P)}^{- 1} (μ_{1} - μ_{2}), \\ = \frac{1}{2 α β} \log \frac{\det (\frac{α}{α + β} Q + \frac{β}{α + β} P)}{\det {(Q)}^{\frac{α}{α + β}} \det {(P)}^{\frac{β}{α + β}}} \\ + \frac{1}{2 (α + β)} {(μ_{1} - μ_{2})}^{T} {(\frac{α}{a + β} Q + \frac{β}{α + β} P)}^{- 1} (μ_{1} - μ_{2}), \end{array}

(113)

for α > 0 and β > 0.

The proof is provided in Appendix H. Note that for α + β = 1, the first term in the right-hand-side of (113) also simplifies as

{\frac{1}{2} D_{A B}^{(β, α)} ({(Q^{- 1 / 2} P Q^{1 / 2})}^{\frac{1}{α + β}} ∥ I) |}_{_{β = 1 - α}} = \frac{1}{2} D_{A B}^{(1 - α, α)} (P ∥ Q) = \frac{1}{2} D_{A}^{(1 - α)} (P ∥ Q) .

(114)

Observe that Formula (113) consists of two terms: the first term is expressed via the AB log-det divergence, which measures the similarity between two covariance or precision matrices and is independent from the mean vectors, while the second term is a quadratic form expressed by the Mahalanobis distance, which represents the distance between the means (weighted by the covariance matrices) of multivariate Gaussian distributions. Note that the second term is zero when the mean values µ₁ and µ₂ coincide.

Theorem 4 is a generalization of the following well-known results:

For α = 1 and β = 0 and as β → 0, the Kullback-Leibler divergence can be expressed as [5,38]

$\begin{array}{l} \lim_{β \to 0} D_{A C}^{(1, β)} (p (x) ∥ q (x)) = D_{K L} (p (x) ∥ q (x)) = \int_{Ω} p (x) \log \frac{p (x)}{q (x)} d x \\ = \frac{1}{2} (tr (P Q^{- 1}) - \log \det (P Q^{- 1}) - n) + \frac{1}{2} {(μ_{1} - μ_{2})}^{T} Q^{- 1} (μ_{1} - μ_{2}), \end{array}$

(115)

where the last term represents the Mahalanobis distance, which becomes zero for zero-mean distributions µ₁ = µ₂ = 0.
For α = β = 0.5 we have the Bhattacharyya distance [5,39]

$\begin{array}{l} D_{A C}^{(0.5, 0.5)} (p (x) ∥ q (x)) = \frac{1}{2} d_{B h}^{2} (p (x) ∥ q (x)) = - 4 \log \int_{Ω} \sqrt{p (x) q (x) d x} \\ = 2 \log \frac{\det \frac{P + Q}{2}}{\sqrt{\det P \det Q}} + \frac{1}{2} {(μ_{1} - μ_{2})}^{T} {[\frac{P + Q}{2}]}^{- 1} (μ_{1} - μ_{2}), \end{array}$

(116)
For α + β = 1 and 0 < α < 1, the closed form expression for the Rényi divergence is obtained [5,32,40]:

$\begin{array}{l} D_{A} (p ∥ q) = - \frac{1}{α (1 - α)} \log \int_{Ω} p^{α} (x) q^{1 - α} (x) d x \\ = \frac{1}{2 α (1 - α)} \log \frac{\det (α Q + (1 - α) P)}{\det (Q^{α} P^{1 - α})} + \frac{1}{2} {(μ_{1} - μ_{2})}^{T} {[α Q + (1 - α) P]}^{- 1} (μ_{1} - μ_{2}) . \end{array}$

(117)
For α = β = 1, the Gamma-divergences reduce to the Cauchy-Schwartz divergence:

$\begin{array}{l} D_{C S} (p (x) ∥ q (x)) = - \log \frac{\int p (x) q (x) d μ (x)}{{(\int p^{2} (x) d μ (x))}^{1 / 2} {(\int q^{2} (x) d μ (x))}^{1 / 2}} \\ = \frac{1}{2} \log \frac{\det \frac{P + Q}{2}}{\sqrt{\det Q \det P}} + \frac{1}{4} {(μ_{1} - μ_{2})}^{T} {(\frac{P + Q}{2})}^{- 1} (μ_{1} - μ_{2}) . \end{array}$

(118)

Similar formulas can be derived for the symmetric Gamma divergence for two multivariate Gaussian distributions. Furthermore, analogous expressions can be derived for Elliptical Gamma distributions (EGD) [41], which facilitate more flexible modeling than standard multivariate Gaussian distributions.

8.1. Multiway Divergences for Multivariate Normal Distributions with Separable Covariance Matrices

Recently, there has been growing interest in the analysis of tensors or multiway arrays [42–45]. One of the most important applications of multiway tensor analysis and multilinear distributions, is magnetic resonance imaging (MRI) (we refer to [46] and the references therein). For multiway arrays, we often use multilinear (array or tensor) normal distributions that correspond to the multivariate normal (Gaussian) distributions in (110) and (111) with common means µ₁ = µ₂ and separable (Kronecker structured) covariance matrices:

\bar{P} = σ_{P}^{2} (P_{1} \otimes P_{2} \otimes \dots \otimes P_{K}) \in ℝ^{N \times N}

(119)

\bar{Q} = σ_{P}^{2} (Q_{1} \otimes Q_{2} \otimes \dots \otimes Q_{K}) \in ℝ^{N \times N},

(120)

where

P_{k} = \in ℝ^{n_{k} \times n_{k}}

and

Q_{k} = \in ℝ^{n_{k} \times n_{k}}

for k = 1, 2, …, K are SPD matrices, usually normalized so that det P_k = det Q_k = 1 for each k and

N = \prod_{k = 1}^{K} n_{k}

[45].

One of the main advantages of the separable Kronecker model is the significant reduction in the number of variance-covariance parameters [42]. Usually, such separable covariance matrices are sparse and very large-scale. The challenge is to design an efficient and relatively simple dissimilarity measure for big data between two zero-mean multivariate (or multilinear) normal distributions ((110) and (111)). Because of its unique properties, the Hilbert projective metric is a good candidate; in particular, for separable Kronecker structured covariances, it can be expressed very simply as

D_{H} (\bar{P} ∥ \bar{Q}) = \sum_{k = 1}^{K} D_{H} (P_{k} ∥ Q_{k}) = \sum_{k = 1}^{K} \log \frac{{\tilde{λ}}_{m a x}^{(k)}}{{\tilde{λ}}_{m i n}^{(k)}} = \log \prod_{k = 1}^{K} (\frac{{\tilde{λ}}_{m a x}^{(k)}}{{\tilde{λ}}_{m i n}^{(k)}}),

(121)

where

{\tilde{λ}}_{m a x}^{(k)}

and

{\tilde{λ}}_{m i n}^{(k)}

are the (shrinked) maximum and minimum eigenvalues of the (relatively small) matrices

P_{k} Q_{k}^{- 1}

for k = 1, 2, …, K, respectively. We refer to this divergence as the multiway Hilbert metric. This metric has many attractive properties, especially invariance under multilinear transformations.

Using the fundamental properties of divergence and SPD matrices, we derive other multiway log-det divergences. For example, the multiway Stein’s loss can be obtained:

\begin{array}{l} D_{M S L} (\bar{P}, \bar{Q}) = 2 D_{K L} (p (x) ∥ q (x)) = D_{A B}^{(0, 1)} (\bar{P} ∥ \bar{Q}) \\ = tr (\bar{P} {\bar{Q}}^{- 1}) - \log \det (\bar{P} {\bar{Q}}^{- 1}) - N \end{array}

(122)

= \frac{σ_{P}^{2}}{σ_{Q}^{2}} (\prod_{k = 1}^{K} tr (P_{k} Q_{k}^{- 1})) - \sum_{k = 1}^{K} \frac{N}{n_{k}} \log \det (P_{k} Q_{k}^{- 1}) - N \log (\frac{σ_{P}^{2}}{σ_{Q}^{2}}) - N .

(123)

Note that under the constraint that det P_k = det Q_k = 1, this simplifies to

\begin{array}{l} D_{M S L} (\bar{P} ∥ \bar{Q}) = tr (\bar{P} {\bar{Q}}^{- 1}) - \log \det (\bar{P} {\bar{Q}}^{- 1}) - N \\ = \frac{σ_{P}^{2}}{σ_{Q}^{2}} (\prod_{k = 1}^{K} tr (P_{k} Q_{k}^{- 1})) - N \log (\frac{σ_{P}^{2}}{σ_{Q}^{2}}) - N, \end{array}

(124)

which is different from the multiway Stein’s loss recently proposed by Gerard and Hoff [45].

Similarly, if det P_k = det Q_k = 1 for each k = 1, 2, …, K, we can derive the multiway Riemannian metric as follows:

D_{R}^{2} (\bar{P} ∥ \bar{Q}) = N \log^{2} \frac{σ_{P}^{2}}{σ_{Q}^{2}} + \sum_{k = 1}^{K} \frac{N}{n_{k}} D_{R}^{2} (P_{k} ∥ Q_{k}) .

(125)

The above multiway divergences are derived using the following properties:

\begin{array}{l} \bar{P} {\bar{Q}}^{- 1} = (P_{1} \otimes P_{2} \otimes \dots \otimes P_{K}) (Q_{1}^{- 1} \otimes Q_{2}^{- 1} \otimes \dots \otimes Q_{K}^{- 1}) \\ = P_{1} Q_{1}^{- 1} \otimes P_{2} Q_{2}^{- 1} \otimes \dots \otimes P_{K} Q_{K}^{- 1}, \end{array}

(126)

tr (\bar{P} {\bar{Q}}^{- 1}) = tr (P_{1} Q_{1}^{- 1} \otimes P_{2} Q_{2}^{- 1} \otimes \dots \otimes P_{K} Q_{K}^{- 1}) = \prod_{k = 1}^{K} tr (P_{K} Q_{K}^{- 1}),

(127)

\det (\bar{P} {\bar{Q}}^{- 1}) = \det (P_{1} Q_{1}^{- 1} \otimes P_{2} Q_{2}^{- 1} \otimes \dots \otimes P_{K} Q_{K}^{- 1}) = \prod_{k = 1}^{K} {(\det (P_{K} Q_{K}^{- 1}))}^{N / n_{k}} .

(128)

and the basic property: If the eigenvalues {λ_i} and {θ_j} are eigenvalues with corresponding eigenvectors {v_i} and {u_j} for SPD matrices A and B, respectively, then A ⊗ B has eigenvalues {λ_iθ_j} with corresponding eigenvectors {v_i ⊗ u_j}.

Other possible extensions of the AB and Gamma matrix divergences to separable multiway divergences for multilinear normal distributions under additional constraints and normalization conditions will be discussed in future works.

9. Conclusions

In this paper, we presented novel (dis)similarity measures; in particular, we considered the Alpha-Beta and Gamma log-det divergences (and/or their square-roots) that smoothly connect or unify a wide class of existing divergences for SPD matrices. We derived numerous results that uncovered or unified theoretic properties and qualitative similarities between well-known divergences and new divergences. The scope of the results presented in this paper is vast, especially since the parameterized Alpha-Beta and Gamma log-det divergence functions include several efficient and useful divergences, including those based on relative entropies, the Riemannian metric (AIRM), S-divergence, generalized Jeffreys KL (KLDM), Stein’s loss, and Hilbert projective metric. Various links and relationships between divergences were also established. Furthermore, we proposed several multiway log-det divergences for tensor (array) normal distributions.

Acknowledgments

Part of this work was supported by the Spanish Government under MICINN projects TEC2014-53103, TEC2011-23559, and by the Regional Government of Andalusia under Grant TIC-7869.

Appendices

A. Basic operations for positive definite matrices

Functions of positive definite matrices frequently appear in many research areas, for an introduction we refer the reader to Chapter 11 in [31]. Consider a positive definite matrix P of rank n with eigendecomposition VΛV^T. The matrix function f(P) is defined as

f (P) = V f (Λ) V^{T},

(129)

where f(Λ) ≡ diag (f(λ₁), …, f(λ_n)). With the help of this definition, the following list of well-known properties can be easily obtained:

\log (\det P) = tr \log (P),

(130)

{(\det P)}^{α} = \det (P^{α}),

(131)

{(\det P)}^{α} = \det {(V Λ V^{T})}^{α} = \det (V) \det (Λ^{α}) \det (Λ^{T}) = \prod_{i = 1}^{n} λ_{i}^{α},

(132)

tr (P^{α}) = tr {(V Λ V^{T})}^{α} = tr (V V^{T} Λ^{α}) = \sum_{i = 1}^{n} λ_{i}^{α},

(133)

P^{α + β} = P^{α} P^{β},

(134)

{(P^{α})}^{β} = P^{α}^{β},

(135)

P^{0} = I,

(136)

{(\det P)}^{α + β} = \det (P^{α}) \det (P^{β}),

(137)

\det ({(P Q^{- 1})}^{α}) = {[\det (P) \det (Q^{- 1})]}^{α} = \det (P^{α}) \det (Q^{-}^{α}),

(138)

\frac{\partial}{\partial α} (P^{α}) = P^{α} \log (P),

(139)

\frac{\partial}{\partial α} \log [\det (P (α))] = tr (P^{- 1} \frac{\partial P}{\partial α}),

(140)

\log (\det (P \otimes Q)) = n \log (\det P) + n \log (\det Q),

(141)

tr (P) - \log \det (P) \geq n .

(142)

B. Extension of $D_{A B}^{(α, β)} (P ∥ Q)$ for (α, β) ∈ ℝ²

Remark 1. Equation (3) is only well defined in the first and third quadrants of the (α, β)-plane. Outside these regions, where α and β have opposite signs (i.e., α > 0 and β < 0 or α < 0 and β > 0), the divergence can be complex valued.

This undesirable behavior can be avoided with the help of the truncation operator

{[x]}_{+} = {\begin{matrix} x & x \geq 0, \\ 0, & x < 0, \end{matrix}

(143)

which prevents the arguments of the logarithms from being negative.

The new definition of the AB log-det divergence is

\begin{array}{l} D_{A B}^{(α, β)} (P ∥ Q) = \frac{1}{α β} \log {[\det \frac{α {(P Q^{- 1})}^{β} + β {(P Q^{- 1})}^{α}}{α + β}]}_{+} \\ for α \neq 0, β \neq 0, α + β \neq 0, \end{array}

(144)

which is compatible with the previous definition in the first and third quadrants of the (α, β)-plane. It is also well defined in the second and fourth quadrants except for the special cases when α = 0, β = 0, and α + β = 0, which is where the formula is undefined. By enforcing continuity, we can explicitly define the AB log-det divergence on the entire (α, β)-plane as follows:

D_{A B}^{(α, β)} (P ∥ Q) = {\begin{array}{l} \frac{1}{α β} \log \det \frac{α {(P Q^{- 1})}^{β} + β {(Q P^{- 1})}^{α}}{α + β} & for α, β \neq 0, α + β \neq 0, \\ \frac{1}{α^{2}} [tr ({(Q P^{- 1})}^{α} - I) - α \log \det (Q P^{- 1})] & for α \neq 0, β = 0, \\ \frac{1}{β^{2}} [tr ({(P Q^{- 1})}^{β} - I) - β \log \det (P Q^{- 1})] & for α = 0, β \neq 0, \\ \frac{1}{α^{2}} \log \det {[{(P Q^{- 1})}^{- α} (I + \log {(P Q^{- 1})}^{α})]}_{+}^{- 1} & for α = - β, \\ \frac{1}{2} tr \log^{2} (P Q^{- 1}) = \frac{1}{2} ∥ \log (Q^{- 1 / 2} P Q^{- 1 / 2}) ∥_{F}^{2} & for α, β = 0 . \end{array}

(145)

C. Eigenvalues Domain for Finite $D_{A B}^{(α, β)} (P ∥ Q)$

In this section, we assume that λ_i, an eigenvalue of PQ⁻¹, satisfies 0 ≤ λ_i ≤ ∞ for all i = 1, …, n. We will determine the bounds of the eigenvalues of PQ⁻¹ that prevent the AB log-det divergence from being infinite. First, recall that

D_{A B}^{(α, β)} (P ∥ Q) = \frac{1}{α β} \sum_{i = 1}^{n} \log {[\frac{α λ_{i}^{β} + β λ_{i}^{- α}}{α + β}]}_{+}, α, β, α + β \neq 0 .

(146)

We assume that 0 ≤ λ_i ≤ ∞ for all i. For the divergence to be finite, the arguments of the logarithms in the previous expression must be positive. This happens when

\frac{α λ_{i}^{β} + β λ_{i}^{- α}}{α + β} > 0 \forall i,

(147)

which is always true when α, β > 0 or when α, β < 0. On the contrary, when sign (αβ)=−1 we have the following two cases. In the first case when α > 0, we initially solve for

λ_{i}^{α + β}

and later for λ_i to obtain

\frac{λ_{i}^{α + β}}{α + β} > \frac{- β}{α (α + β)} = | \frac{β}{α} | \frac{1}{α + β} \to λ_{i} > {| \frac{β}{α} |}^{\frac{1}{α + β}} \forall i, for α > 0 and β < 0 .

(148)

In the second case when α < 0, we obtain

\frac{λ_{i}^{α + β}}{α + β} < \frac{- β}{α (α + β)} = | \frac{β}{α} | \frac{1}{α + β} \to λ_{i} < {| \frac{β}{α} |}^{\frac{1}{α + β}} \forall i, for α < 0 and β > 0 .

(149)

Using sign(αβ) = −1, we can solve for

λ_{i}^{α + β}

, which yields

\frac{λ_{i}^{α + β}}{α + β} > | \frac{β}{α} | \frac{1}{α + β} \forall i .

(150)

Solving again for λ_i, we see that

λ_{i} > {| \frac{β}{α} |}^{\frac{1}{α + β}} \forall i, for α > 0 and β < 0,

(151)

and

λ_{i} < {| \frac{β}{α} |}^{\frac{1}{α + β}} \forall i, for α < 0 and β > 0 .

(152)

In the limit, when α → −β ≠ 0, these bounds simplify to

\lim_{α \to - β} {| \frac{β}{α} |}^{\frac{1}{α + β}} = e^{- 1 / α} \forall i, for β \neq 0 .

(153)

On the other hand, when α → 0 or when β → 0, the bounds disappear. The lower-bounds converge to 0, while the upper-bounds converge to ∞, leading to the trivial inequalities 0 < λ_i < ∞.

This concludes the determination of the domain of the eigenvalues that result in a finite divergence. Outside this domain, we expect

D_{A B}^{(α, β)} (P ∥ Q) = \infty

. A complete picture of bounds for different values of α and β is shown in Figure 1.

D. Proof of the Nonnegativity of $D_{A B}^{(α, β)} (P ∥ Q)$

The AB log-det divergence is separable; it is the sum of the individual divergences of the eigenvalues from unity, i.e.,

D_{A B}^{(α, β)} (P ∥ Q) = \sum_{i = 1}^{n} D_{A B}^{(α, β)} (λ_{i} ∥ 1),

(154)

where

D_{A B}^{(α, β)} (λ_{i} ∥ 1) = \frac{1}{α β} \log {[\frac{α λ_{i}^{β} + β λ_{i}^{- α}}{α + β}]}_{+}, α, β, α + β \neq 0 .

(155)

We prove the nonnegativity of

D_{A B}^{(α, β)} (P ∥ Q)

by showing that the divergence of each of the eigenvalues

D_{A B}^{(α, β)} (λ_{i} ∥ 1)

is nonnegative and minimal at λ_i = 1.

First, note that the only critical point of the criterion is obtained when λ_i = 1. This can be shown by setting the derivative of the criterion equal to zero, i.e.,

\frac{\partial D_{A B}^{(α, β)} (λ_{i} ∥ 1)}{\partial λ_{i}} = \frac{λ_{i}^{α + β} - 1}{α λ_{i}^{α + β + 1} + β λ_{i}} = 0,

(156)

and solving for λ_i.

Next, we show that the sign of the derivative only changes at the critical point λ_i = 1. If we rewrite

\frac{\partial D_{A B}^{(α, β)} (λ_{i} ∥ 1)}{\partial λ_{i}} = (\frac{λ_{i}^{α + β} - 1}{α + β}) {(λ_{i} \frac{α λ_{i}^{α + β} + β}{α + β})}^{- 1},

(157)

and observe that the condition for the divergence to be finite enforces

\frac{α λ_{i}^{α + β} + β}{α + β} > 0

then it follows that

sign {\frac{\partial D_{A B}^{(α, β)} (λ_{i} ∥ 1)}{\partial λ_{i}}} \equiv {\frac{λ_{i}^{α + β} - 1}{α + β}} = {\begin{array}{l} - 1 & for λ_{i} < 1, \\ 0, & for λ_{i} = | 1, \\ + 1 & for λ_{i} > 1 . \end{array}

(158)

Since the derivative is strictly negative for λ_i < 1 and strictly positive for λ_i > 1, the critical point at λ_i = 1 is the global minimum of

D_{A B}^{(α, β)} (λ_{i} ∥ 1)

. From this result, the nonnegativity of the divergence

D_{A B}^{(α, β)} (P ∥ Q) \geq 0

easily follows. Moreover,

D_{A B}^{(α, β)} (P ∥ Q) = 0

only when λ_i = 1 for i = 1, …, n, which concludes the proof of the Theorems 1 and 2.

E. Derivation of the Riemannian Metric

We calculate

D_{A B}^{(α, β)} (P + d P ∥ P)

using the Taylor expansion when dP is small, i.e.,

(P + d P) P^{- 1} = I + d Z,

(159)

where

\begin{array}{l} d Z = d P P^{- 1}, \\ α {[(P + d P) P^{- 1}]}^{β} = α I + α β d Z + \frac{α β (β - 1)}{2} d Z d Z + O ({| d Z |}^{3}) . \end{array}

Similar calculations hold for β[(P + dP)P⁻¹]^−α, and

α {[(P + d P) P^{- 1}]}^{β} + β {[(P + d P) P^{- 1}]}^{- α} = (α + β) (I + \frac{α β}{2} d Z d Z),

where the first-order term of dZ disappears and the higher-order terms are neglected. Since

\det (I + \frac{α β}{2} d Z d Z) = 1 + \frac{α β}{2} tr (d Z d Z),

(160)

by taking its logarithm, we have

D_{A B}^{(α, β)} (P + d P ∥ P) = \frac{1}{2} tr (d P P^{- 1} d P P^{- 1}),

(161)

for any α and β.

F. Proof of the Properties of the AB Log-Det Divergence

Next we provide a proof of the properties of the AB log-det divergence. The proof will only be omitted for those properties which can be readily verified from the definition of the divergence.

Nonnegativity; given by

$D_{A B}^{(α, β)} (P ∥ Q) \geq 0, \forall α, β \in ℝ .$

(162)

The proof of this property is presented in Appendix D.
Identity of indiscernibles; given by

$D_{A B}^{(α, β)} (P ∥ Q) = 0 if and only P = Q .$

(163)

See Appendix D for its proof.
Continuity and smoothness of $D_{A B}^{(α, β)} (P ∥ Q)$ as a function of α ∈ ℝ and β ∈ ℝ, including the singular cases when α = 0 or β = 0, and when α = −β (see Figure 2).
The divergence can be explicitly expressed in terms of Λ = diag{λ₁, λ₂, …, λ_n}, the diagonal matrix with the eigenvalues of Q⁻¹ P; in the form

$D_{A B}^{(α, β)} (P ∥ Q) = D_{A B}^{(α, β)} (Λ ∥ I) .$

(164)

Proof. From the definition of divergence and taking into account the eigenvalue decomposition PQ⁻¹ = VΛ V⁻¹, we can write

$\begin{array}{l} D_{A B}^{(α, β)} (P ∥ Q) = \frac{1}{α β} \log \det \frac{α V Λ^{β} V^{- 1} + β V Λ^{- α} V^{- 1}}{α + β} \\ = \frac{1}{α β} \log [\det V \det \frac{α Λ^{β} + β Λ^{- α}}{α + β} \det V^{- 1}] \\ = \frac{1}{α β} \log \det \frac{α Λ^{β} + β Λ^{- α}}{α + β} \end{array}$

(165)

$= D_{A B}^{(α, β)} (Λ ∥ I) .$

(166)
Scaling invariance; given by

$D_{A B}^{(α, β)} (c P ∥ c Q) = D_{A B}^{(α, β)} (P ∥ Q),$

(167)

for any c > 0.
For a given α and β and nonzero scaling factor ω 6= 0, we have

$D_{A B}^{(ω, α, ω β)} (P ∥ Q) = \frac{1}{ω^{2}} D_{A B}^{(α, β)} ({(Q^{- 1 / 2} P Q^{- 1 / 2})}^{ω} ∥ I) .$

(168)

Proof. From the definition of divergence, we write

$D_{A B}^{(ω, α, ω β)} (P ∥ Q) = \frac{1}{(ω α) (ω β)} \log \det \frac{ω α Λ^{ω β} + ω β Λ^{- ω α}}{(ω α + ω β)}$

(169)

$= \frac{1}{ω^{2}} \frac{1}{α β} \log \det \frac{α {(Λ^{ω})}^{β} + β {(Λ^{ω})}^{- α}}{(α + β)}$

(170)

$= \frac{1}{ω^{2}} D_{A B}^{(α, β)} ({(Q^{- 1 / 2} P Q^{- 1 / 2})}^{ω} ∥ I)$

(171)

Hence, the additional inequality

$D_{A B}^{(α, β)} ({(Q^{- 1 / 2} P Q^{- 1 / 2})}^{ω} ∥ I) \leq D_{A B}^{(ω, α, ω β)} (P ∥ Q)$

(172)

is obtained for |ω| ≤ 1.
Dual-invariance under inversion (for ω = −1); given by

$D_{A B}^{(- α, - β)} (P ∥ Q) = D_{A B}^{(α, β)} (P^{- 1} ∥ Q^{- 1}) .$

(173)
Dual symmetry; given by

$D_{A B}^{(α, β)} (P ∥ Q) = D_{A B}^{(β, α)} (Q ∥ P) .$

(174)
Affine invariance (invariance under congruence transformations); given by

$D_{A B}^{(α, β)} (AP A^{T} ∥ AQ A^{T}) = D_{A B}^{(α, β)} (P ∥ Q),$

(175)

for any nonsingular matrix A ∈ ℝ^n×n.
Proof.

$\begin{array}{l} D_{A B}^{(α, β)} (AP A^{T} ∥ AQ A^{T}) = \frac{1}{α β} \log \det \frac{α {({(AP A^{T} ∥ AQ A^{T})}^{- 1})}^{β} + β {({(AP A^{T} ∥ AQ A^{T})}^{- 1})}^{- α}}{α + β} \\ = \frac{1}{α β} \log \det \frac{α {(A (P Q^{- 1}) A^{- 1})}^{β} + β {(A (P Q^{- 1}) A^{- 1})}^{- α}}{α + β} \\ = \frac{1}{α β} \log [\det (AV) \det \frac{α Λ^{β} + β Λ^{- α}}{α + β} \det {(AV)}^{- 1}] \\ = \frac{1}{α β} \log \det \frac{α Λ^{β} + β Λ^{- α}}{α + β} \\ = D_{A B}^{(α, β)} (P ∥ Q) \end{array}$

(176)
Divergence lower-bound; given by

$D_{A B}^{(α, β)} (X^{T} PX ∥ X^{T} QX) \leq D_{A B}^{(α, β)} (P ∥ Q),$

(177)

for any full-column rank matrix X ∈ ℝ^n×m with n ≤ m.
This result has been already proved for some special cases of α and β, especially these that lead to the S-divergence and the Riemannian metric [6]. Next, we present a different argument to prove it for any α, β ∈ ℝ.
Proof. As already discussed, the divergence $D_{A B}^{(α, β)} (P ∥ Q)$ depends on the generalized eigenvalues of the matrix pencil (P, Q), which have been denoted by λ_i, i = 1, …, n. Similarly, the presumed lower-bound $D_{A B}^{(α, β)} (X^{T} PX ∥ X^{T} QX)$ is determined by µ_i, i = 1, …, m, the eigenvalues of the matrix pencil (X^T PX, X^T QX). Assuming that both sets of eigenvalues are arranged in decreasing order, the Cauchy interlacing inequalities [29] provide the following upper and lower-bounds for µ_j in terms of the eigenvalues of the first matrix pencil,

$λ_{j} \leq μ_{j} \leq λ_{n - m + j} .$

(178)

We classify the eigenvalues µ_j on three sets $S_{μ}^{-}$ , $S_{μ}^{0}$ and $S_{μ}^{+}$ , according to the sign of (µ_j − 1). By the affine invariance we can write

$D_{A B}^{(α, β)} (X^{T} PX ∥ X^{T} QX) = D_{A B}^{(α, β)} ({(X^{T} QX)}^{- 1 / 2} X^{T} PX ({(X^{T} QX)}^{- 1 / 2}) ∥ I)$

(179)

$= \sum_{μ_{j} \in S_{μ}^{-}} D_{A B}^{(α, β)} (μ_{j} ∥ 1) + \sum_{μ_{j} \in S_{μ}^{+}} D_{A B}^{(α, β)} (μ_{j} ∥ 1),$

(180)

where the eigenvalues µ_j ∈ $S_{μ}^{0}$ have been excluded since for them $D_{A B}^{(α, β)} (μ_{j} ∥ 1) = 0$ .
With the help of (178), the first group of eigenvalues µ_j ∈ $S_{μ}^{-}$ (which are smaller than one) are one-to-one mapped with their lower-bounds λ_j, which we include in the set $S_{λ}^{-}$ . Also those µ_j ∈ $S_{μ}^{+}$ (which are greater than one) are mapped with their upper-bounds λ_n₋_m₊_j, which we group in $S_{λ}^{+}$ . It is shown in Appendix D that the scalar divergence $D_{A B}^{(α, α)} (λ ∥ 1)$ is strictly monotone descending for λ < 1, zero for λ = 1 and strictly monotone ascending for λ > 1. This allows one to upperbound (180) as follows

$\begin{matrix} \sum_{μ_{j} \in S_{μ}^{-}} D_{A B}^{(α, β)} (μ_{j} ∥ 1) + \sum_{μ_{j} \in S_{μ}^{+}} D_{A B}^{(α, β)} (μ_{j} ∥ 1) \leq \sum_{μ_{j} \in S_{λ}^{-}} D_{A B}^{(α, β)} (λ_{j} ∥ 1) + \sum_{μ_{j} \in S_{λ}^{+}} D_{A B}^{(α, β)} (λ_{j} ∥ 1) \\ \leq \sum_{j = 1}^{n} D_{A B}^{(α, β)} (λ_{j} ∥ 1) \end{matrix}$

(181)

$= D_{A B}^{(α, β)} (P ∥ Q),$

(182)

obtaining the desired property.
Scaling invariance under the Kronecker product; given by

$D_{A B}^{(α, β)} (Z \otimes P ∥ Z \otimes Q) = n D_{A B}^{(α, β)} (P ∥ Q),$

(183)

for any symmetric and positive definite matrix Z of rank n.
Proof. This property was obtained in [6] for the S-divergence and the Riemannian metric. With the help of the properties of the Kronecker product of matrices, the desired equality is obtained:

$\begin{array}{l} D_{A B}^{(α, β)} (Z \otimes P ∥ Z \otimes Q) = \frac{1}{α β} \log \det [\frac{α {((Z \otimes P) {(Z \otimes Q)}^{- 1})}^{β} + β {((Z \otimes Q) {(Z \otimes P)}^{- 1})}^{α}}{α + β}] \\ = \frac{1}{α β} \log \det [\frac{α {(I \otimes P Q^{- 1})}^{β} + β {(I \otimes Q P^{- 1})}^{α}}{α + β}] \end{array}$

(184)

$= \frac{1}{α β} \log \det [I \otimes \frac{α {(P Q^{- 1})}^{β} + β {(Q P^{- 1})}^{α}}{α + β}]$

(185)

$= \frac{1}{α β} \log \det {[\frac{α {(P Q^{- 1})}^{β} + β {(Q P^{- 1})}^{α}}{α + β}]}^{n}$

(186)

$= n D_{A B}^{(α, β)} (P ∥ Q) .$

(187)
Double Sided Orthogonal Procrustes property. Consider an orthogonal matrix $Ω \in O (n)$ and two symmetric positive definite matrices P and Q, with respective eigenvalue matrices Λ_P and Λ_Q, which elements are sorted in descending order. The AB log-det divergence between Ω^T PΩ and Q is globally minimized when their eigenspaces are aligned, i.e.,

$\min_{Ω \in O (n)} D_{A B}^{(α, β)} (Ω^{T} P Ω ∥ Q) = D_{A B}^{(α, β)} (Λ_{P} ∥ Λ_{Q}) .$

(188)

Proof. Let Λ denote the matrix of eigenvalues of Ω^T PΩQ⁻¹ with its elements sorted in descending order. We start showing that for ∆ = log Λ, the function $D_{A B}^{(α, β)} (\exp Δ ∥ I)$ is convex. Its Hessian matrix is diagonal and positive definite, i.e., with non-negative diagonal elements

$\frac{\partial^{2} D_{A B}^{(α, β)} (e^{Δ_{i i}} ∥ 1)}{\partial Δ_{i i}^{2}} > 0,$

(189)

where

$\frac{\partial^{2} D_{A B}^{(α, β)} (e^{Δ_{i i}} ∥ 1)}{\partial Δ_{i i}^{2}} = {\begin{matrix} {(\frac{β}{α + β} e^{- \frac{α + β}{2} Δ_{i i}} + \frac{α}{α + β} e^{\frac{α + β}{2} Δ_{i i}})}^{- 2} & for α, α + β \neq 0 \\ e^{β Δ_{i i}} & for α = 0 \\ {(1 + α Δ_{i i})}^{- 2} & for α + β = 0 \\ e^{α Δ_{i i}} & for β = 0 \end{matrix}$

(190)

Since $f (e^{Δ_{i i}}) = D_{A B}^{(α, β)} (e^{Δ_{i i}} ∥ 1)$ is strictly convex and non-negative, we are in the conditions of the Corollary 6.15 in [47]. This result states that for two symmetric positive definite matrices A and B, which vectors of eigenvalues are respectively denoted by ${\vec{λ}}_{A}^{↓}$ (when sorted in descending order) and ${\vec{λ}}_{B}^{↑}$ (when sorted in ascending order), the function $f ({\vec{λ}}_{A}^{↓} {\vec{λ}}_{B}^{↑})$ is submajorized by $f ({\vec{λ}}_{AB}^{↓})$ . By choosing A = Ω^T PΩ, B = Q⁻¹, and applying the corollary, we obtain

$D_{A B}^{(α, β)} (Λ_{P} ∥ Λ_{Q}) = D_{A B}^{(α, β)} (Λ_{P} Λ_{Q}^{- 1} ∥ I) \leq D_{A B}^{(α, β)} (Λ ∥ I) = D_{A B}^{(α, β)} (Ω P Ω ∥ Q),$

(191)

where the equality is only reached when the eigendecompositions of the matrices Ω^T PΩ = VΛ_P V^Tand Q = VΛ_QV^T, share the same matrix of eigenvectors V.
Triangle Inequality-Metric Distance Condition, for α = β ∈ ℝ; given by

$\sqrt{D_{A B}^{(α, α)} (P ∥ Q)} \leq \sqrt{D_{A B}^{(α, α)} (P ∥ Z)} + \sqrt{D_{A B}^{(α, α)} (Z ∥ Q)} .$

(192)

Proof. The proof of this property exploits the recent result that the square root of the S-divergence

$d_{Bh} (P ∥ Q) = \sqrt{D_{S} (P ∥ Q)} = 2 \sqrt{\log \frac{\det \frac{1}{2} (P + Q)}{\sqrt{\det (P) \det (Q)}}} .$

(193)

is a metric [17]. Given three arbitrary symmetric positive definite matrices P, Q, Z, with common dimensions, consider the following eigenvalue decompositions

$Q^{- \frac{1}{2}} P Q^{- \frac{1}{2}} = V_{1} Λ_{1} V_{1}^{T}$

(194)

$Q^{- \frac{1}{2}} P Q^{- \frac{1}{2}} = V_{2} Λ_{2} V_{2}^{T}$

(195)

and assume that the diagonal matrices Λ₁ and Λ₂ have the eigenvalues sorted in a descending order.
For a given value of α in the divergence, we define ω = 2α ≠ 0 and use properties 6 and 9 (see Equations (168) and (175)) to obtain the equivalence

$\begin{array}{l} \sqrt{D_{A B}^{(α, α)} (P ∥ Q)} = \sqrt{D_{A B}^{(ω 0.5, ω 0.5)} (P ∥ Q)} \\ = \sqrt{\frac{1}{ω^{2}} D_{A B}^{(0.5, 0.5)} ({(Q^{- 1 / 2} P Q^{- 1 / 2})}^{ω} ∥ I)} \\ = \frac{1}{2 | α |} \sqrt{D_{A B}^{(0.5, 0.5)} (Λ_{1}^{2 α} ∥ I)} \\ = \frac{1}{2 | α |} d_{Bh} (Λ_{1}^{2 α} ∥ I), \end{array}$

(196)

Since the S-divergence satisfies the triangle inequality for diagonal matrices [5,6,17]

$d_{Bh} (Λ_{1}^{2 α} ∥ I) \leq d_{Bh} (Λ_{1}^{2 α} ∥ Λ_{2}^{2 α}) + d_{Bh} (Λ_{2}^{2 α} ∥ I),$

(197)

from (196), this implies that

$\sqrt{D_{A B}^{(α, α)} (P ∥ Q)} \leq \sqrt{D_{A B}^{(α, α)} (Λ_{1} ∥ Λ_{2})} + \sqrt{D_{A B}^{(α, α)} (Λ_{2} ∥ I)}$

(198)

In similarity with the proof of the metric condition for S-divergence [6], we can use property 12 to bound above the first term in the right-hand-side of the equation by

$\begin{array}{l} \sqrt{D_{A B}^{(α, α)} (Λ_{1} ∥ Λ_{2})} \leq \sqrt{D_{A B}^{(α, α)} (Λ_{1} Λ_{1} Λ_{1}^{T} ∥ Λ_{2} Λ_{2} Λ_{2}^{T})} \\ = \sqrt{D_{A B}^{(α, α)} (Q^{- \frac{1}{2}} P Q^{- \frac{1}{2}} ∥ Q^{- \frac{1}{2}} Z Q^{- \frac{1}{2}})} \\ = \sqrt{D_{A B}^{(α, α)} (P ∥ Z)}, \end{array}$

(199)

whereas the second term satisfies

$\begin{array}{l} \sqrt{D_{A B}^{(α, α)} (Λ_{2} ∥ I)} = \sqrt{D_{A B}^{(α, α)} (Λ_{2} Λ_{2} Λ_{2}^{T} ∥ I)} \\ = \sqrt{D_{A B}^{(α, α)} (Q^{- \frac{1}{2}} P Q^{- \frac{1}{2}} ∥ I)} \\ = \sqrt{D_{A B}^{(α, α)} (Z ∥ Q)} . \end{array}$

(200)

After bounding the right-hand-side of (198) with the help of (199) and (200), the divergence satisfies the desired triangle inequality (192) for α ≠ 0.
On the other hand, $\sqrt{D_{A B}^{(α, α)} (P ∥ Q)}$ as α → 0 converges to the Riemannian metric

$\sqrt{D_{A B}^{(0, 0)} (P ∥ Q)} = \lim_{α \to 0} \sqrt{D_{A B}^{(α, α)} (P ∥ Q)}$

(201)

$= {‖ \log (Q^{- 1 / 2} P Q^{- 1 / 2}) ‖}_{F}$

(202)

$= d_{R} (P ∥ Q) .$

(203)

which concludes the proof of the metric condition of $\sqrt{D_{A B}^{(α, α)} (P ∥ Q)}$ for any α ∈ ℝ

G. Proof of Theorem 3

This theorem assumes that the range spaces of the symmetric positive semidefinite matrices C_x and C_y are disjoint, in the sense that they only intersect at the origin, which is the most probable situation for n ≫ r (where n is the size of the matrices while r is their common rank). For ρ > 0 the regularized versions

{\tilde{C}}_{x}

and

{\tilde{C}}_{y}

of these matrices are full rank.

Let

\tilde{Λ} = diag ({\tilde{λ}}_{1}, \dots, {\tilde{λ}}_{n})

denote the diagonal matrix representing the n eigenvalues of the matrix pencil

({\tilde{C}}_{x}, {\tilde{C}}_{y})

. The AB log-det divergence between the regularized matrices is equal to the divergence between

\tilde{Λ}

and the identity matrix of size n, i.e.,

D_{A B}^{(α, β)} ({\tilde{C}}_{x} ∥ {\tilde{C}}_{y}) = D_{A B}^{(α, β)} ({\tilde{C}}_{y}^{- \frac{1}{2}} {\tilde{C}}_{x} {\tilde{C}}_{y}^{- \frac{1}{2}} ∥ I_{n}) = D_{A B}^{(α, β)} (\tilde{Λ} ∥ I_{n}) .

(204)

The positive eigenvalues of the matrix pencil satisfy

\tilde{Λ} \equiv diag E i g + {{({\tilde{C}}_{y})}^{- \frac{1}{2}} {\tilde{C}}_{x} {({\tilde{C}}_{y})}^{- \frac{1}{2}})} = diag E i g + {{\tilde{C}}_{x} {\tilde{C}}_{y}^{- \frac{1}{2}}},

(205)

therefore, the divergence can be directly estimated from the eigenvalues of

{\tilde{C}}_{x} {\tilde{C}}_{y}^{- 1}

. In order to simplify this matrix product, we first express

{\tilde{C}}_{x}

and

{\tilde{C}}_{y}^{- 1}

in term of the auxiliary matrices

T_{x} = U_{x} {(Λ_{x} - ρ I_{r})}^{\frac{1}{2}} and T_{y} = U_{y} {(Λ_{y} - ρ I_{r})}^{\frac{1}{2}} .

(206)

In this way, they are written as a scaled version of the identity matrix plus a symmetric term:

\begin{array}{l} {\tilde{C}}_{x} = C_{x} + ρ U_{x}^{⊥} {(U_{x}^{⊥})}^{T} \\ = U_{x} Λ_{x} U_{x}^{T} + ρ (I_{n} - U_{x} U_{x}^{T}) \\ = ρ I_{n} + U_{x} (Λ_{x} - ρ I_{r}) U_{x}^{T} \\ = ρ I_{n} + T_{x} T_{x}^{T}, \end{array}

(207)

and

\begin{array}{l} {\tilde{C}}_{y}^{- 1} = C_{y}^{+} + ρ^{- 1} U_{y}^{⊥} {(U_{y}^{⊥})}^{T} \\ = U_{y} Λ_{y}^{- 1} U_{y}^{T} + ρ^{- 1} (I_{n} - U_{y} U_{y}^{T}) \\ = ρ^{- 1} I_{n} - ρ^{- 1} U_{y} (Λ_{y} + ρ I_{r}) Λ_{y}^{- 1} U_{y}^{⊥} \\ = ρ^{- 1} I_{n} - ρ^{- 1} T_{y} Λ_{y}^{- 1} T_{y}^{T} . \end{array}

(208)

Next, using (207) and (208), we expand the product

{\tilde{C}}_{x} {\tilde{C}}_{y}^{- 1} = I_{n} + ρ^{- 1} T_{x} T_{x}^{T} (I_{n} - T_{y} Λ_{y}^{- 1} T_{y}^{T}) + R

(209)

and approximate the eigenvectors U_y → U_x of the residual matrix R to obtain the estimate

R \equiv - U_{y} (I_{r} + ρ Λ_{y}^{- 1}) U_{y}^{T} \approx - U_{x} (I_{r} + ρ Λ_{y}^{- 1}) U_{x}^{T} \equiv \hat{R} .

(210)

Hence, it is not difficult to see that the estimated residual is equal to

\hat{R} = - T_{x} (I_{r} + ρ Λ_{y}^{- 1}) T_{x}^{+} .

(211)

After substituting (211) in (209) and collecting common terms, we obtain the expansion

{\tilde{C}}_{x} {\tilde{C}}_{y}^{- 1} = \underset{\hat{{\tilde{C}}_{x} {\tilde{C}}_{y}^{- 1}}}{\underset{︸}{I_{n} + T_{x} (ρ^{- 1} T_{x}^{T} - ρ^{- 1} T_{x}^{T} T_{y} Λ_{y}^{- 1} T_{y}^{T} - (I_{r} + ρ Λ_{y}^{- 1}) T_{x}^{+})}} + O (ρ^{0}) .

(212)

Let Eig ≶₁{·} denote the arrangement of the ordered eigenvalues of the matrix argument after excluding those that are equal to 1. For convenience, we reformulate the property proved in [30] that for any pair of matrices A, B ∈ ℝ^m×n, the non-zero eigenvalues of AB^T and of B^TA are the same, into the following proposition.

Proposition 1. For any pair of m × n matrices A and B, the eigenvalues of the matrices I_m + AB^T and I_n + B^TA, which are not equal to 1, coincide.

E i g_{≶ 1} {I_{m} + A B^{T}} = E i g_{≶ 1} {I_{m} + B^{T} A}

(213)

Since range spaces of C_x and of C_y only intersect at the origin, the approximation matrix

\hat{{\tilde{C}}_{x} {\tilde{C}}_{y}^{- 1}}

has r dominant eigenvalues of order O(ρ⁻¹) and (n − r) remaining eigenvalues equal to 1. Using Proposition 1, these r dominant eigenvalues are given by

\begin{array}{l} E i g_{≶ 1} {\hat{{\tilde{C}}_{x} {\tilde{C}}_{y}^{- 1}}} = E i g_{≶ 1} {I_{r} + (ρ^{- 1} T_{x}^{T} - ρ^{- 1} T_{x}^{T} T_{y} Λ_{y}^{- 1} T_{y}^{T} - (I_{r} + ρ Λ_{y}^{- 1}) T_{x}^{+}) T_{x}} \\ = E i g_{≶ 1} {ρ^{- 1} T_{x}^{T} T_{x} - ρ^{- 1} T_{x}^{T} T_{y} Λ_{y}^{- 1} T_{y}^{T} T_{x} - ρ Λ_{y}^{- 1}} . \end{array}

(214)

Let

{\tilde{Λ}}_{m a x}

and

{\tilde{Λ}}_{m i n}

, respectively denote the diagonal submatrices of

\tilde{Λ}

with the r largest and with the r smallest eigenvalues. From the definitions in (66) and (206), one can recognize that

T_{x}^{T} T_{x} = Λ_{x} - ρ I_{r}

, while

T_{x}^{T} T_{y} = W_{x}^{T} K_{xy} W_{y}

, and substituting them in (214) we obtain the estimate of the r largest eigenvalues

{\hat{Λ}}_{\max} = diag E i g_{≶ 1} {\hat{{\tilde{C}}_{x} {\tilde{C}}_{y}^{- 1}}}

(215)

= diag E i g_{≶ 1} \underset{ρ^{- 1} C_{x / y}^{(ρ)}}{\underset{︸}{{ρ^{- 1} Λ_{x} - I - ρ Λ_{y}^{- 1} - ρ^{- 1} W_{x}^{T} K_{xy} W_{y} Λ_{y}^{- 1} W_{y}^{T} K_{yx} W_{x}}}} .

(216)

The relative error between these eigenvalues and the r largest eigenvalues of

{\tilde{C}}_{x} {\tilde{C}}_{y}^{- 1}

is of order O(ρ). This is a consequence of the fact that these eigenvalues are O(ρ⁻¹), while the Frobenius norm of the error matrix is O(ρ⁰). Then, the relative error between the dominant eigenvalues of the two matrices can be bounded above by

{(\frac{\sum_{i = 1}^{r} {({\tilde{λ}}_{i} - {\hat{λ}}_{i})}^{2}}{\sum_{i = 1}^{r} {\tilde{λ}}_{i}^{2}})}^{\frac{1}{2}} \leq \frac{{‖ {\tilde{C}}_{x} {\tilde{C}}_{y}^{- 1} - \hat{{\tilde{C}}_{x} {\tilde{C}}_{y}^{- 1}} ‖}_{F}}{{(\sum_{i = 1}^{r} {\hat{λ}}_{i}^{2})}^{\frac{1}{2}} + O (ρ^{0})} \equiv \frac{O (ρ^{0})}{O (ρ^{- 1})} \equiv O (ρ) .

(217)

On the other hand, the r smallest eigenvalues of

\hat{Λ}

are the reciprocal of the r dominant eigenvalues of the inverse matrix

{({\tilde{C}}_{y}^{- \frac{1}{2}} {\tilde{C}}_{x} {\tilde{C}}_{y}^{- \frac{1}{2}})}^{- 1}

, so we can estimate them using essentially the same procedure

{\hat{Λ}}_{\min}^{- 1} = diag E i g_{≶ 1} {\hat{{\tilde{C}}_{x} {\tilde{C}}_{y}^{- 1}}}

(218)

= diag E i g_{≶ 1} {ρ^{- 1} C_{y / x}^{(ρ)}} .

(219)

For a sufficient small value of ρ > 0, the dominant contribution to the AB log-det divergence comes from the r largest and r smallest eigenvalues of the matrix pencil (Ĉ_x, Ĉ_y), so we obtain the desired approximation

D_{A B}^{(α, β)} (\tilde{Λ} ∥ I_{n}) \approx D_{A B}^{(α, β)} ({\tilde{Λ}}_{\max} ∥ I_{r}) + D_{A B}^{(α, β)} ({\tilde{Λ}}_{\min} ∥ I_{r})

(220)

= D_{A B}^{(α, β)} (ρ {\tilde{Λ}}_{\max} ∥ ρ I_{r}) + D_{A B}^{(β, α)} (ρ {\tilde{Λ}}_{\min}^{- 1} ∥ ρ I_{r})

(221)

\approx D_{A B}^{(α, β)} (ρ {\hat{Λ}}_{\max} ∥ ρ I_{r}) + D_{A B}^{(β, α)} (ρ {\hat{Λ}}_{\min}^{- 1} ∥ ρ I_{r})

(222)

= D_{A B}^{(α, β)} (C_{x | y}^{(ρ)} ∥ ρ I_{r}) + D_{A B}^{(β, α)} (C_{x | y}^{(ρ)} ∥ ρ I_{r}) .

(223)

Moreover, as ρ → 0, the relative error of this approximation also tends to zero.

H. Gamma Divergence for Multivariate Gaussian Densities

Recall that for a given quadratic function

f (x) = - c + b^{T} x - \frac{1}{2} x^{T} Ax

, where A is an SPD matrix, the integral of exp {f(x)g} with respect to x is given by

\int_{Ω} e^{- \frac{1}{2} x^{T} Ax + b^{T} x - c} d x = {(2 π)}^{\frac{N}{2}} \det {(A)}^{- \frac{1}{2}} e^{\frac{1}{2} b^{T} A^{- 1} b - c} .

(224)

This formula is obtained by evaluating the integral as follows:

\int_{Ω} e^{- \frac{1}{2} x^{T} Ax + b^{T} x - c} d x = e^{\frac{1}{2} b^{T} A^{- 1} b - c} \int_{Ω} e^{- \frac{1}{2} x^{T} Ax + b^{T} x -}^{\frac{1}{2} b^{T} A^{- 1} b} d x

(225)

= e^{\frac{1}{2} b^{T} A^{- 1} b - c} \int_{Ω} e^{{(x - A^{- 1} b)}^{T} A (x - A^{- 1} b)} d x

(226)

= e^{\frac{1}{2} b^{T} A^{- 1} b - c} {(2 π)}^{\frac{N}{2}} \det {(A)}^{- \frac{1}{2}},

(227)

assuming that A is an SPD matrix, which assures the convergence of the integral and the validity of (224).

The Gamma divergence involves the a product of densities. In the multivariate Gaussian case, this simplifies as

\begin{array}{l} p^{α} (x) q^{β} (x) = {(2 π)}^{- \frac{N}{2} (α + β)} \det {(P)}^{- \frac{α}{2}} \det {(Q)}^{- \frac{β}{2}} \times \\ \exp {- \frac{α}{2} {(x - μ_{1})}^{T} P^{- 1} (x - μ_{1}) - \frac{β}{2} {(x - μ_{2})}^{T} Q^{- 1} (x - μ_{1})} \end{array}

(228)

= d \exp {- c + b^{T} x - \frac{1}{2} x^{T} Ax},

(229)

where

A = α P^{- 1} + β Q^{- 1},

(230)

b = {(μ_{1}^{T} α P^{- 1} + μ_{2}^{T} β Q^{- 1})}^{T},

(231)

c = \frac{1}{2} μ_{1} (α P^{- 1}) μ_{1} + \frac{1}{2} μ_{2} (β Q^{- 1}) μ_{2},

(232)

d = {(2 π)}^{- \frac{N}{2} (α + β)} \det {(P)}^{- \frac{α}{2}} \det {(Q)}^{- \frac{β}{2}} .

(233)

Integrating this product with the help of (224), we obtain

\int_{Ω} p^{α} (x) q^{β} (x) d x = d {(2 π)}^{\frac{N}{2}} \det {(A)}^{- \frac{1}{2}} e^{\frac{1}{2} b^{T} A^{- 1} b - c}

(234)

\begin{array}{l} = {(2 π)}^{\frac{N}{2} (1 - (α + β))} \det {(P)}^{- \frac{α}{2}} \det {(Q)}^{- \frac{β}{2}} \det {(α P^{- 1} + β Q^{- 1})}^{- \frac{1}{2}} \times \\ e^{\frac{1}{2} (μ_{1}^{T} α P^{- 1} + μ_{2}^{T} β Q^{- 1}) {(α P^{- 1} + β Q^{- 1})}^{- 1} {(μ_{1}^{T} α P^{- 1} + μ_{2}^{T} β Q^{- 1})}^{T}} \times \\ e^{- \frac{1}{2} μ_{1} (α P^{- 1}) + μ_{1} - \frac{1}{2} μ_{2} (β Q^{- 1}) μ_{2}}, \end{array}

(235)

provided that αP⁻¹ + βQ⁻¹ is positive definite.

Rearranging the expression in terms of µ₁ and µ₂ yields

\begin{array}{l} \int_{Ω} p^{α} (x) q^{β} (x) d x = {(2 π)}^{\frac{N}{2} (1 - (α + β))} \det {(P)}^{- \frac{α}{2}} \det {(Q)}^{- \frac{β}{2}} \det {(α P^{- 1} + β Q^{- 1})}^{- \frac{1}{2}} \times \\ e^{\frac{1}{2} μ_{1}^{T} [α P^{- 1} {(α P^{- 1} + β Q^{- 1})}^{- 1} α P^{- 1} - α P^{- 1}] μ_{1}} \times \\ e^{\frac{1}{2} μ_{2}^{T} [β Q^{- 1} {(α P^{- 1} + β Q^{- 1})}^{- 1} β Q^{- 1} - α Q^{- 1}] μ_{2}} \times \\ e^{μ_{1}^{T} α P^{- 1} {(α P^{- 1} + β Q^{- 1})}^{- 1} β Q^{- 1} μ_{2} .} \end{array}

(236)

With the help of the Woodbury matrix identity, we simplify

e^{\frac{1}{2} μ_{1}^{T} [α P^{- 1} {(α P^{- 1} + β Q^{- 1})}^{- 1} α P^{- 1} - α P^{- 1}] μ_{1}} = e^{- \frac{1}{2} μ_{1}^{T} {(α^{- 1} P + β^{- 1} Q)}^{- 1} μ_{1}},

(237)

e^{\frac{1}{2} μ_{2}^{T} [β Q^{- 1} {(α P^{- 1} + β Q^{- 1})}^{- 1} β Q^{- 1} - β Q^{- 1}] μ_{2}} = e^{- \frac{1}{2} μ_{2}^{T} {(α^{- 1} P + β^{- 1} Q)}^{- 1} μ_{2}},

(238)

e^{μ_{1}^{T} α P^{- 1} {(α P^{- 1} + β Q^{- 1})}^{- 1} β Q^{- 1} μ_{2}} = e^{μ_{1}^{T} {(α^{- 1} P + β^{- 1} Q)}^{- 1} μ_{2}},

(239)

and hence, arriving at the desired result:

\begin{array}{l} \int_{Ω} p^{α} (x) q^{β} (x) d x = {(2 π)}^{\frac{N}{2} (1 - (α + β))} \det {(P)}^{- \frac{α}{2}} \det {(Q)}^{- \frac{β}{2}} \det {(α + β)}^{- \frac{N}{2}} \times \\ \det {(\frac{α}{α + β} P^{- 1} + \frac{β}{α + β} Q^{- 1})}^{- \frac{1}{2}} \times \\ e^{- \frac{α β 1}{2 (α + β)} {(μ_{1} - μ_{2})}^{T} {(\frac{β}{α + β} P + \frac{α}{α + β} Q)}^{- 1} (μ_{1} - μ_{2}) .} \end{array}

(240)

This formula can be can easily particularized to evaluate the integrals and

\begin{array}{l} \int_{Ω} p^{α} {^{+}}^{β} (x) d x = \int_{Ω} p^{α} (x) P^{β} (x) d x \\ = {(2 π)}^{\frac{N}{2} (1 - (α + β))} \det {(P)}^{- \frac{α}{2}} \det {(P)}^{- \frac{β}{2}} \det {(α P^{- 1} + β P^{- 1})}^{- \frac{1}{2}} \times \\ e^{- \frac{α β}{2 (α + β)} {(μ_{1} - μ_{2})}^{T} {(\frac{β}{α + β} P + \frac{α}{α + β} P)}^{- 1} (μ_{1} - μ_{2})} \\ = {(2 π)}^{\frac{N}{2} (1 - (α + β))} {(α + β)}^{- \frac{N}{2}} \det {(P)}^{\frac{1 - (α + β)}{2}} \end{array}

(241)

and

\int_{Ω} q^{α} {^{+}}^{β} (x) d x = {(2 π)}^{\frac{N}{2} (1 - (α + β))} {(α + β)}^{- \frac{N}{2}} \det {(Q)}^{\frac{1 - (α + β)}{2}} .

(242)

By substituting these integrals into the definition of the Gamma divergence and simplifying, we obtain a generalized closed form formula:

\begin{array}{l} D_{A C}^{(α, β)} (p (x) ∥ q (x)) = \frac{1}{α β} \log \frac{{(\int_{Ω} p^{α} {^{+}}^{β} (x) d x)}^{\frac{α}{α + β}} {(\int_{Ω} q^{α} {^{+}}^{β} (x) d x)}^{\frac{α}{α + β}}}{\int_{Ω} p^{α} (x) q^{β} (x) d x} \\ = \frac{1}{α β} \log \frac{\det (\frac{α}{α + β} Q + \frac{β}{α + β} P)}{\det {(Q)}^{\frac{α}{α + β}} \det {(P)}^{\frac{α}{α + β}}} \\ + \frac{1}{2 (α + β)} {(μ_{1} - μ_{2})}^{T} {(\frac{α}{α + β} Q + \frac{β}{α + β} P)}^{- 1} (μ_{1} - μ_{2}), \end{array}

(243)

which concludes the proof of Theorem 4.

Author Contributions

First two authors contributed equally to this work. Andrzej Cichocki has coordinated this study and wrote most of the sections 1–3 and 7–8. Sergio Cruces wrote most of the sections 4, 5, and 6. He also provided most of the final rigorous proofs presented in Appendices. Shun-ichi Amari proved the fundamental property (16) that the Riemannian metric is the same for all AB log-det divergences and critically revised the paper by providing inspiring comments. All authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Amari, S. Information geometry of positive measures and positive-definite matrices: Decomposable dually flat structure. Entropy 2014, 16, 2131–2145. [Google Scholar]
Basseville, M. Divergence measures for statistical data processing—An annotated bibliography. Signal Process 2013, 93, 621–633. [Google Scholar]
Moakher, M.; Batchelor, P.G. Symmetric Positive—Definite Matrices: From Geometry to Applications and Visualization. In Chapter 17 in the Book: Visualization and Processing of Tensor Fields; Weickert, J., Hagen, H., Eds.; Springer: Berlin/Heidelberg, Germany, 2006; pp. 285–298. [Google Scholar]
Amari, S. Information geometry and its applications: Convex function and dually flat manifold. In Emerging Trends in Visual Computing; Nielsen, F., Ed.; Springer: Berlin/Heidelberg, Germany, 2009; pp. 75–102. [Google Scholar]
Chebbi, Z.; Moakher, M. Means of Hermitian positive-definite matrices based on the log-determinant α-divergence function. Linear Algebra Appl 2012, 436, 1872–1889. [Google Scholar]
Sra, S. Positive definite matrices and the S-divergence 2013. arXiv:1110.1773.
Nielsen, F.; Bhatia, R. Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
Amari, S. Alpha-divergence is unique, belonging to both f-divergence and Bregman divergence classes. IEEE Trans. Inf. Theory 2009, 55, 4925–4931. [Google Scholar]
Zhang, J. Divergence function, duality, and convex analysis. Neural Comput 2004, 16, 159–195. [Google Scholar]
Amari, S.; Cichocki, A. Information geometry of divergence functions. Bull. Polish Acad. Sci 2010, 58, 183–195. [Google Scholar]
Cichocki, A.; Amari, S. Families of Alpha- Beta- and Gamma- divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar]
Cichocki, A.; Cruces, S.; Amari, S. Generalized alpha-beta divergences and their application to robust nonnegative matrix factorization. Entropy 2011, 13, 134–170. [Google Scholar]
Cichocki, A.; Zdunek, R.; Phan, A.-H.; Amari, S. Nonnegative Matrix and Tensor Factorizations; John Wiley & Sons Ltd: Chichester, UK, 2009. [Google Scholar]
Cherian, A.; Sra, S.; Banerjee, A.; Papanikolopoulos, N. Jensen-Bregman logdet divergence with application to efficient similarity search for covariance matrices. IEEE Trans. Pattern Anal. Mach. Intell 2013, 35, 2161–2174. [Google Scholar]
Cherian, A.; Sra, S. Riemannian sparse coding for positive definite matrices. Proceedings of the Computer Vision—ECCV 2014—13th European Conference, Zurich, Switzerland, September 6–12 2014; 8691, pp. 299–314.
Olszewski, D.; Ster, B. Asymmetric clustering using the alpha-beta divergence. Pattern Recognit 2014, 47, 2031–2041. [Google Scholar]
Sra, S. A new metric on the manifold of kernel matrices with application to matrix geometric mean. Proceedings of the 26th Annual Conference on Neural Information Processing Systems 2012, Lake Tahoe, Nevada, USA, 3–6 December 2012; pp. 144–152.
Nielsen, F.; Liu, M.; Vemuri, B. Jensen divergence-based means of SPD Matrices. In Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2013; pp. 111–122. [Google Scholar]
Hsieh, C.; Sustik, M.A.; Dhillon, I.; Ravikumar, P.; Poldrack, R. BIG & QUIC: Sparse inverse covariance estimation for a million variables. Proceedings of the 27th Annual Conference on Neural Information Processing Systems 2013, Lake Tahoe, Nevada, USA, 5–8 December 2013; pp. 3165–3173.
Nielsen, F.; Nock, R. A closed-form expression for the Sharma-Mittal entropy of exponential families. CoRR. 2011. arXiv:1112.4221v1 [cs.IT]. Available online: http://arxiv.org/abs/1112.4221 accessed on 4 May 2015.
Fujisawa, H.; Eguchi, S. Robust parameter estimation with a small bias against heavy contamination. Multivar. Anal 2008, 99, 2053–2081. [Google Scholar]
Kulis, B.; Sustik, M.; Dhillon, I. Learning low-rank kernel matrices. Proceedings of the Twenty-third International Conference on Machine Learning (ICML06), Pittsburgh, PA, USA, 25–29 July 2006; pp. 505–512.
Cherian, A.; Sra, S.; Banerjee, A.; Papanikolopoulos, N. Efficient similarity search for covariance matrices via the jensen-bregman logdet divergence. Proceedings of the IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, 6–13 November 2011; pp. 2399–2406.
Österreicher, F. Csiszár’s f-divergences-basic properties. RGMIA Res. Rep. Collect. 2002. Available online: http://rgmia.vu.edu.au/monographs/csiszar.htm accessed on 6 May 2015.
Cichocki, A.; Zdunek, R.; Amari, S. Csiszár’s divergences for nonnegative matrix factorization: Family of new algorithms. Independent Component Analysis and Blind Signal Separation, Proceedings of 6th International Conference on Independent Component Analysis and Blind Signal Separation (ICA 2006), Charleston, SC, USA, 5–8 March 2006; 3889, pp. 32–39.
Reeb, D.; Kastoryano, M.J.; Wolf, M.M. Hilbert’s projective metric in quantum information theory. J. Math. Phys 2011, 52, 082201. [Google Scholar]
Kim, S.; Kim, S.; Lee, H. Factorizations of invertible density matrices. Linear Algebra Appl 2014, 463, 190–204. [Google Scholar]
Bhatia, R. Positive Definite Matrices; Princeton University Press: Princeton, NJ, USA, 2009. [Google Scholar]
Li, R.-C. Rayleigh Quotient Based Optimization Methods For Eigenvalue Problems. In Summary of Lectures Delivered at Gene Golub SIAM Summer School 2013; Fudan University: Shanghai, China, 2013. [Google Scholar]
De Moor, B.L.R. On the Structure and Geometry of the Product Singular Value Decomposition; Numerical Analysis Project NA-89-06; Stanford University: Stanford, CA, USA, 1989; pp. 1–52. [Google Scholar]
Golub, G.H.; van Loan, C.F. Matrix Computations, 3rd ed; Johns Hopkins University Press: Baltimore, MD, USA, 1996; pp. 555–571. [Google Scholar]
Zhou, S.K.; Chellappa, R. From Sample Similarity to Ensemble Similarity: Probabilistic Distance Measures in Reproducing Kernel Hilbert Space. IEEE Trans. Pattern Anal. Mach. Intell 2006, 28, 917–929. [Google Scholar]
Harandi, M.; Salzmann, M.; Porikli, F. Bregman Divergences for Infinite Dimensional Covariance Matrices. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1003–1010.
Minh, H.Q.; Biagio, M.S.; Murino, V. Log-Hilbert-Schmidt metric between positive definite operators on Hilbert spaces. Adv. Neural Inf. Process. Syst 2014, 27, 388–396. [Google Scholar]
Josse, J.; Sardy, S. Adaptive Shrinkage of singular values 2013. arXiv:1310.6602.
Donoho, D.L.; Gavish, M.; Johnstone, I.M. Optimal Shrinkage of Eigenvalues in the Spiked Covariance Model 2013. arXiv:1311.0851.
Gavish, M.; Donoho, D. Optimal shrinkage of singular values 2014. arXiv:1405.7511.
Davis, J.; Dhillon, I. Differential entropic clustering of multivariate gaussians. Proceedings of the Twentieth Annual Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 4–7 December 2006; pp. 337–344.
Abou-Moustafa, K.; Ferrie, F. Modified divergences for Gaussian densities. Proceedings of the Structural, Syntactic, and Statistical Pattern Recognition, Hiroshima, Japan, 7–9 November 2012; pp. 426–436.
Burbea, J.; Rao, C. Entropy differential metric, distance and divergence measures in probability spaces: A unified approach. J. Multi. Anal 1982, 12, 575–596. [Google Scholar]
Hosseini, R.; Sra, S.; Theis, L.; Bethge, M. Statistical inference with the Elliptical Gamma Distribution 2014. arXiv:1410.4812.
Manceur, A.; Dutilleul, P. Maximum likelihood estimation for the tensor normal distribution: Algorithm, minimum sample size, and empirical bias and dispersion. J. Comput. Appl. Math 2013, 239, 37–49. [Google Scholar]
Akdemir, D.; Gupta, A. Array variate random variables with multiway Kronecker delta covariance matrix structure. J. Algebr. Stat 2011, 2, 98–112. [Google Scholar]
PHoff, P.D. Separable covariance arrays via the Tucker product, with applications to multivariate relational data. Bayesian Anal 2011, 6, 179–196. [Google Scholar]
Gerard, D.; Hoff, P. Equivariant minimax dominators of the MLE in the array normal model 2014. arXiv:1408.0424.
Ohlson, M.; Ahmad, M.; von Rosen, D. The Multilinear Normal Distribution: Introduction and Some Basic Properties. J. Multivar. Anal 2013, 113, 37–47. [Google Scholar]
Ando, T. Majorization, doubly stochastic matrices, and comparison of eigenvalues. Linear Algebra Appl 1989, 118, 163–248. [Google Scholar]

Figure 1. Shaded-contour plots of the bounds of λ_i that prevent

D_{A B}^{(α, β)} (P ∥ Q)

from diverging to ∞. The positive lower-bounds are shown in the lower-right quadrant of (a). The finite upper-bounds are shown in the upper-left quadrant of (b).

Figure 1. Shaded-contour plots of the bounds of λ_i that prevent

D_{A B}^{(α, β)} (P ∥ Q)

from diverging to ∞. The positive lower-bounds are shown in the lower-right quadrant of (a). The finite upper-bounds are shown in the upper-left quadrant of (b).

Figure 2. Two-dimensional plots of the AB log-det divergence for different eigenvalues: (a) λ = 0.4, (b) λ = 2.5, (c) λ₁ = 0.4, λ₂ = 2.5, (d) 10 eigenvalues uniformly randomly distributed in the range [0.5, 2].

Figure 3. Links between the fundamental, nonsymmetric, AB log-det divergences. On the α-β-plane, important divergences are indicated by points and lines, especially the Stein’s loss and its generalization, the AIRM (Riemannian) distance, S-divergence (JBLD), Alpha-log-det divergence

D_{A}^{(α)}

, and Beta-log-det divergence

D_{A}^{(β)}

.

Figure 3. Links between the fundamental, nonsymmetric, AB log-det divergences. On the α-β-plane, important divergences are indicated by points and lines, especially the Stein’s loss and its generalization, the AIRM (Riemannian) distance, S-divergence (JBLD), Alpha-log-det divergence

D_{A}^{(α)}

, and Beta-log-det divergence

D_{A}^{(β)}

.

Figure 4. Links between the fundamental symmetric, AB log-det divergences. On the (α, β)-plane, the special cases of particular divergences are indicated by points (Jeffreys KL divergence (KLDM) or symmetric Stein’s loss and its generalization, S-divergence (JBLD), and the Power log-det divergence.

Table 1. Fundamental Log-det Divergences and Distances

**Table 1.** Fundamental Log-det Divergences and Distances
Geodesic Distance (AIRM) (α = β = 0) $\frac{1}{2} d_{R}^{2} (P ∥ Q) = \frac{1}{2} tr \log^{2} (P Q^{- 1}) = \frac{1}{2} \sum_{i - 1}^{n} \log^{2} λ_{i}$
S-divergence (Squared Bhattacharyya Distance) (α = β = 0.5) $D_{S} (P ∥ Q) = d_{Bh}^{2} (P ∥ Q) = 4 \log \frac{\det \frac{(P + Q)}{2}}{{(\det PQ)}^{\frac{1}{2}}} = 4 \sum_{i - 1}^{n} \log \frac{λ_{i} + 1}{2 \sqrt{λ_{i}}}$
Power divergence (α = β ≠ 0) $\frac{1}{α^{2}} \log \det \frac{{(P Q^{- 1})}^{α} - {(P Q^{- 1})}^{- α}}{2} = \frac{1}{α^{2}} \sum \log \frac{λ_{i}^{α} + λ_{i}^{- α}}{2}$
Generalized Burg divergence (Stein’s Loss) (α =0, β ≠ 0) $\frac{1}{β^{2}} tr [{(P Q^{- 1})}^{β} - I] - \log \det {(P Q^{- 1})}^{β} = \frac{1}{β^{2}} (\sum_{i = 1}^{n} (λ_{i}^{β} - \log λ_{i}^{β}) - n)$
Generalized Itakura-Saito log-det divergence (α =−β ≠ 0) $\frac{1}{α^{2}} \log \det \frac{\det {(P Q^{- 1})}^{α}}{\det I + \log {(P Q^{- 1})}^{α}} = \frac{1}{α^{2}} \sum_{i = 1}^{n} \log \frac{λ_{i}^{α}}{1 + \log^{2} λ_{i}^{α}}$
Alpha log-det divergence (0 < α < 1, β = 1 − α) $D_{A}^{(α)} (P ∥ Q) = \frac{1}{α (1 - α)} \log \frac{\det (α P + (1 - α) Q)}{\det (P^{α} Q^{1 - α})} = \frac{1}{α (1 - α)} \sum_{i - 1}^{n} \log (\frac{α (λ_{i} + 1) + 1}{λ_{i}^{α}})$
Beta log-det divergence (α =1, β ≥ 0) $D_{A}^{(β)} (P ∥ Q) = \frac{1}{β} \log \det \frac{{(P Q^{- 1})}^{β} + β (P Q^{- 1})}{1 + β} = \frac{1}{β} \sum_{i - 1}^{n} \log \frac{λ_{i}^{β} + β λ_{i}^{- 1}}{1 + β}$ $D_{A}^{(\infty)} (P ∥ Q) = \sum_{i \in Ω} \log λ_{i}$ , Ω = {i= λ_i > 1}
Symmetric Jeffrey KL divergence (α =1, β = 0) $D_{J} (P ∥ Q) = \frac{1}{2} tr (P Q^{- 1} + Q P^{- 1} - 2 I) = \frac{1}{2} \sum_{i - 1}^{n} {(\sqrt{λ_{i}} - \frac{1}{\sqrt{λ_{i}}})}^{2}$
Generalized Hilbert metrics $D_{C C A}^{(γ_{2}, γ_{1})} (P ∥ Q) = \log \frac{M_{γ_{2}} {λ_{i}}}{M_{γ_{2}} {λ_{i}}}, d_{H} (P ∥ Q) = \log \frac{M_{\infty} {λ_{i}}}{M_{- \infty} {λ_{i}}} = \log \frac{λ_{m a x}}{λ_{m i n}}$

Table 2. Comparison of the fundamental properties of three basic metric distances: the Riemannian (geodesic) metric (19), LogDet Zero (Bhattacharyya) divergence (21), and the Hilbert projective metric (95). Matrices P, Q, P₁, P₂, Q₁, Q₂, Z ∈ ℝ^n×n are SPD matrices, A ∈ ℝ^n×n is nonsingular, and the matrix X ∈ ℝ^n×r with r < n is full column rank. The scalars satisfy the following conditions: c > 0, c₁, c₂ > 0; 0 < ω ≤ 1, s, u ≠ 0, ψ = |s − u|. The geometric means are defined by P#_uQ = P¹^/²(P⁻¹^/²QP⁻¹^/²)^u P¹^/² and P#Q = P#_1/2Q = P¹^/²(P⁻¹^/²QP⁻¹^/²)^1/2 P¹^/². The Hadamard product of P and Q is denoted by P ○ Q (cf. with [6]).

**Table 2.** Comparison of the fundamental properties of three basic metric distances: the Riemannian (geodesic) metric (19), LogDet Zero (Bhattacharyya) divergence (21), and the Hilbert projective metric (95). Matrices P, Q, P₁, P₂, Q₁, Q₂, Z ∈ ℝ^n×n are SPD matrices, A ∈ ℝ^n×n is nonsingular, and the matrix X ∈ ℝ^n×r with r < n is full column rank. The scalars satisfy the following conditions: c > 0, c₁, c₂ > 0; 0 < ω ≤ 1, s, u ≠ 0, ψ = |s − u|. The geometric means are defined by P#_uQ = P¹^/²(P⁻¹^/²QP⁻¹^/²)^u P¹^/² and P#Q = P#_1/2Q = P¹^/²(P⁻¹^/²QP⁻¹^/²)^1/2 P¹^/². The Hadamard product of P and Q is denoted by P ○ Q (cf. with [6]).
Riemannian (geodesic) metric	LogDet Zero (Bhattacharyya) div.	Hilbert projective metric
d_R(P║Q) = ║log(Q^−1/2PQ^−1/2)║_F	$d_{Bh} (P ∥ Q) = 2 \sqrt{\log \frac{\det \frac{1}{2} (P + Q)}{\det (P) \det (Q)}}$	$d_{H} (P ∥ Q) = \log \frac{λ_{m a x} {P Q^{- 1}}}{λ_{m i n} {P Q^{- 1}}}$

d_R(P║Q) = d_R(Q ║ P)	d_Bh(P║Q) = d_Bh(Q ║ P)	d_H(P║Q) = d_H(Q ║ P)

d_R(cP ║ cQ) = d_R(P ║ Q)	d_Bh(cP ║ cQ) = d_Bh(P ║Q)	d_H(c₁P ║ c₂Q) = d_H(P ║Q)
d_R(APA^T║AQA^T) = d_R(P ║Q)	d_Bh(APA^T║AQA^T) = d_Bh(P ║Q)	d_H(APA^T║AQA^T) = d_H(P ║Q)
d_R(P⁻¹║Q⁻¹) = d_R(P ║ Q)	d_Bh (P⁻¹║Q⁻¹) d_Bh (P ║ Q)	d_H (P⁻¹║Q⁻¹) = d_H (P ║ Q)

d_R (P^ω ║ Q^ω) ≤ ω d_R(P ║ Q)	$d_{Bh} (P^{ω} ∥ Q^{ω}) \leq \sqrt{ω} d_{Bh} (P ∥ Q)$	d_H (P^ω ║ Q^ω) ≤ ω d_H (P ║ Q)

d_R(P ║ P#_ωQ) = ω d_R(P ║ Q)	$d_{Bh} (P ∥ P #_{ω} Q) \leq \sqrt{ω} d_{Bh} (P ∥ Q)$	d_H (P ║ P#_ωQ) = ω d_H (P ║ Q)
d_R (Z#_ωP ║ Z#_ωQ) ≤ ω d_R (P ║ Q)	$d_{Bh} (Z #_{ω} P ∥ Z #_{ω} Q) \leq \sqrt{ω} d_{Bh} (P ∥ Q)$	d_H (Z#_ωP ║ Z#_ωQ) ≤ ω d_H (P ║ Q)
d_R (P#_sQ ║ P#_uQ) = ψ d_R (P ║ Q))	$d_{Bh} (P #_{s} Q ∥ P #_{u} Q) \leq \sqrt{ψ} d_{Bh} (P ∥ Q)$	d_H (P#_sQ ║ P#_uQ) = ψ d_H (P ║ Q)
d_R (P ║ P#Q) = d_R (Q ║ P# Q)	d_Bh (P ║ P#Q) = d_R (Q ║ P# Q)	d_H (P ║ P#Q) = d_H (Q ║ P# Q)

d_R (X^TPX ║ X^TQX) ≤ d_R (P ║ Q)	d_Bh (X^TPX ║ X^TQX) ≤ d_Bh (P ║ Q)	d_H (X^TPX ║ X^TQX) ≤ d_H (P ║ Q)

$d_{R} (Z \otimes P ∥ Z \otimes Q) = \sqrt{n} d_{R} (P ∥ Q)$	$d_{B h} (Z \otimes P ∥ Z \otimes Q) = \sqrt{n} d_{B h} (P ∥ Q)$	d_H (Z ⊗ P ║ Z ⊗ Q) = d_H (P ║ Q)

$\begin{array}{l} d_{R}^{2} (P_{1} \otimes P_{2} ∥ Q_{1} \otimes Q_{2}) = \\ = n d_{R}^{2} (P_{1} ∥ Q_{1}) + n d_{R}^{2} (P_{2} ∥ Q_{2}) + \\ 2 \log \det (P_{1} Q_{1}^{- 1}) \log \det (P_{2} Q_{2}^{- 1}) \end{array}$	d_Bh (P₁⊗ P₂ ║ Q₁⊗ Q₂) ≥ d_Bh (P₁○ P₂ ║ Q₁○ Q₂)	d_H(P₁⊗ P₂ ║ Q₁⊗ Q₂) = d_H (P₁║ Q₁) +d_H (P₂║ Q₂)

© 2015 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cichocki, A.; Cruces, S.; Amari, S.-i. Log-Determinant Divergences Revisited: Alpha-Beta and Gamma Log-Det Divergences. Entropy 2015, 17, 2988-3034. https://doi.org/10.3390/e17052988

AMA Style

Cichocki A, Cruces S, Amari S-i. Log-Determinant Divergences Revisited: Alpha-Beta and Gamma Log-Det Divergences. Entropy. 2015; 17(5):2988-3034. https://doi.org/10.3390/e17052988

Chicago/Turabian Style

Cichocki, Andrzej, Sergio Cruces, and Shun-ichi Amari. 2015. "Log-Determinant Divergences Revisited: Alpha-Beta and Gamma Log-Det Divergences" Entropy 17, no. 5: 2988-3034. https://doi.org/10.3390/e17052988

APA Style

Cichocki, A., Cruces, S., & Amari, S.-i. (2015). Log-Determinant Divergences Revisited: Alpha-Beta and Gamma Log-Det Divergences. Entropy, 17(5), 2988-3034. https://doi.org/10.3390/e17052988

Article Menu

Log-Determinant Divergences Revisited: Alpha-Beta and Gamma Log-Det Divergences

Abstract

1. Introduction

1.1. Preliminaries

2. Basic Alpha-Beta Log-Determinant Divergence

3. Special Cases of the AB Log-Det Divergence

4. Properties of the AB Log-Det Divergence

5. Symmetrized AB Log-Det Divergences

6. Similarity Measures for Semidefinite Covariance Matrices in Reproducing Kernel Hilbert Spaces

6.1. Measuring the Dissimilarity with a Divergence Lower-Bound

6.2. Similarity Measures Between Regularized Covariance Descriptors

7. Modifications and Generalizations of AB Log-Det Divergences and Gamma Matrix Divergences

7.1. The AB Log-Det Divergence for Noisy and Ill-Conditioned Covariance Matrices

8. Divergences of Multivariate Gaussian Densities and Differential Relative Entropies of Multivariate Normal Distributions

8.1. Multiway Divergences for Multivariate Normal Distributions with Separable Covariance Matrices

9. Conclusions

Acknowledgments

Appendices

A. Basic operations for positive definite matrices

B. Extension of $D_{A B}^{(α, β)} (P ∥ Q)$ for (α, β) ∈ ℝ²

C. Eigenvalues Domain for Finite $D_{A B}^{(α, β)} (P ∥ Q)$

D. Proof of the Nonnegativity of $D_{A B}^{(α, β)} (P ∥ Q)$

E. Derivation of the Riemannian Metric

F. Proof of the Properties of the AB Log-Det Divergence

G. Proof of Theorem 3

H. Gamma Divergence for Multivariate Gaussian Densities

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Log-Determinant Divergences Revisited: Alpha-Beta and Gamma Log-Det Divergences

Abstract

1. Introduction

1.1. Preliminaries

2. Basic Alpha-Beta Log-Determinant Divergence

3. Special Cases of the AB Log-Det Divergence

4. Properties of the AB Log-Det Divergence

5. Symmetrized AB Log-Det Divergences

6. Similarity Measures for Semidefinite Covariance Matrices in Reproducing Kernel Hilbert Spaces

6.1. Measuring the Dissimilarity with a Divergence Lower-Bound

6.2. Similarity Measures Between Regularized Covariance Descriptors

7. Modifications and Generalizations of AB Log-Det Divergences and Gamma Matrix Divergences

7.1. The AB Log-Det Divergence for Noisy and Ill-Conditioned Covariance Matrices

8. Divergences of Multivariate Gaussian Densities and Differential Relative Entropies of Multivariate Normal Distributions

8.1. Multiway Divergences for Multivariate Normal Distributions with Separable Covariance Matrices

9. Conclusions

Acknowledgments

Appendices

A. Basic operations for positive definite matrices

B. Extension of D A B ( α , β ) ( P ∥ Q ) for (α, β) ∈ ℝ2

C. Eigenvalues Domain for Finite D A B ( α , β ) ( P ∥ Q )

D. Proof of the Nonnegativity of D A B ( α , β ) ( P ∥ Q )

E. Derivation of the Riemannian Metric

F. Proof of the Properties of the AB Log-Det Divergence

G. Proof of Theorem 3

H. Gamma Divergence for Multivariate Gaussian Densities

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

B. Extension of $D_{A B}^{(α, β)} (P ∥ Q)$ for (α, β) ∈ ℝ²

C. Eigenvalues Domain for Finite $D_{A B}^{(α, β)} (P ∥ Q)$

D. Proof of the Nonnegativity of $D_{A B}^{(α, β)} (P ∥ Q)$