A Simple Approximation Method for the Fisher–Rao Distance between Multivariate Normal Distributions

Nielsen, Frank

doi:10.3390/e25040654

Open AccessEditor’s ChoiceArticle

A Simple Approximation Method for the Fisher–Rao Distance between Multivariate Normal Distributions

by

Frank Nielsen

Sony Computer Science Laboratories, Tokyo 141-0022, Japan

Entropy 2023, 25(4), 654; https://doi.org/10.3390/e25040654

Submission received: 27 February 2023 / Revised: 6 April 2023 / Accepted: 12 April 2023 / Published: 13 April 2023

(This article belongs to the Special Issue Information Geometry and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

:

We present a simple method to approximate the Fisher–Rao distance between multivariate normal distributions based on discretizing curves joining normal distributions and approximating the Fisher–Rao distances between successive nearby normal distributions on the curves by the square roots of their Jeffreys divergences. We consider experimentally the linear interpolation curves in the ordinary, natural, and expectation parameterizations of the normal distributions, and compare these curves with a curve derived from the Calvo and Oller’s isometric embedding of the Fisher–Rao d-variate normal manifold into the cone of

(d + 1) \times (d + 1)

symmetric positive–definite matrices. We report on our experiments and assess the quality of our approximation technique by comparing the numerical approximations with both lower and upper bounds. Finally, we present several information–geometric properties of Calvo and Oller’s isometric embedding.

Keywords:

Fisher–Rao normal manifold; symmetric positive–definite matrix cone; isometric embedding; information geometry; exponential family; elliptical distribution; maximal invariant

Graphical Abstract

1. Introduction

1.1. The Fisher–Rao Normal Manifold

Let

Sym (d)

be the set of

d \times d

symmetric matrices with real entries and

P (d) \subset Sym (d)

denote the set of symmetric positive–definite

d \times d

matrices that forms a convex regular cone. Let us denote by

N (d) = {N (μ, Σ) : (μ, Σ) \in Λ (d) = R^{d} \times P (d)}

the set of d-variate normal distributions, MultiVariate Normals or MVNs for short, also called Gaussian distributions. A MVN distribution

N (μ, Σ)

has probability density function (pdf) on the support

R^{d}

:

p_{λ = (μ, Σ)} (x) = {(2 π)}^{^{\frac{d}{2}}} {| Σ |}^{- \frac{1}{2}} exp (- \frac{1}{2} {(x - μ)}^{⊤} Σ^{- 1} (x - μ)), x \in R^{d},

where

| M | = \det (M)

denotes the determinant of matrix M.

The statistical model

N (d)

is of dimension

m = \dim (Λ (d)) = d + \frac{d (d + 1)}{2} = \frac{d (d + 3)}{2}

since it is identifiable, i.e., there is a one-to-one correspondence

λ \leftrightarrow p_{λ} (x)

between

λ \in Λ (d)

and

N (μ, Σ) \in N (d)

. The statistical model

N (d)

is said to be regular since the second-order derivatives

\frac{\partial^{2} p_{λ}}{\partial λ_{i} \partial λ_{j}}

and third-order derivatives

\frac{\partial^{3} p_{λ}}{\partial λ_{i} \partial λ_{j} \partial λ_{k}}

are smooth functions (defining the metric and cubic tensors in information geometry [1]), and the set of first-order partial derivatives

\{\frac{\partial p_{λ}}{\partial λ_{1}}, \dots, \frac{\partial p_{λ}}{\partial λ_{1}}\}

are linearly independent.

Let

Cov (X)

denote the covariance of X (variance when X is scalar). A matrix M is a semi-positive–definite if and only if

\forall x \neq 0, x^{⊤} M x \geq 0

. The Fisher information matrix [1,2] (FIM) is the following symmetric semi-positive–definite matrix:

I (λ) = Cov [\nabla log p_{λ} (x)] ⪰ 0 .

For regular statistical models

{p_{λ}}

, the FIM is positive–definite:

I (λ) ≻ 0

, i.e.,

\forall x \neq 0, x^{⊤} I (λ) x > 0

.

M_{1} ≻ M_{2}

denotes Löwner partial ordering, i.e., the fact that

M_{1} - M_{2}

is positive–definite.

The FIM is covariant under the reparameterization of the statistical model [2]. That is, let

θ (λ)

be a new parameterization of the MVNs. Then we have:

I_{θ} (λ) = {[\frac{\partial λ}{\partial θ}]}^{⊤} \times I_{λ} (λ (θ)) \times [\frac{\partial λ}{\partial θ}] .

For example, we may parameterize univariate normal distributions by

λ = (μ, σ^{2})

or

θ = (μ, σ)

. We obtain the following Fisher information matrices for these parameterizations:

I_{λ} (λ (μ, σ)) = [\begin{matrix} \frac{1}{σ^{2}} & 0 \\ 0 & \frac{1}{2 σ^{4}} \end{matrix}] and I_{θ} (θ (μ, σ)) = [\begin{matrix} \frac{1}{σ^{2}} & 0 \\ 0 & \frac{1}{2 σ^{2}} \end{matrix}] .

In higher dimensions, parameterization

λ = (μ, σ^{2})

corresponds to the parameterization

(μ, Σ)

while parameterization

θ = (μ, L)

where

Σ = L L^{⊤}

is the unique Cholesky decomposition with

L \in GL (d)

, the group of invertible

d \times d

matrices. Another useful parameterization for optimization is the log–Cholesky parameterization [3] (

η = (μ, log σ^{2}) \in R^{2}

for univariate normal distributions) which ensures that a gradient descent always stays in the domain. The Fisher information matrix with respect to the log–Cholesky parameterization is

I_{η} (η (μ, σ)) = [\begin{matrix} \frac{1}{σ^{2}} & 0 \\ 0 & 2 \end{matrix}]

with

η (μ, σ) \in R^{2}

.

Since the statistical model

N (d)

is identifiable and regular, the Fisher information matrix can be written equivalently as follows [2,4]:

\begin{matrix} I (μ, Σ) = Cov [\nabla log p_{(μ, Σ)}] & = & E_{p_{(μ, Σ)}} [\nabla log p_{(μ, Σ)} \nabla log p_{(μ, Σ)}^{⊤}], \end{matrix}

(1)

= - E_{p_{(μ, Σ)}} [\nabla^{2} log p_{(μ, Σ)}] .

(2)

For multivariate distributions parameterized by a m-dimensional vector (with

m = \frac{d (d + 3)}{2}

)

θ = (θ_{1}, \dots, θ_{d}, θ_{d + 1}, \dots, θ_{m}) \in R^{m},

with

μ = (θ_{1}, \dots, θ_{d})

and

Σ (θ) = vech (θ_{d + 1}, \dots, θ_{m})

(inverse half-vectorization of matrices [5]), we have [6,7,8,9]:

I (θ) = [I_{i j} (θ)], with I_{i j} (θ) = {[\frac{\partial μ}{\partial θ_{i}}]}^{⊤} Σ^{- 1} \frac{\partial μ}{\partial θ_{j}} + \frac{1}{2} tr (Σ^{- 1} \frac{\partial μ}{\partial θ_{i}} Σ^{- 1} \frac{\partial μ}{\partial θ_{j}}) .

By equipping the regular statistical model

N (d)

with the Fisher information metric

g_{N}^{Fisher} (μ, Σ) = Cov [\nabla log p_{(μ, Σ)} (x)]

we obtain a Riemannian manifold

M = M_{N}

called the Fisher–Rao Gaussian or normal manifold [6,7]. The tangent space

T_{N} M

is identified with the product space

R^{d} \times Sym (d)

. Let

{\partial μ, \partial Σ}

be a natural vector basis in

T_{N} M

, and denote by

[v]

and

[V]

the vector components in that natural basis. We have

\begin{matrix} g_{(μ, Σ)}^{Fisher} ((v_{1}, V_{1}), (v_{2}, V_{2})) & = & {〈 (v_{1}, V_{1}), (v_{2}, V_{2}) 〉}_{(μ, Σ)}, \\ = & {[v_{1}]}^{⊤} Σ^{- 1} [v_{2}] + \frac{1}{2} tr (Σ^{- 1} [V_{1}] Σ^{- 1} [V_{2}]) . \end{matrix}

The induced Riemannian geodesic distance

ρ_{N} (\cdot, \cdot)

is called the Rao distance [10] or the Fisher–Rao distance [11,12]:

ρ_{N} (N (λ_{1}), N (λ_{2})) = inf_{\begin{matrix} c (t) \\ c (0) = p_{λ_{1}} \\ c (1) = p_{λ_{2}} \end{matrix}} \{Length (c)\},

(3)

where the Riemannian length of any smooth curve

c (t) \in M

is defined by

Length (c) = \int_{0}^{1} \sqrt{{〈 \dot{c} (t), \dot{c} (t) 〉}_{c (t)}} d t = \int_{0}^{1} d s_{N} (t) d t = \int_{0}^{1} {∥ \dot{c} (t) ∥}_{c (t)} d t,

(4)

where

\dot{} = \frac{d}{d t}

denotes the derivative with respect to parameter t,

d s_{N} (t)

is the Riemannian length element of

(M, g_{N}^{Fisher})

and

{∥ \cdot ∥}_{c (t)} = \sqrt{{〈 \cdot, \cdot 〉}_{c (t)}}

. We also write

ρ_{N} (p_{λ_{1}}, p_{λ_{2}})

for

ρ_{N} (N (λ_{1}), N (λ_{2}))

.

The minimizing curve

c (t) = γ_{N} (p_{λ_{1}}, p_{λ_{2}}; t)

of Equation (3) is called the Fisher–Rao geodesic. The Fisher–Rao geodesic is also an autoparallel curve [2] with respect to the Levi–Civita connection

\nabla_{N}^{Fisher}

induced by the Fisher metric

g_{N}^{Fisher}

.

Remark 1.

If we consider the Riemannian manifold

(M, β g)

for

β > 0

instead of

(M, g)

then the length element

d s

is scaled by

\sqrt{β}

:

d s_{β g} = β d s_{g}

. It follows that the length of a curve c becomes

{Length}_{β g} (c) = \sqrt{β} {Length}_{g} (c) .

However, the geodesics joining any two points

p_{1}

and

p_{2}

of

M

are the same:

γ_{β g} (p_{1}, p_{2}; t) = γ_{g} (p_{1}, p_{2}; t)

(with

γ_{g} (p_{1}, p_{2}; 0) = p_{1}

and

γ_{g} (p_{1}, p_{2}; 1) = p_{2}

).

Historically, Hotelling [13] first used this Fisher Riemannian geodesic distance in the late 1920s. From the viewpoint of information geometry [1], the Fisher metric is the unique Markov invariant metric up to rescaling [14,15,16]. The counterpart to the Fisher metric on the compact manifold has been reported in [17], proving its uniqueness under the action of the diffeomorphism group. The Fisher–Rao distance has been used to design statistical hypothesis testing [18,19,20,21], to measure the distance between the prior and posterior distributions in Bayesian statistics [22], in clustering [23,24], in signal processing [25,26,27,28], and in deep learning [29], just to mention a few.

The squared line element induced by the Fisher metric of the multivariate normal family [6,7] is

\begin{matrix} d s_{N}^{2} (μ, Σ) & = & {[\begin{matrix} d μ \\ d Σ \end{matrix}]}^{⊤} I (μ, Σ) [\begin{matrix} d μ \\ d Σ \end{matrix}], \\ = & d μ^{⊤} Σ^{- 1} d μ + \frac{1}{2} tr ({(Σ^{- 1} d Σ)}^{2}) . \end{matrix}

(5)

There are many ways to calculate the FIM/length element for multivariate normal distributions [7,9]. Let us give a simple approach based on the fact that the family

N (d)

of normal distributions forms a regular exponential family [30]:

N (d) = \{p_{θ (λ)} = exp (〈 θ_{v} (μ), x 〉 + 〈 θ_{M} (Σ), x x^{⊤} 〉 - F_{N} (θ_{v}, θ_{M}))\},

with

θ (λ) = (θ_{v} = (Σ^{- 1} μ, θ_{M} = \frac{1}{2} Σ^{- 1})

the natural parameters and log-partition function (also called cumulant function)

F_{N} (θ) = \frac{1}{2} (d log π - log | θ_{M} | + \frac{1}{2} θ_{v}^{⊤} θ_{M}^{- 1} θ_{v}) .

The vector inner product is

〈 v_{1}, v_{2} 〉 = v_{1}^{⊤} v_{2}

, and the matrix inner product is

〈 M_{1}, M_{2} 〉 = tr (M_{1} M_{2}^{⊤})

. The exponential family is said to be regular when the natural parameter space is open. Using Equation (2), it follows that the MVN FIM is

I_{θ} (θ) = - E [\nabla^{2} log p_{θ}] = \nabla^{2} F (θ)

. This proves that the FIM is well-defined, i.e.,

{(I_{θ} (θ))}_{i j} < \infty

. As an exponential family [1], we also have

I_{θ} (θ) = E [t (x)]

, where

t (x) = (x, x x^{⊤})

is the sufficient statistic. Thus, the Fisher metric is a Hessian metric [31]. Let

F_{N} (θ_{v}, θ_{M}) = F_{v} (θ_{v}) + F_{M} (θ_{M})

with

F_{v} (θ_{v}) = \frac{1}{2} (d log π + \frac{1}{2} θ_{v}^{⊤} θ_{M}^{- 1} θ_{v})

and

F_{M} (θ_{M}) = - \frac{1}{2} log | θ_{M} |

. We obtain the following block-diagonal expression of the FIM:

I (θ (λ)) = \nabla^{2} F_{N} (θ (μ, Σ)) = [\begin{matrix} Σ^{- 1} & 0 \\ 0 & \frac{1}{2} \nabla_{θ_{M}}^{2} log | \frac{1}{2} Σ^{- 1} | \end{matrix}] .

Therefore

d s_{N}^{2} (μ, Σ) = d s_{v}^{2} + d s_{M}^{2}

with

d s_{v}^{2} (μ) = d μ^{⊤} Σ^{- 1} d μ

and

d s_{M}^{2} (Σ) = \frac{1}{2} tr ({(Σ^{- 1} d Σ)}^{2})

. Let us note in passing that

\nabla_{θ_{M}}^{2} log | θ_{M} |

is a fourth order tensor [4].

The family

N (d)

can also be considered to be an elliptical family [32], thus highlighting the affine-invariance property of the Fisher information metric. That is, the Fisher metric is invariant with respect to affine transformations [33]: Let

(a, A)

be an element of the affine group

Aff (d)

with

a \in R^{d}

and

A \in GL (d)

. The group identity element of

Aff (d)

is

e = (0, I)

and the group operation is

(a_{1}, A_{1}) . (a_{2}, A_{2}) = (a_{1} + A_{1} a_{2}, A_{1} A_{2})

with inverse

{(a, A)}^{- 1} = (- A^{- 1} a, A^{- 1})

). Then we have

Property 1

(Fisher–Rao affine invariance). For all

A \in GL (d), a \in R^{d}

, we have

ρ_{N} (N (A μ_{1} + a, A Σ_{1} A^{⊤}), N (A μ_{2} + a, A Σ_{2} A^{⊤})) = ρ_{N} (N (μ_{1}, Σ_{1}), N (μ_{2}, Σ_{2})) .

(6)

This can be proven by checking that

d s_{N} (μ^{'}, Σ^{'}) = d s_{N} (μ, Σ)

where

μ^{'} = A μ + a

and

Σ^{'} = A Σ_{2} A^{⊤}

. It follows that we can reduce the calculation of the Fisher–Rao distance to a canonical case where one argument is

N_{std} = N (0, I)

, the standard d-variate distribution:

\begin{matrix} ρ_{N} (N (μ_{1}, Σ_{1}), N (μ_{2}, Σ_{2})) & = & ρ_{N} (N_{std}, N (Σ_{1}^{- \frac{1}{2}} (μ_{2} - μ_{1}), Σ_{1}^{- \frac{1}{2}} Σ_{2} Σ_{1}^{- \frac{1}{2}})), \\ = & ρ_{N} (N (Σ_{2}^{- \frac{1}{2}} (μ_{1} - μ_{2}), Σ_{2}^{- \frac{1}{2}} Σ_{1} Σ_{2}^{- \frac{1}{2}}), N_{std}), \end{matrix}

where

Σ^{p}

is the fractional matrix power which can be calculated from the Singular Value Decomposition

O D O^{⊤}

of

Σ

(where O is an orthogonal matrix and

D = diag (λ_{1}, \dots, λ_{d})

a diagonal matrix):

Σ^{p} = O D^{p} O^{⊤}

with

D^{p} = diag (λ_{1}^{p}, \dots, λ_{d}^{p})

.

The family of normal elliptical distributions can be obtained from the standard normal distribution by the action of the affine group [12,32]

Aff (d)

:

N (μ, Σ) = (μ, Σ^{\frac{1}{2}}) . N_{std} = N ((μ, Σ^{\frac{1}{2}}) . (0, I)) .

1.2. Fisher–Rao Distance between Normal Distributions: Some Subfamilies with Closed-Form Formula

In general, the Fisher–Rao distance

ρ_{N} (N_{1}, N_{2})

between two multivariate normal distributions

N_{1}

and

N_{2}

is not known in closed form [34,35,36,37], and several lower and upper bounds [38], and numerical techniques such as the geodesic shooting [39,40,41] have been investigated. See [42] for a recent review. Unfortunately, the geodesic shooting (GS) approach is time-consuming and numerically unstable for large Fisher–Rao distances [21,42]. In 3D Diffusion Tensor Imaging (DTI),

3 \times 3

covariance matrices

Σ_{i, j, k}

are stored a 3D grid locations

μ_{i, j, k}

thus generating 3D MVNs

N_{i, j, k} = N (μ_{i, j, k}, Σ_{i, j, k})

with means

μ_{i, j, k}

regularly spaced to each others. The Fisher–Rao distances can be calculated between an MVN

N_{i, j, k}

and another MVN

N_{i^{'}, j^{'}, k^{'}}

in a neighborhood of

N_{i, j, k}

(using 6- or 26-neighborhood) using geodesic shooting. For larger Fisher–Rao distances between non-neighbors MVNs, we can use the shortest path distance using Dijkstra’s algorithm [43] on the graph induced by the MVNs with edges between adjacent MVNs weighted by their Fisher–Rao distances.

The two main difficulties with calculating the Fisher–Rao distance are

to know explicitly the expression of the Riemannian Fisher–Rao geodesic $γ_{N}^{FR} (p_{λ_{1}}, p_{λ_{2}}; t)$ and
to integrate, in closed form, the length element $d s_{N}$ along this Riemannian geodesic.

Please note that the Fisher–Rao geodesics [1]

γ_{N}^{FR} (p_{λ_{1}}, p_{λ_{2}}; t)

are parameterized by constant speed (i.e.,

\dot{μ} (t) = \dot{μ} (0)

and

\dot{Σ} (t) = \dot{Σ} (0)

), or equivalently parametrized using the arc length:

ρ_{N} (γ_{N}^{FR} (p_{λ_{1}}, p_{λ_{2}}; s), γ_{N}^{FR} (p_{λ_{1}}, p_{λ_{2}}; t)) = | s - t | ρ_{N} (p_{λ_{1}}, p_{λ_{2}}), \forall s, t \in [0, 1] .

However, in several special cases, the Fisher–Rao distance between normal distributions belonging to restricted subsets of

N

is known.

Three such prominent cases are (see [42] for other cases)

when the normal distributions are univariate ( $d = 1$ ),
when we consider the set $N_{μ} = {N (μ, Σ) : Σ \in P (d)} \subset M_{N}$ of normal distributions sharing the same mean $μ$ (with the embedded submanifold $S_{μ} \in M$ ), and
when we consider the set $N_{Σ} = {N (μ, Σ) : Σ \in P (d)} \subset N$ of normal distributions sharing the same covariance matrix $Σ$ (with the corresponding embedded submanifold $S_{Σ} \in M$ ).

Let us report the formula of the Fisher–Rao distance in these three cases:

In the univariate case $N (1)$ , the Fisher–Rao distance between $N_{1} = N (μ_{1}, σ_{1}^{2})$ and $N_{2} = N (μ_{2}, σ_{2}^{2})$ can be derived from the hyperbolic distance [44] expressed in the Poincaré upper space since we have

$d s_{N}^{2} = g_{(μ, σ)} (d μ, d σ) = \frac{d μ^{2} + 2 d σ^{2}}{σ^{2}} = 2 \frac{{(\frac{d μ}{\sqrt{2}})}^{2} + d σ^{2}}{σ^{2}} = 2 \frac{d x^{2} + d y^{2}}{y^{2}} = d s_{Poincaré}^{2},$

where $x = \frac{μ}{\sqrt{2}}$ and $y = σ$ . It follows that

$ρ_{N} (N_{1}, N_{2}) = \sqrt{2} ρ_{Poincaré} ((x_{1}, y_{1}), (x_{2}, y_{2})) = \sqrt{2} ρ_{Poincaré} ((\frac{μ_{1}}{\sqrt{2}}, σ_{1}), (\frac{μ_{2}}{\sqrt{2}}, σ_{2})) .$

Thus, we have the following expression for the Fisher–Rao distance between univariate normal distributions:

$ρ_{N} (N_{1}, N_{2}) = \sqrt{2} log (\frac{1 + Δ (μ_{1}, σ_{1}; μ_{2}, σ_{2})}{1 - Δ (μ_{1}, σ_{1}; μ_{2}, σ_{2})}),$

(7)

with

$Δ (a, b; c, d) = \sqrt{\frac{{(c - a)}^{2} + 2 {(d - b)}^{2}}{{(c - a)}^{2} + 2 {(d + b)}^{2}}}, (a, b, c, d) \in R^{4} \ {0} .$

(8)

In particular, we have
–
$Δ (a, b; a, d) = |\frac{d - b}{d + b}|$ when $a = c$ (same mean),
–
$Δ (a, b; c, b) = \sqrt{\frac{1}{1 + 8 \frac{b^{2}}{{(c - a)}^{2}}}}$ when $b = d$ (same variance),
–
$Δ (0, 1; c, d) = \sqrt{\frac{c^{2} + 2 {(d - 1)}^{2}}{c^{2} + 2 {(d + 1)}^{2}}}$ when $a = 0$ and $b = 1$ (standard normal).
In 1D, the affine-invariance property (Property 1) extends to function $Δ$ as follows:

$Δ (μ_{1}, σ_{1}; μ_{2}, σ_{2}) = Δ (0, 1; \frac{μ_{2} - μ_{1}}{σ_{1}}, \frac{σ_{2}}{σ_{1}}) = Δ (\frac{μ_{1} - μ_{2}}{σ_{2}}, \frac{σ_{1}}{σ_{2}}; 0, 1) .$

Using one of the many identities between inverse hyperbolic functions (e.g., arctanh, arccosh, arcsinh), we can obtain an equivalent formula for Equation (7). For example, since $arctanh (u) : = \frac{1}{2} log (\frac{1 + u}{1 - u})$ for $0 < u < 1$ , we have equivalently:

$ρ_{N} (N_{1}, N_{2}) = 2 \sqrt{2} arctanh (Δ (μ_{1}, σ_{1}; μ_{2}, σ_{2})) .$

(9)

The Fisher–Rao geodesics are semi-ellipses with centers located on the x-axis. See Appendix A.1 for the parametric equations of Fisher–Rao geodesics between univariate normal distributions. Figure 1 displays four univariate normal distributions with their pairwise geodesics and Fisher–Rao distances.
Using the identity $arctanh (\frac{u^{2} - 1}{u^{2} + 1}) = arccosh (\frac{1 + u^{2}}{2 u})$ with $arccosh (x) : = log (x + \sqrt{x^{2} - 1})$ , we also have

$ρ_{N} (N_{1}, N_{2}) = 2 \sqrt{2} arccosh (\frac{1}{\sqrt{(1 - Δ (μ_{1}, σ_{1}; μ_{2}, σ_{2})) (1 + Δ (μ_{1}, σ_{1}; μ_{2}, σ_{2}))}}),$

Since the inverse hyperbolic cosecant (CSC) function is defined by $arccsch (u) : = arccosh (1 / u)$ , we further obtain

$ρ_{N} (N_{1}, N_{2}) = 2 \sqrt{2} arccsch (\sqrt{(1 - Δ (μ_{1}, σ_{1}; μ_{2}, σ_{2})) (1 + Δ (μ_{1}, σ_{1}; μ_{2}, σ_{2}))}),$

We can also write

$ρ_{N} (N_{1}, N_{2}) = \sqrt{2} arccosh (1 + \frac{{(μ_{2} - μ_{1})}^{2} + 2 {(σ_{2} - σ_{1})}^{2}}{4 σ_{1} σ_{2}})$

Thus, using the many-conversions formula between inverse hyperbolic functions, we obtain many equivalent different formulas of the Fisher–Rao distance, which are used in the literature.
In the second case, the Fisher–Rao distance between $N_{1} = N (μ, Σ_{1})$ and $N_{2} = N (μ, Σ_{2})$ has been reported in [6,7,45,46,47]:

$\begin{matrix} ρ_{N_{μ}} (N_{1}, N_{2}) & = & \sqrt{\frac{1}{2} \sum_{i = 1}^{d} {log}^{2} λ_{i} (Σ_{1}^{- 1} Σ_{2})}, \end{matrix}$

(10)

$\begin{matrix} = & ρ_{N_{μ}} (Σ_{1}, Σ_{2}), \end{matrix}$

(11)

where $λ_{i} (M)$ denotes the i-th generalized largest eigenvalue of matrix M, where the generalized eigenvalues are solutions of the equation $| Σ_{1} - λ Σ_{2} | = 0$ . Let us notice that $ρ_{N_{μ}} ((μ, Σ_{1}), (μ, Σ_{2})) = ρ_{N_{μ}} ((μ, Σ_{1}^{- 1}), (μ, Σ_{2}^{- 1}))$ since $λ_{i} (Σ_{2}^{- 1} Σ_{1}) = \frac{1}{λ_{i} (Σ_{1}^{- 1} Σ_{2})}$ and ${log}^{2} λ_{i} (Σ_{2}^{- 1} Σ_{1}) = {(- log λ_{i} (Σ_{1}^{- 1} Σ_{2}))}^{2} = {log}^{2} λ_{i} (Σ_{1}^{- 1} Σ_{2})$ . Matrix $Σ_{1}^{- 1} Σ_{2}$ may not be SPD and thus the $λ_{i}$ ’s are generalized eigenvalues. We may consider instead the SPD matrix $Σ_{1}^{- \frac{1}{2}} Σ_{2} Σ_{1}^{- \frac{1}{2}}$ which is SPD and such that $λ_{i} (Σ_{1}^{- 1} Σ_{2}) = λ_{i} (Σ_{1}^{- \frac{1}{2}} Σ_{2} Σ_{1}^{- \frac{1}{2}})$ . The Fisher–Rao distance of Equation (11) can be equivalently written [48] as

$ρ_{N_{μ}} (N_{1}, N_{2}) = \frac{1}{\sqrt{2}} {∥Log (Σ_{1}^{- \frac{1}{2}} Σ_{2} Σ_{1}^{- \frac{1}{2}})∥}_{F},$

where $Log (M)$ is the matrix logarithm (unique when M is SPD) and ${∥ M ∥}_{F} = \sqrt{\sum_{i, j} M_{i, j}^{2}} = \sqrt{tr (M M^{⊤})}$ is the matrix Fröbenius norm. This metric distance between SPD matrices although first studied by Siegel [45] in 1964 was rediscovered and analyzed recently in [49] (2003). Let $ρ_{SPD} (P_{1}, P_{2}) = \sqrt{\sum_{i = 1}^{d} {log}^{2} λ_{i} (P_{1}^{- 1} P_{2})}$ so that $ρ_{N_{μ}} (N (μ, P_{1}), N (μ, P_{2})) = \frac{1}{\sqrt{2}} ρ_{SPD} (P_{1}, P_{2})$ .
The Riemannian SPD distance $ρ_{SPD}$ enjoys the following well-known invariance properties:
–
Invariance by congruence transformation:

$\forall X \in GL (d), ρ_{SPD} (X P_{1} X^{⊤}, X P_{2} X^{⊤}) = ρ_{SPD} (P_{1}, P_{2}),$

(12)

–
Invariance by inversion:

$\forall P_{1}, P_{2} \in P (d), ρ (P_{1}^{- 1}, P_{2}^{- 1}) = ρ_{SPD} (P_{1}, P_{2}) .$

Let $P_{1} = L_{1} L_{1}^{⊤}$ be the Cholesky decomposition (unique when $P_{1} ≻ 0$ ). Then apply the congruence invariance for $X = L_{1}^{- 1}$ :

$ρ_{SPD} (P_{1}, P_{2}) = ρ_{SPD} (L_{1}^{- 1} P_{1} {(L_{1}^{- 1})}^{⊤}, L_{1}^{- 1} P_{2} {(L_{1}^{- 1})}^{⊤}) = ρ_{SPD} (I, L_{1}^{- 1} P_{2} {(L_{1}^{- 1})}^{⊤}) .$

(13)

We can also consider the factorization $P_{1} = S_{1} S_{1}$ where $S_{1} = P_{1}^{\frac{1}{2}}$ is the unique symmetric square root matrix [50]. Then we have

$ρ_{SPD} (P_{1}, P_{2}) = ρ_{SPD} (S_{1}^{- 1} P_{1} {(S_{1}^{- 1})}^{⊤}, S_{1}^{- 1} P_{2} {(S_{1}^{- 1})}^{⊤}) = ρ_{SPD} (I, S_{1}^{- 1} P_{2} {(S_{1}^{- 1})}^{⊤}) .$
The Fisher–Rao distance between $N_{1} = N (μ_{1}, Σ)$ and $N_{2} = N (μ_{2}, Σ)$ has been reported in closed form [42] (Proposition 3). The method is described with full details in Appendix B. We present a simpler scheme based on the inverse $Σ^{- \frac{1}{2}}$ of the symmetric square root factorization [50] of $Σ = Σ^{\frac{1}{2}} Σ^{\frac{1}{2}}$ (ith ${(Σ^{- \frac{1}{2}})}^{⊤} = Σ^{- \frac{1}{2}}$ ). Let us use the affine-invariance property of the Fisher–Rao distance under the affine transformation $Σ^{- \frac{1}{2}}$ and then apply affine invariance under translation as follows:

$\begin{matrix} ρ_{N} (N (μ_{1}, Σ), N (μ_{2}, Σ)) & = & ρ_{N} (N (Σ^{- \frac{1}{2}} μ_{1}, Σ^{- \frac{1}{2}} Σ Σ^{- \frac{1}{2}}), N (Σ^{- \frac{1}{2}} μ_{2}, Σ^{- \frac{1}{2}} Σ Σ^{- \frac{1}{2}})), \\ = & ρ_{N} (N (0, I), N (Σ^{- \frac{1}{2}} (μ_{2} - μ_{1}), I)), \\ = & ρ_{N} (N (0, 1), N (∥ Σ^{- \frac{1}{2}} (μ_{2} - μ_{1}) ∥_{2}, 1)) . \end{matrix}$

The right-hand side Fisher–Rao distance is computed from Equation (7) and justified by the method [42] (Proposition 3) described in Appendix B using a rotation matrix R with $R R^{⊤} = I$ so that

$\begin{matrix} ρ_{N} (N (0, I), N (Σ^{- \frac{1}{2}} (μ_{2} - μ_{1}), I)) & = & ρ_{N} (N (0, I), N (R Σ^{- \frac{1}{2}} (μ_{2} - μ_{1}), R I R^{⊤})), \\ = & ρ_{N} (N (0, I), ∥ Σ^{- \frac{1}{2}} (μ_{2} - μ_{1}) ∥_{2}, I)) . \end{matrix}$

Then we apply the formula of Equation (23) of [42]. Section 1.5 shall report a simpler closed-form formula by proving that the Fisher–Rao distance between $N (μ_{1}, Σ)$ and $N (μ_{2}, Σ)$ is a scalar function of their Mahalanobis distance [51] using the algebraic method of maximal invariants [52].

1.3. Fisher–Rao Distance: Totally versus Non-Totally Geodesic Submanifolds

Consider

N^{'} = {N (λ) : λ^{'} \in Λ^{'}} \subset N

a statistical submodel of the MVN statistical model

N

. Using the Fisher information matrix

I_{λ^{'}} (λ^{'})

, we obtain the intrinsic Fisher–Rao manifold

M^{'} = M_{N^{'}}

. We may also consider

M^{'}

to be an embedded submanifold of

M

. Let us write

S^{'} = S_{N^{'}} \subset M

the embedded submanifold.

A totally geodesic submanifold

S^{'} \subset M

is such that the geodesics

γ_{M^{'}} (N_{1}^{'}, N_{2}^{'}; t)

fully stay in

M^{'}

for any pair of points

N_{1}^{'}, N_{2}^{'} \in N^{'}

. For example, the submanifold

M_{μ} = {N (μ, Σ) : Σ \in P (d)} \subset M

of MVNs with fixed mean

μ

is a totally geodesic submanifold [53] of

M

but the submanifold

M_{Σ} = {N (μ, Σ) : μ \in R^{d}} \subset M

of MVNs sharing the same covariance matrix

Σ

is not totally geodesic. When an embedded submanifold

S \subset M

is totally geodesic, we always have

ρ_{M} (N_{1}, N_{2}) = ρ_{S} (N_{1}, N_{2})

. Thus, we have

ρ_{N} (N (μ, Σ_{1}), N (μ, Σ_{2})) = ρ_{SPD} (Σ_{1}, Σ_{2})

. However, when an embedded submanifold

S \subset M

is not totally geodesic, we have

ρ_{M} (N_{1}, N_{2}) \leq ρ_{S} (N_{1}, N_{2})

because the Riemannian geodesic length in

S

is necessarily longer or equal than the Riemannian geodesic length in

M

. The merit to consider submanifolds is to be able to calculate in closed form the Fisher–Rao distance which may then provide an upper bound on the Fisher–Rao distance for the full statistical model. For example, consider

N_{1} = N (μ_{1}, Σ)

and

N_{2} = N (μ_{2}, Σ)

in

M_{Σ}

, a non-totally geodesic submanifold. The Rao distance between

N_{1}

and

N_{2}

in

M

is upper bounded by the Riemannian distance in

M_{Σ}

(with line element

d s_{Σ}^{2} = d μ^{⊤} Σ^{- 1} d μ

) which corresponds to the Mahalanobis distance [10,51]

Δ_{Σ} (μ_{1}, μ_{2})

:

ρ_{M_{μ}} (N_{1}, N_{2}) \leq Δ_{Σ} (μ_{1}, μ_{2}) : = \sqrt{{(μ_{2} - μ_{1})}^{⊤} Σ^{- 1} (μ_{2} - μ_{1})} .

(14)

The Mahalanobis distance can be interpreted as the Euclidean distance

D_{E} (p, q) = Δ_{I} (p, q) = \sqrt{{(p - q)}^{⊤} (p - q)}

(where I denotes the identity matrix) after an affine transformation: Let

Σ = L L^{⊤} = U^{⊤} U

be the Cholesky decomposition of

Σ ≫ 0

with L a lower triangular matrix or

U = L^{⊤}

an upper triangular matrix. Then we have

\begin{matrix} Δ_{Σ} (μ_{1}, μ_{2}) & = & \sqrt{{(μ_{2} - μ_{1})}^{⊤} {(L^{⊤})}^{- 1} L^{- 1} (μ_{2} - μ_{1})}, \\ = & ∥ Σ^{- \frac{1}{2}} (μ_{2} - μ_{1}) ∥_{2}, \\ = & Δ_{I} (L^{- 1} μ_{1}, L^{- 1} μ_{2}) = D_{E} (L^{- 1} μ_{1}, L^{- 1} μ_{2}), \end{matrix}

where

{∥ \cdot ∥}_{2}

denotes the vector

ℓ_{2}

-norm.

The Rao distance

ρ_{Σ}

of Equation (A1) between two MVNs with fixed covariance matrix emanates from the property that the submanifold

M_{[v], Σ} = {N (a v, Σ) : a \in R}

is totally geodesic [54].

Let us emphasize that for a submanifold

S \subset M

to be totally geodesic or not depend on the underlying metric in

M

. The same subset

N^{'} \subset N

with

N

equipped with two different metrics

g_{1}

and

g_{2}

can be totally geodesic regarding

g_{1}

and non-totally geodesic regarding

g_{2}

. See Remark 3 for such an example.

In general, using the triangle inequality of the Riemannian metric distance

ρ_{N}

, we can upper bound

ρ_{N} (N_{1}, N_{2})

with

N_{1} = (μ_{1}, Σ_{1})

and

N_{1} = (μ_{2}, Σ_{2})

as follows:

\begin{matrix} ρ_{N} (N_{1}, N_{2}) & \leq & ρ_{M_{μ_{1}}} (N_{1}, N_{12}) + ρ_{M_{Σ_{2}}} (N_{12}, N_{2}), \\ \leq & ρ_{M_{Σ_{1}}} (N_{1}, N_{21}) + ρ_{M_{μ_{2}}} (N_{21}, N_{2}), \end{matrix}

where

N_{12} = (μ_{1}, Σ_{2})

and

N_{21} = N (μ_{2}, Σ_{1})

. See Figure 2 for an illustration of the Fisher–Rao geodesic triangle

▵ N_{1}, N_{2}, N_{12}

. Furthermore, since

ρ_{N_{Σ_{1}}} (N_{1}, N_{21}) \leq Δ_{Σ_{1}} (μ_{1}, μ_{2})

and

ρ_{N_{Σ_{2}}} (N_{12}, N_{2}) \leq Δ_{Σ_{2}} (μ_{1}, μ_{2})

, we obtain the following upper bound on the Rao distance between MVNs:

ρ_{N} (N_{1}, N_{2}) \leq ρ_{P} (Σ_{1}, Σ_{2}) + min {Δ_{Σ_{1}} (μ_{1}, μ_{2}), Δ_{Σ_{2}} (μ_{1}, μ_{2})} .

(15)

1.4. Contributions and Paper Outline

The main contribution of this paper is to propose an approximation of

ρ_{N}

based on Calvo and Oller’s embedding [19] (C&O for short) and report its experimental performance. First, we concisely recall C&O’s family of embeddings

f_{β}

of

N (d)

as submanifolds

{\bar{N}}_{β}

of

P (d + 1)

in Section 2. Next, we present our approximation technique in Section 3 which differs from the usual geodesic shooting approach [39], and report experimental results. Finally, we study some information–geometric properties [1] of the isometric embedding in Section 5 such as the fact that it preserves mixture geodesics (embedded C&O submanifold is autoparallel with respect to the mixture affine connection) but not exponential geodesics. Moreover, we prove that the Fisher–Rao distance between multivariate normal distributions sharing the same covariance matrix is a scalar function of their Mahalanobis distance in Section 1.5 using the framework of Eaton [52] of maximal invariants.

1.5. A Closed-Form Formula for the Fisher–Rao Distance between Normal Distributions Sharing the Same Covariance Matrix

Consider the Fisher–Rao distance between

N_{1} = (μ_{1}, Σ)

and

N_{1} = (μ_{2}, Σ)

for a fixed covariance matrix

Σ

and the translation action

a . μ : = μ + a

of the translation group

R^{d}

(a subgroup of the affine group). Both the Fisher–Rao distance and the Mahalanobis distance are invariant under translations:

ρ_{N} ((μ_{1} + a, Σ), (μ_{2} + a, Σ)) = ρ_{N} ((μ_{1}, Σ), (μ_{2}, Σ)), Δ_{Σ} (μ_{1} + a, μ_{2} + a) = Δ_{Σ} (μ_{1}, μ_{2}) .

To prove that

ρ_{N} ((μ_{1}, Σ), (μ_{2}, Σ)) = h_{FR} (Δ_{Σ} (μ_{1}, μ_{2}))

for a scalar function

h_{FR} (\cdot)

, we shall prove that the Mahalanobis distance is a maximal invariant, and use the framework of maximal invariants of Eaton [52] (Chapter 2) who proved that any other invariant function is necessarily a function of a maximal invariant, i.e., a function of the Mahalanobis distance in our case.

The Mahalanobis distance is a maximal invariant because we can write

Δ_{Σ} (μ_{1}, μ_{2}) = Δ_{1} (0, Δ_{Σ} (μ_{1}, μ_{2}))

and when

Δ_{Σ} (μ_{1}, μ_{2}) = Δ_{Σ} (μ_{1}^{'}, μ_{2}^{'})

in 1D there exists

a \in R

such that

(μ_{1} + a, μ_{2} + a) = (μ_{1}^{'}, μ_{2}^{'})

. We must prove equivalently that when

| m_{1} - m_{2} | = | m_{1}^{'} - m_{2}^{'} |

that there exists

a \in R

such that

(m_{1} + a, m_{2} + a) = (m_{1}^{'}, m_{2}^{'})

. Assume without loss of generality that

m_{1} \geq m_{2}

. When

m_{1} - m_{2} = m_{1}^{'} - m_{2}^{'}

, there exists

a = m_{1}^{'} - m_{1}

so that

m_{1}^{'} = a . m_{1} = m_{1} + a

and

m_{2}^{'} = a . m_{2} = m_{2} + a

with

m_{1}^{'} - m_{2}^{'} = m_{1} - m_{2}

. Thus, using Eaton’s theorem [52], there exists a scalar function

h_{FR}

such that

ρ_{N} ((μ_{1}, Σ), (μ_{2}, Σ)) = h_{FR} (Δ_{Σ} (μ_{1}, μ_{2}))

.

To find explicitly the scalar function

h_{FR} (\cdot)

, let us consider the univariate case of normal distributions for which the Fisher–Rao distance is given in closed form in Equation (7). In that case, the univariate Mahalanobis distance is

Δ_{σ^{2}} (μ_{1}, μ_{2}) = \sqrt{(μ_{2} - μ_{1}) {(σ^{2})}^{- 1} (μ_{2} - μ_{1})} = \frac{| μ_{2} - μ_{1} |}{σ}

and we can write formula of Equation (7) as

h_{FR} (Δ_{σ^{2}} (μ_{1}, μ_{2}))

with

\begin{matrix} h_{FR} (u) & = & \sqrt{2} log (\frac{\sqrt{8 + u^{2}} + u}{\sqrt{8 + u^{2}} - u}), \end{matrix}

(16)

\begin{matrix} = & \sqrt{2} arccosh (1 + \frac{1}{4} u^{2}), \end{matrix}

(17)

using the identities

log (x) = arccosh (\frac{1 + x^{2}}{2 x}) = arctanh (\frac{x^{2} - 1}{1 + x^{2}}), x > 1,

where

arctanh (u) = \frac{1}{2} log \frac{1 + u}{1 - u}

.

Proposition 1.

The Fisher–Rao distance

ρ_{N} ((μ_{1}, Σ), (μ_{2}, Σ))

between two MVNs with same covariance matrix is

\begin{matrix} ρ_{N} ((μ_{1}, Σ), (μ_{2}, Σ)) & = & ρ_{N} ((0, 1), (Δ_{Σ} (μ_{1}, μ_{2}), 1)), \end{matrix}

(18)

\begin{matrix} = & \sqrt{2} log (\frac{\sqrt{8 + Δ_{Σ}^{2} (μ_{1}, μ_{2})} + Δ_{Σ} (μ_{1}, μ_{2})}{\sqrt{8 + Δ_{Σ}^{2} (μ_{1}, μ_{2})} - Δ_{Σ} (μ_{1}, μ_{2})}), \end{matrix}

(19)

\begin{matrix} = & \sqrt{2} arccosh (1 + \frac{1}{4} Δ_{Σ}^{2} (μ_{1}, μ_{2})), \end{matrix}

(20)

where

Δ_{Σ} (μ_{1}, μ_{2}) = \sqrt{{(μ_{2} - μ_{1})}^{⊤} Σ^{- 1} (μ_{2} - μ_{1})}

is the Mahalanobis distance.

Indeed, notice that the d-variate Mahalanobis distance

Δ_{Σ} (μ_{1}, μ_{2})

can be interpreted as a univariate Mahalanobis distance between the standard normal distribution

N (0, 1)

and

N (Δ_{Σ} (μ_{1}, μ_{2}), 1)

:

Δ_{Σ} (μ_{1}, μ_{2}) = Δ_{1} (0, Δ_{Σ} (μ_{1}, μ_{2})) .

Thus, we have

ρ_{N} ((μ_{1}, Σ), (μ_{2}, Σ)) = ρ_{N} ((0, 1), (Δ_{Σ} (μ_{1}, μ_{2}), 1))

, where the right-hand-side term is the univariate Fisher–Rao distance of Equation (7). Let us notice that the square length element on

M_{Σ}

is

d s^{2} = d μ^{⊤} Σ^{- 1} d μ = Δ_{Σ}^{2} (μ, μ + d μ)

. This result can be extended to elliptical distributions [12] (Theorem 1).

Let us corroborate this result by checking the formula of Equation (1) with two examples in the literature: In [38] (Figure 4), we Fisher–Rao distance between

N_{1} = (0, I)

and

N_{2} = ([\begin{matrix} \frac{1}{2} \\ \frac{1}{2} \end{matrix}], I)

is studied. We find

ρ_{N} (N_{1}, N_{2}) = 0.69994085

in accordance with their result shown in Figure 4. The second example is Example 1 of [42] (p. 11) with

N_{1} = ([\begin{matrix} - 1 \\ 0 \end{matrix}], Σ)

and

N_{2} = ([\begin{matrix} 6 \\ 3 \end{matrix}], Σ)

for

Σ = [\begin{matrix} 1.1 & 0.9 \\ 0.9 & 1.1 \end{matrix}]

. Formula of Equation (18) yields the Fisher–Rao distance

5.006483034546878

in accordance with [42] which reports

5.00648

.

Similarly, the statistical Ali–Silvey–Csiszár f-divergences [60,61]

I_{f} [p_{(μ_{1}, Σ)} : p_{(μ_{2}, Σ)}] = \int_{R^{d}} p_{(μ_{1}, Σ)} (x) f (\frac{p_{(μ_{2}, Σ)}}{p_{(μ_{1}, Σ)}}) d x,

between two MVNs sharing the same covariance matrix are increasing functions of the Mahalanobis distance because the f-divergences between two MVNs sharing the same covariance matrix are invariant under the action of the translation group [62]. Thus, we have

I_{f} [p_{(μ_{1}, Σ)} : p_{(μ_{2}, Σ)}] = h_{f} (Δ_{Σ} (μ_{1}, μ_{2}))

. Since

Δ_{Σ} (μ_{1}, μ_{2}) = Δ_{1} (0, Δ_{Σ} (μ_{1}, μ_{2}))

, we thus have

I_{f} [p_{(μ_{1}, Σ)} : p_{(μ_{2}, Σ)}] = h_{f} (Δ_{1} (0, Δ_{Σ} (μ_{1}, μ_{2})) = I_{f} [p_{(0, 1)} : p_{(Δ_{Σ} (μ_{1}, μ_{2}), 1)}],

where the right-hand side f-divergence is between univariate normal distributions. See Table 2 of [62] for some explicit functions

h_{f}

.

2. Calvo and Oller’s Family of Diffeomorphic Embeddings

Calvo and Oller [19,32] noticed that we can embed the space of normal distributions in

P (d + 1)

by using the following mapping:

f_{β} (N) = f_{β} (μ, Σ) = [\begin{matrix} Σ + β μ μ^{⊤} & β μ \\ β μ^{⊤} & β \end{matrix}] \in P (d + 1),

(21)

where

β \in R_{> 0}

and

N = N (μ, Σ)

. Notice that since the dimension of

P (d + 1)

is

\frac{(d + 1) (d + 2)}{2}

, we only use

\frac{(d + 1) (d + 2)}{2} - \frac{d (d + 3)}{2} = 1

extra dimension for embedding

N (d)

into

P (d + 1)

. By foliating

P = R_{> 0} \times P_{c}

where

P_{c} = {P \in P : | P | = c}

denotes the subsets of

P

with determinant c, we obtain the following Riemannian Calvo and Oller metric on the SPD cone:

\begin{matrix} d s_{CO}^{2} & = & \frac{1}{2} tr ({(f^{- 1} (μ, Σ) d f (μ, Σ))}^{2}), \\ = & \frac{1}{2} {(\frac{d β}{β})}^{2} + β d μ^{⊤} Σ^{- 1} d μ + \frac{1}{2} tr ({(Σ^{- 1} d Σ)}^{2}) . \end{matrix}

Let

{\bar{N}}_{β} (d) = \{\bar{P} = f_{β} (μ, Σ) : (μ, Σ) \in N (d) = R^{d} \times P (d)\}

denote the submanifold of

P (d + 1)

of codimension 1, and

\bar{N} = {\bar{N}}_{1}

(i.e.,

β = 1

). The family of mappings

f_{β}

provides diffeomorphisms between

N (d)

and

{\bar{N}}_{β} (d)

. Let

f_{β}^{- 1} (\bar{P}) = (μ_{\bar{P}}, Σ_{\bar{P}})

denote the inverse mapping for

\bar{P} \in {\bar{N}}_{β} (d)

, and let

f = f_{1}

(i.e.,

β = 1

):

f (N) = f (μ, Σ) = [\begin{matrix} Σ + μ μ^{⊤} & μ \\ μ^{⊤} & 1 \end{matrix}] .

By equipping the cone

P (d + 1)

by the trace metric [63,64] (also called the affine invariant Riemannian metric, AIRM) scaled by

\frac{1}{2}

:

g_{P}^{trace} (P_{1}, P_{2}) : = tr (P^{- 1} P_{1} P^{- 1} P_{2})

(yielding the squared line element

d s_{P}^{2} = \frac{1}{2} tr ({(P d P)}^{2})

), Calvo and Oller [19] proved that

\bar{N} (d)

is isometric to

N (d)

(i.e., the Riemannian metric of

P (d + 1)

restricted to

N (d)

coincides with the Riemannian metric of

N (d)

induced by f) but

\bar{N} (d)

is not totally geodesic (i.e., the geodesics

γ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2}; t)

for

{\bar{P}}_{1} = f (N_{1}), {\bar{P}}_{2} = f (N_{2}) \in \bar{N} (d)

leaves the embedded normal submanifold

\bar{N} (d))

. Please note that

g_{P}^{trace}

can be interpreted as the Fisher metric for the family

N_{0}

of 0-centered normal distributions. Thus, we have

(N (d), g^{Fisher}) ↪ (P (d + 1), g^{trace})

, and the following diagram between parameter spaces and corresponding distributions:

\begin{matrix} N (d) & ↪ & N_{0} (d + 1) \\ ↕ & ↕ \\ Λ (d) & ↪ & P (d + 1) \end{matrix}

Remark 2.

The trace metric was first studied by Siegel [45,65] using the wider scope of complex symmetric matrices with positive–definite imaginary parts generalizing the Poincaré upper half-plane (see Appendix D).

We omit to specify the dimensions and write for short

N

,

\bar{N}

, and

P

when clear from the context. Thus, C&O proposed to use the embedding

f = f_{1}

to give a lower bound

ρ_{CO}

of the Fisher–Rao distance

ρ_{N}

between normals:

{LC}_{CO} : ρ_{N} (N_{1}, N_{2}) \geq ρ_{CO} (\underset{{\bar{P}}_{1}}{\underset{⏟}{f (μ_{1}, Σ_{1})}}, \underset{{\bar{P}}_{2}}{\underset{⏟}{f (μ_{2}, Σ_{2})}}) = \sqrt{\frac{1}{2} \sum_{i = 1}^{d + 1} {log}^{2} λ_{i} ({\bar{P}}_{1}^{- 1} {\bar{P}}_{2})} .

(22)

We let

ρ_{CO} (N_{1}, N_{2}) = ρ_{CO} (f (N_{1}), f (N_{2}))

. The

ρ_{CO}

distance is invariant under affine transformations such as the Fisher–Rao distance of Property 1:

Property 2

(affine invariance of C&O distance [19]). For all

A \in GL (d), a \in R^{d}

, we have

ρ_{CO} ((A μ_{1} + a, A Σ_{1} A^{⊤}), (A μ_{2} + a, A Σ_{2} A^{⊤})) = ρ_{CO} (N (μ_{1}, Σ_{1}), N (μ_{2}, Σ_{2}))

.

When

Σ_{1} = Σ_{2} = Σ

, we have

| {\bar{P}}_{1} | = | {\bar{P}}_{2} | = | Σ |

. Since the Riemannian geodesics

γ_{P} (P_{1}, P_{2}; t)

in the SPD cone are given by

γ_{P} (P_{1}, P_{2}; t) = P_{1}^{\frac{1}{2}} {(P_{1}^{- \frac{1}{2}} P_{2} P_{1}^{- \frac{1}{2}})}^{t} P_{1}^{\frac{1}{2}}

[66] (also written

γ_{SPD} (P_{1}, P_{2}; t)

), we have

| γ_{P} (P_{1}, P_{2}; t) | = | Σ |

. Although the submanifold

P_{c} = {P \in P : | P | = c}

is totally geodesic with respect to the trace metric, it is not totally geodesic with respect to

\frac{1}{2} tr ({(\bar{P} d \bar{P})}^{2})

. Thus, although

γ_{P} (P_{1}, P_{2}) \in \bar{N}

, it does not correspond to the embedded MVN geodesics with respect to the Fisher metric. The C&O distance between two MVNs

N (μ_{1}, Σ)

and

N (μ_{2}, Σ)

sharing the same covariance matrix [19] is

ρ_{CO} (N (μ_{1}, Σ), N (μ_{2}, Σ)) = arccosh (1 + \frac{1}{2} Δ_{Σ}^{2} (μ_{1}, μ_{2})),

(23)

where

arccosh (x) : = log (x + \sqrt{x^{2} - 1})

for

x \geq 1

and

Δ_{Σ} (μ_{1}, μ_{2})

is the Mahalanobis distance between

N (μ_{1}, Σ)

and

N (μ_{2}, Σ)

. In that case, we thus have

ρ_{CO} (N (μ_{1}, Σ), N (μ_{2}, Σ)) = h_{CO} (Δ_{Σ} (μ_{1}, μ_{2}))

where

h_{CO} (u) = arccosh (1 + \frac{1}{2} u^{2})

is a strictly monotone increasing function. Let us note in passing that in [19] (Corollary, page 230) there is a confusing or typographic error since the distance is reported as

arccosh (1 + \frac{1}{2} d_{M} (μ_{1}, μ_{2}))

where

d_{M}

denotes “Mahalanobis distance” [51]. Therefore, either

d_{M} = Δ_{Σ}^{2}

, Mahalanobis

D^{2}

-distance, or there is a missing square in the equation of the Corollary page 230. To obtain a flavor of how good is the approximation of the C&O distance, we may consider the same covariance case where we have both closed-form solutions for

ρ_{N}

(Equation (20)) and

ρ_{CO}

(Equation (23)). Figure 3 plots the two functions

h_{CO}

and

h_{FR}

(with

h_{CO} (u) \leq h_{FR} (u) \leq u

for

u \in [0, \infty)

).

Let us remark that similarly all f-divergences between

N_{1} = (μ_{1}, Σ)

and

N_{2} = (μ_{2}, Σ)

are scalar functions of their Mahalanobis distance

Δ_{Σ} (μ_{1}, μ_{2})

too, see [62].

The C&O distance

ρ_{CO}

is a metric distance that has been used in many applications ranging from computer vision [57,67,68,69] to signal/sensor processing, statistics [70,71], machine learning [29,72,73,74,75,76] and analogical reasoning [77].

Remark 3.

In a second paper, Calvo and Oller [32] noticed that we can embed normal distributions in

P (d + 1)

by the following more general mapping (Lemma 3.1 [32]):

g_{α, β, γ} (μ, Σ) = {| Σ |}^{α} [\begin{matrix} Σ + β γ^{2} μ μ^{⊤} & β γ μ \\ β γ μ^{⊤} & β \end{matrix}] \in P (d + 1),

(24)

where

α \in R

,

β \in R_{> 0}

and

γ \in R

. It is show in [32] that the induced length element is

\begin{matrix} d s_{α, β, γ}^{2} & = & \frac{1}{2} (α ((d + 1) + 2 α) {tr}^{2} (Σ^{- 1} d Σ) + tr ({(Σ^{- 1} d Σ)}^{2}) \\ + 2 β γ^{2} d μ^{⊤} Σ^{- 1} d μ + 2 α tr (Σ^{- 1} d Σ) \frac{d β}{β} + {(\frac{d β}{β})}^{2}) . \end{matrix}

When

γ = β = 1

, we have

d s_{α}^{2} = \frac{1}{2} (α ((d + 1) + 2 α) {tr}^{2} (Σ^{- 1} d Σ) + tr ({(Σ^{- 1} d Σ)}^{2}) + 2 β γ^{2} d μ^{⊤} Σ^{- 1} d μ) .

Thus, to cancel the term

{tr}^{2} (Σ^{- 1} d Σ)

, we may either choose

α = 0

or

α = - \frac{2}{1 + d}

.

In some applications [78], the embedding

g_{- \frac{1}{d + 1}, 1, 1} (μ, Σ) = {| Σ |}^{- \frac{1}{d + 1}} [\begin{matrix} Σ + μ μ^{⊤} & μ \\ μ^{⊤} & 1 \end{matrix}] : = \hat{f} (μ, Σ),

(25)

is used to ensure that

|g_{- \frac{1}{d + 1}, 1, 1} (μ, Σ)| = 1

. That is normal distributions are embedded diffeomorphically into the submanifold of positive–definite matrices with a unit determinant (also called SSPD, acronym of Special SPD). In [32], C&O showed that there exists a second isometric embedding of the Fisher–Rao Gaussian manifold

N (d)

into a submanifold of the cone

P (d + 1)

:

f_{SSPD} (μ, Σ) = {| Σ |}^{- \frac{2}{d + 1}} [\begin{matrix} Σ + μ μ^{⊤} & μ \\ μ^{⊤} & 1 \end{matrix}]

. Let

\hat{P} = f_{SSPD} (μ, Σ)

. This mapping can be understood as taking the elliptic isometry

P \mapsto {| P |}^{- \frac{2}{d + 1}} P

of

P \in P (d + 1)

[64] since

| Σ | = | \bar{P} (μ, Σ) |

(see proof in Proposition 3). It follows that

ρ_{CO} (N_{1}, N_{2}) = ρ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2}) = ρ_{P} ({\hat{P}}_{1}, {\hat{P}}_{2}) \leq ρ_{N} (N_{1}, N_{2}) .

Similarly, we could have mapped

P \mapsto P^{- 1}

to obtain another isometric embedding. See the four types of elliptic isometric of the SPD cone described in [64]. Finally, let us remark that the SSPD submanifold is totally geodesic with respect to the trace metric but not with respect to the C&O metric.

Interestingly, Calvo and Oller [48] (p. 131) proved that

(({\bar{μ}}_{1}, \dots, {\bar{μ}}_{d}), diag ({\bar{σ}}_{1}^{2}, \dots, {\bar{σ}}_{d}^{2}))

is a maximal invariant for the action of the affine group

Aff (d)

, where

\bar{μ} = Q^{- 1} (μ_{2} - μ_{1})

and

Σ_{2} Σ_{1}^{- 1} = Q diag ({\bar{σ}}_{1}^{2}, \dots, {\bar{σ}}_{d}^{2}) Q^{- 1}

(in [48], the authors considered

Σ_{1} Σ_{1}^{- 2}

). Thus, we consider the following dissimilarity

D_{CO} (N (μ_{1}, Σ_{1}), N (μ_{2}, Σ_{2})) = \sqrt{2} \sqrt{\sum_{i = 1}^{d} {log}^{2} (\frac{1 + Δ (0, 1; {\bar{μ}}_{i}, {\bar{σ}}_{i})}{1 - Δ (0, 1; {\bar{μ}}_{i}, {\bar{σ}}_{i})})} .

(26)

Dissimilarity

D_{CO}

is symmetric (i.e.,

D_{CO} (N_{1}, N_{2}) = D_{CO} (N_{2}, N_{1})

) and

D_{CO} (N_{1}, N_{2}) = 0

if and only if

N_{1} = N_{2}

. Please note that when

d = 1

,

D_{CO}

is different from the Fisher–Rao distance of Equation (7).

3. Approximating the Fisher–Rao Distance

3.1. Approximating Length of Curves

Recall that the Fisher–Rao’s distance [79] is the Riemannian geodesic distance

ρ_{N} (N (λ_{1}), N (λ_{2})) = inf_{\begin{matrix} c (t) \\ c (0) = p_{λ_{1}} \\ c (1) = p_{λ_{2}} \end{matrix}} \{Length (c)\},

where

Length (c) = \int_{0}^{1} \underset{d s_{N} (t)}{\underset{⏟}{{\sqrt{〈 \dot{c} (t), \dot{c} (t) 〉}}_{c (t)}}} d t .

We can approximate the Rao distance

ρ_{N} (N_{1}, N_{2})

by discretizing regularly any smooth curve

c (t)

joining

N_{1} = c (0)

to

N_{2} = c (1)

(Figure 4):

ρ_{N} (N_{1}, N_{2}) \leq \frac{1}{T} \sum_{i = 1}^{T - 1} ρ_{N} (c (\frac{i}{T}), c (\frac{i + 1}{T})),

with equality holding iff

c (t) = γ_{N} (N_{1}, N_{2}; t)

is the Riemannian geodesic defined by the Levi–Civita metric connection induced by the Fisher information metric.

When the number of discretization steps T is sufficiently large, the normal distributions

c (\frac{i}{T})

and

c (\frac{i + 1}{T})

are close to each other, and we can approximate

ρ_{N} (c (\frac{i}{T}), c (\frac{i + 1}{T}))

by

\sqrt{D_{J} [c (\frac{i}{T}), c (\frac{i + 1}{T})]}

, where

D_{J} [N_{1}, N_{2}] = D_{KL} [N_{1}, N_{2}] + D_{KL} [N_{2}, N_{1}]

is Jeffreys divergence, and

D_{KL}

is the Kullback–Leibler divergence:

D_{KL} [p_{(μ_{1}, Σ_{1})} : p_{(μ_{2}, Σ_{2})}] = \frac{1}{2} (tr (Σ_{2}^{- 1} Σ_{1}) + Δ μ^{⊤} Σ_{2}^{- 1} Δ μ - d + log \frac{| Σ_{2} |}{| Σ_{1} |}) .

Thus, the costly determinant computations cancel each other in Jeffreys divergence (i.e.,

log \frac{| Σ_{2} |}{| Σ_{1} |} + log \frac{| Σ_{1} |}{| Σ_{2} |} = 0

) and we have:

D_{J} [p_{(μ_{1}, Σ_{1})} : p_{(μ_{2}, Σ_{2})}] = tr (\frac{Σ_{2}^{- 1} Σ_{1} + Σ_{1}^{- 1} Σ_{2}}{2} - I) + Δ μ^{⊤} \frac{Σ_{1}^{- 1} + Σ_{2}^{- 1}}{2} Δ μ .

Figure 4 summarizes our method to approximate the Fisher–Rao geodesic distance.

In general, it holds that

I_{f} [p : q] \approx \frac{f^{″} (1)}{2} d s_{Fisher}^{2},

between infinitesimally close distributions p and q (

d s \approx \sqrt{\frac{2 I_{f} [p : q]}{f^{″} (1)}}

), where

I_{f} [\cdot : \cdot]

denotes a f-divergence [1]. The Jeffreys divergence is a f-divergence obtained for

f_{J} (u) = - log u + u log u

with

f_{J}^{″} (1) = 2

. It is thus interesting to find low computational cost f-divergences between multivariate normal distributions to approximate the infinitesimal length element

d s

. Please note that f-divergences between MVNs are also invariant under the action of the affine group [62]. Thus, for infinitesimally close distributions p and q, this informally explains that

d s_{Fisher}

is invariant under the action of the affine group (see Proposition 1).

Although the definite integral of the length element along the Fisher–Rao geodesic

γ_{N}^{FR}

is not known in closed form (i.e., Fisher–Rao distance), the integral of the squared length element along the mixture geodesic

γ_{N}^{m} (N_{1}, N_{2})

and exponential geodesic

γ_{N}^{e} (N_{1}, N_{2})

coincide with Jeffreys divergence

D_{J} [N_{1}, N_{2}]

between

N_{1}

and

N_{2}

[1]:

Property 3

([1]). We have

D_{J} [p_{λ_{1}}, p_{λ_{2}}] = \int_{0}^{1} d s_{N}^{2} (γ_{N}^{m} (p_{λ_{1}}, p_{λ_{2}}; t)) d t = \int_{0}^{1} d s_{N}^{2} (γ_{N}^{e} (p_{λ_{1}}, p_{λ_{2}}; t)) d t .

Proof.

Let us report a proof of this remarkable fact in the general setting of Bregman manifolds. Indeed, since

D_{J} [p_{λ_{1}}, p_{λ_{2}}] = D_{KL} [p_{λ_{1}} : p_{λ_{2}}] + D_{KL} [p_{λ_{2}} : p_{λ_{1}}],

and

D_{KL} [p_{λ_{1}} : p_{λ_{2}}] = B_{F} (θ (λ_{2}) : θ (λ_{1}))

, where

B_{F}

denotes the Bregman divergence induced by the cumulant function of the multivariate normals and

θ (λ)

is the natural parameter corresponding to

λ

, we have

\begin{matrix} D_{J} [p_{λ_{1}}, p_{λ_{2}}] & = & B_{F} (θ_{1} : θ_{2}) + B_{F} (θ_{2} : θ_{1}), \\ = & S_{F} (θ_{1}; θ_{2}) = {(θ_{2} - θ_{1})}^{⊤} (η_{2} - η_{1}) = S_{F^{*}} (η_{1}; η_{2}), \end{matrix}

where

η = \nabla F (θ)

and

θ = \nabla F^{*} (η)

denote the dual parameterizations obtained by the Legendre–Fenchel convex conjugate

F^{*} (η)

of

F (θ)

. Moreover, we have

F^{*} (η) = - h (p_{μ, Σ})

[1], i.e., the convex conjugate function is Shannon negentropy.

Then we conclude using the fact that

S_{F} (θ_{1}; θ_{2}) = \int_{0}^{1} d s^{2} (γ (t)) d t = \int_{0}^{1} d s^{2} (γ^{*} (t)) d t

, i.e., the symmetrized Bregman divergence amounts to integral energies on dual geodesics on a Bregman manifold. The proof of this general property is reported in Appendix E. □

It follows the following upper bound on the Fisher–Rao distance:

Property 4

(Fisher–Rao upper bound). The Fisher–Rao distance between normal distributions is upper bounded by the square root of the Jeffreys divergence:

ρ_{N} (N_{1}, N_{2}) \leq \sqrt{D_{J} (N_{1}, N_{2})}

.

Proof.

Consider the Cauchy–Schwarz inequality for positive functions

f (t)

and

g (t)

:

\int_{0}^{1} f (t) g (t) d t \leq \sqrt{(\int_{0}^{1} f {(t)}^{2} d t) (\int_{0}^{1} g {(t)}^{2} d t)}

), and let

f (t) = d s_{N} (γ_{N}^{c} (p_{λ_{1}}, p_{λ_{2}}; t)

and

g (t) = 1

. Then we obtain:

{(\int_{0}^{1} d s_{N} (γ_{N}^{c} (p_{λ_{1}}, p_{λ_{2}}; t) d t)}^{2} \leq (\int_{0}^{1} d s_{N}^{2} (γ_{N}^{c} (p_{λ_{1}}, p_{λ_{2}}; t) d t) (\underset{= 1}{\underset{⏟}{\int_{0}^{1} 1^{2} d t}}) .

Furthermore, since by definition of

γ_{N}^{FR}

, we have

\int_{0}^{1} d s_{N} (γ_{N}^{c} (p_{λ_{1}}, p_{λ_{2}}; t) d t \geq \int_{0}^{1} d s_{N} (γ_{N}^{FR} (p_{λ_{1}}, p_{λ_{2}}; t) d t = : ρ_{N} (N_{1}, N_{2}) .

It follows for

c = γ_{N}^{e}

(i.e., e-geodesic) using Property 3 that we have:

ρ_{N} {(N_{1}, N_{2})}^{2} \leq \int_{0}^{1} d s_{N}^{2} (γ_{N}^{e} (p_{λ_{1}}, p_{λ_{2}}; t) d t = D_{J} (N_{1}, N_{2}) .

Thus, we conclude that

ρ_{N} (N_{1}, N_{2}) \leq \sqrt{D_{J} (N_{1}, N_{2})}

.

Please note that in Riemannian geometry, a curve

γ

minimizes the energy

E (γ) = \int_{0}^{1} {∥ \dot{γ} (t) ∥}^{2} d t

if it minimizes the length

L (γ) = \int_{0}^{1} ∥ \dot{γ} (t) ∥ d t

and

∥ \dot{γ} (t) ∥

is constant. Using Cauchy-Schwartz inequality, we can show that

L (γ) \leq E (γ)

. □

This upper bound is tight at infinitesimal scale (i.e., when

N_{2} = N_{1} + d N

) since

ρ_{N} (N_{1}, N_{2}) \approx d s_{N} (N_{1}) \approx \sqrt{\frac{2 I_{f} [N_{1} : N_{2}]}{f^{″} (1)}}

and the f-divergence in right-hand side of the identity can be chosen as Jeffreys divergence. To appreciate the quality of the square root of Jeffreys divergence upper bound of Property 4, consider the case where

N_{1}, N_{2} \in M_{Σ}

. In that case, we have

ρ_{N} (N (μ_{1}, Σ), N (μ_{2}, Σ)) = \sqrt{2} arccosh (1 + \frac{1}{4} Δ_{Σ}^{2} (μ_{1}, μ_{2}))

and

\sqrt{D_{J} [N (μ_{1}, Σ), N (μ_{2}, Σ)]} = Δ_{Σ} (μ_{1}, μ_{2})

(since

D_{KL} [N (μ_{1}, Σ), N (μ_{2}, Σ)] = \frac{1}{2} Δ_{Σ}^{2} (μ_{1}, μ_{2})

). The upper bound can thus be checked since we have

\sqrt{2} arccosh (1 + \frac{1}{4} x^{2}) \leq x

for

x \geq 0

. The plots of Figure 5 shows visually the quality of the

\sqrt{D_{J}}

upper bound.

For any smooth curve

c (t)

, we can thus approximate

ρ_{N}

for large T by

{\tilde{ρ}}_{N}^{c} (N_{1}, N_{2}) : = \frac{1}{T} \sum_{i = 1}^{T - 1} \sqrt{D_{J} [c (\frac{i}{T}), c (\frac{i + 1}{T})] .}

(27)

For example, we may consider the following curves on

M_{N}

which admit closed-form parameterizations in

t \in [0, 1]

:

linear interpolation (LERP, Linear intERPolation) $c_{λ} (t) = t (μ_{1}, Σ_{1}) + (1 - t) (μ_{2}, Σ_{2})$ between $(μ_{1}, Σ_{1})$ and $(μ_{2}, Σ_{2})$ ,
the mixture geodesic [80] $c_{m} (t) = γ_{N}^{m} (N_{1}, N_{2}; t) = (μ_{t}^{m}, Σ_{t}^{m})$ with $μ_{t}^{m} = {\bar{μ}}_{t}$ and $Σ_{t}^{m} = {\bar{Σ}}_{t} + t μ_{1} μ_{1}^{⊤} + (1 - t) μ_{2} μ_{2}^{⊤} - {\bar{μ}}_{t} {\bar{μ}}_{t}^{⊤}$ where ${\bar{μ}}_{t} = t μ_{1} + (1 - t) μ_{2}$ and ${\bar{Σ}}_{t} = t Σ_{1} + (1 - t) Σ_{2}$ ,
the exponential geodesic [80] $c_{e} (t) = γ_{N}^{e} (N_{1}, N_{2}; t) = (μ_{t}^{e}, Σ_{t}^{e})$ with $μ_{t}^{e} = {\bar{Σ}}_{t}^{H} (t Σ_{1}^{- 1} μ_{1} + (1 - t) Σ_{2}^{- 1} μ_{2})$ and $Σ_{t}^{e} = {\bar{Σ}}_{t}^{H}$ where ${\bar{Σ}}_{t}^{H} = {(t Σ_{1}^{- 1} + (1 - t) Σ_{2}^{- 1})}^{- 1}$ is the matrix harmonic mean,
the curve $c_{e m} (t) = \frac{1}{2} (γ_{N}^{e} (N_{1}, N_{2}; t) + γ_{N}^{m} (N_{1}, N_{2}; t))$ which is obtained by averaging the mixture geodesic with the exponential geodesic.

Figure 6 visualizes the exponential and mixture geodesics between two bivariate normal distributions.

Let us denote by

{\tilde{ρ}}_{N}^{λ} = {\tilde{ρ}}_{N}^{c_{λ}}

,

{\tilde{ρ}}_{N}^{m} = {\tilde{ρ}}_{N}^{c_{m}}

,

{\tilde{ρ}}_{N}^{e} = {\tilde{ρ}}_{N}^{c_{e}}

and

{\tilde{ρ}}_{N}^{e m} = {\tilde{ρ}}_{N}^{c_{e m}}

the approximations obtained by these curves following from Equation (27). When T is sufficiently large, the approximated distances

{\tilde{ρ}}^{x}

are close to the length of curve x, and we may thus consider a set of several curves

{c_{i}}_{i \in I}

and report the smallest Fisher–Rao distance approximations obtained among these curves:

ρ_{N} (N_{1}, N_{2}) \approx {min}_{i \in I} {\tilde{ρ}}_{N}^{c_{i}} (N_{1}, N_{2})

.

Please note that we consider the regular spacing for approximating a curve length and do not optimize the position of the sample points on the curve. Indeed, as

T \to \infty

, the curve length approximation tends to the Riemannian curve length. In other words, we can measure approximately finely the length of any curve available with closed-form reparameterization by increasing T. Thus, the key question of our method is how to best approximate the Fisher–Rao geodesic by a curve that can be parametrized by a closed-form formula and is close enough to the Fisher–Rao geodesic.

Next, we introduce our approximation curve

c_{CO} (t)

derived from Calvo and Oller isometric mapping f which experimentally behaves better when normals are not too far from each other.

3.2. A Curve Derived from Calvo and Oller’s Embedding

This approximation consists of leveraging the closed-form expression of the SPD geodesics [63,66]:

γ_{P} (P, Q; t) = P^{\frac{1}{2}} {(P^{- \frac{1}{2}} Q^{\frac{1}{2}} P^{- \frac{1}{2}})}^{t} P^{\frac{1}{2}}, t \in [0, 1]

to approximate the Fisher–Rao normal geodesic

γ_{N}^{Fisher} (N_{1}, N_{2}; t)

as follows: Let

{\bar{P}}_{1} = f (N_{1}), {\bar{P}}_{2} = f (N_{2}) \in \bar{N}

, and consider the smooth curve

{\bar{c}}_{CO} ({\bar{P}}_{1}, {\bar{P}}_{2}; t) = {proj}_{\bar{N}} (γ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2}; t)),

(28)

where

{proj}_{\bar{N}} (P)

denotes the orthogonal projection of

P \in P (d + 1)

onto

\bar{N}

(Figure 7). Thus, curve

c_{CO} (t)

(

t \in [0, 1]

) is then defined by taking the inverse mapping

f^{- 1} ({\bar{c}}_{CO})

(Figure 8):

c_{CO} (t) = f^{- 1} ({proj}_{\bar{N}} (γ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2}; t))) .

(29)

Please note that the matrix power

P^{t}

can be computed as

P^{t} = U diag (λ_{1}^{t}, \dots, λ_{d}^{t}) V^{⊤}

where

P = U diag (λ_{1}^{t}, \dots, λ_{d}^{t}) V^{⊤}

is the eigenvalue decomposition of P.

Let us now explain how to project

P = [P_{i, j}] \in P (d + 1)

onto

\bar{N}

based on the analysis of the Appendix of [19] (p. 239):

Proposition 2

(Projection of an SPD matrix onto the embedded normal submanifold

\bar{N}

). Let

β = P_{d + 1, d + 1}

and write

P = [\begin{matrix} Σ + β μ μ^{⊤} & β μ \\ β μ^{⊤} & β \end{matrix}]

. Then the orthogonal projection at

P \in P

onto

\bar{N}

is:

{\bar{P}}_{⊥} : = {proj}_{\bar{N}} (P) = [\begin{matrix} Σ + μ μ^{⊤} & μ^{⊤} \\ μ & 1 \end{matrix}],

(30)

and the SPD distance between P and

{\bar{P}}_{⊥}

is

ρ_{P} (P, {\bar{P}}_{⊥}) = \frac{1}{\sqrt{2}} | log β | .

(31)

Notice that the projection of P is easily computed since

β = P_{d + 1, d + 1}

.

{proj}_{\bar{N}} ([\begin{matrix} Σ + β μ μ^{⊤} & β μ \\ β μ^{⊤} & β \end{matrix}]) = [\begin{matrix} Σ + μ μ^{⊤} & μ^{⊤} \\ μ & 1 \end{matrix}]

Remark 4.

In Diffusion Tensor Imaging [39] (DTI), the Fisher–Rao distance can be used to evaluate the distance between three-dimensional normal distributions with means located at a 3D grid position. We may consider

3 \times 3 \times 3 - 1 = 26

neighbor graphs induced by the grid, and for each normal N of the grid, calculate the approximations of the Fisher–Rao distance of N with its neighbors

N^{'}

as depicted in Figure 9. Then the distance between two tensors

N_{1}

and

N_{2}

of the 3D grid is calculated as the shortest path on the weighted graph using Dijkstra’s algorithm [39].

Please note that the Fisher–Rao projection of

N_{1} = (μ_{1}, Σ_{1})

onto a submanifold

M_{μ_{2}}

with fixed mean

μ_{2}

was recently reported in closed form in [72] (Equation (21)):

N^{*} = N (μ_{2}, Σ_{1} + \frac{1}{2} (μ_{2} - μ_{1}) {(μ_{2} - μ_{1})}^{⊤}),

with

ρ_{N} (N_{1}, N^{*}) = \frac{1}{\sqrt{2}} arccosh (d + {(μ_{2} - μ_{1})}^{⊤} Σ_{1}^{- 1} (μ_{2} - μ_{1})),

and the Fisher–Rao projection of

N_{1} = (μ_{1}, Σ_{1})

onto submanifold

M_{Σ_{2}}

is the “vertical projection”

N^{*} = (μ_{1}, Σ_{2})

(Figure 10) with

ρ_{N} (N_{1}, N^{*}) = ρ_{N_{μ}} (Σ_{1}, Σ_{2}) .

We can upper bound the Fisher–Rao distance

ρ_{N} ((μ_{1}, Σ_{1}), (μ_{2}, Σ_{2}))

by projecting

Σ_{1}

onto

M_{μ_{2}}

and projecting

Σ_{2}

onto

M_{μ_{1}}

. Let

Σ_{12} \in M_{μ_{2}}

and

Σ_{21} \in M_{μ_{1}}

denote those Fisher–Rao orthogonal projections. Using the triangular inequality property of the Fisher–Rao distance, we obtain the following upper bounds:

\begin{matrix} ρ_{N} ((μ_{1}, Σ_{1}), (μ_{2}, Σ_{2})) & \leq & ρ_{N} ((μ_{1}, Σ_{1}), (μ_{2}, Σ_{12})) + ρ_{N} (μ_{2}, Σ_{12}, (μ_{2}, Σ_{2})), \end{matrix}

(32)

\begin{matrix} \leq & ρ_{N} ((μ_{2}, Σ_{2}), (μ_{1}, Σ_{21})) + ρ_{N} ((μ_{1}, Σ_{21}), (μ_{1}, Σ_{1})) . \end{matrix}

(33)

See Figure 11 for an illustration.

Let

{\bar{c}}_{CO} (t) = {\bar{S}}_{t}

and

c_{CO} (t) = f^{- 1} (c_{CO} (t)) = : G_{t}

. The following proposition shows that we have

D_{J} [{\bar{S}}_{t}, {\bar{S}}_{t + 1}] = D_{J} [G_{t}, G_{t + 1}]

.

Proposition 3.

The Kullback–Leibler divergence between

p_{μ_{1}, Σ_{1}}

and

p_{μ_{2}, Σ_{2}}

amounts to the KLD between

q_{{\bar{P}}_{1}} = p_{0, f (μ_{1}, Σ_{1})}

and

q_{{\bar{P}}_{2}} = p_{0, f (μ_{2}, Σ_{2})}

where

{\bar{P}}_{i} = f (μ_{i}, Σ_{i})

:

D_{KL} [p_{μ_{1}, Σ_{1}} : p_{μ_{2}, Σ_{2}}] = D_{KL} [q_{{\bar{P}}_{1}} : q_{{\bar{P}}_{2}}] .

The KLD between two centered

(d + 1)

-variate normals

q_{P_{1}} = p_{0, P_{1}}

and

q_{P_{2}} = p_{0, P_{2}}

is

D_{KL} [q_{P_{1}} : q_{P_{2}}] = \frac{1}{2} (tr (P_{2}^{- 1} P_{1}) - d - 1 + log \frac{| P_{2} |}{| P_{1} |}) .

This divergence can be interpreted as the matrix version of the Itakura–Saito divergence [81]. The SPD cone equipped with

\frac{1}{2}

of the trace metric can be interpreted as Fisher–Rao centered normal manifolds:

(N_{μ}, g_{N_{μ}}^{Fisher}) = (P, \frac{1}{2} g^{trace})

.

Since the determinant of a block matrix is

|[\begin{matrix} A & B \\ C & D \end{matrix}]| = |A - B D^{- 1} C|,

we obtain with

D = 1

:

| f (μ, Σ) | = | Σ + μ μ^{⊤} - μ μ^{⊤} | = | Σ |

.

Let

{\bar{P}}_{1} = f (μ_{1}, Σ_{1})

and

{\bar{P}}_{2} = f (μ_{2}, Σ_{2})

. Checking

D_{KL} [p_{μ_{1}, Σ_{1}} : p_{μ_{2}, Σ_{2}}] = D_{KL} [q_{{\bar{P}}_{1}} : q_{{\bar{P}}_{2}}]

where

q_{\bar{P}} = p_{0, \bar{P}}

amounts to verify that

tr ({\bar{P}}_{2}^{- 1} {\bar{P}}_{1}) = 1 + tr (Σ_{2}^{- 1} Σ_{1} + Δ_{μ}^{⊤} Σ_{2}^{- 1} Δ_{μ}) .

Indeed, using the inverse matrix

f {(μ, Σ)}^{- 1} = [\begin{matrix} Σ^{- 1} & - Σ^{- 1} μ \\ - μ^{⊤} Σ^{- 1} & 1 + μ^{⊤} Σ^{- 1} μ \end{matrix}],

we have

\begin{matrix} tr ({\bar{P}}_{2}^{- 1} {\bar{P}}_{1}) & = & tr ([\begin{matrix} Σ_{2}^{- 1} & - Σ_{2}^{- 1} μ_{2} \\ - μ_{2}^{⊤} Σ_{2}^{- 1} & 1 + μ_{2}^{⊤} Σ_{2}^{- 1} μ_{2} \end{matrix}] [\begin{matrix} Σ_{1} + μ_{1} μ_{1}^{⊤} & μ_{1} \\ μ_{1}^{⊤} & 1 \end{matrix}]), \\ = & 1 + tr (Σ_{2}^{- 1} Σ_{1} + Δ_{μ}^{⊤} Σ_{2}^{- 1} Δ_{μ}) . \end{matrix}

Thus, even if the dimension of the sample spaces of

p_{μ, Σ}

and

q_{\bar{P} = f (μ, Σ)}

differs by one, we obtain the same KLD by Calvo and Oller’s isometric mapping f.

This property holds for the KLD/Jeffreys divergence

D_{J}

but not for all f-divergences [1]

I_{f}

in general (e.g., it fails for the Hellinger divergence).

Figure 12 shows the various geodesics and curves used to approximate the Fisher–Rao distance with the Fisher metric shown using Tissot indicatrices.

Please note that the introduction of parameter

β

is related to the foliation of the SPD cone

P

by

{f_{β} (N) : β > 0}

:

P (d + 1) = R_{> 0} \times f_{β} (N)

. See Figure 7. Thus, we may define how good the projected C&O curve is to the Fisher–Rao geodesic by measuring the average distance between points on

γ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2}; t)

and their projections

{\bar{γ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2}; t)}}^{⊥}

onto

\bar{N}

:

δ^{CO} (N_{1}, N_{2}) = δ^{CO} ({\bar{P}}_{1}, {\bar{P}}_{2}) = \int_{0}^{1} ρ_{P} (γ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2}; t), {\bar{γ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2}; t)}}^{⊥}) d t .

In practice, we evaluate this integral at the sampling points

S_{t}

:

δ^{CO} (P_{1}, P_{2}) \approx δ_{T}^{CO} (P_{1}, P_{2}) : = \frac{1}{T} \sum_{i = 1}^{T} ρ_{P} (S_{t}, {\bar{S}}_{t}),

(34)

where

S_{t} = γ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2}; t)

and

{\bar{S}}_{t} = γ_{P} {({\bar{P}}_{1}, {\bar{P}}_{2}; t)}^{⊥}

. We checked experimentally (see Section 3.3) that for close by normals

N_{1}

and

N_{1}

, we have

δ^{CO} ({\bar{N}}_{1}, {\bar{N}}_{2})

small, and that when

N_{1}

becomes further separated from

N_{2}

, the average projection error

δ^{CO} ({\bar{N}}_{1}, {\bar{N}}_{2})

increases. Thus,

δ_{T}^{CO} (P_{1}, P_{2})

is a good measure of the precision of our Fisher–Rao distance approximation.

Lemma 1.

We have

ρ_{\bar{N}} ({\bar{S}}_{t}, {\bar{S}}_{t + 1}) \leq ρ_{P} ({\bar{S}}_{t}, S_{t}) + ρ_{P} (S_{t}, S_{t + 1}) + ρ_{P} (S_{t + 1}, {\bar{S}}_{t + 1})

.

Proof.

The proof consists of applying twice the triangle inequality of metric distance

ρ_{P}

:

\begin{matrix} ρ_{\bar{N}} ({\bar{S}}_{t}, {\bar{S}}_{t + 1}) & \leq & ρ_{P} ({\bar{S}}_{t}, S_{t + 1}) + ρ_{P} (S_{t + 1}, {\bar{S}}_{t + 1}), \\ \leq & ρ_{P} ({\bar{S}}_{t}, S_{t}) + ρ_{P} (S_{t}, S_{t + 1}) + ρ_{P} (S_{t + 1}, {\bar{S}}_{t + 1}) . \end{matrix}

See Figure 13 where the left-hand-side geodesic length is shown in blue and the right-hand-side upper bound is visualized in red. □

Property 5.

We have

ρ_{N} (N_{1}, N_{2}) \leq ρ_{N}^{CO} (N_{1}, N_{2}) \leq ρ_{N} (N_{1}, N_{2}) + 2 δ_{T}^{CO} ({\bar{P}}_{1}, {\bar{P}}_{2})

.

Proof.

At infinitesimal scale when

S_{t + 1} \approx S_{t}

, using Lemma 1 and

ρ_{P} (S_{t + 1}, {\bar{S}}_{t + 1}) \approx ρ_{P} ({\bar{S}}_{t}, S_{t})

we have

d s_{N} ({\bar{S}}_{t}) \leq d s_{P} (S_{t}) + 2 ρ_{P} (S_{t}, {\bar{S}}_{t}) .

Taking the integral along the curve

c^{CO} (t) = \bar{γ^{CO} ({\bar{P}}_{1}, {\bar{P}}_{2}; t)}

, we obtain

ρ_{N}^{CO} (N_{1}, N_{2}) \leq ρ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2}) + 2 δ_{T}^{CO} ({\bar{P}}_{1}, {\bar{P}}_{2})

Since

ρ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2}) \leq ρ_{N} (N_{1}, N_{2})

, we have

ρ_{N} (N_{1}, N_{2}) \leq ρ_{N}^{CO} (N_{1}, N_{2}) \leq ρ_{N} (N_{1}, N_{2}) + 2 δ_{T}^{CO} ({\bar{P}}_{1}, {\bar{P}}_{2}) .

□

Notice that

\sum_{i = 0}^{T - 1} ρ_{P} (S_{t}, S_{t + 1}) = ρ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2})

.

Example 1.

Let us consider Example 1 of [42] (p. 11):

N_{1} = ([\begin{matrix} - 1 \\ 0 \end{matrix}], Σ), N_{2} = ([\begin{matrix} 6 \\ 3 \end{matrix}], Σ), Σ = [\begin{matrix} 1.1 & 0.9 \\ 0.9 & 1.1 \end{matrix}] .

The Fisher–Rao distance is evaluated numerically in [42] as

5.00648

. We have the lower bound

ρ_{N}^{CO} (N_{1}, N_{2}) = 4.20447

, and the Mahalanobis distance

8.06226

upper bounds the Fisher–Rao distance (not totally geodesic submanifold

N_{Σ}

). Our projected C&O curve discretized with

T = 1000

yields an approximation

{\tilde{ρ}}_{N}^{CO} (N_{1}, N_{2}) = 5.31667

. The average projection distance

ρ_{P} (S_{t}, {\bar{S}}_{t})

is

δ_{T}^{CO} (N_{1}, N_{2}) = 0.61791

, and the maximum projected distance is

1.00685

. We check that

5.00648 \approx ρ_{N} (N_{1}, N_{2}) \leq {\tilde{ρ}}_{N}^{CO} (N_{1}, N_{2}) \approx 5.31667 \leq ρ_{N} (N_{1}, N_{2}) + 2 δ_{T}^{CO} ({\bar{P}}_{1}, {\bar{P}}_{2}) \approx 5.44028 .

The Killing distance [82] obtained for

κ_{Killing} = 2

is

ρ_{Killing} (N_{1}, N_{2}) \approx 6.82028

(see Appendix C). Notice that geodesic shooting is time-consuming compared to our approximation technique.

3.3. Some Experiments

The KLD

D_{KL}

and Jeffreys divergence

D_{J}

, the Fisher–Rao distance

ρ_{N}

and the Calvo and Oller distance

ρ_{CO}

are all invariant under the congruence action of the affine group

Aff (d) = R^{d} ⋊ GL (d)

with the group operation

(a_{1}, A_{1}) (a_{2}, A_{2}) = (a_{1} + A_{1} a_{2}, A_{1} A_{2}) .

Let

(A, a) \in Aff (d)

, and define the action on the normal space

N

as follows:

(A, a) . N (μ, Σ) = N (A^{⊤} μ + a, A Σ A^{⊤}) .

Then we have:

\begin{matrix} ρ_{N} ((A, a) . N_{1}, (A, a) . N_{2}) & = & ρ_{N} (N_{1}, N_{2}), \\ ρ_{CO} ((A, a) . N_{1}, (A, a) . N_{2}) & = & ρ_{CO} (N_{1}, N_{2}), \\ D_{KL} [(A, a) . N_{1} : (A, a) . N_{2}] & = & D_{KL} [N_{1} : N_{2}] . \end{matrix}

This invariance extends to our approximations

{\tilde{ρ}}_{N}^{c}

(see Equation (27)).

Since we have

{\tilde{ρ}}_{N}^{c} (N_{1}, N_{2}) \approx ρ_{N} (N_{1}, N_{2}) \geq ρ_{CO} (N_{1}, N_{2}),

the ratio

κ_{c} = \frac{{\tilde{ρ}}_{N}^{c}}{ρ_{CO}} \geq κ = \frac{{\tilde{ρ}}_{N}^{c}}{ρ_{N}}

gives an upper bound on the approximation factor of

{\tilde{ρ}}_{N}^{c}

compared to the true Fisher–Rao distance

ρ_{N}

:

κ_{c} ρ_{N} (N_{1}, N_{2}) \geq κ ρ_{N} (N_{1}, N_{2}) \geq {\tilde{ρ}}_{N}^{c} (N_{1}, N_{2}) \approx ρ_{N} (N_{1}, N_{2}) \geq ρ_{CO} (N_{1}, N_{2}) .

Let us now report some numerical experiments of our approximated Fisher–Rao distances

{\tilde{ρ}}_{N}^{x}

with

x \in {l, m, e, em, CO}

. Although that dissimilarity

{\tilde{ρ}}_{N}

is positive–definite, it does not satisfy the triangular inequality of metric distances (e.g., Riemannian distances

ρ_{N}

and

ρ_{CO}

).

First, we draw multivariate normals by sampling means

μ \sim Unif (0, 1)

and sample covariance matrices

Σ

as follows: We draw a lower triangular matrix L with entries

L_{i j}

iid sampled from

Unif (0, 1)

, and take

Σ = L L^{⊤}

. We use

T = 1000

samples on curves and repeat the experiment 1000 times to gather average statistics on

κ_{c}

’s of curves. Results are summarized in Table 1.

For that scenario that the C&O curve (either

{\bar{c}}_{CO} \in \bar{N}

or

c_{CO} \in N

) performs best compared to the linear interpolation curves with respect to source parameter (l), mixture geodesic (m), exponential geodesic (e), or exponential-mixture mid-curve (

em

). Let us point out that we sample

γ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2}; \frac{i}{T})

for

i \in {0, \dots, T}

.

Strapasson, Porto, and Costa [38] (SPC)reported the following upper bound on the Fisher–Rao distance between multivariate normals

ρ_{CO} (N_{1}, N_{2}) \leq ρ_{N} (N_{1}, N_{2}) \leq U_{SPC} (N_{1}, N_{2}),

with:

U_{SPC} (N_{1}, N_{2}) = \sqrt{2 \sum_{i = 1}^{d} {log}^{2} (\frac{\sqrt{{(1 + D_{i i})}^{2} + μ_{i}^{2}} + \sqrt{{(1 - D_{i i})}^{2} + μ_{i}^{2}}}{\sqrt{{(1 + D_{i i})}^{2} + μ_{i}^{2}} - \sqrt{{(1 - D_{i i})}^{2} + μ_{i}^{2}}})},

(35)

where

Σ = Σ_{1}^{- \frac{1}{2}} Σ_{2} Σ_{1}^{- \frac{1}{2}}

,

Σ = Ω D Ω^{⊤}

is the eigen decomposition, and

μ = Ω^{⊤} Σ_{1}^{- \frac{1}{2}} (μ_{2} - μ_{1})

. This upper bound performs better when the normals are well-separated and worse than the

\sqrt{D_{J}}

-upper bound when the normals are close to each other.

Let us compare

ρ_{CO} (N_{1}, N_{2})

with

ρ_{N} (N_{1}, N_{2}) \approx {\tilde{ρ}}^{c_{CO}} (N_{1}, N_{2})

and the upper bound

U (N_{1}, N_{2})

by averaging over 1000 trials with

N_{1}

and

N_{2}

chosen randomly as before and

T = 1000

. We have

ρ_{CO} (N_{1}, N_{2}) \leq ρ_{N} (N_{1}, N_{2}) \approx {\tilde{ρ}}^{c_{CO}} (N_{1}, N_{2}) \leq U (N_{1}, N_{2})

. Table 2 shows that our Fisher–Rao approximation is close to the lower bound (and hence to the underlying true Fisher–Rao distance) and that the upper bound is about twice the lower bound for that particular scenario.

Second, since the distances are invariant under the action of the affine group, we can set wlog.

N_{1} = (0, I)

(standard normal distribution) and let

N_{2} = diag (u_{1}, \dots, u_{d})

where

u_{i} \sim Unif (0, a)

. As normals

N_{1}

and

N_{2}

separate from each other, we notice experimentally that the performance of the

c_{CO}

curve degrades in the second experiment with

a = 5

(see Table 3): Indeed, the mixture geodesic works experimentally better than the C&O curve when

d \geq 11

.

Figure 14 display the various curves considered for approximating the Fisher–Rao distance between bivariate normal distributions: For a curve

c (t)

, we visualize its corresponding bivariate normal distributions

(μ_{c (t)}, Σ_{c (t)})

at several increment steps

t \in [0, 1]

by plotting the ellipsoid

E_{c (t)} = μ_{c (t)} + \{L^{⊤} x, x = (cos θ, sin θ), θ \in [0, 2 π)\},

where

Σ_{c (t)} = L_{c (t)} L_{c (t)}^{⊤}

.

Example 2.

Let us report some numerical results for bivariate normals with

T = 1000

:

We use the following example of Han and Park [39] (Equation (26)):

$N_{1} = ([\begin{matrix} 0 \\ 0 \end{matrix}], [\begin{matrix} 1 & 0 \\ 0 & 0.1 \end{matrix}]), N_{2} = ([\begin{matrix} 1 \\ 1 \end{matrix}], [\begin{matrix} 0.1 & 0 \\ 0 & 1 \end{matrix}]) .$

Their geodesic shooting algorithm [39] evaluates the Fisher–Rao distance to $ρ_{N} (N_{1}, N_{2}) \approx 3.1329$ (precision $10^{- 5}$ ).
We obtain:
–
Calvo and Oller lower bound: $ρ_{CO} (N_{1}, N_{2}) \approx 3.0470$ ,
–
Upper bound using Equation (15): $7.92179$ ,
–
SPC upper bound (Equation (35)): $U_{SPC} (N_{1}, N_{2}) \approx 5.4302$ ,
–
$\sqrt{D_{J}}$ upper bound: $U_{\sqrt{J}} (N_{1}, N_{2}) \approx 4.3704$ ,
–
${\tilde{ρ}}_{N}^{λ} (N_{1}, N_{2}) \approx 3.4496$ ,
–
${\tilde{ρ}}_{N}^{m} (N_{1}, N_{2}) \approx 3.5775$ ,
–
${\tilde{ρ}}_{N}^{e} (N_{1}, N_{2}) \approx 3.7314$ ,
–
${\tilde{ρ}}_{N}^{em} (N_{1}, N_{2}) \approx 3.1672$ ,
–
${\tilde{ρ}}_{N}^{CO} (N_{1}, N_{2}) \approx 3.1391$ .
In that setting, the $\sqrt{D_{J}}$ upper bound is better than the upper bound of Equation (35), and the projected Calvo and Oller geodesic yields the best approximation of the Fisher–Rao distance (Figure 15) with an absolute error of $0.0062$ (about $0.2 %$ relative error). When $T = 10$ , we have ${\tilde{ρ}}_{N}^{CO} (N_{1}, N_{2}) \approx 3.1530$ , when $T = 100$ , we obtain ${\tilde{ρ}}_{N}^{CO} (N_{1}, N_{2}) \approx 3.1136$ , and when $T = 500$ we obtain ${\tilde{ρ}}_{N}^{CO} (N_{1}, N_{2}) \approx 3.1362$ (which is better than the approximation obtained for $T = 1000$ ). Figure 16 shows the fluctuations of the approximation of the Fisher–Rao distance by the projected C&O curve when T ranges from 3 to 100.
Bivariate normal $N_{1} = (0, I)$ and bivariate normal $N_{2} = (μ_{2}, Σ_{2})$ with $μ_{2} = {[1 0]}^{⊤}$ and $Σ_{2} = [\begin{matrix} 1 & - 1 \\ - 1 & 2 \end{matrix}]$ . We obtain
–
Calvo and Oller lower bound: $1.4498$
–
Upper bound of Equation (35): $2.6072$
–
$\sqrt{D_{J}}$ upper bound: $1.5811$
–
${\tilde{ρ}}^{λ}$ : $1.5068$
–
${\tilde{ρ}}^{m}$ : $1.5320$
–
${\tilde{ρ}}^{e}$ : $1.5456$
–
${\tilde{ρ}}^{em}$ : $1.4681$
–
${\tilde{ρ}}^{co}$ : $1.4673$
Bivariate normal $N_{1} = (0, I)$ and bivariate normal $N_{2} = (μ_{2}, Σ_{2})$ with $μ_{2} = {[5 0]}^{⊤}$ and $Σ_{2} = [\begin{matrix} 1 & - 1 \\ - 1 & 2 \end{matrix}]$ . We get:
–
Calvo and Oller lower bound: $3.6852$
–
Upper bound of Equation (35): $6.0392$
–
$\sqrt{D_{J}}$ upper bound: $6.2048$
–
${\tilde{ρ}}^{λ}$ : $5.7319$
–
${\tilde{ρ}}^{m}$ : $4.4039$
–
${\tilde{ρ}}^{e}$ : $5.9205$
–
${\tilde{ρ}}^{em}$ : $4.2901$
–
${\tilde{ρ}}^{co}$ : $4.3786$

See Supplementary Materials for further experiments.

4. Approximating the Smallest Enclosing Fisher–Rao Ball of MVNs

We may use these closed-form distance

ρ_{CO} (N, N^{'})

between N and

N^{'}

to compute an approximation (of the center) of the smallest enclosing Fisher–Rao ball

B^{*} = ball (C^{*}, r^{*})

of a set

G = {N_{1} = (μ_{1}, Σ_{1}), \dots, N_{n} = (μ_{n}, Σ_{n})}

of nd-variate normal distributions:

C^{*} = arg min_{C \in N} max_{i \in {1, \dots, n}} ρ_{N} (C, N_{i})

where

ball (C, r) = {N \in N : ρ_{N} (C, N) \leq r}

.

The method proceeds as follows:

First, we convert MVN set $G$ into the equivalent set of $(d + 1)$ -dimensional SPD matrices $\bar{G} = {{\bar{P}}_{i} = f (N_{i})}$ using the C&O embedding. We relax the problem of approximating the circumcenter $C^{*}$ of the smallest enclosing Fisher–Rao ball by

$P^{*} = arg min_{P \in P (d + 1)} max_{i \in {1, \dots, n}} ρ_{CO} (P, {\bar{P}}_{i}) .$
Second, we approximate the center of the smallest enclosing Riemannian ball of $\bar{G}$ using the iterative smallest enclosing Riemannian ball algorithm in [66] with say $T = 1000$ iterations. Let $\tilde{P} \in P (d + 1)$ denote this approximation center: $P_{T} = {RieSEB}_{SPD} (\bar{G}, T)$ .
Finally, we project back $P_{T}$ onto $\bar{N}$ : ${\bar{P}}_{T} = {proj}_{\bar{N}} (P_{T})$ . We return ${\bar{P}}_{T}$ as the approximation of $C^{*}$ .

Algorithm [66]

{RieSEB}_{SPD} ({P_{1}, \dots, P_{n}}, T)

is described for a set of SPD matrices

{P_{1}, \dots, P_{n}}

as follows:

Let $C_{1} \leftarrow P_{1}$
For $t = 1$ to T
–
Compute the index of the SPD matrix which is farthest from the current circumcenter $C_{t}$ :

$f_{t} = arg max_{i \in {1, \dots, n}} ρ_{SPD} (C_{t}, P_{i})$

–
Update the circumcenter by walking along the geodesic linking $C_{t}$ to $P_{f_{t}}$ :

$C_{t + 1} = γ_{SPD} (C_{t}, P_{f_{t}}; \frac{1}{t + 1}) = C_{t}^{\frac{1}{2}} {(C_{t}^{- \frac{1}{2}} P_{f_{t}} C_{t}^{- \frac{1}{2}})}^{\frac{1}{t + 1}} C_{t}^{\frac{1}{2}}$
Return $C_{T}$

The convergence of the algorithm

{RieSEB}_{SPD}

follows from the fact that the SPD trace manifold is a Hadamard manifold (with negative sectional curvatures). See [66] for proof of convergence.

The SPD distance

ρ_{P} (C_{T}, {\bar{C}}_{T})

indicates the quality of the approximation. Figure 17 shows the result of implementing this heuristic.

Let us notice that when all MVNs share the same covariance matrix

Σ

, we have from Equation (18) or Equation (23) that

ρ_{N} (μ_{1}, Σ), N (μ_{2}, Σ)

and

ρ_{CO} (N (μ_{1}, Σ), N (μ_{2}, Σ))

are strictly increasing function of their Mahalanobis distance. Using the Cholesky decomposition

Σ^{- 1} = L L^{⊤}

, we deduce that the smallest Fisher–Rao enclosing ball coincides with the smallest Calvo and Oller enclosing ball, and the circumcenter of that ball can be found as an ordinary Euclidean circumcenter [83] (Figure 17b). Please note that in 1D, we can find the exact smallest enclosing Fisher–Rao ball as an equivalent smallest enclosing ball in hyperbolic geometry.

Furthermore, we may extend the computation of the approximated circumcenter to k-center clustering [84] of n multivariate normal distributions. Since the circumcenter of the clusters is approximated and not exact, we extend straightforwardly the variational approach of k-means described in [85] to k-center clustering. An application of k-center clustering of MVNs is to simplify a Gaussian mixture model [42] (GMM).

Similarly, we can consider other Riemannian distances with closed-form formulas between MVNs such as the Killing distance in the symmetric space [82] (see Appendix C) or the Siegel-based distance proposed in Appendix D.

5. Some Information–Geometric Properties of the C&O Embedding

In information geometry [1], the manifold

N

admits a dual structure denoted by the quadruple

(N, g_{N}^{Fisher}, \nabla_{N}^{e}, \nabla_{N}^{m}),

when equipped with the exponential connection

\nabla_{N}^{e}

and the mixture connection

\nabla_{N}^{m}

. The connections

\nabla_{N}^{e}

and

\nabla_{N}^{m}

are said to be dual since

\frac{\nabla_{N}^{e} + \nabla_{N}^{m}}{2} = {\bar{\nabla}}_{N}

, the Levi–Civita connection induced by

g_{N}^{Fisher}

. Furthermore, by viewing

N

as an exponential family

{p_{θ}}

with natural parameter

θ = (θ_{v}, θ_{M})

(using the sufficient statistics [80]

(x, - x x^{⊤})

), and taking the convex log-normalizer function

F_{N} (θ)

of the normals, we can build a dually flat space [1] where the canonical divergence amounts to a Bregman divergence which coincides with the reverse Kullback–Leibler divergence [30,86] (KLD). The Legendre duality

F^{*} (η) = 〈 \nabla F (θ), η 〉 - F (\nabla F (θ))

(with

〈 (v_{1}, M_{1}), (v_{2}, M_{2}) 〉 = tr (v_{1} v_{2}^{⊤} + M_{1} M_{2}^{⊤}) = v_{1} \cdot v_{2} + tr (M_{1} M_{2}^{⊤})

) yields:

θ = (θ_{v}, θ_{M}) = (Σ^{- 1} μ, \frac{1}{2} Σ^{- 1})

,

F_{N} (θ) = \frac{1}{2} (d log π - log | θ_{M} | + \frac{1}{2} θ_{v}^{⊤} θ_{M}^{- 1} θ_{v}),

η = (η_{v}, η_{M}) = \nabla F_{N} (θ) = (\frac{1}{2} θ_{M}^{- 1} θ_{v}, θ_{M}^{- 1})

,

F_{N}^{*} (η) = - \frac{1}{2} (log (1 + η_{v}^{⊤} η_{M}^{- 1} η_{v}) + log | - η_{M} | + d (log 2 π e)),

and we have

B_{F_{N}} (θ_{1}, θ_{2}) = D_{KL}^{*} (p_{λ_{1}} : p_{λ_{2}}) = D_{KL} (p_{λ_{2}} : p_{λ_{1}}) = B_{F_{N}^{*}} (η_{2} : η_{1}),

where

D_{KL}^{*} [p : q] = D_{KL} [q : p]

is the reverse KLD.

In a dually flat space, we can express the canonical divergence as a Fenchel–Young divergence using the mixed coordinate systems

B_{F_{N}} (θ_{1} : θ_{2}) = Y_{F_{N}} (θ_{1} : η_{2})

where

η_{i} = \nabla F_{N} (θ_{i})

and

Y_{F_{N}} (θ_{1} : η_{2}) : = F_{N} (θ_{1}) + F_{N}^{*} (η_{2}) - 〈 θ_{1}, η_{2} 〉 .

The moment

η

-parameterization of a normal is

(η = μ, H = - Σ - μ μ^{⊤})

with its reciprocal function

(λ = η, Λ = - H - η η^{⊤})

.

Let

F_{P} (P) = F_{N} (0, P)

,

\bar{θ} = \frac{1}{2} {\bar{P}}^{- 1}

,

\bar{η} = \nabla F_{P} (\bar{θ})

. Then we have the following proposition which proves that the Fenchel–Young divergences in

N

and

\bar{N}

(as a submanifold of

P

) coincide:

Proposition 4.

We have

\begin{matrix} D_{KL} [p_{μ_{1}, Σ_{1}} : p_{μ_{2}, Σ_{2}}] & = & B_{F_{N}} (θ_{2} : θ_{1}) = Y_{F_{N}} (θ_{2} : η_{1}) = Y_{F_{P}} ({\bar{θ}}_{2} : {\bar{η}}_{1}) \\ = & B_{F_{P}} ({\bar{θ}}_{2} : {\bar{θ}}_{1}) = D_{KL} [p_{0, {\bar{P}}_{1} = f (μ_{1}, Σ_{2})} : p_{0, {\bar{P}}_{2} = f (μ_{2}, Σ_{2})}] . \end{matrix}

Consider now the

\nabla^{e}

-geodesics and

\nabla^{m}

-geodesics on

N

(linear interpolation with respect to natural and dual moment parameterizations, respectively):

γ_{N}^{e} (N_{1}, N_{2}; t) = (μ_{t}^{e}, Σ_{t}^{e})

and

γ_{N}^{m} (N_{1}, N_{2}; t) = (μ_{t}^{m}, Σ_{t}^{m})

.

Proposition 5

(Mixture geodesics preserved). The mixture geodesics are preserved by the embedding f:

f (γ_{N}^{m} (N_{1}, N_{2}; t)) = γ_{P}^{m} (f (N_{1}), f (N_{2}); t)

. The exponential geodesics are preserved for the subspace of

N

with fixed mean μ:

N_{μ}

.

Proof.

For the m-geodesics, let us check that

f (μ_{t}^{m}, Σ_{t}^{m}) = [\begin{matrix} Σ_{t}^{m} + μ_{t}^{m} {μ_{t}^{m}}^{⊤} & μ_{t}^{m} \\ {(μ_{t}^{m})}^{⊤} & 1 \end{matrix}] = t \underset{{\bar{P}}_{1}}{\underset{⏟}{f (μ_{1}, Σ_{1})}} + (1 - t) \underset{{\bar{P}}_{2}}{\underset{⏟}{f (μ_{2}, Σ_{2})}},

since

Σ_{t}^{m} + μ_{t} {μ_{t}^{m}}^{⊤} = {\bar{Σ}}_{t} + t μ_{1} μ_{1}^{⊤} + (1 - t) μ_{2} μ_{2}^{⊤}

= t (Σ_{1} + μ_{1} μ_{1}^{⊤}) + (1 - t) (Σ_{2} + μ_{2} μ_{2}^{⊤})

. Thus, we have

f (γ_{N}^{m} (N_{1}, N_{2}; t)) = γ_{P}^{m} ({\bar{P}}_{1}, {\bar{P}}_{2}; t)

. □

Therefore, all algorithms on

N

which only require m-geodesics or m-projections [1] by minimizing the right-hand side of the KLD can be implemented by algorithms on

P

. See, for example, the minimum enclosing ball approximation algorithm called BBC in [87]. Notice that

{\bar{N}}_{μ}

(fixed mean normal submanifolds) preserve both mixture and exponential geodesics: The submanifolds

{\bar{N}}_{μ}

are said to be doubly autoparallel [88].

Remark 5.

In [2] (p. 355), exercises 13.8 and 13.9 ask to prove the equivalence of the following statements for

S

a submanifold of

M

:

$S$ is an exponential family ⇔ $S$ is $\nabla^{1}$ -autoparallel in $M$ (exercise 13.8),
$S$ is a mixture family ⇔ $S$ is $\nabla^{- 1}$ -autoparallel in $M$ (exercise 13.9).

Let

\bar{P} = [\begin{matrix} Σ + μ μ^{⊤} & μ \\ μ^{⊤} & 1 \end{matrix}]

(with

| \bar{P} | = | Σ |

),

{\bar{P}}^{- 1} = [\begin{matrix} Σ^{- 1} & - Σ^{- 1} μ \\ - μ^{⊤} Σ^{- 1} & 1 + μ^{⊤} Σ^{- 1} μ \end{matrix}]

, and

y = (x, 1)

. Then we have

\begin{matrix} q_{\bar{P}} (y) & = & \frac{1}{{(2 π)}^{\frac{d + 1}{2}} \sqrt{| \bar{P} |}} exp (- \frac{1}{2} y^{⊤} {\bar{P}}^{- 1} y), \\ = & \frac{1}{{(2 π)}^{\frac{d + 1}{2}} \sqrt{| Σ |}} exp (- \frac{1}{2} y^{⊤} {\bar{P}}^{- 1} y), \\ = & \frac{1}{{(2 π)}^{\frac{d + 1}{2}} \sqrt{| Σ |}} exp ([x^{⊤} 1] [\begin{matrix} Σ^{- 1} & - Σ^{- 1} μ \\ - μ^{⊤} Σ^{- 1} & 1 + μ^{⊤} Σ^{- 1} μ \end{matrix}] [\begin{matrix} x \\ 1 \end{matrix}]) . \end{matrix}

Thus,

\bar{N} = {q_{\bar{P}} (x, 1)}

is an exponential family. Therefore, we deduce that

P

is

\nabla^{e}

-autoparallel in

P

. However,

\bar{N}

is not a mixture family and thus

P

is not

\nabla^{m}

-autoparallel in

P

.

6. Conclusions and Discussion

In general, the Fisher–Rao distance between multivariate normals (MVNs) is not known in closed form. In practice, the Fisher–Rao distance is usually approximated by costly geodesic shooting techniques [39,40,41] which requires time-consuming computations of the Riemannian exponential map and are nevertheless limited to normals within a short range of each other. In this work, we consider a simple alternative approach for approximating the Fisher–Rao distance by approximating the Riemannian lengths of curves, which admits closed-form parameterizations. In particular, we considered the mixed exponential-mixture curved and the projected symmetric positive–definite matrix geodesic obtained from Calvo and Oller isometric submanifold embedding into the SPD cone [19]. We summarize our method to approximate

ρ_{N} (N_{1}, N_{2})

between

N_{1} = N (μ_{1}, Σ_{1})

and

N_{2} = N (μ_{2}, Σ_{2})

as follows:

{\tilde{ρ}}_{T}^{CO} (N_{1}, N_{2}) : = \frac{1}{T} \sum_{i = 1}^{T - 1} \sqrt{D_{J} [{\bar{S}}_{t}, {\bar{S}}_{t + 1}]},

where

{\bar{S}}_{t} = {proj}_{\bar{N}} (S_{t}), {proj}_{\bar{N}} ([\begin{matrix} Σ + β μ μ^{⊤} & β μ \\ β μ^{⊤} & β \end{matrix}]) = [\begin{matrix} Σ + μ μ^{⊤} & μ^{⊤} \\ μ & 1 \end{matrix}]

and

S_{t} = {\bar{P}}_{1}^{\frac{1}{2}} {({\bar{P}}_{1}^{- \frac{1}{2}} {\bar{P}}_{2}^{\frac{1}{2}} {\bar{P}}_{1}^{- \frac{1}{2}})}^{\frac{t}{T}} {\bar{P}}_{1}^{\frac{1}{2}}

with

{\bar{P}}_{1} = f (N_{1}) = [\begin{matrix} Σ_{1} + μ_{1} μ_{1}^{⊤} & μ_{1} \\ μ_{1}^{⊤} & 1 \end{matrix}], {\bar{P}}_{2} = f (N_{2}) = [\begin{matrix} Σ_{2} + μ_{2} μ_{2}^{⊤} & μ_{2} \\ μ_{2}^{⊤} & 1 \end{matrix}] .

We proved the following sandwich bounds of our approximation

ρ_{N} (N_{1}, N_{2}) \leq {\tilde{ρ}}_{T}^{CO} (N_{1}, N_{2}) \leq ρ_{N} (N_{1}, N_{2}) + 2 δ_{T}^{CO} ({\bar{P}}_{1}, {\bar{P}}_{2}),

where

δ_{T}^{CO} (P_{1}, P_{2}) : = \frac{1}{T} \sum_{i = 1}^{T} ρ_{P} (S_{t}, {\bar{S}}_{t}) .

Notice that we may calculate equivalently

D_{J} [{\bar{S}}_{t}, {\bar{S}}_{t + 1}]

as

D_{J} [G_{t}, G_{t + 1}]

where

G_{i} = f^{- 1} ({\bar{S}}_{i}) = N (m_{i}, C_{i})

for

i \in {0, \dots, T}

(see Proposition 3).

We also reported a fast way to upper bound the Fisher–Rao distance by the square root of Jeffreys’ divergence:

ρ_{N} (N_{1}, N_{2}) \leq \sqrt{D_{J} [N_{1}, N_{2}]}

which is tight at infinitesimal scale. In practice, this upper bound beats the upper bound of [38] when normal distributions are not too far from each other. Finally, we show that not only is Calvo and Oller SPD submanifold embedding [19] isometric, but it also preserves the Kullback–Leibler divergence, the Fenchel–Young divergence, and the mixture geodesics. Our approximation technique extends to elliptical distribution, which generalizes multivariate normal distributions [32,55]. Moreover, we obtained a closed form for the Fisher–Rao distance between normals sharing the same covariance matrix using the technique of maximal invariance under the action of the affine group in Section 1.5. We may also consider other distances different from the Fisher–Rao distance, which admits a closed-form formula: For example, the Calvo and Oller metric distance [19] (a lower bound on the Fisher–Rao distance) or the metric distance proposed in [82] (see Appendix C) whose geodesics enjoys the asymptotic property of the Fisher–Rao geodesics [89]). The C&O distance is very well-suited for short Fisher–Rao distances while the symmetric space distance is well-tailored for large Fisher–Rao distances. The calculations of these closed-form distances rely on generalized eigenvalues. We also propose an embedding of normals into the Siegel upper space in Appendix D. To conclude, let us propose yet another alternative distance, The Hilbert projective distance on the SPD cone [90], which only needs to calculate the minimal and maximal eigenvalues (say, using the power iteration method [91]):

ρ_{Hilbert} (P_{1}, P_{2}) = log (\frac{λ_{\max} (P_{1}^{- 1} P_{2})}{λ_{\min} (P_{1}^{- 1} P_{2})}) .

(36)

The dissimilarity is said projective on the SPD cone because

ρ_{Hilbert} (P_{1}, P_{2}) = 0

if and only if

P_{1} = λ P_{2}

for some

λ > 0

. However, let us notice that it yields a proper metric distance on

\bar{N}

:

ρ_{Hilbert} (N_{1}, N_{2}) : = ρ_{Hilbert} ({\bar{P}}_{1}, {\bar{P}}_{2}),

since

{\bar{P}}_{1} = λ {\bar{P}}_{2}

if and only if

λ = 1

because the array element

{(P_{1})}_{d + 1, d + 1} = {(P_{2})}_{d + 1, d + 1} = 1

, i.e.,

{\bar{P}}_{1} = {\bar{P}}_{2}

implying

P_{1} = P_{2}

by the isometric diffeomorphism f.

Notice that since

λ_{\max} (P) = λ_{\min} (P^{- 1})

,

λ_{\min} (P) = λ_{\max} (P^{- 1})

,

λ_{\max} (P_{1} P_{2}) \leq λ_{\max} (P_{1}) λ_{\max} (P_{2})

, and

λ_{\min} (P_{1} P_{2}) \geq λ_{\min} (P_{1}) λ_{\min} (P_{2})

, we have the following upper bound on Hilbert distance:

ρ_{Hilbert} (P_{1}, P_{2}) \leq log \frac{λ_{\max} (P_{1})}{λ_{\min} (P_{1})} + log \frac{λ_{\max} (P_{2})}{λ_{\min} (P_{2})}

.

Supplementary Materials

The following supporting information can be downloaded at: https://franknielsen.github.io/FisherRaoMVN.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Acknowledgments

I warmly thank Frédéric Barbaresco (Thales) and Mohammad Emtiyaz Khan (Riken AIP) for fruitful discussions about this work.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

Entities
$N (μ, Σ)$	d-variate normal distribution (mean $μ$ , covariance matrix $Σ$ )
$p_{(μ, Σ)} (x)$	Probability density function of $N (μ, Σ)$
$q_{Σ} (y) = p_{(0, Σ)} (y)$	Probability density function of $N (0, Σ)$
$P = (P_{i j})$	Positive–definite matrix with matrix entries $P_{i j}$
Mappings
$\bar{P} = f_{1} (N)$	Calvo and Oller mapping [19] (1990)
$\hat{P} = f_{- \frac{1}{d + 1}, 1} (N) = \hat{f} (N)$	Calvo and Oller mapping [32] (2002) or [82]
Groups
$GL (d)$	Group of linear transformations (invertible $d \times d$ matrices)
$SL (d)$	Special linear group ( $d \times d$ matrices with unit determinant)
$Aff (d)$	Affine group of dimension d
Sets
$N$	Set of multivariate normal distributions $N (μ, Σ)$ (MVNs)
$Sym (d)$	Set of symmetric $d \times d$ real matrices
$P$	Symmetric positive–definite matrix cone (SPD matrix cone)
$P_{c}$	Set of SPD matrices with fixed determinant c ( $P = R_{> 0} \times P_{c}$ )
SSPD, $P_{1}$	Set of SPD matrices with unit determinant
$Λ$	Parameter space of $N (μ, Σ)$ : $R^{d} \times P (d)$
$N_{0}$ , $P$	Set of zero-centered normal distributions $N (0, Σ)$
$N_{Σ}$	Set of normal distributions $N (μ, Σ)$ with fixed $Σ$
$N_{μ}$	Set of normal distributions $N (μ, Σ)$ with fixed $μ$
$\bar{N}$	Set of SPD matrices $f (N)$
Riemannian length elements
MVN Fisher	$d s_{Fisher, N}^{2} = d μ^{⊤} Σ^{- 1} d μ + \frac{1}{2} tr ({(Σ^{- 1} d Σ)}^{2})$
0-MVN Fisher	$d s_{Fisher, N_{0}}^{2} = \frac{1}{2} tr ({(Σ^{- 1} d Σ)}^{2})$
SPD trace	$d s_{β, trace}^{2} = β tr ({(P d P)}^{2})$ (when $β = \frac{1}{2}$ , $d s_{trace} = d s_{Fisher, N_{0}}$ )

SPD Calvo and Oller metric	$d s_{CO}^{2} = \frac{1}{2} {(\frac{d β}{β})}^{2} + β d μ^{⊤} Σ^{- 1} d μ + \frac{1}{2} tr ({(Σ^{- 1} d Σ)}^{2})$
	(with $d s_{CO} = d s_{P} (f (μ, Σ))$ )
	when $β = 1$ , $d s_{CO} = d s_{Fisher, N}$ in $\bar{N}$
SPD symmetric space	$d s_{SS}^{2} = \frac{1}{2} d μ^{⊤} Σ^{- 1} d μ + tr ({(Σ^{- 1} d Σ)}^{2}) - \frac{1}{2} {tr}^{2} (Σ^{- 1} d Σ)$
Siegel upper space	$d s_{SH}^{2} (Z) = 2 tr (Y^{- 1} d Z Y^{- 1} d \bar{Z})$ ( $d s_{SH} (i Y) = 2 d s_{Fisher, N_{0}}$ )
Manifolds and submanifolds
$M$ ( $= M_{N}$ )	Manifold of multivariate normal distributions
$T_{p} M$	Tangent space at $p \in M$
$S_{μ} \subset M$	Submanifold of MVNs with $μ$ prescribed
$S_{Σ} \subset M$	Submanifold of MVNs with $Σ$ prescribed
$M_{Σ}$	manifold of $N_{Σ}$ (non-embedded in $M$ )
$M_{μ}$	manifold of $N_{μ}$ (non-embedded in $M$ )
$S_{[v], Σ}$	Submanifold of MVN set ${N (λ v, Σ) : λ > 0}$
	where v is an eigenvector of $Σ$
$P$	manifold of symmetric positive–definite matrices
Distances
$ρ_{N} (N_{1}, N_{2})$	Fisher–Rao distance between normal distributions $N_{1}$ and $N_{2}$
$ρ_{SPD} (P_{1}, P_{2})$	Riemannian SPD distance between $P_{1}$ and $P_{2}$
$ρ_{CO} (N_{1}, N_{2})$	Calvo and Oller distance from embedding N to $\bar{P} = f (N)$
$ρ_{SS} (N_{1}, N_{2})$	Symmetric space distance from embedding N to $\hat{P} = \hat{f} (N)$
$ρ_{Hilbert} (N_{1}, N_{2})$	Hilbert distance $ρ_{Hilbert} ({\bar{P}}_{1}, {\bar{P}}_{2})$
$D_{KL} (N_{1}, N_{2})$	Kullback–Leibler divergence between MVNs $N_{1}$ and $N_{2}$
$D_{J} (N_{1}, N_{2})$	Jeffreys divergence between MVNs $N_{1}$ and $N_{2}$
$D_{CO} (N_{1}, N_{2})$	Calvo and Oller dissimilarity measure of Equation (26)
Geodesics and curves
$γ_{N}^{FR} (N_{1}, N_{2}; t)$	Fisher–Rao geodesic between MVNs $N_{1}$ and $N_{2}$
$γ_{P}^{FR} (P_{1}, P_{2}; t)$	Fisher–Rao geodesic between SPD $P_{1}$ and $P_{2}$
$γ_{N}^{e} (N_{1}, N_{2}; t)$	exponential geodesic between MVNs $N_{1}$ and $N_{2}$
$γ_{N}^{m} (N_{1}, N_{2}; t)$	mixture geodesic between MVNs $N_{1}$ and $N_{2}$
$γ_{N}^{CO} (N_{1}, N_{2}; t)$	projection curve (not geodesic) of $γ_{P} ({\bar{P}}_{1}, {\bar{P}}_{2}; t)$ onto $\bar{N}$
Metrics and connections
$g_{N}^{Fisher}$	Fisher information metric of MVNs
$g_{P}$	trace metric
$g_{P}$	Fisher information metric of centered MVNs
$g_{Killing}$	Killing metric studied in [82]
$\nabla_{N}^{Fisher}$	Levi–Civita metric connection
$\nabla_{N}^{e}$	exponential connection
$\nabla_{N}^{m}$	mixture connection

Appendix A. Geodesics on the Fisher–Rao Normal Manifold

Appendix A.1. Parametric Equations of the Fisher–Rao Geodesics between Univariate Normal Distributions

The Fisher–Rao geodesics

γ_{N}^{FR} (N_{1}, N_{2})

on the Fisher–Rao univariate normal manifolds are either vertical line segments when

μ_{1} = μ_{2}

, or semi-circle with origin on the x-axis and x-axis stretched by

\sqrt{2}

[92] (Figure A1):

γ_{N}^{FR} (μ_{1}, σ_{1}; μ_{2}, σ_{2}) = \{\begin{matrix} (μ, (1 - t) σ_{1} + t σ_{2}), & μ_{1} = μ_{2} = μ \\ (\sqrt{2} (c + r cos t, r sin t), t \in [min {θ_{1}, θ_{2}}, max {θ_{1}, θ_{2}}], & μ_{1} \neq μ_{2}, \end{matrix},

where

c = \frac{\frac{1}{2} (μ_{2}^{2} - μ_{1}^{2}) + σ_{2}^{2} - σ_{1}^{2}}{\sqrt{2} (μ_{1} - μ_{2})}, r = \sqrt{{(\frac{μ_{i}}{\sqrt{2}} - c)}^{2} + σ_{i}^{2}}, i \in {1, 2},

and

θ_{i} = arctan (\frac{σ_{i}}{\frac{μ_{i}}{\sqrt{2}} - c}), i \in {1, 2},

provided that

θ_{i} \geq 0

for

i \in {1, 2}

(otherwise, we let

θ_{i} \leftarrow θ_{i} + π

).

Figure A1. Visualizing some Fisher–Rao geodesics of univariate normal distributions on the stretched Poincaré upper plane (semi-circles with origin on the x-axis and stretched by

\sqrt{2}

on the x-axis). Full geodesics are plotted with a thin gray style and geodesic arcs are plotted with a thick black style.

Figure A1. Visualizing some Fisher–Rao geodesics of univariate normal distributions on the stretched Poincaré upper plane (semi-circles with origin on the x-axis and stretched by

\sqrt{2}

on the x-axis). Full geodesics are plotted with a thin gray style and geodesic arcs are plotted with a thick black style.

Notice that it is remarkable that the Fisher–Rao distance between normal distributions is available in closed form: Indeed, the Euclidean length (with respect to the Euclidean metric) of semi-ellipse curves (perimeters) is not known in closed form but can be expressed using the so-called complete elliptic integral of the second kind [93].

Appendix A.2. Geodesics with Initial Values on the Multivariate Fisher–Rao Normal Manifold

The geodesic equation is given by

\{\begin{matrix} \ddot{μ} - \dot{Σ} Σ^{- 1} \dot{μ} & = & 0, \\ \ddot{Σ} + \dot{μ} {\dot{μ}}^{⊤} - \dot{Σ} Σ^{- 1} \dot{Σ} & = & 0 . \end{matrix}

We concisely report the parametric geodesics using another variant of the natural parameters of the normal distributions (slightly differing from the

θ

-coordinate system since natural parameters can be chosen up to a fixed affine transformation by changing accordingly the sufficient statistics by the inverse affine transformation) viewed as an exponential family:

(ξ = Σ^{- 1} μ, Ξ = Σ^{- 1}) .

In general, the geodesics with boundary values

γ_{N}^{Fisher} (N_{1}, N_{2}; t)

are not known in closed form. However, Calvo and Oller [48] (Theorem 3.1 and Corollary 1) reported the explicit equations of the geodesics when the initial values are given, i.e.,

γ_{N}^{Fisher} (N_{0}, v_{0}; t)

where

v_{0} = {\dot{γ}}_{N}^{Fisher} (N_{0}, v_{0}; 0) = (\dot{ξ} (0), \dot{Ξ} (0))

is in

T_{N_{0}} M

and

γ_{N}^{Fisher} (N_{0}, v_{0}; 0) = N_{0}

.

Let

\begin{matrix} B & = & - Ξ {(0)}^{- \frac{1}{2}} \dot{Ξ} (0) Ξ {(0)}^{- \frac{1}{2}}, \\ a & = & Ξ {(0)}^{- \frac{1}{2}} \dot{ξ} (0) + B Ξ_{0}^{- \frac{1}{2}} ξ (0), \\ G & = & {(B^{2} + 2 a a^{⊤})}^{\frac{1}{2}}, \end{matrix}

and

G^{†}

be the Moore–Penrose generalized inverse matrix of G:

G^{†} = {(G^{⊤} G)}^{- 1} G^{⊤}

or

G^{†} = G^{⊤} {(G G^{⊤})}^{- 1}

. The Moore–Penrose pseudo-inverse matrix can be replaced by any other pseudo-inverse matrix

G^{-}

[48].

Then we have

(ξ (t), Ξ (t)) = γ_{N}^{Fisher} (N_{0}, v_{0}; t)

with

\begin{matrix} R (t) = Cosh (\frac{1}{2} G t) - B G^{†} Sinh (\frac{1}{2} G t), \\ Ξ (t) = Ξ {(0)}^{\frac{1}{2}} R (t) R {(t)}^{⊤} Ξ {(0)}^{\frac{1}{2}}, \\ ξ (t) = 2 Ξ {(0)}^{\frac{1}{2}} R (t) Sinh (\frac{1}{2} G t) G^{†} a + Ξ (t) Ξ^{- 1} (0) ξ (0), \end{matrix}

where the Cosh and Sinh functions of a matrix M are defined by the following absolutely convergent series [48] (Equation (9), p. 122):

\begin{matrix} Sinh (M) & = & M + \sum_{i = 1}^{\infty} \frac{M^{2 i + 1}}{(2 i + 1)!}, \\ Cosh (M) & = & I + \sum_{i = 1}^{\infty} \frac{M^{2 i}}{(2 i)!}, \end{matrix}

and satisfies the identity

{Sinh}^{2} (M) + {Cosh}^{2} (M) = I

. The matrix Cosh and Sinh functions can be calculated from the eigendecomposition of

M = O diag (λ_{1}, \dots, λ_{d}) O^{⊤}

as follows:

\begin{matrix} Sinh (M) & = & O diag (\sinh (λ_{1}), \dots, \sinh (λ_{d})) O^{⊤}, \sinh (u) = \frac{e^{u} - e^{- u}}{2} = \sum_{i = 0}^{\infty} \frac{u^{2 i + 1}}{(2 i + 1)!}, \\ Cosh (M) & = & O diag (\cosh (λ_{1}), \dots, \cosh (λ_{d})) O^{⊤}, \cosh (u) = \frac{e^{u} + e^{- u}}{2} = \sum_{i = 0}^{\infty} \frac{u^{2 i}}{(2 i)!} . \end{matrix}

When we restrict the manifold to a totally geodesic submanifold

M_{μ} = {P ≻ 0}

, the geodesic equation becomes

\ddot{P} - \dot{P} P^{- 1} \dot{P} = 0

, and the geodesic with initial values

P (0) = P

and

\dot{P} (0) = S \in Sym

is:

P (t) = P^{\frac{1}{2}} exp (t P^{- \frac{1}{2}} S P^{- \frac{1}{2}}) P^{\frac{1}{2}} .

The geodesic with boundary values

P (0) = P_{1}

and

P (1) = P_{2}

is

P (t) = P_{1}^{\frac{1}{2}} exp (t Log (P_{1}^{- \frac{1}{2}} P_{2} P_{1}^{- \frac{1}{2}})) P_{1}^{\frac{1}{2}} .

Furthermore, we can convert a geodesic with boundary values

γ_{P} (P_{1}, P_{2}; t)

to an equivalent geodesic with initial values

γ_{P} (P, S; t)

by letting

S = P_{1}^{\frac{1}{2}} Log (P_{1}^{- \frac{1}{2}} P_{2} P_{1}^{- \frac{1}{2}}) P_{1}^{\frac{1}{2}} .

Appendix B. Fisher–Rao Distance between Normal Distributions Sharing the Same Covariance Matrix

The Rao distance between

N_{1} = N (μ_{1}, Σ)

and

N_{2} = N (μ_{2}, Σ)

has been reported in closed form [42] (Proposition 3). We shall explain the geometric method in full as follows: Let

(e_{1}, \dots, e_{d})

be the standard frame of

R^{d}

(ordered basis): The

e_{i}

’s are the unit vectors of the axis

x_{i}

’s. Let P be an orthogonal matrix such that

P (μ_{2} - μ_{1}) = {∥ μ_{2} - μ_{1} ∥}_{2} e_{1}

(i.e., matrix P aligns vector

μ_{2} - μ_{1}

to the first axis

x_{1}

). Let

Δ_{12} = {∥ μ_{2} - μ_{1} ∥}_{2}

be the Euclidean distance between

μ_{1}

and

μ_{2}

. Furthermore, factorize matrix

P Σ P^{⊤}

using the LDL decomposition (a variant of the Cholesky decomposition) as

P Σ P^{⊤} = L D L^{⊤}

where L is a lower triangular matrix with all diagonal entries equal to one (lower unitriangular matrix of unit determinant) and D a diagonal matrix. Let

σ_{12} = \sqrt{D_{11}}

. Then we have [42]:

ρ_{Σ} (μ_{1}, μ_{2}) = ρ_{N} (N (μ_{1}, Σ), N (μ_{2}, Σ)) = ρ_{N} (N (0, σ), N (Δ_{12} e_{1}, σ_{12})) .

(A1)

Please note that the right-hand side term is the Fisher–Rao distance between univariate normal distributions of Equation (7).

To find matrix P, we proceed as follows: Let

u = \frac{μ_{2} - μ_{1}}{∥ μ_{2} - μ_{1} ∥_{2}}

be the normalized vector to align on axis

x_{1}

. Let

v = u - e_{1}

. Consider the Householder reflection matrix [94]

M = I - \frac{2 v v^{⊤}}{{∥ v ∥}_{2}^{2}}

, where

v v^{⊤}

is an outer product matrix. Since Householder reflection matrices have determinant

- 1

, we let P be a copy of M with the last row multiplied by

- 1

so that we obtain

\det (P) = 1

. By construction, we have

P u = ∥ μ_{2} - μ_{1} ∥_{2} e_{1}

. We then use the affine-invariance property of the Fisher–Rao distance as follows:

\begin{matrix} ρ_{N} (N (μ_{1}, Σ), N (μ_{2}, Σ)) & = & ρ_{N} (N (0, Σ), N (μ_{2} - μ_{1}, Σ)), \\ = & ρ_{N} (N (0, P Σ P^{⊤}), N (P (μ_{2} - μ_{1}), P Σ P^{⊤})), \\ = & ρ_{N} (N (0, P Σ P^{⊤}), N (Δ_{12} e_{1}, P Σ P^{⊤})), \\ = & ρ_{N} (N (0, L D L^{⊤}), N (Δ_{12} e_{1}, L D L^{⊤})), \\ = & ρ_{N} (N (0, D), N (Δ_{12} e_{1}, D)) . \end{matrix}

The last row follows from the fact that

L^{- 1} e_{1} = e_{1}

since

L^{- 1}

is an upper unitriangular matrix, and

L^{⊤} {(L^{- 1})}^{⊤} = {(L^{- 1} L)}^{⊤} = I

. The right-hand side Fisher–Rao distance is computed from Equation (7).

Appendix C. Embedding the Set of Multivariate Normal Distributions in a Riemannian Symmetric Space

The multivariate Gaussian manifold

N (d)

can also be embedded into the SPD cone

P (d + 1)

as a Riemannian symmetric space [82,89] by

f_{SSPD}

:

\hat{P} = {f_{SSPD} (N) \subset P (d + 1) : N \in N (d)}

. We have

\hat{P} ≅ SL (d + 1) / SO (d + 1)

[82,95,96] (and textbook [97], Part II Chapter 10), and the symmetric space

SL (d + 1) / SO (d + 1)

can be embedded with the Killing Riemannian metric instead of the Fisher information metric:

g^{Killing} (N_{1}, N_{2}) = κ_{Killing} (μ_{1}^{⊤} Σ^{- 1} μ_{2} + \frac{1}{2} tr (Σ^{- 1} Σ_{1} Σ^{- 1} Σ_{2}) - \frac{1}{2 (d + 1)} tr (Σ^{- 1} Σ_{1}) tr (Σ^{- 1} Σ_{2})),

where

κ_{Killing} > 0

is a predetermined constant (e.g., 1). The length element of the Killing metric is

d s_{SS}^{2} = κ_{Killing} (\frac{1}{2} d μ^{⊤} Σ^{- 1} d μ + tr ({(Σ^{- 1} d Σ)}^{2}) - \frac{1}{2} {tr}^{2} (Σ^{- 1} d Σ)) .

When we consider

N_{Σ}

, we may choose

κ_{Killing} = 2

so that the Killing metric coincides with the Fisher information metric. The induced Killing distance [82] is available in closed form:

ρ_{Killing} (N_{1}, N_{2}) = \sqrt{κ_{Killing} \sum_{i = 1}^{d + 1} {log}^{2} λ_{i} ({\hat{L}}_{1}^{- 1} {\hat{P}}_{2} {({\hat{L}}_{1}^{- 1})}^{⊤})},

(A2)

where

{\hat{L}}_{1}

is the unique lower triangular matrix obtained from the Cholesky decomposition of

{\hat{P}}_{1} = f_{SSPD} (N_{1}) = {\hat{L}}_{1} {\hat{L}}_{1}^{⊤}

. Please note that

{\hat{L}}_{1}^{- 1} {\hat{P}}_{2} {({\hat{L}}_{1}^{- 1})}^{⊤} \in P (d + 1)

and

| {\hat{L}}_{1} |

, i.e.,

{\hat{L}}_{1} \in SL (d + 1)

. When

N_{1} = (μ_{1}, Σ)

and

N_{2} = (μ_{2}, Σ)

(

N_{1}, N_{2} \in N_{Σ}

), we have [82]

ρ_{Killing} (N_{1}, N_{2}) = \sqrt{2 κ_{Killing}} arccosh (1 + \frac{1}{2} Δ_{Σ}^{2} (μ_{1}, μ_{2})),

where

Δ_{Σ}^{2}

is the squared Mahalanobis distance. Thus,

ρ_{Killing} (N_{1}, N_{2}) = h_{Killing} (Δ_{Σ} (μ_{1}, μ_{2}))

where

h_{Killing} (u) = \sqrt{2 κ_{Killing}} arccosh (1 + \frac{1}{2} u^{2})

.

When

N_{1} = (μ, Σ_{1})

and

N_{2} = (μ, Σ_{2})

(

N_{1}, N_{2} \in N_{μ}

), we have [82]:

ρ_{Killing} (N_{1}, N_{2}) = \sqrt{κ_{Killing} (\sum_{i = 1}^{d} {log}^{2} λ_{i} (L_{1}^{- 1} P_{2} {(L_{1}^{- 1})}^{⊤}) - \frac{1}{{(d + 1)}^{2}} (\sum_{i = 1}^{d} log λ_{i} (L_{1}^{- 1} P_{2} {(L_{1}^{- 1})}^{⊤})))} .

See Example 1. Let us emphasize that the Killing distance is not the Fisher–Rao distance but is available in closed form as an alternative metric distance between MVNs.

A Fisher geodesic defect measure of a curve c is defined in [89] by

δ (c) = lim_{s \to \infty} \frac{1}{s} \int_{0}^{s} {∥ \nabla_{\dot{c}}^{g^{Fisher}} \dot{c} ∥}_{c (t)}^{Fisher} d t,

where

\nabla^{g^{Fisher}}

denotes the Levi–Civita connection induced by the Fisher metric. When

δ (c) = 0

the curve is said to be an asymptotic geodesic of the Fisher geodesic. It is proven that Killing geodesics at

(μ, Σ)

are asymptotic Fisher geodesics when the initial condition

c^{'} (0)

is orthogonal to

N_{μ}

.

Appendix D. Embedding the Set of Multivariate Normal Distributions in the Siegel Upper Space

The Siegel upper space is the space of symmetric complex matrices

Z = X + i Y = Z^{⊤}

with imaginary positive–definite matrices

Y ≻ 0

[45,65] (so-called Riemann matrices [98]):

SH (d) : = \{Z = X + i Y : X \in Sym (d), Y \in P (d)\},

(A3)

where

Sym (d)

is the space of symmetric real

d \times d

matrices.

SH (1)

corresponds to the Poincaré upper plane. See Figure A2 for an illustration.

The Siegel infinitesimal square line element is

d s_{SH}^{2} (Z) = 2 tr (Y^{- 1} d Z Y^{- 1} d \bar{Z}) .

(A4)

When

X = 0

and

Z = i Y

, we have

d Z = i d Y

,

d \bar{Z} = - i d Y

, and it follows that

d s_{SH}^{2} (i Y) = 2 tr ({(Y^{- 1} d Y)}^{2}) .

That is, four times the square length of the Fisher matrix of centered normal distributions

d s_{N_{0}}^{2} = \frac{1}{2} tr ({(P^{- 1} d P)}^{2})

.

The Siegel distance [45] between

Z_{1}

and

Z_{2} \in SH (d)

is

ρ_{SH} (Z_{1}, Z_{2}) = \sqrt{\sum_{i = 1}^{d} {log}^{2} (\frac{1 + \sqrt{r_{i}}}{1 - \sqrt{r_{i}}})},

(A5)

where

r_{i} = λ_{i} (R (Z_{1}, Z_{2})),

(A6)

with

R (Z_{1}, Z_{2})

denoting the matrix generalization of the cross-ratio

R (Z_{1}, Z_{2}) : = (Z_{1} - Z_{2}) {(Z_{1} - {\bar{Z}}_{2})}^{- 1} ({\bar{Z}}_{1} - {\bar{Z}}_{2}) {({\bar{Z}}_{1} - Z_{2})}^{- 1},

(A7)

and

λ_{i} (M)

denoting the i-th largest (real) eigenvalue of (complex) matrix M. (In practice, we numerically must round off the tiny imaginary parts to obtain proper real eigenvalues [65].) The Siegel upper half space is a homogeneous space where the Lie Group

SU (d, d) / S (U (d) \times U (d))

acts transitively on it.

We can embed a multivariate normal distribution

N = (μ, Σ)

into

SH (d)

as follows:

N (μ, Σ) \to Z (N) : = (μ μ^{⊤} + i Σ),

and consider the Siegel distance on the embedded normal distributions as another potential metric distance between multivariate normal distributions:

ρ_{SH} (N_{1}, N_{2}) = ρ_{SH} (Z (N_{1}), Z (N_{2})) .

(A8)

Notice that the real matrix part of the

Z (N)

’s are all of rank one by construction.

Figure A2. Siegel upper space generalizes the Poincaré hyperbolic upper plane.

Appendix E. The Symmetrized Bregman Divergence Expressed as Integral Energies on Dual Geodesics

Let

S_{F} (θ_{1}; θ_{2}) = B_{F} (θ_{1} : θ_{2}) + B_{F} (θ_{2} : θ_{1})

be a symmetrized Bregman divergence. Let

d s^{2} = d θ^{⊤} \nabla^{2} F (θ) d θ

denote the squared length element on the Bregman manifold and denote by

γ (t)

and

γ^{*} (t)

the dual geodesics connecting

θ_{1}

to

θ_{2}

. We can express

S_{F} (θ_{1}; θ_{2})

as integral energies on dual geodesics:

Property A1.

We have

S_{F} (θ_{1}; θ_{2}) = \int_{0}^{1} d s^{2} (γ (t)) d t = \int_{0}^{1} d s^{2} (γ^{*} (t)) d t

.

Proof.

The proof that the symmetrized Bregman divergence amount to these energy integrals is based on the first-order and second-order directional derivatives. The first-order directional derivative

\nabla_{u} F (θ)

with respect to vector u is defined by

\nabla_{u} F (θ) = lim_{t \to 0} \frac{F (θ + t v) - F (θ)}{t} = v^{⊤} \nabla F (θ) .

The second-order directional derivatives

\nabla_{u, v}^{2} F (θ)

is

\begin{matrix} \nabla_{u, v}^{2} F (θ) & = & \nabla_{u} \nabla_{v} F (θ), \\ = & lim_{t \to 0} \frac{v^{⊤} \nabla F (θ + t u) - v^{⊤} \nabla F (θ)}{t}, \\ = & u^{⊤} \nabla^{2} F (θ) v . \end{matrix}

Now consider the squared length element

d s^{2} (γ (t))

on the primal geodesic

γ (t)

expressed using the primal coordinate system

θ

:

d s^{2} (γ (t)) = d θ {(t)}^{⊤} \nabla^{2} F (θ (t)) d θ (t)

with

θ (γ (t)) = θ_{1} + t (θ_{2} - θ_{1})

and

d θ (t) = θ_{2} - θ_{1}

. Let us express the

d s^{2} (γ (t))

using the second-order directional derivative:

d s^{2} (γ (t)) = \nabla_{θ_{2} - θ_{1}}^{2} F (θ (t)) .

Thus, we have

\int_{0}^{1} d s^{2} (γ (t)) d t = {[\nabla_{θ_{2} - θ_{1}} F (θ (t))]}_{0}^{1}

, where the first-order directional derivative is

\nabla_{θ_{2} - θ_{1}} F (θ (t)) = {(θ_{2} - θ_{1})}^{⊤} \nabla F (θ (t))

. Therefore we obtain

\int_{0}^{1} d s^{2} (γ (t)) d t = {(θ_{2} - θ_{1})}^{⊤} (\nabla F (θ_{2}) - \nabla F (θ_{1})) = S_{F} (θ_{1}; θ_{2})

.

Similarly, we express the squared length element

d s^{2} (γ^{*} (t))

using the dual coordinate system

η

as the second-order directional derivative of

F^{*} (η (t))

with

η (γ^{*} (t)) = η_{1} + t (η_{2} - η_{1})

:

d s^{2} (γ^{*} (t)) = \nabla_{η_{2} - η_{1}}^{2} F^{*} (η (t)) .

Therefore, we have

\int_{0}^{1} d s^{2} (γ^{*} (t)) d t = {[\nabla_{η_{2} - η_{1}} F^{*} (η (t))]}_{0}^{1} = S_{F^{*}} (η_{1}; η 2)

. Since

S_{F^{*}} (η_{1}; η_{2}) = S_{F} (θ_{1}; θ_{2})

, we conclude that

S_{F} (θ_{1}; θ_{2}) = \int_{0}^{1} d s^{2} (γ (t)) d t = \int_{0}^{1} d s^{2} (γ^{*} (t)) d t

Please note that in 1D, both pregeodesics

γ (t)

and

γ^{*} (t)

coincide. We have

d s^{2} (t) = {(θ_{2} - θ_{1})}^{2} f^{″} (θ (t)) = (η_{2} - η_{1}) {f^{*}}^{″} (η (t))

so that we check that

S_{F} (θ_{1}; θ_{2}) = \int_{0}^{1} d s^{2} (γ (t)) d t = (θ_{2} - θ_{1}) {[f^{'} (θ (t))]}_{0}^{1} = (η_{2} - η_{1}) {[{f^{*}}^{'} (η (t))]}_{0}^{1} = (η_{2} - η_{1}) (θ_{2} - θ_{2})

. □

References

Amari, S.I. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016. [Google Scholar]
Calin, O.; Udrişte, C. Geometric Modeling in Probability and Statistics; Springer: Berlin/Heidelberg, Germany, 2014; Volume 121. [Google Scholar]
Lin, Z. Riemannian geometry of symmetric positive definite matrices via Cholesky decomposition. SIAM J. Matrix Anal. Appl. 2019, 40, 1353–1370. [Google Scholar] [CrossRef] [Green Version]
Soen, A.; Sun, K. On the variance of the Fisher information for deep learning. Adv. Neural Inf. Process. Syst. 2021, 34, 5708–5719. [Google Scholar]
Barachant, A.; Bonnet, S.; Congedo, M.; Jutten, C. Classification of covariance matrices using a Riemannian-based kernel for BCI applications. Neurocomputing 2013, 112, 172–178. [Google Scholar] [CrossRef] [Green Version]
Skovgaard, L.T. A Riemannian Geometry of the Multivariate Normal Model; Technical Report 81/3; Statistical Research Unit, Danish Medical Research Council, Danish Social Science Research Council: Copenhagen, Denmark, 1981. [Google Scholar]
Skovgaard, L.T. A Riemannian geometry of the multivariate normal model. Scand. J. Stat. 1984, 11, 211–223. [Google Scholar]
Malagò, L.; Pistone, G. Information geometry of the Gaussian distribution in view of stochastic optimization. In Proceedings of the ACM Conference on Foundations of Genetic Algorithms XIII, Aberystwyth, UK, 17–22 January 2015; pp. 150–162. [Google Scholar]
Herntier, T.; Peter, A.M. Transversality Conditions for Geodesics on the Statistical Manifold of Multivariate Gaussian Distributions. Entropy 2022, 24, 1698. [Google Scholar] [CrossRef] [PubMed]
Atkinson, C.; Mitchell, A.F. Rao’s distance measure. SankhyĀ Indian J. Stat. Ser. 1981, 43, 345–365. [Google Scholar]
Radhakrishna Rao, C. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. [Google Scholar]
Chen, X.; Zhou, J.; Hu, S. Upper bounds for Rao distance on the manifold of multivariate elliptical distributions. Automatica 2021, 129, 109604. [Google Scholar] [CrossRef]
Hotelling, H. Spaces of statistical parameters. Bull. Am. Math. Soc. 1930, 36, 191. [Google Scholar]
Cencov, N.N. Statistical Decision Rules and Optimal Inference; American Mathematical Soc.: Providence, RI, USA, 2000; Volume 53. [Google Scholar]
Bauer, M.; Bruveris, M.; Michor, P.W. Uniqueness of the Fisher–Rao metric on the space of smooth densities. Bull. Lond. Math. Soc. 2016, 48, 499–506. [Google Scholar] [CrossRef] [Green Version]
Fujiwara, A. Hommage to Chentsov’s theorem. Inf. Geom. 2022, 1–20. [Google Scholar] [CrossRef]
Bruveris, M.; Michor, P.W. Geometry of the Fisher–Rao metric on the space of smooth densities on a compact manifold. Math. Nachrichten 2019, 292, 511–523. [Google Scholar] [CrossRef] [Green Version]
Burbea, J.; Oller i Sala, J.M. On Rao Distance Asymptotic Distribution; Technical Report Mathematics Preprint Series No. 67; Universitat de Barcelona: Barcelona, Spain, 1989. [Google Scholar]
Calvo, M.; Oller, J.M. A distance between multivariate normal distributions based in an embedding into the Siegel group. J. Multivar. Anal. 1990, 35, 223–242. [Google Scholar] [CrossRef] [Green Version]
Rios, M.; Villarroya, A.; Oller, J.M. Rao distance between multivariate linear normal models and their application to the classification of response curves. Comput. Stat. Data Anal. 1992, 13, 431–445. [Google Scholar] [CrossRef]
Park, P.S.; Kshirsagar, A.M. Distances between normal populations when covariance matrices are unequal. Commun. Stat. Theory Methods 1994, 23, 3549–3556. [Google Scholar] [CrossRef]
Gruber, M.H. Some applications of the Rao distance to shrinkage estimators. Commun. Stat. Methods 2008, 37, 180–193. [Google Scholar] [CrossRef]
Strapasson, J.E.; Pinele, J.; Costa, S.I. Clustering using the Fisher-Rao distance. In Proceedings of the 2016 IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM), Rio de Janeiro, Brazil, 10–13 July 2016; pp. 1–5. [Google Scholar]
Le Brigant, A.; Puechmorel, S. Quantization and clustering on Riemannian manifolds with an application to air traffic analysis. J. Multivar. Anal. 2019, 173, 685–703. [Google Scholar] [CrossRef] [Green Version]
Said, S.; Bombrun, L.; Berthoumieu, Y. Texture classification using Rao’s distance on the space of covariance matrices. In Proceedings of the Geometric Science of Information: Second International Conference, GSI 2015, Proceedings 2, Palaiseau, France, 28–30 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 371–378. [Google Scholar]
Legrand, L.; Grivel, E. Evaluating dissimilarities between two moving-average models: A comparative study between Jeffrey’s divergence and Rao distance. In Proceedings of the 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 8 August–2 September 2016; pp. 205–209. [Google Scholar]
Halder, A.; Georgiou, T.T. Gradient flows in filtering and Fisher-Rao geometry. In Proceedings of the 2018 Annual American Control Conference (ACC), Milwaukee, WI, USA, 27–29 June 2018; pp. 4281–4286. [Google Scholar]
Collas, A.; Breloy, A.; Ren, C.; Ginolhac, G.; Ovarlez, J.P. Riemannian optimization for non-centered mixture of scaled Gaussian distributions. arXiv 2022, arXiv:2209.03315. [Google Scholar]
Liang, T.; Poggio, T.; Rakhlin, A.; Stokes, J. Fisher-Rao metric, geometry, and complexity of neural networks. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, PMLR, Naha, Japan, 16–18 April 2019; pp. 888–896. [Google Scholar]
Yoshizawa, S.; Tanabe, K. Dual differential geometry associated with the Kullback-Leibler information on the Gaussian distributions and its 2-parameter deformations. SUT J. Math. 1999, 35, 113–137. [Google Scholar] [CrossRef]
Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, 2007. [Google Scholar]
Calvo, M.; Oller, J.M. A distance between elliptical distributions based in an embedding into the Siegel group. J. Comput. Appl. Math. 2002, 145, 319–334. [Google Scholar] [CrossRef] [Green Version]
Burbea, J. Informative Geometry of Probability Spaces; Technical Report; Pittsburgh Univ. PA Center for Multivariate Analysis: Pittsburgh, PA, USA, 1984. [Google Scholar]
Eriksen, P.S. Geodesics Connected with the Fischer Metric on the Multivariate Normal Manifold; Institute of Electronic Systems, Aalborg University Centre: Aalborg, Denmark, 1986. [Google Scholar]
Berkane, M.; Oden, K.; Bentler, P.M. Geodesic estimation in elliptical distributions. J. Multivar. Anal. 1997, 63, 35–46. [Google Scholar] [CrossRef] [Green Version]
Imai, T.; Takaesu, A.; Wakayama, M. Remarks on Geodesics for Multivariate Normal Models; Technical Report; Faculty of Mathematics, Kyushu University: Fukuoka, Japan, 2011. [Google Scholar]
Inoue, H. Group theoretical study on geodesics for the elliptical models. In Proceedings of the Geometric Science of Information: Second International Conference, GSI 2015, Proceedings 2, Palaiseau, France, 28–30 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 605–614. [Google Scholar]
Strapasson, J.E.; Porto, J.P.; Costa, S.I. On bounds for the Fisher-Rao distance between multivariate normal distributions. AIP Conf. Proc. 2015, 1641, 313–320. [Google Scholar]
Han, M.; Park, F.C. DTI segmentation and fiber tracking using metrics on multivariate normal distributions. J. Math. Imaging Vis. 2014, 49, 317–334. [Google Scholar] [CrossRef]
Pilté, M.; Barbaresco, F. Tracking quality monitoring based on information geometry and geodesic shooting. In Proceedings of the 2016 17th International Radar Symposium (IRS), Krakow, Poland, 10–12 May 2016; pp. 1–6. [Google Scholar]
Barbaresco, F. Souriau exponential map algorithm for machine learning on matrix Lie groups. In Proceedings of the Geometric Science of Information: 4th International Conference, GSI 2019, Proceedings 4, Toulouse, France, 27–29 August 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 85–95. [Google Scholar]
Pinele, J.; Strapasson, J.E.; Costa, S.I. The Fisher–Rao distance between multivariate normal distributions: Special cases, bounds and applications. Entropy 2020, 22, 404. [Google Scholar] [CrossRef] [Green Version]
Dijkstra, E.W. A note on two problems in connexion with graphs. In Edsger Wybe Dijkstra: His Life, Work, and Legacy; Association for Computing Machinery: New York, NY, USA, 2022; pp. 287–290. [Google Scholar]
Anderson, J.W. Hyperbolic Geometry; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
Siegel, C.L. Symplectic Geometry; First Printed in 1964; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
James, A.T. The variance information manifold and the functions on it. In Multivariate Analysis–III; Elsevier: Amsterdam, The Netherlands, 1973; pp. 157–169. [Google Scholar]
Wells, J.; Cook, M.; Pine, K.; Robinson, B.D. Fisher-Rao distance on the covariance cone. arXiv 2020, arXiv:2010.15861. [Google Scholar]
Calvo, M.; Oller, J.M. An explicit solution of information geodesic equations for the multivariate normal model. Stat. Risk Model. 1991, 9, 119–138. [Google Scholar] [CrossRef]
Förstner, W.; Moonen, B. A metric for covariance matrices. In Geodesy-the Challenge of the 3rd Millennium; Springer: Berlin/Heidelberg, Germany, 2003; pp. 299–309. [Google Scholar]
Dolcetti, A.; Pertici, D. Real square roots of matrices: Differential properties in semi-simple, symmetric and orthogonal cases. arXiv 2020, arXiv:2010.15609. [Google Scholar]
Mahalanobis, P.C. On the generalised distance in statistics. In Proceedings of the National Institute of Science of India; Springer: New Delhi, India, 1936; Volume 12, pp. 49–55. [Google Scholar]
Eaton, M.L. Group Invariance Applications in Statistics; Institute of Mathematical Statistics: Beachwood, OH, USA, 1989. [Google Scholar]
Godinho, L.; Natário, J. An introduction to Riemannian geometry: With Applications to Mechanics and Relativity. In Universitext; Springer International Publishing: Cham, Switzerland, 2014. [Google Scholar]
Strapasson, J.E.; Pinele, J.; Costa, S.I. A totally geodesic submanifold of the multivariate normal distributions and bounds for the Fisher-Rao distance. In Proceedings of the IEEE Information Theory Workshop (ITW), Cambridge, UK, 1–11 September 2016; pp. 61–65. [Google Scholar]
Chen, X.; Zhou, J. Multisensor Estimation Fusion on Statistical Manifold. Entropy 2022, 24, 1802. [Google Scholar] [CrossRef]
Cherian, A.; Sra, S. Riemannian dictionary learning and sparse coding for positive definite matrices. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2859–2871. [Google Scholar] [CrossRef] [Green Version]
Nguyen, X.S. Geomnet: A neural network based on Riemannian geometries of SPD matrix space and Cholesky space for 3d skeleton-based interaction recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13379–13389. [Google Scholar]
Dolcetti, A.; Pertici, D. Differential properties of spaces of symmetric real matrices. arXiv 2018, arXiv:1807.01113. [Google Scholar]
Verdoolaege, G.; Scheunders, P. On the geometry of multivariate generalized Gaussian models. J. Math. Imaging Vis. 2012, 43, 180–193. [Google Scholar] [CrossRef]
Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B 1966, 28, 131–142. [Google Scholar] [CrossRef]
Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
Nielsen, F.; Okamura, K. A note on the f-divergences between multivariate location-scale families with either prescribed scale matrices or location parameters. arXiv 2022, arXiv:2204.10952. [Google Scholar]
Moakher, M.; Zéraï, M. The Riemannian geometry of the space of positive-definite matrices and its application to the regularization of positive-definite matrix-valued data. J. Math. Imaging Vis. 2011, 40, 171–187. [Google Scholar] [CrossRef]
Dolcetti, A.; Pertici, D. Elliptic isometries of the manifold of positive definite real matrices with the trace metric. Rend. Circ. Mat. Palermo Ser. 2 2021, 70, 575–592. [Google Scholar] [CrossRef]
Nielsen, F. The Siegel–Klein Disk: Hilbert Geometry of the Siegel Disk Domain. Entropy 2020, 22, 1019. [Google Scholar] [CrossRef]
Arnaudon, M.; Nielsen, F. On approximating the Riemannian 1-center. Comput. Geom. 2013, 46, 93–104. [Google Scholar] [CrossRef]
Ceolin, S.R.; Hancock, E.R. Computing gender difference using Fisher-Rao metric from facial surface normals. In Proceedings of the 25th SIBGRAPI Conference on Graphics, Patterns and Images, Ouro Preto, Brazil, 22–25 August 2012; pp. 336–343. [Google Scholar]
Wang, Q.; Li, P.; Zhang, L. G2DeNet: Global Gaussian distribution embedding network and its application to visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2730–2739. [Google Scholar]
Miyamoto, H.K.; Meneghetti, F.C.; Costa, S.I. The Fisher–Rao loss for learning under label noise. Inf. Geom. 2022, 1–20. [Google Scholar] [CrossRef]
Kurtek, S.; Bharath, K. Bayesian sensitivity analysis with the Fisher–Rao metric. Biometrika 2015, 102, 601–616. [Google Scholar] [CrossRef] [Green Version]
Marti, G.; Andler, S.; Nielsen, F.; Donnat, P. Optimal transport vs. Fisher-Rao distance between copulas for clustering multivariate time series. In Proceedings of the 2016 IEEE Statistical Signal Processing Workshop (SSP), Palma de Mallorca, Spain, 26–29 June 2016; pp. 1–5. [Google Scholar]
Tang, M.; Rong, Y.; Zhou, J.; Li, X.R. Information geometric approach to multisensor estimation fusion. IEEE Trans. Signal Process. 2018, 67, 279–292. [Google Scholar] [CrossRef]
Wang, W.; Wang, R.; Huang, Z.; Shan, S.; Chen, X. Discriminant analysis on Riemannian manifold of Gaussian distributions for face recognition with image sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2048–2057. [Google Scholar]
Li, P.; Wang, Q.; Zeng, H.; Zhang, L. Local log-Euclidean multivariate Gaussian descriptor and its application to image classification. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 803–817. [Google Scholar] [CrossRef]
Picot, M.; Messina, F.; Boudiaf, M.; Labeau, F.; Ayed, I.B.; Piantanida, P. Adversarial robustness via Fisher-Rao regularization. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2698–2710. [Google Scholar] [CrossRef] [PubMed]
Collas, A.; Bouchard, F.; Ginolhac, G.; Breloy, A.; Ren, C.; Ovarlez, J.P. On the Use of Geodesic Triangles between Gaussian Distributions for Classification Problems. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 5697–5701. [Google Scholar]
Murena, P.A.; Cornuéjols, A.; Dessalles, J.L. Opening the parallelogram: Considerations on non-Euclidean analogies. In Proceedings of the Case-Based Reasoning Research and Development: 26th International Conference, ICCBR 2018, Proceedings 26, Stockholm, Sweden, 9–12 July 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 597–611. [Google Scholar]
Popović, B.; Janev, M.; Krstanović, L.; Simić, N.; Delić, V. Measure of Similarity between GMMs Based on Geometry-Aware Dimensionality Reduction. Mathematics 2022, 11, 175. [Google Scholar] [CrossRef]
Micchelli, C.A.; Noakes, L. Rao distances. J. Multivar. Anal. 2005, 92, 97–115. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy 2019, 21, 485. [Google Scholar] [CrossRef] [Green Version]
Davis, J.; Dhillon, I. Differential entropic clustering of multivariate Gaussians. Adv. Neural Inf. Process. Syst. 2006, 19, 337–344. [Google Scholar]
Lovrić, M.; Min-Oo, M.; Ruh, E.A. Multivariate normal distributions parametrized as a Riemannian symmetric space. J. Multivar. Anal. 2000, 74, 36–48. [Google Scholar] [CrossRef] [Green Version]
Welzl, E. Smallest enclosing disks (balls and ellipsoids). In Proceedings of the New Results and New Trends in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; pp. 359–370. [Google Scholar]
Gonzalez, T.F. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 1985, 38, 293–306. [Google Scholar] [CrossRef] [Green Version]
Acharyya, S.; Banerjee, A.; Boley, D. Bregman divergences and triangle inequality. In Proceedings of the 2013 SIAM International Conference on Data Mining, SIAM, Austin, TX, USA, 2–4 May 2013; pp. 476–484. [Google Scholar]
Ohara, A.; Suda, N.; Amari, S.i. Dualistic differential geometry of positive definite matrices and its applications to related problems. Linear Algebra Appl. 1996, 247, 31–53. [Google Scholar] [CrossRef] [Green Version]
Nock, R.; Nielsen, F. Fitting the smallest enclosing Bregman ball. In Proceedings of the Machine Learning: ECML 2005: 16th European Conference on Machine Learning, Proceedings 16, Porto, Portugal, 3–7 October 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 649–656. [Google Scholar]
Ohara, A. Doubly autoparallel structure on positive definite matrices and its applications. In Proceedings of the International Conference on Geometric Science of Information, Toulouse, France, 27–29 August 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 251–260. [Google Scholar]
Globke, W.; Quiroga-Barranco, R. Information geometry and asymptotic geodesics on the space of normal distributions. Inf. Geom. 2021, 4, 131–153. [Google Scholar] [CrossRef]
Nielsen, F.; Sun, K. Clustering in Hilbert’s projective geometry: The case studies of the probability simplex and the elliptope of correlation matrices. In Geometric Structures of Information; Springer: Berlin/Heidelberg, Germany, 2019; pp. 297–331. [Google Scholar]
Journée, M.; Nesterov, Y.; Richtárik, P.; Sepulchre, R. Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 2010, 11, 517–553. [Google Scholar]
Verdoolaege, G. A new robust regression method based on minimization of geodesic distances on a probabilistic manifold: Application to power laws. Entropy 2015, 17, 4602–4626. [Google Scholar] [CrossRef] [Green Version]
Chandrupatla, T.R.; Osler, T.J. The perimeter of an ellipse. Math. Sci. 2010, 35. [Google Scholar]
Householder, A.S. Unitary triangularization of a nonsymmetric matrix. J. ACM 1958, 5, 339–342. [Google Scholar] [CrossRef] [Green Version]
Fernandes, M.A.; San Martin, L.A. Fisher information and α-connections for a class of transformational models. Differ. Geom. Appl. 2000, 12, 165–184. [Google Scholar] [CrossRef] [Green Version]
Fernandes, M.A.; San Martin, L.A. Geometric proprieties of invariant connections on SL(n,R)/SO(n). J. Geom. Phys. 2003, 47, 369–377. [Google Scholar] [CrossRef]
Bridson, M.R.; Haefliger, A. Metric Spaces of Non-Positive Curvature; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 319. [Google Scholar]
Frauendiener, J.; Jaber, C.; Klein, C. Efficient computation of multidimensional theta functions. J. Geom. Phys. 2019, 141, 147–158. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Four univariate normal distributions

N_{1} = N (0, 1)

,

N_{2} = N (3, 1)

,

N_{3} = N (2, 2.5)

,

N_{4} = N (0, 2)

, and their pairwise full geodesics in gray and geodesics linking them in red. The Fisher–Rao distances are

ρ_{N} (N_{1}, N_{2}) = 2.6124 \dots

,

ρ_{N} (N_{3}, N_{4}) = 0.9317 \dots

,

ρ_{N} (N_{1}, N_{4}) = 0.9803 \dots

,

ρ_{N} (N_{2}, N_{3}) = 1.4225 \dots

ρ_{N} (N_{2}, N_{4}) = 2.1362 \dots

, and

ρ_{N} (N_{1}, N_{3}) = 1.7334 \dots

The ellipses are Tissot indicatrices, which visualize the metric tensor

g_{N}^{Fisher}

at grid positions.

Figure 1. Four univariate normal distributions

N_{1} = N (0, 1)

,

N_{2} = N (3, 1)

,

N_{3} = N (2, 2.5)

,

N_{4} = N (0, 2)

, and their pairwise full geodesics in gray and geodesics linking them in red. The Fisher–Rao distances are

ρ_{N} (N_{1}, N_{2}) = 2.6124 \dots

,

ρ_{N} (N_{3}, N_{4}) = 0.9317 \dots

,

ρ_{N} (N_{1}, N_{4}) = 0.9803 \dots

,

ρ_{N} (N_{2}, N_{3}) = 1.4225 \dots

ρ_{N} (N_{2}, N_{4}) = 2.1362 \dots

, and

ρ_{N} (N_{1}, N_{3}) = 1.7334 \dots

The ellipses are Tissot indicatrices, which visualize the metric tensor

g_{N}^{Fisher}

at grid positions.

Figure 2. The submanifolds

N_{Σ}

are not totally geodesic (i.e.,

ρ_{N} (N_{1}^{'}, N_{2}^{'})

is upper bounded by their Mahalanobis distance) but the submanifolds

N_{μ}

are totally geodesic. Using the triangle inequality of the Riemannian metric distance

ρ_{N}

, we can upper bound

ρ_{N} (N_{1}, N_{2})

.

Figure 2. The submanifolds

N_{Σ}

are not totally geodesic (i.e.,

ρ_{N} (N_{1}^{'}, N_{2}^{'})

is upper bounded by their Mahalanobis distance) but the submanifolds

N_{μ}

are totally geodesic. Using the triangle inequality of the Riemannian metric distance

ρ_{N}

, we can upper bound

ρ_{N} (N_{1}, N_{2})

.

Figure 3. Quality of the C&O lower bound compared to the exact Fisher–Rao distance in the case of

N_{1}, N_{2} \in M_{Σ}

(MVNs sharing the same covariance matrix

Σ

). We have

ρ_{CO} \leq ρ_{N} \leq Δ_{Σ}

.

Figure 3. Quality of the C&O lower bound compared to the exact Fisher–Rao distance in the case of

N_{1}, N_{2} \in M_{Σ}

(MVNs sharing the same covariance matrix

Σ

). We have

ρ_{CO} \leq ρ_{N} \leq Δ_{Σ}

.

Figure 4. Approximating the Fisher–Rao geodesic distance

ρ_{N} (N_{1}, N_{2})

: The Fisher–Rao geodesic

γ_{N}^{FR}

is not known in closed form. We consider a tractable curve

c (t)

, discretize

c (t)

at

T + 1

points

c (\frac{i}{T})

with

c (0) = N_{1}

and

c (1) = N_{2}

, and approximate

ρ_{N} (c (\frac{i}{T}), c (\frac{i + 1}{T}))

by

\sqrt{D_{J} [c (\frac{i}{T}), c (\frac{i + 1}{T})]}

, considering that different tractable curves

c (t)

yield different approximations.

Figure 4. Approximating the Fisher–Rao geodesic distance

ρ_{N} (N_{1}, N_{2})

: The Fisher–Rao geodesic

γ_{N}^{FR}

is not known in closed form. We consider a tractable curve

c (t)

, discretize

c (t)

at

T + 1

points

c (\frac{i}{T})

with

c (0) = N_{1}

and

c (1) = N_{2}

, and approximate

ρ_{N} (c (\frac{i}{T}), c (\frac{i + 1}{T}))

by

\sqrt{D_{J} [c (\frac{i}{T}), c (\frac{i + 1}{T})]}

, considering that different tractable curves

c (t)

yield different approximations.

Figure 5. Quality of the

\sqrt{D_{J}}

upper bound on the Fisher–Rao distance

ρ_{N}

when normal distributions have the same covariance matrix.

Figure 5. Quality of the

\sqrt{D_{J}}

upper bound on the Fisher–Rao distance

ρ_{N}

when normal distributions have the same covariance matrix.

Figure 6. Visualizing the exponential and mixture geodesics between two bivariate normal distributions.

Figure 7. Projecting an SPD matrix

P \in P

onto

\bar{N} = f (N)

:

γ_{P} (P, {\bar{P}}_{⊥})

is orthogonal to

\bar{N}

with respect to the trace metric.

Figure 7. Projecting an SPD matrix

P \in P

onto

\bar{N} = f (N)

:

γ_{P} (P, {\bar{P}}_{⊥})

is orthogonal to

\bar{N}

with respect to the trace metric.

Figure 8. Illustration of the approximation of the Fisher–Rao distance between two multivariate normals

N_{1}

and

N_{2}

(red geodesic length

γ_{N} (N_{1}, N_{2})

by discretizing curve

{\bar{c}}_{CO} \in \bar{N}

or equivalently curve

c_{CO} \in N

.

Figure 8. Illustration of the approximation of the Fisher–Rao distance between two multivariate normals

N_{1}

and

N_{2}

(red geodesic length

γ_{N} (N_{1}, N_{2})

by discretizing curve

{\bar{c}}_{CO} \in \bar{N}

or equivalently curve

c_{CO} \in N

.

Figure 9. Diffusion tensor imaging (DTI) on a 2D grid: (a) Ellipsoids shown at the

8 \times 8

grid locations with C&O curves in green, and (b) some interpolated ellipsoids are further shown along the C&O curves.

Figure 9. Diffusion tensor imaging (DTI) on a 2D grid: (a) Ellipsoids shown at the

8 \times 8

grid locations with C&O curves in green, and (b) some interpolated ellipsoids are further shown along the C&O curves.

Figure 10. Examples of projection of

N (μ, Σ)

onto the submanifolds

M_{μ_{0}}

and

M_{Σ_{0}}

. Tissot indicatrices are rendered in green at the projected normal distributions

(μ_{0}, Σ + \frac{1}{2} (μ_{0} - μ) {(μ_{0} - μ)}^{⊤})

and

(μ, Σ_{0})

, respectively.

Figure 10. Examples of projection of

N (μ, Σ)

onto the submanifolds

M_{μ_{0}}

and

M_{Σ_{0}}

. Tissot indicatrices are rendered in green at the projected normal distributions

(μ_{0}, Σ + \frac{1}{2} (μ_{0} - μ) {(μ_{0} - μ)}^{⊤})

and

(μ, Σ_{0})

, respectively.

Figure 11. Upper bounding the Fisher–Rao’s distance

ρ_{N} ((μ_{1}, Σ_{1}), (μ_{2}, Σ_{2}))

(red points) using projections (green points) onto submanifolds with fixed means.

Figure 11. Upper bounding the Fisher–Rao’s distance

ρ_{N} ((μ_{1}, Σ_{1}), (μ_{2}, Σ_{2}))

(red points) using projections (green points) onto submanifolds with fixed means.

Figure 12. Geodesics and curves used to approximate the Fisher–Rao distance with the Fisher metric shown using Tissot’s indicatrices: exponential geodesic (red), mixture geodesic (blue), mid-exponential-mixture curve (purple), projected CO curve (green), and target Fisher–Rao geodesic (black). (Visualization in the parameter space of normal distributions).

Figure 13. Bounding

ρ_{N} ({\bar{S}}_{t}, {\bar{S}}_{t + 1})

using the triangular inequality of

ρ_{P}

in the SPD cone

P (d + 1)

.

Figure 13. Bounding

ρ_{N} ({\bar{S}}_{t}, {\bar{S}}_{t + 1})

using the triangular inequality of

ρ_{P}

in the SPD cone

P (d + 1)

.

Figure 14. Visualizing at discrete positions (10 increment steps between 0 and 1) some curves used to approximate the Fisher–Rao distance between two bivariate normal distributions: (a) exponential geodesic

c^{e} = γ_{N}^{e}

(red), (b) mixture geodesic

c^{m} = γ_{N}^{m}

(blue), (c) mid-mixture-exponential curve

c^{em}

(purple), (d) projected Calvo and Oller curve

c^{CO}

(green), (e)

c^{λ}

: ordinary linear interpolation in

λ

(yellow), and (f) All superposed curves at once.

Figure 14. Visualizing at discrete positions (10 increment steps between 0 and 1) some curves used to approximate the Fisher–Rao distance between two bivariate normal distributions: (a) exponential geodesic

c^{e} = γ_{N}^{e}

(red), (b) mixture geodesic

c^{m} = γ_{N}^{m}

(blue), (c) mid-mixture-exponential curve

c^{em}

(purple), (d) projected Calvo and Oller curve

c^{CO}

(green), (e)

c^{λ}

: ordinary linear interpolation in

λ

(yellow), and (f) All superposed curves at once.

Figure 15. Comparison of our approximation curves with the Fisher–Rao geodesic (f) obtained by geodesic shooting (Figure 5 of [39]). Exponential (a) and mixture (b) geodesics with the mid-exponential-mixture curve (c), and the projected C&O curve (d). Superposed curves (e) and comparison with geodesic shooting (Figure 5 of [39]). Beware that color coding is not related between (a) and (f), and scale for depicting ellipsoids are different.

Figure 16. Approximation of the Fisher–Rao distance obtained using the projected C&O curve when T ranges from 3 to 100 [39].

Figure 17. Approximation of the smallest enclosing Riemannian ball of a set of n bivariate normals

N_{i} = N (μ_{i}, Σ_{i})

with respect to C&O distance

ρ_{CO}

(the approximate circumcenter

{\bar{C}}_{T}

is depicted as a red ellipse): (a)

n = 8

with different covariance matrices, (b)

n = 8

with identical covariance matrices amount to the smallest enclosing ball of a set of n points

{μ_{i}}

, (c)

n = 2

displays the midpoint of the C&O geodesic visualized as an equivalent bivariate normal distribution in the sample space.

Figure 17. Approximation of the smallest enclosing Riemannian ball of a set of n bivariate normals

N_{i} = N (μ_{i}, Σ_{i})

with respect to C&O distance

ρ_{CO}

(the approximate circumcenter

{\bar{C}}_{T}

is depicted as a red ellipse): (a)

n = 8

with different covariance matrices, (b)

n = 8

with identical covariance matrices amount to the smallest enclosing ball of a set of n points

{μ_{i}}

, (c)

n = 2

displays the midpoint of the C&O geodesic visualized as an equivalent bivariate normal distribution in the sample space.

Table 1. First set of experiments demonstrates the advantage of the

c_{CO} (t)

curve.

Table 1. First set of experiments demonstrates the advantage of the

c_{CO} (t)

curve.

d	$κ_{CO}$	$κ_{l}$	$κ_{e}$	$κ_{m}$	$κ_{em}$
1	1.0025	1.0414	1.1521	1.0236	1.0154
2	1.0167	1.0841	1.1923	1.0631	1.0416
3	1.0182	1.8997	2.6072	1.9965	1.07988
4	1.0207	2.0793	1.8080	2.1687	1.1873
5	1.0324	4.1207	12.3804	5.6170	4.2349

Table 2. Comparing our Fisher–Rao approximation with the Calvo and Oller lower bound and the upper bound of [38].

d	$ρ_{CO} (N_{1}, N_{2})$	${\tilde{ρ}}^{c_{CO}} (N_{1}, N_{2})$	$U (N_{1}, N_{2})$
1	1.7563	1.8020	3.1654
2	3.2213	3.3194	6.012
3	4.6022	4.7642	8.7204
4	5.9517	6.1927	11.3990
5	7.156	7.3866	13.8774

Table 3. Second set of experiments shows limitations of the

c_{CO} (t)

curve.

Table 3. Second set of experiments shows limitations of the

c_{CO} (t)

curve.

d	$κ_{CO}$	$κ_{l}$	$κ_{e}$	$κ_{m}$
1	1.0569	1.1405	1.139	1.0734
5	1.1599	1.4696	1.5201	1.1819
10	1.2180	1.6963	1.7887	1.2184
11	1.2260	1.7333	1.8285	1.2235
12	1.2301	1.7568	1.8539	1.2282
15	1.2484	1.8403	1.9557	1.2367
20	1.2707	1.9519	2.0851	1.2466

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nielsen, F. A Simple Approximation Method for the Fisher–Rao Distance between Multivariate Normal Distributions. Entropy 2023, 25, 654. https://doi.org/10.3390/e25040654

AMA Style

Nielsen F. A Simple Approximation Method for the Fisher–Rao Distance between Multivariate Normal Distributions. Entropy. 2023; 25(4):654. https://doi.org/10.3390/e25040654

Chicago/Turabian Style

Nielsen, Frank. 2023. "A Simple Approximation Method for the Fisher–Rao Distance between Multivariate Normal Distributions" Entropy 25, no. 4: 654. https://doi.org/10.3390/e25040654

APA Style

Nielsen, F. (2023). A Simple Approximation Method for the Fisher–Rao Distance between Multivariate Normal Distributions. Entropy, 25(4), 654. https://doi.org/10.3390/e25040654

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Simple Approximation Method for the Fisher–Rao Distance between Multivariate Normal Distributions

Abstract

1. Introduction

1.1. The Fisher–Rao Normal Manifold

1.2. Fisher–Rao Distance between Normal Distributions: Some Subfamilies with Closed-Form Formula

1.3. Fisher–Rao Distance: Totally versus Non-Totally Geodesic Submanifolds

1.4. Contributions and Paper Outline

1.5. A Closed-Form Formula for the Fisher–Rao Distance between Normal Distributions Sharing the Same Covariance Matrix

2. Calvo and Oller’s Family of Diffeomorphic Embeddings

3. Approximating the Fisher–Rao Distance

3.1. Approximating Length of Curves

3.2. A Curve Derived from Calvo and Oller’s Embedding

3.3. Some Experiments

4. Approximating the Smallest Enclosing Fisher–Rao Ball of MVNs

5. Some Information–Geometric Properties of the C&O Embedding

6. Conclusions and Discussion

Supplementary Materials

Funding

Institutional Review Board Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Geodesics on the Fisher–Rao Normal Manifold

Appendix A.1. Parametric Equations of the Fisher–Rao Geodesics between Univariate Normal Distributions

Appendix A.2. Geodesics with Initial Values on the Multivariate Fisher–Rao Normal Manifold

Appendix B. Fisher–Rao Distance between Normal Distributions Sharing the Same Covariance Matrix

Appendix C. Embedding the Set of Multivariate Normal Distributions in a Riemannian Symmetric Space

Appendix D. Embedding the Set of Multivariate Normal Distributions in the Siegel Upper Space

Appendix E. The Symmetrized Bregman Divergence Expressed as Integral Energies on Dual Geodesics

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI