Next Article in Journal
Frequency Seismic Response for EEWS Testing on Uniaxial Shaking Table
Next Article in Special Issue
Geometric Structures Induced by Deformations of the Legendre Transform
Previous Article in Journal
A Lossless-Recovery Secret Distribution Scheme Based on QR Codes
Previous Article in Special Issue
A Dually Flat Embedding of Spacetime
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Simple Approximation Method for the Fisher–Rao Distance between Multivariate Normal Distributions

Sony Computer Science Laboratories, Tokyo 141-0022, Japan
Entropy 2023, 25(4), 654; https://doi.org/10.3390/e25040654
Submission received: 27 February 2023 / Revised: 6 April 2023 / Accepted: 12 April 2023 / Published: 13 April 2023
(This article belongs to the Special Issue Information Geometry and Its Applications)

Abstract

:
We present a simple method to approximate the Fisher–Rao distance between multivariate normal distributions based on discretizing curves joining normal distributions and approximating the Fisher–Rao distances between successive nearby normal distributions on the curves by the square roots of their Jeffreys divergences. We consider experimentally the linear interpolation curves in the ordinary, natural, and expectation parameterizations of the normal distributions, and compare these curves with a curve derived from the Calvo and Oller’s isometric embedding of the Fisher–Rao d-variate normal manifold into the cone of ( d + 1 ) × ( d + 1 ) symmetric positive–definite matrices. We report on our experiments and assess the quality of our approximation technique by comparing the numerical approximations with both lower and upper bounds. Finally, we present several information–geometric properties of Calvo and Oller’s isometric embedding.

Graphical Abstract

1. Introduction

1.1. The Fisher–Rao Normal Manifold

Let Sym ( d ) be the set of d × d symmetric matrices with real entries and P ( d ) Sym ( d ) denote the set of symmetric positive–definite d × d matrices that forms a convex regular cone. Let us denote by N ( d ) = { N ( μ , Σ ) : ( μ , Σ ) Λ ( d ) = R d × P ( d ) } the set of d-variate normal distributions, MultiVariate Normals or MVNs for short, also called Gaussian distributions. A MVN distribution N ( μ , Σ ) has probability density function (pdf) on the support R d :
p λ = ( μ , Σ ) ( x ) = ( 2 π ) d 2 | Σ | 1 2 exp 1 2 ( x μ ) Σ 1 ( x μ ) , x R d ,
where | M | = det ( M ) denotes the determinant of matrix M.
The statistical model N ( d ) is of dimension m = dim ( Λ ( d ) ) = d + d ( d + 1 ) 2 = d ( d + 3 ) 2 since it is identifiable, i.e., there is a one-to-one correspondence λ p λ ( x ) between λ Λ ( d ) and N ( μ , Σ ) N ( d ) . The statistical model N ( d ) is said to be regular since the second-order derivatives 2 p λ λ i λ j and third-order derivatives 3 p λ λ i λ j λ k are smooth functions (defining the metric and cubic tensors in information geometry [1]), and the set of first-order partial derivatives p λ λ 1 , , p λ λ 1 are linearly independent.
Let Cov ( X ) denote the covariance of X (variance when X is scalar). A matrix M is a semi-positive–definite if and only if x 0 , x M x 0 . The Fisher information matrix [1,2] (FIM) is the following symmetric semi-positive–definite matrix:
I ( λ ) = Cov [ log p λ ( x ) ] 0 .
For regular statistical models { p λ } , the FIM is positive–definite: I ( λ ) 0 , i.e., x 0 , x I ( λ ) x > 0 . M 1 M 2 denotes Löwner partial ordering, i.e., the fact that M 1 M 2 is positive–definite.
The FIM is covariant under the reparameterization of the statistical model [2]. That is, let θ ( λ ) be a new parameterization of the MVNs. Then we have:
I θ ( λ ) = λ θ × I λ ( λ ( θ ) ) × λ θ .
For example, we may parameterize univariate normal distributions by λ = ( μ , σ 2 ) or θ = ( μ , σ ) . We obtain the following Fisher information matrices for these parameterizations:
I λ ( λ ( μ , σ ) ) = 1 σ 2 0 0 1 2 σ 4 and I θ ( θ ( μ , σ ) ) = 1 σ 2 0 0 1 2 σ 2 .
In higher dimensions, parameterization λ = ( μ , σ 2 ) corresponds to the parameterization ( μ , Σ ) while parameterization θ = ( μ , L ) where Σ = L L is the unique Cholesky decomposition with L GL ( d ) , the group of invertible d × d matrices. Another useful parameterization for optimization is the log–Cholesky parameterization [3] ( η = ( μ , log σ 2 ) R 2 for univariate normal distributions) which ensures that a gradient descent always stays in the domain. The Fisher information matrix with respect to the log–Cholesky parameterization is I η ( η ( μ , σ ) ) = 1 σ 2 0 0 2 with η ( μ , σ ) R 2 .
Since the statistical model N ( d ) is identifiable and regular, the Fisher information matrix can be written equivalently as follows [2,4]:
I ( μ , Σ ) = Cov [ log p ( μ , Σ ) ] = E p ( μ , Σ ) log p ( μ , Σ ) log p ( μ , Σ ) ,
                                                                      = E p ( μ , Σ ) 2 log p ( μ , Σ ) .
For multivariate distributions parameterized by a m-dimensional vector (with m = d ( d + 3 ) 2 )
θ = ( θ 1 , , θ d , θ d + 1 , , θ m ) R m ,
with μ = ( θ 1 , , θ d ) and Σ ( θ ) = vech ( θ d + 1 , , θ m ) (inverse half-vectorization of matrices [5]), we have [6,7,8,9]:
I ( θ ) = [ I i j ( θ ) ] , with I i j ( θ ) = μ θ i Σ 1 μ θ j + 1 2 tr Σ 1 μ θ i Σ 1 μ θ j .
By equipping the regular statistical model N ( d ) with the Fisher information metric
g N Fisher ( μ , Σ ) = Cov [ log p ( μ , Σ ) ( x ) ]
we obtain a Riemannian manifold M = M N called the Fisher–Rao Gaussian or normal manifold [6,7]. The tangent space T N M is identified with the product space R d × Sym ( d ) . Let { μ , Σ } be a natural vector basis in T N M , and denote by [ v ] and [ V ] the vector components in that natural basis. We have
g ( μ , Σ ) Fisher ( ( v 1 , V 1 ) , ( v 2 , V 2 ) ) = ( v 1 , V 1 ) , ( v 2 , V 2 ) ( μ , Σ ) , = [ v 1 ] Σ 1 [ v 2 ] + 1 2 tr Σ 1 [ V 1 ] Σ 1 [ V 2 ] .
The induced Riemannian geodesic distance ρ N ( · , · ) is called the Rao distance [10] or the Fisher–Rao distance [11,12]:
ρ N ( N ( λ 1 ) , N ( λ 2 ) ) = inf c ( t ) c ( 0 ) = p λ 1 c ( 1 ) = p λ 2 Length ( c ) ,
where the Riemannian length of any smooth curve c ( t ) M is defined by
Length ( c ) = 0 1 c ˙ ( t ) , c ˙ ( t ) c ( t ) d t = 0 1 d s N ( t ) d t = 0 1 c ˙ ( t ) c ( t ) d t ,
where ˙ = d d t denotes the derivative with respect to parameter t, d s N ( t ) is the Riemannian length element of ( M , g N Fisher ) and · c ( t ) = · , · c ( t ) . We also write ρ N ( p λ 1 , p λ 2 ) for ρ N ( N ( λ 1 ) , N ( λ 2 ) ) .
The minimizing curve c ( t ) = γ N ( p λ 1 , p λ 2 ; t ) of Equation (3) is called the Fisher–Rao geodesic. The Fisher–Rao geodesic is also an autoparallel curve [2] with respect to the Levi–Civita connection N Fisher induced by the Fisher metric g N Fisher .
Remark 1. 
If we consider the Riemannian manifold ( M , β g ) for β > 0 instead of ( M , g ) then the length element d s is scaled by β : d s β g = β d s g . It follows that the length of a curve c becomes
Length β g ( c ) = β Length g ( c ) .
However, the geodesics joining any two points p 1 and p 2 of M are the same: γ β g ( p 1 , p 2 ; t ) = γ g ( p 1 , p 2 ; t ) (with γ g ( p 1 , p 2 ; 0 ) = p 1 and γ g ( p 1 , p 2 ; 1 ) = p 2 ).
Historically, Hotelling [13] first used this Fisher Riemannian geodesic distance in the late 1920s. From the viewpoint of information geometry [1], the Fisher metric is the unique Markov invariant metric up to rescaling [14,15,16]. The counterpart to the Fisher metric on the compact manifold has been reported in [17], proving its uniqueness under the action of the diffeomorphism group. The Fisher–Rao distance has been used to design statistical hypothesis testing [18,19,20,21], to measure the distance between the prior and posterior distributions in Bayesian statistics [22], in clustering [23,24], in signal processing [25,26,27,28], and in deep learning [29], just to mention a few.
The squared line element induced by the Fisher metric of the multivariate normal family [6,7] is
d s N 2 ( μ , Σ ) = d μ d Σ I ( μ , Σ ) d μ d Σ , = d μ Σ 1 d μ + 1 2 tr Σ 1 d Σ 2 .
There are many ways to calculate the FIM/length element for multivariate normal distributions [7,9]. Let us give a simple approach based on the fact that the family N ( d ) of normal distributions forms a regular exponential family [30]:
N ( d ) = p θ ( λ ) = exp θ v ( μ ) , x + θ M ( Σ ) , x x F N ( θ v , θ M ) ,
with θ ( λ ) = ( θ v = ( Σ 1 μ , θ M = 1 2 Σ 1 ) the natural parameters and log-partition function (also called cumulant function)
F N ( θ ) = 1 2 d log π log | θ M | + 1 2 θ v θ M 1 θ v .
The vector inner product is v 1 , v 2 = v 1 v 2 , and the matrix inner product is M 1 , M 2 = tr ( M 1 M 2 ) . The exponential family is said to be regular when the natural parameter space is open. Using Equation (2), it follows that the MVN FIM is I θ ( θ ) = E [ 2 log p θ ] = 2 F ( θ ) . This proves that the FIM is well-defined, i.e., ( I θ ( θ ) ) i j < . As an exponential family [1], we also have I θ ( θ ) = E [ t ( x ) ] , where t ( x ) = ( x , x x ) is the sufficient statistic. Thus, the Fisher metric is a Hessian metric [31]. Let F N ( θ v , θ M ) = F v ( θ v ) + F M ( θ M ) with F v ( θ v ) = 1 2 d log π + 1 2 θ v θ M 1 θ v and F M ( θ M ) = 1 2 log | θ M | . We obtain the following block-diagonal expression of the FIM:
I ( θ ( λ ) ) = 2 F N ( θ ( μ , Σ ) ) = Σ 1 0 0 1 2 θ M 2 log | 1 2 Σ 1 | .
Therefore d s N 2 ( μ , Σ ) = d s v 2 + d s M 2 with d s v 2 ( μ ) = d μ Σ 1 d μ and d s M 2 ( Σ ) = 1 2 tr Σ 1 d Σ 2 . Let us note in passing that θ M 2 log | θ M | is a fourth order tensor [4].
The family N ( d ) can also be considered to be an elliptical family [32], thus highlighting the affine-invariance property of the Fisher information metric. That is, the Fisher metric is invariant with respect to affine transformations [33]: Let ( a , A ) be an element of the affine group Aff ( d ) with a R d and A GL ( d ) . The group identity element of Aff ( d ) is e = ( 0 , I ) and the group operation is ( a 1 , A 1 ) . ( a 2 , A 2 ) = ( a 1 + A 1 a 2 , A 1 A 2 ) with inverse ( a , A ) 1 = ( A 1 a , A 1 ) ). Then we have
Property 1 
(Fisher–Rao affine invariance). For all A GL ( d ) , a R d , we have
ρ N ( N ( A μ 1 + a , A Σ 1 A ) , N ( A μ 2 + a , A Σ 2 A ) ) = ρ N ( N ( μ 1 , Σ 1 ) , N ( μ 2 , Σ 2 ) ) .
This can be proven by checking that d s N ( μ , Σ ) = d s N ( μ , Σ ) where μ = A μ + a and Σ = A Σ 2 A . It follows that we can reduce the calculation of the Fisher–Rao distance to a canonical case where one argument is N std = N ( 0 , I ) , the standard d-variate distribution:
ρ N ( N ( μ 1 , Σ 1 ) , N ( μ 2 , Σ 2 ) ) = ρ N N std , N Σ 1 1 2 ( μ 2 μ 1 ) , Σ 1 1 2 Σ 2 Σ 1 1 2 , = ρ N N Σ 2 1 2 ( μ 1 μ 2 ) , Σ 2 1 2 Σ 1 Σ 2 1 2 , N std ,
where Σ p is the fractional matrix power which can be calculated from the Singular Value Decomposition O D O of Σ (where O is an orthogonal matrix and D = diag ( λ 1 , , λ d ) a diagonal matrix): Σ p = O D p O with D p = diag ( λ 1 p , , λ d p ) .
The family of normal elliptical distributions can be obtained from the standard normal distribution by the action of the affine group [12,32] Aff ( d ) :
N ( μ , Σ ) = ( μ , Σ 1 2 ) . N std = N ( ( μ , Σ 1 2 ) . ( 0 , I ) ) .

1.2. Fisher–Rao Distance between Normal Distributions: Some Subfamilies with Closed-Form Formula

In general, the Fisher–Rao distance ρ N ( N 1 , N 2 ) between two multivariate normal distributions N 1 and N 2 is not known in closed form [34,35,36,37], and several lower and upper bounds [38], and numerical techniques such as the geodesic shooting [39,40,41] have been investigated. See [42] for a recent review. Unfortunately, the geodesic shooting (GS) approach is time-consuming and numerically unstable for large Fisher–Rao distances [21,42]. In 3D Diffusion Tensor Imaging (DTI), 3 × 3 covariance matrices Σ i , j , k are stored a 3D grid locations μ i , j , k thus generating 3D MVNs N i , j , k = N ( μ i , j , k , Σ i , j , k ) with means μ i , j , k regularly spaced to each others. The Fisher–Rao distances can be calculated between an MVN N i , j , k and another MVN N i , j , k in a neighborhood of N i , j , k (using 6- or 26-neighborhood) using geodesic shooting. For larger Fisher–Rao distances between non-neighbors MVNs, we can use the shortest path distance using Dijkstra’s algorithm [43] on the graph induced by the MVNs with edges between adjacent MVNs weighted by their Fisher–Rao distances.
The two main difficulties with calculating the Fisher–Rao distance are
  • to know explicitly the expression of the Riemannian Fisher–Rao geodesic γ N FR ( p λ 1 , p λ 2 ; t ) and
  • to integrate, in closed form, the length element d s N along this Riemannian geodesic.
Please note that the Fisher–Rao geodesics [1] γ N FR ( p λ 1 , p λ 2 ; t ) are parameterized by constant speed (i.e., μ ˙ ( t ) = μ ˙ ( 0 ) and Σ ˙ ( t ) = Σ ˙ ( 0 ) ), or equivalently parametrized using the arc length:
ρ N γ N FR ( p λ 1 , p λ 2 ; s ) , γ N FR ( p λ 1 , p λ 2 ; t ) = | s t | ρ N ( p λ 1 , p λ 2 ) , s , t [ 0 , 1 ] .
However, in several special cases, the Fisher–Rao distance between normal distributions belonging to restricted subsets of N is known.
Three such prominent cases are (see [42] for other cases)
  • when the normal distributions are univariate ( d = 1 ),
  • when we consider the set N μ = { N ( μ , Σ ) : Σ P ( d ) } M N of normal distributions sharing the same mean μ (with the embedded submanifold S μ M ), and
  • when we consider the set N Σ = { N ( μ , Σ ) : Σ P ( d ) } N of normal distributions sharing the same covariance matrix Σ (with the corresponding embedded submanifold S Σ M ).
Let us report the formula of the Fisher–Rao distance in these three cases:
  • In the univariate case N ( 1 ) , the Fisher–Rao distance between N 1 = N ( μ 1 , σ 1 2 ) and N 2 = N ( μ 2 , σ 2 2 ) can be derived from the hyperbolic distance [44] expressed in the Poincaré upper space since we have
    d s N 2 = g ( μ , σ ) ( d μ , d σ ) = d μ 2 + 2 d σ 2 σ 2 = 2 d μ 2 2 + d σ 2 σ 2 = 2 d x 2 + d y 2 y 2 = d s Poincaré 2 ,
    where x = μ 2 and y = σ . It follows that
    ρ N ( N 1 , N 2 ) = 2 ρ Poincaré ( ( x 1 , y 1 ) , ( x 2 , y 2 ) ) = 2 ρ Poincaré μ 1 2 , σ 1 , μ 2 2 , σ 2 .
    Thus, we have the following expression for the Fisher–Rao distance between univariate normal distributions:
    ρ N ( N 1 , N 2 ) = 2 log 1 + Δ ( μ 1 , σ 1 ; μ 2 , σ 2 ) 1 Δ ( μ 1 , σ 1 ; μ 2 , σ 2 ) ,
    with
    Δ ( a , b ; c , d ) = ( c a ) 2 + 2 ( d b ) 2 ( c a ) 2 + 2 ( d + b ) 2 , ( a , b , c , d ) R 4 \ { 0 } .
    In particular, we have
    Δ ( a , b ; a , d ) = d b d + b when a = c (same mean),
    Δ ( a , b ; c , b ) = 1 1 + 8 b 2 ( c a ) 2 when b = d (same variance),
    Δ ( 0 , 1 ; c , d ) = c 2 + 2 ( d 1 ) 2 c 2 + 2 ( d + 1 ) 2 when a = 0 and b = 1 (standard normal).
    In 1D, the affine-invariance property (Property 1) extends to function Δ as follows:
    Δ ( μ 1 , σ 1 ; μ 2 , σ 2 ) = Δ 0 , 1 ; μ 2 μ 1 σ 1 , σ 2 σ 1 = Δ μ 1 μ 2 σ 2 , σ 1 σ 2 ; 0 , 1 .
    Using one of the many identities between inverse hyperbolic functions (e.g., arctanh, arccosh, arcsinh), we can obtain an equivalent formula for Equation (7). For example, since arctanh ( u ) : = 1 2 log 1 + u 1 u for 0 < u < 1 , we have equivalently:
    ρ N ( N 1 , N 2 ) = 2 2 arctanh ( Δ ( μ 1 , σ 1 ; μ 2 , σ 2 ) ) .
    The Fisher–Rao geodesics are semi-ellipses with centers located on the x-axis. See Appendix A.1 for the parametric equations of Fisher–Rao geodesics between univariate normal distributions. Figure 1 displays four univariate normal distributions with their pairwise geodesics and Fisher–Rao distances.
    Using the identity arctanh u 2 1 u 2 + 1 = arccosh 1 + u 2 2 u with arccosh ( x ) : = log ( x + x 2 1 ) , we also have
    ρ N ( N 1 , N 2 ) = 2 2 arccosh 1 ( 1 Δ ( μ 1 , σ 1 ; μ 2 , σ 2 ) ) ( 1 + Δ ( μ 1 , σ 1 ; μ 2 , σ 2 ) ) ,
    Since the inverse hyperbolic cosecant (CSC) function is defined by arccsch ( u ) : = arccosh ( 1 / u ) , we further obtain
    ρ N ( N 1 , N 2 ) = 2 2 arccsch ( 1 Δ ( μ 1 , σ 1 ; μ 2 , σ 2 ) ) ( 1 + Δ ( μ 1 , σ 1 ; μ 2 , σ 2 ) ) ,
    We can also write
    ρ N ( N 1 , N 2 ) = 2 arccosh 1 + ( μ 2 μ 1 ) 2 + 2 ( σ 2 σ 1 ) 2 4 σ 1 σ 2
    Thus, using the many-conversions formula between inverse hyperbolic functions, we obtain many equivalent different formulas of the Fisher–Rao distance, which are used in the literature.
  • In the second case, the Fisher–Rao distance between N 1 = N ( μ , Σ 1 ) and N 2 = N ( μ , Σ 2 ) has been reported in [6,7,45,46,47]:
    ρ N μ ( N 1 , N 2 ) = 1 2 i = 1 d log 2 λ i ( Σ 1 1 Σ 2 ) ,
    = ρ N μ ( Σ 1 , Σ 2 ) ,
    where λ i ( M ) denotes the i-th generalized largest eigenvalue of matrix M, where the generalized eigenvalues are solutions of the equation | Σ 1 λ Σ 2 | = 0 . Let us notice that ρ N μ ( ( μ , Σ 1 ) , ( μ , Σ 2 ) ) = ρ N μ ( ( μ , Σ 1 1 ) , ( μ , Σ 2 1 ) ) since λ i ( Σ 2 1 Σ 1 ) = 1 λ i ( Σ 1 1 Σ 2 ) and log 2 λ i ( Σ 2 1 Σ 1 ) = ( log λ i ( Σ 1 1 Σ 2 ) ) 2 = log 2 λ i ( Σ 1 1 Σ 2 ) . Matrix Σ 1 1 Σ 2 may not be SPD and thus the λ i ’s are generalized eigenvalues. We may consider instead the SPD matrix Σ 1 1 2 Σ 2 Σ 1 1 2 which is SPD and such that λ i ( Σ 1 1 Σ 2 ) = λ i ( Σ 1 1 2 Σ 2 Σ 1 1 2 ) . The Fisher–Rao distance of Equation (11) can be equivalently written [48] as
    ρ N μ ( N 1 , N 2 ) = 1 2 Log Σ 1 1 2 Σ 2 Σ 1 1 2 F ,
    where Log ( M ) is the matrix logarithm (unique when M is SPD) and M F = i , j M i , j 2 = tr ( M M ) is the matrix Fröbenius norm. This metric distance between SPD matrices although first studied by Siegel [45] in 1964 was rediscovered and analyzed recently in [49] (2003). Let ρ SPD ( P 1 , P 2 ) = i = 1 d log 2 λ i ( P 1 1 P 2 ) so that ρ N μ ( N ( μ , P 1 ) , N ( μ , P 2 ) ) = 1 2 ρ SPD ( P 1 , P 2 ) .
    The Riemannian SPD distance ρ SPD enjoys the following well-known invariance properties:
    Invariance by congruence transformation:
    X GL ( d ) , ρ SPD ( X P 1 X , X P 2 X ) = ρ SPD ( P 1 , P 2 ) ,
    Invariance by inversion:
    P 1 , P 2 P ( d ) , ρ ( P 1 1 , P 2 1 ) = ρ SPD ( P 1 , P 2 ) .
    Let P 1 = L 1 L 1 be the Cholesky decomposition (unique when P 1 0 ). Then apply the congruence invariance for X = L 1 1 :
    ρ SPD ( P 1 , P 2 ) = ρ SPD ( L 1 1 P 1 ( L 1 1 ) , L 1 1 P 2 ( L 1 1 ) ) = ρ SPD ( I , L 1 1 P 2 ( L 1 1 ) ) .
    We can also consider the factorization P 1 = S 1 S 1 where S 1 = P 1 1 2 is the unique symmetric square root matrix [50]. Then we have
    ρ SPD ( P 1 , P 2 ) = ρ SPD ( S 1 1 P 1 ( S 1 1 ) , S 1 1 P 2 ( S 1 1 ) ) = ρ SPD ( I , S 1 1 P 2 ( S 1 1 ) ) .
  • The Fisher–Rao distance between N 1 = N ( μ 1 , Σ ) and N 2 = N ( μ 2 , Σ ) has been reported in closed form [42] (Proposition 3). The method is described with full details in Appendix B. We present a simpler scheme based on the inverse Σ 1 2 of the symmetric square root factorization [50] of Σ = Σ 1 2 Σ 1 2 (ith ( Σ 1 2 ) = Σ 1 2 ). Let us use the affine-invariance property of the Fisher–Rao distance under the affine transformation Σ 1 2 and then apply affine invariance under translation as follows:
    ρ N ( N ( μ 1 , Σ ) , N ( μ 2 , Σ ) ) = ρ N ( N ( Σ 1 2 μ 1 , Σ 1 2 Σ Σ 1 2 ) , N ( Σ 1 2 μ 2 , Σ 1 2 Σ Σ 1 2 ) ) , = ρ N ( N ( 0 , I ) , N ( Σ 1 2 ( μ 2 μ 1 ) , I ) ) , = ρ N ( N ( 0 , 1 ) , N ( Σ 1 2 ( μ 2 μ 1 ) 2 , 1 ) ) .
    The right-hand side Fisher–Rao distance is computed from Equation (7) and justified by the method [42] (Proposition 3) described in Appendix B using a rotation matrix R with R R = I so that
    ρ N ( N ( 0 , I ) , N ( Σ 1 2 ( μ 2 μ 1 ) , I ) ) = ρ N ( N ( 0 , I ) , N ( R Σ 1 2 ( μ 2 μ 1 ) , R I R ) ) , = ρ N ( N ( 0 , I ) , Σ 1 2 ( μ 2 μ 1 ) 2 , I ) ) .
    Then we apply the formula of Equation (23) of [42]. Section 1.5 shall report a simpler closed-form formula by proving that the Fisher–Rao distance between N ( μ 1 , Σ ) and N ( μ 2 , Σ ) is a scalar function of their Mahalanobis distance [51] using the algebraic method of maximal invariants [52].

1.3. Fisher–Rao Distance: Totally versus Non-Totally Geodesic Submanifolds

Consider N = { N ( λ ) : λ Λ } N a statistical submodel of the MVN statistical model N . Using the Fisher information matrix I λ ( λ ) , we obtain the intrinsic Fisher–Rao manifold M = M N . We may also consider M to be an embedded submanifold of M . Let us write S = S N M the embedded submanifold.
A totally geodesic submanifold S M is such that the geodesics γ M ( N 1 , N 2 ; t ) fully stay in M for any pair of points N 1 , N 2 N . For example, the submanifold M μ = { N ( μ , Σ ) : Σ P ( d ) } M of MVNs with fixed mean μ is a totally geodesic submanifold [53] of M but the submanifold M Σ = { N ( μ , Σ ) : μ R d } M of MVNs sharing the same covariance matrix Σ is not totally geodesic. When an embedded submanifold S M is totally geodesic, we always have ρ M ( N 1 , N 2 ) = ρ S ( N 1 , N 2 ) . Thus, we have ρ N ( N ( μ , Σ 1 ) , N ( μ , Σ 2 ) ) = ρ SPD ( Σ 1 , Σ 2 ) . However, when an embedded submanifold S M is not totally geodesic, we have ρ M ( N 1 , N 2 ) ρ S ( N 1 , N 2 ) because the Riemannian geodesic length in S is necessarily longer or equal than the Riemannian geodesic length in M . The merit to consider submanifolds is to be able to calculate in closed form the Fisher–Rao distance which may then provide an upper bound on the Fisher–Rao distance for the full statistical model. For example, consider N 1 = N ( μ 1 , Σ ) and N 2 = N ( μ 2 , Σ ) in M Σ , a non-totally geodesic submanifold. The Rao distance between N 1 and N 2 in M is upper bounded by the Riemannian distance in M Σ (with line element d s Σ 2 = d μ Σ 1 d μ ) which corresponds to the Mahalanobis distance [10,51] Δ Σ ( μ 1 , μ 2 ) :
ρ M μ ( N 1 , N 2 ) Δ Σ ( μ 1 , μ 2 ) : = ( μ 2 μ 1 ) Σ 1 ( μ 2 μ 1 ) .
The Mahalanobis distance can be interpreted as the Euclidean distance D E ( p , q ) = Δ I ( p , q ) = ( p q ) ( p q ) (where I denotes the identity matrix) after an affine transformation: Let Σ = L L = U U be the Cholesky decomposition of Σ 0 with L a lower triangular matrix or U = L an upper triangular matrix. Then we have
Δ Σ ( μ 1 , μ 2 ) = ( μ 2 μ 1 ) ( L ) 1 L 1 ( μ 2 μ 1 ) , = Σ 1 2 ( μ 2 μ 1 ) 2 , = Δ I ( L 1 μ 1 , L 1 μ 2 ) = D E ( L 1 μ 1 , L 1 μ 2 ) ,
where · 2 denotes the vector 2 -norm.
The Rao distance ρ Σ of Equation (A1) between two MVNs with fixed covariance matrix emanates from the property that the submanifold M [ v ] , Σ = { N ( a v , Σ ) : a R } is totally geodesic [54].
Let us emphasize that for a submanifold S M to be totally geodesic or not depend on the underlying metric in M . The same subset N N with N equipped with two different metrics g 1 and g 2 can be totally geodesic regarding g 1 and non-totally geodesic regarding g 2 . See Remark 3 for such an example.
In general, using the triangle inequality of the Riemannian metric distance ρ N , we can upper bound ρ N ( N 1 , N 2 ) with N 1 = ( μ 1 , Σ 1 ) and N 1 = ( μ 2 , Σ 2 ) as follows:
ρ N ( N 1 , N 2 ) ρ M μ 1 ( N 1 , N 12 ) + ρ M Σ 2 ( N 12 , N 2 ) , ρ M Σ 1 ( N 1 , N 21 ) + ρ M μ 2 ( N 21 , N 2 ) ,
where N 12 = ( μ 1 , Σ 2 ) and N 21 = N ( μ 2 , Σ 1 ) . See Figure 2 for an illustration of the Fisher–Rao geodesic triangle N 1 , N 2 , N 12 . Furthermore, since ρ N Σ 1 ( N 1 , N 21 ) Δ Σ 1 ( μ 1 , μ 2 ) and ρ N Σ 2 ( N 12 , N 2 ) Δ Σ 2 ( μ 1 , μ 2 ) , we obtain the following upper bound on the Rao distance between MVNs:
ρ N ( N 1 , N 2 ) ρ P ( Σ 1 , Σ 2 ) + min { Δ Σ 1 ( μ 1 , μ 2 ) , Δ Σ 2 ( μ 1 , μ 2 ) } .
See also [55].
In general, the difficulty with calculating the Fisher–Rao distance comes from the fact that
  • we do not know the Fisher–Rao geodesics with boundary value conditions (BVP) in closed form but the geodesics with initial value conditions [48] (IVP) are known explicitly using the natural parameters ( Σ 1 μ , Σ 1 ) of MVNs,
  • we must integrate the line element d s N along the geodesic.
As we shall see in Section 3.1, the above first problem is much harder to solve than the second problem which can be easily approximated by discretizing the curve. The lack of a closed-form formula and fast and good approximations for ρ N between MVNs is a current limiting factor for its use in applications. Indeed, many applications (e.g., [56,57]) consider the restricted case of the Rao distance between zero-centered MVNs which have closed form (distance of Equation (11) in the SPD cone). The SPD cone is a symmetric Hadamard manifold, and its isometries have been fully studied and classified in [58] (Section 4). The Fisher–Rao geometry of zero-centered generalized MVNs was recently studied in [59].

1.4. Contributions and Paper Outline

The main contribution of this paper is to propose an approximation of ρ N based on Calvo and Oller’s embedding [19] (C&O for short) and report its experimental performance. First, we concisely recall C&O’s family of embeddings f β of N ( d ) as submanifolds N ¯ β of P ( d + 1 ) in Section 2. Next, we present our approximation technique in Section 3 which differs from the usual geodesic shooting approach [39], and report experimental results. Finally, we study some information–geometric properties [1] of the isometric embedding in Section 5 such as the fact that it preserves mixture geodesics (embedded C&O submanifold is autoparallel with respect to the mixture affine connection) but not exponential geodesics. Moreover, we prove that the Fisher–Rao distance between multivariate normal distributions sharing the same covariance matrix is a scalar function of their Mahalanobis distance in Section 1.5 using the framework of Eaton [52] of maximal invariants.

1.5. A Closed-Form Formula for the Fisher–Rao Distance between Normal Distributions Sharing the Same Covariance Matrix

Consider the Fisher–Rao distance between N 1 = ( μ 1 , Σ ) and N 1 = ( μ 2 , Σ ) for a fixed covariance matrix Σ and the translation action a . μ : = μ + a of the translation group R d (a subgroup of the affine group). Both the Fisher–Rao distance and the Mahalanobis distance are invariant under translations:
ρ N ( ( μ 1 + a , Σ ) , ( μ 2 + a , Σ ) ) = ρ N ( ( μ 1 , Σ ) , ( μ 2 , Σ ) ) , Δ Σ ( μ 1 + a , μ 2 + a ) = Δ Σ ( μ 1 , μ 2 ) .
To prove that ρ N ( ( μ 1 , Σ ) , ( μ 2 , Σ ) ) = h FR ( Δ Σ ( μ 1 , μ 2 ) ) for a scalar function h FR ( · ) , we shall prove that the Mahalanobis distance is a maximal invariant, and use the framework of maximal invariants of Eaton [52] (Chapter 2) who proved that any other invariant function is necessarily a function of a maximal invariant, i.e., a function of the Mahalanobis distance in our case.
The Mahalanobis distance is a maximal invariant because we can write Δ Σ ( μ 1 , μ 2 ) = Δ 1 ( 0 , Δ Σ ( μ 1 , μ 2 ) ) and when Δ Σ ( μ 1 , μ 2 ) = Δ Σ ( μ 1 , μ 2 ) in 1D there exists a R such that ( μ 1 + a , μ 2 + a ) = ( μ 1 , μ 2 ) . We must prove equivalently that when | m 1 m 2 | = | m 1 m 2 | that there exists a R such that ( m 1 + a , m 2 + a ) = ( m 1 , m 2 ) . Assume without loss of generality that m 1 m 2 . When m 1 m 2 = m 1 m 2 , there exists a = m 1 m 1 so that m 1 = a . m 1 = m 1 + a and m 2 = a . m 2 = m 2 + a with m 1 m 2 = m 1 m 2 . Thus, using Eaton’s theorem [52], there exists a scalar function h FR such that ρ N ( ( μ 1 , Σ ) , ( μ 2 , Σ ) ) = h FR ( Δ Σ ( μ 1 , μ 2 ) ) .
To find explicitly the scalar function h FR ( · ) , let us consider the univariate case of normal distributions for which the Fisher–Rao distance is given in closed form in Equation (7). In that case, the univariate Mahalanobis distance is Δ σ 2 ( μ 1 , μ 2 ) = ( μ 2 μ 1 ) ( σ 2 ) 1 ( μ 2 μ 1 ) = | μ 2 μ 1 | σ and we can write formula of Equation (7) as h FR ( Δ σ 2 ( μ 1 , μ 2 ) ) with
h FR ( u ) = 2 log 8 + u 2 + u 8 + u 2 u ,
  = 2 arccosh 1 + 1 4 u 2 ,
using the identities
log ( x ) = arccosh 1 + x 2 2 x = arctanh x 2 1 1 + x 2 , x > 1 ,
where arctanh ( u ) = 1 2 log 1 + u 1 u .
Proposition 1. 
The Fisher–Rao distance ρ N ( ( μ 1 , Σ ) , ( μ 2 , Σ ) ) between two MVNs with same covariance matrix is
ρ N ( ( μ 1 , Σ ) , ( μ 2 , Σ ) ) = ρ N ( ( 0 , 1 ) , ( Δ Σ ( μ 1 , μ 2 ) , 1 ) ) ,
                                                                                                                              = 2 log 8 + Δ Σ 2 ( μ 1 , μ 2 ) + Δ Σ ( μ 1 , μ 2 ) 8 + Δ Σ 2 ( μ 1 , μ 2 ) Δ Σ ( μ 1 , μ 2 ) ,
                                                                                  = 2 arccosh 1 + 1 4 Δ Σ 2 ( μ 1 , μ 2 ) ,
where Δ Σ ( μ 1 , μ 2 ) = ( μ 2 μ 1 ) Σ 1 ( μ 2 μ 1 ) is the Mahalanobis distance.
Indeed, notice that the d-variate Mahalanobis distance Δ Σ ( μ 1 , μ 2 ) can be interpreted as a univariate Mahalanobis distance between the standard normal distribution N ( 0 , 1 ) and N ( Δ Σ ( μ 1 , μ 2 ) , 1 ) :
Δ Σ ( μ 1 , μ 2 ) = Δ 1 ( 0 , Δ Σ ( μ 1 , μ 2 ) ) .
Thus, we have ρ N ( ( μ 1 , Σ ) , ( μ 2 , Σ ) ) = ρ N ( ( 0 , 1 ) , ( Δ Σ ( μ 1 , μ 2 ) , 1 ) ) , where the right-hand-side term is the univariate Fisher–Rao distance of Equation (7). Let us notice that the square length element on M Σ is d s 2 = d μ Σ 1 d μ = Δ Σ 2 ( μ , μ + d μ ) . This result can be extended to elliptical distributions [12] (Theorem 1).
Let us corroborate this result by checking the formula of Equation (1) with two examples in the literature: In [38] (Figure 4), we Fisher–Rao distance between N 1 = ( 0 , I ) and N 2 = 1 2 1 2 , I is studied. We find ρ N ( N 1 , N 2 ) = 0.69994085 in accordance with their result shown in Figure 4. The second example is Example 1 of [42] (p. 11) with N 1 = 1 0 , Σ and N 2 = 6 3 , Σ for Σ = 1.1 0.9 0.9 1.1 . Formula of Equation (18) yields the Fisher–Rao distance 5.006483034546878 in accordance with [42] which reports 5.00648 .
Similarly, the statistical Ali–Silvey–Csiszár f-divergences [60,61]
I f [ p ( μ 1 , Σ ) : p ( μ 2 , Σ ) ] = R d p ( μ 1 , Σ ) ( x ) f p ( μ 2 , Σ ) p ( μ 1 , Σ ) d x ,
between two MVNs sharing the same covariance matrix are increasing functions of the Mahalanobis distance because the f-divergences between two MVNs sharing the same covariance matrix are invariant under the action of the translation group [62]. Thus, we have I f [ p ( μ 1 , Σ ) : p ( μ 2 , Σ ) ] = h f ( Δ Σ ( μ 1 , μ 2 ) ) . Since Δ Σ ( μ 1 , μ 2 ) = Δ 1 ( 0 , Δ Σ ( μ 1 , μ 2 ) ) , we thus have
I f [ p ( μ 1 , Σ ) : p ( μ 2 , Σ ) ] = h f ( Δ 1 ( 0 , Δ Σ ( μ 1 , μ 2 ) ) = I f [ p ( 0 , 1 ) : p ( Δ Σ ( μ 1 , μ 2 ) , 1 ) ] ,
where the right-hand side f-divergence is between univariate normal distributions. See Table 2 of [62] for some explicit functions h f .

2. Calvo and Oller’s Family of Diffeomorphic Embeddings

Calvo and Oller [19,32] noticed that we can embed the space of normal distributions in P ( d + 1 ) by using the following mapping:
f β ( N ) = f β ( μ , Σ ) = Σ + β μ μ β μ β μ β P ( d + 1 ) ,
where β R > 0 and N = N ( μ , Σ ) . Notice that since the dimension of P ( d + 1 ) is ( d + 1 ) ( d + 2 ) 2 , we only use ( d + 1 ) ( d + 2 ) 2 d ( d + 3 ) 2 = 1 extra dimension for embedding N ( d ) into P ( d + 1 ) . By foliating P = R > 0 × P c where P c = { P P : | P | = c } denotes the subsets of P with determinant c, we obtain the following Riemannian Calvo and Oller metric on the SPD cone:
d s CO 2 = 1 2 tr f 1 ( μ , Σ ) d f ( μ , Σ ) 2 , = 1 2 d β β 2 + β d μ Σ 1 d μ + 1 2 tr Σ 1 d Σ 2 .
Let
N ¯ β ( d ) = P ¯ = f β ( μ , Σ ) : ( μ , Σ ) N ( d ) = R d × P ( d )
denote the submanifold of P ( d + 1 ) of codimension 1, and N ¯ = N ¯ 1 (i.e., β = 1 ). The family of mappings f β provides diffeomorphisms between N ( d ) and N ¯ β ( d ) . Let f β 1 ( P ¯ ) = ( μ P ¯ , Σ P ¯ ) denote the inverse mapping for P ¯ N ¯ β ( d ) , and let f = f 1 (i.e., β = 1 ):
f ( N ) = f ( μ , Σ ) = Σ + μ μ μ μ 1 .
By equipping the cone P ( d + 1 ) by the trace metric [63,64] (also called the affine invariant Riemannian metric, AIRM) scaled by 1 2 :
g P trace ( P 1 , P 2 ) : = tr ( P 1 P 1 P 1 P 2 )
(yielding the squared line element d s P 2 = 1 2 tr ( ( P d P ) 2 ) ), Calvo and Oller [19] proved that N ¯ ( d ) is isometric to N ( d ) (i.e., the Riemannian metric of P ( d + 1 ) restricted to N ( d ) coincides with the Riemannian metric of N ( d ) induced by f) but N ¯ ( d ) is not totally geodesic (i.e., the geodesics γ P ( P ¯ 1 , P ¯ 2 ; t ) for P ¯ 1 = f ( N 1 ) , P ¯ 2 = f ( N 2 ) N ¯ ( d ) leaves the embedded normal submanifold N ¯ ( d ) ) . Please note that g P trace can be interpreted as the Fisher metric for the family N 0 of 0-centered normal distributions. Thus, we have ( N ( d ) , g Fisher ) ( P ( d + 1 ) , g trace ) , and the following diagram between parameter spaces and corresponding distributions:
N ( d ) N 0 ( d + 1 ) Λ ( d ) P ( d + 1 )
Remark 2. 
The trace metric was first studied by Siegel [45,65] using the wider scope of complex symmetric matrices with positive–definite imaginary parts generalizing the Poincaré upper half-plane (see Appendix D).
We omit to specify the dimensions and write for short N , N ¯ , and P when clear from the context. Thus, C&O proposed to use the embedding f = f 1 to give a lower bound ρ CO of the Fisher–Rao distance ρ N between normals:
LC CO : ρ N ( N 1 , N 2 ) ρ CO ( f ( μ 1 , Σ 1 ) P ¯ 1 , f ( μ 2 , Σ 2 ) P ¯ 2 ) = 1 2 i = 1 d + 1 log 2 λ i ( P ¯ 1 1 P ¯ 2 ) .
We let ρ CO ( N 1 , N 2 ) = ρ CO ( f ( N 1 ) , f ( N 2 ) ) . The ρ CO distance is invariant under affine transformations such as the Fisher–Rao distance of Property 1:
Property 2 
(affine invariance of C&O distance [19]). For all A GL ( d ) , a R d , we have ρ CO ( ( A μ 1 + a , A Σ 1 A ) , ( A μ 2 + a , A Σ 2 A ) ) = ρ CO ( N ( μ 1 , Σ 1 ) , N ( μ 2 , Σ 2 ) ) .
When Σ 1 = Σ 2 = Σ , we have | P ¯ 1 | = | P ¯ 2 | = | Σ | . Since the Riemannian geodesics γ P ( P 1 , P 2 ; t ) in the SPD cone are given by γ P ( P 1 , P 2 ; t ) = P 1 1 2 ( P 1 1 2 P 2 P 1 1 2 ) t P 1 1 2 [66] (also written γ SPD ( P 1 , P 2 ; t ) ), we have | γ P ( P 1 , P 2 ; t ) | = | Σ | . Although the submanifold P c = { P P : | P | = c } is totally geodesic with respect to the trace metric, it is not totally geodesic with respect to 1 2 tr ( ( P ¯ d P ¯ ) 2 ) . Thus, although γ P ( P 1 , P 2 ) N ¯ , it does not correspond to the embedded MVN geodesics with respect to the Fisher metric. The C&O distance between two MVNs N ( μ 1 , Σ ) and N ( μ 2 , Σ ) sharing the same covariance matrix [19] is
ρ CO ( N ( μ 1 , Σ ) , N ( μ 2 , Σ ) ) = arccosh 1 + 1 2 Δ Σ 2 ( μ 1 , μ 2 ) ,
where arccosh ( x ) : = log ( x + x 2 1 ) for x 1 and Δ Σ ( μ 1 , μ 2 ) is the Mahalanobis distance between N ( μ 1 , Σ ) and N ( μ 2 , Σ ) . In that case, we thus have ρ CO ( N ( μ 1 , Σ ) , N ( μ 2 , Σ ) ) = h CO ( Δ Σ ( μ 1 , μ 2 ) ) where h CO ( u ) = arccosh 1 + 1 2 u 2 is a strictly monotone increasing function. Let us note in passing that in [19] (Corollary, page 230) there is a confusing or typographic error since the distance is reported as arccosh 1 + 1 2 d M ( μ 1 , μ 2 ) where d M denotes “Mahalanobis distance” [51]. Therefore, either d M = Δ Σ 2 , Mahalanobis D 2 -distance, or there is a missing square in the equation of the Corollary page 230. To obtain a flavor of how good is the approximation of the C&O distance, we may consider the same covariance case where we have both closed-form solutions for ρ N (Equation (20)) and ρ CO (Equation (23)). Figure 3 plots the two functions h CO and h FR (with h CO ( u ) h FR ( u ) u for u [ 0 , ) ).
Let us remark that similarly all f-divergences between N 1 = ( μ 1 , Σ ) and N 2 = ( μ 2 , Σ ) are scalar functions of their Mahalanobis distance Δ Σ ( μ 1 , μ 2 ) too, see [62].
The C&O distance ρ CO is a metric distance that has been used in many applications ranging from computer vision [57,67,68,69] to signal/sensor processing, statistics [70,71], machine learning [29,72,73,74,75,76] and analogical reasoning [77].
Remark 3. 
In a second paper, Calvo and Oller [32] noticed that we can embed normal distributions in P ( d + 1 ) by the following more general mapping (Lemma 3.1 [32]):
g α , β , γ ( μ , Σ ) = | Σ | α Σ + β γ 2 μ μ β γ μ β γ μ β P ( d + 1 ) ,
where α R , β R > 0 and γ R . It is show in [32] that the induced length element is
d s α , β , γ 2 = 1 2 α ( ( d + 1 ) + 2 α ) tr 2 ( Σ 1 d Σ ) + tr ( ( Σ 1 d Σ ) 2 ) + 2 β γ 2 d μ Σ 1 d μ + 2 α tr ( Σ 1 d Σ ) d β β + d β β 2 .
When γ = β = 1 , we have
d s α 2 = 1 2 α ( ( d + 1 ) + 2 α ) tr 2 ( Σ 1 d Σ ) + tr ( ( Σ 1 d Σ ) 2 ) + 2 β γ 2 d μ Σ 1 d μ .
Thus, to cancel the term tr 2 ( Σ 1 d Σ ) , we may either choose α = 0 or α = 2 1 + d .
In some applications [78], the embedding
g 1 d + 1 , 1 , 1 ( μ , Σ ) = | Σ | 1 d + 1 Σ + μ μ μ μ 1 : = f ^ ( μ , Σ ) ,
is used to ensure that g 1 d + 1 , 1 , 1 ( μ , Σ ) = 1 . That is normal distributions are embedded diffeomorphically into the submanifold of positive–definite matrices with a unit determinant (also called SSPD, acronym of Special SPD). In [32], C&O showed that there exists a second isometric embedding of the Fisher–Rao Gaussian manifold N ( d ) into a submanifold of the cone P ( d + 1 ) : f SSPD ( μ , Σ ) = | Σ | 2 d + 1 Σ + μ μ μ μ 1 . Let P ^ = f SSPD ( μ , Σ ) . This mapping can be understood as taking the elliptic isometry P | P | 2 d + 1 P of P P ( d + 1 ) [64] since | Σ | = | P ¯ ( μ , Σ ) | (see proof in Proposition 3). It follows that
ρ CO ( N 1 , N 2 ) = ρ P ( P ¯ 1 , P ¯ 2 ) = ρ P ( P ^ 1 , P ^ 2 ) ρ N ( N 1 , N 2 ) .
Similarly, we could have mapped P P 1 to obtain another isometric embedding. See the four types of elliptic isometric of the SPD cone described in [64]. Finally, let us remark that the SSPD submanifold is totally geodesic with respect to the trace metric but not with respect to the C&O metric.
Interestingly, Calvo and Oller [48] (p. 131) proved that ( ( μ ¯ 1 , , μ ¯ d ) , diag ( σ ¯ 1 2 , , σ ¯ d 2 ) ) is a maximal invariant for the action of the affine group Aff ( d ) , where μ ¯ = Q 1 ( μ 2 μ 1 ) and Σ 2 Σ 1 1 = Q diag ( σ ¯ 1 2 , , σ ¯ d 2 ) Q 1 (in [48], the authors considered Σ 1 Σ 1 2 ). Thus, we consider the following dissimilarity
D CO ( N ( μ 1 , Σ 1 ) , N ( μ 2 , Σ 2 ) ) = 2 i = 1 d log 2 1 + Δ ( 0 , 1 ; μ ¯ i , σ ¯ i ) 1 Δ ( 0 , 1 ; μ ¯ i , σ ¯ i ) .
Dissimilarity D CO is symmetric (i.e., D CO ( N 1 , N 2 ) = D CO ( N 2 , N 1 ) ) and D CO ( N 1 , N 2 ) = 0 if and only if N 1 = N 2 . Please note that when d = 1 , D CO is different from the Fisher–Rao distance of Equation (7).

3. Approximating the Fisher–Rao Distance

3.1. Approximating Length of Curves

Recall that the Fisher–Rao’s distance [79] is the Riemannian geodesic distance
ρ N ( N ( λ 1 ) , N ( λ 2 ) ) = inf c ( t ) c ( 0 ) = p λ 1 c ( 1 ) = p λ 2 Length ( c ) ,
where
Length ( c ) = 0 1 c ˙ ( t ) , c ˙ ( t ) c ( t ) d s N ( t ) d t .
We can approximate the Rao distance ρ N ( N 1 , N 2 ) by discretizing regularly any smooth curve c ( t ) joining N 1 = c ( 0 ) to N 2 = c ( 1 ) (Figure 4):
ρ N ( N 1 , N 2 ) 1 T i = 1 T 1 ρ N c i T , c i + 1 T ,
with equality holding iff c ( t ) = γ N ( N 1 , N 2 ; t ) is the Riemannian geodesic defined by the Levi–Civita metric connection induced by the Fisher information metric.
When the number of discretization steps T is sufficiently large, the normal distributions c i T and c i + 1 T are close to each other, and we can approximate ρ N c i T , c i + 1 T by D J c i T , c i + 1 T , where D J [ N 1 , N 2 ] = D KL [ N 1 , N 2 ] + D KL [ N 2 , N 1 ] is Jeffreys divergence, and D KL is the Kullback–Leibler divergence:
D KL [ p ( μ 1 , Σ 1 ) : p ( μ 2 , Σ 2 ) ] = 1 2 tr ( Σ 2 1 Σ 1 ) + Δ μ Σ 2 1 Δ μ d + log | Σ 2 | | Σ 1 | .
Thus, the costly determinant computations cancel each other in Jeffreys divergence (i.e., log | Σ 2 | | Σ 1 | + log | Σ 1 | | Σ 2 | = 0 ) and we have:
D J [ p ( μ 1 , Σ 1 ) : p ( μ 2 , Σ 2 ) ] = tr Σ 2 1 Σ 1 + Σ 1 1 Σ 2 2 I + Δ μ Σ 1 1 + Σ 2 1 2 Δ μ .
Figure 4 summarizes our method to approximate the Fisher–Rao geodesic distance.
In general, it holds that
I f [ p : q ] f ( 1 ) 2 d s Fisher 2 ,
between infinitesimally close distributions p and q ( d s 2 I f [ p : q ] f ( 1 ) ), where I f [ · : · ] denotes a f-divergence [1]. The Jeffreys divergence is a f-divergence obtained for f J ( u ) = log u + u log u with f J ( 1 ) = 2 . It is thus interesting to find low computational cost f-divergences between multivariate normal distributions to approximate the infinitesimal length element d s . Please note that f-divergences between MVNs are also invariant under the action of the affine group [62]. Thus, for infinitesimally close distributions p and q, this informally explains that d s Fisher is invariant under the action of the affine group (see Proposition 1).
Although the definite integral of the length element along the Fisher–Rao geodesic γ N FR is not known in closed form (i.e., Fisher–Rao distance), the integral of the squared length element along the mixture geodesic γ N m ( N 1 , N 2 ) and exponential geodesic γ N e ( N 1 , N 2 ) coincide with Jeffreys divergence D J [ N 1 , N 2 ] between N 1 and N 2 [1]:
Property 3 
([1]). We have
D J [ p λ 1 , p λ 2 ] = 0 1 d s N 2 ( γ N m ( p λ 1 , p λ 2 ; t ) ) d t = 0 1 d s N 2 ( γ N e ( p λ 1 , p λ 2 ; t ) ) d t .
Proof. 
Let us report a proof of this remarkable fact in the general setting of Bregman manifolds. Indeed, since
D J [ p λ 1 , p λ 2 ] = D KL [ p λ 1 : p λ 2 ] + D KL [ p λ 2 : p λ 1 ] ,
and D KL [ p λ 1 : p λ 2 ] = B F ( θ ( λ 2 ) : θ ( λ 1 ) ) , where B F denotes the Bregman divergence induced by the cumulant function of the multivariate normals and θ ( λ ) is the natural parameter corresponding to λ , we have
D J [ p λ 1 , p λ 2 ] = B F ( θ 1 : θ 2 ) + B F ( θ 2 : θ 1 ) , = S F ( θ 1 ; θ 2 ) = ( θ 2 θ 1 ) ( η 2 η 1 ) = S F * ( η 1 ; η 2 ) ,
where η = F ( θ ) and θ = F * ( η ) denote the dual parameterizations obtained by the Legendre–Fenchel convex conjugate F * ( η ) of F ( θ ) . Moreover, we have F * ( η ) = h ( p μ , Σ ) [1], i.e., the convex conjugate function is Shannon negentropy.
Then we conclude using the fact that S F ( θ 1 ; θ 2 ) = 0 1 d s 2 ( γ ( t ) ) d t = 0 1 d s 2 ( γ * ( t ) ) d t , i.e., the symmetrized Bregman divergence amounts to integral energies on dual geodesics on a Bregman manifold. The proof of this general property is reported in Appendix E. □
It follows the following upper bound on the Fisher–Rao distance:
Property 4 
(Fisher–Rao upper bound). The Fisher–Rao distance between normal distributions is upper bounded by the square root of the Jeffreys divergence: ρ N ( N 1 , N 2 ) D J ( N 1 , N 2 ) .
Proof. 
Consider the Cauchy–Schwarz inequality for positive functions f ( t ) and g ( t ) : 0 1 f ( t ) g ( t ) d t ( 0 1 f ( t ) 2 d t ) ( 0 1 g ( t ) 2 d t ) ), and let f ( t ) = d s N ( γ N c ( p λ 1 , p λ 2 ; t ) and g ( t ) = 1 . Then we obtain:
0 1 d s N ( γ N c ( p λ 1 , p λ 2 ; t ) d t 2 0 1 d s N 2 ( γ N c ( p λ 1 , p λ 2 ; t ) d t 0 1 1 2 d t = 1 .
Furthermore, since by definition of γ N FR , we have
0 1 d s N ( γ N c ( p λ 1 , p λ 2 ; t ) d t 0 1 d s N ( γ N FR ( p λ 1 , p λ 2 ; t ) d t = : ρ N ( N 1 , N 2 ) .
It follows for c = γ N e (i.e., e-geodesic) using Property 3 that we have:
ρ N ( N 1 , N 2 ) 2 0 1 d s N 2 ( γ N e ( p λ 1 , p λ 2 ; t ) d t = D J ( N 1 , N 2 ) .
Thus, we conclude that ρ N ( N 1 , N 2 ) D J ( N 1 , N 2 ) .
Please note that in Riemannian geometry, a curve γ minimizes the energy E ( γ ) = 0 1 γ ˙ ( t ) 2 d t if it minimizes the length L ( γ ) = 0 1 γ ˙ ( t ) d t and γ ˙ ( t ) is constant. Using Cauchy-Schwartz inequality, we can show that L ( γ ) E ( γ ) . □
This upper bound is tight at infinitesimal scale (i.e., when N 2 = N 1 + d N ) since ρ N ( N 1 , N 2 ) d s N ( N 1 ) 2 I f [ N 1 : N 2 ] f ( 1 ) and the f-divergence in right-hand side of the identity can be chosen as Jeffreys divergence. To appreciate the quality of the square root of Jeffreys divergence upper bound of Property 4, consider the case where N 1 , N 2 M Σ . In that case, we have ρ N ( N ( μ 1 , Σ ) , N ( μ 2 , Σ ) ) = 2 arccosh ( 1 + 1 4 Δ Σ 2 ( μ 1 , μ 2 ) ) and D J [ N ( μ 1 , Σ ) , N ( μ 2 , Σ ) ] = Δ Σ ( μ 1 , μ 2 ) (since D KL [ N ( μ 1 , Σ ) , N ( μ 2 , Σ ) ] = 1 2 Δ Σ 2 ( μ 1 , μ 2 ) ). The upper bound can thus be checked since we have 2 arccosh ( 1 + 1 4 x 2 ) x for x 0 . The plots of Figure 5 shows visually the quality of the D J upper bound.
For any smooth curve c ( t ) , we can thus approximate ρ N for large T by
ρ ˜ N c ( N 1 , N 2 ) : = 1 T i = 1 T 1 D J c i T , c i + 1 T .
For example, we may consider the following curves on M N which admit closed-form parameterizations in t [ 0 , 1 ] :
  • linear interpolation (LERP, Linear intERPolation) c λ ( t ) = t ( μ 1 , Σ 1 ) + ( 1 t ) ( μ 2 , Σ 2 ) between ( μ 1 , Σ 1 ) and ( μ 2 , Σ 2 ) ,
  • the mixture geodesic [80] c m ( t ) = γ N m ( N 1 , N 2 ; t ) = ( μ t m , Σ t m ) with μ t m = μ ¯ t and Σ t m = Σ ¯ t + t μ 1 μ 1 + ( 1 t ) μ 2 μ 2 μ ¯ t μ ¯ t where μ ¯ t = t μ 1 + ( 1 t ) μ 2 and Σ ¯ t = t Σ 1 + ( 1 t ) Σ 2 ,
  • the exponential geodesic [80] c e ( t ) = γ N e ( N 1 , N 2 ; t ) = ( μ t e , Σ t e ) with μ t e = Σ ¯ t H ( t Σ 1 1 μ 1 + ( 1 t ) Σ 2 1 μ 2 ) and Σ t e = Σ ¯ t H where Σ ¯ t H = ( t Σ 1 1 + ( 1 t ) Σ 2 1 ) 1 is the matrix harmonic mean,
  • the curve c e m ( t ) = 1 2 γ N e ( N 1 , N 2 ; t ) + γ N m ( N 1 , N 2 ; t ) which is obtained by averaging the mixture geodesic with the exponential geodesic.
Figure 6 visualizes the exponential and mixture geodesics between two bivariate normal distributions.
Let us denote by ρ ˜ N λ = ρ ˜ N c λ , ρ ˜ N m = ρ ˜ N c m , ρ ˜ N e = ρ ˜ N c e and ρ ˜ N e m = ρ ˜ N c e m the approximations obtained by these curves following from Equation (27). When T is sufficiently large, the approximated distances ρ ˜ x are close to the length of curve x, and we may thus consider a set of several curves { c i } i I and report the smallest Fisher–Rao distance approximations obtained among these curves: ρ N ( N 1 , N 2 ) min i I ρ ˜ N c i ( N 1 , N 2 ) .
Please note that we consider the regular spacing for approximating a curve length and do not optimize the position of the sample points on the curve. Indeed, as T , the curve length approximation tends to the Riemannian curve length. In other words, we can measure approximately finely the length of any curve available with closed-form reparameterization by increasing T. Thus, the key question of our method is how to best approximate the Fisher–Rao geodesic by a curve that can be parametrized by a closed-form formula and is close enough to the Fisher–Rao geodesic.
Next, we introduce our approximation curve c CO ( t ) derived from Calvo and Oller isometric mapping f which experimentally behaves better when normals are not too far from each other.

3.2. A Curve Derived from Calvo and Oller’s Embedding

This approximation consists of leveraging the closed-form expression of the SPD geodesics [63,66]:
γ P ( P , Q ; t ) = P 1 2 P 1 2 Q 1 2 P 1 2 t P 1 2 , t [ 0 , 1 ]
to approximate the Fisher–Rao normal geodesic γ N Fisher ( N 1 , N 2 ; t ) as follows: Let P ¯ 1 = f ( N 1 ) , P ¯ 2 = f ( N 2 ) N ¯ , and consider the smooth curve
c ¯ CO ( P ¯ 1 , P ¯ 2 ; t ) = proj N ¯ γ P ( P ¯ 1 , P ¯ 2 ; t ) ,
where proj N ¯ ( P ) denotes the orthogonal projection of P P ( d + 1 ) onto N ¯ (Figure 7). Thus, curve c CO ( t ) ( t [ 0 , 1 ] ) is then defined by taking the inverse mapping f 1 ( c ¯ CO ) (Figure 8):
c CO ( t ) = f 1 proj N ¯ γ P ( P ¯ 1 , P ¯ 2 ; t ) .
Please note that the matrix power P t can be computed as P t = U diag ( λ 1 t , , λ d t ) V where P = U diag ( λ 1 t , , λ d t ) V is the eigenvalue decomposition of P.
Let us now explain how to project P = [ P i , j ] P ( d + 1 ) onto N ¯ based on the analysis of the Appendix of [19] (p. 239):
Proposition 2 
(Projection of an SPD matrix onto the embedded normal submanifold N ¯ ). Let β = P d + 1 , d + 1 and write P = Σ + β μ μ β μ β μ β . Then the orthogonal projection at P P onto N ¯ is:
P ¯ : = proj N ¯ ( P ) = Σ + μ μ μ μ 1 ,
and the SPD distance between P and P ¯ is
ρ P ( P , P ¯ ) = 1 2 | log β | .
Notice that the projection of P is easily computed since β = P d + 1 , d + 1 .
proj N ¯ Σ + β μ μ β μ β μ β = Σ + μ μ μ μ 1
Remark 4. 
In Diffusion Tensor Imaging [39] (DTI), the Fisher–Rao distance can be used to evaluate the distance between three-dimensional normal distributions with means located at a 3D grid position. We may consider 3 × 3 × 3 1 = 26 neighbor graphs induced by the grid, and for each normal N of the grid, calculate the approximations of the Fisher–Rao distance of N with its neighbors N as depicted in Figure 9. Then the distance between two tensors N 1 and N 2 of the 3D grid is calculated as the shortest path on the weighted graph using Dijkstra’s algorithm [39].
Please note that the Fisher–Rao projection of N 1 = ( μ 1 , Σ 1 ) onto a submanifold M μ 2 with fixed mean μ 2 was recently reported in closed form in [72] (Equation (21)):
N * = N μ 2 , Σ 1 + 1 2 ( μ 2 μ 1 ) ( μ 2 μ 1 ) ,
with
ρ N ( N 1 , N * ) = 1 2 arccosh d + ( μ 2 μ 1 ) Σ 1 1 ( μ 2 μ 1 ) ,
and the Fisher–Rao projection of N 1 = ( μ 1 , Σ 1 ) onto submanifold M Σ 2 is the “vertical projection” N * = ( μ 1 , Σ 2 ) (Figure 10) with
ρ N ( N 1 , N * ) = ρ N μ ( Σ 1 , Σ 2 ) .
We can upper bound the Fisher–Rao distance ρ N ( ( μ 1 , Σ 1 ) , ( μ 2 , Σ 2 ) ) by projecting Σ 1 onto M μ 2 and projecting Σ 2 onto M μ 1 . Let Σ 12 M μ 2 and Σ 21 M μ 1 denote those Fisher–Rao orthogonal projections. Using the triangular inequality property of the Fisher–Rao distance, we obtain the following upper bounds:
ρ N ( ( μ 1 , Σ 1 ) , ( μ 2 , Σ 2 ) ) ρ N ( ( μ 1 , Σ 1 ) , ( μ 2 , Σ 12 ) ) + ρ N ( μ 2 , Σ 12 , ( μ 2 , Σ 2 ) ) ,
                                                                                  ρ N ( ( μ 2 , Σ 2 ) , ( μ 1 , Σ 21 ) ) + ρ N ( ( μ 1 , Σ 21 ) , ( μ 1 , Σ 1 ) ) .
See Figure 11 for an illustration.
Let c ¯ CO ( t ) = S ¯ t and c CO ( t ) = f 1 ( c CO ( t ) ) = : G t . The following proposition shows that we have D J [ S ¯ t , S ¯ t + 1 ] = D J [ G t , G t + 1 ] .
Proposition 3. 
The Kullback–Leibler divergence between p μ 1 , Σ 1 and p μ 2 , Σ 2 amounts to the KLD between q P ¯ 1 = p 0 , f ( μ 1 , Σ 1 ) and q P ¯ 2 = p 0 , f ( μ 2 , Σ 2 ) where P ¯ i = f ( μ i , Σ i ) :
D KL [ p μ 1 , Σ 1 : p μ 2 , Σ 2 ] = D KL [ q P ¯ 1 : q P ¯ 2 ] .
The KLD between two centered ( d + 1 ) -variate normals q P 1 = p 0 , P 1 and q P 2 = p 0 , P 2 is
D KL [ q P 1 : q P 2 ] = 1 2 tr ( P 2 1 P 1 ) d 1 + log | P 2 | | P 1 | .
This divergence can be interpreted as the matrix version of the Itakura–Saito divergence [81]. The SPD cone equipped with 1 2 of the trace metric can be interpreted as Fisher–Rao centered normal manifolds: ( N μ , g N μ Fisher ) = ( P , 1 2 g trace ) .
Since the determinant of a block matrix is
A B C D = A B D 1 C ,
we obtain with D = 1 : | f ( μ , Σ ) | = | Σ + μ μ μ μ | = | Σ | .
Let P ¯ 1 = f ( μ 1 , Σ 1 ) and P ¯ 2 = f ( μ 2 , Σ 2 ) . Checking D KL [ p μ 1 , Σ 1 : p μ 2 , Σ 2 ] = D KL [ q P ¯ 1 : q P ¯ 2 ] where q P ¯ = p 0 , P ¯ amounts to verify that
tr ( P ¯ 2 1 P ¯ 1 ) = 1 + tr ( Σ 2 1 Σ 1 + Δ μ Σ 2 1 Δ μ ) .
Indeed, using the inverse matrix
f ( μ , Σ ) 1 = Σ 1 Σ 1 μ μ Σ 1 1 + μ Σ 1 μ ,
we have
tr ( P ¯ 2 1 P ¯ 1 ) = tr Σ 2 1 Σ 2 1 μ 2 μ 2 Σ 2 1 1 + μ 2 Σ 2 1 μ 2 Σ 1 + μ 1 μ 1 μ 1 μ 1 1 , = 1 + tr ( Σ 2 1 Σ 1 + Δ μ Σ 2 1 Δ μ ) .
Thus, even if the dimension of the sample spaces of p μ , Σ and q P ¯ = f ( μ , Σ ) differs by one, we obtain the same KLD by Calvo and Oller’s isometric mapping f.
This property holds for the KLD/Jeffreys divergence D J but not for all f-divergences [1] I f in general (e.g., it fails for the Hellinger divergence).
Figure 12 shows the various geodesics and curves used to approximate the Fisher–Rao distance with the Fisher metric shown using Tissot indicatrices.
Please note that the introduction of parameter β is related to the foliation of the SPD cone P by { f β ( N ) : β > 0 } : P ( d + 1 ) = R > 0 × f β ( N ) . See Figure 7. Thus, we may define how good the projected C&O curve is to the Fisher–Rao geodesic by measuring the average distance between points on γ P ( P ¯ 1 , P ¯ 2 ; t ) and their projections γ P ( P ¯ 1 , P ¯ 2 ; t ) ¯ onto N ¯ :
δ CO ( N 1 , N 2 ) = δ CO ( P ¯ 1 , P ¯ 2 ) = 0 1 ρ P ( γ P ( P ¯ 1 , P ¯ 2 ; t ) , γ P ( P ¯ 1 , P ¯ 2 ; t ) ¯ ) d t .
In practice, we evaluate this integral at the sampling points S t :
δ CO ( P 1 , P 2 ) δ T CO ( P 1 , P 2 ) : = 1 T i = 1 T ρ P ( S t , S ¯ t ) ,
where S t = γ P ( P ¯ 1 , P ¯ 2 ; t ) and S ¯ t = γ P ( P ¯ 1 , P ¯ 2 ; t ) . We checked experimentally (see Section 3.3) that for close by normals N 1 and N 1 , we have δ CO ( N ¯ 1 , N ¯ 2 ) small, and that when N 1 becomes further separated from N 2 , the average projection error δ CO ( N ¯ 1 , N ¯ 2 ) increases. Thus, δ T CO ( P 1 , P 2 ) is a good measure of the precision of our Fisher–Rao distance approximation.
Lemma 1. 
We have ρ N ¯ ( S ¯ t , S ¯ t + 1 ) ρ P ( S ¯ t , S t ) + ρ P ( S t , S t + 1 ) + ρ P ( S t + 1 , S ¯ t + 1 ) .
Proof. 
The proof consists of applying twice the triangle inequality of metric distance ρ P :
ρ N ¯ ( S ¯ t , S ¯ t + 1 ) ρ P ( S ¯ t , S t + 1 ) + ρ P ( S t + 1 , S ¯ t + 1 ) , ρ P ( S ¯ t , S t ) + ρ P ( S t , S t + 1 ) + ρ P ( S t + 1 , S ¯ t + 1 ) .
See Figure 13 where the left-hand-side geodesic length is shown in blue and the right-hand-side upper bound is visualized in red. □
Property 5. 
We have ρ N ( N 1 , N 2 ) ρ N CO ( N 1 , N 2 ) ρ N ( N 1 , N 2 ) + 2 δ T CO ( P ¯ 1 , P ¯ 2 ) .
Proof. 
At infinitesimal scale when S t + 1 S t , using Lemma 1 and ρ P ( S t + 1 , S ¯ t + 1 ) ρ P ( S ¯ t , S t ) we have
d s N ( S ¯ t ) d s P ( S t ) + 2 ρ P ( S t , S ¯ t ) .
Taking the integral along the curve c CO ( t ) = γ CO ( P ¯ 1 , P ¯ 2 ; t ) ¯ , we obtain
ρ N CO ( N 1 , N 2 ) ρ P ( P ¯ 1 , P ¯ 2 ) + 2 δ T CO ( P ¯ 1 , P ¯ 2 )
Since ρ P ( P ¯ 1 , P ¯ 2 ) ρ N ( N 1 , N 2 ) , we have
ρ N ( N 1 , N 2 ) ρ N CO ( N 1 , N 2 ) ρ N ( N 1 , N 2 ) + 2 δ T CO ( P ¯ 1 , P ¯ 2 ) .
Notice that i = 0 T 1 ρ P ( S t , S t + 1 ) = ρ P ( P ¯ 1 , P ¯ 2 ) .
Example 1. 
Let us consider Example 1 of [42] (p. 11):
N 1 = 1 0 , Σ , N 2 = 6 3 , Σ , Σ = 1.1 0.9 0.9 1.1 .
The Fisher–Rao distance is evaluated numerically in [42] as 5.00648 . We have the lower bound ρ N CO ( N 1 , N 2 ) = 4.20447 , and the Mahalanobis distance 8.06226 upper bounds the Fisher–Rao distance (not totally geodesic submanifold N Σ ). Our projected C&O curve discretized with T = 1000 yields an approximation ρ ˜ N CO ( N 1 , N 2 ) = 5.31667 . The average projection distance ρ P ( S t , S ¯ t ) is δ T CO ( N 1 , N 2 ) = 0.61791 , and the maximum projected distance is 1.00685 . We check that
5.00648 ρ N ( N 1 , N 2 ) ρ ˜ N CO ( N 1 , N 2 ) 5.31667 ρ N ( N 1 , N 2 ) + 2 δ T CO ( P ¯ 1 , P ¯ 2 ) 5.44028 .
The Killing distance [82] obtained for κ Killing = 2 is ρ Killing ( N 1 , N 2 ) 6.82028 (see Appendix C). Notice that geodesic shooting is time-consuming compared to our approximation technique.

3.3. Some Experiments

The KLD D KL and Jeffreys divergence D J , the Fisher–Rao distance ρ N and the Calvo and Oller distance ρ CO are all invariant under the congruence action of the affine group Aff ( d ) = R d GL ( d ) with the group operation
( a 1 , A 1 ) ( a 2 , A 2 ) = ( a 1 + A 1 a 2 , A 1 A 2 ) .
Let ( A , a ) Aff ( d ) , and define the action on the normal space N as follows:
( A , a ) . N ( μ , Σ ) = N ( A μ + a , A Σ A ) .
Then we have:
ρ N ( ( A , a ) . N 1 , ( A , a ) . N 2 ) = ρ N ( N 1 , N 2 ) , ρ CO ( ( A , a ) . N 1 , ( A , a ) . N 2 ) = ρ CO ( N 1 , N 2 ) , D KL [ ( A , a ) . N 1 : ( A , a ) . N 2 ] = D KL [ N 1 : N 2 ] .
This invariance extends to our approximations ρ ˜ N c (see Equation (27)).
Since we have
ρ ˜ N c ( N 1 , N 2 ) ρ N ( N 1 , N 2 ) ρ CO ( N 1 , N 2 ) ,
the ratio κ c = ρ ˜ N c ρ CO κ = ρ ˜ N c ρ N gives an upper bound on the approximation factor of ρ ˜ N c compared to the true Fisher–Rao distance ρ N :
κ c ρ N ( N 1 , N 2 ) κ ρ N ( N 1 , N 2 ) ρ ˜ N c ( N 1 , N 2 ) ρ N ( N 1 , N 2 ) ρ CO ( N 1 , N 2 ) .
Let us now report some numerical experiments of our approximated Fisher–Rao distances ρ ˜ N x with x { l , m , e , em , CO } . Although that dissimilarity ρ ˜ N is positive–definite, it does not satisfy the triangular inequality of metric distances (e.g., Riemannian distances ρ N and ρ CO ).
First, we draw multivariate normals by sampling means μ Unif ( 0 , 1 ) and sample covariance matrices Σ as follows: We draw a lower triangular matrix L with entries L i j iid sampled from Unif ( 0 , 1 ) , and take Σ = L L . We use T = 1000 samples on curves and repeat the experiment 1000 times to gather average statistics on κ c ’s of curves. Results are summarized in Table 1.
For that scenario that the C&O curve (either c ¯ CO N ¯ or c CO N ) performs best compared to the linear interpolation curves with respect to source parameter (l), mixture geodesic (m), exponential geodesic (e), or exponential-mixture mid-curve ( em ). Let us point out that we sample γ P ( P ¯ 1 , P ¯ 2 ; i T ) for i { 0 , , T } .
Strapasson, Porto, and Costa [38] (SPC)reported the following upper bound on the Fisher–Rao distance between multivariate normals
ρ CO ( N 1 , N 2 ) ρ N ( N 1 , N 2 ) U SPC ( N 1 , N 2 ) ,
with:
U SPC ( N 1 , N 2 ) = 2 i = 1 d log 2 ( 1 + D i i ) 2 + μ i 2 + ( 1 D i i ) 2 + μ i 2 ( 1 + D i i ) 2 + μ i 2 ( 1 D i i ) 2 + μ i 2 ,
where Σ = Σ 1 1 2 Σ 2 Σ 1 1 2 , Σ = Ω D Ω is the eigen decomposition, and μ = Ω Σ 1 1 2 ( μ 2 μ 1 ) . This upper bound performs better when the normals are well-separated and worse than the D J -upper bound when the normals are close to each other.
Let us compare ρ CO ( N 1 , N 2 ) with ρ N ( N 1 , N 2 ) ρ ˜ c CO ( N 1 , N 2 ) and the upper bound U ( N 1 , N 2 ) by averaging over 1000 trials with N 1 and N 2 chosen randomly as before and T = 1000 . We have ρ CO ( N 1 , N 2 ) ρ N ( N 1 , N 2 ) ρ ˜ c CO ( N 1 , N 2 ) U ( N 1 , N 2 ) . Table 2 shows that our Fisher–Rao approximation is close to the lower bound (and hence to the underlying true Fisher–Rao distance) and that the upper bound is about twice the lower bound for that particular scenario.
Second, since the distances are invariant under the action of the affine group, we can set wlog. N 1 = ( 0 , I ) (standard normal distribution) and let N 2 = diag ( u 1 , , u d ) where u i Unif ( 0 , a ) . As normals N 1 and N 2 separate from each other, we notice experimentally that the performance of the c CO curve degrades in the second experiment with a = 5 (see Table 3): Indeed, the mixture geodesic works experimentally better than the C&O curve when d 11 .
Figure 14 display the various curves considered for approximating the Fisher–Rao distance between bivariate normal distributions: For a curve c ( t ) , we visualize its corresponding bivariate normal distributions ( μ c ( t ) , Σ c ( t ) ) at several increment steps t [ 0 , 1 ] by plotting the ellipsoid
E c ( t ) = μ c ( t ) + L x , x = ( cos θ , sin θ ) , θ [ 0 , 2 π ) ,
where Σ c ( t ) = L c ( t ) L c ( t ) .
Example 2. 
Let us report some numerical results for bivariate normals with T = 1000 :
  • We use the following example of Han and Park [39] (Equation (26)):
    N 1 = 0 0 , 1 0 0 0.1 , N 2 = 1 1 , 0.1 0 0 1 .
    Their geodesic shooting algorithm [39] evaluates the Fisher–Rao distance to ρ N ( N 1 , N 2 ) 3.1329 (precision 10 5 ).
    We obtain:
    Calvo and Oller lower bound: ρ CO ( N 1 , N 2 ) 3.0470 ,
    Upper bound using Equation (15): 7.92179 ,
    SPC upper bound (Equation (35)): U SPC ( N 1 , N 2 ) 5.4302 ,
    D J upper bound: U J ( N 1 , N 2 ) 4.3704 ,
    ρ ˜ N λ ( N 1 , N 2 ) 3.4496 ,
    ρ ˜ N m ( N 1 , N 2 ) 3.5775 ,
    ρ ˜ N e ( N 1 , N 2 ) 3.7314 ,
    ρ ˜ N em ( N 1 , N 2 ) 3.1672 ,
    ρ ˜ N CO ( N 1 , N 2 ) 3.1391 .
    In that setting, the D J upper bound is better than the upper bound of Equation (35), and the projected Calvo and Oller geodesic yields the best approximation of the Fisher–Rao distance (Figure 15) with an absolute error of 0.0062 (about 0.2 % relative error). When T = 10 , we have ρ ˜ N CO ( N 1 , N 2 ) 3.1530 , when T = 100 , we obtain ρ ˜ N CO ( N 1 , N 2 ) 3.1136 , and when T = 500 we obtain ρ ˜ N CO ( N 1 , N 2 ) 3.1362 (which is better than the approximation obtained for T = 1000 ). Figure 16 shows the fluctuations of the approximation of the Fisher–Rao distance by the projected C&O curve when T ranges from 3 to 100.
  • Bivariate normal N 1 = ( 0 , I ) and bivariate normal N 2 = ( μ 2 , Σ 2 ) with μ 2 = [ 1 0 ] and Σ 2 = 1 1 1 2 . We obtain
    Calvo and Oller lower bound: 1.4498
    Upper bound of Equation (35): 2.6072
    D J upper bound: 1.5811
    ρ ˜ λ : 1.5068
    ρ ˜ m : 1.5320
    ρ ˜ e : 1.5456
    ρ ˜ em : 1.4681
    ρ ˜ co : 1.4673
  • Bivariate normal N 1 = ( 0 , I ) and bivariate normal N 2 = ( μ 2 , Σ 2 ) with μ 2 = [ 5 0 ] and Σ 2 = 1 1 1 2 . We get:
    Calvo and Oller lower bound: 3.6852
    Upper bound of Equation (35): 6.0392
    D J upper bound: 6.2048
    ρ ˜ λ : 5.7319
    ρ ˜ m : 4.4039
    ρ ˜ e : 5.9205
    ρ ˜ em : 4.2901
    ρ ˜ co : 4.3786
See Supplementary Materials for further experiments.

4. Approximating the Smallest Enclosing Fisher–Rao Ball of MVNs

We may use these closed-form distance ρ CO ( N , N ) between N and N to compute an approximation (of the center) of the smallest enclosing Fisher–Rao ball B * = ball ( C * , r * ) of a set G = { N 1 = ( μ 1 , Σ 1 ) , , N n = ( μ n , Σ n ) } of nd-variate normal distributions:
C * = arg min C N max i { 1 , , n } ρ N ( C , N i )
where ball ( C , r ) = { N N : ρ N ( C , N ) r } .
The method proceeds as follows:
  • First, we convert MVN set G into the equivalent set of ( d + 1 ) -dimensional SPD matrices G ¯ = { P ¯ i = f ( N i ) } using the C&O embedding. We relax the problem of approximating the circumcenter C * of the smallest enclosing Fisher–Rao ball by
    P * = arg min P P ( d + 1 ) max i { 1 , , n } ρ CO ( P , P ¯ i ) .
  • Second, we approximate the center of the smallest enclosing Riemannian ball of G ¯ using the iterative smallest enclosing Riemannian ball algorithm in [66] with say T = 1000 iterations. Let P ˜ P ( d + 1 ) denote this approximation center: P T = RieSEB SPD ( G ¯ , T ) .
  • Finally, we project back P T onto N ¯ : P ¯ T = proj N ¯ ( P T ) . We return P ¯ T as the approximation of C * .
Algorithm [66] RieSEB SPD ( { P 1 , , P n } , T ) is described for a set of SPD matrices { P 1 , , P n } as follows:
  • Let C 1 P 1
  • For t = 1 to T
    Compute the index of the SPD matrix which is farthest from the current circumcenter C t :
    f t = arg max i { 1 , , n } ρ SPD ( C t , P i )
    Update the circumcenter by walking along the geodesic linking C t to P f t :
    C t + 1 = γ SPD C t , P f t ; 1 t + 1 = C t 1 2 ( C t 1 2 P f t C t 1 2 ) 1 t + 1 C t 1 2
  • Return C T
The convergence of the algorithm RieSEB SPD follows from the fact that the SPD trace manifold is a Hadamard manifold (with negative sectional curvatures). See [66] for proof of convergence.
The SPD distance ρ P ( C T , C ¯ T ) indicates the quality of the approximation. Figure 17 shows the result of implementing this heuristic.
Let us notice that when all MVNs share the same covariance matrix Σ , we have from Equation (18) or Equation (23) that ρ N ( μ 1 , Σ ) , N ( μ 2 , Σ ) and ρ CO ( N ( μ 1 , Σ ) , N ( μ 2 , Σ ) ) are strictly increasing function of their Mahalanobis distance. Using the Cholesky decomposition Σ 1 = L L , we deduce that the smallest Fisher–Rao enclosing ball coincides with the smallest Calvo and Oller enclosing ball, and the circumcenter of that ball can be found as an ordinary Euclidean circumcenter [83] (Figure 17b). Please note that in 1D, we can find the exact smallest enclosing Fisher–Rao ball as an equivalent smallest enclosing ball in hyperbolic geometry.
Furthermore, we may extend the computation of the approximated circumcenter to k-center clustering [84] of n multivariate normal distributions. Since the circumcenter of the clusters is approximated and not exact, we extend straightforwardly the variational approach of k-means described in [85] to k-center clustering. An application of k-center clustering of MVNs is to simplify a Gaussian mixture model [42] (GMM).
Similarly, we can consider other Riemannian distances with closed-form formulas between MVNs such as the Killing distance in the symmetric space [82] (see Appendix C) or the Siegel-based distance proposed in Appendix D.

5. Some Information–Geometric Properties of the C&O Embedding

In information geometry [1], the manifold N admits a dual structure denoted by the quadruple
( N , g N Fisher , N e , N m ) ,
when equipped with the exponential connection N e and the mixture connection N m . The connections N e and N m are said to be dual since N e + N m 2 = ¯ N , the Levi–Civita connection induced by g N Fisher . Furthermore, by viewing N as an exponential family { p θ } with natural parameter θ = ( θ v , θ M ) (using the sufficient statistics [80] ( x , x x ) ), and taking the convex log-normalizer function F N ( θ ) of the normals, we can build a dually flat space [1] where the canonical divergence amounts to a Bregman divergence which coincides with the reverse Kullback–Leibler divergence [30,86] (KLD). The Legendre duality
F * ( η ) = F ( θ ) , η F ( F ( θ ) )
(with ( v 1 , M 1 ) , ( v 2 , M 2 ) = tr ( v 1 v 2 + M 1 M 2 ) = v 1 · v 2 + tr ( M 1 M 2 ) ) yields: θ = ( θ v , θ M ) = Σ 1 μ , 1 2 Σ 1 ,
F N ( θ ) = 1 2 d log π log | θ M | + 1 2 θ v θ M 1 θ v ,
η = ( η v , η M ) = F N ( θ ) = 1 2 θ M 1 θ v , θ M 1 ,
F N * ( η ) = 1 2 log ( 1 + η v η M 1 η v ) + log | η M | + d ( log 2 π e ) ,
and we have
B F N ( θ 1 , θ 2 ) = D KL * ( p λ 1 : p λ 2 ) = D KL ( p λ 2 : p λ 1 ) = B F N * ( η 2 : η 1 ) ,
where D KL * [ p : q ] = D KL [ q : p ] is the reverse KLD.
In a dually flat space, we can express the canonical divergence as a Fenchel–Young divergence using the mixed coordinate systems B F N ( θ 1 : θ 2 ) = Y F N ( θ 1 : η 2 ) where η i = F N ( θ i ) and
Y F N ( θ 1 : η 2 ) : = F N ( θ 1 ) + F N * ( η 2 ) θ 1 , η 2 .
The moment η -parameterization of a normal is ( η = μ , H = Σ μ μ ) with its reciprocal function ( λ = η , Λ = H η η ) .
Let F P ( P ) = F N ( 0 , P ) , θ ¯ = 1 2 P ¯ 1 , η ¯ = F P ( θ ¯ ) . Then we have the following proposition which proves that the Fenchel–Young divergences in N and N ¯ (as a submanifold of P ) coincide:
Proposition 4. 
We have
D KL [ p μ 1 , Σ 1 : p μ 2 , Σ 2 ] = B F N ( θ 2 : θ 1 ) = Y F N ( θ 2 : η 1 ) = Y F P ( θ ¯ 2 : η ¯ 1 ) = B F P ( θ ¯ 2 : θ ¯ 1 ) = D KL [ p 0 , P ¯ 1 = f ( μ 1 , Σ 2 ) : p 0 , P ¯ 2 = f ( μ 2 , Σ 2 ) ] .
Consider now the e -geodesics and m -geodesics on N (linear interpolation with respect to natural and dual moment parameterizations, respectively): γ N e ( N 1 , N 2 ; t ) = ( μ t e , Σ t e ) and γ N m ( N 1 , N 2 ; t ) = ( μ t m , Σ t m ) .
Proposition 5 
(Mixture geodesics preserved). The mixture geodesics are preserved by the embedding f: f ( γ N m ( N 1 , N 2 ; t ) ) = γ P m ( f ( N 1 ) , f ( N 2 ) ; t ) . The exponential geodesics are preserved for the subspace of N with fixed mean μ: N μ .
Proof. 
For the m-geodesics, let us check that
f ( μ t m , Σ t m ) = Σ t m + μ t m μ t m μ t m ( μ t m ) 1 = t f ( μ 1 , Σ 1 ) P ¯ 1 + ( 1 t ) f ( μ 2 , Σ 2 ) P ¯ 2 ,
since Σ t m + μ t μ t m = Σ ¯ t + t μ 1 μ 1 + ( 1 t ) μ 2 μ 2 = t ( Σ 1 + μ 1 μ 1 ) + ( 1 t ) ( Σ 2 + μ 2 μ 2 ) . Thus, we have f ( γ N m ( N 1 , N 2 ; t ) ) = γ P m ( P ¯ 1 , P ¯ 2 ; t ) . □
Therefore, all algorithms on N which only require m-geodesics or m-projections [1] by minimizing the right-hand side of the KLD can be implemented by algorithms on P . See, for example, the minimum enclosing ball approximation algorithm called BBC in [87]. Notice that N ¯ μ (fixed mean normal submanifolds) preserve both mixture and exponential geodesics: The submanifolds N ¯ μ are said to be doubly autoparallel [88].
Remark 5. 
In [2] (p. 355), exercises 13.8 and 13.9 ask to prove the equivalence of the following statements for S a submanifold of M :
  • S is an exponential family ⇔ S is 1 -autoparallel in M (exercise 13.8),
  • S is a mixture family ⇔ S is 1 -autoparallel in M (exercise 13.9).
Let P ¯ = Σ + μ μ μ μ 1 (with | P ¯ | = | Σ | ), P ¯ 1 = Σ 1 Σ 1 μ μ Σ 1 1 + μ Σ 1 μ , and y = ( x , 1 ) . Then we have
q P ¯ ( y ) = 1 ( 2 π ) d + 1 2 | P ¯ | exp 1 2 y P ¯ 1 y , = 1 ( 2 π ) d + 1 2 | Σ | exp 1 2 y P ¯ 1 y , = 1 ( 2 π ) d + 1 2 | Σ | exp [ x 1 ] Σ 1 Σ 1 μ μ Σ 1 1 + μ Σ 1 μ x 1 .
Thus, N ¯ = { q P ¯ ( x , 1 ) } is an exponential family. Therefore, we deduce that P is e -autoparallel in P . However, N ¯ is not a mixture family and thus P is not m -autoparallel in P .

6. Conclusions and Discussion

In general, the Fisher–Rao distance between multivariate normals (MVNs) is not known in closed form. In practice, the Fisher–Rao distance is usually approximated by costly geodesic shooting techniques [39,40,41] which requires time-consuming computations of the Riemannian exponential map and are nevertheless limited to normals within a short range of each other. In this work, we consider a simple alternative approach for approximating the Fisher–Rao distance by approximating the Riemannian lengths of curves, which admits closed-form parameterizations. In particular, we considered the mixed exponential-mixture curved and the projected symmetric positive–definite matrix geodesic obtained from Calvo and Oller isometric submanifold embedding into the SPD cone [19]. We summarize our method to approximate ρ N ( N 1 , N 2 ) between N 1 = N ( μ 1 , Σ 1 ) and N 2 = N ( μ 2 , Σ 2 ) as follows:
ρ ˜ T CO ( N 1 , N 2 ) : = 1 T i = 1 T 1 D J S ¯ t , S ¯ t + 1 ,
where
S ¯ t = proj N ¯ S t , proj N ¯ Σ + β μ μ β μ β μ β = Σ + μ μ μ μ 1
and
S t = P ¯ 1 1 2 P ¯ 1 1 2 P ¯ 2 1 2 P ¯ 1 1 2 t T P ¯ 1 1 2
with
P ¯ 1 = f ( N 1 ) = Σ 1 + μ 1 μ 1 μ 1 μ 1 1 , P ¯ 2 = f ( N 2 ) = Σ 2 + μ 2 μ 2 μ 2 μ 2 1 .
We proved the following sandwich bounds of our approximation
ρ N ( N 1 , N 2 ) ρ ˜ T CO ( N 1 , N 2 ) ρ N ( N 1 , N 2 ) + 2 δ T CO ( P ¯ 1 , P ¯ 2 ) ,
where
δ T CO ( P 1 , P 2 ) : = 1 T i = 1 T ρ P ( S t , S ¯ t ) .
Notice that we may calculate equivalently D J S ¯ t , S ¯ t + 1 as D J [ G t , G t + 1 ] where G i = f 1 ( S ¯ i ) = N ( m i , C i ) for i { 0 , , T } (see Proposition 3).
We also reported a fast way to upper bound the Fisher–Rao distance by the square root of Jeffreys’ divergence: ρ N ( N 1 , N 2 ) D J [ N 1 , N 2 ] which is tight at infinitesimal scale. In practice, this upper bound beats the upper bound of [38] when normal distributions are not too far from each other. Finally, we show that not only is Calvo and Oller SPD submanifold embedding [19] isometric, but it also preserves the Kullback–Leibler divergence, the Fenchel–Young divergence, and the mixture geodesics. Our approximation technique extends to elliptical distribution, which generalizes multivariate normal distributions [32,55]. Moreover, we obtained a closed form for the Fisher–Rao distance between normals sharing the same covariance matrix using the technique of maximal invariance under the action of the affine group in Section 1.5. We may also consider other distances different from the Fisher–Rao distance, which admits a closed-form formula: For example, the Calvo and Oller metric distance [19] (a lower bound on the Fisher–Rao distance) or the metric distance proposed in [82] (see Appendix C) whose geodesics enjoys the asymptotic property of the Fisher–Rao geodesics [89]). The C&O distance is very well-suited for short Fisher–Rao distances while the symmetric space distance is well-tailored for large Fisher–Rao distances. The calculations of these closed-form distances rely on generalized eigenvalues. We also propose an embedding of normals into the Siegel upper space in Appendix D. To conclude, let us propose yet another alternative distance, The Hilbert projective distance on the SPD cone [90], which only needs to calculate the minimal and maximal eigenvalues (say, using the power iteration method [91]):
ρ Hilbert ( P 1 , P 2 ) = log λ max ( P 1 1 P 2 ) λ min ( P 1 1 P 2 ) .
The dissimilarity is said projective on the SPD cone because ρ Hilbert ( P 1 , P 2 ) = 0 if and only if P 1 = λ P 2 for some λ > 0 . However, let us notice that it yields a proper metric distance on N ¯ :
ρ Hilbert ( N 1 , N 2 ) : = ρ Hilbert ( P ¯ 1 , P ¯ 2 ) ,
since P ¯ 1 = λ P ¯ 2 if and only if λ = 1 because the array element ( P 1 ) d + 1 , d + 1 = ( P 2 ) d + 1 , d + 1 = 1 , i.e., P ¯ 1 = P ¯ 2 implying P 1 = P 2 by the isometric diffeomorphism f.
Notice that since λ max ( P ) = λ min ( P 1 ) , λ min ( P ) = λ max ( P 1 ) ,
λ max ( P 1 P 2 ) λ max ( P 1 ) λ max ( P 2 ) , and λ min ( P 1 P 2 ) λ min ( P 1 ) λ min ( P 2 ) , we have the following upper bound on Hilbert distance: ρ Hilbert ( P 1 , P 2 ) log λ max ( P 1 ) λ min ( P 1 ) + log λ max ( P 2 ) λ min ( P 2 ) .

Supplementary Materials

The following supporting information can be downloaded at: https://franknielsen.github.io/FisherRaoMVN.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Acknowledgments

I warmly thank Frédéric Barbaresco (Thales) and Mohammad Emtiyaz Khan (Riken AIP) for fruitful discussions about this work.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

Entities
N ( μ , Σ ) d-variate normal distribution (mean μ , covariance matrix Σ )
p ( μ , Σ ) ( x ) Probability density function of N ( μ , Σ )
q Σ ( y ) = p ( 0 , Σ ) ( y ) Probability density function of N ( 0 , Σ )
P = ( P i j ) Positive–definite matrix with matrix entries P i j
Mappings
P ¯ = f 1 ( N ) Calvo and Oller mapping [19] (1990)
P ^ = f 1 d + 1 , 1 ( N ) = f ^ ( N ) Calvo and Oller mapping [32] (2002) or [82]
Groups
GL ( d ) Group of linear transformations (invertible d × d matrices)
SL ( d ) Special linear group ( d × d matrices with unit determinant)
Aff ( d ) Affine group of dimension d
Sets
N Set of multivariate normal distributions N ( μ , Σ ) (MVNs)
Sym ( d ) Set of symmetric d × d real matrices
P Symmetric positive–definite matrix cone (SPD matrix cone)
P c Set of SPD matrices with fixed determinant c ( P = R > 0 × P c )
SSPD, P 1 Set of SPD matrices with unit determinant
Λ Parameter space of N ( μ , Σ ) : R d × P ( d )
N 0 , P Set of zero-centered normal distributions N ( 0 , Σ )
N Σ Set of normal distributions N ( μ , Σ ) with fixed Σ
N μ Set of normal distributions N ( μ , Σ ) with fixed μ
N ¯ Set of SPD matrices f ( N )
Riemannian length elements
MVN Fisher d s Fisher , N 2 = d μ Σ 1 d μ + 1 2 tr Σ 1 d Σ 2
0-MVN Fisher d s Fisher , N 0 2 = 1 2 tr Σ 1 d Σ 2
SPD trace d s β , trace 2 = β tr ( ( P d P ) 2 ) (when β = 1 2 , d s trace = d s Fisher , N 0 )
SPD Calvo and Oller metric d s CO 2 = 1 2 d β β 2 + β d μ Σ 1 d μ + 1 2 tr Σ 1 d Σ 2
(with d s CO = d s P ( f ( μ , Σ ) ) )
when β = 1 , d s CO = d s Fisher , N in N ¯
SPD symmetric space d s SS 2 = 1 2 d μ Σ 1 d μ + tr Σ 1 d Σ 2 1 2 tr 2 Σ 1 d Σ
Siegel upper space d s SH 2 ( Z ) = 2 tr Y 1 d Z Y 1 d Z ¯ ( d s SH ( i Y ) = 2 d s Fisher , N 0 )
Manifolds and submanifolds
M ( = M N )Manifold of multivariate normal distributions
T p M Tangent space at p M
S μ M Submanifold of MVNs with μ prescribed
S Σ M Submanifold of MVNs with Σ prescribed
M Σ manifold of N Σ (non-embedded in M )
M μ manifold of N μ (non-embedded in M )
S [ v ] , Σ Submanifold of MVN set { N ( λ v , Σ ) : λ > 0 }
where v is an eigenvector of Σ
P manifold of symmetric positive–definite matrices
Distances
ρ N ( N 1 , N 2 ) Fisher–Rao distance between normal distributions N 1 and N 2
ρ SPD ( P 1 , P 2 ) Riemannian SPD distance between P 1 and P 2
ρ CO ( N 1 , N 2 ) Calvo and Oller distance from embedding N to P ¯ = f ( N )
ρ SS ( N 1 , N 2 ) Symmetric space distance from embedding N to P ^ = f ^ ( N )
ρ Hilbert ( N 1 , N 2 ) Hilbert distance ρ Hilbert ( P ¯ 1 , P ¯ 2 )
D KL ( N 1 , N 2 ) Kullback–Leibler divergence between MVNs N 1 and N 2
D J ( N 1 , N 2 ) Jeffreys divergence between MVNs N 1 and N 2
D CO ( N 1 , N 2 ) Calvo and Oller dissimilarity measure of Equation (26)
Geodesics and curves
γ N FR ( N 1 , N 2 ; t ) Fisher–Rao geodesic between MVNs N 1 and N 2
γ P FR ( P 1 , P 2 ; t ) Fisher–Rao geodesic between SPD P 1 and P 2
γ N e ( N 1 , N 2 ; t ) exponential geodesic between MVNs N 1 and N 2
γ N m ( N 1 , N 2 ; t ) mixture geodesic between MVNs N 1 and N 2
γ N CO ( N 1 , N 2 ; t ) projection curve (not geodesic) of γ P ( P ¯ 1 , P ¯ 2 ; t ) onto N ¯
Metrics and connections
g N Fisher Fisher information metric of MVNs
g P trace metric
g P Fisher information metric of centered MVNs
g Killing Killing metric studied in [82]
N Fisher Levi–Civita metric connection
N e exponential connection
N m mixture connection

Appendix A. Geodesics on the Fisher–Rao Normal Manifold

Appendix A.1. Parametric Equations of the Fisher–Rao Geodesics between Univariate Normal Distributions

The Fisher–Rao geodesics γ N FR ( N 1 , N 2 ) on the Fisher–Rao univariate normal manifolds are either vertical line segments when μ 1 = μ 2 , or semi-circle with origin on the x-axis and x-axis stretched by 2 [92] (Figure A1):
γ N FR ( μ 1 , σ 1 ; μ 2 , σ 2 ) = ( μ , ( 1 t ) σ 1 + t σ 2 ) , μ 1 = μ 2 = μ ( 2 ( c + r cos t , r sin t ) , t [ min { θ 1 , θ 2 } , max { θ 1 , θ 2 } ] , μ 1 μ 2 , ,
where
c = 1 2 ( μ 2 2 μ 1 2 ) + σ 2 2 σ 1 2 2 ( μ 1 μ 2 ) , r = μ i 2 c 2 + σ i 2 , i { 1 , 2 } ,
and
θ i = arctan σ i μ i 2 c , i { 1 , 2 } ,
provided that θ i 0 for i { 1 , 2 } (otherwise, we let θ i θ i + π ).
Figure A1. Visualizing some Fisher–Rao geodesics of univariate normal distributions on the stretched Poincaré upper plane (semi-circles with origin on the x-axis and stretched by 2 on the x-axis). Full geodesics are plotted with a thin gray style and geodesic arcs are plotted with a thick black style.
Figure A1. Visualizing some Fisher–Rao geodesics of univariate normal distributions on the stretched Poincaré upper plane (semi-circles with origin on the x-axis and stretched by 2 on the x-axis). Full geodesics are plotted with a thin gray style and geodesic arcs are plotted with a thick black style.
Entropy 25 00654 g0a1
Notice that it is remarkable that the Fisher–Rao distance between normal distributions is available in closed form: Indeed, the Euclidean length (with respect to the Euclidean metric) of semi-ellipse curves (perimeters) is not known in closed form but can be expressed using the so-called complete elliptic integral of the second kind [93].

Appendix A.2. Geodesics with Initial Values on the Multivariate Fisher–Rao Normal Manifold

The geodesic equation is given by
μ ¨ Σ ˙ Σ 1 μ ˙ = 0 , Σ ¨ + μ ˙ μ ˙ Σ ˙ Σ 1 Σ ˙ = 0 .
We concisely report the parametric geodesics using another variant of the natural parameters of the normal distributions (slightly differing from the θ -coordinate system since natural parameters can be chosen up to a fixed affine transformation by changing accordingly the sufficient statistics by the inverse affine transformation) viewed as an exponential family:
ξ = Σ 1 μ , Ξ = Σ 1 .
In general, the geodesics with boundary values γ N Fisher ( N 1 , N 2 ; t ) are not known in closed form. However, Calvo and Oller [48] (Theorem 3.1 and Corollary 1) reported the explicit equations of the geodesics when the initial values are given, i.e., γ N Fisher ( N 0 , v 0 ; t ) where v 0 = γ ˙ N Fisher ( N 0 , v 0 ; 0 ) = ( ξ ˙ ( 0 ) , Ξ ˙ ( 0 ) ) is in T N 0 M and γ N Fisher ( N 0 , v 0 ; 0 ) = N 0 .
Let
B = Ξ ( 0 ) 1 2 Ξ ˙ ( 0 ) Ξ ( 0 ) 1 2 , a = Ξ ( 0 ) 1 2 ξ ˙ ( 0 ) + B Ξ 0 1 2 ξ ( 0 ) , G = ( B 2 + 2 a a ) 1 2 ,
and G be the Moore–Penrose generalized inverse matrix of G: G = ( G G ) 1 G or G = G ( G G ) 1 . The Moore–Penrose pseudo-inverse matrix can be replaced by any other pseudo-inverse matrix G [48].
Then we have ( ξ ( t ) , Ξ ( t ) ) = γ N Fisher ( N 0 , v 0 ; t ) with
R ( t ) = Cosh 1 2 G t B G Sinh 1 2 G t , Ξ ( t ) = Ξ ( 0 ) 1 2 R ( t ) R ( t ) Ξ ( 0 ) 1 2 , ξ ( t ) = 2 Ξ ( 0 ) 1 2 R ( t ) Sinh 1 2 G t G a + Ξ ( t ) Ξ 1 ( 0 ) ξ ( 0 ) ,
where the Cosh and Sinh functions of a matrix M are defined by the following absolutely convergent series [48] (Equation (9), p. 122):
Sinh ( M ) = M + i = 1 M 2 i + 1 ( 2 i + 1 ) ! , Cosh ( M ) = I + i = 1 M 2 i ( 2 i ) ! ,
and satisfies the identity Sinh 2 ( M ) + Cosh 2 ( M ) = I . The matrix Cosh and Sinh functions can be calculated from the eigendecomposition of M = O diag ( λ 1 , , λ d ) O as follows:
Sinh ( M ) = O diag ( sinh ( λ 1 ) , , sinh ( λ d ) ) O , sinh ( u ) = e u e u 2 = i = 0 u 2 i + 1 ( 2 i + 1 ) ! , Cosh ( M ) = O diag ( cosh ( λ 1 ) , , cosh ( λ d ) ) O , cosh ( u ) = e u + e u 2 = i = 0 u 2 i ( 2 i ) ! .
When we restrict the manifold to a totally geodesic submanifold M μ = { P 0 } , the geodesic equation becomes P ¨ P ˙ P 1 P ˙ = 0 , and the geodesic with initial values P ( 0 ) = P and P ˙ ( 0 ) = S Sym is:
P ( t ) = P 1 2 exp t P 1 2 S P 1 2 P 1 2 .
The geodesic with boundary values P ( 0 ) = P 1 and P ( 1 ) = P 2 is
P ( t ) = P 1 1 2 exp t Log ( P 1 1 2 P 2 P 1 1 2 ) P 1 1 2 .
Furthermore, we can convert a geodesic with boundary values γ P ( P 1 , P 2 ; t ) to an equivalent geodesic with initial values γ P ( P , S ; t ) by letting
S = P 1 1 2 Log ( P 1 1 2 P 2 P 1 1 2 ) P 1 1 2 .

Appendix B. Fisher–Rao Distance between Normal Distributions Sharing the Same Covariance Matrix

The Rao distance between N 1 = N ( μ 1 , Σ ) and N 2 = N ( μ 2 , Σ ) has been reported in closed form [42] (Proposition 3). We shall explain the geometric method in full as follows: Let ( e 1 , , e d ) be the standard frame of R d (ordered basis): The e i ’s are the unit vectors of the axis x i ’s. Let P be an orthogonal matrix such that P ( μ 2 μ 1 ) = μ 2 μ 1 2 e 1 (i.e., matrix P aligns vector μ 2 μ 1 to the first axis x 1 ). Let Δ 12 = μ 2 μ 1 2 be the Euclidean distance between μ 1 and μ 2 . Furthermore, factorize matrix P Σ P using the LDL decomposition (a variant of the Cholesky decomposition) as P Σ P = L D L where L is a lower triangular matrix with all diagonal entries equal to one (lower unitriangular matrix of unit determinant) and D a diagonal matrix. Let σ 12 = D 11 . Then we have [42]:
ρ Σ ( μ 1 , μ 2 ) = ρ N ( N ( μ 1 , Σ ) , N ( μ 2 , Σ ) ) = ρ N ( N ( 0 , σ ) , N ( Δ 12 e 1 , σ 12 ) ) .
Please note that the right-hand side term is the Fisher–Rao distance between univariate normal distributions of Equation (7).
To find matrix P, we proceed as follows: Let u = μ 2 μ 1 μ 2 μ 1 2 be the normalized vector to align on axis x 1 . Let v = u e 1 . Consider the Householder reflection matrix [94] M = I 2 v v v 2 2 , where v v is an outer product matrix. Since Householder reflection matrices have determinant 1 , we let P be a copy of M with the last row multiplied by 1 so that we obtain det ( P ) = 1 . By construction, we have P u = μ 2 μ 1 2 e 1 . We then use the affine-invariance property of the Fisher–Rao distance as follows:
ρ N ( N ( μ 1 , Σ ) , N ( μ 2 , Σ ) ) = ρ N ( N ( 0 , Σ ) , N ( μ 2 μ 1 , Σ ) ) , = ρ N ( N ( 0 , P Σ P ) , N ( P ( μ 2 μ 1 ) , P Σ P ) ) , = ρ N ( N ( 0 , P Σ P ) , N ( Δ 12 e 1 , P Σ P ) ) , = ρ N ( N ( 0 , L D L ) , N ( Δ 12 e 1 , L D L ) ) , = ρ N ( N ( 0 , D ) , N ( Δ 12 e 1 , D ) ) .
The last row follows from the fact that L 1 e 1 = e 1 since L 1 is an upper unitriangular matrix, and L ( L 1 ) = ( L 1 L ) = I . The right-hand side Fisher–Rao distance is computed from Equation (7).

Appendix C. Embedding the Set of Multivariate Normal Distributions in a Riemannian Symmetric Space

The multivariate Gaussian manifold N ( d ) can also be embedded into the SPD cone P ( d + 1 ) as a Riemannian symmetric space [82,89] by f SSPD : P ^ = { f SSPD ( N ) P ( d + 1 ) : N N ( d ) } . We have P ^ SL ( d + 1 ) / SO ( d + 1 ) [82,95,96] (and textbook [97], Part II Chapter 10), and the symmetric space SL ( d + 1 ) / SO ( d + 1 ) can be embedded with the Killing Riemannian metric instead of the Fisher information metric:
g Killing ( N 1 , N 2 ) = κ Killing μ 1 Σ 1 μ 2 + 1 2 tr Σ 1 Σ 1 Σ 1 Σ 2 1 2 ( d + 1 ) tr Σ 1 Σ 1 tr Σ 1 Σ 2 ,
where κ Killing > 0 is a predetermined constant (e.g., 1). The length element of the Killing metric is
d s SS 2 = κ Killing 1 2 d μ Σ 1 d μ + tr Σ 1 d Σ 2 1 2 tr 2 Σ 1 d Σ .
When we consider N Σ , we may choose κ Killing = 2 so that the Killing metric coincides with the Fisher information metric. The induced Killing distance [82] is available in closed form:
ρ Killing ( N 1 , N 2 ) = κ Killing i = 1 d + 1 log 2 λ i L ^ 1 1 P ^ 2 L ^ 1 1 ,
where L ^ 1 is the unique lower triangular matrix obtained from the Cholesky decomposition of P ^ 1 = f SSPD ( N 1 ) = L ^ 1 L ^ 1 . Please note that L ^ 1 1 P ^ 2 L ^ 1 1 P ( d + 1 ) and | L ^ 1 | , i.e., L ^ 1 SL ( d + 1 ) . When N 1 = ( μ 1 , Σ ) and N 2 = ( μ 2 , Σ ) ( N 1 , N 2 N Σ ), we have [82]
ρ Killing ( N 1 , N 2 ) = 2 κ Killing arccosh 1 + 1 2 Δ Σ 2 ( μ 1 , μ 2 ) ,
where Δ Σ 2 is the squared Mahalanobis distance. Thus, ρ Killing ( N 1 , N 2 ) = h Killing ( Δ Σ ( μ 1 , μ 2 ) ) where h Killing ( u ) = 2 κ Killing arccosh 1 + 1 2 u 2 .
When N 1 = ( μ , Σ 1 ) and N 2 = ( μ , Σ 2 ) ( N 1 , N 2 N μ ), we have [82]:
ρ Killing ( N 1 , N 2 ) = κ Killing i = 1 d log 2 λ i L 1 1 P 2 L 1 1 1 ( d + 1 ) 2 i = 1 d log λ i L 1 1 P 2 L 1 1 .
See Example 1. Let us emphasize that the Killing distance is not the Fisher–Rao distance but is available in closed form as an alternative metric distance between MVNs.
A Fisher geodesic defect measure of a curve c is defined in [89] by
δ ( c ) = lim s 1 s 0 s c ˙ g Fisher c ˙ c ( t ) Fisher d t ,
where g Fisher denotes the Levi–Civita connection induced by the Fisher metric. When δ ( c ) = 0 the curve is said to be an asymptotic geodesic of the Fisher geodesic. It is proven that Killing geodesics at ( μ , Σ ) are asymptotic Fisher geodesics when the initial condition c ( 0 ) is orthogonal to N μ .

Appendix D. Embedding the Set of Multivariate Normal Distributions in the Siegel Upper Space

The Siegel upper space is the space of symmetric complex matrices Z = X + i Y = Z with imaginary positive–definite matrices Y 0 [45,65] (so-called Riemann matrices [98]):
SH ( d ) : = Z = X + i Y : X Sym ( d ) , Y P ( d ) ,
where Sym ( d ) is the space of symmetric real d × d matrices. SH ( 1 ) corresponds to the Poincaré upper plane. See Figure A2 for an illustration.
The Siegel infinitesimal square line element is
d s SH 2 ( Z ) = 2 tr Y 1 d Z Y 1 d Z ¯ .
When X = 0 and Z = i Y , we have d Z = i d Y , d Z ¯ = i d Y , and it follows that
d s SH 2 ( i Y ) = 2 tr ( Y 1 d Y ) 2 .
That is, four times the square length of the Fisher matrix of centered normal distributions d s N 0 2 = 1 2 tr ( P 1 d P ) 2 .
The Siegel distance [45] between Z 1 and Z 2 SH ( d ) is
ρ SH ( Z 1 , Z 2 ) = i = 1 d log 2 1 + r i 1 r i ,
where
r i = λ i R ( Z 1 , Z 2 ) ,
with R ( Z 1 , Z 2 ) denoting the matrix generalization of the cross-ratio
R ( Z 1 , Z 2 ) : = ( Z 1 Z 2 ) ( Z 1 Z ¯ 2 ) 1 ( Z ¯ 1 Z ¯ 2 ) ( Z ¯ 1 Z 2 ) 1 ,
and λ i ( M ) denoting the i-th largest (real) eigenvalue of (complex) matrix M. (In practice, we numerically must round off the tiny imaginary parts to obtain proper real eigenvalues [65].) The Siegel upper half space is a homogeneous space where the Lie Group SU ( d , d ) / S ( U ( d ) × U ( d ) ) acts transitively on it.
We can embed a multivariate normal distribution N = ( μ , Σ ) into SH ( d ) as follows:
N ( μ , Σ ) Z ( N ) : = μ μ + i Σ ,
and consider the Siegel distance on the embedded normal distributions as another potential metric distance between multivariate normal distributions:
ρ SH ( N 1 , N 2 ) = ρ SH ( Z ( N 1 ) , Z ( N 2 ) ) .
Notice that the real matrix part of the Z ( N ) ’s are all of rank one by construction.
Figure A2. Siegel upper space generalizes the Poincaré hyperbolic upper plane.
Figure A2. Siegel upper space generalizes the Poincaré hyperbolic upper plane.
Entropy 25 00654 g0a2

Appendix E. The Symmetrized Bregman Divergence Expressed as Integral Energies on Dual Geodesics

Let S F ( θ 1 ; θ 2 ) = B F ( θ 1 : θ 2 ) + B F ( θ 2 : θ 1 ) be a symmetrized Bregman divergence. Let d s 2 = d θ 2 F ( θ ) d θ denote the squared length element on the Bregman manifold and denote by γ ( t ) and γ * ( t ) the dual geodesics connecting θ 1 to θ 2 . We can express S F ( θ 1 ; θ 2 ) as integral energies on dual geodesics:
Property A1. 
We have S F ( θ 1 ; θ 2 ) = 0 1 d s 2 ( γ ( t ) ) d t = 0 1 d s 2 ( γ * ( t ) ) d t .
Proof. 
The proof that the symmetrized Bregman divergence amount to these energy integrals is based on the first-order and second-order directional derivatives. The first-order directional derivative u F ( θ ) with respect to vector u is defined by
u F ( θ ) = lim t 0 F ( θ + t v ) F ( θ ) t = v F ( θ ) .
The second-order directional derivatives u , v 2 F ( θ ) is
u , v 2 F ( θ ) = u v F ( θ ) , = lim t 0 v F ( θ + t u ) v F ( θ ) t , = u 2 F ( θ ) v .
Now consider the squared length element d s 2 ( γ ( t ) ) on the primal geodesic γ ( t ) expressed using the primal coordinate system θ : d s 2 ( γ ( t ) ) = d θ ( t ) 2 F ( θ ( t ) ) d θ ( t ) with θ ( γ ( t ) ) = θ 1 + t ( θ 2 θ 1 ) and d θ ( t ) = θ 2 θ 1 . Let us express the d s 2 ( γ ( t ) ) using the second-order directional derivative:
d s 2 ( γ ( t ) ) = θ 2 θ 1 2 F ( θ ( t ) ) .
Thus, we have 0 1 d s 2 ( γ ( t ) ) d t = [ θ 2 θ 1 F ( θ ( t ) ) ] 0 1 , where the first-order directional derivative is θ 2 θ 1 F ( θ ( t ) ) = ( θ 2 θ 1 ) F ( θ ( t ) ) . Therefore we obtain 0 1 d s 2 ( γ ( t ) ) d t = ( θ 2 θ 1 ) ( F ( θ 2 ) F ( θ 1 ) ) = S F ( θ 1 ; θ 2 ) .
Similarly, we express the squared length element d s 2 ( γ * ( t ) ) using the dual coordinate system η as the second-order directional derivative of F * ( η ( t ) ) with η ( γ * ( t ) ) = η 1 + t ( η 2 η 1 ) :
d s 2 ( γ * ( t ) ) = η 2 η 1 2 F * ( η ( t ) ) .
Therefore, we have 0 1 d s 2 ( γ * ( t ) ) d t = [ η 2 η 1 F * ( η ( t ) ) ] 0 1 = S F * ( η 1 ; η 2 ) . Since S F * ( η 1 ; η 2 ) = S F ( θ 1 ; θ 2 ) , we conclude that
S F ( θ 1 ; θ 2 ) = 0 1 d s 2 ( γ ( t ) ) d t = 0 1 d s 2 ( γ * ( t ) ) d t
Please note that in 1D, both pregeodesics γ ( t ) and γ * ( t ) coincide. We have d s 2 ( t ) = ( θ 2 θ 1 ) 2 f ( θ ( t ) ) = ( η 2 η 1 ) f * ( η ( t ) ) so that we check that S F ( θ 1 ; θ 2 ) = 0 1 d s 2 ( γ ( t ) ) d t = ( θ 2 θ 1 ) [ f ( θ ( t ) ) ] 0 1 = ( η 2 η 1 ) [ f * ( η ( t ) ) ] 0 1 = ( η 2 η 1 ) ( θ 2 θ 2 ) . □

References

  1. Amari, S.I. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Tokyo, Japan, 2016. [Google Scholar]
  2. Calin, O.; Udrişte, C. Geometric Modeling in Probability and Statistics; Springer: Berlin/Heidelberg, Germany, 2014; Volume 121. [Google Scholar]
  3. Lin, Z. Riemannian geometry of symmetric positive definite matrices via Cholesky decomposition. SIAM J. Matrix Anal. Appl. 2019, 40, 1353–1370. [Google Scholar] [CrossRef] [Green Version]
  4. Soen, A.; Sun, K. On the variance of the Fisher information for deep learning. Adv. Neural Inf. Process. Syst. 2021, 34, 5708–5719. [Google Scholar]
  5. Barachant, A.; Bonnet, S.; Congedo, M.; Jutten, C. Classification of covariance matrices using a Riemannian-based kernel for BCI applications. Neurocomputing 2013, 112, 172–178. [Google Scholar] [CrossRef] [Green Version]
  6. Skovgaard, L.T. A Riemannian Geometry of the Multivariate Normal Model; Technical Report 81/3; Statistical Research Unit, Danish Medical Research Council, Danish Social Science Research Council: Copenhagen, Denmark, 1981. [Google Scholar]
  7. Skovgaard, L.T. A Riemannian geometry of the multivariate normal model. Scand. J. Stat. 1984, 11, 211–223. [Google Scholar]
  8. Malagò, L.; Pistone, G. Information geometry of the Gaussian distribution in view of stochastic optimization. In Proceedings of the ACM Conference on Foundations of Genetic Algorithms XIII, Aberystwyth, UK, 17–22 January 2015; pp. 150–162. [Google Scholar]
  9. Herntier, T.; Peter, A.M. Transversality Conditions for Geodesics on the Statistical Manifold of Multivariate Gaussian Distributions. Entropy 2022, 24, 1698. [Google Scholar] [CrossRef] [PubMed]
  10. Atkinson, C.; Mitchell, A.F. Rao’s distance measure. SankhyĀ Indian J. Stat. Ser. 1981, 43, 345–365. [Google Scholar]
  11. Radhakrishna Rao, C. Information and accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. [Google Scholar]
  12. Chen, X.; Zhou, J.; Hu, S. Upper bounds for Rao distance on the manifold of multivariate elliptical distributions. Automatica 2021, 129, 109604. [Google Scholar] [CrossRef]
  13. Hotelling, H. Spaces of statistical parameters. Bull. Am. Math. Soc. 1930, 36, 191. [Google Scholar]
  14. Cencov, N.N. Statistical Decision Rules and Optimal Inference; American Mathematical Soc.: Providence, RI, USA, 2000; Volume 53. [Google Scholar]
  15. Bauer, M.; Bruveris, M.; Michor, P.W. Uniqueness of the Fisher–Rao metric on the space of smooth densities. Bull. Lond. Math. Soc. 2016, 48, 499–506. [Google Scholar] [CrossRef] [Green Version]
  16. Fujiwara, A. Hommage to Chentsov’s theorem. Inf. Geom. 2022, 1–20. [Google Scholar] [CrossRef]
  17. Bruveris, M.; Michor, P.W. Geometry of the Fisher–Rao metric on the space of smooth densities on a compact manifold. Math. Nachrichten 2019, 292, 511–523. [Google Scholar] [CrossRef] [Green Version]
  18. Burbea, J.; Oller i Sala, J.M. On Rao Distance Asymptotic Distribution; Technical Report Mathematics Preprint Series No. 67; Universitat de Barcelona: Barcelona, Spain, 1989. [Google Scholar]
  19. Calvo, M.; Oller, J.M. A distance between multivariate normal distributions based in an embedding into the Siegel group. J. Multivar. Anal. 1990, 35, 223–242. [Google Scholar] [CrossRef] [Green Version]
  20. Rios, M.; Villarroya, A.; Oller, J.M. Rao distance between multivariate linear normal models and their application to the classification of response curves. Comput. Stat. Data Anal. 1992, 13, 431–445. [Google Scholar] [CrossRef]
  21. Park, P.S.; Kshirsagar, A.M. Distances between normal populations when covariance matrices are unequal. Commun. Stat. Theory Methods 1994, 23, 3549–3556. [Google Scholar] [CrossRef]
  22. Gruber, M.H. Some applications of the Rao distance to shrinkage estimators. Commun. Stat. Methods 2008, 37, 180–193. [Google Scholar] [CrossRef]
  23. Strapasson, J.E.; Pinele, J.; Costa, S.I. Clustering using the Fisher-Rao distance. In Proceedings of the 2016 IEEE Sensor Array and Multichannel Signal Processing Workshop (SAM), Rio de Janeiro, Brazil, 10–13 July 2016; pp. 1–5. [Google Scholar]
  24. Le Brigant, A.; Puechmorel, S. Quantization and clustering on Riemannian manifolds with an application to air traffic analysis. J. Multivar. Anal. 2019, 173, 685–703. [Google Scholar] [CrossRef] [Green Version]
  25. Said, S.; Bombrun, L.; Berthoumieu, Y. Texture classification using Rao’s distance on the space of covariance matrices. In Proceedings of the Geometric Science of Information: Second International Conference, GSI 2015, Proceedings 2, Palaiseau, France, 28–30 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 371–378. [Google Scholar]
  26. Legrand, L.; Grivel, E. Evaluating dissimilarities between two moving-average models: A comparative study between Jeffrey’s divergence and Rao distance. In Proceedings of the 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 8 August–2 September 2016; pp. 205–209. [Google Scholar]
  27. Halder, A.; Georgiou, T.T. Gradient flows in filtering and Fisher-Rao geometry. In Proceedings of the 2018 Annual American Control Conference (ACC), Milwaukee, WI, USA, 27–29 June 2018; pp. 4281–4286. [Google Scholar]
  28. Collas, A.; Breloy, A.; Ren, C.; Ginolhac, G.; Ovarlez, J.P. Riemannian optimization for non-centered mixture of scaled Gaussian distributions. arXiv 2022, arXiv:2209.03315. [Google Scholar]
  29. Liang, T.; Poggio, T.; Rakhlin, A.; Stokes, J. Fisher-Rao metric, geometry, and complexity of neural networks. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, PMLR, Naha, Japan, 16–18 April 2019; pp. 888–896. [Google Scholar]
  30. Yoshizawa, S.; Tanabe, K. Dual differential geometry associated with the Kullback-Leibler information on the Gaussian distributions and its 2-parameter deformations. SUT J. Math. 1999, 35, 113–137. [Google Scholar] [CrossRef]
  31. Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, 2007. [Google Scholar]
  32. Calvo, M.; Oller, J.M. A distance between elliptical distributions based in an embedding into the Siegel group. J. Comput. Appl. Math. 2002, 145, 319–334. [Google Scholar] [CrossRef] [Green Version]
  33. Burbea, J. Informative Geometry of Probability Spaces; Technical Report; Pittsburgh Univ. PA Center for Multivariate Analysis: Pittsburgh, PA, USA, 1984. [Google Scholar]
  34. Eriksen, P.S. Geodesics Connected with the Fischer Metric on the Multivariate Normal Manifold; Institute of Electronic Systems, Aalborg University Centre: Aalborg, Denmark, 1986. [Google Scholar]
  35. Berkane, M.; Oden, K.; Bentler, P.M. Geodesic estimation in elliptical distributions. J. Multivar. Anal. 1997, 63, 35–46. [Google Scholar] [CrossRef] [Green Version]
  36. Imai, T.; Takaesu, A.; Wakayama, M. Remarks on Geodesics for Multivariate Normal Models; Technical Report; Faculty of Mathematics, Kyushu University: Fukuoka, Japan, 2011. [Google Scholar]
  37. Inoue, H. Group theoretical study on geodesics for the elliptical models. In Proceedings of the Geometric Science of Information: Second International Conference, GSI 2015, Proceedings 2, Palaiseau, France, 28–30 October 2015; Springer: Berlin/Heidelberg, Germany, 2015; pp. 605–614. [Google Scholar]
  38. Strapasson, J.E.; Porto, J.P.; Costa, S.I. On bounds for the Fisher-Rao distance between multivariate normal distributions. AIP Conf. Proc. 2015, 1641, 313–320. [Google Scholar]
  39. Han, M.; Park, F.C. DTI segmentation and fiber tracking using metrics on multivariate normal distributions. J. Math. Imaging Vis. 2014, 49, 317–334. [Google Scholar] [CrossRef]
  40. Pilté, M.; Barbaresco, F. Tracking quality monitoring based on information geometry and geodesic shooting. In Proceedings of the 2016 17th International Radar Symposium (IRS), Krakow, Poland, 10–12 May 2016; pp. 1–6. [Google Scholar]
  41. Barbaresco, F. Souriau exponential map algorithm for machine learning on matrix Lie groups. In Proceedings of the Geometric Science of Information: 4th International Conference, GSI 2019, Proceedings 4, Toulouse, France, 27–29 August 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 85–95. [Google Scholar]
  42. Pinele, J.; Strapasson, J.E.; Costa, S.I. The Fisher–Rao distance between multivariate normal distributions: Special cases, bounds and applications. Entropy 2020, 22, 404. [Google Scholar] [CrossRef] [Green Version]
  43. Dijkstra, E.W. A note on two problems in connexion with graphs. In Edsger Wybe Dijkstra: His Life, Work, and Legacy; Association for Computing Machinery: New York, NY, USA, 2022; pp. 287–290. [Google Scholar]
  44. Anderson, J.W. Hyperbolic Geometry; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  45. Siegel, C.L. Symplectic Geometry; First Printed in 1964; Elsevier: Amsterdam, The Netherlands, 2014. [Google Scholar]
  46. James, A.T. The variance information manifold and the functions on it. In Multivariate Analysis–III; Elsevier: Amsterdam, The Netherlands, 1973; pp. 157–169. [Google Scholar]
  47. Wells, J.; Cook, M.; Pine, K.; Robinson, B.D. Fisher-Rao distance on the covariance cone. arXiv 2020, arXiv:2010.15861. [Google Scholar]
  48. Calvo, M.; Oller, J.M. An explicit solution of information geodesic equations for the multivariate normal model. Stat. Risk Model. 1991, 9, 119–138. [Google Scholar] [CrossRef]
  49. Förstner, W.; Moonen, B. A metric for covariance matrices. In Geodesy-the Challenge of the 3rd Millennium; Springer: Berlin/Heidelberg, Germany, 2003; pp. 299–309. [Google Scholar]
  50. Dolcetti, A.; Pertici, D. Real square roots of matrices: Differential properties in semi-simple, symmetric and orthogonal cases. arXiv 2020, arXiv:2010.15609. [Google Scholar]
  51. Mahalanobis, P.C. On the generalised distance in statistics. In Proceedings of the National Institute of Science of India; Springer: New Delhi, India, 1936; Volume 12, pp. 49–55. [Google Scholar]
  52. Eaton, M.L. Group Invariance Applications in Statistics; Institute of Mathematical Statistics: Beachwood, OH, USA, 1989. [Google Scholar]
  53. Godinho, L.; Natário, J. An introduction to Riemannian geometry: With Applications to Mechanics and Relativity. In Universitext; Springer International Publishing: Cham, Switzerland, 2014. [Google Scholar]
  54. Strapasson, J.E.; Pinele, J.; Costa, S.I. A totally geodesic submanifold of the multivariate normal distributions and bounds for the Fisher-Rao distance. In Proceedings of the IEEE Information Theory Workshop (ITW), Cambridge, UK, 1–11 September 2016; pp. 61–65. [Google Scholar]
  55. Chen, X.; Zhou, J. Multisensor Estimation Fusion on Statistical Manifold. Entropy 2022, 24, 1802. [Google Scholar] [CrossRef]
  56. Cherian, A.; Sra, S. Riemannian dictionary learning and sparse coding for positive definite matrices. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 2859–2871. [Google Scholar] [CrossRef] [Green Version]
  57. Nguyen, X.S. Geomnet: A neural network based on Riemannian geometries of SPD matrix space and Cholesky space for 3d skeleton-based interaction recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 13379–13389. [Google Scholar]
  58. Dolcetti, A.; Pertici, D. Differential properties of spaces of symmetric real matrices. arXiv 2018, arXiv:1807.01113. [Google Scholar]
  59. Verdoolaege, G.; Scheunders, P. On the geometry of multivariate generalized Gaussian models. J. Math. Imaging Vis. 2012, 43, 180–193. [Google Scholar] [CrossRef]
  60. Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. Ser. B 1966, 28, 131–142. [Google Scholar] [CrossRef]
  61. Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
  62. Nielsen, F.; Okamura, K. A note on the f-divergences between multivariate location-scale families with either prescribed scale matrices or location parameters. arXiv 2022, arXiv:2204.10952. [Google Scholar]
  63. Moakher, M.; Zéraï, M. The Riemannian geometry of the space of positive-definite matrices and its application to the regularization of positive-definite matrix-valued data. J. Math. Imaging Vis. 2011, 40, 171–187. [Google Scholar] [CrossRef]
  64. Dolcetti, A.; Pertici, D. Elliptic isometries of the manifold of positive definite real matrices with the trace metric. Rend. Circ. Mat. Palermo Ser. 2 2021, 70, 575–592. [Google Scholar] [CrossRef]
  65. Nielsen, F. The Siegel–Klein Disk: Hilbert Geometry of the Siegel Disk Domain. Entropy 2020, 22, 1019. [Google Scholar] [CrossRef]
  66. Arnaudon, M.; Nielsen, F. On approximating the Riemannian 1-center. Comput. Geom. 2013, 46, 93–104. [Google Scholar] [CrossRef]
  67. Ceolin, S.R.; Hancock, E.R. Computing gender difference using Fisher-Rao metric from facial surface normals. In Proceedings of the 25th SIBGRAPI Conference on Graphics, Patterns and Images, Ouro Preto, Brazil, 22–25 August 2012; pp. 336–343. [Google Scholar]
  68. Wang, Q.; Li, P.; Zhang, L. G2DeNet: Global Gaussian distribution embedding network and its application to visual recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2730–2739. [Google Scholar]
  69. Miyamoto, H.K.; Meneghetti, F.C.; Costa, S.I. The Fisher–Rao loss for learning under label noise. Inf. Geom. 2022, 1–20. [Google Scholar] [CrossRef]
  70. Kurtek, S.; Bharath, K. Bayesian sensitivity analysis with the Fisher–Rao metric. Biometrika 2015, 102, 601–616. [Google Scholar] [CrossRef] [Green Version]
  71. Marti, G.; Andler, S.; Nielsen, F.; Donnat, P. Optimal transport vs. Fisher-Rao distance between copulas for clustering multivariate time series. In Proceedings of the 2016 IEEE Statistical Signal Processing Workshop (SSP), Palma de Mallorca, Spain, 26–29 June 2016; pp. 1–5. [Google Scholar]
  72. Tang, M.; Rong, Y.; Zhou, J.; Li, X.R. Information geometric approach to multisensor estimation fusion. IEEE Trans. Signal Process. 2018, 67, 279–292. [Google Scholar] [CrossRef]
  73. Wang, W.; Wang, R.; Huang, Z.; Shan, S.; Chen, X. Discriminant analysis on Riemannian manifold of Gaussian distributions for face recognition with image sets. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2048–2057. [Google Scholar]
  74. Li, P.; Wang, Q.; Zeng, H.; Zhang, L. Local log-Euclidean multivariate Gaussian descriptor and its application to image classification. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 803–817. [Google Scholar] [CrossRef]
  75. Picot, M.; Messina, F.; Boudiaf, M.; Labeau, F.; Ayed, I.B.; Piantanida, P. Adversarial robustness via Fisher-Rao regularization. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 2698–2710. [Google Scholar] [CrossRef] [PubMed]
  76. Collas, A.; Bouchard, F.; Ginolhac, G.; Breloy, A.; Ren, C.; Ovarlez, J.P. On the Use of Geodesic Triangles between Gaussian Distributions for Classification Problems. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 5697–5701. [Google Scholar]
  77. Murena, P.A.; Cornuéjols, A.; Dessalles, J.L. Opening the parallelogram: Considerations on non-Euclidean analogies. In Proceedings of the Case-Based Reasoning Research and Development: 26th International Conference, ICCBR 2018, Proceedings 26, Stockholm, Sweden, 9–12 July 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 597–611. [Google Scholar]
  78. Popović, B.; Janev, M.; Krstanović, L.; Simić, N.; Delić, V. Measure of Similarity between GMMs Based on Geometry-Aware Dimensionality Reduction. Mathematics 2022, 11, 175. [Google Scholar] [CrossRef]
  79. Micchelli, C.A.; Noakes, L. Rao distances. J. Multivar. Anal. 2005, 92, 97–115. [Google Scholar] [CrossRef] [Green Version]
  80. Nielsen, F. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy 2019, 21, 485. [Google Scholar] [CrossRef] [Green Version]
  81. Davis, J.; Dhillon, I. Differential entropic clustering of multivariate Gaussians. Adv. Neural Inf. Process. Syst. 2006, 19, 337–344. [Google Scholar]
  82. Lovrić, M.; Min-Oo, M.; Ruh, E.A. Multivariate normal distributions parametrized as a Riemannian symmetric space. J. Multivar. Anal. 2000, 74, 36–48. [Google Scholar] [CrossRef] [Green Version]
  83. Welzl, E. Smallest enclosing disks (balls and ellipsoids). In Proceedings of the New Results and New Trends in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; pp. 359–370. [Google Scholar]
  84. Gonzalez, T.F. Clustering to minimize the maximum intercluster distance. Theor. Comput. Sci. 1985, 38, 293–306. [Google Scholar] [CrossRef] [Green Version]
  85. Acharyya, S.; Banerjee, A.; Boley, D. Bregman divergences and triangle inequality. In Proceedings of the 2013 SIAM International Conference on Data Mining, SIAM, Austin, TX, USA, 2–4 May 2013; pp. 476–484. [Google Scholar]
  86. Ohara, A.; Suda, N.; Amari, S.i. Dualistic differential geometry of positive definite matrices and its applications to related problems. Linear Algebra Appl. 1996, 247, 31–53. [Google Scholar] [CrossRef] [Green Version]
  87. Nock, R.; Nielsen, F. Fitting the smallest enclosing Bregman ball. In Proceedings of the Machine Learning: ECML 2005: 16th European Conference on Machine Learning, Proceedings 16, Porto, Portugal, 3–7 October 2005; Springer: Berlin/Heidelberg, Germany, 2005; pp. 649–656. [Google Scholar]
  88. Ohara, A. Doubly autoparallel structure on positive definite matrices and its applications. In Proceedings of the International Conference on Geometric Science of Information, Toulouse, France, 27–29 August 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 251–260. [Google Scholar]
  89. Globke, W.; Quiroga-Barranco, R. Information geometry and asymptotic geodesics on the space of normal distributions. Inf. Geom. 2021, 4, 131–153. [Google Scholar] [CrossRef]
  90. Nielsen, F.; Sun, K. Clustering in Hilbert’s projective geometry: The case studies of the probability simplex and the elliptope of correlation matrices. In Geometric Structures of Information; Springer: Berlin/Heidelberg, Germany, 2019; pp. 297–331. [Google Scholar]
  91. Journée, M.; Nesterov, Y.; Richtárik, P.; Sepulchre, R. Generalized power method for sparse principal component analysis. J. Mach. Learn. Res. 2010, 11, 517–553. [Google Scholar]
  92. Verdoolaege, G. A new robust regression method based on minimization of geodesic distances on a probabilistic manifold: Application to power laws. Entropy 2015, 17, 4602–4626. [Google Scholar] [CrossRef] [Green Version]
  93. Chandrupatla, T.R.; Osler, T.J. The perimeter of an ellipse. Math. Sci. 2010, 35. [Google Scholar]
  94. Householder, A.S. Unitary triangularization of a nonsymmetric matrix. J. ACM 1958, 5, 339–342. [Google Scholar] [CrossRef] [Green Version]
  95. Fernandes, M.A.; San Martin, L.A. Fisher information and α-connections for a class of transformational models. Differ. Geom. Appl. 2000, 12, 165–184. [Google Scholar] [CrossRef] [Green Version]
  96. Fernandes, M.A.; San Martin, L.A. Geometric proprieties of invariant connections on SL(n,R)/SO(n). J. Geom. Phys. 2003, 47, 369–377. [Google Scholar] [CrossRef]
  97. Bridson, M.R.; Haefliger, A. Metric Spaces of Non-Positive Curvature; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 319. [Google Scholar]
  98. Frauendiener, J.; Jaber, C.; Klein, C. Efficient computation of multidimensional theta functions. J. Geom. Phys. 2019, 141, 147–158. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Four univariate normal distributions N 1 = N ( 0 , 1 ) , N 2 = N ( 3 , 1 ) , N 3 = N ( 2 , 2.5 ) , N 4 = N ( 0 , 2 ) , and their pairwise full geodesics in gray and geodesics linking them in red. The Fisher–Rao distances are ρ N ( N 1 , N 2 ) = 2.6124 , ρ N ( N 3 , N 4 ) = 0.9317 , ρ N ( N 1 , N 4 ) = 0.9803 , ρ N ( N 2 , N 3 ) = 1.4225 ρ N ( N 2 , N 4 ) = 2.1362 , and ρ N ( N 1 , N 3 ) = 1.7334 The ellipses are Tissot indicatrices, which visualize the metric tensor g N Fisher at grid positions.
Figure 1. Four univariate normal distributions N 1 = N ( 0 , 1 ) , N 2 = N ( 3 , 1 ) , N 3 = N ( 2 , 2.5 ) , N 4 = N ( 0 , 2 ) , and their pairwise full geodesics in gray and geodesics linking them in red. The Fisher–Rao distances are ρ N ( N 1 , N 2 ) = 2.6124 , ρ N ( N 3 , N 4 ) = 0.9317 , ρ N ( N 1 , N 4 ) = 0.9803 , ρ N ( N 2 , N 3 ) = 1.4225 ρ N ( N 2 , N 4 ) = 2.1362 , and ρ N ( N 1 , N 3 ) = 1.7334 The ellipses are Tissot indicatrices, which visualize the metric tensor g N Fisher at grid positions.
Entropy 25 00654 g001
Figure 2. The submanifolds N Σ are not totally geodesic (i.e., ρ N ( N 1 , N 2 ) is upper bounded by their Mahalanobis distance) but the submanifolds N μ are totally geodesic. Using the triangle inequality of the Riemannian metric distance ρ N , we can upper bound ρ N ( N 1 , N 2 ) .
Figure 2. The submanifolds N Σ are not totally geodesic (i.e., ρ N ( N 1 , N 2 ) is upper bounded by their Mahalanobis distance) but the submanifolds N μ are totally geodesic. Using the triangle inequality of the Riemannian metric distance ρ N , we can upper bound ρ N ( N 1 , N 2 ) .
Entropy 25 00654 g002
Figure 3. Quality of the C&O lower bound compared to the exact Fisher–Rao distance in the case of N 1 , N 2 M Σ (MVNs sharing the same covariance matrix Σ ). We have ρ CO ρ N Δ Σ .
Figure 3. Quality of the C&O lower bound compared to the exact Fisher–Rao distance in the case of N 1 , N 2 M Σ (MVNs sharing the same covariance matrix Σ ). We have ρ CO ρ N Δ Σ .
Entropy 25 00654 g003
Figure 4. Approximating the Fisher–Rao geodesic distance ρ N ( N 1 , N 2 ) : The Fisher–Rao geodesic γ N FR is not known in closed form. We consider a tractable curve c ( t ) , discretize c ( t ) at T + 1 points c ( i T ) with c ( 0 ) = N 1 and c ( 1 ) = N 2 , and approximate ρ N c i T , c i + 1 T by D J c i T , c i + 1 T , considering that different tractable curves c ( t ) yield different approximations.
Figure 4. Approximating the Fisher–Rao geodesic distance ρ N ( N 1 , N 2 ) : The Fisher–Rao geodesic γ N FR is not known in closed form. We consider a tractable curve c ( t ) , discretize c ( t ) at T + 1 points c ( i T ) with c ( 0 ) = N 1 and c ( 1 ) = N 2 , and approximate ρ N c i T , c i + 1 T by D J c i T , c i + 1 T , considering that different tractable curves c ( t ) yield different approximations.
Entropy 25 00654 g004
Figure 5. Quality of the D J upper bound on the Fisher–Rao distance ρ N when normal distributions have the same covariance matrix.
Figure 5. Quality of the D J upper bound on the Fisher–Rao distance ρ N when normal distributions have the same covariance matrix.
Entropy 25 00654 g005
Figure 6. Visualizing the exponential and mixture geodesics between two bivariate normal distributions.
Figure 6. Visualizing the exponential and mixture geodesics between two bivariate normal distributions.
Entropy 25 00654 g006
Figure 7. Projecting an SPD matrix P P onto N ¯ = f ( N ) : γ P ( P , P ¯ ) is orthogonal to N ¯ with respect to the trace metric.
Figure 7. Projecting an SPD matrix P P onto N ¯ = f ( N ) : γ P ( P , P ¯ ) is orthogonal to N ¯ with respect to the trace metric.
Entropy 25 00654 g007
Figure 8. Illustration of the approximation of the Fisher–Rao distance between two multivariate normals N 1 and N 2 (red geodesic length γ N ( N 1 , N 2 ) by discretizing curve c ¯ CO N ¯ or equivalently curve c CO N .
Figure 8. Illustration of the approximation of the Fisher–Rao distance between two multivariate normals N 1 and N 2 (red geodesic length γ N ( N 1 , N 2 ) by discretizing curve c ¯ CO N ¯ or equivalently curve c CO N .
Entropy 25 00654 g008
Figure 9. Diffusion tensor imaging (DTI) on a 2D grid: (a) Ellipsoids shown at the 8 × 8 grid locations with C&O curves in green, and (b) some interpolated ellipsoids are further shown along the C&O curves.
Figure 9. Diffusion tensor imaging (DTI) on a 2D grid: (a) Ellipsoids shown at the 8 × 8 grid locations with C&O curves in green, and (b) some interpolated ellipsoids are further shown along the C&O curves.
Entropy 25 00654 g009
Figure 10. Examples of projection of N ( μ , Σ ) onto the submanifolds M μ 0 and M Σ 0 . Tissot indicatrices are rendered in green at the projected normal distributions μ 0 , Σ + 1 2 ( μ 0 μ ) ( μ 0 μ ) and ( μ , Σ 0 ) , respectively.
Figure 10. Examples of projection of N ( μ , Σ ) onto the submanifolds M μ 0 and M Σ 0 . Tissot indicatrices are rendered in green at the projected normal distributions μ 0 , Σ + 1 2 ( μ 0 μ ) ( μ 0 μ ) and ( μ , Σ 0 ) , respectively.
Entropy 25 00654 g010
Figure 11. Upper bounding the Fisher–Rao’s distance ρ N ( ( μ 1 , Σ 1 ) , ( μ 2 , Σ 2 ) ) (red points) using projections (green points) onto submanifolds with fixed means.
Figure 11. Upper bounding the Fisher–Rao’s distance ρ N ( ( μ 1 , Σ 1 ) , ( μ 2 , Σ 2 ) ) (red points) using projections (green points) onto submanifolds with fixed means.
Entropy 25 00654 g011
Figure 12. Geodesics and curves used to approximate the Fisher–Rao distance with the Fisher metric shown using Tissot’s indicatrices: exponential geodesic (red), mixture geodesic (blue), mid-exponential-mixture curve (purple), projected CO curve (green), and target Fisher–Rao geodesic (black). (Visualization in the parameter space of normal distributions).
Figure 12. Geodesics and curves used to approximate the Fisher–Rao distance with the Fisher metric shown using Tissot’s indicatrices: exponential geodesic (red), mixture geodesic (blue), mid-exponential-mixture curve (purple), projected CO curve (green), and target Fisher–Rao geodesic (black). (Visualization in the parameter space of normal distributions).
Entropy 25 00654 g012
Figure 13. Bounding ρ N ( S ¯ t , S ¯ t + 1 ) using the triangular inequality of ρ P in the SPD cone P ( d + 1 ) .
Figure 13. Bounding ρ N ( S ¯ t , S ¯ t + 1 ) using the triangular inequality of ρ P in the SPD cone P ( d + 1 ) .
Entropy 25 00654 g013
Figure 14. Visualizing at discrete positions (10 increment steps between 0 and 1) some curves used to approximate the Fisher–Rao distance between two bivariate normal distributions: (a) exponential geodesic c e = γ N e (red), (b) mixture geodesic c m = γ N m (blue), (c) mid-mixture-exponential curve c em (purple), (d) projected Calvo and Oller curve c CO (green), (e) c λ : ordinary linear interpolation in λ (yellow), and (f) All superposed curves at once.
Figure 14. Visualizing at discrete positions (10 increment steps between 0 and 1) some curves used to approximate the Fisher–Rao distance between two bivariate normal distributions: (a) exponential geodesic c e = γ N e (red), (b) mixture geodesic c m = γ N m (blue), (c) mid-mixture-exponential curve c em (purple), (d) projected Calvo and Oller curve c CO (green), (e) c λ : ordinary linear interpolation in λ (yellow), and (f) All superposed curves at once.
Entropy 25 00654 g014
Figure 15. Comparison of our approximation curves with the Fisher–Rao geodesic (f) obtained by geodesic shooting (Figure 5 of [39]). Exponential (a) and mixture (b) geodesics with the mid-exponential-mixture curve (c), and the projected C&O curve (d). Superposed curves (e) and comparison with geodesic shooting (Figure 5 of [39]). Beware that color coding is not related between (a) and (f), and scale for depicting ellipsoids are different.
Figure 15. Comparison of our approximation curves with the Fisher–Rao geodesic (f) obtained by geodesic shooting (Figure 5 of [39]). Exponential (a) and mixture (b) geodesics with the mid-exponential-mixture curve (c), and the projected C&O curve (d). Superposed curves (e) and comparison with geodesic shooting (Figure 5 of [39]). Beware that color coding is not related between (a) and (f), and scale for depicting ellipsoids are different.
Entropy 25 00654 g015
Figure 16. Approximation of the Fisher–Rao distance obtained using the projected C&O curve when T ranges from 3 to 100 [39].
Figure 16. Approximation of the Fisher–Rao distance obtained using the projected C&O curve when T ranges from 3 to 100 [39].
Entropy 25 00654 g016
Figure 17. Approximation of the smallest enclosing Riemannian ball of a set of n bivariate normals N i = N ( μ i , Σ i ) with respect to C&O distance ρ CO (the approximate circumcenter C ¯ T is depicted as a red ellipse): (a) n = 8 with different covariance matrices, (b) n = 8 with identical covariance matrices amount to the smallest enclosing ball of a set of n points { μ i } , (c) n = 2 displays the midpoint of the C&O geodesic visualized as an equivalent bivariate normal distribution in the sample space.
Figure 17. Approximation of the smallest enclosing Riemannian ball of a set of n bivariate normals N i = N ( μ i , Σ i ) with respect to C&O distance ρ CO (the approximate circumcenter C ¯ T is depicted as a red ellipse): (a) n = 8 with different covariance matrices, (b) n = 8 with identical covariance matrices amount to the smallest enclosing ball of a set of n points { μ i } , (c) n = 2 displays the midpoint of the C&O geodesic visualized as an equivalent bivariate normal distribution in the sample space.
Entropy 25 00654 g017
Table 1. First set of experiments demonstrates the advantage of the c CO ( t ) curve.
Table 1. First set of experiments demonstrates the advantage of the c CO ( t ) curve.
d κ CO κ l κ e κ m κ em
11.00251.04141.15211.02361.0154
21.01671.08411.19231.06311.0416
31.01821.89972.60721.99651.07988
41.02072.07931.80802.16871.1873
51.03244.120712.38045.61704.2349
Table 2. Comparing our Fisher–Rao approximation with the Calvo and Oller lower bound and the upper bound of [38].
Table 2. Comparing our Fisher–Rao approximation with the Calvo and Oller lower bound and the upper bound of [38].
d ρ CO ( N 1 , N 2 ) ρ ˜ c CO ( N 1 , N 2 ) U ( N 1 , N 2 )
11.75631.80203.1654
23.22133.31946.012
34.60224.76428.7204
45.95176.192711.3990
57.1567.386613.8774
Table 3. Second set of experiments shows limitations of the c CO ( t ) curve.
Table 3. Second set of experiments shows limitations of the c CO ( t ) curve.
d κ CO κ l κ e κ m
11.05691.14051.1391.0734
51.15991.46961.52011.1819
101.21801.69631.78871.2184
111.22601.73331.82851.2235
121.23011.75681.85391.2282
151.24841.84031.95571.2367
201.27071.95192.08511.2466
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nielsen, F. A Simple Approximation Method for the Fisher–Rao Distance between Multivariate Normal Distributions. Entropy 2023, 25, 654. https://doi.org/10.3390/e25040654

AMA Style

Nielsen F. A Simple Approximation Method for the Fisher–Rao Distance between Multivariate Normal Distributions. Entropy. 2023; 25(4):654. https://doi.org/10.3390/e25040654

Chicago/Turabian Style

Nielsen, Frank. 2023. "A Simple Approximation Method for the Fisher–Rao Distance between Multivariate Normal Distributions" Entropy 25, no. 4: 654. https://doi.org/10.3390/e25040654

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop