The Fisher–Rao Distance between Multivariate Normal Distributions: Special Cases, Bounds and Applications

The Fisher–Rao distance is a measure of dissimilarity between probability distributions, which, under certain regularity conditions of the statistical model, is up to a scaling factor the unique Riemannian metric invariant under Markov morphisms. It is related to the Shannon entropy and has been used to enlarge the perspective of analysis in a wide variety of domains such as image processing, radar systems, and morphological classification. Here, we approach this metric considered in the statistical model of normal multivariate probability distributions, for which there is not an explicit expression in general, by gathering known results (closed forms for submanifolds and bounds) and derive expressions for the distance between distributions with the same covariance matrix and between distributions with mirrored covariance matrices. An application of the Fisher–Rao distance to the simplification of Gaussian mixtures using the hierarchical clustering algorithm is also presented.


Introduction
A proper measure to determine the dissimilarity between probability distributions has been approached in many problems and applications. The Fisher-Rao distance is a very special metric for statistical models of probability distributions. This distance is invariant by reparametrization of the sample space and covariant by reparameterization of the parameter space [1]. Moreover, the Fisher-Rao metric is preserved under Markov morphisms and under centain conditions it is, up to a scaling factor, the unique Riemannian metric satisfying this condition [2,3]. Markov morphisms are associated with the notion of statistical sufficiency which express the criterion of passing from one statistical model to another with no loss of information [4][5][6]. Therefore it is natural to require the invariance of the geometric structures of statistical models under Markov morphisms. Between finite sample size simplex model S k−1 = {p ∈ R k ; p i ≥ 0 and ∑ k i=1 p i = 1}, a Markov morphism is a linear map T Q (x) = xQ, where Q ∈ R n×l , with n ≤ l, is a matrix with non-negative entries such that every row sums to 1 and every column has precisely one non-zero element. The mapping T Q corresponds to probabilistic refining of the event space {1, . . . , n} → {1, . . . , l} where the refinement i → j occurs with probability Q ij [7]. Chentsov [8,9] has proved the Fisher-Rao uniqueness invariance property under Markov morphisms for the finite sample spaces. The extension of this result to more general statistical models requires careful formulations of statistical sufficiency and Markov morphisms and

The Fisher-Rao Distance in the Multivariate Normal Distribution Space: Special Submanifolds and Bounds
In this section, as in [38], we summarize previous results regarding the Fisher-Rao distance in the space of multivariate normal distributions including closed forms for this distance restricted to submanifolds and general bounds. Given a statistical model S = {p θ = p(x; θ); θ = (θ 1 , θ 2 , . . . , θ k ) ∈ Θ ⊂ R k }, a natural Riemannian structure [21] can be provided by the Fisher information matrix G(θ θ θ) = g ij (θ θ θ) : = ∂ ∂θ i log p(x; θ θ θ) ∂ ∂θ j log p(x; θ θ θ)p(x; θ θ θ)dx, where E θ θ θ is the expected value with respect to the distribution p θ θ θ . This matrix can also be viewed as the Hessian matrix of the Shannon entropy (concave function) [39], H(p) = − p(x; θ) log p(x; θ)dx, (3) and is used to establish connections between inequalities in information theory and geometrical inequalities. The Fisher-Rao distance, d F (·, ·), between two distributions p θ θ θ 1 and p θ θ θ 2 in S, identified with their parameters θ θ θ 1 and θ θ θ 2 , is given by the shortest length of a curve γ(t) in the parameter space Θ connecting these distributions, d F (p θ 1 Note that this is in fact a metric, since for any θ 1 , θ 2 , and θ 3 in Θ, we have: (i) d F (θ 1 , θ 2 ) ≥ 0 and d F (θ 1 , θ 2 ) = 0 if only if θ 1 A curve that provides the shortest length is called a geodesic and is given by the solutions of the differential equations where Γ m ij are the Christoffel symbols, and [g ij ] is the inverse matrix of the Fisher information matrix. We consider here the space of the multivariate normal distributions given by: where x t = (x 1 , . . . , x n ) ∈ R n is the variable vector, µ t = (µ 1 , . . . , µ n ) ∈ R n is the mean vector, and Σ is the covariance matrix in P n (R), the space of order n positive definite symmetric matrices. In this case, the model S = M = {p θ ; θ = (µ, Σ) ∈ R n × P n (R)} is a statistical n + n(n+1) 2 -dimensional manifold.
In this case, the model S = M = {p θ ; θ = (µ, Σ) ∈ R n × P n (R)} is a statistical manifold of dimension k = n + n(n+1) 2 . Considering a parametrization (µ, Σ) = φ(θ 1 , . . . , θ k ) of the model M, the Fisher information matrix is given by [40] The metric provided by this matrix is invariant with respect to affine transformations. In other words, for any (c, Q) ∈ R n × GL n (R), where GL n (R) is the group of non-singular n-square matrices, the mapping: is an isometry in M [16]. Consequently, the Fisher-Rao distance between θ 1 = (µ 1 , Σ 1 ) and for any (c, Q) ∈ R n × GL n (R). In particular, for Q = Σ −(1/2) 1 and c = −Σ ), the Fisher-Rao distance admits the form: where θ 0 = (0, I n ), I n is the n-order identity matrix, and 0 ∈ R n is the null vector. The geodesic equations in M can be expressed as [17]: and could be partially integrated [27]: where (δ(t), ∆(t)) = (Σ −1 (t)µ(t), Σ −1 (t)), x ∈ R n , and B is a symmetric matrix. The initial conditions for this problem can be taken as: Eriksen [27] and Calvo and Oller [28], in independent works, solved this initial value problem. An explicit solution to the geodesic curve in M [28] is: where I n is an n-order identity matrix, G 2 = B 2 + 2xx t , and G − is the generalized inverse square matrix of G, that is GG − G = G. Due the fact that the geodesic curve has constant velocity at any point, given (x, B) in the tangent space of M, the Fisher-Rao distance between (0, I n ) and (δ(1), ∆(1)) is: where | · | is the standard Euclidean norm. Note that the above expression provides the Fisher-Rao distance between two distributions only if we can determine the initial value problem from the boundary conditions, which usually is very difficult.
Han and Park in [31] presented a numerical shooting method for computing the minimum geodesic distance between two normal distributions, through parallel transport of a vector field defined along the geodesic curve given in Equation (15).
A closed form for the Fisher-Rao distance between two normal distributions in M is still an important open question. Next, we present closed forms for this distance in some submanifolds of M.

Closed Forms for the Fisher-Rao Distance in Submanifolds of M
In this subsection, we consider submanifolds M * ⊂ M with the distance induced by the Fisher-Rao metric in M. It is important to remark that, in general, given two distributions θ 1 and θ 2 in M * , the distance between θ 1 and θ 2 when restricted to a submanifold M * is bigger than the distance between θ 1 and θ 2 . This is due to the fact that to get d M * , we consider the minimum length of restricted curves, which are the ones contained in the submanifold M * . We say that M * is totally geodesic if only if d M * (θ 1 , θ 2 ) = d M (θ 1 , θ 2 ), for any θ 1 , θ 2 ∈ M * , which means that the geodesic in M connecting θ 1 and θ 2 is contained in M * .
In the n-dimensional manifold composed by multivariate normal distributions with common covariance matrix Σ, M Σ = {p θ ; θ = (µ, Σ), Σ = Σ 0 ∈ P n (R) constant}, the Fisher-Rao distance between two distributions θ 1 = (µ 1 , Σ 0 ) and θ 2 = (µ 2 , Σ 0 ) is [18]: This distance is equal to the Mahalanobis distance [11], which is equal to the Euclidean distance between the image of µ 1 and µ 2 under the transformation µ → P −1 µ, where Σ 0 = PP t is the Cholesky decomposition [18]. This distance was one of the first dissimilarity measures between datasets with some correlation. Note that this submanifold is not totally geodesic, as it can be seen even in the space of univariate normal distributions [22] and in Example 1 in the next section.
A geodesic curve γ Σ (t) in M Σ connecting θ 1 and θ 2 can be provided by: composed by distributions that have the same mean vector µ 0 . The Fisher-Rao distance in M µ was studied by several authors in different contexts [16,18,30,41] and for θ 1 = (µ 0 , Σ 1 ) and θ 2 = (µ 0 , Σ 2 ) is given by: where 0 < λ 1 ≤ λ 2 ≤ · · · ≤ λ n are the eigenvalues of Σ −1/2 1 Σ 2 Σ −1/2 1 . An expression for the geodesic curve connecting these two distributions is [30]: . . , n}, the submanifold of M composed by distributions with a diagonal covariance matrix. If we consider the parameter θ = (µ 1 , σ 1 , µ 2 , σ 2 , . . . , µ n , σ n ), it can be shown [22] that the metric in the parametric space of M D is equal to the product metric: where d F * is the Fisher-Rao distance in the univariate case given by [22]: In this space, a curve γ D (t) = (γ 1 (t), . . . , γ n (t)) is a geodesic if, and only if, γ i (t) is a geodesic curve in the univariate case, for all i = 1, . . . , n. The geodesic curves in the univariate normal distributions space (upper half plane R × R + ) are half-vertical lines and half-ellipses centered at σ = 0, with eccentricity 1 √ 2 [22]. It is important to note that M D ⊂ M is not totally geodesic. The submanifold of M D composed only by normal distributions with covariance matrices which are multiples of the identity (round normals) is totally geodesic [22]. In fact, this submanifold of round normals is also contained in the totally geodesic submanifold described next.
2.1.4. The Submanifold M Dµ Where Σ Is Diagonal and µ Is an Eigenvector of Σ Let M Dµ be the n + 1-dimensional submanifold composed by distributions with the mean vector µ = µ 1 e i for some e i ∈ {e 1 , . . . , e n } (the canonical basis of R n ) and diagonal covariance matrix Σ, and without loss of generality, we shall assume that e i = e 1 . An analytic expression for the distance in M Dµ is: We proved in [42] that this submanifold is totally geodesic.

Bounds for the Fisher-Rao in M
As mentioned, a closed form for the Fisher-Rao distance between two general normal distributions is not known. In this subsection, we present some bounds for this distance.

A Lower Bound
Calvo and Oller [43] derived a lower bound for the Fisher-Rao distance through an isometric embedding of the parametric space M into the manifold of the positive definite matrices. Proposition 1. [43] Given θ 1 = (µ 1 , Σ 1 ) and θ 2 = (µ 2 , Σ 2 ), let: i = 1, 2. A lower bound for the distance between θ 1 and θ 2 is: We note that this bound satisfies the distance proprieties in M. In [44], through a similar approach, a lower bound for the Fisher-Rao distance was obtained in the more general space of elliptical distributions, restricted to normal distributions, is the above bound.

The Upper Bound UB 1
In [45], we proposed an upper bound based on an isometry (8) in the manifold M and on the distance in the non-totally geodesic submanifold M D (21), as follows: [45] The Fisher-Rao distance between two multivariate normal distributions θ θ θ 1 = (µ µ µ 1 , Σ 1 ) and θ θ θ 2 = (µ µ µ 2 , Σ 2 ) is upper bounded by, where λ i are the diagonal terms of the matrix Λ given by the eigenvalues of , Q is the orthogonal matrix whose columns are the eigenvectors of A and d F * is the Fisher-Rao distance between univariate normal distributions given in Equation (22).

The Upper Bounds UB 2 and UB 3
Considering the Fisher-Rao distance in the totally geodesic submanifold M Dµ and the triangular inequality, we propose another upper bound [42].
Upper and lower bounds have been used to estimate the Fisher-Rao distance in applications such as [37].

Comparisons of the Bounds
In this section, as in [42], we illustrate comparisons between the bounds presented previously. We consider the bivariate normal distributions model (n = 2) and distributions θ 0 andθ = (μ,Σ), where: From (10), we can see that there always exists an isometry that converts any two pairs of bivariate distributions into a pair of distributions as above.
In Figure 1, we consider the eigenvalues λ 1 = 2, λ 2 = 0.5, and µ = 1 to be fixed and α varying from zero to π 2 . We note that the upper bound UB 1 is very near the lower bound LB and to the numerical solution GS. The other upper bounds are bigger than the bound UB 1 . In Figure 2, it is considered µ = 10 and the previous eigenvalues. Now, the best performance is of bounds UB 2 and UB 3 , which are similar. In Figure 3a, we again keep the eigenvalues; the rotation angle is fixed α = π 4 ; and µ varies from zero to 10. We can see similar performances of UB 2 and UB 3 , which are better than UB 1 for larger values of µ.
We may also consider the upper bound:

Fisher-Rao Distance Between Special Distributions
In this section, we describe the Fisher-Rao distance in the full space M between special kinds of distributions.

The Fisher-Rao Distance Between Distributions with Common Covariance Matrices
The Fisher-Rao distance between distributions with common covariance matrices given in Section 2.1.1 was restricted to non-totally geodesic submanifold M Σ . We show next that using the isometry given in (8) and the distance in the submanifold M Dµ , it is possible to find a closed form for the distance between two distributions with the same covariance matrix, in the full manifold M. Proposition 3. Given two distributions θ 1 = (µ 1 , Σ) and θ 2 = (µ 2 , Σ) in M, let P be an orthogonal matrix such that P(µ 2 − µ 1 ) = |µ 2 − µ 1 |e 1 , and consider the decomposition UDU t of the matrix PΣP t , where U is an upper triangular matrix with all diagonal entries equal to one and D is a diagonal matrix. The Fisher-Rao distance between θ 1 and θ 2 is given by: Proof. By considering the isometries ψ = ψ (−Pµ 1 ,P) andψ = ψ (0,U −1 ) and the decomposition given by Equation (35), it follows from Equation (9) that: Since the distributions (0, D) and (|µ 2 − µ 1 |e 1 , D) belong to the submanifold M Dµ , we conclude that:  Figure 4a illustrates the normal distributions in the geodesic curve connecting connecting θ 1 and θ 2 in M, and Figure 4b illustrates the geodesic in the submanifold M Σ . We observe that in M, the shape of the ellipses (contour curves) changes along the path. Furthermore, the Fisher-Rao distance between θ 1 and θ 2 is d F (θ 1 , θ 2 ) = 5.00648, which is less than the Mahalanobis distance given in Equation (17), d Σ (θ 1 , θ 2 ) = 8.06226, as expected, since the submanifold M Σ is not totally geodesic.

The Fisher-Rao Distance between Mirrored Distributions
We consider here two mirrored normal distributions; that is, without loss of generality, if we consider up rotation, the line connecting µ 1 and µ 2 as parallel to the e 1 -axis, and the covariance matrices Σ 1 and Σ 2 satisfying: This condition implies also the same eigenvalues for both matrices. For bivariate normal distributions, we then should have: see Figure 5. After several experiments using the algorithm geodesic shooting for the θ 1 and θ 2 , we have observed that for t = 0 the geodesic curve connecting these distributions (γ(t) = (µ(t), Σ(t)), with γ(−1) = θ 1 and γ(1) = θ 2 ), satisfies γ (0) ≈θ 1/2 = (μ 1/2 ,Σ 1/2 ) = μ 1 0 , 0σ 12 where η, d 11 , and d 22 are real values; see Figure 6. The focus here is the "shape" of these distributions. Note that at t = 0, the distribution γ(0) appears as θ 1/2 , which has a diagonal covariance matrix, and the tangent vector γ (0) appears aŝ θ 1/2 , which is composed by a mean vector with the second entry equal to zero and by a symmetric covariance matrix with a null diagonal. This observation inspired us to get an explicit expression for the geodesic connecting two mirrored distributions. Starting with the bi-dimensional case again, we will prove that in fact we have equality in Expressions (41) and (42).
The above non-linear system has five equations and five variables (d 11 , d 22 , η, x, b) and can be solved by an iterative method. With the solution of this system, we can determine the geodesic curve connecting the distributions θ 1 and θ 2 . Moreover, by Equation (16), the Fisher-Rao distance is: We also remark that the curve of the means δ(t) (and therefore, µ(t)) satisfies the equation of a hyperbola; in fact: Summarizing the above discussion, we have:

Proposition 4.
(i) Expression (57) provides a closed form for the Fisher-Rao distance between two mirrored bivariate normal distributions, based on the solutions of the non-linear system (56). (ii) The plane curve given by the coordinates of the mean vector in the geodesic connecting two of these distributions is a hyperbola. Table 2 shows a time comparison between the numerical method proposed here and the geodesic shooting to obtain the Fisher-Rao distance. The distributions used in this experiment were: for different values of µ. The method proposed here uses a non-linear system for the calculus of the Fisher-Rao distance, so it is faster the geodesic shooting algorithm. Furthermore, we remark that for µ ≥ 7, the geodesic shooting requires additional adaptation to convergence.
Let γ(t) = (µ(t), Σ(t)), −1 ≤ t ≤ 1, be the geodesic curve in M connecting θ 1 and θ 2 . The proof is similar to the bivariate case, by considering γ(0) = (µ 1/2 , Σ 1/2 ), with Σ 2 = M 1 Σ 1 M 1 . Table 3 collects the results in Section 2.1 and the new results of this section. Table 3. Closed forms for the Fisher-Rao distance in submanifolds of M and the distance in M between pairs of special distributions.

Distance in Non-totally Geodesic Submanifolds
Submanifold Distance

Distance in Totally Geodesic Submanifolds
where λ i are the eigenvalues of Σ −1/2

Hierarchical Clustering for Diagonal Gaussian Mixture Simplification
A parameterized Gaussian mixture model f is a weighted sum of m multivariate normal distributions, that is, m, are normal distributions and w i , i = 1, · · · , m, are mixture, ∑ m i=1 w i = 1. In this paper, we call the diagonal Gaussian mixture model (DGMM) the mixture composed only by distributions with diagonal covariance matrices.
Gaussian mixture models (GMM) are used in modeling datasets: image processing, signal processing, and density estimation problems [46][47][48]. In many applications involving mixture models, the computational requirements are of a very high level due to the large number of mixture components. This can be handled if we reduce the number of components of the mixture: given a mixture f of m components, we want to find a mixture g of l components, 1 ≤ l < m, such that g is a good approximation of f with respect to a similarity measure [49]. Gaussian mixture simplification was considered in statistical inference in [50] and to decode low-density lattice codes [51].
In [49] was proposed a hierarchical clustering algorithm to simplify an exponential family mixture model based on Bregman divergences. This section describes an agglomerative hierarchical clustering method based on the Fisher-Rao distance in the submanifold M D (21) to simplify DGMM, and we present an application to image segmentation, complementing what was developed in [52]. We start by introducing the concept of the centroid for a set of distributions in M D .

Centroids in the Submanifold M D
In [53], Galperin described centroids in the two-dimensional Minkowski model, which can be translated also to the Klein disk and Poincare half-plane models. Given a set of points q q q i = (x q q q i , y q q q i , z q q q i ) in the Minkowski model, with associated weights u i , the centroid is computed and normalized as: To calculate the centroid c c c of a subset of points C = {(w i , θ θ θ i )}, θ θ θ i = (µ i , σ i ), the isometries presented in [25] and the relation between the media × standard deviation plane of parameters of univariate normal distributions and the Poincare half-plane given in [22] are used. Given where c c c j , j = 1, · · · , n, is the centroid of C j = {(w j , (µ ji , σ ji ))} given in Equation (66).
In order to apply the hierarchical clustering algorithm, we need to consider the distance between two subsets A and B. The three most common distances are called linkage criteria and are given by [54]: • Group average linkage: where d D is the distance in the submanifold M D and |X| is the number of elements of a set X.
A summary of the hierarchical clustering algorithm (Algorithm 1) [49] using one of these distances is given next. Algorithm 1: Hierarchical Clustering Algorithm 1: Form m clusters C j = {(w j , θ θ θ j )} with one element. 2: Find the two closest clusters, C i and C j , with respect to a distance D, and merge them into a single cluster C i ∪ C j . 3: Compute distances between the new cluster and each of the old clusters. 4: Repeat Steps 2 and 3 until all items are clustered into a single cluster of size n.
The simplified DGMM: β j g j of l components is built from the l subsets C 1 , . . . , C l remaining after the iteration n − l of the hierarchical clustering algorithm. In this work, we choose the parameters of g j in two ways: as the centroid in the submanifold M D (Fisher-Rao hierarchical clustering) and as the Bregman left-sided centroid [49] (Bregman-Fisher-Rao hierarchical clustering) of the subset C j with weights β j = ∑ As remarked in [49], the hierarchical clustering algorithm allows introducing a method to learn the optimal number of components in the simplified mixture g. Thus, g must be as compact as possible and reach a minimum prescribed quality d KL ( f ||g) ≤ τ, where d KL ( f ||g) is the Kullback-Leibler divergence.

Experiments in Image Segmentation
We can apply the Fisher-Rao and the Bregman-Fisher-Rao hierarchical clusterings to simplify a mixture of exponential families in the context of clustering-based image segmentation as was done in [49] for the Bregman hierarchical clustering. Given an input color image I, we adapt the Bregman soft clustering algorithm to generate a DGMM f of 32 components, which models the image pixels. We point out that the restriction considered in this paper (only DGMM) is also used in many applications due its much lower computational cost. We consider here a pixel ρ ρ ρ = (ρ R , ρ G , ρ B ) as a point in R 3 , where ρ R , ρ G , and ρ B are the RGB color information. For image segmentation, we can say that the image pixel ρ belongs to the class C j when: Thus, the segmented image is illustrated by replacing the color value of the pixel ρ ρ ρ by the mean µ j of the Gaussian p j .
Using the the Fisher-Rao and the Bregman-Fisher-Rao hierarchical clusterings, we simplify the mixture f into mixtures g of l components with l = {2, 4, 8, 16}. Each mixture gives one image segmentation. The linkage criterion used here was the complete linkage (68), which has presented better results in our simulations. Figure 8 shows the segmentation of the Baboon, Lena, and Clown input images given by the Bregman-Fisher-Rao hierarchical clustering. The number of colors in each image is equal to the number of components in the simplified mixture g. The quality of the segmentation was analyzed as a function of l through the Kullback-Leibler divergence estimated by the Monte Carlo method, since there was no closed form for this measure (five thousand points were randomly drawn to estimate d KL ( f ||g)). Figures 9-11 show the evolution of the simplification quality as a function of the number of components l for the Baboon, Lena, and Clown images, using the Bregman, the Fisher-Rao, and the Bregman-Fisher-Rao hierarchical clustering algorithms. We observed that the image quality increased (d KL ( f ||g) decreased) with l, as expected, and the behavior was similar in all clustering algorithms. In general, the Bregman-Fisher-Rao hierarchical clustering algorithm presented better results. Considering the constraint τ = 0.2, the learning process provided, for the Bregman-Fisher-Rao hierarchical clustering, mixtures of 19, 21, and 21 as optimal simplifications for the images of the Baboon, Lena, and Clown, respectively.

Concluding Remarks
The Fisher-Rao distance was approached here in the space of multivariate normal distributions. Initially, as in [38], we summarized some known closed forms for this distance in submanifolds of this model and some bounds for the general case. A closed form for the Fisher-Rao distance between distributions with the same covariance matrix was obtained in Proposition 3, and we also have derived a non-linear system characterizing the distance between two distributions with mirrored covariance matrices in Proposition 5. Some perspectives for future research related to this topic include deriving new bounds for the Fisher-Rao distance in the general case, by using these special distributions, to characterize as non-linear systems the distances between other types of distributions and to extend the closed forms and bounds presented here to the space of elliptical distributions. Finally, we have extended the analysis of the Bregman-Fisher-Rao hierarchical clustering algorithm to simplify Gaussian mixtures in the context of clustering-based image segmentation given in [52] with comparative results that encourage the use of the Fisher-Rao distance in other clustering or classification algorithms.