Fast Proxy Centers for the Jeffreys Centroid: The Jeffreys–Fisher–Rao Center and the Gauss–Bregman Inductive Center

Nielsen, Frank

doi:10.3390/e26121008

Open AccessArticle

Fast Proxy Centers for the Jeffreys Centroid: The Jeffreys–Fisher–Rao Center and the Gauss–Bregman Inductive Center

by

Frank Nielsen

Sony Computer Science Laboratories, Tokyo 141-0022, Japan

Entropy 2024, 26(12), 1008; https://doi.org/10.3390/e26121008

Submission received: 18 October 2024 / Revised: 13 November 2024 / Accepted: 21 November 2024 / Published: 22 November 2024

(This article belongs to the Special Issue Information Theory in Emerging Machine Learning Techniques)

Download

Browse Figures

Versions Notes

Abstract

The symmetric Kullback–Leibler centroid, also called the Jeffreys centroid, of a set of mutually absolutely continuous probability distributions on a measure space provides a notion of centrality which has proven useful in many tasks, including information retrieval, information fusion, and clustering. However, the Jeffreys centroid is not available in closed form for sets of categorical or multivariate normal distributions, two widely used statistical models, and thus needs to be approximated numerically in practice. In this paper, we first propose the new Jeffreys–Fisher–Rao center defined as the Fisher–Rao midpoint of the sided Kullback–Leibler centroids as a plug-in replacement of the Jeffreys centroid. This Jeffreys–Fisher–Rao center admits a generic formula for uni-parameter exponential family distributions and a closed-form formula for categorical and multivariate normal distributions; it matches exactly the Jeffreys centroid for same-mean normal distributions and is experimentally observed in practice to be close to the Jeffreys centroid. Second, we define a new type of inductive center generalizing the principle of the Gauss arithmetic–geometric double sequence mean for pairs of densities of any given exponential family. This new Gauss–Bregman center is shown experimentally to approximate very well the Jeffreys centroid and is suggested to be used as a replacement for the Jeffreys centroid when the Jeffreys–Fisher–Rao center is not available in closed form. Furthermore, this inductive center always converges and matches the Jeffreys centroid for sets of same-mean normal distributions. We report on our experiments, which first demonstrate how well the closed-form formula of the Jeffreys–Fisher–Rao center for categorical distributions approximates the costly numerical Jeffreys centroid, which relies on the Lambert W function, and second show the fast convergence of the Gauss–Bregman double sequences, which can approximate closely the Jeffreys centroid when truncated to a first few iterations. Finally, we conclude this work by reinterpreting these fast proxy Jeffreys–Fisher–Rao and Gauss–Bregman centers of Jeffreys centroids under the lens of dually flat spaces in information geometry.

Keywords:

Kullback–Leibler divergence; exponential family; Bregman divergence; quasi-arithmetic mean; Fisher–Rao geodesic; information geometry; Lambert W function; geometric optimization

1. Introduction

Let

(X, F)

be a measurable space with sample space

X

and

σ

-algebra of events

F

, and

μ

a positive measure. We consider a finite set

{P_{1}, \dots, P_{n}}

of n probability distributions all dominated by

μ

and weighted by a vector w belonging to the open standard simplex

Δ_{n} = \{x = (x_{1}, \dots, x_{n}) : x_{1} > 0, \dots, x_{n} > 0, \sum_{i = 1}^{n} x_{i} = 1\} \subset R^{n}

. Let

P = {p_{1}, \dots, p_{n}}

be the Radon–Nikodym densities of

P_{1}, \dots, P_{n}

with respect to

μ

, i.e.,

p_{i} = \frac{d P_{i}}{d μ}

.

The Kullback–Leibler divergence (KLD) between two densities

p (x)

and

q (x)

is defined by

D_{KL} (p : q) = \int p (x) \log \frac{p (x)}{q (x)} d μ (x)

. The KLD is asymmetric:

D_{KL} (p : q) \neq D_{KL} (q : p)

. We use the argument delimiter ‘:’ as a notation to indicate this asymmetry. The Jeffreys divergence [1] symmetrizes the KLD as follows:

\begin{matrix} D_{J} (p, q) & = & D_{KL} (p : q) + D_{KL} (q : p), \\ = & \int_{X} (p (x) - q (x)) \log \frac{p (x)}{q (x)} d μ (x) . \end{matrix}

In general, the D-barycenter

C_{D}

of

P

with respect to a statistical dissimilarity measure

D (\cdot : \cdot)

yields a notion of centrality

C_{R}

defined by the following optimization problem:

c_{R} = \arg \min_{p} \sum_{i = 1}^{n} w_{i} D (p_{i} : p) .

(1)

Here, the upper case letter ‘R’ indicates that the optimization defining the D-barycenter is carried on the right argument. When

w = (\frac{1}{n}, \dots, \frac{1}{n})

is the uniform weight vector, the D-barycenter is called the D-centroid. We shall loosely call centroids barycenters in the remainder even when the weight vector is not uniform. Centroids with respect to information-theoretic measures have been studied in the literature.

Let us mention some examples of centroids: The entropic centroids [2] (i.e., Bregman centroids and f-divergences centroids), the Burbea–Rao and Bhattacharyya centroids [3], the

α

-centroids with respect to

α

-divergences [4], the Jensen–Shannon centroids [5], etc.

The

D_{J}

-centroid is also called the symmetric Kullback–Leibler (SKL) divergence centroid [6] in the literature. However, since there are many possible symmetrizations of the KLD [7] like the Jensen–Shannon divergence [8] or the resistor KLD [9], we prefer to use the term Jeffreys centroid instead of SKL centroid to avoid any possible ambiguity on the underlying divergence. Notice that the square root of the Jensen–Shannon divergence is a metric distance [10,11] but all powers

D_{J}^{α}

of Jeffreys divergence

D_{J}

for

α > 0

do not yield metric distances [12].

This paper considers the Jeffreys centroids of a finite weighted set of densities

P = {p_{θ_{1}}, \dots, p_{θ_{n}}}

belonging to some prescribed exponential family [13]

E

:

c = \arg \min_{p} \sum_{i = 1}^{n} w_{i} D_{J} (p_{θ_{i}}, p) .

(2)

In particular, we are interested in computing the Jeffreys centroids for sets of categorical distributions or sets of multivariate normal distributions [14].

In general, centroids are used in k means [15,16]-type clustering or hierarchical clustering (e.g., Ward criterion [17]) and information fusion tasks [18] (related to distributed model estimation [19]) among others. See Figure 1. The choice of the dissimilarity measure depends on the application at hand [20]. Clustering with respect to Jeffreys divergence/-Jeffreys centroid has proven useful in many scenarios: for example, it was shown to perform experimentally better than Euclidean or square Euclidean distances for compressed histograms of gradient descriptors [21] or in fuzzy clustering [22]. Jeffreys divergence has also been used for image processing [23], including image segmentation [24], speech processing [25], and computer vision [26], just to name a few. In particular, finding weighted means of centered 3D normal distributions plays an important role in diffusion tensor imaging (DTI) for smoothing and filtering DT images [27] which consist of sets of normal distributions centered at 3D grid locations.

In general, the Jeffreys centroid is not known in closed form for exponential families [28] like the family of categorical distributions or the family of normal distributions often met in applications and thus needs to be numerically approximated in practice. The main contribution of this paper is to present and study two proxy centers as drop-in replacements of the Jeffreys centroids in applications and report the generic structural formula for generic exponential families with an explicit closed-form formula for the families of categorical and multivariate normal distributions. Namely, we define the Jeffreys–Fisher–Rao (JFR) center (Definition 2) and the Gauss–Bregman (GB) inductive center (Definition 3) in Section 2.

This paper is organized as follows: By interpreting in two different ways the closed-form formula of the Jeffreys centroids for the particular case of sets of centered multivariate normal distributions [29] (proof reported in Appendix B), we define the Gauss–Bregman (GB) centers and the Jeffreys–Fisher–Rao (JFR) centers for sets of densities belonging to an exponential family in Section 2. The Jeffreys centroid coincides with both the Gauss–Bregman inductive center and the Jeffreys–Fisher–Rao center for centered multivariate normal distributions but differ from each other in general. In Section 2.4, we study the Gauss–Bregman inductive center [30] induced by the cumulant function of an exponential family and prove the convergence under the separability condition of the generalized Gauss double sequences in the limit (Theorem 3). This Gauss–Bregman center can be easily approximated by limiting the number of iterations of a double sequence inducing it. In Section 4, we report the generic formula for Jeffreys–Fisher–Rao centers for sets of uni-order exponential families [13] and explicitly give the closed-form formula for the categorical family and the multivariate normal family. A comparison of those proxy centers with the numerical Jeffreys centroids is experimentally studied and visually illustrated with some examples. Thus, we propose to use in applications (e.g., clustering) either the fast Jeffreys–Fisher–Rao center when a closed-form formula is available for the family of distributions at hand or the Gauss–Bregman center approximation with a prescribed number of iterations as a drop-in replacement of the numerical Jeffreys centroids while keeping the Jeffreys divergence. Some experiments of the JFR and GB centers are reported for the Jeffreys centroid of categorical distributions in Section 5. Finally, we conclude this paper in Section 6 with a discussion and a generalization of our results to the more general setting of dually flat spaces of information geometry [14].

The core of this paper is followed by an Appendix section as follows: In Appendix A, we explicitly give the algorithm outlined in [31] for numerically computing the Jeffreys centroid of sets of categorical distributions. In Appendix B, we report a proof on the closed-form formula of the Jeffreys centroid for centered normal distributions [29] that motivated this paper. In Appendix C, we explain how to calculate in practice the elaborated closed-form formula for the Fisher–Rao geodesic midpoint between two multivariate normal distributions [32].

2. Proxy Centers for Jeffreys Centroids

2.1. Background on Jeffreys Centroids

A density

p_{θ}

belonging to an exponential family [13]

E

can be expressed canonically as

p_{θ} (x) = \exp (〈 θ, t (x) 〉 - F (θ)) d μ (x)

, where

t (x)

is a sufficient statistic vector,

F (θ) = \log \int \exp (〈 θ, t (x) 〉) d μ (x)

is the log-normalizer, and

θ

is the natural parameter belonging to the natural parameter space

Θ

. We consider minimal regular exponential families [13] like the discrete family of categorical distributions (i.e.,

μ

is the counting measure) or the continuous family of multivariate normal distributions (i.e.,

μ

is the Lebesgue measure).

The Jeffreys centroid of categorical distributions was first studied by Veldhuis [6], who designed a numerical two-nested loops Newton-like algorithm [6]. A random variable X following a categorical distribution

Cat (p)

for a parameter

p \in Δ_{d}

in sample space

X = {ω_{1}, \dots, ω_{d}}

is such that

\Pr (X = ω_{i}) = p_{i}

. Categorical distributions are often used in image processing to statistically model normalized histograms with non-empty bins. The exact characterization of the Jeffreys centroid was given in [31].

We summarize the results regarding the categorical Jeffreys centroid [31] in the following theorem:

Theorem 1

(Categorical Jeffreys centroid [31]). The Jeffreys centroid of a set of n categorical distributions parameterized by

P = {p_{1}, \dots, p_{n}} \in Δ_{d}

arranged in a matrix

P = [p_{i, j}] \in R^{n \times d}

and weighted by a vector

w = (w_{1}, \dots, w_{n}) \in Δ^{n}

is

c (λ) = (c_{1} (λ), \dots, c_{d} (λ))

with

c_{j} (λ) = \frac{a_{j}}{W_{0} (\frac{a_{j}}{g_{j}} e^{1 + λ})}, \forall j \in {1, \dots, d},

where

a_{j} = \sum_{i = 1}^{n} w_{i} p_{i, j}

and

g_{j} = \frac{\prod_{i = 1}^{n} p_{i, j}^{w_{i}}}{\sum_{j = 1}^{d} \prod_{i = 1}^{n} p_{i, j}^{w_{i}}}

are the j-th components of the weighted arithmetic and normalized geometric means, respectively;

W_{0}

is the principal branch of the Lambert W function [33]; and

λ \leq 0

is the unique real value such that

λ = - D_{KL} (c (λ) : g)

.

Furthermore, a simple bisection search is reported in [31] §III.B that we convert into Algorithm A1 in Appendix A, which allows one to numerically approximate the Jeffreys centroid to arbitrary fine precision.

2.2. Jeffreys Centroids on Exponential Family Densities: Symmetrized Bregman Centroids

The Jeffreys divergence between two densities of an exponential family

E = {p_{θ} (x) = \exp (〈 t (x), θ 〉 - F (θ)) : θ \in Θ}

with cumulant function

F (θ)

amounts to a symmetrized Bregman divergence [28] (SBD):

D_{J} (p_{θ}, p_{θ^{'}}) = S_{F} (θ, θ^{'}) : = 〈 θ_{1} - θ_{2}, \nabla F (θ_{1}) - \nabla F (θ_{2}) 〉 .

Using convex duality, we have

S_{F} (θ, θ^{'}) = S_{F^{*}} (η, η^{'})

, where

η = \nabla F (θ)

and

F^{*} (η) = 〈 η, {(\nabla F)}^{- 1} (η) 〉 - F ({(\nabla F)}^{- 1} (η))

is the Legendre–Fenchel convex conjugate. Thus, the Jeffreys barycenter of

P = {p_{θ_{1}}, \dots, p_{θ_{n}}}

amounts to either a symmetrized Bregman barycenter on the natural parameters

P_{θ} = {θ_{1}, \dots, θ_{n}}

with respect to

S_{F}

or a symmetrized Bregman barycenter on the dual moment parameters

P_{η} = {η_{1}, \dots, η_{n}}

with respect to

S_{F^{*}}

.

It was shown in [28] that the symmetrized Bregman barycenter

θ_{S}

of n weighted points amounts to the following minimization problem involving only the sided Bregman centroids:

\begin{matrix} θ_{S} & : = & \arg \min_{θ \in Θ} \sum_{i} w_{i} S_{F} (θ, θ_{i}), \\ \equiv & \arg \min_{θ \in Θ} B_{F} (\bar{θ} : θ) + B_{F} (θ : \underset{̲}{θ}), \end{matrix}

(3)

where

\bar{θ} = \sum_{i} w_{i} θ_{i}

(right Bregman centroid) and

\underset{̲}{θ} = {(\nabla F)}^{- 1} (\sum_{i} w_{i} \nabla F (θ_{i}))

(left Bregman centroid). Those

\bar{θ}

and

\underset{̲}{θ}

centers are centroids [28] with respect to the Bregman divergence

B_{F} (θ_{1} : θ_{2}) = F (θ_{1}) - F (θ_{2}) - 〈 θ_{1} - θ_{2}, \nabla F (θ_{2}) 〉

and reverse Bregman divergence:

{B_{F}}^{*} (θ_{1} : θ_{2}) : = B_{F} (θ_{2} : θ_{1})

:

\begin{matrix} \bar{θ} & = & \arg \min_{θ} \sum_{i} w_{i} B_{F} (θ_{i} : θ), \\ \underset{̲}{θ} & = & \arg \min_{θ} \sum_{i} w_{i} B_{F} (θ : θ_{i}) = \arg \min_{θ} \sum_{i} w_{i} {B_{F}}^{*} (θ_{i} : θ) . \end{matrix}

In general, when

H : R^{m} \to R

is a strictly convex differentiable real-valued function of Legendre type [34], the gradient

\nabla H

is globally invertible (in general, the implicit inverse function theorem only locally guarantees the inverse function) and we can define a quasi-arithmetic center of a point set

P = {θ_{1}, \dots, θ_{n}}

weighted by w as follows:

Definition 1

(Quasi-arithmetic center). Let

H = \nabla F

be the gradient of a strictly convex or concave differentiable real-valued function F of Legendre type. The quasi-arithmetic center

c_{H} (θ_{1}, \dots, θ_{n}; w)

is defined by

c_{H} (θ_{1}, \dots, θ_{n}; w) = H^{- 1} (\sum_{i = 1}^{n} w_{i} H (θ_{i})) .

This definition generalizes the scalar quasi-arithmetic means [35] for univariate functions h, which are continuous and strictly monotone. Quasi-arithmetic means (QAMs) are also called f means or Kolmogorov–Nagumo means. Let

m_{\nabla F} (θ_{1}, θ_{2}) = c_{\nabla F} (θ_{1}, θ_{2}; \frac{1}{2}, \frac{1}{2})

. Notice that

A (θ_{1}, θ_{2}) = \nabla F^{*} (m_{\nabla F^{*}} (η_{1}, η_{2}))

and

A (η_{1}, η_{2}) = \nabla F (m_{\nabla F} (θ_{1}, θ_{2}))

. That is, the arithmetic mean in a primal representation amounts to a QAM in the dual representation.

Thus, we can solve for

θ_{S}

by setting the gradient of

L (θ) = B_{F} (\bar{θ} : θ) + B_{F} (θ : \underset{̲}{θ})

to zero. In general, no closed-form formula is known for the symmetrized Bregman centroids, and a numerical approximation method was reported in [28]. To circumvent the lack of a closed-form formula of symmetrized Bregman centroids for clustering, Nock et al. [36] proposed a mixed Bregman clustering where each cluster has two representative dual Bregman centroids

\bar{θ} = \sum_{i} w_{i} θ_{i}

(right Bregman centroid) and

\underset{̲}{θ} = {(\nabla F)}^{- 1} (\sum_{i} w_{i} \nabla F (θ_{i}))

(left Bregman centroid), and the dissimilarity measure is a mixed Bregman divergence defined by

Δ_{F} (θ_{1} : θ : θ_{2}) : = \frac{1}{2} B_{F} (θ_{1} : θ) + \frac{1}{2} B_{F} (θ : θ_{2}) .

Notice that minimizing Equation (3) amounts to minimizing the mixed Bregman divergence:

\min_{θ} Δ_{F} (\bar{θ} : θ : \underset{̲}{θ}) .

By using the dual parameterization

η = \nabla F (θ)

(with dual domain

H = {\nabla F (θ) : θ \in Θ}

) and the dual Bregman divergence

B_{F^{*}} (η_{1} : η_{2}) = F^{*} (η_{1}) - F^{*} (η_{2}) - {〈 η_{1} - η_{2}, \nabla F 〉}^{*} (η_{1}) = B_{F} (θ_{2} : η_{1})

, we have

\begin{matrix} θ_{S} & : = & \arg \min_{θ \in Θ} \sum_{i} w_{i} S_{F} (θ, θ_{i}), \\ η_{S} & = & \arg \min_{η \in H} \sum_{i} w_{i} S_{F^{*}} (η, η_{i}), \\ \equiv & \arg \min_{η \in H} B_{F^{*}} (η : \nabla F (\bar{θ})) + B_{F^{*}} (\nabla F (\underset{̲}{θ}) : η) . \end{matrix}

(4)

Since

\nabla F (\bar{θ}) = {(\nabla F^{*})}^{- 1} (\sum_{i} w_{i} \nabla F^{*} (η_{i})) = \underset{̲}{η}

and

\nabla F (\underset{̲}{θ}) = \nabla F ({(\nabla F)}^{- 1} \sum_{i} w_{i} η_{i}) = \bar{η}

, we obtain the dual equivalent optimization problem:

θ_{S} = \nabla F^{*} (η_{S}) = \arg \min_{θ \in Θ} B_{F} (\bar{θ} : θ) + B_{F} (θ : \underset{̲}{θ}),

or

η_{S} = \nabla F (θ_{S}) = \arg \min_{η \in H} B_{F^{*}} (\bar{η} : θ) + B_{F^{*}} (η : \underset{̲}{η}) .

However, a remarkable special case is the family of multivariate normal distributions centered at the origin for which the Jeffreys centroid was reported in closed form in [29]. Let

N_{0} = {p_{\sum} : \sum \in {Sym}^{+ +} (R, d)}

be the exponential family with sufficient statistics

t (x) = - \frac{1}{2} (x, x x^{⊤})

, natural parameter

θ = \sum^{- 1}

(the precision matrix) where the covariance matrix belongs to the cone

{Sym}^{+ +} (R, d)

of symmetric positive-definite matrices, inner product

〈 X, Y 〉 = tr (X Y)

, and

F (θ) = - \frac{1}{2} \log \det (θ)

. In that case, the Jeffreys divergence amounts to a symmetrized Bregman log-det (ld) divergence between the corresponding natural parameters:

D_{J} (p_{\sum}, p_{\sum^{'}}) = \frac{1}{2} tr (({\sum^{'}}^{- 1} - \sum^{- 1}) (\sum - \sum^{'})) = : \frac{1}{2} S_{ld} (\sum^{- 1}, {\sum^{'}}^{- 1}) .

Using the standard covariance matrix parameterization

\sum

, we can further express the Jeffreys divergence between two multivariate normal distributions

p_{\sum}

and

p_{\sum^{'}}

as

D_{J} (p_{\sum}, p_{\sum^{'}}) = \sum_{i = 1}^{d} {(\sqrt{λ_{i}} - \frac{1}{\sqrt{λ_{i}}})}^{2},

where

λ_{i}

s are the eigenvalues of

\sum^{- 1} \sum^{'}

. The symmetrized log-det divergence

S_{ld}

is also called the symmetrized Stein loss [37,38]. When

d = 1

, this divergence is the symmetrized Itakura–Saito divergence also called the COSH distance [28]. The Jeffreys centroid can be characterized using the Fisher–Rao geometry [39] of

N_{0}

as the Fisher–Rao geodesic midpoint of the sided Kullback–Leibler centroids as follows:

Theorem 2

([29]). The Jeffreys centroid C of a set of n centered multivariate normal distributions

P = {p_{\sum_{1}}, \dots, p_{\sum_{n}}}

weighted with

w_{i} \in Δ_{n}

amounts to the symmetrized log-det Bregman centroid for the corresponding weighted set of positive-definite precision matrices

P_{θ} = {P_{1} = \sum_{1}^{- 1}, \dots, P_{n} = \sum_{n}^{- 1}}

. The symmetrized log-det Bregman barycenter C is the Riemannian geodesic midpoint

A # H

of the arithmetic barycenter

A = \sum_{i = 1}^{n} w_{i} P_{i}

and harmonic barycenter

H = {(\sum_{i = 1}^{n} w_{i} P_{i}^{- 1})}^{- 1}

where

X # Y : = X^{\frac{1}{2}} {(X^{- \frac{1}{2}} Y X^{- \frac{1}{2}})}^{\frac{1}{2}} X^{\frac{1}{2}}

is the matrix geometric mean [40]

G (X, Y) = X # Y

:

C = (\sum_{i = 1}^{n} w_{i} P_{i}) # {(\sum_{i = 1}^{n} w_{i} P_{i}^{- 1})}^{- 1} .

(5)

Since the proof of this result mentioned in [29] was omitted in [29], we report a proof involving matrix analysis in full detail in Appendix B.

Next, we shall define two types of centers for sets of densities of a prescribed exponential family based on two different interpretations of Equation (5). We call them centers and not centroids because those points are defined by a generic structural formula instead of solutions of minimization problems of average divergences of Equation (1).

2.3. The Jeffreys–Fisher–Rao Center

Since an exponential family

E = {p_{θ} (x)}

induces the Riemannian manifold

(M, g)

with the Fisher metric g expressed in the

θ

-parameterization by the Fisher information matrix

\nabla^{2} F (θ)

and Fisher–Rao geodesics

γ (p, q, t)

defined with respect to the Levi-Civita connection

\bar{\nabla}

(induced by g), we shall define the Jeffreys–Fisher–Rao center on

M

using the Fisher–Rao geodesics as follows:

Definition 2

(Jeffreys–Fisher–Rao (JFR) center). The Jeffreys–Fisher–Rao center

θ_{JFR}

of a set

{p_{θ_{1}}, \dots, p_{θ_{n}}}

of weighted densities by

w \in Δ_{n}

is defined as the Fisher–Rao midpoint of the sided Kullback–Leibler centroids

\bar{θ} = \sum_{i} w_{i} θ_{i}

and

\underset{̲}{θ} = {(\nabla F)}^{- 1} (\sum_{i} w_{i} \nabla F (θ_{i}))

:

θ_{JFR} = \bar{θ} # \underset{̲}{θ},

(6)

where

p # q = γ (p, q, \frac{1}{2})

.

Equation (6) is a generalization of Equation (5); therefore, the JFR center matches the Jeffreys centroid for same-mean multivariate normal distributions (Theorem 2).

Let

P_{θ} = {θ_{1}, \dots, θ_{n}}

and

P_{η} = {η_{1}, \dots, η_{n}}

, where

η = \nabla F (θ)

and

θ = \nabla F^{*} (η)

. Denote by

{JFR}_{F} (P_{θ}; w)

the JFR center of

θ

-coordinate

θ_{JFR}

. Then,

{JFR}_{F^{*}} (P_{η}; w)

=

\nabla F (θ_{JFR}) : = η_{JFR}

.

2.4. Gauss–Bregman Inductive Center

Another remarkable property of the Jeffreys centroid for a set

{p_{μ, \sum_{1}}, \dots, p_{μ, \sum_{n}}}

of same-mean multivariate normal distributions weighted by

w \in Δ_{n}

with arithmetic and harmonic means

A = \sum_{i = 1}^{n} w_{i} \sum_{i}^{- 1}

and

H = {(\sum_{i = 1}^{n} w_{i} \sum_{i})}^{- 1}

on the precision matrices

\sum_{1}^{- 1}, \dots, \sum_{n}^{- 1}

, respectively, is that we have the following invariance of the Jeffreys centroid (see Lemma 17.4.4 of [29]):

G (A, H) = G (\frac{A + H}{2}, 2 {(A^{- 1} + H^{- 1})}^{- 1}) .

(7)

Nakamura [41] defined the following double sequence scheme converging to the matrix geometry mean

G (P, Q)

for any two symmetric positive-definite matrices P and Q:

\begin{matrix} P_{t + 1} & = & A (P_{t}, Q_{t}) : = \frac{P_{t} + Q_{t}}{2}, \\ Q_{t + 1} & = & H (P_{t}, Q_{t}) : = 2 {(P_{t}^{- 1} + Q_{t}^{- 1})}^{- 1}, \end{matrix}

initialized with

P_{0} = P

and

Q_{0} = Q

. We have

\lim_{t \to \infty} P_{t} = \lim_{t \to \infty} Q_{t} = P # Q = G (P, Q)

. Let

P_{\infty} = \lim_{t \to \infty} P_{t}

and

Q_{\infty} = \lim_{t \to \infty} Q_{t}

. That is, the geometric matrix mean can be obtained as the limits of a double sequence of means. We can thus approximate

G (P, Q)

by stopping the double sequence after T iterations to obtain

G^{(T)} (P, Q) = A (P_{T}, Q_{T}) \approx G (P, Q) .

Notice that we can recover those iterations from the invariance property of Equation (7): Indeed, we have

G (P_{0}, Q_{0}) = G (\underset{= : P_{1}}{\underset{︸}{A (P_{0}, Q_{0})}}, \underset{= : Q_{1}}{\underset{︸}{H (P_{0}, Q_{0})}}) = G (\underset{= : P_{2}}{\underset{︸}{A (P_{1}, Q_{1})}}, \underset{= : Q_{2}}{\underset{︸}{H (P_{1}, Q_{1})}}) = \dots,

(8)

and

∥ P_{t} - Q_{t} ∥ = \sqrt{tr ((P_{t} - Q_{t}) (P_{t} - Q_{t}))}

decreases [41] as the number of iterations t increases. Thus, by induction,

G (P_{0}, Q_{0}) = G (P_{\infty}, Q_{\infty})

with

P_{\infty} = Q_{\infty}

. Since

G (X, X) = X

(means are reflexive), it follows that

G (P_{0}, Q_{0}) = P_{\infty} = Q_{\infty}

. It is proved in [41] that the convergence rate of the sequence of double iterations is quadratic. This type of mean has been called an inductive mean [30,42] (or compound mean [43]) and originated from the Gauss arithmetic–geometric mean [44].

Our second interpretation of the geometric matrix mean of Equation (5) is to consider it as an inductive mean [30] and to generalize this double sequence process to pairs/sets of densities of an exponential family as follows:

Definition 3

(Gauss–Bregman

(A, \nabla F)

center). Let

P = {p_{θ_{1}}, \dots, p_{θ_{n}}}

be a set of n distributions of an exponential family with the cumulant function

F (θ)

weighted by a vector

w \in Δ_{n}

. Then, the Gauss–Bregman inductive center

θ_{GB}

is defined as the limit of the double sequence:

\begin{matrix} {\bar{θ}}_{t + 1} & = & A ({\bar{θ}}_{t}, {\underset{̲}{θ}}_{t}) : = \frac{{\bar{θ}}_{t} + {\underset{̲}{θ}}_{t}}{2}, \\ {\underset{̲}{θ}}_{t + 1} & = & m_{\nabla F} ({\bar{θ}}_{t}, {\underset{̲}{θ}}_{t}) : = {(\nabla F)}^{- 1} (\frac{\nabla F ({\bar{θ}}_{t}) + \nabla F ({\underset{̲}{θ}}_{t})}{2}), \end{matrix}

initialized with

{\bar{θ}}_{0} = \bar{θ} = \sum_{i = 1}^{n} w_{i} θ_{i}

(right Bregman centroid) and

{\underset{̲}{θ}}_{0} = \underset{̲}{θ} = \nabla F^{- 1} (\sum_{i = 1}^{n} w_{i} \nabla F (θ_{i}))

(left Bregman centroid). That is, we have

θ_{GB} = \lim_{t \to \infty} {\bar{θ}}_{t} = \lim_{t \to \infty} {\underset{̲}{θ}}_{t} .

(9)

Let

θ_{GB} = {GB}_{F} (\bar{θ}, \underset{̲}{θ})

. Then, we have

η_{GB} = {GB}_{F^{*}} (\bar{η}, \bar{θ}) = \nabla F (θ_{GB})

. The Gauss–Bregman center

c_{GB}

has

θ

-coordinates

θ_{G} B

and

η

-coordinates

η_{GB}

.

Algorithm 1 describes the approximation of the Gauss–Bregman inductive center by stopping the double sequence when the iterated centers are close enough to each other. We shall prove the matching convergence of those

{\bar{θ}}_{t}

and

{\underset{̲}{θ}}_{t}

sequences under separability conditions in Section 2.4.

Algorithm 1: Gauss–Bregman inductive center.

For example, the Gauss–Bregman center of two categorical distributions

p = (p_{1}, \dots, p_{d})

and

p^{'} = (p_{1}^{'}, \dots, p_{d}^{'})

on a sample space

X

of d elements is obtained for the cumulant function

F (θ) = \log (1 + \sum_{i = 1}^{d - 1} e^{θ_{i}})

with gradient

\nabla F (θ) = {[η_{i} = \frac{e^{θ_{i}}}{1 + \sum_{j = 1}^{d - 1} e^{θ_{j}}}]}_{i}

where

θ = (θ_{1} = \log \frac{p_{1}}{p_{d}}, \dots, θ_{d - 1} = \log \frac{p_{d - 1}}{p_{d}})

is the natural parameter. The reciprocal gradient is

{(\nabla F)}^{- 1} (η) = {[\log \frac{η_{i}}{1 - \sum_{j = 1}^{d - 1} η_{j}}]}_{i}

.

We may also compute the Gauss–Bregman center of two categorical distributions

Cat (p)

and

Cat (p^{'})

using iterations of arithmetic means

a_{t}

and geometric normalized means

g_{t}

:

\begin{matrix} a_{t + 1}^{i} & = & A (a_{t}^{i}, g_{t}^{i}) : = \frac{a_{t}^{i} + g_{t}^{i}}{2}, \forall i \in {1, \dots, d} \\ u_{t + 1}^{i} & = & \sqrt{a_{t}^{i} g_{t}^{i}}, \forall i \in {1, \dots, d}, \\ g_{t + 1}^{i} & = & \frac{u_{t + 1}^{i}}{\sum_{j = 1}^{d} u_{t + 1}^{j}}, \forall i \in {1, \dots, d}, \end{matrix}

where the

u_{t}

s are unnormalized geometric means and the

g_{t}

represents normalized geometric means. We initialize the sequence with

a_{0} = p

and

g_{0} = p^{'}

, and the Gauss–Bregman center is obtained in the limit

m_{GB}^{Cat} (p, p^{'}) = \lim_{t \to \infty} a_{t} = \lim_{t \to \infty} g_{t}

. See Algorithm 2.

The Jeffreys centroid of a set of centered multivariate normal distributions is the Gauss–Bregman center obtained for the generator

F (θ) = - \frac{1}{2} \log \det (θ)

, the cumulant function of the exponential family of centered normal distributions.

Algorithm 2: Gauss–Bregman inductive center for sets of categorical distributions.

Figure 2 displays the arithmetic and normalized geometric and numerical Jeffreys, Jeffreys–Fisher–Rao, and Gauss–Bregman centroids/centers for a set of 32 trinomial distributions. We may consider normalized intensity histograms of images (modeled as multinomials with one trial) quantized with

d = 256

bins; that is, a normalized histogram with d bins is interpreted as a point in

Δ_{d}

and visualized as a polyline with

d - 1

line segments. Figure 3 (left) displays the various centroids and centers obtained for an input set consisting of two histograms (the commonly used Barbara and Lena images, which have been used in [31]). Notice that the JFR center (purple) and GB center (yellow) are close to the numerical Jeffreys centroid (green). We also provide a close-up window in Figure 3 (right).

Notice that we can experimentally check the quality of the approximation of the Gauss–Bregman center to the Jeffreys centroid by defining the symmetrized Bregman centroid energy:

E_{F} (θ) : = 〈 θ - \bar{θ}, \nabla F (θ) 〉 - 〈 θ, \nabla F (\bar{θ}) 〉,

and checking that

\nabla E_{F} (θ)

:

\begin{matrix} \forall i, & \partial_{i} (\sum_{i = 1}^{d} (θ_{i} - {\bar{θ}}_{i}) \partial_{i} F (θ) - θ_{i} \partial_{i} F (\bar{θ})) = 0, \end{matrix}

(10)

\begin{matrix} \partial_{i} F (θ) + (θ_{i} - {\bar{θ}}_{i}) \partial_{i}^{2} F (θ) - \partial_{i} F (\bar{θ}) + (\sum_{j \neq i} (θ_{j} - {\bar{θ}}_{j}) \partial_{i} \partial_{j} F (θ) - \partial_{i} θ_{j} \partial_{j} F (\bar{θ})) = 0 \end{matrix}

(11)

is close to zero, where

\partial_{l} : = \frac{\partial}{\partial θ_{l}}

.

Next, we study these two new types of centers and how well they approximate the Jeffreys centroid.

3. Gauss–Bregman Inductive Centers: Convergence Analysis and Properties

Let

F (θ)

be a strictly convex and differentiable real-valued function of Legendre type [45] defined on an open parameter space

Θ

. Then, the gradient map

θ \mapsto η (θ) = \nabla F (θ)

is a bijection with the reciprocal inverse function

η \mapsto θ (η) = \nabla F^{*} (η) = {(\nabla F)}^{- 1} (η)

where

F^{*} (η) = 〈 η, \nabla F^{- 1} (η) 〉 - F (\nabla F^{- 1} (η))

is the Legendre–Fenchel convex conjugate. For example, we may consider the cumulant functions of regular exponential families.

We define the Gauss–Bregman center

θ_{GB}

of a set

{θ_{1}, \dots, θ_{n}}

weighted by

w \in Δ_{n}

as the limits of the sequences

{\bar{θ}}_{1}, \dots

and

{\underset{̲}{θ}}_{1}, \dots

defined by

\begin{matrix} {\bar{θ}}_{t + 1} & = & A ({\bar{θ}}_{t}, {\underset{̲}{θ}}_{t}) : = \frac{{\bar{θ}}_{t} + {\underset{̲}{θ}}_{t}}{2}, \end{matrix}

(12)

\begin{matrix} {\underset{̲}{θ}}_{t + 1} & = & m_{\nabla F} ({\bar{θ}}_{t}, {\underset{̲}{θ}}_{t}) : = {(\nabla F)}^{- 1} (\frac{\nabla F ({\bar{θ}}_{t}) + \nabla F ({\underset{̲}{θ}}_{t})}{2}), \end{matrix}

(13)

initialized with

{\bar{θ}}_{0} = \bar{θ} = \sum_{i = 1}^{n} w_{i} θ_{i}

and

{\underset{̲}{θ}}_{0} = \underset{̲}{θ} = \nabla F^{- 1} (\sum_{i = 1}^{n} w_{i} \nabla F (θ_{i}))

. That is, we have

θ_{GB} = \lim_{n \to \infty} {\bar{θ}}_{t} = \lim_{n \to \infty} {\underset{̲}{θ}}_{t} .

Such a center has been called an inductive mean by Sturm [30]. See [42] for an overview of inductive means. Figure 4 geometrically illustrates the double sequence iterations converging to the Gauss–Bregman mean.

Theorem 3.

The Gauss–Bregman

(A, \nabla F)

center with respect to a Legendre type function

F (θ)

is well defined (i.e., the double sequence converges) for separable Bregman generators.

Proof.

We need to prove the convergence of

{{\bar{θ}}_{t}}

and

{{\underset{̲}{θ}}_{t}}

to the same finite limit. When

F (θ)

is univariate, the convergence of the inductive centers was reported in [43]. We need to prove that the double iterations of Equations (12) and (13) converge.

Let us consider the following cases:

When the dimension is one, the quasi-arithmetic mean $m_{f^{'}}$ for f, a strictly convex and differentiable function, lies between the minimal and maximal argument (i.e., this is the definition of a strict mean):

$\min {θ_{1}, θ_{2}} \leq m_{f^{'}} (θ_{1}, θ_{2}) \leq \max {θ_{1}, θ_{2}} .$

Thus, we have

$| {\bar{θ}}_{t + 1} - {\underset{̲}{θ}}_{t + 1} | \leq \frac{1}{2} | {\bar{θ}}_{t} - {\underset{̲}{θ}}_{t} |,$

and it follows that $| {\bar{θ}}_{t + 1} - {\underset{̲}{θ}}_{t + 1} | \leq \frac{1}{2^{t}} | {\bar{θ}}_{0} - {\underset{̲}{θ}}_{0} |$ . Thus, we have quadratic convergence of scalar $(A, f^{'})$ means. See Figure 5.
When $F (θ)$ is multivariate and separable, i.e., $F (θ) = \sum_{i = 1}^{d} f_{i} (θ^{i})$ where $θ = (θ^{1}, \dots, θ^{d})$ are the components of $θ \in R^{d}$ and the $f_{i}$ s are scalar strictly convex and differentiable functions, we can apply case 1 dimension-wise to obtain the quadratic convergence.
Otherwise, we consider the multivariate quasi-arithmetic center $m_{\nabla F} (θ, θ^{'})$ with the uniform weight vector $w = (\frac{1}{2}, \frac{1}{2})$ . One problem we face is that the quasi-arithmetic center $m_{\nabla F} (θ, θ^{'})$ for $θ \neq θ^{'}$ may lie outside the open bounding box of $R^{d}$ with diagonal corners $θ$ and $θ^{'}$ :

$θ_{m} = (\min {θ^{1}, {θ^{'}}^{1}}, \dots, \min {θ^{d}, {θ^{'}}^{d}}), θ_{M} = (\max {θ^{1}, {θ^{'}}^{1}}, \dots, \max {θ^{d}, {θ^{'}}^{d}}) .$

Indeed, in the 2D case, we may consider $θ = (x, y)$ and $θ^{'} = (x^{'}, y)$ . Clearly, the open bounding box is empty, and the midpoint $m_{\nabla F} (θ, θ^{'})$ lies outside this box. Yet, we are interested in the convergence rate when $θ^{'} \approx θ$ .
In general, we shall measure the difference between two iterations by the squared norm distance induced by the inner product:

$∥ A (θ, θ^{'}) - m_{\nabla F} (θ, θ^{'}) ∥^{2} = 〈 A (θ, θ^{'}) - m_{\nabla F} (θ, θ^{'}), A (θ, θ^{'}) - m_{\nabla F} (θ, θ^{'}) 〉 .$

□

Let

m_{F}^{GB} (θ_{1}, θ_{2})

denote the Gauss–Bregman center of

θ_{1}

and

θ_{2}

,

A (θ_{1}, θ_{2}) = \frac{θ_{1} + θ_{2}}{2}

the arithmetic mean, and

m_{\nabla F} (θ_{1}, θ_{2}) = {(\nabla F)}^{- 1} (\frac{\nabla F (θ_{1}) + \nabla F (θ_{2})}{2})

the quasi-arithmetic center.

By construction, the Gauss–Bregman center enjoys the following invariance property generalizing Lemma 17.4.4 of [29] in the case of the log det generator:

Property 1.

We have

m_{F}^{GB} (θ_{1}, θ_{2}) = m_{F}^{GB} (A (θ_{1}, θ_{2}), m_{\nabla F} (θ_{1}, θ_{2}))

.

Proof.

Similar to the cascaded inequalities of Equation (8), we have

m_{F}^{GB} (θ_{1}, θ_{2}) = m_{GB}^{F} (\underset{= : θ_{1}^{(1)}}{\underset{︸}{A (θ_{1}, θ_{2})}}, \underset{= θ_{2}^{(1)}}{\underset{︸}{m_{\nabla F} (θ_{1}, θ_{2})}}) = \dots

(14)

In the limit

t \to \infty

, we have

m_{F}^{GB} (θ_{1}, θ_{2}) = m_{F}^{GB} (θ_{1}^{(\infty)}, θ_{2}^{(\infty)}) = m_{F}^{GB} (θ_{1}^{(\infty - 1)}, θ_{2}^{(\infty - 1)}) = \dots

Since

\infty - 1 = \infty

, we obtain the desired invariance property:

m_{F}^{GB} (θ_{1}, θ_{2}) = m_{F}^{GB} (A (θ_{1}, θ_{2}), m_{\nabla F} (θ_{1}, θ_{2})) .

□

Note that when

F (θ)

is univariate, the Gauss–Bregman mean

m_{F}^{GB} (θ_{1}, θ_{2})

converges at a quadratic rate [43]. In particular, when

F (θ) = - \log θ

(Burg negentropy), we have

F^{'} (θ) = - \frac{1}{θ}

(

m_{F^{'}}

is the harmonic mean) and the Gauss–Bregman mean is the arithmetic–harmonic mean (AHM) which converges to the geometric mean, a simple closed-form formula. Notice that the geometric mean

g = \sqrt{x y}

of two scalars

x > 0

and

y > 0

can be expressed using the arithmetic mean

a = \frac{x + y}{2}

and the harmonic mean

h = \frac{2 x y}{x + y}

:

g = \sqrt{a h}

. But when

F (θ) = θ \log θ - θ

(Shannon negentropy), the Gauss–Bregman mean

m_{F}^{GB} (θ_{1}, θ_{2})

coincides with the Gauss arithmetic–geometric mean [44] (AGM) since

F^{'} (θ) = \log θ

and

m_{F^{'}} (θ_{1}, θ_{2}) = \sqrt{θ_{1} θ_{2}}

, the geometric mean. Thus,

m_{F}^{GB} (θ_{1}, θ_{2})

is related to the elliptic integral K of the first type [44]: there is no closed-form formula for the AGM in terms of elementary functions as this induced mean is related to the complete elliptic integral of the first kind

K (\cdot)

:

AGM (x, y) = \frac{π}{4} \frac{x + y}{K (\frac{x - y}{x + y})},

(15)

where

K (u) = \int_{0}^{\frac{π}{2}} \frac{d θ}{\sqrt{1 - u^{2} \sin^{2} (θ)}}

is the elliptic integral. Thus, it is difficult, in general, to report a closed-form formula for the inductive Gauss–Bregman means even for univariate generators

F (θ)

.

The Jeffreys centroid of

x > 0

and

y > 0

with respect to the scalar Jeffreys divergence

D_{J} (p, q) = (p - q) \log \frac{p}{q}

admits a closed-form solution [31]:

c = \frac{a}{W_{0} (\frac{a}{g} e)}

(16)

where

a = \frac{x + y}{2}

and

g = \sqrt{x y}

and

W_{0}

is the principal branch of the Lambert W function [33]. This example shows that the Gauss–Bregman center does not coincide with the Jeffreys centroid in general (e.g., compare Equation (15) with Equation (16)).

4. Jeffreys–Fisher–Rao Centers: Generic Structural Formula and Some Closed-Form Formula

4.1. Jeffreys–Fisher–Rao Center for Uni-Parametric Statistical Models

Consider a set

P = {p_{θ_{1}}, \dots, p_{θ_{n}}}

of n parametric distributions where

θ \in Θ \subset R

is a scalar parameter. Let

w = (w_{1}, \dots, w_{n}) \in Δ_{n}

be a weight vector on

P

such that the weight of

p_{θ_{i}}

is

w_{i}

. The distributions

p_{θ}

s may not necessarily belong to an exponential family (e.g., the Cauchy scale family). The Fisher–Rao geometry [46,47] of the parametric family of distributions

F = {p_{θ} : θ \in Θ}

(the statistical model) can be modeled as a Riemannian manifold with the Fisher metric

g (θ) = I (θ)

defined by the Fisher information

I (θ) = E_{θ} [{(\log p_{θ} (x))}^{2}] = - E_{θ} [\nabla^{2} \log p_{θ} (x)]

. When

F

is an exponential family with the cumulant function

F (θ)

, we have

I (θ) = F^{''} (θ)

.

The underlying geometry of

(F, g (θ) = I (θ))

is Euclidean after a change in variable

η (θ) = \sqrt{I (θ)}

since we can write the metric tensor as follows:

g (θ) = \sqrt{I (θ)} \times \underset{= g_{Euclidean}}{\underset{︸}{1}}, \times \sqrt{I (θ)} .

Thus, the Riemannian Fisher–Rao distance is the Euclidean distance expressed in the

h (θ)

-coordinate system with

h (θ) = \int_{θ_{0}}^{θ} \sqrt{I (u)} d u

, and we have the Fisher–Rao distance given by

ρ (p_{θ_{1}}, p_{θ_{2}}) = | h (θ_{1}) - h (θ_{2}) | .

When

F

is an exponential family with the cumulant function

f (θ)

, we have

I (u) = f^{″} (u)

.

We summarize the result on the JFR center in the following theorem:

Theorem 4

(Jeffreys–Fisher–Rao centroid in uni-order exponential families). The Jeffreys–Fisher–Rao centroid

θ_{S}

of n densities

p_{θ_{1}}, \dots, p_{θ_{n}}

of an exponential family of order one with the log-normalizer

f (θ)

for

θ \in Θ

, the natural parameter space, and weight vector

w \in Δ_{n}

is

θ_{S} = m_{h} (\bar{θ}, \underset{̲}{θ}),

(17)

where

m_{h} (\bar{θ}, \underset{̲}{θ}) = h^{- 1} (\frac{h (\bar{θ}) + h (\underset{̲}{θ})}{2})

is the quasi-arithmetic mean [35] of the dual left and right KL centroids

\bar{θ} = \sum_{i = 1}^{n} w_{i} θ_{i} = θ_{R}

and

\underset{̲}{θ} = {(f^{'})}^{- 1} (\sum_{i = 1}^{n} w_{i} f^{'} (θ_{i}))

with respect to the scalar monotone function

h = \int_{θ_{0}}^{θ} \sqrt{f^{″} (u)} d u

for any

θ \in Θ

.

Proof.

Since the Fisher information is

I (θ) = f^{″} (θ)

, we have

h (θ) = \int_{θ_{0}}^{θ} \sqrt{f^{″} (u)} d u

. The Riemannian center of mass [48] minimizes

θ_{S} = \arg \min_{θ} \sum_{i = 1}^{n} w_{i} ρ^{2} (θ_{i}, θ) .

But in the h-parameterization, the Riemannian centroid, amounts to a Euclidean center of mass/centroid in the h-Cartesian coordinate system:

h (θ_{S}) = \sum_{i = 1}^{n} w_{i} h (θ_{i}) .

Therefore, we have

θ_{S} = h^{- 1} (\sum_{i} w_{i} h (θ_{i})) = : m_{h} (θ_{1}, \dots, θ_{n}; w_{1}, \dots, w_{n})

, a weighted quasi-arithmetic mean. Since the Jeffreys centroid amounts to a symmetrized Bregman centroid of the left and right Bregman centroids [28],

\underset{̲}{θ} = m_{f^{'}} (θ_{1}, \dots, θ_{n}; w_{1}, \dots, w_{n})

and

\bar{θ} = \sum_{i} w_{i} θ_{i}

. It follows that the Jeffreys–Fisher–Rao center is

θ_{JFR} = m_{h} (\bar{θ}, \underset{̲}{θ})

after using Property 3. □

4.2. Jeffreys–Fisher–Rao Center for Categorical Distributions

Recall from Theorem 1 that the Jeffreys centroid

c = (c_{1}, \dots, c_{j}, \dots, c_{d})

of a set of n categorical distributions with parameters arranged in the matrix

[p_{i, j}]

is given by

c_{j} (λ) = \frac{a_{j}}{W_{0} (\frac{a_{j}}{g_{j}} e^{1 + λ})}, \forall j \in {1, \dots, d},

where

a_{j} = \sum_{i = 1}^{n} w_{i} p_{i, j}

and

g_{j} = \frac{\prod_{i = 1}^{n} p_{i, j}^{w_{i}}}{\sum_{j = 1}^{d} \prod_{i = 1}^{n} p_{i, j}^{w_{i}}}

are the components of the weighted arithmetic and normalized geometric means, respectively, and

W_{0}

is the principal branch of the Lambert W function [33]. The optimal

λ \leq 0

is unique and satisfies

λ = - D_{KL} (c_{j} (λ) : g)

.

Let

c (λ) = (c_{1} (λ), \dots, c_{d} (λ))

. Let

L_{J} (p)

denote the Jeffreys loss function to minimize to find the optimal Jeffreys centroid:

L_{J} (p) = \sum_{i = 1}^{n} w_{i} D_{J} (p_{i}, p)

(18)

We say that p is a

(1 + ϵ)

approximation of the exact Jeffreys centroid c when we have

L_{J} (c) \leq L_{J} (p) \leq (1 + ϵ) L_{J} (c) .

It was shown in [31] that

\tilde{c} = c (0)

, called the unnormalized Jeffreys center, yields a

s (λ) - 1

approximation on c where

s (λ) = \sum_{j} c_{j} (λ) \leq 1

.

Since the Fisher–Rao geodesic midpoints on the categorical Fisher–Rao manifold are known in closed form [49], we give the mathematical expression of the JFR center as follows:

Theorem 5

(JFR centroid of categorical distributions). Let

P_{w} = {p_{1}, \dots, p_{n}}

be a set of n probability mass functions weighted by

w \in Δ_{n}

with

p_{i} = (p_{i, 1}, \dots, p_{i, d}) \in Δ_{d}

for

i \in {1, \dots, n}

and

w \in Δ_{n}

. Then, the JFR barycenter c minimizing is unique and given by the following formula:

c_{j} = \frac{{(\sqrt{a_{j}} + \sqrt{g_{j}})}^{2}}{2 (1 + \sum_{l = 1}^{d} \sqrt{a_{j}} \sqrt{g_{j}})}, \forall j \in {1, \dots, d},

(19)

where

a = (a_{1}, \dots, a_{d}) = \sum_{i = 1}^{n} w_{i} p_{i}

is the weighted arithmetic mean and

g = (g_{1}, \dots, g_{d})

is the normalized weighted geometric mean with components

g_{j} = \frac{\prod_{i = 1}^{n} p_{i, j}^{w_{i}}}{\sum_{j = 1}^{d} \prod_{i = 1}^{n} p_{i, j}^{w_{i}}}

for

i \in {1, \dots, d}

.

Notice that the JFR center differs from the Jeffreys centroid, which requires the use of the Lambert W function [33]. However, we noticed that for practical applications, the JFR centroid approximates the Jeffreys centroid well and is much faster to compute (see the experiments in Section 5).

4.3. Jeffreys–Fisher–Rao Center for Multivariate Normal Distributions

Let

P = {p_{μ_{1}, \sum_{1}}, \dots, p_{μ_{n}, \sum_{n}}}

be a set of n probability density functions (PDFs) of d-variate normal distributions weighted by

w \in Δ_{n}

, where the PDF of a multivariate normal distribution of mean

μ

and the covariance matrix

\sum

is given by

p_{μ, \sum} = \frac{1}{{(2 π)}^{\frac{d}{2}} \sqrt{\det (\sum)}} \exp (- \frac{1}{2} {(x - μ)}^{⊤} \sum^{- 1} (x - μ)) .

Let

λ_{i} = (μ_{i}, \sum_{i})

be the ordinary parameterization of normal distributions

p_{μ_{i}, \sum_{i}}

. The family

F = {p_{μ, \sum} (x) : μ \in R^{d}, \sum \in {Sym}^{+ +} (R, d)}

of multivariate normal distributions forms an exponential family with the dual natural

θ

- and moment

η

-parameterizations [7] given by

\begin{matrix} θ (λ) & = & (θ_{v}, θ_{M}) = (\sum^{- 1} μ, \frac{1}{2} \sum^{- 1}), \\ η (λ) & = & (μ, μ μ^{⊤} + \sum), \end{matrix}

when choosing the sufficient statistic

t (x) = (x, x x^{⊤})

. The Jeffreys divergence between two d-variate normal distributions

N (μ_{1}, \sum_{1})

and

N (μ_{2}, \sum_{2})

is given by the formula

D_{J} (p_{μ_{1}, \sum_{1}}, p_{μ_{2}, \sum_{2}}) = (μ_{2} - μ_{1}) ⊤ (\sum_{1}^{- 1} + \sum_{2}^{- 1}) (μ_{2} - μ_{1}) + tr (\sum_{1}^{- 1} \sum_{2} + \sum_{2}^{- 1} \sum_{1}) - 2 d .

The left and right Kullback–Leibler barycenters amount to the corresponding right and left Bregman barycenters [28] induced by the cumulant function

F (θ) = F (θ_{v}, θ_{M}) = \frac{1}{2} (d \log π - \log \det (θ_{M}) + \frac{1}{2} θ_{v}^{⊤} θ_{M}^{- 1} θ_{v}),

and the gradient of

F (θ)

defines the dual moment parameter with

η (θ) = \nabla F (θ) = (\frac{1}{2} θ_{M}^{- 1} θ_{v}, \frac{1}{2} θ_{M}^{- 1} - \frac{1}{4} (θ_{M}^{- 1} θ_{v}) {(θ_{M}^{- 1} θ_{v})}^{⊤}) .

The reciprocal gradient is given by

θ (η) = θ (η_{v}, η_{M}) = {(\nabla F)}^{- 1} (η) = (θ_{v} = - {(η_{M} + η_{v} η_{v}^{⊤})}^{- 1} η_{v}, θ_{M} = - \frac{1}{2} {(η_{M} + η_{v} η_{v}^{⊤})}^{- 1}) .

The Gauss–Bregman center is a

(A, m_{\nabla F})

-inductive center, which can be approximated by carrying a prescribed number T of iterations of the Gauss–Bregman double sequence.

Although the Rao distance between two d-variate normal distributions is not available in closed form when

d > 1

[50,51], the Jeffreys–Fisher–Rao center can be computed in closed form. Indeed, the sided Kullback–Leibler centroids of multivariate normal distributions amount to reverse-sided Bregman centroids [28], and the Fisher–Rao geodesic midpoint between two multivariate normal distributions was recently reported in [32]. Appendix C concisely describes the method of Kobayashi [32], which allows one to obtain the Fisher–Rao midpoints of multivariate normal distributions. An implementation of that algorithm is available in the Python software library pyBregMan [52].

Thus, the Jeffreys–Fisher–Rao center is available in closed form:

Theorem 6

(JFR center of MVNs). The Jeffreys–Fisher–Rao center of a finite set of weighted multivariate normal distributions is available in closed form.

Note that the Fisher–Rao distance between normal distributions is invariant under the action of the positive affine group [50], as are the Jeffreys centroid, the JFR center, and the GB center. Figure 6 shows several examples of the JFR and GB centers of two univariate normal distributions. We can observe that those centers are close to each other although they are distinct when the normal distributions do not share the same means and covariance matrices.

Figure 7 shows the various centroids/centers between two bivariate normal distributions displayed as ellipsoids centered as their means. Observe that the inductive Gauss–Bregman center is visually closer than the Jeffreys–Fisher–Rao center to the Jeffreys centroid.

Figure 8 displays the various centroids and centers for pairs of bivariate normal distributions centered at the same mean. Figure 9 shows the centroids and centers for pairs of bivariate normal distributions with the same covariance matrix.

Remark 1.

In general, an exponential family may be characterized equivalently by two convex functions: (1) its log-normalizer

F (θ)

or (2) its partition function

Z (θ) = \exp (F (θ))

, which is log-convex and hence also convex [53]. It has been shown that the Bregman divergence

B_{Z}

for

Z = \sqrt{\det (θ)}

(convex) corresponds to the reverse extended Kullback–Leibler divergence between unnormalized PDFs of normal distributions:

B_{Z} (θ_{1} : θ_{2}) = D_{KL}^{+} ({\tilde{p}}_{λ (θ_{2})} : {\tilde{p}}_{λ (θ_{1})}),

where

{\tilde{p}}_{μ, \sum} = \exp (- \frac{1}{2} {(x - μ)}^{⊤} \sum^{- 1} (x - μ))

and the extended KLD between two positive measures is given by

D_{KL}^{+} (m_{1} : m_{2}) = \int (m_{1} (x) \log \frac{m_{1} (x)}{m_{2} (x)} + m_{2} (x) - m_{1} (x)) d μ (x) .

Remark 2.

We may further define yet another center for multivariate normal distributions by considering the Fisher–Rao isometric embedding of the Fisher–Rao d-variate normal manifold

M = {p_{μ, \sum}}

into the Fisher–Rao

(d + 1)

-variate centered manifold

N_{0}^{+} = {q_{P} (y) = p_{0, P} (y) : P \in {Sym}^{+ +} (R, d + 1)}

using Calvo and Oller mapping [50]:

f (μ, \sum) : = [\begin{matrix} \sum + μ μ^{⊤} & μ \\ μ^{⊤} & 1 \end{matrix}] .

Let

\bar{M} = {f (p) : p \in M}

denote the embedded submanifold of codimension one in

N_{0}^{+}

. The Calvo–Oller center is then defined by taking the Fisher–Rao midpoints

q_{CO}

of

q_{P_{1}}

and

q_{P_{2}}

, projecting

q_{CO}

onto

\bar{M}

as

q_{CO}^{'}

and converting

q_{CO}^{'}

into

p_{CO} \in M

using the inverse mapping

f^{- 1}

[51].

The Fisher orthogonal projection of a

(d + 1) \times (d + 1)

matrix

P \in N_{0}^{+}

onto the submanifold

\bar{M}

is performed as follows: Let

β = P_{d + 1, d + 1}

and write

P = [\begin{matrix} \sum + β μ μ^{⊤} & β μ \\ β μ^{⊤} & β \end{matrix}]

. Then, the orthogonal projection at

P \in P

onto

\bar{M}

is

[\begin{matrix} \sum + μ μ^{⊤} & μ^{⊤} \\ μ & 1 \end{matrix}]

. See [51] for details of the Calvo and Oller embedding/projection method.

5. Experiments

We run all experiments on a Dell Inspiron 5502 Core i7-116567@2.8 GHz using compiled Java programs. For each experiment, we consider a set of

n = 2

uniformly randomized histograms with d bins (i.e., points in

Δ_{d}

) and calculate the numerical Jeffreys centroid, which requires the time-consuming Lambert W function, the GB center, and the JFR center. For each prescribed value of d, we run 10,000 experiments to collect various statistics like the average and maximum approximations and running times. The approximations of the JFR and GB methods are calculated either as the approximation of the Jeffreys information (Equation (18)) or as the approximation of the centers with respect to the numerical Jeffreys centroids measured using the total variation distance. Table 1 is a verbatim export of our experimental results when we range the dimension of histograms for

d = 2

to

d = 256

by doubling the dimension at each round. The inductive GB center is stopped when the total variation

\frac{1}{2} {∥ a_{t} - g_{t} ∥}_{1} \leq 10^{- 8}

.

We observe that the JFR center is faster to compute than the GB center but the GB center is of higher quality (i.e., a better approximation with a lower

ϵ

) than the JFR center to approximate the numerical Jeffreys centroid.

Another test consists of choosing

d = 3

and the following two 3D normalized histograms:

(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})

and

(1 - α, α / 2, α / 2)

for

α \in {10^{- 1}, 10^{- 2}, \dots, 10^{- 7}, 10^{- 8}}

. Table 2 reports the experiments. The objective is to find a setting where both the JFR and GB centers are distinguished from the Jeffreys centroid. We see that as we decrease

α

, the approximation factor

ϵ

gets worse for both the JFR center and the GB center. The JFR center is often faster to compute than the GB inductive center, but the approximation of the GB center is better than the JFR approximation.

Finally, we implemented the Gauss–Bregman and Jeffreys–Fisher–Rao centers and Jeffreys centroid using multi-precision arithmetic. We report the following experiments using 200-digit precision arithmetic for the following input of two normalized histograms:

p = (0.1, 0.9)

and

q = (0.8, 0.2)

. We report the various first 17-digit mantissas obtained with the corresponding Jeffreys information:

Jeffreys center: $(0.42490383904214813, 0.575096160957851866)$
Jeffreys information: $1.2490723231955352$ .
Gauss–Bregman center: (0.42490383904276856, 0.575096160957231439)
Jeffreys information: $1.2490723231955353$ .
Jeffreys–Fisher–Rao center: (0.42490390202906282, 0.575096097970937175)
Jeffreys information: $1.2490723232068266$ .

The total variation distance between the Jeffreys centroid and the Gauss–Bregman center is

6.204271148284087422350000686372 10^{- 13}

.

The total variation distance between the Jeffreys centroid and the Jeffreys–Fisher–Rao center is

6.298691469047911984039363762611 10^{- 8}

.

The total variation distance between the Gauss–Bregman center and the Jeffreys–Fisher–Rao center is

6.298629426336429143165140262604 10^{- 8}

.

Although all those points are close to each other, they are all distinct points (note that using the limited precision of the IEEE 754 floating point standard may yield a misleading interpretation of experiments).

6. Conclusions and Discussion

In this work, we considered the Jeffreys centroid of a finite weighted set of densities of a given exponential family

E = {p_{θ} (x) : θ \in Θ}

. This Jeffreys centroid amounts to a symmetrized Bregman centroid on the corresponding weighted set of natural parameters of the densities [28]. In general, the Jeffreys centroids do not admit closed-form formulas [28,31] except for sets of same-mean normal distributions [29] (see Appendix B).

In this paper, we interpreted the closed-form formula for same-mean multivariate normal distributions in two different ways:

First, as the Fisher–Rao geodesic midpoint of the sided Kullback–Leibler centroids. This interpretation lets us relax the midpoint definition to arbitrary exponential families to define the Jeffreys–Fisher–Rao center (the JFR center of Definition 2);
Second, as an inductive $(A, m_{\nabla F})$ center using a multivariate Gauss-type double sequence, which converges to the Gauss–Bregman center (the GB center of Definition 3). The latter definition yields an extension of Nakamura’s arithmetic–harmonic $(A, H)$ mean [41] to an arbitrary $(A, m_{\nabla F})$ mean for which we proved convergence under a separability condition in Theorem 3. Convergence proof remains to be performed in the general case, although we noticed in practice convergence when $\nabla F (θ)$ is the moment parameter of categorical or multivariate normal distributions.

In general, the Jeffreys, JFR, and GB centers differ from each other (e.g., the case of categorical distributions). But for sets of same-mean normal distributions, they remarkably coincide altogether: namely, this was the point of departure of this research. We reported generic or closed-form formulas for the JFR centers of (a) uni-order parametric families in Section 4.1 (Theorem 4), (b) categorical families in Section 4.2 (Theorem 5), and (c) multivariate normal families in Section 4.3 (Theorem 6). Table 3 summarizes the new results obtained in this paper and states references of prior work. Notice that in practice, we approximate the Gauss–Bregman center by prescribing a number of iterations

T \in N

for the Gauss–Bregman double sequence to obtain

m_{GB}^{(T)}

. Prescribing the number of GB center iterations T allows us to tune the time complexity of computing

m_{GB}^{(T)}

while adjusting the quality of the approximation of the Jeffreys centroid.

In applications requiring the Jeffreys centroid, we thus propose to either use the fast Jeffreys–Fisher–Rao center when a closed-form formula is available for the family of distributions at hand or use the Gauss–Bregman center approximation with a prescribed number of iterations as a drop-in replacement of the numerical Jeffreys centroids while keeping the Jeffreys divergence (the centers we defined are not centroids as we do not exhibit distances from which they are population minimizers).

More generally, let us rephrase the results in a purely geometric setting using the framework of information geometry [14]: let

P_{1}, \dots, P_{n}

be a set of n points weighted by a vector

w \in Δ_{n}

on an m-dimensional dually flat space

(M, g, \nabla, \nabla^{*})

with the ∇-affine coordinate system

θ (\cdot)

and dual

\nabla^{*}

-affine coordinate system

η (\cdot)

, where ∇ and

\nabla^{*}

are two torsion-free dual affine connections. The Riemannian metric g is a Hessian metric [54], which may be expressed in the

θ

-coordinate system as

g (θ) = \nabla^{2} F (θ)

or in the dual coordinate system as

g (η) = \nabla^{2} F^{*} (η)

, where

F (θ)

and

F^{*} (η)

are dual convex potential functions related by the Legendre–Fenchel transform [14,54]. Let

η_{i} = \nabla F (θ_{i})

and

θ_{i} = \nabla F^{*} (η_{i})

be the coordinates of point

P_{i}

in the

η

- and

θ

-coordinate systems, respectively. An arbitrary point P can be either referenced in the

θ

-coordinate system (

P = P_{θ}

) or in the

η

-coordinate system (

P = P_{η}

). Then, the Jeffreys–Fisher–Rao center is defined as the midpoint with respect to the Levi-Civita connection

\bar{\nabla} = \frac{\nabla + \nabla^{*}}{2} = \nabla^{g}

of g:

C_{JFR} : = γ_{\bar{\nabla}} (C_{\bar{θ}}, C_{\underset{̲}{θ}}, \frac{1}{2}) = : C_{\bar{θ}} # C_{\underset{̲}{θ}} .

(20)

The point

C_{\bar{θ}}

is the centroid with respect to the canonical flat divergence

D (P : Q) = F (θ (P)) + F^{*} (η (Q)) - \sum_{i = 1}^{m} θ_{i} (P) η_{i} (Q)

, and the point

C_{\underset{̲}{θ}}

is the centroid with respect to the dual canonical flat divergence

D^{*} (P : Q) : = D (Q : P)

. The canonical divergence is expressed using the mixed coordinates

θ

/

η

but can also be expressed using the

θ

-coordinates as an equivalent Bregman divergence

D (P : Q) = B_{F} (θ (P) : θ (Q))

or as a reverse dual Bregman divergence

D (P : Q) = B_{F^{*}} (η (Q) : η (P))

. This JFR center

C_{JFR}

approximates the symmetrized centroid with respect to the canonical symmetrized divergence

S (P, Q) = D (P : Q) + D (Q : P)

(i.e., Jeffreys divergence when written using the

θ

-coordinate system). This symmetrized divergence

S (P, Q)

can be interpreted as the energy of the Riemannian length element

d s

along the primal geodesic

γ (t)

and dual geodesic

γ^{*} (t)

(with

γ (0) = γ^{*} (0) = P

and

γ (1) = γ^{*} (1) = Q

), see [14]:

S (P, Q) = \int_{0}^{1} d s^{2} (γ (t)) d t = \int_{0}^{1} d s^{2} (γ^{*} (t)) d t

. The Riemannian distance

ρ (P, Q)

corresponds to the Riemannian length element along the Riemannian geodesic

\bar{γ} (t)

induced by the Levi-Civita connection

\bar{\nabla} = \frac{\nabla + \nabla^{*}}{2}

:

ρ (P, Q) = \int_{0}^{1} d s (\bar{γ} (t)) d t

.

The inductive Gauss–Bregman center

C_{GB}

is obtained as a limit sequence of taking iteratively the ∇ midpoints and

\nabla^{*}

midpoints with respect to the ∇ and

\nabla^{*}

connections. Those midpoints correspond to the right and left centroids

C_{t + 1}

and

C_{t + 1}^{*}

with respect to

D (\cdot : \cdot)

:

\begin{matrix} C_{t + 1} & = & γ_{\nabla} (C_{t}, C_{t}^{*}, \frac{1}{2}), \\ C_{t + 1}^{*} & = & γ_{\nabla^{*}} (C_{t}, C_{t}^{*}, \frac{1}{2}), \end{matrix}

initialized with

θ (C_{0}) = \sum_{i = 1}^{n} w_{i} θ (P_{i})

and

η (C_{0}^{*}) = \sum_{i = 1}^{n} w_{i} η (P_{i})

. We have

C_{0} = \arg \min_{C \in M}

\sum_{i} w_{i} D (P_{i} : C)

and

C_{0}^{*} = \arg \min_{C \in M} \sum_{i} w_{i} D (P_{i} : C)

. Figure 10 geometrically illustrates the double sequence of iteratively taking dual geodesic midpoints to converge toward the Gauss–Bregman center

C_{GB}

. Thus, the GB double sequence can be interpreted as a geometric optimization technique. Figure 11 illustrates the JFR and GB centers on a dually flat space. Notice that

C_{JFR}

has coordinates

{JFR}_{F} (P_{θ}; w)

in the

θ

-chart and coordinates

{JFR}_{F^{*}} (P_{η}; w)

in the

η

-chart. Similarly,

C_{GB}

has coordinates

{GB}_{F} (\bar{θ}, \underset{̲}{θ})

in the

θ

-chart and coordinates

{GB}_{F^{*}} (\bar{η}, \underset{̲}{η})

in the

η

-chart.

As a final remark, let us emphasize that choosing a proper mean or center depends on the application at hand [55,56]. For example, in Bayesian hypothesis testing, the Chernoff mean [57] is used to upper bound Bayes’ error and has been widely used in information fusion [18] for its empirical robustness [58] in practice. Jeffreys centroid has been successfully used in information retrieval tasks [6].

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The author Frank Nielsen is employed by the company Sony Computer Science Laboratories Inc. The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare no conflicts of interest.

Appendix A. Numerical Jeffreys Centroids for Categorical Distributions

Algorithm A1 implements the method described in [31] for numerically finely approximating the Jeffreys centroid of a weighted set of categorical distributions.

Algorithm A1: Numerical approximation of the SKL/Jeffreys centroid for categorical distributions.

Appendix B. Closed-Form Formula for the Symmetrized Log Det Centroids

Consider a set

P = {P_{1}, \dots, P_{n}}

of n symmetric positive-definite matrices of the d-dimensional SPD cone

{Sym}^{+ +} (d, R)

weighted by a vector

w = (w_{1}, \dots, w_{n}) \in Δ_{n}^{\circ}

such that

P_{i}

has weight

w_{i}

for

i \in {1, \dots, n}

. The log det divergence [59] is a Bregman divergence induced by the strictly convex and differential generator

F_{ld} (X) = - \log \det (X)

on

{Sym}^{+ +} (d, R)

equipped with the inner product

〈 X, Y 〉 = tr (X Y)

for

X, Y \in Sym (d, R)

:

D_{ld} (X : Y) = B_{F_{ld}} (X : Y) = F (X) - F (Y) - 〈 X - Y, \nabla F_{ld} (Y) 〉 .

(A1)

Since we have

\nabla F_{ld} (X) = - \nabla \log \det (X) = - \frac{\nabla \det (X)}{\det (X)} = - {(X^{- 1})}^{⊤}

(hence

\nabla F_{ld} (X) = - X^{- 1}

for symmetric matrices), it follows that the log det divergence is

\begin{matrix} D_{ld} (X : Y) & = & \log \det (Y X^{- 1}) + tr ((X - Y) Y^{- 1}), \\ = & tr (X Y^{- 1}) - \log \det (X Y^{- 1}) - d, \end{matrix}

using the properties that

\det (X) \det (Y) = \det (X Y)

,

\det (X^{- 1}) = \frac{1}{\det (X)}

, and

tr (I) = d

where I denotes the

d \times d

identity matrix. When

d = 1

, we recover the Itakura–Saito divergence [45] obtained for

F_{IS} (x) = - \log x

(Burg negative entropy) with

F_{IS}^{'} (x) = - \frac{1}{x}

:

D_{IS} (x : y) = B_{F_{IS}} (x : y) = \frac{x}{y} - \log \frac{x}{y} - 1, x, y > 0 .

The log det divergence is known in statistics as Stein’s loss [37,38] and has been used for estimating covariance matrices. The log det divergence

S_{ld}

satisfies the following invariance properties:

Inversion invariance: $S_{ld} (X^{- 1}, Y^{- 1}) = S_{ld} (X, Y)$ ;
Congruence invariance: for any invertible matrix $A \in GL (d)$ , we have $S_{ld} (A X A^{⊤}, A Y^{- 1} A^{⊤}) = S_{ld} (X, Y)$ .

The Jeffreys’ symmetrized log det divergence (SLD) is thus

\begin{matrix} S_{ld} (X, Y) & = & D_{ld} (X : Y) + D_{ld} (Y : X) = tr ((Y^{- 1} - X^{- 1}) (X - Y)), \end{matrix}

(A2)

\begin{matrix} = & tr (X^{- 1} Y + Y^{- 1} X - 2 I) . \end{matrix}

(A3)

When

d = 1

, the SLD corresponds to the COSH distance [60] (COSine Hyperbolic distance, the symmetrized Itakura–Saito divergence) when

d = 1

:

D_{COSH} (x : y) = (\frac{y}{x} - \frac{1}{x}) = \frac{x}{y} + \frac{y}{x} - 2 .

Consider a family

N_{μ} = \{p_{μ, \sum_{1}}, \dots, p_{μ, \sum_{n}}\}

of n multivariate normal distributions centered at the same mean

μ \in R^{d}

with covariance matrices

\sum_{1}, \dots, \sum_{n}

. The set of same-mean normal distributions forms an exponential family with the natural parameter

θ = \sum^{- 1}

(the precision matrix) corresponding to the sufficient statistics

t (x) = - \frac{1}{2} x x^{⊤}

, and the log-normalizer

F (θ) = - \frac{1}{2} \log \det (θ)

. Thus, the Kullback–Leibler divergence between

p_{μ, \sum_{i}}

and

p_{μ, \sum_{j}}

corresponds to a log det divergence [16]:

D_{KL} [p_{μ, \sum_{i}}, p_{μ, \sum_{j}}] = B_{F} (θ_{j} : θ_{i}) = D_{ld} (\sum_{j}^{- 1} : \sum_{i}^{- 1}),

and therefore the Jeffreys divergence

D_{J} [p_{μ, \sum_{i}}, p_{μ, \sum_{j}}]

corresponds to the matrix COSH/symetrized log-det divergence:

D_{J} [p_{μ, \sum_{i}}, p_{μ, \sum_{j}}] = S_{ld} (\sum_{i}^{- 1}, \sum_{j}^{- 1}) = tr ((\sum_{i}^{- 1} - \sum_{j}^{- 1}) (\sum_{j} - \sum_{i})) .

(A4)

The left KL centroid corresponds to a right Bregman centroid on the natural parameters (the center of mass of the natural parameters), which corresponds to a weighted matrix harmonic mean on the covariance matrices:

C_{L}^{KL} = C_{R}^{B_{F}} = {(\sum_{i = 1}^{n} w_{i} \sum_{i}^{- 1})}^{- 1} .

The right KL centroid is a left Bregman centroid (i.e., a quasi-arithmetic mean for

h (X) = - X^{- 1}

with

h^{- 1} (Y) = - Y^{- 1}

) which corresponds to the inverse of the weighted arithmetic mean on the covariance matrices:

C_{R}^{KL} = C_{L}^{B_{F}} = {(\sum_{i = 1}^{n} w_{i} \sum_{i})}^{- 1} .

We state the remarkable case of the closed-form formula for the symmetrized Bregman logdet centroid:

Proposition A1

([29]). The symmetrized log det centroid of a set

P^{w} = {(w_{i}, P_{i})}

of n weighted positive-definite matrices is

A # H

where

A = \sum_{i} w_{i} P_{i}

and

H = {(\sum_{i} w_{i} P_{i}^{- 1})}^{- 1}

are the weighted arithmetic and harmonic means and

A # B

is the matrix geometric mean.

Since the proof was only briefly sketched in [29], we report a full-length proof for the sake of completeness:

Proof.

We have

\min_{X} \sum_{i} w_{i} S_{ld} (X, P_{i}) \equiv \min_{X} tr (X^{- 1} A + H^{- 1} X) .

Setting the gradient of the right-hand-side term to zero using matrix calculus [61] yields

\begin{matrix} \nabla_{X} tr (X^{- 1} A + H^{- 1} X) = tr (\nabla_{X} (X^{- 1} A + H^{- 1} X)) = 0 . \end{matrix}

Using the matrix calculus property that

\nabla (M^{- 1}) = - M^{- 1} (\nabla M) M^{- 1}

for

M = X^{- 1} A

, we obtain

X^{- 1} A X^{- 1} - H^{- 1} = 0 .

That is, we need to solve the following Ricatti equation:

X^{- 1} A X^{- 1} = H^{- 1} .

The well-known Ricatti equation

X A^{- 1} X = B

solves [40] as

X = A # B

, and therefore we obtain

X^{- 1} = A^{- 1} # H^{- 1} .

Finally, we use the invariance property of the geometric mean under matrix inversion,

A^{- 1} # H^{- 1} = A # H

, to obtain the result

C_{S}^{ld} = A # H

. □

The Riemannian Hessian metric

g (θ)

induced by

F (θ) = - \frac{1}{2} \log \det (θ)

is

g_{θ} (S_{1}, S_{2}) = tr (θ^{- 1} S_{1} θ^{- 1} S_{2}),

where

S_{1}

and

S_{2}

are two symmetric matrices of the tangent space

T_{θ}

at

θ

. The metric tensor g is commonly called the trace metric or Affine-Invariant Riemannian Metric (AIRM) [62].

It follows that the Riemannian geodesic midpoint is the matrix geometric mean [63] given by

X # Y = X^{\frac{1}{2}} {(X^{- \frac{1}{2}} Y X^{- \frac{1}{2}})}^{\frac{1}{2}} X^{\frac{1}{2}} .

We have

ρ (X, X # Y) = ρ (X # Y, Y)

, where

ρ (\cdot, \cdot)

denotes the geodesic length distance on the Riemannian manifold. The geodesic length is given by the following formula [64,65]:

ρ (P_{1}, P_{2}) = {∥\log (P_{1}^{- \frac{1}{2}} P_{2} P_{1}^{- \frac{1}{2}})∥}_{F} = \sqrt{\sum_{i = 1}^{d} \log^{2} λ_{i} (P_{1}^{- \frac{1}{2}} P_{2} P_{1}^{- \frac{1}{2}})},

where the

λ_{i} (X)

s are the generalized eigenvalues of X.

We state the theorem geometrically characterizing the Jeffreys centroid of a weighted set of centered multivariate normal distributions.

Theorem A1

(The Jeffreys centroid of n weighted centered multivariate normal distributions). The Jeffreys centroid

C_{S}

of a weighted set

{p_{μ, \sum_{i}}}

of centered normal distributions

N (μ, \sum_{i})

with a weighted

w \in Δ_{n}

corresponds to the midpoint of the Fisher–Rao geodesic linking the left and right SKL centroids:

C_{S} = (\sum_{i = 1}^{n} w_{i} \sum_{i}) # {(\sum_{i = 1}^{n} w_{i} \sum_{i}^{- 1})}^{- 1},

(A5)

where

X # Y

is the geometric matrix mean:

X # Y = X^{\frac{1}{2}} {(X^{- \frac{1}{2}} Y X^{- \frac{1}{2}})}^{\frac{1}{2}} X^{\frac{1}{2}} .

This result first appeared in [29] (Lemma 17.4.3, item 3) and also appeared in an indirect but more general form in [66] (Theorem 5.3). Indeed, in [66], the authors define the regularized symmetric log det divergence as follows:

S_{ld}^{ϵ} (X, Y) = tr ((X - Y) ({(Y + ϵ I)}^{- 1} - {(X + ϵ I)}^{- 1})), ϵ > 0 .

This extended definition of the symmetrized logdet divergence allows one to consider degenerate semi-positive definite matrices.

Appendix C. Fisher–Rao Midpoint for Multivariate Normal Distributions

The expression of the Fisher–Rao geodesics for multivariate normal distributions (MVNs) passing through two given d-variate normal distributions was elucidated in [32]. We give below the method for finding those Fisher–Rao MVN midpoints without the underlying geometric explanation that relies on a Riemannian submersion in dimension

2 d + 1

[32]. The Python software library pyBregMan [52] provides an implementation of those Fisher–Rao MVN midpoints.

Fisher–Rao geodesic midpoint

N = N (μ, \sum)

of

N_{0} = N (μ_{0}, \sum_{0})

and

N_{1} = N (μ_{1}, \sum_{1})

For $i \in {0, 1}$ , let $G_{0} = M_{0} D_{0} M_{0}^{⊤}$ and $G_{1} = M_{1} D_{1} M_{1}^{⊤}$ , where

$\begin{matrix} D_{0} & = & [\begin{matrix} \sum_{0}^{- 1} & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & \sum_{0} \end{matrix}], \\ M_{0} & = & [\begin{matrix} I_{d} & 0 & 0 \\ μ_{0}^{⊤} & 1 & 0 \\ 0 & - μ_{0} & I_{d} \end{matrix}], \\ D_{1} & = & [\begin{matrix} \sum_{1}^{- 1} & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & \sum_{1} \end{matrix}], \\ M_{1} & = & [\begin{matrix} I_{d} & 0 & 0 \\ μ_{1}^{⊤} & 1 & 0 \\ 0 & - μ_{1} & I_{d} \end{matrix}], \end{matrix}$

where $I_{d}$ denotes the identity matrix of shape $d \times d$ . That is, matrices $G_{0}$ and $G_{1} \in {Sym}_{+} (2 d + 1, R)$ can be expressed by block Cholesky factorizations.
Consider the Riemannian geodesic midpoint G in ${Sym}_{+} (2 d + 1, R)$ with respect to the trace metric:

$G = G_{0}^{\frac{1}{2}} {(G_{0}^{- \frac{1}{2}} G_{1} G_{0}^{- \frac{1}{2}})}^{\frac{1}{2}} G_{0}^{\frac{1}{2}} .$

In order to compute the matrix power $G^{p}$ for $p \in R$ , we first calculate the Singular Value Decomposition (SVD) of G: $G = O L O^{⊤}$ (where O is an orthogonal matrix and $L = diag (λ_{1}, \dots, λ_{2 d + 1})$ a diagonal matrix) and then obtain the matrix power as $G^{p} = O L^{p} O^{⊤}$ with $L^{p} = diag (λ_{1}^{p}, \dots, λ_{2 d + 1}^{p})$ .
Retrieve $N = N (μ, \sum)$ from matrix G:

$\begin{matrix} \sum & = & {[G]}_{1 : d, 1 : d}^{- 1}, \\ μ & = & \sum {[G]}_{1 : d, d + 1}, \end{matrix}$

where ${[G]}_{1 : d, 1 : d}$ denotes the block matrix with rows and columns ranging from one to d extracted from the $(2 d + 1) \times (2 d + 1)$ matrix G, and ${[G]}_{1 : d, d + 1}$ is similarly the column vector of $R^{d}$ extracted from G.

References

Jeffreys, H. The Theory of Probability; OUP Oxford: Oxford, UK, 1998. [Google Scholar]
Ben-Tal, A.; Charnes, A.; Teboulle, M. Entropic means. J. Math. Anal. Appl. 1989, 139, 537–551. [Google Scholar] [CrossRef]
Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
Amari, S.I. Integration of stochastic models by minimizing α-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef]
Nielsen, F. On a generalization of the Jensen–Shannon divergence and the Jensen–Shannon centroid. Entropy 2020, 22, 221. [Google Scholar] [CrossRef]
Veldhuis, R. The centroid of the symmetrical Kullback-Leibler distance. IEEE Signal Process. Lett. 2002, 9, 96–99. [Google Scholar] [CrossRef]
Nielsen, F. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy 2019, 21, 485. [Google Scholar] [CrossRef]
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
Johnson, D.H.; Sinanovic, S. Symmetrizing the Kullback-Leibler distance. IEEE Trans. Inf. Theory 2001, 1, 1–10. [Google Scholar]
Fuglede, B.; Topsoe, F. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium on Information Theory (ISIT), Chicago, IL, USA, 27 June–2 July 2004; IEEE: Piscataway, NJ, USA, 2004; p. 31. [Google Scholar]
Sra, S. Metrics induced by Jensen-Shannon and related divergences on positive definite matrices. Linear Algebra Its Appl. 2021, 616, 125–138. [Google Scholar] [CrossRef]
Vajda, I. On metric divergences of probability measures. Kybernetika 2009, 45, 885–900. [Google Scholar]
Barndorff-Nielsen, O. Information and Exponential Families: In Statistical Theory; John Wiley & Sons: New York, NY, USA, 2014. [Google Scholar]
Amari, S.I. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
Davis, J.; Dhillon, I. Differential entropic clustering of multivariate gaussians. Adv. Neural Inf. Process. Syst. 2006, 19, 337–344. [Google Scholar]
Murtagh, F.; Legendre, P. Ward’s hierarchical agglomerative clustering method: Which algorithms implement Ward’s criterion? J. Classif. 2014, 31, 274–295. [Google Scholar] [CrossRef]
Julier, S.; Uhlmann, J.K. General decentralized data fusion with covariance intersection. In Handbook of Multisensor Data Fusion; CRC Press: Boca Raton, FL, USA, 2017; pp. 339–364. [Google Scholar]
Liu, Q.; Ihler, A.T. Distributed estimation, information loss and exponential families. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
Basseville, M. Divergence measures for statistical data processing: An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
Chandrasekhar, V.; Takacs, G.; Chen, D.M.; Tsai, S.S.; Reznik, Y.; Grzeszczuk, R.; Girod, B. Compressed histogram of gradients: A low-bitrate descriptor. Int. J. Comput. Vis. 2012, 96, 384–399. [Google Scholar] [CrossRef]
Seal, A.; Karlekar, A.; Krejcar, O.; Gonzalo-Martin, C. Fuzzy c-means clustering using Jeffreys-divergence based similarity measure. Appl. Soft Comput. 2020, 88, 106016. [Google Scholar] [CrossRef]
Vasconcelos, N. On the efficient evaluation of probabilistic similarity functions for image retrieval. IEEE Trans. Inf. Theory 2004, 50, 1482–1496. [Google Scholar] [CrossRef]
Ge, P.; Chen, Y.; Wang, G.; Weng, G. An active contour model driven by adaptive local pre-fitting energy function based on Jeffreys divergence for image segmentation. Expert Syst. Appl. 2022, 210, 118493. [Google Scholar] [CrossRef]
Tabibian, S.; Akbari, A.; Nasersharif, B. Speech enhancement using a wavelet thresholding method based on symmetric Kullback–Leibler divergence. Signal Process. 2015, 106, 184–197. [Google Scholar] [CrossRef]
Zhao, Q.; Zhou, G.; Zhang, L.; Cichocki, A. Tensor-variate Gaussian processes regression and its application to video surveillance. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 5–9 May 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1265–1269. [Google Scholar]
Welk, M.; Feddern, C.; Burgeth, B.; Weickert, J. Tensor median filtering and M-smoothing. In Visualization and Processing of Tensor Fields; Springer: Berlin/Heidelberg, Germany, 2006; pp. 345–356. [Google Scholar]
Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef]
Moakher, M.; Batchelor, P.G. Symmetric positive-definite matrices: From geometry to applications and visualization. In Visualization and Processing of Tensor Fields; Springer: Berlin/Heidelberg, Germany, 2006; pp. 285–298. [Google Scholar]
Sturm, K.T. Probability measures on metric spaces of nonpositive. Heat Kernels Anal. Manifolds Graphs Metr. Spaces 2003, 338, 357. [Google Scholar]
Nielsen, F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. IEEE Signal Process. Lett. 2013, 20, 657–660. [Google Scholar] [CrossRef]
Kobayashi, S. Geodesics of multivariate normal distributions and a Toda lattice type Lax pair. Phys. Scr. 2023, 98, 115241. [Google Scholar] [CrossRef]
Corless, R.M.; Gonnet, G.H.; Hare, D.E.; Jeffrey, D.J.; Knuth, D.E. On the Lambert W function. Adv. Comput. Math. 1996, 5, 329–359. [Google Scholar] [CrossRef]
Rockafellar, R.T. Conjugates and Legendre transforms of convex functions. Can. J. Math. 1967, 19, 200–205. [Google Scholar] [CrossRef]
Bullen, P.S.; Bullen, P. Quasi-arithmetic means. In Handbook of Means and Their Inequalities; Springer: Berlin/Heidelberg, Germany, 2003; pp. 266–320. [Google Scholar]
Nock, R.; Luosto, P.; Kivinen, J. Mixed Bregman clustering with approximation guarantees. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Antwerp, Belgium, 15–19 September 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 154–169. [Google Scholar]
James, W.; Stein, C. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Los Angeles, CA, USA, 20 June–30 July 1960; Volume 1, pp. 361–379. [Google Scholar]
Salehian, H.; Cheng, G.; Vemuri, B.C.; Ho, J. Recursive estimation of the Stein center of SPD matrices and its applications. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 1793–1800. [Google Scholar]
Skovgaard, L.T. A Riemannian geometry of the multivariate normal model. Scand. J. Stat. 1984, 11, 211–223. [Google Scholar]
Bhatia, R. The Riemannian mean of positive matrices. In Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2012; pp. 35–51. [Google Scholar]
Nakamura, Y. Algorithms associated with arithmetic, geometric and harmonic means and integrable systems. J. Comput. Appl. Math. 2001, 131, 161–174. [Google Scholar] [CrossRef]
Nielsen, F. What is… an Inductive Mean? Not. Am. Math. Soc. 2023, 70, 1851–1855. [Google Scholar] [CrossRef]
Lehmer, D.H. On the compounding of certain means. J. Math. Anal. Appl. 1971, 36, 183–200. [Google Scholar] [CrossRef]
Almkvist, G.; Berndt, B. Gauss, Landen, Ramanujan, the arithmetic-geometric mean, ellipses, π, and the Ladies Diary. Am. Math. Mon. 1988, 95, 585–608. [Google Scholar]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J.; Lafferty, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Miyamoto, H.K.; Meneghetti, F.C.; Pinele, J.; Costa, S.I. On closed-form expressions for the Fisher–Rao distance. Inf. Geom. 2024, 1–44. [Google Scholar] [CrossRef]
Nielsen, F. Approximation and bounding techniques for the Fisher-Rao distances between parametric statistical models; Handbook of Statistics; Elsevier: Amsterdam, The Netherlands, 2024. [Google Scholar] [CrossRef]
Karcher, H. Riemannian center of mass and mollifier smoothing. Commun. Pure Appl. Math. 1977, 30, 509–541. [Google Scholar] [CrossRef]
Čencov, N.N. Algebraic foundation of mathematical statistics. Stat. A J. Theor. Appl. Stat. 1978, 9, 267–276. [Google Scholar]
Calvo, M.; Oller, J.M. A distance between multivariate normal distributions based in an embedding into the Siegel group. J. Multivar. Anal. 1990, 35, 223–242. [Google Scholar] [CrossRef]
Nielsen, F. A simple approximation method for the Fisher–Rao distance between multivariate normal distributions. Entropy 2023, 25, 654. [Google Scholar] [CrossRef]
Nielsen, F.; Soen, A. pyBregMan: A Python library for Bregman Manifolds. arXiv 2024, arXiv:2408.04175. [Google Scholar]
Nielsen, F. Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity. Entropy 2024, 26, 193. [Google Scholar] [CrossRef]
Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, 2007. [Google Scholar]
De Carvalho, M. Mean, what do you Mean? Am. Stat. 2016, 70, 270–274. [Google Scholar] [CrossRef]
Bullen, P.S. Handbook of Means and Their Inequalities; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 560. [Google Scholar]
Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. [Google Scholar] [CrossRef]
Nielsen, F. Revisiting Chernoff information with likelihood ratio exponential families. Entropy 2022, 24, 1400. [Google Scholar] [CrossRef] [PubMed]
Kulis, B.; Sustik, M.; Dhillon, I. Learning low-rank kernel matrices. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 505–512. [Google Scholar]
Gray, A.; Markel, J. Distance measures for speech processing. IEEE Trans. Acoust. Speech, Signal Process. 1976, 24, 380–391. [Google Scholar] [CrossRef]
Petersen, K.B.; Pedersen, M.S. The matrix cookbook. Tech. Univ. Den. 2008, 7, 510. [Google Scholar]
Thanwerdas, Y.; Pennec, X. O(n)-invariant Riemannian metrics on SPD matrices. Linear Algebra Its Appl. 2023, 661, 163–201. [Google Scholar] [CrossRef]
Bhatia, R.; Holbrook, J. Riemannian geometry and matrix geometric means. Linear Algebra Its Appl. 2006, 413, 594–618. [Google Scholar] [CrossRef]
Siegel, C. Symplectic geometry. Am. J. Math. 1964, 65, 1–86. [Google Scholar] [CrossRef]
James, A.T. The variance information manifold and the functions on it. In Multivariate Analysis–III; Elsevier: Amsterdam, The Netherlands, 1973; pp. 157–169. [Google Scholar]
Kim, S.; Lawson, J.; Lim, Y. The matrix geometric mean of parameterized, weighted arithmetic and harmonic means. Linear Algebra Its Appl. 2011, 435, 2114–2131. [Google Scholar] [CrossRef]

Figure 1. Application of centroids and centers in signal processing. (Left): information fusion and mixture model simplification, a 2D Gaussian mixture model (GMM) is simplified to a single bivariate normal distribution. (Right): distributed estimation, a data set is split among p processes

P_{i}

s, which first estimate the statistical model parameters

{\hat{θ}}_{i}

s. Then, the processus models are aggregated to yield a single consolidated model

\hat{θ}

.

Figure 1. Application of centroids and centers in signal processing. (Left): information fusion and mixture model simplification, a 2D Gaussian mixture model (GMM) is simplified to a single bivariate normal distribution. (Right): distributed estimation, a data set is split among p processes

P_{i}

s, which first estimate the statistical model parameters

{\hat{θ}}_{i}

s. Then, the processus models are aggregated to yield a single consolidated model

\hat{θ}

.

Figure 2. Visualizing the arithmetic and normalized geometric and numerical Jeffreys, Jeffreys–Fisher–Rao, and Gauss–Bregman centroids/centers in red, blue, green, purple, and yellow, respectively. (Left): input set consists of

n = 2

trinomial distributions (black) with parameters chosen randomly. (Right): input set consists of

n = 32

trinomial distributions (black) with parameters

(\frac{1}{2}, \frac{1}{2})

and

(0.99, 0.005, 0.005)

. The numerical Jeffreys centroid (green) is time consuming to calculate using the Lambert W function. However, the Jeffreys centroid can be well approximated by either the Jeffreys–Fisher–Rao center (purple) or the inductive Gauss–Bregman center (yellow). Point centers are visualized with different radii in order to distinguish them easily.

Figure 2. Visualizing the arithmetic and normalized geometric and numerical Jeffreys, Jeffreys–Fisher–Rao, and Gauss–Bregman centroids/centers in red, blue, green, purple, and yellow, respectively. (Left): input set consists of

n = 2

trinomial distributions (black) with parameters chosen randomly. (Right): input set consists of

n = 32

trinomial distributions (black) with parameters

(\frac{1}{2}, \frac{1}{2})

and

(0.99, 0.005, 0.005)

. The numerical Jeffreys centroid (green) is time consuming to calculate using the Lambert W function. However, the Jeffreys centroid can be well approximated by either the Jeffreys–Fisher–Rao center (purple) or the inductive Gauss–Bregman center (yellow). Point centers are visualized with different radii in order to distinguish them easily.

Figure 3. (Left): Displaying the arithmetic and normalized geometric and numerical Jeffreys, Jeffreys–Fisher–Rao, and Gauss–Bregman centroids/centers in red, blue, green, purple, and yellow, respectively. Input sets are two normalized histograms with

d = 256

bins plotted as polylines with 255 line segments (black). Observe that the Jeffreys–Fisher–Rao center (purple) and Gauss–Bregman center (yellow) approximates the Jeffreys centroid (green) well, which is more computationally expensive to calculate. (Right): close-up window on the first left bins of normalized histograms.

Figure 3. (Left): Displaying the arithmetic and normalized geometric and numerical Jeffreys, Jeffreys–Fisher–Rao, and Gauss–Bregman centroids/centers in red, blue, green, purple, and yellow, respectively. Input sets are two normalized histograms with

d = 256

bins plotted as polylines with 255 line segments (black). Observe that the Jeffreys–Fisher–Rao center (purple) and Gauss–Bregman center (yellow) approximates the Jeffreys centroid (green) well, which is more computationally expensive to calculate. (Right): close-up window on the first left bins of normalized histograms.

Figure 4. Geometric illustration of the double sequence inducing a Gauss–Bregman center in the limit.

Figure 5. Illustration of the double sequence convergence for scalar Gauss–Bregman

(A, m_{\nabla F})

mean.

Figure 5. Illustration of the double sequence convergence for scalar Gauss–Bregman

(A, m_{\nabla F})

mean.

Figure 6. Visualization of the Jeffreys–Fisher–Rao center and Gauss–Bregman center of two univariate normal distributions (black circle). The exponential geodesic and mixture geodesics are shown in red and blue, respectively, with their corresponding midpoints. The Jeffreys–Fisher–Rao is the Fisher–Rao midpoint (green) lying on the Fisher–Rao geodesics (purple). The inductive Gauss–Bregman center is displayed in yellow with double size in order to ease its comparison with the Jeffreys–Fisher–Rao center.

Figure 7. Centroids and centers between a pair of bivariate normal distributions (black). Each normal distribution

N (μ, \sum)

(parameterized by a 5D parameter

θ

) is displayed as a 2D ellipsoid

E (μ, \sum) = {{(x - μ)}^{⊤} \sum^{- 1} (x - μ) = l}

for a prescribed level

l > 0

in the sample space

R^{2}

. Blue, red, purple, yellow, and green ellipsoids correspond to m-geodesic midpoint, e-geodesic midpoint, Jeffreys–Fisher–Rao midpoint, Gauss–Bregman inductive mean, and numerical Jeffreys centroid (symmetrized Bregman centroid), respectively.

Figure 7. Centroids and centers between a pair of bivariate normal distributions (black). Each normal distribution

N (μ, \sum)

(parameterized by a 5D parameter

θ

) is displayed as a 2D ellipsoid

E (μ, \sum) = {{(x - μ)}^{⊤} \sum^{- 1} (x - μ) = l}

for a prescribed level

l > 0

in the sample space

R^{2}

. Blue, red, purple, yellow, and green ellipsoids correspond to m-geodesic midpoint, e-geodesic midpoint, Jeffreys–Fisher–Rao midpoint, Gauss–Bregman inductive mean, and numerical Jeffreys centroid (symmetrized Bregman centroid), respectively.

Figure 8. Centroids and centers between a pair of bivariate centered normal distributions (black). Each normal distribution

N (μ, \sum)

with a prescribed

μ

(parameterized by a 3D parameter

θ

) is displayed as a 2D ellipsoid. The red and blue ellipsoids correspond to the e-geodesic and m-geodesic midpoints, respectively. The green ellipsoid is the exact Jeffreys centroid which coincide perfectly with the inductive Gauss–Bregman center (yellow) and Jeffreys–Fisher–Rao center (purple). Thus these three green, yellow, and purple matching ellipsoids are rendered superposed in an overall shade of brown.

Figure 8. Centroids and centers between a pair of bivariate centered normal distributions (black). Each normal distribution

N (μ, \sum)

with a prescribed

μ

(parameterized by a 3D parameter

θ

) is displayed as a 2D ellipsoid. The red and blue ellipsoids correspond to the e-geodesic and m-geodesic midpoints, respectively. The green ellipsoid is the exact Jeffreys centroid which coincide perfectly with the inductive Gauss–Bregman center (yellow) and Jeffreys–Fisher–Rao center (purple). Thus these three green, yellow, and purple matching ellipsoids are rendered superposed in an overall shade of brown.

Figure 9. Centroids and centers between a pair of bivariate-centered normal distributions (black). Each normal distribution

N (μ, \sum)

with a prescribed covariance matrix

\sum

(parameterized by a 2D parameter

θ

) is displayed as a 2D ellipsoid. The red and blue ellipsoids correspond to the e-geodesic and m-geodesic midpoints, respectively. The inductive Gauss–Bregman (yellow) and Jeffreys–Fisher–Rao center (purple) do not coincide.

Figure 9. Centroids and centers between a pair of bivariate-centered normal distributions (black). Each normal distribution

N (μ, \sum)

with a prescribed covariance matrix

\sum

(parameterized by a 2D parameter

θ

) is displayed as a 2D ellipsoid. The red and blue ellipsoids correspond to the e-geodesic and m-geodesic midpoints, respectively. The inductive Gauss–Bregman (yellow) and Jeffreys–Fisher–Rao center (purple) do not coincide.

Figure 10. Illustration on a dually flat space of the double sequence inducing a Gauss–Bregman center in the limit.

Figure 11. Illustration of the Jeffreys–Fisher–Rao and Gauss–Bregman centers a dually flat space.

γ

denotes the Riemannian geodesic.

Figure 11. Illustration of the Jeffreys–Fisher–Rao and Gauss–Bregman centers a dually flat space.

γ

denotes the Riemannian geodesic.

Table 1. Experiments for JFR and GB centers approximating the numerical Jeffreys centroid.

dim.	Jeffreys–Fisher–Rao Center						Gauss–Bregman Center
	Avg Info $ϵ$	Max Info $ϵ$	Avg TV	Max TV	Avg Time	×Speed	Avg Info $ϵ$	Max Info $ϵ$	Avg TV	Max TV	Avg Time	×Speed
d = 2	5.662 × $10^{- 6}$	6.386 × $10^{- 3}$	8.735 × $10^{- 5}$	5.005 × $10^{- 2}$	1.614 × $10^{- 7}$	82.541	1.507 × $10^{- 4}$	9.745 × $10^{- 2}$	6.304 × $10^{- 4}$	5.005 × $10^{- 2}$	5.072 × $10^{- 7}$	26.258
d = 4	1.283 × $10^{- 5}$	5.294 × $10^{- 3}$	1.690 × $10^{- 4}$	3.969 × $10^{- 2}$	1.418 × $10^{- 7}$	182.309	4.696 × $10^{- 4}$	7.695 × $10^{- 2}$	1.431 × $10^{- 3}$	3.969 × $10^{- 2}$	1.623 × $10^{- 7}$	159.304
d = 8	2.766 × $10^{- 5}$	6.970 × $10^{- 3}$	2.210 × $10^{- 4}$	3.470 × $10^{- 2}$	1.772 × $10^{- 7}$	292.125	1.011 × $10^{- 3}$	9.677 × $10^{- 2}$	2.033 × $10^{- 3}$	3.470 × $10^{- 2}$	1.955 × $10^{- 7}$	264.680
d = 16	3.531 × $10^{- 5}$	8.544 × $10^{- 3}$	2.325 × $10^{- 4}$	2.450 × $10^{- 2}$	6.318 × $10^{- 7}$	224.370	1.388 × $10^{- 3}$	9.231 × $10^{- 2}$	2.275 × $10^{- 3}$	2.450 × $10^{- 2}$	7.208 × $10^{- 7}$	196.660
d = 32	4.123 × $10^{- 5}$	5.242 × $10^{- 3}$	2.457 × $10^{- 4}$	1.230 × $10^{- 2}$	4.811 × $10^{- 7}$	462.754	1.674 × $10^{- 3}$	5.398 × $10^{- 2}$	2.449 × $10^{- 3}$	1.230 × $10^{- 2}$	5.457 × $10^{- 7}$	408.007
d = 64	4.747 × $10^{- 5}$	3.437 × $10^{- 3}$	2.486 × $10^{- 4}$	9.756 × $10^{- 3}$	9.789 × $10^{- 7}$	578.354	1.863 × $10^{- 3}$	3.685 × $10^{- 2}$	2.498 × $10^{- 3}$	9.756 × $10^{- 3}$	1.160 × $10^{- 6}$	488.246
d = 128	5.020 × $10^{- 5}$	2.540 × $10^{- 3}$	2.491 × $10^{- 4}$	6.580 × $10^{- 3}$	5.874 × $10^{- 6}$	477.412	1.937 × $10^{- 3}$	2.374 × $10^{- 2}$	2.522 × $10^{- 3}$	6.580 × $10^{- 3}$	6.605 × $10^{- 6}$	424.609
d = 256	4.735 × $10^{- 5}$	1.410 × $10^{- 3}$	2.476 × $10^{- 4}$	4.855 × $10^{- 3}$	9.349 × $10^{- 6}$	528.452	1.914 × $10^{- 3}$	1.521 × $10^{- 2}$	2.529 × $10^{- 3}$	4.855 × $10^{- 3}$	1.110 × $10^{- 5}$	445.304

Table 2. Experiments for JFR and GB centers approximating the numerical Jeffreys centroid for the following setting of two normalized histograms of 3 bins:

(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})

and

(1 - α, α / 2, α / 2)

.

Table 2. Experiments for JFR and GB centers approximating the numerical Jeffreys centroid for the following setting of two normalized histograms of 3 bins:

(\frac{1}{3}, \frac{1}{3}, \frac{1}{3})

and

(1 - α, α / 2, α / 2)

.

$α$	Info. $ϵ$	TV $ϵ$	Avg Time	×Speed	Info. $ϵ$	TV $ϵ$	Avg Time	×Speed
1.000 × $10^{- 1}$	6.882 × $10^{- 9}$	2.495 × $10^{- 5}$	1.767 × $10^{- 7}$	125.960	1.338 × $10^{- 6}$	3.480 × $10^{- 4}$	2.334 × $10^{- 7}$	95.356
1.000 × $10^{- 2}$	2.607 × $10^{- 5}$	1.722 × $10^{- 3}$	1.371 × $10^{- 7}$	167.932	1.061 × $10^{- 3}$	1.108 × $10^{- 2}$	1.565 × $10^{- 7}$	147.104
1.000 × $10^{- 3}$	6.262 × $10^{- 4}$	7.530 × $10^{- 3}$	1.033 × $10^{- 7}$	218.450	1.272 × $10^{- 2}$	3.534 × $10^{- 2}$	1.208 × $10^{- 7}$	186.698
1.000 × $10^{- 4}$	3.632 × $10^{- 3}$	1.570 × $10^{- 2}$	1.171 × $10^{- 7}$	193.345	4.580 × $10^{- 2}$	6.065 × $10^{- 2}$	1.367 × $10^{- 7}$	165.571
1.000 × $10^{- 5}$	1.121 × $10^{- 2}$	2.419 × $10^{- 2}$	1.546 × $10^{- 7}$	150.807	7.322 × $10^{- 3}$	1.929 × $10^{- 2}$	2.834 × $10^{- 7}$	82.261
1.000 × $10^{- 6}$	2.457 × $10^{- 2}$	3.204 × $10^{- 2}$	1.619 × $10^{- 7}$	141.896	1.655 × $10^{- 2}$	2.579 × $10^{- 2}$	2.512 × $10^{- 7}$	91.467
1.000 × $10^{- 7}$	4.375 × $10^{- 2}$	3.897 × $10^{- 2}$	1.357 × $10^{- 7}$	170.065	3.065 × $10^{- 2}$	3.183 × $10^{- 2}$	2.131 × $10^{- 7}$	108.314
1.000 × $10^{- 8}$	6.806 × $10^{- 2}$	4.492 × $10^{- 2}$	1.315 × $10^{- 7}$	173.698	4.948 × $10^{- 2}$	3.725 × $10^{- 2}$	2.017 × $10^{- 7}$	113.292
1.000 × $10^{- 9}$	9.651 × $10^{- 2}$	4.999 × $10^{- 2}$	1.125 × $10^{- 7}$	208.627	7.240 × $10^{- 2}$	4.199 × $10^{- 2}$	1.590 × $10^{- 7}$	147.610
1.000 × $10^{- 10}$	1.281 × $10^{- 1}$	5.428 × $10^{- 2}$	8.366 × $10^{- 8}$	242.967	9.862 × $10^{- 2}$	4.610 × $10^{- 2}$	1.111 × $10^{- 7}$	183.000
1.000 × $10^{- 11}$	1.621 × $10^{- 1}$	5.792 × $10^{- 2}$	1.066 × $10^{- 7}$	215.817	1.274 × $10^{- 1}$	4.963 × $10^{- 2}$	1.325 × $10^{- 7}$	173.614
1.000 × $10^{- 12}$	1.979 × $10^{- 1}$	6.100 × $10^{- 2}$	1.028 × $10^{- 7}$	229.484	1.580 × $10^{- 1}$	5.266 × $10^{- 2}$	1.329 × $10^{- 7}$	177.581
1.000 × $10^{- 13}$	2.348 × $10^{- 1}$	6.363 × $10^{- 2}$	9.541 × $10^{- 8}$	244.587	1.901 × $10^{- 1}$	5.526 × $10^{- 2}$	1.255 × $10^{- 7}$	185.940
1.000 × $10^{- 14}$	2.727 × $10^{- 1}$	6.589 × $10^{- 2}$	1.062 × $10^{- 7}$	219.787	2.231 × $10^{- 1}$	5.750 × $10^{- 2}$	1.361 × $10^{- 7}$	171.456
1.000 × $10^{- 15}$	3.112 × $10^{- 1}$	6.784 × $10^{- 2}$	9.043 × $10^{- 8}$	248.688	2.570 × $10^{- 1}$	5.943 × $10^{- 2}$	1.322 × $10^{- 7}$	170.122
1.000 × $10^{- 16}$	3.483 × $10^{- 1}$	6.946 × $10^{- 2}$	8.857 × $10^{- 8}$	267.219	2.897 × $10^{- 1}$	6.105 × $10^{- 2}$	1.438 × $10^{- 7}$	164.535

Table 3. Summary of the results: Δ indicates a generic formula, √ a closed-form formula, and × no-known formula.

Family	Jeffreys	Jeffreys–Fisher–Rao	Gauss–Bregman
Exponential family	Equation (2)	Definition 2	Definition 3
One-dimensional exponential family	×	Δ	×
	Theorem 4		[43]
Categorical family	Δ	√	×
	[31]	Theorem 5	Theorem 3
Normal family	×	√	×
	[28]	Theorem 6	Theorem 3
Centered normal family	√	√	√
	[29]	[29]	[41]

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nielsen, F. Fast Proxy Centers for the Jeffreys Centroid: The Jeffreys–Fisher–Rao Center and the Gauss–Bregman Inductive Center. Entropy 2024, 26, 1008. https://doi.org/10.3390/e26121008

AMA Style

Nielsen F. Fast Proxy Centers for the Jeffreys Centroid: The Jeffreys–Fisher–Rao Center and the Gauss–Bregman Inductive Center. Entropy. 2024; 26(12):1008. https://doi.org/10.3390/e26121008

Chicago/Turabian Style

Nielsen, Frank. 2024. "Fast Proxy Centers for the Jeffreys Centroid: The Jeffreys–Fisher–Rao Center and the Gauss–Bregman Inductive Center" Entropy 26, no. 12: 1008. https://doi.org/10.3390/e26121008

APA Style

Nielsen, F. (2024). Fast Proxy Centers for the Jeffreys Centroid: The Jeffreys–Fisher–Rao Center and the Gauss–Bregman Inductive Center. Entropy, 26(12), 1008. https://doi.org/10.3390/e26121008

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fast Proxy Centers for the Jeffreys Centroid: The Jeffreys–Fisher–Rao Center and the Gauss–Bregman Inductive Center

Abstract

1. Introduction

2. Proxy Centers for Jeffreys Centroids

2.1. Background on Jeffreys Centroids

2.2. Jeffreys Centroids on Exponential Family Densities: Symmetrized Bregman Centroids

2.3. The Jeffreys–Fisher–Rao Center

2.4. Gauss–Bregman Inductive Center

3. Gauss–Bregman Inductive Centers: Convergence Analysis and Properties

4. Jeffreys–Fisher–Rao Centers: Generic Structural Formula and Some Closed-Form Formula

4.1. Jeffreys–Fisher–Rao Center for Uni-Parametric Statistical Models

4.2. Jeffreys–Fisher–Rao Center for Categorical Distributions

4.3. Jeffreys–Fisher–Rao Center for Multivariate Normal Distributions

5. Experiments

6. Conclusions and Discussion

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. Numerical Jeffreys Centroids for Categorical Distributions

Appendix B. Closed-Form Formula for the Symmetrized Log Det Centroids

Appendix C. Fisher–Rao Midpoint for Multivariate Normal Distributions

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI