Next Article in Journal
A Hybrid Quantum Solver for the Lorenz System
Next Article in Special Issue
Time Series Data Generation Method with High Reliability Based on ACGAN
Previous Article in Journal
A Real-Time Fault Diagnosis Method for Multi-Source Heterogeneous Information Fusion Based on Two-Level Transfer Learning
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fast Proxy Centers for the Jeffreys Centroid: The Jeffreys–Fisher–Rao Center and the Gauss–Bregman Inductive Center

Sony Computer Science Laboratories, Tokyo 141-0022, Japan
Entropy 2024, 26(12), 1008; https://doi.org/10.3390/e26121008
Submission received: 18 October 2024 / Revised: 13 November 2024 / Accepted: 21 November 2024 / Published: 22 November 2024
(This article belongs to the Special Issue Information Theory in Emerging Machine Learning Techniques)

Abstract

The symmetric Kullback–Leibler centroid, also called the Jeffreys centroid, of a set of mutually absolutely continuous probability distributions on a measure space provides a notion of centrality which has proven useful in many tasks, including information retrieval, information fusion, and clustering. However, the Jeffreys centroid is not available in closed form for sets of categorical or multivariate normal distributions, two widely used statistical models, and thus needs to be approximated numerically in practice. In this paper, we first propose the new Jeffreys–Fisher–Rao center defined as the Fisher–Rao midpoint of the sided Kullback–Leibler centroids as a plug-in replacement of the Jeffreys centroid. This Jeffreys–Fisher–Rao center admits a generic formula for uni-parameter exponential family distributions and a closed-form formula for categorical and multivariate normal distributions; it matches exactly the Jeffreys centroid for same-mean normal distributions and is experimentally observed in practice to be close to the Jeffreys centroid. Second, we define a new type of inductive center generalizing the principle of the Gauss arithmetic–geometric double sequence mean for pairs of densities of any given exponential family. This new Gauss–Bregman center is shown experimentally to approximate very well the Jeffreys centroid and is suggested to be used as a replacement for the Jeffreys centroid when the Jeffreys–Fisher–Rao center is not available in closed form. Furthermore, this inductive center always converges and matches the Jeffreys centroid for sets of same-mean normal distributions. We report on our experiments, which first demonstrate how well the closed-form formula of the Jeffreys–Fisher–Rao center for categorical distributions approximates the costly numerical Jeffreys centroid, which relies on the Lambert W function, and second show the fast convergence of the Gauss–Bregman double sequences, which can approximate closely the Jeffreys centroid when truncated to a first few iterations. Finally, we conclude this work by reinterpreting these fast proxy Jeffreys–Fisher–Rao and Gauss–Bregman centers of Jeffreys centroids under the lens of dually flat spaces in information geometry.

1. Introduction

Let ( X , F ) be a measurable space with sample space X and σ -algebra of events F , and  μ a positive measure. We consider a finite set { P 1 , , P n } of n probability distributions all dominated by μ and weighted by a vector w belonging to the open standard simplex Δ n = x = ( x 1 , , x n ) : x 1 > 0 , , x n > 0 , i = 1 n x i = 1 R n . Let P = { p 1 , , p n } be the Radon–Nikodym densities of P 1 , , P n with respect to μ , i.e., p i = d P i d μ .
The Kullback–Leibler divergence (KLD) between two densities p ( x ) and q ( x ) is defined by D KL ( p : q ) = p ( x ) log p ( x ) q ( x ) d μ ( x ) . The KLD is asymmetric: D KL ( p : q ) D KL ( q : p ) . We use the argument delimiter ‘:’ as a notation to indicate this asymmetry. The Jeffreys divergence [1] symmetrizes the KLD as follows:
D J ( p , q ) = D KL ( p : q ) + D KL ( q : p ) , = X ( p ( x ) q ( x ) ) log p ( x ) q ( x ) d μ ( x ) .
In general, the D-barycenter C D of P with respect to a statistical dissimilarity measure D ( · : · ) yields a notion of centrality C R defined by the following optimization problem:
c R = arg min p i = 1 n w i D ( p i : p ) .
Here, the upper case letter ‘R’ indicates that the optimization defining the D-barycenter is carried on the right argument. When w = ( 1 n , , 1 n ) is the uniform weight vector, the D-barycenter is called the D-centroid. We shall loosely call centroids barycenters in the remainder even when the weight vector is not uniform. Centroids with respect to information-theoretic measures have been studied in the literature.
Let us mention some examples of centroids: The entropic centroids [2] (i.e., Bregman centroids and f-divergences centroids), the Burbea–Rao and Bhattacharyya centroids [3], the  α -centroids with respect to α -divergences [4], the Jensen–Shannon centroids [5], etc.
The D J -centroid is also called the symmetric Kullback–Leibler (SKL) divergence centroid [6] in the literature. However, since there are many possible symmetrizations of the KLD [7] like the Jensen–Shannon divergence [8] or the resistor KLD [9], we prefer to use the term Jeffreys centroid instead of SKL centroid to avoid any possible ambiguity on the underlying divergence. Notice that the square root of the Jensen–Shannon divergence is a metric distance [10,11] but all powers D J α of Jeffreys divergence D J for α > 0 do not yield metric distances [12].
This paper considers the Jeffreys centroids of a finite weighted set of densities P = { p θ 1 , , p θ n } belonging to some prescribed exponential family [13] E :
c = arg min p i = 1 n w i D J ( p θ i , p ) .
In particular, we are interested in computing the Jeffreys centroids for sets of categorical distributions or sets of multivariate normal distributions [14].
In general, centroids are used in k means [15,16]-type clustering or hierarchical clustering (e.g., Ward criterion [17]) and information fusion tasks [18] (related to distributed model estimation [19]) among others. See Figure 1. The choice of the dissimilarity measure depends on the application at hand [20]. Clustering with respect to Jeffreys divergence/-Jeffreys centroid has proven useful in many scenarios: for example, it was shown to perform experimentally better than Euclidean or square Euclidean distances for compressed histograms of gradient descriptors [21] or in fuzzy clustering [22]. Jeffreys divergence has also been used for image processing [23], including image segmentation [24], speech processing [25], and computer vision [26], just to name a few. In particular, finding weighted means of centered 3D normal distributions plays an important role in diffusion tensor imaging (DTI) for smoothing and filtering DT images [27] which consist of sets of normal distributions centered at 3D grid locations.
In general, the Jeffreys centroid is not known in closed form for exponential families [28] like the family of categorical distributions or the family of normal distributions often met in applications and thus needs to be numerically approximated in practice. The main contribution of this paper is to present and study two proxy centers as drop-in replacements of the Jeffreys centroids in applications and report the generic structural formula for generic exponential families with an explicit closed-form formula for the families of categorical and multivariate normal distributions. Namely, we define the Jeffreys–Fisher–Rao (JFR) center (Definition 2) and the Gauss–Bregman (GB) inductive center (Definition 3) in Section 2.
This paper is organized as follows: By interpreting in two different ways the closed-form formula of the Jeffreys centroids for the particular case of sets of centered multivariate normal distributions [29] (proof reported in Appendix B), we define the Gauss–Bregman (GB) centers and the Jeffreys–Fisher–Rao (JFR) centers for sets of densities belonging to an exponential family in Section 2. The Jeffreys centroid coincides with both the Gauss–Bregman inductive center and the Jeffreys–Fisher–Rao center for centered multivariate normal distributions but differ from each other in general. In Section 2.4, we study the Gauss–Bregman inductive center [30] induced by the cumulant function of an exponential family and prove the convergence under the separability condition of the generalized Gauss double sequences in the limit (Theorem 3). This Gauss–Bregman center can be easily approximated by limiting the number of iterations of a double sequence inducing it. In Section 4, we report the generic formula for Jeffreys–Fisher–Rao centers for sets of uni-order exponential families [13] and explicitly give the closed-form formula for the categorical family and the multivariate normal family. A comparison of those proxy centers with the numerical Jeffreys centroids is experimentally studied and visually illustrated with some examples. Thus, we propose to use in applications (e.g., clustering) either the fast Jeffreys–Fisher–Rao center when a closed-form formula is available for the family of distributions at hand or the Gauss–Bregman center approximation with a prescribed number of iterations as a drop-in replacement of the numerical Jeffreys centroids while keeping the Jeffreys divergence. Some experiments of the JFR and GB centers are reported for the Jeffreys centroid of categorical distributions in Section 5. Finally, we conclude this paper in Section 6 with a discussion and a generalization of our results to the more general setting of dually flat spaces of information geometry [14].
The core of this paper is followed by an Appendix section as follows: In Appendix A, we explicitly give the algorithm outlined in [31] for numerically computing the Jeffreys centroid of sets of categorical distributions. In Appendix B, we report a proof on the closed-form formula of the Jeffreys centroid for centered normal distributions [29] that motivated this paper. In Appendix C, we explain how to calculate in practice the elaborated closed-form formula for the Fisher–Rao geodesic midpoint between two multivariate normal distributions [32].

2. Proxy Centers for Jeffreys Centroids

2.1. Background on Jeffreys Centroids

A density p θ belonging to an exponential family [13] E can be expressed canonically as p θ ( x ) = exp ( θ , t ( x ) F ( θ ) ) d μ ( x ) , where t ( x ) is a sufficient statistic vector, F ( θ ) = log exp ( θ , t ( x ) ) d μ ( x ) is the log-normalizer, and θ is the natural parameter belonging to the natural parameter space Θ . We consider minimal regular exponential families [13] like the discrete family of categorical distributions (i.e., μ is the counting measure) or the continuous family of multivariate normal distributions (i.e., μ is the Lebesgue measure).
The Jeffreys centroid of categorical distributions was first studied by Veldhuis [6], who designed a numerical two-nested loops Newton-like algorithm [6]. A random variable X following a categorical distribution Cat ( p ) for a parameter p Δ d in sample space X = { ω 1 , , ω d } is such that Pr ( X = ω i ) = p i . Categorical distributions are often used in image processing to statistically model normalized histograms with non-empty bins. The exact characterization of the Jeffreys centroid was given in [31].
We summarize the results regarding the categorical Jeffreys centroid [31] in the following theorem:
Theorem 1
(Categorical Jeffreys centroid [31]). The Jeffreys centroid of a set of n categorical distributions parameterized by P = { p 1 , , p n } Δ d arranged in a matrix P = [ p i , j ] R n × d and weighted by a vector w = ( w 1 , , w n ) Δ n is c ( λ ) = ( c 1 ( λ ) , , c d ( λ ) ) with
c j ( λ ) = a j W 0 a j g j e 1 + λ , j { 1 , , d } ,
where a j = i = 1 n w i p i , j and g j = i = 1 n p i , j w i j = 1 d i = 1 n p i , j w i are the j-th components of the weighted arithmetic and normalized geometric means, respectively; W 0 is the principal branch of the Lambert W function [33]; and  λ 0 is the unique real value such that λ = D KL ( c ( λ ) : g ) .
Furthermore, a simple bisection search is reported in [31] §III.B that we convert into Algorithm A1 in Appendix A, which allows one to numerically approximate the Jeffreys centroid to arbitrary fine precision.

2.2. Jeffreys Centroids on Exponential Family Densities: Symmetrized Bregman Centroids

The Jeffreys divergence between two densities of an exponential family E = { p θ ( x ) = exp ( t ( x ) , θ F ( θ ) ) : θ Θ } with cumulant function F ( θ ) amounts to a symmetrized Bregman divergence [28] (SBD):
D J ( p θ , p θ ) = S F ( θ , θ ) : = θ 1 θ 2 , F ( θ 1 ) F ( θ 2 ) .
Using convex duality, we have S F ( θ , θ ) = S F * ( η , η ) , where η = F ( θ ) and F * ( η ) = η , ( F ) 1 ( η ) F ( ( F ) 1 ( η ) ) is the Legendre–Fenchel convex conjugate. Thus, the Jeffreys barycenter of P = { p θ 1 , , p θ n } amounts to either a symmetrized Bregman barycenter on the natural parameters P θ = { θ 1 , , θ n } with respect to S F or a symmetrized Bregman barycenter on the dual moment parameters P η = { η 1 , , η n } with respect to  S F * .
It was shown in [28] that the symmetrized Bregman barycenter θ S of n weighted points amounts to the following minimization problem involving only the sided Bregman centroids:
θ S : = arg min θ Θ i w i S F ( θ , θ i ) , arg min θ Θ B F ( θ ¯ : θ ) + B F ( θ : θ ̲ ) ,
where θ ¯ = i w i θ i (right Bregman centroid) and θ ̲ = ( F ) 1 ( i w i F ( θ i ) ) (left Bregman centroid). Those θ ¯ and θ ̲ centers are centroids [28] with respect to the Bregman divergence B F ( θ 1 : θ 2 ) = F ( θ 1 ) F ( θ 2 ) θ 1 θ 2 , F ( θ 2 ) and reverse Bregman divergence: B F * ( θ 1 : θ 2 ) : = B F ( θ 2 : θ 1 ) :
θ ¯ = arg min θ i w i B F ( θ i : θ ) , θ ̲ = arg min θ i w i B F ( θ : θ i ) = arg min θ i w i B F * ( θ i : θ ) .
In general, when H : R m R is a strictly convex differentiable real-valued function of Legendre type [34], the gradient H is globally invertible (in general, the implicit inverse function theorem only locally guarantees the inverse function) and we can define a quasi-arithmetic center of a point set P = { θ 1 , , θ n } weighted by w as follows:
Definition 1
(Quasi-arithmetic center). Let H = F be the gradient of a strictly convex or concave differentiable real-valued function F of Legendre type. The quasi-arithmetic center c H ( θ 1 , , θ n ; w ) is defined by
c H ( θ 1 , , θ n ; w ) = H 1 i = 1 n w i H ( θ i ) .
This definition generalizes the scalar quasi-arithmetic means [35] for univariate functions h, which are continuous and strictly monotone. Quasi-arithmetic means (QAMs) are also called f means or Kolmogorov–Nagumo means. Let m F ( θ 1 , θ 2 ) = c F ( θ 1 , θ 2 ; 1 2 , 1 2 ) . Notice that A ( θ 1 , θ 2 ) = F * ( m F * ( η 1 , η 2 ) ) and A ( η 1 , η 2 ) = F ( m F ( θ 1 , θ 2 ) ) . That is, the arithmetic mean in a primal representation amounts to a QAM in the dual representation.
Thus, we can solve for θ S by setting the gradient of L ( θ ) = B F ( θ ¯ : θ ) + B F ( θ : θ ̲ ) to zero. In general, no closed-form formula is known for the symmetrized Bregman centroids, and a numerical approximation method was reported in [28]. To circumvent the lack of a closed-form formula of symmetrized Bregman centroids for clustering, Nock et al. [36] proposed a mixed Bregman clustering where each cluster has two representative dual Bregman centroids θ ¯ = i w i θ i (right Bregman centroid) and θ ̲ = ( F ) 1 ( i w i F ( θ i ) ) (left Bregman centroid), and the dissimilarity measure is a mixed Bregman divergence defined by
Δ F ( θ 1 : θ : θ 2 ) : = 1 2 B F ( θ 1 : θ ) + 1 2 B F ( θ : θ 2 ) .
Notice that minimizing Equation (3) amounts to minimizing the mixed Bregman divergence:
min θ Δ F ( θ ¯ : θ : θ ̲ ) .
By using the dual parameterization η = F ( θ ) (with dual domain H = { F ( θ ) : θ Θ } ) and the dual Bregman divergence B F * ( η 1 : η 2 ) = F * ( η 1 ) F * ( η 2 ) η 1 η 2 , F * ( η 1 ) = B F ( θ 2 : η 1 ) , we have
θ S : = arg min θ Θ i w i S F ( θ , θ i ) , η S = arg min η H i w i S F * ( η , η i ) , arg min η H B F * ( η : F ( θ ¯ ) ) + B F * ( F ( θ ̲ ) : η ) .
Since F ( θ ¯ ) = ( F * ) 1 ( i w i F * ( η i ) ) = η ̲ and F ( θ ̲ ) = F ( ( F ) 1 i w i η i ) = η ¯ , we obtain the dual equivalent optimization problem:
θ S = F * ( η S ) = arg min θ Θ B F ( θ ¯ : θ ) + B F ( θ : θ ̲ ) ,
or
η S = F ( θ S ) = arg min η H B F * ( η ¯ : θ ) + B F * ( η : η ̲ ) .
However, a remarkable special case is the family of multivariate normal distributions centered at the origin for which the Jeffreys centroid was reported in closed form in [29]. Let N 0 = { p : Sym + + ( R , d ) } be the exponential family with sufficient statistics t ( x ) = 1 2 ( x , x x ) , natural parameter θ = 1 (the precision matrix) where the covariance matrix belongs to the cone Sym + + ( R , d ) of symmetric positive-definite matrices, inner product X , Y = tr ( X Y ) , and  F ( θ ) = 1 2 log det ( θ ) . In that case, the Jeffreys divergence amounts to a symmetrized Bregman log-det (ld) divergence between the corresponding natural parameters:
D J ( p , p ) = 1 2 tr ( ( 1 1 ) ( ) ) = : 1 2 S ld ( 1 , 1 ) .
Using the standard covariance matrix parameterization , we can further express the Jeffreys divergence between two multivariate normal distributions p and p as
D J ( p , p ) = i = 1 d λ i 1 λ i 2 ,
where λ i s are the eigenvalues of 1 . The symmetrized log-det divergence S ld is also called the symmetrized Stein loss [37,38]. When d = 1 , this divergence is the symmetrized Itakura–Saito divergence also called the COSH distance [28]. The Jeffreys centroid can be characterized using the Fisher–Rao geometry [39] of N 0 as the Fisher–Rao geodesic midpoint of the sided Kullback–Leibler centroids as follows:
Theorem 2
([29]). The Jeffreys centroid C of a set of n centered multivariate normal distributions P = { p 1 , , p n } weighted with w i Δ n amounts to the symmetrized log-det Bregman centroid for the corresponding weighted set of positive-definite precision matrices P θ = { P 1 = 1 1 , , P n = n 1 } . The symmetrized log-det Bregman barycenter C is the Riemannian geodesic midpoint A # H of the arithmetic barycenter A = i = 1 n w i P i and harmonic barycenter H = i = 1 n w i P i 1 1 where X # Y : = X 1 2 X 1 2 Y X 1 2 1 2 X 1 2 is the matrix geometric mean [40] G ( X , Y ) = X # Y :
C = ( i = 1 n w i P i ) # i = 1 n w i P i 1 1 .
Since the proof of this result mentioned in [29] was omitted in [29], we report a proof involving matrix analysis in full detail in Appendix B.
Next, we shall define two types of centers for sets of densities of a prescribed exponential family based on two different interpretations of Equation (5). We call them centers and not centroids because those points are defined by a generic structural formula instead of solutions of minimization problems of average divergences of Equation (1).

2.3. The Jeffreys–Fisher–Rao Center

Since an exponential family E = { p θ ( x ) } induces the Riemannian manifold ( M , g ) with the Fisher metric g expressed in the θ -parameterization by the Fisher information matrix 2 F ( θ ) and Fisher–Rao geodesics γ ( p , q , t ) defined with respect to the Levi-Civita connection ¯ (induced by g), we shall define the Jeffreys–Fisher–Rao center on M using the Fisher–Rao geodesics as follows:
Definition 2
(Jeffreys–Fisher–Rao (JFR) center). The Jeffreys–Fisher–Rao center θ JFR of a set { p θ 1 , , p θ n } of weighted densities by w Δ n is defined as the Fisher–Rao midpoint of the sided Kullback–Leibler centroids θ ¯ = i w i θ i and θ ̲ = ( F ) 1 ( i w i F ( θ i ) ) :
θ JFR = θ ¯ # θ ̲ ,
where p # q = γ p , q , 1 2 .
Equation (6) is a generalization of Equation (5); therefore, the JFR center matches the Jeffreys centroid for same-mean multivariate normal distributions (Theorem 2).
Let P θ = { θ 1 , , θ n } and P η = { η 1 , , η n } , where η = F ( θ ) and θ = F * ( η ) . Denote by JFR F ( P θ ; w ) the JFR center of θ -coordinate θ JFR . Then, JFR F * ( P η ; w ) = F ( θ JFR ) : = η JFR .

2.4. Gauss–Bregman Inductive Center

Another remarkable property of the Jeffreys centroid for a set { p μ , 1 , , p μ , n } of same-mean multivariate normal distributions weighted by w Δ n with arithmetic and harmonic means A = i = 1 n w i i 1 and H = ( i = 1 n w i i ) 1 on the precision matrices 1 1 , , n 1 , respectively, is that we have the following invariance of the Jeffreys centroid (see Lemma 17.4.4 of [29]):
G ( A , H ) = G A + H 2 , 2 ( A 1 + H 1 ) 1 .
Nakamura [41] defined the following double sequence scheme converging to the matrix geometry mean G ( P , Q ) for any two symmetric positive-definite matrices P and Q:
P t + 1 = A ( P t , Q t ) : = P t + Q t 2 , Q t + 1 = H ( P t , Q t ) : = 2 ( P t 1 + Q t 1 ) 1 ,
initialized with P 0 = P and Q 0 = Q . We have lim t P t = lim t Q t = P # Q = G ( P , Q ) . Let P = lim t P t and Q = lim t Q t . That is, the geometric matrix mean can be obtained as the limits of a double sequence of means. We can thus approximate G ( P , Q ) by stopping the double sequence after T iterations to obtain
G ( T ) ( P , Q ) = A ( P T , Q T ) G ( P , Q ) .
Notice that we can recover those iterations from the invariance property of Equation (7): Indeed, we have
G ( P 0 , Q 0 ) = G ( A ( P 0 , Q 0 ) = : P 1 , H ( P 0 , Q 0 ) = : Q 1 ) = G ( A ( P 1 , Q 1 ) = : P 2 , H ( P 1 , Q 1 ) = : Q 2 ) = ,
and P t Q t = tr ( ( P t Q t ) ( P t Q t ) ) decreases [41] as the number of iterations t increases. Thus, by induction, G ( P 0 , Q 0 ) = G ( P , Q ) with P = Q . Since G ( X , X ) = X (means are reflexive), it follows that G ( P 0 , Q 0 ) = P = Q . It is proved in [41] that the convergence rate of the sequence of double iterations is quadratic. This type of mean has been called an inductive mean [30,42] (or compound mean [43]) and originated from the Gauss arithmetic–geometric mean [44].
Our second interpretation of the geometric matrix mean of Equation (5) is to consider it as an inductive mean [30] and to generalize this double sequence process to pairs/sets of densities of an exponential family as follows:
Definition 3
(Gauss–Bregman ( A , F ) center). Let P = { p θ 1 , , p θ n } be a set of n distributions of an exponential family with the cumulant function F ( θ ) weighted by a vector w Δ n . Then, the Gauss–Bregman inductive center θ GB is defined as the limit of the double sequence:
θ ¯ t + 1 = A ( θ ¯ t , θ ̲ t ) : = θ ¯ t + θ ̲ t 2 , θ ̲ t + 1 = m F ( θ ¯ t , θ ̲ t ) : = ( F ) 1 F ( θ ¯ t ) + F ( θ ̲ t ) 2 ,
initialized with θ ¯ 0 = θ ¯ = i = 1 n w i θ i (right Bregman centroid) and θ ̲ 0 = θ ̲ = F 1 i = 1 n w i F ( θ i ) (left Bregman centroid). That is, we have
θ GB = lim t θ ¯ t = lim t θ ̲ t .
Let θ GB = GB F ( θ ¯ , θ ̲ ) . Then, we have η GB = GB F * ( η ¯ , θ ¯ ) = F ( θ GB ) . The Gauss–Bregman center c GB has θ -coordinates θ G B and η -coordinates η GB .
Algorithm 1 describes the approximation of the Gauss–Bregman inductive center by stopping the double sequence when the iterated centers are close enough to each other. We shall prove the matching convergence of those θ ¯ t and θ ̲ t sequences under separability conditions in Section 2.4.
Algorithm 1: Gauss–Bregman inductive center.
Entropy 26 01008 i001
For example, the Gauss–Bregman center of two categorical distributions p = ( p 1 , , p d ) and p = ( p 1 , , p d ) on a sample space X of d elements is obtained for the cumulant function F ( θ ) = log ( 1 + i = 1 d 1 e θ i ) with gradient F ( θ ) = η i = e θ i 1 + j = 1 d 1 e θ j i where θ = ( θ 1 = log p 1 p d , , θ d 1 = log p d 1 p d ) is the natural parameter. The reciprocal gradient is ( F ) 1 ( η ) = log η i 1 j = 1 d 1 η j i .
We may also compute the Gauss–Bregman center of two categorical distributions Cat ( p ) and Cat ( p ) using iterations of arithmetic means a t and geometric normalized means g t :
a t + 1 i = A ( a t i , g t i ) : = a t i + g t i 2 , i { 1 , , d } u t + 1 i = a t i g t i , i { 1 , , d } , g t + 1 i = u t + 1 i j = 1 d u t + 1 j , i { 1 , , d } ,
where the u t s are unnormalized geometric means and the g t represents normalized geometric means. We initialize the sequence with a 0 = p and g 0 = p , and the Gauss–Bregman center is obtained in the limit m GB Cat ( p , p ) = lim t a t = lim t g t . See Algorithm 2.
The Jeffreys centroid of a set of centered multivariate normal distributions is the Gauss–Bregman center obtained for the generator F ( θ ) = 1 2 log det ( θ ) , the cumulant function of the exponential family of centered normal distributions.
Algorithm 2: Gauss–Bregman inductive center for sets of categorical distributions.
Entropy 26 01008 i002
Figure 2 displays the arithmetic and normalized geometric and numerical Jeffreys, Jeffreys–Fisher–Rao, and Gauss–Bregman centroids/centers for a set of 32 trinomial distributions. We may consider normalized intensity histograms of images (modeled as multinomials with one trial) quantized with d = 256 bins; that is, a normalized histogram with d bins is interpreted as a point in Δ d and visualized as a polyline with d 1 line segments. Figure 3 (left) displays the various centroids and centers obtained for an input set consisting of two histograms (the commonly used Barbara and Lena images, which have been used in [31]). Notice that the JFR center (purple) and GB center (yellow) are close to the numerical Jeffreys centroid (green). We also provide a close-up window in Figure 3 (right).
Notice that we can experimentally check the quality of the approximation of the Gauss–Bregman center to the Jeffreys centroid by defining the symmetrized Bregman centroid energy:
E F ( θ ) : = θ θ ¯ , F ( θ ) θ , F ( θ ¯ ) ,
and checking that E F ( θ ) :
i , i i = 1 d ( θ i θ ¯ i ) i F ( θ ) θ i i F ( θ ¯ ) = 0 ,
i F ( θ ) + ( θ i θ ¯ i ) i 2 F ( θ ) i F ( θ ¯ ) + j i ( θ j θ ¯ j ) i j F ( θ ) i θ j j F ( θ ¯ ) = 0
is close to zero, where l : = θ l .
Next, we study these two new types of centers and how well they approximate the Jeffreys centroid.

3. Gauss–Bregman Inductive Centers: Convergence Analysis and Properties

Let F ( θ ) be a strictly convex and differentiable real-valued function of Legendre type [45] defined on an open parameter space Θ . Then, the gradient map θ η ( θ ) = F ( θ ) is a bijection with the reciprocal inverse function η θ ( η ) = F * ( η ) = ( F ) 1 ( η ) where F * ( η ) = η , F 1 ( η ) F ( F 1 ( η ) ) is the Legendre–Fenchel convex conjugate. For example, we may consider the cumulant functions of regular exponential families.
We define the Gauss–Bregman center θ GB of a set { θ 1 , , θ n } weighted by w Δ n as the limits of the sequences θ ¯ 1 , and θ ̲ 1 , defined by
θ ¯ t + 1 = A ( θ ¯ t , θ ̲ t ) : = θ ¯ t + θ ̲ t 2 ,
θ ̲ t + 1 = m F ( θ ¯ t , θ ̲ t ) : = ( F ) 1 F ( θ ¯ t ) + F ( θ ̲ t ) 2 ,
initialized with θ ¯ 0 = θ ¯ = i = 1 n w i θ i and θ ̲ 0 = θ ̲ = F 1 i = 1 n w i F ( θ i ) . That is, we have
θ GB = lim n θ ¯ t = lim n θ ̲ t .
Such a center has been called an inductive mean by Sturm [30]. See [42] for an overview of inductive means. Figure 4 geometrically illustrates the double sequence iterations converging to the Gauss–Bregman mean.
Theorem 3.
The Gauss–Bregman ( A , F ) center with respect to a Legendre type function F ( θ ) is well defined (i.e., the double sequence converges) for separable Bregman generators.
Proof. 
We need to prove the convergence of { θ ¯ t } and { θ ̲ t } to the same finite limit. When F ( θ ) is univariate, the convergence of the inductive centers was reported in [43]. We need to prove that the double iterations of Equations (12) and (13) converge.
Let us consider the following cases:
  • When the dimension is one, the quasi-arithmetic mean m f for f, a strictly convex and differentiable function, lies between the minimal and maximal argument (i.e., this is the definition of a strict mean):
    min { θ 1 , θ 2 } m f ( θ 1 , θ 2 ) max { θ 1 , θ 2 } .
    Thus, we have
    | θ ¯ t + 1 θ ̲ t + 1 | 1 2 | θ ¯ t θ ̲ t | ,
    and it follows that | θ ¯ t + 1 θ ̲ t + 1 | 1 2 t | θ ¯ 0 θ ̲ 0 | . Thus, we have quadratic convergence of scalar ( A , f ) means. See Figure 5.
  • When F ( θ ) is multivariate and separable, i.e.,  F ( θ ) = i = 1 d f i ( θ i ) where θ = ( θ 1 , , θ d ) are the components of θ R d and the f i s are scalar strictly convex and differentiable functions, we can apply case 1 dimension-wise to obtain the quadratic convergence.
  • Otherwise, we consider the multivariate quasi-arithmetic center m F ( θ , θ ) with the uniform weight vector w = ( 1 2 , 1 2 ) . One problem we face is that the quasi-arithmetic center m F ( θ , θ ) for θ θ may lie outside the open bounding box of R d with diagonal corners θ and θ :
    θ m = ( min { θ 1 , θ 1 } , , min { θ d , θ d } ) , θ M = ( max { θ 1 , θ 1 } , , max { θ d , θ d } ) .
    Indeed, in the 2D case, we may consider θ = ( x , y ) and θ = ( x , y ) . Clearly, the open bounding box is empty, and the midpoint m F ( θ , θ ) lies outside this box. Yet, we are interested in the convergence rate when θ θ .
    In general, we shall measure the difference between two iterations by the squared norm distance induced by the inner product:
    A ( θ , θ ) m F ( θ , θ ) 2 = A ( θ , θ ) m F ( θ , θ ) , A ( θ , θ ) m F ( θ , θ ) .
 □
Let m F GB ( θ 1 , θ 2 ) denote the Gauss–Bregman center of θ 1 and θ 2 , A ( θ 1 , θ 2 ) = θ 1 + θ 2 2 the arithmetic mean, and m F ( θ 1 , θ 2 ) = ( F ) 1 F ( θ 1 ) + F ( θ 2 ) 2 the quasi-arithmetic center.
By construction, the Gauss–Bregman center enjoys the following invariance property generalizing Lemma 17.4.4 of [29] in the case of the log det generator:
Property 1.
We have m F GB ( θ 1 , θ 2 ) = m F GB A ( θ 1 , θ 2 ) , m F ( θ 1 , θ 2 ) .
Proof. 
Similar to the cascaded inequalities of Equation (8), we have
m F GB ( θ 1 , θ 2 ) = m GB F ( A ( θ 1 , θ 2 ) = : θ 1 ( 1 ) , m F ( θ 1 , θ 2 ) = θ 2 ( 1 ) ) =
In the limit t , we have m F GB ( θ 1 , θ 2 ) = m F GB ( θ 1 ( ) , θ 2 ( ) ) = m F GB ( θ 1 ( 1 ) , θ 2 ( 1 ) ) = Since 1 = , we obtain the desired invariance property:
m F GB ( θ 1 , θ 2 ) = m F GB A ( θ 1 , θ 2 ) , m F ( θ 1 , θ 2 ) .
 □
Note that when F ( θ ) is univariate, the Gauss–Bregman mean m F GB ( θ 1 , θ 2 ) converges at a quadratic rate [43]. In particular, when F ( θ ) = log θ (Burg negentropy), we have F ( θ ) = 1 θ ( m F is the harmonic mean) and the Gauss–Bregman mean is the arithmetic–harmonic mean (AHM) which converges to the geometric mean, a simple closed-form formula. Notice that the geometric mean g = x y of two scalars x > 0 and y > 0 can be expressed using the arithmetic mean a = x + y 2 and the harmonic mean h = 2 x y x + y : g = a h . But when F ( θ ) = θ log θ θ (Shannon negentropy), the Gauss–Bregman mean m F GB ( θ 1 , θ 2 ) coincides with the Gauss arithmetic–geometric mean [44] (AGM) since F ( θ ) = log θ and m F ( θ 1 , θ 2 ) = θ 1 θ 2 , the geometric mean. Thus, m F GB ( θ 1 , θ 2 ) is related to the elliptic integral K of the first type [44]: there is no closed-form formula for the AGM in terms of elementary functions as this induced mean is related to the complete elliptic integral of the first kind  K ( · ) :
AGM ( x , y ) = π 4 x + y K x y x + y ,
where K ( u ) = 0 π 2 d θ 1 u 2 sin 2 ( θ ) is the elliptic integral. Thus, it is difficult, in general, to report a closed-form formula for the inductive Gauss–Bregman means even for univariate generators F ( θ ) .
The Jeffreys centroid of x > 0 and y > 0 with respect to the scalar Jeffreys divergence D J ( p , q ) = ( p q ) log p q admits a closed-form solution [31]:
c = a W 0 a g e
where a = x + y 2 and g = x y and  W 0 is the principal branch of the Lambert W function [33]. This example shows that the Gauss–Bregman center does not coincide with the Jeffreys centroid in general (e.g., compare Equation (15) with Equation (16)).

4. Jeffreys–Fisher–Rao Centers: Generic Structural Formula and Some Closed-Form Formula

4.1. Jeffreys–Fisher–Rao Center for Uni-Parametric Statistical Models

Consider a set P = { p θ 1 , , p θ n } of n parametric distributions where θ Θ R is a scalar parameter. Let w = ( w 1 , , w n ) Δ n be a weight vector on P such that the weight of p θ i is w i . The distributions p θ s may not necessarily belong to an exponential family (e.g., the Cauchy scale family). The Fisher–Rao geometry [46,47] of the parametric family of distributions F = { p θ : θ Θ } (the statistical model) can be modeled as a Riemannian manifold with the Fisher metric g ( θ ) = I ( θ ) defined by the Fisher information I ( θ ) = E θ [ ( log p θ ( x ) ) 2 ] = E θ [ 2 log p θ ( x ) ] . When F is an exponential family with the cumulant function F ( θ ) , we have I ( θ ) = F ( θ ) .
The underlying geometry of ( F , g ( θ ) = I ( θ ) ) is Euclidean after a change in variable η ( θ ) = I ( θ ) since we can write the metric tensor as follows:
g ( θ ) = I ( θ ) × 1 = g Euclidean , × I ( θ ) .
Thus, the Riemannian Fisher–Rao distance is the Euclidean distance expressed in the h ( θ ) -coordinate system with h ( θ ) = θ 0 θ I ( u ) d u , and we have the Fisher–Rao distance given by
ρ ( p θ 1 , p θ 2 ) = | h ( θ 1 ) h ( θ 2 ) | .
When F is an exponential family with the cumulant function f ( θ ) , we have I ( u ) = f ( u ) .
We summarize the result on the JFR center in the following theorem:
Theorem 4
(Jeffreys–Fisher–Rao centroid in uni-order exponential families). The Jeffreys–Fisher–Rao centroid θ S of n densities p θ 1 , , p θ n of an exponential family of order one with the log-normalizer f ( θ ) for θ Θ , the natural parameter space, and weight vector w Δ n is
θ S = m h ( θ ¯ , θ ̲ ) ,
where m h ( θ ¯ , θ ̲ ) = h 1 h ( θ ¯ ) + h ( θ ̲ ) 2 is the quasi-arithmetic mean [35] of the dual left and right KL centroids θ ¯ = i = 1 n w i θ i = θ R and θ ̲ = ( f ) 1 ( i = 1 n w i f ( θ i ) ) with respect to the scalar monotone function h = θ 0 θ f ( u ) d u for any θ Θ .
Proof. 
Since the Fisher information is I ( θ ) = f ( θ ) , we have h ( θ ) = θ 0 θ f ( u ) d u . The Riemannian center of mass [48] minimizes
θ S = arg min θ i = 1 n w i ρ 2 ( θ i , θ ) .
But in the h-parameterization, the Riemannian centroid, amounts to a Euclidean center of mass/centroid in the h-Cartesian coordinate system:
h ( θ S ) = i = 1 n w i h ( θ i ) .
Therefore, we have θ S = h 1 ( i w i h ( θ i ) ) = : m h ( θ 1 , , θ n ; w 1 , , w n ) , a weighted quasi-arithmetic mean. Since the Jeffreys centroid amounts to a symmetrized Bregman centroid of the left and right Bregman centroids [28], θ ̲ = m f ( θ 1 , , θ n ; w 1 , , w n ) and θ ¯ = i w i θ i . It follows that the Jeffreys–Fisher–Rao center is θ JFR = m h ( θ ¯ , θ ̲ ) after using Property 3. □

4.2. Jeffreys–Fisher–Rao Center for Categorical Distributions

Recall from Theorem 1 that the Jeffreys centroid c = ( c 1 , , c j , , c d ) of a set of n categorical distributions with parameters arranged in the matrix [ p i , j ] is given by
c j ( λ ) = a j W 0 a j g j e 1 + λ , j { 1 , , d } ,
where a j = i = 1 n w i p i , j and g j = i = 1 n p i , j w i j = 1 d i = 1 n p i , j w i are the components of the weighted arithmetic and normalized geometric means, respectively, and W 0 is the principal branch of the Lambert W function [33]. The optimal λ 0 is unique and satisfies λ = D KL ( c j ( λ ) : g ) .
Let c ( λ ) = ( c 1 ( λ ) , , c d ( λ ) ) . Let L J ( p ) denote the Jeffreys loss function to minimize to find the optimal Jeffreys centroid:
L J ( p ) = i = 1 n w i D J ( p i , p )
We say that p is a ( 1 + ϵ ) approximation of the exact Jeffreys centroid c when we have
L J ( c ) L J ( p ) ( 1 + ϵ ) L J ( c ) .
It was shown in [31] that c ˜ = c ( 0 ) , called the unnormalized Jeffreys center, yields a s ( λ ) 1 approximation on c where s ( λ ) = j c j ( λ ) 1 .
Since the Fisher–Rao geodesic midpoints on the categorical Fisher–Rao manifold are known in closed form [49], we give the mathematical expression of the JFR center as follows:
Theorem 5
(JFR centroid of categorical distributions). Let P w = { p 1 , , p n } be a set of n probability mass functions weighted by w Δ n with p i = ( p i , 1 , , p i , d ) Δ d for i { 1 , , n } and w Δ n . Then, the JFR barycenter c minimizing is unique and given by the following formula:
c j = ( a j + g j ) 2 2 ( 1 + l = 1 d a j g j ) , j { 1 , , d } ,
where a = ( a 1 , , a d ) = i = 1 n w i p i is the weighted arithmetic mean and g = ( g 1 , , g d ) is the normalized weighted geometric mean with components g j = i = 1 n p i , j w i j = 1 d i = 1 n p i , j w i for i { 1 , , d } .
Notice that the JFR center differs from the Jeffreys centroid, which requires the use of the Lambert W function [33]. However, we noticed that for practical applications, the JFR centroid approximates the Jeffreys centroid well and is much faster to compute (see the experiments in Section 5).

4.3. Jeffreys–Fisher–Rao Center for Multivariate Normal Distributions

Let P = { p μ 1 , 1 , , p μ n , n } be a set of n probability density functions (PDFs) of d-variate normal distributions weighted by w Δ n , where the PDF of a multivariate normal distribution of mean μ and the covariance matrix is given by
p μ , = 1 ( 2 π ) d 2 det ( ) exp 1 2 ( x μ ) 1 ( x μ ) .
Let λ i = ( μ i , i ) be the ordinary parameterization of normal distributions p μ i , i . The family F = { p μ , ( x ) : μ R d , Sym + + ( R , d ) } of multivariate normal distributions forms an exponential family with the dual natural θ - and moment η -parameterizations [7] given by
θ ( λ ) = ( θ v , θ M ) = 1 μ , 1 2 1 , η ( λ ) = μ , μ μ + ,
when choosing the sufficient statistic t ( x ) = ( x , x x ) . The Jeffreys divergence between two d-variate normal distributions N ( μ 1 , 1 ) and N ( μ 2 , 2 ) is given by the formula
D J ( p μ 1 , 1 , p μ 2 , 2 ) = ( μ 2 μ 1 ) ( 1 1 + 2 1 ) ( μ 2 μ 1 ) + tr 1 1 2 + 2 1 1 2 d .
The left and right Kullback–Leibler barycenters amount to the corresponding right and left Bregman barycenters [28] induced by the cumulant function
F ( θ ) = F ( θ v , θ M ) = 1 2 d log π log det ( θ M ) + 1 2 θ v θ M 1 θ v ,
and the gradient of F ( θ ) defines the dual moment parameter with
η ( θ ) = F ( θ ) = 1 2 θ M 1 θ v , 1 2 θ M 1 1 4 ( θ M 1 θ v ) ( θ M 1 θ v ) .
The reciprocal gradient is given by
θ ( η ) = θ ( η v , η M ) = ( F ) 1 ( η ) = θ v = ( η M + η v η v ) 1 η v , θ M = 1 2 ( η M + η v η v ) 1 .
The Gauss–Bregman center is a ( A , m F ) -inductive center, which can be approximated by carrying a prescribed number T of iterations of the Gauss–Bregman double sequence.
Although the Rao distance between two d-variate normal distributions is not available in closed form when d > 1 [50,51], the Jeffreys–Fisher–Rao center can be computed in closed form. Indeed, the sided Kullback–Leibler centroids of multivariate normal distributions amount to reverse-sided Bregman centroids [28], and the Fisher–Rao geodesic midpoint between two multivariate normal distributions was recently reported in [32]. Appendix C concisely describes the method of Kobayashi [32], which allows one to obtain the Fisher–Rao midpoints of multivariate normal distributions. An implementation of that algorithm is available in the Python software library pyBregMan [52].
Thus, the Jeffreys–Fisher–Rao center is available in closed form:
Theorem 6
(JFR center of MVNs). The Jeffreys–Fisher–Rao center of a finite set of weighted multivariate normal distributions is available in closed form.
Note that the Fisher–Rao distance between normal distributions is invariant under the action of the positive affine group [50], as are the Jeffreys centroid, the JFR center, and the GB center. Figure 6 shows several examples of the JFR and GB centers of two univariate normal distributions. We can observe that those centers are close to each other although they are distinct when the normal distributions do not share the same means and covariance matrices.
Figure 7 shows the various centroids/centers between two bivariate normal distributions displayed as ellipsoids centered as their means. Observe that the inductive Gauss–Bregman center is visually closer than the Jeffreys–Fisher–Rao center to the Jeffreys centroid.
Figure 8 displays the various centroids and centers for pairs of bivariate normal distributions centered at the same mean. Figure 9 shows the centroids and centers for pairs of bivariate normal distributions with the same covariance matrix.
Remark 1.
In general, an exponential family may be characterized equivalently by two convex functions: (1) its log-normalizer F ( θ ) or (2) its partition function Z ( θ ) = exp ( F ( θ ) ) , which is log-convex and hence also convex [53]. It has been shown that the Bregman divergence B Z for Z = det ( θ ) (convex) corresponds to the reverse extended Kullback–Leibler divergence between unnormalized PDFs of normal distributions:
B Z ( θ 1 : θ 2 ) = D KL + ( p ˜ λ ( θ 2 ) : p ˜ λ ( θ 1 ) ) ,
where p ˜ μ , = exp 1 2 ( x μ ) 1 ( x μ ) and the extended KLD between two positive measures is given by
D KL + ( m 1 : m 2 ) = m 1 ( x ) log m 1 ( x ) m 2 ( x ) + m 2 ( x ) m 1 ( x ) d μ ( x ) .
Remark 2.
We may further define yet another center for multivariate normal distributions by considering the Fisher–Rao isometric embedding of the Fisher–Rao d-variate normal manifold M = { p μ , } into the Fisher–Rao ( d + 1 ) -variate centered manifold N 0 + = { q P ( y ) = p 0 , P ( y ) : P Sym + + ( R , d + 1 ) } using Calvo and Oller mapping [50]:
f ( μ , ) : = + μ μ μ μ 1 .
Let M ¯ = { f ( p ) : p M } denote the embedded submanifold of codimension one in N 0 + . The Calvo–Oller center is then defined by taking the Fisher–Rao midpoints q CO of q P 1 and q P 2 , projecting q CO onto M ¯ as q CO and converting q CO into p CO M using the inverse mapping f 1 [51].
The Fisher orthogonal projection of a ( d + 1 ) × ( d + 1 ) matrix P N 0 + onto the submanifold M ¯ is performed as follows: Let β = P d + 1 , d + 1 and write P = + β μ μ β μ β μ β . Then, the orthogonal projection at P P onto M ¯ is + μ μ μ μ 1 . See [51] for details of the Calvo and Oller embedding/projection method.

5. Experiments

We run all experiments on a Dell Inspiron 5502 Core i7-116567@2.8 GHz using compiled Java programs. For each experiment, we consider a set of n = 2 uniformly randomized histograms with d bins (i.e., points in Δ d ) and calculate the numerical Jeffreys centroid, which requires the time-consuming Lambert W function, the GB center, and the JFR center. For each prescribed value of d, we run 10,000 experiments to collect various statistics like the average and maximum approximations and running times. The approximations of the JFR and GB methods are calculated either as the approximation of the Jeffreys information (Equation (18)) or as the approximation of the centers with respect to the numerical Jeffreys centroids measured using the total variation distance. Table 1 is a verbatim export of our experimental results when we range the dimension of histograms for d = 2 to d = 256 by doubling the dimension at each round. The inductive GB center is stopped when the total variation 1 2 a t g t 1 10 8 .
We observe that the JFR center is faster to compute than the GB center but the GB center is of higher quality (i.e., a better approximation with a lower ϵ ) than the JFR center to approximate the numerical Jeffreys centroid.
Another test consists of choosing d = 3 and the following two 3D normalized histograms: ( 1 3 , 1 3 , 1 3 ) and ( 1 α , α / 2 , α / 2 ) for α { 10 1 , 10 2 , , 10 7 , 10 8 } . Table 2 reports the experiments. The objective is to find a setting where both the JFR and GB centers are distinguished from the Jeffreys centroid. We see that as we decrease α , the approximation factor ϵ gets worse for both the JFR center and the GB center. The JFR center is often faster to compute than the GB inductive center, but the approximation of the GB center is better than the JFR approximation.
Finally, we implemented the Gauss–Bregman and Jeffreys–Fisher–Rao centers and Jeffreys centroid using multi-precision arithmetic. We report the following experiments using 200-digit precision arithmetic for the following input of two normalized histograms: p = ( 0.1 , 0.9 ) and q = ( 0.8 , 0.2 ) . We report the various first 17-digit mantissas obtained with the corresponding Jeffreys information:
  • Jeffreys center: ( 0.42490383904214813 , 0.575096160957851866 )
    Jeffreys information: 1.2490723231955352 .
  • Gauss–Bregman center: (0.42490383904276856, 0.575096160957231439)
    Jeffreys information: 1.2490723231955353 .
  • Jeffreys–Fisher–Rao center: (0.42490390202906282, 0.575096097970937175)
    Jeffreys information: 1.2490723232068266 .
The total variation distance between the Jeffreys centroid and the Gauss–Bregman center is 6.204271148284087422350000686372 10 13 .
The total variation distance between the Jeffreys centroid and the Jeffreys–Fisher–Rao center is 6.298691469047911984039363762611 10 8 .
The total variation distance between the Gauss–Bregman center and the Jeffreys–Fisher–Rao center is 6.298629426336429143165140262604 10 8 .
Although all those points are close to each other, they are all distinct points (note that using the limited precision of the IEEE 754 floating point standard may yield a misleading interpretation of experiments).

6. Conclusions and Discussion

In this work, we considered the Jeffreys centroid of a finite weighted set of densities of a given exponential family E = { p θ ( x ) : θ Θ } . This Jeffreys centroid amounts to a symmetrized Bregman centroid on the corresponding weighted set of natural parameters of the densities [28]. In general, the Jeffreys centroids do not admit closed-form formulas [28,31] except for sets of same-mean normal distributions [29] (see Appendix B).
In this paper, we interpreted the closed-form formula for same-mean multivariate normal distributions in two different ways:
  • First, as the Fisher–Rao geodesic midpoint of the sided Kullback–Leibler centroids. This interpretation lets us relax the midpoint definition to arbitrary exponential families to define the Jeffreys–Fisher–Rao center (the JFR center of Definition 2);
  • Second, as an inductive ( A , m F ) center using a multivariate Gauss-type double sequence, which converges to the Gauss–Bregman center (the GB center of Definition 3). The latter definition yields an extension of Nakamura’s arithmetic–harmonic ( A , H ) mean [41] to an arbitrary ( A , m F ) mean for which we proved convergence under a separability condition in Theorem 3. Convergence proof remains to be performed in the general case, although we noticed in practice convergence when F ( θ ) is the moment parameter of categorical or multivariate normal distributions.
In general, the Jeffreys, JFR, and GB centers differ from each other (e.g., the case of categorical distributions). But for sets of same-mean normal distributions, they remarkably coincide altogether: namely, this was the point of departure of this research. We reported generic or closed-form formulas for the JFR centers of (a) uni-order parametric families in Section 4.1 (Theorem 4), (b) categorical families in Section 4.2 (Theorem 5), and (c) multivariate normal families in Section 4.3 (Theorem 6). Table 3 summarizes the new results obtained in this paper and states references of prior work. Notice that in practice, we approximate the Gauss–Bregman center by prescribing a number of iterations T N for the Gauss–Bregman double sequence to obtain m GB ( T ) . Prescribing the number of GB center iterations T allows us to tune the time complexity of computing m GB ( T ) while adjusting the quality of the approximation of the Jeffreys centroid.
In applications requiring the Jeffreys centroid, we thus propose to either use the fast Jeffreys–Fisher–Rao center when a closed-form formula is available for the family of distributions at hand or use the Gauss–Bregman center approximation with a prescribed number of iterations as a drop-in replacement of the numerical Jeffreys centroids while keeping the Jeffreys divergence (the centers we defined are not centroids as we do not exhibit distances from which they are population minimizers).
More generally, let us rephrase the results in a purely geometric setting using the framework of information geometry [14]: let P 1 , , P n be a set of n points weighted by a vector w Δ n on an m-dimensional dually flat space ( M , g , , * ) with the ∇-affine coordinate system θ ( · ) and dual * -affine coordinate system η ( · ) , where ∇ and * are two torsion-free dual affine connections. The Riemannian metric g is a Hessian metric [54], which may be expressed in the θ -coordinate system as g ( θ ) = 2 F ( θ ) or in the dual coordinate system as g ( η ) = 2 F * ( η ) , where F ( θ ) and F * ( η ) are dual convex potential functions related by the Legendre–Fenchel transform [14,54]. Let η i = F ( θ i ) and θ i = F * ( η i ) be the coordinates of point P i in the η - and θ -coordinate systems, respectively. An arbitrary point P can be either referenced in the θ -coordinate system ( P = P θ ) or in the η -coordinate system ( P = P η ). Then, the Jeffreys–Fisher–Rao center is defined as the midpoint with respect to the Levi-Civita connection ¯ = + * 2 = g of g:
C JFR : = γ ¯ ( C θ ¯ , C θ ̲ , 1 2 ) = : C θ ¯ # C θ ̲ .
The point C θ ¯ is the centroid with respect to the canonical flat divergence D ( P : Q ) = F ( θ ( P ) ) + F * ( η ( Q ) ) i = 1 m θ i ( P ) η i ( Q ) , and the point C θ ̲ is the centroid with respect to the dual canonical flat divergence D * ( P : Q ) : = D ( Q : P ) . The canonical divergence is expressed using the mixed coordinates θ / η but can also be expressed using the θ -coordinates as an equivalent Bregman divergence D ( P : Q ) = B F ( θ ( P ) : θ ( Q ) ) or as a reverse dual Bregman divergence D ( P : Q ) = B F * ( η ( Q ) : η ( P ) ) . This JFR center C JFR approximates the symmetrized centroid with respect to the canonical symmetrized divergence S ( P , Q ) = D ( P : Q ) + D ( Q : P ) (i.e., Jeffreys divergence when written using the θ -coordinate system). This symmetrized divergence S ( P , Q ) can be interpreted as the energy of the Riemannian length element d s along the primal geodesic γ ( t ) and dual geodesic γ * ( t ) (with γ ( 0 ) = γ * ( 0 ) = P and γ ( 1 ) = γ * ( 1 ) = Q ), see [14]: S ( P , Q ) = 0 1 d s 2 ( γ ( t ) ) d t = 0 1 d s 2 ( γ * ( t ) ) d t . The Riemannian distance ρ ( P , Q ) corresponds to the Riemannian length element along the Riemannian geodesic γ ¯ ( t ) induced by the Levi-Civita connection ¯ = + * 2 : ρ ( P , Q ) = 0 1 d s ( γ ¯ ( t ) ) d t .
The inductive Gauss–Bregman center C GB is obtained as a limit sequence of taking iteratively the ∇ midpoints and * midpoints with respect to the ∇ and * connections. Those midpoints correspond to the right and left centroids C t + 1 and C t + 1 * with respect to D ( · : · ) :
C t + 1 = γ C t , C t * , 1 2 , C t + 1 * = γ * C t , C t * , 1 2 ,
initialized with θ ( C 0 ) = i = 1 n w i θ ( P i ) and η ( C 0 * ) = i = 1 n w i η ( P i ) . We have C 0 = arg min C M i w i D ( P i : C ) and C 0 * = arg min C M i w i D ( P i : C ) . Figure 10 geometrically illustrates the double sequence of iteratively taking dual geodesic midpoints to converge toward the Gauss–Bregman center C GB . Thus, the GB double sequence can be interpreted as a geometric optimization technique. Figure 11 illustrates the JFR and GB centers on a dually flat space. Notice that C JFR has coordinates JFR F ( P θ ; w ) in the θ -chart and coordinates JFR F * ( P η ; w ) in the η -chart. Similarly, C GB has coordinates GB F ( θ ¯ , θ ̲ ) in the θ -chart and coordinates GB F * ( η ¯ , η ̲ ) in the η -chart.
As a final remark, let us emphasize that choosing a proper mean or center depends on the application at hand [55,56]. For example, in Bayesian hypothesis testing, the Chernoff mean [57] is used to upper bound Bayes’ error and has been widely used in information fusion [18] for its empirical robustness [58] in practice. Jeffreys centroid has been successfully used in information retrieval tasks [6].

Funding

This research received no external funding.

Informed Consent Statement

Not applicable.

Data Availability Statement

No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest

The author Frank Nielsen is employed by the company Sony Computer Science Laboratories Inc. The  author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare no conflicts of interest.

Appendix A. Numerical Jeffreys Centroids for Categorical Distributions

Algorithm A1 implements the method described in [31] for numerically finely approximating the Jeffreys centroid of a weighted set of categorical distributions.
Algorithm A1: Numerical approximation of the SKL/Jeffreys centroid for categorical distributions.
Entropy 26 01008 i003

Appendix B. Closed-Form Formula for the Symmetrized Log Det Centroids

Consider a set P = { P 1 , , P n } of n symmetric positive-definite matrices of the d-dimensional SPD cone Sym + + ( d , R ) weighted by a vector w = ( w 1 , , w n ) Δ n such that P i has weight w i for i { 1 , , n } . The log det divergence [59] is a Bregman divergence induced by the strictly convex and differential generator F ld ( X ) = log det ( X ) on Sym + + ( d , R ) equipped with the inner product X , Y = tr ( X Y ) for X , Y Sym ( d , R ) :
D ld ( X : Y ) = B F ld ( X : Y ) = F ( X ) F ( Y ) X Y , F ld ( Y ) .
Since we have F ld ( X ) = log det ( X ) = det ( X ) det ( X ) = ( X 1 ) (hence F ld ( X ) = X 1 for symmetric matrices), it follows that the log det divergence is
D ld ( X : Y ) = log det ( Y X 1 ) + tr ( ( X Y ) Y 1 ) , = tr ( X Y 1 ) log det ( X Y 1 ) d ,
using the properties that det ( X ) det ( Y ) = det ( X Y ) , det ( X 1 ) = 1 det ( X ) , and tr ( I ) = d where I denotes the d × d identity matrix. When d = 1 , we recover the Itakura–Saito divergence [45] obtained for F IS ( x ) = log x (Burg negative entropy) with F IS ( x ) = 1 x :
D IS ( x : y ) = B F IS ( x : y ) = x y log x y 1 , x , y > 0 .
The log det divergence is known in statistics as Stein’s loss [37,38] and has been used for estimating covariance matrices. The log det divergence S ld satisfies the following invariance properties:
  • Inversion invariance: S ld ( X 1 , Y 1 ) = S ld ( X , Y ) ;
  • Congruence invariance: for any invertible matrix A GL ( d ) , we have S ld ( A X A , A Y 1 A ) = S ld ( X , Y ) .
The Jeffreys’ symmetrized log det divergence (SLD) is thus
S ld ( X , Y ) = D ld ( X : Y ) + D ld ( Y : X ) = tr Y 1 X 1 ) ( X Y ,
= tr X 1 Y + Y 1 X 2 I .
When d = 1 , the SLD corresponds to the COSH distance [60] (COSine Hyperbolic distance, the symmetrized Itakura–Saito divergence) when d = 1 :
D COSH ( x : y ) = y x 1 x = x y + y x 2 .
Consider a family N μ = p μ , 1 , , p μ , n of n multivariate normal distributions centered at the same mean μ R d with covariance matrices 1 , , n . The set of same-mean normal distributions forms an exponential family with the natural parameter θ = 1 (the precision matrix) corresponding to the sufficient statistics t ( x ) = 1 2 x x , and the log-normalizer F ( θ ) = 1 2 log det ( θ ) . Thus, the Kullback–Leibler divergence between p μ , i and p μ , j corresponds to a log det divergence [16]:
D KL [ p μ , i , p μ , j ] = B F ( θ j : θ i ) = D ld ( j 1 : i 1 ) ,
and therefore the Jeffreys divergence D J [ p μ , i , p μ , j ] corresponds to the matrix COSH/symetrized log-det divergence:
D J [ p μ , i , p μ , j ] = S ld ( i 1 , j 1 ) = tr i 1 j 1 ) ( j i .
The left KL centroid corresponds to a right Bregman centroid on the natural parameters (the center of mass of the natural parameters), which corresponds to a weighted matrix harmonic mean on the covariance matrices:
C L KL = C R B F = i = 1 n w i i 1 1 .
The right KL centroid is a left Bregman centroid (i.e., a quasi-arithmetic mean for h ( X ) = X 1 with h 1 ( Y ) = Y 1 ) which corresponds to the inverse of the weighted arithmetic mean on the covariance matrices:
C R KL = C L B F = i = 1 n w i i 1 .
We state the remarkable case of the closed-form formula for the symmetrized Bregman logdet centroid:
Proposition A1
([29]). The symmetrized log det centroid of a set P w = { ( w i , P i ) } of n weighted positive-definite matrices is A # H where A = i w i P i and H = i w i P i 1 1 are the weighted arithmetic and harmonic means and  A # B is the matrix geometric mean.
Since the proof was only briefly sketched in [29], we report a full-length proof for the sake of completeness:
Proof. 
We have
min X i w i S ld ( X , P i ) min X tr X 1 A + H 1 X .
Setting the gradient of the right-hand-side term to zero using matrix calculus [61] yields
X tr X 1 A + H 1 X = tr X ( X 1 A + H 1 X ) = 0 .
Using the matrix calculus property that ( M 1 ) = M 1 ( M ) M 1 for M = X 1 A , we obtain
X 1 A X 1 H 1 = 0 .
That is, we need to solve the following Ricatti equation:
X 1 A X 1 = H 1 .
The well-known Ricatti equation X A 1 X = B solves [40] as X = A # B , and therefore we obtain
X 1 = A 1 # H 1 .
Finally, we use the invariance property of the geometric mean under matrix inversion, A 1 # H 1 = A # H , to obtain the result C S ld = A # H . □
The Riemannian Hessian metric g ( θ ) induced by F ( θ ) = 1 2 log det ( θ ) is
g θ ( S 1 , S 2 ) = tr θ 1 S 1 θ 1 S 2 ,
where S 1 and S 2 are two symmetric matrices of the tangent space T θ at θ . The metric tensor g is commonly called the trace metric or Affine-Invariant Riemannian Metric (AIRM) [62].
It follows that the Riemannian geodesic midpoint is the matrix geometric mean [63] given by
X # Y = X 1 2 ( X 1 2 Y X 1 2 ) 1 2 X 1 2 .
We have ρ ( X , X # Y ) = ρ ( X # Y , Y ) , where ρ ( · , · ) denotes the geodesic length distance on the Riemannian manifold. The geodesic length is given by the following formula [64,65]:
ρ ( P 1 , P 2 ) = log P 1 1 2 P 2 P 1 1 2 F = i = 1 d log 2 λ i P 1 1 2 P 2 P 1 1 2 ,
where the λ i ( X ) s are the generalized eigenvalues of X.
We state the theorem geometrically characterizing the Jeffreys centroid of a weighted set of centered multivariate normal distributions.
Theorem A1
(The Jeffreys centroid of n weighted centered multivariate normal distributions). The Jeffreys centroid C S of a weighted set { p μ , i } of centered normal distributions N ( μ , i ) with a weighted w Δ n corresponds to the midpoint of the Fisher–Rao geodesic linking the left and right SKL centroids:
C S = i = 1 n w i i # i = 1 n w i i 1 1 ,
where X # Y is the geometric matrix mean:
X # Y = X 1 2 ( X 1 2 Y X 1 2 ) 1 2 X 1 2 .
This result first appeared in [29] (Lemma 17.4.3, item 3) and also appeared in an indirect but more general form in [66] (Theorem 5.3). Indeed, in [66], the authors define the regularized symmetric log det divergence as follows:
S ld ϵ ( X , Y ) = tr ( X Y ) ( Y + ϵ I ) 1 ( X + ϵ I ) 1 , ϵ > 0 .
This extended definition of the symmetrized logdet divergence allows one to consider degenerate semi-positive definite matrices.

Appendix C. Fisher–Rao Midpoint for Multivariate Normal Distributions

The expression of the Fisher–Rao geodesics for multivariate normal distributions (MVNs) passing through two given d-variate normal distributions was elucidated in [32]. We give below the method for finding those Fisher–Rao MVN midpoints without the underlying geometric explanation that relies on a Riemannian submersion in dimension 2 d + 1 [32]. The Python software library pyBregMan [52] provides an implementation of those Fisher–Rao MVN midpoints.
Fisher–Rao geodesic midpoint N = N ( μ , ) of N 0 = N ( μ 0 , 0 ) and N 1 = N ( μ 1 , 1 )
  • For i { 0 , 1 } , let G 0 = M 0 D 0 M 0 and G 1 = M 1 D 1 M 1 , where
    D 0 = 0 1 0 0 0 1 0 0 0 0 , M 0 = I d 0 0 μ 0 1 0 0 μ 0 I d , D 1 = 1 1 0 0 0 1 0 0 0 1 , M 1 = I d 0 0 μ 1 1 0 0 μ 1 I d ,
    where I d denotes the identity matrix of shape d × d . That is, matrices G 0 and G 1 Sym + ( 2 d + 1 , R ) can be expressed by block Cholesky factorizations.
  • Consider the Riemannian geodesic midpoint G in Sym + ( 2 d + 1 , R ) with respect to the trace metric:
    G = G 0 1 2 G 0 1 2 G 1 G 0 1 2 1 2 G 0 1 2 .
    In order to compute the matrix power G p for p R , we first calculate the Singular Value Decomposition (SVD) of G: G = O L O (where O is an orthogonal matrix and L = diag ( λ 1 , , λ 2 d + 1 ) a diagonal matrix) and then obtain the matrix power as G p = O L p O with L p = diag ( λ 1 p , , λ 2 d + 1 p ) .
  • Retrieve N = N ( μ , ) from matrix G:
    = [ G ] 1 : d , 1 : d 1 , μ = [ G ] 1 : d , d + 1 ,
    where [ G ] 1 : d , 1 : d denotes the block matrix with rows and columns ranging from one to d extracted from the ( 2 d + 1 ) × ( 2 d + 1 ) matrix G, and  [ G ] 1 : d , d + 1 is similarly the column vector of R d extracted from G.

References

  1. Jeffreys, H. The Theory of Probability; OUP Oxford: Oxford, UK, 1998. [Google Scholar]
  2. Ben-Tal, A.; Charnes, A.; Teboulle, M. Entropic means. J. Math. Anal. Appl. 1989, 139, 537–551. [Google Scholar] [CrossRef]
  3. Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef]
  4. Amari, S.I. Integration of stochastic models by minimizing α-divergence. Neural Comput. 2007, 19, 2780–2796. [Google Scholar] [CrossRef]
  5. Nielsen, F. On a generalization of the Jensen–Shannon divergence and the Jensen–Shannon centroid. Entropy 2020, 22, 221. [Google Scholar] [CrossRef]
  6. Veldhuis, R. The centroid of the symmetrical Kullback-Leibler distance. IEEE Signal Process. Lett. 2002, 9, 96–99. [Google Scholar] [CrossRef]
  7. Nielsen, F. On the Jensen–Shannon symmetrization of distances relying on abstract means. Entropy 2019, 21, 485. [Google Scholar] [CrossRef]
  8. Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
  9. Johnson, D.H.; Sinanovic, S. Symmetrizing the Kullback-Leibler distance. IEEE Trans. Inf. Theory 2001, 1, 1–10. [Google Scholar]
  10. Fuglede, B.; Topsoe, F. Jensen-Shannon divergence and Hilbert space embedding. In Proceedings of the International Symposium on Information Theory (ISIT), Chicago, IL, USA, 27 June–2 July 2004; IEEE: Piscataway, NJ, USA, 2004; p. 31. [Google Scholar]
  11. Sra, S. Metrics induced by Jensen-Shannon and related divergences on positive definite matrices. Linear Algebra Its Appl. 2021, 616, 125–138. [Google Scholar] [CrossRef]
  12. Vajda, I. On metric divergences of probability measures. Kybernetika 2009, 45, 885–900. [Google Scholar]
  13. Barndorff-Nielsen, O. Information and Exponential Families: In Statistical Theory; John Wiley & Sons: New York, NY, USA, 2014. [Google Scholar]
  14. Amari, S.I. Information Geometry and Its Applications; Applied Mathematical Sciences; Springer: Berlin/Heidelberg, Germany, 2016. [Google Scholar]
  15. Lloyd, S. Least squares quantization in PCM. IEEE Trans. Inf. Theory 1982, 28, 129–137. [Google Scholar] [CrossRef]
  16. Davis, J.; Dhillon, I. Differential entropic clustering of multivariate gaussians. Adv. Neural Inf. Process. Syst. 2006, 19, 337–344. [Google Scholar]
  17. Murtagh, F.; Legendre, P. Ward’s hierarchical agglomerative clustering method: Which algorithms implement Ward’s criterion? J. Classif. 2014, 31, 274–295. [Google Scholar] [CrossRef]
  18. Julier, S.; Uhlmann, J.K. General decentralized data fusion with covariance intersection. In Handbook of Multisensor Data Fusion; CRC Press: Boca Raton, FL, USA, 2017; pp. 339–364. [Google Scholar]
  19. Liu, Q.; Ihler, A.T. Distributed estimation, information loss and exponential families. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
  20. Basseville, M. Divergence measures for statistical data processing: An annotated bibliography. Signal Process. 2013, 93, 621–633. [Google Scholar] [CrossRef]
  21. Chandrasekhar, V.; Takacs, G.; Chen, D.M.; Tsai, S.S.; Reznik, Y.; Grzeszczuk, R.; Girod, B. Compressed histogram of gradients: A low-bitrate descriptor. Int. J. Comput. Vis. 2012, 96, 384–399. [Google Scholar] [CrossRef]
  22. Seal, A.; Karlekar, A.; Krejcar, O.; Gonzalo-Martin, C. Fuzzy c-means clustering using Jeffreys-divergence based similarity measure. Appl. Soft Comput. 2020, 88, 106016. [Google Scholar] [CrossRef]
  23. Vasconcelos, N. On the efficient evaluation of probabilistic similarity functions for image retrieval. IEEE Trans. Inf. Theory 2004, 50, 1482–1496. [Google Scholar] [CrossRef]
  24. Ge, P.; Chen, Y.; Wang, G.; Weng, G. An active contour model driven by adaptive local pre-fitting energy function based on Jeffreys divergence for image segmentation. Expert Syst. Appl. 2022, 210, 118493. [Google Scholar] [CrossRef]
  25. Tabibian, S.; Akbari, A.; Nasersharif, B. Speech enhancement using a wavelet thresholding method based on symmetric Kullback–Leibler divergence. Signal Process. 2015, 106, 184–197. [Google Scholar] [CrossRef]
  26. Zhao, Q.; Zhou, G.; Zhang, L.; Cichocki, A. Tensor-variate Gaussian processes regression and its application to video surveillance. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, Italy, 5–9 May 2014; IEEE: Piscataway, NJ, USA, 2014; pp. 1265–1269. [Google Scholar]
  27. Welk, M.; Feddern, C.; Burgeth, B.; Weickert, J. Tensor median filtering and M-smoothing. In Visualization and Processing of Tensor Fields; Springer: Berlin/Heidelberg, Germany, 2006; pp. 345–356. [Google Scholar]
  28. Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef]
  29. Moakher, M.; Batchelor, P.G. Symmetric positive-definite matrices: From geometry to applications and visualization. In Visualization and Processing of Tensor Fields; Springer: Berlin/Heidelberg, Germany, 2006; pp. 285–298. [Google Scholar]
  30. Sturm, K.T. Probability measures on metric spaces of nonpositive. Heat Kernels Anal. Manifolds Graphs Metr. Spaces 2003, 338, 357. [Google Scholar]
  31. Nielsen, F. Jeffreys centroids: A closed-form expression for positive histograms and a guaranteed tight approximation for frequency histograms. IEEE Signal Process. Lett. 2013, 20, 657–660. [Google Scholar] [CrossRef]
  32. Kobayashi, S. Geodesics of multivariate normal distributions and a Toda lattice type Lax pair. Phys. Scr. 2023, 98, 115241. [Google Scholar] [CrossRef]
  33. Corless, R.M.; Gonnet, G.H.; Hare, D.E.; Jeffrey, D.J.; Knuth, D.E. On the Lambert W function. Adv. Comput. Math. 1996, 5, 329–359. [Google Scholar] [CrossRef]
  34. Rockafellar, R.T. Conjugates and Legendre transforms of convex functions. Can. J. Math. 1967, 19, 200–205. [Google Scholar] [CrossRef]
  35. Bullen, P.S.; Bullen, P. Quasi-arithmetic means. In Handbook of Means and Their Inequalities; Springer: Berlin/Heidelberg, Germany, 2003; pp. 266–320. [Google Scholar]
  36. Nock, R.; Luosto, P.; Kivinen, J. Mixed Bregman clustering with approximation guarantees. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Antwerp, Belgium, 15–19 September 2008; Springer: Berlin/Heidelberg, Germany, 2008; pp. 154–169. [Google Scholar]
  37. James, W.; Stein, C. Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Los Angeles, CA, USA, 20 June–30 July 1960; Volume 1, pp. 361–379. [Google Scholar]
  38. Salehian, H.; Cheng, G.; Vemuri, B.C.; Ho, J. Recursive estimation of the Stein center of SPD matrices and its applications. In Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia, 1–8 December 2013; pp. 1793–1800. [Google Scholar]
  39. Skovgaard, L.T. A Riemannian geometry of the multivariate normal model. Scand. J. Stat. 1984, 11, 211–223. [Google Scholar]
  40. Bhatia, R. The Riemannian mean of positive matrices. In Matrix Information Geometry; Springer: Berlin/Heidelberg, Germany, 2012; pp. 35–51. [Google Scholar]
  41. Nakamura, Y. Algorithms associated with arithmetic, geometric and harmonic means and integrable systems. J. Comput. Appl. Math. 2001, 131, 161–174. [Google Scholar] [CrossRef]
  42. Nielsen, F. What is… an Inductive Mean? Not. Am. Math. Soc. 2023, 70, 1851–1855. [Google Scholar] [CrossRef]
  43. Lehmer, D.H. On the compounding of certain means. J. Math. Anal. Appl. 1971, 36, 183–200. [Google Scholar] [CrossRef]
  44. Almkvist, G.; Berndt, B. Gauss, Landen, Ramanujan, the arithmetic-geometric mean, ellipses, π, and the Ladies Diary. Am. Math. Mon. 1988, 95, 585–608. [Google Scholar]
  45. Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J.; Lafferty, J. Clustering with Bregman divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
  46. Miyamoto, H.K.; Meneghetti, F.C.; Pinele, J.; Costa, S.I. On closed-form expressions for the Fisher–Rao distance. Inf. Geom. 2024, 1–44. [Google Scholar] [CrossRef]
  47. Nielsen, F. Approximation and bounding techniques for the Fisher-Rao distances between parametric statistical models; Handbook of Statistics; Elsevier: Amsterdam, The Netherlands, 2024. [Google Scholar] [CrossRef]
  48. Karcher, H. Riemannian center of mass and mollifier smoothing. Commun. Pure Appl. Math. 1977, 30, 509–541. [Google Scholar] [CrossRef]
  49. Čencov, N.N. Algebraic foundation of mathematical statistics. Stat. A J. Theor. Appl. Stat. 1978, 9, 267–276. [Google Scholar]
  50. Calvo, M.; Oller, J.M. A distance between multivariate normal distributions based in an embedding into the Siegel group. J. Multivar. Anal. 1990, 35, 223–242. [Google Scholar] [CrossRef]
  51. Nielsen, F. A simple approximation method for the Fisher–Rao distance between multivariate normal distributions. Entropy 2023, 25, 654. [Google Scholar] [CrossRef]
  52. Nielsen, F.; Soen, A. pyBregMan: A Python library for Bregman Manifolds. arXiv 2024, arXiv:2408.04175. [Google Scholar]
  53. Nielsen, F. Divergences Induced by the Cumulant and Partition Functions of Exponential Families and Their Deformations Induced by Comparative Convexity. Entropy 2024, 26, 193. [Google Scholar] [CrossRef]
  54. Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, 2007. [Google Scholar]
  55. De Carvalho, M. Mean, what do you Mean? Am. Stat. 2016, 70, 270–274. [Google Scholar] [CrossRef]
  56. Bullen, P.S. Handbook of Means and Their Inequalities; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 560. [Google Scholar]
  57. Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. [Google Scholar] [CrossRef]
  58. Nielsen, F. Revisiting Chernoff information with likelihood ratio exponential families. Entropy 2022, 24, 1400. [Google Scholar] [CrossRef] [PubMed]
  59. Kulis, B.; Sustik, M.; Dhillon, I. Learning low-rank kernel matrices. In Proceedings of the 23rd International Conference on Machine Learning, Pittsburgh, PA, USA, 25–29 June 2006; pp. 505–512. [Google Scholar]
  60. Gray, A.; Markel, J. Distance measures for speech processing. IEEE Trans. Acoust. Speech, Signal Process. 1976, 24, 380–391. [Google Scholar] [CrossRef]
  61. Petersen, K.B.; Pedersen, M.S. The matrix cookbook. Tech. Univ. Den. 2008, 7, 510. [Google Scholar]
  62. Thanwerdas, Y.; Pennec, X. O(n)-invariant Riemannian metrics on SPD matrices. Linear Algebra Its Appl. 2023, 661, 163–201. [Google Scholar] [CrossRef]
  63. Bhatia, R.; Holbrook, J. Riemannian geometry and matrix geometric means. Linear Algebra Its Appl. 2006, 413, 594–618. [Google Scholar] [CrossRef]
  64. Siegel, C. Symplectic geometry. Am. J. Math. 1964, 65, 1–86. [Google Scholar] [CrossRef]
  65. James, A.T. The variance information manifold and the functions on it. In Multivariate Analysis–III; Elsevier: Amsterdam, The Netherlands, 1973; pp. 157–169. [Google Scholar]
  66. Kim, S.; Lawson, J.; Lim, Y. The matrix geometric mean of parameterized, weighted arithmetic and harmonic means. Linear Algebra Its Appl. 2011, 435, 2114–2131. [Google Scholar] [CrossRef]
Figure 1. Application of centroids and centers in signal processing. (Left): information fusion and mixture model simplification, a 2D Gaussian mixture model (GMM) is simplified to a single bivariate normal distribution. (Right): distributed estimation, a data set is split among p processes P i s, which first estimate the statistical model parameters θ ^ i s. Then, the processus models are aggregated to yield a single consolidated model θ ^ .
Figure 1. Application of centroids and centers in signal processing. (Left): information fusion and mixture model simplification, a 2D Gaussian mixture model (GMM) is simplified to a single bivariate normal distribution. (Right): distributed estimation, a data set is split among p processes P i s, which first estimate the statistical model parameters θ ^ i s. Then, the processus models are aggregated to yield a single consolidated model θ ^ .
Entropy 26 01008 g001
Figure 2. Visualizing the arithmetic and normalized geometric and numerical Jeffreys, Jeffreys–Fisher–Rao, and Gauss–Bregman centroids/centers in red, blue, green, purple, and yellow, respectively. (Left): input set consists of n = 2 trinomial distributions (black) with parameters chosen randomly. (Right): input set consists of n = 32 trinomial distributions (black) with parameters ( 1 2 , 1 2 ) and ( 0.99 , 0.005 , 0.005 ) . The numerical Jeffreys centroid (green) is time consuming to calculate using the Lambert W function. However, the Jeffreys centroid can be well approximated by either the Jeffreys–Fisher–Rao center (purple) or the inductive Gauss–Bregman center (yellow). Point centers are visualized with different radii in order to distinguish them easily.
Figure 2. Visualizing the arithmetic and normalized geometric and numerical Jeffreys, Jeffreys–Fisher–Rao, and Gauss–Bregman centroids/centers in red, blue, green, purple, and yellow, respectively. (Left): input set consists of n = 2 trinomial distributions (black) with parameters chosen randomly. (Right): input set consists of n = 32 trinomial distributions (black) with parameters ( 1 2 , 1 2 ) and ( 0.99 , 0.005 , 0.005 ) . The numerical Jeffreys centroid (green) is time consuming to calculate using the Lambert W function. However, the Jeffreys centroid can be well approximated by either the Jeffreys–Fisher–Rao center (purple) or the inductive Gauss–Bregman center (yellow). Point centers are visualized with different radii in order to distinguish them easily.
Entropy 26 01008 g002
Figure 3. (Left): Displaying the arithmetic and normalized geometric and numerical Jeffreys, Jeffreys–Fisher–Rao, and Gauss–Bregman centroids/centers in red, blue, green, purple, and yellow, respectively. Input sets are two normalized histograms with d = 256 bins plotted as polylines with 255 line segments (black). Observe that the Jeffreys–Fisher–Rao center (purple) and Gauss–Bregman center (yellow) approximates the Jeffreys centroid (green) well, which is more computationally expensive to calculate. (Right): close-up window on the first left bins of normalized histograms.
Figure 3. (Left): Displaying the arithmetic and normalized geometric and numerical Jeffreys, Jeffreys–Fisher–Rao, and Gauss–Bregman centroids/centers in red, blue, green, purple, and yellow, respectively. Input sets are two normalized histograms with d = 256 bins plotted as polylines with 255 line segments (black). Observe that the Jeffreys–Fisher–Rao center (purple) and Gauss–Bregman center (yellow) approximates the Jeffreys centroid (green) well, which is more computationally expensive to calculate. (Right): close-up window on the first left bins of normalized histograms.
Entropy 26 01008 g003
Figure 4. Geometric illustration of the double sequence inducing a Gauss–Bregman center in the limit.
Figure 4. Geometric illustration of the double sequence inducing a Gauss–Bregman center in the limit.
Entropy 26 01008 g004
Figure 5. Illustration of the double sequence convergence for scalar Gauss–Bregman ( A , m F ) mean.
Figure 5. Illustration of the double sequence convergence for scalar Gauss–Bregman ( A , m F ) mean.
Entropy 26 01008 g005
Figure 6. Visualization of the Jeffreys–Fisher–Rao center and Gauss–Bregman center of two univariate normal distributions (black circle). The exponential geodesic and mixture geodesics are shown in red and blue, respectively, with their corresponding midpoints. The Jeffreys–Fisher–Rao is the Fisher–Rao midpoint (green) lying on the Fisher–Rao geodesics (purple). The inductive Gauss–Bregman center is displayed in yellow with double size in order to ease its comparison with the Jeffreys–Fisher–Rao center.
Figure 6. Visualization of the Jeffreys–Fisher–Rao center and Gauss–Bregman center of two univariate normal distributions (black circle). The exponential geodesic and mixture geodesics are shown in red and blue, respectively, with their corresponding midpoints. The Jeffreys–Fisher–Rao is the Fisher–Rao midpoint (green) lying on the Fisher–Rao geodesics (purple). The inductive Gauss–Bregman center is displayed in yellow with double size in order to ease its comparison with the Jeffreys–Fisher–Rao center.
Entropy 26 01008 g006
Figure 7. Centroids and centers between a pair of bivariate normal distributions (black). Each normal distribution N ( μ , ) (parameterized by a 5D parameter θ ) is displayed as a 2D ellipsoid E ( μ , ) = { ( x μ ) 1 ( x μ ) = l } for a prescribed level l > 0 in the sample space R 2 . Blue, red, purple, yellow, and green ellipsoids correspond to m-geodesic midpoint, e-geodesic midpoint, Jeffreys–Fisher–Rao midpoint, Gauss–Bregman inductive mean, and numerical Jeffreys centroid (symmetrized Bregman centroid), respectively.
Figure 7. Centroids and centers between a pair of bivariate normal distributions (black). Each normal distribution N ( μ , ) (parameterized by a 5D parameter θ ) is displayed as a 2D ellipsoid E ( μ , ) = { ( x μ ) 1 ( x μ ) = l } for a prescribed level l > 0 in the sample space R 2 . Blue, red, purple, yellow, and green ellipsoids correspond to m-geodesic midpoint, e-geodesic midpoint, Jeffreys–Fisher–Rao midpoint, Gauss–Bregman inductive mean, and numerical Jeffreys centroid (symmetrized Bregman centroid), respectively.
Entropy 26 01008 g007
Figure 8. Centroids and centers between a pair of bivariate centered normal distributions (black). Each normal distribution N ( μ , ) with a prescribed μ (parameterized by a 3D parameter θ ) is displayed as a 2D ellipsoid. The red and blue ellipsoids correspond to the e-geodesic and m-geodesic midpoints, respectively. The green ellipsoid is the exact Jeffreys centroid which coincide perfectly with the inductive Gauss–Bregman center (yellow) and Jeffreys–Fisher–Rao center (purple). Thus these three green, yellow, and purple matching ellipsoids are rendered superposed in an overall shade of brown.
Figure 8. Centroids and centers between a pair of bivariate centered normal distributions (black). Each normal distribution N ( μ , ) with a prescribed μ (parameterized by a 3D parameter θ ) is displayed as a 2D ellipsoid. The red and blue ellipsoids correspond to the e-geodesic and m-geodesic midpoints, respectively. The green ellipsoid is the exact Jeffreys centroid which coincide perfectly with the inductive Gauss–Bregman center (yellow) and Jeffreys–Fisher–Rao center (purple). Thus these three green, yellow, and purple matching ellipsoids are rendered superposed in an overall shade of brown.
Entropy 26 01008 g008
Figure 9. Centroids and centers between a pair of bivariate-centered normal distributions (black). Each normal distribution N ( μ , ) with a prescribed covariance matrix (parameterized by a 2D parameter θ ) is displayed as a 2D ellipsoid. The red and blue ellipsoids correspond to the e-geodesic and m-geodesic midpoints, respectively. The inductive Gauss–Bregman (yellow) and Jeffreys–Fisher–Rao center (purple) do not coincide.
Figure 9. Centroids and centers between a pair of bivariate-centered normal distributions (black). Each normal distribution N ( μ , ) with a prescribed covariance matrix (parameterized by a 2D parameter θ ) is displayed as a 2D ellipsoid. The red and blue ellipsoids correspond to the e-geodesic and m-geodesic midpoints, respectively. The inductive Gauss–Bregman (yellow) and Jeffreys–Fisher–Rao center (purple) do not coincide.
Entropy 26 01008 g009
Figure 10. Illustration on a dually flat space of the double sequence inducing a Gauss–Bregman center in the limit.
Figure 10. Illustration on a dually flat space of the double sequence inducing a Gauss–Bregman center in the limit.
Entropy 26 01008 g010
Figure 11. Illustration of the Jeffreys–Fisher–Rao and Gauss–Bregman centers a dually flat space. γ denotes the Riemannian geodesic.
Figure 11. Illustration of the Jeffreys–Fisher–Rao and Gauss–Bregman centers a dually flat space. γ denotes the Riemannian geodesic.
Entropy 26 01008 g011
Table 1. Experiments for JFR and GB centers approximating the numerical Jeffreys centroid.
Table 1. Experiments for JFR and GB centers approximating the numerical Jeffreys centroid.
dim.Jeffreys–Fisher–Rao CenterGauss–Bregman Center
Avg Info ϵ Max Info ϵ Avg TV Max TV Avg Time ×Speed Avg Info ϵ Max Info ϵ Avg TV Max TV Avg Time ×Speed
d = 25.662 × 10 6 6.386 × 10 3 8.735 × 10 5 5.005 × 10 2 1.614 × 10 7 82.5411.507 × 10 4 9.745 × 10 2 6.304 × 10 4 5.005 × 10 2 5.072 × 10 7 26.258
d = 41.283 × 10 5 5.294 × 10 3 1.690 × 10 4 3.969 × 10 2 1.418 × 10 7 182.3094.696 × 10 4 7.695 × 10 2 1.431 × 10 3 3.969 × 10 2 1.623 × 10 7 159.304
d = 82.766 × 10 5 6.970 × 10 3 2.210 × 10 4 3.470 × 10 2 1.772 × 10 7 292.1251.011 × 10 3 9.677 × 10 2 2.033 × 10 3 3.470 × 10 2 1.955 × 10 7 264.680
d = 163.531 × 10 5 8.544 × 10 3 2.325 × 10 4 2.450 × 10 2 6.318 × 10 7 224.3701.388 × 10 3 9.231 × 10 2 2.275 × 10 3 2.450 × 10 2 7.208 × 10 7 196.660
d = 324.123 × 10 5 5.242 × 10 3 2.457 × 10 4 1.230 × 10 2 4.811 × 10 7 462.7541.674 × 10 3 5.398 × 10 2 2.449 × 10 3 1.230 × 10 2 5.457 × 10 7 408.007
d = 644.747 × 10 5 3.437 × 10 3 2.486 × 10 4 9.756 × 10 3 9.789 × 10 7 578.3541.863 × 10 3 3.685 × 10 2 2.498 × 10 3 9.756 × 10 3 1.160 × 10 6 488.246
d = 1285.020 × 10 5 2.540 × 10 3 2.491 × 10 4 6.580 × 10 3 5.874 × 10 6 477.4121.937 × 10 3 2.374 × 10 2 2.522 × 10 3 6.580 × 10 3 6.605 × 10 6 424.609
d = 2564.735 × 10 5 1.410 × 10 3 2.476 × 10 4 4.855 × 10 3 9.349 × 10 6 528.4521.914 × 10 3 1.521 × 10 2 2.529 × 10 3 4.855 × 10 3 1.110 × 10 5 445.304
Table 2. Experiments for JFR and GB centers approximating the numerical Jeffreys centroid for the following setting of two normalized histograms of 3 bins: ( 1 3 , 1 3 , 1 3 ) and ( 1 α , α / 2 , α / 2 ) .
Table 2. Experiments for JFR and GB centers approximating the numerical Jeffreys centroid for the following setting of two normalized histograms of 3 bins: ( 1 3 , 1 3 , 1 3 ) and ( 1 α , α / 2 , α / 2 ) .
α Info. ϵ TV ϵ Avg Time×SpeedInfo. ϵ TV ϵ Avg Time×Speed
1.000 × 10 1 6.882 × 10 9 2.495 × 10 5 1.767 × 10 7 125.9601.338 × 10 6 3.480 × 10 4 2.334 × 10 7 95.356
1.000 × 10 2 2.607 × 10 5 1.722 × 10 3 1.371 × 10 7 167.9321.061 × 10 3 1.108 × 10 2 1.565 × 10 7 147.104
1.000 × 10 3 6.262 × 10 4 7.530 × 10 3 1.033 × 10 7 218.4501.272 × 10 2 3.534 × 10 2 1.208 × 10 7 186.698
1.000 × 10 4 3.632 × 10 3 1.570 × 10 2 1.171 × 10 7 193.3454.580 × 10 2 6.065 × 10 2 1.367 × 10 7 165.571
1.000 × 10 5 1.121 × 10 2 2.419 × 10 2 1.546 × 10 7 150.8077.322 × 10 3 1.929 × 10 2 2.834 × 10 7 82.261
1.000 × 10 6 2.457 × 10 2 3.204 × 10 2 1.619 × 10 7 141.8961.655 × 10 2 2.579 × 10 2 2.512 × 10 7 91.467
1.000 × 10 7 4.375 × 10 2 3.897 × 10 2 1.357 × 10 7 170.0653.065 × 10 2 3.183 × 10 2 2.131 × 10 7 108.314
1.000 × 10 8 6.806 × 10 2 4.492 × 10 2 1.315 × 10 7 173.6984.948 × 10 2 3.725 × 10 2 2.017 × 10 7 113.292
1.000 × 10 9 9.651 × 10 2 4.999 × 10 2 1.125 × 10 7 208.6277.240 × 10 2 4.199 × 10 2 1.590 × 10 7 147.610
1.000 × 10 10 1.281 × 10 1 5.428 × 10 2 8.366 × 10 8 242.9679.862 × 10 2 4.610 × 10 2 1.111 × 10 7 183.000
1.000 × 10 11 1.621 × 10 1 5.792 × 10 2 1.066 × 10 7 215.8171.274 × 10 1 4.963 × 10 2 1.325 × 10 7 173.614
1.000 × 10 12 1.979 × 10 1 6.100 × 10 2 1.028 × 10 7 229.4841.580 × 10 1 5.266 × 10 2 1.329 × 10 7 177.581
1.000 × 10 13 2.348 × 10 1 6.363 × 10 2 9.541 × 10 8 244.5871.901 × 10 1 5.526 × 10 2 1.255 × 10 7 185.940
1.000 × 10 14 2.727 × 10 1 6.589 × 10 2 1.062 × 10 7 219.7872.231 × 10 1 5.750 × 10 2 1.361 × 10 7 171.456
1.000 × 10 15 3.112 × 10 1 6.784 × 10 2 9.043 × 10 8 248.6882.570 × 10 1 5.943 × 10 2 1.322 × 10 7 170.122
1.000 × 10 16 3.483 × 10 1 6.946 × 10 2 8.857 × 10 8 267.2192.897 × 10 1 6.105 × 10 2 1.438 × 10 7 164.535
Table 3. Summary of the results: Δ indicates a generic formula, √ a closed-form formula, and × no-known formula.
Table 3. Summary of the results: Δ indicates a generic formula, √ a closed-form formula, and × no-known formula.
FamilyJeffreysJeffreys–Fisher–RaoGauss–Bregman
Exponential familyEquation (2)Definition 2Definition 3
One-dimensional exponential family× Δ ×
Theorem 4 [43]
Categorical family Δ ×
[31]Theorem 5Theorem 3
Normal family××
[28]Theorem 6Theorem 3
Centered normal family
[29][29][41]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nielsen, F. Fast Proxy Centers for the Jeffreys Centroid: The Jeffreys–Fisher–Rao Center and the Gauss–Bregman Inductive Center. Entropy 2024, 26, 1008. https://doi.org/10.3390/e26121008

AMA Style

Nielsen F. Fast Proxy Centers for the Jeffreys Centroid: The Jeffreys–Fisher–Rao Center and the Gauss–Bregman Inductive Center. Entropy. 2024; 26(12):1008. https://doi.org/10.3390/e26121008

Chicago/Turabian Style

Nielsen, Frank. 2024. "Fast Proxy Centers for the Jeffreys Centroid: The Jeffreys–Fisher–Rao Center and the Gauss–Bregman Inductive Center" Entropy 26, no. 12: 1008. https://doi.org/10.3390/e26121008

APA Style

Nielsen, F. (2024). Fast Proxy Centers for the Jeffreys Centroid: The Jeffreys–Fisher–Rao Center and the Gauss–Bregman Inductive Center. Entropy, 26(12), 1008. https://doi.org/10.3390/e26121008

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop