A Novel Approach to Canonical Divergences within Information Geometry

A divergence function defines a Riemannian metric g and dually coupled affine connections ∇ and ∇∗ with respect to it in a manifold M . When M is dually flat, that is flat with respect to ∇ and ∇∗, a canonical divergence is known, which is uniquely determined from (M, g,∇,∇∗). We propose a natural definition of a canonical divergence for a general, not necessarily flat, M by using the geodesic integration of the inverse exponential map. The new definition of a canonical divergence reduces to the known canonical divergence in the case of dual flatness. Finally, we show that the integrability of the inverse exponential map implies the geodesic projection property.


Introduction: Divergence and Dual Geometry
A divergence function D(p q) is a differentiable real-valued function of two points p and q in a manifold M. It satisfies the non-negativity condition D(p q) ≥ 0 (1) with equality if and only if p = q.Thus, it is a distance-like function, but does not necessarily share all properties of a distance.For instance, it can be asymmetric in p and q.When a coordinate system ξ : p → ξ p = (ξ 1 p , . . ., ξ n p ) ∈ R n is given in M, we pose one condition that, for two nearby points ξ p and ξ q = ξ p + ∆ξ, D is expanded as and ( D g ij (p)) ij is a positive definite matrix.Here, the Einstein summation convention is used, which means that summation is taken with respect to any index that appears twice in a term, as a lower as well as an upper index.Throughout the paper, we apply this convention or explicitly use the summation sign.The coefficients D g ij in Equation (2) define a Riemannian metric D g .Furthermore, the divergence function D allows us to define also a pair of dual affine connections [1].In order to be more explicit, we consider coordinates ξ p = (ξ 1 p , . . ., ξ n p ) of p and coordinates ξ q = (ξ 1 q , . . ., ξ n q ) of q and introduce the following simplified notations of differentiation With D(ξ p ξ q ) = D(p q), the coefficients of the Riemannian metric can be written as Furthermore, the coefficients define a pair of dual affine connections D ∇ and D ∇ * [1].The duality of the connections holds with respect to the Riemannian metric D g in terms of the following condition: for all vector fields X, Y and Z, where the brackets •, • denote the inner product with respect to D g [2].
The inverse problem is to find a divergence D which generates a given geometrical structure (M, g, ∇, ∇ * ).Matumoto [3] showed that a divergence exists for any such manifold.However, it is not unique and there are infinitely many divergences that give the same geometrical structure.When a manifold is dually flat, a canonical divergence was introduced by Amari and Nagaoka [2], which is a Bregman divergence.Extensions of the canonical divergence within conformal geometry have been studied by Kurose [4] and Matsuzoe [5].The canonical divergence has nice properties such as the generalized Pythagorean theorem and the geodesic projection theorem.It is an important problem to define a canonical divergence in the general case.The present paper gives an answer to this problem by using the inverse exponential map.We already used the inverse exponential map in our previous work [6], where we studied a different divergence function.We could show that it recovers the metric g in the sense of Equation ( 4) and has some consistency with the dual connections ∇ and ∇ * .However, it turns out that it does not reduce to the well-established canonical divergence in the dually flat case.The divergence introduced in the present article not only recovers the original geometry directly in terms of Equations ( 4)- (6), it also coincides with the original canonical divergence in the dually flat case.

A New Approach to the General Inverse Problem
We begin with a motivation in terms of a simple example where the manifold is R n equipped with the standard Euclidean metric and connection (here, the Levi-Civita connection): Let p be a fixed point in R n , and consider the vector field pointing to p, that is Obviously, the vector field Equation ( 8) can be seen as the negative gradient of the squared distance Entropy 2015, 17, 8111-8129 as potential function, that is p − q = −grad q D p (9) Here, the gradient grad q is taken with respect to the canonical inner product on R n .We shall now generalize the relation Equation ( 9) between the squared distance D p and the difference of two points p and q to the more general setting of a differentiable manifold M. Given a fixed point p ∈ M, we want to define a vector field q → X(q, p), at least in a neighbourhood of p, that corresponds to the difference vector field Equation (8).Obviously, the problem is that the difference p − q is not naturally defined for a general manifold M. We need an affine connection ∇ in order to have a notion of a difference.Given such a connection ∇, for each point q ∈ M and each direction X ∈ T q M we consider the geodesic γ q,X (t), with the initial point q and the initial velocity X, that is γ q,X (0) = q and γq,X (0) = X.If γ q,X (t) is defined for all 0 ≤ t ≤ 1, the endpoint p = γ q,X (1) is interpreted as the result of a translation of the point q along a straight line in the direction of the vector X.This straightness is expressed in terms of the local coordinates ξ(t) := (ξ 1 (t), • • • , ξ n (t)) := ξ(γ q,X (t)) of the geodesic γ q,X by the following set of differential equations: The translation of points along geodesics defines a map, the so-called exponential map: where U q ⊆ T q M denotes the set of tangent vectors X, for which the domain of γ q,X contains the unit interval [0, 1].Given two points p and q, one can interpret any X with exp q (X) = p as a difference vector X that translates q to p. Throughout this paper we assume the existence and uniqueness of such a difference vector, denoted by X(q, p) (see Figure 1).p q q,p X(q, p) Figure 1.Illustration of (A) the difference vector p − q in R n pointing from q to p; and (B) the difference vector X(q, p) = γq,p (0) as the inverse of the exponential map in q.
This is a strong assumption, which is, however, always locally satisfied.On one hand, we are mainly interested in local properties.On the other hand, although being quite restrictive in general, this property will be satisfied in our information-geometric context, where g is given by the Fisher metric and ∇ is given by the mand e-connections and their convex combinations, the α-connections.
If we attach to each point q ∈ M the difference vector X(q, p), we obtain a vector field that corresponds to the vector field Equation (8) in R n .In order to interpret this vector field as a negative gradient field of a (squared) distance function, and thereby generalize Equation (9), we need a Riemannian metric g on M. Given such a metric, we assume integrability of X and ∇, respectively, in the sense that for all p there exists a function D p satisfying X(q, p) = −grad q D p (12) Entropy 2015, 17, 8111-8129 Here, the Riemannian gradient is taken with respect to g, which is defined by the property that the total differential d q D p can be expressed as an inner product: Obviously, if there are functions D p satisfying the condition of Equation (12) then they are unique up to a constant that can vary with p, and we can therefore assume D p (p) = 0. Throughout the paper we will also use the standard notation D(p q) = D p (q) of a divergence as a function D of two arguments.In order to recover D from Equation (12) we consider any curve γ : [0, 1] → M that connects q with p, that is γ(0) = q and γ(1) = p.We compose the inner product of the curve velocity γ(t) with the inverse of the exponential map X(γ(t), p) in γ(t) and integrate this along the curve: In particular, we can apply this derivation to the geodesic connecting q and p even when the integrability of X is not guaranteed and obtain the definition of a general canonical divergence, discussed in more detail in Section 5. Before we treat the general definition of a canonical divergence, however, we discuss important special cases of divergences within the cone of positive measures and the simplex of probability measures included in it.In particular, we verify that the well-known relative entropy (KL-divergence) and the α-entropy (α-divergence) can be derived in terms of Equation (13).

The Fisher Metric and Its Gradients
We represent measures on the set {1, . . ., n} as elements of R n .In this representation, the Dirac measures δ i , i = 1, . . ., n, form the canonical basis of R n .We consider the n-dimensional cone of positive measures on the set {1, . . ., n}, defined by and the corresponding (n − 1)-dimensional simplex of normalized measures (probability measures) There is a natural Riemannian metric on M n , called the Fisher metric: Entropy 2015, 17, 8111-8129 In theoretical biology, the Fisher metric is also known as Shahshahani metric (see [7], Equation (7.48)).Given a point p ∈ S n−1 and a vector X ∈ T p M n , its projection onto T p S n−1 with respect to g p is given by and the corresponding projection onto the orthogonal complement of T p S n−1 is given by For a function V : M n → R, this metric implies the Riemannian gradient A vector field is the gradient of a function V if and only if it satisfies for all i, j If we consider a function that is defined on S n−1 , for instance the restriction of V: M n → R to S n−1 , then the vector Equation ( 16), evaluated in p ∈ S n−1 , will not necessarily be an element of T p S n−1 .Therefore, in order to evaluate the gradient on S n−1 , we have to project the vector Equation (16) onto T p S n−1 with respect to the metric g by using Equation ( 14).This leads to the following gradient formula for functions on S n−1 : This gives rise to consider general vector fields of the form Such a vector field is integrable, in the sense that it is the gradient Equation ( 19) of a potential function V, if and only if the following condition holds for all i, j, k (see [7], Equation (19.23)):

The Mixture and the Exponential Connections
After having introduced the Fisher metric and corresponding gradient fields, we now define natural notions of straight lines on M n and S n−1 , respectively, induced by corresponding affine connections.Let us first introduce the straight lines of the so-called mixture connection ∇ (m) on M n .Given a point p ∈ M n and a direction X ∈ T p M n , the most natural way to define a straight line that starts in p and has velocity X is given by the so-called m-geodesic We obtain the exponential map for t = 1, which is, in this simple example, the translation: The inverse, therefore, maps a point q to the difference vector that translates p into q: With this difference as X in Equation ( 22), we obtain the geodesics that connects p with q: If we choose a point p ∈ S n−1 and X ∈ T p S n−1 , or two points p, q ∈ S n−1 , respectively, then the corresponding geodesic Equations ( 22) and (23) will stay in S n−1 .Therefore, the restriction of the exponential map to T p S n−1 and its inverse are trivial: where we use a bar over symbols in order to denote the restriction of corresponding objects to S n−1 .Now let us come to the notion of an e-geodesic and the exponential map of the so-called e-connection ∇ (e) .Given a point p ∈ M n and a direction X ∈ T p M n , we consider the geodesic (The "exp" on the right-hand side of Equation ( 24) denotes the standard real-valued natural exponential function.)The exponential map of the e-connection is given for t = 1: with the inverse This implies that the e-geodesic connecting p with q is given by Clearly, if we start in a point p ∈ S n−1 and go along the e-geodesic Equation (24) in a direction X that is tangential to S n−1 , we will not stay in S n−1 .Analogously, if we connect a point p ∈ S n−1 with a point q ∈ S n−1 in terms of the e-geodesic Equation (25), then the intermediate points will in general not be in the set S n−1 .It turns out that, in order to obtain the right exponential map of the e-connection defined on S n−1 , we have to normalize the geodesic, which leads to: Entropy 2015, 17, 8111-8129

The α-Connections
Given α ∈ [−1, 1], we define the following convex combination of the mixture connection ∇ (m) and the exponential connection ∇ (e) on M n : The differential equation for the α-geodesic with initial point p ∈ M n and initial velocity X ∈ T p M n is given by γi One can easily verify that Equation ( 27) is solved by the following curve: By setting t = 1, we can define the corresponding α-exponential map: with the inverse Finally, the α-geodesic with initial point p and endpoint q is given by The α-connection ∇ (α) on S n−1 is defined as the projection of ∇ (α) with respect to the Fisher metric g.The corresponding geodesic equation is a modification of Equation (27 It is reasonable to make a solution ansatz by normalization of the unconstrained geodesics Equations ( 28) and (31).However, it turns out that, in order to solve the geodesic Equation (32), both normalized curves have to be reparametrized.More precisely, it has been shown in [8] (Theorems 14.1.and 15.1.)that, with appropriate reparametrizations τ p,X and τ p,q , we have the following form of the α-geodesic in the simplex S n−1 : and Here, the conditions γ p,X (0) = p , γp,X (0) = τp,X (0) X = X , and γ p,q (0) = p , γ p,q (1) = q imply τ p,X (0) = 0 , τp,X (0) = 1 , and τ p,q (0) = 0 , τ p,q (1) = 1 Now let us couple X and q by assuming γ p,X (1) = q.Together with the condition ∑ n i=1 X i = 0, this implies Furthermore, if the initial and endpoints of the two curves are identical, then γ p,X (t) = γ p,q (t) for all t.In particular, A comparison of the Equations ( 35) and (36) yields τp,q (0) n ∑ j=1 p j q j p j

The Relative Entropy (KL-Divergence)
Now we apply the ansatz of Equation ( 12) in order to define divergence functions for the mand e-connections on the cone M n of positive measures.The inverse maps of the corresponding exponential maps are given by We can easily verify that the corresponding vector fields q → X (m) (q, p) , q → X (e) (q, p) are gradient fields: The functions f i (q) := p i q i , and g i (q) := ln p i q i trivially satisfy the integrability condition for all i, j.Therefore, for both connections, there are canonical divergence functions which solve the corresponding Equation (12).We derive the canonical divergence of the m-connection first, which we denote by D (m) .We consider two positive measures p and q and a curve γ:[0, 1] → M n connecting q with p, that is γ(0) = q and γ(1) = p.This implies and With the same calculation for the e-connection, we obtain the corresponding canonical divergence, which we denote by D (e) .Again, we consider a curve γ connecting q with p.This implies and These calculations give rise to the following definition: is called the relative entropy or Kullback-Leibler divergence.Its restriction to the set of probability distributions is given by Proposition 1.The following holds: Furthermore, D is the only function on M n × M n that satisfies the conditions Equation (43) and D(p p) = 0 for all p.
Proof.We first compute the partial derivatives With the Formula (16), we obtain A comparison with Equation (37) verifies the Equation (43) which uniquely characterize D(p •) as well as D(• p), up to a constant depending on p.With the additional assumption D(p p) = 0 for all p, this constant is fixed.
One can now ask whether the restriction Equation (42) of the Kullback-Leibler divergence to the manifold S n−1 is the right divergence function in the sense that Equation (43) also hold for the exponential maps of the restricted mand e-connections.It is easy to verify that this is indeed the case.Let us elaborate on the geometric reason for this.We consider a general Riemannian manifold M and a submanifold N in it.Given an affine connection ∇ on M, we can define its restriction ∇ to N.More precisely, denoting the projection of a vector Z in T p M onto T p N by Π p (Z), we define ∇ X Y p := Π p ∇ X Y| p , where X and Y are vector fields on N. Furthermore, we denote the exponential map of ∇ by exp p and its inverse by X(p, q).Now, given p ∈ N, we consider a function D p on M, which satisfies the Equation ( 12).With the restriction D p of D p to the submanifold N, this directly implies Π q (X(q, p)) = −grad q D p However, in order to have X(q, p) = −grad q D p , which corresponds to the Equation (12) on the submanifold N, the following equality is required: This condition is satisfied for the mand e-connections on M n and its submanifold S n−1 , which implies the following proposition.
Proposition 2. The following holds: where D is given by Equation (42) in Definition 1. Furthermore, D is the only function on S n−1 × S n−1 that satisfies the conditions (45) and D(p p) = 0 for all p.
The objects and derivations of this section represent a special case of a general dually flat manifold M, which will be studied in Section 5.

The α-Divergence
We now extend the method of Section 4.1 to the α-connections, leading to a generalization of the relative entropy, the so-called α-divergence.From the definition of the α-exponential map on the manifold M n of positive measures, given in Equation ( 29), we obtain the inverse In order to derive the canonical divergence D (α) of the α-connection, which is integrable, we consider two points p and q and a curve γ: [0, 1] → M n connecting q with p.We obtain and Obviously, we have These calculations give rise to the following definition: is called the α-divergence.Its restriction to probability measures is given as Proposition 3. The following holds: Furthermore, D (α) is the only function on M n × M n that satisfies the condition (50) and D (α) (p p) = 0 for all p.
Proof.We compute the partial derivative With the Formula (16), we obtain A comparison with Equation (46) verifies Equation ( 50) which uniquely characterizes D (α) (p •), up to a constant depending on p.With the additional assumption D (α) (p p) = 0 for all p, this constant is fixed.
In what follows, we use the notation D (α) also for α ∈ {−1, 1} by setting D (−1) (p q) := D(p q) and D (1) (p q) := D(q p) where D is relative entropy defined by Equation (41).This is consistent with the definition of the α-connections, given by Equation ( 26), where we have the m-connection for α = −1 and the e-connection for α = 1.Note that D (0) is closely related to the Hellinger distance More precisely, we have In fact, the derivation of D (α) was based on the idea to associate a distance-like function to the α-connections through the general Equation (12).However, it turns out that, although being naturally motivated, the functions D (α) do not share all properties of the square of a distance, except for α = 0.The symmetry is obviously not satisfied.On the other hand, we have D (α) (p q) ≥ 0, and D (α) (p q) = 0 if and only if p = q.
We now ask whether the restriction of D (α) , which is defined for positive measures, to the simplex S n−1 of probability distributions is the canonical divergence for the α-connections on S n−1 .We have seen that this is the case for the mand e-connections, that is for α ∈ {−1, +1}.However, for general α, the situation is more complicated.From Equation (36) we obtain This equality deviates from the condition of Equation (44) by the factor τq,p (0), which proves that the restriction of the α-divergence to S n−1 does not coincide with the canonical α-divergence on the simplex.As an example, we consider the case α = 0, where the α-connection is the Levi-Civita connection of the Fisher metric.As we will see in the next section, the canonical divergence in that case equals D (0) (p q) = 1 2 d F (p, q) 2 , where d F denotes the distance with respect to the Fisher metric (see Equation ( 62)).Obviously, this divergence is different from the divergence D (0) , given by Equation (51), which is based on the distance in the ambient space M n , the Hellinger distance.
Entropy 2015, 17, 8111-8129 Proposition 4. The divergence of Definition 3 is given by where γ p,q denotes the geodesic from p to q.
Remark 1.In the special case where M is self-dual, ∇ = ∇ * is the Levi-Civita connection with respect to g.In that case, the velocity field γp,q is parallel along the geodesic γ p,q , and therefore γp,q (t) where d(p, q) denotes the Riemannian distance between p and q.This implies that the canonical divergence corresponds to the energy of the geodesic γ p,q , that is In the general case, where ∇ is not necessarily the Levi-Civita connection, we obtain the energy of the geodesic p,q as the symmetrized version of the canonical divergence: Remark 2. Let us compare the canonical divergence D of the affine connection ∇ with the canonical divergence D * of its dual connection ∇ * , both defined by Equation (55) or equivalently by Equation (61).In the special case of the α-connection ∇ = ∇ (α) , we have D * (p q) = D(q p) (see Equation ( 48)).In Section 5.3, we will prove that this kind of symmetry holds in the general case of a dually flat manifold.However, our canonical divergence does not necessarily have this property, when the space is not dually flat.This is contrary to most other approaches where the symmetry is considered to be a natural property of any divergence.In order to have that property also in our setting, we can consider the mean canonical divergence which obviously satisfies As we will prove in the next section, the canonical divergence D induces the metric g and the connections ∇ and ∇ * .The same holds for the mean canonical divergence D ∇ mcd .However, if ∇ is integrable, then it is not generally true that X(q, p) = −grad q D ∇ mcd (p •), which is inconsistent with the main motivation of our canonical divergence (see Equation ( 12)).

Main Consistency Result
where indices of Λ ijk are symmetrized because of multiplication of z i z j z k .This gives Equation (70).
of which the indexed quantities of the right-hand side need to be symmetrized with respect to i, j.
By evaluating ∂ i ∂ j D at ξ p = ξ q , i.e., z = 0, we have proving that the Riemannian metric derived from D is the same as the original one.We further differentiate Equation (82) with respect to ξ q and evaluate it at ξ p = ξ q .This yields Hence, the affine connection D ∇ derived from D is exactly the same as the original affine connection ∇.Remark 3. In the special case ∇ = ∇ * , the canonical divergence is given by half of the squared norm of the inverse exponential map (see Equation (62)): The right-hand side of Equation (86) defines a divergence for a general connection, which coincides with the canonical divergence in the self-dual case.We have studied this divergence in our previous work [6].We have shown that this divergence recovers g in terms of Equation (66).However, it fails to recover ∇ and ∇ * in terms of Equations ( 67) and (68) directly.In order to overcome this shortcoming, we considered the α-connection ∇ (α) = 1−α 2 ∇ + 1+α 2 ∇ * and the corresponding inverse exponential map X (α) , which imply the following version of Equation (86): (D (α) does not denote the α-divergence here.)We have shown in [6] that for α = − 1 3 the divergence D (α) , referred to it as standard divergence, induces the original quantities g, ∇, and ∇ * .It turns out, however, that this first attempt to define a canonical divergence has serious limitations.For instance, it does not reduce to the known canonical divergence in the dually flat case.This important property is satisfied by the canonical divergence of Definition 3, which we are going to prove in the next section.

Canonical Divergence in a Dually Flat Manifold
When a manifold M is dually flat, it has an affine coordinate system θ = (θ 1 , . . ., θ n ) and a potential function ψ(θ), where the dual affine coordinates η = (η 1 , . . ., η n ) are given by The dual potential is then defined as where θ • η = θ i η i and θ is a function of η by Equation (88).The geodesic connecting p and q, a generalisation of the e-geodesic of Section 3.2, has the form Hence, the velocity is constant θ The canonical divergence from θ p to θ q is defined by This shows that our canonical divergence is the same as the canonical divergence defined in terms of the Bregman divergence of M. Now we come back to the symmetry property that we already addressed in Remark 2. We derived D(p q) by using the primal affine connection ∇ and the related inverse exponential map.We can construct its dual D * (p q) by using the dual affine connection ∇ * and the dual inverse exponential map.The dual affine coordinates are η, and the m-geodesic connecting p and q is given by Hence, the velocity is constant η The dual canonical divergence D * is defined by Here, where So we have By similar calculations, we have D * (p q) = D(q p) (103) This proves that ∇ and ∇ * give the same canonical divergence except that p and q are interchanged because of the duality.Such a nice property holds when M is dually flat.

Geodesic Projections and Integrability
Given a divergence D on M and a point p ∈ M, we consider the set of points q that satisfy where p is fixed.This set is the surface of the equi-divergence ball centered at p.When a smooth submanifold S is given, we search for a point p ∈ S that minimizes D(p q), q ∈ S. Intuitively, we obtain such a minimizer by considering a ball centered at p. We increase its radius, starting from 0, until the ball touches S for the first time.Any touch point p is then a minimizer of D(p q), q ∈ S.
When the geodesic connecting p and p is orthogonal to S at p, we call p a geodesic projection of p onto S.
Definition 4. We say that the geodesic projection property holds if every minimizer p of the divergence D is given by the geodesic projection of p onto S.
We know that the geodesic projection property holds when M is dually flat, but it does not hold in general.The following condition guarantees the geodesic projection property: Proposition 6.The geodesic projection property holds when the inverse exponential map X(q, p) is in proportion to the gradient of D(p q) with respect to q, X(q, p) = c • grad q D(p •) where c is a constant that may depend on q and p.
Proof.Consider the geodesic connecting q = p and p.Then, the tangent vector at q is X(q, p).Assume that X(q, p) has the same direction as the gradient grad q D(p •), that is, the vector orthogonal to the surface of the ball touching S. Then X(q, p) is also orthogonal to the tangent space of S in p, as the tangent space of the ball contains the tangent space of S at this point.This means that p is a geodesic projection.
Obviously, when the vector field of the inverse exponential map is integrable, the geodesic projection property directly follows from Equation (12).We have shown that this intergrability condition is satisfied for general dually flat manifolds.In particular, the integrability is satisfied for the α-connection ∇ (α) defined on the cone M n of positive measures, which leads to the α-divergence as canonical divergence.The restriction of the α-connection to the simplex S n−1 of probability distributions, denoted by ∇ (α) , is still integrable, even though S n−1 is not (dually) flat with respect ∇ (α) if α / ∈ {−1, +1}.As we have seen, the canonical divergence associated with ∇ (α) does not coincide with the restriction of the α-divergence to S n−1 .However, this restriction is still useful in the context of applications that require projections onto submanifolds S. The reason is that the geodesic the geometrical objects derived from the canonical divergence D as defined in Equation (55).We recall the corresponding definitions from Section 1 in terms of a local coordinate system derived from the canonical divergence D(p q) of Definition 3 coincide with the original quantities g, ∇, and ∇ * .Proof.By differentiating Equation (70) with respect to ξ p ,