A Novel Approach to Canonical Divergences within Information Geometry

Ay, Nihat; Amari, Shun-ichi

doi:10.3390/e17127866

Open AccessArticle

A Novel Approach to Canonical Divergences within Information Geometry

by

Nihat Ay

^1,2,3,* and

Shun-ichi Amari

⁴

¹

Max Planck Institute for Mathematics in the Sciences, Inselstraße 22, Leipzig 04103 , Germany

²

Faculty of Mathematics and Computer Science, University of Leipzig, PF 100920, Leipzig 04009, Germany

³

Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA

⁴

Laboratory for Mathematical Neuroscience, RIKEN Brain Science Institute, Wako-shi Hirosawa 2-1, Saitama 351-0198, Japan

^*

Author to whom correspondence should be addressed.

Entropy 2015, 17(12), 8111-8129; https://doi.org/10.3390/e17127866

Submission received: 12 October 2015 / Revised: 21 November 2015 / Accepted: 25 November 2015 / Published: 9 December 2015

Download

Browse Figure

Versions Notes

Abstract

:

A divergence function on a manifold M defines a Riemannian metric g and dually coupled affine connections ∇ and

\nabla^{*}

on M. When M is dually flat, that is flat with respect to ∇ and

\nabla^{*}

, a canonical divergence is known, which is uniquely determined from

(M, g, \nabla, \nabla^{*})

. We propose a natural definition of a canonical divergence for a general, not necessarily flat, M by using the geodesic integration of the inverse exponential map. The new definition of a canonical divergence reduces to the known canonical divergence in the case of dual flatness. Finally, we show that the integrability of the inverse exponential map implies the geodesic projection property.

Keywords:

information geometry; canonical divergence; relative entropy; α-divergence; α-geodesic; duality; geodesic projection

1. Introduction: Divergence and Dual Geometry

A divergence function

D (p ∥ q)

is a differentiable real-valued function of two points p and q in a manifold M. It satisfies the non-negativity condition

D (p ∥ q) \geq 0

(1)

with equality if and only if

p = q

. Thus, it is a distance-like function, but does not necessarily share all properties of a distance. For instance, it can be asymmetric in p and q. When a coordinate system

ξ : p \mapsto ξ_{p} = (ξ_{p}^{1}, \dots, ξ_{p}^{n}) \in R^{n}

is given in M, we pose one condition that, for two nearby points

ξ_{p}

and

ξ_{q} = ξ_{p} + Δ ξ

, D is expanded as

D (p ∥ q) = \frac{1}{2} {\overset{D}{g}}_{i j} (p) Δ ξ^{i} Δ ξ^{j} + O ({∥ Δ ξ ∥}^{3})

(2)

and

{({\overset{D}{g}}_{i j} (p))}_{i j}

is a positive definite matrix. Here, the Einstein summation convention is used, which means that summation is taken with respect to any index that appears twice in a term, as a lower as well as an upper index. Throughout the paper, we apply this convention or explicitly use the summation sign. The coefficients

{\overset{D}{g}}_{i j}

in Equation (2) define a Riemannian metric

\overset{D}{g}

. Furthermore, the divergence function D allows us to define also a pair of dual affine connections [1]. In order to be more explicit, we consider coordinates

ξ_{p} = (ξ_{p}^{1}, \dots, ξ_{p}^{n})

of p and coordinates

ξ_{q} = (ξ_{q}^{1}, \dots, ξ_{q}^{n})

of q and introduce the following simplified notations of differentiation

\partial_{i} = \frac{\partial}{\partial ξ_{p}^{i}}, \partial_{i}^{'} = \frac{\partial}{\partial ξ_{q}^{i}}

(3)

With

D (ξ_{p} ∥ ξ_{q}) = D (p ∥ q)

, the coefficients of the Riemannian metric can be written as

{\overset{D}{g}}_{i j} (p) = {- \partial_{i} \partial_{j}^{'} D (ξ_{p} ∥ ξ_{q})|}_{q = p} = {\partial_{i}^{'} \partial_{j}^{'} D (ξ_{p} ∥ ξ_{q})|}_{q = p}

(4)

Furthermore, the coefficients

\begin{matrix} {\overset{D}{Γ}}_{i j k} (p) & = & {- \partial_{i} \partial_{j} \partial_{k}^{'} D (ξ_{p} ∥ ξ_{q})|}_{q = p} \end{matrix}

(5)

\begin{matrix} {\overset{D}{Γ^{*}}}_{i j k} (p) & = & {- \partial_{i}^{'} \partial_{j}^{'} \partial_{k} D (ξ_{p} ∥ ξ_{q})|}_{q = p} \end{matrix}

(6)

define a pair of dual affine connections

\overset{D}{\nabla}

and

\overset{D}{\nabla^{*}}

[1]. The duality of the connections holds with respect to the Riemannian metric

\overset{D}{g}

in terms of the following condition:

X 〈 Y, Z 〉 = 〈 {\overset{D}{\nabla}}_{X} Y, Z 〉 + 〈 Y, {\overset{D}{\nabla^{*}}}_{X} Z 〉

(7)

for all vector fields

X, Y

and Z, where the brackets

〈 \cdot, \cdot 〉

denote the inner product with respect to

\overset{D}{g}

[2].

The inverse problem is to find a divergence D which generates a given geometrical structure

(M, g, \nabla, \nabla^{*})

. Matumoto [3] showed that a divergence exists for any such manifold. However, it is not unique and there are infinitely many divergences that give the same geometrical structure. When a manifold is dually flat, a canonical divergence was introduced by Amari and Nagaoka [2], which is a Bregman divergence. Extensions of the canonical divergence within conformal geometry have been studied by Kurose [4] and Matsuzoe [5]. The canonical divergence has nice properties such as the generalized Pythagorean theorem and the geodesic projection theorem. It is an important problem to define a canonical divergence in the general case. The present paper gives an answer to this problem by using the inverse exponential map. We already used the inverse exponential map in our previous work [6], where we studied a different divergence function. We could show that it recovers the metric g in the sense of Equation (4) and has some consistency with the dual connections ∇ and

\nabla^{*}

. However, it turns out that it does not reduce to the well-established canonical divergence in the dually flat case. The divergence introduced in the present article not only recovers the original geometry directly in terms of Equations (4)–(6), it also coincides with the original canonical divergence in the dually flat case.

2. A New Approach to the General Inverse Problem

We begin with a motivation in terms of a simple example where the manifold is

R^{n}

equipped with the standard Euclidean metric and connection (here, the Levi-Civita connection): Let p be a fixed point in

R^{n}

, and consider the vector field pointing to p, that is

R^{n} \to R^{n}, q \mapsto p - q

(8)

Obviously, the vector field Equation (8) can be seen as the negative gradient of the squared distance

D_{p} : R^{n} \to R, q \mapsto D_{p} (q) : = D (p ∥ q) : = \frac{1}{2} {∥ p - q ∥}^{2} = \frac{1}{2} \sum_{i = 1}^{n} {(p_{i} - q_{i})}^{2}

as potential function, that is

p - q = - {grad}_{q} D_{p}

(9)

Here, the gradient

{grad}_{q}

is taken with respect to the canonical inner product on

R^{n}

.

We shall now generalize the relation Equation (9) between the squared distance

D_{p}

and the difference of two points p and q to the more general setting of a differentiable manifold M. Given a fixed point

p \in M

, we want to define a vector field

q \mapsto X (q, p)

, at least in a neighbourhood of p, that corresponds to the difference vector field Equation (8). Obviously, the problem is that the difference

p - q

is not naturally defined for a general manifold M. We need an affine connection ∇ in order to have a notion of a difference. Given such a connection ∇, for each point

q \in M

and each direction

X \in T_{q} M

we consider the geodesic

γ_{q, X} (t)

, with the initial point q and the initial velocity X, that is

γ_{q, X} (0) = q

and

{\dot{γ}}_{q, X} (0) = X

. If

γ_{q, X} (t)

is defined for all

0 \leq t \leq 1

, the endpoint

p = γ_{q, X} (1)

is interpreted as the result of a translation of the point q along a straight line in the direction of the vector X. This straightness is expressed in terms of the local coordinates

ξ (t) : = (ξ^{1} (t), \dots, ξ^{n} (t)) : = ξ (γ_{q, X} (t))

of the geodesic

γ_{q, X}

by the following set of differential equations:

{\ddot{ξ}}^{i} (t) + Γ_{j k}^{i} (ξ (t)) {\dot{ξ}}^{j} (t) {\dot{ξ}}^{k} (t) = 0, i = 1, \dots, n

(10)

The translation of points along geodesics defines a map, the so-called exponential map:

{exp}_{q} : U_{q} \to M, X \mapsto γ_{q, X} (1)

(11)

where

U_{q} \subseteq T_{q} M

denotes the set of tangent vectors X, for which the domain of

γ_{q, X}

contains the unit interval

[0, 1]

.

Given two points p and q, one can interpret any X with

{exp}_{q} (X) = p

as a difference vector X that translates q to p. Throughout this paper we assume the existence and uniqueness of such a difference vector, denoted by

X (q, p)

(see Figure 1).

Figure 1. Illustration of (A) the difference vector

p - q

in

R^{n}

pointing from q to p; and (B) the difference vector

X (q, p) = {\dot{γ}}_{q, p} (0)

as the inverse of the exponential map in q.

Figure 1. Illustration of (A) the difference vector

p - q

in

R^{n}

pointing from q to p; and (B) the difference vector

X (q, p) = {\dot{γ}}_{q, p} (0)

as the inverse of the exponential map in q.

This is a strong assumption, which is, however, always locally satisfied. On one hand, we are mainly interested in local properties. On the other hand, although being quite restrictive in general, this property will be satisfied in our information-geometric context, where g is given by the Fisher metric and ∇ is given by the m- and e-connections and their convex combinations, the α-connections.

If we attach to each point

q \in M

the difference vector

X (q, p)

, we obtain a vector field that corresponds to the vector field Equation (8) in

R^{n}

. In order to interpret this vector field as a negative gradient field of a (squared) distance function, and thereby generalize Equation (9), we need a Riemannian metric g on M. Given such a metric, we assume integrability of X and ∇, respectively, in the sense that for all p there exists a function

D_{p}

satisfying

X (q, p) = - {grad}_{q} D_{p}

(12)

Here, the Riemannian gradient is taken with respect to g, which is defined by the property that the total differential

d_{q} D_{p}

can be expressed as an inner product:

〈 {grad}_{q} D_{p}, Y 〉 = d_{q} D_{p} (Y), Y \in T_{q} M

Obviously, if there are functions

D_{p}

satisfying the condition of Equation (12) then they are unique up to a constant that can vary with p, and we can therefore assume

D_{p} (p) = 0

. Throughout the paper we will also use the standard notation

D (p ∥ q) = D_{p} (q)

of a divergence as a function D of two arguments. In order to recover D from Equation (12) we consider any curve

γ : [0, 1] \to M

that connects q with p, that is

γ (0) = q

and

γ (1) = p

. We compose the inner product of the curve velocity

\dot{γ} (t)

with the inverse of the exponential map

X (γ (t), p)

in

γ (t)

and integrate this along the curve:

\begin{matrix} \int_{0}^{1} 〈 X (γ (t), p), \dot{γ} (t) 〉 d t & = & - \int_{0}^{1} 〈{grad}_{γ (t)} D_{p}, \dot{γ} (t)〉 d t \\ = & - \int_{0}^{1} (d_{γ (t)} D_{p}) (\dot{γ} (t)) d t \\ = & - \int_{0}^{1} \frac{d D_{p} \circ γ}{d t} (t) d t \\ = & D_{p} (γ (0)) - D_{p} (γ (1)) \\ = & D_{p} (q) - D_{p} (p) = D_{p} (q) = D (p ∥ q) \end{matrix}

(13)

In particular, we can apply this derivation to the geodesic connecting q and p even when the integrability of X is not guaranteed and obtain the definition of a general canonical divergence, discussed in more detail in Section 5. Before we treat the general definition of a canonical divergence, however, we discuss important special cases of divergences within the cone of positive measures and the simplex of probability measures included in it. In particular, we verify that the well-known relative entropy (KL-divergence) and the α-entropy (α-divergence) can be derived in terms of Equation (13).

3. Natural Connections for Positive and Probability Measures

3.1. The Fisher Metric and Its Gradients

We represent measures on the set

{1, \dots, n}

as elements of

R^{n}

. In this representation, the Dirac measures

δ_{i}

,

i = 1, \dots, n

, form the canonical basis of

R^{n}

. We consider the n-dimensional cone of positive measures on the set

{1, \dots, n}

, defined by

M_{n} : = R_{+}^{n} = \{p = \sum_{i = 1}^{n} p_{i} δ_{i} \in R^{n} : p_{i} > 0 for all i\}

and the corresponding

(n - 1)

-dimensional simplex of normalized measures (probability measures)

S_{n - 1} \subset M_{n}

:

S_{n - 1} : = \{p = \sum_{i = 1}^{n} p_{i} δ_{i} \in R^{n} : p_{i} > 0 for all i, and \sum_{i = 1}^{n} p_{i} = 1\}

There is a natural Riemannian metric on

M_{n}

, called the Fisher metric:

g_{p} (X, Y) : = \sum_{i = 1}^{n} \frac{1}{p_{i}} X_{i} Y_{i}, X, Y \in T_{p} M_{n}

In theoretical biology, the Fisher metric is also known as Shahshahani metric (see [7], Equation (7.48)). Given a point

p \in S_{n - 1}

and a vector

X \in T_{p} M_{n}

, its projection onto

T_{p} S_{n - 1}

with respect to

g_{p}

is given by

Π_{p}^{⊤} X = \sum_{i = 1}^{n} (X_{i} - p_{i} \sum_{j = 1}^{n} X_{j}) δ_{i}

(14)

and the corresponding projection onto the orthogonal complement of

T_{p} S_{n - 1}

is given by

Π_{p}^{⊥} X = \sum_{i = 1}^{n} (p_{i} \sum_{j = 1}^{n} X_{j}) δ_{i}

(15)

For a function

V : M_{n} \to R

, this metric implies the Riemannian gradient

{grad}_{p} V = \sum_{i = 1}^{n} (p_{i} \frac{\partial V}{\partial p_{i}} (p)) δ_{i}

(16)

A vector field

X_{p} = \sum_{i = 1}^{n} p_{i} f_{i} (p) δ_{i}, p \in M_{n}

(17)

is the gradient of a function V if and only if it satisfies for all

i, j

\frac{\partial f_{i}}{\partial p_{j}} = \frac{\partial f_{j}}{\partial p_{i}}

(18)

If we consider a function that is defined on

S_{n - 1}

, for instance the restriction of V:

M_{n} \to R

to

S_{n - 1}

, then the vector Equation (16), evaluated in

p \in S_{n - 1}

, will not necessarily be an element of

T_{p} S_{n - 1}

. Therefore, in order to evaluate the gradient on

S_{n - 1}

, we have to project the vector Equation (16) onto

T_{p} S_{n - 1}

with respect to the metric g by using Equation (14). This leads to the following gradient formula for functions on

S_{n - 1}

:

{grad}_{p} V = \sum_{i = 1}^{n} p_{i} (\frac{\partial V}{\partial p_{i}} (p) - \sum_{j = 1}^{n} p_{j} \frac{\partial V}{\partial p_{j}} (p)) δ_{i}, p \in S_{n - 1}

(19)

This gives rise to consider general vector fields of the form

X_{p} = \sum_{i = 1}^{n} p_{i} (f_{i} (p) - \sum_{j = 1}^{n} p_{j} f_{j} (p)) δ_{i}, p \in S_{n - 1}

(20)

Such a vector field is integrable, in the sense that it is the gradient Equation (19) of a potential function V, if and only if the following condition holds for all

i, j, k

(see [7], Equation (19.23)):

\frac{\partial f_{i}}{\partial p_{j}} + \frac{\partial f_{j}}{\partial p_{k}} + \frac{\partial f_{k}}{\partial p_{i}} = \frac{\partial f_{i}}{\partial p_{k}} + \frac{\partial f_{k}}{\partial p_{j}} + \frac{\partial f_{j}}{\partial p_{i}}

(21)

3.2. The Mixture and the Exponential Connections

After having introduced the Fisher metric and corresponding gradient fields, we now define natural notions of straight lines on

M_{n}

and

S_{n - 1}

, respectively, induced by corresponding affine connections. Let us first introduce the straight lines of the so-called mixture connection

\nabla^{(m)}

on

M_{n}

. Given a point

p \in M_{n}

and a direction

X \in T_{p} M_{n}

, the most natural way to define a straight line that starts in p and has velocity X is given by the so-called m-geodesic

γ (t) = p + t X

(22)

We obtain the exponential map for

t = 1

, which is, in this simple example, the translation:

{exp}_{p}^{(m)} (X) = p + X

The inverse, therefore, maps a point q to the difference vector that translates p into q:

X^{(m)} (p, q) : = {({exp}_{p}^{(m)})}^{- 1} (q) = q - p

With this difference as X in Equation (22), we obtain the geodesics that connects p with q:

γ (t) = p + t (q - p)

(23)

If we choose a point

p \in S_{n - 1}

and

X \in T_{p} S_{n - 1}

, or two points

p, q \in S_{n - 1}

, respectively, then the corresponding geodesic Equation (22) and Equation (23) will stay in

S_{n - 1}

. Therefore, the restriction of the exponential map to

T_{p} S_{n - 1}

and its inverse are trivial:

{\bar{exp}}_{p}^{(m)} (X) = p + X, \bar{X} (p, q) : = {({\bar{exp}}_{p}^{(m)})}^{- 1} (q) = q - p

where we use a bar over symbols in order to denote the restriction of corresponding objects to

S_{n - 1}

.

Now let us come to the notion of an e-geodesic and the exponential map of the so-called e-connection

\nabla^{(e)}

. Given a point

p \in M_{n}

and a direction

X \in T_{p} M_{n}

, we consider the geodesic

γ (t) = \sum_{i = 1}^{n} p_{i} exp (t \frac{X_{i}}{p_{i}}) δ_{i}

(24)

(The “exp” on the right-hand side of Equation (24) denotes the standard real-valued natural exponential function.) The exponential map of the e-connection is given for

t = 1

:

{exp}_{p}^{(e)} (X) = \sum_{i = 1}^{n} p_{i} exp (\frac{X_{i}}{p_{i}}) δ_{i}

with the inverse

X^{(e)} (p, q) : = {({exp}_{p}^{(e)})}^{- 1} (q) = \sum_{i = 1}^{n} p_{i} ln (\frac{q_{i}}{p_{i}}) δ_{i}

This implies that the e-geodesic connecting p with q is given by

γ (t) = \sum_{i = 1}^{n} p_{i} {(\frac{q_{i}}{p_{i}})}^{t} δ_{i}

(25)

Clearly, if we start in a point

p \in S_{n - 1}

and go along the e-geodesic Equation (24) in a direction X that is tangential to

S_{n - 1}

, we will not stay in

S_{n - 1}

. Analogously, if we connect a point

p \in S_{n - 1}

with a point

q \in S_{n - 1}

in terms of the e-geodesic Equation (25), then the intermediate points will in general not be in the set

S_{n - 1}

. It turns out that, in order to obtain the right exponential map of the e-connection defined on

S_{n - 1}

, we have to normalize the geodesic, which leads to:

{\bar{exp}}_{p}^{(e)} (X) = \sum_{i = 1}^{n} \frac{p_{i} exp (\frac{X_{i}}{p_{i}})}{\sum_{j = 1}^{n} p_{j} exp (\frac{X_{j}}{p_{j}})} δ_{i}

{\bar{X}}^{(e)} (p, q) : = {({\bar{exp}}_{p}^{(e)})}^{- 1} (q) = \sum_{i = 1}^{n} p_{i} (ln (\frac{q_{i}}{p_{i}}) - \sum_{j = 1}^{n} p_{j} ln (\frac{q_{j}}{p_{j}})) δ_{i}

3.3. The α-Connections

Given

α \in [- 1, 1]

, we define the following convex combination of the mixture connection

\nabla^{(m)}

and the exponential connection

\nabla^{(e)}

on

M_{n}

:

\nabla^{(α)} : = \frac{1 - α}{2} \nabla^{(m)} + \frac{1 + α}{2} \nabla^{(e)} = \nabla^{(m)} + \frac{1 + α}{2} (\nabla^{(e)} - \nabla^{(m)})

(26)

The differential equation for the α-geodesic with initial point

p \in M_{n}

and initial velocity

X \in T_{p} M_{n}

is given by

{\ddot{γ}}_{i} - \frac{1 + α}{2} \frac{{\dot{γ}}_{i}^{2}}{γ_{i}} = 0, γ (0) = p, \dot{γ} (0) = X

(27)

One can easily verify that Equation (27) is solved by the following curve:

\begin{matrix} γ (t) & = & \sum_{i = 1}^{n} p_{i} {(1 + t \frac{1 - α}{2} \frac{X_{i}}{p_{i}})}^{\frac{2}{1 - α}} δ_{i} \end{matrix}

(28)

By setting

t = 1

, we can define the corresponding α-exponential map:

{exp}_{p}^{(α)} (X) = \sum_{i = 1}^{n} p_{i} {(1 + \frac{1 - α}{2} \frac{X_{i}}{p_{i}})}^{\frac{2}{1 - α}} δ_{i}

(29)

with the inverse

X^{(α)} (p, q) : = {({exp}_{p}^{(α)})}^{- 1} (q) = \frac{2}{1 - α} \sum_{i = 1}^{n} p_{i} ({(\frac{q_{i}}{p_{i}})}^{\frac{1 - α}{2}} - 1) δ_{i}

(30)

Finally, the α-geodesic with initial point p and endpoint q is given by

γ (t) = \sum_{i = 1}^{n} {(p_{i}^{\frac{1 - α}{2}} + t (q_{i}^{\frac{1 - α}{2}} - p_{i}^{\frac{1 - α}{2}}))}^{\frac{2}{1 - α}} δ_{i}

(31)

The α-connection

{\bar{\nabla}}^{(α)}

on

S_{n - 1}

is defined as the projection of

\nabla^{(α)}

with respect to the Fisher metric g. The corresponding geodesic equation is a modification of Equation (27):

{\ddot{γ}}_{i} - \frac{1 + α}{2} \{\frac{{\dot{γ}}_{i}^{2}}{γ_{i}} - γ_{i} \sum_{j = 1}^{n} \frac{{\dot{γ}}_{j}^{2}}{γ_{j}}\} = 0, γ (0) = p, \dot{γ} (0) = X

(32)

It is reasonable to make a solution ansatz by normalization of the unconstrained geodesics Equation (28) and Equation (31). However, it turns out that, in order to solve the geodesic Equation (32), both normalized curves have to be reparametrized. More precisely, it has been shown in [8] (Theorems 14.1. and 15.1.) that, with appropriate reparametrizations

τ_{p, X}

and

τ_{p, q}

, we have the following form of the α-geodesic in the simplex

S_{n - 1}

:

γ_{p, X} (t) = \sum_{i = 1}^{n} \frac{p_{i} {(1 + τ_{p, X} (t) \frac{1 - α}{2} \frac{X_{i}}{p_{i}})}^{\frac{2}{1 - α}}}{\sum_{j = 1}^{n} p_{j} {(1 + τ_{p, X} (t) \frac{1 - α}{2} \frac{X_{j}}{p_{j}})}^{\frac{2}{1 - α}}} δ_{i}

(33)

and

γ_{p, q} (t) = \sum_{i = 1}^{n} \frac{{(p_{i}^{\frac{1 - α}{2}} + τ_{p, q} (t) (q_{i}^{\frac{1 - α}{2}} - p_{i}^{\frac{1 - α}{2}}))}^{\frac{2}{1 - α}}}{\sum_{j = 1}^{n} {(p_{j}^{\frac{1 - α}{2}} + τ_{p, q} (t) (q_{i}^{\frac{1 - α}{2}} - p_{i}^{\frac{1 - α}{2}}))}^{\frac{2}{1 - α}}} δ_{i}

(34)

Here, the conditions

γ_{p, X} (0) = p, {\dot{γ}}_{p, X} (0) = {\dot{τ}}_{p, X} (0) X = X, and γ_{p, q} (0) = p, γ_{p, q} (1) = q

imply

τ_{p, X} (0) = 0, {\dot{τ}}_{p, X} (0) = 1, and τ_{p, q} (0) = 0, τ_{p, q} (1) = 1

Now let us couple X and q by assuming

γ_{p, X} (1) = q

. Together with the condition

\sum_{i = 1}^{n} X_{i} = 0

, this implies

X = \frac{1}{τ_{p, X} (1)} \frac{2}{1 - α} \sum_{i = 1}^{n} p_{i} (\frac{{(\frac{q_{i}}{p_{i}})}^{\frac{1 - α}{2}}}{\sum_{j = 1}^{n} p_{j} {(\frac{q_{j}}{p_{j}})}^{\frac{1 - α}{2}}} - 1) δ_{i}

(35)

Furthermore, if the initial and endpoints of the two curves are identical, then

γ_{p, X} (t) = γ_{p, q} (t)

for all t. In particular,

\begin{matrix} X & = & {\dot{γ}}_{p, X} (0) = {\dot{γ}}_{p, q} (0) \\ = & {\dot{τ}}_{p, q} (0) \frac{2}{1 - α} \sum_{i = 1}^{n} p_{i} ({(\frac{q_{i}}{p_{i}})}^{\frac{1 - α}{2}} - \sum_{j = 1}^{n} p_{j} {(\frac{q_{j}}{p_{j}})}^{\frac{1 - α}{2}}) δ_{i} \end{matrix}

(36)

A comparison of the Equation (35) and Equation (36) yields

{\dot{τ}}_{p, q} (0) \sum_{j = 1}^{n} p_{j} {(\frac{q_{j}}{p_{j}})}^{\frac{1 - α}{2}} = \frac{1}{τ_{p, X} (1)}

4. Canonical Divergences for Positive and Probability Measures

4.1. The Relative Entropy (KL-Divergence)

Now we apply the ansatz of Equation (12) in order to define divergence functions for the m- and e-connections on the cone

M_{n}

of positive measures. The inverse maps of the corresponding exponential maps are given by

\begin{matrix} X^{(m)} (q, p) & \sum_{i = 1}^{n} (p_{i} - q_{i}) δ_{i} \\ X^{(e)} (q, p) & \sum_{i = 1}^{n} q_{i} ln \frac{p_{i}}{q_{i}} δ_{i} \end{matrix}

(37)

We can easily verify that the corresponding vector fields

q \mapsto X^{(m)} (q, p), q \mapsto X^{(e)} (q, p)

(38)

are gradient fields: The functions

f_{i} (q) : = \frac{p_{i}}{q_{i}}, and g_{i} (q) : = ln \frac{p_{i}}{q_{i}}

trivially satisfy the integrability condition

\frac{\partial f_{i}}{\partial q_{j}} = \frac{\partial f_{j}}{\partial q_{i}}

and

\frac{\partial g_{i}}{\partial q_{j}} = \frac{\partial g_{j}}{\partial q_{i}}

for all

i, j

. Therefore, for both connections, there are canonical divergence functions which solve the corresponding Equation (12).

We derive the canonical divergence of the m-connection first, which we denote by

D^{(m)}

. We consider two positive measures p and q and a curve γ:

[0, 1] \to M_{n}

connecting q with p, that is

γ (0) = q

and

γ (1) = p

. This implies

〈X^{(m)} (γ (t), p), \dot{γ} (t)〉 = \sum_{i = 1}^{n} \frac{1}{γ_{i} (t)} (p_{i} - γ_{i} (t)) {\dot{γ}}_{i} (t)

(39)

and

\begin{matrix} D^{(m)} (p ∥ q) & = & \int_{0}^{1} 〈X^{(m)} (γ (t), p), \dot{γ} (t)〉 d t \\ = & \sum_{i = 1}^{n} \int_{0}^{1} \frac{1}{γ_{i} (t)} (p_{i} - γ_{i} (t)) {\dot{γ}}_{i} (t) d t \\ = & \sum_{i = 1}^{n} {[p_{i} ln γ_{i} (t) - γ_{i} (t)]}_{0}^{1} \\ = & \sum_{i = 1}^{n} (p_{i} ln p_{i} - p_{i} - p_{i} ln q_{i} + q_{i}) \\ = & \sum_{i = 1}^{n} (q_{i} - p_{i} + p_{i} ln \frac{p_{i}}{q_{i}}) \end{matrix}

With the same calculation for the e-connection, we obtain the corresponding canonical divergence, which we denote by

D^{(e)}

. Again, we consider a curve γ connecting q with p. This implies

〈X^{(e)} (γ (t), p), \dot{γ} (t)〉 = \sum_{i = 1}^{n} {\dot{γ}}_{i} (t) ln \frac{p_{i}}{γ_{i} (t)}

(40)

and

\begin{matrix} D^{(e)} (p ∥ q) & = & \int_{0}^{1} 〈X^{(e)} (γ (t), p), \dot{γ} (t)〉 d t \\ = & \sum_{i = 1}^{n} \int_{0}^{1} {\dot{γ}}_{i} (t) ln \frac{p_{i}}{γ_{i} (t)} d t \\ = & \sum_{i = 1}^{n} {[γ_{i} (t) (1 + ln \frac{p_{i}}{γ_{i} (t)})]}_{0}^{1} \\ = & \sum_{i = 1}^{n} (p_{i} - q_{i} (1 + ln \frac{p_{i}}{q_{i}})) \\ = & \sum_{i = 1}^{n} (p_{i} - q_{i} + q_{i} ln \frac{q_{i}}{p_{i}}) \\ = & D^{(m)} (q ∥ p) \end{matrix}

These calculations give rise to the following definition:

Definition 1.

The function

D : M_{n} \times M_{n} \to R

defined by

D (p ∥ q) : = \sum_{i = 1}^{n} q_{i} - \sum_{i = 1}^{n} p_{i} + \sum_{i = 1}^{n} p_{i} ln \frac{p_{i}}{q_{i}}

(41)

is called the relative entropy or Kullback–Leibler divergence. Its restriction to the set of probability distributions is given by

D (p ∥ q) : = \sum_{i = 1}^{n} p_{i} ln \frac{p_{i}}{q_{i}}

(42)

Proposition 1.

The following holds:

X^{(m)} (q, p) = - {grad}_{q} D (p ∥ \cdot), X^{(e)} (q, p) = - {grad}_{q} D (\cdot ∥ p)

(43)

Furthermore, D is the only function on

M_{n} \times M_{n}

that satisfies the conditions Equation (43) and

D (p ∥ p) = 0

for all p.

Proof.

We first compute the partial derivatives

\begin{matrix} \frac{\partial D (p ∥ \cdot)}{\partial q_{i}} (q) = - \frac{p_{i}}{q_{i}} + 1, \frac{\partial D (\cdot ∥ p)}{\partial q_{i}} (q) = - ln \frac{p_{i}}{q_{i}} \end{matrix}

With the Formula (16), we obtain

\begin{matrix} {({grad}_{q} D (p ∥ \cdot))}_{i} & = & q_{i} (- \frac{p_{i}}{q_{i}} + 1) = - p_{i} + q_{i} \\ {({grad}_{q} D (\cdot ∥ p))}_{i} & = & - q_{i} ln \frac{p_{i}}{q_{i}} \end{matrix}

A comparison with Equation (37) verifies the Equation (43) which uniquely characterize

D (p ∥ \cdot)

as well as

D (\cdot ∥ p)

, up to a constant depending on p. With the additional assumption

D (p ∥ p) = 0

for all p, this constant is fixed. ☐

One can now ask whether the restriction Equation (42) of the Kullback–Leibler divergence to the manifold

S_{n - 1}

is the right divergence function in the sense that Equation (43) also hold for the exponential maps of the restricted m- and e-connections. It is easy to verify that this is indeed the case. Let us elaborate on the geometric reason for this. We consider a general Riemannian manifold M and a submanifold N in it. Given an affine connection ∇ on M, we can define its restriction

\bar{\nabla}

to N. More precisely, denoting the projection of a vector Z in

T_{p} M

onto

T_{p} N

by

Π_{p}^{⊤} (Z)

, we define

{{\bar{\nabla}}_{X} Y|}_{p} : = Π_{p}^{⊤} ({\nabla_{X} Y|}_{p})

, where X and Y are vector fields on N. Furthermore, we denote the exponential map of

\bar{\nabla}

by

{\bar{exp}}_{p}

and its inverse by

\bar{X} (p, q)

.

Now, given

p \in N

, we consider a function

D_{p}

on M, which satisfies the Equation (12). With the restriction

{\bar{D}}_{p}

of

D_{p}

to the submanifold N, this directly implies

Π_{q}^{⊤} (X (q, p)) = - {grad}_{q} {\bar{D}}_{p}

However, in order to have

\bar{X} (q, p) = - {grad}_{q} {\bar{D}}_{p}

, which corresponds to the Equation (12) on the submanifold N, the following equality is required:

\bar{X} (q, p) = Π_{q}^{⊤} (X (q, p))

(44)

This condition is satisfied for the m- and e-connections on

M_{n}

and its submanifold

S_{n - 1}

, which implies the following proposition.

Proposition 2.

The following holds:

{\bar{X}}^{(m)} (q, p) = - {grad}_{q} D (p ∥ \cdot), {\bar{X}}^{(e)} (q, p) = - {grad}_{q} D (\cdot ∥ p)

(45)

where D is given by Equation (42) in Definition 1. Furthermore, D is the only function on

S_{n - 1} \times S_{n - 1}

that satisfies the conditions (45) and

D (p ∥ p) = 0

for all p.

The objects and derivations of this section represent a special case of a general dually flat manifold M, which will be studied in Section 5.

4.2. The α-Divergence

We now extend the method of Section 4.1 to the α-connections, leading to a generalization of the relative entropy, the so-called α-divergence. From the definition of the α-exponential map on the manifold

M_{n}

of positive measures, given in Equation (29), we obtain the inverse

X^{(α)} (q, p) : = {({exp}_{q}^{(α)})}^{- 1} (p) = \frac{2}{1 - α} \sum_{i = 1}^{n} q_{i} ({(\frac{p_{i}}{q_{i}})}^{\frac{1 - α}{2}} - 1) δ_{i}

(46)

In order to derive the canonical divergence

D^{(α)}

of the α-connection, which is integrable, we consider two points p and q and a curve γ:

[0, 1]

\to M_{n}

connecting q with p. We obtain

〈X^{(α)} (γ (t), p), \dot{γ} (t)〉 = \frac{2}{1 - α} \sum_{i = 1}^{n} {\dot{γ}}_{i} (t) ({(\frac{p_{i}}{γ_{i} (t)})}^{\frac{1 - α}{2}} - 1)

(47)

and

\begin{matrix} D^{(α)} (p ∥ q) & = & \int_{0}^{1} 〈X^{(α)} (γ (t), p), \dot{γ} (t)〉 d t \\ = & \sum_{i = 1}^{n} \int_{0}^{1} \frac{2}{1 - α} {\dot{γ}}_{i} (t) ({(\frac{p_{i}}{γ_{i} (t)})}^{\frac{1 - α}{2}} - 1) d t \\ = & \sum_{i = 1}^{n} {[\frac{4}{1 - α^{2}} {γ_{i} (t)}^{\frac{1 + α}{2}} {p_{i}}^{\frac{1 - α}{2}} - \frac{2}{1 - α} γ_{i} (t)]}_{0}^{1} \\ = & \sum_{i = 1}^{n} (\frac{2}{1 + α} p_{i} - (\frac{4}{1 - α^{2}} q_{i}^{\frac{1 + α}{2}} p_{i}^{\frac{1 - α}{2}} - \frac{2}{1 - α} q_{i})) \\ = & \sum_{i = 1}^{n} (\frac{2}{1 - α} q_{i} + \frac{2}{1 + α} p_{i} - \frac{4}{1 - α^{2}} q_{i}^{\frac{1 + α}{2}} p_{i}^{\frac{1 - α}{2}}) \end{matrix}

Obviously, we have

D^{(- α)} (p ∥ q) = D^{(α)} (q ∥ p)

(48)

These calculations give rise to the following definition:

Definition 2.

The function

D^{(α)} : M_{n} \times M_{n} \to R

defined by

D^{(α)} (p ∥ q) : = \frac{2}{1 - α} \sum_{i = 1}^{n} q_{i} + \frac{2}{1 + α} \sum_{i = 1}^{n} p_{i} - \frac{4}{1 - α^{2}} \sum_{i = 1}^{n} q_{i}^{\frac{1 + α}{2}} p_{i}^{\frac{1 - α}{2}}

(49)

is called the α-divergence. Its restriction to probability measures is given as

\begin{matrix} D^{(α)} (p ∥ q) & = & \frac{4}{1 - α^{2}} (1 - \sum_{i = 1}^{n} q_{i}^{\frac{1 + α}{2}} p_{i}^{\frac{1 - α}{2}}) \end{matrix}

Proposition 3.

The following holds:

X^{(α)} (q, p) = - {grad}_{q} D^{(α)} (p ∥ \cdot)

(50)

Furthermore,

D^{(α)}

is the only function on

M_{n} \times M_{n}

that satisfies the condition (50) and

D^{(α)} (p ∥ p) = 0

for all p.

Proof.

We compute the partial derivative

\frac{\partial D^{(α)} (p ∥ \cdot)}{\partial q_{i}} (q) = \frac{2}{1 - α} (1 - q_{i}^{\frac{1 + α}{2} - 1} p_{i}^{\frac{1 - α}{2}})

With the Formula (16), we obtain

\begin{matrix} {({grad}_{q} D^{(α)} (p ∥ \cdot))}_{i} & = & q_{i} \cdot \frac{2}{1 - α} (1 - q_{i}^{\frac{1 + α}{2} - 1} p_{i}^{\frac{1 - α}{2}}) \\ = & \frac{2}{1 - α} (q_{i} - q_{i}^{\frac{1 + α}{2}} p_{i}^{\frac{1 - α}{2}}) \end{matrix}

A comparison with Equation (46) verifies Equation (50) which uniquely characterizes

D^{(α)} (p ∥ \cdot)

, up to a constant depending on p. With the additional assumption

D^{(α)} (p ∥ p) = 0

for all p, this constant is fixed. ☐

In what follows, we use the notation

D^{(α)}

also for

α \in {- 1, 1}

by setting

D^{(- 1)} (p ∥ q) : = D (p ∥ q)

and

D^{(1)} (p ∥ q) : = D (q ∥ p)

where D is relative entropy defined by Equation (41). This is consistent with the definition of the α-connections, given by Equation (26), where we have the m-connection for

α = - 1

and the e-connection for

α = 1

. Note that

D^{(0)}

is closely related to the Hellinger distance

d^{H} (p, q) : = {(\sum_{i = 1}^{n} {({p_{i}}^{\frac{1}{2}} - {q_{i}}^{\frac{1}{2}})}^{2})}^{\frac{1}{2}}

More precisely, we have

D^{(0)} (p ∥ q) = 2 {(d^{H} (p, q))}^{2}

(51)

In fact, the derivation of

D^{(α)}

was based on the idea to associate a distance-like function to the α-connections through the general Equation (12). However, it turns out that, although being naturally motivated, the functions

D^{(α)}

do not share all properties of the square of a distance, except for

α = 0

. The symmetry is obviously not satisfied. On the other hand, we have

D^{(α)} (p ∥ q) \geq 0

, and

D^{(α)} (p ∥ q) = 0

if and only if

p = q

.

We now ask whether the restriction of

D^{(α)}

, which is defined for positive measures, to the simplex

S_{n - 1}

of probability distributions is the canonical divergence for the α-connections on

S_{n - 1}

. We have seen that this is the case for the m- and e-connections, that is for

α \in {- 1, + 1}

. However, for general α, the situation is more complicated. From Equation (36) we obtain

{\bar{X}}^{(α)} (q, p) = {\dot{τ}}_{q, p} (0) Π_{q}^{⊤} (X^{(α)} (q, p))

This equality deviates from the condition of Equation (44) by the factor

{\dot{τ}}_{q, p} (0)

, which proves that the restriction of the α-divergence to

S_{n - 1}

does not coincide with the canonical α-divergence on the simplex. As an example, we consider the case

α = 0

, where the α-connection is the Levi-Civita connection of the Fisher metric. As we will see in the next section, the canonical divergence in that case equals

{\bar{D}}^{(0)} (p ∥ q) = \frac{1}{2} {(d^{F} (p, q))}^{2}

, where

d^{F}

denotes the distance with respect to the Fisher metric (see Equation (62)). Obviously, this divergence is different from the divergence

D^{(0)}

, given by Equation (51), which is based on the distance in the ambient space

M_{n}

, the Hellinger distance.

5. General Canonical Divergence and Its Consistency

5.1. Canonical Divergence

We have derived a canonical divergence when the vector field X of the inverse exponential map, that is

{exp}_{q} (X (q, p)) = p

for all p and q, is integrable. We now define a canonical divergence in a general n-dimensional dual manifold

(M, g, \nabla, \nabla^{*})

. Consider a ∇-geodesic

γ_{q, p}

:

[0, 1]

\to M

connecting q and p. We define a tangent vector field

X_{t} (p, q)

along this geodesic:

X_{t} (q, p) : = X (γ_{q, p} (t), p)

(52)

Obviously,

\begin{matrix} X_{0} & = & X (q, p) \end{matrix}

(53)

\begin{matrix} X_{1} (q, p) & = & 0 \end{matrix}

(54)

Definition 3.

A canonical divergence from p to q is defined by the path integral

D (p ∥ q) = \int_{0}^{1} 〈X_{t} (q, p), {\dot{γ}}_{q, p} (t)〉 d t

(55)

Replacing the ∇-geodesic

γ_{q, p}

from q to p by the reversed ∇-geodesic

γ_{p, q}

from p to q and the vector field

X_{t} (q, p)

by the vector field

X_{t}^{*} (p, q) : = X^{*} (γ_{p, q} (t), p)

of the dual connection

\nabla^{*}

leads to the following related definition of a canonical divergence:

\begin{matrix} D^{'} (p ∥ q) & : = & \int_{0}^{1} 〈X_{t}^{*} (p, q), {\dot{γ}}_{p, q} (t)〉 d t \end{matrix}

(56)

\begin{matrix} = & - \int_{0}^{1} 〈X^{*} (γ_{q, p} (t), q), {\dot{γ}}_{q, p} (t)〉 d t \end{matrix}

(57)

Although motivated and derived in different terms, the divergence of the article [9] turns out to coincide with

D^{'}

. The authors apply Hooke’s law to a “

\nabla^{*}

-spring” and define their divergence, in terms of an expression related to Equation (57), as the work that is necessary to move a point of unit mass from q to p along the ∇-geodesic

γ_{q, p}

against the force field

X^{*} (γ_{q, p} (t), q)

. We became aware of this article after submission of our present article. The divergence

D^{'}

shares many nice properties of our canonical divergence. However, in the integrability case, it is not generally true that

X (q, p) = - {grad}_{q} D^{'} (p ∥ \cdot)

, a property that serves as main motivation of our article and which is satisfied by our canonical divergence of Equation (55).

Before stating the main result that the canonical divergence defined by Equation (55) induces the same Riemannian metric g and the same pair of affine connections ∇ and

\nabla^{*}

, we show some of its properties. Since the geodesic connecting

γ_{q, p} (t)

and p is a part of the geodesic connecting q and p, corresponding to the interval

[t, 1]

, the inverse exponential map at

γ_{q, p} (t)

satisfies

X_{t} (q, p) = (1 - t) {\dot{γ}}_{q, p} (t)

(58)

Hence, we have

D (p ∥ q) = \int_{0}^{1} (1 - t) {∥{\dot{γ}}_{q, p} (t)∥}^{2} d t

(59)

where

{∥{\dot{γ}}_{q, p} (t)∥}^{2} = 〈{\dot{γ}}_{q, p} (t), {\dot{γ}}_{q, p} (t)〉

(60)

This already proves

D (p ∥ q) \geq 0

, and

D (p ∥ q) = 0

if and only if

p = q

. If we replace the parameter t by

1 - t

and use

γ_{q, p} (t) = γ_{p, q} (1 - t)

, we directly obtain the following representation of the canonical divergence:

Proposition 4.

The divergence of Definition 3 is given by

D (p ∥ q) = \int_{0}^{1} t {∥{\dot{γ}}_{p, q} (t)∥}^{2} d t

(61)

where

γ_{p, q}

denotes the geodesic from p to q.

Remark 1.

In the special case where M is self-dual,

\nabla = \nabla^{*}

is the Levi-Civita connection with respect to g. In that case, the velocity field

{\dot{γ}}_{p, q}

is parallel along the geodesic

γ_{p, q}

, and therefore

{∥ {\dot{γ}}_{p, q} (t) ∥}_{γ (t)} = {∥ {\dot{γ}}_{p, q} (0) ∥}_{p} = {∥ X (p, q) ∥}_{p} = d (p, q)

where

d (p, q)

denotes the Riemannian distance between p and q. This implies that the canonical divergence corresponds to the energy of the geodesic

γ_{p, q}

, that is

D (p ∥ q) = \frac{1}{2} d^{2} (p, q)

(62)

In the general case, where ∇ is not necessarily the Levi-Civita connection, we obtain the energy of the geodesic

γ_{p, q}

as the symmetrized version of the canonical divergence:

\frac{1}{2} (D (p ∥ q) + D (q ∥ p)) = \frac{1}{2} \int_{0}^{1} {∥{\dot{γ}}_{p, q} (t)∥}^{2} d t

(63)

Remark 2.

Let us compare the canonical divergence D of the affine connection ∇ with the canonical divergence

D^{*}

of its dual connection

\nabla^{*}

, both defined by Equation (55) or equivalently by Equation (61). In the special case of the α-connection

\nabla = \nabla^{(α)}

, we have

D^{*} (p ∥ q) = D (q ∥ p)

(see Equation (48)). In Section 5.3, we will prove that this kind of symmetry holds in the general case of a dually flat manifold. However, our canonical divergence does not necessarily have this property, when the space is not dually flat. This is contrary to most other approaches where the symmetry is considered to be a natural property of any divergence. In order to have that property also in our setting, we can consider the mean canonical divergence

D_{m c d}^{\nabla} (p ∥ q) : = \frac{1}{2} (D (p ∥ q) + D^{*} (q ∥ p))

(64)

which obviously satisfies

D_{m c d}^{(\nabla^{*})} (p ∥ q) = D_{m c d}^{\nabla} (q ∥ p)

(65)

As we will prove in the next section, the canonical divergence D induces the metric g and the connections ∇ and

\nabla^{*}

. The same holds for the mean canonical divergence

D_{m c d}^{\nabla}

. However, if ∇ is integrable, then it is not generally true that

X (q, p) = - {grad}_{q} D_{m c d}^{\nabla} (p ∥ \cdot)

, which is inconsistent with the main motivation of our canonical divergence (see Equation (12)).

5.2. Main Consistency Result

Let

\overset{D}{g}

,

\overset{D}{\nabla}

, and

\overset{D}{\nabla^{*}}

be the geometrical objects derived from the canonical divergence D as defined in Equation (55). We recall the corresponding definitions from Section 1 in terms of a local coordinate system

ξ = (ξ^{1}, \dots, ξ^{n})

:

\begin{matrix} {\overset{D}{g}}_{i j} (p) & = & {\partial_{i}^{'} \partial_{j}^{'} D (ξ_{p} ∥ ξ_{q})|}_{q = p} \end{matrix}

(66)

\begin{matrix} {\overset{D}{Γ}}_{i j k} (p) & = & {- \partial_{i} \partial_{j} \partial_{k}^{'} D (ξ_{p} ∥ ξ_{q})|}_{q = p} \end{matrix}

(67)

\begin{matrix} {\overset{D}{Γ^{*}}}_{i j k} (p) & = & {- \partial_{i}^{'} \partial_{j}^{'} \partial_{k} D (ξ_{p} ∥ ξ_{q})|}_{q = p} \end{matrix}

(68)

We have defined our canonical divergence D based on a metric g and an affine connection ∇. It is natural to require that this divergence is consistent in the sense that the objects

\overset{D}{g}

,

\overset{D}{\nabla}

, and

\overset{D}{\nabla^{*}}

coincide with the original objects g, ∇, and

\nabla^{*}

of M, where

\nabla^{*}

is the dual affine connection of ∇ with respect to g. Since the geometry is determined by the derivatives of

D (ξ_{p} ∥ ξ_{q})

at

p = q

, we consider the case where p and q are close to each other, that is

z^{i} = ξ_{q}^{i} - ξ_{p}^{i}

(69)

is small for all i. We evaluate the divergence by Taylor expansion up to

O ({∥ z ∥}^{3})

. Note that

X (p, q)

is of order

∥ z ∥

.

Proposition 5.

When

∥ z ∥ = ∥ ξ_{q} - ξ_{p} ∥

is small, the canonical divergence is expanded as

D (p ∥ q) = \frac{1}{2} g_{i j} (p) z^{i} z^{j} + \frac{1}{6} Λ_{i j k} (p) z^{i} z^{j} z^{k} + O ({∥ z ∥}^{4})

(70)

where

Λ_{i j k} = 2 \partial_{i} g_{j k} - Γ_{i j k}

(71)

Proof.

We obtain the local coordinates

ξ (t)

of the geodesic

γ_{p, q} (t)

in Taylor series as

ξ^{i} (t) = ξ_{p}^{i} + t X^{i} - \frac{t^{2}}{2} Γ_{j k}^{i} X^{j} X^{k} + O ({∥ t X ∥}^{3})

(72)

where

X^{i} = X^{i} (p, q)

. When

z

is small, X is of order

O (∥ z ∥)

. Hence, we regard Equation (72) as Taylor expansion with respect to X, and

t \in [0, 1]

when

z

is small. When

t = 1

, we have

z^{i} = X^{i} - \frac{1}{2} Γ_{j k}^{i} X^{j} X^{k}

(73)

where the higher-order terms are neglected. This in turn gives

X^{i} = z^{i} + \frac{1}{2} Γ_{j k}^{i} z^{j} z^{k}

(74)

We calculate

D (p ∥ q)

by using Equation (61). The velocity at t is given as

\begin{matrix} {\dot{ξ}}^{i} (t) & = & X^{i} - t Γ_{j k}^{i} X^{j} X^{k} \end{matrix}

(75)

\begin{matrix} = & z^{i} + \frac{1}{2} (1 - 2 t) Γ_{j k}^{i} z^{j} z^{k} \end{matrix}

(76)

We also use

g_{i j} (ξ (t)) = g_{i j} (ξ_{p}) + t \partial_{k} g_{i j} z^{k}

(77)

Collecting these terms, we have

t g_{i j} (ξ (t)) {\dot{ξ}}^{i} (t) {\dot{ξ}}^{j} (t) = t g_{i j} z^{i} z^{j} + \{t^{2} \partial_{i} g_{j k} + (- 2 t^{2} + t) Γ_{i j k}\} z^{i} z^{j} z^{k}

(78)

By integration, we have

\begin{matrix} D (p ∥ q) & = & \int_{0}^{1} t g_{i j} (ξ (t)) {\dot{ξ}}^{i} (t) {\dot{ξ}}^{j} (t) d t \end{matrix}

(79)

\begin{matrix} = & \frac{1}{2} g_{i j} z^{i} z^{j} + \frac{1}{6} Λ_{i j k} z^{i} z^{j} z^{k} \end{matrix}

(80)

where indices of

Λ_{i j k}

are symmetrized because of multiplication of

z^{i} z^{j} z^{k}

. This gives Equation (70). ☐

Theorem 1.

(Consistency theorem) The geometric quantities

\overset{D}{g}

,

\overset{D}{\nabla}

, and

\overset{D}{\nabla^{*}}

, derived from the canonical divergence

D (p ∥ q)

of Definition 3 coincide with the original quantities g, ∇, and

\nabla^{*}

.

Proof.

By differentiating Equation (70) with respect to

ξ_{p}

,

\begin{matrix} \partial_{i} D & = & \frac{1}{2} \partial_{i} g_{j k} z^{j} z^{k} - g_{i j} z^{j} - \frac{1}{2} Λ_{i j k} z^{j} z^{k} \end{matrix}

(81)

\begin{matrix} \partial_{i} \partial_{j} D & = & \frac{1}{2} \partial_{i} \partial_{j} g_{k l} z^{k} z^{j} - 2 \partial_{i} g_{j k} z^{k} + g_{i j} + Λ_{i j k} z^{k} \end{matrix}

(82)

of which the indexed quantities of the right-hand side need to be symmetrized with respect to

i, j

. By evaluating

\partial_{i} \partial_{j} D

at

ξ_{p} = ξ_{q}

, i.e.,

z = 0

, we have

{\overset{D}{g}}_{i j} = g_{i j}

(83)

proving that the Riemannian metric derived from D is the same as the original one. We further differentiate Equation (82) with respect to

ξ_{q}

and evaluate it at

ξ_{p} = ξ_{q}

. This yields

\begin{matrix} {\overset{D}{Γ}}_{i j k} & = & - \partial_{i} \partial_{j} \partial_{k}^{'} D = 2 \partial_{i} g_{j k} - Λ_{i j k} \end{matrix}

(84)

\begin{matrix} = & Γ_{i j k} \end{matrix}

(85)

Hence, the affine connection

\overset{D}{\nabla}

derived from D is exactly the same as the original affine connection ∇. ☐

Remark 3.

In the special case

\nabla = \nabla^{*}

, the canonical divergence is given by half of the squared norm of the inverse exponential map (see Equation (62)):

D (p ∥ q) = \frac{1}{2} {∥ X (p, q) ∥}_{p}^{2}

(86)

The right-hand side of Equation (86) defines a divergence for a general connection, which coincides with the canonical divergence in the self-dual case. We have studied this divergence in our previous work [6]. We have shown that this divergence recovers g in terms of Equation (66). However, it fails to recover ∇ and

\nabla^{*}

in terms of Equations (67) and (68) directly. In order to overcome this shortcoming, we considered the α-connection

\nabla^{(α)} = \frac{1 - α}{2} \nabla + \frac{1 + α}{2} \nabla^{*}

and the corresponding inverse exponential map

X^{(α)}

, which imply the following version of Equation (86):

D^{(α)} (p ∥ q) : = \frac{1}{2} {∥ X^{(α)} (p, q) ∥}_{p}^{2}

(87)

(

D^{(α)}

does not denote the α-divergence here.) We have shown in [6] that for

α = - \frac{1}{3}

the divergence

D^{(α)}

, referred to it as standard divergence, induces the original quantities g, ∇, and

\nabla^{*}

. It turns out, however, that this first attempt to define a canonical divergence has serious limitations. For instance, it does not reduce to the known canonical divergence in the dually flat case. This important property is satisfied by the canonical divergence of Definition 3, which we are going to prove in the next section.

5.3. Canonical Divergence in a Dually Flat Manifold

When a manifold M is dually flat, it has an affine coordinate system

θ = (θ^{1}, \dots, θ^{n})

and a potential function

ψ (θ)

, where the dual affine coordinates

η = (η_{1}, \dots, η_{n})

are given by

η_{i} = \frac{\partial ψ (θ)}{\partial θ^{i}}, i = 1, \dots, n

(88)

The dual potential is then defined as

φ (η) = ψ (θ) - θ \cdot η

(89)

where

θ \cdot η = θ^{i} η_{i}

and θ is a function of η by Equation (88). The geodesic connecting p and q, a generalisation of the e-geodesic of Section 3.2, has the form

θ (t) = θ_{p} + t (θ_{q} - θ_{p})

(90)

Hence, the velocity is constant

\dot{θ} (t) = z = θ_{q} - θ_{p}

(91)

The canonical divergence from

θ_{p}

to

θ_{q}

is defined by

D (θ_{p} ∥ θ_{q}) = \int_{0}^{1} t g_{i j} (θ (t)) z^{i} z^{j} d t

(92)

Since

g_{i j} = \partial_{i} \partial_{j} ψ

, we have

\begin{matrix} D (θ_{p} ∥ θ_{q}) & = & \int_{0}^{1} t \partial_{i} \partial_{j} ψ (θ_{p} + t z) z^{i} z^{j} d t \end{matrix}

(93)

\begin{matrix} = & \int_{0}^{1} t \ddot{ψ} (θ (t)) d t \end{matrix}

(94)

\begin{matrix} = & - \int_{0}^{1} \dot{ψ} (θ (t)) d t + {[t \dot{ψ} (θ (t))]}_{0}^{1} \end{matrix}

(95)

\begin{matrix} = & ψ (θ_{p}) + φ (η_{q}) - θ_{p} \cdot η_{q} \end{matrix}

(96)

This shows that our canonical divergence is the same as the canonical divergence defined in terms of the Bregman divergence of M.

Now we come back to the symmetry property that we already addressed in Remark 2. We derived

D (p ∥ q)

by using the primal affine connection ∇ and the related inverse exponential map. We can construct its dual

D^{*} (p ∥ q)

by using the dual affine connection

\nabla^{*}

and the dual inverse exponential map. The dual affine coordinates are η, and the m-geodesic connecting p and q is given by

η (t) = η_{p} + t (η_{q} - η_{p})

(97)

Hence, the velocity is constant

\dot{η} (t) = z^{*} = η_{q} - η_{p}

(98)

The dual canonical divergence

D^{*}

is defined by

D^{*} (p ∥ q) = \int_{0}^{1} t g^{i j} (η_{t}) z_{i}^{*} z_{j}^{*} d t

(99)

Here,

g^{i j} (η) = \partial^{i} \partial^{j} φ (η)

(100)

where

\partial^{i} = \frac{\partial}{\partial η_{i}}

(101)

So we have

D^{*} (p ∥ q) = \int_{0}^{1} t \partial^{i} \partial^{j} φ (η_{p} + t z^{*}) z_{i}^{*} z_{j}^{*} d t

(102)

By similar calculations, we have

D^{*} (p ∥ q) = D (q ∥ p)

(103)

This proves that ∇ and

\nabla^{*}

give the same canonical divergence except that p and q are interchanged because of the duality. Such a nice property holds when M is dually flat.

6. Geodesic Projections and Integrability

Given a divergence D on M and a point

p \in M

, we consider the set of points q that satisfy

D (p ∥ q) = const

(104)

where p is fixed. This set is the surface of the equi-divergence ball centered at p. When a smooth submanifold S is given, we search for a point

\hat{p} \in S

that minimizes

D (p ∥ q)

,

q \in S

. Intuitively, we obtain such a minimizer by considering a ball centered at p. We increase its radius, starting from 0, until the ball touches S for the first time. Any touch point

\hat{p}

is then a minimizer of

D (p ∥ q)

,

q \in S

. When the geodesic connecting

\hat{p}

and p is orthogonal to S at

\hat{p}

, we call

\hat{p}

a geodesic projection of p onto S.

Definition 4.

We say that the geodesic projection property holds if every minimizer

\hat{p}

of the divergence D is given by the geodesic projection of p onto S.

We know that the geodesic projection property holds when M is dually flat, but it does not hold in general. The following condition guarantees the geodesic projection property:

Proposition 6.

The geodesic projection property holds when the inverse exponential map

X (q, p)

is in proportion to the gradient of

D (p ∥ q)

with respect to q,

X (q, p) = c \cdot {grad}_{q} D (p ∥ \cdot)

(105)

where c is a constant that may depend on q and p.

Proof.

Consider the geodesic connecting

q = \hat{p}

and p. Then, the tangent vector at q is

X (q, p)

. Assume that

X (q, p)

has the same direction as the gradient

{grad}_{q} D (p ∥ \cdot)

, that is, the vector orthogonal to the surface of the ball touching S. Then

X (q, p)

is also orthogonal to the tangent space of S in

\hat{p}

, as the tangent space of the ball contains the tangent space of S at this point. This means that

\hat{p}

is a geodesic projection. ☐

Obviously, when the vector field of the inverse exponential map is integrable, the geodesic projection property directly follows from Equation (12). We have shown that this intergrability condition is satisfied for general dually flat manifolds. In particular, the integrability is satisfied for the α-connection

\nabla^{(α)}

defined on the cone

M_{n}

of positive measures, which leads to the α-divergence as canonical divergence. The restriction of the α-connection to the simplex

S_{n - 1}

of probability distributions, denoted by

{\bar{\nabla}}^{(α)}

, is still integrable, even though

S_{n - 1}

is not (dually) flat with respect

{\bar{\nabla}}^{(α)}

if

α \notin {- 1, + 1}

. As we have seen, the canonical divergence associated with

{\bar{\nabla}}^{(α)}

does not coincide with the restriction of the α-divergence to

S_{n - 1}

. However, this restriction is still useful in the context of applications that require projections onto submanifolds S. The reason is that the geodesic projection property holds for

{\bar{\nabla}}^{(α)}

. To be more precise, consider the restriction of the α-divergence to the simplex

S_{n - 1}

:

\begin{matrix} D^{(α)} (p ∥ q) & = & \frac{4}{1 - α^{2}} (1 - \sum_{i = 1}^{n} q_{i}^{\frac{1 + α}{2}} p_{i}^{\frac{1 - α}{2}}) \end{matrix}

The gradient is given as

{grad}_{q} D^{(α)} (p ∥ \cdot) = - \frac{2}{1 - α} \sum_{i} q_{i} ({(\frac{p_{i}}{q_{i}})}^{\frac{1 - α}{2}} - \sum_{j} q_{j} {(\frac{p_{j}}{q_{j}})}^{\frac{1 - α}{2}}) δ_{i}

Comparing this with Equation (36) we see that

X (q, p) = - {\dot{τ}}_{q, p} (0) {grad}_{q} D^{(α)} (p ∥ \cdot)

With the condition (105) this implies that the geodesic projection property holds for

D^{(α)}

, even though it is not the canonical α-divergence on the simplex.

Author Contributions

The research was designed and carried out by both authors. They both wrote the paper, with main contribution by Nihat Ay. Both authors have read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Eguchi, S. Second order efficiency of minimum contrast estimators in a curved exponential family. Ann. Stat. 1983, 11, 793–803. [Google Scholar]
Amari, S.-I.; Nagaoka, H. Methods of Information Geometry; American Mathematical Society: Providence, RI, USA; Oxford University Press: Oxford, UK, 2000. [Google Scholar]
Matumoto, T. Any statistical manifold has a contrast function—On the C³-functions taking the minimum at the diagonal of the product manifold. Hiroshima Math. J. 1993, 23, 327–332. [Google Scholar]
Kurose, T. On the divergence of 1-conformally flat statistical manifolds. Tohoku Math. J. 1994, 46, 427–433. [Google Scholar] [CrossRef]
Matsuzoe, H. On realization of conformally-projectively flat statistical manifolds and the divergences. Hokkaido Math. J. 1998, 27, 409–421. [Google Scholar] [CrossRef]
Amari, S.-I.; Ay, N. Standard Divergence in Manifold of Dual Affine Connections. In Geometric Science of Information, Proceedings of the 2nd International Conference on Geometric Science of Information, Palaiseau, France, 28–30 October 2015.
Hofbauer, J.; Sigmund, K. Evolutionary Games and Population Dynamics; Cambridge University Press: Cambridge, UK, 2002. [Google Scholar]
Morozova, E.A.; Chentsov, N.N. Markov invariant geometry on manifolds of states. J. Sov. Math. 1991, 56, 2648–2669. [Google Scholar] [CrossRef]
Henmi, M.; Kobayashi, R. Hooke’s law in statistical manifolds and divergences. Nagoya Math. J. 2000, 159, 1–24. [Google Scholar]

© 2015 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ay, N.; Amari, S.-i. A Novel Approach to Canonical Divergences within Information Geometry. Entropy 2015, 17, 8111-8129. https://doi.org/10.3390/e17127866

AMA Style

Ay N, Amari S-i. A Novel Approach to Canonical Divergences within Information Geometry. Entropy. 2015; 17(12):8111-8129. https://doi.org/10.3390/e17127866

Chicago/Turabian Style

Ay, Nihat, and Shun-ichi Amari. 2015. "A Novel Approach to Canonical Divergences within Information Geometry" Entropy 17, no. 12: 8111-8129. https://doi.org/10.3390/e17127866

Article Menu

A Novel Approach to Canonical Divergences within Information Geometry

Abstract

1. Introduction: Divergence and Dual Geometry

2. A New Approach to the General Inverse Problem

3. Natural Connections for Positive and Probability Measures

3.1. The Fisher Metric and Its Gradients

3.2. The Mixture and the Exponential Connections

3.3. The α-Connections

4. Canonical Divergences for Positive and Probability Measures

4.1. The Relative Entropy (KL-Divergence)

4.2. The α-Divergence

5. General Canonical Divergence and Its Consistency

5.1. Canonical Divergence

5.2. Main Consistency Result

5.3. Canonical Divergence in a Dually Flat Manifold

6. Geodesic Projections and Integrability

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI