Extended Divergence on a Foliation by Deformed Probability Simplexes

Uohashi, Keiko

doi:10.3390/e24121736

Open AccessArticle

Extended Divergence on a Foliation by Deformed Probability Simplexes

by

Keiko Uohashi

Faculty of Engineering, Tohoku Gakuin University, Tagajo 985-8537, Miyagi, Japan

Entropy 2022, 24(12), 1736; https://doi.org/10.3390/e24121736

Submission received: 30 October 2022 / Revised: 25 November 2022 / Accepted: 26 November 2022 / Published: 28 November 2022

(This article belongs to the Special Issue MaxEnt 2022—the 41st International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering)

Download

Browse Figure

Versions Notes

Abstract

This study considers a new decomposition of an extended divergence on a foliation by deformed probability simplexes from the information geometry perspective. In particular, we treat the case where each deformed probability simplex corresponds to a set of q-escort distributions. For the foliation, different q-parameters and the corresponding

α

-parameters of dualistic structures are defined on each of the various leaves. We propose the divergence decomposition theorem that guides the proximity of q-escort distributions with different q-parameters and compare the new theorem to the previous theorem of the standard divergence on a Hessian manifold with a fixed

α

-parameter.

Keywords:

divergence; exponential family; escort distribution; relative entropy; nonextensive statistics; information geometry; affine geometry

1. Introduction

In the field of nonextensive statistics, q-normal distributions and the generalization, q-exponential families, play an important role [1,2,3]. Since Ohara first pointed out the correspondence between the q-parameter of nonextensive statistics and the

α

-parameter of information geometry [4,5], the information geometric structure of q-exponential families has been investigated [6,7,8,9,10,11,12,13,14].

On a set of probability distributions, divergences are usually defined for a fixed

α

-parameter of the dualistic structure. Using those results, we defined an extended divergence on a foliation by sets of probability distributions, setting different

α

-parameters on each leaf. In particular, we treated a foliation by deformed probability simplexes [15].

In this paper, we also study deformed probability simplexes corresponding to sets of escort distributions with q-parameters, which satisfy

q = (1 - α) / 2

for

α

-parameters of information geometry. We clarify the relationship among affine spaces, affine immersions and the extended divergence more than in our previous paper. A comparison with the extended divergence and the duo Bregman divergence used in machine learning is also described [16].

First, we explain the dualistic structures,

α

-divergences, and the Tsallis relative entropy on the probability simplex, using the concept of affine geometry and information geometry. The relationship between an

α

-parameter and the Tsallis q-parameter is stated. Next, we describe the dualistic structures and the divergences generated by affine immersions on the deformed probability simplexes corresponding to sets of escort distributions. It also includes topics about Hessian manifolds and their level surfaces. We then define an extended divergence on a foliation by deformed probability simplexes. Finally, we propose a new decomposition of an extended divergence on the foliation.

2. The Tsallis Relative Entropy and the Kullback–Leibler Divergence on the Probability Simplex

In this section, we explain dualistic structures,

α

-divergences, and the Tsallis relative entropy on the probability simplex [4,5,12].

Let

A^{n + 1}

be an

(n + 1)

-dimensional real affine space and

{x^{1}, \dots, x^{n + 1}}

be the canonical affine coordinate system on

A^{n + 1}

, i.e.,

\tilde{D} d x = 0

, where

\tilde{D}

is the canonical flat affine connection on

A^{n + 1}

. Let

S^{n}

be a simplex in

A_{+}^{n + 1}

defined by

S^{n} = {p | p \in A_{+}^{n + 1}, \sum_{i = 1}^{n + 1} x^{i} (p) = 1} .

(1)

If

x^{1} (p), \dots, x^{n + 1} (p)

are regarded as probabilities of

n + 1

states,

S^{n}

is called the n-dimensional probability simplex. Let

{{\bar{p}}^{1}, \dots, {\bar{p}}^{n}}

be an affine coordinate system on

S^{n}

defined by

{\bar{p}}^{i} (p) = x^{i} (p) - x^{n + 1} (p)

for

i = 1, \dots, n

, and

{\partial_{1}, \dots, \partial_{n}}, w h e r e \partial_{i} |_{p} = (\frac{\partial}{\partial x^{i}} - \frac{\partial}{\partial x^{n + 1}}) |_{p}, p \in S^{n},

(2)

be a frame of a tangent vector field on

S^{n}

.

The Fisher metric

g = (g_{i j})

on

S^{n}

is defined by

g_{i j} (p) \equiv g (\partial_{i}, \partial_{j}) |_{p} = \sum_{k = 1}^{n + 1} x^{k} (p) \frac{\partial log x^{k}}{\partial x^{i}} |_{p} \frac{\partial log x^{k}}{\partial x^{j}} |_{p} = \frac{1}{x^{i} (p)} δ_{i j} + \frac{1}{x^{n + 1} (p)},

(3)

p \in S^{n}, i, j = 1, \dots, n,

where

δ_{i j}

is the Kronecker’s delta. We define an

α

-connection

\nabla^{(α)}

on

S^{n}

by

\nabla_{\partial_{i}}^{(α)} \partial_{j} = \sum_{k = 1}^{n} Γ_{i j}^{(α) k} \partial_{k},

(4)

Γ_{i j}^{(α) k} |_{p} = \frac{1 + α}{2} (- \frac{1}{x^{k} (p)} δ_{i j}^{k} + x^{k} (p) g_{i j} (p)), i, j, k = 1, \dots, n,

(5)

where

δ_{i j}^{k} = 1

if

i = j = k

, and

δ_{i j}^{k} = 0

if others. Then, the Levi–Civita connection ∇ of g coincides with

\nabla^{(0)}

. For

α \in R

, we have

X g (Y, Z) = g (\nabla_{X}^{(α)} Y, Z) + g (Y, \nabla_{X}^{(- α)} Z) for X, Y, Z \in X (S^{n}),

(6)

where

X (S^{n})

is the set of all smooth tangent vector fields on

S^{n}

. Then,

\nabla^{(- α)}

is called the dual connection of

\nabla^{(α)}

. For each

α

,

\nabla^{(α)}

is torsion-free and

\nabla^{(α)} g

is symmetric. Therefore, the triple

(S^{n}, \nabla^{(α)}, g)

is a statistical manifold, and

(S^{n}, \nabla^{(- α)}, g)

the dual statistical manifold of it.

Note that affine connections

\nabla^{(1)}

and

\nabla^{(- 1)}

in Equations (4)–(6) are the dual connection and the canonical connection, respectively.

It is known that when

n \geq 2

, the curvature of the statistical manifold

(S^{n}, \nabla^{(α)}, g)

is a constant value

κ = \frac{1 + α}{2} \frac{1 - α}{2} = \frac{1 - α^{2}}{4} .

Therefore, the curvature of the dual statistical manifold

(S^{n}, \nabla^{(- α)}, g)

is also

κ = (1 - α^{2}) / 4

. Iff

α = \pm 1

, the curvature of

(S^{n}, \nabla^{(α)}, g)

is zero, and

(\nabla^{(α)}, \nabla^{(- α)}, g)

is called the dually flat structure.

For

α \neq \pm 1

, an

α

-divergence

D^{(α)}

on

A_{+}^{n + 1}

is often defined by

D^{(α)} (p, r) = \frac{4}{1 - α^{2}} {\frac{1 - α}{2} \sum_{i = 1}^{n + 1} x^{i} (p) + \frac{1 + α}{2} \sum_{i = 1}^{n + 1} x^{i} (r) - \sum_{i = 1}^{n + 1} x^{i} {(p)}^{\frac{1 - α}{2}} x^{i} {(r)}^{\frac{1 + α}{2}}},

(7)

p, r \in A_{+}^{n + 1} .

If

q = (1 - α) / 2

, it holds that

D^{(α)} (p, r) = \frac{1}{q} K_{q} (p, r), p, r \in S^{n},

(8)

for the Tsallis relative entropy

K_{q}

on

S^{n}

defined by

K_{q} (p, r) \equiv - \sum_{i = 1}^{n + 1} x^{i} (p) {ln}_{q} \frac{x^{i} (r)}{x^{i} (p)} = \frac{1}{1 - q} {1 - \sum_{i = 1}^{n + 1} x^{i} {(p)}^{q} x^{i} {(r)}^{1 - q}}, p, r \in S^{n},

(9)

where

{ln}_{q}

is the q-logarithmic function defined by

{ln}_{q} x \equiv \frac{x^{1 - q} - 1}{1 - q}, q \neq 1, x > 0

(10)

Refs. [1,2]. The Tsallis relative entropy

K_{q}

converges to the Kullback–Leibler divergence as

q \to 1

, because

{lim}_{q \to 1} {ln}_{q} x = log x

. In the information geometric view, the

α

-divergence

D^{(α)}

converges to the Kullback–Leibler divergence as

α \to - 1

.

For the Tsallis q-parameter, the curvature of the statistical manifold

(S^{n}, \nabla^{(α)}, g)

is

κ = q (1 - q)

.

3. Divergences Generated by Affine Immersions as Level Surfaces

In this section, we describe the general theory of affine immersions and divergences related to level surfaces of the Hessian domain.

If the Hessian

\tilde{D} d φ = \sum_{i, j} (\partial^{2} φ) / (\partial x^{i} \partial x^{j}) d x^{i} d x^{j}

of a function

φ

on a domain

Ω \subseteq A^{n + 1}

is non-degenerate, the triple

(Ω, \tilde{D}, \tilde{g} = \tilde{D} d φ)

is called a Hessian domain. A statistical manifold is said to be flat if the curvature tensor of its affine connection vanishes. A flat statistical manifold is locally a Hessian domain. Conversely, a Hessian domain is a flat statistical manifold [12,17].

In a previous study, we show the following theorem on the level surfaces of a Hessian function.

Theorem 1

([18]). Let M be a simply connected n-dimensional level surface of φ on an

(n + 1)

-dimensional Hessian domain

(Ω, \tilde{D}, \tilde{g} = \tilde{D} d φ)

with a Riemannian metric

\tilde{g}

and suppose that

n \geq 2

. If we consider

(Ω, \tilde{D}, \tilde{g})

a flat statistical manifold,

(M, D, g)

is a 1-conformally flat statistical submanifold of

(Ω, \tilde{D}, \tilde{g})

, where D and g denote the connection and the Riemannian metric on M induced by

\tilde{D}

and

\tilde{g}

, respectively.

Here, “1-conformally flat” represents the characterization of surfaces projected by a flat statistical manifold along dual coordinates. We continue to explain the terms used in Theorem 1 and the outline of the proof.

For

α \in R

, statistical manifolds

(N, \nabla, h)

and

(N, \bar{\nabla}, \bar{h})

are

α

-conformally equivalent if there exists a function

ϕ

on N such that

\begin{matrix} \bar{h} (X, Y) & = & e^{ϕ} h (X, Y), \\ h ({\bar{\nabla}}_{X} Y, Z) & = & h (\nabla_{X} Y, Z) - \frac{1 + α}{2} d ϕ (Z) h (X, Y) \\ + \frac{1 - α}{2} {d ϕ (X) h (Y, Z) + d ϕ (Y) h (X, Z)}, X, Y, Z \in X (N) . \end{matrix}

If

(N, \bar{\nabla}, \bar{h})

is 1-conformally equivalent to a flat statistical manifold

(N, \nabla, h)

,

(N, \bar{\nabla}, \bar{h})

is called a 1-conformally flat statistical manifold. A statistical manifold

(N, \nabla, h)

is 1-conformally flat iff the dual statistical manifold

(N, \nabla^{'}, h)

is

(- 1)

-conformally flat [19].

In terms of affine geometry,

(N, \nabla^{'}, h)

and

(N, {\bar{\nabla}}^{'}, h)

are

(- 1)

-conformally equivalent if and only if

\nabla^{'}

and

{\bar{\nabla}}^{'}

are projectively equivalent [20,21].

For an

(n + 1)

-dimensional Hessian domain

(Ω, \tilde{D}, \tilde{g} = \tilde{D} d φ)

, an n-dimensional level surface of

φ

has the dualistic structure as the statistical submanifold structure. On the other hand, the level surface also has the structure induced by the affine immersion. It is essential for Theorem 1 that the statistical submanifold structure coincides with the dualistic structure by the affine immersion on a level surface of

φ

.

For

(Ω, \tilde{D}, \tilde{g} = \tilde{D} d φ)

, let x be the canonical immersion of an n-dimensional level surface M into

Ω

. Let E be a transversal vector field on M defined by

E = - d φ {(\tilde{E})}^{- 1} \tilde{E},

(11)

where

\tilde{E}

is the gradient vector field of

φ

on

Ω

defined by

\tilde{g} (\tilde{X}, \tilde{E}) = d φ (\tilde{X}), \tilde{X} \in X (Ω) .

(12)

For an affine immersion

(x, E)

and the canonical flat affine connection

\tilde{D}

on

Ω \subseteq A^{n + 1}

, the induced affine connection

D^{E}

, the affine fundamental form

g^{E}

, the shape operator

S^{E}

and the transversal connection form

τ^{E}

on M are defined by

\begin{matrix} D_{X} Y & = & D_{X}^{E} Y + g^{E} (X, Y) E, \end{matrix}

(13)

\begin{matrix} D_{X} E & = & S^{E} (X) + τ^{E} (X) E, X, Y \in X (M) . \end{matrix}

(14)

See [21,22]. Then,

D^{E}

and

g^{E}

coincide with the restricted affine connection of

\tilde{D}

and the restricted Riemannian metric of

\tilde{g}

, respectively. For the level surface M, the transversal connection form satisfies that

τ^{E} \equiv 0

. Therefore,

(x, E)

it is called the equiaffine immersion. It is known that a simply connected statistical manifold can be realized in

A^{n + 1}

by a non-degenerate equiaffine immersion iff it is 1-conformally flat [19]. Thus, Theorem 1 holds.

Next, we introduce a divergence on a Hessian domain, treating it as a flat statistical manifold.

The canonical divergence

ρ

of a Hessian domain

(Ω, \tilde{D}, \tilde{g} = \tilde{D} d φ)

is defined by

ρ (p, r) = φ (p) + φ^{*} (\tilde{ι} (r)) + \sum_{i = 1}^{n + 1} x^{i} (p) x_{i}^{'} (r) f o r p, r \in Ω,

(15)

where

\tilde{ι}

is the gradient mapping from

Ω

to the dual affine space

A_{n + 1}^{*}

, i.e.,

x_{i}^{'} = x_{i}^{*} \circ \tilde{ι} = - \frac{\partial φ}{\partial x^{i}},

(16)

and {

x_{1}^{*}, \dots, x_{n + 1}^{*}

} is the dual affine coordinate system of {

x^{1}, \dots, x^{n + 1}

}. The Legendre transform

φ^{*}

of

φ

is defined by

φ^{*} \circ \tilde{ι} = - \sum_{i = 1}^{n + 1} x^{i} x_{i}^{'} - φ .

(17)

See [12].

Let

ι

be the conormal immersion for the affine immersion

(x, E)

defined by Equation (11), 12. By the definition of a conormal immersion,

ι

satisfies that

〈 ι (p), Y_{p} 〉 = 0, Y_{p} \in T_{p} M, 〈 ι (p), E_{p} 〉 = 1 for p \in M,

where

〈 a, b 〉

is the pairing of

a \in A_{n + 1}^{*}

and

b \in A^{n + 1}

. It is known that the conormal immersion

ι

coincides with the restriction of the gradient mapping

\tilde{ι}

to the level surface M.

The next definition is given in relation to affine immersions and divergences.

Definition 1

([19]).Let

(N, \nabla, h)

be a 1-conformally flat statistical manifold realized by a non-degenerate affine immersion

(v, ξ)

into

A^{n + 1}

, and w the conormal immersion for v. Then the divergence

ρ_{c o n f}

of

(N, \nabla, h)

is defined by

ρ_{c o n f} (p, r) = 〈 w (r), v (p) - v (r) 〉 f o r p, r \in N .

The

ρ_{c o n f}

definition is independent of the choice of a realization of

(N, \nabla, h)

.

The divergence

ρ_{c o n f}

is referred to as Kurose geometric divergence in affine geometry and as Fenchel–Young divergence in the machine learning community [23,24]. Since an n-dimensional level surface M of

(Ω, \tilde{D}, \tilde{g} = \tilde{D} d φ)

is a 1-conformally flat statistical manifold realized by a non-degenerate affine immersion

(x, E)

,

ρ_{c o n f}

on M is as follows:

ρ_{c o n f} (p, r) = 〈 ι (r), x (p) - x (r) 〉 for p, r \in M .

(18)

Let

ρ_{s u b}

be the restriction of the canonical divergence

ρ

to

(M, D, g)

as a statistical submanifold of

(Ω, \tilde{D}, \tilde{g})

. From Equations (15), (17) and (18), the next theorem holds.

Theorem 2

([20]). For a 1-conformally flat statistical submanifold

(M, D, g)

of

(Ω, \tilde{D}, \tilde{g})

, two divergences

ρ_{c o n f}

and

ρ_{s u b}

coincide.

4. Deformed Probability Simplexes and Escort Distributions Generated by Affine Immersions

In this section, we explain dualistic structures on deformed probability simplexes, which correspond to sets of escort distributions via affine immersion.

We set

p_{i} = x^{i} (p)

,

i = 1, \dots, n + 1

for

p \in S^{n}

, where

S^{n}

and

{x^{1}, \dots, x^{n + 1}}

be the probability simplex and the canonical affine coordinate system on

A^{n + 1}

, respectively. For

n + 1

states

p_{1}, \dots, p_{n + 1}

on

S^{n}

and

0 < q < 1

, if each probability

P (p_{i})

satisfies

P (p_{i}) = \frac{{(p_{i})}^{q}}{\sum_{i = 1}^{n + 1} {(p_{i})}^{q}}, i = 1, \dots, n + 1,

(19)

the probability distribution

P

is called the escort distribution [1,2], where

{(p_{i})}^{q}

is

p_{i}

powered by q.

It realizes the dualistic structure of a set of escort distributions via the affine immersion into

A_{+}^{n + 1}

[4,5]. For

0 < q < 1

, let

f_{q}

be the affine immersion of

S^{n}

into

A_{+}^{n + 1}

defined by

x^{i} (f_{q} (p)) = \frac{1}{q} {(x^{i} (p))}^{q}, i = 1, \dots, n + 1, f o r p \in S^{n} .

(20)

Then, the escort distribution

P

is also represented as follows:

P (p_{i}) = \frac{θ^{i}}{\sum_{i = 1}^{n + 1} θ^{i}}, θ^{i} = \frac{1}{q} {(p_{i})}^{q}, i = 1, \dots, n + 1 .

(21)

For a function

ψ_{q}

on

A_{+}^{n + 1}

defined by

ψ_{q} = \frac{1}{1 - q} \sum_{i = 1}^{n + 1} {(q x^{i})}^{\frac{1}{q}},

(22)

the image

f_{q} (S^{n})

is a level surface of

ψ_{q}

satisfying

ψ_{q} = 1 / (1 - q)

. For

0 < q < 1

, the Hessian matrix of the function

ψ_{q}

is positive definite on

A_{+}^{n + 1}

. Then,

ψ_{q}

induces the Hessian structure

(A_{+}^{n + 1}, \tilde{D}, {\tilde{g}}_{q} \equiv (\partial^{2} ψ_{q} / \partial x^{i} \partial x^{j}))

. By definition

{\tilde{Γ}}_{i j k} = \sum_{l = 1}^{n + 1} {\tilde{g}}_{q k l} {\tilde{Γ}}_{i j}^{l} = \frac{\partial^{3} ψ}{\partial x^{i} \partial x^{j} \partial x^{k}}, i, j, k = 1, \dots, n,

(23)

{\tilde{D}}_{\frac{\partial}{\partial x_{i}}}^{(α)} \frac{\partial}{\partial x_{j}} = \frac{1 - α}{2} \sum_{k = 1}^{n + 1} {\tilde{Γ}}_{i j}^{k} \frac{\partial}{\partial x_{k}}, α = 1 - 2 q,

(24)

the tetrad

(A_{+}^{n + 1}, \tilde{D}, {\tilde{D}}^{(- 1)}, {\tilde{g}}_{q})

is the dually flat structure. The connection

{\tilde{D}}^{(0)}

coincides with the Levi–Civita connection of the Riemannian metric

{\tilde{g}}_{q}

.

We denote by D and

g_{q}

the restricted

\tilde{D}

and

\tilde{g_{q}}

on

f_{q} (S^{n})

, and induce the dualistic structure of

(f_{q} (S^{n}), D, g_{q})

as the submanifold structure of

(A_{+}^{n + 1}, \tilde{D}, {\tilde{g}}_{q})

. From the discussion in Section 3,

(f_{q} (S^{n}), D, g_{q})

coincides with the dualistic structure induced by the equiaffine immersion

(f_{q}, E_{q})

, where

E_{q} \equiv - d ψ_{q} {(\tilde{E_{q}})}^{- 1} \tilde{E_{q}}

(25)

for the gradient vector field

\tilde{E_{q}}

of

ψ_{q}

on

A_{+}^{n + 1}

defined by

{\tilde{g}}_{q} (\tilde{X}, \tilde{E_{q}}) = d ψ_{q} (\tilde{X}) for \tilde{X} \in X (A_{+}^{n + 1}) .

(26)

The pullback of

(f_{q} (S^{n}), D, g_{q})

to

S^{n}

is

(- 1)

-conformally equivalent to

(S^{n}, \nabla^{(α)}, g)

defined by Equations (3)–(5). In addition,

(f_{q} (S^{n}), D, g_{q})

has a constant curvature

κ = q (1 - q) = (1 - α^{2}) / 4

[5].

On

(f_{q} (S^{n}), D, g_{q})

, the restricted divergence

ρ_{q}

from the canonical divergence of

(A_{+}^{n + 1}, \tilde{D}, {\tilde{g}}_{q})

coincides with the geometric divergence by Equation (18) from the affine immersion

(f_{q}, E_{q})

. For an affine coordinate system

{x_{1}^{'}, \dots, x_{n + 1}^{'}}

on

A^{n + 1}

defined by

x_{i}^{'} = - \frac{\partial ψ_{q}}{\partial x^{i}} = - \frac{1}{1 - q} {(q x^{i})}^{\frac{1 - q}{q}},

(27)

the divergence

ρ_{q}

of

(f_{q} (S^{n}), D, g_{q})

is described as

ρ_{q} (a, b) = \sum_{i = 1}^{n + 1} x_{i}^{'} (b) (x^{i} (a) - x^{i} (b)), a, b \in f_{q} (S^{n}) .

(28)

In addition, the pullback divergence of

ρ_{q}

to

S^{n}

coincides with

D^{(α)}

and the Tsallis relative entropy

K_{q}

[4].

At the end of this section, we mention the divergence of

(A_{+}^{n + 1}, \tilde{D}, {\tilde{g}}_{q})

. By Equation (17), the Legendre transform

ψ_{q}^{*}

of

ψ_{q}

is

ψ_{q}^{*} (x^{'} (a)) = - ψ_{q} (a) + \sum_{i = 1}^{n + 1} x^{i} (a) x_{i}^{'} (a), a \in A_{+}^{n + 1} .

(29)

By Equations (15) and (16), the canonical divergence

ρ_{q}

of

(A_{+}^{n + 1}, \tilde{D}, {\tilde{g}}_{q})

is defined by

ρ_{q} (a, b) = ψ_{q} (a) - ψ_{q} (b) + \sum_{i = 1}^{n + 1} x_{i}^{'} (b) (x^{i} (a) - x^{i} (b)), a, b \in A_{+}^{n + 1},

(30)

represented by the same symbol

ρ_{q}

of

(f_{q} (S^{n}), D, g_{q})

.

5. Extended Divergence on a Foliation by Deformed Probability Simplexes

Previous sections described the divergence for each fixed q and each fixed

α

. This section defines an extended divergence on a foliation by deformed probability simplexes

(f_{q} (S^{n}), D, g_{q})

for all

0 < q < 1

, and shows the divergence decomposition theorem. The contents of our paper [15] are included but are explained in detail by the setting of affine geometry.

To give the proximity of q-escort distributions with different q-parameters, we define an extended divergence on a foliation by deformed probability simplexes as follows.

Definition 2.

Let

S_{f o l} = \cup_{0 < q < 1} f_{q} (S^{n}) = {p | p \in A_{+}^{n + 1}, \sum_{i = 1}^{n + 1} x^{i} (p) > 1}

, which corresponds to a foliation

F = {f_{q} (S^{n}) | 0 < q < 1}

. We call a function

ρ_{f o l}

on

S_{f o l} \times S_{f o l}

defined by Equation (31) an extended divergence on a foliation by deformed probability simplexes.

ρ_{f o l} (a, b) \equiv ψ_{q (a)} (a) - ψ_{q (b)} (b) + \sum_{i = 1}^{n + 1} x_{i}^{'} (b) (x^{i} (a) - x^{i} (b))

(31)

f o r a \in f_{q (a)} (S^{n}), b \in f_{q (b)} (S^{n}), 0 < q (a) < 1, 0 < q (b) < 1 .

The i-th component of the conormal immersion of

(f_{q}, E_{q})

is

- \partial ψ_{q} / \partial x^{i}

. By the right-hand side of Equation (27), the dual coordinate of b, denoted by

x^{'} (b)

, satisfies that

- x^{'} (b) \equiv (- x_{1}^{'} (b), \dots, - x_{n + 1}^{'} (b)) \in f_{1 - q (b)} (S^{n}) .

Therefore, we consider

f_{1 - q} (S^{n})

as the dual simplex of

f_{q} (S^{n})

for

0 < q < 1

. As

q = 1 / 2

,

f_{q} (S^{n})

is self dual [4]. Note that the i-th component of the dual coordinate of b is denoted by

η_{i} (b) = - x_{i}^{'} (b) = (\partial ψ_{q} / \partial x^{i}) |_{b}

in [15].

On the extended divergence, the next proposition holds.

Proposition 1.

An extended divergence

ρ_{f o l}

on

S_{f o l}

of satisfies that:

(i) If

a, b \in f_{q (a)} (S^{n})

,

ρ_{f o l} (a, b) = ρ_{q (a)} (a, b) = D^{(α (a))} (f_{q (a)}^{- 1} (a), f_{q (a)}^{- 1} (b)),

where

ρ_{q}

is the divergence of

(f_{q} (S^{n}), D, g_{q})

by Equation (28),

D^{(α)}

is an α-divergence defined by Equation (7), and

α (a) = 1 - 2 q (a)

.

(ii) In the case of

q (a) \geq q (b)

,

ρ_{f o l} (a, b) \geq 0 f o r (a, b) \in S_{f o l} \times S_{f o l},

and if and only if

a = b

,

ρ_{f o l} (a, b) = 0 .

Proof.

If

a, b \in f_{q (a)} (S^{n})

,

ψ_{q (a)} (a) = ψ_{q (b)} (a) = ψ_{q (b)} (b)

. By Equations (28) and (31),

ρ_{f o l} (a, b) = \sum_{i = 1}^{n + 1} x_{i}^{'} (b) (x^{i} (a) - x^{i} (b)) = ρ_{q (a)} (a, b) .

Then, (i) holds. If

1 > q (a) \geq q (b) > 0

, it holds that

ψ_{q (a)} (a) \geq ψ_{q (b)} (b)

because

ψ_{q (a)} (a) = \frac{1}{1 - q (a)}, ψ_{q (b)} (b) = \frac{1}{1 - q (b)}

(32)

are induced by the definition of

f_{q} (S^{n})

. In addition,

f_{q (a)} (S^{n})

and

f_{q (b)} (S^{n})

are convex surfaces centered on the origini of

A_{+}^{n + 1}

, and the surfaces

f_{q (a)} (S^{n})

closer to the origin than

f_{q (b)} (S^{n})

. Then,

\sum_{i = 1}^{n + 1} x_{i}^{'} (b) (x^{i} (a) - x^{i} (b)) \geq 0

. Thus, (ii) holds. □

We define the extended dual divergence

ρ_{f o l}^{*}

of

ρ_{f o l}

as follows;

ρ_{f o l}^{*} (a, b) \equiv ψ_{q (a)}^{*} (x^{'} (a)) - ψ_{q (b)}^{*} (x^{'} (b)) + \sum_{i = 1}^{n + 1} x^{i} (b) (x_{i}^{'} (a) - x_{i}^{'} (b))

(33)

f o r a \in f_{q (a)} (S^{n}), b \in f_{q (b)} (S^{n}), 0 < q (a) < 1, 0 < q (b) < 1,

where

ψ_{q}^{*}

is the Legendre transform of

ψ_{q}

for

0 < q < 1

. Then, the following holds.

Proposition 2.

The functions

ρ_{f o l}

and

ρ_{f o l}^{*}

satisfy that

ρ_{f o l}^{*} (b, a) = ρ_{f o l} (a, b) f o r a \in f_{q (a)} (S^{n}), b \in f_{q (b)} (S^{n}) .

(34)

Proof.

By the definition of the Legendre transform, we have

\begin{matrix} ρ_{f o l}^{*} (b, a) & = & ψ_{q (b)}^{*} (x^{'} (b)) - ψ_{q (a)}^{*} (x^{'} (a)) + \sum_{i = 1}^{n + 1} x^{i} (a) (x_{i}^{'} (b) - x_{i}^{'} (a)) \\ = & - ψ_{q (b)} (b) - \sum_{i = 1}^{n + 1} x^{i} (b) x_{i}^{'} (b) \\ + ψ_{q (a)} (a) + \sum_{i = 1}^{n + 1} x^{i} (a) x_{i}^{'} (a) + \sum_{i = 1}^{n + 1} x^{i} (a) (x_{i}^{'} (b) - x_{i}^{'} (a)) \\ = & ψ_{q (a)} (a) - ψ_{q (b)} (b) + \sum_{i = 1}^{n + 1} x_{i}^{'} (b) (x^{i} (a) - x^{i} (b)) \\ = & ρ_{f o l} (a, b) . \end{matrix}

□

The extended divergence is related to the duo Bregman (pseudo-)divergence, where the parameters also define the convex functions [16]. To work with the entire parametrized probability distribution families and to explore the application of divergences, we must investigate their relationship.

6. Decomposition of an Extended Divergence

In this section, we explain the orthogonal foliation of

F

. Next, we give a decomposition of an extended divergence along the orthogonal leaf and the original leaf.

For the foliation

F = {f_{q} (S^{n}) | 0 < q < 1}

, we consider the flow on

S_{f o l}

defined using the following equation.

\frac{d x_{i}^{'}}{d t} = x_{i}^{'}, i = 1, \dots, n + 1,

(35)

where a function

x_{i}^{'}

on

S_{f o l}

takes the i-th component of the dual coordinate on

f_{q} (S^{n})

as Equation (27) for each

0 < q < 1

. An integral curve of Equation (35) is orthogonal to

f_{q} (S^{n})

for each q with respect to the pairing

〈, 〉

. The set of integral curves becomes the orthogonal foliation of

F

. We denote it by

F^{⊥}

.

Translating into the primal coordinate system, we have the next equation.

\frac{d x^{i}}{d t} = {\tilde{E}}^{i}, i = 1, \dots, n + 1, on S_{f o l},

(36)

\tilde{E} = {\tilde{E}}_{q}^{i} = \sum_{j = 1}^{n + 1} g_{q}^{i j} \frac{\partial ψ_{q}}{\partial x^{j}} on f_{q} (S^{n}),

(37)

where

(g_{q}^{i j})

is the inverse matrix of

(g_{q i j})

. The right-hand side of Equation (37) is calculated using Equations (11) and (12) for

ψ_{q}

. A leaf of

F^{⊥}

is an integral curve of the vector field

\tilde{E}

that takes the value

{\tilde{E}}_{q}

on

f_{q} (S^{n})

for each q.

The following theorem is on the decomposition of the extended divergence.

Theorem 3.

Let

S^{n}

be the probability simplex, and

(f_{q} (S^{n}), D, g_{q})

the 1-conformally flat statistical manifold generated by the affine immersion

(f_{q}, E_{q})

, where

f_{q}

is defined as

x^{i} (f_{q} (p)) = \frac{1}{q} {(x^{i} (p))}^{q}, i = 1, \dots, n + 1, f o r p \in S^{n},

(38)

ψ_{q} \equiv 1 / (1 - q) \sum_{i = 1}^{n + 1} {(q x^{i})}^{1 / q}

,

E_{q} \equiv - d ψ {({\tilde{E}}_{q})}^{- 1} {\tilde{E}}_{q}

,

{\tilde{E}}_{q}^{i} \equiv \sum_{j = 1}^{n + 1} g_{q}^{i j} \partial ψ_{q} / \partial x^{j}

, and

g_{q}

is the restriction of

(g_{q i j}) = D d ψ_{q}

to

f_{q} (S^{n})

. Let

a, b \in f_{q (a)} (S^{n})

,

0 < q (a) < 1

, and

c \in S_{f o l} \equiv \cup_{0 < q < 1} f_{q} (S^{n})

. If there exists an orthogonal leaf

L^{⊥} \in F^{⊥}

that includes b and c, we have

ρ_{f o l} (a, c) = μ ρ_{f o l} (a, b) + ρ_{f o l} (b, c), x^{'} (c) = μ x^{'} (b), μ > 0,

(39)

where

x^{'} (\cdot)

is the dual coordinate of

f_{q} (S^{n})

for each q.

Proof.

From

a, b \in f_{q (a)} (S^{n})

, it holds that

ψ_{q (a)} (a) = ψ_{q (b)} (b)

, where

q (b) = q (a)

. By the definition in Equations (22) and (23), we have

\begin{matrix} ρ_{f o l} (a, c) & = & ψ_{q (a)} (a) - ψ_{q (c)} (c) + \sum_{i = 1}^{n + 1} x_{i}^{'} (c) (x^{i} (a) - x^{i} (c)) \\ = & ψ_{q (b)} (b) - ψ_{q (c)} (c) + \sum_{i = 1}^{n + 1} {x_{i}^{'} (c) (x^{i} (a) - x^{i} (b)) + x_{i}^{'} (c) (x^{i} (b) - x^{i} (c))} \\ = & + μ \sum_{i = 1}^{n + 1} x_{i}^{'} (b) (x^{i} (a) - x^{i} (b)) \\ + {ψ_{q (b)} (b) - ψ_{q (c)} (c) + \sum_{i = 1}^{n + 1} x_{i}^{'} (c) (x^{i} (b) - x^{i} (c))} \\ = & μ ρ_{f o l} (a, b) + ρ_{f o l} (b, c) . \end{matrix}

□

See Figure 1 for a decomposition of extended divergence and graphs of deformed simplexes

f_{q} (S^{n})

.

A decomposition similar to Equation (39) is also available on a foliation by Hessian level surfaces of one Hessian manifold [20]. Theorem 3 generalizes the previous decomposition.

Finally, we describe the gradient flow on a leaf

f_{q} (S^{n})

using the extended divergence.

Theorem 4.

For a submanifold

(f_{q} (S^{n}), D, g_{q})

of

S_{f o l}

, we denote by

{x^{1}, \dots, x^{n}}

an affine coordinate system on

f_{q} (S^{n})

such that

D d x^{i} = 0

,

i = 1, \dots, n

, and set

g_{q i j} = g_{q} (\partial / \partial x^{i}, \partial / \partial x^{j})

,

(g_{q}^{i j}) = {(g_{q i j})}^{- 1}

. For a fixed point

c \in L^{⊥}

, the gradient flow on

f_{q} (S^{n})

defined by

\frac{d x^{i}}{d t} = - \sum_{j = 1}^{n} g_{q}^{i j} \frac{\partial}{\partial x^{j}} ρ_{f o l} (a_{x}, c), i = 1, \dots, n, a_{x} \in f_{q} (S^{n})

(40)

converges to the unique point

b \in L^{⊥} \cap f_{q} (S^{n})

, where

a_{x}

is a variable point parametrized as

{x^{1} (t), \dots, x^{n} (t)}

.

Proof.

By Theorem 3, for any

a_{x} \in f_{q} (S^{n})

, there exists

μ > 0

such that

ρ_{f o l} (a_{x}, c) = μ ρ_{f o l} (a_{x}, b) + ρ_{f o l} (b, c), x^{'} (c) = μ x^{'} (b) .

Equation (40) is described by the dual coordinate system

{x_{1}^{'}, \dots, x_{n}^{'}}

on

f_{q} (S^{n})

as follows;

\frac{d x_{i}^{'}}{d t} = - μ \frac{\partial}{\partial x^{j}} ρ_{f o l} (a_{x}, b), i = 1, \dots, n .

(41)

On

f_{q} (S^{n})

, from Prop. 1.(i),

ρ_{f o l}

coincides with the geometric divergence

ρ_{q}

, generated by the affine immersion

(f_{q}, E_{q})

. The geometric divergence generates the dual coordinate

x_{i}^{'}

such that

D^{*} d x_{i}^{'} = 0

,

i = 1, \dots, n

, to be derived by

x^{i}

[19]. Then, it holds that

\frac{d x_{i}^{'}}{d t} = - μ (x_{i}^{'} (a_{x}) - x_{i}^{'} (b)), i = 1, \dots, n,

(42)

and that

x_{i}^{'} = x_{i}^{'} (b) + (x_{i}^{'} (a |_{t = 0}) - x_{i}^{'} (b)) e^{- μ t}, i = 1, \dots, n,

(43)

where

{a |}_{t = 0}

is an initial point of Equation (40). Then, the gradient flow of Equation (40) converges to

b \in L^{⊥} \cap f_{q} (S^{n})

following a geodesic for the dual coordinate system. □

The gradient flow similar to Equation (40) has been provided on a flat statistical submanifold [25]. The similar one on a Hessian level surface, i.e., a 1-conformally statistical submanifold, has been given in [20]. Theorem 4 generalizes the previous theorems on gradient flows.

7. Conclusions

This study considers a foliation of deformed probability simplexes corresponding to sets of escort distributions with q-parameters, for the continuous transition of

α

-parameters on information geometry. Since these are typical q-exponential families, we still need to provide details on the extended divergence and natural definition of the foliation of q-exponential families.

The extended divergence guides the proximity of q-exponential distributions with different q-parameters. Therefore, our theory guarantees the mathematical basis for generalizing methods of machine learning and statistical mechanics to the case of the q-distribution families when different q-parameters are mixed. The decomposition theorem is applied to the problem of the optimal choice of q-parameter. The application methods are open to consideration. It also remains to investigate the relationship with a new

λ

-duality on nonextensive statistical mechanics with mixed parameters [26,27].

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We are grateful to the referees for their constructive comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tsallis, C. Introduction to Nonextensive Statistical Mechanics: Approaching a Complex World; Springer: New York, NY, USA, 2009. [Google Scholar]
Naudts, J. Generalised Thermostatistics; Springer: London, UK, 2011. [Google Scholar]
Naudts, J. Estimators, escort probabilities, and ϕ-exponential families in statistical physics. J. Ineq. Pure Appl. Math. 2004, 5, 102. [Google Scholar]
Ohara, A. Geometry of distributions associated with Tsallis statistics and properties of relative entropy minimization. Phys. Lett. A 2007, 370, 184–193. [Google Scholar] [CrossRef]
Ohara, A. Geometric study for the Legendre duality of generalized entropies and its application to the porous medium equation. Eur. Phys. J. B 2009, 70, 15–28. [Google Scholar] [CrossRef]
Ohara, A.; Wada, T. Information geometry of q-Gaussian densities and behaviors of solutions to related diffusion equations. J. Phys. A Math. Theor. 2010, 43, 035002. [Google Scholar] [CrossRef]
Matsuzoe, H.; Ohara, A. Geometry for q-exponential families. In Recent Progress in Differential Geometry and its Related Fields; Adachi, T., Hashimoto, H., Hristov, M.J., Eds.; World Scientific Publishing: Hackensack, NJ, USA, 2011; pp. 55–71. [Google Scholar]
Amari, S.; Ohara, A.; Matsuzoe, H. Geometry of deformed exponential families: Invariant, dually-flat and conformal geometry. Phys. A 2012, 391, 4308–4319. [Google Scholar] [CrossRef]
Matsuzoe, H.; Henmi, M. Hessian structures and divergence functions on deformed exponential families. In Geometric Theory of Information, Signals and Communication Technology; Nielsen, F., Ed.; Springer: Basel, Switzerland, 2014; pp. 57–80. [Google Scholar]
Matsuzoe, H.; Wada, T. Deformed algebras and generalizations of independence on deformed exponential families. Entropy 2015, 17, 5729–5751. [Google Scholar] [CrossRef]
Wada, T.; Matsuzoe, H.; Scarfone, A.M. Dualistic Hessian structures among the thermodynamic potentials in the κ-thermostatistics. Entropy 2015, 17, 7213–7229. [Google Scholar] [CrossRef]
Amari, S. Information Geometry and Its Applications; Springer: Tokyo, Japan, 2016. [Google Scholar]
Scarfone, A.M.; Matsuzoe, H.; Wada, T. Information geometry of κ-exponential families: Dually-flat, Hessian and Legendre structures. Entropy 2018, 20, 436. [Google Scholar] [CrossRef] [PubMed]
Matsuzoe, H. A sequence of escort distributions and generalizations of expectations on q-exponential family. Entropy 2017, 19, 7. [Google Scholar] [CrossRef]
Uohashi, K. A foliation by deformed probability simplexes for transition of α-parameters. In Proceedings of the International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering, IHP, Paris, France, 18–22 July 2022. [Google Scholar]
Nielsen, F. Statistical divergences between densities of truncated exponential families with nested supports: Duo Bregman and duo Jensen divergences. Entropy 2022, 24, 421. [Google Scholar] [CrossRef] [PubMed]
Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, 2007. [Google Scholar]
Uohashi, K.; Ohara, A.; Fujii, T. 1-conformally flat statistical submanifolds. Osaka J. Math. 2000, 37, 501–507. [Google Scholar]
Kurose, T. On the divergences of 1-conformally flat statistical manifolds. Tohoku Math. J. 1994, 46, 427–433. [Google Scholar] [CrossRef]
Uohashi, K.; Ohara, A.; Fujii, T. Foliations and divergences of flat statistical manifolds. Hiroshima Math. J. 2000, 30, 403–414. [Google Scholar] [CrossRef]
Nomizu, K.; Pinkal, U. On the geometry and affine immersions. Math. Z. 1987, 195, 165–178. [Google Scholar] [CrossRef]
Nomizu, K.; Sasaki, T. Affine Differential Geometry: Geometry of Affine Immersions; Cambridge University Press: Cambridge, UK, 1994. [Google Scholar]
Azoury, K.S.; Warmuth, M.K. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn. 2001, 43, 211–246. [Google Scholar] [CrossRef]
Blondel, M.; Martins, A.F.T.; Niculae, V. Learning with Fenchel-Young losses. J. Mach. Learn. Res. 2020, 21, 1–69. [Google Scholar]
Fujiwara, A.; Amari, S. Gradient systems in view of information geometry. Phys. D. 1995, 80, 317–327. [Google Scholar] [CrossRef]
Zhang, J.; Wong, T.-K.L. λ-Deformation: A canonical framework for statistical manifolds of constant curvature. Entropy 2022, 24, 193. [Google Scholar] [CrossRef] [PubMed]
Wong, T.-K.L.; Zhang, J. Tsallis and Rényi deformations linked via a new λ-duality. IEEE Trans. Inf. Theory 2022, 68, 5353–5373. [Google Scholar] [CrossRef]

Figure 1. A decomposition of extended divergence

ρ_{f o l} (a, c)

, graphs of the standard simplex (

q \to 1

), and deformed simplexes as

q =

0.75, 0.6, 0.5, 0.4, 0.25 in

A_{+}^{2}

. For primal coordinates

a, b \in f_{0.75} (S^{1})

, and

c \in f_{0.6} (S^{1})

, dual coordinates satisfy

- x^{'} (a), - x^{'} (b) \in f_{0.25} (S^{1})

, and

- x^{'} (c) \in f_{0.4} (S^{1})

. The primal geodesic between a and b is orthogonal to the dual one between b and c with respect to the metric

g_{0.75}

.

Figure 1. A decomposition of extended divergence

ρ_{f o l} (a, c)

, graphs of the standard simplex (

q \to 1

), and deformed simplexes as

q =

0.75, 0.6, 0.5, 0.4, 0.25 in

A_{+}^{2}

. For primal coordinates

a, b \in f_{0.75} (S^{1})

, and

c \in f_{0.6} (S^{1})

, dual coordinates satisfy

- x^{'} (a), - x^{'} (b) \in f_{0.25} (S^{1})

, and

- x^{'} (c) \in f_{0.4} (S^{1})

. The primal geodesic between a and b is orthogonal to the dual one between b and c with respect to the metric

g_{0.75}

.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Uohashi, K. Extended Divergence on a Foliation by Deformed Probability Simplexes. Entropy 2022, 24, 1736. https://doi.org/10.3390/e24121736

AMA Style

Uohashi K. Extended Divergence on a Foliation by Deformed Probability Simplexes. Entropy. 2022; 24(12):1736. https://doi.org/10.3390/e24121736

Chicago/Turabian Style

Uohashi, Keiko. 2022. "Extended Divergence on a Foliation by Deformed Probability Simplexes" Entropy 24, no. 12: 1736. https://doi.org/10.3390/e24121736

APA Style

Uohashi, K. (2022). Extended Divergence on a Foliation by Deformed Probability Simplexes. Entropy, 24(12), 1736. https://doi.org/10.3390/e24121736

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Extended Divergence on a Foliation by Deformed Probability Simplexes

Abstract

1. Introduction

2. The Tsallis Relative Entropy and the Kullback–Leibler Divergence on the Probability Simplex

3. Divergences Generated by Affine Immersions as Level Surfaces

4. Deformed Probability Simplexes and Escort Distributions Generated by Affine Immersions

5. Extended Divergence on a Foliation by Deformed Probability Simplexes

6. Decomposition of an Extended Divergence

7. Conclusions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI