Lagrangian Submanifolds of Symplectic Structures Induced by Divergence Functions

Marco Favretti

doi:10.3390/e22090983

Dipartimento di Matematica Tullio Levi-Civita, Università degli Studi di Padova, 35121 Padova, Italy

Entropy2020, 22(9), 983;https://doi.org/10.3390/e22090983

This article belongs to the Special Issue Information Geometry III

Version Notes

Order Reprints

Abstract

Divergence functions play a relevant role in Information Geometry as they allow for the introduction of a Riemannian metric and a dual connection structure on a finite dimensional manifold of probability distributions. They also allow to define, in a canonical way, a symplectic structure on the square of the above manifold of probability distributions, a property that has received less attention in the literature until recent contributions. In this paper, we hint at a possible application: we study Lagrangian submanifolds of this symplectic structure and show that they are useful for describing the manifold of solutions of the Maximum Entropy principle.

Keywords:

canonical divergence; Lagrangian submanifolds; Morse family; constrained optimization; geometric phase transitions

1. Introduction

Information Geometry [1,2] provides a sound and fruitful framework for interpreting statistics using classical differential geometry notions [3]. A principal object in Information Geometry is the notion of contrast or divergence function, which (informally speaking) measures the degree of separation between two probability distributions [4,5,6]. The main thrust of divergence functions is that they allow to define a Riemannian structure on a finite dimensional submanifold M of probability distributions endowed with a dual coordinate system, with far reaching implications. A less-studied spin off of contrast function is the possibility of introducing a symplectic structure on the square of M by the pull-back of the canonical symplectic structure defined on the cotangent bundle

T^{*} M

. This procedure was introduced in 1995 in the pioneering paper [7], suggesting that symplectic geometry may have a natural role to play in statistics. In recent times there has been a renewed interest in possible applications of the symplectic structures introduced, as in [7] for example, to studying the analogies with the discrete Lagrangian mechanics (see in [8]) or the relations with completely integrable systems of Hamiltonian mechanics (see in [9,10]).

In this paper, we try to look at a possible role for Lagrangian submanifolds of the above-discussed symplectic structure on

M^{2}

in the case that M is an exponential family

M (h, k)

. Exponential families are prototypical examples of finite dimensional manifolds admitting a dually flat canonical structure defined by the canonical divergence, and they play a relevant role in information geometry and statistics [1,2]. For our argument, their importance is due to the fact that they represent the manifold of solutions of the variational problem associated to the Maximum Entropy Principle (MEP) with linear constraints ([11,12]). In some applications to statistical mechanics, e.g., in the descriptions of phase transitions in Ising spin systems, MEP with nonlinear constraints is considered, see, e.g., in [13,14,15]. In this case, the set of possible solutions has a richer structure, which is well captured by a Lagrangian submanifold of

T^{*} M (h, k)

. In this work, we are concerned with the Lagrangian submanifolds defined in the square of

M (h, k)

via the canonical pull-back hinted at above.

The structure of the paper is as follows. In Section 2, we recall the needed tools of Symplectic Geometry, and in Section 2.1 we review the canonical pull-back construction via divergence function construction exposed in [7]. In Section 3, we consider the special case of exponential families associated with MEP with nonlinear constraints.

2. Synopsis of Symplectic Geometry

We briefly recall the basic facts of symplectic geometry that are necessary for introducing our argument referring to classical textbooks for the proof of the results. A symplectic manifold

(M, ω)

is a smooth even-dimensional manifold M equipped with a non-degenerate, closed two-form

ω

(

d ω = 0

, where d is the external derivation operator). A submanifold L of M is a Lagrangian submanifold if

2 dim L = dim M

and the two-form restricted to L is vanishing,

ω_{| L} = 0

. A prototypical example of symplectic manifold is the cotangent bundle

T^{*} S

of a manifold S. If

x = (x^{1}, \dots, x^{n})

are local coordinates on S, and

(x, λ)

are local coordinates on

T^{*} S

, then the Liouville one-form

θ_{c}

on

T^{*} S

has the local expression

θ_{c} = λ_{i} d x^{i}

(summation over repeated indices is understood) and the symplectic two form is

ω = d θ_{c} = d λ_{i} \land d x^{i} .

(1)

A classical theorem of Darboux says that every symplectic manifold

(M, ω)

admits an atlas of local coordinates

(x, λ)

such that locally

ω

has the representation (1). A relevant example of Lagrangian submanifold of

T^{*} S

is the graph of the differential of a function

g : S \to R

, that is,

L_{g} = {(x, λ (x)) \in T^{*} S : λ (x) = d g (x), x \in S} .

Note that

L_{g}

is a n-dimensional submanifold which is transversal to the fibers of the fibration

π : T^{*} S \to S

, that is, its tangent bundle

T L_{g}

is transversal to the vertical bundle

ker T π

.

According to a theorem of Maslov–Hormander ([16,17]), a general (i.e. not necessarily trasversal) Lagrangian submanifold of

T^{*} S

can be locally described as the graph of a smooth function G depending on extra parameters. Let us sketch briefly this construction along the lines of the works in [18,19].

Let U be a k-dimensional manifold called supplementary manifold, and let

G : S \times U \to R

be a smooth function whose representation in a local chart is

G (x, u)

. We define the critical set of G as (we use the notation

{(G_{x})}_{i} = \partial G / \partial x^{i}

and

{(G_{x y})}_{i j} = \partial^{2} G / \partial x^{i} \partial y^{j}

) for partial derivatives)

E = {(x, u) : G_{u} (x, u) = 0} .

(2)

If

d G_{u}

has maximal rank over

E

, that is,

rk d G_{u} = rk (G_{x u} G_{u u}) = k for all (x, u) \in E

(3)

then G is called Morse family and the following

Λ_{G}

is a Lagrangian submanifold of

T^{*} S

,

Λ_{G} = {(x, G_{x} (x, u)) \in T^{*} S where (x, u) \in E} .

(4)

If there are no extra parameters

k = 0

, then

Λ_{G}

is the graph of a differential and thus

Λ_{G}

is a transversal submanifold. Note that the above rank condition (3) can be satisfied if the square submatrix

G_{u u}

has maximal rank, i.e.,

det G_{u u} \neq 0

on

E

. In this case, by the implicit function theorem there exist a locally defined function

u = u (x)

such that

E

is the graph of u and setting

\hat{G} (x) = G (x, u (x))

we have that

{\hat{G}}_{x} (x) = G_{x} (x, u (x)) + G_{u} (x, u (x)) u_{x} (x) = G_{x} (x, u (x)) for all (x, u) \in E .

Therefore, where

det G_{u u} \neq 0

on

E

, all the parameters u can be eliminated and

Λ_{\hat{G}}

is locally transversal to the fibers. The set of points of S where

det G_{u u} (x, u) = 0

for

(x, u) \in E

is called the caustic of

Λ_{G}

. These are the points where the Lagrangian submanifold is tangent to the fibers of

π : T^{*} S \to S

and trasversality is lost.

2.1. Symplectic Structures Defined by Divergence Functions

Given a smooth n-dimensional manifold, M, let us denote with

M^{2} = M \times M

the square of M and with

Δ_{M} \subset M^{2}

the diagonal of

M^{2}

. We will use local coordinates

x = (x^{1}, \dots, x^{n})

on M and

(x, y) = (x^{1}, \dots, x^{n}, y^{1}, \dots, y^{n})

on

M^{2}

.

Let

D : M^{2} \to [0, + \infty)

be a smooth non-negative function whose representation in a local chart is

D (x, y) \geq 0

. We use the notations

{(D_{x})}_{i} = \frac{\partial D}{\partial x^{i}}, {(D_{y})}_{j} = \frac{\partial D}{\partial y^{j}}, {(D_{y x})}_{j i} = \frac{\partial}{\partial y^{j}} (\frac{\partial D}{\partial x^{i}}) = - ϕ_{j i}

for first and second order derivatives of D. The function D is a yoke (see [7] ) if the following conditions hold and D is a divergence (see [8]) if

(iii)

below holds on the whole

M^{2}

.

(i): $D = 0$ only on $Δ_{M}$
(ii): $D_{x} = 0$ and $D_{y} = 0$ on $Δ_{M}$
(iii): $ϕ = - D_{x y}$ is positive definite on $Δ_{M}$

thus points of

Δ_{M}

are minima of D. A divergence function act as a pseudo-distance but it does not satisfy the symmetry nor the triangle inequality conditions. In [7], the following fibered map

F_{D} : M^{2} \to T^{*} M

over M is considered, whose representation in a local chart is

F_{D} (x, y) = (x, D_{x} (x, y)) .

(5)

By condition (iii) above there exist a neighborhood W of

Δ_{M}

, where

F_{D}

has a smooth inverse

F_{D}^{- 1} (x, λ) = (x, y (x, λ)) .

Using the local diffeomorphism

F_{D}

a symplectic structure

(W, ω_{D})

is defined in [7] via the pull-back

ω_{D} = F_{D}^{*} ω

of the canonical two form (1) on

T^{*} M

. The local form of

ω_{D}

can be computed as follows,

ω_{D} = F_{D}^{*} ω = F_{D}^{*} (d θ_{c}) = d (F_{D}^{*} θ_{c}) = d ({(D_{x})}_{i} d x^{i}))

(6)

thus (see Section 3.2 in [7])

ω_{D} = \frac{\partial^{2} D}{\partial x^{j} \partial x^{i}} d x^{j} \land d x^{i} + \frac{\partial^{2} D}{\partial y^{j} \partial x^{i}} d y^{j} \land d x^{i} = - ϕ_{j i} d y^{j} \land d x^{i}

because the first term

\partial^{2} D / \partial x^{j} \partial x^{i}

is symmetric in the

i, j

indices. For the applications that we have in mind of the above theory, we will assume in (iii) above that

- D_{y x}

is positive definite on the whole

M^{2}

so that

F_{D}

is a global diffeomorphism.

Simple examples of Lagrangian submanifolds of

M^{2}

with respect to

ω_{D}

are (with a little abuse of notation) the n-dimensional submanifolds

M_{x} = M \times {y} \approx M

, which are also transversal to the fibers of

π_{1} : M^{2} \to M

,

π_{1} (x, y) = x

. Moreover, as

ω_{D} (u, u) = 0

,

Δ_{M}

is also a Lagrangian submanifold.

Note also that (6) implies that

F_{D}

is a symplectomorphism, thus

L = F_{D}^{- 1} (Λ)

is a Lagrangian submanifold of

M^{2}

whenever

Λ \subset T^{*} M

is a Lagrangian submanifold. In this paper, we will be mainly concerned with the study of Lagrangian submanifolds of

M^{2}

defined in this way.

In the following Section 2.2, we will compute the above introduced objects for the relevant case of exponential families of probability distributions and canonical divergence.

In [7], the Hamiltonian

H : T^{*} M \to [0, + \infty)

associated to a divergence function is defined as

H = D \circ F_{D}^{- 1}

and locally it has the form

H (x, λ) = D (x, y (x, λ)) .

(7)

2.2. Canonical Divergence and Exponential Families

In this section, we recall the basic definitions of exponential family and canonical divergence, as described, e.g., in [1,2]. Let

(X, B, d x)

be a probability space, where X may be a discrete set or

X = R^{k}

. We stipulate that in case of a discrete set the integrals over X with respect to the measure

d x

are substituted by summations. Let

P (X) = {p : X \to [0, + \infty), p (x) \geq 0, \int_{X} p d x = 1}

and suppose that

q \in P (X)

for suitable k, where

q (x) = e^{k (x)} > 0

. Consider n independent observables

h : X \to R^{n}, rk d h (x) = n \forall x \in X

and define the related free energy

ψ : Θ \subset R^{n} \to R

as (here

θ \cdot h = θ^{i} h_{i}

)

e^{ψ (θ)} = \int_{X} e^{θ \cdot h (x) + k (x)} d x .

(8)

The n real numbers

θ^{i}

are called canonical parameters. They define uniquely a probability distribution

p (\cdot; θ)

which belongs to the exponential family defined by

h, k

,

M (h, k) = {p (x; θ) = e^{θ \cdot h (x) + k (x) - ψ (θ)}, θ \in Θ} \subset P (X) .

(9)

The relevant fact is that

M (h, k)

is a n-dimensional submanifold of the infinite dimensional set

P (X)

and that the canonical parameters

θ

are local coordinates. Note that

q \in M (h, k)

as

ψ (0) = 0

and

q (x) = p (x; 0)

. Another system of local coordinates is provided by the so-called expectation parameters defined by

η = ψ_{θ} (θ) = E_{p_{θ}} [h] = \int_{X} h (x) p (x; θ) d x .

As

ψ

is a convex function, the gradient map

ψ_{θ} (θ) = η

is globally invertible with inverse

θ = \hat{θ} (η)

, which is also a gradient map

\hat{θ} (η) = φ_{η} (η)

, where

φ (η) = \hat{θ} (η) \cdot η - ψ (\hat{θ} (η))

(10)

is the Legendre transform of

ψ

(see, e.g., in [1]). We will denote with

p (x; η)

the point in

M (h, k)

associated to

η

. The Kullback–Leibler divergence is defined for general

(p, \tilde{p})

in

P {(X)}^{2}

as

D_{K L} (p, \tilde{p}) = \int_{X} p (x) log \frac{p (x)}{\tilde{p} (x)} d x .

The restriction of

D_{K L}

to

M {(h, k)}^{2}

, the square of

M (h, k)

,

D_{K L} : M {(h, k)}^{2} \to [0, + \infty)

is called canonical divergence. It can be shown (see in [1]) that when

M (h, k)

is referred to the coordinates

(η, θ)

,

D_{K L}

has the local representation

D (η, θ) = φ (η) + ψ (θ) - η \cdot θ .

(11)

Note that as

p (\cdot; θ) = q

for

θ = 0

D_{K L} (p, q) = D_{K L} (p (\cdot; η), p (\cdot; 0)) = φ (η) + ψ (0) - η \cdot 0 = φ (η) .

(12)

A key object is the map

F_{D}

introduced in (5) associated to

M (h, k)

and the canonical divergence (11). It has the local form in coordinates

(η, θ)

, see (5) and (11),

F_{D} (η, θ) = (η, D_{η}) = (η, φ_{η} (η) - θ),

(13)

with the explicit inverse, using local coordinates

(η, λ)

in

T^{*} M (h, k)

,

F_{D}^{- 1} (η, λ) = (η, θ (η, λ)) = (η, φ_{η} (η) - λ) = (η, \hat{θ} (η) - λ) .

(14)

A simple but elegant result of the above-introduced framework is the following.

Proposition 1.

Let

Λ_{G}

be a Lagrangian submanifold of

T^{*} M (h, k)

described by the Morse family

G (η, u)

as in (4). Then,

L_{S} = F_{D}^{- 1} (Λ_{G})

is a Lagrangian submanifold of

M {(h, k)}^{2}

described by the Morse family

S (η, u) = φ (η) - G (η, u)

.

Proof.

From (4) we have that

λ = G_{η} (η, u)

on

Λ_{G}

and from (14)

F_{D}^{- 1} (Λ_{G}) = {(η, θ) = (η, φ_{η} (η) - G_{η} (η, u)) = (η, S_{η} (η, u)), (η, u) \in E}

where

S (η, u) = φ (η) - G (η, u)

. Moreover, as

S_{u} (η, u) = G_{u} (η, u)

the critical set

E

in (2) is the same. □

As a consequence of the above proposition, if

Λ_{G}

is transversal to the fibers of

T^{*} M (h, k)

(no extra parameters u), then its image in

M {(h, k)}^{2}

is transversal to the fibers of

π_{1}

.

Another interesting consequence is that the zero section of the cotangent bundle

T^{*} M (h, k)

, locally represented as

Z = {(η, 0) : η \in E}

, is mapped by

F_{D}^{- 1}

into

Z_{0} = F_{D}^{- 1} (Z) = {(η, \hat{θ} (η)) : η \in E}

which is contained into

D^{- 1} (0)

, the zero-level set of the canonical divergence. Indeed, from (10) and (11) we have that

D (η, \hat{θ} (η)) = φ (η) + ψ (\hat{θ} (η)) - η \cdot \hat{θ} (η) = φ (η) - φ (η) \equiv 0

(15)

thus

Z_{0} \subset D^{- 1} (0)

in the general case and

Z_{0} = D^{- 1} (0)

if

n = 1

. For later use, we compute from (7) the Hamiltonian associated to the canonical divergence

H (η, λ) = D \circ F_{D}^{- 1} (η, λ) = φ (η) + ψ (\hat{θ} (η) - λ) - η \cdot (\hat{θ} (η) - λ) .

We set for the sake of simplicity

\hat{θ} (η) = \hat{θ}

and we compute from (8) the free energy

ψ (\hat{θ} (η) - λ)

\begin{matrix} e^{ψ (\hat{θ} - λ)} & = & \int_{X} e^{(\hat{θ} - λ) \cdot h + k} d x = \int_{X} e^{(\hat{θ} - λ) \cdot h + k + ψ (\hat{θ}) - ψ (\hat{θ})} d x \\ = & e^{ψ (\hat{θ})} \int_{X} e^{- λ \cdot h} e^{\hat{θ} \cdot h + k - ψ (\hat{θ})} d x = e^{ψ (\hat{θ})} E_{p_{\hat{θ}}} [e^{- λ \cdot h}] . \end{matrix}

(16)

Using (15) and (16), the Hamiltonian can be written using relation (10) as

\begin{matrix} H (η, λ) & = & φ (η) + ψ (\hat{θ}) + ln E_{p_{\hat{θ}}} [e^{- λ \cdot h}] - η \cdot \hat{θ} + η \cdot λ \\ = & ln E_{p_{\hat{θ}}} [e^{- λ \cdot h}] + η \cdot λ . \end{matrix}

(17)

It is interesting to investigate more in detail the structure of the Lagrangian submanifold

L_{S} = F_{D}^{- 1} (Λ_{G}) \subset M {(h, k)}^{2}

by studying the form of the two probability distributions

F_{D}^{- 1} (η, λ) = (η, \hat{θ} - λ)

in

L_{S}

associated to the coordinates respectively

η

and

\hat{θ} - λ

. We compute from (9)

p (x; η) = e^{\hat{θ} \cdot h (x) + k (x) - ψ (\hat{θ})}

and using (17)

\begin{matrix} p (x; \hat{θ} - λ) & = & e^{(\hat{θ} - λ) \cdot h + k - ψ (\hat{θ} - λ)} \\ = & e^{\hat{θ} \cdot h - λ \cdot h + k - ψ (\hat{θ}) - ln E_{p_{\hat{θ}}} [e^{- λ \cdot h}]} \\ = & p (x; η) \frac{e^{- λ \cdot h (x)}}{E_{p_{\hat{θ}}} [e^{- λ \cdot h}]} . \end{matrix}

(18)

Note that setting

p (x; λ) = \frac{e^{- λ \cdot h (x)}}{Z (λ)} = \frac{e^{- λ \cdot h (x)}}{\int_{X} e^{λ \cdot h (x) d x}}

relation (18) can be given the form

p (x; \hat{θ} - λ) = \frac{p (x; η) e^{- λ \cdot h (x)}}{\int_{X} p (x; η) e^{- λ \cdot h (x)} d x} = \frac{p (x; η) p (x; λ)}{\int_{X} p (x; η) p (x; λ) d x} .

(19)

We will give an interpretation of this relation in the case of discrete probability distributions in Section 3.2 below.

3. Application to Maximum Entropy Principle with Nonlinear Constraints and Phase Transitions

A relevant application of the above-introduced framework concerns the use of the Maximum Entropy Principle with nonlinear constraints. Let us consider a physical system X whose description is given in terms of a probability distribution

q \in P (X)

. The Maximum Entropy Principle (E.T. Jaynes, see in [11,12]) is a general inference procedure that allows to update an initial probability distribution q on the basis of subsequent information on the system represented by the average values

E_{p} [h]

of some observables h of interest for the system. The sought distribution p is the one that minimizes the relative entropy

D_{K L} (p, q)

on the set of the distributions which satisfy the constraints on

E_{p} [h]

. From a mathematical point of view, we are faced with a constrained extremization problem to be solved below using the Lagrange multipliers method.

We will see that the set of solutions for different values of the constraints defines a Lagrangian submanifold of a cotangent space of a manifold

M (h, k)

. We are interested in describing the corresponding Lagrangian submanifold in

M {(h, k)}^{2}

.

This section has a pedagogical character, so for the sake of simplicity we will avoid technicalities and assume that

X = {1, \dots, n}

is a discrete space and that there is only one observable of interest defined by assigning

h = (h_{1}, \dots, h_{n})

. The case of k observables can be dealt with along the same lines with no extra effort. The case of a continuous space

X \subset R^{n}

presents more technical difficulties and it is considered in [20].

Let

q_{i} = e^{k_{i}} \in P (X)

be the a priori distribution describing X. The Kullback–Leibler divergence is called relative entropy in this setting and has the form

D (p, q) = \sum_{i} p_{i} ln \frac{p_{i}}{q_{i}} .

Let

f : R \to R

be a smooth globally non-invertible function (think for example of a cubic

f (x) = x (x^{2} - a^{2})

for

a \in R

, see Figure 1 below). We look for the minima of D on the set of

p \in P (X)

that satisfy the nonlinear constraint on p in the form

g : R_{+}^{n} \to R, g (p) = y

that is

g (p) = f (E_{p} [h]) = f (\sum_{i = 1}^{n} h_{i} p_{i}) = y .

(20)

The choice of this type of constraints is motivated by classical applications in statistical physics. For example in the Ising model in the Curie–Weiss (mean field) approximation the average energy of the spin lattice is a quadratic function of the average magnetization

E_{p} [s]

, see [14,15]. We have that

d g (p) = f^{'} (E_{p} [h]) h = (f^{'} (E_{p} [h]) h_{1}, \dots, f^{'} (E_{p} [h]) h_{n}) .

(21)

Note that we do not take into account at this stage of the procedure the normalization constraint stipulating that we will enforce it by dividing any candidate extremum point

\hat{p}

by

\sum_{i} {\hat{p}}_{i}

. After introducing the Lagrange function where

λ

is the Lagrange multiplier associated to the constraint (20)

G (y, p, λ) = D (p, q) - λ (f (E_{p} [h]) - y)

(22)

we see that the candidate extrema are the solutions

(p, λ)

for given y of (here

i = 1, \dots, n

)

{(G_{p})}_{i} = ln \frac{p_{i}}{q_{i}} + 1 - λ f^{'} (E_{p} [h]) h_{i} = 0, G_{λ} = f (E_{p} [h]) - y = 0

(23)

that is, setting

q_{i} = e^{k_{i}}

, we have to face a trascendental equation for the unnormalized probability

p_{i} = c e^{λ f^{'} (E_{p} [h]) h_{i} + k_{i}}, f (E_{p} [h]) = y .

(24)

After normalization, (24)₁ becomes

p_{i} = e^{λ f^{'} (E_{p} [h]) h_{i} + k_{i} - ψ (λ, p)}, e^{ψ (λ, p)} = \sum e^{λ f^{'} (E_{p} [h]) h_{i} + k_{i}} .

(25)

Let us denote with

f^{\leftarrow} (y) \subset R

the set of pre-images of y along f (see, e.g., Figure 1 below)

f^{\leftarrow} (y) = {η \in R : f (η) = y} = {η_{1}, \dots, η_{α}, \dots, η_{A}}, η_{α} = η_{α} (y)

(26)

where we have supposed that, for every y,

f^{\leftarrow} (y)

is a finite set of cardinality

A (y) < + \infty

. The crux is that we can substitute the constraint

f (E_{p} [h]) = y

in (24)₂ with the following equivalent one

f (E_{p} [h])) = y ⟺ E_{p} [h] \in f^{\leftarrow} (y)

therefore we can describe the—possibly non-unique—solution (25) of the extremum problem (23) as

p_{i}^{α} = e^{λ f^{'} (η_{α}) h_{i} + k_{i} - ψ (λ, α)}, e^{ψ (λ, α)} = \sum e^{λ f^{'} (η_{α}) h_{i} + k_{i}}

(27)

where

α = 1, \dots, A (y)

, showing that the candidate solution belongs to an exponential family

M (h, k)

. Note that in Information Geometry, the critical points of the MEP extremum problem are computed as geodesic projections over a submanifold which is an exponential family and multiplicity of solutions are related to the non-uniqueness of the geodesic projection, see in [1,15].

Figure 1. Plot of

y = f (η) = η (η^{2} - a^{2})

. Points

b, c

correspond to points where

f^{'} (η) = 0

.

Note that where

f^{'} (η_{α}) \neq 0

setting

λ f^{'} (η_{α}) = θ_{α}

the solution (27) can be given the standard form (see in [1,14]) of MEP solution

{\hat{p}}_{i} = e^{\hat{θ} h_{i} + k_{i} - ψ (\hat{θ})}, \hat{θ} (η) = φ_{η} (η)

with linear constraint

E [h] = η_{α}

, hence (25) becomes

p_{i}^{α} = e^{θ_{α} h_{i} + k_{i} - ψ (θ_{α})}, e^{ψ (θ_{α})} = \sum e^{θ_{α} h_{i} + k_{i}} .

(28)

The multipliers

θ_{α} = \hat{θ} (η_{α} (y))

,

α = 1, \dots, A (y)

are uniquely determined (see (10)) by the equation

ψ_{θ} (θ) = η i . e . \hat{θ} (η) = φ_{η} (η)

(29)

for

η = η_{α} (y)

and accordingly we can compute the multipliers

λ

as

λ_{α} (y) = \frac{\hat{θ} (η_{α} (y))}{f^{'} (η_{α} (y))} .

(30)

Note that the solution to our constrained extremization problem (28) has the form of a curved exponential family (see [1]) with respect to the discrete parameter

α

. We will see in the next Section 3.1 that the framework of Lagrangian submanifold is useful to describe the global picture of the solutions in case of multiple solutions.

3.1. The Global Picture via Lagrange Submanifold

If we set in the Lagrange function (22)

(p, λ) = u

, we see that for

G (y, u)

the set of points

(y, u)

satisfying the first order necessary condition for unconstrained extremum (23) is the critical set

E = {(y, u) : G_{u} (y, u) = 0} .

We can check if the Lagrange function

G (y, u)

defines a Morse family using the rank condition (3)

rk (G_{y u} G_{u u}) = n + 1 for all (y, u) \in E

where in this case

(G_{y u} G_{u u}) = (\begin{matrix} 0 & G_{p p} & - d g^{T} \\ 1 & - d g & 0 \end{matrix})

(31)

and

G_{p p}

is the n-dimensional Hessian matrix (here

δ_{i j}

is Kronecker symbol)

{(G_{p p})}_{i j} = {(D_{p p})}_{i j} - λ f^{''} (E_{p} [h]) h_{i} h_{j} = \frac{δ_{i j}}{p_{i}} - λ f^{''} (E_{p} [h]) h_{i} h_{j} .

(32)

If

G (y, u)

is a Morse family, then by Maslov–Hormander theorem

Λ_{G} = {(y, G_{y}) where (y, u) \in E}

(33)

is a Lagrangian submanifold of

T^{*} R

. We claim that (33) provides a global description of the set of solutions (28). We have seen in Section 1 that a sufficient condition for the elimination of all extra parameters u is that

G_{u u}

has maximal rank for all

(y, u) \in E

. A criterion for this is given by the following classical result in constrained optimization theory, here adapted to our notations, which express the second order sufficient condition for maxima or minima (see in [14,21] for the proof).

Proposition 2.

If the symmetric matrix

G_{p p}

in (32) is (positive or negative) definite on

ker d g

for

(y, u) \in E

, then the square matrix

G_{u u}

in (31) has maximal rank.

From (21), we have that for

(y, u) \in E

ker d g (p) = {u \in R^{n} : f^{'} (η_{α}) h \cdot u = 0}

and from (32), that

G_{p p} u \cdot u = (\sum_{i} \frac{u_{i}^{2}}{p_{i}}) - λ f^{″} (η_{α}) {(h \cdot u)}^{2} .

It is straightforward to derive from the above relations that the two cases below hold

\{\begin{matrix} f^{'} (η_{α}) \neq 0 \Rightarrow ker d g (p) = {u : h \cdot u = 0} \Rightarrow G_{p p} u \cdot u > 0 \forall u \neq 0, \\ f^{'} (η_{α}) = 0 \Rightarrow ker d g (p) = R^{n} \Rightarrow G_{p p} u \cdot u \in R . \end{matrix}

Therefore, at points

(y, u) \in E

where

f^{'} (η_{α}) \neq 0

the Lagrangian submanifold

Λ_{G}

in (33) is transversal. At points in

E

where

f^{'} (η_{α}) = 0

, we have

d g = f^{'} (η_{α}) h = 0

, see (21), thus transversality is lost as—see the form of

G_{u u}

in (31)—for these points

det G_{u u} (p, λ) = 0, a n d (y, u) \in E .

We remark that the above introduced framework is able to give the global description of the set of solutions (28), (30) in terms of the Lagrangian submanifold locally described as

Λ_{f}^{(y)} = {(y, G_{y}) = (y, λ (η_{α} (y))) = (y, \frac{\hat{θ} (η_{α} (y))}{f^{'} (η_{α} (y))})} \subset T^{*} R_{y}

(34)

where

λ (η_{α} (y))

is given by (30). If we consider

f : E \subset R_{η} \to R_{y}

,

y = f (η)

as a local change of coordinates on

M (h, k)

(since f is locally invertible where

f^{'} (η) \neq 0

) it is easy to prove that

Proposition 3.

The submanifold

Λ_{f}^{(y)} \subset T^{*} R_{y}

in (34) is the image

Λ_{f}^{(y)} = T^{*} f (Λ_{f})

of

Λ_{f} = {(η, {\hat{θ}}_{α} (η)) : η \in E} \subset T^{*} M (h, k)

(35)

where

{\hat{θ}}_{α} (η)

is the multiplier in (29) associated to the constraint

E_{p} [h] = η_{α}

and

η_{α} \in I (η) = f^{\leftarrow} (f (η))

.

Proof.

If

y = f (η)

is the local change of coordinates in

M (h, k)

, then the tangent map

T f : T R_{η} \to T R_{y}

has the local form

(y, \dot{y}) = T f (η, \dot{η}) = (f (η), f^{'} (η) \dot{η})

and the cotangent map

T^{*} f : T^{*} R_{η} \to T^{*} R_{y}

has the local form

(y, λ) = T^{*} f (η, β) = (f (η), \frac{β}{f^{'} (η)})

if we want that the Liouville one-form (see above (1)) has the same canonical form

θ_{c} = λ d y = β d η

in the two coordinate charts. See, e.g., in [19] for a proof of this last classical result of differential geometry. □

We want to study the Lagrangian submanifold

Λ_{f}

defined in (35) and its image

L_{f} = F_{D}^{- 1} (Λ_{f}) \subset M {(h, k)}^{2}

, where

F_{D}^{- 1}

is defined in (14), whose local expression is

L_{f} = {(η, \hat{θ} (η) - \hat{θ} (η_{α})) : η \in E} .

(36)

First we consider the case that f is a globally invertible function. In this case,

I (η) = f^{\leftarrow} (f (η)) = {η}

and

\hat{θ} (η) = φ_{η} (η)

. The Lagrangian submanifold

Λ_{f}

in (35) is the graph of the differential

φ_{η} (η)

and it is transversal, see Figure 2a. Moreover, see below (9), if

η = η_{0} = E_{q} [h]

then

\hat{θ} (η_{0}) = 0

. As

ψ_{θ} (θ) = η

is invertible with inverse

θ = \hat{θ} (η)

, we have

\frac{d \hat{θ}}{d η} (η) = {(\frac{d^{2} ψ}{d θ^{2}})}^{- 1} = {var}_{\hat{p}} (h) = E_{\hat{p}} [h^{2}] - η^{2} > 0

and

\hat{θ} (η)

is a monotonically increasing function, see Figure 2a. Its image (36) is

L_{f} = M (h, k) \times {0}

, see Figure 2b.

Figure 2. The case of a transversal Lagrangian submanifold.

If we consider a globally non invertible function f as the one depicted in Figure 1, then

I (η)

contains multiple points and

Λ_{f}

is non transversal at points where

f^{'} (η) = 0

, see Figure 3a. The corresponding image

L_{f}

has multiple branches and it is not a manifold at points

(b, c)

where transversality fails, see Figure 3b).

Figure 3. The case of a folded, i.e., non transversal Lagrangian submanifold.

3.2. Probability Distributions in $L_{f}$

In this section, we study the structure of the probability distributions in

L_{f}

. In the local coordinate systems

(η, θ)

of

M {(h, k)}^{2}

,

η

and

\hat{θ} (η)

describe the same probability distribution that we write for brevity as

p_{i} (η) = p_{i} (\hat{θ})

. Therefore, the probability distributions in

L_{f}

in (36) associated to

η

and

\hat{θ} (η) - \hat{θ} (η_{α})

are, respectively,

p_{i} (η) = e^{\hat{θ} h_{i} - k_{i} - ψ (\hat{θ})}

(37)

and, see (18),

p_{i} (\hat{θ} - \hat{θ} (η_{α})) = p_{i} (η) \frac{e^{- \hat{θ} (η_{α}) h_{i}}}{\sum_{i} p_{i} (η) e^{- \hat{θ} (η_{α}) h_{i}}} .

(38)

Setting

{\tilde{p}}_{i} (η_{α}) = \frac{e^{- \hat{θ} (η_{α}) h_{i}}}{Z (λ)}, Z (λ) = \sum_{i} e^{- \hat{θ} (η_{α}) h_{i}},

the above (38) can be rewritten as the discrete version of (19), that is,

p_{i} (\hat{θ} - \hat{θ} (η_{α})) = \frac{p_{i} (η) {\tilde{p}}_{i} (η_{α})}{\sum_{i} p_{i} (η) {\tilde{p}}_{i} (η_{α})} .

(39)

This last formula can be interpreted as follows; let A and B be two independent random variables

A, B

:

Ω \to X

, where

X = {1, \dots, n}

is the discrete state space, described by the probability distributions

p_{i}

and

{\tilde{p}}_{i}

, respectively (for example, A and B describe two dices with n faces). Then,

\sum_{i} p_{i} {\tilde{p}}_{i}

is the probability that A and B are found in the same state and

P r o b (A = i, B = i | A = B) = \frac{p_{i} {\tilde{p}}_{i}}{\sum_{i} p_{i} {\tilde{p}}_{i}}

in (39) is the conditional probability that A and B are found in the state i provided that they are found in the same state. Note that for

p_{i} (η)

in (37) we have

e^{k_{i}} = q_{i}

, thus (37) can be rewritten as

p_{i} (η) = q_{i} \frac{e^{\hat{θ} h_{i}}}{\sum_{i} q_{i} e^{\hat{θ} h_{i}}} = \frac{q_{i} \tilde{p} (\hat{θ})}{\sum_{i} q_{i} {\tilde{p}}_{i} (\hat{θ})}

and (39) above is equal to

p_{i} (\hat{θ} - \hat{θ} (η_{α})) = \frac{q_{i} p_{i} {\tilde{p}}_{i}}{\sum_{i} q_{i} p_{i} {\tilde{p}}_{i}} = P r o b (A = i, B = i, C = i | A = B = C)

where

A, B, C

are described by

q_{i}

,

p_{i} = {\tilde{p}}_{i} (\hat{θ} (η))

,

{\tilde{p}}_{i} = {\tilde{p}}_{i} (\hat{θ} (η_{α}))

.

4. Discussion

Canonical coordinates

η

and

θ

associated to an exponential family

M (h, k)

are dually flat coordinates with respect to the duality defined by the canonical divergence. With respect to these coordinates, a generalization of the Pitagorean theorem is proved in Information Geometry which provides a generalized formulation of the Maximum Entropy Principle with linear constraints as a geodesic projection problem (see [2]). Multiplicity of the solutions

\hat{θ} (η)

of the Maximum Entropy problem are due to the non uniqueness of the projection. In this paper, we have shown that the set of couples

(η, \hat{θ} (η))

defines a transversal Lagrangian submanifold

Λ

of

T^{*} M (h, k)

, and we have seen with an example that if nonlinear constraints are considered the set of possible multiple solutions to the Maximum Entropy problem is globally described by a folded (i.e., a possibly non-trasversal) Lagrangian submanifold

Λ_{f}

. We have computed their pull-back to the square manifold

M {(h, k)}^{2}

via the map

F_{D}^{- 1}

. We think that this framework offers a complementary view to the generalized Pitagorean Theorem. We plan to address in a subsequent paper a generalization of the theory presented here to a more general form of nonlinear constraint.

Funding

This research received no external funding.

Conflicts of Interest

The author declares no conflict of interest.

References

Amari, S. Information Geometry and Its Applications; Springer: Berlin/Heidelberg, Germany, 2016; Volume 194. [Google Scholar]
Amari, S.; Hiroshi, N. Methods of Information Geometry; American Mathematical Soc.: Providence, RI, USA, 2007; Volume 191. [Google Scholar]
Murray, M.K.; Rice, J.W. Differential Geometry and Statistics; CRC Press: Boca Raton, FL, USA, 1993; Volume 48. [Google Scholar]
Amari, S.; Cichocki, A. Information geometry of divergence functions. Bull. Pol. Acad. Sci. Tech. 2010, 58, 183–195. [Google Scholar] [CrossRef]
Eguchi, S. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima Math. J. 1985, 15, 341–391. [Google Scholar] [CrossRef]
Ay, N.; Amari, S. A novel approach to canonical divergences within information geometry. Entropy 2015, 17, 8111–8129. [Google Scholar] [CrossRef]
Barndorff-Nielsen, O.E.; Jupp, P.E. Statistics, yokes and symplectic geometry. Ann. Fac. Sci. Toulouse Math. 1997, 6, 389–427. [Google Scholar] [CrossRef]
Leok, M.; Zhang, J. Connecting information geometry and geometric mechanics. Entropy 2017, 19, 518. [Google Scholar] [CrossRef]
Noda, T. Symplectic structures on statistical manifolds. J. Aust. Math. Soc. 2011, 90, 371–384. [Google Scholar] [CrossRef]
Nakamura, Y. Completely integrable gradient systems on the manifolds of Gaussian and multinomial distributions. Jpn. J. Ind. Appl. Math. 1993, 10, 179. [Google Scholar] [CrossRef]
Jaynes, E.T. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
Jaynes, E.T. Probability Theory: The Logic of Science; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Brot, R. Phase Transitions. In Statistical Physics. Phase Transitions and Superfluidity; Brandeis University Summer Institute in Theoretical Physics, Gordon and Breach Science Publishers: London, UK, 1966; pp. 5–103. [Google Scholar]
Favretti, M. Lagrangian submanifolds generated by the Maximum Entropy principle. Entropy 2005, 7, 1–14. [Google Scholar] [CrossRef]
Fujiwara, A.; Shigeru, S. Hereditary structure in Hamiltonians: Information geometry of Ising spin chains. Phys. Lett. A 2010, 374, 911–916. [Google Scholar] [CrossRef]
Maslov, V.P.; Bouslaev, V.C.; Arnol’d, V.I. Theorie des Perturbations et Methodes Asymptotiques; Dunod: Paris, France, 1972. [Google Scholar]
Hormander, L. Fourier integral operators. I. Acta Math. 1971, 127, 79. [Google Scholar] [CrossRef]
Weinstein, A. Lectures on Symplectic Manifolds; No. 29.; American Mathematical Soc.: Providence, RI, USA, 1977. [Google Scholar]
Cardin, F. Elementary Symplectic Topology and Mechanics; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
Favretti, M. Isotropic submanifolds generated by the Maximum Entropy Principle and Onsager reciprocity relations. J. Funct. Anal. 2005, 227, 227–243. [Google Scholar] [CrossRef]
Bertsekas, D.P. Constrained Optimization and Lagrange Multiplier Methods; Academic Press: Cambridge, MA, USA, 2014. [Google Scholar]

Figure 1. Plot of

y = f (η) = η (η^{2} - a^{2})

. Points

b, c

correspond to points where

f^{'} (η) = 0

.

Figure 2. The case of a transversal Lagrangian submanifold.

Figure 3. The case of a folded, i.e., non transversal Lagrangian submanifold.

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Lagrangian Submanifolds of Symplectic Structures Induced by Divergence Functions

Abstract

1. Introduction

2. Synopsis of Symplectic Geometry

2.1. Symplectic Structures Defined by Divergence Functions

2.2. Canonical Divergence and Exponential Families

3. Application to Maximum Entropy Principle with Nonlinear Constraints and Phase Transitions

3.1. The Global Picture via Lagrange Submanifold

3.2. Probability Distributions in $L_{f}$

4. Discussion

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

Lagrangian Submanifolds of Symplectic Structures Induced by Divergence Functions

Abstract

1. Introduction

2. Synopsis of Symplectic Geometry

2.1. Symplectic Structures Defined by Divergence Functions

2.2. Canonical Divergence and Exponential Families

3. Application to Maximum Entropy Principle with Nonlinear Constraints and Phase Transitions

3.1. The Global Picture via Lagrange Submanifold

3.2. Probability Distributions in L f

4. Discussion

Funding

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics

3.2. Probability Distributions in $L_{f}$