Lagrangian Submanifolds of Symplectic Structures Induced by Divergence Functions

Divergence functions play a relevant role in Information Geometry as they allow for the introduction of a Riemannian metric and a dual connection structure on a finite dimensional manifold of probability distributions. They also allow to define, in a canonical way, a symplectic structure on the square of the above manifold of probability distributions, a property that has received less attention in the literature until recent contributions. In this paper, we hint at a possible application: we study Lagrangian submanifolds of this symplectic structure and show that they are useful for describing the manifold of solutions of the Maximum Entropy principle.


Introduction
Information Geometry [1,2] provides a sound and fruitful framework for interpreting statistics using classical differential geometry notions [3]. A principal object in Information Geometry is the notion of contrast or divergence function, which (informally speaking) measures the degree of separation between two probability distributions [4][5][6]. The main thrust of divergence functions is that they allow to define a Riemannian structure on a finite dimensional submanifold M of probability distributions endowed with a dual coordinate system, with far reaching implications. A less-studied spin off of contrast function is the possibility of introducing a symplectic structure on the square of M by the pull-back of the canonical symplectic structure defined on the cotangent bundle T * M. This procedure was introduced in 1995 in the pioneering paper [7], suggesting that symplectic geometry may have a natural role to play in statistics. In recent times there has been a renewed interest in possible applications of the symplectic structures introduced, as in [7] for example, to studying the analogies with the discrete Lagrangian mechanics (see in [8]) or the relations with completely integrable systems of Hamiltonian mechanics (see in [9,10]).
In this paper, we try to look at a possible role for Lagrangian submanifolds of the above-discussed symplectic structure on M 2 in the case that M is an exponential family M(h, k). Exponential families are prototypical examples of finite dimensional manifolds admitting a dually flat canonical structure defined by the canonical divergence, and they play a relevant role in information geometry and statistics [1,2]. For our argument, their importance is due to the fact that they represent the manifold of solutions of the variational problem associated to the Maximum Entropy Principle (MEP) with linear constraints ( [11,12]). In some applications to statistical mechanics, e.g., in the descriptions of phase transitions in Ising spin systems, MEP with nonlinear constraints is considered, see, e.g., in [13][14][15]. In this case, the set of possible solutions has a richer structure, which is well captured by a Lagrangian submanifold of T * M(h, k). In this work, we are concerned with the Lagrangian submanifolds defined in the square of M(h, k) via the canonical pull-back hinted at above. The structure of the paper is as follows. In Section 2, we recall the needed tools of Symplectic Geometry, and in Section 2.1 we review the canonical pull-back construction via divergence function construction exposed in [7]. In Section 3, we consider the special case of exponential families associated with MEP with nonlinear constraints.

Synopsis of Symplectic Geometry
We briefly recall the basic facts of symplectic geometry that are necessary for introducing our argument referring to classical textbooks for the proof of the results. A symplectic manifold (M, ω) is a smooth even-dimensional manifold M equipped with a non-degenerate, closed two-form ω (dω = 0, where d is the external derivation operator). A submanifold L of M is a Lagrangian submanifold if 2 dim L = dim M and the two-form restricted to L is vanishing, ω |L = 0. A prototypical example of symplectic manifold is the cotangent bundle T * S of a manifold S. If x = (x 1 , . . . , x n ) are local coordinates on S, and (x, λ) are local coordinates on T * S, then the Liouville one-form θ c on T * S has the local expression θ c = λ i dx i (summation over repeated indices is understood) and the symplectic two form is A classical theorem of Darboux says that every symplectic manifold (M, ω) admits an atlas of local coordinates (x, λ) such that locally ω has the representation (1). A relevant example of Lagrangian submanifold of T * S is the graph of the differential of a function g : S → R, that is, Note that L g is a n-dimensional submanifold which is transversal to the fibers of the fibration π : T * S → S, that is, its tangent bundle TL g is transversal to the vertical bundle ker Tπ.
According to a theorem of 17]), a general (i.e. not necessarily trasversal) Lagrangian submanifold of T * S can be locally described as the graph of a smooth function G depending on extra parameters. Let us sketch briefly this construction along the lines of the works in [18,19].
Let U be a k-dimensional manifold called supplementary manifold, and let G : S × U → R be a smooth function whose representation in a local chart is G(x, u). We define the critical set of G as (we use the notation (G x ) i = ∂G/∂x i and (G xy ) ij = ∂ 2 G/∂x i ∂y j ) for partial derivatives) If dG u has maximal rank over E , that is, then G is called Morse family and the following Λ G is a Lagrangian submanifold of T * S, If there are no extra parameters k = 0, then Λ G is the graph of a differential and thus Λ G is a transversal submanifold. Note that the above rank condition (3) can be satisfied if the square submatrix G uu has maximal rank, i.e., det G uu = 0 on E . In this case, by the implicit function theorem there exist a locally defined function u = u(x) such that E is the graph of u and settingĜ(x) = G(x, u(x)) we have that Therefore, where det G uu = 0 on E , all the parameters u can be eliminated and ΛĜ is locally transversal to the fibers. The set of points of S where det G uu (x, u) = 0 for (x, u) ∈ E is called the caustic of Λ G .
These are the points where the Lagrangian submanifold is tangent to the fibers of π : T * S → S and trasversality is lost.

Symplectic Structures Defined by Divergence Functions
Given a smooth n-dimensional manifold, M, let us denote with M 2 = M × M the square of M and with ∆ M ⊂ M 2 the diagonal of M 2 . We will use local coordinates x = (x 1 , . . . , x n ) on M and (x, y) = (x 1 , . . . , x n , y 1 , . . . , y n ) on M 2 .
Let D : M 2 → [0, +∞) be a smooth non-negative function whose representation in a local chart is D(x, y) ≥ 0. We use the notations for first and second order derivatives of D. The function D is a yoke (see [7] ) if the following conditions hold and D is a divergence (see [8]) if (iii) below holds on the whole M 2 .
thus points of ∆ M are minima of D. A divergence function act as a pseudo-distance but it does not satisfy the symmetry nor the triangle inequality conditions. In [7], the following fibered map F D : M 2 → T * M over M is considered, whose representation in a local chart is By condition (iii) above there exist a neighborhood W of ∆ M , where F D has a smooth inverse Using the local diffeomorphism F D a symplectic structure (W, ω D ) is defined in [7] via the pull-back ω D = F * D ω of the canonical two form (1) on T * M. The local form of ω D can be computed as follows, thus (see Section 3.2 in [7]) For the applications that we have in mind of the above theory, we will assume in (iii) above that −D yx is positive definite on the whole M 2 so that F D is a global diffeomorphism. Simple examples of Lagrangian submanifolds of M 2 with respect to ω D are (with a little abuse of notation) the n-dimensional submanifolds M x = M × {y} ≈ M, which are also transversal to the fibers of π 1 : Note also that (6) implies that F D is a symplectomorphism, thus L = F −1 D (Λ) is a Lagrangian submanifold of M 2 whenever Λ ⊂ T * M is a Lagrangian submanifold. In this paper, we will be mainly concerned with the study of Lagrangian submanifolds of M 2 defined in this way.
In the following Section 2.2, we will compute the above introduced objects for the relevant case of exponential families of probability distributions and canonical divergence.

Canonical Divergence and Exponential Families
In this section, we recall the basic definitions of exponential family and canonical divergence, as described, e.g., in [1,2]. Let (X, B, dx) be a probability space, where X may be a discrete set or X = R k . We stipulate that in case of a discrete set the integrals over X with respect to the measure dx are substituted by summations. Let The n real numbers θ i are called canonical parameters. They define uniquely a probability distribution p(·; θ) which belongs to the exponential family defined by h, k, The relevant fact is that M(h, k) is a n-dimensional submanifold of the infinite dimensional set P (X) and that the canonical parameters θ are local coordinates. Note that q ∈ M(h, k) as ψ(0) = 0 and q(x) = p(x; 0). Another system of local coordinates is provided by the so-called expectation parameters defined by As ψ is a convex function, the gradient map ψ θ (θ) = η is globally invertible with inverse θ =θ(η), which is also a gradient mapθ(η) = ϕ η (η), where is the Legendre transform of ψ (see, e.g., in [1]). We will denote with p(x; η) the point in M(h, k) associated to η. The Kullback-Leibler divergence is defined for general (p,p) in P (X) 2 as It can be shown (see in [1]) that when M(h, k) is referred to the coordinates (η, θ), D KL has the local representation Note that as p(·; θ) = q for θ = 0 Entropy 2020, 22, 983

of 13
A key object is the map F D introduced in (5) associated to M(h, k) and the canonical divergence (11). It has the local form in coordinates (η, θ), see (5) and (11), with the explicit inverse, using local coordinates (η, λ) in T * M(h, k), A simple but elegant result of the above-introduced framework is the following.

Proposition 1.
Let Λ G be a Lagrangian submanifold of T * M(h, k) described by the Morse family G(η, u) as in (4). Then, As a consequence of the above proposition, if Λ G is transversal to the fibers of T * M(h, k) (no extra parameters u), then its image in M(h, k) 2 is transversal to the fibers of π 1 .
Another interesting consequence is that the zero section of the cotangent bundle T * M(h, k), which is contained into D −1 (0), the zero-level set of the canonical divergence. Indeed, from (10) and (11) we have that For later use, we compute from (7) the Hamiltonian associated to the canonical divergence We set for the sake of simplicityθ(η) =θ and we compute from (8) the free energy ψ(θ(η) − λ) Using (15) and (16), the Hamiltonian can be written using relation (10) as It is interesting to investigate more in detail the structure of the Lagrangian submanifold L S = F −1 D (Λ G ) ⊂ M(h, k) 2 by studying the form of the two probability distributions in L S associated to the coordinates respectively η andθ − λ. We compute from (9) p(x; η) = eθ ·h(x)+k(x)−ψ(θ) and using (17) p Note that setting X e λ·h(x)dx relation (18) can be given the form We will give an interpretation of this relation in the case of discrete probability distributions in Section 3.2 below.

Application to Maximum Entropy Principle with Nonlinear Constraints and Phase Transitions
A relevant application of the above-introduced framework concerns the use of the Maximum Entropy Principle with nonlinear constraints. Let us consider a physical system X whose description is given in terms of a probability distribution q ∈ P (X). The Maximum Entropy Principle (E.T. Jaynes, see in [11,12]) is a general inference procedure that allows to update an initial probability distribution q on the basis of subsequent information on the system represented by the average values E p [h] of some observables h of interest for the system. The sought distribution p is the one that minimizes the relative entropy D KL (p, q) on the set of the distributions which satisfy the constraints on E p [h]. From a mathematical point of view, we are faced with a constrained extremization problem to be solved below using the Lagrange multipliers method.
We will see that the set of solutions for different values of the constraints defines a Lagrangian submanifold of a cotangent space of a manifold M(h, k). We are interested in describing the corresponding Lagrangian submanifold in M(h, k) 2 .
This section has a pedagogical character, so for the sake of simplicity we will avoid technicalities and assume that X = {1, . . . , n} is a discrete space and that there is only one observable of interest defined by assigning h = (h 1 , . . . , h n ). The case of k observables can be dealt with along the same lines with no extra effort. The case of a continuous space X ⊂ R n presents more technical difficulties and it is considered in [20].
Let q i = e k i ∈ P (X) be the a priori distribution describing X. The Kullback-Leibler divergence is called relative entropy in this setting and has the form Let f : R → R be a smooth globally non-invertible function (think for example of a cubic f (x) = x(x 2 − a 2 ) for a ∈ R, see Figure 1 below). We look for the minima of D on the set of p ∈ P (X) that satisfy the nonlinear constraint on p in the form g : R n + → R, g(p) = y that is The choice of this type of constraints is motivated by classical applications in statistical physics. For example in the Ising model in the Curie-Weiss (mean field) approximation the average energy of the spin lattice is a quadratic function of the average magnetization E p [s], see [14,15]. We have that Note that we do not take into account at this stage of the procedure the normalization constraint stipulating that we will enforce it by dividing any candidate extremum pointp by ∑ ipi . After introducing the Lagrange function where λ is the Lagrange multiplier associated to the constraint (20) we see that the candidate extrema are the solutions (p, λ) for given y of (here i = 1, . . . , n) that is, setting q i = e k i , we have to face a trascendental equation for the unnormalized probability After normalization, (24) 1 becomes Let us denote with f ← (y) ⊂ R the set of pre-images of y along f (see, e.g., Figure 1 below) where we have supposed that, for every y, f ← (y) is a finite set of cardinality A(y) < +∞. The crux is that we can substitute the constraint f (E p [h]) = y in (24) 2 with the following equivalent one therefore we can describe the-possibly non-unique-solution (25) of the extremum problem (23) as where α = 1, . . . , A(y), showing that the candidate solution belongs to an exponential family M(h, k). Note that in Information Geometry, the critical points of the MEP extremum problem are computed as geodesic projections over a submanifold which is an exponential family and multiplicity of solutions are related to the non-uniqueness of the geodesic projection, see in [1,15].
Note that where f (η α ) = 0 setting λ f (η α ) = θ α the solution (27) can be given the standard form (see in [1,14]) of MEP solutionp with linear constraint E[h] = η α , hence (25) becomes The multipliers θ α =θ(η α (y)), α = 1, . . . , A(y) are uniquely determined (see (10)) by the equation for η = η α (y) and accordingly we can compute the multipliers λ as Note that the solution to our constrained extremization problem (28) has the form of a curved exponential family (see [1]) with respect to the discrete parameter α. We will see in the next Section 3.1 that the framework of Lagrangian submanifold is useful to describe the global picture of the solutions in case of multiple solutions.

The Global Picture via Lagrange Submanifold
If we set in the Lagrange function (22) (p, λ) = u, we see that for G(y, u) the set of points (y, u) satisfying the first order necessary condition for unconstrained extremum (23) is the critical set We can check if the Lagrange function G(y, u) defines a Morse family using the rank condition (3) rk G yu G uu = n + 1 for all (y, u) ∈ E where in this case and G pp is the n-dimensional Hessian matrix (here δ ij is Kronecker symbol) If G(y, u) is a Morse family, then by Maslov-Hormander theorem is a Lagrangian submanifold of T * R. We claim that (33) provides a global description of the set of solutions (28). We have seen in Section 1 that a sufficient condition for the elimination of all extra parameters u is that G uu has maximal rank for all (y, u) ∈ E . A criterion for this is given by the following classical result in constrained optimization theory, here adapted to our notations, which express the second order sufficient condition for maxima or minima (see in [14,21] for the proof).

Proposition 2.
If the symmetric matrix G pp in (32) is (positive or negative) definite on ker dg for (y, u) ∈ E , then the square matrix G uu in (31) has maximal rank.
From (21), we have that for (y, u) ∈ E ker dg(p) = {u ∈ R n : f (η α )h · u = 0} and from (32), that It is straightforward to derive from the above relations that the two cases below hold Therefore, at points (y, u) ∈ E where f (η α ) = 0 the Lagrangian submanifold Λ G in (33) is transversal. At points in E where f (η α ) = 0, we have dg = f (η α )h = 0, see (21), thus transversality is lost as-see the form of G uu in (31)-for these points det G uu (p, λ) = 0, and (y, u) ∈ E .
We remark that the above introduced framework is able to give the global description of the set of solutions (28), (30) in terms of the Lagrangian submanifold locally described as where λ(η α (y)) is given by (30). If we consider f : E ⊂ R η → R y , y = f (η) as a local change of coordinates on M(h, k) (since f is locally invertible where f (η) = 0) it is easy to prove that whereθ α (η) is the multiplier in (29) associated to the constraint E p [h] = η α and η α ∈ I(η) = f ← ( f (η)).

Proof.
If y = f (η) is the local change of coordinates in M(h, k), then the tangent map T f : TR η → TR y has the local form (y,ẏ) = T f (η,η) = ( f (η), f (η)η) and the cotangent map T * f : T * R η → T * R y has the local form if we want that the Liouville one-form (see above (1)) has the same canonical form θ c = λdy = βdη in the two coordinate charts. See, e.g., in [19] for a proof of this last classical result of differential geometry.
We want to study the Lagrangian submanifold Λ f defined in (35) and its image (14), whose local expression is First we consider the case that f is a globally invertible function. In this case, The Lagrangian submanifold Λ f in (35) is the graph of the differential ϕ η (η) and it is transversal, see Figure 2a. Moreover, see below (9) If we consider a globally non invertible function f as the one depicted in Figure 1, then I(η) contains multiple points and Λ f is non transversal at points where f (η) = 0, see Figure 3a.
The corresponding image L f has multiple branches and it is not a manifold at points (b, c) where transversality fails, see Figure 3b).
transversal Lagrangian submanifold Λ of (h, k) generated by the constraint E p (h) = η a monotonically increasing function, see Figure 1(a). Its image (36) is L f = M 1(b). ider a globally non invertible function f as the one depicted in Figure 2, then oints and Λ f is non transversal at points where f (η) = 0, see Figure 3(a). The co as multiple branches and it is not a manifold at points (b, c) where transversa ).  and, see (18), the above (38) can be rewritten as the discrete version of (19), that is This last formula can be interpreted as follows: let A and B be two independent random variables A, B : Ω → X where X = {1, . . . , n} is the discrete state space, described by the probability distributions p i andp i respectively (for exampleA, B describe two dices with n faces). Then ∑ i p ipi is the probability that A and B are found in the same state and (39) is the conditional probability that A and B are found in the state i provided that they are found in the same state. Note that for p i (η) in (37) we have e k i = q i hence (37) can be rewritten as and (39) above is equal to where A, B, C are described by q i , p i =p i (θ(η)) ,p i =p i (θ(η α )).

Discussion
Canonical coordinates η and θ associated to an exponential family M(h, k) are dually flat

Probability Distributions in L f
In this section, we study the structure of the probability distributions in L f . In the local coordinate systems (η, θ) of M(h, k) 2 , η andθ(η) describe the same probability distribution that we write for brevity as p i (η) = p i (θ). Therefore, the probability distributions in L f in (36) associated to η and and, see (18), Settingp i (η α ) = e −θ(η α )h i Z(λ) , Z(λ) = ∑ i e −θ(η α )h i , the above (38) can be rewritten as the discrete version of (19), that is, This last formula can be interpreted as follows; let A and B be two independent random variables A, B: Ω → X, where X = {1, . . . , n} is the discrete state space, described by the probability distributions p i andp i , respectively (for example, A and B describe two dices with n faces). Then, ∑ i p ipi is the probability that A and B are found in the same state and Prob(A = i, B = i|A = B) = p ipi ∑ i p ipi in (39) is the conditional probability that A and B are found in the state i provided that they are found in the same state. Note that for p i (η) in (37) we have e k i = q i , thus (37) can be rewritten as where A, B, C are described by q i , p i =p i (θ(η)) ,p i =p i (θ(η α )).

Discussion
Canonical coordinates η and θ associated to an exponential family M(h, k) are dually flat coordinates with respect to the duality defined by the canonical divergence. With respect to these coordinates, a generalization of the Pitagorean theorem is proved in Information Geometry which provides a generalized formulation of the Maximum Entropy Principle with linear constraints as a geodesic projection problem (see [2]). Multiplicity of the solutionsθ(η) of the Maximum Entropy problem are due to the non uniqueness of the projection. In this paper, we have shown that the set of couples (η,θ(η)) defines a transversal Lagrangian submanifold Λ of T * M(h, k), and we have seen with an example that if nonlinear constraints are considered the set of possible multiple solutions to the Maximum Entropy problem is globally described by a folded (i.e., a possibly non-trasversal) Lagrangian submanifold Λ f . We have computed their pull-back to the square manifold M(h, k) 2 via the map F −1 D . We think that this framework offers a complementary view to the generalized Pitagorean Theorem. We plan to address in a subsequent paper a generalization of the theory presented here to a more general form of nonlinear constraint.