Abstract
Divergence functions play a relevant role in Information Geometry as they allow for the introduction of a Riemannian metric and a dual connection structure on a finite dimensional manifold of probability distributions. They also allow to define, in a canonical way, a symplectic structure on the square of the above manifold of probability distributions, a property that has received less attention in the literature until recent contributions. In this paper, we hint at a possible application: we study Lagrangian submanifolds of this symplectic structure and show that they are useful for describing the manifold of solutions of the Maximum Entropy principle.
1. Introduction
Information Geometry [1,2] provides a sound and fruitful framework for interpreting statistics using classical differential geometry notions [3]. A principal object in Information Geometry is the notion of contrast or divergence function, which (informally speaking) measures the degree of separation between two probability distributions [4,5,6]. The main thrust of divergence functions is that they allow to define a Riemannian structure on a finite dimensional submanifold M of probability distributions endowed with a dual coordinate system, with far reaching implications. A less-studied spin off of contrast function is the possibility of introducing a symplectic structure on the square of M by the pull-back of the canonical symplectic structure defined on the cotangent bundle . This procedure was introduced in 1995 in the pioneering paper [7], suggesting that symplectic geometry may have a natural role to play in statistics. In recent times there has been a renewed interest in possible applications of the symplectic structures introduced, as in [7] for example, to studying the analogies with the discrete Lagrangian mechanics (see in [8]) or the relations with completely integrable systems of Hamiltonian mechanics (see in [9,10]).
In this paper, we try to look at a possible role for Lagrangian submanifolds of the above-discussed symplectic structure on in the case that M is an exponential family. Exponential families are prototypical examples of finite dimensional manifolds admitting a dually flat canonical structure defined by the canonical divergence, and they play a relevant role in information geometry and statistics [1,2]. For our argument, their importance is due to the fact that they represent the manifold of solutions of the variational problem associated to the Maximum Entropy Principle (MEP) with linear constraints ([11,12]). In some applications to statistical mechanics, e.g., in the descriptions of phase transitions in Ising spin systems, MEP with nonlinear constraints is considered, see, e.g., in [13,14,15]. In this case, the set of possible solutions has a richer structure, which is well captured by a Lagrangian submanifold of . In this work, we are concerned with the Lagrangian submanifolds defined in the square of via the canonical pull-back hinted at above.
The structure of the paper is as follows. In Section 2, we recall the needed tools of Symplectic Geometry, and in Section 2.1 we review the canonical pull-back construction via divergence function construction exposed in [7]. In Section 3, we consider the special case of exponential families associated with MEP with nonlinear constraints.
2. Synopsis of Symplectic Geometry
We briefly recall the basic facts of symplectic geometry that are necessary for introducing our argument referring to classical textbooks for the proof of the results. A symplectic manifold is a smooth even-dimensional manifold M equipped with a non-degenerate, closed two-form (, where d is the external derivation operator). A submanifold L of M is a Lagrangian submanifold if and the two-form restricted to L is vanishing, . A prototypical example of symplectic manifold is the cotangent bundle of a manifold S. If are local coordinates on S, and are local coordinates on , then the Liouville one-form on has the local expression (summation over repeated indices is understood) and the symplectic two form is
A classical theorem of Darboux says that every symplectic manifold admits an atlas of local coordinates such that locally has the representation (1). A relevant example of Lagrangian submanifold of is the graph of the differential of a function , that is,
Note that is a n-dimensional submanifold which is transversal to the fibers of the fibration , that is, its tangent bundle is transversal to the vertical bundle .
According to a theorem of Maslov–Hormander ([16,17]), a general (i.e. not necessarily trasversal) Lagrangian submanifold of can be locally described as the graph of a smooth function G depending on extra parameters. Let us sketch briefly this construction along the lines of the works in [18,19].
Let U be a k-dimensional manifold called supplementary manifold, and let be a smooth function whose representation in a local chart is . We define the critical set of G as (we use the notation and ) for partial derivatives)
If has maximal rank over , that is,
then G is called Morse family and the following is a Lagrangian submanifold of ,
If there are no extra parameters , then is the graph of a differential and thus is a transversal submanifold. Note that the above rank condition (3) can be satisfied if the square submatrix has maximal rank, i.e., on . In this case, by the implicit function theorem there exist a locally defined function such that is the graph of u and setting we have that
Therefore, where on , all the parameters u can be eliminated and is locally transversal to the fibers. The set of points of S where for is called the caustic of . These are the points where the Lagrangian submanifold is tangent to the fibers of and trasversality is lost.
2.1. Symplectic Structures Defined by Divergence Functions
Given a smooth n-dimensional manifold, M, let us denote with the square of M and with the diagonal of . We will use local coordinates on M and on .
Let be a smooth non-negative function whose representation in a local chart is . We use the notations
for first and second order derivatives of D. The function D is a yoke (see [7] ) if the following conditions hold and D is a divergence (see [8]) if below holds on the whole .
- (i)
- only on
- (ii)
- and on
- (iii)
- is positive definite on
thus points of are minima of D. A divergence function act as a pseudo-distance but it does not satisfy the symmetry nor the triangle inequality conditions. In [7], the following fibered map over M is considered, whose representation in a local chart is
By condition (iii) above there exist a neighborhood W of , where has a smooth inverse
Using the local diffeomorphism a symplectic structure is defined in [7] via the pull-back of the canonical two form (1) on . The local form of can be computed as follows,
thus (see Section 3.2 in [7])
because the first term is symmetric in the indices. For the applications that we have in mind of the above theory, we will assume in (iii) above that is positive definite on the whole so that is a global diffeomorphism.
Simple examples of Lagrangian submanifolds of with respect to are (with a little abuse of notation) the n-dimensional submanifolds , which are also transversal to the fibers of , . Moreover, as , is also a Lagrangian submanifold.
Note also that (6) implies that is a symplectomorphism, thus is a Lagrangian submanifold of whenever is a Lagrangian submanifold. In this paper, we will be mainly concerned with the study of Lagrangian submanifolds of defined in this way.
In the following Section 2.2, we will compute the above introduced objects for the relevant case of exponential families of probability distributions and canonical divergence.
In [7], the Hamiltonian associated to a divergence function is defined as and locally it has the form
2.2. Canonical Divergence and Exponential Families
In this section, we recall the basic definitions of exponential family and canonical divergence, as described, e.g., in [1,2]. Let be a probability space, where X may be a discrete set or . We stipulate that in case of a discrete set the integrals over X with respect to the measure are substituted by summations. Let
and suppose that for suitable k, where . Consider n independent observables
and define the related free energy as (here )
The n real numbers are called canonical parameters. They define uniquely a probability distribution which belongs to the exponential family defined by ,
The relevant fact is that is a n-dimensional submanifold of the infinite dimensional set and that the canonical parameters are local coordinates. Note that as and . Another system of local coordinates is provided by the so-called expectation parameters defined by
As is a convex function, the gradient map is globally invertible with inverse , which is also a gradient map , where
is the Legendre transform of (see, e.g., in [1]). We will denote with the point in associated to . The Kullback–Leibler divergence is defined for general in as
The restriction of to , the square of , is called canonical divergence. It can be shown (see in [1]) that when is referred to the coordinates , has the local representation
Note that as for
A key object is the map introduced in (5) associated to and the canonical divergence (11). It has the local form in coordinates , see (5) and (11),
with the explicit inverse, using local coordinates in ,
A simple but elegant result of the above-introduced framework is the following.
Proposition 1.
Let be a Lagrangian submanifold of described by the Morse family as in (4). Then, is a Lagrangian submanifold of described by the Morse family .
Proof.
As a consequence of the above proposition, if is transversal to the fibers of (no extra parameters u), then its image in is transversal to the fibers of .
Another interesting consequence is that the zero section of the cotangent bundle , locally represented as , is mapped by into
which is contained into , the zero-level set of the canonical divergence. Indeed, from (10) and (11) we have that
thus in the general case and if . For later use, we compute from (7) the Hamiltonian associated to the canonical divergence
We set for the sake of simplicity and we compute from (8) the free energy
Using (15) and (16), the Hamiltonian can be written using relation (10) as
It is interesting to investigate more in detail the structure of the Lagrangian submanifold by studying the form of the two probability distributions in associated to the coordinates respectively and . We compute from (9)
and using (17)
Note that setting
relation (18) can be given the form
We will give an interpretation of this relation in the case of discrete probability distributions in Section 3.2 below.
3. Application to Maximum Entropy Principle with Nonlinear Constraints and Phase Transitions
A relevant application of the above-introduced framework concerns the use of the Maximum Entropy Principle with nonlinear constraints. Let us consider a physical system X whose description is given in terms of a probability distribution . The Maximum Entropy Principle (E.T. Jaynes, see in [11,12]) is a general inference procedure that allows to update an initial probability distribution q on the basis of subsequent information on the system represented by the average values of some observables h of interest for the system. The sought distribution p is the one that minimizes the relative entropy on the set of the distributions which satisfy the constraints on . From a mathematical point of view, we are faced with a constrained extremization problem to be solved below using the Lagrange multipliers method.
We will see that the set of solutions for different values of the constraints defines a Lagrangian submanifold of a cotangent space of a manifold . We are interested in describing the corresponding Lagrangian submanifold in .
This section has a pedagogical character, so for the sake of simplicity we will avoid technicalities and assume that is a discrete space and that there is only one observable of interest defined by assigning . The case of k observables can be dealt with along the same lines with no extra effort. The case of a continuous space presents more technical difficulties and it is considered in [20].
Let be the a priori distribution describing X. The Kullback–Leibler divergence is called relative entropy in this setting and has the form
Let be a smooth globally non-invertible function (think for example of a cubic for , see Figure 1 below). We look for the minima of D on the set of that satisfy the nonlinear constraint on p in the form that is
The choice of this type of constraints is motivated by classical applications in statistical physics. For example in the Ising model in the Curie–Weiss (mean field) approximation the average energy of the spin lattice is a quadratic function of the average magnetization , see [14,15]. We have that
Note that we do not take into account at this stage of the procedure the normalization constraint stipulating that we will enforce it by dividing any candidate extremum point by . After introducing the Lagrange function where is the Lagrange multiplier associated to the constraint (20)
we see that the candidate extrema are the solutions for given y of (here )
that is, setting , we have to face a trascendental equation for the unnormalized probability
After normalization, (24)1 becomes
Let us denote with the set of pre-images of y along f (see, e.g., Figure 1 below)
where we have supposed that, for every y, is a finite set of cardinality . The crux is that we can substitute the constraint in (24)2 with the following equivalent one
therefore we can describe the—possibly non-unique—solution (25) of the extremum problem (23) as
where , showing that the candidate solution belongs to an exponential family . Note that in Information Geometry, the critical points of the MEP extremum problem are computed as geodesic projections over a submanifold which is an exponential family and multiplicity of solutions are related to the non-uniqueness of the geodesic projection, see in [1,15].
Figure 1.
Plot of . Points correspond to points where .
Note that where setting the solution (27) can be given the standard form (see in [1,14]) of MEP solution
with linear constraint , hence (25) becomes
The multipliers , are uniquely determined (see (10)) by the equation
for and accordingly we can compute the multipliers as
Note that the solution to our constrained extremization problem (28) has the form of a curved exponential family (see [1]) with respect to the discrete parameter . We will see in the next Section 3.1 that the framework of Lagrangian submanifold is useful to describe the global picture of the solutions in case of multiple solutions.
3.1. The Global Picture via Lagrange Submanifold
If we set in the Lagrange function (22) , we see that for the set of points satisfying the first order necessary condition for unconstrained extremum (23) is the critical set
We can check if the Lagrange function defines a Morse family using the rank condition (3)
where in this case
and is the n-dimensional Hessian matrix (here is Kronecker symbol)
If is a Morse family, then by Maslov–Hormander theorem
is a Lagrangian submanifold of . We claim that (33) provides a global description of the set of solutions (28). We have seen in Section 1 that a sufficient condition for the elimination of all extra parameters u is that has maximal rank for all . A criterion for this is given by the following classical result in constrained optimization theory, here adapted to our notations, which express the second order sufficient condition for maxima or minima (see in [14,21] for the proof).
Proposition 2.
From (21), we have that for
and from (32), that
It is straightforward to derive from the above relations that the two cases below hold
Therefore, at points where the Lagrangian submanifold in (33) is transversal. At points in where , we have , see (21), thus transversality is lost as—see the form of in (31)—for these points
We remark that the above introduced framework is able to give the global description of the set of solutions (28), (30) in terms of the Lagrangian submanifold locally described as
where is given by (30). If we consider , as a local change of coordinates on (since f is locally invertible where ) it is easy to prove that
Proposition 3.
Proof.
If is the local change of coordinates in , then the tangent map has the local form and the cotangent map has the local form
if we want that the Liouville one-form (see above (1)) has the same canonical form in the two coordinate charts. See, e.g., in [19] for a proof of this last classical result of differential geometry. □
We want to study the Lagrangian submanifold defined in (35) and its image , where is defined in (14), whose local expression is
First we consider the case that f is a globally invertible function. In this case, and . The Lagrangian submanifold in (35) is the graph of the differential and it is transversal, see Figure 2a. Moreover, see below (9), if then . As is invertible with inverse , we have
and is a monotonically increasing function, see Figure 2a. Its image (36) is , see Figure 2b.
Figure 2.
The case of a transversal Lagrangian submanifold.
If we consider a globally non invertible function f as the one depicted in Figure 1, then contains multiple points and is non transversal at points where , see Figure 3a. The corresponding image has multiple branches and it is not a manifold at points where transversality fails, see Figure 3b).
Figure 3.
The case of a folded, i.e., non transversal Lagrangian submanifold.
3.2. Probability Distributions in
In this section, we study the structure of the probability distributions in . In the local coordinate systems of , and describe the same probability distribution that we write for brevity as . Therefore, the probability distributions in in (36) associated to and are, respectively,
and, see (18),
Setting
the above (38) can be rewritten as the discrete version of (19), that is,
This last formula can be interpreted as follows; let A and B be two independent random variables : , where is the discrete state space, described by the probability distributions and , respectively (for example, A and B describe two dices with n faces). Then, is the probability that A and B are found in the same state and
in (39) is the conditional probability that A and B are found in the state i provided that they are found in the same state. Note that for in (37) we have , thus (37) can be rewritten as
and (39) above is equal to
where are described by , , .
4. Discussion
Canonical coordinates and associated to an exponential family are dually flat coordinates with respect to the duality defined by the canonical divergence. With respect to these coordinates, a generalization of the Pitagorean theorem is proved in Information Geometry which provides a generalized formulation of the Maximum Entropy Principle with linear constraints as a geodesic projection problem (see [2]). Multiplicity of the solutions of the Maximum Entropy problem are due to the non uniqueness of the projection. In this paper, we have shown that the set of couples defines a transversal Lagrangian submanifold of , and we have seen with an example that if nonlinear constraints are considered the set of possible multiple solutions to the Maximum Entropy problem is globally described by a folded (i.e., a possibly non-trasversal) Lagrangian submanifold . We have computed their pull-back to the square manifold via the map . We think that this framework offers a complementary view to the generalized Pitagorean Theorem. We plan to address in a subsequent paper a generalization of the theory presented here to a more general form of nonlinear constraint.
Funding
This research received no external funding.
Conflicts of Interest
The author declares no conflict of interest.
References
- Amari, S. Information Geometry and Its Applications; Springer: Berlin/Heidelberg, Germany, 2016; Volume 194. [Google Scholar]
- Amari, S.; Hiroshi, N. Methods of Information Geometry; American Mathematical Soc.: Providence, RI, USA, 2007; Volume 191. [Google Scholar]
- Murray, M.K.; Rice, J.W. Differential Geometry and Statistics; CRC Press: Boca Raton, FL, USA, 1993; Volume 48. [Google Scholar]
- Amari, S.; Cichocki, A. Information geometry of divergence functions. Bull. Pol. Acad. Sci. Tech. 2010, 58, 183–195. [Google Scholar] [CrossRef]
- Eguchi, S. A differential geometric approach to statistical inference on the basis of contrast functionals. Hiroshima Math. J. 1985, 15, 341–391. [Google Scholar] [CrossRef]
- Ay, N.; Amari, S. A novel approach to canonical divergences within information geometry. Entropy 2015, 17, 8111–8129. [Google Scholar] [CrossRef]
- Barndorff-Nielsen, O.E.; Jupp, P.E. Statistics, yokes and symplectic geometry. Ann. Fac. Sci. Toulouse Math. 1997, 6, 389–427. [Google Scholar] [CrossRef]
- Leok, M.; Zhang, J. Connecting information geometry and geometric mechanics. Entropy 2017, 19, 518. [Google Scholar] [CrossRef]
- Noda, T. Symplectic structures on statistical manifolds. J. Aust. Math. Soc. 2011, 90, 371–384. [Google Scholar] [CrossRef]
- Nakamura, Y. Completely integrable gradient systems on the manifolds of Gaussian and multinomial distributions. Jpn. J. Ind. Appl. Math. 1993, 10, 179. [Google Scholar] [CrossRef]
- Jaynes, E.T. Information theory and statistical mechanics. Phys. Rev. 1957, 106, 620–630. [Google Scholar] [CrossRef]
- Jaynes, E.T. Probability Theory: The Logic of Science; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
- Brot, R. Phase Transitions. In Statistical Physics. Phase Transitions and Superfluidity; Brandeis University Summer Institute in Theoretical Physics, Gordon and Breach Science Publishers: London, UK, 1966; pp. 5–103. [Google Scholar]
- Favretti, M. Lagrangian submanifolds generated by the Maximum Entropy principle. Entropy 2005, 7, 1–14. [Google Scholar] [CrossRef]
- Fujiwara, A.; Shigeru, S. Hereditary structure in Hamiltonians: Information geometry of Ising spin chains. Phys. Lett. A 2010, 374, 911–916. [Google Scholar] [CrossRef]
- Maslov, V.P.; Bouslaev, V.C.; Arnol’d, V.I. Theorie des Perturbations et Methodes Asymptotiques; Dunod: Paris, France, 1972. [Google Scholar]
- Hormander, L. Fourier integral operators. I. Acta Math. 1971, 127, 79. [Google Scholar] [CrossRef]
- Weinstein, A. Lectures on Symplectic Manifolds; No. 29.; American Mathematical Soc.: Providence, RI, USA, 1977. [Google Scholar]
- Cardin, F. Elementary Symplectic Topology and Mechanics; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
- Favretti, M. Isotropic submanifolds generated by the Maximum Entropy Principle and Onsager reciprocity relations. J. Funct. Anal. 2005, 227, 227–243. [Google Scholar] [CrossRef]
- Bertsekas, D.P. Constrained Optimization and Lagrange Multiplier Methods; Academic Press: Cambridge, MA, USA, 2014. [Google Scholar]
© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).