Connecting Information Geometry and Geometric Mechanics

The divergence function in information geometry, and the discrete Lagrangian in discrete geometric mechanics each induce a differential geometric structure on the product manifold Q×Q. We aim to investigate the relationship between these two objects, and the fundamental role that duality, in the form of Legendre transforms, plays in both fields. By establishing an analogy between these two approaches, we will show how a fruitful cross-fertilization of techniques may arise from switching formulations based on the cotangent bundle T∗Q (as in geometric mechanics) and the tangent bundle TQ (as in information geometry). In particular, we establish, through variational error analysis, that the divergence function agrees with the exact discrete Lagrangian up to third order if and only if Q is a Hessian manifold.


Introduction
Information geometry and geometric mechanics each induce geometric structures on an arbitrary manifold Q, and we investigate the relationship between these two approaches.More specifically, we study the interaction of three objects: TQ, the tangent bundle on which a Lagrangian function L(q, v) is defined; T * Q, the cotangent bundle on which a Hamiltonian function H(q, v) is defined; and Q × Q, the product manifold on which the divergence function (from the information geometric perspective) or the Type I generating function (from the geometric mechanics perspective) is defined.In discrete mechanics, while the Q × Q ←→ T * Q correspondence is via a symplectomorphism given by the time-h flow map associated with the Hamiltonian H(q, p), and the Q × Q ←→ TQ correspondence is via the map relating the boundary-value and initial-value formulation of the Euler-Lagrange flow, it is the correspondence between TQ ←→ T * Q through the fiberwise Legendre map based on L or H that actually serves to couple the Hamiltonian flow with Lagrangian flow, leading to one and the same dynamics.We propose a decoupling of the Lagrangian and Hamiltonian dynamics through the use of a divergence function D(q, v, p) defined on the Pontryagin bundle TQ ⊕ T * Q that measures the discrepancy (or duality gap) between H(q, p) and L(q, v).We also establish, through variational error analysis that the divergence function agrees with exact discrete Lagrangian up to third order if and only if Q is a Hessian manifold.
Geometric mechanics [1] investigates the equations of motion using the Lagrangian, Hamiltonian, and Hamilton-Jacobi formulations of classical Newtonian mechanics.Two apparently different principles were used in those formulations: the principle of conservation (energy, momentum, etc.) leading to Hamiltonian dynamics and the principle of variation (least action) leading to Lagrangian dynamics.The conservation properties of the Hamiltonian approach are with respect to the underlying symplectic geometry on the cotangent bundle, whereas the variational principles that result in the Euler-Lagrange equation of motions and the Hamilton-Jacobi equations reflect the geometry of the semispray on the tangent bundle.Lagrangian and Hamiltonian mechanics reflect two sides of the same coin-that they describe the identical dynamics on the configuration space (base manifold) is both remarkable and also to be expected due to their construction: because the Lagrangian and Hamiltonian are dual to each other, and related via the Legendre transform.
Information geometry [2,3] in the broadest sense of the term, provides a dualistic Riemannian geometric structure that is induced by a class of functions called divergence functions, which essentially provide a method of smoothly measuring a directed distance between any two points on the manifold, where the manifold is the space of probability densities.It arises in various branches of information science, including statistical inference, machine learning, neural computation, information theory, optimization and control, etc. Various geometric structures can be induced from divergence functions, including metric, affine connection, symplectic structure, etc., and this is reviewed in [4].Convex duality and the Legendre transform play a key role in both constructing the divergence function and characterizing the various dualities encoded by information geometry [5,6].
Given that geometric mechanics and information geometry both prescribe dualistic geometric structures on a manifold, it is interesting to explore the extent to which these two frameworks are related.In geometric mechanics, the Legendre transform provides a link between the Hamiltonian function that is defined on the cotangent bundle T * Q, with the Lagrangian function that is defined on the tangent bundle TQ, whereas in information geometry, it provides a link between the biorthogonal coordinates of the base manifold Q if it is dually-flat and exhibits the Hessian geometry.To understand their deep relationship, it turns out that we need to resort to the discrete formulation of geometric mechanics, and investigate the product manifold Q × Q.The basic tenet of discrete geometric mechanics is to preserve the fact that Hamiltonian flow is a symplectomorphism, and construct discrete time maps that are symplectic.This results in two ways of viewing discrete-time mechanics, either as maps on Q × Q or T * Q, which are related via discrete Legendre transforms.The shift in focus from T * Q to Q × Q precisely lends itself to establishing a connection to information geometry, as the divergence function is naturally defined on Q × Q, and in both information geometry and discrete geometric mechanics, induces a symplectic structure on Q × Q.This is the basic observation that connects geometric mechanics and information geometry, and we will explore the implications of this connection in the paper.
Our paper is organized as follows.Section 2 provides a contemporary viewpoint of geometric mechanics, with Lagrangian and Hamiltonian systems discussed in parallel with one another, in terms of geometry on TQ and T * Q, respectively, including a discussion of Dirac mechanics on the Pontryagin bundle TQ ⊕ T * Q, which provides a unified treatment of Lagrangian and Hamiltonian mechanics.Section 3 summarizes the results of the discrete formulation of geometric mechanics, which is naturally defined on the product manifold Q × Q. Section 4 is a review of now-classical information geometry, including the Riemannian metric and affine connections on TQ, and the manner in which the divergence function naturally induces dualistic Riemannian structures.The special cases of Hessian geometry and biorthogonal coordinates are highlighted, showing how the Legendre transform is essential for characterizing dually-flat spaces.Section 5 starts with a presentation of the symplectic structure on Q × Q induced by a divergence function, which is naturally identified with the Type I generating function on it.We follow up by investigating its transformation into a Type II generating function (which plays a key role in discrete Hamiltonian mechanics).We then propose to decouple the discrete Hamiltonian and Lagrangian dynamics by using the divergence function to measure their duality gap.Finally, we perform variational error analysis to show that on a dually-flat Hessian manifold, the Bregman divergence is third-order accurate with respect to the exact discrete Lagrangian.Section 6 closes with a summary and discussion.

A Review of Geometric Mechanics
Consider an n-dimensional configuration manifold Q, with local coordinates (q 1 , . . ., q n ).The Lagrangian formulation of mechanics is defined on the tangent bundle TQ, in terms of a Lagrangian L : TQ → R. From this, one can construct an action integral S which is a functional of the curve q : [t 1 , t 2 ] → Q, given by Then, Hamilton's variational principle states that, where the variation δS is induced by an infinitesimal variation δq of the trajectory q, subject to the condition that the variations vanish at the endpoints, i.e., δq(t 1 ) = δq(t 2 ) = 0. Applying standard results from the calculus of variations, we obtain the following Euler-Lagrange equations of motion, The Hamiltonian formulation of mechanics is defined on the cotangent bundle T * Q, and the fiberwise Legendre transform, FL : TQ → T * Q, relates the tangent bundle and the cotangent bundle as follows, where p i is the conjugate momentum to q i : The term fiberwise is used to emphasize the fact that FL establishes a pointwise correspondence between T q Q and T * q Q for any point q on Q.The cotangent bundle forms the phase space, on which one can define a Hamiltonian H : T * Q → R, H(q, p) = p, q(q, p) − L(q, q(q, p)), where q is viewed as a function of (q, p) by inverting the Legendre transform (3), and denotes the duality or natural pairing between a vector v and covector p at the point q ∈ Q.A straightforward calculation shows that and From these, we transform the Euler-Lagrange equations into Hamilton's equations, The canonical symplectic form ω can on T * M can be identified with a quadratic form induced by the skew-symmetric matrix J, i.e., ω can (v, w) = v T Jw.With that identification, Hamilton's equations can be expressed as, q ṗ = 0 I −I 0 Alternatively, Hamilton's equations ( 5) can be derived using Hamilton's phase space variational principle, which states that, for infinitesimal variations δq that vanish at the endpoints.The infinitesimal variation of the integral is computed by differentiating under the integral, integrating by parts, and using the fact that the infinitesimal variations δq vanish at the endpoints, which yields: and by the fundamental theorem of the calculus of variations, which states that the integral is stationary only when the terms in the parentheses multiplying into the independent variations δq and δp vanish, we recover Hamilton's equations (5).
Lagrangian and Hamiltonian mechanics are typically viewed as different representations of the same dynamical system, with the Legendre transform relating the two formulations.Here, the Legendre transform FL (with FH as its inverse) refers to both the map relating two sets of variables, (q, q) with (q, p), as well as the relationship between two functions, the Lagrangian L(q, q) and the Hamiltonian H(q, p).The Legendre transform links pairs of convex conjugate functions; in classical mechanics, the Lagrangian L and Hamiltonian H are always related in this sense of forming a convex pair.The requirement that L(q, q) be strictly convex in the variable q is referred to as hyperregularity.When the Lagrangian is positive homogeneous (or singular), the Legendre transform yields a Hamiltonian function that is identically zero, which means that in such cases, the Hamiltonian analogue of the Lagrangian system does not exist, which is problematic in the context of analytic mechanics.In order to address such degeneracy, it is necessary to consider Dirac mechanics on Dirac manifolds, which is a simultaneous generalization of Lagrangian and Hamiltonian mechanics.
In geometric mechanics, including the contemporaneous Dirac formulation, the Lagrangian L and Hamiltonian H are always coupled via the fiberwise Legendre transform FL.In information geometry, it is a well-known fact that one can construct the divergence function (to be defined later), which captures the departure from such perfect coupling.In other words, we can view Lagrangian and Hamiltonian systems as two separate systems, which are endowed with their own dynamics and are in some sense dual to each other, and we then use the divergence function to measure their duality gap.For this reason, we will review the Lagrangian and Hamiltonian formulation of mechanics in terms of L(q, q) and H(q, p), respectively, without necessarily assuming that the Lagrangian and Hamiltonian are related by the Legendre transform.

Lagrangian Mechanics as an Extremization System on TQ
As noted previously, the Euler-Lagrange equations (2) arise from the stationarity conditions that describe the extremal curves of the action integral, over the class of varied curves that fix the endpoints.Carrying out the differentiation in (2) explicitly yields, The fundamental tensor g ij associated with the Lagrangian L(q, v) is given by, which is assumed to be positive-definite, i.e., the Lagrangian L is hyperregular.Let g il denote the matrix inverse of g ij , then (6) can be written as where So, (7) with the above G i are Euler-Lagrange equations in disguise, and its solution is an extremal curve of the action integral.
Recall that a smooth curve on Q can be lifted to a curve on TQ in a natural way: a curve t → q(t) ∈ Q becomes t → (q(t), q(t)) ∈ TQ.Given an arbitrary G i (q, v), the system of equations ( 7) specify a family of curves, called a semispray.As seen above, semisprays arise naturally in variational calculus as extremal curves of the action integral associated with a Lagrangian.
Semisprays can be more generally described by a vector field.Recall that a vector field on Q is a section of TQ.Now, consider a vector field on the tangent bundle TQ; it is a section of the double tangent bundle T(TQ).The integral surfaces of the semispray induces a decomposition of the total space T(TQ) into the horizontal subspace H(TQ) and the vertical subspace V (TQ), which defines an Ehresmann connection.A vector on TQ encodes the second-order derivative of curves on Q, and a semispray defines the following vector field V on TQ: , where the factor −2 is there by convention.The integral curve of the semispray satisfies the second-order ordinary differential equation (7), and we say that a semispray is a vector field on the tangent bundle TQ which encodes a second-order system of differential equations on the base manifold Q.
A semispray is called a full spray if the spray coefficients G i satisfy for λ > 0. In this case, the integral curve remains invariant under reparameterization by a positive number, i.e., it satisfies homogeneity.When the semispray becomes a (full) spray, the Lagrange geometry becomes Finsler geometry, and the fundamental tensor g ij (q, v) becomes the Finsler-Riemann metric tensor (which includes the Riemann metric as a special case).
As noted above, a semispray induces an Ehresmann connection on Q and this connection is torsion-free and typically nonlinear.Conversely, given a torsion-free connection, one can construct a semispray.The connection is homogenous if and only if the semispray is a full spray.Moreover, if the spray is affine, then the connection is affine as well-an affine spray G i takes the form where Γ i jk is referred to as the affine connection.
To summarize, Lagrangian dynamics is related to action minimization by the Euler operator, and leads to a semispray on the configuration manifold Q.Under suitable conditions, the Lagrangian function L(q, v) defined on TQ will lead to a torsion-free but generally nonlinear connection, and an affine connection only for a very special form of Lagrangian.

Hamiltonian Mechanics as a
Conservative System on T * Q Given a Hamiltonian H : T * Q → R, we consider the Hamiltonian vector field X H ∈ X(T * Q) (where X denotes a section) defined by It is straightforward to verify that H = const along the dynamical flow of X H : So, a Hamiltonian vector field X H advects the Hamiltonian H along its flow, so that H is constant along solution curves, which implies that the Lie derivative L of H along the flow of X H vanishes, Formally, starting from the tautological 1-form θ = ∑ i p i dq i on Q, one obtains a 2-form ω, called the Poincaré 2-form, where X, Y are vector fields on T * Q.
More generally, given a Hamiltonian H along with a symplectic form ω, which is, by definition, a closed, nondegenerate 2-form, one obtains the Hamiltonian vector field X H on T * Q, defined in abstract notation by or equivalently in a more familiar notation, One can define the Poisson bracket {•, •} of two functions F and G by using their respective Hamiltonian vector fields and the symplectic form, {F, G} ≡ ω(X F , X G ).
For the canonical symplectic form, it has the following coordinate expression, In this way, Hamilton's equations can be expressed in terms of the Poisson bracket as follows, By Darboux's theorem, it is always possible to choose local coordinates (q, p) on T * Q, referred to as canonical coordinates, such that the symplectic form has the expression ω = ∑ i dq i ∧ dp i .In these coordinates, Hamilton's equations defined in terms of the symplectic structure (9) and Poisson structure (10) recover the canonical Hamiltonian vector field (8).
Note that any smooth function H on T * Q induces a Hamiltonian vector field.An arbitrary vector field X on T * Q is locally Hamiltonian, i.e., induced by a smooth function H, if ι X ω is closed, i.e., L X ω = 0.In addition, a Hamiltonian vector field preserves the volume form Ω, i.e., where Ω is the n-fold exterior product of ω,

Symplectic Maps and Symplectic Flows
A symplectic map is a diffeomorphism of T * Q that preserves its symplectic structure ω.We first consider a one-parameter family of symplectic maps generated by the flow map F X : T * Q → T * Q of a vector field X ∈ X(T * Q).Since the entire family of symplectic maps leave ω invariant, it follows that £ X ω = 0.It can be shown (using Cartan's magic formula, and the fact that ω is closed) that a vector field X ∈ X(T * Q) is symplectic if i X ω is closed, i.e., d(i X ω) = 0.By the Poincaré lemma, this implies that i X ω is locally exact, that is, in the neighborhood of any point, there exists some function H : T * Q → R such that i X ω = dH.So there is always locally exists a Hamiltonian H : T * Q → R that generates a vector field X whose flow is symplectic with respect to ω.
More generally, a diffeomorphism φ : M 1 → M 2 is a symplectic map from a symplectic space (M 1 , ω 1 ) to another space (M 2 , ω 2 ) if: where ω 1 , ω 2 are the symplectic forms on M 1 , M 2 , respectively.The above condition (11) holds if and only if for any functions f , g: With respect to Darboux coordinates about a point z = (q, p) ∈ M 1 , the condition (11) that a map φ : M 1 → M 2 is symplectic can be expressed locally by (D z φ) T J(D z φ) = J, where D z F denotes the Jacobian of φ at z.
A canonical transformation of T * M is an automorphism φ : The significance of canonical transformations is that they preserve the form of Hamilton's equations, and one can check that an automorphism φ is canonical by verifying that (D z φ) T J(D z φ) = J in a Darboux coordinate chart.

Symplectic Structure on TQ Pulled Back from T * Q
If we endow T * Q with the canonical symplectic form, we can construct a symplectic form on TQ in such a way that these two spaces are symplectomorphic.
The mapping between TQ and T * Q can be constructed in two different ways, Case I involves the Legendre transform: and Case II involves the Riemannian metric tensor g (on Q): Note that we say that g is a pseudo-Riemannian metric on Q when g acts on a pair of tangent vectors at the tangent space T q Q at a point q of Q; it can be viewed as a symmetric (0, 2)-tensor that maps TQ × TQ → R. On the other hand, the symplectic form ω is a skew-symmetric (0, 2)-tensor that acts on a pair of tangent vectors on Case I. Given the Lagrangian L : TQ → R, this induces the fiberwise Legendre transform FL : TQ → T * Q, which is given by (q, v) → (q, p) = q, ∂L ∂v .If L is hyperregular, then this map is a diffeomorphism.If we endow TQ with the pullback symplectic form (FL) * ω, which is given by then the Legendre transform is a symplectomorphism (by construction).
Case II.The Riemannian metric g induces the musical isomorphisms : TQ → T * Q and : T * Q → TQ between TQ and T * Q, which are the operations that lower and raise the index, respectively.If we endow TQ with the pullback symplectic form * ω, which is given by then the musical isomorphism is a symplectomorphism (by construction).
Link between Case I and Case II.It is possible that the two ways of identifying TQ ↔ T * Q may be the same; this happens when g on TQ coincides with the second derivatives of L(q, v) with respect to the v-variable: assuming L is hyperregular.The inverse of g, denoted g, can be obtained from gij (q, p) = ∂ 2 H(q, p) ∂p i ∂p j , using the Hamiltonian H(q, p) defined on T * Q.Note that when the Lagrangian has the form L(q, v) = 1 2 v T M(q)v − V(q), this corresponds to the Riemannian metric g being given by the kinetic energy metric M(q).

Hamilton-Jacobi Theory and Dirichlet-to-Neumann Map
In classical mechanics, the Hamilton-Jacobi equation is first introduced as a partial differential equation that the action integral satisfies.Recall that the action integral S along the solution of the Euler-Lagrange equation (2) over the time interval [t 0 , t] is S q 0 (q(t)) := t t 0 L(q(s), q(s))ds, where q(•) is the solution of the Euler-Lagrange equation.(14) This is referred to as Jacobi's solution of the Hamilton-Jacobi equation.Here, we assume that the initial position q(0) is fixed and the final position q(t) depends on the initial velocity v 0 = q(0).By taking a variation δq(t) of the endpoint q(t), one obtains a partial differential equation satisfied by S(q, t): This is the Hamilton-Jacobi equation, when H does not explicit depend on t.
Conversely, it is shown that if S q 0 (q) is a solution of the Hamilton-Jacobi equation then S q 0 (q) is a generating function for the family of canonical transformations (or symplectic flows) that describe the dynamics defined by Hamilton's equations.This result is the theoretical basis for the powerful technique of exact integration called separation of variables.
There are two uses of S q 0 (q).First, it serves to characterize the Dirichlet-to-Neumann map, which refers to the correspondence between the boundary data (q 0 , q) ∈ Q × Q with the initial data (q 0 , v 0 ) ∈ TQ of a dynamical system.Second, it provides a foliation of the configuration space Q, around the point q 0 and parameterized by t, that is defined by the condition S q 0 (q(t)) = const.
In the rest of the paper, we will view S q 0 (q) as a scalar-valued function of (q 0 , q(t)), which we refer to as the exact discrete Lagrangian L This is equivalent to the expression for Jacobi's solution, as the stationarity conditions of this variational characterization are simply the Euler-Lagrange equations.Furthermore, this characterization has the added benefit that it is well-defined even if the Lagrangian is degenerate.The exact discrete Lagrangian provides us with the time-h flow map for the Euler-Lagrange equation.Given a fixed initial point q 0 , this defines a map which takes q ∈ Q to an initial velocity v 0 , such that the Euler-Lagrange trajectory q(t) with initial condition (q(0), q(0)) = (q 0 , v 0 ) has boundary values (q(0), q(h)) = (q 0 , q).This is the Dirichlet-to-Neumann map Q × Q → TQ, (q 0 , q) → (q 0 , v 0 ).
To address the Dirichlet-to-Neumann map more generally, let us first recall the definition of a retraction: Definition 1 ([7], Definition 4.1.1 on p. 55).A retraction on a manifold Q is a smooth mapping R: TQ → Q with the following properties: Let R q : T q Q → Q be the restriction of R to T q Q for an arbitrary q ∈ Q; then, (i) R q (0 q ) = q, where 0 q denotes the zero element of T q Q; (ii) with the identification T 0 q (T q Q) T q Q, R q satisfies where T 0 q R q is the tangent map of R q at 0 q ∈ T q Q. Equation ( 17) implies that the map R q : T q Q → Q is invertible in some neighborhood of 0 q in T q Q.Its inverse is conveniently denoted as R : It is easy to see that R : TQ → Q × Q is also invertible in some neighborhood of 0 q ∈ TQ for any q ∈ Q.
Let us introduce a special class of coordinate charts that are compatible with a given retraction map R : holds, where we identify T 0 R n with R n as follows: Let ϕ = (q 1 , . . ., q n ) with q i : U → R for i = 1, . . ., n.Then where ∂/∂q i is the unit vector in the q i -direction in T 0 R n .An atlas for the manifold Q is retraction compatible if it consists of retraction compatible coordinate charts.
In Equation ( 19), we assumed that T q ϕ(v q ) ∈ ϕ(U) ⊂ R n and so strictly speaking R q is defined on (T q ϕ) −1 (ϕ(U)) ⊂ T q Q.However, it is always possible to define a coordinate chart such that ϕ(U) = R n by stretching out the open set ϕ(U) to R n so that ( 19) is defined for any v q ∈ T q Q.
Retraction maps provide general means to relate Q × Q to TQ: in essence it provides a correspondence between {q} × Q and T q Q for all q ∈ Q (we may take q ∈ Q to mean the projection of Q × Q onto either the first or the second slot).

Variational Mechanics and the Pontryagin Bundle
Lagrangian and Hamiltonian mechanics can be combined into Dirac mechanics [8,9] , which is described on the Pontryagin bundle TQ ⊕ T * Q, which has position, velocity, and momentum as local coordinates.
Just as the Euler-Lagrange equations of motion arises out of Hamilton's principle, Hamilton's equations can also arise from Hamilton's phase space principle: On the Pontryagin bundle TQ ⊕ T * Q, which has local coordinates (q, v, p), a relaxation of Hamilton's principle (1) is the Hamilton-Pontryagin variational principle, which uses a Lagrange multiplier p to impose the second-order condition v = q, This encapsulates both Hamilton's and Hamilton's phase space variational principles, as well as the Legendre transform, and gives the implicit Euler-Lagrange equations, The last equation explicitly imposes the primary constraint condition, which is important when describing degenerate Lagrangian systems, such as electrical circuits.Note that the p are interpreted as Lagrange multipliers (Tulczyjew and Urbanski, [10]) in addition to its usual interpretation as conjugate momenta.The three equations can be combined by eliminating v and p to recover the Euler-Lagrange equations.
An important application of Hamilton-Jacobi theory is in optimal control theory.Consider a typical optimal control problem, min and the boundary conditions q(0) = q 0 and q(T) = q T .We convert constrained optimization to unconstrained optimization by using Lagrange multipliers p (sometimes called the costate or auxiliary variables), and we can define the augmented cost functional: where we introduced the costate variables p, and also defined the control Hamiltonian, Ĥ(q, p, u) := p, f (q, u) − L(q, u).
The variables (q, p) forms a Hamiltonian system, so we impose the optimality condition, to obtain the equation for the optimal control u = u * (q, p), and we obtain the Hamiltonian, H(q, p) := max u Ĥ(q, p, u) = Ĥ(q, p, u * (q, p)).
We also define the optimal cost-to-go function, where ( q(s), p(s)) for s ∈ [0, T] is the solution of Hamilton's equations with the above H such that q(t) = q; and S * is the optimal cost and the function S(q, t) is defined by Since this definition coincides with ( 14), the function S(q, t) = S * − C(q, t) satisfies the Hamilton-Jacobi equation (15); this reduces to the Hamilton-Jacobi-Bellman (HJB) equation for the optimal cost-to-go function C(q, t): It can also be shown that the costate p of the optimal solution is related to the solution of the Hamilton-Jacobi-Bellman equation.

Discrete Formulation of Geometric Mechanics
In this section, we review various schemes for discretizing mechanics (see, e.g., [11]).Geometric mechanics focuses on the differential geometric structure of the configuration manifolds, the associated symplectic and Poisson structures on the phase space, and the conservation laws generated by symmetries, and geometric structure-preserving numerical integration aims to preserve as many of these geometric properties as possible under discretization.The main idea is to start from the canonical symplectic form ω on T * Q, and look at the symplectomorphisms that preserve ω or its pullback via the Legendre transforms to Q × Q or TQ.
3.1.Symplectomorphisms from T * Q to TQ and to Q × Q Given a cotangent bundle T * Q with a symplectic form ω, we wish to endow the bundles TQ and Q × Q with a symplectic structure.Given a function L : TQ → R, the Legendre transform is viewed as the fiber derivative FL : TQ → T * Q, (q, q) → (q, ∂L/∂ q).The pullback of ω with respect to FL yields a symplectic structure ω L = FL * ω on TQ.
Similarly, given a function , which serve as discrete Legendre transforms: Here D 1 , D 2 refers to taking a derivative with respect to the first or second slot, respectively: The two choices of discrete fiber derivatives correspond to whether one views Q × Q as a bundle over Q with respect to π − : (q k , q k+1 ) → q k or π + : (q k , q k+1 ) → q k+1 , i.e., projection onto the first or the second slot.These induce symplectic structures Let F : T * Q → T * Q be a symplectic map and let the maps denoted by the dotted arrows in the figure above be defined by requiring that the diagram commutes.Then, these maps are also symplectic maps, and the fiber derivative FL is a symplectomorphism between (TQ, ω L ) and (T * Q, ω), and the discrete fiber derivatives FL ± d are symplectomorphisms between (Q × Q, ω ± L d ) and (T * Q, ω).

Discrete Lagrangian Mechanics
The aim of geometric structure-preserving numerical integration is to preserve as many geometric conservation laws as possible under discretization.Discrete variational mechanics [11] is based on the discrete Hamilton's principle, where the endpoints q 0 and q n are fixed, and the discrete Lagrangian, L d : Q × Q → R, is a Type I generating function of the symplectic map.Recall that there exists an exact discrete Lagrangian (16) L E d , that generates the exact time-h flow of a Lagrangian system, but it cannot be computed in general.One possible method of constructing computable discrete Lagrangians is the Galerkin approach, which involves replacing the infinite-dimensional function space C 2 ([0, h], Q) and the integral in ( 16) with a finite-dimensional function space and a quadrature formula, respectively.Below are two examples of discrete Lagrangians: This can be obtained from the Galerkin approach by considering the family of linear polynomials as the finite-dimensional function space, and the midpoint rule as the quadrature formula.(ii) Störmer-Verlet integrator This can be obtained from the Galerkin approach by considering the family of linear polynomials as the finite-dimensional function space, and the trapezoidal rule as the quadrature formula.
Performing variational calculus on the discrete Hamilton's principle (25) yields the discrete Euler-Lagrange (DEL) equations, The above equation implicitly defines the discrete Lagrangian map F L d : (q k−1 , q k ) → (q k , q k+1 ) at points sufficiently close to the diagonal of Q × Q.This is equivalent to the implicit discrete Euler-Lagrange (IDEL) equations, which is precisely the characterization of a symplectic map in terms of Type I generating function.
It implicitly defines the discrete Hamiltonian map FL d : (q k , p k ) → (q k+1 , p k+1 ), and it is symplectic with respect to the canonical symplectic form ω can on T * Q, i.e., F * L d ω can = ω can .The two discrete fiber derivatives FL ± d induce a single unique discrete symplectic form and the discrete Lagrangian map is symplectic with respect to ω The discrete Lagrangian and Hamiltonian maps can be expressed in terms of the discrete fiber derivatives, respectively.This characterization of the discrete flow maps underlies the proof of the variational error analysis theorem.
These properties may be summarized in the following commutative diagram, If the exact discrete Lagrangian L E d is used, then the discrete Hamiltonian map FL d is equal to the time-h flow map of Hamilton's equations, and the dotted arrow is the time-h flow map of the Euler-Lagrange equations.
The variational integrator approach to constructing symplectic integrators simplifies the numerical analysis of these methods.In particular, the task of establishing the geometric conservation properties and order of accuracy of the discrete Lagrangian map F L d reduces to the simpler task of verifying certain properties of the discrete Lagrangian instead.

Discrete Hamilton-Jacobi Formulation
In the context of discrete variational mechanics, discrete Hamilton-Jacobi theory can be viewed as a composition theorem which relates the composition of symplectic maps generated by a Type II generating function H + d (q k , p k+1 ) with a symplectic map generated by a Type I generating function S k d (q 0 , q k ).By convention, the first argument q 0 in the composition generating function is typically omitted, and we simply consider it to be a function S k d (q k ) of the final position q k .The right discrete Hamiltonian, H + d (q k , p k+1 ) [12], is related to the discrete Lagrangian by the Legendre transform, ), where we impose the condition that p k+1 = D 2 L d (q k , q k+1 ).Equivalently, this can be characterized variationally by . This leads to a discrete Hamilton's principle in phase space, which yields the right discrete Hamilton's equations, which is precisely the characterization of a symplectic map in terms of Type II generating function.
The continuous Hamilton-Jacobi equation can be derived by considering the evolution properties of Jacobi's solution, which is the action integral evaluated along the solution of the Euler-Lagrange equations.One can derive a discrete Hamilton-Jacobi theory by considering a discrete analogue of Jacobi's solution, expressed in terms of the right discrete Hamiltonian, which we evaluate along a solution of the right discrete Hamilton's equations (29).From this, we have, where p k+1 is considered to be a function of q k and q k+1 .Taking derivatives with respect to q k+1 , we obtain, but the term inside the parenthesis vanishes as we are restricting this to a solution of the right discrete Hamilton's equations.Therefore, we have that which when substituted into (30) yields the discrete Hamilton-Jacobi equation, )).

Discrete Hamilton-Pontryagin Principle
Leok and Ohsawa [13] considered the discrete Hamilton's principle and relaxed the discrete second-order condition, q 1 k = q 0 k+1 , and reimposed it using Lagrange multipliers p k+1 , in order to derive the discrete Hamilton-Pontryagin principle on Here, the superscripts 0, or 1 on q i refers to the first or second slot, respectively, in Q × Q.This in turn yields the implicit discrete Euler-Lagrange equations, where D 1 , D 2 denote as before the partial derivative with respect to the first or second argument in L d .
Making the identification q k = q 0 k = q 1 k−1 , the last two equations define the discrete fiber derivatives, FL ± d : Q × Q → T * Q as given by ( 23) and (24).Discrete fiber derivatives induce a discrete symplectic form, ω L d ≡ (FL ± d ) * ω can , and the discrete Lagrangian map ) and the discrete Hamiltonian map ) preserve ω L d and ω can , respectively.

Statistical Structure on M
On a differentiable manifold M endowed with a metric g and a torsion-free affine connection ∇, the compatibility of a metric g and a connection ∇ is encoded by the cubic form 3-tensor field C = ∇g, i.e., the covariant derivative of g.In a local coordinate system with basis ∂ i ≡ ∂/∂x i , the metric tensor g is locally represented by and the components of ∇ takes the contravariant form Γ l ij , where or the covariant form Γ ij,k , where Torsion-freeness of ∇ implies the symmetry of its (first) two lower indices, i.e., .
We can now compute the cubic form, or in components, When the cubic form is identically zero, ∇ is said to be parallel with respect to g.A torsion-free connection parallel to g is called the Levi-Civita connection ∇ associated to the given metric g: The fundamental theorem of Riemannian geometry establishes the existence and uniqueness of the Levi-Civita connection, which is a solution of (37), and is given by, Generalizing the notion of parallelism of a connection is the notion of conjugacy (denoted by * ) between two connections.A connection ∇ * is said to be conjugate (or dual) to ∇ with respect to g if Clearly, (∇ * ) * = ∇.Moreover, ∇, which satisfies (37), is special in the sense that it is self-conjugate ( ∇) * = ∇.Writing out (38): where Γ * kj,i is defined analogously to (34) and (35), From (36) and (38), the cubic form can now be written as Clearly, C ijk = C jik , i.e., it is symmetric with respective to its first two indices.When both ∇ and ∇ * are torsion-free, this implies that then C ijk = C ikj , which leads to C being totally symmetric in all the indices, Requiring that C ijk is totally symmetric imposes a compatibility condition between g and ∇, so that they form a Codazzi pair (see Simon, [14]), which generalizes the Levi-Civita coupling (whose corresponding cubic form C ijk ≡ 0).Lauritzen [15] defined a statistical manifold (M, g, ∇) to be a manifold M equipped with metric g and connection ∇ such that i) ∇ is torsion-free; ii) ∇g ≡ C is totally symmetric.Equivalently, a manifold has a statistical structure when the conjugate (with respect to g) ∇ * of a torsion-free connection ∇ is also torsion-free.In this case, ∇ * g = −C, and the Levi-Civita connection ∇ = (∇ + ∇ * )/2.
The curvature/flatness of a connection ∇ is described by the Riemann curvature tensor R, defined as and substituting (34), the components of the Riemann curvature tensor are By definition, R l kij is antisymmetric when i ↔ j.The covariant form of the Riemann curvature is When the connection is torsion-free, R lkij is antisymmetric when i ↔ j or when k ↔ l, and symmetric when (i, j) ↔ (l, k).It is related to the Ricci tensor Ric by Ric kj = ∑ i,l R lkij g il .In addition, it can be shown that the curvatures R lkij , R * lkij for the pair of conjugate connections ∇, ∇ * satisfy A connection is said to be flat when R l kij (x) ≡ 0. So, ∇ is flat if and only if ∇ * is flat.In this case, the manifold is said to be dually-flat, and the metric g takes on a particular form (to be discussed later).

Divergence Function and Induced Geometry
Here D i (x, y) = ∂ x i D(x, y), D ,i (x, y) = ∂ y i D(x, y) denote partial derivatives with respect to the i-th component of point x and of point y, respectively, and D i,j (x, y) = ∂ x i ∂ y j D(x, y) the second-order mixed derivative, and so on.
On a manifold, divergence functions act as pseudo-distance functions that are nonnegative but need not be symmetric.Every divergence function induces a dualistic Riemannian structure, i.e., statistical structure, which was first demonstrated by Eguchi (see [16]).Lemma 1.A divergence function induces a Riemannian metric g and a pair of torsion-free conjugate connections ∇, ∇ * given as The Γ ij,k , Γ * ij,k are torsion-free and are conjugate with respect to the induced metric g ij .Hence, the divergence function D induces (M, g, ∇, ∇ * ), which is a statistical manifold (Lauritzen [15]).
A popular divergence function is the Bregman divergence B Φ (x, y) [17], which is associated to a strictly convex function Φ: where denotes the exterior derivative, and •, • n denotes the canonical pairing of a vector Where there is no danger of confusion, the subscript n in •, • n is often omitted.A basic fact in convex analysis is that the necessary and sufficient condition for a smooth function Φ to be strictly convex is for all x = y.
Recall that when Φ is convex, its convex conjugate, Φ : V ⊆ R n → R, is defined through the Legendre transform: with Φ = Φ and (∂Φ) = (∂ Φ) −1 .Since Φ is also convex, by (43), we obtain the Fenchel inequality, for any x ∈ V, u ∈ V, with equality holding if and only if or, in component form, Using conjugate variables, we can introduce the canonical divergence A Φ : They are related to the Bregman divergence (41) via the relation Though the Bregman divergence is not a metric, it satisfies a quadrilateral relation [18]: For any four points x, x , x , x ∈ V, As a special case, when x = x , B Φ (x , x ) = 0, the above equality reduces to the Pythagorean (generalized cosine) relation among three points x, x , x ∈ V: This is the Pythagorean relation [3] for a dually-flat space.Using this relation, one can state minimization problems for divergence functions.
The quadrilateral relation can be expressed in terms of the canonical divergence A as follows, for any four points x, y ∈ V, v, u ∈ V.

Note that D (α)
Φ (x, y) satisfies the relation (called referential duality in [5,19]), that is, exchanging the two points in the directed distance amounts to α ↔ −α.

Hessian Manifolds and Biorthogonal Coordinates
Applying Lemma 1 to the Bregman divergence B Φ induces the following metric, and the pair of torsion-free conjugate connections, In this case, M is dually-flat.This yields a Hessian manifold, where g takes the form of the Hessian of a strictly convex function Φ.More generally, as shown in [5], the Φ-divergence D (α) Φ of (47), which degenerates to the Bregman divergence B Φ when α = ±1, induces an α-independent Hessian metric g ij along with the following α-connections Hessian manifolds enjoy a special status in information geometry, as they exhibit biorthogonal coordinates on M that are globally affine coordinates despite the nontrivial Riemannian (Hessian) metric on M.
Consider the coordinate transform x → u, where the Jacobian matrix F is given by where δ i j is the Kronecker delta, which takes the value 1 when i = j and 0 otherwise.If the new coordinate system u = [u 1 , • • • , u n ] (with components denoted by subscripts) is such that then the x-coordinate system and the u-coordinate system are said to be biorthogonal to each other, since, from the definition of the metric tensor (33), In this case, we define which is equal to F ij , the Jacobian of the inverse coordinate transform u → x.We also introduce the contravariant representation of the affine connection ∇ with respect to the u-coordinate system, and denote it by an unconventional notation Γ rs t , which is defined by The covariant representation of the affine connections will be denoted by superscripted Γ and Γ * , The representation of the affine connections in the u-coordinate system (denoted by superscripts) and the x-coordinate system (denoted by subscripts) are related by and Similarly relations hold between Γ * rs t (u) and Γ * k ij (x), and between Γ * rs,t (u) and Γ * ij,k (x).
Analogous to (39), we have the following identity, Therefore, with respect to biorthogonal coordinates, a pair of conjugate connections ∇, ∇ * satisfy, and We now investigate conditions for the existence of biorthogonal coordinates on a Riemannian manifold (M, g).From its definition (49), it can easily be shown that Proposition 1 ([20]).A Riemannian manifold M with metric g ij admits biorthogonal coordinates if and only if In this case, M is Hessian.
That (56) is satisfied for biorthogonal coordinates is evident by virtue of ( 48) and (49).Conversely, given (56), there must be n functions u The above identity implies that there exist a function Φ such that u i = ∂ i Φ and, by positive definiteness of g ij , Φ would have to be a strictly convex function!In this case, the xand u-variables satisfy (45), and the pair of convex functions, Φ and its conjugate Φ, are related to g ij and g ij by It follows from Proposition 1 that a necessary and sufficient condition for a Riemannian manifold to admit biorthogonal coordinates it that its Levi-Civita connection is given by From this, the following can be shown: A Riemannian manifold (M, g) admits a pair of biorthogonal coordinates x and u if and only if there exists a pair of conjugate connections ∇ and ∇ * such that Γ ij,k (x) = 0, Γ * rs,t (u) = 0.
In other words, biorthogonal coordinates are affine coordinates for the dually-flat pair of connections.In fact, we can now define a pair of torsion-free connections by and show that they are conjugate with respect to g, that is, they satisfy (38).This means that we select an affine connection ∇ such that x is its affine coordinate system.From (53), when ∇ * is expressed in u-coordinates, This implies that u is an affine coordinate system with respect to ∇ * .Furthermore, where Φ is the convex conjugate of Φ.Therefore, biorthogonal coordinates are affine coordinates for a pair of dually-flat connections.On the manifold of parameterized probability density functions, if the x-coordinates are the natural parameters, then the u-coordinates are the expectations.

Linking Information Geometry with Geometric Mechanics
5.1.Symplectic Structure on Q × Q Induced from the Divergence Function D We will now establish the connection between information geometry and discrete geometric mechanics.The divergence function from information geometry can be viewed as a Type I generating function of a symplectic map, and in particular, it can be viewed as a discrete Lagrangian in the sense of discrete Lagrangian mechanics.More specifically, let the configuration manifold be the information manifold, i.e., Q = M, and the discrete Lagrangian be the divergence function, i.e., L d = D.With this identification, we observe that the information geometric construction of symplectic structure on M × M described below is nothing but the discrete symplectic structure on Q × Q given in (28) where the discrete Lagrangian L d is replaced with the divergence function D.
From information geometry, a divergence function D is given as a scalar-valued binary function on Q (of dimension n).We now view it as a unary function on Q × Q (of dimension 2n) that vanishes along the diagonal ∆ Q ⊂ Q × Q.In this subsection, we investigate the conditions under which a divergence function can serve as a generating function of a symplectic structure on Q × Q.A compatible metric on Q × Q will also be derived.When restricted to the diagonal submanifold ∆ Q , the skew-symmetric symplectic form will vanish, so ∆ Q , which carries a statistical structure, is actually a Lagrangian submanifold (see [21,22]).
First, we fix a point x in the first slot and a point y in the second slot of (x, y) the y point fixed) and Q y = {x} × Q Q (with the x point fixed), respectively.The canonical symplectic form ω x on the cotangent bundle T * Q x is given by Recall that the comma in the subscript of a divergence function D indicates whether it is being differentiated with respect to a variable in the first or second slot.It is easily checked that there exists a neighborhood of the diagonal ∆ Q ⊂ Q × Q, such that the map L D is a diffeomorphism.In particular, the Jacobian of the map is given by , which is nondegenerate in a neighborhood of the diagonal ∆ Q .We calculate the pullback by L D of the canonical symplectic form ω x on Here, ∑ ij D ij dx i ∧ dx j = 0, since by the equality of mixed partials, D ij (x, y) = D ji (x, y) always holds.
Similarly, we consider the canonical symplectic form ω y = ∑ i dy i ∧ dη i on T * Q y and define a map Using R D to pullback ω y to Q × Q yields an analogous formula: Therefore, based on canonical symplectic forms on T * Q x and T * Q y , we obtained the same symplectic form on Theorem 1 ([22]).A divergence function D induces a symplectic form ω D (57) on Q × Q which is the pullback of the canonical symplectic forms ω x and ω y by the maps L D and R D , With the symplectic form ω D given as above, it is easy to check that ω D is closed: It was Barndorff-Nielsen and Jupp [21] who first proposed (57) as an induced symplectic form on Q × Q, apart from a minus sign; they called the divergence function D a york.
The fact that this symplectic structure coincides with the one introduced in discrete mechanics should come as no surprise.The Q x and Q y submanifolds are related to the two ways of viewing Q × Q as a bundle over Q, depending on whether one chooses π 1 : x, y) → y as the bundle projection.Then, the maps L D , R D are, up to a sign, simply the discrete fiber derivatives FL ± d , where the discrete Lagrangian L d is replaced by the divergence function D.

Divergence as a Type I Generating Function
As we have seen previously, symplectic maps are a natural way of describing the flow of Hamiltonian mechanics on the cotangent bundle T * Q.We will now consider the characterization of symplectic maps in terms of generating functions, and in particular, we review three different parameterizations based on the classification given in Goldstein [23].Lemma 2. Given (q 0 , q 1 ) ∈ Q × Q, then (q 0 , p 0 ) → (q 1 , p 1 ) on T * Q is symplectic if and only if there exists S 1 : To prove this, observe that dS 1 (q 0 , q 1 ) = (D 1 S 1 )dq 0 + (D 2 S 1 )dq 1 , from which, we immediately obtain Identifying the corresponding terms yield (59). Type I generating functions S 1 are linked with other types of generating functions via partial Legendre transforms.Fixing the first or second variable slot leads to, respectively, Type II or III generating functions, denoted S 2 , S 3 respectively.
Let H + be a submanifold, with local coordinates (q 0 , p 1 ), of Q × (T * Q), with local coordinates (q 0 , (q 1 , p 1 )), where q 1 is dependent on q 0 and p 1 .Then (q 0 , p 0 ) → (q 1 , p 1 ) on T * Q is symplectic if and only if there exists S 2 : H + → R such that Likewise, let H − be a submanifold, whose local coordinates are (p 1 , q 0 ), of (T * Q) × Q with local coordinates ((q 0 , p 0 ), q 1 ) where q 0 is dependent on p 0 and q 1 .Then (q 0 , p 0 ) → (q 1 , p 1 ) on T * Q is symplectic if and only if there exists S 3 : H − → R such that In the case of discrete mechanics, the Type II generating function is denoted by H + (q 0 , p 1 ) and the Type III generating function is denoted by H − (q 1 , p 0 ).We compute their exterior derivatives: From this, we obtain, Therefore, symplectic maps can be defined implicitly in terms of a Type II generating function H + (q 0 , p 1 ), q 1 = D 2 H + (q 0 , p 1 ), p 0 = D 1 H + (q 0 , p 1 ), and a Type III generating function H − (q 1 , p 0 ), More explicitly, these are related to the discrete Lagrangian L d (q 0 , q 1 ), which is a Type I generating function, by the following partial Legendre transforms: or equivalently, The upshot of the above discussion is that p 0 , p 1 are Legendre dual variables with respect to q 0 , q 1 , whereas in the fiberwise Legendre transform FL, it is (q 0 , v 0 ), (q 1 , v 1 ) which are dual to (q 0 , p 0 ), (q 1 , p 0 ) -the dual correspondence is q ↔ p, instead of v ↔ p.As before, the two discrete Legendre dualities are due to the two ways of viewing Q × Q as a bundle over Q.
Note that in this case, the Legendre dual of q 1 is no longer p 1 = ∂Φ(q 1 ) as given by the fiberwise Legendre map, but is rather shifted by an amount ∂Φ(q 0 ).It is interesting that H + still takes the form of B, as does L d .This is a special property of taking the Bregman divergence as the generating function.

D-Divergence for Decoupling L and H
In geometric mechanics, Hamiltonian and Lagrangian dynamics represent one and the same dynamics-they are coupled; this is because H(q, p) and L(q, v) are related by the fiberwise Legendre transform FL = (FH) −1 -in fact they are a Legendre pair.The conservation properties of the Hamiltonian approach with respect to the underlying symplectic geometry and the variational principles that arise in the Lagrangian and Hamilton-Jacobi theories reflect two sides of the same coin. (72) To appreciate this, we look at the interaction of three manifolds Q × Q, T * Q and TQ.We take (q k , q k+1 ) to be the configuration variable q at successive time-step-it is the dynamical equation that governs the evolution from q k to q k+1 .The Hamiltonian dynamics, which is encoded in the preservation of ω can of T * Q, governs discrete Hamiltonian flow H(q k , p k ), through a Type I generating function L d (q k , q k+1 ).On the other hand, the Lagrangian flow is governed by the retraction map R q , such as the Dirichlet-to-Neumann map induced by Jacobi's solution S q 0 (q) to the Hamilton-Jacobi equation.Those two dynamic updates q k → q k+1 need not be identical.In mechanics, the Hamiltonian energy conservation system and the Lagrangian extremization system lead to one and the same dynamics, precisely because L(q k , v k ) and H(q k , p k ) are linked through the fiberwise Legendre transform FL at q k : In other words, L and H are perfectly coupled-with no duality gap.Information geometry, on the other hand, starts with a divergence (or contrast) function D(q, v, p) on TQ ⊕ T * Q, which measures the discrepancy between the two systems.Given H(q, p) on T * Q and L(q, v) on TQ, we write D(q, v, p) = L(q, v) + H(q, p) − v, p q .
Theorem 2. Let L(q, v) and H(q, p) be strictly convex functions, defined on TQ and T * Q in terms of the variables (q, v) and (q, p), respectively.Then, for the following statements, any two imply the rest: (i) D = 0; (ii) H(q, •) and L(q, •) are (fiberwise) convex conjugate (Legendre dual) to each other; Then, The Euler-Lagrange equations are equivalent to Our insight here is that D does not have to vanish identically.The consequence is that we do not require the Lagrangian dynamics (extremization dynamics) and Hamiltonian dynamics (conservation dynamics) to be coupled; they will be allowed to evolve independently.The function D allows us to study fiberwise symplectomorphisms of Dirac manifolds.
Let us consider the case that (ii) holds, i.e., H(q, •) and L(q, •) are Legendre duals to each other.Then, the canonical divergence D can be written as the Bregman divergence B H and B L , after applying the fiberwise Legendre map FL or FH, This implies that, and they satisfy, B L (q, v 0 , v 1 ) = B H (q, p 1 , p 0 ).This is the reference-representation biduality [18,19], which is satisfied whenever L and H are Legendre duals of each other.

Variational Error Analysis
Recall that we previously defined the exact discrete Lagrangian L E d (16), which is related to Jacobi's solution of the Hamilton-Jacobi equation.The significance of the exact discrete Lagrangian is that it generates the exact discrete time flow of a Lagrangian system, but in general it cannot be computed explicitly.Instead, a computable discrete Lagrangian L d is used instead to construct a discretization of Lagrangian mechanics, and it induces the discrete Lagrangian map F L d .
Since discrete variational mechanics is expressed in terms of discrete Lagrangians, and the exact discrete Lagrangian generates the exact flow map of a continuous Lagrangian system, it is natural to ask whether we can characterize the order of accuracy of the Lagrangian map F L d as an approximation of the exact flow map, in terms of the extent to which the discrete Lagrangian L d approximates the exact discrete Lagrangian L E d .This is indeed possible, and is referred to as variational error analysis.Theorem 2.3.1 of [11] shows that if a discrete Lagrangian L d approximates the exact discrete Lagrangian L E d to order p, i.e., L d (q 0 , q 1 ; h) = L E d (q 0 , q 1 ; h) + O(h p+1 ), then the discrete Hamiltonian map, FL d : (q k , p k ) → (q k+1 , p k+1 ), viewed as a one-step method, is order p accurate.
As mentioned above, the divergence function D from information geometry can serve as a Type I generating function of a symplectic map, and hence it can be viewed as a discrete Lagrangian in the sense of discrete Lagrangian mechanics.A divergence function also generates the Riemannian metric and affine connection structures on the diagonal manifold (Lemma 1), in addition to generating the symplectic structure on Q × Q. Viewed in this way, a natural question is to what extent can we view the divergence function as corresponding to the exact Lagrangian flow of an associated continuous Lagrangian.We can show that Theorem 3. The exact discrete Lagrangian L E d (q(0), q(h), h) associated with the geodesic flow, with respect to the induced metric g, can be approximated by a divergence function D(q(0), q(h)) up to third order O(h 3 ) accuracy, L E d (q(0), q(h), h) = hL(q(0), v(0)) + D(q(0), q(h)) + O(h 3 ), if and only if Q is a Hessian manifold, i.e., D is the Bregman divergence B Φ , for some strictly convex function Φ.
Clearly, g ij = g ji , and Comparing the corresponding terms in powers of h, we obtain, This, according to Proposition 1, demonstrates that the manifold M is Hessian, and hence dually-flat.So, for the expansions to agree to O(h 3 ), the inducing divergence function D must be the Bregman divergence B Φ .

Summary
In this paper, we show the differences and connections between geometric mechanics and information geometry in canonically prescribing differential geometric structures on a smooth manifold Q.The Legendre transform plays crucial roles in both; however, they serve very different purposes.In geometric mechanics, the fiberwise Legendre map serves to link the cotangent bundle T * Q with tangent bundle TQ, whereas in information geometry, the Legendre transform relates the pair of biorthogonal coordinates, which are special coordinates on a dually-flat manifold Q.More specifically, FL (or its inverse FH) is invoked to establish the isomorphism between T * Q ↔ TQ in geometric mechanics, whereas in information geometry, a Hessian metric g built upon a convex function on Q is used for the correspondence between two coordinate systems on Q, and also for potentially (but not necessarily) establishing a correspondence between TQ and T * Q.
The link between information geometry and discrete mechanics is much stronger when one considers the discrete version (as opposed to the traditional, continuous version) of geometric mechanics.Both endow a symplectic structure ω × on Q × Q, through the use of a discrete Lagrangian L d in the case of geometric mechanics and a divergence function D in the case of information geometry-in fact they are both Type I generating functions for inducing ω × on Q × Q via pullback from the canonical symplectic structure ω can on T * Q.Using the Legendre transform, Type II generating functions can be constructed, which lead to the (right) discrete Hamiltonian H + in geometric mechanics and to the dual divergence function in information geometry.
Our analyses draw a distinction between the fiberwise Legendre map (which is used in continuous mechanics setting), the Legendre transform between biorthogonal coordinates (which is used in information geometry), and the Legendre transform between Type I and Type II generating functions (which is used in the setting of both discrete geometric mechanics and information geometry).The distinctions are more prominent when one considers the Pontryagin bundle TQ ⊕ T * Q. There, we can construct a divergence function that actually measures the duality gap between the Lagrangian function and the Hamiltonian function that generate a pair of (forward and backward) Legendre maps.In so doing, we demonstrate that information geometry can be viewed as an extension of geometric mechanics based on Dirac mechanics and geometry, with a full-blown duality between the Lagrangian and Hamiltonian components.

Discussion and Future Directions
Noda [24] showed that, with respect to the symplectic structure ω × on Q × Q, the Hamiltonian flow of the canonical divergence A induces geodesic flows for ∇ and ∇ * .He interpreted biorthogonal coordinates as a single coordinate system on Q × Q, in a way that is consistent with treating A as the Type I generating function on Q × Q.It remains unclear how the resulting Hamiltonian flow is related to dynamical flow on the Dirac manifold.
In another related work, Ay and Amari [25] sought to characterize the canonical form of divergence functions for general (non dually-flat) manifolds.They investigated the retraction map R q : {q} × Q → T q Q which we discussed in Section 2.5, and used the exponential map associated to any torsion-free affine connection ∇ on TQ.This approach, based on parallel transport, in essence generates a semispray on TQ, and is quite different from characterizing the dynamics using the Hamilton flow on T * Q.Note that even though one may define a symplectic structure (through pullback) on TQ as well, Ay and Amari [25] treats the semispray on TQ as the primary geometric object.Future research will clarify its relation to our approach, which is based on defining a symplectic structure on Q × Q directly.
Finally, comparing information geometry with geometric mechanics may shed light on universal machine learning algorithms.In machine learning or state estimation applications, we wish to have the estimated distribution be influenced by the observations, so that the estimated distribution eventually becomes consistent with the observed data.Let {x i } denote the sequence of predictions by (possibly a series of) model distributions X , and let {y i } denote the actual data generated by an unknown distribution Y that we are trying to estimate.In practice, the divergence functions are constructed so that the pseudo-distance between two distributions X and Y can be computed using only complete information about X and samples from Y.As such, we can measure the mismatch between the current prediction x i and the actual data y i using D(x i , y i ), since the asymmetry in the definition of D is such that we only require samples y i from the true but unknown distribution.So, adding a momentum term to ensure gentle change in model predictions, a possible choice of a discrete Lagrangian for generating the discrete dynamics for the machine learning application might be given by L(x i , x i+1 ) = D(x i , x i+1 ) + D(x i+1 , y i+1 ), where the first term can be interpreted as the action associated with the kinetic energy, and the second term is the action associated with the potential energy.By construction, the term D(x i , y i ) vanishes when the prediction x i is consistent with the actual observation y i , and it is positive otherwise, so the term D(x i , y i ) can be viewed as a potential energy term that penalizes mismatch between the estimated distribution and the observational data.Our variational error analysis may thus shed light on an asymptotic theory of inference where sample size N → ∞ is akin to discretization step h → 0. The link between geometric mechanics and information geometry, as revealed through our present investigation, is still rather preliminary.The possibility of a unified mathematical framework for information and mechanics is intriguing and remains a challenge for future research.