Lagrangian Function on the Finite State Space Statistical Bundle

The statistical bundle is the set of couples (Q,W) of a probability density Q and a random variable W such that 𝔼QW=0. On a finite state space, we assume Q to be a probability density with respect to the uniform probability and give an affine atlas of charts such that the resulting manifold is a model for Information Geometry. Velocity and acceleration of a one-dimensional statistical model are computed in this set up. The Euler–Lagrange equations are derived from the Lagrange action integral. An example Lagrangian using minus the entropy as potential energy is briefly discussed.


Introduction
The set-up of classical Lagrangian Mechanics is a finite-dimensional Riemannian manifold. For example, see the monographs by V.I. Arnold ([1], Chapters III-IV), R. Abraham and J.E. Mardsen ( [2], Chapter 3), J.E. Marsden and T.S. Ratiu ( [3], Chapter 7). Classical Information geometry, as it was first defined in the monograph by S.-I. Amari and H. Nagaoka [4], views parametric statistical models as a manifold endowed with a dually-flat connection. In a recent paper, M. Leok and J. Zhang [5] have pointed out the natural relation between these two topics and have given a wide overview of the mathematical structures involved.
In the present paper, we take up the same research program with two further qualifications. First, we assume a non-parametric approach by considering the full set of positive probability functions on a finite set, as it was done, for example, in our review paper [6]. The discussion is restricted here to a finite state space to avoid difficult technical problems. Second, we consider a specific expression of the tangent space of the statistical manifold, which is a Hilbert bundle that we call a statistical bundle. Our aim is to emphasize the basic statistical intuition of the geometric quantities involved. Because of that, we chose to systematically use the language of non-parametric differential geometry as it is developed in the monography of S. Lang [7].
Herein, we use our version of Information Geometry; see the review paper [6]. Preliminary versions of this paper have been presented at the SigmaPhy2017 Conference held in Corfu, Greece, 10-14 July 2017, and at a seminar held at Collegio Carlo Alberto, Moncalieri, on 5 September 2017. In these early versions, we did not refer to Leok and Zhang's work, which we were unaware of at that time.
In Section 2, we review the definition and properties of the statistical bundle, and of the affine atlas that endows it with both a manifold structure and a natural family of transports between the fibers. In Section 3, we develop the formalism of the tangent space of the statistical bundle and derive the expression of the velocity and the acceleration of a one-dimensional statistical model in the given affine atlas. The derivation of the Euler-Lagrange equations, together with a relevant example, is discussed in Section 4.

Statistical Bundle
We consider a finite sample space Ω, with #Ω = N. The probability simplex is ∆(Ω), and ∆ • (Ω) is its interior. The uniform probability on Ω is denoted as µ, µ(x) = 1 N , x ∈ Ω. The maximal exponential family E (µ) is the set of all strictly positive probability densities of (Ω, µ). The expected value of f : Ω → R with respect to the density N ∑ x∈Ω f (x)P(x). In [6,8,9], we made the case for the statistical bundle being the key structure of Information Geometry. The statistical bundle with base Ω is The statistical bundle is a semi-algebraic subset of R 2N ; i.e., it is defined by algebraic equations and strict inequalities. It is trivially a real manifold. At each Q ∈ E (µ), the fiber S Q E (µ) is endowed with the scalar product To this structure we add a special affine atlas of charts in order to show a structure of affine manifold, which is of interest in the statistical applications. The exponential atlas of the statistical manifold S E (µ) is the collection of charts given for each P ∈ E (µ) by where (with a slight abuse of notation) As s P (P, V) = (0, V), we say that s P is the chart centered at P. If s P (Q) = U, it is easy to derive the exponential form of Q as a density with respect to P; namely, Q = e U−E P log Q P · P. As E µ [Q] = 1, − E P log P Q , so that the cumulant function K P is defined on that is, K P (V) is the expression in the chart at P of Kullback-Leibler divergence of Q → D (P Q), and we can write Q = e U−K P (U) × P = e P (U) .
The patch centered at P is In statistical terms, the random variable log (Q/P) is the relative point-wise information about Q relative to the reference P, while s P (Q) is the deviation from its mean value at P. The expression of the other divergence in the chart centered at P is The equation above shows that the two divergences are convex conjugate functions in the proper charts; see [10].
The transition maps of the exponential atlas in Equations (1) and (2) are so that the exponential atlas is indeed affine. Notice that the linear part is e U P 2 P 1 .

The Tangent Space of the Statistical Bundle
Let us compute the expression of the velocity at time t of a smooth curve t → γ(t) = (Q(t), W(t)) ∈ S E (µ) in the chart centered at P. The expression of the curve is and hence we have, by denoting the derivative in R N by the dot, and is a curve in the statistical bundle whose expression in the chart centered at P is t → (U(t),U(t)). The velocity as defined above is nothing else as the score function of the one-dimensional statistical model; see e.g., the textbook by B. Efron and T. Haste (Section 4.2, [11]). The variance of the score function (i.e., the squared norm of Q(t) in S Q(t) E (µ)) is classically known as Fisher information at t. We define the second statistical bundle to be we can identify the second bundle with the tangent space of the first bundle as follows.
For each curve t → γ(t) = (Q(t), W(t)) in the statistical bundle, define its velocity at t to be is a curve in the second statistical bundle, and its expression in the chart at P has the last two components equal to the values given in Equations (3) and (4). In particular, consider the a curve t → χ(t) = (Q(t), Q(t)). The velocity is where the acceleration * * It should be noted that the acceleration has been defined without explicitly mentioning the relevant connection. In fact, the connection here is implicitly defined by the transports e U Q P , which is unusual in Differential Geometry, but is quite natural from the probabilistic point of view; see P. Gibilisco and G. Pistone [12]. We shall see below that the non-parametric approach to Information Geometry allows the definition of a dual transport, hence a dual connection as it was in [4]. Because of that, we could have defined other types of acceleration together with the one we have defined. Namely, we could consider an exponential acceleration e D 2 Q(t) = * * Q(t), a mixture acceleration m D 2 Q(t) =Q(t)/Q(t), and a Riemannian acceleration each acceleration being associated with a specific connection; see the review paper [6]. We do not further discuss the different second-order geometries associated with the statistical bundle in this paper.
. Notice that we have the equation Following the original construction of Amari's Information Geometry [4], we have defined on the statistical bundle a manifold structure which is both an affine and a Riemannian manifold. The base manifold E (µ) is actually a Hessian manifold with respect to any of the convex functions K p (U) = log E p e U , U ∈ S p E (µ) (see [14]). Many computations are actually performed using the Hessian structure. The following equations are easily checked and frequently used: e U e P (U) P We have defined a centering operation that can be thought of as a transport among fibers, e U Q P : S p E (µ) → S q E (µ) , whose adjoint is m U p q V = q p V. In fact, is the adjoint of e U q p , Moreover, iff U, V ∈ S P E (µ), then Example 2 (Entropy flow). This example is taken from [8]. In the scalar field H there is no dependence on the fiber. If t → Q(t) = e V(t)−K P (V(t)) · P is a smooth curve in E (µ) expressed in the chart centered at P, then we can write where the argument of the last expectation belongs to the fiber S P E (µ) and we have expressed the expected value as a derivative by using Equation (7). Again using Equations (7) and (9), we compute the derivative of the entropy along the given curve as We use now the equations .
We have identified the gradient of the entropy in the statistical bundle, grad H (Q) = −(log Q + H (Q)) . (12) Notice that the previous computation could have been done using the exponential family Q(t) = e P (tV). See the computation of the gradient flow in [8].
In the next section, we extend the computation illustrated in the example above to scalar fields on the statistical bundle.

Lagrangian Function
A Lagrangian function is a smooth scalar field on the statistical bundle At each fixed density Q ∈ E (µ), the partial mapping is defined on the vector space S q E (µ); hence, we can use the ordinary derivative, which in this case is called the fiber derivative, Example 3 (Running Example 1). If The example is suggested by the form of the classical Lagrangian function in mechanics, where the first term is the kinetic energy and −κ H (Q) is the potential energy.
As the statistical bundle S E (µ) is non-trivial, the computation of the partial derivative of the Lagrangian with respect to the first variable requires some care. We want to compute the expression of the total derivative in a chart of the affine atlas defined in Equations (1) and (2).
Let t → γ(t) = (Q(t), W(t)) be a smooth curve in the statistical bundle. In the chart centered at P, we have with t → γ P (t) = (U(t), V(t)) being a smooth curve in (S P E (µ)) 2 . Let us compute the velocity of variation of the Lagrangian L along the curve γ.
If we write Q = e P (U) and W = e U e P (U) P V, then we have where d 2 L is the fiber derivative of L. AsU(t) = e U P Q(t) Q(t) and e U e P (U(t)) PV (t) = W(t), it follows from Equations (16) and (17) that In the equation above, the first term on the RHS does not depend on P because the LHS and the second term of the RHS do not depend on P. Hence, we define the first partial derivative of the Lagrangian function to be so that the derivative of L along γ becomes In particular, if W(t) = Q(t), then

Example 4 (Running Example 2). With the Lagrangian of Equation (15), we have
see Equations (9) and (11). The first partial derivative is where we have used Equations (9) and (10) together with e U e P (U) P (U + log P + H (P)) = log Q + H (Q). We have found that and also Using the fiber derivative computed in the first part of the running example, we find Notice that Equation (12) shows that one of the terms in the equations above is grad H (Q).

Action Integral
is a smooth curve in the exponential manifold, then the action integral is well defined. We consider the expression of Q in the chart centered at P, Q(t) = e U(t)−K P (U(t)) × P.
The equation above has been derived using the exponential affine geometry of the statistical bundle and involves * * Q(t). However, by using Equations (5), (6), and (12), we find the equivalent form 0 D 2 Q(t) = κ grad H (Q(t)) .

Discussion
We have shown that the research program consisting of applying concepts taken from Classical Mechanics to Statistics makes sense, even if no practical application has been produced in this paper. Some simple examples have been discussed in order to show clearly that the language from classical mechanics is indeed suggestive when applied to typical concepts in Statistics such as Fisher score and statistical entropy. The derivation of the Euler-Lagrange equations is classically done in the set-up of the Riemannian geometry, while here we have used the affine structure of Information Geometry. The present provisional results prompt a generalization to non-finite sample spaces and the development of applications. Finally, the related Hamiltonian formalism remains to be investigated.