Information Geometry Formalism for the Spatially Homogeneous Boltzmann Equation

Information Geometry generalizes to infinite dimension by modeling the tangent space of the relevant manifold of probability densities with exponential Orlicz spaces. We review here several properties of the exponential manifold on a suitable set $\mathcal E$ of mutually absolutely continuous densities. We study in particular the fine properties of the Kullback-Liebler divergence in this context. We also show that this setting is well-suited for the study of the spatially homogeneous Boltzmann equation if $\mathcal E$ is a set of positive densities with finite relative entropy with respect to the Maxwell density. More precisely, we analyse the Boltzmann operator in the geometric setting from the point of its Maxwell's weak form as a composition of elementary operations in the exponential manifold, namely tensor product, conditioning, marginalization and we prove in a geometric way the basic facts i.e., the H-theorem. We also illustrate the robustness of our method by discussing, besides the Kullback-Leibler divergence, also the property of Hyv\"arinen divergence. This requires to generalise our approach to Orlicz-Sobolev spaces to include derivatives.%

Information geometry (IG) has been essentially developed by S.-I. Amari, see the monograph by Amari and Nagaoka [4]. In his work, all previous geometric-essentially metricdescriptions of probabilistic and statistics concepts are extended in the direction of affine differential geometry, including the fundamental treatment of connections. A corresponding concept for abstract manifold, called statistical manifold, has been worked out by S. L. Lauritzen in [3]. Amari's framework is today considered a case of Hessian geometry as it is described in the monograph by H. Shima [38].
Other versions of IG have been studied to deal with a non-parametric settings such as the Boltzmann equation as it is described in [14] and [41]. A very general set-up for information geometry is the following. Consider a one-dimensional family of positive densities with respect to a measure µ, θ → p θ , and a random variable U . A classical statistical computation, possibly due to Ronald Fisher, is The previous computation suggests the following geometric construction which is rigorous if the sample space is finite and can be forced to work in general under suitable assumptions. We use the differential geometry language e.g., [17]. If ∆ is the probability simplex on a given sample space (Ω, F), we define the statistical bundle of ∆ to be T ∆ = (π, u) π ∈ ∆, u ∈ L 2 (π), E π [u] = 0 . Given a one dimensional curve in ∆, θ → π θ we can define its velocity to be the curve where we define Each fiber T π ∆ = L 2 0 (µ) has a scalar product and we have a parallel transport This structure provides an interesting framework to interpret the Fisher computation cited above. The basic case of a finite state space has been extended by Amari and coworkers to the case of a parametric set of strictly positive probability densities on a generic sample space. Following a suggestion by A. P. Dawid in [18,16], a particular nonparametric version of that theory was developed in a series of papers [36,20,35,19,12,13,22,31,32,34,33], where the set P > of all strictly positive probability densities of a measure space is shown to be a Banach manifold (as it is defined in [8,1,24]) modeled on an Orlicz Banach space, see e.g., [28,Chapter II].
In the present paper, Sec. 2 recalls the theory and our notation about the model Orlicz spaces. This material is included for convenience only and this part should be skipped by any reader aware of any of the papers [36,20,35,19,12,13,22,31,32,34, 33] quoted above. The following Sec. 3 is mostly based on the same references and it is intended to introduce that manifold structure and to give a first example of application to the study of Kullback-Liebler divergence. The special features of statistical manifolds that contain the Maxwell density are discussed in Sec. 4. In this case we can define the Boltzmann-Gibbs entropy and study its gradient flow. The setting for the Boltzmann equation is discussed in Sec. 5 where we show that the equation can be derived from probabilistic operations performed on the statistical manifold. Our application to the study of the Kullback-Liebler divergence is generalised in Sec. 6 to the more delicate case of the Hyvarïnen divergence. This requires in particular a generalisation of the manifold structure to include differential operators and leads naturally to the introduction of Orlicz-Sobolev spaces.
We are aware that there are other approaches to non-parametric information geometry that are not based on the notion of the exponential family and that we do not consider here. We mention in particular [29] and [7].

Model Spaces
Given a σ-finite measure space, (Ω, F, µ), we denote by P > the set of all densities that are positive µ-a.s, by P the set of all densities, by P 1 the set of measurable functions f with f dµ = 1.
We introduce here the Orlicz spaces we shall mainly investigate in the sequel. We refer to [13] and [28, Chapter II] for more details on the matter. We consider the Young function Φ : R x → Φ(x) = cosh x − 1 and, for any p ∈ P > , the Orlicz space L Φ (p) = L cosh −1 (p) is defined as follows: a real random variable U belongs to L Φ (p) if E p [Φ(αU )] < +∞ for some α > 0 . The Orlicz space L Φ (p) is a Banach space when endowed with the Luxemburg norm defined as The conjugate function of Φ = cosh −1 is Φ * (y) = (cosh −1) * (y) = y 0 arsinh u du = y arsinh y + 1 − 1 + y 2 , y ∈ R which satisfies the so-called ∆ 2 -condition as Φ * (α y) = |αy| 0 (|αy| − u) Φ * (u)du = |αy| 0 |α| |y| − u 1 + u 2 du max(1, α 2 )Φ * (y) . Since Φ * is a Young function, for any p ∈ P > , one can define as above the associated Orlicz space L Φ * (p) = L (cosh −1) * (p) and its corresponding Luxemburg norm · Φ * ,p . Because the functions Φ and Φ * form a Young pair, for each U ∈ L cosh −1 (p) and V ∈ L (cosh −1) * (p) we can deduce from Young's inequality ∀x, y ∈ R 2 , that the following Holder's inequality holds: Moreover, it is a classical result that the space L (cosh −1) * (p) is separable and its dual space is L cosh −1 (p), the duality pairing being We recall the following continuous embedding result that we shall use repeatedly in the paper: Theorem 1. Given p ∈ P > , for any 1 < r < ∞, the following embeddings From this result, we deduce easily the following useful Lemma Lemma 2. Given p ∈ P > and k 1. For any u 1 , . . . , u k ∈ L cosh −1 (p) one has Proof. According to Theorem 1, u i ∈ L r (p) for any 1 < r < ∞ and any i = 1, . . . , k. The proof follows then simply from the repeated use of Holder inequality.
From now on we also define, for any p ∈ P > In the same way, we set * B p = L Φ * 0 (p) := u ∈ L (cosh −1) * (p) E p [u] = 0 . 2.1. Cumulant Generating Functional. Let p ∈ P > be given. With the above notations one can define: Definition 3. The cumulant generating functional is the mapping The following result [12,13] shows the properties of the exponential function as a superposition mapping [6]. Proposition 4. Let a 1 and p ∈ P > be given.
(2) v → ∞ n=0 1 n! v a n is a power series from L Φ (p) to L a (p), with radius of convergence larger than 1.
(3) The superposition mapping, v → e v/a , is an analytic function from the open unit ball of L Φ (p) to L a (p).
(2) K p is convex and lower semi-continuous, and its proper domain is a convex set that contains the open unit ball of B p . (3) K p is infinitely Gâteaux-differentiable in the interior of its proper domain. Remark 1. One sees from the above property 2 that the interior of the proper domain of K p is a non-empty open convex set. From now on we shall adopt the notation Other properties of the functional K p are described below, as they relate directly to the exponential manifold.

Exponential manifold
The set of positive densities, P > , locally around a given p ∈ P > , is modeled by the subspace of centered random variables in the Orlicz space, L cosh −1 (p). Hence, it is crucial to discuss the isomorphism of the model spaces for different p's in order to show the existence of an atlas defining a Banach manifold.
Definition 6 (Statistical exponential manifold [13,Def. 20]). For p ∈ P > , the statistical exponential manifold at p is We also need the following definition of connection In such a case, one simply writes p q.
We collect here various results from [36,35,12,13] about additional properties of K p .
(1) The first three derivatives of K p on S p are: (2) The random variable, q p − 1, belongs to * B p and: In other words, the gradient of K p at u is identified with an element of the predual space of B p , viz. * B p , denoted by ∇K p (u) = e u−Kp(u) − 1 = q p − 1.
(3) The weak derivative of the map, S p u → ∇K p (u) ∈ * B p , at u applied to w ∈ B p is given by: and it is one-to-one at each point. (4) q/p ∈ L (cosh −1) * (p). (5) B q is defined by an orthogonality property: On the basis of the above result, it appears natural to define the following parallel transports: (1) The exponential transport e U q p : B p → B q is computed as (2) The mixture transport m U p q : * B q → * B p is computed as One has the following properties Proposition 12. Let p, q ∈ P > be given. Then The mixture transport m U p q and the exponential transport e U q p are dual of each other: if u ∈ B q and v ∈ * B p then p and e U q p together transport the duality pairing: if u ∈ B p and v ∈ * B p , then We reproduce here a scheme of how the affine manifold works. The domains of the charts centered at p and q respectively are either disjoint or equal if p q: Our discussion of the tangent bundle of the exponential manifold is based on the concept of the velocity of a curve as in [1, §3.3] and it is mainly intended to underline its statistical interpretation, which is obtained by identifying curves with one-parameter statistical models. For a statistical model p(t), t ∈ I, the random variable,ṗ(t)/p(t) (which corresponds to the Fisher score), has zero expectation with respect to p(t), and its meaning in the exponential manifold is velocity; see [11] on exponential families. More precisely, let p(·) : I → E (p), I the open real interval containing zero. In the chart centered at p, the curve is u(·) : I → B p , where p(t) = e u(t)−Kp(u(t)) p.
Definition 13 (Velocity field of a curve and tangent bundle).
(1) Assume t → u(t) = s p (p(t)) is differentiable with derivativeu(t). Define: Note that Dp does not depend on the chart s p and that the derivative of t → p(t) in the last term of the equation is computed in L Φ * (p). The curve t → (p(t), Dp(t)) is the velocity field of the curve.
(2) On the set {(p, v)|p ∈ P > , v ∈ B p }, the charts: 3.1. Pretangent Bundle. Let M be a density and E (M ) its associated exponential manifold. Here M is generic, later it will be the Maxwell distribution. All densities are assumed to be in E (M ).
together with the charts: * s p : Let F be a vector field of the pretangent bundle, F : E (M ) → * T E (M ). In the chart centered at p, the vector field is expressed by Definition 15 (Covariant derivative in * T E (M )). Let F be a vector field of class C 1 of the pretangent bundle * T E (M ), and let G be a continuous vector field in the tangent bundle T E (M ). The covariant derivative is the vector field D G F of * T E (M ) defined at each q ∈ E (M ) by In the definition above the covariant derivative is computed in the mobile frame because its value at q is computed using the expression in the chart centered at q. In a fixed frame centered at p we write s p (q) = w so that e q (u) = e p (u − E p [u] + w), and compare the two expressions of F as follows.
], hence dF q (u) = m U q p dF p ( e U p q u + w) e U p q and at u = 0, we have dF q (0) = m U q p dF p (w) e U p q . It follows that the covariant derivative in the fixed frame at p is . The tangent and pretangent bundle can be coupled to produce the vector bundle of order 2 defined by q v, m U p q w) and the duality coupling: Proposition 16 (Covariant derivative of the duality coupling). Let F be a vector field of * T E (M ), and let G, X be vector fields of T E (M ), F, G of class C 1 and X continuous.
in the chart centered at any p ∈ E (M ): , and compute its derivative at 0 in the direction X(p).
We refer to [34,33] for further details on the geometric structure, namely the Hilbert bundle, the tangent mapping of an homeomorphism, the Riemannian Hessian. We now turn to a basic example.

3.2.
Kullback-Leibler Divergence. The Kullback-Leibler divergence [23] on the exponential manifold E is the mapping ] is the expression in the chart centered at q 1 of the marginal Kullback-Leibler divergence q 2 → D (q 1 q 2 ). Therefore, the Kullback-Leibler divergence is non-negative valued and zero if and only if q 2 = q 1 because of Theorem 8, item (4). Its expression in the chart centered at a generic p ∈ E is , which is the Bregman divergence [9] of the convex function K p : S p → R.
It follows from Proposition 5.(4) that it is C ∞ jointly in both variables and, moreover, analytic with This regularity result is to be compared with what is available when the restriction, q 1 q 2 , is removed, i.e., the semi-continuity [5, §9.4]. The partial derivative of D p in the first variable, that is the derivative of w , so that we can compute both the covariant derivative of the partial functional q → D (q q 2 ) and its gradient as The negative gradient flow is . It is the exponential arc of q(0) q 2 in an exponential time scale.
The partial derivative of D p in the second variable, that is the derivative of , so that we can compute both the covariant derivative of the partial functional q → D (q 1 q) and its gradient as The negative gradient flow iṡ It is a mixture model in an exponential time scale.

Gaussian space
In this Section the sample space is R n , M denotes the standard n-dimensional Gaussian density (we denoted it M because of James Clerk Maxwell (1831 -1879)) and E is the exponential manifold containing M . We recall that the Orlicz space L cosh −1 (M ) is defined with the Young function Φ := x → cosh x − 1. The following propositions depend on the specific properties of the Gaussian density M . They do not hold in general.
and the latter is finite for all α such that α Hess f − I is negative definite.
2. The result comes from the fact that all polynomials belong to L 2 (M ) and one has 2 is defined and finite if the densities q 1 and q 2 belong to the same exponential manifold, the Boltzmann-Gibbs entropy (BG-entropy in the sequel) could be either non defined or infinite, precisely −∞, everywhere on some exponential manifolds, or finite everywhere on other exponential manifolds.
Proposition 18. Assume p q, Then: (1) For each a 1, log p ∈ L a (p) if, and only if, log q ∈ L a (q).
Proof. If p, q belong to the same exponential manifold, we can write q = e u−Kp(u) · p and, from Theorem 8.(4), we obtain log q − log p = u − K p (u) ∈ L cosh −1 (p) = L cosh −1 (q), so that log q ∈ L a (q) if, and only if, log p ∈ L a (p), a 1, and log q ∈ L cosh −1 (q) if, and only if, log p ∈ L cosh −1 (p).
In order to obtain a smooth function, we study the BG-entropy H(q) on all manifolds E, such that for at least one, and, hence for all, p ∈ E, it holds log (p) ∈ L cosh −1 (p). In such an exponential manifold we can write For example, it is the case when the reference measure is finite and p is constant. Another notable example is the Gaussian case, where the sample space is R n endowed with the Lebesgue measure and p(x) ∝ exp − |x| 2 2 . In such case cosh(α|x| 2 ) exp (−|x| 2 /2) dx < +∞ if 0 < α < 1/2.
We investigate here the main properties of the BG-entropy in this context. First, one has Proposition 19. The BG-entropy is a smooth real function on the exponential manifold E. Namely, if p ∈ E then with −(u+log p+H(p)) ∈ B p , the representation of the BG-entropy in the chart centered at p is Notice that, for u = 0 and q = e p (u), Proposition 20. The gradient field ∇H over E can be identified, at each p, with random variable ∇H(p) ∈ B p ⊂ * B q , The gradient field ∇H over E, is then defined by D G H(p) = ∇H(p), G(p) p . This justifies the identification with the random variable Remark 3. The equation ∇H(p) = 0 implies log p = −H(p), hence p has to be constant and this requires it is the finite reference measure µ.
We refer to [33] for more details on the BG-entropy and in particular on the evolution of H on C 1 curve in E of the type I t → f t .

Boltzmann equation
We consider a space-homogeneous Boltzmann operator as it is defined, for example, in [41] and [40]. We retell the basic story in order to introduce our notations and the IG background. Orlicz spaces as a setting for Boltzmann's equation have been recently proposed by [26], while the use of exponential statistical manifolds has been suggested in [34, Example 11]  We denote by v, w ∈ R 3 the velocities before collision, while the velocities after collision are denoted by v, w ∈ R 3 . The quadruple (v, w, v, w) ∈ (R 3 ) 4 , is assumed to satisfy the conservation laws which define an algebraic variety M that we expect to have dimension 12 − (3 + 1) = 8. The Jacobian matrix of the four defining Eq.s (4) and (5) is The Jacobian matrix in Eq. (6) has in general position full rank, and rank From (4) and (5) it follows the conservation of both the scalar product, v ·w =v ·w and of the norm of the difference, |v − w| = |v − w|, so that all the vectors of the quadruple lie on a circle with center z = (v + w)/2 = (v +w)/2 and are the four vertexes of a rectangle. If v = w then v = w, and also v = w = v = w as the circle collapse to one point, hence we have There are various explicit and interesting parametrizations of M * available. An elementary parametrization consists of any algebraic solution of Eq.s (4) and (5) with respect to any of the free 8 coordinates. Other parametrizations are used in the literature, see classical references on the Boltzmann equation, e.g. [41].
A first parametrization is Viceversa, on M * the collision transformation depends on the unit vector σ = v − w ∈ S 2 , while the other terms depend on the collision invariants, as |v − w| 2 = 2(|v| 2 + |w| 2 ) − |v + w| 2 . In conclusion, the transformation in Eq. (7) is A second parametrization of M * is obtained using the common space of two parallel sides of the velocity's rectangle, where Π is the orthogonal projection on the subspace. Viceversa, given any Π in the set Π(1) of projections of rank 1, the mapping The components in the direction of the image of Π are exchanged, Πv Π = Πw and Πw Π = Πv, while the orthogonal components are conserved. If ω is any of the two unit vectors such that Π = ω ⊗ ω , the matrix A Π = A ω⊗ω does not depend on the direction of ω ∈ S 2 . Notice that A Π = A Π and A Π A Π = I 6 , that is A Π is an orthogonal symmetric matrix. This parametrization uses the set Π(1) of projection matrices of rank 1, The σ-parametrization (7) and the Π-parametrization (10) are related as follows. Given the unit vector κ = v − w, the parameters Π and σ in Eq.s (9) and (8) are in 1-to-1 relation as The transition map from the parametrization (7) to the parametrization (10) is

5.2.
Uniform probabilities on S 2 and on Π(1). Let µ be the uniform probability on S 2 , computed, for example, in polar coordinates by In such a way, U : S 2 → S 2 and the right hand side of Eq. (12) does not depend on U .

5.3.
Conditioning on the collision invariants. Given a function g : R 3 × R 3 , Eq. (8) shows that the function depends on the collision invariants only. This, in turn, implies thatĝ is the conditional expectation of g with respect of the collision invariants under any probability distribution on R 3 × R 3 such that the collision invariants and the unit vector of σ are independent, the unit vector σ being uniformly distributed. See below a more precise statement in the case of the Gaussian distribution. On the sample space (R 3 , dv), let M be the standard normal density defined in (3) (the Maxwell density). As, for all Π ∈ Π(1), Under the same distributions, the random variables V +W √ 2 , |V −W | √ 2 , V − W , are independent, with distributions given by respectively. Hence, given any S ∼ µ such that V +W  7) is supported by the manifold M * . Such a distribution has the property that the projections on both the first two and the last two components are M ⊗ M . The joint distribution is not Gaussian; in fact the support M * is not a linear subspace. We will call this distribution the normal collision distribution.
The second parametrization in Eq. (9) shows that the variety M contains the bundle of linear spaces The distribution of under the normal collision distribution is obtained from Eq. (11). In fact Π is the projector on the subspace generated by κ − σ where (v, w, σ) → κ = v − w is uniformly distributed and independent from σ. Hence, Eq. (18) shows that (v, w, σ) → κ − σ is uniformly distributed on S 2 so that the distribution in Eq. (24) is the ν measure defined in Eq. (13). Conditionally to Π, the normal collision distribution is Gaussian with covariance We can give the previous remarks a more probabilistic form as follows.
Proposition 21 (Conditioning). Let M be the density of the standard normal N(0 3 , I 3 ) and g : R 3 × R 3 → R be an integrable function. It holds the following (2) For a generic integrable h : The random variable is a function of the collision invariants i.e., it is of the formg(V + W, |V | 2 + |W | 2 ). For all bounded h 1 : R 3 → R and h 2 : R → R, we apply the previous computation to h = gh 1 h 2 to get (3) We use Item 2 and the equalityÂ (17), the left-end-side can be rewritten as an integral with respect to ω ∈ S 2 , If Π = ω ⊗ ω, then |κ · ω| = |ω ⊗ ωκ| = |Πκ|. Using that together with the definition of the measure ν on Π(1) in Eq. (13), we have the result: In order to show that E M ⊗M (f ) t < +∞ for t ∈ I, it is enough to consider the convex cases, t < 0 and t > 1, because otherwise E M ⊗M (f ) t 1. We have so that in the convex cases: To conclude, use Bayes' formula for conditional expectation, .
to the expressions of conditional expectation in Item 2 and Item 3 above.
Remark 4. In the last Item, we compute a conditional expectation of a density f , that is The random variablef : R 3 × R + → R is the density of the image of F dvdw with respect to the image of M (v)M (w) dvdw.

Interactions.
We introduce here the crucial role played by microscopic interaction in the definition of the Boltzmann collision operator. In the physics literature, such interaction are referred to as the kinetic collision kernel and takes into account the intermolecular forces suffered by particles during a collision [41]. Before defining formally what we mean by interaction, we first observe that, if M is the Maxwell density on R 3 and f, g M then where M ⊗ M is the standard normal density on R 6 and f ⊗ g is a density on R 6 . Indeed, one has It follows that the product density has the form

Definition 22.
With the previous notations, we say that b : is a density.
Sometimes, we make the abuse of notation by writing E b·f ⊗g [·], where the obvious normalization is not written down. As can been seen, here interactions indicate only a class of suitable weight functions b for which b(v, w)f (v)g(w) is still (up to normalisation) a density. This is in accordance with the usual role played by the kinetic collision kernel (see [41]).
According to the Portmanteau Theorem 8, it holds b Then the b is an interaction on E (M ) × E (M ) and for all f, g M the following holds.
and moreover a sufficiency relation holds i.e., for all integrable F : For the second inequality we use the Hardy-Littlewood-Sobolev inequality [25,Th. 4.3]. We have |u − v| λ dudv. If 1 β + λ 3 + 1 α = 2, by the H-L-S inequality the last integral is bounded by a constant times f α g β . From f, g M we get that f α g β is finite for α, β in a right neighborhood of 1. There exists = 3 λ 2 − 1 α − 1 β > 0 satisfying all conditions. 2. It is a special case of the Conditioning Theorem 21.
Let us discuss the differentiability of the operations we have just introduced.
with tangent mapping given for any vector field X ∈ T E (M ) by As constant random variables are in the kernel of the operator A, we assume E f [g] = 0.
Prop. 23. We shall call such a b a proper interaction.
For each random variable g ∈ L cosh −1 (M ) = L cosh −1 (f ), define g = 1 2 g ⊕ g by which it is easily shown to belong to L cosh −1 (M ⊗ M ) = L cosh −1 (f ⊗ f ). The mapping g → g is a version of the conditional expectation E M ⊗M (g|S), where S is the σ-algebra generated by symmetric random variables. We define the operator which is a version of where I ⊂ S is the σ-algebra generated by the collision invariants (v, w) → (v + w, |v| 2 + |w| 2 ).
As constant random variables are in the kernel of the operator A, we assume E f [g] = 0. Analogously, as the kernel of the operator contains all symmetric random variables, we could always assume that g is anti-symmetric.
The nonlinear operator f → E b·f ⊗f [Ag] is the Maxwell's weak form of the Boltzmann operator, g being a test function.
Proposition 25. Given a proper interaction b and a density f ∈ E (M ), the linear map (1) It can be represented in the duality where Q is the Boltzmann operator with interaction b, Especially, if f = e U −K M (U ) M and we take g = log f M , then Proof. The continuity follows from the Portmanteau theorem and the continuity of the conditional expectation. Item 1 follows from the projection properties of the conditional expectation and from general properties of Orlicz spaces. Item 2 is a special case of the previous one.
It follows from the previous theorem and from the discussion in [33, Prop. 10] that f → Q(f )/f is a vector field in the cotangent bundle * T E (M ) and its flow is equivalent to the standard Boltzmann equationḟ t = Q(t). We do not discuss in this paper the implications of that presentation of the Boltzmann equation to the existence properties. We turn our attention to the comparison of the Boltzmann field to the gradient field of the entropy.

Entropy generation. If t → p(t) is a curve in E (M ), the entropy of p(t) is defined for all t and the variation of the entropy along the curve is computed as
In our setup, the computation takes the following form. Let t → p t be a differentiable curve in E (M ) with velocity t → Dp t ∈ * B pt . As the gradient ∇H is given at p by (∇H)(p) = −(log p + H(p)) ∈ B p , we have d dt H(p(t)) = − log p(t) + H(p(t)), Dp(t) p(t) .
In particular, if Dp(t) =ṗ(v; t)/p(v; t) we recover the previous computation: Assume now that (p, t) → γ(t; p) is the flow of a vector field F : E (M ) → * T E (M ). Then d dt H(γ(t; p)) = − log γ(t; p) + H(γ(t; p)), Dγ(t; p) γ(t;p) = − log γ(t; p) + H(γ(t; p)), F (γ(t; p)) γ(t;p) , that is, for each p ∈ E (M ), the entropy production at p along the vector field F is In particular, if the vector field F is the Boltzmann vector field, F (f ) = Q(f )/f , we have that for each f ∈ E (M ) the Boltzmann's entropy production is where We recover a well-known formula for the entropy production of the Boltzmann operator [41]. We now proceed to compute the covariant derivative of the entropy production.

Proposition 26.
Let X be a vector field of T E (M ) and let F be a vector field of * T E (M ).
(1) The Hessian of the entropy (in the exponential connection) is The covariant derivative of the entropy production along F is Proof. We note that the entropy production along a vector field F , D = (∇ F H)(p) = ∇H, F , is a function of the duality coupling of T E (M ) × * T E (M ), so that we can apply Prop. 16 to compute its covariant derivative along X as D X D = D X ∇H, F + ∇H, D X F . The first term at p is Let us compute D X ∇H(p), which is the Hessian of the entropy in the exponential connection. First, we compute the expression of ∇H(q) = − (log q + H(q)) ∈ B q , q ∈ E (M ) in the chart centered at p. We have − log q = K p (U ) − U − log p, U ∈ S p , q = e p (U ), and and, finally, the expression of ∇H in the chart centered at p is (∇H) p (U ) = e U p ep(U ) (∇H) • e p (U ) = ∇H(p) − U. Note that this function is affine, and its derivative in the direction X(p) is d(∇H) p (U )[X(p)] = −X(p). It follows that D X ∇H = −X.
The application of this computation to the Boltzmann field i.e. F (f ) = Q(f )/f requires the existence of the covariant derivative of the Boltzmann operator. We leave this discussion as a research plan.

Weighted Orlicz-Sobolev model space
We show in this section that the Information Geometry formalism described in Sections 2 and 3 is robust enough to allow to take into account differential operators (e.g., the classical Laplacian). This yields naturally to the introduction of weighted Orlicz-Sobolev spaces. While the case "without derivative" studied in Section 3 was well-suited for the study of the fine properties of the Kullback-Leibler divergence, we illustrate in Sec. 6.3 our use of Orlicz-Sobolev spaces with the fine study of the Hyvärinen divergence. This is a special type of divergence between densities that involves an L 2 -distance between gradients of densities [21] which has multiple applications. In particular, it is related with the so called Fisher information as it is defined for example in [40, p. 49], which has deep connections with Boltzmann equation, see [39]. However the name Fisher information should not be used in a statistics context where it rather refers to the expression in coordinates of the metric of statistical models considered as pseudo-Riemannian manifolds e.g., [4].
We introduce the Orlicz-Sobolev spaces with weight M , Maxwell density on R n , where ∂ j is the derivative in the sense of distributions. They are both Banach spaces, see [28, §10].
Since ∂ j (f G) = G∂ j f + Gf ∂ j u a.e., one gets ∂ j (f G) ∈ L (cosh −1) * (M ) for any j = 1, . . . , n which proves the result. The Orlicz-Sobolev spaces W 1 cosh −1 (M ) and W 1 (cosh −1) * (M ), as defined in Eq. (26) and (26) respectively, are instances of Gaussian spaces of random variables and they inherit from the corresponding Orlicz spaces a duality form. In this duality the adjoint of the partial derivative has a special form coming from the form of the weight M see e.g. [27,Ch. V]. We have the following: (1) Let f ∈ C ∞ 0 (R n ) and g ∈ W 1 cosh −1 (M ). Then where X j is the mutliplication operator by the j-th coordinate x j .
(1) As f M ∈ C ∞ 0 (R n ), we have by definition of distributional derivative, (2) Let us observe first that, according to Holder's inequality i.e. X j f ∈ L 1 (M ). Since Φ * = (cosh −1) * enjoys the so-called ∆ 2 -condition, to prove the stronger result First of all, using the tensorization property of the Gaussian measure, i.e. the fact that M (x) = M 1 (x 1 ) . . . M 1 (x n ) for any x = (x 1 , . . . , x n ) ∈ R n where M 1 stands for the one-dimensional standard Gaussian, we claim that it is enough to prove the result for n = 1. Indeed, given f ∈ W 1 (cosh −1) * (M ) and x j ∈ R (j = 1, . . . , n), any x = (x 1 , . . . , x n ) ∈ R n can be identified with where F denotes the distributional derivative of F = F (y). In particular, if there exists C > 0 such that we get the desired result.
Let us then prove (31) and fix F ∈ W 1 (cosh −1) * (M 1 ). From Φ * (y) = Now, the derivative of G exists because of the assumption F ∈ W 1 (cosh −1) * (M 1 ) (that is, |F | ∈ W 1 (cosh −1) * (M 1 ) and its derivative is given by the derivation of a composite function) and it is computed as Using Young's inequality with Φ = cosh −1 and Φ * = (cosh −1) * we get All the terms in the right-hand side of the above inequality are integrable with respect to the measure M 1 (y) dy over R. Indeed the first term is bounded as and y → yF (y) ∈ L 1 (R, M 1 (y)dy). The second term is integrable by assumption. The only concern is then the last term. For any r > 0, |y|>r M 1 (y) Now, splitting the integral into the two integrals r 0 and 0 −r , one can use the one-dimensional Hardy inequality in Orlicz-Sobolev space [15] to get that there exists C > 0 such that This achieves to prove (31). (3) Recall that (29) holds for any f ∈ C ∞ 0 (R n ) and any g ∈ L cosh −1 (M ). Since Φ * = (cosh −1) * enjoys the ∆ 2 -condition, it is a well-known fact that C ∞ 0 (R n ) is dense in W 1 (cosh −1) * (M ) (for the norm · W 1 (cosh −1) * (M ) ). Therefore, approximating any f ∈ W 1 (cosh −1) * (M ) by a sequence of C ∞ 0 (R n ) functions, we deduce the result from point 1.
6.1. Stein and Laplace operators. Following the language of [27, Chapter V], Item 3 of the above Proposition can be reformulated saying that This allows to define the Stein operator δ on L (cosh −1) * (M ) as where the the domain Dom(δ) of δ is exactly W 1 (cosh −1) * (M ) according to point 2. of the above Proposition. Notice that, since Φ * enjoys the ∆ 2 -condition, Dom(δ) is dense in L (cosh −1) * (M ). One deduces then easily that δ is a closed and densely defined operator in L (cosh −1) * (M ).
6.2. Exponential family based on Orlicz-Sobolev spaces with Gaussian weights. If we restrict the exponential family E (M ) to M -centered random variable W 1 cosh −1 (M ), that is in , we obtain the following non parametric exponential family . Every feature of the exponential manifold carries over to this case. In particular, we can define the spaces , to be models for the tangent spaces of E 1 (M ). Note that the transport acts on these spaces ∈ W g , so that we can define the tangent bundle to be T E 1 (M ) = {(g, V )|g ∈ E 1 (M ) , V ∈ W f } and take as charts the restrictions of the charts defined on T E (M ).
As a first example of application, note that the gradient of the BG-entropy Remark 11.
(1) The mapping E 1 (M ) f → ∇ log f = f −1 ∇f is, in statistical terms, an estimating function or a pivot, because E f [∇ log f ] = 0, f ∈ E 1 (M ). This means, it is a random variable whose value is zero in the mean if f is correct. If g is correct, then the expected value is E g [∇ log f ] = E g ∇ log f g = − Cov g log f g , ∇ log g M . The second moment of ∇ log f g was used by Hyvärinen as a measure of deviation from f to g, [21] (2) Hyvärinen work has been used to discuss proper scoring rules in [30].
(3) The same notion is known in Physics under the name of relative Fisher information e.g., see [40].
In the following we denote the gradient of a function defined on the exponential manifold E 1 (M ), which is a random variable, by ∂.
(1) The Hyvärinen divergence is finite and infinitely differentiable everywhere in both variables.
(1) For each w ∈ V the gradient ∇w is in (L Φ (g)) n = (L Φ (M )) n for all g ∈ E 1 (M ), hence it is g-square integrable for all g ∈ E 1 (M ). Moreover, the squared norm function S M × (B M ) n (v, w) → E e M (v) |w| 2 is ∞-differentiable because it is the moment functional, We can compose this function with the linear function where we notice that all the terms are well defined whenever u, v ∈ S M ∩V , w ∈ V.
Using now (33) with w 2 = u − v and w 1 = w we get that Since this is true for any w ∈ V we get where of course, ∆(u−v) is meant in W −1 cosh −1 (M ) (notice that w ∈ W 1 cosh −1 (M ) ⊂ W 1 (cosh −1) * (M ). The formula for the partial gradient in absolute variables follows from (X − ∇v) · ∇(u − v) = −∇ log g · ∇ log f g for, with a slight abuse of notations, we identify ∆ log f g to ∆(u − v). (3) As above, let g = e M (v), f = e M (u), u, v ∈ S M ∩ V be given. For any w ∈ V , we compute first the directional derivative. One gets now One recognizes in the first term −d (u → DH M (v u)) [w] while the second term is given by ) |∇u − ∇v| 2 = Cov g w, |∇u − ∇v| 2 .
As in the previous item, this gives the result.
As well-documented, the Hyvarïnen divergence is a powerful tool for the study of general diffusion processes. We have just shown that the Information Geometry formalism and the exponential manifold approach are robust enough to allow for a generalization in Orlicz-Sobolev spaces. We believe then that, as Boltzmann equation can be studied through the exponential manifold formalism in L cosh −1 (M ), general diffusion processes can be investigated in W 1 cosh −1 (M ) with the formalism discussed in the present section. This is a plan for future work.

Conclusions and Discussion
We have shown that well known geometric feature of problems in Statistical Physics can be turned into precise formal results via a careful consideration of the relevant functional analysis.
In particular the notion of flow in a Banach manifold modeled on Orlicz spaces can be used to clarify arguments based on the evolution of the classical Boltzmann-Gibbs entropy in the vector field associated to the Boltzmann equation.
In the last section we have shown how to construct a similar theory in the case the generalized entropy under consideration is the so-called Fisher functional or Hyvarïnen divergence. Such a generalised entropy is particularly well-suited for the study of general