Examples of Application of Nonparametric Information Geometry to Statistical Physics

We review a nonparametric version of Amari's Information Geometry in which the set of positive probability densities on a given sample space is endowed with an atlas of charts to form a differentiable manifold modeled on Orlicz Banach spaces. This nonparametric setting is used to discuss the setting of typical problems in Machine Learning and Statistical Physics, such as relaxed optimization, Kullback-Leibler divergence, Boltzmann entropy, Boltzmann equation


Introduction
Information Geometry was developed in the seminal monograph by Amari and Nagaoka [1], where previous-essentially metric-descriptions of probabilistic and statistics concepts are extended in the direction of differential geometry, including the fundamental treatment of differential connections. The differential geometry involved in their construction is finite dimensional and the formalism is based on coordinate systems. Following a suggestion by Phil Dawid in [2,3], a particular nonparametric version of the Amari-Nagaoka theory was developed in a series of papers [4][5][6][7][8][9][10][11][12][13], where the set P > of all strictly positive probability densities of a measure space is shown to be a Banach manifold (as defined in [14][15][16]) modelled on an Orlicz Banach space (see e.g. [17,Ch II]).
Specifically, Gibbs densities q = e u−Kp(u) · p, E p [u] = 0, are represented by the chart s p : q → u. Because of the exponential form, the random variable u is required to belong to an exponential Orlicz space, which is similar to an ordinary Lebesgue spaces, but lacks some important features of these spaces, such as reflexivity and separability. On the other side, the nonparametric setting emphasizes in a nice way the fact that statistical manifolds are actually affine manifolds with an Hessian structure, see [18].
Such a formalism has been frequently criticised as unnecessarily involved to be of use in practical applications and also lacking really new results with respect to the Amari-Nagaoka theory. However, it should be observed that most applications in Statistical Physics, such as Boltzmann equation theory [19], are intrinsically nonparametric. I like to quote here a line by Serge Lang in [16, p. vi]: "One major function of finding proofs valid in the infinite dimensional case is to provide proofs which are especially natural and simple in the finite dimensional case." This paper is organized as follows. Sec.s 2 and 3 are a review of the basic material on statistical exponential manifolds with some emphasis on the functional analytic setting and on second order structures. Sec. 4 contains a discussion of examples of application to the differential geometry of expected values, Kullback-Leibler divergence, Boltzmann entropy, Boltzmann equation. Sec. 5 presents some among the topics that would require a further study, together with some lines of current research.

Model spaces.
Given a σ-finite measure space (Ω, F, µ), we denote by P > the set of all densities which are positive µ-a.s, by P ≥ the set of all densities, by P 1 the set of measurable functions f with f dµ = 1. In the finite state space case P 1 is an affine subspace, P ≥ is the simplex, P > its topological interior. We summarize below the basic notations and results. Missing proof are to be found e.g. in [9] and in [17,Ch II].
If both φ and φ * are monotone, continuous functions on R ≥ onto itself such that φ −1 = φ * , we call the pair a Young pair. Each Young pair satisfies the Young inequality |xy| ≤ Φ(x) + Φ * (y) (1) with equality if, and only if, y = φ(x). The relation in a Young pair is symmetric and either element is called a Young function. We will use the following Young pairs: Let us derive a few elementary but crucial inequalities. If hence, as From (3) we have for a > 1 In a similar way, from we have a relation similar to (4), that is for all y Property (3) does not hold in this case. Such an inequality is called ∆ 2 -condition and has an important role in the theory of Orlizc spaces. If Φ is any Young function, a real random variable u belongs to the Orlicz space to be the closed unit ball. It follows that the open unit ball consists of those u's such that αu is in the closed unit ball for some α > 1. The corresponding norm · Φ,p is called Luxemburg norm and defines a Banach space, see e.g. [17,Th 7.7]. From (6) and (4) follows that cases (a) and (b) in (2) define equal vector spaces with equivalent norms, , see [9, Lemma 1], therefore we drop any mention of them.
The Young function cosh −1 has been chosen here because the condition E p [Φ(αv)] < +∞ is clearly equivalent to E p [e tv ] < +∞ for t ∈ [−α, α], that is the random variable u has a Laplace transform around 0. The case of a moment generating function defined on all of the real line is special and define a notable subspace of the Orlicz space. The use of such space has been proposed by [20].
There are technical issues in working with Orlicz spaces such as L (cosh −1) (p), in particular the regularity of its unit sphere S cosh −1 = u : u (cosh −1),p = 1 . In fact, while E p [cosh u − 1] = 1 implies u ∈ S cosh −1 , the latter implies only E p [cosh u − 1] ≤ 1. Subspaces of L Φ (p) where it cannot happen at the same time u (cosh −1),p = 1 and E p [cosh u − 1] < 1 are of special interest. In general, the sphere S cosh −1 is not smooth, see an example in [13,Ex. 3].
If the functions Φ and Φ * are Young pair, for each u ∈ L Φ (p) and v ∈ L Φ * (p), such that is a duality pairing, u, v p ≤ 2 u Φ * ,p v Φ,p . It is a classical result that in our case (2) the space L Φ * (p) is separable and its dual space is L Φ (p), the duality pairing being (u, v) → u, v p . This duality extends to a continuous chain of spaces: where → denotes continuous injection.
Version May 11, 2014 submitted to Entropy

Cumulant generating functional
Let p ∈ P > be given. The following theorem has been proved in [8,Ch 2], see also [9].
is a continuous, symmetric, n-multi-linear map from L Φ (p) to L a (p).
v a n is a power series from L Φ (p) to L a (p) with radius of convergence ≥ 1.
3. The superposition mapping v → e v/a is an analytic function from the open unit ball of L Φ (p) to L a (p).
The previous theorem provides an improvement upon the original construction of [4].
The moment generating functional is the partition functional (normalizing factor) of the Gibbs model,   Other properties of the functional K p are described below as they relate directly to the exponential manifold.

Exponential manifold
The set of positive densities P > around a given p ∈ P > is modelled of the subspace of centered random variables in the Orlizc space L Φ (p), hence it is crucial to discuss the isomorphism of the model spaces for different p's.
The following example is of interest for the applications in Sec. 4. Let f 0 be the standerd normal density on R N and f a density, Proposition 6 (Characterization of a maximal exponential model: [9,Th 19 and 21]). The following statement are equivalent: 1. p, q ∈ P > are connected by an open exponential arc, p q; 2. q ∈ E (p); 3. E (p) = E (q); 4. log q p belongs to both L Φ (p) and L Φ (q); 5. L Φ (p) and L Φ (q) are equal as vector spaces and their norms are equivalent.
The affine manifold we have defined has a simple and natural structure because of Prop. 6. The domains E (p), E (q) of the charts s p , s q are either disjoint or equal when p q: For ease of reference, various results from [4,5,8,9] are collected in the following proposition. We Proposition 8.
1. The first three derivatives of K p on S p are 2. The random variable q p − 1 belongs to * B p and In other words the gradient of K p at u is identified with an element of * B p , denoted by and it is one-to-one at each point.

The mapping m
8. B q is defined by an orthogonality property:

Tangen bundle
Our discussion of the tangent bundle of the exponential manifold is based on the concept of velocity of a curve as in [15, §3.3] and it is mainly intended to underline its statistical interpretation, which is obtained by identifying curves with one-parameter statistical models. For a statistical model p(t), t ∈ I, the random variableṗ(t)/p(t), the Fisher score, has zero expectation with respect to p(t), and its meaning in the exponential manifold is velocity. If p(t) = e tv−ψ(t) · p, v ∈ L Φ (p), is an exponential family, theṅ [21] on exponential families.
Let p(·) : I → E (p), I open real interval containing 0. In the chart centered at p the curve is u(·) : The transition maps of the exponential manifold are Definition 9 (Velocity field of a curve). Define Note that δp does not depend on the chart s p and that the derivative of t → p(t) in the last term of the equation is computed in L Φ * (p). The curve t → (p(t), δp(t)) is the velocity field of the curve.

On the set
Proposition 10 (Covariant derivative of a real function).

As
3. Assume moreover that dE p (u) ∈ B p * can be identified with an element ∇E p (u) ∈ * B p by Then for u = e p (q) Proof.

It is a computation based on
, and this definition does not depend on p.

Pretangent bundle
Because of the lack of reflexivity of the exponential Orlicz space, we are forced to distinguish between the dual tangent bundle (T P > ) * = {(p, v) : p ∈ P > , v ∈ (B p ) * } and a pretangent bundle.
The pretangent bundle is actually the tangent bundle of the mixture manifold on P 1 = f ∈ L 1 (µ) : f dµ = 1 whose charts are of the form η p (q) = q/p − 1 ∈ * B p . For each p ∈ P > consider the set * U p = q ∈ P 1 : q/p ∈ L Φ * (p) and the mapping Let characterize * U p as the set the set of q's of finite Kullback-Leibler divergence from p.
In conclusion, we have the inequalities If D (q p) < +∞. then which in turn implies that The covariant gradient defined in Prop. 3(3) is a vector field of the pretangent bundle. Note that the injection P > → P 1 is represented in the charts centered at p by u → e u−Kp(u) · p − 1. We do not further discuss here the mixture manifold and refer to [9,Sec. 5] for further information on this topic.
Let F be a vector field of the pretangent bundle * T P > . In the chart centered at p E (p) q → F (q) is represented by For each q = e p (u) ∈ E (p), w ∈ * B q , m U q p dF p (u) e U p q w ∈ * B q does not depend on p. Definition 14 (Covariant derivative in * T P > .). Let F be a vector field of the pretangent bundle * T P > and G a vector field in the tangent bundle P > , both of class C 1 on E (p). The covariant derivative is Tangent and pretangent bundle can be coupled to produce the new frame bundle with the duality coupling Proposition 15 (Covariant derivative of the duality coupling). Let F be a vector field of * T P > , G, H vector fields of T P > , all of class C 1 on a maximal exponential model E. Then Proof. Consider the real function E q → F, G (q) = E q [F (q)G(q)] in the chart centered at any p ∈ E, and compute its derivative.
3.3. The Hilbert bundle. The duality on ( * T × T )P > is reminiscent of a Riemannian metric, but it is not, because we do not have a Riemannian manifold unless the state space is finite. However, we we can push on the analogy, by constructing an Hilbert bundle. As L Φ (p) ⊂ L 2 (p) ⊂ L Φ * (p), p ∈ P > , we have B p ⊂ H p ⊂ * B p , L 2 0 (p) = H p being the fiber at p. The Hilbert bundle is provided with an atlas of charts by using the isometries U q p : H p → H q which result from the pull-back of the metric connection on the sphere S µ = f ∈ L 2 (µ) : f 2 dµ = 1 , see [6,7,22] and [13,Sec. 4]. 1. For all p, q ∈ P > , the mapping Note that U r q U q p = U r p .

Definition 17 (Hilbert bundle). The charts
Let t → p(t) be a C 1 curve in E (p), p = p(0), u(t) = s p (p(t)), and F : E (p) a C 1 vector field in HP > . In che chart centered at p we have F p (u(t)) = U p p(t) (F • e p )(u(t)). A computation shows that which could be used as a nonparametric definition of the metric connection, see [22], [13,Sec. 4.4].

The second tangent bundle
We briefly discuss here the second order structure, i.e. the tangent bundle of tangent bundle T P > . Let F : I t → (p(t), V (t)) be a C 1 curve in the tangent bundle T P > . In the chart centered at p we have It follows in particular that E p(t) V (t) = −E p(t) [V (t)δp(t)] and e U p(t) . Note that the left end side is not a transport but an extension of the transport, precisely the projection Π p(t) : L Φ (p) → B p(t) . It follows fromḞ p (t) = u(t),V p (t) that the velocity vector is where we have denoted by I(p(t)) = E p(t) d dt ln (p(t)) 2 the Fisher information. In this case we can write δ(p, δp)(t) = (δp(t),(δp)(t) + I(p(t))).

Applications.
In this section we consider a typical set of exemples where the nonparametric framework is applicable.

Expected value.
Let f ∈ L Φ (p), f 0 = f − E p [f ] ∈ B p , and consider the relaxed mapping The information geometric study of the relaxed mapping can be based on the notion of natural gradient as defined in a seminal paper by Amari [23] and it is currently used for optimization, see e.g. [24][25][26][27][28][29][30]. Covariant derivative of a real function is the nonparametric counterpart of Amari's natural gradient. From the properties of K p in Eq.s (8) and (9) of Prop. 8 we obtain the representation of the function in (13) in the chart centered at p, The covariant derivative at (q.w) ∈ T B q is computed from Def. 10(2) as Note that the gradient is never zero unless f is constatant and that the covariant derivative is zero for each vector field G which is uncorrelated with f . Consider the gradient vector field F (q) = f − E q [f ] ∈ * T P > . The gradient flow is whose unique solution is the exponential family p(t) ∝ e tf · p(0). In fact, the gradient is actually the e-transport of f 0 , F (p) = e U ep(u) p f 0 and the exponential family is the exponential curve of the e-transport. Let us discuss the differentiability of the gradient. In the chart centered at p the gradient is represented as Let us first compute the differential of u → F p (u), w p , w ∈ B p , in the direction v ∈ B p , i.e. the weak differential: where we have used Prop. 8. At u = 0 The product f 0 G(p) belongs to * B p . In fact, If D G ∇E exists in * P > as a Frechét derivative, then The differentiability in Orlicz spaces of superposition operators is discussed in detail in [31].

Kullback-Leibler divergence.
If E is a maximal exponential model, the mapping is represented in the charts centered at p by hence, from Prop. 3(4) it is C ∞ jointly in both variables, and moreover analytic This regularity result is to be compared with what is available when the restriction q 1 q 2 is removed, i.e. the semicontinuity [32].
The (partial) derivative of u 2 → E p (u 1 , u 2 ) in the direction v 2 ∈ B p is and the covariant derivative of the partial functional q → D (q 1 q) is The second mixed derivative of E p is Equivalently, we consider the mapping q 1 → D 2,w D (q 1 q), in the chart

Boltzmann entropy.
While our discussion of the Kulback-Leibel divergence in the previous Sec. 4.2 does not require any special assumption but the restriction of its domain to a maximal exponential model, in the present discussion of the Boltzmann entropy a further restriction is required. If p, q belong to the same maximal exponential model, p q, then from q = e u−Kp(u) · p with u ∈ B p , we obtain ln q − ln p ∈ L Φ (p), so that ln q ∈ L Φ (p) if, and only if, ln p ∈ L Φ (p).
We study the Boltzmann entropy E(q) = E q [ln (q)] on a maximal exponential model q ∈ E such that for at least one, and hence for all, p ∈ E it holds ln (p) ∈ L Φ (p), i.e. (p 1+α + p 1−α ) dµ < +∞ for some α > 0. This is for example the case when the reference measure is finite and p is constant. Another notable example is the Gaussian case, i.e. the sample space is R n endowed with the Lebesgue measure and p(x) ∝ exp −1/2|x| 2 . In fact cosh(α|x| 2 ) exp (−1/2|x| 2 ) dx < +∞ for 0 < α < 1/2.
Under our assumption, the Boltzmann entropy is a smooth function. As the representation in the chart centered at p is hence it is a C ∞ real function. The derivative in the direction v equals The value of the covariant derivative D G E at q and G(q) = w is The gradient ∇E(q) ∈ (B q ) * , D G E(q) = ∇E(q), G(q) q , is identified with a random variable in B q ⊂ * B q , and is a vector field in the tangent bundle T E, hence a vector field in the Hilbert bundle HE and in the pretangent bundle * T E.
In the pretangent bundle the action of the dual exponential transport ( e U p q ) * is identified with m U p q . It follows that the representation of the gradient in the chart centered at p is Let us assume u → F p (u) is (strongly) differentiable and let us compute the derivative by the product rule. As u → F p (u) can be seen locally as the product of an analytic mapping u → e u−Kp(u) with values in L a (p), a > 1 because of Prop. 1, while the second factor is an analytic function with values in L Φ (p) ⊂ ∩ a>1 L a (p), we can compute its differential in the direction v ∈ B p as the product of two functions in the Fechét space ∩ a>1 L a (p) as in particular, for u = 0, The covariant derivative of the gradient ∇E of the Boltzmann entropy in the pretangent bundle * T E is The existence of the covariant derivative implies ln (p) G(p) ∈ L Φ * (p), p ∈ E. We do not discuss here the existence problem.
The computation of the covariant derivative of the same gradient in the tangent bundle T E would bē but we cannot suggest any use of this computation.

Boltzmann equation
Orlicz spaces as a setting for Boltzmann equation has been recently discussed in [33], while the use of exponential manifolds has been suggested in [13,Example 11]. Here we further work out this framework for space-homogeneous Boltzmann operator with angular collision kernel B(z, x) = |x z|, see the presentation in [19]. In order to avoid a clash with the notations used in other parts of this paper, we use v and w to denote velocities in R 3 in place of the more common couple v and v * and the velocities after collision are denoted by v x and w x instead of v , v * , x ∈ S 2 being a unit vector.
Let v, w ∈ R 3 be the velocities of two particles, andv,w be the velocities after a elastic collision, i.e.
Using (14) we derive from the development of |v + w| 2 = |v +w| 2 that v ·w =v ·w. The four vectors v, w,v,w all lie on a circle with center z = (v + w)/2 = (v +w)/2. In fact, the four vectors and z lie on the same plane because v − z = −(w − z),v − z = −(w − z), and moreover |v − z| 2 = |v − z| 2 . As v, w,v,w form a rectangle, we can denote by x the common unit vector unit of the parallel sidesw − w and v −v and writew − w = v −v as the orthogonal projection of v − w on x. Given the unit vector where denotes the transposed vector.
Given any x ∈ S 2 we have A x = A −x . If v, w, v x , w x are as in (15) then the elastic collision invariants of (14) hold, v + w = v x + w x , |v| 2 + |w| 2 = |v x | 2 + |w x | 2 . The components in the direction x are exchanged, xx v x = xx w and xx w x = xx v, while the orthogonal components are conserved.
Let σ be the uniform probability on S 2 . For each positive function g : R 3 × R 3 the integral and al other terms depend on the collision invariants, in particular |v − w| 2 = 2(|v| 2 + |w| 2 ) − |v + w| 2 .
On the sample space (R 3 , dv) let f 0 be the standard normal density viz. the Maxwell distribution of velocities. As A x A x = I 6 the identity matrix on R 6 , in particular |det A x | = 1, we have . We can give the previous remarks a more probabilistic form as follows.
Proposition 18. Let f 0 the density of the standard normal N(0 3 , I 3 ).
and there exists a neighborhood I of [0, 1] where the one dimensional exponential family < +∞ for t ∈ I it is enough to consider the convex cases t < 0 and t > 1. We have and in the convex cases The last equation if Bayes' formula for conditional expectation.
Definition 19. For each element of the maximal exponential model containing f 0 , f ∈ E (f 0 ), the Boltzmann operator is In our definition we have restricted the domain of the Boltzmann operator to a maximal exponential model containing the standard normal density in order to fit into our framework and be able to prove the smoothness of the operator. The maximal exponential model E (f 0 ) contains all normal densities f ∼ N(µ, Σ). It has other peculiar properties.
As f ∈ E (f 0 ), f = e u−K 0 (u) · f 0 , u belongs to the interior of the proper domain of K 0 , u ∈ S f 0 ⊂ B f 0 . It follows from Prop. 6 that we have the equality and isomorphism of the Banach spaces L Φ (f ) and [1,2]. In fact, is finite for all α if a ∈ [0, 2[ and for α < 1/2 if a = 2. In particular, it follows that V 1 (v) = |v| has finite moments with respect to f , |v| n f (v) dv < +∞, n = 1, 2, . . . .
Other properties are obtained in the proof of the following proposition. Proof. Let us consider first the second part of the Boltzmann operator Note that from inequality (5) which is finite as |v − w| 2 ≤ 2(|u| 2 + |v| 2 ).
We consider now the first part of the Boltzmann operator We want to prove that where L(a) = a ∨ a 2 . It follows where both terme are finite. Finally, the integral of the Boltzmann operator is zero: The smoothness of the Boltzmann field could be studied by carefully analyzing the structure of the operator as superposition of

Marginalization.
The single operations of the chain are discussed in [5]. We do not do this analysis here, and conclude the section by rephrasing in our language the Maxwell form of the Boltzmann operator.

Conclusion and Discussion
We have shown that a careful consideration of the relevant functional analysis allows to discuss some basic features of statistical models of interest in Statistical Physics in the framework of the nonparametric Information Geometry based on Orlicz spaces. In particular, we have defined the exponential statistical manifold and its vector bundles, namely the tangent bundle, the pretangent bundle, the Hilbert bundle. Partial results are obtained on connections, which is a topic considered by many Authors the very core of statistical manifolds theory.
For example, the Boltzmann equation takes the form of an evolution equation for the Boltzmann field In the case of the pretangent bundle, we have been able to show that it is actually the tangen bundle of an extension of the exponential manifold, the mixture manifold, * T P > → T P 1 . It has been the object of much research the construction of a manifold whose tangent space would be the Hilbert bundle. In some sense the answer is known because of the embedding p → √ p that maps positive densities P > into the unit sphere S µ , but a proper definition of the charts is difficult in this setting. It has been suggested to use functions called deformed exponentials to mimic the theory of exponential families, see the monograph [34], and also [11], [13,Sec. 5]. An example of deformed exponential is which is a special case of the class introduced in [35,36]. The function exp d maps R onto R > , is increasing, convex, and Φ d (u) = 1 2 (exp d (u) + exp d (−u)) − 1 = 1 2 u 2 .
The Young conjugate is Φ d, * = Φ d and the Orlicz space is L Φ d (p) = L 2 (p) A nonparametric exponential family around the positive density p was defined by [37] to be If we assume Ep [u] = 0, wherep is a suitable density associate to p, then An account of this research in progress will be publisher elsewhere. We conclude by mentioning the different nonparametric approach of [38].