Global Geometry of Bayesian Statistics

In the previous work of the author, a non-trivial symmetry of the relative entropy in the information geometry of normal distributions was discovered. The same symmetry also appears in the symplectic/contact geometry of Hilbert modular cusps. Further, it was observed that a contact Hamiltonian flow presents a certain Bayesian inference on normal distributions. In this paper, we describe Bayesian statistics and the information geometry in the language of current geometry in order to spread our interest in statistics through general geometers and topologists. Then, we foliate the space of multivariate normal distributions by symplectic leaves to generalize the above result of the author. This foliation arises from the Cholesky decomposition of the covariance matrices.


Introduction
Suppose that a smooth manifold U is embedded in the space of positive probability densities defined on a fixed domain. Then, the relative entropy defines a separating premetric D : U × U → R ≥0 on U. Here a premetric on U is a non-negative function on U × U vanishing along the diagonal set ∆ ⊂ U × U, and it is separating if it vanishes only at ∆. Its jet of order 3 at ∆ induces a family of differential geometric structures on U, which is the main subject of the information geometry. There is a large body of literature on the information geometry (see [1], [2] and references therein). It is worth noting that another "canonical" choice of premetric other than the above D is discussed in [3].
In the case where U is the space of univariate normal distributions, the half plane H = R × R >0 (m, s) presents U, where m denotes the mean and s the standard deviation. Since the convolution of two normal densities is a normal density, it induces a product * on H called the convolution product. On the other hand, since the pointwise product of two normal densities is proportional to a normal density, it induces another product · on H called the Bayesian product. Their expressions are The flow (m, s, M, S) → (e t m, e t s, e −t M, e −t S) (t ∈ R) preserves f as well as the graph F ⊂ H × H ofF.
The same symmetry appears in the contact/symplectic geometry related to the algebraic geometry of Hilbert modular cusps. Moreover, there exists a contact Hamiltonian flow whose restriction to the graph F presents a certain Bayesian inference. Its application appears in [5].
In this paper, we describe Bayesian statistics in the language of current geometry in order to share the problems among general geometers and topologists. Then, generalizing the above result of the author, we foliate the space of multivariate normal distributions by using the Cholesky decomposition of the covariance matrices and define on each leaf the Fourier-like transformation, the stereograph of the relative entropy, and the contact Hamiltonian flow presenting a Bayesian inference. The ultimate aim of this research is to construct a Bayesian statistical model of space-time on which everything learns by changing its inner distribution along the leaf.

Symplectic/Contact Geometry
Current geometry does not heavily use tensor calculus. Instead, it uses (exterior) differential forms, which can be integrated along cycles, pulled-back under smooth maps, and differentiated without affine connections. In symplectic/contact geometry, the readers must be familiar with differential forms. Then, this subsection is the minimal summary of definitions. For the details, refer to [6].
A (positive) symplectic form on an oriented 2n-manifold is a closed 2-form ω satisfying ω n > 0, where ω n = ω ∧ · · · ∧ ω. If the orientation is reversed, the 2-form ω becomes a negative symplectic form. In either case, a symplectic form ω identifies a vector field X with an exact 1-form dH through the one-to-one correspondence defined by Hamilton's equation ι X ω = −dH. Here ι denotes the interior product. Then, X is called a Hamiltonian vector field of the primitive function H (+constant). The flow generated by X preserves the symplectic form ω. Namely, the Lie derivative L X ω(= ι X dω + dι X ω) vanishes. A Lagrangian submanifold is an n-manifold which is immersed in a symplectic 2n-manifold so that the pull-back of the symplectic form vanishes. The word "symplectic" is a calque of "complex". Indeed, there exists an almost complex structure J which is compatible with a given symplectic structure, i.e., for which the composition ω(·, J·) is a Riemannian metric. In the case where J is integrable, ω is called a Kähler form of the complex manifold.
On the other hand, a (positive) contact form on an oriented (2n − 1)-manifold N is a 1-form η satisfying η ∧ (dη) n−1 > 0. A (co-oriented) contact structure on N is the conformal class of a contact form. It can be presented as the oriented hyperplane field ker η. The product manifold R( t) × N carries the exact symplectic form d(e t η). Take a function h on N. Let X be the Hamiltonian vector field of the function e t h defined on the product manifold R × N. Then, the push-forward Y of X under the projection of R × N to the second factor is well-defined. The vector field Y is called the contact Hamiltonian vector field of the function h on N. The pair of the equations η(Y) = h and η ∧ L Y η = 0 uniquely determines Y. A Legendrian submanifold is an (n − 1)-manifold which is immersed in a contact (2n − 1)-manifold so that the pull-back of the contact form vanishes.

Bayesian Statistics
Suppose that any point y of a smooth manifold M equipped with a volume form dvol presents a positive probability density or probability ρ y : W → R >0 defined on a (possibly discrete) measurable space W, where ρ y depends smoothly on y, and ρ y = ρ y for y = y ∈ M. Let V be the space of positive volume forms with finite total volume on M. Take an arbitrary element V ∈ V and consider it as the initial state of the mind M of an agent. Here W stands for (a part of) the world for the agent. Finding a datum w ∈ W in his world, the agent can take the value ρ y (w) as a smooth positive function ρ w : y → ρ y (w) on M, which is called the likelihood of the datum w. Then, he can multiply the volume form V by ρ w > 0 to obtain a new element of V. This defines the updating map The "psychological" probability density p V on the mind M defined by p V dvol = V/ M V is accordingly updated into the density p ϕ(w,V) ∝ p V ρ w , which is called the conditional probability density given the datum w. Practically, Bayes' rule on conditional probabilities is expressed as Here P denotes the probability of an event, ∆ y (respectively, ∆ w ) a small portion of M (respectively, W). Since the state of the world does not depend on the mind of the agent, the probability P(w ∈ ∆ w ) is independent of y, and therefore approximates a constant on M. On the other hand, the conditional probability P(w ∈ ∆ w | y ∈ ∆ y ) of the datum w approximates a function of y which is clearly proportional to the above likelihood. This implies that the factor P(w ∈ ∆ w | y ∈ ∆ y ) P(w ∈ ∆ w ) in the right-hand side of Bayes' rule (Equation (4)) is approximately proportional to the likelihood. Thus, Bayes' rule (Equation (4)) implies the updating of p V via the formula (Equation (3)). The Bayesian product · mentioned in the introduction appears in this context. Namely, the variable of the first factor is the mean y of a normal distribution on W. The density of the normal distribution at the datum w can be considered as a function of y, which is proportional to a normal distribution on M. Thus the Bayesian product of normal distributions on M presents the updating of the density of the predictive mean in the mind of the agent. The aim of Bayesian updating is practical to many people. Indeed, the aim of the above updating is the estimation of the mean. Nevertheless, it is quite natural that a geometer multiplies a volume form by a positive function once he is given them. In this regard, we can say that the aim of Bayesian updating is a geometric setting of a dynamical system. In particular, a Bayesian updating in the conjugate prior is at first, simply the iteration of a Bayesian product.

The Information Geometry
Hereafter, we identify the element p V dvol ∈ U with the "psychological" probability density p V . We call U a conjugate prior for the updating map ϕ if the cone U = {e t V | t ∈ R, V ∈ U} satisfies ϕ(W × U) ⊂ U. (Whether there exists a preferred conjugate prior or not, how to determine the initial state of the mind is another interesting problem. For example, one may fix the asymptotic behavior of the state of mind according to the aim of the Bayesian inference and search for the optimal decision of the initial state. See [7] for an approach to this problem via the information geometry.) Now we define the "distance" D : U × U → R on U, which satisfies none of the axioms of distance, by From the convexity of − ln, we see that the restriction D = D| U×U ≥ − ln M V 2 = 0 is a separating premetric on U, which is called the Kullback-Leibler divergence in information theory. This implies that the germ of D along the diagonal set ∆ of U × U represents the zero section of the cotangent sheaf of U, that is, for any point x = (x 1 , . . . , x n ) of any chart of U, the Taylor expansion of D x + 1 2 dx, x − 1 2 dx has no linear terms. Thus the differential dD : TU × TU → R also vanishes on the diagonal set ∆ of TU × TU. We regard the 1-form on TU represented by the germ of dD along ∆ as a quadratic tensor, and denote it by g (note that g x : T x U × T x U → R is linear). It appears as two times the quadratic terms 1 2 ∑ i,j g ij dx i dx j (g ij = g ji ) in the above Taylor expansion. Of course, it also appears in the Taylor expansion of D x − 1 2 dx, x + 1 2 dx . Thus it can also be considered as the quadratic terms of the symmetric sum D ] is called the Fisher information in information theory. From the non-negativity of D, we may assume generically that g is a Riemannian metric. We would like to notice that this construction of Riemannian metric by means of symmetric sum also works over U. Indeed, we have Let ∇ 0 be the Levi-Civita connection of g. We write the lowest degree terms in the Taylor expansions of . This presents the symmetric cubic tensor T, which can be constructed from the anti-symmetric difference D (x 1 , Especially, we call ∇ 1 and ∇ −1 respectively the e-connection and the m-connection. The symmetric tensor T is sometimes called skewness since it presents the asymmetry of D. The information geometry concerns the family of α-connections as well as the Fisher information metric on U. We usually do not extend it over U for the symmetric sum of D lacks asymmetry.

The Geometry of Normal Distributions
We consider the space U of multivariate normal distributions on M = R n . A vector µ = (µ i ) 1≤i≤n ∈ R n and an upper triangular matrix C = [c ij ] 1≤i,j≤n ∈ Mat(n, R) with positive diagonal entries parameterize U by declaring that µ presents the mean and C T C the Cholesky decomposition of the covariance matrix. Further we put The matrix [r ij ] is unitriangular, i.e., a triangular matrix whose diagonal entries are all 1. Then, each point x = (µ, σ, r) ∈ U = R n × (R >0 ) n × R n(n−1)/2 presents the volume form Let · 2 denote the sum of the squares of the entries of a matrix as well as a vector. The relative entropy defines the premetric D(x, x = (µ , σ , r )) by Let 1 n denote the unit matrix, and ∆C the difference C(σ + ∆σ, r + ∆r) − C(σ, r). We have Let r ij be the entries of the inverse matrix of [r ij ]. We have From Equations (10) and (11), we see that the Fisher information metric g is expressed as Put Then, the representing matrix of g is the following block diagonal matrix: diag(g µµ , g σσ , g rr,2 , . . . , g rr,n ).
Lowering the upper indices of the α-connection with ∑ L g KL Γ α With respect to the same coordinates, the coefficients of the e-connection are Those of the m-connection are There is a particular system of coordinates for describing the e-connection. Namely, all the coefficients vanish with respect to the natural parameter ( On the other hand, all the coefficients for the m-connection vanish with respect to the expectation parameter (µ, ν), where ν = (ν ab ) 1≤a≤b≤n is the upper half of C T C + µµ T .

The Generalization
This subsection is devoted to the generalization of the result of the author, which is mentioned in the introduction, to the above multivariate setting. We fix the third component r of the coordinate system (µ, σ, r), and change the presentation of the others. Namely, we take the natural projection π : U = H n × R n(n−1)/2 → R n(n−1)/2 and replace the coordinates (µ, σ) on the fiber L(r) = π −1 (r) by (m, s) appearing in the next proposition. The generalization is then straightforward. Note that dr(∇ α ∂ µ ∂ µ ) is identically zero on some/any fiber L(r) if and only if α = 1. The fiber satisfies the following two properties. Proposition 2. L(r) is closed under the convolution * and the Bayesian product ·, and thus inherits them.
Put u = (σ i −2 ) and y = [r ij ] T x. The density at (µ, σ, r) is proportional to exp y(x) T diag(u)y(x) − 2m T diag(u)y(x) . From this we see that (µ, σ, r) · (µ , σ , r) = (µ , σ , r ) implies r = r, u = u + u and diag(u )m = diag(u)m + diag(u ). Proof. The restriction of g is 2 We define the complex structure J : TL(r) → TL(r) by We write the restriction D| L(r) of the premetric D using the coordinates (m, s) as follows, where we omit r for the expression does not depend on r.
We take the product U 1 × U 2 of two copies of U. Then, the products L 1 (r) × L 2 (R) of the fibers foliate U 1 × U 2 . We call this the primary foliation of U 1 × U 2 . For each (r, R) ∈ R n(n−1) , we have the coordinate system (m, s, M, S) on the leaf L 1 (r) × L 2 (R). From the Kähler forms respectively on L 1 (r) and L 2 (R), we define the symplectic forms ω 1 ± ω 2 on L 1 (r) × L 2 (R), which induce the mutually opposite orientations in the case where n is odd. Hereafter, we consider the pair of regular Poisson structures defined by these symplectic structures on the primary foliation, and fix the primitive 1-forms λ ± = 2 We take the 2n-dimensional and δ ∈ (R >0 ) n . The secondary foliation of U 1 × U 2 foliates any leaf U(r) × U(r) by the 3n-dimensional submanifolds F ε = δ∈(R >0 ) n F ε,δ for ε ∈ R n . The tertiary foliation of U 1 × U 2 foliates all leaves F ε of the secondary foliation by the 2n-dimensional submanifolds F ε,δ for δ ∈ (R >0 ) n .

Proposition 4.
With respect to the Kähler form dλ − , the tertiary leaves F ε,δ are Lagrangian correspondences.
The restrictions of λ ± | N to the hypersurface N = (m, s, M, S) ∈ L 1 × L 2 n ∏ i=1 (s i S i ) = 1 are contact forms. Let η ± denote them.

Proposition 5.
For any ε and δ with n ∏ i=1 δ i = 1, the submanifold F ε,δ ⊂ N is a disjoint union of n-dimensional submanifolds {s = const} ⊂ F ε,δ which are integral submanifolds of the hyper plane field ker η + on N.
Proof. F ε,δ maps to F ε,(ζ 2i−1 ζ 2i δ i ) , and f ε, A curve (m(t), s(t)) ∈ H n is a geodesic with respect to the e-connection ∇ 1 if and only if m i s i 2 and 1 s i 2 are affine functions of t for i = 1, . . . , n. Definition 1. We say that an e-geodesic (m(t), s(t)) ∈ H n is intensive if it admits an affine parameterization such that 1 s i 2 are linear for i = 1, . . . , n.
Note that any e-geodesic is intensive in the case where n = 1.

Proposition 9.
Given an intensive e-geodesic (m(t), s(t)), we can change the parameterization of its image δ i s i under the diffeomorphismF ε,δ to obtain an intensive e-geodesic.
Proof. Put m i s i 2 = a i t + b i and are respectively an affine function and a linear function of 1 t .
We have the hypersurface N = . This is defined on any leaf L 1 (r) × L 2 (R) ≈ H 2n of the primary foliation of Corollary 1. For any δ ∈ (R >0 ) n , the flow on the leaf F 0,δ presents the iteration of the operation * on the first factor of U 1 × U 2 and that of the operation · on the second factor as is described in Proposition 8 (see Figure 1). From Corollary 1, we see that the flow model of Bayesian inference studied in [4,5] also works for the multivariate case. We prove the theorem.

Proof. Take the vector field
L Y λ ± = λ ± , and ι Y d ln(s i S i ) = 0, and thus its restriction to N is the contact Hamiltonian vector field Y. n), where the right-hand sides vanish along F ε,δ . Given a point (P 0 , we have the integral curve (m(t), s(t), r(t), M(t), S(t), R(t)) = (e 2t m 0 , e t s 0 , r 0 , M 0 , e −t S 0 , R 0 ) of Y with initial point (P 0 , Q 0 ). We can change the parameter of the curve (m(t), s(t), r 0 ) on the first factor with t = e −2t to obtain an intensive e-geodesic.
In the univariate case (n = 1), on the logarithmic sS-plane, we can take any point (ln s, ln S) = (ln ζ 1 (1), ln ζ 2 (1)) of the line H = ln s + ln S = 0 other than the origin. That is, we put ζ 1 (1) = e t and ζ 2 (1) = e −t (t = 0). Then, the map ϕ ε,ζ sends (m, s, M, S) to (e t m, e t s, e −t (M − ε), e −t S). The Z-action generated by it rolls up the level sets of H, so that the quotient of the logarithmic sS-plane becomes the cylinder T 1 × R, which is the base space. On the other hand, the mM-plane, which is the fiber, expands horizontally and contracts vertically. This is the inverse monodromy along T 1 . In general, we obtain a similar R 2n -bundle over T 2n−1 × R if we take the 2n − 1 points of H = 0 in general position.
From Proposition 7, we see that the leaf F ε of the secondary foliation with the set f ε , as well as the pair of the 1-forms λ ± with the function H, descends to the R 2n -bundle. If there exists further a Z 2n -lattice on the fiber R 2n which is simultaneously preserved by the maps ((m i ), (M i − ε i )) → ((ζ 2i−1 (k)m i ), (ζ 2i (k)(M i − ε i ))) (k = 1, . . . , 2n − 1), we obtain a T 2n -bundle over T 2n−1 × R. Such a choice of ζ(k) would be number theoretical. Indeed, this is the case for Hilbert modular cusps. Moreover, we are considering the 1-forms λ ± , which descend to the T 2n -bundle. See [9] for the standard construction with special attention to these 1-forms.
We should notice that the vector field Y does not descend to the T 2n -bundle. However, every actual Bayesian inference along Y eventually stops. Thus, we may take sufficiently large T 2n and consider Y as a locally supported vector field to perform the inference in the quotient space.

Discussion
Finally, we would like to comment on the transverse geometry of the primary foliation. The author conjectures that it has some relation to the M-theory. See e.g., [10] for a relation between Poisson geometry and matrix theoretical or non-commutative geometrical physics.
We notice that, in the symplectic case where n = 3, the quotient manifold admits no Kähler structure (see [11]). However, it is still remarkable that the transverse symplectic 6-manifold is naturally ignored in the Bayesian inference described in this paper. Conjecturally, a similar model would help us to treat events in parallel worlds (or blanes) in the same "psychological" procedure.