Conformal Flattening for Deformed Information Geometries on the Probability Simplex

Recent progress of theories and applications regarding statistical models with generalized exponential functions in statistical science is giving an impact on the movement to deform the standard structure of information geometry. For this purpose, various representing functions are playing central roles. In this paper, we consider two important notions in information geometry, i.e., invariance and dual flatness, from a viewpoint of representing functions. We first characterize a pair of representing functions that realizes the invariant geometry by solving a system of ordinary differential equations. Next, by proposing a new transformation technique, i.e., conformal flattening, we construct dually flat geometries from a certain class of non-flat geometries. Finally, we apply the results to demonstrate several properties of gradient flows on the probability simplex.


Introduction
The theory of information geometry has elucidated abundant geometric properties equipped with a Riemannian metric and mutually dual affine connections. When it is applied to the study of statistical models described by the exponential family, the logarithmic function plays a significant role in giving the standard information geometric structure to the models [1,2].
Inspired by the recent progress of several areas in statistical physics and mathematical statistics [3][4][5][6][7][8][9][10] which have exploited theoretical interests and possible applications for generalized exponential families, one research direction in information geometry is pointing to constructions of deformed geometries based on the standard one, keeping its basic properties. A typical and classical example of such a deformation would be the alpha-geometry [1,2], a statistical definition of which can be regarded as a replacement of the logarithmic function by suitable power functions. Hence, for the purpose of the generalization and flexible applicability, much attention is paid to various uses of such replacements by representing functions as important tools [3,4,11,12].
Two major characteristics of the standard structure are dual flatness and invariance [2]. Dual flatness (or Hessian structure [13]) produces fruitful properties such as the existence of canonical coordinate systems, a pair of conjugate potential functions and the canonical divergence (relative entropy). In addition, they are connected with the Legendre duality relation, which is also fundamental in the generalization of statistical physics. On the other hand, the invariance of geometric structure is crucially valuable in developing mathematical statistics. It has been proved [14] that invariance holds for only the structure with a special triple of a Riemannian metric and a pair of mutually dual affine connections, which are respectively called the Fisher information and the alpha-connections (see Section 3 for their definitions). The study of these two characteristics from a viewpoint of representing functions would contribute to our geometrical understanding.
In this paper, we first characterize a pair of representing functions that realizes the invariant information geometric structure. Next, we propose a new transformation to obtain dually flat geometries from a certain class of non-flat information geometries, using concepts from affine differential geometry [15,16]. We call the transformation conformal flattening, which is a generalization of the way to realize the corresponding dually flat geometry from the alpha-geometry developed in [17,18]. As applications and easy consequences of the results, we finally show several properties of gradient flows associated with realized dually flat geometries. Focusing on geometric characteristics conserved by the transformation, we discuss the properties such as a relation between geodesics and flows, the first integral of the flows and so on. These properties are new and general. Hence, they refine the arguments of the flows in [18], where only the alpha-geometry is treated.
The paper is organized as follows. In Section 2, we introduce preliminary results, explaining several existing methods to construct the information geometric structure that includes a dually flat structure and the alpha-structure and so on. We also give a short summary of concepts from affine differential geometry, which will be used in this paper. Section 3 provides a characterization of representing functions that realize invariant geometry, i.e., the one equipped with the Fisher information and a pair of the alpha-connections. The characterization is obtained by solving a simple system of ordinary equations. In Section 4, we first obtain a certain class of information geometric structure by regarding representing functions as immersions into an ambient affine space. Then, we demonstrate the conformal flattening to realize the corresponding dually flat structure, and discuss their properties and relations with generalized entropies or escort probabilities [19]. Section 5 exhibits the geometric properties of gradient flows with respect to a conformally realized Riemannian metric. These flows are reduced to the well-known replicator flow [20] (Chapter 16) when we consider the standard information geometry. Suitably choosing its pay-off functions, we see that the flow follows a geodesic curve or conserves a divergence from an equilibrium. In the final section, some concluding remarks are made.
Throughout the paper, we use a probability simplex as a statistical model for the sake of simplicity.

Preliminaries
2.1. Information Geometry of S n and R n+1 + Let us represent an element p ∈ R n+1 with its components p i , i = 1, · · · , n + 1 as p = (p i ) ∈ R n+1 . Denote, respectively, the positive orthant by and the relative interior of the probability simplex by Let p(X) be a probability distribution of a random variable X taking a value in the finite sample space Ω = {1, 2, · · · , n, n + 1}. We consider a set of distributions p(X) with positive probabilities, i.e., p(i) = p i > 0, i = 1, · · · , n + 1, defined by which is identified with S n . A statistical model in S n is represented with parameters ζ = (ζ j ), j = 1, · · · , d ≤ n by where each p i is smoothly parametrized by ζ. For such a statistical model, ζ j can also be regarded as coordinates of the corresponding submanifold in S n . For simplicity, we shall consider the full model, i.e., d = n and the parameter set is bijective with S n via p i (ζ)'s.
The information geometric structure [2] on S n denoted by (g, ∇, ∇ * ) is composed of the pair of mutually dual torsion-free affine connections ∇ and ∇ * with respect to a Riemannian metric g. If we write ∂ i := ∂/∂ζ i , i = 1, · · · , n, the mutual duality requires components of (g, ∇, ∇ * ) to satisfy Let L and M be a pair of strictly monotone (i.e., one-to-one) smooth functions on the interval (0, 1). One way of constructing such a structure (g, ∇, ∇ * ) is to define the components as follows [2,11]: In this paper, we call L and M representing functions. It is easy to verify the mutual duality (1).
A contrast function ρ of the form: always induces the corresponding dually flat structure. Conversely, it is known [2] that if (g, ∇, ∇ * ) is dually flat, then there exists the unique contrast function of the form (8) that induces the structure. Hence, it is called the canonical divergence of (g, ∇, ∇ * ) and we say that the functions ψ and ϕ are potentials. By setting p = r, we see that a dually flat structure naturally gives the Legendre duality relations at each p, i.e., the function ϕ, is the Legendre conjugate of ψ satisfying Applying the idea of affine hypersurface theory [15] is also one of the other ways to construct the information geometric structure. Let D be the canonical flat affine connection on R n+1 . Consider an immersion f from S n into R n+1 and a vector field ξ on S n that is transversal to the hypersurface f (S n ) in R n+1 . Such a pair ( f , ξ), called an affine immersion, defines a torsion-free connection ∇ and the affine fundamental form g on S n via the Gauss formula as where X (S n ) is the set of tangent vector fields on S n and f * denotes the differential of f . By regarding g as a (pseudo-) Riemannian metric, one can discuss the realized structure (g, ∇) on S n . We say that ( f , ξ) is non-degenerate and equiaffine if g is non-degenerate and D X ξ is tangent to S n for any X ∈ X (S n ), respectively. The latter ensures that the volume element θ on S n defined by is parallel to ∇ [15] (p.31). It is known [15,16] that there exists a torsion-free dual affine connection ∇ * satisfying (1) if and only if ( f , ξ) is non-degenerate and equiaffine. In this case, the obtained structure (g, ∇, ∇ * ) on S n is not dually flat in general. However, there always exists a positive function σ and a dually flat structure (g,∇,∇ * ) on S n that hold the following relations [16]: Furthermore, there exists a specific contrast function ρ(p, r) for (g, ∇, ∇ * ) called the geometric divergence. Then, a contrast functionρ(p, r) that induces (g,∇,∇ * ) is given by the conformal divergenceρ(p, r) = σ(r)ρ(p, r). These properties of the structure (g, ∇) realized by the non-degenerate and equiaffine immersion are called 1-conformal flatness [16].

Characterization of Invariant Geometry by Representing Functions
Suppose that a pair of representing functions (L, M) defines an information geometric structure (g, ∇, ∇ * ) by (2), (3) and (4). In this section, we consider the condition of (L, M) such that (g, ∇, ∇ * ) is invariant. This is equivalent [2,14] to g which is the Fisher information g F defined by and a pair of dual connections satisfies ∇ = ∇ (α) and ∇ * = ∇ (−α) for a certain α ∈ R, where ∇ (α) is the α-connection defined by Hence, g ij expressed in (2) by functions L(u) and M(u) coincides with the Fisher information if and only if the following equation holds: Similarly, we derive a condition for Γ ij,k expressed in (3) to be the α-connection. First, note that the following relations hold: On the other hand, we have Substituting (16), (17) and (18) into (3) and (14), and comparing them, we obtain (15) again and Expressing L := dL/du and L := d 2 L/du 2 , we have the following ODE from (15) and (19): By integrations, we get and where c and c i , i = 1, · · · , 4 are constants with a constraint c 1 c 3 = 4/(1 − α 2 ). Thus, (L, M) is essentially a pair of representing functions that derives the alpha-geometry and there is only freedom of adjusting the constants for the invariance of geometry. If we require solely (15), which implies that only a Riemannian metric g is the Fisher information g F , there still remains much freedom for (L, M).

Affine Immersion of the Probability Simplex
Now we consider the affine immersion with the following assumptions. Assumptions: 1. The affine immersion ( f , ξ) is nondegenerate and equiaffine, 2. The immersion f is given by the component-by-component and common representing function L, i.e., f : S n p = (p i ) → x = (x i ) ∈ R n+1 , x i = L(p i ), i = 1, · · · , n + 1, 3. The representing function L : (0, 1) → R is sign-definite, concave with L < 0 and strictly increasing, i.e., L > 0. Hence, the inverse of L denoted by E exists, i.e., E • L = id. 4. Each component of ξ satisfies ξ i < 0, i = 1, · · · , n + 1 on S n . Remark 1. From the assumption 3, it follows that L E = 1, E > 0 and E > 0. Regarding sign-definiteness of L, note that we can adjust L(u) to L(u) + c by a suitable constant c without loss of generality since the resultant geometric structure is unchanged (See Theorem 1) by the adjustment. For a fixed L satisfying the assumption 3, we can choose ξ that meets the assumptions 1 and 4. For example, if we take ξ i = −|L(p i )|, then ( f , ξ) is called centro-affine, which is known to be equiaffine [15] (p.37). The assumptions 3 and 4 also assure positive definiteness of g (the details are described in the proof of Theorem 1). Hence, ( f , ξ) is non-degenerate and we can regard g as a Riemannian metric on S n .

Conormal Vector and the Geometric Divergence
Define a function Ψ on R n+1 by then f (S n ) immersed in R n+1 is expressed as a level surface of Ψ(x) = 1. Denote by R n+1 the dual space of R n+1 and by ν, x the pairing of x ∈ R n+1 and ν ∈ R n+1 . The conormal vector [15] (p.57) ν : S n → R n+1 for the affine immersion ( f , ξ) is defined by for p ∈ S n . Using the assumptions and noting the relations: where Λ is a normalizing factor defined by Then, we can confirm (23) using the relation ∑ n+1 i=1 X i = 0 for X = (X i ) ∈ X (S n ). Note that v : S n → R n+1 defined by Furthermore, it follows, from (24), (25) and the assumption 4, that Λ(p) < 0, ν i (p) < 0, i = 1, · · · , n + 1, for all p ∈ S n . It is known [15] (p.57) that the affine fundamental form g can be represented by In our case, it is calculated via (26) as Hence, g is positive definite from the assumptions 3 and 4, and we can regard it as a Riemannian metric. Utilizing these notions from affine differential geometry, we can introduce a geometric divergence [16] as follows: It is easily checked that ρ is actually a contrast function of the 1-conformally flat structure (g, ∇, ∇ * ) using (5), (6) and (7).
from the geometric divergence ρ in (28). For an arbitrary function L within our setting given by the four assumptions, we prove that we can construct a dually flat structure (g,∇,∇ * ) by choosing the conformal factor σ carefully. Hereafter, we call this transformation conformal flattening. Define then it is negative because each ν i (p) is negative. The conformal divergence to ρ with respect to the conformal factor σ(r)
Furthermore, the conformal divergenceρ that induces (g,∇,∇ * ) on S n is canonical where Legendre conjugate potential functions and coordinate systems are explicitly given by Proof. Using given relations, we first show that the conformal divergenceρ is the canonical divergence for (g,∇,∇ * ): Next, let us confirm that ∂ψ/∂θ i = η i . Since θ i (p) = L(p i ) + ψ(p), i = 1, · · · , n, we have by setting θ n+1 := 0. Hence, we have Differentiating by θ j , we obtain This implies that Together with (34) and this relation, ϕ is confirmed to be the Legendre conjugate of ψ.
The dual relation ∂ϕ/∂η i = θ i follows automatically from the property of the Legendre transform.
The following corollary is straightforward because all the quantities in the theorem depend on only L: Corollary 1. Under the assumptions, the dually flat structure (g,∇,∇ * ) on S n , obtained by following the above conformal flattening, does not depend on the choice of the transversal vector ξ.

Remark 2.
Note that the conformal metric is given byg = −g/Z and is positive definite. Furthermore, the relation (12) means that the dual affine connections ∇ * and∇ * are projectively (or -1-conformally) equivalent [15,16]. Hence, ∇ * is projectively flat. Furthermore, the above corollary implies that the realized affine connection ∇ is also projectively equivalent to the flat connection∇ if we use the centro-affine immersion, i.e., ξ i = −L(p i ) [15,16]. See Proposition 3 for an application of projective equivalence of affine connections. Remark 3. In our setting, conformal flattening is geometrically regarded as normalization of the conormal vector ν. Hence, the dual coordinates η i (p) = P i (p) can be interpreted as a generalization of the escort probability [10,19] (see the following example). Similarly, ψ and −ϕ might be seen as the associated Massieu function and entropy, respectively.

Remark 4.
While the immersion f is composed of a representing function L under the assumption 2, the corresponding M of a single variable does not generally exist for (g, ∇, ∇ * ) nor (g,∇,∇ * ). From the expressions of the Riemann metrics g in (27) andg = −g/Z, we see that the counterparts of the representing functions M(p i ) would be, respectively, −ν i (p) and P i (p), but note that they are multi-variable functions of p = (p i ).

Examples
If we take L to be the logarithmic function L(t) = ln(t), then the conformally flattened geometry immediately defines the standard dually flat structure (g F , ∇ (1) , ∇ (−1) ) on the simplex S n . We see that −ϕ(p) is the entropy, i.e., ϕ(p) = ∑ n+1 i=1 p i ln p i and the conformal divergence is the KL divergence (relative entropy), i.e.,ρ(p, r) = D (KL) (r||p) = ∑ n+1 i=1 r i (ln r i − ln p i ). Next, let the affine immersion ( f , ξ) be defined by the following L and ξ: with 0 < q and q = 1. We see that the immersion is centro-affine scaled by the constant factor q(1 − q). Then, we see that the immersion realizes the alpha-structure (g F , ∇ (α) , ∇ (−α) ) on S n with q = (1 + α)/2. The geometric divergence is the alpha-divergence, i.e., Following the procedure of conformal flattening described in the above, we have [17] , and obtain a dually flat structure (g F ,∇,∇ * ) via the formulas in Theorem 1: Here, ln q and S q (p) are the q-logarithmic function and the Tsallis entropy [10], respectively, defined by Note that the escort probability appears as the dual coordinate η i .

An Application to Gradient Flows on S n
Recall the replicator flow on the simplex S n for given functions f i (p) defined bẏ which is extensively studied in evolutionary game theory. It is known [20] (Chapter 16) that (i) the solution to (35) is the gradient flow that maximizes a function V(p) satisfying with respect to the Shahshahani metric g S (See below), (ii) the KL divergence is a local Lyapunov function for an equilibrium called the evolutionary stable state (ESS) for the case of f i (p) = ∑ n+1 j=1 a ij p j with (a ij ) ∈ R (n+1)×(n+1) .
Note that the Shahshahani metric induces the Fisher metric g F on S n . Further, the KL divergence is the canonical divergence [2] of (g F , ∇ (1) , ∇ (−1) ). Thus, the replicator dynamics (35) are closely related with the standard dually flat structure (g F , ∇ (1) , ∇ (−1) ), which associates with exponential and mixture families of probability distributions. In addition, investigation of the flow is also important from a viewpoint of statistical physics governed by the Boltzmann-Gibbs distributions when we choose V(p) as various physical quantities, e.g., free energy or entropy. Similarly, when we consider various Legendre relations deformed by L, it would be of interest to investigate gradient flows on S n for a dually flat structure (g,∇,∇ * ) or a 1-conformally flat structure (g, ∇, ∇ * ). Since g andg can be naturally extended to R n+1 + as a diagonal form (we use the same notation for brevity): from (27), we can define two gradient flows for V(p) on S n . One is the gradient flow for g, which iṡ for i = 1, · · · , n + 1. It is verified thatṗ is tangent to S n , i.e.,ṗ ∈ T p S n and gradient of V, i.e., In the same way, the other one forg is defined bẏ Note that both the flows reduce to (35) when L = ln. From (37), the following consequence is immediate: The trajectories of the gradient flow (38) and (39) starting from the same initial point coincide while velocities of time-evolutions are different by the factor-Z(p).
Taking account of the example with respect to the alpha-geometry and the conformally flattened one given in subsection 4.3, the following result shown in [18] can be regarded as a corollary of the above proposition: Next, we particularly consider the case when V(p) is a potential function or divergences. As for a gradient flow on a manifold equipped with a dually flat structure (g,∇,∇ * ), the following result is known: Proposition 2. [22] Consider the potential function ψ(p) and the canonical divergenceρ(p, r) of (g,∇,∇ * ) for an arbitrary prefixed point r. The gradient flows for V(p) = ±ψ(p) and V(p) = ±ρ(p, r) follow ∇ * -geodesic curves.
Finally, we demonstrate here another aspect of the flow (39). Let us particularly consider the following functions f i : Note that f i s are not integrable, i.e., non-trivial V satisfying (36) does not exist because of the anti-symmetry of a ij . Hence, for this case, (39) is no longer a gradient flow. However, we can prove the following result: Consider the flow (39) with the functions f i s defined in (40) and assume that there exists an equilibrium r ∈ S n for the flow. Then, ρ(p, r) andρ(p, r) are the first integral (conserved quantity) of the flow.
By the relation E (x i ) = 1/L (p i ) and (31), it holds that a ij E (x l )E (x j ) = 0.
Thus,ρ(p, r) is the first integral of the flow. It follows that ρ(p, r) is also the first integral of the flow from the definition of conformal divergence (29).

Remark 5.
From proposition 1, the same statement holds for the flow (38). The proposition implies the fact [20] that the KL divergence is the first integral for the replicator flow (35) with the function f i (p) in (40) defined by L(t) = ln t and P j (p) = p j .

Conclusions
We have considered two important aspects of information geometric structure, i.e., invariance and dual flatness, from a viewpoint of representing functions. As for the invariance of geometry, we have proved that a pair of representing functions that derives the alpha-structure is essentially unique. On the other hand, we have shown the explicit formula of conformal flattening that transforms 1-conformally flat structures on the simplex S n realized by affine immersions to the corresponding dually flat structures. Finally, we have discussed several geometric properties of gradient flows associated to two structures.
Presently, our analysis is restricted to the probability simplex, i.e., the space of discrete probability distributions. For the continuous case, the similar or related results are obtained in [23,24] without using affine immersions. Extensions of the results obtained in this paper to continuous probability space and the exploitation of relations to the literature are left for future work.
The conformal flattening can also be applied to the computationally efficient construction of a Voronoi diagram with respect to the geometric divergences [18]. Exploring the possibilities of other applications would be of interest.