An Elementary Introduction to Information Geometry

In this survey, we describe the fundamental differential-geometric structures of information manifolds, state the fundamental theorem of information geometry, and illustrate some use cases of these information manifolds in information sciences. The exposition is self-contained by concisely introducing the necessary concepts of differential geometry. Proofs are omitted for brevity.


Overview of information geometry
We present a concise and modern view of the basic structures lying at the heart of Information Geometry (IG), and report some applications of those information-geometric manifolds (termed "information manifolds") in statistics (Bayesian hypothesis testing) and machine learning (statistical mixture clustering).
By analogy to Information Theory (IT) pioneered by Claude Shannon [62] (in 1948) which considers primarily the communication of messages over noisy transmission channels, we may define Information Sciences as the fields that study "communication" between (noisy/imperfect) data and families of models (postulated as a priori knowledge). In short, Information Sciences (IS) seek methods to distill information from data to models. Thus, information sciences encompass information theory but also include Probability & Statistics, Machine Learning (ML), Artificial Intelligence (AI), Mathematical Programming, just to name a few areas.
In §5.2, we review some key milestones of information geometry and report some definitions of the field by its pioneers. A modern and broad definition of information geometry can be stated as the field that studies the geometry of decision making. This definition also includes model fitting (inference) that can be interpreted as a decision problem as illustrated in Figure 1: Namely, deciding which model parameter to choose from a family of parametric models. This framework was advocated by Abraham Wald [72,73,17] who considered all statistical problems as statistical decision problems. Distances play a crucial role not only for measuring the goodness-of-fit of data to model (say, likelihood in statistics, classifier loss functions in ML, objective functions in mathematical programming, etc.) but also for measuring the discrepancy (or deviance) between models.
Why adopting a geometric approach? Geometry allows one to study invariance and equivariance 1 of "figures" in a coordinate-free approach. The geometric language (e.g., ball or projection) M m θ 1 m θ 2 mθ n (D) Figure 1: The parameter inferenceθ of a model from data D can also be interpreted as a decision making problem: Decide which parameter of a parametric family of models M = {m θ } θ∈Θ suits the "best" the data. Information geometry provides a differential-geometric manifold structure to M useful for designing and studying decision rules. also provides affordances that help us reason intuitively about problems. Note that although figures can be visualized (i.e., plotted in coordinate charts), they should be thought of as purely abstract objects, namely, geometric figures.

Outline
The paper is organized as follows: In the first part ( §2), we start by concisely introducing the necessary background of differential geometry in order to define a manifold (M, g, ∇) equipped with a metric tensor g and an affine connection ∇. We explain how this framework generalizes the Riemannian manifolds (M, g) by stating the fundamental theorem of Riemannian geometry that defines a unique torsion-free metriccompatible Levi-Civita connection from the metric tensor.
In the second part ( §3), we explain the dualistic structures of information manifolds: We present the conjugate connection manifolds (M, g, ∇, ∇ * ), the statistical manifolds (M, g, C) where C is a cubic tensor, and show how to derive a family of information manifolds (M, g, ∇ −α , ∇ α ) for α ∈ R provided any given pair (∇ = ∇ −1 , ∇ * = ∇ 1 ) of conjugate connections. We explain how to get conjugate connections from any smooth (potentially asymmetric) distances (called divergences), present the dually flat manifolds obtained when considering Bregman divergences, and define, when dealing with parametric family of probability models, the exponential connection e ∇ and the mixture connection m ∇ that are coupled to the Fisher information metric. We discuss the concept of statistical invariance for the metric tensor and the notion of information monotonicity for statistical divergences. It follows that the Fisher metric is the unique invariant metric (up to a scaling factor), and that the f -divergences are the unique separable invariant divergences.
In the third part ( §4), we illustrate these information-geometric structures with two simple applications: In the first application, we consider Bayesian hypothesis testing and show how Chernoff information which defines the best error exponent, can be geometrically characterized on the dually flat structure of an exponential family manifold. In the second application, we show how to cluster statistical mixtures sharing the same component distributions on the dually flat mixture family Then we have t(θ) = t( θ), where the MLE is denoted by·.

manifold.
Finally, we conclude in §5 by summarizing the important concepts and structures of information geometry, and by providing further references and textbooks [12,4] to more advanced structures and applications for further readings. We mention recent studies of generic classes of distances/divergences.
At the beginning of each part, we outline its contents. A summary of notations is provided page 34.
2 Prerequisite: Basics of differential geometry In §2.1, we review the basics of Differential Geometry (DG) for defining a manifold (M, g, ∇) equipped with both a metric tensor g and an affine connection ∇. We explain these two independent metric/connection structures in §2.2 and in §2.3, respectively. From a connection ∇, we show how to derive the notion of covariant derivative in §2.3.1, parallel transport in §2.3.2 and geodesics in §2.3.3. We further explain the intrinsic curvature and torsion of manifolds induced by the connection in §2. 3.4, and state the fundamental theorem of Riemannian geometry in §2. 4: The existence of a unique torsion-free Levi-Civita metric connection LC ∇ that can be calculated from the metric. Thus Riemannian geometry (M, g) is obtained as a special case of the more general manifold structure (M, g, LC ∇): (M, g) ≡ (M, g, LC ∇). Information geometry shall further consider a dual structure (M, g, ∇ * ) associated to (M, g, ∇), and the pair of dual structures shall form an information manifold (M, g, ∇, ∇ * ).

Overview of differential geometry
Informally speaking, a smooth D-dimensional manifold M is a topological space that locally behaves like the Euclidean space R D . Geometric objects (e.g., points and vector fields) and entities (e.g., functions and differential operators) live on M , and are coordinate-free but can conveniently be expressed in any local coordinate 2 system of an atlas A = {(U i , x i )} i of charts (U i , x i )'s (fully covering the manifold) for calculations. A C k manifold is obtained when the change of chart transformations are C k . The manifold is said smooth when it is C ∞ . At each point p ∈ M , a tangent plane T p locally best linearizes the manifold. On any smooth manifold M , we can define two independent structures: 1. a metric tensor g, and 2. an affine connection ∇.
The metric tensor g induces on each tangent plane T p an inner product space that allows one to measure vector magnitudes (vector "lengths") and angles/orthogonality between vectors. The affine connection ∇ is a differential operator that allows one to define: 1. the covariant derivative operator which provides a way to calculate differentials of a vector field Y with respect to another vector field X: Namely, the covariant derivative ∇ X Y , 2. the parallel transport ∇ c which defines a way to transport vectors on tangent planes along any smooth curve c, 3. the notion of ∇-geodesics γ ∇ which are defined as autoparallel curves, thus extending the ordinary notion of Euclidean straightness, 4. the intrinsic curvature and torsion of the manifold.

Metric tensor fields g
The tangent bundle 3 of M is defined as the "union" of all tangent spaces: A tangent vector v plays the role of a directional derivative 4 , with vf informally meaning the derivative of a smooth function f (belonging to the space of smooth functions F(M )) along the direction v. A smooth vector field X is defined as a "cross-section" of the tangent bundle: X ∈ X(M ) = Γ(T M ), where X(M ) or Γ(T M ) denote the space of smooth vector fields. A basis B = {b 1 , . . . , b D } of a finite D-dimensional vector space is a maximal linearly independent set of vectors. 5 Tangent spaces carry algebraic structures of vector spaces. 6 Using local coordinates on a chart (U, x), the vector field X can be expressed as X = D i=1 X i e i Σ = X i e i using Einstein summation convention on dummy indices (using notation Σ =), where (X) B :=(X i ) denotes the contravariant vector components (manipulated as "column vectors" in algebra) in the natural basis B = {e 1 = ∂ 1 , . . . , e D = ∂ D } with ∂ i :=: ∂ ∂x i . A tangent plane (vector space) equipped with an inner product ·, · yields an inner product space. We define a reciprocal basis so that vectors can also be expressed using the covariant vector components in the natural reciprocal basis. The primal and reciprocal basis are mutually orthogonal by construction as illustrated in Figure 2.
For any vector v, its contravariant components v i 's (superscript notation) and its covariant components v i 's (subscript notation) can be retrieved from v using the inner product with the use of the reciprocal and primal basis, respectively: The inner product defines a metric tensor g and a dual metric tensor g * : x 1 e 2 e i , e j = δ j i Figure 2: Primal and reciprocal basis of an inner product ·, · space. The primal/reciprocal basis are mutually orthogonal: e 1 is orthogonal to e 2 , and e 1 is orthogonal to e 2 .
Technically speaking, the metric tensor g is a 2-covariant tensor 7 field: where ⊗ is the dyadic tensor product performed on pairwise covector basis {dx i } i (the covectors corresponding to the reciprocal vector basis). Let G = [g ij ] and G * = [g * ij ] denote the D × D matrices It follows by construction of the reciprocal basis that G * = G −1 . The reciprocal basis vectors e * i 's and primal basis vectors e i 's can be expressed using the dual metric g * and metric g on the primal basis vectors e j 's and reciprocal basis vectors e * j 's, respectively: The metric tensor field g ("metric tensor" or "metric" for short) defines a smooth symmetric positive-definite bilinear form on the tangent bundle so that for u, v ∈ T p , g(u, v) ≥ 0 ∈ R. We can also write equivalently g p (u, v):=: u, v p :=: u, v g(p) :=: u, v . Two vectors u and v are said orthogonal, denoted by u ⊥ v, iff u, v = 0. The length of a vector is induced from the norm u p :=: u g(p) = u, u g(p) . Using local coordinates of a chart (U, x), we get the vector contravariant/covariant components, and compute the metric tensor using matrix algebra (with column vectors by convention) as follows: since it follows from the primal/reciprocal basis that G × G * = I, the identity matrix. Thus on any tangent plane T p , we get a Mahalanobis distance: The inner product of two vectors u and v is a scalar (a 0-tensor) that can be equivalently calculated as: A metric tensor g of manifold M is said conformal when ·, · p = κ(p) ·, · Euclidean . That is, when the inner product is a scalar function κ(·) of the Euclidean dot product. In conformal geometry, we can measure angles between vectors in tangent planes as if we were in an Euclidean space, without any deformation. This is handy for checking orthogonality (in charts). For example, Poincaré disk model of hyperbolic geometry is conformal but Klein disk model is not conformal (except at the origin), see [44].

Affine connections ∇
An affine connection ∇ is a differential operator defined on a manifold that allows us to define a covariant derivative of vector fields, a parallel transport of vectors on tangent planes along a smooth curve, and geodesics. Furthermore, an affine connection fully characterizes the curvature and torsion of a manifold.

Covariant derivatives ∇ X Y of vector fields
A connection defines a covariant derivative operator that tells us how to differentiate a vector field Y according to another vector field X. The covariant derivative operator is denoted using the traditional gradient symbol ∇. Thus a covariate derivative ∇ is a function: that has its own special subscript notation ∇ X Y :=:∇(X, Y ) for indicating that it is differentiating a vector field Y according to another vector field X. By prescribing D 3 smooth functions Γ k ij = Γ k ij (p), called the Christoffel symbols of the second kind, we define the unique affine connection ∇ that satisfies in local coordinates of chart (U, x) the following equations: The Christoffel symbols can also be written as Γ k ij := (∇ ∂ i ∂ j ) k , where (·) k denote the k-th coordinate. The k-th component (∇ X Y ) k of the covariant derivative of vector field Y with respect to vector field X is given by: The Christoffel symbols are not tensors (fields) because the transformation rules induced by a change of basis do not obey the tensor contravariant/covariant rules.

Parallel transport ∇
c along a smooth curve c Since the manifold is not embedded 8 in a Euclidean space, we cannot add a vector v ∈ T p to a vector v ∈ T p as the tangent vector spaces are unrelated to each others without a connection. 9 Figure 3: Illustration of the parallel transport of vectors on tangent planes along a smooth curve. For a smooth curve c, with c(0) = p and c(1) = q, a vector v p ∈ T p is parallel transported smoothly to a vector v q ∈ T q such that for any t ∈ [0, 1], we have v c(t) ∈ T c(t) .
Thus a connection ∇ defines how to associate vectors between infinitesimally close tangent planes T p and T p+dp . Then the connection allows us to smoothly transport a vector v ∈ T p by sliding it (with infinitesimal moves) along a smooth curve c(t) (with c(0) = p and c(1) = q), so that the vector v p ∈ T p "corresponds" to a vector v q ∈ T q : This is called the parallel transport. This mathematical prescription is necessary in order to study dynamics on manifolds (e.g., study the motion of a particle on the manifold). We can express the parallel transport along the smooth curve c as: The parallel transport is schematically illustrated in Figure 3.

∇-geodesics γ ∇ : Autoparallel curves
A connection ∇ allows one to define ∇-geodesics as autoparallel curves, that are curves γ such that we have: That is, the velocity vectorγ is moving along the curve parallel to itself: In other words, ∇geodesics generalize the notion of "straight Euclidean" lines. In local coordinates (U, x), γ(t) = (γ k (t)) k , the autoparallelism amounts to solve the following second-order Ordinary Differential Equations (ODEs):γ where Γ k ij are the Christoffel symbols of the second kind, with: where Γ ij,l the Christoffel symbols of the first kind. Geodesics are 1D autoparallel submanifolds and ∇-hyperplanes are defined similarly as autoparallel submanifolds of dimension D − 1. We may specify in subscript the connection that yields the geodesic γ: γ ∇ .

Curvature and torsion of a manifold
An affine connection ∇ defines a 4D 10 Riemann-Christoffel curvature tensor R (expressed using components R i jkl of a (1, 3)-tensor). The coordinate-free equation of the curvature tensor is given by: where is the Lie bracket of vector fields. A manifold M equipped with a connection ∇ is said flat (meaning ∇-flat) when R = 0. This holds in particular when finding a particular 11 coordinate system x of a chart (U, x) such that Γ k ij = 0, i.e., when all connection coefficients vanish. A manifold is torsion-free when the connection is symmetric. A symmetric connection satisfies the following coordinate-free equation: Using local chart coordinates, this amounts to check that Γ k ij = Γ k ji . The torsion tensor is a (1, 2)-tensor defined by: In general, the parallel transport is path-dependent. The angle defect of a vector transported on an infinitesimal closed loop (a smooth curve with coinciding extremities) is related to the curvature. However for a flat connection, the parallel transport does not depend on the path. Figure 4 illustrates the parallel transport along a curve for a curved manifold (the sphere manifold) and a flat manifold ( the cylinder manifold 12 ). 10 It follows from symmetry constraints that the number of independent components of the Riemann tensor is in D dimensions. 11 For example, the Christoffel symbols vanish in a rectangular coordinate system of a plane but not in the polar coordinate system of it. 12 The Gaussian curvature at of point of a manifold is the product of the minimal and maximal sectional curvatures: κG:=κminκmax . For a cylinder, since κmin = 0, it follows that the Gaussian curvature of a cylinder is 0. Gauss's Theorema Egregium (meaning "remarkable theorem") proved that the Gaussian curvature is intrinsic and does not depend on how the surface is embedded into the ambient Euclidean space.

The fundamental theorem of Riemannian geometry: The Levi-Civita metric connection
By definition, an affine connection ∇ is said metric compatible with g when it satisfies for any triple (X, Y, Z) of vector fields the following equation: which can be written equivalently as: Using local coordinates and natural basis {∂ i } for vector fields, the metric-compatibility property amounts to check that we have: A property of using a metric-compatible connection is that the parallel transport ∇ of vectors preserve the metric: That is, the parallel transport preserves angles (and orthogonality) and lengths of vectors in tangent planes when transported along a smooth curve. The fundamental theorem of Riemannian geometry states the existence of a unique torsion-free metric compatible connection: Theorem 1 (Levi-Civita metric connection) There exists a unique torsion-free affine connection compatible with the metric called the Levi-Civita connection LC ∇.
The Christoffel symbols of the Levi-Civita connection can be expressed from the metric tensor g as follows: where g ij denote the matrix elements of the inverse matrix g −1 .
The Levi-Civita connection can also be defined coordinate-free with the Koszul formula: There exists metric-compatible connections with torsions studied in theoretical physics. See for example the flat Weitzenböck connection [9].
The metric tensor g induces the torsion-free metric-compatible Levi-Civita connection that determines the local structure of the manifold. However, the metric g does not fix the global topological structure: For example, although a cone and a cylinder have locally the same flat Euclidean metric, they exhibit different global structures.

Preview: Information geometry versus Riemannian geometry
In information geometry, we consider a pair of conjugate affine connections ∇ and ∇ * (often but not necessarily torsion-free) that are coupled to the metric g: The structure is conventionally written as (M, g, ∇, ∇ * ). The key property is that those conjugate connections are metric compatible, and therefore the induced dual parallel transport preserves the metric: Thus the Riemannian manifold (M, g) can be interpreted as the self-dual information-geometric manifold obtained for ∇ = ∇ * = LC ∇ the unique torsion-free Levi-Civita metric connection: However, let us point out that for a pair of self-dual Levi-Civita conjugate connections, the information-geometric manifold does not induce a distance. This contrasts with the Riemannian modeling (M, g) which provides a Riemmanian metric distance D ρ (p, q) defined by the length of the geodesic γ connecting the two points p = γ(0) and q = γ(1) (shortest path): Usually, this Riemannian geodesic distance is not available in closed-form (and need to be approximated or bounded) because the geodesics cannot be explicitly parameterized (see geodesic shooting methods [7]).
We are now ready to introduce the key geometric structures of information geometry.

Information manifolds 3.1 Overview
In this part, we explain the dualistic structures of manifolds in information geometry. In §3.2, we first present the core Conjugate Connection Manifolds (CCMs) (M, g, ∇, ∇ * ), and show how to build Statistical Manifolds (SMs) (M, g, C) from a CCM in §3.3. From any statistical manifold, we can build a 1-parameter family (M, g, ∇ −α , ∇ α ) of CCMs, the information α-manifolds. We state the fundamental theorem of information geometry in §3.5. These CCMs and SMs structures are not related to any distance a priori but require at first a pair (∇, ∇ * ) of conjugate connections coupled to a metric tensor g. We show two methods to build an initial pair of conjugate connections. A first method consists in building a pair of conjugate connections ( D ∇, D ∇ * ) from any divergence D in §3.6. Thus we obtain self-conjugate connections when the divergence is symmetric: D(θ 1 : θ 2 ) = D(θ 2 : θ 1 ). When the divergences are Bregman divergences (i.e., D = B F for a strictly convex and differentiable Bregman generator), we obtain Dually Flat Manifolds (DFMs) (M, ∇ 2 F, F ∇, F ∇ * ) in §3.7. DFMs nicely generalize the Euclidean geometry and exhibit Pythagorean theorems. We further characterize when orthogonal F ∇-projections and dual F ∇ * -projections of a point on submanifold a is unique. 13 A second method to get a pair of conjugate connections ( e ∇, m ∇) consists in defining these connections from a regular parametric family of probability distributions P = {p θ (x)} θ . In that case, these 'e'xponential connection e ∇ and 'm'ixture connection m ∇ are coupled to the Fisher information metric P g. A statistical manifold (P, P g, P C) can be recovered by considering the skewness Amari-Chentsov cubic tensor P C, and it follows a 1-parameter family of CCMs, (P, P g, P ∇ −α , P ∇ +α ), the statistical expected α-manifolds.
In this parametric statistical context, these information manifolds are called expected information manifolds because the various quantities are expressed from statistical expectations E · [·]. Notice that these information manifolds can be used in information sciences in general, beyond the traditional fields of statistics. In statistics, we motivate the choice of the connections, metric tensors and divergences by studying statistical invariance criteria, in §3.9. We explain how to recover the expected α-connections from standard f -divergences that are the only separable divergences that satisfy the property of information monotonicity. Finally, in §3.10, the recall the Fisher-Rao expected Riemannian manifolds that are Riemannian manifolds (P, P g) equipped with a geodesic metric distance called the Fisher-Rao distance, or Rao distance for short.

Conjugate connection manifolds:
We begin with a definition: Definition 1 (Conjugate connections) A connection ∇ * is said to be conjugate to a connection ∇ with respect to the metric tensor g if and only if we have for any triple (X, Y, Z) of smooth vector fields the following identity satisfied: We can notationally rewrite Eq. 31 as: and further explicit that for each point p ∈ M , we have: We check that the right-hand-side is a scalar and that the left-hand-side is a directional derivative of a real-valued function, that is also a scalar. Conjugation is an involution: (∇ * ) * = ∇. A remarkable property is that the dual parallel transport of vectors preserves the metric. That is, for any smooth curve c(t), the inner product is conserved when we transport one of the vector u using the primal parallel transport ∇ c and the other vector v using the dual parallel transport Property 1 (Dual parallel transport preserves the metric) A pair (∇, ∇ * ) of conjugate connections preserves the metric g if and only if: Property 2 Given a connection ∇ on (M, g) (i.e., a structure (M, g, ∇)), there exists a unique conjugate connection ∇ * (i.e., a dual structure (M, g, ∇ * )).
We consider a manifold M equipped with a pair of conjugate connections ∇ and ∇ * that are coupled with the metric tensor g so that the dual parallel transport preserves the metric. We define the mean connection∇: with corresponding Christoffel coefficients denoted byΓ. This mean connection coincides with the Levi-Civita metric connection:∇ = LC ∇.
Property 3 The mean connection∇ is self-conjugate, and coincide with the Levi-Civita metric connection.

Statistical manifolds: (M, g, C)
Lauritzen introduced this corner structure [30] of information geometry in 1987. Beware that although it bears the name "statistical manifold," it is a purely geometric construction that may be used outside of the field of Statistics. However, as we shall mention later, we can always find a statistical model P corresponding to a statistical manifold [69]. We shall see how we can convert a conjugate connection manifold into such a statistical manifold, and how we can subsequently derive an infinite family of CCMs from a statistical manifold. In other words, once we have a pair of conjugate connections, we will be able to build a family of pairs of conjugate connections. We define a totally symmetric 14 cubic (0, 3)-tensor (i.e., 3-covariant tensor) called the Amari-Chentsov tensor: or in coordinate-free equation: Using the local basis, this cubic tensor can be expressed as: Definition 3 (Statistical manifold [30]) A statistical manifold (M, g, C) is a manifold M equipped with a metric tensor g and a totally symmetric cubic tensor C.

conjugate connection manifolds
For any pair (∇, ∇ * ) of conjugate connections, we can define a 1-parameter family of connections {∇ α } α∈R , called the α-connections such that (∇ −α , ∇ α ) are dually coupled to the metric, with ∇ 0 =∇ = LC ∇, ∇ 1 = ∇ and ∇ −1 = ∇ * . By observing that the scaled cubic tensor αC is also a totally symmetric cubic 3-covariant tensor, we can derive the α-connections from a statistical manifold (M, g, C) as: where Γ 0 ij,k are the Levi-Civita Christoffel symbols, and Γ ki,j Σ = Γ l ij g lk (by index juggling). The α-connection ∇ α can also be defined as follows: Theorem 2 (Family of information α-manifolds) For any α ∈ R, (M, g, is a conjugate connection manifold. The α-connections ∇ α can also be constructed directly from a pair (∇, ∇ * ) of conjugate connections by taking the following weighted combination: 3.5 The fundamental theorem of information geometry: ∇ κ-curved ⇔ ∇ * κcurved We now state the fundamental theorem of information geometry and its corollaries: Theorem 3 (Dually constant curvature manifolds) If a torsion-free affine connection ∇ has constant curvature κ then its conjugate torsion-free connection ∇ * has necessarily the same constant curvature κ.
The proof is reported in [12] (Proposition 8.1.4, page 226). We get the following two corollaries: Thus once we are given a pair of conjugate connections, we can always build a 1-parametric family of manifolds. Manifolds with constant curvature κ are interesting from the computational viewpoint as dual geodesics have simple closed-form expressions.

Conjugate connections from divergences:
Loosely speaking, a divergence D(· : ·) is a smooth distance [74], potentially asymmetric. In order to define precisely a divergence, let us first introduce the following handy notations: on a manifold M with respect to a local chart Θ ⊂ R D is a C 3 -function satisfying the following properties: The dual divergence is defined by swapping the arguments: and is also called the reverse divergence (reference duality in information geometry). Reference duality of divergences is an involution: The Euclidean distance is a metric distance but not a divergence. The squared Euclidean distance is a non-metric symmetric divergence. The metric tensor g yields Riemannian metric distance D ρ but it is never a divergence.
From any given divergence D, we can define a conjugate connection manifold following the construction of Eguchi [20] (1983): Theorem 4 (Manifold from divergence) (M, D g, D ∇, D * ∇) is an information manifold with: The associated statistical manifold is (M, D g, D C) with: Since α D C is a totally symmetric cubic tensor for any α ∈ R, we can derive a one-parameter family of conjugate connection manifolds: In the remainder, we use the shortcut (M, D) to denote the divergence-induced information manifold (M, D g, D ∇, D ∇ * ). Notice that it follows from construction that: 3.7 Dually flat manifolds (Bregman geometry): We consider dually flat manifolds that satisfy asymmetric Pythagorean theorems. These flat manifolds can be obtained from a canonical Bregman divergence. Consider a strictly convex smooth function F (θ) called a potential function, with θ ∈ Θ where Θ is an open convex domain. Notice that the function convexity does not change by an affine transformation. We associate to the potential function F a corresponding Bregman divergence (parameter divergence): We write also the Bregman divergence between point P and point Q as D(P : Q):=B F (θ(P ) : θ(Q)), where θ(P ) denotes the coordinates of a point P .
The induced information-geometric structure is (M, F g, F C):=(M, B F g, B F C) with: Since all coefficients of the Christoffel symbols vanish (Eq. 54), the information manifold is F ∇-flat. The Levi-Civita connection LC ∇ is obtained from the metric tensor F g (usually not flat), and we get the conjugate connection ( F ∇) * = F ∇ 1 from (M, F g, F C).
We have the Crouzeix [15] identity relating the Hessians of the potential functions: where I denote the D × D identity matrix. This Crouzeix identity reveals that B = {∂ i } i and B * = {∂ j } j are the primal and reciprocal basis, respectively. The Bregman divergence can be reinterpreted using Young-Fenchel (in)equality as the canonical divergence A F,F * [8]:  The In a dually flat manifold, any pair of points P and Q can either be linked using the ∇-geodesic (that is θ-straight) or the ∇ * -geodesic (that is η-straight). In general, there are 2 3 = 8 types of geodesic triangles in a dually flat manifold.
Theorem 6 (Dual Pythagorean identities) We can define dual Bregman projections and characterize when these projections are unique: A submanifold S ⊂ M is said ∇-flat (∇ * -flat) iff. it corresponds to an affine subspace in the θ-coordinate system (in the η-coordinate system, respectively).
When S is a ∇-flat submanifold and S ∇ * -flat submanifold, the divergence D(S : S ) between submanifold S and submanifold S can be calculated using the method of alternating projections [4].
Let us remark that Kurose [29] reported a Pythagorean theorem for dually constant curvature manifolds that generalizes the Pythagorean theorems of dually flat spaces.
The dually flat geometry can be investigated under the wider scope of Hessian manifolds [63] which consider locally potential functions.
We now consider information manifolds induced by parametric statistical models.

Expected α-manifolds of a family of parametric probability distributions:
(P, P g, P ∇ −α , P ∇ α ) Informally speaking, an expected manifold is an information manifold built on a regular parametric family of distributions. It is sometimes called "expected" manifold or "expected" geometry in the literature [76] because the components of the metric tensor g and the Amari-Chentsov cubic tensor C are expressed using statistical expectations E · [·]. Let P be a parametric family of probability distributions: with θ belonging to the open parameter space Θ. The order of the family is the dimension of its parameter space. We define the likelihood function 16 L(θ; x):=p θ (x) as a function of θ, and its corresponding log-likelihood function: The score vector: indicates the sensitivity of the likelihood ∂ i l:=: ∂ ∂θ i l(θ; x). The Fisher information matrix (FIM) of D × D for dim(Θ) = D is defined by: 16 The likelihood function is an equivalence class of functions defined modulo a positive scaling factor.
where denotes the Löwner order. That is, for two symmetric positive-definite matrices A and B, A B if and only if matrix A − B is positive semidefinite. For regular models [12], the FIM is positive definite: P I(θ) 0, where A B if and only if matrix A − B is positive-definite.
In statistics, the FIM plays a role in the attainable precision of unbiased estimators. For any unbiased estimator, the Cramér-Rao lower bound [33] on the variance of the estimator is: The FIM is invariant by reparameterization of the sample space X , and covariant by reparameterization of the parameter space Θ, see [12].
We report the expression of the FIM for two important generic parametric family of probability distributions: (1) an exponential family, and (2) a mixture family.
Example 1 (FIM of an exponential family E) An exponential family [41] E is defined for a sufficient statistic vector t(x) = (t 1 (x), . . . , t D (x)), and an auxiliary carrier measure k(x) by the following canonical density: where F is the strictly convex cumulant function. Exponential families include the Gaussian family, the Gamma and Beta families, the probability simplex ∆, etc. The FIM of an exponential family is given by: Example 2 (FIM of a mixture family M) A mixture family is defined for D + 1 functions F 1 , . . . , F D and C as: where the functions {F i (x)} i are linearly independent on the common support X and satisfying F i (x)dµ(x) = 0. Function C is such that C(x)dµ(x) = 1. Mixture families include statistical mixtures with prescribed component distributions and the probability simplex ∆. The FIM of a mixture family is given by: Notice that the probability simplex of discrete distributions can be both modeled as an exponential family or a mixture family [4]. The expected α-geometry is built from the expected dual ±α-connections. The Fisher "information metric" tensor is built from the FIM as follows: The expected exponential connection and expected mixture connection are given by The dualistic structure is denoted by (P, P g, m P ∇, e P ∇) with Amari-Chentsov cubic tensor called the skewness tensor: It follows that we can build a one-family of expected information α-manifolds: with The Levi-Civita metric connection is recovered as follows: In case of an exponential family E or a mixture family M equipped with the dual exponential/mixture connection, we get dually flat manifolds (Bregman geometry).
Indeed, for the exponential/mixture families, it is easy to check that the Christoffel symbols of ∇ e and ∇ m vanish:

Criteria for statistical invariance
So far we have explained how to build an information manifold (or information α-manifold) from a pair of conjugate connections. Then we reported two ways to obtain such a pair of conjugate connections: (1) from a parametric divergence, or (2) by using the predefined expected exponential/mixture connections. We now ask the following question: Which information manifold makes sense in Statistics? We can refine the question as follows: • Which metric tensors g make sense in statistics?
• Which affine connections ∇ make sense in statistics?
• Which statistical divergences make sense in statistics (from which we can get the metric tensor and dual connections)?
By definition, an invariant metric tensor g shall preserve the inner product under important statistical mappings called Markov embeddings. Informally, we embed ∆ D into ∆ D with D > D and the induced metric should be preserved (see [4], page 62).  Here, parameter θ represents a discrete distribution.
A D-dimensional parameter (discrete) divergence satisfies the information monotonicity if and only if: for any coarse-grained partition . This concept of coarse-graining is illustrated in Figure 6.
A separable divergence D(θ 1 : θ 2 ) is a divergence that can be expressed as the sum of elementary scalar divergences d(x : y): For example, the squared Euclidean distance D(θ 1 : is not separable because of the square root operation. The only invariant and decomposable divergences when D > 1 are f -divergences [27] defined for a convex functional generator f : The standard f -divergences are defined for f -generators satisfying f (1) = 0 (choose f λ (u):=f (u) + λ(u − 1) since I f λ = I f ), and f (u) = 1 (scale fixed).
Statistical f -divergences are invariant [58] under one-to-one/sufficient statistic transformations y = t(x) of sample space: p(x; θ) = q(y(x); θ): The dual f -divergences for reference duality is for the standard conjugate f -generator (diamond f generator) with:

21
One can check that f is a standard f -generator when f is standard. Let us report some common examples of f -divergences: • The family of α-divergences: 2 ). The α-divergences include: the Kullback-Leibler when α → 1: for f (u) = − log u.
for f (u) = 1 2 |u − 1|. The total variation distance is the only metric f -divergence.
A remarkable property is that invariant standard f -divergences yield the Fisher information matrix and the α-connections. Indeed, the invariant standard f -divergences is related infinitesimally to the Fisher metric as follows:

22
A statistical parameter divergence D on a parametric family of distributions P yields an equivalent parameter divergence P D: (94) Thus we can build the information manifold induced by this parameter divergence P D(· : ·). For P D(· : ·) = I f [· : ·], the induced ±1-divergence connections 3.10 Fisher-Rao expected Riemannian manifolds: (P, P g) Historically, a first manifold modeling of a regular parametric family of distributions P = {p θ (x)} θ was to consider the Fisher Information Matrix (FIM) as the Riemannian metric tensor g (see [25,60]), with: where ∂ i l:=: ∂ ∂θ i log p(x; θ). Under some regularity conditions, we can rewrite the FIM: The Riemannian geodesic metric distance D ρ is commonly called the Fisher-Rao distance: where γ denotes the geodesic passing through γ(0) = θ 1 and γ(1) = θ 2 .
Definition 5 (Fisher-Rao distance) The Fisher-Rao distance is the geodesic metric distance of the Fisher-Riemannian manifold (P, P g).

Let us give some examples of Fisher-Riemannian manifolds:
• The Fisher-Riemannian manifold of the family of categorical distributions (also called finite discrete distributions in [4]) amount to the spherical geometry [28] (spherical manifold).
• The Fisher-Riemannian manifold of the family of bivariate location-scale families amount to hyperbolic geometry (hyperbolic manifold).
• The Fisher-Riemannian manifold of the family of location families amount to Euclidean geometry (Euclidean manifold).
The first fundamental form of the Riemannian geometry is ds 2 = dx, dx Σ = g ij dx i dx j where ds denotes the line element.
This Riemannian geometric structure applied to a family of parametric probability distributions was first proposed by Harold Hotelling [25] (in a handwritten note of 1929, reprinted typeset in [65]) and independently later by C. R. Rao [60] (1945, reprinted in [59]). In a similar vein, Jeffreys [26] proposed to use the volume element of a manifold as an invariant prior: The eponym Jeffreys prior in 1946.
Notice that for a parametric family of probability distributions P, the Riemannian structure (P, P g) coincides with the self-dual conjugate connection manifold (P, P g, induced by a symmetric f -divergence like the squared Hellinger divergence.

The monotone α-embeddings
Another common mathematically equivalent expression of the FIM [12] is given by: This form of the FIM is well-suited to prove that the FIM is always a positive semi-definite matrix [12] (I(θ) 0). It turns out that we can define a family of equivalent representations of the FIM using the α-embedding [75] of the parametric family. First, we define the α-representation of densities l α (x; θ) := k α (p(x; θ)) with: The function l α (x; θ) is called the α-likelihood function. Then the α-representation of the FIM, the α-FIM for short, is expressed as: We can rewrite compactly the α-FIM, as I α ij (θ) = ∂ i l α ∂ j l −α dµ(x). Expanding the α-FIM, we get: The 1-representation of the FIM is called the logarithmic representation and its 0-representation is called the square root representation. The set of α-scores vectors B α :={∂ i l α } i are interpreted as the tangent basis vectors of the α-base B α . Thus the FIM is α-independent.
Furthermore, the α-representation of the FIM can be rewritten under mild conditions [12] as: Since we have: it follows that: Notice that when α = 1, we recover the equivalent expression of the FIM (under mild conditions): In particular, when the family is an exponential family [41] with cumulant function F (θ) (satisfying the mild conditions), we have: The α-embeddings can be generalized by considering a pair of strictly increasing real-valued functions 17 ρ and τ (the conjugate embeddings) to yield the (ρ, τ )-geometry [75,51]. Zhang [75] further discussed the representation/reference biduality which was confounded in the α-geometry. Figure 7 displays the main types of information manifolds encountered in information geometry with their relationships.

Some illustrating applications of dually flat manifolds
Information geometry [4] found broad applications in information sciences. For example, we can mention: • Statistics: Asymptotic inference, Expectation-Maximization (EM and the novel informationgeometric em), time series (AutoRegressive Moving Average model, ARMA) models, • Machine learning: Restricted Boltzmann machines (RBMs), neuromanifolds and natural gradient [66], • Signal processing: Principal Component Analysis (PCA), Independent Component Analysis (ICA), Non-negative Matrix Factorization (NMF), • Mathematical programming: Barrier function of interior point methods, • Game theory: Score functions.
In this part, we describe how to use the dually flat structures for handling an exponential family E (in a hypothesis testing problem detailed in §4.1) and the mixture family M (clustering statistical mixtures §4.2). Note that for a general divergence, neither (E, D) nor (M, D) is dually flat. However, when D = KL, the Kullback-Leibler divergence, we get dually flat spaces that are computationally attractive since the primal/dual geodesics are straight lines in the corresponding global affine coordinate system. Etc.

Frank Nielsen
Cubic skewness tensor canonical divergence Figure 7: Overview of the main types of information manifolds with their relationships in information geometry.
x 1 x Figure 8: Statistical Bayesian hypothesis testing: The best Maximum A Posteriori (MAP) rule chooses to classify an observation from the class that yields the maximum likelihood.

Hypothesis testing in the dually flat exponential family manifold (E, KL * )
Given two probability distributions P 0 ∼ p 0 (x) and P 1 ∼ p 1 (x), we ask to classify a set of iid. observations X 1:n = {x 1 , . . . , x n } as either sampled from P 0 or from P 1 ? This is a statistical decision problem [35]. For example, P 0 can represent the signal distribution and P 1 the noise distribution. Figure 8 displays the probability distributions and the unavoidable error that is made by any statistical decision rule (on observations x 1 and x 2 ). Assume that both distributions P 0 ∼ P θ 0 and P 1 ∼ P θ 1 belong to the same exponential family E = {P θ : θ ∈ Θ}, and consider the exponential family manifold with the dually flat structure (E, E g, E ∇ e , E ∇ m ). That is, the manifold equipped with the Fisher information metric tensor field and the expected exponential connection and conjugate expected mixture connection. This structure can also be derived from a divergence manifold structure by choosing the reverse Kullback-Leibler divergence KL * : Therefore, the Kullback-Leibler divergence KL[P θ : P θ ] amounts to a Bregman divergence (for the cumulant function of the exponential family): The best exponent error α * of the best Maximum A Priori (MAP) decision rule is found by minimizing the Bhattacharyya distance to get the Chernoff information [56]: On the exponential family manifold E, the Bhattacharyya distance: amounts to a skew Jensen parameter divergence [40] (also called Burbea-Rao divergence): It can be shown that the Chernoff information (that minimizes α) is equivalent to a Bregman divergence: Namely, the Bregman divergence for exponential families at the optimal exponent value α * . Theorem 9 (Chernoff information [35]) The Chernoff information between two distributions belonging to the same exponential family amount to a Bregman divergence: where θ α 12 = (1 − α)θ 1 + αθ 2 , and α * denote the best exponent error.
Let θ * 12 :=θ α * 12 denote the best exponent error. The geometry [35] of the best error exponent can be explained on the dually flat exponential family manifold as follows: where G e denotes the exponential geodesic γ ∇ e and Bi m the m-bisector: Figure 9 illustrates how to retrieve the best error exponent from an exponential arc (θ-geodesic) intersecting the m-bisector.
Furthermore, instead of considering two distributions for this statistical binary decision problem, we may consider a set of n distributions of P 1 , . . . , P n ∈ E. The geometry of the error exponent in this multiple hypothesis testing setting has been investigated in [34]. On the dually flat exponential family manifold, it corresponds to check the exponential arcs between natural neighbors (sharing Voronoi subfaces) of a Bregman Voronoi diagram [11]. See Figure 10 for an illustration.

Clustering mixtures in the dually flat mixture family manifold (M, KL)
Given a set of k prescribed statistical distributions p 0 (x), . . . , p k−1 (x), all sharing the same support X (say, R), a mixture family M of order D = k − 1 consists of all strictly convex combinations of these component distributions [48]:

η-coordinate system
Chernoff distribution between natural neighbours  We consider the expected information manifold (M, M g, M ∇ m , M ∇ e ) which is dually flat and equivalent to (M Θ , KL). That is, the KL between two mixtures with prescribed components (wmixtures, for short) is equivalent to a Bregman divergence for is the differential Shannon information (negative entropy) [48]: Consider a set {m θ 1 , . . . , m θn } of n w-mixtures [48]. Because F (θ) = −h(m(x; θ)) is the negative differential entropy of a mixture (not available in closed form [49]), we approximate the untractable F by another close tractable generatorF . We use Monte Carlo stochastic sampling to get Monte-Carlo convexF S for an independent and identically distributed sample S.
Thus we can build a nested sequence (M,F S 1 ), . . . , (M,F Sm ) of tractable dually flat manifolds for nested sample sets S 1 ⊂ . . . ⊂ S m converging to the ideal mixture manifold (M, F ): lim m→∞ (M,F Sm ) = (M, F ) (where convergence is defined with respect to the induced canonical Bregman divergence). A key advantage of this approach is that for a given sample S, all computations carried inside the dually flat manifold (M,F S ) are consistent, see [48].
For example, we can apply Bregman k-means [43] on these Monte Carlo dually flat spaces [42] of w-GMMs (Gaussian Mixture Models) to cluster a set of w-GMMs. Figure 12 displays the result of such a clustering.
We have briefly described two applications using dually flat manifolds: (1) the dually flat exponential manifold induced by the statistical reverse Kullback-Leibler divergence on an exponential family (structure (E, KL * )), and (2) the dually flat mixture manifold induced by the statistical Kullback-Leibler divergence on a mixture family (structure (M, KL)). There are many other dually flat structures that can be met in a statistical context: For example, two other dually flat structures for the D-dimensional probability simplex ∆ D are reported in Amari's textbook [4]: (1) the conformally deforming of the α-geometry (page 88, Eq. 4.95 of [4]), and (2) the χ-escort geometry (page 91, Eq. 4.114 of [4]).
5 Conclusion: Summary, historical background, and perspectives

Summary
We explained the dualistic nature of information manifolds (M, g, ∇, ∇ * ) in information geometry. The dualistic structure is defined by a pair of conjugate connections coupled with the metric tensor that provides a dual parallel transport that preserves the metric. We showed how to extend this structure to a 1-parameter family of structures. From a pair of conjugate connections, the pipeline can be informally summarized as: We stated the fundamental theorem of information geometry on dual constant-curvature manifolds, including the special but important case of dually flat manifolds on which there exists two potential functions and global affine coordinate systems related by the Legendre-Fenchel transformation. Although, information geometry historically started with the Riemannian modeling (P, P g) of a parametric family of probability distributions P by letting the metric tensor be the Fisher information matrix, we have emphasized the dualistic view of information geometry which considers non-Riemannian manifolds that can be derived from any divergence, and not necessarily tied to a statistical context (e.g., information manifold can be used in mathematical programming [54]). Let us notice that for any symmetric divergence (e.g. any symmetrized f -divergence like the squared Hellinger divergence), the induced conjugate connections coincide with the Levi-Civita connection but the Fisher-Rao metric distance does not coincide with the squared Hellinger divergence. On one hand, a Riemannian metric distance D ρ is never a divergence because the rooted distance functions fail to be smooth at the extremities but a squared Riemmanian metric distance is always a divergence. On the other hand, taking the power δ of a divergence D (i.e., D δ ) for some δ > 0 may yield a metric distance (e.g., the square root of the Jensen-Shannon divergence [21]), but this may not always be the case: The powered Jeffreys divergence J δ is never a metric distance (see [68], page 889). Recently, the Optimal Transport (OT) theory [71] gained interest in statistics and machine learning. But the optimal transport between two members of a same elliptically-contoured family has the same optimal transport formula distance (see [18] Eq. 16 and Eq. 17, although they have different Kullback-Leibler divergences). Another essential difference is that the Fisher-Rao manifold of location-scale families is hyperbolic but the Wasserstein manifold of location-scale families has positive curvature [18,67].

A brief historical review of information geometry
The field of Information Geometry (IG) was historically motivated by providing some differentialgeometric structure to statistical models in order to reason geometrically about statistical problems with the endeavor goal of geometrizing mathematical statistics [14,3,32,28,5]: Harold Hotelling [25] first considered in the late 1920's the Fisher Information Matrix (FIM) I as a Riemannian metric tensor g, and interpreted a parametric family of probability distributions M as a Riemannian manifold (M, g). 18 . In this pioneering work, Hotelling mentioned that location-scale probability families yield manifolds of constant negative curvatures. This Riemannian modeling of parametric family of densities was further independently studied by Calyampudi Radhakrishna Rao in his celebrated paper [60] (1945) that also includes the Cramér-Rao lower bound [33] and the Rao-Blackwellization technique. Nowadays the induced Riemannian metric distance is often called the Fisher-Rao distance [64] or Rao distance [61]. Another use of Riemannian geometry in statistics was pioneered by Harold Jeffreys [26] that proposed to use as an invariant prior the normalized volume element of the expected Fisher-Riemannian manifold. In those seminal papers, there was no theoretical justification of using the Fisher information matrix as a metric tensor (besides the fact that it is a positive-definite matrix for regular identifiable models). Nowadays, this Riemmanian metric tensor is called the information metric for short. Information geometry considers a generalization of this approach using a non-Riemannian dualistic modeling (M, g, ∇, ∇ * ) that coincide with the Riemannian manifold when ∇ = ∇ * = LC ∇, the Levi-Civita connection (the unique torsion-free connection compatible with the metric tensor).
In the 1960's, Nikolai Chentsov (also commonly writtenČencov) studied the algebraic category of all statistical decision rules with its induced geometric structures: Namely, the expected α-geometries ("equivalent differential geometry") and the dually flat manifolds ("Nonsymmetric Pythagorean geometry" of the exponential families with respect to the Kullback-Leibler divergence). In the preface of the english translation of his 1972's russia monograph [14], the field of investigation is defined as "geometrical statistics." However in the original Russian monograph, Chentsov used the russian term geometrostatistics. The geometrostatistics term was allegedly coined 19 by Andrey Kolmogorov to define the field of differential geometry of statistical models. In the monograph of Chentsov, the Fisher information metric is shown to be the unique metric tensor (up to a scaling factor) yielding statistical invariance under Markov morphisms (see [13] for a simpler proof that generalizes to positive measures).
The dual nature of the geometry was thoroughly investigated 20 by Shun-ichi Amari. In the preface of his 1985's monograph [3], Professor Amari coined the term information geometry as follows: "The differential-geometrical method developed in statistics is also applicable to other fields of sciences such as information theory and systems theory... They together will open a new field, which I would like to call information geometry." The role of differential geometry in statistics has been discussed in [10].
Note that the dual affine connections of information geometry have also been investigated independently in affine differential geometry [52] which considers invariance under volume-preserving affine transformations by defining a volume form (instead of a metric form for Riemannian geometry). The notion of dual parallel transport compatible with the metric is due to Aleksandr Norden [53].
We summarize the main fundamental structures of information manifolds below: Statistical Parameters" was read by a colleague and are fully typeset in [65]. We warmly thank Professor Stigler for sending us the scanned handwritten notes and for discussing by emails historical aspects of the birth of information geometry. 19 We thank Professor Alexander Holevo for email correspondences on this matter. 20 Professor Amari mentioned in [3] that he considered the Gaussian Riemannian manifold as a hyperbolic manifold in 1959, and was strongly influenced by Efron's paper on statistical curvature [19] (1975) to study the family of α-connections in the 1980's [2].

Perspectives
We recommend the two recent textbooks [12,4] for an indepth covering of (parametric) information geometry, and the book [22] for a thorough description of some infinite-dimensional statistical models. We did not report the various coefficients of the metric tensors, Christoffel symbols and skewness tensors for the expected α-geometry of common parametric models like the multivariate Gaussian distributions, the Gamma/Beta distributions, etc. They can be found in [6,12] and in various articles dealing with less common family of distributions [77]. Although we have focused on the finite parametric setting, information geometry is also considering non-parametric families of distributions [57], and quantum information geometry [23].
We have shown that we can always create an information manifold (M, D) from any divergence function D. It is therefore important to consider generic classes of divergences in applications, that are ideally axiomatized and shown to have exhaustive characteristics. Beyond the three main Bregman/Csiszár/Jensen classes (theses classes overlap [55]), we may also mention the class of conformal divergences [51,46], the class of projective divergences [47,50], etc. Figure 13 illustrates the relationships between the principal classes of distances.
There are many perspectives on information geometry as attested by the new Springer journal 21 , and the biannual international conference "Geometric Sciences of Information" (GSI) [37,38,39]. Then the divergence reference duality yields conjugate divergence-based connections: In particular, when D is a Bregman divergence B F with B F r (θ 1 : θ 2 ) = B F (θ 2 : θ 1 ) = B F (∇F (θ 1 ) : ∇F (θ 2 )) and F ∇ := B F r ∇, F ∇ and F ∇ are both flat, and we have However, it makes the equations look a bit clumsy so we preferred to stick with the same star symbol * for expressing these different dualities that should be clear from the context.