Nonparametric Information Geometry : From Divergence Function to Referential-Representational Biduality on Statistical Manifolds

Divergence functions are the non-symmetric “distance” on the manifold,Mθ, of parametric probability density functions over a measure space, (X,μ). Classical information geometry prescribes, on Mθ: (i) a Riemannian metric given by the Fisher information; (ii) a pair of dual connections (giving rise to the family of α-connections) that preserve the metric under parallel transport by their joint actions; and (iii) a family of divergence functions (α-divergence) defined on Mθ × Mθ, which induce the metric and the dual connections. Here, we construct an extension of this differential geometric structure from Mθ (that of parametric probability density functions) to the manifold,M, of non-parametric functions on X , removing the positivity and normalization constraints. The generalized Fisher information and α-connections on M are induced by an α-parameterized family of divergence functions, reflecting the fundamental convex inequality associated with any smooth and strictly convex function. The infinite-dimensional manifold, M, has zero curvature for all these α-connections; hence, the generally non-zero curvature ofMθ can be interpreted as arising from an embedding ofMθ intoM. Furthermore, when a parametric model (after a monotonic scaling) forms an affine submanifold, its natural and expectation parameters form biorthogonal coordinates, and such a submanifold is dually flat for α = ±1, generalizing the results of Amari’s α-embedding. The present analysis illuminates two different types of duality in information geometry, one concerning the referential status of a point (measurable function) expressed in the divergence function (“referential duality”) and the other concerning its representation under an arbitrary monotone scaling (“representational duality”). Entropy 2013, 15 5385

The differential geometric structure of statistical models with finite parameters is now well understood.Consider a family of probability functions (i.e., probability measures on discrete support or probability density functions on continuous support) as parameterized by θ The collection of such probability functions, where each function is indexed by a point, θ ∈ R n , forms a manifold, M θ , under suitable conditions.Rao [17] identified Fisher information to be the Riemannian metric for M θ .Efron [18], through investigating a one-parameter family of statistical models, elucidated the meaning of curvature for asymptotic statistical inference and pointed out its flatness for the exponential model.In his reaction to Efron's work, Dawid [19] invoked the differential geometric notion of linear connections on a manifold as preserving parallelism during vector transportation and pointed out other possible constructions of linear connections on M θ , in addition to the non-flat Levi-Civita connection associated with the Fisher metric.Amari [2,20], in his path-breaking work, systematically advanced the theory of information geometry by constructing a parametric family of α-connections, Γ (α) , α ∈ R, along with a dualistic interpretation of α ↔ −α as conjugate connections on the manifold, M θ .The e-connection (α = 1) vanishes (i.e., becomes identically zero) on the manifold of the exponential family of probability functions under natural parameters, whereas the m-connection (α = −1) vanishes on the manifold of the mixture family of probability functions under mixture parameters.Therefore, not only have Γ (±1) zero curvatures for both exponential and mixture families, but affine coordinates were found to yield Γ (1) and Γ (−1) , themselves zero for the exponential and mixture families, respectively.This classic information geometry dealing with parametric statistical models has been investigated in the non-parametric setting using the tools of infinite-dimensional analysis [21][22][23], with non-parametric Fisher information given by [23].This is made possible, because topological issues were resolved by the pioneering work of [24] using the theory of Orlicz space for charting the exponential statistical manifold.Zhang and Hasto [25] characterized the probability manifold modeled on an ambient affine space via functional equations and generalized exponential charts.The goal of the present paper is to extend these non-parametric results by showing links among three inter-connected mathematical topics that underlie information geometry, namely: (i) divergence functions measuring the non-symmetric distance of any two points (density or measurable functions) on the manifold (the referential duality); (ii) convex analysis and the associated Legendre-Fenchel transformation linking the natural and expectation parameters of parametric models (the representational duality); and (iii) the resulting dual Riemannian structure involving the Fisher metric and the family of α-connections.Results in the parametric setting were summarized in [26].
The Riemannian manifold of parametric statistical models is a special kind, one that involves dual (as known as conjugate) connections; historically, such a mathematical theory was independently developed to investigate hypersurface immersion (see [27,28]).Lauritzen [29] characterized the general differential geometric context under which a one-parameter family of α-connections arise, as well as the meaning of conjugacy for a pair of connections on statistical manifolds [30].Kurose [31,32] and then Matsuzoe [33,34] elucidated information geometry from an affine differential geometric perspective.See, also [35] for a generalized notion of conjugate connections.It was Eguchi [36][37][38] who provided a generic way for inducing a metric and a pair of conjugate connections from an arbitrary divergence (what he called "contrast") function.The current exposition will build on this "Eguchi relation" between the metric and conjugate connections of the Riemannian manifold, M θ , and the divergence function defined on M θ × M θ .
The main results of this paper include the introduction of an α-parametric family of divergence functionals on measurable functions (including probability functions) using any smooth and strictly convex function and the induction by such divergence a metric and a family of conjugate connections that resemble, but generalize, the Fisher information proper and α-connections proper.In particular, we derive explicit expressions of the metric and conjugate connections on the infinite-dimensional manifold of all functions defined on the same support of the sample space.When finite-dimensional affine embedding is allowed, our formulae reduce to the familiar ones associated with the exponential family established in [2].We carefully delineate two senses of duality associated with such manifolds, one related to the reference/comparison status of any pair of points (functions) and the other related to properly scaled representations of them.

Parametric Information Geometry Revisited
Here, we briefly summarize the well-known results of parametric information geometry in the classical (as opposed to quantum) sense.The motivation is two-fold.First, by reviewing the basic parametric results, we want to make sure that any generalization of the framework of information geometry will reduce to those formulae under appropriate conditions.Secondly, understanding how a divergence function is related to the dual Riemannian structure will enable us to approach the infinite-dimensional case by analogy, that is, through constructing more general classes of divergence functionals defined on function spaces.

Riemannian Manifold, Fisher Metric and α-Connections
Let (X , µ) be a measure space with σ-algebra built upon the atoms, dζ, of X .Let M µ denote the space of probability density functions, p : X → R + (≡ R + ∪ {0}), defined on the sample space, X , with background measure dµ = µ(dζ): Here, and throughout this paper, E µ {•} = X {•} dµ denotes the expectation of a measurable function (in curly brackets) with respect to the background measure, µ.We also denote A parametric family of density functions, p(•|θ), called a parametric statistical model, is the association of a density function, θ → p(•|θ), for each n-dimensional The space of parametric statistical models forms a Riemannian manifold (where θ is treated as the local chart): with the so-called Fisher metric [17]: and α-connections [20,39]: with the α-connections satisfying the dualistic relation: Here, * , denotes conjugate (dual) connection.Recall that, in general, a metric is a bilinear map on the tangent space, and an affine connection is used to define parallel transport of vectors.The conjugacy in a pair of connections, Γ ←→ Γ * , is defined by their jointly preserving the metric when each acts on one of the two tangent vectors; that is, when the tangent vectors undergo parallel transport according to Γ or Γ * respectively.Equivalently, and perhaps more fundamentally, the pair of conjugate connections preserve the dual pairing of vectors in the tangent space with co-vectors in the cotangent space [30].Any Riemannian manifold with its metric, g, and conjugate connections, Γ, Γ * , given in the form of Equations ( 3)-( 5) is called a statistical manifold (in the narrower sense) and is denoted as {M θ , g, Γ (±α) }.In the broader sense, a statistical manifold {M, g, Γ, Γ * } is a differentiable manifold equipped with a Riemannian metric g and a pair of torsion-free conjugate connections Γ ≡ Γ (1) , Γ * ≡ Γ (−1) , without necessarily requiring g and Γ, Γ * to take the forms of Equations ( 3)-( 5).

Exponential Family, Mixture Family and Their Generalization
An exponential family of probability density functions is defined as: where θ is its natural parameter and ) is a set of linearly independent functions with the same support in X , and the cumulant generating function ("potential function") Φ(θ) is: Substituting Equation (6) into Equations ( 3) and ( 4), the Fisher metric and the α-connections are simply: and: whereas the Riemannian curvature tensor (of an α-connection) is given by [2], p.106: where Φ ij = g ij is the matrix inverse of g ij and subscripts of Φ indicate partial derivatives.Therefore, the α-connection for the exponential family is dually flat when α = ±1.In particular, all components of Γ ij,k vanish, due to Equation ( 9), on the manifold formed by p (e) (•|θ) in which the natural parameter, θ, serves as the local coordinates.
On the other hand, the mixture family: when viewed as a manifold charted by its mixture parameter, θ, with the constraints, i θ i = 1 and ij,k .The connections, Γ (1) and Γ (−1) , are also called the exponential and mixture connections, or eand m-connection, respectively.The exponential family and the mixture family are special cases of the α-family [1,2] of density functions, p(ζ|θ), whose denormalization satisfies (with constant κ): under the α-embedding function, l (α) : R + → R, defined as: The α-embedding of a probability density function plays an important role in Tsallis statistics; see, e.g., [40].Under α-embedding, the denormalized density functions form the so-called α-affine manifold [1], p.46.The Fisher metric and α-connections, under such α-representation, have the following expressions: Clearly, on an α-affine manifold with any given α value, components of Γ (α) are all identically zero by virtue of the definition of α-family Equation ( 12), and hence, ±α-connections are dually flat.

Divergence Function and Induced Statistical Manifold
It is well known that the statistical manifold, {M θ , g, Γ (±α) }, with Fisher information as the metric, g, and the (±α)-connections, Γ (±α) , as conjugate connections, can be induced from a parametric family of divergence functions called "α-divergence".Here, we briefly review the link of divergence functions to the dual Riemannian geometry of statistical manifolds.1.2.1.Kullback-Leibler Divergence, Bregman Divergence and α-Divergence Divergence functions are distance-like quantities; they measure the directed (non-symmetric) difference of two probability density functions in the infinite-dimensional function space or two points in a finite-dimensional vector space of the parameters of a statistical model.An example is the Kullback-Leibler divergence (also known as, KL cross-entropy) between two probability densities, p, q ∈ M µ , here expressed in its extended form (i.e., without requiring p and q to be normalized): with a unique, global minimum of zero when p = q.For the exponential family Equation ( 6), the expression ( 16) takes the form of the so-called Bregman divergence [41] where Φ is the potential function (7), ∂ is the gradient operator and •, • denotes the standard bilinear form (pairing) of a vector with a co-vector.The Bregman divergence (17) expresses the directed-distance of two members, p and q, of the exponential family as indexed, respectively, by the two parameters, θ p and θ q .
A generalization of the Kullback-Leibler divergence is the α-divergence, defined as: measuring the directed distance between any two density functions, p and q.It is easily seen that: Note that traditionally (see [2,20]), the term 1−α 2 p + 1+α 2 q is replaced by 1 in the integrand of Equation (18), and the term q − p is absent in the integrand of Equation (16); this is trivially true when p, q are probability densities with a normalization of one.Zhu and Rohwer [42,43], in what they called the δ-divergence, δ = 1−α 2 , supplied these extra terms as the "extended" forms of α-divergence and of Kullback-Leibler divergence).The importance of these terms will be seen later (Section 2.2).
Note that, strictly speaking, when the underlying space is a finite-dimensional vector space, that is, the space, R n , for the parameters, θ, of a statistical model, p(•|θ), then the term "divergence function" is appropriate.However, if the underlying sample space is infinite-dimensional that may be uncountable, that is, the manifold, M µ , of non-parametric probability densities, p and q, then the term "divergence functional" seems more appropriate.The latter implicitly defines a divergence function (through pullback) if the probability densities are embedded into a finite-dimensional submanifold, M θ , in the case of a parametric statistical model, p(•|θ).As an example, for the exponential family Equation ( 6), the Kullback-Leibler divergence Equation ( 16) in terms of p and q, implicitly defines a divergence in terms of θ p , θ q , i.e., the Bregman divergence Equation (17).In the following, we use the term divergence function when we intend to blur the distinction between whether it is defined on the finite-dimensional vector space or on the infinite-dimensional function space and, in the latter case, whether it is pulled back into the finite dimensional submanifold.We will, however, use the term divergence functional when we emphasize the infinite-dimensional setting sans parametric embedding.
In general, a divergence function (also called "contrast function") is non-negative for all p, q and vanishes only when p = q; it is assumed to be sufficiently smooth.A divergence function will induce a Riemannian metric, g, in the form of Equation ( 3) by its second order properties and a pair of conjugate connections, Γ, Γ * , in the forms of Equations ( 4) and ( 5) by its third order properties (relations were first formulated by Eguchi [36,37], which we are going to review next).

Induced Dual Riemannian Geometry
Let M be a Riemannian manifold endowed with a metric tensor field, g, whose restriction to any point, p, is a symmetric, positive bilinear form, , , on T p (M) × T p (M).Here, T p (M) denotes the space of all tangent vectors at the point, p ∈ M, and Σ(M) denotes the collection of all vector fields on M. Then: with u, v ∈ Σ(M).Let w ∈ Σ(M) be another vector field, and d w denotes the directional derivative (of a function, vector field, etc.) along the direction corresponding to w (taken at any given point, p, if explicitly written out).An affine connection, ∇, is a map, Σ(M) × Σ(M) → Σ(M), (w, u) → ∇ w u, that is linear in u, w, while F-linear in w, but not in u.A pair of connections, ∇, ∇ * , are said to be conjugate to each other if: or in component form, denoted by Γ, Γ * : The "contravariant" form, Γ l ij , of the affine connection defined by: is related to the "covariant" form, Γ ij,k through: The Riemannian metric, g, and conjugate connections, ∇, ∇ * , on a statistical manifold can be induced by a divergence function, D : M × M → R + , which, by definition, satisfies: (i) D(p, q) ≥ 0 ∀p, q ∈ M with equality holding iff p = q; where the subscript, p, q, means that the directional derivative is taken with respect to the first and second arguments in D(p, q), respectively, along the direction, u or v. Eguchi [36,37] showed that any such divergence function, D, satisfying (i)-(iii) will induce a Riemannian metric, g, and a pair of connections, ∇, ∇ * via: In index-laden component forms, they are: Equations ( 26)-( 28) in coordinate-free form, or Equations ( 29)- (31) in index-laden form, link a divergence function, D, to the Riemannian metric, g, and conjugate connections, ∇, ∇ * ; henceforth, they will be called the Eguchi relation.It is easily verifiable that they satisfy Equation (22) or Equation (23), respectively.These relations are the stepping stones going from a divergence function defining (generally) non-symmetric distances between a pair of points on a manifold at large to the dual Riemannian geometric structure on the same manifold in the small.To apply to the infinite-dimensional context, we provide a proof (in Section 4) for the coordinate-free version Equations ( 26)-( 28).This will allow us to first construct divergence functional on the infinite-dimensional function space (the Kullback-Leibler divergence being a special example) and then derive explicit expressions for the non-parametric Riemannian metric and conjugate connections by explicating d u , d v , d w .

Goals and Approach
Our goals in this paper are several-fold.First, we want to provide a unified perspective for the divergence functions encountered in the literature.There are two broad classes, those defined on the infinite-dimensional function space and those defined on the finite-dimensional vector space.The former class include the one-parameter family of α-divergence (equivalently, the δ-divergence in [42,43]), the family of Jensen difference related to the Shannon entropy function [44], both specializing to Kullback-Leibler divergence as a limiting case.The latter class includes the Bregman divergence [41], also called "geometric divergence" [32], which turns out to be identical to the "canonical divergence" [1] on a dually flat manifold expressed in a pair of biorthogonal coordinates; those coordinates are induced by a pair of conjugate convex functions via the Legendre-Fenchel transform [2,20].[15] recently investigated an infinite-dimensional version of the Bregman divergence, called the U -divergence.It will be shown in this paper that all of the above-mentioned divergence functions can be understood as convex inequalities associated with some real-valued, strictly convex function defined on R (for the infinite-dimensional case) or R n (for the finite-dimensional case), with the convex mixture parameter assuming the role of α in the induced α-connection.Note that α ←→ −α in such divergence functions corresponds to an exchange of the two points the divergence functions measure (generally in a non-symmetric fashion), while α ←→ −α in the induced connections corresponds to the conjugacy operation for the pairing of two metric-compatible connections.Hence, our approach to divergence functions from convex analysis will address both of these aspects coherently, and an intimate relation between these two senses of duality is expected to emerge from our formulation (see below).
The second goal of our paper is to provide a more general form for the Fisher metric Equation ( 3) and the α-connections Equation (4) (or equivalently, Equations ( 14) and ( 15) under α-embedding), while still staying within the framework of [29] in characterizing statistical manifolds.One specific aim is to derive explicit expressions for the Fisher metric and α-connections for the infinite-dimensional case.In the past, infinite-dimensional expression for the α-connection ∇ (α) , as a mixture of ∇ (1) and ∇ (−1) , has emerged, but was given only implicitly with their interpretations debated [22,23].Our approach exploits the coordinate-free version of the Eguchi relations Equations ( 26)-( 28) directly, and derives Fisher metric and α-connections from the general form of divergence functions mentioned in the last paragraph.The affine connection, ∇ (α) , is formulated as the covariant derivative, which is characterized by a bilinear form.Since our divergence functional will be defined on the infinite-dimensional manifold, M, without restricting the underlying functions (individual points on M) to be normalized and positively-valued, the affine connections we derive are expected to have zero Riemann curvature as those in the ambient space.From this perspective, statistical curvature (the curvature of a statistical manifold) can be viewed as an embedding curvature, that is, curvature arising out of restricting to the submanifold, M µ , of normalized and positive-valued functions (i.e., non-parametric statistical manifold), and further to the finite-dimensional submanifold M θ (i.e., parametric statistical models).
Our third goal here is to clarify some fundamental issues in information geometry, including the meaning of duality and its relation to submanifold embedding.In its original development starting from [19], the flatness of the e-connection (or m-connection) is with respect to a particular family of density functions, namely, the exponential family (or mixture family).Later, Amari [2,20] generalized this observation to any α-family (i.e., a density function that is, after denormalization, affine under α-embedding): the α-connection is flat (indeed, Γ (α) ij,k vanishes) for the α-affine manifold (which is reduced to the exponential model for α = 1 and the mixture model for α = −1).One may be led to infer that the α parameter in the α-connection and the α parameter in α-embedding are one and the same and, thereby, conclude that ∇ (1) -flatness (or ∇ (−1) -flatness) is exclusively associated with the exponential family expressed in its natural parameter (or the mixture family expressed in its mixture parameter).Here, we point out that these conclusions are unwarranted: the flatness of an α-connection and the embedding of a probability function into an affine submanifold under α-representation are two related, but separate, issues.We will show that the α-connections for the infinite-dimensional ambient manifold, M, which contains the manifold of probability density functions, M µ , as a submanifold, has zero (ambient) curvature for all α values.For finite-dimensional parametric statistical models, it is known that the α-connection will not in general have zero curvature even when α = ±1.Here, we will give precise conditions under which ∇ (±1) will be dually flat-i.e., when the denormalized statistical model can be affine embedded under any ρ-representation, where a strictly increasing function ρ : R → R generalizes the α-embedding function (13).In such cases, there exists a strictly convex potential function, akin to Equation (7), for the exponential statistical model, that will reduce the Fisher metric and α-connections to the forms of Equations ( 8) and ( 9).One may define the natural parameter and expectation parameter that are dual to each other and that form biorthogonal coordinates for the underlying manifold, just as for the exponential family.
Our analysis will clarify two different kinds of duality in information geometry, one related to the different status of a reference probability function and a comparison probability function (referential duality), the other related to the representation of each probability function via a pair of conjugate scaling (representational duality).Roughly speaking, the (±1)-duality reflects the former, whereas the e/m-duality reflects the latter.Previously, they were non-distinguished; in our analysis, we are able to disambiguate these two senses of duality.For instance, we are able to devise a two-parameter family of divergence functions, where the two parameters play distinct roles in the induced geometry, one capturing referential duality and the other capturing representational duality.Interestingly, this two-parameter family of connections still takes the same form of the α-connection proper (with a single parameter), indicating that this extension is still within [29]'s conceptualization of dual connections in information geometry.
The technical challenge that we have to overcome in our derivations is doing calculus in the infinitedimensional setting.Consider the set of measurable functions from X to R, which, in the presence of charts modeled on (open) subsets, {E i } i∈I , of a Banach space, form a manifold, M, of infinite dimension.Each point on M is a function, p : X → R, over the sample space X ; and each chart, U ⊂ M, is afforded with a bijective map to the Banach space with a suitable norm (e.g., Orlicz space, as adopted by [21][22][23][24]45]).For non-parametric statistical models, [24] provided exponential charts modeled on Orlicz spaces, which was followed by the rest of the above-referenced works.We do not restrict ourselves to probability density functions and work, in general, with measurable functions (without positivity and normalization requirements); we treat probability functions as forming a submanifold in M defined by the positivity and normalization conditions.This approach gives us certain advantages in deriving, from divergence functions directly, the Riemannian geometry on M, whereby M serves as an ambient space to embed a statistical manifold, M µ , as a submanifold in a standard way (by restricting the tangent vector field of M).The usual interpretation of the affine connection on M µ as the projection of a natural connection on M is then "borrowed" over from the finite-dimensional setting to this infinite-dimensional setting.Our approach followed that of [46], who treats the infinite dimensional manifold as a generic C ∞ -Banach manifold and used the theory of canonical spray (and the Morse-Palais Lemma) to construct Riemannian metric and affine connections on such manifolds.However, we fell short of providing a topology on M as induced from the divergence functions and compare it with the one endowed by [24].In particular, the conditions under which M µ forms a proper submanifold of M remain to be identified.Neither have we addressed topological issues for the well-definedness of conjugate connections on such infinite-dimensional manifolds.We refer the readers to [23], who investigated whether the entire family of α-connections is well-defined for M endowed with the same topology.
The structure of the rest of the paper is as follows.Section 2 will deal with information geometry under the infinite-dimensional setting and Section 3 under the finite-dimensional setting.For ease of presentation, results will be provided in the main text, while their proofs will be deferred to Section 4. Section 5 closes with a discussion of the implications of the current framework.A preliminary report of this work was presented to IGAIA2 (Tokyo) and appeared in [47].

Information Geometry on Infinite-Dimensional Function Space
In this section, we first review the basic apparatus of the differentiable manifold with particular emphasis paid to the infinite-dimensional (non-parametric) setting (Section 2.1).We then define a family of divergence functionals based on convex analysis (Section 2.2) and use them to induce the dual Riemannian geometry on the infinite-dimensional manifold (Section 2.3).The section is concluded with an investigation of a special case of homogeneous divergence, called (α, β)-divergence, in which the two parameters play distinct, but interrelated, roles for referential duality and representational duality, thereby generalizing the familiar α-divergence in a sensible way (Section 2.4).

Differentiable Manifold in the Infinite-Dimensional Setting
Let U be an open set on the base manifold, M, containing a representative point, x 0 , and F : U → R, a smooth function defined on this local patch, U ⊂ M. The set of smooth functions on M is denoted F(M).A curve, t → x(t), on the manifold is a collection of points, {x(t) ∈ U : t ∈ [0, 1]}, whereas a tangent vector (or simply "vector"), v at x 0 ∈ U, represents an equivalent class of curves passing through x 0 = x(0), all with the same direction and speed as specified by the vector, v = dx dt t=0 .We use T x 0 (M) to denote the space of all tangent vectors ("tangent space") at a given x 0 ; it is obviously a vector space.The tangent manifold, T M, is then the collection of tangent spaces for all points on M: T M = {∪T x (M), x ∈ M}.A vector field, v(x), is the association of a vector, v, at each point, x, of the manifold, M; it is a cross-section of T M. The set of all smooth vector fields on M is denoted Σ(M).The tangent vector, v, acting on a function, F, will yield a scalar, denoted d v F, called the direction derivative of F: The tangent vector, v, acting on a vector field, u(x), is defined analogously: In our setting, given a measure space, (X , µ), where samples are drawn from the set X and µ is the background measure, we call any function that maps X → R a ζ-function.The set of all ζ-functions forms a vector space, where vector addition is point-wise: , and scalar multiplication is simple multiplication: (cf )(ζ) = cf (ζ).We now consider the set of all ζ-functions with common support µ, which is assumed to form a manifold denoted as M µ .A typical point of this manifold denotes a specific ζ-function, p(ζ) : ζ → p(ζ), defined over X , the sample space, which is infinite dimensional or even uncountable in general.Under suitable topology (e.g., [24]), all points, M µ , form a manifold.On this manifold, any function, F : p → F(p), is referred to (in this paper) as a ζ-functional, because it takes in a ζ-function p(•) and outputs a scalar.The set of ζ-functionals on M is denoted F(M).(Note that "ζ-function" and "ζ-functional" are both functions (also called "maps" or "mappings") in the mathematical sense, with pre-specified domains and ranges.We make the distinction that the ζ-function refers to a real-valued function (e.g., density functions, random variables) defined on the sample space, X , and ζ-functional refers to a mapping from one or more ζ-functions to a real number.)A curve on M passing through a typical point, p, is nothing but a one-parameter family of ζ-functions, denoted as p(ζ|t), with p(ζ|0) = p.Here, •|t is read as "given t", "indexed by t", "parameterized" by t-a one-parameter family of ζ-functions, p(ζ|t), is formed as t varies.For each fixed t, p(ζ|t) is a function, X × I → R.More generally, p(ζ|θ), where θ As θ varies, p(ζ|θ) represents a finite dimensional submanifold, M θ ⊂ M where: In this paper, they are referred to as parametric models (and parametric statistical model if p(ζ|θ) is normalized and positive-valued).
In the infinite-dimensional setting, the following tangent vector, v: is also a ζ-function.When the tangent vector, v, operates on the ζ-functional F(p): the outcome is another ζ-functional of both p(ζ) and v(ζ) and linear in the latter.A particular ζ-functional of interest in this paper is of the following form: where f : R → R is a strictly convex function defined on the real line.In this case, p(ζ|t ), so: which is linear in v(•).
A vector field, as a cross-section of T M, takes p(ζ) and associates a ζ-function.We denote a vector field as u(ζ|p) ∈ Σ(M), where the variable following the "|" sign indicates that u depends on the point, p(ζ), an element of the base manifold, M (we could also write it as u(p(ζ))(ζ) or u p (ζ)).Though the vector fields defined above are not necessarily smooth, we will concentrate on smooth ones below.Of particular interest to us is the vector field, ρ(p(ζ)), for some strictly increasing function, ρ : R → R.
Differentiation of smooth vector fields can be defined analogously.The directional derivative, d v u, of a vector field, u(ζ|p), which is a ζ-function also dependent on p(ζ), in the direction of v = v(ζ), which is another ζ-function, is: Note that d v u is another ζ-function; that is why we can write d v u(ζ|p) also as (d v u)(ζ).As an example, the derivative of the vector field, ρ(p(ζ)), where ρ : R → R, in the direction of v(ζ) is: With differentiation of vector fields defined, one can define the covariant derivative operation, ∇ w .When operating on a ζ-functional, the covariant derivative is simply the directional derivative (along direction w): when operating on a vector field, say u(ζ|p), ∇ w is defined as (see [46]): where B : Σ(M) × Σ(M) → Σ(M) is a ζ-function, which is bilinear in the two tangent vectors (ζ-functions), w and u; it is the infinite-dimensional counterpart of the Christoffel symbol, Γ (for finite dimensions).We denote the conjugate covariant derivative, ∇ * w (as defined by Equation ( 22)) in terms of B * (with an asterisk denoting conjugacy): (here, we write out the explicit dependency on ζ).The Riemann curvature tensor, R, which measures the curvature of a connection, ∇ (as specified by B), is defined by the map, where: The torsion tensor, T : Σ(M) × Σ(M) → Σ(M), is given by: 2.2.D (α) -Divergence, a Family of Generalized Divergence Functionals Divergence functionals are defined with respect to a pair of ζ-functions in an infinite-dimensional function space.A divergence functional, D : M × M → R + , maps two ζ-functions to a non-negative real number.To the extent that ζ-functions can be parameterized by finite-dimensional vectors, θ ⊆ R n , a divergence functional on M × M will implicitly induce a divergence function on the parameter space, which is a subset of R n × R n .In this section, we will discuss the general form of the divergence functional and the associated infinite-dimensional manifold.Finite-dimensional embedding of ζ-functions (i.e., parametric models) will be discussed in Section 3.

Fundamental Convex Inequality and Divergence
We start our exposition by reviewing the notion of a convex function on the real line, f : R → R. We recall the fundamental convex inequality that defines a strictly convex function, f : for all γ, δ ∈ R, with equality holding, if and only if γ = δ, for all α ∈ (−1, 1).Geometrically, the value of the function, f , at any point, , in between two end points, γ and δ, lies on or below the chord connecting its values at these two points.This property of a strictly convex function can also be stated in elementary algebra as the Chord Theorem, namely: where: (here, we assumed γ ≤ ≤ δ without loss of generality).In fact, the slope, , is an increasing function in both δ and γ.The slopes for the chords connecting from the midpoint to either end point are, respectively: with skew symmetry: As α : −1 → 1 (i.e., as point moves from γ to δ, the two fixed ends), both L (α) (γ, δ) and L(α) (γ, δ), are increasing functions of α, but the chord theorem dictates that the latter is always no greater than the former.In fact, their difference has a non-negative value: Though the above is obviously valid for α ∈ [−1, 1], it can be shown that it is also valid for any α ∈ R.
The fundamental convex inequality applies to any two real numbers, γ, δ.We can treat γ, δ as the values of two functions, p, q : X → R, evaluated at any particular sample point, ζ, that is, γ = p(ζ), δ = q(ζ).This allows us to define the following family of divergence functionals (see [48]).
PROPOSITION 1 Let f : R → R be smooth and strictly convex, and ρ : R → R be strictly increasing.For any two ζ-functions, p, q, and any α ∈ R: is non-negative and equals zero, if and only Proof.See Section 4. Proposition 1 constructed a family (parameterized by α) of divergence functionals, D (α) , for two ζ-functions, in which representational duality is embodied as: Its definition involves a strictly increasing function ρ, which can be taken to be the identity function if necessary.The reason ρ is introduced will be clear in the next subsection, where we introduce the notion of conjugate-scaled representations.Furthermore, in order to ensure that the integrals in Equation ( 54) are well defined, we require p, q to be elements of the set: D (α) -divergence was first introduced in [48].It generalized the familiar α-divergence Equation ( 18): take f (p) = e p and ρ(p) = log p; then D (α) f,ρ (p, q) = A (α) (p, q).D (α) -divergence became the U -divergence [15] when f (p) = U (p), ρ(p) = (U ) −1 (p), α → 1 for any strictly convex and strictly increasing U : R + → R. It was well known that U -divergence, when taking specializes to β-divergence [49], defined as: and that both αand β-divergence specialize to the Kullback-Leibler divergence as α → ±1 and β → 1, respectively.

Conjugate-Scaled Representations of Measurable Functions
In one-dimension, any strictly convex function, f : R → R, can be written as an integral of a strictly increasing function, g, and vice versa: has the integral expression f * (λ) = λ g(γ) g −1 (t)dt + f * (g(γ)), with g −1 also strictly monotonic and γ, δ, λ ∈ R. (Here, the monotonicity condition replaces the requirement of a positive semi-definite Hessian in the case of a convex function of several variables.)The Legendre-Fenchel inequality: can be cast as the Young's inequality: with equality holding, if and only if λ = g(δ).The conjugate function, f * , which is also strictly convex, satisfies (f * ) * = f and (f * ) = (f ) −1 .We introduce the notion of ρ-representation of a ζ-function p(•) by defining a mapping, p → ρ(p), for a strictly increasing function, ρ : R → R. We say that a τ -representation of a ζ-function, p → τ (p), is conjugate to the ρ-representation with respect to a smooth and strictly convex function, f : R → R, if: As an example, we may let ρ(p) = l (α) (p) be the α-representation, where l (α) is given by Equation ( 13), and the conjugate representation is the (−α)-representation τ (p) = l (−α) (p): In this case: (65) so that: both linear in p.More generally, strictly increasing functions from R → R form a group, with functional composition as group composition operation and the functional inverse as the group inverse operation.That is, (i) for any two strictly increasing functions, ρ 1 , ρ 2 , their functional composition ρ 2 • ρ 1 is strictly increasing; (ii) the functional inverse, ρ −1 , of any strictly increasing function, ρ, is also strictly increasing; (iii) there exists a strictly increasing function, ι, the identity function, such that , encountered above, are themselves two mutually inverse strictly increasing functions.If, in the above discussions, f = τ • ρ −1 is further assumed to be strictly convex, that is: for any γ, δ ∈ R and α ∈ (−1, 1), then by taking τ −1 on both sides of the inequality and renaming ρ −1 (γ) as γ and ρ −1 (δ) as δ, we obtain: This is to say: with equality holding, if and only if γ = δ, where: is the quasi-linear mean of two numbers γ, δ.Therefore, the following is also a divergence functional (see more discussions in Section 2.4)

Canonical Divergence
The use of a pair of strictly increasing functions, f, f * , allow us to define, in parallel with D (α) f,ρ (p, q) given in Equation ( 54), the conjugate family, D (α) f * ,τ (p, q).The two families turn out to have the same form when α = ±1; this is the so-called canonical divergence.

Geometry Induced by the D (α) -Divergence
The last two sections showed that the divergence functional, D (α) , we constructed on M according to Equation ( 54) generalizes the α-divergence in a sensible way.Now, we investigate the metric and conjugate connections that such divergence functionals induce; this is accomplished by invoking Eguchi relations Equations ( 26)-( 28).PROPOSITION 2. At any given p ∈ M and for any vector fields, u, v ∈ Σ(M): (i) the metric tensor field, g : Σ(M) × Σ(M) → F(M), is given by: where: (ii) the family of covariant derivatives (connections) ∇ (α) : Σ(M) × Σ(M) → Σ(M) is given as: where: (iii) the family of conjugate covariant derivatives is: Proof.See Section 4. Note that the g(•) term in Equation ( 82) and the b (α) (•) term in covariant derivatives Equation (84) depend on p, the point on the base manifold, where the metric and covariant derivatives are evaluated.They both depend on the auxiliary "scaling functions", f and ρ.We may cast them into an equivalent, dually symmetric form as follows.
COROLLARY 3. The g(•) function in expressing the metric Equation (82) and b (α) (•) in expressing the covariant derivatives Equation (84) can be expressed in dualistic forms: and: Proof.See Section 4. Corollary 3 makes it immediately evident that the Riemannian metrics induced by f,ρ (p, q) and by D (α) f * ,τ (p, q) are identical for all α values, while the connections (covariant derivatives) induced by the two families of divergence are conjugate to each other, expressed as α ←→ −α.This implies that the conjugacy embodied by the definition of the pair of connections is related to both referential duality and representational duality.
It can be proven that the covariant derivative of the kind of Equation ( 84) are both curvature-free and torsion-free.PROPOSITION 4. For the entire family of covariant derivatives indexed by α (α ∈ R): (i) the Riemann curvature tensor R (α) (u, v, w) ≡ 0; (ii) the torsion tensor T (α) (u, v) ≡ 0.
In other words, the manifold, M, has zero-curvature and zero-torsion for all α.As such, it can serve as an ambient manifold to embed the manifold, M µ , of non-parametric probability density functions and the manifold, M θ , of parametric density functions, and any curvature on M µ or M θ may be interpreted as arising from embedding or restriction to a lower dimensional space.See, also, [50] for a discussion of curvatures of statistical manifolds.

Homogeneous (α, β)-Divergence and the Induced Geometry
Suppose that f is, in addition to being strictly convex, strictly increasing.We may set ρ(t) = f −1 (εt) ←→ f (t) = ερ −1 (t), so that the divergence functional becomes: Now, the second term in the integrand is just the quasi-linear mean, M ρ , introduced in Equation ( 70), where ρ is strictly increasing and concave here.As an example, take ρ(p) = log p, = 1; then , and D (α) ρ (p, q) is the α-divergence Equation ( 18), while: is an immediate generalization of the extended Kullback-Leibler divergence in Equation ( 16).
With respect to the geometry induced from the (α, β)-divergence of Equation (92), we have the following result.
PROPOSITION 5.The metric g and affine connections (covariant derivatives) ∇ (α,β) corresponding to the (α, β)-divergence are given by: where u, v ∈ Σ(M) and p = p(ζ) is the point at which g and ∇ are evaluated.
Proof.The proof is immediate upon substituting Equations ( 64) and (65) to Equations ( 83) and (85).This is to say, with respect to the (α, β)-divergence, the product of the two parameters, αβ, acts as the "alpha" parameter in the family of induced connections, so: Setting lim β→1 ∇ (α,β) yields Amari's one-parameter family of α-connections in the infinite-dimensional setting, taking the very simple form: The same is true when lim α→1 ∇ (α,β) (the connections are indexed by β, of course).

Finite-Dimensional Parametric Models
Now, we restrict attention to a finite-dimensional submanifold of measurable functions whose ρ-representation are parameterized using θ = [θ 1 , • • • , θ n ] ⊆ R n .In this case, the divergence functional of the two functions, p and q, assumed to be specified, respectively, by θ p and θ q in the parametric model, becomes an implicit function of θ p , θ q .In other words, through introducing parametric models (i.e., a finite-dimensional submanifold) of the infinite-dimensional manifold of measurable functions, we arrive at a divergence function defined ("pulled back") over the vector space.We denote the ρ-representation of a parameterized measurable function as ρ(p(ζ|θ)), and the corresponding divergence function by D(θ p , θ q ).It is important to realize that, while f (•) is strictly convex, F(p) = X f (p(ζ|θ)) dµ is not at all convex in θ in general.

Riemannian Geometry of Parametric Models
The parametric family of functions, p(ζ|θ), forms a submanifold of M defined by: where p(ζ|θ) is a ζ-function indexed by θ, i.e., θ is treated as a parameter to specify a ζ-function.M θ is a finite-dimensional submanifold of M. We also denote the manifold of a parametric statistical model as: The θ values themselves, called the natural parameter of the parametric (statistical) model, p(•|θ), are coordinates for M θ (or M θ ).The tangent vector fields, u, v, w, of M in the directions that are also tangent for M θ (or M θ ) take the form: The following proposition gives the metric and the family of α-connections in the parametric case.
An immediate consequence of this corollary is as follows.If we construct the divergence function, D f * ,τ (θ p , θ q ), then the induced metric, gij , and the induced conjugate connections, f,ρ (θ p , θ q ) (and denoted without the ˜) via: with: So, the difference between using D f,ρ (θ p , θ q ) and D f * ,τ (θ p , θ q ) reflects a conjugacy in the ρand τ -scalings of p(ζ|θ).Corollary 7 says that the conjugacy in the connection pair Γ ←→ Γ * reflects, in addition to the referential duality θ p ←→ θ q , the representational duality between ρ-scaling and τ -scaling of a ζ-function:

Affine Embedded Submanifold
We now define the notion of ρ-affinity.A parametric model, p(ζ|θ), is said to be ρ-affine if its ρ-representation can be embedded into a finite-dimensional affine space, i.e., if there exists a set of linearly independent functions λ i (ζ) over the same support, X ζ, such that: As noted in Section 3.1.1,the parameter θ For any measurable function, p(ζ), the projection of its τ -representation onto the functions λ i (ζ) We call η the expectation parameter of p(ζ), and the functions the affine basis functions.
The above notion of ρ-affinity is a generalization of α-affine manifolds [1,2], where ρand τ -representations are just αand (−α)-representations, respectively.Note that elements of the ρ-affine manifold may not be a probability model; rather, after denormalization, probability models can become ρ-affine.The issue of normalization will be discussed in Section 5.
is strictly convex; (ii) the divergence functional, D (α) f,ρ (p, q), takes the form of the divergence function: (iii) the metric tensor, affine connections and the Riemann curvature tensor take the forms: Here, Φ ij , Φ ijk denote, respectively, second and third partial derivatives of Φ(θ): and Φ ij is the matrix inverse of Φ ij .
Proof.See Section 4. Note that while the function, Φ(θ), can be viewed as the generalized cumulant generating function (or partition function), the function, Φ * (η), is the generalized entropy function.For an exponential family, the two are well known to form one-to-one correspondence; either can be used on that index as a density function of the exponential family.
We point out that there are two kinds of duality associated with the divergence defined on dually flat statistical manifold, one between D Φ * .The first kind is related to the duality in the choice of the reference and the comparison status for the two points (θ versus η) for computing the value of the divergence and, hence, called "referential duality".The second kind is related to the duality in the choice of the representation of the point as a vector in the parameter versus gradient space (θ versus η) in the expression of the divergence function and, hence, called "representational duality".More concretely: The biduality is compactly reflected in the canonical divergence as:
PROOF OF PROPOSITION 2. With respect to Equation ( 54), note that (d u ) p means that the functional derivative is with respect to p only (point q is treated as fixed): Applying functional derivative (d v ) q , now with respect to q only, to the above equation yields: Setting p = q and invoking Equation ( 26) yields Equation (82) with Equation (83).Next, applying (d w ) p to Equation (152), and realizing that u, v are both vector fields: Setting p = q, invoking Equation ( 27) and: and realizing that v(ζ|p) can be arbitrary, we have: where we have short-handed ρ for ρ(p(ζ)).Remember that ∇ w u is a ζ-function; the above equation yields: Thus, we obtain Equation (84) with Equation (85).The expression for ∇ * (α) is obtained analogously.PROOF OF COROLLARY 3. From the identities: we obtain Equations ( 87) and (88) after substitution.PROOF OF PROPOSITION 4. We first derive a general formula for the Riemann curvature tensor for the infinite-dimensional manifold, since that given by a popular text book ( [46], p.226) appears to miss some terms.From Equation (42): so that: here d u B = B u refers to the derivative on the B-form itself and not on its v, w arguments.The expression for ∇ v (∇ u w) simply exchanges u → v in the above.Now: where [u, v] = d u v − d v u is a vector field, such that: Substituting them into Equation ( 44), we get a general expression of the Riemann curvature tensor in infinite-dimensional setting: The expression for T (u, v) in Equation ( 46) becomes: In the current case, B evaluated at p(ζ) is the bilinear form: Substituting this into the above, and realizing that (d u B)(v, w) is simply (b (α) ) u v w, we immediately have R (α) (u, v, w) = 0, as well as T (α) (u, v) = 0. PROOF OF PROPOSITION 6.Given Equation (107) as the tangent vector fields for parametric models with holonomic coordinates θ, we note that: so Equation (108) follows.Next, from: we have: It follows that: Note that given holonomic coordinates Equation (107): Substituting into Equation (84) with Equation (88) yields Equations ( 113) and (114).PROOF OF PROPOSITION 9.The assumption Equation (124) implies that ∂ρ ∂θ i = λ i (ζ), so from Equation (108): That the above expression is positive definite is seen by observing: , due to the linear independence of the λ i components and the strict convexity of f .Hence, Φ(θ) is strictly convex in θ, proving (i).An immediate consequence is that expression (127) is non-negative and vanishes, if an only if θ p = θ q .This establishes (ii), i.e., D Φ (θ p , θ q ) is a divergence functions.Part (iii) follows from a straight-forward application of Eguchi relations Equations ( 29)- (31).
PROOF OF COROLLARY 10.First, since f (ρ(t)) = τ (t), we have the identity: From (126), taking a derivative with respect to θ i , while noting that p(ζ|θ) satisfies (124), gives: and that: It follows from Equation (131) that Φ * , as defined in (i), is the conjugate of Φ, and that the relation in (ii) is the basic Legendre-Fenchel duality.Finally, the biorthogonality of η and θ as expressed by (iii) also becomes evident on account of (ii).

Discussions
This paper constructs a family of divergence functionals, induced by any smooth and strictly convex function, to measure the non-symmetric "distance" between two measurable functions defined on a sample space.Subject to an arbitrary monotone scaling, the divergence functional induces a Riemannian manifold with a metric tensor generalizing the conventional Fisher information and a pair of conjugate connections generalizing the conventional (±α)-connections.Such manifolds manifest biduality: referential duality (in choosing a reference point) and representational duality (in choosing a monotone scale).The (α, β)-divergence we gave as an example of this bidualistic structure extends the α-divergence, with α and β representing referential duality and representational duality, respectively.It induces the conventional Fisher metric and the conventional α-connection (with αβ as a single parameter).Finally, for the ρ-affine submanifold, a pair of conjugated potentials exist to induce the natural and expectation parameters as biorthogonal coordinates on the manifold.
Our approach demonstrated an intimate connection between convex analysis and information geometry.The divergence functionals (and the divergence functions in the finite-dimensional case) are associated with the fundamental convex inequality of a convex function, f : R → R (or Φ : R n → R), with the convex mixture coefficient as the α-parameter in the induced geometry.Referential duality is associated with α ←→ −α, and representational duality is associated with the convex conjugacy f ←→ f * (or Φ ←→ Φ * ).Thus, our analysis reveals that the e/m-duality and (±1)-duality that were used almost interchangeably in the current literature are not the same thing!
The kind of referential duality (originating from non-symmetric status for a referent and for a comparison object), while common in psychological and behavioral contexts [54,55], has always been implicitly acknowledged in statistics.Formal investigation of such non-symmetry between a reference probability distribution and comparison probability distribution in constructing divergence functions leads to the framework of preferred point geometry [56][57][58][59][60][61].Preferred point geometry reformulates Amari's [20] expected geometry and Barndorff-Nielsen's [3] observed geometry by studying the product manifold, M θ × M θ , formed by an ordered pair of probability densities, (p, q), and defining a family of Riemannian metric defined on the product manifold.The precise relation of the preferred point approach with our approach to referential duality needs future exploration.
With respect to representational duality, it is worth mentioning the field of affine differential geometry which studies hypersurface realization of the dual Riemannian manifold involving a pair of conjugate connections (see [27,28]).[31][32][33][34] investigated affine immersion of statistical manifolds.[62][63][64][65] further illuminated a conformal structure when the (normalized) probability density functions undergo the l (α) embedding.Such an embedding appears in the context of Tsallis statistics, where Shannon entropy and Kullback-Leibler cross-entropy (divergence) is generalized to a one-parameter family of entropy and cross-entropy (see, e.g., [40]).We demonstrated ( [48], and here) that the ρ-affine manifold (Section 3.2) has the structure of an α-Hessian structure [26], a generalization of Hessian manifold [66,67].It remains to be illuminated whether a conformal structure arises for ρ-affine probability density functions after normalization.
It should be noted that, while any divergence function determines uniquely a statistical manifold (in the broad sense of [29]), the converse is not true.Though a statistical manifold equipped with an arbitrary metric tensor and a pair of conjugate, torsion-free connections always admits a divergence function [68], it is not unique in general, except when the connections are dually flat, in which case the divergence is uniquely determined as the canonical divergence.In this sense, there is nothing special about our use of D (α) -divergence apart from it generalizing familiar divergences (including α-divergence in particular).Rather, D (α) -divergence is merely a vehicle for us to derive the underlying dual Riemannian geometry.It remains to be elucidated why the convex mixture parameter turns out to be the α-parameter in the family of connections of the induced geometry.It seems that our generalizations of the Fisher metric and of conjugate α-connections hinge on this miraculous identification.Generalization from α-affinity/embedding to ρ-affinity/embedding, and the resulting generalized biorthogonality between natural and expectation parameters is akin to generalizing L p space to L Φ (i.e., Orlicz) space, which is an entirely different matter.Future research will further clarify these fundamental relations between convexity, conjugacy, and duality in non-parametric (and parametric) information geometry.

Conclusions
We constructed an extension of parametric information geometry to the non-parametric setting by studying the manifold M of non-parametric functions on sample space (without positivity and normalization constraints).The generalized Fisher information and α-connections on M are induced by an α-parameterized family of divergence functions, reflecting the fundamental convex inequality associated with any smooth and strictly convex function.Parametric models are recovered as submanifolds of M. We also generalize Amari's α-embedding to an affine submanifold under arbitrary monotonic embedding, and show that its natural and expectation parameters form biorthogonal coordinates, and such a submanifold is dually flat for α = ±1.Our analysis illuminates two different types of duality in information geometry, one concerning the referential status of a point (measurable function) expressed in the divergence function ("referential duality") and the other concerning its representation under an arbitrary monotone scaling ("representational duality").