Conjugate Representations and Characterizing Escort Expectations in Information Geometry

: Based on the maximum entropy (MaxEnt) principle for a generalized entropy functional and the conjugate representations introduced by Zhang, we have reformulated the method of information geometry. For a set of conjugate representations, the associated escort expectation is naturally introduced and characterized by the generalized score function which has zero-escort expectation. Furthermore, we show that the escort expectation induces a conformal divergence


Introduction
Information geometry (IG) [1,2] is a differential geometrical method based on a Riemannian metric on a statistical manifold, which is constructed from a given parameterized probability distribution function (pdf) p θ (x).It provides a useful tool to study, for example, the dually flat structures of a statistical manifold.Recently, much effort has been made to study some deformed exponential families of pdfs, in which the standard exponential function exp(x) and its inverse function ln(x) are replaced with a deformed exponential and its inverse function, which is called a deformed logarithmic function.Among different deformed exponential functions, relatively well known ones include Tsallis' q-deformed exponential [3] and Kaniadakis' κ-deformed exponential functions [4].Naudts [5] introduced the so-called ϕ-logarithmic function in terms of a positive increasing function ϕ(x), and studied the generalized thermostatistics.It is shown that a q-deformed relative entropy is proportional to Amari's α-divergence and is related with the α-geometry on the statistical manifold with a constant curvature [6].In order to construct a suitable statistical manifold in IG, usually the α-representation (rep.), or α-immersion, of a pdf is used.It is well known that the α-rep.works fine with an exponential family but does not necessarily work fine for a non-exponential pdf, e.g., a κ-deformed exponential family [4].A generalization of the α-rep.is conjugate representations, or (ρ, τ)-reps., by Zhang [7].He also introduced the (ρ, τ)-divergences from the point of view of "representation duality".By finding out a suitable conjugate rep.for κ-deformed exponential pdf, the IG of the κ-generalized thermostatistics [8] was studied.We further studied the IG structures among the thermodynamic potentials in the κ-thermostatistics [9], in which the escort pdfs and escort expectations play an important role.Zhang [10] further showed that his conjugate reps.also include Naudts' ϕ-logarithm [5] as a special case.In this way, Zhang's conjugate reps.are very useful as a generalization of α-reps in IG.Amari [11] showed that (ρ, τ)-divergences generate a dually flat structure in the manifold of positive measures and in that of positive-definite matrices.In this contribution, we reformulate the IG structures based on Zhang's conjugate reps.and the maximum entropy (MaxEnt) principle for a generalized entropy functional.Our approach is different from the previous works [10,11] in that we relate the ρ-rep. of a pdf p(x) with the Lagrange multipliers in the MaxEnt problem.This enables us to introduce a generalized score function and characterize the escort expectations.
The rest of the paper is organized as follows.The next section provides us with the preliminaries for the basics of IG for the exponential families of pdf.In Section 3, after a brief review of conjugate reps.introduced by Zhang [7], we reformulate the IG structures based on the conjugate reps.and discuss the maximum entropy (MaxEnt) principle for a generalized entropy.For a set of conjugate reps., the associated generalized score function is introduced.The escort rep.and escort expectation are then naturally induced.Section 4 relates the conformal divergence to the difference of the entropies in terms of the escort expectations.The final section is devoted to our concluding remarks.Throughout the paper, we use the abbreviations ∂ i for ∂/∂θ i , and ∂ i for ∂/∂η i .

Preliminaries
Information geometry [1,2] provides us a useful tool for studying a family of a probability distribution function (pdf) p θ (x) characterized by a set of real parameters θ = (θ 1 , θ 2 . . ., θ M ).S is called a (M-dimensional) statistical model and the pdf p θ (x) of S can be regarded as a point in a differential manifold M with local coordinates {θ i }.M is called a statistical manifold and a Riemannian metric on M is provided by the Fisher information matrix g F ij [2], where θ (x) ≡ ln p θ (x).In this contribution, we assume that g F is positive definite, and E p θ [•] stands for the linear expectation with respect to the pdf p θ (x).
A manifold M is said to be e-flat (exponential-flat) if a set of coordinate systems {θ i } satisfies identically.Any set of coordinates {θ i } satisfying (3) is called e-affine coordinates.A well-known example of e-flat manifolds is the exponential family where each F m (x) is a given function of a random value x and Ψ(θ) is the normalization factor of a pdf p θ (x).From the normalization of the pdf p θ (x), we see that and, for the exponential family, we have which does not depend on x.Hence, the condition ( 3) is satisfied and one confirms that the exponential family is e-flat.In addition, for the exponential family, we have Taking the expectation of both sides and using Equation ( 5), we see that the m-affine coordinates (η-coordinates) of the exponential family are given by Accounting for Equations ( 7) and ( 8), from definition (2), we obtain which is the covariance matrix for the statistical model S exp .
A manifold M is said to be m-flat (mixture-flat) if a set of coordinate systems {η i } satisfies identically.In this case, the set of coordinates {η i } is called m-affine coordinates.
In a dually flat structure, the θ-and η-coordinates are related by the Legendre transformation where Ψ(θ) and Ψ * (η) are Legendre-Fenchel dual to each other and are called θ-and η-potential functions, respectively.The canonical divergence function [2] for a set of two pdf p θ p (x) and r θ r (x) can be defined by which is a Bregman divergence with the convex function Ψ(θ).
For a dually flat manifold S, Pythagorean relation is generalized in terms of divergence.Let p, r, s be three probability distributions in S. When the e-geodesic connecting p and r is orthogonal at r to the m-geodesic connecting r and s, the following generalized Pythagorean relation [2] holds.
As is well known, maximizing the Boltzmann-Gibbs-Shannon (BGS) entropy under the M-constraints for a given set of U m and the normalization dx p(x) = 1, leads to the optimized pdf belonging to the exponential family S exp .The Lagrange multipliers are the control parameters {θ m } for the above M-constraints.From the normalization of an exponential pdf, we readily obtain the θ-potential function Ψ(θ) as We note that, in addition to Equation (2), the Fisher metric g F can be written equivalently in other different expressions In particular, combining Equation ( 6) with (20), we readily confirm the important relation that is, the Fisher metric coincides with the Hessian matrix of the θ-potential function Ψ(θ).It is known that an exponential family naturally has the dualistic Hessian structures and their canonical divergences coincide with the Kullback-Leibler divergences.Furthermore, using Equation ( 8), the Fisher matrix can also be rewritten as which holds for the exponential family S exp .
In general, the dual affine connections are induced from the metric.By applying ∂ i to Equation (19) for g F , we see that the following relation holds where the Christoffel symbol of the first kind for the e-affine connection and that for the m-affine connection are defined by respectively.In addition, we can introduce a cubic form which characterizes the difference between the affine connection ∇ (e) (or ∇ (m) ) and Levi-Civita connection ∇ (0) through the relations

Conjugate Representations
Here, we briefly review Zhang's conjugate representations [7].For a parameterized probability density function (pdf) p θ (x) with a set of real parameters {θ m }, m = 1, 2, . . ., M, information geometry is founded by Prof. Amari [1], based on his α-representations (reps.)defined by for a real parameter α = 1, and on the α-divergence As a generalization of the α-reps., Zhang [7] introduced the conjugate representations as follows.
Definition 1.A ρ-representation of a real positive number ξ is a mapping ξ → ρ(ξ), where ρ(ξ) is a strictly monotone function.For a smooth and strictly convex function f (ρ), a τ-representation: ξ → τ(ξ) is said to be conjugate to the ρ-representation with respect to f (ρ) if the following relations are satisfied, where the convex functions f (ρ) and f (τ) are Legendre dual to each other: By utilizing the conjugate reps., the associated Bregman divergence can be defined as The α-rep.is, of course, an example of the conjugate reps., and they are related as follows. . (37) The α-divergence (31) is expressed as D f α ,ρ α (p|r).
Remark 1.We assume that ρ-and τ-functions satisfy the suitable regularity conditions throughout this paper.
It is important to describe the domains and the tangents of the relevant ρ-and τ-functions.However, this is a very difficult matter in general.For example, consider a statistical manifold which is a set of q-Gaussian distributions, and using the α-rep.(36).We see that it is an α-affine manifold with α = 2q − 1 [1,7].In this case, if the domain Ω (the total sample space) is R, then α must satisfy 1 < α < 5.If Ω = R 2 , then α must satisfy 1 < α < 3. The lower bound comes from the regularity conditions of the statistical manifold (Amari and Nagaoka [1], Chapter 2), and the upper bounds come from the integrability conditions of probability densities.In this way, the regularity conditions for a set of ρ-and τ-functions are not determined from these functions themselves only, but depend on the total sample space and the given statistical model.Some arguments have been given in our previous paper [12].

MaxEnt
For a set (ρ, τ, f (ρ), f (τ)) of conjugate reps., let us introduce a generalized entropy functional S defined by and consider the following MaxEnt problem.
where F m (x) is a given function of x, and θ m , m = 1, 2, . . ., M and γ are the Lagrange multipliers.Using the relation d f (τ)/dτ = ρ defined in Equation (32), this MaxEnt problem leads to where τ (p) stands for dτ(p)/dp.We assume τ (p(x)) = 0 because if τ (p) = 0 then the τ-rep.is a constant mapping, which fails to work as a rep., or immersion, of a pdf p(x).We thus obtain Remark 2. Note that unless τ(p) = p, the constraints of this generalized MaxEnt problem are neither the standard expectations dxp(x)F m (x) nor the normalization dxp(x) = 1 of the pdf p(x).However, the solution of this MaxEnt problem is expressed in terms of the inverse function of ρ(p) as Definition 2. For any given ρ-rep., the generalized score function s ρ (x) is defined by Remark 3. In the above MaxEnt setting, substituting Equation (42) into the generalized score function, we obtain that Theorem 1.For any set of conjugate reps., and the associated generalized score function s(x), Proof.From the definition (32) of the conjugate reps., we see τ = d f (ρ)/dρ.It follows that and integrating both sides by x, we obtain the result.
Definition 3 (Escort rep.).For a given ρ-rep.which satisfies dρ(p)/dp = 0, we can introduce a new τ-rep., which is called the escort rep. of a pdf p(x) and is defined by where c is an appropriate constant and ρ −1 (ξ) is the inverse function of ρ(ξ) .

Remark 4.
For the α-reps., we have We thus see that which states that τ α (p) is a self-escort rep.with the constant c = 2/(1 + α).
One of the merits of introducing the escort rep.τ(p) is the next theorem.
Proof.Substituting the relation into Equation (50) leads to because the pdf p(x) is normalized.
For this τ-rep., we can introduce the associated convex functions f (ρ) and f ( τ) that satisfy respectively.Note that combining Theorem 1 with Theorem 2 leads to We then obtain the following corollary Corollary 1.For the conjugate reps.ρ and τ, the associated f (ρ) function satisfies that which is the constant c defined in Equation (47) for any normalized pdf p(x).
Remark 5.For the α-rep., we see that Definition 4 (Escort pdf and escort exprectation).Define the escort pdf P esc (x) with regards to a pdf p(x) by utilizing the escort rep.τ(p) as follows.
and define the escort expectation with regards to p(x) of a given function F i (x) as Theorem 3. In the MaxEnt setting of Equation (39), the score function s ρ (x) has zero-escort expectation, i.e., E P esc s ρ (x) = 0, and it follows that Proof.From Equation (41) we have where we used E P esc [1] = 1.Since E P esc s ρ = 0, we obtain Equation (59).
Remark 6.In our formalism, the escort expectation is characterized by the generalized score function s ρ (x) which is unbiased, i.e., s ρ (x) has zero-escort expectation.We see that the Lagrange multiplier γ(θ) is the θ-potential Ψ(θ) associated with the escort expectation.The dual affine coordinate η esc is and the associated Riemannian metric and cubic form are respectively.Since g esc is a Hessian metric, the statistical manifold described by the θ-and η esc -coordinates is dually flat.

Conformal Divergence
Let us consider the Bregman divergence (35) of the escort reps., i.e., The next theorem is a main result of this contribution.
Theorem 4. The relative escort expectation of ρ-reps. is the conformal (or scaled) divergence of D f , τ (p|r) with the scaling factor 1/ dx τ(p(x)), i.e., Proof.From Corollary 1, we see that where c is an appropriate constant for any normalized pdf p(x), and it follows that Substituting this relation into Equation (64) leads to Dividing both sides by dx τ(p(x)) and using the escort expectation, we obtain the result.
Remark 7. As an example of Theorem 4, let us consider the α-rep.case.Since τα = τ α as shown in Remark 4, it follows that f α ( τ) = f α (τ).The corresponding escort pdf becomes and Equation (65) becomes When we set q = (1 + α)/2 this relation becomes E P esc ln q p − E P esc ln q r = q dx p q (x) D f α ,τ α (p|r), which was first shown by Matsuzoe and Ohara [13].

Concluding Remarks
We have discussed and reformulated the method of information geometry in terms of the conjugate reps.introduced by Zhang [7].For an appropriate set of conjugate reps., the MaxEnt principle for a generalized entropy relates the associated Lagrange multipliers to the corresponding ρ rep.(41) of the optimal pdf (42).For a generalized score function (2), the escort rep.and escort expectation are then naturally induced.The conformal divergence is related to the difference of the entropies in terms of the escort expectations, as shown in Theorem 4.
In previous work [9], we studied, for the κ-deformed exponential family, the dualistic Hessian geometries among the thermodynamic potentials in the κ-deformed thermostatistics, and found that there exist two different kinds of dual affine-coordinates: one η is associated with the standard expectation; and the other η esc is associated with the escort expectation.There, the double escort distributions, i.e., the escort of the escort distributions, play an important role.For the q-deformed exponential family, one of the authors (H.M.) further studied a sequence (or hierarchy) of escort distributions [14].We think that these results are not specific to the qor κ-deformed exponential pdf.We believe that these results [9,14] can be systematically studied by applying the reformulated method developed in this work.Further studies are needed and will be carried out in future work.