Geometry Induced by a Generalization of Rényi Divergence

In this paper, we propose a generalization of Rényi divergence, and then we investigate its induced geometry. This generalization is given in terms of a φ-function, the same function that is used in the definition of non-parametric φ-families. The properties of φ-functions proved to be crucial in the generalization of Rényi divergence. Assuming appropriate conditions, we verify that the generalized Rényi divergence reduces, in a limiting case, to the φ-divergence. In generalized statistical manifold, the φ-divergence induces a pair of dual connections D(−1) and D(1). We show that the family of connections D(α) induced by the generalization of Rényi divergence satisfies the relation D(α) = 1−α 2 D (−1) + 1+α 2 D (1), with α ∈ [−1, 1].


Introduction
Information geometry, the study of statistical models equipped with a differentiable structure, was pioneered by the work of Rao [1], and gained maturity with the work of Amari and many others [2][3][4].It has been successfully applied in many different areas, such as statistical inference, machine learning, signal processing or optimization [4,5].In appropriate statistical models, the differentiable structure is induced by a (statistical) divergence.The Kullback-Leibler divergence induces a Riemannian metric, called the Fisher-Rao metric, and a pair of dual connections, the exponential and mixture connections.A statistical model endowed with the Fisher-Rao metric is called a (classical) statistical manifold.Amari also considered a family of α-divergences that induce a family of α-connections.
Much research in recent years has focused on the geometry of non-standard statistical models [6][7][8].These models are defined in terms of a deformed exponential (also called φ-exponential).In particular, κ-exponential models and q-exponential families are investigated in [9,10].Non-parametric (or infinite-dimensional) ϕ-families were introduced by the authors in [11,12], which generalize exponential families in the non-parametric setting [13][14][15][16].Based on the similarity between exponential and ϕ-families, we defined the so-called ϕ-divergence, with respect to which the Kullback-Leibler divergence is a particular case.Statistical models equipped with a geometric structure induced by ϕ-divergences, which are called generalized statistical manifolds, are investigated in [17,18].With respect to these connections, parametric ϕ-families are dually flat.
The ϕ-divergence is intrinsically related to the (ρ, τ)-model of Zhang, which was proposed in [19,20], extended to the infinite-dimension setting in [21], and explained in more details in [22,23].
For instance, the metric induced by ϕ-divergence and the (ρ, τ)-generalization of the Fisher-Rao metric, for the choices ρ = ϕ −1 and f = ρ −1 , differ by a conformal factor.
Among many attempts to generalize Kullback-Leibler divergence, Rényi divergence [24] is one of the most successful, having found many applications [25].In the present paper, we propose a generalization of Rényi divergence, which we use to define a family of α-connections.This generalization is based on an interpretation of Rényi divergence as a kind of normalizing function.To generalize Rényi divergence, we considered functions satisfying some suitable conditions.To a function for which these conditions hold, we give the name of ϕ-function.In a limiting case, the generalized Rényi divergence reduces to the ϕ-divergence.In [17,18], the ϕ-divergence gives rise to a pair of dual connections D (−1) and D (1) .We show that the connection D (α) induced by the generalization of Rényi divergence satisfies the convex combination 1) .Eguchi in [26] investigated a geometry based on a normalizing function similar to the one used in the generalization of Rényi divergence.In [26], results were derived supposing that this normalizing function exists; conditions for its existence were not given.In the present paper, the existence of the normalizing function is ensured by conditions involved in the definition of ϕ-functions.
The rest of the paper is organized as follows.In Section 2, ϕ-functions are introduced and some properties are discussed.The Rényi divergence is generalized in Section 3. We investigate in Section 4 the geometry induced by the generalization of Rényi divergence.Section 4.2 provides evidence of the role of the generalized Rényi divergence in ϕ-families.

ϕ-Functions
Rényi divergence is defined in terms of the exponential function (to be more precise, the logarithm).A way of generalizing Rényi divergence is to replace the exponential function by another function, which satisfies some suitable conditions.To a function for which these conditions hold, we give the name ϕ-function.In this section, we define and investigate some properties of ϕ-functions.
Let (T, Σ, µ) be a measure space.Although we do not restrict our analysis to a particular measure space, the reader can think of T as the set of real numbers R, Σ as the Borel σ-algebra on R, and µ as the Lebesgue measure.We can also consider T to be a discrete set, a case in which µ is the counting measure.
We say that ϕ : R → (0, ∞) is a ϕ-function if the following conditions are satisfied: for each measurable function c : Thanks to condition (a3), we can generalize Rényi divergence using ϕ-functions.These conditions appeared first at [12] where the authors constructed non-parametric ϕ-families of probability distributions.We remark that if T is finite, condition (a3) is always satisfied.
The restriction that T ϕ(c)dµ = 1 can be weakened, as asserted in the next result.
By the definition of u 0 , we can write which is the desired result.
As a consequence of Lemma 1, condition (a3) can be replaced by the following one: (a3') There exists a measurable function u 0 : T → (0, ∞) such that for each measurable function c : T → R for which T ϕ(c)dµ < ∞.
Example 2. Assume that the underlying measure µ is σ-finite and non-atomic.This is the case of the Lebesgue measure.Let us consider the function which clearly is convex, and satisfies the limits lim u→−∞ ϕ(u) = 0 and lim u→∞ ϕ(u) = ∞.Given any measurable function u 0 : T → (0, ∞), we will find a measurable function c : , According to (Lemma 8.3 in [29]) , there exists a sub-sequence w k = v m n k and pairwise disjoint sets Then, we can write On the other hand, which shows that ( 2) is not satisfied.

Generalization of Rényi Divergence
In this section, we provide a generalization of Rényi divergence, which is given in terms of a ϕ-function.This generalization also depends on a parameter α ∈ [−1, 1]; for α = ±1, it is defined as a limit.Supposing that the underlying ϕ-function is continuously differentiable, we show that this limit exists and results in the ϕ-divergence [12].In what follows, all probability distributions are assumed to have positive density.In other words, they belong to the collection where L 0 is the space of all real-valued, measurable functions on T, with equality µ-a.e.(µ-almost everywhere).
The Rényi divergence of order α ∈ (−1, 1) between two probability distributions p and q in P µ is defined as For α = ±1, the Rényi divergence is defined by taking a limit: D Under some conditions, the limits in ( 5) and ( 6) are finite-valued, and converge to the Kullback-Leibler divergence.In other words, where D KL (p q) denotes the Kullback-Leibler divergence between p and q, which is given by These conditions are stated in Proposition 1, given in the end of this section, for the case involving the generalized Rényi divergence.
The Rényi divergence in its standard form is given by Expression ( 4) is related to this form by Beyond the change of variables, which results in α ranging in [−1, 1], expressions (4) and ( 7) differ by the factor 2/(1 − α).We opted to insert the term 2/(1 − α) so that some kind of symmetry could be maintained when the limits α ↓ −1 and α ↑ 1 are considered.In addition, the geometry induced by the version (4) conforms with Amari's notation [5].
The function κ(α), which depends on p and q, can be defined as the unique non-negative real number for which The function κ(α) makes the role of a normalizing term.The generalization of Rényi divergence, which we propose, is based on the interpretation of κ(α) given in (8).Instead of the exponential function, we consider a ϕ-function in (8).
Fix any ϕ-function ϕ : R → (0, ∞).Given any p and q in P µ , we take κ(α) = κ(α; p, q) ≥ 0 so that or, in other words, the term inside the integral is a probability distribution in P µ .The existence and uniqueness of κ(α) as defined in ( 9) is guaranteed by condition (a3').
We define a generalization of the Rényi divergence of order α ∈ (−1, 1) as For α = ±1, this generalization is defined as a limit:

D
(1) The cases α = ±1 are related to a generalization of the Kullback-Leibler divergence, the so-called ϕ-divergence, which was introduced by the authors in [12].The ϕ-divergence is given by (It was pointed out to us by an anonymous referee that this form of divergence is a special case of the (ρ, τ)-divergence for ρ = ϕ −1 and f = ρ −1 (see Section 3.5 in [19]) apart from a conformal factor, which is the denominator of ( 13)): Under some conditions, the limit in (11) or ( 12) is finite-valued and converges to the ϕ-divergence: To show (14), we make use of the following result.
As an immediate consequence of Lemma 2, we get the proposition below.

Generalized Statistical Manifolds
Statistical manifolds consist of a collection of probability distributions endowed with a metric and α-connections, which are defined in terms of the derivative of l(t; θ) = log p(t; θ).In a generalized statistical manifold, the metric and connection are defined in terms of f (t; θ) = ϕ −1 (p(t; θ)).Instead of the logarithm, we consider the inverse ϕ −1 (•) of a ϕ-function.Generalized statistical manifolds were introduced by the authors in [17,18].Among examples of the generalized statistical manifold, (parametric) ϕ-families of probability distributions are of greatest importance.The non-parametric counterpart was investigated in [11,12].The metric in ϕ-families can be defined as the Hessian of a function; i.e., ϕ-families are Hessian manifolds [30].In [17,18], the ϕ-divergence gives rise to a pair of dual connections D (−1) and D (1) ; and then for α ∈ (−1, 1) the α-connection D (α) is defined as the convex combination D (α) = 1−α 2 D (−1) + 1+α 2 D (1) .In the present paper, we show that the connection induced by D (α) ϕ (• •), the generalization of Rényi divergence, corresponds to D (α) .

Definitions
Let ϕ : R → (0, ∞) be a ϕ-function.A generalized statistical manifold P = {p(t; θ) : θ ∈ Θ} is a collection of probability distributions p θ (t) := p(t; θ), indexed by parameters θ = (θ 1 , . . ., θ n ) ∈ Θ in a one-to-one relation, such that (m1) Θ is a domain (open and connected set) in R n ; (m2) p(t; θ) is differentiable with respect to θ; (m3) the matrix g = (g ij ) defined by is positive definite at each θ ∈ Θ, where (m4) the operations of integration with respect to µ and differentiation with respect to θ i commute in all calculations found below, which are related to the metric and connections.
The matrix g = (g ij ) equips P with a metric.By the chain rule, the tensor related to g = (g ij ) is invariant under change of coordinates.The (classical) statistical manifold is a particular case in which ϕ(u) = exp(u) and u 0 = 1 T .
We introduce a notation similar to Equation ( 23) that involves higher order derivatives of ϕ(•).For each n ≥ 1, we define We also use , respectively.The notation ( 24) appears in expressions related to the metric and connections.
Using property (m4), we can find an alternate expression for g ij as well as an identification involving tangent spaces.The matrix g = (g ij ) can be equivalently defined by As a consequence of this equivalence, the tangent space T p θ P can be identified with T p θ P, the vector space spanned by , and endowed with the inner product X, defines an isometry between T p θ P and T p θ P.
To verify (25), we differentiate T p θ dµ = 1, with respect to θ i , to get Now, differentiating with respect to θ j , we obtain and then (25) follows.In view of ( 26), we notice that every vector X belonging to T p θ P satisfies E θ [ X] = 0.The metric g = (g ij ) gives rise to a Levi-Civita connection ∇ (i.e., a torsion-free, metric connection), whose corresponding Christoffel symbols Γ ijk are given by Using expression (25) to calculate the derivatives in (27), we can express As we will show later, the Levi-Civita connection ∇ corresponds to the connection derived from the divergence D (α)
Moreover, the domain Θ ⊆ R n is defined as the set of all vectors θ = (θ i ) for which for some λ > 1.
Condition (i) implies that the mapping defined by ( 28) is one-to-one.Assumption (ii) makes of ψ a non-negative function.Indeed, by the convexity of ϕ(•), along with (ii), we can write which implies ψ(θ) ≥ 0. By condition (iii), the domain Θ is an open neighborhood of the origin.If the set T is finite, condition (iii) is always satisfied.One can show that the domain Θ is open and convex.Moreover, the normalizing function ψ is also convex (or strictly convex if ϕ(•) is strictly convex).Conditions (ii) and (iii) also appears in the definition of non-parametric ϕ-families.For further details, we refer to [11,12].
In a ϕ-family F p , the matrix (g ij ) given by ( 22) or ( 25) can be expressed as the Hessian of ∂θ i ∂θ j .The next two results show how the generalization of Rényi divergence and the ϕ-divergence are related to the normalizing function in ϕ-families.Proposition 2. In a ϕ-family F p , the generalization of Rényi divergence for α ∈ (−1, 1) can be expressed in terms of the normalizing function ψ as follows: for all θ, ϑ ∈ Θ.
Proof.Recall the definition of κ(α) as the real number for which Using expression (28) for probability distributions in F p , we can write The last equality is a consequence of the domain Θ being convex.Thus, it follows that By the definition of D (α) ϕ (• •), we get (29).
Proposition 3. In a ϕ-family F p , the ϕ-divergence is related to the normalizing function ψ by the equality for all θ, ϑ ∈ Θ.
In Proposition 2, the expression on the right-hand side of Equation ( 29) defines a divergence on its own, which was investigated by Jun Zhang in [19].Proposition 3 asserts that the ϕ-divergence D ϕ (p θ p ϑ ) coincides with the Bregman divergence [31,32] associated with the normalizing function ψ for points ϑ and θ in Θ.Because ψ is convex and attains a minimum at θ = 0, it follows that ∂ψ ∂θ i (θ) = 0 at θ = 0.As a result, equality (30) reduces to D ϕ (p p θ ) = ψ(θ).In this section, we assume that ϕ(•) is continuously differentiable and strictly convex.The latter assumption guarantees that The generalized Rényi divergence induces a metric g = (g ij ) in generalized statistical manifolds P.This metric is given by To show that this expression defines a metric, we have to verify that g ij is invariant under change of coordinates, and (g ij ) is positive definite.The first claim follows from the chain rule.The positive definiteness of (g ij ) is a consequence of Proposition 4, which is given below.where c α = 1−α 2 ϕ −1 (p θ ) + 1+α 2 ϕ −1 (p ϑ ) + κ(α)u 0 , we obtain .
By the standard differentiation rules, we can write ϕ (c α )dµ = 0 for p ϑ = p θ , the second term on the right-hand side of Equation (34) vanishes, and then If we use the notation introduced in (24), we can write It remains to show the case α = ±1.Comparing ( 13) and ( 23), we can write We use the equivalent expressions which follows from condition (32), to infer that ϕ (q p) = D ϕ (p q), we conclude that the metric defined by ( 22) coincides with the metric induced by D In generalized statistical manifolds, the generalized Rényi divergence D ijk are given by For α ∈ (−1, 1), the Christoffel symbols Γ Proof.For α = ±1, equality (39) follows trivially.Thus, we assume α ∈ (−1, 1).By (34), we can write Applying ( ∂ ∂θ j ) p θ to the first term on the right-hand side of (40), and then equating p ϑ = p θ , we obtain Expression (39) follows from (37), ( 38) and (43).

Conclusions
In [17,18], the authors introduced a pair of dual connections D (−1) and D (1) induced by ϕ-divergence.The main motivation of the present work was to find a (non-trivial) family of α-divergences, whose induced α-connections are convex combinations of D (−1) and D (1) .As a result of our efforts, we proposed a generalization of Rényi divergence.The connection D (α) induced by the generalization of Rényi divergence satisfies the relation D (α) = 1−α 2 D (−1) + 1+α 2 D (1) .To generalize Rényi divergence, we made use of properties of ϕ-functions.This makes evident the importance of ϕ-functions in the geometry of non-standard models.In standard statistical manifolds, even though Amari's α-divergence and Rényi divergence (with α ∈ [−1, 1]) do not coincide, they induce the same family of α-connections.This striking result requires further investigation.Future work should focus on how the generalization of Rényi divergence is related to Zhang's (ρ, τ)-divergence, and also how the present proposal is related to the model presented in [33].