Abstract
In this paper, we propose a generalization of Rényi divergence, and then we investigate its induced geometry. This generalization is given in terms of a φ-function, the same function that is used in the definition of non-parametric φ-families. The properties of φ-functions proved to be crucial in the generalization of Rényi divergence. Assuming appropriate conditions, we verify that the generalized Rényi divergence reduces, in a limiting case, to the φ-divergence. In generalized statistical manifold, the φ-divergence induces a pair of dual connections and . We show that the family of connections induced by the generalization of Rényi divergence satisfies the relation , with .
1. Introduction
Information geometry, the study of statistical models equipped with a differentiable structure, was pioneered by the work of Rao [1], and gained maturity with the work of Amari and many others [2,3,4]. It has been successfully applied in many different areas, such as statistical inference, machine learning, signal processing or optimization [4,5]. In appropriate statistical models, the differentiable structure is induced by a (statistical) divergence. The Kullback–Leibler divergence induces a Riemannian metric, called the Fisher–Rao metric, and a pair of dual connections, the exponential and mixture connections. A statistical model endowed with the Fisher–Rao metric is called a (classical) statistical manifold. Amari also considered a family of α-divergences that induce a family of α-connections.
Much research in recent years has focused on the geometry of non-standard statistical models [6,7,8]. These models are defined in terms of a deformed exponential (also called ϕ-exponential). In particular, κ-exponential models and q-exponential families are investigated in [9,10]. Non-parametric (or infinite-dimensional) φ-families were introduced by the authors in [11,12], which generalize exponential families in the non-parametric setting [13,14,15,16]. Based on the similarity between exponential and φ-families, we defined the so-called φ-divergence, with respect to which the Kullback–Leibler divergence is a particular case. Statistical models equipped with a geometric structure induced by φ-divergences, which are called generalized statistical manifolds, are investigated in [17,18]. With respect to these connections, parametric φ-families are dually flat.
The φ-divergence is intrinsically related to the -model of Zhang, which was proposed in [19,20], extended to the infinite-dimension setting in [21], and explained in more details in [22,23]. For instance, the metric induced by φ-divergence and the -generalization of the Fisher–Rao metric, for the choices and , differ by a conformal factor.
Among many attempts to generalize Kullback–Leibler divergence, Rényi divergence [24] is one of the most successful, having found many applications [25]. In the present paper, we propose a generalization of Rényi divergence, which we use to define a family of α-connections. This generalization is based on an interpretation of Rényi divergence as a kind of normalizing function. To generalize Rényi divergence, we considered functions satisfying some suitable conditions. To a function for which these conditions hold, we give the name of φ-function. In a limiting case, the generalized Rényi divergence reduces to the φ-divergence. In [17,18], the φ-divergence gives rise to a pair of dual connections and . We show that the connection induced by the generalization of Rényi divergence satisfies the convex combination .
Eguchi in [26] investigated a geometry based on a normalizing function similar to the one used in the generalization of Rényi divergence. In [26], results were derived supposing that this normalizing function exists; conditions for its existence were not given. In the present paper, the existence of the normalizing function is ensured by conditions involved in the definition of φ-functions.
The rest of the paper is organized as follows. In Section 2, φ-functions are introduced and some properties are discussed. The Rényi divergence is generalized in Section 3. We investigate in Section 4 the geometry induced by the generalization of Rényi divergence. Section 4.2 provides evidence of the role of the generalized Rényi divergence in φ-families.
2. φ-Functions
Rényi divergence is defined in terms of the exponential function (to be more precise, the logarithm). A way of generalizing Rényi divergence is to replace the exponential function by another function, which satisfies some suitable conditions. To a function for which these conditions hold, we give the name φ-function. In this section, we define and investigate some properties of φ-functions.
Let be a measure space. Although we do not restrict our analysis to a particular measure space, the reader can think of T as the set of real numbers , Σ as the Borel σ-algebra on , and μ as the Lebesgue measure. We can also consider T to be a discrete set, a case in which μ is the counting measure.
We say that is a φ-function if the following conditions are satisfied:
- (a1)
- is convex;
- (a2)
- and ;
- (a3)
- there exists a measurable function such thatfor each measurable function satisfying .
Thanks to condition (a3), we can generalize Rényi divergence using φ-functions. These conditions appeared first at [12] where the authors constructed non-parametric φ-families of probability distributions. We remark that if T is finite, condition (a3) is always satisfied.
Examples of functions satisfying (a1)–(a3) abound. An example of great relevance is the exponential function , which satisfies conditions (a1)–(a3) with . Another example of φ-function is the Kaniadakis’ κ-exponential [12,27,28].
Example 1.
The Kaniadakis’ κ-exponential for is defined as
whose inverse is the so called the Kaniadakis’ κ-logarithm , which is given by
It is clear that satisfies (a1) and (a2). Let be any measurable function for which . We will show that satisfies expression (1). For any and , we can write
where we used that . Then, we conclude that for all . Fix any measurable function such that . For each , we have
which shows that satisfies (a3). Therefore, the Kaniadakis’ κ-exponential is an example of φ-function.
The restriction that can be weakened, as asserted in the next result.
Lemma 1.
Let be any measurable function such that . Then, for all .
Proof.
Notice that if , then for some . From the definition of , it follows that , where . Now assume that . Consider any measurable set with measure . Let be a measurable function supported on A satisfying , where . Defining , we see that . By the definition of , we can write
which is the desired result. ☐
As a consequence of Lemma 1, condition (a3) can be replaced by the following one:
(a3’) There exists a measurable function such that
for each measurable function for which .
Without the equivalence between conditions (a3) and (a3’), we could not generalize Rényi divergence in the manner we propose. In fact, φ-functions could be defined directly in terms of (a3’), without mentioning (a3). We chose to begin with (a3) because this condition appeared initially in [12].
Not all functions , for which conditions (a1) and (a2) hold, satisfy condition (a3). Such a function is given below.
Example 2.
Assume that the underlying measure μ is σ-finite and non-atomic. This is the case of the Lebesgue measure. Let us consider the function
which clearly is convex, and satisfies the limits and . Given any measurable function , we will find a measurable function with , for which expression (2) is not satisfied.
For each , we define
where . Because , we can find a sub-sequence such that
According to (Lemma 8.3 in [29]) , there exists a sub-sequence and pairwise disjoint sets for which
Let us define , where and is any measurable function such that for and . Observing that
we get
Then, we can write
3. Generalization of Rényi Divergence
In this section, we provide a generalization of Rényi divergence, which is given in terms of a φ-function. This generalization also depends on a parameter ; for , it is defined as a limit. Supposing that the underlying φ-function is continuously differentiable, we show that this limit exists and results in the φ-divergence [12]. In what follows, all probability distributions are assumed to have positive density. In other words, they belong to the collection
where is the space of all real-valued, measurable functions on T, with equality μ-a.e. (μ-almost everywhere).
The Rényi divergence of order between two probability distributions p and q in is defined as
For , the Rényi divergence is defined by taking a limit:
Under some conditions, the limits in (5) and (6) are finite-valued, and converge to the Kullback–Leibler divergence. In other words,
where denotes the Kullback–Leibler divergence between p and q, which is given by
These conditions are stated in Proposition 1, given in the end of this section, for the case involving the generalized Rényi divergence.
The Rényi divergence in its standard form is given by
Expression (4) is related to this form by
Beyond the change of variables, which results in α ranging in , expressions (4) and (7) differ by the factor . We opted to insert the term so that some kind of symmetry could be maintained when the limits and are considered. In addition, the geometry induced by the version (4) conforms with Amari’s notation [5].
The Rényi divergence can be defined for every . However, for , the expression (4) may not be finite-valued for every p and q in . To avoid some technicalities, we just consider .
Given p and q in , let us define
which can be used to express the Rényi divergence as
The function , which depends on p and q, can be defined as the unique non-negative real number for which
The function makes the role of a normalizing term. The generalization of Rényi divergence, which we propose, is based on the interpretation of given in (8). Instead of the exponential function, we consider a φ-function in (8).
Fix any φ-function . Given any p and q in , we take so that
or, in other words, the term inside the integral is a probability distribution in . The existence and uniqueness of as defined in (9) is guaranteed by condition (a3’).
We define a generalization of the Rényi divergence of order as
For , this generalization is defined as a limit:
The cases are related to a generalization of the Kullback–Leibler divergence, the so-called φ-divergence, which was introduced by the authors in [12]. The φ-divergence is given by (It was pointed out to us by an anonymous referee that this form of divergence is a special case of the -divergence for and (see Section 3.5 in [19]) apart from a conformal factor, which is the denominator of (13)):
Under some conditions, the limit in (11) or (12) is finite-valued and converges to the φ-divergence:
To show (14), we make use of the following result.
Lemma 2.
Assume that is continuously differentiable. If for , the expression
is satisfied for all , then the derivative of exists at any , and is given by
where .
Proof.
For and , define
The function is defined implicitly by . If we show that
- (i)
- the function is continuous in a neighborhood of ,
- (ii)
- the partial derivatives and exist and are continuous at ,
- (iii)
- and ,
We begin by verifying that is continuous. For fixed and , set . Denoting , we can write
for every and . Because the function on the right-hand side of (18) is integrable, we can apply the Dominated Convergence Theorem to conclude that
Now, we will show that the derivative of with respect to α exists and is continuous. Consider the difference
where . Represent by the function inside the integral sign in (19). For fixed and , denote , , and . Because is convex and increasing, it follows that
where . Observing that f is integrable, we can use the Dominated Convergence Theorem to get
and then
For and , the function inside the integral sign in (20) is dominated by f. As a result, a second use of the Dominated Convergence Theorem shows that is continuous at :
Using similar arguments, one can show that exists and is continuous at any and , and is given by
Clearly, expression (21) implies that for all and .
As an immediate consequence of Lemma 2, we get the proposition below.
4. Generalized Statistical Manifolds
Statistical manifolds consist of a collection of probability distributions endowed with a metric and α-connections, which are defined in terms of the derivative of . In a generalized statistical manifold, the metric and connection are defined in terms of . Instead of the logarithm, we consider the inverse of a φ-function. Generalized statistical manifolds were introduced by the authors in [17,18]. Among examples of the generalized statistical manifold, (parametric) φ-families of probability distributions are of greatest importance. The non-parametric counterpart was investigated in [11,12]. The metric in φ-families can be defined as the Hessian of a function; i.e., φ-families are Hessian manifolds [30]. In [17,18], the φ-divergence gives rise to a pair of dual connections and ; and then for the α-connection is defined as the convex combination . In the present paper, we show that the connection induced by , the generalization of Rényi divergence, corresponds to .
4.1. Definitions
Let be a φ-function. A generalized statistical manifold is a collection of probability distributions , indexed by parameters in a one-to-one relation, such that
- (m1)
- Θ is a domain (open and connected set) in ;
- (m2)
- is differentiable with respect to θ;
- (m3)
- the matrix defined byis positive definite at each , where
- (m4)
- the operations of integration with respect to μ and differentiation with respect to commute in all calculations found below, which are related to the metric and connections.
The matrix equips with a metric. By the chain rule, the tensor related to is invariant under change of coordinates. The (classical) statistical manifold is a particular case in which and .
We introduce a notation similar to Equation (23) that involves higher order derivatives of . For each , we define
We also use , and to denote for , respectively. The notation (24) appears in expressions related to the metric and connections.
Using property (m4), we can find an alternate expression for as well as an identification involving tangent spaces. The matrix can be equivalently defined by
As a consequence of this equivalence, the tangent space can be identified with , the vector space spanned by , and endowed with the inner product . The mapping
defines an isometry between and .
To verify (25), we differentiate , with respect to , to get
Now, differentiating with respect to , we obtain
and then (25) follows. In view of (26), we notice that every vector belonging to satisfies .
The metric gives rise to a Levi–Civita connection ∇ (i.e., a torsion-free, metric connection), whose corresponding Christoffel symbols are given by
As we will show later, the Levi–Civita connection ∇ corresponds to the connection derived from the divergence with .
4.2. φ-Families
Let be a measurable function for which is a probability density in . Fix measurable functions . A (parametric) φ-family , centered at , is a set of probability distributions in , whose members can be written in the form
where is a normalizing function, which is introduced so that expression (28) defines a probability distribution belonging to .
The functions are not arbitrary. They are chosen to satisfy the following assumptions:
- (i)
- are linearly independent,
- (ii)
- , and
- (iii)
- there exists such that , for all .
Moreover, the domain is defined as the set of all vectors for which
Condition (i) implies that the mapping defined by (28) is one-to-one. Assumption (ii) makes of ψ a non-negative function. Indeed, by the convexity of , along with (ii), we can write
which implies . By condition (iii), the domain Θ is an open neighborhood of the origin. If the set T is finite, condition (iii) is always satisfied. One can show that the domain Θ is open and convex. Moreover, the normalizing function ψ is also convex (or strictly convex if is strictly convex). Conditions (ii) and (iii) also appears in the definition of non-parametric φ-families. For further details, we refer to [11,12].
In a φ-family , the matrix given by (22) or (25) can be expressed as the Hessian of ψ. If is strictly convex, then is positive definite. From
it follows that .
The next two results show how the generalization of Rényi divergence and the φ-divergence are related to the normalizing function in φ-families.
Proposition 2.
In a φ-family , the generalization of Rényi divergence for can be expressed in terms of the normalizing function ψ as follows:
for all .
Proof.
Recall the definition of as the real number for which
Using expression (28) for probability distributions in , we can write
The last equality is a consequence of the domain Θ being convex. Thus, it follows that
By the definition of , we get (29). ☐
Proposition 3.
In a φ-family , the φ-divergence is related to the normalizing function ψ by the equality
for all .
Proof.
In Proposition 2, the expression on the right-hand side of Equation (29) defines a divergence on its own, which was investigated by Jun Zhang in [19]. Proposition 3 asserts that the φ-divergence coincides with the Bregman divergence [31,32] associated with the normalizing function ψ for points ϑ and θ in Θ. Because ψ is convex and attains a minimum at , it follows that at . As a result, equality (30) reduces to .
4.3. Geometry Induced by
In this section, we assume that is continuously differentiable and strictly convex. The latter assumption guarantees that
The generalized Rényi divergence induces a metric in generalized statistical manifolds . This metric is given by
To show that this expression defines a metric, we have to verify that is invariant under change of coordinates, and is positive definite. The first claim follows from the chain rule. The positive definiteness of is a consequence of Proposition 4, which is given below.
Proof.
Fix any . Applying the operator to
where , we obtain
which results in
By the standard differentiation rules, we can write
Noticing that for , the second term on the right-hand side of Equation (34) vanishes, and then
If we use the notation introduced in (24), we can write
Because , we conclude that the metric defined by (22) coincides with the metric induced by and . ☐
In generalized statistical manifolds, the generalized Rényi divergence induces a connection , whose Christoffel symbols are given by
Because , it follows that and are mutually dual for any . In other words, and satisfy the relation . A development involving expression (35) results in
and
For , the Christoffel symbols can be written as a convex combination of and , as asserted in the next result.
Proposition 5.
The Christoffel symbols induced by the divergence satisfy the relation
5. Conclusions
In [17,18], the authors introduced a pair of dual connections and induced by φ-divergence. The main motivation of the present work was to find a (non-trivial) family of α-divergences, whose induced α-connections are convex combinations of and . As a result of our efforts, we proposed a generalization of Rényi divergence. The connection induced by the generalization of Rényi divergence satisfies the relation . To generalize Rényi divergence, we made use of properties of φ-functions. This makes evident the importance of φ-functions in the geometry of non-standard models. In standard statistical manifolds, even though Amari’s α-divergence and Rényi divergence (with ) do not coincide, they induce the same family of α-connections. This striking result requires further investigation. Future work should focus on how the generalization of Rényi divergence is related to Zhang’s -divergence, and also how the present proposal is related to the model presented in [33].
Acknowledgments
The authors are indebted to the anonymous reviewers for their valuable comments and corrections, which led to a great improvement of this paper. Charles C. Cavalcante also thanks the CNPq (Proc. 309055/2014-8) for partial funding.
Author Contributions
All authors contributed equally to the design of the research. The research was carried out by all authors. Rui F. Vigelis and Charles C. Cavalcante gave the central idea of the paper and managed the organization of it. Rui F. Vigelis wrote the paper. All the authors read and approved the final manuscript.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. [Google Scholar]
- Amari, S.-I. Differential geometry of curved exponential families—Curvatures and information loss. Ann. Stat. 1982, 10, 357–385. [Google Scholar] [CrossRef]
- Amari, S.-I. Differential-Geometrical Methods in Statistics; Springer: Berlin/Heidelberg, Germany, 1985; Volume 28. [Google Scholar]
- Amari, S.-I.; Nagaoka, H. Methods of Information Geometry (Translations of Mathematical Monographs); American Mathematical Society: Providence, RI, USA, 2000; Volume 191. [Google Scholar]
- Amari, S.-I. Information Geometry and Its Applications; Applied Mathematical Sciences Series; Springer: Berlin/Heidelberg, Germany, 2016; Volume 194. [Google Scholar]
- Amari, S.-I.; Ohara, A.; Matsuzoe, H. Geometry of deformed exponential families: Invariant, dually-flat and conformal geometries. Physica A 2012, 391, 4308–4319. [Google Scholar] [CrossRef]
- Matsuzoe, H. Hessian structures on deformed exponential families and their conformal structures. Differ. Geom. Appl. 2014, 35 (Suppl.), 323–333. [Google Scholar] [CrossRef]
- Naudts, J. Estimators, escort probabilities, and ϕ-exponential families in statistical physics. J. Inequal. Pure Appl. Math. 2004, 5, 102. [Google Scholar]
- Pistone, G. κ-exponential models from the geometrical viewpoint. Eur. Phys. J. B 2009, 70, 29–37. [Google Scholar] [CrossRef]
- Amari, S.-I.; Ohara, A. Geometry of q-exponential family of probability distributions. Entropy 2011, 13, 1170–1185. [Google Scholar] [CrossRef]
- Vigelis, R.F.; Cavalcante, C.C. The Δ2-Condition and φ-Families of Probability Distributions. In Geometric Science of Information; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8085, pp. 729–736. [Google Scholar]
- Vigelis, R.F.; Cavalcante, C.C. On φ-families of probability distributions. J. Theor. Probab. 2013, 26, 870–884. [Google Scholar] [CrossRef]
- Cena, A.; Pistone, G. Exponential statistical manifold. Ann. Inst. Stat. Math. 2007, 59, 27–56. [Google Scholar] [CrossRef]
- Grasselli, M.R. Dual connections in nonparametric classical information geometry. Ann. Inst. Stat. Math. 2010, 62, 873–896. [Google Scholar] [CrossRef]
- Pistone, G.; Sempi, C. An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 1995, 23, 1543–1561. [Google Scholar] [CrossRef]
- Santacroce, M.; Siri, P.; Trivellato, B. New results on mixture and exponential models by Orlicz spaces. Bernoulli 2016, 22, 1431–1447. [Google Scholar] [CrossRef]
- Vigelis, R.F.; Cavalcante, C.C. Information Geometry: An Introduction to New Models for Signal Processing. In Signals and Images; CRC Press: Boca Raton, FL, USA, 2015; pp. 455–491. [Google Scholar]
- Vigelis, R.F.; de Souza, D.C.; Cavalcante, C.C. New Metric and Connections in Statistical Manifolds. In Geometric Science of Information; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9389, pp. 222–229. [Google Scholar]
- Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef] [PubMed]
- Zhang, J. Referential Duality and Representational Duality on Statistical Manifolds. In Proceedings of the 2nd International Symposium on Information Geometry and Its Applications, Pescara, Italy, 12–16 December 2005; pp. 58–67.
- Zhang, J. Nonparametric information geometry: From divergence function to referential-representational biduality on statistical manifolds. Entropy 2013, 15, 5384–5418. [Google Scholar] [CrossRef]
- Zhang, J. Divergence Functions and Geometric Structures They Induce on a Manifold. In Geometric Theory of Information; Springer: Berlin/Heidelberg, Germany, 2014; pp. 1–30. [Google Scholar]
- Zhang, J. On monotone embedding in information geometry. Entropy 2015, 17, 4485–4489. [Google Scholar] [CrossRef]
- Rényi, A. On measures of entropy and information. In Proceedings of 4th Berkeley Symposium on Mathematical Statistics and Probability; University California Press: Berkeley, CA, USA, 1961; Volume I, pp. 547–561. [Google Scholar]
- Van Erven, T.; Harremoës, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inform. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
- Eguchi, S.; Komori, O. Path Connectedness on a Space of Probability Density Functions. In Geometric Science of Information; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9389, pp. 615–624. [Google Scholar]
- Kaniadakis, G.; Lissia, M.; Scarfone, A.M. Deformed logarithms and entropies. Physica A 2004, 340, 41–49. [Google Scholar] [CrossRef]
- Kaniadakis, G. Theoretical foundations and mathematical formalism of the power-law tailed statistical distributions. Entropy 2013, 15, 3983–4010. [Google Scholar] [CrossRef]
- Musielak, J. Orlicz Spaces and Modular Spaces; Springer: Berlin/Heidelberg, Germany, 1983; Volume 1034. [Google Scholar]
- Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, 2007. [Google Scholar]
- Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
- Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
- Zhanga, J.; Hästö, P. Statistical manifold as an affine space: A functional equation approach. J. Math. Psychol. 2006, 50, 60–65. [Google Scholar] [CrossRef]
© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).