1. Introduction
Information geometry is a differential-geometric framework for studying finite-dimensional statistical models that coherently integrates the following notions:
- (i)
A differentiable manifold consisting of probability density functions or finite measures on a common sample space;
- (ii)
A divergence function that defines an asymmetric proximity between points p, in ;
- (iii)
A Riemannian metric plus a pair of torsion-free dual (conjugate) affine connections on .
For completeness, we recall that a pair of affine connections ∇,
on
are said to be dual (or conjugate) with respect to a Riemannian metric
if for any vector fields
X,
Y, and
Z on
, one has:
Here,
is called a
dualistic structure. When
D is the Kullback–Leibler divergence (or more generally,
f-divergence), the induced Riemannian metric
is the Fisher–Rao metric, and the induced cubic form
is the Amari–Chentsov tensor [
1]. It can be shown that the Fisher–Rao metric and the Amari–Chentsov tensor are unique invariants, of respectively second and third orders, under sufficient statistics on the manifold
[
2].
Geometrically, the standard model (denoted the
-model in this paper) uses a pair of affine connections that are torsion-free, though in general, they are not curvature-free. An alternative, “partially flat” model (denoted the
-model in this paper) was recently investigated in [
3], leading to the notion of “statistical mirror symmetry” [
4]. Under the
-model, the affine connections ∇ and
are allowed to carry torsion, but are both curvature-free. See [
4] for the geometric properties of the
-model leading to a symplectic-to-complex correspondence characteristic of mirror Calabi–Yau manifolds studied in string theory and mathematical physics.
Within the usual -model, a special case is the dually flat geometry where the Riemannian metric can be expressed under special coordinate systems as a Hessian metric. Two prominent examples are the exponential family and the mixture family, where the Hessian metric coincides with the Fisher–Rao metric. The Hessian geometry is said to be dually flat because the Riemann curvature tensors of both the primal and the dual connections vanish; the corresponding primal and dual affine coordinate systems are linked via Legendre transformations by a pair of convex potentials. For an exponential family, these coordinates are precisely the natural (canonical) and mixture (expectation) coordinate systems, respectively. Note that the Hessian metric itself is not flat as its Levi-Civita connection contains curvature in general.
Between the well-understood dually flat Hessian geometry and the full-blown
-model, there is a wide range of geometries capturing various probability models. Of special interest are generalizations of the exponential family, namely
deformed exponential families. The
-exponential family was introduced in the context of statistical physics [
5]; it was later shown [
6] to be equivalent to the
U-model [
7] motivated by applications in machine learning—[
6] revealed that both the
- and
U-models can be generated from the
-model [
8] through the mechanism of “gauge selection”. The
-metric generalizes the Fisher–Rao metric and may lead to a conformal Hessian metric for a
-exponential family. However, the connections are typically not curvature-free unless a special type of gauge is selected; this underlies the geometric characterization of the
q-exponential model of Tsallis by [
9,
10,
11].
In recent years, the second author [
12], motivated by previous works with Pal on mathematical finance and optimal transport [
13,
14,
15,
16], studied a class of deformed exponential families generating
constant curvatures through the use of a new divergence function called
logarithmic divergence. By constant (information geometric) curvature, we mean that both the primal and dual Riemann curvature tensors have (the same) constant sectional curvature with respect to
. In [
17], the present authors developed a unified framework, based on the notions of
λ-duality and the
λ-exponential family, which appears to provide a
canonical extension of the dually flat geometry to the constant curvature case. Previously, statistical manifolds with constant curvature were studied using the abstract tools of affine differential geometry; see, e.g., [
1,
18] (also see [
19]). Our framework provides a concrete approach and an explicit construction that elucidates how the properties of the exponential family and the dually flat geometry may be extended to the constant curvature case. In this paper, a careful exposition of the
-deformation framework is provided from the perspective of
-duality, namely the
-deformation of Legendre duality.
The rest of the paper is organized as follows. In
Section 2, we review the standard
-model of information geometry with a focus on the dually flat geometry, based on convex duality and Bregman divergence, of the exponential and mixture families. The section closes with a preview of
-deformation by introducing a suite of four deformation functions, as two pairs of mutually inverse functions:
versus
and
versus
, with the first pair deforming log and exp and the second pair deforming the identity function. In
Section 3, we describe the
-duality, which deforms the standard convex duality. In particular, we compare
-duality and standard Legendre duality and show their relations to each other upon a change of parameterization. In
Section 4, we define the
-gradient and then the
-logarithmic divergence and study the constant curvature information geometry the latter induces. In
Section 5, we relate
-divergence to Rényi entropy by introducing the
-exponential and
-mixture family. The two expressions of the
-exponential family under divisive and subtractive normalization correspond to, respectively, Rényi deformation and Tsallis deformation.
Section 6 concludes with a comparison of
-deformation with the standard dually flat (Hessian) framework.
3. Deforming the Legendre Duality: -Duality
In this section, we describe the
-duality and a its link to the standard Legendre duality. We start by defining the notions of
-conjugate and
-convexity/
-concavity, then draw a parallel to the regular Legendre duality. We proceed to establish a formal correspondence between the
-duality and classical convex duality, including the associated notions of the
-gradient,
-logarithmic divergence, etc. Some of the derivations are illustrative, yet heuristic—a rigorous analysis in the spirit of Rockafellar [
35] is yet to be performed in future research.
3.1. Legendre Duality and Bregman Divergence Reviewed
Recall from (
4) that the
convex conjugate of a function
f on
is defined by:
It can be proven that:
- (i)
is convex;
- (ii)
;
- (iii)
if f is convex and lower semicontinuous.
When
f is further differentiable, then the Legendre transformation:
which can be motivated by the first-order condition in (
11), defines a “dual variable”
u, satisfying the Fenchel identity:
We have
, provided the second derivative or
is positive definite. The function
f also defines a Bregman divergence
given by:
The Bregman divergence satisfies the
reference–representation biduality [
24,
25] in the sense that:
where
Note that when
f is convex and differentiable, the non-negativity of the Bregman divergence encodes the fact that for any
:
3.2. -Deformation of Legendre Duality
The main idea behind the
-deformation of the Legendre duality (“
-duality”) is to replace the term
in (
11) by a monotone transformation of
. Given a parameter
, later revealed to be the
curvature parameter of the information geometric characterization, we replace the term
by:
where
and its inverse
are given by (
10). With this in mind we give the following definition.
Definition 1 (
-conjugation)
. Let . Given a function , we define its λ-conjugate by: Generalized convex dualities have been heavily used in optimal transport theory [
36,
37] to characterize the optimal transport plans; in this context, it is called the
c-duality where
c is the cost function of the transport problem. A major novelty of our framework is that the functional form of
(and of
) leads to explicit formulas, which are not available in the general case. We remark that this is closely related to the fact that the associated information geometry has constant curvature
.
It turns out that the
-conjugation defined by (
14) corresponds to an appropriately generalized notion of convexity or concavity, through the aid of the function
given by (
10). Henceforth, we let
be a fixed constant.
Definition 2 (
-exponential convexity and concavity)
. Let be an open convex set. A function is said to be λ-exponentially convex (“λ-convex”), or λ-exponentially concave (“λ-concave”), if:is convex, or concave, on Ω. When f is , we have equivalently that f is λ-convex, or λ-concave, if the Hessian of is positive definite, or negative definite. Note that the additive term in the above definition of is not necessary; it is included so that , meaning that in the limiting case of zero-convexity is just ordinary convexity.
It is easily shown that, for a fixed positive number,
- (i)
f is -convex if and only if is -concave;
- (ii)
f is -concave if and only if is -convex.
Proposition 1. Given any , we define variable , which has range , and function by:Then, the convex (Legendre) conjugate of the function g:is related to the λ-conjugate of the function f via: Proof. We first prove the following identities:
where, going from the first to the second line, we used (
15) and the fact:
which is a re-write of the definition of
g given by (16).
With the above identity, we can proceed to prove this proposition. For
, we have:
Recasting the above relation yields (
17). □
Recall that from convex analysis,
is always a convex function regardless of whether
g is convex (by the property of Legendre conjugation). The expression of
in (
17) therefore implies that
is
-convex, by the definition of
-convexity.
Corollary 1. For any , its λ-conjugate as defined by (14) is a λ-convex function of u on (note may not necessarily be convex). Proof. We can also give a direct proof (essentially reversing the steps of the proof of Proposition 1).
□
Corollary 1 is the extension of the claim that for any
f, the standard Legendre conjugate
as given by (
11) is always a convex function. Because of this, we can prove, in analogy to the standard Legendre conjugation ∗, the following relations:
- (i)
for any f.
- (ii)
if f is -convex.
3.3. Relations between the -Duality and Legendre Duality
We proceed to establish a formal relationship between the -duality and the ordinary Legendre duality, by relating the -conjugation of a -convex function f, denoted by , to the standard Legendre conjugation of a function (denoted by ∗).
We continue the analysis performed in Proposition 1. Taking
-conjugation for a second time,
Here, the variable
is defined by:
and the function
by:
In the event when
f is
-convex, then
. Therefore:
Therefore,
. That is, the function
is just the (regular) Legendre conjugation
of the function
. In
parameterization, the
function has the expression of (
18) with
and
u related by (20). This parallels the fact that
, and in
parameterization, the
g function has the expression of (16) with
and
x related by (
19).
Summarizing the above, we have:
Theorem 1 (Connecting
-duality to Legendre duality)
. Let f be a λ-convex function and be its λ-conjugate. Denote two functions g and :where the two variables and are given by:Then, the following statements are equivalent:- (i)
The variables satisfy the λ-duality of a pair of λ-convex functions : - (ii)
The variables satisfy the Legendre duality of a pair of convex functions :with: - (iii)
The variables satisfy the Legendre duality of a pair of convex functions :with:
Proof. To prove the equivalence of (
21) and (
22), we re-write the latter as:
where we inserted the following relations:
and replaced
by
x using (
19). Multiplying
on both sides, we obtain:
Noting
verifies (
21).
To prove the equivalence of (
21) and (
23), we rely on an analogous identity:
where:
We have, after multiplying
on both sides of (
24),
where the last step used:
Noting
due to
f assumed to be
-convex, then (
23) follows. □
We see that the functions and serve as link functions from the -pair of the -deformed Legendre conjugation to the -pair and the -pair of the regular Legendre conjugation.
4. -Logarithmic Divergence and Its Dualistic Geometry
In this section, we study the -deformation of the Bregman (canonical) divergence function and the resulting dualistic geometry (Riemannian metric and dual connections), which correspond to the -duality. This involves first establishing the -deformation to the gradient operation (so-called -gradient), which then leads to the so-called -logarithmic divergence function as deformation to the Bregman divergence. Finally, we show that the resulting Riemannian metric is a conformal Hessian metric, while the resulting dual connections are projectively flat (with constant curvature). The conformal and projective factor is parameterized by , which gives the curvature of the constant curvature space.
4.1. -Gradient
Definition 3 (
-gradient)
. For , define the λ-gradient by: The work of [
17] (Theorem 2.2) showed the above formula for deforming the gradient of a function motivated by the
-duality setting. For mathematical convenience, it is proven under some regularity conditions; a full generalization along the lines of [
35] is a natural direction for further research.
Theorem 2 (-gradient for -duality). Let , and let f be a λ-exponentially convex function that is on some open convex set , such that (a) is strictly positive definite and (b) on Ω. Then we have
- (i)
is a -diffeomorphism from Ω to its range .
- (ii)
Denote . We have , and the following identity holds: - (iii)
Furthermore, .
Note that the -gradient differs from the regular gradient by a scalar multiplication. The duality between x and u under the -duality is mediated by a dual variable , which plays an important role in what follows.
Let:
- (a)
denote the -conjugate variable corresponding to x with respect to ;
- (b)
be the Legendre conjugate variable corresponding to with respect to ;
- (c)
denote the -conjugate variable corresponding to u with respect to ;
- (d)
be the Legendre conjugate variable corresponding to with respect to .
Is there a simple relationship between them? The following proposition says
, where
and
x are linked by (
19), and
, where
and
u are linked by (20).
Proposition 2. We have:Here, we add the subscript to to emphasize the argument with respect to which the derivative is taken. Proof. We use matrix notations where the gradient is regarded as a column vector. Applying the multivariate chain rule to (
17), we have:
where
is the Jacobian of the transformation
and
denotes the transpose, For two vectors
x and
y, their outer product is denoted by
, which is a rank-one square matrix with the
-entry
.
From (
15), we have:
Since
by assumption, we can invert the Jacobian by the Sherman–Morrison formula (see [
12], Proposition 4) to obtain:
Plugging this into the above, we have:
Using (
25) to relate
to
, the first relation involving
is proven. The proof of the second relation in this proposition is analogous. □
Just as ordinary convexity leads to the notion of Bregman divergence (
12), the notion of
-exponential convexity leads to a generalization that we call the
λ-logarithmic divergence. Henceforth, we let
be a
-exponentially convex function on an open convex domain
, and we assumed that the regularity conditions in Theorem 2 hold.
4.2. -Logarithmic Divergence
By the definition of the
-convexity, we have that
is convex on
. By the ordinary convexity of
, we have:
In terms of
f, we have, after some manipulations,
Since
is increasing, we have:
This motivates the following definition.
Definition 4 (
-logarithmic divergence)
. We define the λ-logarithmic divergence of f by: See
Figure 1 for a graphical illustration. We note that the logarithmic correction in (
26) corresponds to a
logarithmic first-order approximation, based at
, which is possible due to the
-exponential convexity of
f. We also note that when
, it is possible that
. Nevertheless,
is finite when
x and
are sufficiently close. Formally, letting
in (
26) recovers the Bregman divergence.
4.3. -Logarithmic Divergence in Different Forms
We now prove a lemma about the relationship of the variables and gradients or -gradients of f or . We assumed, for convenience, that for all .
Lemma 1. Given or equivalently , for arbitrary (such that the expressions are well defined), we have the following identities:where is a multiplicative factor (function of x or u) given by: Proof. Since
, substituting (
25), we have:
and:
so:
Taking the logarithm and rearranging, we obtain (
27).
On the other hand, because:
we also have:
The proof of (28) is similar. □
In this above lemma, and are arbitrary; it is interesting that a modified form of “linearity” holds even though is itself nonlinear. As a consequence, we have an alternative expression for .
Proposition 3. defined by (26) can also be written as:where . Of course, we may express the
-logarithmic divergence using the conjugate variables
as well. Indeed, we have the analogous reference–representation biduality (see [
24,
25]) that is characteristic of Bregman divergence and canonical divergence for dually flat spaces, that is (
8). See [
38] for the reference–representation biduality of a general
c-divergence (which includes both the Bregman and logarithmic divergences) based on optimal transport.
Theorem 3. The λ-logarithmic divergence satisfies the reference–representation biduality, namely:where and . Moreover, define the λ-deformed canonical divergence by:We have: Proposition 3 also allows us to derive our next theorem (Theorem 4) linking
-logarithmic divergence and Bregman divergence (also see [
19] for a discussion of conformal divergence in the affine immersion setting).
Theorem 4. The canonical forms of the λ-logarithmic divergence and are related to the canonical forms of the Bregman divergence and via a conformal transformation and the non-linear link function : Proof.
The proof of the second line of Theorem 4 is similar. We have
from Theorem 3. □
4.4. Dualistic Geometry of -Logarithmic Divergence
Regard
as the primal (global) coordinate system of a manifold
. As described in
Section 2.1, we may use the
-logarithmic divergence
of
f to construct a dualistic structure
. In this subsection, we provide explicit expressions of the corresponding coefficients and state some key geometric consequences.
We begin with the Riemannian metric.
Theorem 5. The Riemannian metric induced from is given in primal coordinate x by: Proof. According to (
2), we perform direct differentiation of (
26):
and obtain the expression of (
29). □
By symmetry, under the dual coordinate system
, we have:
From the first equality in (
29), we see that
is a
rank-one correction of the Hessian matrix
. From the second equality, we see that
is in fact a
conformal Hessian metric, i.e., it has the form
, where
is the Hessian metric induced by the convex function
. This conclusion is entirely anticipated from Theorem 4.
To compute the Christoffel symbols of the primal and dual connections, we need an expression of the inverse of the Riemannian metric as a matrix. This is provided by the following proposition.
Proposition 4. The metric can be expressed as:where is the Jacobian matrix of the coordinate transformation and is the identity matrix with Kronecker as its entries. Here:and for and , due to Part (ii) of Theorem 2. Moreover, the inverse of can be expressed as: Proof. Using the
-logarithmic divergence represented as the generalized canonical divergence
(
26), we apply (
2) to obtain:
Expressing the above expression using matrix notations gives (
30). Formula (
31) follows by inverting (
30) using the Sherman–Morrison formula. □
Under the dualistic structure induced by a
-logarithmic divergence, the primal and dual coordinate vector fields are no longer biorthogonal in the sense of (
7). Nevertheless, we have the following generalization. Again, we write
.
Corollary 2. The inner product of the coordinate vector fields and is given by a λ-deformed “biorthogonality” relation: Proof. Write
. Then:
Simplifying the expression using (
30) gives the result. For details, see ([
12], Proposition 8). □
Theorem 6. The Christoffel symbols of the primal connection ∇ are given by:where as in Proposition 4. Furthermore, let be the Christoffel symbol of the second kind, then:where δ is the Kronecker delta. Similarly, under the dual coordinate system u, the Christoffel symbol (of the second kind) of the dual connection is given by: Proof. This is a straightforward computation using (
3) and Proposition 4. The details, which are a minor modification of the proof of ([
12], Proposition 5), are omitted. □
Although the curvatures of ∇ and are nonzero, it can be shown that ∇ and are both projectively flat, i.e., each of them is projectively equivalent to a flat connection. Specifically, any ∇-geodesic (resp. -geodesic) is a time-reparameterized straight line under the x (resp. u) coordinate system.
Theorem 7. The sectional curvatures of ∇ and with respect to are both equal to λ.
Proof. See ([
12], Theorem 15). □
Using the dual projective flatness and Corollary 2, Reference ([
12], Theorem 16) showed that the
-logarithmic divergence satisfies a generalized Pythagorean theorem, which generalizes the property of Bregman divergence outlined in
Section 2.2.
Theorem 8 (Generalized Pythagorean theorem)
. Let . Then:if and only if the ∇-geodesic between Q and R and the -geodesic between Q and P meet -orthogonally at Q. To summarize, the dually flat geometry becomes a dually projectively flat geometry with constant sectional curvature , and the Hessian metric becomes a conformal Hessian metric. Nevertheless, the primal and dual geodesics are still straight lines (up to time reparametrizations), and the generalized Pythagorean theorem holds.
We say that the above
-deformation framework is “canonical” because the statistical manifold
, with a conformal Hessian metric
given by (
29) and a pair of dual projectively flat affine connections
given by (
32) and (
33), is the
only statistical structure with constant curvature ([
12], Theorem 15). Moreover, given such a statistical manifold, one can construct locally a
-logarithmic divergence, which induces the given geometry.
6. Summary and Conclusions
Our paper summarizes a canonical approach to deforming exponential and mixture families and the associated dually flat Hessian geometry. The
-exponential family we introduced has two parameterizations (
35) and (
37):
The two expressions reflect subtractive and divisive normalizations—a typical example of the former is the q-exponential family with associated Tsallis entropy, whereas an example of the latter is the -family and the associated Rényi entropy. These two versions of deformation to an exponential family are two faces of the same coin; furthermore, the -exponential family is also linked to the -mixture family, when , via a reparameterization of the random functions above.
The coincidence of these two parameterizations of the deformed family is associated with the
-duality, which is the main focus of our exposition. The
-duality is a “deformation” (see
Table 1) of the usual Legendre duality reviewed in
Section 3.1. In a nutshell, instead of convex functions, we worked with
-convex functions
f such that
is convex, for a fixed
. Furthermore, instead of the convex conjugate, we used the
-conjugate given by:
The expression of the
-duality:
turns out to be a re-write of the Legendre duality between
and
u:
and a re-write of the Legendre duality between
x and
:
Therefore,
-duality is in essence the Legendre duality with a
-dependent rescaling of the variables:
and:
The two pairs of convex functions
and
are linked with the pair of
-convex functions
via:
The
-duality leads to nontrivial mathematical questions, e.g., a differential calculus in the spirit of Rockafellar and analogous to functions of the Legendre type. Some of the derivations in the current paper were heuristic, and a complete and rigorous development is left for future research.
Coming back to the probability families, we first verified that the subtractive potential
is convex in
and the divisive potential
is
-convex in
. Subtractive normalization using
is associated with the regular Legendre duality, whereas divisive normalization using
is associated with the
-duality. This gives an interpretation of the distinctiveness of Rényi entropy (used in the latter) from Tsallis entropy (used in the former) based on their intimate connection to the
-duality (for
) or to the Legendre duality. As
is the parameter that controls the curvature in the Riemannian geometry of these probability families (see [
12]), our framework provides a simple parametric deformation from the dually flat geometry (of the exponential model) to the dually projectively flat geometry (of the
-exponential model). We expect that this framework will generate new insights in the applications of the
q-exponential family and related concepts in statistical physics and information science.