Geometry Induced by a Generalization of Rényi Divergence

De Souza, David C.; Vigelis, Rui F.; Cavalcante, Charles C.

doi:10.3390/e18110407

Open AccessArticle

Geometry Induced by a Generalization of Rényi Divergence

by

David C. De Souza

^1,†,

Rui F. Vigelis

^2,*,† and

Charles C. Cavalcante

^3,†

¹

Instituto Federal do Ceará, Campus Maracanaú, Fortaleza 61939-140, Brazil

²

Computer Engineering School, Campus Sobral, Federal University of Ceará, Sobral 62010-560, Brazil

³

Department of Teleinformatics Engineering, Federal University of Ceará, Fortaleza 60455-900, Brazil

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Entropy 2016, 18(11), 407; https://doi.org/10.3390/e18110407

Submission received: 6 September 2016 / Revised: 27 October 2016 / Accepted: 11 November 2016 / Published: 17 November 2016

(This article belongs to the Special Issue Differential Geometrical Theory of Statistics)

Download Versions Notes

Abstract

:

In this paper, we propose a generalization of Rényi divergence, and then we investigate its induced geometry. This generalization is given in terms of a φ-function, the same function that is used in the definition of non-parametric φ-families. The properties of φ-functions proved to be crucial in the generalization of Rényi divergence. Assuming appropriate conditions, we verify that the generalized Rényi divergence reduces, in a limiting case, to the φ-divergence. In generalized statistical manifold, the φ-divergence induces a pair of dual connections

D^{(- 1)}

and

D^{(1)}

. We show that the family of connections

D^{(α)}

induced by the generalization of Rényi divergence satisfies the relation

D^{(α)} = \frac{1 - α}{2} D^{(- 1)} + \frac{1 + α}{2} D^{(1)}

, with

α \in [- 1, 1]

.

Keywords:

Rényi divergence; φ-function; φ-divergence; φ-family; statistical manifold; information geometry

1. Introduction

Information geometry, the study of statistical models equipped with a differentiable structure, was pioneered by the work of Rao [1], and gained maturity with the work of Amari and many others [2,3,4]. It has been successfully applied in many different areas, such as statistical inference, machine learning, signal processing or optimization [4,5]. In appropriate statistical models, the differentiable structure is induced by a (statistical) divergence. The Kullback–Leibler divergence induces a Riemannian metric, called the Fisher–Rao metric, and a pair of dual connections, the exponential and mixture connections. A statistical model endowed with the Fisher–Rao metric is called a (classical) statistical manifold. Amari also considered a family of α-divergences that induce a family of α-connections.

Much research in recent years has focused on the geometry of non-standard statistical models [6,7,8]. These models are defined in terms of a deformed exponential (also called ϕ-exponential). In particular, κ-exponential models and q-exponential families are investigated in [9,10]. Non-parametric (or infinite-dimensional) φ-families were introduced by the authors in [11,12], which generalize exponential families in the non-parametric setting [13,14,15,16]. Based on the similarity between exponential and φ-families, we defined the so-called φ-divergence, with respect to which the Kullback–Leibler divergence is a particular case. Statistical models equipped with a geometric structure induced by φ-divergences, which are called generalized statistical manifolds, are investigated in [17,18]. With respect to these connections, parametric φ-families are dually flat.

The φ-divergence is intrinsically related to the

(ρ, τ)

-model of Zhang, which was proposed in [19,20], extended to the infinite-dimension setting in [21], and explained in more details in [22,23]. For instance, the metric induced by φ-divergence and the

(ρ, τ)

-generalization of the Fisher–Rao metric, for the choices

ρ = φ^{- 1}

and

f = ρ^{- 1}

, differ by a conformal factor.

Among many attempts to generalize Kullback–Leibler divergence, Rényi divergence [24] is one of the most successful, having found many applications [25]. In the present paper, we propose a generalization of Rényi divergence, which we use to define a family of α-connections. This generalization is based on an interpretation of Rényi divergence as a kind of normalizing function. To generalize Rényi divergence, we considered functions satisfying some suitable conditions. To a function for which these conditions hold, we give the name of φ-function. In a limiting case, the generalized Rényi divergence reduces to the φ-divergence. In [17,18], the φ-divergence gives rise to a pair of dual connections

D^{(- 1)}

and

D^{(1)}

. We show that the connection

D^{(α)}

induced by the generalization of Rényi divergence satisfies the convex combination

D^{(α)} = \frac{1 - α}{2} D^{(- 1)} + \frac{1 + α}{2} D^{(1)}

.

Eguchi in [26] investigated a geometry based on a normalizing function similar to the one used in the generalization of Rényi divergence. In [26], results were derived supposing that this normalizing function exists; conditions for its existence were not given. In the present paper, the existence of the normalizing function is ensured by conditions involved in the definition of φ-functions.

The rest of the paper is organized as follows. In Section 2, φ-functions are introduced and some properties are discussed. The Rényi divergence is generalized in Section 3. We investigate in Section 4 the geometry induced by the generalization of Rényi divergence. Section 4.2 provides evidence of the role of the generalized Rényi divergence in φ-families.

2. φ-Functions

Rényi divergence is defined in terms of the exponential function (to be more precise, the logarithm). A way of generalizing Rényi divergence is to replace the exponential function by another function, which satisfies some suitable conditions. To a function for which these conditions hold, we give the name φ-function. In this section, we define and investigate some properties of φ-functions.

Let

(T, Σ, μ)

be a measure space. Although we do not restrict our analysis to a particular measure space, the reader can think of T as the set of real numbers

R

, Σ as the Borel σ-algebra on

R

, and μ as the Lebesgue measure. We can also consider T to be a discrete set, a case in which μ is the counting measure.

We say that

φ : R \to (0, \infty)

is a φ-function if the following conditions are satisfied:

(a1): $φ (\cdot)$ is convex;
(a2): ${lim}_{u \to - \infty} φ (u) = 0$ and ${lim}_{u \to \infty} φ (u) = \infty$ ;
(a3): there exists a measurable function $u_{0} : T \to (0, \infty)$ such that

$\int_{T} φ (c + λ u_{0}) d μ < \infty, for all λ > 0,$

(1)

for each measurable function $c : T \to R$ satisfying $\int_{T} φ (c) d μ = 1$ .

Thanks to condition (a3), we can generalize Rényi divergence using φ-functions. These conditions appeared first at [12] where the authors constructed non-parametric φ-families of probability distributions. We remark that if T is finite, condition (a3) is always satisfied.

Examples of functions

φ : R \to (0, \infty)

satisfying (a1)–(a3) abound. An example of great relevance is the exponential function

φ (u) = exp (u)

, which satisfies conditions (a1)–(a3) with

u_{0} = 1_{T}

. Another example of φ-function is the Kaniadakis’ κ-exponential [12,27,28].

Example 1.

The Kaniadakis’ κ-exponential

{exp}_{κ} : R \to (0, \infty)

for

κ \in [- 1, 1]

is defined as

{exp}_{κ} (u) = \{\begin{matrix} {(κ u + \sqrt{1 + κ^{2} u^{2}})}^{1 / κ}, & if κ \neq 0, \\ exp (u), & if κ = 0, \end{matrix}

whose inverse is the so called the Kaniadakis’ κ-logarithm

{log}_{k} : (0, \infty) \to R

, which is given by

{log}_{κ} (u) = \{\begin{matrix} \frac{u^{κ} - u^{- κ}}{2 κ}, & if κ \neq 0, \\ ln (u), & if κ = 0 . \end{matrix}

It is clear that

{exp}_{κ} (\cdot)

satisfies (a1) and (a2). Let

u_{0} : T \to (0, \infty)

be any measurable function for which

\int_{T} {exp}_{κ} (u_{0}) d μ < \infty

. We will show that

u_{0}

satisfies expression (1). For any

u \in R

and

α \geq 1

, we can write

\begin{matrix} {exp}_{κ} (α u) & = α^{1 / | κ |} (| κ | u + \sqrt{1 / α^{2} + {| κ |}^{2} u^{2}})^{1 / | κ |} \\ \leq α^{1 / | κ |} (| κ | u + \sqrt{1 + {| κ |}^{2} u^{2}})^{1 / | κ |} \\ = α^{1 / | κ |} {exp}_{κ} (u), \end{matrix}

where we used that

{exp}_{κ} (\cdot) = {exp}_{- κ} (\cdot)

. Then, we conclude that

\int_{T} {exp}_{κ} (α u_{0}) d μ < \infty

for all

α \geq 0

. Fix any measurable function

c : T \to R

such that

\int_{T} φ (c) d μ = 1

. For each

λ > 0

, we have

\begin{matrix} \int_{T} {exp}_{κ} (c + λ u_{0}) d μ & \leq \frac{1}{2} \int_{T} {exp}_{κ} (2 c) d μ + \frac{1}{2} \int_{T} {exp}_{κ} (2 λ u_{0}) d μ \\ \leq 2^{1 / | κ | - 1} \int_{T} {exp}_{κ} (c) d μ + 2^{1 / | κ | - 1} \int_{T} {exp}_{κ} (λ u_{0}) d μ \\ < \infty, \end{matrix}

which shows that

{exp}_{κ} (\cdot)

satisfies (a3). Therefore, the Kaniadakis’ κ-exponential

{exp}_{κ} (\cdot)

is an example of φ-function.

The restriction that

\int_{T} φ (c) d μ = 1

can be weakened, as asserted in the next result.

Lemma 1.

Let

\tilde{c} : T \to R

be any measurable function such that

\int_{T} φ (\tilde{c}) d μ < \infty

. Then,

\int_{T} φ (\tilde{c} + λ u_{0}) d μ < \infty

for all

λ > 0

.

Proof.

Notice that if

\int_{T} φ (\tilde{c}) d μ \geq 1

, then

\int_{T} φ (\tilde{c} - α u_{0}) d μ = 1

for some

α > 1

. From the definition of

u_{0}

, it follows that

\int_{T} φ (\tilde{c} + λ u_{0}) d μ = \int_{T} φ (c + (α + λ) u_{0}) d μ < \infty

, where

c = \tilde{c} - α u_{0}

. Now assume that

\int_{T} φ (\tilde{c}) d μ < 1

. Consider any measurable set

A \subseteq T

with measure

0 < μ (A) < μ (T)

. Let

u : T \to [0, \infty)

be a measurable function supported on A satisfying

φ (\tilde{c} + u) 1_{A} = [φ (\tilde{c}) + α] 1_{A}

, where

α = (1 - \int_{T} φ (\tilde{c}) d μ) / μ (A)

. Defining

c = (\tilde{c} + u) 1_{A} + \tilde{c} 1_{T \ A}

, we see that

\int_{T} φ (c) d μ = 1

. By the definition of

u_{0}

, we can write

\int_{T} φ (\tilde{c} + λ u_{0}) d μ \leq \int_{T} φ (c + λ u_{0}) d μ < \infty, for any λ > 0,

which is the desired result. ☐

As a consequence of Lemma 1, condition (a3) can be replaced by the following one:

(a3’) There exists a measurable function

u_{0} : T \to (0, \infty)

such that

\int_{T} φ (c + λ u_{0}) d μ < \infty, for all λ > 0,

(2)

for each measurable function

c : T \to R

for which

\int_{T} φ (c) d μ < \infty

.

Without the equivalence between conditions (a3) and (a3’), we could not generalize Rényi divergence in the manner we propose. In fact, φ-functions could be defined directly in terms of (a3’), without mentioning (a3). We chose to begin with (a3) because this condition appeared initially in [12].

Not all functions

φ : R \to (0, \infty)

, for which conditions (a1) and (a2) hold, satisfy condition (a3). Such a function is given below.

Example 2.

Assume that the underlying measure μ is σ-finite and non-atomic. This is the case of the Lebesgue measure. Let us consider the function

φ (u) = \{\begin{matrix} e^{{(u + 1)}^{2} / 2}, & u \geq 0, \\ e^{(u + 1 / 2)}, & u \leq 0, \end{matrix}

(3)

which clearly is convex, and satisfies the limits

{lim}_{u \to - \infty} φ (u) = 0

and

{lim}_{u \to \infty} φ (u) = \infty

. Given any measurable function

u_{0} : T \to (0, \infty)

, we will find a measurable function

c : T \to R

with

\int_{T} φ (c) d μ < \infty

, for which expression (2) is not satisfied.

For each

m \geq 1

, we define

v_{m} (t) : = (m \frac{log (2)}{u_{0} (t)} - \frac{u_{0} (t)}{2} - 1) 1_{E_{m}} (t),

where

E_{m} = {t \in T : m \frac{log (2)}{u_{0} (t)} - \frac{u_{0} (t)}{2} - 1 > 0}

. Because

v_{m} ↑ \infty

, we can find a sub-sequence

{v_{m_{n}}}

such that

\int_{E_{m_{n}}} e^{{(v_{m_{n}} + u_{0} + 1)}^{2} / 2} d μ \geq 2^{n} .

According to (Lemma 8.3 in [29]) , there exists a sub-sequence

w_{k} = v_{m_{n_{k}}}

and pairwise disjoint sets

A_{k} \subseteq E_{m_{n_{k}}}

for which

\int_{A_{k}} e^{{(w_{k} + u_{0} + 1)}^{2} / 2} d μ = 1 .

Let us define

c = \bar{c} 1_{T \ A} + \sum_{k = 1}^{\infty} w_{k} 1_{A_{k}}

, where

A = ⋃_{k = 1}^{\infty} A_{k}

and

\bar{c}

is any measurable function such that

φ (\bar{c} (t)) > 0

for

t \in T \ A

and

\int_{T \ A} φ (\bar{c}) d μ < \infty

. Observing that

e^{{(w_{k} (t) + u_{0} (t) + 1)}^{2} / 2} = 2^{m_{n_{k}}} e^{{(w_{k} (t) + 1)}^{2} / 2}, for t \in A_{k},

we get

\int_{A_{k}} e^{{(w_{k} + 1)}^{2} / 2} d μ = \frac{1}{2^{m_{n_{k}}}}, for every m \geq 1 .

Then, we can write

\begin{matrix} \int_{T} φ (c) d μ & = \int_{T \ A} φ (\bar{c}) d μ + \sum_{k = 1}^{\infty} \int_{A_{k}} e^{{(w_{k} + 1)}^{2} / 2} d μ \\ = \int_{T \ A} φ (\bar{c}) d μ + \sum_{k = 1}^{\infty} \frac{1}{2^{m_{n_{k}}}} < \infty . \end{matrix}

On the other hand,

\begin{matrix} \int_{T} φ (c + u_{0}) d μ & = \int_{T \ A} φ (\bar{c}) d μ + \sum_{k = 1}^{\infty} \int_{A_{k}} e^{{(u_{0} + w_{k} + 1)}^{2} / 2} d μ \\ = \int_{T \ A} φ (\bar{c}) d μ + \sum_{k = 1}^{\infty} 1 = \infty, \end{matrix}

which shows that (2) is not satisfied.

3. Generalization of Rényi Divergence

In this section, we provide a generalization of Rényi divergence, which is given in terms of a φ-function. This generalization also depends on a parameter

α \in [- 1, 1]

; for

α = \pm 1

, it is defined as a limit. Supposing that the underlying φ-function is continuously differentiable, we show that this limit exists and results in the φ-divergence [12]. In what follows, all probability distributions are assumed to have positive density. In other words, they belong to the collection

P_{μ} = \{p \in L^{0} : \int_{T} p d μ = 1 and p > 0\},

where

L^{0}

is the space of all real-valued, measurable functions on T, with equality μ-a.e. (μ-almost everywhere).

The Rényi divergence of order

α \in (- 1, 1)

between two probability distributions p and q in

P_{μ}

is defined as

D_{R}^{(α)} (p ∥ q) = \frac{4}{α^{2} - 1} log (\int_{T} p^{\frac{1 - α}{2}} q^{\frac{1 + α}{2}} d μ) .

(4)

For

α = \pm 1

, the Rényi divergence is defined by taking a limit:

\begin{matrix} D_{R}^{(- 1)} (p ∥ q) & = lim_{α ↓ - 1} D_{R}^{(α)} (p ∥ q), \end{matrix}

(5)

\begin{matrix} D_{R}^{(1)} (p ∥ q) & = lim_{α ↑ 1} D_{R}^{(α)} (p ∥ q) . \end{matrix}

(6)

Under some conditions, the limits in (5) and (6) are finite-valued, and converge to the Kullback–Leibler divergence. In other words,

D_{R}^{(- 1)} (p ∥ q) = D_{R}^{(1)} (q ∥ p) = D_{KL} (p ∥ q) < \infty,

where

D_{KL} (p ∥ q)

denotes the Kullback–Leibler divergence between p and q, which is given by

D_{KL} (p ∥ q) = \int_{T} p log (\frac{p}{q}) d μ .

These conditions are stated in Proposition 1, given in the end of this section, for the case involving the generalized Rényi divergence.

The Rényi divergence in its standard form is given by

D^{(α)} (p ∥ q) = \frac{1}{1 - α} log (\int_{T} p^{α} q^{1 - α} d μ), for α \in (0, 1) .

(7)

Expression (4) is related to this form by

D_{R}^{(α)} (p ∥ q) = \frac{2}{1 - α} D^{((1 - α) / 2)} (p ∥ q) .

Beyond the change of variables, which results in α ranging in

[- 1, 1]

, expressions (4) and (7) differ by the factor

2 / (1 - α)

. We opted to insert the term

2 / (1 - α)

so that some kind of symmetry could be maintained when the limits

α ↓ - 1

and

α ↑ 1

are considered. In addition, the geometry induced by the version (4) conforms with Amari’s notation [5].

The Rényi divergence

D_{R}^{(α)} (\cdot ∥ \cdot)

can be defined for every

α \in R

. However, for

α \notin (- 1, 1)

, the expression (4) may not be finite-valued for every p and q in

P_{μ}

. To avoid some technicalities, we just consider

α \in [- 1, 1]

.

Given p and q in

P_{μ}

, let us define

κ (α) = - log (\int_{T} p^{\frac{1 - α}{2}} q^{\frac{1 + α}{2}} d μ), for α \in [- 1, 1],

which can be used to express the Rényi divergence as

D_{R}^{(α)} (p ∥ q) = \frac{4}{1 - α^{2}} κ (α), for α \in (- 1, 1) .

The function

κ (α)

, which depends on p and q, can be defined as the unique non-negative real number for which

\int_{T} exp (\frac{1 - α}{2} ln (p) + \frac{1 + α}{2} ln (q) + κ (α)) d μ = 1 .

(8)

The function

κ (α)

makes the role of a normalizing term. The generalization of Rényi divergence, which we propose, is based on the interpretation of

κ (α)

given in (8). Instead of the exponential function, we consider a φ-function in (8).

Fix any φ-function

φ : R \to (0, \infty)

. Given any p and q in

P_{μ}

, we take

κ (α) = κ (α; p, q) \geq 0

so that

\int_{T} φ (\frac{1 - α}{2} φ^{- 1} (p) + \frac{1 + α}{2} φ^{- 1} (q) + κ (α) u_{0}) d μ = 1,

(9)

or, in other words, the term inside the integral is a probability distribution in

P_{μ}

. The existence and uniqueness of

κ (α)

as defined in (9) is guaranteed by condition (a3’).

We define a generalization of the Rényi divergence of order

α \in (- 1, 1)

as

D_{φ}^{(α)} (p ∥ q) = \frac{4}{1 - α^{2}} κ (α) .

(10)

For

α = \pm 1

, this generalization is defined as a limit:

\begin{matrix} D_{φ}^{(- 1)} (p ∥ q) & = lim_{α ↓ - 1} D_{φ}^{(α)} (p ∥ q), \end{matrix}

(11)

\begin{matrix} D_{φ}^{(1)} (p ∥ q) & = lim_{α ↑ 1} D_{φ}^{(α)} (p ∥ q) . \end{matrix}

(12)

The cases

α = \pm 1

are related to a generalization of the Kullback–Leibler divergence, the so-called φ-divergence, which was introduced by the authors in [12]. The φ-divergence is given by (It was pointed out to us by an anonymous referee that this form of divergence is a special case of the

(ρ, τ)

-divergence for

ρ = φ^{- 1}

and

f = ρ^{- 1}

(see Section 3.5 in [19]) apart from a conformal factor, which is the denominator of (13)):

D_{φ} (p ∥ q) = \frac{\int_{T} \frac{φ^{- 1} (p) - φ^{- 1} (q)}{{(φ^{- 1})}^{'} (p)} d μ}{\int_{T} \frac{u_{0}}{{(φ^{- 1})}^{'} (p)} d μ} .

(13)

Under some conditions, the limit in (11) or (12) is finite-valued and converges to the φ-divergence:

D_{φ}^{(- 1)} (p ∥ q) = D_{φ}^{(1)} (q ∥ p) = D_{φ} (p ∥ q) < \infty .

(14)

To show (14), we make use of the following result.

Lemma 2.

Assume that

φ (\cdot)

is continuously differentiable. If for

α_{0}, α_{1} \in R

, the expression

\int_{T} φ (\frac{1 - α}{2} φ^{- 1} (p) + \frac{1 + α}{2} φ^{- 1} (q)) d μ < \infty

(15)

is satisfied for all

α \in [α_{0}, α_{1}]

, then the derivative of

κ (α)

exists at any

α \in (α_{0}, α_{1})

, and is given by

\frac{\partial κ}{\partial α} (α) = - \frac{1}{2} \frac{\int_{T} [φ^{- 1} (q) - φ^{- 1} (p)] φ^{'} (c_{α}) d μ}{\int_{T} φ^{'} (c_{α}) u_{0} d μ},

(16)

where

c_{α} = \frac{1 - α}{2} φ^{- 1} (p) + \frac{1 + α}{2} φ^{- 1} (q) + κ (α) u_{0}

.

Proof.

For

α \in (α_{0}, α_{1})

and

κ > 0

, define

g (α, κ) = \int_{T} φ (\frac{1 - α}{2} φ^{- 1} (p) + \frac{1 + α}{2} φ^{- 1} (q) + κ u_{0}) d μ .

The function

κ (α)

is defined implicitly by

g (α, κ (α)) = 1

. If we show that

(i): the function $g (α, κ)$ is continuous in a neighborhood of $(α, κ (α))$ ,
(ii): the partial derivatives $\frac{\partial g}{\partial α}$ and $\frac{\partial g}{\partial κ}$ exist and are continuous at $(α, κ (α))$ ,
(iii): and $\frac{\partial g}{\partial κ} (α, κ (α)) > 0$ ,

then by the Implicit Function Theorem

κ (α)

is differentiable at

α \in (α_{0}, α_{1})

, and

\frac{\partial κ}{\partial α} (α) = - \frac{(\partial g / \partial α) (α, κ (α))}{(\partial g / \partial κ) (α, κ (α))} .

(17)

We begin by verifying that

g (α, κ)

is continuous. For fixed

α \in (α_{0}, α_{1})

and

κ > 0

, set

κ_{0} = 2 κ

. Denoting

A = {t \in T : φ^{- 1} (q (t)) > φ^{- 1} (p (t))}

, we can write

\begin{matrix} φ (\frac{1 - β}{2} φ^{- 1} (p) + \frac{1 + β}{2} φ^{- 1} (q) + λ u_{0}) & \leq φ (φ^{- 1} (p) + \frac{1 + β}{2} [φ^{- 1} (q) - φ^{- 1} (p)] + κ_{0} u_{0}) \\ \leq φ (φ^{- 1} (p) + \frac{1 + α_{1}}{2} [φ^{- 1} (q) - φ^{- 1} (p)] + κ_{0} u_{0}) 1_{A} \\ + φ (φ^{- 1} (p) + \frac{1 + α_{0}}{2} [φ^{- 1} (q) - φ^{- 1} (p)] + κ_{0} u_{0}) 1_{T \ A}, \end{matrix}

(18)

for every

β \in (α_{0}, α_{1})

and

λ \in (0, κ_{0})

. Because the function on the right-hand side of (18) is integrable, we can apply the Dominated Convergence Theorem to conclude that

lim_{(β, λ) \to (α, κ)} g (β, λ) = g (α, κ) .

Now, we will show that the derivative of

g (α, κ)

with respect to α exists and is continuous. Consider the difference

\frac{g (γ, λ) - g (β, λ)}{γ - β} = \int_{T} \frac{1}{γ - β} [φ (c_{β} + \frac{γ - β}{2} [φ^{- 1} (q) - φ^{- 1} (p)] + λ u_{0}) - φ (c_{β} + λ u_{0})] d μ,

(19)

where

c_{β} = \frac{1 - β}{2} φ^{- 1} (p) + \frac{1 + β}{2} φ^{- 1} (q)

. Represent by

f_{β, γ, λ}

the function inside the integral sign in (19). For fixed

α \in (α_{0}, α_{1})

and

κ > 0

, denote

{\bar{α}}_{0} = (α_{0} + α) / 2

,

{\bar{α}}_{1} = (α + α_{1}) / 2

, and

κ_{0} = 2 κ

. Because

φ (\cdot)

is convex and increasing, it follows that

| f_{β, γ, λ} | \leq f_{{\bar{α}}_{1}, α_{1}, κ_{0}} 1_{A} - f_{{\bar{α}}_{0}, α_{0}, κ_{0}} 1_{T \ A} = : f, for all β, γ \in ({\bar{α}}_{0}, {\bar{α}}_{1}) and λ \in (0, κ_{0}),

where

A = {t \in T : φ^{- 1} (q (t)) > φ^{- 1} (p (t))}

. Observing that f is integrable, we can use the Dominated Convergence Theorem to get

lim_{γ \to β} \int_{T} f_{β, γ, λ} d μ = \int_{T} (lim_{γ \to β} f_{β, γ, λ}) d μ,

and then

\frac{\partial g}{\partial α} (β, λ) = \frac{1}{2} \int_{T} [φ^{- 1} (q) - φ^{- 1} (p)] φ^{'} (c_{β} + λ u_{0}) d μ .

(20)

For

β \in ({\bar{α}}_{0}, {\bar{α}}_{1})

and

λ \in (0, κ_{0})

, the function inside the integral sign in (20) is dominated by f. As a result, a second use of the Dominated Convergence Theorem shows that

\frac{\partial g}{\partial α}

is continuous at

(α, κ)

:

lim_{(β, λ) \to (α, κ)} \frac{\partial g}{\partial α} (β, λ) = \frac{\partial g}{\partial α} (α, κ) .

Using similar arguments, one can show that

\frac{\partial g}{\partial κ} (α, κ)

exists and is continuous at any

α \in (α_{0}, α_{1})

and

κ > 0

, and is given by

\frac{\partial g}{\partial κ} (α, κ) = \int_{T} u_{0} φ^{'} (c_{α} + κ u_{0}) d μ .

(21)

Clearly, expression (21) implies that

\frac{\partial g}{\partial κ} (α, κ) > 0

for all

α \in (0, α_{0})

and

κ > 0

.

We proved that items (i)–(iii) are satisfied. As consequence, the derivative of

κ (α)

exists at any

α \in (α_{0}, α_{1})

. Expression (16) for the derivative of

κ (α)

follows from (17), (20) and (21). ☐

As an immediate consequence of Lemma 2, we get the proposition below.

Proposition 1.

Assume that

φ (\cdot)

is continuously differentiable.

(a): If, for some $α_{0} < - 1$ , expression (15) is satisfied for all $α \in [α_{0}, - 1)$ , then

$D_{φ}^{(- 1)} (p ∥ q) = lim_{α ↓ - 1} D_{φ}^{(α)} (p ∥ q) = 2 \frac{\partial κ}{\partial α} (- 1) = D_{φ} (p ∥ q) < \infty .$
(b): If, for some $α_{1} > 1$ , expression (15) is satisfied for all $α \in (1, α_{1}]$ , then

$D_{φ}^{(1)} (p ∥ q) = lim_{α ↑ 1} D_{φ}^{(α)} (p ∥ q) = - 2 \frac{\partial κ}{\partial α} (1) = D_{φ} (q ∥ p) < \infty .$

4. Generalized Statistical Manifolds

Statistical manifolds consist of a collection of probability distributions endowed with a metric and α-connections, which are defined in terms of the derivative of

l (t; θ) = log p (t; θ)

. In a generalized statistical manifold, the metric and connection are defined in terms of

f (t; θ) = φ^{- 1} (p (t; θ))

. Instead of the logarithm, we consider the inverse

φ^{- 1} (\cdot)

of a φ-function. Generalized statistical manifolds were introduced by the authors in [17,18]. Among examples of the generalized statistical manifold, (parametric) φ-families of probability distributions are of greatest importance. The non-parametric counterpart was investigated in [11,12]. The metric in φ-families can be defined as the Hessian of a function; i.e., φ-families are Hessian manifolds [30]. In [17,18], the φ-divergence gives rise to a pair of dual connections

D^{(- 1)}

and

D^{(1)}

; and then for

α \in (- 1, 1)

the α-connection

D^{(α)}

is defined as the convex combination

D^{(α)} = \frac{1 - α}{2} D^{(- 1)} + \frac{1 + α}{2} D^{(1)}

. In the present paper, we show that the connection induced by

D_{φ}^{(α)} (\cdot ∥ \cdot)

, the generalization of Rényi divergence, corresponds to

D^{(α)}

.

4.1. Definitions

Let

φ : R \to (0, \infty)

be a φ-function. A generalized statistical manifold

P = {p (t; θ) : θ \in Θ}

is a collection of probability distributions

p_{θ} (t) : = p (t; θ)

, indexed by parameters

θ = (θ^{1}, \dots, θ^{n}) \in Θ

in a one-to-one relation, such that

(m1): Θ is a domain (open and connected set) in $R^{n}$ ;
(m2): $p (t; θ)$ is differentiable with respect to θ;
(m3): the matrix $g = (g_{i j})$ defined by

$g_{i j} = - E_{θ}^{'} [\frac{\partial^{2} φ^{- 1} (p_{θ})}{\partial θ^{i} \partial θ^{j}}],$

(22)

is positive definite at each $θ \in Θ$ , where

$E_{θ}^{'} [\cdot] = \frac{\int_{T} (\cdot) φ^{'} (φ^{- 1} (p_{θ})) d μ}{\int_{T} u_{0} φ^{'} (φ^{- 1} (p_{θ})) d μ};$

(23)
(m4): the operations of integration with respect to μ and differentiation with respect to $θ^{i}$ commute in all calculations found below, which are related to the metric and connections.

The matrix

g = (g_{i j})

equips

P

with a metric. By the chain rule, the tensor related to

g = (g_{i j})

is invariant under change of coordinates. The (classical) statistical manifold is a particular case in which

φ (u) = exp (u)

and

u_{0} = 1_{T}

.

We introduce a notation similar to Equation (23) that involves higher order derivatives of

φ (\cdot)

. For each

n \geq 1

, we define

E_{θ}^{(n)} [\cdot] = \frac{\int_{T} (\cdot) φ^{(n)} (φ^{- 1} (p_{θ})) d μ}{\int_{T} u_{0} φ^{'} (φ^{- 1} (p_{θ})) d μ} .

(24)

We also use

E_{θ}^{'} [\cdot]

,

E_{θ}^{″} [\cdot]

and

E_{θ}^{‴} [\cdot]

to denote

E_{θ}^{(n)} [\cdot]

for

n = 1, 2, 3

, respectively. The notation (24) appears in expressions related to the metric and connections.

Using property (m4), we can find an alternate expression for

g_{i j}

as well as an identification involving tangent spaces. The matrix

g = (g_{i j})

can be equivalently defined by

g_{i j} = E_{θ}^{″} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}}] .

(25)

As a consequence of this equivalence, the tangent space

T_{p_{θ}} P

can be identified with

{\tilde{T}}_{p_{θ}} P

, the vector space spanned by

\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}}

, and endowed with the inner product

{〈 \tilde{X}, \tilde{Y} 〉}_{θ} : = E_{θ}^{″} [\tilde{X} \tilde{Y}]

. The mapping

\sum_{i} a_{i} \frac{\partial}{\partial θ^{i}} \mapsto \sum_{i} a_{i} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}}

defines an isometry between

T_{p_{θ}} P

and

{\tilde{T}}_{p_{θ}} P

.

To verify (25), we differentiate

\int_{T} p_{θ} d μ = 1

, with respect to

θ^{i}

, to get

0 = \frac{\partial}{\partial θ^{i}} \int_{T} p_{θ} d μ = \int_{T} \frac{\partial}{\partial θ^{i}} φ (φ^{- 1} (p_{θ})) d μ = \int_{T} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} φ^{'} (φ^{- 1} (p_{θ})) d μ .

(26)

Now, differentiating with respect to

θ^{j}

, we obtain

0 = \int_{T} \frac{\partial^{2} φ^{- 1} (p_{θ})}{\partial θ^{i} \partial θ^{j}} φ^{'} (φ^{- 1} (p_{θ})) d μ + \int_{T} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}} φ^{″} (φ^{- 1} (p_{θ})) d μ,

and then (25) follows. In view of (26), we notice that every vector

\tilde{X}

belonging to

{\tilde{T}}_{p_{θ}} P

satisfies

E_{θ}^{'} [\tilde{X}] = 0

.

The metric

g = (g_{i j})

gives rise to a Levi–Civita connection ∇ (i.e., a torsion-free, metric connection), whose corresponding Christoffel symbols

Γ_{i j k}

are given by

Γ_{i j k} : = \frac{1}{2} (\frac{\partial g_{k i}}{\partial θ^{j}} + \frac{\partial g_{k j}}{\partial θ^{i}} - \frac{\partial g_{i j}}{\partial θ^{k}}) .

(27)

Using expression (25) to calculate the derivatives in (27), we can express

\begin{matrix} Γ_{i j k} & = E_{θ}^{″} [\frac{\partial^{2} φ^{- 1} (p_{θ})}{\partial θ^{i} \partial θ^{j}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] + \frac{1}{2} E_{θ}^{‴} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] \\ - \frac{1}{2} E_{θ}^{″} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] E_{θ}^{″} [u_{0} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}}] - \frac{1}{2} E_{θ}^{″} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] E_{θ}^{″} [u_{0} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}}] \\ + \frac{1}{2} E_{θ}^{″} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}}] E_{θ}^{″} [u_{0} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] . \end{matrix}

As we will show later, the Levi–Civita connection ∇ corresponds to the connection derived from the divergence

D_{φ}^{(α)} (\cdot ∥ \cdot)

with

α = 0

.

4.2. φ-Families

Let

c : T \to R

be a measurable function for which

p = φ (c)

is a probability density in

P_{μ}

. Fix measurable functions

u_{1}, \dots, u_{n} : T \to R

. A (parametric) φ-family

F_{p} = {p_{θ} : θ \in Θ}

, centered at

p = φ (c)

, is a set of probability distributions in

P_{μ}

, whose members can be written in the form

p_{θ} : = φ (c + \sum_{i = 1}^{n} θ^{i} u_{i} - ψ (θ) u_{0}), for each θ = (θ^{i}) \in Θ,

(28)

where

ψ : Θ \to [0, \infty)

is a normalizing function, which is introduced so that expression (28) defines a probability distribution belonging to

P_{μ}

.

The functions

u_{1}, \dots, u_{n}

are not arbitrary. They are chosen to satisfy the following assumptions:

(i): $u_{0}, u_{1}, \dots, u_{n}$ are linearly independent,
(ii): $\int_{T} u_{i} φ^{'} (c) d μ = 0$ , and
(iii): there exists $ε > 0$ such that $\int_{T} φ (c + λ u_{i}) d μ < \infty$ , for all $λ \in (- ε, ε)$ .

Moreover, the domain

Θ \subseteq R^{n}

is defined as the set of all vectors

θ = (θ^{i})

for which

\int_{T} φ (c + λ \sum_{i = 1}^{n} θ^{i} u_{i}) d μ < \infty, for some λ > 1 .

Condition (i) implies that the mapping defined by (28) is one-to-one. Assumption (ii) makes of ψ a non-negative function. Indeed, by the convexity of

φ (\cdot)

, along with (ii), we can write

\int_{T} φ (c) d μ = \int_{T} [φ (c) + (\sum_{i = 1}^{n} θ^{i} u_{i}) φ^{'} (c)] d μ \leq \int_{T} φ (c + \sum_{i = 1}^{n} θ^{i} u_{i}) d μ,

which implies

ψ (θ) \geq 0

. By condition (iii), the domain Θ is an open neighborhood of the origin. If the set T is finite, condition (iii) is always satisfied. One can show that the domain Θ is open and convex. Moreover, the normalizing function ψ is also convex (or strictly convex if

φ (\cdot)

is strictly convex). Conditions (ii) and (iii) also appears in the definition of non-parametric φ-families. For further details, we refer to [11,12].

In a φ-family

F_{p}

, the matrix

(g_{i j})

given by (22) or (25) can be expressed as the Hessian of ψ. If

φ (\cdot)

is strictly convex, then

(g_{i j})

is positive definite. From

\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} = u_{i} - \frac{\partial ψ}{\partial θ^{i}}, - \frac{\partial^{2} φ^{- 1} (p_{θ})}{\partial θ^{i} \partial θ^{j}} = - \frac{\partial^{2} ψ}{\partial θ^{i} \partial θ^{j}},

it follows that

g_{i j} = \frac{\partial^{2} ψ}{\partial θ^{i} \partial θ^{j}}

.

The next two results show how the generalization of Rényi divergence and the φ-divergence are related to the normalizing function in φ-families.

Proposition 2.

In a φ-family

F_{p}

, the generalization of Rényi divergence for

α \in (- 1, 1)

can be expressed in terms of the normalizing function ψ as follows:

D_{φ}^{(α)} (p_{θ} ∥ p_{ϑ}) = \frac{2}{1 + α} ψ (θ) + \frac{2}{1 - α} ψ (ϑ) - \frac{4}{1 - α^{2}} ψ (\frac{1 - α}{2} θ + \frac{1 + α}{2} ϑ),

(29)

for all

θ, ϑ \in Θ

.

Proof.

Recall the definition of

κ (α)

as the real number for which

\int_{T} φ (\frac{1 - α}{2} φ^{- 1} (p_{θ}) + \frac{1 + α}{2} φ^{- 1} (p_{ϑ}) + κ (α) u_{0}) d μ = 1 .

Using expression (28) for probability distributions in

F_{p}

, we can write

\begin{array}{l} \frac{1 - α}{2} φ^{- 1} (p_{θ}) + \frac{1 + α}{2} φ^{- 1} & (p_{ϑ}) + κ (α) u_{0} \\ = c + \sum_{i = 1}^{n} (\frac{1 - α}{2} θ^{i} + \frac{1 + α}{2} ϑ^{i}) u_{i} - (\frac{1 - α}{2} ψ (θ) + \frac{1 + α}{2} ψ (ϑ) - κ (α)) u_{0} \\ = c + \sum_{i = 1}^{n} (\frac{1 - α}{2} θ^{i} + \frac{1 + α}{2} ϑ^{i}) u_{i} - ψ (\frac{1 - α}{2} θ + \frac{1 + α}{2} ϑ) u_{0} . \end{array}

The last equality is a consequence of the domain Θ being convex. Thus, it follows that

κ (α) = \frac{1 - α}{2} ψ (θ) + \frac{1 + α}{2} ψ (ϑ) - ψ (\frac{1 - α}{2} θ + \frac{1 + α}{2} ϑ) .

By the definition of

D_{φ}^{(α)} (\cdot ∥ \cdot)

, we get (29). ☐

Proposition 3.

In a φ-family

F_{p}

, the φ-divergence is related to the normalizing function ψ by the equality

D_{φ} (p_{θ} ∥ p_{ϑ}) = ψ (ϑ) - ψ (θ) - \nabla ψ (θ) \cdot (ϑ - θ),

(30)

for all

θ, ϑ \in Θ

.

Proof.

To show (30), we use

\frac{\partial ψ}{\partial θ^{i}} (θ) = \frac{\int_{T} u_{i} φ^{'} (φ^{- 1} (p_{θ})) d μ}{\int_{T} u_{0} φ^{'} (φ^{- 1} (p_{θ})) d μ},

which is a consequence of (Lemma 10 in [12]). In view of

{(φ^{- 1})}^{'} (u) = 1 / φ^{'} (φ^{- 1} (u))

, expression (13) with

p = p_{θ}

and

q = p_{ϑ}

results in

D_{φ} (p_{θ} ∥ p_{ϑ}) = \frac{\int_{T} [φ^{- 1} (p_{θ}) - φ^{- 1} (p_{ϑ})] φ^{'} (φ^{- 1} (p_{θ})) d μ}{\int_{T} u_{0} φ^{'} (φ^{- 1} (p_{θ})) d μ} .

(31)

Inserting into (31) the difference

\begin{matrix} φ^{- 1} (p_{θ}) - φ^{- 1} (p_{ϑ}) & = (c + \sum_{i = 1}^{n} θ^{i} u_{i} - ψ (θ) u_{0}) - (c + \sum_{i = 1}^{n} ϑ^{i} u_{i} - ψ (ϑ) u_{0}) \\ = ψ (ϑ) u_{0} - ψ (θ) u_{0} - \sum_{i = 1}^{n} (ϑ^{i} - θ^{i}) u_{i}, \end{matrix}

we get expression (30). ☐

In Proposition 2, the expression on the right-hand side of Equation (29) defines a divergence on its own, which was investigated by Jun Zhang in [19]. Proposition 3 asserts that the φ-divergence

D_{φ} (p_{θ} ∥ p_{ϑ})

coincides with the Bregman divergence [31,32] associated with the normalizing function ψ for points ϑ and θ in Θ. Because ψ is convex and attains a minimum at

θ = 0

, it follows that

\frac{\partial ψ}{\partial θ^{i}} (θ) = 0

at

θ = 0

. As a result, equality (30) reduces to

D_{φ} (p ∥ p_{θ}) = ψ (θ)

.

4.3. Geometry Induced by $D_{φ}^{(α)} (\cdot ∥ \cdot)$

In this section, we assume that

φ (\cdot)

is continuously differentiable and strictly convex. The latter assumption guarantees that

D_{φ}^{(α)} (p ∥ q) = 0 if and only if p = q .

(32)

The generalized Rényi divergence induces a metric

g = (g_{i j})

in generalized statistical manifolds

P

. This metric is given by

g_{i j} = - {[{(\frac{\partial}{\partial θ^{i}})}_{p} {(\frac{\partial}{\partial θ^{j}})}_{q} D_{φ}^{α} (p ∥ q)]}_{q = p} .

(33)

To show that this expression defines a metric, we have to verify that

g_{i j}

is invariant under change of coordinates, and

(g_{i j})

is positive definite. The first claim follows from the chain rule. The positive definiteness of

(g_{i j})

is a consequence of Proposition 4, which is given below.

Proposition 4.

The metric induced by

D_{φ}^{(α)} (\cdot ∥ \cdot)

coincides with the metric given by (22) or (25).

Proof.

Fix any

α \in (- 1, 1)

. Applying the operator

{(\frac{\partial}{\partial θ^{j}})}_{p_{ϑ}}

to

\int_{T} φ (c_{α}) d μ = 1,

where

c_{α} = \frac{1 - α}{2} φ^{- 1} (p_{θ}) + \frac{1 + α}{2} φ^{- 1} (p_{ϑ}) + κ (α) u_{0}

, we obtain

\int_{T} (\frac{1 + α}{2} \frac{\partial φ^{- 1} (p_{ϑ})}{\partial θ^{j}} + {(\frac{\partial}{\partial θ^{j}})}_{p_{ϑ}} κ (α) u_{0}) φ^{'} (c_{α}) d μ = 0,

which results in

{(\frac{\partial}{\partial θ^{j}})}_{p_{ϑ}} κ (α) = - \frac{1 + α}{2} \frac{\int_{T} \frac{\partial φ^{- 1} (p_{ϑ})}{\partial θ^{j}} φ^{'} (c_{α}) d μ}{\int_{T} u_{0} φ^{'} (c_{α}) d μ} .

By the standard differentiation rules, we can write

\begin{matrix} {(\frac{\partial}{\partial θ^{i}})}_{p_{θ}} {(\frac{\partial}{\partial θ^{j}})}_{p_{ϑ}} κ (α) = - \frac{1 + α}{2} \frac{\int_{T} [\frac{1 - α}{2} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} + {(\frac{\partial}{\partial θ^{i}})}_{p_{θ}} κ (α) u_{0}] \frac{\partial φ^{- 1} (p_{ϑ})}{\partial θ^{j}} φ^{″} (c_{α}) d μ}{\int_{T} u_{0} φ^{'} (c_{α}) d μ} \\ + \frac{1 + α}{2} \frac{\int_{T} \frac{\partial φ^{- 1} (p_{ϑ})}{\partial θ^{j}} φ^{'} (c_{α}) d μ}{\int_{T} u_{0} φ^{'} (c_{α}) d μ} \frac{\int_{T} u_{0} [\frac{1 - α}{2} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} + {(\frac{\partial}{\partial θ^{i}})}_{p_{θ}} κ (α) u_{0}] φ^{″} (c_{α}) d μ}{\int_{T} u_{0} φ^{'} (c_{α}) d μ} . \end{matrix}

(34)

Noticing that

\int_{T} \frac{\partial φ^{- 1} (p_{ϑ})}{\partial θ^{j}} φ^{'} (c_{α}) d μ = 0

for

p_{ϑ} = p_{θ}

, the second term on the right-hand side of Equation (34) vanishes, and then

{[{(\frac{\partial}{\partial θ^{i}})}_{p_{θ}} {(\frac{\partial}{\partial θ^{j}})}_{p_{ϑ}} κ (α)]}_{p_{ϑ} = p_{θ}} = - \frac{1 - α^{2}}{4} \frac{\int_{T} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}} φ^{″} (φ^{- 1} (p_{θ})) d μ}{\int_{T} u_{0} φ^{'} (φ^{- 1} (p_{θ})) d μ} .

If we use the notation introduced in (24), we can write

g_{i j} = - {[{(\frac{\partial}{\partial θ^{i}})}_{p_{θ}} {(\frac{\partial}{\partial θ^{j}})}_{p_{ϑ}} D_{φ}^{(α)} (p_{θ} ∥ p_{ϑ})]}_{p_{ϑ} = p_{θ}} = E_{θ}^{″} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}}] .

It remains to show the case

α = \pm 1

. Comparing (13) and (23), we can write

D_{φ} (p_{θ} ∥ p_{ϑ}) = E_{θ}^{'} [φ^{- 1} (p_{θ}) - φ^{- 1} (p_{ϑ})] .

(35)

We use the equivalent expressions

g_{i j} = {[{(\frac{\partial^{2}}{\partial θ^{i} \partial θ^{j}})}_{p} D_{φ}^{α} (p ∥ q)]}_{q = p} = {[{(\frac{\partial^{2}}{\partial θ^{i} \partial θ^{j}})}_{q} D_{φ}^{α} (p ∥ q)]}_{q = p},

which follows from condition (32), to infer that

g_{i j} = {[{(\frac{\partial^{2}}{\partial θ^{i} \partial θ^{j}})}_{p_{ϑ}} D_{φ} (p_{θ} ∥ p_{ϑ})]}_{p_{θ} = p_{ϑ}} = - E_{θ}^{'} [\frac{\partial^{2} φ^{- 1} (p_{θ})}{\partial θ^{i} \partial θ^{j}}] .

(36)

Because

D_{φ}^{(- 1)} (p ∥ q) = D_{φ}^{(1)} (q ∥ p) = D_{φ} (p ∥ q)

, we conclude that the metric defined by (22) coincides with the metric induced by

D_{φ}^{(- 1)} (\cdot ∥ \cdot)

and

D_{φ}^{(1)} (\cdot ∥ \cdot)

. ☐

In generalized statistical manifolds, the generalized Rényi divergence

D_{φ}^{(α)} (\cdot ∥ \cdot)

induces a connection

D^{(α)}

, whose Christoffel symbols

Γ_{i j k}^{(α)}

are given by

Γ_{i j k}^{(α)} = - {[{(\frac{\partial^{2}}{\partial θ^{i} \partial θ^{j}})}_{p} {(\frac{\partial}{\partial θ^{k}})}_{q} D_{φ}^{(α)} (p ∥ q)]}_{q = p} .

Because

D_{φ}^{(α)} (p ∥ q) = D_{φ}^{(- α)} (q ∥ p)

, it follows that

D^{(α)}

and

D^{(- α)}

are mutually dual for any

α \in [- 1, 1]

. In other words,

Γ_{i j k}^{(α)}

and

Γ_{i j k}^{(- α)}

satisfy the relation

\frac{\partial g_{j k}}{\partial θ^{i}} = Γ_{i j k}^{(α)} + Γ_{i k j}^{(- α)}

. A development involving expression (35) results in

\begin{matrix} Γ_{i j k}^{(1)} & = E_{θ}^{″} [\frac{\partial^{2} φ^{- 1} (p_{θ})}{\partial θ^{i} \partial θ^{j}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] - E_{θ}^{'} [\frac{\partial^{2} φ^{- 1} (p_{θ})}{\partial θ^{i} \partial θ^{j}}] E_{θ}^{″} [u_{0} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}], \end{matrix}

(37)

and

\begin{matrix} Γ_{i j k}^{(- 1)} & = E_{θ}^{″} [\frac{\partial^{2} φ^{- 1} (p_{θ})}{\partial θ^{i} \partial θ^{j}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] + E_{θ}^{‴} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] \\ - E_{θ}^{″} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] E_{θ}^{″} [u_{0} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}}] \\ - E_{θ}^{″} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] E_{θ}^{″} [u_{0} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}}] . \end{matrix}

(38)

For

α \in (- 1, 1)

, the Christoffel symbols

Γ_{i j k}^{(α)}

can be written as a convex combination of

Γ_{i j k}^{(- 1)}

and

Γ_{i j k}^{(- 1)}

, as asserted in the next result.

Proposition 5.

The Christoffel symbols

Γ_{i j k}^{(α)}

induced by the divergence

D_{φ}^{(α)} (\cdot ∥ \cdot)

satisfy the relation

Γ_{i j k}^{(α)} = \frac{1 - α}{2} Γ_{i j k}^{(- 1)} + \frac{1 + α}{2} Γ_{i j k}^{(1)}, for α \in [- 1, 1] .

(39)

Proof.

For

α = \pm 1

, equality (39) follows trivially. Thus, we assume

α \in (- 1, 1)

. By (34), we can write

\begin{matrix} {(\frac{\partial}{\partial θ^{i}})}_{p_{θ}} {(\frac{\partial}{\partial θ^{k}})}_{p_{ϑ}} κ (α) & = - \frac{1 + α}{2} \frac{\int_{T} [\frac{1 - α}{2} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} + {(\frac{\partial}{\partial θ^{i}})}_{p_{θ}} κ (α) u_{0}] \frac{\partial φ^{- 1} (p_{ϑ})}{\partial θ^{k}} φ^{″} (c_{α}) d μ}{\int_{T} u_{0} φ^{'} (c_{α}) d μ} \\ + \frac{1 + α}{2} \frac{\int_{T} \frac{\partial φ^{- 1} (p_{ϑ})}{\partial θ^{k}} φ^{'} (c_{α}) d μ}{\int_{T} u_{0} φ^{'} (c_{α}) d μ} \frac{\int_{T} u_{0} [\frac{1 - α}{2} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} + {(\frac{\partial}{\partial θ^{i}})}_{p_{θ}} κ (α) u_{0}] φ^{″} (c_{α}) d μ}{\int_{T} u_{0} φ^{'} (c_{α}) d μ} . \end{matrix}

(40)

Applying

{(\frac{\partial}{\partial θ^{j}})}_{p_{θ}}

to the first term on the right-hand side of (40), and then equating

p_{ϑ} = p_{θ}

, we obtain

\begin{matrix} - \frac{1 - α^{2}}{4} E_{θ}^{″} [\frac{\partial^{2} φ^{- 1} (p_{θ})}{\partial θ^{i} \partial θ^{j}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] - \frac{1 + α}{2} {(\frac{\partial^{2}}{\partial θ^{i} \partial θ^{j}})}_{p_{θ}} κ (α) E_{θ}^{″} [u_{0} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] \\ - \frac{1 - α^{2}}{4} \frac{1 - α}{2} E_{θ}^{‴} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] \\ + \frac{1 - α^{2}}{4} \frac{1 - α}{2} E_{θ}^{″} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} \frac{\partial φ^{- 1} (p_{ϑ})}{\partial θ^{k}}] E_{θ}^{″} [u_{0} \frac{\partial φ^{- 1} (p_{ϑ})}{\partial θ^{j}}] . \end{matrix}

(41)

Similarly, if we apply

{(\frac{\partial}{\partial θ^{j}})}_{p_{θ}}

to the second term on the right-hand side of (40), and make

p_{ϑ} = p_{θ}

, we get

\frac{1 - α^{2}}{4} \frac{1 - α}{2} E_{θ}^{″} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] E_{θ}^{″} [u_{0} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}}] .

(42)

Collecting (41) and (42), we can write

\begin{matrix} Γ_{i j k}^{(α)} & = - \frac{4}{1 - α^{2}} {[{(\frac{\partial^{2}}{\partial θ^{i} \partial θ^{j}})}_{p_{θ}} {(\frac{\partial}{\partial θ^{k}})}_{p_{ϑ}} κ (α)]}_{p_{θ} = p_{ϑ}} \\ = E_{θ}^{″} [\frac{\partial^{2} φ^{- 1} (p_{θ})}{\partial θ^{i} \partial θ^{j}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] + \frac{1 - α}{2} E_{θ}^{‴} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] \\ - \frac{1 - α}{2} E_{θ}^{″} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{j}} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}] E_{θ}^{″} [u_{0} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}}] \\ - \frac{1 - α}{2} E_{θ}^{″} [\frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{i}} \frac{\partial φ^{- 1} (p_{ϑ})}{\partial θ^{k}}] E_{θ}^{″} [u_{0} \frac{\partial φ^{- 1} (p_{ϑ})}{\partial θ^{j}}] \\ - \frac{1 + α}{2} E_{θ}^{'} [\frac{\partial^{2} φ^{- 1} (p_{θ})}{\partial θ^{i} \partial θ^{j}}] E_{θ}^{″} [u_{0} \frac{\partial φ^{- 1} (p_{θ})}{\partial θ^{k}}], \end{matrix}

(43)

where we used

\begin{matrix} {(\frac{\partial^{2}}{\partial θ^{i} \partial θ^{j}})}_{p_{θ}} κ (α) & = \frac{1 - α^{2}}{4} {[{(\frac{\partial^{2}}{\partial θ^{i} \partial θ^{j}})}_{p_{θ}} D_{φ}^{(α)} (p_{θ} ∥ p_{ϑ})]}_{p_{ϑ} = p_{θ}} \\ = \frac{1 - α^{2}}{4} g_{i j} = - \frac{1 - α^{2}}{4} E_{θ}^{'} [\frac{\partial^{2} φ^{- 1} (p_{θ})}{\partial θ^{i} \partial θ^{i}}] . \end{matrix}

Expression (39) follows from (37), (38) and (43). ☐

5. Conclusions

In [17,18], the authors introduced a pair of dual connections

D^{(- 1)}

and

D^{(1)}

induced by φ-divergence. The main motivation of the present work was to find a (non-trivial) family of α-divergences, whose induced α-connections are convex combinations of

D^{(- 1)}

and

D^{(1)}

. As a result of our efforts, we proposed a generalization of Rényi divergence. The connection

D^{(α)}

induced by the generalization of Rényi divergence satisfies the relation

D^{(α)} = \frac{1 - α}{2} D^{(- 1)} + \frac{1 + α}{2} D^{(1)}

. To generalize Rényi divergence, we made use of properties of φ-functions. This makes evident the importance of φ-functions in the geometry of non-standard models. In standard statistical manifolds, even though Amari’s α-divergence and Rényi divergence (with

α \in [- 1, 1]

) do not coincide, they induce the same family of α-connections. This striking result requires further investigation. Future work should focus on how the generalization of Rényi divergence is related to Zhang’s

(ρ, τ)

-divergence, and also how the present proposal is related to the model presented in [33].

Acknowledgments

The authors are indebted to the anonymous reviewers for their valuable comments and corrections, which led to a great improvement of this paper. Charles C. Cavalcante also thanks the CNPq (Proc. 309055/2014-8) for partial funding.

Author Contributions

All authors contributed equally to the design of the research. The research was carried out by all authors. Rui F. Vigelis and Charles C. Cavalcante gave the central idea of the paper and managed the organization of it. Rui F. Vigelis wrote the paper. All the authors read and approved the final manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rao, C.R. Information and the accuracy attainable in the estimation of statistical parameters. Bull. Calcutta Math. Soc. 1945, 37, 81–91. [Google Scholar]
Amari, S.-I. Differential geometry of curved exponential families—Curvatures and information loss. Ann. Stat. 1982, 10, 357–385. [Google Scholar] [CrossRef]
Amari, S.-I. Differential-Geometrical Methods in Statistics; Springer: Berlin/Heidelberg, Germany, 1985; Volume 28. [Google Scholar]
Amari, S.-I.; Nagaoka, H. Methods of Information Geometry (Translations of Mathematical Monographs); American Mathematical Society: Providence, RI, USA, 2000; Volume 191. [Google Scholar]
Amari, S.-I. Information Geometry and Its Applications; Applied Mathematical Sciences Series; Springer: Berlin/Heidelberg, Germany, 2016; Volume 194. [Google Scholar]
Amari, S.-I.; Ohara, A.; Matsuzoe, H. Geometry of deformed exponential families: Invariant, dually-flat and conformal geometries. Physica A 2012, 391, 4308–4319. [Google Scholar] [CrossRef]
Matsuzoe, H. Hessian structures on deformed exponential families and their conformal structures. Differ. Geom. Appl. 2014, 35 (Suppl.), 323–333. [Google Scholar] [CrossRef]
Naudts, J. Estimators, escort probabilities, and ϕ-exponential families in statistical physics. J. Inequal. Pure Appl. Math. 2004, 5, 102. [Google Scholar]
Pistone, G. κ-exponential models from the geometrical viewpoint. Eur. Phys. J. B 2009, 70, 29–37. [Google Scholar] [CrossRef]
Amari, S.-I.; Ohara, A. Geometry of q-exponential family of probability distributions. Entropy 2011, 13, 1170–1185. [Google Scholar] [CrossRef]
Vigelis, R.F.; Cavalcante, C.C. The Δ₂-Condition and φ-Families of Probability Distributions. In Geometric Science of Information; Springer: Berlin/Heidelberg, Germany, 2013; Volume 8085, pp. 729–736. [Google Scholar]
Vigelis, R.F.; Cavalcante, C.C. On φ-families of probability distributions. J. Theor. Probab. 2013, 26, 870–884. [Google Scholar] [CrossRef]
Cena, A.; Pistone, G. Exponential statistical manifold. Ann. Inst. Stat. Math. 2007, 59, 27–56. [Google Scholar] [CrossRef]
Grasselli, M.R. Dual connections in nonparametric classical information geometry. Ann. Inst. Stat. Math. 2010, 62, 873–896. [Google Scholar] [CrossRef]
Pistone, G.; Sempi, C. An infinite-dimensional geometric structure on the space of all the probability measures equivalent to a given one. Ann. Stat. 1995, 23, 1543–1561. [Google Scholar] [CrossRef]
Santacroce, M.; Siri, P.; Trivellato, B. New results on mixture and exponential models by Orlicz spaces. Bernoulli 2016, 22, 1431–1447. [Google Scholar] [CrossRef]
Vigelis, R.F.; Cavalcante, C.C. Information Geometry: An Introduction to New Models for Signal Processing. In Signals and Images; CRC Press: Boca Raton, FL, USA, 2015; pp. 455–491. [Google Scholar]
Vigelis, R.F.; de Souza, D.C.; Cavalcante, C.C. New Metric and Connections in Statistical Manifolds. In Geometric Science of Information; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9389, pp. 222–229. [Google Scholar]
Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef] [PubMed]
Zhang, J. Referential Duality and Representational Duality on Statistical Manifolds. In Proceedings of the 2nd International Symposium on Information Geometry and Its Applications, Pescara, Italy, 12–16 December 2005; pp. 58–67.
Zhang, J. Nonparametric information geometry: From divergence function to referential-representational biduality on statistical manifolds. Entropy 2013, 15, 5384–5418. [Google Scholar] [CrossRef]
Zhang, J. Divergence Functions and Geometric Structures They Induce on a Manifold. In Geometric Theory of Information; Springer: Berlin/Heidelberg, Germany, 2014; pp. 1–30. [Google Scholar]
Zhang, J. On monotone embedding in information geometry. Entropy 2015, 17, 4485–4489. [Google Scholar] [CrossRef]
Rényi, A. On measures of entropy and information. In Proceedings of 4th Berkeley Symposium on Mathematical Statistics and Probability; University California Press: Berkeley, CA, USA, 1961; Volume I, pp. 547–561. [Google Scholar]
Van Erven, T.; Harremoës, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inform. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
Eguchi, S.; Komori, O. Path Connectedness on a Space of Probability Density Functions. In Geometric Science of Information; Springer: Berlin/Heidelberg, Germany, 2015; Volume 9389, pp. 615–624. [Google Scholar]
Kaniadakis, G.; Lissia, M.; Scarfone, A.M. Deformed logarithms and entropies. Physica A 2004, 340, 41–49. [Google Scholar] [CrossRef]
Kaniadakis, G. Theoretical foundations and mathematical formalism of the power-law tailed statistical distributions. Entropy 2013, 15, 3983–4010. [Google Scholar] [CrossRef]
Musielak, J. Orlicz Spaces and Modular Spaces; Springer: Berlin/Heidelberg, Germany, 1983; Volume 1034. [Google Scholar]
Shima, H. The Geometry of Hessian Structures; World Scientific: Singapore, 2007. [Google Scholar]
Banerjee, A.; Merugu, S.; Dhillon, I.S.; Ghosh, J. Clustering with Bregman Divergences. J. Mach. Learn. Res. 2005, 6, 1705–1749. [Google Scholar]
Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
Zhanga, J.; Hästö, P. Statistical manifold as an affine space: A functional equation approach. J. Math. Psychol. 2006, 50, 60–65. [Google Scholar] [CrossRef]

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

De Souza, D.C.; Vigelis, R.F.; Cavalcante, C.C. Geometry Induced by a Generalization of Rényi Divergence. Entropy 2016, 18, 407. https://doi.org/10.3390/e18110407

AMA Style

De Souza DC, Vigelis RF, Cavalcante CC. Geometry Induced by a Generalization of Rényi Divergence. Entropy. 2016; 18(11):407. https://doi.org/10.3390/e18110407

Chicago/Turabian Style

De Souza, David C., Rui F. Vigelis, and Charles C. Cavalcante. 2016. "Geometry Induced by a Generalization of Rényi Divergence" Entropy 18, no. 11: 407. https://doi.org/10.3390/e18110407

APA Style

De Souza, D. C., Vigelis, R. F., & Cavalcante, C. C. (2016). Geometry Induced by a Generalization of Rényi Divergence. Entropy, 18(11), 407. https://doi.org/10.3390/e18110407

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Geometry Induced by a Generalization of Rényi Divergence

Abstract

1. Introduction

2. φ-Functions

3. Generalization of Rényi Divergence

4. Generalized Statistical Manifolds

4.1. Definitions

4.2. φ-Families

4.3. Geometry Induced by $D_{φ}^{(α)} (\cdot ∥ \cdot)$

5. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Geometry Induced by a Generalization of Rényi Divergence

Abstract

1. Introduction

2. φ-Functions

3. Generalization of Rényi Divergence

4. Generalized Statistical Manifolds

4.1. Definitions

4.2. φ-Families

4.3. Geometry Induced by D φ ( α ) ( · ∥ · )

5. Conclusions

Acknowledgments

Author Contributions

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.3. Geometry Induced by $D_{φ}^{(α)} (\cdot ∥ \cdot)$