Statistical Divergences between Densities of Truncated Exponential Families with Nested Supports: Duo Bregman and Duo Jensen Divergences

Nielsen, Frank

doi:10.3390/e24030421

Open AccessFeature PaperArticle

Statistical Divergences between Densities of Truncated Exponential Families with Nested Supports: Duo Bregman and Duo Jensen Divergences

by

Frank Nielsen

Sony Computer Science Laboratories, Tokyo 141-0022, Japan

Entropy 2022, 24(3), 421; https://doi.org/10.3390/e24030421

Submission received: 2 March 2022 / Revised: 14 March 2022 / Accepted: 16 March 2022 / Published: 17 March 2022

(This article belongs to the Special Issue Information and Divergence Measures)

Download

Browse Figures

Versions Notes

Abstract

:

By calculating the Kullback–Leibler divergence between two probability measures belonging to different exponential families dominated by the same measure, we obtain a formula that generalizes the ordinary Fenchel–Young divergence. Inspired by this formula, we define the duo Fenchel–Young divergence and report a majorization condition on its pair of strictly convex generators, which guarantees that this divergence is always non-negative. The duo Fenchel–Young divergence is also equivalent to a duo Bregman divergence. We show how to use these duo divergences by calculating the Kullback–Leibler divergence between densities of truncated exponential families with nested supports, and report a formula for the Kullback–Leibler divergence between truncated normal distributions. Finally, we prove that the skewed Bhattacharyya distances between truncated exponential families amount to equivalent skewed duo Jensen divergences.

Keywords:

exponential family; statistical divergence; truncated exponential family; truncated normal distributions

Graphical Abstract

1. Introduction

1.1. Exponential Families

Let

(X, Σ)

be a measurable space, and consider a regular minimal exponential family [1]

E

of probability measures

P_{θ}

all dominated by a base measure

μ

(

P_{θ} ≪ μ

):

E = {P_{θ} : θ \in Θ} .

(1)

The Radon–Nikodym derivatives or densities of the probability measures

P_{θ}

with respect to

μ

can be written canonically as

p_{θ} (x) = \frac{d P_{θ}}{d μ} (x) = exp (θ^{⊤} t (x) - F (θ) + k (x)),

(2)

where

θ

denotes the natural parameter,

t (x)

the sufficient statistic [1,2,3,4], and

F (θ)

the log-normalizer [1] (or cumulant function). The optional auxiliary term

k (x)

allows us to change the base measure

μ

into the measure

ν

such that

\frac{d ν}{d μ} (x) = e^{k (x)}

. The order D of the family is the dimension of the natural parameter space

Θ

:

Θ = \{θ \in R^{D} : \int_{X} exp (θ^{⊤} t (x) + k (x)) d μ (x) < \infty\},

(3)

where

R

denotes the set of reals. The sufficient statistic

t (x) = (t_{1} (x), \dots, t_{D} (x))

is a vector of D functions. The sufficient statistic

t (x)

is said to be minimal when the

D + 1

functions 1,

t_{1} (x)

, …,

t_{D} (x)

are linearly independent [1]. The sufficient statistics

t (x)

are such that the probability

\Pr [X | θ] = \Pr [X | t (X)]

. That is, all information necessary for the statistical inference of parameter

θ

is contained in

t (X)

. Exponential families are characterized as families of parametric distributions with finite-dimensional sufficient statistics [1]. Exponential families

{p_{λ}}

include among others the exponential, normal, gamma/beta, inverse gamma, inverse Gaussian, and Wishart distributions once a reparameterization

θ = θ (λ)

of the parametric distributions

{p_{λ}}

is performed to reveal their natural parameters [1].

When the sufficient statistic

t (x)

is x, these exponential families [1] are called natural exponential families or tilted exponential families [5] in the literature. Indeed, the distributions

P_{θ}

of the exponential family

E

can be interpreted as distributions obtained by tilting the base measure

μ

[6]. In this paper, we consider either discrete exponential families like the family of Poisson distributions (univariate distributions of order

D = 1

with respect to the counting measure) or continuous exponential families like the family of normal distributions (univariate distributions of order

D = 2

with respect to the Lebesgue measure). The Radon–Nikodym derivative of a discrete exponential family is a probability mass function (pmf), and the Radon–Nikodym derivative of a continuous exponential family is a probability density function (pdf). The support of a pmf

p (x)

is

supp (p) = {x \in Z : p (x) > 0}

(where

Z

denotes the set of integers) and the support of a d-variate pdf

p (x)

is

supp (p) = {x \in R^{d} : p (x) > 0}

. The Poisson distributions have support

N \cup {0}

where

N

denotes the set of natural numbers

{1, 2, \dots,}

. Densities of an exponential family all have coinciding support [1].

1.2. Truncated Exponential Families with Nested Supports

In this paper, we shall consider truncated exponential families [7] with nested supports. A truncated exponential family is a set of parametric probability distributions obtained by truncation of the support of an exponential family. Truncated exponential families are exponential families but their statistical inference is more subtle [8,9]. Let

E_{Trunc} = {q_{θ}}

be a truncated exponential family of

E = {p_{θ}}

with nested supports

supp (q_{θ}) \subset supp (p_{θ})

. The canonical decompositions of densities

p_{θ}

and

q_{θ}

have the following expressions:

\begin{matrix} p_{θ} (x) & = & exp (θ^{⊤} t (x) + k (x) - F (θ)), \end{matrix}

(4)

\begin{matrix} q_{θ} (x) & = & \frac{p_{θ} (x)}{Z^{X_{Trunc}} (θ)} = exp (θ^{⊤} t (x) + k (x) - F_{Trunc} (θ)), \end{matrix}

(5)

where the log-normalizer of the truncated exponential family is:

F_{Trunc} (θ) = F (θ) + log Z^{X_{Trunc}} (θ),

(6)

where

Z^{X_{Trunc}} (θ)

is a normalizing term that takes into account the truncated support

X_{Trunc}

. These equations show that densities of truncated exponential families only differ by their log-normalizer functions. Let

X_{Trunc}

denote the support of the distributions of

E_{Trunc} = supp (q_{θ})

and

X = supp (p_{θ})

the support of

E

. Family

E_{Trunc}

is a truncated exponential family of

E

that can be notationally written as

E_{X_{Trunc}}

. Family

E

can also be interpreted as the (un)truncated exponential family

E_{X}

with densities

p_{θ}^{X} = p_{θ}

. A truncated exponential family

E_{X_{Trunc}}

of

E

is said to have nested support when

X_{Trunc} \subset X

. For example, the family of half-normal distributions defined on the support

X_{Trunc} = [0, \infty)

is a nested truncated exponential family of the family of normal distributions defined on the support

X = (- \infty, \infty)

.

1.3. Kullback–Leibler Divergence between Exponential Family Distributions

For two

σ

-finite probability measures P and Q on

(X, Σ)

such that P is dominated by Q (

P ≪ Q

), the Kullback–Leibler divergence between P and Q is defined by

D_{KL} [P : Q] = \int_{X} log \frac{d P}{d Q} d P = E_{P} [log \frac{d P}{d Q}],

(7)

where

E_{P} [X]

denotes the expectation of a random variable

X \sim P

[10]. When

P ≪ Q

, we set

D_{KL} [P : Q] = + \infty

. Gibbs’ inequality [11]

D_{KL} [P : Q] \geq 0

shows that the Kullback–Leibler divergence (KLD for short) is always non-negative. The proof of Gibbs’ inequality relies on Jensen’s inequality and holds for the wide class of f-divergences [12] induced by convex generators

f (u)

:

I_{f} [P : Q] = \int_{X} f (\frac{d Q}{d P}) d P \geq f (\int_{X} \frac{d Q}{d P} d P) \geq f (1) .

(8)

The KLD is an f-divergence obtained for the convex generator

f (u) = - log u

.

1.4. Kullback–Leibler Divergence between Exponential Family Densities

It is well-known that the KLD between two distributions

P_{θ_{1}}

and

P_{θ_{2}}

of

E

amounts to computing an equivalent Fenchel–Young divergence [13]:

D_{KL} [P_{θ_{1}} : P_{θ_{2}}] = \int_{X} p_{θ_{1}} (x) log \frac{p_{θ_{1}} (x)}{p_{θ_{2}} (x)} d μ (x) = Y_{F, F^{*}} (θ_{2}, η_{1}),

(9)

where

η = \nabla F (θ) = E_{P_{θ}} [t (x)]

is the moment parameter [1] and

\nabla F (θ) = {[\frac{\partial}{\partial θ_{1}} F (θ), \dots, \frac{\partial}{\partial θ_{D}} F (θ)]}^{⊤},

(10)

is the gradient of F with respect to

θ = {[θ_{1}, \dots, θ_{D}]}^{⊤}

. The Fenchel–Young divergence is defined for a pair of strictly convex conjugate functions [14]

F (θ)

and

F^{*} (η)

related by the Legendre–Fenchel transform by

Y_{F, F^{*}} (θ_{1}, η_{2}) : = F (θ_{1}) + F^{*} (η_{2}) - θ_{1}^{⊤} η_{2} .

(11)

Amari (1985) first introduced this formula as the canonical divergence of dually flat spaces in information geometry [15] (Equation 3.21), and proved that the Fenchel–Young divergence is obtained as the KLD between densities belonging to the same exponential family [15] (Theorem 3.7). Azoury and Warmuth expressed the KLD

D_{KL} [P_{θ_{1}} : P_{θ_{2}}]

using dual Bregman divergences in [13] (2001):

D_{KL} [P_{θ_{1}} : P_{θ_{2}}] = B_{F} (θ_{2} : θ_{1}) = B_{F^{*}} (η_{1} : η_{2}),

(12)

where a Bregman divergence [16]

B_{F} (θ_{1} : θ_{2})

is defined for a strictly convex and differentiable generator

F (θ)

by:

B_{F} (θ_{1} : θ_{2}) : = F (θ_{1}) - F (θ_{2}) - {(θ_{1} - θ_{2})}^{⊤} \nabla F (θ_{2}) .

(13)

Acharyya termed the divergence

Y_{F, F^{*}}

the Fenchel–Young divergence in his PhD thesis [17] (2013), and Blondel et al. called such divergences Fenchel–Young losses (2020) in the context of machine learning [18] (Equation (9) in Definition 2). This term was also used by the author the Legendre–Fenchel divergence in [19]. The Fenchel–Young divergence stems from the Fenchel–Young inequality [14,20]:

F (θ_{1}) + F^{*} (η_{2}) \geq θ_{1}^{⊤} η_{2},

(14)

with equality if and only if

η_{2} = \nabla F (θ_{1})

.

Figure 1 visualizes the 1D Fenchel–Young divergence and gives a geometric proof that

Y_{F, F^{*}} (θ_{1}, η_{2}) \geq 0

with equality if and only if

η_{2} = F^{'} (θ_{1})

. Indeed, by considering the behavior of the Legendre–Fenchel transformation under translations:

if $F_{t} (θ) = F (θ + t)$ then $F_{t}^{*} (η) = F^{*} (η) - η^{⊤} t$ for all $t \in R$ , and
if $F_{λ} (θ) = F (θ) + λ$ then $F_{λ}^{*} (η) = F^{*} (η) - λ$ for all $λ \in R$ ,

we may assume without loss of generality that

F (0) = 0

. The function

F^{'} (θ)

is strictly increasing and continuous since

F (θ)

is a strictly convex and differentiable convex function. Thus we have

F (θ) = \int_{0}^{θ} F^{'} (θ) d θ

and

F^{*} (η) = \int_{0}^{η} {F^{*}}^{'} (η) d η = \int_{0}^{η} {F^{'}}^{- 1} (η) d η

.

Figure 1. Visualizing the Fenchel–Young divergence.

The Bregman divergence

B_{F} (θ_{1} : θ_{2})

amounts to a dual Bregman divergence [13] between the dual parameters with swapped order:

B_{F} (θ_{1} : θ_{2}) = B_{F^{*}} (η_{2} : η_{1})

where

η_{i} = \nabla F (θ_{i})

for

i \in {1, 2}

. Thus the KLD between two distributions

P_{θ_{1}}

and

P_{θ_{2}}

of

E

can be expressed equivalently as follows:

D_{KL} [P_{θ_{1}} : P_{θ_{2}}] = Y_{F, F^{*}} (θ_{2} : η_{1}) = B_{F} (θ_{2} : θ_{1}) = B_{F^{*}} (η_{1} : η_{2}) = Y_{F^{*}, F} (η_{1} : η_{2}) .

(15)

The symmetrized Kullback–Leibler divergence

D_{J} [P_{θ_{1}} : P_{θ_{2}}]

between two distributions

P_{θ_{1}}

and

P_{θ_{2}}

of

E

is called Jeffreys’ divergence [21] and amounts to a symmetrized Bregman divergence [22]:

\begin{matrix} D_{J} [P_{θ_{1}} : P_{θ_{2}}] & = & D_{KL} [P_{θ_{1}} : P_{θ_{2}}] + D_{KL} [P_{θ_{2}} : P_{θ_{1}}], \end{matrix}

(16)

\begin{matrix} = & B_{F} (θ_{2} : θ_{1}) + B_{F} (θ_{1} : θ_{2}), \end{matrix}

(17)

\begin{matrix} = & {(θ_{2} - θ_{1})}^{⊤} (η_{2} - η_{1}) : = S_{F} (θ_{1}, θ_{2}) . \end{matrix}

(18)

Note that the Bregman divergence

B_{F} (θ_{1} : θ_{2})

can also be interpreted as a surface area:

B_{F} (θ_{1} : θ_{2}) = \int_{θ_{2}}^{θ_{1}} (F^{'} (θ) - F^{'} (θ_{2})) d θ .

(19)

Figure 2 illustrates the sided and symmetrized Bregman divergences.

1.5. Contributions and Paper Outline

We recall in Section 2 the formula obtained for the Kullback–Leibler divergence between two exponential family densities equivalent to each other [23] (Equation (29)). Inspired by this formula, we give a definition of the duo Fenchel–Young divergence induced by a pair of strictly convex functions

F_{1}

and

F_{2}

(Definition 1) in Section 3, and prove that the divergence is always non-negative provided that

F_{1}

upper bounds

F_{2}

. We then define the duo Bregman divergence (Definition 2) corresponding to the duo Fenchel–Young divergence. In Section 4, we show that the Kullback–Leibler divergence between a truncated density and a density of a same parametric exponential family amounts to a duo Fenchel–Young divergence or equivalently to a duo Bregman divergence on swapped parameters (Theorem 1). That is, we consider a truncated exponential family [7]

E_{1}

of an exponential family

E_{1}

such that the common support of the distributions of

E_{1}

is contained in the common support of the distributions of

E_{2}

and both canonical decompositions of the families coincide (see Equation (2)). In particular, when

E_{2}

is also a truncated exponential family of

E

, then we express the KLD between two truncated distributions as a duo Bregman divergence. As examples, we report the formula for the Kullback–Leibler divergence between two densities of truncated exponential families (Corollary 1), and illustrate the formula for the Kullback–Leibler divergence between truncated exponential distributions (Example 6) and for the Kullback–Leibler divergence between truncated normal distributions (Example 7).

In Section 5, we further consider the skewed Bhattacharyya distance between densities of truncated exponential families and prove that it amounts to a duo Jensen divergence (Theorem 2). Finally, we conclude in Section 6.

2. Kullback–Leibler Divergence between Different Exponential Families

Consider now two exponential families [1]

P

and

Q

defined by their Radon–Nikodym derivatives with respect to two positive measures

μ_{P}

and

μ_{Q}

on

(X, Σ)

:

\begin{matrix} P & = & \{P_{θ} : θ \in Θ\}, \end{matrix}

(20)

\begin{matrix} Q & = & \{Q_{θ^{'}} : θ^{'} \in Θ^{'}\} . \end{matrix}

(21)

The corresponding natural parameter spaces are

\begin{matrix} Θ & = & \{θ \in R^{D} : \int_{X} exp (θ^{⊤} t_{P} (x) + k_{P} (x)) d μ_{P} (x) < \infty\}, \end{matrix}

(22)

\begin{matrix} Θ^{'} & = & \{θ^{'} \in R^{D^{'}} : \int_{X} exp ({θ^{'}}^{⊤} t_{Q} (x) + k_{Q} (x)) d μ_{Q} (x) < \infty\}, \end{matrix}

(23)

The order of

P

is D,

t_{P} (x)

denotes the sufficient statistics of

P_{θ}

, and

k_{P} (x)

is a term to adjust/tilt the base measure

μ_{P}

. Similarly, the order of

Q

is

D^{'}

,

t_{Q} (x)

denotes the sufficient statistics of

Q_{θ^{'}}

, and

k_{Q} (x)

is an optional term to adjust the base measure

μ_{Q}

. Let

p_{θ}

and

q_{θ^{'}}

denote the Radon–Nikodym derivatives with respect to the measures

μ_{P}

and

μ_{Q}

, respectively:

\begin{matrix} p_{θ} & = & \frac{d P_{θ}}{d μ_{P}} = exp (θ^{⊤} t_{P} (x) - F_{P} (θ) + k_{P} (x)), \end{matrix}

(24)

\begin{matrix} q_{θ^{'}} & = & \frac{d Q_{θ^{'}}}{d μ_{Q}} = exp ({θ^{'}}^{⊤} t_{Q} (x) - F_{Q} (θ^{'}) + k_{Q} (x)), \end{matrix}

(25)

where

F_{P} (θ)

and

F_{Q} (θ^{'})

denote the corresponding log-normalizers of

P

and

Q

, respectively.

\begin{matrix} F_{P} (θ) & = & log (\int exp (θ^{⊤} t_{P} (x) + k_{P} (x)) d μ_{P} (x)), \end{matrix}

(26)

\begin{matrix} F_{Q} (θ) & = & log (\int exp (θ^{⊤} t_{Q} (x) + k_{Q} (x)) d μ_{Q} (x)) . \end{matrix}

(27)

The functions

F_{P}

and

F_{Q}

are strictly convex and real analytic [1]. Hence, those functions are infinitely many times differentiable on their open natural parameter spaces.

Consider the KLD between

P_{θ} \in P

and

Q_{θ^{'}} \in Q

such that

μ_{P} = μ_{Q}

(and hence

P_{θ} ≪ Q_{θ^{'}}

). Then the KLD between

P_{θ}

and

Q_{θ^{'}}

was first considered in [23]:

\begin{matrix} D_{KL} [P_{θ} : Q_{θ^{'}}] & = & E_{P} [log (\frac{d P_{θ}}{d Q_{θ^{'}}})], \\ = & E_{P_{θ}} [(θ^{⊤} t_{P} (x) - {θ^{'}}^{⊤} t_{Q} (x) - F_{P} (θ) + F_{Q} (θ^{'}) + k_{P} (x) - k_{Q} (x)) \underset{= 1}{\underset{︸}{\frac{d μ_{P}}{d μ_{Q}}}}], \\ = & F_{Q} (θ^{'}) - F_{P} (θ) + θ^{⊤} E_{P_{θ}} [t_{P} (x)] - {θ^{'}}^{⊤} E_{P_{θ}} [t_{Q} (x)] + E_{P_{θ}} [k_{P} (x) - k_{Q} (x)] . \end{matrix}

(28)

Recall that the dual parameterization of an exponential family density

P_{θ}

is

P^{η}

with

η = E_{P_{θ}} [t_{P} (x)] = \nabla F_{P} (θ)

[1], and that the Fenchel–Young equality is

F (θ) + F^{*} (η) = θ^{⊤} η

for

η = \nabla F (θ)

. Thus the KLD between

P_{θ}

and

Q_{θ^{'}}

can be rewritten as

D_{KL} [P_{θ} : Q_{θ^{'}}] = F_{Q} (θ^{'}) + F_{P}^{*} (η) - {θ^{'}}^{⊤} E_{P_{θ}} [t_{Q} (x)] + E_{P_{θ}} [k_{P} (x) - k_{Q} (x)] .

(29)

This formula was reported in [23] and generalizes the Fenchel–Young divergence [17] obtained when

P = Q

(with

t_{P} (x) = t_{Q} (x)

,

k_{P} (x) = k_{Q} (x)

, and

F (θ) = F_{P} (θ) = F_{Q} (θ)

and

F^{*} (η) = F_{P}^{*} (η) = F_{Q}^{*} (η)

).

The formula of Equation (29) was illustrated in [23] with two examples: the KLD between Laplacian distributions and zero-centered Gaussian distributions, and the KLD between two Weibull distributions. Both these examples use the Lebesgue base measure for

μ_{P}

and

μ_{Q}

.

Let us report another example that uses the counting measure as the base measure for

μ_{P}

and

μ_{Q}

.

Example 1.

Consider the KLD between a Poisson probability mass function (pmf) and a geometric pmf. The canonical decompositions of the Poisson and geometric pmfs are summarized in Table 1. The KLD between a Poisson pmf

p_{λ}

and a geometric pmf

q_{p}

is equal to

\begin{matrix} D_{KL} [P_{λ} : Q_{p}] & = & F_{Q} (θ^{'}) + F_{P}^{*} (η) - E_{P_{θ}} [t_{Q} (x)] \cdot θ^{'} + E_{P_{θ}} [k_{P} (x) - k_{Q} (x)], \end{matrix}

(30)

\begin{matrix} = & - log p + λ log λ - λ - λ log (1 - p) - E_{P_{λ}} [log x!] \end{matrix}

(31)

Since

E_{p_{λ}} [- log x!] = - \sum_{k = 0}^{\infty} e^{- λ} \frac{λ^{k} log (k!)}{k!}

, we have

D_{KL} [P_{λ} : Q_{p}] = - log p + λ log \frac{λ}{1 - p} - λ - \sum_{k = 0}^{\infty} e^{- λ} \frac{λ^{k} log (k!)}{k!} .

(32)

Note that we can calculate the KLD between two geometric distributions

Q_{p_{1}}

and

Q_{p_{2}}

as

\begin{matrix} D_{KL} [Q_{p_{1}} : Q_{p_{2}}] & = & B_{F_{Q}} (θ (p_{2}) : θ (p_{1})), \end{matrix}

(33)

\begin{matrix} = & F_{Q} (θ (p_{2})) - F_{Q} (θ (p_{1})) - (θ (p_{2}) - θ (p_{1})) η (p_{1}), \end{matrix}

(34)

We obtain:

D_{KL} [Q_{p_{1}} : Q_{p_{2}}] = log (\frac{p_{1}}{p_{2}}) - (1 - \frac{1}{p_{1}}) log \frac{1 - p_{1}}{1 - p_{2}} .

3. The Duo Fenchel–Young Divergence and Its Corresponding Duo Bregman Divergence

Inspired by formula of Equation (29), we shall define the duo Fenchel–Young divergence using a dominance condition on a pair

(F_{1} (θ), F_{2} (θ))

of strictly convex generators.

Definition 1

(duo Fenchel–Young divergence). Let

F_{1} (θ)

and

F_{2} (θ)

be two strictly convex functions such that

F_{1} (θ) \geq F_{2} (θ)

for any

θ \in Θ_{12} = dom (F_{1}) \cap dom (F_{2})

. Then the duo Fenchel–Young divergence

Y_{F_{1}, F_{2}^{*}} (θ, η^{'})

is defined by

Y_{F_{1}, F_{2}^{*}} (θ, η^{'}) : = F_{1} (θ) + F_{2}^{*} (η^{'}) - θ^{⊤} η^{'} .

(35)

When

F_{1} (θ) = F_{2} (θ) = : F (θ)

, we have

F_{1}^{*} (η) = F_{2}^{*} (η) = : F^{*} (η)

, and we retrieve the ordinary Fenchel–Young divergence [17]:

Y_{F, F^{*}} (θ, η^{'}) : = F (θ) + F^{*} (η^{'}) - θ^{⊤} η^{'} \geq 0 .

(36)

Note that in Equation (35), we have

η^{'} = \nabla F_{2} (θ^{'})

.

Property 1

(Non-negative duo Fenchel–Young divergence). The duo Fenchel–Young divergence is always non-negative.

Proof.

The proof relies on the reverse dominance property of strictly convex and differentiable conjugate functions:

Lemma 1

(Reverse majorization order of functions by the Legendre–Fenchel transform). Let

F_{1} (θ)

and

F_{2} (θ)

be two Legendre-type convex functions [14]. Then if

F_{1} (θ) \geq F_{2} (θ)

then we have

F_{2}^{*} (η) \geq F_{1}^{*} (η)

.

Proof .

This property is graphically illustrated in Figure 3. The reverse dominance property of the Legendre–Fenchel transformation can be checked algebraically as follows:

\begin{matrix} F_{1}^{*} (η) & = & sup_{θ \in Θ} {η^{⊤} θ - F_{1} (θ)}, \end{matrix}

(37)

\begin{matrix} = & η^{⊤} θ_{1} - F_{1} (θ_{1}) (with η = \nabla F_{1} (θ_{1})), \end{matrix}

(38)

\begin{matrix} \leq & η^{⊤} θ_{1} - F_{2} (θ_{1}), \end{matrix}

(39)

\begin{matrix} \leq & sup_{θ \in Θ} {η^{⊤} θ - F_{2} (θ)} = F_{2}^{*} (η) . \end{matrix}

(40)

□

Thus we have

F_{1}^{*} (η) \leq F_{2}^{*} (η)

when

F_{1} (θ) \geq F_{2} (θ)

. Therefore it follows that

Y_{F_{1}, F_{2}^{*}} (θ, η^{'})

\geq 0

since we have

\begin{matrix} Y_{F_{1}, F_{2}^{*}} (θ, η^{'}) & : = & F_{1} (θ) + F_{2}^{*} (η^{'}) - θ^{⊤} η^{'}, \end{matrix}

(41)

\begin{matrix} \geq & F_{1} (θ) + F_{1}^{*} (η^{'}) - θ^{⊤} η^{'} = Y_{F_{1}, F_{1}^{*}} (θ, η^{'}) \geq 0, \end{matrix}

(42)

where

Y_{F_{1}, F_{1}^{*}}

is the ordinary Fenchel–Young divergence, which is guaranteed to be non-negative from the Fenchel–Young inequality. □

Figure 3. (a) Visual illustration of the Legendre–Fenchel transformation:

F^{*} (η)

is measured as the vertical gap (left long black line with both arrows) between the origin and the hyperplane of the “slope”

η

tangent at

F (θ)

evaluated at

θ = 0

. (b) The Legendre transforms

F_{1}^{*} (η)

and

F_{1}^{*} (η)

of two functions

F_{1} (θ)

and

F_{2} (θ)

such that

F_{1} (θ) \geq F_{2} (θ)

reverse the dominance order:

F_{2}^{*} (η) \geq F_{1}^{*} (η)

.

Figure 3. (a) Visual illustration of the Legendre–Fenchel transformation:

F^{*} (η)

is measured as the vertical gap (left long black line with both arrows) between the origin and the hyperplane of the “slope”

η

tangent at

F (θ)

evaluated at

θ = 0

. (b) The Legendre transforms

F_{1}^{*} (η)

and

F_{1}^{*} (η)

of two functions

F_{1} (θ)

and

F_{2} (θ)

such that

F_{1} (θ) \geq F_{2} (θ)

reverse the dominance order:

F_{2}^{*} (η) \geq F_{1}^{*} (η)

.

We can express the duo Fenchel–Young divergence using the primal coordinate systems as a generalization of the Bregman divergence to two generators that we term the duo Bregman divergence (see Figure 4):

B_{F_{1}, F_{2}} (θ : θ^{'}) : = Y_{F_{1}, F_{2}^{*}} (θ, η^{'}) = F_{1} (θ) - F_{2} (θ^{'}) - {(θ - θ^{'})}^{⊤} \nabla F_{2} (θ^{'}),

(43)

with

η^{'} = \nabla F_{2} (θ^{'})

.

This generalized Bregman divergence is non-negative when

F_{1} (θ) \geq F_{2} (θ)

. Indeed, we check that

\begin{matrix} B_{F_{1}, F_{2}} (θ : θ^{'}) & = & F_{1} (θ) - F_{2} (θ^{'}) - {(θ - θ^{'})}^{⊤} \nabla F_{2} (θ^{'}), \end{matrix}

(44)

\begin{matrix} \geq & F_{2} (θ) - F_{2} (θ^{'}) - {(θ - θ^{'})}^{⊤} \nabla F_{2} (θ^{'}) = B_{F_{2}} (θ : θ^{'}) \geq 0 . \end{matrix}

(45)

Definition 2

(duo Bregman divergence). Let

F_{1} (θ)

and

F_{2} (θ)

be two strictly convex functions such that

F_{1} (θ) \geq F_{2} (θ)

for any

θ \in Θ_{12} = dom (F_{1}) \cap dom (F_{2})

. Then the generalized Bregman divergence is defined by

B_{F_{1}, F_{2}} (θ : θ^{'}) = F_{1} (θ) - F_{2} (θ^{'}) - {(θ - θ^{'})}^{⊤} \nabla F_{2} (θ^{'}) \geq 0 .

(46)

Example 2.

Consider

F_{1} (θ) = \frac{a}{2} θ^{2}

for

a > 0

. We have

η = a θ

,

θ = \frac{η}{a}

, and

F_{1}^{*} (η) = \frac{η^{2}}{a} - \frac{a}{2} \frac{η^{2}}{a^{2}} = \frac{η^{2}}{2 a} .

(47)

Let

F_{2} (θ) = \frac{1}{2} θ^{2}

so that

F_{1} (θ) \geq F_{2} (θ)

for

a \geq 1

. We check that

F_{1}^{*} (η) = \frac{η^{2}}{2 a} \leq F_{2}^{*} (η)

when

a \geq 1

. The duo Fenchel–Young divergence is

Y_{F_{1}, F_{2}^{*}} (θ, η^{'}) = \frac{a}{2} θ^{2} + \frac{1}{2} {η^{'}}^{2} - θ η^{'} \geq 0,

(48)

when

a \geq 1

. We can express the duo Fenchel–Young divergence in the primal coordinate systems as

B_{F_{1}, F_{2}} (θ, θ^{'}) = \frac{a}{2} θ^{2} + \frac{1}{2} {θ^{'}}^{2} - θ θ^{'} .

(49)

When

a = 1

,

F_{1} (θ) = F_{2} (θ) = \frac{1}{2} θ^{2} : = F (θ)

, and we obtain

B_{F} (θ, θ^{'}) = \frac{1}{2} {∥ θ - θ^{'} ∥}_{2}^{2}

, half the squared Euclidean distance as expected. Figure 5 displays the graph plot of the duo Bregman divergence for several values of a.

Example 3.

Consider

F_{1} (θ) = θ^{2}

and

F_{2} (θ) = θ^{4}

on the domain

Θ = [0, 1]

. We have

F_{1} (θ) \geq F_{2} (θ)

for

θ \in Θ

. The convex conjugate of

F_{1} (η)

is

F_{1}^{*} (η) = \frac{1}{4} η^{2}

. We have

F_{2}^{*} (η) = η^{\frac{4}{3}} ({(\frac{1}{4})}^{\frac{1}{3}} - {(\frac{1}{4})}^{\frac{4}{3}}) = \frac{3}{4^{\frac{4}{3}}} η^{\frac{4}{3}}

(50)

with

η_{2} (θ) = 4 θ^{3}

. Figure 6 plots the convex functions

F_{1} (θ)

and

F_{2} (θ)

, and their convex conjugates

F_{1}^{*} (η)

and

F_{2}^{*} (η)

. We observe that

F_{1} (θ) \geq F_{2} (θ)

on

θ \in [0, 1]

and that

F_{1}^{*} (η) \leq F_{2}^{*} (η)

on

H = [0, 2]

.

We now state a property between dual duo Bregman divergences:

Property 2

(Dual duo Fenchel–Young and Bregman divergences). We have

Y_{F_{1}, F_{2}^{*}} (θ : η^{'}) = B_{F_{1}, F_{2}} (θ : θ^{'}) = B_{F_{2}^{*}, F_{1}^{*}} (η^{'} : η) = Y_{F_{2}^{*}, F_{1}} (η^{'} : θ)

(51)

Proof.

From the Fenchel–Young equalities of the inequalities, we have

F_{1} (θ) = θ^{⊤} η - F_{1}^{*} (η)

for

η = \nabla F_{1} (θ)

and

F_{2} (θ^{'}) = {θ^{'}}^{⊤} η^{'} - F_{2}^{*} (η^{'})

with

η^{'} = \nabla F_{2} (θ^{'})

. Thus we have

\begin{matrix} B_{F_{1}, F_{2}} (θ : θ^{'}) & = & F_{1} (θ) - F_{2} (θ^{'}) - {(θ - θ^{'})}^{⊤} \nabla F_{2} (θ^{'}), \end{matrix}

(52)

\begin{matrix} = & θ^{⊤} η - F_{1}^{*} (η) - {θ^{'}}^{⊤} η^{'} + F_{2}^{*} (η^{'}) - {(θ - θ^{'})}^{⊤} η^{'}, \end{matrix}

(53)

\begin{matrix} = & F_{2}^{*} (η^{'}) - F_{1}^{*} (η) - {(η^{'} - η)}^{⊤} θ, \end{matrix}

(54)

\begin{matrix} = & B_{F_{2}^{*}, F_{1}^{*}} (η^{'} : η) . \end{matrix}

(55)

Recall that

F_{1} (θ) \geq F_{2} (θ)

implies that

F_{1}^{*} (η) \leq F_{2}^{*} (η)

(Lemma 1),

θ = \nabla F_{1}^{*} (η)

, and therefore the dual duo Bregman divergence is non-negative:

\begin{matrix} B_{F_{2}^{*}, F_{1}^{*}} (η^{'} : η) & = & F_{2}^{*} (η^{'}) - F_{1}^{*} (η) - {(η^{'} - η)}^{⊤} θ, \\ \geq & \underset{B_{F_{1}^{*}} (η^{'} : η) \geq 0}{\underset{︸}{F_{1}^{*} (η^{'}) - F_{1}^{*} (η) - {(η^{'} - η)}^{⊤} \nabla F_{1}^{*} (η)}} . \end{matrix}

□

4. Kullback–Leibler Divergence between Distributions of Truncated Exponential Families

Let

E_{1} = {P_{θ} : θ \in Θ_{1}}

be an exponential family of distributions all dominated by

μ

with Radon–Nikodym density

p_{θ} (x) = exp (θ^{⊤} t (x) - F_{1} (θ) + k (x)) d μ (x)

defined on the support

X_{1}

. Let

E_{2} = {Q_{θ} : θ \in Θ_{2}}

be another exponential family of distributions all dominated by

μ

with Radon–Nikodym density

q_{θ} (x) = exp (θ^{⊤} t (x) - F_{2} (θ) + k (x)) d μ (x)

defined on the support

X_{2}

such that

X_{1} \subseteq X_{2}

. Let

{\tilde{p}}_{θ} (x) = exp (θ^{⊤} t (x) + k (x)) d μ (x)

be the common unnormalized density so that

p_{θ} (x) = \frac{{\tilde{p}}_{θ} (x)}{Z_{1} (θ)}

(56)

and

q_{θ} (x) = \frac{{\tilde{p}}_{θ} (x)}{Z_{2} (θ)} = \frac{Z_{1} (θ)}{Z_{2} (θ)} p_{θ} (x),

(57)

with

Z_{1} (θ) = exp (F_{1} (θ))

and

Z_{2} (θ) = exp (F_{2} (θ))

being the log-normalizer functions of

E_{1}

and

E_{2}

, respectively.

We have

\begin{matrix} D_{KL} [p_{θ_{1}} : q_{θ_{2}}] & = & \int_{X_{1}} p_{θ_{1}} (x) log \frac{p_{θ_{1}} (x)}{q_{θ_{2}} (x)} d μ (x), \end{matrix}

(58)

\begin{matrix} = & \int_{X_{1}} p_{θ_{1}} (x) log \frac{p_{θ_{1}} (x)}{p_{θ_{2}} (x)} d μ (x) + \int_{X_{1}} p_{θ_{1}} (x) log (\frac{Z_{2} (θ_{2})}{Z_{1} (θ_{2})}) d μ (x), \end{matrix}

(59)

\begin{matrix} = & D_{KL} [p_{θ_{1}} : p_{θ_{2}}] + log Z_{2} (θ_{2}) - log Z_{1} (θ_{2}) . \end{matrix}

(60)

Since

D_{KL} [p_{θ_{1}} : p_{θ_{2}}] = B_{F_{1}} (θ_{2} : θ_{1})

and

log Z_{i} (θ) = F_{i} (θ)

, we obtain

\begin{matrix} D_{KL} [p_{θ_{1}} : q_{θ_{2}}] & = & B_{F_{1}} (θ_{2} : θ_{1}) + F_{2} (θ_{2}) - F_{1} (θ_{2}), \end{matrix}

(61)

\begin{matrix} = & F_{1} (θ_{2}) - F_{1} (θ_{1}) - {(θ_{2} - θ_{1})}^{⊤} \nabla F_{1} (θ_{1}) + F_{2} (θ_{2}) - F_{1} (θ_{2}), \end{matrix}

(62)

\begin{matrix} = & F_{2} (θ_{2}) - F_{1} (θ_{1}) - {(θ_{2} - θ_{1})}^{⊤} \nabla F_{1} (θ_{1}) = : B_{F_{2}, F_{1}} (θ_{2} : θ_{1}) . \end{matrix}

(63)

Observe that since

X_{1} \subseteq X 2

, we have:

F_{2} (θ) = log \int_{X_{2}} {\tilde{p}}_{θ} (x) d μ (x) \geq log \int_{X_{1}} {\tilde{p}}_{θ} (x) d μ (x) : = F_{1} (θ) .

(64)

Therefore

Θ_{2} \subseteq Θ_{1}

, and the common natural parameter space is

Θ_{12} = Θ_{1} \cap Θ_{2} = Θ_{2}

.

Notice that the reverse Kullback–Leibler divergence

D_{KL}^{*} [p_{θ_{1}} : q_{θ_{2}}] = D_{KL} [q_{θ_{2}} : p_{θ_{1}}] = + \infty

since

Q_{θ_{2}} ≪ P_{θ_{1}}

.

Theorem 1

(Kullback–Leibler divergence between truncated exponential family densities). Let

E_{2} = {q_{θ_{2}}}

be an exponential family with support

X_{2}

, and

E_{1} = {p_{θ_{1}}}

a truncated exponential family of

E_{2}

with support

X_{1} \subset X_{2}

. Let

F_{1}

and

F_{2}

denote the log-normalizers of

E_{1}

and

E_{2}

and

η_{1}

and

η_{2}

the moment parameters corresponding to the natural parameters

θ_{1}

and

θ_{2}

. Then the Kullback–Leibler divergence between a truncated density of

E_{1}

and a density of

E_{2}

is

D_{KL} [p_{θ_{1}} : q_{θ_{2}}] = Y_{F_{2}, F_{1}^{*}} (θ_{2} : η_{1}) = B_{F_{2}, F_{1}} (θ_{2} : θ_{1}) = B_{F_{1}^{*}, F_{2}^{*}} (η_{1} : η_{2}) = Y_{F_{1}^{*}, F_{2}} (η_{1} : θ_{2}) .

(65)

For example, consider the calculation of the KLD between an exponential distribution (view as half a Laplacian distribution, i.e., a truncated Laplacian distribution on the positive real support) and a Laplacian distribution defined on the real line support.

Example 4.

Let

R_{+ +} = {x \in R : x > 0}

denote the set of positive reals. Let

E_{1} = {p_{λ} (x) = λ exp (- λ x), λ \in R_{+ +}, x > 0}

and

E_{2} = {q_{λ} (x) = λ exp (- λ | x |), λ \in R_{+ +}, x \in R}

denote the exponential families of exponential distributions and Laplacian distributions, respectively. We have the sufficient statistic

t (x) = - | x |

and natural parameter

θ = λ

so that

{\tilde{p}}_{θ} (x) = exp (- | x | θ)

. The log-normalizers are

F_{1} (θ) = - log θ

and

F_{2} (θ) = - log θ + log 2

(hence

F_{2} (θ) \geq F_{1} (θ)

). The moment parameter

η = \nabla F_{1} (θ) = \nabla F_{2} (θ) = - \frac{1}{θ} = - \frac{1}{λ}

. Thus using the duo Bregman divergence, we have:

\begin{matrix} D_{KL} [p_{θ_{1}} : q_{θ_{2}}] & = & B_{F_{2}, F_{1}} (θ_{2} : θ_{1}), \end{matrix}

(66)

\begin{matrix} = & F_{2} (θ_{2}) - F_{1} (θ_{1}) - {(θ_{2} - θ_{1})}^{⊤} \nabla F_{1} (θ_{1}), \end{matrix}

(67)

\begin{matrix} = & log 2 + log \frac{λ_{1}}{λ_{2}} + \frac{λ_{2}}{λ_{1}} - 1 . \end{matrix}

(68)

Moreover, we can interpret that divergence using the Itakura–Saito divergence [24]:

D_{IS} [λ_{1} : λ_{2}] : = \frac{λ_{1}}{λ_{2}} - log \frac{λ_{1}}{λ_{2}} - 1 \geq 0 .

(69)

we have

D_{KL} [p_{θ_{1}} : q_{θ_{2}}] = D_{IS} [λ_{2} : λ_{1}] + log 2 \geq 0 .

(70)

We check the result using the duo Fenchel–Young divergence:

D_{KL} [p_{θ_{1}} : q_{θ_{2}}] = Y_{F_{2}, F_{1}^{*}} (θ_{2} : η_{1}),

(71)

with

F_{1}^{*} (η) = - 1 + log (- \frac{1}{η})

:

\begin{matrix} D_{KL} [p_{θ_{1}} : q_{θ_{2}}] & = & Y_{F_{2}, F_{1}^{*}} (θ_{2} : η_{1}), \end{matrix}

(72)

\begin{matrix} = & - log λ_{2} + log 2 - 1 + log λ_{1} + \frac{λ_{2}}{λ_{1}}, \end{matrix}

(73)

\begin{matrix} = & log \frac{λ_{1}}{λ_{2}} + \frac{λ_{2}}{λ_{1}} + {log}_{2} - 1 . \end{matrix}

(74)

Next, consider the calculation of the KLD between a half-normal distribution and a (full) normal distribution:

Example 5.

Consider

E_{1}

and

E_{2}

to be the scale family of the half standard normal distributions and the scale family of the standard normal distribution, respectively. We have

{\tilde{p}}_{θ} (x) = exp (- \frac{x^{2}}{2 σ^{2}})

with

Z_{1} (θ) = σ \sqrt{\frac{π}{2}}

and

Z_{2} (θ) = σ \sqrt{2 π}

. Let the sufficient statistic be

t (x) = - \frac{x^{2}}{2}

so that the natural parameter is

θ = \frac{1}{σ^{2}} \in R_{+ +}

. Here, we have both

Θ_{1} = Θ_{2} = R_{+ +}

. For this example, we check that

Z_{1} (θ) = \frac{1}{2} Z_{2} (θ)

. We have

F_{1} (θ) = - \frac{1}{2} log θ + \frac{1}{2} log \frac{π}{2}

and

F_{2} (θ) = - \frac{1}{2} log θ + \frac{1}{2} log (2 π)

(with

F_{2} (θ) \geq F_{1} (θ)

). We have

η = - \frac{1}{2 θ} = - \frac{1}{2} σ^{2}

. The KLD between two half scale normal distributions is

\begin{matrix} D_{KL} [p_{θ_{1}} : p_{θ_{2}}] & = & B_{F_{1}} (θ_{2} : θ_{1}), \end{matrix}

(75)

\begin{matrix} = & \frac{1}{2} (log \frac{σ_{2}^{2}}{σ_{1}^{2}} + \frac{σ_{1}^{2}}{σ_{2}^{2}} - 1) . \end{matrix}

(76)

Since

F_{1} (θ)

and

F_{2} (θ)

differ only by a constant and the Bregman divergence is invariant under an affine term of its generator, we have

\begin{matrix} D_{KL} [q_{θ_{1}} : q_{θ_{2}}] & = & B_{F_{2}} (θ_{2} : θ_{1}), \end{matrix}

(77)

\begin{matrix} = & B_{F_{1}} (θ_{2} : θ_{1}) = \frac{1}{2} (log \frac{σ_{2}^{2}}{σ_{1}^{2}} + \frac{σ_{1}^{2}}{σ_{2}^{2}} - 1) . \end{matrix}

(78)

Moreover, we can interpret those Bregman divergences as half of the Itakura–Saito divergence:

D_{KL} [p_{θ_{1}} : p_{θ_{2}}] = D_{KL} [q_{θ_{1}} : q_{θ_{2}}] = B_{F_{2}} (θ_{2} : θ_{1}) = \frac{1}{2} D_{IS} [σ_{1}^{2} : σ_{2}^{2}] .

(79)

It follows that

\begin{matrix} D_{KL} [p_{θ_{1}} : q_{θ_{2}}] & = & B_{F_{2}, F_{1}} (θ_{2} : θ_{1}) = F_{2} (θ_{2}) - F_{1} (θ_{1}) - {(θ_{2} - θ_{1})}^{⊤} \nabla F_{1} (θ_{1}), \end{matrix}

(80)

\begin{matrix} = & \frac{1}{2} (log \frac{σ_{2}^{2}}{σ_{1}^{2}} + \frac{σ_{1}^{2}}{σ_{2}^{2}} + log 4 - 1), \end{matrix}

(81)

\begin{matrix} = & D_{KL} [q_{θ_{1}} : q_{θ_{2}}] + log 2 . \end{matrix}

(82)

Since

log 2 > 0

, we have

D_{KL} [p_{θ_{1}} : q_{θ_{2}}] \geq D_{KL} [q_{θ_{1}} : q_{θ_{2}}]

.

Thus the Kullback–Leibler divergence between a truncated density and another density of the same exponential family amounts to calculate a duo Bregman divergence on the reverse parameter order:

D_{KL} [p_{θ_{1}} : q_{θ_{2}}] = B_{F_{2}, F_{1}} (θ_{2} : θ_{1})

. Let

D_{KL}^{*} [p : q] : = D_{KL} [q : p]

be the reverse Kullback–Leibler divergence. Then

D_{KL}^{*} [q_{θ_{2}} : p_{θ_{1}}] = B_{F_{2}, F_{1}} (θ_{2} : θ_{1})

.

Notice that truncated exponential families are also exponential families but those exponential families may be non-steep [25].

Let

E_{1} = {p_{θ}^{a_{1}, b_{1}}}

and

E_{2} = {p_{θ}^{a_{2}, b_{2}}}

be two truncated exponential families of the exponential family

E = {p_{θ} = \frac{d P_{θ}}{d μ}}

with log-normalizer

F (θ)

such that

p_{θ}^{a_{i}, b_{i}} (x) = \frac{p_{θ} (x)}{Z_{a_{i}, b_{i}} (θ)},

(83)

with

Z_{a_{i}, b_{i}} (θ) = Φ_{θ} (b_{i}) - Φ_{θ} (a_{i})

, where

Φ_{θ} (x)

denotes the CDF of

p_{θ} (x)

. Then the log-normalizer of

E_{i}

is

F_{i} (θ) = F (θ) + log (Φ_{θ} (b_{i}) - Φ_{θ} (a_{i}))

for

i \in {1, 2}

.

Corollary 1

(Kullback–Leibler divergence between densities of truncated exponential families). Let

E_{i} = {p_{θ}^{a_{i}, b_{i}}}

be truncated exponential families of the exponential family

E = {p_{θ}}

with support

X_{i} = [a_{i}, b_{i}] \subset X

(where

X

denotes the support of

E

) for

i \in {1, 2}

. Then the Kullback–Leibler divergence between

p_{θ_{1}}^{a_{1}, b_{1}}

and

p_{θ_{2}}^{a_{2}, b_{2}}

is infinite if

[a_{1}, b_{1}] \subset [a_{2}, b_{2}]

and has the following formula when

[a_{1}, b_{1}] \subset [a_{2}, b_{2}]

:

D_{KL} [p_{θ_{1}}^{a_{1}, b_{1}} : p_{θ_{2}}^{a_{2}, b_{2}}] = D_{KL} [p_{θ_{1}}^{a_{1}, b_{1}} : p_{θ_{2}}^{a_{1}, b_{1}}] + log \frac{Z_{a_{2}, b_{2}} (θ_{2})}{Z_{a_{1}, b_{1}} (θ_{2})} .

(84)

Proof.

We have

p_{θ}^{a_{1}, b_{1}} = \frac{p_{θ}}{Z_{a_{1}, b_{1}} (θ)}

and

p_{θ}^{a_{2}, b_{2}} = \frac{p_{θ}}{Z_{a_{2}, b_{2}} (θ)}

. Therefore

p_{θ}^{a_{2}, b_{2}} = p_{θ}^{a_{1}, b_{1}} \frac{Z_{a_{1}, b_{1}} (θ)}{Z_{a_{2}, b_{2}} (θ)}

. Thus we have

\begin{matrix} D_{KL} [p_{θ_{1}}^{a_{1}, b_{1}} : p_{θ_{2}}^{a_{2}, b_{2}}] & = & \int_{X_{1}} p_{θ_{1}}^{a_{1}, b_{1}} (x) log \frac{p_{θ_{1}}^{a_{1}, b_{1}} (x)}{p_{θ_{2}}^{a_{2}, b_{2}}} d μ (x), \end{matrix}

(85)

\begin{matrix} = & \int_{X_{1}} p_{θ_{1}}^{a_{1}, b_{1}} (x) log \frac{p_{θ_{1}}^{a_{1}, b_{1}} (x)}{p_{θ_{2}}^{a_{1}, b_{1}}} d μ (x) + log \frac{Z_{a_{2}, b_{2}} (θ_{2})}{Z_{a_{1}, b_{1}} (θ_{2})}, \end{matrix}

(86)

\begin{matrix} = & D_{KL} [p_{θ_{1}}^{a_{1}, b_{1}} : p_{θ_{2}}^{a_{1}, b_{1}}] + log \frac{Z_{a_{2}, b_{2}} (θ_{2})}{Z_{a_{1}, b_{1}} (θ_{2})} . \end{matrix}

(87)

□

Thus the KLD between truncated exponential family densities

p_{θ_{1}}^{a_{1}, b_{1}}

and

p_{θ_{2}}^{a_{2}, b_{2}}

amounts to the KLD between the densities with the same truncation parameter with an additive term depending on the log ratio of the mass with respect to the truncated supports evaluated at

θ_{2}

. We shall illustrate with two examples the calculation of the KLD between truncated exponential families.

Example 6.

Consider the calculation of the KLD between a truncated exponential distribution

p_{λ_{1}}^{a_{1}, b_{1}}

with support

X_{1} = [a_{1}, b_{1}]

(

b_{1} > a_{1} \geq 0

) and another truncated exponential distribution

p_{λ_{2}}^{a_{2}, b_{2}}

with support

X_{2} = [a_{2}, b_{2}]

(

b_{2} > a_{2} \geq 0

). We have

p_{λ} (x) = λ exp (- λ x)

(density of the untruncated exponential family with natural parameter

θ = λ

, sufficient statistic

t (x) = - x

and log-normalizer

F (θ) = - log θ

),

p_{λ_{1}}^{a_{1}, b_{1}} = \frac{1}{Z_{a_{1}, b_{1}} (λ)} p_{λ_{1}} (x)

, and

p_{λ_{2}}^{a_{2}, b_{2}} = \frac{1}{Z_{a_{2}, b_{2}} (λ)} p_{λ_{2}} (x)

. Let

Φ_{λ} (x) = 1 - exp (- λ x)

denote the cumulative distribution function of the exponential distribution. We have

Z_{a, b} (λ) = Φ_{b} (λ) - Φ_{a} (λ)

and

F_{a, b} (λ) = F (λ) + log (Φ_{b} (λ) - Φ_{a} (λ)) = - log λ + log (e^{- λ a} - e^{- λ b}) .

(88)

If

[a_{1}, b_{1}] \notin [a_{2}, b_{2}]

then

D_{KL} [p_{λ_{1}} : q_{λ_{2}}] = + \infty

. Otherwise,

[a_{1}, b_{1}] \in [a_{2}, b_{2}]

, and the exponential family

{p_{λ}}

is the truncated exponential family

{q_{λ}}

. Using the computer algebra system Maxima (https://maxima.sourceforge.io/ accessed on 15 March 2022), we find that

- E_{p_{λ}} [x] = \frac{(1 + λ b) e^{λ a} - (1 + λ a) e^{λ b}}{λ (e^{λ b} - e^{λ a})} = F_{a, b}^{'} (λ) .

(89)

Thus we have:

\begin{matrix} D_{KL} [p_{λ_{1}}^{a_{1}, b_{1}} : q_{λ_{2}}^{a_{2}, b_{2}}] & = & B_{F_{2}, F_{1}} (θ_{2} : θ_{1}), \\ = & F_{a_{2}, b_{2}} (λ_{2}) - F_{a_{1}, b_{1}} (λ_{1}) - (λ_{2} - λ_{1}) F_{a_{1}, b_{1}}^{'} (λ_{1}), \end{matrix}

(90)

\begin{matrix} = & log \frac{λ_{1}}{λ_{2}} + (λ_{2} - λ_{1}) E_{p_{λ_{1}}} [x] + log \frac{e^{- λ_{2} a_{2}} - e^{- λ_{2} b_{2}}}{e^{- λ_{1} a_{1}} - e^{- λ_{1} b_{1}}} . \end{matrix}

(91)

When

a_{1} = a_{2} = 0

and

b_{1} = b_{2} = + \infty

, we recover the KLD between two exponential distributions

p_{λ_{1}}

and

p_{λ_{2}}

:

\begin{matrix} D_{KL} [p_{λ_{1}} : p_{λ_{2}}] & = & B_{F} (λ_{2} : λ_{1}), \end{matrix}

(92)

\begin{matrix} = & F (θ_{2}) - F (θ_{1}) - (θ_{2} - θ_{1}) F^{'} (θ_{1}), \end{matrix}

(93)

\begin{matrix} = & \frac{λ_{2}}{λ_{1}} - log \frac{λ_{2}}{λ_{1}} - 1 = D_{IS} [λ_{2} : λ_{1}] . \end{matrix}

(94)

Note that the KLD between two truncated exponential distributions with the same truncation support

X = [a, b]

is

D_{KL} [p_{λ_{1}}^{a, b} : p_{λ_{2}}^{a, b}] = log \frac{λ_{2}}{λ_{1}} + log \frac{Φ_{λ_{2}} (b) - Φ_{λ_{2}} (a)}{Φ_{λ_{1}} (b) - Φ_{λ_{1}} (a)} + (λ_{2} - λ_{1}) E_{p_{1}^{a, b}} [x] .

(95)

We also check Corollary 1:

D_{KL} [p_{λ_{1}}^{a_{1}, b_{1}} : p_{λ_{2}}^{a_{2}, b_{2}}] = D_{KL} [p_{λ_{1}}^{a_{1}, b_{1}} : p_{λ_{2}}^{a_{1}, b_{1}}] + log \frac{Z_{a_{2}, b_{2}} (λ_{2})}{Z_{a_{1}, b_{1}} (λ_{2})},

(96)

The next example shows how to compute the Kullback–Leibler divergence between two truncated normal distributions:

Example 7.

Let

N_{a, b} (m, s)

denote a truncated normal distribution with support the open interval

(a, b)

(

a < b

) and probability density function defined by:

p_{m, s}^{a, b} (x) = \frac{1}{Z_{a, b} (m, s)} exp (- \frac{{(x - m)}^{2}}{2 s^{2}}),

(97)

where

Z_{a, b} (m, s)

is related to the partition function [26] expressed using the cumulative distribution function (CDF)

Φ_{m, s} (x)

:

Z_{a, b} (m, s) = \sqrt{2 π} s (Φ_{m, s} (b) - Φ_{m, s} (a)),

(98)

with

Φ_{m, s} (x) = \frac{1}{2} (1 + \erf (\frac{x - m}{\sqrt{2} s})),

(99)

where

\erf (x)

is the error function:

\erf (x) : = \frac{2}{\sqrt{π}} \int_{0}^{x} e^{- t^{2}} d t .

(100)

Thus we have

\erf (x) = 2 Φ (\sqrt{2} x) - 1

where

Φ (x) = Φ_{0, 1} (x)

.

The pdf can also be written as

p_{m, s}^{a, b} (x) = \frac{1}{s} \frac{ϕ (\frac{x - m}{s})}{Φ (\frac{b - m}{s}) - Φ (\frac{a - m}{s})},

(101)

where

ϕ (x)

denotes the standard normal pdf (

ϕ (x) = p_{0, 1}^{- \infty, + \infty} (x)

):

ϕ (x) : = \frac{1}{\sqrt{2 π}} exp (- \frac{x^{2}}{2}),

(102)

and

Φ (x) = Φ_{0, 1} (x) = \int_{- \infty}^{x} ϕ (t) d t

is the standard normal CDF. When

a = - \infty

and

b = + \infty

, we have

Z_{- \infty, \infty} (m, s) = \sqrt{2 π} s

since

Φ (- \infty) = 0

and

Φ (+ \infty) = 1

.

The density

p_{m, s}^{a, b} (x)

belongs to an exponential family

E_{a, b}

with natural parameter

θ = (\frac{m}{s^{2}}, - \frac{1}{2 s^{2}})

, sufficient statistics

t (x) = (x, x^{2})

, and log-normalizer:

F_{a, b} (θ) = - \frac{θ_{1}^{2}}{4 θ_{2}} + log Z_{a, b} (θ)

(103)

The natural parameter space is

Θ = R \times R_{- -}

where

R_{- -} = {x \in R : x < 0}

denotes the set of negative real numbers.

The log-normalizer can be expressed using the source parameters

(m, s)

(which are not the mean and standard deviation when the support is truncated, hence the notation m and s):

\begin{matrix} F_{a, b} (m, s) & = & \frac{m^{2}}{2 s^{2}} + log Z_{a, b} (m, s), \end{matrix}

(104)

\begin{matrix} = & \frac{m^{2}}{2 s^{2}} + \frac{1}{2} log 2 π s^{2} + log (Φ_{m, s} (b) - Φ_{m, s} (a)) . \end{matrix}

(105)

We shall use the fact that the gradient of the log-normalizer of any exponential family distribution amounts to the expectation of the sufficient statistics [1]:

\nabla F_{a, b} (θ) = E_{p_{m, s}^{a, b}} [t (x)] = η .

(106)

Parameter η is called the moment or expectation parameter [1].

The mean

μ (m, s; a, b) = E_{p_{m, s}^{a, b}} [x] = \frac{\partial}{\partial θ_{1}} F_{a, b} (θ)

and the variance

σ^{2} (m, s; a, b) = E_{p_{m, s}^{a, b}} [x^{2}] - μ^{2}

(with

E_{p_{m, s}^{a, b}} [x^{2}] = \frac{\partial}{\partial θ_{2}} F_{a, b} (θ)

) of the truncated normal

p_{m, s}^{a, b}

can be expressed using the following formula [26,27] (page 25):

\begin{matrix} μ (m, s; a, b) & = & m - s \frac{ϕ (β) - ϕ (α)}{Φ (β) - Φ (α)}, \end{matrix}

(107)

\begin{matrix} σ^{2} (m, s; a, b) & = & s^{2} (1 - \frac{β ϕ (β) - α ϕ (α)}{Φ (β) - Φ (α)} - {(\frac{ϕ (β) - ϕ (α)}{Φ (β) - Φ (α)})}^{2}), \end{matrix}

(108)

where

α : = \frac{a - m}{s}

and

β : = \frac{b - m}{s}

. Thus we have the following moment parameter

η = (η_{1}, η_{2})

with

\begin{matrix} η_{1} (m, s; a, b) & = & E_{p_{m, s}^{a, b}} [x] = μ (m, s; a, b), \end{matrix}

(109)

\begin{matrix} η_{2} (m, s; a, b) & = & E_{p_{m, s}^{a, b}} [x^{2}] = σ^{2} (m, s; a, b) + μ^{2} (m, s; a, b) . \end{matrix}

(110)

Now consider two truncated normal distributions

p_{m_{1}, s_{1}}^{a_{1}, b_{1}}

and

p_{m_{2}, s_{2}}^{a_{2}, b_{2}}

with

[a_{1}, b_{1}] \subseteq [a_{2}, b_{2}]

(otherwise, we have

D_{KL} [p_{m_{1}, s_{1}}^{a_{1}, b_{1}} : p_{m_{2}, s_{2}}^{a_{2}, b_{2}}] = + \infty

). Then the KLD between

p_{m_{1}, s_{1}}^{a_{1}, b_{1}}

and

p_{m_{2}, s_{2}}^{a_{2}, b_{2}}

is equivalent to a duo Bregman divergence:

\begin{matrix} D_{KL} [p_{m_{1}, s_{1}}^{a_{1}, b_{1}} : p_{m_{2}, s_{2}}^{a_{2}, b_{2}}] & = & F_{m_{2}, s_{2}} (θ_{2}) - F_{m_{1}, s_{1}} (θ_{1}) - {(θ_{2} - θ_{1})}^{⊤} \nabla F_{m_{1}, s_{1}} (θ_{1}), \\ = & \frac{m_{2}}{2 s_{2}^{2}} - \frac{m_{1}}{2 s_{1}^{2}} + log \frac{Z_{a_{2}, b_{2}} (m_{2}, s_{2})}{Z_{a_{1}, b_{1}} (m_{1}, s_{1})} - (\frac{m_{2}}{s_{2}^{2}} - \frac{m_{1}}{s_{1}^{2}}) η_{1} (m_{1}, s_{1}; a_{1}, b_{1}) \\ - (\frac{1}{2 s_{1}^{2}} - \frac{1}{2 s_{2}^{2}}) η_{2} (m_{1}, s_{1}; a_{1}, b_{1}) . \end{matrix}

(111)

Note that

F_{m_{2}, s_{2}} (θ) \geq F_{m_{1}, s_{1}} (θ)

.

This formula is valid for (1) the KLD between two truncated normal distributions, or for (2) the KLD between a truncated normal distribution and a (full support) normal distribution. Note that the formula depends on the erf function used in function Φ. Furthermore, when

a_{1} = a_{2} = - \infty

and

b_{1} = b_{2} = + \infty

, we recover (3) the KLD between two univariate normal distributions, since

log \frac{Z_{a_{2}, b_{2}} (m_{2}, s_{2})}{Z_{a_{1}, b_{1}} (m_{1}, s_{1})} = log \frac{σ_{2}}{σ_{1}} = \frac{1}{2} log \frac{σ_{2}^{2}}{σ_{1}^{2}}

:

D_{KL} [p_{m_{1}, s_{1}} : p_{m_{2}, s_{2}}] = \frac{1}{2} (log \frac{s_{2}^{2}}{s_{1}^{2}} + \frac{σ_{1}^{2}}{σ_{2}^{2}} + \frac{{(m_{2} - m_{1})}^{2}}{s_{2}^{2}} - 1 .) .

(112)

Note that for full support normal distributions, we have

μ (m, s; - \infty; + \infty) = m

and

σ^{2} (m, s; - \infty; + \infty) = s^{2}

.

The entropy of a truncated normal distribution (an exponential family [28]) is

h [p_{m, s}^{a, b}] = - \int_{a}^{b} p_{m, s}^{a, b} (x) log p_{m, s}^{a, b} d x = - F^{*} (η) = F (θ) - θ^{⊤} η

. We find that

h [p_{m, s}^{a, b}] = log (\sqrt{2 π e} s (Φ (β) - Φ (α))) + \frac{α ϕ (α) - β ϕ (β)}{2 (Φ (β) - Φ (α))} .

(113)

When

(a, b) = (- \infty, \infty)

, we have

Φ (β) - Φ (α) = 1

and

α ϕ (α) - β ϕ (β) = 0

since

β = - α

,

ϕ (- x) = ϕ (x)

(an even function), and

{lim}_{β \to + \infty} β ϕ (β) = 0

. Thus we recover the differential entropy of a normal distribution:

h [p_{μ, σ}] = log (\sqrt{2 π e} σ)

.

5. Bhattacharyya Skewed Divergence between Truncated Densities of an Exponential Family

The Bhattacharyya

α

-skewed divergence [29,30] between two densities

p (x)

and

q (x)

with respect to

μ

is defined for a skewing scalar parameter

α \in (0, 1)

as:

D_{Bhat, α} [p : q] : = - log \int_{X} p {(x)}^{α} q {(x)}^{1 - α} d μ (x),

(114)

where

X

denotes the support of the distributions. The Bhattacharyya distance is

D_{Bhat} [p, q] = D_{Bhat, \frac{1}{2}} [p : q] = - log \int_{X} \sqrt{p (x) q (x)} d μ (x) .

(115)

The Bhattacharyya distance is not a metric distance since it does not satisfy the triangle inequality. The Bhattacharyya distance is related to the Hellinger distance [31] as follows:

D_{H} [p, q] = \sqrt{\frac{1}{2} \int_{X} {(\sqrt{p (x)} - \sqrt{q (x)})}^{2} d μ (x)} = \sqrt{1 - exp (- D_{Bhat} [p, q])} .

(116)

The Hellinger distance is a metric distance.

Let

I_{α} [p : q] : = \int_{X} p {(x)}^{α} q {(x)}^{1 - α} d μ (x)

denote the skewed affinity coefficient so that

D_{Bhat, α} [p : q] = - log I_{α} [p : q]

. Since

I_{α} [p : q] = I_{1 - α} [q : p]

, we have

D_{Bhat, α} [p : q] = D_{Bhat, 1 - α} [q : p]

.

Consider an exponential family

E = {p_{θ}}

with log-normalizer

F (θ)

. Then it is well-known that the

α

-skewed Bhattacharyya divergence between two densities of an exponential family amounts to a skewed Jensen divergence [30] (originally called Jensen difference in [32]):

D_{Bhat, α} [p_{θ_{1}} : p_{θ_{2}}] = J_{F, α} (θ_{1} : θ_{2}),

(117)

where the skewed Jensen divergence is defined by

J_{F, α} (θ_{1} : θ_{2}) = α F (θ_{1}) + (1 - α) F (θ_{2}) - F (α θ_{1} + (1 - α) θ_{2}) .

(118)

The convexity of the log-normalizer

F (θ)

ensures that

J_{F, α} (θ_{1} : θ_{2}) \geq 0

. The Jensen divergence can be extended to full real

α

by rescaling it by

\frac{1}{α (1 - α)}

, see [33].

Remark 1.

The Bhattacharyya skewed divergence

D_{Bhat, α} [p : q]

appears naturally as the negative of the log-normalizer of the exponential family induced by the exponential arc

{r_{α} (x) α \in (0, 1)}

linking two densities p and q with

r_{α} (x) \propto p {(x)}^{α} q {(x)}^{1 - α}

. This arc is an exponential family of order 1:

\begin{matrix} r_{α} (x) & = & exp (α log p (x) + (1 - α) log q (x) - log Z_{α} (p : q)), \end{matrix}

(119)

\begin{matrix} = & exp (α log \frac{p (x)}{q (x)} - F_{p q} (α)) q (x) . \end{matrix}

(120)

The sufficient statistic is

t (x) = \frac{p (x)}{q (x)}

, the natural parameter

α \in (0, 1)

, and the log-normalizer

F_{p q} (α) = log Z_{α} (p : q) = log \int p {(x)}^{α} q {(x)}^{1 - α} d μ (x) = - D_{Bhat, α} [p : q]

. This shows that

D_{Bhat, α} [p : q]

is concave with respect to α since log-normalizers

F_{p q} (α)

are always convex. Grünwald called those exponential families the likelihood ratio exponential families [34].

Now, consider calculating

D_{Bhat, α} [p_{θ_{1}} : q_{θ_{2}}]

where

p_{θ_{1}} \in E_{1}

with

E_{1}

a truncated exponential family of

E_{2}

and

q_{θ_{2}} \in E_{2} = {q_{θ}}

. We have

q_{θ} (x) = \frac{Z_{1} (θ)}{Z_{2} (θ)} p_{θ} (x)

, where

Z_{1} (θ)

and

Z_{2} (θ)

are the partition functions of

E_{1}

and

E_{2}

, respectively. Thus we have

I_{α} [p_{θ_{1}} : q_{θ_{2}}] = {(\frac{Z_{1} (θ_{2})}{Z_{2} (θ_{2})})}^{1 - α} I_{α} [p_{θ_{1}} : p_{θ_{2}}],

(121)

and the

α

-skewed Bhattacharyya divergence is

D_{Bhat, α} [p_{θ_{1}} : q_{θ_{2}}] = D_{Bhat, α} [p_{θ_{1}} : p_{θ_{2}}] - (1 - α) (F_{1} (θ_{2}) - F_{2} (θ_{2})) .

(122)

Therefore we obtain

\begin{matrix} D_{Bhat, α} [p_{θ_{1}} : q_{θ_{2}}] & = & J_{F_{1}, α} (θ_{1} : θ_{2}) - (1 - α) (F_{1} (θ_{2}) - F_{2} (θ_{2})), \end{matrix}

(123)

\begin{matrix} = & α F_{1} (θ_{1}) + (1 - α) F_{2} (θ_{2}) - F_{1} (α θ_{1} + (1 - α) θ_{2}), \end{matrix}

(124)

\begin{matrix} = : & J_{F_{1}, F_{2}, α} (θ_{1} : θ_{2}) . \end{matrix}

(125)

We call

J_{F_{1}, F_{2}, α} (θ_{1} : θ_{2})

the duo Jensen divergence. Since

F_{2} (θ) \geq F_{1} (θ)

, we check that

J_{F_{1}, F_{2}, α} (θ_{1} : θ_{2}) \geq J_{F_{1}, α} (θ_{1} : θ_{2}) \geq 0 .

(126)

Figure 7 illustrates graphically the duo Jensen divergence

J_{F_{1}, F_{2}, α} (θ_{1} : θ_{2})

.

Theorem 2.

The α-skewed Bhattacharyya divergence for

α \in (0, 1)

between a truncated density of

E_{1}

with log-normalizer

F_{1} (θ)

and another density of an exponential family

E_{2}

with log-normalizer

F_{2} (θ)

amounts to a duo Jensen divergence:

D_{Bhat, α} [p_{θ_{1}} : q_{θ_{2}}] = J_{F_{1}, F_{2}, α} (θ_{1} : θ_{2}),

(127)

where

J_{F_{1}, F_{2}, α} (θ_{1} : θ_{2})

is the duo skewed Jensen divergence induced by two strictly convex functions

F_{1} (θ)

and

F_{2} (θ)

such that

F_{2} (θ) \geq F_{1} (θ)

:

J_{F_{1}, F_{2}, α} (θ_{1} : θ_{2}) = α F_{1} (θ_{1}) + (1 - α) F_{2} (θ_{2}) - F_{1} (α θ_{1} + (1 - α) θ_{2}) .

(128)

In [30], it is reported that

\begin{matrix} D_{KL} [p_{θ_{1}} : p_{θ_{2}}] & = & B_{F} (θ_{2} : θ_{1}), \end{matrix}

(129)

\begin{matrix} = & lim_{α \to 0} \frac{1}{α} J_{F, α} (θ_{2} : θ_{1}) = lim_{α \to 0} \frac{1}{α} J_{F, 1 - α} (θ_{1} : θ_{2}), \end{matrix}

(130)

\begin{matrix} = & lim_{α \to 0} \frac{1}{α} D_{Bhat, α} [p_{θ_{2}} : p_{θ_{1}}] = lim_{α \to 0} \frac{1}{α} D_{Bhat, 1 - α} [p_{θ_{1}} : p_{θ_{2}}] . \end{matrix}

(131)

Indeed, using the first-order Taylor expansion of

F (θ_{1} + α (θ_{2} - θ_{1})) \underset{α \to 0}{\approx} F (θ_{1}) + α {(θ_{2} - θ_{1})}^{⊤} \nabla F (θ_{1})

(132)

when

α \to 0

, we check that we have

\begin{matrix} \frac{1}{α} J_{F, α} (θ_{2} : θ_{1}) & : = & \frac{F (θ_{1}) + α (F (θ_{2}) - F (θ_{1})) - F (θ_{1} + α (θ_{2} - θ_{1}))}{α}, \end{matrix}

(133)

\begin{matrix} \approx_{α \to 0}^{E q u a t i o n (132)} & \frac{F (θ_{1}) + α (F (θ_{2}) - F (θ_{1})) - F (θ_{1}) - α {(θ_{2} - θ_{1})}^{⊤} \nabla F (θ_{1})}{α}, \end{matrix}

(134)

\begin{matrix} = & F (θ_{2}) - F (θ_{1}) - {(θ_{2} - θ_{1})}^{⊤} \nabla F (θ_{1}), \end{matrix}

(135)

\begin{matrix} = : & B_{F} (θ_{2} : θ_{1}) . \end{matrix}

(136)

Thus we have

{lim}_{α \to 0} \frac{1}{α} J_{F, α} (θ_{2} : θ_{1}) = B_{F} (θ_{2} : θ_{1})

.

Moreover, we have

lim_{α \to 0} \frac{1}{α} D_{Bhat, 1 - α} [p : q] = D_{KL} [p : q] .

(137)

Similarly, we can prove that

lim_{α \to 1} \frac{1}{1 - α} J_{F_{1}, F_{2}, α} (θ_{1} : θ_{2}) = B_{F_{2}, F_{1}} (θ_{2} : θ_{1}),

(138)

which can be reinterpreted as

lim_{α \to 1} \frac{1}{1 - α} D_{Bhat, α} [p_{θ_{1}} : q_{θ_{2}}] = D_{KL} [p_{θ_{1}} : q_{θ_{2}}] .

(139)

6. Concluding Remarks

We considered the Kullback–Leibler divergence between two parametric densities

p_{θ} \in E_{1}

and

q_{θ^{'}} \in E_{2}

belonging to truncated exponential families [7]

E_{1}

and

E_{2}

, and we showed that their KLD is equivalent to a duo Bregman divergence on swapped parameter order (Theorem 1). This result generalizes the study of Azoury and Warmuth [13]. The duo Bregman divergence can be rewritten as a duo Fenchel–Young divergence using mixed natural/moment parameterizations of the exponential family densities (Definition 1). This second result generalizes the approach taken in information geometry [15,35]. We showed how to calculate the Kullback–Leibler divergence between two truncated normal distributions as a duo Bregman divergence. More generally, we proved that the skewed Bhattacharyya distance between two parametric densities of truncated exponential families amounts to a duo Jensen divergence (Theorem 2). We showed asymptotically that scaled duo Jensen divergences tend to duo Bregman divergences generalizing a result of [30,33]. This study of duo divergences induced by pair of generators was motivated by the formula obtained for the Kullback–Leibler divergence between two densities of two different exponential families originally reported in [23] (Equation (29)).

It is interesting to find applications of the duo Fenchel–Young, Bregman, and Jensen divergences beyond the scope of calculating statistical distances between truncated exponential family densities. Note that in [36], the authors exhibit a relationship between densities with nested supports and quasi-convex Bregman divergences. However, those considered parametric densities are not exponential families since their supports depend on the parameter. Recently, Khan and Swaroop [37] used this duo Fenchel–Young divergence in machine learning for knowledge-adaptation priors in the so-called change regularizer task.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The author would like to thank the three reviewers for their helpful comments, which led to this improved paper.

Conflicts of Interest

The author declares no conflict of interest.

References

Sundberg, R. Statistical Modelling by Exponential Families; Cambridge University Press: Cambridge, UK, 2019; Volume 12. [Google Scholar]
Pitman, E.J.G. Sufficient Statistics and Intrinsic Accuracy; Mathematical Proceedings of the cambridge Philosophical Society; Cambridge University Press: Cambridge, UK, 1936; Volume 32, pp. 567–579. [Google Scholar]
Darmois, G. Sur les lois de probabilitéa estimation exhaustive. CR Acad. Sci. Paris 1935, 260, 85. [Google Scholar]
Koopman, B.O. On distributions admitting a sufficient statistic. Trans. Am. Math. Soc. 1936, 39, 399–409. [Google Scholar] [CrossRef]
Hiejima, Y. Interpretation of the quasi-likelihood via the tilted exponential family. J. Jpn. Stat. Soc. 1997, 27, 157–164. [Google Scholar] [CrossRef]
Efron, B.; Hastie, T. Computer Age Statistical Inference: Algorithms, Evidence, and Data Science; Cambridge University Press: Cambridge, UK, 2021; Volume 6. [Google Scholar]
Akahira, M. Statistical Estimation for Truncated Exponential Families; Springer: Berlin/Heidelberg, Germany, 2017. [Google Scholar]
Bar-Lev, S.K. Large sample properties of the MLE and MCLE for the natural parameter of a truncated exponential family. Ann. Inst. Stat. Math. 1984, 36, 217–222. [Google Scholar] [CrossRef]
Shah, A.; Shah, D.; Wornell, G. A Computationally Efficient Method for Learning Exponential Family Distributions. Adv. Neural Inf. Process. Syst. 2021, 34. Available online: https://proceedings.neurips.cc/paper/2021/hash/84f7e69969dea92a925508f7c1f9579a-Abstract.html (accessed on 15 March 2022).
Keener, R.W. Theoretical Statistics: Topics for a Core Course; Springer: Berlin/Heidelberg, Germany, 2010. [Google Scholar]
Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
Csiszár, I. Eine informationstheoretische Ungleichung und ihre Anwendung auf Beweis der Ergodizitaet von Markoffschen Ketten. Magyer Tud. Akad. Mat. Kutato Int. Koezl. 1964, 8, 85–108. [Google Scholar]
Azoury, K.S.; Warmuth, M.K. Relative loss bounds for on-line density estimation with the exponential family of distributions. Mach. Learn. 2001, 43, 211–246. [Google Scholar] [CrossRef] [Green Version]
Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 2015. [Google Scholar]
Amari, S.I. Differential-geometrical methods in statistics. Lect. Notes Stat. 1985, 28, 1. [Google Scholar]
Bregman, L.M. The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming. Ussr Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
Acharyya, S. Learning to Rank in Supervised and Unsupervised Settings Using Convexity and Monotonicity. Ph.D. Thesis, The University of Texas at Austin, Austin, TX, USA, 2013. [Google Scholar]
Blondel, M.; Martins, A.F.; Niculae, V. Learning with Fenchel-Young losses. J. Mach. Learn. Res. 2020, 21, 1–69. [Google Scholar]
Nielsen, F. An elementary introduction to information geometry. Entropy 2020, 22, 1100. [Google Scholar] [CrossRef] [PubMed]
Mitroi, F.C.; Niculescu, C.P. An Extension of Young’s Inequality; Abstract and Applied Analysis; Hindawi: London, UK, 2011; Volume 2011. [Google Scholar]
Jeffreys, H. The Theory of Probability; OUP Oxford: Oxford, UK, 1998. [Google Scholar]
Nielsen, F.; Nock, R. Sided and symmetrized Bregman centroids. IEEE Trans. Inf. Theory 2009, 55, 2882–2904. [Google Scholar] [CrossRef] [Green Version]
Nielsen, F. On a variational definition for the Jensen-Shannon symmetrization of distances based on the information radius. Entropy 2021, 23, 464. [Google Scholar] [CrossRef]
Itakura, F.; Saito, S. Analysis synthesis telephony based on the maximum likelihood method. In Proceedings of the 6th International Congress on Acoustics, Tokyo, Japan, 21–28 August 1968; pp. 280–292. [Google Scholar]
Del Castillo, J. The singly truncated normal distribution: A non-steep exponential family. Ann. Inst. Stat. Math. 1994, 46, 57–66. [Google Scholar] [CrossRef] [Green Version]
Burkardt, J. The Truncated Normal Distribution; Technical Report; Department of Scientific Computing Website, Florida State University: Tallahassee, FL, USA, 2014. [Google Scholar]
Kotz, J.; Balakrishan. Continuous Univariate Distributions, Volumes I and II; John Wiley and Sons: Hoboken, NJ, USA, 1994. [Google Scholar]
Nielsen, F.; Nock, R. Entropies and cross-entropies of exponential families. In Proceedings of the 2010 IEEE International Conference on Image Processing, Hong Kong, China, 26–29 September 2010; pp. 3621–3624. [Google Scholar]
Bhattacharyya, A. On a measure of divergence between two statistical populations defined by their probability distributions. Bull. Calcutta Math. Soc. 1943, 35, 99–109. [Google Scholar]
Nielsen, F.; Boltz, S. The Burbea-Rao and Bhattacharyya centroids. IEEE Trans. Inf. Theory 2011, 57, 5455–5466. [Google Scholar] [CrossRef] [Green Version]
Hellinger, E. Neue Begründung der Theorie Quadratischer Formen von unendlichvielen Veränderlichen. J. Reine Angew. Math. 1909, 1909, 210–271. [Google Scholar] [CrossRef]
Rao, C.R. Diversity and dissimilarity coefficients: A unified approach. Theor. Popul. Biol. 1982, 21, 24–43. [Google Scholar] [CrossRef]
Zhang, J. Divergence function, duality, and convex analysis. Neural Comput. 2004, 16, 159–195. [Google Scholar] [CrossRef] [PubMed]
Grünwald, P.D. The Minimum Description Length Principle; MIT Press: Cambridge, MA, USA, 2007. [Google Scholar]
Nielsen, F. The Many Faces of Information Geometry. Not. Am. Math. Soc. 2022, 69. [Google Scholar] [CrossRef]
Nielsen, F.; Hadjeres, G. Quasiconvex Jensen Divergences and Quasiconvex Bregman Divergences; Workshop on Joint Structures and Common Foundations of Statistical Physics, Information Geometry and Inference for Learning; Springer: Berlin/Heidelberg, Germany, 2020; pp. 196–218. [Google Scholar]
Emtiyaz Khan, M.; Swaroop, S. Knowledge-Adaptation Priors. arXiv 2021, arXiv:2106.08769. [Google Scholar]

Figure 2. Visualizing the sided and symmetrized Bregman divergences.

Figure 4. The duo Bregman divergence induced by two strictly convex and differentiable functions

F_{1}

and

F_{2}

such that

F_{1} (θ) \geq F_{2} (θ)

. We check graphically that

B_{F_{1}, F_{2}} (θ : θ^{'}) \geq B_{F_{2}} (θ : θ^{'})

(vertical gaps).

Figure 4. The duo Bregman divergence induced by two strictly convex and differentiable functions

F_{1}

and

F_{2}

such that

F_{1} (θ) \geq F_{2} (θ)

. We check graphically that

B_{F_{1}, F_{2}} (θ : θ^{'}) \geq B_{F_{2}} (θ : θ^{'})

(vertical gaps).

Figure 5. The duo half squared Euclidean distance

D_{a}^{2} (θ : θ^{'}) : = \frac{a}{2} θ^{2} + \frac{1}{2} {θ^{'}}^{2} - θ θ^{'}

is non-negative when

a \geq 1

: (a) half squared Euclidean distance (

a = 1

), (b)

a = 2

, (c)

a = \frac{1}{2}

, which shows that the divergence can be negative then since

a < 1

.

Figure 5. The duo half squared Euclidean distance

D_{a}^{2} (θ : θ^{'}) : = \frac{a}{2} θ^{2} + \frac{1}{2} {θ^{'}}^{2} - θ θ^{'}

is non-negative when

a \geq 1

: (a) half squared Euclidean distance (

a = 1

), (b)

a = 2

, (c)

a = \frac{1}{2}

, which shows that the divergence can be negative then since

a < 1

.

Figure 6. The Legendre transform reverses the dominance ordering:

F_{1} (θ) = θ^{2} \geq F_{2} (θ) = θ^{4} \Leftrightarrow F_{1}^{*} (η) \leq F_{2}^{*} (η)

for

θ \in [0, 1]

.

Figure 6. The Legendre transform reverses the dominance ordering:

F_{1} (θ) = θ^{2} \geq F_{2} (θ) = θ^{4} \Leftrightarrow F_{1}^{*} (η) \leq F_{2}^{*} (η)

for

θ \in [0, 1]

.

Figure 7. The duo Jensen divergence

J_{F_{1}, F_{2}, α} (θ_{1} : θ_{2})

is greater than the Jensen divergence

J_{F_{1}, α} (θ_{1} : θ_{2})

for

F_{2} (θ) \geq F_{1} (θ)

.

Figure 7. The duo Jensen divergence

J_{F_{1}, F_{2}, α} (θ_{1} : θ_{2})

is greater than the Jensen divergence

J_{F_{1}, α} (θ_{1} : θ_{2})

for

F_{2} (θ) \geq F_{1} (θ)

.

Table 1. Canonical decomposition of the Poisson and the geometric discrete exponential families.

Quantity	Poisson Family $P$	Geometric Family $Q$
support	$N \cup {0}$	$N \cup {0}$
base measure	counting measure	counting measure
ordinary parameter	rate $λ > 0$	success probability $p \in (0, 1)$
pmf	$\frac{λ^{x}}{x!} exp (- λ)$	${(1 - p)}^{x} p$
sufficient statistic	$t_{P} (x) = x$	$t_{Q} (x) = x$
natural parameter	$θ (λ) = log λ$	$θ (p) = log (1 - p)$
cumulant function	$F_{P} (θ) = exp (θ)$	$F_{Q} (θ) = - log (1 - exp (θ))$
	$F_{P} (λ) = λ$	$F_{Q} (p) = - log (p)$
auxiliary term	$k_{P} (x) = - log x!$	$k_{Q} (x) = 0$
moment $η = E [t (x)]$	$η = λ$	$η = \frac{e^{θ}}{1 - e^{θ}} = \frac{1}{p} - 1$
negentropy	$F_{P}^{*} (η (λ)) = λ log λ - λ$	$F_{Q}^{*} (η (p)) = (1 - \frac{1}{p}) log (1 - p) + log p$
( $F^{*} (η) = θ \cdot η - F (θ)$ )

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nielsen, F. Statistical Divergences between Densities of Truncated Exponential Families with Nested Supports: Duo Bregman and Duo Jensen Divergences. Entropy 2022, 24, 421. https://doi.org/10.3390/e24030421

AMA Style

Nielsen F. Statistical Divergences between Densities of Truncated Exponential Families with Nested Supports: Duo Bregman and Duo Jensen Divergences. Entropy. 2022; 24(3):421. https://doi.org/10.3390/e24030421

Chicago/Turabian Style

Nielsen, Frank. 2022. "Statistical Divergences between Densities of Truncated Exponential Families with Nested Supports: Duo Bregman and Duo Jensen Divergences" Entropy 24, no. 3: 421. https://doi.org/10.3390/e24030421

APA Style

Nielsen, F. (2022). Statistical Divergences between Densities of Truncated Exponential Families with Nested Supports: Duo Bregman and Duo Jensen Divergences. Entropy, 24(3), 421. https://doi.org/10.3390/e24030421

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Statistical Divergences between Densities of Truncated Exponential Families with Nested Supports: Duo Bregman and Duo Jensen Divergences

Abstract

1. Introduction

1.1. Exponential Families

1.2. Truncated Exponential Families with Nested Supports

1.3. Kullback–Leibler Divergence between Exponential Family Distributions

1.4. Kullback–Leibler Divergence between Exponential Family Densities

1.5. Contributions and Paper Outline

2. Kullback–Leibler Divergence between Different Exponential Families

3. The Duo Fenchel–Young Divergence and Its Corresponding Duo Bregman Divergence

4. Kullback–Leibler Divergence between Distributions of Truncated Exponential Families

5. Bhattacharyya Skewed Divergence between Truncated Densities of an Exponential Family

6. Concluding Remarks

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI