Strongly Convex Divergences

Melbourne, James

doi:10.3390/e22111327

Open AccessArticle

Strongly Convex Divergences

by

James Melbourne

Department of Electrical and Computer Engineering, University of Minnesota-Twin Cities, Minneapolis, MN 55455, USA

Entropy 2020, 22(11), 1327; https://doi.org/10.3390/e22111327

Submission received: 2 September 2020 / Revised: 6 November 2020 / Accepted: 9 November 2020 / Published: 21 November 2020

(This article belongs to the Special Issue Divergence Measures: Mathematical Foundations and Applications in Information-Theoretic and Statistical Problems)

Download Versions Notes

Abstract

We consider a sub-class of the f-divergences satisfying a stronger convexity property, which we refer to as strongly convex, or

κ

-convex divergences. We derive new and old relationships, based on convexity arguments, between popular f-divergences.

Keywords:

information measures; f-divergence; hypothesis testing; total variation; skew-divergence; convexity; Pinsker’s inequality; Bayes risk; Jensen–Shannon divergence

1. Introduction

The concept of an f-divergence, introduced independently by Ali-Silvey [1], Morimoto [2], and Csisizár [3], unifies several important information measures between probability distributions, as integrals of a convex function f, composed with the Radon–Nikodym of the two probability distributions. (An additional assumption can be made that f is strictly convex at 1, to ensure that

D_{f} (μ | | ν) > 0

for

μ \neq ν

. This obviously holds for any

f^{″} (1) > 0

, and can hold for some f-divergences without classical derivatives at 0, for instance the total variation is strictly convex at 1. An example of an f-divergence not strictly convex is provided by the so-called “hockey-stick” divergence, where

f (x) = {(x - γ)}_{+}

, see [4,5,6].) For a convex function

f : (0, \infty) \to R

such that

f (1) = 0

, and measures P and Q such that

P ≪ Q

, the f-divergence from P to Q is given by

D_{f} (P | | Q) : = \int f (\frac{d P}{d Q}) d Q .

The canonical example of an f-divergence, realized by taking

f (x) = x \log x

, is the relative entropy (often called the KL-divergence), which we denote with the subscript f omitted. f-divergences inherit many properties enjoyed by this special case; non-negativity, joint convexity of arguments, and a data processing inequality. Other important examples include the total variation, the

χ^{2}

-divergence, and the squared Hellinger distance. The reader is directed to Chapter 6 and 7 of [7] for more background.

We are interested in how stronger convexity properties of f give improvements of classical f-divergence inequalities. More explicitly, we consider consequences of f being

κ

-convex, in the sense that the map

x \mapsto f (x) - κ x^{2} / 2

is convex. This is in part inspired by the work of Sason [8], who demonstrated that divergences that are

κ

-convex satisfy “stronger than

χ^{2}

” data-processing inequalities.

Perhaps the most well known example of an f-divergence inequality is Pinsker’s inequality, which bounds the square of the total variation above by a constant multiple of the relative entropy. That is for probability measures P and Q,

{| P - Q |}_{T V}^{2} \leq c D (P | | Q)

. The optimal constant is achieved for Bernoulli measures, and under our conventions for total variation,

c = 1 / 2 \log e

. Many extensions and sharpenings of Pinsker’s inequality exist (for examples, see [9,10,11]). Building on the work of Guntuboyina [9] and Topsøe [11], we achieve a further sharpening of Pinsker’s inequality in Theorem 9.

Aside from the total variation, most divergences of interest have stronger than affine convexity, at least when f is restricted to a sub-interval of the real line. This observation is especially relevant to the situation in which one wishes to study

D_{f} (P | | Q)

in the existence of a bounded Radon–Nikodym derivative

\frac{d P}{d Q} \in (a, b) ⊊ (0, \infty)

. One naturally obtains such bounds for skew divergences. That is divergences of the form

(P, Q) \mapsto D_{f} ((1 - t) P + t Q | | (1 - s) P + s Q)

for

t, s \in [0, 1]

, as in this case,

\frac{(1 - t) P + t Q}{(1 - s) P + s Q} \leq \max \{\frac{1 - t}{1 - s}, \frac{t}{s}\}

. Important examples of skew-divergences include the skew divergence [12] based on the relative entropy and the Vincze–Le Cam divergence [13,14], called the triangular discrimination in [11] and its generalization due to Györfi and Vajda [15] based on the

χ^{2}

-divergence. The Jensen–Shannon divergence [16] and its recent generalization [17] give examples of f-divergences realized as linear combinations of skewed divergences.

Let us outline the paper. In Section 2, we derive elementary results of

κ

-convex divergences and give a table of examples of

κ

-convex divergences. We demonstrate that

κ

-convex divergences can be lower bounded by the

χ^{2}

-divergence, and that the joint convexity of the map

(P, Q) \mapsto D_{f} (P | | Q)

can be sharpened under

κ

-convexity conditions on f. As a consequence, we obtain bounds between the mean square total variation distance of a set of distributions from its barycenter, and the average f-divergence from the set to the barycenter.

In Section 3, we investigate general skewing of f-divergences. In particular, we introduce the skew-symmetrization of an f-divergence, which recovers the Jensen–Shannon divergence and the Vincze–Le Cam divergences as special cases. We also show that a scaling of the Vincze–Le Cam divergence is minimal among skew-symmetrizations of

κ

-convex divergences on

(0, 2)

. We then consider linear combinations of skew divergences and show that a generalized Vincze–Le Cam divergence (based on skewing the

χ^{2}

-divergence) can be upper bounded by the generalized Jensen–Shannon divergence introduced recently by Nielsen [17] (based on skewing the relative entropy), reversing the classical convexity bounds

D (P | | Q) \leq \log (1 + χ^{2} (P | | Q)) \leq \log e χ^{2} (P | | Q)

. We also derive upper and lower total variation bounds for Nielsen’s generalized Jensen–Shannon divergence.

In Section 4, we consider a family of densities

{p_{i}}

weighted by

λ_{i}

, and a density q. We use the Bayes estimator

T (x) = \arg \max_{i} λ_{i} p_{i} (x)

to derive a convex decomposition of the barycenter

p = \sum_{i} λ_{i} p_{i}

and of q, each into two auxiliary densities. (Recall, a Bayes estimator is one that minimizes the expected value of a loss function. By the assumptions of our model, that

P (θ = i) = λ_{i}

, and

P (X \in A | θ = i) = \int_{A} p_{i} (x) d x

, we have

E ℓ (θ, \hat{θ}) = 1 - \int λ_{\hat{θ} (x)} p_{\hat{θ} (x)} (x) d x

for the loss function

ℓ (i, j) = 1 - δ_{i} (j)

and any estimator

\hat{θ}

. It follows that

E ℓ (θ, \hat{θ}) \geq E ℓ (θ, T)

by

λ_{\hat{θ} (x)} p_{\hat{θ} (x)} (x) \leq λ_{T (x)} p_{T (x)} (x)

. Thus, T is a Bayes estimator associated to ℓ. ) We use this decomposition to sharpen, for

κ

-convex divergences, an elegant theorem of Guntuboyina [9] that generalizes Fano and Pinsker’s inequality to f-divergences. We then demonstrate explicitly, using an argument of Topsøe, how our sharpening of Guntuboyina’s inequality gives a new sharpening of Pinsker’s inequality in terms of the convex decomposition induced by the Bayes estimator.

Notation

Throughout, f denotes a convex function

f : (0, \infty) \to R \cup {\infty}

, such that

f (1) = 0

. For a convex function defined on

(0, \infty)

, we define

f (0) : = \lim_{x \to 0} f (x)

. We denote by

f^{*}

, the convex function

f^{*} : (0, \infty) \to R \cup {\infty}

defined by

f^{*} (x) = x f (x^{- 1})

. We consider Borel probability measures P and Q on a Polish space

X

and define the f-divergence from P to Q, via densities p for P and q for Q with respect to a common reference measure

μ

as

\begin{matrix} D_{f} (p | | q) & = \int_{X} f (\frac{p}{q}) q d μ \\ = \int_{{p q > 0}} q f (\frac{p}{q}) d μ + f (0) Q ({p = 0}) + f^{*} (0) P ({q = 0}) . \end{matrix}

(1)

We note that this representation is independent of

μ

, and such a reference measure always exists, take

μ = P + Q

for example.

For

t, s \in [0, 1]

, define the binary f-divergence

D_{f} (t | | s) : = s f (\frac{t}{s}) + (1 - s) f (\frac{1 - t}{1 - s})

(2)

with the conventions,

f (0) = \lim_{t \to 0^{+}} f (t)

,

0 f (0 / 0) = 0

, and

0 f (a / 0) = a \lim_{t \to \infty} f (t) / t

. For a random variable X and a set A, we denote the probability that X takes a value in A by

P (X \in A)

, the expectation of the random variable by

E X

, and the variance by

{Var (X) : = E | X - E X |}^{2}

. For a probability measure

μ

satisfying

μ (A) = P (X \in A)

for all Borel A, we write

X \sim μ

, and, when there exists a probability density function such that

P (X \in A) = \int_{A} f (x) d γ (x)

for a reference measure

γ

, we write

X \sim f

. For a probability measure

μ

on

X

, and an

L^{2}

function

f : X \to R

, we denote

{Var}_{μ} (f) : = Var (f (X))

for

X \sim μ

.

2. Strongly Convex Divergences

Definition 1.

A

R \cup {\infty}

-valued function f on a convex set

K \subseteq R

is κ-convex when

x, y \in K

and

t \in [0, 1]

implies

f ((1 - t) x + t y) \leq (1 - t) f (x) + t f (y) - κ t (1 - t) {(x - y)}^{2} / 2 .

(3)

For example, when f is twice differentiable, (3) is equivalent to

f^{″} (x) \geq κ

for

x \in K

. Note that the case

κ = 0

is just usual convexity.

Proposition 1.

For

f : K \to R \cup {\infty}

and

κ \in [0, \infty)

, the following are equivalent:

f is κ-convex.
The function $f - κ {(t - a)}^{2} / 2$ is convex for any $a \in R$ .
The right handed derivative, defined as $f_{+}^{'} (t) : = \lim_{h ↓ 0} \frac{f (t + h) - f (t)}{h}$ satisfies,

$f_{+}^{'} (t) \geq f_{+}^{'} (s) + κ (t - s)$

for $t \geq s$ .

Proof.

Observe that it is enough to prove the result when

κ = 0

, where the proposition is reduced to the classical result for convex functions. □

Definition 2.

An f-divergence

D_{f}

is κ-convex on an interval K for

κ \geq 0

when the function f is κ-convex on K.

Table 1 lists some

κ

-convex f-divergences of interest to this article.

Observe that we have taken the normalization convention on the total variation (the total variation for a signed measure

μ

on a space X can be defined through the Hahn-Jordan decomposition of the measure into non-negative measures

μ^{+}

and

μ^{-}

such that

μ = μ^{+} - μ^{-}

, as

∥ μ ∥ = μ^{+} (X) + μ^{-} (X)

(see [18]); in our notation,

{| μ |}_{T V} = ∥ μ ∥ / 2

) which we denote by

{| P - Q |}_{T V}

, such that

{| P - Q |}_{T V} = \sup_{A} | P (A) - Q (A) | \leq 1

. In addition, note that the

α

-divergence interpolates Pearson’s

χ^{2}

-divergence when

α = 3

, one half Neyman’s

χ^{2}

-divergence when

α = - 3

, the squared Hellinger divergence when

α = 0

, and has limiting cases, the relative entropy when

α = 1

and the reverse relative entropy when

α = - 1

. If f is

κ

-convex on

[a, b]

, then recalling its dual divergence

f^{*} (x) : = x f (x^{- 1})

is

κ a^{3}

-convex on

[\frac{1}{b}, \frac{1}{a}]

. Recall that

f^{*}

satisfies the equality

D_{f^{*}} (P | | Q) = D_{f} (Q | | P)

. For brevity, we use

χ^{2}

-divergence to refer to the Pearson

χ^{2}

-divergence, and we articulate Neyman’s

χ^{2}

explicitly when necessary.

The next lemma is a restatement of Jensen’s inequality.

Lemma 1.

If f is κ-convex on the range of X,

E f (X) \geq f (E (X)) + \frac{κ}{2} Var (X) .

Proof.

Apply Jensen’s inequality to

f (x) - κ x^{2} / 2

. □

For a convex function f such that

f (1) = 0

and

c \in R

, the function

\tilde{f} (t) = f (t) + c (t - 1)

remains a convex function, and what is more satisfies

D_{f} (P | | Q) = D_{\tilde{f}} (P | | Q)

since

\int c (p / q - 1) q d μ = 0

.

Definition 3

(

χ^{2}

-divergence). For

f (t) = {(t - 1)}^{2}

, we write

χ^{2} (P | | Q) : = D_{f} (P | | Q) .

We pursue a generalization of the following bound on the total variation by the

χ^{2}

-divergence [19,20,21].

Theorem 1

([19,20,21]). For measures P and Q,

\begin{matrix} {| P - Q |}_{T V}^{2} \leq \frac{χ^{2} (P | | Q)}{2} . \end{matrix}

(4)

We mention the work of Harremos and Vadja [20], in which it is shown, through a characterization of the extreme points of the joint range associated to a pair of f-divergences (valid in general), that the inequality characterizes the “joint range”, that is, the range of the function

(P, Q) \mapsto {(| P - Q |}_{T V}, χ^{2} (P | | Q))

. We use the following lemma, which shows that every strongly convex divergence can be lower bounded, up to its convexity constant

κ > 0

, by the

χ^{2}

-divergence,

Lemma 2.

For a κ-convex f,

D_{f} (P | | Q) \geq \frac{κ}{2} χ^{2} (P | | Q) .

Proof.

Define a

\tilde{f} (t) = f (t) - f_{+}^{'} (1) (t - 1)

and note that

\tilde{f}

defines the same

κ

-convex divergence as f. Thus, we may assume without loss of generality that

f_{+}^{'}

is uniquely zero when

t = 1

. Since f is

κ

-convex

ϕ : t \mapsto f (t) - κ {(t - 1)}^{2} / 2

is convex, and, by

f_{+}^{'} (1) = 0

,

ϕ_{+}^{'} (1) = 0

as well. Thus,

ϕ

takes its minimum when

t = 1

and hence

ϕ \geq 0

so that

f (t) \geq κ {(t - 1)}^{2} / 2

. Computing,

\begin{matrix} D_{f} (P | | Q) & = \int f (\frac{d P}{d Q}) d Q \\ \geq \frac{κ}{2} \int {(\frac{d P}{d Q} - 1)}^{2} d Q \\ = \frac{κ}{2} χ^{2} (P | | Q) . \end{matrix}

□

Based on a Taylor series expansion of f about 1, Nielsen and Nock ([22], [Corollary 1]) gave the estimate

\begin{matrix} D_{f} (P | | Q) \approx \frac{f^{″} (1)}{2} χ^{2} (P | | Q) \end{matrix}

(5)

for divergences with a non-zero second derivative and P close to Q. Lemma 2 complements this estimate with a lower bound, when f is

κ

-concave. In particular, if

f^{″} (1) = κ

, it shows that the approximation in (5) is an underestimate.

Theorem 2.

For measures P and Q, and a κ convex divergence

D_{f}

,

\begin{matrix} {| P - Q |}_{T V}^{2} \leq \frac{D_{f} (P | | Q)}{κ} . \end{matrix}

(6)

Proof.

By Lemma 2 and then Theorem 1,

\begin{matrix} \frac{D_{f} (P | | Q)}{κ} \geq \frac{χ^{2} (P | | Q)}{2} \geq {| P - Q |}_{T V} . \end{matrix}

(7)

□

The proof of Lemma 2 uses a pointwise inequality between convex functions to derive an inequality between their respective divergences. This simple technique was shown to have useful implications by Sason and Verdu in [6], where it appears as Theorem 1 and is used to give sharp comparisons in several f-divergence inequalities.

Theorem 3

(Sason–Verdu [6]). For divergences defined by g and f with

c f (t) \geq g (t)

for all t, then

D_{g} (P | | Q) \leq c D_{f} (P | | Q) .

Moreover, if

f^{'} (1) = g^{'} (1) = 0

, then

\sup_{P \neq Q} \frac{D_{g} (P | | Q)}{D_{f} (P | | Q)} = \sup_{t \neq 1} \frac{g (t)}{f (t)} .

Corollary 1.

For a smooth κ-convex divergence f, the inequality

D_{f} (P | | Q) \geq \frac{κ}{2} χ^{2} (P | | Q)

(8)

is sharp multiplicatively in the sense that

\inf_{P \neq Q} \frac{D_{f} (P | | Q)}{χ^{2} (P | | Q)} = \frac{κ}{2} .

(9)

if

f^{″} (1) = κ

.

In information geometry, a standard f-divergence is defined as an f-divergence satisfying the normalization

f (1) = f^{'} (1) = 0, f^{″} (1) = 1

(see [23]). Thus, Corollary 1 shows that

\frac{1}{2} χ^{2}

provides a sharp lower bound on every standard f-divergence that is 1-convex. In particular, the lower bound in Lemma 2 complimenting the estimate (5) is shown to be sharp.

Proof.

Without loss of generality, we assume that

f^{'} (1) = 0

. If

f^{″} (1) = κ + 2 ε

for some

ε > 0

, then taking

g (t) = {(t - 1)}^{2}

and applying Theorem 3 and Lemma 2

\sup_{P \neq Q} \frac{D_{g} (P | | Q)}{D_{f} (P | | Q)} = \sup_{t \neq 1} \frac{g (t)}{f (t)} \leq \frac{2}{κ} .

(10)

Observe that, after two applications of L’Hospital,

\lim_{ε \to 0} \frac{g (1 + ε)}{f (1 + ε)} = \lim_{ε \to 0} \frac{g^{'} (1 + ε)}{f^{'} (1 + ε)} = \frac{g^{″} (1)}{f^{″} (1)} = \frac{2}{κ} \leq \sup_{t \neq 1} \frac{g (t)}{f (t)} .

Thus, (9) follows. □

Proposition 2.

When

D_{f}

is an f divergence such that f is κ-convex on

[a, b]

and that

P_{θ}

and

Q_{θ}

are probability measures indexed by a set Θ such that

a \leq \frac{d P_{θ}}{d Q_{θ}} (x) \leq b

, holds for all θ and

P : = \int_{Θ} P_{θ} d μ (θ)

and

Q : = \int_{Θ} Q_{θ} d μ (θ)

for a probability measure μ on Θ, then

D_{f} (P | | Q) \leq \int_{Θ} D_{f} (P_{θ} | | Q_{θ}) d μ (θ) - \frac{κ}{2} \int_{Θ} \int_{X} {(\frac{d P_{θ}}{d Q_{θ}} - \frac{d P}{d Q})}^{2} d Q d μ,

(11)

In particular, when

Q_{θ} = Q

for all θ

\begin{matrix} D_{f} (P | | Q) \\ \leq \int_{Θ} D_{f} (P_{θ} | | Q) d μ (θ) - \frac{κ}{2} \int_{Θ} \int_{X} {(\frac{d P_{θ}}{d Q} - \frac{d P}{d Q})}^{2} d Q d μ (θ) \\ \leq \int_{Θ} D_{f} (P_{θ} | | Q) d μ (θ) - κ \int_{Θ} {| P_{θ} - P |}_{T V}^{2} d μ (θ) \end{matrix}

(12)

Proof.

Let

d θ

denote a reference measure dominating

μ

so that

d μ = φ (θ) d θ

then write

ν_{θ} = ν (θ, x) = \frac{d Q_{θ}}{d Q} (x) φ (θ)

.

\begin{matrix} D_{f} (P | | Q) & = \int_{X} f (\frac{d P}{d Q}) d Q \\ = \int_{X} f (\int_{Θ} \frac{d P_{θ}}{d Q} d μ (θ)) d Q \\ = \int_{X} f (\int_{Θ} \frac{d P_{θ}}{d Q_{θ}} ν (θ, x) d θ) d Q \end{matrix}

(13)

By Jensen’s inequality, as in Lemma 1

\begin{matrix} f (\int_{Θ} \frac{d P_{θ}}{d Q_{θ}} ν_{θ} d θ) \leq \int_{θ} f & (\frac{d P_{θ}}{d Q_{θ}}) ν_{θ} d θ - \frac{κ}{2} \int_{Θ} {(\frac{d P_{θ}}{d Q_{θ}} - \int_{Θ} \frac{d P_{θ}}{d Q_{θ}} ν_{θ} d θ)}^{2} ν_{θ} d θ \end{matrix}

Integrating this inequality gives

D_{f} (P | | Q) \leq \int_{X} (\int_{θ} f (\frac{d P_{θ}}{d Q_{θ}}) ν_{θ} d θ - \frac{κ}{2} \int_{Θ} {(\frac{d P_{θ}}{d Q_{θ}} - \int_{Θ} \frac{d P_{θ}}{d Q_{θ}} ν_{θ} d θ)}^{2} ν_{θ} d θ) d Q

(14)

Note that

\int_{X} \int_{Θ} {(\frac{d P_{θ}}{d Q_{θ}} d Q - \int_{Θ} \frac{d P_{θ}}{d Q_{θ_{0}}} ν_{θ_{0}} d θ_{0})}^{2} ν_{θ} d θ d Q = \int_{Θ} \int_{X} {(\frac{d P_{θ}}{d Q_{θ}} - \frac{d P}{d Q})}^{2} d Q d μ,

and

\begin{matrix} \int_{X} \int_{Θ} f (\frac{d P_{θ}}{d Q_{θ}}) ν (θ, x) d θ d Q & = \int_{Θ} \int_{X} f (\frac{d P_{θ}}{d Q_{θ}}) ν (θ, x) d Q d θ \\ = \int_{Θ} \int_{X} f (\frac{d P_{θ}}{d Q_{θ}}) d Q_{θ} d μ (θ) \\ = \int_{Θ} D (P_{θ} | | Q_{θ}) d μ (θ) \end{matrix}

(15)

Inserting these equalities into (14) gives the result.

To obtain the total variation bound, one needs only to apply Jensen’s inequality,

\begin{matrix} \int_{X} {(\frac{d P_{θ}}{d Q} - \frac{d P}{d Q})}^{2} d Q & \geq {(\int_{X} |\frac{d P_{θ}}{d Q} - \frac{d P}{d Q}| d Q)}^{2} \\ = | P_{θ} {- P |}_{T V}^{2} . \end{matrix}

(16)

□

Observe that, taking

Q = P = \int_{Θ} P_{θ} d μ (θ)

in Proposition 2, one obtains a lower bound for the average f-divergence from the set of distribution to their barycenter, by the mean square total variation of the set of distributions to the barycenter,

κ \int_{Θ} | P_{θ} {- P |}_{T V}^{2} d μ (θ) \leq \int_{Θ} D_{f} (P_{θ} | | P) d μ (θ) .

(17)

An alternative proof of this can be obtained by applying

| P_{θ} {- P |}_{T V}^{2} \leq D_{f} (P_{θ} | | P) / κ

from Theorem 2 pointwise.

The next result shows that, for f strongly convex, Pinsker type inequalities can never be reversed,

Proposition 3.

Given f strongly convex and

M > 0

, there exists P, Q measures such that

D_{f} (P | | Q) \geq {M | P - Q |}_{T V} .

(18)

Proof.

By

κ

-convexity

ϕ (t) = f (t) - κ t^{2} / 2

is a convex function. Thus,

ϕ (t) \geq ϕ (1) + ϕ_{+}^{'} (1) (t - 1) = (f_{+}^{'} (1) - κ) (t - 1)

and hence

\lim_{t \to \infty} \frac{f (t)}{t} \geq \lim_{t \to \infty} κ t / 2 + (f_{+}^{'} (1) - κ) (1 - \frac{1}{t}) = \infty .

Taking measures on the two points space

P = {1 / 2, 1 / 2}

and

Q = {1 / 2 t, 1 - 1 / 2 t}

gives

D_{f} (P | | Q) \geq \frac{1}{2} \frac{f (t)}{t}

which tends to infinity with

t \to \infty

, while

{| P - Q |}_{T V} \leq 1

. □

In fact, building on the work of Basu-Shioya-Park [24] and Vadja [25], Sason and Verdu proved [6] that, for any f divergence,

\sup_{P \neq Q} \frac{D_{f} (P | | Q)}{{| P - Q |}_{T V}} = f (0) + f^{*} (0)

. Thus, an f-divergence can be bounded above by a constant multiple of a the total variation, if and only if

f (0) + f^{*} (0) < \infty

. From this perspective, Proposition 3 is simply the obvious fact that strongly convex functions have super linear (at least quadratic) growth at infinity.

3. Skew Divergences

If we denote

C v x (0, \infty)

to be quotient of the cone of convex functions f on

(0, \infty)

such that

f (1) = 0

under the equivalence relation

f_{1} \sim f_{2}

when

f_{1} - f_{2} = c (x - 1)

for

c \in R

, then the map

f \mapsto D_{f}

gives a linear isomorphism between

C v x (0, \infty)

and the space of all f-divergences. The mapping

T : C v x (0, \infty) \to C v x (0, \infty)

defined by

T f = f^{*}

, where we recall

f^{*} (t) = t f (t^{- 1})

, gives an involution of

C v x (0, \infty)

. Indeed,

D_{T f} (P | | Q) = D_{f} (Q | | P)

, so that

D_{T (T (f))} (P | | Q) = D_{f} (P | | Q)

. Mathematically, skew divergences give an interpolation of this involution as

(P, Q) \mapsto D_{f} ((1 - t) P + t Q | | (1 - s) P + s Q)

gives

D_{f} (P | | Q)

by taking

s = 1

and

t = 0

or yields

D_{f^{*}} (P | | Q)

by taking

s = 0

and

t = 1

.

Moreover, as mentioned in the Introduction, skewing imposes boundedness of the Radon–Nikodym derivative

\frac{d P}{d Q}

, which allows us to constrain the domain of f-divergences and leverage

κ

-convexity to obtain f-divergence inequalities in this section.

The following appears as Theorem III.1 in the preprint [26]. It states that skewing an f-divergence preserves its status as such. This guarantees that the generalized skew divergences of this section are indeed f-divergences. A proof is given in the Appendix A for the convenience of the reader.

Theorem 4

(Melbourne et al [26]). For

t, s \in [0, 1]

and a divergence

D_{f}

, then

S_{f} (P | | Q) : = D_{f} ((1 - t) P + t Q | | (1 - s) P + s Q)

(19)

is an f-divergence as well.

Definition 4.

For an f-divergence, its skew symmetrization,

Δ_{f} (P | | Q) : = \frac{1}{2} D_{f} (P | | \frac{P + Q}{2}) + \frac{1}{2} D_{f} (Q | | \frac{P + Q}{2}) .

Δ_{f}

is determined by the convex function

x \mapsto \frac{1 + x}{2} (f (\frac{2 x}{1 + x}) + f (\frac{2}{1 + x})) .

(20)

Observe that

Δ_{f} (P | | Q) = Δ_{f} (Q | | P)

, and when

f (0) < \infty

,

Δ_{f} (P | | Q) \leq \sup_{x \in [0, 2]} f (x) < \infty

for all

P, Q

since

\frac{d P}{d (P + Q) / 2}

,

\frac{d Q}{d (P + Q) / 2} \leq 2

. When

f (x) = x \log x

, the relative entropy’s skew symmetrization is the Jensen–Shannon divergence. When

f (x) = {(x - 1)}^{2}

up to a normalization constant the

χ^{2}

-divergence’s skew symmetrization is the Vincze–Le Cam divergence which we state below for emphasis. The work of Topsøe [11] provides more background on this divergence, where it is referred to as the triangular discrimination.

Definition 5.

When

f (t) = \frac{{(t - 1)}^{2}}{t + 1}

, denote the Vincze–Le Cam divergence by

Δ (P | | Q) : = D_{f} (P | | Q) .

If one denotes the skew symmetrization of the

χ^{2}

-divergence by

Δ_{χ^{2}}

, one can compute easily from (20) that

Δ_{χ^{2}} (P | | Q) = Δ (P | | Q) / 2

. We note that although skewing preserves 0-convexity, by the above example, it does not preserve

κ

-convexity in general. The skew symmetrization of the

χ^{2}

-divergence a 2-convex divergence while

f (t) = {(t - 1)}^{2} / (t + 1)

corresponding to the Vincze–Le Cam divergence satisfies

f^{″} (t) = \frac{8}{{(t + 1)}^{3}}

, which cannot be bounded away from zero on

(0, \infty)

.

Corollary 2.

For an f-divergence such that f is a κ-convex on

(0, 2)

,

Δ_{f} (P | | Q) \geq \frac{κ}{4} Δ (P | | Q) = \frac{κ}{2} Δ_{χ^{2}} (P | | Q),

(21)

with equality when the

f (t) = {(t - 1)}^{2}

corresponding the the

χ^{2}

-divergence, where

Δ_{f}

denotes the skew symmetrized divergence associated to f and Δ is the Vincze- Le Cam divergence.

Proof.

Applying Proposition 2

\begin{matrix} 0 & = D_{f} (\frac{P + Q}{2} | | \frac{Q + P}{2}) \\ \leq \frac{1}{2} D_{f} (P | | \frac{Q + P}{2}) + \frac{1}{2} D_{f} (Q | | \frac{Q + P}{2}) - \frac{κ}{8} \int {(\frac{2 P}{P + Q} - \frac{2 Q}{P + Q})}^{2} d (P + Q) / 2 \\ = Δ_{f} (P | | Q) - \frac{κ}{4} Δ (P | | Q) . \end{matrix}

□

When

f (x) = x \log x

, we have

f^{″} (x) \geq \frac{\log e}{2}

on

[0, 2]

, which demonstrates that up to a constant

\frac{\log e}{8}

the Jensen–Shannon divergence bounds the Vincze–Le Cam divergence (see [11] for improvement of the inequality in the case of the Jensen–Shannon divergence, called the “capacitory discrimination” in the reference, by a factor of 2).

We now investigate more general, non-symmetric skewing in what follows.

Proposition 4.

For

α, β \in [0, 1]

, define

C (α) : = \{\begin{matrix} 1 - α & w h e n α \leq β \\ α & w h e n α > β, \end{matrix}

(22)

and

S_{α, β} (P | | Q) : = D ((1 - α) P + α Q | | (1 - β) P + β Q) .

(23)

Then,

S_{α, β} (P | | Q) \leq C (α) D_{\infty} {(α | | β) | P - Q |}_{T V},

(24)

where

D_{\infty} (α | | β) : = \log (\max \{\frac{α}{β}, \frac{1 - α}{1 - β}\})

is the binary ∞-Rényi divergence [27].

We need the following lemma originally proved by Audenart in the quantum setting [28]. It is based on a differential relationship between the skew divergence [12] and the [15] (see [29,30]).

Lemma 3

(Theorem III.1 [26]). For P and Q probability measures and

t \in [0, 1]

,

S_{0, t} (P | | Q) \leq {- \log t | P - Q |}_{T V} .

(25)

Proof of Theorem 4.

If

α \leq β

, then

D_{\infty} (α | | β) = \log \frac{1 - α}{1 - β}

and

C (α) = 1 - α

. In addition,

(1 - β) P + β Q = t ((1 - α) P + α Q) + (1 - t) Q

(26)

with

t = \frac{1 - β}{1 - α}

, thus

\begin{matrix} S_{α, β} (P | | Q) & = S_{0, t} ((1 - α) P + α Q | | Q) \\ \leq {(- \log t) | ((1 - α) P + α Q) - Q |}_{T V} \\ = C (α) D_{\infty} {(α | | β) | P - Q |}_{T V}, \end{matrix}

(27)

where the inequality follows from Lemma 3. Following the same argument for

α > β

, so that

C (α) = α

,

D_{\infty} (α | | β) = \log \frac{α}{β}

, and

(1 - β) P + β Q = t ((1 - α) P + α Q) + (1 - t) P

(28)

for

t = \frac{β}{α}

completes the proof. Indeed,

\begin{matrix} S_{α, β} (P | | Q) & = S_{0, t} ((1 - α) P + α Q | | P) \\ \leq {- \log t | ((1 - α) P + α Q) - P |}_{T V} \\ = C (α) D_{\infty} {(α | | β) | P - Q |}_{T V} . \end{matrix}

(29)

□

We recover the classical bound [11,16] of the Jensen–Shannon divergence by the total variation.

Corollary 3.

For probability measure P and Q,

JSD (P | | Q) \leq {\log 2 | P - Q |}_{T V}

(30)

Proof.

Since

JSD (P | | Q) = \frac{1}{2} S_{0, \frac{1}{2}} (P | | Q) + \frac{1}{2} S_{1, \frac{1}{2}} (P | | Q)

. □

Proposition 4 gives a sharpening of Lemma 1 of Nielsen [17], who proved

S_{α, β} (P | | Q) \leq D_{\infty} (α | | β)

, and used the result to establish the boundedness of a generalization of the Jensen–Shannon Divergence.

Definition 6

(Nielsen [17]). For p and q densities with respect to a reference measure μ,

w_{i} > 0

, such that

\sum_{i = 1}^{n} w_{i} = 1

and

α_{i} \in [0, 1]

, define

J S^{α, w} (p : q) = \sum_{i = 1}^{n} w_{i} D ((1 - α_{i}) p + α_{i} q | | (1 - \bar{α}) p + \bar{α} q)

(31)

where

\sum_{i = 1}^{n} w_{i} α_{i} = \bar{α}

.

Note that, when

n = 2

,

α_{1} = 1

,

α_{2} = 0

and

w_{i} = \frac{1}{2}

,

J S^{α, w} (p : q) = JSD (p | | q)

, the usual Jensen–Shannon divergence. We now demonstrate that Nielsen’s generalized Jensen–Shannon Divergence can be bounded by the total variation distance just as the ordinary Jensen–Shannon Divergence.

Theorem 5.

For p and q densities with respect to a reference measure μ,

w_{i} > 0

, such that

\sum_{i = 1}^{n} w_{i} = 1

and

α_{i} \in (0, 1)

,

\log e {Var}_{w} {(α) | p - q |}_{T V}^{2} \leq J S^{α, w} (p : q) \leq A H (w) {| p - q |}_{T V}

(32)

where

H (w) : = - \sum_{i} w_{i} \log w_{i} \geq 0

and

A = \max_{i} | α_{i} - {\bar{α}}_{i} |

with

{\bar{α}}_{i} = \sum_{j \neq i} \frac{w_{j} α_{j}}{1 - w_{i}}

.

Note that, since

{\bar{α}}_{i}

is the w average of the

α_{j}

terms with

α_{i}

removed,

{\bar{α}}_{i} \in [0, 1]

and thus

A \leq 1

. We need the following Theorem from Melbourne et al. [26] for the upper bound.

Theorem 6

([26] Theorem 1.1). For

f_{i}

densities with respect to a common reference measure γ and

λ_{i} > 0

such that

\sum_{i = 1}^{n} λ_{i} = 1

,

h_{γ} (\sum_{i} λ_{i} f_{i}) - \sum_{i} λ_{i} h_{γ} (f_{i}) \leq T H (λ),

(33)

where

h_{γ} (f_{i}) : = - \int f_{i} (x) \log f_{i} (x) d γ (x)

and

T = \sup_{i} {| f_{i} - {\tilde{f}}_{i} |}_{T V}

with

{\tilde{f}}_{i} = \sum_{j \neq i} \frac{λ_{j}}{1 - λ_{i}} f_{j}

.

Proof of Theorem 5.

We apply Theorem 6 with

f_{i} = (1 - α_{i}) p + α_{i} q

,

λ_{i} = w_{i}

, and noticing that in general

h_{γ} (\sum_{i} λ_{i} f_{i}) - \sum_{i} λ h_{γ} (f_{i}) = \sum_{i} λ_{i} D (f_{i} | | f),

(34)

we have

\begin{matrix} J S^{α, w} (p : q) & = \sum_{i = 1}^{n} w_{i} D ((1 - α_{i}) p + α_{i} q | | (1 - \bar{α}) p + \bar{α} q) \\ \leq T H (w) . \end{matrix}

(35)

It remains to determine

T = \max_{i} {| f_{i} - {\tilde{f}}_{i} |}_{T V}

,

\begin{matrix} {\tilde{f}}_{i} - f_{i} & = \frac{f - f_{i}}{1 - λ_{i}} \\ = \frac{((1 - \bar{α}) p + \bar{α} q) - ((1 - α_{i}) p + α_{i} q)}{1 - w_{i}} \\ = \frac{(α_{i} - \bar{α}) (p - q)}{1 - w_{i}} \\ = (α_{i} - {\bar{α}}_{i}) (p - q) . \end{matrix}

(36)

Thus,

T = \max_{i} (α_{i} - {\bar{α}}_{i}) {| p - q |}_{T V} = A {| p - q |}_{T V},

and the proof of the upper bound is complete.

To prove the lower bound, we apply Pinsker’s inequality,

{2 \log e | P - Q |}_{T V}^{2} \leq D (P | | Q),

\begin{matrix} J S^{α, w} (p : q) & = \sum_{i = 1}^{n} w_{i} D ((1 - α_{i}) p + α_{i} q | | (1 - \bar{α}) p + \bar{α} q) \\ \geq \frac{1}{2} \sum_{i = 1}^{n} w_{i} 2 \log e {| ((1 - α_{i}) p + α_{i} q) - ((1 - \bar{α}) p + \bar{α} q) |}_{T V}^{2} \\ = \log e \sum_{i = 1}^{n} w_{i} {(α_{i} - \bar{α})}^{2} {| p - q |}_{T V}^{2} \\ = \log e {Var}_{w} (α) {| p - q |}_{T V}^{2} . \end{matrix}

(37)

□

Definition 7.

Given an f-divergence, densities p and q with respect to common reference measure,

α \in {[0, 1]}^{n}

and

w \in {(0, 1)}^{n}

such that

\sum_{i} w_{i} = 1

define its generalized skew divergence

D_{f}^{α, w} (p : q) = \sum_{i = 1}^{n} w_{i} D_{f} ((1 - α_{i}) p + α_{i} q | | (1 - \bar{α}) p + \bar{α} q) .

(38)

where

\bar{α} = \sum_{i} w_{i} α_{i}

.

Note that, by Theorem 4,

D_{f}^{α, w}

is an f-divergence. The generalized skew divergence of the relative entropy is the generalized Jensen–Shannon divergence

J S^{α, w}

. We denote the generalized skew divergence of the

χ^{2}

-divergence from p to q by

χ_{α, w}^{2} (p : q) : = \sum_{i} w_{i} χ^{2} ((1 - α_{i}) p + α_{i} q | | (1 - \bar{α} p + \bar{α} q)

(39)

Note that, when

n = 2

and

α_{1} = 0

,

α_{2} = 1

and

w_{i} = \frac{1}{2}

, we recover the skew symmetrized divergence in Definition 4

D_{f}^{(0, 1), (1 / 2, 1 / 2)} (p : q) = Δ_{f} (p | | q)

(40)

The following theorem shows that the usual upper bound for the relative entropy by the

χ^{2}

-divergence can be reversed up to a factor in the skewed case.

Theorem 7.

For p and q with a common dominating measure μ,

χ_{α, w}^{2} (p : q) \leq N_{\infty} (α, w) J S^{α, w} (p : q) .

Writing

N_{\infty} (α, w) = \max_{i} \max \{\frac{1 - α_{i}}{1 - \bar{α}}, \frac{α_{i}}{\bar{α}}\}

. For

α \in {[0, 1]}^{n}

and

w \in {(0, 1)}^{n}

such that

\sum_{i} w_{i} = 1

, we use the notation

N_{\infty} (α, w) : = \max_{i} e^{D_{\infty} (α_{i} | | \bar{α})}

where

\bar{α} \sum_{i} w_{i} α_{i}

.

Proof.

By definition,

J S^{α, w} (p : q) = \sum_{i = 1}^{n} w_{i} D ((1 - α_{i}) p + α_{i} q | | (1 - \bar{α}) p + \bar{α} q) .

Taking

P_{i}

to be the measure associated to

(1 - α_{i}) p + α_{i} q

and Q given by

(1 - \bar{α}) p + \bar{α} q

, then

\frac{d P_{i}}{d Q} = \frac{(1 - α_{i}) p + α_{i} q}{(1 - \bar{α}) p + \bar{α} q} \leq \max \{\frac{1 - α_{i}}{1 - \bar{α}}, \frac{α_{i}}{\bar{α}}\} = e^{D_{\infty} (α_{i} | | \bar{α})} \leq N_{\infty} (α, w) .

(41)

Since

f (x) = x \log x

, the convex function associated to the usual KL divergence, satisfies

f^{″} (x) = \frac{1}{x}

, f is

e^{- D_{\infty} (α)}

-convex on

[0, \sup_{x, i} \frac{d P_{i}}{d Q} (x)]

, applying Proposition 2, we obtain

D (\sum_{i} w_{i} P_{i} | | Q) \leq \sum_{i} w_{i} D (P_{i} | | Q) - \frac{\sum_{i} w_{i} \int_{X} {(\frac{d P_{i}}{d Q} - \frac{d P}{d Q})}^{2} d Q}{2 N_{\infty} (α, w)} .

(42)

Since

Q = \sum_{i} w_{i} P_{i}

, the left hand side of (42) is zero, while

\begin{matrix} \sum_{i} w_{i} \int_{X} {(\frac{d P_{i}}{d Q} - \frac{d P}{d Q})}^{2} d Q & = \sum_{i} w_{i} \int_{X} {(\frac{d P_{i}}{d P} - 1)}^{2} d P \\ = \sum_{i} w_{i} χ^{2} (P_{i} | | P) \\ = χ_{α, w}^{2} (p : q) . \end{matrix}

(43)

Rearranging gives,

\frac{χ_{α, w}^{2} (p : q)}{2 N_{\infty} (α, w)} \leq J S^{α, w} (p : q),

(44)

which is our conclusion. □

4. Total Variation Bounds and Bayes Risk

In this section, we derive bounds on the Bayes risk associated to a family of probability measures with a prior distribution

λ

. Let us state definitions and recall basic relationships. Given probability densities

{p_{i}}_{i = 1}^{n}

on a space

X

with respect a reference measure

μ

and

λ_{i} \geq 0

such that

\sum_{i = 1}^{n} λ_{i} = 1

, define the Bayes risk,

R : = R_{λ} (p) : = 1 - \int_{X} \max_{i} {λ_{i} p_{i} (x)} d μ (x)

(45)

If

ℓ (x, y) = 1 - δ_{x} (y)

, and we define

T : = (x) \arg \max_{i} λ_{i} p_{i} (x)

then observe that this definition is consistent with, the usual definition of the Bayes risk associated to the loss function ℓ. Below, we consider

θ

to be a random variable on

{1, 2, \dots, n}

such that

P (θ = i) = λ_{i}

, and x to be a variable with conditional distribution

P (X \in A | θ = i) = \int_{A} p_{i} (x) d μ (x)

. The following result shows that the Bayes risk gives the probability of the categorization error, under an optimal estimator.

Proposition 5.

The Bayes risk satisfies

R = \min_{\hat{θ}} E ℓ (θ, \hat{θ} (X)) = E ℓ (θ, T (X))

where the minimum is defined over

\hat{θ} : X \to {1, 2, \dots, n}

.

Proof.

Observe that

R = 1 - \int_{X} λ_{T (x)} p_{T (x)} (x) d μ (x) = E ℓ (θ, T (X))

. Similarly,

\begin{matrix} E ℓ (θ, \hat{θ} (X)) & = 1 - \int_{X} λ_{\hat{θ} (x)} p_{\hat{θ} (x)} (x) d μ (x) \\ \geq 1 - \int_{X} λ_{T (x)} p_{T (x)} (x) d μ (x) = R, \end{matrix}

which gives our conclusion. □

It is known (see, for example, [9,31]) that the Bayes risk can also be tied directly to the total variation in the following special case, whose proof we include for completeness.

Proposition 6.

When

n = 2

and

λ_{1} = λ_{2} = \frac{1}{2}

, the Bayes risk associated to the densities

p_{1}

and

p_{2}

satisfies

2 R = 1 - | p_{1} - p_{2} |_{T V}

(46)

Proof.

Since

p_{T} = \frac{| p_{1} - p_{2} | + p_{1} + p_{2}}{2}

, integrating gives

\int_{X} p_{T} (x) d μ (x) = {| p_{1} - p_{2} |}_{T V} + 1

from which the equality follows. □

Information theoretic bounds to control the Bayes and minimax risk have an extensive literature (see, for example, [9,32,33,34,35]). Fano’s inequality is the seminal result in this direction, and we direct the reader to a survey of such techniques in statistical estimation (see [36]). What follows can be understood as a sharpening of the work of Guntuboyina [9] under the assumption of a

κ

-convexity.

The function

T (x) = \arg \max_{i} {λ_{i} p_{i} (x)}

induces the following convex decompositions of our densities. The density q can be realized as a convex combination of

q_{1} = \frac{λ_{T} q}{1 - Q}

where

Q = 1 - \int λ_{T} q d μ

and

q_{2} = \frac{(1 - λ_{T}) q}{Q}

,

q = (1 - Q) q_{1} + Q q_{2} .

If we take

p \sum_{i} λ_{i} p_{i}

, then p can be decomposed as

ρ_{1} = \frac{λ_{T} p_{T}}{1 - R}

and

ρ_{2} = \frac{p - λ_{T} p_{T}}{R}

so that

p = (1 - R) ρ_{1} + R ρ_{2} .

Theorem 8.

When f is κ-convex, on

(a, b)

with

a = \inf_{i, x} \frac{p_{i} (x)}{q (x)}

and

b = \sup_{i, x} \frac{p_{i} (x)}{q (x)}

\sum_{i} λ_{i} D_{f} (p_{i} | | q) \geq D_{f} (R | | Q) + \frac{κ W}{2}

where

W : = W (λ_{i}, p_{i}, q) : = \frac{{(1 - R)}^{2}}{1 - Q} χ^{2} (ρ_{1} | | q_{1}) + \frac{R^{2}}{Q} χ^{2} (ρ_{2} | | q_{2}) + W_{0}

for

W_{0} \geq 0

.

W_{0}

can be expressed explicitly as

W_{0} = \int (1 - λ_{T}) V a r_{λ_{i} \neq T} (\frac{p_{i}}{q}) d μ = \int \sum_{i \neq T} λ_{i} \frac{| p_{i} - \sum_{j \neq T} \frac{λ_{j}}{1 - λ_{T}} p_{j} |^{2}}{q} d μ,

where for fixed x, we consider the variance

V a r_{λ_{i} \neq T} (\frac{p_{i}}{q})

to be the variance of a random variable taking values

p_{i} (x) / q (x)

with probability

λ_{i} / (1 - λ_{T (x)})

for

i \neq T (x)

. Note this term is a non-zero term only when

n > 2

.

Proof.

For a fixed x, we apply Lemma 1

\begin{matrix} \sum_{i} λ_{i} f (\frac{p_{i}}{q}) & = λ_{T} f (\frac{p_{T}}{q}) + (1 - λ_{T}) \sum_{i \neq T} \frac{λ_{i}}{1 - λ_{T}} f (\frac{p_{i}}{q}) \\ \geq λ_{T} f (\frac{p_{T}}{q}) + (1 - λ_{T}) [f (\frac{p - λ_{T} p_{T}}{q (1 - λ_{T})}) + \frac{κ}{2} {Var}_{λ_{i \neq T}} (\frac{p_{i}}{q})] \end{matrix}

(47)

Integrating,

\begin{matrix} \sum_{i} λ_{i} D_{f} (p_{i} | | q) \geq \int λ_{T} f (\frac{p_{T}}{q}) q & + \int (1 - λ_{T}) f (\frac{- λ_{T} p_{T} + \sum_{i} λ_{i} p_{i}}{q (1 - λ_{T})}) q + \frac{κ}{2} W_{0}, \end{matrix}

(48)

where

W_{0} = \int \sum_{i \neq T (x)} \frac{λ_{i}}{1 - λ_{T} (x)} \frac{| p_{i} - \sum_{j \neq T} \frac{λ_{j}}{1 - λ_{T}} p_{j} |^{2}}{q} d μ .

(49)

Applying the

κ

-convexity of f,

\begin{matrix} \int λ_{T} f (\frac{p_{T}}{q}) q & = (1 - Q) \int q_{1} f (\frac{p_{T}}{q}) \\ \geq (1 - Q) (f (\frac{\int λ_{T} p_{T}}{1 - Q}) + \frac{κ}{2} {Var}_{q_{1}} (\frac{p_{T}}{q})) \\ = (1 - Q) f ((1 - R) / (1 - Q)) + \frac{Q κ}{2} W_{1}, \end{matrix}

(50)

with

\begin{matrix} W_{1} & : = {Var}_{q_{1}} (\frac{p_{T}}{q}) \\ = {(\frac{1 - R}{1 - Q})}^{2} {Var}_{q_{1}} (\frac{λ_{T} p_{T}}{λ_{T} q} \frac{1 - Q}{1 - R}) \\ = {(\frac{1 - R}{1 - Q})}^{2} {Var}_{q_{1}} (\frac{ρ_{1}}{q_{1}}) \\ = {(\frac{1 - R}{1 - Q})}^{2} χ^{2} (ρ_{1} | | q_{1}) \end{matrix}

(51)

Similarly,

\begin{matrix} \int (1 - λ_{T}) f (\frac{p - λ_{T} p_{T}}{q (1 - λ_{T})}) q & = Q \int q_{2} f (\frac{p - λ_{T} p_{T}}{q (1 - λ_{T})}) \\ \geq Q f (\int q_{2} \frac{p - λ_{T} p_{T}}{q (1 - λ_{T})}) + \frac{Q κ}{2} W_{2} \\ = Q f (\frac{R}{1 - Q}) + \frac{Q κ}{2} W_{2} \end{matrix}

(52)

where

\begin{matrix} W_{2} & : = {Var}_{q_{2}} (\frac{p - λ_{T} p_{T}}{q (1 - λ_{T})}) \\ = {(\frac{R}{Q})}^{2} {Var}_{q_{2}} (\frac{p - λ_{T} p_{T}}{q (1 - λ_{T})} \frac{Q}{R}) \\ = {(\frac{R}{Q})}^{2} {Var}_{q_{2}} {(\frac{p - λ_{T} p_{T}}{q (1 - λ_{T})} - \frac{R}{Q})}^{2} \\ = {(\frac{R}{Q})}^{2} \int q_{2} {(\frac{ρ_{2}}{q_{2}} - 1)}^{2} \\ = {(\frac{R}{Q})}^{2} χ^{2} (ρ_{2} | | q_{2}) \end{matrix}

(53)

Writing

W = W_{0} + W_{1} + W_{2}

, we have our result. □

Corollary 4.

When

λ_{i} = \frac{1}{n}

, and f is κ-convex on

(\inf_{i, x} p_{i} / q, \sup_{i, x} p_{i} / q)

\begin{matrix} \frac{1}{n} \sum_{i} & D_{f} (p_{i} | | q) \\ \geq D_{f} (R | | (n - 1) / n) + \frac{κ}{2} (n^{2} {(1 - R)}^{2} χ^{2} (ρ_{1} | | q) + {(\frac{n R}{n - 1})}^{2} χ^{2} (ρ_{2} | | q) + W_{0}) \end{matrix}

(54)

further when

n = 2

,

\begin{matrix} \frac{D_{f} (p_{1} | | q) + D_{f} (p_{2} | | q)}{2} \geq D_{f} & (\frac{1 - | p_{1} - p_{2} |_{T V}}{2} | | \frac{1}{2}) \\ + \frac{κ}{2} ((1 + | p_{1} - p_{2} |_{T V})^{2} χ^{2} (ρ_{1} | | q) + (1 - | p_{1} - p_{2} |_{T V})^{2} χ^{2} (ρ_{2} | | q)) . \end{matrix}

(55)

Proof.

Note that

q_{1} = q_{2} = q

, since

λ_{i} = \frac{1}{n}

implies

λ_{T} = \frac{1}{n}

as well. In addition,

Q = 1 - \int λ_{T} q d μ = \frac{n - 1}{n}

so that applying Theorem 8 gives

\sum_{i = 1}^{n} D_{f} (p_{i} | | q) \geq n D_{f} (R | | (n - 1) / n) + \frac{κ n W (λ_{i}, p_{i}, q)}{2} .

(56)

The term W can be simplified as well. In the notation of the proof of Theorem 8,

\begin{matrix} W_{1} & = n^{2} {(1 - R)}^{2} χ^{2} (ρ_{1}, q) \\ W_{2} & = {(\frac{n R}{n - 1})}^{2} χ^{2} (ρ_{2} | | q) \\ W_{0} & = \int \frac{\frac{1}{n - 1} \sum_{i \neq T} {(p_{i} - \frac{1}{n - 1} \sum_{j \neq T} p_{j})}^{2}}{q} d μ . \end{matrix}

(57)

For the special case, one needs only to recall

R = \frac{1 - | p_{1} - p_{2} |_{T V}}{2}

while inserting 2 for n. □

Corollary 5.

When

p_{i} \leq q / t^{*}

for

t^{*} > 0

, and

f (x) = x \log x

\sum_{i} λ_{i} D (p_{i} | | q) \geq D (R | | Q) + \frac{t^{*} W (λ_{i}, p_{i}, q)}{2}

for

D (p_{i} | | q)

the relative entropy. In particular,

\sum_{i} λ_{i} D (p_{i} | | q) \geq D (p | | q) + D (R | | P) + \frac{t^{*} W (λ_{i}, p_{i}, p)}{2}

where

P = 1 - \int λ_{T} p d μ

for

p = \sum_{i} λ_{i} p_{i}

and

t^{*} = \min λ_{i}

.

Proof.

For the relative entropy,

f (x) = x \log x

is

\frac{1}{M}

-convex on

[0, M]

since

f^{″} (x) = 1 / x

. When

p_{i} \leq q / t^{*}

holds for all i, then we can apply Theorem 8 with

M = \frac{1}{t^{*}}

. For the second inequality, recall the compensation identity,

\sum_{i} λ_{i} D (p_{i} | | q) = \sum_{i} λ_{i} D (p_{i} | | p) + D (p | | q)

, and apply the first inequality to

\sum_{i} D (p_{i} | | p)

for the result. □

This gives an upper bound on the Jensen–Shannon divergence, defined as

JSD (μ | | ν) = \frac{1}{2} D (μ | | μ / 2 + ν / 2) + \frac{1}{2} D (ν | | μ / 2 + ν / 2)

. Let us also note that through the compensation identity

\sum_{i} λ_{i} D (p_{i} | | q) = \sum_{i} λ_{i} D (p_{i} | | p) + D (p | | q)

,

\sum_{i} λ_{i} D (p_{i} | | q) \geq \sum_{i} λ_{i} D (p_{i} | | p)

where

p = \sum_{i} λ_{i} p_{i}

. In the case that

λ_{i} = \frac{1}{N}

\begin{matrix} \sum_{i} λ_{i} & D (p_{i} | | q) \\ \geq \sum_{i} λ_{i} D (p_{i} | | p) \\ \geq Q f (\frac{1 - R}{Q}) + (1 - Q) f (\frac{R}{1 - Q}) + \frac{t^{*} W}{2} \end{matrix}

(58)

Corollary 6.

For two densities

p_{1}

and

p_{2}

, the Jensen–Shannon divergence satisfies the following,

\begin{matrix} JSD (p_{1} | | p_{2}) \geq D & (\frac{1 - | p_{1} - p_{2} |_{T V}}{2} | | 1 / 2) \\ + \frac{1}{4} ((1 + | p_{1} - p_{2} |_{T V})^{2} χ^{2} (ρ_{1} | | p) + (1 - | p_{1} - p_{2} |_{T V})^{2} χ^{2} (ρ_{2} | | p)) \end{matrix}

(59)

with

ρ (i)

defined above and

p = p_{1} / 2 + p_{2} / 2

.

Proof.

Since

\frac{p_{i}}{(p_{1} + p_{2}) / 2} \leq 2

and

f (x) = x \log x

satisfies

f^{″} (x) \geq \frac{1}{2}

on

(0, 2)

. Taking

q = \frac{p_{1} + p_{2}}{2}

, in the

n = 2

example of Corollary 4 with

κ = \frac{1}{2}

yields the result. □

Note that

2 D ((1 + V) / 2 | | 1 / 2) = (1 + V) \log (1 + V) + (1 - V) \log (1 - V) \geq V^{2} \log e

, we see that a further bound,

JSD (p_{1} | | p_{2}) \geq \frac{\log e}{2} V^{2} + \frac{{(1 + V)}^{2} χ^{2} (ρ_{1} | | p) + {(1 - V)}^{2} χ^{2} (ρ_{2} | | p)}{4},

(60)

can be obtained for

V = | p_{1} - p_{2} |_{T V}

.

On Topsøe’s Sharpening of Pinsker’s Inequality

For

P_{i}, Q

probability measures with densities

p_{i}

and q with respect to a common reference measure,

\sum_{i = 1}^{n} t_{i} = 1

, with

t_{i} > 0

, denote

P = \sum_{i} t_{i} P_{i}

, with density

p = \sum_{i} t_{i} p_{i}

, the compensation identity is

\sum_{i = 1}^{n} t_{i} D (P_{i} | | Q) = D (P | | Q) + \sum_{i = 1}^{n} t_{i} D (P_{i} | | P) .

(61)

Theorem 9.

For

P_{1}

and

P_{2}

, denote

M_{k} = 2^{- k} P_{1} + (1 - 2^{- k}) P_{2}

, and define

\begin{matrix} M_{1} (k) = \frac{M_{k} 1_{{P_{1} > P_{2}}} + P_{2} 1_{{P_{1} \leq P_{2}}}}{M_{k} {P_{1} > P_{2}} + P_{2} {P_{1} \leq P_{2}}} M_{2} (k) = \frac{M_{k} 1_{{P_{1} \leq P_{2}}} + P_{2} 1_{{P_{1} > P_{2}}}}{M_{k} {P_{1} \leq P_{2}} + P_{2} {P_{1} > P_{2}}}, \end{matrix}

then the following sharpening of Pinsker’s inequality can be derived,

\begin{matrix} D (P_{1} | | P_{2}) \geq (2 \log e) | P_{1} - P_{2} |_{T V}^{2} + \sum_{k = 0}^{\infty} 2^{k} (\frac{χ^{2} (M_{1} (k), M_{k + 1})}{2} + \frac{χ^{2} (M_{2} (k), M_{k + 1})}{2}) . \end{matrix}

Proof.

When

n = 2

and

t_{1} = t_{2} = \frac{1}{2}

, if we denote

M = \frac{P_{1} + P_{2}}{2}

, then (61) reads as

\frac{1}{2} D (P_{1} | | Q) + \frac{1}{2} D (P_{2} | | Q) = D (M | | Q) + JSD (P_{1} | | P_{2}) .

(62)

Taking

Q = P_{2}

, we arrive at

D (P_{1} | | P_{2}) = 2 D (M | | P_{2}) + 2 JSD (P_{1} | | P_{2})

(63)

Iterating and writing

M_{k} = 2^{- k} P_{1} + (1 - 2^{- k}) P_{2}

, we have

D (P_{1} | | P_{2}) = 2^{n} (D (M_{n} | | P_{2}) + 2 \sum_{k = 0}^{n} JSD (M_{n} | | P_{2}))

(64)

It can be shown (see [11]) that

2^{n} D (M_{n} | | P_{2}) \to 0

with

n \to \infty

, giving the following series representation,

D (P_{1} | | P_{2}) = 2 \sum_{k = 0}^{\infty} 2^{k} JSD (M_{k} | | P_{2}) .

(65)

Note that the

ρ

-decomposition of

M_{k}

is exactly

ρ_{i} = M_{k} (i)

, thus, by Corollary 6,

\begin{matrix} D (P_{1} | | P_{2}) & = 2 \sum_{k = 0}^{\infty} 2^{k} JSD (M_{k} | | P_{2}) \\ \geq \sum_{k = 0}^{\infty} 2^{k} (| M_{k} - P_{2} |_{T V}^{2} \log e + \frac{χ^{2} (M_{1} (k), M_{k + 1})}{2} + \frac{χ^{2} (M_{2} (k), M_{k + 1})}{2}) \\ = (2 \log e) | P_{1} - P_{2} |_{T V}^{2} + \sum_{k = 0}^{\infty} 2^{k} (\frac{χ^{2} (M_{1} (k), M_{k + 1})}{2} + \frac{χ^{2} (M_{2} (k), M_{k + 1})}{2}) . \end{matrix}

(66)

Thus, we arrive at the desired sharpening of Pinsker’s inequality. □

Observe that the

k = 0

term in the above series is equivalent to

2^{0} (\frac{χ^{2} (M_{1} (0), M_{0 + 1})}{2} + \frac{χ^{2} (M_{2} (0), M_{0 + 1})}{2}) = \frac{χ^{2} (ρ_{1}, p)}{2} + \frac{χ^{2} (ρ_{2}, p)}{2},

(67)

where

ρ_{i}

is the convex decomposition of

p = \frac{p_{1} + p_{2}}{2}

in terms of

T (x) = \arg \max {p_{1} (x), p_{2} (x)}

.

5. Conclusions

In this article, we begin a systematic study of strongly convex divergences, and how the strength of convexity of a divergence generator f, quantified by the parameter

κ

, influences the behavior of the divergence

D_{f}

. We prove that every strongly convex divergence dominates the square of the total variation, extending the classical bound provided by the

χ^{2}

-divergence. We also study a general notion of a skew divergence, providing new bounds, in particular for the generalized skew divergence of Nielsen. Finally, we show how

κ

-convexity can be leveraged to yield improvements of Bayes risk f-divergence inequalities, and as a consequence achieve a sharpening of Pinsker’s inequality.

Funding

This research was funded by NSF grant CNS 1809194.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Theorem A1.

The class of f-divergences is stable under skewing. That is, if f is convex, satisfying

f (1) = 0

, then

\begin{matrix} \hat{f} (x) : = (t x + (1 - t)) f (\frac{r x + (1 - r)}{t x + (1 - t)}) \end{matrix}

(A1)

is convex with

\hat{f} (1) = 0

as well.

Proof.

If

μ

and

ν

have respective densities u and v with respect to a reference measure

γ

, then

r μ + (1 - r) ν

and

t μ + 1 - t ν

have densities

r u + (1 - r) v

and

t u + (1 - t) v

\begin{matrix} S_{f, r, t} (μ | | ν) & = \int f (\frac{r u + (1 - r) v}{t u + (1 - t) v}) (t u + (1 - t) v) d γ \end{matrix}

(A2)

\begin{matrix} = \int f (\frac{r \frac{u}{v} + (1 - r)}{t \frac{u}{v} + (1 - t)}) (t \frac{u}{v} + (1 - t)) v d γ \end{matrix}

(A3)

\begin{matrix} = \int \hat{f} (\frac{u}{v}) v d γ . \end{matrix}

(A4)

Since

\hat{f} (1) = f (1) = 0

, we need only prove

\hat{f}

convex. For this, recall that the conic transform g of a convex function f defined by

g (x, y) = y f (x / y)

for

y > 0

is convex, since

\begin{matrix} \frac{y_{1} + y_{2}}{2} f (\frac{x_{1} + x_{2}}{2} / \frac{y_{1} + y_{2}}{2}) & = \frac{y_{1} + y_{2}}{2} f (\frac{y_{1}}{y_{1} + y_{2}} \frac{x_{1}}{y_{1}} + \frac{y_{2}}{y_{1} + y_{2}} \frac{x_{2}}{y_{2}}) \end{matrix}

(A5)

\begin{matrix} \leq \frac{y_{1}}{2} f (x_{1} / y_{1}) + \frac{y_{2}}{2} f (x_{2} / y_{2}) . \end{matrix}

(A6)

Our result follows since

\hat{f}

is the composition of the affine function

A (x) = (r x + (1 - r), t x + (1 - t))

with the conic transform of f,

\begin{matrix} \hat{f} (x) = g (A (x)) . \end{matrix}

(A7)

□

References

Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. Roy. Statist. Soc. Ser. B 1966, 28, 131–142. [Google Scholar] [CrossRef]
Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331. [Google Scholar] [CrossRef]
Csiszár, I. Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Magyar Tud. Akad. Mat. Kutató Int. Közl. 1963, 8, 85–108. [Google Scholar]
Csiszár, I. Information-type measures of difference of probability distributions and indirect observation. Stud. Sci. Math. Hung. 1967, 2, 229–318. [Google Scholar]
Polyanskiy, Y.; Poor, H.V.; Verdú, S. Channel coding rate in the finite blocklength regime. IEEE Trans. Inf. Theory 2010, 56, 2307–2359. [Google Scholar] [CrossRef]
Sason, I.; Verdú, S. f-divergence inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
Polyanskiy, Y.; Wu, Y. Lecture Notes on Information Theory. Available online: http://people.lids.mit.edu/yp/homepage/data/itlectures_v5.pdf (accessed on 13 November 2019).
Sason, I. On data-processing and majorization inequalities for f-divergences with applications. Entropy 2019, 21, 1022. [Google Scholar] [CrossRef]
Guntuboyina, A. Lower bounds for the minimax risk using f-divergences, and applications. IEEE Trans. Inf. Theory 2011, 57, 2386–2399. [Google Scholar] [CrossRef]
Reid, M.; Williamson, R. Generalised Pinsker inequalities. arXiv 2009, arXiv:0906.1244. [Google Scholar]
Topsøe, F. Some inequalities for information divergence and related measures of discrimination. IEEE Trans. Inf. Theory 2000, 46, 1602–1609. [Google Scholar] [CrossRef]
Lee, L. Measures of distributional similarity. In Proceedings of the 37th Annual Meeting of the Association For Computational Linguistics on Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 1999; pp. 25–32. [Google Scholar]
Le Cam, L. Asymptotic Methods in Statistical Decision Theory; Springer Series in Statistics; Springer: New York, NY, USA, 1986. [Google Scholar]
Vincze, I. On the concept and measure of information contained in an observation. Contrib. Probab. 1981, 207–214. [Google Scholar] [CrossRef]
Györfi, L.; Vajda, I. A class of modified Pearson and Neyman statistics. Stat. Decis. 2001, 19, 239–251. [Google Scholar] [CrossRef]
Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
Nielsen, F. On a generalization of the Jensen–Shannon divergence and the Jensen–Shannon centroid. Entropy 2020, 22, 221. [Google Scholar] [CrossRef]
Folland, G. Real Analysis: Modern Techniques and Their Applications; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435. [Google Scholar] [CrossRef]
Harremoës, P.; Vajda, I. On pairs of f-divergences and their joint range. IEEE Trans. Inf. Theory 2011, 57, 3230–3235. [Google Scholar] [CrossRef]
Reiss, R. Approximate Distributions of Order Statistics: With Applications to Nonparametric Statistics; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Nielsen, F.; Nock, R. On the chi square and higher-order chi distances for approximating f-divergences. IEEE Signal Process. Lett. 2013, 21, 10–13. [Google Scholar] [CrossRef]
Amari, S. Information Geometry and Its Applications; Springer: Berlin/Heidelberg, Germany, 2016; p. 194. [Google Scholar]
Basu, A.; Shioya, H.; Park, C. Statistical Inference: The Minimum Distance Approach; CRC Press: Boca Raton, FL, USA, 2011. [Google Scholar]
Vajda, I. On the f-divergence and singularity of probability measures. Period. Math. Hung. 1972, 2, 223–234. [Google Scholar] [CrossRef]
Melbourne, J.; Talukdar, S.; Bhaban, S.; Madiman, M.; Salapaka, M.V. The differential entropy of mixtures: new bounds and applications. arXiv 2020, arXiv:1805.11257. [Google Scholar]
Erven, T.V.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
Audenaert, K.M.R. Quantum skew divergence. J. Math. Phys. 2014, 55, 112202. [Google Scholar] [CrossRef]
Melbourne, J.; Madiman, M.; Salapaka, M.V. Relationships between certain f-divergences. In Proceedings of the 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Monticello, IL, USA, , 24–27 September 2019; pp. 1068–1073. [Google Scholar]
Nishiyama, T.; Sason, I. On relations between the relative entropy and χ2-divergence, generalizations and applications. Entropy 2020, 22, 563. [Google Scholar] [CrossRef]
Nielsen, F. Generalized Bhattacharyya and Chernoff upper bounds on Bayes error using quasi-arithmetic means. Pattern Recognit. Lett. 2014, 42, 25–34. [Google Scholar] [CrossRef][Green Version]
Birgé, L. A new lower bound for multiple hypothesis testing. IEEE Trans. Inf. Theory 2005, 51, 1611–1615. [Google Scholar] [CrossRef]
Chen, X.; Guntuboyina, A.; Zhang, Y. On Bayes risk lower bounds. J. Mach. Learn. Res. 2016, 17, 7687–7744. [Google Scholar]
Xu, A.; Raginsky, M. Information-theoretic lower bounds on Bayes risk in decentralized estimation. IEEE Trans. Inf. Theory 2016, 63, 1580–1600. [Google Scholar] [CrossRef]
Yang, Y.; Barron, A. Information-theoretic determination of minimax rates of convergence. Ann. Statist. 1999, 27, 1564–1599. [Google Scholar]
Scarlett, J.; Cevher, V. An introductory guide to Fano’s inequality with applications in statistical estimation. arXiv 2019, arXiv:1901.00555. [Google Scholar]

Table 1. Examples of Strongly Convex Divergences.

Divergence	f	$κ$	Domain
relative entropy (KL)	$t \log t$	$\frac{1}{M}$	$(0, M]$
total variation	$\frac{\| t - 1 \|}{2}$	0	$(0, \infty)$
Pearson’s $χ^{2}$	${(t - 1)}^{2}$	2	$(0, \infty)$
squared Hellinger	$2 (1 - \sqrt{t})$	$M^{- \frac{3}{2}} / 2$	$(0, M]$
reverse relative entropy	$- \log t$	$1 / M^{2}$	$(0, M]$
Vincze- Le Cam	$\frac{{(t - 1)}^{2}}{t + 1}$	$\frac{8}{{(M + 1)}^{3}}$	$(0, M]$
Jensen–Shannon	$(t + 1) \log \frac{2}{t + 1} + t \log t$	$\frac{1}{M (M + 1)}$	$(0, M]$
Neyman’s $χ^{2}$	$\frac{1}{t} - 1$	$2 / M^{3}$	$(0, M]$
Sason’s s	$\log {(s + t)}^{{(s + t)}^{2}} - \log {(s + 1)}^{{(s + 1)}^{2}}$	$2 \log (s + M) + 3$	$[M, \infty)$ , $s > e^{- 3 / 2}$
$α$ -divergence	$\frac{4 (1 - t^{\frac{1 + α}{2}})}{1 - α^{2}}, α \neq \pm 1$	$M^{\frac{α - 3}{2}}$	$\{\begin{matrix} [M, \infty), α > 3 \\ (0, M], α < 3 \end{matrix}$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2020 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Melbourne, J. Strongly Convex Divergences. Entropy 2020, 22, 1327. https://doi.org/10.3390/e22111327

AMA Style

Melbourne J. Strongly Convex Divergences. Entropy. 2020; 22(11):1327. https://doi.org/10.3390/e22111327

Chicago/Turabian Style

Melbourne, James. 2020. "Strongly Convex Divergences" Entropy 22, no. 11: 1327. https://doi.org/10.3390/e22111327

APA Style

Melbourne, J. (2020). Strongly Convex Divergences. Entropy, 22(11), 1327. https://doi.org/10.3390/e22111327

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Strongly Convex Divergences

Abstract

1. Introduction

Notation

2. Strongly Convex Divergences

3. Skew Divergences

4. Total Variation Bounds and Bayes Risk

On Topsøe’s Sharpening of Pinsker’s Inequality

5. Conclusions

Funding

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI