Two Measures of Dependence

Lapidoth, Amos; Pfister, Christoph

doi:10.3390/e21080778

Open AccessEditor’s ChoiceArticle

Two Measures of Dependence

by

Amos Lapidoth

and

Christoph Pfister

^*

Signal and Information Processing Laboratory, ETH Zurich, 8092 Zurich, Switzerland

^*

Author to whom correspondence should be addressed.

Entropy 2019, 21(8), 778; https://doi.org/10.3390/e21080778

Submission received: 5 July 2019 / Revised: 2 August 2019 / Accepted: 5 August 2019 / Published: 8 August 2019

(This article belongs to the Special Issue Information Measures with Applications)

Download

Browse Figure

Versions Notes

Abstract

:

Two families of dependence measures between random variables are introduced. They are based on the Rényi divergence of order

α

and the relative

α

-entropy, respectively, and both dependence measures reduce to Shannon’s mutual information when their order

α

is one. The first measure shares many properties with the mutual information, including the data-processing inequality, and can be related to the optimal error exponents in composite hypothesis testing. The second measure does not satisfy the data-processing inequality, but appears naturally in the context of distributed task encoding.

Keywords:

data processing; dependence measure; relative α-entropy; Rényi divergence; Rényi entropy

1. Introduction

The solutions to many information-theoretic problems can be expressed using Shannon’s information measures such as entropy, relative entropy, and mutual information. Other problems require Rényi’s information measures, which generalize Shannon’s. In this paper, we analyze two Rényi measures of dependence,

J_{α} (X; Y)

and

K_{α} (X; Y)

, between random variables X and Y taking values in the finite sets

X

and

Y

, with

α \in [0, \infty]

being a parameter. (Our notation is similar to the one used for the mutual information: technically,

J_{α} (\cdot)

and

K_{α} (\cdot)

are functions not of X and Y, but of their joint probability mass function (PMF)

P_{X Y}

.) For

α \in [0, \infty]

, we define

J_{α} (X; Y)

and

K_{α} (X; Y)

as

\begin{matrix} J_{α} (X; Y) & ≜ min_{(Q_{X}, Q_{Y}) \in P (X) \times P (Y)} D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}), \end{matrix}

(1)

\begin{matrix} K_{α} (X; Y) & ≜ min_{(Q_{X}, Q_{Y}) \in P (X) \times P (Y)} Δ_{α} (P_{X Y} ∥ Q_{X} Q_{Y}), \end{matrix}

(2)

where

P (X)

and

P (Y)

denote the set of all PMFs over

X

and

Y

, respectively;

D_{α} (P ∥ Q)

denotes the Rényi divergence of order

α

(see (50) ahead); and

Δ_{α} (P ∥ Q)

denotes the relative

α

-entropy (see (55) ahead). As shown in Proposition 7,

J_{α} (X; Y)

and

K_{α} (X; Y)

are in fact closely related.

The measures

J_{α} (X; Y)

and

K_{α} (X; Y)

have the following operational meanings (see Section 3):

J_{α} (X; Y)

is related to the optimal error exponents in testing whether the observed independent and identically distributed (IID) samples were generated according to the joint PMF

P_{X Y}

or an unknown product PMF; and

K_{α} (X; Y)

appears as a penalty term in the sum-rate constraint of distributed task encoding.

The measures

J_{α} (X; Y)

and

K_{α} (X; Y)

share many properties with Shannon’s mutual information [1], and both are equal to the mutual information when

α

is one. Except for some special cases, we have no closed-form expressions for

J_{α} (X; Y)

or

K_{α} (X; Y)

. As illustrated in Figure 1, unless

α

is one, the minimum in the definitions of

J_{α} (X; Y)

and

K_{α} (X; Y)

is typically not achieved by

Q_{X} = P_{X}

and

Q_{Y} = P_{Y}

. (When

α

is one, then the minimum is always achieved by

Q_{X} = P_{X}

and

Q_{Y} = P_{Y}

; this follows from Proposition 8 and the fact that

D_{1} (P_{X Y} ∥ Q_{X} Q_{Y}) = Δ_{1} (P_{X Y} ∥ Q_{X} Q_{Y}) = D (P_{X Y} ∥ Q_{X} Q_{Y})

.)

The rest of this paper is organized as follows. In Section 2, we review other generalizations of the mutual information. In Section 3, we discuss the operational meanings of

J_{α} (X; Y)

and

K_{α} (X; Y)

. In Section 4, we recall the required Rényi information measures and prove some preparatory results. In Section 5, we state the properties of

J_{α} (X; Y)

and

K_{α} (X; Y)

. In Section 6, we prove these properties.

2. Related Work

The measure

J_{α} (X; Y)

was discovered independently from the authors of the present paper by Tomamichel and Hayashi [2] (Equation (58)), who, for the case when

α > \frac{1}{2}

, derived some of its properties in [2] (Appendix A-C).

Other Rényi-based measures of dependence appeared in the past. Notable are those by Sibson [3], Arimoto [4], and Csiszár [5], respectively denoted by

I_{α}^{s} (\cdot)

,

I_{α}^{a} (\cdot)

, and

I_{α}^{c} (\cdot)

:

\begin{matrix} I_{α}^{s} (X; Y) & ≜ \frac{α}{α - 1} log \sum_{y} {[\sum_{x} P (x) P {(y | x)}^{α}]}^{\frac{1}{α}} \end{matrix}

(3)

\begin{matrix} = min_{Q_{Y}} D_{α} (P_{X Y} ∥ P_{X} Q_{Y}), \end{matrix}

(4)

\begin{matrix} I_{α}^{a} (X; Y) & ≜ H_{α} (X) - H_{α} (X | Y) \end{matrix}

(5)

\begin{matrix} = \frac{α}{α - 1} log \sum_{y} {[\sum_{x} \frac{P {(x)}^{α}}{\sum_{x^{'} \in X} P {(x^{'})}^{α}} P {(y | x)}^{α}]}^{\frac{1}{α}}, \end{matrix}

(6)

\begin{matrix} I_{α}^{c} (X; Y) & ≜ min_{Q_{Y}} \sum_{x} P (x) D_{α} (P_{Y | X = x} ∥ Q_{Y}), \end{matrix}

(7)

where, throughout the paper,

log (\cdot)

denotes the base-2 logarithm;

D_{α} (P ∥ Q)

denotes the Rényi divergence of order

α

(see (50) ahead);

H_{α} (X)

denotes the Rényi entropy of order

α

(see (45) ahead); and

H_{α} (X | Y)

denotes the Arimoto–Rényi conditional entropy [4,6,7], which is defined for positive

α

other than one as

\begin{matrix} H_{α} (X | Y) ≜ \frac{α}{1 - α} log \sum_{y} {[\sum_{x} P {(x, y)}^{α}]}^{\frac{1}{α}} . \end{matrix}

(8)

(Equation (4) follows from Proposition 9 ahead, and (6) follows from (45) and (8).) An overview of

I_{α}^{s} (\cdot)

,

I_{α}^{a} (\cdot)

, and

I_{α}^{c} (\cdot)

is provided in [8]. Another Rényi-based measure of dependence can be found in [9] (Equation (19)):

\begin{matrix} I_{α}^{t} (X; Y) ≜ D_{α} (P_{X Y} ∥ P_{X} P_{Y}) . \end{matrix}

(9)

The relation between

I_{α}^{c} (X; Y)

,

J_{α} (X; Y)

, and

I_{α}^{s} (X; Y)

for

α > 1

was established recently:

Proposition 1

([10] (Theorem IV.1)). For every PMF

P_{X Y}

and every

α > 1

,

\begin{matrix} I_{α}^{c} (X; Y) & \leq J_{α} (X; Y) \end{matrix}

(10)

\begin{matrix} \leq I_{α}^{s} (X; Y) . \end{matrix}

(11)

Proof.

This is proved in [10] for a measure-theoretic setting. Here, we specialize the proof to finite alphabets. We first prove (10):

\begin{matrix} J_{α} (X; Y) & = min_{Q_{Y}} min_{Q_{X}} D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) \end{matrix}

(12)

\begin{matrix} = min_{Q_{Y}} \frac{α}{α - 1} log \sum_{x} {[\sum_{y} P {(x, y)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(13)

\begin{matrix} = min_{Q_{Y}} \frac{α}{α - 1} log \sum_{x} P (x) {[\sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(14)

\begin{matrix} \geq min_{Q_{Y}} \frac{α}{α - 1} \sum_{x} P (x) log {[\sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(15)

\begin{matrix} = min_{Q_{Y}} \sum_{x} P (x) \frac{1}{α - 1} log \sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α} \end{matrix}

(16)

\begin{matrix} = I_{α}^{c} (X; Y), \end{matrix}

(17)

where (12) follows from the definition of

J_{α} (X; Y)

in (1); (13) follows from Proposition 9 ahead with the roles of

Q_{X}

and

Q_{Y}

swapped; (15) follows from Jensen’s inequality because

log (\cdot)

is concave and because

\frac{α}{α - 1} > 0

; and (17) follows from the definition of

I_{α}^{c} (X; Y)

in (7).

We next prove (11):

\begin{matrix} J_{α} (X; Y) & = min_{Q_{X}, Q_{Y}} D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) \end{matrix}

(18)

\begin{matrix} \leq min_{Q_{Y}} D_{α} (P_{X Y} ∥ P_{X} Q_{Y}) \end{matrix}

(19)

\begin{matrix} = I_{α}^{s} (X; Y), \end{matrix}

(20)

where (18) follows from the definition of

J_{α} (X; Y)

in (1), and (20) follows from (4). □

Many of the above Rényi information measures coincide when they are maximized over

P_{X}

with

P_{Y | X}

held fixed: for every conditional PMF

P_{Y | X}

and every positive

α

other than one,

\begin{matrix} max_{P_{X}} I_{α}^{a} (P_{X} P_{Y | X}) & = max_{P_{X}} I_{α}^{s} (P_{X} P_{Y | X}) \end{matrix}

(21)

\begin{matrix} = max_{P_{X}} I_{α}^{c} (P_{X} P_{Y | X}), \end{matrix}

(22)

where

P_{X} P_{Y | X}

denotes the joint PMF of X and Y; (21) follows from [4] (Lemma 1); and (22) follows from [5] (Proposition 1). It was recently established that, for

α > 1

, this is also true for

J_{α} (X; Y)

:

Proposition 2

([10] (Theorem V.1)). For every conditional PMF

P_{Y | X}

and every

α > 1

,

\begin{matrix} max_{P_{X}} J_{α} (P_{X} P_{Y | X}) = max_{P_{X}} I_{α}^{s} (P_{X} P_{Y | X}) . \end{matrix}

(23)

Proof.

By Proposition 1, we have for all

α > 1

\begin{matrix} max_{P_{X}} I_{α}^{c} (P_{X} P_{Y | X}) & \leq max_{P_{X}} J_{α} (P_{X} P_{Y | X}) \end{matrix}

(24)

\begin{matrix} \leq max_{P_{X}} I_{α}^{s} (P_{X} P_{Y | X}) . \end{matrix}

(25)

By (22), the left-hand side (LHS) of (24) is equal to the right-hand side (RHS) of (25), so (24) and (25) both hold with equality. □

Dependence measures can also be based on the f-divergence

D_{f} (P ∥ Q)

[11,12,13]. Every convex function

f : (0, \infty) \to R

satisfying

f (1) = 0

induces a dependence measure, namely

\begin{matrix} I_{f} (X; Y) & ≜ D_{f} (P_{X Y} ∥ P_{X} P_{Y}) \end{matrix}

(26)

\begin{matrix} = \sum_{x, y} P (x) P (y) f (\frac{P (x, y)}{P (x) P (y)}), \end{matrix}

(27)

where (27) follows from the definition of the f-divergence. (For

f (t) = t log t

,

I_{f} (X; Y)

is the mutual information.) Such dependence measures are used for example in [14], and a construction equivalent to (27) is studied in [15].

3. Operational Meanings

In this section, we discuss the operational meaning of

J_{α} (X; Y)

in hypothesis testing (Section 3.1) and of

K_{α} (X; Y)

in distributed task encoding (Section 3.2).

3.1. Testing Against Independence and $J_{α} (X; Y)$

Consider the hypothesis testing problem of guessing whether an observed sequence of pairs was drawn IID from some given joint PMF

P_{X Y}

or IID from some unknown product distribution. Thus, based on a sequence of pairs of random variables

{(X_{i}, Y_{i})}_{i = 1}^{n}

, two hypotheses have to be distinguished:

0): Under the null hypothesis, $(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})$ are IID according to $P_{X Y}$ .
1): Under the alternative hypothesis, $(X_{1}, Y_{1}), \dots, (X_{n}, Y_{n})$ are IID according to some unknown PMF of the form $Q_{X Y} = Q_{X} Q_{Y}$ , where $Q_{X}$ and $Q_{Y}$ are arbitrary PMFs over $X$ and $Y$ , respectively.

Associated with every deterministic test

T_{n} : X^{n} \times Y^{n} \to {0, 1}

and pair

(Q_{X}, Q_{Y})

are the type-I error probability

P_{X Y}^{\times n} [T_{n} (X^{n}, Y^{n}) = 1]

and the type-II error probability

{(Q_{X} Q_{Y})}^{\times n} [T_{n} (X^{n}, Y^{n}) = 0]

, where

R_{X Y}^{\times n} [A]

denotes the probability of an event

A

when

{(X_{i}, Y_{i})}_{i = 1}^{n}

are IID according to

R_{X Y}

. We seek sequences of tests whose worst-case type-II error probability decays exponentially faster than

2^{- n E_{Q}}

. To be more specific, for a fixed

E_{Q} \in R

, denote by

T (E_{Q})

the set of all sequences of deterministic tests

{T_{n}}_{n = 1}^{\infty}

for which

\begin{matrix} \underset{n \to \infty}{lim inf} min_{Q_{X}, Q_{Y}} - \frac{1}{n} log ({(Q_{X} Q_{Y})}^{\times n} [T_{n} (X^{n}, Y^{n}) = 0]) > E_{Q}, \end{matrix}

(28)

where

log (\cdot)

denotes the base-2 logarithm. Note that (28) implies—but is not equivalent to—that for n sufficiently large,

{(Q_{X} Q_{Y})}^{\times n} [T_{n} (X^{n}, Y^{n}) = 0] \leq 2^{- n E_{Q}}

for all

(Q_{X}, Q_{Y}) \in P (X) \times P (Y)

. For a fixed

E_{Q} \in R

, the optimal type-I error exponent that can be asymptotically achieved under the constraint (28) is given by

\begin{matrix} E_{P} (E_{Q}) ≜ sup_{{T_{n}}_{n = 1}^{\infty} \in T (E_{Q})} \underset{n \to \infty}{lim inf} - \frac{1}{n} log (P_{X Y}^{\times n} [T_{n} (X^{n}, Y^{n}) = 1]) . \end{matrix}

(29)

The measure

J_{α} (X; Y)

appears as follows: In [2] (first part of (57)), it is shown that for

E_{Q}

sufficiently close to

I (X; Y)

,

\begin{matrix} E_{P} (E_{Q}) = sup_{α \in (\frac{1}{2}, 1]} \frac{1 - α}{α} (J_{α} (X; Y) - E_{Q}), \end{matrix}

(30)

and in [16] (Theorem 3), it is shown that for all

E_{Q} \in R

,

\begin{matrix} E_{P}^{* *} (E_{Q}) = sup_{α \in (0, 1]} \frac{1 - α}{α} (J_{α} (X; Y) - E_{Q}), \end{matrix}

(31)

where

E_{P}^{* *} (\cdot)

denotes the Fenchel biconjugate of

E_{P} (\cdot)

. In general, the Fenchel biconjugation cannot be omitted because sometimes [16] (Equation (11) and Example 14)

\begin{matrix} E_{P} (E_{Q}) \neq E_{P}^{* *} (E_{Q}) . \end{matrix}

(32)

For large values of

E_{Q}

, the optimal type-I error tends to one as n tends to infinity. In this case, the type-I strong-converse exponent [17,18], which is defined for a sequence of tests

{T_{n}}_{n = 1}^{\infty}

as

\begin{matrix} S C_{P} ≜ \underset{n \to \infty}{lim sup} - \frac{1}{n} log (1 - P_{X Y}^{\times n} [T_{n} (X^{n}, Y^{n}) = 1]), \end{matrix}

(33)

measures how fast the type-I error tends to one as n tends to infinity (smaller values correspond to lower error probabilities). For a fixed

E_{Q} \in R

, the optimal type-I strong-converse exponent that can be asymptotically achieved under the constraint (28) is given by

\begin{matrix} S C_{P} (E_{Q}) ≜ inf_{{T_{n}}_{n = 1}^{\infty} \in T (E_{Q})} \underset{n \to \infty}{lim sup} - \frac{1}{n} log (1 - P_{X Y}^{\times n} [T_{n} (X^{n}, Y^{n}) = 1]) . \end{matrix}

(34)

In [2] (second part of (57)), it is shown that for

E_{Q}

sufficiently close to

I (X; Y)

,

\begin{matrix} S C_{P} (E_{Q}) = sup_{α > 1} \frac{1 - α}{α} (J_{α} (X; Y) - E_{Q}) . \end{matrix}

(35)

Here, the same

\frac{1 - α}{α} (J_{α} (X; Y) - E_{Q})

expression appears as in (30) and (31), but with a different set of

α

’s to optimize over.

3.2. Distributed Task Encoding and $K_{α} (X; Y)$

The task-encoding problem studied in [19] can be extended to a distributed setting as follows [20]: A source

{(X_{i}, Y_{i})}_{i = 1}^{\infty}

emits pairs of random variables

(X_{i}, Y_{i})

taking values in a finite alphabet

X \times Y

. For a fixed rate pair

(R_{X}, R_{Y}) \in R_{\geq 0}^{2}

and a positive integer n, the sequences

{X_{i}}_{i = 1}^{n}

and

{Y_{i}}_{i = 1}^{n}

are described separately using

⌊ 2^{n R_{X}} ⌋

and

⌊ 2^{n R_{Y}} ⌋

labels, respectively. The decoder produces a list comprising all the pairs

(x^{n}, y^{n})

whose description matches the given labels, and the goal is to minimize the

ρ

-th moment of the list size as n tends to infinity (for some

ρ > 0

).

For a fixed

ρ > 0

, a rate pair

(R_{X}, R_{Y}) \in R_{\geq 0}^{2}

is called achievable if there exists a sequence of encoders

{(f_{n}, g_{n})}_{n = 1}^{\infty}

,

\begin{matrix} f_{n} : X^{n} & \to {1, \dots, ⌊ 2^{n R_{X}} ⌋}, \end{matrix}

(36)

\begin{matrix} g_{n} : Y^{n} & \to {1, \dots, ⌊ 2^{n R_{Y}} ⌋}, \end{matrix}

(37)

such that the

ρ

-th moment of the list size tends to one as n tends to infinity, i.e.,

\begin{matrix} lim_{n \to \infty} E [{| L (X^{n}, Y^{n}) |}^{ρ}] = 1, \end{matrix}

(38)

where

\begin{matrix} L (x^{n}, y^{n}) ≜ {({\tilde{x}}^{n}, {\tilde{y}}^{n}) \in X^{n} \times Y^{n} : f_{n} ({\tilde{x}}^{n}) = f_{n} (x^{n}) \land g_{n} ({\tilde{y}}^{n}) = g_{n} (y^{n})} . \end{matrix}

(39)

For a memoryless source and a fixed

ρ > 0

, rate pairs in the interior of the region

R (ρ)

defined next are achievable, while those outside

R (ρ)

are not achievable [20] (Theorem 1). The region

R (ρ)

is defined as the set of all rate pairs

(R_{X}, R_{Y})

satisfying the following inequalities simultaneously:

\begin{matrix} R_{X} & \geq H_{\frac{1}{1 + ρ}} (X), \end{matrix}

(40)

\begin{matrix} R_{Y} & \geq H_{\frac{1}{1 + ρ}} (Y), \end{matrix}

(41)

\begin{matrix} R_{X} + R_{Y} & \geq H_{\frac{1}{1 + ρ}} (X, Y) + K_{\frac{1}{1 + ρ}} (X; Y), \end{matrix}

(42)

where

H_{α} (X)

denotes the Rényi entropy of order

α

(see (45) ahead).

To better understand the role of

K_{α} (X; Y)

, suppose that the sequences

{X_{i}}_{i = 1}^{n}

and

{Y_{i}}_{i = 1}^{n}

were allowed to be described jointly using

⌊ 2^{n R_{X}} ⌋ \cdot ⌊ 2^{n R_{Y}} ⌋ \approx 2^{n (R_{X} + R_{Y})}

labels. Then, by [19] (Theorem I.2), all rate pairs

(R_{X}, R_{Y}) \in R_{\geq 0}^{2}

satisfying the following inequality with strict inequality would be achievable, while those not satisfying the inequality would not:

\begin{matrix} R_{X} + R_{Y} \geq H_{\frac{1}{1 + ρ}} (X, Y) . \end{matrix}

(43)

Comparing (42) and (43), we see that the measure

K_{α} (X; Y)

appears as a penalty term on the sum-rate constraint incurred by requiring that the sequences be described separately as opposed to jointly.

4. Preliminaries

Throughout the paper,

log (\cdot)

denotes the base-2 logarithm,

X

and

Y

are finite sets,

P_{X Y}

denotes a joint PMF over

X \times Y

,

Q_{X}

denotes a PMF over

X

, and

Q_{Y}

denotes a PMF over

Y

. We use P and Q as generic PMFs over a finite set

X

. We denote by

supp (P) ≜ {x \in X : P (x) > 0}

the support of P, and by

P (X)

the set of all PMFs over

X

. When clear from the context, we often omit sets and subscripts: for example, we write

{min}_{Q_{X}, Q_{Y}}

for

{min}_{(Q_{X}, Q_{Y}) \in P (X) \times P (Y)}

,

\sum_{x}

for

\sum_{x \in X}

,

P (x)

for

P_{X} (x)

, and

P (y | x)

for

P_{Y | X} (y | x)

. Whenever a conditional probability

P (y | x)

is undefined because

P (x) = 0

, we define

P (y | x) ≜ 1 / | Y |

. We denote by

𝟙 {condition}

the indicator function that is one if the condition is satisfied and zero otherwise. In the definitions below, we use the following conventions:

\begin{matrix} \frac{0}{0} = 0, \frac{p}{0} = \infty \forall p > 0, 0 log 0 = 0, β log 0 = - \infty \forall β > 0 . \end{matrix}

(44)

The Rényi entropy of order

α

[21] is defined for positive

α

other than one as

\begin{matrix} H_{α} (X) ≜ \frac{1}{1 - α} log \sum_{x} P {(x)}^{α} . \end{matrix}

(45)

For

α

being zero, one, or infinity, we define by continuous extension of (45)

\begin{matrix} H_{0} (X) & ≜ log | supp (P) |, \end{matrix}

(46)

\begin{matrix} H_{1} (X) & ≜ H (X), \end{matrix}

(47)

\begin{matrix} H_{\infty} (X) & ≜ - log max_{x} P (x), \end{matrix}

(48)

where

H (X)

is the Shannon entropy. With this extension to

α \in {0, 1, \infty}

, the Rényi entropy satisfies the following basic properties:

Proposition 3

([5]). Let P be a PMF. Then,

(i): For all $α \in [0, \infty]$ , $H_{α} (X) \leq log | X |$ . If $α \in (0, \infty]$ , then $H_{α} (X) = log | X |$ if and only if X is distributed uniformly over $X$ .
(ii): The mapping $α \mapsto H_{α} (X)$ is nonincreasing on $[0, \infty]$ .
(iii): The mapping $α \mapsto H_{α} (X)$ is continuous on $[0, \infty]$ .

The relative entropy (or Kullback–Leibler divergence) is defined as

\begin{matrix} D (P ∥ Q) ≜ \sum_{x} P (x) log \frac{P (x)}{Q (x)} . \end{matrix}

(49)

The Rényi divergence of order

α

[21,22] is defined for positive

α

other than one as

\begin{matrix} D_{α} (P ∥ Q) ≜ \frac{1}{α - 1} log \sum_{x} P {(x)}^{α} Q {(x)}^{1 - α}, \end{matrix}

(50)

where we read

P {(x)}^{α} Q {(x)}^{1 - α}

as

P {(x)}^{α} / Q {(x)}^{α - 1}

if

α > 1

. For

α

being zero, one, or infinity, we define by continuous extension of (50)

\begin{matrix} D_{0} (P ∥ Q) & ≜ - log \sum_{x \in supp (P)} Q (x), \end{matrix}

(51)

\begin{matrix} D_{1} (P ∥ Q) & ≜ D (P ∥ Q), \end{matrix}

(52)

\begin{matrix} D_{\infty} (P ∥ Q) & ≜ log max_{x} \frac{P (x)}{Q (x)} . \end{matrix}

(53)

With this extension to

α \in {0, 1, \infty}

, the Rényi divergence satisfies the following basic properties:

Proposition 4.

Let P and Q be PMFs. Then,

(i): For all $α \in [0, 1)$ , $D_{α} (P ∥ Q)$ is finite if and only if $| supp (P) \cap supp (Q) | > 0$ . For all $α \in [1, \infty]$ , $D_{α} (P ∥ Q)$ is finite if and only if $supp (P) \subseteq supp (Q)$ .
(ii): For all $α \in [0, \infty]$ , $D_{α} (P ∥ Q) \geq 0$ . If $α \in (0, \infty]$ , then $D_{α} (P ∥ Q) = 0$ if and only if $P = Q$ .
(iii): For every $α \in [0, \infty]$ , the mapping $Q \mapsto D_{α} (P ∥ Q)$ is continuous.
(iv): The mapping $α \mapsto D_{α} (P ∥ Q)$ is nondecreasing on $[0, \infty]$ .
(v): The mapping $α \mapsto D_{α} (P ∥ Q)$ is continuous on $[0, \infty]$ .

Proof.

Part (i) follows from the definition of

D_{α} (P ∥ Q)

and the conventions (44), and Parts (ii)–(v) are shown in [22]. □

The Rényi divergence for negative

α

is defined as

\begin{matrix} D_{α} (P ∥ Q) ≜ \frac{1}{α - 1} log \sum_{x} \frac{Q {(x)}^{1 - α}}{P {(x)}^{- α}} . \end{matrix}

(54)

(We use negative

α

only in Lemma 19. More about negative orders can be found in [22] (Section V). For other applications of negative orders, see [23] (Proof of Theorem 1 and Example 1).)

The relative

α

-entropy [24,25] is defined for positive

α

other than one as

\begin{matrix} Δ_{α} (P ∥ Q) ≜ \frac{α}{1 - α} log \sum_{x} P (x) Q {(x)}^{α - 1} + log \sum_{x} Q {(x)}^{α} - \frac{1}{1 - α} log \sum_{x} P {(x)}^{α}, \end{matrix}

(55)

where we read

P (x) Q {(x)}^{α - 1}

as

P (x) / Q {(x)}^{1 - α}

if

α < 1

. The relative

α

-entropy appears in mismatched guessing [26], mismatched source coding [26] (Theorem 8), and mismatched task encoding [19] (Section IV). It also arises in robust parameter estimation and constrained compression settings [25] (Section II). For

α

being zero, one, or infinity, we define by continuous extension of (55)

\begin{matrix} Δ_{0} (P ∥ Q) & ≜ \{\begin{matrix} log \frac{| supp (Q) |}{| supp (P) |} & if supp (P) \subseteq supp (Q), \\ \infty & otherwise, \end{matrix} \end{matrix}

(56)

\begin{matrix} Δ_{1} (P ∥ Q) & ≜ D (P ∥ Q), \end{matrix}

(57)

\begin{matrix} Δ_{\infty} (P ∥ Q) & ≜ log \frac{{max}_{x} P (x)}{{| argmax (Q) |}^{- 1} \sum_{x \in argmax (Q)} P (x)}, \end{matrix}

(58)

where

argmax (Q) ≜ {x \in X : Q (x) = {max}_{x^{'} \in X} Q (x^{'})}

and

| argmax (Q) |

is the cardinality of this set. With this extension to

α \in {0, 1, \infty}

, the relative

α

-entropy satisfies the following basic properties:

Proposition 5.

Let P and Q be PMFs. Then,

(i): For all $α \in [0, 1]$ , $Δ_{α} (P ∥ Q)$ is finite if and only if $supp (P) \subseteq supp (Q)$ . For all $α \in (1, \infty)$ , $Δ_{α} (P ∥ Q)$ is finite if and only if $| supp (P) \cap supp (Q) | > 0$ .
(ii): For all $α \in [0, \infty]$ , $Δ_{α} (P ∥ Q) \geq 0$ . If $α \in (0, \infty)$ , then $Δ_{α} (P ∥ Q) = 0$ if and only if $P = Q$ .
(iii): For every $α \in (0, \infty)$ , the mapping $Q \mapsto Δ_{α} (P ∥ Q)$ is continuous.
(iv): The mapping $α \mapsto Δ_{α} (P ∥ Q)$ is continuous on $[0, \infty]$ .

(Part (i) differs from [19] (Proposition IV.1), where the conventions for

α > 1

differ from ours. Our conventions are compatible with [24,25], and, as stated in Part (iii), they result in the continuity of the mapping

Q \mapsto Δ_{α} (P ∥ Q)

.)

Proof of Proposition 5.

Part (i) follows from the definition of

Δ_{α} (P ∥ Q)

in (55) and the conventions (44). For

α \in (0, 1) \cup (1, \infty)

, Part (ii) follows from [19] (Proposition IV.1); for

α = 1

, Part (ii) holds because

Δ_{1} (P ∥ Q) = D (P ∥ Q)

; and for

α \in {0, \infty}

, Part (ii) follows from the definition of

Δ_{α} (P ∥ Q)

. Part (iii) follows from the definition of

Δ_{α} (P ∥ Q)

, and Part (iv) follows from [19] (Proposition IV.1). □

In the rest of this section, we prove some auxiliary results that we need later (Propositions 6–9). We first establish the relation between

D_{α} (P ∥ Q)

and

Δ_{α} (P ∥ Q)

.

Proposition 6

([26] (Section V, Property 4)). Let P and Q be PMFs, and let

α > 0

. Then,

\begin{matrix} Δ_{α} (P ∥ Q) = D_{\frac{1}{α}} (\tilde{P} ∥ \tilde{Q}), \end{matrix}

(59)

where the PMFs

\tilde{P}

and

\tilde{Q}

are given by

\begin{matrix} \tilde{P} (x) & ≜ \frac{P {(x)}^{α}}{\sum_{x^{'} \in X} P {(x^{'})}^{α}}, \end{matrix}

(60)

\begin{matrix} \tilde{Q} (x) & ≜ \frac{Q {(x)}^{α}}{\sum_{x^{'} \in X} Q {(x^{'})}^{α}} . \end{matrix}

(61)

Proof.

If

α = 1

, then (59) holds because

\tilde{P} = P

,

\tilde{Q} = Q

, and

Δ_{1} (P ∥ Q) = D_{1} (P ∥ Q) = D (P ∥ Q)

. Now let

α \neq 1

. Because

\tilde{P} (x)

and

\tilde{Q} (x)

are zero if and only if

P (x)

and

Q (x)

are zero, respectively, the LHS of (59) is finite if and only if its RHS is finite. If

D_{1 / α} (\tilde{P} ∥ \tilde{Q})

is finite, then (59) follows from a simple computation. □

In light of Proposition 6,

J_{α} (X; Y)

and

K_{α} (X; Y)

are related as follows:

Proposition 7.

Let

P_{X Y}

be a joint PMF, and let

α > 0

. Then,

\begin{matrix} K_{α} (X; Y) = J_{\frac{1}{α}} (\tilde{X}; \tilde{Y}), \end{matrix}

(62)

where the joint PMF of

\tilde{X}

and

\tilde{Y}

is given by

\begin{matrix} {\tilde{P}}_{X Y} (x, y) ≜ \frac{P_{X Y} {(x, y)}^{α}}{\sum_{(x^{'}, y^{'}) \in X \times Y} P_{X Y} {(x^{'}, y^{'})}^{α}} . \end{matrix}

(63)

Proof.

Let

α > 0

. For fixed PMFs

Q_{X}

and

Q_{Y}

, define the transformed PMFs

\tilde{Q_{X} Q_{Y}}

,

{\tilde{Q}}_{X}

, and

{\tilde{Q}}_{Y}

as

\begin{matrix} \tilde{Q_{X} Q_{Y}} (x, y) & ≜ \frac{{[Q_{X} (x) Q_{Y} (y)]}^{α}}{\sum_{(x^{'}, y^{'}) \in X \times Y} {[Q_{X} (x^{'}) Q_{Y} (y^{'})]}^{α}}, \end{matrix}

(64)

\begin{matrix} {\tilde{Q}}_{X} (x) & ≜ \frac{Q_{X} {(x)}^{α}}{\sum_{x^{'} \in X} Q_{X} {(x^{'})}^{α}}, \end{matrix}

(65)

\begin{matrix} {\tilde{Q}}_{Y} (y) & ≜ \frac{Q_{Y} {(y)}^{α}}{\sum_{y^{'} \in Y} Q_{Y} {(y^{'})}^{α}} . \end{matrix}

(66)

Then,

\begin{matrix} K_{α} (X; Y) & = min_{Q_{X}, Q_{Y}} Δ_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) \end{matrix}

(67)

\begin{matrix} = min_{Q_{X}, Q_{Y}} D_{\frac{1}{α}} ({\tilde{P}}_{X Y} ∥ \tilde{Q_{X} Q_{Y}}) \end{matrix}

(68)

\begin{matrix} = min_{Q_{X}, Q_{Y}} D_{\frac{1}{α}} ({\tilde{P}}_{X Y} ∥ {\tilde{Q}}_{X} {\tilde{Q}}_{Y}) \end{matrix}

(69)

\begin{matrix} = min_{Q_{X}, Q_{Y}} D_{\frac{1}{α}} ({\tilde{P}}_{X Y} ∥ Q_{X} Q_{Y}) \end{matrix}

(70)

\begin{matrix} = J_{\frac{1}{α}} (\tilde{X}; \tilde{Y}), \end{matrix}

(71)

where (67) holds by the definition of

K_{α} (X; Y)

; (68) follows from Proposition 6; (69) holds because

\tilde{Q_{X} Q_{Y}} = {\tilde{Q}}_{X} {\tilde{Q}}_{Y}

; (70) holds because the transformations (65) and (66) are bijective on the set of PMFs over

X

and

Y

, respectively; and (71) holds by the definition of

J_{α} (X; Y)

. □

The next proposition provides a characterization of the mutual information that parallels the definitions of

J_{α} (X; Y)

and

K_{α} (X; Y)

. Because

D_{1} (P ∥ Q) = Δ_{1} (P ∥ Q) = D (P ∥ Q)

, this also shows that

J_{α} (X; Y)

and

K_{α} (X; Y)

reduce to the mutual information when

α

is one.

Proposition 8

([27] (Theorem 3.4)). Let

P_{X Y}

be a joint PMF. Then, for all PMFs

Q_{X}

and

Q_{Y}

,

\begin{matrix} D (P_{X Y} ∥ Q_{X} Q_{Y}) \geq D (P_{X Y} ∥ P_{X} P_{Y}), \end{matrix}

(72)

with equality if and only if

Q_{X} = P_{X}

and

Q_{Y} = P_{Y}

. Thus,

\begin{matrix} I (X; Y) = min_{Q_{X}, Q_{Y}} D (P_{X Y} ∥ Q_{X} Q_{Y}) . \end{matrix}

(73)

Proof.

A simple computation reveals that

\begin{matrix} D (P_{X Y} ∥ Q_{X} Q_{Y}) = D (P_{X Y} ∥ P_{X} P_{Y}) + D (P_{X} ∥ Q_{X}) + D (P_{Y} ∥ Q_{Y}), \end{matrix}

(74)

which implies (72) because

D (P ∥ Q) \geq 0

with equality if and only if

P = Q

. Thus, (73) holds because

I (X; Y) = D (P_{X Y} ∥ P_{X} P_{Y})

. □

The last proposition of this section is about a precursor to

J_{α} (X; Y)

, namely, the minimization of

D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

with respect to

Q_{Y}

only, which can be carried out explicitly. (This proposition extends [5] (Equation (13)) and [2] (Lemma 29).)

Proposition 9.

Let

P_{X Y}

be a joint PMF and

Q_{X}

a PMF. Then, for every

α \in (0, 1) \cup (1, \infty)

,

\begin{matrix} min_{Q_{Y}} D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) = \frac{α}{α - 1} log \sum_{y} {[\sum_{x} P {(x, y)}^{α} Q_{X} {(x)}^{1 - α}]}^{\frac{1}{α}}, \end{matrix}

(75)

with the conventions of (44). If the RHS of (75) is finite, then the minimum is achieved uniquely by

\begin{matrix} Q_{Y}^{*} (y) = \frac{{[\sum_{x} P {(x, y)}^{α} Q_{X} {(x)}^{1 - α}]}^{\frac{1}{α}}}{\sum_{y^{'} \in Y} {[\sum_{x} P {(x, y^{'})}^{α} Q_{X} {(x)}^{1 - α}]}^{\frac{1}{α}}} . \end{matrix}

(76)

For

α = \infty

,

\begin{matrix} min_{Q_{Y}} D_{\infty} (P_{X Y} ∥ Q_{X} Q_{Y}) = log \sum_{y} max_{x} \frac{P (x, y)}{Q_{X} (x)}, \end{matrix}

(77)

with the conventions of (44). If the RHS of (77) is finite, then the minimum is achieved uniquely by

\begin{matrix} Q_{Y}^{*} (y) = \frac{{max}_{x} [P (x, y) / Q_{X} (x)]}{\sum_{y^{'} \in Y} {max}_{x} [P (x, y^{'}) / Q_{X} (x)]} . \end{matrix}

(78)

Proof.

We first treat the case

α \in (0, 1) \cup (1, \infty)

. If the RHS of (75) is infinite, then the conventions imply that

D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

is infinite for every

Q_{Y} \in P (Y)

, so (75) holds. Otherwise, if the RHS of (75) is finite, then the PMF

Q_{Y}^{*}

given by (76) is well-defined, and a simple computation shows that for every

Q_{Y} \in P (Y)

,

\begin{matrix} D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) = \frac{α}{α - 1} log \sum_{y} {[\sum_{x} P {(x, y)}^{α} Q_{X} {(x)}^{1 - α}]}^{\frac{1}{α}} + D_{α} (Q_{Y}^{*} ∥ Q_{Y}) . \end{matrix}

(79)

The only term on the RHS of (79) that depends on

Q_{Y}

is

D_{α} (Q_{Y}^{*} ∥ Q_{Y})

. Because

D_{α} (Q_{Y}^{*} ∥ Q_{Y}) \geq 0

with equality if and only if

Q_{Y} = Q_{Y}^{*}

(Proposition 4), (79) implies (75) and (76).

The case

α = \infty

is analogous: if the RHS of (77) is infinite, then the LHS of (77) is infinite, too; and if the RHS of (77) is finite, then the PMF

Q_{Y}^{*}

given by (78) is well-defined, and a simple computation shows that for every

Q_{Y} \in P (Y)

,

\begin{matrix} D_{\infty} (P_{X Y} ∥ Q_{X} Q_{Y}) = log \sum_{y} max_{x} \frac{P (x, y)}{Q_{X} (x)} + D_{\infty} (Q_{Y}^{*} ∥ Q_{Y}) . \end{matrix}

(80)

The only term on the RHS of (80) that depends on

Q_{Y}

is

D_{\infty} (Q_{Y}^{*} ∥ Q_{Y})

. Because

D_{\infty} (Q_{Y}^{*} ∥ Q_{Y}) \geq 0

with equality if and only if

Q_{Y} = Q_{Y}^{*}

(Proposition 4), (80) implies (77) and (78). □

5. Two Measures of Dependence

We state the properties of

J_{α} (X; Y)

in Theorem 1 and those of

K_{α} (X; Y)

in Theorem 2. The enumeration labels in the theorems refer to the lemmas in Section 6 where the properties are proved. (The enumeration labels are not consecutive because, in order to avoid forward references in the proofs, the order of the results in Section 6 is not the same as here.)

Theorem 1.

Let X,

X_{1}

,

X_{2}

, Y,

Y_{1}

,

Y_{2}

, and Z be random variables taking values in finite sets. Then:

(Lemma 1): For every $α \in [0, \infty]$ , the minimum in the definition of $J_{α} (X; Y)$ exists and is finite.

The following properties of the mutual information

I (X; Y)

[28] (Chapter 2) are also satisfied by

J_{α} (X; Y)

:

(Lemma 2): For all $α \in [0, \infty]$ , $J_{α} (X; Y) \geq 0$ . If $α \in (0, \infty]$ , then $J_{α} (X; Y) = 0$ if and only if X and Y are independent (nonnegativity).
(Lemma 3): For all $α \in [0, \infty]$ , $J_{α} (X; Y) = J_{α} (Y; X)$ (symmetry).
(Lemma 4): If $X ⊸ - - Y ⊸ - - Z$ form a Markov chain, then $J_{α} (X; Z) \leq J_{α} (X; Y)$ for all $α \in [0, \infty]$ (data-processing inequality).
(Lemma 12): If the pairs $(X_{1}, Y_{1})$ and $(X_{2}, Y_{2})$ are independent, then $J_{α} (X_{1}, X_{2}; Y_{1}, Y_{2}) = J_{α} (X_{1}; Y_{1}) + J_{α} (X_{2}; Y_{2})$ for all $α \in [0, \infty]$ (additivity).
(Lemma 13): For all $α \in [0, \infty]$ , $J_{α} (X; Y) \leq log | X |$ with equality if and only if $(α \in [\frac{1}{2}, \infty]$ , X is distributed uniformly over $X$ , and $H (X | Y) = 0)$ .
(Lemma 14): For every $α \in [1, \infty]$ , $J_{α} (X; Y)$ is concave in $P_{X}$ for fixed $P_{Y | X}$ .

Moreover:

(Lemma 5): $J_{0} (X; Y) = 0$ .
(Lemma 6): Let $f : {1, \dots, | X |} \to X$ and $g : {1, \dots, | Y |} \to Y$ be bijective functions, and let $A$ be the $| X | \times | Y |$ matrix whose Row-i Column-j entry $A_{i, j}$ equals $\sqrt{P_{X Y} (f (i), g (j))}$ . Then,

$\begin{matrix} J_{\frac{1}{2}} (X; Y) = - 2 log σ_{1} (A), \end{matrix}$

(81)

where $σ_{1} (A)$ denotes the largest singular value of $A$ . (Because the singular values of a matrix are invariant under row and column permutations, the result does not depend on f or g.)
(Lemma 7): $J_{1} (X; Y) = I (X; Y)$ .
(Lemma 8): For all $α > 0$ ,

$\begin{matrix} (1 - α) J_{α} (X; Y) = min_{R_{X Y} \in P (X \times Y)} [(1 - α) D (R_{X Y} ∥ R_{X} R_{Y}) + α D (R_{X Y} ∥ P_{X Y})] . \end{matrix}$

(82)

Thus, being the minimum of concave functions in α, the mapping $α \mapsto (1 - α) J_{α} (X; Y)$ is concave on $(0, \infty)$ .
(Lemma 9): The mapping $α \mapsto J_{α} (X; Y)$ is nondecreasing on $[0, \infty]$ .
(Lemma 10): The mapping $α \mapsto J_{α} (X; Y)$ is continuous on $[0, \infty]$ .
(Lemma 11): If $X = Y$ with probability one, then

$\begin{matrix} J_{α} (X; Y) = \{\begin{matrix} \frac{α}{1 - α} H_{\infty} (X) & i f α \in [0, \frac{1}{2}], \\ H_{\frac{α}{2 α - 1}} (X) & i f α > \frac{1}{2}, \\ H_{\frac{1}{2}} (X) & i f α = \infty . \end{matrix} \end{matrix}$

(83)

The minimization problem in the definition of

J_{α} (X; Y)

has the following characteristics:

(Lemma 15): For every $α \in [\frac{1}{2}, \infty]$ , the mapping $(Q_{X}, Q_{Y}) \mapsto D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})$ is convex, i.e., for all $λ, λ^{'} \in [0, 1]$ with $λ + λ^{'} = 1$ , all $Q_{X}, Q_{X}^{'} \in P (X)$ , and all $Q_{Y}, Q_{Y}^{'} \in P (Y)$ ,

$\begin{matrix} D_{α} (P_{X Y} ∥ (λ Q_{X} + λ^{'} Q_{X}^{'}) (λ Q_{Y} + λ^{'} Q_{Y}^{'})) \leq λ D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) + λ^{'} D_{α} (P_{X Y} ∥ Q_{X}^{'} Q_{Y}^{'}) . \end{matrix}$

(84)

For $α \in [0, \frac{1}{2})$ , the mapping need not be convex.
(Lemma 16): Let $α \in (0, 1) \cup (1, \infty)$ . If $(Q_{X}^{*}, Q_{Y}^{*})$ achieves the minimum in the definition of $J_{α} (X; Y)$ , then there exist positive normalization constants c and d such that

$\begin{matrix} Q_{X}^{*} (x) & = c {[\sum_{y} P {(x, y)}^{α} Q_{Y}^{*} {(y)}^{1 - α}]}^{\frac{1}{α}} \forall x \in X, \end{matrix}$

(85)

$\begin{matrix} Q_{Y}^{*} (y) & = d {[\sum_{x} P {(x, y)}^{α} Q_{X}^{*} {(x)}^{1 - α}]}^{\frac{1}{α}} \forall y \in Y, \end{matrix}$

(86)

with the conventions of (44). The case $α = \infty$ is similar: if $(Q_{X}^{*}, Q_{Y}^{*})$ achieves the minimum in the definition of $J_{\infty} (X; Y)$ , then there exist positive normalization constants c and d such that

$\begin{matrix} Q_{X}^{*} (x) & = c max_{y} \frac{P (x, y)}{Q_{Y}^{*} (y)} \forall x \in X, \end{matrix}$

(87)

$\begin{matrix} Q_{Y}^{*} (y) & = d max_{x} \frac{P (x, y)}{Q_{X}^{*} (x)} \forall y \in Y, \end{matrix}$

(88)

with the conventions of (44). (If $α = 1$ , then $Q_{X}^{*} = P_{X}$ and $Q_{Y}^{*} = P_{Y}$ by Proposition 8.) Thus, for all $α \in (0, \infty]$ , both inclusions $supp (Q_{X}^{*}) \subseteq supp (P_{X})$ and $supp (Q_{Y}^{*}) \subseteq supp (P_{Y})$ hold. $\frac{1}{2}$
(Lemma 20): For every $α \in (\frac{1}{2}, \infty]$ , the mapping $(Q_{X}, Q_{Y}) \mapsto D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})$ has a unique minimizer. This need not be the case when $α \in [0, \frac{1}{2}]$ .

The measure

J_{α} (X; Y)

can also be expressed as follows:

(Lemma 17): For all $α \in (0, \infty]$ ,

$\begin{matrix} J_{α} (X; Y) = min_{Q_{X}} ϕ_{α} (Q_{X}), \end{matrix}$

(89)

where $ϕ_{α} (Q_{X})$ is defined as

$\begin{matrix} ϕ_{α} (Q_{X}) ≜ min_{Q_{Y}} D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) \end{matrix}$

(90)

and is given explicitly as follows: for $α \in (0, 1) \cup (1, \infty)$ ,

$\begin{matrix} ϕ_{α} (Q_{X}) = \frac{α}{α - 1} log \sum_{y} {[\sum_{x} P {(x, y)}^{α} Q_{X} {(x)}^{1 - α}]}^{\frac{1}{α}}, \end{matrix}$

(91)

with the conventions of (44); and for $α \in {1, \infty}$ ,

$\begin{matrix} ϕ_{1} (Q_{X}) & = D (P_{X Y} ∥ Q_{X} P_{Y}), \end{matrix}$

(92)

$\begin{matrix} ϕ_{\infty} (Q_{X}) & = log \sum_{y} max_{x} \frac{P (x, y)}{Q_{X} (x)}, \end{matrix}$

(93)

with the conventions of (44). For every $α \in [\frac{1}{2}, \infty]$ , the mapping $Q_{X} \mapsto ϕ_{α} (Q_{X})$ is convex. For $α \in (0, \frac{1}{2})$ , the mapping need not be convex.
(Lemma 18): For all $α \in (0, 1) \cup (1, \infty]$ ,

$\begin{matrix} J_{α} (X; Y) = \{\begin{matrix} min_{R_{X Y} \in P (X \times Y)} ψ_{α} (R_{X Y}) & i f α \in (0, 1), \\ max_{R_{X Y} \in P (X \times Y)} ψ_{α} (R_{X Y}) & i f α \in (1, \infty], \end{matrix} \end{matrix}$

(94)

where

$\begin{matrix} ψ_{α} (R_{X Y}) ≜ \{\begin{matrix} D (R_{X Y} ∥ R_{X} R_{Y}) + \frac{α}{1 - α} D (R_{X Y} ∥ P_{X Y}) & i f α \in (0, 1) \cup (1, \infty), \\ D (R_{X Y} ∥ R_{X} R_{Y}) - D (R_{X Y} ∥ P_{X Y}) & i f α = \infty . \end{matrix} \end{matrix}$

(95)

For every $α \in (1, \infty]$ , the mapping $R_{X Y} \mapsto ψ_{α} (R_{X Y})$ is concave. For all $α \in (1, \infty]$ and all $R_{X Y} \in P (X \times Y)$ , the statement $J_{α} (X; Y) = ψ_{α} (R_{X Y})$ is equivalent to $ψ_{α} (R_{X Y}) = D_{α} (P_{X Y} ∥ R_{X} R_{Y})$ .
(Lemma 19): For all $α \in (0, 1) \cup (1, \infty)$ ,

$\begin{matrix} J_{α} (X; Y) = min_{R_{X} ≪ P_{X}} \frac{1}{α - 1} [D_{\frac{α}{α - 1}} (P_{X} ∥ R_{X}) - α E_{0} (\frac{1 - α}{α}, R_{X})], \end{matrix}$

(96)

where the minimization is over all PMFs $R_{X}$ satisfying $R_{X} ≪ P_{X}$ $(i . e ., supp (R_{X}) \subseteq supp (P_{X}))$ ; $D_{α} (P ∥ Q)$ for negative α is given by (54); and Gallager’s $E_{0}$ function [29] is defined as

$\begin{matrix} E_{0} (ρ, R_{X}) ≜ - log \sum_{y} {[\sum_{x} R_{X} (x) P {(y | x)}^{\frac{1}{1 + ρ}}]}^{1 + ρ} . \end{matrix}$

(97)

We now move on to the properties of

K_{α} (X; Y)

. Some of these properties are derived from their counterparts of

J_{α} (X; Y)

using the relation

K_{α} (X; Y) = J_{1 / α} (\tilde{X}; \tilde{Y})

described in Proposition 7.

Theorem 2.

Let X,

X_{1}

,

X_{2}

, Y,

Y_{1}

,

Y_{2}

, and Z be random variables taking values in finite sets. Then:

(Lemma 21): For every $α \in [0, \infty]$ , the minimum in the definition of $K_{α} (X; Y)$ in (2) exists and is finite.

The following properties of the mutual information

I (X; Y)

are also satisfied by

K_{α} (X; Y)

:

(Lemma 22): For all $α \in [0, \infty]$ , $K_{α} (X; Y) \geq 0$ . If $α \in (0, \infty)$ , then $K_{α} (X; Y) = 0$ if and only if X and Y are independent (nonnegativity).
(Lemma 23): For all $α \in [0, \infty]$ , $K_{α} (X; Y) = K_{α} (Y; X)$ (symmetry).
(Lemma 34): If the pairs $(X_{1}, Y_{1})$ and $(X_{2}, Y_{2})$ are independent, then $K_{α} (X_{1}, X_{2}; Y_{1}, Y_{2}) = K_{α} (X_{1}; Y_{1}) + K_{α} (X_{2}; Y_{2})$ for all $α \in [0, \infty]$ (additivity).
(Lemma 35): For all $α \in [0, \infty]$ , $K_{α} (X; Y) \leq log | X |$ .

Unlike the mutual information,

K_{α} (X; Y)

does not satisfy the data-processing inequality:

(Lemma 36): There exists a Markov chain $X ⊸ - - Y ⊸ - - Z$ for which $K_{2} (X; Z) > K_{2} (X; Y)$ .

Moreover:

(Lemma 24): For all $α \in (0, \infty)$ ,

$\begin{matrix} K_{α} (X; Y) + H_{α} (X, Y) = min_{Q_{X}, Q_{Y}} - log M_{\frac{α - 1}{α}} (Q_{X}, Q_{Y}), \end{matrix}$

(98)

where $M_{β} (Q_{X}, Q_{Y})$ is the following weighted power mean [30] (Chapter III): For $β \in R ∖ {0}$ ,

$\begin{matrix} M_{β} (Q_{X}, Q_{Y}) ≜ {[\sum_{x, y} P (x, y) {[Q_{X} (x) Q_{Y} (y)]}^{β}]}^{\frac{1}{β}}, \end{matrix}$

(99)

where for $β < 0$ , we read $P (x, y) {[Q_{X} (x) Q_{Y} (y)]}^{β}$ as $P (x, y) / {[Q_{X} (x) Q_{Y} (y)]}^{- β}$ and use the conventions (44); and for $β = 0$ , using the convention $0^{0} = 1$ ,

$\begin{matrix} M_{0} (Q_{X}, Q_{Y}) ≜ \prod_{x, y} {[Q_{X} (x) Q_{Y} (y)]}^{P (x, y)} . \end{matrix}$

(100)
(Lemma 25): For $α = 0$ ,

$\begin{matrix} K_{0} (X; Y) & = log \frac{| supp (P_{X} P_{Y}) |}{| supp (P_{X Y}) |} \end{matrix}$

(101)

$\begin{matrix} \geq min_{Q_{X}, Q_{Y}} log max_{(x, y) \in supp (P_{X Y})} \frac{1}{Q_{X} (x) Q_{Y} (y)} - log | supp (P_{X Y}) | \end{matrix}$

(102)

$\begin{matrix} = lim_{α ↓ 0} K_{α} (X; Y), \end{matrix}$

(103)

where in the RHS of (102), we use the conventions (44). The inequality can be strict, so $α \mapsto K_{α} (X; Y)$ need not be continuous at $α = 0$ .
(Lemma 26): $K_{1} (X; Y) = I (X; Y)$ .
(Lemma 27): Let $f : {1, \dots, | X |} \to X$ and $g : {1, \dots, | Y |} \to Y$ be bijective functions, and let $B$ be the $| X | \times | Y |$ matrix whose Row-i Column-j entry $B_{i, j}$ equals $P_{X Y} (f (i), g (j))$ . Then,

$\begin{matrix} K_{2} (X; Y) = - 2 log σ_{1} (B) - H_{2} (X, Y), \end{matrix}$

(104)

where $σ_{1} (B)$ denotes the largest singular value of $B$ . (Because the singular values of a matrix are invariant under row and column permutations, the result does not depend on f or g.)
(Lemma 28): $K_{\infty} (X; Y) = 0$ .
(Lemma 29): The mapping $α \mapsto K_{α} (X; Y)$ need not be monotonic on $[0, \infty]$ .
(Lemma 30): The mapping $α \mapsto K_{α} (X; Y) + H_{α} (X, Y)$ is nonincreasing on $[0, \infty]$ .
(Lemma 31): The mapping $α \mapsto K_{α} (X; Y)$ is continuous on $(0, \infty]$ . (See Lemma 25 for the behavior at $α = 0$ .)
(Lemma 32): If $X = Y$ with probability one, then

$\begin{matrix} K_{α} (X; Y) = \{\begin{matrix} 2 H_{\frac{α}{2 - α}} (X) - H_{α} (X) & i f α \in [0, 2), \\ \frac{α}{α - 1} H_{\infty} (X) - H_{α} (X) & i f α \geq 2, \\ 0 & i f α = \infty . \end{matrix} \end{matrix}$

(105)
(Lemma 33): For every $α \in (0, 2)$ , the mapping $(Q_{X}, Q_{Y}) \mapsto Δ_{α} (P_{X Y} ∥ Q_{X} Q_{Y})$ in the definition of $K_{α} (X; Y)$ in (2) has a unique minimizer. This need not be the case when $α \in {0} \cup [2, \infty]$ .

6. Proofs

In this section, we prove the properties of

J_{α} (X; Y)

and

K_{α} (X; Y)

stated in Section 5.

Lemma 1.

For every

α \in [0, \infty]

, the minimum in the definition of

J_{α} (X; Y)

exists and is finite.

Proof.

Let

α \in [0, \infty]

. Then

{inf}_{Q_{X}, Q_{Y}} D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

is finite because

D_{α} (P_{X Y} ∥ P_{X} P_{Y})

is finite and because the Rényi divergence is nonnegative. The minimum exists because the set

P (X) \times P (Y)

is compact and the mapping

(Q_{X}, Q_{Y}) \mapsto D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

is continuous. □

Lemma 2.

For all

α \in [0, \infty]

,

J_{α} (X; Y) \geq 0

. If

α \in (0, \infty]

, then

J_{α} (X; Y) = 0

if and only if X and Y are independent (nonnegativity).

Proof.

The nonnegativity follows from the definition of

J_{α} (X; Y)

because the Rényi divergence is nonnegative for

α \in [0, \infty]

. If X and Y are independent, then

P_{X Y} = P_{X} P_{Y}

, and the choice

Q_{X} = P_{X}

and

Q_{Y} = P_{Y}

in the definition of

J_{α} (X; Y)

achieves

J_{α} (X; Y) = 0

. Conversely, if

J_{α} (X; Y) = 0

, then there exist PMFs

Q_{X}^{*}

and

Q_{Y}^{*}

satisfying

D_{α} (P_{X Y} ∥ Q_{X}^{*} Q_{Y}^{*}) = 0

. If, in addition,

α \in (0, \infty]

, then

P_{X Y} = Q_{X}^{*} Q_{Y}^{*}

by Proposition 4, and hence X and Y are independent. □

Lemma 3.

For all

α \in [0, \infty]

,

J_{α} (X; Y) = J_{α} (Y; X)

(symmetry).

Proof.

The definition of

J_{α} (X; Y)

is symmetric in X and Y. □

Lemma 4.

If

X ⊸ - - Y ⊸ - - Z

form a Markov chain, then

J_{α} (X; Z) \leq J_{α} (X; Y)

for all

α \in [0, \infty]

(data-processing inequality).

Proof.

Let

X ⊸ - - Y ⊸ - - Z

form a Markov chain, and let

α \in [0, \infty]

. Let

{\hat{Q}}_{X}

and

{\hat{Q}}_{Y}

be PMFs that achieve the minimum in the definition of

J_{α} (X; Y)

, so

\begin{matrix} J_{α} (X; Y) = D_{α} (P_{X Y} ∥ {\hat{Q}}_{X} {\hat{Q}}_{Y}) . \end{matrix}

(106)

Define the PMF

{\hat{Q}}_{Z}

as

\begin{matrix} {\hat{Q}}_{Z} (z) ≜ \sum_{y} {\hat{Q}}_{Y} (y) P_{Z | Y} (z | y) . \end{matrix}

(107)

(As noted in the preliminaries, we define

P_{Z | Y} (z | y) ≜ 1 / | Z |

when

P_{Y} (y) = 0

.) We show below that

\begin{matrix} D_{α} (P_{X Z} ∥ {\hat{Q}}_{X} {\hat{Q}}_{Z}) \leq D_{α} (P_{X Y} ∥ {\hat{Q}}_{X} {\hat{Q}}_{Y}), \end{matrix}

(108)

which implies the data-processing inequality because

\begin{matrix} J_{α} (X; Z) & \leq D_{α} (P_{X Z} ∥ {\hat{Q}}_{X} {\hat{Q}}_{Z}) \end{matrix}

(109)

\begin{matrix} \leq D_{α} (P_{X Y} ∥ {\hat{Q}}_{X} {\hat{Q}}_{Y}) \end{matrix}

(110)

\begin{matrix} = J_{α} (X; Y), \end{matrix}

(111)

where (109) holds by the definition of

J_{α} (X; Z)

; (110) follows from (108); and (111) follows from (106).

The proof of (108) is based on the data-processing inequality for the Rényi divergence. Define the conditional PMF

A_{X^{'} Z^{'} | X Y}

as

\begin{matrix} A_{X^{'} Z^{'} | X Y} (x^{'}, z^{'} | x, y) ≜ 𝟙 {x^{'} = x} P_{Z | Y} (z^{'} | y) . \end{matrix}

(112)

If

(X, Y) \sim P_{X Y}

, then the marginal distribution of

X^{'}

and

Z^{'}

is

\begin{matrix} (P_{X Y} A_{X^{'} Z^{'} | X Y}) (x^{'}, z^{'}) & = \sum_{x, y} P_{X Y} (x, y) A_{X^{'} Z^{'} | X Y} (x^{'}, z^{'} | x, y) \end{matrix}

(113)

\begin{matrix} = \sum_{y} P_{X Y} (x^{'}, y) P_{Z | Y} (z^{'} | y) \end{matrix}

(114)

\begin{matrix} = \sum_{y} P_{X Y} (x^{'}, y) P_{Z | X Y} (z^{'} | x^{'}, y) \end{matrix}

(115)

\begin{matrix} = P_{X Z} (x^{'}, z^{'}), \end{matrix}

(116)

where (114) follows from (112); and (115) holds because X, Y, and Z form a Markov chain. If

(X, Y) \sim {\hat{Q}}_{X} {\hat{Q}}_{Y}

, then the marginal distribution of

X^{'}

and

Z^{'}

is

\begin{matrix} ({\hat{Q}}_{X} {\hat{Q}}_{Y} A_{X^{'} Z^{'} | X Y}) (x^{'}, z^{'}) & = \sum_{x, y} {\hat{Q}}_{X} (x) {\hat{Q}}_{Y} (y) A_{X^{'} Z^{'} | X Y} (x^{'}, z^{'} | x, y) \end{matrix}

(117)

\begin{matrix} = \sum_{y} {\hat{Q}}_{X} (x^{'}) {\hat{Q}}_{Y} (y) P_{Z | Y} (z^{'} | y) \end{matrix}

(118)

\begin{matrix} = {\hat{Q}}_{X} (x^{'}) {\hat{Q}}_{Z} (z^{'}), \end{matrix}

(119)

where (118) follows from (112), and (119) follows from (107). Finally, we are ready to prove (108):

\begin{matrix} D_{α} (P_{X Z} ∥ {\hat{Q}}_{X} {\hat{Q}}_{Z}) & = D_{α} ((P_{X Y} A_{X^{'} Z^{'} | X Y}) ∥ ({\hat{Q}}_{X} {\hat{Q}}_{Y} A_{X^{'} Z^{'} | X Y})) \end{matrix}

(120)

\begin{matrix} \leq D_{α} (P_{X Y} ∥ {\hat{Q}}_{X} {\hat{Q}}_{Y}), \end{matrix}

(121)

where (120) follows from (116) and (119), and where (121) follows from the data-processing inequality for the Rényi divergence [22] (Theorem 9). □

Lemma 5.

J_{0} (X; Y) = 0

.

Proof.

By Lemma 2,

J_{0} (X; Y) \geq 0

, so it suffices to show that

J_{0} (X; Y) \leq 0

. Let

(\hat{x}, \hat{y}) \in X \times Y

satisfy

P_{X Y} (\hat{x}, \hat{y}) > 0

. Define the PMF

{\hat{Q}}_{X}

as

{\hat{Q}}_{X} (x) ≜ 𝟙 {x = \hat{x}}

and the PMF

{\hat{Q}}_{Y}

as

{\hat{Q}}_{Y} (y) ≜ 𝟙 {y = \hat{y}}

. Then,

D_{0} (P_{X Y} ∥ {\hat{Q}}_{X} {\hat{Q}}_{Y}) = 0

, so

J_{0} (X; Y) \leq 0

by the definition of

J_{0} (X; Y)

. □

Lemma 6.

Let

f : {1, \dots, | X |} \to X

and

g : {1, \dots, | Y |} \to Y

be bijective functions, and let

A

be the

| X | \times | Y |

matrix whose Row-i Column-j entry

A_{i, j}

equals

\sqrt{P_{X Y} (f (i), g (j))}

. Then,

\begin{matrix} J_{\frac{1}{2}} (X; Y) = - 2 log σ_{1} (A), \end{matrix}

(122)

where

σ_{1} (A)

denotes the largest singular value of

A

. (Because the singular values of a matrix are invariant under row and column permutations, the result does not depend on f or g.)

Proof.

By the definitions of

J_{α} (X; Y)

and the Rényi divergence,

\begin{matrix} J_{\frac{1}{2}} (X; Y) = - 2 log max_{Q_{X}, Q_{Y}} \sum_{x, y} \sqrt{Q_{X} (x)} \sqrt{P (x, y)} \sqrt{Q_{Y} (y)} . \end{matrix}

(123)

The claim follows from (123) because

\begin{matrix} max_{Q_{X}, Q_{Y}} \sum_{x, y} \sqrt{Q_{X} (x)} \sqrt{P (x, y)} \sqrt{Q_{Y} (y)} & = max_{{∥ u ∥}_{2} = {∥ v ∥}_{2} = 1} u^{T} A v \end{matrix}

(124)

\begin{matrix} = max_{{∥ v ∥}_{2} = 1} {∥ A v ∥}_{2} \end{matrix}

(125)

\begin{matrix} = σ_{1} (A), \end{matrix}

(126)

where

u

and

v

are column vectors with

| X |

and

| Y |

elements, respectively; (124) is shown below; (125) follows from the Cauchy–Schwarz inequality

| u^{T} A v {| \leq ∥ u ∥}_{2} {∥ A v ∥}_{2}

, which holds with equality if

u

and

A v

are linearly dependent; and (126) holds because the spectral norm of a matrix is equal to its largest singular value [31] (Example 5.6.6).

We now prove (124). Let

u

and

v

be vectors that satisfy

{∥ u ∥}_{2} = {∥ v ∥}_{2} = 1

, and define the PMFs

{\hat{Q}}_{X}

and

{\hat{Q}}_{Y}

as

{\hat{Q}}_{X} (x) ≜ u_{f^{- 1} (x)}^{2}

and

{\hat{Q}}_{Y} (y) ≜ v_{g^{- 1} (y)}^{2}

, where

f^{- 1}

and

g^{- 1}

denote the inverse functions of f and g, respectively. Then,

\begin{matrix} u^{T} A v & = \sum_{i, j} u_{i} A_{i, j} v_{j} \end{matrix}

(127)

\begin{matrix} \leq \sum_{i, j} | u_{i} | A_{i, j} | v_{j} | \end{matrix}

(128)

\begin{matrix} = \sum_{x, y} \sqrt{{\hat{Q}}_{X} (x)} \sqrt{P (x, y)} \sqrt{{\hat{Q}}_{Y} (y)} \end{matrix}

(129)

\begin{matrix} \leq max_{Q_{X}, Q_{Y}} \sum_{x, y} \sqrt{Q_{X} (x)} \sqrt{P (x, y)} \sqrt{Q_{Y} (y)}, \end{matrix}

(130)

where (128) holds because all the entries of

A

are nonnegative, and in (129), we changed the summation variables to

x ≜ f (i)

and

y ≜ g (j)

. It remains to show that equality can be achieved in (128) and (130). To that end, let

Q_{X}^{*}

and

Q_{Y}^{*}

be PMFs that achieve the maximum on the RHS of (130), and define the vectors

u

and

v

as

u_{i} ≜ Q_{X}^{*} {(f (i))}^{1 / 2}

and

v_{j} ≜ Q_{Y}^{*} {(g (j))}^{1 / 2}

. Then,

{∥ u ∥}_{2} = {∥ v ∥}_{2} = 1

, and (128) and (130) hold with equality, which proves (124). □

Lemma 7.

J_{1} (X; Y) = I (X; Y)

.

Proof.

This follows from Proposition 8 because

D_{1} (P_{X Y} ∥ Q_{X} Q_{Y})

in the definition of

J_{1} (X; Y)

is equal to

D (P_{X Y} ∥ Q_{X} Q_{Y})

. □

Lemma 8.

For all

α > 0

,

\begin{matrix} (1 - α) J_{α} (X; Y) = min_{R_{X Y} \in P (X \times Y)} [(1 - α) D (R_{X Y} ∥ R_{X} R_{Y}) + α D (R_{X Y} ∥ P_{X Y})] . \end{matrix}

(131)

Thus, being the minimum of concave functions in α, the mapping

α \mapsto (1 - α) J_{α} (X; Y)

is concave on

(0, \infty)

.

Proof.

For

α = 1

, (131) holds because

D (R_{X Y} ∥ P_{X Y}) \geq 0

with equality if

R_{X Y} = P_{X Y}

. For

α \in (0, 1)

,

\begin{matrix} (1 - α) J_{α} (X; Y) & = min_{Q_{X}, Q_{Y}} (1 - α) D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) \end{matrix}

(132)

\begin{matrix} = min_{Q_{X}, Q_{Y}} min_{R_{X Y}} [(1 - α) D (R_{X Y} ∥ Q_{X} Q_{Y}) + α D (R_{X Y} ∥ P_{X Y})] \end{matrix}

(133)

\begin{matrix} = min_{R_{X Y}} [(1 - α) D (R_{X Y} ∥ R_{X} R_{Y}) + α D (R_{X Y} ∥ P_{X Y})], \end{matrix}

(134)

where (132) holds by the definition of

J_{α} (X; Y)

; (133) follows from [22] (Theorem 30); and (134) follows from Proposition 8 after swapping the minima.

For

α > 1

, define the sets

\begin{matrix} Q & ≜ {(Q_{X}, Q_{Y}) \in P (X) \times P (Y) : supp (Q_{X} Q_{Y}) = X \times Y}, \end{matrix}

(135)

\begin{matrix} R & ≜ {R_{X Y} \in P (X \times Y) : supp (R_{X Y}) \subseteq supp (P_{X Y})} . \end{matrix}

(136)

Then,

\begin{matrix} (1 - α) J_{α} (X; Y) & = sup_{(Q_{X}, Q_{Y}) \in Q} (1 - α) D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) \end{matrix}

(137)

\begin{matrix} = sup_{(Q_{X}, Q_{Y}) \in Q} min_{R_{X Y} \in R} [(1 - α) D (R_{X Y} ∥ Q_{X} Q_{Y}) + α D (R_{X Y} ∥ P_{X Y})] \end{matrix}

(138)

\begin{matrix} = min_{R_{X Y} \in R} sup_{(Q_{X}, Q_{Y}) \in Q} [(1 - α) D (R_{X Y} ∥ Q_{X} Q_{Y}) + α D (R_{X Y} ∥ P_{X Y})] \end{matrix}

(139)

\begin{matrix} = min_{R_{X Y} \in P (X \times Y)} [(1 - α) D (R_{X Y} ∥ R_{X} R_{Y}) + α D (R_{X Y} ∥ P_{X Y})], \end{matrix}

(140)

where (137) follows from the definition of

J_{α} (X; Y)

because

1 - α < 0

and because the mapping

(Q_{X}, Q_{Y}) \mapsto D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

is continuous; (138) follows from [22] (Theorem 30); (139) follows from a minimax theorem and is justified below; and (140) follows from Proposition 8, a continuity argument, and the observation that

D (R_{X Y} ∥ P_{X Y})

is infinite if

R_{X Y} \notin R

.

We now verify the conditions of Ky Fan’s minimax theorem [32] (Theorem 2), which will establish (139). (We use Ky Fan’s minimax theorem because it does not require that the set

Q

be compact, and having a noncompact set

Q

helps to guarantee that the function f defined next takes on finite values only. A brief proof of Ky Fan’s minimax theorem appears in [33].) Let the function

f : R \times Q \to R

be defined by the expression in square brackets in (139), i.e.,

\begin{matrix} f (R_{X Y}, Q_{X}, Q_{Y}) ≜ (1 - α) D (R_{X Y} ∥ Q_{X} Q_{Y}) + α D (R_{X Y} ∥ P_{X Y}) . \end{matrix}

(141)

We check that

(i): the sets $Q$ and $R$ are convex;
(ii): the set $R$ is compact;
(iii): the function f is real-valued;
(iv): for every $(Q_{X}, Q_{Y}) \in Q$ , the function f is continuous in $R_{X Y}$ ;
(v): for every $(Q_{X}, Q_{Y}) \in Q$ , the function f is convex in $R_{X Y}$ ; and
(vi): for every $R_{X Y} \in R$ , the function f is concave in the pair $(Q_{X}, Q_{Y})$ .

Indeed, Parts (i) and (ii) are easy to see; Part (iii) holds because both relative entropies on the RHS of (141) are finite by our definitions of

Q

and

R

; and to show Parts (iv)–(vi), we rewrite f as:

\begin{matrix} f (R_{X Y}, Q_{X}, Q_{Y}) & = - H (R_{X Y}) - α \sum_{x, y} R_{X Y} (x, y) log P (x, y) \\ + (α - 1) \sum_{x} R_{X} (x) log Q_{X} (x) + (α - 1) \sum_{y} R_{Y} (y) log Q_{Y} (y) . \end{matrix}

(142)

From (142), we see that Part (iv) holds by our definitions of

Q

and

R

; Part (v) holds because the entropy is a concave function (so

- H (R_{X Y})

is convex), because linear functionals of

R_{X Y}

are convex, and because the sum of convex functions is convex; and Part (vi) holds because the logarithm is a concave function and because a nonnegative weighted sum of concave functions is concave. (In Ky Fan’s theorem, weaker conditions than Parts (i)–(vi) are required, but it is not difficult to see that Parts (i)–(vi) are sufficient.)

The last claim, namely, that the mapping

α \mapsto (1 - α) J_{α} (X; Y)

is concave on

(0, \infty)

, is true because the expression in square brackets on the RHS of (131) is concave in

α

for every

R_{X Y}

and because the pointwise minimum preserves the concavity. □

Lemma 9.

The mapping

α \mapsto J_{α} (X; Y)

is nondecreasing on

[0, \infty]

.

Proof.

This is true because for every

α, α^{'} \in [0, \infty]

with

α \leq α^{'}

,

\begin{matrix} min_{Q_{X}, Q_{Y}} D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) \leq min_{Q_{X}, Q_{Y}} D_{α^{'}} (P_{X Y} ∥ Q_{X} Q_{Y}), \end{matrix}

(143)

which holds because the Rényi divergence is nondecreasing in

α

(Proposition 4). □

Lemma 10.

The mapping

α \mapsto J_{α} (X; Y)

is continuous on

[0, \infty]

.

Proof.

By Lemma 8, the mapping

α \mapsto (1 - α) J_{α} (X; Y)

is concave on

(0, \infty)

, thus it is continuous on

(0, \infty)

, which implies that

α \mapsto J_{α} (X; Y)

is continuous on

(0, 1) \cup (1, \infty)

.

We next prove the continuity at

α = 0

. Let

Q_{X}^{*}

and

Q_{Y}^{*}

be PMFs that achieve the minimum in the definition of

J_{0} (X; Y)

. Then, for all

α \geq 0

,

\begin{matrix} D_{0} (P_{X Y} ∥ Q_{X}^{*} Q_{Y}^{*}) & = J_{0} (X; Y) \end{matrix}

(144)

\begin{matrix} \leq J_{α} (X; Y) \end{matrix}

(145)

\begin{matrix} \leq D_{α} (P_{X Y} ∥ Q_{X}^{*} Q_{Y}^{*}), \end{matrix}

(146)

where (145) holds because

α \mapsto J_{α} (X; Y)

is nondecreasing (Lemma 9), and (146) holds by the definition of

J_{α} (X; Y)

. The Rényi divergence is continuous in

α

(Proposition 4), so (144)–(146) and the sandwich theorem imply that

J_{α} (X; Y)

is continuous at

α = 0

.

We continue with the continuity at

α = \infty

. Define

\begin{matrix} τ ≜ min_{(x, y) \in supp (P_{X Y})} P (x, y) . \end{matrix}

(147)

Then, for all

α > 1

,

\begin{matrix} J_{\infty} (X; Y) & \geq J_{α} (X; Y) \end{matrix}

(148)

\begin{matrix} = min_{Q_{X}, Q_{Y}} \frac{1}{α - 1} log \sum_{x, y} P (x, y) \frac{P {(x, y)}^{α - 1}}{{[Q_{X} (x) Q_{Y} (y)]}^{α - 1}} \end{matrix}

(149)

\begin{matrix} \geq min_{Q_{X}, Q_{Y}} \frac{1}{α - 1} log max_{x, y} \frac{τ P {(x, y)}^{α - 1}}{{[Q_{X} (x) Q_{Y} (y)]}^{α - 1}} \end{matrix}

(150)

\begin{matrix} = \frac{1}{α - 1} log τ + min_{Q_{X}, Q_{Y}} log max_{x, y} \frac{P (x, y)}{Q_{X} (x) Q_{Y} (y)} \end{matrix}

(151)

\begin{matrix} = \frac{1}{α - 1} log τ + J_{\infty} (X; Y), \end{matrix}

(152)

where (148) holds because

α \mapsto J_{α} (X; Y)

is nondecreasing (Lemma 9), and (149) and (152) hold by the definitions of

J_{α} (X; Y)

and the Rényi divergence. The RHS of (152) tends to

J_{\infty} (X; Y)

as

α

tends to infinity, so

J_{α} (X; Y)

is continuous at

α = \infty

by the sandwich theorem.

It remains to show the continuity at

α = 1

. Let

α \in (\frac{3}{4}, 1) \cup (1, \frac{5}{4})

, and let

δ ≜ | 1 - α | \in (0, \frac{1}{4})

. Then, for all PMFs

Q_{X}

and

Q_{Y}

,

\begin{matrix} 2^{- δ D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})} & \leq 2^{- δ D_{1 - δ} (P_{X Y} ∥ Q_{X} Q_{Y})} \end{matrix}

(153)

\begin{matrix} = \sum_{x, y} P (x, y) {[\frac{Q_{X} (x) Q_{Y} (y)}{P (x, y)}]}^{δ} \end{matrix}

(154)

\begin{matrix} = \sum_{x, y} P (x, y) {[\frac{P_{X} (x) P_{Y} (y)}{P (x, y)}]}^{δ} {[\frac{Q_{X} (x) Q_{Y} (y)}{P_{X} (x) P_{Y} (y)}]}^{δ} \end{matrix}

(155)

\begin{matrix} \leq {\sum_{x, y} P (x, y) {[\frac{P_{X} (x) P_{Y} (y)}{P (x, y)}]}^{2 δ}}^{\frac{1}{2}} \cdot {\sum_{x, y} P (x, y) {[\frac{Q_{X} (x) Q_{Y} (y)}{P_{X} (x) P_{Y} (y)}]}^{2 δ}}^{\frac{1}{2}} \end{matrix}

(156)

\begin{matrix} \leq {\sum_{x, y} P (x, y) {[\frac{P_{X} (x) P_{Y} (y)}{P (x, y)}]}^{2 δ}}^{\frac{1}{2}} \end{matrix}

(157)

\begin{matrix} = 2^{- δ D_{1 - 2 δ} (P_{X Y} ∥ P_{X} P_{Y})}, \end{matrix}

(158)

where (153) holds because

1 - δ \leq α

and because the Rényi divergence is nondecreasing in

α

(Proposition 4); (156) follows from the Cauchy–Schwarz inequality; and (157) holds because

\begin{matrix} {\sum_{x, y} P (x, y) {[\frac{Q_{X} (x) Q_{Y} (y)}{P_{X} (x) P_{Y} (y)}]}^{2 δ}}^{\frac{1}{2}} & \leq {\sum_{x} P_{X} (x) {[\frac{Q_{X} (x)}{P_{X} (x)}]}^{4 δ}}^{\frac{1}{4}} \cdot {\sum_{y} P_{Y} (y) {[\frac{Q_{Y} (y)}{P_{Y} (y)}]}^{4 δ}}^{\frac{1}{4}} \end{matrix}

(159)

\begin{matrix} = 2^{- δ D_{1 - 4 δ} (P_{X} ∥ Q_{X})} \cdot 2^{- δ D_{1 - 4 δ} (P_{Y} ∥ Q_{Y})} \end{matrix}

(160)

\begin{matrix} \leq 1, \end{matrix}

(161)

where (159) follows from the Cauchy–Schwarz inequality, and (161) holds because

1 - 4 δ > 0

and because the Rényi divergence is nonnegative for positive orders (Proposition 4). Thus, for all

α \in (\frac{3}{4}, \frac{5}{4})

,

\begin{matrix} D_{1 - 2 | 1 - α |} (P_{X Y} ∥ P_{X} P_{Y}) & \leq min_{Q_{X}, Q_{Y}} D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) \end{matrix}

(162)

\begin{matrix} = J_{α} (X; Y) \end{matrix}

(163)

\begin{matrix} \leq D_{α} (P_{X Y} ∥ P_{X} P_{Y}), \end{matrix}

(164)

where (162) follows from (158) if

α \neq 1

and from Proposition 8 if

α = 1

; and (164) holds by the definition of

J_{α} (X; Y)

. The Rényi divergence is continuous in

α

(Proposition 4), thus (162)–(164) and the sandwich theorem imply that

J_{α} (X; Y)

is continuous at

α = 1

. □

Lemma 11.

If

X = Y

with probability one, then

\begin{matrix} J_{α} (X; Y) = \{\begin{matrix} \frac{α}{1 - α} H_{\infty} (X) & i f α \in [0, \frac{1}{2}], \\ H_{\frac{α}{2 α - 1}} (X) & i f α > \frac{1}{2}, \\ H_{\frac{1}{2}} (X) & i f α = \infty . \end{matrix} \end{matrix}

(165)

Proof.

We show below that (165) holds for

α \in (0, 1) \cup (1, \infty)

. Thus, (165) holds also for

α \in {0, 1, \infty}

because both its sides are continuous in

α

: its LHS by Lemma 10, and its RHS by the continuity of the Rényi entropy (Proposition 3).

Fix

α \in (0, 1) \cup (1, \infty)

. Then,

\begin{matrix} J_{α} (X; Y) & = min_{Q_{X}} min_{Q_{Y}} D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) \end{matrix}

(166)

\begin{matrix} = min_{Q_{X}} \frac{α}{α - 1} log \sum_{y} {[\sum_{x} P {(x, y)}^{α} Q_{X} {(x)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(167)

\begin{matrix} = min_{Q_{X}} \frac{α}{α - 1} log \sum_{x} P_{X} (x) Q_{X} {(x)}^{\frac{1 - α}{α}}, \end{matrix}

(168)

where (166) follows from Proposition 9, and (168) holds because

\begin{matrix} P_{X Y} (x, y) = \{\begin{matrix} P_{X} (x) & if x = y, \\ 0 & otherwise . \end{matrix} \end{matrix}

(169)

First consider the case

α > \frac{1}{2}

. Define

γ ≜ \sum_{x} P_{X} {(x)}^{\frac{α}{2 α - 1}}

. Then, for all

Q_{X} \in P (X)

,

\begin{matrix} \frac{α}{α - 1} log \sum_{x} P_{X} (x) Q_{X} {(x)}^{\frac{1 - α}{α}} & = \frac{α}{α - 1} log \sum_{x} {[γ γ^{- 1} P_{X} {(x)}^{\frac{α}{2 α - 1}}]}^{\frac{2 α - 1}{α}} Q_{X} {(x)}^{\frac{1 - α}{α}} \end{matrix}

(170)

\begin{matrix} = \frac{2 α - 1}{α - 1} log γ + D_{\frac{2 α - 1}{α}} (γ^{- 1} P_{X}^{\frac{α}{2 α - 1}} ∥ Q_{X}) \end{matrix}

(171)

\begin{matrix} = H_{\frac{α}{2 α - 1}} (X) + D_{\frac{2 α - 1}{α}} (γ^{- 1} P_{X}^{\frac{α}{2 α - 1}} ∥ Q_{X}), \end{matrix}

(172)

where (171) holds because

x \mapsto γ^{- 1} P_{X} {(x)}^{\frac{α}{2 α - 1}}

is a PMF. Because

\frac{2 α - 1}{α} > 0

, Proposition 4 implies that

D_{(2 α - 1) / α} (P ∥ Q) \geq 0

with equality if

Q = P

. This together with (168) and (172) establishes (165).

Now consider the case

α \in (0, \frac{1}{2}]

. For all

Q_{X} \in P (X)

,

\begin{matrix} \sum_{x} P_{X} (x) Q_{X} {(x)}^{\frac{1 - α}{α}} & \leq \sum_{x} P_{X} (x) Q_{X} (x) \end{matrix}

(173)

\begin{matrix} \leq \sum_{x} [max_{x^{'}} P_{X} (x^{'})] Q_{X} (x) \end{matrix}

(174)

\begin{matrix} = max_{x} P_{X} (x), \end{matrix}

(175)

where (173) holds because

Q_{X} (x) \in [0, 1]

for all

x \in X

and because

\frac{1 - α}{α} \geq 1

. The inequalities (173) and (174) both hold with equality when

Q_{X} (x) = 𝟙 {x = x^{*}}

, where

x^{*} \in X

is such that

P_{X} (x^{*}) = {max}_{x} P_{X} (x)

. Thus,

\begin{matrix} max_{Q_{X}} \sum_{x} P_{X} (x) Q_{X} {(x)}^{\frac{1 - α}{α}} = max_{x} P_{X} (x) . \end{matrix}

(176)

Now (165) follows:

\begin{matrix} J_{α} (X; Y) & = min_{Q_{X}} \frac{α}{α - 1} log \sum_{x} P_{X} (x) Q_{X} {(x)}^{\frac{1 - α}{α}} \end{matrix}

(177)

\begin{matrix} = \frac{α}{α - 1} log max_{Q_{X}} \sum_{x} P_{X} (x) Q_{X} {(x)}^{\frac{1 - α}{α}} \end{matrix}

(178)

\begin{matrix} = \frac{α}{α - 1} log max_{x} P_{X} (x) \end{matrix}

(179)

\begin{matrix} = \frac{α}{1 - α} H_{\infty} (X), \end{matrix}

(180)

where (177) follows from (168); (178) holds because

\frac{α}{α - 1} < 0

; (179) follows from (176); and (180) follows from the definition of

H_{\infty} (X)

. □

Lemma 12.

If the pairs

(X_{1}, Y_{1})

and

(X_{2}, Y_{2})

are independent, then

J_{α} (X_{1}, X_{2}; Y_{1}, Y_{2}) = J_{α} (X_{1}; Y_{1}) + J_{α} (X_{2}; Y_{2})

for all

α \in [0, \infty]

(additivity).

Proof.

Let the pairs

(X_{1}, Y_{1})

and

(X_{2}, Y_{2})

be independent. For

α \in (0, 1) \cup (1, \infty)

, we establish the lemma by showing the following two inequalities:

\begin{matrix} J_{α} (X_{1}, X_{2}; Y_{1}, Y_{2}) & \leq J_{α} (X_{1}; Y_{1}) + J_{α} (X_{2}; Y_{2}), \end{matrix}

(181)

\begin{matrix} J_{α} (X_{1}, X_{2}; Y_{1}, Y_{2}) & \geq J_{α} (X_{1}; Y_{1}) + J_{α} (X_{2}; Y_{2}) . \end{matrix}

(182)

Because

J_{α} (X; Y)

is continuous in

α

(Lemma 10), this will also establish the lemma for

α \in {0, 1, \infty}

.

To show (181), let

Q_{X_{1}}^{*}

and

Q_{Y_{1}}^{*}

be PMFs that achieve the minimum in the definition of

J_{α} (X_{1}; Y_{1})

, and let

Q_{X_{2}}^{*}

and

Q_{Y_{2}}^{*}

be PMFs that achieve the minimum in the definition of

J_{α} (X_{2}; Y_{2})

, so

\begin{matrix} J_{α} (X_{1}; Y_{1}) & = D_{α} (P_{X_{1} Y_{1}} ∥ Q_{X_{1}}^{*} Q_{Y_{1}}^{*}), \end{matrix}

(183)

\begin{matrix} J_{α} (X_{2}; Y_{2}) & = D_{α} (P_{X_{2} Y_{2}} ∥ Q_{X_{2}}^{*} Q_{Y_{2}}^{*}) . \end{matrix}

(184)

Then, (181) holds because

\begin{matrix} J_{α} (X_{1}, X_{2}; Y_{1}, Y_{2}) & \leq D_{α} (P_{X_{1} X_{2} Y_{1} Y_{2}} ∥ Q_{X_{1}}^{*} Q_{X_{2}}^{*} Q_{Y_{1}}^{*} Q_{Y_{2}}^{*}) \end{matrix}

(185)

\begin{matrix} = D_{α} (P_{X_{1} Y_{1}} ∥ Q_{X_{1}}^{*} Q_{Y_{1}}^{*}) + D_{α} (P_{X_{2} Y_{2}} ∥ Q_{X_{2}}^{*} Q_{Y_{2}}^{*}) \end{matrix}

(186)

\begin{matrix} = J_{α} (X_{1}; Y_{1}) + J_{α} (X_{2}; Y_{2}), \end{matrix}

(187)

where (185) holds by the definition of

J_{α} (X_{1}, X_{2}; Y_{1}, Y_{2})

as a minimum; (186) follows from a simple computation using the independence hypothesis

P_{X_{1} X_{2} Y_{1} Y_{2}} = P_{X_{1} Y_{1}} P_{X_{2} Y_{2}}

; and (187) follows from (183) and (184).

To establish (182), we consider the cases

α > 1

and

α < 1

separately, starting with

α > 1

. Let

{\hat{Q}}_{X_{1} X_{2}}

and

{\hat{Q}}_{Y_{1} Y_{2}}

be PMFs that achieve the minimum in the definition of

J_{α} (X_{1}, X_{2}; Y_{1}, Y_{2})

, so

\begin{matrix} J_{α} (X_{1}, X_{2}; Y_{1}, Y_{2}) = D_{α} (P_{X_{1} X_{2} Y_{1} Y_{2}} ∥ {\hat{Q}}_{X_{1} X_{2}} {\hat{Q}}_{Y_{1} Y_{2}}) . \end{matrix}

(188)

Define the function

f : X_{1} \times Y_{1} \to R \cup {\infty}

as

\begin{matrix} f (x_{1}, y_{1}) ≜ \sum_{x_{2}, y_{2}} P_{X_{2} Y_{2}} {(x_{2}, y_{2})}^{α} {[{\hat{Q}}_{X_{2} | X_{1}} (x_{2} | x_{1}) {\hat{Q}}_{Y_{2} | Y_{1}} (y_{2} | y_{1})]}^{1 - α}, \end{matrix}

(189)

and let

(x_{1}^{'}, y_{1}^{'}) \in X_{1} \times Y_{1}

be such that

\begin{matrix} f (x_{1}^{'}, y_{1}^{'}) = min_{x_{1}, y_{1}} f (x_{1}, y_{1}) . \end{matrix}

(190)

Define the PMFs

Q_{X_{2}}^{'}

and

Q_{Y_{2}}^{'}

as

\begin{matrix} Q_{X_{2}}^{'} (x_{2}) & ≜ {\hat{Q}}_{X_{2} | X_{1}} (x_{2} | x_{1}^{'}), \end{matrix}

(191)

\begin{matrix} Q_{Y_{2}}^{'} (y_{2}) & ≜ {\hat{Q}}_{Y_{2} | Y_{1}} (y_{2} | y_{1}^{'}) . \end{matrix}

(192)

Then,

\begin{matrix} 2^{(α - 1) J_{α} (X_{1}, X_{2}; Y_{1}, Y_{2})} & = 2^{(α - 1) D_{α} (P_{X_{1} X_{2} Y_{1} Y_{2}} ∥ {\hat{Q}}_{X_{1} X_{2}} {\hat{Q}}_{Y_{1} Y_{2}})} \end{matrix}

(193)

\begin{matrix} = \sum_{x_{1}, x_{2}, y_{1}, y_{2}} {[P_{X_{1} Y_{1}} (x_{1}, y_{1}) P_{X_{2} Y_{2}} (x_{2}, y_{2})]}^{α} {[{\hat{Q}}_{X_{1} X_{2}} (x_{1}, x_{2}) {\hat{Q}}_{Y_{1} Y_{2}} (y_{1}, y_{2})]}^{1 - α} \end{matrix}

(194)

\begin{matrix} = \sum_{x_{1}, y_{1}} P_{X_{1} Y_{1}} {(x_{1}, y_{1})}^{α} {[{\hat{Q}}_{X_{1}} (x_{1}) {\hat{Q}}_{Y_{1}} (y_{1})]}^{1 - α} f (x_{1}, y_{1}) \end{matrix}

(195)

\begin{matrix} \geq \sum_{x_{1}, y_{1}} P_{X_{1} Y_{1}} {(x_{1}, y_{1})}^{α} {[{\hat{Q}}_{X_{1}} (x_{1}) {\hat{Q}}_{Y_{1}} (y_{1})]}^{1 - α} f (x_{1}^{'}, y_{1}^{'}) \end{matrix}

(196)

\begin{matrix} = 2^{(α - 1) D_{α} (P_{X_{1} Y_{1}} ∥ {\hat{Q}}_{X_{1}} {\hat{Q}}_{Y_{1}}) + (α - 1) D_{α} (P_{X_{2} Y_{2}} ∥ Q_{X_{2}}^{'} Q_{Y_{2}}^{'})}, \end{matrix}

(197)

where (193) follows from (188); (194) holds by the independence hypothesis

P_{X_{1} X_{2} Y_{1} Y_{2}} = P_{X_{1} Y_{1}} P_{X_{2} Y_{2}}

; (195) follows from (189); (196) follows from (190); and (197) follows from (191) and (192). Taking the logarithm and multiplying by

\frac{1}{α - 1} > 0

establishes (182):

\begin{matrix} J_{α} (X_{1}, X_{2}; Y_{1}, Y_{2}) & \geq D_{α} (P_{X_{1} Y_{1}} ∥ {\hat{Q}}_{X_{1}} {\hat{Q}}_{Y_{1}}) + D_{α} (P_{X_{2} Y_{2}} ∥ Q_{X_{2}}^{'} Q_{Y_{2}}^{'}) \end{matrix}

(198)

\begin{matrix} \geq J_{α} (X_{1}; Y_{1}) + J_{α} (X_{2}; Y_{2}), \end{matrix}

(199)

where (199) holds by the definition of

J_{α} (X_{1}; Y_{1})

and

J_{α} (X_{2}; Y_{2})

.

The proof of (182) for

α \in (0, 1)

is essentially the same as for

α > 1

: Replace the minimum in (190) by a maximum. Inequality (196) is then reversed, but (198) continues to hold because

\frac{1}{α - 1} < 0

. Inequality (199) also continues to hold, and (198) and (199) together imply (182).

Lemma 13.

For all

α \in [0, \infty]

,

J_{α} (X; Y) \leq log | X |

with equality if and only if

(α \in [\frac{1}{2}, \infty]

, X is distributed uniformly over

X

, and

H (X | Y) = 0)

.

Proof.

Throughout the proof, define

X^{'} ≜ X

. We first show that

J_{α} (X; Y) \leq log | X |

for all

α \in [0, \infty]

:

\begin{matrix} J_{α} (X; Y) & \leq J_{α} (X; X^{'}) \end{matrix}

(200)

\begin{matrix} \leq J_{\infty} (X; X^{'}) \end{matrix}

(201)

\begin{matrix} = H_{\frac{1}{2}} (X) \end{matrix}

(202)

\begin{matrix} \leq log | X |, \end{matrix}

(203)

where (200) follows from the data-processing inequality (Lemma 4) because

X ⊸ - - X^{'} ⊸ - - Y

form a Markov chain; (201) holds because

J_{α} (X; X^{'})

is nondecreasing in

α

(Lemma 9); (202) follows from Lemma 11; and (203) follows from Proposition 3.

We now show that (200)–(203) can hold with equality only if the following conditions all hold:

(1): $α \in [\frac{1}{2}, \infty]$ ;
(ii): X is distributed uniformly over $X$ ; and
(iii): $H (X | Y) = 0$ , i.e., for every $y \in supp (P_{Y})$ , there exists an $x \in X$ for which $P (x | y) = 1$ .

Indeed, if

α < \frac{1}{2}

, then Lemma 11 implies that

\begin{matrix} J_{α} (X; X^{'}) = \frac{α}{1 - α} H_{\infty} (X) . \end{matrix}

(204)

Because

\frac{α}{1 - α} < 1

for such

α

’s and because

H_{\infty} (X) \leq log | X |

(Proposition 3), the RHS of (204) is strictly smaller than

log | X |

. This, together with (200), shows that Part (i) is a necessary condition. The necessity of Part (ii) follows from (203): if X is not distributed uniformly over

X

, then (203) holds with strict inequality (Proposition 3). As to the necessity of Part (iii),

\begin{matrix} J_{α} (X; Y) & \leq J_{\infty} (X; Y) \end{matrix}

(205)

\begin{matrix} = min_{Q_{X}} min_{Q_{Y}} D_{\infty} (P_{X Y} ∥ Q_{X} Q_{Y}) \end{matrix}

(206)

\begin{matrix} = min_{Q_{X}} log \sum_{y} max_{x} \frac{P (x, y)}{Q_{X} (x)} \end{matrix}

(207)

\begin{matrix} \leq log \sum_{y} max_{x} \frac{P (y) P (x | y)}{1 / | X |} \end{matrix}

(208)

\begin{matrix} = log | X | + log \sum_{y} P (y) max_{x} P (x | y) \end{matrix}

(209)

\begin{matrix} \leq log | X |, \end{matrix}

(210)

where (205) holds because

J_{α} (X; Y)

is nondecreasing in

α

(Lemma 9); (207) follows from Proposition 9; and (208) follows from choosing

Q_{X}

to be the uniform distribution. The inequality (210) is strict when Part (iii) does not hold, so Part (iii) is a necessary condition.

It remains to show that when Parts (i)–(iii) all hold,

J_{α} (X; Y) = log | X |

. By (203),

J_{α} (X; Y) \leq log | X |

always holds, so it suffices to show that Parts (i)–(iii) together imply

J_{α} (X; Y) \geq log | X |

. Indeed,

\begin{matrix} J_{α} (X; Y) & \geq J_{\frac{1}{2}} (X; Y) \end{matrix}

(211)

\begin{matrix} \geq J_{\frac{1}{2}} (X; X^{'}) \end{matrix}

(212)

\begin{matrix} = H_{\infty} (X) \end{matrix}

(213)

\begin{matrix} = log | X |, \end{matrix}

(214)

where (211) holds because Part (i) implies that

α \geq \frac{1}{2}

and because

J_{α} (X; Y)

is nondecreasing in

α

(Lemma 9); (212) follows from the data-processing inequality (Lemma 4) because Part (iii) implies that

X ⊸ - - Y ⊸ - - X^{'}

form a Markov chain; (213) follows from Lemma 11; and (214) follows from Part (ii). □

Lemma 14.

For every

α \in [1, \infty]

,

J_{α} (X; Y)

is concave in

P_{X}

for fixed

P_{Y | X}

.

Proof.

We prove the claim for

α \in (1, \infty)

; for

α \in {1, \infty}

the claim will then hold because

J_{α} (X; Y)

is continuous in

α

(Lemma 10).

Fix

α \in (1, \infty)

. Let

λ, λ^{'} \in [0, 1]

with

λ + λ^{'} = 1

, let

P_{X}

and

P_{X}^{'}

be PMFs, let

P_{Y | X}

be a conditional PMF, and define

f : X \times P (Y) \to R \cup {\infty}

as

\begin{matrix} f (x, Q_{Y}) ≜ {[\sum_{y} P_{Y | X} {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} . \end{matrix}

(215)

Denoting

J_{α} (X; Y)

by

J_{α} (P_{X} P_{Y | X})

,

\begin{matrix} J_{α} ((λ P_{X} + λ^{'} P_{X}^{'}) P_{Y | X}) \end{matrix}

\begin{matrix} = min_{Q_{Y}} min_{Q_{X}} D_{α} ((λ P_{X} + λ^{'} P_{X}^{'}) P_{Y | X} ∥ Q_{X} Q_{Y}) \end{matrix}

(216)

\begin{matrix} = min_{Q_{Y}} \frac{α}{α - 1} log \sum_{x} {[\sum_{y} {[λ P_{X} (x) + λ^{'} P_{X}^{'} (x)]}^{α} P_{Y | X} {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(217)

\begin{matrix} = min_{Q_{Y}} \frac{α}{α - 1} log \sum_{x} [λ P_{X} (x) + λ^{'} P_{X}^{'} (x)] {[\sum_{y} P_{Y | X} {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(218)

\begin{matrix} = min_{Q_{Y}} \frac{α}{α - 1} log [λ \sum_{x} P_{X} (x) f (x, Q_{Y}) + λ^{'} \sum_{x} P_{X}^{'} (x) f (x, Q_{Y})] \end{matrix}

(219)

\begin{matrix} \geq min_{Q_{Y}} \frac{α}{α - 1} [λ log \sum_{x} P_{X} (x) f (x, Q_{Y}) + λ^{'} log \sum_{x} P_{X}^{'} (x) f (x, Q_{Y})] \end{matrix}

(220)

\begin{matrix} \geq λ min_{Q_{Y}} \frac{α}{α - 1} log \sum_{x} P_{X} (x) f (x, Q_{Y}) + λ^{'} min_{Q_{Y}} \frac{α}{α - 1} log \sum_{x} P_{X}^{'} (x) f (x, Q_{Y}) \end{matrix}

(221)

\begin{matrix} = λ J_{α} (P_{X} P_{Y | X}) + λ^{'} J_{α} (P_{X}^{'} P_{Y | X}), \end{matrix}

(222)

where (217) follows from Proposition 9 with the roles of

Q_{X}

and

Q_{Y}

swapped; (220) holds because

log (\cdot)

is concave; (221) holds because optimizing

Q_{Y}

separately cannot be worse than optimizing a common

Q_{Y}

; and (222) can be established using steps similar to (216)–(218). □

Lemma 15.

For every

α \in [\frac{1}{2}, \infty]

, the mapping

(Q_{X}, Q_{Y}) \mapsto D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

is convex, i.e., for all

λ, λ^{'} \in [0, 1]

with

λ + λ^{'} = 1

, all

Q_{X}, Q_{X}^{'} \in P (X)

, and all

Q_{Y}, Q_{Y}^{'} \in P (Y)

,

\begin{matrix} D_{α} (P_{X Y} ∥ (λ Q_{X} + λ^{'} Q_{X}^{'}) (λ Q_{Y} + λ^{'} Q_{Y}^{'})) \leq λ D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) + λ^{'} D_{α} (P_{X Y} ∥ Q_{X}^{'} Q_{Y}^{'}) . \end{matrix}

(223)

For

α \in [0, \frac{1}{2})

, the mapping need not be convex.

Proof.

We establish (223) for

α \in [\frac{1}{2}, 1)

and for

α \in (1, \infty)

, which also establishes (223) for

α \in {1, \infty}

because the Rényi divergence is continuous in

α

(Proposition 4). Afterwards, we provide an example where (223) is violated for all

α \in [0, \frac{1}{2})

.

We begin with the case where

α \in [\frac{1}{2}, 1)

:

\begin{matrix} 2^{(α - 1) λ D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) + (α - 1) λ^{'} D_{α} (P_{X Y} ∥ Q_{X}^{'} Q_{Y}^{'})} \\ = {[\sum_{x, y} P {(x, y)}^{α} {[Q_{X} (x) Q_{Y} (y)]}^{1 - α}]}^{λ} \cdot {[\sum_{x, y} P {(x, y)}^{α} {[Q_{X}^{'} (x) Q_{Y}^{'} (y)]}^{1 - α}]}^{λ^{'}} \end{matrix}

(224)

\begin{matrix} \leq λ \sum_{x, y} P {(x, y)}^{α} {[Q_{X} (x) Q_{Y} (y)]}^{1 - α} + λ^{'} \sum_{x, y} P {(x, y)}^{α} {[Q_{X}^{'} (x) Q_{Y}^{'} (y)]}^{1 - α} \end{matrix}

(225)

\begin{matrix} = \sum_{x, y} P {(x, y)}^{α} [\sqrt{λ} Q_{X} {(x)}^{1 - α} \sqrt{λ} Q_{Y} {(y)}^{1 - α} + \sqrt{λ^{'}} Q_{X}^{'} {(x)}^{1 - α} \sqrt{λ^{'}} Q_{Y}^{'} {(y)}^{1 - α}] \end{matrix}

(226)

\begin{matrix} \leq \sum_{x, y} P {(x, y)}^{α} \sqrt{λ Q_{X} {(x)}^{2 (1 - α)} + λ^{'} Q_{X}^{'} {(x)}^{2 (1 - α)}} \sqrt{λ Q_{Y} {(y)}^{2 (1 - α)} + λ^{'} Q_{Y}^{'} {(y)}^{2 (1 - α)}} \end{matrix}

(227)

\begin{matrix} \leq \sum_{x, y} P {(x, y)}^{α} {[λ Q_{X} (x) + λ^{'} Q_{X}^{'} (x)]}^{1 - α} \sqrt{λ Q_{Y} {(y)}^{2 (1 - α)} + λ^{'} Q_{Y}^{'} {(y)}^{2 (1 - α)}} \end{matrix}

(228)

\begin{matrix} \leq \sum_{x, y} P {(x, y)}^{α} {[λ Q_{X} (x) + λ^{'} Q_{X}^{'} (x)]}^{1 - α} {[λ Q_{Y} (y) + λ^{'} Q_{Y}^{'} (y)]}^{1 - α} \end{matrix}

(229)

\begin{matrix} = 2^{(α - 1) D_{α} (P_{X Y} ∥ (λ Q_{X} + λ^{'} Q_{X}^{'}) (λ Q_{Y} + λ^{'} Q_{Y}^{'}))}, \end{matrix}

(230)

where (225) follows from the arithmetic mean-geometric mean inequality; (227) follows from the Cauchy–Schwarz inequality; and (228) and (229) hold because the mapping

z \mapsto z^{2 (1 - α)}

is concave on

R_{\geq 0}

for

α \in [\frac{1}{2}, 1)

. Taking the logarithm and multiplying by

\frac{1}{α - 1} < 0

establishes (223).

Now, consider

α \in (1, \infty)

. Then,

\begin{matrix} 2^{(α - 1) D_{α} (P_{X Y} ∥ (λ Q_{X} + λ^{'} Q_{X}^{'}) (λ Q_{Y} + λ^{'} Q_{Y}^{'}))} \\ = \sum_{x, y} P {(x, y)}^{α} {[λ Q_{X} (x) + λ^{'} Q_{X}^{'} (x)]}^{1 - α} {[λ Q_{Y} (y) + λ^{'} Q_{Y}^{'} (y)]}^{1 - α} \end{matrix}

(231)

\begin{matrix} \leq \sum_{x, y} P {(x, y)}^{α} {[Q_{X} {(x)}^{λ} Q_{X}^{'} {(x)}^{λ^{'}}]}^{1 - α} {[Q_{Y} {(y)}^{λ} Q_{Y}^{'} {(y)}^{λ^{'}}]}^{1 - α} \end{matrix}

(232)

\begin{matrix} = \sum_{x, y} P {(x, y)}^{α} {[Q_{X} (x) Q_{Y} (y)]}^{(1 - α) λ} {[Q_{X}^{'} (x) Q_{Y}^{'} (y)]}^{(1 - α) λ^{'}} \end{matrix}

(233)

\begin{matrix} \leq {[\sum_{x, y} P {(x, y)}^{α} {[Q_{X} (x) Q_{Y} (y)]}^{1 - α}]}^{λ} \cdot {[\sum_{x, y} P {(x, y)}^{α} {[Q_{X}^{'} (x) Q_{Y}^{'} (y)]}^{1 - α}]}^{λ^{'}} \end{matrix}

(234)

\begin{matrix} = 2^{(α - 1) λ D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) + (α - 1) λ^{'} D_{α} (P_{X Y} ∥ Q_{X}^{'} Q_{Y}^{'})}, \end{matrix}

(235)

where (232) follows from the arithmetic mean-geometric mean inequality and the fact that the mapping

z \mapsto z^{1 - α}

is decreasing on

R_{> 0}

for

α > 1

, and (233) follows from Hölder’s inequality. Taking the logarithm and multiplying by

\frac{1}{α - 1} > 0

establishes (223).

Finally, we show that the mapping

(Q_{X}, Q_{Y}) \mapsto D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

does not need to be convex for

α \in [0, \frac{1}{2})

. Let X be uniformly distributed over

{0, 1}

, and let

Y = X

. Then, for all

α \in [0, \frac{1}{2})

,

\begin{matrix} D_{α} (P_{X Y} ∥ (0.5, 0.5) (0.5, 0.5)) > 0.5 D_{α} (P_{X Y} ∥ (1, 0) (1, 0)) + 0.5 D_{α} (P_{X Y} ∥ (0, 1) (0, 1)), \end{matrix}

(236)

because the LHS of (236) is equal to

log 2

, and the RHS of (236) is equal to

\frac{α}{1 - α} log 2

. □

Lemma 16.

Let

α \in (0, 1) \cup (1, \infty)

. If

(Q_{X}^{*}, Q_{Y}^{*})

achieves the minimum in the definition of

J_{α} (X; Y)

, then there exist positive normalization constants c and d such that

\begin{matrix} Q_{X}^{*} (x) & = c {[\sum_{y} P {(x, y)}^{α} Q_{Y}^{*} {(y)}^{1 - α}]}^{\frac{1}{α}} \forall x \in X, \end{matrix}

(237)

\begin{matrix} Q_{Y}^{*} (y) & = d {[\sum_{x} P {(x, y)}^{α} Q_{X}^{*} {(x)}^{1 - α}]}^{\frac{1}{α}} \forall y \in Y, \end{matrix}

(238)

with the conventions of (44). The case

α = \infty

is similar: if

(Q_{X}^{*}, Q_{Y}^{*})

achieves the minimum in the definition of

J_{\infty} (X; Y)

, then there exist positive normalization constants c and d such that

\begin{matrix} Q_{X}^{*} (x) & = c max_{y} \frac{P (x, y)}{Q_{Y}^{*} (y)} \forall x \in X, \end{matrix}

(239)

\begin{matrix} Q_{Y}^{*} (y) & = d max_{x} \frac{P (x, y)}{Q_{X}^{*} (x)} \forall y \in Y, \end{matrix}

(240)

with the conventions of (44). (If

α = 1

, then

Q_{X}^{*} = P_{X}

and

Q_{Y}^{*} = P_{Y}

by Proposition 8.) Thus, for all

α \in (0, \infty]

, both inclusions

supp (Q_{X}^{*}) \subseteq supp (P_{X})

and

supp (Q_{Y}^{*}) \subseteq supp (P_{Y})

hold.

Proof.

If

(Q_{X}^{*}, Q_{Y}^{*})

achieves the minimum in the definition of

J_{α} (X; Y)

, then

\begin{matrix} min_{Q_{Y}} D_{α} (P_{X Y} ∥ Q_{X}^{*} Q_{Y}) = D_{α} (P_{X Y} ∥ Q_{X}^{*} Q_{Y}^{*}) . \end{matrix}

(241)

Hence, (238) and (240) follow from (76) and (78) of Proposition 9 because

D_{α} (P_{X Y} ∥ Q_{X}^{*} Q_{Y}^{*}) = J_{α} (X; Y)

is finite. Swapping the roles of

Q_{X}

and

Q_{Y}

establishes (237) and (239). For

α \in (0, 1) \cup (1, \infty)

the claimed inclusions follow from (237) and (238); for

α = \infty

from (239) and (240); and for

α = 1

from Proposition 8. □

Lemma 17.

For all

α \in (0, \infty]

,

\begin{matrix} J_{α} (X; Y) = min_{Q_{X}} ϕ_{α} (Q_{X}), \end{matrix}

(242)

where

ϕ_{α} (Q_{X})

is defined as

\begin{matrix} ϕ_{α} (Q_{X}) ≜ min_{Q_{Y}} D_{α} (P_{X Y} ∥ Q_{X} Q_{Y}) \end{matrix}

(243)

and is given explicitly as follows: for

α \in (0, 1) \cup (1, \infty)

,

\begin{matrix} ϕ_{α} (Q_{X}) = \frac{α}{α - 1} log \sum_{y} {[\sum_{x} P {(x, y)}^{α} Q_{X} {(x)}^{1 - α}]}^{\frac{1}{α}}, \end{matrix}

(244)

with the conventions of (44); and for

α \in {1, \infty}

,

\begin{matrix} ϕ_{1} (Q_{X}) & = D (P_{X Y} ∥ Q_{X} P_{Y}), \end{matrix}

(245)

\begin{matrix} ϕ_{\infty} (Q_{X}) & = log \sum_{y} max_{x} \frac{P (x, y)}{Q_{X} (x)}, \end{matrix}

(246)

with the conventions of (44). For every

α \in [\frac{1}{2}, \infty]

, the mapping

Q_{X} \mapsto ϕ_{α} (Q_{X})

is convex. For

α \in (0, \frac{1}{2})

, the mapping need not be convex.

Proof.

We first establish (242) and (244)–(246): (242) follows from the definition of

J_{α} (X; Y)

; (244) and (246) follow from Proposition 9; and (245) holds because

\begin{matrix} min_{Q_{Y}} D (P_{X Y} ∥ Q_{X} Q_{Y}) & = min_{Q_{Y}} [D (P_{X Y} ∥ Q_{X} P_{Y}) + D (P_{Y} ∥ Q_{Y})] \end{matrix}

(247)

\begin{matrix} = D (P_{X Y} ∥ Q_{X} P_{Y}), \end{matrix}

(248)

where (247) follows from a simple computation, and (248) holds because

D (P_{Y} ∥ Q_{Y}) \geq 0

with equality if

Q_{Y} = P_{Y}

.

We now show that the mapping

Q_{X} \mapsto ϕ_{α} (Q_{X})

is convex for every

α \in [\frac{1}{2}, \infty]

. To that end, let

α \in [\frac{1}{2}, \infty]

, let

λ, λ^{'} \in [0, 1]

with

λ + λ^{'} = 1

, and let

Q_{X}, Q_{X}^{'} \in P (X)

. Let

{\hat{Q}}_{Y}

and

{\hat{Q}}_{Y}^{'}

be PMFs that achieve the minimum in the definitions of

ϕ_{α} (Q_{X})

and

ϕ_{α} (Q_{X}^{'})

, respectively. Then,

\begin{matrix} ϕ_{α} (λ Q_{X} + λ^{'} Q_{X}^{'}) & \leq D_{α} (P_{X Y} ∥ (λ Q_{X} + λ^{'} Q_{X}^{'}) (λ {\hat{Q}}_{Y} + λ^{'} {\hat{Q}}_{Y}^{'})) \end{matrix}

(249)

\begin{matrix} \leq λ D_{α} (P_{X Y} ∥ Q_{X} {\hat{Q}}_{Y}) + λ^{'} D_{α} (P_{X Y} ∥ Q_{X}^{'} {\hat{Q}}_{Y}^{'}) \end{matrix}

(250)

\begin{matrix} = λ ϕ_{α} (Q_{X}) + λ^{'} ϕ_{α} (Q_{X}^{'}), \end{matrix}

(251)

where (249) holds by the definition of

ϕ_{α} (\cdot)

; (250) holds because

D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

is convex in the pair

(Q_{X}, Q_{Y})

for

α \in [\frac{1}{2}, \infty]

(Lemma 15); and (251) follows from our choice of

{\hat{Q}}_{Y}

and

{\hat{Q}}_{Y}^{'}

.

Finally, we show that the mapping

Q_{X} \mapsto ϕ_{α} (Q_{X})

need not be convex for

α \in (0, \frac{1}{2})

. Let X be uniformly distributed over

{0, 1}

, and let

Y = X

. Then, for all

α \in (0, \frac{1}{2})

,

\begin{matrix} ϕ_{α} ((0.5, 0.5)) > 0.5 ϕ_{α} ((1, 0)) + 0.5 ϕ_{α} ((0, 1)), \end{matrix}

(252)

because the LHS of (252) is equal to

log 2

, and the RHS of (252) is equal to

\frac{α}{1 - α} log 2

. □

Lemma 18.

For all

α \in (0, 1) \cup (1, \infty]

,

\begin{matrix} J_{α} (X; Y) = \{\begin{matrix} min_{R_{X Y} \in P (X \times Y)} ψ_{α} (R_{X Y}) & i f α \in (0, 1), \\ max_{R_{X Y} \in P (X \times Y)} ψ_{α} (R_{X Y}) & i f α \in (1, \infty], \end{matrix} \end{matrix}

(253)

where

\begin{matrix} ψ_{α} (R_{X Y}) ≜ \{\begin{matrix} D (R_{X Y} ∥ R_{X} R_{Y}) + \frac{α}{1 - α} D (R_{X Y} ∥ P_{X Y}) & i f α \in (0, 1) \cup (1, \infty), \\ D (R_{X Y} ∥ R_{X} R_{Y}) - D (R_{X Y} ∥ P_{X Y}) & i f α = \infty . \end{matrix} \end{matrix}

(254)

For every

α \in (1, \infty]

, the mapping

R_{X Y} \mapsto ψ_{α} (R_{X Y})

is concave. For all

α \in (1, \infty]

and all

R_{X Y} \in P (X \times Y)

, the statement

J_{α} (X; Y) = ψ_{α} (R_{X Y})

is equivalent to

ψ_{α} (R_{X Y}) = D_{α} (P_{X Y} ∥ R_{X} R_{Y})

.

Proof.

For

α \in (0, 1) \cup (1, \infty)

, (253) follows from Lemma 8 by dividing by

1 - α

, which is positive or negative depending on whether

α

is smaller than or greater than one. For

α = \infty

, we establish (253) as follows: By Lemma 10, its LHS is continuous at

α = \infty

. We argue below that its RHS is continuous at

α = \infty

, i.e., that

\begin{matrix} lim_{α \to \infty} max_{R_{X Y}} ψ_{α} (R_{X Y}) = max_{R_{X Y}} ψ_{\infty} (R_{X Y}) . \end{matrix}

(255)

Because (253) holds for

α \in (1, \infty)

and because both its sides are continuous at

α = \infty

, it must also hold for

α = \infty

.

We now establish (255). Let

R_{X Y}^{*}

be a PMF that achieves the maximum on the RHS of (255). Then, for all

α > 1

,

\begin{matrix} ψ_{\infty} (R_{X Y}^{*}) & = max_{R_{X Y}} ψ_{\infty} (R_{X Y}) \end{matrix}

(256)

\begin{matrix} \geq max_{R_{X Y}} ψ_{α} (R_{X Y}) \end{matrix}

(257)

\begin{matrix} \geq ψ_{α} (R_{X Y}^{*}), \end{matrix}

(258)

where (257) holds because, by (254),

ψ_{\infty} (R_{X Y}) = ψ_{α} (R_{X Y}) + \frac{1}{α - 1} D (R_{X Y} ∥ P_{X Y}) \geq ψ_{α} (R_{X Y})

for all

R_{X Y} \in P (X \times Y)

. By (254),

α \mapsto ψ_{α} (R_{X Y}^{*})

is continuous at

α = \infty

, so the RHS of (258) approaches

ψ_{\infty} (R_{X Y}^{*})

as

α

tends to infinity, and (255) follows from the sandwich theorem.

We now show that

R_{X Y} \mapsto ψ_{α} (R_{X Y})

is concave for

α \in (1, \infty]

. A simple computation reveals that for all

α \in (1, \infty)

,

\begin{matrix} ψ_{α} (R_{X Y}) = H (R_{X}) + H (R_{Y}) + \frac{1}{α - 1} H (R_{X Y}) + \frac{α}{α - 1} \sum_{x, y} R_{X Y} (x, y) log P (x, y) . \end{matrix}

(259)

Because the entropy is a concave function and because a nonnegative weighted sum of concave functions is concave, this implies that

ψ_{α} (R_{X Y})

is concave in

R_{X Y}

for

α \in (1, \infty)

. By (254),

α \mapsto ψ_{α} (R_{X Y})

is continuous at

α = \infty

, so

ψ_{α} (R_{X Y})

is concave in

R_{X Y}

also for

α = \infty

.

We next show that if

α \in (1, \infty]

and

ψ_{α} (R_{X Y}) = D_{α} (P_{X Y} ∥ R_{X} R_{Y})

, then

J_{α} (X; Y) = ψ_{α} (R_{X Y})

. Let

α \in (1, \infty]

, and let

R_{X Y}

be a PMF that satisfies

ψ_{α} (R_{X Y}) = D_{α} (P_{X Y} ∥ R_{X} R_{Y})

. Then,

\begin{matrix} ψ_{α} (R_{X Y}) & \leq J_{α} (X; Y) \end{matrix}

(260)

\begin{matrix} \leq D_{α} (P_{X Y} ∥ R_{X} R_{Y}), \end{matrix}

(261)

where (260) follows from (253), and (261) holds by the definition of

J_{α} (X; Y)

. Because

ψ_{α} (R_{X Y})

is equal to

D_{α} (P_{X Y} ∥ R_{X} R_{Y})

, both inequalities hold with equality, which implies the claim.

Finally, we show that if

α \in (1, \infty]

and

J_{α} (X; Y) = ψ_{α} (R_{X Y})

, then

ψ_{α} (R_{X Y}) = D_{α} (P_{X Y} ∥ R_{X} R_{Y})

. We first consider

α \in (1, \infty)

. Let

R_{X Y}

be a PMF that satisfies

J_{α} (X; Y) = ψ_{α} (R_{X Y})

, and let

Q_{X}^{*}

and

Q_{Y}^{*}

be PMFs that achieve the minimum in the definition of

J_{α} (X; Y)

. Then,

\begin{matrix} J_{α} (X; Y) & = ψ_{α} (R_{X Y}) \end{matrix}

(262)

\begin{matrix} = D (R_{X Y} ∥ R_{X} R_{Y}) + \frac{α}{1 - α} D (R_{X Y} ∥ P_{X Y}) \end{matrix}

(263)

\begin{matrix} \leq D (R_{X Y} ∥ Q_{X}^{*} Q_{Y}^{*}) + \frac{α}{1 - α} D (R_{X Y} ∥ P_{X Y}) \end{matrix}

(264)

\begin{matrix} \leq D_{α} (P_{X Y} ∥ Q_{X}^{*} Q_{Y}^{*}) \end{matrix}

(265)

\begin{matrix} = J_{α} (X; Y), \end{matrix}

(266)

where (264) follows from Proposition 8, and (265) follows from [22] (Theorem 30). Thus, all inequalities hold with equality. Because (264) holds with equality,

Q_{X}^{*} = R_{X}

and

Q_{Y}^{*} = R_{Y}

by Proposition 8. Hence,

ψ_{α} (R_{X Y}) = D_{α} (P_{X Y} ∥ Q_{X}^{*} Q_{Y}^{*}) = D_{α} (P_{X Y} ∥ R_{X} R_{Y})

as desired. We now consider

α = \infty

. Here, (262)–(266) remain valid after replacing

\frac{α}{1 - α}

by

- 1

. (Now, (265) follows from a short computation.) Consequently,

ψ_{α} (R_{X Y}) = D_{α} (P_{X Y} ∥ R_{X} R_{Y})

holds also for

α = \infty

.

Lemma 19.

For all

α \in (0, 1) \cup (1, \infty)

,

\begin{matrix} J_{α} (X; Y) = min_{R_{X} ≪ P_{X}} \frac{1}{α - 1} [D_{\frac{α}{α - 1}} (P_{X} ∥ R_{X}) - α E_{0} (\frac{1 - α}{α}, R_{X})], \end{matrix}

(267)

where the minimization is over all PMFs

R_{X}

satisfying

R_{X} ≪ P_{X}

(i . e ., supp (R_{X}) \subseteq supp (P_{X}))

;

D_{α} (P ∥ Q)

for negative α is given by (54); and Gallager’s

E_{0}

function [29] is defined as

\begin{matrix} E_{0} (ρ, R_{X}) ≜ - log \sum_{y} {[\sum_{x} R_{X} (x) P {(y | x)}^{\frac{1}{1 + ρ}}]}^{1 + ρ} . \end{matrix}

(268)

Proof.

Let

α \in (0, 1) \cup (1, \infty)

, and define the set

R ≜ {R_{X} \in P (X) : supp (R_{X}) \subseteq supp (P_{X})}

. We establish (267) by showing that for all

R_{X} \in R

,

\begin{matrix} \frac{1}{α - 1} [D_{\frac{α}{α - 1}} (P_{X} ∥ R_{X}) - α E_{0} (\frac{1 - α}{α}, R_{X})] \geq J_{α} (X; Y), \end{matrix}

(269)

with equality for some

R_{X} \in R

.

Fix

R_{X} \in R

. If the LHS of (269) is infinite, then (269) holds trivially. Otherwise, define the PMF

{\hat{Q}}_{X}

as

\begin{matrix} {\hat{Q}}_{X} (x) ≜ \frac{P_{X} {(x)}^{\frac{α}{α - 1}} R_{X} {(x)}^{\frac{1}{1 - α}}}{\sum_{x^{'} \in X} P_{X} {(x^{'})}^{\frac{α}{α - 1}} R_{X} {(x^{'})}^{\frac{1}{1 - α}}}, \end{matrix}

(270)

where we use the convention that

0^{\frac{α}{α - 1}} \cdot 0^{\frac{1}{1 - α}} = 0

. (The RHS of (270) is finite whenever the LHS of (269) is finite.) Then, (269) holds because

\begin{matrix} J_{α} (X; Y) & = min_{Q_{X}} \frac{α}{α - 1} log \sum_{y} {[\sum_{x} P {(x, y)}^{α} Q_{X} {(x)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(271)

\begin{matrix} \leq \frac{α}{α - 1} log \sum_{y} {[\sum_{x} P {(x, y)}^{α} {\hat{Q}}_{X} {(x)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(272)

\begin{matrix} = log \sum_{x} P_{X} {(x)}^{\frac{α}{α - 1}} R_{X} {(x)}^{\frac{1}{1 - α}} + \frac{α}{α - 1} log \sum_{y} {[\sum_{x} R_{X} (x) P {(y | x)}^{α}]}^{\frac{1}{α}} \end{matrix}

(273)

\begin{matrix} = \frac{1}{α - 1} [D_{\frac{α}{α - 1}} (P_{X} ∥ R_{X}) - α E_{0} (\frac{1 - α}{α}, R_{X})], \end{matrix}

(274)

where (271) follows from Lemma 17, and (273) follows from (270) using some algebra. It remains to show that there exists an

R_{X} \in R

for which (272) holds with equality. To that end, let

Q_{X}^{*}

be a PMF that achieves the minimum on the RHS of (271), and define the PMF

R_{X}

as

\begin{matrix} R_{X} (x) ≜ \frac{P_{X} {(x)}^{α} Q_{X}^{*} {(x)}^{1 - α}}{\sum_{x^{'} \in X} P_{X} {(x^{'})}^{α} Q_{X}^{*} {(x^{'})}^{1 - α}}, \end{matrix}

(275)

where we use the convention that

0^{α} \cdot 0^{1 - α} = 0

. Because

supp (Q_{X}^{*}) \subseteq supp (P_{X})

(Lemma 16), the definitions (275) and (270) imply that

{\hat{Q}}_{X} = Q_{X}^{*}

. Hence, (272) holds with equality for this

R_{X} \in R

.

Lemma 20.

For every

α \in (\frac{1}{2}, \infty]

, the mapping

(Q_{X}, Q_{Y}) \mapsto D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

has a unique minimizer. This need not be the case when

α \in [0, \frac{1}{2}]

.

Proof.

First consider

α \in (\frac{1}{2}, 1)

. Let

(Q_{X}^{*}, Q_{Y}^{*})

and

({\hat{Q}}_{X}, {\hat{Q}}_{Y})

be pairs of PMFs that both minimize

(Q_{X}, Q_{Y}) \mapsto D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

. We establish uniqueness by arguing that

(Q_{X}^{*}, Q_{Y}^{*})

and

({\hat{Q}}_{X}, {\hat{Q}}_{Y})

must be identical. Observe that

\begin{matrix} J_{α} (X; Y) & \leq D_{α} (P_{X Y} ∥ (0.5 Q_{X}^{*} + 0.5 {\hat{Q}}_{X}) (0.5 Q_{Y}^{*} + 0.5 {\hat{Q}}_{Y})) \end{matrix}

(276)

\begin{matrix} \leq 0.5 D_{α} (P_{X Y} ∥ Q_{X}^{*} Q_{Y}^{*}) + 0.5 D_{α} (P_{X Y} ∥ {\hat{Q}}_{X} {\hat{Q}}_{Y}) \end{matrix}

(277)

\begin{matrix} = J_{α} (X; Y), \end{matrix}

(278)

where (276) holds by the definition of

J_{α} (X; Y)

, and (277) follows from Lemma 15. Hence, (277) holds with equality, which implies that (228) in the proof of Lemma 15 holds with equality, i.e.,

\begin{matrix} \sum_{x, y} P {(x, y)}^{α} \sqrt{0.5 Q_{X}^{*} {(x)}^{2 (1 - α)} + 0.5 {\hat{Q}}_{X} {(x)}^{2 (1 - α)}} \sqrt{0.5 Q_{Y}^{*} {(y)}^{2 (1 - α)} + 0.5 {\hat{Q}}_{Y} {(y)}^{2 (1 - α)}} \\ = \sum_{x, y} P {(x, y)}^{α} {[0.5 Q_{X}^{*} (x) + 0.5 {\hat{Q}}_{X} (x)]}^{1 - α} \sqrt{0.5 Q_{Y}^{*} {(y)}^{2 (1 - α)} + 0.5 {\hat{Q}}_{Y} {(y)}^{2 (1 - α)}} . \end{matrix}

(279)

We first argue that

Q_{X}^{*} = {\hat{Q}}_{X}

. Since

Q_{X}^{*}

and

{\hat{Q}}_{X}

are PMFs, it suffices to show that

Q_{X}^{*} (x) = {\hat{Q}}_{X} (x)

for every

x \in supp ({\hat{Q}}_{X})

. Let

\hat{x} \in supp ({\hat{Q}}_{X})

. Because

supp ({\hat{Q}}_{X}) \subseteq supp (P_{X})

(Lemma 16), there exists a

\hat{y} \in Y

such that

P (\hat{x}, \hat{y}) > 0

. Again by Lemma 16, this implies that

{\hat{Q}}_{Y} (\hat{y}) > 0

. Because the mapping

z \mapsto z^{2 (1 - α)}

is strictly concave on

R_{\geq 0}

for

α \in (\frac{1}{2}, 1)

, it follows from (279) that

Q_{X}^{*} (\hat{x}) = {\hat{Q}}_{X} (\hat{x})

. Swapping the roles of

Q_{X}

and

Q_{Y}

, we obtain that

Q_{Y}^{*} = {\hat{Q}}_{Y}

.

For

α = 1

, the minimizer is unique by Proposition 8 because

D_{1} (P_{X Y} ∥ Q_{X} Q_{Y}) = D (P_{X Y} ∥ Q_{X} Q_{Y})

.

Now consider

α \in (1, \infty]

. Here, we establish uniqueness via the characterization of

J_{α} (X; Y)

provided by Lemma 18. Let

ψ_{α} (R_{X Y})

be defined as in Lemma 18. Let

R_{X Y}

be a PMF that satisfies

J_{α} (X; Y) = ψ_{α} (R_{X Y})

, and let

(Q_{X}^{*}, Q_{Y}^{*})

be a pair of PMFs that minimizes

(Q_{X}, Q_{Y}) \mapsto D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

. If

α \in (1, \infty)

, then (264) in the proof of Lemma 18 holds with equality, i.e.,

\begin{matrix} D (R_{X Y} ∥ R_{X} R_{Y}) + \frac{α}{1 - α} D (R_{X Y} ∥ P_{X Y}) = D (R_{X Y} ∥ Q_{X}^{*} Q_{Y}^{*}) + \frac{α}{1 - α} D (R_{X Y} ∥ P_{X Y}) . \end{matrix}

(280)

Because the LHS of (280) is finite, Proposition 8 implies that

Q_{X}^{*} = R_{X}

and

Q_{Y}^{*} = R_{Y}

, thus the minimizer is unique. As shown in the proof of Lemma 18, (280) remains valid for

α = \infty

after replacing

\frac{α}{1 - α}

by

- 1

, thus the same argument establishes the uniqueness for

α = \infty

.

Finally, we show that, for

α \in [0, \frac{1}{2}]

, the mapping

(Q_{X}, Q_{Y}) \mapsto D_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

can have more than one minimizer. Let X be uniformly distributed over

{0, 1}

, and let

Y = X

. Then, for all

α \in [0, \frac{1}{2}]

,

\begin{matrix} J_{α} (X; Y) & = \frac{α}{1 - α} log 2 \end{matrix}

(281)

\begin{matrix} = D_{α} (P_{X Y} ∥ (1, 0) (1, 0)) \end{matrix}

(282)

\begin{matrix} = D_{α} (P_{X Y} ∥ (0, 1) (0, 1)), \end{matrix}

(283)

where (281) follows from Lemma 11. □

Lemma 21.

For every

α \in [0, \infty]

, the minimum in the definition of

K_{α} (X; Y)

in (2) exists and is finite.

Proof.

Let

α \in [0, \infty]

, and denote by

U_{X}

and

U_{Y}

the uniform distribution over

X

and

Y

, respectively. Then

{inf}_{Q_{X}, Q_{Y}} Δ_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

is finite because

Δ_{α} (P_{X Y} ∥ U_{X} U_{Y})

is finite and because the relative

α

-entropy is nonnegative (Proposition 5). For

α \in (0, \infty)

, the minimum exists because the set

P (X) \times P (Y)

is compact and the mapping

(Q_{X}, Q_{Y}) \mapsto Δ_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

is continuous. For

α \in {0, \infty}

, the minimum exists because

(Q_{X}, Q_{Y}) \mapsto Δ_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

takes on only a finite number of values: if

α = 0

, then

Δ_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

depends on

Q_{X} Q_{Y}

only via

supp (Q_{X} Q_{Y}) \subseteq X \times Y

; and if

α = \infty

, then

Δ_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

depends on

Q_{X} Q_{Y}

only via

argmax (Q_{X} Q_{Y}) \subseteq X \times Y

. □

Lemma 22.

For all

α \in [0, \infty]

,

K_{α} (X; Y) \geq 0

. If

α \in (0, \infty)

, then

K_{α} (X; Y) = 0

if and only if X and Y are independent (nonnegativity).

Proof.

The nonnegativity follows from the definition of

K_{α} (X; Y)

because the relative

α

-entropy is nonnegative for

α \in [0, \infty]

(Proposition 5). If X and Y are independent, then

P_{X Y} = P_{X} P_{Y}

, and the choice

Q_{X} = P_{X}

and

Q_{Y} = P_{Y}

in the definition of

K_{α} (X; Y)

achieves

K_{α} (X; Y) = 0

. Conversely, if

K_{α} (X; Y) = 0

, then there exist PMFs

Q_{X}^{*}

and

Q_{Y}^{*}

satisfying

Δ_{α} (P_{X Y} ∥ Q_{X}^{*} Q_{Y}^{*}) = 0

. If, in addition,

α \in (0, \infty)

, then

P_{X Y} = Q_{X}^{*} Q_{Y}^{*}

by Proposition 5, and hence X and Y are independent. □

Lemma 23.

For all

α \in [0, \infty]

,

K_{α} (X; Y) = K_{α} (Y; X)

(symmetry).

Proof.

The definition of

K_{α} (X; Y)

is symmetric in X and Y. □

Lemma 24.

For all

α \in (0, \infty)

,

\begin{matrix} K_{α} (X; Y) + H_{α} (X, Y) = min_{Q_{X}, Q_{Y}} - log M_{\frac{α - 1}{α}} (Q_{X}, Q_{Y}), \end{matrix}

(284)

where

M_{β} (Q_{X}, Q_{Y})

is the following weighted power mean [30] (Chapter III): For

β \in R ∖ {0}

,

\begin{matrix} M_{β} (Q_{X}, Q_{Y}) ≜ {[\sum_{x, y} P (x, y) {[Q_{X} (x) Q_{Y} (y)]}^{β}]}^{\frac{1}{β}}, \end{matrix}

(285)

where for

β < 0

, we read

P (x, y) {[Q_{X} (x) Q_{Y} (y)]}^{β}

as

P (x, y) / {[Q_{X} (x) Q_{Y} (y)]}^{- β}

and use the conventions (44); and for

β = 0

, using the convention

0^{0} = 1

,

\begin{matrix} M_{0} (Q_{X}, Q_{Y}) ≜ \prod_{x, y} {[Q_{X} (x) Q_{Y} (y)]}^{P (x, y)} . \end{matrix}

(286)

Proof.

Let

α \in (0, \infty)

, and define the PMF

{\tilde{P}}_{X Y}

as

\begin{matrix} {\tilde{P}}_{X Y} (x, y) ≜ \frac{P_{X Y} {(x, y)}^{α}}{\sum_{(x^{'}, y^{'}) \in X \times Y} P_{X Y} {(x^{'}, y^{'})}^{α}} . \end{matrix}

(287)

Then,

\begin{matrix} K_{α} (X; Y) & = J_{\frac{1}{α}} (\tilde{X}; \tilde{Y}) \end{matrix}

(288)

\begin{matrix} = min_{Q_{X}, Q_{Y}} D_{\frac{1}{α}} ({\tilde{P}}_{X Y} ∥ Q_{X} Q_{Y}), \end{matrix}

(289)

where (288) follows from Proposition 7, and (289) follows from the definition of

J_{1 / α} (\tilde{X}; \tilde{Y})

. A simple computation reveals that for all PMFs

Q_{X}

and

Q_{Y}

,

\begin{matrix} D_{\frac{1}{α}} ({\tilde{P}}_{X Y} ∥ Q_{X} Q_{Y}) = - log M_{\frac{α - 1}{α}} (Q_{X}, Q_{Y}) - H_{α} (X, Y) . \end{matrix}

(290)

Hence, (284) follows from (289) and (290). □

Lemma 25.

For

α = 0

,

\begin{matrix} K_{0} (X; Y) & = log \frac{| supp (P_{X} P_{Y}) |}{| supp (P_{X Y}) |} \end{matrix}

(291)

\begin{matrix} \geq min_{Q_{X}, Q_{Y}} log max_{(x, y) \in supp (P_{X Y})} \frac{1}{Q_{X} (x) Q_{Y} (y)} - log | supp (P_{X Y}) | \end{matrix}

(292)

\begin{matrix} = lim_{α ↓ 0} K_{α} (X; Y), \end{matrix}

(293)

where in the RHS of (292), we use the conventions (44). The inequality can be strict, so

α \mapsto K_{α} (X; Y)

need not be continuous at

α = 0

.

Proof.

We first prove (291). Recall that

\begin{matrix} Δ_{0} (P_{X Y} ∥ Q_{X} Q_{Y}) = \{\begin{matrix} log \frac{| supp (Q_{X} Q_{Y}) |}{| supp (P_{X Y}) |} & if supp (P_{X Y}) \subseteq supp (Q_{X} Q_{Y}), \\ \infty & otherwise . \end{matrix} \end{matrix}

(294)

Observe that

Δ_{0} (P_{X Y} ∥ Q_{X} Q_{Y})

is finite only if

supp (P_{X}) \subseteq supp (Q_{X})

and

supp (P_{Y}) \subseteq supp (Q_{Y})

. For such PMFs

Q_{X}

and

Q_{Y}

, we have

| supp (Q_{X} Q_{Y}) | \geq | supp (P_{X} P_{Y}) |

. Thus, for all PMFs

Q_{X}

and

Q_{Y}

,

\begin{matrix} Δ_{0} (P_{X Y} ∥ Q_{X} Q_{Y}) \geq log \frac{| supp (P_{X} P_{Y}) |}{| supp (P_{X Y}) |} . \end{matrix}

(295)

Choosing

Q_{X} = P_{X}

and

Q_{Y} = P_{Y}

achieves equality in (295), which establishes (291).

We now show (292). Let

Q_{X}

and

Q_{Y}

be the uniform distributions over

supp (P_{X})

and

supp (P_{Y})

, respectively. Then,

\begin{matrix} log max_{(x, y) \in supp (P_{X Y})} \frac{1}{Q_{X} (x) Q_{Y} (y)} - log | supp (P_{X Y}) | = log \frac{| supp (P_{X} P_{Y}) |}{| supp (P_{X Y}) |}, \end{matrix}

(296)

and hence (292) holds.

We next establish (293). To that end, define

\begin{matrix} τ ≜ min_{(x, y) \in supp (P_{X Y})} P (x, y) . \end{matrix}

(297)

We bound

K_{α} (X; Y) + H_{α} (X, Y)

as follows: For all

α \in (0, 1)

,

\begin{matrix} K_{α} (X; Y) + H_{α} (X, Y) & = min_{Q_{X}, Q_{Y}} \frac{α}{1 - α} log \sum_{x, y} P (x, y) {[Q_{X} (x) Q_{Y} (y)]}^{\frac{α - 1}{α}} \end{matrix}

(298)

\begin{matrix} \geq min_{Q_{X}, Q_{Y}} \frac{α}{1 - α} log \sum_{(x, y) \in supp (P_{X Y})} τ {[Q_{X} (x) Q_{Y} (y)]}^{\frac{α - 1}{α}} \end{matrix}

(299)

\begin{matrix} \geq min_{Q_{X}, Q_{Y}} \frac{α}{1 - α} log max_{(x, y) \in supp (P_{X Y})} τ {[Q_{X} (x) Q_{Y} (y)]}^{\frac{α - 1}{α}} \end{matrix}

(300)

\begin{matrix} = min_{Q_{X}, Q_{Y}} log max_{(x, y) \in supp (P_{X Y})} \frac{1}{Q_{X} (x) Q_{Y} (y)} - \frac{α}{1 - α} log \frac{1}{τ}, \end{matrix}

(301)

where (298) follows from Lemma 24. Similarly, for all

α \in (0, 1)

,

\begin{matrix} K_{α} (X; Y) + H_{α} (X, Y) & = min_{Q_{X}, Q_{Y}} \frac{α}{1 - α} log \sum_{x, y} P (x, y) {[Q_{X} (x) Q_{Y} (y)]}^{\frac{α - 1}{α}} \end{matrix}

(302)

\begin{matrix} \leq min_{Q_{X}, Q_{Y}} \frac{α}{1 - α} log max_{(x, y) \in supp (P_{X Y})} {[Q_{X} (x) Q_{Y} (y)]}^{\frac{α - 1}{α}} \end{matrix}

(303)

\begin{matrix} = min_{Q_{X}, Q_{Y}} log max_{(x, y) \in supp (P_{X Y})} \frac{1}{Q_{X} (x) Q_{Y} (y)}, \end{matrix}

(304)

where (302) is the same as (298). Now (293) follows from (301), (304), and the sandwich theorem because

{lim}_{α ↓ 0} \frac{α}{1 - α} log \frac{1}{τ} = 0

and because

{lim}_{α ↓ 0} H_{α} (X, Y) = log | supp (P_{X Y}) |

(Proposition 3).

Finally, we provide an example for which (292) holds with strict inequality. Let

X = {1, 2, 3}

, let

Y = {1, 2}

, and let

(X, Y)

be uniformly distributed over

{(1, 1), (2, 2), (3, 1)}

. The LHS of (292) then equals

log 2

. Using

\begin{matrix} Q_{X} (x) & ≜ \{\begin{matrix} 0.28 & if x \in {1, 3}, \\ 0.44 & if x = 2, \end{matrix} \end{matrix}

(305)

\begin{matrix} Q_{Y} (y) & ≜ \{\begin{matrix} 0.60 & if y = 1, \\ 0.40 & if y = 2, \end{matrix} \end{matrix}

(306)

we see that the RHS of (292) is upper bounded by

log \frac{5.952 \dots}{3}

, which is smaller than

log 2

. □

Lemma 26.

K_{1} (X; Y) = I (X; Y)

.

Proof.

The claim follows from Proposition 8 because

Δ_{1} (P_{X Y} ∥ Q_{X} Q_{Y})

in the definition of

K_{1} (X; Y)

is equal to

D (P_{X Y} ∥ Q_{X} Q_{Y})

. □

Lemma 27.

Let

f : {1, \dots, | X |} \to X

and

g : {1, \dots, | Y |} \to Y

be bijective functions, and let

B

be the

| X | \times | Y |

matrix whose Row-i Column-j entry

B_{i, j}

equals

P_{X Y} (f (i), g (j))

. Then,

\begin{matrix} K_{2} (X; Y) = - 2 log σ_{1} (B) - H_{2} (X, Y), \end{matrix}

(307)

where

σ_{1} (B)

denotes the largest singular value of

B

. (Because the singular values of a matrix are invariant under row and column permutations, the result does not depend on f or g.)

Proof.

Let

(\tilde{X}, \tilde{Y})

be distributed according to the joint PMF

\begin{matrix} {\tilde{P}}_{X Y} (x, y) ≜ {[β P_{X Y} (x, y)]}^{2}, \end{matrix}

(308)

where

\begin{matrix} β ≜ {[\sum_{x, y} P_{X Y} {(x, y)}^{2}]}^{- \frac{1}{2}} . \end{matrix}

(309)

Then,

\begin{matrix} K_{2} (X; Y) & = J_{\frac{1}{2}} (\tilde{X}; \tilde{Y}) \end{matrix}

(310)

\begin{matrix} = - 2 log σ_{1} (β B) \end{matrix}

(311)

\begin{matrix} = - 2 log [β σ_{1} (B)] \end{matrix}

(312)

\begin{matrix} = - 2 log σ_{1} (B) - H_{2} (X, Y), \end{matrix}

(313)

where (310) follows from Proposition 7; (311) follows from Lemma 6 and (308); (312) holds because

β > 0

; and (313) follows from the definition of

H_{2} (X, Y)

. □

Lemma 28.

K_{\infty} (X; Y) = 0

.

Proof.

Let the pair

(\hat{x}, \hat{y})

be such that

P (\hat{x}, \hat{y}) = {max}_{x, y} P (x, y)

, and define the PMFs

{\hat{Q}}_{X}

and

{\hat{Q}}_{Y}

as

{\hat{Q}}_{X} (x) = 𝟙 {x = \hat{x}}

and

{\hat{Q}}_{Y} (y) = 𝟙 {y = \hat{y}}

. Then,

Δ_{\infty} (P_{X Y} ∥ {\hat{Q}}_{X} {\hat{Q}}_{Y}) = 0

, so

K_{\infty} (X; Y) \leq 0

. Because

K_{\infty} (X; Y) \geq 0

(Lemma 22), this implies

K_{\infty} (X; Y) = 0

. □

Lemma 29.

The mapping

α \mapsto K_{α} (X; Y)

need not be monotonic on

[0, \infty]

.

Proof.

Let

P_{X Y}

be such that

supp (P_{X Y}) = X \times Y

and

I (X; Y) > 0

. Then,

\begin{matrix} K_{0} (X; Y) & = 0, \end{matrix}

(314)

\begin{matrix} K_{1} (X; Y) & > 0, \end{matrix}

(315)

\begin{matrix} K_{\infty} (X; Y) & = 0, \end{matrix}

(316)

which follow from Lemmas 25, 26, and 28, respectively. Thus,

α \mapsto K_{α} (X; Y)

is not monotonic on

[0, \infty]

. □

Lemma 30.

The mapping

α \mapsto K_{α} (X; Y) + H_{α} (X, Y)

is nonincreasing on

[0, \infty]

.

Proof.

We first show the monotonicity for

α \in (0, \infty)

. To that end, let

α, α^{'} \in (0, \infty)

with

α \leq α^{'}

, and let

M_{β} (Q_{X}, Q_{Y})

be defined as in (285) and (286). Then, for all PMFs

Q_{X}

and

Q_{Y}

,

\begin{matrix} M_{\frac{α - 1}{α}} (Q_{X}, Q_{Y}) \leq M_{\frac{α^{'} - 1}{α^{'}}} (Q_{X}, Q_{Y}), \end{matrix}

(317)

which follows from the power mean inequality [30] (III 3.1.1 Theorem 1) because

\frac{α - 1}{α} \leq \frac{α^{'} - 1}{α^{'}}

. Hence,

\begin{matrix} K_{α} (X; Y) + H_{α} (X, Y) & = min_{Q_{X}, Q_{Y}} - log M_{\frac{α - 1}{α}} (Q_{X}, Q_{Y}) \end{matrix}

(318)

\begin{matrix} \geq min_{Q_{X}, Q_{Y}} - log M_{\frac{α^{'} - 1}{α^{'}}} (Q_{X}, Q_{Y}) \end{matrix}

(319)

\begin{matrix} = K_{α^{'}} (X; Y) + H_{α^{'}} (X, Y), \end{matrix}

(320)

where (318) and (320) follow from Lemma 24, and (319) follows from (317).

The monotonicity extends to

α = 0

because

\begin{matrix} K_{0} (X; Y) + H_{0} (X, Y) & \geq lim_{α ↓ 0} K_{α} (X; Y) + H_{0} (X, Y) \end{matrix}

(321)

\begin{matrix} = lim_{α ↓ 0} [K_{α} (X; Y) + H_{α} (X, Y)], \end{matrix}

(322)

where (321) follows from Lemma 25, and (322) holds because

α \mapsto H_{α} (X, Y)

is continuous at

α = 0

(Proposition 3).

The monotonicity extends to

α = \infty

because for all

α \in (0, \infty)

,

\begin{matrix} K_{α} (X; Y) + H_{α} (X, Y) & \geq H_{α} (X, Y) \end{matrix}

(323)

\begin{matrix} \geq H_{\infty} (X, Y) \end{matrix}

(324)

\begin{matrix} = K_{\infty} (X; Y) + H_{\infty} (X, Y), \end{matrix}

(325)

where (323) holds because

K_{α} (X; Y) \geq 0

(Lemma 22); (324) holds because

H_{α} (X, Y)

is nonincreasing in

α

(Proposition 3); and (325) holds because

K_{\infty} (X; Y) = 0

(Lemma 28). □

Lemma 31.

The mapping

α \mapsto K_{α} (X; Y)

is continuous on

(0, \infty]

. (See Lemma 25 for the behavior at

α = 0

.)

Proof.

Because

α \mapsto H_{α} (X, Y)

is continuous on

[0, \infty]

(Proposition 3), it suffices to show that the mapping

α \mapsto K_{α} (X; Y) + H_{α} (X, Y)

is continuous on

(0, \infty]

. We first show that it is continuous on

(0, 1) \cup (1, \infty)

by showing that

α \mapsto (1 - \frac{1}{α}) [K_{α} (X; Y) + H_{α} (X, Y)]

is concave and hence continuous on

(0, \infty)

. For a fixed

α \in (0, \infty)

, let

(\tilde{X}, \tilde{Y})

be distributed according to the joint PMF

\begin{matrix} {\tilde{P}}_{X Y} (x, y) ≜ \frac{P_{X Y} {(x, y)}^{α}}{\sum_{(x^{'}, y^{'}) \in X \times Y} P_{X Y} {(x^{'}, y^{'})}^{α}} . \end{matrix}

(326)

Then, for all

α \in (0, \infty)

,

\begin{matrix} (1 - \frac{1}{α}) [K_{α} (X; Y) + H_{α} (X, Y)] \\ = (1 - \frac{1}{α}) J_{\frac{1}{α}} (\tilde{X}; \tilde{Y}) + (1 - \frac{1}{α}) H_{α} (X, Y) \end{matrix}

(327)

\begin{matrix} = min_{R_{X Y}} [(1 - \frac{1}{α}) D (R_{X Y} ∥ R_{X} R_{Y}) + \frac{1}{α} D (R_{X Y} ∥ {\tilde{P}}_{X Y}) + (1 - \frac{1}{α}) H_{α} (X, Y)] \end{matrix}

(328)

\begin{matrix} = min_{R_{X Y}} [(1 - \frac{1}{α}) D (R_{X Y} ∥ R_{X} R_{Y}) + (1 - \frac{1}{α}) H (R_{X Y}) + D (R_{X Y} ∥ P_{X Y})], \end{matrix}

(329)

where (327) follows from Proposition 7; (328) follows from Lemma 8; and (329) follows from a short computation. For every

R_{X Y} \in P (X \times Y)

, the expression in square brackets on the RHS of (329) is concave in

α

because the mapping

α \mapsto 1 - \frac{1}{α}

is concave on

(0, \infty)

and because

D (R_{X Y} ∥ R_{X} R_{Y})

and

H (R_{X Y})

are nonnegative. The pointwise minimum preserves the concavity, thus the LHS of (327) is concave in

α

and hence continuous in

α \in (0, \infty)

. This implies that

α \mapsto K_{α} (X; Y) + H_{α} (X, Y)

and hence

α \mapsto K_{α} (X; Y)

is continuous on

(0, 1) \cup (1, \infty)

.

We now establish continuity at

α = \infty

. Let

(\hat{x}, \hat{y})

be such that

P (\hat{x}, \hat{y}) = {max}_{x, y} P (x, y)

; define the PMFs

{\hat{Q}}_{X}

and

{\hat{Q}}_{Y}

as

{\hat{Q}}_{X} (x) ≜ 𝟙 {x = \hat{x}}

and

{\hat{Q}}_{Y} (y) ≜ 𝟙 {y = \hat{y}}

; and let

M_{β} (Q_{X}, Q_{Y})

be defined as in (285). Then, for all

α \in (1, \infty)

,

\begin{matrix} K_{\infty} (X; Y) + H_{\infty} (X, Y) & \leq K_{α} (X; Y) + H_{α} (X, Y) \end{matrix}

(330)

\begin{matrix} \leq - log M_{\frac{α - 1}{α}} ({\hat{Q}}_{X}, {\hat{Q}}_{Y}) \end{matrix}

(331)

\begin{matrix} = \frac{α}{α - 1} H_{\infty} (X, Y) \end{matrix}

(332)

\begin{matrix} = K_{\infty} (X; Y) + \frac{α}{α - 1} H_{\infty} (X, Y), \end{matrix}

(333)

where (330) holds because

K_{α} (X; Y) + H_{α} (X, Y)

is nonincreasing in

α

(Lemma 30); (331) follows from Lemma 24; (332) follows from the definitions of

M_{β} (Q_{X}, Q_{Y})

in (285) and

H_{\infty} (X, Y)

in (46); and (333) holds because

K_{\infty} (X; Y) = 0

(Lemma 28). Because

{lim}_{α \to \infty} \frac{α}{α - 1} = 1

, (330)–(333) and the sandwich theorem imply that

α \mapsto K_{α} (X; Y) + H_{α} (X, Y)

is continuous at

α = \infty

. This and the continuity of

α \mapsto H_{α} (X, Y)

at

α = \infty

(Proposition 3) establish the continuity of

α \mapsto K_{α} (X; Y)

at

α = \infty

.

It remains to show the continuity at

α = 1

. Let

α \in (\frac{4}{5}, 1) \cup (1, \frac{4}{3})

, and define

δ ≜ \frac{| α - 1 |}{α} \in (0, \frac{1}{4})

. (These definitions ensure that on the RHS of (340) ahead,

1 - 4 δ

will be positive.) Let

M_{β} (Q_{X}, Q_{Y})

be defined as in (285) and (286). Then, for all PMFs

Q_{X}

and

Q_{Y}

,

\begin{matrix} M_{\frac{α - 1}{α}} (Q_{X}, Q_{Y}) & \leq M_{δ} (Q_{X}, Q_{Y}) \end{matrix}

(334)

\begin{matrix} = {[\sum_{x, y} P (x, y) {[P_{X} (x) P_{Y} (y)]}^{δ} {[\frac{Q_{X} (x) Q_{Y} (y)}{P_{X} (x) P_{Y} (y)}]}^{δ}]}^{\frac{1}{δ}} \end{matrix}

(335)

\begin{matrix} \leq {[\sum_{x, y} P (x, y) {[P_{X} (x) P_{Y} (y)]}^{2 δ}]}^{\frac{1}{2 δ}} \cdot {[\sum_{x, y} P (x, y) {[\frac{Q_{X} (x) Q_{Y} (y)}{P_{X} (x) P_{Y} (y)}]}^{2 δ}]}^{\frac{1}{2 δ}} \end{matrix}

(336)

\begin{matrix} \leq {[\sum_{x, y} P (x, y) {[P_{X} (x) P_{Y} (y)]}^{2 δ}]}^{\frac{1}{2 δ}} \end{matrix}

(337)

\begin{matrix} = M_{2 δ} (P_{X}, P_{Y}), \end{matrix}

(338)

where (334) follows from the power mean inequality [30] (III 3.1.1 Theorem 1) because

\frac{α - 1}{α} \leq δ

; (336) follows from the Cauchy–Schwarz inequality; and (337) holds because

\begin{matrix} {[\sum_{x, y} P (x, y) {[\frac{Q_{X} (x)}{P_{X} (x)}]}^{2 δ} {[\frac{Q_{Y} (y)}{P_{Y} (y)}]}^{2 δ}]}^{\frac{1}{2 δ}} \\ \leq {[\sum_{x} P_{X} (x) {[\frac{Q_{X} (x)}{P_{X} (x)}]}^{4 δ}]}^{\frac{1}{4 δ}} \cdot {[\sum_{y} P_{Y} (y) {[\frac{Q_{Y} (y)}{P_{Y} (y)}]}^{4 δ}]}^{\frac{1}{4 δ}} \end{matrix}

(339)

\begin{matrix} = 2^{- D_{1 - 4 δ} (P_{X} ∥ Q_{X})} \cdot 2^{- D_{1 - 4 δ} (P_{Y} ∥ Q_{Y})} \end{matrix}

(340)

\begin{matrix} \leq 1, \end{matrix}

(341)

where (339) follows from the Cauchy–Schwarz inequality, and (341) holds because

1 - 4 δ > 0

and because the Rényi divergence is nonnegative for positive orders (Proposition 4). Thus, for all

α \in (\frac{4}{5}, \frac{4}{3})

,

\begin{matrix} - log M_{\frac{2 | α - 1 |}{α}} (P_{X}, P_{Y}) & \leq min_{Q_{X}, Q_{Y}} - log M_{\frac{α - 1}{α}} (Q_{X}, Q_{Y}) \end{matrix}

(342)

\begin{matrix} \leq - log M_{\frac{α - 1}{α}} (P_{X}, P_{Y}), \end{matrix}

(343)

where (342) follows from (338) if

α \neq 1

and from Proposition 8 and a simple computation if

α = 1

. By Lemma 24, this implies that for all

α \in (\frac{4}{5}, \frac{4}{3})

,

\begin{matrix} - log M_{\frac{2 | α - 1 |}{α}} (P_{X}, P_{Y}) & \leq K_{α} (X; Y) + H_{α} (X, Y) \end{matrix}

(344)

\begin{matrix} \leq - log M_{\frac{α - 1}{α}} (P_{X}, P_{Y}) . \end{matrix}

(345)

Because

β \mapsto M_{β} (P_{X}, P_{Y})

is continuous at

β = 0

[30] (III 1 Theorem 2(b)), (344)–(345) and the sandwich theorem imply that

α \mapsto K_{α} (X; Y) + H_{α} (X, Y)

is continuous at

α = 1

. This and the continuity of

α \mapsto H_{α} (X, Y)

at

α = 1

(Proposition 3) establish the continuity of

α \mapsto K_{α} (X; Y)

at

α = 1

. □

Lemma 32.

If

X = Y

with probability one, then

\begin{matrix} K_{α} (X; Y) = \{\begin{matrix} 2 H_{\frac{α}{2 - α}} (X) - H_{α} (X) & i f α \in [0, 2), \\ \frac{α}{α - 1} H_{\infty} (X) - H_{α} (X) & i f α \geq 2, \\ 0 & i f α = \infty . \end{matrix} \end{matrix}

(346)

Proof.

We first treat the cases

α = 0

,

α = 1

, and

α = \infty

. For

α = 0

, (346) holds because

\begin{matrix} K_{0} (X; Y) & = log \frac{| supp (P_{X} P_{Y}) |}{| supp (P_{X Y}) |} \end{matrix}

(347)

\begin{matrix} = log | supp (P_{X}) | \end{matrix}

(348)

\begin{matrix} = H_{0} (X), \end{matrix}

(349)

where (347) follows from Lemma 25, and (348) holds because the hypothesis

Pr [X = Y] = 1

implies that

| supp (P_{X} P_{Y}) | = {| supp (P_{X}) |}^{2}

and

| supp (P_{X Y}) | = | supp (P_{X}) |

. For

α = 1

, (346) holds because

K_{1} (X; Y) = I (X; Y)

(Lemma 26) and because

Pr [X = Y] = 1

implies that

I (X; Y) = H (X) = H_{1} (X)

. For

α = \infty

, (346) holds because

K_{\infty} (X; Y) = 0

(Lemma 28).

Now let

α \in (0, 1) \cup (1, \infty)

, and let

(\tilde{X}, \tilde{Y})

be distributed according to the joint PMF

\begin{matrix} {\tilde{P}}_{X Y} (x, y) & ≜ \frac{P_{X Y} {(x, y)}^{α}}{\sum_{(x^{'}, y^{'}) \in X \times Y} P_{X Y} {(x^{'}, y^{'})}^{α}} \end{matrix}

(350)

\begin{matrix} = \frac{P_{X} {(x)}^{α}}{\sum_{x^{'} \in X} P_{X} {(x^{'})}^{α}} 𝟙 {x = y}, \end{matrix}

(351)

where (351) holds because

P_{X Y} (x, y) = P_{X} (x) 𝟙 {x = y}

for all

x \in X

and all

y \in Y

. If

α < 2

, then (346) holds because

\begin{matrix} K_{α} (X; Y) & = J_{\frac{1}{α}} (\tilde{X}; \tilde{Y}) \end{matrix}

(352)

\begin{matrix} = H_{\frac{1}{2 - α}} (\tilde{X}) \end{matrix}

(353)

\begin{matrix} = \frac{2 - α}{1 - α} log \sum_{x} {[\frac{P_{X} {(x)}^{α}}{\sum_{x^{'} \in X} P_{X} {(x^{'})}^{α}}]}^{\frac{1}{2 - α}} \end{matrix}

(354)

\begin{matrix} = 2 H_{\frac{α}{2 - α}} (X) - H_{α} (X), \end{matrix}

(355)

where (352) follows from Proposition 7; (353) follows from Lemma 11 because

Pr [\tilde{X} = \tilde{Y}] = 1

and because

\frac{1}{α} > \frac{1}{2}

; and (355) follows from a simple computation. If

α \geq 2

, then (346) holds because

\begin{matrix} K_{α} (X; Y) & = J_{\frac{1}{α}} (\tilde{X}; \tilde{Y}) \end{matrix}

(356)

\begin{matrix} = \frac{1}{α - 1} H_{\infty} (\tilde{X}) \end{matrix}

(357)

\begin{matrix} = \frac{- 1}{α - 1} log max_{x} \frac{P_{X} {(x)}^{α}}{\sum_{x^{'} \in X} P_{X} {(x^{'})}^{α}} \end{matrix}

(358)

\begin{matrix} = \frac{α}{α - 1} H_{\infty} (X) - H_{α} (X), \end{matrix}

(359)

where (356) follows from Proposition 7; (357) follows from Lemma 11 because

Pr [\tilde{X} = \tilde{Y}] = 1

and because

\frac{1}{α} \leq \frac{1}{2}

; and (359) follows from a simple computation. □

Lemma 33.

For every

α \in (0, 2)

, the mapping

(Q_{X}, Q_{Y}) \mapsto Δ_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

in the definition of

K_{α} (X; Y)

in (2) has a unique minimizer. This need not be the case when

α \in {0} \cup [2, \infty]

.

Proof.

Let

α \in (0, 2)

. By Proposition 7,

K_{α} (X; Y) = J_{1 / α} (\tilde{X}; \tilde{Y})

, where the pair

(\tilde{X}, \tilde{Y})

is distributed according to the joint PMF

{\tilde{P}}_{X Y}

defined in Proposition 7. The mapping

(Q_{X}, Q_{Y}) \mapsto D_{1 / α} ({\tilde{P}}_{X Y} ∥ Q_{X} Q_{Y})

in the definition of

J_{1 / α} (\tilde{X}; \tilde{Y})

has a unique minimizer by Lemma 20 because

\frac{1}{α} > \frac{1}{2}

. By Proposition 6, there is a bijection between the minimizers of

D_{1 / α} ({\tilde{P}}_{X Y} ∥ Q_{X} Q_{Y})

and

Δ_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

, so the mapping

(Q_{X}, Q_{Y}) \mapsto Δ_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

also has a unique minimizer.

We next show that for

α \in {0} \cup [2, \infty]

, the mapping

(Q_{X}, Q_{Y}) \mapsto Δ_{α} (P_{X Y} ∥ Q_{X} Q_{Y})

can have more than one minimizer. Let X be uniformly distributed over

{0, 1}

, and let

Y = X

. Then, by Lemma 32,

\begin{matrix} K_{α} (X; Y) = \{\begin{matrix} log 2 & if α = 0, \\ \frac{1}{α - 1} log 2 & if α \geq 2, \\ 0 & if α = \infty . \end{matrix} \end{matrix}

(360)

If

α = 0

, then it follows from the definition of

Δ_{0} (P ∥ Q)

in (56) that

Δ_{0} (P_{X Y} ∥ Q_{X} Q_{Y}) = log 2

whenever

supp (Q_{X}) = supp (Q_{Y}) = {0, 1}

, so the minimizer is not unique. Otherwise, if

α \in [2, \infty]

, it can be verified that

\begin{matrix} Δ_{α} (P_{X Y} ∥ (1, 0) (1, 0)) & = Δ_{α} (P_{X Y} ∥ (0, 1) (0, 1)) \end{matrix}

(361)

\begin{matrix} = \{\begin{matrix} \frac{1}{α - 1} log 2 & if α \geq 2, \\ 0 & if α = \infty, \end{matrix} \end{matrix}

(362)

so the minimizer is not unique in this case either. □

Lemma 34.

If the pairs

(X_{1}, Y_{1})

and

(X_{2}, Y_{2})

are independent, then

K_{α} (X_{1}, X_{2}; Y_{1}, Y_{2}) = K_{α} (X_{1}; Y_{1}) + K_{α} (X_{2}; Y_{2})

for all

α \in [0, \infty]

(additivity).

Proof.

We first treat the cases

α = 0

and

α = \infty

. For

α = 0

, the claim is true because

\begin{matrix} K_{0} (X_{1}, X_{2}; Y_{1}, Y_{2}) & = log \frac{| supp (P_{X_{1} X_{2}} P_{Y_{1} Y_{2}}) |}{| supp (P_{X_{1} X_{2} Y_{1} Y_{2}}) |} \end{matrix}

(363)

\begin{matrix} = log \frac{| supp (P_{X_{1}} P_{Y_{1}}) | \cdot | supp (P_{X_{2}} P_{Y_{2}}) |}{| supp (P_{X_{1} Y_{1}}) | \cdot | supp (P_{X_{2} Y_{2}}) |} \end{matrix}

(364)

\begin{matrix} = K_{0} (X_{1}; Y_{1}) + K_{0} (X_{2}; Y_{2}), \end{matrix}

(365)

where (363) and (365) follow from Lemma 25, and (364) follows from the independence hypothesis

P_{X_{1} X_{2} Y_{1} Y_{2}} = P_{X_{1} Y_{1}} P_{X_{2} Y_{2}}

. For

α = \infty

, the claim is true because

K_{\infty} (X; Y) = 0

(Lemma 28).

Now let

α \in (0, \infty)

, and let

({\tilde{X}}_{1}, {\tilde{X}}_{2}, {\tilde{Y}}_{1}, {\tilde{Y}}_{2})

be distributed according to the joint PMF

\begin{matrix} {\tilde{P}}_{X_{1} X_{2} Y_{1} Y_{2}} (x_{1}, x_{2}, y_{1}, y_{2}) & ≜ \frac{P_{X_{1} X_{2} Y_{1} Y_{2}} {(x_{1}, x_{2}, y_{1}, y_{2})}^{α}}{\sum_{x_{1}^{'}, x_{2}^{'}, y_{1}^{'}, y_{2}^{'}} P_{X_{1} X_{2} Y_{1} Y_{2}} {(x_{1}^{'}, x_{2}^{'}, y_{1}^{'}, y_{2}^{'})}^{α}} \end{matrix}

(366)

\begin{matrix} = \frac{P_{X_{1} Y_{1}} {(x_{1}, y_{1})}^{α}}{\sum_{x_{1}^{'}, y_{1}^{'}} P_{X_{1} Y_{1}} {(x_{1}^{'}, y_{1}^{'})}^{α}} \cdot \frac{P_{X_{2} Y_{2}} {(x_{2}, y_{2})}^{α}}{\sum_{x_{2}^{'}, y_{2}^{'}} P_{X_{2} Y_{2}} {(x_{2}^{'}, y_{2}^{'})}^{α}}, \end{matrix}

(367)

where (366) follows from the independence hypothesis

P_{X_{1} X_{2} Y_{1} Y_{2}} = P_{X_{1} Y_{1}} P_{X_{2} Y_{2}}

. Then,

\begin{matrix} K_{α} (X_{1}, X_{2}; Y_{1}, Y_{2}) & = J_{\frac{1}{α}} ({\tilde{X}}_{1}, {\tilde{X}}_{2}; {\tilde{Y}}_{1}, {\tilde{Y}}_{2}) \end{matrix}

(368)

\begin{matrix} = J_{\frac{1}{α}} ({\tilde{X}}_{1}; {\tilde{Y}}_{1}) + J_{\frac{1}{α}} ({\tilde{X}}_{2}; {\tilde{Y}}_{2}) \end{matrix}

(369)

\begin{matrix} = K_{α} (X_{1}; Y_{1}) + K_{α} (X_{2}; Y_{2}), \end{matrix}

(370)

where (368) and (370) follow from Proposition 7, and (369) follows from Lemma 12 because the pairs

({\tilde{X}}_{1}, {\tilde{Y}}_{1})

and

({\tilde{X}}_{2}, {\tilde{Y}}_{2})

are independent by (367). □

Lemma 35.

For all

α \in [0, \infty]

,

K_{α} (X; Y) \leq log | X |

.

Proof.

For

α = 0

, this is true because

\begin{matrix} K_{0} (X; Y) & = log \frac{| supp (P_{X} P_{Y}) |}{| supp (P_{X Y}) |} \end{matrix}

(371)

\begin{matrix} \leq log \frac{| X | \cdot | supp (P_{Y}) |}{| supp (P_{X Y}) |} \end{matrix}

(372)

\begin{matrix} \leq log | X |, \end{matrix}

(373)

where (371) follows from Lemma 25. For

α \in (0, \infty)

, the claim is true because

\begin{matrix} K_{α} (X; Y) & = J_{\frac{1}{α}} (\tilde{X}; \tilde{Y}) \end{matrix}

(374)

\begin{matrix} \leq log | X |, \end{matrix}

(375)

where (374) follows from Proposition 7, and (375) follows from Lemma 13. For

α = \infty

, the claim is true because

K_{\infty} (X; Y) = 0

(Lemma 28). □

Lemma 36.

There exists a Markov chain

X ⊸ - - Y ⊸ - - Z

for which

K_{2} (X; Z) > K_{2} (X; Y)

.

Proof.

Let the Markov chain

X ⊸ - - Y ⊸ - - Z

be given by

$P_{Z \| Y}$ $P_{X Y} (x, y)$	$y = 0$	$y = 1$
$x = 0$	$0.6$	0
$x = 1$	0	$0.4$

$P_{Z \| Y}$ $P_{Z \| Y} (z \| y)$	$z = 0$	$z = 1$
$y = 0$	$0.9$	$0.1$
$y = 1$	0	1

Using Lemma 27, we see that

K_{2} (X; Z) \approx 0.605

bits, which is larger than

K_{2} (X; Y) \approx 0.531

bits. □

Author Contributions

Writing–original draft preparation, A.L. and C.P.; writing–review and editing, A.L. and C.P.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Tomamichel, M.; Hayashi, M. Operational interpretation of Rényi information measures via composite hypothesis testing against product and Markov distributions. IEEE Trans. Inf. Theory 2018, 64, 1064–1082. [Google Scholar] [CrossRef]
Sibson, R. Information radius. Z. Wahrscheinlichkeitstheorie verw. Geb. 1969, 14, 149–160. [Google Scholar] [CrossRef]
Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Topics in Information Theory; Csiszár, I., Elias, P., Eds.; North-Holland Publishing Company: Amsterdam, The Netherlands, 1977; pp. 41–52. ISBN 0-7204-0699-4. [Google Scholar]
Csiszár, I. Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory 1995, 41, 26–34. [Google Scholar] [CrossRef]
Fehr, S.; Berens, S. On the conditional Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 6801–6810. [Google Scholar] [CrossRef]
Sason, I.; Verdú, S. Arimoto–Rényi conditional entropy and Bayesian M-ary hypothesis testing. IEEE Trans. Inf. Theory 2018, 64, 4–25. [Google Scholar] [CrossRef]
Verdú, S. α-mutual information. In Proceedings of the 2015 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar] [CrossRef]
Tridenski, S.; Zamir, R.; Ingber, A. The Ziv–Zakai–Rényi bound for joint source-channel coding. IEEE Trans. Inf. Theory 2015, 61, 4293–4315. [Google Scholar] [CrossRef]
Aishwarya, G.; Madiman, M. Remarks on Rényi versions of conditional entropy and mutual information. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 1117–1121. [Google Scholar]
Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial; now Publishers: Hanover, MA, USA, 2004; ISBN 978-1-933019-05-5. [Google Scholar]
Liese, F.; Vajda, I. On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
Sason, I.; Verdú, S. f-divergence inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
Jiao, J.; Han, Y.; Weissman, T. Dependence measures bounding the exploration bias for general measurements. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 1475–1479. [Google Scholar] [CrossRef]
Ziv, J.; Zakai, M. On functionals satisfying a data-processing theorem. IEEE Trans. Inf. Theory 1973, 19, 275–283. [Google Scholar] [CrossRef]
Lapidoth, A.; Pfister, C. Testing against independence and a Rényi information measure. In Proceedings of the 2018 IEEE Information Theory Workshop (ITW), Guangzhou, China, 25–29 November 2018; pp. 1–5. [Google Scholar] [CrossRef]
Han, T.S.; Kobayashi, K. The strong converse theorem for hypothesis testing. IEEE Trans. Inf. Theory 1989, 35, 178–180. [Google Scholar] [CrossRef]
Nakagawa, K.; Kanaya, F. On the converse theorem in statistical hypothesis testing. IEEE Trans. Inf. Theory 1993, 39, 623–628. [Google Scholar] [CrossRef]
Bunte, C.; Lapidoth, A. Encoding tasks and Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 5065–5076. [Google Scholar] [CrossRef]
Bracher, A.; Lapidoth, A.; Pfister, C. Distributed task encoding. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 1993–1997. [Google Scholar] [CrossRef]
Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; Volume 1, pp. 547–561. [Google Scholar]
van Erven, T.; Harremoës, P. Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
Sason, I.; Verdú, S. Improved bounds on lossless source coding and guessing moments via Rényi measures. IEEE Trans. Inf. Theory 2018, 64, 4323–4346. [Google Scholar] [CrossRef]
Ashok Kumar, M.; Sundaresan, R. Minimization problems based on relative α-entropy I: Forward projection. IEEE Trans. Inf. Theory 2015, 61, 5063–5080. [Google Scholar] [CrossRef]
Ashok Kumar, M.; Sundaresan, R. Minimization problems based on relative α-entropy II: Reverse projection. IEEE Trans. Inf. Theory 2015, 61, 5081–5095. [Google Scholar] [CrossRef]
Sundaresan, R. Guessing under source uncertainty. IEEE Trans. Inf. Theory 2007, 53, 269–287. [Google Scholar] [CrossRef]
Polyanskiy, Y.; Wu, Y. Lecture Notes on Information Theory. 2017. Available online: http://people.lids.mit.edu/yp/homepage/data/itlectures_v5.pdf (accessed on 18 August 2017).
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006; ISBN 978-0-471-24195-9. [Google Scholar]
Gallager, R.G. Information Theory and Reliable Communication; John Wiley & Sons: Hoboken, NJ, USA, 1968; ISBN 978-0-471-29048-3. [Google Scholar]
Bullen, P.S. Handbook of Means and Their Inequalities; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2003; ISBN 978-1-4020-1522-9. [Google Scholar]
Horn, R.A.; Johnson, C.R. Matrix Analysis, 2nd ed.; Cambridge University Press: Cambridge, UK, 2013; ISBN 978-0-521-83940-2. [Google Scholar]
Fan, K. Minimax theorems. Proc. Natl. Acad. Sci. USA 1953, 39, 42–47. [Google Scholar] [CrossRef] [PubMed]
Borwein, J.M.; Zhuang, D. On Fan’s minimax theorem. Math. Program. 1986, 34, 232–234. [Google Scholar] [CrossRef]

Figure 1. (Left)

J_{α} (X; Y)

and

D_{α} (P_{X Y} ∥ P_{X} P_{Y})

versus

α

. (Right)

K_{α} (X; Y)

and

Δ_{α} (P_{X Y} ∥ P_{X} P_{Y})

versus

α

. In both plots, X is Bernoulli with

Pr (X = 1) = 0.2

, and Y is equal to X.

Figure 1. (Left)

J_{α} (X; Y)

and

D_{α} (P_{X Y} ∥ P_{X} P_{Y})

versus

α

. (Right)

K_{α} (X; Y)

and

Δ_{α} (P_{X Y} ∥ P_{X} P_{Y})

versus

α

. In both plots, X is Bernoulli with

Pr (X = 1) = 0.2

, and Y is equal to X.

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lapidoth, A.; Pfister, C. Two Measures of Dependence. Entropy 2019, 21, 778. https://doi.org/10.3390/e21080778

AMA Style

Lapidoth A, Pfister C. Two Measures of Dependence. Entropy. 2019; 21(8):778. https://doi.org/10.3390/e21080778

Chicago/Turabian Style

Lapidoth, Amos, and Christoph Pfister. 2019. "Two Measures of Dependence" Entropy 21, no. 8: 778. https://doi.org/10.3390/e21080778

APA Style

Lapidoth, A., & Pfister, C. (2019). Two Measures of Dependence. Entropy, 21(8), 778. https://doi.org/10.3390/e21080778

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Two Measures of Dependence

Abstract

1. Introduction

2. Related Work

3. Operational Meanings

3.1. Testing Against Independence and $J_{α} (X; Y)$

3.2. Distributed Task Encoding and $K_{α} (X; Y)$

4. Preliminaries

5. Two Measures of Dependence

6. Proofs

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Two Measures of Dependence

Abstract

1. Introduction

2. Related Work

3. Operational Meanings

3.1. Testing Against Independence and J α ( X ; Y )

3.2. Distributed Task Encoding and K α ( X ; Y )

4. Preliminaries

5. Two Measures of Dependence

6. Proofs

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

3.1. Testing Against Independence and $J_{α} (X; Y)$

3.2. Distributed Task Encoding and $K_{α} (X; Y)$