Conditional Rényi Divergences and Horse Betting

Bleuler, Cédric; Lapidoth, Amos; Pfister, Christoph

doi:10.3390/e22030316

Open AccessArticle

Conditional Rényi Divergences and Horse Betting

by

Cédric Bleuler

,

Amos Lapidoth

and

Christoph Pfister

^*

Signal and Information Processing Laboratory, ETH Zurich, 8092 Zurich, Switzerland

^*

Author to whom correspondence should be addressed.

Entropy 2020, 22(3), 316; https://doi.org/10.3390/e22030316

Submission received: 17 December 2019 / Revised: 8 March 2020 / Accepted: 9 March 2020 / Published: 11 March 2020

(This article belongs to the Special Issue Divergence Measures: Mathematical Foundations and Applications in Information-Theoretic and Statistical Problems)

Download Versions Notes

Abstract

:

Motivated by a horse betting problem, a new conditional Rényi divergence is introduced. It is compared with the conditional Rényi divergences that appear in the definitions of the dependence measures by Csiszár and Sibson, and the properties of all three are studied with emphasis on their behavior under data processing. In the same way that Csiszár’s and Sibson’s conditional divergence lead to the respective dependence measures, so does the new conditional divergence lead to the Lapidoth–Pfister mutual information. Moreover, the new conditional divergence is also related to the Arimoto–Rényi conditional entropy and to Arimoto’s measure of dependence. In the second part of the paper, the horse betting problem is analyzed where, instead of Kelly’s expected log-wealth criterion, a more general family of power-mean utility functions is considered. The key role in the analysis is played by the Rényi divergence, and in the setting where the gambler has access to side information, the new conditional Rényi divergence is key. The setting with side information also provides another operational meaning to the Lapidoth–Pfister mutual information. Finally, a universal strategy for independent and identically distributed races is presented that—without knowing the winning probabilities or the parameter of the utility function—asymptotically maximizes the gambler’s utility function.

Keywords:

conditional Rényi divergence; horse betting; Kelly gambling; Rényi divergence; Rényi mutual information

1. Introduction

As shown by Kelly [1,2], many of Shannon’s information measures appear naturally in the context of horse gambling when the gambler’s utility function is expected log-wealth. Here, we show that under a more general family of utility functions, gambling also provides a context for some of Rényi’s information measures. Moreover, the setting where the gambler has side information motivates a new Rényi-like conditional divergence, which we study and compare to other conditional divergences. The proposed family of utility functions in the context of gambling with side information also provides another operational meaning to the Rényi-like mutual information that was recently proposed by Lapidoth and Pfister [3]: it measures the gambler’s gain from the side information as measured by the increase in the minimax value of the two-player zero-sum game in which the bookmaker picks the odds and the gambler then places the bets based on these odds and her side information.

Deferring the gambling-based motivation to the second part of the paper, we first describe the different conditional divergences and study some of their properties with emphasis on their behavior under data processing. We also show that the new conditional Rényi divergence relates to the Lapidoth–Pfister mutual information in much the same way that Csiszár’s and Sibson’s conditional divergences relate to their corresponding mutual informations. Before discussing the conditional divergences, we first recall other information measures.

The Kullback–Leibler divergence (or relative entropy) is an important concept in information theory and statistics [2,4,5,6]. It is defined between two probability mass functions (PMFs) P and Q over a finite set

X

as

\begin{matrix} D (P ‖ Q) ≜ \sum_{x \in X} P (x) \log \frac{P (x)}{Q (x)}, \end{matrix}

(1)

where

\log (\cdot)

denotes the base-2 logarithm. Defining a conditional Kullback–Leibler divergence is straightforward because, as simple algebra shows, the two natural approaches lead to the same result:

\begin{matrix} D (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & ≜ \sum_{x \in supp (P_{X})} P (x) D (P_{Y | X = x} ‖ Q_{Y | X = x}) \end{matrix}

(2)

\begin{matrix} = D (P_{X} P_{Y | X} ‖ P_{X} Q_{Y | X}), \end{matrix}

(3)

where

supp (P) ≜ {x \in X : P (x) > 0}

denotes the support of P, and in (3) and throughout

P_{X} P_{Y | X}

denotes the PMF on

X \times Y

that assigns

(x, y)

the probability

P_{X} (x) P_{Y | X} (y | x)

.

The Rényi divergence of order

α

[7,8] between two PMFs P and Q is defined for all positive

α

’s other than one as

\begin{matrix} D_{α} (P ‖ Q) ≜ \frac{1}{α - 1} \log \sum_{x \in X} P {(x)}^{α} Q {(x)}^{1 - α} . \end{matrix}

(4)

A conditional Rényi divergence can be defined in more than one way. In this paper, we consider the following three definitions, two classic and one new:

\begin{matrix} D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & ≜ \sum_{x \in supp (P_{X})} P (x) D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x}), \end{matrix}

(5)

\begin{matrix} D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & ≜ D_{α} (P_{X} P_{Y | X} ‖ P_{X} Q_{Y | X}), \end{matrix}

(6)

\begin{matrix} D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & ≜ \frac{α}{α - 1} \log \sum_{x \in supp (P_{X})} P (x) 2^{\frac{α - 1}{α} D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x})}, \end{matrix}

(7)

where (5) is inspired by Csiszár [9]; (6) is inspired by Sibson [10]; and (7) is motivated by the horse betting problem discussed in Section 9. The first two conditional Rényi divergences were used to define the Rényi measures of dependence of Csiszár

I_{α}^{c} (X; Y)

[9] and of Sibson

I_{α}^{s} (X; Y)

[10]:

\begin{matrix} I_{α}^{c} (X; Y) & ≜ \min_{Q_{Y}} D_{α}^{c} (P_{Y | X} ‖ Q_{Y} | P_{X}), \end{matrix}

(8)

\begin{matrix} I_{α}^{s} (X; Y) & ≜ \min_{Q_{Y}} D_{α}^{s} (P_{Y | X} ‖ Q_{Y} | P_{X}), \end{matrix}

(9)

where the minimization is over all PMFs on the set

Y

. (Gallager’s

E_{0}

function [11] and

I_{α}^{s} (X; Y)

are in one-to-one correspondence; see (65) below.) The analogous minimization of

D_{α}^{l} (\cdot)

leads to the Lapidoth–Pfister mutual information

J_{α} (X; Y)

[3]:

\begin{matrix} J_{α} (X; Y) & ≜ \min_{Q_{X}, Q_{Y}} D_{α} (P_{X Y} ‖ Q_{X} Q_{Y}) \end{matrix}

(10)

\begin{matrix} = \min_{Q_{Y}} D_{α}^{l} (P_{Y | X} ‖ Q_{Y} | P_{X}), \end{matrix}

(11)

where (11) is proved in Proposition 5.

The first part of the paper is structured as follows: In Section 2, we discuss some preliminaries. In Section 3, Section 4 and Section 5, we study the properties of the three conditional Rényi divergences and their associated measure of dependence. In Section 6, we express the Arimoto–Rényi conditional entropy

H_{α} (X | Y)

and the Arimoto measure of dependence

I_{α}^{a} (X; Y)

[12] in terms of

D_{α}^{l} (P_{X | Y} ‖ U_{X} | P_{Y})

. In Section 7, we relate the conditional Rényi divergences to each other and discuss the relations between the Rényi dependence measures.

The second part of the paper deals with horse gambling under our proposed family of power-mean utility functions. It is in this context that the Rényi divergence (Theorem 9) and the conditional Rényi divergence

D_{α}^{l} (\cdot)

(Theorem 10) appear naturally.

More specifically, consider a horse race with a finite nonempty set of horses

X

, where a bookmaker offers odds

o (x)

-for-1 on each horse

x \in X

, where

o : X \to (0, \infty)

[2] (Section 6.1). A gambler spends all her wealth placing bets on the horses. The fraction of her wealth that she bets on Horse

x \in X

is denoted

b (x) \geq 0

, which sums up to 1 over

x \in X

, and the PMF b is her “betting strategy.” The winning horse, which we denote X, is drawn according to the PMF p, where we assume

p (x) > 0

for all

x \in X

. The wealth relative (or end-to-beginning wealth ratio) is the random variable

\begin{matrix} S ≜ b (X) o (X) . \end{matrix}

(12)

Hence, given an initial wealth

γ

, the gambler’s wealth after the race is

γ S

. We seek betting strategies that maximize the utility function

\begin{matrix} U_{β} ≜ \{\begin{matrix} \frac{1}{β} \log E [S^{β}] & if β \neq 0, \\ E [\log S] & if β = 0, \end{matrix} \end{matrix}

(13)

where

β \in R

is a parameter that accounts for the risk sensitivity. This optimization generalizes the following cases:

(a): In the limit as $β$ tends to $- \infty$ , we optimize the worst-case return. The optimal strategy is risk-free in the sense that S does not depend on the winning horse (see Proposition 8).
(b): If $β = 0$ , then we optimize $E [\log S]$ , which is known as the doubling rate [2] (Section 6.1). The optimal strategy is proportional betting, i.e., to choose $b = p$ (see Remark 4).
(c): If $β = 1$ , then we optimize $E [S]$ , the expected return. The optimal strategy is to put all the money on a horse that maximizes $p (x) o (x)$ (see Proposition 9).
(d): In general, if $β \geq 1$ , then it is optimal to put all the money on one horse (see Proposition 9). This is risky: if that horse loses, the gambler will go broke.
(e): In the limit as $β$ tends to $+ \infty$ , we optimize the best-case return. The optimal strategy is to put all the money on a horse that maximizes $o (x)$ (see Proposition 10).

Note that, for

β \neq 0

and

η ≜ 1 - β

, maximizing

U_{β}

is equivalent to maximizing

\begin{matrix} E [\frac{S^{1 - η}}{1 - η}], \end{matrix}

(14)

which is known in the finance literature as Constant Relative Risk Aversion (CRRA) [13,14].

We refer to our utility function as “power mean” because it can be written as the logarithm of a weighted power mean [15,16]:

\begin{matrix} U_{β} = \log {[\sum_{x} p (x) {(b (x) o (x))}^{β}]}^{\frac{1}{β}} . \end{matrix}

(15)

Because the power mean tends to the geometric mean as

β

tends to zero [15] (Problem 8.1),

U_{β}

is continuous at

β = 0

:

\begin{matrix} \lim_{β \to 0} U_{β} & = \log \prod_{x} {(b (x) o (x))}^{p (x)} \end{matrix}

(16)

\begin{matrix} = E [\log S] . \end{matrix}

(17)

Campbell [17,18] used an exponential cost function with a similar structure to (15) to provide an operational meaning to the Rényi entropy in source coding. Other information-theoretic applications of exponential moments were studied in [19].

The second part of the paper is structured as follows: In Section 8, we relate the utility function

U_{β}

to the Rényi divergence (Theorem 9) and derive its optimal gambling strategy. In Section 9, we consider the situation where the gambler observes side information prior to betting, a situation that leads to the conditional Rényi divergence

D_{α}^{l} (\cdot)

(Theorem 10) and to a new operational meaning for the measure of dependence

J_{α} (X; Y)

(Theorem 11). In Section 10, we consider the situation where the gambler invests only part of her money. In Section 11, we present a universal strategy for independent and identically distributed (IID) races that requires neither knowledge of the winning probabilities nor of the parameter

β

of the utility function and yet asymptotically maximizes the utility function for all PMFs p and all

β \in R

.

2. Preliminaries

Throughout the paper,

\log (\cdot)

denotes the base-2 logarithm,

X

and

Y

are finite sets,

P_{X Y}

denotes a joint PMF over

X \times Y

,

Q_{X}

denotes a PMF over

X

, and

Q_{Y}

denotes a PMF over

Y

. An expression of the form

P_{X} P_{Y | X}

denotes the PMF on

X \times Y

that assigns

(x, y)

the probability

P_{X} (x) P_{Y | X} (y | x)

. We use P and Q as generic PMFs over a finite set

X

. We denote by

supp (P) ≜ {x \in X : P (x) > 0}

the support of P, and by

P (X)

the set of all PMFs over

X

. When clear from the context, we often omit sets and subscripts: for example, we write

\sum_{x}

for

\sum_{x \in X}

,

\min_{Q_{X}, Q_{Y}}

for

\min_{(Q_{X}, Q_{Y}) \in P (X) \times P (Y)}

,

P (x)

for

P_{X} (x)

, and

P (y | x)

for

P_{Y | X} (y | x)

. When

P (x)

is 0, we define the conditional probability

P (y | x)

as

1 / | Y |

. The conditional distribution of Y given

X = x

is denoted by

P_{Y | X = x}

, thus

\begin{matrix} P_{Y | X = x} (y) & = P (y | x) . \end{matrix}

(18)

We denote by

𝟙 {condition}

the indicator function that is one if the condition is satisfied and zero otherwise.

In the definition of the Kullback–Leibler divergence in (1), we use the conventions

\begin{matrix} 0 \log \frac{0}{q} = 0 \forall q \geq 0, p \log \frac{p}{0} = \infty \forall p > 0 . \end{matrix}

(19)

In the definition of the Rényi divergence in (4), we read

P {(x)}^{α} Q {(x)}^{1 - α}

as

P {(x)}^{α} / Q {(x)}^{α - 1}

for

α > 1

and use the conventions

\begin{matrix} \frac{0}{0} = 0, \frac{p}{0} = \infty \forall p > 0 . \end{matrix}

(20)

For

α

being zero, one, or infinity, we define by continuous extension of (4)

\begin{matrix} D_{0} (P ‖ Q) & ≜ - \log \sum_{x \in supp (P)} Q (x), \end{matrix}

(21)

\begin{matrix} D_{1} (P ‖ Q) & ≜ D (P ‖ Q), \end{matrix}

(22)

\begin{matrix} D_{\infty} (P ‖ Q) & ≜ \log \max_{x} \frac{P (x)}{Q (x)} . \end{matrix}

(23)

The Rényi divergence for negative

α

is defined as

\begin{matrix} D_{α} (P ‖ Q) ≜ \frac{1}{α - 1} \log \sum_{x} \frac{Q {(x)}^{1 - α}}{P {(x)}^{- α}} . \end{matrix}

(24)

(We use negative

α

in the proof of Proposition 1 (e) below and in Remark 6. More about negative orders can be found in [8] (Section V). For other applications of negative orders, see [20] (Proof of Theorem 1 and Example 1).)

The Rényi divergence satisfies the following basic properties:

Proposition 1.

Let P and Q be PMFs. Then, the Rényi divergence

D_{α} (P ‖ Q)

satisfies the following:

(a): For all $α \in [0, \infty]$ , $D_{α} (P ‖ Q) \geq 0$ . If $α \in (0, \infty]$ , then $D_{α} (P ‖ Q) = 0$ if and only if $P = Q$ .
(b): For all $α \in [0, 1)$ , $D_{α} (P ‖ Q)$ is finite if and only if $| supp (P) \cap supp (Q) | > 0$ . For all $α \in [1, \infty]$ , $D_{α} (P ‖ Q)$ is finite if and only if $supp (P) \subseteq supp (Q)$ .
(c): The mapping $α \mapsto D_{α} (P ‖ Q)$ is continuous on $[0, \infty]$ .
(d): The mapping $α \mapsto D_{α} (P ‖ Q)$ is nondecreasing on $[0, \infty]$ .
(e): The mapping $α \mapsto \frac{1 - α}{α} D_{α} (P ‖ Q)$ is nonincreasing on $(0, \infty)$ .
(f): The mapping $α \mapsto (1 - α) D_{α} (P ‖ Q)$ is concave on $[0, \infty)$ .
(g): The mapping $α \mapsto (α - 1) D_{1 / α} (P ‖ Q)$ is concave on $(0, \infty)$ .
(h): (Data-processing inequality.) Let $A_{X^{'} | X}$ be a conditional PMF, and define the PMFs

$\begin{matrix} P^{'} (x^{'}) & ≜ \sum_{x} P (x) A_{X^{'} | X} (x^{'} | x), \end{matrix}$

(25)

$\begin{matrix} Q^{'} (x^{'}) & ≜ \sum_{x} Q (x) A_{X^{'} | X} (x^{'} | x) . \end{matrix}$

(26)

Then, for all $α \in [0, \infty]$ ,

$\begin{matrix} D_{α} (P^{'} ‖ Q^{'}) \leq D_{α} (P ‖ Q) . \end{matrix}$

(27)

Proof.

See Appendix A. □

All three conditional Rényi divergences reduce to the unconditional Rényi divergence when both

P_{Y | X}

and

Q_{Y | X}

are independent of X:

Remark 1.

Let

P_{Y}

,

Q_{Y}

, and

P_{X}

be PMFs. Then, for all

α \in [0, \infty]

,

\begin{matrix} D_{α}^{c} (P_{Y} ‖ Q_{Y} | P_{X}) = D_{α}^{s} (P_{Y} ‖ Q_{Y} | P_{X}) = D_{α}^{l} (P_{Y} ‖ Q_{Y} | P_{X}) = D_{α} (P_{Y} ‖ Q_{Y}) . \end{matrix}

(28)

Proof.

This follows from the definitions of

D_{α}^{c} (\cdot)

,

D_{α}^{s} (\cdot)

, and

D_{α}^{l} (\cdot)

in (5)–(7). □

3. Csiszár’s Conditional Rényi Divergence

For a PMF

P_{X}

and conditional PMFs

P_{Y | X}

and

Q_{Y | X}

, Csiszár’s conditional Rényi divergence

D_{α}^{c} (\cdot)

is defined for every

α \in [0, \infty]

as

\begin{matrix} D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & ≜ \sum_{x \in supp (P_{X})} P (x) D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x}) . \end{matrix}

(29)

For

α \in (0, 1) \cup (1, \infty)

,

\begin{matrix} D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) = \frac{1}{α - 1} \sum_{x \in supp (P_{X})} P (x) \log \sum_{y} P {(y | x)}^{α} Q {(y | x)}^{1 - α}, \end{matrix}

(30)

which follows from the definition of the Rényi divergence in (4). For

α

being zero, one, or infinity, we obtain from (21)–(23) and (2)

\begin{matrix} D_{0}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = - \sum_{x \in supp (P_{X})} P (x) \log \sum_{y \in supp (P_{Y | X = x})} Q (y | x), \end{matrix}

(31)

\begin{matrix} D_{1}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = D (P_{Y | X} ‖ Q_{Y | X} | P_{X}), \end{matrix}

(32)

\begin{matrix} D_{\infty}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = \sum_{x \in supp (P_{X})} P (x) \log \max_{y} \frac{P (y | x)}{Q (y | x)} . \end{matrix}

(33)

Augustin [21] and later Csiszár [9] defined the measure of dependence

\begin{matrix} I_{α}^{c} (X; Y) ≜ \min_{Q_{Y}} D_{α}^{c} (P_{Y | X} ‖ Q_{Y} | P_{X}) . \end{matrix}

(34)

Augustin used this measure to study the error exponents for channel coding with input constraints, while Csiszár used it to study generalized cutoff rates for channel coding with composition constraints. Nakiboğlu [22] studied more properties of

I_{α}^{c} (X; Y)

. Inter alia, he analyzed the minimax properties of the Augustin capacity

\begin{matrix} \sup_{P_{X} \in A} I_{α}^{c} (P_{X}, P_{Y | X}) = \sup_{P_{X} \in A} \min_{Q_{Y}} D_{α}^{c} (P_{Y | X} ‖ Q_{Y} | P_{X}), \end{matrix}

(35)

where

A \subseteq P (X)

is a constraint set. The Augustin capacity is used in [23] to establish the sphere packing bound for memoryless channels with cost constraints.

The rest of the section presents some properties of

D_{α}^{c} (\cdot)

. Being an average of Rényi divergences (see (29)),

D_{α}^{c} (\cdot)

inherits many properties from the Rényi divergence:

Proposition 2.

Let

P_{X}

be a PMF, and let

P_{Y | X}

and

Q_{Y | X}

be conditional PMFs. Then,

(a): For all $α \in [0, \infty]$ , $D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) \geq 0$ . If $α \in (0, \infty]$ , then $D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) = 0$ if and only if $(P_{Y | X = x} = Q_{Y | X = x}$ for all $x \in supp (P_{X}))$ .
(b): For all $α \in [0, 1)$ , $D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is finite if and only if $(| supp (P_{Y | X = x}) \cap supp (Q_{Y | X = x}) | > 0$ for all $x \in supp (P_{X}))$ . For all $α \in [1, \infty]$ , $D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is finite if and only if $(supp (P_{Y | X = x}) \subseteq supp (Q_{Y | X = x})$ for all $x \in supp (P_{X}))$ .
(c): The mapping $α \mapsto D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is continuous on $[0, \infty]$ .
(d): The mapping $α \mapsto D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is nondecreasing on $[0, \infty]$ .
(e): The mapping $α \mapsto \frac{1 - α}{α} D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is nonincreasing on $(0, \infty)$ .
(f): The mapping $α \mapsto (1 - α) D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is concave on $[0, \infty)$ .
(g): The mapping $α \mapsto (α - 1) D_{1 / α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is concave on $(0, \infty)$ .

Proof.

These follow from (29) and the properties of the Rényi divergence (Proposition 1). For Parts (f) and (g), recall that a nonnegative weighted sum of concave functions is concave. □

We next consider data-processing inequalities for

D_{α}^{c} (\cdot)

. We distinguish between processing Y and processing X. The data-processing inequality for processing Y follows from the data-processing inequality for the (unconditional) Rényi divergence:

Theorem 1.

Let

P_{X}

be a PMF, and let

P_{Y | X}

and

Q_{Y | X}

be conditional PMFs. For a conditional PMF

A_{Y^{'} | X Y}

, define

\begin{matrix} P_{Y^{'} | X} (y^{'} | x) & ≜ \sum_{y} P_{Y | X} (y | x) A_{Y^{'} | X Y} (y^{'} | x, y), \end{matrix}

(36)

\begin{matrix} Q_{Y^{'} | X} (y^{'} | x) & ≜ \sum_{y} Q_{Y | X} (y | x) A_{Y^{'} | X Y} (y^{'} | x, y) . \end{matrix}

(37)

Then, for all

α \in [0, \infty]

,

\begin{matrix} D_{α}^{c} (P_{Y^{'} | X} ‖ Q_{Y^{'} | X} | P_{X}) \leq D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) . \end{matrix}

(38)

Proof.

See Appendix B. □

The following data-processing inequality for processing X holds for

α \in [0, 1]

(as shown in Example 1 below, it does not extend to

α \in (1, \infty]

):

Theorem 2.

Let

P_{X}

be a PMF, and let

P_{Y | X}

and

Q_{Y | X}

be conditional PMFs. For a conditional PMF

B_{X^{'} | X}

, define the PMFs

\begin{matrix} P_{X^{'}} (x^{'}) & ≜ \sum_{x} P_{X} (x) B_{X^{'} | X} (x^{'} | x), \end{matrix}

(39)

B_{X | X^{'}} (x | x^{'}) ≜ \{\begin{cases} P_{X} (x) B_{X^{'} | X} (x^{'} | x) / P_{X^{'}} (x^{'}) & i f P_{X^{'}} (x^{'}) > 0, \\ 1 / | X | & o t h e r w i s e, \end{cases}

(40)

\begin{matrix} P_{Y | X^{'}} (y | x^{'}) & ≜ \sum_{x} B_{X | X^{'}} (x | x^{'}) P_{Y | X} (y | x), \end{matrix}

(41)

\begin{matrix} Q_{Y | X^{'}} (y | x^{'}) & ≜ \sum_{x} B_{X | X^{'}} (x | x^{'}) Q_{Y | X} (y | x) . \end{matrix}

(42)

Then, for all

α \in [0, 1]

,

\begin{matrix} D_{α}^{c} (P_{Y | X^{'}} ‖ Q_{Y | X^{'}} | P_{X^{'}}) \leq D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) . \end{matrix}

(43)

Note that

P_{X^{'}}

,

P_{Y | X^{'}}

, and

Q_{Y | X^{'}}

in Theorem 2 can be obtained from the following marginalizations:

\begin{matrix} P_{X^{'}} (x^{'}) P_{Y | X^{'}} (y | x^{'}) & = \sum_{x} P_{X} (x) B_{X^{'} | X} (x^{'} | x) P_{Y | X} (y | x), \end{matrix}

(44)

\begin{matrix} P_{X^{'}} (x^{'}) Q_{Y | X^{'}} (y | x^{'}) & = \sum_{x} P_{X} (x) B_{X^{'} | X} (x^{'} | x) Q_{Y | X} (y | x) . \end{matrix}

(45)

Proof of Theorem 2.

See Appendix C. □

As a special case of Theorem 2, we obtain the following relation between the conditional and the unconditional Rényi divergence:

Corollary 1.

For a PMF

P_{X}

and conditional PMFs

P_{Y | X}

and

Q_{Y | X}

, define the marginal PMFs

\begin{matrix} P_{Y} (y) & ≜ \sum_{x} P_{X} (x) P_{Y | X} (y | x), \end{matrix}

(46)

\begin{matrix} Q_{Y} (y) & ≜ \sum_{x} P_{X} (x) Q_{Y | X} (y | x) . \end{matrix}

(47)

Then, for all

α \in [0, 1]

,

\begin{matrix} D_{α} (P_{Y} ‖ Q_{Y}) \leq D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) . \end{matrix}

(48)

Proof.

See Appendix D. □

Consider next

α \in (1, \infty]

. It turns out that Corollary 1, and hence Theorem 2, cannot be extended to these values of

α

(not even if

Q_{Y | X}

is restricted to be independent of X, i.e., if

Q_{Y | X} = Q_{Y}

):

Example 1.

Let

X = Y = {0, 1}

. For

ϵ \in (0, 1)

, define the PMFs

P_{X}

,

Q_{Y}^{(ϵ)}

, and

P_{Y | X}^{(ϵ)}

as

\begin{matrix} P_{X} (0) & = 0.5, & P_{X} (1) & = 0.5, \end{matrix}

(49)

\begin{matrix} Q_{Y}^{(ϵ)} (0) & = 1 - ϵ, & Q_{Y}^{(ϵ)} (1) & = ϵ, \end{matrix}

(50)

\begin{matrix} P_{Y | X}^{(ϵ)} (0 | 0) & = 1 - ϵ, & P_{Y | X}^{(ϵ)} (1 | 0) & = ϵ, \end{matrix}

(51)

\begin{matrix} P_{Y | X}^{(ϵ)} (0 | 1) & = ϵ, & P_{Y | X}^{(ϵ)} (1 | 1) & = 1 - ϵ . \end{matrix}

(52)

Then, for every

α \in (1, \infty]

, there exists an

ϵ \in (0, 1)

such that

\begin{matrix} D_{α} (P_{Y} ‖ Q_{Y}^{(ϵ)}) > D_{α}^{c} (P_{Y | X}^{(ϵ)} ‖ Q_{Y}^{(ϵ)} | P_{X}), \end{matrix}

(53)

where the PMF

P_{Y}

is defined by (46) and, irrespective of ϵ, satisfies

P_{Y} (0) = P_{Y} (1) = 0.5

.

Proof.

See Appendix E. □

4. Sibson’s Conditional Rényi Divergence

For a PMF

P_{X}

and conditional PMFs

P_{Y | X}

and

Q_{Y | X}

, Sibson’s conditional Rényi divergence

D_{α}^{s} (\cdot)

is defined for every

α \in [0, \infty]

as

\begin{matrix} D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & ≜ D_{α} (P_{X} P_{Y | X} ‖ P_{X} Q_{Y | X}) . \end{matrix}

(54)

For

α \in (0, 1) \cup (1, \infty)

,

\begin{matrix} D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = \frac{1}{α - 1} \log \sum_{x \in supp (P_{X})} P (x) \sum_{y} P {(y | x)}^{α} Q {(y | x)}^{1 - α} \end{matrix}

(55)

\begin{matrix} = \frac{1}{α - 1} \log \sum_{x \in supp (P_{X})} P (x) 2^{(α - 1) D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x})}, \end{matrix}

(56)

where (55) and (56) follow from the definition of the Rényi divergence in (4). For

α

being zero, one, or infinity, we obtain from (21)–(23) and (3)

\begin{matrix} D_{0}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = - \log \sum_{x \in supp (P_{X})} P (x) \sum_{y \in supp (P_{Y | X = x})} Q (y | x), \end{matrix}

(57)

\begin{matrix} D_{1}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = D (P_{Y | X} ‖ Q_{Y | X} | P_{X}), \end{matrix}

(58)

\begin{matrix} D_{\infty}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = \log \max_{x \in supp (P_{X})} \max_{y} \frac{P (y | x)}{Q (y | x)} . \end{matrix}

(59)

Sibson [10] defined the measure of dependence

\begin{matrix} I_{α}^{s} (X; Y) ≜ \min_{Q_{Y}} D_{α}^{s} (P_{Y | X} ‖ Q_{Y} | P_{X}) . \end{matrix}

(60)

This minimum can be computed explicitly [10] (Corollary 2.3): For

α \in (0, 1) \cup (1, \infty)

,

\begin{matrix} I_{α}^{s} (X; Y) = \frac{α}{α - 1} \log \sum_{y} {[\sum_{x} P (x) P {(y | x)}^{α}]}^{\frac{1}{α}}, \end{matrix}

(61)

and for

α

being one or infinity,

\begin{matrix} I_{1}^{s} (X; Y) & = I (X; Y), \end{matrix}

(62)

\begin{matrix} I_{\infty}^{s} (X; Y) & = \log \sum_{y} \max_{x} P (y | x), \end{matrix}

(63)

where

I (X; Y)

denotes Shannon’s mutual information.

The concavity and convexity properties of

D_{α}^{s} (\cdot)

and

I_{α}^{s} (X; Y)

were studied by Ho–Verdú [24]. More properties of

I_{α}^{s} (X; Y)

were collected by Verdú [25]. The maximization of

I_{α}^{s} (X; Y)

with respect to

P_{X}

and the minimax properties of

D_{α}^{s} (\cdot)

were studied by Nakiboğlu [26] and Cai–Verdú [27].

The conditional Rényi divergence

D_{α}^{s} (\cdot)

was used by Fong and Tan [28] to establish strong converse theorems for multicast networks. Yu and Tan [29] analyzed channel resolvability, among other measures, in terms of

D_{α}^{s} (\cdot)

.

From (61) we see that Gallager’s

E_{0}

function [11], which is defined as

\begin{matrix} E_{0} (ρ, P_{X}, P_{Y | X}) ≜ - \log \sum_{y} {[\sum_{x} P (x) P {(y | x)}^{\frac{1}{1 + ρ}}]}^{1 + ρ}, \end{matrix}

(64)

is in one-to-one correspondence to Sibson’s measure of dependence:

\begin{matrix} I_{α}^{s} (X; Y) = \frac{α}{1 - α} E_{0} (\frac{1 - α}{α}, P_{X}, P_{Y | X}) . \end{matrix}

(65)

Gallager’s

E_{0}

function is important in channel coding: it appears in the random coding exponent [30] and in the sphere packing exponent [31,32] (see also Gallager [11]). The exponential strong converse theorem proved by Arimoto [33] also uses the

E_{0}

function. Polyanskiy and Verdú [34] extended the exponential strong converse theorem to channels with feedback. Augustin [21] and Nakiboğlu [35,36] extended the sphere packing bound to channels with feedback.

The rest of the section presents some properties of

D_{α}^{s} (\cdot)

. Because

D_{α}^{s} (\cdot)

can be written as an (unconditional) Rényi divergence (see (54)), it inherits many properties from the Rényi divergence:

Proposition 3.

Let

P_{X}

be a PMF, and let

P_{Y | X}

and

Q_{Y | X}

be conditional PMFs. Then,

(a): For all $α \in [0, \infty]$ , $D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) \geq 0$ . If $α \in (0, \infty]$ , then $D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) = 0$ if and only if $(P_{Y | X = x} = Q_{Y | X = x}$ for all $x \in supp (P_{X}))$ .
(b): For all $α \in [0, 1)$ , $D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is finite if and only if (there exists an $x \in supp (P_{X})$ such that $| supp (P_{Y | X = x}) \cap supp (Q_{Y | X = x}) | > 0)$ . For all $α \in [1, \infty]$ , $D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is finite if and only if $(supp (P_{Y | X = x}) \subseteq supp (Q_{Y | X = x})$ for all $x \in supp (P_{X}))$ .
(c): The mapping $α \mapsto D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is continuous on $[0, \infty]$ .
(d): The mapping $α \mapsto D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is nondecreasing on $[0, \infty]$ .
(e): The mapping $α \mapsto \frac{1 - α}{α} D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is nonincreasing on $(0, \infty)$ .
(f): The mapping $α \mapsto (1 - α) D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is concave on $[0, \infty)$ .
(g): The mapping $α \mapsto (α - 1) D_{1 / α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is concave on $(0, \infty)$ .

Proof.

These follow from (54) and the properties of the Rényi divergence (Proposition 1). □

We next consider data-processing inequalities for

D_{α}^{s} (\cdot)

. We distinguish between processing Y and processing X. The data-processing inequality for processing Y follows from the data-processing inequality for the (unconditional) Rényi divergence:

Theorem 3.

Let

P_{X}

be a PMF, and let

P_{Y | X}

and

Q_{Y | X}

be conditional PMFs. For a conditional PMF

A_{Y^{'} | X Y}

, define

\begin{matrix} P_{Y^{'} | X} (y^{'} | x) & ≜ \sum_{y} P_{Y | X} (y | x) A_{Y^{'} | X Y} (y^{'} | x, y), \end{matrix}

(66)

\begin{matrix} Q_{Y^{'} | X} (y^{'} | x) & ≜ \sum_{y} Q_{Y | X} (y | x) A_{Y^{'} | X Y} (y^{'} | x, y) . \end{matrix}

(67)

Then, for all

α \in [0, \infty]

,

\begin{matrix} D_{α}^{s} (P_{Y^{'} | X} ‖ Q_{Y^{'} | X} | P_{X}) \leq D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) . \end{matrix}

(68)

Proof.

See Appendix F. □

The data-processing inequality for processing X similarly follows from the data-processing inequality for the (unconditional) Rényi divergence:

Theorem 4.

Let

P_{X}

be a PMF, and let

P_{Y | X}

and

Q_{Y | X}

be conditional PMFs. For a conditional PMF

B_{X^{'} | X}

, define the PMFs

\begin{matrix} P_{X^{'}} (x^{'}) & ≜ \sum_{x} P_{X} (x) B_{X^{'} | X} (x^{'} | x), \end{matrix}

(69)

\begin{matrix} B_{X | X^{'}} (x | x^{'}) & ≜ \{\begin{matrix} P_{X} (x) B_{X^{'} | X} (x^{'} | x) / P_{X^{'}} (x^{'}) & i f P_{X^{'}} (x^{'}) > 0, \\ 1 / | X | & o t h e r w i s e, \end{matrix} \end{matrix}

(70)

\begin{matrix} P_{Y | X^{'}} (y | x^{'}) & ≜ \sum_{x} B_{X | X^{'}} (x | x^{'}) P_{Y | X} (y | x), \end{matrix}

(71)

\begin{matrix} Q_{Y | X^{'}} (y | x^{'}) & ≜ \sum_{x} B_{X | X^{'}} (x | x^{'}) Q_{Y | X} (y | x) . \end{matrix}

(72)

Then, for all

α \in [0, \infty]

,

\begin{matrix} D_{α}^{s} (P_{Y | X^{'}} ‖ Q_{Y | X^{'}} | P_{X^{'}}) \leq D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) . \end{matrix}

(73)

Proof.

See Appendix G. □

As a special case of Theorem 4, we obtain the following relation between the conditional and the unconditional Rényi divergence:

Corollary 2.

Let

P_{X}

be a PMF, and let

P_{Y | X}

and

Q_{Y | X}

be conditional PMFs. Define the marginal PMFs

\begin{matrix} P_{Y} (y) & ≜ \sum_{x} P_{X} (x) P_{Y | X} (y | x), \end{matrix}

(74)

\begin{matrix} Q_{Y} (y) & ≜ \sum_{x} P_{X} (x) Q_{Y | X} (y | x) . \end{matrix}

(75)

Then, for all

α \in [0, \infty]

,

\begin{matrix} D_{α} (P_{Y} ‖ Q_{Y}) \leq D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) . \end{matrix}

(76)

Proof.

This follows from Theorem 4 in the same way that Corollary 1 followed from Theorem 2. □

5. New Conditional Rényi Divergence

Let

P_{X}

be a PMF, and let

P_{Y | X}

and

Q_{Y | X}

be conditional PMFs. For

α \in (0, 1) \cup (1, \infty)

, define

\begin{matrix} D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & ≜ \frac{α}{α - 1} \log \sum_{x \in supp (P_{X})} P (x) 2^{\frac{α - 1}{α} D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x})} \end{matrix}

(77)

\begin{matrix} = \frac{α}{α - 1} \log \sum_{x \in supp (P_{X})} P (x) {[\sum_{y} P {(y | x)}^{α} Q {(y | x)}^{1 - α}]}^{\frac{1}{α}}, \end{matrix}

(78)

where (78) follows from the definition of the Rényi divergence in (4). (Except for the sign, the exponential averaging in (77) is very similar to the one of the Arimoto–Rényi conditional entropy; compare with (147) below.) For

α

being zero, one, or infinity, we define by continuous extension of (77)

\begin{matrix} D_{0}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & ≜ - \log \max_{x \in supp (P_{X})} \sum_{y \in supp (P_{Y | X = x})} Q (y | x), \end{matrix}

(79)

\begin{matrix} D_{1}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & ≜ D (P_{Y | X} ‖ Q_{Y | X} | P_{X}), \end{matrix}

(80)

\begin{matrix} D_{\infty}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & ≜ \log \sum_{x \in supp (P_{X})} P (x) \max_{y} \frac{P (y | x)}{Q (y | x)} . \end{matrix}

(81)

This conditional Rényi divergence has an operational meaning in horse betting with side information (see Theorem 10 below). Before discussing the measure of dependence associated with

D_{α}^{l} (\cdot)

, we establish the following alternative characterization of

D_{α}^{l} (\cdot)

:

Proposition 4.

Let

P_{X}

be a PMF, and let

P_{Y | X}

and

Q_{Y | X}

be conditional PMFs. Then, for all

α \in [0, \infty]

,

\begin{matrix} D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) = \min_{Q_{X}} D_{α} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y | X}) . \end{matrix}

(82)

Proof.

We first treat the case

α \in (0, 1) \cup (1, \infty)

. Some algebra reveals that, for every PMF

Q_{X}

,

\begin{matrix} D_{α} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y | X}) = D_{α} (Q_{X}^{* (α)} ‖ Q_{X}) + \frac{α}{α - 1} \log \sum_{x \in supp (P_{X})} P (x) {[\sum_{y} P {(y | x)}^{α} Q {(y | x)}^{1 - α}]}^{\frac{1}{α}}, \end{matrix}

(83)

where the PMF

Q_{X}^{* (α)}

is defined as

\begin{matrix} Q_{X}^{* (α)} (x) ≜ \frac{P (x) {[\sum_{y} P {(y | x)}^{α} Q {(y | x)}^{1 - α}]}^{1 / α}}{\sum_{x^{'} \in supp (P_{X})} P (x^{'}) {[\sum_{y} P {(y | x^{'})}^{α} Q {(y | x^{'})}^{1 - α}]}^{1 / α}} . \end{matrix}

(84)

The right-hand side (RHS) of (82) is thus equal to the minimum over

Q_{X}

of the RHS of (83). Since

D_{α} (Q_{X}^{* (α)} ‖ Q_{X}) \geq 0

with equality if

Q_{X} = Q_{X}^{* (α)}

(Proposition 1 (a)), this minimum is equal to the second term on the RHS of (83), which, by (78), equals

D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X})

.

For

α = 1

and

α = \infty

, (82) follows from the same argument using that, for every PMF

Q_{X}

,

\begin{matrix} D_{1} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y | X}) & = D (P_{X} ‖ Q_{X}) + D (P_{Y | X} ‖ Q_{Y | X} | P_{X}), \end{matrix}

(85)

\begin{matrix} D_{\infty} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y | X}) & = D_{\infty} (Q_{X}^{* (\infty)} ‖ Q_{X}) + \log \sum_{x \in supp (P_{X})} P (x) \max_{y} \frac{P (y | x)}{Q (y | x)}, \end{matrix}

(86)

where the PMF

Q_{X}^{* (\infty)}

is defined as

\begin{matrix} Q_{X}^{* (\infty)} (x) ≜ \frac{P (x) \max_{y} [P (y | x) / Q (y | x)]}{\sum_{x^{'} \in supp (P_{X})} P (x^{'}) \max_{y} [P (y | x^{'}) / Q (y | x^{'})]} . \end{matrix}

(87)

For

α = 0

, (82) holds because

\begin{matrix} \min_{Q_{X}} D_{0} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y | X}) & = \min_{Q_{X}} - \log \sum_{x \in supp (P_{X})} Q (x) \sum_{y \in supp (P_{Y | X = x})} Q (y | x) \end{matrix}

(88)

\begin{matrix} = - \log \max_{Q_{X}} \sum_{x \in supp (P_{X})} Q (x) \sum_{y \in supp (P_{Y | X = x})} Q (y | x) \end{matrix}

(89)

\begin{matrix} = - \log \max_{x \in supp (P_{X})} \sum_{y \in supp (P_{Y | X = x})} Q (y | x) \end{matrix}

(90)

\begin{matrix} = D_{0}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}), \end{matrix}

(91)

where (88) follows from the definition of

D_{0} (P ‖ Q)

in (21), and (91) follows from (79). □

Tomamichel and Hayashi [37] and Lapidoth and Pfister [3] independently introduced and studied the dependence measure

\begin{matrix} J_{α} (X; Y) & ≜ \min_{Q_{X}, Q_{Y}} D_{α} (P_{X Y} ‖ Q_{X} Q_{Y}) . \end{matrix}

(92)

(For some measure-theoretic properties of

J_{α} (X; Y)

, see Aishwarya–Madiman [38].) The measure

J_{α} (X; Y)

can be related to the error exponents in a hypothesis testing problem where the samples are either from a known joint distribution or an unknown product distribution (see [37] (Equation (57)) and [39]). It also appears in horse betting with side information (see Theorem 11 below).

Similar to

I_{α}^{c} (X; Y)

in (34) and

I_{α}^{s} (X; Y)

in (60), the measure

J_{α} (X; Y)

can be expressed as a minimization involving the new conditional Rényi divergence:

Proposition 5.

Let

P_{X Y}

be a joint PMF. Denote its marginal PMFs by

P_{X}

and

P_{Y}

and its conditional PMFs by

P_{Y | X}

and

P_{X | Y}

, so

P_{X Y} = P_{X} P_{Y | X} = P_{Y} P_{X | Y}

. Then, for all

α \in [0, \infty]

,

\begin{matrix} J_{α} (X; Y) & = \min_{Q_{Y}} D_{α}^{l} (P_{Y | X} ‖ Q_{Y} | P_{X}) \end{matrix}

(93)

\begin{matrix} = \min_{Q_{X}} D_{α}^{l} (P_{X | Y} ‖ Q_{X} | P_{Y}) . \end{matrix}

(94)

Proof.

Equation (93) holds because

\begin{matrix} \min_{Q_{Y}} D_{α}^{l} (P_{Y | X} ‖ Q_{Y} | P_{X}) & = \min_{Q_{Y}} \min_{Q_{X}} D_{α} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y}) \end{matrix}

(95)

\begin{matrix} = J_{α} (X; Y), \end{matrix}

(96)

where (95) follows from Proposition 4, and (96) follows from (92). Swapping the roles of X and Y establishes (94):

\begin{matrix} \min_{Q_{X}} D_{α}^{l} (P_{X | Y} ‖ Q_{X} | P_{Y}) & = \min_{Q_{X}} \min_{Q_{Y}} D_{α} (P_{Y} P_{X | Y} ‖ Q_{Y} Q_{X}) \end{matrix}

(97)

\begin{matrix} = J_{α} (X; Y), \end{matrix}

(98)

where (97) follows from Proposition 4, and (98) follows from (92). □

The rest of the section presents some properties of

D_{α}^{l} (\cdot)

.

Proposition 6.

Let

P_{X}

be a PMF, and let

P_{Y | X}

and

Q_{Y | X}

be conditional PMFs. Then,

(a): For all $α \in [0, \infty]$ , $D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) \geq 0$ . If $α \in (0, \infty]$ , then $D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) = 0$ if and only if $(P_{Y | X = x} = Q_{Y | X = x}$ for all $x \in supp (P_{X}))$ .
(b): For all $α \in [0, 1)$ , $D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is finite if and only if (there exists an $x \in supp (P_{X})$ such that $| supp (P_{Y | X = x}) \cap supp (Q_{Y | X = x}) | > 0)$ . For all $α \in [1, \infty]$ , $D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is finite if and only if $(supp (P_{Y | X = x}) \subseteq supp (Q_{Y | X = x})$ for all $x \in supp (P_{X}))$ .
(c): The mapping $α \mapsto D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is continuous on $[0, \infty]$ .
(d): The mapping $α \mapsto D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is nondecreasing on $[0, \infty]$ .
(e): The mapping $α \mapsto \frac{1 - α}{α} D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is nonincreasing on $(0, \infty)$ .
(f): The mapping $α \mapsto (1 - α) D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is concave on $[0, 1]$ .
(g): The mapping $α \mapsto (α - 1) D_{1 / α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is concave on $[1, \infty)$ .

Proof.

We prove these properties as follows:

(a): For all $α \in [0, \infty]$ , Proposition 4 implies

$\begin{matrix} D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) = \min_{Q_{X}} D_{α} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y | X}) . \end{matrix}$

(99)

The nonnegativity of $D_{α}^{l} (\cdot)$ now follows from the nonnegativity of the Rényi divergence (Proposition 1 (a)). If $(P_{Y | X = x} = Q_{Y | X = x}$ for all $x \in supp (P_{X}))$ , then $P_{X} P_{Y | X} = P_{X} Q_{Y | X}$ . Hence, using $Q_{X} = P_{X}$ on the RHS of (99), $D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ equals zero. Conversely, if $α \in (0, \infty]$ and $D_{α}^{l} (\cdot) = 0$ , then $P_{X} P_{Y | X} = Q_{X} Q_{Y | X}$ for some $Q_{X}$ by Proposition 1 (a), which implies $(P_{Y | X = x} = Q_{Y | X = x}$ for all $x \in supp (P_{X}))$ .
(b): This follows from the definitions in (77) and (79)–(81) and the conventions in (20).
(c): For $α \in (0, 1) \cup (1, \infty)$ , $D_{α}^{l} (\cdot)$ is continuous because it is, by its definition in (77), a composition of continuous functions. The continuity at $α = 1$ follows from a careful application of L’Hôpital’s rule.
We next consider the continuity at $α = 0$ . Define $τ ≜ \min_{x \in supp (P_{X})} P (x)$ . Then, for all $α \in (0, 1)$ ,

$\begin{matrix} (α - 1) D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = α \log \sum_{x \in supp (P_{X})} P (x) 2^{\frac{α - 1}{α} D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x})} \end{matrix}$

(100)

$\begin{matrix} \geq α \log \sum_{x \in supp (P_{X})} τ 2^{\frac{α - 1}{α} D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x})} \end{matrix}$

(101)

$\begin{matrix} \geq α \log \max_{x \in supp (P_{X})} τ 2^{\frac{α - 1}{α} D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x})} \end{matrix}$

(102)

$\begin{matrix} = α \log τ + \max_{x \in supp (P_{X})} (α - 1) D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x}), \end{matrix}$

(103)

where (100) follows from the definition in (77). On the other hand, for all $α \in (0, 1)$ ,

$\begin{matrix} (α - 1) D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = α \log \sum_{x \in supp (P_{X})} P (x) 2^{\frac{α - 1}{α} D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x})} \end{matrix}$

(104)

$\begin{matrix} \leq α \log \max_{x \in supp (P_{X})} 2^{\frac{α - 1}{α} D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x})} \end{matrix}$

(105)

$\begin{matrix} = \max_{x \in supp (P_{X})} (α - 1) D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x}) . \end{matrix}$

(106)

Because $\lim_{α \to 0} α \log τ = 0$ , it follows from (103) and (106) and the sandwich theorem that

$\begin{matrix} \lim_{α ↓ 0} D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = \lim_{α ↓ 0} \frac{1}{α - 1} \max_{x \in supp (P_{X})} (α - 1) D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x}) \end{matrix}$

(107)

$\begin{matrix} = - \log \max_{x \in supp (P_{X})} \sum_{y \in supp (P_{Y | X = x})} Q (y | x), \end{matrix}$

(108)

where (108) follows from the continuity of the Rényi divergence (Proposition 1 (c)) and the definition of $D_{0} (P ‖ Q)$ in (21).
We conclude with the continuity at $α = \infty$ . Observe that

$\begin{matrix} \lim_{α \to \infty} D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = \lim_{α \to \infty} \frac{α}{α - 1} \log \sum_{x \in supp (P_{X})} P (x) 2^{\frac{α - 1}{α} D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x})} \end{matrix}$

(109)

$\begin{matrix} = \log \sum_{x \in supp (P_{X})} P (x) 2^{\lim_{α \to \infty} D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x})} \end{matrix}$

(110)

$\begin{matrix} = \log \sum_{x \in supp (P_{X})} P (x) \max_{y} \frac{P (y | x)}{Q (y | x)}, \end{matrix}$

(111)

where (109) follows from the definition in (77), and (111) follows from the continuity of the Rényi divergence (Proposition 1 (c)) and the definition of $D_{\infty} (P ‖ Q)$ in (23).
(d): For all $α \in [0, \infty]$ , Proposition 4 implies

$\begin{matrix} D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) = \min_{Q_{X}} D_{α} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y | X}) . \end{matrix}$

(112)

Because $α \mapsto D_{α} (P ‖ Q)$ is nonincreasing on $[0, \infty]$ (Proposition 1 (d)) and because the pointwise minimum preserves the monotonicity, the mapping $α \mapsto D_{α}^{l} (\cdot)$ is nonincreasing on $[0, \infty]$ .
(e): By Proposition 4,

$\begin{matrix} \frac{1 - α}{α} D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = \{\begin{matrix} \min_{Q_{X}} \frac{1 - α}{α} D_{α} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y | X}) & if α \in (0, 1], \\ \max_{Q_{X}} \frac{1 - α}{α} D_{α} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y | X}) & if α \in (1, \infty) . \end{matrix} \end{matrix}$

(113)

By the nonnegativity of the Rényi divergence (Proposition 1 (a)), the RHS of (113) is nonnegative for $α \in (0, 1]$ and nonpositive for $α \in (1, \infty)$ . Hence, it suffices to show separately that the mapping $α \mapsto \frac{1 - α}{α} D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is nonincreasing on $(0, 1]$ and on $(1, \infty)$ . This is indeed the case: the mapping $α \mapsto \frac{1 - α}{α} D_{α} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y | X})$ on the RHS of (113) is nonincreasing on $(0, \infty)$ (Proposition 1 (e)), and the monotonicity is preserved by the pointwise minimum and maximum, respectively.
(f): For $α \in [0, 1]$ , Proposition 4 implies that

$\begin{matrix} (1 - α) D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = \min_{Q_{X}} [(1 - α) D_{α} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y | X})] . \end{matrix}$

(114)

Because $α \mapsto (1 - α) D_{α} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y | X})$ is concave on $[0, 1]$ (Proposition 1 (f)) and because the pointwise minimum preserves the concavity, the mapping $α \mapsto (1 - α) D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X})$ is concave on $[0, 1]$ .
(g): This follows from Proposition 1 (g) in the same way that Part (f) followed from Proposition 1 (f). □

We next consider data-processing inequalities for

D_{α}^{l} (\cdot)

. We distinguish between processing Y and processing X. The data-processing inequality for processing Y follows from the data-processing inequality for the (unconditional) Rényi divergence:

Theorem 5.

Let

P_{X}

be a PMF, and let

P_{Y | X}

and

Q_{Y | X}

be conditional PMFs. For a conditional PMF

A_{Y^{'} | X Y}

, define

\begin{matrix} P_{Y^{'} | X} (y^{'} | x) & ≜ \sum_{y} P_{Y | X} (y | x) A_{Y^{'} | X Y} (y^{'} | x, y), \end{matrix}

(115)

\begin{matrix} Q_{Y^{'} | X} (y^{'} | x) & ≜ \sum_{y} Q_{Y | X} (y | x) A_{Y^{'} | X Y} (y^{'} | x, y) . \end{matrix}

(116)

Then, for all

α \in [0, \infty]

,

\begin{matrix} D_{α}^{l} (P_{Y^{'} | X} ‖ Q_{Y^{'} | X} | P_{X}) \leq D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) . \end{matrix}

(117)

Proof.

We prove (117) for

α \in (0, 1) \cup (1, \infty)

; the claim will then extend to

α \in [0, \infty]

by the continuity of

D_{α}^{l} (\cdot)

in

α

(Proposition 6 (c)). For every

x \in supp (P_{X})

, we can apply Proposition 1 (h) with the substitution of

A_{Y^{'} | Y, X = x}

for

A_{Y^{'} | Y}

to obtain

\begin{matrix} D_{α} (P_{Y^{'} | X = x} ‖ Q_{Y^{'} | X = x}) & \leq D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x}) . \end{matrix}

(118)

For

α \in (0, 1) \cup (1, \infty)

, (117) now follows from (77) and (118). □

Processing X is different. Consider first

Q_{Y | X}

that does not depend on X. Then, writing

Q_{Y | X} = Q_{Y}

, we have the following result (which, as shown in Example 2 below, does not extend to general

Q_{Y | X}

):

Theorem 6.

Let

P_{X}

and

Q_{Y}

be PMFs, and let

P_{Y | X}

be a conditional PMF. For a conditional PMF

B_{X^{'} | X}

, define the PMFs

\begin{matrix} P_{X^{'}} (x^{'}) & ≜ \sum_{x} P_{X} (x) B_{X^{'} | X} (x^{'} | x), \end{matrix}

(119)

\begin{matrix} B_{X | X^{'}} (x | x^{'}) & ≜ \{\begin{matrix} P_{X} (x) B_{X^{'} | X} (x^{'} | x) / P_{X^{'}} (x^{'}) & i f P_{X^{'}} (x^{'}) > 0, \\ 1 / | X | & o t h e r w i s e, \end{matrix} \end{matrix}

(120)

\begin{matrix} P_{Y | X^{'}} (y | x^{'}) & ≜ \sum_{x} B_{X | X^{'}} (x | x^{'}) P_{Y | X} (y | x) . \end{matrix}

(121)

Then, for all

α \in [0, \infty]

,

\begin{matrix} D_{α}^{l} (P_{Y | X^{'}} ‖ Q_{Y} | P_{X^{'}}) \leq D_{α}^{l} (P_{Y | X} ‖ Q_{Y} | P_{X}) . \end{matrix}

(122)

Once we provide the operational meaning of

D_{α}^{l} (\cdot)

in horse betting with side information (Theorem 10 below), Theorem 6 will become very intuitive: it expresses the fact that preprocessing the side information cannot increase the gambler’s utility; see Remark 8. Note that

P_{X^{'}}

and

P_{Y | X^{'}}

in Theorem 6 can be obtained from the following marginalization:

\begin{matrix} P_{X^{'}} (x^{'}) P_{Y | X^{'}} (y | x^{'}) & = \sum_{x} P_{X} (x) B_{X^{'} | X} (x^{'} | x) P_{Y | X} (y | x) . \end{matrix}

(123)

Proof of Theorem 6.

We show (122) for

α \in (0, 1) \cup (1, \infty)

; the claim will then extend to

α \in [0, \infty]

by the continuity of

D_{α}^{l} (\cdot)

in

α

(Proposition 6 (c)). Consider first

α \in (1, \infty)

. Then, (122) holds because

\begin{matrix} \frac{α - 1}{α} D_{α}^{l} (P_{Y | X^{'}} ‖ Q_{Y} | P_{X^{'}}) \end{matrix}

\begin{matrix} = \log \sum_{x^{'} \in supp (P_{X^{'}})} P_{X^{'}} (x^{'}) {[\sum_{y} P_{Y | X^{'}} {(y | x^{'})}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(124)

\begin{matrix} = \log \sum_{x^{'} \in supp (P_{X^{'}})} P_{X^{'}} (x^{'}) {[\sum_{y} {[\sum_{x} B_{X | X^{'}} (x | x^{'}) P_{Y | X} (y | x) Q_{Y} {(y)}^{\frac{1 - α}{α}}]}^{α}]}^{\frac{1}{α}} \end{matrix}

(125)

\begin{matrix} = \log \sum_{x^{'} \in supp (P_{X^{'}})} {[\sum_{y} {[\sum_{x \in supp (P_{X})} P_{X} (x) B_{X^{'} | X} (x^{'} | x) P_{Y | X} (y | x) Q_{Y} {(y)}^{\frac{1 - α}{α}}]}^{α}]}^{\frac{1}{α}} \end{matrix}

(126)

\begin{matrix} \leq \log \sum_{x^{'} \in supp (P_{X^{'}})} \sum_{x \in supp (P_{X})} {[\sum_{y} {[P_{X} (x) B_{X^{'} | X} (x^{'} | x) P_{Y | X} (y | x) Q_{Y} {(y)}^{\frac{1 - α}{α}}]}^{α}]}^{\frac{1}{α}} \end{matrix}

(127)

\begin{matrix} = \log \sum_{x \in supp (P_{X})} P_{X} (x) [\sum_{x^{'} \in supp (P_{X^{'}})} B_{X^{'} | X} (x^{'} | x)] {[\sum_{y} P_{Y | X} {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(128)

\begin{matrix} = \log \sum_{x \in supp (P_{X})} P_{X} (x) {[\sum_{y} P_{Y | X} {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(129)

\begin{matrix} = \frac{α - 1}{α} D_{α}^{l} (P_{Y | X} ‖ Q_{Y} | P_{X}), \end{matrix}

(130)

where (124) follows from (78); (125) follows from (121); (126) follows from (120); (127) follows from the Minkowski inequality [16] (III 2.4 Theorem 9); (129) holds because

P_{X} (x) > 0

and

P_{X^{'}} (x^{'}) = 0

imply

B_{X^{'} | X} (x^{'} | x) = 0

, hence the first expression in square brackets on the left-hand side (LHS) of (129) equals one; and (130) follows from (78).

The proof for

α \in (0, 1)

is very similar: (124)–(126) and (128)–(130) continue to hold, and (127) is reversed [16] (III 2.4 Theorem 9). Because now

\frac{α - 1}{α} < 0

, (122) continues to hold for

α \in (0, 1)

. □

As a special case of Theorem 6, we obtain the following relation between the conditional and the unconditional Rényi divergence:

Corollary 3.

Let

P_{X}

and

Q_{Y}

be PMFs, and let

P_{Y | X}

be a conditional PMF. Define the marginal PMF

\begin{matrix} P_{Y} (y) & ≜ \sum_{x} P_{X} (x) P_{Y | X} (y | x) . \end{matrix}

(131)

Then, for all

α \in [0, \infty]

,

\begin{matrix} D_{α} (P_{Y} ‖ Q_{Y}) \leq D_{α}^{l} (P_{Y | X} ‖ Q_{Y} | P_{X}) . \end{matrix}

(132)

Proof.

This follows from Theorem 6 in the same way that Corollary 1 followed from Theorem 2. □

Consider next

Q_{Y | X}

that does depend on X. It turns out that Corollary 3, and hence Theorem 6, cannot be extended to this setting:

Example 2.

Let

X = {0, 1}

and

Y = {0, 1, 2}

. Define the PMFs

P_{X}

,

P_{Y | X}

, and

Q_{Y | X}

as

\begin{matrix} P_{X} (0) & = 0.5, & P_{X} (1) & = 0.5, \end{matrix}

(133)

\begin{matrix} P_{Y | X} (0 | 0) & = 0.96, & P_{Y | X} (1 | 0) & = 0.02, & P_{Y | X} (2 | 0) & = 0.02, \end{matrix}

(134)

\begin{matrix} P_{Y | X} (0 | 1) & = 0.12, & P_{Y | X} (1 | 1) & = 0.02, & P_{Y | X} (2 | 1) & = 0.86, \end{matrix}

(135)

\begin{matrix} Q_{Y | X} (0 | 0) & = 0.06, & Q_{Y | X} (1 | 0) & = 0.92, & Q_{Y | X} (2 | 0) & = 0.02, \end{matrix}

(136)

\begin{matrix} Q_{Y | X} (0 | 1) & = 0.02, & Q_{Y | X} (1 | 1) & = 0.16, & Q_{Y | X} (2 | 1) & = 0.82 . \end{matrix}

(137)

Then, for

α = 0.5

and for

α = 2

,

D_{α} (P_{Y} ‖ Q_{Y}) > D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}),

(138)

where the PMFs

P_{Y}

and

Q_{Y}

are given by

\begin{matrix} P_{Y} (y) & ≜ \sum_{x} P_{X} (x) P_{Y | X} (y | x), \end{matrix}

(139)

\begin{matrix} Q_{Y} (y) & ≜ \sum_{x} P_{X} (x) Q_{Y | X} (y | x) . \end{matrix}

(140)

Proof.

Numerically,

D_{0.5} (P_{Y} ‖ Q_{Y}) \approx 1.11

bits, which is larger than

D_{0.5}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) \approx 0.93

bits. Similarly,

D_{2} (P_{Y} ‖ Q_{Y}) \approx 2.95

bits, which is larger than

D_{2}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) \approx 2.75

bits. □

6. Relation to Arimoto’s Measures

Before discussing Arimoto’s measures, we first recall the definition of the Rényi entropy. The Rényi entropy of order

α

[7] is defined for all positive

α

’s other than one as

\begin{matrix} H_{α} (X) ≜ \frac{1}{1 - α} \log \sum_{x} P {(x)}^{α} . \end{matrix}

(141)

For

α

being zero, one, or infinity, we define by continuous extension of (141)

\begin{matrix} H_{0} (X) & ≜ \log | supp (P_{X}) |, \end{matrix}

(142)

\begin{matrix} H_{1} (X) & ≜ H (X), \end{matrix}

(143)

\begin{matrix} H_{\infty} (X) & ≜ - \log \max_{x} P (x), \end{matrix}

(144)

where

H (X)

denotes Shannon’s entropy. The Rényi entropy can be related to the Rényi divergence as follows:

\begin{matrix} H_{α} (X) = \log | X | - D_{α} (P_{X} ‖ U_{X}), \end{matrix}

(145)

where

U_{X}

denotes the uniform distribution over

X

.

There are different ways to define a conditional Rényi entropy [40]; we use Arimoto’s proposal. The Arimoto–Rényi conditional entropy of order

α

[12,38,40,41] is defined for positive

α

other than one as

\begin{matrix} H_{α} (X | Y) & ≜ \frac{α}{1 - α} \log \sum_{y \in supp (P_{Y})} P (y) {[\sum_{x} P {(x | y)}^{α}]}^{\frac{1}{α}} \end{matrix}

(146)

\begin{matrix} = \frac{α}{1 - α} \log \sum_{y \in supp (P_{Y})} P (y) 2^{\frac{1 - α}{α} H_{α} (P_{X | Y = y})}, \end{matrix}

(147)

where (147) follows from the definition of the Rényi entropy in (141). The Arimoto–Rényi conditional entropy plays a key role in guessing with side information [20,42,43,44] and in task encoding with side information [45]; and it can be related to hypothesis testing [41]. For

α

being zero, one, or infinity, we define by continuous extension of (146)

\begin{matrix} H_{0} (X | Y) & ≜ \log \max_{y \in supp (P_{Y})} |supp (P_{X | Y = y})|, \end{matrix}

(148)

\begin{matrix} H_{1} (X | Y) & ≜ H (X | Y), \end{matrix}

(149)

\begin{matrix} H_{\infty} (X | Y) & ≜ - \log \sum_{y \in supp (P_{Y})} P (y) \max_{x} P (x | y), \end{matrix}

(150)

where

H (X | Y)

denotes Shannon’s conditional entropy. The analog of (145) for

H_{α} (X | Y)

is:

Remark 2.

For all

α \in [0, \infty]

,

\begin{matrix} H_{α} (X | Y) & = \log | X | - D_{α}^{l} (P_{X | Y} ‖ U_{X} | P_{Y}) \end{matrix}

(151)

\begin{matrix} = \log | X | - \min_{Q_{Y}} D_{α} (P_{Y} P_{X | Y} ‖ Q_{Y} U_{X}) . \end{matrix}

(152)

Proof.

Equation (151) follows, using some algebra, from the definition of

D_{α}^{l} (\cdot)

in (78)–(81); and (152) follows from Proposition 4. (The characterization in (152) previously appeared as [40] (Theorem 4).) □

Arimoto [12] also defined the following measure of dependence:

\begin{matrix} I_{α}^{a} (X; Y) & ≜ H_{α} (X) - H_{α} (X | Y) \end{matrix}

(153)

\begin{matrix} = \frac{α}{α - 1} \log \sum_{y} {[\sum_{x} \frac{P {(x)}^{α}}{\sum_{x^{'} \in X} P {(x^{'})}^{α}} P {(y | x)}^{α}]}^{\frac{1}{α}}, \end{matrix}

(154)

where (154) follows from (141) and (146). Using Remark 2, we can express

I_{α}^{a} (X; Y)

in terms of

D_{α}^{l} (\cdot)

:

Remark 3.

For all

α \in [0, \infty]

,

\begin{matrix} I_{α}^{a} (X; Y) = D_{α}^{l} (P_{X | Y} ‖ U_{X} | P_{Y}) - D_{α} (P_{X} ‖ U_{X}) . \end{matrix}

(155)

Proof.

This follows from (145), (151), and (153). □

7. Relations Between the Conditional Rényi Divergences and the Rényi Dependence Measures

In this section, we first establish the greater-or-equal-than order between the conditional Rényi divergences, where the order depends on whether

α \in [0, 1]

or

α \in [1, \infty]

. We then show that this implies the same order between the dependence measures derived from the conditional Rényi divergences. Finally, we remark that many of the dependence measures coincide when they are maximized over all PMFs

P_{X}

.

Proposition 7.

For all

α \in [0, \infty]

,

\begin{matrix} D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) \leq D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) . \end{matrix}

(156)

Proof.

This holds because

\begin{matrix} D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = \min_{Q_{X}} D_{α} (P_{X} P_{Y | X} ‖ Q_{X} Q_{Y | X}) \end{matrix}

(157)

\begin{matrix} \leq D_{α} (P_{X} P_{Y | X} ‖ P_{X} Q_{Y | X}) \end{matrix}

(158)

\begin{matrix} = D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}), \end{matrix}

(159)

where (157) follows from Proposition 4, and (159) follows from the definition of

D_{α}^{s} (\cdot)

in (54). □

Theorem 7.

For all

α \in [0, 1]

,

\begin{matrix} D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) \leq D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) \leq D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) . \end{matrix}

(160)

For all

α \in [1, \infty]

,

\begin{matrix} D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) \leq D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) \leq D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) . \end{matrix}

(161)

Proof.

For both

α \in [0, 1]

and

α \in [1, \infty]

, the relation

D_{α}^{l} (\cdot) \leq D_{α}^{s} (\cdot)

follows from Proposition 7.

We next show that

D_{α}^{s} (\cdot) \leq D_{α}^{c} (\cdot)

for

α \in [0, 1]

. We show this for

α \in (0, 1)

; the claim will then extend to

α \in [0, 1]

by the continuity in

α

of

D_{α}^{s} (\cdot)

and

D_{α}^{c} (\cdot)

(Proposition 3 (c) and Proposition 2 (c)). For

α \in (0, 1)

,

\begin{matrix} (α - 1) D_{α}^{s} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = \log \sum_{x \in supp (P_{X})} P (x) \sum_{y} P {(y | x)}^{α} Q {(y | x)}^{1 - α} \end{matrix}

(162)

\begin{matrix} \geq \sum_{x \in supp (P_{X})} P (x) \log \sum_{y} P {(y | x)}^{α} Q {(y | x)}^{1 - α} \end{matrix}

(163)

\begin{matrix} = (α - 1) D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}), \end{matrix}

(164)

where (162) follows from (55); (163) follows from Jensen’s inequality because

\log (\cdot)

is a concave function; and (164) follows from (30). The proof of the claim for

α \in (0, 1)

is finished by dividing (162)–(164) by

α - 1

, which reverses the inequality because

α - 1 < 0

.

We conclude by showing that

D_{α}^{c} (\cdot) \leq D_{α}^{l} (\cdot)

for

α \in [1, \infty]

. We show this for

α \in (1, \infty)

; the claim will then extend to

α \in [1, \infty]

by the continuity of

D_{α}^{c} (\cdot)

and

D_{α}^{l} (\cdot)

in

α

(Proposition 2 (c) and Proposition 6 (c)). For

α \in (1, \infty)

,

\begin{matrix} D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) & = \sum_{x \in supp (P_{X})} P (x) \frac{1}{α - 1} \log \sum_{y} P {(y | x)}^{α} Q {(y | x)}^{1 - α} \end{matrix}

(165)

\begin{matrix} = \frac{α}{α - 1} \sum_{x \in supp (P_{X})} P (x) \log {[\sum_{y} P {(y | x)}^{α} Q {(y | x)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(166)

\begin{matrix} \leq \frac{α}{α - 1} \log \sum_{x \in supp (P_{X})} P (x) {[\sum_{y} P {(y | x)}^{α} Q {(y | x)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(167)

\begin{matrix} = D_{α}^{l} (P_{Y | X} ‖ Q_{Y | X} | P_{X}), \end{matrix}

(168)

where (165) follows from (30); (167) follows from Jensen’s inequality because

\log (\cdot)

is a concave function; and (168) follows from (78). □

Corollary 4.

For all

α \in [0, 1]

,

\begin{matrix} J_{α} (X; Y) \leq I_{α}^{s} (X; Y) \leq I_{α}^{c} (X; Y) . \end{matrix}

(169)

For all

α \in [1, \infty]

,

\begin{matrix} I_{α}^{c} (X; Y) \leq J_{α} (X; Y) \leq I_{α}^{s} (X; Y) . \end{matrix}

(170)

Proof.

By (34) and (60) and Proposition 5, respectively,

\begin{matrix} I_{α}^{c} (X; Y) & = \min_{Q_{Y}} D_{α}^{c} (P_{Y | X} ‖ Q_{Y} | P_{X}), \end{matrix}

(171)

\begin{matrix} I_{α}^{s} (X; Y) & = \min_{Q_{Y}} D_{α}^{s} (P_{Y | X} ‖ Q_{Y} | P_{X}), \end{matrix}

(172)

\begin{matrix} J_{α} (X; Y) & = \min_{Q_{Y}} D_{α}^{l} (P_{Y | X} ‖ Q_{Y} | P_{X}) . \end{matrix}

(173)

The corollary now follows from (171)–(173) and Theorem 7. □

Despite

I_{α}^{c} (X; Y)

,

I_{α}^{s} (X; Y)

,

I_{α}^{a} (X; Y)

, and

J_{α} (X; Y)

being different measures, they often coincide when maximized over all PMFs

P_{X}

:

Theorem 8.

For every conditional PMF

P_{Y | X}

and every

α \in (0, 1) \cup (1, \infty)

,

\begin{matrix} \max_{P_{X}} I_{α}^{c} (P_{X}, P_{Y | X}) & = \max_{P_{X}} I_{α}^{s} (P_{X}, P_{Y | X}) \end{matrix}

(174)

\begin{matrix} = \max_{P_{X}} I_{α}^{a} (P_{X}, P_{Y | X}) . \end{matrix}

(175)

In addition, for every conditional PMF

P_{Y | X}

and every

α \in [\frac{1}{2}, 1) \cup (1, \infty)

,

\begin{matrix} \max_{P_{X}} J_{α} (P_{X}, P_{Y | X}) = \max_{P_{X}} I_{α}^{s} (P_{X}, P_{Y | X}) . \end{matrix}

(176)

For

α \in (0, \frac{1}{2})

, the situation is different: there exists a conditional PMF

P_{Y | X}

such that, for every

α \in (0, \frac{1}{2})

,

\begin{matrix} \max_{P_{X}} J_{α} (P_{X}, P_{Y | X}) < \max_{P_{X}} I_{α}^{s} (P_{X}, P_{Y | X}) . \end{matrix}

(177)

Proof.

Equation (174) follows from [9] (Proposition 1); (175) follows from [12] (Lemma 1); and (176) follows from [38] (Theorem V.1) for

α \in (1, \infty)

.

We next establish (176) for

α \in [\frac{1}{2}, 1)

. Observe that, for

α \in [\frac{1}{2}, 1)

, (176) is equivalent to

\begin{matrix} \max_{P_{X}} - 2^{\frac{α - 1}{α} J_{α} (P_{X}, P_{Y | X})} & = \max_{P_{X}} - 2^{\frac{α - 1}{α} I_{α}^{s} (P_{X}, P_{Y | X})} . \end{matrix}

(178)

For

α \in [\frac{1}{2}, 1)

, (178) holds because

\begin{matrix} \max_{P_{X}} - 2^{\frac{α - 1}{α} J_{α} (P_{X}, P_{Y | X})} & = \max_{P_{X}} \min_{Q_{Y}} - 2^{\frac{α - 1}{α} D_{α}^{l} (P_{Y | X} ‖ Q_{Y} | P_{X})} \end{matrix}

(179)

\begin{matrix} = - \min_{P_{X}} \max_{Q_{Y}} \sum_{x} P_{X} (x) {[\sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(180)

\begin{matrix} = - \max_{Q_{Y}} \min_{P_{X}} \sum_{x} P_{X} (x) {[\sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(181)

\begin{matrix} = - \max_{Q_{Y}} \min_{x} {[\sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(182)

\begin{matrix} = - {[\max_{Q_{Y}} \min_{x} \sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(183)

\begin{matrix} = - {[\max_{Q_{Y}} \min_{P_{X}} \sum_{x} P_{X} (x) \sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(184)

\begin{matrix} = - {[\min_{P_{X}} \max_{Q_{Y}} \sum_{x} P_{X} (x) \sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(185)

\begin{matrix} = - \min_{P_{X}} \max_{Q_{Y}} {[\sum_{x} P_{X} (x) \sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(186)

\begin{matrix} = \max_{P_{X}} \min_{Q_{Y}} - 2^{\frac{α - 1}{α} D_{α}^{s} (P_{Y | X} ‖ Q_{Y} | P_{X})} \end{matrix}

(187)

\begin{matrix} = \max_{P_{X}} - 2^{\frac{α - 1}{α} I_{α}^{s} (P_{X}, P_{Y | X})}, \end{matrix}

(188)

where (179) follows from Proposition 5; (180) follows from (78); (181) and (185) follow from a minimax theorem and are justified below; (187) follows from (55); and (188) follows from (60).

To justify (181), we apply the minimax theorem [46] (Corollary 37.3.2) to the function

f : P (Y) \times P (X) \to R

,

\begin{matrix} f (Q_{Y}, P_{X}) = \sum_{x} P_{X} (x) {[\sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} . \end{matrix}

(189)

The sets of all PMFs over

X

and over

Y

are convex and compact; the function f is jointly continuous in the pair

(Q_{Y}, P_{X})

because it is a composition of continuous functions; for every

Q_{Y} \in P (Y)

, the function f is linear and hence convex in

P_{X}

; and it only remains to show that the function f is concave in

Q_{Y}

for every

P_{X} \in P (X)

. Indeed, for every

λ, λ^{'} \in [0, 1]

with

λ + λ^{'} = 1

, every

Q_{Y}, Q_{Y}^{'} \in P (Y)

, and every

P_{X} \in P (X)

,

\begin{matrix} f (λ Q_{Y} + λ^{'} Q_{Y}^{'}, P_{X}) \end{matrix}

(190)

\begin{matrix} = \sum_{x} P_{X} (x) {[\sum_{y} P {(y | x)}^{α} {[λ Q_{Y} (y) + λ^{'} Q_{Y}^{'} (y)]}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(191)

\begin{matrix} = \sum_{x} P_{X} (x) {[\sum_{y} {[λ P {(y | x)}^{\frac{α}{1 - α}} Q_{Y} (y) + λ^{'} P {(y | x)}^{\frac{α}{1 - α}} Q_{Y}^{'} (y)]}^{1 - α}]}^{\frac{1}{1 - α} \cdot \frac{1 - α}{α}} \end{matrix}

(192)

\begin{matrix} \geq \sum_{x} P_{X} (x) {{[\sum_{y} {[λ P {(y | x)}^{\frac{α}{1 - α}} Q_{Y} (y)]}^{1 - α}]}^{\frac{1}{1 - α}} + {[\sum_{y} {[λ^{'} P {(y | x)}^{\frac{α}{1 - α}} Q_{Y}^{'} (y)]}^{1 - α}]}^{\frac{1}{1 - α}}}^{\frac{1 - α}{α}} \end{matrix}

(193)

\begin{matrix} = \sum_{x} P_{X} (x) {λ {[\sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{1 - α}} + λ^{'} {[\sum_{y} P {(y | x)}^{α} Q_{Y}^{'} {(y)}^{1 - α}]}^{\frac{1}{1 - α}}}^{\frac{1 - α}{α}} \end{matrix}

(194)

\begin{matrix} \geq \sum_{x} P_{X} (x) {λ {[\sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α}]}^{\frac{1}{α}} + λ^{'} {[\sum_{y} P {(y | x)}^{α} Q_{Y}^{'} {(y)}^{1 - α}]}^{\frac{1}{α}}} \end{matrix}

(195)

\begin{matrix} = λ f (Q_{Y}, P_{X}) + λ^{'} f (Q_{Y}^{'}, P_{X}), \end{matrix}

(196)

where (193) follows from the reverse Minkowski inequality [16] (III 2.4 Theorem 9) because

α \in [\frac{1}{2}, 1)

; and (195) holds because the function

z \mapsto z^{(1 - α) / α}

is concave for

α \in [\frac{1}{2}, 1)

.

The justification of (185) is very similar to that of (181); here, we apply the minimax theorem to the function

g : P (Y) \times P (X) \to R

,

\begin{matrix} g (Q_{Y}, P_{X}) = \sum_{x} P_{X} (x) \sum_{y} P {(y | x)}^{α} Q_{Y} {(y)}^{1 - α} . \end{matrix}

(197)

Compared to the justification of (181), the only essential difference lies in showing that the function g is concave in

Q_{Y}

for every

P_{X} \in P (X)

: here, this follows easily from the concavity of the function

z \mapsto z^{1 - α}

for

α \in [\frac{1}{2}, 1)

.

We conclude the proof by establishing (177). Let

X = Y = {0, 1}

, and let the conditional PMF

P_{Y | X}

be given by

P_{Y | X} (y | x) = 𝟙 {y = x}

. (This corresponds to a binary noiseless channel.) Then, denoting by

U_{X}

the uniform distribution over

X

,

\begin{matrix} \max_{P_{X}} I_{α}^{s} (P_{X}, P_{Y | X}) & \geq I_{α}^{s} (U_{X}, P_{Y | X}) \end{matrix}

(198)

\begin{matrix} = \log 2, \end{matrix}

(199)

where (199) follows from (61). On the other hand, for every

α \in (0, \frac{1}{2})

and every PMF

P_{X}

,

\begin{matrix} J_{α} (P_{X}, P_{Y | X}) & = \frac{α}{1 - α} H_{\infty} (P_{X}) \end{matrix}

(200)

\begin{matrix} \leq \frac{α}{1 - α} \log 2 \end{matrix}

(201)

\begin{matrix} < \log 2, \end{matrix}

(202)

where (200) follows from [3] (Lemma 11); (201) follows from (144); and (202) holds because

α \in (0, \frac{1}{2})

. Inequality (177) now follows from (199) and (202). □

8. Horse Betting

In this section, we analyze horse betting with a gambler investing all her money. Recall from the introduction that the winning horse X is distributed according to the PMF p, where we assume

p (x) > 0

for all

x \in X

; that the odds offered by the bookmaker are denoted by

o : X \to (0, \infty)

; that the fraction of her wealth that the gambler bets on Horse

x \in X

is denoted

b (x) \geq 0

; that the wealth relative is the random variable

S ≜ b (X) o (X)

; and that we seek betting strategies that maximize the utility function

\begin{matrix} U_{β} ≜ \{\begin{matrix} \frac{1}{β} \log E [S^{β}] & if β \neq 0, \\ E [\log S] & if β = 0 . \end{matrix} \end{matrix}

(203)

Because the gambler invests all her money, b is a PMF. As in [47] (Section 10.3), define the constant

\begin{matrix} c ≜ {[\sum_{x} \frac{1}{o (x)}]}^{- 1} \end{matrix}

(204)

and the PMF

\begin{matrix} r (x) ≜ \frac{c}{o (x)} . \end{matrix}

(205)

Using these definitions, the utility function

U_{β}

can be decomposed as follows:

Theorem 9.

Let

β \in (- \infty, 1)

, and let b be a PMF. Then,

\begin{matrix} U_{β} = \log c + D_{\frac{1}{1 - β}} (p ‖ r) - D_{1 - β} (g^{(β)} ‖ b), \end{matrix}

(206)

where the PMF

g^{(β)}

is given by

\begin{matrix} g^{(β)} (x) ≜ \frac{p {(x)}^{\frac{1}{1 - β}} o {(x)}^{\frac{β}{1 - β}}}{\sum_{x^{'} \in X} p {(x^{'})}^{\frac{1}{1 - β}} o {(x^{'})}^{\frac{β}{1 - β}}} . \end{matrix}

(207)

Thus, choosing

b = g^{(β)}

uniquely maximizes

U_{β}

among all PMFs b.

The three terms in (206) can be interpreted as follows:

The first term, $\log c$ , depends only on the odds and is related to the fairness of the odds. The odds are called subfair if $c < 1$ , fair if $c = 1$ , and superfair if $c > 1$ .
The second term, $D_{1 / (1 - β)} (p ‖ r)$ , is related to the bookmaker’s estimate of the winning probabilities. It is zero if and only if the odds are inversely proportional to the winning probabilities.
The third term, $- D_{1 - β} (g^{(β)} ‖ b)$ , is related to the gambler’s estimate of the winning probabilities. It is zero if and only if b is equal to $g^{(β)}$ .

Remark 4.

For

β = 0

, (206) reduces to the following decomposition of the doubling rate

E [\log S]

:

\begin{matrix} E [\log S] = \log c + D (p ‖ r) - D (p ‖ b) . \end{matrix}

(208)

(This decomposition appeared previously in [47] (Section 10.3).) Equation (208) implies that the doubling rate is maximized by proportional gambling, i.e., that

E [\log S]

is maximized if and only if b is equal to p.

Remark 5.

Considering the limits

β \to - \infty

and

β ↑ 1

, the PMF

g^{(β)}

satisfies, for every

x \in X

,

\begin{matrix} \lim_{β \to - \infty} g^{(β)} (x) & = \frac{c}{o (x)}, \end{matrix}

(209)

\begin{matrix} \lim_{β ↑ 1} g^{(β)} (x) & = \frac{p (x) 𝟙 {x \in S}}{\sum_{x^{'} \in X} p (x^{'}) 𝟙 {x^{'} \in S}}, \end{matrix}

(210)

where the set

S

is defined as

S ≜ \{x^{'} \in X : p (x^{'}) o (x^{'}) = \max_{x} [p (x) o (x)]\}

. It follows from Proposition 8 below that the RHS of (209) is the unique maximizer of

\lim_{β \to - \infty} U_{β}

; and it follows from the proof of Proposition 9 below that the RHS of (210) is a maximizer (not necessarily unique) of

U_{1}

.

Proof of Remark 5.

Recall that we assume

p (x) > 0

for every

x \in X

. Then, (209) follows from (207) and the definition of c in (204). To establish (210), define

τ ≜ \max_{x} [p (x) o (x)]

and observe that, for every

x \in X

,

\begin{matrix} \lim_{β ↑ 1} g^{(β)} (x) & = \lim_{β ↑ 1} \frac{p (x) {[p (x) o (x) / τ]}^{\frac{β}{1 - β}}}{\sum_{x^{'} \in X} p (x^{'}) {[p (x^{'}) o (x^{'}) / τ]}^{\frac{β}{1 - β}}} \end{matrix}

(211)

\begin{matrix} = \frac{p (x) 𝟙 {x \in S}}{\sum_{x^{'} \in X} p (x^{'}) 𝟙 {x^{'} \in S}}, \end{matrix}

(212)

where (211) follows from (207) and some algebra; and (212) is justified as follows: if

x \in S

, then

{[p (x) o (x) / τ]}^{β / (1 - β)}

equals one; and if

x \notin S

, then

{[p (x) o (x) / τ]}^{β / (1 - β)}

tends to zero as

β ↑ 1

because

p (x) o (x) / τ < 1

and because

\lim_{β ↑ 1} \frac{β}{1 - β} = + \infty

. □

Remark 6.

Using the definition in (24) for the Rényi divergence of negative orders, it is not difficult to see from the proof of Theorem 9 below that (206) also holds for

β > 1

. However, because the Rényi divergence of negative orders is nonpositive instead of nonnegative, the above interpretation is not valid anymore; in particular, for

β > 1

, choosing

b = g^{(β)}

is in general not optimal.

Proof of Theorem 9.

We first show the maximization claim. The only term on the RHS of (206) that depends on b is

- D_{1 - β} (g^{(β)} ‖ b)

. Because

1 - β > 0

, this term is maximized if and only if

b = g^{(β)}

(Proposition 1 (a)).

We now establish (206) for

β \in (- \infty, 0) \cup (0, 1)

; we omit the proof for

β = 0

, which can be found in [47] (Section 10.3). For

β \in (- \infty, 0) \cup (0, 1)

,

\begin{matrix} U_{β} = \frac{1}{β} \log \sum_{x} p (x) b {(x)}^{β} o {(x)}^{β} . \end{matrix}

(213)

For every

x \in X

,

\begin{matrix} p (x) b {(x)}^{β} o {(x)}^{β} & = {[\sum_{x^{'} \in X} p {(x^{'})}^{\frac{1}{1 - β}} o {(x^{'})}^{\frac{β}{1 - β}}]}^{1 - β} \cdot g^{(β)} {(x)}^{1 - β} b {(x)}^{β}, \end{matrix}

(214)

which follows from (207). Now, (206) holds because

\begin{matrix} U_{β} & = \frac{1 - β}{β} \log \sum_{x^{'} \in X} p {(x^{'})}^{\frac{1}{1 - β}} o {(x^{'})}^{\frac{β}{1 - β}} + \frac{1}{β} \log \sum_{x} g^{(β)} {(x)}^{1 - β} b {(x)}^{β} \end{matrix}

(215)

\begin{matrix} = \frac{1 - β}{β} \log \sum_{x^{'} \in X} p {(x^{'})}^{\frac{1}{1 - β}} o {(x^{'})}^{\frac{β}{1 - β}} - D_{1 - β} (g^{(β)} ‖ b) \end{matrix}

(216)

\begin{matrix} = \log c + \frac{1 - β}{β} \log \sum_{x^{'} \in X} p {(x^{'})}^{\frac{1}{1 - β}} r {(x^{'})}^{\frac{- β}{1 - β}} - D_{1 - β} (g^{(β)} ‖ b) \end{matrix}

(217)

\begin{matrix} = \log c + D_{\frac{1}{1 - β}} (p ‖ r) - D_{1 - β} (g^{(β)} ‖ b), \end{matrix}

(218)

where (215) follows from (213) and (214); (216) follows from identifying the Rényi divergence (recall that

g^{(β)}

and b are PMFs); (217) follows from (205); and (218) follows from identifying the Rényi divergence (recall that r is a PMF). □

The rest of the section presents the cases

β \to - \infty

,

β \geq 1

, and

β \to + \infty

.

Proposition 8.

Let b be a PMF. Then,

\begin{matrix} \lim_{β \to - \infty} U_{β} & = \log \min_{x} [b (x) o (x)] \end{matrix}

(219)

\begin{matrix} \leq \log c . \end{matrix}

(220)

Inequality (220) holds with equality if and only if

b (x) = c / o (x)

for all

x \in X

.

Observe that if

b (x) = c / o (x)

for all

x \in X

, then

S = c

with probability one, i.e., S does not depend on the winning horse.

Proof of Proposition 8.

Equation (219) holds because

\begin{matrix} \lim_{β \to - \infty} U_{β} & = \lim_{β \to - \infty} \log {[\sum_{x} p (x) {(b (x) o (x))}^{β}]}^{\frac{1}{β}} \end{matrix}

(221)

\begin{matrix} = \log \min_{x} [b (x) o (x)], \end{matrix}

(222)

where (222) holds because, in the limit as

β

tends to

- \infty

, the power mean tends to the minimum (since p is a PMF with

p (x) > 0

for all

x \in X

[15] (Chapter 8)).

We show (220) by contradiction. Assume that there exists a PMF b that does not satisfy (220), thus

\begin{matrix} b (x) o (x) > c \end{matrix}

(223)

for all

x \in X

. Then,

\begin{matrix} 1 & = \sum_{x} b (x) \end{matrix}

(224)

\begin{matrix} > \sum_{x} \frac{c}{o (x)} \end{matrix}

(225)

\begin{matrix} = 1, \end{matrix}

(226)

where (224) holds because b is a PMF; (225) follows from (223); and (226) follows from the definition of c in (204). Because

1 > 1

is impossible, such a b cannot exist, which establishes (220).

It is not difficult to see that (220) holds with equality if

b (x) = c / o (x)

for all

x \in X

. We therefore focus on establishing that if (220) holds with equality, then

b (x) = c / o (x)

for all

x \in X

. Observe first that, if (220) holds with equality, then, for all

x \in X

,

\begin{matrix} b (x) o (x) \geq c . \end{matrix}

(227)

We now claim that (227) holds with equality for all

x \in X

. Indeed, if this were not the case, then there would exist an

x^{'} \in X

for which

b (x^{'}) o (x^{'}) > c

, thus (224)–(226) would hold, which would lead to a contradiction. Hence, if (220) holds with equality, then

b (x) = c / o (x)

for all

x \in X

. □

Proposition 9.

Let

β \geq 1

, and let b be a PMF. Then,

\begin{matrix} U_{β} \leq \log \max_{x} [p {(x)}^{1 / β} o (x)] . \end{matrix}

(228)

Equality in (228) can be achieved by choosing

b (x) = 𝟙 {x = x^{'}}

for some

x^{'} \in X

satisfying

\begin{matrix} p {(x^{'})}^{1 / β} o (x^{'}) = \max_{x} [p {(x)}^{1 / β} o (x)] . \end{matrix}

(229)

Remark 7.

Proposition 9 implies that if

β \geq 1

, then it is optimal to bet on a single horse. Unless

| X | = 1

, this is not the case when

β < 1

: When

β < 1

, an optimal betting strategy requires placing a bet on every horse. This follows from Theorem 9 and our assumption that

p (x)

and

o (x)

are all positive.

Proof of Proposition 9.

Inequality (228) holds because

\begin{matrix} U_{β} & = \frac{1}{β} \log \sum_{x} p (x) b {(x)}^{β} o {(x)}^{β} \end{matrix}

(230)

\begin{matrix} \leq \frac{1}{β} \log \sum_{x} p (x) b (x) o {(x)}^{β} \end{matrix}

(231)

\begin{matrix} \leq \frac{1}{β} \log \sum_{x} b (x) \cdot \max_{x^{'} \in X} [p (x^{'}) o {(x^{'})}^{β}] \end{matrix}

(232)

\begin{matrix} = \frac{1}{β} \log \max_{x^{'} \in X} [p (x^{'}) o {(x^{'})}^{β}] \end{matrix}

(233)

\begin{matrix} = \log \max_{x^{'} \in X} [p {(x^{'})}^{1 / β} o (x^{'})], \end{matrix}

(234)

where (231) holds because

b (x) \in [0, 1]

and

β \geq 1

, and (233) holds because b is a PMF. It is not difficult to see that (228) holds with equality if

b (x) = 𝟙 {x = x^{'}}

for some

x^{'} \in X

satisfying (229). □

Proposition 10.

Let b be a PMF. Then,

\begin{matrix} \lim_{β \to + \infty} U_{β} & = \log \max_{x} [b (x) o (x)] \end{matrix}

(235)

\begin{matrix} \leq \log \max_{x} o (x) . \end{matrix}

(236)

Equality in (236) can be achieved by choosing

b (x) = 𝟙 {x = x^{'}}

for some

x^{'} \in X

satisfying

\begin{matrix} o (x^{'}) = \max_{x} o (x) . \end{matrix}

(237)

Proof.

Equation (235) holds because

\begin{matrix} \lim_{β \to + \infty} U_{β} & = \lim_{β \to + \infty} \log {[\sum_{x} p (x) {(b (x) o (x))}^{β}]}^{\frac{1}{β}} \end{matrix}

(238)

\begin{matrix} = \log \max_{x} [b (x) o (x)], \end{matrix}

(239)

where (239) holds because in the limit as

β

tends to

+ \infty

, the power mean tends to the maximum (since p is a PMF with

p (x) > 0

for all

x \in X

[15] (Chapter 8)). Inequality (236) holds because

b (x) \leq 1

for all

x \in X

. It is not difficult to see that (236) holds with equality if

b (x) = 𝟙 {x = x^{'}}

for some

x^{'} \in X

satisfying (237). □

9. Horse Betting with Side Information

In this section, we study the horse betting problem where the gambler observes some side information Y before placing her bets. This setting leads to the conditional Rényi divergence

D_{α}^{l} (\cdot)

discussed in Section 5 (see Theorem 10). In addition, it provides a new operational meaning to the dependence measure

J_{α} (X; Y)

(see Theorem 11).

We adapt our notation as follows: The joint PMF of X and Y is denoted

p_{X Y}

. (Recall that X denotes the winning horse.) We drop the assumption that the winning probabilities

p (x)

are positive, but we assume that

p (y) > 0

for all

y \in Y

. We continue to assume that the gambler invests all her wealth, so a betting strategy is now a conditional PMF

b_{X | Y}

, and the wealth relative S is

\begin{matrix} S ≜ b (X | Y) o (X) . \end{matrix}

(240)

As in Section 8, define the constant

\begin{matrix} c ≜ {[\sum_{x} \frac{1}{o (x)}]}^{- 1} \end{matrix}

(241)

and the PMF

\begin{matrix} r_{X} (x) ≜ \frac{c}{o (x)} . \end{matrix}

(242)

The following decomposition of the utility function

U_{β}

parallels that of Theorem 9:

Theorem 10.

Let

β \in (- \infty, 1)

. Then,

\begin{matrix} U_{β} = \log c + D_{\frac{1}{1 - β}}^{l} (p_{X | Y} ‖ r_{X} | p_{Y}) - D_{1 - β} (g_{X | Y}^{(β)} g_{Y}^{(β)} ‖ b_{X | Y} g_{Y}^{(β)}), \end{matrix}

(243)

where the conditional PMF

g_{X | Y}^{(β)}

and the PMF

g_{Y}^{(β)}

are given by

\begin{matrix} g_{X | Y}^{(β)} (x | y) & ≜ \frac{p {(x | y)}^{\frac{1}{1 - β}} o {(x)}^{\frac{β}{1 - β}}}{\sum_{x^{'}} p {(x^{'} | y)}^{\frac{1}{1 - β}} o {(x^{'})}^{\frac{β}{1 - β}}}, \end{matrix}

(244)

\begin{matrix} g_{Y}^{(β)} (y) & ≜ \frac{p (y) {[\sum_{x^{'}} p {(x^{'} | y)}^{\frac{1}{1 - β}} o {(x^{'})}^{\frac{β}{1 - β}}]}^{1 - β}}{\sum_{y^{'}} p (y^{'}) {[\sum_{x^{'}} p {(x^{'} | y^{'})}^{\frac{1}{1 - β}} o {(x^{'})}^{\frac{β}{1 - β}}]}^{1 - β}} . \end{matrix}

(245)

Thus, choosing

b_{X | Y} = g_{X | Y}^{(β)}

uniquely maximizes

U_{β}

among all conditional PMFs

b_{X | Y}

.

Proof.

We first show that

U_{β}

is uniquely maximized by

g_{X | Y}^{(β)}

. The only term on the RHS of (243) that depends on

b_{X | Y}

is

- D_{1 - β} (g_{X | Y}^{(β)} g_{Y}^{(β)} ‖ b_{X | Y} g_{Y}^{(β)})

. Because

1 - β > 0

, this term is maximized if and only if

b_{X | Y} g_{Y}^{(β)} = g_{X | Y}^{(β)} g_{Y}^{(β)}

(Proposition 1 (a)). By our assumptions that

p (y) > 0

for all

y \in Y

and

o (x) > 0

for all

x \in X

, we have

g_{Y}^{(β)} (y) > 0

for all

y \in Y

. Consequently,

b_{X | Y} g_{Y}^{(β)} = g_{X | Y}^{(β)} g_{Y}^{(β)}

if and only if

b_{X | Y} = g_{X | Y}^{(β)}

.

Consider now (243) for

β = 0

. For

β = 0

, (243) reduces to

\begin{matrix} E [\log S] = \log c + D (p_{X | Y} p_{Y} ‖ r_{X} p_{Y}) - D (p_{X | Y} p_{Y} ‖ b_{X | Y} p_{Y}), \end{matrix}

(246)

and some algebra reveals that (246) holds.

We conclude with establishing (243) for

β \in (- \infty, 0) \cup (0, 1)

. For

β \in (- \infty, 0) \cup (0, 1)

,

\begin{matrix} U_{β} = \frac{1}{β} \log \sum_{x, y} p (x, y) b {(x | y)}^{β} o {(x)}^{β} . \end{matrix}

(247)

For every

x \in X

and every

y \in Y

,

\begin{matrix} p (x, y) b {(x | y)}^{β} o {(x)}^{β} & = \sum_{y^{'} \in Y} p (y^{'}) {[\sum_{x^{'} \in X} p {(x^{'} | y^{'})}^{\frac{1}{1 - β}} o {(x^{'})}^{\frac{β}{1 - β}}]}^{1 - β} \cdot g_{Y}^{(β)} (y) g_{X | Y}^{(β)} {(x | y)}^{1 - β} b {(x | y)}^{β}, \end{matrix}

(248)

which follows from (244) and (245). Now, (243) holds because

\begin{matrix} U_{β} & = \frac{1}{β} \log \sum_{y^{'} \in Y} p (y^{'}) {[\sum_{x^{'} \in X} p {(x^{'} | y^{'})}^{\frac{1}{1 - β}} o {(x^{'})}^{\frac{β}{1 - β}}]}^{1 - β} \end{matrix}

\begin{matrix} + \frac{1}{β} \log \sum_{x, y} {[g_{X | Y}^{(β)} (x | y) g_{Y}^{(β)} (y)]}^{1 - β} {[b (x | y) g_{Y}^{(β)} (y)]}^{β} \end{matrix}

(249)

\begin{matrix} = \frac{1}{β} \log \sum_{y^{'} \in Y} p (y^{'}) {[\sum_{x^{'} \in X} p {(x^{'} | y^{'})}^{\frac{1}{1 - β}} o {(x^{'})}^{\frac{β}{1 - β}}]}^{1 - β} - D_{1 - β} (g_{X | Y}^{(β)} g_{Y}^{(β)} ‖ b_{X | Y} g_{Y}^{(β)}) \end{matrix}

(250)

\begin{matrix} = \log c + \frac{1}{β} \log \sum_{y^{'} \in Y} p (y^{'}) {[\sum_{x^{'} \in X} p {(x^{'} | y^{'})}^{\frac{1}{1 - β}} r_{X} {(x^{'})}^{\frac{- β}{1 - β}}]}^{1 - β} - D_{1 - β} (g_{X | Y}^{(β)} g_{Y}^{(β)} ‖ b_{X | Y} g_{Y}^{(β)}) \end{matrix}

(251)

\begin{matrix} = \log c + D_{\frac{1}{1 - β}}^{l} (p_{X | Y} ‖ r_{X} | p_{Y}) - D_{1 - β} (g_{X | Y}^{(β)} g_{Y}^{(β)} ‖ b_{X | Y} g_{Y}^{(β)}), \end{matrix}

(252)

where (249) follows from (247) and (248) and the fact that

g_{Y}^{(β)} (y) = g_{Y}^{(β)} {(y)}^{1 - β} g_{Y}^{(β)} {(y)}^{β}

; (250) follows by identifying the Rényi divergence; (251) follows from (242); and (252) follows by identifying the conditional Rényi divergence using (78). □

Remark 8.

It follows from Theorem 10 that, if the gambler gambles optimally, then, for

β \in (- \infty, 1)

,

\begin{matrix} U_{β} = \log c + D_{\frac{1}{1 - β}}^{l} (p_{X | Y} ‖ r_{X} | p_{Y}) . \end{matrix}

(253)

Operationally, it is clear that preprocessing the side information cannot increase the gambler’s utility, i.e., that, for every conditional PMF

p_{Y^{'} | Y}

,

\begin{matrix} D_{\frac{1}{1 - β}}^{l} (p_{X | Y^{'}} ‖ r_{X} | p_{Y^{'}}) \leq D_{\frac{1}{1 - β}}^{l} (p_{X | Y} ‖ r_{X} | p_{Y}), \end{matrix}

(254)

where

p_{X | Y^{'}}

and

p_{Y^{'}}

are derived from the joint PMF

p_{X Y Y^{'}}

given by

\begin{matrix} p_{X Y Y^{'}} (x, y, y^{'}) = p_{Y} (y) p_{X | Y} (x | y) p_{Y^{'} | Y} (y^{'} | y) . \end{matrix}

(255)

This provides the intuition for Theorem 6, where (254) is shown directly.

The extreme case is when the preprocessing maps the side information to a constant and hence leads to the case where the side information is absent. In this case,

Y^{'}

is deterministic and

p_{X | Y^{'}}

equals

p_{X}

. Theorem 9 and Theorem 10 then lead to the following relation between the conditional and unconditional Rényi divergence:

\begin{matrix} D_{\frac{1}{1 - β}} (p_{X} ‖ r_{X}) \leq D_{\frac{1}{1 - β}}^{l} (p_{X | Y} ‖ r_{X} | p_{Y}), \end{matrix}

(256)

where the marginal PMF

p_{X}

is given by

\begin{matrix} p_{X} (x) = \sum_{y} p_{X Y} (x, y) . \end{matrix}

(257)

This motivates Corollary 3, where (256) is derived from (254).

The last result of this section provides a new operational meaning to the Lapidoth–Pfister mutual information

J_{α} (X; Y)

: assuming that

β \in (- \infty, 1)

and that the gambler knows the winning probabilities,

J_{1 / (1 - β)} (X; Y)

measures how much the side information that is available to the gambler but not the bookmaker increases the gambler’s smallest guaranteed utility for a fixed level of fairness c. To see this, consider first the setting without side information. By Theorem 9, the gambler chooses

b = g^{(β)}

to maximize her utility, where

g^{(β)}

is defined in (207). Then, using the nonnegativity of the Rényi divergence (Proposition 1 (a)), the following lower bound on the gambler’s utility follows from (206):

\begin{matrix} U_{β} \geq \log c . \end{matrix}

(258)

We call the RHS of (258) the smallest guaranteed utility for a fixed level of fairness c because (258) holds with equality if the bookmaker chooses the odds inversely proportional to the winning probabilities. Comparing (258) with (259) below, we see that the difference due to the side information is

J_{1 / (1 - β)} (X; Y)

. Note that

J_{1 / (1 - β)} (X; Y)

is typically not the difference between the utility with and without side information; this is because the odds for which (258) and (259) hold with equality are typically not the same.

Theorem 11.

Let

β \in (- \infty, 1)

. If

b_{X | Y}

is equal to

g_{X | Y}^{(β)}

from Theorem 10, then

\begin{matrix} U_{β} \geq \log c + J_{\frac{1}{1 - β}} (X; Y) . \end{matrix}

(259)

Moreover, for every

c > 0

, there exist odds

o : X \to (0, \infty)

such that (259) holds with equality.

Proof.

For this choice of

b_{X | Y}

, (259) holds because

\begin{matrix} U_{β} & = \log c + D_{\frac{1}{1 - β}}^{l} (p_{X | Y} ‖ r_{X} | p_{Y}) \end{matrix}

(260)

\begin{matrix} \geq \log c + \min_{{\tilde{r}}_{X} \in P (X)} D_{\frac{1}{1 - β}}^{l} (p_{X | Y} ‖ {\tilde{r}}_{X} | p_{Y}) \end{matrix}

(261)

\begin{matrix} = \log c + J_{\frac{1}{1 - β}} (X; Y), \end{matrix}

(262)

where (260) follows from Theorem 10, and (262) follows from Proposition 5.

Fix now

c > 0

, let

{\tilde{r}}_{X}^{*}

achieve the minimum on the RHS of (261), and choose the odds

\begin{matrix} o (x) = \frac{c}{{\tilde{r}}_{X}^{*} (x)} . \end{matrix}

(263)

Then, (261) holds with equality because

r_{X} = {\tilde{r}}_{X}^{*}

by (241) and (242). □

10. Horse Betting with Part of the Money

In this section, we treat the possibility that the gambler does not invest all her wealth. We restrict ourselves to the setting without side information and to

β \in (- \infty, 0) \cup (0, 1)

. (For the case

β = 0

, see [47] (Section 10.5).) We assume that

p (x) > 0

and

o (x) > 0

for all

x \in X

. Denote by

b (0)

the fraction of her wealth that the gambler does not use for betting. (We assume

0 \notin X

.) Then,

b : X \cup {0} \to [0, 1]

is a PMF, and the wealth relative S is the random variable

\begin{matrix} S ≜ b (0) + b (X) o (X) . \end{matrix}

(264)

As in Section 8, define the constant

\begin{matrix} c ≜ {[\sum_{x} \frac{1}{o (x)}]}^{- 1} . \end{matrix}

(265)

We treat the cases

c < 1

and

c \geq 1

separately, starting with the latter. If

c \geq 1

, then it is optimal to invest all the money:

Proposition 11.

Assume

c \geq 1

, let

β \in R

, and let b be a PMF on

X \cup {0}

with utility

U_{β}

. Then, there exists a PMF

b^{'}

on

X \cup {0}

with

b^{'} (0) = 0

and utility

U_{β}^{'} \geq U_{β}

.

Proof.

Choose the PMF

b^{'}

as follows:

\begin{matrix} b^{'} (x) = \{\begin{matrix} \frac{c}{o (x)} \cdot b (0) + b (x) & if x \in X, \\ 0 & if x = 0 . \end{matrix} \end{matrix}

(266)

Then, for every

x \in X

,

\begin{matrix} b^{'} (0) + b^{'} (x) o (x) & = c \cdot b (0) + b (x) o (x) \end{matrix}

(267)

\begin{matrix} \geq b (0) + b (x) o (x), \end{matrix}

(268)

where (268) holds because

c \geq 1

by assumption. For

β > 0

,

U_{β}^{'} \geq U_{β}

holds because (268) implies

E [S^{^{'} β}] \geq E [S^{β}]

. For

β < 0

and

β = 0

,

U_{β}^{'} \geq U_{β}

follows similarly from (268). □

On the other hand, if

β < 1

and the odds are subfair, i.e., if

c < 1

, then Claim (c) of the following theorem shows that investing all the money is not optimal:

Theorem 12.

Assume

c < 1

, let

β \in (- \infty, 0) \cup (0, 1)

, and let

b^{*}

be a PMF on

X \cup {0}

that maximizes

U_{β}

among all PMFs b. Defining

\begin{matrix} S & ≜ {x \in X : b^{*} (x) > 0}, \end{matrix}

(269)

\begin{matrix} Γ & ≜ \frac{1 - \sum_{x \in S} p (x)}{1 - \sum_{x \in S} \frac{1}{o (x)}}, \end{matrix}

(270)

\begin{matrix} γ (x) & ≜ \max \{0, Γ^{\frac{1}{β - 1}} p {(x)}^{\frac{1}{1 - β}} o {(x)}^{\frac{β}{1 - β}} - \frac{1}{o (x)}\} \forall x \in X, \end{matrix}

(271)

the following claims hold:

(a): Both the numerator and denominator on the RHS of (270) are positive, so Γ is well-defined and positive.
(b): For every $x \in X$ ,

$\begin{matrix} b^{*} (x) = γ (x) b^{*} (0) . \end{matrix}$

(272)
(c): The quantity $b^{*} (0)$ satisfies

$\begin{matrix} b^{*} (0) = \frac{1}{1 + \sum_{x \in X} γ (x)} . \end{matrix}$

(273)

In particular, $b^{*} (0) > 0$ .

Claim (b) implies that for every

x \in X

,

b^{*} (x) > 0

if and only if

p (x) o (x) > Γ

. Ordering the elements

x_{1}, x_{2}, \dots

of

X

such that

p (x_{1}) o (x_{1}) \geq p (x_{2}) o (x_{2}) \geq \dots

, the set

S

thus has a special structure: it is either empty or equal to

{x_{1}, x_{2}, \dots, x_{k}}

for some integer k. To maximize

U_{β}

, the following procedure can be used: for every

S

with the above structure, compute the corresponding b according to (270)–(273); and from these b’s, take one that maximizes

U_{β}

. This procedure leads to an optimal solution: an optimal solution

b^{*}

exists because we are optimizing a continuous function over a compact set, and

b^{*}

corresponds to a set

S

that will be considered by the procedure.

Proof of Theorem 12.

The proof is based on the Karush–Kuhn–Tucker conditions. By separately considering the cases

β \in (0, 1)

and

β < 0

, we first show that, for

β \in (- \infty, 0) \cup (0, 1)

, a strategy

b (\cdot)

is optimal if and only if the following conditions are satisfied for some

μ \in R

:

\begin{matrix} \sum_{x \in X} p (x) {(b (0) + b (x) o (x))}^{β - 1} \{\begin{matrix} = μ & if b (0) > 0, \\ \leq μ & if b (0) = 0, \end{matrix} \end{matrix}

(274)

and, for every

x \in X

,

\begin{matrix} p (x) o (x) {(b (0) + b (x) o (x))}^{β - 1} \{\begin{matrix} = μ & if b (x) > 0, \\ \leq μ & if b (x) = 0 . \end{matrix} \end{matrix}

(275)

Consider first

β \in (0, 1)

, and define the function

τ : P (X \cup {0}) \to R

,

\begin{matrix} τ (b) ≜ \sum_{x \in X} p (x) {(b (0) + b (x) o (x))}^{β} . \end{matrix}

(276)

Since

β > 0

and since the logarithm is an increasing function, maximizing

U_{β} = \frac{1}{β} \log E [S^{β}]

over b is equivalent to maximizing

τ (b)

. Observe that

τ

is concave, thus, by the Karush–Kuhn–Tucker conditions [11] (Theorem 4.4.1), it is maximized by a PMF b if and only if there exists a

λ \in R

such that (i) for all

x \in X \cup {0}

with

b (x) > 0

,

\begin{matrix} \frac{\partial τ}{\partial b (x)} (b) = λ, \end{matrix}

(277)

and (ii) for all

x \in X \cup {0}

with

b (x) = 0

,

\begin{matrix} \frac{\partial τ}{\partial b (x)} (b) \leq λ . \end{matrix}

(278)

Henceforth, we use the following notation: to designate that (i) and (ii) both hold, we write

\begin{matrix} \frac{\partial τ}{\partial b (x)} (b) \{\begin{matrix} = λ & if b (x) > 0, \\ \leq λ & if b (x) = 0 . \end{matrix} \end{matrix}

(279)

Dividing both sides of (279) by

β > 0

and defining

μ ≜ \frac{λ}{β}

, we obtain that (279) is equivalent to

\begin{matrix} \frac{1}{β} \cdot \frac{\partial τ}{\partial b (x)} (b) \{\begin{matrix} = μ & if b (x) > 0, \\ \leq μ & if b (x) = 0 . \end{matrix} \end{matrix}

(280)

Now, (280) translates to (274) for

x = 0

and to (275) for

x \in X

.

Consider now

β < 0

, and define

τ

as in (276). Then, because

β < 0

, maximizing

U_{β} = \frac{1}{β} \log E [S^{β}]

is equivalent to minimizing

τ

. The function

τ

is convex, thus Inequality (278) is reversed. Dividing by

β < 0

again reverses the inequalities, thus (280), (274), and (275) continue to hold for

β < 0

.

Having established that, for all

β \in (- \infty, 0) \cup (0, 1)

, a strategy b is optimal if and only if (274) and (275) hold, we next continue with the proof. Let

β \in (- \infty, 0) \cup (0, 1)

, and let

b^{*}

be a PMF on

X \cup {0}

that maximizes

U_{β}

. By the above discussion, (274) and (275) are satisfied by

b^{*}

for some

μ \in R

. The LHS of (274) is positive, so

μ > 0

. We now show that for all

x \in X

,

\begin{matrix} b^{*} (x) = \max \{0, {[\frac{p (x) o {(x)}^{β}}{μ}]}^{\frac{1}{1 - β}} - \frac{b^{*} (0)}{o (x)}\} . \end{matrix}

(281)

To this end, fix

x \in X

. If

b^{*} (x) > 0

, then (275) implies

\begin{matrix} b^{*} (x) = {[\frac{p (x) o {(x)}^{β}}{μ}]}^{\frac{1}{1 - β}} - \frac{b^{*} (0)}{o (x)}, \end{matrix}

(282)

and the RHS of (282) is equal to the RHS of (281) because, being equal to

b^{*} (x)

, it is positive. If

b^{*} (x) = 0

, then (275) implies

\begin{matrix} {[\frac{p (x) o {(x)}^{β}}{μ}]}^{\frac{1}{1 - β}} - \frac{b^{*} (0)}{o (x)} \leq 0, \end{matrix}

(283)

so the RHS of (281) is zero and (281) hence holds.

Having established (281), we next show that

b^{*} (\hat{x}) = 0

for some

\hat{x} \in X

. For a contradiction, assume that

b^{*} (x) > 0

for all

x \in X

. Then,

\begin{matrix} \sum_{x \in X} p (x) {(b^{*} (0) + b^{*} (x) o (x))}^{β - 1} & = μ \sum_{x \in X} \frac{1}{o (x)} \end{matrix}

(284)

\begin{matrix} > μ, \end{matrix}

(285)

where (284) follows from (275), and (285) holds because

c < 1

by assumption. However, this is impossible: (285) contradicts (274).

Let now

\hat{x} \in X

be such that

b^{*} (\hat{x}) = 0

. Then, by (281),

\begin{matrix} {[\frac{p (\hat{x}) o {(\hat{x})}^{β}}{μ}]}^{\frac{1}{1 - β}} - \frac{b^{*} (0)}{o (\hat{x})} \leq 0 . \end{matrix}

(286)

Because

p (\hat{x})

and

o (\hat{x})

are positive, this implies

b^{*} (0) > 0

. Thus, by (274),

\begin{matrix} \sum_{x \in X} p (x) {(b^{*} (0) + b^{*} (x) o (x))}^{β - 1} = μ . \end{matrix}

(287)

Splitting the sum on the LHS of (287) depending on whether

b^{*} (x) > 0

or

b^{*} (x) = 0

, we obtain

\begin{matrix} μ & = \sum_{x \in S} p (x) {(b^{*} (0) + b^{*} (x) o (x))}^{β - 1} + \sum_{x \notin S} p (x) {(b^{*} (0) + b^{*} (x) o (x))}^{β - 1} \end{matrix}

(288)

\begin{matrix} = \sum_{x \in S} \frac{μ}{o (x)} + \sum_{x \notin S} p (x) b^{*} {(0)}^{β - 1} \end{matrix}

(289)

\begin{matrix} = μ \sum_{x \in S} \frac{1}{o (x)} + b^{*} {(0)}^{β - 1} [1 - \sum_{x \in S} p (x)], \end{matrix}

(290)

where (289) follows from (275). Rearranging (290), we obtain

\begin{matrix} μ [1 - \sum_{x \in S} \frac{1}{o (x)}] = b^{*} {(0)}^{β - 1} [1 - \sum_{x \in S} p (x)] . \end{matrix}

(291)

Recall that

μ > 0

and

b^{*} (0) > 0

. In addition,

1 - \sum_{x \in S} p (x) > 0

because

b^{*} (\hat{x}) = 0

and hence

\hat{x} \notin S

. Thus,

1 - \sum_{x \in S} \frac{1}{o (x)} > 0

, so both the numerator and denominator in the definition of

Γ

in (270) are positive, which establishes Claim (a), namely that

Γ

is well-defined and positive.

To establish Claim (b), note that (291) and (270) imply that

μ

is given by

\begin{matrix} μ = b^{*} {(0)}^{β - 1} Γ, \end{matrix}

(292)

which, when substituted into (281), yields (272).

We conclude by proving Claim (c). Because

b^{*}

is a PMF on

X \cup {0}

,

\begin{matrix} 1 & = b^{*} (0) + \sum_{x \in X} b^{*} (x) \end{matrix}

(293)

\begin{matrix} = b^{*} (0) [1 + \sum_{x \in X} γ (x)], \end{matrix}

(294)

where (294) follows from (272). Rearranging (294) yields (273). □

11. Universal Betting for IID Races

In this section, we present a universal gambling strategy for IID races that requires neither knowledge of the winning probabilities nor of the parameter

β

of the utility function and yet asymptotically maximizes the utility function for all PMFs p and all

β \in R

. Consider n consecutive horse races, where the winning horse in the ith race is denoted

X_{i}

for

i \in {1, \dots, n}

. We assume that

X_{1}, \dots, X_{n}

are IID according to the PMF p, where

p (x) > 0

for all

x \in X

. In every race, the bookmaker offers the same odds

o : X \to (0, \infty)

, and the gambler spends all her wealth placing bets on the horses. The gambler plays race-after-race, i.e., before placing bets for a race, she is revealed the winning horse of the previous race and receives the money from the bookmaker. Her betting strategy is hence a sequence of conditional PMFs

(b_{X_{1}}, b_{X_{2} | X_{1}}, b_{X_{3} | X_{1} X_{2}}, \dots, b_{X_{n} | X_{1} X_{2} \dots X_{n - 1}})

. The wealth relative is the random variable

\begin{matrix} S_{n} ≜ \prod_{i = 1}^{n} b (X_{i} | X_{1}, \dots, X_{i - 1}) o (X_{i}) . \end{matrix}

(295)

We seek betting strategies that maximize the utility function

\begin{matrix} U_{β, n} ≜ \{\begin{matrix} \frac{1}{β} \log E [S_{n}^{β}] & if β \neq 0, \\ E [\log S_{n}] & if β = 0 . \end{matrix} \end{matrix}

(296)

We first establish that to maximize

U_{β, n}

for a fixed

β \in R

, it suffices to use the same betting strategy in every race; see Theorem 13. We then show that the individual-sequence-universal strategy by Cover–Ordentlich [48] allows to asymptotically achieve the same normalized utility without knowing p or

β

(see Theorem 14).

For a fixed

β \in R

, let the PMF

b^{*}

be a betting strategy that maximizes the single-race utility

U_{β}

discussed in Section 8, and denote by

U_{β}^{*}

the utility associated with

b^{*}

. Using the same betting strategy

b^{*}

over n races leads to the utility

U_{β, n}

, and it follows from (295) and (296) that

\begin{matrix} U_{β, n} = n U_{β}^{*} . \end{matrix}

(297)

As we show next,

n U_{β}^{*}

is the maximum utility that can be achieved among all betting strategies:

Theorem 13.

Let

β \in R

, and let

(b_{X_{1}}, b_{X_{2} | X_{1}}, b_{X_{3} | X_{1} X_{2}}, \dots, b_{X_{n} | X_{1} X_{2} \dots X_{n - 1}})

be a sequence of conditional PMFs. Then,

\begin{matrix} U_{β, n} \leq n U_{β}^{*} . \end{matrix}

(298)

Proof.

We show (298) for

β > 0

; analogous arguments establish (298) for

β < 0

and

β = 0

. We prove (298) by induction on n. For

n = 1

, (298) holds because

U_{β}^{*}

is the maximum single-race utility. Assume now

n \geq 2

and that (298) is valid for

n - 1

. For

β > 0

, (298) holds because

\begin{matrix} U_{β, n} & = \frac{1}{β} \log E [S_{n}^{β}] \end{matrix}

(299)

\begin{matrix} = \frac{1}{β} \log \sum_{x_{1}, \dots, x_{n}} P (x_{1}) \dots P (x_{n}) \prod_{i = 1}^{n} b {(x_{i} | x^{i - 1})}^{β} o {(x_{i})}^{β} \end{matrix}

(300)

\begin{matrix} = \frac{1}{β} \log \sum_{x_{1}, \dots, x_{n - 1}} P (x_{1}) \dots P (x_{n - 1}) [\prod_{i = 1}^{n - 1} b {(x_{i} | x^{i - 1})}^{β} o {(x_{i})}^{β}] \sum_{x_{n}} P (x_{n}) b {(x_{n} | x^{n - 1})}^{β} o {(x_{n})}^{β} \end{matrix}

(301)

\begin{matrix} \leq \frac{1}{β} \log \sum_{x_{1}, \dots, x_{n - 1}} P (x_{1}) \dots P (x_{n - 1}) [\prod_{i = 1}^{n - 1} b {(x_{i} | x^{i - 1})}^{β} o {(x_{i})}^{β}] \max_{b \in P (X)} \sum_{x_{n}} P (x_{n}) b {(x_{n})}^{β} o {(x_{n})}^{β} \end{matrix}

(302)

\begin{matrix} = \frac{1}{β} \log \sum_{x_{1}, \dots, x_{n - 1}} P (x_{1}) \dots P (x_{n - 1}) [\prod_{i = 1}^{n - 1} b {(x_{i} | x^{i - 1})}^{β} o {(x_{i})}^{β}] \sum_{x_{n}} P (x_{n}) b^{*} {(x_{n})}^{β} o {(x_{n})}^{β} \end{matrix}

(303)

\begin{matrix} = U_{β, n - 1} + U_{β}^{*} \end{matrix}

(304)

\begin{matrix} \leq (n - 1) U_{β}^{*} + U_{β}^{*} \end{matrix}

(305)

\begin{matrix} = n U_{β}^{*}, \end{matrix}

(306)

where (303) holds because

b^{*}

maximizes the single-race utility

U_{β}

, and (305) holds because (298) is valid for

n - 1

. □

In portfolio theory, Cover–Ordentlich [48] (Definition 1) proposed a universal strategy. Adapted to our setting, it leads to the following sequence of conditional PMFs:

\begin{matrix} \hat{b} (x_{i} | x^{i - 1}) = \frac{\int_{b \in P (X)} b (x_{i}) S_{i - 1} (b, x^{i - 1}) d μ (b)}{\int_{b \in P (X)} S_{i - 1} (b, x^{i - 1}) d μ (b)}, \end{matrix}

(307)

where

i \in {1, 2, \dots}

;

μ

is the

Dirichlet (1 / 2, \dots, 1 / 2)

distribution on

P (X)

;

S_{0} (b, x^{0}) ≜ 1

; and

\begin{matrix} S_{i} (b, x^{i}) ≜ \prod_{j = 1}^{i} b (x_{j}) o (x_{j}) . \end{matrix}

(308)

This strategy depends neither on the winning probabilities p nor on the parameter

β

. Denoting the utility (296) associated with the strategy

\hat{b} (x_{i} | x^{i - 1})

by

{\hat{U}}_{β, n}

, we have the following result:

Theorem 14.

For every

β \in R

,

\begin{matrix} n U_{β}^{*} - \log 2 - \frac{| X | - 1}{2} \log (n + 1) & \leq {\hat{U}}_{β, n} \end{matrix}

(309)

\begin{matrix} \leq n U_{β}^{*} . \end{matrix}

(310)

Hence,

\begin{matrix} \lim_{n \to \infty} \frac{1}{n} {\hat{U}}_{β, n} = U_{β}^{*} . \end{matrix}

(311)

Proof.

Inequality (310) follows from Theorem 13; and (311) follows from (309) and (310) and the sandwich theorem. It thus remains to establish (309): We do so for

β > 0

; analogous arguments establish (309) for

β < 0

and

β = 0

. For a fixed sequence

x^{n} \in X^{n}

, let

\tilde{b}

be a PMF on

X

that maximizes

S_{n} (b, x^{n})

, and denote the wealth relative in (295) associated with using

\tilde{b}

in every race by

{\tilde{S}}_{n} (x^{n})

, thus

\begin{matrix} {\tilde{S}}_{n} (x^{n}) = \max_{b \in P (X)} \prod_{i = 1}^{n} b (x_{i}) o (x_{i}) . \end{matrix}

(312)

Let

{\hat{S}}_{n} (x^{n})

denote the wealth relative in (295) associated with the strategy

\hat{b} (x_{i} | x^{i - 1})

and the sequence

x^{n}

. Using [48] (Theorem 2) it follows that, for every

x^{n} \in X^{n}

,

\begin{matrix} {\hat{S}}_{n} (x^{n}) \geq \frac{1}{2 {(n + 1)}^{(| X | - 1) / 2}} {\tilde{S}}_{n} (x^{n}) . \end{matrix}

(313)

This implies that (309) holds for

β > 0

because

\begin{matrix} {\hat{U}}_{β, n} & = \frac{1}{β} \log E [{\hat{S}}_{n} {(X^{n})}^{β}] \end{matrix}

(314)

\begin{matrix} \geq \frac{1}{β} \log E [{\tilde{S}}_{n} {(X^{n})}^{β}] - \log 2 - \frac{| X | - 1}{2} \log (n + 1) \end{matrix}

(315)

\begin{matrix} \geq \frac{1}{β} \log \sum_{x_{1}, \dots, x_{n}} P (x_{1}) \dots P (x_{n}) \prod_{i = 1}^{n} b^{*} {(x_{i})}^{β} o {(x_{i})}^{β} - \log 2 - \frac{| X | - 1}{2} \log (n + 1) \end{matrix}

(316)

\begin{matrix} = n U_{β}^{*} - \log 2 - \frac{| X | - 1}{2} \log (n + 1), \end{matrix}

(317)

where (315) follows from (313), and (316) follows from (312). □

Remark 9.

As discussed in Section 8, the optimal single-race betting strategy varies significantly with different values of β, thus it might be a bit surprising that the Cover–Ordentlich strategy is not only universal with respect to the winning probabilities, but also with respect to β. This is due to the following two reasons: First, for fixed winning probabilities and a fixed β, it is optimal to use the same betting strategy in every race (see Theorem 13). Second, for every

x^{n} \in X^{n}

, the wealth relative of the Cover–Ordentlich strategy is not much worse than that of using the same strategy

b (\cdot)

in every race, irrespective of

b (\cdot)

(see (313)). Hence, irrespective of the optimal single-race betting strategy, the Cover–Ordentlich strategy is able to asymptotically achieve the same normalized utility.

Author Contributions

Writing—original draft preparation, C.B., A.L., and C.P.; and writing—review and editing, C.B., A.L., and C.P. All authors have read and agreed to the published version of the manuscript

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Proposition 1

These properties mostly follow from van Erven–Harremoës [8]:

(a): See [8] (Theorem 8).
(b): This follows from the definitions in (4) and (21)–(23) and the conventions in (20).
(c): This follows from [8] (Theorem 7) and the fact that $\lim_{α \to 1} D_{α} (P ‖ Q) = D (P ‖ Q)$ by L’Hôpital’s rule. (Note that $α \mapsto D_{α} (P ‖ Q)$ does not need to be continuous at $α = 1$ when the alphabets are not finite; see the discussion after [8] (Equation (18)).)
(d): See [8] (Theorem 3).
(e): Let $α, α^{'} \in (0, \infty)$ satisfy $α \leq α^{'}$ . Then,

$\begin{matrix} \frac{1 - α}{α} D_{α} (P ‖ Q) & = D_{1 - α} (Q ‖ P) \end{matrix}$

(A1)

$\begin{matrix} \geq D_{1 - α^{'}} (Q ‖ P) \end{matrix}$

(A2)

$\begin{matrix} = \frac{1 - α^{'}}{α^{'}} D_{α^{'}} (P ‖ Q), \end{matrix}$

(A3)

where (A1) and (A3) follow from [8] (Lemma 10), and (A2) holds because the Rényi divergence, extended to negative orders, is nondecreasing ([8] (Theorem 39)).
(f): See [8] (Corollary 2).
(g): For $α \in (0, \infty)$ ,

$\begin{matrix} (α - 1) D_{1 / α} (P ‖ Q) & = α (1 - \frac{1}{α}) D_{1 / α} (P ‖ Q) \end{matrix}$

(A4)

$\begin{matrix} = α \inf_{R} [\frac{1}{α} D (R ‖ P) + (1 - \frac{1}{α}) D (R ‖ Q)] \end{matrix}$

(A5)

$\begin{matrix} = \inf_{R} [D (R ‖ P) + (α - 1) D (R ‖ Q)], \end{matrix}$

(A6)

where (A5) follows from [8] (Theorem 30). Hence, $(α - 1) D_{1 / α} (P ‖ Q)$ is concave in $α$ because the expression in square brackets on the RHS of (A6) is concave in $α$ for every R and because the pointwise infimum preserves the concavity.
(h): See [8] (Theorem 9).

Appendix B. Proof of Theorem 1

Beginning with (29),

\begin{matrix} D_{α}^{c} (P_{Y^{'} | X} ‖ Q_{Y^{'} | X} | P_{X}) & = \sum_{x \in supp (P_{X})} P (x) D_{α} (P_{Y^{'} | X = x} ‖ Q_{Y^{'} | X = x}) \end{matrix}

(A7)

\begin{matrix} \leq \sum_{x \in supp (P_{X})} P (x) D_{α} (P_{Y | X = x} ‖ Q_{Y | X = x}) \end{matrix}

(A8)

\begin{matrix} = D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}), \end{matrix}

(A9)

where (A8) follows by applying, separately for every

x \in supp (P_{X})

, Proposition 1 (h) with the conditional PMF

A_{Y^{'} | Y, X = x}

.

Appendix C. Proof of Theorem 2

We show (43) for

α \in (0, 1)

; the claim then extends to

α \in [0, 1]

by the continuity of

D_{α}^{c} (\cdot)

in

α

(Proposition 2 (c)). Let

α \in (0, 1)

. Keeping in mind that

α - 1 < 0

, (43) holds because

\begin{matrix} (α - 1) D_{α}^{c} (P_{Y | X^{'}} ‖ Q_{Y | X^{'}} | P_{X^{'}}) \end{matrix}

\begin{matrix} = \sum_{x^{'} \in supp (P_{X^{'}})} P_{X^{'}} (x^{'}) \log \sum_{y} P_{Y | X^{'}} {(y | x^{'})}^{α} Q_{Y | X^{'}} {(y | x^{'})}^{1 - α} \end{matrix}

(A10)

\begin{matrix} = \sum_{x^{'} \in supp (P_{X^{'}})} P_{X^{'}} (x^{'}) \log \sum_{y} {[\sum_{x} B_{X | X^{'}} (x | x^{'}) P_{Y | X} (y | x)]}^{α} {[\sum_{x} B_{X | X^{'}} (x | x^{'}) Q_{Y | X} (y | x)]}^{1 - α} \end{matrix}

(A11)

\begin{matrix} \geq \sum_{x^{'} \in supp (P_{X^{'}})} P_{X^{'}} (x^{'}) \log \sum_{y} \sum_{x} B_{X | X^{'}} (x | x^{'}) P_{Y | X} {(y | x)}^{α} Q_{Y | X} {(y | x)}^{1 - α} \end{matrix}

(A12)

\begin{matrix} = \sum_{x^{'} \in supp (P_{X^{'}})} P_{X^{'}} (x^{'}) \log \sum_{x \in supp (P_{X})} B_{X | X^{'}} (x | x^{'}) \sum_{y} P_{Y | X} {(y | x)}^{α} Q_{Y | X} {(y | x)}^{1 - α} \end{matrix}

(A13)

\begin{matrix} \geq \sum_{x^{'} \in supp (P_{X^{'}})} P_{X^{'}} (x^{'}) \sum_{x \in supp (P_{X})} B_{X | X^{'}} (x | x^{'}) \log \sum_{y} P_{Y | X} {(y | x)}^{α} Q_{Y | X} {(y | x)}^{1 - α} \end{matrix}

(A14)

\begin{matrix} = \sum_{x \in supp (P_{X})} P_{X} (x) [\sum_{x^{'} \in supp (P_{X^{'}})} B_{X^{'} | X} (x^{'} | x)] \log \sum_{y} P_{Y | X} {(y | x)}^{α} Q_{Y | X} {(y | x)}^{1 - α} \end{matrix}

(A15)

\begin{matrix} = \sum_{x \in supp (P_{X})} P_{X} (x) \log \sum_{y} P_{Y | X} {(y | x)}^{α} Q_{Y | X} {(y | x)}^{1 - α} \end{matrix}

(A16)

\begin{matrix} = (α - 1) D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}), \end{matrix}

(A17)

where (A10) follows from (30); (A11) follows from (41) and (42); (A12) follows from Hölder’s inequality; (A13) holds because

B_{X | X^{'}} (x | x^{'}) = 0

if

P_{X^{'}} (x^{'}) > 0

and

P_{X} (x) = 0

; (A14) follows from Jensen’s inequality because

\log (\cdot)

is concave; (A15) follows from (40); (A16) holds because

P_{X} (x) > 0

and

P_{X^{'}} (x^{'}) = 0

imply

B_{X^{'} | X} (x^{'} | x) = 0

, hence the expression in square brackets on the LHS of (A16) equals one; and (A17) follows from (30).

Appendix D. Proof of Corollary 1

Applying Theorem 2 with

X^{'} ≜ {1}

and the conditional PMF

B_{X^{'} | X} (x^{'} | x) ≜ 1

, we obtain

\begin{matrix} D_{α}^{c} (P_{Y | X^{'}} ‖ Q_{Y | X^{'}} | P_{X^{'}}) \leq D_{α}^{c} (P_{Y | X} ‖ Q_{Y | X} | P_{X}) . \end{matrix}

(A18)

To complete the proof of (48), observe that

\begin{matrix} D_{α}^{c} (P_{Y | X^{'}} ‖ Q_{Y | X^{'}} | P_{X^{'}}) & = D_{α}^{c} (P_{Y} ‖ Q_{Y} | P_{X^{'}}) \end{matrix}

(A19)

\begin{matrix} = D_{α} (P_{Y} ‖ Q_{Y}), \end{matrix}

(A20)

where (A19) holds because (41) and (46) imply

P_{Y | X^{'}} (y | x^{'}) = P_{Y} (y)

and because (42) and (47) imply

Q_{Y | X^{'}} (y | x^{'}) = Q_{Y} (y)

; and (A20) follows from Remark 1.

Appendix E. Proof of Example 1

If

α = \infty

, then it can be verified numerically that (53) holds for

ϵ = 0.1

. Fix now

α \in (1, \infty)

. Then, for all

ϵ \in (0, 1)

,

\begin{matrix} D_{α} (P_{Y} ‖ Q_{Y}^{(ϵ)}) & = \frac{1}{α - 1} \log [{0.5}^{α} {(1 - ϵ)}^{1 - α} + {0.5}^{α} ϵ^{1 - α}] \end{matrix}

(A21)

\begin{matrix} \geq \frac{1}{α - 1} \log [{0.5}^{α} ϵ^{1 - α}] \end{matrix}

(A22)

\begin{matrix} = \frac{α}{α - 1} \log 0.5 + \log \frac{1}{ϵ} . \end{matrix}

(A23)

The RHS of (53) satisfies, for sufficiently small

ϵ

,

\begin{matrix} D_{α}^{c} (P_{Y | X}^{(ϵ)} ‖ Q_{Y}^{(ϵ)} | P_{X}) & = 0.5 \cdot 0 + 0.5 \cdot D_{α} (P_{Y | X = 1}^{(ϵ)} ‖ Q_{Y}^{(ϵ)}) \end{matrix}

(A24)

\begin{matrix} = \frac{0.5}{α - 1} \log [ϵ^{α} {(1 - ϵ)}^{1 - α} + {(1 - ϵ)}^{α} ϵ^{1 - α}] \end{matrix}

(A25)

\begin{matrix} = \frac{0.5}{α - 1} \log [ϵ^{1 - α} ({(1 - ϵ)}^{α} + ϵ^{2 α - 1} {(1 - ϵ)}^{1 - α})] \end{matrix}

(A26)

\begin{matrix} \leq \frac{0.5}{α - 1} \log [2 ϵ^{1 - α}] \end{matrix}

(A27)

\begin{matrix} = \frac{0.5}{α - 1} \log 2 + 0.5 \log \frac{1}{ϵ}, \end{matrix}

(A28)

where (A27) holds for sufficiently small

ϵ

because

\lim_{ϵ ↓ 0} ({(1 - ϵ)}^{α} + ϵ^{2 α - 1} {(1 - ϵ)}^{1 - α}) = 1

. Because

\lim_{ϵ ↓ 0} \log \frac{1}{ϵ} = \infty

, (53) follows from (A23) and (A28) for sufficiently small

ϵ

.

Appendix F. Proof of Theorem 3

Observe that, for all

x^{'} \in X

and all

y^{'} \in Y^{'}

,

\begin{matrix} P_{X} (x^{'}) P_{Y^{'} | X} (y^{'} | x^{'}) & = \sum_{x, y} P_{X} (x) P_{Y | X} (y | x) 𝟙 {x^{'} = x} A_{Y^{'} | X Y} (y^{'} | x, y), \end{matrix}

(A29)

\begin{matrix} P_{X} (x^{'}) Q_{Y^{'} | X} (y^{'} | x^{'}) & = \sum_{x, y} P_{X} (x) Q_{Y | X} (y | x) 𝟙 {x^{'} = x} A_{Y^{'} | X Y} (y^{'} | x, y) . \end{matrix}

(A30)

Hence, (68) follows from (54) and

\begin{matrix} D_{α} (P_{X} P_{Y^{'} | X} ‖ P_{X} Q_{Y^{'} | X}) \leq D_{α} (P_{X} P_{Y | X} ‖ P_{X} Q_{Y | X}), \end{matrix}

(A31)

which follows from the data-processing inequality for the Rényi divergence by substituting

𝟙_{X^{'} = X} A_{Y^{'} | X Y}

for

A_{X^{'} Y^{'} | X Y}

in Proposition 1 (h).

Appendix G. Proof of Theorem 4

Observe that, for all

x^{'} \in X^{'}

and all

y^{'} \in Y

,

\begin{matrix} P_{X^{'}} (x^{'}) P_{Y | X^{'}} (y^{'} | x^{'}) & = \sum_{x, y} P_{X} (x) P_{Y | X} (y | x) B_{X^{'} | X} (x^{'} | x) 𝟙 {y^{'} = y}, \end{matrix}

(A32)

\begin{matrix} P_{X^{'}} (x^{'}) Q_{Y | X^{'}} (y^{'} | x^{'}) & = \sum_{x, y} P_{X} (x) Q_{Y | X} (y | x) B_{X^{'} | X} (x^{'} | x) 𝟙 {y^{'} = y} . \end{matrix}

(A33)

Hence, (73) follows from (54) and

\begin{matrix} D_{α} (P_{X^{'}} P_{Y | X^{'}} ‖ P_{X^{'}} Q_{Y | X^{'}}) \leq D_{α} (P_{X} P_{Y | X} ‖ P_{X} Q_{Y | X}), \end{matrix}

(A34)

which follows from the data-processing inequality for the Rényi divergence by substituting

B_{X^{'} | X} 𝟙_{Y^{'} = Y}

for

A_{X^{'} Y^{'} | X Y}

in Proposition 1 (h).

References

Kelly, J.L., Jr. A new interpretation of information rate. Bell Syst. Tech. J. 1956, 35, 917–926. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006; ISBN 978-0-471-24195-9. [Google Scholar]
Lapidoth, A.; Pfister, C. Two measures of dependence. Entropy 2019, 21, 778. [Google Scholar] [CrossRef] [Green Version]
Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed.; Cambridge University Press: Cambridge, UK, 2011; ISBN 978-0-521-19681-9. [Google Scholar]
Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial; now Publishers: Hanover, MA, USA, 2004; ISBN 978-1-933019-05-5. [Google Scholar]
Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; Volume 1, pp. 547–561. [Google Scholar]
Van Erven, T.; Harremoës, P. Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
Csiszár, I. Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory 1995, 41, 26–34. [Google Scholar] [CrossRef]
Sibson, R. Information radius. Z. Wahrscheinlichkeitstheorie verw. Geb. 1969, 14, 149–160. [Google Scholar] [CrossRef]
Gallager, R.G. Information Theory and Reliable Communication; John Wiley & Sons: Hoboken, NJ, USA, 1968; ISBN 978-0-471-29048-3. [Google Scholar]
Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Topics in Information Theory; Csiszár, I., Elias, P., Eds.; North-Holland Publishing Company: Amsterdam, The Netherlands, 1977; pp. 41–52. ISBN 0-7204-0699-4. [Google Scholar]
Eeckhoudt, L.; Gollier, C.; Schlesinger, H. Economic and Financial Decisions under Risk; Princeton University Press: Princeton, NJ, USA, 2005; ISBN 978-0-691-12215-1. [Google Scholar]
Soklakov, A.N. Economics of disagreement – financial intuition for the Rényi divergence. arXiv 2018, arXiv:1811.08308. [Google Scholar]
Steele, J.M. The Cauchy–Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities; Cambridge University Press: Cambridge, UK, 2004; ISBN 978-0-521-54677-5. [Google Scholar]
Bullen, P.S. Handbook of Means and Their Inequalities; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2003; ISBN 978-1-4020-1522-9. [Google Scholar]
Campbell, L.L. A coding theorem and Rényi’s entropy. Inf. Control 1965, 8, 423–429. [Google Scholar] [CrossRef] [Green Version]
Campbell, L.L. Definition of entropy by means of a coding problem. Z. Wahrscheinlichkeitstheorie verw. Geb. 1966, 6, 113–118. [Google Scholar] [CrossRef]
Merhav, N. On optimum strategies for minimizing the exponential moments of a loss function. Commun. Inf. Syst. 2011, 11, 343–368. [Google Scholar] [CrossRef] [Green Version]
Sason, I.; Verdú, S. Improved bounds on lossless source coding and guessing moments via Rényi measures. IEEE Trans. Inf. Theory 2018, 64, 4323–4346. [Google Scholar] [CrossRef] [Green Version]
Augustin, U. Noisy Channels. Habilitation Thesis, Universität Erlangen–Nürnberg, Erlangen, Germany, 1978. [Google Scholar]
Nakiboğlu, B. The Augustin capacity and center. Probl. Inf. Transm. 2019, 55, 299–342. [Google Scholar] [CrossRef] [Green Version]
Nakiboğlu, B. The sphere packing bound for memoryless channels. arXiv 2018, arXiv:1804.06372. [Google Scholar]
Ho, S.-W.; Verdú, S. Convexity/concavity of Rényi entropy and α-mutual information. In Proceedings of the 2015 IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 745–749. [Google Scholar] [CrossRef]
Verdú, S. α-mutual information. In Proceedings of the 2015 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar] [CrossRef]
Nakiboğlu, B. The Rényi capacity and center. IEEE Trans. Inf. Theory 2019, 65, 841–860. [Google Scholar] [CrossRef] [Green Version]
Cai, C.; Verdú, S. Conditional Rényi divergence saddlepoint and the maximization of α-mutual information. Entropy 2019, 21, 969. [Google Scholar] [CrossRef] [Green Version]
Fong, S.L.; Tan, V.Y.F. Strong converse theorems for classes of multimessage multicast networks: A Rényi divergence approach. IEEE Trans. Inf. Theory 2016, 62, 4953–4967. [Google Scholar] [CrossRef] [Green Version]
Yu, L.; Tan, V.Y.F. Rényi resolvability and its applications to the wiretap channel. IEEE Trans. Inf. Theory 2019, 65, 1862–1897. [Google Scholar] [CrossRef] [Green Version]
Gallager, R.G. A simple derivation of the coding theorem and some applications. IEEE Trans. Inf. Theory 1965, 11, 3–18. [Google Scholar] [CrossRef] [Green Version]
Shannon, C.E.; Gallager, R.G.; Berlekamp, E.R. Lower bounds to error probability for coding on discrete memoryless channels. I. Inf. Control 1967, 10, 65–103. [Google Scholar] [CrossRef] [Green Version]
Shannon, C.E.; Gallager, R.G.; Berlekamp, E.R. Lower bounds to error probability for coding on discrete memoryless channels. II. Inf. Control 1967, 10, 522–552. [Google Scholar] [CrossRef] [Green Version]
Arimoto, S. On the converse to the coding theorem for discrete memoryless channels. IEEE Trans. Inf. Theory 1973, 19, 357–359. [Google Scholar] [CrossRef]
Polyanskiy, Y.; Verdú, S. Arimoto channel coding converse and Rényi divergence. In Proceedings of the 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Allerton, IL, USA, 29 September–1 October 2010; pp. 1327–1333. [Google Scholar] [CrossRef]
Nakiboğlu, B. The sphere packing bound via Augustin’s method. IEEE Trans. Inf. Theory 2019, 65, 816–840. [Google Scholar] [CrossRef]
Nakiboğlu, B. The sphere packing bound for DSPCs with feedback à la Augustin. IEEE Trans. Commun. 2019, 67, 7456–7467. [Google Scholar] [CrossRef]
Tomamichel, M.; Hayashi, M. Operational interpretation of Rényi information measures via composite hypothesis testing against product and Markov distributions. IEEE Trans. Inf. Theory 2018, 64, 1064–1082. [Google Scholar] [CrossRef] [Green Version]
Aishwarya, G.; Madiman, M. Remarks on Rényi versions of conditional entropy and mutual information. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 1117–1121. [Google Scholar] [CrossRef]
Lapidoth, A.; Pfister, C. Testing against independence and a Rényi information measure. In Proceedings of the 2018 IEEE Information Theory Workshop (ITW), Guangzhou, China, 25–29 November 2018; pp. 1–5. [Google Scholar] [CrossRef] [Green Version]
Fehr, S.; Berens, S. On the conditional Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 6801–6810. [Google Scholar] [CrossRef]
Sason, I.; Verdú, S. Arimoto–Rényi conditional entropy and Bayesian M-ary hypothesis testing. IEEE Trans. Inf. Theory 2018, 64, 4–25. [Google Scholar] [CrossRef]
Arıkan, E. An inequality on guessing and its application to sequential decoding. IEEE Trans. Inf. Theory 1996, 42, 99–105. [Google Scholar] [CrossRef] [Green Version]
Sundaresan, R. Guessing under source uncertainty. IEEE Trans. Inf. Theory 2007, 53, 269–287. [Google Scholar] [CrossRef] [Green Version]
Bracher, A.; Lapidoth, A.; Pfister, C. Guessing with distributed encoders. Entropy 2019, 21, 298. [Google Scholar] [CrossRef] [Green Version]
Bunte, C.; Lapidoth, A. Encoding tasks and Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 5065–5076. [Google Scholar] [CrossRef] [Green Version]
Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970; ISBN 978-0-691-01586-6. [Google Scholar]
Moser, S.M. Information Theory (Lecture Notes), version 6.6. 2018. Available online: http://moser-isi.ethz.ch/scripts.html (accessed on 8 March 2020).
Cover, T.M.; Ordentlich, E. Universal portfolios with side information. IEEE Trans. Inf. Theory 1996, 42, 348–363. [Google Scholar] [CrossRef] [Green Version]

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Bleuler, C.; Lapidoth, A.; Pfister, C. Conditional Rényi Divergences and Horse Betting. Entropy 2020, 22, 316. https://doi.org/10.3390/e22030316

AMA Style

Bleuler C, Lapidoth A, Pfister C. Conditional Rényi Divergences and Horse Betting. Entropy. 2020; 22(3):316. https://doi.org/10.3390/e22030316

Chicago/Turabian Style

Bleuler, Cédric, Amos Lapidoth, and Christoph Pfister. 2020. "Conditional Rényi Divergences and Horse Betting" Entropy 22, no. 3: 316. https://doi.org/10.3390/e22030316

APA Style

Bleuler, C., Lapidoth, A., & Pfister, C. (2020). Conditional Rényi Divergences and Horse Betting. Entropy, 22(3), 316. https://doi.org/10.3390/e22030316

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Conditional Rényi Divergences and Horse Betting

Abstract

1. Introduction

2. Preliminaries

3. Csiszár’s Conditional Rényi Divergence

4. Sibson’s Conditional Rényi Divergence

5. New Conditional Rényi Divergence

6. Relation to Arimoto’s Measures

7. Relations Between the Conditional Rényi Divergences and the Rényi Dependence Measures

8. Horse Betting

9. Horse Betting with Side Information

10. Horse Betting with Part of the Money

11. Universal Betting for IID Races

Author Contributions

Funding

Conflicts of Interest

Appendix A. Proof of Proposition 1

Appendix B. Proof of Theorem 1

Appendix C. Proof of Theorem 2

Appendix D. Proof of Corollary 1

Appendix E. Proof of Example 1

Appendix F. Proof of Theorem 3

Appendix G. Proof of Theorem 4

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI