Next Article in Journal
Entropy in Heart Rate Dynamics Reflects How HRV-Biofeedback Training Improves Neurovisceral Complexity during Stress-Cognition Interactions
Next Article in Special Issue
On Relations Between the Relative Entropy and χ2-Divergence, Generalizations and Applications
Previous Article in Journal
Multivariate and Multiscale Complexity of Long-Range Correlated Cardiovascular and Respiratory Variability Series
Previous Article in Special Issue
On a Generalization of the Jensen–Shannon Divergence and the Jensen–Shannon Centroid
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Conditional Rényi Divergences and Horse Betting

Signal and Information Processing Laboratory, ETH Zurich, 8092 Zurich, Switzerland
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(3), 316; https://doi.org/10.3390/e22030316
Submission received: 17 December 2019 / Revised: 8 March 2020 / Accepted: 9 March 2020 / Published: 11 March 2020

Abstract

:
Motivated by a horse betting problem, a new conditional Rényi divergence is introduced. It is compared with the conditional Rényi divergences that appear in the definitions of the dependence measures by Csiszár and Sibson, and the properties of all three are studied with emphasis on their behavior under data processing. In the same way that Csiszár’s and Sibson’s conditional divergence lead to the respective dependence measures, so does the new conditional divergence lead to the Lapidoth–Pfister mutual information. Moreover, the new conditional divergence is also related to the Arimoto–Rényi conditional entropy and to Arimoto’s measure of dependence. In the second part of the paper, the horse betting problem is analyzed where, instead of Kelly’s expected log-wealth criterion, a more general family of power-mean utility functions is considered. The key role in the analysis is played by the Rényi divergence, and in the setting where the gambler has access to side information, the new conditional Rényi divergence is key. The setting with side information also provides another operational meaning to the Lapidoth–Pfister mutual information. Finally, a universal strategy for independent and identically distributed races is presented that—without knowing the winning probabilities or the parameter of the utility function—asymptotically maximizes the gambler’s utility function.

1. Introduction

As shown by Kelly [1,2], many of Shannon’s information measures appear naturally in the context of horse gambling when the gambler’s utility function is expected log-wealth. Here, we show that under a more general family of utility functions, gambling also provides a context for some of Rényi’s information measures. Moreover, the setting where the gambler has side information motivates a new Rényi-like conditional divergence, which we study and compare to other conditional divergences. The proposed family of utility functions in the context of gambling with side information also provides another operational meaning to the Rényi-like mutual information that was recently proposed by Lapidoth and Pfister [3]: it measures the gambler’s gain from the side information as measured by the increase in the minimax value of the two-player zero-sum game in which the bookmaker picks the odds and the gambler then places the bets based on these odds and her side information.
Deferring the gambling-based motivation to the second part of the paper, we first describe the different conditional divergences and study some of their properties with emphasis on their behavior under data processing. We also show that the new conditional Rényi divergence relates to the Lapidoth–Pfister mutual information in much the same way that Csiszár’s and Sibson’s conditional divergences relate to their corresponding mutual informations. Before discussing the conditional divergences, we first recall other information measures.
The Kullback–Leibler divergence (or relative entropy) is an important concept in information theory and statistics [2,4,5,6]. It is defined between two probability mass functions (PMFs) P and Q over a finite set X as
D ( P Q ) x X P ( x ) log P ( x ) Q ( x ) ,
where log ( · ) denotes the base-2 logarithm. Defining a conditional Kullback–Leibler divergence is straightforward because, as simple algebra shows, the two natural approaches lead to the same result:
D ( P Y | X Q Y | X | P X ) x supp ( P X ) P ( x ) D ( P Y | X = x Q Y | X = x )
= D ( P X P Y | X P X Q Y | X ) ,
where supp ( P ) { x X : P ( x ) > 0 } denotes the support of P, and in (3) and throughout P X P Y | X denotes the PMF on X × Y that assigns ( x , y ) the probability P X ( x ) P Y | X ( y | x ) .
The Rényi divergence of order α [7,8] between two PMFs P and Q is defined for all positive α ’s other than one as
D α ( P Q ) 1 α 1 log x X P ( x ) α Q ( x ) 1 α .
A conditional Rényi divergence can be defined in more than one way. In this paper, we consider the following three definitions, two classic and one new:
D α c ( P Y | X Q Y | X | P X ) x supp ( P X ) P ( x ) D α ( P Y | X = x Q Y | X = x ) ,
D α s ( P Y | X Q Y | X | P X ) D α ( P X P Y | X P X Q Y | X ) ,
D α l ( P Y | X Q Y | X | P X ) α α 1 log x supp ( P X ) P ( x ) 2 α 1 α D α ( P Y | X = x Q Y | X = x ) ,
where (5) is inspired by Csiszár [9]; (6) is inspired by Sibson [10]; and (7) is motivated by the horse betting problem discussed in Section 9. The first two conditional Rényi divergences were used to define the Rényi measures of dependence of Csiszár I α c ( X ; Y ) [9] and of Sibson I α s ( X ; Y ) [10]:
I α c ( X ; Y ) min Q Y D α c ( P Y | X Q Y | P X ) ,
I α s ( X ; Y ) min Q Y D α s ( P Y | X Q Y | P X ) ,
where the minimization is over all PMFs on the set Y . (Gallager’s E 0 function [11] and I α s ( X ; Y ) are in one-to-one correspondence; see (65) below.) The analogous minimization of D α l ( · ) leads to the Lapidoth–Pfister mutual information J α ( X ; Y ) [3]:
J α ( X ; Y ) min Q X , Q Y D α ( P X Y Q X Q Y )
= min Q Y D α l ( P Y | X Q Y | P X ) ,
where (11) is proved in Proposition 5.
The first part of the paper is structured as follows: In Section 2, we discuss some preliminaries. In Section 3, Section 4 and Section 5, we study the properties of the three conditional Rényi divergences and their associated measure of dependence. In Section 6, we express the Arimoto–Rényi conditional entropy H α ( X | Y ) and the Arimoto measure of dependence I α a ( X ; Y ) [12] in terms of D α l ( P X | Y U X | P Y ) . In Section 7, we relate the conditional Rényi divergences to each other and discuss the relations between the Rényi dependence measures.
The second part of the paper deals with horse gambling under our proposed family of power-mean utility functions. It is in this context that the Rényi divergence (Theorem 9) and the conditional Rényi divergence D α l ( · ) (Theorem 10) appear naturally.
More specifically, consider a horse race with a finite nonempty set of horses X , where a bookmaker offers odds o ( x ) -for-1 on each horse x X , where o : X ( 0 , ) [2] (Section 6.1). A gambler spends all her wealth placing bets on the horses. The fraction of her wealth that she bets on Horse x X is denoted b ( x ) 0 , which sums up to 1 over x X , and the PMF b is her “betting strategy.” The winning horse, which we denote X, is drawn according to the PMF p, where we assume p ( x ) > 0 for all x X . The wealth relative (or end-to-beginning wealth ratio) is the random variable
S b ( X ) o ( X ) .
Hence, given an initial wealth γ , the gambler’s wealth after the race is γ S . We seek betting strategies that maximize the utility function
U β 1 β log E [ S β ] if β 0 , E [ log S ] if β = 0 ,
where β R is a parameter that accounts for the risk sensitivity. This optimization generalizes the following cases:
(a)
In the limit as β tends to , we optimize the worst-case return. The optimal strategy is risk-free in the sense that S does not depend on the winning horse (see Proposition 8).
(b)
If β = 0 , then we optimize E [ log S ] , which is known as the doubling rate [2] (Section 6.1). The optimal strategy is proportional betting, i.e., to choose b = p (see Remark 4).
(c)
If β = 1 , then we optimize E [ S ] , the expected return. The optimal strategy is to put all the money on a horse that maximizes p ( x ) o ( x ) (see Proposition 9).
(d)
In general, if β 1 , then it is optimal to put all the money on one horse (see Proposition 9). This is risky: if that horse loses, the gambler will go broke.
(e)
In the limit as β tends to + , we optimize the best-case return. The optimal strategy is to put all the money on a horse that maximizes o ( x ) (see Proposition 10).
Note that, for β 0 and η 1 β , maximizing U β is equivalent to maximizing
E S 1 η 1 η ,
which is known in the finance literature as Constant Relative Risk Aversion (CRRA) [13,14].
We refer to our utility function as “power mean” because it can be written as the logarithm of a weighted power mean [15,16]:
U β = log x p ( x ) b ( x ) o ( x ) β 1 β .
Because the power mean tends to the geometric mean as β tends to zero [15] (Problem 8.1), U β is continuous at β = 0 :
lim β 0 U β = log x b ( x ) o ( x ) p ( x )
= E [ log S ] .
Campbell [17,18] used an exponential cost function with a similar structure to (15) to provide an operational meaning to the Rényi entropy in source coding. Other information-theoretic applications of exponential moments were studied in [19].
The second part of the paper is structured as follows: In Section 8, we relate the utility function U β to the Rényi divergence (Theorem 9) and derive its optimal gambling strategy. In Section 9, we consider the situation where the gambler observes side information prior to betting, a situation that leads to the conditional Rényi divergence D α l ( · ) (Theorem 10) and to a new operational meaning for the measure of dependence J α ( X ; Y ) (Theorem 11). In Section 10, we consider the situation where the gambler invests only part of her money. In Section 11, we present a universal strategy for independent and identically distributed (IID) races that requires neither knowledge of the winning probabilities nor of the parameter β of the utility function and yet asymptotically maximizes the utility function for all PMFs p and all β R .

2. Preliminaries

Throughout the paper, log ( · ) denotes the base-2 logarithm, X and Y are finite sets, P X Y denotes a joint PMF over X × Y , Q X denotes a PMF over X , and Q Y denotes a PMF over Y . An expression of the form P X P Y | X denotes the PMF on X × Y that assigns ( x , y ) the probability P X ( x ) P Y | X ( y | x ) . We use P and Q as generic PMFs over a finite set X . We denote by supp ( P ) { x X : P ( x ) > 0 } the support of P, and by P ( X ) the set of all PMFs over X . When clear from the context, we often omit sets and subscripts: for example, we write x for x X , min Q X , Q Y for min ( Q X , Q Y ) P ( X ) × P ( Y ) , P ( x ) for P X ( x ) , and P ( y | x ) for P Y | X ( y | x ) . When P ( x ) is 0, we define the conditional probability P ( y | x ) as 1 / | Y | . The conditional distribution of Y given X = x is denoted by P Y | X = x , thus
P Y | X = x ( y ) = P ( y | x ) .
We denote by 𝟙 { condition } the indicator function that is one if the condition is satisfied and zero otherwise.
In the definition of the Kullback–Leibler divergence in (1), we use the conventions
0 log 0 q = 0 q 0 , p log p 0 = p > 0 .
In the definition of the Rényi divergence in (4), we read P ( x ) α Q ( x ) 1 α as P ( x ) α / Q ( x ) α 1 for α > 1 and use the conventions
0 0 = 0 , p 0 = p > 0 .
For α being zero, one, or infinity, we define by continuous extension of (4)
D 0 ( P Q ) log x supp ( P ) Q ( x ) ,
D 1 ( P Q ) D ( P Q ) ,
D ( P Q ) log max x P ( x ) Q ( x ) .
The Rényi divergence for negative α is defined as
D α ( P Q ) 1 α 1 log x Q ( x ) 1 α P ( x ) α .
(We use negative α in the proof of Proposition 1 (e) below and in Remark 6. More about negative orders can be found in [8] (Section V). For other applications of negative orders, see [20] (Proof of Theorem 1 and Example 1).)
The Rényi divergence satisfies the following basic properties:
Proposition 1.
Let P and Q be PMFs. Then, the Rényi divergence D α ( P Q ) satisfies the following:
(a) 
For all α [ 0 , ] , D α ( P Q ) 0 . If α ( 0 , ] , then D α ( P Q ) = 0 if and only if P = Q .
(b) 
For all α [ 0 , 1 ) , D α ( P Q ) is finite if and only if | supp ( P ) supp ( Q ) | > 0 . For all α [ 1 , ] , D α ( P Q ) is finite if and only if supp ( P ) supp ( Q ) .
(c) 
The mapping α D α ( P Q ) is continuous on [ 0 , ] .
(d) 
The mapping α D α ( P Q ) is nondecreasing on [ 0 , ] .
(e) 
The mapping α 1 α α D α ( P Q ) is nonincreasing on ( 0 , ) .
(f) 
The mapping α ( 1 α ) D α ( P Q ) is concave on [ 0 , ) .
(g) 
The mapping α ( α 1 ) D 1 / α ( P Q ) is concave on ( 0 , ) .
(h) 
(Data-processing inequality.) Let A X | X be a conditional PMF, and define the PMFs
P ( x ) x P ( x ) A X | X ( x | x ) ,
Q ( x ) x Q ( x ) A X | X ( x | x ) .
Then, for all α [ 0 , ] ,
D α ( P Q ) D α ( P Q ) .
Proof. 
See Appendix A. □
All three conditional Rényi divergences reduce to the unconditional Rényi divergence when both P Y | X and Q Y | X are independent of X:
Remark 1.
Let P Y , Q Y , and P X be PMFs. Then, for all α [ 0 , ] ,
D α c ( P Y Q Y | P X ) = D α s ( P Y Q Y | P X ) = D α l ( P Y Q Y | P X ) = D α ( P Y Q Y ) .
Proof. 
This follows from the definitions of D α c ( · ) , D α s ( · ) , and D α l ( · ) in (5)–(7). □

3. Csiszár’s Conditional Rényi Divergence

For a PMF P X and conditional PMFs P Y | X and Q Y | X , Csiszár’s conditional Rényi divergence D α c ( · ) is defined for every α [ 0 , ] as
D α c ( P Y | X Q Y | X | P X ) x supp ( P X ) P ( x ) D α ( P Y | X = x Q Y | X = x ) .
For α ( 0 , 1 ) ( 1 , ) ,
D α c ( P Y | X Q Y | X | P X ) = 1 α 1 x supp ( P X ) P ( x ) log y P ( y | x ) α Q ( y | x ) 1 α ,
which follows from the definition of the Rényi divergence in (4). For α being zero, one, or infinity, we obtain from (21)–(23) and (2)
D 0 c ( P Y | X Q Y | X | P X ) = x supp ( P X ) P ( x ) log y supp ( P Y | X = x ) Q ( y | x ) ,
D 1 c ( P Y | X Q Y | X | P X ) = D ( P Y | X Q Y | X | P X ) ,
D c ( P Y | X Q Y | X | P X ) = x supp ( P X ) P ( x ) log max y P ( y | x ) Q ( y | x ) .
Augustin [21] and later Csiszár [9] defined the measure of dependence
I α c ( X ; Y ) min Q Y D α c ( P Y | X Q Y | P X ) .
Augustin used this measure to study the error exponents for channel coding with input constraints, while Csiszár used it to study generalized cutoff rates for channel coding with composition constraints. Nakiboğlu [22] studied more properties of I α c ( X ; Y ) . Inter alia, he analyzed the minimax properties of the Augustin capacity
sup P X A I α c ( P X , P Y | X ) = sup P X A min Q Y D α c ( P Y | X Q Y | P X ) ,
where A P ( X ) is a constraint set. The Augustin capacity is used in [23] to establish the sphere packing bound for memoryless channels with cost constraints.
The rest of the section presents some properties of D α c ( · ) . Being an average of Rényi divergences (see (29)), D α c ( · ) inherits many properties from the Rényi divergence:
Proposition 2.
Let P X be a PMF, and let P Y | X and Q Y | X be conditional PMFs. Then,
(a) 
For all α [ 0 , ] , D α c ( P Y | X Q Y | X | P X ) 0 . If α ( 0 , ] , then D α c ( P Y | X Q Y | X | P X ) = 0 if and only if ( P Y | X = x = Q Y | X = x for all x supp ( P X ) ) .
(b) 
For all α [ 0 , 1 ) , D α c ( P Y | X Q Y | X | P X ) is finite if and only if ( | supp ( P Y | X = x ) supp ( Q Y | X = x ) | > 0 for all x supp ( P X ) ) . For all α [ 1 , ] , D α c ( P Y | X Q Y | X | P X ) is finite if and only if ( supp ( P Y | X = x ) supp ( Q Y | X = x ) for all x supp ( P X ) ) .
(c) 
The mapping α D α c ( P Y | X Q Y | X | P X ) is continuous on [ 0 , ] .
(d) 
The mapping α D α c ( P Y | X Q Y | X | P X ) is nondecreasing on [ 0 , ] .
(e) 
The mapping α 1 α α D α c ( P Y | X Q Y | X | P X ) is nonincreasing on ( 0 , ) .
(f) 
The mapping α ( 1 α ) D α c ( P Y | X Q Y | X | P X ) is concave on [ 0 , ) .
(g) 
The mapping α ( α 1 ) D 1 / α c ( P Y | X Q Y | X | P X ) is concave on ( 0 , ) .
Proof. 
These follow from (29) and the properties of the Rényi divergence (Proposition 1). For Parts (f) and (g), recall that a nonnegative weighted sum of concave functions is concave. □
We next consider data-processing inequalities for D α c ( · ) . We distinguish between processing Y and processing X. The data-processing inequality for processing Y follows from the data-processing inequality for the (unconditional) Rényi divergence:
Theorem 1.
Let P X be a PMF, and let P Y | X and Q Y | X be conditional PMFs. For a conditional PMF A Y | X Y , define
P Y | X ( y | x ) y P Y | X ( y | x ) A Y | X Y ( y | x , y ) ,
Q Y | X ( y | x ) y Q Y | X ( y | x ) A Y | X Y ( y | x , y ) .
Then, for all α [ 0 , ] ,
D α c ( P Y | X Q Y | X | P X ) D α c ( P Y | X Q Y | X | P X ) .
Proof. 
See Appendix B. □
The following data-processing inequality for processing X holds for α [ 0 , 1 ] (as shown in Example 1 below, it does not extend to α ( 1 , ] ):
Theorem 2.
Let P X be a PMF, and let P Y | X and Q Y | X be conditional PMFs. For a conditional PMF B X | X , define the PMFs
P X ( x ) x P X ( x ) B X | X ( x | x ) ,
B X | X ( x | x ) P X ( x ) B X | X ( x | x ) / P X ( x ) i f P X ( x ) > 0 , 1 / | X | o t h e r w i s e ,
P Y | X ( y | x ) x B X | X ( x | x ) P Y | X ( y | x ) ,
Q Y | X ( y | x ) x B X | X ( x | x ) Q Y | X ( y | x ) .
Then, for all α [ 0 , 1 ] ,
D α c ( P Y | X Q Y | X | P X ) D α c ( P Y | X Q Y | X | P X ) .
Note that P X , P Y | X , and Q Y | X in Theorem 2 can be obtained from the following marginalizations:
P X ( x ) P Y | X ( y | x ) = x P X ( x ) B X | X ( x | x ) P Y | X ( y | x ) ,
P X ( x ) Q Y | X ( y | x ) = x P X ( x ) B X | X ( x | x ) Q Y | X ( y | x ) .
Proof of Theorem 2.
See Appendix C. □
As a special case of Theorem 2, we obtain the following relation between the conditional and the unconditional Rényi divergence:
Corollary 1.
For a PMF P X and conditional PMFs P Y | X and Q Y | X , define the marginal PMFs
P Y ( y ) x P X ( x ) P Y | X ( y | x ) ,
Q Y ( y ) x P X ( x ) Q Y | X ( y | x ) .
Then, for all α [ 0 , 1 ] ,
D α ( P Y Q Y ) D α c ( P Y | X Q Y | X | P X ) .
Proof. 
See Appendix D. □
Consider next α ( 1 , ] . It turns out that Corollary 1, and hence Theorem 2, cannot be extended to these values of α (not even if Q Y | X is restricted to be independent of X, i.e., if Q Y | X = Q Y ):
Example 1.
Let X = Y = { 0 , 1 } . For ϵ ( 0 , 1 ) , define the PMFs P X , Q Y ( ϵ ) , and P Y | X ( ϵ ) as
P X ( 0 ) = 0.5 , P X ( 1 ) = 0.5 ,
Q Y ( ϵ ) ( 0 ) = 1 ϵ , Q Y ( ϵ ) ( 1 ) = ϵ ,
P Y | X ( ϵ ) ( 0 | 0 ) = 1 ϵ , P Y | X ( ϵ ) ( 1 | 0 ) = ϵ ,
P Y | X ( ϵ ) ( 0 | 1 ) = ϵ , P Y | X ( ϵ ) ( 1 | 1 ) = 1 ϵ .
Then, for every α ( 1 , ] , there exists an ϵ ( 0 , 1 ) such that
D α P Y Q Y ( ϵ ) > D α c P Y | X ( ϵ ) Q Y ( ϵ ) | P X ,
where the PMF P Y is defined by (46) and, irrespective of ϵ, satisfies P Y ( 0 ) = P Y ( 1 ) = 0.5 .
Proof. 
See Appendix E. □

4. Sibson’s Conditional Rényi Divergence

For a PMF P X and conditional PMFs P Y | X and Q Y | X , Sibson’s conditional Rényi divergence D α s ( · ) is defined for every α [ 0 , ] as
D α s ( P Y | X Q Y | X | P X ) D α ( P X P Y | X P X Q Y | X ) .
For α ( 0 , 1 ) ( 1 , ) ,
D α s ( P Y | X Q Y | X | P X ) = 1 α 1 log x supp ( P X ) P ( x ) y P ( y | x ) α Q ( y | x ) 1 α
= 1 α 1 log x supp ( P X ) P ( x ) 2 ( α 1 ) D α ( P Y | X = x Q Y | X = x ) ,
where (55) and (56) follow from the definition of the Rényi divergence in (4). For α being zero, one, or infinity, we obtain from (21)–(23) and (3)
D 0 s ( P Y | X Q Y | X | P X ) = log x supp ( P X ) P ( x ) y supp ( P Y | X = x ) Q ( y | x ) ,
D 1 s ( P Y | X Q Y | X | P X ) = D ( P Y | X Q Y | X | P X ) ,
D s ( P Y | X Q Y | X | P X ) = log max x supp ( P X ) max y supp ( P X ) P ( y | x ) Q ( y | x ) .
Sibson [10] defined the measure of dependence
I α s ( X ; Y ) min Q Y D α s ( P Y | X Q Y | P X ) .
This minimum can be computed explicitly [10] (Corollary 2.3): For α ( 0 , 1 ) ( 1 , ) ,
I α s ( X ; Y ) = α α 1 log y x P ( x ) P ( y | x ) α 1 α ,
and for α being one or infinity,
I 1 s ( X ; Y ) = I ( X ; Y ) ,
I s ( X ; Y ) = log y max x P ( y | x ) ,
where I ( X ; Y ) denotes Shannon’s mutual information.
The concavity and convexity properties of D α s ( · ) and I α s ( X ; Y ) were studied by Ho–Verdú [24]. More properties of I α s ( X ; Y ) were collected by Verdú [25]. The maximization of I α s ( X ; Y ) with respect to P X and the minimax properties of D α s ( · ) were studied by Nakiboğlu [26] and Cai–Verdú [27].
The conditional Rényi divergence D α s ( · ) was used by Fong and Tan [28] to establish strong converse theorems for multicast networks. Yu and Tan [29] analyzed channel resolvability, among other measures, in terms of D α s ( · ) .
From (61) we see that Gallager’s E 0 function [11], which is defined as
E 0 ( ρ , P X , P Y | X ) log y [ x P ( x ) P ( y | x ) 1 1 + ρ ] 1 + ρ ,
is in one-to-one correspondence to Sibson’s measure of dependence:
I α s ( X ; Y ) = α 1 α E 0 1 α α , P X , P Y | X .
Gallager’s E 0 function is important in channel coding: it appears in the random coding exponent [30] and in the sphere packing exponent [31,32] (see also Gallager [11]). The exponential strong converse theorem proved by Arimoto [33] also uses the E 0 function. Polyanskiy and Verdú [34] extended the exponential strong converse theorem to channels with feedback. Augustin [21] and Nakiboğlu [35,36] extended the sphere packing bound to channels with feedback.
The rest of the section presents some properties of D α s ( · ) . Because D α s ( · ) can be written as an (unconditional) Rényi divergence (see (54)), it inherits many properties from the Rényi divergence:
Proposition 3.
Let P X be a PMF, and let P Y | X and Q Y | X be conditional PMFs. Then,
(a) 
For all α [ 0 , ] , D α s ( P Y | X Q Y | X | P X ) 0 . If α ( 0 , ] , then D α s ( P Y | X Q Y | X | P X ) = 0 if and only if ( P Y | X = x = Q Y | X = x for all x supp ( P X ) ) .
(b) 
For all α [ 0 , 1 ) , D α s ( P Y | X Q Y | X | P X ) is finite if and only if (there exists an x supp ( P X ) such that | supp ( P Y | X = x ) supp ( Q Y | X = x ) | > 0 ) . For all α [ 1 , ] , D α s ( P Y | X Q Y | X | P X ) is finite if and only if ( supp ( P Y | X = x ) supp ( Q Y | X = x ) for all x supp ( P X ) ) .
(c) 
The mapping α D α s ( P Y | X Q Y | X | P X ) is continuous on [ 0 , ] .
(d) 
The mapping α D α s ( P Y | X Q Y | X | P X ) is nondecreasing on [ 0 , ] .
(e) 
The mapping α 1 α α D α s ( P Y | X Q Y | X | P X ) is nonincreasing on ( 0 , ) .
(f) 
The mapping α ( 1 α ) D α s ( P Y | X Q Y | X | P X ) is concave on [ 0 , ) .
(g) 
The mapping α ( α 1 ) D 1 / α s ( P Y | X Q Y | X | P X ) is concave on ( 0 , ) .
Proof. 
These follow from (54) and the properties of the Rényi divergence (Proposition 1). □
We next consider data-processing inequalities for D α s ( · ) . We distinguish between processing Y and processing X. The data-processing inequality for processing Y follows from the data-processing inequality for the (unconditional) Rényi divergence:
Theorem 3.
Let P X be a PMF, and let P Y | X and Q Y | X be conditional PMFs. For a conditional PMF A Y | X Y , define
P Y | X ( y | x ) y P Y | X ( y | x ) A Y | X Y ( y | x , y ) ,
Q Y | X ( y | x ) y Q Y | X ( y | x ) A Y | X Y ( y | x , y ) .
Then, for all α [ 0 , ] ,
D α s ( P Y | X Q Y | X | P X ) D α s ( P Y | X Q Y | X | P X ) .
Proof. 
See Appendix F. □
The data-processing inequality for processing X similarly follows from the data-processing inequality for the (unconditional) Rényi divergence:
Theorem 4.
Let P X be a PMF, and let P Y | X and Q Y | X be conditional PMFs. For a conditional PMF B X | X , define the PMFs
P X ( x ) x P X ( x ) B X | X ( x | x ) ,
B X | X ( x | x ) P X ( x ) B X | X ( x | x ) / P X ( x ) i f P X ( x ) > 0 , 1 / | X | o t h e r w i s e ,
P Y | X ( y | x ) x B X | X ( x | x ) P Y | X ( y | x ) ,
Q Y | X ( y | x ) x B X | X ( x | x ) Q Y | X ( y | x ) .
Then, for all α [ 0 , ] ,
D α s ( P Y | X Q Y | X | P X ) D α s ( P Y | X Q Y | X | P X ) .
Proof. 
See Appendix G. □
As a special case of Theorem 4, we obtain the following relation between the conditional and the unconditional Rényi divergence:
Corollary 2.
Let P X be a PMF, and let P Y | X and Q Y | X be conditional PMFs. Define the marginal PMFs
P Y ( y ) x P X ( x ) P Y | X ( y | x ) ,
Q Y ( y ) x P X ( x ) Q Y | X ( y | x ) .
Then, for all α [ 0 , ] ,
D α ( P Y Q Y ) D α s ( P Y | X Q Y | X | P X ) .
Proof. 
This follows from Theorem 4 in the same way that Corollary 1 followed from Theorem 2. □

5. New Conditional Rényi Divergence

Let P X be a PMF, and let P Y | X and Q Y | X be conditional PMFs. For α ( 0 , 1 ) ( 1 , ) , define
D α l ( P Y | X Q Y | X | P X ) α α 1 log x supp ( P X ) P ( x ) 2 α 1 α D α ( P Y | X = x Q Y | X = x )
= α α 1 log x supp ( P X ) P ( x ) y P ( y | x ) α Q ( y | x ) 1 α 1 α ,
where (78) follows from the definition of the Rényi divergence in (4). (Except for the sign, the exponential averaging in (77) is very similar to the one of the Arimoto–Rényi conditional entropy; compare with (147) below.) For α being zero, one, or infinity, we define by continuous extension of (77)
D 0 l ( P Y | X Q Y | X | P X ) log max x supp ( P X ) y supp ( P Y | X = x ) Q ( y | x ) ,
D 1 l ( P Y | X Q Y | X | P X ) D ( P Y | X Q Y | X | P X ) ,
D l ( P Y | X Q Y | X | P X ) log x supp ( P X ) P ( x ) max y P ( y | x ) Q ( y | x ) .
This conditional Rényi divergence has an operational meaning in horse betting with side information (see Theorem 10 below). Before discussing the measure of dependence associated with D α l ( · ) , we establish the following alternative characterization of D α l ( · ) :
Proposition 4.
Let P X be a PMF, and let P Y | X and Q Y | X be conditional PMFs. Then, for all α [ 0 , ] ,
D α l ( P Y | X Q Y | X | P X ) = min Q X D α ( P X P Y | X Q X Q Y | X ) .
Proof. 
We first treat the case α ( 0 , 1 ) ( 1 , ) . Some algebra reveals that, for every PMF Q X ,
D α ( P X P Y | X Q X Q Y | X ) = D α Q X ( α ) Q X + α α 1 log x supp ( P X ) P ( x ) y P ( y | x ) α Q ( y | x ) 1 α 1 α ,
where the PMF Q X ( α ) is defined as
Q X ( α ) ( x ) P ( x ) y P ( y | x ) α Q ( y | x ) 1 α 1 / α x supp ( P X ) P ( x ) y P ( y | x ) α Q ( y | x ) 1 α 1 / α .
The right-hand side (RHS) of (82) is thus equal to the minimum over Q X of the RHS of (83). Since D α Q X ( α ) Q X 0 with equality if Q X = Q X ( α ) (Proposition 1 (a)), this minimum is equal to the second term on the RHS of (83), which, by (78), equals D α l ( P Y | X Q Y | X | P X ) .
For α = 1 and α = , (82) follows from the same argument using that, for every PMF Q X ,
D 1 ( P X P Y | X Q X Q Y | X ) = D ( P X Q X ) + D ( P Y | X Q Y | X | P X ) ,
D ( P X P Y | X Q X Q Y | X ) = D Q X ( ) Q X + log x supp ( P X ) P ( x ) max y P ( y | x ) Q ( y | x ) ,
where the PMF Q X ( ) is defined as
Q X ( ) ( x ) P ( x ) max y P ( y | x ) / Q ( y | x ) x supp ( P X ) P ( x ) max y P ( y | x ) / Q ( y | x ) .
For α = 0 , (82) holds because
min Q X D 0 ( P X P Y | X Q X Q Y | X ) = min Q X log x supp ( P X ) Q ( x ) y supp ( P Y | X = x ) Q ( y | x )
= log max Q X x supp ( P X ) Q ( x ) y supp ( P Y | X = x ) Q ( y | x )
= log max x supp ( P X ) y supp ( P Y | X = x ) Q ( y | x )
= D 0 l ( P Y | X Q Y | X | P X ) ,
where (88) follows from the definition of D 0 ( P Q ) in (21), and (91) follows from (79). □
Tomamichel and Hayashi [37] and Lapidoth and Pfister [3] independently introduced and studied the dependence measure
J α ( X ; Y ) min Q X , Q Y D α ( P X Y Q X Q Y ) .
(For some measure-theoretic properties of J α ( X ; Y ) , see Aishwarya–Madiman [38].) The measure J α ( X ; Y ) can be related to the error exponents in a hypothesis testing problem where the samples are either from a known joint distribution or an unknown product distribution (see [37] (Equation (57)) and [39]). It also appears in horse betting with side information (see Theorem 11 below).
Similar to I α c ( X ; Y ) in (34) and I α s ( X ; Y ) in (60), the measure J α ( X ; Y ) can be expressed as a minimization involving the new conditional Rényi divergence:
Proposition 5.
Let P X Y be a joint PMF. Denote its marginal PMFs by P X and P Y and its conditional PMFs by P Y | X and P X | Y , so P X Y = P X P Y | X = P Y P X | Y . Then, for all α [ 0 , ] ,
J α ( X ; Y ) = min Q Y D α l ( P Y | X Q Y | P X )
= min Q X D α l ( P X | Y Q X | P Y ) .
Proof. 
Equation (93) holds because
min Q Y D α l ( P Y | X Q Y | P X ) = min Q Y min Q X D α ( P X P Y | X Q X Q Y )
= J α ( X ; Y ) ,
where (95) follows from Proposition 4, and (96) follows from (92). Swapping the roles of X and Y establishes (94):
min Q X D α l ( P X | Y Q X | P Y ) = min Q X min Q Y D α ( P Y P X | Y Q Y Q X )
= J α ( X ; Y ) ,
where (97) follows from Proposition 4, and (98) follows from (92). □
The rest of the section presents some properties of D α l ( · ) .
Proposition 6.
Let P X be a PMF, and let P Y | X and Q Y | X be conditional PMFs. Then,
(a) 
For all α [ 0 , ] , D α l ( P Y | X Q Y | X | P X ) 0 . If α ( 0 , ] , then D α l ( P Y | X Q Y | X | P X ) = 0 if and only if ( P Y | X = x = Q Y | X = x for all x supp ( P X ) ) .
(b) 
For all α [ 0 , 1 ) , D α l ( P Y | X Q Y | X | P X ) is finite if and only if (there exists an x supp ( P X ) such that | supp ( P Y | X = x ) supp ( Q Y | X = x ) | > 0 ) . For all α [ 1 , ] , D α l ( P Y | X Q Y | X | P X ) is finite if and only if ( supp ( P Y | X = x ) supp ( Q Y | X = x ) for all x supp ( P X ) ) .
(c) 
The mapping α D α l ( P Y | X Q Y | X | P X ) is continuous on [ 0 , ] .
(d) 
The mapping α D α l ( P Y | X Q Y | X | P X ) is nondecreasing on [ 0 , ] .
(e) 
The mapping α 1 α α D α l ( P Y | X Q Y | X | P X ) is nonincreasing on ( 0 , ) .
(f) 
The mapping α ( 1 α ) D α l ( P Y | X Q Y | X | P X ) is concave on [ 0 , 1 ] .
(g) 
The mapping α ( α 1 ) D 1 / α l ( P Y | X Q Y | X | P X ) is concave on [ 1 , ) .
Proof. 
We prove these properties as follows:
(a)
For all α [ 0 , ] , Proposition 4 implies
D α l ( P Y | X Q Y | X | P X ) = min Q X D α ( P X P Y | X Q X Q Y | X ) .
The nonnegativity of D α l ( · ) now follows from the nonnegativity of the Rényi divergence (Proposition 1 (a)). If ( P Y | X = x = Q Y | X = x for all x supp ( P X ) ) , then P X P Y | X = P X Q Y | X . Hence, using Q X = P X on the RHS of (99), D α l ( P Y | X Q Y | X | P X ) equals zero. Conversely, if α ( 0 , ] and D α l ( · ) = 0 , then P X P Y | X = Q X Q Y | X for some Q X by Proposition 1 (a), which implies ( P Y | X = x = Q Y | X = x for all x supp ( P X ) ) .
(b)
This follows from the definitions in (77) and (79)–(81) and the conventions in (20).
(c)
For α ( 0 , 1 ) ( 1 , ) , D α l ( · ) is continuous because it is, by its definition in (77), a composition of continuous functions. The continuity at α = 1 follows from a careful application of L’Hôpital’s rule.
We next consider the continuity at α = 0 . Define τ min x supp ( P X ) P ( x ) . Then, for all α ( 0 , 1 ) ,
( α 1 ) D α l ( P Y | X Q Y | X | P X ) = α log x supp ( P X ) P ( x ) 2 α 1 α D α ( P Y | X = x Q Y | X = x )
α log x supp ( P X ) τ 2 α 1 α D α ( P Y | X = x Q Y | X = x )
α log max x supp ( P X ) τ 2 α 1 α D α ( P Y | X = x Q Y | X = x )
= α log τ + max x supp ( P X ) ( α 1 ) D α ( P Y | X = x Q Y | X = x ) ,
where (100) follows from the definition in (77). On the other hand, for all α ( 0 , 1 ) ,
( α 1 ) D α l ( P Y | X Q Y | X | P X ) = α log x supp ( P X ) P ( x ) 2 α 1 α D α ( P Y | X = x Q Y | X = x )
α log max x supp ( P X ) 2 α 1 α D α ( P Y | X = x Q Y | X = x )
= max x supp ( P X ) ( α 1 ) D α ( P Y | X = x Q Y | X = x ) .
Because lim α 0 α log τ = 0 , it follows from (103) and (106) and the sandwich theorem that
lim α 0 D α l ( P Y | X Q Y | X | P X ) = lim α 0 1 α 1 max x supp ( P X ) ( α 1 ) D α ( P Y | X = x Q Y | X = x )
= log max x supp ( P X ) y supp ( P Y | X = x ) Q ( y | x ) ,
where (108) follows from the continuity of the Rényi divergence (Proposition 1 (c)) and the definition of D 0 ( P Q ) in (21).
We conclude with the continuity at α = . Observe that
lim α D α l ( P Y | X Q Y | X | P X ) = lim α α α 1 log x supp ( P X ) P ( x ) 2 α 1 α D α ( P Y | X = x Q Y | X = x )
= log x supp ( P X ) P ( x ) 2 lim α D α ( P Y | X = x Q Y | X = x )
= log x supp ( P X ) P ( x ) max y P ( y | x ) Q ( y | x ) ,
where (109) follows from the definition in (77), and (111) follows from the continuity of the Rényi divergence (Proposition 1 (c)) and the definition of D ( P Q ) in (23).
(d)
For all α [ 0 , ] , Proposition 4 implies
D α l ( P Y | X Q Y | X | P X ) = min Q X D α ( P X P Y | X Q X Q Y | X ) .
Because α D α ( P Q ) is nonincreasing on [ 0 , ] (Proposition 1 (d)) and because the pointwise minimum preserves the monotonicity, the mapping α D α l ( · ) is nonincreasing on [ 0 , ] .
(e)
By Proposition 4,
1 α α D α l ( P Y | X Q Y | X | P X ) = min Q X X Y Z 1 α α D α ( P X X Y Z P Y | X X Y Z Q X X Y Z Q Y | X X Y Z ) if α ( 0 , 1 ] , max Q X X Y Z 1 α α D α ( P X X Y Z P Y | X X Y Z Q X X Y Z Q Y | X X Y Z ) if α ( 1 , ) .
By the nonnegativity of the Rényi divergence (Proposition 1 (a)), the RHS of (113) is nonnegative for α ( 0 , 1 ] and nonpositive for α ( 1 , ) . Hence, it suffices to show separately that the mapping α 1 α α D α l ( P Y | X Q Y | X | P X ) is nonincreasing on ( 0 , 1 ] and on ( 1 , ) . This is indeed the case: the mapping α 1 α α D α ( P X P Y | X Q X Q Y | X ) on the RHS of (113) is nonincreasing on ( 0 , ) (Proposition 1 (e)), and the monotonicity is preserved by the pointwise minimum and maximum, respectively.
(f)
For α [ 0 , 1 ] , Proposition 4 implies that
( 1 α ) D α l ( P Y | X Q Y | X | P X ) = min Q X ( 1 α ) D α ( P X P Y | X Q X Q Y | X ) .
Because α ( 1 α ) D α ( P X P Y | X Q X Q Y | X ) is concave on [ 0 , 1 ] (Proposition 1 (f)) and because the pointwise minimum preserves the concavity, the mapping α ( 1 α ) D α l ( P Y | X Q Y | X | P X ) is concave on [ 0 , 1 ] .
(g)
This follows from Proposition 1 (g) in the same way that Part (f) followed from Proposition 1 (f). □
We next consider data-processing inequalities for D α l ( · ) . We distinguish between processing Y and processing X. The data-processing inequality for processing Y follows from the data-processing inequality for the (unconditional) Rényi divergence:
Theorem 5.
Let P X be a PMF, and let P Y | X and Q Y | X be conditional PMFs. For a conditional PMF A Y | X Y , define
P Y | X ( y | x ) y P Y | X ( y | x ) A Y | X Y ( y | x , y ) ,
Q Y | X ( y | x ) y Q Y | X ( y | x ) A Y | X Y ( y | x , y ) .
Then, for all α [ 0 , ] ,
D α l ( P Y | X Q Y | X | P X ) D α l ( P Y | X Q Y | X | P X ) .
Proof. 
We prove (117) for α ( 0 , 1 ) ( 1 , ) ; the claim will then extend to α [ 0 , ] by the continuity of D α l ( · ) in α (Proposition 6 (c)). For every x supp ( P X ) , we can apply Proposition 1 (h) with the substitution of A Y | Y , X = x for A Y | Y to obtain
D α ( P Y | X = x Q Y | X = x ) D α ( P Y | X = x Q Y | X = x ) .
For α ( 0 , 1 ) ( 1 , ) , (117) now follows from (77) and (118). □
Processing X is different. Consider first Q Y | X that does not depend on X. Then, writing Q Y | X = Q Y , we have the following result (which, as shown in Example 2 below, does not extend to general Q Y | X ):
Theorem 6.
Let P X and Q Y be PMFs, and let P Y | X be a conditional PMF. For a conditional PMF B X | X , define the PMFs
P X ( x ) x P X ( x ) B X | X ( x | x ) ,
B X | X ( x | x ) P X ( x ) B X | X ( x | x ) / P X ( x ) i f P X ( x ) > 0 , 1 / | X | o t h e r w i s e ,
P Y | X ( y | x ) x B X | X ( x | x ) P Y | X ( y | x ) .
Then, for all α [ 0 , ] ,
D α l ( P Y | X Q Y | P X ) D α l ( P Y | X Q Y | P X ) .
Once we provide the operational meaning of D α l ( · ) in horse betting with side information (Theorem 10 below), Theorem 6 will become very intuitive: it expresses the fact that preprocessing the side information cannot increase the gambler’s utility; see Remark 8. Note that P X and P Y | X in Theorem 6 can be obtained from the following marginalization:
P X ( x ) P Y | X ( y | x ) = x P X ( x ) B X | X ( x | x ) P Y | X ( y | x ) .
Proof of Theorem 6.
We show (122) for α ( 0 , 1 ) ( 1 , ) ; the claim will then extend to α [ 0 , ] by the continuity of D α l ( · ) in α (Proposition 6 (c)). Consider first α ( 1 , ) . Then, (122) holds because
α 1 α D α l ( P Y | X Q Y | P X )
= log x supp ( P X ) P X ( x ) y P Y | X ( y | x ) α Q Y ( y ) 1 α 1 α
= log x supp ( P X ) P X ( x ) y x B X | X ( x | x ) P Y | X ( y | x ) Q Y ( y ) 1 α α α 1 α
= log x supp ( P X ) y x supp ( P X ) P X ( x ) B X | X ( x | x ) P Y | X ( y | x ) Q Y ( y ) 1 α α α 1 α
log x supp ( P X ) x supp ( P X ) y P X ( x ) B X | X ( x | x ) P Y | X ( y | x ) Q Y ( y ) 1 α α α 1 α
= log x supp ( P X ) P X ( x ) x supp ( P X ) B X | X ( x | x ) y P Y | X ( y | x ) α Q Y ( y ) 1 α 1 α
= log x supp ( P X ) P X ( x ) y P Y | X ( y | x ) α Q Y ( y ) 1 α 1 α
= α 1 α D α l ( P Y | X Q Y | P X ) ,
where (124) follows from (78); (125) follows from (121); (126) follows from (120); (127) follows from the Minkowski inequality [16] (III 2.4 Theorem 9); (129) holds because P X ( x ) > 0 and P X ( x ) = 0 imply B X | X ( x | x ) = 0 , hence the first expression in square brackets on the left-hand side (LHS) of (129) equals one; and (130) follows from (78).
The proof for α ( 0 , 1 ) is very similar: (124)–(126) and (128)–(130) continue to hold, and (127) is reversed [16] (III 2.4 Theorem 9). Because now α 1 α < 0 , (122) continues to hold for α ( 0 , 1 ) . □
As a special case of Theorem 6, we obtain the following relation between the conditional and the unconditional Rényi divergence:
Corollary 3.
Let P X and Q Y be PMFs, and let P Y | X be a conditional PMF. Define the marginal PMF
P Y ( y ) x P X ( x ) P Y | X ( y | x ) .
Then, for all α [ 0 , ] ,
D α ( P Y Q Y ) D α l ( P Y | X Q Y | P X ) .
Proof. 
This follows from Theorem 6 in the same way that Corollary 1 followed from Theorem 2. □
Consider next Q Y | X that does depend on X. It turns out that Corollary 3, and hence Theorem 6, cannot be extended to this setting:
Example 2.
Let X = { 0 , 1 } and Y = { 0 , 1 , 2 } . Define the PMFs P X , P Y | X , and Q Y | X as
P X ( 0 ) = 0.5 , P X ( 1 ) = 0.5 ,
P Y | X ( 0 | 0 ) = 0.96 , P Y | X ( 1 | 0 ) = 0.02 , P Y | X ( 2 | 0 ) = 0.02 ,
P Y | X ( 0 | 1 ) = 0.12 , P Y | X ( 1 | 1 ) = 0.02 , P Y | X ( 2 | 1 ) = 0.86 ,
Q Y | X ( 0 | 0 ) = 0.06 , Q Y | X ( 1 | 0 ) = 0.92 , Q Y | X ( 2 | 0 ) = 0.02 ,
Q Y | X ( 0 | 1 ) = 0.02 , Q Y | X ( 1 | 1 ) = 0.16 , Q Y | X ( 2 | 1 ) = 0.82 .
Then, for α = 0.5 and for α = 2 ,
D α ( P Y Q Y ) > D α l ( P Y | X Q Y | X | P X ) ,
where the PMFs P Y and Q Y are given by
P Y ( y ) x P X ( x ) P Y | X ( y | x ) ,
Q Y ( y ) x P X ( x ) Q Y | X ( y | x ) .
Proof. 
Numerically, D 0.5 ( P Y Q Y ) 1.11 bits, which is larger than D 0.5 l ( P Y | X Q Y | X | P X ) 0.93 bits. Similarly, D 2 ( P Y Q Y ) 2.95 bits, which is larger than D 2 l ( P Y | X Q Y | X | P X ) 2.75 bits. □

6. Relation to Arimoto’s Measures

Before discussing Arimoto’s measures, we first recall the definition of the Rényi entropy. The Rényi entropy of order α [7] is defined for all positive α ’s other than one as
H α ( X ) 1 1 α log x P ( x ) α .
For α being zero, one, or infinity, we define by continuous extension of (141)
H 0 ( X ) log | supp ( P X ) | ,
H 1 ( X ) H ( X ) ,
H ( X ) log max x P ( x ) ,
where H ( X ) denotes Shannon’s entropy. The Rényi entropy can be related to the Rényi divergence as follows:
H α ( X ) = log | X | D α ( P X U X ) ,
where U X denotes the uniform distribution over X .
There are different ways to define a conditional Rényi entropy [40]; we use Arimoto’s proposal. The Arimoto–Rényi conditional entropy of order α [12,38,40,41] is defined for positive α other than one as
H α ( X | Y ) α 1 α log y supp ( P Y ) P ( y ) x P ( x | y ) α 1 α
= α 1 α log y supp ( P Y ) P ( y ) 2 1 α α H α ( P X | Y = y ) ,
where (147) follows from the definition of the Rényi entropy in (141). The Arimoto–Rényi conditional entropy plays a key role in guessing with side information [20,42,43,44] and in task encoding with side information [45]; and it can be related to hypothesis testing [41]. For α being zero, one, or infinity, we define by continuous extension of (146)
H 0 ( X | Y ) log max y supp ( P Y ) supp ( P X | Y = y ) ,
H 1 ( X | Y ) H ( X | Y ) ,
H ( X | Y ) log y supp ( P Y ) P ( y ) max x P ( x | y ) ,
where H ( X | Y ) denotes Shannon’s conditional entropy. The analog of (145) for H α ( X | Y ) is:
Remark 2.
For all α [ 0 , ] ,
H α ( X | Y ) = log | X | D α l ( P X | Y U X | P Y )
= log | X | min Q Y D α ( P Y P X | Y Q Y U X ) .
Proof. 
Equation (151) follows, using some algebra, from the definition of D α l ( · ) in (78)–(81); and (152) follows from Proposition 4. (The characterization in (152) previously appeared as [40] (Theorem 4).) □
Arimoto [12] also defined the following measure of dependence:
I α a ( X ; Y ) H α ( X ) H α ( X | Y )
= α α 1 log y [ x P ( x ) α x X P ( x ) α P ( y | x ) α ] 1 α ,
where (154) follows from (141) and (146). Using Remark 2, we can express I α a ( X ; Y ) in terms of D α l ( · ) :
Remark 3.
For all α [ 0 , ] ,
I α a ( X ; Y ) = D α l ( P X | Y U X | P Y ) D α ( P X U X ) .
Proof. 
This follows from (145), (151), and (153). □

7. Relations Between the Conditional Rényi Divergences and the Rényi Dependence Measures

In this section, we first establish the greater-or-equal-than order between the conditional Rényi divergences, where the order depends on whether α [ 0 , 1 ] or α [ 1 , ] . We then show that this implies the same order between the dependence measures derived from the conditional Rényi divergences. Finally, we remark that many of the dependence measures coincide when they are maximized over all PMFs P X .
Proposition 7.
For all α [ 0 , ] ,
D α l ( P Y | X Q Y | X | P X ) D α s ( P Y | X Q Y | X | P X ) .
Proof. 
This holds because
D α l ( P Y | X Q Y | X | P X ) = min Q X D α ( P X P Y | X Q X Q Y | X )
D α ( P X P Y | X P X Q Y | X )
= D α s ( P Y | X Q Y | X | P X ) ,
where (157) follows from Proposition 4, and (159) follows from the definition of D α s ( · ) in (54). □
Theorem 7.
For all α [ 0 , 1 ] ,
D α l ( P Y | X Q Y | X | P X ) D α s ( P Y | X Q Y | X | P X ) D α c ( P Y | X Q Y | X | P X ) .
For all α [ 1 , ] ,
D α c ( P Y | X Q Y | X | P X ) D α l ( P Y | X Q Y | X | P X ) D α s ( P Y | X Q Y | X | P X ) .
Proof. 
For both α [ 0 , 1 ] and α [ 1 , ] , the relation D α l ( · ) D α s ( · ) follows from Proposition 7.
We next show that D α s ( · ) D α c ( · ) for α [ 0 , 1 ] . We show this for α ( 0 , 1 ) ; the claim will then extend to α [ 0 , 1 ] by the continuity in α of D α s ( · ) and D α c ( · ) (Proposition 3 (c) and Proposition 2 (c)). For α ( 0 , 1 ) ,
( α 1 ) D α s ( P Y | X Q Y | X | P X ) = log x supp ( P X ) P ( x ) y P ( y | x ) α Q ( y | x ) 1 α
x supp ( P X ) P ( x ) log y P ( y | x ) α Q ( y | x ) 1 α
= ( α 1 ) D α c ( P Y | X Q Y | X | P X ) ,
where (162) follows from (55); (163) follows from Jensen’s inequality because log ( · ) is a concave function; and (164) follows from (30). The proof of the claim for α ( 0 , 1 ) is finished by dividing (162)–(164) by α 1 , which reverses the inequality because α 1 < 0 .
We conclude by showing that D α c ( · ) D α l ( · ) for α [ 1 , ] . We show this for α ( 1 , ) ; the claim will then extend to α [ 1 , ] by the continuity of D α c ( · ) and D α l ( · ) in α (Proposition 2 (c) and Proposition 6 (c)). For α ( 1 , ) ,
D α c ( P Y | X Q Y | X | P X ) = x supp ( P X ) P ( x ) 1 α 1 log y P ( y | x ) α Q ( y | x ) 1 α
= α α 1 x supp ( P X ) P ( x ) log y P ( y | x ) α Q ( y | x ) 1 α 1 α
α α 1 log x supp ( P X ) P ( x ) y P ( y | x ) α Q ( y | x ) 1 α 1 α
= D α l ( P Y | X Q Y | X | P X ) ,
where (165) follows from (30); (167) follows from Jensen’s inequality because log ( · ) is a concave function; and (168) follows from (78). □
Corollary 4.
For all α [ 0 , 1 ] ,
J α ( X ; Y ) I α s ( X ; Y ) I α c ( X ; Y ) .
For all α [ 1 , ] ,
I α c ( X ; Y ) J α ( X ; Y ) I α s ( X ; Y ) .
Proof. 
By (34) and (60) and Proposition 5, respectively,
I α c ( X ; Y ) = min Q Y D α c ( P Y | X Q Y | P X ) ,
I α s ( X ; Y ) = min Q Y D α s ( P Y | X Q Y | P X ) ,
J α ( X ; Y ) = min Q Y D α l ( P Y | X Q Y | P X ) .
The corollary now follows from (171)–(173) and Theorem 7. □
Despite I α c ( X ; Y ) , I α s ( X ; Y ) , I α a ( X ; Y ) , and J α ( X ; Y ) being different measures, they often coincide when maximized over all PMFs P X :
Theorem 8.
For every conditional PMF P Y | X and every α ( 0 , 1 ) ( 1 , ) ,
max P X I α c ( P X , P Y | X ) = max P X I α s ( P X , P Y | X )
= max P X I α a ( P X , P Y | X ) .
In addition, for every conditional PMF P Y | X and every α [ 1 2 , 1 ) ( 1 , ) ,
max P X J α ( P X , P Y | X ) = max P X I α s ( P X , P Y | X ) .
For α ( 0 , 1 2 ) , the situation is different: there exists a conditional PMF P Y | X such that, for every α ( 0 , 1 2 ) ,
max P X J α ( P X , P Y | X ) < max P X I α s ( P X , P Y | X ) .
Proof. 
Equation (174) follows from [9] (Proposition 1); (175) follows from [12] (Lemma 1); and (176) follows from [38] (Theorem V.1) for α ( 1 , ) .
We next establish (176) for α [ 1 2 , 1 ) . Observe that, for α [ 1 2 , 1 ) , (176) is equivalent to
max P X 2 α 1 α J α ( P X , P Y | X ) = max P X 2 α 1 α I α s ( P X , P Y | X ) .
For α [ 1 2 , 1 ) , (178) holds because
max P X 2 α 1 α J α ( P X , P Y | X ) = max P X min Q Y 2 α 1 α D α l ( P Y | X Q Y | P X )
= min P X max Q Y x P X ( x ) y P ( y | x ) α Q Y ( y ) 1 α 1 α
= max Q Y min P X x P X ( x ) y P ( y | x ) α Q Y ( y ) 1 α 1 α
= max Q Y min x y P ( y | x ) α Q Y ( y ) 1 α 1 α
= max Q Y min x y P ( y | x ) α Q Y ( y ) 1 α 1 α
= max Q Y min P X x P X ( x ) y P ( y | x ) α Q Y ( y ) 1 α 1 α
= min P X max Q Y x P X ( x ) y P ( y | x ) α Q Y ( y ) 1 α 1 α
= min P X max Q Y x P X ( x ) y P ( y | x ) α Q Y ( y ) 1 α 1 α
= max P X min Q Y 2 α 1 α D α s ( P Y | X Q Y | P X )
= max P X 2 α 1 α I α s ( P X , P Y | X ) ,
where (179) follows from Proposition 5; (180) follows from (78); (181) and (185) follow from a minimax theorem and are justified below; (187) follows from (55); and (188) follows from (60).
To justify (181), we apply the minimax theorem [46] (Corollary 37.3.2) to the function f : P ( Y ) × P ( X ) R ,
f ( Q Y , P X ) = x P X ( x ) y P ( y | x ) α Q Y ( y ) 1 α 1 α .
The sets of all PMFs over X and over Y are convex and compact; the function f is jointly continuous in the pair ( Q Y , P X ) because it is a composition of continuous functions; for every Q Y P ( Y ) , the function f is linear and hence convex in P X ; and it only remains to show that the function f is concave in Q Y for every P X P ( X ) . Indeed, for every λ , λ [ 0 , 1 ] with λ + λ = 1 , every Q Y , Q Y P ( Y ) , and every P X P ( X ) ,
f ( λ Q Y + λ Q Y , P X )
= x P X ( x ) y P ( y | x ) α λ Q Y ( y ) + λ Q Y ( y ) 1 α 1 α
= x P X ( x ) y λ P ( y | x ) α 1 α Q Y ( y ) + λ P ( y | x ) α 1 α Q Y ( y ) 1 α 1 1 α · 1 α α
x P X ( x ) { y λ P ( y | x ) α 1 α Q Y ( y ) 1 α 1 1 α + y λ P ( y | x ) α 1 α Q Y ( y ) 1 α 1 1 α } 1 α α
= x P X ( x ) { λ y P ( y | x ) α Q Y ( y ) 1 α 1 1 α + λ y P ( y | x ) α Q Y ( y ) 1 α 1 1 α } 1 α α
x P X ( x ) { λ y P ( y | x ) α Q Y ( y ) 1 α 1 α + λ y P ( y | x ) α Q Y ( y ) 1 α 1 α }
= λ f ( Q Y , P X ) + λ f ( Q Y , P X ) ,
where (193) follows from the reverse Minkowski inequality [16] (III 2.4 Theorem 9) because α [ 1 2 , 1 ) ; and (195) holds because the function z z ( 1 α ) / α is concave for α [ 1 2 , 1 ) .
The justification of (185) is very similar to that of (181); here, we apply the minimax theorem to the function g : P ( Y ) × P ( X ) R ,
g ( Q Y , P X ) = x P X ( x ) y P ( y | x ) α Q Y ( y ) 1 α .
Compared to the justification of (181), the only essential difference lies in showing that the function g is concave in Q Y for every P X P ( X ) : here, this follows easily from the concavity of the function z z 1 α for α [ 1 2 , 1 ) .
We conclude the proof by establishing (177). Let X = Y = { 0 , 1 } , and let the conditional PMF P Y | X be given by P Y | X ( y | x ) = 𝟙 { y = x } . (This corresponds to a binary noiseless channel.) Then, denoting by U X the uniform distribution over X ,
max P X I α s ( P X , P Y | X ) I α s ( U X , P Y | X )
= log 2 ,
where (199) follows from (61). On the other hand, for every α ( 0 , 1 2 ) and every PMF P X ,
J α ( P X , P Y | X ) = α 1 α H ( P X )
α 1 α log 2
< log 2 ,
where (200) follows from [3] (Lemma 11); (201) follows from (144); and (202) holds because α ( 0 , 1 2 ) . Inequality (177) now follows from (199) and (202). □

8. Horse Betting

In this section, we analyze horse betting with a gambler investing all her money. Recall from the introduction that the winning horse X is distributed according to the PMF p, where we assume p ( x ) > 0 for all x X ; that the odds offered by the bookmaker are denoted by o : X ( 0 , ) ; that the fraction of her wealth that the gambler bets on Horse x X is denoted b ( x ) 0 ; that the wealth relative is the random variable S b ( X ) o ( X ) ; and that we seek betting strategies that maximize the utility function
U β 1 β log E [ S β ] if β 0 , E [ log S ] if β = 0 .
Because the gambler invests all her money, b is a PMF. As in [47] (Section 10.3), define the constant
c [ x 1 o ( x ) ] 1
and the PMF
r ( x ) c o ( x ) .
Using these definitions, the utility function U β can be decomposed as follows:
Theorem 9.
Let β ( , 1 ) , and let b be a PMF. Then,
U β = log c + D 1 1 β ( p r ) D 1 β ( g ( β ) b ) ,
where the PMF g ( β ) is given by
g ( β ) ( x ) p ( x ) 1 1 β o ( x ) β 1 β x X p ( x ) 1 1 β o ( x ) β 1 β .
Thus, choosing b = g ( β ) uniquely maximizes U β among all PMFs b.
The three terms in (206) can be interpreted as follows:
  • The first term, log c , depends only on the odds and is related to the fairness of the odds. The odds are called subfair if c < 1 , fair if c = 1 , and superfair if c > 1 .
  • The second term, D 1 / ( 1 β ) ( p r ) , is related to the bookmaker’s estimate of the winning probabilities. It is zero if and only if the odds are inversely proportional to the winning probabilities.
  • The third term, D 1 β ( g ( β ) b ) , is related to the gambler’s estimate of the winning probabilities. It is zero if and only if b is equal to g ( β ) .
Remark 4.
For β = 0 , (206) reduces to the following decomposition of the doubling rate E [ log S ] :
E [ log S ] = log c + D ( p r ) D ( p b ) .
(This decomposition appeared previously in [47] (Section 10.3).) Equation (208) implies that the doubling rate is maximized by proportional gambling, i.e., that E [ log S ] is maximized if and only if b is equal to p.
Remark 5.
Considering the limits β and β 1 , the PMF g ( β ) satisfies, for every x X ,
lim β g ( β ) ( x ) = c o ( x ) ,
lim β 1 g ( β ) ( x ) = p ( x ) 𝟙 { x S } x X p ( x ) 𝟙 { x S } ,
where the set S is defined as S x X : p ( x ) o ( x ) = max x [ p ( x ) o ( x ) ] . It follows from Proposition 8 below that the RHS of (209) is the unique maximizer of lim β U β ; and it follows from the proof of Proposition 9 below that the RHS of (210) is a maximizer (not necessarily unique) of U 1 .
Proof of Remark 5.
Recall that we assume p ( x ) > 0 for every x X . Then, (209) follows from (207) and the definition of c in (204). To establish (210), define τ max x [ p ( x ) o ( x ) ] and observe that, for every x X ,
lim β 1 g ( β ) ( x ) = lim β 1 p ( x ) p ( x ) o ( x ) / τ β 1 β x X p ( x ) p ( x ) o ( x ) / τ β 1 β
= p ( x ) 𝟙 { x S } x X p ( x ) 𝟙 { x S } ,
where (211) follows from (207) and some algebra; and (212) is justified as follows: if x S , then p ( x ) o ( x ) / τ β / ( 1 β ) equals one; and if x S , then p ( x ) o ( x ) / τ β / ( 1 β ) tends to zero as β 1 because p ( x ) o ( x ) / τ < 1 and because lim β 1 β 1 β = + . □
Remark 6.
Using the definition in (24) for the Rényi divergence of negative orders, it is not difficult to see from the proof of Theorem 9 below that (206) also holds for β > 1 . However, because the Rényi divergence of negative orders is nonpositive instead of nonnegative, the above interpretation is not valid anymore; in particular, for β > 1 , choosing b = g ( β ) is in general not optimal.
Proof of Theorem 9.
We first show the maximization claim. The only term on the RHS of (206) that depends on b is D 1 β ( g ( β ) b ) . Because 1 β > 0 , this term is maximized if and only if b = g ( β ) (Proposition 1 (a)).
We now establish (206) for β ( , 0 ) ( 0 , 1 ) ; we omit the proof for β = 0 , which can be found in [47] (Section 10.3). For β ( , 0 ) ( 0 , 1 ) ,
U β = 1 β log x p ( x ) b ( x ) β o ( x ) β .
For every x X ,
p ( x ) b ( x ) β o ( x ) β = x X p ( x ) 1 1 β o ( x ) β 1 β 1 β · g ( β ) ( x ) 1 β b ( x ) β ,
which follows from (207). Now, (206) holds because
U β = 1 β β log x X p ( x ) 1 1 β o ( x ) β 1 β + 1 β log x g ( β ) ( x ) 1 β b ( x ) β
= 1 β β log x X p ( x ) 1 1 β o ( x ) β 1 β D 1 β ( g ( β ) b )
= log c + 1 β β log x X p ( x ) 1 1 β r ( x ) β 1 β D 1 β ( g ( β ) b )
= log c + D 1 1 β ( p r ) D 1 β ( g ( β ) b ) ,
where (215) follows from (213) and (214); (216) follows from identifying the Rényi divergence (recall that g ( β ) and b are PMFs); (217) follows from (205); and (218) follows from identifying the Rényi divergence (recall that r is a PMF). □
The rest of the section presents the cases β , β 1 , and β + .
Proposition 8.
Let b be a PMF. Then,
lim β U β = log min x b ( x ) o ( x )
log c .
Inequality (220) holds with equality if and only if b ( x ) = c / o ( x ) for all x X .
Observe that if b ( x ) = c / o ( x ) for all x X , then S = c with probability one, i.e., S does not depend on the winning horse.
Proof of Proposition 8.
Equation (219) holds because
lim β U β = lim β log x p ( x ) b ( x ) o ( x ) β 1 β
= log min x b ( x ) o ( x ) ,
where (222) holds because, in the limit as β tends to , the power mean tends to the minimum (since p is a PMF with p ( x ) > 0 for all x X [15] (Chapter 8)).
We show (220) by contradiction. Assume that there exists a PMF b that does not satisfy (220), thus
b ( x ) o ( x ) > c
for all x X . Then,
1 = x b ( x )
> x c o ( x )
= 1 ,
where (224) holds because b is a PMF; (225) follows from (223); and (226) follows from the definition of c in (204). Because 1 > 1 is impossible, such a b cannot exist, which establishes (220).
It is not difficult to see that (220) holds with equality if b ( x ) = c / o ( x ) for all x X . We therefore focus on establishing that if (220) holds with equality, then b ( x ) = c / o ( x ) for all x X . Observe first that, if (220) holds with equality, then, for all x X ,
b ( x ) o ( x ) c .
We now claim that (227) holds with equality for all x X . Indeed, if this were not the case, then there would exist an x X for which b ( x ) o ( x ) > c , thus (224)–(226) would hold, which would lead to a contradiction. Hence, if (220) holds with equality, then b ( x ) = c / o ( x ) for all x X . □
Proposition 9.
Let β 1 , and let b be a PMF. Then,
U β log max x p ( x ) 1 / β o ( x ) .
Equality in (228) can be achieved by choosing b ( x ) = 𝟙 { x = x } for some x X satisfying
p ( x ) 1 / β o ( x ) = max x p ( x ) 1 / β o ( x ) .
Remark 7.
Proposition 9 implies that if β 1 , then it is optimal to bet on a single horse. Unless | X | = 1 , this is not the case when β < 1 : When β < 1 , an optimal betting strategy requires placing a bet on every horse. This follows from Theorem 9 and our assumption that p ( x ) and o ( x ) are all positive.
Proof of Proposition 9.
Inequality (228) holds because
U β = 1 β log x p ( x ) b ( x ) β o ( x ) β
1 β log x p ( x ) b ( x ) o ( x ) β
1 β log x b ( x ) · max x X p ( x ) o ( x ) β
= 1 β log max x X p ( x ) o ( x ) β
= log max x X p ( x ) 1 / β o ( x ) ,
where (231) holds because b ( x ) [ 0 , 1 ] and β 1 , and (233) holds because b is a PMF. It is not difficult to see that (228) holds with equality if b ( x ) = 𝟙 { x = x } for some x X satisfying (229). □
Proposition 10.
Let b be a PMF. Then,
lim β + U β = log max x b ( x ) o ( x )
log max x o ( x ) .
Equality in (236) can be achieved by choosing b ( x ) = 𝟙 { x = x } for some x X satisfying
o ( x ) = max x o ( x ) .
Proof. 
Equation (235) holds because
lim β + U β = lim β + log x p ( x ) b ( x ) o ( x ) β 1 β
= log max x b ( x ) o ( x ) ,
where (239) holds because in the limit as β tends to + , the power mean tends to the maximum (since p is a PMF with p ( x ) > 0 for all x X [15] (Chapter 8)). Inequality (236) holds because b ( x ) 1 for all x X . It is not difficult to see that (236) holds with equality if b ( x ) = 𝟙 { x = x } for some x X satisfying (237). □

9. Horse Betting with Side Information

In this section, we study the horse betting problem where the gambler observes some side information Y before placing her bets. This setting leads to the conditional Rényi divergence D α l ( · ) discussed in Section 5 (see Theorem 10). In addition, it provides a new operational meaning to the dependence measure J α ( X ; Y ) (see Theorem 11).
We adapt our notation as follows: The joint PMF of X and Y is denoted p X Y . (Recall that X denotes the winning horse.) We drop the assumption that the winning probabilities p ( x ) are positive, but we assume that p ( y ) > 0 for all y Y . We continue to assume that the gambler invests all her wealth, so a betting strategy is now a conditional PMF b X | Y , and the wealth relative S is
S b ( X | Y ) o ( X ) .
As in Section 8, define the constant
c [ x 1 o ( x ) ] 1
and the PMF
r X ( x ) c o ( x ) .
The following decomposition of the utility function U β parallels that of Theorem 9:
Theorem 10.
Let β ( , 1 ) . Then,
U β = log c + D 1 1 β l ( p X | Y r X | p Y ) D 1 β g X | Y ( β ) g Y ( β ) b X | Y g Y ( β ) ,
where the conditional PMF g X | Y ( β ) and the PMF g Y ( β ) are given by
g X | Y ( β ) ( x | y ) p ( x | y ) 1 1 β o ( x ) β 1 β x p ( x | y ) 1 1 β o ( x ) β 1 β ,
g Y ( β ) ( y ) p ( y ) x p ( x | y ) 1 1 β o ( x ) β 1 β 1 β y p ( y ) x p ( x | y ) 1 1 β o ( x ) β 1 β 1 β .
Thus, choosing b X | Y = g X | Y ( β ) uniquely maximizes U β among all conditional PMFs b X | Y .
Proof. 
We first show that U β is uniquely maximized by g X | Y ( β ) . The only term on the RHS of (243) that depends on b X | Y is D 1 β g X | Y ( β ) g Y ( β ) b X | Y g Y ( β ) . Because 1 β > 0 , this term is maximized if and only if b X | Y g Y ( β ) = g X | Y ( β ) g Y ( β ) (Proposition 1 (a)). By our assumptions that p ( y ) > 0 for all y Y and o ( x ) > 0 for all x X , we have g Y ( β ) ( y ) > 0 for all y Y . Consequently, b X | Y g Y ( β ) = g X | Y ( β ) g Y ( β ) if and only if b X | Y = g X | Y ( β ) .
Consider now (243) for β = 0 . For β = 0 , (243) reduces to
E [ log S ] = log c + D ( p X | Y p Y r X p Y ) D ( p X | Y p Y b X | Y p Y ) ,
and some algebra reveals that (246) holds.
We conclude with establishing (243) for β ( , 0 ) ( 0 , 1 ) . For β ( , 0 ) ( 0 , 1 ) ,
U β = 1 β log x , y p ( x , y ) b ( x | y ) β o ( x ) β .
For every x X and every y Y ,
p ( x , y ) b ( x | y ) β o ( x ) β = y Y p ( y ) x X p ( x | y ) 1 1 β o ( x ) β 1 β 1 β · g Y ( β ) ( y ) g X | Y ( β ) ( x | y ) 1 β b ( x | y ) β ,
which follows from (244) and (245). Now, (243) holds because
U β = 1 β log y Y p ( y ) x X p ( x | y ) 1 1 β o ( x ) β 1 β 1 β
= + 1 β log x , y g X | Y ( β ) ( x | y ) g Y ( β ) ( y ) 1 β b ( x | y ) g Y ( β ) ( y ) β
= 1 β log y Y p ( y ) x X p ( x | y ) 1 1 β o ( x ) β 1 β 1 β D 1 β g X | Y ( β ) g Y ( β ) b X | Y g Y ( β )
= log c + 1 β log y Y p ( y ) x X p ( x | y ) 1 1 β r X ( x ) β 1 β 1 β D 1 β g X | Y ( β ) g Y ( β ) b X | Y g Y ( β )
= log c + D 1 1 β l ( p X | Y r X | p Y ) D 1 β g X | Y ( β ) g Y ( β ) b X | Y g Y ( β ) ,
where (249) follows from (247) and (248) and the fact that g Y ( β ) ( y ) = g Y ( β ) ( y ) 1 β g Y ( β ) ( y ) β ; (250) follows by identifying the Rényi divergence; (251) follows from (242); and (252) follows by identifying the conditional Rényi divergence using (78). □
Remark 8.
It follows from Theorem 10 that, if the gambler gambles optimally, then, for β ( , 1 ) ,
U β = log c + D 1 1 β l ( p X | Y r X | p Y ) .
Operationally, it is clear that preprocessing the side information cannot increase the gambler’s utility, i.e., that, for every conditional PMF p Y | Y ,
D 1 1 β l ( p X | Y r X | p Y ) D 1 1 β l ( p X | Y r X | p Y ) ,
where p X | Y and p Y are derived from the joint PMF p X Y Y given by
p X Y Y ( x , y , y ) = p Y ( y ) p X | Y ( x | y ) p Y | Y ( y | y ) .
This provides the intuition for Theorem 6, where (254) is shown directly.
The extreme case is when the preprocessing maps the side information to a constant and hence leads to the case where the side information is absent. In this case, Y is deterministic and p X | Y equals p X . Theorem 9 and Theorem 10 then lead to the following relation between the conditional and unconditional Rényi divergence:
D 1 1 β ( p X r X ) D 1 1 β l ( p X | Y r X | p Y ) ,
where the marginal PMF p X is given by
p X ( x ) = y p X Y ( x , y ) .
This motivates Corollary 3, where (256) is derived from (254).
The last result of this section provides a new operational meaning to the Lapidoth–Pfister mutual information J α ( X ; Y ) : assuming that β ( , 1 ) and that the gambler knows the winning probabilities, J 1 / ( 1 β ) ( X ; Y ) measures how much the side information that is available to the gambler but not the bookmaker increases the gambler’s smallest guaranteed utility for a fixed level of fairness c. To see this, consider first the setting without side information. By Theorem 9, the gambler chooses b = g ( β ) to maximize her utility, where g ( β ) is defined in (207). Then, using the nonnegativity of the Rényi divergence (Proposition 1 (a)), the following lower bound on the gambler’s utility follows from (206):
U β log c .
We call the RHS of (258) the smallest guaranteed utility for a fixed level of fairness c because (258) holds with equality if the bookmaker chooses the odds inversely proportional to the winning probabilities. Comparing (258) with (259) below, we see that the difference due to the side information is J 1 / ( 1 β ) ( X ; Y ) . Note that J 1 / ( 1 β ) ( X ; Y ) is typically not the difference between the utility with and without side information; this is because the odds for which (258) and (259) hold with equality are typically not the same.
Theorem 11.
Let β ( , 1 ) . If b X | Y is equal to g X | Y ( β ) from Theorem 10, then
U β log c + J 1 1 β ( X ; Y ) .
Moreover, for every c > 0 , there exist odds o : X ( 0 , ) such that (259) holds with equality.
Proof. 
For this choice of b X | Y , (259) holds because
U β = log c + D 1 1 β l ( p X | Y r X | p Y )
log c + min r ˜ X P ( X ) D 1 1 β l ( p X | Y r ˜ X | p Y )
= log c + J 1 1 β ( X ; Y ) ,
where (260) follows from Theorem 10, and (262) follows from Proposition 5.
Fix now c > 0 , let r ˜ X achieve the minimum on the RHS of (261), and choose the odds
o ( x ) = c r ˜ X ( x ) .
Then, (261) holds with equality because r X = r ˜ X by (241) and (242). □

10. Horse Betting with Part of the Money

In this section, we treat the possibility that the gambler does not invest all her wealth. We restrict ourselves to the setting without side information and to β ( , 0 ) ( 0 , 1 ) . (For the case β = 0 , see [47] (Section 10.5).) We assume that p ( x ) > 0 and o ( x ) > 0 for all x X . Denote by b ( 0 ) the fraction of her wealth that the gambler does not use for betting. (We assume 0 X .) Then, b : X { 0 } [ 0 , 1 ] is a PMF, and the wealth relative S is the random variable
S b ( 0 ) + b ( X ) o ( X ) .
As in Section 8, define the constant
c [ x 1 o ( x ) ] 1 .
We treat the cases c < 1 and c 1 separately, starting with the latter. If c 1 , then it is optimal to invest all the money:
Proposition 11.
Assume c 1 , let β R , and let b be a PMF on X { 0 } with utility U β . Then, there exists a PMF b on X { 0 } with b ( 0 ) = 0 and utility U β U β .
Proof. 
Choose the PMF b as follows:
b ( x ) = c o ( x ) · b ( 0 ) + b ( x ) if x X , 0 if x = 0 .
Then, for every x X ,
b ( 0 ) + b ( x ) o ( x ) = c · b ( 0 ) + b ( x ) o ( x )
b ( 0 ) + b ( x ) o ( x ) ,
where (268) holds because c 1 by assumption. For β > 0 , U β U β holds because (268) implies E [ S β ] E [ S β ] . For β < 0 and β = 0 , U β U β follows similarly from (268). □
On the other hand, if β < 1 and the odds are subfair, i.e., if c < 1 , then Claim (c) of the following theorem shows that investing all the money is not optimal:
Theorem 12.
Assume c < 1 , let β ( , 0 ) ( 0 , 1 ) , and let b be a PMF on X { 0 } that maximizes U β among all PMFs b. Defining
S { x X : b ( x ) > 0 } ,
Γ 1 x S p ( x ) 1 x S 1 o ( x ) ,
γ ( x ) max 0 , Γ 1 β 1 p ( x ) 1 1 β o ( x ) β 1 β 1 o ( x ) x X ,
the following claims hold:
(a) 
Both the numerator and denominator on the RHS of (270) are positive, so Γ is well-defined and positive.
(b) 
For every x X ,
b ( x ) = γ ( x ) b ( 0 ) .
(c) 
The quantity b ( 0 ) satisfies
b ( 0 ) = 1 1 + x X γ ( x ) .
In particular, b ( 0 ) > 0 .
Claim (b) implies that for every x X , b ( x ) > 0 if and only if p ( x ) o ( x ) > Γ . Ordering the elements x 1 , x 2 , of X such that p ( x 1 ) o ( x 1 ) p ( x 2 ) o ( x 2 ) , the set S thus has a special structure: it is either empty or equal to { x 1 , x 2 , , x k } for some integer k. To maximize U β , the following procedure can be used: for every S with the above structure, compute the corresponding b according to (270)–(273); and from these b’s, take one that maximizes U β . This procedure leads to an optimal solution: an optimal solution b exists because we are optimizing a continuous function over a compact set, and b corresponds to a set S that will be considered by the procedure.
Proof of Theorem 12.
The proof is based on the Karush–Kuhn–Tucker conditions. By separately considering the cases β ( 0 , 1 ) and β < 0 , we first show that, for β ( , 0 ) ( 0 , 1 ) , a strategy b ( · ) is optimal if and only if the following conditions are satisfied for some μ R :
x X p ( x ) b ( 0 ) + b ( x ) o ( x ) β 1 = μ if b ( 0 ) > 0 , μ if b ( 0 ) = 0 ,
and, for every x X ,
p ( x ) o ( x ) b ( 0 ) + b ( x ) o ( x ) β 1 = μ if b ( x ) > 0 , μ if b ( x ) = 0 .
Consider first β ( 0 , 1 ) , and define the function τ : P ( X { 0 } ) R ,
τ ( b ) x X p ( x ) b ( 0 ) + b ( x ) o ( x ) β .
Since β > 0 and since the logarithm is an increasing function, maximizing U β = 1 β log E [ S β ] over b is equivalent to maximizing τ ( b ) . Observe that τ is concave, thus, by the Karush–Kuhn–Tucker conditions [11] (Theorem 4.4.1), it is maximized by a PMF b if and only if there exists a λ R such that (i) for all x X { 0 } with b ( x ) > 0 ,
τ b ( x ) ( b ) = λ ,
and (ii) for all x X { 0 } with b ( x ) = 0 ,
τ b ( x ) ( b ) λ .
Henceforth, we use the following notation: to designate that (i) and (ii) both hold, we write
τ b ( x ) ( b ) = λ if b ( x ) > 0 , λ if b ( x ) = 0 .
Dividing both sides of (279) by β > 0 and defining μ λ β , we obtain that (279) is equivalent to
1 β · τ b ( x ) ( b ) = μ if b ( x ) > 0 , μ if b ( x ) = 0 .
Now, (280) translates to (274) for x = 0 and to (275) for x X .
Consider now β < 0 , and define τ as in (276). Then, because β < 0 , maximizing U β = 1 β log E [ S β ] is equivalent to minimizing τ . The function τ is convex, thus Inequality (278) is reversed. Dividing by β < 0 again reverses the inequalities, thus (280), (274), and (275) continue to hold for β < 0 .
Having established that, for all β ( , 0 ) ( 0 , 1 ) , a strategy b is optimal if and only if (274) and (275) hold, we next continue with the proof. Let β ( , 0 ) ( 0 , 1 ) , and let b be a PMF on X { 0 } that maximizes U β . By the above discussion, (274) and (275) are satisfied by b for some μ R . The LHS of (274) is positive, so μ > 0 . We now show that for all x X ,
b ( x ) = max 0 , p ( x ) o ( x ) β μ 1 1 β b ( 0 ) o ( x ) .
To this end, fix x X . If b ( x ) > 0 , then (275) implies
b ( x ) = p ( x ) o ( x ) β μ 1 1 β b ( 0 ) o ( x ) ,
and the RHS of (282) is equal to the RHS of (281) because, being equal to b ( x ) , it is positive. If b ( x ) = 0 , then (275) implies
p ( x ) o ( x ) β μ 1 1 β b ( 0 ) o ( x ) 0 ,
so the RHS of (281) is zero and (281) hence holds.
Having established (281), we next show that b ( x ^ ) = 0 for some x ^ X . For a contradiction, assume that b ( x ) > 0 for all x X . Then,
x X p ( x ) b ( 0 ) + b ( x ) o ( x ) β 1 = μ x X 1 o ( x )
> μ ,
where (284) follows from (275), and (285) holds because c < 1 by assumption. However, this is impossible: (285) contradicts (274).
Let now x ^ X be such that b ( x ^ ) = 0 . Then, by (281),
p ( x ^ ) o ( x ^ ) β μ 1 1 β b ( 0 ) o ( x ^ ) 0 .
Because p ( x ^ ) and o ( x ^ ) are positive, this implies b ( 0 ) > 0 . Thus, by (274),
x X p ( x ) b ( 0 ) + b ( x ) o ( x ) β 1 = μ .
Splitting the sum on the LHS of (287) depending on whether b ( x ) > 0 or b ( x ) = 0 , we obtain
μ = x S p ( x ) b ( 0 ) + b ( x ) o ( x ) β 1 + x S p ( x ) b ( 0 ) + b ( x ) o ( x ) β 1
= x S μ o ( x ) + x S p ( x ) b ( 0 ) β 1
= μ x S 1 o ( x ) + b ( 0 ) β 1 1 x S p ( x ) ,
where (289) follows from (275). Rearranging (290), we obtain
μ 1 x S 1 o ( x ) = b ( 0 ) β 1 1 x S p ( x ) .
Recall that μ > 0 and b ( 0 ) > 0 . In addition, 1 x S p ( x ) > 0 because b ( x ^ ) = 0 and hence x ^ S . Thus, 1 x S 1 o ( x ) > 0 , so both the numerator and denominator in the definition of Γ in (270) are positive, which establishes Claim (a), namely that Γ is well-defined and positive.
To establish Claim (b), note that (291) and (270) imply that μ is given by
μ = b ( 0 ) β 1 Γ ,
which, when substituted into (281), yields (272).
We conclude by proving Claim (c). Because b is a PMF on X { 0 } ,
1 = b ( 0 ) + x X b ( x )
= b ( 0 ) 1 + x X γ ( x ) ,
where (294) follows from (272). Rearranging (294) yields (273). □

11. Universal Betting for IID Races

In this section, we present a universal gambling strategy for IID races that requires neither knowledge of the winning probabilities nor of the parameter β of the utility function and yet asymptotically maximizes the utility function for all PMFs p and all β R . Consider n consecutive horse races, where the winning horse in the ith race is denoted X i for i { 1 , , n } . We assume that X 1 , , X n are IID according to the PMF p, where p ( x ) > 0 for all x X . In every race, the bookmaker offers the same odds o : X ( 0 , ) , and the gambler spends all her wealth placing bets on the horses. The gambler plays race-after-race, i.e., before placing bets for a race, she is revealed the winning horse of the previous race and receives the money from the bookmaker. Her betting strategy is hence a sequence of conditional PMFs b X 1 , b X 2 | X 1 , b X 3 | X 1 X 2 , , b X n | X 1 X 2 X n 1 . The wealth relative is the random variable
S n i = 1 n b ( X i | X 1 , , X i 1 ) o ( X i ) .
We seek betting strategies that maximize the utility function
U β , n 1 β log E [ S n β ] if β 0 , E [ log S n ] if β = 0 .
We first establish that to maximize U β , n for a fixed β R , it suffices to use the same betting strategy in every race; see Theorem 13. We then show that the individual-sequence-universal strategy by Cover–Ordentlich [48] allows to asymptotically achieve the same normalized utility without knowing p or β (see Theorem 14).
For a fixed β R , let the PMF b be a betting strategy that maximizes the single-race utility U β discussed in Section 8, and denote by U β the utility associated with b . Using the same betting strategy b over n races leads to the utility U β , n , and it follows from (295) and (296) that
U β , n = n U β .
As we show next, n U β is the maximum utility that can be achieved among all betting strategies:
Theorem 13.
Let β R , and let b X 1 , b X 2 | X 1 , b X 3 | X 1 X 2 , , b X n | X 1 X 2 X n 1 be a sequence of conditional PMFs. Then,
U β , n n U β .
Proof. 
We show (298) for β > 0 ; analogous arguments establish (298) for β < 0 and β = 0 . We prove (298) by induction on n. For n = 1 , (298) holds because U β is the maximum single-race utility. Assume now n 2 and that (298) is valid for n 1 . For β > 0 , (298) holds because
U β , n = 1 β log E [ S n β ]
= 1 β log x 1 , , x n P ( x 1 ) P ( x n ) i = 1 n b ( x i | x i 1 ) β o ( x i ) β
= 1 β log x 1 , , x n 1 P ( x 1 ) P ( x n 1 ) i = 1 n 1 b ( x i | x i 1 ) β o ( x i ) β x n P ( x n ) b ( x n | x n 1 ) β o ( x n ) β
1 β log x 1 , , x n 1 P ( x 1 ) P ( x n 1 ) i = 1 n 1 b ( x i | x i 1 ) β o ( x i ) β max b P ( X ) x n P ( x n ) b ( x n ) β o ( x n ) β
= 1 β log x 1 , , x n 1 P ( x 1 ) P ( x n 1 ) i = 1 n 1 b ( x i | x i 1 ) β o ( x i ) β x n P ( x n ) b ( x n ) β o ( x n ) β
= U β , n 1 + U β
( n 1 ) U β + U β
= n U β ,
where (303) holds because b maximizes the single-race utility U β , and (305) holds because (298) is valid for n 1 . □
In portfolio theory, Cover–Ordentlich [48] (Definition 1) proposed a universal strategy. Adapted to our setting, it leads to the following sequence of conditional PMFs:
b ^ ( x i | x i 1 ) = b P ( X ) b ( x i ) S i 1 ( b , x i 1 ) d μ ( b ) b P ( X ) S i 1 ( b , x i 1 ) d μ ( b ) ,
where i { 1 , 2 , } ; μ is the Dirichlet ( 1 / 2 , , 1 / 2 ) distribution on P ( X ) ; S 0 ( b , x 0 ) 1 ; and
S i ( b , x i ) j = 1 i b ( x j ) o ( x j ) .
This strategy depends neither on the winning probabilities p nor on the parameter β . Denoting the utility (296) associated with the strategy b ^ ( x i | x i 1 ) by U ^ β , n , we have the following result:
Theorem 14.
For every β R ,
n U β log 2 | X | 1 2 log ( n + 1 ) U ^ β , n
n U β .
Hence,
lim n 1 n U ^ β , n = U β .
Proof. 
Inequality (310) follows from Theorem 13; and (311) follows from (309) and (310) and the sandwich theorem. It thus remains to establish (309): We do so for β > 0 ; analogous arguments establish (309) for β < 0 and β = 0 . For a fixed sequence x n X n , let b ˜ be a PMF on X that maximizes S n ( b , x n ) , and denote the wealth relative in (295) associated with using b ˜ in every race by S ˜ n ( x n ) , thus
S ˜ n ( x n ) = max b P ( X ) i = 1 n b ( x i ) o ( x i ) .
Let S ^ n ( x n ) denote the wealth relative in (295) associated with the strategy b ^ ( x i | x i 1 ) and the sequence x n . Using [48] (Theorem 2) it follows that, for every x n X n ,
S ^ n ( x n ) 1 2 ( n + 1 ) ( | X | 1 ) / 2 S ˜ n ( x n ) .
This implies that (309) holds for β > 0 because
U ^ β , n = 1 β log E S ^ n ( X n ) β
1 β log E S ˜ n ( X n ) β log 2 | X | 1 2 log ( n + 1 )
1 β log x 1 , , x n P ( x 1 ) P ( x n ) i = 1 n b ( x i ) β o ( x i ) β log 2 | X | 1 2 log ( n + 1 )
= n U β log 2 | X | 1 2 log ( n + 1 ) ,
where (315) follows from (313), and (316) follows from (312). □
Remark 9.
As discussed in Section 8, the optimal single-race betting strategy varies significantly with different values of β, thus it might be a bit surprising that the Cover–Ordentlich strategy is not only universal with respect to the winning probabilities, but also with respect to β. This is due to the following two reasons: First, for fixed winning probabilities and a fixed β, it is optimal to use the same betting strategy in every race (see Theorem 13). Second, for every x n X n , the wealth relative of the Cover–Ordentlich strategy is not much worse than that of using the same strategy b ( · ) in every race, irrespective of b ( · ) (see (313)). Hence, irrespective of the optimal single-race betting strategy, the Cover–Ordentlich strategy is able to asymptotically achieve the same normalized utility.

Author Contributions

Writing—original draft preparation, C.B., A.L., and C.P.; and writing—review and editing, C.B., A.L., and C.P. All authors have read and agreed to the published version of the manuscript

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Proposition 1

These properties mostly follow from van Erven–Harremoës [8]:
(a)
See [8] (Theorem 8).
(b)
This follows from the definitions in (4) and (21)–(23) and the conventions in (20).
(c)
This follows from [8] (Theorem 7) and the fact that lim α 1 D α ( P Q ) = D ( P Q ) by L’Hôpital’s rule. (Note that α D α ( P Q ) does not need to be continuous at α = 1 when the alphabets are not finite; see the discussion after [8] (Equation (18)).)
(d)
See [8] (Theorem 3).
(e)
Let α , α ( 0 , ) satisfy α α . Then,
1 α α D α ( P Q ) = D 1 α ( Q P )
D 1 α ( Q P )
= 1 α α D α ( P Q ) ,
where (A1) and (A3) follow from [8] (Lemma 10), and (A2) holds because the Rényi divergence, extended to negative orders, is nondecreasing ([8] (Theorem 39)).
(f)
See [8] (Corollary 2).
(g)
For α ( 0 , ) ,
( α 1 ) D 1 / α ( P Q ) = α 1 1 α D 1 / α ( P Q )
= α inf R 1 α D ( R P ) + 1 1 α D ( R Q )
= inf R D ( R P ) + ( α 1 ) D ( R Q ) ,
where (A5) follows from [8] (Theorem 30). Hence, ( α 1 ) D 1 / α ( P Q ) is concave in α because the expression in square brackets on the RHS of (A6) is concave in α for every R and because the pointwise infimum preserves the concavity.
(h)
See [8] (Theorem 9).

Appendix B. Proof of Theorem 1

Beginning with (29),
D α c ( P Y | X Q Y | X | P X ) = x supp ( P X ) P ( x ) D α ( P Y | X = x Q Y | X = x )
x supp ( P X ) P ( x ) D α ( P Y | X = x Q Y | X = x )
= D α c ( P Y | X Q Y | X | P X ) ,
where (A8) follows by applying, separately for every x supp ( P X ) , Proposition 1 (h) with the conditional PMF A Y | Y , X = x .

Appendix C. Proof of Theorem 2

We show (43) for α ( 0 , 1 ) ; the claim then extends to α [ 0 , 1 ] by the continuity of D α c ( · ) in α (Proposition 2 (c)). Let α ( 0 , 1 ) . Keeping in mind that α 1 < 0 , (43) holds because
( α 1 ) D α c ( P Y | X Q Y | X | P X )
= x supp ( P X ) P X ( x ) log y P Y | X ( y | x ) α Q Y | X ( y | x ) 1 α
= x supp ( P X ) P X ( x ) log y x B X | X ( x | x ) P Y | X ( y | x ) α x B X | X ( x | x ) Q Y | X ( y | x ) 1 α
x supp ( P X ) P X ( x ) log y x B X | X ( x | x ) P Y | X ( y | x ) α Q Y | X ( y | x ) 1 α
= x supp ( P X ) P X ( x ) log x supp ( P X ) B X | X ( x | x ) y P Y | X ( y | x ) α Q Y | X ( y | x ) 1 α
x supp ( P X ) P X ( x ) x supp ( P X ) B X | X ( x | x ) log y P Y | X ( y | x ) α Q Y | X ( y | x ) 1 α
= x supp ( P X ) P X ( x ) x supp ( P X ) B X | X ( x | x ) log y P Y | X ( y | x ) α Q Y | X ( y | x ) 1 α
= x supp ( P X ) P X ( x ) log y P Y | X ( y | x ) α Q Y | X ( y | x ) 1 α
= ( α 1 ) D α c ( P Y | X Q Y | X | P X ) ,
where (A10) follows from (30); (A11) follows from (41) and (42); (A12) follows from Hölder’s inequality; (A13) holds because B X | X ( x | x ) = 0 if P X ( x ) > 0 and P X ( x ) = 0 ; (A14) follows from Jensen’s inequality because log ( · ) is concave; (A15) follows from (40); (A16) holds because P X ( x ) > 0 and P X ( x ) = 0 imply B X | X ( x | x ) = 0 , hence the expression in square brackets on the LHS of (A16) equals one; and (A17) follows from (30).

Appendix D. Proof of Corollary 1

Applying Theorem 2 with X { 1 } and the conditional PMF B X | X ( x | x ) 1 , we obtain
D α c ( P Y | X Q Y | X | P X ) D α c ( P Y | X Q Y | X | P X ) .
To complete the proof of (48), observe that
D α c ( P Y | X Q Y | X | P X ) = D α c ( P Y Q Y | P X )
= D α ( P Y Q Y ) ,
where (A19) holds because (41) and (46) imply P Y | X ( y | x ) = P Y ( y ) and because (42) and (47) imply Q Y | X ( y | x ) = Q Y ( y ) ; and (A20) follows from Remark 1.

Appendix E. Proof of Example 1

If α = , then it can be verified numerically that (53) holds for ϵ = 0.1 . Fix now α ( 1 , ) . Then, for all ϵ ( 0 , 1 ) ,
D α P Y Q Y ( ϵ ) = 1 α 1 log 0.5 α ( 1 ϵ ) 1 α + 0.5 α ϵ 1 α
1 α 1 log 0.5 α ϵ 1 α
= α α 1 log 0.5 + log 1 ϵ .
The RHS of (53) satisfies, for sufficiently small ϵ ,
D α c P Y | X ( ϵ ) Q Y ( ϵ ) | P X = 0.5 · 0 + 0.5 · D α P Y | X = 1 ( ϵ ) Q Y ( ϵ )
= 0.5 α 1 log ϵ α ( 1 ϵ ) 1 α + ( 1 ϵ ) α ϵ 1 α
= 0.5 α 1 log ϵ 1 α ( 1 ϵ ) α + ϵ 2 α 1 ( 1 ϵ ) 1 α
0.5 α 1 log 2 ϵ 1 α
= 0.5 α 1 log 2 + 0.5 log 1 ϵ ,
where (A27) holds for sufficiently small ϵ because lim ϵ 0 ( 1 ϵ ) α + ϵ 2 α 1 ( 1 ϵ ) 1 α = 1 . Because lim ϵ 0 log 1 ϵ = , (53) follows from (A23) and (A28) for sufficiently small ϵ .

Appendix F. Proof of Theorem 3

Observe that, for all x X and all y Y ,
P X ( x ) P Y | X ( y | x ) = x , y P X ( x ) P Y | X ( y | x ) 𝟙 { x = x } A Y | X Y ( y | x , y ) ,
P X ( x ) Q Y | X ( y | x ) = x , y P X ( x ) Q Y | X ( y | x ) 𝟙 { x = x } A Y | X Y ( y | x , y ) .
Hence, (68) follows from (54) and
D α ( P X P Y | X P X Q Y | X ) D α ( P X P Y | X P X Q Y | X ) ,
which follows from the data-processing inequality for the Rényi divergence by substituting 𝟙 X = X A Y | X Y for A X Y | X Y in Proposition 1 (h).

Appendix G. Proof of Theorem 4

Observe that, for all x X and all y Y ,
P X ( x ) P Y | X ( y | x ) = x , y P X ( x ) P Y | X ( y | x ) B X | X ( x | x ) 𝟙 { y = y } ,
P X ( x ) Q Y | X ( y | x ) = x , y P X ( x ) Q Y | X ( y | x ) B X | X ( x | x ) 𝟙 { y = y } .
Hence, (73) follows from (54) and
D α ( P X P Y | X P X Q Y | X ) D α ( P X P Y | X P X Q Y | X ) ,
which follows from the data-processing inequality for the Rényi divergence by substituting B X | X 𝟙 Y = Y for A X Y | X Y in Proposition 1 (h).

References

  1. Kelly, J.L., Jr. A new interpretation of information rate. Bell Syst. Tech. J. 1956, 35, 917–926. [Google Scholar] [CrossRef]
  2. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006; ISBN 978-0-471-24195-9. [Google Scholar]
  3. Lapidoth, A.; Pfister, C. Two measures of dependence. Entropy 2019, 21, 778. [Google Scholar] [CrossRef] [Green Version]
  4. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  5. Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed.; Cambridge University Press: Cambridge, UK, 2011; ISBN 978-0-521-19681-9. [Google Scholar]
  6. Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial; now Publishers: Hanover, MA, USA, 2004; ISBN 978-1-933019-05-5. [Google Scholar]
  7. Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; Volume 1, pp. 547–561. [Google Scholar]
  8. Van Erven, T.; Harremoës, P. Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
  9. Csiszár, I. Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory 1995, 41, 26–34. [Google Scholar] [CrossRef]
  10. Sibson, R. Information radius. Z. Wahrscheinlichkeitstheorie verw. Geb. 1969, 14, 149–160. [Google Scholar] [CrossRef]
  11. Gallager, R.G. Information Theory and Reliable Communication; John Wiley & Sons: Hoboken, NJ, USA, 1968; ISBN 978-0-471-29048-3. [Google Scholar]
  12. Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Topics in Information Theory; Csiszár, I., Elias, P., Eds.; North-Holland Publishing Company: Amsterdam, The Netherlands, 1977; pp. 41–52. ISBN 0-7204-0699-4. [Google Scholar]
  13. Eeckhoudt, L.; Gollier, C.; Schlesinger, H. Economic and Financial Decisions under Risk; Princeton University Press: Princeton, NJ, USA, 2005; ISBN 978-0-691-12215-1. [Google Scholar]
  14. Soklakov, A.N. Economics of disagreement – financial intuition for the Rényi divergence. arXiv 2018, arXiv:1811.08308. [Google Scholar]
  15. Steele, J.M. The Cauchy–Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities; Cambridge University Press: Cambridge, UK, 2004; ISBN 978-0-521-54677-5. [Google Scholar]
  16. Bullen, P.S. Handbook of Means and Their Inequalities; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2003; ISBN 978-1-4020-1522-9. [Google Scholar]
  17. Campbell, L.L. A coding theorem and Rényi’s entropy. Inf. Control 1965, 8, 423–429. [Google Scholar] [CrossRef] [Green Version]
  18. Campbell, L.L. Definition of entropy by means of a coding problem. Z. Wahrscheinlichkeitstheorie verw. Geb. 1966, 6, 113–118. [Google Scholar] [CrossRef]
  19. Merhav, N. On optimum strategies for minimizing the exponential moments of a loss function. Commun. Inf. Syst. 2011, 11, 343–368. [Google Scholar] [CrossRef] [Green Version]
  20. Sason, I.; Verdú, S. Improved bounds on lossless source coding and guessing moments via Rényi measures. IEEE Trans. Inf. Theory 2018, 64, 4323–4346. [Google Scholar] [CrossRef] [Green Version]
  21. Augustin, U. Noisy Channels. Habilitation Thesis, Universität Erlangen–Nürnberg, Erlangen, Germany, 1978. [Google Scholar]
  22. Nakiboğlu, B. The Augustin capacity and center. Probl. Inf. Transm. 2019, 55, 299–342. [Google Scholar] [CrossRef] [Green Version]
  23. Nakiboğlu, B. The sphere packing bound for memoryless channels. arXiv 2018, arXiv:1804.06372. [Google Scholar]
  24. Ho, S.-W.; Verdú, S. Convexity/concavity of Rényi entropy and α-mutual information. In Proceedings of the 2015 IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 745–749. [Google Scholar] [CrossRef]
  25. Verdú, S. α-mutual information. In Proceedings of the 2015 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar] [CrossRef]
  26. Nakiboğlu, B. The Rényi capacity and center. IEEE Trans. Inf. Theory 2019, 65, 841–860. [Google Scholar] [CrossRef] [Green Version]
  27. Cai, C.; Verdú, S. Conditional Rényi divergence saddlepoint and the maximization of α-mutual information. Entropy 2019, 21, 969. [Google Scholar] [CrossRef] [Green Version]
  28. Fong, S.L.; Tan, V.Y.F. Strong converse theorems for classes of multimessage multicast networks: A Rényi divergence approach. IEEE Trans. Inf. Theory 2016, 62, 4953–4967. [Google Scholar] [CrossRef] [Green Version]
  29. Yu, L.; Tan, V.Y.F. Rényi resolvability and its applications to the wiretap channel. IEEE Trans. Inf. Theory 2019, 65, 1862–1897. [Google Scholar] [CrossRef] [Green Version]
  30. Gallager, R.G. A simple derivation of the coding theorem and some applications. IEEE Trans. Inf. Theory 1965, 11, 3–18. [Google Scholar] [CrossRef] [Green Version]
  31. Shannon, C.E.; Gallager, R.G.; Berlekamp, E.R. Lower bounds to error probability for coding on discrete memoryless channels. I. Inf. Control 1967, 10, 65–103. [Google Scholar] [CrossRef] [Green Version]
  32. Shannon, C.E.; Gallager, R.G.; Berlekamp, E.R. Lower bounds to error probability for coding on discrete memoryless channels. II. Inf. Control 1967, 10, 522–552. [Google Scholar] [CrossRef] [Green Version]
  33. Arimoto, S. On the converse to the coding theorem for discrete memoryless channels. IEEE Trans. Inf. Theory 1973, 19, 357–359. [Google Scholar] [CrossRef]
  34. Polyanskiy, Y.; Verdú, S. Arimoto channel coding converse and Rényi divergence. In Proceedings of the 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Allerton, IL, USA, 29 September–1 October 2010; pp. 1327–1333. [Google Scholar] [CrossRef]
  35. Nakiboğlu, B. The sphere packing bound via Augustin’s method. IEEE Trans. Inf. Theory 2019, 65, 816–840. [Google Scholar] [CrossRef]
  36. Nakiboğlu, B. The sphere packing bound for DSPCs with feedback à la Augustin. IEEE Trans. Commun. 2019, 67, 7456–7467. [Google Scholar] [CrossRef]
  37. Tomamichel, M.; Hayashi, M. Operational interpretation of Rényi information measures via composite hypothesis testing against product and Markov distributions. IEEE Trans. Inf. Theory 2018, 64, 1064–1082. [Google Scholar] [CrossRef] [Green Version]
  38. Aishwarya, G.; Madiman, M. Remarks on Rényi versions of conditional entropy and mutual information. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 1117–1121. [Google Scholar] [CrossRef]
  39. Lapidoth, A.; Pfister, C. Testing against independence and a Rényi information measure. In Proceedings of the 2018 IEEE Information Theory Workshop (ITW), Guangzhou, China, 25–29 November 2018; pp. 1–5. [Google Scholar] [CrossRef] [Green Version]
  40. Fehr, S.; Berens, S. On the conditional Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 6801–6810. [Google Scholar] [CrossRef]
  41. Sason, I.; Verdú, S. Arimoto–Rényi conditional entropy and Bayesian M-ary hypothesis testing. IEEE Trans. Inf. Theory 2018, 64, 4–25. [Google Scholar] [CrossRef]
  42. Arıkan, E. An inequality on guessing and its application to sequential decoding. IEEE Trans. Inf. Theory 1996, 42, 99–105. [Google Scholar] [CrossRef] [Green Version]
  43. Sundaresan, R. Guessing under source uncertainty. IEEE Trans. Inf. Theory 2007, 53, 269–287. [Google Scholar] [CrossRef] [Green Version]
  44. Bracher, A.; Lapidoth, A.; Pfister, C. Guessing with distributed encoders. Entropy 2019, 21, 298. [Google Scholar] [CrossRef] [Green Version]
  45. Bunte, C.; Lapidoth, A. Encoding tasks and Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 5065–5076. [Google Scholar] [CrossRef] [Green Version]
  46. Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970; ISBN 978-0-691-01586-6. [Google Scholar]
  47. Moser, S.M. Information Theory (Lecture Notes), version 6.6. 2018. Available online: http://moser-isi.ethz.ch/scripts.html (accessed on 8 March 2020).
  48. Cover, T.M.; Ordentlich, E. Universal portfolios with side information. IEEE Trans. Inf. Theory 1996, 42, 348–363. [Google Scholar] [CrossRef] [Green Version]

Share and Cite

MDPI and ACS Style

Bleuler, C.; Lapidoth, A.; Pfister, C. Conditional Rényi Divergences and Horse Betting. Entropy 2020, 22, 316. https://doi.org/10.3390/e22030316

AMA Style

Bleuler C, Lapidoth A, Pfister C. Conditional Rényi Divergences and Horse Betting. Entropy. 2020; 22(3):316. https://doi.org/10.3390/e22030316

Chicago/Turabian Style

Bleuler, Cédric, Amos Lapidoth, and Christoph Pfister. 2020. "Conditional Rényi Divergences and Horse Betting" Entropy 22, no. 3: 316. https://doi.org/10.3390/e22030316

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop