Conditional Rényi Divergences and Horse Betting

Motivated by a horse betting problem, a new conditional Rényi divergence is introduced. It is compared with the conditional Rényi divergences that appear in the definitions of the dependence measures by Csiszár and Sibson, and the properties of all three are studied with emphasis on their behavior under data processing. In the same way that Csiszár’s and Sibson’s conditional divergence lead to the respective dependence measures, so does the new conditional divergence lead to the Lapidoth–Pfister mutual information. Moreover, the new conditional divergence is also related to the Arimoto–Rényi conditional entropy and to Arimoto’s measure of dependence. In the second part of the paper, the horse betting problem is analyzed where, instead of Kelly’s expected log-wealth criterion, a more general family of power-mean utility functions is considered. The key role in the analysis is played by the Rényi divergence, and in the setting where the gambler has access to side information, the new conditional Rényi divergence is key. The setting with side information also provides another operational meaning to the Lapidoth–Pfister mutual information. Finally, a universal strategy for independent and identically distributed races is presented that—without knowing the winning probabilities or the parameter of the utility function—asymptotically maximizes the gambler’s utility function.


Introduction
As shown by Kelly [1,2], many of Shannon's information measures appear naturally in the context of horse gambling when the gambler's utility function is expected log-wealth. Here, we show that under a more general family of utility functions, gambling also provides a context for some of Rényi's information measures. Moreover, the setting where the gambler has side information motivates a new Rényi-like conditional divergence, which we study and compare to other conditional divergences. The proposed family of utility functions in the context of gambling with side information also provides another operational meaning to the Rényi-like mutual information that was recently proposed by Lapidoth and Pfister [3]: it measures the gambler's gain from the side information as measured by the increase in the minimax value of the two-player zero-sum game in which the bookmaker picks the odds and the gambler then places the bets based on these odds and her side information.
Deferring the gambling-based motivation to the second part of the paper, we first describe the different conditional divergences and study some of their properties with emphasis on their behavior under data processing. We also show that the new conditional Rényi divergence relates to the Lapidoth-Pfister mutual information in much the same way that Csiszár's and Sibson's conditional divergences relate to their corresponding mutual informations. Before discussing the conditional divergences, we first recall other information measures.
The Kullback-Leibler divergence (or relative entropy) is an important concept in information theory and statistics [2,[4][5][6]. It is defined between two probability mass functions (PMFs) P and Q over a finite set X as where log(·) denotes the base-2 logarithm. Defining a conditional Kullback-Leibler divergence is straightforward because, as simple algebra shows, the two natural approaches lead to the same result: P(x) D(P Y|X=x Q Y|X=x ) (2) = D(P X P Y|X P X Q Y|X ), where supp(P) {x ∈ X : P(x) > 0} denotes the support of P, and in (3) and throughout P X P Y|X denotes the PMF on X × Y that assigns (x, y) the probability P X (x) P Y|X (y|x). The Rényi divergence of order α [7,8] between two PMFs P and Q is defined for all positive α's other than one as A conditional Rényi divergence can be defined in more than one way. In this paper, we consider the following three definitions, two classic and one new: D s α (P Y|X Q Y|X |P X ) D α (P X P Y|X P X Q Y|X ), where (5) is inspired by Csiszár [9]; (6) is inspired by Sibson [10]; and (7) is motivated by the horse betting problem discussed in Section 9. The first two conditional Rényi divergences were used to define the Rényi measures of dependence of Csiszár I c α (X; Y) [9] and of Sibson I s α (X; Y) [10]: where the minimization is over all PMFs on the set Y. (Gallager's E 0 function [11] and I s α (X; Y) are in one-to-one correspondence; see (65) below.) The analogous minimization of D l α (·) leads to the Lapidoth-Pfister mutual information J α (X; Y) [3]: where (11) is proved in Proposition 5. The first part of the paper is structured as follows: In Section 2, we discuss some preliminaries. In Sections 3-5, we study the properties of the three conditional Rényi divergences and their associated measure of dependence. In Section 6, we express the Arimoto-Rényi conditional entropy H α (X|Y) and the Arimoto measure of dependence I a α (X; Y) [12] in terms of D l α (P X|Y U X |P Y ). In Section 7, we relate the conditional Rényi divergences to each other and discuss the relations between the Rényi dependence measures. The second part of the paper deals with horse gambling under our proposed family of power-mean utility functions. It is in this context that the Rényi divergence (Theorem 9) and the conditional Rényi divergence D l α (·) (Theorem 10) appear naturally. More specifically, consider a horse race with a finite nonempty set of horses X , where a bookmaker offers odds o(x)-for-1 on each horse x ∈ X , where o : X → (0, ∞) [2] (Section 6.1). A gambler spends all her wealth placing bets on the horses. The fraction of her wealth that she bets on Horse x ∈ X is denoted b(x) ≥ 0, which sums up to 1 over x ∈ X , and the PMF b is her "betting strategy." The winning horse, which we denote X, is drawn according to the PMF p, where we assume p(x) > 0 for all x ∈ X . The wealth relative (or end-to-beginning wealth ratio) is the random variable Hence, given an initial wealth γ, the gambler's wealth after the race is γ S. We seek betting strategies that maximize the utility function where β ∈ R is a parameter that accounts for the risk sensitivity. This optimization generalizes the following cases: (a) In the limit as β tends to −∞, we optimize the worst-case return. The optimal strategy is risk-free in the sense that S does not depend on the winning horse (see Proposition 8).
If β = 0, then we optimize E[log S], which is known as the doubling rate [2] (Section 6.1). The optimal strategy is proportional betting, i.e., to choose b = p (see Remark 4). (c) If β = 1, then we optimize E[S], the expected return. The optimal strategy is to put all the money on a horse that maximizes p(x) o(x) (see Proposition 9).
In general, if β ≥ 1, then it is optimal to put all the money on one horse (see Proposition 9). This is risky: if that horse loses, the gambler will go broke. (e) In the limit as β tends to +∞, we optimize the best-case return. The optimal strategy is to put all the money on a horse that maximizes o(x) (see Proposition 10).
Note that, for β = 0 and η 1 − β, maximizing U β is equivalent to maximizing which is known in the finance literature as Constant Relative Risk Aversion (CRRA) [13,14]. We refer to our utility function as "power mean" because it can be written as the logarithm of a weighted power mean [15,16]: Because the power mean tends to the geometric mean as β tends to zero [15] (Problem 8.1), U β is continuous at β = 0: Campbell [17,18] used an exponential cost function with a similar structure to (15) to provide an operational meaning to the Rényi entropy in source coding. Other information-theoretic applications of exponential moments were studied in [19].
The second part of the paper is structured as follows: In Section 8, we relate the utility function U β to the Rényi divergence (Theorem 9) and derive its optimal gambling strategy. In Section 9, we consider the situation where the gambler observes side information prior to betting, a situation that leads to the conditional Rényi divergence D l α (·) (Theorem 10) and to a new operational meaning for the measure of dependence J α (X; Y) (Theorem 11). In Section 10, we consider the situation where the gambler invests only part of her money. In Section 11, we present a universal strategy for independent and identically distributed (IID) races that requires neither knowledge of the winning probabilities nor of the parameter β of the utility function and yet asymptotically maximizes the utility function for all PMFs p and all β ∈ R.

Preliminaries
Throughout the paper, log(·) denotes the base-2 logarithm, X and Y are finite sets, P XY denotes a joint PMF over X × Y, Q X denotes a PMF over X , and Q Y denotes a PMF over Y. An expression of the form P X P Y|X denotes the PMF on X × Y that assigns (x, y) the probability P X (x) P Y|X (y|x). We use P and Q as generic PMFs over a finite set X . We denote by supp (P) {x ∈ X : P(x) > 0} the support of P, and by P (X ) the set of all PMFs over X . When clear from the context, we often omit sets and subscripts: for example, we write ∑ for P X (x), and P(y|x) for P Y|X (y|x). When P(x) is 0, we define the conditional probability P(y|x) as 1/|Y |. The conditional distribution of Y given X = x is denoted by P Y|X=x , thus We denote by 1{condition} the indicator function that is one if the condition is satisfied and zero otherwise.
In the definition of the Kullback-Leibler divergence in (1), we use the conventions In the definition of the Rényi divergence in (4), we read P(x) α Q(x) 1−α as P(x) α /Q(x) α−1 for α > 1 and use the conventions For α being zero, one, or infinity, we define by continuous extension of (4) The Rényi divergence for negative α is defined as (We use negative α in the proof of Proposition 1 (e) below and in Remark 6. More about negative orders can be found in [8] (Section V). For other applications of negative orders, see [20] (Proof of  Theorem 1 and Example 1).) The Rényi divergence satisfies the following basic properties: Proposition 1. Let P and Q be PMFs. Then, the Rényi divergence D α (P Q) satisfies the following: (h) (Data-processing inequality.) Let A X |X be a conditional PMF, and define the PMFs Then, for all α ∈ [0, ∞], Proof. See Appendix A.
All three conditional Rényi divergences reduce to the unconditional Rényi divergence when both P Y|X and Q Y|X are independent of X: Remark 1. Let P Y , Q Y , and P X be PMFs. Then, for all α ∈ [0, ∞], Proof. This follows from the definitions of D c α (·), D s α (·), and D l α (·) in (5)-(7).

Csiszár's Conditional Rényi Divergence
For a PMF P X and conditional PMFs P Y|X and Q Y|X , For α ∈ (0, 1) ∪ (1, ∞), which follows from the definition of the Rényi divergence in (4). For α being zero, one, or infinity, we obtain from (21)-(23) and (2) Augustin [21] and later Csiszár [9] defined the measure of dependence Augustin used this measure to study the error exponents for channel coding with input constraints, while Csiszár used it to study generalized cutoff rates for channel coding with composition constraints. Nakiboglu [22] studied more properties of I c α (X; Y). Inter alia, he analyzed the minimax properties of the Augustin capacity where A ⊆ P (X ) is a constraint set. The Augustin capacity is used in [23] to establish the sphere packing bound for memoryless channels with cost constraints. The rest of the section presents some properties of D c α (·). Being an average of Rényi divergences (see (29)), D c α (·) inherits many properties from the Rényi divergence: Let P X be a PMF, and let P Y|X and Q Y|X be conditional PMFs. Then, , then D c α (P Y|X Q Y|X |P X ) = 0 if and only if P Y|X=x = Q Y|X=x for all x ∈ supp(P X ) .
Proof. These follow from (29) and the properties of the Rényi divergence (Proposition 1). For Parts (f) and (g), recall that a nonnegative weighted sum of concave functions is concave.
We next consider data-processing inequalities for D c α (·). We distinguish between processing Y and processing X. The data-processing inequality for processing Y follows from the data-processing inequality for the (unconditional) Rényi divergence: Theorem 1. Let P X be a PMF, and let P Y|X and Q Y|X be conditional PMFs. For a conditional PMF A Y |XY , define Then, for all α ∈ [0, ∞], Proof. See Appendix B.
The following data-processing inequality for processing X holds for α ∈ [0, 1] (as shown in Example 1 below, it does not extend to α ∈ (1, ∞]): Let P X be a PMF, and let P Y|X and Q Y|X be conditional PMFs. For a conditional PMF B X |X , define the PMFs Then, for all α ∈ [0, 1], Note that P X , P Y|X , and Q Y|X in Theorem 2 can be obtained from the following marginalizations: Proof of Theorem 2. See Appendix C.
As a special case of Theorem 2, we obtain the following relation between the conditional and the unconditional Rényi divergence: Corollary 1. For a PMF P X and conditional PMFs P Y|X and Q Y|X , define the marginal PMFs Then, for all α ∈ [0, 1], Proof. See Appendix D.
Consider next α ∈ (1, ∞]. It turns out that Corollary 1, and hence Theorem 2, cannot be extended to these values of α (not even if Q Y|X is restricted to be independent of X, i.e., if Q Y|X = Q Y ): Then, for every α ∈ (1, ∞], there exists an ∈ (0, 1) such that where the PMF P Y is defined by (46) and, irrespective of , satisfies P Y (0) = P Y (1) = 0.5.
Proof. See Appendix E.
The concavity and convexity properties of D s α (·) and I s α (X; Y) were studied by Ho-Verdú [24]. More properties of I s α (X; Y) were collected by Verdú [25]. The maximization of I s α (X; Y) with respect to P X and the minimax properties of D s α (·) were studied by Nakiboglu [26] and Cai-Verdú [27]. The conditional Rényi divergence D s α (·) was used by Fong and Tan [28] to establish strong converse theorems for multicast networks. Yu and Tan [29] analyzed channel resolvability, among other measures, in terms of D s α (·). From (61) we see that Gallager's E 0 function [11], which is defined as is in one-to-one correspondence to Sibson's measure of dependence: Gallager's E 0 function is important in channel coding: it appears in the random coding exponent [30] and in the sphere packing exponent [31,32] (see also Gallager [11]). The exponential strong converse theorem proved by Arimoto [33] also uses the E 0 function. Polyanskiy and Verdú [34] extended the exponential strong converse theorem to channels with feedback. Augustin [21] and Nakiboglu [35,36] extended the sphere packing bound to channels with feedback.
The rest of the section presents some properties of D s α (·). Because D s α (·) can be written as an (unconditional) Rényi divergence (see (54)), it inherits many properties from the Rényi divergence: Proposition 3. Let P X be a PMF, and let P Y|X and Q Y|X be conditional PMFs. Then, Proof. These follow from (54) and the properties of the Rényi divergence (Proposition 1).
We next consider data-processing inequalities for D s α (·). We distinguish between processing Y and processing X. The data-processing inequality for processing Y follows from the data-processing inequality for the (unconditional) Rényi divergence: Theorem 3. Let P X be a PMF, and let P Y|X and Q Y|X be conditional PMFs. For a conditional PMF A Y |XY , define Then, for all α ∈ [0, ∞], The data-processing inequality for processing X similarly follows from the data-processing inequality for the (unconditional) Rényi divergence: Theorem 4. Let P X be a PMF, and let P Y|X and Q Y|X be conditional PMFs. For a conditional PMF B X |X , define the PMFs Then, for all α ∈ [0, ∞], Proof. See Appendix G.
As a special case of Theorem 4, we obtain the following relation between the conditional and the unconditional Rényi divergence: Let P X be a PMF, and let P Y|X and Q Y|X be conditional PMFs. Define the marginal PMFs Then, for all α ∈ [0, ∞], Proof. This follows from Theorem 4 in the same way that Corollary 1 followed from Theorem 2.

New Conditional Rényi Divergence
Let P X be a PMF, and let P Y|X and Q Y|X be conditional PMFs.
where (78) follows from the definition of the Rényi divergence in (4). (Except for the sign, the exponential averaging in (77) is very similar to the one of the Arimoto-Rényi conditional entropy; compare with (147) below.) For α being zero, one, or infinity, we define by continuous extension of (77) This conditional Rényi divergence has an operational meaning in horse betting with side information (see Theorem 10 below). Before discussing the measure of dependence associated with D l α (·), we establish the following alternative characterization of D l α (·): Proposition 4. Let P X be a PMF, and let P Y|X and Q Y|X be conditional PMFs. Then, for all α ∈ [0, ∞], Proof. We first treat the case α ∈ (0, 1) ∪ (1, ∞). Some algebra reveals that, for every PMF Q X , where the PMF Q * (α) X is defined as The right-hand side (RHS) of (82) is thus equal to the minimum over Q X of the RHS of (83). Since (Proposition 1 (a)), this minimum is equal to the second term on the RHS of (83), which, by (78), equals D l α (P Y|X Q Y|X |P X ). For α = 1 and α = ∞, (82) follows from the same argument using that, for every PMF Q X , where the PMF Q * (∞) X is defined as For where (88) follows from the definition of D 0 (P Q) in (21), and (91) follows from (79).
Tomamichel and Hayashi [37] and Lapidoth and Pfister [3] independently introduced and studied the dependence measure (For some measure-theoretic properties of J α (X; Y), see Aishwarya-Madiman [38].) The measure J α (X; Y) can be related to the error exponents in a hypothesis testing problem where the samples are either from a known joint distribution or an unknown product distribution (see [37] (Equation (57)) and [39]). It also appears in horse betting with side information (see Theorem 11 below).
Similar to I c α (X; Y) in (34) and I s α (X; Y) in (60), the measure J α (X; Y) can be expressed as a minimization involving the new conditional Rényi divergence: Proposition 5. Let P XY be a joint PMF. Denote its marginal PMFs by P X and P Y and its conditional PMFs by P Y|X and P X|Y , so P XY = P X P Y|X = P Y P X|Y . Then, for all α ∈ [0, ∞], where (95) follows from Proposition 4, and (96) follows from (92). Swapping the roles of X and Y establishes (94): where (97) follows from Proposition 4, and (98) follows from (92).
The rest of the section presents some properties of D l α (·). Proposition 6. Let P X be a PMF, and let P Y|X and Q Y|X be conditional PMFs. Then, Proof. We prove these properties as follows: The nonnegativity of D l α (·) now follows from the nonnegativity of the Rényi divergence (Proposition 1 (a)). If P Y|X=x = Q Y|X=x for all x ∈ supp(P X ) , then P X P Y|X = P X Q Y|X . Hence, using Q X = P X on the RHS of (99), D l α (P Y|X Q Y|X |P X ) equals zero. Conversely, if α ∈ (0, ∞] and D l α (·) = 0, then P X P Y|X = Q X Q Y|X for some Q X by Proposition 1 (a), which implies This follows from the definitions in (77) and (79)-(81) and the conventions in (20).
is continuous because it is, by its definition in (77), a composition of continuous functions. The continuity at α = 1 follows from a careful application of L'Hôpital's rule.
We conclude with the continuity at α = ∞. Observe that where (109) follows from the definition in (77), and (111) follows from the continuity of the Rényi divergence (Proposition 1 (c)) and the definition of and because the pointwise minimum preserves the monotonicity, the mapping By Proposition 4, By the nonnegativity of the Rényi divergence (Proposition 1 (a)), the RHS of (113) is nonnegative for α ∈ (0, 1] and nonpositive for α ∈ (1, ∞). Hence, it suffices to show separately that the mapping α → 1−α α D l α (P Y|X Q Y|X |P X ) is nonincreasing on (0, 1] and on (1, ∞). This is indeed the case: the mapping α → 1−α α D α (P X P Y|X Q X Q Y|X ) on the RHS of (113) is nonincreasing on (0, ∞) (Proposition 1 (e)), and the monotonicity is preserved by the pointwise minimum and maximum, respectively.
(Proposition 1 (f)) and because the pointwise minimum preserves the concavity, the mapping This follows from Proposition 1 (g) in the same way that Part (f) followed from Proposition 1 (f).
We next consider data-processing inequalities for D l α (·). We distinguish between processing Y and processing X. The data-processing inequality for processing Y follows from the data-processing inequality for the (unconditional) Rényi divergence: Theorem 5. Let P X be a PMF, and let P Y|X and Q Y|X be conditional PMFs. For a conditional PMF A Y |XY , define Then, for all α ∈ [0, ∞], Proof. We prove (117) for α ∈ (0, 1) ∪ (1, ∞); the claim will then extend to α ∈ [0, ∞] by the continuity of D l α (·) in α (Proposition 6 (c)). For every x ∈ supp(P X ), we can apply Proposition 1 (h) with the substitution of A Y |Y,X=x for A Y |Y to obtain For α ∈ (0, 1) ∪ (1, ∞), (117) now follows from (77) and (118).
Processing X is different. Consider first Q Y|X that does not depend on X. Then, writing Q Y|X = Q Y , we have the following result (which, as shown in Example 2 below, does not extend to general Q Y|X ): Theorem 6. Let P X and Q Y be PMFs, and let P Y|X be a conditional PMF. For a conditional PMF B X |X , define the PMFs Then, for all α ∈ [0, ∞], Once we provide the operational meaning of D l α (·) in horse betting with side information (Theorem 10 below), Theorem 6 will become very intuitive: it expresses the fact that preprocessing the side information cannot increase the gambler's utility; see Remark 8. Note that P X and P Y|X in Theorem 6 can be obtained from the following marginalization: Proof of Theorem 6. We show (122) for α ∈ (0, 1) ∪ (1, ∞); the claim will then extend to α ∈ [0, ∞] by the continuity of D l α (·) in α (Proposition 6 (c)). Consider first α ∈ (1, ∞). Then, (122) holds because where (124) (127) is reversed [16] (III 2.4 Theorem 9). Because now α−1 α < 0, (122) continues to hold for α ∈ (0, 1). As a special case of Theorem 6, we obtain the following relation between the conditional and the unconditional Rényi divergence: Corollary 3. Let P X and Q Y be PMFs, and let P Y|X be a conditional PMF. Define the marginal PMF Then, for all α ∈ [0, ∞], Proof. This follows from Theorem 6 in the same way that Corollary 1 followed from Theorem 2.
Consider next Q Y|X that does depend on X. It turns out that Corollary 3, and hence Theorem 6, cannot be extended to this setting: Define the PMFs P X , P Y|X , and Q Y|X as Then, for α = 0.5 and for α = 2, where the PMFs P Y and Q Y are given by Proof. Numerically, D 0.5 (P Y Q Y ) ≈ 1.11 bits, which is larger than D l 0.5 (P Y|X Q Y|X |P X ) ≈ 0.93 bits. Similarly, D 2 (P Y Q Y ) ≈ 2.95 bits, which is larger than D l 2 (P Y|X Q Y|X |P X ) ≈ 2.75 bits.

Relation to Arimoto's Measures
Before discussing Arimoto's measures, we first recall the definition of the Rényi entropy. The Rényi entropy of order α [7] is defined for all positive α's other than one as For α being zero, one, or infinity, we define by continuous extension of (141) where H(X) denotes Shannon's entropy. The Rényi entropy can be related to the Rényi divergence as follows: where U X denotes the uniform distribution over X .

Relations Between the Conditional Rényi Divergences and the Rényi Dependence Measures
In this section, we first establish the greater-or-equal-than order between the conditional Rényi divergences, where the order depends on whether α ∈ [0, 1] or α ∈ [1, ∞]. We then show that this implies the same order between the dependence measures derived from the conditional Rényi divergences. Finally, we remark that many of the dependence measures coincide when they are maximized over all PMFs P X .

Proposition 7. For all
Proof. This holds because where (157) follows from Proposition 4, and (159) follows from the definition of D s α (·) in (54).
Despite I c α (X; Y), I s α (X; Y), I a α (X; Y), and J α (X; Y) being different measures, they often coincide when maximized over all PMFs P X : Theorem 8. For every conditional PMF P Y|X and every α ∈ (0, 1) ∪ (1, ∞), In addition, for every conditional PMF P Y|X and every α ∈ [ 1 2 , 1) ∪ (1, ∞), For α ∈ (0, 1 2 ), the situation is different: there exists a conditional PMF P Y|X such that, for every α ∈ (0, 1 2 ), Proof. Equation (174) For α ∈ [ 1 2 , 1), (178) holds because The sets of all PMFs over X and over Y are convex and compact; the function f is jointly continuous in the pair (Q Y , P X ) because it is a composition of continuous functions; for every Q Y ∈ P (Y ), the function f is linear and hence convex in P X ; and it only remains to show that the function f is concave in Q Y for every P X ∈ P (X ). Indeed, for every λ, λ ∈ [0, 1] with λ + λ = 1, every Q Y , Q Y ∈ P (Y ), and every P X ∈ P (X ), where (193) follows from the reverse Minkowski inequality [16] (III 2.4 Theorem 9) because α ∈ [ 1 2 , 1); and (195) holds because the function z → z (1−α)/α is concave for α ∈ [ 1 2 , 1). The justification of (185) is very similar to that of (181); here, we apply the minimax theorem to the function g : P (Y ) × P (X ) → R, Compared to the justification of (181), the only essential difference lies in showing that the function g is concave in Q Y for every P X ∈ P (X ): here, this follows easily from the concavity of the function z → z 1−α for α ∈ [ 1 2 , 1). We conclude the proof by establishing (177). Let X = Y = {0, 1}, and let the conditional PMF P Y|X be given by P Y|X (y|x) = 1{y = x}. (This corresponds to a binary noiseless channel.) Then, denoting by U X the uniform distribution over X , where (199) follows from (61). On the other hand, for every α ∈ (0, 1 2 ) and every PMF P X , where (200) follows from [3] (Lemma 11); (201) follows from (144); and (202) holds because α ∈ (0, 1 2 ). Inequality (177) now follows from (199) and (202).

Horse Betting
In this section, we analyze horse betting with a gambler investing all her money. Recall from the introduction that the winning horse X is distributed according to the PMF p, where we assume p(x) > 0 for all x ∈ X ; that the odds offered by the bookmaker are denoted by o : X → (0, ∞); that the fraction of her wealth that the gambler bets on Horse x ∈ X is denoted b(x) ≥ 0; that the wealth relative is the random variable S b(X) o(X); and that we seek betting strategies that maximize the utility function Because the gambler invests all her money, b is a PMF. As in [47]  .
Using these definitions, the utility function U β can be decomposed as follows: Theorem 9. Let β ∈ (−∞, 1), and let b be a PMF. Then, where the PMF g (β) is given by (207) Thus, choosing b = g (β) uniquely maximizes U β among all PMFs b.
The three terms in (206) can be interpreted as follows: 1.
The first term, log c, depends only on the odds and is related to the fairness of the odds. The odds are called subfair if c < 1, fair if c = 1, and superfair if c > 1.

2.
The second term, D 1/(1−β) (p r), is related to the bookmaker's estimate of the winning probabilities. It is zero if and only if the odds are inversely proportional to the winning probabilities.

3.
The third term, −D 1−β (g (β) b), is related to the gambler's estimate of the winning probabilities. It is zero if and only if b is equal to g (β) .

Remark 4.
For β = 0, (206) reduces to the following decomposition of the doubling rate E[log S]: (This decomposition appeared previously in [47] (Section 10.3).) Equation (208) implies that the doubling rate is maximized by proportional gambling, i.e., that E[log S] is maximized if and only if b is equal to p.

Remark 5.
Considering the limits β → −∞ and β ↑ 1, the PMF g (β) satisfies, for every x ∈ X , where the set S is defined as S ] . It follows from Proposition 8 below that the RHS of (209) is the unique maximizer of lim β→−∞ U β ; and it follows from the proof of Proposition 9 below that the RHS of (210) is a maximizer (not necessarily unique) of U 1 .
≤ log c. Observe that if b(x) = c/o(x) for all x ∈ X , then S = c with probability one, i.e., S does not depend on the winning horse.

Proof of Proposition 8. Equation (219) holds because
where (222) holds because, in the limit as β tends to −∞, the power mean tends to the minimum (since p is a PMF with p(x) > 0 for all x ∈ X [15] (Chapter 8)). We show (220) by contradiction. Assume that there exists a PMF b that does not satisfy (220), thus for all x ∈ X . Then, where (224) holds because b is a PMF; (225) follows from (223); and (226) follows from the definition of c in (204). Because 1 > 1 is impossible, such a b cannot exist, which establishes (220). It is not difficult to see that (220) holds with equality if b(x) = c/o(x) for all x ∈ X . We therefore focus on establishing that if (220) holds with equality, then b(x) = c/o(x) for all x ∈ X . Observe first that, if (220) holds with equality, then, for all x ∈ X , We now claim that (227) holds with equality for all x ∈ X . Indeed, if this were not the case, then there would exist an x ∈ X for which b(x ) o(x ) > c, thus (224)-(226) would hold, which would lead to a contradiction. Hence, if (220) holds with equality, then b(x) = c/o(x) for all x ∈ X . Proposition 9. Let β ≥ 1, and let b be a PMF. Then, Equality in (228) can be achieved by choosing b(x) = 1{x = x } for some x ∈ X satisfying Remark 7. Proposition 9 implies that if β ≥ 1, then it is optimal to bet on a single horse. Unless |X | = 1, this is not the case when β < 1: When β < 1, an optimal betting strategy requires placing a bet on every horse. This follows from Theorem 9 and our assumption that p(x) and o(x) are all positive.

Proposition 10. Let b be a PMF. Then,
Equality in (236) can be achieved by choosing b(x) = 1{x = x } for some x ∈ X satisfying Proof. Equation (235) holds because where (239) holds because in the limit as β tends to +∞, the power mean tends to the maximum (since p is a PMF with p(x) > 0 for all x ∈ X [15] (Chapter 8)). Inequality (236) holds because b(x) ≤ 1 for all x ∈ X . It is not difficult to see that (236) holds with equality if b(x) = 1{x = x } for some x ∈ X satisfying (237).

Horse Betting with Side Information
In this section, we study the horse betting problem where the gambler observes some side information Y before placing her bets. This setting leads to the conditional Rényi divergence D l α (·) discussed in Section 5 (see Theorem 10). In addition, it provides a new operational meaning to the dependence measure J α (X; Y) (see Theorem 11).
We adapt our notation as follows: The joint PMF of X and Y is denoted p XY . (Recall that X denotes the winning horse.) We drop the assumption that the winning probabilities p(x) are positive, but we assume that p(y) > 0 for all y ∈ Y. We continue to assume that the gambler invests all her wealth, so a betting strategy is now a conditional PMF b X|Y , and the wealth relative S is As in Section 8, define the constant The following decomposition of the utility function U β parallels that of Theorem 9: Theorem 10. Let β ∈ (−∞, 1). Then, where the conditional PMF g (β) X|Y and the PMF g X|Y (x|y) Thus, choosing b X|Y = g (β) X|Y uniquely maximizes U β among all conditional PMFs b X|Y .

Remark 8.
It follows from Theorem 10 that, if the gambler gambles optimally, then, for β ∈ (−∞, 1), Operationally, it is clear that preprocessing the side information cannot increase the gambler's utility, i.e., that, for every conditional PMF p Y |Y , where p X|Y and p Y are derived from the joint PMF p XYY given by p XYY (x, y, y ) = p Y (y) p X|Y (x|y) p Y |Y (y |y).
This provides the intuition for Theorem 6, where (254) is shown directly. The extreme case is when the preprocessing maps the side information to a constant and hence leads to the case where the side information is absent. In this case, Y is deterministic and p X|Y equals p X . Theorem 9 and Theorem 10 then lead to the following relation between the conditional and unconditional Rényi divergence: where the marginal PMF p X is given by This motivates Corollary 3, where (256) is derived from (254).
The last result of this section provides a new operational meaning to the Lapidoth-Pfister mutual information J α (X; Y): assuming that β ∈ (−∞, 1) and that the gambler knows the winning probabilities, J 1/(1−β) (X; Y) measures how much the side information that is available to the gambler but not the bookmaker increases the gambler's smallest guaranteed utility for a fixed level of fairness c. To see this, consider first the setting without side information. By Theorem 9, the gambler chooses b = g (β) to maximize her utility, where g (β) is defined in (207). Then, using the nonnegativity of the Rényi divergence (Proposition 1 (a)), the following lower bound on the gambler's utility follows from (206): We call the RHS of (258) the smallest guaranteed utility for a fixed level of fairness c because (258) holds with equality if the bookmaker chooses the odds inversely proportional to the winning probabilities. Comparing (258) with (259) below, we see that the difference due to the side information is J 1/(1−β) (X; Y). Note that J 1/(1−β) (X; Y) is typically not the difference between the utility with and without side information; this is because the odds for which (258) and (259) hold with equality are typically not the same.
Proof. For this choice of b X|Y , (259) holds because where (260) follows from Theorem 10, and (262) follows from Proposition 5.
Fix now c > 0, letr * X achieve the minimum on the RHS of (261), and choose the odds Then, (261) holds with equality because r X =r * X by (241) and (242).

Horse Betting with Part of the Money
In this section, we treat the possibility that the gambler does not invest all her wealth. We restrict ourselves to the setting without side information and to β ∈ (−∞, 0) ∪ (0, 1). (For the case β = 0, see [47] (Section 10.5).) We assume that p(x) > 0 and o(x) > 0 for all x ∈ X . Denote by b(0) the fraction of her wealth that the gambler does not use for betting. (We assume 0 / ∈ X .) Then, b : X ∪ {0} → [0, 1] is a PMF, and the wealth relative S is the random variable As in Section 8, define the constant We treat the cases c < 1 and c ≥ 1 separately, starting with the latter. If c ≥ 1, then it is optimal to invest all the money: Proposition 11. Assume c ≥ 1, let β ∈ R, and let b be a PMF on X ∪ {0} with utility U β . Then, there exists a PMF b on X ∪ {0} with b (0) = 0 and utility U β ≥ U β .
On the other hand, if β < 1 and the odds are subfair, i.e., if c < 1, then Claim (c) of the following theorem shows that investing all the money is not optimal: Theorem 12. Assume c < 1, let β ∈ (−∞, 0) ∪ (0, 1), and let b * be a PMF on X ∪ {0} that maximizes U β among all PMFs b. Defining the following claims hold: (a) Both the numerator and denominator on the RHS of (270) are positive, so Γ is well-defined and positive.
. ., the set S thus has a special structure: it is either empty or equal to {x 1 , x 2 , . . . , x k } for some integer k. To maximize U β , the following procedure can be used: for every S with the above structure, compute the corresponding b according to (270)-(273); and from these b's, take one that maximizes U β . This procedure leads to an optimal solution: an optimal solution b * exists because we are optimizing a continuous function over a compact set, and b * corresponds to a set S that will be considered by the procedure.
Having established that, for all β ∈ (−∞, 0) ∪ (0, 1), a strategy b is optimal if and only if (274) and (275) hold, we next continue with the proof. Let β ∈ (−∞, 0) ∪ (0, 1), and let b * be a PMF on X ∪ {0} that maximizes U β . By the above discussion, (274) and (275) are satisfied by b * for some µ ∈ R. The LHS of (274) is positive, so µ > 0. We now show that for all To this end, fix and the RHS of (282) is equal to the RHS of (281) because, being equal to b * (x), it is positive. If b * (x) = 0, then (275) implies so the RHS of (281) is zero and (281) hence holds.

Universal Betting for IID Races
In this section, we present a universal gambling strategy for IID races that requires neither knowledge of the winning probabilities nor of the parameter β of the utility function and yet asymptotically maximizes the utility function for all PMFs p and all β ∈ R. Consider n consecutive horse races, where the winning horse in the ith race is denoted X i for i ∈ {1, . . . , n}. We assume that X 1 , . . . , X n are IID according to the PMF p, where p(x) > 0 for all x ∈ X . In every race, the bookmaker offers the same odds o : X → (0, ∞), and the gambler spends all her wealth placing bets on the horses. The gambler plays race-after-race, i.e., before placing bets for a race, she is revealed the winning horse of the previous race and receives the money from the bookmaker. Her betting strategy is hence a sequence of conditional PMFs b X 1 , b X 2 |X 1 , b X 3 |X 1 X 2 , . . . , b X n |X 1 X 2 ···X n−1 . The wealth relative is the random variable S n n ∏ i=1 b(X i |X 1 , . . . , X i−1 ) o(X i ).
We seek betting strategies that maximize the utility function We first establish that to maximize U β,n for a fixed β ∈ R, it suffices to use the same betting strategy in every race; see Theorem 13. We then show that the individual-sequence-universal strategy by Cover-Ordentlich [48] allows to asymptotically achieve the same normalized utility without knowing p or β (see Theorem 14).
For a fixed β ∈ R, let the PMF b * be a betting strategy that maximizes the single-race utility U β discussed in Section 8, and denote by U * β the utility associated with b * . Using the same betting strategy b * over n races leads to the utility U β,n , and it follows from (295) and (296) that U β,n = nU * β .

Appendix F. Proof of Theorem 3
Observe that, for all x ∈ X and all y ∈ Y , P X (x ) P Y |X (y |x ) = ∑ x,y P X (x) P Y|X (y|x)1{x = x} A Y |XY (y |x, y), Hence, (68) follows from (54) and D α (P X P Y |X P X Q Y |X ) ≤ D α (P X P Y|X P X Q Y|X ), which follows from the data-processing inequality for the Rényi divergence by substituting 1 X =X A Y |XY for A X Y |XY in Proposition 1 (h).

Appendix G. Proof of Theorem 4
Observe that, for all x ∈ X and all y ∈ Y, P X (x ) P Y|X (y |x ) = ∑ x,y P X (x) P Y|X (y|x) B X |X (x |x)1{y = y}, (A32) Hence, (73) follows from (54) and D α (P X P Y|X P X Q Y|X ) ≤ D α (P X P Y|X P X Q Y|X ), which follows from the data-processing inequality for the Rényi divergence by substituting B X |X 1 Y =Y for A X Y |XY in Proposition 1 (h).