Abstract
Motivated by a horse betting problem, a new conditional Rényi divergence is introduced. It is compared with the conditional Rényi divergences that appear in the definitions of the dependence measures by Csiszár and Sibson, and the properties of all three are studied with emphasis on their behavior under data processing. In the same way that Csiszár’s and Sibson’s conditional divergence lead to the respective dependence measures, so does the new conditional divergence lead to the Lapidoth–Pfister mutual information. Moreover, the new conditional divergence is also related to the Arimoto–Rényi conditional entropy and to Arimoto’s measure of dependence. In the second part of the paper, the horse betting problem is analyzed where, instead of Kelly’s expected log-wealth criterion, a more general family of power-mean utility functions is considered. The key role in the analysis is played by the Rényi divergence, and in the setting where the gambler has access to side information, the new conditional Rényi divergence is key. The setting with side information also provides another operational meaning to the Lapidoth–Pfister mutual information. Finally, a universal strategy for independent and identically distributed races is presented that—without knowing the winning probabilities or the parameter of the utility function—asymptotically maximizes the gambler’s utility function.
1. Introduction
As shown by Kelly [1,2], many of Shannon’s information measures appear naturally in the context of horse gambling when the gambler’s utility function is expected log-wealth. Here, we show that under a more general family of utility functions, gambling also provides a context for some of Rényi’s information measures. Moreover, the setting where the gambler has side information motivates a new Rényi-like conditional divergence, which we study and compare to other conditional divergences. The proposed family of utility functions in the context of gambling with side information also provides another operational meaning to the Rényi-like mutual information that was recently proposed by Lapidoth and Pfister [3]: it measures the gambler’s gain from the side information as measured by the increase in the minimax value of the two-player zero-sum game in which the bookmaker picks the odds and the gambler then places the bets based on these odds and her side information.
Deferring the gambling-based motivation to the second part of the paper, we first describe the different conditional divergences and study some of their properties with emphasis on their behavior under data processing. We also show that the new conditional Rényi divergence relates to the Lapidoth–Pfister mutual information in much the same way that Csiszár’s and Sibson’s conditional divergences relate to their corresponding mutual informations. Before discussing the conditional divergences, we first recall other information measures.
The Kullback–Leibler divergence (or relative entropy) is an important concept in information theory and statistics [2,4,5,6]. It is defined between two probability mass functions (PMFs) P and Q over a finite set as
where denotes the base-2 logarithm. Defining a conditional Kullback–Leibler divergence is straightforward because, as simple algebra shows, the two natural approaches lead to the same result:
where denotes the support of P, and in (3) and throughout denotes the PMF on that assigns the probability .
The Rényi divergence of order [7,8] between two PMFs P and Q is defined for all positive ’s other than one as
A conditional Rényi divergence can be defined in more than one way. In this paper, we consider the following three definitions, two classic and one new:
where (5) is inspired by Csiszár [9]; (6) is inspired by Sibson [10]; and (7) is motivated by the horse betting problem discussed in Section 9. The first two conditional Rényi divergences were used to define the Rényi measures of dependence of Csiszár [9] and of Sibson [10]:
where the minimization is over all PMFs on the set . (Gallager’s function [11] and are in one-to-one correspondence; see (65) below.) The analogous minimization of leads to the Lapidoth–Pfister mutual information [3]:
where (11) is proved in Proposition 5.
The first part of the paper is structured as follows: In Section 2, we discuss some preliminaries. In Section 3, Section 4 and Section 5, we study the properties of the three conditional Rényi divergences and their associated measure of dependence. In Section 6, we express the Arimoto–Rényi conditional entropy and the Arimoto measure of dependence [12] in terms of . In Section 7, we relate the conditional Rényi divergences to each other and discuss the relations between the Rényi dependence measures.
The second part of the paper deals with horse gambling under our proposed family of power-mean utility functions. It is in this context that the Rényi divergence (Theorem 9) and the conditional Rényi divergence (Theorem 10) appear naturally.
More specifically, consider a horse race with a finite nonempty set of horses , where a bookmaker offers odds -for-1 on each horse , where [2] (Section 6.1). A gambler spends all her wealth placing bets on the horses. The fraction of her wealth that she bets on Horse is denoted , which sums up to 1 over , and the PMF b is her “betting strategy.” The winning horse, which we denote X, is drawn according to the PMF p, where we assume for all . The wealth relative (or end-to-beginning wealth ratio) is the random variable
Hence, given an initial wealth , the gambler’s wealth after the race is . We seek betting strategies that maximize the utility function
where is a parameter that accounts for the risk sensitivity. This optimization generalizes the following cases:
- (a)
- In the limit as tends to , we optimize the worst-case return. The optimal strategy is risk-free in the sense that S does not depend on the winning horse (see Proposition 8).
- (b)
- If , then we optimize , which is known as the doubling rate [2] (Section 6.1). The optimal strategy is proportional betting, i.e., to choose (see Remark 4).
- (c)
- If , then we optimize , the expected return. The optimal strategy is to put all the money on a horse that maximizes (see Proposition 9).
- (d)
- In general, if , then it is optimal to put all the money on one horse (see Proposition 9). This is risky: if that horse loses, the gambler will go broke.
- (e)
- In the limit as tends to , we optimize the best-case return. The optimal strategy is to put all the money on a horse that maximizes (see Proposition 10).
Note that, for and , maximizing is equivalent to maximizing
which is known in the finance literature as Constant Relative Risk Aversion (CRRA) [13,14].
We refer to our utility function as “power mean” because it can be written as the logarithm of a weighted power mean [15,16]:
Because the power mean tends to the geometric mean as tends to zero [15] (Problem 8.1), is continuous at :
Campbell [17,18] used an exponential cost function with a similar structure to (15) to provide an operational meaning to the Rényi entropy in source coding. Other information-theoretic applications of exponential moments were studied in [19].
The second part of the paper is structured as follows: In Section 8, we relate the utility function to the Rényi divergence (Theorem 9) and derive its optimal gambling strategy. In Section 9, we consider the situation where the gambler observes side information prior to betting, a situation that leads to the conditional Rényi divergence (Theorem 10) and to a new operational meaning for the measure of dependence (Theorem 11). In Section 10, we consider the situation where the gambler invests only part of her money. In Section 11, we present a universal strategy for independent and identically distributed (IID) races that requires neither knowledge of the winning probabilities nor of the parameter of the utility function and yet asymptotically maximizes the utility function for all PMFs p and all .
2. Preliminaries
Throughout the paper, denotes the base-2 logarithm, and are finite sets, denotes a joint PMF over , denotes a PMF over , and denotes a PMF over . An expression of the form denotes the PMF on that assigns the probability . We use P and Q as generic PMFs over a finite set . We denote by the support of P, and by the set of all PMFs over . When clear from the context, we often omit sets and subscripts: for example, we write for , for , for , and for . When is 0, we define the conditional probability as . The conditional distribution of Y given is denoted by , thus
We denote by the indicator function that is one if the condition is satisfied and zero otherwise.
In the definition of the Kullback–Leibler divergence in (1), we use the conventions
In the definition of the Rényi divergence in (4), we read as for and use the conventions
The Rényi divergence for negative is defined as
(We use negative in the proof of Proposition 1 (e) below and in Remark 6. More about negative orders can be found in [8] (Section V). For other applications of negative orders, see [20] (Proof of Theorem 1 and Example 1).)
The Rényi divergence satisfies the following basic properties:
Proposition 1.
Let P and Q be PMFs. Then, the Rényi divergencesatisfies the following:
- (a)
- For all,. If, thenif and only if.
- (b)
- For all,is finite if and only if. For all,is finite if and only if.
- (c)
- The mappingis continuous on.
- (d)
- The mappingis nondecreasing on.
- (e)
- The mappingis nonincreasing on.
- (f)
- The mappingis concave on.
- (g)
- The mappingis concave on.
- (h)
- (Data-processing inequality.) Letbe a conditional PMF, and define the PMFsThen, for all,
Proof.
See Appendix A. □
All three conditional Rényi divergences reduce to the unconditional Rényi divergence when both and are independent of X:
Remark 1.
Let,, andbe PMFs. Then, for all,
3. Csiszár’s Conditional Rényi Divergence
For a PMF and conditional PMFs and , Csiszár’s conditional Rényi divergence is defined for every as
For ,
which follows from the definition of the Rényi divergence in (4). For being zero, one, or infinity, we obtain from (21)–(23) and (2)
Augustin [21] and later Csiszár [9] defined the measure of dependence
Augustin used this measure to study the error exponents for channel coding with input constraints, while Csiszár used it to study generalized cutoff rates for channel coding with composition constraints. Nakiboğlu [22] studied more properties of . Inter alia, he analyzed the minimax properties of the Augustin capacity
where is a constraint set. The Augustin capacity is used in [23] to establish the sphere packing bound for memoryless channels with cost constraints.
The rest of the section presents some properties of . Being an average of Rényi divergences (see (29)), inherits many properties from the Rényi divergence:
Proposition 2.
Letbe a PMF, and letandbe conditional PMFs. Then,
- (a)
- For all,. If, thenif and only iffor all.
- (b)
- For all,is finite if and only iffor all. For all,is finite if and only iffor all.
- (c)
- The mappingis continuous on.
- (d)
- The mappingis nondecreasing on.
- (e)
- The mappingis nonincreasing on.
- (f)
- The mappingis concave on.
- (g)
- The mappingis concave on.
Proof.
These follow from (29) and the properties of the Rényi divergence (Proposition 1). For Parts (f) and (g), recall that a nonnegative weighted sum of concave functions is concave. □
We next consider data-processing inequalities for . We distinguish between processing Y and processing X. The data-processing inequality for processing Y follows from the data-processing inequality for the (unconditional) Rényi divergence:
Theorem 1.
Letbe a PMF, and letandbe conditional PMFs. For a conditional PMF, define
Then, for all,
Proof.
See Appendix B. □
The following data-processing inequality for processing X holds for (as shown in Example 1 below, it does not extend to ):
Theorem 2.
Letbe a PMF, and letandbe conditional PMFs. For a conditional PMF, define the PMFs
Then, for all,
Note that , , and in Theorem 2 can be obtained from the following marginalizations:
Proof of Theorem 2.
See Appendix C. □
As a special case of Theorem 2, we obtain the following relation between the conditional and the unconditional Rényi divergence:
Corollary 1.
For a PMFand conditional PMFsand, define the marginal PMFs
Then, for all,
Proof.
See Appendix D. □
Consider next . It turns out that Corollary 1, and hence Theorem 2, cannot be extended to these values of (not even if is restricted to be independent of X, i.e., if ):
Example 1.
Let. For, define the PMFs,, andas
Then, for every, there exists ansuch that
where the PMFis defined by (46) and, irrespective of ϵ, satisfies .
Proof.
See Appendix E. □
4. Sibson’s Conditional Rényi Divergence
For a PMF and conditional PMFs and , Sibson’s conditional Rényi divergence is defined for every as
For ,
where (55) and (56) follow from the definition of the Rényi divergence in (4). For being zero, one, or infinity, we obtain from (21)–(23) and (3)
Sibson [10] defined the measure of dependence
This minimum can be computed explicitly [10] (Corollary 2.3): For ,
and for being one or infinity,
where denotes Shannon’s mutual information.
The concavity and convexity properties of and were studied by Ho–Verdú [24]. More properties of were collected by Verdú [25]. The maximization of with respect to and the minimax properties of were studied by Nakiboğlu [26] and Cai–Verdú [27].
The conditional Rényi divergence was used by Fong and Tan [28] to establish strong converse theorems for multicast networks. Yu and Tan [29] analyzed channel resolvability, among other measures, in terms of .
From (61) we see that Gallager’s function [11], which is defined as
is in one-to-one correspondence to Sibson’s measure of dependence:
Gallager’s function is important in channel coding: it appears in the random coding exponent [30] and in the sphere packing exponent [31,32] (see also Gallager [11]). The exponential strong converse theorem proved by Arimoto [33] also uses the function. Polyanskiy and Verdú [34] extended the exponential strong converse theorem to channels with feedback. Augustin [21] and Nakiboğlu [35,36] extended the sphere packing bound to channels with feedback.
The rest of the section presents some properties of . Because can be written as an (unconditional) Rényi divergence (see (54)), it inherits many properties from the Rényi divergence:
Proposition 3.
Letbe a PMF, and letandbe conditional PMFs. Then,
- (a)
- For all,. If, thenif and only iffor all.
- (b)
- For all,is finite if and only if (there exists ansuch that. For all,is finite if and only iffor all.
- (c)
- The mappingis continuous on.
- (d)
- The mappingis nondecreasing on.
- (e)
- The mappingis nonincreasing on.
- (f)
- The mappingis concave on.
- (g)
- The mappingis concave on.
Proof.
These follow from (54) and the properties of the Rényi divergence (Proposition 1). □
We next consider data-processing inequalities for . We distinguish between processing Y and processing X. The data-processing inequality for processing Y follows from the data-processing inequality for the (unconditional) Rényi divergence:
Theorem 3.
Letbe a PMF, and letandbe conditional PMFs. For a conditional PMF, define
Then, for all,
Proof.
See Appendix F. □
The data-processing inequality for processing X similarly follows from the data-processing inequality for the (unconditional) Rényi divergence:
Theorem 4.
Letbe a PMF, and letandbe conditional PMFs. For a conditional PMF, define the PMFs
Then, for all,
Proof.
See Appendix G. □
As a special case of Theorem 4, we obtain the following relation between the conditional and the unconditional Rényi divergence:
Corollary 2.
Letbe a PMF, and letandbe conditional PMFs. Define the marginal PMFs
Then, for all,
Proof.
This follows from Theorem 4 in the same way that Corollary 1 followed from Theorem 2. □
5. New Conditional Rényi Divergence
Let be a PMF, and let and be conditional PMFs. For , define
where (78) follows from the definition of the Rényi divergence in (4). (Except for the sign, the exponential averaging in (77) is very similar to the one of the Arimoto–Rényi conditional entropy; compare with (147) below.) For being zero, one, or infinity, we define by continuous extension of (77)
This conditional Rényi divergence has an operational meaning in horse betting with side information (see Theorem 10 below). Before discussing the measure of dependence associated with , we establish the following alternative characterization of :
Proposition 4.
Letbe a PMF, and letandbe conditional PMFs. Then, for all,
Proof.
We first treat the case . Some algebra reveals that, for every PMF ,
where the PMF is defined as
The right-hand side (RHS) of (82) is thus equal to the minimum over of the RHS of (83). Since with equality if (Proposition 1 (a)), this minimum is equal to the second term on the RHS of (83), which, by (78), equals .
For and , (82) follows from the same argument using that, for every PMF ,
where the PMF is defined as
Tomamichel and Hayashi [37] and Lapidoth and Pfister [3] independently introduced and studied the dependence measure
(For some measure-theoretic properties of , see Aishwarya–Madiman [38].) The measure can be related to the error exponents in a hypothesis testing problem where the samples are either from a known joint distribution or an unknown product distribution (see [37] (Equation (57)) and [39]). It also appears in horse betting with side information (see Theorem 11 below).
Similar to in (34) and in (60), the measure can be expressed as a minimization involving the new conditional Rényi divergence:
Proposition 5.
Letbe a joint PMF. Denote its marginal PMFs byandand its conditional PMFs byand, so. Then, for all,
Proof.
The rest of the section presents some properties of .
Proposition 6.
Letbe a PMF, and letandbe conditional PMFs. Then,
- (a)
- For all,. If, thenif and only iffor all.
- (b)
- For all,is finite if and only if (there exists ansuch that. For all,is finite if and only iffor all.
- (c)
- The mappingis continuous on.
- (d)
- The mappingis nondecreasing on.
- (e)
- The mappingis nonincreasing on.
- (f)
- The mapping is concave on .
- (g)
- The mappingis concave on.
Proof.
We prove these properties as follows:
- (a)
- For all , Proposition 4 impliesThe nonnegativity of now follows from the nonnegativity of the Rényi divergence (Proposition 1 (a)). If for all , then . Hence, using on the RHS of (99), equals zero. Conversely, if and , then for some by Proposition 1 (a), which implies for all .
- (b)
- (c)
- For , is continuous because it is, by its definition in (77), a composition of continuous functions. The continuity at follows from a careful application of L’Hôpital’s rule.We next consider the continuity at . Define . Then, for all ,where (100) follows from the definition in (77). On the other hand, for all ,Because , it follows from (103) and (106) and the sandwich theorem thatwhere (108) follows from the continuity of the Rényi divergence (Proposition 1 (c)) and the definition of in (21).We conclude with the continuity at . Observe thatwhere (109) follows from the definition in (77), and (111) follows from the continuity of the Rényi divergence (Proposition 1 (c)) and the definition of in (23).
- (d)
- For all , Proposition 4 impliesBecause is nonincreasing on (Proposition 1 (d)) and because the pointwise minimum preserves the monotonicity, the mapping is nonincreasing on .
- (e)
- By Proposition 4,By the nonnegativity of the Rényi divergence (Proposition 1 (a)), the RHS of (113) is nonnegative for and nonpositive for . Hence, it suffices to show separately that the mapping is nonincreasing on and on . This is indeed the case: the mapping on the RHS of (113) is nonincreasing on (Proposition 1 (e)), and the monotonicity is preserved by the pointwise minimum and maximum, respectively.
- (f)
- For , Proposition 4 implies thatBecause is concave on (Proposition 1 (f)) and because the pointwise minimum preserves the concavity, the mapping is concave on .
- (g)
- This follows from Proposition 1 (g) in the same way that Part (f) followed from Proposition 1 (f). □
We next consider data-processing inequalities for . We distinguish between processing Y and processing X. The data-processing inequality for processing Y follows from the data-processing inequality for the (unconditional) Rényi divergence:
Theorem 5.
Let be a PMF, and let and be conditional PMFs. For a conditional PMF , define
Then, for all,
Proof.
We prove (117) for ; the claim will then extend to by the continuity of in (Proposition 6 (c)). For every , we can apply Proposition 1 (h) with the substitution of for to obtain
Processing X is different. Consider first that does not depend on X. Then, writing , we have the following result (which, as shown in Example 2 below, does not extend to general ):
Theorem 6.
Letandbe PMFs, and letbe a conditional PMF. For a conditional PMF, define the PMFs
Then, for all,
Once we provide the operational meaning of in horse betting with side information (Theorem 10 below), Theorem 6 will become very intuitive: it expresses the fact that preprocessing the side information cannot increase the gambler’s utility; see Remark 8. Note that and in Theorem 6 can be obtained from the following marginalization:
Proof of Theorem 6.
We show (122) for ; the claim will then extend to by the continuity of in (Proposition 6 (c)). Consider first . Then, (122) holds because
where (124) follows from (78); (125) follows from (121); (126) follows from (120); (127) follows from the Minkowski inequality [16] (III 2.4 Theorem 9); (129) holds because and imply , hence the first expression in square brackets on the left-hand side (LHS) of (129) equals one; and (130) follows from (78).
As a special case of Theorem 6, we obtain the following relation between the conditional and the unconditional Rényi divergence:
Corollary 3.
Letandbe PMFs, and letbe a conditional PMF. Define the marginal PMF
Then, for all,
Proof.
This follows from Theorem 6 in the same way that Corollary 1 followed from Theorem 2. □
Consider next that does depend on X. It turns out that Corollary 3, and hence Theorem 6, cannot be extended to this setting:
Example 2.
Letand. Define the PMFs,, andas
Then, forand for,
where the PMFsandare given by
Proof.
Numerically, bits, which is larger than bits. Similarly, bits, which is larger than bits. □
6. Relation to Arimoto’s Measures
Before discussing Arimoto’s measures, we first recall the definition of the Rényi entropy. The Rényi entropy of order [7] is defined for all positive ’s other than one as
For being zero, one, or infinity, we define by continuous extension of (141)
where denotes Shannon’s entropy. The Rényi entropy can be related to the Rényi divergence as follows:
where denotes the uniform distribution over .
There are different ways to define a conditional Rényi entropy [40]; we use Arimoto’s proposal. The Arimoto–Rényi conditional entropy of order [12,38,40,41] is defined for positive other than one as
where (147) follows from the definition of the Rényi entropy in (141). The Arimoto–Rényi conditional entropy plays a key role in guessing with side information [20,42,43,44] and in task encoding with side information [45]; and it can be related to hypothesis testing [41]. For being zero, one, or infinity, we define by continuous extension of (146)
where denotes Shannon’s conditional entropy. The analog of (145) for is:
Remark 2.
For all,
Proof.
Equation (151) follows, using some algebra, from the definition of in (78)–(81); and (152) follows from Proposition 4. (The characterization in (152) previously appeared as [40] (Theorem 4).) □
Arimoto [12] also defined the following measure of dependence:
where (154) follows from (141) and (146). Using Remark 2, we can express in terms of :
Remark 3.
For all,
7. Relations Between the Conditional Rényi Divergences and the Rényi Dependence Measures
In this section, we first establish the greater-or-equal-than order between the conditional Rényi divergences, where the order depends on whether or . We then show that this implies the same order between the dependence measures derived from the conditional Rényi divergences. Finally, we remark that many of the dependence measures coincide when they are maximized over all PMFs .
Proposition 7.
For all,
Proof.
Theorem 7.
For all,
For all,
Proof.
For both and , the relation follows from Proposition 7.
We next show that for . We show this for ; the claim will then extend to by the continuity in of and (Proposition 3 (c) and Proposition 2 (c)). For ,
where (162) follows from (55); (163) follows from Jensen’s inequality because is a concave function; and (164) follows from (30). The proof of the claim for is finished by dividing (162)–(164) by , which reverses the inequality because .
Corollary 4.
For all,
For all,
Proof.
The corollary now follows from (171)–(173) and Theorem 7. □
Despite , , , and being different measures, they often coincide when maximized over all PMFs :
Theorem 8.
For every conditional PMFand every,
In addition, for every conditional PMFand every,
For, the situation is different: there exists a conditional PMFsuch that, for every,
Proof.
Equation (174) follows from [9] (Proposition 1); (175) follows from [12] (Lemma 1); and (176) follows from [38] (Theorem V.1) for .
For , (178) holds because
where (179) follows from Proposition 5; (180) follows from (78); (181) and (185) follow from a minimax theorem and are justified below; (187) follows from (55); and (188) follows from (60).
To justify (181), we apply the minimax theorem [46] (Corollary 37.3.2) to the function ,
The sets of all PMFs over and over are convex and compact; the function f is jointly continuous in the pair because it is a composition of continuous functions; for every , the function f is linear and hence convex in ; and it only remains to show that the function f is concave in for every . Indeed, for every with , every , and every ,
where (193) follows from the reverse Minkowski inequality [16] (III 2.4 Theorem 9) because ; and (195) holds because the function is concave for .
The justification of (185) is very similar to that of (181); here, we apply the minimax theorem to the function ,
Compared to the justification of (181), the only essential difference lies in showing that the function g is concave in for every : here, this follows easily from the concavity of the function for .
We conclude the proof by establishing (177). Let , and let the conditional PMF be given by . (This corresponds to a binary noiseless channel.) Then, denoting by the uniform distribution over ,
where (199) follows from (61). On the other hand, for every and every PMF ,
where (200) follows from [3] (Lemma 11); (201) follows from (144); and (202) holds because . Inequality (177) now follows from (199) and (202). □
8. Horse Betting
In this section, we analyze horse betting with a gambler investing all her money. Recall from the introduction that the winning horse X is distributed according to the PMF p, where we assume for all ; that the odds offered by the bookmaker are denoted by ; that the fraction of her wealth that the gambler bets on Horse is denoted ; that the wealth relative is the random variable ; and that we seek betting strategies that maximize the utility function
Because the gambler invests all her money, b is a PMF. As in [47] (Section 10.3), define the constant
and the PMF
Using these definitions, the utility function can be decomposed as follows:
Theorem 9.
Let, and let b be a PMF. Then,
where the PMFis given by
Thus, choosinguniquely maximizesamong all PMFs b.
The three terms in (206) can be interpreted as follows:
- The first term, , depends only on the odds and is related to the fairness of the odds. The odds are called subfair if , fair if , and superfair if .
- The second term, , is related to the bookmaker’s estimate of the winning probabilities. It is zero if and only if the odds are inversely proportional to the winning probabilities.
- The third term, , is related to the gambler’s estimate of the winning probabilities. It is zero if and only if b is equal to .
Remark 4.
(This decomposition appeared previously in [47] (Section 10.3).) Equation (208) implies that the doubling rate is maximized by proportional gambling, i.e., that is maximized if and only if b is equal to p.
Remark 5.
Considering the limitsand, the PMFsatisfies, for every,
where the setis defined as. It follows from Proposition 8 below that the RHS of (209) is the unique maximizer of ; and it follows from the proof of Proposition 9 below that the RHS of (210) is a maximizer (not necessarily unique) of .
Proof of Remark 5.
Recall that we assume for every . Then, (209) follows from (207) and the definition of c in (204). To establish (210), define and observe that, for every ,
where (211) follows from (207) and some algebra; and (212) is justified as follows: if , then equals one; and if , then tends to zero as because and because . □
Remark 6.
Using the definition in (24) for the Rényi divergence of negative orders, it is not difficult to see from the proof of Theorem 9 below that (206) also holds for . However, because the Rényi divergence of negative orders is nonpositive instead of nonnegative, the above interpretation is not valid anymore; in particular, for , choosing is in general not optimal.
Proof of Theorem 9.
We first show the maximization claim. The only term on the RHS of (206) that depends on b is . Because , this term is maximized if and only if (Proposition 1 (a)).
We now establish (206) for ; we omit the proof for , which can be found in [47] (Section 10.3). For ,
The rest of the section presents the cases , , and .
Proposition 8.
Let b be a PMF. Then,
Inequality (220) holds with equality if and only if for all .
Observe that if for all , then with probability one, i.e., S does not depend on the winning horse.
Proof of Proposition 8.
Equation (219) holds because
where (222) holds because, in the limit as tends to , the power mean tends to the minimum (since p is a PMF with for all [15] (Chapter 8)).
We show (220) by contradiction. Assume that there exists a PMF b that does not satisfy (220), thus
for all . Then,
where (224) holds because b is a PMF; (225) follows from (223); and (226) follows from the definition of c in (204). Because is impossible, such a b cannot exist, which establishes (220).
It is not difficult to see that (220) holds with equality if for all . We therefore focus on establishing that if (220) holds with equality, then for all . Observe first that, if (220) holds with equality, then, for all ,
Proposition 9.
Let, and let b be a PMF. Then,
Remark 7.
Proposition 9 implies that if, then it is optimal to bet on a single horse. Unless, this is not the case when: When, an optimal betting strategy requires placing a bet on every horse. This follows from Theorem 9 and our assumption thatandare all positive.
Proof of Proposition 9.
Proposition 10.
Let b be a PMF. Then,
Equality in (236) can be achieved by choosing for some satisfying
Proof.
9. Horse Betting with Side Information
In this section, we study the horse betting problem where the gambler observes some side information Y before placing her bets. This setting leads to the conditional Rényi divergence discussed in Section 5 (see Theorem 10). In addition, it provides a new operational meaning to the dependence measure (see Theorem 11).
We adapt our notation as follows: The joint PMF of X and Y is denoted . (Recall that X denotes the winning horse.) We drop the assumption that the winning probabilities are positive, but we assume that for all . We continue to assume that the gambler invests all her wealth, so a betting strategy is now a conditional PMF , and the wealth relative S is
The following decomposition of the utility function parallels that of Theorem 9:
Theorem 10.
Let. Then,
where the conditional PMFand the PMFare given by
Thus, choosinguniquely maximizesamong all conditional PMFs.
Proof.
We first show that is uniquely maximized by . The only term on the RHS of (243) that depends on is . Because , this term is maximized if and only if (Proposition 1 (a)). By our assumptions that for all and for all , we have for all . Consequently, if and only if .
We conclude with establishing (243) for . For ,
For every and every ,
which follows from (244) and (245). Now, (243) holds because
where (249) follows from (247) and (248) and the fact that ; (250) follows by identifying the Rényi divergence; (251) follows from (242); and (252) follows by identifying the conditional Rényi divergence using (78). □
Remark 8.
It follows from Theorem 10 that, if the gambler gambles optimally, then, for,
Operationally, it is clear that preprocessing the side information cannot increase the gambler’s utility, i.e., that, for every conditional PMF,
whereandare derived from the joint PMFgiven by
The extreme case is when the preprocessing maps the side information to a constant and hence leads to the case where the side information is absent. In this case,is deterministic andequals. Theorem 9 and Theorem 10 then lead to the following relation between the conditional and unconditional Rényi divergence:
where the marginal PMFis given by
The last result of this section provides a new operational meaning to the Lapidoth–Pfister mutual information : assuming that and that the gambler knows the winning probabilities, measures how much the side information that is available to the gambler but not the bookmaker increases the gambler’s smallest guaranteed utility for a fixed level of fairness c. To see this, consider first the setting without side information. By Theorem 9, the gambler chooses to maximize her utility, where is defined in (207). Then, using the nonnegativity of the Rényi divergence (Proposition 1 (a)), the following lower bound on the gambler’s utility follows from (206):
We call the RHS of (258) the smallest guaranteed utility for a fixed level of fairness c because (258) holds with equality if the bookmaker chooses the odds inversely proportional to the winning probabilities. Comparing (258) with (259) below, we see that the difference due to the side information is . Note that is typically not the difference between the utility with and without side information; this is because the odds for which (258) and (259) hold with equality are typically not the same.
Theorem 11.
Let. Ifis equal tofrom Theorem 10, then
10. Horse Betting with Part of the Money
In this section, we treat the possibility that the gambler does not invest all her wealth. We restrict ourselves to the setting without side information and to . (For the case , see [47] (Section 10.5).) We assume that and for all . Denote by the fraction of her wealth that the gambler does not use for betting. (We assume .) Then, is a PMF, and the wealth relative S is the random variable
As in Section 8, define the constant
We treat the cases and separately, starting with the latter. If , then it is optimal to invest all the money:
Proposition 11.
Assume, let, and let b be a PMF onwith utility. Then, there exists a PMF on with and utility .
Proof.
Choose the PMF as follows:
On the other hand, if and the odds are subfair, i.e., if , then Claim (c) of the following theorem shows that investing all the money is not optimal:
Theorem 12.
Assume, let, and letbe a PMF onthat maximizesamong all PMFs b. Defining
the following claims hold:
- (a)
- Both the numerator and denominator on the RHS of (270) are positive, so Γ is well-defined and positive.
- (b)
- For every,
- (c)
- The quantitysatisfiesIn particular,.
Claim (b) implies that for every , if and only if . Ordering the elements of such that , the set thus has a special structure: it is either empty or equal to for some integer k. To maximize , the following procedure can be used: for every with the above structure, compute the corresponding b according to (270)–(273); and from these b’s, take one that maximizes . This procedure leads to an optimal solution: an optimal solution exists because we are optimizing a continuous function over a compact set, and corresponds to a set that will be considered by the procedure.
Proof of Theorem 12.
The proof is based on the Karush–Kuhn–Tucker conditions. By separately considering the cases and , we first show that, for , a strategy is optimal if and only if the following conditions are satisfied for some :
and, for every ,
Consider first , and define the function ,
Since and since the logarithm is an increasing function, maximizing over b is equivalent to maximizing . Observe that is concave, thus, by the Karush–Kuhn–Tucker conditions [11] (Theorem 4.4.1), it is maximized by a PMF b if and only if there exists a such that (i) for all with ,
and (ii) for all with ,
Henceforth, we use the following notation: to designate that (i) and (ii) both hold, we write
Consider now , and define as in (276). Then, because , maximizing is equivalent to minimizing . The function is convex, thus Inequality (278) is reversed. Dividing by again reverses the inequalities, thus (280), (274), and (275) continue to hold for .
Having established that, for all , a strategy b is optimal if and only if (274) and (275) hold, we next continue with the proof. Let , and let be a PMF on that maximizes . By the above discussion, (274) and (275) are satisfied by for some . The LHS of (274) is positive, so . We now show that for all ,
To this end, fix . If , then (275) implies
and the RHS of (282) is equal to the RHS of (281) because, being equal to , it is positive. If , then (275) implies
so the RHS of (281) is zero and (281) hence holds.
Having established (281), we next show that for some . For a contradiction, assume that for all . Then,
where (284) follows from (275), and (285) holds because by assumption. However, this is impossible: (285) contradicts (274).
Let now be such that . Then, by (281),
Because and are positive, this implies . Thus, by (274),
Splitting the sum on the LHS of (287) depending on whether or , we obtain
where (289) follows from (275). Rearranging (290), we obtain
Recall that and . In addition, because and hence . Thus, , so both the numerator and denominator in the definition of in (270) are positive, which establishes Claim (a), namely that is well-defined and positive.
11. Universal Betting for IID Races
In this section, we present a universal gambling strategy for IID races that requires neither knowledge of the winning probabilities nor of the parameter of the utility function and yet asymptotically maximizes the utility function for all PMFs p and all . Consider n consecutive horse races, where the winning horse in the ith race is denoted for . We assume that are IID according to the PMF p, where for all . In every race, the bookmaker offers the same odds , and the gambler spends all her wealth placing bets on the horses. The gambler plays race-after-race, i.e., before placing bets for a race, she is revealed the winning horse of the previous race and receives the money from the bookmaker. Her betting strategy is hence a sequence of conditional PMFs . The wealth relative is the random variable
We seek betting strategies that maximize the utility function
We first establish that to maximize for a fixed , it suffices to use the same betting strategy in every race; see Theorem 13. We then show that the individual-sequence-universal strategy by Cover–Ordentlich [48] allows to asymptotically achieve the same normalized utility without knowing p or (see Theorem 14).
For a fixed , let the PMF be a betting strategy that maximizes the single-race utility discussed in Section 8, and denote by the utility associated with . Using the same betting strategy over n races leads to the utility , and it follows from (295) and (296) that
As we show next, is the maximum utility that can be achieved among all betting strategies:
Theorem 13.
Let, and letbe a sequence of conditional PMFs. Then,
Proof.
We show (298) for ; analogous arguments establish (298) for and . We prove (298) by induction on n. For , (298) holds because is the maximum single-race utility. Assume now and that (298) is valid for . For , (298) holds because
where (303) holds because maximizes the single-race utility , and (305) holds because (298) is valid for . □
In portfolio theory, Cover–Ordentlich [48] (Definition 1) proposed a universal strategy. Adapted to our setting, it leads to the following sequence of conditional PMFs:
where ; is the distribution on ; ; and
This strategy depends neither on the winning probabilities p nor on the parameter . Denoting the utility (296) associated with the strategy by , we have the following result:
Theorem 14.
For every,
Hence,
Proof.
Inequality (310) follows from Theorem 13; and (311) follows from (309) and (310) and the sandwich theorem. It thus remains to establish (309): We do so for ; analogous arguments establish (309) for and . For a fixed sequence , let be a PMF on that maximizes , and denote the wealth relative in (295) associated with using in every race by , thus
Let denote the wealth relative in (295) associated with the strategy and the sequence . Using [48] (Theorem 2) it follows that, for every ,
Remark 9.
As discussed in Section 8, the optimal single-race betting strategy varies significantly with different values of β, thus it might be a bit surprising that the Cover–Ordentlich strategy is not only universal with respect to the winning probabilities, but also with respect to β. This is due to the following two reasons: First, for fixed winning probabilities and a fixed β, it is optimal to use the same betting strategy in every race (see Theorem 13). Second, for every, the wealth relative of the Cover–Ordentlich strategy is not much worse than that of using the same strategyin every race, irrespective of(see (313)). Hence, irrespective of the optimal single-race betting strategy, the Cover–Ordentlich strategy is able to asymptotically achieve the same normalized utility.
Author Contributions
Writing—original draft preparation, C.B., A.L., and C.P.; and writing—review and editing, C.B., A.L., and C.P. All authors have read and agreed to the published version of the manuscript
Funding
This research received no external funding.
Conflicts of Interest
The authors declare no conflict of interest.
Appendix A. Proof of Proposition 1
These properties mostly follow from van Erven–Harremoës [8]:
- (a)
- See [8] (Theorem 8).
- (b)
- This follows from the definitions in (4) and (21)–(23) and the conventions in (20).
- (c)
- This follows from [8] (Theorem 7) and the fact that by L’Hôpital’s rule. (Note that does not need to be continuous at when the alphabets are not finite; see the discussion after [8] (Equation (18)).)
- (d)
- See [8] (Theorem 3).
- (e)
- Let satisfy . Then,where (A1) and (A3) follow from [8] (Lemma 10), and (A2) holds because the Rényi divergence, extended to negative orders, is nondecreasing ([8] (Theorem 39)).
- (f)
- See [8] (Corollary 2).
- (g)
- For ,where (A5) follows from [8] (Theorem 30). Hence, is concave in because the expression in square brackets on the RHS of (A6) is concave in for every R and because the pointwise infimum preserves the concavity.
- (h)
- See [8] (Theorem 9).
Appendix B. Proof of Theorem 1
Beginning with (29),
where (A8) follows by applying, separately for every , Proposition 1 (h) with the conditional PMF .
Appendix C. Proof of Theorem 2
We show (43) for ; the claim then extends to by the continuity of in (Proposition 2 (c)). Let . Keeping in mind that , (43) holds because
where (A10) follows from (30); (A11) follows from (41) and (42); (A12) follows from Hölder’s inequality; (A13) holds because if and ; (A14) follows from Jensen’s inequality because is concave; (A15) follows from (40); (A16) holds because and imply , hence the expression in square brackets on the LHS of (A16) equals one; and (A17) follows from (30).
Appendix D. Proof of Corollary 1
Applying Theorem 2 with and the conditional PMF , we obtain
To complete the proof of (48), observe that
where (A19) holds because (41) and (46) imply and because (42) and (47) imply ; and (A20) follows from Remark 1.
Appendix E. Proof of Example 1
If , then it can be verified numerically that (53) holds for . Fix now . Then, for all ,
The RHS of (53) satisfies, for sufficiently small ,
where (A27) holds for sufficiently small because . Because , (53) follows from (A23) and (A28) for sufficiently small .
Appendix F. Proof of Theorem 3
Observe that, for all and all ,
Hence, (68) follows from (54) and
which follows from the data-processing inequality for the Rényi divergence by substituting for in Proposition 1 (h).
Appendix G. Proof of Theorem 4
Observe that, for all and all ,
Hence, (73) follows from (54) and
which follows from the data-processing inequality for the Rényi divergence by substituting for in Proposition 1 (h).
References
- Kelly, J.L., Jr. A new interpretation of information rate. Bell Syst. Tech. J. 1956, 35, 917–926. [Google Scholar] [CrossRef]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006; ISBN 978-0-471-24195-9. [Google Scholar]
- Lapidoth, A.; Pfister, C. Two measures of dependence. Entropy 2019, 21, 778. [Google Scholar] [CrossRef]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed.; Cambridge University Press: Cambridge, UK, 2011; ISBN 978-0-521-19681-9. [Google Scholar]
- Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial; now Publishers: Hanover, MA, USA, 2004; ISBN 978-1-933019-05-5. [Google Scholar]
- Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; Volume 1, pp. 547–561. [Google Scholar]
- Van Erven, T.; Harremoës, P. Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
- Csiszár, I. Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory 1995, 41, 26–34. [Google Scholar] [CrossRef]
- Sibson, R. Information radius. Z. Wahrscheinlichkeitstheorie verw. Geb. 1969, 14, 149–160. [Google Scholar] [CrossRef]
- Gallager, R.G. Information Theory and Reliable Communication; John Wiley & Sons: Hoboken, NJ, USA, 1968; ISBN 978-0-471-29048-3. [Google Scholar]
- Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Topics in Information Theory; Csiszár, I., Elias, P., Eds.; North-Holland Publishing Company: Amsterdam, The Netherlands, 1977; pp. 41–52. ISBN 0-7204-0699-4. [Google Scholar]
- Eeckhoudt, L.; Gollier, C.; Schlesinger, H. Economic and Financial Decisions under Risk; Princeton University Press: Princeton, NJ, USA, 2005; ISBN 978-0-691-12215-1. [Google Scholar]
- Soklakov, A.N. Economics of disagreement – financial intuition for the Rényi divergence. arXiv 2018, arXiv:1811.08308. [Google Scholar]
- Steele, J.M. The Cauchy–Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities; Cambridge University Press: Cambridge, UK, 2004; ISBN 978-0-521-54677-5. [Google Scholar]
- Bullen, P.S. Handbook of Means and Their Inequalities; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2003; ISBN 978-1-4020-1522-9. [Google Scholar]
- Campbell, L.L. A coding theorem and Rényi’s entropy. Inf. Control 1965, 8, 423–429. [Google Scholar] [CrossRef]
- Campbell, L.L. Definition of entropy by means of a coding problem. Z. Wahrscheinlichkeitstheorie verw. Geb. 1966, 6, 113–118. [Google Scholar] [CrossRef]
- Merhav, N. On optimum strategies for minimizing the exponential moments of a loss function. Commun. Inf. Syst. 2011, 11, 343–368. [Google Scholar] [CrossRef]
- Sason, I.; Verdú, S. Improved bounds on lossless source coding and guessing moments via Rényi measures. IEEE Trans. Inf. Theory 2018, 64, 4323–4346. [Google Scholar] [CrossRef]
- Augustin, U. Noisy Channels. Habilitation Thesis, Universität Erlangen–Nürnberg, Erlangen, Germany, 1978. [Google Scholar]
- Nakiboğlu, B. The Augustin capacity and center. Probl. Inf. Transm. 2019, 55, 299–342. [Google Scholar] [CrossRef]
- Nakiboğlu, B. The sphere packing bound for memoryless channels. arXiv 2018, arXiv:1804.06372. [Google Scholar]
- Ho, S.-W.; Verdú, S. Convexity/concavity of Rényi entropy and α-mutual information. In Proceedings of the 2015 IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 745–749. [Google Scholar] [CrossRef]
- Verdú, S. α-mutual information. In Proceedings of the 2015 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar] [CrossRef]
- Nakiboğlu, B. The Rényi capacity and center. IEEE Trans. Inf. Theory 2019, 65, 841–860. [Google Scholar] [CrossRef]
- Cai, C.; Verdú, S. Conditional Rényi divergence saddlepoint and the maximization of α-mutual information. Entropy 2019, 21, 969. [Google Scholar] [CrossRef]
- Fong, S.L.; Tan, V.Y.F. Strong converse theorems for classes of multimessage multicast networks: A Rényi divergence approach. IEEE Trans. Inf. Theory 2016, 62, 4953–4967. [Google Scholar] [CrossRef]
- Yu, L.; Tan, V.Y.F. Rényi resolvability and its applications to the wiretap channel. IEEE Trans. Inf. Theory 2019, 65, 1862–1897. [Google Scholar] [CrossRef]
- Gallager, R.G. A simple derivation of the coding theorem and some applications. IEEE Trans. Inf. Theory 1965, 11, 3–18. [Google Scholar] [CrossRef]
- Shannon, C.E.; Gallager, R.G.; Berlekamp, E.R. Lower bounds to error probability for coding on discrete memoryless channels. I. Inf. Control 1967, 10, 65–103. [Google Scholar] [CrossRef]
- Shannon, C.E.; Gallager, R.G.; Berlekamp, E.R. Lower bounds to error probability for coding on discrete memoryless channels. II. Inf. Control 1967, 10, 522–552. [Google Scholar] [CrossRef]
- Arimoto, S. On the converse to the coding theorem for discrete memoryless channels. IEEE Trans. Inf. Theory 1973, 19, 357–359. [Google Scholar] [CrossRef]
- Polyanskiy, Y.; Verdú, S. Arimoto channel coding converse and Rényi divergence. In Proceedings of the 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Allerton, IL, USA, 29 September–1 October 2010; pp. 1327–1333. [Google Scholar] [CrossRef]
- Nakiboğlu, B. The sphere packing bound via Augustin’s method. IEEE Trans. Inf. Theory 2019, 65, 816–840. [Google Scholar] [CrossRef]
- Nakiboğlu, B. The sphere packing bound for DSPCs with feedback à la Augustin. IEEE Trans. Commun. 2019, 67, 7456–7467. [Google Scholar] [CrossRef]
- Tomamichel, M.; Hayashi, M. Operational interpretation of Rényi information measures via composite hypothesis testing against product and Markov distributions. IEEE Trans. Inf. Theory 2018, 64, 1064–1082. [Google Scholar] [CrossRef]
- Aishwarya, G.; Madiman, M. Remarks on Rényi versions of conditional entropy and mutual information. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 1117–1121. [Google Scholar] [CrossRef]
- Lapidoth, A.; Pfister, C. Testing against independence and a Rényi information measure. In Proceedings of the 2018 IEEE Information Theory Workshop (ITW), Guangzhou, China, 25–29 November 2018; pp. 1–5. [Google Scholar] [CrossRef]
- Fehr, S.; Berens, S. On the conditional Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 6801–6810. [Google Scholar] [CrossRef]
- Sason, I.; Verdú, S. Arimoto–Rényi conditional entropy and Bayesian M-ary hypothesis testing. IEEE Trans. Inf. Theory 2018, 64, 4–25. [Google Scholar] [CrossRef]
- Arıkan, E. An inequality on guessing and its application to sequential decoding. IEEE Trans. Inf. Theory 1996, 42, 99–105. [Google Scholar] [CrossRef]
- Sundaresan, R. Guessing under source uncertainty. IEEE Trans. Inf. Theory 2007, 53, 269–287. [Google Scholar] [CrossRef]
- Bracher, A.; Lapidoth, A.; Pfister, C. Guessing with distributed encoders. Entropy 2019, 21, 298. [Google Scholar] [CrossRef]
- Bunte, C.; Lapidoth, A. Encoding tasks and Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 5065–5076. [Google Scholar] [CrossRef]
- Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1970; ISBN 978-0-691-01586-6. [Google Scholar]
- Moser, S.M. Information Theory (Lecture Notes), version 6.6. 2018. Available online: http://moser-isi.ethz.ch/scripts.html (accessed on 8 March 2020).
- Cover, T.M.; Ordentlich, E. Universal portfolios with side information. IEEE Trans. Inf. Theory 1996, 42, 348–363. [Google Scholar] [CrossRef]
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).