Abstract
Two families of dependence measures between random variables are introduced. They are based on the Rényi divergence of order and the relative -entropy, respectively, and both dependence measures reduce to Shannon’s mutual information when their order is one. The first measure shares many properties with the mutual information, including the data-processing inequality, and can be related to the optimal error exponents in composite hypothesis testing. The second measure does not satisfy the data-processing inequality, but appears naturally in the context of distributed task encoding.
1. Introduction
The solutions to many information-theoretic problems can be expressed using Shannon’s information measures such as entropy, relative entropy, and mutual information. Other problems require Rényi’s information measures, which generalize Shannon’s. In this paper, we analyze two Rényi measures of dependence, and , between random variables X and Y taking values in the finite sets and , with being a parameter. (Our notation is similar to the one used for the mutual information: technically, and are functions not of X and Y, but of their joint probability mass function (PMF) .) For , we define and as
where and denote the set of all PMFs over and , respectively; denotes the Rényi divergence of order (see (50) ahead); and denotes the relative -entropy (see (55) ahead). As shown in Proposition 7, and are in fact closely related.
The measures and have the following operational meanings (see Section 3): is related to the optimal error exponents in testing whether the observed independent and identically distributed (IID) samples were generated according to the joint PMF or an unknown product PMF; and appears as a penalty term in the sum-rate constraint of distributed task encoding.
The measures and share many properties with Shannon’s mutual information [], and both are equal to the mutual information when is one. Except for some special cases, we have no closed-form expressions for or . As illustrated in Figure 1, unless is one, the minimum in the definitions of and is typically not achieved by and . (When is one, then the minimum is always achieved by and ; this follows from Proposition 8 and the fact that .)
Figure 1.
(Left) and versus . (Right) and versus . In both plots, X is Bernoulli with , and Y is equal to X.
The rest of this paper is organized as follows. In Section 2, we review other generalizations of the mutual information. In Section 3, we discuss the operational meanings of and . In Section 4, we recall the required Rényi information measures and prove some preparatory results. In Section 5, we state the properties of and . In Section 6, we prove these properties.
2. Related Work
The measure was discovered independently from the authors of the present paper by Tomamichel and Hayashi [] (Equation (58)), who, for the case when , derived some of its properties in [] (Appendix A-C).
Other Rényi-based measures of dependence appeared in the past. Notable are those by Sibson [], Arimoto [], and Csiszár [], respectively denoted by , , and :
where, throughout the paper, denotes the base-2 logarithm; denotes the Rényi divergence of order (see (50) ahead); denotes the Rényi entropy of order (see (45) ahead); and denotes the Arimoto–Rényi conditional entropy [,,], which is defined for positive other than one as
(Equation (4) follows from Proposition 9 ahead, and (6) follows from (45) and (8).) An overview of , , and is provided in []. Another Rényi-based measure of dependence can be found in [] (Equation (19)):
The relation between , , and for was established recently:
Proposition 1
([] (Theorem IV.1)). For every PMF and every ,
Proof.
This is proved in [] for a measure-theoretic setting. Here, we specialize the proof to finite alphabets. We first prove (10):
where (12) follows from the definition of in (1); (13) follows from Proposition 9 ahead with the roles of and swapped; (15) follows from Jensen’s inequality because is concave and because ; and (17) follows from the definition of in (7).
Many of the above Rényi information measures coincide when they are maximized over with held fixed: for every conditional PMF and every positive other than one,
where denotes the joint PMF of X and Y; (21) follows from [] (Lemma 1); and (22) follows from [] (Proposition 1). It was recently established that, for , this is also true for :
Proposition 2
([] (Theorem V.1)). For every conditional PMF and every ,
Proof.
Dependence measures can also be based on the f-divergence [,,]. Every convex function satisfying induces a dependence measure, namely
where (27) follows from the definition of the f-divergence. (For , is the mutual information.) Such dependence measures are used for example in [], and a construction equivalent to (27) is studied in [].
3. Operational Meanings
In this section, we discuss the operational meaning of in hypothesis testing (Section 3.1) and of in distributed task encoding (Section 3.2).
3.1. Testing Against Independence and
Consider the hypothesis testing problem of guessing whether an observed sequence of pairs was drawn IID from some given joint PMF or IID from some unknown product distribution. Thus, based on a sequence of pairs of random variables , two hypotheses have to be distinguished:
- 0)
- Under the null hypothesis, are IID according to .
- 1)
- Under the alternative hypothesis, are IID according to some unknown PMF of the form , where and are arbitrary PMFs over and , respectively.
Associated with every deterministic test and pair are the type-I error probability and the type-II error probability , where denotes the probability of an event when are IID according to . We seek sequences of tests whose worst-case type-II error probability decays exponentially faster than . To be more specific, for a fixed , denote by the set of all sequences of deterministic tests for which
where denotes the base-2 logarithm. Note that (28) implies—but is not equivalent to—that for n sufficiently large, for all . For a fixed , the optimal type-I error exponent that can be asymptotically achieved under the constraint (28) is given by
The measure appears as follows: In [] (first part of (57)), it is shown that for sufficiently close to ,
and in [] (Theorem 3), it is shown that for all ,
where denotes the Fenchel biconjugate of . In general, the Fenchel biconjugation cannot be omitted because sometimes [] (Equation (11) and Example 14)
For large values of , the optimal type-I error tends to one as n tends to infinity. In this case, the type-I strong-converse exponent [,], which is defined for a sequence of tests as
measures how fast the type-I error tends to one as n tends to infinity (smaller values correspond to lower error probabilities). For a fixed , the optimal type-I strong-converse exponent that can be asymptotically achieved under the constraint (28) is given by
In [] (second part of (57)), it is shown that for sufficiently close to ,
Here, the same expression appears as in (30) and (31), but with a different set of ’s to optimize over.
3.2. Distributed Task Encoding and
The task-encoding problem studied in [] can be extended to a distributed setting as follows []: A source emits pairs of random variables taking values in a finite alphabet . For a fixed rate pair and a positive integer n, the sequences and are described separately using and labels, respectively. The decoder produces a list comprising all the pairs whose description matches the given labels, and the goal is to minimize the -th moment of the list size as n tends to infinity (for some ).
For a fixed , a rate pair is called achievable if there exists a sequence of encoders ,
such that the -th moment of the list size tends to one as n tends to infinity, i.e.,
where
For a memoryless source and a fixed , rate pairs in the interior of the region defined next are achievable, while those outside are not achievable [] (Theorem 1). The region is defined as the set of all rate pairs satisfying the following inequalities simultaneously:
where denotes the Rényi entropy of order (see (45) ahead).
To better understand the role of , suppose that the sequences and were allowed to be described jointly using labels. Then, by [] (Theorem I.2), all rate pairs satisfying the following inequality with strict inequality would be achievable, while those not satisfying the inequality would not:
Comparing (42) and (43), we see that the measure appears as a penalty term on the sum-rate constraint incurred by requiring that the sequences be described separately as opposed to jointly.
4. Preliminaries
Throughout the paper, denotes the base-2 logarithm, and are finite sets, denotes a joint PMF over , denotes a PMF over , and denotes a PMF over . We use P and Q as generic PMFs over a finite set . We denote by the support of P, and by the set of all PMFs over . When clear from the context, we often omit sets and subscripts: for example, we write for , for , for , and for . Whenever a conditional probability is undefined because , we define . We denote by the indicator function that is one if the condition is satisfied and zero otherwise. In the definitions below, we use the following conventions:
The Rényi entropy of order [] is defined for positive other than one as
For being zero, one, or infinity, we define by continuous extension of (45)
where is the Shannon entropy. With this extension to , the Rényi entropy satisfies the following basic properties:
Proposition 3
([]). Let P be a PMF. Then,
- (i)
- For all , . If , then if and only if X is distributed uniformly over .
- (ii)
- The mapping is nonincreasing on .
- (iii)
- The mapping is continuous on .
The relative entropy (or Kullback–Leibler divergence) is defined as
The Rényi divergence of order [,] is defined for positive other than one as
where we read as if . For being zero, one, or infinity, we define by continuous extension of (50)
With this extension to , the Rényi divergence satisfies the following basic properties:
Proposition 4.
Let P and Q be PMFs. Then,
- (i)
- For all , is finite if and only if . For all , is finite if and only if .
- (ii)
- For all , . If , then if and only if .
- (iii)
- For every , the mapping is continuous.
- (iv)
- The mapping is nondecreasing on .
- (v)
- The mapping is continuous on .
Proof.
Part (i) follows from the definition of and the conventions (44), and Parts (ii)–(v) are shown in []. □
The Rényi divergence for negative is defined as
(We use negative only in Lemma 19. More about negative orders can be found in [] (Section V). For other applications of negative orders, see [] (Proof of Theorem 1 and Example 1).)
The relative -entropy [,] is defined for positive other than one as
where we read as if . The relative -entropy appears in mismatched guessing [], mismatched source coding [] (Theorem 8), and mismatched task encoding [] (Section IV). It also arises in robust parameter estimation and constrained compression settings [] (Section II). For being zero, one, or infinity, we define by continuous extension of (55)
where and is the cardinality of this set. With this extension to , the relative -entropy satisfies the following basic properties:
Proposition 5.
Let P and Q be PMFs. Then,
- (i)
- For all , is finite if and only if . For all , is finite if and only if .
- (ii)
- For all , . If , then if and only if .
- (iii)
- For every , the mapping is continuous.
- (iv)
- The mapping is continuous on .
(Part (i) differs from [] (Proposition IV.1), where the conventions for differ from ours. Our conventions are compatible with [,], and, as stated in Part (iii), they result in the continuity of the mapping .)
Proof of Proposition 5.
Part (i) follows from the definition of in (55) and the conventions (44). For , Part (ii) follows from [] (Proposition IV.1); for , Part (ii) holds because ; and for , Part (ii) follows from the definition of . Part (iii) follows from the definition of , and Part (iv) follows from [] (Proposition IV.1). □
In the rest of this section, we prove some auxiliary results that we need later (Propositions 6–9). We first establish the relation between and .
Proposition 6
([] (Section V, Property 4)). Let P and Q be PMFs, and let . Then,
where the PMFs and are given by
Proof.
In light of Proposition 6, and are related as follows:
Proposition 7.
Let be a joint PMF, and let . Then,
where the joint PMF of and is given by
Proof.
Let . For fixed PMFs and , define the transformed PMFs , , and as
The next proposition provides a characterization of the mutual information that parallels the definitions of and . Because , this also shows that and reduce to the mutual information when is one.
Proposition 8
([] (Theorem 3.4)). Let be a joint PMF. Then, for all PMFs and ,
with equality if and only if and . Thus,
Proof.
The last proposition of this section is about a precursor to , namely, the minimization of with respect to only, which can be carried out explicitly. (This proposition extends [] (Equation (13)) and [] (Lemma 29).)
Proposition 9.
Proof.
We first treat the case . If the RHS of (75) is infinite, then the conventions imply that is infinite for every , so (75) holds. Otherwise, if the RHS of (75) is finite, then the PMF given by (76) is well-defined, and a simple computation shows that for every ,
The only term on the RHS of (79) that depends on is . Because with equality if and only if (Proposition 4), (79) implies (75) and (76).
The case is analogous: if the RHS of (77) is infinite, then the LHS of (77) is infinite, too; and if the RHS of (77) is finite, then the PMF given by (78) is well-defined, and a simple computation shows that for every ,
The only term on the RHS of (80) that depends on is . Because with equality if and only if (Proposition 4), (80) implies (77) and (78). □
5. Two Measures of Dependence
We state the properties of in Theorem 1 and those of in Theorem 2. The enumeration labels in the theorems refer to the lemmas in Section 6 where the properties are proved. (The enumeration labels are not consecutive because, in order to avoid forward references in the proofs, the order of the results in Section 6 is not the same as here.)
Theorem 1.
Let X, , , Y, , , and Z be random variables taking values in finite sets. Then:
- (Lemma 1)
- For every , the minimum in the definition of exists and is finite.
The following properties of the mutual information [] (Chapter 2) are also satisfied by :
- (Lemma 2)
- For all , . If , then if and only if X and Y are independent (nonnegativity).
- (Lemma 3)
- For all , (symmetry).
- (Lemma 4)
- If form a Markov chain, then for all (data-processing inequality).
- (Lemma 12)
- If the pairs and are independent, then for all (additivity).
- (Lemma 13)
- For all , with equality if and only if , X is distributed uniformly over , and .
- (Lemma 14)
- For every , is concave in for fixed .
Moreover:
- (Lemma 5)
- .
- (Lemma 6)
- Let and be bijective functions, and let be the matrix whose Row-i Column-j entry equals . Then,where denotes the largest singular value of . (Because the singular values of a matrix are invariant under row and column permutations, the result does not depend on f or g.)
- (Lemma 7)
- .
- (Lemma 8)
- For all ,Thus, being the minimum of concave functions in α, the mapping is concave on .
- (Lemma 9)
- The mapping is nondecreasing on .
- (Lemma 10)
- The mapping is continuous on .
- (Lemma 11)
- If with probability one, then
The minimization problem in the definition of has the following characteristics:
- (Lemma 15)
- For every , the mapping is convex, i.e., for all with , all , and all ,For , the mapping need not be convex.
- (Lemma 16)
- Let . If achieves the minimum in the definition of , then there exist positive normalization constants c and d such thatwith the conventions of (44). The case is similar: if achieves the minimum in the definition of , then there exist positive normalization constants c and d such thatwith the conventions of (44). (If , then and by Proposition 8.) Thus, for all , both inclusions and hold.
- (Lemma 20)
- For every , the mapping has a unique minimizer. This need not be the case when .
The measure can also be expressed as follows:
- (Lemma 17)
- (Lemma 18)
- For all ,whereFor every , the mapping is concave. For all and all , the statement is equivalent to .
- (Lemma 19)
- For all ,where the minimization is over all PMFs satisfying ; for negative α is given by (54); and Gallager’s function [] is defined as
We now move on to the properties of . Some of these properties are derived from their counterparts of using the relation described in Proposition 7.
Theorem 2.
Let X, , , Y, , , and Z be random variables taking values in finite sets. Then:
- (Lemma 21)
- For every , the minimum in the definition of in (2) exists and is finite.
The following properties of the mutual information are also satisfied by :
- (Lemma 22)
- For all , . If , then if and only if X and Y are independent (nonnegativity).
- (Lemma 23)
- For all , (symmetry).
- (Lemma 34)
- If the pairs and are independent, then for all (additivity).
- (Lemma 35)
- For all , .
Unlike the mutual information, does not satisfy the data-processing inequality:
- (Lemma 36)
- There exists a Markov chain for which .
Moreover:
- (Lemma 24)
- For all ,where is the following weighted power mean [] (Chapter III): For ,where for , we read as and use the conventions (44); and for , using the convention ,
- (Lemma 25)
- For ,where in the RHS of (102), we use the conventions (44). The inequality can be strict, so need not be continuous at .
- (Lemma 26)
- .
- (Lemma 27)
- Let and be bijective functions, and let be the matrix whose Row-i Column-j entry equals . Then,where denotes the largest singular value of . (Because the singular values of a matrix are invariant under row and column permutations, the result does not depend on f or g.)
- (Lemma 28)
- .
- (Lemma 29)
- The mapping need not be monotonic on .
- (Lemma 30)
- The mapping is nonincreasing on .
- (Lemma 31)
- The mapping is continuous on . (See Lemma 25 for the behavior at .)
- (Lemma 32)
- If with probability one, then
- (Lemma 33)
- For every , the mapping in the definition of in (2) has a unique minimizer. This need not be the case when .
6. Proofs
In this section, we prove the properties of and stated in Section 5.
Lemma 1.
For every , the minimum in the definition of exists and is finite.
Proof.
Let . Then is finite because is finite and because the Rényi divergence is nonnegative. The minimum exists because the set is compact and the mapping is continuous. □
Lemma 2.
For all , . If , then if and only if X and Y are independent (nonnegativity).
Proof.
The nonnegativity follows from the definition of because the Rényi divergence is nonnegative for . If X and Y are independent, then , and the choice and in the definition of achieves . Conversely, if , then there exist PMFs and satisfying . If, in addition, , then by Proposition 4, and hence X and Y are independent. □
Lemma 3.
For all , (symmetry).
Proof.
The definition of is symmetric in X and Y. □
Lemma 4.
If form a Markov chain, then for all (data-processing inequality).
Proof.
Let form a Markov chain, and let . Let and be PMFs that achieve the minimum in the definition of , so
Define the PMF as
(As noted in the preliminaries, we define when .) We show below that
which implies the data-processing inequality because
where (109) holds by the definition of ; (110) follows from (108); and (111) follows from (106).
The proof of (108) is based on the data-processing inequality for the Rényi divergence. Define the conditional PMF as
If , then the marginal distribution of and is
where (114) follows from (112); and (115) holds because X, Y, and Z form a Markov chain. If , then the marginal distribution of and is
where (118) follows from (112), and (119) follows from (107). Finally, we are ready to prove (108):
where (120) follows from (116) and (119), and where (121) follows from the data-processing inequality for the Rényi divergence [] (Theorem 9). □
Lemma 5.
.
Proof.
By Lemma 2, , so it suffices to show that . Let satisfy . Define the PMF as and the PMF as . Then, , so by the definition of . □
Lemma 6.
Let and be bijective functions, and let be the matrix whose Row-i Column-j entry equals . Then,
where denotes the largest singular value of . (Because the singular values of a matrix are invariant under row and column permutations, the result does not depend on f or g.)
Proof.
By the definitions of and the Rényi divergence,
The claim follows from (123) because
where and are column vectors with and elements, respectively; (124) is shown below; (125) follows from the Cauchy–Schwarz inequality , which holds with equality if and are linearly dependent; and (126) holds because the spectral norm of a matrix is equal to its largest singular value [] (Example 5.6.6).
We now prove (124). Let and be vectors that satisfy , and define the PMFs and as and , where and denote the inverse functions of f and g, respectively. Then,
where (128) holds because all the entries of are nonnegative, and in (129), we changed the summation variables to and . It remains to show that equality can be achieved in (128) and (130). To that end, let and be PMFs that achieve the maximum on the RHS of (130), and define the vectors and as and . Then, , and (128) and (130) hold with equality, which proves (124). □
Lemma 7.
.
Proof.
This follows from Proposition 8 because in the definition of is equal to . □
Lemma 8.
For all ,
Thus, being the minimum of concave functions in α, the mapping is concave on .
Proof.
For , (131) holds because with equality if . For ,
where (132) holds by the definition of ; (133) follows from [] (Theorem 30); and (134) follows from Proposition 8 after swapping the minima.
For , define the sets
Then,
where (137) follows from the definition of because and because the mapping is continuous; (138) follows from [] (Theorem 30); (139) follows from a minimax theorem and is justified below; and (140) follows from Proposition 8, a continuity argument, and the observation that is infinite if .
We now verify the conditions of Ky Fan’s minimax theorem [] (Theorem 2), which will establish (139). (We use Ky Fan’s minimax theorem because it does not require that the set be compact, and having a noncompact set helps to guarantee that the function f defined next takes on finite values only. A brief proof of Ky Fan’s minimax theorem appears in [].) Let the function be defined by the expression in square brackets in (139), i.e.,
We check that
- (i)
- the sets and are convex;
- (ii)
- the set is compact;
- (iii)
- the function f is real-valued;
- (iv)
- for every , the function f is continuous in ;
- (v)
- for every , the function f is convex in ; and
- (vi)
- for every , the function f is concave in the pair .
Indeed, Parts (i) and (ii) are easy to see; Part (iii) holds because both relative entropies on the RHS of (141) are finite by our definitions of and ; and to show Parts (iv)–(vi), we rewrite f as:
From (142), we see that Part (iv) holds by our definitions of and ; Part (v) holds because the entropy is a concave function (so is convex), because linear functionals of are convex, and because the sum of convex functions is convex; and Part (vi) holds because the logarithm is a concave function and because a nonnegative weighted sum of concave functions is concave. (In Ky Fan’s theorem, weaker conditions than Parts (i)–(vi) are required, but it is not difficult to see that Parts (i)–(vi) are sufficient.)
The last claim, namely, that the mapping is concave on , is true because the expression in square brackets on the RHS of (131) is concave in for every and because the pointwise minimum preserves the concavity. □
Lemma 9.
The mapping is nondecreasing on .
Proof.
This is true because for every with ,
which holds because the Rényi divergence is nondecreasing in (Proposition 4). □
Lemma 10.
The mapping is continuous on .
Proof.
By Lemma 8, the mapping is concave on , thus it is continuous on , which implies that is continuous on .
We next prove the continuity at . Let and be PMFs that achieve the minimum in the definition of . Then, for all ,
where (145) holds because is nondecreasing (Lemma 9), and (146) holds by the definition of . The Rényi divergence is continuous in (Proposition 4), so (144)–(146) and the sandwich theorem imply that is continuous at .
We continue with the continuity at . Define
Then, for all ,
where (148) holds because is nondecreasing (Lemma 9), and (149) and (152) hold by the definitions of and the Rényi divergence. The RHS of (152) tends to as tends to infinity, so is continuous at by the sandwich theorem.
It remains to show the continuity at . Let , and let . Then, for all PMFs and ,
where (153) holds because and because the Rényi divergence is nondecreasing in (Proposition 4); (156) follows from the Cauchy–Schwarz inequality; and (157) holds because
where (159) follows from the Cauchy–Schwarz inequality, and (161) holds because and because the Rényi divergence is nonnegative for positive orders (Proposition 4). Thus, for all ,
where (162) follows from (158) if and from Proposition 8 if ; and (164) holds by the definition of . The Rényi divergence is continuous in (Proposition 4), thus (162)–(164) and the sandwich theorem imply that is continuous at . □
Lemma 11.
If with probability one, then
Proof.
We show below that (165) holds for . Thus, (165) holds also for because both its sides are continuous in : its LHS by Lemma 10, and its RHS by the continuity of the Rényi entropy (Proposition 3).
First consider the case . Define . Then, for all ,
where (171) holds because is a PMF. Because , Proposition 4 implies that with equality if . This together with (168) and (172) establishes (165).
Now consider the case . For all ,
where (173) holds because for all and because . The inequalities (173) and (174) both hold with equality when , where is such that . Thus,
Now (165) follows:
where (177) follows from (168); (178) holds because ; (179) follows from (176); and (180) follows from the definition of . □
Lemma 12.
If the pairs and are independent, then for all (additivity).
Proof.
Let the pairs and be independent. For , we establish the lemma by showing the following two inequalities:
Because is continuous in (Lemma 10), this will also establish the lemma for .
To show (181), let and be PMFs that achieve the minimum in the definition of , and let and be PMFs that achieve the minimum in the definition of , so
Then, (181) holds because
where (185) holds by the definition of as a minimum; (186) follows from a simple computation using the independence hypothesis ; and (187) follows from (183) and (184).
To establish (182), we consider the cases and separately, starting with . Let and be PMFs that achieve the minimum in the definition of , so
Define the function as
and let be such that
Define the PMFs and as
Then,
where (193) follows from (188); (194) holds by the independence hypothesis ; (195) follows from (189); (196) follows from (190); and (197) follows from (191) and (192). Taking the logarithm and multiplying by establishes (182):
where (199) holds by the definition of and .
Lemma 13.
For all , with equality if and only if , X is distributed uniformly over , and .
Proof.
Throughout the proof, define . We first show that for all :
where (200) follows from the data-processing inequality (Lemma 4) because form a Markov chain; (201) holds because is nondecreasing in (Lemma 9); (202) follows from Lemma 11; and (203) follows from Proposition 3.
- (1)
- ;
- (ii)
- X is distributed uniformly over ; and
- (iii)
- , i.e., for every , there exists an for which .
Indeed, if , then Lemma 11 implies that
Because for such ’s and because (Proposition 3), the RHS of (204) is strictly smaller than . This, together with (200), shows that Part (i) is a necessary condition. The necessity of Part (ii) follows from (203): if X is not distributed uniformly over , then (203) holds with strict inequality (Proposition 3). As to the necessity of Part (iii),
where (205) holds because is nondecreasing in (Lemma 9); (207) follows from Proposition 9; and (208) follows from choosing to be the uniform distribution. The inequality (210) is strict when Part (iii) does not hold, so Part (iii) is a necessary condition.
It remains to show that when Parts (i)–(iii) all hold, . By (203), always holds, so it suffices to show that Parts (i)–(iii) together imply . Indeed,
where (211) holds because Part (i) implies that and because is nondecreasing in (Lemma 9); (212) follows from the data-processing inequality (Lemma 4) because Part (iii) implies that form a Markov chain; (213) follows from Lemma 11; and (214) follows from Part (ii). □
Lemma 14.
For every , is concave in for fixed .
Proof.
We prove the claim for ; for the claim will then hold because is continuous in (Lemma 10).
Fix . Let with , let and be PMFs, let be a conditional PMF, and define as
Denoting by ,
where (217) follows from Proposition 9 with the roles of and swapped; (220) holds because is concave; (221) holds because optimizing separately cannot be worse than optimizing a common ; and (222) can be established using steps similar to (216)–(218). □
Lemma 15.
For every , the mapping is convex, i.e., for all with , all , and all ,
For , the mapping need not be convex.
Proof.
We establish (223) for and for , which also establishes (223) for because the Rényi divergence is continuous in (Proposition 4). Afterwards, we provide an example where (223) is violated for all .
We begin with the case where :
where (225) follows from the arithmetic mean-geometric mean inequality; (227) follows from the Cauchy–Schwarz inequality; and (228) and (229) hold because the mapping is concave on for . Taking the logarithm and multiplying by establishes (223).
Lemma 16.
Let . If achieves the minimum in the definition of , then there exist positive normalization constants c and d such that
with the conventions of (44). The case is similar: if achieves the minimum in the definition of , then there exist positive normalization constants c and d such that
with the conventions of (44). (If , then and by Proposition 8.) Thus, for all , both inclusions and hold.
Proof.
If achieves the minimum in the definition of , then
Hence, (238) and (240) follow from (76) and (78) of Proposition 9 because is finite. Swapping the roles of and establishes (237) and (239). For the claimed inclusions follow from (237) and (238); for from (239) and (240); and for from Proposition 8. □
Lemma 17.
Proof.
We first establish (242) and (244)–(246): (242) follows from the definition of ; (244) and (246) follow from Proposition 9; and (245) holds because
where (247) follows from a simple computation, and (248) holds because with equality if .
We now show that the mapping is convex for every . To that end, let , let with , and let . Let and be PMFs that achieve the minimum in the definitions of and , respectively. Then,
where (249) holds by the definition of ; (250) holds because is convex in the pair for (Lemma 15); and (251) follows from our choice of and .
Lemma 18.
For all ,
where
For every , the mapping is concave. For all and all , the statement is equivalent to .
Proof.
For , (253) follows from Lemma 8 by dividing by , which is positive or negative depending on whether is smaller than or greater than one. For , we establish (253) as follows: By Lemma 10, its LHS is continuous at . We argue below that its RHS is continuous at , i.e., that
Because (253) holds for and because both its sides are continuous at , it must also hold for .
We now establish (255). Let be a PMF that achieves the maximum on the RHS of (255). Then, for all ,
where (257) holds because, by (254), for all . By (254), is continuous at , so the RHS of (258) approaches as tends to infinity, and (255) follows from the sandwich theorem.
We now show that is concave for . A simple computation reveals that for all ,
Because the entropy is a concave function and because a nonnegative weighted sum of concave functions is concave, this implies that is concave in for . By (254), is continuous at , so is concave in also for .
We next show that if and , then . Let , and let be a PMF that satisfies . Then,
where (260) follows from (253), and (261) holds by the definition of . Because is equal to , both inequalities hold with equality, which implies the claim.
Finally, we show that if and , then . We first consider . Let be a PMF that satisfies , and let and be PMFs that achieve the minimum in the definition of . Then,
where (264) follows from Proposition 8, and (265) follows from [] (Theorem 30). Thus, all inequalities hold with equality. Because (264) holds with equality, and by Proposition 8. Hence, as desired. We now consider . Here, (262)–(266) remain valid after replacing by . (Now, (265) follows from a short computation.) Consequently, holds also for .
Lemma 19.
For all ,
where the minimization is over all PMFs satisfying ; for negative α is given by (54); and Gallager’s function [] is defined as
Proof.
Let , and define the set . We establish (267) by showing that for all ,
with equality for some .
Fix . If the LHS of (269) is infinite, then (269) holds trivially. Otherwise, define the PMF as
where we use the convention that . (The RHS of (270) is finite whenever the LHS of (269) is finite.) Then, (269) holds because
where (271) follows from Lemma 17, and (273) follows from (270) using some algebra. It remains to show that there exists an for which (272) holds with equality. To that end, let be a PMF that achieves the minimum on the RHS of (271), and define the PMF as
where we use the convention that . Because (Lemma 16), the definitions (275) and (270) imply that . Hence, (272) holds with equality for this .
Lemma 20.
For every , the mapping has a unique minimizer. This need not be the case when .
Proof.
First consider . Let and be pairs of PMFs that both minimize . We establish uniqueness by arguing that and must be identical. Observe that
where (276) holds by the definition of , and (277) follows from Lemma 15. Hence, (277) holds with equality, which implies that (228) in the proof of Lemma 15 holds with equality, i.e.,
We first argue that . Since and are PMFs, it suffices to show that for every . Let . Because (Lemma 16), there exists a such that . Again by Lemma 16, this implies that . Because the mapping is strictly concave on for , it follows from (279) that . Swapping the roles of and , we obtain that .
For , the minimizer is unique by Proposition 8 because .
Now consider . Here, we establish uniqueness via the characterization of provided by Lemma 18. Let be defined as in Lemma 18. Let be a PMF that satisfies , and let be a pair of PMFs that minimizes . If , then (264) in the proof of Lemma 18 holds with equality, i.e.,
Because the LHS of (280) is finite, Proposition 8 implies that and , thus the minimizer is unique. As shown in the proof of Lemma 18, (280) remains valid for after replacing by , thus the same argument establishes the uniqueness for .
Finally, we show that, for , the mapping can have more than one minimizer. Let X be uniformly distributed over , and let . Then, for all ,
where (281) follows from Lemma 11. □
Lemma 21.
For every , the minimum in the definition of in (2) exists and is finite.
Proof.
Let , and denote by and the uniform distribution over and , respectively. Then is finite because is finite and because the relative -entropy is nonnegative (Proposition 5). For , the minimum exists because the set is compact and the mapping is continuous. For , the minimum exists because takes on only a finite number of values: if , then depends on only via ; and if , then depends on only via . □
Lemma 22.
For all , . If , then if and only if X and Y are independent (nonnegativity).
Proof.
The nonnegativity follows from the definition of because the relative -entropy is nonnegative for (Proposition 5). If X and Y are independent, then , and the choice and in the definition of achieves . Conversely, if , then there exist PMFs and satisfying . If, in addition, , then by Proposition 5, and hence X and Y are independent. □
Lemma 23.
For all , (symmetry).
Proof.
The definition of is symmetric in X and Y. □
Lemma 24.
For all ,
where is the following weighted power mean [] (Chapter III): For ,
where for , we read as and use the conventions (44); and for , using the convention ,
Proof.
Lemma 25.
For ,
where in the RHS of (292), we use the conventions (44). The inequality can be strict, so need not be continuous at .
Proof.
We first prove (291). Recall that
Observe that is finite only if and . For such PMFs and , we have . Thus, for all PMFs and ,
Choosing and achieves equality in (295), which establishes (291).
We now show (292). Let and be the uniform distributions over and , respectively. Then,
and hence (292) holds.
We next establish (293). To that end, define
We bound as follows: For all ,
where (298) follows from Lemma 24. Similarly, for all ,
where (302) is the same as (298). Now (293) follows from (301), (304), and the sandwich theorem because and because (Proposition 3).
Finally, we provide an example for which (292) holds with strict inequality. Let , let , and let be uniformly distributed over . The LHS of (292) then equals . Using
we see that the RHS of (292) is upper bounded by , which is smaller than . □
Lemma 26.
.
Proof.
The claim follows from Proposition 8 because in the definition of is equal to . □
Lemma 27.
Let and be bijective functions, and let be the matrix whose Row-i Column-j entry equals . Then,
where denotes the largest singular value of . (Because the singular values of a matrix are invariant under row and column permutations, the result does not depend on f or g.)
Proof.
Lemma 28.
.
Proof.
Let the pair be such that , and define the PMFs and as and . Then, , so . Because (Lemma 22), this implies . □
Lemma 29.
The mapping need not be monotonic on .
Proof.
Let be such that and . Then,
which follow from Lemmas 25, 26, and 28, respectively. Thus, is not monotonic on . □
Lemma 30.
The mapping is nonincreasing on .
Proof.
We first show the monotonicity for . To that end, let with , and let be defined as in (285) and (286). Then, for all PMFs and ,
which follows from the power mean inequality [] (III 3.1.1 Theorem 1) because . Hence,
where (318) and (320) follow from Lemma 24, and (319) follows from (317).
The monotonicity extends to because
where (321) follows from Lemma 25, and (322) holds because is continuous at (Proposition 3).
Lemma 31.
The mapping is continuous on . (See Lemma 25 for the behavior at .)
Proof.
Because is continuous on (Proposition 3), it suffices to show that the mapping is continuous on . We first show that it is continuous on by showing that is concave and hence continuous on . For a fixed , let be distributed according to the joint PMF
Then, for all ,
where (327) follows from Proposition 7; (328) follows from Lemma 8; and (329) follows from a short computation. For every , the expression in square brackets on the RHS of (329) is concave in because the mapping is concave on and because and are nonnegative. The pointwise minimum preserves the concavity, thus the LHS of (327) is concave in and hence continuous in . This implies that and hence is continuous on .
We now establish continuity at . Let be such that ; define the PMFs and as and ; and let be defined as in (285). Then, for all ,
where (330) holds because is nonincreasing in (Lemma 30); (331) follows from Lemma 24; (332) follows from the definitions of in (285) and in (46); and (333) holds because (Lemma 28). Because , (330)–(333) and the sandwich theorem imply that is continuous at . This and the continuity of at (Proposition 3) establish the continuity of at .
It remains to show the continuity at . Let , and define . (These definitions ensure that on the RHS of (340) ahead, will be positive.) Let be defined as in (285) and (286). Then, for all PMFs and ,
where (334) follows from the power mean inequality [] (III 3.1.1 Theorem 1) because ; (336) follows from the Cauchy–Schwarz inequality; and (337) holds because
where (339) follows from the Cauchy–Schwarz inequality, and (341) holds because and because the Rényi divergence is nonnegative for positive orders (Proposition 4). Thus, for all ,
where (342) follows from (338) if and from Proposition 8 and a simple computation if . By Lemma 24, this implies that for all ,
Because is continuous at [] (III 1 Theorem 2(b)), (344)–(345) and the sandwich theorem imply that is continuous at . This and the continuity of at (Proposition 3) establish the continuity of at . □
Lemma 32.
If with probability one, then
Proof.
We first treat the cases , , and . For , (346) holds because
where (347) follows from Lemma 25, and (348) holds because the hypothesis implies that and . For , (346) holds because (Lemma 26) and because implies that . For , (346) holds because (Lemma 28).
Now let , and let be distributed according to the joint PMF
where (351) holds because for all and all . If , then (346) holds because
where (352) follows from Proposition 7; (353) follows from Lemma 11 because and because ; and (355) follows from a simple computation. If , then (346) holds because
where (356) follows from Proposition 7; (357) follows from Lemma 11 because and because ; and (359) follows from a simple computation. □
Lemma 33.
For every , the mapping in the definition of in (2) has a unique minimizer. This need not be the case when .
Proof.
Let . By Proposition 7, , where the pair is distributed according to the joint PMF defined in Proposition 7. The mapping in the definition of has a unique minimizer by Lemma 20 because . By Proposition 6, there is a bijection between the minimizers of and , so the mapping also has a unique minimizer.
We next show that for , the mapping can have more than one minimizer. Let X be uniformly distributed over , and let . Then, by Lemma 32,
If , then it follows from the definition of in (56) that whenever , so the minimizer is not unique. Otherwise, if , it can be verified that
so the minimizer is not unique in this case either. □
Lemma 34.
If the pairs and are independent, then for all (additivity).
Proof.
Lemma 35.
For all , .
Proof.
Lemma 36.
There exists a Markov chain for which .
Proof.
Let the Markov chain be given by
Using Lemma 27, we see that bits, which is larger than bits. □
| 0 | ||
| 0 |
| 0 | 1 |
Author Contributions
Writing–original draft preparation, A.L. and C.P.; writing–review and editing, A.L. and C.P.
Funding
This research received no external funding.
Conflicts of Interest
The authors declare no conflict of interest.
References
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
- Tomamichel, M.; Hayashi, M. Operational interpretation of Rényi information measures via composite hypothesis testing against product and Markov distributions. IEEE Trans. Inf. Theory 2018, 64, 1064–1082. [Google Scholar] [CrossRef]
- Sibson, R. Information radius. Z. Wahrscheinlichkeitstheorie verw. Geb. 1969, 14, 149–160. [Google Scholar] [CrossRef]
- Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Topics in Information Theory; Csiszár, I., Elias, P., Eds.; North-Holland Publishing Company: Amsterdam, The Netherlands, 1977; pp. 41–52. ISBN 0-7204-0699-4. [Google Scholar]
- Csiszár, I. Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory 1995, 41, 26–34. [Google Scholar] [CrossRef]
- Fehr, S.; Berens, S. On the conditional Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 6801–6810. [Google Scholar] [CrossRef]
- Sason, I.; Verdú, S. Arimoto–Rényi conditional entropy and Bayesian M-ary hypothesis testing. IEEE Trans. Inf. Theory 2018, 64, 4–25. [Google Scholar] [CrossRef]
- Verdú, S. α-mutual information. In Proceedings of the 2015 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar] [CrossRef]
- Tridenski, S.; Zamir, R.; Ingber, A. The Ziv–Zakai–Rényi bound for joint source-channel coding. IEEE Trans. Inf. Theory 2015, 61, 4293–4315. [Google Scholar] [CrossRef]
- Aishwarya, G.; Madiman, M. Remarks on Rényi versions of conditional entropy and mutual information. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 1117–1121. [Google Scholar]
- Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial; now Publishers: Hanover, MA, USA, 2004; ISBN 978-1-933019-05-5. [Google Scholar]
- Liese, F.; Vajda, I. On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
- Sason, I.; Verdú, S. f-divergence inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
- Jiao, J.; Han, Y.; Weissman, T. Dependence measures bounding the exploration bias for general measurements. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 1475–1479. [Google Scholar] [CrossRef]
- Ziv, J.; Zakai, M. On functionals satisfying a data-processing theorem. IEEE Trans. Inf. Theory 1973, 19, 275–283. [Google Scholar] [CrossRef]
- Lapidoth, A.; Pfister, C. Testing against independence and a Rényi information measure. In Proceedings of the 2018 IEEE Information Theory Workshop (ITW), Guangzhou, China, 25–29 November 2018; pp. 1–5. [Google Scholar] [CrossRef]
- Han, T.S.; Kobayashi, K. The strong converse theorem for hypothesis testing. IEEE Trans. Inf. Theory 1989, 35, 178–180. [Google Scholar] [CrossRef]
- Nakagawa, K.; Kanaya, F. On the converse theorem in statistical hypothesis testing. IEEE Trans. Inf. Theory 1993, 39, 623–628. [Google Scholar] [CrossRef]
- Bunte, C.; Lapidoth, A. Encoding tasks and Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 5065–5076. [Google Scholar] [CrossRef]
- Bracher, A.; Lapidoth, A.; Pfister, C. Distributed task encoding. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 1993–1997. [Google Scholar] [CrossRef]
- Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; Volume 1, pp. 547–561. [Google Scholar]
- van Erven, T.; Harremoës, P. Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
- Sason, I.; Verdú, S. Improved bounds on lossless source coding and guessing moments via Rényi measures. IEEE Trans. Inf. Theory 2018, 64, 4323–4346. [Google Scholar] [CrossRef]
- Ashok Kumar, M.; Sundaresan, R. Minimization problems based on relative α-entropy I: Forward projection. IEEE Trans. Inf. Theory 2015, 61, 5063–5080. [Google Scholar] [CrossRef]
- Ashok Kumar, M.; Sundaresan, R. Minimization problems based on relative α-entropy II: Reverse projection. IEEE Trans. Inf. Theory 2015, 61, 5081–5095. [Google Scholar] [CrossRef]
- Sundaresan, R. Guessing under source uncertainty. IEEE Trans. Inf. Theory 2007, 53, 269–287. [Google Scholar] [CrossRef]
- Polyanskiy, Y.; Wu, Y. Lecture Notes on Information Theory. 2017. Available online: http://people.lids.mit.edu/yp/homepage/data/itlectures_v5.pdf (accessed on 18 August 2017).
- Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006; ISBN 978-0-471-24195-9. [Google Scholar]
- Gallager, R.G. Information Theory and Reliable Communication; John Wiley & Sons: Hoboken, NJ, USA, 1968; ISBN 978-0-471-29048-3. [Google Scholar]
- Bullen, P.S. Handbook of Means and Their Inequalities; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2003; ISBN 978-1-4020-1522-9. [Google Scholar]
- Horn, R.A.; Johnson, C.R. Matrix Analysis, 2nd ed.; Cambridge University Press: Cambridge, UK, 2013; ISBN 978-0-521-83940-2. [Google Scholar]
- Fan, K. Minimax theorems. Proc. Natl. Acad. Sci. USA 1953, 39, 42–47. [Google Scholar] [CrossRef] [PubMed]
- Borwein, J.M.; Zhuang, D. On Fan’s minimax theorem. Math. Program. 1986, 34, 232–234. [Google Scholar] [CrossRef]
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).