Two Measures of Dependence

Two families of dependence measures between random variables are introduced. They are based on the Rényi divergence of order α and the relative α-entropy, respectively, and both dependence measures reduce to Shannon’s mutual information when their order α is one. The first measure shares many properties with the mutual information, including the data-processing inequality, and can be related to the optimal error exponents in composite hypothesis testing. The second measure does not satisfy the data-processing inequality, but appears naturally in the context of distributed task encoding.


I. INTRODUCTION
At the heart of information theory lies the Shannon entropy which, together with relative entropy and mutual information, appears in numerous contexts.One of the more successful attempts to generalize Shannon entropy was performed by Rényi [1], who introduced the Rényi entropy of order α, which is defined for α > 0 and α = 1 and has the desirable property that lim α→1 H α (X) = H(X).But there does not seem to be a unique way to generalize relative entropy and mutual information to the Rényi setting.
The two classical generalizations of relative entropy are reviewed in Section II.In Section III, our proposed generalizations of mutual information, J α (X; Y ) and K α (X; Y ), are introduced.Their properties are analyzed in Sections IV and V. Section VI provides an operational meaning to K α (X; Y ).
Other generalizations of mutual information appeared in the past.Notable are those by Sibson [2], Arimoto [3], and Csiszár [4].An overview and some properties of these proposals are provided by Verdú [5].

II. GENERALIZATIONS OF RELATIVE ENTROPY
Throughout this section, P and Q are probability mass functions on a finite set X .The relative entropy (or Kullback-Leibler divergence) of P with respect to Q is defined as with the convention 0 log 0 q = 0 and p log p 0 = ∞ for p > 0.
The relative α-entropy of P with respect to Q is defined for α > 0 and α = 1 as with the convention that for α < 1, we read P (x)Q(x) α−1 as Q(x) 1−α and say that 0 0 = 0 and p 0 = ∞ for p > 0. It was first identified by Sundaresan [7] in the context of the Massey-Arikan guessing problem [8], [9] and it also plays a role in the context of mismatched task encoding as shown by Bunte and Lapidoth [10].Further properties of relative α-entropy are studied by Kumar and Sundaresan [11], [12].By a continuity argument [11,Lemma 2], ∆ 1 (P ||Q) is defined as D(P ||Q).
The following lemma shows that ∆ α (P ||Q) and D α (P ||Q) are in fact closely related.(This relationship was first described in [7, Section V, Property 4].) Lemma 1.Let P and Q be PMFs over a finite set X and let α > 0 be a constant.Define the PMFs Then, where the LHS is ∞ if and only if the RHS is ∞.

III. TWO MEASURES OF DEPENDENCE
Throughout this section, X and Y are random variables taking values in finite sets according to the joint PMF P XY .Based on the observation that mutual information can be characterized as where the minimization is over all PMFs Q X and Q Y , two generalizations are proposed: Because D 1 (P ||Q) = ∆ 1 (P ||Q) = D(P ||Q) and because of (10), J 1 (X; Y ) and K 1 (X; Y ) are equal to I(X; Y ).The measures J α (X; Y ) and K α (X; Y ) are well-defined for all α > 0: Because D α (P ||Q) and ∆ α (P ||Q) are nonnegative and continuous in Q and because the minima in the RHS of ( 11) and ( 12) exist.Note that, (10) notwithstanding, this choice of Q X and Q Y need not be optimal if α = 1.Also note that the optimization problems (11) and (12) are not convex because the set of all product PMFs is not convex. 1(But see Theorem 1, Properties 8 and 10 ahead for two convex formulations in the case α > 1.) In light of Lemma 1, J α (X; Y ) and K α (X; Y ) are related as follows: Lemma 2. Let P XY be a joint PMF over the finite sets X and Y and let α > 0 be a constant.Define the PMF Then, Proof.For every α > 0, = min 1 This is true except for the trivial cases where (15) follows from the definition ( 12); (16) follows from Lemma 1; (17) follows because the transformation (7) of a product is the product of the transformations; (18) follows because the transformation ( 7) is bijective on the set of PMFs; and (19) follows from the definition (11).
IV. PROPERTIES OF J α (X; Y ) and Z be random variables on finite sets.The following properties of the mutual information I(X; Y ) are also satisfied by J α (X; Y ) for all α > 0: 1) J α (X; Y ) ≥ 0 with equality if and only if X and Y are independent (nonnegativity).2) and Z form a Markov chain (data-processing inequality).4) where the minimization is over all PMFs Q X .This is a convex optimization problem if α > 1.For all α ∈ (0, 1): , where the minimization is over all joint PMFs R XY and , where the maximization is over all joint PMFs R XY and Proof.It is well-known that Properties 1-5 are satisfied by the mutual information [13,Chapter 2].We are left to show that J α (X; Y ) satisfies Properties 1-10: 1) We use the fact that for all α > 0, D α (P ||Q) ≥ 0 with equality if and only if P = Q [6, Theorem 8].Then, the nonnegativity of J α (X; Y ) follows from (11) and from D α (P ||Q) ≥ 0. If X and Y are independent, i.e., if PMFs Q X and Q Y , which in turn implies that X and Y are independent.
2) The symmetry of J α (X; Y ) in X and Y follows because (11) is symmetric in X and Y .
3) Assume that X ⊸− − Y ⊸− − Z, which is equivalent to for all x, y, and z.Let Q X and Q Y be PMFs that achieve the minimum in the RHS of ( 11), so Define the PMF Q Z as follows: We will show that for all α > 0, which implies the data-processing inequality because where (25) follows from ( 11); (26) follows from (24); and (27) follows from (22).In order to prove (24), we use the fact that D α (P ||Q) satisfies a data-processing inequality, namely, that for any conditional PMF A(x|x ′ ), where Theorem 9].We choose where I{x = x ′ } is the indicator function that is one if x = x ′ and zero otherwise.Processing P XY leads to where (31) follows from (29).Processing Q X Q Y leads to where (34) follows from (29); (35) follows from (21); and (36) follows from (23).Combining (28), (32), and (36) now leads to (24).4) The proof of this property is omitted.
9) For α ∈ (0, 1), we have [6, Theorem 30] where the infimum is over all PMFs R. The claim follows by observing that3 where (58) follows from ( 11) and ( 57); (59) follows by interchanging the order of the infima; (60) follows from (10); and (61) follows from a continuity argument.10) For α > 1, we have [6, Theorem 30] where the supremum is over all PMFs R. A simple computation reveals that 3 is concave in R because H(R) and linear functionals of R are concave in R; in addition, the LHS of (63) is convex in Q Y and continuous in R and Q Y . 4Then, where (64) can be justified by [14,Corollary 37.3.2]because the set of all PMFs is compact, convex, and nonempty and because the expression in brackets is continuous in R and Q Y , convex in Q Y , and concave in R; and (65) follows from a simple computation.Finally, where (66) follows from ( 11) and ( 62 V. PROPERTIES OF K α (X; Y ) The relationship K α (X; Y ) = J 1 α ( X; Y ) from Lemma 2 allows us to derive some properties of K α (X; Y ) from the properties of J 1 α ( X; Y ).But, unlike J α (X; Y ), K α (X; Y ) does not satisfy the data-processing inequality.

VI. OPERATIONAL MEANING OF K α (X; Y )
The motivation to study J α (X; Y ) and K α (X; Y ) stems from [15], which extends the task-encoding problem studied in [10] to a distributed setting.It considers a discrete source {(X i , Y i )} ∞ i=1 over a finite alphabet that emits pairs of random variables (X i , Y i ).For any positive integer n, the sequences {X i } n i=1 and {Y i } n i=1 are encoded separately, and the decoder outputs the list of all pairs (x n , y n ) that share the given description. 5The goal is to minimize the ρ-th moment of the list size for some ρ > 0 as n goes to infinity.In the following theorem, necessary and sufficient conditions on the coding rates are given to drive the ρ-th moment of the list size asymptotically to one.
8) is ∞ establishes that the LHS is ∞ if and only if the RHS is ∞ because P (x) and Q(x) are zero if and only if P (x) and Q(x) are zero, respectively.For α = 1, (8) is valid because we have P = P , Q = Q, and ∆ 1 (P ||Q) = D 1 (P ||Q) = D(P ||Q) by definition.