Next Article in Journal
Kolmogorov Complexity of Coronary Sinus Atrial Electrograms Before Ablation Predicts Termination of Atrial Fibrillation After Pulmonary Vein Isolation
Next Article in Special Issue
Guessing with a Bit of Help
Previous Article in Journal
Ordered Avalanches on the Bethe Lattice
Previous Article in Special Issue
Distributed Hypothesis Testing with Privacy Constraints
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information

1
Department of Electrical Engineering, Princeton University, C307 Engineering Quadrangle, NJ 08540, USA
2
Independent Researcher, Princeton, NJ 08540, USA
*
Author to whom correspondence should be addressed.
Entropy 2019, 21(10), 969; https://doi.org/10.3390/e21100969
Submission received: 8 August 2019 / Revised: 20 September 2019 / Accepted: 25 September 2019 / Published: 4 October 2019

Abstract

:
Rényi-type generalizations of entropy, relative entropy and mutual information have found numerous applications throughout information theory and beyond. While there is consensus that the ways A. Rényi generalized entropy and relative entropy in 1961 are the “right” ones, several candidates have been put forth as possible mutual informations of order α . In this paper we lend further evidence to the notion that a Bayesian measure of statistical distinctness introduced by R. Sibson in 1969 (closely related to Gallager’s E 0 function) is the most natural generalization, lending itself to explicit computation and maximization, as well as closed-form formulas. This paper considers general (not necessarily discrete) alphabets and extends the major analytical results on the saddle-point and saddle-level of the conditional relative entropy to the conditional Rényi divergence. Several examples illustrate the main application of these results, namely, the maximization of α -mutual information with and without constraints.

1. Introduction

The Rényi divergence of order α between two probability measures defined on the same measurable space,
D α ( P Q ) = 1 α 1 log d P d Q ( x ) α d Q ( x ) ,
is a useful generalization of the relative entropy D ( P Q ) introduced by Rényi [1] in the discrete case ( lim α 1 D α ( P Q ) = D ( P Q ) ). Many of the properties satisfied by relative entropy hold for Rényi divergence, such as nonnegativity, convexity, lower semicontinuity, data processing inequality, and additivity for product measures. D α ( P Q ) can be defined in more generality without requiring P Q . A comprehensive survey of the properties satisfied by Rényi divergence can be found in [2]. Just as D ( P Q ) , D α ( P Q ) provides a useful gauge of the distinctness of P and Q, which has found applications in large deviations problems (such as the asymptotic analysis of hypothesis testing [3,4,5]), lossless data compression [4,6,7], data transmission through noisy channels [8,9,10], and statistical physics [11]. If P 1 P 0 , then Rényi divergence of order α ( 0 , 1 ) ( 1 , ) can be expressed in terms of relative entropy through [5]
( 1 α ) D α ( P 1 P 0 ) = min P P 1 α D ( P P 1 ) + ( 1 α ) D ( P P 0 ) .
Although not an f-divergence, there is a one-to-one correspondence between Rényi divergence and Hellinger divergence H α ( P Q ) (e.g., [12])
D α ( P Q ) = 1 α 1 log 1 + ( α 1 ) H α ( P Q ) .
One of the major applications of relative entropy is to quantify statistical dependence in a joint probability measure by means of the mutual information
I ( X ; Y ) = D ( P X Y P X × P Y ) .
The corresponding straight generalization replacing relative entropy by Rényi divergence is also a measure of dependence but has found scant utility so far (see [6,13]). To explore the generalization that we study in this paper, namely α -mutual information, we need to consider the conditional versions of relative entropy and Rényi divergence. These are defined in general for two random transformations P Y | X and Q Y | X and an unconditional probability measure P X simply as
D ( P Y | X Q Y | X | P X ) = D ( P Y | X P X Q Y | X P X ) ,
D α ( P Y | X Q Y | X | P X ) = D α ( P Y | X P X Q Y | X P X ) .
A major difference between those conditional measures is that while D ( P Y | X Q Y | X | P X ) is plainly the expectation D ( P Y | X = x Q Y | X = x ) d P X ( x ) , the conditional Rényi divergence depends on the function D α ( P Y | X = x Q Y | X = x ) in a more involved way. In this paper, the use of the conditional information measures will be circumscribed to the special case in which Q Y | X is actually an unconditional measure. In fact, a more productive way to express mutual information than (4) is the asymmetric expression
I ( X ; Y ) = D ( P Y | X P Y | P X )
= min Q D ( P Y | X Q | P X ) .
Equation (8) follows from the key additive decomposition formula
D ( P Y | X P Y | P X ) = D ( P Y | X Q Y | P X ) D ( P Y Q Y ) ,
where Q Y is an arbitrary measure dominating P Y . We see that (8) is a Bayesian measure of the distinctness of the constellation of probability measures { P Y | X = x , x A } , sometimes referred to as information radius, where the center of gravity of the constellation is none other than P Y . Equation (8) has proven to be very fertile, particularly when it comes to supremize I ( X ; Y ) with respect to P X since the ensuing sup min optimization has a saddle-point if and only if there is an input distribution that attains the maximal mutual information. The convexity of D ( P Y | X Q | P X ) in Q and concavity (linearity) in P X , along with the minimax theorem ensures the existence of the saddle-point whenever the set of allowed input distributions is compact. The Arimoto-Blahut algorithm [14,15] for finding max I ( X ; Y ) in finite alphabet settings is also inspired by (8).
Mutual information I ( X ; Y ) = I ( P X , P Y | X ) also possesses a saddle point (assuming convexity and compactness of the corresponding feasible sets) since it is concave in P X (to see that, nothing better than (9)) and is convex in P Y | X . This property has found rich applications in information theory (e.g., [16,17,18]) but neither it nor its generalization to α -mutual information will not concern us in this paper.
Even if a saddle-point for the conditional relative entropy does not exist, Kemperman [19] showed that sup min can be swapped, thereby establishing the existence of a saddle value.
Another well-known application of (8) and the conditional relative entropy saddle-point is the so-called channel capacity-minimax redundancy theorem due to Gallager [20] and Ryabko [21] (see also [22,23]), which shows that the maximal mutual information obtained with a finite constellation { P Y | X = x , x A } is equal to the minimax redundancy in universal lossless data compression of an unknown source selected from { P Y | X = x , x A } . Notable generalizations of this result to infinite alphabets without requiring that a distribution maximizing mutual information exists are due to Kemperman [19] and Haussler [24]. Recently, the Rényi counterpart of the channel capacity-minimax redundancy result has also been considered, under various restrictions, in [2,25,26].
The main purpose of this paper is to generalize the saddle-point property of conditional relative entropy and its applications to the maximization of mutual information when relative entropy is replaced by Rényi divergence. Towards that end, we recall the various directions in which mutual information has been generalized using Rényi divergence (see also [27]):
  • As aforementioned, the straight generalization D α ( P X Y P X × P Y ) has not yet found wide applicability.
  • In the discrete case and α ( 0 , 1 ) ( 1 , ) , Arimoto [28] proposed the definition of the nonnegative quantity
    I a ( X ; Y ) = H α ( X ) H α a ( X | Y ) ,
    where the Rényi entropy [1] and Arimoto-Rényi conditional entropy [28] are
    H α ( X ) = α 1 α log P X α ,
    H α a ( X | Y ) = α 1 α log E [ P X | Y ( · | Y ) α ]
    with the α -norm of a probability mass function denoted as P α = x A P α ( x ) 1 α . Arimoto extended his algorithm in [14] to compute what he called the capacity of order α ,
    C α a = max X I a ( X ; Y ) ,
    for finite-alphabet random transformations and showed that there exist codes of rate R and blocklength n whose error probability is upper bounded by
    inf α ( 1 2 , 1 ) exp α 1 α ( C α a R ) .
  • Augustin [29] and, later, Csiszár [4] defined
    I α c ( X ; Y ) = min Q Y E D α ( P Y | X ( · | X ) Q Y ) .
    C α c = max X I α c ( X ; Y ) is dubbed the Augustin capacity of order α in [30]. Csiszár [4] showed that for α ( 1 2 , 1 ) , I α c ( X ; Y ) is the intercept on the R-axis of a supporting line of slope 1 1 α of the error exponent function for codes of rate R with constant-composition P X . Unfortunately, the minimization in (14) is not amenable to explicit solution.
  • For the purpose of coming up with a measure of the similarity among a finite collection of probability measures { P Y | X = x , x A } , weighted by P X on A , Sibson [31] proposed the information radius of order α as
    I α ( X ; Y ) = min Q D α ( P Y | X Q | P X ) .
    As we will see, the minimization in (15) can be solved explicitly. This is the generalization of mutual information we adopt in this paper and which, as in [27], we refer to as α -mutual information. A word of caution is that in [4], the symbols I α ( X ; Y ) and K α ( X ; Y ) are used in lieu of what we denote I α c ( X ; Y ) and I α ( X ; Y ) , respectively. C α = max X I α ( X ; Y ) is dubbed the Rényi capacity of order α in [26].
  • Independently, Lapidoth-Pfister [32] and Tomamichel-Hayashi [33] proposed
    I α l ( X ; Y ) = min Q X min Q Y D α ( P X Y Q X × Q Y ) .
    and showed that it determines the performance of composite hypothesis tests for independence where the hypothesized joint distribution is known but under the independence hypothesis the marginals are unknown. It was shown in [34] that
    I α c ( X ; Y ) I α l ( X ; Y ) I α ( X ; Y ) .
Despite the difference in the definitions of the various versions, it was shown in the discrete setting that [4,28].
C α a = C α c = C α
Therefore, solving for max X I α ( X ; Y ) carries added significance, whenever one of the other definitions is adopted. Note that (17) and (18) imply that C α = max P X I α l ( P X , P Y | X ) . A major application for the maximization of I α ( X ; Y ) is in the large deviation analysis of optimal data transmission codes since the sphere-packing error exponent function and the random-coding error exponent function
E sp ( R ) = sup ρ 0 ρ C 1 1 + ρ ρ R ,
E r ( R ) = sup ρ [ 0 , 1 ] ρ C 1 1 + ρ ρ R ,
popularized in [35] and [36], respectively, are upper and lower bounds to the channel reliability function, respectively. A function similar to (20) has recently been shown [37] to yield the large deviations behavior of random coding in the setting of channel resolvability.
The organization of the paper is as follows. Section 2 states the definitions and properties of the various information measures that are used throughout the paper. In particular, we introduce the key notion of α -response to an input probability measure through a given random transformation. In Section 3 we present the main results (with proofs relegated to Section 5) related to the saddle-point and saddle-value of the conditional Rényi divergence, allowing the optimization to be circumscribed to any convex set of input probability measures. The equivalence of the existence of a probability measure that maximizes α -mutual information and the existence of a saddle point is shown and several illustrative examples of the use of this result in the computation of C α are also given. The fact that a saddle-level exists (i.e., sup min commute) even if there is no input probability measure that achieves the supremum α -mutual information is established, thereby generalizing Kemperman’s [19] saddle-level result to Rényi divergence through a different route than that followed in [26].

2. Notation, Definitions and Properties

  • If ( A , F , P ) is a probability space, X P indicates P [ X F ] = P ( F ) for all F F .
  • Let ( A , F ) and ( B , G ) be measurable spaces, which we refer to as the input and output spaces, respectively, with A and B referred to as the input and output alphabets respectively. P Y | X : A B denotes a random transformation from A to B , i.e., for any x A , P Y | X = x ( · ) is a probability measure on ( B , G ) , and for any B G , P Y | X = · ( B ) is an F -measurable function. For brevity, we will usually drop mention of the underlying σ -fields. If P is a probability measure on A and P Y | X : A B is a random transformation, the corresponding joint probability measure on A × B is denoted by P P Y | X (or, interchangeably, P Y | X P ). The notation P P Y | X Q indicates that the output marginal of the joint probability measure P P Y | X is denoted by Q.
  • The relative information ı P Q ( x ) between two probability measures P and Q on the same measurable space such that P Q is defined as
    ı P Q ( x ) = log d P d Q ( x ) ,
    where d P d Q is the Radon-Nikodym derivative of P with respect to Q. The relative entropy is
    D ( P Q ) = E [ ı P X Q X ( X ) ] , X P .
  • Given P X P Y | X P Y , the information density is defined as
    ı X ; Y ( a ; b ) = ı P Y | X = a P Y ( b ) , ( a , b ) A × B .
  • Fix α > 0 , P Y | X : A B , and a probability measure P X on A . Then, the output probability measure P Y [ α ] is called the α -response to P X if
    ı Y [ α ] Y ( y ) = 1 α log E [ exp ( α ı X ; Y ( X ; y ) κ α ) ] , X P X ,
    where P X P Y | X P Y , and κ α is a scalar that guarantees that P Y [ α ] is a probability measure. For notational convenience, we omit the dependence of κ α on P X and P Y | X . Equivalently, if p Y [ α ] and p Y | X denote the densities with respect to some dominating measure, then (24) becomes
    p Y [ α ] ( y ) = exp κ α α E 1 α p Y | X α ( y | X ) , X P X .
    In particular, the 1-response to P X is P Y . In [26], the α -response to P X is dubbed the order α Rényi mean for prior P X .
  • Given two probability measures P and Q on the same measurable space and a scalar α ( 0 , 1 ) ( 1 , ) , the Rényi divergence of order α between P and Q is defined as [1]
    D α ( P Q ) = 1 α 1 log A p α q 1 α d μ ,
    where p and q are the Radon-Nikodym derivatives of P and Q, respectively, with respect to a common dominating σ -finite measure μ . We define D 1 ( P Q ) = D ( P Q ) as this coincides with the limit from the left at α = 1 . It is also the limit from the right whenever D α ( P Q ) < for some α > 1 . The cases α = 0 and α = can be defined by taking the corresponding limits. In this work, we only focus on the simple orders of α , i.e., α ( 0 , 1 ) ( 1 , ) . As we saw in (1), if P Q , then (26) becomes
    D α ( P Q ) = 1 α 1 log E exp α ı P Q ( W ) , W Q
    = 1 α 1 log E exp ( α 1 ) ı P Q ( V ) , V P
  • If α ( 0 , 1 ) ( 1 , ) , then the binary Rényi divergence of order α is given by
    d α ( p q ) = D α ( [ p 1 p ] [ q 1 q ] )
    = 1 α 1 log p α q 1 α + ( 1 p ) α ( 1 q ) 1 α .
    Note that
    lim α 1 d α p 1 2 = log 2 h ( p ) ,
    where the usual binary entropy function is denoted by h ( x ) = x log 1 x + ( 1 x ) log 1 1 x . Given ( p 0 , p 1 ) ( 0 , 1 ) 2 , p 0 p 1 , the solution to d α ( p 0 q ) = d α ( p 1 q ) is
    q = 1 + p 0 α p 1 α 1 p 1 α 1 p 0 α 1 1 α 1 .
  • D α ( P Q ) 0 , with equality only if P = Q .
  • D α ( P Q ) is monotonically increasing with α .
  • While we may have D ( P Q ) = and P Q simultaneously, D α ( P Q ) = for any α ( 0 , 1 ) is equivalent to P and Q being orthogonal. Conversely, if for some α > 1 , D α ( P Q ) < , then P Q .
  • The Rényi divergence satisfies the data-processing inequality. If P X P Y | X P Y and Q X P Y | X Q Y , then
    D α ( P X Q X ) D α ( P Y Q Y ) .
  • Gilardoni [38] gave a strengthened Pinsker’s inequality upper bounding the square of the total variation distance by
    | P Q | 2 inf α ( 0 , 1 ] 1 2 α D α ( P Q )
    inf α > 0 1 2 min { α , 1 } D α ( P Q ) ,
    where we have used the monotonicity in α of the Rényi divergence.
  • The Rényi divergence is lower semicontinuous in the topology of setwise convergence, i.e., if for every event A F , P n ( A ) P ( A ) , and Q n ( A ) Q ( A ) , then
    lim inf n D α ( P n Q n ) D α ( P Q ) , α ( 0 , ] .
    In particular, note that (36) holds if | P n Q n | 0 .
  • In the theory of robust lossless source coding [22,25] the following scalar, called the α -minimax redundancy of P Y | X , is an important measure of the worst-case redundancy penalty that ensues when the encoder only knows that the data is generated according to one of the probability measures in the collection { P Y | X = x , x A } :
    R α = inf Q Y sup x A D α ( P Y | X = x Q Y ) ,
    where the infimum is over all the probability measures on B .
  • Given input distribution P X and random transformations P Y | X , Q Y | X , the conditional Rényi divergence of order α ( 0 , 1 ) ( 1 , ) is
    D α ( P Y | X Q Y | X | P X ) = D α ( P X P Y | X P X Q Y | X ) .
    Although (38) also holds for the familiar α = 1 case, in general the conditional Rényi divergence is not the arithmetic average of D ( P Y | X = x Q Y ) with respect to P X if α 1 . Instead it’s a generalized mean, or a scaled cumulant generating function evaluated at α 1 . Specifically, if X P X , then
    D α ( P Y | X Q Y | P X ) = 1 α 1 log E exp ( α 1 ) D α ( P Y | X ( · | X ) Q Y ) .
    Regardless of whether α ( 0 , 1 ) or α ( 1 , ) , (39) implies that
    D α ( P Y | X Q Y | P X ) sup x A D α ( P Y | X = x Q Y )
    = sup P X D α ( P Y | X Q Y | P X )
    with the supremum in (41) over all input probability measures.
  • The key additive decomposition formula for the mutual information (9) has a nice counterpart for the α -mutual information [27]. Let P X P Y | X P Y and Q Y be an arbitrary probability measure on B such that P Y Q Y . Then, it is easy to verify that
    D α ( P Y | X P Y [ α ] | P X ) = D α ( P Y | X Q Y | P X ) D α ( P Y [ α ] Q Y ) ,
    a relationship noted by Sibson [31] in the discrete case.
  • Given α > 0 , P X and P Y | X , the α-mutual information is [27,31]
    I α ( X ; Y ) = min Q D α ( P Y | X Q | P X )
    = D α ( P Y | X P Y [ α ] | P X )
    = 1 α 1 log E exp ( α 1 ) D α ( P Y | X ( · | X ) P Y [ α ] )
    = D α ( P Y | X P Y | P X ) D α ( P Y [ α ] P Y ) ,
    where P X P Y | X P Y . It can be checked that the constant in (24) is equal to
    κ α = ( α 1 ) I α ( X ; Y ) .
    Note that I 1 ( X ; Y ) = I ( X ; Y ) but, in general, I α ( X ; Y ) I α ( Y ; X ) .
  • An alternative expression for α -mutual information, which will come in handy in our analysis and which does not involve either P Y or P Y [ α ] is obtained by introducing an auxiliary probability measure P Y ¯ dominating the collection { P Y | X = u , u A } [27]:
    I α ( X ; Y ) = α α 1 log E [ E 1 α [ exp ( α ı X ; Y ¯ ( X ; Y ¯ ) ) | Y ¯ ] ] , ( X , Y ¯ ) P X × P Y ¯ ,
    where
    ı X ; Y ¯ ( x ; y ) = log d P Y | X = x d P Y ¯ ( y ) .
    As usual, sometimes it is convenient to fix σ -finite measures μ X and μ Y on the input and output spaces which dominate P X and { P Y | X = x : x A } , respectively, and denote their densities with respect to the reference measures by
    p Y | X ( y | x ) = d P Y | X = x d μ Y ( y ) ,
    p X ( x ) = d P X d μ X ( x ) .
    Then, we can write α -mutual information as
    I α ( P X , P Y | X ) = α α 1 log B A p Y | X α ( y | x ) p X ( x ) d μ X ( x ) 1 α d μ Y ( y ) .
  • In the special case of discrete alphabets,
    E 0 ( ρ , P X , P Y | X ) = ρ I 1 1 + ρ ( X ; Y ) ,
    where the left side is the familiar Gallager function defined in [36] for ρ ( 0 , 1 ) as
    E 0 ( ρ , P X , P Y | X ) = log y B x A P X ( x ) P Y | X 1 1 + ρ ( y | x ) 1 + ρ .
  • Fix α > 0 , P Y | X : A B , and a collection P of probability measures on the input space. Then, we denote
    C α ( P ) = sup P X P I α ( P X , P Y | X ) .
    When P is the set of all input measures we write simply C α , dubbed the Rényi capacity in [26]. C α is a measure of the similarity of the family { P Y | X = x , x A } , which plays an important role in the analysis of the fundamental limits of information transmission through noisy channels, particularly in the regime of exponentially small error probability. For a long time (e.g., [39]) the cutoff rate C 1 2 was conjectured to be the maximal rate for which reliable codes with manageable decoding complexity can be found. The zero-error capacity of the discrete memoryless channel with feedback is equal to either zero or [40]
    C 0 f = max X I 0 ( X ; Y ) ,
    depending on whether there is ( a 1 , a 2 ) A 2 such that P Y | X ( · | a 1 ) P Y | X ( · | a 2 ) .
  • The related quantity max P X I 1 α ( P X , P Y | X α ) arises in the study of the fundamental limits of guessing and task completion under mismatch [41,42].
  • While D ( P Q ) is convex in the pair ( P , Q ) , the picture for Rényi divergence is somewhat more nuanced:
    (a)
    If α ( 0 , 1 ) , then D α ( P Q ) is convex in ( P , Q ) .
    (b)
    If α > 0 , then D α ( P Q ) is convex in Q for all P, (see [4]).
  • For any fixed pair ( P Y | X , Q Y | X ) , D α ( P Y | X Q Y | X | P X ) is concave (resp. convex) in P X if α 1 (resp. α ( 0 , 1 ] ) (see [43]).
  • The α -mutual information I α ( P X , P Y | X ) is concave in P X for any fixed P Y | X and α > 1 (see [43]). If α ( 0 , 1 ) ( 1 , ) , then the following monotonically increasing function of I α ( P X , P Y | X ) is concave in P X
    Γ α I α ( P X , P Y | X ) = 1 α 1 φ 1 α I α ( P X , P Y | X ) ,
    where φ α ( z ) = exp ( z α z ) (see [10,43]).

3. Conditional Rényi Divergence Game

As can be expected from (43), when maximizing α -mutual information, for fixed P Y | X , with respect to the input probability measure, it is interesting to consider a zero-sum game with payoff function
D α ( P Y | X Q | P X )
such that one player tries to maximize it by choosing P X P , where P is a given collection of input probability measures, and the other player tries to minimize it by choosing the probability measure Q Q on the output space. Balancing simplicity and generality and motivated by applications, while we allow P to be a proper subset of the set of all input probability measures, we assume that there are no restrictions in the choice of the output probability measure, and therefore Q stands for the whole collection of probability measures on the output space. This game also arises in the determination of the worst-case redundancy in (37). In Section 3.1 we consider the important special case in which there exists an input distribution that attains the supremum in (55). In the more general scenario in which the supremum may not be achieved, we cannot identify a saddle point but we can indeed swap sup and min as we show in Section 3.2.

3.1. Saddle point

We begin by showing that the maximal α -mutual information input distribution and its α -response form a saddle point.
Theorem 1.
Let P be a convex set of probability distributions on A and Q be the set of all probability distributions on B . Let α ( 0 , 1 ) ( 1 , ) . Suppose that there exists some P X P such that
I α ( P X , P Y | X ) = max P X P I α ( P X , P Y | X ) < ,
and denote the α-response to P X by P Y [ α ] . Then, for any ( P X , Q Y ) P × Q ,
D α ( P Y | X P Y [ α ] | P X ) D α ( P Y | X P Y [ α ] | P X )
D α ( P Y | X Q Y | P X ) .
Conversely, if ( P X , P Y [ α ] ) is a saddle point of D α ( P Y | X · | · ) , namely, (59)–(60) are satisfied, then P X maximizes the α-mutual information.
Remark 1.
Assuming that P includes δ x (unit mass at x A ), (59) implies that for any x A ,
D α ( P Y | X = x P Y [ α ] ) max P X P I α ( P X , P Y | X ) .
We can easily obtain corollaries to Theorem 1 that elucidate useful properties of the saddle point.
Corollary 1.
Let α ( 0 , 1 ) ( 1 , ) . Under the assumptions in Theorem 1, for any P X P , we have
D α ( P Y [ α ] P Y [ α ] ) I α ( P X , P Y | X ) I α ( P X , P Y | X ) < ,
where P Y [ α ] is the α-response to P X . Moreover, P Y [ α ] = P Y [ α ] if, in addition to P X , P X also attains C α ( P ) = max P X P I α ( P X , P Y | X ) .
Proof of Corollary 1.
For any P X P ,
I α ( P X , P Y | X ) = D α ( P Y | X P Y [ α ] | P X )
= D α ( P Y | X P Y [ α ] | P X ) D α ( P Y [ α ] P Y [ α ] )
D α ( P Y | X P Y [ α ] | P X ) D α ( P Y [ α ] P Y [ α ] )
= I α ( P X , P Y | X ) D α ( P Y [ α ] P Y [ α ] ) ,
where (64) and (65) follow from (42) and (59), respectively. Since Rényi divergence is nonnegative, D α ( P Y [ α ] P Y [ α ] ) = 0 if P X also attains C α ( P ) .  □
Therefore, Corollary 1 implies that the α -responses to all the maximal α -mutual information input distributions must be identical. Moreover, if α > 1 , then every α -response to any input distribution satisfies P Y [ α ] P Y [ α ] .
If P is the space of all probability distributions on A , then we can get the following corollary.
Corollary 2.
Unconstrained maximization of α-mutual information. Suppose that α ( 0 , 1 ) ( 1 , ) and P contains all probability mass functions on the discrete alphabet A . Fix P Y | X : A B . For any input distribution P ¯ X , denote its support by A ¯ X A and the corresponding α-response by P ¯ Y [ α ] .
A necessary and sufficient condition for P ¯ X to achieve max X I ( X ; Y ) < is
max a A ¯ X D α ( P Y | X = a P ¯ Y [ α ] ) = min a A ¯ X D α ( P Y | X = a P ¯ Y [ α ] ) max a A ¯ X c D α ( P Y | X = a P ¯ Y [ α ] ) .
Proof of Corollary 2.
  • max X I α ( X ; Y ) = I α ( P ¯ X , P Y | X ) ( 67 ) : Regardless of whether α > 1 or α < 1 , we see from (45) that if there exists some x 0 A ¯ X such that
    D α P Y | X = x 0 P ¯ Y [ α ] < max P X P I α ( P X , P Y | X ) ,
    then I α ( P ¯ X , P Y | X ) < max P X P I α ( P X , P Y | X ) , which contradicts the assumed optimality of P ¯ X . Moreover, if there exists some x 0 A ¯ X c such that (68) holds with the strict inequality reversed, then (59) would be violated, again contradicting the assumed optimality of P ¯ X .
  • ( 67 ) max X I α ( X ; Y ) = I α ( P ¯ X , P Y | X ) : Again, we see from (45) that if (67) is satisfied, then (59) is satisfied. Since P ¯ Y [ α ] is the α -response to P ¯ X , (59) is also satisfied, and the converse part in Theorem 1 results in the optimality of P ¯ X .
 □
Remark 2.
According to Corollary 2, if some input distribution P X achieves C α , we know the α-response output distribution P Y [ α ] is equidistant in D α ( · P Y [ α ] ) to any of the output distributions in the collection
S = { P Y | X = x , P X ( x ) > 0 } Q .
Moreover, we know that the optimal α-response output distribution is actually unique even if there exist several optimal input distributions. So the key is to find the unique centroid of S when the distance is measured by the Rényi divergence. In contrast to the maximization of the mutual information, the optimal α-response output distribution is no longer a mixture of the conditional output distributions.
Remark 3.
Corollary 2 enables us to recover Gallager’s finite alphabet result in Theorem 5.6.5 of [44], which characterizes the maximal α-mutual information input distribution if α ( 0 , 1 ) when both A and B are finite. The optimal input distribution P X must satisfy the following condition:
y B x A P Y | X α ( y | x ) P X ( x ) 1 α α P Y | X α ( y | u ) y B x A P Y | X α ( y | x ) P X ( x ) 1 α ,
with equality for all u such that P X ( u ) > 0 . To verify this condition, note that Corollary 2 requires that
exp ( κ α ) = exp ( α 1 ) C α y B P Y | X = u α ( y ) P Y [ α ] ( y ) 1 α ,
with equality if P X ( u ) > 0 , and where κ α stands for the normalizing constant in (24) with P X P X . Upon substitution of (25) with P X P X , we obtain (70). The assumption of finite output alphabet can be easily dispensed with to obtain the more general optimality condition
E Ψ α 1 α ( Y ) exp α ı X ; Y ( u ; Y ) E Ψ α ( Y )
with equality for all u such that P X ( u ) > 0 . In (72), ı X ; Y stands for the information density corresponding to P X P Y | X P Y and
Ψ α ( y ) = E 1 α exp α ı X ; Y ( X ; y ) , X P X
If α > 1 , condition (72) holds with the inequality reversed.
Remark 4.
When B is finite, it was shown in [2,4,25] that for any α [ 0 , ] ,
C α = R α ,
where R α is defined in (37). This is now established without imposing finiteness conditions, as long as there is an input that achieves the maximal α-mutual information because
R α C α
= max P X P min Q Y Q D α ( P Y | X Q Y | P X )
min Q Y Q max P X P D α ( P Y | X Q Y | P X )
R α ,
where (75) follows from particularizing (59) to deterministic P X , and () follows from (40).

3.2. Minimax identity

In this section we drop the assumption that there exists an input probability measure that attains the maximal α -mutual information and show that the conditional Rényi divergence still satisfies a minimax identity, even if a saddle point does not exist.
Theorem 2.
Let P be a convex set of probability distributions on A and Q be the set of all probability distributions on B . We have the minimax equality:
C α ( P ) = sup P X P min Q Y Q D α ( P Y | X Q Y | P X )
= min Q Y Q sup P X P D α ( P Y | X Q Y | P X ) .
Furthermore, if C α ( P ) < , then there exists a unique element in Q attaining the minimum in (80).
The assumption of convexity in Theorem 2 is not superfluous, as the following example illustrates.
Example 1.
Let A = B = N and Y = X + N , where N is a geometric random variable on the nonnegative integers with positive mean and independent of X. Let P be the non-convex set of all the deterministic probability measures on A . In this case, the left side of (80) is zero, while the right side is infinity. To see this, note that for any Q Y Q and n N , it follows from the data processing inequality applied to the binary deterministic transformation 1 { Y n } that
D α ( P Y | X = n Q Y ) d α 1 Q Y ( { n + 1 , } ) ,
whose right side diverges as n . Therefore, for any Q Y Q , (39) results in
sup P X P D α ( P Y | X Q Y | P X ) = .
Continuing with the theme in Remark 4, Theorem 2 extends the validity of R α = C α without requiring the existence of the maximal α -mutual information input distribution. It was conjectured in [2] (and proved in [26]) that for α ( 0 , ) , if R α < and B is finite or countable, there exists a unique redundancy-achieving distribution
Q Y = arg min Q Y sup x A D α ( P Y | X = x Q Y )
and for all probability measures Q Y on the output space,
sup x A D α ( P Y | X = x Q Y ) C α + D α ( Q Y Q Y ) .
We can prove the conjecture easily with the help of Theorem 2.
Proof. 
Let P be the convex set of all probability measures on A . Since C α = R α < , by Theorem 2, we know there exists a unique P Y [ α ] such that sup P X P D α ( P Y | X P Y [ α ] | P X ) = C α , which implies that P Y [ α ] is precisely the unique redundancy-achieving distribution in (83). Moreover, as shown in the proof of Theorem 2, we can find a sequence { P X n } n 1 in P such that I α ( P X n , P Y | X ) C α as n and such that the corresponding α -responses P Y n [ α ] converge to P Y [ α ] in the total variation metric. Pick an arbitrary Q Y Q . If α > 1 and P Y | X = x Q Y for some x A , then sup x A D α ( P Y | X = x Q Y ) = and (84) holds. Otherwise, by (42) we always have
D α ( P Y | X Q Y | P X ) = D α ( P Y | X P Y [ α ] | P X ) + D α ( P Y [ α ] Q Y ) .
For any n 1 , since P includes all probability measures on A , we have
sup x A D α ( P Y | X = x Q Y ) = sup P X P D α ( P Y | X Q Y | P X )
D α ( P Y | X Q Y | P X n )
= D α ( P Y | X P Y n [ α ] | P X n ) + D α ( P Y n [ α ] Q Y )
= I α ( P X n , P Y | X ) + D α ( P Y n [ α ] Q Y ) ,
where (88) is due to (42). Taking the limit as n , the lower-semicontinuity of the Rényi divergence ensures that
sup x A D α ( P Y | X = x Q Y ) C α + D α ( P Y [ α ] Q Y ) ,
and therefore the sought-after Q Y is none other that P Y [ α ] , the unique maximal α -mutual information output distribution.  □

4. Finding C α

In this section, we present a number of examples to illustrate how the results in Section 3 can be used to maximize the α -mutual information with respect to the input distribution. It is instructive to contrast the present approach with the maximization of α -mutual information invoking the KKT conditions, which is feasible in both the case α > 1 in which the functional is concave with respect to the input distribution, and the case α ( 0 , 1 ) in which a monotonically increasing function of α -mutual information is concave. Simple finite-alphabet examples of such approach can be found in [44] when dealing with the E 0 functional in (54). Thanks to Theorem 1 it is possible to avoid taking derivatives of any functionals.
Example 2 (Binary symmetric channel).
Let the input and output alphabet be A = B = { 0 , 1 } and the random transformation be
P Y | X = 1 δ δ δ 1 δ , δ [ 0 , 1 ] .
Let’s try the input distribution P X ( 0 ) = P X ( 1 ) = 0.5 . Then, according to (25), the α-response output distribution is also equiprobable P Y [ α ] ( 0 ) = P Y [ α ] ( 1 ) = 0.5 . Since
D α ( P Y | X = 0 P Y [ α ] ) = D α ( P Y | X = 1 P Y [ α ] ) = d α δ 1 2
the conditions of Corollary 2 are met, P X attains the maximal α-mutual information and therefore,
C α = d α δ 1 2 = log 2 + 1 α 1 log δ α + ( 1 δ ) α ,
which satisfies, according to (31)
lim α 1 C α = log 2 h ( α ) ,
lim α C α = log 2 max { δ , 1 δ } .
Example 3 (Binary erasure channel.).
Let the input/output alphabets be A = { 0 , 1 } and B = { 0 , e , 1 } , and the random transformation be
P Y | X = 1 δ 0 δ δ 0 1 δ , δ [ 0 , 1 ] .
(Departing from usual practice, columns/rows represent input/output letters respectively, i.e., probability vectors are column vectors, although for typographical convenience we show them as row vectors in the text.)
The α-response output distribution to P X ( 0 ) = P X ( 1 ) = 0 . 5 is
P Y [ α ] ( 0 ) = P Y [ α ] ( 1 ) = 1 δ δ 2 1 α + 2 ( 1 δ ) ,
P Y [ α ] ( e ) = δ 2 1 α δ 2 1 α + 2 ( 1 δ ) .
By symmetry,
C α = D α ( P Y | X = 0 P Y [ α ] ) = D α ( P Y | X = 1 P Y [ α ] )
= 1 α 1 log ( 1 δ ) α ( 1 δ ) δ 2 1 α + 2 ( 1 δ ) 1 α + δ α δ 2 1 α δ 2 1 α + 2 ( 1 δ ) 1 α
= 1 α 1 log 1 δ + 2 1 α α δ δ 2 1 α + 2 ( 1 δ ) 1 α
= 1 α 1 log 2 α 1 α 1 δ + δ δ + 2 α 1 α ( 1 δ ) 1 α
= α α 1 log ( 1 δ ) 2 1 1 α + δ ,
which satisfies (in bits)
lim α 1 C α = 1 δ ,
lim α C α = log 2 ( 2 δ ) .
Example 4 (Binary asymmetric channel).
Let the input and output alphabet be A = B = { 0 , 1 } and the random transformation be
P Y | X = 1 δ 0 δ 1 δ 0 1 δ 1 , ( δ 0 , δ 1 ) [ 0 , 1 ] 2 .
If δ 0 + δ 1 = 1 , then I α ( X ; Y ) = 0 for any input distribution. We will assume δ 0 + δ 1 < 1 . Otherwise, we can just relabel the output alphabet ( 0 , 1 ) ( 1 , 0 ) , or equivalently ( δ 0 , δ 1 ) ( 1 δ 0 , 1 δ 1 ) . The condition
D α ( P Y | X = 0 P Y [ α ] ) = D α ( P Y | X = 1 P Y [ α ] )
is now d α 1 δ 0 P Y [ α ] ( 0 ) = d α δ 1 P Y [ α ] ( 0 ) , which, in view of (32) yields
P Y [ α ] ( 0 ) = 1 + ( 1 δ 0 ) α δ 1 α ( 1 δ 1 ) α δ 0 α 1 1 α 1 .
We can verify from (25) that this corresponds to the α-response to P X ( 1 ) = p = 1 P X ( 0 ) , where p [ 0 , 1 ] is the solution to
δ 0 α ( 1 p ) + ( 1 δ 1 ) α p ( 1 δ 0 ) α ( 1 p ) + δ 1 α p = ( 1 δ 0 ) α δ 1 α ( 1 δ 1 ) α δ 0 α α 1 α .
Then,
C α = D α ( P Y | X = 0 P Y [ α ] )
= 1 α 1 log 1 δ 0 α 1 δ 1 α δ 0 α δ 1 α + log 1 δ 0 α δ 1 α 1 1 α + 1 δ 1 α δ 0 α 1 1 α .
which satisfies
lim α 1 C α = log exp h ( δ 0 ) 1 δ 1 δ 0 + exp h ( δ 1 ) 1 δ 1 δ 0 ( 1 δ 1 ) h ( δ 0 ) + ( 1 δ 0 ) h ( δ 1 ) 1 δ 1 δ 0 ,
and
lim α C α = log ( 2 δ 0 δ 1 ) .
Example 5 (Z channel).
Let the input and output alphabet be A = B = { 0 , 1 } and the random transformation be
P Y | X = 1 δ 0 1 δ , δ [ 0 , 1 ] .
Since this is the special case ( δ 0 , δ 1 ) = ( 0 , δ ) of the binary asymmetric channel we obtain
C α = log 1 + 1 δ α ( 1 δ ) α 1 1 α .
The limit
lim α 1 C α = log 1 δ 1 1 δ + δ δ 1 δ
coincides with the capacity of the Z-channel originally derived in [45].
The next example illustrates a case in which there are multiple optimal input distributions.
Example 6.
Let A = { 0 , 1 , 2 } , B = { 0 , 1 , 2 , 3 } and the random transformation be
P Y | X = 1 2 δ δ 1 2 δ δ 1 2 δ δ δ 1 2 δ δ 1 2 δ δ 1 2 δ , δ 0 , 1 2 .
Let P X 0 = [ 1 2 1 2 0 ] and P X 1 = [ 0 1 2 1 2 ] . Its easy to verify that the corresponding α-responses are the equiprobable distribution on B . To verify that P X 0 and P X 1 attain the maximal α-mutual information, denote P Y [ α ] = 1 4 1 4 1 4 1 4 . For all x A , we have
D α P Y | X = x P Y [ α ] = D α δ 1 2 δ 1 2 δ δ 1 4 1 4 1 4 1 4
= 2 α 1 α 1 log 2 + log δ α + 1 2 δ α
= C α ,
where (120) follows from Corollary 2.
In the next example C α is constant in α .
Example 7 (Additive phase noise).
Let the input and output alphabet be A = B = [ 0 , 2 π ) and the random transformation be Y = X + N mod 2 π , where N is independent of X and is uniform on the interval [ θ 0 , θ 0 ] with θ 0 ( 0 , π ] . Suppose P X is uniform on [ 0 , 2 π ) , it is easy to verify that P Y [ α ] is also uniform on [ 0 , 2 π ) . Invoking (26), we obtain
D α P Y | X = x P Y [ α ] = log π θ 0 , x A ,
which according to Theorem 1 must be equal to C α attained by P X .
Example 8 (Additive Gaussian noise).
Let A = B = R , Y = X + N , where N N ( 0 , σ 2 ) is independent of X. Fix α ( 0 , 1 ) and P > 0 . Suppose that the set, P , of allowable input probability measures on A consists of those that satisfy
E exp e α ( 1 α ) X 2 2 α 2 P + σ 2 α 2 P + σ 2 α P + σ 2 .
By completing the square, it is easy to verify that P X N ( 0 , P ) satisfies (122) with equality. Furthermore, its α-response is P Y [ α ] N ( 0 , α P + σ 2 ) . To show that P X does indeed attain C α ( P ) , first we compute
D α P Y | X = x P Y [ α ] = 1 2 log 1 + α P σ 2 1 2 1 α log α P + σ 2 α 2 P + σ 2 + 1 2 α x 2 α 2 P + σ 2 log e .
With (122) and (123), it is straightforward to see that if P X P then
D α P Y | X P Y [ α ] | P X D α P Y | X P Y [ α ] | P X .
Consequently, Theorem 1 establishes that P X achieves the maximal α-mutual information, which using (39) is given by
C α ( P ) = 1 2 log 1 + α P σ 2 .

5. Proofs

5.1. Proof of Theorem 1

We deal with the converse statement first. If (59)–(60) are satisfied then ( P X , P Y [ α ] ) is a saddle point, and therefore,
I α ( P X , P Y | X ) = max P X P min Q Y Q D α ( P Y | X Q Y | P X )
= min Q Y Q max P X P D α ( P Y | X Q Y | P X ) .
which, by definition of α -mutual information, implies that P X attains max P X P I α ( P X , P Y | X ) . To show that the optimal input and its α -response must form a saddle point, first we deal with the case α > 1 , in which we can use the concavity of the conditional Rényi divergence. Choose arbitrary ν ( 0 , 1 ) and P X P . Let P ν = ν P X + ( 1 ν ) P X and denote its α -response by
Q Y [ α ] ( ν ) = arg min Q Y Q D α ( P Y | X Q Y | P ν ) .
Therefore, Q Y [ α ] ( 0 ) = P Y [ α ] . We have
I α ( P X , P Y | X ) I α ( P ν , P Y | X )
= D α ( P Y | X Q Y [ α ] ( ν ) | P ν )
ν D α ( P Y | X Q Y [ α ] ( ν ) | P X ) + ( 1 ν ) D α ( P Y | X Q Y [ α ] ( ν ) | P X )
ν D α ( P Y | X Q Y [ α ] ( ν ) | P X ) + ( 1 ν ) min Q Y Q D α ( P Y | X Q Y | P X )
= ν D α ( P Y | X Q Y [ α ] ( ν ) | P X ) + ( 1 ν ) I α ( P X , P Y | X ) ,
where (129) is due to the assumption of the optimality of P X , (131) holds because D α ( P Y | X Q Y | P X ) is concave in P X for α > 1 and (130) and (133) are due to the definition of α -mutual information.
It follows that
D α ( P Y | X Q Y [ α ] ( ν ) | P X ) I α ( P X , P Y | X ) = D α ( P Y | X P Y [ α ] | P X ) .
Since | Q Y [ α ] ( ν ) Q Y [ α ] ( 0 ) | 0 as ν 0 , the lower semicontinuity of Rényi divergence and (134) imply (59).
We now show the desired result for α ( 0 , 1 ) . In this case, the method of proof is easy to adapt to the α > 1 case but it is more cumbersome than the foregoing proof, which is able to capitalize on the concavity of the conditional Rényi divergence. The starting point is the expression in (52) which we write as
I α ( P X , P Y | X ) = α α 1 log f ( p X ) ,
f ( r ) = B A p Y | X α ( y | x ) r ( x ) d μ X ( x ) 1 α d μ Y ( y ) ,
where we have defined the functional f on the convex cone P ¯ of nonnegative functions r on the input space such that r ( x ) = β d P d μ X ( x ) for some β 0 and P P . Recall from (48) that when the argument is a density, then we have
f ( p X ) = E [ E 1 α [ exp ( α ı X ; Y ¯ ( X ; Y ¯ ) ) | Y ¯ ] ] , ( X , Y ¯ ) P X × P Y ¯ .
By virtue of the convexity of ( · ) 1 α , f is a convex functional. Its directional (Gateaux) derivative is given by (note that the assumed finiteness of I α ( X ; Y ) allows swapping of differentiation and integration by means of the dominated convergence theorem)
f ( r ; q ) = d d δ f ( r + δ q ) | δ = 0
= 1 α B A p Y | X α ( y | x ) r ( x ) d μ X ( x ) 1 α α A p Y | X α ( y | x ) q ( x ) d μ X ( x ) d μ Y ( y ) .
Define the Lagrangian
L ( r , λ ) = f ( r ) λ A r ( x ) d μ X ( x ) 1 .
Since f is convex and (135) is maximized by P X among all probability measures on the convex set P , there exists some λ 0 0 such that
max λ 0 min r L ( r , λ ) = L ( p X , λ 0 ) ,
where the minimization is over the convex cone P ¯ . It follows from standard convex optimization (e.g., see p. 227 in Reference [46]) that the Gateaux derivative of L ( · , λ 0 ) at p X in the direction of any q P ¯ satisfies
L ( p X ; q , λ 0 ) 0 ,
with equality if q = p X . Invoking (139), we obtain
L ( p X ; q , λ 0 ) = 1 α B A p Y | X α ( y | x ) p X ( x ) d μ X ( x ) 1 α α A p Y | X α ( y | x ) q ( x ) d μ X ( x ) d μ Y ( y ) λ 0 A p X ( x ) d μ X ( x ) 1 .
Specializing (142) and its condition for equality to q p X we obtain
L ( p X ; p X , λ 0 ) L ( p X ; p X , λ 0 )
which, upon substitution of (143), becomes
f ( p X ) A B p Y | X α ( y | x ) A p Y | X α ( y | x ) p X ( x ) d μ X ( x ) 1 α α p X ( x ) d μ Y ( y ) d μ X ( x ) .
Taking 1 α 1 log ( · ) of both sides of (145), invoking (25) and (47), the inequality is reversed and we obtain
D α ( P Y | X P Y [ α ] | P X ) + 1 α α I α ( X ; Y ) 1 α I α ( X ; Y ) ,
which upon rearranging is the sought-after inequality (59).

5.2. Proof of Theorem 2

In order to show
inf Q Y Q sup P X P D α ( P Y | X Q Y | P X ) sup P X P min Q Y Q D α ( P Y | X Q Y | P X ) = C α ( P ) ,
we construct Q Y Q such that
D α ( P Y | X Q Y | P X ) C α ( P ) , P X P .
Moreover, Q Y is indeed the minimizer in the leftmost side of (147) and we may replace the inf with min therein.
The construction of Q Y follows a Cauchy-sequence approach in the proof of Kemperman’s result in [47]. Let { P X n } n 1 be a sequence of probability distributions in P such that
lim n I α ( P X n , P Y | X ) = C α ( P ) .
Fix an arbitrary P X P and let P n P denote the convex hull of { P X , P X 1 , , P X n } , which is a compact set. Although I α ( · , P Y | X ) may not be concave for α ( 0 , 1 ) , recall from (57) that the monotonically increasing function Γ α is such that Γ α ( I α ( · , P Y | X ) ) is concave. So there exists some P X n P n that attains C α ( P n ) . Thus for any n 1 ,
I α ( P X n , P Y | X ) I α ( P X n , P Y | X ) C α ( P )
by the definition of C α ( P ) . The asymptotic optimality of the sequence X n implies that I α ( P X n , P Y | X ) = C α ( P n ) also converges to C α ( P ) .
Denote by P Y n [ α ] the α -response to P X n . Then for any m n 1 , we have
I α ( P X n , P Y | X ) = D α ( P Y | X P Y n [ α ] | P X n )
= D α ( P Y | X P Y m [ α ] | P X n ) D α ( P Y n [ α ] P Y m [ α ] )
D α ( P Y | X P Y m [ α ] | P X m ) D α ( P Y n [ α ] P Y m [ α ] )
= I α ( P X m , P Y | X ) D α ( P Y n [ α ] P Y m [ α ] ) ,
where (151) and (154) are due to the definition of α -mutual information, (152) follows from (42), and (153) holds because of Theorem 1 applied to P m since P X n P m as P n P m for m n . Rearranging the end-to-end inequality in (151)–(154) results in
D α ( P Y n [ α ] P Y m [ α ] ) I α ( P X m , P Y | X ) I α ( P X n , P Y | X ) .
But since I α ( P X n , P Y | X ) converges to C α ( P ) , it is a Cauchy sequence, i.e.
I α ( P X n , P Y | X ) I α ( P X m , P Y | X ) 0 , n , m .
Hence, (155) ensures that P Y n [ α ] is also a Cauchy sequence in the sense that D α ( P Y n [ α ] P Y m [ α ] ) 0 as n , m . By the generalized Pinkser’s inequality (35), P Y n [ α ] is also a Cauchy sequence in total variation distance, i.e., | P Y n [ α ] P Y m [ α ] | 0 as n , m . Since the space of probability measures is complete in the total variation distance, { P Y n [ α ] } n must possess a limit point, which we denote by P Y [ α ] .
Now, by Theorem 1 applied to P n , we have
D α ( P Y | X P Y n [ α ] | P X ) D α ( P Y | X P Y n [ α ] | P X n ) C α ( P ) ,
and since D α ( P Q ) is lower-semicontinuous in ( P , Q ) for α > 0 , taking limits as n of (157), we obtain
D α ( P Y | X P Y [ α ] | P X ) C α ( P ) .
Next we show that (158) holds for all P X P , in other words, the limit point P Y [ α ] does not depend on the initial choice of P X P . Choose an arbitrary distribution Q X P , Q X P X , and introduce the following notation: P n is the convex hull of { Q X , P X , P X 1 , , P X n } ; Q X n is a maximizer of I α ( P X , P Y | X ) in P n ; its α -response is Q Y n [ α ] ; and Q Y [ α ] is the limit of the sequence { Q Y n [ α ] } n . Then we have
D α ( P Y n [ α ] Q Y n [ α ] ) = D α ( P Y | X Q Y n [ α ] | P X n ) D α ( P Y | X P Y n [ α ] | P X n )
D α ( P Y | X Q Y n [ α ] | Q X n ) D α ( P Y | X P Y n [ α ] | P X n )
= I α ( Q X n , P Y | X ) I α ( P X n , P Y | X ) ,
where (159) holds because of (42); (160) is due to Theorem 1 applied to P n and the fact that P X n P n P n for any n 1 ; and (161) is because of the definition of the α -mutual information.
The same argument that led to the conclusion that I α ( P X n , P Y | X ) C α ( P ) establishes that I α ( Q X n , P Y | X ) C α ( P ) . Therefore, taking limits as n in (159)–(161) and applying the lower-semicontinuity of D α ( P Q ) again, we obtain
D α ( P Y [ α ] Q Y [ α ] ) = 0 ,
and therefore P Y [ α ] = Q Y [ α ] . Since the limiting output distribution is the same whether we use P n or P n and according to the latter the roles of P X and Q X are identical, we conclude that had we defined P n with Q X instead of P X , we would have reached the same limiting output distribution and (158) holds for all P X P . So we have constructed Q Y satisfying (148).
Finally, we show that P Y [ α ] is the only element that achieves
inf Q Y Q sup P X P D α ( P Y | X Q Y | P X ) .
Arguing by contradiction, suppose that there exists another P Y ^ such that
sup P X P D α ( P Y | X P Y ^ | P X ) = C α ( P ) .
As earlier in the proof, let { P X n P } n be a sequence satisfying (149), and denote the corresponding α -responses by P Y n [ α ] . Then, invoking (42) again we have
D α ( P Y | X P Y n [ α ] | P X n ) + D α ( P Y n [ α ] P Y ^ ) = D α ( P Y | X P Y ^ | P X n )
C α ( P ) < ,
where the inequality follows from (163). Using (149) we obtain
D α ( P Y n [ α ] P Y ^ ) C α ( P ) D α ( P Y | X P Y n [ α ] | P X n )
0 ,
and by (35), it follows that
| P Y n [ α ] P Y ^ | 0 .
Furthermore, we established above that
| P Y n [ α ] P Y [ α ] | 0 .
So, by the triangle inequality, we conclude that P Y ^ = P Y [ α ] .

6. Conclusions

The supremization of α -mutual information with respect to the input distribution plays an important role in various information theoretic settings, most notably in the error exponent of optimal codes operating below capacity. We show that the optimal (if it exists) input distribution, together with its α -response, form a saddle-point of the conditional Rényi divergence, and vice versa, the existence of the saddle point ensure the existence of a maximal α -mutual information input distribution. The application of this result to various discrete and non-discrete settings illustrates the power and generality of this tool, which mirrors a similar result enjoyed by conditional relative entropy; However, the proof of the latter result is much easier due to the more convenient structure of the objective function. Regardless of whether there exists an input distribution maximizing α -mutual information, there always exists a unique optimal output distribution, which is the limit of the α -responses of any asymptotically optimal sequence of input distributions. Furthermore, a saddle-value exists and
sup P X min Q D α ( P Y | X Q | P X ) = min Q sup P X D α ( P Y | X Q | P X )
even if we restrict the feasible set of input distributions to be an arbitrary convex subset. These results lend further evidence to the notion that, out of all the available Rényi-generalizations of mutual information, the α -mutual information defined as in (43) is the most convenient and insightful, although I α c ( X ; Y ) is also of considerable interest particularly in the error exponent analysis of channels with cost constraints.

Author Contributions

Both authors contributed to the conceptualization, investigation, results, original draft preparation, editing, revision and response to reviewers.

Funding

This work was partially supported by the US National Science Foundation under Grant CCF-1016625, and in part by the Center for Science of Information, an NSF Science and Technology Center under Grant CCF-0939370.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Rényi, A. On measures of information and entropy. In Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability; Neyman, J., Ed.; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
  2. Van Erven, T.; Harremoës, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
  3. Blahut, R.E. Hypothesis testing and information theory. IEEE Trans. Inf. Theory 1974, 20, 405–417. [Google Scholar] [CrossRef] [Green Version]
  4. Csiszár, I. Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory 1995, 41, 26–34. [Google Scholar] [CrossRef]
  5. Shayevitz, O. On Rényi measures and hypothesis testing. In Proceedings of the 2011 IEEE International Symposium on Information Theory, Saint Petersburg, Russia, 31 July–5 August 2011; pp. 894–898. [Google Scholar]
  6. Harremoës, P. Interpretations of Rényi entropies and divergences. Phys. A Stat. Mech. Its Appl. 2006, 365, 57–62. [Google Scholar] [CrossRef]
  7. Sason, I.; Verdú, S. Improved bounds on lossless source coding and guessing moments via Rényi measures. IEEE Trans. Inf. Theory 2018, 64, 4323–4346. [Google Scholar] [CrossRef]
  8. Haroutunian, E.A. Estimates of the exponent of the error probability for a semicontinuous memoryless channel. Probl. Inf. Transm. 1968, 4, 29–39. [Google Scholar]
  9. Haroutunian, E.A.; Haroutunian, M.E.; Harutyunyan, A.N. Reliability criteria in information theory and in statistical hypothesis testing. Found. Trends Commun. Inf. Theory 2007, 4, 97–263. [Google Scholar] [CrossRef]
  10. Polyanskiy, Y.; Verdú, S. Arimoto channel coding converse and Rényi divergence. In Proceedings of the 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton, IL, USA, 29 September–1 October 2010; pp. 1327–1333. [Google Scholar]
  11. Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
  12. Liese, F.; Vajda, I. Convex Statistical Distances; Number 95 in Teubner Texte zur Mathematik; Teubner: Leipzig, Germany, 1987. [Google Scholar]
  13. Tridenski, S.; Zamir, R.; Ingber, A. The Ziv-Zakai-Renyi bound for joint source-channel coding. IEEE Trans. Inf. Theory 2015, 61, 429–4315. [Google Scholar] [CrossRef]
  14. Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inf. Theory 1972, 18, 14–20. [Google Scholar] [CrossRef] [Green Version]
  15. Blahut, R.E. Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf. Theory 1972, 18, 460–473. [Google Scholar] [CrossRef] [Green Version]
  16. Blachman, N.M. Communication as a game. Proc. IRE Wescon 1957, 2, 61–66. [Google Scholar]
  17. Borden, J.M.; Mason, D.M.; McEliece, R.J. Some information theoretic saddlepoints. SIAM J. Control Optim. 1985, 23, 129–143. [Google Scholar] [CrossRef]
  18. Lapidoth, A.; Narayan, P. Reliable communication under channel uncertainty. IEEE Trans. Inf. Theory 1998, 44, 2148–2177. [Google Scholar] [CrossRef] [Green Version]
  19. Kemperman, J. On the Shannon capacity of an arbitrary channel. In Koninklijke Nederlandse Akademie van Wetenschappen, Indagationes Mathematicae; Elsevier: Amsterdam, The Netherlands, 1974; Volume 77, pp. 101–115. [Google Scholar]
  20. Gallager, R.G. Source Coding With Side Information and Universal Coding; Technical Report LIDS-P-937; Lab. Information Decision Systems, Massachusetts Institute of Technology: Cambridge, MA, USA, 1979. [Google Scholar]
  21. Ryabko, B. Encoding a source with unknown but ordered probabilities. Probl. Inf. Transm. 1979, 15, 134–138. [Google Scholar]
  22. Davisson, L.; Leon-Garcia, A. A source matching approach to finding minimax codes. IEEE Trans. Inf. Theory 1980, 26, 166–174. [Google Scholar] [CrossRef]
  23. Ryabko, B. Comments on “A Source Matching Approach to Finding Minimax Codes”. IEEE Trans. Inf. Theory 1981, 27, 780–781. [Google Scholar] [CrossRef]
  24. Haussler, D. A general minimax result for relative entropy. IEEE Trans. Inf. Theory 1997, 43, 1276–1280. [Google Scholar] [CrossRef]
  25. Yagli, S.; Altuğ, Y.; Verdú, S. Minimax Rényi redundancy. IEEE Trans. Inf. Theory 2018, 64, 3715–3733. [Google Scholar] [CrossRef]
  26. Nakiboğlu, B. The Rényi capacity and center. IEEE Trans. Inf. Theory 2019, 65, 841–860. [Google Scholar] [CrossRef]
  27. Verdú, S. α-mutual information. In Proceedings of the 2015 Information Theory and Applications Workshop, San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar]
  28. Arimoto, S. Information Measures and Capacity of Order α for Discrete Memoryless Channels. In Topics in Information Theory, Proceedings of the Coll. Math. Soc. János Bolyai; Bolyai: Keszthely, Hungary, 1975; pp. 41–52. [Google Scholar]
  29. Augustin, U. Noisy Channels. Ph.D. Thesis, Universität Erlangen-Nürnberg, Erlangen, Germany, 1978. [Google Scholar]
  30. Nakiboglu, B. The Augustin capacity and center. arXiv 2018, arXiv:1606.00304. [Google Scholar]
  31. Sibson, R. Information radius. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 1969, 14, 149–160. [Google Scholar] [CrossRef]
  32. Lapidoth, A.; Pfister, C. Two measures of dependence. Entropy 2019, 21, 778. [Google Scholar] [CrossRef]
  33. Tomamichel, M.; Hayashi, M. Operational interpretation of Rényi information measures via composite hypothesis testing against product and Markov distributions. IEEE Trans. Inf. Theory 2018, 64, 1064–1082. [Google Scholar] [CrossRef]
  34. Aishwarya, G.; Madiman, M. Remarks on Rényi versions of conditional entropy and mutual information. In Proceedings of the 2019 IEEE International Symposium on Information Theory, Paris, France, 7–12 July 2019; pp. 1117–1121. [Google Scholar]
  35. Shannon, C.E.; Gallager, R.G.; Berlekamp, E. Lower bounds to error probability for coding on discrete memoryless channels, I. Inf. Control 1967, 10, 65–103. [Google Scholar] [CrossRef]
  36. Gallager, R.G. A simple derivation of the coding theorem and some applications. IEEE Trans. Inf. Theory 1965, 11, 3–18. [Google Scholar] [CrossRef] [Green Version]
  37. Yagli, S.; Cuff, P. Exact soft-covering exponent. IEEE Trans. Inf. Theory 2019, 65, 6234–6262. [Google Scholar] [CrossRef]
  38. Gilardoni, G.L. On Pinsker’s and Vajda’s type inequalities for Csiszár’s f-divergences. IEEE Trans. Inf. Theory 2010, 56, 5377–5386. [Google Scholar] [CrossRef]
  39. Massey, J.L. Coding and modulation in digital communications. In Proceedings of the 1974 International Zurich Seminar on Digital Communications, Zurich, Switzerland, 12–15 March 1974; pp. E2(1)–E2(4). [Google Scholar]
  40. Shannon, C.E. The zero error capacity of a noisy channel. IRE Trans. Inf. Theory 1956, 2, 8–19. [Google Scholar] [CrossRef] [Green Version]
  41. Sundaresan, R. Guessing under source uncertainty. IEEE Trans. Inf. Theory 2007, 53, 269–287. [Google Scholar] [CrossRef]
  42. Bunte, C.; Lapidoth, A. Encoding tasks and Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 5065–5076. [Google Scholar] [CrossRef]
  43. Ho, S.W.; Verdú, S. Convexity/concavity of Rényi entropy and α-mutual information. In Proceedings of the 2015 IEEE International Symposium on Information Theory, Hong Kong, China, 14–19 June 2015; pp. 745–749. [Google Scholar]
  44. Gallager, R.G. Information Theory and Reliable Communication; John Wiley and Sons: New York, USA, 1968; Volume 2. [Google Scholar]
  45. Verdú, S. Channel Capacity. In The Electrical Engineering Handbook, 2nd ed.; IEEE Press: Piscataway, NJ, USA; CRC Press: Boca Raton, FL, USA, 1997; Chapter 73.5; pp. 1671–1678. [Google Scholar]
  46. Luenberger, D. Optimization by vector space methods; John Wiley and Sons: Hoboken, NJ, USA, 1997. [Google Scholar]
  47. Polyanskiy, Y.; Wu, Y. Lecture Notes on Information Theory. 2017. Available online: http://people.lids.mit.edu/yp/homepage/data/itlectures_v5.pdf (accessed on 30 April 2019).

Share and Cite

MDPI and ACS Style

Cai, C.; Verdú, S. Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information. Entropy 2019, 21, 969. https://doi.org/10.3390/e21100969

AMA Style

Cai C, Verdú S. Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information. Entropy. 2019; 21(10):969. https://doi.org/10.3390/e21100969

Chicago/Turabian Style

Cai, Changxiao, and Sergio Verdú. 2019. "Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information" Entropy 21, no. 10: 969. https://doi.org/10.3390/e21100969

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop