Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information

Rényi-type generalizations of entropy, relative entropy and mutual information have found numerous applications throughout information theory and beyond. While there is consensus that the ways A. Rényi generalized entropy and relative entropy in 1961 are the “right” ones, several candidates have been put forth as possible mutual informations of order α. In this paper we lend further evidence to the notion that a Bayesian measure of statistical distinctness introduced by R. Sibson in 1969 (closely related to Gallager’s E0 function) is the most natural generalization, lending itself to explicit computation and maximization, as well as closed-form formulas. This paper considers general (not necessarily discrete) alphabets and extends the major analytical results on the saddle-point and saddle-level of the conditional relative entropy to the conditional Rényi divergence. Several examples illustrate the main application of these results, namely, the maximization of α-mutual information with and without constraints.


Introduction
The Rényi divergence of order α between two probability measures defined on the same measurable space, is a useful generalization of the relative entropy D(P Q) introduced by Rényi [1] in the discrete case (lim α↑1 D α (P Q) = D(P Q)). Many of the properties satisfied by relative entropy hold for Rényi divergence, such as nonnegativity, convexity, lower semicontinuity, data processing inequality, and additivity for product measures. D α (P Q) can be defined in more generality without requiring P Q. A comprehensive survey of the properties satisfied by Rényi divergence can be found in [2]. Just as D(P Q), D α (P Q) provides a useful gauge of the distinctness of P and Q, which has found applications in large deviations problems (such as the asymptotic analysis of hypothesis testing [3][4][5]), lossless data compression [4,6,7], data transmission through noisy channels [8][9][10], and statistical physics [11]. If P 1 P 0 , then Rényi divergence of order α ∈ (0, 1) ∪ (1, ∞) can be expressed in terms of relative entropy through [5] (1 − α)D α (P 1 P 0 ) = min P P 1 {α D(P P 1 ) + (1 − α) D(P P 0 )} . (2) Although not an f -divergence, there is a one-to-one correspondence between Rényi divergence and Hellinger divergence H α (P Q) (e.g., [12]) One of the major applications of relative entropy is to quantify statistical dependence in a joint probability measure by means of the mutual information I(X; Y) = D(P XY P X × P Y ).
The corresponding straight generalization replacing relative entropy by Rényi divergence is also a measure of dependence but has found scant utility so far (see [6,13]). To explore the generalization that we study in this paper, namely α-mutual information, we need to consider the conditional versions of relative entropy and Rényi divergence. These are defined in general for two random transformations P Y|X and Q Y|X and an unconditional probability measure P X simply as D(P Y|X Q Y|X |P X ) = D(P Y|X P X Q Y|X P X ), D α (P Y|X Q Y|X |P X ) = D α (P Y|X P X Q Y|X P X ). (6) A major difference between those conditional measures is that while D(P Y|X Q Y|X |P X ) is plainly the expectation D(P Y|X=x Q Y|X=x ) dP X (x), the conditional Rényi divergence depends on the function D α (P Y|X=x Q Y|X=x ) in a more involved way. In this paper, the use of the conditional information measures will be circumscribed to the special case in which Q Y|X is actually an unconditional measure.
In fact, a more productive way to express mutual information than (4) is the asymmetric expression I(X; Y) = D(P Y|X P Y |P X ) (7) = min Q D(P Y|X Q | P X ). (8) Equation (8) follows from the key additive decomposition formula where Q Y is an arbitrary measure dominating P Y . We see that (8) is a Bayesian measure of the distinctness of the constellation of probability measures {P Y|X=x , x ∈ A}, sometimes referred to as information radius, where the center of gravity of the constellation is none other than P Y . Equation (8) has proven to be very fertile, particularly when it comes to supremize I(X; Y) with respect to P X since the ensuing sup min optimization has a saddle-point if and only if there is an input distribution that attains the maximal mutual information. The convexity of D(P Y|X Q | P X ) in Q and concavity (linearity) in P X , along with the minimax theorem ensures the existence of the saddle-point whenever the set of allowed input distributions is compact. The Arimoto-Blahut algorithm [14,15] for finding max I(X; Y) in finite alphabet settings is also inspired by (8).

1.
As aforementioned, the straight generalization D α (P XY P X × P Y ) has not yet found wide applicability.

2.
In the discrete case and α ∈ (0, 1) ∪ (1, ∞), Arimoto [28] proposed the definition of the nonnegative quantity where the Rényi entropy [1] and Arimoto-Rényi conditional entropy [28] are with the α-norm of a probability mass function denoted as P α = (∑ x∈A P α (x)) 1 α . Arimoto extended his algorithm in [14] to compute what he called the capacity of order α, for finite-alphabet random transformations and showed that there exist codes of rate R and blocklength n whose error probability is upper bounded by

3.
Augustin [29] and, later, Csiszár [4] defined C c α = max X I c α (X; Y) is dubbed the Augustin capacity of order α in [30]. Csiszár [4] showed that for α ∈ ( 1 2 , 1), I c α (X; Y) is the intercept on the R-axis of a supporting line of slope 1 − 1 α of the error exponent function for codes of rate R with constant-composition P X . Unfortunately, the minimization in (14) is not amenable to explicit solution.

4.
For the purpose of coming up with a measure of the similarity among a finite collection of probability measures {P Y|X=x , x ∈ A}, weighted by P X on A, Sibson [31] proposed the information radius of order α as As we will see, the minimization in (15) can be solved explicitly. This is the generalization of mutual information we adopt in this paper and which, as in [27], we refer to as α-mutual information. A word of caution is that in [4], the symbols I α (X; Y) and K α (X; Y) are used in lieu of what we denote I c α (X; Y) and I α (X; Y), respectively. C α = max X I α (X; Y) is dubbed the Rényi capacity of order α in [26].

5.
Independently, Lapidoth-Pfister [32] and Tomamichel-Hayashi [33] proposed and showed that it determines the performance of composite hypothesis tests for independence where the hypothesized joint distribution is known but under the independence hypothesis the marginals are unknown. It was shown in [34] that Despite the difference in the definitions of the various versions, it was shown in the discrete setting that [4,28].
Therefore, solving for max X I α (X; Y) carries added significance, whenever one of the other definitions is adopted. Note that (17) and (18) imply that C α = max P X I l α (P X , P Y|X ). A major application for the maximization of I α (X; Y) is in the large deviation analysis of optimal data transmission codes since the sphere-packing error exponent function and the random-coding error exponent function popularized in [35] and [36], respectively, are upper and lower bounds to the channel reliability function, respectively. A function similar to (20) has recently been shown [37] to yield the large deviations behavior of random coding in the setting of channel resolvability. The organization of the paper is as follows. Section 2 states the definitions and properties of the various information measures that are used throughout the paper. In particular, we introduce the key notion of α-response to an input probability measure through a given random transformation. In Section 3 we present the main results (with proofs relegated to Section 5) related to the saddle-point and saddle-value of the conditional Rényi divergence, allowing the optimization to be circumscribed to any convex set of input probability measures. The equivalence of the existence of a probability measure that maximizes α-mutual information and the existence of a saddle point is shown and several illustrative examples of the use of this result in the computation of C α are also given. The fact that a saddle-level exists (i.e., sup min commute) even if there is no input probability measure that achieves the supremum α-mutual information is established, thereby generalizing Kemperman's [19] saddle-level result to Rényi divergence through a different route than that followed in [26].

Notation, Definitions and Properties
2. Let (A, F ) and (B, G ) be measurable spaces, which we refer to as the input and output spaces, respectively, with A and B referred to as the input and output alphabets respectively. P Y|X : A → B denotes a random transformation from A to B, i.e. for any x ∈ A, P Y|X=x (·) is a probability measure on (B, G ), and for any B ∈ G , P Y|X=· (B) is an F -measurable function. For brevity, we will usually drop mention of the underlying σ-fields. If P is a probability measure on A and P Y|X : A → B is a random transformation, the corresponding joint probability measure on A × B is denoted by P P Y|X (or, interchangeably, P Y|X P). The notation P → P Y|X → Q indicates that the output marginal of the joint probability measure P P Y|X is denoted by Q.
3. The relative information ı P Q (x) between two probability measures P and Q on the same measurable space such that P Q is defined as where dP dQ is the Radon-Nikodym derivative of P with respect to Q. The relative entropy is 4. Given P X → P Y|X → P Y , the information density is defined as 5. Fix α > 0, P Y|X : A → B, and a probability measure P X on A. Then, the output probability measure where P X → P Y|X → P Y , and κ α is a scalar that guarantees that P Y [α] is a probability measure. For notational convenience, we omit the dependence of κ α on P X and P Y|X . Equivalently, if p Y [α] and p Y|X denote the densities with respect to some dominating measure, then (24) becomes In particular, the 1-response to P X is P Y . In [26], the α-response to P X is dubbed the order α Rényi mean for prior P X .
6. Given two probability measures P and Q on the same measurable space and a scalar α ∈ (0, 1) ∪ (1, ∞), the Rényi divergence of order α between P and Q is defined as [1] D α (P Q) = 1 where p and q are the Radon-Nikodym derivatives of P and Q, respectively, with respect to a common dominating σ-finite measure µ. We define D 1 (P Q) = D(P Q) as this coincides with the limit from the left at α = 1. It is also the limit from the right whenever D α (P Q) < ∞ for some α > 1. The cases α = 0 and α = ∞ can be defined by taking the corresponding limits. In this work, we only focus on the simple orders of α, i.e. α ∈ (0, 1) ∪ (1, ∞). As we saw in (1), if P Q, then (26) becomes 7. If α ∈ (0, 1) ∪ (1, ∞), then the binary Rényi divergence of order α is given by Note that where the usual binary entropy function is denoted by 8. D α (P Q) ≥ 0, with equality only if P = Q.
11. The Rényi divergence satisfies the data-processing inequality. If P X → P Y|X → P Y and Q X → P Y|X → Q Y , then 12. Gilardoni [38] gave a strengthened Pinsker's inequality upper bounding the square of the total variation distance by where we have used the monotonicity in α of the Rényi divergence.
13. The Rényi divergence is lower semicontinuous in the topology of setwise convergence, i.e., if for In particular, note that (36) holds if |P n − Q n | → 0.
14. In the theory of robust lossless source coding [22,25] the following scalar, called the α-minimax redundancy of P Y|X , is an important measure of the worst-case redundancy penalty that ensues when the encoder only knows that the data is generated according to one of the probability measures in the collection {P Y|X=x , x ∈ A}: where the infimum is over all the probability measures on B.
15. Given input distribution P X and random transformations P Y|X , Q Y|X , the conditional Rényi divergence Although (38) also holds for the familiar α = 1 case, in general the conditional Rényi divergence is not the arithmetic average of D(P Y|X=x Q Y ) with respect to P X if α = 1. Instead it's a generalized mean, or a scaled cumulant generating function evaluated at α − 1. Specifically, if X ∼ P X , then Regardless of whether α ∈ (0, 1) or α ∈ (1, ∞), (39) implies that with the supremum in (41) over all input probability measures. 16. The key additive decomposition formula for the mutual information (9) has a nice counterpart for the α-mutual information [27]. Let P X → P Y|X → P Y and Q Y be an arbitrary probability measure on B such that P Y Q Y . Then, it is easy to verify that a relationship noted by Sibson [31] in the discrete case.
17. Given α > 0, P X and P Y|X , the α-mutual information is [27,31] where P X → P Y|X → P Y . It can be checked that the constant in (24) is equal to Note that I 1 (X; Y) = I(X; Y) but, in general, I α (X; Y) = I α (Y; X).
18. An alternative expression for α-mutual information, which will come in handy in our analysis and which does not involve either P Y or P Y [α] is obtained by introducing an auxiliary probability measure PȲ dominating the collection {P Y|X=u , u ∈ A} [27]: where As usual, sometimes it is convenient to fix σ-finite measures µ X and µ Y on the input and output spaces which dominate P X and {P Y|X=x : x ∈ A}, respectively, and denote their densities with respect to the reference measures by Then, we can write α-mutual information as 19. In the special case of discrete alphabets, where the left side is the familiar Gallager function defined in [36] for ρ ∈ (0, 1) as (54) 20. Fix α > 0, P Y|X : A → B, and a collection P of probability measures on the input space. Then, we denote When P is the set of all input measures we write simply C α , dubbed the Rényi capacity in [26]. C α is a measure of the similarity of the family {P Y|X=x , x ∈ A}, which plays an important role in the analysis of the fundamental limits of information transmission through noisy channels, particularly in the regime of exponentially small error probability. For a long time (e.g., [39]) the cutoff rate C 1 2 was conjectured to be the maximal rate for which reliable codes with manageable decoding complexity can be found. The zero-error capacity of the discrete memoryless channel with feedback is equal to either zero or [40] depending on whether there is (a 1 , a 2 ) ∈ A 2 such that P Y|X (·|a 1 ) ⊥ P Y|X (·|a 2 ). 21. The related quantity max P X I 1 α (P X , P α Y|X ) arises in the study of the fundamental limits of guessing and task completion under mismatch [41,42]. 22. While D(P Q) is convex in the pair (P, Q), the picture for Rényi divergence is somewhat more nuanced: is convex in Q for all P, (see [4]). [43]). 24. The α-mutual information I α (P X , P Y|X ) is concave in P X for any fixed P Y|X and α > 1 (see [43]). If α ∈ (0, 1) ∪ (1, ∞), then the following monotonically increasing function of I α (P X , P Y|X ) is concave in P X

Conditional Rényi Divergence Game
As can be expected from (43), when maximizing α-mutual information, for fixed P Y|X , with respect to the input probability measure, it is interesting to consider a zero-sum game with payoff function such that one player tries to maximize it by choosing P X ∈ P, where P is a given collection of input probability measures, and the other player tries to minimize it by choosing the probability measure Q ∈ Q on the output space. Balancing simplicity and generality and motivated by applications, while we allow P to be a proper subset of the set of all input probability measures, we assume that there are no restrictions in the choice of the output probability measure, and therefore Q stands for the whole collection of probability measures on the output space. This game also arises in the determination of the worst-case redundancy in (37). In Section 3.1 we consider the important special case in which there exists an input distribution that attains the supremum in (55). In the more general scenario in which the supremum may not be achieved, we cannot identify a saddle point but we can indeed swap sup and min as we show in Section 3.2.

Saddle point
We begin by showing that the maximal α-mutual information input distribution and its α-response form a saddle point. Theorem 1. Let P be a convex set of probability distributions on A and Q be the set of all probability distributions on B. Let α ∈ (0, 1) ∪ (1, ∞). Suppose that there exists some P * X ∈ P such that and denote the α-response to P * X by P * . Then, for any (P X , Q Y ) ∈ P × Q, Conversely, if (P * X , P * ) is a saddle point of D α (P Y|X · |·), namely, (59)-(60) are satisfied, then P * X maximizes the α-mutual information.

Remark 1.
Assuming that P includes δ x (unit mass at x ∈ A), (59) implies that for any x ∈ A, We can easily obtain corollaries to Theorem 1 that elucidate useful properties of the saddle point. ∞). Under the assumptions in Theorem 1, for any P X ∈ P, we have if, in addition to P * X , P X also attains C α (P ) = max P X ∈P I α (P X , P Y|X ).
Therefore, Corollary 1 implies that the α-responses to all the maximal α-mutual information input distributions must be identical. Moreover, if α > 1, then every α-response to any input distribution satisfies .
If P is the space of all probability distributions on A, then we can get the following corollary.

Corollary 2.
Unconstrained maximization of α-mutual information. Suppose that α ∈ (0, 1) ∪ (1, ∞) and P contains all probability mass functions on the discrete alphabet A. Fix P Y|X : A → B. For any input distribution P X , denote its support byĀ X ⊂ A and the corresponding α-response byP Y [α] . A necessary and sufficient condition forP X to achieve max X I(X; Y) < ∞ is Proof of Corollary 2.
• max X I α (X; Y) = I α (P X , P Y|X ) ⇒ (67): Regardless of whether α > 1 or α < 1, we see from (45) that if there exists some x 0 ∈Ā X such that then I α (P X , P Y|X ) < max P X ∈P I α (P X , P Y|X ), which contradicts the assumed optimality ofP X . Moreover, if there exists some x 0 ∈Ā c X such that (68) holds with the strict inequality reversed, then (59) would be violated, again contradicting the assumed optimality ofP X .

•
(67) ⇒ max X I α (X; Y) = I α (P X , P Y|X ): Again, we see from (45) that if (67) is satisfied, then (59) is satisfied. SinceP Y [α] is the α-response toP X , (59) is also satisfied, and the converse part in Theorem 1 results in the optimality ofP X .

Remark 2.
According to Corollary 2, if some input distribution P * X achieves C α , we know the α-response output distribution P * ) to any of the output distributions in the collection Moreover, we know that the optimal α-response output distribution is actually unique even if there exist several optimal input distributions. So the key is to find the unique centroid of S when the distance is measured by the Rényi divergence. In contrast to the maximization of the mutual information, the optimal α-response output distribution is no longer a mixture of the conditional output distributions.

Remark 3.
Corollary 2 enables us to recover Gallager's finite alphabet result in Theorem 5.6.5 of [44], which characterizes the maximal α-mutual information input distribution if α ∈ (0, 1) when both A and B are finite. The optimal input distribution P * X must satisfy the following condition: with equality for all u such that P * X (u) > 0. To verify this condition, note that Corollary 2 requires that with equality if P * X (u) > 0, and where κ * α stands for the normalizing constant in (24) with P X ← P * X . Upon substitution of (25) with P X ← P * X , we obtain (70). The assumption of finite output alphabet can be easily dispensed with to obtain the more general optimality condition with equality for all u such that P * X (u) > 0. In (72), ı * X;Y stands for the information density corresponding to P * X → P Y|X → P * Y and If α > 1, condition (72) holds with the inequality reversed.

Remark 4.
When B is finite, it was shown in [2,4,25] that for any α ∈ [0, ∞], where R α is defined in (37). This is now established without imposing finiteness conditions, as long as there is an input that achieves the maximal α-mutual information because where (75) follows from particularizing (59) to deterministic P X , and (78) follows from (40).

Minimax identity
In this section we drop the assumption that there exists an input probability measure that attains the maximal α-mutual information and show that the conditional Rényi divergence still satisfies a minimax identity, even if a saddle point does not exist.

Theorem 2.
Let P be a convex set of probability distributions on A and Q be the set of all probability distributions on B. We have the minimax equality: Furthermore, if C α (P ) < ∞, then there exists a unique element in Q attaining the minimum in (80).
The assumption of convexity in Theorem 2 is not superfluous, as the following example illustrates.
Example 1. Let A = B = N and Y = X + N, where N is a geometric random variable on the nonnegative integers with positive mean and independent of X. Let P be the non-convex set of all the deterministic probability measures on A. In this case, the left side of (80) is zero, while the right side is infinity. To see this, note that for any Q Y ∈ Q and n ∈ N, it follows from the data processing inequality applied to the binary deterministic transformation 1{Y ≥ n} that whose right side diverges as n → ∞. Therefore, for any Q Y ∈ Q, (39) results in Continuing with the theme in Remark 4, Theorem 2 extends the validity of R α = C α without requiring the existence of the maximal α-mutual information input distribution. It was conjectured in [2] (and proved in [26]) that for α ∈ (0, ∞), if R α < ∞ and B is finite or countable, there exists a unique redundancy-achieving distribution and for all probability measures Q Y on the output space, We can prove the conjecture easily with the help of Theorem 2.
Proof. Let P be the convex set of all probability measures on A. Since C α = R α < ∞, by Theorem 2, we know there exists a unique P * such that sup P X ∈P D α (P Y|X P * is precisely the unique redundancy-achieving distribution in (83). Moreover, as shown in the proof of Theorem 2, we can find a sequence {P X n } n≥1 in P such that I α (P X n , P Y|X ) → C α as n → ∞ and such that the corresponding α-responses P Y n[α] converge to P * in the total variation metric. Pick an arbitrary Q Y ∈ Q. If α > 1 and P Y|X=x Q Y for some x ∈ A, then sup x∈A D α (P Y|X=x Q Y ) = ∞ and (84) holds. Otherwise, by (42) we always have For any n ≥ 1, since P includes all probability measures on A, we have where (88) is due to (42). Taking the limit as n → ∞, the lower-semicontinuity of the Rényi divergence ensures that and therefore the sought-after Q * Y is none other that P * , the unique maximal α-mutual information output distribution.

Finding C α
In this section, we present a number of examples to illustrate how the results in Section 3 can be used to maximize the α-mutual information with respect to the input distribution. It is instructive to contrast the present approach with the maximization of α-mutual information invoking the KKT conditions, which is feasible in both the case α > 1 in which the functional is concave with respect to the input distribution, and the case α ∈ (0, 1) in which a monotonically increasing function of α-mutual information is concave. Simple finite-alphabet examples of such approach can be found in [44] when dealing with the E 0 functional in (54). Thanks to Theorem 1 it is possible to avoid taking derivatives of any functionals.
Example 2 (Binary symmetric channel). Let the input and output alphabet be A = B = {0, 1} and the random transformation be Let's try the input distribution P * X (0) = P * X (1) = 0.5. Then, according to (25), the α-response output distribution is also equiprobable P * the conditions of Corollary 2 are met, P * X attains the maximal α-mutual information and therefore, which satisfies, according to (31) Example 3 (Binary erasure channel.). Let the input/output alphabets be A = {0, 1} and B = {0, e, 1}, and the random transformation be (Departing from usual practice, columns/rows represent input/output letters respectively, i.e. probability vectors are column vectors, although for typographical convenience we show them as row vectors in the text.) The α-response output distribution to P * By symmetry, which satisfies (in bits) Example 4 (Binary asymmetric channel). Let the input and output alphabet be A = B = {0, 1} and the random transformation be If δ 0 + δ 1 = 1, then I α (X; Y) = 0 for any input distribution. We will assume δ 0 + δ 1 < 1. Otherwise, we can just relabel the output alphabet (0, 1) ← (1, 0), or equivalently (δ 0 , (0) , which, in view of (32) yields (108) We can verify from (25) that this corresponds to the α-response to P * (109) Then, and Example 5 (Z channel). Let the input and output alphabet be A = B = {0, 1} and the random transformation be Since this is the special case (δ 0 , δ 1 ) = (0, δ) of the binary asymmetric channel we obtain (115) The limit coincides with the capacity of the Z-channel originally derived in [45].
The next example illustrates a case in which there are multiple optimal input distributions.
. Its easy to verify that the corresponding α-responses are the equiprobable distribution on B. To verify that P 0 X and P 1 X attain the maximal α-mutual information, denote P * where (120) follows from Corollary 2.
In the next example C α is constant in α.
Example 7 (Additive phase noise). Let the input and output alphabet be A = B = [0, 2π) and the random transformation be Y = (X + N) mod 2π, where N is independent of X and is uniform on the interval [−θ 0 , θ 0 ] with θ 0 ∈ (0, π]. Suppose P * X is uniform on [0, 2π), it is easy to verify that P * is also uniform on [0, 2π). Invoking (26), we obtain which according to Theorem 1 must be equal to C α attained by P * X .

Example 8 (Additive Gaussian noise). Let
Fix α ∈ (0, 1) and P > 0. Suppose that the set, P, of allowable input probability measures on A consists of those that satisfy By completing the square, it is easy to verify that P * X ∼ N (0, P) satisfies (122) with equality. Furthermore, its α-response is P * ∼ N (0, αP + σ 2 ). To show that P * X does indeed attain C α (P ), first we compute With (122) and (123), it is straightforward to see that if P X ∈ P then Consequently, Theorem 1 establishes that P * X achieves the maximal α-mutual information, which using (39) is given by

Proof of Theorem 1
We deal with the converse statement first. If (59)-(60) are satisfied then (P * X , P * ) is a saddle point, and therefore, which, by definition of α-mutual information, implies that P * X attains max P X ∈P I α (P X , P Y|X ). To show that the optimal input and its α-response must form a saddle point, first we deal with the case α > 1, in which we can use the concavity of the conditional Rényi divergence. Choose arbitrary ν ∈ (0, 1) and P X ∈ P. Let P ν = νP X + (1 − ν)P * X and denote its α-response by . We have where (129) is due to the assumption of the optimality of P * X , (131) holds because D α (P Y|X Q Y |P X ) is concave in P X for α > 1 and (130) and (133) are due to the definition of α-mutual information.
We now show the desired result for α ∈ (0, 1). In this case, the method of proof is easy to adapt to the α > 1 case but it is more cumbersome than the foregoing proof, which is able to capitalize on the concavity of the conditional Rényi divergence. The starting point is the expression in (52) which we write as where we have defined the functional f on the convex coneP of nonnegative functions r on the input space such that r(x) = β dP dµ X (x) for some β ≥ 0 and P ∈ P. Recall from (48) that when the argument is a density, then we have By virtue of the convexity of (·) 1 α , f is a convex functional. Its directional (Gateaux) derivative is given by (note that the assumed finiteness of I α (X; Y) allows swapping of differentiation and integration by means of the dominated convergence theorem) Define the Lagrangian Since f is convex and (135) is maximized by P * X among all probability measures on the convex set P, there exists some λ 0 ≥ 0 such that max λ≥0 min r L(r, λ) = L(p * X , λ 0 ), where the minimization is over the convex coneP. It follows from standard convex optimization (e.g., see p. 227 in Reference [46]) that the Gateaux derivative of L(·, λ 0 ) at p * X in the direction of any q ∈P satisfies with equality if q = p * X . Invoking (139), we obtain Specializing (142) and its condition for equality to q ← p X we obtain which, upon substitution of (143), becomes Taking 1 α−1 log(·) of both sides of (145), invoking (25) and (47), the inequality is reversed and we obtain which upon rearranging is the sought-after inequality (59).

Proof of Theorem 2
In order to show Moreover, Q * Y is indeed the minimizer in the leftmost side of (147) and we may replace the inf with min therein.
The construction of Q * Y follows a Cauchy-sequence approach in the proof of Kemperman's result in [47]. Let {P X n } n≥1 be a sequence of probability distributions in P such that lim n→∞ I α (P X n , P Y|X ) = C α (P ).
notation: P n is the convex hull of {Q X , P X , P X 1 , . . . , P X n }; Q * X n is a maximizer of I α (P X , P Y|X ) in P n ; its α-response is Q *

Y n[α]
; and Q * Y [α] is the limit of the sequence {Q *

Y n[α]
} n . Then we have = I α (Q * X n , P Y|X ) − I α (P * X n , P Y|X ), where (159) holds because of (42); (160) is due to Theorem 1 applied to P n and the fact that P * X n ∈ P n ⊂ P n for any n ≥ 1; and (161) is because of the definition of the α-mutual information.
The same argument that led to the conclusion that I α (P * X n , P Y|X ) → C α (P ) establishes that I α (Q * X n , P Y|X ) → C α (P ). Therefore, taking limits as n → ∞ in (159)-(161) and applying the lower-semicontinuity of D α (P Q) again, we obtain and therefore P * . Since the limiting output distribution is the same whether we use P n or P n and according to the latter the roles of P X and Q X are identical, we conclude that had we defined P n with Q X instead of P X , we would have reached the same limiting output distribution and (158) holds for all P X ∈ P. So we have constructed Q * Y satisfying (148). Finally, we show that P * is the only element that achieves inf Q Y ∈Q sup P X ∈P D α (P Y|X Q Y |P X ).
Arguing by contradiction, suppose that there exists another P Y such that sup P X ∈P D α (P Y|X P Y |P X ) = C α (P ).
As earlier in the proof, let {P X n ∈ P } n be a sequence satisfying (149), and denote the corresponding α-responses by P Y n [α] . Then, invoking (42) again we have D α (P Y|X P Y n[α] |P X n ) + D α (P Y n[α] P Y ) = D α (P Y|X P Y |P X n ) (164) where the inequality follows from (163). Using (149) we obtain and by (35), it follows that Furthermore, we established above that So, by the triangle inequality, we conclude that P Y = P * .

Conclusions
The supremization of α-mutual information with respect to the input distribution plays an important role in various information theoretic settings, most notably in the error exponent of optimal codes operating below capacity. We show that the optimal (if it exists) input distribution, together with its α-response, form a saddle-point of the conditional Rényi divergence, and vice versa, the existence of the saddle point ensure the existence of a maximal α-mutual information input distribution. The application of this result to various discrete and non-discrete settings illustrates the power and generality of this tool, which mirrors a similar result enjoyed by conditional relative entropy; However, the proof of the latter result is much easier due to the more convenient structure of the objective function. Regardless of whether there exists an input distribution maximizing α-mutual information, there always exists a unique optimal output distribution, which is the limit of the α-responses of any asymptotically optimal sequence of input distributions. Furthermore, a saddle-value exists and sup P X min Q D α (P Y|X Q|P X ) = min Q sup P X D α (P Y|X Q|P X ) even if we restrict the feasible set of input distributions to be an arbitrary convex subset. These results lend further evidence to the notion that, out of all the available Rényi-generalizations of mutual information, the α-mutual information defined as in (43) is the most convenient and insightful, although I c α (X; Y) is also of considerable interest particularly in the error exponent analysis of channels with cost constraints.