Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information

Cai, Changxiao; Verdú, Sergio

doi:10.3390/e21100969

Open AccessArticle

Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information

by

Changxiao Cai

¹ and

Sergio Verdú

^2,*

¹

Department of Electrical Engineering, Princeton University, C307 Engineering Quadrangle, NJ 08540, USA

²

Independent Researcher, Princeton, NJ 08540, USA

^*

Author to whom correspondence should be addressed.

Entropy 2019, 21(10), 969; https://doi.org/10.3390/e21100969

Submission received: 8 August 2019 / Revised: 20 September 2019 / Accepted: 25 September 2019 / Published: 4 October 2019

(This article belongs to the Special Issue Probabilistic Methods in Information Theory, Hypothesis Testing, and Coding)

Download Versions Notes

Abstract

Rényi-type generalizations of entropy, relative entropy and mutual information have found numerous applications throughout information theory and beyond. While there is consensus that the ways A. Rényi generalized entropy and relative entropy in 1961 are the “right” ones, several candidates have been put forth as possible mutual informations of order

α

. In this paper we lend further evidence to the notion that a Bayesian measure of statistical distinctness introduced by R. Sibson in 1969 (closely related to Gallager’s

E_{0}

function) is the most natural generalization, lending itself to explicit computation and maximization, as well as closed-form formulas. This paper considers general (not necessarily discrete) alphabets and extends the major analytical results on the saddle-point and saddle-level of the conditional relative entropy to the conditional Rényi divergence. Several examples illustrate the main application of these results, namely, the maximization of

α

-mutual information with and without constraints.

Keywords:

information measures; relative entropy; conditional relative entropy; mutual information; Rényi divergence; α-mutual information; channel capacity; minimax redundancy

1. Introduction

The Rényi divergence of order

α

between two probability measures defined on the same measurable space,

\begin{matrix} D_{α} (P ∥ Q) = \frac{1}{α - 1} log \int {(\frac{d P}{d Q} (x))}^{α} d Q (x), \end{matrix}

(1)

is a useful generalization of the relative entropy

D (P ∥ Q)

introduced by Rényi [1] in the discrete case

({lim}_{α ↑ 1} D_{α} (P ∥ Q) = D (P ∥ Q)

). Many of the properties satisfied by relative entropy hold for Rényi divergence, such as nonnegativity, convexity, lower semicontinuity, data processing inequality, and additivity for product measures.

D_{α} (P ∥ Q)

can be defined in more generality without requiring

P ≪ Q

. A comprehensive survey of the properties satisfied by Rényi divergence can be found in [2]. Just as

D (P ∥ Q)

,

D_{α} (P ∥ Q)

provides a useful gauge of the distinctness of P and Q, which has found applications in large deviations problems (such as the asymptotic analysis of hypothesis testing [3,4,5]), lossless data compression [4,6,7], data transmission through noisy channels [8,9,10], and statistical physics [11]. If

P_{1} ≪ P_{0}

, then Rényi divergence of order

α \in (0, 1) \cup (1, \infty)

can be expressed in terms of relative entropy through [5]

\begin{matrix} (1 - α) D_{α} (P_{1} ∥ P_{0}) = min_{P ≪ P_{1}} \{α D (P ∥ P_{1}) + (1 - α) D (P ∥ P_{0})\} . \end{matrix}

(2)

Although not an f-divergence, there is a one-to-one correspondence between Rényi divergence and Hellinger divergence

H_{α} (P ∥ Q)

(e.g., [12])

\begin{matrix} D_{α} (P ∥ Q) = \frac{1}{α - 1} log (1 + (α - 1) H_{α} (P ∥ Q)) . \end{matrix}

(3)

One of the major applications of relative entropy is to quantify statistical dependence in a joint probability measure by means of the mutual information

\begin{matrix} I (X; Y) = D (P_{X Y} ∥ P_{X} \times P_{Y}) . \end{matrix}

(4)

The corresponding straight generalization replacing relative entropy by Rényi divergence is also a measure of dependence but has found scant utility so far (see [6,13]). To explore the generalization that we study in this paper, namely

α

-mutual information, we need to consider the conditional versions of relative entropy and Rényi divergence. These are defined in general for two random transformations

P_{Y | X}

and

Q_{Y | X}

and an unconditional probability measure

P_{X}

simply as

\begin{matrix} D (P_{Y | X} ∥ Q_{Y | X} | P_{X}) & = D (P_{Y | X} P_{X} ∥ Q_{Y | X} P_{X}), \end{matrix}

(5)

\begin{matrix} D_{α} (P_{Y | X} ∥ Q_{Y | X} | P_{X}) & = D_{α} (P_{Y | X} P_{X} ∥ Q_{Y | X} P_{X}) . \end{matrix}

(6)

A major difference between those conditional measures is that while

D (P_{Y | X} ∥ Q_{Y | X} | P_{X})

is plainly the expectation

\int D (P_{Y | X = x} ∥ Q_{Y | X = x}) d P_{X} (x)

, the conditional Rényi divergence depends on the function

D_{α} (P_{Y | X = x} ∥ Q_{Y | X = x})

in a more involved way. In this paper, the use of the conditional information measures will be circumscribed to the special case in which

Q_{Y | X}

is actually an unconditional measure. In fact, a more productive way to express mutual information than (4) is the asymmetric expression

\begin{matrix} I (X; Y) & = D (P_{Y | X} ∥ P_{Y} | P_{X}) \end{matrix}

(7)

\begin{matrix} = min_{Q} D (P_{Y | X} ∥ Q | P_{X}) . \end{matrix}

(8)

Equation (8) follows from the key additive decomposition formula

\begin{matrix} D (P_{Y | X} ∥ P_{Y} | P_{X}) = D (P_{Y | X} ∥ Q_{Y} | P_{X}) - D (P_{Y} ∥ Q_{Y}), \end{matrix}

(9)

where

Q_{Y}

is an arbitrary measure dominating

P_{Y}

. We see that (8) is a Bayesian measure of the distinctness of the constellation of probability measures

{P_{Y | X = x}, x \in A}

, sometimes referred to as information radius, where the center of gravity of the constellation is none other than

P_{Y}

. Equation (8) has proven to be very fertile, particularly when it comes to supremize

I (X; Y)

with respect to

P_{X}

since the ensuing

sup min

optimization has a saddle-point if and only if there is an input distribution that attains the maximal mutual information. The convexity of

D (P_{Y | X} ∥ Q | P_{X})

in Q and concavity (linearity) in

P_{X}

, along with the minimax theorem ensures the existence of the saddle-point whenever the set of allowed input distributions is compact. The Arimoto-Blahut algorithm [14,15] for finding

max I (X; Y)

in finite alphabet settings is also inspired by (8).

Mutual information

I (X; Y) = I (P_{X}, P_{Y | X})

also possesses a saddle point (assuming convexity and compactness of the corresponding feasible sets) since it is concave in

P_{X}

(to see that, nothing better than (9)) and is convex in

P_{Y | X}

. This property has found rich applications in information theory (e.g., [16,17,18]) but neither it nor its generalization to

α

-mutual information will not concern us in this paper.

Even if a saddle-point for the conditional relative entropy does not exist, Kemperman [19] showed that

sup min

can be swapped, thereby establishing the existence of a saddle value.

Another well-known application of (8) and the conditional relative entropy saddle-point is the so-called channel capacity-minimax redundancy theorem due to Gallager [20] and Ryabko [21] (see also [22,23]), which shows that the maximal mutual information obtained with a finite constellation

{P_{Y | X = x}, x \in A}

is equal to the minimax redundancy in universal lossless data compression of an unknown source selected from

{P_{Y | X = x}, x \in A}

. Notable generalizations of this result to infinite alphabets without requiring that a distribution maximizing mutual information exists are due to Kemperman [19] and Haussler [24]. Recently, the Rényi counterpart of the channel capacity-minimax redundancy result has also been considered, under various restrictions, in [2,25,26].

The main purpose of this paper is to generalize the saddle-point property of conditional relative entropy and its applications to the maximization of mutual information when relative entropy is replaced by Rényi divergence. Towards that end, we recall the various directions in which mutual information has been generalized using Rényi divergence (see also [27]):

As aforementioned, the straight generalization $D_{α} (P_{X Y} ∥ P_{X} \times P_{Y})$ has not yet found wide applicability.
In the discrete case and $α \in (0, 1) \cup (1, \infty)$ , Arimoto [28] proposed the definition of the nonnegative quantity

$\begin{matrix} I^{a} (X; Y) = H_{α} (X) - H_{α}^{a} (X | Y), \end{matrix}$

(10)

where the Rényi entropy [1] and Arimoto-Rényi conditional entropy [28] are

$\begin{matrix} H_{α} (X) & = \frac{α}{1 - α} log {∥ P_{X} ∥}_{α}, \end{matrix}$

(11)

$\begin{matrix} H_{α}^{a} (X | Y) & = \frac{α}{1 - α} log E [∥ P_{X | Y} (\cdot | Y) ∥_{α}] \end{matrix}$

(12)

with the $α$ -norm of a probability mass function denoted as ${∥ P ∥}_{α} = {(\sum_{x \in A} P^{α} (x))}^{\frac{1}{α}}$ . Arimoto extended his algorithm in [14] to compute what he called the capacity of order $α$ ,

$\begin{matrix} C_{α}^{a} = max_{X} I^{a} (X; Y), \end{matrix}$

(13)

for finite-alphabet random transformations and showed that there exist codes of rate R and blocklength n whose error probability is upper bounded by

$inf_{α \in (\frac{1}{2}, 1)} exp (\frac{α - 1}{α} (C_{α}^{a} - R)) .$
Augustin [29] and, later, Csiszár [4] defined

$\begin{matrix} I_{α}^{c} (X; Y) = min_{Q_{Y}} E [D_{α} (P_{Y | X} (\cdot | X) ∥ Q_{Y})] . \end{matrix}$

(14)

$C_{α}^{c} = {max}_{X} I_{α}^{c} (X; Y)$ is dubbed the Augustin capacity of order $α$ in [30]. Csiszár [4] showed that for $α \in (\frac{1}{2}, 1)$ , $I_{α}^{c} (X; Y)$ is the intercept on the R-axis of a supporting line of slope $1 - \frac{1}{α}$ of the error exponent function for codes of rate R with constant-composition $P_{X}$ . Unfortunately, the minimization in (14) is not amenable to explicit solution.
For the purpose of coming up with a measure of the similarity among a finite collection of probability measures ${P_{Y | X = x}, x \in A}$ , weighted by $P_{X}$ on $A$ , Sibson [31] proposed the information radius of order $α$ as

$\begin{matrix} I_{α} (X; Y) = min_{Q} D_{α} (P_{Y | X} ∥ Q | P_{X}) . \end{matrix}$

(15)

As we will see, the minimization in (15) can be solved explicitly. This is the generalization of mutual information we adopt in this paper and which, as in [27], we refer to as $α$ -mutual information. A word of caution is that in [4], the symbols $I_{α} (X; Y)$ and $K_{α} (X; Y)$ are used in lieu of what we denote $I_{α}^{c} (X; Y)$ and $I_{α} (X; Y)$ , respectively. $C_{α} = {max}_{X} I_{α} (X; Y)$ is dubbed the Rényi capacity of order $α$ in [26].
Independently, Lapidoth-Pfister [32] and Tomamichel-Hayashi [33] proposed

$\begin{matrix} I_{α}^{l} (X; Y) = min_{Q_{X}} min_{Q_{Y}} D_{α} (P_{X Y} ∥ Q_{X} \times Q_{Y}) . \end{matrix}$

(16)

and showed that it determines the performance of composite hypothesis tests for independence where the hypothesized joint distribution is known but under the independence hypothesis the marginals are unknown. It was shown in [34] that

$\begin{matrix} I_{α}^{c} (X; Y) \leq I_{α}^{l} (X; Y) \leq I_{α} (X; Y) . \end{matrix}$

(17)

Despite the difference in the definitions of the various versions, it was shown in the discrete setting that [4,28].

\begin{matrix} C_{α}^{a} = C_{α}^{c} = C_{α} \end{matrix}

(18)

Therefore, solving for

{max}_{X} I_{α} (X; Y)

carries added significance, whenever one of the other definitions is adopted. Note that (17) and (18) imply that

C_{α} = {max}_{P_{X}} I_{α}^{l} (P_{X}, P_{Y | X})

. A major application for the maximization of

I_{α} (X; Y)

is in the large deviation analysis of optimal data transmission codes since the sphere-packing error exponent function and the random-coding error exponent function

\begin{matrix} E_{sp} (R) & = sup_{ρ \geq 0} \{ρ C_{\frac{1}{1 + ρ}} - ρ R\}, \end{matrix}

(19)

\begin{matrix} E_{r} (R) & = sup_{ρ \in [0, 1]} \{ρ C_{\frac{1}{1 + ρ}} - ρ R\}, \end{matrix}

(20)

popularized in [35] and [36], respectively, are upper and lower bounds to the channel reliability function, respectively. A function similar to (20) has recently been shown [37] to yield the large deviations behavior of random coding in the setting of channel resolvability.

The organization of the paper is as follows. Section 2 states the definitions and properties of the various information measures that are used throughout the paper. In particular, we introduce the key notion of

α

-response to an input probability measure through a given random transformation. In Section 3 we present the main results (with proofs relegated to Section 5) related to the saddle-point and saddle-value of the conditional Rényi divergence, allowing the optimization to be circumscribed to any convex set of input probability measures. The equivalence of the existence of a probability measure that maximizes

α

-mutual information and the existence of a saddle point is shown and several illustrative examples of the use of this result in the computation of

C_{α}

are also given. The fact that a saddle-level exists (i.e.,

sup min

commute) even if there is no input probability measure that achieves the supremum

α

-mutual information is established, thereby generalizing Kemperman’s [19] saddle-level result to Rényi divergence through a different route than that followed in [26].

2. Notation, Definitions and Properties

If $(A, F, P)$ is a probability space, $X \sim P$ indicates $P [X \in F] = P (F)$ for all $F \in F$ .
Let $(A, F)$ and $(B, G)$ be measurable spaces, which we refer to as the input and output spaces, respectively, with $A$ and $B$ referred to as the input and output alphabets respectively. $P_{Y | X} : A \to B$ denotes a random transformation from $A$ to $B$ , i.e., for any $x \in A$ , $P_{Y | X = x} (\cdot)$ is a probability measure on $(B, G)$ , and for any $B \in G$ , $P_{Y | X = \cdot} (B)$ is an $F$ -measurable function. For brevity, we will usually drop mention of the underlying $σ$ -fields. If P is a probability measure on $A$ and $P_{Y | X} : A \to B$ is a random transformation, the corresponding joint probability measure on $A \times B$ is denoted by $P P_{Y | X}$ (or, interchangeably, $P_{Y | X} P$ ). The notation $P \to P_{Y | X} \to Q$ indicates that the output marginal of the joint probability measure $P P_{Y | X}$ is denoted by Q.
The relative information $ı_{P ∥ Q} (x)$ between two probability measures P and Q on the same measurable space such that $P ≪ Q$ is defined as

$\begin{matrix} ı_{P ∥ Q} (x) = log \frac{d P}{d Q} (x), \end{matrix}$

(21)

where $\frac{d P}{d Q}$ is the Radon-Nikodym derivative of P with respect to Q. The relative entropy is

$\begin{matrix} D (P ∥ Q) = E [ı_{P_{X} ∥ Q_{X}} (X)], X \sim P . \end{matrix}$

(22)
Given $P_{X} \to P_{Y | X} \to P_{Y}$ , the information density is defined as

$\begin{matrix} ı_{X; Y} (a; b) = ı_{P_{Y | X = a} ∥ P_{Y}} (b), (a, b) \in A \times B . \end{matrix}$

(23)
Fix $α > 0$ , $P_{Y | X} : A \to B$ , and a probability measure $P_{X}$ on $A$ . Then, the output probability measure $P_{Y_{[α]}}$ is called the $α$ -response to $P_{X}$ if

$\begin{matrix} ı_{Y_{[α]} ∥ Y} (y) = \frac{1}{α} log E [exp (α ı_{X; Y} (X; y) - κ_{α})], X \sim P_{X}, \end{matrix}$

(24)

where $P_{X} \to P_{Y | X} \to P_{Y}$ , and $κ_{α}$ is a scalar that guarantees that $P_{Y_{[α]}}$ is a probability measure. For notational convenience, we omit the dependence of $κ_{α}$ on $P_{X}$ and $P_{Y | X}$ . Equivalently, if $p_{Y_{[α]}}$ and $p_{Y | X}$ denote the densities with respect to some dominating measure, then (24) becomes

$\begin{matrix} p_{Y_{[α]}} (y) = exp (- \frac{κ_{α}}{α}) E^{\frac{1}{α}} [p_{Y | X}^{α} (y | X)], X \sim P_{X} . \end{matrix}$

(25)

In particular, the 1-response to $P_{X}$ is $P_{Y}$ . In [26], the $α$ -response to $P_{X}$ is dubbed the order $α$ Rényi mean for prior $P_{X}$ .
Given two probability measures P and Q on the same measurable space and a scalar $α \in (0, 1) \cup (1, \infty)$ , the Rényi divergence of order $α$ between P and Q is defined as [1]

$\begin{matrix} D_{α} (P ∥ Q) = \frac{1}{α - 1} log \int_{A} p^{α} q^{1 - α} d μ, \end{matrix}$

(26)

where p and q are the Radon-Nikodym derivatives of P and Q, respectively, with respect to a common dominating $σ$ -finite measure $μ$ . We define $D_{1} (P ∥ Q) = D (P ∥ Q)$ as this coincides with the limit from the left at $α = 1$ . It is also the limit from the right whenever $D_{α} (P ∥ Q) < \infty$ for some $α > 1$ . The cases $α = 0$ and $α = \infty$ can be defined by taking the corresponding limits. In this work, we only focus on the simple orders of $α$ , i.e., $α \in (0, 1) \cup (1, \infty)$ . As we saw in (1), if $P ≪ Q$ , then (26) becomes

$\begin{matrix} D_{α} (P ∥ Q) & = \frac{1}{α - 1} log (E [exp (α ı_{P ∥ Q} (W))]), W \sim Q \end{matrix}$

(27)

$\begin{matrix} = \frac{1}{α - 1} log (E [exp ((α - 1) ı_{P ∥ Q} (V))]), V \sim P \end{matrix}$

(28)
If $α \in (0, 1) \cup (1, \infty)$ , then the binary Rényi divergence of order $α$ is given by

$\begin{matrix} d_{α} (p ∥ q) & = D_{α} ([p 1 - p] ∥ [q 1 - q]) \end{matrix}$

(29)

$\begin{matrix} = \frac{1}{α - 1} log (p^{α} q^{1 - α} + {(1 - p)}^{α} {(1 - q)}^{1 - α}) . \end{matrix}$

(30)

Note that

$\begin{matrix} lim_{α \to 1} d_{α} (p ∥ \frac{1}{2}) = log 2 - h (p), \end{matrix}$

(31)

where the usual binary entropy function is denoted by $h (x) = x log \frac{1}{x} + (1 - x) log \frac{1}{1 - x}$ . Given $(p_{0}, p_{1}) \in {(0, 1)}^{2}$ , $p_{0} \neq p_{1}$ , the solution to $d_{α} (p_{0} ∥ q) = d_{α} (p_{1} ∥ q)$ is

$\begin{matrix} q = {(1 + {(\frac{p_{0}^{α} - p_{1}^{α}}{{(1 - p_{1})}^{α} - {(1 - p_{0})}^{α}})}^{\frac{1}{1 - α}})}^{- 1} . \end{matrix}$

(32)
$D_{α} (P ∥ Q) \geq 0$ , with equality only if $P = Q$ .
$D_{α} (P ∥ Q)$ is monotonically increasing with $α$ .
While we may have $D (P ∥ Q) = \infty$ and $P ≪ Q$ simultaneously, $D_{α} (P ∥ Q) = \infty$ for any $α \in (0, 1)$ is equivalent to P and Q being orthogonal. Conversely, if for some $α > 1$ , $D_{α} (P ∥ Q) < \infty$ , then $P ≪ Q$ .
The Rényi divergence satisfies the data-processing inequality. If $P_{X} \to P_{Y | X} \to P_{Y}$ and $Q_{X} \to P_{Y | X} \to Q_{Y}$ , then

$\begin{matrix} D_{α} (P_{X} ∥ Q_{X}) \geq D_{α} (P_{Y} ∥ Q_{Y}) . \end{matrix}$

(33)
Gilardoni [38] gave a strengthened Pinsker’s inequality upper bounding the square of the total variation distance by

$\begin{matrix} {| P - Q |}^{2} & \leq inf_{α \in (0, 1]} \frac{1}{2 α} D_{α} (P ∥ Q) \end{matrix}$

(34)

$\begin{matrix} \leq inf_{α > 0} \frac{1}{2 min {α, 1}} D_{α} (P ∥ Q), \end{matrix}$

(35)

where we have used the monotonicity in $α$ of the Rényi divergence.
The Rényi divergence is lower semicontinuous in the topology of setwise convergence, i.e., if for every event $A \in F$ , $P_{n} (A) \to P (A)$ , and $Q_{n} (A) \to Q (A)$ , then

$\begin{matrix} \underset{n \to \infty}{lim inf} D_{α} (P_{n} ∥ Q_{n}) \geq D_{α} (P ∥ Q), α \in (0, \infty] . \end{matrix}$

(36)

In particular, note that (36) holds if $| P_{n} - Q_{n} | \to 0$ .
In the theory of robust lossless source coding [22,25] the following scalar, called the $α$ -minimax redundancy of $P_{Y | X}$ , is an important measure of the worst-case redundancy penalty that ensues when the encoder only knows that the data is generated according to one of the probability measures in the collection ${P_{Y | X = x}, x \in A}$ :

$\begin{matrix} R_{α} = inf_{Q_{Y}} sup_{x \in A} D_{α} (P_{Y | X = x} ∥ Q_{Y}), \end{matrix}$

(37)

where the infimum is over all the probability measures on $B$ .
Given input distribution $P_{X}$ and random transformations $P_{Y | X}, Q_{Y | X}$ , the conditional Rényi divergence of order $α \in (0, 1) \cup (1, \infty)$ is

$\begin{matrix} D_{α} (P_{Y | X} ∥ Q_{Y | X} | P_{X}) & = D_{α} (P_{X} P_{Y | X} ∥ P_{X} Q_{Y | X}) . \end{matrix}$

(38)

Although (38) also holds for the familiar $α = 1$ case, in general the conditional Rényi divergence is not the arithmetic average of $D (P_{Y | X = x} ∥ Q_{Y})$ with respect to $P_{X}$ if $α \neq 1$ . Instead it’s a generalized mean, or a scaled cumulant generating function evaluated at $α - 1$ . Specifically, if $X \sim P_{X}$ , then

$\begin{matrix} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) & = \frac{1}{α - 1} log E [exp ((α - 1) D_{α} (P_{Y | X} (\cdot | X) ∥ Q_{Y}))] . \end{matrix}$

(39)

Regardless of whether $α \in (0, 1)$ or $α \in (1, \infty)$ , (39) implies that

$\begin{matrix} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) & \leq sup_{x \in A} D_{α} (P_{Y | X = x} ∥ Q_{Y}) \end{matrix}$

(40)

$\begin{matrix} = sup_{P_{X}} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) \end{matrix}$

(41)

with the supremum in (41) over all input probability measures.
The key additive decomposition formula for the mutual information (9) has a nice counterpart for the $α$ -mutual information [27]. Let $P_{X} \to P_{Y | X} \to P_{Y}$ and $Q_{Y}$ be an arbitrary probability measure on $B$ such that $P_{Y} ≪ Q_{Y}$ . Then, it is easy to verify that

$\begin{matrix} D_{α} (P_{Y | X} ∥ P_{Y_{[α]}} | P_{X}) = D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) - D_{α} (P_{Y_{[α]}} ∥ Q_{Y}), \end{matrix}$

(42)

a relationship noted by Sibson [31] in the discrete case.
Given $α > 0$ , $P_{X}$ and $P_{Y | X}$ , the α-mutual information is [27,31]

$\begin{matrix} I_{α} (X; Y) & = min_{Q} D_{α} (P_{Y | X} ∥ Q | P_{X}) \end{matrix}$

(43)

$\begin{matrix} = D_{α} (P_{Y | X} ∥ P_{Y_{[α]}} | P_{X}) \end{matrix}$

(44)

$\begin{matrix} = \frac{1}{α - 1} log E [exp ((α - 1) D_{α} (P_{Y | X} (\cdot | X) ∥ P_{Y_{[α]}}))] \end{matrix}$

(45)

$\begin{matrix} = D_{α} (P_{Y | X} ∥ P_{Y} | P_{X}) - D_{α} (P_{Y_{[α]}} ∥ P_{Y}), \end{matrix}$

(46)

where $P_{X} \to P_{Y | X} \to P_{Y}$ . It can be checked that the constant in (24) is equal to

$\begin{matrix} κ_{α} & = (α - 1) I_{α} (X; Y) . \end{matrix}$

(47)

Note that $I_{1} (X; Y) = I (X; Y)$ but, in general, $I_{α} (X; Y) \neq I_{α} (Y; X)$ .
An alternative expression for $α$ -mutual information, which will come in handy in our analysis and which does not involve either $P_{Y}$ or $P_{Y_{[α]}}$ is obtained by introducing an auxiliary probability measure $P_{\bar{Y}}$ dominating the collection ${P_{Y | X = u}, u \in A}$ [27]:

$\begin{matrix} I_{α} (X; Y) & = \frac{α}{α - 1} log E [E^{\frac{1}{α}} [exp (α ı_{X; \bar{Y}} (X; \bar{Y})) | \bar{Y}]], (X, \bar{Y}) \sim P_{X} \times P_{\bar{Y}}, \end{matrix}$

(48)

where

$\begin{matrix} ı_{X; \bar{Y}} (x; y) = log \frac{d P_{Y | X = x}}{d P_{\bar{Y}}} (y) . \end{matrix}$

(49)

As usual, sometimes it is convenient to fix $σ$ -finite measures $μ_{X}$ and $μ_{Y}$ on the input and output spaces which dominate $P_{X}$ and ${P_{Y | X = x} : x \in A}$ , respectively, and denote their densities with respect to the reference measures by

$\begin{matrix} p_{Y | X} (y | x) & = \frac{d P_{Y | X = x}}{d μ_{Y}} (y), \end{matrix}$

(50)

$\begin{matrix} p_{X} (x) & = \frac{d P_{X}}{d μ_{X}} (x) . \end{matrix}$

(51)

Then, we can write $α$ -mutual information as

$\begin{matrix} I_{α} (P_{X}, P_{Y | X}) & = \frac{α}{α - 1} log \int_{B} {(\int_{A} p_{Y | X}^{α} (y | x) p_{X} (x) d μ_{X} (x))}^{\frac{1}{α}} d μ_{Y} (y) . \end{matrix}$

(52)
In the special case of discrete alphabets,

$\begin{matrix} E_{0} (ρ, P_{X}, P_{Y | X}) = ρ I_{\frac{1}{1 + ρ}} (X; Y), \end{matrix}$

(53)

where the left side is the familiar Gallager function defined in [36] for $ρ \in (0, 1)$ as

$\begin{matrix} E_{0} (ρ, P_{X}, P_{Y | X}) = - log \sum_{y \in B} {(\sum_{x \in A} P_{X} (x) P_{Y | X}^{\frac{1}{1 + ρ}} (y | x))}^{1 + ρ} . \end{matrix}$

(54)
Fix $α > 0$ , $P_{Y | X} : A \to B$ , and a collection $P$ of probability measures on the input space. Then, we denote

$\begin{matrix} C_{α} (P) = sup_{P_{X} \in P} I_{α} (P_{X}, P_{Y | X}) . \end{matrix}$

(55)

When $P$ is the set of all input measures we write simply $C_{α}$ , dubbed the Rényi capacity in [26]. $C_{α}$ is a measure of the similarity of the family ${P_{Y | X = x}, x \in A}$ , which plays an important role in the analysis of the fundamental limits of information transmission through noisy channels, particularly in the regime of exponentially small error probability. For a long time (e.g., [39]) the cutoff rate $C_{\frac{1}{2}}$ was conjectured to be the maximal rate for which reliable codes with manageable decoding complexity can be found. The zero-error capacity of the discrete memoryless channel with feedback is equal to either zero or [40]

$\begin{matrix} C_{0 f} & = max_{X} I_{0} (X; Y), \end{matrix}$

(56)

depending on whether there is $(a_{1}, a_{2}) \in A^{2}$ such that $P_{Y | X} (\cdot | a_{1}) ⊥ P_{Y | X} (\cdot | a_{2})$ .
The related quantity ${max}_{P_{X}} I_{\frac{1}{α}} (P_{X}, P_{Y | X}^{α})$ arises in the study of the fundamental limits of guessing and task completion under mismatch [41,42].
While $D (P ∥ Q)$ is convex in the pair $(P, Q)$ , the picture for Rényi divergence is somewhat more nuanced:
(a)
If $α \in (0, 1)$ , then $D_{α} (P ∥ Q)$ is convex in $(P, Q)$ .
(b)
If $α > 0$ , then $D_{α} (P ∥ Q)$ is convex in Q for all P, (see [4]).
For any fixed pair $(P_{Y | X}, Q_{Y | X})$ , $D_{α} (P_{Y | X} ∥ Q_{Y | X} | P_{X})$ is concave (resp. convex) in $P_{X}$ if $α \geq 1$ (resp. $α \in (0, 1]$ ) (see [43]).
The $α$ -mutual information $I_{α} (P_{X}, P_{Y | X})$ is concave in $P_{X}$ for any fixed $P_{Y | X}$ and $α > 1$ (see [43]). If $α \in (0, 1) \cup (1, \infty)$ , then the following monotonically increasing function of $I_{α} (P_{X}, P_{Y | X})$ is concave in $P_{X}$

$\begin{matrix} Γ_{α} (I_{α} (P_{X}, P_{Y | X})) = \frac{1}{α - 1} φ_{\frac{1}{α}} (I_{α} (P_{X}, P_{Y | X})), \end{matrix}$

(57)

where $φ_{α} (z) = exp (z - α z)$ (see [10,43]).

3. Conditional Rényi Divergence Game

As can be expected from (43), when maximizing

α

-mutual information, for fixed

P_{Y | X}

, with respect to the input probability measure, it is interesting to consider a zero-sum game with payoff function

D_{α} (P_{Y | X} ∥ Q | P_{X})

such that one player tries to maximize it by choosing

P_{X} \in P

, where

P

is a given collection of input probability measures, and the other player tries to minimize it by choosing the probability measure

Q \in Q

on the output space. Balancing simplicity and generality and motivated by applications, while we allow

P

to be a proper subset of the set of all input probability measures, we assume that there are no restrictions in the choice of the output probability measure, and therefore

Q

stands for the whole collection of probability measures on the output space. This game also arises in the determination of the worst-case redundancy in (37). In Section 3.1 we consider the important special case in which there exists an input distribution that attains the supremum in (55). In the more general scenario in which the supremum may not be achieved, we cannot identify a saddle point but we can indeed swap sup and min as we show in Section 3.2.

3.1. Saddle point

We begin by showing that the maximal

α

-mutual information input distribution and its

α

-response form a saddle point.

Theorem 1.

Let

P

be a convex set of probability distributions on

A

and

Q

be the set of all probability distributions on

B

. Let

α \in (0, 1) \cup (1, \infty)

. Suppose that there exists some

P_{X}^{*} \in P

such that

\begin{matrix} I_{α} (P_{X}^{*}, P_{Y | X}) = max_{P_{X} \in P} I_{α} (P_{X}, P_{Y | X}) < \infty, \end{matrix}

(58)

and denote the α-response to

P_{X}^{*}

by

P_{Y_{[α]}}^{*}

. Then, for any

(P_{X}, Q_{Y}) \in P \times Q

,

\begin{matrix} D_{α} (P_{Y | X} ∥ P_{Y_{[α]}}^{*} | P_{X}) & \leq D_{α} (P_{Y | X} ∥ P_{Y_{[α]}}^{*} | P_{X}^{*}) \end{matrix}

(59)

\begin{matrix} \leq D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}^{*}) . \end{matrix}

(60)

Conversely, if

(P_{X}^{*}, P_{Y_{[α]}}^{*})

is a saddle point of

D_{α} (P_{Y | X} ∥ \cdot | \cdot)

, namely, (59)–(60) are satisfied, then

P_{X}^{*}

maximizes the α-mutual information.

Remark 1.

Assuming that

P

includes

δ_{x}

(unit mass at

x \in A

), (59) implies that for any

x \in A

,

\begin{matrix} D_{α} (P_{Y | X = x} ∥ P_{Y_{[α]}}^{*}) \leq max_{P_{X} \in P} I_{α} (P_{X}, P_{Y | X}) . \end{matrix}

(61)

We can easily obtain corollaries to Theorem 1 that elucidate useful properties of the saddle point.

Corollary 1.

Let

α \in (0, 1) \cup (1, \infty)

. Under the assumptions in Theorem 1, for any

P_{X} \in P

, we have

\begin{matrix} D_{α} (P_{Y_{[α]}} ∥ P_{Y_{[α]}}^{*}) \leq I_{α} (P_{X}^{*}, P_{Y | X}) - I_{α} (P_{X}, P_{Y | X}) < \infty, \end{matrix}

(62)

where

P_{Y_{[α]}}

is the α-response to

P_{X}

. Moreover,

P_{Y_{[α]}} = P_{Y_{[α]}}^{*}

if, in addition to

P_{X}^{*}

,

P_{X}

also attains

C_{α} (P) = {max}_{P_{X} \in P} I_{α} (P_{X}, P_{Y | X})

.

Proof of Corollary 1.

For any

P_{X} \in P

,

\begin{matrix} I_{α} (P_{X}, P_{Y | X}) & = D_{α} (P_{Y | X} ∥ P_{Y_{[α]}} | P_{X}) \end{matrix}

(63)

\begin{matrix} = D_{α} (P_{Y | X} ∥ P_{Y_{[α]}}^{*} | P_{X}) - D_{α} (P_{Y_{[α]}} ∥ P_{Y_{[α]}}^{*}) \end{matrix}

(64)

\begin{matrix} \leq D_{α} (P_{Y | X} ∥ P_{Y_{[α]}}^{*} | P_{X}^{*}) - D_{α} (P_{Y_{[α]}} ∥ P_{Y_{[α]}}^{*}) \end{matrix}

(65)

\begin{matrix} = I_{α} (P_{X}^{*}, P_{Y | X}) - D_{α} (P_{Y_{[α]}} ∥ P_{Y_{[α]}}^{*}), \end{matrix}

(66)

where (64) and (65) follow from (42) and (59), respectively. Since Rényi divergence is nonnegative,

D_{α} (P_{Y_{[α]}} ∥ P_{Y_{[α]}}^{*}) = 0

if

P_{X}

also attains

C_{α} (P)

. □

Therefore, Corollary 1 implies that the

α

-responses to all the maximal

α

-mutual information input distributions must be identical. Moreover, if

α > 1

, then every

α

-response to any input distribution satisfies

P_{Y_{[α]}} ≪ P_{Y_{[α]}}^{*}

.

If

P

is the space of all probability distributions on

A

, then we can get the following corollary.

Corollary 2.

Unconstrained maximization of α-mutual information. Suppose that

α \in (0, 1) \cup (1, \infty)

and

P

contains all probability mass functions on the discrete alphabet

A

. Fix

P_{Y | X} : A \to B

. For any input distribution

{\bar{P}}_{X}

, denote its support by

{\bar{A}}_{X} \subset A

and the corresponding α-response by

{\bar{P}}_{Y_{[α]}}

.

A necessary and sufficient condition for

{\bar{P}}_{X}

to achieve

{max}_{X} I (X; Y) < \infty

is

\begin{matrix} max_{a \in {\bar{A}}_{X}} D_{α} (P_{Y | X = a} ∥ {\bar{P}}_{Y_{[α]}}) = min_{a \in {\bar{A}}_{X}} D_{α} (P_{Y | X = a} ∥ {\bar{P}}_{Y_{[α]}}) \geq max_{a \in {\bar{A}}_{X}^{c}} D_{α} (P_{Y | X = a} ∥ {\bar{P}}_{Y_{[α]}}) . \end{matrix}

(67)

Proof of Corollary 2.

${max}_{X} I_{α} (X; Y) = I_{α} ({\bar{P}}_{X}, P_{Y | X}) \Rightarrow (67)$ : Regardless of whether $α > 1$ or $α < 1$ , we see from (45) that if there exists some $x_{0} \in {\bar{A}}_{X}$ such that

$\begin{matrix} D_{α} (P_{Y | X = x_{0}} ∥ {\bar{P}}_{Y_{[α]}}) < max_{P_{X} \in P} I_{α} (P_{X}, P_{Y | X}), \end{matrix}$

(68)

then $I_{α} ({\bar{P}}_{X}, P_{Y | X}) < {max}_{P_{X} \in P} I_{α} (P_{X}, P_{Y | X})$ , which contradicts the assumed optimality of ${\bar{P}}_{X}$ . Moreover, if there exists some $x_{0} \in {\bar{A}}_{X}^{c}$ such that (68) holds with the strict inequality reversed, then (59) would be violated, again contradicting the assumed optimality of ${\bar{P}}_{X}$ .
$(67) \Rightarrow {max}_{X} I_{α} (X; Y) = I_{α} ({\bar{P}}_{X}, P_{Y | X})$ : Again, we see from (45) that if (67) is satisfied, then (59) is satisfied. Since ${\bar{P}}_{Y_{[α]}}$ is the $α$ -response to ${\bar{P}}_{X}$ , (59) is also satisfied, and the converse part in Theorem 1 results in the optimality of ${\bar{P}}_{X}$ .

□

Remark 2.

According to Corollary 2, if some input distribution

P_{X}^{*}

achieves

C_{α}

, we know the α-response output distribution

P_{Y_{[α]}}^{*}

is equidistant in

D_{α} (\cdot ∥ P_{Y_{[α]}}^{*})

to any of the output distributions in the collection

\begin{matrix} S = {P_{Y | X = x}, P_{X}^{*} (x) > 0} \subset Q . \end{matrix}

(69)

Moreover, we know that the optimal α-response output distribution is actually unique even if there exist several optimal input distributions. So the key is to find the unique centroid of

S

when the distance is measured by the Rényi divergence. In contrast to the maximization of the mutual information, the optimal α-response output distribution is no longer a mixture of the conditional output distributions.

Remark 3.

Corollary 2 enables us to recover Gallager’s finite alphabet result in Theorem 5.6.5 of [44], which characterizes the maximal α-mutual information input distribution if

α \in (0, 1)

when both

A

and

B

are finite. The optimal input distribution

P_{X}^{*}

must satisfy the following condition:

\begin{matrix} \sum_{y \in B} {(\sum_{x \in A} P_{Y | X}^{α} (y | x) P_{X}^{*} (x))}^{\frac{1 - α}{α}} P_{Y | X}^{α} (y | u) \geq \sum_{y \in B} {(\sum_{x \in A} P_{Y | X}^{α} (y | x) P_{X}^{*} (x))}^{\frac{1}{α}}, \end{matrix}

(70)

with equality for all u such that

P_{X}^{*} (u) > 0

. To verify this condition, note that Corollary 2 requires that

\begin{matrix} exp (κ_{α}^{*}) = exp ((α - 1) C_{α}) \leq \sum_{y \in B} P_{Y | X = u}^{α} (y) {(P_{Y_{[α]}}^{*} (y))}^{1 - α}, \end{matrix}

(71)

with equality if

P_{X}^{*} (u) > 0

, and where

κ_{α}^{*}

stands for the normalizing constant in (24) with

P_{X} \leftarrow P_{X}^{*}

. Upon substitution of (25) with

P_{X} \leftarrow P_{X}^{*}

, we obtain (70). The assumption of finite output alphabet can be easily dispensed with to obtain the more general optimality condition

\begin{matrix} E [Ψ_{α}^{1 - α} (Y^{*}) exp (α ı_{X; Y}^{*} (u; Y^{*}))] \geq E [Ψ_{α} (Y^{*})] \end{matrix}

(72)

with equality for all u such that

P_{X}^{*} (u) > 0

. In (72),

ı_{X; Y}^{*}

stands for the information density corresponding to

P_{X}^{*} \to P_{Y | X} \to P_{Y}^{*}

and

\begin{matrix} Ψ_{α} (y) = E^{\frac{1}{α}} [exp (α ı_{X; Y}^{*} (X^{*}; y))], X^{*} \sim P_{X}^{*} \end{matrix}

(73)

If

α > 1

, condition (72) holds with the inequality reversed.

Remark 4.

When

B

is finite, it was shown in [2,4,25] that for any

α \in [0, \infty]

,

\begin{matrix} C_{α} = R_{α}, \end{matrix}

(74)

where

R_{α}

is defined in (37). This is now established without imposing finiteness conditions, as long as there is an input that achieves the maximal α-mutual information because

\begin{matrix} R_{α} & \leq C_{α} \end{matrix}

(75)

\begin{matrix} = max_{P_{X} \in P} min_{Q_{Y} \in Q} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) \end{matrix}

(76)

\begin{matrix} \leq min_{Q_{Y} \in Q} max_{P_{X} \in P} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) \end{matrix}

(77)

\begin{matrix} \leq R_{α}, \end{matrix}

(78)

where (75) follows from particularizing (59) to deterministic

P_{X}

, and () follows from (40).

3.2. Minimax identity

In this section we drop the assumption that there exists an input probability measure that attains the maximal

α

-mutual information and show that the conditional Rényi divergence still satisfies a minimax identity, even if a saddle point does not exist.

Theorem 2.

Let

P

be a convex set of probability distributions on

A

and

Q

be the set of all probability distributions on

B

. We have the minimax equality:

\begin{matrix} C_{α} (P) & = sup_{P_{X} \in P} min_{Q_{Y} \in Q} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) \end{matrix}

(79)

\begin{matrix} = min_{Q_{Y} \in Q} sup_{P_{X} \in P} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) . \end{matrix}

(80)

Furthermore, if

C_{α} (P) < \infty

, then there exists a unique element in

Q

attaining the minimum in (80).

The assumption of convexity in Theorem 2 is not superfluous, as the following example illustrates.

Example 1.

Let

A = B = N

and

Y = X + N

, where N is a geometric random variable on the nonnegative integers with positive mean and independent of X. Let

P

be the non-convex set of all the deterministic probability measures on

A

. In this case, the left side of (80) is zero, while the right side is infinity. To see this, note that for any

Q_{Y} \in Q

and

n \in N

, it follows from the data processing inequality applied to the binary deterministic transformation

1 {Y \geq n}

that

\begin{matrix} D_{α} (P_{Y | X = n} ∥ Q_{Y}) \geq d_{α} (1 ∥ Q_{Y} ({n + 1, \dots})), \end{matrix}

(81)

whose right side diverges as

n \to \infty

. Therefore, for any

Q_{Y} \in Q

, (39) results in

\begin{matrix} sup_{P_{X} \in P} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) = \infty . \end{matrix}

(82)

Continuing with the theme in Remark 4, Theorem 2 extends the validity of

R_{α} = C_{α}

without requiring the existence of the maximal

α

-mutual information input distribution. It was conjectured in [2] (and proved in [26]) that for

α \in (0, \infty)

, if

R_{α} < \infty

and

B

is finite or countable, there exists a unique redundancy-achieving distribution

\begin{matrix} Q_{Y}^{*} = arg min_{Q_{Y}} sup_{x \in A} D_{α} (P_{Y | X = x} ∥ Q_{Y}) \end{matrix}

(83)

and for all probability measures

Q_{Y}

on the output space,

\begin{matrix} sup_{x \in A} D_{α} (P_{Y | X = x} ∥ Q_{Y}) \geq C_{α} + D_{α} (Q_{Y}^{*} ∥ Q_{Y}) . \end{matrix}

(84)

We can prove the conjecture easily with the help of Theorem 2.

Proof.

Let

P

be the convex set of all probability measures on

A

. Since

C_{α} = R_{α} < \infty

, by Theorem 2, we know there exists a unique

P_{Y_{[α]}}^{*}

such that

{sup}_{P_{X} \in P} D_{α} (P_{Y | X} ∥ P_{Y_{[α]}}^{*} | P_{X}) = C_{α}

, which implies that

P_{Y_{[α]}}^{*}

is precisely the unique redundancy-achieving distribution in (83). Moreover, as shown in the proof of Theorem 2, we can find a sequence

{P_{X_{n}}}_{n \geq 1}

in

P

such that

I_{α} (P_{X_{n}}, P_{Y | X}) \to C_{α}

as

n \to \infty

and such that the corresponding

α

-responses

P_{Y_{n [α]}}

converge to

P_{Y_{[α]}}^{*}

in the total variation metric. Pick an arbitrary

Q_{Y} \in Q

. If

α > 1

and

P_{Y | X = x} ≪ Q_{Y}

for some

x \in A

, then

{sup}_{x \in A} D_{α} (P_{Y | X = x} ∥ Q_{Y}) = \infty

and (84) holds. Otherwise, by (42) we always have

\begin{matrix} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) = D_{α} (P_{Y | X} ∥ P_{Y_{[α]}}^{*} | P_{X}) + D_{α} (P_{Y_{[α]}}^{*} ∥ Q_{Y}) . \end{matrix}

(85)

For any

n \geq 1

, since

P

includes all probability measures on

A

, we have

\begin{matrix} sup_{x \in A} D_{α} (P_{Y | X = x} ∥ Q_{Y}) & = sup_{P_{X} \in P} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) \end{matrix}

(86)

\begin{matrix} \geq D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X_{n}}) \end{matrix}

(87)

\begin{matrix} = D_{α} (P_{Y | X} ∥ P_{Y_{n [α]}} | P_{X_{n}}) + D_{α} (P_{Y_{n [α]}} ∥ Q_{Y}) \end{matrix}

(88)

\begin{matrix} = I_{α} (P_{X_{n}}, P_{Y | X}) + D_{α} (P_{Y_{n [α]}} ∥ Q_{Y}), \end{matrix}

(89)

where (88) is due to (42). Taking the limit as

n \to \infty

, the lower-semicontinuity of the Rényi divergence ensures that

\begin{matrix} sup_{x \in A} D_{α} (P_{Y | X = x} ∥ Q_{Y}) \geq C_{α} + D_{α} (P_{Y_{[α]}}^{*} ∥ Q_{Y}), \end{matrix}

(90)

and therefore the sought-after

Q_{Y}^{*}

is none other that

P_{Y_{[α]}}^{*}

, the unique maximal

α

-mutual information output distribution. □

4. Finding $C_{α}$

In this section, we present a number of examples to illustrate how the results in Section 3 can be used to maximize the

α

-mutual information with respect to the input distribution. It is instructive to contrast the present approach with the maximization of

α

-mutual information invoking the KKT conditions, which is feasible in both the case

α > 1

in which the functional is concave with respect to the input distribution, and the case

α \in (0, 1)

in which a monotonically increasing function of

α

-mutual information is concave. Simple finite-alphabet examples of such approach can be found in [44] when dealing with the

E_{0}

functional in (54). Thanks to Theorem 1 it is possible to avoid taking derivatives of any functionals.

Example 2 (Binary symmetric channel).

Let the input and output alphabet be

A = B = {0, 1}

and the random transformation be

\begin{matrix} P_{Y | X} = [\begin{matrix} 1 - δ & δ \\ δ & 1 - δ \end{matrix}], δ \in [0, 1] . \end{matrix}

(91)

Let’s try the input distribution

P_{X}^{*} (0) = P_{X}^{*} (1) = 0.5

. Then, according to (25), the α-response output distribution is also equiprobable

P_{Y_{[α]}}^{*} (0) = P_{Y_{[α]}}^{*} (1) = 0.5

. Since

\begin{matrix} D_{α} (P_{Y | X = 0} ∥ P_{Y_{[α]}}^{*}) & = D_{α} (P_{Y | X = 1} ∥ P_{Y_{[α]}}^{*}) = d_{α} (δ ∥ \frac{1}{2}) \end{matrix}

(92)

the conditions of Corollary 2 are met,

P_{X}^{*}

attains the maximal α-mutual information and therefore,

\begin{matrix} C_{α} = d_{α} (δ ∥ \frac{1}{2}) = log 2 + \frac{1}{α - 1} log (δ^{α} + {(1 - δ)}^{α}), \end{matrix}

(93)

which satisfies, according to (31)

\begin{matrix} lim_{α \to 1} C_{α} & = log 2 - h (α), \end{matrix}

(94)

\begin{matrix} lim_{α \to \infty} C_{α} & = log (2 max {δ, 1 - δ}) . \end{matrix}

(95)

Example 3 (Binary erasure channel.).

Let the input/output alphabets be

A = {0, 1}

and

B = {0, e, 1}

, and the random transformation be

\begin{matrix} P_{Y | X} = [\begin{matrix} 1 - δ & 0 \\ δ & δ \\ 0 & 1 - δ \end{matrix}], δ \in [0, 1] . \end{matrix}

(96)

(Departing from usual practice, columns/rows represent input/output letters respectively, i.e., probability vectors are column vectors, although for typographical convenience we show them as row vectors in the text.)

The α-response output distribution to

P_{X}^{*} (0) = P_{X}^{*} (1) = 0.5

is

\begin{matrix} P_{Y_{[α]}}^{*} (0) = P_{Y_{[α]}}^{*} (1) & = \frac{1 - δ}{δ 2^{\frac{1}{α}} + 2 (1 - δ)}, \end{matrix}

(97)

\begin{matrix} P_{Y_{[α]}}^{*} (e) & = \frac{δ 2^{\frac{1}{α}}}{δ 2^{\frac{1}{α}} + 2 (1 - δ)} . \end{matrix}

(98)

By symmetry,

\begin{matrix} C_{α} & = D_{α} (P_{Y | X = 0} ∥ P_{Y_{[α]}}^{*}) = D_{α} (P_{Y | X = 1} ∥ P_{Y_{[α]}}^{*}) \end{matrix}

(99)

\begin{matrix} = \frac{1}{α - 1} log ({(1 - δ)}^{α} {(\frac{(1 - δ)}{δ 2^{\frac{1}{α}} + 2 (1 - δ)})}^{1 - α} + δ^{α} {(\frac{δ 2^{\frac{1}{α}}}{δ 2^{\frac{1}{α}} + 2 (1 - δ)})}^{1 - α}) \end{matrix}

(100)

\begin{matrix} = \frac{1}{α - 1} log (\frac{1 - δ + 2^{\frac{1 - α}{α}} δ}{{(δ 2^{\frac{1}{α}} + 2 (1 - δ))}^{1 - α}}) \end{matrix}

(101)

\begin{matrix} = \frac{1}{α - 1} log (\frac{2^{\frac{α - 1}{α}} (1 - δ) + δ}{{(δ + 2^{\frac{α - 1}{α}} (1 - δ))}^{1 - α}}) \end{matrix}

(102)

\begin{matrix} = \frac{α}{α - 1} log ((1 - δ) 2^{(1 - \frac{1}{α})} + δ), \end{matrix}

(103)

which satisfies (in bits)

\begin{matrix} lim_{α \to 1} C_{α} & = 1 - δ, \end{matrix}

(104)

\begin{matrix} lim_{α \to \infty} C_{α} & = {log}_{2} (2 - δ) . \end{matrix}

(105)

Example 4 (Binary asymmetric channel).

Let the input and output alphabet be

A = B = {0, 1}

and the random transformation be

\begin{matrix} P_{Y | X} = [\begin{matrix} 1 - δ_{0} & δ_{1} \\ δ_{0} & 1 - δ_{1} \end{matrix}], (δ_{0}, δ_{1}) \in {[0, 1]}^{2} . \end{matrix}

(106)

If

δ_{0} + δ_{1} = 1

, then

I_{α} (X; Y) = 0

for any input distribution. We will assume

δ_{0} + δ_{1} < 1

. Otherwise, we can just relabel the output alphabet

(0, 1) \leftarrow (1, 0)

, or equivalently

(δ_{0}, δ_{1}) \leftarrow (1 - δ_{0}, 1 - δ_{1})

. The condition

\begin{matrix} D_{α} (P_{Y | X = 0} ∥ P_{Y_{[α]}}^{*}) = D_{α} (P_{Y | X = 1} ∥ P_{Y_{[α]}}^{*}) \end{matrix}

(107)

is now

d_{α} (1 - δ_{0} ∥ P_{Y_{[α]}}^{*} (0)) = d_{α} (δ_{1} ∥ P_{Y_{[α]}}^{*} (0))

, which, in view of (32) yields

\begin{matrix} P_{Y_{[α]}}^{*} (0) = {(1 + {(\frac{{(1 - δ_{0})}^{α} - δ_{1}^{α}}{{(1 - δ_{1})}^{α} - δ_{0}^{α}})}^{\frac{1}{1 - α}})}^{- 1} . \end{matrix}

(108)

We can verify from (25) that this corresponds to the α-response to

P_{X}^{*} (1) = p^{*} = 1 - P_{X}^{*} (0)

, where

p^{*} \in [0, 1]

is the solution to

\begin{matrix} \frac{δ_{0}^{α} (1 - p) + {(1 - δ_{1})}^{α} p}{{(1 - δ_{0})}^{α} (1 - p) + δ_{1}^{α} p} = {(\frac{{(1 - δ_{0})}^{α} - δ_{1}^{α}}{{(1 - δ_{1})}^{α} - δ_{0}^{α}})}^{\frac{α}{1 - α}} . \end{matrix}

(109)

Then,

\begin{matrix} C_{α} & = D_{α} (P_{Y | X = 0} ∥ P_{Y_{[α]}}^{*}) \end{matrix}

(110)

\begin{matrix} = \frac{1}{α - 1} log ({(1 - δ_{0})}^{α} {(1 - δ_{1})}^{α} - δ_{0}^{α} δ_{1}^{α}) + log ({({(1 - δ_{0})}^{α} - δ_{1}^{α})}^{\frac{1}{1 - α}} + {({(1 - δ_{1})}^{α} - δ_{0}^{α})}^{\frac{1}{1 - α}}) . \end{matrix}

(111)

which satisfies

\begin{matrix} lim_{α \to 1} C_{α} & = log (exp (\frac{h (δ_{0})}{1 - δ_{1} - δ_{0}}) + exp (\frac{h (δ_{1})}{1 - δ_{1} - δ_{0}})) - \frac{(1 - δ_{1}) h (δ_{0}) + (1 - δ_{0}) h (δ_{1})}{1 - δ_{1} - δ_{0}}, \end{matrix}

(112)

and

\begin{matrix} lim_{α \to \infty} C_{α} & = log (2 - δ_{0} - δ_{1}) . \end{matrix}

(113)

Example 5 (Z channel).

Let the input and output alphabet be

A = B = {0, 1}

and the random transformation be

\begin{matrix} P_{Y | X} = [\begin{matrix} 1 & δ \\ 0 & 1 - δ \end{matrix}], δ \in [0, 1] . \end{matrix}

(114)

Since this is the special case

(δ_{0}, δ_{1}) = (0, δ)

of the binary asymmetric channel we obtain

\begin{matrix} C_{α} = log (1 + {(\frac{1 - δ^{α}}{{(1 - δ)}^{α}})}^{\frac{1}{1 - α}}) . \end{matrix}

(115)

The limit

\begin{matrix} lim_{α \to 1} C_{α} = log (1 - δ^{\frac{1}{1 - δ}} + δ^{\frac{δ}{1 - δ}}) \end{matrix}

(116)

coincides with the capacity of the Z-channel originally derived in [45].

The next example illustrates a case in which there are multiple optimal input distributions.

Example 6.

Let

A = {0, 1, 2}

,

B = {0, 1, 2, 3}

and the random transformation be

\begin{matrix} P_{Y | X} = [\begin{matrix} \frac{1}{2} - δ & δ & \frac{1}{2} - δ \\ δ & \frac{1}{2} - δ & δ \\ δ & \frac{1}{2} - δ & δ \\ \frac{1}{2} - δ & δ & \frac{1}{2} - δ \end{matrix}], δ \in [0, \frac{1}{2}] . \end{matrix}

(117)

Let

P_{X}^{0} = [\frac{1}{2} \frac{1}{2} 0]

and

P_{X}^{1} = [0 \frac{1}{2} \frac{1}{2}]

. Its easy to verify that the corresponding α-responses are the equiprobable distribution on

B

. To verify that

P_{X}^{0}

and

P_{X}^{1}

attain the maximal α-mutual information, denote

P_{Y_{[α]}}^{*} = [\frac{1}{4} \frac{1}{4} \frac{1}{4} \frac{1}{4}]

. For all

x \in A

, we have

\begin{matrix} D_{α} (P_{Y | X = x} ∥ P_{Y_{[α]}}^{*}) & = D_{α} ([δ \frac{1}{2} - δ \frac{1}{2} - δ δ] ∥ [\frac{1}{4} \frac{1}{4} \frac{1}{4} \frac{1}{4}]) \end{matrix}

(118)

\begin{matrix} = \frac{2 α - 1}{α - 1} log 2 + log (δ^{α} + {(\frac{1}{2} - δ)}^{α}) \end{matrix}

(119)

\begin{matrix} = C_{α}, \end{matrix}

(120)

where (120) follows from Corollary 2.

In the next example

C_{α}

is constant in

α

.

Example 7 (Additive phase noise).

Let the input and output alphabet be

A = B = [0, 2 π)

and the random transformation be

Y = (X + N) \mod 2 π

, where N is independent of X and is uniform on the interval

[- θ_{0}, θ_{0}]

with

θ_{0} \in (0, π]

. Suppose

P_{X}^{*}

is uniform on

[0, 2 π)

, it is easy to verify that

P_{Y_{[α]}}^{*}

is also uniform on

[0, 2 π)

. Invoking (26), we obtain

\begin{matrix} D_{α} (P_{Y | X = x} ∥ P_{Y_{[α]}}^{*}) = log \frac{π}{θ_{0}}, x \in A, \end{matrix}

(121)

which according to Theorem 1 must be equal to

C_{α}

attained by

P_{X}^{*}

.

Example 8 (Additive Gaussian noise).

Let

A = B = R

,

Y = X + N

, where

N \sim N (0, σ^{2})

is independent of X. Fix

α \in (0, 1)

and

P > 0

. Suppose that the set,

P

, of allowable input probability measures on

A

consists of those that satisfy

\begin{matrix} E [{exp}_{e} (- \frac{α (1 - α) X^{2}}{2 (α^{2} P + σ^{2})})] \geq \sqrt{\frac{α^{2} P + σ^{2}}{α P + σ^{2}}} . \end{matrix}

(122)

By completing the square, it is easy to verify that

P_{X}^{*} \sim N (0, P)

satisfies (122) with equality. Furthermore, its α-response is

P_{Y_{[α]}}^{*} \sim N (0, α P + σ^{2})

. To show that

P_{X}^{*}

does indeed attain

C_{α} (P)

, first we compute

\begin{matrix} D_{α} (P_{Y | X = x} ∥ P_{Y_{[α]}}^{*}) = \frac{1}{2} log (1 + \frac{α P}{σ^{2}}) - \frac{1}{2 (1 - α)} log (\frac{α P + σ^{2}}{α^{2} P + σ^{2}}) + \frac{1}{2} \frac{α x^{2}}{α^{2} P + σ^{2}} log e . \end{matrix}

(123)

With (122) and (123), it is straightforward to see that if

P_{X} \in P

then

\begin{matrix} D_{α} (P_{Y | X} ∥ P_{Y_{[α]}}^{*} | P_{X}) \leq D_{α} (P_{Y | X} ∥ P_{Y_{[α]}}^{*} | P_{X}^{*}) . \end{matrix}

(124)

Consequently, Theorem 1 establishes that

P_{X}^{*}

achieves the maximal α-mutual information, which using (39) is given by

\begin{matrix} C_{α} (P) & = \frac{1}{2} log (1 + \frac{α P}{σ^{2}}) . \end{matrix}

(125)

5. Proofs

5.1. Proof of Theorem 1

We deal with the converse statement first. If (59)–(60) are satisfied then

(P_{X}^{*}, P_{Y_{[α]}}^{*})

is a saddle point, and therefore,

\begin{matrix} I_{α} (P_{X}^{*}, P_{Y | X}) & = max_{P_{X} \in P} min_{Q_{Y} \in Q} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) \end{matrix}

(126)

\begin{matrix} = min_{Q_{Y} \in Q} max_{P_{X} \in P} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) . \end{matrix}

(127)

which, by definition of

α

-mutual information, implies that

P_{X}^{*}

attains

{max}_{P_{X} \in P} I_{α} (P_{X}, P_{Y | X})

. To show that the optimal input and its

α

-response must form a saddle point, first we deal with the case

α > 1

, in which we can use the concavity of the conditional Rényi divergence. Choose arbitrary

ν \in (0, 1)

and

P_{X} \in P

. Let

P_{ν} = ν P_{X} + (1 - ν) P_{X}^{*}

and denote its

α

-response by

\begin{matrix} Q_{Y_{[α]}}^{(ν)} = \underset{Q_{Y} \in Q}{arg min} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{ν}) . \end{matrix}

(128)

Therefore,

Q_{Y_{[α]}}^{(0)} = P_{Y_{[α]}}^{*}

. We have

\begin{matrix} I_{α} (P_{X}^{*}, P_{Y | X}) & \geq I_{α} (P_{ν}, P_{Y | X}) \end{matrix}

(129)

\begin{matrix} = D_{α} (P_{Y | X} ∥ Q_{Y_{[α]}}^{(ν)} | P_{ν}) \end{matrix}

(130)

\begin{matrix} \geq ν D_{α} (P_{Y | X} ∥ Q_{Y_{[α]}}^{(ν)} | P_{X}) + (1 - ν) D_{α} (P_{Y | X} ∥ Q_{Y_{[α]}}^{(ν)} | P_{X}^{*}) \end{matrix}

(131)

\begin{matrix} \geq ν D_{α} (P_{Y | X} ∥ Q_{Y_{[α]}}^{(ν)} | P_{X}) + (1 - ν) min_{Q_{Y} \in Q} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}^{*}) \end{matrix}

(132)

\begin{matrix} = ν D_{α} (P_{Y | X} ∥ Q_{Y_{[α]}}^{(ν)} | P_{X}) + (1 - ν) I_{α} (P_{X}^{*}, P_{Y | X}), \end{matrix}

(133)

where (129) is due to the assumption of the optimality of

P_{X}^{*}

, (131) holds because

D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X})

is concave in

P_{X}

for

α > 1

and (130) and (133) are due to the definition of

α

-mutual information.

It follows that

\begin{matrix} D_{α} (P_{Y | X} ∥ Q_{Y_{[α]}}^{(ν)} | P_{X}) \leq I_{α} (P_{X}^{*}, P_{Y | X}) = D_{α} (P_{Y | X} ∥ P_{Y_{[α]}}^{*} | P_{X}^{*}) . \end{matrix}

(134)

Since

| Q_{Y_{[α]}}^{(ν)} - Q_{Y_{[α]}}^{(0)} | \to 0

as

ν \to 0

, the lower semicontinuity of Rényi divergence and (134) imply (59).

We now show the desired result for

α \in (0, 1)

. In this case, the method of proof is easy to adapt to the

α > 1

case but it is more cumbersome than the foregoing proof, which is able to capitalize on the concavity of the conditional Rényi divergence. The starting point is the expression in (52) which we write as

\begin{matrix} I_{α} (P_{X}, P_{Y | X}) & = \frac{α}{α - 1} log f (p_{X}), \end{matrix}

(135)

\begin{matrix} f (r) & = \int_{B} {(\int_{A} p_{Y | X}^{α} (y | x) r (x) d μ_{X} (x))}^{\frac{1}{α}} d μ_{Y} (y), \end{matrix}

(136)

where we have defined the functional f on the convex cone

\bar{P}

of nonnegative functions r on the input space such that

r (x) = β \frac{d P}{d μ_{X}} (x)

for some

β \geq 0

and

P \in P

. Recall from (48) that when the argument is a density, then we have

\begin{matrix} f (p_{X}) = E [E^{\frac{1}{α}} [exp (α ı_{X; \bar{Y}} (X; \bar{Y})) | \bar{Y}]], (X, \bar{Y}) \sim P_{X} \times P_{\bar{Y}} . \end{matrix}

(137)

By virtue of the convexity of

{(\cdot)}^{\frac{1}{α}}

, f is a convex functional. Its directional (Gateaux) derivative is given by (note that the assumed finiteness of

I_{α} (X; Y)

allows swapping of differentiation and integration by means of the dominated convergence theorem)

\begin{matrix} f^{'} (r; q) & = \frac{d}{d δ} {f (r + δ q) |}_{δ = 0} \end{matrix}

(138)

\begin{matrix} = \frac{1}{α} \int_{B} {(\int_{A} p_{Y | X}^{α} (y | x) r (x) d μ_{X} (x))}^{\frac{1 - α}{α}} (\int_{A} p_{Y | X}^{α} (y | x) q (x) d μ_{X} (x)) d μ_{Y} (y) . \end{matrix}

(139)

Define the Lagrangian

\begin{matrix} L (r, λ) = f (r) - λ (\int_{A} r (x) d μ_{X} (x) - 1) . \end{matrix}

(140)

Since f is convex and (135) is maximized by

P_{X}^{*}

among all probability measures on the convex set

P

, there exists some

λ_{0} \geq 0

such that

\begin{matrix} max_{λ \geq 0} min_{r} L (r, λ) = L (p_{X}^{*}, λ_{0}), \end{matrix}

(141)

where the minimization is over the convex cone

\bar{P}

. It follows from standard convex optimization (e.g., see p. 227 in Reference [46]) that the Gateaux derivative of

L (\cdot, λ_{0})

at

p_{X}^{*}

in the direction of any

q \in \bar{P}

satisfies

\begin{matrix} L^{'} (p_{X}^{*}; q, λ_{0}) & \geq 0, \end{matrix}

(142)

with equality if

q = p_{X}^{*}

. Invoking (139), we obtain

\begin{matrix} L^{'} (p_{X}^{*}; q, λ_{0}) = & \frac{1}{α} \int_{B} {(\int_{A} p_{Y | X}^{α} (y | x) p_{X}^{*} (x) d μ_{X} (x))}^{\frac{1 - α}{α}} (\int_{A} p_{Y | X}^{α} (y | x) q (x) d μ_{X} (x)) d μ_{Y} (y) \\ - λ_{0} (\int_{A} p_{X}^{*} (x) d μ_{X} (x) - 1) . \end{matrix}

(143)

Specializing (142) and its condition for equality to

q \leftarrow p_{X}

we obtain

\begin{matrix} L^{'} (p_{X}^{*}; p_{X}, λ_{0}) \geq L^{'} (p_{X}^{*}; p_{X}^{*}, λ_{0}) \end{matrix}

(144)

which, upon substitution of (143), becomes

\begin{matrix} f (p_{X}^{*}) \leq \int_{A} \int_{B} p_{Y | X}^{α} (y | x) {(\int_{A} p_{Y | X}^{α} (y | x) p_{X}^{*} (x) d μ_{X} (x))}^{\frac{1 - α}{α}} p_{X} (x) d μ_{Y} (y) d μ_{X} (x) . \end{matrix}

(145)

Taking

\frac{1}{α - 1} log (\cdot)

of both sides of (145), invoking (25) and (47), the inequality is reversed and we obtain

\begin{matrix} D_{α} (P_{Y | X} ∥ P_{Y_{[α]}}^{*} | P_{X}) + \frac{1 - α}{α} I_{α} (X^{*}; Y^{*}) \leq \frac{1}{α} I_{α} (X^{*}; Y^{*}), \end{matrix}

(146)

which upon rearranging is the sought-after inequality (59).

5.2. Proof of Theorem 2

In order to show

\begin{matrix} inf_{Q_{Y} \in Q} sup_{P_{X} \in P} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) \leq sup_{P_{X} \in P} min_{Q_{Y} \in Q} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) = C_{α} (P), \end{matrix}

(147)

we construct

Q_{Y}^{*} \in Q

such that

\begin{matrix} D_{α} (P_{Y | X} ∥ Q_{Y}^{*} | P_{X}) \leq C_{α} (P), \forall P_{X} \in P . \end{matrix}

(148)

Moreover,

Q_{Y}^{*}

is indeed the minimizer in the leftmost side of (147) and we may replace the inf with min therein.

The construction of

Q_{Y}^{*}

follows a Cauchy-sequence approach in the proof of Kemperman’s result in [47]. Let

{P_{X_{n}}}_{n \geq 1}

be a sequence of probability distributions in

P

such that

\begin{matrix} lim_{n \to \infty} I_{α} (P_{X_{n}}, P_{Y | X}) = C_{α} (P) . \end{matrix}

(149)

Fix an arbitrary

P_{X} \in P

and let

P_{n} \subset P

denote the convex hull of

{P_{X}, P_{X_{1}}, \dots, P_{X_{n}}}

, which is a compact set. Although

I_{α} (\cdot, P_{Y | X})

may not be concave for

α \in (0, 1)

, recall from (57) that the monotonically increasing function

Γ_{α}

is such that

Γ_{α} (I_{α} (\cdot, P_{Y | X}))

is concave. So there exists some

P_{X_{n}}^{*} \in P_{n}

that attains

C_{α} (P_{n})

. Thus for any

n \geq 1

,

\begin{matrix} I_{α} (P_{X_{n}}, P_{Y | X}) \leq I_{α} (P_{X_{n}}^{*}, P_{Y | X}) \leq C_{α} (P) \end{matrix}

(150)

by the definition of

C_{α} (P)

. The asymptotic optimality of the sequence

X_{n}

implies that

I_{α} (P_{X_{n}}^{*}, P_{Y | X}) = C_{α} (P_{n})

also converges to

C_{α} (P)

.

Denote by

P_{Y_{n [α]}}^{*}

the

α

-response to

P_{X_{n}}^{*}

. Then for any

m \geq n \geq 1

, we have

\begin{matrix} I_{α} (P_{X_{n}}^{*}, P_{Y | X}) & = D_{α} (P_{Y | X} ∥ P_{Y_{n [α]}}^{*} | P_{X_{n}}^{*}) \end{matrix}

(151)

\begin{matrix} = D_{α} (P_{Y | X} ∥ P_{Y_{m [α]}}^{*} | P_{X_{n}}^{*}) - D_{α} (P_{Y_{n [α]}}^{*} ∥ P_{Y_{m [α]}}^{*}) \end{matrix}

(152)

\begin{matrix} \leq D_{α} (P_{Y | X} ∥ P_{Y_{m [α]}}^{*} | P_{X_{m}}^{*}) - D_{α} (P_{Y_{n [α]}}^{*} ∥ P_{Y_{m [α]}}^{*}) \end{matrix}

(153)

\begin{matrix} = I_{α} (P_{X_{m}}^{*}, P_{Y | X}) - D_{α} (P_{Y_{n [α]}}^{*} ∥ P_{Y_{m [α]}}^{*}), \end{matrix}

(154)

where (151) and (154) are due to the definition of

α

-mutual information, (152) follows from (42), and (153) holds because of Theorem 1 applied to

P_{m}

since

P_{X_{n}}^{*} \in P_{m}

as

P_{n} \subset P_{m}

for

m \geq n

. Rearranging the end-to-end inequality in (151)–(154) results in

\begin{matrix} D_{α} (P_{Y_{n [α]}}^{*} ∥ P_{Y_{m [α]}}^{*}) \leq I_{α} (P_{X_{m}}^{*}, P_{Y | X}) - I_{α} (P_{X_{n}}^{*}, P_{Y | X}) . \end{matrix}

(155)

But since

I_{α} (P_{X_{n}}^{*}, P_{Y | X})

converges to

C_{α} (P)

, it is a Cauchy sequence, i.e.

\begin{matrix} |I_{α} (P_{X_{n}}^{*}, P_{Y | X}) - I_{α} (P_{X_{m}}^{*}, P_{Y | X})| \to 0, n, m \to \infty . \end{matrix}

(156)

Hence, (155) ensures that

P_{Y_{n [α]}}^{*}

is also a Cauchy sequence in the sense that

D_{α} (P_{Y_{n [α]}}^{*} ∥ P_{Y_{m [α]}}^{*}) \to 0

as

n, m \to \infty

. By the generalized Pinkser’s inequality (35),

P_{Y_{n [α]}}^{*}

is also a Cauchy sequence in total variation distance, i.e.,

| P_{Y_{n [α]}}^{*} - P_{Y_{m [α]}}^{*} | \to 0

as

n, m \to \infty

. Since the space of probability measures is complete in the total variation distance,

{P_{Y_{n [α]}}^{*}}_{n}

must possess a limit point, which we denote by

P_{Y_{[α]}}^{*}

.

Now, by Theorem 1 applied to

P_{n}

, we have

\begin{matrix} D_{α} (P_{Y | X} ∥ P_{Y_{n [α]}}^{*} | P_{X}) \leq D_{α} (P_{Y | X} ∥ P_{Y_{n [α]}}^{*} | P_{X_{n}}^{*}) \leq C_{α} (P), \end{matrix}

(157)

and since

D_{α} (P ∥ Q)

is lower-semicontinuous in

(P, Q)

for

α > 0

, taking limits as

n \to \infty

of (157), we obtain

\begin{matrix} D_{α} (P_{Y | X} ∥ P_{Y_{[α]}}^{*} | P_{X}) \leq C_{α} (P) . \end{matrix}

(158)

Next we show that (158) holds for all

P_{X} \in P

, in other words, the limit point

P_{Y_{[α]}}^{*}

does not depend on the initial choice of

P_{X} \in P

. Choose an arbitrary distribution

Q_{X} \in P, Q_{X} \neq P_{X}

, and introduce the following notation:

P_{n}^{'}

is the convex hull of

{Q_{X}, P_{X}, P_{X_{1}}, \dots, P_{X_{n}}}

;

Q_{X_{n}}^{*}

is a maximizer of

I_{α} (P_{X}, P_{Y | X})

in

P_{n}^{'}

; its

α

-response is

Q_{Y_{n [α]}}^{*}

; and

Q_{Y_{[α]}}^{*}

is the limit of the sequence

{Q_{Y_{n [α]}}^{*}}_{n}

. Then we have

\begin{matrix} D_{α} (P_{Y_{n [α]}}^{*} ∥ Q_{Y_{n [α]}}^{*}) & = D_{α} (P_{Y | X} ∥ Q_{Y_{n [α]}}^{*} | P_{X_{n}}^{*}) - D_{α} (P_{Y | X} ∥ P_{Y_{n [α]}}^{*} | P_{X_{n}}^{*}) \end{matrix}

(159)

\begin{matrix} \leq D_{α} (P_{Y | X} ∥ Q_{Y_{n [α]}}^{*} | Q_{X_{n}}^{*}) - D_{α} (P_{Y | X} ∥ P_{Y_{n [α]}}^{*} | P_{X_{n}}^{*}) \end{matrix}

(160)

\begin{matrix} = I_{α} (Q_{X_{n}}^{*}, P_{Y | X}) - I_{α} (P_{X_{n}}^{*}, P_{Y | X}), \end{matrix}

(161)

where (159) holds because of (42); (160) is due to Theorem 1 applied to

P_{n}^{'}

and the fact that

P_{X_{n}}^{*} \in P_{n} \subset P_{n}^{'}

for any

n \geq 1

; and (161) is because of the definition of the

α

-mutual information.

The same argument that led to the conclusion that

I_{α} (P_{X_{n}}^{*}, P_{Y | X}) \to C_{α} (P)

establishes that

I_{α} (Q_{X_{n}}^{*}, P_{Y | X}) \to C_{α} (P)

. Therefore, taking limits as

n \to \infty

in (159)–(161) and applying the lower-semicontinuity of

D_{α} (P ∥ Q)

again, we obtain

\begin{matrix} D_{α} (P_{Y_{[α]}}^{*} ∥ Q_{Y_{[α]}}^{*}) = 0, \end{matrix}

(162)

and therefore

P_{Y_{[α]}}^{*} = Q_{Y_{[α]}}^{*} .

Since the limiting output distribution is the same whether we use

P_{n}

or

P_{n}^{'}

and according to the latter the roles of

P_{X}

and

Q_{X}

are identical, we conclude that had we defined

P_{n}

with

Q_{X}

instead of

P_{X}

, we would have reached the same limiting output distribution and (158) holds for all

P_{X} \in P

. So we have constructed

Q_{Y}^{*}

satisfying (148).

Finally, we show that

P_{Y_{[α]}}^{*}

is the only element that achieves

inf_{Q_{Y} \in Q} sup_{P_{X} \in P} D_{α} (P_{Y | X} ∥ Q_{Y} | P_{X}) .

Arguing by contradiction, suppose that there exists another

\hat{P_{Y}}

such that

\begin{matrix} sup_{P_{X} \in P} D_{α} (P_{Y | X} ∥ \hat{P_{Y}} | P_{X}) = C_{α} (P) . \end{matrix}

(163)

As earlier in the proof, let

{P_{X_{n}} \in P}_{n}

be a sequence satisfying (149), and denote the corresponding

α

-responses by

P_{Y_{n [α]}}

. Then, invoking (42) again we have

\begin{matrix} D_{α} (P_{Y | X} ∥ P_{Y_{n [α]}} | P_{X_{n}}) + D_{α} (P_{Y_{n [α]}} ∥ \hat{P_{Y}}) & = D_{α} (P_{Y | X} ∥ \hat{P_{Y}} | P_{X_{n}}) \end{matrix}

(164)

\begin{matrix} \leq C_{α} (P) < \infty, \end{matrix}

(165)

where the inequality follows from (163). Using (149) we obtain

\begin{matrix} D_{α} (P_{Y_{n [α]}} ∥ \hat{P_{Y}}) & \leq C_{α} (P) - D_{α} (P_{Y | X} ∥ P_{Y_{n [α]}} | P_{X_{n}}) \end{matrix}

(166)

\begin{matrix} \to 0, \end{matrix}

(167)

and by (35), it follows that

\begin{matrix} | P_{Y_{n [α]}} - \hat{P_{Y}} | \to 0 . \end{matrix}

(168)

Furthermore, we established above that

\begin{matrix} | P_{Y_{n [α]}} - P_{Y_{[α]}}^{*} | \to 0 . \end{matrix}

(169)

So, by the triangle inequality, we conclude that

\hat{P_{Y}} = P_{Y_{[α]}}^{*}

.

6. Conclusions

The supremization of

α

-mutual information with respect to the input distribution plays an important role in various information theoretic settings, most notably in the error exponent of optimal codes operating below capacity. We show that the optimal (if it exists) input distribution, together with its

α

-response, form a saddle-point of the conditional Rényi divergence, and vice versa, the existence of the saddle point ensure the existence of a maximal

α

-mutual information input distribution. The application of this result to various discrete and non-discrete settings illustrates the power and generality of this tool, which mirrors a similar result enjoyed by conditional relative entropy; However, the proof of the latter result is much easier due to the more convenient structure of the objective function. Regardless of whether there exists an input distribution maximizing

α

-mutual information, there always exists a unique optimal output distribution, which is the limit of the

α

-responses of any asymptotically optimal sequence of input distributions. Furthermore, a saddle-value exists and

\begin{matrix} sup_{P_{X}} min_{Q} D_{α} (P_{Y | X} ∥ Q | P_{X}) = min_{Q} sup_{P_{X}} D_{α} (P_{Y | X} ∥ Q | P_{X}) \end{matrix}

(170)

even if we restrict the feasible set of input distributions to be an arbitrary convex subset. These results lend further evidence to the notion that, out of all the available Rényi-generalizations of mutual information, the

α

-mutual information defined as in (43) is the most convenient and insightful, although

I_{α}^{c} (X; Y)

is also of considerable interest particularly in the error exponent analysis of channels with cost constraints.

Author Contributions

Both authors contributed to the conceptualization, investigation, results, original draft preparation, editing, revision and response to reviewers.

Funding

This work was partially supported by the US National Science Foundation under Grant CCF-1016625, and in part by the Center for Science of Information, an NSF Science and Technology Center under Grant CCF-0939370.

Conflicts of Interest

The authors declare no conflict of interest.

References

Rényi, A. On measures of information and entropy. In Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability; Neyman, J., Ed.; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
Van Erven, T.; Harremoës, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
Blahut, R.E. Hypothesis testing and information theory. IEEE Trans. Inf. Theory 1974, 20, 405–417. [Google Scholar] [CrossRef]
Csiszár, I. Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory 1995, 41, 26–34. [Google Scholar] [CrossRef]
Shayevitz, O. On Rényi measures and hypothesis testing. In Proceedings of the 2011 IEEE International Symposium on Information Theory, Saint Petersburg, Russia, 31 July–5 August 2011; pp. 894–898. [Google Scholar]
Harremoës, P. Interpretations of Rényi entropies and divergences. Phys. A Stat. Mech. Its Appl. 2006, 365, 57–62. [Google Scholar] [CrossRef]
Sason, I.; Verdú, S. Improved bounds on lossless source coding and guessing moments via Rényi measures. IEEE Trans. Inf. Theory 2018, 64, 4323–4346. [Google Scholar] [CrossRef]
Haroutunian, E.A. Estimates of the exponent of the error probability for a semicontinuous memoryless channel. Probl. Inf. Transm. 1968, 4, 29–39. [Google Scholar]
Haroutunian, E.A.; Haroutunian, M.E.; Harutyunyan, A.N. Reliability criteria in information theory and in statistical hypothesis testing. Found. Trends Commun. Inf. Theory 2007, 4, 97–263. [Google Scholar] [CrossRef]
Polyanskiy, Y.; Verdú, S. Arimoto channel coding converse and Rényi divergence. In Proceedings of the 48th Annual Allerton Conference on Communication, Control, and Computing, Allerton, IL, USA, 29 September–1 October 2010; pp. 1327–1333. [Google Scholar]
Tsallis, C. Possible generalization of Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
Liese, F.; Vajda, I. Convex Statistical Distances; Number 95 in Teubner Texte zur Mathematik; Teubner: Leipzig, Germany, 1987. [Google Scholar]
Tridenski, S.; Zamir, R.; Ingber, A. The Ziv-Zakai-Renyi bound for joint source-channel coding. IEEE Trans. Inf. Theory 2015, 61, 429–4315. [Google Scholar] [CrossRef]
Arimoto, S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inf. Theory 1972, 18, 14–20. [Google Scholar] [CrossRef]
Blahut, R.E. Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf. Theory 1972, 18, 460–473. [Google Scholar] [CrossRef]
Blachman, N.M. Communication as a game. Proc. IRE Wescon 1957, 2, 61–66. [Google Scholar]
Borden, J.M.; Mason, D.M.; McEliece, R.J. Some information theoretic saddlepoints. SIAM J. Control Optim. 1985, 23, 129–143. [Google Scholar] [CrossRef]
Lapidoth, A.; Narayan, P. Reliable communication under channel uncertainty. IEEE Trans. Inf. Theory 1998, 44, 2148–2177. [Google Scholar] [CrossRef]
Kemperman, J. On the Shannon capacity of an arbitrary channel. In Koninklijke Nederlandse Akademie van Wetenschappen, Indagationes Mathematicae; Elsevier: Amsterdam, The Netherlands, 1974; Volume 77, pp. 101–115. [Google Scholar]
Gallager, R.G. Source Coding With Side Information and Universal Coding; Technical Report LIDS-P-937; Lab. Information Decision Systems, Massachusetts Institute of Technology: Cambridge, MA, USA, 1979. [Google Scholar]
Ryabko, B. Encoding a source with unknown but ordered probabilities. Probl. Inf. Transm. 1979, 15, 134–138. [Google Scholar]
Davisson, L.; Leon-Garcia, A. A source matching approach to finding minimax codes. IEEE Trans. Inf. Theory 1980, 26, 166–174. [Google Scholar] [CrossRef]
Ryabko, B. Comments on “A Source Matching Approach to Finding Minimax Codes”. IEEE Trans. Inf. Theory 1981, 27, 780–781. [Google Scholar] [CrossRef]
Haussler, D. A general minimax result for relative entropy. IEEE Trans. Inf. Theory 1997, 43, 1276–1280. [Google Scholar] [CrossRef]
Yagli, S.; Altuğ, Y.; Verdú, S. Minimax Rényi redundancy. IEEE Trans. Inf. Theory 2018, 64, 3715–3733. [Google Scholar] [CrossRef]
Nakiboğlu, B. The Rényi capacity and center. IEEE Trans. Inf. Theory 2019, 65, 841–860. [Google Scholar] [CrossRef]
Verdú, S. α-mutual information. In Proceedings of the 2015 Information Theory and Applications Workshop, San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar]
Arimoto, S. Information Measures and Capacity of Order α for Discrete Memoryless Channels. In Topics in Information Theory, Proceedings of the Coll. Math. Soc. János Bolyai; Bolyai: Keszthely, Hungary, 1975; pp. 41–52. [Google Scholar]
Augustin, U. Noisy Channels. Ph.D. Thesis, Universität Erlangen-Nürnberg, Erlangen, Germany, 1978. [Google Scholar]
Nakiboglu, B. The Augustin capacity and center. arXiv 2018, arXiv:1606.00304. [Google Scholar]
Sibson, R. Information radius. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 1969, 14, 149–160. [Google Scholar] [CrossRef]
Lapidoth, A.; Pfister, C. Two measures of dependence. Entropy 2019, 21, 778. [Google Scholar] [CrossRef]
Tomamichel, M.; Hayashi, M. Operational interpretation of Rényi information measures via composite hypothesis testing against product and Markov distributions. IEEE Trans. Inf. Theory 2018, 64, 1064–1082. [Google Scholar] [CrossRef]
Aishwarya, G.; Madiman, M. Remarks on Rényi versions of conditional entropy and mutual information. In Proceedings of the 2019 IEEE International Symposium on Information Theory, Paris, France, 7–12 July 2019; pp. 1117–1121. [Google Scholar]
Shannon, C.E.; Gallager, R.G.; Berlekamp, E. Lower bounds to error probability for coding on discrete memoryless channels, I. Inf. Control 1967, 10, 65–103. [Google Scholar] [CrossRef]
Gallager, R.G. A simple derivation of the coding theorem and some applications. IEEE Trans. Inf. Theory 1965, 11, 3–18. [Google Scholar] [CrossRef]
Yagli, S.; Cuff, P. Exact soft-covering exponent. IEEE Trans. Inf. Theory 2019, 65, 6234–6262. [Google Scholar] [CrossRef]
Gilardoni, G.L. On Pinsker’s and Vajda’s type inequalities for Csiszár’s f-divergences. IEEE Trans. Inf. Theory 2010, 56, 5377–5386. [Google Scholar] [CrossRef]
Massey, J.L. Coding and modulation in digital communications. In Proceedings of the 1974 International Zurich Seminar on Digital Communications, Zurich, Switzerland, 12–15 March 1974; pp. E2(1)–E2(4). [Google Scholar]
Shannon, C.E. The zero error capacity of a noisy channel. IRE Trans. Inf. Theory 1956, 2, 8–19. [Google Scholar] [CrossRef]
Sundaresan, R. Guessing under source uncertainty. IEEE Trans. Inf. Theory 2007, 53, 269–287. [Google Scholar] [CrossRef]
Bunte, C.; Lapidoth, A. Encoding tasks and Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 5065–5076. [Google Scholar] [CrossRef]
Ho, S.W.; Verdú, S. Convexity/concavity of Rényi entropy and α-mutual information. In Proceedings of the 2015 IEEE International Symposium on Information Theory, Hong Kong, China, 14–19 June 2015; pp. 745–749. [Google Scholar]
Gallager, R.G. Information Theory and Reliable Communication; John Wiley and Sons: New York, USA, 1968; Volume 2. [Google Scholar]
Verdú, S. Channel Capacity. In The Electrical Engineering Handbook, 2nd ed.; IEEE Press: Piscataway, NJ, USA; CRC Press: Boca Raton, FL, USA, 1997; Chapter 73.5; pp. 1671–1678. [Google Scholar]
Luenberger, D. Optimization by vector space methods; John Wiley and Sons: Hoboken, NJ, USA, 1997. [Google Scholar]
Polyanskiy, Y.; Wu, Y. Lecture Notes on Information Theory. 2017. Available online: http://people.lids.mit.edu/yp/homepage/data/itlectures_v5.pdf (accessed on 30 April 2019).

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cai, C.; Verdú, S. Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information. Entropy 2019, 21, 969. https://doi.org/10.3390/e21100969

AMA Style

Cai C, Verdú S. Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information. Entropy. 2019; 21(10):969. https://doi.org/10.3390/e21100969

Chicago/Turabian Style

Cai, Changxiao, and Sergio Verdú. 2019. "Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information" Entropy 21, no. 10: 969. https://doi.org/10.3390/e21100969

APA Style

Cai, C., & Verdú, S. (2019). Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information. Entropy, 21(10), 969. https://doi.org/10.3390/e21100969

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information

Abstract

1. Introduction

2. Notation, Definitions and Properties

3. Conditional Rényi Divergence Game

3.1. Saddle point

3.2. Minimax identity

4. Finding $C_{α}$

5. Proofs

5.1. Proof of Theorem 1

5.2. Proof of Theorem 2

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information

Abstract

1. Introduction

2. Notation, Definitions and Properties

3. Conditional Rényi Divergence Game

3.1. Saddle point

3.2. Minimax identity

4. Finding C α

5. Proofs

5.1. Proof of Theorem 1

5.2. Proof of Theorem 2

6. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4. Finding $C_{α}$