Next Article in Journal
A Bootstrap Framework for Aggregating within and between Feature Selection Methods
Next Article in Special Issue
Discriminant Analysis under f-Divergence Measures
Previous Article in Journal
Active Inference: Applicability to Different Types of Social Organization Explained through Reference to Industrial Engineering and Quality Management
Previous Article in Special Issue
Minimum Divergence Estimators, Maximum Likelihood and the Generalized Bootstrap
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Error Exponents and α-Mutual Information

Independent Researcher, Princeton, NJ 08540, USA
Entropy 2021, 23(2), 199; https://doi.org/10.3390/e23020199
Submission received: 7 December 2020 / Revised: 26 January 2021 / Accepted: 28 January 2021 / Published: 5 February 2021

Abstract

:
Over the last six decades, the representation of error exponent functions for data transmission through noisy channels at rates below capacity has seen three distinct approaches: (1) Through Gallager’s E 0 functions (with and without cost constraints); (2) large deviations form, in terms of conditional relative entropy and mutual information; (3) through the α -mutual information and the Augustin–Csiszár mutual information of order α derived from the Rényi divergence. While a fairly complete picture has emerged in the absence of cost constraints, there have remained gaps in the interrelationships between the three approaches in the general case of cost-constrained encoding. Furthermore, no systematic approach has been proposed to solve the attendant optimization problems by exploiting the specific structure of the information functions. This paper closes those gaps and proposes a simple method to maximize Augustin–Csiszár mutual information of order α under cost constraints by means of the maximization of the α -mutual information subject to an exponential average constraint.

1. Introduction

1.1. Phase 1: The MIT School

The capacity C of a stationary memoryless channel is equal to the maximal symbolwise input–output mutual information. Not long after Shannon [1] established this result, Rice [2] observed that, when operating at any encoding rate R < C , there exist codes whose error probability vanishes exponentially with blocklength, with a speed of decay that decreases as R approaches C. This early observation moved the center of gravity of information theory research towards the quest for the reliability function, a term coined by Shannon [3] to refer to the maximal achievable exponential decay as a function of R. The MIT information theory school, and most notably, Elias [4], Feinstein [5], Shannon [3,6], Fano [7], Gallager [8,9], and Shannon, Gallager and Berlekamp [10,11], succeeded in upper/lower bounding the reliability function by the sphere-packing error exponent function and the random coding error exponent function, respectively. Fortunately, these functions coincide for rates between C and a certain value, called the critical rate, thereby determining the reliability function in that region. The influential 1968 textbook by Gallager [9] set down the major error exponent results obtained during Phase 1 of research on this topic, including the expurgation technique to improve upon the random coding error exponent lower bound. Two aspects of those early works (and of Dobrushin’s contemporary papers [12,13] on the topic) stand out:
(a)
The error exponent functions were expressed as the result of the Karush-Kuhn-Tucker optimization of ad-hoc functions which, unlike mutual information, carried little insight. In particular, during the first phase, center stage is occupied by the parametrized function of the input distribution P X and the random transformation (or "channel”) P Y | X ,
E 0 ( ρ , P X ) = log y B x A P X ( x ) P Y | X 1 1 + ρ ( y | x ) 1 + ρ ,
introduced by Gallager in [8].
(b)
Despite the large-deviations nature of the setup, none of the tools from that then-nascent field (other than the Chernoff bound) found their way to the first phase of the work on error exponents; in particular, relative entropy, introduced by Kullback and Leibler [14], failed to put in an appearance.
To this date, the reliability function remains open for low rates even for the binary symmetric channel, despite a number of refined converse and achievability results (e.g., [15,16,17,18,19,20,21]) obtained since [9]. Our focus in this paper is not on converse/achievability techniques but on the role played by various information measures in the formulation of error exponent results.

1.2. Phase 2: Relative Entropy

The second phase of the error exponent research was pioneered by Haroutunian [22] and Blahut [23], who infused the expressions for the error exponent functions with meaning by incorporating relative entropy. The sphere-packing error exponent function corresponding to a random transformation P Y | X is given as
E sp ( R ) = sup P X min Q Y | X : A B I ( P X , Q Y | X ) R D ( Q Y | X P Y | X | P X ) .
Roughly speaking, optimal codes of rate R < C incur in errors due to atypical channel behavior, and large deviations establishes that the overwhelmingly most likely such behavior can be explained as if the channel would be supplanted by the one with mutual information bounded by R which is closest to the true channel in conditional relative entropy D ( Q Y | X P Y | X | P X ) . Within the confines of finite-alphabet memoryless channels, this direction opened the possibility of using the combinatorial method of types to obtain refined results robustifying the choice of the optimal code against incomplete knowledge of the channel. The 1981 textbook by Csiszár and Körner [24] summarizes the main results obtained during Phase 2.

1.3. Phase 3: Rényi Information Measures

Entropy and relative entropy were generalized by Rényi [25], who introduced the notions of Rényi entropy and Rényi divergence of order α . He arrived at Rényi entropy by relaxing the axioms Shannon proposed in [1], and showed to be satisfied by no measure but entropy. Shortly after [25], Campbell [26] realized the operational role of Rényi entropy in variable-length data compression if the usual average encoding length criterion E [ ( c ( X ) ) ] is replaced by an exponential average α 1 log E [ exp ( α ( c ( X ) ) ] . Arimoto [27] put forward a generalized conditional entropy inspired by Rényi’s measures (now known as Arimoto-Rényi conditional entropy) and proposed a generalized mutual information by taking the difference between Rényi entropy and the Arimoto-Rényi conditional entropy. The role of the Arimoto-Rényi conditional entropy in the analysis of the error probability of Bayesian M-ary hypothesis testing problems has been recently shown in [28], tightening and generalizing a number of results dating back to Fano’s inequality [29].
Phase 3 of the error exponent research was pioneered by Csiszár [30] where he established a connection between Gallager’s E 0 function and Rényi divergence by means of a Bayesian measure of the discrepancy among a finite collection of distributions introduced by Sibson [31]. Although [31] failed to realize its connection to mutual information, Csiszár [30,32] noticed that it could be viewed as a natural generalization of mutual information. Arimoto [27] also observed that the unconstrained maximization of his generalized mutual information measure with respect to the input distribution coincides with a scaled version of the maximal E 0 function. This resulted in an extension of the Arimoto-Blahut algorithm useful for the computation of error exponent functions [33] (see also [34]) for finite-alphabet memoryless channels.
Within Haroutunian’s framework [22] applied in the context of the method of types, Poltyrev [35] proposed an alternative to Gallager’s E 0 function, defined by means of a cumbersome maximization over a reverse random transformation. This measure turned out to coincide (modulo different parametrizations) with another generalized mutual information introduced four years earlier by Augustin in his unpublished thesis [36], by means of a minimization with respect to an output probability measure.
The key contribution in the development of this third phase is Csiszár’s paper [32] where he makes a compelling case for the adoption of Rényi’s information measures in the large deviations analysis of lossless data compression, hypothesis testing and data transmission. Recall that more than two decades earlier, Csiszár [30] had already established the connection of Gallager’s E 0 function and the generalized mutual information inspired by Sibson [31], which, henceforth, we refer to as the α -mutual information. Therefore, its relevance to the error exponent analysis of error correcting codes had already been established. Incidentally, more recently, another operational role was found for α -mutual information in the context of the large deviations analysis of composite hypothesis testing [37]. In addition to α -mutual information, and always working with discrete alphabets, Csiszár [32] considers the generalized mutual informations due to Arimoto [27], and to Augustin [36], which we refer to as the Augustin–Csiszár mutual information of order α . Csiszár shows that all those three generalizations of mutual information coincide upon their unconstrained maximization with respect to the input distribution. Further relationships among those Rényi-based generalized mutual informations have been obtained in recent years in [38,39,40,41,42,43,44,45]. In [32] the maximal α -mutual information or generalized capacity of order α finds an operational characterization as a generalized cutoff rate–an equivalent way to express the reliability function. This would have been the final word on the topic if it weren’t for its limitation to discrete-alphabet channels, and more importantly, encoding without cost constraints.

1.4. Cost Constraints

If the transmitted codebook is cost-constrained, i.e., every codeword ( c 1 , , c n ) is forced to satisfy i = 1 n b ( c i ) n θ for some nonnegative cost function b ( · ) , then the channel capacity is equal to the input–output mutual information maximized over input probability measures restricted to satisfy E [ b ( X ) ] θ . Gallager [9] incorporated cost constraints in his treatment of error exponents by generalizing (1) to the function
E 0 ( ρ , P X , r , θ ) = log y B x A P X ( x ) exp r b ( x ) r θ P Y | X 1 1 + ρ ( y | x ) 1 + ρ ,
with which he was able to prove an achievability result invoking Shannon’s random coding technique [1]. Gallager also suggested in the footnote of page 329 of [9] that the converse technique of [10] is amenable to extension to prove a sphere-packing converse based on (3). However, an important limitation is that that technique only applies to constant-composition codes (all codewords have the same empirical distribution). A more powerful converse circumventing that limitation (at least for symmetric channels) was given by [46] also expressing the upper bound on the reliability function by optimizing (3) with respect to ρ , r and P X . A notable success of the approach based on the optimization of (3) was the determination of the reliability function (for all rates below capacity) of the direct detection photon channel [47].
In contrast, the Phase Two expression (2) for the sphere-packing error exponent for cost-constrained channels is much more natural and similar to the way the expression for channel capacity is impacted by cost constraints, namely we simply constrain the maximization in (2) to satisfy E [ b ( X ) ] θ . Unfortunately, no general methods to solve the ensuing optimization have been reported.
Once cost constraints are incorporated, the equivalence among the maximal α -mutual information, maximal order- α Augustin–Csiszár mutual information, and maximal Arimoto mutual information of order α breaks down. Of those three alternatives, it is the maximal Augustin–Csiszár mutual information under cost constraints that appears in the error exponent functions. The challenge is that Augustin–Csiszár mutual information is much harder to evaluate, let alone maximize, than α -mutual information. The Phase 3 effort to encompass cost constraints started by Augustin [36] and was continued recently by Nakiboglu [43]. Their focus was to find a way to express (3) in terms of Rényi information measures. Although, as we explain in Item 62, they did not quite succeed, their efforts were instrumental in developing key properties of the Augustin–Csiszár mutual information.

1.5. Organization

To enhance readability and ease of reference, the rest of this work is organized in 81 items, grouped into Section 13 and an appendix.
Basic notions and notation (including the key concept of α -response) are collected in Section 2. Unlike much of the literature on the topic, we do not restrict attention to discrete input/output alphabets, nor do we impose any topological structures on them.
The paper is essentially self-contained. Section 3 covers the required background material on relative entropy, Rényi divergence of order α , and their conditional versions, including a key representation of Rényi divergence in terms of relative entropies and a tilted probability measure, and additive decompositions of Rényi divergence involving the α -response.
Section 4 studies the basic properties of α -mutual information and order- α Augustin–Csiszár mutual information. This includes their variational representations in terms of conventional (non-Rényi) information measures such as conditional relative entropy and mutual information, which are particularly simple to show in the main range of interest in applications to error exponents, namely, α ( 0 , 1 ) .
The interrelationships between α -mutual information and order- α Augustin–Csiszár mutual information are covered in Section 5, which introduces the dual notions of α -adjunct and α -adjunct of an input probability measure.
The maximizations with respect to the input distribution of α -mutual information and order- α Augustin–Csiszár mutual information account for their role in the fundamental limits in data transmission through noisy channels. Section 6 gives a brief review of the results in [45] for the maximization of α -mutual information. For Augustin–Csiszár mutual information, Section 7 covers its unconstrained maximization, which coincides with its α -mutual information counterpart. Section 8 proposes an approach to find C α c ( θ ) , the maximal Augustin–Csiszár mutual information of order α ( 0 , 1 ) subject to E [ b ( X ) ] θ . Instead of trying to identify directly the input distribution that maximizes Augustin–Csiszár mutual information, the method seeks its α -adjunct. This is tantamount to maximizing α -mutual information over a larger set of distributions.
Section 9 shows
ρ C 1 1 + ρ c ( θ ) = min r 0 max P X E 0 ( ρ , P X , r , θ ) ,
where the maximization on the right side is unconstrained. In other words, the minimax of Gallager’s E 0 function (3) with cost constraints is shown to be equal to the maximal Augustin–Csiszár mutual information, thereby bridging the existing gap between the Phase 1 and Phase 3 representations alluded to earlier in this introduction.
As in [48], Section 10 defines the sphere-packing and random-coding error exponent functions in the natural canonical form of Phase 2 (e.g., (2)), and gives a very simple proof of the nexus between the Phase 2 and Phase 3 representations, namely,
E sp ( R ) = sup ρ 0 ρ C 1 1 + ρ c ( θ ) ρ R ,
with or without cost constraints. In this regard, we note that, although all the ingredients required were already present at the time the revised version of [24] was published three decades after the original, [48] does not cover the role of Rényi’s information measures in channel error exponents.
Examples illustrating the proposed method are given in Section 11 and Section 12 for the additive Gaussian noise channel under a quadratic cost function, and the additive exponential noise channel under a linear cost function, respectively. Simple parametric expressions are given for the error exponent functions, and the least favorable channels that account for the most likely error mechanism (Section 1.2) are identified in both cases.

2. Relative Information and Information Density

We begin with basic terminology and notation required for the subsequent development.
1.
If ( A , F , P ) is a probability space, X P indicates P [ X F ] = P ( F ) for all F F .
2.
If probability measures P and Q defined on the same measurable space ( A , F ) satisfy P ( A ) = 0 for all A F such that Q ( A ) = 0 , we say that P is dominated by Q, denoted as P Q . If P and Q dominate each other, we write P Q . If there is an event such that P ( A ) = 0 and Q ( A ) = 1 , we say that P and Q are mutually singular, and we write P Q .
3.
If P Q , then d P d Q is the Radon-Nikodym derivative of the dominated measure P with respect to the reference measure Q. Its logarithm is known as the relative information, namely, the random variable
ı P Q ( a ) = log d P d Q ( a ) [ , + ) , a A .
As with the Radon-Nikodym derivative, any identity involving relative informations can be changed on a set of measure zero under the reference measure without incurring in any contradiction. If P Q R , then the chain rule of Radon-Nikodym derivatives yields
ı P Q ( a ) + ı Q R ( a ) = ı P R ( a ) , a A .
Throughout the paper, the base of exp and log is the same and chosen by the reader unless explicitly indicated otherwise. We frequently define a probability measure P from the specification of ı P Q and Q P since
P ( A ) = A exp ı P Q ( a ) d Q ( a ) , A F .
If X P and Y Q , it is often convenient to write ı X Y ( x ) instead of ı P Q ( x ) . Note that
E exp ı X Y ( Y ) = 1 .
Example 1.
If X N μ X , σ X 2 (Gaussian with mean μ X and variance σ X 2 ) and Y N μ Y , σ Y 2 , then,
ı X Y ( a ) = 1 2 log σ Y 2 σ X 2 + 1 2 ( a μ Y ) 2 σ Y 2 ( a μ X ) 2 σ X 2 log e .
4.
Let ( A , F ) and ( B , G ) be measurable spaces, known as the input and output spaces, respectively. Likewise, A and B are referred to as the input and output alphabets respectively. The simplified notation P Y | X : A B denotes a random transformation from ( A , F ) to ( B , G ) , i.e., for any x A , P Y | X = x ( · ) is a probability measure on ( B , G ) , and for any B G , P Y | X = · ( B ) is an F -measurable function.
5.
We abbreviate by P A the set of probability measures on ( A , F ) , and by P A × B the set of probability measures on ( A × B , F G ) . If P P A and P Y | X : A B is a random transformation, the corresponding joint probability measure is denoted by P P Y | X P A × B (or, interchangeably, P Y | X P ). The notation P P Y | X Q simply indicates that the output marginal of the joint probability measure P P Y | X is denoted by Q P B , namely,
Q ( B ) = P Y | X ( B | x ) d P X ( x ) = E P Y | X ( B | X ) , B G .
6.
If P X P Y | X P Y and P Y | X = a P Y , the information density ı X ; Y : A × B [ , ) is defined as
ı X ; Y ( a ; b ) = ı P Y | X = a P Y ( b ) , ( a , b ) A × B .
Following Rényi’s terminology [49], if P X P Y | X P X × P Y , the dependence between X and Y is said to be regular, and the information density can be defined on ( x , y ) A × B . Henceforth, we assume that P Y | X is such that the dependence between its input and output is regular regardless of the input probability measure. For example, if X = Y R , then P Y | X = a ( A ) = 1 { a A } , and their dependence is not regular, since for any P X with non-discrete components PXY Entropy 23 00199 i001 PX × PY.
7.
Let α > 0 , and P X P Y | X P Y . The α -response to P X P A is the output probability measure P Y [ α ] P Y with relative information given by
ı Y [ α ] Y ( y ) = 1 α log E [ exp ( α ı X ; Y ( X ; y ) κ α ) ] , X P X ,
where κ α is a scalar that guarantees that P Y [ α ] is a probability measure. Invoking (9), we obtain
κ α = α log E E 1 α [ exp ( α ı X ; Y ( X ; Y ¯ ) ) | Y ¯ ] , ( X , Y ¯ ) P X × P Y .
For brevity, the dependence of κ α on P X and P Y | X is omitted. Jensen’s inequality applied to ( · ) α results in κ α 0 for α ( 0 , 1 ) and κ α 0 for α > 1 . Although the α -response has a long record of services to information theory, this terminology and notation were introduced recently in [45]. Alternative terminology and notation were proposed in [42], which refers to the α -response as the order α Rényi mean. Note that κ 1 = 0 and the 1-response to P X is P Y . If p Y [ α ] and p Y | X denote the densities of P Y [ α ] and P Y | X with respect to some common dominating measure, then (13) becomes
p Y [ α ] ( y ) = exp κ α α E 1 α p Y | X α ( y | X ) , X P X .
For α > 1 (resp. α < 1 ) we can think of the normalized version of p Y | X α as a random transformation with less (resp. more) "noise" than p Y | X .
8.
We will have opportunity to apply the following examples.
Example 2.
If Y = X + N , where X N μ X , σ X 2 independent of N N μ N , σ N 2 , then the α-response to P X is
Y [ α ] N μ X + μ N , α σ X 2 + σ N 2 .
Example 3.
Suppose that Y = X + N , where N is exponential with mean ζ, independent of X, which is a mixed random variable with density
f X ( t ) = ζ α μ δ ( t ) + 1 ζ α μ 1 μ e t / μ 1 { t > 0 } ,
with α μ ζ . Then, Y [ α ] , the α-response to P X , is exponential with mean α μ .

3. Relative Entropy and Rényi Divergence

Given a pair of probability measures ( P , Q ) P A 2 , relative entropy and Rényi divergence gauge the distinctness between P and Q.
9.
Provided P Q , the relative entropy is the expectation of the relative information with respect to the dominated measure
D ( P Q ) = E ı P Q ( X ) , X P
= E exp ı P Q ( Y ) ı P Q ( Y ) , Y Q
0 ,
with equality if and only if P = Q . If P Entropy 23 00199 i001 Q, then D ( P Q ) = . As in Item 3, if X P and Y Q , we may write D ( X Y ) instead of D ( P Q ) , in the same spirit that the expectation and entropy of P are written as E [ X ] and H ( X ) , respectively.
10.
Arising in the sequel, a common optimization in information theory finds, among the probability measures satisfying an average cost constraint, that which is closest to a given reference measure Q in the sense of D ( · Q ) . For that purpose, the following result proves sufficient. Incidentally, we often refer to unconstrained maximizations over probability distributions. It should be understood that those optimizations are still constrained to the sets P A or P B . As customary in information theory, we will abbreviate max P X P A by max X or max P X .
Theorem 1.
Let P Z P A and suppose that g : A [ 0 , ) is a Borel measurable mapping. Then,
min X D ( X Z ) + E [ g ( X ) ] = log E [ exp ( g ( Z ) ) ] ,
achieved uniquely by P X * P Z defined by
ı X * Z ( a ) = g ( a ) log E [ exp ( g ( Z ) ) ] , a A .
Proof. 
Note that since g is nonnegative, η = E [ exp ( g ( Z ) ) ] ( 0 , 1 ] . Furthermore,
E [ g ( X * ) ] = g ( t ) exp ( g ( t ) ) d P Z ( t ) E [ exp ( g ( Z ) ) ] 0 , 1 e η .
Therefore, the subset of P A for which the term in { · } in (21) is finite is nonempty: Fix any P X from that subset, (which therefore satisfies P X P Z P X * ) and invoke the chain rule (7) to write
D ( X Z ) + E [ g ( X ) ] = E ı X X * ( X ) + ı X * Z ( X ) + g ( X )
= D ( X X * ) log E [ exp ( g ( Z ) ) ] , X P X ,
which is uniquely minimized by letting P X = P X * . Note that for typographical convenience we have denoted X * P X * . □
11.
Let p and q denote the Radon-Nikodym derivatives of probability measures P and Q, respectively, with respect to a common dominating σ -finite measure μ . The Rényi divergence of order α ( 0 , 1 ) ( 1 , ) between P and Q is defined as [25,50]
D α ( P Q ) = 1 α 1 log A p α q 1 α d μ
= 1 α 1 log E exp α ı P R ( Z ) + ( 1 α ) ı Q R ( Z ) , Z R
= 1 α 1 log E exp α ı P Q ( Y ) , Y Q
= 1 α 1 log E exp ( α 1 ) ı P Q ( X ) , X P ,
where (28) and (29) hold if P Q , and in (27), R is a probability measure that dominates both P and Q. Note that (28) and (29) state that ( t 1 ) D t ( X Y ) and t D 1 + t ( X Y ) are the cumulant generating functions of the random variables ı X Y ( Y ) and ı X Y ( X ) , respectively. The relative entropy is the limit of D α ( P Q ) as α 1 , so it is customary to let D 1 ( P Q ) = D ( P Q ) . For any α > 0 , D α ( P Q ) 0 with equality if and only if P = Q . Furthermore, D α ( P Q ) is non-decreasing in α , satisfies the skew-symmetric property
( 1 α ) D α ( P Q ) = α D 1 α ( Q P ) , α [ 0 , 1 ] ,
and
inf α ( 0 , 1 ) D α ( P Q ) = P Q inf α > 1 D α ( P Q ) = .
12.
The expressions in the following pair of examples will come in handy in Section 11 and Section 12.
Example 4.
Suppose that σ α 2 = α σ 1 2 + ( 1 α ) σ 0 2 > 0 and α ( 0 , 1 ) ( 1 , ) . Then,
D α N μ 0 , σ 0 2 N μ 1 , σ 1 2 = 1 2 log σ 1 2 σ 0 2 + 1 2 ( α 1 ) log σ 1 2 σ α 2 + α ( μ 1 μ 0 ) 2 2 σ α 2 log e ,
D N μ 0 , σ 0 2 N μ 1 , σ 1 2 = 1 2 log σ 1 2 σ 0 2 + 1 2 σ 0 2 σ 1 2 1 log e + ( μ 1 μ 0 ) 2 2 σ 1 2 log e
= lim α 1 D α N μ 0 , σ 0 2 N μ 1 , σ 1 2 .
Example 5.
Suppose Z is exponentially distributed with unit mean, i.e., its probability density function is e t 1 { t 0 } . For d 0 d 1 and α such that ( 1 α ) μ 0 + α μ 1 > 0 we obtain
D α μ 0 Z + d 0 μ 1 Z + d 1 = d 0 d 1 μ 1 log e + log μ 1 μ 0 + 1 1 α log α + ( 1 α ) μ 0 μ 1 ,
D μ 0 Z + d 0 μ 1 Z + d 1 = μ 0 μ 1 1 + d 0 d 1 μ 1 log e + log μ 1 μ 0
= lim α 1 D α μ 0 Z + d 0 μ 1 Z + d 1 .
13.
Intimately connected with the notion of Rényi divergence is the tilted probability measure P α defined, if D α ( P 1 P 0 ) < , by
ı P α Q ( a ) = α ı P 1 Q ( a ) + ( 1 α ) ı P 0 Q ( a ) + ( 1 α ) D α ( P 1 P 0 ) ,
where Q is any probability measure that dominates both P 0 and P 1 . Although (37) is defined in general, our main emphasis is on the range α ( 0 , 1 ) , in which, as long as P0 Entropy 23 00199 i002 P1, the tilted probability measure is defined and satisfies P α P 0 and P α P 1 , with corresponding relative informations
ı P α P 0 ( a ) = ı P α Q ( a ) ı P 0 Q ( a )
= ( 1 α ) D α ( P 1 P 0 ) + α ı P 1 Q ( a ) ı P 0 Q ( a ) ,
ı P α P 1 ( a ) = ı P α Q ( a ) ı P 1 Q ( a )
= ( 1 α ) D α ( P 1 P 0 ) ( 1 α ) ı P 1 Q ( a ) ı P 0 Q ( a ) ,
where we have used the chain rule for P α P 0 Q and P α P 1 Q . Taking a linear combination of (38)–(41) we conclude that, for all a A ,
( 1 α ) D α ( P 1 P 0 ) = ( 1 α ) ı P α P 0 ( a ) + α ı P α P 1 ( a ) .
Henceforth, we focus particular attention on the case α ( 0 , 1 ) since that is the region of interest in the application of Rényi information measures to the evaluation of error exponents in channel coding for codes whose rate is below capacity. In addition, often proofs simplify considerably for α ( 0 , 1 ) .
14.
Much of the interplay between relative entropy and Rényi divergence hinges on the following identity, which appears, without proof, in (3) of [51].
Theorem 2.
Let α ( 0 , 1 ) and assume that P0 Entropy 23 00199 i002 P1 are defined on the same measurable space. Then, for any P P 1 and P P 0 ,
α D ( P P 1 ) + ( 1 α ) D ( P P 0 ) = D ( P P α ) + ( 1 α ) D α ( P 1 P 0 ) ,
where P α is the tilted probability measure in (37) and (43) holds regardless of whether the relative entropies are finite. In particular,
D ( P P α ) < max { D ( P P 0 ) , D ( P P 1 ) } < .
Proof. 
We distinguish three overlapping cases:
(1)
D ( P P α ) < : Taking expectation of (42) with respect to a X P , yields (43) because
E ı P α P 0 ( X ) = D ( P P 0 ) D ( P P α ) ,
E ı P α P 1 ( X ) = D ( P P 1 ) D ( P P α ) ,
where, thanks to the assumption that D ( P P α ) < , we have invoked Corollary A1 in the Appendix A twice with ( P , Q , R ) ( P , P α , P 0 ) and ( P , Q , R ) ( P , P α , P 1 ) , respectively;
(2)
max { D ( P P 0 ) , D ( P P 1 ) } < : The proof is identical since we are entitled to invoke Corollary A1 to show (45) (resp., (46)) because D ( P P 0 ) < (resp., D ( P P 1 ) < ).
(3)
D ( P P α ) = and max { D ( P P 0 ) , D ( P P 1 ) } = : both sides of (43) are equal to .
Finally, to show that (44) follows from (43), simply recall from (31) that D α ( P 1 P 0 ) < . □
15.
Relative entropy and Rényi divergence are related by the following fundamental variational representation.
Theorem 3.
Fix α ( 0 , 1 ) and ( P 1 , P 0 ) P A 2 . Then, the Rényi divergence between P 1 and P 0 satisfies
( 1 α ) D α ( P 1 P 0 ) = min P α D ( P P 1 ) + ( 1 α ) D ( P P 0 ) ,
where the minimum is over P A . If P0 Entropy 23 00199 i002 P1, then the right side of (47) is attained by the tilted measure P α , and the minimization can be restricted to the subset of probability measures which are dominated by both P 1 and P 0 .
Proof. 
If P 0 P 1 , then both sides of (47) are + since there is no probability measure that is dominated by both P 0 and P 1 . If P0 Entropy 23 00199 i002 P1, then minimizing both sides of (43) with respect to P yields (47) and the fact that the tilted probability measure attains the minimum therein. □
The variational representation in (47) was observed in [39] in the finite-alphabet case, and, contemporaneously, in full generality in [50]. Unlike Theorem 3, both of those references also deal with α > 1 . The function d ( α ) = ( 1 α ) D α ( P 1 P 0 ) , with d ( 1 ) = lim α 1 d ( α ) , is concave in α because the right side of (47) is a minimum of affine functions of α .
16.
Given random transformations P Y | X : A B , Q Y | X : A B , and a probability measure P X P A on the input space, the conditional relative entropy is
D ( P Y | X Q Y | X | P X ) = D ( P Y | X P X Q Y | X P X )
= E D P Y | X ( · | X ) Q Y | X ( · | X ) , X P X .
Analogously, the conditional Rényi divergence is defined as
D α ( P Y | X Q Y | X | P X ) = D α ( P Y | X P X Q Y | X P X ) .
A word of caution: the notation in (50) conforms to that in [38,45] but it is not universally adopted, e.g., [43] uses the left side of (50) to denote the Rényi generalization of the right side of (49). We can express the conditional Rényi divergence as
D α ( P Y | X Q Y | X | P X )
= 1 α 1 log E exp ( α 1 ) D α P Y | X ( · | X ) Q Y | X ( · | X ) , X P X ,
= 1 α 1 log E d P Y | X d Q Y | X ( Y | X ) α 1 , ( X , Y ) P X P Y | X ,
where (52) holds if P X P Y | X P X Q Y | X . Jensen’s inequality applied to (51) results in
D α ( P Y | X Q Y | X | P X ) E D α ( P Y | X ( · | X ) Q Y | X ( · | X ) ) , α ( 0 , 1 ) ;
D α ( P Y | X Q Y | X | P X ) E D α ( P Y | X ( · | X ) Q Y | X ( · | X ) ) , α > 1 .
Nevertheless, an immediate and crucial observation we can draw from (51) is that the unconstrained maximizations of the sides of (53) and of (54) over P X do coincide: for all α > 0 ,
sup X D α ( P Y | X Q Y | X | P X ) = sup X E D α ( P Y | X ( · | X ) Q Y | X ( · | X ) )
= sup a A D α ( P Y | X = a Q Y | X = a ) .
17.
Conditional Rényi divergence satisfies the following additive decomposition, originally pointed out, without proof, by Sibson [31] in the setting of finite A .
Theorem 4.
Given P X P A , Q Y P B , P Y | X : A B , and α ( 0 , 1 ) ( 1 , ) , we have
D α ( P Y | X Q Y | P X ) = D α ( P Y | X P Y [ α ] | P X ) + D α ( P Y [ α ] Q Y ) .
Furthermore, with κ α as in (14),
D α P Y | X P Y [ α ] | P X = κ α α 1 .
Proof. Select an arbitrary probability measure R Y P B that dominates both Q Y and P Y , and, therefore, P Y [ α ] too. Letting ( X , Z ) P X × R Y , we have
D α ( P Y | X Q Y | P X ) = 1 α 1 log E d P X Y d P X × R Y ( X , Z ) α d Q Y d R Y ( Z ) 1 α
= 1 α 1 log E E exp α ı X ; Y ( X ; Z ) | Z d P Y d R Y ( Z ) α d Q Y d R Y ( Z ) 1 α
= κ α α 1 + 1 α 1 log E d P Y [ α ] d P Y ( Z ) α d P Y d R Y ( Z ) α d Q Y d R Y ( Z ) 1 α
= κ α α 1 + 1 α 1 log E d P Y [ α ] d R Y ( Z ) α d Q Y d R Y ( Z ) 1 α
= κ α α 1 + D α ( P Y [ α ] Q Y ) ,
where (61) follows from (13), and (62) follows from the chain rule of Radon-Nikodym derivatives applied to P Y [ α ] P Y R Y . Then, (58) follows by specializing Q Y = P Y [ α ] , and the proof of (57) is complete, upon plugging (58) into the right side of (63). □
A proof of (57) in the discrete case can be found in Appendix A of [37].
18.
For all α > 0 , given two inputs ( P X , Q X ) P A 2 and one random transformation P Y | X : A B , Rényi divergence (and, in particular, relative entropy) satisfies the data processing inequality,
D α ( P X Q X ) D α ( P Y Q Y ) ,
where P X P Y | X P Y , and Q X P Y | X Q Y . The data processing inequality for Rényi divergence was observed by Csiszár [52] in the more general context of f-divergences. More recently it was stated in [39,50]. Furthermore, given one input P X P A and two transformations P Y | X : A B and Q Y | X : A B , conditioning cannot decrease Rényi divergence,
D α ( P Y | X Q Y | X | P X ) D α ( P Y Q Y ) .
Since D α ( P Y | X Q Y | X | P X ) = D α ( P X P Y | X P X Q Y | X ) , (65) follows by applying (64) to a deterministic transformation which takes an input pair and outputs the second component. Inequalities (53) and (65) imply the convexity of D α ( P Q ) in ( P , Q ) for α ( 0 , 1 ] .

4. Dependence Measures

In this paper we are interested in three information measures that quantify the dependence between random variables X and Y, such that P X P Y | X P Y , namely, mutual information, and two of its generalizations, α - mutual information and Augustin–Csiszár mutual information of order α .
19.
The mutual information is
I ( X ; Y ) = I ( P X , P Y | X ) = D ( P Y | X P Y | P X )
= min Q Y D ( P Y | X Q Y | P X )
= min Q Y D ( P X Y P X × Q Y ) .
20.
Given α ( 0 , 1 ) ( 1 , ) , the α -mutual information is defined as (see [30,31,32,40,42,45])
I α ( X ; Y ) = I α ( P X , P Y | X )
= min Q Y D α ( P Y | X Q Y | P X )
= min Q Y D α ( P X Y P X × Q Y )
= D α P Y | X P Y [ α ] | P X
= 1 α 1 log E exp ( α 1 ) D α P Y | X ( · | X ) P Y [ α ] , X P X
= D α P Y | X P Y | P X D α P Y [ α ] P Y
= κ α α 1
= α α 1 log E [ E 1 α [ exp ( α ı X ; Y ( X ; Y ¯ ) ) | Y ¯ ] ] , ( X , Y ¯ ) P X × P Y ,
where (72) and (74) follow from (57); (73) is a special case of (51); (75) follows from Theorem 4; and, (76) is (14). In view of (67) and (69), we let I 1 ( X ; Y ) = I ( X ; Y ) . The notation we use for α -mutual information conforms to that used in [40,42,45,53]. Other notations include K α in [32,38,39] and I α g in [43]. I 0 ( X ; Y ) and I ( X ; Y ) are defined by taking the corresponding limits.
21.
Theorem 4 and (72) result in the additive decomposition
I α ( X ; Y ) = D α ( P Y | X Q Y | P X ) D α ( P Y [ α ] Q Y ) ,
for any Q Y with D α ( P Y [ α ] Q Y ) < , thereby generalizing the well-known decomposition for mutual information,
I ( X ; Y ) = D ( P Y | X Q Y | P X ) D ( P Y Q Y ) ,
which, in contrast to (77), is a simple consequence of the chain rule whenever the dependence between X and Y is regular, and of Lemma A1 in general.
22.
Example 6.
Additive independent Gaussian noise. If Y = X + N , where X N 0 , σ X 2 independent of N N 0 , σ N 2 , then, for α > 0 ,
Y [ α ] N 0 , α σ X 2 + σ N 2 ,
I α ( X ; X + N ) = I α ( X + N ; X ) = 1 2 log 1 + α σ X 2 σ N 2 .
23.
If α ( 0 , 1 ) , (47) and (69) result in
( 1 α ) I α ( P X , P Y | X ) = min Q X Q Y | X D ( Q X P X ) + α D ( Q Y | X P Y | X | Q X ) + ( 1 α ) I ( Q X , Q Y | X ) .
For α > 1 a proof of (81) is given in [39] for finite alphabets.
24.
Unlike I ( P X , P Y | X ) , we can express I α ( P X , P Y | X ) directly in terms of its arguments without involving the corresponding output distribution or the α -response to P X . This is most evident in the case of discrete alphabets, in which (76) becomes
I α ( X ; Y ) = α α 1 log y B x A P X ( x ) P Y | X = x α ( y ) 1 α ,
I 0 ( X ; Y ) = log max y B x A P X ( x ) 1 { P Y | X ( y | x ) > 0 } ,
I ( X ; Y ) = log b Y sup a : P X ( a ) > 0 P Y | X ( b | a ) .
For example, if X is discrete and H α ( X ) denotes the Rényi entropy of order α , then for all α > 0 ,
H α ( X ) = I 1 α ( X ; X ) .
If X and Y are equiprobable with P [ X Y ] = δ , then, in bits, I α ( X ; Y ) = 1 h α ( δ ) , where h α ( δ ) denotes the binary Rényi entropy.
25.
In the main region of interest, namely, α ( 0 , 1 ) , frequently we use a different parametrization in terms of ρ > 0 , with α = 1 1 + ρ .
Theorem 5.
For any ρ > 0 , we have the upper bound
ρ I 1 1 + ρ ( X ; Y ) min Q Y | X : A B D ( Q Y | X P Y | X | P X ) + ρ I ( P X , Q Y | X ) .
Proof. 
Fix Q Y | X : A B , and let P X Q Y | X Q Y . Then,
I 1 1 + ρ ( X ; Y ) D 1 1 + ρ ( P X Y P X × Q Y )
= 1 + ρ ρ min R X Y 1 1 + ρ D ( R X Y P X Y ) + ρ 1 + ρ D ( R X Y P X × Q Y )
1 ρ D ( Q Y | X P X P X Y ) + D ( Q Y | X P X P X × Q Y )
= 1 ρ D ( Q Y | X P Y | X | P X ) + I ( P X , Q Y | X ) ,
where (87), (88) and (90) follow from (69), (47) and (66) respectively. □
Just like (53), we will show in Section 7 that (86) becomes an equality upon the unconstrained maximization of both sides.
26.
Before introducing the last dependence measure in this section, recall from Definition 7 and (58) that P Y [ α ] P Y , the α -response (of P Y | X ) to P X defined by
ı Y [ α ] Y ( y ) = 1 α log E [ exp α ı X ; Y ( X ; y ) + ( 1 α ) D α P Y | X P Y [ α ] | P X ] ,
attains min Q Y D α ( P Y | X Q Y | P X ) , where the expectation is with respect to X P X . We proceed to define P Y α P Y , the α -response (of P Y | X ) to P X by means of
ı Y α Y ( y ) = 1 α log E exp ( α ı X ; Y ( X ; y ) + ( 1 α ) D α P Y | X ( · | X ) P Y α ,
with X P X . Note that P Y 1 = P Y [ 1 ] = P Y .
27.
In the case of discrete alphabets, (92) becomes the implicit equation
P Y α α ( y ) = a A P X ( a ) P Y | X α ( y | a ) b B P Y | X α ( b | a ) P Y α 1 α ( b ) , y B ,
which coincides with (9.24) in Fano’s 1961 textbook [7], with s 1 α , and is also given by Haroutunian in (19) of [22]. For example, if A = B is discrete and Y = X , then P Y α = P X , while P Y [ α ] α ( y ) = c P X ( y ) , y A .
28.
The α -response satisfies the following identity, which can be regarded as the counterpart of (57) satisfied by the α -response.
Theorem 6.
Fix P X P A , P Y | X : A B and Q Y P B . Then,
D α ( P Y α Q Y ) = 1 α 1 log E exp ( 1 α ) D α ( P Y | X ( · | X ) P Y α ) D α ( P Y | X ( · | X ) Q Y ) .
Proof. 
For brevity we assume Q Y P Y . Otherwise, the proof is similar adopting a reference measure that dominates both Q Y and P Y . The definition of unconditional Rényi divergence in Item 11 implies that we can write ( α 1 ) times the exponential of the left side of (94) as
exp ( α 1 ) D α ( P Y α Q Y ) = E d P Y α d P Y ( Y ) α d Q Y d P Y ( Y ) 1 α
= E exp α ı X ; Y ( X ; Y ) + ( 1 α ) D α P Y | X ( · | X ) P Y α d Q Y d P Y ( Y ) 1 α = E E exp α ı X ; Y ( X ; Y ) + ( 1 α ) ı Q Y P Y ( Y ) + D α P Y | X ( · | X ) P Y α | X
= E exp ( 1 α ) D α P Y | X ( · | X ) P Y α D α P Y | X ( · | X ) Q Y ,
where ( X , Y ) P X × P Y , (96) follows from (92), and (97) follows from the definition of unconditional Rényi divergence in (27). □
Theorem 7.
If α ( 0 , 1 ] , then
D α ( P Y α Q Y ) E D α ( P Y | X ( · | X ) Q Y ) E D α ( P Y | X ( · | X ) P Y α )
D ( P Y α Q Y ) .
If α 1 , inequalities (98) and (99) are reversed.
Proof. 
Assume α ( 0 , 1 ] . Jensen’s inequality applied to the right side of (94) results in (98). To show (99), again we assume for brevity Q Y P Y , and define the positive functions V : A × B ( 0 , ) and W : A × B ( 0 , ) ,
V ( x , y ) = exp α ı X ; Y ( x ; y ) + ( 1 α ) ı Y α Y ( y ) ,
W ( x , y ) = exp α ı X ; Y ( x ; y ) + ( 1 α ) ı Q Y P Y ( y ) .
Note that, with ( X , Y ) P X × P Y , and ( x , y ) A × B ,
E [ V ( x , Y ) ] = exp ( α 1 ) D α ( P Y | X = x P Y α ) ,
E [ W ( x , Y ) ] = exp ( α 1 ) D α ( P Y | X = x Q Y ) , E V ( X , y ) E [ V ( X , Y ) | X ] = exp ( 1 α ) ı Y α Y ( y ) ·
= · E exp α ı X ; Y ( X ; y ) + ( 1 α ) D α ( P Y | X ( · | X ) P Y α )
= d P Y α d P Y ( y ) .
where (104) uses (100) and (102) and (105) follows from (92). Then,
D α ( P Y | X = x Q Y ) D α ( P Y | X = x P Y α )
= 1 1 α log E [ V ( x , Y ) ] E [ W ( x , Y ) ]
1 1 α E V ( x , Y ) E [ V ( x , Y ) ] log V ( x , Y ) W ( x , Y )
= E V ( x , Y ) E [ V ( x , Y ) ] ı Y α Y ( Y ) ı Q Y P Y ( Y ) ,
where the expectations are with respect to Y P Y , and
  • (107) follows from the log-sum inequality for integrable non-negative random variables,
    E [ V ] log E [ V ] E [ W ] E V log V W ;
  • (108) ⇐ (100) and (101).
Taking expectation with respect to X P X of (106)–(108) yields (99) because of Lemma A1 and (105). If α 1 , then Jensen’s inequality applied to the right side of (94) results in (98) but with the opposite inequality. Moreover, (107) is reversed and the remainder of the proof holds verbatim. □
In the case of finite input-alphabets, a different proof of (99) is given in Appendix B of [54].
29.
Introduced in the unpublished dissertation [36] and rescued from oblivion in [32], the Augustin–Csiszár mutual information of order α is defined for α > 0 as
I α c ( X ; Y ) = I α c ( P X , P Y | X ) = min Q Y E D α ( P Y | X ( · | X ) Q Y )
= E D α ( P Y | X ( · | X ) P Y α ) ,
where (111) follows from (98) if α ( 0 , 1 ] , and from the reverse of (99) if α 1 . We conform to the notation in [40], where I α a was used to denote the difference between entropy and Arimoto-Rényi conditional entropy. In [32,39,43] the Augustin–Csiszár mutual information of order α is denoted by I α . In Augustin’s original notation [36], I ρ ( P X ) means I 1 ρ c ( P X , P Y | X ) , ρ ( 0 , 1 ) . Independently of [36], Poltyrev [35] introduced a functional (expressed as a maximization over a reverse random transformation) which turns out to be ρ I 1 1 + ρ c ( X ; Y ) and which he denoted by E 0 ( ρ , P X ) , although in Gallager’s notation that corresponds to ρ I 1 1 + ρ ( X ; Y ) , as we will see in (233). I 0 c ( X ; Y ) and I c ( X ; Y ) are defined by taking the corresponding limits.
30.
In the discrete case, (110) boils down to
I α c ( X ; Y ) = min Q Y 1 α 1 x A P X ( x ) log y B P Y | X α ( y | x ) Q Y 1 α ( y ) ,
which can be juxtaposed with the much easier expression in (82) for I α ( X ; Y ) involving no further optimization. Minimizing the Lagrangian, we can verify that the minimizer in (112) satisfies (93). With ( X , Y ¯ ) P X × Q Y , we have
I 0 c ( X ; Y ) = min Q Y E log 1 P [ P Y | X ( Y ¯ | X ) > 0 X ] ,
I c ( X ; Y ) = min Q Y E log P Y | X ( Y ¯ | X ) Q Y ( Y ¯ ) ,
where the expectations are with respect to X.
31.
The respective minimizers of (72) and (110), namely, the α -response and the α -response, are quite different. Most notably, in contrast to Item 7, an explicit expression for P Y α is unknown. Instead of defining P Y α through (92), [36] defines it, equivalently, as the fixed point of the operator (dubbed the Augustin operator in [43]) which maps the set of probability measures on the output space to itself,
d T α ( Q ) d Q ( y ) = E d P Y | X d Q ( y | X ) α exp ( 1 α ) D α ( P Y | X ( · | X ) Q ) ,
where X P X . Although we do not rely on them, Lemma 34.2 of ( α ( 0 , 1 ) ) and Lemma 13 of [43] ( α > 1 ) claim that the minimizer in (110), referred to in [43] as the Augustin mean of order α , is unique and is a fixed point of the operator T α regardless of P X . Moreover, Lemma 13(c) of [43] establishes that for α ( 0 , 1 ) and finite input alphabets, repeated iterations of the operator T α with initial argument P Y [ α ] converge to P Y α .
32.
It is interesting to contrast the next example with the formulas in Examples 2 and 6.
Example 7.
Additive independent Gaussian noise.If Y = X + N , where X N 0 , σ X 2 independent of N N 0 , σ N 2 , then
Y α N 0 , σ N 2 2 2 1 α + Δ + snr ,
snr = σ X 2 σ N 2 ,
Δ = 4 snr + 1 α snr 2 .
This result can be obtained by postulating a zero-mean Gaussian distribution with variance v α 2 as P Y α and verifying that (92) is indeed satisfied if v α 2 is chosen as in (116). The first step is to invoke (32), which yields
D α P Y | X = x P Y α = λ α 2 + α x 2 2 s α 2 log e ,
λ α = log v α 2 σ N 2 + 1 α 1 log v α 2 s α 2 ,
where we have denoted s α 2 = α v α 2 + ( 1 α ) σ N 2 . Since Y N 0 , σ X 2 + σ N 2 ,
ı X ; Y ( x ; y ) = 1 2 log σ X 2 + σ N 2 σ N 2 + 1 2 y 2 σ X 2 + σ N 2 ( y x ) 2 σ N 2 log e ,
ı Y α Y ( y ) = 1 2 log σ X 2 + σ N 2 v α 2 + 1 2 y 2 σ X 2 + σ N 2 y 2 v α 2 log e .
Assembling (120) and (121), the right side of (92) becomes
1 α log E exp ( α ı X ; Y ( X ; y ) + ( 1 α ) D α P Y | X ( · | X ) P Y α = 1 2 log σ X 2 + σ N 2 σ N 2 + 1 2 y 2 log e σ X 2 + σ N 2 + 1 α 2 α λ α + 1 α log E exp e α ( y X ) 2 2 σ N 2 + α ( 1 α ) X 2 2 s α 2
= 1 2 log σ X 2 + σ N 2 σ N 2 + 1 α 2 α λ α + y 2 log e 2 1 σ X 2 + σ N 2 s α 2 α ( 1 α ) σ X 2 σ N 2 s α 2 + α 2 v α 2 σ X 2 + 1 2 α log σ N 2 s α 2 σ N 2 s α 2 + α 2 v α 2 σ X 2
= 1 2 log σ X 2 + σ N 2 v α 2 + 1 2 y 2 σ X 2 + σ N 2 y 2 v α 2 log e ,
where (124) follows by Gaussian integration, and the marvelous simplification in (125) is satisfied provided that we choose
s α 2 = α σ X 2 v α 2 v α 2 σ N 2 .
Comparing (122) and (125), we see that (92) is indeed satisfied with Y α N 0 , v α 2 if v α 2 satisfies the quadratic Equation (126), whose solution is in (116)–(118). Invoking (32) and (116), we obtain
I α c ( X ; X + N ) = α snr 1 + α Δ + α snr log e + 1 2 log 1 + 1 2 Δ + snr 1 α 1 2 ( 1 α ) log 2 1 α + Δ + snr 1 + α Δ + α snr .
Beyond its role in evaluating the Augustin–Csiszár mutual information for Gaussian inputs, the Gaussian distribution in (116) has found some utility in the analysis of finite blocklength fundamental limits for data transmission [55].
33.
This item gives a variational representation for the Augustin–Csiszár mutual information in terms of mutual information and conditional relative entropy (i.e., non-Rényi information measures). As we will see in Section 10, this representation accounts for the role played by Augustin–Csiszár mutual information in expressing error exponent functions.
Theorem 8.
For α ( 0 , 1 ) , the Augustin–Csiszár mutual information satisfies the variational representation in terms of conditional relative entropy and mutual information,
( 1 α ) I α c ( P X , P Y | X ) = min Q Y | X α D ( Q Y | X P Y | X | P X ) + ( 1 α ) I ( P X , Q Y | X ) ,
where the minimum is over all the random transformations from the input to the output spaces.
Proof. 
Invoking (47) with ( P 1 , P 0 ) ( P Y | X = x , Q Y ) we obtain
( 1 α ) D α ( P Y | X = x Q Y ) = min R Y α D ( R Y P Y | X = x ) + ( 1 α ) D ( R Y Q Y )
= min R Y | X = x α D ( R Y | X = x P Y | X = x ) + ( 1 α ) D ( R Y | X = x Q Y ) .
Averaging over x P X , followed by minimization with respect to Q Y yields (128) upon recalling (67). □
In the finite-alphabet case with α ( 0 , 1 ) ( 1 , ) , the representation in (128) is implicit in the appendix of [32], and stated explicitly in [39], where it is shown by means of a minimax theorem. This is one of the instances in which the proof of the result is considerably easier for α ( 0 , 1 ) ; we can take the following route to show (128) for α > 1 . Neglecting to emphasize its dependence on P X , denote
f α ( Q Y , R Y | X ) = α 1 α D ( R Y | X P Y | X | P X ) + D ( R Y | X Q Y | P X ) .
Invoking (47) we obtain
D α ( P Y | X = x Q Y ) = max R Y | X = x α 1 α D ( R Y | X = x P Y | X = x ) + D ( R Y | X = x Q Y ) .
Averaging (132) with respect to P X followed by minimization over Q Y , results in
I α c ( P X , P Y | X ) = min Q Y max R Y | X f α ( Q Y , R Y | X )
max R Y | X min Q Y f α ( Q Y , R Y | X )
= max R Y | X α 1 α D ( R Y | X P Y | X | P X ) + I ( P X , R Y | X ) ,
which shows ≥ in (128). If a minimax theorem can be invoked to show equality in (134), then (128) is established for α > 1 . For that purpose, for fixed R Y | X , f ( · , R Y | X ) is convex and lower semicontinuous in Q Y on the set where it is finite. Rewriting
f ( Q Y , R Y | X ) = 1 1 α D ( R Y | X P Y | X | P X ) + D ( R Y | X Q Y | P X ) D ( R Y | X P Y | X | P X ) ,
it can be seen that f ( Q Y , · ) is upper semicontinuous and concave (if α > 1 ). A different, and considerably more intricate route is taken in Lemma 13(d) of [43], which also gives (128) for α > 1 assuming finite input alphabets.
34.
Unlike mutual information, neither I α ( X ; Y ) = I α ( Y ; X ) nor I α c ( X ; Y ) = I α c ( Y ; X ) hold in general.
Example 8.
Erasure transformation. Let A = { 0 , 1 } , B = { 0 , 1 , e } ,
P Y | X ( b | a ) = 1 δ , a = b ; δ , b = e ; 0 , a b e ,
with δ ( 0 , 1 ) , and P X ( 0 ) = 1 2 . Then, we obtain, for α ( 0 , 1 ) ( 1 , ) ,
I α ( X ; Y ) = I α c ( X ; Y ) = α α 1 log δ + ( 1 δ ) 2 1 1 α ,
I α ( Y ; X ) = 1 α 1 log δ + ( 1 δ ) 2 α 1 ,
I α c ( Y ; X ) = I ( X ; Y ) = 1 δ b i t s .
35.
It was shown in Theorem 5.2 of [38] that α -mutual information satisfies the data processing lemma, namely, if X and Z are conditionally independent given Y, then
I α ( X ; Z ) min I α ( X ; Y ) , I α ( Y ; Z ) ,
I α ( Z ; X ) min I α ( Z ; Y ) , I α ( Y ; X ) .
As shown by Csiszár [32] using the data processing inequality for Rényi divergence, the data processing lemma also holds for I α c .
36.
From (53), (54) and the monotonicity of D α ( P Q ) in α , we obtain the ordering
I β ( X ; Y ) I α ( X ; Y ) I α c ( X ; Y ) I ν c ( X ; Y ) I ( X ; Y ) , 0 < β α ν < 1 ;
I ( X ; Y ) I ν c ( X ; Y ) I α c ( X ; Y ) I α ( X ; Y ) I β ( X ; Y ) , 1 < ν α β .
37.
The convexity/concavity properties of the generalized mutual informations are summarized next.
Theorem 9.
(a)
ρ I 1 1 + ρ ( X ; Y ) and ρ I 1 1 + ρ c ( X ; Y ) are concave and monotonically non-decreasing in ρ 0 .
(b)
I ( · , P Y | X ) and I α c ( · , P Y | X ) are concave functions. The same holds for I α ( · , P Y | X ) if α > 1 .
(c)
If α ( 0 , 1 ) , then I ( P X , · ) , I α ( P X , · ) and I α c ( P X , · ) are convex functions.
Proof. 
(a)
According to (81) and (128), respectively, with α = 1 1 + ρ ( 0 , 1 ) , ρ I 1 1 + ρ ( X ; Y ) and ρ I 1 1 + ρ c ( X ; Y ) are the infima of affine functions with nonnegative slopes.
(b)
For mutual information the result goes back to [56] in the finite-alphabet case. In general, it holds since (67) is the infimum of linear functions of P X . The same reasoning applies to Augustin–Csiszár mutual information in view of (110). For α -mutual information with α > 1 , notice from (51) that D α ( P Y | X Q Y | P X ) is concave in P X if α > 1 . Therefore,
I α ( λ P X 1 + ( 1 λ ) P X 0 , P Y | X )
= inf Q Y D α ( P Y | X Q Y | λ P X 1 + ( 1 λ ) P X 0 )
inf Q Y λ D α ( P Y | X Q Y | P X 1 ) + ( 1 λ ) D α ( P Y | X Q Y | P X 0 )
λ I α ( P X 1 , P Y | X ) + ( 1 λ ) I α ( P X 0 , P Y | X ) .
(c)
The convexity of I ( P X , · ) and I α ( P X , · ) follow from the convexity of D α ( P Q ) in ( P , Q ) for α ( 0 , 1 ] as we saw in Item 18. To show convexity of I α c ( P X , · ) if α ( 0 , 1 ) , we apply (169) in Item 45 with P Y | X = λ P Y | X 1 + ( 1 λ ) P Y | X 0 , and invoke the convexity of I α ( P X , · ) :
( 1 α ) I α c ( P X , P Y | X )
= max Q X ( 1 α ) I α ( Q X , λ P Y | X 1 + ( 1 λ ) P Y | X 0 ) D ( P X Q X ) , max Q X λ 1 α ) I α ( Q X , P Y | X 1 ) D ( P X Q X )
+ ( 1 λ ) 1 α ) I α ( Q X , P Y | X 0 ) D ( P X Q X )
( 1 α ) λ I α c ( P X , P Y | X 1 ) + ( 1 λ ) I α c ( P X , P Y | X 0 ) .
Although not used in the sequel, we note, for completeness, that if α ( 0 , 1 ) ( 1 , ) , [38] (see corrected version in [41]) shows that exp 1 1 α I α ( · , P Y | X ) / ( α 1 ) is concave.

5. Interplay between I α ( P X , P Y | X ) and I α c ( P X , P Y | X )

In this section we study the interplay between both notions of mutual informations of order α , and, in particular, various variational representations of these information measures.
38.
For given α ( 0 , 1 ) ( 1 , ) and P Y | X : A B , define Q X [ α ] P X , the α -adjunct of P X by
ı Q X [ α ] P X ( x ) = ( α 1 ) D α P Y | X = x P Y [ α ] κ α ,
with κ α the constant in (14) and P Y [ α ] , the α -response to P X .
39.
Example 9.
Let Y = X + N with X N 0 , σ X 2 independent of N N 0 , σ N 2 , and snr = σ X 2 σ N 2 . The α-adjunct of the input is
Q X [ α ] = N 0 , σ X 2 1 + α 2 snr 1 + α snr .
40.
Theorem 10.
The α -response to Q X [ α ] is P Y [ α ] , the α-response to P X .
Proof. 
We just need to verify that (92) is satisfied if we substitute Y α by Y [ α ] , and instead of taking the expectation in the right side with respect to X P X we take it with respect to X ˜ Q X [ α ] . Then,
E exp ( α ı X ; Y ( X ˜ ; y ) + ( 1 α ) D α P Y | X ( · | X ˜ ) P Y [ α ]
= E exp ı Q X [ α ] P X ( X ) + α ı X ; Y ( X ; y ) + ( 1 α ) D α P Y | X ( · | X ) P Y [ α ]
= E exp ( α ı X ; Y ( X ; y ) κ α )
= exp α ı Y [ α ] Y ( y ) ,
where (154) is by change of measure, (155) follows by substitution of (152), and (156) is the same as (13). □
41.
For given α ( 0 , 1 ) ( 1 , ) and P Y | X : A B , we define Q X α P X , the α -adjunct of an input probability measure P X through
ı Q X α P X ( x ) = ( 1 α ) D α P Y | X = x P Y α + υ α ,
where P Y α is the α -response to P X and υ α is a normalizing constant so that Q X α is a probability measure. According to (9), we must have
E exp ı Q X α P X ( X ) = 1 , X P X .
Hence,
υ α = ( α 1 ) D α P Y | X P Y α | Q X α .
42.
With the aid of the expression in Example 7, we obtain
Example 10.
Let Y = X + N with X N 0 , σ X 2 independent of N N 0 , σ N 2 , and snr = σ X 2 σ N 2 . Then, the α -adjunct of the input is
Q X α = N 0 , σ X 2 1 + α ( Δ + snr ) 1 + α ( Δ snr ) + 2 α 2 snr ,
which, in contrast to Q X [ α ] , has larger variance than σ X 2 if α ( 0 , 1 ) .
43.
The following result is the dual of Theorem 10.
Theorem 11.
The α-response to Q X α is P Y α , the α -response to P X . Therefore,
υ α = ( α 1 ) I α Q X α , P Y | X .
Proof. 
The proof is similar to that of Theorem 10. We just need to verify that we obtain the right side of (92) if on the right side of (91) we substitute P X by Q X α and P Y [ α ] by P Y α . Let X ¯ Q X α . Then,
1 α log E exp α ı X ; Y ( X ¯ ; y ) + ( 1 α ) D α P Y | X P Y α | Q X α
= 1 α log E exp ı Q X α P X ( X ) + α ı X ; Y ( X ; y ) υ α
= 1 α log E exp α ı X ; Y ( X ; y ) + ( 1 α ) D α P Y | X ( · | X ) P Y α
= ı Y α Y ( y ) ,
where (162)–(164) follow by change of measure, (157), and (92), respectively. □
44.
By recourse to a minimax theorem, the following representation is given for α ( 0 , 1 ) ( 1 , ) in the case of finite alphabets in [39], and dropping the restriction on the finiteness of the output space in [43]. As we show, a very simple and general proof is possible for α ( 0 , 1 ) .
Theorem 12.
Fix α ( 0 , 1 ) , P X P A and P Y | X : A B . Then,
( 1 α ) I α ( X ; Y ) = min Q X ( 1 α ) I α c ( Q X , P Y | X ) + D ( Q X P X ) ,
where the minimum is attained by Q X [ α ] , the α-adjunct of P X defined in (152).
Proof. 
The variational representations in (81) and (128) result in (165). To show that the minimum is indeed attained by Q X [ α ] , recall from Theorem 10 that the α -response to Q X [ α ] is P Y [ α ] . Therefore, evaluating the term in { } in (165) for Q X Q X [ α ] yields, with X ˜ Q X [ α ] ,
( 1 α ) I α c ( Q X [ α ] , P Y | X ) + D ( Q X [ α ] P X )
= ( 1 α ) E D α ( P Y | X ( · | X ˜ ) P Y [ α ] ) + D ( Q X [ α ] P X )
= κ α
= ( 1 α ) I α ( X ; Y ) ,
where (167) follows from (152) and (168) results from (69)–(75). □
45.
For finite-input alphabets, Lemma 18(b) of [43] (earlier Theorem 3.4 of [35] gave an equivalent variational characterization assuming, in addition, finite output alphabets) established the following dual to Theorem 12.
Theorem 13.
Fix α ( 0 , 1 ) , P X P A and P Y | X : A B . Then,
( 1 α ) I α c ( X ; Y ) = max Q X ( 1 α ) I α ( Q X , P Y | X ) D ( P X Q X ) .
The maximum is attained by Q X α , the α -adjunct of P X defined by (157).
Proof. 
First observe that (165) implies that ≥ holds in (169). Second, the term in { } on the right side of (169) evaluated at Q X Q X α becomes
( 1 α ) I α ( Q X α , P Y | X ) D ( P X Q X α )
= ( 1 α ) I α ( Q X α , P Y | X ) + ( 1 α ) I α c ( P X , P Y | X ) + υ α
= ( 1 α ) I α c ( P X , P Y | X ) ,
where (170) follows by taking the expectation of minus (157) with respect to P X . Therefore, ≤ also holds in (169) and the maximum is attained by Q X α , as we wanted to show. □
Hinging on Theorem 8, Theorems 12 and 13 are given for α ( 0 , 1 ) which is the region of interest in the analysis of error exponents. Whenever, as in the finite-alphabet case, (128) holds for α > 1 , Theorems 12 and 13 also hold for α > 1 .
Notice that since the definition of Q X α involves P Y α , the fact that it attains the maximum in (169) does not bring us any closer to finding I α c ( X ; Y ) for a specific input probability measure P X . Fortunately, as we will see in Section 8, (169) proves to be the gateway to the maximization of I α c ( X ; Y ) in the presence of input-cost constraints.
46.
Focusing on the main range of interest, α ( 0 , 1 ) , we can express (169) as
I α c ( P X , P Y | X ) = max Q X I α ( Q X , P Y | X ) 1 1 α D ( P X Q X )
= max ξ 0 I ( ξ ) ξ 1 α
= I ( ξ α ) ξ α 1 α ,
where we have defined the function (dependent on α , P X , and P Y | X )
I ( ξ ) = max Q X : D ( P X Q X ) ξ I α ( Q X , P Y | X ) ,
and ξ α is the solution to
I ˙ ( ξ α ) = 1 1 α .
Recall that the maxima over the input distribution in (172) and (175) are attained by the α -adjunct Q X α defined in Item 41.
47.
At this point it is convenient to summarize the notions of input and output probability measures that we have defined for a given α , random transformation P Y | X , and input probability measure P X :
  • P Y : The familiar output probability measure P X P Y | X P Y , defined in Item 5.
  • P Y [ α ] : The α -response to P X , defined in Item 7. It is the unique achiever of the minimization in the definition of α -mutual information in (67).
  • P Y α : The α -response to P X defined in Item 26. It is the unique achiever of the minimization in the definition of Augustin–Csiszár α -mutual information in (110).
  • Q X [ α ] : The α -adjunct of P X , defined in (152). The α -response to Q X [ α ] is P Y [ α ] . Furthermore, Q X [ α ] achieves the minimum in (165).
  • Q X α : The α -adjunct of P X , defined in (157). The α -response to Q X α is P Y α . Furthermore, Q X α achieves the maximum in (169).

6. Maximization of I α ( X ; Y )

Just like the maximization of mutual information with respect to the input distribution yields the channel capacity (of course, subject to conditions [57]), the maximization of I α ( X ; Y ) and of I α c ( X ; Y ) arises in the analysis of error exponents, as we will see in Section 10. A recent in-depth treatment of the maximization of α -mutual information is given in [45]. As we see most clearly in (82) for the discrete case, when it comes to its optimization, one advantage of I α ( X ; Y ) over I ( X ; Y ) is that the input distribution does not affect the expression through its influence on the output distribution.
48.
The maximization of α -mutual information is facilitated by the following result.
Theorem 14 ([45]).
Given α ( 0 , 1 ) ( 1 , ) ; a random transformation P Y | X : A B ; and, a convex set P P A , the following are equivalent.
(a
P X * P attains the maximal α-mutual information on P ,
I α ( P X * , P Y | X ) = max P P I α ( P , P Y | X ) < .
(b
For any P X P , and any output distribution Q Y P B ,
D α ( P Y | X P Y [ α ] * | P X ) D α ( P Y | X P Y [ α ] * | P X * )
D α ( P Y | X Q Y | P X * ) ,
where P Y [ α ] * is the α-response to P X * .
Moreover, if P Y [ α ] denotes the α-response to P X , then
D α ( P Y [ α ] P Y [ α ] * ) I α ( P X * , P Y | X ) I α ( P X , P Y | X ) < .
Note that, while I α ( · , P Y | X ) may not be maximized by a unique (or, in fact, by any) input distribution, the resulting α -response P Y [ α ] * is indeed unique. If P is such that none of its elements attain the maximal I α , it is known [42,45] that the α -response to any asymptotically optimal sequence of input distributions converges to P Y [ α ] * . This is the counterpart of a result by Kemperman [58] concerning mutual information.
49.
The following example appears in [45].
Example 11.
Let Y = X + N where N N 0 , σ N 2 independent of X. Fix α ( 0 , 1 ) and P > 0 . Suppose that the set, P P A , of allowable input probability measures consists of those that satisfy the constraint
E exp e α ( 1 α ) X 2 2 α 2 P + σ N 2 α 2 P + σ N 2 α P + σ N 2 .
We can readily check that X * N 0 , P satisfies (181) with equality, and as we saw in Example 2, its α-response is P Y [ α ] * = N ( 0 , α P + σ 2 ) . Theorem 14 establishes that P X * does indeed maximize the α-mutual information among all the distributions in P , yielding (recall Example 6)
max P X P I α ( X ; Y ) = 1 2 log 1 + α P σ 2 .
Curiously, if, instead of P defined by the constraint (181), we consider the more conventional P = { X : E [ X 2 ] P } , then the left side of (182) is unknown at present. Numerical evidence shows that it can exceed the right side by employing non-Gaussian inputs.
50.
Recalling (56) and (178) implies that if P X * attains the finite maximal unconstrained α -mutual information and its α -response is denoted by P Y [ α ] * , then,
max X I α ( X ; Y ) = max P P I α ( P , P Y | X ) = max a A D α ( P Y | X = a P Y [ α ] * ) ,
which requires that P X * ( A α * ) = 1 , with
A α * = x A : D α ( P Y | X = x P Y [ α ] * ) = max a A D α ( P Y | X = a P Y [ α ] * ) .
For discrete alphabets, this requires that if x A α * , then P X * ( x ) = 0 , which is tantamount to
y B P Y | X α ( y | x ) E 1 α α P Y | X α ( y | X * ) exp α 1 α I α ( X * ; Y * ) ,
with equality for all x A such that P X * ( x ) > 0 . For finite-alphabet random transformations this observation is equivalent to Theorem 5.6.5 in [9].
51.
Getting slightly ahead of ourselves, we note that, in view of (128), an important consequence of Theorem 15 below, is that, as anticipated in Item 25, the unconstrained maximization of I α ( X ; Y ) for α ( 0 , 1 ) can be expressed in terms of the solution to an optimization problem involving only conventional mutual information and conditional relative entropy. For ρ 0 ,
ρ sup X I 1 1 + ρ ( X ; Y ) = sup X min Q Y | X : A B D ( Q Y | X P Y | X | P X ) + ρ I ( P X , Q Y | X ) .

7. Unconstrained Maximization of I α c ( X ; Y )

52.
In view of the fact that it is much easier to determine the α -mutual information than the order- α Augustin–Csiszár information, it would be advantageous to show that the unconstrained maximum of I α c ( X ; Y ) equals the unconstrained maximum of I α ( X ; Y ) . In the finite-alphabet setting, in which it is possible to invoke a "minisup” theorem (e.g., see Section 7.1.7 of [59]), Csiszár [32] showed this result for α > 0 . The assumption of finite output alphabets was dropped in Theorem 1 of [42], and further generalized in Theorem 3 of the same reference. As we see next, for α ( 0 , 1 ) , it is possible to give an elementary proof without restrictions on the alphabets.
Theorem 15.
Let α ( 0 , 1 ) . If the suprema are over P A , the set of all probability measures defined on the input space, then
sup X I α c ( X ; Y ) = sup X I α ( X ; Y ) .
Proof. In view of (143), ≥ holds in (187). To show ≤, we assume sup X I α ( X ; Y ) < as, otherwise, there is nothing left to prove. The unconstrained maximization identity in (183) implies
sup X I α ( X ; Y ) = sup a A D α ( P Y | X = a P Y [ α ] * )
= sup P X P E D α ( P Y | X ( · | X ) P Y [ α ] * )
inf Q Q sup P X P E D α ( P Y | X ( · | X ) Q )
sup P X P inf Q Q E D α ( P Y | X ( · | X ) Q )
= sup X I α c ( X ; Y ) ,
where P Y [ α ] * is the unique α -response to any input that achieves the maximal α -mutual information, and if there is no such input, it is the limit of the α -responses to any asymptotically optimal input sequence (Item 48). □
Furthermore, if { X n } is asymptotically optimal for I α , i.e., lim n I α ( X n ; Y n ) = sup X I α ( X ; Y ) , then { X n } is also asymptotically optimal for I α c because for any δ > 0 , we can find N, such that for all n > N ,
I α ( X n ; Y n ) + δ sup a A D α ( P Y | X = a P Y [ α ] * )
E D α ( P Y | X ( · | X n ) P Y [ α ] * )
I α c ( X n ; Y n )
I α ( X n ; Y n ) .

8. Maximization of I α c ( X ; Y ) Subject to Average Cost Constraints

This section is at the heart of the relevance of Rényi information measures to error exponent functions.
53.
Given α ( 0 , 1 ) , P Y | X : A B , a cost function b : A [ 0 , ) and real scalar θ 0 , the objective is to maximize the Augustin–Csiszár mutual information allowing only those probability measures that satisfy E [ b ( X ) ] θ , namely,
C α c ( θ ) = sup P X : E [ b ( X ) ] θ I α c ( P X , P Y | X ) .
Unfortunately, identity (187) no longer holds when the maximizations over the input probability measure are cost-constrained, and, in general, we can only claim
C α c ( θ ) sup P X : E [ b ( X ) ] θ I α ( P X , P Y | X ) .
A conceptually simple approach to solve for C α c ( θ ) is to
(a)
postulate an input probability measure P X * that achieves the supremum in (197);
(b)
solve for its α -response P Y * using (92);
(c)
show that ( P X * , P Y * ) is a saddle point for the game with payoff function
B ( P X , Q Y ) = D α P Y | X = x Q Y d P X ,
where Q Y P A and P X is chosen from the convex subset of P A of probability measures which satisfy E [ b ( X ) ] θ .
Since P Y * is already known, by definition, to be the α -response to P X * , verifying the saddle point is tantamount to showing that B ( P X , P Y * ) is maximized by P X * among { P X P A : E [ b ( X ) ] θ } . Theorem 1 of [43] guarantees the existence of a saddle point in the case of finite input alphabets. In addition to the fact that it is not always easy to guess the optimum input P X * (see e.g., Section 12), the main stumbling block is the difficulty in determining the α -response to any candidate input distribution, although sometimes this is indeed feasible as we saw in Example 7.
54.
Naturally, Theorem 15 implies
C α c ( θ ) sup X I α ( X ; Y ) .
If the unconstrained maximization of I α c ( · , P Y | X ) is achieved by an input distribution X that satisfies E [ b ( X ) ] θ , then equality holds in (200), which, in turn, is equal to I α c ( P X , P Y | X ) . In that case, the average cost constraint is said to be inactive. For most cost functions and random transformations of practical interest, the cost constraint is active for all θ > 0 . To ascertain whether it is, we simply verify whether there exists an input achieving the right side of (200), which happens to satisfy the constraint. If so, C α c ( θ ) has been found. The same holds if we can find a sequence { X n } such that E [ b ( X n ) ] θ and I α ( X n ; Y n ) sup X I α ( X ; Y ) . Otherwise, we proceed with the method described below. Thus, henceforth, we assume that the cost constraint is active.
55.
The approach proposed in this paper to solve for C α c ( θ ) for α ( 0 , 1 ) hinges on the variational representation in (172), which allows us to sidestep having to find any α -response. Note that once we set out to maximize I α c ( P X , P Y | X ) over P = { P X P A : E [ b ( X ) ] θ } , the allowable Q X in the maximization in (175) range over a ξ -blow-up of P defined by
Γ ξ ( P ) = Q X P A : P X P , s u c h t h a t D ( P X Q X ) ξ .
As we show in Item 56, we can accomplish such an optimization by solving an unconstrained maximization of the sum of α -mutual information and a term suitably derived from the cost function.
56.
It will not be necessary to solve for (176), as our goal is to further maximize (172) over P X subject to an average cost constraint. The Lagrangian corresponding to the constrained optimization in (197) is
L α ( ν , P X ) = I α c ( X ; Y ) ν E [ b ( X ) ] + ν θ ,
where on the left side we have omitted, for brevity, the dependence on θ stemming from the last term on the right side. The Lagrange multiplier method (e.g., [60]) implies that if X * achieves the supremum in (197), then there exists ν * 0 such that for all P X on A and ν 0 ,
L α ( ν * , P X ) L α ( ν * , P X * ) L α ( ν , P X * ) .
Note from (202) that the right inequality in (203) can only be achieved if
E [ b ( X * ) ] = θ ,
and, consequently,
C α c ( θ ) = L α ( ν * , P X * ) = min ν 0 max P X L α ( ν , P X ) = max P X min ν 0 L α ( ν , P X ) .
The pivotal result enabling us to obtain C α c ( θ ) without the need to deal with Augustin–Csiszár mutual information is the following.
Theorem 16.
Given α ( 0 , 1 ) , ν 0 , P Y | X : A B , and b : A [ 0 , ) , denote the function
A α ( ν ) = max X I α ( X ; Y ) + 1 1 α log E exp ( 1 α ) ν b ( X ) .
Then,
sup P X P A L α ( ν , P X ) = ν θ + A α ( ν ) ,
and
C α c ( θ ) = min ν 0 ν θ + A α ( ν ) .
Proof. 
Plugging (172) into (197) we obtain, with X P X , and X ^ Q X ,
sup P X P A L α ( ν , P X ) = sup P X I α c ( X ; Y ) ν E [ b ( X ) ] + ν θ
= sup P X P A max Q X P A I α ( Q X , P Y | X ) 1 1 α D ( P X Q X ) ν E [ b ( X ) ] + ν θ
= ν θ + max Q X P A I α ( Q X , P Y | X ) 1 1 α inf P X D ( P X Q X ) + ν ( 1 α ) E [ b ( X ) ]
= ν θ + max Q X P A I α ( Q X , P Y | X ) + 1 1 α log E exp ν ( 1 α ) b ( X ^ )
= ν θ + A α ( ν ) ,
where (209) and (213) follow from (202) and (206), respectively, and (212) follows by invoking Theorem 1 with Z Q X and
g ( a ) = ( 1 α ) ν b ( a ) ,
which is nonnegative since α ( 0 , 1 ) and ν > 0 . Finally, (208) follows from (205) and (207). □
In conclusion, we have shown that the maximization of Augustin–Csiszár mutual information of order α subject to E [ b ( X ) ] θ boils down to the unconstrained maximization of a Lagrangian consisting of the sum of α -mutual information and an exponential average of the cost function. Circumventing the need to deal with α -responses and with Augustin–Csiszár mutual information of order α leads to a particularly simple optimization, as illustrated in Section 11 and Section 12.
57.
Theorem 16 solves for the maximal Augustin–Csiszár mutual information of order α under an average cost constraint without having to find out the input probability measure P X * that attains it nor its α -response P Y * (using the notation in Item 53). Instead, it gives the solution as
C α c ( θ ) = min ν 0 ν θ + max X I α ( X ; Y ) + 1 1 α log E exp ( 1 α ) ν b ( X ) .
Although we are not going to invoke a minimax theorem, with the aid of Theorem 9-(b) we can see that the functional within the inner brackets is concave in P X ; Furthermore, if V ( 0 , 1 ] , then log E V ν is easily seen to be convex in ν with the aid of the Cauchy-Schwarz inequality. Before we characterize the saddle point ( ν * , Q X * ) of the game in (215) we note that ( P X * , P Y * ) can be readily obtained from ( ν * , Q X * ) .
Theorem 17.
Fix α ( 0 , 1 ) . Let ν * > 0 denote the minimizer on the right side of (215), and Q X * the input probability measure that attains the maximum in (206) (or (215)) for ν = ν * . Then,
(a)
Q X * is the α -adjunct of P X * .
(b)
P Y * = Q Y [ α ] * , the α-response to Q X * .
(c)
P X * Q X * with
ı P X * Q X * ( a ) = ( 1 α ) ν * b ( a ) + τ α , a A ,
where τ α is a normalizing constant ensuring that P X * is a probability measure.
Proof. 
(a)
We had already established in Theorem 13 that the maximum on the right side of (210) is achieved by the α -adjunct of P X . In the special case ν = ν * , such P X is P X * . Therefore, Q X * , the argument that achieves the maximum in (206) for ν = ν * , is the α -adjunct of P X * .
(b)
According to Theorem 11, the α -response to Q X * is the α -response to P X * , which is P Y * by definition.
(c)
For ν = ν * , P X * achieves the supremum in (209) and the infimum in (211). Therefore, (216) follows from Theorem 1 with Z Q X * and g ( · ) given by (214) particularized to ν = ν * .
The saddle point of (215) admits the following characterization.
Theorem 18.
If α ( 0 , 1 ) , the saddle point ( ν * , Q X * ) of (215) satisfies
E b ( X ¯ * ) exp ( 1 α ) ν * b ( X ¯ * ) = θ E exp ( 1 α ) ν * b ( X ¯ * ) , X ¯ * Q X * ;
D α P Y | X = a Q Y [ α ] * = ν * b ( a ) + c α ( ν * ) , a A ,
where Q Y [ α ] * is the α-response to Q X * , and c α ( ν * ) does not depend on a A . Furthermore,
A α ( ν * ) = c α ( ν * ) ,
C α c ( θ ) = ν * θ + c α ( ν * ) .
Proof. First, we show that the scalar ν * 0 that minimizes
f ( ν ) = ν θ + I α ( Q X * , P Y | X ) + 1 1 α log E exp ( 1 α ) ν b ( X ¯ * )
satisfies (217). If we abbreviate V = exp ( 1 α ) b ( X ¯ * ) ( 0 , 1 ] , then the dominated convergence theorem results in
d d ν ν θ + 1 1 α log E V ν = θ + 1 1 α E V ν log V E V ν .
Therefore, (217) is equivalent to f ˙ ( ν * ) = 0 , which is all we need on account of the convexity of f ( · ) . To show (218), notice that for all a A ,
( 1 α ) ν * b ( a ) τ α = ı Q X * P X * ( a )
= ( 1 α ) D α ( P Y | X = a P Y * ) + υ α ,
where (223) is (216) and (224) is (157) with P Y α P Y * in view of Theorem 17-(b). In conclusion, (218) holds with
c α ( ν * ) = υ α + τ α α 1 .
Finally, (206) implies
A α ( ν * ) = I α ( Q X * , P Y | X ) + 1 1 α log E exp ( 1 α ) ν * b ( X ¯ * ) = 1 α 1 log E exp ( α 1 ) D α P Y | X ( · | X ¯ * ) P Y *
+ 1 1 α log E exp ( α 1 ) ν * b ( X ¯ * ) = 1 α 1 log E exp ( α 1 ) ν * b ( X ¯ * ) + c α ( ν * )
+ 1 1 α log E exp ( α 1 ) ν * b ( X ¯ * )
= c α ( ν * ) ,
where (227) follows from the definition of α -mutual information and Theorem 17-(b), and (228) follows from (218). Plugging (219) into (208) results in (220). □
58.
Typically, the application of Theorem 18 involves
(a)
guessing the form of the auxiliary input Q X * (modulo some unknown parameter),
(b)
obtaining its α -response Q Y [ α ] * , and
(c)
verifying that (217) and (218) are satisfied for some specific choice of the unknown parameter.
With the same approach, we can postulate, for every ν 0 , an input distribution R X ν , whose α -response R Y [ α ] ν satisfies
D α P Y | X = a R Y [ α ] ν = ν b ( a ) + c α ( ν ) , a A ,
where the only condition we place on c α ( ν ) is that it not depend on a A . If this is indeed the case, then the same derivation in (226)–(229) results in
A α ( ν ) = c α ( ν ) ,
and we determine ν * as the solution to θ = c ˙ α ( ν * ) , in lieu of (217). Section 11 and Section 12 illustrate the effortless nature of this approach to solve for A α ( ν ) . Incidentally, (230) can be seen as the α -generalization of the condition in Problem 8.2 of [48], elaborated later in [61].

9. Gallager’s E 0 Functions and the Maximal Augustin–Csiszár Mutual Information

In keeping with Gallager’s setting [9], we stick to discrete alphabets throughout this section.
59.
In his derivation of an achievability result for discrete memoryless channels, Gallager [8] introduced the function (1), which we repeat for convenience,
E 0 ( ρ , P X ) = log y B x A P X ( x ) P Y | X 1 1 + ρ ( y | x ) 1 + ρ .
Comparing (82) and (232), we obtain
E 0 ( ρ , P X ) = ρ I 1 1 + ρ ( X ; Y ) ,
which, as we mentioned in Section 1, is the observation by Csiszár in [30] that triggered the third phase in the representation of error exponents. Popularized in [9], the E 0 function was employed by Shannon, Gallager and Berlekamp [10] for ρ 0 and by Arimoto [62] for ρ ( 1 , 0 ) in the derivation of converse results in data transmission, the latter of which considers rates above capacity, a region in which error probability increases with blocklength, approaching one at an exponential rate. For the achievability part, [8] showed upper bounds on the error probability involving E 0 ( ρ , P X ) for ρ [ 0 , 1 ] . Therefore, for rates below capacity, the α -mutual information only enters the picture for α ( 0 , 1 ) . One exception in which Rényi divergence of order greater than 1 plays a role at rates below capacity was found by Sason [63], where a refined achievability result is shown for binary linear codes for output symmetric channels (a case in which equiprobable P X maximizes (233)), as a function of their Hamming weight distribution.
Although Gallager did not have the benefit of the insight provided by the Rényi information measures, he did notice certain behaviors of E 0 reminiscent of mutual information. For example, the derivative of (233) with respect to ρ , at ρ 0 is equal to I ( X ; Y ) . As pointed out by Csiszár in [32], in the absence of cost constraints, Gallager’s E 0 function in (232) satisfies
max P X E 0 ( ρ , P X ) = ρ max X I 1 1 + ρ ( X ; Y ) = ρ max X I 1 1 + ρ c ( X ; Y ) ,
in view of (233) and (187).
Recall that Gallager’s modified E 0 function in the case of cost constraints is
E 0 ( ρ , P X , r , θ ) = log y B x A P X ( x ) exp r b ( x ) r θ P Y | X 1 1 + ρ ( y | x ) 1 + ρ ,
which, like (232) he introduced in order to show an achievability result. Up until now, no counterpart to (234) has been found with cost constraints and (235). This is accomplished in the remainder of this section.
60.
In the finite alphabet case the following result is useful to obtain a numerical solution for the functional in (206). More importantly, it is relevant to the discussion in Item 61.
Theorem 19.
In the special case of discrete alphabets, the function in (206) is equal to
A α ( ν ) = max G α α 1 log y B a A G ( a ) P Y | X α ( y | a ) 1 α ,
where the maximization is over all G : A [ 0 , ) such that
a A G ( a ) exp ( 1 α ) ν b ( a ) = 1 .
Proof. 
Recalling (82) we have
I α ( X ; Y ) + 1 1 α log E exp ( 1 α ) ν b ( X ) = α α 1 log y B x A P X ( x ) P Y | X = x α ( y ) 1 α
+ 1 1 α log E exp ( 1 α ) ν b ( X )
= α α 1 log y B E P Y | X α ( y | X ) E exp ( 1 α ) ν b ( X ) 1 α
= α α 1 log y B a A G ( a ) P Y | X α ( y | a ) 1 α ,
where
G ( x ) = P X ( x ) a A P X ( a ) exp ( 1 α ) ν b ( a ) .
61.
We can now proceed to close the circle between the maximization of Augustin–Csiszár mutual information subject to average cost constraints (Phase 3 in Section 1) and Gallager’s approach (Phase 1 in Section 1).
Theorem 20.
In the discrete alphabet case, recalling the definitions in (202) and (235), for ρ > 0 ,
max P X E 0 ( ρ , P X , r , θ ) = ρ max P X L 1 1 + ρ r + r ρ , P X , r > 0 ;
min r 0 max P X E 0 ( ρ , P X , r , θ ) = ρ C 1 1 + ρ c ( θ ) ,
where the maximizations are over P A .
Proof. 
With
α = 1 1 + ρ a n d ν = r 1 + ρ ρ = r 1 α ,
the maximization of (235) with the respect to the input probability measure yields
max P X E 0 ( ρ , P X , r , θ )
= max P X ( 1 + ρ ) r θ log y B x A P X ( x ) exp r b ( x ) P Y | X 1 1 + ρ ( y | x ) 1 + ρ
= ρ ν θ + ρ max P X α α 1 log y B x A P X ( x ) exp ( 1 α ) ν b ( x ) P Y | X α ( y | x ) 1 α
= ρ ν θ + ρ max G α α 1 log y B x A G ( x ) P Y | X α ( y | x ) 1 α
= ρ ν θ + ρ A α ( ν )
= ρ max P X L α ( ν , P X ) ,
where
  • the maximization on the right side of (247) is over all G : A [ 0 , ) that satisfy (237), since that constraint is tantamount to enforcing the constraint that P X P A on the left side of (247);
  • (248) ⟸ Theorem 19;
  • (249) ⟸ Theorem 16.
The proof of (242) is complete once (244) is invoked to substitute α and ν from the right side of (249). If we now minimize the outer sides of (245)–(249) with respect to r we obtain, using (205) and (244),
min r 0 max P X E 0 ( ρ , P X , r , θ ) = ρ min r 0 max P X L α r 1 α , P X
= ρ min ν 0 max P X L α ν , P X
= ρ C 1 1 + ρ c ( θ ) .
In p. 329 of [9], Gallager poses the unconstrained maximization (i.e., over P X P A ) of the Lagrangian
E 0 ( ρ , P X , r , θ ) + γ a A P X ( a ) b ( a ) γ θ .
Note the apparent discrepancy between the optimizations in (243) and (253): the latter is parametrized by r and γ (in addition to ρ and θ ), while the maximization on the right side of (243) does not enforce any average cost constraint. In fact, there is no disparity since Gallager loc. cit. finds serendipitously that γ = 0 regardless of r and θ , and, therefore, just one parameter is enough.
62.
The raison d’être for Augustin’s introduction of I α c in [36] was his quest to view Gallager’s approach with average cost constraints under the optic of Rényi information measures. Contrasting (232) and (235) and inspired by the fact that, in the absence of cost constraints, (232) satisfies a variational characterization in view of (69) and (233), Augustin [36] dealt, not with (235), but with
min Q Y D α ( P ˜ Y | X Q Y | P X ) , w h e r e P ˜ Y | X = x = P Y | X = x exp r b ( x ) .
Assuming finite alphabets, Augustin was able to connect this quantity with the maximal I α c ( X ; Y ) under cost constraints in an arcane analysis that invokes a minimax theorem. This line of work was continued in Section 5 of [43], which refers to min Q Y D α ( P ˜ Y | X Q Y | P X ) as the Rényi-Gallager information. Unfortunately, since P ˜ Y | X is not a random transformation, the conditional pseudo-Rényi divergence D α ( P ˜ Y | X Q Y | P X ) need not satisfy the key additive decomposition in Theorem 4 so the approach of [36,43] fails to establish an identity equating the maximization of Gallager’s function (235) with the maximization of Augustin–Csiszár mutual information, which is what we have accomplished through a crisp and elementary analysis.

10. Error Exponent Functions

The central objects of interest in the error exponent analysis of data transmission are the functions E sp ( R , P X ) and E r ( R , P X ) of a random transformation P Y | X : A B . Reflecting the three different phases referred to in Section 1, there is no unanimity in the definition of those functions. Following [48], we adopt the standard canonical Phase 2 (Section 1.2) definitions of those functions, which are given in Items 63 and 67.
63.
If R 0 and P X P A , the sphere-packing error exponent function is (e.g., (10.19) of [48])
E sp ( R , P X ) = min Q Y | X : A B I ( P X , Q Y | X ) R D ( Q Y | X P Y | X | P X ) .
64.
As a function of R 0 , the basic properties of (254) for fixed ( P X , P Y | X ) are as follows.
(a)
If R I ( P X , P Y | X ) , then E sp ( R , P X ) = 0 ;
(b)
If R < I ( P X , P Y | X ) , then E sp ( R , P X ) > 0 ;
(c)
The infimum of the arguments for which the sphere-packing error exponent function is finite is denoted by R ( P X ) ;
(d)
On the interval R ( R ( P X ) , I ( P X , P Y | X ) ) , E sp ( R , P X ) is convex, strictly decreasing, continuous, and equal to (254) where the constraint is satisfied with equality. This implies that for R belonging to that interval, we can find ρ R 0 so that for all r 0 ,
E sp ( r , P X ) E sp ( R , P X ) ρ R r + ρ R R .
65.
In view of Theorem 8 and its definition in (254), it is not surprising that E sp ( R , P X ) is intimately related to the Augustin–Csiszár mutual information, through the following key identity.
Theorem 21.
E sp ( R , P X ) = sup ρ 0 ρ I 1 1 + ρ c ( X ; Y ) ρ R , R 0 ;
R ( P X ) = I 0 c ( X ; Y ) .
Proof. 
First note that ≥ holds in (256) because from (128) we obtain, for all ρ 0 ,
ρ I 1 1 + ρ c ( X ; Y ) = min Q Y | X D ( Q Y | X P Y | X | P X ) + ρ I ( P X , Q Y | X )
min Q Y | X : I ( P X , Q Y | X ) R D ( Q Y | X P Y | X | P X ) + ρ I ( P X , Q Y | X )
E sp ( R , P X ) + ρ R ,
where (260) follows from the definition in (254). To show ≤ in (256) for those R such that 0 < E sp ( R , P X ) < , Property (d) in Item 64 allows us to write
min Q Y | X D ( Q Y | X P Y | X | P X ) + ρ R I ( P X , Q Y | X ) = min r 0 E sp ( r , P X ) + ρ R r
E sp ( R , P X ) + ρ R R ,
where (262) follows from (255).
To determine the region where the sphere-packing error exponent is infinite and show (257), first note that if R < I 0 c ( X ; Y ) = lim α 0 I α c ( X ; Y ) , then E sp ( R , P X ) = because for any ρ 0 , the function in { } on the right side of (256) satisfies
ρ I 1 1 + ρ c ( X ; Y ) ρ R = ρ I 1 1 + ρ c ( X ; Y ) ρ I 0 c ( X ; Y ) + ρ I 0 c ( X ; Y ) ρ R
ρ I 0 c ( X ; Y ) ρ R ,
where (264) follows from the monotonicity of I α c ( X ; Y ) in α we saw in (143). Conversely, if I 0 c ( X ; Y ) < R < , there exists ϵ ( 0 , 1 ) such that I ϵ c ( X ; Y ) < R , which implies that in the minimization
I ϵ c ( X ; Y ) = min Q Y | X ϵ 1 ϵ D ( Q Y | X P Y | X | P X ) + I ( P X , Q Y | X )
we may restrict to those Q Y | X such that I ( P X , Q Y | X ) R , and consequently, I ϵ c ( X ; Y ) ϵ 1 ϵ E sp ( R , P X ) . Therefore, to avoid a contradiction, we must have E sp ( R , P X ) < .
The remaining case is I 0 c ( X ; Y ) = . Again, the monotonicity of the Augustin–Csiszár mutual information implies that I α c ( X ; Y ) = for all α > 0 . So, (128) prescribes D ( Q Y | X P Y | X | P X ) = for any Q Y | X is such that I ( P X , Q Y | X ) < . Therefore, E sp ( R , P X ) = for all R 0 , as we wanted to show. □
Augustin [36] provided lower bounds on error probability for codes of type P X as a function of I α c ( X ; Y ) but did not state (256); neither did Csiszár in [32] as he was interested in a non-conventional parametrization (generalized cutoff rates) of the reliability function. As pointed out in p. 5605 of [64], the ingredients for the proof of (256) were already present in the hint of Problem 23 of Section II.5 of [24]. In the discrete case, an exponential lower bound on error probability for codes with constant composition P X is given as a function of I 1 1 + ρ c ( P X , P Y | X ) in [44,64]. As in [64], Nakiboglu [65] gives (256) as the definition of the sphere-packing function and connects it with (254) in Lemma 3 therein, within the context of discrete input alphabets.
In the discrete case, (257) is well-known (e.g., [66]), and given by (83). As pointed out in [40], max X I 0 c ( X ; Y ) is the zero-error capacity with noiseless feedback found by Shannon [67], provided there is at least a pair ( a 1 , a 2 ) A 2 such that P Y | X = a 1 P Y | X = a 2 . Otherwise, the zero-error capacity with feedback is zero.
66.
The critical rate, R c ( P X ) , is defined as the smallest abscissa at which the convex function E sp ( · , P X ) meets its supporting line of slope 1 . According to (256),
I 1 2 c ( X ; Y ) = R c ( P X ) + E sp ( R c ( P X ) , P X ) .
67.
If R 0 and P X P A , the random-coding exponent function is (e.g., (10.15) of [48])
E r ( R , P X ) = min Q Y | X : A B D ( Q Y | X P Y | X | P X ) + [ I ( P X , Q Y | X ) R ] + ,
with [ t ] + = max { 0 , t } .
68.
The random-coding error exponent function is determined by the sphere-packing error exponent function through the following relation, illustrated in Figure 1.
Theorem 22.
E r ( R , P X ) = min r R E sp ( r , P X ) + r R
= 0 , R I ( P X , P Y | X ) ; E sp ( R , P X ) , R [ R c ( P X ) , I ( P X , P Y | X ) ] ; I 1 2 c ( X ; Y ) R , R [ 0 , R c ( P X ) ] .
= sup ρ [ 0 , 1 ] ρ I 1 1 + ρ c ( X ; Y ) ρ R .
Proof. 
Identities (268) and (269) are well-known (e.g., Lemma 10.4 and Corollary 10.4 in [48]). To show (270), note that (256) expresses E sp ( · , P X ) as the supremum of supporting lines parametrized by their slope ρ . By definition of critical rate (for brevity, we do not show explicitly its dependence on P X ), if R [ R c , I ( P X , P Y | X ) ] , then E sp ( R , P X ) can be obtained by restricting the optimization in (256) to ρ [ 0 , 1 ] . In that segment of values of R, E sp ( R , P X ) = E r ( R , P X ) according to (269). Moreover, on the interval R [ 0 , R c ] , we have
max ρ [ 0 , 1 ] ρ I 1 1 + ρ c ( X ; Y ) ρ R = I 1 2 c ( X ; Y ) R
= E sp ( R c , P X ) + R c R
= E r ( R , P X ) ,
where we have used (266) and (269). □
The first explicit connection between E r ( R , P X ) and the Augustin–Csiszár mutual information was made by Poltyrev [35] although he used a different form for I α c ( X ; Y ) , as we discussed in (29).
69.
The unconstrained maximizations over the input distribution of the sphere-packing and random coding error exponent functions are denoted, respectively, by
E sp ( R ) = sup P X E sp ( R , P X ) ,
E r ( R ) = sup P X E r ( R , P X ) .
Coding theorems [8,9,10,22,48] have shown that when these functions coincide they yield the reliability function (optimum speed at which the error probability vanishes with blocklength) as a function of the rate R < max X I ( X ; Y ) . The intuition is that, for the most favorable input distribution, errors occur when the channel behaves so atypically that codes of rate R are not reliable. There are many ways in which the channel may exhibit such behavior and they are all unlikely, but the most likely among them is the one that achieves (254).
It follows from (187), (256) and (270) that (274) and (275) can be expressed as
E sp ( R ) = sup ρ 0 ρ sup X I 1 1 + ρ ( X ; Y ) ρ R ,
E r ( R ) = sup ρ [ 0 , 1 ] ρ sup X I 1 1 + ρ ( X ; Y ) ρ R .
Therefore, we can sidestep working with the Augustin–Csiszár mutual information in the absence of cost constraints.
70.
Shannon [1] showed that, operating at rates below maximal mutual information, it is possible to find codes whose error probability vanishes with blocklength; for the converse, instead of error probability, Shannon measured reliability by the conditional entropy of the message given the channel output. That alternative reliability measure, as well as its generalization to Arimoto-Rényi conditional entropy, is also useful analyzing the average performance over code ensembles. It turns out (see e.g., [28,68]) that, below capacity, those conditional entropies also vanish exponentially fast in much the same way as error probability with bounds that are governed by E sp ( R ) and E r ( R ) thereby lending additional operational significance to those functions.
71.
We now introduce a cost function b : A [ 0 , ) and real scalar θ 0 , and reexamine the optimizations in (274) and (275) allowing only those probability measures that satisfy E [ b ( X ) ] θ . With a patent, but unavoidable, abuse of notation we define
E sp ( R , θ ) = sup P X : E [ b ( X ) ] θ E sp ( R , P X )
= sup ρ 0 ρ sup P X : E [ b ( X ) ] θ I 1 1 + ρ c ( X ; Y ) ρ R
= sup ρ 0 ρ C 1 1 + ρ c ( θ ) ρ R
= sup ρ 0 ρ R + ρ min ν 0 ν θ + A 1 1 + ρ ( ν ) = sup ρ 0 ρ R + min ν 0 ρ ν θ
+ max X ρ I 1 1 + ρ ( X ; Y ) + ( 1 + ρ ) log E exp ρ ν 1 + ρ b ( X ) ,
where (279), (281) and (282) follow from (256), (208) and (206), respectively.
72.
In parallel to (278)–(281),
E r ( R , θ ) = sup P X : E [ b ( X ) ] θ E r ( R , P X )
= sup ρ [ 0 , 1 ] ρ sup P X : E [ b ( X ) ] θ I 1 1 + ρ c ( X ; Y ) ρ R
= sup ρ [ 0 , 1 ] ρ C 1 1 + ρ c ( θ ) ρ R ,
where (284) follows from (270). In particular, if we define the critical rate and the cutoff rate as
R c = sup P X : E [ b ( X ) ] θ R c ( P X ) ,
R 0 = sup P X : E [ b ( X ) ] θ I 1 2 c ( X ; Y ) ,
respectively, then it follows from (270) that
E r ( R ) = R 0 R , R [ 0 , R c ] .
Summarizing, the evaluation of E sp ( R , θ ) and E r ( R , θ ) can be accomplished by the method proposed in Section 8, at the heart of which is the maximization in (206) involving α -mutual information instead of Augustin–Csiszár mutual information. In Section 11 and Section 12, we illustrate the evaluation of the error exponent functions with two important additive-noise examples.

11. Additive Independent Gaussian Noise; Input Power Constraint

We illustrate the procedure in Item 58 by taking Example 6 considerably further.
73.
Suppose A = B = R , b ( x ) = x 2 , and P Y | X = a = N a , σ N 2 . We start by testing whether we can find R X ν P A such that its α -response satisfies (230). Naturally, it makes sense to try R X ν = N 0 , σ 2 for some yet to be determined σ 2 . As we saw in Example 6, this choice implies that its α -response is R Y [ α ] ν = N 0 , α σ 2 + σ N 2 . Specializing Example 4, we obtain
D α P Y | X = x R Y [ α ] ν = D α N x , σ N 2 N 0 , α σ 2 + σ N 2
= 1 2 log 1 + α σ 2 σ N 2 1 2 ( 1 α ) log 1 + α ( 1 α ) σ 2 α 2 σ 2 + σ N 2 + 1 2 α x 2 α 2 σ 2 + σ N 2 log e .
Therefore, (230) is indeed satisfied with
c α ( ν ) = 1 2 log 1 + α σ 2 σ N 2 1 2 ( 1 α ) log 1 + α ( 1 α ) σ 2 α 2 σ 2 + σ N 2 ,
ν = 1 2 α α 2 σ 2 + σ N 2 log e ,
where (292) follows if we choose the variance of the auxiliary input as
σ 2 = log e 2 α ν σ N 2 α 2
= σ N 2 α 2 α λ 1 .
In (294) we have introduced an alternative, more convenient, parametrization for the Lagrange multiplier
λ = 2 ν σ N 2 log e ( 0 , α ) .
In conclusion, with the choice in (293), N 0 , σ 2 attains the maximum in (206), and in view of (231), A α ( ν ) is given by the right side of (291) substituting σ 2 by (293). Therefore, we have
ν θ + A α ( ν ) = λ 2 snr log e + c α λ log e 2 σ N 2
= λ 2 snr log e + 1 2 log 1 + 1 λ 1 α 1 2 ( 1 α ) log α λ ( 1 α ) + log α 1 α ,
where we denoted snr = θ σ N 2 .
In accordance with Theorem 16 all that remains is to minimize (297) with respect to ν , or equivalently, with respect to λ . Differentiating (297) with respect to λ , the minimum is achieved at λ * satisfying
snr = 1 λ * α λ * α λ * + α λ * ,
whose only valid root (obtained by solving a quadratic equation) is
λ * = 1 + α snr α Δ 2 snr ( 1 α ) ( 0 , α ) ,
with Δ defined in (118). So, for α ( 0 , 1 ) , (208) becomes
C α c ( snr σ N 2 ) = 1 + α snr α Δ 4 ( 1 α ) log e + 1 2 log 1 + 2 snr ( 1 α ) 1 + α snr α Δ 1 α 1 2 ( 1 α ) log α snr + α Δ 1 2 snr α 2 .
Letting α = 1 1 + ρ , we obtain
C 1 1 + ρ c ( snr σ N 2 ) = snr 2 ρ 1 β log e + 1 2 log ( 1 + β snr ) 1 + ρ 2 ρ log ( 1 + ρ ) β ,
with
β = 1 2 1 1 α snr + Δ snr = 1 2 1 1 + ρ snr + 4 snr + 1 + ρ snr 1 2 .
74.
Alternatively, it is instructive to apply Theorem 18 to the current Gaussian/quadratic cost setting. Suppose we let Q X * = N 0 , σ * 2 , where σ * 2 is to be determined. With the aid of the formulas
E X 2 e μ X 2 = σ 2 1 + 2 μ σ 2 3 2 ,
E e μ X 2 = 1 1 + 2 μ σ 2 ,
where μ 0 , and X N 0 , σ 2 , (217) becomes
1 snr = σ N 2 σ * 2 + ( 1 α ) λ * ,
upon substituting σ 2 σ * 2 and
μ ν * 1 α log e = λ * 1 α 2 σ N 2 .
Likewise (218) translates into (291) and (292) with ( ν , σ 2 ) ( ν * , σ * 2 ) , namely,
c α ( ν * ) = 1 2 log 1 + α σ * 2 σ N 2 1 2 ( 1 α ) log 1 + α ( 1 α ) σ * 2 α 2 σ * 2 + σ N 2 ,
λ * = α σ N 2 α 2 σ * 2 + σ N 2 .
Eliminating σ * 2 from (305) by means of (308) results in (299) and the same derivation that led to (300) shows that it is equal to ν * θ + c α ( ν * ) .
75.
Applying Theorem 17, we can readily find the input distribution, P X * , that attains C α c ( θ ) as well as its α -response P Y * (recall the notation in Item 53). According to Example 2, P Y * , the α -response to Q X * is Gaussian with zero mean and variance
σ N 2 + α σ * 2 = σ N 2 1 + 1 λ * 1 α
= σ N 2 2 2 1 α + Δ + snr ,
where (309) follows from (308) and (310) follows by using the expression for Δ in (118). Note from Example 7 that P Y * is nothing but the α -response to N 0 , snr σ N 2 . We can easily verify from Theorem 17 that indeed P X * = N 0 , snr σ N 2 since in this case (216) becomes
ı P X * Q X * ( a ) = ( 1 α ) ν * a 2 + τ α ,
which can only be satisfied by P X * = N 0 , snr σ N 2 in view of (305). As an independent confirmation, we can verify, after some algebra, that the right sides of (127) and (300) are identical.
In fact, in the current Gaussian setting, we could start by postulating that the distribution that maximizes the Augustin–Csiszár mutual information under the second moment constraint does not depend on α and is given by P X * = N 0 , θ . Its α -response P Y α * was already obtained in Example 7. Then, an alternative method to find C α c ( θ ) , given in Section 6.2 of [43], is to follow the approach outlined in Item 53. To validate the choice of P X * we must show that it maximizes B ( P X , P Y α * ) (in the notation introduced in (199)) among the subset of P A which satisfies E [ X 2 ] θ . This follows from the fact that D α P Y | X = x P Y α * is an affine function of x 2 .
76.
Let’s now use the result in Item 73 to evaluate, with a novel parametrization, the error exponent functions for the Gaussian channel under an average power constraint.
Theorem 23.
Let A = B = R , b ( x ) = x 2 , and P Y | X = a = N a , σ N 2 . Then, for β [ 0 , 1 ] ,
E sp ( R , snr σ N 2 ) = snr 2 ( 1 β ) log e 1 2 log 1 + snr β ( 1 β ) ,
R = 1 2 log 1 + β 2 β ( 1 β ) + 1 snr .
The critical rate and cutoff rate are, respectively,
R c = 1 2 log 1 2 + snr 4 + 1 2 1 + snr 2 4 ,
R 0 = 1 2 1 + snr 2 1 + snr 2 4 log e + 1 2 log 1 2 + 1 2 1 + snr 2 4 .
Proof. 
Expression (315) for the cutoff rate follows by letting ρ = 1 in (301) and (302). The supremum in (281) is attained by ρ * 0 that satisfies (recall the concavity result in Theorem 9-(a))
R = d d ρ ρ C 1 1 + ρ c ( snr σ N 2 ) | ρ ρ *
= 1 2 log snr + 1 β 1 2 log 1 + ρ * ,
obtained after a dose of symbolic computation working with (301). In particular, letting ρ * = 1 , we obtain the critical rate in (314). Note that if in (302) we substitute ρ ρ * , with ρ * given as a function of R, snr and β by (317), we end up with an equation involving R, snr , and β . We proceed to verify that that equation is, in fact, (312). By solving a quadratic equation, we can readily check that (302) is the positive root of
1 + ρ = snr ( 1 β ) + 1 β .
If we particularize (318) to ρ ρ * , with ρ * given by (317), namely,
ρ * = 1 + exp ( 2 R ) snr + 1 β ,
we obtain
exp ( 2 R ) = snr β + 1 snr β ( 1 β ) + 1 ,
which is (313). Notice that the right side of (320) is monotonic increasing in β > 0 ranging from 1 (for β = 0 ) to 1 + snr (for β = 1 ). Therefore, β [ 0 , 1 ] spans the whole gamut of values of R of interest.
Assembling (281), (301) and (317), we obtain
E sp ( R , snr σ N 2 )
= ρ * R + snr 2 1 β log e + ρ * 2 log ( 1 + β snr ) 1 + ρ * 2 log ( 1 + ρ * ) β = ρ * R + snr 2 1 β log e + ρ * 2 log ( 1 + β snr ) 1 + ρ * 2 log β
+ ( 1 + ρ * ) R 1 + ρ * 2 log snr + 1 β
= R + snr 2 1 β log e 1 2 log 1 + β snr
= snr 2 ( 1 β ) log e 1 2 log 1 + snr β ( 1 β ) ,
where (324) follows by substituting (313) on the left side. □
Note that the parametric expression in (312) and (313) (shown in Figure 2) is, in fact, a closed-form expression for E sp ( R , snr σ N 2 ) since we can invert (313) to obtain
β = 1 2 1 exp ( 2 R ) 1 + 1 + 4 snr ( 1 exp ( 2 R ) ) .
The random coding error exponent is
E r ( R , θ ) = E sp ( R , θ ) , R ( R c , 1 2 log ( 1 + snr ) ) ; R 0 R , R [ 0 , R c ] ,
with the critical rate R c and cutoff rate R 0 in (314) and (315), respectively. It can be checked that (326) coincides with the expression given by Gallager [9] (p. 340) where he optimizes (235) with respect to ρ and r, but not P X , which he just assumes to be P X = N 0 , θ . The expression for R c in (314) can be found in (7.4.34) of [9]; R 0 in (314) is implicit in p. 340 of [9], and explicit in e.g., [69].
77.
The expression for E sp ( R , θ ) in Theorem 23 has more structure than meets the eye. The analysis in Item 73 has shown that E sp ( R , P X ) is maximized over P X with second moment not exceeding θ by P X * = N 0 , θ regardless of R 0 , 1 2 log ( 1 + snr ) . The fact that we have found a closed-form expression for (254) when evaluated at such input probability measure and P Y | X = a = N a , σ N 2 is indicative that the minimum therein is attained by a Gaussian random transformation Q Y | X * . This is indeed the case: define the random transformation
Q Y | X = a * = N β a , σ 1 2 ,
σ 1 2 σ N 2 = 1 + snr β ( 1 β ) .
In comparison with the nominal random transformation P Y | X = a = N a , σ N 2 , this channel attenuates the input and contaminates it with a more powerful noise. Then,
I ( P X * , Q Y | X * ) = 1 2 log 1 + β 2 β ( 1 β ) + 1 snr = R .
Furthermore, invoking (33), we get
D ( Q Y | X * P Y | X | P X * ) = E D N β X * , σ 1 2 N X * , σ N 2
= 1 2 ( β 1 ) 2 snr + σ 1 2 σ N 2 1 log e 1 2 log σ 1 2 σ N 2
= snr 2 ( 1 β ) log e 1 2 log 1 + snr β ( 1 β )
= E sp ( R , snr σ N 2 ) ,
where (333) is (312). Therefore, Q Y | X * does indeed achieve the minimum in (254) if P Y | X = a = N a , σ N 2 and P X * = N 0 , θ . So, the most likely error mechanism is the result of atypically large noise strength and an attenuated received signal. Both effects cannot be combined into additional noise variance: there is no σ 2 > 0 such that Q Y | X = a = N a , σ 2 achieves the minimum in (254).

12. Additive Independent Exponential Noise; Input-Mean Constraint

This section finds the sphere-packing error exponent for the additive independent exponential noise channel under an input-mean constraint.
78.
Suppose that A = B = [ 0 , ) , b ( x ) = x , and
Y = X + N ,
where N is exponentially distributed, independent of X, and E [ N ] = ζ . Therefore P Y | X = a has density
p Y | X = a ( t ) = 1 ζ e t a ζ 1 { t a } .
It is shown in [70,71] that
max X : E [ X ] θ I ( X ; X + N ) = log 1 + snr ,
snr = θ ζ ,
achieved by a mixed random variable with density
f X * ( t ) = ζ ζ + θ δ ( t ) + θ ( ζ + θ ) 2 e t / ( ζ + θ ) 1 { t > 0 } .
To determine C α c ( snr ζ ) , α ( 0 , 1 ) , we invoke Theorem 18. A sensible candidate for the auxiliary input distribution Q X * is a mixed random variable with density
q X * ( t ) = Γ * δ ( t ) + 1 Γ * 1 μ e t / μ 1 { t > 0 } ,
μ = ζ α Γ * ,
where Γ * ( 0 , 1 ) is yet to be determined. This is an attractive choice because its α -response, Q Y [ α ] * , is particularly simple: exponential with mean α μ = ζ Γ * , as we can verify using Laplace transforms. Then, if Z is exponential with unit mean, with the aid of Example 5, we can write
D α P Y | X = x Q Y [ α ] * = D α ( ζ Z + x α μ Z )
= x α μ log e + log α μ ζ + 1 1 α log α + ( 1 α ) ζ α μ
= Γ * x ζ log e log Γ * + 1 1 α log α + ( 1 α ) Γ * .
So, (218) is satisfied with
ν * = Γ * ζ log e ,
c α ( ν * ) = 1 1 α log α + ( 1 α ) Γ * log Γ * .
To evaluate (217), it is useful to note that if γ > 1 , then
E Z e γ Z = 1 ( 1 + γ ) 2 ,
E e γ Z = 1 1 + γ .
Therefore, the left side of (217) specializes to, with X ¯ * Q X * ,
E b ( X ¯ * ) exp ( 1 α ) ν * b ( X ¯ * ) = μ ( 1 Γ * ) 1 + μ ( 1 α ) ν * log e 2
= ζ α 1 Γ * 1 ,
while the expectation on the right side of (217) is given by
E exp ( 1 α ) ν * b ( X ¯ * ) = α + Γ * α Γ * .
Therefore, (217) yields
snr = 1 Γ * 1 α + ( 1 α ) Γ *
whose solution is
Γ * = 1 2 ρ snr 1 + snr 2 + 4 ρ snr 1 snr ,
with ρ = 1 α α . So, finally, (220), (344) and (345) give the closed-form expression
C α c ( θ ) = snr Γ * log e log Γ * + 1 1 α log α + ( 1 α ) Γ * .
As in Item 73, we can postulate an auxiliary distribution that satisfies (230) for every ν 0 . This is identical to what we did in (341)–(343) except that now (344) and (345) hold for generic ν and Γ . Then, (351) is the result of solving θ = c ˙ α ( ν * ) , which is, in fact, somewhat simpler than obtaining it through (217).
79.
We proceed to get a very simple parametric expression for E sp ( R , θ ) .
Theorem 24.
Let A = B = [ 0 , ) , b ( x ) = x , and Y = X + N , with N exponentially distributed, independent of X, and E [ N ] = ζ . Then, under the average cost constraint E [ b ( X ) ] ζ snr ,
E sp ( R , ζ snr ) = 1 η 1 log e + log η ,
R = log ( 1 + η snr ) ,
where η ( 0 , 1 ] .
Proof. 
Rewriting (353), results in
ρ C 1 1 + ρ c ( θ ) = ρ snr Γ * log e ρ log Γ * + ( 1 + ρ ) log 1 + ρ Γ * 1 + ρ ,
which is monotonically decreasing with ρ . With Γ ˙ * = ρ Γ * ( ρ , snr ) , the counterpart of (317) is now
R = d d ρ ρ C 1 1 + ρ c ( θ ) | ρ ρ *
= ( Γ * + ρ * Γ ˙ * ) snr 1 Γ * + 1 + ρ * 1 + ρ * Γ * log e + log 1 + ρ * Γ * Γ * + ρ * Γ *
= ( Γ * + ρ * Γ ˙ * ) snr + 1 Γ * Γ * 1 1 + ρ * Γ * log e + log 1 + ρ * Γ * Γ * + ρ * Γ *
= log 1 + ρ * Γ * Γ * + ρ * Γ * ,
where the drastic simplification in (360) occurs because, with the current parametrization, (351) becomes
1 Γ * = ( 1 + ρ * Γ * ) Γ * snr .
Now we go ahead and express both ρ * and Γ * as functions of snr and R exclusively. We may rewrite (357)–(360) as
ρ * Γ * = exp ( R ) Γ * 1 exp ( R ) ,
which, when plugged in (361), results in
Γ * = 1 snr 1 exp ( R ) < 1 ,
ρ * = ( 1 + snr ) exp ( R ) 1 1 exp ( R ) 2 > 0 ,
where the inequalities in (363) and (364) follow from R < log ( 1 + snr ) . So, in conclusion,
E sp ( R , θ ) = max ρ 0 ρ C 1 1 + ρ c ( θ ) ρ R
= ρ * C 1 1 + ρ * c ( θ ) ρ * R
= ρ * snr Γ * log e ρ * log Γ * + ( 1 + ρ * ) log 1 + ρ * Γ * 1 + ρ * ρ * R
= ρ * snr Γ * log e ρ * log Γ * + ( 1 + ρ * ) ( R + log Γ * ) ρ * R
= ρ * snr Γ * log e + log Γ * + R
= snr exp ( R ) 1 1 log e + log exp ( R ) 1 snr
= 1 η 1 log e + log η ,
where we have introduced
η = exp ( R ) 1 snr = Γ * 1 snr Γ * .
Evidently, the left identity in (372) is the same as (355). □
The critical rate and the cutoff rate are obtained by particularizing (360) and (356) to ρ * = 1 and ρ = 1 , respectively. This yields
R c = log 1 + Γ 1 * 2 Γ 1 * ,
R 0 = snr Γ 1 * log e log 4 Γ 1 * + 2 log 1 + Γ 1 * ,
Γ 1 * = 1 + snr 2 + 4 snr 1 snr 2 snr .
As in (326), the random coding error exponent is
E r ( R , ζ snr ) = E sp ( R , ζ snr ) , R ( R c , log ( 1 + snr ) ) ; R 0 R , R [ 0 , R c ] ,
with the critical rate R c and cutoff rate R 0 in (373) and (375), respectively. This function is shown along with E sp ( R , ζ snr ) in Figure 3 for snr = 3 .
80.
In parallel to Item 77, we find the random transformation that explains the most likely mechanism to produce errors at every rate R, namely the minimizer of (254) when P X = P X * , the maximizer of the Augustin–Csiszár mutual information of order α . In this case, P X * is not as trivial to guess as in Section 11, but since we already found Q X * in (339) with Γ = Γ * , we can invoke Theorem 17 to show that the density of P X * achieving the maximal order- α Augustin–Csiszár mutual information is
p X * ( t ) = Γ * α + ( 1 α ) Γ * δ ( t ) + 1 Γ * α + ( 1 α ) Γ * α Γ * ζ e t Γ * / ζ 1 { t > 0 } ,
whose mean is, as it should,
α ζ Γ * 1 Γ * α + ( 1 α ) Γ * = ζ snr = θ .
Let Q Y * be exponential with mean θ + κ , and Q Y | X = a * have density
q Y | X = a * ( t ) = 1 κ e t a κ 1 { t a } ,
with
κ = ζ η ,
and η as defined in (372). Using Laplace transforms, we can verify that P X * Q Y | X * Q Y * where P X * is the probability measure with density in (377). Let Z be unit-mean exponentially distributed. Writing mutual information as the difference between the output differential entropy and the noise differential entropy we get
I ( P X * , Q Y | X * ) = h ( ( θ + κ ) Z ) h ( κ Z )
= log 1 + θ κ
= R ,
in view of (363). Furthermore, using (335) and (379),
D ( Q Y | X * P Y | X | P X * ) = log ζ κ + κ ζ 1 log e
= log η + 1 η 1 log e
= E sp ( R , ζ snr ) ,
where we have used (380) and (354). Therefore, we have shown that Q Y | X * is indeed the minimizer of (254). In this case, the most likely mechanism for errors to happen is that the channel adds independent exponential noise with mean ζ / η , instead of the nominal mean ζ . In this respect, the behavior is reminiscent of that of the exponential timing channel for which the error exponent is dominated (at least above critical rate) by an exponential server which is slower than the nominal [72].

13. Recap

81.
The analysis of the fundamental limits of noisy channels in the regime of vanishing error probability with blocklength growing without bound expresses channel capacity in terms of a basic information measure: the input–output mutual information maximized over the input distribution. In the regime of fixed nonzero error probability, the asymptotic fundamental limit is a function of not only capacity but channel dispersion [73], which is also expressible in terms of an information measure: the variance of the information density obtained with the capacity-achieving distribution. In the regime of exponentially decreasing error probability (at fixed rate below capacity) the analysis of the fundamental limits has gone through three distinct phases. No information measures were involved during the first phase and any optimization with respect to various auxiliary parameters and input distribution had to rely on standard convex optimization techniques, such as Karush-Kuhn-Tucker conditions, which not only are cumbersome to solve in this particular setting, but shed little light on the structure of the solution. The second phase firmly anchored the problem in a large deviations foundation, with the fundamental limits expressed in terms of conditional relative entropy as well as mutual information. Unfortunately, the associated maximinimization in (2) did not immediately lend itself to analytical progress. Thanks to Csiszár’s realization of the relevance of Rényi’s information measures to this problem, the third phase has found a way to, not only express the error exponent functions as a function of information measures, but to solve the associated optimization problems in a systematic way. While, in the absence of cost constraints, the problem reduces to finding the maximal α -mutual information, cost constraints make the problem much more challenging because of the difficulty in determining the order- α Augustin–Csiszár mutual information. Fortunately, thanks to the introduction of an auxiliary input distribution (the α -adjunct of the distribution that maximizes I α c ), we have shown that α -mutual information also comes to the rescue in the maximization of the order- α Augustin–Csiszár mutual information in the presence of average cost constraints. We have also finally ended the isolation of Gallager’s E 0 function with cost constraints from the representations in Phases 2 and 3. The pursuit of such a link is what motivated Augustin in 1978 to define a generalized mutual information measure. Overall, the analysis has given yet another instance of the benefits of variational representations of information measures, leading to solutions based on saddle points. However, we have steered clear of off-the-shelf minimax theorems and their associated topological constraints.
We have worked out two channels/cost constraints (additive Gaussian noise with quadratic cost, and additive exponential noise with a linear cost) that admit closed-form error-exponent functions, most easily expressed in parametric form. Furthermore, in Items 77 and 80 we have illuminated the structure of those closed-form expressions by identifying the anomalous channel behavior responsible for most errors at every given rate. In the exponential noise case, the solution is simply a noisier exponential channel, while in the Gaussian case it is the result of both a noisier Gaussian channel and an attenuated input.
These observations prompt the question of whether there might be an alternative general approach that eschews Rényi’s information measures to arrive at not only the most likely anomalous channel behavior, but the error exponent functions themselves.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Acknowledgments

The manuscript incorporates constructive suggestions by Academic Editor Igal Sason and the anonymous referees.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A

Recall that the relative information ı P Q is defined only if P Q , while D ( P Q ) [ 0 , + ] is always defined and equal to + if (but not only if) P Entropy 23 00199 i001 Q.
Lemma A1.
If Q R and X P R , then
E ı P R ( X ) ı Q R ( X ) = D ( P Q ) ,
regardless of whether the right side is finite.
Proof. 
If P Q R , we may invoke the chain rule (7) to decompose
ı P R ( a ) ı Q R ( a ) = ı P Q ( a ) .
Then, the result follows by taking expectations of (A2) when a X P .
To show that (A1) also holds when P Entropy 23 00199 i001 Q, i.e., that the expectation on the left side is + , we invoke the Lebesgue decomposition theorem (e.g. p. 384 of [74]), which ensures that we can find α [ 0 , 1 ) , P 0 Q and P 1 Q , such that
P = α P 1 + ( 1 α ) P 0 .
Since P 1 P 0 , we have
D ( P 1 P ) = log 1 α ,
D ( P 0 P ) = log 1 1 α .
If X 1 P 1 , then
E ı P R ( X 1 ) ı Q R ( X 1 ) = E ı P 1 R ( X 1 ) ı Q R ( X 1 ) E ı P 1 R ( X 1 ) ı P R ( X 1 )
= D ( P 1 Q ) D ( P 1 P )
= D ( P 1 Q ) log 1 α ,
where
  • (A7) ⟸ (A1) with ( P , Q , R ) ( P 1 , Q , R ) and (A1) with ( P , Q , R ) ( P 1 , P , R ) , which we are entitled to invoke since P 1 is dominated by both Q and R;
  • (A8) ⟸ (A4).
Analogously, if X 0 P 0 , then
E ı P R ( X 0 ) = E ı P 0 R ( X 0 ) E ı P 0 R ( X 0 ) ı P R ( X 0 )
= D ( P 0 R ) D ( P 0 P )
= D ( P 0 R ) log 1 1 α .
Therefore, we are ready to conclude that
E ı P R ( X ) ı Q R ( X )
= α E ı P R ( X 1 ) ı Q R ( X 1 ) + ( 1 α ) E ı P R ( X 0 ) ı Q R ( X 0 )
= α D ( P 1 Q ) + ( 1 α ) D ( P 0 R ) ( 1 α ) E ı Q R ( X 0 ) h ( α )
= + ,
where
  • (A12) ⟸ (A3);
  • (A13) ⟸ h ( · ) is the binary entropy function, (A8) and (A11);
  • (A14) ⟸ E ı Q R ( X 0 ) = P 0 x A : d Q d R ( x ) = 0 = 1 P 0 Q .
Corollary A1.
Suppose that Q R and X P R . Then,
E ı Q R ( X ) = D ( P R ) D ( P Q ) ,
as long as at least one of the relative entropies on the right side is finite.

References

  1. Shannon, C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef] [Green Version]
  2. Rice, S.O. Communication in the Presence of Noise–Probability of Error for Two Encoding Schemes. Bell Syst. Tech. J. 1950, 29, 60–93. [Google Scholar] [CrossRef]
  3. Shannon, C.E. Probability of Error for Optimal Codes in a Gaussian Channel. Bell Syst. Tech. J. 1959, 38, 611–656. [Google Scholar] [CrossRef]
  4. Elias, P. Coding for Noisy Channels. IRE Conv. Rec. 1955, 4, 37–46. [Google Scholar]
  5. Feinstein, A. Error Bounds in Noisy Channels without Memory. IRE Trans. Inf. Theory 1955, 1, 13–14. [Google Scholar] [CrossRef]
  6. Shannon, C.E. Certain Results in Coding Theory for Noisy Channels. Inf. Control 1957, 1, 6–25. [Google Scholar] [CrossRef] [Green Version]
  7. Fano, R.M. Transmission of Information; Wiley: New York, NY, USA, 1961. [Google Scholar]
  8. Gallager, R.G. A Simple Derivation of the Coding Theorem and Some Applications. IEEE Trans. Inf. Theory 1965, 11, 3–18. [Google Scholar] [CrossRef] [Green Version]
  9. Gallager, R.G. Information Theory and Reliable Communication; Wiley: New York, NY, USA, 1968. [Google Scholar]
  10. Shannon, C.E.; Gallager, R.G.; Berlekamp, E. Lower Bounds to Error Probability for Coding on Discrete Memoryless Channels, I. Inf. Control 1967, 10, 65–103. [Google Scholar] [CrossRef] [Green Version]
  11. Shannon, C.E.; Gallager, R.G.; Berlekamp, E. Lower Bounds to Error Probability for Coding on Discrete Memoryless Channels, II. Inf. Control 1967, 10, 522–552. [Google Scholar] [CrossRef] [Green Version]
  12. Dobrushin, R.L. Asymptotic Estimates of the Error Probability for Transmission of Messages over a Discrete Memoryless Communication Channel with a Symmetric Transition Probability Matrix. Theory Probab. Appl. 1962, 7, 270–300. [Google Scholar] [CrossRef]
  13. Dobrushin, R.L. Optimal Binary Codes for Low Rates of Information Transmission. Theory Probab. Appl. 1962, 7, 208–213. [Google Scholar] [CrossRef]
  14. Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  15. Csiszár, I.; Körner, J. Graph Decomposition: A New Key to Coding Theorems. IEEE Trans. Inf. Theory 1981, 27, 5–11. [Google Scholar] [CrossRef]
  16. Barg, A.; Forney, G.D., Jr. Random codes: Minimum Distances and Error Exponents. IEEE Trans. Inf. Theory 2002, 48, 2568–2573. [Google Scholar] [CrossRef] [Green Version]
  17. Sason, I.; Shamai, S. Performance Analysis of Linear Codes under Maximum-likelihood Decoding: A Tutorial. Found. Trends Commun. Inf. Theory 2006, 3, 1–222. [Google Scholar] [CrossRef]
  18. Ashikhmin, A.E.; Barg, A.; Litsyn, S.N. A New Upper Bound on the Reliability Function of the Gaussian Channel. IEEE Trans. Inf. Theory 2000, 46, 1945–1961. [Google Scholar] [CrossRef]
  19. Haroutunian, E.A.; Haroutunian, M.E.; Harutyunyan, A.N. Reliability Criteria in Information Theory and in Statistical Hypothesis Testing. Found. Trends Commun. Inf. Theory 2007, 4, 97–263. [Google Scholar] [CrossRef]
  20. Scarlett, J.; Peng, L.; Merhav, N.; Martinez, A.; Guillén i Fàbregas, A. Expurgated Random-coding Ensembles: Exponents, Refinements, and Connections. IEEE Trans. Inf. Theory 2014, 60, 4449–4462. [Google Scholar] [CrossRef] [Green Version]
  21. Somekh-Baruch, A.; Scarlett, J.; Guillén i Fàbregas, A. A Recursive Cost-Constrained Construction that Attains the Expurgated Exponent. In Proceedings of the 2019 IEEE International Symposium on Information Theory, Paris, France, 7–12 July 2019; pp. 2938–2942. [Google Scholar]
  22. Haroutunian, E.A. Estimates of the Exponent of the Error Probability for a Semicontinuous Memoryless Channel. Probl. Inf. Transm. 1968, 4, 29–39. [Google Scholar]
  23. Blahut, R.E. Hypothesis Testing and Information Theory. IEEE Trans. Inf. Theory 1974, 20, 405–417. [Google Scholar] [CrossRef] [Green Version]
  24. Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems; Academic: New York, NY, USA, 1981. [Google Scholar]
  25. Rényi, A. On Measures of Information and Entropy. In Berkeley Symposium on Mathematical Statistics and Probability; Neyman, J., Ed.; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
  26. Campbell, L.L. A Coding Theorem and Rényi’s Entropy. Inf. Control 1965, 8, 423–429. [Google Scholar] [CrossRef] [Green Version]
  27. Arimoto, S. Information Measures and Capacity of Order α for Discrete Memoryless Channels. In Topics in Information Theory; Bolyai: Keszthely, Hungary, 1975; pp. 41–52. [Google Scholar]
  28. Sason, I.; Verdú, S. Arimoto-Rényi conditional entropy and Bayesian M-ary hypothesis testing. IEEE Trans. Inf. Theory 2018, 64, 4–25. [Google Scholar] [CrossRef]
  29. Fano, R.M. Class Notes for Course 6.574: Statistical Theory of Information; Massachusetts Institute of Technology: Cambridge, MA, USA, 1953. [Google Scholar]
  30. Csiszár, I. A Class of Measures of Informativity of Observation Channels. Period. Mat. Hung. 1972, 2, 191–213. [Google Scholar] [CrossRef]
  31. Sibson, R. Information Radius. Z. Wahrscheinlichkeitstheorie Und Verw. Geb. 1969, 14, 149–161. [Google Scholar] [CrossRef]
  32. Csiszár, I. Generalized Cutoff Rates and Rényi’s Information Measures. IEEE Trans. Inf. Theory 1995, 41, 26–34. [Google Scholar] [CrossRef]
  33. Arimoto, S. Computation of Random Coding Exponent Functions. IEEE Trans. Inf. Theory 1976, 22, 665–671. [Google Scholar] [CrossRef]
  34. Candan, C. Chebyshev Center Computation on Probability Simplex with α-divergence Measure. IEEE Signal Process. Lett. 2020, 27, 1515–1519. [Google Scholar] [CrossRef]
  35. Poltyrev, G.S. Random Coding Bounds for Discrete Memoryless Channels. Probl. Inf. Transm. 1982, 18, 9–21. [Google Scholar]
  36. Augustin, U. Noisy Channels. Ph.D. Thesis, Universität Erlangen-Nürnberg, Erlangen, Germany, 1978. [Google Scholar]
  37. Tomamichel, M.; Hayashi, M. Operational Interpretation of Rényi Information Measures via Composite Hypothesis Testing against Product and Markov Distributions. IEEE Trans. Inf. Theory 2018, 64, 1064–1082. [Google Scholar] [CrossRef] [Green Version]
  38. Polyanskiy, Y.; Verdú, S. Arimoto Channel Coding Converse and Rényi Divergence. In Proceedings of the 48th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 29 September–1 October 2010; pp. 1327–1333. [Google Scholar]
  39. Shayevitz, O. On Rényi Measures and Hypothesis Testing. In Proceedings of the 2011 IEEE International Symposium on Information Theory, St. Petersburg, Russia, 31 July–5 August 2011; pp. 894–898. [Google Scholar]
  40. Verdú, S. α-Mutual Information. In Proceedings of the 2015 Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015. [Google Scholar]
  41. Ho, S.W.; Verdú, S. Convexity/Concavity of Rényi Entropy and α-Mutual Information. In Proceedings of the 2015 IEEE International Symposium on Information Theory, Hong Kong, China, 15–19 June 2015; pp. 745–749. [Google Scholar]
  42. Nakiboglu, B. The Rényi Capacity and Center. IEEE Trans. Inf. Theory 2019, 65, 841–860. [Google Scholar] [CrossRef] [Green Version]
  43. Nakiboglu, B. The Augustin Capacity and Center. arXiv 2018, arXiv:1803.07937. [Google Scholar] [CrossRef] [Green Version]
  44. Dalai, M. Some Remarks on Classical and Classical-Quantum Sphere Packing Bounds: Rényi vs. Kullback–Leibler. Entropy 2017, 19, 355. [Google Scholar] [CrossRef] [Green Version]
  45. Cai, C.; Verdú, S. Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information. Entropy 2019, 21, 969. [Google Scholar] [CrossRef] [Green Version]
  46. Vázquez-Vilar, G.; Martinez, A.; Guillén i Fàbregas, A. A Derivation of the Cost-constrained Sphere-Packing Exponent. In Proceedings of the 2015 IEEE International Symposium on Information Theory, Hong Kong, China, 15–19 June 2015; pp. 929–933. [Google Scholar]
  47. Wyner, A.D. Capacity and Error Exponent for the Direct Detection Photon Channel. IEEE Trans. Inf. Theory 1988, 34, 1449–1471. [Google Scholar] [CrossRef]
  48. Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed.; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
  49. Rényi, A. On Measures of Dependence. Acta Math. Hung. 1959, 10, 441–451. [Google Scholar] [CrossRef]
  50. van Erven, T.; Harremoës, P. Rényi Divergence and Kullback-Leibler Divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
  51. Csiszár, I.; Matúš, F. Information Projections Revisited. IEEE Trans. Inf. Theory 2003, 49, 1474–1490. [Google Scholar] [CrossRef]
  52. Csiszár, I. Information-type Measures of Difference of Probability Distributions and Indirect Observations. Stud. Sci. Math. Hung. 1967, 2, 299–318. [Google Scholar]
  53. Nakiboglu, B. The Sphere Packing Bound via Augustin’s Method. IEEE Trans. Inf. Theory 2019, 65, 816–840. [Google Scholar] [CrossRef]
  54. Nakiboglu, B. The Augustin Capacity and Center. Probl. Inf. Transm. 2019, 55, 299–342. [Google Scholar] [CrossRef] [Green Version]
  55. Vázquez-Vilar, G. Error Probability Bounds for Gaussian Channels under Maximal and Average Power Constraints. arXiv 2019, arXiv:1907.03163. [Google Scholar]
  56. Shannon, C.E. Geometrische Deutung einiger Ergebnisse bei der Berechnung der Kanalkapazität. Nachrichtentechnische Z. 1957, 10, 1–4. [Google Scholar]
  57. Verdú, S.; Han, T.S. A General Formula for Channel Capacity. IEEE Trans. Inf. Theory 1994, 40, 1147–1157. [Google Scholar] [CrossRef] [Green Version]
  58. Kemperman, J.H.B. On the Shannon Capacity of an Arbitrary Channel. K. Ned. Akad. Van Wet. Indag. Math. 1974, 77, 101–115. [Google Scholar] [CrossRef] [Green Version]
  59. Aubin, J.P. Mathematical Methods of Game and Economic Theory; North-Holland: Amsterdam, The Netherlands, 1979. [Google Scholar]
  60. Luenberger, D.G. Optimization by Vector Space Methods; Wiley: New York, NY, USA, 1969. [Google Scholar]
  61. Gastpar, M.; Rimoldi, B.; Vetterli, M. To Code, or Not to Code: Lossy Source–Channel Communication Revisited. IEEE Trans. Inf. Theory 2003, 49, 1147–1158. [Google Scholar] [CrossRef]
  62. Arimoto, S. On the Converse to the Coding Theorem for Discrete Memoryless Channels. IEEE Trans. Inf. Theory 1973, 19, 357–359. [Google Scholar] [CrossRef]
  63. Sason, I. On the Rényi Divergence, Joint Range of Relative Entropies, Measures and a Channel Coding Theorem. IEEE Trans. Inf. Theory 2016, 62, 23–34. [Google Scholar] [CrossRef]
  64. Dalai, M.; Winter, A. Constant Compositions in the Sphere Packing Bound for Classical-quantum Channels. IEEE Trans. Inf. Theory 2017, 63, 5603–5617. [Google Scholar] [CrossRef] [Green Version]
  65. Nakiboglu, B. The Sphere Packing Bound for Memoryless Channels. Probl. Inf. Transm. 2020, 56, 201–244. [Google Scholar] [CrossRef]
  66. Dalai, M. Lower Bounds on the Probability of Error for Classical and Classical-quantum Channels. IEEE Trans. Inf. Theory 2013, 59, 8027–8056. [Google Scholar] [CrossRef] [Green Version]
  67. Shannon, C.E. The Zero Error Capacity of a Noisy Channel. IRE Trans. Inf. Theory 1956, 2, 8–19. [Google Scholar] [CrossRef] [Green Version]
  68. Feder, M.; Merhav, N. Relations Between Entropy and Error Probability. IEEE Trans. Inf. Theory 1994, 40, 259–266. [Google Scholar] [CrossRef] [Green Version]
  69. Einarsson, G. Signal Design for the Amplitude-limited Gaussian Channel by Error Bound Optimization. IEEE Trans. Commun. 1979, 27, 152–158. [Google Scholar] [CrossRef]
  70. Anantharam, V.; Verdú, S. Bits through Queues. IEEE Trans. Inf. Theory 1996, 42, 4–18. [Google Scholar] [CrossRef] [Green Version]
  71. Verdú, S. The Exponential Distribution in Information Theory. Probl. Inf. Transm. 1996, 32, 86–95. [Google Scholar]
  72. Arikan, E. On the Reliability Exponent of the Exponential Timing Channel. IEEE Trans. Inf. Theory 1996, 48, 1681–1689. [Google Scholar] [CrossRef] [Green Version]
  73. Polyanskiy, Y.; Poor, H.V.; Verdú, S. Channel Coding Rate in the Finite Blocklength Regime. IEEE Trans. Inf. Theory 2010, 56, 2307–2359. [Google Scholar] [CrossRef]
  74. Royden, H.L.; Fitzpatrick, P. Real Analysis, 4th ed.; Prentice Hall: Boston, FL, USA, 2010. [Google Scholar]
Figure 1. E sp ( · , P X ) and E r ( · , P X ) .
Figure 1. E sp ( · , P X ) and E r ( · , P X ) .
Entropy 23 00199 g001
Figure 2. E sp ( R , snr σ N 2 ) in (312) and (313); logarithms in base 2.
Figure 2. E sp ( R , snr σ N 2 ) in (312) and (313); logarithms in base 2.
Entropy 23 00199 g002
Figure 3. Error exponent functions in (354), (355) and (376).
Figure 3. Error exponent functions in (354), (355) and (376).
Entropy 23 00199 g003
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Verdú, S. Error Exponents and α-Mutual Information. Entropy 2021, 23, 199. https://doi.org/10.3390/e23020199

AMA Style

Verdú S. Error Exponents and α-Mutual Information. Entropy. 2021; 23(2):199. https://doi.org/10.3390/e23020199

Chicago/Turabian Style

Verdú, Sergio. 2021. "Error Exponents and α-Mutual Information" Entropy 23, no. 2: 199. https://doi.org/10.3390/e23020199

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop