Next Article in Journal
Teaching Ordinal Patterns to a Computer: Efficient Encoding Algorithms Based on the Lehmer Code
Next Article in Special Issue
Rényi and Tsallis Entropies of the Aharonov–Bohm Ring in Uniform Magnetic Fields
Previous Article in Journal
Fast, Asymptotically Efficient, Recursive Estimation in a Riemannian Manifold
Previous Article in Special Issue
Entropic Matroids and Their Representation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On Data-Processing and Majorization Inequalities for f-Divergences with Applications

Department of Electrical Engineering, Technion—Israel Institute of Technology, Haifa 3200003, Israel
Entropy 2019, 21(10), 1022; https://doi.org/10.3390/e21101022
Submission received: 17 July 2019 / Revised: 12 October 2019 / Accepted: 17 October 2019 / Published: 21 October 2019
(This article belongs to the Special Issue Information Measures with Applications)

Abstract

:
This paper is focused on the derivation of data-processing and majorization inequalities for f-divergences, and their applications in information theory and statistics. For the accessibility of the material, the main results are first introduced without proofs, followed by exemplifications of the theorems with further related analytical results, interpretations, and information-theoretic applications. One application refers to the performance analysis of list decoding with either fixed or variable list sizes; some earlier bounds on the list decoding error probability are reproduced in a unified way, and new bounds are obtained and exemplified numerically. Another application is related to a study of the quality of approximating a probability mass function, induced by the leaves of a Tunstall tree, by an equiprobable distribution. The compression rates of finite-length Tunstall codes are further analyzed for asserting their closeness to the Shannon entropy of a memoryless and stationary discrete source. Almost all the analysis is relegated to the appendices, which form the major part of this manuscript.

1. Introduction

Divergences are non-negative measures of dissimilarity between pairs of probability measures which are defined on the same measurable space. They play a key role in the development of information theory, probability theory, statistics, learning, signal processing, and other related fields. One important class of divergence measures is defined by means of convex functions f, and it is called the class of f-divergences. It unifies fundamental and independently-introduced concepts in several branches of mathematics such as the chi-squared test for the goodness of fit in statistics, the total variation distance in functional analysis, the relative entropy in information theory and statistics, and it is closely related to the Rényi divergence which generalizes the relative entropy. The class of f-divergences was introduced in the sixties by Ali and Silvey [1], Csiszár [2,3,4,5,6], and Morimoto [7]. This class satisfies pleasing features such as the data-processing inequality, convexity, continuity and duality properties, finding interesting applications in information theory and statistics (see, e.g., [4,6,8,9,10,11,12,13,14,15]).
This manuscript is a research paper which is focused on the derivation of data-processing and majorization inequalities for f-divergences, and a study of some of their potential applications in information theory and statistics. Preliminaries are next provided.

1.1. Preliminaries and Related Works

We provide here definitions and known results from the literature which serve as a background to the presentation in this paper. We first provide a definition for the family of f-divergences.
Definition 1
([16], p. 4398). Let P and Q be probability measures, let μ be a dominating measure of P and Q (i.e., P , Q μ ), and let p : = d P d μ and q : = d Q d μ . The f-divergence from P to Q is given, independently of μ, by
D f ( P Q ) : = q f p q d μ ,
where
f ( 0 ) : = lim t 0 + f ( t ) ,
0 f 0 0 : = 0 ,
0 f a 0 : = lim t 0 + t f a t = a lim u f ( u ) u , a > 0 .
Definition 2.
Let Q X be a probability distribution which is defined on a set X , and that is not a point mass, and let W Y | X : X Y be a stochastic transformation. The contraction coefficient for f-divergences is defined as
μ f ( Q X , W Y | X ) : = sup P X : D f ( P X Q X ) ( 0 , ) D f ( P Y Q Y ) D f ( P X Q X ) ,
where, for all y Y ,
P Y ( y ) = ( P X W Y | X ) ( y ) : = X d P X ( x ) W Y | X ( y | x ) ,
Q Y ( y ) = ( Q X W Y | X ) ( y ) : = X d Q X ( x ) W Y | X ( y | x ) .
The notation in (6) and (7), and also in (20), (21), (42), (43), (44) in the continuation of this paper, is consistent with the standard notation used in information theory (see, e.g., the first displayed equation after (3.2) in [17]).
Contraction coefficients for f-divergences play a key role in strong data-processing inequalities (see [18,19,20], ([21], Chapter II), [22,23,24,25,26]). The following are essential definitions and results which are related to maximal correlation and strong data-processing inequalities.
Definition 3.
The maximal correlation between two random variables X and Y is defined as
ρ m ( X ; Y ) : = sup f , g E [ f ( X ) g ( Y ) ] ,
where the supremum is taken over all real-valued functions f and g such that
E [ f ( X ) ] = E [ g ( Y ) ] = 0 , E [ f 2 ( X ) ] 1 , E [ g 2 ( Y ) ] 1 .
Definition 4.
Pearson’s χ 2 -divergence [27] from P to Q is defined to be the f-divergence from P to Q (see Definition 1) with f ( t ) = ( t 1 ) 2 or f ( t ) = t 2 1 for all t > 0 ,
χ 2 ( P Q ) : = D f ( P Q )
= ( p q ) 2 q d μ
= p 2 q d μ 1
independently of the dominating measure μ (i.e., P , Q μ , e.g., μ = P + Q ).
Neyman’s χ 2 -divergence [28] from P to Q is the Pearson’s χ 2 -divergence from Q to P, i.e., it is equal to
χ 2 ( Q P ) = D g ( P Q )
with g ( t ) = ( t 1 ) 2 t or g ( t ) = 1 t t for all t > 0 .
Proposition 1
(([24], Theorem 3.2), [29]). The contraction coefficient for the χ 2 -divergence satisfies
μ χ 2 ( Q X , W Y | X ) = ρ m 2 ( X ; Y )
with X Q X and Y Q Y (see (7)).
Proposition 2
([25], Theorem 2). Let f : ( 0 , ) R be convex and twice continuously differentiable with f ( 1 ) = 0 and f ( 1 ) > 0 . Then, for any Q X that is not a point mass,
μ χ 2 ( Q X , W Y | X ) μ f ( Q X , W Y | X ) ,
i.e., the contraction coefficient for the χ 2 -divergence is the minimal contraction coefficient among all f-divergences with f satisfying the above conditions.
Remark 1.
A weaker version of (15) was presented in ([21], Proposition II.6.15) in the general alphabet setting, and the result in (15) was obtained in ([24], Theorem 3.3) for finite alphabets.
The following result provides an upper bound on the contraction coefficient for a subclass of f-divergences in the finite alphabet setting.
Proposition 3
([26], Theorem 8). Let f : [ 0 , ) R be a continuous convex function which is three times differentiable at unity with f ( 1 ) = 0 and f ( 1 ) > 0 , and let it further satisfy the following conditions:
(a) 
f ( t ) f ( 1 ) ( t 1 ) 1 f ( 3 ) ( 1 ) ( t 1 ) 3 f ( 1 ) 1 2 f ( 1 ) ( t 1 ) 2 , t > 0 .
(b) 
The function g : ( 0 , ) R , given by g ( t ) : = f ( t ) f ( 0 ) t for all t > 0 , is concave.
Then, for a probability mass function Q X supported over a finite set X ,
μ f ( Q X , W Y | X ) f ( 1 ) + f ( 0 ) f ( 1 ) min x X Q X ( x ) μ χ 2 ( Q X , W Y | X ) .
For the presentation of our majorization inequalities for f-divergences and related entropy bounds (see Section 2.3), essential definitions and basic results are next provided (see, e.g., [30], ([31], Chapter 13) and ([32], Chapter 2)). Let P be a probability mass function defined on a finite set X , let p max be the maximal mass of P, and let G P ( k ) be the sum of the k largest masses of P for k { 1 , , | X | } (hence, it follows that G P ( 1 ) = p max and G P ( | X | ) = 1 ).
Definition 5.
Consider discrete probability mass functions P and Q defined on a finite set X . It is said that P is majorized by Q (or Q majorizes P), and it is denoted by P Q , if G P ( k ) G Q ( k ) for all k { 1 , , | X | } (recall that G P ( | X | ) = G Q ( | X | ) = 1 ).
A unit mass majorizes any other distribution; on the other hand, the equiprobable distribution on a finite set is majorized by any other distribution defined on the same set.
Definition 6.
Let P n denote the set of all the probability mass functions that are defined on A n : = { 1 , , n } . A function f : P n R is said to be Schur-convex if for every P , Q P n such that P Q , we have f ( P ) f ( Q ) . Likewise, f is said to be Schur-concave if f is Schur-convex, i.e., P , Q P n and P Q imply that f ( P ) f ( Q ) .
Characterization of Schur-convex functions is provided, e.g., in ([30], Chapter 3). For example, there exist some connections between convexity and Schur-convexity (see, e.g., ([30], Section 3.C) and ([32], Chapter 2.3)). However, a Schur-convex function is not necessarily convex ([32], Example 2.3.15).
Finally, what is the connection between data processing and majorization, and why these types of inequalities are both considered in the same manuscript? This connection is provided in the following fundamental well-known result (see, e.g., ([32], Theorem 2.1.10), ([30], Theorem B.2) and ([31], Chapter 13)):
Proposition 4.
Let P and Q be probability mass functions defined on a finite set A . Then, P Q if and only if there exists a doubly-stochastic transformation W Y | X : A A (i.e., x A W Y | X ( y | x ) = 1 for all y A , and y A W Y | X ( y | x ) = 1 for all x A with W Y | X ( · | · ) 0 ) such that Q W Y | X P . In other words, P Q if and only if in their representation as column vectors, there exists a doubly-stochastic matrix W (i.e., a square matrix with non-negative entries such that the sum of each column or each row in W is equal to 1) such that P = W Q .

1.2. Contributions

This paper is focused on the derivation of data-processing and majorization inequalities for f-divergences, and it applies these inequalities to information theory and statistics.
The starting point for obtaining strong data-processing inequalities in this paper relies on the derivation of lower and upper bounds on the difference D f ( P X Q X ) D f ( P Y Q Y ) where ( P X , Q X ) and ( P Y , Q Y ) denote, respectively, pairs of input and output probability distributions with a given stochastic transformation W Y | X (i.e., where P X W Y | X P Y and Q X W Y | X Q Y ). These bounds are expressed in terms of the respective difference in the Pearson’s or Neyman’s χ 2 -divergence, and they hold for all f-divergences (see Theorems 1 and 2). By a different approach, we derive an upper bound on the contraction coefficient for f-divergences of a certain type, which gives an alternative strong data-processing inequality for the considered type of f-divergences (see Theorems 3 and 4). In this framework, a parametric subclass of f-divergences is introduced, its interesting properties are studied (see Theorem 5), all the data-processing inequalities which are derived in this paper are applied to this subclass, and these inequalities are exemplified numerically to examine their tightness (see Section 3.1).
This paper also derives majorization inequalities for f-divergences where part of these inequalities rely on the earlier data-processing inequalities (see Theorem 6). A different approach, which relies on the concept of majorization, serves to derive tight bounds on the maximal value of an f-divergence from a probability mass function P to an equiprobable distribution; the maximization is carried over all P with a fixed finite support where the ratio of their maximal to minimal probability masses does not exceed a given value (see Theorem 7). These bounds lead to accurate asymptotic results which apply to general f-divergences, and they strengthen and generalize recent results of this type with respect to the relative entropy [33], and the Rényi divergence [34]. Furthermore, we explore in Theorem 7 the convergence rates to the asymptotic results. Data-processing and majorization inequalities also serve to strengthen the Schur-concavity property of the Tsallis entropy (see Theorem 8), showing by a comparison to earlier bounds in [35,36] that none of these bounds is superseded by the other. Further analytical results which are related to the specialization of our central result on majorization inequalities in Theorem 7, applied to several important sub-classes of f-divergences, are provided in Section 3.2 (including Theorem 9). A quantity which is involved in our majorization inequalities in Theorem 7 is interpreted by relying on a variational representation of f-divergences (see Theorem 10).
As an application of the data-processing inequalities for f-divergences, the setup of list decoding is further studied, reproducing in a unified way some known bounds on the list decoding error probability, and deriving new bounds for fixed and variable list sizes (see Theorems 11–13).
As an application of the majorization inequalities in this paper, we study properties of a measure which is used to quantify the quality of approximating probability mass functions, induced by the leaves of a Tunstall tree, by an equiprobable distribution (see Theorem 14). An application of majorization inequalities for the relative entropy is used to derive a sufficient condition, expressed in terms of the principal and secondary real branches of the Lambert W function [37], for asserting the proximity of compression rates of finite-length (lossless and variable-to-fixed) Tunstall codes to the Shannon entropy of a memoryless and stationary discrete source (see Theorem 15).

1.3. Paper Organization

The paper is structured as follows: Section 2 provides our main new results on data-processing and majorization inequalities for f-divergences and related entropy measures. Illustration of the theorems in Section 2, and further mathematical results which follow from these theorems are introduced in Section 3. Applications in information theory and statistics are considered in Section 4. Proofs of all theorems are relegated to the appendices, which form a major part of this paper.

2. Main Results on f-Divergences

This section provides strong data-processing inequalities for f-divergences (see Section 2.1), followed by a study of a new subclass of f-divergences (see Section 2.2) which later serves to exemplify our data-processing inequalities. The third part of this section (see Section 2.3) provides majorization inequalities for f-divergences, and for the Tsallis entropy, whose derivation relies in part on the new data-processing inequalities.

2.1. Data-Processing Inequalities for f-Divergences

Strong data-processing inequalities are provided in the following, bounding the difference D f ( P X Q X ) D f ( P Y Q Y ) and ratio D f ( P Y Q Y ) D f ( P X Q X ) where ( P X , Q X ) and ( P Y , Q Y ) denote, respectively, pairs of input and output probability distributions with a given stochastic transformation.
Theorem 1.
Let X and Y be finite or countably infinite sets, let P X and Q X be probability mass functions that are supported on X , and let
ξ 1 : = inf x X P X ( x ) Q X ( x ) [ 0 , 1 ] ,
ξ 2 : = sup x X P X ( x ) Q X ( x ) [ 1 , ] .
Let W Y | X : X Y be a stochastic transformation such that for every y Y , there exists x X with W Y | X ( y | x ) > 0 , and let (see (6) and (7))
P Y : = P X W Y | X ,
Q Y : = Q X W Y | X .
Furthermore, let f : ( 0 , ) R be a convex function with f ( 1 ) = 0 , and let the non-negative constant c f : = c f ( ξ 1 , ξ 2 ) satisfy
f + ( v ) f + ( u ) 2 c f ( v u ) , u , v I , u < v
where f + denotes the right-side derivative of f, and
I : = I ( ξ 1 , ξ 2 ) = [ ξ 1 , ξ 2 ] ( 0 , ) .
Then,
(a) 
D f ( P X Q X ) D f ( P Y Q Y ) c f ( ξ 1 , ξ 2 ) χ 2 ( P X Q X ) χ 2 ( P Y Q Y )
0 ,
where equality holds in (24) if D f ( · · ) is Pearson’s χ 2 -divergence with c f 1 .
(b) 
If f is twice differentiable on I , then the largest possible coefficient in the right side of (22) is given by
c f ( ξ 1 , ξ 2 ) = 1 2 inf t I ( ξ 1 , ξ 2 ) f ( t ) .
(c) 
Under the assumption in Item (b), the following dual inequality also holds:
D f ( P X Q X ) D f ( P Y Q Y ) c f 1 ξ 2 , 1 ξ 1 χ 2 ( Q X P X ) χ 2 ( Q Y P Y )
0 ,
where f : ( 0 , ) R is the dual convex function which is given by
f ( t ) : = t f 1 t , t > 0 ,
and the coefficient in the right side of (27) satisfies
c f 1 ξ 2 , 1 ξ 1 = 1 2 inf t I ( ξ 1 , ξ 2 ) { t 3 f ( t ) }
with the convention that 1 ξ 1 = if ξ 1 = 0 . Equality holds in (27) if D f ( · · ) is Neyman’s χ 2 -divergence (i.e., D f ( P Q ) : = χ 2 ( Q P ) for all P and Q) with c f 1 .
(d) 
Under the assumption in Item (b), if
e f ( ξ 1 , ξ 2 ) : = 1 2 sup t I ( ξ 1 , ξ 2 ) f ( t ) < ,
then,
D f ( P X Q X ) D f ( P Y Q Y ) e f ( ξ 1 , ξ 2 ) χ 2 ( P X Q X ) χ 2 ( P Y Q Y ) .
Furthermore,
D f ( P X Q X ) D f ( P Y Q Y ) e f 1 ξ 2 , 1 ξ 1 χ 2 ( Q X P X ) χ 2 ( Q Y P Y )
where the coefficient in the right side of (33) satisfies
e f 1 ξ 2 , 1 ξ 1 = 1 2 sup t I ( ξ 1 , ξ 2 ) { t 3 f ( t ) } ,
which is assumed to be finite. Equalities hold in (32) and (33) if D f ( · · ) is Pearson’s or Neyman’s χ 2 -divergence with e f 1 or e f 1 , respectively.
(e) 
The lower and upper bounds in (24), (27), (32) and (33) are locally tight. More precisely, let { P X ( n ) } be a sequence of probability mass functions defined on X and pointwise converging to Q X which is supported on X , and let P Y ( n ) and Q Y be the probability mass functions defined on Y via (20) and (21) with inputs P X ( n ) and Q X , respectively. Suppose that
lim n inf x X P X ( n ) ( x ) Q X ( x ) = 1 ,
lim n sup x X P X ( n ) ( x ) Q X ( x ) = 1 .
If f has a continuous second derivative at unity, then
lim n D f ( P X ( n ) Q X ) D f ( P Y ( n ) Q Y ) χ 2 ( P X ( n ) Q X ) χ 2 ( P Y ( n ) Q Y ) = 1 2 f ( 1 ) ,
lim n D f ( P X ( n ) Q X ) D f ( P Y ( n ) Q Y ) χ 2 ( Q X P X ( n ) ) χ 2 ( Q Y P Y ( n ) ) = 1 2 f ( 1 ) ,
and these limits indicate the local tightness of the lower and upper bounds in Items (a)–(d).
Proof. 
See Appendix A.  □
An application of Theorem 1 gives the following result.
Theorem 2.
Let X and Y be finite or countably infinite sets, let n N , and let X n : = ( X 1 , , X n ) and Y n : = ( Y 1 , , Y n ) be random vectors taking values on X n and Y n , respectively. Let P X n and Q X n be the probability mass functions of discrete memoryless sources where, for all x ̲ X n ,
P X n ( x ̲ ) = i = 1 n P X i ( x i ) , Q X n ( x ̲ ) = i = 1 n Q X i ( x i ) ,
with P X i and Q X i supported on X for all i { 1 , , n } . Let each symbol X i be independently selected from one of the source outputs at time instant i with probabilities λ and 1 λ , respectively, and let it be transmitted over a discrete memoryless channel with transition probabilities
W Y n | X n ( y ̲ | x ̲ ) = i = 1 n W Y i | X i ( y i | x i ) , x ̲ X n , y ̲ Y n .
Let R X n ( λ ) be the probability mass function of the symbols at the channel input, i.e.,
R X n ( λ ) ( x ̲ ) = i = 1 n λ P X i ( x i ) + ( 1 λ ) Q X i ( x i ) , x ̲ X n , λ [ 0 , 1 ] ,
let
R Y n ( λ ) : = R X n ( λ ) W Y n | X n ,
P Y n : = P X n W Y n | X n ,
Q Y n : = Q X n W Y n | X n ,
and let f : ( 0 , ) R be a convex and twice differentiable function with f ( 1 ) = 0 . Then,
(a) 
For all λ [ 0 , 1 ] ,
D f ( R X n ( λ ) Q X n ) D f ( R Y n ( λ ) Q Y n )
c f ξ 1 ( n , λ ) , ξ 2 ( n , λ ) i = 1 n 1 + λ 2 χ 2 ( P X i Q X i ) i = 1 n 1 + λ 2 χ 2 ( P Y i Q Y i )
c f ξ 1 ( n , λ ) , ξ 2 ( n , λ ) λ 2 i = 1 n χ 2 ( P X i Q X i ) χ 2 ( P Y i Q Y i ) 0 ,
where c f ( · , · ) in the right sides of (45) and (46) is given in (26), and
ξ 1 ( n , λ ) : = i = 1 n 1 λ + λ inf x X P X i ( x ) Q X i ( x ) [ 0 , 1 ] ,
ξ 2 ( n , λ ) : = i = 1 n 1 λ + λ sup x X P X i ( x ) Q X i ( x ) [ 1 , ] .
(b) 
For all λ [ 0 , 1 ] ,
D f ( R X n ( λ ) Q X n ) D f ( R Y n ( λ ) Q Y n ) e f ξ 1 ( n , λ ) , ξ 2 ( n , λ ) i = 1 n 1 + λ 2 χ 2 ( P X i Q X i ) i = 1 n 1 + λ 2 χ 2 ( P Y i Q Y i )
where e f ( · , · ) , ξ 1 ( · , · ) and ξ 2 ( · , · ) in the right side of (49) are given in (31), (47) and (48), respectively.
(c) 
If f has a continuous second derivative at unity, and sup x X P X i ( x ) Q X i ( x ) < for all i { 1 , , n } , then
lim λ 0 + D f ( R X n ( λ ) Q X n ) D f ( R Y n ( λ ) Q Y n ) λ 2 = 1 2 f ( 1 ) i = 1 n χ 2 ( P X i Q X i ) χ 2 ( P Y i Q Y i ) .
The lower bounds in the right sides of (45) and (46), and the upper bound in the right side of (49) are tight as we let λ 0 + , yielding the limit in the right side of (50).
Proof. 
See Appendix B.  □
Remark 2.
Similar upper and lower bounds on D f ( P X n R X n ( λ ) ) D f ( P Y n R Y n ( λ ) ) can be obtained for all λ [ 0 , 1 ] . To that end, in (45)(49), one needs to replace f with f , switch between P X i and Q X i for all i, and replace λ with 1 λ .
In continuation to ([26], Theorem 8) (see Proposition 3 in Section 1.1), we next provide an upper bound on the contraction coefficient for a subclass of f-divergences (this subclass is different from the one which is addressed in ([26], Theorem 8)). Although the first part of the next result is stated for finite or countably infinite alphabets, it is clear from its proof that it also holds in the general alphabet setting. Connections to the literature are provided in Remarks A1–A3.
Theorem 3.
Let f : ( 0 , ) R be a function which satisfies the following conditions:
  • f is convex, differentiable at 1, f ( 1 ) = 0 , and f ( 0 ) : = lim t 0 + f ( t ) < ;
  • The function g : ( 0 , ) R , defined for all t > 0 by g ( t ) : = f ( t ) f ( 0 ) t , is convex.
Let P X and Q X be non-identical probability mass functions which are defined on a finite or a countably infinite set X , and let
κ ( ξ 1 , ξ 2 ) : = sup t ( ξ 1 , 1 ) ( 1 , ξ 2 ) f ( t ) + f ( 1 ) ( 1 t ) ( t 1 ) 2
where ξ 1 [ 0 , 1 ) and ξ 2 ( 1 , ] are given in (18) and (19). Then, in the setting of (20) and (21),
D f ( P Y Q Y ) D f ( P X Q X ) κ ( ξ 1 , ξ 2 ) f ( 0 ) + f ( 1 ) · χ 2 ( P Y Q Y ) χ 2 ( P X Q X ) .
Consequently, if Q X is finitely supported on X ,
μ f ( Q X , W Y | X ) 1 f ( 0 ) + f ( 1 ) · κ 0 , 1 min x X Q X ( x ) · μ χ 2 ( Q X , W Y | X ) .
Proof. 
See Appendix C.1.  □
Similarly to the extension of Theorem 1 to Theorem 2, a similar extension of Theorem 3 leads to the following result.
Theorem 4.
In the setting of (39)(44) in Theorem 2, and under the assumptions on f in Theorem 3, the following holds for all λ ( 0 , 1 ] and n N :
D f R Y n ( λ ) Q Y n D f R X n ( λ ) Q X n κ ξ 1 ( n , λ ) , ξ 2 ( n , λ ) f ( 0 ) + f ( 1 ) i = 1 n 1 + λ 2 χ 2 ( P Y i Q Y i ) 1 i = 1 n 1 + λ 2 χ 2 ( P X i Q X i ) 1 ,
with ξ 1 ( n , λ ) and ξ 2 ( n , λ ) and κ ( · , · ) defined in (47), (48) and (51), respectively.
Proof. 
See Appendix C.2.  □

2.2. A Subclass of f-Divergences

A subclass of f-divergences with interesting properties is introduced in Theorem 5. The data-processing inequalities in Theorems 2 and 4 are applied to these f-divergences in Section 3.
Theorem 5.
Let f α : [ 0 , ) R be given by
f α ( t ) : = ( α + t ) 2 log ( α + t ) ( α + 1 ) 2 log ( α + 1 ) , t 0
for all α e 3 2 . Then,
(a) 
D f α ( · · ) is an f-divergence which is monotonically increasing and concave in α, and its first three derivatives are related to the relative entropy and χ 2 -divergence as follows:
α D f α ( P Q ) = 2 ( α + 1 ) D α Q + P α + 1 Q ,
2 α 2 D f α ( P Q ) = 2 D Q α Q + P α + 1 ,
3 α 3 D f α ( P Q ) = 2 log e α + 1 · χ 2 Q α Q + P α + 1 .
(b) 
For every n N ,
( 1 ) n 1 n α n D f α ( P Q ) 0 ,
and, in addition to (56)–(58), for all n > 3
n α n D f α ( P Q ) = 2 ( 1 ) n 1 ( n 3 ) ! log e ( α + 1 ) n 2 exp ( n 2 ) D n 1 Q α Q + P α + 1 1 ,
where D n 1 ( · · ) in the right side of (60) denotes the Rényi divergence of order n 1 .
(c) 
D f α ( P Q ) k ( α ) χ 2 ( P Q )
k ( α ) exp D ( P Q ) 1
where the function k : [ e 3 2 , ) R is defined as
k ( α ) : = log ( α + 1 ) + 3 2 log e log e 3 α ,
which is monotonically increasing in α, satisfying k ( α ) 0.2075 log e for all α e 3 2 , and it tends to infinity as we let α . Consequently, unless P Q ,
lim α D f α ( P Q ) = + .
(d) 
D f α ( P Q ) log ( α + 1 ) + 3 2 log e log e α + 1 χ 2 ( P Q ) + log e 3 ( α + 1 ) exp 2 D 3 ( P Q ) 1 .
(e) 
For every ε > 0 and a pair of probability mass functions ( P , Q ) where D 3 ( P Q ) < , there exists α : = α ( P , Q , ε ) such that for all α > α
D f α ( P Q ) log ( α + 1 ) + 3 2 log e χ 2 ( P Q ) < ε .
(f) 
If a sequence of probability measures { P n } converges to a probability measure Q such that
lim n ess sup d P n d Q ( Y ) = 1 , Y Q ,
where P n Q for all sufficiently large n, then
lim n D f α ( P n Q ) χ 2 ( P n Q ) = log ( α + 1 ) + 3 2 log e .
(g) 
If α > β e 3 2 , then
0 ( α β ) ( α + β + 2 ) D α Q + P α + 1 Q
D f α ( P Q ) D f β ( P Q )
( α β ) min ( α + β + 2 ) D β Q + P β + 1 Q , 2 D ( P Q ) .
(h) 
The function f α : [ 0 , ) R , as given in (55), satisfies the conditions in Theorems 3 and 4 for all α e 3 2 . Furthermore, the corresponding function in (51) is equal to
κ α ( ξ 1 , ξ 2 ) : = sup t ( ξ 1 , 1 ) ( 1 , ξ 2 ) f α ( t ) + f α ( 1 ) ( 1 t ) ( t 1 ) 2
= f α ( ξ 2 ) + f α ( 1 ) ( 1 ξ 2 ) ( ξ 2 1 ) 2
for all ξ 1 [ 0 , 1 ) and ξ 2 ( 1 , ) .
Proof. 
See Appendix D.  □

2.3. f-Divergence Inequalities via Majorization

Let U n denote an equiprobable probability mass function on { 1 , , n } for an arbitrary n N , i.e., U n ( i ) : = 1 n for all i { 1 , , n } . By majorization theory and Theorem 1, the next result strengthens the Schur-convexity property of the f-divergence D f ( · U n ) (see ([38], Lemma 1)).
Theorem 6.
Let P and Q be probability mass functions which are supported on { 1 , , n } , and suppose that P Q . Let f : ( 0 , ) R be twice differentiable and convex with f ( 1 ) = 0 , and let q max and q min be, respectively, the maximal and minimal positive masses of Q. Then,
(a) 
n e f ( n q min , n q max ) Q 2 2 P 2 2
D f ( Q U n ) D f ( P U n )
n c f ( n q min , n q max ) Q 2 2 P 2 2 0 ,
where c f ( · , · ) and e f ( · , · ) are given in (26) and (31), respectively, and · 2 denotes the Euclidean norm. Furthermore, (74) and (75) hold with equality if D f ( · · ) = χ 2 ( · · ) .
(b) 
If P Q and q max q min ρ for an arbitrary ρ 1 , then
0 Q 2 2 P 2 2 ( ρ 1 ) 2 4 ρ n .
Proof. 
See Appendix E.  □
Remark 3.
If P is not supported on { 1 , , n } , then (74) and (75) hold if f is also right continuous at zero.
The next result provides upper and lower bounds on f-divergences from any probability mass function to an equiprobable distribution. It relies on majorization theory, and it follows in part from Theorem 6.
Theorem 7.
Let P n denote the set of all the probability mass functions that are defined on A n : = { 1 , , n } . For ρ 1 , let P n ( ρ ) be the set of all Q P n which are supported on A n with q max q min ρ , and let f : ( 0 , ) R be a convex function with f ( 1 ) = 0 . Then,
(a) 
The set P n ( ρ ) , for any ρ 1 , is a non-empty, convex and compact set.
(b) 
For a given Q P n , which is supported on A n , the f-divergences D f ( · Q ) and D f ( Q · ) attain their maximal values over the set P n ( ρ ) .
(c) 
For ρ 1 and an integer n 2 , let
u f ( n , ρ ) : = max Q P n ( ρ ) D f ( Q U n ) ,
v f ( n , ρ ) : = max Q P n ( ρ ) D f ( U n Q ) ,
let
Γ n ( ρ ) : = 1 1 + ( n 1 ) ρ , 1 n ,
and let the probability mass function Q β P n ( ρ ) be defined on the set A n as follows:
Q β ( j ) : = { ρ β , i f   j { 1 , , i β } , 1 n + i β ( ρ 1 ) 1 β , i f   j = i β + 1 , β , i f   j { i β + 2 , , n }
where
i β : = 1 n β ( ρ 1 ) β .
Then,
u f ( n , ρ ) = max β Γ n ( ρ ) D f ( Q β U n ) ,
v f ( n , ρ ) = max β Γ n ( ρ ) D f ( U n Q β ) .
(d) 
For ρ 1 and an integer n 2 , let the non-negative function g f ( ρ ) : [ 0 , 1 ] R + be given by
g f ( ρ ) ( x ) : = x f ρ 1 + ( ρ 1 ) x + ( 1 x ) f 1 1 + ( ρ 1 ) x , x [ 0 , 1 ] .
Then,
max m { 0 , , n } g f ( ρ ) m n u f ( n , ρ ) max x [ 0 , 1 ] g f ( ρ ) ( x ) ,
max m { 0 , , n } g f ( ρ ) m n v f ( n , ρ ) max x [ 0 , 1 ] g f ( ρ ) ( x )
with the convex function f : ( 0 , ) R in (29).
(e) 
The right-side inequalities in (85) and (86) are asymptotically tight ( n ). More explicitly,
lim n u f ( n , ρ ) = max x [ 0 , 1 ] x f ρ 1 + ( ρ 1 ) x + ( 1 x ) f 1 1 + ( ρ 1 ) x ,
lim n v f ( n , ρ ) = max x [ 0 , 1 ] ρ x 1 + ( ρ 1 ) x f 1 + ( ρ 1 ) x ρ + ( 1 x ) f 1 + ( ρ 1 ) x 1 + ( ρ 1 ) x .
(f) 
If g f ( ρ ) ( · ) in (84) is differentiable on ( 0 , 1 ) and its derivative is upper bounded by K f ( ρ ) 0 , then for every integer n 2
0 lim n u f ( n , ρ ) u f ( n , ρ ) K f ( ρ ) n .
(g) 
Let f ( 0 ) : = lim t 0 f ( t ) ( , + ] , and let n 2 be an integer. Then,
lim ρ u f ( n , ρ ) = 1 1 n f ( 0 ) + f ( n ) n .
Furthermore, if f ( 0 ) < , f is differentiable on ( 0 , n ) , and K n : = sup t ( 0 , n ) f ( t ) < , then, for every ρ 1 ,
0 lim ρ u f ( n , ρ ) u f ( n , ρ ) 2 K n ( n 1 ) n + ρ 1 .
(h) 
For ρ 1 , let the function f be also twice differentiable, and let M and m be constants such that the following condition holds:
0 m f ( t ) M , t 1 ρ , ρ .
Then, for all Q P n ( ρ ) ,
0 1 2 m n Q 2 2 1
D f ( Q U n )
1 2 M n Q 2 2 1
M ( ρ 1 ) 2 8 ρ
with equalities in (94) and (95) for the χ 2 divergence (with M = m = 2 ).
(i) 
Let d > 0 . If f ( t ) M f ( 0 , ) for all t > 0 , then D f ( Q U n ) d for all Q P n ( ρ ) , if
ρ 1 + 4 d M f + 8 d M f + 16 d 2 M f 2 .
Proof. 
See Appendix F.  □
Tsallis entropy was introduced in [39] as a generalization of the Shannon entropy (similarly to the Rényi entropy [40]), and it was applied to statistical physics in [39].
Definition 7
([39]). Let P X be a probability mass function defined on a discrete set X . The Tsallis entropy of order α ( 0 , 1 ) ( 1 , ) of X, denoted by S α ( X ) or S α ( P X ) , is defined as
S α ( X ) = 1 1 α x X P X α ( x ) 1
= P X α α 1 1 α ,
where P X α : = x X P X α ( x ) 1 α . The Tsallis entropy is continuously extended at orders 0, 1, and; at order 1, it coincides with the Shannon entropy on base e (expressed in nats).
Theorem 6 enables to strengthen the Schur-concavity property of the Tsallis entropy (see ([30], Theorem 13.F.3.a.)) as follows.
Theorem 8.
Let P and Q be probability mass functions which are supported on a finite set, and let P Q . Then, for all α > 0 ,
(a) 
0 L ( α , P , Q ) S α ( P ) S α ( Q ) U ( α , P , Q ) ,
where
L ( α , P , Q ) : = { 1 2 α q max α 2 Q 2 2 P 2 2 , i f   α ( 0 , 2 ] , 1 2 α q min α 2 Q 2 2 P 2 2 , i f   α ( 2 , ) ,
U ( α , P , Q ) : = { 1 2 α q min α 2 Q 2 2 P 2 2 , i f   α ( 0 , 2 ] , 1 2 α q max α 2 Q 2 2 P 2 2 , i f   α ( 2 , ) ,
and the bounds in (101) and (102) are attained at α = 2 .
(b) 
inf P Q , P Q S α ( P ) S α ( Q ) L ( α , P , Q ) = sup P Q , P Q S α ( P ) S α ( Q ) U ( α , P , Q ) = 1 ,
where the infimum and supremum in (103) can be restricted to probability mass functions P and Q which are supported on a binary alphabet.
Proof. 
See Appendix G.  □
Remark 4.
The lower bound in ([36], Theorem 1) also strengthens the Schur-concavity property of the Tsallis entropy. It can be verified that none of the lower bounds in ([36], Theorem 1) and Theorem 8 supersedes the other. For example, let α > 0 , and let P ε and Q ε be probability mass functions supported on A : = { 0 , 1 } with P ε ( 0 ) = 1 2 + ε and Q ε ( 0 ) = 1 2 + β ε where β > 1 and 0 < ε < 1 2 β . This yields P ε Q ε . From (A233) (see Appendix G),
lim ε 0 + S α ( P ε ) S α ( Q ε ) L ( α , P ε , Q ε ) = 1 .
If α = 1 , then S 1 ( P ε ) S 1 ( Q ε ) = 1 log e H ( P ε ) H ( Q ε ) , and the continuous extension of the lower bound in ([36], Theorem 1) at α = 1 is specialized to the earlier result by the same authors in ([35], Theorem 3); it states that if P Q , then H ( P ) H ( Q ) D ( Q P ) . In contrast to (104), it can be verified that
lim ε 0 + S 1 ( P ε ) S 1 ( Q ε ) 1 log e D ( Q ε P ε ) = β + 1 β 1 > 1 , β > 1 ,
which can be made arbitrarily large by selecting β to be sufficiently close to 1 (from above). This provides a case where the lower bound in Theorem 8 outperforms the one in ([35], Theorem 3).
Remark 5.
Due to the one-to-one correspondence between Tsallis and Rényi entropies of the same positive order, similar to the transition from ([36], Theorem 1) to ([36], Theorem 2), also Theorem 8 enables to strengthen the Schur-concavity property of the Rényi entropy. For information-theoretic implications of the Schur-concavity of the Rényi entropy, the reader is referred to, e.g., [34], ([41], Theorem 3) and ([42], Theorem 11).

3. Illustration of the Main Results and Implications

3.1. Illustration of Theorems 2 and 4

We apply here the data-processing inequalities in Theorems 2 and 4 to the new class of f-divergences introduced in Theorem 5.
In the setup of Theorems 2 and 4, consider communication over a time-varying binary-symmetric channel (BSC). Consequently, let X = Y = { 0 , 1 } , and let
P X i ( 1 ) = p i , Q X i ( 1 ) = q i ,
with p i ( 0 , 1 ) and q i ( 0 , 1 ) for every i { 1 , , n } . Let the transition probabilities P Y i | X i ( · | · ) correspond to BSC ( δ i ) (i.e., a BSC with a crossover probability δ i ), i.e.,
P Y i | X i ( y | x ) = { 1 δ i if   x = y , δ i if   x y .
For all λ [ 0 , 1 ] and x ̲ X n , the probability mass function at the channel input is given by
R X n ( λ ) ( x ̲ ) = i = 1 n R X i ( λ ) ( x i ) ,
with
R X i ( λ ) ( x ) = λ P X i ( x ) + ( 1 λ ) Q X i ( x ) , x { 0 , 1 } ,
where the probability mass function in (109) refers to a Bernoulli distribution with parameter λ p i + ( 1 λ ) q i . At the output of the time-varying BSC (see (42)–(44) and (107)), for all y ̲ Y n ,
R Y n ( λ ) ( y ̲ ) = i = 1 n R Y i ( λ ) ( y i ) , P Y n ( y ̲ ) = i = 1 n P Y i ( y i ) , Q Y n ( y ̲ ) = i = 1 n Q Y i ( y i ) ,
where
R Y i ( λ ) ( 1 ) = λ p i + ( 1 λ ) q i δ i ,
P Y i ( 1 ) = p i δ i ,
Q Y i ( 1 ) = q i δ i ,
with
a b : = a ( 1 b ) + ( 1 a ) b , 0 a , b 1 .
The χ 2 -divergence from Bernoulli ( p ) to Bernoulli ( q ) is given by
χ 2 Bernoulli ( p ) Bernoulli ( q ) = ( p q ) 2 q ( 1 q ) ,
and since the probability mass functions P X i , Q X i , P Y i and Q Y i correspond to Bernoulli distributions with parameters p i , q i , p i δ i and q i δ i , respectively, Theorem 2 gives that
c f α ξ 1 ( n , λ ) , ξ 2 ( n , λ ) i = 1 n 1 + λ 2 ( p i q i ) 2 q i ( 1 q i ) i = 1 n 1 + λ 2 ( p i δ i q i δ i ) 2 ( q i δ i ) ( 1 q i δ i )
D f α ( R X n ( λ ) Q X n ) D f α ( R Y n ( λ ) Q Y n )
e f α ξ 1 ( n , λ ) , ξ 2 ( n , λ ) i = 1 n 1 + λ 2 ( p i q i ) 2 q i ( 1 q i ) i = 1 n 1 + λ 2 ( p i δ i q i δ i ) 2 ( q i δ i ) ( 1 q i δ i )
for all λ [ 0 , 1 ] and n N . From (26), (31) and (55), we get that for all ξ 1 < 1 < ξ 2 ,
c f α ( ξ 1 , ξ 2 ) = 1 2 inf t [ ξ 1 , ξ 2 ] f α ( t )
= log ( α + ξ 1 ) + 3 2 log e ,
e f α ( ξ 1 , ξ 2 ) = 1 2 sup t [ ξ 1 , ξ 2 ] f α ( t )
= log ( α + ξ 2 ) + 3 2 log e ,
and, from (47), (48) and (106), for all λ ( 0 , 1 ] ,
ξ 1 ( n , λ ) : = i = 1 n 1 λ + λ min p i q i , 1 p i 1 q i [ 0 , 1 ) ,
ξ 2 ( n , λ ) : = i = 1 n 1 λ + λ max p i q i , 1 p i 1 q i ( 1 , ) ,
provided that p i q i for some i { 1 , , n } (otherwise, both f-divergences in the right side of (116) are equal to zero since P X i Q X i and therefore R X i ( λ ) Q X i for all i and λ [ 0 , 1 ] ). Furthermore, from Item (c) of Theorem 2, for every n N and α e 3 2 ,
lim λ 0 + D f α ( R X n ( λ ) Q X n ) D f α ( R Y n ( λ ) Q Y n ) λ 2 = log ( α + 1 ) + 3 2 log e i = 1 n ( p i q i ) 2 q i ( 1 q i ) ( p i δ i q i δ i ) 2 ( q i δ i ) ( 1 q i δ i ) ,
and the lower and upper bounds in the left side of (116) and the right side of (117), respectively, are tight as we let λ 0 , and they both coincide with the limit in the right side of (124).
Figure 1 illustrates the upper and lower bounds in (116) and (117) with α = 1 , p i 1 4 , q i 1 2 and δ i 0.110 for all i, and n { 1 , 10 , 50 } . In the special case where { δ i } are fixed for all i, the communication channel is a time-invariant BSC whose capacity is equal to 1 2 bit per channel use.
By referring to the upper and middle plots of Figure 1, if n = 1 or n = 10 , then the exact values of the differences of the f α -divergences in the right side of (116) are calculated numerically, being compared to the lower and upper bounds in the left side of (116) and the right side of (117) respectively. Since the f α -divergence does not tensorize, the computation of the exact value of each of the two f α -divergences in the right side of (116) involves a pre-computation of 2 n probabilities for each of the probability mass functions P X n , Q X n , P Y n and Q Y n ; this computation is prohibitively complex unless n is small enough.
We now apply the bound in Theorem 4. In view of (51), (54), (55) and (73), for all λ ( 0 , 1 ] and α e 3 2 ,
D f α R Y n ( λ ) Q Y n D f α R X n ( λ ) Q X n
κ α ξ 1 ( n , λ ) , ξ 2 ( n , λ ) f α ( 0 ) + f α ( 1 ) i = 1 n 1 + λ 2 χ 2 ( P Y i Q Y i ) 1 i = 1 n 1 + λ 2 χ 2 ( P X i Q X i ) 1
= f α ξ 2 ( n , λ ) + f α ( 1 ) 1 ξ 2 ( n , λ ) ξ 2 ( n , λ ) 1 2 f α ( 0 ) + f α ( 1 ) · i = 1 n 1 + λ 2 ( p i δ i q i δ i ) 2 ( q i δ i ) ( 1 q i δ i ) 1 i = 1 n 1 + λ 2 ( p i q i ) 2 q i ( 1 q i ) 1 ,
where ξ 1 ( n , λ ) [ 0 , 1 ) and ξ 2 ( n , λ ) ( 1 , ) are given in (122) and (123), respectively, and for t 0 ,
f α ( t ) + f α ( 1 ) ( 1 t ) = ( α + t ) 2 log ( α + t ) ( α + 1 ) 2 log ( α + 1 ) + 2 ( α + 1 ) log ( α + 1 ) + ( α + 1 ) log e ( 1 t ) .
Figure 2 illustrates the upper bound on D f α ( R Y n ( λ ) Q Y n ) D f α ( R X n ( λ ) Q X n ) (see (125)–(127)) as a function of λ ( 0 , 1 ] . It refers to the case where p i 1 4 , q i 1 2 , and δ i 0.110 for all i (similarly to Figure 1). The upper and middle plots correspond to n = 10 with α = 10 and α = 100 , respectively; the middle and lower plots correspond to α = 100 with n = 10 and n = 100 , respectively. The bounds in the upper and middle plots are compared to their exact values since their numerical computations are feasible for n = 10 . It is observed from the numerical comparisons for n = 10 (see the upper and middle plots in Figure 2) that the upper bounds are informative, especially for large values of α where the f α -divergence becomes closer to a scaled version of the χ 2 -divergence (see Item (e) in Theorem 5).

3.2. Illustration of Theorems 3 and 5

Following the application of the data-processing inequalities in Theorems 2 and 4 to a class of f-divergences (see Section 3.1), some interesting properties of this class are introduced in Theorem 5.
For α e 3 2 , let d f α : ( 0 , 1 ) 2 [ 0 , ) be the binary f α -divergence (see (55)), defined as
d f α ( p q ) : = D f α Bernoulli ( p ) Bernoulli ( q ) = q α + p q 2 log α + p q + ( 1 q ) α + 1 p 1 q 2 log α + 1 p 1 q
( α + 1 ) 2 log ( α + 1 ) , ( p , q ) ( 0 , 1 ) 2 .
Theorem 5 is illustrated in Figure 3, showing that d f α ( p q ) is monotonically increasing as a function of α e 3 2 (note that the concavity in α is not reflected from these plots because the horizontal axis of α is in logarithmic scaling). The binary divergence d f α ( p q ) is also compared in Figure 3 with its lower and upper bounds in (61) and (65), respectively, illustrating that these bounds are both asymptotically tight for large values of α . The asymptotic approximation of d f α ( p q ) for large α , expressed as a function of α and χ 2 ( p q ) (see (66)), is also depicted in Figure 3. The upper and lower plots in Figure 3 refer, respectively, to ( p , q ) = ( 0.1 , 0.9 ) and ( 0.2 , 0.8 ) ; a comparison of these plots show a better match between the exact value of the binary divergence, its upper and lower bounds, and its asymptotic approximation when the values of p and q are getting closer.
In view of the results in (66) and (68), it is interesting to note that the asymptotic value of D f α ( P Q ) for large values of α is also the exact scaling of this f-divergence for any finite value of α e 3 2 when the probability mass functions P and Q are close enough to each other.
We next consider the ratio of the contraction coefficients μ f α ( Q X , W Y | X ) μ χ 2 ( Q X , W Y | X ) where Q X is finitely supported on X and it is not a point mass (i.e., | X | 2 ), and W Y | X is arbitrary. For all α e 3 2 ,
1 μ f α ( Q X , W Y | X ) μ χ 2 ( Q X , W Y | X ) f α ( ξ ) + f α ( 1 ) ( 1 ξ ) ( ξ 1 ) 2 f α ( 0 ) + f α ( 1 ) ,
where f α : ( 0 , ) R is given in (55), and
ξ : = 1 min x X Q X ( x ) [ | X | , ) .
The left-side inequality in (130) is due to ([25], Theorem 2) (see Proposition 2), and the right-side inequality in (130) holds due to (53) and (73).
Figure 4 shows the upper bound on the ratio of contraction coefficients μ f α ( Q X , W Y | X ) μ χ 2 ( Q X , W Y | X ) , as it is given in the right-side inequality of (130), as a function of the parameter α e 3 2 . The curves in Figure 4 correspond to different values of ξ [ | X | , ) , as it is given in (131); these upper bounds are monotonically decreasing in α , and they asymptotically tend to 1 as we let α . Hence, in view of the left-side inequality in (130), the upper bound on the ratio of the contraction coefficients (in the right-side inequality) is asymptotically tight in α . The fact that the ratio of the contraction coefficients in the middle of (130) tends asymptotically to 1, as α gets large, is not directly implied by Item (e) of Theorem 5. The latter implies that, for fixed probability mass functions P and Q and for sufficiently large α ,
D f α ( P Q ) log ( α + 1 ) + 3 2 log e χ 2 ( P Q ) ;
however, there is no guarantee that for fixed Q and sufficiently large α , the approximation in (132) holds for all P. By the upper bound in the right side of (130), it follows however that μ f α ( Q X , W Y | X ) tends asymptotically (as we let α ) to the contraction coefficient of the χ 2 divergence.

3.3. Illustration of Theorem 7 and Further Results

Theorem 7 provides upper and lower bounds on an f-divergence, D f ( Q U n ) , from any probability mass function Q supported on a finite set of cardinality n to an equiprobable distribution over this set. We apply in the following, the exact formula for
d f ( ρ ) : = lim n max Q P n ( ρ ) D f ( Q U n ) , ρ 1
to several important f-divergences. From (87),
d f ( ρ ) = max x [ 0 , 1 ] x f ρ 1 + ( ρ 1 ) x + ( 1 x ) f 1 1 + ( ρ 1 ) x , ρ 1 .
Since f is a convex function on ( 0 , ) with f ( 1 ) = 0 , Jensen’s inequality implies that the function which is subject to maximization in the right-side of (134) is non-negative over the interval [ 0 , 1 ] . It is equal to zero at the endpoints of the interval [ 0 , 1 ] , so the maximum over this interval is attained at an interior point. Note also that, in view of Items (d) and (e) of Theorem 7, the exact asymptotic expression in (134) satisfies
max Q P n ( ρ ) D f ( Q U n ) d f ( ρ ) , n { 2 , 3 , } , ρ 1 .

3.3.1. Total Variation Distance

This distance is an f-divergence with f ( t ) : = | t 1 | for t > 0 . Substituting f into (134) gives
d f ( ρ ) = max x [ 0 , 1 ] 2 ( ρ 1 ) x ( 1 x ) 1 + ( ρ 1 ) x .
By setting to zero the derivative of the function which is subject to maximization in the right side of (136), it can be verified that the maximizer over this interval is equal to x = 1 1 + ρ , which implies that
d f ( ρ ) = 2 ( ρ 1 ) ρ + 1 , ρ 1 .

3.3.2. Alpha Divergences

The class of Alpha divergences forms a parametric subclass of the f-divergences, which includes in particular the relative entropy, χ 2 -divergence, and the squared-Hellinger distance. For α R , let
D A ( α ) ( P Q ) : = D u α ( P Q ) ,
where u α : ( 0 , ) R is a non-negative and convex function with u α ( 1 ) = 0 , which is defined for t > 0 as follows (see ([8], Chapter 2), followed by studies in, e.g., [10,16,43,44,45]):
u α ( t ) : = { t α α ( t 1 ) 1 α ( α 1 ) , α ( , 0 ) ( 0 , 1 ) ( 1 , ) , t log e t + 1 t , α = 1 , log e t , α = 0 .
The functions u 0 and u 1 are defined in the right side of (139) by a continuous extension of u α at α = 0 and α = 1 , respectively. The following relations hold (see, e.g., ([44], (10)–(13))):
D A ( 1 ) ( P Q ) = 1 log e D ( P Q ) ,
D A ( 0 ) ( P Q ) = 1 log e D ( Q P ) ,
D A ( 2 ) ( P Q ) = 1 2 χ 2 ( P Q ) ,
D A ( 1 ) ( P Q ) = 1 2 χ 2 ( Q P ) ,
D A ( 1 2 ) ( P Q ) = 4 H 2 ( P Q ) .
Substituting f : = u α (see (139)) into the right side of (134) gives that
Δ ( α , ρ ) : = d u α ( ρ )
= lim n max Q P n ( ρ ) D A ( α ) ( Q U n )
= max x [ 0 , 1 ] 1 + ( ρ α 1 ) x 1 + ( ρ 1 ) x α 1 .
Setting to zero the derivative of the function which is subject to maximization in the right side of (147) gives
x = x : = 1 + α ( ρ 1 ) ρ α ( 1 α ) ( ρ 1 ) ( ρ α 1 ) ,
where it can be verified that x ( 0 , 1 ) for all α ( , 0 ) ( 0 , 1 ) ( 1 , ) and ρ > 1 . Substituting (148) into the right side of (147) gives that, for all such α and ρ ,
Δ ( α , ρ ) = 1 α ( α 1 ) ( 1 α ) α 1 ( ρ α 1 ) α ( ρ ρ α ) 1 α ( ρ 1 ) α α 1 .
By a continuous extension of Δ ( α , ρ ) in (149) at α = 1 and α = 0 , it follows that for all ρ > 1
Δ ( 1 , ρ ) = Δ ( 0 , ρ ) = ρ log ρ ρ 1 log e ρ log e ρ ρ 1 .
Consequently, for all ρ > 1 ,
lim n max Q P n ( ρ ) D ( Q U n ) = log e lim n max Q P n ( ρ ) D A ( 1 ) ( Q U n )
= Δ ( 1 , ρ ) log e
= ρ log ρ ρ 1 log e ρ log e ρ ρ 1 ,
where (151) holds due to (140); (152) is due to (146), and (153) holds due to (150). This sharpens the result in ([33], Theorem 2) for the relative entropy from the equiprobable distribution, D ( Q U n ) = log n H ( Q ) , by showing that the bound in ([33], (7)) is asymptotically tight as we let n . The result in ([33], Theorem 2) can be further tightened for finite n by applying the result in Theorem 7 (d) with f ( t ) : = u 1 ( t ) log e = t log t + ( 1 t ) log e for all t > 0 (although, unlike the asymptotic result in (149), the refined bound for a finite n does not lend itself to a closed-form expression as a function of n; see also ([34], Remark 3), which provides such a refinement of the bound on D ( Q U n ) for finite n in a different approach).
From (141), (146) and (150), it follows similarly to (153) that for all ρ > 1
lim n max Q P n ( ρ ) D ( U n Q ) = Δ ( 0 , ρ ) log e
= ρ log ρ ρ 1 log e ρ log e ρ ρ 1 .
It should be noted that in view of the one-to-one correspondence between the Rényi divergence and the Alpha divergence of the same order α where, for α 1 ,
D α ( P Q ) = 1 α 1 log 1 + α ( α 1 ) D A ( α ) ( P Q ) ,
the asymptotic result in (149) can be obtained from ([34], Lemma 4) and vice versa; however, in [34], the focus is on the Rényi divergence from the equiprobable distribution, whereas the result in (149) is obtained by specializing the asymptotic expression in (134) for a general f-divergence. Note also that the result in ([34], Lemma 4) is restricted to α > 0 , whereas the result in (149) and (150) covers all values of α R .
In view of (146), (149), (153), (155), and the special cases of the Alpha divergences in (140)–(144), it follows that for all ρ > 1 and for every integer n 2
max Q P n ( ρ ) D ( Q U n ) Δ ( 1 , ρ ) log e = ρ log ρ ρ 1 log e ρ log e ρ ρ 1 ,
max Q P n ( ρ ) D ( U n Q ) Δ ( 0 , ρ ) log e = ρ log ρ ρ 1 log e ρ log e ρ ρ 1 ,
max Q P n ( ρ ) χ 2 ( Q U n ) 2 Δ ( 2 , ρ ) = ( ρ 1 ) 2 4 ρ ,
max Q P n ( ρ ) χ 2 ( U n Q ) 2 Δ ( 1 , ρ ) = ( ρ 1 ) 2 4 ρ ,
max Q P n ( ρ ) H 2 ( Q U n ) 1 4 Δ ( 1 2 , ρ ) = ( ρ 4 1 ) 2 ρ + 1 ,
and the upper bounds on the right sides of (157)–(161) are asymptotically tight in the limit where n tends to infinity.
The next result characterizes the function Δ : ( 0 , ) × ( 1 , ) R as it is given in (149) and (150).
Theorem 9.
The function Δ satisfies the following properties:
(a) 
For every ρ > 1 , Δ ( α , ρ ) is a convex function of α over the real line, and it is symmetric around α = 1 2 with a global minimum at α = 1 2 .
(b) 
The following inequalities hold:
α Δ ( α , ρ ) β Δ ( β , ρ ) , 0 < α β < ,
( 1 β ) Δ ( β , ρ ) ( 1 α ) Δ ( α , ρ ) , < α β < 1 .
(c) 
For every α R , Δ ( α , ρ ) is monotonically increasing and continuous in ρ ( 1 , ) , and lim ρ 1 + Δ ( α , ρ ) = 0 .
Proof. 
See Appendix H.1.  □
Remark 6.
The symmetry of Δ ( α , ρ ) around α = 1 2 (see Theorem 9 (a)) is not implied by the following symmetry property of the Alpha divergence around α = 1 2 (see, e.g., ([8], p. 36)):
D A ( 1 2 + α ) ( P Q ) = D A ( 1 2 α ) ( Q P ) .
Relying on Theorem 9, the following corollary gives a similar result to (146) where the order of Q and U n in D A ( α ) ( · · ) is switched.
Corollary 1.
For all α R and ρ > 1 ,
lim n max Q P n ( ρ ) D A ( α ) ( U n Q ) = Δ ( α , ρ ) .
Proof. 
See Appendix H.2.  □
We next further exemplify Theorem 7 for the relative entropy. Let f ( t ) : = t log t + ( 1 t ) log e for t > 0 . Then, f ( t ) = log e t , so the bounds on the second derivative of f over the interval [ 1 ρ , ρ ] are given by M = ρ log e and m = log e ρ . Theorem 7 (h) gives the following bounds:
n Q 2 2 1 log e 2 ρ D ( Q U n ) ρ n Q 2 2 1 log e 2 .
From ([33], Theorem 2) (and (157)),
D ( Q U n ) ρ log ρ ρ 1 log e ρ log e ρ ρ 1 .
Furthermore, (96) gives that
D ( Q U n ) 1 8 ( ρ 1 ) 2 log e ,
which, for ρ > 1 , is a looser bound in comparison to (167). It can be verified, however, that the dominant term in the Taylor series expansion (around ρ = 1 ) of the right side of (167) coincides with the right side of (168), so the bounds scale similarly for small values of ρ 1 .
Suppose that we wish to assert that, for every integer n 2 and for all probability mass functions Q P n ( ρ ) , the condition
D ( Q U n ) d log e
holds with a fixed d > 0 . Due to the left side inequality in (89), this condition is equivalent to the requirement that
lim n max Q P n ( ρ ) D ( Q U n ) d log e .
Due to the asymptotic tightness of the upper bound in the right side of (157) (as we let n ), requiring that this upper bound is not larger than d log e is necessary and sufficient for the satisfiability of (169) for all n and Q P n ( ρ ) . This leads to the analytical solution ρ ρ max ( 1 ) ( d ) with (see Appendix I)
ρ max ( 1 ) ( d ) : = W 1 e d 1 W 0 e d 1 ,
where W 0 and W 1 denote, respectively, the principal and secondary real branches of the Lambert W function [37]. Requiring the stronger condition where the right side of (168) is not larger than d log e leads to the sufficient solution ρ ρ max ( 2 ) with the simple expression
ρ max ( 2 ) ( d ) : = 1 + 8 d .
In comparison to ρ max ( 1 ) in (171), ρ max ( 2 ) in (172) is more insightful; these values nearly coincide for small values of d > 0 , providing in that case the same range of possible values of ρ for asserting the satisfiability of condition (169). As it is shown in Figure 5, for d 0.01 , the difference between the maximal values of ρ in (171) and (172) is marginal, though in general ρ max ( 1 ) ( d ) > ρ max ( 2 ) ( d ) for all d > 0 .

3.3.3. The Subclass of f-Divergences in Theorem 5

This example refers to the subclass of f-divergences in Theorem 5. For these f α -divergences, with α e 3 2 , substituting f : = f α from (55) into the right side of (134) gives that for all ρ 1
Φ ( α , ρ ) : = d f α ( ρ )
= lim n max Q P n ( ρ ) D f α ( Q U n ) = max x [ 0 , 1 ] x α + ρ 1 + ( ρ 1 ) x 2 log α + ρ 1 + ( ρ 1 ) x ( α + 1 ) 2 log ( α + 1 )
+ ( 1 x ) α + 1 1 + ( ρ 1 ) x 2 log α + 1 1 + ( ρ 1 ) x .
The exact asymptotic expression in the right side of (175) is subject to numerical maximization.
We next provide two alternative closed-form upper bounds, based on Theorems 5 and 7, and study their tightness. The two upper bounds, for all α e 3 2 and ρ 1 , are given by (see Appendix J)
Φ ( α , ρ ) log ( α + 1 ) + 3 2 log e log e α + 1 ( ρ 1 ) 2 4 ρ + log e 81 ( α + 1 ) ( ρ 1 ) ( 2 ρ + 1 ) ( ρ + 2 ) ρ ( ρ + 1 ) 2 ,
and
Φ ( α , ρ ) log ( α + ρ ) + 3 2 log e ( ρ 1 ) 2 4 ρ .
Suppose that we wish to assert that, for every integer n 2 and for all probability mass functions Q P n ( ρ ) , the condition
D f α ( Q U n ) d log e
holds with a fixed d > 0 and α e 3 2 . Due to (173)–(174) and the left side inequality in (89), the satisfiability of the latter condition is equivalent to the requirement that
Φ ( α , ρ ) d log e .
In order to obtain a sufficient condition for ρ to satisfy (179), expressed as an explicit function of α and d, the upper bound in the right side of (176) is slightly loosened to
Φ ( α , ρ ) a ( ρ 1 ) 2 + b min { ρ 1 , ( ρ 1 ) 2 } ,
where
a : = 4 log e 81 ( α + 1 ) ,
b : = 1 4 log ( α + 1 ) + 3 8 log e ,
for all ρ 1 and α e 3 2 . The upper bounds in the right sides of (176), (177) and (180) are derived in Appendix J.
In comparison to (179), the stronger requirement that the right side of (180) is less than or equal to d log e gives the sufficient condition
ρ ρ max ( α , d ) : = max ρ 1 ( α , d ) , ρ 2 ( α , d ) ,
with
ρ 1 ( α , d ) : = 1 + b 2 + 4 a d log e b 2 a ,
ρ 2 ( α , d ) : = 1 + d log e a + b .
Figure 6 compares the exact expression in (175) with its upper bounds in (176), (177) and (180). These bounds show good match with the exact value, and none of the bounds in (176) and (177) is superseded by the other; the bound in (180) is looser than (176), and it is derived for obtaining the closed-form solution in (183)–(185). The bound in (176) is tighter than the bound in (177) for small values of ρ 1 , whereas the latter bound outperforms the first one for sufficiently large values of ρ . It has been observed numerically that the tightness of the bounds is improved by increasing the value of α , and the range of parameters of ρ over which the bound in (176) outperforms the second bound in (177) is enlarged when α is increased. It is also shown in Figure 6 that the bound in (176) and its loosened version in (180) almost coincide for sufficiently small values of ρ (i.e., for ρ is close to 1), and also for sufficiently large values of ρ .

3.4. An Interpretation of u f ( · , · ) in Theorem 7

We provide here an interpretation of u f ( n , ρ ) in (77), for ρ > 1 and an integer n 2 ; note that u f ( n , 1 ) 0 since P n ( 1 ) = { U n } . Before doing so, recall that (82) introduces an identity which significantly simplifies the numerical calculation of u f ( n , ρ ) , and (85) gives (asymptotically tight) upper and lower bounds.
The following result relies on the variational representation of f-divergences.
Theorem 10.
Let f : ( 0 , ) R be convex with f ( 1 ) = 0 , and let f ¯ : R R { } be the convex conjugate function of f (a.k.a. the Fenchel-Legendre transform of f), i.e.,
f ¯ ( x ) : = sup t > 0 t x f ( t ) , x R .
Let ρ > 1 , and define A n : = { 1 , , n } for an integer n 2 . Then, the following holds:
(a) 
For every P P n ( ρ ) , a random variable X P , and a function g : A n R ,
E [ g ( X ) ] u f ( n , ρ ) + 1 n i = 1 n f ¯ g ( i ) .
(b) 
There exists P P n ( ρ ) such that, for every ε > 0 , there is a function g ε : A n R which satisfies
E [ g ε ( X ) ] u f ( n , ρ ) + 1 n i = 1 n f ¯ g ε ( i ) ε ,
with X P .
Proof. 
See Appendix K.  □
Remark 7.
The proof suggests a constructive way to obtain, for an arbitrary ε > 0 , a function g ε which satisfies (188).

4. Applications in Information Theory and Statistics

4.1. Bounds on the List Decoding Error Probability with f-Divergences

The minimum probability of error of a random variable X given Y, denoted by ε X | Y , can be achieved by a deterministic function (maximum-a-posteriori decision rule) L : Y X (see [42]):
ε X | Y = min L : Y X P [ X L ( Y ) ]
= P [ X L ( Y ) ]
= 1 E max x X P X | Y ( x | Y ) .
Fano’s inequality [46] gives an upper bound on the conditional entropy H ( X | Y ) as a function of ε X | Y (or, otherwise, providing a lower bound on ε X | Y as a function of H ( X | Y ) ) when X takes a finite number of possible values.
The list decoding setting, in which the hypothesis tester is allowed to output a subset of given cardinality, and an error occurs if the true hypothesis is not in the list, has great interest in information theory. A generalization of Fano’s inequality to list decoding, in conjunction with the blowing-up lemma ([17], Lemma 1.5.4), leads to strong converse results in multi-user information theory. This approach was initiated in ([47], Section 5) (see also ([48], Section 3.6)). The main idea of the successful combination of these two tools is that, given a code, it is possible to blow-up the decoding sets in a way that the probability of decoding error can be as small as desired for sufficiently large blocklengths; since the blown-up decoding sets are no longer disjoint, the resulting setup is a list decoder with sub-exponential list size (as a function of the block length).
In statistics, Fano’s-type lower bounds on Bayes and minimax risks, expressed in terms of f-divergences, are derived in [49,50].
In this section, we further study the setup of list decoding, and derive bounds on the average list decoding error probability. We first consider the special case where the list size is fixed (see Section 4.1.1), and then move to the more general case of a list size which depends on the channel observation (see Section 4.1.2).

4.1.1. Fixed-Size List Decoding

A generalization of Fano’s inequality for fixed-size list decoding is given in ([42], (139)), expressed as a function of the conditional Shannon entropy (strengthening ([51], Lemma 1)). A further generalization in this setup, which is expressed as a function of the Arimoto-Rényi conditional entropy with an arbitrary positive order (see Definition 9), is provided in ([42], Theorem 8).
The next result provides a generalized Fano’s inequality for fixed-size list decoding, expressed in terms of an arbitrary f-divergence. Some earlier results in the literature are reproduced from the next result, followed by its strengthening as an application of Theorem 1.
Theorem 11.
Let P X Y be a probability measure defined on X × Y with | X | = M . Consider a decision rule L : Y X L , where X L stands for the set of subsets of X with cardinality L, and L < M is fixed. Denote the list decoding error probability by P L : = P X L ( Y ) . Let U M denote an equiprobable probability mass function on X . Then, for every convex function f : ( 0 , ) R with f ( 1 ) = 0 ,
E D f P X | Y ( · | Y ) U M L M f M ( 1 P L ) L + 1 L M f M P L M L .
Proof. 
See Appendix L.  □
Remark 8.
The case where L = 1 (i.e., a decoder with a single output) gives ([50], (5)).
As consequences of Theorem 11, we first reproduce some earlier results as special cases.
Corollary 2
([42] (139)). Under the assumptions in Theorem 11,
H ( X | Y ) log M d P L 1 L M
where d ( · · ) : [ 0 , 1 ] × [ 0 , 1 ] [ 0 , + ] denotes the binary relative entropy, defined as the continuous extension of D ( [ p , 1 p ] [ q , 1 q ] ) : = p log p q + ( 1 p ) log 1 p 1 q for p , q ( 0 , 1 ) .
Proof. 
The choice f ( t ) : = t log t + ( 1 t ) log e , for t > 0 , (note that f ( t ) = u 1 ( t ) log e with u 1 ( · ) defined in (139)) gives
E D f P X | Y ( · | Y ) U M = Y d P Y ( y ) D P X | Y ( · | y ) U M
= Y d P Y ( y ) log M H ( X | Y = y )
= log M H ( X | Y ) ,
and
L M f M ( 1 P L ) L + 1 L M f M P L M L = d P L 1 L M .
Substituting (194)–(197) into (192) gives (193).  □
Theorem 11 enables to reproduce a result in [42] which generalizes Corollary 2. It relies on Rényi information measures, and we first provide definitions for a self-contained presentation.
Definition 8
([40]). Let P X be a probability mass function defined on a discrete set X . The Rényi entropy of order α ( 0 , 1 ) ( 1 , ) of X, denoted by H α ( X ) or H α ( P X ) , is defined as
H α ( X ) : = 1 1 α log x X P X α ( x )
= α 1 α log P X α .
The Rényi entropy is continuously extended at orders 0, 1, and; at order 1, it coincides with the Shannon entropy H ( X ) .
Definition 9
([52]). Let P X Y be defined on X × Y , where X is a discrete random variable. The Arimoto-Rényi conditional entropy of order α [ 0 , ] of X given Y is defined as follows:
  • If α ( 0 , 1 ) ( 1 , ) , then
    H α ( X | Y ) = α 1 α log E x X P X | Y α ( x | Y ) 1 α
    = α 1 α log E P X | Y ( · | Y ) α
    = α 1 α log Y d P Y ( y ) exp 1 α α H α ( X | Y = y ) .
  • The Arimoto-Rényi conditional entropy is continuously extended at orders 0, 1, and ∞; at order 1, it coincides with the conditional Shannon entropy H ( X | Y ) .
Definition 10
([42]). For all α ( 0 , 1 ) ( 1 , ) , the binary Rényi divergence of order α, denoted by d α ( p q ) for ( p , q ) [ 0 , 1 ] 2 , is defined as D α ( [ p , 1 p ] [ q , 1 q ] ) . It is the continuous extension to [ 0 , 1 ] 2 of
d α ( p q ) = 1 α 1 log p α q 1 α + ( 1 p ) α ( 1 q ) 1 α .
For α = 1 ,
d 1 ( p q ) : = lim α 1 d α ( p q ) = d ( p q ) .
The following result, generalizing Corollary 2, is shown to be a consequence of Theorem 11. It has been originally derived in ([42], Theorem 8) in a different way. The alternative derivation of this inequality relies on Theorem 11, applied to the family of Alpha-divergences (see (138)) as a subclass of the f-divergences.
Corollary 3
([42] Theorem 8). Under the assumptions in Theorem 11, for α ( 0 , 1 ) ( 1 , ) ,
H α ( X | Y ) log M d α P L 1 L M
= 1 1 α log L 1 α 1 P L α + ( M L ) 1 α P L α ,
with equality in (205) if and only if
P X | Y ( x | y ) = { P L M L , x L ( y ) , 1 P L L , x L ( y ) .
Proof. 
See Appendix M.  □
Another application of Theorem 11 with the selection f ( t ) : = | t 1 | s , for t [ 0 , ) and a parameter s 1 , gives the following result.
Corollary 4.
Under the assumptions in Theorem 11, for all s 1 ,
P L 1 L M L 1 s + ( M L ) 1 s 1 s E x X P X | Y ( x | Y ) 1 M s 1 s ,
where (208) holds with equality if X and Y are independent with X being equiprobable. For s = 1 and s = 2 , (208) respectively gives that
P L 1 L M 1 2 E x X P X | Y ( x | Y ) 1 M ,
P L 1 L M L M 1 L M M E [ P X | Y ( X | Y ) ] 1 .
The following refinement of the generalized Fano’s inequality in Theorem 11 relies on the version of the strong data-processing inequality in Theorem 1.
Theorem 12.
Under the assumptions in Theorem 11, let the convex function f : ( 0 , ) R be twice differentiable, and assume that there exists a constant m f > 0 such that
f ( t ) m f , t I ( ξ 1 , ξ 2 ) ,
where
ξ 1 : = M inf ( x , y ) X × Y P X | Y ( x | y ) ,
ξ 2 : = M sup ( x , y ) X × Y P X | Y ( x | y ) ,
and the interval I ( · , · ) is defined in (23). Let u + : = max { u , 0 } for u R . Then,
(a) 
E D f P X | Y ( · | Y ) U M L M f M ( 1 P L ) L + 1 L M f M P L M L + 1 2 m f M E P X | Y ( X | Y ) 1 P L L P L M L + .
(b) 
If the list decoder selects the L most probable elements from X , given the value of Y Y , then (214) is strengthened to
E D f P X | Y ( · | Y ) U M L M f M ( 1 P L ) L + 1 L M f M P L M L + 1 2 m f M E P X | Y ( X | Y ) 1 P L L ,
where the last term in the right side of (215) is necessarily non-negative.
Proof. 
See Appendix N.  □
An application of Theorem 12 gives the following tightened version of Corollary 2.
Corollary 5.
Under the assumptions in Theorem 11, the following holds:
(a) 
Inequality (193) is strengthened to
H ( X | Y ) log M d P L 1 L M log e 2 E P X | Y ( X | Y ) 1 P L L P L M L + sup ( x , y ) X × Y P X | Y ( x | y ) .
(b) 
If the list decoder selects the L most probable elements from X , given the value of Y Y , then (216) is strengthened to
H ( X | Y ) log M d P L 1 L M log e 2 · E P X | Y ( X | Y ) 1 P L L + sup ( x , y ) X × Y P X | Y ( x | y ) .
Proof. 
The choice f ( t ) : = t log t + ( 1 t ) log e , for t > 0 , gives (see (23) and (211)–(213))
m f M = M inf t I ( ξ 1 , ξ 2 ) f t ) = M log e ξ 2 = log e sup ( x , y ) X × Y P X | Y ( x | y ) .
Substituting (194)–(197) and (218) into (214) and (215) give, respectively, (216) and (217).  □
Remark 9.
Similarly to the bounds on P L in (193) and (205), which tensorize when P X | Y is replaced by a product probability measure P X n | Y n ( x ̲ | y ̲ ) = i = 1 n P X i | Y i ( x i | y i ) , this is also the case with the new bounds in (216) and (217).
Remark 10.
The ceil operation in the right side of (217) is redundant with P L denoting the list decoding error probability (see (A335)(A341)). However, for obtaining a lower bound on P L with (217), the ceil operation assures that the bound is at least as good as the lower bound which relies on the generalized Fano’s inequality in (193).
Example 1.
Let X and Y be discrete random variables taking values in X = { 0 , 1 , , 8 } and Y = { 0 , 1 } , respectively, and let P X Y be the joint probability mass function, given by
P X Y ( x , y ) ( x , y ) X × Y = 1 512 128 64 32 16 8 4 2 1 1 2 2 2 2 8 16 32 64 128 T .
Let the list decoder select the L most probable elements from X , given the value of Y Y . Table 1 compares the list decoding error probability P L with the lower bound which relies on the generalized Fano’s inequality in (193), its tightened version in (217), and the closed-form lower bound in (210) for fixed list sizes of L = 1 , , 4 . For L = 3 and L = 4 , (217) improves the lower bound in (193) (see Table 1). If L = 4 , then the generalized Fano’s lower bound in (193) and also (210) are useless, whereas (217) gives a non-trivial lower bound. It is shown here that none of the new lower bounds in (210) and (217) is superseded by the other.

4.1.2. Variable-Size List Decoding

In the more general setting of list decoding where the size of the list may depend on the channel observation, Fano’s inequality has been generalized as follows.
Proposition 5
(([48], Appendix 3.E) and [53]). Let P X Y be a probability measure defined on X × Y with | X | = M . Consider a decision rule L : Y 2 X , and let the (average) list decoding error probability be given by P L : = P X L ( Y ) with | L ( y ) | 1 for all y Y . Then,
H ( X | Y ) h ( P L ) + E [ log | L ( Y ) | ] + P L log M ,
where h : [ 0 , 1 ] [ 0 , log 2 ] denotes the binary entropy function. If | L ( Y ) | N almost surely, then also
H ( X | Y ) h ( P L ) + ( 1 P L ) log N + P L log M .
By relying on the data-processing inequality for f-divergences, we derive in the following an alternative explicit lower bound on the average list decoding error probability P L . The derivation relies on the E γ divergence (see, e.g., [54]), which forms a subclass of the f-divergences.
Theorem 13.
Under the assumptions in (220), for every γ 1 ,
P L 1 + γ 2 γ E [ | L ( Y ) | ] M 1 2 E x X P X | Y ( x | Y ) γ M .
Let γ 1 , and let | L ( y ) | M γ for all y Y . Then, (222) holds with equality if, for every y Y , the list decoder selects the | L ( y ) | most probable elements in X given Y = y ; if x ( y ) denotes the ℓ-th most probable element in X given Y = y , where ties in probabilities are resolved arbitrarily, then (222) holds with equality if
P X | Y ( x ( y ) | y ) = { α ( y ) , 1 , , | L ( y ) | , 1 α ( y ) | L ( y ) | M | L ( y ) | , { | L ( y ) | + 1 , , M } ,
with α : Y [ 0 , 1 ] being an arbitrary function which satisfies
γ M α ( y ) 1 | L ( y ) | , y Y .
Proof. 
See Appendix O.  □
Remark 11.
By setting γ = 1 and | L ( Y ) | = L (i.e., a decoding list of fixed size L), (222) is specialized to (209).
Example 2.
Let X and Y be discrete random variables taking their values in X = { 0 , 1 , 2 , 3 , 4 } and Y = { 0 , 1 } , respectively, and let P X Y be their joint probability mass function, which is given by
{ P X Y ( 0 , 0 ) = P X Y ( 1 , 0 ) = P X Y ( 2 , 0 ) = 1 8 , P X Y ( 3 , 0 ) = P X Y ( 4 , 0 ) = 1 16 , P X Y ( 0 , 1 ) = P X Y ( 1 , 1 ) = P X Y ( 2 , 1 ) = 1 24 , P X Y ( 3 , 1 ) = P X Y ( 4 , 1 ) = 3 16 .
Let L ( 0 ) : = { 0 , 1 , 2 } and L ( 1 ) : = { 3 , 4 } be the lists in X , given the value of Y Y . We get P Y ( 0 ) = P Y ( 1 ) = 1 2 , so the conditional probability mass function of X given Y satisfies P X | Y ( x | y ) = 2 P X Y ( x , y ) for all ( x , y ) X × Y . It can be verified that, if γ = 5 4 , then max { | L ( 0 ) | , | L ( 1 ) | } = 3 M γ , and also (223) and (224) are satisfied (here, M : = | X | = 5 , α ( 0 ) = 1 4 = γ M and α ( 1 ) = 3 8 1 4 , 1 2 ). By Theorem 13, it follows that (222) holds in this case with equality, and the list decoding error probability is equal to P L = 1 E α ( Y ) | L ( Y ) | = 1 4 (i.e., it coincides with the lower bound in the right side of (222) with γ = 5 4 ). On the other hand, the generalized Fano’s inequality in (220) gives that P L 0.1206 (the left side of (220) is H ( X | Y ) = 5 2 log 2 1 4 log 3 = 2.1038 bits); moreover, by letting N : = max y Y | L ( y ) | = 3 , (221) gives the looser lower bound P L 0.0939 . This exemplifies a case where the lower bound in Theorem 13 is tight, whereas the generalized Fano’s inequalities in (220) and (221) are looser.

4.2. A Measure for the Approximation of Equiprobable Distributions by Tunstall Trees

The best possible approximation of equiprobable distributions, which one can get by using tree codes has been considered in [38]. The optimal solution is obtained by using Tunstall codes, which are variable-to-fixed lossless compression codes (see ([55], Section 11.2.3), [56]). The main idea behind Tunstall codes is parsing the source sequence into variable-length segments of roughly the same probability, and then coding all these segments with codewords of fixed length. This task is done by assigning the leaves of a Tunstall tree, which correspond to segments of source symbols with a variable length (according to the depth of the leaves in the tree), to codewords of fixed length. The following result links Tunstall trees with majorization theory.
Proposition 6
([38] Theorem 1). Let P be the probability measure generated on the leaves by a Tunstall tree T , and let Q be the probability measure generated by an arbitrary tree S with the same number of leaves as of T . Then, P Q .
From Proposition 6, and the Schur-convexity of an f-divergence D f ( · U n ) (see ([38], Lemma 1)), it follows that (see ([38], Corollary 1))
D f ( P U n ) D f ( Q U n ) ,
where n designates the joint number of leaves of the trees T and S .
Before we proceed, it is worth noting that the strong data-processing inequality in Theorem 6 implies that if f is also twice differentiable, then (226) can be strengthened to
D f ( P U n ) + n c f ( n q min , n q max ) Q 2 2 P 2 2 D f ( Q U n ) ,
where q max and q min denote, respectively, the maximal and minimal positive masses of Q on the n leaves of a tree S , and c f ( · , · ) is given in (26).
We next consider a measure which quantifies the quality of the approximation of the probability mass function P , induced by the leaves of a Tunstall tree, by an equiprobable distribution U n over a set whose cardinality (n) is equal to the number of leaves in the tree. To this end, consider the setup of Bayesian binary hypothesis testing where a random variable X has one of the two probability distributions
{ H 0 : X P , H 1 : X U n ,
with a-priori probabilities P [ H 0 ] = ω , and P [ H 1 ] = 1 ω for an arbitrary ω ( 0 , 1 ) . The measure being considered here is equal to the difference between the minimum a-priori and minimum a-posteriori error probabilities of the Bayesian binary hypothesis testing model in (228), which is close to zero if the two distributions are sufficiently close.
The difference between the minimum a-priori and minimum a-posteriori error probabilities of a general Bayesian binary hypothesis testing model with the two arbitrary alternative hypotheses H 0 : X P and H 1 : X Q with a-priori probabilities ω and 1 ω , respectively, is defined to be the order- ω DeGroot statistical information I ω ( P , Q ) [57] (see also ([16], Definition 3)). It can be expressed as an f-divergence:
I ω ( P , Q ) = D ϕ ω ( P Q ) ,
where ϕ ω : [ 0 , ) R is the convex function with ϕ ω ( 1 ) = 0 , given by (see ([16], (73)))
ϕ ω ( t ) : = min { ω , 1 ω } min { ω , 1 ω t } , t 0 .
The measure considered here for quantifying the closeness of P to the equiprobable distribution U n is therefore given by
d ω , n ( P ) : = D ϕ ω ( P U n ) , ω ( 0 , 1 ) ,
which is bounded in the interval 0 , min { ω , 1 ω } .
The next result partially relies on Theorem 7.
Theorem 14.
The measure in (231) satisfies the following properties:
(a) 
It is the minimum of D ϕ ω ( P U n ) with respect to all probability measures P P n that are induced by an arbitrary tree with n leaves.
(b) 
d ω , n ( P ) max β Γ n ( ρ ) D ϕ ω ( Q β U n ) ,
with the function ϕ ω ( · ) in (230), the interval Γ n ( ρ ) in (79), the probability mass function Q β in (80), and ρ : = 1 p min is the reciprocal of the minimal probability of the source symbols.
(c) 
The following bound holds for every n N , which is the asymptotic limit of the right side of (232) as we let n :
d ω , n ( P ) max x [ 0 , 1 ] x ϕ ω ρ 1 + ( ρ 1 ) x + ( 1 x ) ϕ ω 1 1 + ( ρ 1 ) x .
(d) 
If f : ( 0 , ) R is convex and twice differentiable, continuous at zero and f ( 1 ) = 0 , then
D f ( P U n ) = 0 1 d ω , n ( P ) ω 3 f 1 ω ω d ω .
Proof. 
See Appendix P.1.  □
Remark 12.
The integral representation in (234) provides another justification for quantifying the closeness of P to an equiprobable distribution by the measure in (231).
Figure 7 refers to the upper bound on the closeness-to-equiprobable measure d ω , n ( P ) in (233) for Tunstall trees with n leaves. The bound holds for all n N , and it is shown as a function of ω [ 0 , 1 ] for several values of ρ [ 1 , ] . In the limit where ρ , the upper bound is equal to min { ω , 1 ω } since the minimum a-posteriori error probability of the Bayesian binary hypothesis testing model in (228) tends to zero. On the other hand, if ρ = 1 , then the right side of (233) is identically equal to zero (since ϕ ω ( 1 ) = 0 ).
Theorem 14 gives an upper bound on the measure in (231), for the closeness of the probability mass function generated on the leaves by a Tunstall tree to the equiprobable distribution, where this bound is expressed as a function of the minimal probability mass of the source. The following result, which relies on ([33], Theorem 4) and our earlier analysis related to Theorem 7, provides a sufficient condition on the minimal probability mass for asserting the closeness of the compression rate to the Shannon entropy of a stationary and memoryless discrete source.
Theorem 15.
Let P be a probability mass function of a stationary and memoryless discrete source, and let the emitted source symbols be from an alphabet of size D 2 . Let C be a Tunstall code which is used for source compression; let m and X denote, respectively, the fixed length and the alphabet of the codewords of C (where | X | 2 ), referring to a Tunstall tree of n leaves with n | X | m < n + ( D 1 ) . Let p min be the minimal probability mass of the source symbols, and let
d = d ( m , ε ) : = { m ε log e | X | 1 + ε + log e 1 D 1 | X | m , i f   D > 2 , m ε log e | X | 1 + ε , i f   D = 2 ,
with an arbitrary ε > 0 such that d > 0 . If
p min W 0 e d 1 W 1 e d 1 ,
where W 0 and W 1 denote, respectively, the principal and secondary real branches of the Lambert W function [37], then the compression rate of the Tunstall code is larger than the Shannon entropy of the source by a factor which is at most 1 + ε .
Proof. 
See Appendix P.2.  □
Remark 13.
The condition in (236) can be replaced by the stronger requirement that
p min 1 1 + 8 d .
Unless d is a small fraction of unity, there is a significant difference between the condition in (236) and the more restrictive condition in (237) (see Figure 8).
Example 3.
Consider a memoryless and stationary binary source, and a binary Tunstall code with codewords of length m = 10 referring to a Tunstall tree with n = 2 m = 1024 leaves. Letting ε = 0.1 in Theorem 15, it follows that if the minimal probability mass of the source satisfies p min 0.0978 (see (235), and Figure 8 with d = m ε log e 2 1 + ε = 0.6301 ), then the compression rate of the Tunstall code is at most 10 % larger than the Shannon entropy of the source.

Funding

This research received no external funding.

Acknowledgments

The author wishes to thank the Guest Editor, Amos Lapidoth, and the two anonymous reviewers for an efficient process in reviewing and handling this paper.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proof of Theorem 1

We start by proving Item (a). By our assumptions on Q X and W Y | X ,
P X ( x ) , Q X ( x ) > 0 , x X ,
x X W Y | X ( y | x ) > 0 , y Y ,
y Y W Y | X ( y | x ) = 1 , x X ,
W Y | X ( y | x ) 0 , ( x , y ) X × Y .
From (20), (21), (A1), (A2) and (A4), it follows that
P Y ( y ) = x X P X ( x ) W Y | X ( y | x ) > 0 , y Y ,
Q Y ( y ) = x X Q X ( x ) W Y | X ( y | x ) > 0 , y Y ,
which imply that, for all y Y ,
inf x X P X ( x ) Q X ( x ) P Y ( y ) Q Y ( y ) sup x X P X ( x ) Q X ( x ) .
Since by assumption P X and Q X are supported on X , and P Y and Q Y are supported on Y (see (A5) and (A6)), it follows that the left side inequality in (A7) is strict if the infimum in the left side is equal to 0, and the right side inequality in (A7) is strict if the supremum in the right side is equal to ∞. Hence, due to (18), (19) and (23),
P X ( x ) Q X ( x ) , P Y ( y ) Q Y ( y ) I ( ξ 1 , ξ 2 ) , ( x , y ) X × Y .
Since by assumption f : ( 0 , ) R is convex, it follows that its right derivative f + ( · ) exists, and it is monotonically non-decreasing and finite on ( 0 , ) (see, e.g., ([58], Theorem 1.2) or ([59], Theorem 24.1)). A straightforward generalization of ([60], Theorem 1.1) (see ([60], Remark 1)) gives
D f ( P X Q X ) D f ( P Y Q Y ) = ( x , y ) X × Y Q X ( x ) W Y | X ( y | x ) Δ P X ( x ) Q X ( x ) , P Y ( y ) Q Y ( y )
where
Δ ( u , v ) : = f ( u ) f ( v ) f + ( v ) ( u v ) , u , v > 0 .
In comparison to ([60], Theorem 1.1), the requirement that f is differentiable on ( 0 , ) is relaxed here, and the derivative of f is replaced by its right-side derivative. Note that if f is differentiable, then Δ P X ( x ) Q X ( x ) , P Y ( y ) Q Y ( y ) with Δ ( · , · ) as defined in (A10) is Bregman’s divergence [61]. The following equality, expressed in terms of Lebesgue-Stieltjes integrals, holds by ([16], Theorem 1):
Δ P X ( x ) Q X ( x ) , P Y ( y ) Q Y ( y )
= { 1 s P Y ( y ) Q Y ( y ) , P X ( x ) Q X ( x ) P X ( x ) Q X ( x ) s d f + ( s ) , if P X ( x ) Q X ( x ) P Y ( y ) Q Y ( y ) , 1 s P X ( x ) Q X ( x ) , P Y ( y ) Q Y ( y ) s P X ( x ) Q X ( x ) d f + ( s ) , if P X ( x ) Q X ( x ) < P Y ( y ) Q Y ( y ) .
From (18), (19), (22), (A8) and (A11), if P X ( x ) Q X ( x ) P Y ( y ) Q Y ( y ) , then
Δ P X ( x ) Q X ( x ) , P Y ( y ) Q Y ( y ) 2 c f ( ξ 1 , ξ 2 ) P Y ( y ) Q Y ( y ) P X ( x ) Q X ( x ) P X ( x ) Q X ( x ) s d s = c f ( ξ 1 , ξ 2 ) P X ( x ) Q X ( x ) P Y ( y ) Q Y ( y ) 2 ,
and similarly, if P X ( x ) Q X ( x ) < P Y ( y ) Q Y ( y ) , then
Δ P X ( x ) Q X ( x ) , P Y ( y ) Q Y ( y ) 2 c f ( ξ 1 , ξ 2 ) P X ( x ) Q X ( x ) P Y ( y ) Q Y ( y ) s P X ( x ) Q X ( x ) d s = c f ( ξ 1 , ξ 2 ) P X ( x ) Q X ( x ) P Y ( y ) Q Y ( y ) 2 .
By combining (A9), (A12) and (A13), it follows that
D f ( P X Q X ) D f ( P Y Q Y ) c f ( ξ 1 , ξ 2 ) ( x , y ) X × Y Q X ( x ) W Y | X ( y | x ) P X ( x ) Q X ( x ) P Y ( y ) Q Y ( y ) 2 ,
and an evaluation of the sum in the right side of (A14) gives (see (20), (21) and (A3))
( x , y ) X × Y Q X ( x ) W Y | X ( y | x ) P X ( x ) Q X ( x ) P Y ( y ) Q Y ( y ) 2 = x X P X 2 ( x ) Q X ( x ) y Y W Y | X ( y | x ) = 1 2 y Y P Y ( y ) Q Y ( y ) x X P X ( x ) W Y | X ( y | x ) = P Y ( y ) + y Y P Y 2 ( y ) Q Y 2 ( y ) x X Q X ( x ) W Y | X ( y | x ) = Q Y ( y )
= x X P X 2 ( x ) Q X ( x ) y Y P Y 2 ( y ) Q Y ( y )
= x X P X ( x ) Q X ( x ) 2 Q X ( x ) y Y P Y ( y ) Q Y ( y ) 2 Q Y ( y )
= χ 2 ( P X Q X ) χ 2 ( P Y Q Y ) .
Combining (A14)–(A18) gives (24); (25) is due to the data-processing inequality for f-divergences (applied to the χ 2 -divergence), and the non-negativity of c f ( ξ 1 , ξ 2 ) in (22).
The χ 2 -divergence is an f-divergence with f ( t ) = ( t 1 ) 2 for t 0 . The condition in (22) allows to set here c f ( ξ 1 , ξ 2 ) 1 , implying that (24) holds in this case with equality.
We next prove Item (b). Let f be twice differentiable on I : = I ( ξ 1 , ξ 2 ) (see (23)), and let ( u , v ) I × I with v > u . Dividing both sides of (22) by v u , and letting v u + , yields c f ( ξ 1 , ξ 2 ) 1 2 f ( u ) . Since this holds for all u I , it follows that c f ( ξ 1 , ξ 2 ) 1 2 inf t I f ( t ) . We next show that c f ( ξ 1 , ξ 2 ) in (26) fulfills the condition in (22), and therefore it is the largest possible value of c f to satisfy (22). By the mean value theorem of Lagrange, for all ( u , v ) I × I with v > u , there exists an intermediate value ξ ( u , v ) such that f ( v ) f ( u ) = f ( ξ ) ( v u ) ; hence, f ( v ) f ( u ) 2 c f ( ξ 1 , ξ 2 ) ( v u ) , so the condition in (22) is indeed fulfilled with c f : = c f ( ξ 1 , ξ 2 ) as given in (26).
We next prove Item (c). Let f : ( 0 , ) R be the dual convex function which is given by f ( t ) : = t f ( 1 t ) for all t > 0 with f ( 1 ) = f ( 1 ) = 0 . Since P X , P Y , Q X and Q Y are supported on X (see (A5) and (A6)), we have
D f ( P X Q X ) = D f ( Q X P X ) ,
D f ( P Y Q Y ) = D f ( Q Y P Y ) ,
ξ 1 : = inf x X Q X ( x ) P X ( x ) = sup x X P X ( x ) Q X ( x ) 1 = 1 ξ 2 ,
ξ 2 : = sup x X Q X ( x ) P X ( x ) = inf x X P X ( x ) Q X ( x ) 1 = 1 ξ 1 .
Consequently, it follows that
D f ( P X Q X ) D f ( P Y Q Y ) = D f ( Q X P X ) D f ( Q Y P Y )
c f ( ξ 1 , ξ 2 ) χ 2 ( Q X P X ) χ 2 ( Q Y P Y )
= c f 1 ξ 2 , 1 ξ 1 χ 2 ( Q X P X ) χ 2 ( Q Y P Y )
where (A23) holds due to (A19) and (A20); (A24) follows from (24) with f, P X and Q X replaced by f , Q X and P X , respectively, which then implies that ξ 1 and ξ 2 in (18) and (19) are, respectively, replaced by ξ 1 and ξ 2 in (A21) and (A22); finally, (A25) holds due to (A21) and (A22). Since by assumption f is twice differentiable on ( 0 , ) , so is f , and
( f ) ( t ) = 1 t 3 f 1 t , t > 0 .
Hence,
c f 1 ξ 2 , 1 ξ 1 = 1 2 inf u I 1 ξ 2 , 1 ξ 1 ( f ) ( u )
= 1 2 inf u I 1 ξ 2 , 1 ξ 1 1 u 3 f 1 u
= 1 2 inf t I ( ξ 1 , ξ 2 ) t 3 f ( t )
where (A27) follows from (24) with f, ξ 1 and ξ 2 replaced by f , 1 ξ 2 and 1 ξ 1 , respectively; (A28) holds due to (A26), and (A29) holds by substituting t = : 1 u . This proves (27) and (30), where (28) is due to the data-processing inequality for f-divergences, and the non-negativity of c f ( · , · ) .
Similarly to the condition for equality in (24), equality in (27) is satisfied if f ( t ) = ( t 1 ) 2 for all t > 0 , or equivalently f ( t ) = t f ( 1 t ) = ( t 1 ) 2 t for all t > 0 . This f-divergence is Neyman’s χ 2 -divergence where D f ( P Q ) : = χ 2 ( Q P ) for all P and Q with c f 1 (due to (30), and since t 3 f ( t ) = 2 for all t > 0 ).
The proof of Item (d) follows that same lines as the proof of Items (a)–(c) by replacing the condition in (22) with a complementary condition of the form
f + ( v ) f + ( u ) 2 e f ( ξ 1 , ξ 2 ) ( v u ) , u , v I ( ξ 1 , ξ 2 ) , u < v .
We finally prove Item (e) by showing that the lower and upper bounds in (24), (27), (32) and (33) are locally tight. More precisely, let { P X ( n ) } be a sequence of probability mass functions defined on X and pointwise converging to Q X which is supported on X , let P Y ( n ) and Q Y be the probability mass functions defined on Y via (20) and (21) with inputs P X ( n ) and Q X , respectively, and let { ξ 1 , n } and { ξ 2 , n } be defined, respectively, by (18) and (19) with P X being replaced by P X ( n ) . By the assumptions in (35) and (36),
lim n ξ 1 , n = lim n inf x X P X ( n ) ( x ) Q X ( x ) = 1 ,
lim n ξ 2 , n = lim n sup x X P X ( n ) ( x ) Q X ( x ) = 1 .
Consequently, if f has a continuous second derivative at unity, then (24), (26), (31), (32), (A31) and (A32) imply that
lim n D f ( P X ( n ) Q X ) D f ( P Y ( n ) Q Y ) χ 2 ( P X ( n ) Q X ) χ 2 ( P Y ( n ) Q Y ) = lim n c f ( ξ 1 , n , ξ 2 , n ) = lim n e f ( ξ 1 , n , ξ 2 , n ) = 1 2 f ( 1 ) ,
and similarly, from (27), (30), (33), (34), (A31) and (A32),
lim n D f ( P X ( n ) Q X ) D f ( P Y ( n ) Q Y ) χ 2 ( Q X P X ( n ) ) χ 2 ( Q Y P Y ( n ) ) = lim n c f 1 ξ 2 , n , 1 ξ 1 , n = lim n e f 1 ξ 2 , n , 1 ξ 1 , n = 1 2 f ( 1 ) ,
which, respectively, prove (37) and (38).

Appendix B. Proof of Theorem 2

We start by proving Item (a). By the assumption that P X i and Q X i are supported on X for all i { 1 , , n } , it follows from (39) that the probability mass functions P X n and Q X n are supported on X n . Consequently, from (41), also R X n ( λ ) is supported on X n for all λ [ 0 , 1 ] . Due to the product forms of Q X n and R X n ( λ ) in (39) and (41), respectively, we get from (47) that
ξ 1 ( n , λ ) = i = 1 n 1 λ + λ inf x X P X i ( x ) Q X i ( x ) = i = 1 n inf x X λ P X i ( x ) + ( 1 λ ) Q X i ( x ) Q X i ( x ) = inf x ̲ X n n i = 1 λ P X i ( x i ) + ( 1 λ ) Q X i ( x i ) n i = 1 Q X i ( x i ) = inf x ̲ X n R X n ( λ ) ( x ̲ ) Q X n ( x ̲ ) ( 0 , 1 ] ,
and likewise, from (48),
ξ 2 ( n , λ ) = sup x ̲ X n R X n ( λ ) ( x ̲ ) Q X n ( x ̲ ) [ 1 , )
for all λ [ 0 , 1 ] . In view of (24), (26), (A35) and (A36), replacing ( P X , P Y , Q X , Q Y , ξ 1 , ξ 2 ) in (24) and (26) with ( R X n ( λ ) , R Y n ( λ ) , Q X n , Q Y n , ξ 1 ( n , λ ) , ξ 2 ( n , λ ) ) , we obtain that, for all λ [ 0 , 1 ] ,
D f ( R X n ( λ ) Q X n ) D f ( R Y n ( λ ) Q Y n ) c f ξ 1 ( n , λ ) , ξ 2 ( n , λ ) χ 2 ( R X n ( λ ) Q X n ) χ 2 ( R Y n ( λ ) Q Y n ) .
Due to the setting in (39)–(44), for all y ̲ Y n and λ [ 0 , 1 ] ,
R Y n ( λ ) ( y ̲ ) = x ̲ X n R X n ( λ ) ( x ̲ ) W Y n | X n ( y ̲ | x ̲ ) = x ̲ X n i = 1 n λ P X i ( x i ) + ( 1 λ ) Q X i ( x i ) i = 1 n W Y i | X i ( y i | x i ) = i = 1 n x i X λ P X i ( x i ) + ( 1 λ ) Q X i ( x i ) W Y i | X i ( y i | x i ) = i = 1 n λ x X P X i ( x ) W Y i | X i ( y i | x ) + ( 1 λ ) x X Q X i ( x ) W Y i | X i ( y i | x ) = i = 1 n λ P Y i ( y i ) + ( 1 λ ) Q Y i ( y i ) = i = 1 n R Y i ( λ ) ( y i )
with
R Y i ( λ ) ( y ) : = λ P Y i ( y ) + ( 1 λ ) Q Y i ( y ) , i { 1 , , n } , y Y , λ [ 0 , 1 ] ,
and R Y i ( λ ) is the probability mass function at the channel output at time instant i. In particular, setting λ = 0 in (A38) gives
Q Y n ( y ̲ ) = i = 1 n Q Y i ( y i ) , y ̲ Y n .
Due to the tensorization property of the χ 2 divergence, and since R X n ( λ ) , R Y n ( λ ) , Q X n and Q Y n are product probability measures (see (39), (41), (A38) and (A40)), it follows that
χ 2 ( R X n ( λ ) Q X n ) = i = 1 n 1 + χ 2 ( R X i ( λ ) Q X i ) 1 ,
and
χ 2 ( R Y n ( λ ) Q Y n ) = i = 1 n 1 + χ 2 ( R Y i ( λ ) Q Y i ) 1 .
Substituting (A41) and (A42) into the right side of (A37) gives that, for all λ [ 0 , 1 ] ,
D f ( R X n ( λ ) Q X n ) D f ( R Y n ( λ ) Q Y n ) c f ξ 1 ( n , λ ) , ξ 2 ( n , λ ) i = 1 n 1 + χ 2 ( R X i ( λ ) Q X i ) i = 1 n 1 + χ 2 ( R Y i ( λ ) Q Y i ) .
Due to (41) and (A39), since
R X i ( λ ) = λ P X i + ( 1 λ ) Q X i ,
R Y i ( λ ) = λ P Y i + ( 1 λ ) Q Y i ,
and (see ([45], Lemma 5))
χ 2 ( λ P + ( 1 λ ) Q Q ) = λ 2 χ 2 ( P Q ) , λ [ 0 , 1 ]
for every pair of probability measures ( P , Q ) , it follows that
χ 2 ( R X i ( λ ) Q X i ) = λ 2 χ 2 ( P X i Q X i ) ,
χ 2 ( R Y i ( λ ) Q Y i ) = λ 2 χ 2 ( P Y i Q Y i ) .
Substituting (A47) and (A48) into the right side of (A43) gives (45). For proving the looser bound (46) from (45), and also for later proving the result in Item (c), we rely on the following lemma.
Lemma A1.
Let { a i } i = 1 n and { b i } i = 1 n be non-negative with a i b i for all i { 1 , , n } . Then,
(a) 
For all u 0 ,
i = 1 n ( 1 + a i u ) i = 1 n ( 1 + b i u ) i = 1 n ( a i b i ) u .
(b) 
If a i > b i for at least one index i, then
i = 1 n ( 1 + a i u ) i = 1 n ( 1 + b i u ) = i = 1 n ( a i b i ) u + O ( u 2 ) .
Proof. 
Let g : [ 0 , ) R be defined as
g ( u ) : = i = 1 n ( 1 + a i u ) i = 1 n ( 1 + b i u ) , u 0 .
We have g ( 0 ) = 0 , and the first two derivatives of g are given by
g ( u ) = i = 1 n a i j i ( 1 + a j u ) b i j i ( 1 + b j u ) ,
and
g ( u ) = i = 1 n j i a i a j k i , j ( 1 + a k u ) b i b j k i , j ( 1 + b k u ) .
Since by assumption a i b i 0 for all i, it follows from (A53) that g ( u ) 0 for all u 0 , which asserts the convexity of g on [ 0 , ) . Hence, for all u 0 ,
g ( u ) g ( 0 ) + g ( 0 ) u = i = 1 n ( b i a i ) u
where the right-side equality in (A54) is due to (A51) and (A52). This gives (A49).
We next prove Item (b) of Lemma A1. By the Taylor series expansion of the polynomial function g, we get
g ( u ) = g ( 0 ) + g ( 0 ) u + 1 2 g ( 0 ) u 2 + = i = 1 n ( b i a i ) u + 1 2 i = 1 n j i ( a i a j b i b j ) u 2 +
for all u 0 . Since by assumption a i b i 0 for all i, and there exists an index i { 1 , , n } such that a i > b i , it follows that the coefficient of u 2 in the right side of (A55) is positive. This yields (A50).  □
We obtain here (46) from (45) and Item (a) of Lemma A1. To that end, for i { 1 , , n } , let
a i : = χ 2 ( P X i Q X i ) , b i : = χ 2 ( P Y i Q Y i ) , u : = λ 2
with u [ 0 , 1 ] for every λ [ 0 , 1 ] . Since by (39), (40), (43) and (44),
P X i W Y i | X i P Y i ,
Q X i W Y i | X i Q Y i ,
it follows from the data-processing inequality for f-divergences, and their non-negativity, that
a i b i 0 , i { 1 , , n } ,
which yields (46) from (45), (A49), (A56) and (A59).
We next prove Item (b) of Theorem 2. Similarly to the proof of (A37), we get from (32) (rather than (24)) that
D f ( R X n ( λ ) Q X n ) D f ( R Y n ( λ ) Q Y n ) e f ξ 1 ( n , λ ) , ξ 2 ( n , λ ) χ 2 ( R X n ( λ ) Q X n ) χ 2 ( R Y n ( λ ) Q Y n ) .
Combining (A41), (A42), (A47), (A48) and (A60) gives (49).
We finally prove Item (c) of Theorem 2. In view of (47) and (48), and by the assumption that sup x X P X i ( x ) Q X i ( x ) < for all i { 1 , , n } , we get
lim λ 0 + ξ 1 ( n , λ ) = 1 ,
lim λ 0 + ξ 2 ( n , λ ) = 1 .
Since, by assumption f has a continuous second derivative at unity, (26), (31), (A61) and (A62) imply that
lim λ 0 + c f ξ 1 ( n , λ ) , ξ 2 ( n , λ ) = 1 2 f ( 1 ) ,
lim λ 0 + e f ξ 1 ( n , λ ) , ξ 2 ( n , λ ) = 1 2 f ( 1 ) .
From (A56), (A59), and Item (b) of Lemma A1, it follows that
lim λ 0 + 1 λ 2 i = 1 n 1 + λ 2 χ 2 ( P X i Q X i ) i = 1 n 1 + λ 2 χ 2 ( P Y i Q Y i ) = i = 1 n χ 2 ( P X i Q X i ) χ 2 ( P Y i Q Y i ) .
The result in (50) finally follows from (45), (49) and (A63)–(A65). This indeed shows that the lower bounds in the right sides of (45) and (46), and the upper bound in the right side of (49) yield a tight result as we let λ 0 + , leading to the limit in the right side of (50).

Appendix C. Proof of Theorems 3 and 4

Appendix C.1. Proof of Theorem 3

We first obtain a lower bound on D f ( P X Q X ) , and then obtain an upper bound on D f ( P Y Q Y ) .
D f ( P X Q X ) = x X Q X ( x ) f P X ( x ) Q X ( x )
= x X Q X ( x ) P X ( x ) Q X ( x ) g P X ( x ) Q X ( x ) + f ( 0 )
= f ( 0 ) + x X P X ( x ) g P X ( x ) Q X ( x )
f ( 0 ) + g x X P X 2 ( x ) Q X ( x )
= f ( 0 ) + g 1 + χ 2 ( P X Q X )
f ( 0 ) + g ( 1 ) + g ( 1 ) χ 2 ( P X Q X )
= g ( 1 ) χ 2 ( P X Q X )
= f ( 1 ) + f ( 0 ) χ 2 ( P X Q X ) ,
where (A67) holds by the definition of g in Theorem 3 with the assumption that f ( 0 ) < ; (A69) is due to Jensen’s inequality and the convexity of g; (A70) holds by the definition of the χ 2 -divergence; (A71) holds due to the convexity of g, and its differentiability at 1 (due to the differentiability of f at 1); (A72) holds since f ( 0 ) + g ( 1 ) = f ( 1 ) = 0 ; finally, (A73) holds since f ( 1 ) = 0 implies that g ( 1 ) = f ( 1 ) + f ( 0 ) .
By ([62], Theorem 5), it follows that
D f ( P Y Q Y ) κ ( ξ 1 , ξ 2 ) χ 2 ( P Y Q Y ) ,
where κ ( ξ 1 , ξ 2 ) is given in (51). Combining (A66)–(A74) yields (52). Taking suprema on both sides of (52), with respect to all probability mass functions P X with P X Q X and P X Q X , gives (53) since by the definition of κ ( ξ 1 , ξ 2 ) in (51), it is monotonically decreasing in ξ 1 [ 0 , 1 ) and monotonically increasing in ξ 2 ( 1 , ] , while (18) and (19) yield
ξ 1 0 , ξ 2 1 min x X Q X ( x ) .
Remark A1.
The derivation in (A66)(A73) is conceptually similar to the proof of ([24], Lemma A.2). However, the function g here is convex, and our derivation involves the χ 2 -divergence.
Remark A2.
The proof of ([26], Theorem 8) (see Proposition 3 in Section 1.1 here) relies on ([24], Lemma A.2), where the function g is required to be concave in [24,26]. This leads, in the proof of ([26], Theorem 8), to an upper bound on D f ( P Y Q Y ) . One difference in the derivation of Theorem 3 is that our requirement on the convexity of g leads to a lower bound on D f ( P X Q X ) , instead of an upper bound on D f ( P Y Q Y ) . Another difference between the proofs of Theorem 3 and ([26], Theorem 8) is that we apply here the result in ([62], Theorem 5) to obtain an upper bound on D f ( P Y Q Y ) , whereas the proof of ([26], Theorem 8) relies on a Pinsker-type inequality (see ([63], Theorem 3)) to obtain a lower bound on D f ( P X Q X ) ; the latter lower bound relies on the condition on f in (16), which is not necessary for the derivation of the bound in Theorem 3.
Remark A3.
From ([62], Theorem 1 (b)), it follows that
sup P Q D f ( P Q ) χ 2 ( P Q ) = κ ( ξ 1 , ξ 2 ) ,
with κ ( ξ 1 , ξ 2 ) in the right side of (A76) as given in (51), and the supremum in the left side of (A76) is taken over all probability measures P and Q such that P Q . In view of ([62], Theorem 1 (b)), the equality in (A76) holds since the functions f ˜ , g ˜ : ( 0 , ) R , defined as f ˜ ( t ) : = f ( t ) + f ( 1 ) ( 1 t ) and g ˜ ( t ) : = ( t 1 ) 2 for all t > 0 , satisfy D f ˜ ( P Q ) = D f ( P Q ) and D g ˜ ( P Q ) = χ 2 ( P Q ) for all probability measures P and Q, and since f ˜ ( 1 ) = g ˜ ( 1 ) = 0 while the function g ˜ is also strictly positive on ( 0 , 1 ) ( 1 , ) . Furthermore, from the proof of ([62], Theorem 1 (b)), restricting P and Q to be probability mass functions which are defined over a binary alphabet, the ratio D f ( P Q ) χ 2 ( P Q ) can be made arbitrarily close to the supremum in the left side of (A76); such probability measures can be obtained as the output distributions P Y and Q Y of an arbitrary non-degenerate stochastic transformation W Y | X : X Y , with | Y | = 2 , by a suitable selection of probability input distributions P X and Q X , respectively (see (A5) and (A6)). In the latter case where | Y | = 2 , this shows the optimality of the non-negative constant κ ( ξ 1 , ξ 2 ) in the right side of (A74).

Appendix C.2. Proof of Theorem 4

Combining (A66)–(A73) gives that, for all λ [ 0 , 1 ] ,
D f R X n ( λ ) Q X n ( λ ) f ( 1 ) + f ( 0 ) χ 2 R X n ( λ ) Q X n ,
and from (A74)
D f R Y n ( λ ) Q Y n κ ξ 1 ( n , λ ) , ξ 2 ( n , λ ) χ 2 R Y n ( λ ) Q Y n .
From (A41) and (A47),
χ 2 R X n ( λ ) Q X n = i = 1 n 1 + λ 2 χ 2 ( P X i Q X i ) 1 ,
and similarly, from (A42) and (A48),
χ 2 R Y n ( λ ) Q Y n = i = 1 n 1 + λ 2 χ 2 ( P Y i Q Y i ) 1 .
Combining (A77)–(A80) yields (54).

Appendix D. Proof of Theorem 5

The function f α : [ 0 , ) R in (55) satisfies f α ( 1 ) = 0 , and for all α e 3 2
f α ( t ) = 2 log ( α + t ) + 3 log e > 0 , t > 0 ,
which yields the convexity of f α ( · ) on [ 0 , ) . This justifies the definition of the f-divergence
D f α ( P Q ) : = x X Q ( x ) f α P ( x ) Q ( x )
for probability mass functions P and Q, which are defined on a finite or countably infinite set X , with Q supported on X . In the general alphabet setting, sums and probability mass functions are, respectively, replaced by Lebesgue integrals and Radon-Nikodym derivatives. Differentiation of both sides of (A82) with respect to α gives
α D f α ( P Q ) = x X Q ( x ) r α P ( x ) Q ( x )
where
r α ( t ) : = f α ( t ) α
= 2 ( α + t ) log ( α + t ) 2 ( α + 1 ) log ( α + 1 ) + ( t 1 ) log e , t > 0 .
The function r α : ( 0 , ) R is convex since
r α ( t ) = 2 log e α + t > 0 , t > 0 ,
and r α ( 1 ) = 0 . Hence, D r α ( · · ) is an f-divergence, and it follows from (A83)–(A85) that
α D f α ( P Q ) = D r α ( P Q )
= 2 x X α Q ( x ) + P ( x ) log α + P ( x ) Q ( x ) 2 ( α + 1 ) log ( α + 1 )
= 2 ( α + 1 ) x X α Q ( x ) + P ( x ) α + 1 log α Q ( x ) + P ( x ) ( α + 1 ) Q ( x )
= 2 ( α + 1 ) D α Q + P α + 1 Q 0 ,
which gives (56), so D f α ( · · ) is monotonically increasing in α . Double differentiation of both sides of (A82) with respect to α gives
2 α 2 D f α ( P Q ) = x X Q ( x ) v α P ( x ) Q ( x )
where
v α ( t ) : = 2 f α ( t ) α 2
= 2 log ( α + t ) 2 log ( α + 1 ) , t > 0 .
The function v α : ( 0 , ) R is concave, and v α ( 1 ) = 0 . By referring to the f-divergence D v α ( · · ) , it follows from (A91)–(A93) that
2 α 2 D f α ( P Q ) = D v α ( P Q )
= 2 x X Q ( x ) log ( α + 1 ) log α + P ( x ) Q ( x )
= 2 x X Q ( x ) log ( α + 1 ) Q ( x ) α Q ( x ) + P ( x )
= 2 D Q α Q + P α + 1 0 ,
which gives (57), so D f α ( · · ) is concave in α for α e 3 2 . Differentiation of both sides of (A93) with respect to α gives that
3 f α ( t ) α 3 = 2 1 α + t 1 α + 1 log e ,
which implies that
3 α 3 D f α ( P Q ) = 2 log e x X Q ( x ) 1 α + P ( x ) Q ( x ) 1 α + 1
= 2 log e α + 1 x X Q 2 ( x ) α Q ( x ) + P ( x ) α + 1 1
= 2 log e α + 1 · χ 2 Q α Q + P α + 1 0 .
This gives (58), and it completes the proof of Item (a).
We next prove Item (b). From Item (a), the result in (59) holds for n = 1 , 2 , 3 . We provide in the following a proof of (59) for all n 3 . In view of (A98), it can be verified that for n 3 ,
n f α ( t ) α n = 2 ( 1 ) n 1 ( n 3 ) ! 1 ( α + t ) n 2 1 ( α + 1 ) n 2 log e ,
which, from (A82), implies that
( 1 ) n 1 n α n D f α ( P Q ) = x X Q ( x ) g α , n P ( x ) Q ( x )
with
g α , n ( t ) : = ( 1 ) n 1 n f α ( t ) α n
= 2 ( n 3 ) ! 1 ( α + t ) n 2 1 ( α + 1 ) n 2 log e , t > 0 .
The function g α , n : ( 0 , ) R is convex for n 3 , with g α , n ( 1 ) = 0 . By referring to the f-divergence D g α , n ( · · ) , its non-negativity and (A103) imply that for all n 3
( 1 ) n 1 n α n D f α ( P Q ) = D g α , n ( P Q ) 0 .
Furthermore, we get the following explicit formula for n-th partial derivative of D f α ( P Q ) with respect to α for n 3 :
n α n D f α ( P Q ) = ( 1 ) n 1 x X Q ( x ) g α , n P ( x ) Q ( x )
= 2 ( 1 ) n 1 ( n 3 ) ! log e ( α + 1 ) n 2 x X Q ( x ) α + 1 α + P ( x ) Q ( x ) n 2 1
= 2 ( 1 ) n 1 ( n 3 ) ! log e ( α + 1 ) n 2 x X Q n 1 ( x ) α Q ( x ) + P ( x ) α + 1 n 2 1
= 2 ( 1 ) n 1 ( n 3 ) ! log e ( α + 1 ) n 2 exp ( n 2 ) D n 1 Q α Q + P α + 1 1
where (A107) holds due to (A103); (A108) follows from (A104), and (A110) is satisfied by the definition of the Rényi divergence [40] which is given by
D β ( P Q ) : = 1 β 1 log x X P β ( x ) Q 1 β ( x ) , β ( 0 , 1 ) ( 1 , )
with D 1 ( P Q ) : = D ( P Q ) by continuous extension of D β ( · · ) at β = 1 . For n = 3 , the right side of (A110) is simplified to the right side of (58); this holds due to the identity
D 2 ( P Q ) = log 1 + χ 2 ( P Q ) .
To prove Item (c), from (55), for all t 0
f α ( t ) = 2 ( α + t ) log ( α + t ) + ( α + t ) log e ,
f α ( t ) = 2 log ( α + t ) + 3 log e ,
f α ( 3 ) ( t ) = 2 log e α + t ,
which implies by a Taylor series expansion of f α ( · ) that
f α ( t ) = f α ( 1 ) + f α ( 1 ) ( t 1 ) + 1 2 f α ( 1 ) ( t 1 ) 2 + 1 6 f α ( 3 ) ( ξ ) ( t 1 ) 3 , t 0
where ξ in the right side of (A116) is an intermediate value between 1 and t. Hence, for t 0 ,
f α ( t ) f α ( 1 ) ( t 1 ) + 1 2 f α ( 1 ) ( t 1 ) 2 + 1 6 f α ( 3 ) ( 0 ) ( t 1 ) 3 1 { t [ 0 , 1 ] }
f α ( 1 ) ( t 1 ) + 1 2 f α ( 1 ) 1 6 f α ( 3 ) ( 0 ) ( t 1 ) 2
= f α ( 1 ) ( t 1 ) + k ( α ) ( t 1 ) 2
where (A117) follows from (A116) since f α ( 1 ) = 0 and f α ( 3 ) ( · ) is monotonically decreasing and positive (see (A115)); 1 { t [ 0 , 1 ] } in the right side of (A117) denotes the indicator function which is equal to 1 if the relation t [ 0 , 1 ] holds, and it is otherwise equal to zero; (A118) holds since ( t 1 ) 3 1 { t [ 0 , 1 ] } ( t 1 ) 2 for all t 0 , and f α ( 3 ) ( 0 ) > 0 ; finally, (A119) follows by substituting (A114) and (A115) into the right side of (A118), which gives the equality
1 2 f α ( 1 ) 1 6 f α ( 3 ) ( 0 ) = k ( α )
with k ( · ) as defined in (63). Since the first term in the right side of (A119) does not affect an f-divergence (as it is equal to c ( t 1 ) for t 0 and some constant c), and for an arbitrary positive constant k > 0 and g ( t ) : = ( t 1 ) 2 for t 0 , we get D k g ( P Q ) = k χ 2 ( P Q ) , inequality (61) follows from (A117) and (A119). To that end, note that k = k ( α ) defined in (63) is monotonically increasing in α , and therefore k ( α ) k ( e 3 2 ) > 0.2075 for all α e 3 2 . Due to the inequality (see, e.g., ([64], Theorem 5), followed by refined versions in ([62], Theorem 20) and ([65], Theorem 9))
D ( P Q ) log 1 + χ 2 ( P Q ) ,
the looser lower bound on D f α ( P Q ) in the right side of (62), expressed as a function of the relative entropy D ( P Q ) , follows from (61). Hence, if P and Q are not identical, then (64) follows from (61) since χ 2 ( P Q ) > 0 and lim α k ( α ) = .
We next prove Item (d). The Taylor series expansion of f α ( · ) implies that, for all t 0 ,
f α ( t ) = f α ( 1 ) + f α ( 1 ) ( t 1 ) + 1 2 f α ( 1 ) ( t 1 ) 2 + 1 6 f α ( 3 ) ( 1 ) ( t 1 ) 3 + 1 24 f α ( 4 ) ( ξ ) ( t 1 ) 4
where ξ in the right side of (A122) is an intermediate value between 1 and t. Consequently, since f α ( 4 ) ( ξ ) = 2 log e ( α + ξ ) 2 < 0 and f α ( 1 ) = 0 , it follows from (A122) that, for all t 0 ,
f α ( t ) f α ( 1 ) ( t 1 ) + 1 2 f α ( 1 ) ( t 1 ) 2 + 1 6 f α ( 3 ) ( 1 ) ( t 1 ) 3
= f α ( 1 ) ( t 1 ) + 1 2 f α ( 1 ) ( t 1 ) 2 + 1 6 f α ( 3 ) ( 1 ) [ t 3 3 ( t 1 ) 2 3 ( t 1 ) 1 ]
= f α ( 1 ) 1 2 f α ( 3 ) ( 1 ) ( t 1 ) + 1 2 f α ( 1 ) f α ( 3 ) ( 1 ) ( t 1 ) 2 + 1 6 f α ( 3 ) ( 1 ) ( t 3 1 ) .
Based on (A123)–(A125), it follows that
D f α ( P Q ) 1 2 f α ( 1 ) f α ( 3 ) ( 1 ) χ 2 ( P Q ) + 1 6 f α ( 3 ) ( 1 ) x X Q ( x ) P ( x ) Q ( x ) 3 1
= 1 2 f α ( 1 ) f α ( 3 ) ( 1 ) χ 2 ( P Q ) + 1 6 f α ( 3 ) ( 1 ) 1 + x X P 3 ( x ) Q 2 ( x )
= 1 2 f α ( 1 ) f α ( 3 ) ( 1 ) χ 2 ( P Q ) + 1 6 f α ( 3 ) ( 1 ) exp 2 D 3 ( P Q ) 1 ,
where (A127) holds due to (A111) (with β = 3 ). Substituting (A114) and (A115) into the right side of (A127) gives (65).
We next prove Item (e). Let P and Q be probability mass functions such that D 3 ( P Q ) < , and let ε > 0 be arbitrarily small. Since the Rényi divergence D α ( P Q ) is monotonically non-decreasing in α > 0 (see ([66], Theorem 3)), it follows that D 2 ( P Q ) < , and therefore also
χ 2 ( P Q ) = exp D 2 ( P Q ) 1 < .
In view of (61), there exists α 1 : = α 1 ( P , Q , ε ) such that for all α > α 1
D f α ( P Q ) > log ( α + 1 ) + 3 2 log e χ 2 ( P Q ) ε ,
and, from (65), there exists α 2 : = α 2 ( P , Q , ε ) such that for all α > α 2
D f α ( P Q ) < log ( α + 1 ) + 3 2 log e χ 2 ( P Q ) + ε .
Letting α : = max { α 1 , α 2 } gives the result in (66) for all α > α .
Item (f) of Theorem 5 is a direct consequence of ([45], Lemma 4), which relies on ([67], Theorem 3). Let g ( t ) : = ( t 1 ) 2 for t 0 (hence, D g ( · · ) is the χ 2 divergence). If a sequence { P n } converges to a probability measure Q in the sense that the condition in (67) is satisfied, and P n Q for all sufficiently large n, then ([45], Lemma 4) yields
lim n D f α ( P n Q ) χ 2 ( P n Q ) = 1 2 f α ( 1 ) ,
which gives (68) from (A114) and (A131).
We next prove Item (g). Inequality (69) is trivial. Inequality (70) is obtained as follows:
D f α ( P Q ) D f β ( P Q ) = β α u D f u ( P Q ) d u
= β α 2 ( u + 1 ) D u Q + P u + 1 Q d u
β α 2 ( u + 1 ) d u · D α Q + P α + 1 Q
= ( α + 1 ) 2 ( β + 1 ) 2 D α Q + P α + 1 Q
= ( α β ) ( α + β + 2 ) D α Q + P α + 1 Q
where (A133) follows from (56), and (A134) holds since the function I : [ 0 , ) [ 0 , ) given by
I ( u ) : = D u Q + P u + 1 Q , u 0
is monotonically decreasing in u (note that by increasing the value of the non-negative variable u, the probability mass function u Q + P u + 1 gets closer to Q). This gives (70).
For proving inequality (71), we obtain two upper bounds on D f α ( P Q ) D f β ( P Q ) with α > β e 3 2 . For the derivation of the first bound, we rely on (A83). From (A84) and (A85),
r α ( t ) = 2 t log t s α ( t ) , t 0
where s α : ( 0 , ) R is given by
s α ( t ) : = 2 t log t 2 ( α + t ) log ( α + t ) + ( 1 t ) log e + 2 ( α + 1 ) log ( α + 1 ) , t 0 ,
with the convention that 0 log 0 = 0 (by a continuous extension of t log t at t = 0 ). Since s α ( 1 ) = 0 , and
s α ( t ) = 2 α t ( α + t ) > 0 , t > 0 ,
which implies that s α ( · ) is convex on ( 0 , ) , we get
α D f α ( P Q ) = D r α ( P Q )
= 2 D ( P Q ) D s α ( P Q )
2 D ( P Q )
where (A141) holds due to (A83) (recall the convexity of r α : ( 0 , ) R with r α ( 1 ) = 0 ); (A142) holds due to (A138) and since r ( t ) : = t log t for t > 0 implies that D r ( P Q ) = D ( P Q ) ; finally, (A143) follows from the non-negativity of the f-divergence D s α ( · · ) . Consequently, integration over the interval [ β , α ] ( α > β ) on the left side of (A141) and the right side of (A143) gives
D f α ( P Q ) D f β ( P Q ) 2 ( α β ) D ( P Q ) .
Note that the same reasoning of (A132)–(A136) also implies that
D f α ( P Q ) D f β ( P Q ) ( α β ) ( α + β + 2 ) D β Q + P β + 1 Q ,
which gives a second upper bound on the left side of (A145). Taking the minimal value among the two upper bounds in the right sides of (A144) and (A145) gives (71) (see Remark A4).
We finally prove Item (h). From (55) and (A81), the function f α : [ 0 , ) R is convex for α e 3 2 with f α ( 1 ) = 0 , f α ( 0 ) = α 2 log α ( α + 1 ) 2 log ( α + 1 ) R , and it is also differentiable at 1. It is left to prove that the function g α : ( 0 , ) R , defined as g α ( t ) : = f α ( t ) f α ( 0 ) t for t > 0 , is convex. From (55), the function g α is given explicitly by
g α ( t ) = ( α + t ) 2 log ( α + t ) α 2 log α t , t > 0 ,
and its second derivative is given by
g α ( t ) = w α ( t ) t 3 , t > 0 ,
with
w α ( t ) : = 2 α 2 log 1 + t α + t ( t 2 α ) log e , t 0 .
Since w α ( 0 ) = 0 , and
w α ( t ) = 2 t 2 log e α + t > 0 , t > 0 ,
it follows that w α ( t ) > 0 for all t > 0 ; hence, from (A147), g α ( t ) > 0 for t ( 0 , ) , which yields the convexity of the function g α ( · ) on ( 0 , ) for all α 0 . This shows that, for every α e 3 2 , the function f α : [ 0 , ) R satisfies all the required conditions in Theorems 3 and 4. We proceed to calculate the function κ α : [ 0 , 1 ) × ( 1 , ) R in (51), which corresponds to f : = f α , i.e., (see (72)),
κ α ( ξ 1 , ξ 2 ) = sup t ( ξ 1 , 1 ) ( 1 , ξ 2 ) z α ( t ) ,
with
z α ( t ) : = { f α ( t ) + f α ( 1 ) ( 1 t ) ( t 1 ) 2 , t [ 0 , 1 ) ( 1 , ) , 3 2 log e + log ( α + 1 ) , t = 1 ,
where the definition of z α ( 1 ) is obtained by continuous extension of the function z α ( · ) at t = 1 (recall that the function f α ( · ) is given in (55)). Differentiation shows that
z α ( t ) t = v α ( t ) ( t 1 ) 4 , t [ 0 , 1 ) ( 1 , ) ,
where, for t 0 ,
v α ( t ) : = ( 2 α + t + 1 ) ( t 1 ) 2 log e 2 ( α + 1 ) ( α + t ) ( t 1 ) log α + t α + 1 ,
and
v α ( t ) = ( t 1 ) 2 log e + 2 ( α + t ) ( t 1 ) log e 2 ( α + 1 ) ( 2 t + α 1 ) log α + t α + 1 ,
v α ( t ) = 6 ( t 1 ) log e + 2 ( α + 1 ) 2 log e α + t 4 ( α + 1 ) log α + t α + 1 ,
v α ( 3 ) ( t ) = 2 ( t 1 ) ( 3 t + 4 α + 1 ) ( α + t ) 2 .
From (A156), it follows that v α ( 3 ) ( t ) < 0 if t [ 0 , 1 ) , v α ( 3 ) ( 1 ) = 0 , and v α ( 3 ) ( t ) > 0 if t ( 1 , ) . Since v α ( · ) is therefore monotonically decreasing on [ 0 , 1 ] and it is monotonically increasing on [ 1 , ) , (A155) implies that
v α ( t ) v α ( 1 ) = 2 ( α + 1 ) log e > 0 , t 0 .
Since v α ( 1 ) = 0 (see (A154)), and v α ( · ) is monotonically increasing on [ 0 , ) , it follows that v α ( t ) < 0 for all t [ 0 , 1 ) and v α ( t ) > 0 for all t > 1 . This implies that v α ( t ) v α ( 1 ) = 0 for all t 0 (see (A153)); hence, from (A152), the function z α ( · ) is monotonically increasing on [ 0 , ) , and it is continuous over this interval (see (A151)). It therefore follows from (A150) that
κ α ( ξ 1 , ξ 2 ) = z α ( ξ 2 ) ,
for every ξ 1 [ 0 , 1 ) and ξ 2 ( 1 , ) (independently of ξ 1 ), which proves (73).
Remark A4.
None of the upper bounds in the right sides of (A144) and (A145) supersedes the other. For example, if P and Q correspond to Bernoulli ( p ) and Bernoulli ( q ) , respectively, and ( α , β , p , q ) = ( 2 , 1 , 1 5 , 2 5 ) , then the right sides of (A144) and (A145) are, respectively, equal to 0.264 log e and 0.156 log e . If on the other hand ( α , β , p , q ) = ( 10 , 1 , 1 5 , 2 5 ) , then the right sides of (A144) and (A145) are, respectively, equal to 2.377 log e and 3.646 log e .

Appendix E. Proof of Theorem 6

By assumption, P Q where the probability mass functions P and Q are defined on the set A : = { 1 , , n } . The majorization relation P Q is equivalent to the existence of a doubly-stochastic transformation W Y | X : A A such that (see Proposition 4)
Q W Y | X P .
(See, e.g., ([32], Theorem 2.1.10) or ([30], Theorem 2.B.2) or ([31], pp. 195–204)). Define
X = Y : = A , P X : = Q , Q X : = U n .
The probability mass functions given by
P Y : = P , Q Y : = U n
satisfy, respectively, (20) and (21). The first one is obvious from (A159)–(A161); equality (21) holds due to the fact that W Y | X : A A is a doubly stochastic transformation, which implies that for all y A
x A Q X ( x ) P Y | X ( y | x ) = 1 n x A P Y | X ( y | x )
= 1 n = Q Y ( y ) .
Since (by assumption) P X and Q X are supported on A , relations (20) and (21) hold in the setting of (A159)–(A161), and f : ( 0 , ) R is (by assumption) convex and twice differentiable, it is possible to apply the bounds in Theorem 1 (b) and (d). To that end, from (18), (19), (A160) and (A161),
ξ 1 = min x A Q ( x ) 1 n = n q min ,
ξ 2 = max x A Q ( x ) 1 n = n q max ,
which, from (24), (25), (32), (A160), (A161) and (A164), give that
e f ( n q min , n q max ) χ 2 ( Q U n ) χ 2 ( P U n )
D f ( Q U n ) D f ( P U n )
c f ( n q min , n q max ) χ 2 ( Q U n ) χ 2 ( P U n )
0 .
The difference of the χ 2 divergences in the left side of (A166) and the right side of (A167) satisfies
χ 2 ( Q U n ) χ 2 ( P U n ) = x A Q 2 ( x ) 1 n x A P 2 ( x ) 1 n = n Q 2 2 P 2 2 ,
and the substitution of (A169) into the bounds in (A166) and (A167) give the result in (74) and (75).
Let f ( t ) = ( t 1 ) 2 for t > 0 , which yields from (26) and (31) that c f ( · , · ) = e f ( · , · ) = 1 . Since D f ( · · ) = χ 2 ( · · ) , it follows from (A169) that the upper and lower bounds in the left side of (74) and the right side of (75), respectively, coincide for the χ 2 -divergence; this therefore yields the tightness of these bounds in this special case.
We next prove (76). The following lower bound on the second-order Rényi entropy (a.k.a. the collision entropy) holds (see ([34], (25)–(27))):
H 2 ( Q ) : = log Q 2 2 log 4 n ρ ( 1 + ρ ) 2 ,
where q max q min ρ . This gives
Q 2 2 = exp H 2 ( Q ) ( 1 + ρ ) 2 4 n ρ .
By Cauchy-Schwartz inequality P 2 2 1 n which, together with (A171), give
Q 2 2 P 2 2 ( ρ 1 ) 2 4 n ρ .
In view of the Schur-concavity of the Rényi entropy (see ([30], Theorem 13.F.3.a.)), the assumption P Q implies that
H 2 ( P ) H 2 ( Q ) ,
and an exponentiation of both sides of (A173) (see the left-side equality in (A170)) gives
Q 2 2 P 2 2 .
Combining (A172) and (A173) gives (76).

Appendix F. Proof of Theorem 7

We prove Item (a), showing that the set P n ( ρ ) (with ρ 1 ) is non-empty, convex and compact. Note that P n ( 1 ) = { U n } is a singleton, so the claim is trivial for ρ = 1 .
Let ρ > 1 . The non-emptiness of P n ( ρ ) is trivial since U n P n ( ρ ) . To prove the convexity of P n ( ρ ) , let P 1 , P 2 P n ( ρ ) , and let p max ( 1 ) , p max ( 2 ) , p min ( 1 ) and p min ( 2 ) be the (positive) maximal and minimal probability masses of P 1 and P 2 , respectively. Then, p max ( 1 ) p min ( 1 ) ρ and p max ( 2 ) p min ( 2 ) ρ yield
λ p max ( 1 ) + ( 1 λ ) p max ( 2 ) λ p min ( 1 ) + ( 1 λ ) p min ( 2 ) ρ , λ [ 0 , 1 ] .
For every λ [ 0 , 1 ] ,
min 1 i n λ P 1 ( i ) + ( 1 λ ) P 2 ( i ) λ p min ( 1 ) + ( 1 λ ) p min ( 2 ) ,
max 1 i n λ P 1 ( i ) + ( 1 λ ) P 2 ( i ) λ p max ( 1 ) + ( 1 λ ) p max ( 2 ) .
Combining (A175)–(A177) implies that
max 1 i n λ P 1 ( i ) + ( 1 λ ) P 2 ( i ) min 1 i n λ P 1 ( i ) + ( 1 λ ) P 2 ( i ) ρ ,
so λ P 1 + ( 1 λ ) P 2 P n ( ρ ) for all λ [ 0 , 1 ] . This proves the convexity of P n ( ρ ) .
The set of probability mass functions P n ( ρ ) is clearly bounded; for showing its compactness, it is left to show that P n ( ρ ) is closed. Let ρ > 1 , and let { P ( m ) } m = 1 be a sequence of probability mass functions in P n ( ρ ) which pointwise converges to P over the finite set A n . It is required to show that P P n ( ρ ) P n . As a limit of probability mass functions, P P n , and since by assumption P ( m ) P n ( ρ ) for all m N , it follows that
( n 1 ) ρ p min ( m ) + p min ( m ) ( n 1 ) p max ( m ) + p min ( m ) 1 ,
which yields p min ( m ) 1 ( n 1 ) ρ + 1 for all m. Since p max ( m ) ρ p min ( m ) for every m, it follows that also for the limiting probability mass function P we have p min 1 ( n 1 ) ρ + 1 > 0 , and p max ρ p min . This proves that P P n ( ρ ) , and therefore P n ( ρ ) is a closed set.
An alternative proof for Item (a) relies on the observation that, for ρ 1 ,
P n ( ρ ) = P n i j P : P ( i ) ρ P ( j ) 0 ,
which yields the convexity and compactness of the set P n ( ρ ) for all ρ 1 .
The result in Item (b) holds in view of Item (a), and due to the convexity and continuity of D f ( P Q ) in ( P , Q ) P n ( ρ ) × P n ( ρ ) (where p min , q min 1 ( n 1 ) ρ + 1 > 0 ). This implication is justified by the statement that a convex and continuous function over a non-empty convex and compact set attains its supremum over this set (see, e.g., ([68], Theorem 7.42) or ([59], Theorem 10.1 and Corollary 32.3.2)).
We next prove Item (c). If Q P n ( ρ ) , then 1 1 + ( n 1 ) ρ q min 1 n where the lower bound on q min is attained when Q is the probability mass function with n 1 masses equal to ρ q min and a single smaller mass equal to q min , and the upper bound is attained when Q is the equiprobable distribution. For an arbitrary Q P n ( ρ ) , let q min : = β where β can get any value in the interval Γ n ( ρ ) defined in (79). By ([34], Lemma 1), Q Q β and Q β P n ( ρ ) where Q β is given in (80). The Schur-convexity of D f ( · U n ) (see ([38], Lemma 1)) and the identity D f ( U n · ) = D f ( · U n ) give that
D f ( Q U n ) D f ( Q β U n ) , D f ( U n Q ) D f ( U n Q β )
for all Q P n ( ρ ) with q min = β Γ n ( ρ ) ; equalities hold in (A180) if Q = Q β P n ( ρ ) . The maximization of D f ( Q U n ) and D f ( U n Q ) over all probability mass functions Q P n ( ρ ) can be therefore simplified to the maximization of D f ( Q β U n ) and D f ( U n Q β ) , respectively, over the parameter β which lies in the interval Γ n ( ρ ) in (79). This proves (82) and (83).
We next prove Item (e), and then prove Item (d). In view of Item (c), the maximum of D f ( Q U n ) over all the probability mass functions Q P n ( ρ ) is attained by Q = Q β with β Γ n ( ρ ) (see (79)–(81)). From (80), Q β can be expressed as the n-length probability vector
Q β = ( ρ β , , ρ β i β , 1 ( n + i β ρ i β 1 ) β , β , , β n i β 1 ) .
The influence of the ( i β + 1 ) -th entry of the probability vector in (A181) on D f ( Q β U n ) tends to zero as we let n . This holds since the entries of the vector in (A181) are written in decreasing order, which implies that for all β Γ n ( ρ ) (with ρ 1 )
n 1 ( n + i β ρ i β 1 ) [ n β , n ρ β ] n ( n 1 ) ρ + 1 , ρ 1 ρ , ρ ;
from (A182) and the convexity of f on ( 0 , ) (so, f attains its finite maximum on every closed sub-interval of ( 0 , ) ), it follows that
1 ( n + i β ρ i β 1 ) β f n 1 ( n + i β ρ i β 1 ) 1 ( n + i β ρ i β 1 ) β max u 1 ρ , ρ f ( u ) ρ n max u 1 ρ , ρ f ( u ) n 0 .
In view of (A181) and (A183), by letting n , the maximization of D f ( Q β U n ) over β Γ n ( ρ ) can be replaced by a maximization of D f ( Q ˜ m U n ) where
Q ˜ m : = ( ρ β , , ρ β m , β , , β n m ) P n ( ρ )
with the free parameter m { 0 , , n } , and with β : = 1 n + ( ρ 1 ) m (the value of β is determined so that the total mass of Q ˜ m is 1). Hence, we get
lim n max β Γ n ( ρ ) D f ( Q β U n ) = lim n max m { 0 , , n } D f ( Q ˜ m U n ) .
The f-divergence in the right side of (A185) satisfies
D f ( Q ˜ m U n ) = 1 n i = 1 n f n Q ˜ m ( i )
= m n f ρ n n + ( ρ 1 ) m + 1 m n f n n + ( ρ 1 ) m
= g f ( ρ ) m n ,
where (A188) holds by the definition of the function g f ( ρ ) ( · ) in (84). It therefore follows that
lim n u f ( n , ρ )
= lim n max m { 0 , , n } g f ( ρ ) m n
= max x [ 0 , 1 ] g f ( ρ ) ( x )
where (A189) holds by combining (82) and (A185)–(A188); (A190) holds by the continuity of the function g f ( ρ ) ( · ) on [ 0 , 1 ] , which follows from (84) and the continuity of the convex function f on 1 ρ , ρ for ρ 1 (recall that a convex function is continuous on every closed sub-interval of its domain of region, and by assumption f is convex on ( 0 , ) ). This proves (87), by the definition of g f ( ρ ) ( · ) in (84).
Equality (88) follows from (87) by replacing g f ( ρ ) ( · ) with g f ( ρ ) ( · ) , with f : ( 0 , ) R as given in (29); this replacement is justified by the equality D f ( U n Q ) = D f ( Q U n ) .
Once Item (e) is proved, we return to prove Item (d). To that end, it is first shown that
u f ( n , ρ ) u f ( 2 n , ρ ) ,
v f ( n , ρ ) v f ( 2 n , ρ ) ,
for all ρ 1 and integers n 2 , with the functions u f and v f , respectively, defined in (77) and (78). Since D f ( P Q ) = D f ( Q P ) for all P , Q P n , (77) and (78) give that
v f ( n , ρ ) = u f ( n , ρ ) ,
so the monotonicity property in (A192) follows from (A191) by replacing f with f . To prove (A191), let Q P n ( ρ ) be a probability mass function which attains the maximum at the right side of (77), and let P be the probability mass function supported on A 2 n = { 1 , , 2 n } , and defined as follows:
P ( i ) = { 1 2 Q ( i ) , i f   i { 1 , , n } , 1 2 Q ( i n ) , i f   i { n + 1 , , 2 n } .
Since by assumption Q P n ( ρ ) , (A194) implies that P P 2 n ( ρ ) . It therefore follows that
u f ( 2 n , ρ ) = max Q P 2 n ( ρ ) D f ( Q U 2 n )
D f ( P U 2 n )
= 1 2 n i = 1 n f 2 n P ( i ) + i = n + 1 2 n f 2 n P ( i )
= 1 n i = 1 n f n Q ( i )
= D f ( Q U n )
= max Q P n ( ρ ) D f ( Q U 2 n )
= u f ( n , ρ )
where (A195) and (A201) hold due to (77); (A196) holds since P P 2 n ( ρ ) ; finally, (A198) holds due to (A194), which implies that the two sums in the right side of (A197) are identical, and they equal to the sum in the right side of (A198). This gives (A191), and likewise also (A192) (see (A193)).
u f ( n , ρ ) lim k u f ( 2 k n , ρ )
= lim n u f ( n , ρ )
= max x [ 0 , 1 ] g f ( ρ ) ( x )
where (A202) holds since, due to (A191), the sequence { u f ( 2 k n , ρ ) } k = 0 is monotonically increasing, which implies that the first term of this sequence is less than or equal to its limit. Equality (A203) holds since the limit in its right side exists (in view of the above proof of (87)), so its limit coincides with the limit of every subsequence; (A204) holds due to (A189) and (A190). A replacement of f with f gives, from (A193), that
v f ( n , ρ ) max x [ 0 , 1 ] g f ( ρ ) ( x ) .
Combining (A202)–(A205) gives the right-side inequalities in (85) and (86). The left-side inequality in (85) follows by combining (77), (A184) and (A186)–(A188), which gives
u f ( n , ρ ) = max Q P n ( ρ ) D f ( Q U n )
max m { 0 , , n } D f ( Q ˜ m U n )
= max m { 0 , , n } g f ( ρ ) m n .
Likewise, in view of (A193), the left-side inequality in (86) follows from the left-side inequality in (85) by replacing f with f .
We next prove Item (f), providing an upper bound on the convergence rate of the limit in (87); an analogous result can be obtained for the convergence rate to the limit in (88) by replacing f with f in (29). To prove (89), in view of Items (d) and (e), we get that for every integer n 2
0 lim n u f ( n , ρ ) u f ( n , ρ )
max x [ 0 , 1 ] g f ( ρ ) ( x ) max m { 0 , , n } g f ( ρ ) m n
= max x [ 0 , 1 ] g f ( ρ ) ( x ) max m { 0 , , n 1 } g f ( ρ ) m n
= max m { 0 , , n 1 } max x m n , m + 1 n g f ( ρ ) ( x ) max m { 0 , , n 1 } g f ( ρ ) m n
max m { 0 , , n 1 } max x m n , m + 1 n g f ( ρ ) ( x ) g f ( ρ ) m n
where (A209) holds due to monotonicity property in (A191), and also due to the existence of the limit of { u f ( n , ρ ) } n N ; (A210) holds due to (85); (A211) holds since the function g f ( ρ ) : [ 0 , 1 ] R (as it is defined in (84)) satisfies g f ( ρ ) ( 1 ) = g f ( ρ ) ( 0 ) = 0 (recall that by assumption f ( 1 ) = 0 ); (A212) holds since [ 0 , 1 ] = m = 1 n 1 m n , m + 1 n , so the maximization of g f ( ρ ) ( · ) over the interval [ 0 , 1 ] is the maximum over the maximal values over the sub-intervals m n , m + 1 n for m { 0 , , n 1 } ; finally, (A213) holds since the maximum of a sum of functions is less than or equal to the sum of the maxima of these functions. If the function g f ( ρ ) : [ 0 , 1 ] R is differentiable on ( 0 , 1 ) , and its derivative is upper bounded by K f ( ρ ) 0 , then by the mean value theorem of Lagrange, for every m { 0 , , n 1 } ,
g f ( ρ ) ( x ) g f ( ρ ) m n K f ( ρ ) n , x m n , m + 1 n .
Combining (A209)–(A214) gives (89).
We next prove Item (g). By definition, it readily follows that P n ( ρ 1 ) P n ( ρ 2 ) if 1 ρ 1 < ρ 2 . By the definition in (77), for a fixed integer n 2 , it follows that the function u f ( n , · ) is monotonically increasing on [ 1 , ) . The limit in the left side of (90) therefore exists. Since D f ( Q U n ) is convex in Q, its maximum over the convex set of probability mass functions Q P n is obtained at one of the vertices of the simplex P n . Hence, a maximum of D f ( Q U n ) over this set is attained at Q = ( q 1 , , q n ) with q i = 1 for some i { 1 , , n } , and q j = 0 for j i . In the latter case,
D f ( Q U n ) = 1 n k = 1 n f ( n q k ) = 1 n ( n 1 ) f ( 0 ) + f ( n ) .
Note that Q ρ 1 P n ( ρ ) (since the union of { P n ( ρ ) } , for all ρ 1 , includes all the probability mass functions in P n which are supported on A n = { 1 , , n } , so Q P n is not an element of this union); hence, it follows that
lim ρ u f ( n , ρ ) 1 1 n f ( 0 ) + f ( n ) n .
On the other hand, for every ρ 1 ,
u f ( n , ρ ) g f ( ρ ) 1 n
= 1 n f ρ n n + ρ 1 + 1 1 n f n n + ρ 1
where (A217) holds due to the left-side inequality of (85), and (A218) is due to (84). Combining (A217) and (A218), and the continuity of f at zero (by the continuous extension of the convex function f at zero), yields (by letting ρ )
lim ρ u f ( n , ρ ) 1 1 n f ( 0 ) + f ( n ) n .
Combining (A216) and (A219) gives (90) for every integer n 2 . In order to get an upper bound on the convergence rate in (90), suppose that f ( 0 ) < , f is differentiable on ( 0 , n ) , and K n : = sup t ( 0 , n ) f ( t ) < . For every ρ 1 , we get
0 lim ρ u f ( n , ρ ) u f ( n , ρ )
1 n f ( n ) f ρ n n + ρ 1 + 1 1 n f ( 0 ) f n n + ρ 1
K n n n ρ n n + ρ 1 + 1 1 n K n n n + ρ 1
= 2 K n ( n 1 ) n + ρ 1 ,
where (A220) holds since the sets { P n ( ρ ) } ρ 1 are monotonically increasing in ρ ; (A221) follows from (A216)–(A218); (A222) holds by the assumption that f ( t ) K n for all t ( 0 , n ) , by the mean value theorem of Lagrange, and since 0 < n n + ρ 1 ρ n n + ρ 1 n for all ρ 1 and n N . This proves (91).
We next prove Item (h). Setting P : = U n yields P Q for every probability mass function Q which is supported on { 1 , , n } . Since q min + ( n 1 ) q max 1 and ( n 1 ) q min + q max 1 , and since by assumption q max q min ρ , it follows that
[ n q min , n q max ] n 1 + ( n 1 ) ρ , ρ n n 1 + ρ 1 ρ , ρ .
Combining the assumption in (92) with (A224) implies that
m f ( t ) M , t [ n q min , n q max ] .
Hence, (26), (31) and (A225) yield
1 2 m c f ( n q min , n q max ) e f ( n q min , n q max ) 1 2 M .
The lower bound on D f ( Q U n ) in the left side of (94) follows from a combination of (75), the left-side inequality in (A226), and P 2 2 = 1 n . Similarly, the upper bound on D f ( Q U n ) in the right side of (95) follows from a combination of (74), the right-side inequality in (A226), and the equality P 2 2 = 1 n . The looser upper bound on D f ( Q U n ) in the right side of (96), expressed as a function of M and ρ , follows by combining (74), (76), and the right-side inequality in (A226).
The tightness of the lower bound in the left side of (94) and the upper bound in the right side of (95) for the χ 2 divergence is clear from the fact that M = m = 2 if f ( t ) = ( t 1 ) 2 for all t > 0 ; in this case, χ 2 ( Q U n ) = n Q 2 2 1 .
To prove Item (i), suppose that the second derivative of f is upper bounded on ( 0 , ) with f ( t ) M f ( 0 , ) for all t > 0 , and there is a need to assert that D f ( Q U n ) d for an arbitrary d > 0 . Condition (97) follows from (96) by solving the inequality M f ( ρ 1 ) 2 8 ρ d , with the variable ρ 1 , for given d > 0 and M f > 0 (note that M f does not depend on ρ ).

Appendix G. Proof of Theorem 8

The proof of Theorem 8 relies on Theorem 6. For α ( 0 , 1 ) ( 1 , ) , let u α : ( 0 , ) R be the non-negative and convex function given by (see, e.g., ([8], (2.1)) or ([16], (17)))
u α ( t ) : = t α α ( t 1 ) 1 α ( α 1 ) , t > 0 ,
and let u 1 : ( 0 , ) R be the convex function given by
u 1 ( t ) : = lim α 1 u α ( t ) = t log e t + 1 t , t > 0 .
Let P and Q be probability mass functions which are supported on a finite set; without loss of generality, let their support be given by A n : = { 1 , , n } . Then, for α ( 0 , 1 ) ( 1 , ) ,
D u α ( Q U n ) D u α ( P U n ) = 1 n i = 1 n u α n Q ( i ) 1 n i = 1 n u α n P ( i ) = n α 1 α ( α 1 ) i = 1 n Q α ( i ) i = 1 n P α ( i ) = n α 1 S α ( P ) S α ( Q ) α ,
where
S α ( P ) : = { 1 1 α i = 1 n P α ( i ) 1 , α ( 0 , 1 ) ( 1 , ) , i = 1 n P ( i ) log e P ( i ) , α = 1 .
designates the order- α Tsallis entropy of a probability mass P defined on the set A n . Equality (A229) also holds for α = 1 by continuous extension.
In view of (26) and (31), since u α ( t ) = t α 2 for all t > 0 , it follows that
c u α ( n q min , n q max ) = { 1 2 n α 2 q max α 2 , if   α ( 0 , 2 ] , 1 2 n α 2 q min α 2 , if   α ( 2 , ) ,
and
e u α ( n q min , n q max ) = { 1 2 n α 2 q min α 2 , if   α ( 0 , 2 ] , 1 2 n α 2 q max α 2 , if   α ( 2 , ) .
The combination of (74) and (75) under the assumption that P and Q are supported on A n and P Q , together with (A229), (A231) and (A232) gives (100)–(102). Furthermore, the left and right-side inequalities in (100) hold with equality if c u α ( · , · ) in (A231) and e u α ( · , · ) in (A232) coincide, which implies that the upper and lower bounds in (74) and (75) are tight in that case. Comparing c u α ( · , · ) in (A231) and e u α ( · , · ) in (A232) shows that they coincide if α = 2 .
To prove Item (b) of Theorem 8, let P ε and Q ε be probability mass functions supported on A = { 0 , 1 } where P ε ( 0 ) = 1 2 + ε , Q ε ( 0 ) = 1 2 + β ε , and β > 1 and 0 < ε < 1 2 β . This yields P ε Q ε . The result in (103) is proved by showing that, for all α > 0 ,
lim ε 0 + S α ( P ε ) S α ( Q ε ) L ( α , P ε , Q ε ) = 1 ,
lim ε 0 + S α ( P ε ) S α ( Q ε ) U ( α , P ε , Q ε ) = 1 ,
which shows that the infimum and supremum in (103) can be even restricted to the binary alphabet setting. For every α ( 0 , 1 ) ( 1 , ) ,
S α ( P ε ) S α ( Q ε ) = 1 1 α i P ε α ( i ) i Q ε α ( i ) = 1 1 α 1 2 + ε α + 1 2 ε α 1 2 + β ε α 1 2 β ε α = α 2 2 α ( β 2 1 ) ε 2 + O ε 4 ,
where (A235) follows from a Taylor series expansion around ε = 0 , and the passage in the limit where α 1 shows that (A235) also holds at α = 1 (due to the continuous extension of the order- α Tsallis entropy at α = 1 ). This implies that (A235) holds for all α > 0 . We now calculate the lower and upper bounds on S α ( P ε ) S α ( Q ε ) in (101) and (102), respectively.
  • For α ( 0 , 2 ] ,
    L ( α , P ε , Q ε ) = 1 2 α q max α 2 Q ε 2 2 P ε 2 2 = 1 2 α 1 2 + β ε α 2 1 2 + β ε 2 + 1 2 β ε 2 1 2 + ε 2 1 2 ε 2 = α 2 2 α ( β 2 1 ) ( 1 + 2 β ε ) α 2 .
  • For α ( 2 , ) ,
    L ( α , P ε , Q ε ) = 1 2 α q min α 2 Q ε 2 2 P ε 2 2 = α 2 2 α ( β 2 1 ) ( 1 2 β ε ) α 2 .
  • Similarly, for α ( 0 , 2 ] ,
    U ( α , P ε , Q ε ) = 1 2 α q min α 2 Q ε 2 2 P ε 2 2 = α 2 2 α ( β 2 1 ) ( 1 2 β ε ) α 2 ,
    and, for α ( 2 , ) ,
    U ( α , P ε , Q ε ) = 1 2 α q max α 2 Q ε 2 2 P ε 2 2 = α 2 2 α ( β 2 1 ) ( 1 + 2 β ε ) α 2 .
The combination of (A235)–(A237) yields (A233); similarly, the combination of (A235), (A238) and (A239) yields (A234).

Appendix H. Proof of Theorem 9 and Corollary 1

Appendix H.1. Proof of Theorem 9

The proof of the convexity property of Δ ( · , ρ ) in (149), with ρ > 1 , over the real line R relies on ([69], Theorem 2.1) which states that if W is a non-negative random variable, then
λ α : = { E [ W α ] E α [ W ] log e α ( α 1 ) , α 0 , 1 log E [ W ] E [ log W ] , α = 0 E [ W log W ] E [ W ] log E [ W ] , α = 1
is log-convex in α R . This property has been used to derive f-divergence inequalities (see, e.g., ([62], Theorem 20), [65,69]).
Let Q P , and let W : = d Q d P be the Radon-Nikodym derivative (W is a non-negative random variable). Let the expectations in the right side of (A240) be taken with respect to P. In view of the above statement from ([69], Theorem 2.1), this gives the log-convexity of D A ( α ) ( Q P ) in α R . Since log-convexity yields convexity, it follows that D A ( α ) ( Q P ) is convex in α over the real line. Let P : = U n , and let Q P n ( ρ ) ; since Q P , it follows that D A ( α ) ( Q U n ) is convex in α R . The pointwise maximum of a set of convex functions is a convex function, which implies that max Q P n ( ρ ) D A ( α ) ( Q U n ) is convex in α R for every integer n 2 . Since the pointwise limit of a convergent sequence of convex functions is convex, it follows that lim n max Q P n ( ρ ) D A ( α ) ( Q U n ) is convex in α . This, by definition, is equal to Δ ( α , ρ ) (see (146)), which proves the convexity of this function in α R . From (149), for all ρ > 1 ,
Δ ( 1 + α , ρ ) = 1 ( α + 1 ) α ( α ) α ρ 1 + α 1 1 + α ρ ρ 1 + α α ( ρ 1 ) ( 1 + α ) 1 + α 1 = 1 ( α ) ( α 1 ) ( 1 + α ) α 1 ρ α ( ρ ρ α ) 1 + α ρ 1 + α ( ρ α 1 ) α ( ρ 1 ) ( α ) α 1 = 1 ( α ) ( α 1 ) ( 1 + α ) α 1 ρ ρ α 1 + α ρ α 1 α ( ρ 1 ) ( α ) α 1 = Δ ( α , ρ ) ,
which proves the symmetry property of Δ ( α , ρ ) around α = 1 2 for all ρ > 1 . The convexity in α over the real line, and the symmetry around α = 1 2 implies that Δ ( α , ρ ) gets its global minimum at α = 1 2 , which is equal to 4 ( ρ 4 1 ) 2 ρ + 1 for all ρ > 1 .
Inequalities (162) and (163) follow from ([8], Proposition 2.7); this proposition implies that, for every integer n 2 and for all probability mass functions Q defined on A n : = { 1 , , n } ,
α D A ( α ) ( Q U n ) β D A ( β ) ( Q U n ) , 0 < α β < ,
( 1 β ) D A ( 1 β ) ( Q U n ) ( 1 α ) D A ( 1 α ) ( Q U n ) , < α β < 1 .
Inequalities (162) and (163) follow, respectively, by maximizing both sides of (A242) or (A243) over Q P n ( ρ ) , and letting n tend to infinity.
For every α R , the function Δ ( α , ρ ) is monotonically increasing in ρ ( 1 , ) since (by definition) the set of probability mass functions { P n ( ρ ) } ρ 1 is monotonically increasing (i.e., P n ( ρ 1 ) P n ( ρ 2 ) if 1 ρ 1 < ρ 2 < ), and therefore the maximum of D A ( α ) ( Q U n ) over Q P n ( ρ ) is a monotonically increasing function of ρ [ 1 , ) ; the limit of this maximum, as we let n , is equal to Δ ( α , ρ ) in (149) for all ρ > 1 , which is therefore monotonically increasing in ρ over the interval ( 1 , ) . The continuity of Δ ( α , ρ ) in both α and ρ is due to its expression in (149) with its continuous extension at α = 0 and α = 1 in (150). Since P n ( 1 ) = { U n } , it follows from the continuity of Δ ( α , ρ ) that
lim ρ 1 + Δ ( α , ρ ) = D A ( α ) ( U n U n ) = 0 .

Appendix H.2. Proof of Corollary 1

For all α R and ρ > 1 ,
lim n max Q P n ( ρ ) D A ( α ) ( U n Q )
= lim n max Q P n ( ρ ) D A ( 1 α ) ( Q U n )
= Δ ( 1 α , ρ )
= Δ ( α , ρ ) ,
where (A244) holds due to the symmetry property in ([8], p. 36), which states that
D A ( α ) ( P Q ) = D A ( 1 α ) ( Q P ) ,
for every α R and probability mass functions P and Q; (A245) is due to (146); finally, (A246) holds due to the symmetry property of Δ ( · , ρ ) around 1 2 in Theorem 9 (a).

Appendix I. Proof of (171)

In view of (154) and (155), it follows that the condition in (170) is satisfied if and only if ρ ρ where ρ ( 1 , ) is the solution of the equation
ρ log ρ ρ 1 log e ρ log e ρ ρ 1 = d log e .
with a fixed d > 0 . The substitution
x : = ρ log e ρ ρ 1
leads to the equation
x log e x = d + 1 .
Negation and exponentiation of both sides of (A250) gives
( x ) e x = e d 1 .
Since ρ > 1 implies by (A249) that x > 1 , the proper solution for x is given by
x = W 1 e d 1 , d > 0 ,
where W 1 denotes the secondary real branch of the Lambert W function [37]; otherwise, the replacement of W 1 in the right side of (A252) with the principal real branch W 0 yields x ( 0 , 1 ) .
We next proceed to solve ρ as a function of x. From (A249), letting u : = 1 ρ gives the equation u = e ( u 1 ) x , which is equivalent to
( u x ) e u x = x e x
= e d 1 ,
where (A254) follows from (A252) and by the definition of the Lambert W function (i.e., t = W ( u ) if and only if t e t = u ). The solutions of (A253) are given by
u x = W 1 e d 1 ,
and
u x = W 0 e d 1 ,
which (from (A252)) correspond, respectively, to u = 1 and
u = W 0 e d 1 W 1 e d 1 ( 0 , 1 ) .
Since ρ ( 1 , ) is equal to 1 u , the reciprocal of the right side of (A257) gives the proper solution for ρ (denoted by ρ max ( 1 ) ( d ) in (171)).

Appendix J. Proof of (176), (177) and (180)

We first derive the upper bound on Φ ( α , ρ ) in (176) for α e 3 2 and ρ 1 . For every Q P n ( ρ ) , with an integer n 2 ,
D f α ( Q U n ) log ( α + 1 ) + 3 2 log e log e α + 1 χ 2 ( Q U n )
+ log e 3 ( α + 1 ) exp 2 D 3 ( Q U n ) 1 log ( α + 1 ) + 3 2 log e log e α + 1 ( ρ 1 ) 2 4 ρ
+ log e 3 ( α + 1 ) exp 2 D 3 ( Q U n ) 1
where (A258) follows from (65), and (A258) holds due to (159). By upper bounding the second term in the right side of (A259), for all Q P n ( ρ ) ,
D 3 ( Q U n ) = 1 2 log 1 + 6 D A ( 3 ) ( Q U n )
1 2 log 1 + 6 Δ ( 3 , ρ )
= 1 2 log 4 ( ρ 3 1 ) 3 27 ( ρ 1 ) ( ρ ρ 3 ) 2
= 1 2 log 4 ( ρ 2 + ρ + 1 ) 3 27 ρ 2 ( ρ + 1 ) 2
where (A260) holds by setting α = 3 in (156); (A261) follows from (135), (138) and (145); (A262) holds by setting α = 3 in (149); finally, (A263) follows from the factorizations
( ρ 3 1 ) 3 = ( ρ 1 ) 3 ( ρ 2 + ρ + 1 ) 3 , ( ρ 1 ) ( ρ ρ 3 ) 2 = ( ρ 1 ) 3 ρ 2 ( ρ + 1 ) 2 .
Substituting the bound in the right side of (A263) into the second term of the bound on the right side of (A259) implies that, for all Q P n ( ρ ) ,
D f α ( Q U n ) log ( α + 1 ) + 3 2 log e log e α + 1 ( ρ 1 ) 2 4 ρ
+ log e 3 ( α + 1 ) 4 ( ρ 2 + ρ + 1 ) 3 27 ρ 2 ( ρ + 1 ) 2 1 = log ( α + 1 ) + 3 2 log e log e α + 1 ( ρ 1 ) 2 4 ρ
+ log e 81 ( α + 1 ) ( ρ 1 ) ( 2 ρ + 1 ) ( ρ + 2 ) ρ ( ρ + 1 ) 2 ,
which therefore gives (176) by maximizing the left side of (A264) over Q P n ( ρ ) , and letting n tend to infinity (see (174)).
We next derive the upper bound in (177). The second derivative of the convex function f α : ( 0 , ) R in (55) is upper bounded over the interval 1 ρ , ρ by the positive constant M = 2 log ( α + ρ ) + 3 log e . From (96), it follows that for all Q P n ( ρ ) (with ρ 1 and an integer n 2 ) and α e 3 2 ,
D f α ( Q U n ) log ( α + ρ ) + 3 2 log e ( ρ 1 ) 2 4 ρ ,
which, from (174), yields (177).
We finally derive the upper bound in (180) by loosening the bound in (176). The upper bound in the right side of (176) can be rewritten as
Φ ( α , ρ ) 1 4 log ( α + 1 ) + 3 8 log e ( ρ 1 ) 2 ρ + log e α + 1 1 81 2 + 2 ρ + 1 1 + ρ 2 1 4 ρ ( ρ 1 ) 2 .
For all ρ 1 ,
1 81 2 + 2 ρ + 1 1 + ρ 2 1 4 ρ 4 81 ,
which can be verified by showing that the left side of (A268) is monotonically increasing in ρ over the interval [ 1 , ) , and it tends to 4 81 as we let ρ . Furthermore, for all ρ 1 ,
( ρ 1 ) 2 ρ min ρ 1 , ( ρ 1 ) 2 .
In view of inequalities (A268) and (A269), one gets (180) from (A267) (where the latter is an equivalent form of (176)).

Appendix K. Proof of Theorem 10

We start by proving Item (a). In view of the variational representation of f-divergences (see ([70], Theorem 2.1), and ([71], Lemma 1)), if f : ( 0 , ) R is convex with f ( 1 ) = 0 , and P and Q are probability measures defined on a set A , then
D f ( P Q ) = sup g : A R E g ( X ) E [ f ¯ g ( Y ) ) ,
where X P and Y Q , and the supremum is taken over all measurable functions g under which the expectations are finite.
Let P P n ( ρ ) , with ρ > 1 , and let Q : = U n ; these probability mass functions are defined on the set A n : = { 1 , , n } , and it follows that
u f ( n , ρ ) D f ( P U n )
E g ( X ) 1 n i = 1 n f ¯ g ( i ) ,
where (A271) holds by the definition in (77); (A272) holds due to (A270) with X P , and Y being an equiprobable random variable over A n . This gives (187).
We next prove Item (b). As above, let f : ( 0 , ) R be a convex function with f ( 1 ) = 0 . Let β Γ n ( ρ ) be a maximizer of the right side of (82). Then,
u f ( n , ρ ) = D f ( Q β U n )
= 1 n i = 1 n f n Q β ( i ) .
Let ε > 0 be selected arbitrarily. We have ( f ¯ ) ¯ f (i.e., repeating twice the convex conjugate operation (see (186)) on a convex function f, returns f itself). From the convexity of f, it therefore follows that, for all t > 0 , there exists x R such that
f ( t ) t x f ¯ ( x ) + ε .
Let
t i : = n Q β ( i ) , i A n ,
let x : = x i ( ε ) R be selected to satisfy (A275) with t : = t i , and let the function g ε : A n R be defined as
g ε ( i ) = x i ( ε ) , i A n .
Consequently, it follows from (A275)–(A277) that for all such i
f n Q β ( i ) n Q β ( i ) g ε ( i ) f ¯ g ε ( i ) + ε .
Let P : = Q β P n ( ρ ) (see (80)), and X P . Then,
u f ( n , ρ ) = 1 n i = 1 n f n Q β ( i )
i = 1 n Q β ( i ) g ε ( i ) 1 n i = 1 n f ¯ g ε ( i ) + ε
= E g ε ( X ) 1 n i = 1 n f ¯ g ε ( i ) + ε
where (A279) holds due to (A273) and (A274); (A280) follows from (A278); (A281) holds since by assumption P X = Q β . This gives (188).

Appendix L. Proof of Theorem 11

For y Y , let the L-size list of the decoder be given by L ( y ) = { x 1 ( y ) , , x L ( y ) } with L < M . Then, the (average) list decoding error probability is given by
P L = E P L ( Y )
where the conditional list decoding error probability, given that Y = y Y , is equal to
P L ( y ) = 1 = 1 L P X | Y x ( y ) | y .
For every y Y ,
D f P X | Y ( · | y ) U M
D f = 1 L P X | Y x ( y ) | y , 1 = 1 L P X | Y ( x ( y ) | y ) L M , 1 L M
= D f 1 P L ( y ) , P L ( y ) L M , 1 L M ,
where (A284) holds by the data-processing inequality for f-divergences, and since for every y Y
= 1 L U M x ( y ) = = 1 L 1 M = L M ;
(A285) is due to (A283). Hence, it follows that
E D f P X | Y ( · | Y ) U M
E D f 1 P L ( Y ) , P L ( Y ) L M , 1 L M
= L M E f M ( 1 P L ( Y ) ) L + 1 L M E f M P L ( Y ) M L
L M f M E [ 1 P L ( Y ) ] L + 1 L M f M E [ P L ( Y ) ] M L
= L M f M 1 P L L + 1 L M f M P L M L ,
where (A287) holds by taking expectations in (A284) and (285) with respect to Y; (A288) holds by the definition of f-divergence, and the linearity of expectation operator; (A289) follows from the convexity of f and Jensen’s inequality; finally, (A290) holds by (A282).

Appendix M. Proof of Corollary 3

Let α ( 0 , 1 ) ( 1 , ) , and let y Y . The proof starts by applying Theorem 11 in the setting where Y = y is deterministic, and the convex function f : ( 0 , ) R is given by f : = u α in (139), i.e.,
f ( t ) = t α α ( t 1 ) 1 α ( α 1 ) , t 0 .
In this setting, (192) is specialized to
D f P X | Y ( · | y ) U M L M f M ( 1 P L ( y ) ) L + 1 L M f M P L ( y ) M L ,
where P L ( y ) is the conditional list decoding error probability given that Y = y . Substituting (A291) into the right side of (A292) gives
L M f M ( 1 P L ( y ) ) L + 1 L M f M P L ( y ) M L
= 1 α ( α 1 ) P L α ( y ) 1 L M 1 α + 1 P L ( y ) α L M 1 α 1
= 1 α ( α 1 ) exp ( α 1 ) d α P L ( y ) 1 L M 1 ,
where (A294) follows from (203). Substituting (A291) into the left side of (A292) gives
D f P X | Y ( · | y ) U M
= 1 M α ( α 1 ) x X M P X | Y ( x | y ) α α M P X | Y ( x | y ) 1 1
= 1 M α ( α 1 ) M α x X P X | Y α ( x | y ) α x X M P X | Y ( x | y ) 1 = 0 ( | X | = M ) M
= 1 α ( α 1 ) M α 1 x X P X | Y α ( x | y ) 1
= 1 α ( α 1 ) exp ( α 1 ) log M H α ( X | Y = y ) 1 .
Substituting (A294) and (A298) into the right and left sides of (A292), and rearranging terms while relying on the monotonicity property of an exponential function gives
H α ( X | Y = y ) log M d α P L ( y ) 1 L M .
We next obtain an upper bound on the Arimoto-Rényi conditional entropy.
H α ( X | Y )
= α 1 α log Y d P Y ( y ) exp 1 α α H α ( X | Y = y )
α 1 α log Y d P Y ( y ) exp 1 α α log M d α P L ( y ) 1 L M
= log M + α 1 α log Y d P Y ( y ) P L α ( y ) 1 L M 1 α + 1 P L ( y ) α L M 1 α 1 α
where (A300) holds due to (202); (A301) follows from (A299), and (A302) follows from (203). By ([42], Lemma 1), it follows that the integrand in the right side of (A302) is convex in P L ( y ) if α > 1 ; furthermore, it is concave in P L ( y ) if α ( 0 , 1 ) . Invoking Jensen’s inequality therefore yields (see (A282))
H α ( X | Y ) log M + α 1 α log P L α 1 L M 1 α + 1 P L α L M 1 α 1 α
= log M 1 α 1 log P L α 1 L M 1 α + 1 P L α L M 1 α
= log M d α P L 1 L M ,
where (A303) follows from Jensen’s inequality, and (A305) follows from (203). This proves (205) and (206) for all α ( 0 , 1 ) ( 1 , ) . The necessary and sufficient condition for (205) to hold with equality, as given in (207), follows from the proof of (A292) (see (A284)–(A286)), and from the use of Jensen’s inequality in (A303).

Appendix N. Proof of Theorem 12

The proof of Theorem 12 relies on Theorem 1, and the proof of Theorem 11.
Let Z = { 0 , 1 } and, without any loss of generality, let X = { 1 , , M } . For every y Y , define a deterministic transformation from X to Z such that every x L ( y ) is mapped to z = 0 , and every x L ( y ) is mapped to z = 1 . This corresponds to a conditional probability mass function, for every y Y , where W Z | X ( y ) ( z | x ) = 1 if x L ( y ) and z = 0 , or if x L ( y ) and z = 1 ; otherwise, W Z | X ( y ) ( z | x ) = 0 . Let L ( y ) : = { x 1 ( y ) , , x L ( y ) } with L < M . Then, for every y Y , a conditional probability mass function P X | Y ( · | y ) implies that
P Z ( y ) ( z ) : = x X P X | Y ( x | y ) W Z | X ( y ) ( z | x ) , z { 0 , 1 } ,
satisfies (see (A283))
P Z ( y ) ( 0 ) = = 1 L P X | Y ( x ( y ) | y ) = 1 P L ( y ) ,
P Z ( y ) ( 1 ) = P L ( y ) .
Under the deterministic transformation W Z | X ( y ) as above, the equiprobable distribution Q X ( y ) = U M (independently of y Y ) is mapped to a Bernoulli distribution over the two-elements set Z where
Q Z ( y ) = L M , 1 L M , y Y .
Given Y = y Y , applying Theorem 1 with the transformation W Z | X ( y ) as above gives that
D f P X | Y ( · | y ) U M D f P Z ( y ) Q Z ( y ) + c f ξ 1 ( y ) , ξ 2 ( y ) χ 2 P X | Y ( · | y ) U M χ 2 P Z ( y ) Q Z ( y )
where, from (18) and (19),
ξ 1 ( y ) = min x X P X | Y ( x | y ) U M ( x ) = M min x X P X | Y ( x | y ) ,
ξ 2 ( y ) = max x X P X | Y ( x | y ) U M ( x ) = M max x X P X | Y ( x | y ) .
Since, from (212), (213), (A311) and (A312),
inf y Y ξ 1 ( y ) = M inf ( x , y ) X × Y P X | Y ( x | y ) = ξ 1 ,
sup y Y ξ 2 ( y ) = M sup ( x , y ) X × Y P X | Y ( x | y ) = ξ 2 ,
it follows from the definition of c f ( · , · ) in (26) that for every y Y
c f ξ 1 ( y ) , ξ 2 ( y ) c f ξ 1 , ξ 2
= 1 2 inf t I ( ξ 1 , ξ 2 ) f ( t )
1 2 m f
where the last inequality holds by the assumption in (211). Combining (A310) and (A315)–(A317) yields
D f P X | Y ( · | y ) U M D f P Z ( y ) Q Z ( y ) + 1 2 m f χ 2 P X | Y ( · | y ) U M χ 2 P Z ( y ) Q Z ( y ) ,
for every y Y . Hence,
E D f P X | Y ( · | Y ) U M E D f P Z ( Y ) Q Z ( Y ) + 1 2 m f E χ 2 P X | Y ( · | Y ) U M χ 2 P Z ( Y ) Q Z ( Y )
where (A319) holds by taking expectations with respect to Y on both sides of (A318).
Referring to the first term in the right side of (A319) gives
E D f P Z ( Y ) Q Z ( Y ) = E D f 1 P L ( Y ) , P L ( Y ) L M , 1 L M
L M f M 1 P L L + 1 L M f M P L M L ,
where (A320) follows from (A307)–(A309), and (A321) holds due to (A288)–(A290).
Referring to the second term in the right side of (A319) gives
E χ 2 P X | Y ( · | Y ) U M χ 2 P Z ( Y ) Q Z ( Y )
= E χ 2 P X | Y ( · | Y ) U M χ 2 1 P L ( Y ) , P L ( Y ) L M , 1 L M
= E M x X P X | Y 2 ( x | Y ) M 1 P L ( Y ) 2 L M P L 2 ( Y ) M L
= M E x X P X | Y 2 ( x | Y ) M L + 2 M L · E P L ( Y ) M L + M M L E P L 2 ( Y )
= M E x X P X | Y 2 ( x | Y ) M 1 2 P L L M 2 E P L 2 ( Y ) L ( M L ) ,
where (A322) follows from (A306)–(A309); (A323) follows from (A16)–(A18); (A325) is due to (A282). Furthermore, we get (since P L ( Y ) [ 0 , 1 ] )
E P L 2 ( Y ) E P L ( Y ) = P L ,
E P L 2 ( Y ) E 2 P L ( Y ) = P L 2 ,
and
E x X P X | Y 2 ( x | Y ) = Y d P Y ( y ) x X P X | Y 2 ( x | y )
= X × Y d P X Y ( x , y ) P ( x | y )
= E P X | Y ( X | Y ) .
Combining (A322)–(A330) gives
M E P X | Y ( X | Y ) 1 P L L P L M L +
E χ 2 P X | Y ( · | Y ) U M χ 2 P Z ( Y ) Q Z ( Y )
M E P X | Y ( X | Y ) 1 P L 2 L P L 2 M L ,
which provides tight upper and lower bounds on E χ 2 P X | Y ( · | Y ) U M χ 2 P Z ( Y ) Q Z ( Y ) if P L is small. Note that the lower bound on the left side of (A331) is non-negative since, by the data-processing inequality for the χ 2 divergence, the right side of (A331) should be non-negative (see (A306)–(A309)). Finally, combining (A319)–(A332) yields (214), which proves Item (a).
For proving Item (b), the upper bound on the left side of (A326) is tightened. If the list decoder selects the L most probable elements from X given the value of Y Y , then P L ( y ) 1 L M for every y Y . Hence, the bound in (A326) is replaced by the tighter bound
E P L 2 ( Y ) 1 L M P L .
Combining (A322)–(A325), (A328)–(A330) and (A333) gives the following improved lower bound in the left side of (A331):
M E P X | Y ( X | Y ) 1 P L L + E χ 2 P X | Y ( · | Y ) U M χ 2 P Z ( Y ) Q Z ( Y ) .
It is next shown that the operation ( · ) + in the left side of (A334) is redundant. From (A282) and (A283),
P L = 1 = 1 L E P X | Y x ( Y ) | Y
= 1 = 1 L Y d P Y ( y ) P X | Y x ( y ) | y
= 1 Y d P Y ( y ) = 1 L P X | Y x ( y ) | y ,
which then implies that
P L 1 L Y d P Y ( y ) = 1 L P X | Y 2 x ( y ) | y
1 L Y d P Y ( y ) x X P X | Y 2 x | y
1 L X × Y d P X Y ( x , y ) P X | Y ( x | y )
= 1 L E P X | Y ( X | Y ) ,
where (A338) is due the Cauchy-Schwarz inequality applied to the right side of (A337), and (A339) holds since L ( y ) X for all y Y . From (A335)–(A341), E P X | Y ( X | Y ) 1 P L L , which implies that the operation ( · ) + in the left side of (A334) is indeed redundant. Similarly to the proof of (214) (see (A319)–(A321)), (A334) yields (215) while ignoring the operation ( · ) + in the left side of (A334).

Appendix O. Proof of Theorem 13

For every y Y , let the M elements of X be sorted in decreasing order according to the conditional probabilities P X | Y ( · | y ) . Let x ( y ) be the -th most probable element in X given Y = y , i.e.,
P X | Y ( x 1 ( y ) | y ) P X | Y ( x 2 ( y ) | y ) P X | Y ( x M ( y ) | y ) .
The conditional list decoding error probability, given Y = y , satisfies
P L ( y ) 1 = 1 | L ( y ) | P X | Y ( x ( y ) | y )
: = P L ( opt ) ( y ) ,
and the (average) list decoding error probability satisfies P L P L ( opt ) . Let U M denote the equiprobable distribution on X , and let g γ : [ 0 , ) R be given by g γ ( t ) : = ( t γ ) + with γ 1 , where u + : = max { u , 0 } for u R . The function g γ ( · ) is convex, and g γ ( 1 ) = 0 for γ 1 ; the f-divergence D g γ ( · · ) is named as the E γ divergence (see, e.g., [54]), i.e.,
E γ ( P Q ) : = D g γ ( P Q ) , γ 1 ,
for all probability measures P and Q. For every y Y ,
E γ P X | Y ( · | y ) U M E γ [ 1 P L ( opt ) ( y ) , P L ( opt ) ( y ) ] | L ( y ) | M , 1 | L ( y ) | M
= | L ( y ) | M · g γ M 1 P L ( opt ) ( y ) | L ( y ) | + 1 | L ( y ) | M g γ M P L ( opt ) ( y ) M | L ( y ) | ,
where (A346) holds due to the data-processing inequality for f-divergences, and because of (A344); (A347) holds due to (A345). Furthermore, in view of (A342) and (A344), it follows that M P L ( opt ) ( y ) M | L ( y ) | 1 for all y Y ; by the definition of g γ , it follows that
g γ M P L ( opt ) ( y ) M | L ( y ) | = 0 , γ 1 .
Substituting (A348) into the right side of (A347) gives that, for all y Y ,
E γ P X | Y ( · | y ) U M | L ( y ) | M · g γ M 1 P L ( opt ) ( y ) | L ( y ) |
= 1 P L ( opt ) ( y ) γ | L ( y ) | M + .
Taking expectations with respect to Y in (A349) and (A350), and applying Jensen’s inequality to the convex function f ( u ) : = ( u ) + , for u R , gives
E E γ P X | Y ( · | Y ) U M E 1 P L ( opt ) ( Y ) γ | L ( Y ) | M +
1 E P L ( opt ) ( Y ) γ E | L ( Y ) | M +
= 1 P L ( opt ) γ E | L ( Y ) | M +
1 P L ( opt ) γ E | L ( Y ) | M .
On the other hand, the left side of (A351) is equal to
E E γ P X | Y ( · | Y ) U M
= E 1 M x X M P X | Y ( x | Y ) γ +
= E x X P X | Y ( x | Y ) γ M +
= 1 2 E x X P X | Y ( x | Y ) γ M + P X | Y ( x | Y ) γ M
= 1 2 E x X P X | Y ( x | Y ) γ M + 1 2 ( 1 γ ) ,
where (A355) is due to (A345), and since U M ( x ) = 1 M for all x X ; (A356) and (A357) hold, respectively, by the simple identities ( c u ) + = c u + , and u + = 1 2 ( | u | + u ) for c 0 and u R ; finally, (A358) holds since
x X P X | Y ( x | y ) γ M = γ + x X P X | Y ( x | y ) = 1 γ ,
for all y Y . Substituting (A355)–(A358) and rearranging terms gives that
P L P L ( opt ) 1 + γ 2 γ E | L ( Y ) | M 1 2 E x X P X | Y ( x | Y ) γ M ,
which is the lower bound on the list decoding error probability in (222).
We next proceed to prove the sufficient conditions for equality in (222). First, if for all y Y , the list decoder selects the | L ( y ) | most probable elements in X given that Y = y , then equality holds in (A359). In this case, for all y Y , L ( y ) : = { x 1 ( y ) , , x | L ( y ) | } where x ( y ) denotes the -th most probable element in X , given Y = y , with ties in probabilities which are resolved arbitrarily (see (A342)). Let γ 1 . If, for every y Y , P X | Y ( x ( y ) | y ) is fixed for all { 1 , , | L ( y ) | } and P X | Y ( x ( y ) | y ) is fixed for all { | L ( y ) | + 1 , , M } , then equality holds in (A346) (and therefore equalities also hold in (A349) and (A351)). For all y Y , let the common values of the conditional probabilities P X | Y ( · | y ) over each of these two sets, respectively, be equal to α ( y ) and β ( y ) . Then,
α ( y ) | L ( y ) | + β ( y ) ( M | L ( y ) | ) = x X P X | Y ( x | y ) = 1 ,
which gives the condition in (223). Furthermore, if for all y Y , 1 P L ( opt ) ( y ) γ | L ( y ) | M 0 , then the operation ( · ) + in the right side of (A351) is redundant, which causes (A352) to hold with equality as an expectation of a linear function; furthermore, also (A354) holds with equality in this case (since an expectation of a non-negative and bounded function is non-negative and finite). By (223) and (A344), it follows that P L ( opt ) ( y ) = 1 α ( y ) | L ( y ) | for all y Y , and therefore the satisfiability of (224) implies that equalities hold in (A352) and (A354). Overall, under the above condition, it therefore follows that (222) holds with equality. To verify it explicitly, under conditions (223) and (224) which have been derived as above, the right side of (222) satisfies
1 + γ 2 γ E [ | L ( Y ) | ] M 1 2 E x X P X | Y ( x | Y ) γ M = 1 + γ 2 γ E [ | L ( Y ) | ] M 1 2 E α ( Y ) γ M | L ( Y ) | + γ M 1 α ( Y ) | L ( Y ) | M | L ( Y ) | M | L ( Y ) |
= 1 E α ( Y ) | L ( Y ) |
= E 1 = 1 | L ( Y ) | P X | Y x ( Y ) | Y
= P L ,
where (A361) holds since, under (224), it follows that 0 1 α ( Y ) | L ( Y ) | M | L ( Y ) | 1 M γ M for all γ 1 ; (A362) holds by straightforward algebra, where γ is canceled out; (A363) holds by the condition in (223); finally, (A364) holds by (A282), (A283) and (A342). This indeed explicitly verifies that the conditions in Theorem 13 yield an equality in (222).

Appendix P. Proofs of Theorems Related to Tunstall Trees

Appendix P.1. Proof of Theorem 14

Theorem 14 (a) follows from (226) (see ([38], Corollary 1)).
By ([72], Lemma 6), the ratio of the maximal to minimal positive masses of P is upper bounded by the reciprocal of the minimal probability mass of the source symbols. Theorem 14 (b) is therefore obtained from Theorem 7 (c). Theorem 14 (c) consequently holds due to Theorem 7 (d); the bound in the right side of (233), which holds for every number of leaves n in the Tunstall tree, is equal to the limit of the upper bound in the right side of (232) when we let n .
Theorem 14 (d) relies on ([16], Theorem 11) and the definition in (231), providing an integral representation of an f-divergence in (234) under the conditions in Item (d).

Appendix P.2. Proof of Theorem 15

In view of ([33], Theorem 4), if the fixed length of the codewords of the Tunstall code is equal to m, then the compression rate R of the code satisfies
R log | X | n H ( P ) log | X | n ρ log ρ ρ 1 log e ρ log e ρ ρ 1 1 log | X | ,
where H ( P ) denotes the Shannon entropy of the memoryless and stationary discrete source, ρ : = 1 p min , n is the number of leaves in Tunstall tree, and the logarithms with an unspecified base can be taken on an arbitrary base in the right side of (A365). By the setting in Theorem 15, the construction of the Tunstall tree satisfies n | X | m < n + ( D 1 ) . Hence, if D = 2 , then log | X | n = m ; if D > 2 , then log | X | n = m (since the length of the codewords is m), and log | X | n > m + log | X | 1 D 1 | X | m . Combining this with (A365) yields
R { m H ( P ) m + log 1 D 1 | X | m ρ log ρ ρ 1 log e ρ log e ρ ρ 1 1 log | X | , if   D > 2 , m H ( P ) m ρ log ρ ρ 1 log e ρ log e ρ ρ 1 1 log | X | , if   D = 2 .
In order to assert that R ( 1 + ε ) H ( P ) , it is requested that the right side of (A366) does not exceed ( 1 + ε ) H ( P ) . This gives
ρ log ρ ρ 1 log e ρ log e ρ ρ 1 d log e ,
where d is given in (235). In view of the part in Section 3.3.2 with respect to the exemplification of Theorem 7 for the relative entropy, and the related analysis in Appendix I, the condition in (A367) is equivalent to ρ ρ max ( 1 ) ( d ) where ρ max ( 1 ) ( d ) is defined in (171). Since p min = 1 ρ , it leads to the sufficient condition in (236) for the requested compression rate R of the Tunstall code.

References

  1. Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. 1966, 28, 131–142. [Google Scholar] [CrossRef]
  2. Csiszár, I. Eine Informationstheoretische Ungleichung und ihre Anwendung auf den Bewis der Ergodizität von Markhoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. 1963, 8, 85–108. (In German) [Google Scholar]
  3. Csiszár, I. A note on Jensen’s inequality. Studia Scientiarum Mathematicarum Hungarica 1966, 1, 185–188. [Google Scholar]
  4. Csiszár, I. Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica 1967, 2, 299–318. [Google Scholar]
  5. Csiszár, I. On topological properties of f-divergences. Studia Scientiarum Mathematicarum Hungarica 1967, 2, 329–339. [Google Scholar]
  6. Csiszár, I. A class of measures of informativity of observation channels. Periodica Mathematicarum Hungarica 1972, 2, 191–213. [Google Scholar] [CrossRef]
  7. Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331. [Google Scholar] [CrossRef]
  8. Liese, F.; Vajda, I. Convex Statistical Distances. In Teubner-Texte Zur Mathematik; Springer: Leipzig, Germany, 1987; Volume 95. [Google Scholar]
  9. Pardo, L. Statistical Inference Based on Divergence Measures; Chapman and Hall/CRC, Taylor &amp, Ed.; Francis Group: Boca Raton, FL, USA, 2006. [Google Scholar]
  10. Pardo, M.C.; Vajda, I. About distances of discrete distributions satisfying the data processing theorem of information theory. IEEE Trans. Inf. Theory 1997, 43, 1288–1293. [Google Scholar] [CrossRef]
  11. Stummer, W.; Vajda, I. On divergences of finite measures and their applicability in statistics and information theory. Statistics 2010, 44, 169–187. [Google Scholar] [CrossRef]
  12. Vajda, I. Theory of Statistical Inference and Information; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1989. [Google Scholar]
  13. Ziv, J.; Zakai, M. On functionals satisfying a data-processing theorem. IEEE Trans. Inf. Theory 1973, 19, 275–283. [Google Scholar] [CrossRef]
  14. Zakai, M.; Ziv, J. A generalization of the rate-distortion theory and applications. In Information Theory—New Trends and Open Problems; Longo, G., Ed.; Springer: Berlin/Heidelberg, Germany, 1975; pp. 87–123. [Google Scholar]
  15. Merhav, N. Data processing theorems and the second law of thermodynamics. IEEE Trans. Inf. Theory 2011, 57, 4926–4939. [Google Scholar] [CrossRef]
  16. Liese, F.; Vajda, I. On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
  17. Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed.; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
  18. Ahlswede, R.; Gács, P. Spreading of sets in product spaces and hypercontraction of the Markov operator. Ann. Probab. 1976, 4, 925–939. [Google Scholar] [CrossRef]
  19. Calmon, F.P.; Polyanskiy, Y.; Wu, Y. Strong data processing inequalities for input constrained additive noise channels. IEEE Trans. Inf. Theory 2018, 64, 1879–1892. [Google Scholar] [CrossRef]
  20. Cohen, J.E.; Iwasa, Y.; Rautu, Gh.; Ruskai, M.B.; Seneta, E.; Zbăganu, G. Relative entropy under mappings by stochastic matrices. Linear Algebra Appl. 1993, 179, 211–235. [Google Scholar] [CrossRef] [Green Version]
  21. Cohen, J.E.; Kemperman, J.H.B.; Zbăganu, Gh. Comparison of Stochastic Matrices with Applications in Information Theory, Statistics, Economics and Population Sciences; Birkhäuser: Boston, MA, USA, 1998. [Google Scholar]
  22. Makur, A.; Polyanskiy, Y. Comparison of channels: Criteria for domination by a symmetric channel. IEEE Trans. Inf. Theory 2018, 64, 5704–5725. [Google Scholar] [CrossRef]
  23. Polyanskiy, Y.; Wu, Y. Dissipation of information in channels with input constraints. IEEE Trans. Inf. Theory 2016, 62, 35–55. [Google Scholar] [CrossRef]
  24. Raginsky, M. Strong data processing inequalities and Φ-Sobolev inequalities for discrete channels. IEEE Trans. Inf. Theory 2016, 62, 3355–3389. [Google Scholar] [CrossRef]
  25. Polyanskiy, Y.; Wu, Y. Strong data processing inequalities for channels and Bayesian networks. In Convexity and Concentration; Carlen, E., Madiman, M., Werner, E.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2017; Volume 161, pp. 211–249. [Google Scholar]
  26. Makur, A.; Zheng, L. Linear bounds between contraction coefficients for f-divergences. arXiv 2018, arXiv:1510.01844.v4. [Google Scholar]
  27. Pearson, K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1900, 50, 157–175. [Google Scholar] [CrossRef]
  28. Neyman, J. Contribution to the theory of the χ2 test. In Proceedings of the First Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 13–18 August 1945 and 27–29 January 1946; University of California Press: Berkeley, CA, USA, 1949; pp. 239–273. [Google Scholar]
  29. Sarmanov, O.V. Maximum correlation coefficient (non-symmetric case). In Selected Translations in Mathematical Statistics and Probability; American Mathematical Society: Providence, RI, USA, 1962. [Google Scholar]
  30. Marshall, A.W.; Olkin, I.; Arnold, B.C. Inequalities: Theory of Majorization and Its Applications, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
  31. Steele, J.M. The Cauchy-Schwarz Master Class; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  32. Bhatia, R. Matrix Analysis; Springer: Berlin/Heidelberg, Germany, 1997. [Google Scholar]
  33. Cicalese, F.; Gargano, L.; Vaccaro, U. Bounds on the entropy of a function of a random variable and their applications. IEEE Trans. Inf. Theory 2018, 64, 2220–2230. [Google Scholar] [CrossRef]
  34. Sason, I. Tight bounds on the Rényi entropy via majorization with applications to guessing and compression. Entropy 2018, 20, 896. [Google Scholar] [CrossRef]
  35. Ho, S.W.; Verdú, S. On the interplay between conditional entropy and error probability. IEEE Trans. Inf. Theory 2010, 56, 5930–5942. [Google Scholar] [CrossRef]
  36. Ho, S.W.; Verdú, S. Convexity/concavity of the Rényi entropy and α-mutual information. In Proceedings of the 2015 IEEE International Symposium on Information Theory, Hong Kong, China, 14–19 June 2015; pp. 745–749. [Google Scholar]
  37. Corless, R.M.; Gonnet, G.H.; Hare, D.E.G.; Jeffrey, D.J.; Knuth, D.E. On the Lambert W function. Adv. Comput. Math. 1996, 5, 329–359. [Google Scholar] [CrossRef]
  38. Cicalese, F.; Gargano, L.; Vaccaro, U. A note on approximation of uniform distributions from variable-to-fixed length codes. IEEE Trans. Inf. Theory 2006, 52, 3772–3777. [Google Scholar] [CrossRef]
  39. Tsallis, C. Possible generalization of the Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
  40. Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; Volume 1, pp. 547–561. [Google Scholar]
  41. Cicalese, F.; Gargano, L.; Vaccaro, U. Minimum-entropy couplings and their applications. IEEE Trans. Inf. Theory 2019, 65, 3436–3451. [Google Scholar] [CrossRef]
  42. Sason, I.; Verdú, S. Arimoto-Rényi conditional entropy and Bayesian M-ary hypothesis testing. IEEE Trans. Inf. Theory 2018, 64, 4–25. [Google Scholar] [CrossRef]
  43. Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: New York, NJ, USA, 2000. [Google Scholar]
  44. Cichocki, A.; Amari, S.I. Families of Alpha- Beta- and Gamma- divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
  45. Sason, I. On f-divergences: Integral representations, local behavior, and inequalities. Entropy 2018, 20, 383. [Google Scholar] [CrossRef]
  46. Fano, R.M. Class Notes for Course 6.574: Transmission of Information; MIT: Cambridge, MA, USA, 1952. [Google Scholar]
  47. Ahlswede, R.; Gács, P.; Körner, J. Bounds on conditional probabilities with applications in multi-user communication. Z. Wahrscheinlichkeitstheorie verw. Gebiete 1977, 34, 157–177, Correction in 1977, 39, 353–354. [Google Scholar] [CrossRef]
  48. Raginsky, M.; Sason, I. Concentration of measure inequalities in information theory, communications and coding: Third edition. In Foundations and Trends (FnT) in Communications and Information Theory; NOW Publishers: Delft, The Netherlands, 2019; pp. 1–266. [Google Scholar]
  49. Chen, X.; Guntuboyina, A.; Zhang, Y. On Bayes risk lower bounds. J. Mach. Learn. Res. 2016, 17, 7687–7744. [Google Scholar]
  50. Guntuboyina, A. Lower bounds for the minimax risk using f-divergences, and applications. IEEE Trans. Inf. Theory 2011, 57, 2386–2399. [Google Scholar] [CrossRef]
  51. Kim, Y.H.; Sutivong, A.; Cover, T.M. State amplification. IEEE Trans. Inf. Theory 2008, 54, 1850–1859. [Google Scholar] [CrossRef]
  52. Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Topics in Information Theory—2nd Colloquium; Csiszár, I., Elias, P., Eds.; Colloquia Mathematica Societatis Janós Bolyai; Elsevier: Amsterdam, The Netherlands, 1977; Volume 16, pp. 41–52. [Google Scholar]
  53. Ahlswede, R.; Körner, J. Source coding with side information and a converse for degraded broadcast channels. IEEE Trans. Inf. Theory 1975, 21, 629–637. [Google Scholar] [CrossRef]
  54. Liu, J.; Cuff, P.; Verdú, S. Eγ resolvability. IEEE Trans. Inf. Theory 2017, 63, 2629–2658. [Google Scholar]
  55. Brémaud, P. Discrete Probability Models and Methods: Probability on Graphs and Trees, Markov Chains and Random Fields, Entropy and Coding; Springer: Basel, Switzerland, 2017. [Google Scholar]
  56. Tunstall, B.K. Synthesis of Noiseless Compression Codes. Ph.D. Thesis, Georgia Institute of Technology, Atlanta, GA, USA, 1967. [Google Scholar]
  57. DeGroot, M.H. Uncertainty, information and sequential experiments. Ann. Math. Stat. 1962, 33, 404–419. [Google Scholar] [CrossRef]
  58. Roberts, A.W.; Varberg, D.E. Convex Functions; Academic Press: Cambridge, MA, USA, 1973. [Google Scholar]
  59. Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1996. [Google Scholar]
  60. Collet, J.F. An exact expression for the gap in the data processing inequality for f-divergences. IEEE Trans. Inf. Theory 2019, 65, 4387–4391. [Google Scholar] [CrossRef]
  61. Bregman, L.M. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
  62. Sason, I.; Verdú, S. f-divergence inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
  63. Gilardoni, G.L. On Pinsker’s and Vajda’s type inequalities for Csiszár’s f-divergences. IEEE Trans. Inf. Theory 2010, 56, 5377–5386. [Google Scholar] [CrossRef]
  64. Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435. [Google Scholar] [CrossRef]
  65. Simic, S. Second and third order moment inequalities for probability distributions. Acta Math. Hung. 2018, 155, 518–532. [Google Scholar] [CrossRef]
  66. Van Erven, T.; Harremoës, P. Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
  67. Pardo, M.C.; Vajda, I. On asymptotic properties of information-theoretic divergences. IEEE Trans. Inf. Theory 2003, 49, 1860–1868. [Google Scholar] [CrossRef]
  68. Beck, A. Introduction to Nonlinear Optimization: Theory, Algorithms and Applications with Matlab; SIAM-Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2014. [Google Scholar]
  69. Simic, S. On logarithmic convexity for differences of power means. J. Inequalities Appl. 2008, 2007, 037359. [Google Scholar] [CrossRef]
  70. Keziou, A. Dual representation of φ-divergences and applications. C. R. Math. 2003, 336, 857–862. [Google Scholar] [CrossRef]
  71. Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 2010, 56, 5847–5861. [Google Scholar] [CrossRef]
  72. Jelineck, F.; Schneider, K.S. On variable-length-to-block coding. IEEE Trans. Inf. Theory 1972, 18, 765–774. [Google Scholar] [CrossRef]
Figure 1. The bounds in Theorem 2 applied to D f α ( R X n ( λ ) Q X n ) D f α ( R Y n ( λ ) Q Y n ) (vertical axis) versus λ [ 0 , 1 ] (horizontal axis). The f α -divergence refers to Theorem 5. The probability mass functions P X n and Q X n correspond, respectively, to discrete memoryless sources emitting n i.i.d. Bernoulli ( p ) and Bernoulli ( q ) symbols; the symbols are transmitted over BSC ( δ ) with ( α , p , q , δ ) = 1 , 1 4 , 1 2 , 0.110 . The bounds in the upper and middle plots are compared to the exact values, being computationally feasible for n = 1 and n = 10 , respectively. The upper, middle and lower plots correspond, respectively, to n = 1 , n = 10 , and n = 50 .
Figure 1. The bounds in Theorem 2 applied to D f α ( R X n ( λ ) Q X n ) D f α ( R Y n ( λ ) Q Y n ) (vertical axis) versus λ [ 0 , 1 ] (horizontal axis). The f α -divergence refers to Theorem 5. The probability mass functions P X n and Q X n correspond, respectively, to discrete memoryless sources emitting n i.i.d. Bernoulli ( p ) and Bernoulli ( q ) symbols; the symbols are transmitted over BSC ( δ ) with ( α , p , q , δ ) = 1 , 1 4 , 1 2 , 0.110 . The bounds in the upper and middle plots are compared to the exact values, being computationally feasible for n = 1 and n = 10 , respectively. The upper, middle and lower plots correspond, respectively, to n = 1 , n = 10 , and n = 50 .
Entropy 21 01022 g001
Figure 2. The upper bound in Theorem 4 applied to D f α ( R Y n ( λ ) Q Y n ) D f α ( R X n ( λ ) Q X n ) (see (125)–(127)) in the vertical axis versus λ [ 0 , 1 ] in the horizontal axis. The f α -divergence refers to Theorem 5. The probability mass functions P X i and Q X i are Bernoulli ( p ) and Bernoulli ( q ) , respectively, for all i { 1 , , n } with n uses of BSC ( δ ) , and parameters ( p , q , δ ) = 1 4 , 1 2 , 0.110 . The upper and middle plots correspond to n = 10 with α = 10 and α = 100 , respectively; the middle and lower plots correspond to α = 100 with n = 10 and n = 100 , respectively. The bounds in the upper and middle plots are compared to the exact values, being computationally feasible for n = 10 .
Figure 2. The upper bound in Theorem 4 applied to D f α ( R Y n ( λ ) Q Y n ) D f α ( R X n ( λ ) Q X n ) (see (125)–(127)) in the vertical axis versus λ [ 0 , 1 ] in the horizontal axis. The f α -divergence refers to Theorem 5. The probability mass functions P X i and Q X i are Bernoulli ( p ) and Bernoulli ( q ) , respectively, for all i { 1 , , n } with n uses of BSC ( δ ) , and parameters ( p , q , δ ) = 1 4 , 1 2 , 0.110 . The upper and middle plots correspond to n = 10 with α = 10 and α = 100 , respectively; the middle and lower plots correspond to α = 100 with n = 10 and n = 100 , respectively. The bounds in the upper and middle plots are compared to the exact values, being computationally feasible for n = 10 .
Entropy 21 01022 g002
Figure 3. Plots of d f α ( p q ) , its upper and lower bounds in (61) and (65), respectively, and its asymptotic approximation in (66) for large values of α . The plots are shown as a function of α e 3 2 , 1000 . The upper and lower plots refer, respectively, to ( p , q ) = ( 0.1 , 0.9 ) and ( p , q ) = ( 0.2 , 0.8 ) .
Figure 3. Plots of d f α ( p q ) , its upper and lower bounds in (61) and (65), respectively, and its asymptotic approximation in (66) for large values of α . The plots are shown as a function of α e 3 2 , 1000 . The upper and lower plots refer, respectively, to ( p , q ) = ( 0.1 , 0.9 ) and ( p , q ) = ( 0.2 , 0.8 ) .
Entropy 21 01022 g003
Figure 4. Curves of the upper bound on the ratio of contraction coefficients μ f α ( Q X , W Y | X ) μ χ 2 ( Q X , W Y | X ) (see the right-side inequality of (130)) as a function of the parameter α e 3 2 . The curves correspond to different values of ξ in (131).
Figure 4. Curves of the upper bound on the ratio of contraction coefficients μ f α ( Q X , W Y | X ) μ χ 2 ( Q X , W Y | X ) (see the right-side inequality of (130)) as a function of the parameter α e 3 2 . The curves correspond to different values of ξ in (131).
Entropy 21 01022 g004
Figure 5. A comparison of the maximal values of ρ (minus 1) according to (171) and (172), asserting the satisfiability of the condition D ( Q U n ) d log e , with an arbitrary d > 0 , for all integers n 2 and probability mass functions Q supported on { 1 , , n } with q max q min ρ . The solid line refers to the necessary and sufficient condition which gives (171), and the dashed line refers to a stronger condition which gives (172).
Figure 5. A comparison of the maximal values of ρ (minus 1) according to (171) and (172), asserting the satisfiability of the condition D ( Q U n ) d log e , with an arbitrary d > 0 , for all integers n 2 and probability mass functions Q supported on { 1 , , n } with q max q min ρ . The solid line refers to the necessary and sufficient condition which gives (171), and the dashed line refers to a stronger condition which gives (172).
Entropy 21 01022 g005
Figure 6. A comparison of the exact expression of Φ ( α , ρ ) in (175), with α = 1 , and its three upper bounds in the right sides of (176), (177) and (180) (called ’Upper bound 1’ (dotted line), ’Upper bound 2’ (thin dashed line), and ’Upper bound 3’ (thick dashed line), respectively).
Figure 6. A comparison of the exact expression of Φ ( α , ρ ) in (175), with α = 1 , and its three upper bounds in the right sides of (176), (177) and (180) (called ’Upper bound 1’ (dotted line), ’Upper bound 2’ (thin dashed line), and ’Upper bound 3’ (thick dashed line), respectively).
Entropy 21 01022 g006
Figure 7. Curves of the upper bound on the measure d ω , n ( P ) in (233), valid for all n N , as a function of ω [ 0 , 1 ] for different values of ρ : = 1 p min .
Figure 7. Curves of the upper bound on the measure d ω , n ( P ) in (233), valid for all n N , as a function of ω [ 0 , 1 ] for different values of ρ : = 1 p min .
Entropy 21 01022 g007
Figure 8. Curves for the smallest values of p min , in the setup of Theorem 15, according to the condition in (236) (solid line) and the more restrictive condition in (237) (dashed line) for binary Tunstall codes which are used to compress memoryless and stationary binary sources.
Figure 8. Curves for the smallest values of p min , in the setup of Theorem 15, according to the condition in (236) (solid line) and the more restrictive condition in (237) (dashed line) for binary Tunstall codes which are used to compress memoryless and stationary binary sources.
Entropy 21 01022 g008
Table 1. The lower bounds on P L in (193), (210) and (217), and its exact value for fixed list size L (see Example 1).
Table 1. The lower bounds on P L in (193), (210) and (217), and its exact value for fixed list size L (see Example 1).
LExact P L (193)(217)(210)
10.5000.3530.3530.444
20.2500.1780.1780.190
30.1250.0650.072 5.34 × 10 5
40.06300.0160

Share and Cite

MDPI and ACS Style

Sason, I. On Data-Processing and Majorization Inequalities for f-Divergences with Applications. Entropy 2019, 21, 1022. https://doi.org/10.3390/e21101022

AMA Style

Sason I. On Data-Processing and Majorization Inequalities for f-Divergences with Applications. Entropy. 2019; 21(10):1022. https://doi.org/10.3390/e21101022

Chicago/Turabian Style

Sason, Igal. 2019. "On Data-Processing and Majorization Inequalities for f-Divergences with Applications" Entropy 21, no. 10: 1022. https://doi.org/10.3390/e21101022

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop