On Data-Processing and Majorization Inequalities for f-Divergences with Applications

Igal Sason

doi:10.3390/e21101022

Abstract

This paper is focused on the derivation of data-processing and majorization inequalities for f-divergences, and their applications in information theory and statistics. For the accessibility of the material, the main results are first introduced without proofs, followed by exemplifications of the theorems with further related analytical results, interpretations, and information-theoretic applications. One application refers to the performance analysis of list decoding with either fixed or variable list sizes; some earlier bounds on the list decoding error probability are reproduced in a unified way, and new bounds are obtained and exemplified numerically. Another application is related to a study of the quality of approximating a probability mass function, induced by the leaves of a Tunstall tree, by an equiprobable distribution. The compression rates of finite-length Tunstall codes are further analyzed for asserting their closeness to the Shannon entropy of a memoryless and stationary discrete source. Almost all the analysis is relegated to the appendices, which form the major part of this manuscript.

Keywords:

contraction coefficient; data-processing inequalities; f-divergences; hypothesis testing; list decoding; majorization theory; Rényi information measures; Tsallis entropy; Tunstall trees

1. Introduction

Divergences are non-negative measures of dissimilarity between pairs of probability measures which are defined on the same measurable space. They play a key role in the development of information theory, probability theory, statistics, learning, signal processing, and other related fields. One important class of divergence measures is defined by means of convex functions f, and it is called the class of f-divergences. It unifies fundamental and independently-introduced concepts in several branches of mathematics such as the chi-squared test for the goodness of fit in statistics, the total variation distance in functional analysis, the relative entropy in information theory and statistics, and it is closely related to the Rényi divergence which generalizes the relative entropy. The class of f-divergences was introduced in the sixties by Ali and Silvey [1], Csiszár [2,3,4,5,6], and Morimoto [7]. This class satisfies pleasing features such as the data-processing inequality, convexity, continuity and duality properties, finding interesting applications in information theory and statistics (see, e.g., [4,6,8,9,10,11,12,13,14,15]).

This manuscript is a research paper which is focused on the derivation of data-processing and majorization inequalities for f-divergences, and a study of some of their potential applications in information theory and statistics. Preliminaries are next provided.

1.1. Preliminaries and Related Works

We provide here definitions and known results from the literature which serve as a background to the presentation in this paper. We first provide a definition for the family of f-divergences.

Definition 1

([16], p. 4398). Let P and Q be probability measures, let μ be a dominating measure of P and Q (i.e.,

P, Q ≪ μ

), and let

p : = \frac{d P}{d μ}

and

q : = \frac{d Q}{d μ}

. The f-divergence from P to Q is given, independently of μ, by

\begin{matrix} D_{f} (P ∥ Q) : = \int q f (\frac{p}{q}) d μ, \end{matrix}

(1)

where

f (0) : = lim_{t \to 0^{+}} f (t),

(2)

0 f (\frac{0}{0}) : = 0,

(3)

0 f (\frac{a}{0}) : = lim_{t \to 0^{+}} t f (\frac{a}{t}) = a lim_{u \to \infty} \frac{f (u)}{u}, a > 0 .

(4)

Definition 2.

Let

Q_{X}

be a probability distribution which is defined on a set

X

, and that is not a point mass, and let

W_{Y | X} : X \to Y

be a stochastic transformation. The contraction coefficient for f-divergences is defined as

\begin{matrix} μ_{f} (Q_{X}, W_{Y | X}) : = sup_{P_{X} : D_{f} (P_{X} ∥ Q_{X}) \in (0, \infty)} \frac{D_{f} (P_{Y} ∥ Q_{Y})}{D_{f} (P_{X} ∥ Q_{X})}, \end{matrix}

(5)

where, for all

y \in Y

,

P_{Y} (y) = (P_{X} W_{Y | X}) (y) : = \int_{X} d P_{X} (x) W_{Y | X} (y | x),

(6)

Q_{Y} (y) = (Q_{X} W_{Y | X}) (y) : = \int_{X} d Q_{X} (x) W_{Y | X} (y | x) .

(7)

The notation in (6) and (7), and also in (20), (21), (42), (43), (44) in the continuation of this paper, is consistent with the standard notation used in information theory (see, e.g., the first displayed equation after (3.2) in [17]).

Contraction coefficients for f-divergences play a key role in strong data-processing inequalities (see [18,19,20], ([21], Chapter II), [22,23,24,25,26]). The following are essential definitions and results which are related to maximal correlation and strong data-processing inequalities.

Definition 3.

The maximal correlation between two random variables X and Y is defined as

\begin{matrix} ρ_{m} (X; Y) : = sup_{f, g} E [f (X) g (Y)], \end{matrix}

(8)

where the supremum is taken over all real-valued functions f and g such that

\begin{matrix} E [f (X)] = E [g (Y)] = 0, E [f^{2} (X)] \leq 1, E [g^{2} (Y)] \leq 1 . \end{matrix}

(9)

Definition 4.

Pearson’s

χ^{2}

-divergence [27] from P to Q is defined to be the f-divergence from P to Q (see Definition 1) with

f (t) = {(t - 1)}^{2}

or

f (t) = t^{2} - 1

for all

t > 0

,

\begin{matrix} χ^{2} (P ∥ Q) & : = D_{f} (P ∥ Q) \end{matrix}

(10)

\begin{matrix} = \int \frac{{(p - q)}^{2}}{q} d μ \end{matrix}

(11)

\begin{matrix} = \int \frac{p^{2}}{q} d μ - 1 \end{matrix}

(12)

independently of the dominating measure μ (i.e.,

P, Q ≪ μ

, e.g.,

μ = P + Q

).

Neyman’s

χ^{2}

-divergence [28] from P to Q is the Pearson’s

χ^{2}

-divergence from Q to P, i.e., it is equal to

\begin{matrix} χ^{2} (Q ∥ P) = D_{g} (P ∥ Q) \end{matrix}

(13)

with

g (t) = \frac{{(t - 1)}^{2}}{t}

or

g (t) = \frac{1}{t} - t

for all

t > 0

.

Proposition 1

(([24], Theorem 3.2), [29]). The contraction coefficient for the

χ^{2}

-divergence satisfies

\begin{matrix} μ_{χ^{2}} (Q_{X}, W_{Y | X}) = ρ_{m}^{2} (X; Y) \end{matrix}

(14)

with

X \sim Q_{X}

and

Y \sim Q_{Y}

(see (7)).

Proposition 2

([25], Theorem 2). Let

f : (0, \infty) \to R

be convex and twice continuously differentiable with

f (1) = 0

and

f^{″} (1) > 0

. Then, for any

Q_{X}

that is not a point mass,

\begin{matrix} μ_{χ^{2}} (Q_{X}, W_{Y | X}) \leq μ_{f} (Q_{X}, W_{Y | X}), \end{matrix}

(15)

i.e., the contraction coefficient for the

χ^{2}

-divergence is the minimal contraction coefficient among all f-divergences with f satisfying the above conditions.

Remark 1.

A weaker version of (15) was presented in ([21], Proposition II.6.15) in the general alphabet setting, and the result in (15) was obtained in ([24], Theorem 3.3) for finite alphabets.

The following result provides an upper bound on the contraction coefficient for a subclass of f-divergences in the finite alphabet setting.

Proposition 3

([26], Theorem 8). Let

f : [0, \infty) \to R

be a continuous convex function which is three times differentiable at unity with

f (1) = 0

and

f^{″} (1) > 0

, and let it further satisfy the following conditions:

(a): $\begin{matrix} (f (t) - f^{'} (1) (t - 1)) (1 - \frac{f^{(3)} (1) (t - 1)}{3 f^{″} (1)}) \geq \frac{1}{2} f^{″} (1) {(t - 1)}^{2}, \forall t > 0 . \end{matrix}$

(16)
(b): The function $g : (0, \infty) \to R$ , given by $g (t) : = \frac{f (t) - f (0)}{t}$ for all $t > 0$ , is concave.

Then, for a probability mass function

Q_{X}

supported over a finite set

X

,

\begin{matrix} μ_{f} (Q_{X}, W_{Y | X}) \leq (\frac{f^{'} (1) + f (0)}{f^{″} (1) min_{x \in X} Q_{X} (x)}) μ_{χ^{2}} (Q_{X}, W_{Y | X}) . \end{matrix}

(17)

For the presentation of our majorization inequalities for f-divergences and related entropy bounds (see Section 2.3), essential definitions and basic results are next provided (see, e.g., [30], ([31], Chapter 13) and ([32], Chapter 2)). Let P be a probability mass function defined on a finite set

X

, let

p_{max}

be the maximal mass of P, and let

G_{P} (k)

be the sum of the k largest masses of P for

k \in {1, \dots, | X |}

(hence, it follows that

G_{P} (1) = p_{max}

and

G_{P} (| X |) = 1

).

Definition 5.

Consider discrete probability mass functions P and Q defined on a finite set

X

. It is said that P is majorized by Q (or Q majorizes P), and it is denoted by

P ≺ Q

, if

G_{P} (k) \leq G_{Q} (k)

for all

k \in {1, \dots, | X |}

(recall that

G_{P} (| X |) = G_{Q} (| X |) = 1

).

A unit mass majorizes any other distribution; on the other hand, the equiprobable distribution on a finite set is majorized by any other distribution defined on the same set.

Definition 6.

Let

P_{n}

denote the set of all the probability mass functions that are defined on

A_{n} : = {1, \dots, n}

. A function

f : P_{n} \to R

is said to be Schur-convex if for every

P, Q \in P_{n}

such that

P ≺ Q

, we have

f (P) \leq f (Q)

. Likewise, f is said to be Schur-concave if

- f

is Schur-convex, i.e.,

P, Q \in P_{n}

and

P ≺ Q

imply that

f (P) \geq f (Q)

.

Characterization of Schur-convex functions is provided, e.g., in ([30], Chapter 3). For example, there exist some connections between convexity and Schur-convexity (see, e.g., ([30], Section 3.C) and ([32], Chapter 2.3)). However, a Schur-convex function is not necessarily convex ([32], Example 2.3.15).

Finally, what is the connection between data processing and majorization, and why these types of inequalities are both considered in the same manuscript? This connection is provided in the following fundamental well-known result (see, e.g., ([32], Theorem 2.1.10), ([30], Theorem B.2) and ([31], Chapter 13)):

Proposition 4.

Let P and Q be probability mass functions defined on a finite set

A

. Then,

P ≺ Q

if and only if there exists a doubly-stochastic transformation

W_{Y | X} : A \to A

(i.e.,

\sum_{x \in A} W_{Y | X} (y | x) = 1

for all

y \in A

, and

\sum_{y \in A} W_{Y | X} (y | x) = 1

for all

x \in A

with

W_{Y | X} (\cdot | \cdot) \geq 0

) such that

Q \to W_{Y | X} \to P

. In other words,

P ≺ Q

if and only if in their representation as column vectors, there exists a doubly-stochastic matrix

W

(i.e., a square matrix with non-negative entries such that the sum of each column or each row in

W

is equal to 1) such that

P = W Q

.

1.2. Contributions

This paper is focused on the derivation of data-processing and majorization inequalities for f-divergences, and it applies these inequalities to information theory and statistics.

The starting point for obtaining strong data-processing inequalities in this paper relies on the derivation of lower and upper bounds on the difference

D_{f} (P_{X} ∥ Q_{X}) - D_{f} (P_{Y} ∥ Q_{Y})

where

(P_{X}, Q_{X})

and

(P_{Y}, Q_{Y})

denote, respectively, pairs of input and output probability distributions with a given stochastic transformation

W_{Y | X}

(i.e., where

P_{X} \to W_{Y | X} \to P_{Y}

and

Q_{X} \to W_{Y | X} \to Q_{Y}

). These bounds are expressed in terms of the respective difference in the Pearson’s or Neyman’s

χ^{2}

-divergence, and they hold for all f-divergences (see Theorems 1 and 2). By a different approach, we derive an upper bound on the contraction coefficient for f-divergences of a certain type, which gives an alternative strong data-processing inequality for the considered type of f-divergences (see Theorems 3 and 4). In this framework, a parametric subclass of f-divergences is introduced, its interesting properties are studied (see Theorem 5), all the data-processing inequalities which are derived in this paper are applied to this subclass, and these inequalities are exemplified numerically to examine their tightness (see Section 3.1).

This paper also derives majorization inequalities for f-divergences where part of these inequalities rely on the earlier data-processing inequalities (see Theorem 6). A different approach, which relies on the concept of majorization, serves to derive tight bounds on the maximal value of an f-divergence from a probability mass function P to an equiprobable distribution; the maximization is carried over all P with a fixed finite support where the ratio of their maximal to minimal probability masses does not exceed a given value (see Theorem 7). These bounds lead to accurate asymptotic results which apply to general f-divergences, and they strengthen and generalize recent results of this type with respect to the relative entropy [33], and the Rényi divergence [34]. Furthermore, we explore in Theorem 7 the convergence rates to the asymptotic results. Data-processing and majorization inequalities also serve to strengthen the Schur-concavity property of the Tsallis entropy (see Theorem 8), showing by a comparison to earlier bounds in [35,36] that none of these bounds is superseded by the other. Further analytical results which are related to the specialization of our central result on majorization inequalities in Theorem 7, applied to several important sub-classes of f-divergences, are provided in Section 3.2 (including Theorem 9). A quantity which is involved in our majorization inequalities in Theorem 7 is interpreted by relying on a variational representation of f-divergences (see Theorem 10).

As an application of the data-processing inequalities for f-divergences, the setup of list decoding is further studied, reproducing in a unified way some known bounds on the list decoding error probability, and deriving new bounds for fixed and variable list sizes (see Theorems 11–13).

As an application of the majorization inequalities in this paper, we study properties of a measure which is used to quantify the quality of approximating probability mass functions, induced by the leaves of a Tunstall tree, by an equiprobable distribution (see Theorem 14). An application of majorization inequalities for the relative entropy is used to derive a sufficient condition, expressed in terms of the principal and secondary real branches of the Lambert W function [37], for asserting the proximity of compression rates of finite-length (lossless and variable-to-fixed) Tunstall codes to the Shannon entropy of a memoryless and stationary discrete source (see Theorem 15).

1.3. Paper Organization

The paper is structured as follows: Section 2 provides our main new results on data-processing and majorization inequalities for f-divergences and related entropy measures. Illustration of the theorems in Section 2, and further mathematical results which follow from these theorems are introduced in Section 3. Applications in information theory and statistics are considered in Section 4. Proofs of all theorems are relegated to the appendices, which form a major part of this paper.

2. Main Results on f-Divergences

This section provides strong data-processing inequalities for f-divergences (see Section 2.1), followed by a study of a new subclass of f-divergences (see Section 2.2) which later serves to exemplify our data-processing inequalities. The third part of this section (see Section 2.3) provides majorization inequalities for f-divergences, and for the Tsallis entropy, whose derivation relies in part on the new data-processing inequalities.

2.1. Data-Processing Inequalities for f-Divergences

Strong data-processing inequalities are provided in the following, bounding the difference

D_{f} (P_{X} ∥ Q_{X}) - D_{f} (P_{Y} ∥ Q_{Y})

and ratio

\frac{D_{f} (P_{Y} ∥ Q_{Y})}{D_{f} (P_{X} ∥ Q_{X})}

where

(P_{X}, Q_{X})

and

(P_{Y}, Q_{Y})

denote, respectively, pairs of input and output probability distributions with a given stochastic transformation.

Theorem 1.

Let

X

and

Y

be finite or countably infinite sets, let

P_{X}

and

Q_{X}

be probability mass functions that are supported on

X

, and let

\begin{matrix} ξ_{1} : = inf_{x \in X} \frac{P_{X} (x)}{Q_{X} (x)} \in [0, 1], \end{matrix}

(18)

\begin{matrix} ξ_{2} : = sup_{x \in X} \frac{P_{X} (x)}{Q_{X} (x)} \in [1, \infty] . \end{matrix}

(19)

Let

W_{Y | X} : X \to Y

be a stochastic transformation such that for every

y \in Y

, there exists

x \in X

with

W_{Y | X} (y | x) > 0

, and let (see (6) and (7))

\begin{matrix} P_{Y} : = P_{X} W_{Y | X}, \end{matrix}

(20)

\begin{matrix} Q_{Y} : = Q_{X} W_{Y | X} . \end{matrix}

(21)

Furthermore, let

f : (0, \infty) \to R

be a convex function with

f (1) = 0

, and let the non-negative constant

c_{f} : = c_{f} (ξ_{1}, ξ_{2})

satisfy

\begin{matrix} f_{+}^{'} (v) - f_{+}^{'} (u) \geq 2 c_{f} (v - u), \forall u, v \in I, u < v \end{matrix}

(22)

where

f_{+}^{'}

denotes the right-side derivative of f, and

\begin{matrix} I : = I (ξ_{1}, ξ_{2}) = [ξ_{1}, ξ_{2}] \cap (0, \infty) . \end{matrix}

(23)

Then,

(a): $\begin{matrix} D_{f} (P_{X} ∥ Q_{X}) - D_{f} (P_{Y} ∥ Q_{Y}) & \geq c_{f} (ξ_{1}, ξ_{2}) [χ^{2} (P_{X} ∥ Q_{X}) - χ^{2} (P_{Y} ∥ Q_{Y})] \end{matrix}$

(24)

$\begin{matrix} \geq 0, \end{matrix}$

(25)

where equality holds in (24) if $D_{f} (\cdot ∥ \cdot)$ is Pearson’s $χ^{2}$ -divergence with $c_{f} \equiv 1$ .
(b): If f is twice differentiable on $I$ , then the largest possible coefficient in the right side of (22) is given by

$\begin{matrix} c_{f} (ξ_{1}, ξ_{2}) = \frac{1}{2} inf_{t \in I (ξ_{1}, ξ_{2})} f^{″} (t) . \end{matrix}$

(26)
(c): Under the assumption in Item (b), the following dual inequality also holds:

$\begin{matrix} D_{f} (P_{X} ∥ Q_{X}) - D_{f} (P_{Y} ∥ Q_{Y}) & \geq c_{f^{*}} (\frac{1}{ξ_{2}}, \frac{1}{ξ_{1}}) [χ^{2} (Q_{X} ∥ P_{X}) - χ^{2} (Q_{Y} ∥ P_{Y})] \end{matrix}$

(27)

$\begin{matrix} \geq 0, \end{matrix}$

(28)

where $f^{*} : (0, \infty) \to R$ is the dual convex function which is given by

$\begin{matrix} f^{*} (t) : = t f (\frac{1}{t}), \forall t > 0, \end{matrix}$

(29)

and the coefficient in the right side of (27) satisfies

$\begin{matrix} c_{f^{*}} (\frac{1}{ξ_{2}}, \frac{1}{ξ_{1}}) = \frac{1}{2} inf_{t \in I (ξ_{1}, ξ_{2})} {t^{3} f^{″} (t)} \end{matrix}$

(30)

with the convention that $\frac{1}{ξ_{1}} = \infty$ if $ξ_{1} = 0$ . Equality holds in (27) if $D_{f} (\cdot ∥ \cdot)$ is Neyman’s $χ^{2}$ -divergence (i.e., $D_{f} (P ∥ Q) : = χ^{2} (Q ∥ P)$ for all P and Q) with $c_{f^{*}} \equiv 1$ .
(d): Under the assumption in Item (b), if

$\begin{matrix} e_{f} (ξ_{1}, ξ_{2}) : = \frac{1}{2} sup_{t \in I (ξ_{1}, ξ_{2})} f^{″} (t) < \infty, \end{matrix}$

(31)

then,

$\begin{matrix} D_{f} (P_{X} ∥ Q_{X}) - D_{f} (P_{Y} ∥ Q_{Y}) & \leq e_{f} (ξ_{1}, ξ_{2}) [χ^{2} (P_{X} ∥ Q_{X}) - χ^{2} (P_{Y} ∥ Q_{Y})] . \end{matrix}$

(32)

Furthermore,

$\begin{matrix} D_{f} (P_{X} ∥ Q_{X}) - D_{f} (P_{Y} ∥ Q_{Y}) & \leq e_{f^{*}} (\frac{1}{ξ_{2}}, \frac{1}{ξ_{1}}) [χ^{2} (Q_{X} ∥ P_{X}) - χ^{2} (Q_{Y} ∥ P_{Y})] \end{matrix}$

(33)

where the coefficient in the right side of (33) satisfies

$\begin{matrix} e_{f^{*}} (\frac{1}{ξ_{2}}, \frac{1}{ξ_{1}}) = \frac{1}{2} sup_{t \in I (ξ_{1}, ξ_{2})} {t^{3} f^{″} (t)}, \end{matrix}$

(34)

which is assumed to be finite. Equalities hold in (32) and (33) if $D_{f} (\cdot ∥ \cdot)$ is Pearson’s or Neyman’s $χ^{2}$ -divergence with $e_{f} \equiv 1$ or $e_{f^{*}} \equiv 1$ , respectively.
(e): The lower and upper bounds in (24), (27), (32) and (33) are locally tight. More precisely, let ${P_{X}^{(n)}}$ be a sequence of probability mass functions defined on $X$ and pointwise converging to $Q_{X}$ which is supported on $X$ , and let $P_{Y}^{(n)}$ and $Q_{Y}$ be the probability mass functions defined on $Y$ via (20) and (21) with inputs $P_{X}^{(n)}$ and $Q_{X}$ , respectively. Suppose that

$\begin{matrix} lim_{n \to \infty} inf_{x \in X} \frac{P_{X}^{(n)} (x)}{Q_{X} (x)} = 1, \end{matrix}$

(35)

$\begin{matrix} lim_{n \to \infty} sup_{x \in X} \frac{P_{X}^{(n)} (x)}{Q_{X} (x)} = 1 . \end{matrix}$

(36)

If f has a continuous second derivative at unity, then

$\begin{matrix} lim_{n \to \infty} \frac{D_{f} (P_{X}^{(n)} ∥ Q_{X}) - D_{f} (P_{Y}^{(n)} ∥ Q_{Y})}{χ^{2} (P_{X}^{(n)} ∥ Q_{X}) - χ^{2} (P_{Y}^{(n)} ∥ Q_{Y})} = \frac{1}{2} f^{″} (1), \end{matrix}$

(37)

$\begin{matrix} lim_{n \to \infty} \frac{D_{f} (P_{X}^{(n)} ∥ Q_{X}) - D_{f} (P_{Y}^{(n)} ∥ Q_{Y})}{χ^{2} (Q_{X} ∥ P_{X}^{(n)}) - χ^{2} (Q_{Y} ∥ P_{Y}^{(n)})} = \frac{1}{2} f^{″} (1), \end{matrix}$

(38)

and these limits indicate the local tightness of the lower and upper bounds in Items (a)–(d).

Proof.

See Appendix A. □

An application of Theorem 1 gives the following result.

Theorem 2.

Let

X

and

Y

be finite or countably infinite sets, let

n \in N

, and let

X^{n} : = (X_{1}, \dots, X_{n})

and

Y^{n} : = (Y_{1}, \dots, Y_{n})

be random vectors taking values on

X^{n}

and

Y^{n}

, respectively. Let

P_{X^{n}}

and

Q_{X^{n}}

be the probability mass functions of discrete memoryless sources where, for all

\underset{̲}{x} \in X^{n}

,

\begin{matrix} P_{X^{n}} (\underset{̲}{x}) = \prod_{i = 1}^{n} P_{X_{i}} (x_{i}), Q_{X^{n}} (\underset{̲}{x}) = \prod_{i = 1}^{n} Q_{X_{i}} (x_{i}), \end{matrix}

(39)

with

P_{X_{i}}

and

Q_{X_{i}}

supported on

X

for all

i \in {1, \dots, n}

. Let each symbol

X_{i}

be independently selected from one of the source outputs at time instant i with probabilities λ and

1 - λ

, respectively, and let it be transmitted over a discrete memoryless channel with transition probabilities

\begin{matrix} W_{Y^{n} | X^{n}} (\underset{̲}{y} | \underset{̲}{x}) = \prod_{i = 1}^{n} W_{Y_{i} | X_{i}} (y_{i} | x_{i}), \forall \underset{̲}{x} \in X^{n}, \underset{̲}{y} \in Y^{n} . \end{matrix}

(40)

Let

R_{X^{n}}^{(λ)}

be the probability mass function of the symbols at the channel input, i.e.,

\begin{matrix} R_{X^{n}}^{(λ)} (\underset{̲}{x}) = \prod_{i = 1}^{n} (λ P_{X_{i}} (x_{i}) + (1 - λ) Q_{X_{i}} (x_{i})), \forall \underset{̲}{x} \in X^{n}, λ \in [0, 1], \end{matrix}

(41)

let

\begin{matrix} R_{Y^{n}}^{(λ)} : = R_{X^{n}}^{(λ)} W_{Y^{n} | X^{n}}, \end{matrix}

(42)

\begin{matrix} P_{Y^{n}} : = P_{X^{n}} W_{Y^{n} | X^{n}}, \end{matrix}

(43)

\begin{matrix} Q_{Y^{n}} : = Q_{X^{n}} W_{Y^{n} | X^{n}}, \end{matrix}

(44)

and let

f : (0, \infty) \to R

be a convex and twice differentiable function with

f (1) = 0

. Then,

(a): For all $λ \in [0, 1]$ ,

$\begin{matrix} D_{f} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) - D_{f} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}}) \end{matrix}$

$\begin{matrix} \geq c_{f} (ξ_{1} (n, λ), ξ_{2} (n, λ)) [\prod_{i = 1}^{n} (1 + λ^{2} χ^{2} (P_{X_{i}} ∥ Q_{X_{i}})) - \prod_{i = 1}^{n} (1 + λ^{2} χ^{2} (P_{Y_{i}} ∥ Q_{Y_{i}}))] \end{matrix}$

(45)

$\begin{matrix} \geq c_{f} (ξ_{1} (n, λ), ξ_{2} (n, λ)) λ^{2} \sum_{i = 1}^{n} [χ^{2} (P_{X_{i}} ∥ Q_{X_{i}}) - χ^{2} (P_{Y_{i}} ∥ Q_{Y_{i}})] \geq 0, \end{matrix}$

(46)

where $c_{f} (\cdot, \cdot)$ in the right sides of (45) and (46) is given in (26), and

$\begin{matrix} ξ_{1} (n, λ) : = \prod_{i = 1}^{n} (1 - λ + λ inf_{x \in X} \frac{P_{X_{i}} (x)}{Q_{X_{i}} (x)}) \in [0, 1], \end{matrix}$

(47)

$\begin{matrix} ξ_{2} (n, λ) : = \prod_{i = 1}^{n} (1 - λ + λ sup_{x \in X} \frac{P_{X_{i}} (x)}{Q_{X_{i}} (x)}) \in [1, \infty] . \end{matrix}$

(48)
(b): For all $λ \in [0, 1]$ ,

$\begin{matrix} D_{f} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) - D_{f} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}}) \\ \leq e_{f} (ξ_{1} (n, λ), ξ_{2} (n, λ)) [\prod_{i = 1}^{n} (1 + λ^{2} χ^{2} (P_{X_{i}} ∥ Q_{X_{i}})) - \prod_{i = 1}^{n} (1 + λ^{2} χ^{2} (P_{Y_{i}} ∥ Q_{Y_{i}}))] \end{matrix}$

(49)

where $e_{f} (\cdot, \cdot)$ , $ξ_{1} (\cdot, \cdot)$ and $ξ_{2} (\cdot, \cdot)$ in the right side of (49) are given in (31), (47) and (48), respectively.
(c): If f has a continuous second derivative at unity, and $sup_{x \in X} \frac{P_{X_{i}} (x)}{Q_{X_{i}} (x)} < \infty$ for all $i \in {1, \dots, n}$ , then

$\begin{matrix} lim_{λ \to 0^{+}} \frac{D_{f} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) - D_{f} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}})}{λ^{2}} \\ = \frac{1}{2} f^{″} (1) \sum_{i = 1}^{n} [χ^{2} (P_{X_{i}} ∥ Q_{X_{i}}) - χ^{2} (P_{Y_{i}} ∥ Q_{Y_{i}})] . \end{matrix}$

(50)

The lower bounds in the right sides of (45) and (46), and the upper bound in the right side of (49) are tight as we let $λ \to 0^{+}$ , yielding the limit in the right side of (50).

Proof.

See Appendix B. □

Remark 2.

Similar upper and lower bounds on

D_{f} (P_{X^{n}} ∥ R_{X^{n}}^{(λ)}) - D_{f} (P_{Y^{n}} ∥ R_{Y^{n}}^{(λ)})

can be obtained for all

λ \in [0, 1]

. To that end, in (45)–(49), one needs to replace f with

f^{*}

, switch between

P_{X_{i}}

and

Q_{X_{i}}

for all i, and replace λ with

1 - λ

.

In continuation to ([26], Theorem 8) (see Proposition 3 in Section 1.1), we next provide an upper bound on the contraction coefficient for a subclass of f-divergences (this subclass is different from the one which is addressed in ([26], Theorem 8)). Although the first part of the next result is stated for finite or countably infinite alphabets, it is clear from its proof that it also holds in the general alphabet setting. Connections to the literature are provided in Remarks A1–A3.

Theorem 3.

Let

f : (0, \infty) \to R

be a function which satisfies the following conditions:

f is convex, differentiable at 1, $f (1) = 0$ , and $f (0) : = lim_{t \to 0^{+}} f (t) < \infty$ ;
The function $g : (0, \infty) \to R$ , defined for all $t > 0$ by $g (t) : = \frac{f (t) - f (0)}{t}$ , is convex.

Let

P_{X}

and

Q_{X}

be non-identical probability mass functions which are defined on a finite or a countably infinite set

X

, and let

\begin{matrix} κ (ξ_{1}, ξ_{2}) : = sup_{t \in (ξ_{1}, 1) \cup (1, ξ_{2})} \frac{f (t) + f^{'} (1) (1 - t)}{{(t - 1)}^{2}} \end{matrix}

(51)

where

ξ_{1} \in [0, 1)

and

ξ_{2} \in (1, \infty]

are given in (18) and (19). Then, in the setting of (20) and (21),

\frac{D_{f} (P_{Y} ∥ Q_{Y})}{D_{f} (P_{X} ∥ Q_{X})} \leq \frac{κ (ξ_{1}, ξ_{2})}{f (0) + f^{'} (1)} \cdot \frac{χ^{2} (P_{Y} ∥ Q_{Y})}{χ^{2} (P_{X} ∥ Q_{X})} .

(52)

Consequently, if

Q_{X}

is finitely supported on

X

,

μ_{f} (Q_{X}, W_{Y | X}) \leq \frac{1}{f (0) + f^{'} (1)} \cdot κ (0, \frac{1}{min_{x \in X} Q_{X} (x)}) \cdot μ_{χ^{2}} (Q_{X}, W_{Y | X}) .

(53)

Proof.

See Appendix C.1. □

Similarly to the extension of Theorem 1 to Theorem 2, a similar extension of Theorem 3 leads to the following result.

Theorem 4.

In the setting of (39)–(44) in Theorem 2, and under the assumptions on f in Theorem 3, the following holds for all

λ \in (0, 1]

and

n \in N

:

\begin{matrix} \frac{D_{f} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}})}{D_{f} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}})} \leq \frac{κ (ξ_{1} (n, λ), ξ_{2} (n, λ))}{f (0) + f^{'} (1)} \frac{\overset{n}{\prod_{i = 1}} (1 + λ^{2} χ^{2} (P_{Y_{i}} ∥ Q_{Y_{i}})) - 1}{\overset{n}{\prod_{i = 1}} (1 + λ^{2} χ^{2} (P_{X_{i}} ∥ Q_{X_{i}})) - 1}, \end{matrix}

(54)

with

ξ_{1} (n, λ)

and

ξ_{2} (n, λ)

and

κ (\cdot, \cdot)

defined in (47), (48) and (51), respectively.

Proof.

See Appendix C.2. □

2.2. A Subclass of f-Divergences

A subclass of f-divergences with interesting properties is introduced in Theorem 5. The data-processing inequalities in Theorems 2 and 4 are applied to these f-divergences in Section 3.

Theorem 5.

Let

f_{α} : [0, \infty) \to R

be given by

\begin{matrix} f_{α} (t) : = {(α + t)}^{2} log (α + t) - {(α + 1)}^{2} log (α + 1), t \geq 0 \end{matrix}

(55)

for all

α \geq e^{- \frac{3}{2}}

. Then,

(a): $D_{f_{α}} (\cdot ∥ \cdot)$ is an f-divergence which is monotonically increasing and concave in α, and its first three derivatives are related to the relative entropy and $χ^{2}$ -divergence as follows:

$\begin{matrix} \frac{\partial}{\partial α} \{D_{f_{α}} (P ∥ Q)\} = 2 (α + 1) D (\frac{α Q + P}{α + 1} ∥ Q), \end{matrix}$

(56)

$\begin{matrix} \frac{\partial^{2}}{\partial α^{2}} \{D_{f_{α}} (P ∥ Q)\} = - 2 D (Q ∥ \frac{α Q + P}{α + 1}), \end{matrix}$

(57)

$\begin{matrix} \frac{\partial^{3}}{\partial α^{3}} \{D_{f_{α}} (P ∥ Q)\} = \frac{2 log e}{α + 1} \cdot χ^{2} (Q ∥ \frac{α Q + P}{α + 1}) . \end{matrix}$

(58)
(b): For every $n \in N$ ,

$\begin{matrix} {(- 1)}^{n - 1} \frac{\partial^{n}}{\partial α^{n}} \{D_{f_{α}} (P ∥ Q)\} \geq 0, \end{matrix}$

(59)

and, in addition to (56)–(58), for all $n > 3$

$\begin{matrix} \frac{\partial^{n}}{\partial α^{n}} \{D_{f_{α}} (P ∥ Q)\} = \frac{2 {(- 1)}^{n - 1} (n - 3)! log e}{{(α + 1)}^{n - 2}} [exp ((n - 2) D_{n - 1} (Q ∥ \frac{α Q + P}{α + 1})) - 1], \end{matrix}$

(60)

where $D_{n - 1} (\cdot ∥ \cdot)$ in the right side of (60) denotes the Rényi divergence of order $n - 1$ .
(c): $\begin{matrix} D_{f_{α}} (P ∥ Q) & \geq k (α) χ^{2} (P ∥ Q) \end{matrix}$

(61)

$\begin{matrix} \geq k (α) [exp (D (P ∥ Q)) - 1] \end{matrix}$

(62)

where the function $k : [e^{- \frac{3}{2}}, \infty) \to R$ is defined as

$\begin{matrix} k (α) : = log (α + 1) + \frac{3}{2} log e - \frac{log e}{3 α}, \end{matrix}$

(63)

which is monotonically increasing in α, satisfying $k (α) \geq 0.2075 log e$ for all $α \geq e^{- \frac{3}{2}}$ , and it tends to infinity as we let $α \to \infty$ . Consequently, unless $P \equiv Q$ ,

$\begin{matrix} lim_{α \to \infty} D_{f_{α}} (P ∥ Q) = + \infty . \end{matrix}$

(64)
(d): $\begin{matrix} D_{f_{α}} (P ∥ Q) & \leq [log (α + 1) + \frac{3}{2} log e - \frac{log e}{α + 1}] χ^{2} (P ∥ Q) + \frac{log e}{3 (α + 1)} [exp (2 D_{3} (P ∥ Q)) - 1] . \end{matrix}$

(65)
(e): For every $ε > 0$ and a pair of probability mass functions $(P, Q)$ where $D_{3} (P ∥ Q) < \infty$ , there exists $α^{*} : = α (P, Q, ε)$ such that for all $α > α^{*}$

$\begin{matrix} |D_{f_{α}} (P ∥ Q) - [log (α + 1) + \frac{3}{2} log e] χ^{2} (P ∥ Q)| < ε . \end{matrix}$

(66)
(f): If a sequence of probability measures ${P_{n}}$ converges to a probability measure Q such that

$\begin{matrix} lim_{n \to \infty} ess \sup \frac{d P_{n}}{d Q} (Y) = 1, Y \sim Q, \end{matrix}$

(67)

where $P_{n} ≪ Q$ for all sufficiently large n, then

$\begin{matrix} lim_{n \to \infty} \frac{D_{f_{α}} (P_{n} ∥ Q)}{χ^{2} (P_{n} ∥ Q)} = log (α + 1) + \frac{3}{2} log e . \end{matrix}$

(68)
(g): If $α > β \geq e^{- \frac{3}{2}}$ , then

$\begin{matrix} 0 & \leq (α - β) (α + β + 2) D (\frac{α Q + P}{α + 1} ∥ Q) \end{matrix}$

(69)

$\begin{matrix} \leq D_{f_{α}} (P ∥ Q) - D_{f_{β}} (P ∥ Q) \end{matrix}$

(70)

$\begin{matrix} \leq (α - β) min \{(α + β + 2) D (\frac{β Q + P}{β + 1} ∥ Q), 2 D (P ∥ Q)\} . \end{matrix}$

(71)
(h): The function $f_{α} : [0, \infty) \to R$ , as given in (55), satisfies the conditions in Theorems 3 and 4 for all $α \geq e^{- \frac{3}{2}}$ . Furthermore, the corresponding function in (51) is equal to

$\begin{matrix} κ_{α} (ξ_{1}, ξ_{2}) & : = sup_{t \in (ξ_{1}, 1) \cup (1, ξ_{2})} \frac{f_{α} (t) + f_{α}^{'} (1) (1 - t)}{{(t - 1)}^{2}} \end{matrix}$

(72)

$\begin{matrix} = \frac{f_{α} (ξ_{2}) + f_{α}^{'} (1) (1 - ξ_{2})}{{(ξ_{2} - 1)}^{2}} \end{matrix}$

(73)

for all $ξ_{1} \in [0, 1)$ and $ξ_{2} \in (1, \infty)$ .

Proof.

See Appendix D. □

2.3. f-Divergence Inequalities via Majorization

Let

U_{n}

denote an equiprobable probability mass function on

{1, \dots, n}

for an arbitrary

n \in N

, i.e.,

U_{n} (i) : = \frac{1}{n}

for all

i \in {1, \dots, n}

. By majorization theory and Theorem 1, the next result strengthens the Schur-convexity property of the f-divergence

D_{f} (\cdot ∥ U_{n})

(see ([38], Lemma 1)).

Theorem 6.

Let P and Q be probability mass functions which are supported on

{1, \dots, n}

, and suppose that

P ≺ Q

. Let

f : (0, \infty) \to R

be twice differentiable and convex with

f (1) = 0

, and let

q_{max}

and

q_{min}

be, respectively, the maximal and minimal positive masses of Q. Then,

(a): $\begin{matrix} n e_{f} (n q_{min}, n q_{max}) ({∥ Q ∥}_{2}^{2} - {∥ P ∥}_{2}^{2}) \end{matrix}$

$\begin{matrix} \geq D_{f} (Q ∥ U_{n}) - D_{f} (P ∥ U_{n}) \end{matrix}$

(74)

$\begin{matrix} \geq n c_{f} (n q_{min}, n q_{max}) ({∥ Q ∥}_{2}^{2} - {∥ P ∥}_{2}^{2}) \geq 0, \end{matrix}$

(75)

where $c_{f} (\cdot, \cdot)$ and $e_{f} (\cdot, \cdot)$ are given in (26) and (31), respectively, and ${∥ \cdot ∥}_{2}$ denotes the Euclidean norm. Furthermore, (74) and (75) hold with equality if $D_{f} (\cdot ∥ \cdot) = χ^{2} (\cdot ∥ \cdot)$ .
(b): If $P ≺ Q$ and $\frac{q_{max}}{q_{min}} \leq ρ$ for an arbitrary $ρ \geq 1$ , then

$\begin{matrix} 0 \leq {∥ Q ∥}_{2}^{2} - {∥ P ∥}_{2}^{2} \leq \frac{{(ρ - 1)}^{2}}{4 ρ n} . \end{matrix}$

(76)

Proof.

See Appendix E. □

Remark 3.

If P is not supported on

{1, \dots, n}

, then (74) and (75) hold if f is also right continuous at zero.

The next result provides upper and lower bounds on f-divergences from any probability mass function to an equiprobable distribution. It relies on majorization theory, and it follows in part from Theorem 6.

Theorem 7.

Let

P_{n}

denote the set of all the probability mass functions that are defined on

A_{n} : = {1, \dots, n}

. For

ρ \geq 1

, let

P_{n} (ρ)

be the set of all

Q \in P_{n}

which are supported on

A_{n}

with

\frac{q_{max}}{q_{min}} \leq ρ

, and let

f : (0, \infty) \to R

be a convex function with

f (1) = 0

. Then,

(a): The set $P_{n} (ρ)$ , for any $ρ \geq 1$ , is a non-empty, convex and compact set.
(b): For a given $Q \in P_{n}$ , which is supported on $A_{n}$ , the f-divergences $D_{f} (\cdot ∥ Q)$ and $D_{f} (Q ∥ \cdot)$ attain their maximal values over the set $P_{n} (ρ)$ .
(c): For $ρ \geq 1$ and an integer $n \geq 2$ , let

$\begin{matrix} u_{f} (n, ρ) : = max_{Q \in P_{n} (ρ)} D_{f} (Q ∥ U_{n}), \end{matrix}$

(77)

$\begin{matrix} v_{f} (n, ρ) : = max_{Q \in P_{n} (ρ)} D_{f} (U_{n} ∥ Q), \end{matrix}$

(78)

let

$\begin{matrix} Γ_{n} (ρ) : = [\frac{1}{1 + (n - 1) ρ}, \frac{1}{n}], \end{matrix}$

(79)

and let the probability mass function $Q_{β} \in P_{n} (ρ)$ be defined on the set $A_{n}$ as follows:

$Q_{β} (j) : = {\begin{matrix} ρ β, & i f j \in {1, \dots, i_{β}}, \\ 1 - (n + i_{β} (ρ - 1) - 1) β, & i f j = i_{β} + 1, \\ β, & i f j \in {i_{β} + 2, \dots, n} \end{matrix}$

(80)

where

$\begin{matrix} i_{β} : = ⌊\frac{1 - n β}{(ρ - 1) β}⌋ . \end{matrix}$

(81)

Then,

$\begin{matrix} u_{f} (n, ρ) = max_{β \in Γ_{n} (ρ)} D_{f} (Q_{β} ∥ U_{n}), \end{matrix}$

(82)

$\begin{matrix} v_{f} (n, ρ) = max_{β \in Γ_{n} (ρ)} D_{f} (U_{n} ∥ Q_{β}) . \end{matrix}$

(83)
(d): For $ρ \geq 1$ and an integer $n \geq 2$ , let the non-negative function $g_{f}^{(ρ)} : [0, 1] \to R_{+}$ be given by

$\begin{matrix} g_{f}^{(ρ)} (x) : = x f (\frac{ρ}{1 + (ρ - 1) x}) + (1 - x) f (\frac{1}{1 + (ρ - 1) x}), x \in [0, 1] . \end{matrix}$

(84)

Then,

$\begin{matrix} max_{m \in {0, \dots, n}} g_{f}^{(ρ)} (\frac{m}{n}) \leq u_{f} (n, ρ) \leq max_{x \in [0, 1]} g_{f}^{(ρ)} (x), \end{matrix}$

(85)

$\begin{matrix} max_{m \in {0, \dots, n}} g_{f^{*}}^{(ρ)} (\frac{m}{n}) \leq v_{f} (n, ρ) \leq max_{x \in [0, 1]} g_{f^{*}}^{(ρ)} (x) \end{matrix}$

(86)

with the convex function $f^{*} : (0, \infty) \to R$ in (29).
(e): The right-side inequalities in (85) and (86) are asymptotically tight ( $n \to \infty$ ). More explicitly,

$\begin{matrix} lim_{n \to \infty} u_{f} (n, ρ) = max_{x \in [0, 1]} \{x f (\frac{ρ}{1 + (ρ - 1) x}) + (1 - x) f (\frac{1}{1 + (ρ - 1) x})\}, \end{matrix}$

(87)

$\begin{matrix} lim_{n \to \infty} v_{f} (n, ρ) = max_{x \in [0, 1]} \{\frac{ρ x}{1 + (ρ - 1) x} f (\frac{1 + (ρ - 1) x}{ρ}) + \frac{(1 - x) f (1 + (ρ - 1) x)}{1 + (ρ - 1) x}\} . \end{matrix}$

(88)
(f): If $g_{f}^{(ρ)} (\cdot)$ in (84) is differentiable on $(0, 1)$ and its derivative is upper bounded by $K_{f} (ρ) \geq 0$ , then for every integer $n \geq 2$

$\begin{matrix} 0 \leq lim_{n^{'} \to \infty} \{u_{f} (n^{'}, ρ)\} - u_{f} (n, ρ) \leq \frac{K_{f} (ρ)}{n} . \end{matrix}$

(89)
(g): Let $f (0) : = lim_{t \to 0} f (t) \in (- \infty, + \infty]$ , and let $n \geq 2$ be an integer. Then,

$\begin{matrix} lim_{ρ \to \infty} u_{f} (n, ρ) = (1 - \frac{1}{n}) f (0) + \frac{f (n)}{n} . \end{matrix}$

(90)

Furthermore, if $f (0) < \infty$ , f is differentiable on $(0, n)$ , and $K_{n} : = sup_{t \in (0, n)} |f^{'} (t)| < \infty$ , then, for every $ρ \geq 1$ ,

$\begin{matrix} 0 \leq lim_{ρ^{'} \to \infty} \{u_{f} (n, ρ^{'})\} - u_{f} (n, ρ) \leq \frac{2 K_{n} (n - 1)}{n + ρ - 1} . \end{matrix}$

(91)
(h): For $ρ \geq 1$ , let the function f be also twice differentiable, and let M and m be constants such that the following condition holds:

$\begin{matrix} 0 \leq m \leq f^{″} (t) \leq M, \forall t \in [\frac{1}{ρ}, ρ] . \end{matrix}$

(92)

Then, for all $Q \in P_{n} (ρ)$ ,

$\begin{matrix} 0 & \leq \frac{1}{2} m ({n ∥ Q ∥}_{2}^{2} - 1) \end{matrix}$

(93)

$\begin{matrix} \leq D_{f} (Q ∥ U_{n}) \end{matrix}$

(94)

$\begin{matrix} \leq \frac{1}{2} M ({n ∥ Q ∥}_{2}^{2} - 1) \end{matrix}$

(95)

$\begin{matrix} \leq \frac{M {(ρ - 1)}^{2}}{8 ρ} \end{matrix}$

(96)

with equalities in (94) and (95) for the $χ^{2}$ divergence (with $M = m = 2$ ).
(i): Let $d > 0$ . If $f^{″} (t) \leq M_{f} \in (0, \infty)$ for all $t > 0$ , then $D_{f} (Q ∥ U_{n}) \leq d$ for all $Q \in P_{n} (ρ)$ , if

$\begin{matrix} ρ \leq 1 + \frac{4 d}{M_{f}} + \sqrt{\frac{8 d}{M_{f}} + \frac{16 d^{2}}{M_{f}^{2}}} . \end{matrix}$

(97)

Proof.

See Appendix F. □

Tsallis entropy was introduced in [39] as a generalization of the Shannon entropy (similarly to the Rényi entropy [40]), and it was applied to statistical physics in [39].

Definition 7

([39]). Let

P_{X}

be a probability mass function defined on a discrete set

X

. The Tsallis entropy of order

α \in (0, 1) \cup (1, \infty)

of X, denoted by

S_{α} (X)

or

S_{α} (P_{X})

, is defined as

\begin{matrix} S_{α} (X) & = \frac{1}{1 - α} (\sum_{x \in X} P_{X}^{α} (x) - 1) \end{matrix}

(98)

\begin{matrix} = \frac{∥ P_{X} ∥_{α}^{α} - 1}{1 - α}, \end{matrix}

(99)

where

{∥ P_{X} ∥}_{α} : = {(\sum_{x \in X} P_{X}^{α} (x))}^{\frac{1}{α}}

. The Tsallis entropy is continuously extended at orders 0, 1, and ∞; at order 1, it coincides with the Shannon entropy on base

e

(expressed in nats).

Theorem 6 enables to strengthen the Schur-concavity property of the Tsallis entropy (see ([30], Theorem 13.F.3.a.)) as follows.

Theorem 8.

Let P and Q be probability mass functions which are supported on a finite set, and let

P ≺ Q

. Then, for all

α > 0

,

(a): $0 \leq L (α, P, Q) \leq S_{α} (P) - S_{α} (Q) \leq U (α, P, Q),$

(100)

where

$L (α, P, Q) : = {\begin{matrix} \frac{1}{2} α q_{max}^{α - 2} ({∥ Q ∥}_{2}^{2} - {∥ P ∥}_{2}^{2}), & i f α \in (0, 2], \\ \frac{1}{2} α q_{min}^{α - 2} ({∥ Q ∥}_{2}^{2} - {∥ P ∥}_{2}^{2}), & i f α \in (2, \infty), \end{matrix}$

(101)

$U (α, P, Q) : = {\begin{matrix} \frac{1}{2} α q_{min}^{α - 2} ({∥ Q ∥}_{2}^{2} - {∥ P ∥}_{2}^{2}), & i f α \in (0, 2], \\ \frac{1}{2} α q_{max}^{α - 2} ({∥ Q ∥}_{2}^{2} - {∥ P ∥}_{2}^{2}), & i f α \in (2, \infty), \end{matrix}$

(102)

and the bounds in (101) and (102) are attained at $α = 2$ .
(b): $inf_{P ≺ Q, P \neq Q} \frac{S_{α} (P) - S_{α} (Q)}{L (α, P, Q)} = sup_{P ≺ Q, P \neq Q} \frac{S_{α} (P) - S_{α} (Q)}{U (α, P, Q)} = 1,$

(103)

where the infimum and supremum in (103) can be restricted to probability mass functions P and Q which are supported on a binary alphabet.

Proof.

See Appendix G. □

Remark 4.

The lower bound in ([36], Theorem 1) also strengthens the Schur-concavity property of the Tsallis entropy. It can be verified that none of the lower bounds in ([36], Theorem 1) and Theorem 8 supersedes the other. For example, let

α > 0

, and let

P_{ε}

and

Q_{ε}

be probability mass functions supported on

A : = {0, 1}

with

P_{ε} (0) = \frac{1}{2} + ε

and

Q_{ε} (0) = \frac{1}{2} + β ε

where

β > 1

and

0 < ε < \frac{1}{2 β}

. This yields

P_{ε} ≺ Q_{ε}

. From (A233) (see Appendix G),

\begin{matrix} lim_{ε \to 0^{+}} \frac{S_{α} (P_{ε}) - S_{α} (Q_{ε})}{L (α, P_{ε}, Q_{ε})} = 1 . \end{matrix}

(104)

If

α = 1

, then

S_{1} (P_{ε}) - S_{1} (Q_{ε}) = \frac{1}{log e} (H (P_{ε}) - H (Q_{ε}))

, and the continuous extension of the lower bound in ([36], Theorem 1) at

α = 1

is specialized to the earlier result by the same authors in ([35], Theorem 3); it states that if

P ≺ Q

, then

H (P) - H (Q) \geq D (Q ∥ P)

. In contrast to (104), it can be verified that

\begin{matrix} lim_{ε \to 0^{+}} \frac{S_{1} (P_{ε}) - S_{1} (Q_{ε})}{\frac{1}{log e} D (Q_{ε} ∥ P_{ε})} = \frac{β + 1}{β - 1} > 1, \forall β > 1, \end{matrix}

(105)

which can be made arbitrarily large by selecting β to be sufficiently close to 1 (from above). This provides a case where the lower bound in Theorem 8 outperforms the one in ([35], Theorem 3).

Remark 5.

Due to the one-to-one correspondence between Tsallis and Rényi entropies of the same positive order, similar to the transition from ([36], Theorem 1) to ([36], Theorem 2), also Theorem 8 enables to strengthen the Schur-concavity property of the Rényi entropy. For information-theoretic implications of the Schur-concavity of the Rényi entropy, the reader is referred to, e.g., [34], ([41], Theorem 3) and ([42], Theorem 11).

3. Illustration of the Main Results and Implications

3.1. Illustration of Theorems 2 and 4

We apply here the data-processing inequalities in Theorems 2 and 4 to the new class of f-divergences introduced in Theorem 5.

In the setup of Theorems 2 and 4, consider communication over a time-varying binary-symmetric channel (BSC). Consequently, let

X = Y = {0, 1}

, and let

\begin{matrix} P_{X_{i}} (1) = p_{i}, Q_{X_{i}} (1) = q_{i}, \end{matrix}

(106)

with

p_{i} \in (0, 1)

and

q_{i} \in (0, 1)

for every

i \in {1, \dots, n}

. Let the transition probabilities

P_{Y_{i} | X_{i}} (\cdot | \cdot)

correspond to

BSC (δ_{i})

(i.e., a BSC with a crossover probability

δ_{i}

), i.e.,

P_{Y_{i} | X_{i}} (y | x) = {\begin{matrix} 1 - δ_{i} & if x = y, \\ δ_{i} & if x \neq y . \end{matrix}

(107)

For all

λ \in [0, 1]

and

\underset{̲}{x} \in X^{n}

, the probability mass function at the channel input is given by

\begin{matrix} R_{X^{n}}^{(λ)} (\underset{̲}{x}) = \prod_{i = 1}^{n} R_{X_{i}}^{(λ)} (x_{i}), \end{matrix}

(108)

with

\begin{matrix} R_{X_{i}}^{(λ)} (x) = λ P_{X_{i}} (x) + (1 - λ) Q_{X_{i}} (x), x \in {0, 1}, \end{matrix}

(109)

where the probability mass function in (109) refers to a Bernoulli distribution with parameter

λ p_{i} + (1 - λ) q_{i}

. At the output of the time-varying BSC (see (42)–(44) and (107)), for all

\underset{̲}{y} \in Y^{n}

,

\begin{matrix} R_{Y^{n}}^{(λ)} (\underset{̲}{y}) = \prod_{i = 1}^{n} R_{Y_{i}}^{(λ)} (y_{i}), P_{Y^{n}} (\underset{̲}{y}) = \prod_{i = 1}^{n} P_{Y_{i}} (y_{i}), Q_{Y^{n}} (\underset{̲}{y}) = \prod_{i = 1}^{n} Q_{Y_{i}} (y_{i}), \end{matrix}

(110)

where

\begin{matrix} R_{Y_{i}}^{(λ)} (1) = (λ p_{i} + (1 - λ) q_{i}) * δ_{i}, \end{matrix}

(111)

\begin{matrix} P_{Y_{i}} (1) = p_{i} * δ_{i}, \end{matrix}

(112)

\begin{matrix} Q_{Y_{i}} (1) = q_{i} * δ_{i}, \end{matrix}

(113)

with

\begin{matrix} a * b : = a (1 - b) + (1 - a) b, 0 \leq a, b \leq 1 . \end{matrix}

(114)

The

χ^{2}

-divergence from

Bernoulli (p)

to

Bernoulli (q)

is given by

\begin{matrix} χ^{2} (Bernoulli (p) ∥ Bernoulli (q)) = \frac{{(p - q)}^{2}}{q (1 - q)}, \end{matrix}

(115)

and since the probability mass functions

P_{X_{i}}

,

Q_{X_{i}}

,

P_{Y_{i}}

and

Q_{Y_{i}}

correspond to Bernoulli distributions with parameters

p_{i}

,

q_{i}

,

p_{i} * δ_{i}

and

q_{i} * δ_{i}

, respectively, Theorem 2 gives that

\begin{matrix} c_{f_{α}} (ξ_{1} (n, λ), ξ_{2} (n, λ)) [\prod_{i = 1}^{n} (1 + \frac{λ^{2} {(p_{i} - q_{i})}^{2}}{q_{i} (1 - q_{i})}) - \prod_{i = 1}^{n} (1 + \frac{λ^{2} {(p_{i} * δ_{i} - q_{i} * δ_{i})}^{2}}{(q_{i} * δ_{i}) (1 - q_{i} * δ_{i})})] \end{matrix}

\begin{matrix} \leq D_{f_{α}} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) - D_{f_{α}} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}}) \end{matrix}

(116)

\begin{matrix} \leq e_{f_{α}} (ξ_{1} (n, λ), ξ_{2} (n, λ)) [\prod_{i = 1}^{n} (1 + \frac{λ^{2} {(p_{i} - q_{i})}^{2}}{q_{i} (1 - q_{i})}) - \prod_{i = 1}^{n} (1 + \frac{λ^{2} {(p_{i} * δ_{i} - q_{i} * δ_{i})}^{2}}{(q_{i} * δ_{i}) (1 - q_{i} * δ_{i})})] \end{matrix}

(117)

for all

λ \in [0, 1]

and

n \in N

. From (26), (31) and (55), we get that for all

ξ_{1} < 1 < ξ_{2}

,

\begin{matrix} c_{f_{α}} (ξ_{1}, ξ_{2}) & = \frac{1}{2} inf_{t \in [ξ_{1}, ξ_{2}]} f_{α}^{″} (t) \end{matrix}

(118)

\begin{matrix} = log (α + ξ_{1}) + \frac{3}{2} log e, \end{matrix}

(119)

\begin{matrix} e_{f_{α}} (ξ_{1}, ξ_{2}) & = \frac{1}{2} sup_{t \in [ξ_{1}, ξ_{2}]} f_{α}^{″} (t) \end{matrix}

(120)

\begin{matrix} = log (α + ξ_{2}) + \frac{3}{2} log e, \end{matrix}

(121)

and, from (47), (48) and (106), for all

λ \in (0, 1]

,

\begin{matrix} ξ_{1} (n, λ) : = \prod_{i = 1}^{n} (1 - λ + λ min \{\frac{p_{i}}{q_{i}}, \frac{1 - p_{i}}{1 - q_{i}}\}) \in [0, 1), \end{matrix}

(122)

\begin{matrix} ξ_{2} (n, λ) : = \prod_{i = 1}^{n} (1 - λ + λ max \{\frac{p_{i}}{q_{i}}, \frac{1 - p_{i}}{1 - q_{i}}\}) \in (1, \infty), \end{matrix}

(123)

provided that

p_{i} \neq q_{i}

for some

i \in {1, \dots, n}

(otherwise, both f-divergences in the right side of (116) are equal to zero since

P_{X_{i}} \equiv Q_{X_{i}}

and therefore

R_{X_{i}}^{(λ)} \equiv Q_{X_{i}}

for all i and

λ \in [0, 1]

). Furthermore, from Item (c) of Theorem 2, for every

n \in N

and

α \geq e^{- \frac{3}{2}}

,

\begin{matrix} lim_{λ \to 0^{+}} \frac{D_{f_{α}} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) - D_{f_{α}} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}})}{λ^{2}} \\ = (log (α + 1) + \frac{3}{2} log e) \sum_{i = 1}^{n} \{\frac{{(p_{i} - q_{i})}^{2}}{q_{i} (1 - q_{i})} - \frac{{(p_{i} * δ_{i} - q_{i} * δ_{i})}^{2}}{(q_{i} * δ_{i}) (1 - q_{i} * δ_{i})}\}, \end{matrix}

(124)

and the lower and upper bounds in the left side of (116) and the right side of (117), respectively, are tight as we let

λ \to 0

, and they both coincide with the limit in the right side of (124).

Figure 1 illustrates the upper and lower bounds in (116) and (117) with

α = 1

,

p_{i} \equiv \frac{1}{4}

,

q_{i} \equiv \frac{1}{2}

and

δ_{i} \equiv 0.110

for all i, and

n \in {1, 10, 50}

. In the special case where

{δ_{i}}

are fixed for all i, the communication channel is a time-invariant BSC whose capacity is equal to

\frac{1}{2}

bit per channel use.

Figure 1. The bounds in Theorem 2 applied to

D_{f_{α}} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) - D_{f_{α}} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}})

(vertical axis) versus

λ \in [0, 1]

(horizontal axis). The

f_{α}

-divergence refers to Theorem 5. The probability mass functions

P_{X^{n}}

and

Q_{X^{n}}

correspond, respectively, to discrete memoryless sources emitting n i.i.d.

Bernoulli (p)

and

Bernoulli (q)

symbols; the symbols are transmitted over

BSC (δ)

with

(α, p, q, δ) = (1, \frac{1}{4}, \frac{1}{2}, 0.110)

. The bounds in the upper and middle plots are compared to the exact values, being computationally feasible for

n = 1

and

n = 10

, respectively. The upper, middle and lower plots correspond, respectively, to

n = 1

,

n = 10

, and

n = 50

.

By referring to the upper and middle plots of Figure 1, if

n = 1

or

n = 10

, then the exact values of the differences of the

f_{α}

-divergences in the right side of (116) are calculated numerically, being compared to the lower and upper bounds in the left side of (116) and the right side of (117) respectively. Since the

f_{α}

-divergence does not tensorize, the computation of the exact value of each of the two

f_{α}

-divergences in the right side of (116) involves a pre-computation of

2^{n}

probabilities for each of the probability mass functions

P_{X^{n}}

,

Q_{X^{n}}

,

P_{Y^{n}}

and

Q_{Y^{n}}

; this computation is prohibitively complex unless n is small enough.

We now apply the bound in Theorem 4. In view of (51), (54), (55) and (73), for all

λ \in (0, 1]

and

α \geq e^{- \frac{3}{2}}

,

\begin{matrix} \frac{D_{f_{α}} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}})}{D_{f_{α}} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}})} \end{matrix}

\begin{matrix} \leq \frac{κ_{α} (ξ_{1} (n, λ), ξ_{2} (n, λ))}{f_{α} (0) + f_{α}^{'} (1)} \frac{\prod_{i = 1}^{n} (1 + λ^{2} χ^{2} (P_{Y_{i}} ∥ Q_{Y_{i}})) - 1}{\prod_{i = 1}^{n} (1 + λ^{2} χ^{2} (P_{X_{i}} ∥ Q_{X_{i}})) - 1} \end{matrix}

(125)

\begin{matrix} = \frac{f_{α} (ξ_{2} (n, λ)) + f_{α}^{'} (1) (1 - ξ_{2} (n, λ))}{{(ξ_{2} (n, λ) - 1)}^{2} (f_{α} (0) + f_{α}^{'} (1))} \cdot \frac{\prod_{i = 1}^{n} (1 + \frac{λ^{2} {(p_{i} * δ_{i} - q_{i} * δ_{i})}^{2}}{(q_{i} * δ_{i}) (1 - q_{i} * δ_{i})}) - 1}{\prod_{i = 1}^{n} (1 + \frac{λ^{2} {(p_{i} - q_{i})}^{2}}{q_{i} (1 - q_{i})}) - 1}, \end{matrix}

(126)

where

ξ_{1} (n, λ) \in [0, 1)

and

ξ_{2} (n, λ) \in (1, \infty)

are given in (122) and (123), respectively, and for

t \geq 0

,

\begin{matrix} f_{α} (t) + f_{α}^{'} (1) (1 - t) \\ = {(α + t)}^{2} log (α + t) - {(α + 1)}^{2} log (α + 1) + [2 (α + 1) log (α + 1) + (α + 1) log e] (1 - t) . \end{matrix}

(127)

Figure 2 illustrates the upper bound on

\frac{D_{f_{α}} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}})}{D_{f_{α}} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}})}

(see (125)–(127)) as a function of

λ \in (0, 1]

. It refers to the case where

p_{i} \equiv \frac{1}{4}

,

q_{i} \equiv \frac{1}{2}

, and

δ_{i} \equiv 0.110

for all i (similarly to Figure 1). The upper and middle plots correspond to

n = 10

with

α = 10

and

α = 100

, respectively; the middle and lower plots correspond to

α = 100

with

n = 10

and

n = 100

, respectively. The bounds in the upper and middle plots are compared to their exact values since their numerical computations are feasible for

n = 10

. It is observed from the numerical comparisons for

n = 10

(see the upper and middle plots in Figure 2) that the upper bounds are informative, especially for large values of

α

where the

f_{α}

-divergence becomes closer to a scaled version of the

χ^{2}

-divergence (see Item (e) in Theorem 5).

Figure 2. The upper bound in Theorem 4 applied to

\frac{D_{f_{α}} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}})}{D_{f_{α}} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}})}

(see (125)–(127)) in the vertical axis versus

λ \in [0, 1]

in the horizontal axis. The

f_{α}

-divergence refers to Theorem 5. The probability mass functions

P_{X_{i}}

and

Q_{X_{i}}

are

Bernoulli (p)

and

Bernoulli (q)

, respectively, for all

i \in {1, \dots, n}

with n uses of

BSC (δ)

, and parameters

(p, q, δ) = (\frac{1}{4}, \frac{1}{2}, 0.110)

. The upper and middle plots correspond to

n = 10

with

α = 10

and

α = 100

, respectively; the middle and lower plots correspond to

α = 100

with

n = 10

and

n = 100

, respectively. The bounds in the upper and middle plots are compared to the exact values, being computationally feasible for

n = 10

.

3.2. Illustration of Theorems 3 and 5

Following the application of the data-processing inequalities in Theorems 2 and 4 to a class of f-divergences (see Section 3.1), some interesting properties of this class are introduced in Theorem 5.

For

α \geq e^{- \frac{3}{2}}

, let

d_{f_{α}} : {(0, 1)}^{2} \to [0, \infty)

be the binary

f_{α}

-divergence (see (55)), defined as

\begin{matrix} d_{f_{α}} (p ∥ q) & : = D_{f_{α}} (Bernoulli (p) ∥ Bernoulli (q)) \\ = q {(α + \frac{p}{q})}^{2} log (α + \frac{p}{q}) + (1 - q) {(α + \frac{1 - p}{1 - q})}^{2} log (α + \frac{1 - p}{1 - q}) \end{matrix}

(128)

\begin{matrix} - {(α + 1)}^{2} log (α + 1), \forall (p, q) \in {(0, 1)}^{2} . \end{matrix}

(129)

Theorem 5 is illustrated in Figure 3, showing that

d_{f_{α}} (p ∥ q)

is monotonically increasing as a function of

α \geq e^{- \frac{3}{2}}

(note that the concavity in

α

is not reflected from these plots because the horizontal axis of

α

is in logarithmic scaling). The binary divergence

d_{f_{α}} (p ∥ q)

is also compared in Figure 3 with its lower and upper bounds in (61) and (65), respectively, illustrating that these bounds are both asymptotically tight for large values of

α

. The asymptotic approximation of

d_{f_{α}} (p ∥ q)

for large

α

, expressed as a function of

α

and

χ^{2} (p ∥ q)

(see (66)), is also depicted in Figure 3. The upper and lower plots in Figure 3 refer, respectively, to

(p, q) = (0.1, 0.9)

and

(0.2, 0.8)

; a comparison of these plots show a better match between the exact value of the binary divergence, its upper and lower bounds, and its asymptotic approximation when the values of p and q are getting closer.

Figure 3. Plots of

d_{f_{α}} (p ∥ q)

, its upper and lower bounds in (61) and (65), respectively, and its asymptotic approximation in (66) for large values of

α

. The plots are shown as a function of

α \in [e^{- \frac{3}{2}}, 1000]

. The upper and lower plots refer, respectively, to

(p, q) = (0.1, 0.9)

and

(p, q) = (0.2, 0.8)

.

In view of the results in (66) and (68), it is interesting to note that the asymptotic value of

D_{f_{α}} (P ∥ Q)

for large values of

α

is also the exact scaling of this f-divergence for any finite value of

α \geq e^{- \frac{3}{2}}

when the probability mass functions P and Q are close enough to each other.

We next consider the ratio of the contraction coefficients

\frac{μ_{f_{α}} (Q_{X}, W_{Y | X})}{μ_{χ^{2}} (Q_{X}, W_{Y | X})}

where

Q_{X}

is finitely supported on

X

and it is not a point mass (i.e.,

| X | \geq 2

), and

W_{Y | X}

is arbitrary. For all

α \geq e^{- \frac{3}{2}}

,

\begin{matrix} 1 \leq \frac{μ_{f_{α}} (Q_{X}, W_{Y | X})}{μ_{χ^{2}} (Q_{X}, W_{Y | X})} \leq \frac{f_{α} (ξ) + f_{α}^{'} (1) (1 - ξ)}{{(ξ - 1)}^{2} (f_{α} (0) + f_{α}^{'} (1))}, \end{matrix}

(130)

where

f_{α} : (0, \infty) \to R

is given in (55), and

\begin{matrix} ξ : = \frac{1}{min_{x \in X} Q_{X} (x)} \in [| X |, \infty) . \end{matrix}

(131)

The left-side inequality in (130) is due to ([25], Theorem 2) (see Proposition 2), and the right-side inequality in (130) holds due to (53) and (73).

Figure 4 shows the upper bound on the ratio of contraction coefficients

\frac{μ_{f_{α}} (Q_{X}, W_{Y | X})}{μ_{χ^{2}} (Q_{X}, W_{Y | X})}

, as it is given in the right-side inequality of (130), as a function of the parameter

α \geq e^{- \frac{3}{2}}

. The curves in Figure 4 correspond to different values of

ξ \in [| X |, \infty)

, as it is given in (131); these upper bounds are monotonically decreasing in

α

, and they asymptotically tend to 1 as we let

α \to \infty

. Hence, in view of the left-side inequality in (130), the upper bound on the ratio of the contraction coefficients (in the right-side inequality) is asymptotically tight in

α

. The fact that the ratio of the contraction coefficients in the middle of (130) tends asymptotically to 1, as

α

gets large, is not directly implied by Item (e) of Theorem 5. The latter implies that, for fixed probability mass functions P and Q and for sufficiently large

α

,

\begin{matrix} D_{f_{α}} (P ∥ Q) \approx [log (α + 1) + \frac{3}{2} log e] χ^{2} (P ∥ Q); \end{matrix}

(132)

however, there is no guarantee that for fixed Q and sufficiently large

α

, the approximation in (132) holds for all P. By the upper bound in the right side of (130), it follows however that

μ_{f_{α}} (Q_{X}, W_{Y | X})

tends asymptotically (as we let

α \to \infty

) to the contraction coefficient of the

χ^{2}

divergence.

Figure 4. Curves of the upper bound on the ratio of contraction coefficients

\frac{μ_{f_{α}} (Q_{X}, W_{Y | X})}{μ_{χ^{2}} (Q_{X}, W_{Y | X})}

(see the right-side inequality of (130)) as a function of the parameter

α \geq e^{- \frac{3}{2}}

. The curves correspond to different values of

ξ

in (131).

3.3. Illustration of Theorem 7 and Further Results

Theorem 7 provides upper and lower bounds on an f-divergence,

D_{f} (Q ∥ U_{n})

, from any probability mass function Q supported on a finite set of cardinality n to an equiprobable distribution over this set. We apply in the following, the exact formula for

\begin{matrix} d_{f} (ρ) : = lim_{n \to \infty} max_{Q \in P_{n} (ρ)} D_{f} (Q ∥ U_{n}), ρ \geq 1 \end{matrix}

(133)

to several important f-divergences. From (87),

\begin{matrix} d_{f} (ρ) = max_{x \in [0, 1]} \{x f (\frac{ρ}{1 + (ρ - 1) x}) + (1 - x) f (\frac{1}{1 + (ρ - 1) x})\}, ρ \geq 1 . \end{matrix}

(134)

Since f is a convex function on

(0, \infty)

with

f (1) = 0

, Jensen’s inequality implies that the function which is subject to maximization in the right-side of (134) is non-negative over the interval

[0, 1]

. It is equal to zero at the endpoints of the interval

[0, 1]

, so the maximum over this interval is attained at an interior point. Note also that, in view of Items (d) and (e) of Theorem 7, the exact asymptotic expression in (134) satisfies

\begin{matrix} max_{Q \in P_{n} (ρ)} D_{f} (Q ∥ U_{n}) \leq d_{f} (ρ), \forall n \in {2, 3, \dots}, ρ \geq 1 . \end{matrix}

(135)

3.3.1. Total Variation Distance

This distance is an f-divergence with

f (t) : = | t - 1 |

for

t > 0

. Substituting f into (134) gives

\begin{matrix} d_{f} (ρ) = max_{x \in [0, 1]} \{\frac{2 (ρ - 1) x (1 - x)}{1 + (ρ - 1) x}\} . \end{matrix}

(136)

By setting to zero the derivative of the function which is subject to maximization in the right side of (136), it can be verified that the maximizer over this interval is equal to

x = \frac{1}{1 + \sqrt{ρ}}

, which implies that

\begin{matrix} d_{f} (ρ) = \frac{2 (\sqrt{ρ} - 1)}{\sqrt{ρ} + 1}, \forall ρ \geq 1 . \end{matrix}

(137)

3.3.2. Alpha Divergences

The class of Alpha divergences forms a parametric subclass of the f-divergences, which includes in particular the relative entropy,

χ^{2}

-divergence, and the squared-Hellinger distance. For

α \in R

, let

\begin{matrix} D_{A}^{(α)} (P ∥ Q) : = D_{u_{α}} (P ∥ Q), \end{matrix}

(138)

where

u_{α} : (0, \infty) \to R

is a non-negative and convex function with

u_{α} (1) = 0

, which is defined for

t > 0

as follows (see ([8], Chapter 2), followed by studies in, e.g., [10,16,43,44,45]):

u_{α} (t) : = {\begin{matrix} \frac{t^{α} - α (t - 1) - 1}{α (α - 1)}, & α \in (- \infty, 0) \cup (0, 1) \cup (1, \infty), \\ t {log}_{e} t + 1 - t, & α = 1, \\ - {log}_{e} t, & α = 0 . \end{matrix}

(139)

The functions

u_{0}

and

u_{1}

are defined in the right side of (139) by a continuous extension of

u_{α}

at

α = 0

and

α = 1

, respectively. The following relations hold (see, e.g., ([44], (10)–(13))):

\begin{matrix} D_{A}^{(1)} (P ∥ Q) = \frac{1}{log e} D (P ∥ Q), \end{matrix}

(140)

\begin{matrix} D_{A}^{(0)} (P ∥ Q) = \frac{1}{log e} D (Q ∥ P), \end{matrix}

(141)

\begin{matrix} D_{A}^{(2)} (P ∥ Q) = \frac{1}{2} χ^{2} (P ∥ Q), \end{matrix}

(142)

\begin{matrix} D_{A}^{(- 1)} (P ∥ Q) = \frac{1}{2} χ^{2} (Q ∥ P), \end{matrix}

(143)

\begin{matrix} D_{A}^{(\frac{1}{2})} (P ∥ Q) = 4 H^{2} (P ∥ Q) . \end{matrix}

(144)

Substituting

f : = u_{α}

(see (139)) into the right side of (134) gives that

\begin{matrix} Δ (α, ρ) & : = d_{u_{α}} (ρ) \end{matrix}

(145)

\begin{matrix} = lim_{n \to \infty} max_{Q \in P_{n} (ρ)} D_{A}^{(α)} (Q ∥ U_{n}) \end{matrix}

(146)

\begin{matrix} = max_{x \in [0, 1]} \{\frac{1 + (ρ^{α} - 1) x}{{(1 + (ρ - 1) x)}^{α}} - 1\} . \end{matrix}

(147)

Setting to zero the derivative of the function which is subject to maximization in the right side of (147) gives

\begin{matrix} x = x^{*} : = \frac{1 + α (ρ - 1) - ρ^{α}}{(1 - α) (ρ - 1) (ρ^{α} - 1)}, \end{matrix}

(148)

where it can be verified that

x^{*} \in (0, 1)

for all

α \in (- \infty, 0) \cup (0, 1) \cup (1, \infty)

and

ρ > 1

. Substituting (148) into the right side of (147) gives that, for all such

α

and

ρ

,

\begin{matrix} Δ (α, ρ) = \frac{1}{α (α - 1)} [\frac{{(1 - α)}^{α - 1} {(ρ^{α} - 1)}^{α} {(ρ - ρ^{α})}^{1 - α}}{(ρ - 1) α^{α}} - 1] . \end{matrix}

(149)

By a continuous extension of

Δ (α, ρ)

in (149) at

α = 1

and

α = 0

, it follows that for all

ρ > 1

\begin{matrix} Δ (1, ρ) = Δ (0, ρ) = \frac{ρ log ρ}{ρ - 1} - log (\frac{e ρ {log}_{e} ρ}{ρ - 1}) . \end{matrix}

(150)

Consequently, for all

ρ > 1

,

\begin{matrix} lim_{n \to \infty} max_{Q \in P_{n} (ρ)} D (Q ∥ U_{n}) & = log e lim_{n \to \infty} max_{Q \in P_{n} (ρ)} D_{A}^{(1)} (Q ∥ U_{n}) \end{matrix}

(151)

\begin{matrix} = Δ (1, ρ) log e \end{matrix}

(152)

\begin{matrix} = \frac{ρ log ρ}{ρ - 1} - log (\frac{e ρ {log}_{e} ρ}{ρ - 1}), \end{matrix}

(153)

where (151) holds due to (140); (152) is due to (146), and (153) holds due to (150). This sharpens the result in ([33], Theorem 2) for the relative entropy from the equiprobable distribution,

D (Q ∥ U_{n}) = log n - H (Q)

, by showing that the bound in ([33], (7)) is asymptotically tight as we let

n \to \infty

. The result in ([33], Theorem 2) can be further tightened for finite n by applying the result in Theorem 7 (d) with

f (t) : = u_{1} (t) log e = t log t + (1 - t) log e

for all

t > 0

(although, unlike the asymptotic result in (149), the refined bound for a finite n does not lend itself to a closed-form expression as a function of n; see also ([34], Remark 3), which provides such a refinement of the bound on

D (Q ∥ U_{n})

for finite n in a different approach).

From (141), (146) and (150), it follows similarly to (153) that for all

ρ > 1

\begin{matrix} lim_{n \to \infty} max_{Q \in P_{n} (ρ)} D (U_{n} ∥ Q) & = Δ (0, ρ) log e \end{matrix}

(154)

\begin{matrix} = \frac{ρ log ρ}{ρ - 1} - log (\frac{e ρ {log}_{e} ρ}{ρ - 1}) . \end{matrix}

(155)

It should be noted that in view of the one-to-one correspondence between the Rényi divergence and the Alpha divergence of the same order

α

where, for

α \neq 1

,

\begin{matrix} D_{α} (P ∥ Q) = \frac{1}{α - 1} log (1 + α (α - 1) D_{A}^{(α)} (P ∥ Q)), \end{matrix}

(156)

the asymptotic result in (149) can be obtained from ([34], Lemma 4) and vice versa; however, in [34], the focus is on the Rényi divergence from the equiprobable distribution, whereas the result in (149) is obtained by specializing the asymptotic expression in (134) for a general f-divergence. Note also that the result in ([34], Lemma 4) is restricted to

α > 0

, whereas the result in (149) and (150) covers all values of

α \in R

.

In view of (146), (149), (153), (155), and the special cases of the Alpha divergences in (140)–(144), it follows that for all

ρ > 1

and for every integer

n \geq 2

\begin{matrix} max_{Q \in P_{n} (ρ)} D (Q ∥ U_{n}) & \leq Δ (1, ρ) log e = \frac{ρ log ρ}{ρ - 1} - log (\frac{e ρ {log}_{e} ρ}{ρ - 1}), \end{matrix}

(157)

\begin{matrix} max_{Q \in P_{n} (ρ)} D (U_{n} ∥ Q) & \leq Δ (0, ρ) log e = \frac{ρ log ρ}{ρ - 1} - log (\frac{e ρ {log}_{e} ρ}{ρ - 1}), \end{matrix}

(158)

\begin{matrix} max_{Q \in P_{n} (ρ)} χ^{2} (Q ∥ U_{n}) & \leq 2 Δ (2, ρ) = \frac{{(ρ - 1)}^{2}}{4 ρ}, \end{matrix}

(159)

\begin{matrix} max_{Q \in P_{n} (ρ)} χ^{2} (U_{n} ∥ Q) & \leq 2 Δ (- 1, ρ) = \frac{{(ρ - 1)}^{2}}{4 ρ}, \end{matrix}

(160)

\begin{matrix} max_{Q \in P_{n} (ρ)} H^{2} (Q ∥ U_{n}) & \leq \frac{1}{4} Δ (\frac{1}{2}, ρ) = \frac{{(\sqrt[4]{ρ} - 1)}^{2}}{\sqrt{ρ} + 1}, \end{matrix}

(161)

and the upper bounds on the right sides of (157)–(161) are asymptotically tight in the limit where n tends to infinity.

The next result characterizes the function

Δ : (0, \infty) \times (1, \infty) \to R

as it is given in (149) and (150).

Theorem 9.

The function Δ satisfies the following properties:

(a): For every $ρ > 1$ , $Δ (α, ρ)$ is a convex function of α over the real line, and it is symmetric around $α = \frac{1}{2}$ with a global minimum at $α = \frac{1}{2}$ .
(b): The following inequalities hold:

$\begin{matrix} α Δ (α, ρ) \leq β Δ (β, ρ), 0 < α \leq β < \infty, \end{matrix}$

(162)

$\begin{matrix} (1 - β) Δ (β, ρ) \leq (1 - α) Δ (α, ρ), - \infty < α \leq β < 1 . \end{matrix}$

(163)
(c): For every $α \in R$ , $Δ (α, ρ)$ is monotonically increasing and continuous in $ρ \in (1, \infty)$ , and $lim_{ρ \to 1^{+}} Δ (α, ρ) = 0$ .

Proof.

See Appendix H.1. □

Remark 6.

The symmetry of

Δ (α, ρ)

around

α = \frac{1}{2}

(see Theorem 9 (a)) is not implied by the following symmetry property of the Alpha divergence around

α = \frac{1}{2}

(see, e.g., ([8], p. 36)):

\begin{matrix} D_{A}^{(\frac{1}{2} + α)} (P ∥ Q) = D_{A}^{(\frac{1}{2} - α)} (Q ∥ P) . \end{matrix}

(164)

Relying on Theorem 9, the following corollary gives a similar result to (146) where the order of Q and

U_{n}

in

D_{A}^{(α)} (\cdot ∥ \cdot)

is switched.

Corollary 1.

For all

α \in R

and

ρ > 1

,

\begin{matrix} lim_{n \to \infty} max_{Q \in P_{n} (ρ)} D_{A}^{(α)} (U_{n} ∥ Q) = Δ (α, ρ) . \end{matrix}

(165)

Proof.

See Appendix H.2. □

We next further exemplify Theorem 7 for the relative entropy. Let

f (t) : = t log t + (1 - t) log e

for

t > 0

. Then,

f^{″} (t) = \frac{log e}{t}

, so the bounds on the second derivative of f over the interval

[\frac{1}{ρ}, ρ]

are given by

M = ρ log e

and

m = \frac{log e}{ρ}

. Theorem 7 (h) gives the following bounds:

\begin{matrix} \frac{({n ∥ Q ∥}_{2}^{2} - 1) log e}{2 ρ} \leq D (Q ∥ U_{n}) \leq \frac{ρ ({n ∥ Q ∥}_{2}^{2} - 1) log e}{2} . \end{matrix}

(166)

From ([33], Theorem 2) (and (157)),

\begin{matrix} D (Q ∥ U_{n}) \leq \frac{ρ log ρ}{ρ - 1} - log (\frac{e ρ {log}_{e} ρ}{ρ - 1}) . \end{matrix}

(167)

Furthermore, (96) gives that

\begin{matrix} D (Q ∥ U_{n}) \leq \frac{1}{8} {(ρ - 1)}^{2} log e, \end{matrix}

(168)

which, for

ρ > 1

, is a looser bound in comparison to (167). It can be verified, however, that the dominant term in the Taylor series expansion (around

ρ = 1

) of the right side of (167) coincides with the right side of (168), so the bounds scale similarly for small values of

ρ \geq 1

.

Suppose that we wish to assert that, for every integer

n \geq 2

and for all probability mass functions

Q \in P_{n} (ρ)

, the condition

\begin{matrix} D (Q ∥ U_{n}) \leq d log e \end{matrix}

(169)

holds with a fixed

d > 0

. Due to the left side inequality in (89), this condition is equivalent to the requirement that

\begin{matrix} lim_{n \to \infty} max_{Q \in P_{n} (ρ)} D (Q ∥ U_{n}) \leq d log e . \end{matrix}

(170)

Due to the asymptotic tightness of the upper bound in the right side of (157) (as we let

n \to \infty

), requiring that this upper bound is not larger than

d log e

is necessary and sufficient for the satisfiability of (169) for all n and

Q \in P_{n} (ρ)

. This leads to the analytical solution

ρ \leq ρ_{max}^{(1)} (d)

with (see Appendix I)

\begin{matrix} ρ_{max}^{(1)} (d) : = \frac{W_{- 1} (- e^{- d - 1})}{W_{0} (- e^{- d - 1})}, \end{matrix}

(171)

where

W_{0}

and

W_{- 1}

denote, respectively, the principal and secondary real branches of the Lambert W function [37]. Requiring the stronger condition where the right side of (168) is not larger than

d log e

leads to the sufficient solution

ρ \leq ρ_{max}^{(2)}

with the simple expression

\begin{matrix} ρ_{max}^{(2)} (d) : = 1 + \sqrt{8 d} . \end{matrix}

(172)

In comparison to

ρ_{max}^{(1)}

in (171),

ρ_{max}^{(2)}

in (172) is more insightful; these values nearly coincide for small values of

d > 0

, providing in that case the same range of possible values of

ρ

for asserting the satisfiability of condition (169). As it is shown in Figure 5, for

d \leq 0.01

, the difference between the maximal values of

ρ

in (171) and (172) is marginal, though in general

ρ_{max}^{(1)} (d) > ρ_{max}^{(2)} (d)

for all

d > 0

.

Figure 5. A comparison of the maximal values of

ρ

(minus 1) according to (171) and (172), asserting the satisfiability of the condition

D (Q ∥ U_{n}) \leq d log e

, with an arbitrary

d > 0

, for all integers

n \geq 2

and probability mass functions Q supported on

{1, \dots, n}

with

\frac{q_{max}}{q_{min}} \leq ρ

. The solid line refers to the necessary and sufficient condition which gives (171), and the dashed line refers to a stronger condition which gives (172).

3.3.3. The Subclass of f-Divergences in Theorem 5

This example refers to the subclass of f-divergences in Theorem 5. For these

f_{α}

-divergences, with

α \geq e^{- \frac{3}{2}}

, substituting

f : = f_{α}

from (55) into the right side of (134) gives that for all

ρ \geq 1

\begin{matrix} Φ (α, ρ) & : = d_{f_{α}} (ρ) \end{matrix}

(172)

\begin{matrix} = lim_{n \to \infty} max_{Q \in P_{n} (ρ)} D_{f_{α}} (Q ∥ U_{n}) \\ = max_{x \in [0, 1]} \{x {(α + \frac{ρ}{1 + (ρ - 1) x})}^{2} log (α + \frac{ρ}{1 + (ρ - 1) x}) - {(α + 1)}^{2} log (α + 1) \end{matrix}

(173)

\begin{matrix} + (1 - x) {(α + \frac{1}{1 + (ρ - 1) x})}^{2} log (α + \frac{1}{1 + (ρ - 1) x})\} . \end{matrix}

(175)

The exact asymptotic expression in the right side of (175) is subject to numerical maximization.

We next provide two alternative closed-form upper bounds, based on Theorems 5 and 7, and study their tightness. The two upper bounds, for all

α \geq e^{- \frac{3}{2}}

and

ρ \geq 1

, are given by (see Appendix J)

\begin{matrix} Φ (α, ρ) & \leq [log (α + 1) + \frac{3}{2} log e - \frac{log e}{α + 1}] \frac{{(ρ - 1)}^{2}}{4 ρ} \\ + \frac{log e}{81 (α + 1)} {(\frac{(ρ - 1) (2 ρ + 1) (ρ + 2)}{ρ (ρ + 1)})}^{2}, \end{matrix}

(176)

and

\begin{matrix} Φ (α, ρ) \leq [log (α + ρ) + \frac{3}{2} log e] \frac{{(ρ - 1)}^{2}}{4 ρ} . \end{matrix}

(177)

Suppose that we wish to assert that, for every integer

n \geq 2

and for all probability mass functions

Q \in P_{n} (ρ)

, the condition

\begin{matrix} D_{f_{α}} (Q ∥ U_{n}) \leq d log e \end{matrix}

(178)

holds with a fixed

d > 0

and

α \geq e^{- \frac{3}{2}}

. Due to (173)–(174) and the left side inequality in (89), the satisfiability of the latter condition is equivalent to the requirement that

\begin{matrix} Φ (α, ρ) \leq d log e . \end{matrix}

(179)

In order to obtain a sufficient condition for

ρ

to satisfy (179), expressed as an explicit function of

α

and d, the upper bound in the right side of (176) is slightly loosened to

\begin{matrix} Φ (α, ρ) \leq a {(ρ - 1)}^{2} + b min {ρ - 1, {(ρ - 1)}^{2}}, \end{matrix}

(180)

where

\begin{matrix} a : = \frac{4 log e}{81 (α + 1)}, \end{matrix}

(181)

\begin{matrix} b : = \frac{1}{4} log (α + 1) + \frac{3}{8} log e, \end{matrix}

(182)

for all

ρ \geq 1

and

α \geq e^{- \frac{3}{2}}

. The upper bounds in the right sides of (176), (177) and (180) are derived in Appendix J.

In comparison to (179), the stronger requirement that the right side of (180) is less than or equal to

d log e

gives the sufficient condition

\begin{matrix} ρ \leq ρ_{max} (α, d) : = max \{ρ_{1} (α, d), ρ_{2} (α, d)\}, \end{matrix}

(183)

with

\begin{matrix} ρ_{1} (α, d) : = 1 + \frac{\sqrt{b^{2} + 4 a d log e} - b}{2 a}, \end{matrix}

(184)

\begin{matrix} ρ_{2} (α, d) : = 1 + \sqrt{\frac{d log e}{a + b}} . \end{matrix}

(185)

Figure 6 compares the exact expression in (175) with its upper bounds in (176), (177) and (180). These bounds show good match with the exact value, and none of the bounds in (176) and (177) is superseded by the other; the bound in (180) is looser than (176), and it is derived for obtaining the closed-form solution in (183)–(185). The bound in (176) is tighter than the bound in (177) for small values of

ρ \geq 1

, whereas the latter bound outperforms the first one for sufficiently large values of

ρ

. It has been observed numerically that the tightness of the bounds is improved by increasing the value of

α

, and the range of parameters of

ρ

over which the bound in (176) outperforms the second bound in (177) is enlarged when

α

is increased. It is also shown in Figure 6 that the bound in (176) and its loosened version in (180) almost coincide for sufficiently small values of

ρ

(i.e., for

ρ

is close to 1), and also for sufficiently large values of

ρ

.

Figure 6. A comparison of the exact expression of

Φ (α, ρ)

in (175), with

α = 1

, and its three upper bounds in the right sides of (176), (177) and (180) (called ’Upper bound 1’ (dotted line), ’Upper bound 2’ (thin dashed line), and ’Upper bound 3’ (thick dashed line), respectively).

3.4. An Interpretation of $u_{f} (\cdot, \cdot)$ in Theorem 7

We provide here an interpretation of

u_{f} (n, ρ)

in (77), for

ρ > 1

and an integer

n \geq 2

; note that

u_{f} (n, 1) \equiv 0

since

P_{n} (1) = {U_{n}}

. Before doing so, recall that (82) introduces an identity which significantly simplifies the numerical calculation of

u_{f} (n, ρ)

, and (85) gives (asymptotically tight) upper and lower bounds.

The following result relies on the variational representation of f-divergences.

Theorem 10.

Let

f : (0, \infty) \to R

be convex with

f (1) = 0

, and let

\bar{f} : R \to R \cup {\infty}

be the convex conjugate function of f (a.k.a. the Fenchel-Legendre transform of f), i.e.,

\begin{matrix} \bar{f} (x) : = sup_{t > 0} \{t x - f (t)\}, x \in R . \end{matrix}

(186)

Let

ρ > 1

, and define

A_{n} : = {1, \dots, n}

for an integer

n \geq 2

. Then, the following holds:

(a): For every $P \in P_{n} (ρ)$ , a random variable $X \sim P$ , and a function $g : A_{n} \to R$ ,

$\begin{matrix} E [g (X)] \leq u_{f} (n, ρ) + \frac{1}{n} \sum_{i = 1}^{n} \bar{f} (g (i)) . \end{matrix}$

(187)
(b): There exists $P \in P_{n} (ρ)$ such that, for every $ε > 0$ , there is a function $g_{ε} : A_{n} \to R$ which satisfies

$\begin{matrix} E [g_{ε} (X)] \geq u_{f} (n, ρ) + \frac{1}{n} \sum_{i = 1}^{n} \bar{f} (g_{ε} (i)) - ε, \end{matrix}$

(188)

with $X \sim P$ .

Proof.

See Appendix K. □

Remark 7.

The proof suggests a constructive way to obtain, for an arbitrary

ε > 0

, a function

g_{ε}

which satisfies (188).

4. Applications in Information Theory and Statistics

4.1. Bounds on the List Decoding Error Probability with f-Divergences

The minimum probability of error of a random variable X given Y, denoted by

ε_{X | Y}

, can be achieved by a deterministic function (maximum-a-posteriori decision rule)

L^{*} : Y \to X

(see [42]):

\begin{matrix} ε_{X | Y} & = min_{L : Y \to X} P [X \neq L (Y)] \end{matrix}

(189)

\begin{matrix} = P [X \neq L^{*} (Y)] \end{matrix}

(190)

\begin{matrix} = 1 - E [max_{x \in X} P_{X | Y} (x | Y)] . \end{matrix}

(191)

Fano’s inequality [46] gives an upper bound on the conditional entropy

H (X | Y)

as a function of

ε_{X | Y}

(or, otherwise, providing a lower bound on

ε_{X | Y}

as a function of

H (X | Y))

when X takes a finite number of possible values.

The list decoding setting, in which the hypothesis tester is allowed to output a subset of given cardinality, and an error occurs if the true hypothesis is not in the list, has great interest in information theory. A generalization of Fano’s inequality to list decoding, in conjunction with the blowing-up lemma ([17], Lemma 1.5.4), leads to strong converse results in multi-user information theory. This approach was initiated in ([47], Section 5) (see also ([48], Section 3.6)). The main idea of the successful combination of these two tools is that, given a code, it is possible to blow-up the decoding sets in a way that the probability of decoding error can be as small as desired for sufficiently large blocklengths; since the blown-up decoding sets are no longer disjoint, the resulting setup is a list decoder with sub-exponential list size (as a function of the block length).

In statistics, Fano’s-type lower bounds on Bayes and minimax risks, expressed in terms of f-divergences, are derived in [49,50].

In this section, we further study the setup of list decoding, and derive bounds on the average list decoding error probability. We first consider the special case where the list size is fixed (see Section 4.1.1), and then move to the more general case of a list size which depends on the channel observation (see Section 4.1.2).

4.1.1. Fixed-Size List Decoding

A generalization of Fano’s inequality for fixed-size list decoding is given in ([42], (139)), expressed as a function of the conditional Shannon entropy (strengthening ([51], Lemma 1)). A further generalization in this setup, which is expressed as a function of the Arimoto-Rényi conditional entropy with an arbitrary positive order (see Definition 9), is provided in ([42], Theorem 8).

The next result provides a generalized Fano’s inequality for fixed-size list decoding, expressed in terms of an arbitrary f-divergence. Some earlier results in the literature are reproduced from the next result, followed by its strengthening as an application of Theorem 1.

Theorem 11.

Let

P_{X Y}

be a probability measure defined on

X \times Y

with

| X | = M

. Consider a decision rule

L : Y \to (\binom{X}{L})

, where

(\binom{X}{L})

stands for the set of subsets of

X

with cardinality L, and

L < M

is fixed. Denote the list decoding error probability by

P_{L} : = P [X \notin L (Y)]

. Let

U_{M}

denote an equiprobable probability mass function on

X

. Then, for every convex function

f : (0, \infty) \to R

with

f (1) = 0

,

\begin{matrix} E [D_{f} (P_{X | Y} (\cdot | Y) ∥ U_{M})] \geq \frac{L}{M} f (\frac{M (1 - P_{L})}{L}) + (1 - \frac{L}{M}) f (\frac{M P_{L}}{M - L}) . \end{matrix}

(192)

Proof.

See Appendix L. □

Remark 8.

The case where

L = 1

(i.e., a decoder with a single output) gives ([50], (5)).

As consequences of Theorem 11, we first reproduce some earlier results as special cases.

Corollary 2

([42] (139)). Under the assumptions in Theorem 11,

\begin{matrix} H (X | Y) \leq log M - d (P_{L} ∥ 1 - \frac{L}{M}) \end{matrix}

(193)

where

d (\cdot ∥ \cdot) : [0, 1] \times [0, 1] \to [0, + \infty]

denotes the binary relative entropy, defined as the continuous extension of

D ([p, 1 - p] ∥ [q, 1 - q]) : = p log \frac{p}{q} + (1 - p) log \frac{1 - p}{1 - q}

for

p, q \in (0, 1)

.

Proof.

The choice

f (t) : = t log t + (1 - t) log e

, for

t > 0

, (note that

f (t) = u_{1} (t) log e

with

u_{1} (\cdot)

defined in (139)) gives

\begin{matrix} E [D_{f} (P_{X | Y} (\cdot | Y) ∥ U_{M})] & = \int_{Y} d P_{Y} (y) D (P_{X | Y} (\cdot | y) ∥ U_{M}) \end{matrix}

(194)

\begin{matrix} = \int_{Y} d P_{Y} (y) [log M - H (X | Y = y)] \end{matrix}

(195)

\begin{matrix} = log M - H (X | Y), \end{matrix}

(196)

and

\begin{matrix} \frac{L}{M} f (\frac{M (1 - P_{L})}{L}) + (1 - \frac{L}{M}) f (\frac{M P_{L}}{M - L}) = d (P_{L} ∥ 1 - \frac{L}{M}) . \end{matrix}

(197)

Substituting (194)–(197) into (192) gives (193). □

Theorem 11 enables to reproduce a result in [42] which generalizes Corollary 2. It relies on Rényi information measures, and we first provide definitions for a self-contained presentation.

Definition 8

([40]). Let

P_{X}

be a probability mass function defined on a discrete set

X

. The Rényi entropy of order

α \in (0, 1) \cup (1, \infty)

of X, denoted by

H_{α} (X)

or

H_{α} (P_{X})

, is defined as

\begin{matrix} H_{α} (X) & : = \frac{1}{1 - α} log \sum_{x \in X} P_{X}^{α} (x) \end{matrix}

(198)

\begin{matrix} = \frac{α}{1 - α} log {∥ P_{X} ∥}_{α} . \end{matrix}

(199)

The Rényi entropy is continuously extended at orders 0, 1, and ∞; at order 1, it coincides with the Shannon entropy

H (X)

.

Definition 9

([52]). Let

P_{X Y}

be defined on

X \times Y

, where X is a discrete random variable. The Arimoto-Rényi conditional entropy of order

α \in [0, \infty]

of X given Y is defined as follows:

If $α \in (0, 1) \cup (1, \infty)$ , then

$\begin{matrix} H_{α} (X | Y) & = \frac{α}{1 - α} log E [{(\sum_{x \in X} P_{X | Y}^{α} (x | Y))}^{\frac{1}{α}}] \end{matrix}$

(200)

$\begin{matrix} = \frac{α}{1 - α} log E [∥ P_{X | Y} {(\cdot | Y) ∥}_{α}] \end{matrix}$

(201)

$\begin{matrix} = \frac{α}{1 - α} log \int_{Y} d P_{Y} (y) exp (\frac{1 - α}{α} H_{α} (X | Y = y)) . \end{matrix}$

(202)
The Arimoto-Rényi conditional entropy is continuously extended at orders 0, 1, and ∞; at order 1, it coincides with the conditional Shannon entropy $H (X | Y)$ .

Definition 10

([42]). For all

α \in (0, 1) \cup (1, \infty)

, the binary Rényi divergence of order α, denoted by

d_{α} (p ∥ q)

for

(p, q) \in {[0, 1]}^{2}

, is defined as

D_{α} ([p, 1 - p] ∥ [q, 1 - q])

. It is the continuous extension to

{[0, 1]}^{2}

of

\begin{matrix} d_{α} (p ∥ q) = \frac{1}{α - 1} log (p^{α} q^{1 - α} + {(1 - p)}^{α} {(1 - q)}^{1 - α}) . \end{matrix}

(203)

For

α = 1

,

\begin{matrix} d_{1} (p ∥ q) : = lim_{α \to 1} d_{α} (p ∥ q) = d (p ∥ q) . \end{matrix}

(204)

The following result, generalizing Corollary 2, is shown to be a consequence of Theorem 11. It has been originally derived in ([42], Theorem 8) in a different way. The alternative derivation of this inequality relies on Theorem 11, applied to the family of Alpha-divergences (see (138)) as a subclass of the f-divergences.

Corollary 3

([42] Theorem 8). Under the assumptions in Theorem 11, for

α \in (0, 1) \cup (1, \infty)

,

\begin{matrix} H_{α} (X | Y) & \leq log M - d_{α} (P_{L} ∥ 1 - \frac{L}{M}) \end{matrix}

(205)

\begin{matrix} = \frac{1}{1 - α} log (L^{1 - α} {(1 - P_{L})}^{α} + {(M - L)}^{1 - α} P_{L}^{α}), \end{matrix}

(206)

with equality in (205) if and only if

P_{X | Y} (x | y) = {\begin{matrix} \frac{P_{L}}{M - L}, & x \notin L (y), \\ \frac{1 - P_{L}}{L}, & x \in L (y) . \end{matrix}

(207)

Proof.

See Appendix M. □

Another application of Theorem 11 with the selection

f (t) : = {| t - 1 |}^{s}

, for

t \in [0, \infty)

and a parameter

s \geq 1

, gives the following result.

Corollary 4.

Under the assumptions in Theorem 11, for all

s \geq 1

,

\begin{matrix} P_{L} \geq 1 - \frac{L}{M} - {(L^{1 - s} + {(M - L)}^{1 - s})}^{- \frac{1}{s}} {(E [\sum_{x \in X} {|P_{X | Y} (x | Y) - \frac{1}{M}|}^{s}])}^{\frac{1}{s}}, \end{matrix}

(208)

where (208) holds with equality if X and Y are independent with X being equiprobable. For

s = 1

and

s = 2

, (208) respectively gives that

\begin{matrix} P_{L} \geq 1 - \frac{L}{M} - \frac{1}{2} E [\sum_{x \in X} |P_{X | Y} (x | Y) - \frac{1}{M}|], \end{matrix}

(209)

\begin{matrix} P_{L} \geq 1 - \frac{L}{M} - \sqrt{\frac{L}{M} (1 - \frac{L}{M}) (M E [P_{X | Y} (X | Y)] - 1)} . \end{matrix}

(210)

The following refinement of the generalized Fano’s inequality in Theorem 11 relies on the version of the strong data-processing inequality in Theorem 1.

Theorem 12.

Under the assumptions in Theorem 11, let the convex function

f : (0, \infty) \to R

be twice differentiable, and assume that there exists a constant

m_{f} > 0

such that

\begin{matrix} f^{″} (t) \geq m_{f}, \forall t \in I (ξ_{1}^{*}, ξ_{2}^{*}), \end{matrix}

(211)

where

\begin{matrix} ξ_{1}^{*} : = M inf_{(x, y) \in X \times Y} P_{X | Y} (x | y), \end{matrix}

(212)

\begin{matrix} ξ_{2}^{*} : = M sup_{(x, y) \in X \times Y} P_{X | Y} (x | y), \end{matrix}

(213)

and the interval

I (\cdot, \cdot)

is defined in (23). Let

u^{+} : = max {u, 0}

for

u \in R

. Then,

(a): $\begin{matrix} E [D_{f} (P_{X | Y} (\cdot | Y) ∥ U_{M})] & \geq \frac{L}{M} f (\frac{M (1 - P_{L})}{L}) + (1 - \frac{L}{M}) f (\frac{M P_{L}}{M - L}) \\ + \frac{1}{2} m_{f} M {(E [P_{X | Y} (X | Y)] - \frac{1 - P_{L}}{L} - \frac{P_{L}}{M - L})}^{+} . \end{matrix}$

(214)
(b): If the list decoder selects the L most probable elements from $X$ , given the value of $Y \in Y$ , then (214) is strengthened to

$\begin{matrix} E [D_{f} (P_{X | Y} (\cdot | Y) ∥ U_{M})] & \geq \frac{L}{M} f (\frac{M (1 - P_{L})}{L}) + (1 - \frac{L}{M}) f (\frac{M P_{L}}{M - L}) \\ + \frac{1}{2} m_{f} M (E [P_{X | Y} (X | Y)] - \frac{1 - P_{L}}{L}), \end{matrix}$

(215)

where the last term in the right side of (215) is necessarily non-negative.

Proof.

See Appendix N. □

An application of Theorem 12 gives the following tightened version of Corollary 2.

Corollary 5.

Under the assumptions in Theorem 11, the following holds:

(a): Inequality (193) is strengthened to

$\begin{matrix} H (X | Y) \leq & log M - d (P_{L} ∥ 1 - \frac{L}{M}) - \frac{log e}{2} \frac{{(E [P_{X | Y} (X | Y)] - \frac{1 - P_{L}}{L} - \frac{P_{L}}{M - L})}^{+}}{sup_{(x, y) \in X \times Y} P_{X | Y} (x | y)} . \end{matrix}$

(216)
(b): If the list decoder selects the L most probable elements from $X$ , given the value of $Y \in Y$ , then (216) is strengthened to

$\begin{matrix} H (X | Y) \leq log M - d (P_{L} ∥ 1 - \frac{L}{M}) - \frac{log e}{2} \cdot \frac{{(E [P_{X | Y} (X | Y)] - \frac{1 - P_{L}}{L})}^{+}}{sup_{(x, y) \in X \times Y} P_{X | Y} (x | y)} . \end{matrix}$

(217)

Proof.

The choice

f (t) : = t log t + (1 - t) log e

, for

t > 0

, gives (see (23) and (211)–(213))

\begin{matrix} m_{f} M & = M inf_{t \in I (ξ_{1}^{*}, ξ_{2}^{*})} f^{″} t) \\ = \frac{M log e}{ξ_{2}^{*}} \\ = \frac{log e}{sup_{(x, y) \in X \times Y} P_{X | Y} (x | y)} . \end{matrix}

(218)

Substituting (194)–(197) and (218) into (214) and (215) give, respectively, (216) and (217). □

Remark 9.

Similarly to the bounds on

P_{L}

in (193) and (205), which tensorize when

P_{X | Y}

is replaced by a product probability measure

P_{X^{n} | Y^{n}} (\underset{̲}{x} | \underset{̲}{y}) = \overset{n}{\prod_{i = 1}} P_{X_{i} | Y_{i}} (x_{i} | y_{i})

, this is also the case with the new bounds in (216) and (217).

Remark 10.

The ceil operation in the right side of (217) is redundant with

P_{L}

denoting the list decoding error probability (see (A335)–(A341)). However, for obtaining a lower bound on

P_{L}

with (217), the ceil operation assures that the bound is at least as good as the lower bound which relies on the generalized Fano’s inequality in (193).

Example 1.

Let X and Y be discrete random variables taking values in

X = {0, 1, \dots, 8}

and

Y = {0, 1}

, respectively, and let

P_{X Y}

be the joint probability mass function, given by

\begin{matrix} {[P_{X Y} (x, y)]}_{(x, y) \in X \times Y} = \frac{1}{512} {(\begin{matrix} 128 & 64 & 32 & 16 & 8 & 4 & 2 & 1 & 1 \\ 2 & 2 & 2 & 2 & 8 & 16 & 32 & 64 & 128 \end{matrix})}^{T} . \end{matrix}

(219)

Let the list decoder select the L most probable elements from

X

, given the value of

Y \in Y

. Table 1 compares the list decoding error probability

P_{L}

with the lower bound which relies on the generalized Fano’s inequality in (193), its tightened version in (217), and the closed-form lower bound in (210) for fixed list sizes of

L = 1, \dots, 4

. For

L = 3

and

L = 4

, (217) improves the lower bound in (193) (see Table 1). If

L = 4

, then the generalized Fano’s lower bound in (193) and also (210) are useless, whereas (217) gives a non-trivial lower bound. It is shown here that none of the new lower bounds in (210) and (217) is superseded by the other.

Table 1. The lower bounds on

P_{L}

in (193), (210) and (217), and its exact value for fixed list size L (see Example 1).

4.1.2. Variable-Size List Decoding

In the more general setting of list decoding where the size of the list may depend on the channel observation, Fano’s inequality has been generalized as follows.

Proposition 5

(([48], Appendix 3.E) and [53]). Let

P_{X Y}

be a probability measure defined on

X \times Y

with

| X | = M

. Consider a decision rule

L : Y \to 2^{X}

, and let the (average) list decoding error probability be given by

P_{L} : = P [X \notin L (Y)]

with

| L (y) | \geq 1

for all

y \in Y

. Then,

\begin{matrix} H (X | Y) \leq h (P_{L}) + E [log | L (Y) |] + P_{L} log M, \end{matrix}

(220)

where

h : [0, 1] \to [0, log 2]

denotes the binary entropy function. If

| L (Y) | \leq N

almost surely, then also

\begin{matrix} H (X | Y) \leq h (P_{L}) + (1 - P_{L}) log N + P_{L} log M . \end{matrix}

(221)

By relying on the data-processing inequality for f-divergences, we derive in the following an alternative explicit lower bound on the average list decoding error probability

P_{L}

. The derivation relies on the

E_{γ}

divergence (see, e.g., [54]), which forms a subclass of the f-divergences.

Theorem 13.

Under the assumptions in (220), for every

γ \geq 1

,

\begin{matrix} P_{L} \geq \frac{1 + γ}{2} - \frac{γ E [| L (Y) |]}{M} - \frac{1}{2} E [\sum_{x \in X} |P_{X | Y} (x | Y) - \frac{γ}{M}|] . \end{matrix}

(222)

Let

γ \geq 1

, and let

| L (y) | \leq \frac{M}{γ}

for all

y \in Y

. Then, (222) holds with equality if, for every

y \in Y

, the list decoder selects the

| L (y) |

most probable elements in

X

given

Y = y

; if

x_{ℓ} (y)

denotes the ℓ-th most probable element in

X

given

Y = y

, where ties in probabilities are resolved arbitrarily, then (222) holds with equality if

P_{X | Y} (x_{ℓ} (y) | y) = {\begin{matrix} α (y), & \forall ℓ \in \{1, \dots, | L (y) |\}, \\ \frac{1 - α (y) | L (y) |}{M - | L (y) |}, & \forall ℓ \in {| L (y) | + 1, \dots, M}, \end{matrix}

(223)

with

α : Y \to [0, 1]

being an arbitrary function which satisfies

\begin{matrix} \frac{γ}{M} \leq α (y) \leq \frac{1}{| L (y) |}, \forall y \in Y . \end{matrix}

(224)

Proof.

See Appendix O. □

Remark 11.

By setting

γ = 1

and

| L (Y) | = L

(i.e., a decoding list of fixed size L), (222) is specialized to (209).

Example 2.

Let X and Y be discrete random variables taking their values in

X = {0, 1, 2, 3, 4}

and

Y = {0, 1}

, respectively, and let

P_{X Y}

be their joint probability mass function, which is given by

{\begin{matrix} P_{X Y} (0, 0) = P_{X Y} (1, 0) = P_{X Y} (2, 0) = \frac{1}{8}, & P_{X Y} (3, 0) = P_{X Y} (4, 0) = \frac{1}{16}, \\ P_{X Y} (0, 1) = P_{X Y} (1, 1) = P_{X Y} (2, 1) = \frac{1}{24}, & P_{X Y} (3, 1) = P_{X Y} (4, 1) = \frac{3}{16} . \end{matrix}

(225)

Let

L (0) : = {0, 1, 2}

and

L (1) : = {3, 4}

be the lists in

X

, given the value of

Y \in Y

. We get

P_{Y} (0) = P_{Y} (1) = \frac{1}{2}

, so the conditional probability mass function of X given Y satisfies

P_{X | Y} (x | y) = 2 P_{X Y} (x, y)

for all

(x, y) \in X \times Y

. It can be verified that, if

γ = \frac{5}{4}

, then

max {| L (0) |, | L (1) |} = 3 \leq \frac{M}{γ}

, and also (223) and (224) are satisfied (here,

M : = | X | = 5

,

α (0) = \frac{1}{4} = \frac{γ}{M}

and

α (1) = \frac{3}{8} \in [\frac{1}{4}, \frac{1}{2}]

). By Theorem 13, it follows that (222) holds in this case with equality, and the list decoding error probability is equal to

P_{L} = 1 - E [α (Y) | L (Y) |] = \frac{1}{4}

(i.e., it coincides with the lower bound in the right side of (222) with

γ = \frac{5}{4}

). On the other hand, the generalized Fano’s inequality in (220) gives that

P_{L} \geq 0.1206

(the left side of (220) is

H (X | Y) = \frac{5}{2} log 2 - \frac{1}{4} log 3 = 2.1038

bits); moreover, by letting

N : = max_{y \in Y} | L (y) | = 3

, (221) gives the looser lower bound

P_{L} \geq 0.0939

. This exemplifies a case where the lower bound in Theorem 13 is tight, whereas the generalized Fano’s inequalities in (220) and (221) are looser.

4.2. A Measure for the Approximation of Equiprobable Distributions by Tunstall Trees

The best possible approximation of equiprobable distributions, which one can get by using tree codes has been considered in [38]. The optimal solution is obtained by using Tunstall codes, which are variable-to-fixed lossless compression codes (see ([55], Section 11.2.3), [56]). The main idea behind Tunstall codes is parsing the source sequence into variable-length segments of roughly the same probability, and then coding all these segments with codewords of fixed length. This task is done by assigning the leaves of a Tunstall tree, which correspond to segments of source symbols with a variable length (according to the depth of the leaves in the tree), to codewords of fixed length. The following result links Tunstall trees with majorization theory.

Proposition 6

([38] Theorem 1). Let

P_{ℓ}

be the probability measure generated on the leaves by a Tunstall tree

T

, and let

Q_{ℓ}

be the probability measure generated by an arbitrary tree

S

with the same number of leaves as of

T

. Then,

P_{ℓ} ≺ Q_{ℓ}

.

From Proposition 6, and the Schur-convexity of an f-divergence

D_{f} (\cdot ∥ U_{n})

(see ([38], Lemma 1)), it follows that (see ([38], Corollary 1))

\begin{matrix} D_{f} (P_{ℓ} ∥ U_{n}) \leq D_{f} (Q_{ℓ} ∥ U_{n}), \end{matrix}

(226)

where n designates the joint number of leaves of the trees

T

and

S

.

Before we proceed, it is worth noting that the strong data-processing inequality in Theorem 6 implies that if f is also twice differentiable, then (226) can be strengthened to

\begin{matrix} D_{f} (P_{ℓ} ∥ U_{n}) + n c_{f} (n q_{min}, n q_{max}) (∥ Q_{ℓ} ∥_{2}^{2} - {∥ P_{ℓ} ∥}_{2}^{2}) \leq D_{f} (Q_{ℓ} ∥ U_{n}), \end{matrix}

(227)

where

q_{max}

and

q_{min}

denote, respectively, the maximal and minimal positive masses of

Q_{ℓ}

on the n leaves of a tree

S

, and

c_{f} (\cdot, \cdot)

is given in (26).

We next consider a measure which quantifies the quality of the approximation of the probability mass function

P_{ℓ}

, induced by the leaves of a Tunstall tree, by an equiprobable distribution

U_{n}

over a set whose cardinality (n) is equal to the number of leaves in the tree. To this end, consider the setup of Bayesian binary hypothesis testing where a random variable X has one of the two probability distributions

{\begin{matrix} H_{0} : & X \sim P_{ℓ}, \\ H_{1} : & X \sim U_{n}, \end{matrix}

(228)

with a-priori probabilities

P [H_{0}] = ω

, and

P [H_{1}] = 1 - ω

for an arbitrary

ω \in (0, 1)

. The measure being considered here is equal to the difference between the minimum a-priori and minimum a-posteriori error probabilities of the Bayesian binary hypothesis testing model in (228), which is close to zero if the two distributions are sufficiently close.

The difference between the minimum a-priori and minimum a-posteriori error probabilities of a general Bayesian binary hypothesis testing model with the two arbitrary alternative hypotheses

H_{0} : X \sim P

and

H_{1} : X \sim Q

with a-priori probabilities

ω

and

1 - ω

, respectively, is defined to be the order-

ω

DeGroot statistical information

I_{ω} (P, Q)

[57] (see also ([16], Definition 3)). It can be expressed as an f-divergence:

\begin{matrix} I_{ω} (P, Q) = D_{ϕ_{ω}} (P ∥ Q), \end{matrix}

(229)

where

ϕ_{ω} : [0, \infty) \to R

is the convex function with

ϕ_{ω} (1) = 0

, given by (see ([16], (73)))

\begin{matrix} ϕ_{ω} (t) : = min {ω, 1 - ω} - min {ω, 1 - ω t}, t \geq 0 . \end{matrix}

(230)

The measure considered here for quantifying the closeness of

P_{ℓ}

to the equiprobable distribution

U_{n}

is therefore given by

\begin{matrix} d_{ω, n} (P_{ℓ}) : = D_{ϕ_{ω}} (P_{ℓ} ∥ U_{n}), \forall ω \in (0, 1), \end{matrix}

(231)

which is bounded in the interval

[0, min {ω, 1 - ω}]

.

The next result partially relies on Theorem 7.

Theorem 14.

The measure in (231) satisfies the following properties:

(a): It is the minimum of $D_{ϕ_{ω}} (P ∥ U_{n})$ with respect to all probability measures $P \in P_{n}$ that are induced by an arbitrary tree with n leaves.
(b): $\begin{matrix} d_{ω, n} (P_{ℓ}) & \leq max_{β \in Γ_{n} (ρ)} D_{ϕ_{ω}} (Q_{β} ∥ U_{n}), \end{matrix}$

(232)

with the function $ϕ_{ω} (\cdot)$ in (230), the interval $Γ_{n} (ρ)$ in (79), the probability mass function $Q_{β}$ in (80), and $ρ : = \frac{1}{p_{min}}$ is the reciprocal of the minimal probability of the source symbols.
(c): The following bound holds for every $n \in N$ , which is the asymptotic limit of the right side of (232) as we let $n \to \infty$ :

$\begin{matrix} d_{ω, n} (P_{ℓ}) \leq max_{x \in [0, 1]} \{x ϕ_{ω} (\frac{ρ}{1 + (ρ - 1) x}) + (1 - x) ϕ_{ω} (\frac{1}{1 + (ρ - 1) x})\} . \end{matrix}$

(233)
(d): If $f : (0, \infty) \to R$ is convex and twice differentiable, continuous at zero and $f (1) = 0$ , then

$\begin{matrix} D_{f} (P_{ℓ} ∥ U_{n}) = \int_{0}^{1} \frac{d_{ω, n} (P_{ℓ})}{ω^{3}} f^{″} (\frac{1 - ω}{ω}) d ω . \end{matrix}$

(234)

Proof.

See Appendix P.1. □

Remark 12.

The integral representation in (234) provides another justification for quantifying the closeness of

P_{ℓ}

to an equiprobable distribution by the measure in (231).

Figure 7 refers to the upper bound on the closeness-to-equiprobable measure

d_{ω, n} (P_{ℓ})

in (233) for Tunstall trees with n leaves. The bound holds for all

n \in N

, and it is shown as a function of

ω \in [0, 1]

for several values of

ρ \in [1, \infty]

. In the limit where

ρ \to \infty

, the upper bound is equal to

min {ω, 1 - ω}

since the minimum a-posteriori error probability of the Bayesian binary hypothesis testing model in (228) tends to zero. On the other hand, if

ρ = 1

, then the right side of (233) is identically equal to zero (since

ϕ_{ω} (1) = 0

).

Figure 7. Curves of the upper bound on the measure

d_{ω, n} (P_{ℓ})

in (233), valid for all

n \in N

, as a function of

ω \in [0, 1]

for different values of

ρ : = \frac{1}{p_{min}}

.

Theorem 14 gives an upper bound on the measure in (231), for the closeness of the probability mass function generated on the leaves by a Tunstall tree to the equiprobable distribution, where this bound is expressed as a function of the minimal probability mass of the source. The following result, which relies on ([33], Theorem 4) and our earlier analysis related to Theorem 7, provides a sufficient condition on the minimal probability mass for asserting the closeness of the compression rate to the Shannon entropy of a stationary and memoryless discrete source.

Theorem 15.

Let P be a probability mass function of a stationary and memoryless discrete source, and let the emitted source symbols be from an alphabet of size

D \geq 2

. Let

C

be a Tunstall code which is used for source compression; let m and

X

denote, respectively, the fixed length and the alphabet of the codewords of

C

(where

| X | \geq 2

), referring to a Tunstall tree of n leaves with

n \leq {| X |}^{m} < n + (D - 1)

. Let

p_{min}

be the minimal probability mass of the source symbols, and let

d = d (m, ε) : = {\begin{matrix} \frac{m ε {log}_{e} | X |}{1 + ε} + {log}_{e} (1 - \frac{D - 1}{{| X |}^{m}}), & i f D > 2, \\ \frac{m ε {log}_{e} | X |}{1 + ε}, & i f D = 2, \end{matrix}

(235)

with an arbitrary

ε > 0

such that

d > 0

. If

\begin{matrix} p_{min} \geq \frac{W_{0} (- e^{- d - 1})}{W_{- 1} (- e^{- d - 1})}, \end{matrix}

(236)

where

W_{0}

and

W_{- 1}

denote, respectively, the principal and secondary real branches of the Lambert W function [37], then the compression rate of the Tunstall code is larger than the Shannon entropy of the source by a factor which is at most

1 + ε

.

Proof.

See Appendix P.2. □

Remark 13.

The condition in (236) can be replaced by the stronger requirement that

\begin{matrix} p_{min} \geq \frac{1}{1 + \sqrt{8 d}} . \end{matrix}

(237)

Unless d is a small fraction of unity, there is a significant difference between the condition in (236) and the more restrictive condition in (237) (see Figure 8).

Figure 8. Curves for the smallest values of

p_{min}

, in the setup of Theorem 15, according to the condition in (236) (solid line) and the more restrictive condition in (237) (dashed line) for binary Tunstall codes which are used to compress memoryless and stationary binary sources.

Example 3.

Consider a memoryless and stationary binary source, and a binary Tunstall code with codewords of length

m = 10

referring to a Tunstall tree with

n = 2^{m} = 1024

leaves. Letting

ε = 0.1

in Theorem 15, it follows that if the minimal probability mass of the source satisfies

p_{min} \geq 0.0978

(see (235), and Figure 8 with

d = \frac{m ε {log}_{e} 2}{1 + ε} = 0.6301

), then the compression rate of the Tunstall code is at most

10 %

larger than the Shannon entropy of the source.

Funding

This research received no external funding.

Acknowledgments

The author wishes to thank the Guest Editor, Amos Lapidoth, and the two anonymous reviewers for an efficient process in reviewing and handling this paper.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proof of Theorem 1

We start by proving Item (a). By our assumptions on

Q_{X}

and

W_{Y | X}

,

\begin{matrix} P_{X} (x), Q_{X} (x) > 0, \forall x \in X, \end{matrix}

(A1)

\begin{matrix} \sum_{x \in X} W_{Y | X} (y | x) > 0, \forall y \in Y, \end{matrix}

(A2)

\begin{matrix} \sum_{y \in Y} W_{Y | X} (y | x) = 1, \forall x \in X, \end{matrix}

(A3)

\begin{matrix} W_{Y | X} (y | x) \geq 0, \forall (x, y) \in X \times Y . \end{matrix}

(A4)

From (20), (21), (A1), (A2) and (A4), it follows that

\begin{matrix} P_{Y} (y) = \sum_{x \in X} P_{X} (x) W_{Y | X} (y | x) > 0, \forall y \in Y, \end{matrix}

(A5)

\begin{matrix} Q_{Y} (y) = \sum_{x \in X} Q_{X} (x) W_{Y | X} (y | x) > 0, \forall y \in Y, \end{matrix}

(A6)

which imply that, for all

y \in Y

,

\begin{matrix} inf_{x \in X} \frac{P_{X} (x)}{Q_{X} (x)} \leq \frac{P_{Y} (y)}{Q_{Y} (y)} \leq sup_{x \in X} \frac{P_{X} (x)}{Q_{X} (x)} . \end{matrix}

(A7)

Since by assumption

P_{X}

and

Q_{X}

are supported on

X

, and

P_{Y}

and

Q_{Y}

are supported on

Y

(see (A5) and (A6)), it follows that the left side inequality in (A7) is strict if the infimum in the left side is equal to 0, and the right side inequality in (A7) is strict if the supremum in the right side is equal to ∞. Hence, due to (18), (19) and (23),

\begin{matrix} \frac{P_{X} (x)}{Q_{X} (x)}, \frac{P_{Y} (y)}{Q_{Y} (y)} \in I (ξ_{1}, ξ_{2}), \forall (x, y) \in X \times Y . \end{matrix}

(A8)

Since by assumption

f : (0, \infty) \to R

is convex, it follows that its right derivative

f_{+}^{'} (\cdot)

exists, and it is monotonically non-decreasing and finite on

(0, \infty)

(see, e.g., ([58], Theorem 1.2) or ([59], Theorem 24.1)). A straightforward generalization of ([60], Theorem 1.1) (see ([60], Remark 1)) gives

\begin{matrix} D_{f} (P_{X} ∥ Q_{X}) - D_{f} (P_{Y} ∥ Q_{Y}) = \sum_{(x, y) \in X \times Y} \{Q_{X} (x) W_{Y | X} (y | x) Δ (\frac{P_{X} (x)}{Q_{X} (x)}, \frac{P_{Y} (y)}{Q_{Y} (y)})\} \end{matrix}

(A9)

where

\begin{matrix} Δ (u, v) : = f (u) - f (v) - f_{+}^{'} (v) (u - v), u, v > 0 . \end{matrix}

(A10)

In comparison to ([60], Theorem 1.1), the requirement that f is differentiable on

(0, \infty)

is relaxed here, and the derivative of f is replaced by its right-side derivative. Note that if f is differentiable, then

Δ (\frac{P_{X} (x)}{Q_{X} (x)}, \frac{P_{Y} (y)}{Q_{Y} (y)})

with

Δ (\cdot, \cdot)

as defined in (A10) is Bregman’s divergence [61]. The following equality, expressed in terms of Lebesgue-Stieltjes integrals, holds by ([16], Theorem 1):

\begin{matrix} Δ (\frac{P_{X} (x)}{Q_{X} (x)}, \frac{P_{Y} (y)}{Q_{Y} (y)}) \end{matrix}

\begin{matrix} = {\begin{matrix} \int 1 \{s \in (\frac{P_{Y} (y)}{Q_{Y} (y)}, \frac{P_{X} (x)}{Q_{X} (x)}]\} (\frac{P_{X} (x)}{Q_{X} (x)} - s) d f_{+}^{'} (s), & if \frac{P_{X} (x)}{Q_{X} (x)} \geq \frac{P_{Y} (y)}{Q_{Y} (y)}, \\ \int 1 \{s \in (\frac{P_{X} (x)}{Q_{X} (x)}, \frac{P_{Y} (y)}{Q_{Y} (y)}]\} (s - \frac{P_{X} (x)}{Q_{X} (x)}) d f_{+}^{'} (s), & if \frac{P_{X} (x)}{Q_{X} (x)} < \frac{P_{Y} (y)}{Q_{Y} (y)} . \end{matrix} \end{matrix}

(A11)

From (18), (19), (22), (A8) and (A11), if

\frac{P_{X} (x)}{Q_{X} (x)} \geq \frac{P_{Y} (y)}{Q_{Y} (y)}

, then

\begin{matrix} Δ (\frac{P_{X} (x)}{Q_{X} (x)}, \frac{P_{Y} (y)}{Q_{Y} (y)}) & \geq 2 c_{f} (ξ_{1}, ξ_{2}) \int_{\frac{P_{Y} (y)}{Q_{Y} (y)}}^{\frac{P_{X} (x)}{Q_{X} (x)}} (\frac{P_{X} (x)}{Q_{X} (x)} - s) d s \\ = c_{f} (ξ_{1}, ξ_{2}) {(\frac{P_{X} (x)}{Q_{X} (x)} - \frac{P_{Y} (y)}{Q_{Y} (y)})}^{2}, \end{matrix}

(A12)

and similarly, if

\frac{P_{X} (x)}{Q_{X} (x)} < \frac{P_{Y} (y)}{Q_{Y} (y)}

, then

\begin{matrix} Δ (\frac{P_{X} (x)}{Q_{X} (x)}, \frac{P_{Y} (y)}{Q_{Y} (y)}) & \geq 2 c_{f} (ξ_{1}, ξ_{2}) \int_{\frac{P_{X} (x)}{Q_{X} (x)}}^{\frac{P_{Y} (y)}{Q_{Y} (y)}} (s - \frac{P_{X} (x)}{Q_{X} (x)}) d s \\ = c_{f} (ξ_{1}, ξ_{2}) {(\frac{P_{X} (x)}{Q_{X} (x)} - \frac{P_{Y} (y)}{Q_{Y} (y)})}^{2} . \end{matrix}

(A13)

By combining (A9), (A12) and (A13), it follows that

\begin{matrix} D_{f} (P_{X} ∥ Q_{X}) - D_{f} (P_{Y} ∥ Q_{Y}) \geq c_{f} (ξ_{1}, ξ_{2}) \sum_{(x, y) \in X \times Y} \{Q_{X} (x) W_{Y | X} (y | x) {(\frac{P_{X} (x)}{Q_{X} (x)} - \frac{P_{Y} (y)}{Q_{Y} (y)})}^{2}\}, \end{matrix}

(A14)

and an evaluation of the sum in the right side of (A14) gives (see (20), (21) and (A3))

\begin{matrix} \sum_{(x, y) \in X \times Y} \{Q_{X} (x) W_{Y | X} (y | x) {(\frac{P_{X} (x)}{Q_{X} (x)} - \frac{P_{Y} (y)}{Q_{Y} (y)})}^{2}\} \\ = \sum_{x \in X} \{\frac{P_{X}^{2} (x)}{Q_{X} (x)} \underset{= 1}{\underset{︸}{\sum_{y \in Y} W_{Y | X} (y | x)}}\} - 2 \sum_{y \in Y} \{\frac{P_{Y} (y)}{Q_{Y} (y)} \underset{= P_{Y} (y)}{\underset{︸}{\sum_{x \in X} P_{X} (x) W_{Y | X} (y | x)}}\} \\ + \sum_{y \in Y} \{\frac{P_{Y}^{2} (y)}{Q_{Y}^{2} (y)} \underset{= Q_{Y} (y)}{\underset{︸}{\sum_{x \in X} Q_{X} (x) W_{Y | X} (y | x)}}\} \end{matrix}

(A15)

\begin{matrix} = \sum_{x \in X} \frac{P_{X}^{2} (x)}{Q_{X} (x)} - \sum_{y \in Y} \frac{P_{Y}^{2} (y)}{Q_{Y} (y)} \end{matrix}

(A16)

\begin{matrix} = \sum_{x \in X} \frac{{(P_{X} (x) - Q_{X} (x))}^{2}}{Q_{X} (x)} - \sum_{y \in Y} \frac{{(P_{Y} (y) - Q_{Y} (y))}^{2}}{Q_{Y} (y)} \end{matrix}

(A17)

\begin{matrix} = χ^{2} (P_{X} ∥ Q_{X}) - χ^{2} (P_{Y} ∥ Q_{Y}) . \end{matrix}

(A18)

Combining (A14)–(A18) gives (24); (25) is due to the data-processing inequality for f-divergences (applied to the

χ^{2}

-divergence), and the non-negativity of

c_{f} (ξ_{1}, ξ_{2})

in (22).

The

χ^{2}

-divergence is an f-divergence with

f (t) = {(t - 1)}^{2}

for

t \geq 0

. The condition in (22) allows to set here

c_{f} (ξ_{1}, ξ_{2}) \equiv 1

, implying that (24) holds in this case with equality.

We next prove Item (b). Let f be twice differentiable on

I : = I (ξ_{1}, ξ_{2})

(see (23)), and let

(u, v) \in I \times I

with

v > u

. Dividing both sides of (22) by

v - u

, and letting

v \to u^{+}

, yields

c_{f} (ξ_{1}, ξ_{2}) \leq \frac{1}{2} f^{″} (u)

. Since this holds for all

u \in I

, it follows that

c_{f} (ξ_{1}, ξ_{2}) \leq \frac{1}{2} inf_{t \in I} f^{″} (t)

. We next show that

c_{f} (ξ_{1}, ξ_{2})

in (26) fulfills the condition in (22), and therefore it is the largest possible value of

c_{f}

to satisfy (22). By the mean value theorem of Lagrange, for all

(u, v) \in I \times I

with

v > u

, there exists an intermediate value

ξ \in (u, v)

such that

f^{'} (v) - f^{'} (u) = f^{″} (ξ) (v - u)

; hence,

f^{'} (v) - f^{'} (u) \geq 2 c_{f} (ξ_{1}, ξ_{2}) (v - u)

, so the condition in (22) is indeed fulfilled with

c_{f} : = c_{f} (ξ_{1}, ξ_{2})

as given in (26).

We next prove Item (c). Let

f^{*} : (0, \infty) \to R

be the dual convex function which is given by

f^{*} (t) : = t f (\frac{1}{t})

for all

t > 0

with

f^{*} (1) = f (1) = 0

. Since

P_{X}

,

P_{Y}

,

Q_{X}

and

Q_{Y}

are supported on

X

(see (A5) and (A6)), we have

\begin{matrix} D_{f} (P_{X} ∥ Q_{X}) = D_{f^{*}} (Q_{X} ∥ P_{X}), \end{matrix}

(A19)

\begin{matrix} D_{f} (P_{Y} ∥ Q_{Y}) = D_{f^{*}} (Q_{Y} ∥ P_{Y}), \end{matrix}

(A20)

\begin{matrix} ξ_{1}^{*} : = inf_{x \in X} \frac{Q_{X} (x)}{P_{X} (x)} = {(sup_{x \in X} \frac{P_{X} (x)}{Q_{X} (x)})}^{- 1} = \frac{1}{ξ_{2}}, \end{matrix}

(A21)

\begin{matrix} ξ_{2}^{*} : = sup_{x \in X} \frac{Q_{X} (x)}{P_{X} (x)} = {(inf_{x \in X} \frac{P_{X} (x)}{Q_{X} (x)})}^{- 1} = \frac{1}{ξ_{1}} . \end{matrix}

(A22)

Consequently, it follows that

\begin{matrix} D_{f} (P_{X} ∥ Q_{X}) - D_{f} (P_{Y} ∥ Q_{Y}) & = D_{f^{*}} (Q_{X} ∥ P_{X}) - D_{f^{*}} (Q_{Y} ∥ P_{Y}) \end{matrix}

(A23)

\begin{matrix} \geq c_{f^{*}} (ξ_{1}^{*}, ξ_{2}^{*}) [χ^{2} (Q_{X} ∥ P_{X}) - χ^{2} (Q_{Y} ∥ P_{Y})] \end{matrix}

(A24)

\begin{matrix} = c_{f^{*}} (\frac{1}{ξ_{2}}, \frac{1}{ξ_{1}}) [χ^{2} (Q_{X} ∥ P_{X}) - χ^{2} (Q_{Y} ∥ P_{Y})] \end{matrix}

(A25)

where (A23) holds due to (A19) and (A20); (A24) follows from (24) with f,

P_{X}

and

Q_{X}

replaced by

f^{*}

,

Q_{X}

and

P_{X}

, respectively, which then implies that

ξ_{1}

and

ξ_{2}

in (18) and (19) are, respectively, replaced by

ξ_{1}^{*}

and

ξ_{2}^{*}

in (A21) and (A22); finally, (A25) holds due to (A21) and (A22). Since by assumption f is twice differentiable on

(0, \infty)

, so is

f^{*}

, and

\begin{matrix} {(f^{*})}^{″} (t) = \frac{1}{t^{3}} f (\frac{1}{t}), t > 0 . \end{matrix}

(A26)

Hence,

\begin{matrix} c_{f^{*}} (\frac{1}{ξ_{2}}, \frac{1}{ξ_{1}}) & = \frac{1}{2} inf_{u \in I (\frac{1}{ξ_{2}}, \frac{1}{ξ_{1}})} {(f^{*})}^{″} (u) \end{matrix}

(A27)

\begin{matrix} = \frac{1}{2} inf_{u \in I (\frac{1}{ξ_{2}}, \frac{1}{ξ_{1}})} \{{(\frac{1}{u})}^{3} f (\frac{1}{u})\} \end{matrix}

(A28)

\begin{matrix} = \frac{1}{2} inf_{t \in I (ξ_{1}, ξ_{2})} \{t^{3} f (t)\} \end{matrix}

(A29)

where (A27) follows from (24) with f,

ξ_{1}

and

ξ_{2}

replaced by

f^{*}

,

\frac{1}{ξ_{2}}

and

\frac{1}{ξ_{1}}

, respectively; (A28) holds due to (A26), and (A29) holds by substituting

t = : \frac{1}{u}

. This proves (27) and (30), where (28) is due to the data-processing inequality for f-divergences, and the non-negativity of

c_{f^{*}} (\cdot, \cdot)

.

Similarly to the condition for equality in (24), equality in (27) is satisfied if

f^{*} (t) = {(t - 1)}^{2}

for all

t > 0

, or equivalently

f (t) = t f^{*} (\frac{1}{t}) = \frac{{(t - 1)}^{2}}{t}

for all

t > 0

. This f-divergence is Neyman’s

χ^{2}

-divergence where

D_{f} (P ∥ Q) : = χ^{2} (Q ∥ P)

for all P and Q with

c_{f^{*}} \equiv 1

(due to (30), and since

t^{3} f^{″} (t) = 2

for all

t > 0

).

The proof of Item (d) follows that same lines as the proof of Items (a)–(c) by replacing the condition in (22) with a complementary condition of the form

\begin{matrix} f_{+}^{'} (v) - f_{+}^{'} (u) \leq 2 e_{f} (ξ_{1}, ξ_{2}) (v - u), \forall u, v \in I (ξ_{1}, ξ_{2}), u < v . \end{matrix}

(A30)

We finally prove Item (e) by showing that the lower and upper bounds in (24), (27), (32) and (33) are locally tight. More precisely, let

{P_{X}^{(n)}}

be a sequence of probability mass functions defined on

X

and pointwise converging to

Q_{X}

which is supported on

X

, let

P_{Y}^{(n)}

and

Q_{Y}

be the probability mass functions defined on

Y

via (20) and (21) with inputs

P_{X}^{(n)}

and

Q_{X}

, respectively, and let

{ξ_{1, n}}

and

{ξ_{2, n}}

be defined, respectively, by (18) and (19) with

P_{X}

being replaced by

P_{X}^{(n)}

. By the assumptions in (35) and (36),

\begin{matrix} lim_{n \to \infty} ξ_{1, n} = lim_{n \to \infty} inf_{x \in X} \frac{P_{X}^{(n)} (x)}{Q_{X} (x)} = 1, \end{matrix}

(A31)

\begin{matrix} lim_{n \to \infty} ξ_{2, n} = lim_{n \to \infty} sup_{x \in X} \frac{P_{X}^{(n)} (x)}{Q_{X} (x)} = 1 . \end{matrix}

(A32)

Consequently, if f has a continuous second derivative at unity, then (24), (26), (31), (32), (A31) and (A32) imply that

\begin{matrix} lim_{n \to \infty} \frac{D_{f} (P_{X}^{(n)} ∥ Q_{X}) - D_{f} (P_{Y}^{(n)} ∥ Q_{Y})}{χ^{2} (P_{X}^{(n)} ∥ Q_{X}) - χ^{2} (P_{Y}^{(n)} ∥ Q_{Y})} \\ = lim_{n \to \infty} c_{f} (ξ_{1, n}, ξ_{2, n}) = lim_{n \to \infty} e_{f} (ξ_{1, n}, ξ_{2, n}) = \frac{1}{2} f^{″} (1), \end{matrix}

(A33)

and similarly, from (27), (30), (33), (34), (A31) and (A32),

\begin{matrix} lim_{n \to \infty} \frac{D_{f} (P_{X}^{(n)} ∥ Q_{X}) - D_{f} (P_{Y}^{(n)} ∥ Q_{Y})}{χ^{2} (Q_{X} ∥ P_{X}^{(n)}) - χ^{2} (Q_{Y} ∥ P_{Y}^{(n)})} \\ = lim_{n \to \infty} c_{f^{*}} (\frac{1}{ξ_{2, n}}, \frac{1}{ξ_{1, n}}) = lim_{n \to \infty} e_{f^{*}} (\frac{1}{ξ_{2, n}}, \frac{1}{ξ_{1, n}}) = \frac{1}{2} f^{″} (1), \end{matrix}

(A34)

which, respectively, prove (37) and (38).

Appendix B. Proof of Theorem 2

We start by proving Item (a). By the assumption that

P_{X_{i}}

and

Q_{X_{i}}

are supported on

X

for all

i \in {1, \dots, n}

, it follows from (39) that the probability mass functions

P_{X^{n}}

and

Q_{X^{n}}

are supported on

X^{n}

. Consequently, from (41), also

R_{X^{n}}^{(λ)}

is supported on

X^{n}

for all

λ \in [0, 1]

. Due to the product forms of

Q_{X^{n}}

and

R_{X^{n}}^{(λ)}

in (39) and (41), respectively, we get from (47) that

\begin{matrix} ξ_{1} (n, λ) & = \prod_{i = 1}^{n} (1 - λ + λ inf_{x \in X} \frac{P_{X_{i}} (x)}{Q_{X_{i}} (x)}) \\ = \prod_{i = 1}^{n} (inf_{x \in X} \frac{λ P_{X_{i}} (x) + (1 - λ) Q_{X_{i}} (x)}{Q_{X_{i}} (x)}) \\ = inf_{\underset{̲}{x} \in X^{n}} \{\frac{\underset{i = 1}{\prod^{n}} (λ P_{X_{i}} (x_{i}) + (1 - λ) Q_{X_{i}} (x_{i}))}{\underset{i = 1}{\prod^{n}} Q_{X_{i}} (x_{i})}\} \\ = inf_{\underset{̲}{x} \in X^{n}} \frac{R_{X^{n}}^{(λ)} (\underset{̲}{x})}{Q_{X^{n}} (\underset{̲}{x})} \in (0, 1], \end{matrix}

(A35)

and likewise, from (48),

\begin{matrix} ξ_{2} (n, λ) & = sup_{\underset{̲}{x} \in X^{n}} \frac{R_{X^{n}}^{(λ)} (\underset{̲}{x})}{Q_{X^{n}} (\underset{̲}{x})} \in [1, \infty) \end{matrix}

(A36)

for all

λ \in [0, 1]

. In view of (24), (26), (A35) and (A36), replacing

(P_{X}, P_{Y}, Q_{X}, Q_{Y}, ξ_{1}, ξ_{2})

in (24) and (26) with

(R_{X^{n}}^{(λ)}, R_{Y^{n}}^{(λ)}, Q_{X^{n}}, Q_{Y^{n}}, ξ_{1} (n, λ), ξ_{2} (n, λ)),

we obtain that, for all

λ \in [0, 1]

,

\begin{matrix} D_{f} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) - D_{f} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}}) \\ \geq c_{f} (ξ_{1} (n, λ), ξ_{2} (n, λ)) [χ^{2} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) - χ^{2} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}})] . \end{matrix}

(A37)

Due to the setting in (39)–(44), for all

\underset{̲}{y} \in Y^{n}

and

λ \in [0, 1]

,

\begin{matrix} R_{Y^{n}}^{(λ)} (\underset{̲}{y}) & = \sum_{\underset{̲}{x} \in X^{n}} R_{X^{n}}^{(λ)} (\underset{̲}{x}) W_{Y^{n} | X^{n}} (\underset{̲}{y} | \underset{̲}{x}) \\ = \sum_{\underset{̲}{x} \in X^{n}} \{\prod_{i = 1}^{n} (λ P_{X_{i}} (x_{i}) + (1 - λ) Q_{X_{i}} (x_{i})) \prod_{i = 1}^{n} W_{Y_{i} | X_{i}} (y_{i} | x_{i})\} \\ = \prod_{i = 1}^{n} \{\sum_{x_{i} \in X} \{(λ P_{X_{i}} (x_{i}) + (1 - λ) Q_{X_{i}} (x_{i})) W_{Y_{i} | X_{i}} (y_{i} | x_{i})\}\} \\ = \prod_{i = 1}^{n} \{λ \sum_{x \in X} P_{X_{i}} (x) W_{Y_{i} | X_{i}} (y_{i} | x) + (1 - λ) \sum_{x \in X} Q_{X_{i}} (x) W_{Y_{i} | X_{i}} (y_{i} | x)\} \\ = \prod_{i = 1}^{n} (λ P_{Y_{i}} (y_{i}) + (1 - λ) Q_{Y_{i}} (y_{i})) \\ = \prod_{i = 1}^{n} R_{Y_{i}}^{(λ)} (y_{i}) \end{matrix}

(A38)

with

\begin{matrix} R_{Y_{i}}^{(λ)} (y) : = λ P_{Y_{i}} (y) + (1 - λ) Q_{Y_{i}} (y), \forall i \in {1, \dots, n}, y \in Y, λ \in [0, 1], \end{matrix}

(A39)

and

R_{Y_{i}}^{(λ)}

is the probability mass function at the channel output at time instant i. In particular, setting

λ = 0

in (A38) gives

\begin{matrix} Q_{Y^{n}} (\underset{̲}{y}) = \prod_{i = 1}^{n} Q_{Y_{i}} (y_{i}), \forall \underset{̲}{y} \in Y^{n} . \end{matrix}

(A40)

Due to the tensorization property of the

χ^{2}

divergence, and since

R_{X^{n}}^{(λ)}

,

R_{Y^{n}}^{(λ)}

,

Q_{X^{n}}

and

Q_{Y^{n}}

are product probability measures (see (39), (41), (A38) and (A40)), it follows that

\begin{matrix} χ^{2} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) = \prod_{i = 1}^{n} (1 + χ^{2} (R_{X_{i}}^{(λ)} ∥ Q_{X_{i}})) - 1, \end{matrix}

(A41)

and

\begin{matrix} χ^{2} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}}) = \prod_{i = 1}^{n} (1 + χ^{2} (R_{Y_{i}}^{(λ)} ∥ Q_{Y_{i}})) - 1 . \end{matrix}

(A42)

Substituting (A41) and (A42) into the right side of (A37) gives that, for all

λ \in [0, 1]

,

\begin{matrix} D_{f} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) - D_{f} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}}) \\ \geq c_{f} (ξ_{1} (n, λ), ξ_{2} (n, λ)) [\prod_{i = 1}^{n} (1 + χ^{2} (R_{X_{i}}^{(λ)} ∥ Q_{X_{i}})) - \prod_{i = 1}^{n} (1 + χ^{2} (R_{Y_{i}}^{(λ)} ∥ Q_{Y_{i}}))] . \end{matrix}

(A43)

Due to (41) and (A39), since

\begin{matrix} R_{X_{i}}^{(λ)} = λ P_{X_{i}} + (1 - λ) Q_{X_{i}}, \end{matrix}

(A44)

\begin{matrix} R_{Y_{i}}^{(λ)} = λ P_{Y_{i}} + (1 - λ) Q_{Y_{i}}, \end{matrix}

(A45)

and (see ([45], Lemma 5))

\begin{matrix} χ^{2} (λ P + (1 - λ) Q ∥ Q) = λ^{2} χ^{2} (P ∥ Q), \forall λ \in [0, 1] \end{matrix}

(A46)

for every pair of probability measures

(P, Q)

, it follows that

\begin{matrix} χ^{2} (R_{X_{i}}^{(λ)} ∥ Q_{X_{i}}) = λ^{2} χ^{2} (P_{X_{i}} ∥ Q_{X_{i}}), \end{matrix}

(A47)

\begin{matrix} χ^{2} (R_{Y_{i}}^{(λ)} ∥ Q_{Y_{i}}) = λ^{2} χ^{2} (P_{Y_{i}} ∥ Q_{Y_{i}}) . \end{matrix}

(A18)

Substituting (A47) and (A48) into the right side of (A43) gives (45). For proving the looser bound (46) from (45), and also for later proving the result in Item (c), we rely on the following lemma.

Lemma A1.

Let

{a_{i}}_{i = 1}^{n}

and

{b_{i}}_{i = 1}^{n}

be non-negative with

a_{i} \geq b_{i}

for all

i \in {1, \dots, n}

. Then,

(a): For all $u \geq 0$ ,

$\begin{matrix} \prod_{i = 1}^{n} (1 + a_{i} u) - \prod_{i = 1}^{n} (1 + b_{i} u) \geq \sum_{i = 1}^{n} (a_{i} - b_{i}) u . \end{matrix}$

(A49)
(b): If $a_{i} > b_{i}$ for at least one index i, then

$\begin{matrix} \prod_{i = 1}^{n} (1 + a_{i} u) - \prod_{i = 1}^{n} (1 + b_{i} u) = \sum_{i = 1}^{n} (a_{i} - b_{i}) u + O (u^{2}) . \end{matrix}$

(A50)

Proof.

Let

g : [0, \infty) \to R

be defined as

\begin{matrix} g (u) : = \prod_{i = 1}^{n} (1 + a_{i} u) - \prod_{i = 1}^{n} (1 + b_{i} u), \forall u \geq 0 . \end{matrix}

(A51)

We have

g (0) = 0

, and the first two derivatives of g are given by

\begin{matrix} g^{'} (u) = \sum_{i = 1}^{n} \{a_{i} \prod_{j \neq i} (1 + a_{j} u) - b_{i} \prod_{j \neq i} (1 + b_{j} u)\}, \end{matrix}

(A52)

and

\begin{matrix} g^{″} (u) = \sum_{i = 1}^{n} \sum_{j \neq i} \{a_{i} a_{j} \prod_{k \neq i, j} (1 + a_{k} u) - b_{i} b_{j} \prod_{k \neq i, j} (1 + b_{k} u)\} . \end{matrix}

(A53)

Since by assumption

a_{i} \geq b_{i} \geq 0

for all i, it follows from (A53) that

g^{″} (u) \geq 0

for all

u \geq 0

, which asserts the convexity of g on

[0, \infty)

. Hence, for all

u \geq 0

,

\begin{matrix} g (u) \geq g (0) + g^{'} (0) u = \sum_{i = 1}^{n} (b_{i} - a_{i}) u \end{matrix}

(A54)

where the right-side equality in (A54) is due to (A51) and (A52). This gives (A49).

We next prove Item (b) of Lemma A1. By the Taylor series expansion of the polynomial function g, we get

\begin{matrix} g (u) & = g (0) + g^{'} (0) u + \frac{1}{2} g^{″} (0) u^{2} + \dots \\ = \sum_{i = 1}^{n} (b_{i} - a_{i}) u + \frac{1}{2} \sum_{i = 1}^{n} \sum_{j \neq i} (a_{i} a_{j} - b_{i} b_{j}) u^{2} + \dots \end{matrix}

(A55)

for all

u \geq 0

. Since by assumption

a_{i} \geq b_{i} \geq 0

for all i, and there exists an index

i \in {1, \dots, n}

such that

a_{i} > b_{i}

, it follows that the coefficient of

u^{2}

in the right side of (A55) is positive. This yields (A50). □

We obtain here (46) from (45) and Item (a) of Lemma A1. To that end, for

i \in {1, \dots, n}

, let

\begin{matrix} a_{i} : = χ^{2} (P_{X_{i}} ∥ Q_{X_{i}}), b_{i} : = χ^{2} (P_{Y_{i}} ∥ Q_{Y_{i}}), u : = λ^{2} \end{matrix}

(A56)

with

u \in [0, 1]

for every

λ \in [0, 1]

. Since by (39), (40), (43) and (44),

\begin{matrix} P_{X_{i}} \to W_{Y_{i} | X_{i}} \to P_{Y_{i}}, \end{matrix}

(A57)

\begin{matrix} Q_{X_{i}} \to W_{Y_{i} | X_{i}} \to Q_{Y_{i}}, \end{matrix}

(A58)

it follows from the data-processing inequality for f-divergences, and their non-negativity, that

\begin{matrix} a_{i} \geq b_{i} \geq 0, \forall i \in {1, \dots, n}, \end{matrix}

(A59)

which yields (46) from (45), (A49), (A56) and (A59).

We next prove Item (b) of Theorem 2. Similarly to the proof of (A37), we get from (32) (rather than (24)) that

\begin{matrix} D_{f} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) - D_{f} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}}) \\ \leq e_{f} (ξ_{1} (n, λ), ξ_{2} (n, λ)) [χ^{2} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) - χ^{2} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}})] . \end{matrix}

(A60)

Combining (A41), (A42), (A47), (A48) and (A60) gives (49).

We finally prove Item (c) of Theorem 2. In view of (47) and (48), and by the assumption that

sup_{x \in X} \frac{P_{X_{i}} (x)}{Q_{X_{i}} (x)} < \infty

for all

i \in {1, \dots, n}

, we get

\begin{matrix} lim_{λ \to 0^{+}} ξ_{1} (n, λ) = 1, \end{matrix}

(A61)

\begin{matrix} lim_{λ \to 0^{+}} ξ_{2} (n, λ) = 1 . \end{matrix}

(A62)

Since, by assumption f has a continuous second derivative at unity, (26), (31), (A61) and (A62) imply that

\begin{matrix} lim_{λ \to 0^{+}} c_{f} (ξ_{1} (n, λ), ξ_{2} (n, λ)) = \frac{1}{2} f^{″} (1), \end{matrix}

(A63)

\begin{matrix} lim_{λ \to 0^{+}} e_{f} (ξ_{1} (n, λ), ξ_{2} (n, λ)) = \frac{1}{2} f^{″} (1) . \end{matrix}

(A64)

From (A56), (A59), and Item (b) of Lemma A1, it follows that

\begin{matrix} lim_{λ \to 0^{+}} \frac{1}{λ^{2}} [\prod_{i = 1}^{n} (1 + λ^{2} χ^{2} (P_{X_{i}} ∥ Q_{X_{i}})) - \prod_{i = 1}^{n} (1 + λ^{2} χ^{2} (P_{Y_{i}} ∥ Q_{Y_{i}}))] \\ = \sum_{i = 1}^{n} [χ^{2} (P_{X_{i}} ∥ Q_{X_{i}}) - χ^{2} (P_{Y_{i}} ∥ Q_{Y_{i}})] . \end{matrix}

(A65)

The result in (50) finally follows from (45), (49) and (A63)–(A65). This indeed shows that the lower bounds in the right sides of (45) and (46), and the upper bound in the right side of (49) yield a tight result as we let

λ \to 0^{+}

, leading to the limit in the right side of (50).

Appendix C. Proof of Theorems 3 and 4

Appendix C.1. Proof of Theorem 3

We first obtain a lower bound on

D_{f} (P_{X} ∥ Q_{X})

, and then obtain an upper bound on

D_{f} (P_{Y} ∥ Q_{Y})

.

\begin{matrix} D_{f} (P_{X} ∥ Q_{X}) & = \sum_{x \in X} Q_{X} (x) f (\frac{P_{X} (x)}{Q_{X} (x)}) \end{matrix}

(A66)

\begin{matrix} = \sum_{x \in X} Q_{X} (x) [\frac{P_{X} (x)}{Q_{X} (x)} g (\frac{P_{X} (x)}{Q_{X} (x)}) + f (0)] \end{matrix}

(A67)

\begin{matrix} = f (0) + \sum_{x \in X} P_{X} (x) g (\frac{P_{X} (x)}{Q_{X} (x)}) \end{matrix}

(A68)

\begin{matrix} \geq f (0) + g (\sum_{x \in X} \frac{P_{X}^{2} (x)}{Q_{X} (x)}) \end{matrix}

(A69)

\begin{matrix} = f (0) + g (1 + χ^{2} (P_{X} ∥ Q_{X})) \end{matrix}

(A70)

\begin{matrix} \geq f (0) + g (1) + g^{'} (1) χ^{2} (P_{X} ∥ Q_{X}) \end{matrix}

(A71)

\begin{matrix} = g^{'} (1) χ^{2} (P_{X} ∥ Q_{X}) \end{matrix}

(A72)

\begin{matrix} = (f^{'} (1) + f (0)) χ^{2} (P_{X} ∥ Q_{X}), \end{matrix}

(A73)

where (A67) holds by the definition of g in Theorem 3 with the assumption that

f (0) < \infty

; (A69) is due to Jensen’s inequality and the convexity of g; (A70) holds by the definition of the

χ^{2}

-divergence; (A71) holds due to the convexity of g, and its differentiability at 1 (due to the differentiability of f at 1); (A72) holds since

f (0) + g (1) = f (1) = 0

; finally, (A73) holds since

f (1) = 0

implies that

g^{'} (1) = f^{'} (1) + f (0)

.

By ([62], Theorem 5), it follows that

\begin{matrix} D_{f} (P_{Y} ∥ Q_{Y}) \leq κ (ξ_{1}, ξ_{2}) χ^{2} (P_{Y} ∥ Q_{Y}), \end{matrix}

(A74)

where

κ (ξ_{1}, ξ_{2})

is given in (51). Combining (A66)–(A74) yields (52). Taking suprema on both sides of (52), with respect to all probability mass functions

P_{X}

with

P_{X} ≪ Q_{X}

and

P_{X} \neq Q_{X}

, gives (53) since by the definition of

κ (ξ_{1}, ξ_{2})

in (51), it is monotonically decreasing in

ξ_{1} \in [0, 1)

and monotonically increasing in

ξ_{2} \in (1, \infty]

, while (18) and (19) yield

\begin{matrix} ξ_{1} \geq 0, ξ_{2} \leq \frac{1}{min_{x \in X} Q_{X} (x)} . \end{matrix}

(A75)

Remark A1.

The derivation in (A66)–(A73) is conceptually similar to the proof of ([24], Lemma A.2). However, the function g here is convex, and our derivation involves the

χ^{2}

-divergence.

Remark A2.

The proof of ([26], Theorem 8) (see Proposition 3 in Section 1.1 here) relies on ([24], Lemma A.2), where the function g is required to be concave in [24,26]. This leads, in the proof of ([26], Theorem 8), to an upper bound on

D_{f} (P_{Y} ∥ Q_{Y})

. One difference in the derivation of Theorem 3 is that our requirement on the convexity of g leads to a lower bound on

D_{f} (P_{X} ∥ Q_{X})

, instead of an upper bound on

D_{f} (P_{Y} ∥ Q_{Y})

. Another difference between the proofs of Theorem 3 and ([26], Theorem 8) is that we apply here the result in ([62], Theorem 5) to obtain an upper bound on

D_{f} (P_{Y} ∥ Q_{Y})

, whereas the proof of ([26], Theorem 8) relies on a Pinsker-type inequality (see ([63], Theorem 3)) to obtain a lower bound on

D_{f} (P_{X} ∥ Q_{X})

; the latter lower bound relies on the condition on f in (16), which is not necessary for the derivation of the bound in Theorem 3.

Remark A3.

From ([62], Theorem 1 (b)), it follows that

\begin{matrix} sup_{P \neq Q} \frac{D_{f} (P ∥ Q)}{χ^{2} (P ∥ Q)} = κ (ξ_{1}, ξ_{2}), \end{matrix}

(A76)

with

κ (ξ_{1}, ξ_{2})

in the right side of (A76) as given in (51), and the supremum in the left side of (A76) is taken over all probability measures P and Q such that

P \neq Q

. In view of ([62], Theorem 1 (b)), the equality in (A76) holds since the functions

\tilde{f}, \tilde{g} : (0, \infty) \to R

, defined as

\tilde{f} (t) : = f (t) + f^{'} (1) (1 - t)

and

\tilde{g} (t) : = {(t - 1)}^{2}

for all

t > 0

, satisfy

D_{\tilde{f}} (P ∥ Q) = D_{f} (P ∥ Q)

and

D_{\tilde{g}} (P ∥ Q) = χ^{2} (P ∥ Q)

for all probability measures P and Q, and since

{\tilde{f}}^{'} (1) = \tilde{g}^{'} (1) = 0

while the function

\tilde{g}

is also strictly positive on

(0, 1) \cup (1, \infty)

. Furthermore, from the proof of ([62], Theorem 1 (b)), restricting P and Q to be probability mass functions which are defined over a binary alphabet, the ratio

\frac{D_{f} (P ∥ Q)}{χ^{2} (P ∥ Q)}

can be made arbitrarily close to the supremum in the left side of (A76); such probability measures can be obtained as the output distributions

P_{Y}

and

Q_{Y}

of an arbitrary non-degenerate stochastic transformation

W_{Y | X} : X \to Y

, with

| Y | = 2

, by a suitable selection of probability input distributions

P_{X}

and

Q_{X}

, respectively (see (A5) and (A6)). In the latter case where

| Y | = 2

, this shows the optimality of the non-negative constant

κ (ξ_{1}, ξ_{2})

in the right side of (A74).

Appendix C.2. Proof of Theorem 4

Combining (A66)–(A73) gives that, for all

λ \in [0, 1]

,

\begin{matrix} D_{f} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}^{(λ)}) \geq (f^{'} (1) + f (0)) χ^{2} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}), \end{matrix}

(A77)

and from (A74)

\begin{matrix} D_{f} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}}) \leq κ (ξ_{1} (n, λ), ξ_{2} (n, λ)) χ^{2} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}}) . \end{matrix}

(A78)

From (A41) and (A47),

\begin{matrix} χ^{2} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) = \prod_{i = 1}^{n} (1 + λ^{2} χ^{2} (P_{X_{i}} ∥ Q_{X_{i}})) - 1, \end{matrix}

(A79)

and similarly, from (A42) and (A48),

\begin{matrix} χ^{2} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}}) = \prod_{i = 1}^{n} (1 + λ^{2} χ^{2} (P_{Y_{i}} ∥ Q_{Y_{i}})) - 1 . \end{matrix}

(A80)

Combining (A77)–(A80) yields (54).

Appendix D. Proof of Theorem 5

The function

f_{α} : [0, \infty) \to R

in (55) satisfies

f_{α} (1) = 0

, and for all

α \geq e^{- \frac{3}{2}}

\begin{matrix} f_{α}^{″} (t) = 2 log (α + t) + 3 log e > 0, \forall t > 0, \end{matrix}

(A81)

which yields the convexity of

f_{α} (\cdot)

on

[0, \infty)

. This justifies the definition of the f-divergence

\begin{matrix} D_{f_{α}} (P ∥ Q) & : = \sum_{x \in X} Q (x) f_{α} (\frac{P (x)}{Q (x)}) \end{matrix}

(A82)

for probability mass functions P and Q, which are defined on a finite or countably infinite set

X

, with Q supported on

X

. In the general alphabet setting, sums and probability mass functions are, respectively, replaced by Lebesgue integrals and Radon-Nikodym derivatives. Differentiation of both sides of (A82) with respect to

α

gives

\begin{matrix} \frac{\partial}{\partial α} \{D_{f_{α}} (P ∥ Q)\} = \sum_{x \in X} Q (x) r_{α} (\frac{P (x)}{Q (x)}) \end{matrix}

(A83)

where

\begin{matrix} r_{α} (t) & : = \frac{\partial f_{α} (t)}{\partial α} \end{matrix}

(A84)

\begin{matrix} = 2 (α + t) log (α + t) - 2 (α + 1) log (α + 1) + (t - 1) log e, t > 0 . \end{matrix}

(A85)

The function

r_{α} : (0, \infty) \to R

is convex since

\begin{matrix} r_{α}^{″} (t) = \frac{2 log e}{α + t} > 0, \forall t > 0, \end{matrix}

(A86)

and

r_{α} (1) = 0

. Hence,

D_{r_{α}} (\cdot ∥ \cdot)

is an f-divergence, and it follows from (A83)–(A85) that

\begin{matrix} \frac{\partial}{\partial α} \{D_{f_{α}} (P ∥ Q)\} & = D_{r_{α}} (P ∥ Q) \end{matrix}

(A87)

\begin{matrix} = 2 \sum_{x \in X} \{(α Q (x) + P (x)) log (α + \frac{P (x)}{Q (x)})\} - 2 (α + 1) log (α + 1) \end{matrix}

(A88)

\begin{matrix} = 2 (α + 1) \sum_{x \in X} \frac{α Q (x) + P (x)}{α + 1} log (\frac{α Q (x) + P (x)}{(α + 1) Q (x)}) \end{matrix}

(A89)

\begin{matrix} = 2 (α + 1) D (\frac{α Q + P}{α + 1} ∥ Q) \geq 0, \end{matrix}

(A90)

which gives (56), so

D_{f_{α}} (\cdot ∥ \cdot)

is monotonically increasing in

α

. Double differentiation of both sides of (A82) with respect to

α

gives

\begin{matrix} \frac{\partial^{2}}{\partial α^{2}} \{D_{f_{α}} (P ∥ Q)\} = \sum_{x \in X} Q (x) v_{α} (\frac{P (x)}{Q (x)}) \end{matrix}

(A91)

where

\begin{matrix} v_{α} (t) & : = \frac{\partial^{2} f_{α} (t)}{\partial α^{2}} \end{matrix}

(A92)

\begin{matrix} = 2 log (α + t) - 2 log (α + 1), t > 0 . \end{matrix}

(A93)

The function

v_{α} : (0, \infty) \to R

is concave, and

v_{α} (1) = 0

. By referring to the f-divergence

D_{- v_{α}} (\cdot ∥ \cdot)

, it follows from (A91)–(A93) that

\begin{matrix} \frac{\partial^{2}}{\partial α^{2}} \{D_{f_{α}} (P ∥ Q)\} & = - D_{- v_{α}} (P ∥ Q) \end{matrix}

(A94)

\begin{matrix} = - 2 \sum_{x \in X} Q (x) [log (α + 1) - log (α + \frac{P (x)}{Q (x)})] \end{matrix}

(A95)

\begin{matrix} = - 2 \sum_{x \in X} Q (x) log (\frac{(α + 1) Q (x)}{α Q (x) + P (x)}) \end{matrix}

(A96)

\begin{matrix} = - 2 D (Q ∥ \frac{α Q + P}{α + 1}) \leq 0, \end{matrix}

(A97)

which gives (57), so

D_{f_{α}} (\cdot ∥ \cdot)

is concave in

α

for

α \geq e^{- \frac{3}{2}}

. Differentiation of both sides of (A93) with respect to

α

gives that

\begin{matrix} \frac{\partial^{3} f_{α} (t)}{\partial α^{3}} & = 2 (\frac{1}{α + t} - \frac{1}{α + 1}) log e, \end{matrix}

(A98)

which implies that

\begin{matrix} \frac{\partial^{3}}{\partial α^{3}} \{D_{f_{α}} (P ∥ Q)\} & = 2 log e \sum_{x \in X} Q (x) (\frac{1}{α + \frac{P (x)}{Q (x)}} - \frac{1}{α + 1}) \end{matrix}

(A99)

\begin{matrix} = \frac{2 log e}{α + 1} [\sum_{x \in X} \frac{Q^{2} (x)}{\frac{α Q (x) + P (x)}{α + 1}} - 1] \end{matrix}

(A100)

\begin{matrix} = \frac{2 log e}{α + 1} \cdot χ^{2} (Q ∥ \frac{α Q + P}{α + 1}) \geq 0 . \end{matrix}

(A101)

This gives (58), and it completes the proof of Item (a).

We next prove Item (b). From Item (a), the result in (59) holds for

n = 1, 2, 3

. We provide in the following a proof of (59) for all

n \geq 3

. In view of (A98), it can be verified that for

n \geq 3

,

\begin{matrix} \frac{\partial^{n} f_{α} (t)}{\partial α^{n}} = 2 {(- 1)}^{n - 1} (n - 3)! [\frac{1}{{(α + t)}^{n - 2}} - \frac{1}{{(α + 1)}^{n - 2}}] log e, \end{matrix}

(A102)

which, from (A82), implies that

\begin{matrix} {(- 1)}^{n - 1} \frac{\partial^{n}}{\partial α^{n}} \{D_{f_{α}} (P ∥ Q)\} = \sum_{x \in X} Q (x) g_{α, n} (\frac{P (x)}{Q (x)}) \end{matrix}

(A103)

with

\begin{matrix} g_{α, n} (t) & : = {(- 1)}^{n - 1} \frac{\partial^{n} f_{α} (t)}{\partial α^{n}} \end{matrix}

(A104)

\begin{matrix} = 2 (n - 3)! [\frac{1}{{(α + t)}^{n - 2}} - \frac{1}{{(α + 1)}^{n - 2}}] log e, t > 0 . \end{matrix}

(A105)

The function

g_{α, n} : (0, \infty) \to R

is convex for

n \geq 3

, with

g_{α, n} (1) = 0

. By referring to the f-divergence

D_{g_{α, n}} (\cdot ∥ \cdot)

, its non-negativity and (A103) imply that for all

n \geq 3

\begin{matrix} {(- 1)}^{n - 1} \frac{\partial^{n}}{\partial α^{n}} \{D_{f_{α}} (P ∥ Q)\} & = D_{g_{α, n}} (P ∥ Q) \geq 0 . \end{matrix}

(A106)

Furthermore, we get the following explicit formula for n-th partial derivative of

D_{f_{α}} (P ∥ Q)

with respect to

α

for

n \geq 3

:

\begin{matrix} \frac{\partial^{n}}{\partial α^{n}} \{D_{f_{α}} (P ∥ Q)\} & = {(- 1)}^{n - 1} \sum_{x \in X} Q (x) g_{α, n} (\frac{P (x)}{Q (x)}) \end{matrix}

(A107)

\begin{matrix} = \frac{2 {(- 1)}^{n - 1} (n - 3)! log e}{{(α + 1)}^{n - 2}} [\sum_{x \in X} \{Q (x) {(\frac{α + 1}{α + \frac{P (x)}{Q (x)}})}^{n - 2}\} - 1] \end{matrix}

(A108)

\begin{matrix} = \frac{2 {(- 1)}^{n - 1} (n - 3)! log e}{{(α + 1)}^{n - 2}} [\sum_{x \in X} \frac{Q^{n - 1} (x)}{{(\frac{α Q (x) + P (x)}{α + 1})}^{n - 2}} - 1] \end{matrix}

(A109)

\begin{matrix} = \frac{2 {(- 1)}^{n - 1} (n - 3)! log e}{{(α + 1)}^{n - 2}} [exp ((n - 2) D_{n - 1} (Q ∥ \frac{α Q + P}{α + 1})) - 1] \end{matrix}

(A110)

where (A107) holds due to (A103); (A108) follows from (A104), and (A110) is satisfied by the definition of the Rényi divergence [40] which is given by

\begin{matrix} D_{β} (P ∥ Q) : = \frac{1}{β - 1} log (\sum_{x \in X} P^{β} (x) Q^{1 - β} (x)), \forall β \in (0, 1) \cup (1, \infty) \end{matrix}

(A111)

with

D_{1} (P ∥ Q) : = D (P ∥ Q)

by continuous extension of

D_{β} (\cdot ∥ \cdot)

at

β = 1

. For

n = 3

, the right side of (A110) is simplified to the right side of (58); this holds due to the identity

\begin{matrix} D_{2} (P ∥ Q) = log (1 + χ^{2} (P ∥ Q)) . \end{matrix}

(A112)

To prove Item (c), from (55), for all

t \geq 0

\begin{matrix} f_{α}^{'} (t) = 2 (α + t) log (α + t) + (α + t) log e, \end{matrix}

(A113)

\begin{matrix} f_{α}^{″} (t) = 2 log (α + t) + 3 log e, \end{matrix}

(A114)

\begin{matrix} f_{α}^{(3)} (t) = \frac{2 log e}{α + t}, \end{matrix}

(A115)

which implies by a Taylor series expansion of

f_{α} (\cdot)

that

\begin{matrix} f_{α} (t) = f_{α} (1) + f_{α}^{'} (1) (t - 1) + \frac{1}{2} f_{α}^{″} (1) {(t - 1)}^{2} + \frac{1}{6} f_{α}^{(3)} (ξ) {(t - 1)}^{3}, \forall t \geq 0 \end{matrix}

(A116)

where

ξ

in the right side of (A116) is an intermediate value between 1 and t. Hence, for

t \geq 0

,

\begin{matrix} f_{α} (t) & \geq f_{α}^{'} (1) (t - 1) + \frac{1}{2} f_{α}^{″} (1) {(t - 1)}^{2} + \frac{1}{6} f_{α}^{(3)} (0) {(t - 1)}^{3} 1 {t \in [0, 1]} \end{matrix}

(A117)

\begin{matrix} \geq f_{α}^{'} (1) (t - 1) + (\frac{1}{2} f_{α}^{″} (1) - \frac{1}{6} f_{α}^{(3)} (0)) {(t - 1)}^{2} \end{matrix}

(A118)

\begin{matrix} = f_{α}^{'} (1) (t - 1) + k (α) {(t - 1)}^{2} \end{matrix}

(A119)

where (A117) follows from (A116) since

f_{α} (1) = 0

and

f_{α}^{(3)} (\cdot)

is monotonically decreasing and positive (see (A115));

1 {t \in [0, 1]}

in the right side of (A117) denotes the indicator function which is equal to 1 if the relation

t \in [0, 1]

holds, and it is otherwise equal to zero; (A118) holds since

{(t - 1)}^{3} 1 {t \in [0, 1]} \geq - {(t - 1)}^{2}

for all

t \geq 0

, and

f_{α}^{(3)} (0) > 0

; finally, (A119) follows by substituting (A114) and (A115) into the right side of (A118), which gives the equality

\begin{matrix} \frac{1}{2} f_{α}^{″} (1) - \frac{1}{6} f_{α}^{(3)} (0) = k (α) \end{matrix}

(A120)

with

k (\cdot)

as defined in (63). Since the first term in the right side of (A119) does not affect an f-divergence (as it is equal to

c (t - 1)

for

t \geq 0

and some constant c), and for an arbitrary positive constant

k > 0

and

g (t) : = {(t - 1)}^{2}

for

t \geq 0

, we get

D_{k g} (P ∥ Q) = k χ^{2} (P ∥ Q)

, inequality (61) follows from (A117) and (A119). To that end, note that

k = k (α)

defined in (63) is monotonically increasing in

α

, and therefore

k (α) \geq k (e^{- \frac{3}{2}}) > 0.2075

for all

α \geq e^{- \frac{3}{2}}

. Due to the inequality (see, e.g., ([64], Theorem 5), followed by refined versions in ([62], Theorem 20) and ([65], Theorem 9))

\begin{matrix} D (P ∥ Q) \leq log (1 + χ^{2} (P ∥ Q)), \end{matrix}

(A121)

the looser lower bound on

D_{f_{α}} (P ∥ Q)

in the right side of (62), expressed as a function of the relative entropy

D (P ∥ Q)

, follows from (61). Hence, if P and Q are not identical, then (64) follows from (61) since

χ^{2} (P ∥ Q) > 0

and

lim_{α \to \infty} k (α) = \infty

.

We next prove Item (d). The Taylor series expansion of

f_{α} (\cdot)

implies that, for all

t \geq 0

,

\begin{matrix} f_{α} (t) = f_{α} (1) + f_{α}^{'} (1) (t - 1) + \frac{1}{2} f_{α}^{″} (1) {(t - 1)}^{2} + \frac{1}{6} f_{α}^{(3)} (1) {(t - 1)}^{3} + \frac{1}{24} f_{α}^{(4)} (ξ) {(t - 1)}^{4} \end{matrix}

(A122)

where

ξ

in the right side of (A122) is an intermediate value between 1 and t. Consequently, since

f_{α}^{(4)} (ξ) = - \frac{2 log e}{{(α + ξ)}^{2}} < 0

and

f_{α} (1) = 0

, it follows from (A122) that, for all

t \geq 0

,

\begin{matrix} f_{α} (t) & \leq f_{α}^{'} (1) (t - 1) + \frac{1}{2} f_{α}^{″} (1) {(t - 1)}^{2} + \frac{1}{6} f_{α}^{(3)} (1) {(t - 1)}^{3} \end{matrix}

(A123)

\begin{matrix} = f_{α}^{'} (1) (t - 1) + \frac{1}{2} f_{α}^{″} (1) {(t - 1)}^{2} + \frac{1}{6} f_{α}^{(3)} (1) [t^{3} - 3 {(t - 1)}^{2} - 3 (t - 1) - 1] \end{matrix}

(A124)

\begin{matrix} = [f_{α}^{'} (1) - \frac{1}{2} f_{α}^{(3)} (1)] (t - 1) + \frac{1}{2} [f_{α}^{″} (1) - f_{α}^{(3)} (1)] {(t - 1)}^{2} + \frac{1}{6} f_{α}^{(3)} (1) (t^{3} - 1) . \end{matrix}

(A125)

Based on (A123)–(A125), it follows that

\begin{matrix} D_{f_{α}} (P ∥ Q) & \leq \frac{1}{2} [f_{α}^{″} (1) - f_{α}^{(3)} (1)] χ^{2} (P ∥ Q) + \frac{1}{6} f_{α}^{(3)} (1) \sum_{x \in X} \{Q (x) [{(\frac{P (x)}{Q (x)})}^{3} - 1]\} \end{matrix}

\begin{matrix} = \frac{1}{2} [f_{α}^{″} (1) - f_{α}^{(3)} (1)] χ^{2} (P ∥ Q) + \frac{1}{6} f_{α}^{(3)} (1) (- 1 + \sum_{x \in X} \frac{P^{3} (x)}{Q^{2} (x)}) \end{matrix}

(A126)

\begin{matrix} = \frac{1}{2} [f_{α}^{″} (1) - f_{α}^{(3)} (1)] χ^{2} (P ∥ Q) + \frac{1}{6} f_{α}^{(3)} (1) [exp (2 D_{3} (P ∥ Q)) - 1], \end{matrix}

(A127)

where (A127) holds due to (A111) (with

β = 3

). Substituting (A114) and (A115) into the right side of (A127) gives (65).

We next prove Item (e). Let P and Q be probability mass functions such that

D_{3} (P ∥ Q) < \infty

, and let

ε > 0

be arbitrarily small. Since the Rényi divergence

D_{α} (P ∥ Q)

is monotonically non-decreasing in

α > 0

(see ([66], Theorem 3)), it follows that

D_{2} (P ∥ Q) < \infty

, and therefore also

\begin{matrix} χ^{2} (P ∥ Q) = exp (D_{2} (P ∥ Q)) - 1 < \infty . \end{matrix}

(A128)

In view of (61), there exists

α_{1} : = α_{1} (P, Q, ε)

such that for all

α > α_{1}

\begin{matrix} D_{f_{α}} (P ∥ Q) > (log (α + 1) + \frac{3}{2} log e) χ^{2} (P ∥ Q) - ε, \end{matrix}

(A129)

and, from (65), there exists

α_{2} : = α_{2} (P, Q, ε)

such that for all

α > α_{2}

\begin{matrix} D_{f_{α}} (P ∥ Q) < (log (α + 1) + \frac{3}{2} log e) χ^{2} (P ∥ Q) + ε . \end{matrix}

(A130)

Letting

α^{*} : = max {α_{1}, α_{2}}

gives the result in (66) for all

α > α^{*}

.

Item (f) of Theorem 5 is a direct consequence of ([45], Lemma 4), which relies on ([67], Theorem 3). Let

g (t) : = {(t - 1)}^{2}

for

t \geq 0

(hence,

D_{g} (\cdot ∥ \cdot)

is the

χ^{2}

divergence). If a sequence

{P_{n}}

converges to a probability measure Q in the sense that the condition in (67) is satisfied, and

P_{n} ≪ Q

for all sufficiently large n, then ([45], Lemma 4) yields

\begin{matrix} lim_{n \to \infty} \frac{D_{f_{α}} (P_{n} ∥ Q)}{χ^{2} (P_{n} ∥ Q)} = \frac{1}{2} f_{α}^{″} (1), \end{matrix}

(A131)

which gives (68) from (A114) and (A131).

We next prove Item (g). Inequality (69) is trivial. Inequality (70) is obtained as follows:

\begin{matrix} D_{f_{α}} (P ∥ Q) - D_{f_{β}} (P ∥ Q) & = \int_{β}^{α} \frac{\partial}{\partial u} \{D_{f_{u}} (P ∥ Q)\} d u \end{matrix}

(A132)

\begin{matrix} = \int_{β}^{α} 2 (u + 1) D (\frac{u Q + P}{u + 1} ∥ Q) d u \end{matrix}

(A133)

\begin{matrix} \geq \int_{β}^{α} 2 (u + 1) d u \cdot D (\frac{α Q + P}{α + 1} ∥ Q) \end{matrix}

(A134)

\begin{matrix} = [{(α + 1)}^{2} - {(β + 1)}^{2}] D (\frac{α Q + P}{α + 1} ∥ Q) \end{matrix}

(A135)

\begin{matrix} = (α - β) (α + β + 2) D (\frac{α Q + P}{α + 1} ∥ Q) \end{matrix}

(A136)

where (A133) follows from (56), and (A134) holds since the function

I : [0, \infty) \to [0, \infty)

given by

\begin{matrix} I (u) : = D (\frac{u Q + P}{u + 1} ∥ Q), u \geq 0 \end{matrix}

(A137)

is monotonically decreasing in u (note that by increasing the value of the non-negative variable u, the probability mass function

\frac{u Q + P}{u + 1}

gets closer to Q). This gives (70).

For proving inequality (71), we obtain two upper bounds on

D_{f_{α}} (P ∥ Q) - D_{f_{β}} (P ∥ Q)

with

α > β \geq e^{- \frac{3}{2}}

. For the derivation of the first bound, we rely on (A83). From (A84) and (A85),

\begin{matrix} r_{α} (t) = 2 t log t - s_{α} (t), t \geq 0 \end{matrix}

(A138)

where

s_{α} : (0, \infty) \to R

is given by

\begin{matrix} s_{α} (t) : = 2 t log t - 2 (α + t) log (α + t) + (1 - t) log e + 2 (α + 1) log (α + 1), t \geq 0, \end{matrix}

(A139)

with the convention that

0 log 0 = 0

(by a continuous extension of

t log t

at

t = 0

). Since

s_{α} (1) = 0

, and

\begin{matrix} s_{α}^{″} (t) = \frac{2 α}{t (α + t)} > 0, \forall t > 0, \end{matrix}

(A140)

which implies that

s_{α} (\cdot)

is convex on

(0, \infty)

, we get

\begin{matrix} \frac{\partial}{\partial α} \{D_{f_{α}} (P ∥ Q)\} & = D_{r_{α}} (P ∥ Q) \end{matrix}

(A141)

\begin{matrix} = 2 D (P ∥ Q) - D_{s_{α}} (P ∥ Q) \end{matrix}

(A142)

\begin{matrix} \leq 2 D (P ∥ Q) \end{matrix}

(A143)

where (A141) holds due to (A83) (recall the convexity of

r_{α} : (0, \infty) \to R

with

r_{α} (1) = 0

); (A142) holds due to (A138) and since

r (t) : = t log t

for

t > 0

implies that

D_{r} (P ∥ Q) = D (P ∥ Q)

; finally, (A143) follows from the non-negativity of the f-divergence

D_{s_{α}} (\cdot ∥ \cdot)

. Consequently, integration over the interval

[β, α]

(

α > β

) on the left side of (A141) and the right side of (A143) gives

\begin{matrix} D_{f_{α}} (P ∥ Q) - D_{f_{β}} (P ∥ Q) \leq 2 (α - β) D (P ∥ Q) . \end{matrix}

(A144)

Note that the same reasoning of (A132)–(A136) also implies that

\begin{matrix} D_{f_{α}} (P ∥ Q) - D_{f_{β}} (P ∥ Q) & \leq (α - β) (α + β + 2) D (\frac{β Q + P}{β + 1} ∥ Q), \end{matrix}

(A145)

which gives a second upper bound on the left side of (A145). Taking the minimal value among the two upper bounds in the right sides of (A144) and (A145) gives (71) (see Remark A4).

We finally prove Item (h). From (55) and (A81), the function

f_{α} : [0, \infty) \to R

is convex for

α \geq e^{- \frac{3}{2}}

with

f_{α} (1) = 0

,

f_{α} (0) = α^{2} log α - {(α + 1)}^{2} log (α + 1) \in R

, and it is also differentiable at 1. It is left to prove that the function

g_{α} : (0, \infty) \to R

, defined as

g_{α} (t) : = \frac{f_{α} (t) - f_{α} (0)}{t}

for

t > 0

, is convex. From (55), the function

g_{α}

is given explicitly by

\begin{matrix} g_{α} (t) = \frac{{(α + t)}^{2} log (α + t) - α^{2} log α}{t}, t > 0, \end{matrix}

(A146)

and its second derivative is given by

\begin{matrix} g_{α}^{″} (t) = \frac{w_{α} (t)}{t^{3}}, t > 0, \end{matrix}

(A147)

with

\begin{matrix} w_{α} (t) : = 2 α^{2} log (1 + \frac{t}{α}) + t (t - 2 α) log e, t \geq 0 . \end{matrix}

(A148)

Since

w_{α} (0) = 0

, and

\begin{matrix} w_{α}^{'} (t) = \frac{2 t^{2} log e}{α + t} > 0, \forall t > 0, \end{matrix}

(A149)

it follows that

w_{α} (t) > 0

for all

t > 0

; hence, from (A147),

g_{α}^{″} (t) > 0

for

t \in (0, \infty)

, which yields the convexity of the function

g_{α} (\cdot)

on

(0, \infty)

for all

α \geq 0

. This shows that, for every

α \geq e^{- \frac{3}{2}}

, the function

f_{α} : [0, \infty) \to R

satisfies all the required conditions in Theorems 3 and 4. We proceed to calculate the function

κ_{α} : [0, 1) \times (1, \infty) \to R

in (51), which corresponds to

f : = f_{α}

, i.e., (see (72)),

\begin{matrix} κ_{α} (ξ_{1}, ξ_{2}) = sup_{t \in (ξ_{1}, 1) \cup (1, ξ_{2})} z_{α} (t), \end{matrix}

(A150)

with

z_{α} (t) : = {\begin{matrix} \frac{f_{α} (t) + f_{α}^{'} (1) (1 - t)}{{(t - 1)}^{2}}, & t \in [0, 1) \cup (1, \infty), \\ \frac{3}{2} log e + log (α + 1), & t = 1, \end{matrix}

(A151)

where the definition of

z_{α} (1)

is obtained by continuous extension of the function

z_{α} (\cdot)

at

t = 1

(recall that the function

f_{α} (\cdot)

is given in (55)). Differentiation shows that

\begin{matrix} \frac{\partial z_{α} (t)}{\partial t} = \frac{v_{α} (t)}{{(t - 1)}^{4}}, t \in [0, 1) \cup (1, \infty), \end{matrix}

(A152)

where, for

t \geq 0

,

\begin{matrix} v_{α} (t) : = (2 α + t + 1) {(t - 1)}^{2} log e - 2 (α + 1) (α + t) (t - 1) log \frac{α + t}{α + 1}, \end{matrix}

(A153)

and

\begin{matrix} v_{α}^{'} (t) = {(t - 1)}^{2} log e + 2 (α + t) (t - 1) log e - 2 (α + 1) (2 t + α - 1) log \frac{α + t}{α + 1}, \end{matrix}

(A154)

\begin{matrix} v_{α}^{″} (t) = 6 (t - 1) log e + \frac{2 {(α + 1)}^{2} log e}{α + t} - 4 (α + 1) log \frac{α + t}{α + 1}, \end{matrix}

(A155)

\begin{matrix} v_{α}^{(3)} (t) = \frac{2 (t - 1) (3 t + 4 α + 1)}{{(α + t)}^{2}} . \end{matrix}

(A156)

From (A156), it follows that

v_{α}^{(3)} (t) < 0

if

t \in [0, 1)

,

v_{α}^{(3)} (1) = 0

, and

v_{α}^{(3)} (t) > 0

if

t \in (1, \infty)

. Since

v_{α}^{″} (\cdot)

is therefore monotonically decreasing on

[0, 1]

and it is monotonically increasing on

[1, \infty)

, (A155) implies that

\begin{matrix} v_{α}^{″} (t) \geq v_{α}^{″} (1) = 2 (α + 1) log e > 0, \forall t \geq 0 . \end{matrix}

(A157)

Since

v_{α}^{'} (1) = 0

(see (A154)), and

v_{α}^{'} (\cdot)

is monotonically increasing on

[0, \infty)

, it follows that

v_{α}^{'} (t) < 0

for all

t \in [0, 1)

and

v_{α}^{'} (t) > 0

for all

t > 1

. This implies that

v_{α} (t) \geq v_{α} (1) = 0

for all

t \geq 0

(see (A153)); hence, from (A152), the function

z_{α} (\cdot)

is monotonically increasing on

[0, \infty)

, and it is continuous over this interval (see (A151)). It therefore follows from (A150) that

\begin{matrix} κ_{α} (ξ_{1}, ξ_{2}) = z_{α} (ξ_{2}), \end{matrix}

(A158)

for every

ξ_{1} \in [0, 1)

and

ξ_{2} \in (1, \infty)

(independently of

ξ_{1}

), which proves (73).

Remark A4.

None of the upper bounds in the right sides of (A144) and (A145) supersedes the other. For example, if P and Q correspond to

Bernoulli (p)

and

Bernoulli (q)

, respectively, and

(α, β, p, q) = (2, 1, \frac{1}{5}, \frac{2}{5})

, then the right sides of (A144) and (A145) are, respectively, equal to

0.264 log e

and

0.156 log e

. If on the other hand

(α, β, p, q) = (10, 1, \frac{1}{5}, \frac{2}{5})

, then the right sides of (A144) and (A145) are, respectively, equal to

2.377 log e

and

3.646 log e

.

Appendix E. Proof of Theorem 6

By assumption,

P ≺ Q

where the probability mass functions P and Q are defined on the set

A : = {1, \dots, n}

. The majorization relation

P ≺ Q

is equivalent to the existence of a doubly-stochastic transformation

W_{Y | X} : A \to A

such that (see Proposition 4)

\begin{matrix} Q \to W_{Y | X} \to P . \end{matrix}

(A159)

(See, e.g., ([32], Theorem 2.1.10) or ([30], Theorem 2.B.2) or ([31], pp. 195–204)). Define

\begin{matrix} X = Y : = A, P_{X} : = Q, Q_{X} : = U_{n} . \end{matrix}

(A160)

The probability mass functions given by

\begin{matrix} P_{Y} : = P, Q_{Y} : = U_{n} \end{matrix}

(A161)

satisfy, respectively, (20) and (21). The first one is obvious from (A159)–(A161); equality (21) holds due to the fact that

W_{Y | X} : A \to A

is a doubly stochastic transformation, which implies that for all

y \in A

\begin{matrix} \sum_{x \in A} Q_{X} (x) P_{Y | X} (y | x) & = \frac{1}{n} \sum_{x \in A} P_{Y | X} (y | x) \end{matrix}

(A162)

\begin{matrix} = \frac{1}{n} = Q_{Y} (y) . \end{matrix}

(A163)

Since (by assumption)

P_{X}

and

Q_{X}

are supported on

A

, relations (20) and (21) hold in the setting of (A159)–(A161), and

f : (0, \infty) \to R

is (by assumption) convex and twice differentiable, it is possible to apply the bounds in Theorem 1 (b) and (d). To that end, from (18), (19), (A160) and (A161),

\begin{matrix} ξ_{1} = min_{x \in A} \frac{Q (x)}{\frac{1}{n}} = n q_{min}, \end{matrix}

(A164)

\begin{matrix} ξ_{2} = max_{x \in A} \frac{Q (x)}{\frac{1}{n}} = n q_{max}, \end{matrix}

(A165)

which, from (24), (25), (32), (A160), (A161) and (A164), give that

\begin{matrix} e_{f} (n q_{min}, n q_{max}) [χ^{2} (Q ∥ U_{n}) - χ^{2} (P ∥ U_{n})] \end{matrix}

\begin{matrix} \geq D_{f} (Q ∥ U_{n}) - D_{f} (P ∥ U_{n}) \end{matrix}

(A166)

\begin{matrix} \geq c_{f} (n q_{min}, n q_{max}) [χ^{2} (Q ∥ U_{n}) - χ^{2} (P ∥ U_{n})] \end{matrix}

(A167)

\begin{matrix} \geq 0 . \end{matrix}

(A168)

The difference of the

χ^{2}

divergences in the left side of (A166) and the right side of (A167) satisfies

\begin{matrix} χ^{2} (Q ∥ U_{n}) - χ^{2} (P ∥ U_{n}) & = \sum_{x \in A} \frac{Q^{2} (x)}{\frac{1}{n}} - \sum_{x \in A} \frac{P^{2} (x)}{\frac{1}{n}} \\ = n ({∥ Q ∥}_{2}^{2} - {∥ P ∥}_{2}^{2}), \end{matrix}

(A169)

and the substitution of (A169) into the bounds in (A166) and (A167) give the result in (74) and (75).

Let

f (t) = {(t - 1)}^{2}

for

t > 0

, which yields from (26) and (31) that

c_{f} (\cdot, \cdot) = e_{f} (\cdot, \cdot) = 1

. Since

D_{f} (\cdot ∥ \cdot) = χ^{2} (\cdot ∥ \cdot)

, it follows from (A169) that the upper and lower bounds in the left side of (74) and the right side of (75), respectively, coincide for the

χ^{2}

-divergence; this therefore yields the tightness of these bounds in this special case.

We next prove (76). The following lower bound on the second-order Rényi entropy (a.k.a. the collision entropy) holds (see ([34], (25)–(27))):

\begin{matrix} H_{2} (Q) : = - log ({∥ Q ∥}_{2}^{2}) \geq log \frac{4 n ρ}{{(1 + ρ)}^{2}}, \end{matrix}

(A170)

where

\frac{q_{max}}{q_{min}} \leq ρ

. This gives

\begin{matrix} {∥ Q ∥}_{2}^{2} = exp (- H_{2} (Q)) \leq \frac{{(1 + ρ)}^{2}}{4 n ρ} . \end{matrix}

(A171)

By Cauchy-Schwartz inequality

{∥ P ∥}_{2}^{2} \geq \frac{1}{n}

which, together with (A171), give

\begin{matrix} {∥ Q ∥}_{2}^{2} - {∥ P ∥}_{2}^{2} \leq \frac{{(ρ - 1)}^{2}}{4 n ρ} . \end{matrix}

(A172)

In view of the Schur-concavity of the Rényi entropy (see ([30], Theorem 13.F.3.a.)), the assumption

P ≺ Q

implies that

\begin{matrix} H_{2} (P) \geq H_{2} (Q), \end{matrix}

(A173)

and an exponentiation of both sides of (A173) (see the left-side equality in (A170)) gives

\begin{matrix} {∥ Q ∥}_{2}^{2} \geq {∥ P ∥}_{2}^{2} . \end{matrix}

(A174)

Combining (A172) and (A173) gives (76).

Appendix F. Proof of Theorem 7

We prove Item (a), showing that the set

P_{n} (ρ)

(with

ρ \geq 1

) is non-empty, convex and compact. Note that

P_{n} (1) = {U_{n}}

is a singleton, so the claim is trivial for

ρ = 1

.

Let

ρ > 1

. The non-emptiness of

P_{n} (ρ)

is trivial since

U_{n} \in P_{n} (ρ)

. To prove the convexity of

P_{n} (ρ)

, let

P_{1}, P_{2} \in P_{n} (ρ)

, and let

p_{max}^{(1)}, p_{max}^{(2)}, p_{min}^{(1)}

and

p_{min}^{(2)}

be the (positive) maximal and minimal probability masses of

P_{1}

and

P_{2}

, respectively. Then,

\frac{p_{max}^{(1)}}{p_{min}^{(1)}} \leq ρ

and

\frac{p_{max}^{(2)}}{p_{min}^{(2)}} \leq ρ

yield

\begin{matrix} \frac{λ p_{max}^{(1)} + (1 - λ) p_{max}^{(2)}}{λ p_{min}^{(1)} + (1 - λ) p_{min}^{(2)}} \leq ρ, \forall λ \in [0, 1] . \end{matrix}

(A175)

For every

λ \in [0, 1]

,

\begin{matrix} min_{1 \leq i \leq n} \{λ P_{1} (i) + (1 - λ) P_{2} (i)\} \geq λ p_{min}^{(1)} + (1 - λ) p_{min}^{(2)}, \end{matrix}

(A176)

\begin{matrix} max_{1 \leq i \leq n} \{λ P_{1} (i) + (1 - λ) P_{2} (i)\} \leq λ p_{max}^{(1)} + (1 - λ) p_{max}^{(2)} . \end{matrix}

(A177)

Combining (A175)–(A177) implies that

\begin{matrix} \frac{max_{1 \leq i \leq n} \{λ P_{1} (i) + (1 - λ) P_{2} (i)\}}{min_{1 \leq i \leq n} \{λ P_{1} (i) + (1 - λ) P_{2} (i)\}} \leq ρ, \end{matrix}

(A178)

so

λ P_{1} + (1 - λ) P_{2} \in P_{n} (ρ)

for all

λ \in [0, 1]

. This proves the convexity of

P_{n} (ρ)

.

The set of probability mass functions

P_{n} (ρ)

is clearly bounded; for showing its compactness, it is left to show that

P_{n} (ρ)

is closed. Let

ρ > 1

, and let

{P^{(m)}}_{m = 1}^{\infty}

be a sequence of probability mass functions in

P_{n} (ρ)

which pointwise converges to P over the finite set

A_{n}

. It is required to show that

P \in P_{n} (ρ) \subseteq P_{n}

. As a limit of probability mass functions,

P \in P_{n}

, and since by assumption

P^{(m)} \in P_{n} (ρ)

for all

m \in N

, it follows that

(n - 1) ρ p_{min}^{(m)} + p_{min}^{(m)} \geq (n - 1) p_{max}^{(m)} + p_{min}^{(m)} \geq 1,

which yields

p_{min}^{(m)} \geq \frac{1}{(n - 1) ρ + 1}

for all m. Since

p_{max}^{(m)} \leq ρ p_{min}^{(m)}

for every m, it follows that also for the limiting probability mass function P we have

p_{min} \geq \frac{1}{(n - 1) ρ + 1} > 0

, and

p_{max} \leq ρ p_{min}

. This proves that

P \in P_{n} (ρ)

, and therefore

P_{n} (ρ)

is a closed set.

An alternative proof for Item (a) relies on the observation that, for

ρ \geq 1

,

\begin{matrix} P_{n} (ρ) = P_{n} ⋂ \{⋂_{i \neq j} \{P : P (i) - ρ P (j) \leq 0\}\}, \end{matrix}

(A179)

which yields the convexity and compactness of the set

P_{n} (ρ)

for all

ρ \geq 1

.

The result in Item (b) holds in view of Item (a), and due to the convexity and continuity of

D_{f} (P ∥ Q)

in

(P, Q) \in P_{n} (ρ) \times P_{n} (ρ)

(where

p_{min}, q_{min} \geq \frac{1}{(n - 1) ρ + 1} > 0

). This implication is justified by the statement that a convex and continuous function over a non-empty convex and compact set attains its supremum over this set (see, e.g., ([68], Theorem 7.42) or ([59], Theorem 10.1 and Corollary 32.3.2)).

We next prove Item (c). If

Q \in P_{n} (ρ)

, then

\frac{1}{1 + (n - 1) ρ} \leq q_{min} \leq \frac{1}{n}

where the lower bound on

q_{min}

is attained when Q is the probability mass function with

n - 1

masses equal to

ρ q_{min}

and a single smaller mass equal to

q_{min}

, and the upper bound is attained when Q is the equiprobable distribution. For an arbitrary

Q \in P_{n} (ρ)

, let

q_{min} : = β

where

β

can get any value in the interval

Γ_{n} (ρ)

defined in (79). By ([34], Lemma 1),

Q ≺ Q_{β}

and

Q_{β} \in P_{n} (ρ)

where

Q_{β}

is given in (80). The Schur-convexity of

D_{f} (\cdot ∥ U_{n})

(see ([38], Lemma 1)) and the identity

D_{f} (U_{n} ∥ \cdot) = D_{f^{*}} (\cdot ∥ U_{n})

give that

\begin{matrix} D_{f} (Q ∥ U_{n}) \leq D_{f} (Q_{β} ∥ U_{n}), D_{f} (U_{n} ∥ Q) \leq D_{f} (U_{n} ∥ Q_{β}) \end{matrix}

(A180)

for all

Q \in P_{n} (ρ)

with

q_{min} = β \in Γ_{n} (ρ)

; equalities hold in (A180) if

Q = Q_{β} \in P_{n} (ρ)

. The maximization of

D_{f} (Q ∥ U_{n})

and

D_{f} (U_{n} ∥ Q)

over all probability mass functions

Q \in P_{n} (ρ)

can be therefore simplified to the maximization of

D_{f} (Q_{β} ∥ U_{n})

and

D_{f} (U_{n} ∥ Q_{β})

, respectively, over the parameter

β

which lies in the interval

Γ_{n} (ρ)

in (79). This proves (82) and (83).

We next prove Item (e), and then prove Item (d). In view of Item (c), the maximum of

D_{f} (Q ∥ U_{n})

over all the probability mass functions

Q \in P_{n} (ρ)

is attained by

Q = Q_{β}

with

β \in Γ_{n} (ρ)

(see (79)–(81)). From (80),

Q_{β}

can be expressed as the n-length probability vector

\begin{matrix} Q_{β} = (\underset{i_{β}}{\underset{︸}{ρ β, \dots, ρ β}}, 1 - (n + i_{β} ρ - i_{β} - 1) β, \underset{n - i_{β} - 1}{\underset{︸}{β, \dots, β}}) . \end{matrix}

(A181)

The influence of the

(i_{β} + 1)

-th entry of the probability vector in (A181) on

D_{f} (Q_{β} ∥ U_{n})

tends to zero as we let

n \to \infty

. This holds since the entries of the vector in (A181) are written in decreasing order, which implies that for all

β \in Γ_{n} (ρ)

(with

ρ \geq 1

)

\begin{matrix} n [1 - (n + i_{β} ρ - i_{β} - 1)] \in [n β, n ρ β] \subseteq [\frac{n}{(n - 1) ρ + 1}, ρ] \subseteq [\frac{1}{ρ}, ρ]; \end{matrix}

(A182)

from (A182) and the convexity of f on

(0, \infty)

(so, f attains its finite maximum on every closed sub-interval of

(0, \infty)

), it follows that

\begin{matrix} |[1 - (n + i_{β} ρ - i_{β} - 1) β] f (n [1 - (n + i_{β} ρ - i_{β} - 1)])| \\ \leq |[1 - (n + i_{β} ρ - i_{β} - 1) β]| max_{u \in [\frac{1}{ρ}, ρ]} |f (u)| \\ \leq \frac{ρ}{n} max_{u \in [\frac{1}{ρ}, ρ]} |f (u)| \underset{n \to \infty}{⟶} 0 . \end{matrix}

(A183)

In view of (A181) and (A183), by letting

n \to \infty

, the maximization of

D_{f} (Q_{β} ∥ U_{n})

over

β \in Γ_{n} (ρ)

can be replaced by a maximization of

D_{f} ({\tilde{Q}}_{m} ∥ U_{n})

where

\begin{matrix} {\tilde{Q}}_{m} : = (\underset{m}{\underset{︸}{ρ β, \dots, ρ β}}, \underset{n - m}{\underset{︸}{β, \dots, β}}) \in P_{n} (ρ) \end{matrix}

(A184)

with the free parameter

m \in {0, \dots, n}

, and with

β : = \frac{1}{n + (ρ - 1) m}

(the value of

β

is determined so that the total mass of

{\tilde{Q}}_{m}

is 1). Hence, we get

\begin{matrix} lim_{n \to \infty} max_{β \in Γ_{n} (ρ)} D_{f} (Q_{β} ∥ U_{n}) = lim_{n \to \infty} max_{m \in {0, \dots, n}} D_{f} ({\tilde{Q}}_{m} ∥ U_{n}) . \end{matrix}

(A185)

The f-divergence in the right side of (A185) satisfies

\begin{matrix} D_{f} ({\tilde{Q}}_{m} ∥ U_{n}) & = \frac{1}{n} \sum_{i = 1}^{n} f (n {\tilde{Q}}_{m} (i)) \end{matrix}

(A186)

\begin{matrix} = \frac{m}{n} f (\frac{ρ n}{n + (ρ - 1) m}) + (1 - \frac{m}{n}) f (\frac{n}{n + (ρ - 1) m}) \end{matrix}

(A187)

\begin{matrix} = g_{f}^{(ρ)} (\frac{m}{n}), \end{matrix}

(A188)

where (A188) holds by the definition of the function

g_{f}^{(ρ)} (\cdot)

in (84). It therefore follows that

\begin{matrix} lim_{n \to \infty} u_{f} (n, ρ) \end{matrix}

\begin{matrix} = lim_{n \to \infty} max_{m \in {0, \dots, n}} g_{f}^{(ρ)} (\frac{m}{n}) \end{matrix}

(A189)

\begin{matrix} = max_{x \in [0, 1]} g_{f}^{(ρ)} (x) \end{matrix}

(A190)

where (A189) holds by combining (82) and (A185)–(A188); (A190) holds by the continuity of the function

g_{f}^{(ρ)} (\cdot)

on

[0, 1]

, which follows from (84) and the continuity of the convex function f on

[\frac{1}{ρ}, ρ]

for

ρ \geq 1

(recall that a convex function is continuous on every closed sub-interval of its domain of region, and by assumption f is convex on

(0, \infty)

). This proves (87), by the definition of

g_{f}^{(ρ)} (\cdot)

in (84).

Equality (88) follows from (87) by replacing

g_{f}^{(ρ)} (\cdot)

with

g_{f^{*}}^{(ρ)} (\cdot)

, with

f^{*} : (0, \infty) \to R

as given in (29); this replacement is justified by the equality

D_{f} (U_{n} ∥ Q) = D_{f^{*}} (Q ∥ U_{n})

.

Once Item (e) is proved, we return to prove Item (d). To that end, it is first shown that

\begin{matrix} u_{f} (n, ρ) \leq u_{f} (2 n, ρ), \end{matrix}

(A191)

\begin{matrix} v_{f} (n, ρ) \leq v_{f} (2 n, ρ), \end{matrix}

(A192)

for all

ρ \geq 1

and integers

n \geq 2

, with the functions

u_{f}

and

v_{f}

, respectively, defined in (77) and (78). Since

D_{f} (P ∥ Q) = D_{f^{*}} (Q ∥ P)

for all

P, Q \in P_{n}

, (77) and (78) give that

\begin{matrix} v_{f} (n, ρ) = u_{f^{*}} (n, ρ), \end{matrix}

(A193)

so the monotonicity property in (A192) follows from (A191) by replacing f with

f^{*}

. To prove (A191), let

Q^{*} \in P_{n} (ρ)

be a probability mass function which attains the maximum at the right side of (77), and let

P^{*}

be the probability mass function supported on

A_{2 n} = {1, \dots, 2 n}

, and defined as follows:

P^{*} (i) = {\begin{matrix} \frac{1}{2} Q^{*} (i), & i f i \in {1, \dots, n}, \\ \frac{1}{2} Q^{*} (i - n), & i f i \in {n + 1, \dots, 2 n} . \end{matrix}

(A194)

Since by assumption

Q^{*} \in P_{n} (ρ)

, (A194) implies that

P^{*} \in P_{2 n} (ρ)

. It therefore follows that

\begin{matrix} u_{f} (2 n, ρ) & = max_{Q \in P_{2 n} (ρ)} D_{f} (Q ∥ U_{2 n}) \end{matrix}

(A195)

\begin{matrix} \geq D_{f} (P^{*} ∥ U_{2 n}) \end{matrix}

(A196)

\begin{matrix} = \frac{1}{2 n} [\sum_{i = 1}^{n} f (2 n P^{*} (i)) + \sum_{i = n + 1}^{2 n} f (2 n P^{*} (i))] \end{matrix}

(A197)

\begin{matrix} = \frac{1}{n} \sum_{i = 1}^{n} f (n Q^{*} (i)) \end{matrix}

(A198)

\begin{matrix} = D_{f} (Q^{*} ∥ U_{n}) \end{matrix}

(A199)

\begin{matrix} = max_{Q \in P_{n} (ρ)} D_{f} (Q ∥ U_{2 n}) \end{matrix}

(A200)

\begin{matrix} = u_{f} (n, ρ) \end{matrix}

(A201)

where (A195) and (A201) hold due to (77); (A196) holds since

P^{*} \in P_{2 n} (ρ)

; finally, (A198) holds due to (A194), which implies that the two sums in the right side of (A197) are identical, and they equal to the sum in the right side of (A198). This gives (A191), and likewise also (A192) (see (A193)).

\begin{matrix} u_{f} (n, ρ) & \leq lim_{k \to \infty} u_{f} (2^{k} n, ρ) \end{matrix}

(A202)

\begin{matrix} = lim_{n^{'} \to \infty} u_{f} (n^{'}, ρ) \end{matrix}

(A203)

\begin{matrix} = max_{x \in [0, 1]} g_{f}^{(ρ)} (x) \end{matrix}

(A204)

where (A202) holds since, due to (A191), the sequence

{u_{f} (2^{k} n, ρ)}_{k = 0}^{\infty}

is monotonically increasing, which implies that the first term of this sequence is less than or equal to its limit. Equality (A203) holds since the limit in its right side exists (in view of the above proof of (87)), so its limit coincides with the limit of every subsequence; (A204) holds due to (A189) and (A190). A replacement of f with

f^{*}

gives, from (A193), that

\begin{matrix} v_{f} (n, ρ) \leq max_{x \in [0, 1]} g_{f^{*}}^{(ρ)} (x) . \end{matrix}

(A205)

Combining (A202)–(A205) gives the right-side inequalities in (85) and (86). The left-side inequality in (85) follows by combining (77), (A184) and (A186)–(A188), which gives

\begin{matrix} u_{f} (n, ρ) & = max_{Q \in P_{n} (ρ)} D_{f} (Q ∥ U_{n}) \end{matrix}

(A206)

\begin{matrix} \geq max_{m \in {0, \dots, n}} D_{f} ({\tilde{Q}}_{m} ∥ U_{n}) \end{matrix}

(A207)

\begin{matrix} = max_{m \in {0, \dots, n}} g_{f}^{(ρ)} (\frac{m}{n}) . \end{matrix}

(A208)

Likewise, in view of (A193), the left-side inequality in (86) follows from the left-side inequality in (85) by replacing f with

f^{*}

.

We next prove Item (f), providing an upper bound on the convergence rate of the limit in (87); an analogous result can be obtained for the convergence rate to the limit in (88) by replacing f with

f^{*}

in (29). To prove (89), in view of Items (d) and (e), we get that for every integer

n \geq 2

\begin{matrix} 0 & \leq lim_{n^{'} \to \infty} \{u_{f} (n^{'}, ρ)\} - u_{f} (n, ρ) \end{matrix}

(A209)

\begin{matrix} \leq max_{x \in [0, 1]} g_{f}^{(ρ)} (x) - max_{m \in {0, \dots, n}} g_{f}^{(ρ)} (\frac{m}{n}) \end{matrix}

(A210)

\begin{matrix} = max_{x \in [0, 1]} g_{f}^{(ρ)} (x) - max_{m \in {0, \dots, n - 1}} g_{f}^{(ρ)} (\frac{m}{n}) \end{matrix}

(A211)

\begin{matrix} = max_{m \in {0, \dots, n - 1}} \{max_{x \in [\frac{m}{n}, \frac{m + 1}{n}]} g_{f}^{(ρ)} (x)\} - max_{m \in {0, \dots, n - 1}} g_{f}^{(ρ)} (\frac{m}{n}) \end{matrix}

(A212)

\begin{matrix} \leq max_{m \in {0, \dots, n - 1}} \{max_{x \in [\frac{m}{n}, \frac{m + 1}{n}]} \{g_{f}^{(ρ)} (x) - g_{f}^{(ρ)} (\frac{m}{n})\}\} \end{matrix}

(A213)

where (A209) holds due to monotonicity property in (A191), and also due to the existence of the limit of

{u_{f} (n^{'}, ρ)}_{n^{'} \in N}

; (A210) holds due to (85); (A211) holds since the function

g_{f}^{(ρ)} : [0, 1] \to R

(as it is defined in (84)) satisfies

g_{f}^{(ρ)} (1) = g_{f}^{(ρ)} (0) = 0

(recall that by assumption

f (1) = 0

); (A212) holds since

[0, 1] = \overset{n - 1}{⋃_{m = 1}} [\frac{m}{n}, \frac{m + 1}{n}]

, so the maximization of

g_{f}^{(ρ)} (\cdot)

over the interval

[0, 1]

is the maximum over the maximal values over the sub-intervals

[\frac{m}{n}, \frac{m + 1}{n}]

for

m \in {0, \dots, n - 1}

; finally, (A213) holds since the maximum of a sum of functions is less than or equal to the sum of the maxima of these functions. If the function

g_{f}^{(ρ)} : [0, 1] \to R

is differentiable on

(0, 1)

, and its derivative is upper bounded by

K_{f} (ρ) \geq 0

, then by the mean value theorem of Lagrange, for every

m \in {0, \dots, n - 1}

,

\begin{matrix} g_{f}^{(ρ)} (x) - g_{f}^{(ρ)} (\frac{m}{n}) \leq \frac{K_{f} (ρ)}{n}, \forall x \in [\frac{m}{n}, \frac{m + 1}{n}] . \end{matrix}

(A214)

Combining (A209)–(A214) gives (89).

We next prove Item (g). By definition, it readily follows that

P_{n} (ρ_{1}) \subseteq P_{n} (ρ_{2})

if

1 \leq ρ_{1} < ρ_{2}

. By the definition in (77), for a fixed integer

n \geq 2

, it follows that the function

u_{f} (n, \cdot)

is monotonically increasing on

[1, \infty)

. The limit in the left side of (90) therefore exists. Since

D_{f} (Q ∥ U_{n})

is convex in Q, its maximum over the convex set of probability mass functions

Q \in P_{n}

is obtained at one of the vertices of the simplex

P_{n}

. Hence, a maximum of

D_{f} (Q ∥ U_{n})

over this set is attained at

Q^{*} = (q_{1}^{*}, \dots, q_{n}^{*})

with

q_{i}^{*} = 1

for some

i \in {1, \dots, n}

, and

q_{j}^{*} = 0

for

j \neq i

. In the latter case,

\begin{matrix} D_{f} (Q^{*} ∥ U_{n}) = \frac{1}{n} \sum_{k = 1}^{n} f (n q_{k}^{*}) = \frac{1}{n} [(n - 1) f (0) + f (n)] . \end{matrix}

(A215)

Note that

Q^{*} \notin ⋃_{ρ \geq 1} P_{n} (ρ)

(since the union of

{P_{n} (ρ)}

, for all

ρ \geq 1

, includes all the probability mass functions in

P_{n}

which are supported on

A_{n} = {1, \dots, n}

, so

Q^{*} \in P_{n}

is not an element of this union); hence, it follows that

\begin{matrix} lim_{ρ \to \infty} u_{f} (n, ρ) \leq (1 - \frac{1}{n}) f (0) + \frac{f (n)}{n} . \end{matrix}

(A216)

On the other hand, for every

ρ \geq 1

,

\begin{matrix} u_{f} (n, ρ) & \geq g_{f}^{(ρ)} (\frac{1}{n}) \end{matrix}

(A217)

\begin{matrix} = \frac{1}{n} f (\frac{ρ n}{n + ρ - 1}) + (1 - \frac{1}{n}) f (\frac{n}{n + ρ - 1}) \end{matrix}

(A218)

where (A217) holds due to the left-side inequality of (85), and (A218) is due to (84). Combining (A217) and (A218), and the continuity of f at zero (by the continuous extension of the convex function f at zero), yields (by letting

ρ \to \infty

)

\begin{matrix} lim_{ρ \to \infty} u_{f} (n, ρ) \geq (1 - \frac{1}{n}) f (0) + \frac{f (n)}{n} . \end{matrix}

(A219)

Combining (A216) and (A219) gives (90) for every integer

n \geq 2

. In order to get an upper bound on the convergence rate in (90), suppose that

f (0) < \infty

, f is differentiable on

(0, n)

, and

K_{n} : = sup_{t \in (0, n)} |f^{'} (t)| < \infty

. For every

ρ \geq 1

, we get

\begin{matrix} 0 & \leq lim_{ρ^{'} \to \infty} \{u_{f} (n, ρ^{'})\} - u_{f} (n, ρ) \end{matrix}

(A220)

\begin{matrix} \leq \frac{1}{n} [f (n) - f (\frac{ρ n}{n + ρ - 1})] + (1 - \frac{1}{n}) [f (0) - f (\frac{n}{n + ρ - 1})] \end{matrix}

(A221)

\begin{matrix} \leq \frac{K_{n}}{n} (n - \frac{ρ n}{n + ρ - 1}) + (1 - \frac{1}{n}) \frac{K_{n} n}{n + ρ - 1} \end{matrix}

(A222)

\begin{matrix} = \frac{2 K_{n} (n - 1)}{n + ρ - 1}, \end{matrix}

(A223)

where (A220) holds since the sets

{P_{n} (ρ)}_{ρ \geq 1}

are monotonically increasing in

ρ

; (A221) follows from (A216)–(A218); (A222) holds by the assumption that

|f^{'} (t)| \leq K_{n}

for all

t \in (0, n)

, by the mean value theorem of Lagrange, and since

0 < \frac{n}{n + ρ - 1} \leq \frac{ρ n}{n + ρ - 1} \leq n

for all

ρ \geq 1

and

n \in N

. This proves (91).

We next prove Item (h). Setting

P : = U_{n}

yields

P ≺ Q

for every probability mass function Q which is supported on

{1, \dots, n}

. Since

q_{min} + (n - 1) q_{max} \geq 1

and

(n - 1) q_{min} + q_{max} \leq 1

, and since by assumption

\frac{q_{max}}{q_{min}} \leq ρ

, it follows that

\begin{matrix} [n q_{min}, n q_{max}] & \subseteq [\frac{n}{1 + (n - 1) ρ}, \frac{ρ n}{n - 1 + ρ}] \subseteq [\frac{1}{ρ}, ρ] . \end{matrix}

(A224)

Combining the assumption in (92) with (A224) implies that

\begin{matrix} m \leq f^{″} (t) \leq M, \forall t \in [n q_{min}, n q_{max}] . \end{matrix}

(A225)

Hence, (26), (31) and (A225) yield

\begin{matrix} \frac{1}{2} m \leq c_{f} (n q_{min}, n q_{max}) \leq e_{f} (n q_{min}, n q_{max}) \leq \frac{1}{2} M . \end{matrix}

(A226)

The lower bound on

D_{f} (Q ∥ U_{n})

in the left side of (94) follows from a combination of (75), the left-side inequality in (A226), and

{∥ P ∥}_{2}^{2} = \frac{1}{n}

. Similarly, the upper bound on

D_{f} (Q ∥ U_{n})

in the right side of (95) follows from a combination of (74), the right-side inequality in (A226), and the equality

{∥ P ∥}_{2}^{2} = \frac{1}{n}

. The looser upper bound on

D_{f} (Q ∥ U_{n})

in the right side of (96), expressed as a function of M and

ρ

, follows by combining (74), (76), and the right-side inequality in (A226).

The tightness of the lower bound in the left side of (94) and the upper bound in the right side of (95) for the

χ^{2}

divergence is clear from the fact that

M = m = 2

if

f (t) = {(t - 1)}^{2}

for all

t > 0

; in this case,

χ^{2} (Q ∥ U_{n}) = n {∥ Q ∥}_{2}^{2} - 1

.

To prove Item (i), suppose that the second derivative of f is upper bounded on

(0, \infty)

with

f^{″} (t) \leq M_{f} \in (0, \infty)

for all

t > 0

, and there is a need to assert that

D_{f} (Q ∥ U_{n}) \leq d

for an arbitrary

d > 0

. Condition (97) follows from (96) by solving the inequality

\frac{M_{f} {(ρ - 1)}^{2}}{8 ρ} \leq d

, with the variable

ρ \geq 1

, for given

d > 0

and

M_{f} > 0

(note that

M_{f}

does not depend on

ρ

).

Appendix G. Proof of Theorem 8

The proof of Theorem 8 relies on Theorem 6. For

α \in (0, 1) \cup (1, \infty)

, let

u_{α} : (0, \infty) \to R

be the non-negative and convex function given by (see, e.g., ([8], (2.1)) or ([16], (17)))

\begin{matrix} u_{α} (t) : = \frac{t^{α} - α (t - 1) - 1}{α (α - 1)}, t > 0, \end{matrix}

(A227)

and let

u_{1} : (0, \infty) \to R

be the convex function given by

\begin{matrix} u_{1} (t) : = lim_{α \to 1} u_{α} (t) = t {log}_{e} t + 1 - t, t > 0 . \end{matrix}

(A228)

Let P and Q be probability mass functions which are supported on a finite set; without loss of generality, let their support be given by

A_{n} : = {1, \dots, n}

. Then, for

α \in (0, 1) \cup (1, \infty)

,

\begin{matrix} D_{u_{α}} (Q ∥ U_{n}) - D_{u_{α}} (P ∥ U_{n}) \\ = \frac{1}{n} \sum_{i = 1}^{n} u_{α} (n Q (i)) - \frac{1}{n} \sum_{i = 1}^{n} u_{α} (n P (i)) \\ = \frac{n^{α - 1}}{α (α - 1)} [\sum_{i = 1}^{n} Q^{α} (i) - \sum_{i = 1}^{n} P^{α} (i)] \\ = \frac{n^{α - 1} [S_{α} (P) - S_{α} (Q)]}{α}, \end{matrix}

(A229)

where

S_{α} (P) : = {\begin{matrix} \frac{1}{1 - α} (\sum_{i = 1}^{n} P^{α} (i) - 1), & α \in (0, 1) \cup (1, \infty), \\ - \sum_{i = 1}^{n} P (i) {log}_{e} P (i), & α = 1 . \end{matrix}

(A230)

designates the order-

α

Tsallis entropy of a probability mass P defined on the set

A_{n}

. Equality (A229) also holds for

α = 1

by continuous extension.

In view of (26) and (31), since

u_{α}^{″} (t) = t^{α - 2}

for all

t > 0

, it follows that

c_{u_{α}} (n q_{min}, n q_{max}) = {\begin{matrix} \frac{1}{2} n^{α - 2} q_{max}^{α - 2}, & if α \in (0, 2], \\ \frac{1}{2} n^{α - 2} q_{min}^{α - 2}, & if α \in (2, \infty), \end{matrix}

(A231)

and

e_{u_{α}} (n q_{min}, n q_{max}) = {\begin{matrix} \frac{1}{2} n^{α - 2} q_{min}^{α - 2}, & if α \in (0, 2], \\ \frac{1}{2} n^{α - 2} q_{max}^{α - 2}, & if α \in (2, \infty) . \end{matrix}

(A232)

The combination of (74) and (75) under the assumption that P and Q are supported on

A_{n}

and

P ≺ Q

, together with (A229), (A231) and (A232) gives (100)–(102). Furthermore, the left and right-side inequalities in (100) hold with equality if

c_{u_{α}} (\cdot, \cdot)

in (A231) and

e_{u_{α}} (\cdot, \cdot)

in (A232) coincide, which implies that the upper and lower bounds in (74) and (75) are tight in that case. Comparing

c_{u_{α}} (\cdot, \cdot)

in (A231) and

e_{u_{α}} (\cdot, \cdot)

in (A232) shows that they coincide if

α = 2

.

To prove Item (b) of Theorem 8, let

P_{ε}

and

Q_{ε}

be probability mass functions supported on

A = {0, 1}

where

P_{ε} (0) = \frac{1}{2} + ε

,

Q_{ε} (0) = \frac{1}{2} + β ε

, and

β > 1

and

0 < ε < \frac{1}{2 β}

. This yields

P_{ε} ≺ Q_{ε}

. The result in (103) is proved by showing that, for all

α > 0

,

\begin{matrix} lim_{ε \to 0^{+}} \frac{S_{α} (P_{ε}) - S_{α} (Q_{ε})}{L (α, P_{ε}, Q_{ε})} = 1, \end{matrix}

(A233)

\begin{matrix} lim_{ε \to 0^{+}} \frac{S_{α} (P_{ε}) - S_{α} (Q_{ε})}{U (α, P_{ε}, Q_{ε})} = 1, \end{matrix}

(A234)

which shows that the infimum and supremum in (103) can be even restricted to the binary alphabet setting. For every

α \in (0, 1) \cup (1, \infty)

,

\begin{matrix} S_{α} (P_{ε}) - S_{α} (Q_{ε}) & = \frac{1}{1 - α} (\sum_{i} P_{ε}^{α} (i) - \sum_{i} Q_{ε}^{α} (i)) \\ = \frac{1}{1 - α} [{(\frac{1}{2} + ε)}^{α} + {(\frac{1}{2} - ε)}^{α} - {(\frac{1}{2} + β ε)}^{α} - {(\frac{1}{2} - β ε)}^{α}] \\ = α 2^{2 - α} (β^{2} - 1) ε^{2} + O (ε^{4}), \end{matrix}

(A235)

where (A235) follows from a Taylor series expansion around

ε = 0

, and the passage in the limit where

α \to 1

shows that (A235) also holds at

α = 1

(due to the continuous extension of the order-

α

Tsallis entropy at

α = 1

). This implies that (A235) holds for all

α > 0

. We now calculate the lower and upper bounds on

S_{α} (P_{ε}) - S_{α} (Q_{ε})

in (101) and (102), respectively.

For $α \in (0, 2]$ ,

$\begin{matrix} L (α, P_{ε}, Q_{ε}) & = \frac{1}{2} α q_{max}^{α - 2} (∥ Q_{ε} ∥_{2}^{2} - {∥ P_{ε} ∥}_{2}^{2}) \\ = \frac{1}{2} α {(\frac{1}{2} + β ε)}^{α - 2} [{(\frac{1}{2} + β ε)}^{2} + {(\frac{1}{2} - β ε)}^{2} - {(\frac{1}{2} + ε)}^{2} - {(\frac{1}{2} - ε)}^{2}] \\ = α 2^{2 - α} (β^{2} - 1) {(1 + 2 β ε)}^{α - 2} . \end{matrix}$

(A236)
For $α \in (2, \infty)$ ,

$\begin{matrix} L (α, P_{ε}, Q_{ε}) & = \frac{1}{2} α q_{min}^{α - 2} (∥ Q_{ε} ∥_{2}^{2} - {∥ P_{ε} ∥}_{2}^{2}) \\ = α 2^{2 - α} (β^{2} - 1) {(1 - 2 β ε)}^{α - 2} . \end{matrix}$

(A237)
Similarly, for $α \in (0, 2]$ ,

$\begin{matrix} U (α, P_{ε}, Q_{ε}) & = \frac{1}{2} α q_{min}^{α - 2} (∥ Q_{ε} ∥_{2}^{2} - {∥ P_{ε} ∥}_{2}^{2}) \\ = α 2^{2 - α} (β^{2} - 1) {(1 - 2 β ε)}^{α - 2}, \end{matrix}$

(A238)

and, for $α \in (2, \infty)$ ,

$\begin{matrix} U (α, P_{ε}, Q_{ε}) & = \frac{1}{2} α q_{max}^{α - 2} (∥ Q_{ε} ∥_{2}^{2} - {∥ P_{ε} ∥}_{2}^{2}) \\ = α 2^{2 - α} (β^{2} - 1) {(1 + 2 β ε)}^{α - 2} . \end{matrix}$

(A239)

The combination of (A235)–(A237) yields (A233); similarly, the combination of (A235), (A238) and (A239) yields (A234).

Appendix H. Proof of Theorem 9 and Corollary 1

Appendix H.1. Proof of Theorem 9

The proof of the convexity property of

Δ (\cdot, ρ)

in (149), with

ρ > 1

, over the real line

R

relies on ([69], Theorem 2.1) which states that if W is a non-negative random variable, then

λ_{α} : = {\begin{matrix} \frac{(E [W^{α}] - E^{α} [W]) log e}{α (α - 1)}, & α \neq 0, 1 \\ log (E [W]) - E [log W], & α = 0 \\ E [W log W] - E [W] log (E [W]), & α = 1 \end{matrix}

(A240)

is log-convex in

α \in R

. This property has been used to derive f-divergence inequalities (see, e.g., ([62], Theorem 20), [65,69]).

Let

Q ≪ P

, and let

W : = \frac{d Q}{d P}

be the Radon-Nikodym derivative (W is a non-negative random variable). Let the expectations in the right side of (A240) be taken with respect to P. In view of the above statement from ([69], Theorem 2.1), this gives the log-convexity of

D_{A}^{(α)} (Q ∥ P)

in

α \in R

. Since log-convexity yields convexity, it follows that

D_{A}^{(α)} (Q ∥ P)

is convex in

α

over the real line. Let

P : = U_{n}

, and let

Q \in P_{n} (ρ)

; since

Q ≪ P

, it follows that

D_{A}^{(α)} (Q ∥ U_{n})

is convex in

α \in R

. The pointwise maximum of a set of convex functions is a convex function, which implies that

max_{Q \in P_{n} (ρ)} D_{A}^{(α)} (Q ∥ U_{n})

is convex in

α \in R

for every integer

n \geq 2

. Since the pointwise limit of a convergent sequence of convex functions is convex, it follows that

lim_{n \to \infty} max_{Q \in P_{n} (ρ)} D_{A}^{(α)} (Q ∥ U_{n})

is convex in

α

. This, by definition, is equal to

Δ (α, ρ)

(see (146)), which proves the convexity of this function in

α \in R

. From (149), for all

ρ > 1

,

\begin{matrix} Δ (1 + α, ρ) & = \frac{1}{(α + 1) α} [\frac{{(- α)}^{α} {(ρ^{1 + α} - 1)}^{1 + α} {(ρ - ρ^{1 + α})}^{- α}}{(ρ - 1) {(1 + α)}^{1 + α}} - 1] \\ = \frac{1}{(- α) (- α - 1)} [\frac{{(1 + α)}^{- α - 1} {(ρ^{α} (ρ - ρ^{- α}))}^{1 + α} {(ρ^{1 + α} (ρ^{- α} - 1))}^{- α}}{(ρ - 1) {(- α)}^{- α}} - 1] \\ = \frac{1}{(- α) (- α - 1)} [\frac{{(1 + α)}^{- α - 1} {(ρ - ρ^{- α})}^{1 + α} {(ρ^{- α} - 1)}^{- α}}{(ρ - 1) {(- α)}^{- α}} - 1] \\ = Δ (- α, ρ), \end{matrix}

(A241)

which proves the symmetry property of

Δ (α, ρ)

around

α = \frac{1}{2}

for all

ρ > 1

. The convexity in

α

over the real line, and the symmetry around

α = \frac{1}{2}

implies that

Δ (α, ρ)

gets its global minimum at

α = \frac{1}{2}

, which is equal to

\frac{4 {(\sqrt[4]{ρ} - 1)}^{2}}{\sqrt{ρ} + 1}

for all

ρ > 1

.

Inequalities (162) and (163) follow from ([8], Proposition 2.7); this proposition implies that, for every integer

n \geq 2

and for all probability mass functions Q defined on

A_{n} : = {1, \dots, n}

,

\begin{matrix} α D_{A}^{(α)} (Q ∥ U_{n}) \leq β D_{A}^{(β)} (Q ∥ U_{n}), 0 < α \leq β < \infty, \end{matrix}

(A242)

\begin{matrix} (1 - β) D_{A}^{(1 - β)} (Q ∥ U_{n}) \leq (1 - α) D_{A}^{(1 - α)} (Q ∥ U_{n}), - \infty < α \leq β < 1 . \end{matrix}

(A243)

Inequalities (162) and (163) follow, respectively, by maximizing both sides of (A242) or (A243) over

Q \in P_{n} (ρ)

, and letting n tend to infinity.

For every

α \in R

, the function

Δ (α, ρ)

is monotonically increasing in

ρ \in (1, \infty)

since (by definition) the set of probability mass functions

{P_{n} (ρ)}_{ρ \geq 1}

is monotonically increasing (i.e.,

P_{n} (ρ_{1}) \subseteq P_{n} (ρ_{2})

if

1 \leq ρ_{1} < ρ_{2} < \infty

), and therefore the maximum of

D_{A}^{(α)} (Q ∥ U_{n})

over

Q \in P_{n} (ρ)

is a monotonically increasing function of

ρ \in [1, \infty)

; the limit of this maximum, as we let

n \to \infty

, is equal to

Δ (α, ρ)

in (149) for all

ρ > 1

, which is therefore monotonically increasing in

ρ

over the interval

(1, \infty)

. The continuity of

Δ (α, ρ)

in both

α

and

ρ

is due to its expression in (149) with its continuous extension at

α = 0

and

α = 1

in (150). Since

P_{n} (1) = {U_{n}}

, it follows from the continuity of

Δ (α, ρ)

that

lim_{ρ \to 1^{+}} Δ (α, ρ) = D_{A}^{(α)} (U_{n} ∥ U_{n}) = 0 .

Appendix H.2. Proof of Corollary 1

For all

α \in R

and

ρ > 1

,

\begin{matrix} lim_{n \to \infty} max_{Q \in P_{n} (ρ)} D_{A}^{(α)} (U_{n} ∥ Q) \end{matrix}

\begin{matrix} = lim_{n \to \infty} max_{Q \in P_{n} (ρ)} D_{A}^{(1 - α)} (Q ∥ U_{n}) \end{matrix}

(A244)

\begin{matrix} = Δ (1 - α, ρ) \end{matrix}

(A245)

\begin{matrix} = Δ (α, ρ), \end{matrix}

(A246)

where (A244) holds due to the symmetry property in ([8], p. 36), which states that

\begin{matrix} D_{A}^{(α)} (P ∥ Q) = D_{A}^{(1 - α)} (Q ∥ P), \end{matrix}

(A247)

for every

α \in R

and probability mass functions P and Q; (A245) is due to (146); finally, (A246) holds due to the symmetry property of

Δ (\cdot, ρ)

around

\frac{1}{2}

in Theorem 9 (a).

Appendix I. Proof of (171)

In view of (154) and (155), it follows that the condition in (170) is satisfied if and only if

ρ \leq ρ^{*}

where

ρ^{*} \in (1, \infty)

is the solution of the equation

\begin{matrix} \frac{ρ^{*} log ρ^{*}}{ρ^{*} - 1} - log (\frac{e ρ^{*} {log}_{e} ρ^{*}}{ρ^{*} - 1}) = d log e . \end{matrix}

(A248)

with a fixed

d > 0

. The substitution

\begin{matrix} x : = \frac{ρ^{*} {log}_{e} ρ^{*}}{ρ^{*} - 1} \end{matrix}

(A249)

leads to the equation

\begin{matrix} x - {log}_{e} x = d + 1 . \end{matrix}

(A250)

Negation and exponentiation of both sides of (A250) gives

\begin{matrix} (- x) e^{- x} = - e^{- d - 1} . \end{matrix}

(A251)

Since

ρ^{*} > 1

implies by (A249) that

x > 1

, the proper solution for x is given by

\begin{matrix} x = - W_{- 1} (- e^{- d - 1}), d > 0, \end{matrix}

(A252)

where

W_{- 1}

denotes the secondary real branch of the Lambert W function [37]; otherwise, the replacement of

W_{- 1}

in the right side of (A252) with the principal real branch

W_{0}

yields

x \in (0, 1)

.

We next proceed to solve

ρ^{*}

as a function of x. From (A249), letting

u : = \frac{1}{ρ^{*}}

gives the equation

u = e^{(u - 1) x}

, which is equivalent to

\begin{matrix} (- u x) e^{- u x} & = - x e^{- x} \end{matrix}

(A253)

\begin{matrix} = - e^{- d - 1}, \end{matrix}

(A254)

where (A254) follows from (A252) and by the definition of the Lambert W function (i.e.,

t = W (u)

if and only if

t e^{t} = u

). The solutions of (A253) are given by

\begin{matrix} - u x = W_{- 1} (- e^{- d - 1}), \end{matrix}

(A255)

and

\begin{matrix} - u x = W_{0} (- e^{- d - 1}), \end{matrix}

(A256)

which (from (A252)) correspond, respectively, to

u = 1

and

\begin{matrix} u = \frac{W_{0} (- e^{- d - 1})}{W_{- 1} (- e^{- d - 1})} \in (0, 1) . \end{matrix}

(A257)

Since

ρ^{*} \in (1, \infty)

is equal to

\frac{1}{u}

, the reciprocal of the right side of (A257) gives the proper solution for

ρ^{*}

(denoted by

ρ_{max}^{(1)} (d)

in (171)).

Appendix J. Proof of (176), (177) and (180)

We first derive the upper bound on

Φ (α, ρ)

in (176) for

α \geq e^{- \frac{3}{2}}

and

ρ \geq 1

. For every

Q \in P_{n} (ρ)

, with an integer

n \geq 2

,

\begin{matrix} D_{f_{α}} (Q ∥ U_{n}) & \leq [log (α + 1) + \frac{3}{2} log e - \frac{log e}{α + 1}] χ^{2} (Q ∥ U_{n}) \end{matrix}

\begin{matrix} + \frac{log e}{3 (α + 1)} [exp (2 D_{3} (Q ∥ U_{n})) - 1] \\ \leq [log (α + 1) + \frac{3}{2} log e - \frac{log e}{α + 1}] \frac{{(ρ - 1)}^{2}}{4 ρ} \end{matrix}

(A258)

\begin{matrix} + \frac{log e}{3 (α + 1)} [exp (2 D_{3} (Q ∥ U_{n})) - 1] \end{matrix}

(A259)

where (A258) follows from (65), and (A258) holds due to (159). By upper bounding the second term in the right side of (A259), for all

Q \in P_{n} (ρ)

,

\begin{matrix} D_{3} (Q ∥ U_{n}) & = \frac{1}{2} log (1 + 6 D_{A}^{(3)} (Q ∥ U_{n})) \end{matrix}

(A260)

\begin{matrix} \leq \frac{1}{2} log (1 + 6 Δ (3, ρ)) \end{matrix}

(A261)

\begin{matrix} = \frac{1}{2} log (\frac{4 {(ρ^{3} - 1)}^{3}}{27 (ρ - 1) {(ρ - ρ^{3})}^{2}}) \end{matrix}

(A262)

\begin{matrix} = \frac{1}{2} log (\frac{4 {(ρ^{2} + ρ + 1)}^{3}}{27 ρ^{2} {(ρ + 1)}^{2}}) \end{matrix}

(A263)

where (A260) holds by setting

α = 3

in (156); (A261) follows from (135), (138) and (145); (A262) holds by setting

α = 3

in (149); finally, (A263) follows from the factorizations

{(ρ^{3} - 1)}^{3} = {(ρ - 1)}^{3} {(ρ^{2} + ρ + 1)}^{3}, (ρ - 1) {(ρ - ρ^{3})}^{2} = {(ρ - 1)}^{3} ρ^{2} {(ρ + 1)}^{2} .

Substituting the bound in the right side of (A263) into the second term of the bound on the right side of (A259) implies that, for all

Q \in P_{n} (ρ)

,

\begin{matrix} D_{f_{α}} (Q ∥ U_{n}) & \leq [log (α + 1) + \frac{3}{2} log e - \frac{log e}{α + 1}] \frac{{(ρ - 1)}^{2}}{4 ρ} \end{matrix}

\begin{matrix} + \frac{log e}{3 (α + 1)} [\frac{4 {(ρ^{2} + ρ + 1)}^{3}}{27 ρ^{2} {(ρ + 1)}^{2}} - 1] \\ = [log (α + 1) + \frac{3}{2} log e - \frac{log e}{α + 1}] \frac{{(ρ - 1)}^{2}}{4 ρ} \end{matrix}

(A264)

\begin{matrix} + \frac{log e}{81 (α + 1)} {(\frac{(ρ - 1) (2 ρ + 1) (ρ + 2)}{ρ (ρ + 1)})}^{2}, \end{matrix}

(A265)

which therefore gives (176) by maximizing the left side of (A264) over

Q \in P_{n} (ρ)

, and letting n tend to infinity (see (174)).

We next derive the upper bound in (177). The second derivative of the convex function

f_{α} : (0, \infty) \to R

in (55) is upper bounded over the interval

[\frac{1}{ρ}, ρ]

by the positive constant

M = 2 log (α + ρ) + 3 log e

. From (96), it follows that for all

Q \in P_{n} (ρ)

(with

ρ \geq 1

and an integer

n \geq 2

) and

α \geq e^{- \frac{3}{2}}

,

\begin{matrix} D_{f_{α}} (Q ∥ U_{n}) \leq [log (α + ρ) + \frac{3}{2} log e] \frac{{(ρ - 1)}^{2}}{4 ρ}, \end{matrix}

(A266)

which, from (174), yields (177).

We finally derive the upper bound in (180) by loosening the bound in (176). The upper bound in the right side of (176) can be rewritten as

\begin{matrix} Φ (α, ρ) & \leq [\frac{1}{4} log (α + 1) + \frac{3}{8} log e] \frac{{(ρ - 1)}^{2}}{ρ} \\ + \frac{log e}{α + 1} [\frac{1}{81} {(2 + \frac{2}{ρ} + \frac{1}{1 + ρ})}^{2} - \frac{1}{4 ρ}] {(ρ - 1)}^{2} . \end{matrix}

(A267)

For all

ρ \geq 1

,

\begin{matrix} \frac{1}{81} {(2 + \frac{2}{ρ} + \frac{1}{1 + ρ})}^{2} - \frac{1}{4 ρ} \leq \frac{4}{81}, \end{matrix}

(A268)

which can be verified by showing that the left side of (A268) is monotonically increasing in

ρ

over the interval

[1, \infty)

, and it tends to

\frac{4}{81}

as we let

ρ \to \infty

. Furthermore, for all

ρ \geq 1

,

\begin{matrix} \frac{{(ρ - 1)}^{2}}{ρ} \leq min \{ρ - 1, {(ρ - 1)}^{2}\} . \end{matrix}

(A269)

In view of inequalities (A268) and (A269), one gets (180) from (A267) (where the latter is an equivalent form of (176)).

Appendix K. Proof of Theorem 10

We start by proving Item (a). In view of the variational representation of f-divergences (see ([70], Theorem 2.1), and ([71], Lemma 1)), if

f : (0, \infty) \to R

is convex with

f (1) = 0

, and P and Q are probability measures defined on a set

A

, then

\begin{matrix} D_{f} (P ∥ Q) = sup_{g : A \to R} (E [g (X)] - E [\bar{f} (g (Y))]), \end{matrix}

(A270)

where

X \sim P

and

Y \sim Q

, and the supremum is taken over all measurable functions g under which the expectations are finite.

Let

P \in P_{n} (ρ)

, with

ρ > 1

, and let

Q : = U_{n}

; these probability mass functions are defined on the set

A_{n} : = {1, \dots, n}

, and it follows that

\begin{matrix} u_{f} (n, ρ) & \geq D_{f} (P ∥ U_{n}) \end{matrix}

(A271)

\begin{matrix} \geq E [g (X)] - \frac{1}{n} \sum_{i = 1}^{n} \bar{f} (g (i)), \end{matrix}

(A272)

where (A271) holds by the definition in (77); (A272) holds due to (A270) with

X \sim P

, and Y being an equiprobable random variable over

A_{n}

. This gives (187).

We next prove Item (b). As above, let

f : (0, \infty) \to R

be a convex function with

f (1) = 0

. Let

β^{*} \in Γ_{n} (ρ)

be a maximizer of the right side of (82). Then,

\begin{matrix} u_{f} (n, ρ) & = D_{f} (Q_{β^{*}} ∥ U_{n}) \end{matrix}

(A273)

\begin{matrix} = \frac{1}{n} \sum_{i = 1}^{n} f (n Q_{β^{*}} (i)) . \end{matrix}

(A274)

Let

ε > 0

be selected arbitrarily. We have

\bar{(\bar{f})} \equiv f

(i.e., repeating twice the convex conjugate operation (see (186)) on a convex function f, returns f itself). From the convexity of f, it therefore follows that, for all

t > 0

, there exists

x \in R

such that

\begin{matrix} f (t) \leq t x - \bar{f} (x) + ε . \end{matrix}

(A275)

Let

\begin{matrix} t_{i} : = n Q_{β^{*}} (i), \forall i \in A_{n}, \end{matrix}

(A276)

let

x : = x_{i} (ε) \in R

be selected to satisfy (A275) with

t : = t_{i}

, and let the function

g_{ε} : A_{n} \to R

be defined as

\begin{matrix} g_{ε} (i) = x_{i} (ε), \forall i \in A_{n} . \end{matrix}

(A277)

Consequently, it follows from (A275)–(A277) that for all such i

\begin{matrix} f (n Q_{β^{*}} (i)) \leq n Q_{β^{*}} (i) g_{ε} (i) - \bar{f} (g_{ε} (i)) + ε . \end{matrix}

(A278)

Let

P : = Q_{β^{*}} \in P_{n} (ρ)

(see (80)), and

X \sim P

. Then,

\begin{matrix} u_{f} (n, ρ) & = \frac{1}{n} \sum_{i = 1}^{n} f (n Q_{β^{*}} (i)) \end{matrix}

(A279)

\begin{matrix} \leq \sum_{i = 1}^{n} Q_{β^{*}} (i) g_{ε} (i) - \frac{1}{n} \sum_{i = 1}^{n} \bar{f} (g_{ε} (i)) + ε \end{matrix}

(A280)

\begin{matrix} = E [g_{ε} (X)] - \frac{1}{n} \sum_{i = 1}^{n} \bar{f} (g_{ε} (i)) + ε \end{matrix}

(A281)

where (A279) holds due to (A273) and (A274); (A280) follows from (A278); (A281) holds since by assumption

P_{X} = Q_{β^{*}}

. This gives (188).

Appendix L. Proof of Theorem 11

For

y \in Y

, let the L-size list of the decoder be given by

L (y) = {x_{1} (y), \dots, x_{L} (y)}

with

L < M

. Then, the (average) list decoding error probability is given by

\begin{matrix} P_{L} = E [P_{L} (Y)] \end{matrix}

(A282)

where the conditional list decoding error probability, given that

Y = y \in Y

, is equal to

\begin{matrix} P_{L} (y) = 1 - \sum_{ℓ = 1}^{L} P_{X | Y} (x_{ℓ} (y) | y) . \end{matrix}

(A283)

For every

y \in Y

,

\begin{matrix} D_{f} (P_{X | Y} (\cdot | y) ∥ U_{M}) \end{matrix}

\begin{matrix} \geq D_{f} ([\sum_{ℓ = 1}^{L} P_{X | Y} (x_{ℓ} (y) | y), 1 - \sum_{ℓ = 1}^{L} P_{X | Y} (x_{ℓ} (y) | y)] ∥ [\frac{L}{M}, 1 - \frac{L}{M}]) \end{matrix}

(A284)

\begin{matrix} = D_{f} ([1 - P_{L} (y), P_{L} (y)] ∥ [\frac{L}{M}, 1 - \frac{L}{M}]), \end{matrix}

(A285)

where (A284) holds by the data-processing inequality for f-divergences, and since for every

y \in Y

\begin{matrix} \sum_{ℓ = 1}^{L} U_{M} (x_{ℓ} (y)) = \sum_{ℓ = 1}^{L} \frac{1}{M} = \frac{L}{M}; \end{matrix}

(A286)

(A285) is due to (A283). Hence, it follows that

\begin{matrix} E [D_{f} (P_{X | Y} (\cdot | Y) ∥ U_{M})] \end{matrix}

\begin{matrix} \geq E [D_{f} ([1 - P_{L} (Y), P_{L} (Y)] ∥ [\frac{L}{M}, 1 - \frac{L}{M}])] \end{matrix}

(A287)

\begin{matrix} = \frac{L}{M} E [f (\frac{M (1 - P_{L} (Y))}{L})] + (1 - \frac{L}{M}) E [f (\frac{M P_{L} (Y)}{M - L})] \end{matrix}

(A278)

\begin{matrix} \geq \frac{L}{M} f (\frac{M E [1 - P_{L} (Y)]}{L}) + (1 - \frac{L}{M}) f (\frac{M E [P_{L} (Y)]}{M - L}) \end{matrix}

(A279)

\begin{matrix} = \frac{L}{M} f (\frac{M (1 - P_{L})}{L}) + (1 - \frac{L}{M}) f (\frac{M P_{L}}{M - L}), \end{matrix}

(A290)

where (A287) holds by taking expectations in (A284) and (285) with respect to Y; (A288) holds by the definition of f-divergence, and the linearity of expectation operator; (A289) follows from the convexity of f and Jensen’s inequality; finally, (A290) holds by (A282).

Appendix M. Proof of Corollary 3

Let

α \in (0, 1) \cup (1, \infty)

, and let

y \in Y

. The proof starts by applying Theorem 11 in the setting where

Y = y

is deterministic, and the convex function

f : (0, \infty) \to R

is given by

f : = u_{α}

in (139), i.e.,

\begin{matrix} f (t) = \frac{t^{α} - α (t - 1) - 1}{α (α - 1)}, t \geq 0 . \end{matrix}

(A291)

In this setting, (192) is specialized to

\begin{matrix} D_{f} (P_{X | Y} (\cdot | y) ∥ U_{M}) \geq \frac{L}{M} f (\frac{M (1 - P_{L} (y))}{L}) + (1 - \frac{L}{M}) f (\frac{M P_{L} (y)}{M - L}), \end{matrix}

(A292)

where

P_{L} (y)

is the conditional list decoding error probability given that

Y = y

. Substituting (A291) into the right side of (A292) gives

\begin{matrix} \frac{L}{M} f (\frac{M (1 - P_{L} (y))}{L}) + (1 - \frac{L}{M}) f (\frac{M P_{L} (y)}{M - L}) \end{matrix}

\begin{matrix} = \frac{1}{α (α - 1)} [P_{L}^{α} (y) {(1 - \frac{L}{M})}^{1 - α} + {(1 - P_{L} (y))}^{α} {(\frac{L}{M})}^{1 - α} - 1] \end{matrix}

(A293)

\begin{matrix} = \frac{1}{α (α - 1)} [exp ((α - 1) d_{α} (P_{L} (y) ∥ 1 - \frac{L}{M})) - 1], \end{matrix}

(A294)

where (A294) follows from (203). Substituting (A291) into the left side of (A292) gives

\begin{matrix} D_{f} (P_{X | Y} (\cdot | y) ∥ U_{M}) \end{matrix}

\begin{matrix} = \frac{1}{M α (α - 1)} \sum_{x \in X} [{(M P_{X | Y} (x | y))}^{α} - α (M P_{X | Y} (x | y) - 1) - 1] \end{matrix}

(A295)

\begin{matrix} = \frac{1}{M α (α - 1)} [M^{α} \sum_{x \in X} P_{X | Y}^{α} (x | y) - α \underset{= 0 (| X | = M)}{\underset{︸}{\sum_{x \in X} (M P_{X | Y} (x | y) - 1)}} - M] \end{matrix}

(A296)

\begin{matrix} = \frac{1}{α (α - 1)} [M^{α - 1} \sum_{x \in X} P_{X | Y}^{α} (x | y) - 1] \end{matrix}

(A297)

\begin{matrix} = \frac{1}{α (α - 1)} [exp ((α - 1) [log M - H_{α} (X | Y = y)]) - 1] . \end{matrix}

(A298)

Substituting (A294) and (A298) into the right and left sides of (A292), and rearranging terms while relying on the monotonicity property of an exponential function gives

\begin{matrix} H_{α} (X | Y = y) \leq log M - d_{α} (P_{L} (y) ∥ 1 - \frac{L}{M}) . \end{matrix}

(A299)

We next obtain an upper bound on the Arimoto-Rényi conditional entropy.

\begin{matrix} H_{α} (X | Y) \end{matrix}

\begin{matrix} = \frac{α}{1 - α} log \int_{Y} d P_{Y} (y) exp (\frac{1 - α}{α} H_{α} (X | Y = y)) \end{matrix}

(A300)

\begin{matrix} \leq \frac{α}{1 - α} log \int_{Y} d P_{Y} (y) exp (\frac{1 - α}{α} [log M - d_{α} (P_{L} (y) ∥ 1 - \frac{L}{M})]) \end{matrix}

(A301)

\begin{matrix} = log M + \frac{α}{1 - α} log \int_{Y} d P_{Y} (y) {[P_{L}^{α} (y) {(1 - \frac{L}{M})}^{1 - α} + {(1 - P_{L} (y))}^{α} {(\frac{L}{M})}^{1 - α}]}^{\frac{1}{α}} \end{matrix}

(A302)

where (A300) holds due to (202); (A301) follows from (A299), and (A302) follows from (203). By ([42], Lemma 1), it follows that the integrand in the right side of (A302) is convex in

P_{L} (y)

if

α > 1

; furthermore, it is concave in

P_{L} (y)

if

α \in (0, 1)

. Invoking Jensen’s inequality therefore yields (see (A282))

\begin{matrix} H_{α} (X | Y) & \leq log M + \frac{α}{1 - α} log ({[P_{L}^{α} {(1 - \frac{L}{M})}^{1 - α} + {(1 - P_{L})}^{α} {(\frac{L}{M})}^{1 - α}]}^{\frac{1}{α}}) \end{matrix}

(A303)

\begin{matrix} = log M - \frac{1}{α - 1} log (P_{L}^{α} {(1 - \frac{L}{M})}^{1 - α} + {(1 - P_{L})}^{α} {(\frac{L}{M})}^{1 - α}) \end{matrix}

(A304)

\begin{matrix} = log M - d_{α} (P_{L} ∥ 1 - \frac{L}{M}), \end{matrix}

(A305)

where (A303) follows from Jensen’s inequality, and (A305) follows from (203). This proves (205) and (206) for all

α \in (0, 1) \cup (1, \infty)

. The necessary and sufficient condition for (205) to hold with equality, as given in (207), follows from the proof of (A292) (see (A284)–(A286)), and from the use of Jensen’s inequality in (A303).

Appendix N. Proof of Theorem 12

The proof of Theorem 12 relies on Theorem 1, and the proof of Theorem 11.

Let

Z = {0, 1}

and, without any loss of generality, let

X = {1, \dots, M}

. For every

y \in Y

, define a deterministic transformation from

X

to

Z

such that every

x \in L (y)

is mapped to

z = 0

, and every

x \notin L (y)

is mapped to

z = 1

. This corresponds to a conditional probability mass function, for every

y \in Y

, where

W_{Z | X}^{(y)} (z | x) = 1

if

x \in L (y)

and

z = 0

, or if

x \notin L (y)

and

z = 1

; otherwise,

W_{Z | X}^{(y)} (z | x) = 0

. Let

L (y) : = {x_{1} (y), \dots, x_{L} (y)}

with

L < M

. Then, for every

y \in Y

, a conditional probability mass function

P_{X | Y} (\cdot | y)

implies that

\begin{matrix} P_{Z}^{(y)} (z) : = \sum_{x \in X} P_{X | Y} (x | y) W_{Z | X}^{(y)} (z | x), \forall z \in {0, 1}, \end{matrix}

(A306)

satisfies (see (A283))

\begin{matrix} P_{Z}^{(y)} (0) = \sum_{ℓ = 1}^{L} P_{X | Y} (x_{ℓ} (y) | y) = 1 - P_{L} (y), \end{matrix}

(A307)

\begin{matrix} P_{Z}^{(y)} (1) = P_{L} (y) . \end{matrix}

(A308)

Under the deterministic transformation

W_{Z | X}^{(y)}

as above, the equiprobable distribution

Q_{X}^{(y)} = U_{M}

(independently of

y \in Y

) is mapped to a Bernoulli distribution over the two-elements set

Z

where

\begin{matrix} Q_{Z}^{(y)} = [\frac{L}{M}, 1 - \frac{L}{M}], \forall y \in Y . \end{matrix}

(A309)

Given

Y = y \in Y

, applying Theorem 1 with the transformation

W_{Z | X}^{(y)}

as above gives that

\begin{matrix} D_{f} (P_{X | Y} (\cdot | y) ∥ U_{M}) \\ \geq D_{f} (P_{Z}^{(y)} ∥ Q_{Z}^{(y)}) + c_{f} (ξ_{1} (y), ξ_{2} (y)) [χ^{2} (P_{X | Y} (\cdot | y) ∥ U_{M}) - χ^{2} (P_{Z}^{(y)} ∥ Q_{Z}^{(y)})] \end{matrix}

(A310)

where, from (18) and (19),

\begin{matrix} ξ_{1} (y) = min_{x \in X} \frac{P_{X | Y} (x | y)}{U_{M} (x)} = M min_{x \in X} P_{X | Y} (x | y), \end{matrix}

(A311)

\begin{matrix} ξ_{2} (y) = max_{x \in X} \frac{P_{X | Y} (x | y)}{U_{M} (x)} = M max_{x \in X} P_{X | Y} (x | y) . \end{matrix}

(A312)

Since, from (212), (213), (A311) and (A312),

\begin{matrix} inf_{y \in Y} ξ_{1} (y) = M inf_{(x, y) \in X \times Y} P_{X | Y} (x | y) = ξ_{1}^{*}, \end{matrix}

(A313)

\begin{matrix} sup_{y \in Y} ξ_{2} (y) = M sup_{(x, y) \in X \times Y} P_{X | Y} (x | y) = ξ_{2}^{*}, \end{matrix}

(A314)

it follows from the definition of

c_{f} (\cdot, \cdot)

in (26) that for every

y \in Y

\begin{matrix} c_{f} (ξ_{1} (y), ξ_{2} (y)) & \geq c_{f} (ξ_{1}^{*}, ξ_{2}^{*}) \end{matrix}

(A315)

\begin{matrix} = \frac{1}{2} inf_{t \in I (ξ_{1}^{*}, ξ_{2}^{*})} f^{″} (t) \end{matrix}

(A316)

\begin{matrix} \geq \frac{1}{2} m_{f} \end{matrix}

(A317)

where the last inequality holds by the assumption in (211). Combining (A310) and (A315)–(A317) yields

\begin{matrix} D_{f} (P_{X | Y} (\cdot | y) ∥ U_{M}) \geq D_{f} (P_{Z}^{(y)} ∥ Q_{Z}^{(y)}) + \frac{1}{2} m_{f} [χ^{2} (P_{X | Y} (\cdot | y) ∥ U_{M}) - χ^{2} (P_{Z}^{(y)} ∥ Q_{Z}^{(y)})], \end{matrix}

(A318)

for every

y \in Y

. Hence,

E [D_{f} (P_{X | Y} (\cdot | Y) ∥ U_{M})] \geq E [D_{f} (P_{Z}^{(Y)} ∥ Q_{Z}^{(Y)})] + \frac{1}{2} m_{f} E [χ^{2} (P_{X | Y} (\cdot | Y) ∥ U_{M}) - χ^{2} (P_{Z}^{(Y)} ∥ Q_{Z}^{(Y)})]

(A319)

where (A319) holds by taking expectations with respect to Y on both sides of (A318).

Referring to the first term in the right side of (A319) gives

\begin{matrix} E [D_{f} (P_{Z}^{(Y)} ∥ Q_{Z}^{(Y)})] & = E [D_{f} ([1 - P_{L} (Y), P_{L} (Y)] ∥ [\frac{L}{M}, 1 - \frac{L}{M}])] \end{matrix}

(A320)

\begin{matrix} \geq \frac{L}{M} f (\frac{M (1 - P_{L})}{L}) + (1 - \frac{L}{M}) f (\frac{M P_{L}}{M - L}), \end{matrix}

(A321)

where (A320) follows from (A307)–(A309), and (A321) holds due to (A288)–(A290).

Referring to the second term in the right side of (A319) gives

\begin{matrix} E [χ^{2} (P_{X | Y} (\cdot | Y) ∥ U_{M}) - χ^{2} (P_{Z}^{(Y)} ∥ Q_{Z}^{(Y)})] \end{matrix}

\begin{matrix} = E [χ^{2} (P_{X | Y} (\cdot | Y) ∥ U_{M}) - χ^{2} ([1 - P_{L} (Y), P_{L} (Y)] ∥ [\frac{L}{M}, 1 - \frac{L}{M}])] \end{matrix}

(A322)

\begin{matrix} = E [M \sum_{x \in X} P_{X | Y}^{2} (x | Y) - \frac{M {(1 - P_{L} (Y))}^{2}}{L} - \frac{M P_{L}^{2} (Y)}{M - L}] \end{matrix}

(A323)

\begin{matrix} = M E [\sum_{x \in X} P_{X | Y}^{2} (x | Y)] - \frac{M}{L} + \frac{2 M}{L} \cdot E [P_{L} (Y)] - (\frac{M}{L} + \frac{M}{M - L}) E [P_{L}^{2} (Y)] \end{matrix}

(A324)

\begin{matrix} = M E [\sum_{x \in X} P_{X | Y}^{2} (x | Y)] - \frac{M (1 - 2 P_{L})}{L} - \frac{M^{2} E [P_{L}^{2} (Y)]}{L (M - L)}, \end{matrix}

(A325)

where (A322) follows from (A306)–(A309); (A323) follows from (A16)–(A18); (A325) is due to (A282). Furthermore, we get (since

P_{L} (Y) \in [0, 1]

)

\begin{matrix} E [P_{L}^{2} (Y)] \leq E [P_{L} (Y)] = P_{L}, \end{matrix}

(A326)

\begin{matrix} E [P_{L}^{2} (Y)] \geq E^{2} [P_{L} (Y)] = P_{L}^{2}, \end{matrix}

(A327)

and

\begin{matrix} E [\sum_{x \in X} P_{X | Y}^{2} (x | Y)] & = \int_{Y} d P_{Y} (y) \sum_{x \in X} P_{X | Y}^{2} (x | y) \end{matrix}

(A328)

\begin{matrix} = \int_{X \times Y} d P_{X Y} (x, y) P (x | y) \end{matrix}

(A329)

\begin{matrix} = E [P_{X | Y} (X | Y)] . \end{matrix}

(A330)

Combining (A322)–(A330) gives

\begin{matrix} M {(E [P_{X | Y} (X | Y)] - \frac{1 - P_{L}}{L} - \frac{P_{L}}{M - L})}^{+} \end{matrix}

\begin{matrix} \leq E [χ^{2} (P_{X | Y} (\cdot | Y) ∥ U_{M}) - χ^{2} (P_{Z}^{(Y)} ∥ Q_{Z}^{(Y)})] \end{matrix}

(A331)

\begin{matrix} \leq M (E [P_{X | Y} (X | Y)] - \frac{{(1 - P_{L})}^{2}}{L} - \frac{P_{L}^{2}}{M - L}), \end{matrix}

(A332)

which provides tight upper and lower bounds on

E [χ^{2} (P_{X | Y} (\cdot | Y) ∥ U_{M}) - χ^{2} (P_{Z}^{(Y)} ∥ Q_{Z}^{(Y)})]

if

P_{L}

is small. Note that the lower bound on the left side of (A331) is non-negative since, by the data-processing inequality for the

χ^{2}

divergence, the right side of (A331) should be non-negative (see (A306)–(A309)). Finally, combining (A319)–(A332) yields (214), which proves Item (a).

For proving Item (b), the upper bound on the left side of (A326) is tightened. If the list decoder selects the L most probable elements from

X

given the value of

Y \in Y

, then

P_{L} (y) \leq 1 - \frac{L}{M}

for every

y \in Y

. Hence, the bound in (A326) is replaced by the tighter bound

\begin{matrix} E [P_{L}^{2} (Y)] \leq (1 - \frac{L}{M}) P_{L} . \end{matrix}

(A333)

Combining (A322)–(A325), (A328)–(A330) and (A333) gives the following improved lower bound in the left side of (A331):

\begin{matrix} M {(E [P_{X | Y} (X | Y)] - \frac{1 - P_{L}}{L})}^{+} \leq E [χ^{2} (P_{X | Y} (\cdot | Y) ∥ U_{M}) - χ^{2} (P_{Z}^{(Y)} ∥ Q_{Z}^{(Y)})] . \end{matrix}

(A334)

It is next shown that the operation

{(\cdot)}^{+}

in the left side of (A334) is redundant. From (A282) and (A283),

\begin{matrix} P_{L} & = 1 - \sum_{ℓ = 1}^{L} E [P_{X | Y} (x_{ℓ} (Y) | Y)] \end{matrix}

(A335)

\begin{matrix} = 1 - \sum_{ℓ = 1}^{L} \int_{Y} d P_{Y} (y) P_{X | Y} (x_{ℓ} (y) | y) \end{matrix}

(A336)

\begin{matrix} = 1 - \int_{Y} d P_{Y} (y) \sum_{ℓ = 1}^{L} P_{X | Y} (x_{ℓ} (y) | y), \end{matrix}

(A337)

which then implies that

\begin{matrix} P_{L} & \geq 1 - L \int_{Y} d P_{Y} (y) \sum_{ℓ = 1}^{L} P_{X | Y}^{2} (x_{ℓ} (y) | y) \end{matrix}

(A338)

\begin{matrix} \geq 1 - L \int_{Y} d P_{Y} (y) \sum_{x \in X} P_{X | Y}^{2} (x | y) \end{matrix}

(A339)

\begin{matrix} \geq 1 - L \int_{X \times Y} d P_{X Y} (x, y) P_{X | Y} (x | y) \end{matrix}

(A340)

\begin{matrix} = 1 - L E [P_{X | Y} (X | Y)], \end{matrix}

(A341)

where (A338) is due the Cauchy-Schwarz inequality applied to the right side of (A337), and (A339) holds since

L (y) \subseteq X

for all

y \in Y

. From (A335)–(A341),

E [P_{X | Y} (X | Y)] \geq \frac{1 - P_{L}}{L}

, which implies that the operation

{(\cdot)}^{+}

in the left side of (A334) is indeed redundant. Similarly to the proof of (214) (see (A319)–(A321)), (A334) yields (215) while ignoring the operation

{(\cdot)}^{+}

in the left side of (A334).

Appendix O. Proof of Theorem 13

For every

y \in Y

, let the M elements of

X

be sorted in decreasing order according to the conditional probabilities

P_{X | Y} (\cdot | y)

. Let

x_{ℓ} (y)

be the ℓ-th most probable element in

X

given

Y = y

, i.e.,

\begin{matrix} P_{X | Y} (x_{1} (y) | y) \geq P_{X | Y} (x_{2} (y) | y) \geq \dots \geq P_{X | Y} (x_{M} (y) | y) . \end{matrix}

(A342)

The conditional list decoding error probability, given

Y = y

, satisfies

\begin{matrix} P_{L} (y) & \geq 1 - \sum_{ℓ = 1}^{| L (y) |} P_{X | Y} (x_{ℓ} (y) | y) \end{matrix}

(A343)

\begin{matrix} : = P_{L}^{(opt)} (y), \end{matrix}

(A344)

and the (average) list decoding error probability satisfies

P_{L} \geq P_{L}^{(opt)}

. Let

U_{M}

denote the equiprobable distribution on

X

, and let

g_{γ} : [0, \infty) \to R

be given by

g_{γ} (t) : = {(t - γ)}^{+}

with

γ \geq 1

, where

u^{+} : = max {u, 0}

for

u \in R

. The function

g_{γ} (\cdot)

is convex, and

g_{γ} (1) = 0

for

γ \geq 1

; the f-divergence

D_{g_{γ}} (\cdot ∥ \cdot)

is named as the

E_{γ}

divergence (see, e.g., [54]), i.e.,

\begin{matrix} E_{γ} (P ∥ Q) : = D_{g_{γ}} (P ∥ Q), \forall γ \geq 1, \end{matrix}

(A345)

for all probability measures P and Q. For every

y \in Y

,

\begin{matrix} E_{γ} (P_{X | Y} (\cdot | y) ∥ U_{M}) & \geq E_{γ} ([1 - P_{L}^{(opt)} (y), P_{L}^{(opt)} (y)] ∥ [\frac{| L (y) |}{M}, 1 - \frac{| L (y) |}{M}]) \end{matrix}

(A346)

\begin{matrix} = \frac{| L (y) |}{M} \cdot g_{γ} (\frac{M (1 - P_{L}^{(opt)} (y))}{| L (y) |}) + (1 - \frac{| L (y) |}{M}) g_{γ} (\frac{M P_{L}^{(opt)} (y)}{M - | L (y) |}), \end{matrix}

(A347)

where (A346) holds due to the data-processing inequality for f-divergences, and because of (A344); (A347) holds due to (A345). Furthermore, in view of (A342) and (A344), it follows that

\frac{M P_{L}^{(opt)} (y)}{M - | L (y) |} \leq 1

for all

y \in Y

; by the definition of

g_{γ}

, it follows that

\begin{matrix} g_{γ} (\frac{M P_{L}^{(opt)} (y)}{M - | L (y) |}) = 0, \forall γ \geq 1 . \end{matrix}

(A348)

Substituting (A348) into the right side of (A347) gives that, for all

y \in Y

,

\begin{matrix} E_{γ} (P_{X | Y} (\cdot | y) ∥ U_{M}) & \geq \frac{| L (y) |}{M} \cdot g_{γ} (\frac{M (1 - P_{L}^{(opt)} (y))}{| L (y) |}) \end{matrix}

(A349)

\begin{matrix} = {(1 - P_{L}^{(opt)} (y) - \frac{γ | L (y) |}{M})}^{+} . \end{matrix}

(A350)

Taking expectations with respect to Y in (A349) and (A350), and applying Jensen’s inequality to the convex function

f (u) : = {(u)}^{+}

, for

u \in R

, gives

\begin{matrix} E [E_{γ} (P_{X | Y} (\cdot | Y) ∥ U_{M})] & \geq E [{(1 - P_{L}^{(opt)} (Y) - \frac{γ | L (Y) |}{M})}^{+}] \end{matrix}

(A351)

\begin{matrix} \geq {(1 - E [P_{L}^{(opt)} (Y)] - \frac{γ E [| L (Y) |]}{M})}^{+} \end{matrix}

(A352)

\begin{matrix} = {(1 - P_{L}^{(opt)} - \frac{γ E [| L (Y) |]}{M})}^{+} \end{matrix}

(A353)

\begin{matrix} \geq 1 - P_{L}^{(opt)} - \frac{γ E [| L (Y) |]}{M} . \end{matrix}

(A354)

On the other hand, the left side of (A351) is equal to

\begin{matrix} E [E_{γ} (P_{X | Y} (\cdot | Y) ∥ U_{M})] \end{matrix}

\begin{matrix} = E [\frac{1}{M} \sum_{x \in X} {(M P_{X | Y} (x | Y) - γ)}^{+}] \end{matrix}

(A355)

\begin{matrix} = E [\sum_{x \in X} {(P_{X | Y} (x | Y) - \frac{γ}{M})}^{+}] \end{matrix}

(A356)

\begin{matrix} = \frac{1}{2} E [\sum_{x \in X} \{|P_{X | Y} (x | Y) - \frac{γ}{M}| + P_{X | Y} (x | Y) - \frac{γ}{M}\}] \end{matrix}

(A357)

\begin{matrix} = \frac{1}{2} E [\sum_{x \in X} |P_{X | Y} (x | Y) - \frac{γ}{M}|] + \frac{1}{2} (1 - γ), \end{matrix}

(A358)

where (A355) is due to (A345), and since

U_{M} (x) = \frac{1}{M}

for all

x \in X

; (A356) and (A357) hold, respectively, by the simple identities

{(c u)}^{+} = c u^{+}

, and

u^{+} = \frac{1}{2} (| u | + u)

for

c \geq 0

and

u \in R

; finally, (A358) holds since

\sum_{x \in X} (P_{X | Y} (x | y) - \frac{γ}{M}) = - γ + \sum_{x \in X} P_{X | Y} (x | y) = 1 - γ,

for all

y \in Y

. Substituting (A355)–(A358) and rearranging terms gives that

\begin{matrix} P_{L} \geq P_{L}^{(opt)} \geq \frac{1 + γ}{2} - \frac{γ E [| L (Y) |]}{M} - \frac{1}{2} E [\sum_{x \in X} |P_{X | Y} (x | Y) - \frac{γ}{M}|], \end{matrix}

(A359)

which is the lower bound on the list decoding error probability in (222).

We next proceed to prove the sufficient conditions for equality in (222). First, if for all

y \in Y

, the list decoder selects the

| L (y) |

most probable elements in

X

given that

Y = y

, then equality holds in (A359). In this case, for all

y \in Y

,

L (y) : = {x_{1} (y), \dots, x_{| L (y) |}}

where

x_{ℓ} (y)

denotes the ℓ-th most probable element in

X

, given

Y = y

, with ties in probabilities which are resolved arbitrarily (see (A342)). Let

γ \geq 1

. If, for every

y \in Y

,

P_{X | Y} (x_{ℓ} (y) | y)

is fixed for all

ℓ \in {1, \dots, | L (y) |}

and

P_{X | Y} (x_{ℓ} (y) | y)

is fixed for all

ℓ \in {| L (y) | + 1, \dots, M}

, then equality holds in (A346) (and therefore equalities also hold in (A349) and (A351)). For all

y \in Y

, let the common values of the conditional probabilities

P_{X | Y} (\cdot | y)

over each of these two sets, respectively, be equal to

α (y)

and

β (y)

. Then,

\begin{matrix} α (y) | L (y) | + β (y) (M - | L (y) |) = \sum_{x \in X} P_{X | Y} (x | y) = 1, \end{matrix}

(A360)

which gives the condition in (223). Furthermore, if for all

y \in Y

,

1 - P_{L}^{(opt)} (y) - \frac{γ | L (y) |}{M} \geq 0

, then the operation

{(\cdot)}^{+}

in the right side of (A351) is redundant, which causes (A352) to hold with equality as an expectation of a linear function; furthermore, also (A354) holds with equality in this case (since an expectation of a non-negative and bounded function is non-negative and finite). By (223) and (A344), it follows that

P_{L}^{(opt)} (y) = 1 - α (y) | L (y) |

for all

y \in Y

, and therefore the satisfiability of (224) implies that equalities hold in (A352) and (A354). Overall, under the above condition, it therefore follows that (222) holds with equality. To verify it explicitly, under conditions (223) and (224) which have been derived as above, the right side of (222) satisfies

\begin{matrix} \frac{1 + γ}{2} - \frac{γ E [| L (Y) |]}{M} - \frac{1}{2} E [\sum_{x \in X} |P_{X | Y} (x | Y) - \frac{γ}{M}|] \\ = \frac{1 + γ}{2} - \frac{γ E [| L (Y) |]}{M} \\ - \frac{1}{2} E [(α (Y) - \frac{γ}{M}) | L (Y) | + (\frac{γ}{M} - \frac{1 - α (Y) | L (Y) |}{M - | L (Y) |}) (M - | L (Y) |)] \end{matrix}

(A361)

\begin{matrix} = 1 - E [α (Y) | L (Y) |] \end{matrix}

(A362)

\begin{matrix} = E [1 - \sum_{ℓ = 1}^{| L (Y) |} P_{X | Y} (x_{ℓ} (Y) | Y)] \end{matrix}

(A363)

\begin{matrix} = P_{L}, \end{matrix}

(A364)

where (A361) holds since, under (224), it follows that

0 \leq \frac{1 - α (Y) | L (Y) |}{M - | L (Y) |} \leq \frac{1}{M} \leq \frac{γ}{M}

for all

γ \geq 1

; (A362) holds by straightforward algebra, where

γ

is canceled out; (A363) holds by the condition in (223); finally, (A364) holds by (A282), (A283) and (A342). This indeed explicitly verifies that the conditions in Theorem 13 yield an equality in (222).

Appendix P. Proofs of Theorems Related to Tunstall Trees

Appendix P.1. Proof of Theorem 14

Theorem 14 (a) follows from (226) (see ([38], Corollary 1)).

By ([72], Lemma 6), the ratio of the maximal to minimal positive masses of

P_{ℓ}

is upper bounded by the reciprocal of the minimal probability mass of the source symbols. Theorem 14 (b) is therefore obtained from Theorem 7 (c). Theorem 14 (c) consequently holds due to Theorem 7 (d); the bound in the right side of (233), which holds for every number of leaves n in the Tunstall tree, is equal to the limit of the upper bound in the right side of (232) when we let

n \to \infty

.

Theorem 14 (d) relies on ([16], Theorem 11) and the definition in (231), providing an integral representation of an f-divergence in (234) under the conditions in Item (d).

Appendix P.2. Proof of Theorem 15

In view of ([33], Theorem 4), if the fixed length of the codewords of the Tunstall code is equal to m, then the compression rate R of the code satisfies

\begin{matrix} R \leq \frac{⌈ {log}_{| X |} n ⌉ H (P)}{{log}_{| X |} n - [\frac{ρ log ρ}{ρ - 1} - log (\frac{e ρ {log}_{e} ρ}{ρ - 1})] \frac{1}{log | X |}}, \end{matrix}

(A365)

where

H (P)

denotes the Shannon entropy of the memoryless and stationary discrete source,

ρ : = \frac{1}{p_{min}}

, n is the number of leaves in Tunstall tree, and the logarithms with an unspecified base can be taken on an arbitrary base in the right side of (A365). By the setting in Theorem 15, the construction of the Tunstall tree satisfies

n \leq {| X |}^{m} < n + (D - 1)

. Hence, if

D = 2

, then

{log}_{| X |} n = m

; if

D > 2

, then

⌈ {log}_{| X |} n ⌉ = m

(since the length of the codewords is m), and

{log}_{| X |} n > m + {log}_{| X |} (1 - \frac{D - 1}{{| X |}^{m}})

. Combining this with (A365) yields

R \leq {\begin{matrix} \frac{m H (P)}{m + \{log (1 - \frac{D - 1}{{| X |}^{m}}) - [\frac{ρ log ρ}{ρ - 1} - log (\frac{e ρ {log}_{e} ρ}{ρ - 1})]\} \frac{1}{log | X |}}, & if D > 2, \\ \frac{m H (P)}{m - [\frac{ρ log ρ}{ρ - 1} - log (\frac{e ρ {log}_{e} ρ}{ρ - 1})] \frac{1}{log | X |}}, & if D = 2 . \end{matrix}

(A366)

In order to assert that

R \leq (1 + ε) H (P)

, it is requested that the right side of (A366) does not exceed

(1 + ε) H (P)

. This gives

\begin{matrix} \frac{ρ log ρ}{ρ - 1} - log (\frac{e ρ {log}_{e} ρ}{ρ - 1}) \leq d log e, \end{matrix}

(A367)

where d is given in (235). In view of the part in Section 3.3.2 with respect to the exemplification of Theorem 7 for the relative entropy, and the related analysis in Appendix I, the condition in (A367) is equivalent to

ρ \leq ρ_{max}^{(1)} (d)

where

ρ_{max}^{(1)} (d)

is defined in (171). Since

p_{min} = \frac{1}{ρ}

, it leads to the sufficient condition in (236) for the requested compression rate R of the Tunstall code.

References

Ali, S.M.; Silvey, S.D. A general class of coefficients of divergence of one distribution from another. J. R. Stat. Soc. 1966, 28, 131–142. [Google Scholar] [CrossRef]
Csiszár, I. Eine Informationstheoretische Ungleichung und ihre Anwendung auf den Bewis der Ergodizität von Markhoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. 1963, 8, 85–108. (In German) [Google Scholar]
Csiszár, I. A note on Jensen’s inequality. Studia Scientiarum Mathematicarum Hungarica 1966, 1, 185–188. [Google Scholar]
Csiszár, I. Information-type measures of difference of probability distributions and indirect observations. Studia Scientiarum Mathematicarum Hungarica 1967, 2, 299–318. [Google Scholar]
Csiszár, I. On topological properties of f-divergences. Studia Scientiarum Mathematicarum Hungarica 1967, 2, 329–339. [Google Scholar]
Csiszár, I. A class of measures of informativity of observation channels. Periodica Mathematicarum Hungarica 1972, 2, 191–213. [Google Scholar] [CrossRef]
Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331. [Google Scholar] [CrossRef]
Liese, F.; Vajda, I. Convex Statistical Distances. In Teubner-Texte Zur Mathematik; Springer: Leipzig, Germany, 1987; Volume 95. [Google Scholar]
Pardo, L. Statistical Inference Based on Divergence Measures; Chapman and Hall/CRC, Taylor &amp, Ed.; Francis Group: Boca Raton, FL, USA, 2006. [Google Scholar]
Pardo, M.C.; Vajda, I. About distances of discrete distributions satisfying the data processing theorem of information theory. IEEE Trans. Inf. Theory 1997, 43, 1288–1293. [Google Scholar] [CrossRef]
Stummer, W.; Vajda, I. On divergences of finite measures and their applicability in statistics and information theory. Statistics 2010, 44, 169–187. [Google Scholar] [CrossRef]
Vajda, I. Theory of Statistical Inference and Information; Kluwer Academic Publishers: Dordrecht, The Netherlands, 1989. [Google Scholar]
Ziv, J.; Zakai, M. On functionals satisfying a data-processing theorem. IEEE Trans. Inf. Theory 1973, 19, 275–283. [Google Scholar] [CrossRef]
Zakai, M.; Ziv, J. A generalization of the rate-distortion theory and applications. In Information Theory—New Trends and Open Problems; Longo, G., Ed.; Springer: Berlin/Heidelberg, Germany, 1975; pp. 87–123. [Google Scholar]
Merhav, N. Data processing theorems and the second law of thermodynamics. IEEE Trans. Inf. Theory 2011, 57, 4926–4939. [Google Scholar] [CrossRef]
Liese, F.; Vajda, I. On divergences and informations in statistics and information theory. IEEE Trans. Inf. Theory 2006, 52, 4394–4412. [Google Scholar] [CrossRef]
Csiszár, I.; Körner, J. Information Theory: Coding Theorems for Discrete Memoryless Systems, 2nd ed.; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
Ahlswede, R.; Gács, P. Spreading of sets in product spaces and hypercontraction of the Markov operator. Ann. Probab. 1976, 4, 925–939. [Google Scholar] [CrossRef]
Calmon, F.P.; Polyanskiy, Y.; Wu, Y. Strong data processing inequalities for input constrained additive noise channels. IEEE Trans. Inf. Theory 2018, 64, 1879–1892. [Google Scholar] [CrossRef]
Cohen, J.E.; Iwasa, Y.; Rautu, Gh.; Ruskai, M.B.; Seneta, E.; Zbăganu, G. Relative entropy under mappings by stochastic matrices. Linear Algebra Appl. 1993, 179, 211–235. [Google Scholar] [CrossRef]
Cohen, J.E.; Kemperman, J.H.B.; Zbăganu, Gh. Comparison of Stochastic Matrices with Applications in Information Theory, Statistics, Economics and Population Sciences; Birkhäuser: Boston, MA, USA, 1998. [Google Scholar]
Makur, A.; Polyanskiy, Y. Comparison of channels: Criteria for domination by a symmetric channel. IEEE Trans. Inf. Theory 2018, 64, 5704–5725. [Google Scholar] [CrossRef]
Polyanskiy, Y.; Wu, Y. Dissipation of information in channels with input constraints. IEEE Trans. Inf. Theory 2016, 62, 35–55. [Google Scholar] [CrossRef]
Raginsky, M. Strong data processing inequalities and Φ-Sobolev inequalities for discrete channels. IEEE Trans. Inf. Theory 2016, 62, 3355–3389. [Google Scholar] [CrossRef]
Polyanskiy, Y.; Wu, Y. Strong data processing inequalities for channels and Bayesian networks. In Convexity and Concentration; Carlen, E., Madiman, M., Werner, E.M., Eds.; Springer: Berlin/Heidelberg, Germany, 2017; Volume 161, pp. 211–249. [Google Scholar]
Makur, A.; Zheng, L. Linear bounds between contraction coefficients for f-divergences. arXiv 2018, arXiv:1510.01844.v4. [Google Scholar]
Pearson, K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1900, 50, 157–175. [Google Scholar] [CrossRef]
Neyman, J. Contribution to the theory of the χ² test. In Proceedings of the First Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 13–18 August 1945 and 27–29 January 1946; University of California Press: Berkeley, CA, USA, 1949; pp. 239–273. [Google Scholar]
Sarmanov, O.V. Maximum correlation coefficient (non-symmetric case). In Selected Translations in Mathematical Statistics and Probability; American Mathematical Society: Providence, RI, USA, 1962. [Google Scholar]
Marshall, A.W.; Olkin, I.; Arnold, B.C. Inequalities: Theory of Majorization and Its Applications, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2011. [Google Scholar]
Steele, J.M. The Cauchy-Schwarz Master Class; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Bhatia, R. Matrix Analysis; Springer: Berlin/Heidelberg, Germany, 1997. [Google Scholar]
Cicalese, F.; Gargano, L.; Vaccaro, U. Bounds on the entropy of a function of a random variable and their applications. IEEE Trans. Inf. Theory 2018, 64, 2220–2230. [Google Scholar] [CrossRef]
Sason, I. Tight bounds on the Rényi entropy via majorization with applications to guessing and compression. Entropy 2018, 20, 896. [Google Scholar] [CrossRef]
Ho, S.W.; Verdú, S. On the interplay between conditional entropy and error probability. IEEE Trans. Inf. Theory 2010, 56, 5930–5942. [Google Scholar] [CrossRef]
Ho, S.W.; Verdú, S. Convexity/concavity of the Rényi entropy and α-mutual information. In Proceedings of the 2015 IEEE International Symposium on Information Theory, Hong Kong, China, 14–19 June 2015; pp. 745–749. [Google Scholar]
Corless, R.M.; Gonnet, G.H.; Hare, D.E.G.; Jeffrey, D.J.; Knuth, D.E. On the Lambert W function. Adv. Comput. Math. 1996, 5, 329–359. [Google Scholar] [CrossRef]
Cicalese, F.; Gargano, L.; Vaccaro, U. A note on approximation of uniform distributions from variable-to-fixed length codes. IEEE Trans. Inf. Theory 2006, 52, 3772–3777. [Google Scholar] [CrossRef]
Tsallis, C. Possible generalization of the Boltzmann-Gibbs statistics. J. Stat. Phys. 1988, 52, 479–487. [Google Scholar] [CrossRef]
Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 20 June–30 July 1960; University of California Press: Berkeley, CA, USA, 1961; Volume 1, pp. 547–561. [Google Scholar]
Cicalese, F.; Gargano, L.; Vaccaro, U. Minimum-entropy couplings and their applications. IEEE Trans. Inf. Theory 2019, 65, 3436–3451. [Google Scholar] [CrossRef]
Sason, I.; Verdú, S. Arimoto-Rényi conditional entropy and Bayesian M-ary hypothesis testing. IEEE Trans. Inf. Theory 2018, 64, 4–25. [Google Scholar] [CrossRef]
Amari, S.; Nagaoka, H. Methods of Information Geometry; Oxford University Press: New York, NJ, USA, 2000. [Google Scholar]
Cichocki, A.; Amari, S.I. Families of Alpha- Beta- and Gamma- divergences: Flexible and robust measures of similarities. Entropy 2010, 12, 1532–1568. [Google Scholar] [CrossRef]
Sason, I. On f-divergences: Integral representations, local behavior, and inequalities. Entropy 2018, 20, 383. [Google Scholar] [CrossRef]
Fano, R.M. Class Notes for Course 6.574: Transmission of Information; MIT: Cambridge, MA, USA, 1952. [Google Scholar]
Ahlswede, R.; Gács, P.; Körner, J. Bounds on conditional probabilities with applications in multi-user communication. Z. Wahrscheinlichkeitstheorie verw. Gebiete 1977, 34, 157–177, Correction in 1977, 39, 353–354. [Google Scholar] [CrossRef]
Raginsky, M.; Sason, I. Concentration of measure inequalities in information theory, communications and coding: Third edition. In Foundations and Trends (FnT) in Communications and Information Theory; NOW Publishers: Delft, The Netherlands, 2019; pp. 1–266. [Google Scholar]
Chen, X.; Guntuboyina, A.; Zhang, Y. On Bayes risk lower bounds. J. Mach. Learn. Res. 2016, 17, 7687–7744. [Google Scholar]
Guntuboyina, A. Lower bounds for the minimax risk using f-divergences, and applications. IEEE Trans. Inf. Theory 2011, 57, 2386–2399. [Google Scholar] [CrossRef]
Kim, Y.H.; Sutivong, A.; Cover, T.M. State amplification. IEEE Trans. Inf. Theory 2008, 54, 1850–1859. [Google Scholar] [CrossRef]
Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Topics in Information Theory—2nd Colloquium; Csiszár, I., Elias, P., Eds.; Colloquia Mathematica Societatis Janós Bolyai; Elsevier: Amsterdam, The Netherlands, 1977; Volume 16, pp. 41–52. [Google Scholar]
Ahlswede, R.; Körner, J. Source coding with side information and a converse for degraded broadcast channels. IEEE Trans. Inf. Theory 1975, 21, 629–637. [Google Scholar] [CrossRef]
Liu, J.; Cuff, P.; Verdú, S. E_γ resolvability. IEEE Trans. Inf. Theory 2017, 63, 2629–2658. [Google Scholar]
Brémaud, P. Discrete Probability Models and Methods: Probability on Graphs and Trees, Markov Chains and Random Fields, Entropy and Coding; Springer: Basel, Switzerland, 2017. [Google Scholar]
Tunstall, B.K. Synthesis of Noiseless Compression Codes. Ph.D. Thesis, Georgia Institute of Technology, Atlanta, GA, USA, 1967. [Google Scholar]
DeGroot, M.H. Uncertainty, information and sequential experiments. Ann. Math. Stat. 1962, 33, 404–419. [Google Scholar] [CrossRef]
Roberts, A.W.; Varberg, D.E. Convex Functions; Academic Press: Cambridge, MA, USA, 1973. [Google Scholar]
Rockafellar, R.T. Convex Analysis; Princeton University Press: Princeton, NJ, USA, 1996. [Google Scholar]
Collet, J.F. An exact expression for the gap in the data processing inequality for f-divergences. IEEE Trans. Inf. Theory 2019, 65, 4387–4391. [Google Scholar] [CrossRef]
Bregman, L.M. The relaxation method of finding the common points of convex sets and its application to the solution of problems in convex programming. USSR Comput. Math. Math. Phys. 1967, 7, 200–217. [Google Scholar] [CrossRef]
Sason, I.; Verdú, S. f-divergence inequalities. IEEE Trans. Inf. Theory 2016, 62, 5973–6006. [Google Scholar] [CrossRef]
Gilardoni, G.L. On Pinsker’s and Vajda’s type inequalities for Csiszár’s f-divergences. IEEE Trans. Inf. Theory 2010, 56, 5377–5386. [Google Scholar] [CrossRef]
Gibbs, A.L.; Su, F.E. On choosing and bounding probability metrics. Int. Stat. Rev. 2002, 70, 419–435. [Google Scholar] [CrossRef]
Simic, S. Second and third order moment inequalities for probability distributions. Acta Math. Hung. 2018, 155, 518–532. [Google Scholar] [CrossRef]
Van Erven, T.; Harremoës, P. Rényi divergence and Kullback–Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]
Pardo, M.C.; Vajda, I. On asymptotic properties of information-theoretic divergences. IEEE Trans. Inf. Theory 2003, 49, 1860–1868. [Google Scholar] [CrossRef]
Beck, A. Introduction to Nonlinear Optimization: Theory, Algorithms and Applications with Matlab; SIAM-Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2014. [Google Scholar]
Simic, S. On logarithmic convexity for differences of power means. J. Inequalities Appl. 2008, 2007, 037359. [Google Scholar] [CrossRef]
Keziou, A. Dual representation of φ-divergences and applications. C. R. Math. 2003, 336, 857–862. [Google Scholar] [CrossRef]
Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 2010, 56, 5847–5861. [Google Scholar] [CrossRef]
Jelineck, F.; Schneider, K.S. On variable-length-to-block coding. IEEE Trans. Inf. Theory 1972, 18, 765–774. [Google Scholar] [CrossRef]

Figure 1. The bounds in Theorem 2 applied to

D_{f_{α}} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) - D_{f_{α}} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}})

(vertical axis) versus

λ \in [0, 1]

(horizontal axis). The

f_{α}

-divergence refers to Theorem 5. The probability mass functions

P_{X^{n}}

and

Q_{X^{n}}

correspond, respectively, to discrete memoryless sources emitting n i.i.d.

Bernoulli (p)

and

Bernoulli (q)

symbols; the symbols are transmitted over

BSC (δ)

with

(α, p, q, δ) = (1, \frac{1}{4}, \frac{1}{2}, 0.110)

. The bounds in the upper and middle plots are compared to the exact values, being computationally feasible for

n = 1

and

n = 10

, respectively. The upper, middle and lower plots correspond, respectively, to

n = 1

,

n = 10

, and

n = 50

.

Figure 1. The bounds in Theorem 2 applied to

D_{f_{α}} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}}) - D_{f_{α}} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}})

(vertical axis) versus

λ \in [0, 1]

(horizontal axis). The

f_{α}

-divergence refers to Theorem 5. The probability mass functions

P_{X^{n}}

and

Q_{X^{n}}

correspond, respectively, to discrete memoryless sources emitting n i.i.d.

Bernoulli (p)

and

Bernoulli (q)

symbols; the symbols are transmitted over

BSC (δ)

with

(α, p, q, δ) = (1, \frac{1}{4}, \frac{1}{2}, 0.110)

. The bounds in the upper and middle plots are compared to the exact values, being computationally feasible for

n = 1

and

n = 10

, respectively. The upper, middle and lower plots correspond, respectively, to

n = 1

,

n = 10

, and

n = 50

.

Figure 2. The upper bound in Theorem 4 applied to

\frac{D_{f_{α}} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}})}{D_{f_{α}} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}})}

(see (125)–(127)) in the vertical axis versus

λ \in [0, 1]

in the horizontal axis. The

f_{α}

-divergence refers to Theorem 5. The probability mass functions

P_{X_{i}}

and

Q_{X_{i}}

are

Bernoulli (p)

and

Bernoulli (q)

, respectively, for all

i \in {1, \dots, n}

with n uses of

BSC (δ)

, and parameters

(p, q, δ) = (\frac{1}{4}, \frac{1}{2}, 0.110)

. The upper and middle plots correspond to

n = 10

with

α = 10

and

α = 100

, respectively; the middle and lower plots correspond to

α = 100

with

n = 10

and

n = 100

, respectively. The bounds in the upper and middle plots are compared to the exact values, being computationally feasible for

n = 10

.

Figure 2. The upper bound in Theorem 4 applied to

\frac{D_{f_{α}} (R_{Y^{n}}^{(λ)} ∥ Q_{Y^{n}})}{D_{f_{α}} (R_{X^{n}}^{(λ)} ∥ Q_{X^{n}})}

(see (125)–(127)) in the vertical axis versus

λ \in [0, 1]

in the horizontal axis. The

f_{α}

-divergence refers to Theorem 5. The probability mass functions

P_{X_{i}}

and

Q_{X_{i}}

are

Bernoulli (p)

and

Bernoulli (q)

, respectively, for all

i \in {1, \dots, n}

with n uses of

BSC (δ)

, and parameters

(p, q, δ) = (\frac{1}{4}, \frac{1}{2}, 0.110)

. The upper and middle plots correspond to

n = 10

with

α = 10

and

α = 100

, respectively; the middle and lower plots correspond to

α = 100

with

n = 10

and

n = 100

, respectively. The bounds in the upper and middle plots are compared to the exact values, being computationally feasible for

n = 10

.

Figure 3. Plots of

d_{f_{α}} (p ∥ q)

, its upper and lower bounds in (61) and (65), respectively, and its asymptotic approximation in (66) for large values of

α

. The plots are shown as a function of

α \in [e^{- \frac{3}{2}}, 1000]

. The upper and lower plots refer, respectively, to

(p, q) = (0.1, 0.9)

and

(p, q) = (0.2, 0.8)

.

Figure 3. Plots of

d_{f_{α}} (p ∥ q)

, its upper and lower bounds in (61) and (65), respectively, and its asymptotic approximation in (66) for large values of

α

. The plots are shown as a function of

α \in [e^{- \frac{3}{2}}, 1000]

. The upper and lower plots refer, respectively, to

(p, q) = (0.1, 0.9)

and

(p, q) = (0.2, 0.8)

.

Figure 4. Curves of the upper bound on the ratio of contraction coefficients

\frac{μ_{f_{α}} (Q_{X}, W_{Y | X})}{μ_{χ^{2}} (Q_{X}, W_{Y | X})}

(see the right-side inequality of (130)) as a function of the parameter

α \geq e^{- \frac{3}{2}}

. The curves correspond to different values of

ξ

in (131).

Figure 4. Curves of the upper bound on the ratio of contraction coefficients

\frac{μ_{f_{α}} (Q_{X}, W_{Y | X})}{μ_{χ^{2}} (Q_{X}, W_{Y | X})}

(see the right-side inequality of (130)) as a function of the parameter

α \geq e^{- \frac{3}{2}}

. The curves correspond to different values of

ξ

in (131).

Figure 5. A comparison of the maximal values of

ρ

(minus 1) according to (171) and (172), asserting the satisfiability of the condition

D (Q ∥ U_{n}) \leq d log e

, with an arbitrary

d > 0

, for all integers

n \geq 2

and probability mass functions Q supported on

{1, \dots, n}

with

\frac{q_{max}}{q_{min}} \leq ρ

. The solid line refers to the necessary and sufficient condition which gives (171), and the dashed line refers to a stronger condition which gives (172).

Figure 5. A comparison of the maximal values of

ρ

(minus 1) according to (171) and (172), asserting the satisfiability of the condition

D (Q ∥ U_{n}) \leq d log e

, with an arbitrary

d > 0

, for all integers

n \geq 2

and probability mass functions Q supported on

{1, \dots, n}

with

\frac{q_{max}}{q_{min}} \leq ρ

. The solid line refers to the necessary and sufficient condition which gives (171), and the dashed line refers to a stronger condition which gives (172).

Figure 6. A comparison of the exact expression of

Φ (α, ρ)

in (175), with

α = 1

, and its three upper bounds in the right sides of (176), (177) and (180) (called ’Upper bound 1’ (dotted line), ’Upper bound 2’ (thin dashed line), and ’Upper bound 3’ (thick dashed line), respectively).

Figure 6. A comparison of the exact expression of

Φ (α, ρ)

in (175), with

α = 1

, and its three upper bounds in the right sides of (176), (177) and (180) (called ’Upper bound 1’ (dotted line), ’Upper bound 2’ (thin dashed line), and ’Upper bound 3’ (thick dashed line), respectively).

Figure 7. Curves of the upper bound on the measure

d_{ω, n} (P_{ℓ})

in (233), valid for all

n \in N

, as a function of

ω \in [0, 1]

for different values of

ρ : = \frac{1}{p_{min}}

.

Figure 7. Curves of the upper bound on the measure

d_{ω, n} (P_{ℓ})

in (233), valid for all

n \in N

, as a function of

ω \in [0, 1]

for different values of

ρ : = \frac{1}{p_{min}}

.

Figure 8. Curves for the smallest values of

p_{min}

, in the setup of Theorem 15, according to the condition in (236) (solid line) and the more restrictive condition in (237) (dashed line) for binary Tunstall codes which are used to compress memoryless and stationary binary sources.

Figure 8. Curves for the smallest values of

p_{min}

, in the setup of Theorem 15, according to the condition in (236) (solid line) and the more restrictive condition in (237) (dashed line) for binary Tunstall codes which are used to compress memoryless and stationary binary sources.

Table 1. The lower bounds on

P_{L}

in (193), (210) and (217), and its exact value for fixed list size L (see Example 1).

Table 1. The lower bounds on

P_{L}

in (193), (210) and (217), and its exact value for fixed list size L (see Example 1).

L	Exact $P_{L}$	(193)	(217)	(210)
1	0.500	0.353	0.353	0.444
2	0.250	0.178	0.178	0.190
3	0.125	0.065	0.072	$5.34 \times 10^{- 5}$
4	0.063	0	0.016	0

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Multiple requests from the same IP address are counted as one view.

On Data-Processing and Majorization Inequalities for f-Divergences with Applications

Abstract

1. Introduction

1.1. Preliminaries and Related Works

1.2. Contributions

1.3. Paper Organization

2. Main Results on f-Divergences

2.1. Data-Processing Inequalities for f-Divergences

2.2. A Subclass of f-Divergences

2.3. f-Divergence Inequalities via Majorization

3. Illustration of the Main Results and Implications

3.1. Illustration of Theorems 2 and 4

3.2. Illustration of Theorems 3 and 5

3.3. Illustration of Theorem 7 and Further Results

3.3.1. Total Variation Distance

3.3.2. Alpha Divergences

3.3.3. The Subclass of f-Divergences in Theorem 5

3.4. An Interpretation of u f ( · , · ) in Theorem 7

4. Applications in Information Theory and Statistics

4.1. Bounds on the List Decoding Error Probability with f-Divergences

4.1.1. Fixed-Size List Decoding

4.1.2. Variable-Size List Decoding

4.2. A Measure for the Approximation of Equiprobable Distributions by Tunstall Trees

Funding

Acknowledgments

Conflicts of Interest

Appendix A. Proof of Theorem 1

Appendix B. Proof of Theorem 2

Appendix C. Proof of Theorems 3 and 4

Appendix C.1. Proof of Theorem 3

Appendix C.2. Proof of Theorem 4

Appendix D. Proof of Theorem 5

Appendix E. Proof of Theorem 6

Appendix F. Proof of Theorem 7

Appendix G. Proof of Theorem 8

Appendix H. Proof of Theorem 9 and Corollary 1

Appendix H.1. Proof of Theorem 9

Appendix H.2. Proof of Corollary 1

Appendix I. Proof of (171)

Appendix J. Proof of (176), (177) and (180)

Appendix K. Proof of Theorem 10

Appendix L. Proof of Theorem 11

Appendix M. Proof of Corollary 3

Appendix N. Proof of Theorem 12

Appendix O. Proof of Theorem 13

Appendix P. Proofs of Theorems Related to Tunstall Trees

Appendix P.1. Proof of Theorem 14

Appendix P.2. Proof of Theorem 15

References

Article Metrics

Article Access Statistics

3.4. An Interpretation of $u_{f} (\cdot, \cdot)$ in Theorem 7