Tight Bounds on the Rényi Entropy via Majorization with Applications to Guessing and Compression

Igal Sason

doi:10.3390/e20120896

Department of Electrical Engineering, Technion—Israel Institute of Technology, Haifa 3200003, Israel

Entropy2018, 20(12), 896;https://doi.org/10.3390/e20120896

This article belongs to the Special Issue Probabilistic Methods in Information Theory, Hypothesis Testing, and Coding

Version Notes

Order Reprints

Abstract

This paper provides tight bounds on the Rényi entropy of a function of a discrete random variable with a finite number of possible values, where the considered function is not one to one. To that end, a tight lower bound on the Rényi entropy of a discrete random variable with a finite support is derived as a function of the size of the support, and the ratio of the maximal to minimal probability masses. This work was inspired by the recently published paper by Cicalese et al., which is focused on the Shannon entropy, and it strengthens and generalizes the results of that paper to Rényi entropies of arbitrary positive orders. In view of these generalized bounds and the works by Arikan and Campbell, non-asymptotic bounds are derived for guessing moments and lossless data compression of discrete memoryless sources.

Keywords:

Majorization; Rényi entropy; Rényi divergence; cumulant generating functions; guessing moments; lossless source coding; fixed-to-variable source codes; Huffman algorithm; Tunstall codes

1. Introduction

Majorization theory is a simple and productive concept in the theory of inequalities, which also unifies a variety of familiar bounds [1,2]. These mathematical tools find various applications in diverse fields (see, e.g., [3]) such as economics [2,4,5], combinatorial analysis [2,6], geometric inequalities [2], matrix theory [2,6,7,8], Shannon theory [5,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25], and wireless communications [26,27,28,29,30,31,32,33].

This work, which relies on the majorization theory, has been greatly inspired by the recent insightful paper by Cicalese et al. [12] (the research work in the present paper has been initialized while the author handled [12] as an associate editor). The work in [12] provides tight bounds on the Shannon entropy of a function of a discrete random variable with a finite number of possible values, where the considered function is not one to one. For that purpose, and while being of interest by its own right (see [12], Section 6), a tight lower bound on the Shannon entropy of a discrete random variable with a finite support was derived in [12] as a function of the size of the support, and the ratio of the maximal to minimal probability masses. The present paper aims to extend the bounds in [12] to Rényi entropies of arbitrary positive orders (note that the Shannon entropy is equal to the Rényi entropy of order 1), and to study the information-theoretic applications of these (non-trivial) generalizations in the context of non-asymptotic analysis of guessing moments and lossless data compression.

The motivation for this work is rooted in the diverse information-theoretic applications of Rényi measures [34]. These include (but are not limited to) asymptotically tight bounds on guessing moments [35], information-theoretic applications such as guessing subject to distortion [36], joint source-channel coding and guessing with application to sequential decoding [37], guessing with a prior access to a malicious oracle [38], guessing while allowing the guesser to give up and declare an error [39], guessing in secrecy problems [40,41], guessing with limited memory [42], and guessing under source uncertainty [43]; encoding tasks [44,45]; Bayesian hypothesis testing [9,22,23], and composite hypothesis testing [46,47]; Rényi generalizations of the rejection sampling problem in [48], motivated by the communication complexity in distributed channel simulation, where these generalizations distinguish between causal and noncausal sampler scenarios [49]; Wyner’s common information in distributed source simulation under Rényi divergence measures [50]; various other source coding theorems [23,39,51,52,53,54,55,56,57,58], channel coding theorems [23,58,59,60,61,62,63,64], including coding theorems in quantum information theory [65,66,67].

The presentation in this paper is structured as follows: Section 2 provides notation and essential preliminaries for the analysis in this paper. Section 3 and Section 4 strengthen and generalize, in a non-trivial way, the bounds on the Shannon entropy in [12] to Rényi entropies of arbitrary positive orders (see Theorems 1 and 2). Section 5 relies on the generalized bound from Section 4 and the work by Arikan [35] to derive non-asymptotic bounds for guessing moments (see Theorem 3); Section 5 also relies on the generalized bound in Section 4 and the source coding theorem by Campbell [51] (see Theorem 4) for the derivation of non-asymptotic bounds for lossless compression of discrete memoryless sources (see Theorem 5).

2. Notation and Preliminaries

Let

P be a probability mass function defined on a finite set $X$ ;
$p_{\max}$ and $p_{\min}$ be, respectively, the maximal and minimal positive masses of P;
$G_{P} (k)$ be the sum of the k largest masses of P for $k \in {1, \dots, | X |}$ (note that $G_{P} (1) = p_{\max}$ and $G_{P} (| X |) = 1$ );
$P_{n}$ , for an integer $n \geq 2$ , be the set of all probability mass functions defined on $X$ with $| X | = n$ ; without any loss of generality, let $X = {1, \dots, n}$ ;
$P_{n} (ρ)$ , for $ρ \geq 1$ and an integer $n \geq 2$ , be the subset of all probability measures $P \in P_{n}$ such that

$\begin{matrix} \frac{p_{\max}}{p_{\min}} \leq ρ . \end{matrix}$

(1)

Definition 1 (Majorization).

Consider discrete probability mass functions P and Q defined on the same (finite or countably infinite) set

X

. It is said that P is majorized by Q (or Q majorizes P), and it is denoted by

P ≺ Q

, if

G_{P} (k) \leq G_{Q} (k)

for all

k \in {1, \dots, | X | - 1}

(recall that

G_{P} (| X |) = G_{Q} (| X |) = 1

). If P and Q are defined on finite sets of different cardinalities, then the probability mass function which is defined over the smaller set is first padded by zeros for making the cardinalities of these sets be equal.

By Definition 1, a unit mass majorizes any other distribution; on the other hand, the equiprobable distribution on a finite set is majorized by any other distribution defined on the same set.

Definition 2 (Schur-convexity/concavity).

A function

f : P_{n} \to R

is said to be Schur-convex if for every

P, Q \in P_{n}

such that

P ≺ Q

, we have

f (P) \leq f (Q)

. Likewise, f is said to be Schur-concave if

- f

is Schur-convex, i.e.,

P, Q \in P_{n}

and

P ≺ Q

imply that

f (P) \geq f (Q)

.

Definition 3

(Rényi entropy [34]). Let X be a random variable taking values on a finite or countably infinite set

X

, and let

P_{X}

be its probability mass function. The Rényi entropy of order

α \in (0, 1) \cup (1, \infty)

is given by

\begin{matrix} H_{α} (X) & = H_{α} (P_{X}) = \frac{1}{1 - α} \log (\sum_{x \in X} P_{X}^{α} (x)) . \end{matrix}

(2)

Unless explicitly stated, the logarithm base can be chosen by the reader, with exp indicating the inverse function of log.

By its continuous extension,

\begin{matrix} H_{0} (X) = \log |{x \in X : P_{X} (x) > 0}|, \end{matrix}

(3)

\begin{matrix} H_{1} (X) = H (X), \end{matrix}

(4)

\begin{matrix} H_{\infty} (X) = \log \frac{1}{p_{\max}} \end{matrix}

(5)

where

H (X)

is the (Shannon) entropy of X.

Proposition 1

(Schur-concavity of the Rényi entropy (Appendix F.3.a (p. 562) of [2])). The Rényi entropy of an arbitrary order

α > 0

is Schur-concave; in particular, for

α = 1

, the Shannon entropy is Schur-concave.

Remark 1.

[17] (Theorem 2) strengthens Proposition 1, though it is not needed for our analysis.

Definition 4

(Rényi divergence [34]). Let P and Q be probability mass functions defined on a finite or countably infinite set

X

. The Rényi divergence of order

α \in [0, \infty]

is defined as follows:

If $α \in (0, 1) \cup (1, \infty)$ , then

$\begin{matrix} D_{α} (P ‖ Q) = \frac{1}{α - 1} \log \sum_{x \in X} P^{α} (x) Q^{1 - α} (x) . \end{matrix}$

(6)
By the continuous extension of $D_{α} (P ‖ Q)$ ,

$\begin{matrix} D_{0} (P ‖ Q) & = \max_{A : P (A) = 1} \log \frac{1}{Q (A)}, \end{matrix}$

(7)

$\begin{matrix} D_{1} (P ‖ Q) & = D (P ‖ Q), \end{matrix}$

(8)

$\begin{matrix} D_{\infty} (P ‖ Q) & = \log \sup_{x \in X} \frac{P (x)}{Q (x)}, \end{matrix}$

(9)

where $D (P ‖ Q)$ in the right side of (8) is the relative entropy (a.k.a. Kullback-Leibler divergence).

Throughout this paper, for

a \in R

,

⌈ a ⌉

denotes the ceiling of a (i.e., the smallest integer not smaller than the real number a), and

⌊ a ⌋

denotes the flooring of a (i.e., the greatest integer not greater than a).

3. A Tight Lower Bound on the Rényi Entropy

We provide in this section a tight lower bound on the Rényi entropy, of an arbitrary order

α > 0

, when the probability mass function of the discrete random variable is defined on a finite set of cardinality n, and the ratio of the maximal to minimal probability masses is upper bounded by an arbitrary fixed value

ρ \in [1, \infty)

. In other words, we derive the largest possible gap between the order-

α

Rényi entropies of an equiprobable distribution and a non-equiprobable distribution (defined on a finite set of the same cardinality) with a given value for the ratio of the maximal to minimal probability masses. The basic tool used for the development of our result in this section relies on the majorization theory. Our result strengthens the result in [12] (Theorem 2) for the Shannon entropy, and it further provides a generalization for the Rényi entropy of an arbitrary order

α > 0

(recall that the Shannon entropy is equal to the Rényi entropy of order

α = 1

, see (4)). Furthermore, the approach for proving the main result in this section differs significantly from the proof in [12] for the Shannon entropy. The main result in this section is a key result for all what follows in this paper.

The following lemma is a restatement of [12] (Lemma 6).

Lemma 1.

Let

P \in P_{n} (ρ)

with

ρ \geq 1

and an integer

n \geq 2

, and assume without any loss of generality that the probability mass function P is defined on the set

X = {1, \dots, n}

. Let

Q \in P_{n}

be defined on

X

as follows:

Q (j) = {\begin{matrix} ρ p_{\min}, & j \in {1, \dots, i}, \\ 1 - (n + i ρ - i - 1) p_{\min}, & j = i + 1, \\ p_{\min}, & j \in {i + 2, \dots, n} \end{matrix}

(10)

where

\begin{matrix} i : = ⌊\frac{1 - n p_{\min}}{(ρ - 1) p_{\min}}⌋ . \end{matrix}

(11)

Then,

(1): $Q \in P_{n} (ρ)$ , and $Q (1) \geq Q (2) \geq \dots \geq Q (n) > 0$ ;
(2): $P ≺ Q$ .

Proof.

See [12] (p. 2236) (top of the second column). ☐

Lemma 2.

Let

ρ > 1

,

α > 0

, and

n \geq 2

be an integer. For

\begin{matrix} β \in [\frac{1}{1 + (n - 1) ρ}, \frac{1}{n}] : = Γ_{ρ}^{(n)}, \end{matrix}

(12)

let

Q_{β} \in P_{n} (ρ)

be defined on

X = {1, \dots, n}

as follows:

Q_{β} (j) = {\begin{matrix} ρ β, & j \in {1, \dots, i_{β}}, \\ 1 - (n + i_{β} ρ - i_{β} - 1) β, & j = i_{β} + 1, \\ β, & j \in {i_{β} + 2, \dots, n} \end{matrix}

(13)

where

\begin{matrix} i_{β} : = ⌊\frac{1 - n β}{(ρ - 1) β}⌋ . \end{matrix}

(14)

Then, for every

α > 0

,

\begin{matrix} \min_{P \in P_{n} (ρ)} H_{α} (P) = \min_{β \in Γ_{ρ}^{(n)}} H_{α} (Q_{β}) . \end{matrix}

(15)

Proof.

See Appendix A. ☐

Lemma 3.

For

ρ > 1

and

α > 0

, let

\begin{matrix} c_{α}^{(n)} (ρ) : = \log n - \min_{P \in P_{n} (ρ)} H_{α} (P), n = 2, 3, \dots \end{matrix}

(16)

with

c_{α}^{(1)} (ρ) : = 0

. Then, for every

n \in N

,

\begin{matrix} 0 \leq c_{α}^{(n)} (ρ) \leq \log ρ, \end{matrix}

(17)

\begin{matrix} c_{α}^{(n)} (ρ) \leq c_{α}^{(2 n)} (ρ), \end{matrix}

(18)

and

c_{α}^{(n)} (ρ)

is monotonically increasing in

α \in [0, \infty]

.

Proof.

See Appendix B. ☐

Lemma 4.

For

α > 0

and

ρ > 1

, the limit

\begin{matrix} c_{α}^{(\infty)} (ρ) & : = \lim_{n \to \infty} c_{α}^{(n)} (ρ) \end{matrix}

(19)

exists, having the following properties:

(a): If $α \in (0, 1) \cup (1, \infty)$ , then

$\begin{matrix} c_{α}^{(\infty)} (ρ) = \frac{1}{α - 1} \log (1 + \frac{1 + α (ρ - 1) - ρ^{α}}{(1 - α) (ρ - 1)}) - \frac{α}{α - 1} \log (1 + \frac{1 + α (ρ - 1) - ρ^{α}}{(1 - α) (ρ^{α} - 1)}), \end{matrix}$

(20)

and

$\begin{matrix} \lim_{α \to \infty} c_{α}^{(\infty)} (ρ) = \log ρ . \end{matrix}$

(21)
(b): If $α = 1$ , then

$\begin{matrix} c_{1}^{(\infty)} (ρ) = \lim_{α \to 1} c_{α}^{(\infty)} (ρ) = \frac{ρ \log ρ}{ρ - 1} - \log (\frac{e ρ \log_{e} ρ}{ρ - 1}) . \end{matrix}$

(22)
(c): For all $α > 0$ ,

$\begin{matrix} \lim_{ρ ↓ 1} c_{α}^{(\infty)} (ρ) = 0 . \end{matrix}$

(23)

For every $n \in N$ , $α > 0$ and $ρ \geq 1$ ,

$\begin{matrix} 0 \leq c_{α}^{(n)} (ρ) \leq c_{α}^{(2 n)} (ρ) \leq c_{α}^{(\infty)} (ρ) \leq \log ρ . \end{matrix}$

(24)

Proof.

See Appendix C. ☐

In view of Lemmata 1–4, we obtain the following main result in this section:

Theorem 1.

Let

α > 0

,

ρ > 1

,

n \geq 2

, and let

c_{α}^{(n)} (ρ)

in (16) designate the maximal gap between the order-α Rényi entropies of equiprobable and arbitrary distributions in

P_{n} (ρ)

. Then,

(a): The non-negative sequence ${c_{α}^{(n)} (ρ)}_{n = 2}^{\infty}$ can be calculated by the real-valued single-parameter optimization in the right side of (15).
(b): The asymptotic limit as $n \to \infty$ , denoted by $c_{α}^{(\infty)} (ρ)$ , admits the closed-form expressions in (20) and (22), and it satisfies the properties in (21), (23) and (24).

Remark 2.

Setting

α = 2

in Theorem 1 gives that, for all

P \in P_{n} (ρ)

(with

ρ > 1

, and an integer

n \geq 2

),

\begin{matrix} H_{2} (P) & \geq \log n - c_{2}^{(n)} (ρ) \end{matrix}

(25)

\begin{matrix} \geq \log n - c_{2}^{(\infty)} (ρ) \end{matrix}

(26)

\begin{matrix} = \log \frac{4 ρ n}{{(1 + ρ)}^{2}} \end{matrix}

(27)

where (25)–(27) hold, respectively, due to (16), (24) and (20). This strengthens the result in [68] (Proposition 2) which gives the same lower bound as in the right side of (27) for

H (P)

rather than for

H_{2} (P)

(recall that

H (P) \geq H_{2} (P)

).

For a numerical illustration of Theorem 1, Figure 1 provides a plot of

c_{α}^{(\infty)} (ρ)

in (20) and (22) as a function of

ρ \geq 1

, confirming numerically the properties in (21) and (23). Furthermore, Figure 2 provides plots of

c_{α}^{(n)} (ρ)

in (16) as a function of

α > 0

, for

ρ = 2

(left plot) and

ρ = 256

(right plot), with several values of

n \geq 2

; the calculation of the curves in these plots relies on (15), (20) and (22), and they illustrate the monotonicity and boundedness properties in (24).

Figure 1. A plot of

c_{α}^{(\infty)} (ρ)

in (20) and (22) (log is on base 2) as a function of

ρ

, confirming numerically the properties in (21) and (23).

Figure 2. Plots of

c_{α}^{(n)} (ρ)

in (16) (log is on base 2) as a function of

α > 0

, for

ρ = 2

(left plot) and

ρ = 256

(right plot), with several values of

n \geq 2

.

Remark 3.

Theorem 1 strengthens the result in [12] (Theorem 2) for the Shannon entropy (i.e., for

α = 1

), in addition to its generalization to Rényi entropies of arbitrary orders

α > 0

. This is because our lower bound on the Shannon entropy is given by

\begin{matrix} H (P) \geq \log n - c_{1}^{(n)} (ρ), \forall P \in P_{n} (ρ), \end{matrix}

(28)

whereas the looser bound in [12] is given by (see [12] ((7)) and (22) here)

\begin{matrix} H (P) \geq \log n - c_{1}^{(\infty)} (ρ), \forall P \in P_{n} (ρ), \end{matrix}

(29)

and we recall that

0 \leq c_{1}^{(n)} (ρ) \leq c_{1}^{(\infty)} (ρ)

(see (24)). Figure 3 shows the improvement in the new lower bound (28) over (29) by comparing

c_{1}^{(\infty)} (ρ)

versus

c_{1}^{(n)} (ρ)

for

ρ \in [1, 10^{5}]

and with several values of n. It is reflected from Figure 3 that there is a very marginal improvement in the lower bound on the Shannon entropy (28) over the bound in (29) if

ρ \leq 30

(even for small values of n), whereas there is a significant improvement over the bound in (29) for large values of ρ; by increasing the value of n, also the value of ρ needs to be increased for observing an improvement of the lower bound in (28) over (29) (see Figure 3).

Figure 3. A plot of

c_{1}^{(\infty)} (ρ)

in (22) versus

c_{1}^{(n)} (ρ)

for finite n (

n = 512, 128, 32

, and 8) as a function of ρ.

An improvement of the bound in (28) over (29) leads to a tightening of the upper bound in [12] (Theorem 4) on the compression rate of Tunstall codes for discrete memoryless sources, which further tightens the bound by Jelinek and Schneider in [69] (Equation (9)). More explicitly, in view of [12] (Section 6), an improved upper bound on the compression rate of these variable-to-fixed lossless source codes is obtained by combining [12] (Equations (36) and (38)) with a tightened lower bound on the entropy

H (W)

of the leaves of the tree graph for Tunstall codes. From (28), the latter lower bound is given by

H (W) \geq \log_{2} n - c_{1}^{(n)} (ρ)

where

c_{1}^{(n)} (ρ)

is expressed in bits,

ρ : = \frac{1}{p_{\min}}

is the reciprocal of the minimal positive probability of the source symbols, and n is the number of codewords (so, all codewords are of length

⌈ \log_{2} n ⌉

bits). This yields a reduction in the upper bound on the non-asymptotic compression rate R of Tunstall codes from

\frac{⌈ \log_{2} n ⌉ H (X)}{\log_{2} n - c_{1}^{(\infty)} (ρ)}

(see [12] (Equation (40)) and (22)) to

\frac{⌈ \log_{2} n ⌉ H (X)}{\log_{2} n - c_{1}^{(n)} (ρ)}

bits per source symbol where

H (X)

denotes the source entropy (converging, in view of (17), to

H (X)

as we let

n \to \infty

).

Remark 4.

Equality (15) with the minimizing probability mass function of the form (13) holds, in general, by replacing the Rényi entropy with an arbitrary Schur-concave function (as it can be easily verified from the proof of Lemma 2 in Appendix A). However, the analysis leading to Lemmata 3–4 and Theorem 1 applies particularly to the Rényi entropy.

4. Bounds on the Rényi Entropy of a Function of a Discrete Random Variable

This section relies on Theorem 1 and majorization for extending [12] (Theorem 1), which applies to the Shannon entropy, to Rényi entropies of any positive order. More explicitly, let

α \in (0, \infty)

and

$X$ and $Y$ be finite sets of cardinalities $| X | = n$ and $| Y | = m$ with $n > m \geq 2$ ; without any loss of generality, let $X = {1, \dots, n}$ and $Y = {1, \dots, m}$ ;
X be a random variable taking values on $X$ with a probability mass function $P_{X} \in P_{n}$ ;
$F_{n, m}$ be the set of deterministic functions $f : X \to Y$ ; note that $f \in F_{n, m}$ is not one to one since $m < n$ .

The main result in this section sharpens the inequality

H_{α} (f (X)) \leq H_{α} (X)

, for every deterministic function

f \in F_{n, m}

with

n > m \geq 2

and

α > 0

, by obtaining non-trivial upper and lower bounds on

\max_{f \in F_{n, m}} H_{α} (f (X))

. The calculation of the exact value of

\min_{f \in F_{n, m}} H_{α} (f (X))

is much easier, and it is expressed in closed form by capitalizing on the Schur-concavity of the Rényi entropy.

The following main result extends [12] (Theorem 1) to Rényi entropies of arbitrary positive orders.

Theorem 2.

Let

X \in {1, \dots, n}

be a random variable which satisfies

P_{X} (1) \geq P_{X} (2) \geq \dots \geq P_{X} (n)

.

(a): For $m \in {2, \dots, n - 1}$ , if $P_{X} (1) < \frac{1}{m}$ , let ${\tilde{X}}_{m}$ be the equiprobable random variable on ${1, \dots, m}$ ; otherwise, if $P_{X} (1) \geq \frac{1}{m}$ , let ${\tilde{X}}_{m} \in {1, \dots, m}$ be a random variable with the probability mass function

$P_{{\tilde{X}}_{m}} (i) = {\begin{matrix} P_{X} (i), & i \in {1, \dots, n^{*}}, \\ \frac{1}{m - n^{*}} \sum_{j = n^{*} + 1}^{n} P_{X} (j), & i \in {n^{*} + 1, \dots, m}, \end{matrix}$

(30)

where $n^{*}$ is the maximal integer $i \in {1, \dots, m - 1}$ such that

$\begin{matrix} P_{X} (i) \geq \frac{1}{m - i} \sum_{j = i + 1}^{n} P_{X} (j) . \end{matrix}$

(31)

Then, for every $α > 0$ ,

$\begin{matrix} \max_{f \in F_{n, m}} H_{α} (f (X)) \in [H_{α} ({\tilde{X}}_{m}) - v (α), H_{α} ({\tilde{X}}_{m})], \end{matrix}$

(32)

where

$v (α) : = c_{α}^{(\infty)} (2) = {\begin{matrix} \log (\frac{α - 1}{2^{α} - 2}) - \frac{α}{α - 1} \log (\frac{α}{2^{α} - 1}), & α \neq 1, \\ \log (\frac{2}{e \ln 2}) \approx 0.08607 b i t s, & α = 1 . \end{matrix}$

(33)
(b): There exists an explicit construction of a deterministic function $f^{*} \in F_{n, m}$ such that

$\begin{matrix} H_{α} (f^{*} (X)) \in [H_{α} ({\tilde{X}}_{m}) - v (α), H_{α} ({\tilde{X}}_{m})] \end{matrix}$

(34)

where $f^{*}$ is independent of α, and it is obtained by using Huffman coding (as in [12] for $α = 1$ ).
(c): Let ${\tilde{Y}}_{m} \in {1, \dots, m}$ be a random variable with the probability mass function

$P_{{\tilde{Y}}_{m}} (i) = {\begin{matrix} \sum_{k = 1}^{n - m + 1} P_{X} (k), & i = 1, \\ P_{X} (n - m + i), & i \in {2, \dots, m} . \end{matrix}$

(35)

Then, for every $α > 0$ ,

$\begin{matrix} \min_{f \in F_{n, m}} H_{α} (f (X)) = H_{α} ({\tilde{Y}}_{m}) . \end{matrix}$

(36)

Remark 5.

Setting

α = 1

specializes Theorem 2 to [12] (Theorem 1) (regarding the Shannon entropy). This point is further elaborated in Remark 8, after the proof of Theorem 2.

Remark 6.

Similarly to [12] (Lemma 1), an exact solution of the maximization problem in the left side of (32) is strongly NP-hard [70]; this means that, unless

P = N P

, there is no polynomial time algorithm which, for an arbitrarily small

ε > 0

, computes an admissible deterministic function

f_{ε} \in F_{n, m}

such that

\begin{matrix} H_{α} (f_{ε} (X)) \geq (1 - ε) \max_{f \in F_{n, m}} H_{α} (f (X)) . \end{matrix}

(37)

This motivates the derivation of the bounds in (32), and the simple construction of a deterministic function

f^{*} \in F_{n, m}

achieving (34).

A proof of Theorem 2 relies on the following lemmata.

Lemma 5.

Let

X \in {1, \dots, n}

,

m < n

and

α > 0

. Then,

\begin{matrix} \max_{Q \in P_{m} : P_{X} ≺ Q} H_{α} (Q) = H_{α} ({\tilde{X}}_{m}) \end{matrix}

(38)

where the probability mass function of

{\tilde{X}}_{m}

is given in (30).

Proof.

Since

P_{X} ≺ P_{{\tilde{X}}_{m}}

(see [12] (Lemma 2)) with

P_{{\tilde{X}}_{m}} \in P_{m}

, and

P_{{\tilde{X}}_{m}} ≺ Q

for all

Q \in P_{m}

such that

P_{X} ≺ Q

(see [12] (Lemma 4)), the result follows from the Schur-concavity of the Rényi entropy. ☐

Lemma 6.

Let

X \in {1, \dots, n}

,

α > 0

, and

f \in F_{n, m}

with

m < n

. Then,

\begin{matrix} H_{α} (f (X)) \leq H_{α} ({\tilde{X}}_{m}) . \end{matrix}

(39)

Proof.

Since f is a deterministic function in

F_{n, m}

with

m < n

, the probability mass function of

f (X)

is an element in

P_{m}

which majorizes

P_{X}

(see [12] (Lemma 3)). Inequality (39) then follows from Lemma 5. ☐

We are now ready to prove Theorem 2.

Proof.

In view of (39),

\begin{matrix} \max_{f \in F_{n, m}} H_{α} (f (X)) \leq H_{α} ({\tilde{X}}_{m}) . \end{matrix}

(40)

We next construct a function

f^{*} \in F_{n, m}

such that, for all

α > 0

,

\begin{matrix} H_{α} (f^{*} (X)) & \geq \max_{Q \in P_{m} : P_{X} ≺ Q} H_{α} (Q) - v (α) \end{matrix}

(41)

\begin{matrix} \geq \max_{f \in F_{n, m}} H_{α} (f (X)) - v (α) \end{matrix}

(42)

where the function

v : (0, \infty) \to (0, \infty)

in the right side of (41) is given in (33), and (42) holds due to (38) and (40). The function

f^{*}

in our proof coincides with the construction in [12], and it is, therefore, independent of

α

.

We first review and follow the concept of the proof of [12] (Lemma 5), and we then deviate from the analysis there for proving our result. The idea behind the proof of [12] (Lemma 5) relies on the following algorithm:

(1): Start from the probability mass function $P_{X} \in P_{n}$ with $P_{X} (1) \geq \dots \geq P_{X} (n)$ ;
(2): Merge successively pairs of probability masses by applying the Huffman algorithm;
(3): Stop the merging process in Step 2 when a probability mass function $Q \in P_{m}$ is obtained (with $Q (1) \geq \dots \geq Q (m)$ );
(4): Construct the deterministic function $f^{*} \in F_{n, m}$ by setting $f^{*} (k) = j \in {1, \dots, m}$ for all probability masses $P_{X} (k)$ , with $k \in {1, \dots, n}$ , being merged in Steps 2–3 into the node of $Q (j)$ .

Let

i \in {0, \dots, m - 1}

be the largest index such that

P_{X} (1) = Q (1), \dots, P_{X} (i) = Q (i)

(note that

i = 0

corresponds to the case where each node

Q (j)

, with

j \in {1, \dots, m}

, is constructed by merging at least two masses of the probability mass function

P_{X}

). Then, according to [12] (p. 2225),

\begin{matrix} Q (i + 1) \leq 2 Q (m) . \end{matrix}

(43)

Let

\begin{matrix} S : = \sum_{j = i + 1}^{m} Q (j) \end{matrix}

(44)

be the sum of the

m - i

smallest masses of the probability mass function Q. In view of (43), the vector

\begin{matrix} \bar{Q} : = (\frac{Q (i + 1)}{S}, \dots, \frac{Q (m)}{S}) \end{matrix}

(45)

represents a probability mass function where the ratio of its maximal to minimal masses is upper bounded by 2.

At this point, our analysis deviates from [12] (p. 2225). Applying Theorem 1 to

\bar{Q}

with

ρ = 2

gives

\begin{matrix} H_{α} (\bar{Q}) \geq \log (m - i) - c_{α}^{(\infty)} (2) \end{matrix}

(46)

with

\begin{matrix} c_{α}^{(\infty)} (2) & = \frac{1}{α - 1} \log (1 + \frac{1 + α - 2^{α}}{1 - α}) - \frac{α}{α - 1} \log (1 + \frac{1 + α - 2^{α}}{(1 - α) (2^{α} - 1)}) \end{matrix}

(47)

\begin{matrix} = \log (\frac{α - 1}{2^{α} - 2}) - \frac{α}{α - 1} \log (\frac{α}{2^{α} - 1}) \end{matrix}

(48)

\begin{matrix} = v (α) \end{matrix}

(49)

where (47) follows from (20); (48) is straightforward algebra, and (49) is the definition in (33).

For

α \in (0, 1) \cup (1, \infty)

, we get

\begin{matrix} H_{α} (Q) & = \frac{1}{1 - α} \log (\sum_{j = 1}^{m} Q^{α} (j)) \end{matrix}

(50)

\begin{matrix} = \frac{1}{1 - α} \log (\sum_{j = 1}^{i} Q^{α} (j) + \sum_{j = i + 1}^{m} Q^{α} (j)) \end{matrix}

(51)

\begin{matrix} = \frac{1}{1 - α} \log (\sum_{j = 1}^{i} Q^{α} (j) + S^{α} \exp ((1 - α) H_{α} (\bar{Q}))) \end{matrix}

(52)

\begin{matrix} \geq \frac{1}{1 - α} \log (\sum_{j = 1}^{i} Q^{α} (j) + S^{α} \exp ((1 - α) (\log (m - i) - v (α)))) \end{matrix}

(53)

\begin{matrix} = \frac{1}{1 - α} \log (\sum_{j = 1}^{i} Q^{α} (j) + S^{α} {(m - i)}^{1 - α} \exp ((α - 1) v (α))) \end{matrix}

(54)

where (51) holds since

i \in {0, \dots, m - 1}

; (52) follows from (2) and (45); (53) holds by (46)–(49).

In view of (44), let

Q^{*} \in P_{m}

be the probability mass function which is given by

Q^{*} (j) = {\begin{matrix} Q (j), & j = 1, \dots, i \\ \frac{S}{m - i}, & j = i + 1, \dots, m . \end{matrix}

(55)

From (50)–(55), we get

\begin{matrix} H_{α} (Q) & \geq \frac{1}{1 - α} \log (\sum_{j = 1}^{i} {(Q^{*} (j))}^{α} + \sum_{j = i + 1}^{m} {(Q^{*} (j))}^{α} \exp ((α - 1) v (α))) \end{matrix}

(56)

\begin{matrix} = \frac{1}{1 - α} \log (\sum_{j = 1}^{m} {(Q^{*} (j))}^{α} + \sum_{j = i + 1}^{m} {(Q^{*} (j))}^{α} (\exp ((α - 1) v (α)) - 1)) \end{matrix}

(57)

\begin{matrix} = H_{α} (Q^{*}) + \frac{1}{1 - α} \log (1 + T (\exp ((α - 1) v (α)) - 1)) \end{matrix}

(58)

with

\begin{matrix} T : = \frac{\sum_{j = i + 1}^{m} {(Q^{*} (j))}^{α}}{\sum_{j = 1}^{m} {(Q^{*} (j))}^{α}} \in [0, 1] . \end{matrix}

(59)

Since

T \in [0, 1]

and

v (α) > 0

for

α > 0

, it can be verified from (56)–(58) that for

α \in (0, 1) \cup (1, \infty)

\begin{matrix} H_{α} (Q) \geq H_{α} (Q^{*}) - v (α) . \end{matrix}

(60)

The validity of (60) is extended to

α = 1

by taking the limit

α \to 1

on both sides of this inequality, and due to the continuity of

v (\cdot)

in (33) at

α = 1

. Applying the majorization result

Q^{*} ≺ P_{{\tilde{X}}_{m}}

in [12] ((31)), it follows from (60) and the Schur-concavity of the Rényi entropy that, for all

α > 0

,

\begin{matrix} H_{α} (Q) \geq H_{α} (Q^{*}) - v (α) \geq H_{α} ({\tilde{X}}_{m}) - v (α), \end{matrix}

(61)

which together with (40), prove Items a) and b) of Theorem 2 (note that, in view of the construction of the deterministic function

f^{*} \in F_{n, m}

in Step 4 of the above algorithm, we get

H_{α} (f^{*} (X)) = H_{α} (Q)

).

We next prove Item c). Equality (36) is due to the Schur-concavity of the Rényi entropy, and since we have

$f (X)$ is an aggregation of X, i.e., the probability mass function $Q \in P_{m}$ of $f (X)$ satisfies $Q (j) = \sum_{i \in I_{j}} P_{X} (i)$ ( $1 \leq j \leq m$ ) where $I_{1}, \dots, I_{m}$ partition ${1, \dots, n}$ into m disjoint subsets as follows:

$\begin{matrix} I_{j} : = {i \in {1, \dots, n} : f (i) = j}, j = 1, \dots, m; \end{matrix}$

(62)
By the assumption $P_{X} (1) \geq P_{X} (2) \geq \dots \geq P_{X} (n)$ , it follows that $Q ≺ P_{{\tilde{Y}}_{m}}$ for every such $Q \in P_{m}$ ;
From (35), ${\tilde{Y}}_{m} = \tilde{f} (X)$ where the function $\tilde{f} \in F_{n, m}$ is given by $\tilde{f} (k) : = 1$ for all $k \in {1, \dots, n - m + 1}$ , and $\tilde{f} (n - m + i) : = i$ for all $i \in {2, \dots, m}$ . Hence, $P_{{\tilde{Y}}_{m}}$ is an element in the set of the probability mass functions of $f (X)$ with $f \in F_{n, m}$ which majorizes every other element from this set.

☐

Remark 7.

The solid line in the left plot of Figure 2 depicts

v (α) : = c_{α}^{(\infty)} (2)

in (33) for

α > 0

. In view of Lemma 4, and by the definition in (33), the function

v : (0, \infty) \to (0, \infty)

is indeed monotonically increasing and continuous.

Remark 8.

Inequality (43) leads to the application of Theorem 1 with

ρ = 2

(see (46)). In the derivation of Theorem 2, we refer to

v (α) : = c_{α}^{(\infty)} (2)

(see (47)–(49)) rather than referring to

c_{α}^{(n)} (2)

(although, from (24), we have

0 \leq c_{α}^{(n)} (2) \leq v (α)

for all

α > 0

). We do so since, for

n \geq 16

, the difference between the curves of

c_{α}^{(n)} (2)

(as a function of

α > 0

) and the curve of

c_{α}^{(\infty)} (2)

is marginal (see the dashed and solid lines in the left plot of Figure 2), and also because the function v in (33) is expressed in a closed form whereas

c_{α}^{(n)} (2)

is subject to numerical optimization for finite n (see (15) and (16)). For this reason, Theorem 2 coincides with the result in [12] (Theorem 1) for the Shannon entropy (i.e., for

α = 1

) while providing a generalization of the latter result for Rényi entropies of arbitrary positive orders α. Theorem 1, however, both strengthens the bounds in [12] (Theorem 2) for the Shannon entropy with finite cardinality n (see Remark 3), and it also generalizes these bounds to Rényi entropies of all positive orders.

Remark 9.

The minimizing probability mass function in (35) to the optimization problem (36), and the maximizing probability mass function in (30) to the optimization problem (38) are in general valid when the Rényi entropy of a positive order is replaced by an arbitrary Schur-concave function. However, the main results in (32)–(34) hold particularly for the Rényi entropy.

Remark 10.

Theorem 2 makes use of the random variables denoted by

{\tilde{X}}_{m}

and

{\tilde{Y}}_{m}

, rather than (more simply)

X_{m}

and

Y_{m}

respectively, because Section 5 considers i.i.d. samples

{X_{i}}_{i = 1}^{k}

and

{Y_{i}}_{i = 1}^{k}

with

X_{i} \sim P_{X}

and

Y_{i} \sim P_{Y}

; note, however, that the probability mass functions of

{\tilde{X}}_{m}

and

{\tilde{Y}}_{m}

are different from

P_{X}

and

P_{Y}

, respectively, and for that reason we make use of tilted symbols in the left sides of (30) and (35).

5. Information-Theoretic Applications: Non-Asymptotic Bounds for Lossless Compression and Guessing

Theorem 2 is applied in this section to derive non-asymptotic bounds for lossless compression of discrete memoryless sources and guessing moments. Each of the two subsections starts with a short background for making the presentation self-contained.

5.1. Guessing

5.1.1. Background

The problem of guessing discrete random variables has various theoretical and operational aspects in information theory (see [35,36,37,38,40,41,43,56,71,72,73,74,75,76,77,78,79,80,81]). The central object of interest is the distribution of the number of guesses required to identify a realization of a random variable X, taking values on a finite or countably infinite set

X = {1, \dots, | X |}

, by successively asking questions of the form “Is X equal to x?” until the value of X is guessed correctly. A guessing function is a one-to-one function

g : X \to X

, which can be viewed as a permutation of the elements of

X

in the order in which they are guessed. The required number of guesses is therefore equal to

g (x)

when

X = x

with

x \in X

.

Lower and upper bounds on the minimal expected number of required guesses for correctly identifying the realization of X, expressed as a function of the Shannon entropy

H (X)

, have been respectively derived by Massey [77] and by McEliece and Yu [78], followed by a derivation of improved upper and lower bounds by De Santis et al. [80]. More generally, given a probability mass function

P_{X}

on

X

, it is of interest to minimize the generalized guessing moment

E [g^{ρ} (X)] = \sum_{x \in X} P_{X} (x) g^{ρ} (x)

for

ρ > 0

. For an arbitrary positive

ρ

, the

ρ

-th moment of the number of guesses is minimized by selecting the guessing function to be a ranking function

g_{X}

, for which

g_{X} (x) = ℓ

if

P_{X} (x)

is the ℓ-th largest mass [77]. Although the tie breaking affects the choice of

g_{X}

, the distribution of

g_{X} (X)

does not depend on how ties are resolved. Not only does this strategy minimize the average number of guesses, but it also minimizes the

ρ

-th moment of the number of guesses for every

ρ > 0

. Upper and lower bounds on the

ρ

-th moment of ranking functions, expressed in terms of the Rényi entropies, were derived by Arikan [35], Boztaş [71], followed by recent improvements in the non-asymptotic regime by Sason and Verdú [56]. Although if

| X |

is small, it is straightforward to evaluate numerically the guessing moments, the benefit of bounds expressed in terms of Rényi entropies is particularly relevant when dealing with a random vector

X^{k} = (X_{1}, \dots, X_{k})

whose letters belong to a finite alphabet

X

; computing all the probabilities of the mass function

P_{X^{k}}

over the set

X^{k}

, and then sorting them in decreasing order for the calculation of the

ρ

-th moment of the optimal guessing function for the elements of

X^{k}

becomes infeasible even for moderate values of k. In contrast, regardless of the value of k, bounds on guessing moments which depend on the Rényi entropy are readily computable if for example,

{X_{i}}_{i = 1}^{k}

are independent; in which case, the Rényi entropy of the vector is equal to the sum of the Rényi entropies of its components. Arikan’s bounds in [35] are asymptotically tight for random vectors of length k as

k \to \infty

, thus providing the correct exponential growth rate of the guessing moments for sufficiently large k.

5.1.2. Analysis

We next analyze the following setup of guessing. Let

{X_{i}}_{i = 1}^{k}

be i.i.d. random variables where

X_{1} \sim P_{X}

takes values on a finite set

X

with

| X | = n

. To cluster the data [82] (see also [12] (Section 3.A) and references therein), suppose that each

X_{i}

is mapped to

Y_{i} = f (X_{i})

where

f \in F_{n, m}

is an arbitrary deterministic function (independent of the index i) with

m < n

. Consequently,

{Y_{i}}_{i = 1}^{k}

are i.i.d., and each

Y_{i}

takes values on a finite set

Y

with

| Y | = m < | X |

.

Let

g_{X^{k}} : X^{k} \to {1, \dots, n^{k}}

and

g_{Y^{k}} : Y^{k} \to {1, \dots, m^{k}}

be, respectively, the ranking functions of the random vectors

X^{k} = (X_{1}, \dots, X_{k})

and

Y^{k} = (Y_{1}, \dots, Y_{k})

by sorting in separate decreasing orders the probabilities

P_{X^{k}} (x^{k}) = \prod_{i = 1}^{k} P_{X} (x_{i})

for

x^{k} \in X^{k}

, and

P_{Y^{k}} (y^{k}) = \prod_{i = 1}^{k} P_{Y} (y_{i})

for

y^{k} \in Y^{k}

where ties in both cases are resolved arbitrarily. In view of Arikan’s bounds on the

ρ

-th moment of ranking functions (see [35] (Theorem 1) for the lower bound, and [35] (Proposition 4) for the upper bound), since

| X^{k} | = n^{k}

and

| Y^{k} | = m^{k}

, the following bounds hold for all

ρ > 0

:

\begin{matrix} ρ H_{\frac{1}{1 + ρ}} (X) - \frac{ρ \log (1 + k \ln n)}{k} \leq \frac{1}{k} \log E [g_{X^{k}}^{ρ} (X^{k})] \leq ρ H_{\frac{1}{1 + ρ}} (X), \end{matrix}

(63)

\begin{matrix} ρ H_{\frac{1}{1 + ρ}} (Y) - \frac{ρ \log (1 + k \ln m)}{k} \leq \frac{1}{k} \log E [g_{Y^{k}}^{ρ} (Y^{k})] \leq ρ H_{\frac{1}{1 + ρ}} (Y) . \end{matrix}

(64)

In the following, we rely on Theorem 2 and the bounds in (63) and (64) to obtain bounds on the exponential reduction of the

ρ

-th moment of the ranking function of

X^{k}

as a result of its mapping to

Y^{k}

. First, the combination of (63) and (64) yields

\begin{matrix} ρ [H_{\frac{1}{1 + ρ}} (X) - H_{\frac{1}{1 + ρ}} (Y)] - \frac{ρ \log (1 + k \ln n)}{k} \\ \leq \frac{1}{k} \log \frac{E [g_{X^{k}}^{ρ} (X^{k})]}{E [g_{Y^{k}}^{ρ} (Y^{k})]} \end{matrix}

(65)

\begin{matrix} \leq ρ [H_{\frac{1}{1 + ρ}} (X) - H_{\frac{1}{1 + ρ}} (Y)] + \frac{ρ \log (1 + k \ln m)}{k} . \end{matrix}

(66)

In view of Theorem 2-(a) and (65), it follows that for an arbitrary

f \in F_{n, m}

and

ρ > 0

\begin{matrix} \frac{1}{k} \log \frac{E [g_{X^{k}}^{ρ} (X^{k})]}{E [g_{Y^{k}}^{ρ} (Y^{k})]} \geq ρ [H_{\frac{1}{1 + ρ}} (X) - H_{\frac{1}{1 + ρ}} ({\tilde{X}}_{m})] - \frac{ρ \log (1 + k \ln n)}{k} \end{matrix}

(67)

where

{\tilde{X}}_{m}

is a random variable whose probability mass function is given in (30). Please note that

\begin{matrix} H_{\frac{1}{1 + ρ}} ({\tilde{X}}_{m}) \leq H_{\frac{1}{1 + ρ}} (X), \frac{ρ \log (1 + k \ln n)}{k} \underset{k \to \infty}{⟶} 0 \end{matrix}

(68)

where the first inequality in (68) holds since

P_{X} ≺ P_{{\tilde{X}}_{m}}

(see Lemma 5) and the Rényi entropy is Schur-concave.

By the explicit construction of the function

f^{*} \in F_{n, m}

according to the algorithm in Steps 1–4 in the proof of Theorem 2 (based on the Huffman procedure), by setting

Y_{i} : = f^{*} (X_{i})

for every

i \in {1, \dots, k}

, it follows from (34) and (66) that for all

ρ > 0

\begin{matrix} \frac{1}{k} \log \frac{E [g_{X^{k}}^{ρ} (X^{k})]}{E [g_{Y^{k}}^{ρ} (Y^{k})]} \leq ρ [H_{\frac{1}{1 + ρ}} (X) - H_{\frac{1}{1 + ρ}} ({\tilde{X}}_{m}) + v (\frac{1}{1 + ρ})] + \frac{ρ \log (1 + k \ln m)}{k} \end{matrix}

(69)

where the monotonically increasing function

v : (0, \infty) \to (0, \infty)

is given in (33), and it is depicted by the solid line in the left plot of Figure 2. In view of (33), it can be shown that the linear approximation

v (α) \approx v (1) α

is excellent for all

α \in [0, 1]

, and therefore for all

ρ > 0

\begin{matrix} v (\frac{1}{1 + ρ}) \approx \frac{0.08607}{1 + ρ} bits . \end{matrix}

(70)

Hence, for sufficiently large value of k, the gap between the lower and upper bounds in (67) and (69) is marginal, being approximately equal to

\frac{0.08607 ρ}{1 + ρ} bits

for all

ρ > 0

.

The following theorem summarizes our result in this section.

Theorem 3.

Let

${X_{i}}_{i = 1}^{k}$ be i.i.d. with $X_{1} \sim P_{X}$ taking values on a set $X$ with $| X | = n$ ;
$Y_{i} = f (X_{i})$ , for every $i \in {1, \dots, k}$ , where $f \in F_{n, m}$ is a deterministic function with $m < n$ ;
$g_{X^{k}} : X^{k} \to {1, \dots, n^{k}}$ and $g_{Y^{k}} : Y^{k} \to {1, \dots, m^{k}}$ be, respectively, ranking functions of the random vectors $X^{k} = (X_{1}, \dots, X_{k})$ and $Y^{k} = (Y_{1}, \dots, Y_{k})$ .

Then, for every

ρ > 0

,

(a): The lower bound in (67) holds for every deterministic function $f \in F_{n, m}$ ;
(b): The upper bound in (69) holds for the specific $f^{*} \in F_{n, m}$ , whose construction relies on the Huffman algorithm (see Steps 1–4 of the procedure in the proof of Theorem 2);
(c): The gap between these bounds, for $f = f^{*}$ and sufficiently large k, is at most $ρ v (\frac{1}{1 + ρ}) \approx \frac{0.08607 ρ}{1 + ρ} b i t s .$

5.1.3. Numerical Result

The following simple example illustrates the tightness of the achievable upper bound and the universal lower bound in Theorem 3, especially for sufficiently long sequences.

Example 1.

Let X be geometrically distributed restricted to

{1, \dots, n}

with the probability mass function

\begin{matrix} P_{X} (j) = \frac{(1 - a) a^{j - 1}}{1 - a^{n}}, j \in {1, \dots, n} \end{matrix}

(71)

where

a = \frac{24}{25}

and

n = 128

. Assume that

X_{1}, \dots, X_{k}

are i.i.d. with

X_{1} \sim P_{X}

, and let

Y_{i} = f (X_{i})

for a deterministic function

f \in F_{n, m}

with

n = 128

and

m = 16

. We compare the upper and lower bounds in Theorem 3 for the two cases where the sequence

X^{k} = (X_{1}, \dots, X_{k})

is of length

k = 100

or

k = 1000

. The lower bound in (67) holds for an arbitrary deterministic

f \in F_{n, m}

, and the achievable upper bound in (69) holds for the construction of the deterministic function

f = f^{*} \in F_{n, m}

(based on the Huffman algorithm) in Theorem 3. Numerical results are shown in Figure 4, providing plots of the upper and lower bounds on

\frac{1}{k} \log_{2} \frac{E [g_{X^{k}}^{ρ} (X^{k})]}{E [g_{Y^{k}}^{ρ} (Y^{k})]}

in Theorem 3, and illustrating the improved tightness of these bounds when the value of k is increased from 100 (left plot) to 1000 (right plot). From Theorem 3-(c), for sufficiently large k, the gap between the upper and lower bounds is less than 0.08607 bits (for all

ρ > 0

); this is consistent with the right plot of Figure 4 where

k = 1000

.

Figure 4. Plots of the upper and lower bounds on

\frac{1}{k} \log_{2} \frac{E [g_{X^{k}}^{ρ} (X^{k})]}{E [g_{Y^{k}}^{ρ} (Y^{k})]}

in Theorem 3, as a function of

ρ > 0

, for random vectors of length

k = 100

(left plot) or

k = 1000

(right plot) in the setting of Example 1. Each plot shows the universal lower bound for an arbitrary deterministic

f \in F_{128, 16}

, and the achievable upper bound with the construction of the deterministic function

f = f^{*} \in F_{128, 16}

(based on the Huffman algorithm) in Theorem 3 (see, respectively, (67) and (69)).

5.2. Lossless Source Coding

5.2.1. Background

For uniquely decodable (UD) lossless source coding, Campbell [51,83] proposed the cumulant generating function of the codeword lengths as a generalization to the frequently used design criterion of average code length. Campbell’s motivation in [51] was to control the contribution of the longer codewords via a free parameter in the cumulant generating function: if the value of this parameter tends to zero, then the resulting design criterion becomes the average code length per source symbol; on the other hand, by increasing the value of the free parameter, the penalty for longer codewords is more severe, and the resulting code optimization yields a reduction in the fluctuations of the codeword lengths.

We introduce the coding theorem by Campbell [51] for lossless compression of a discrete memoryless source (DMS) with UD codes, which serves for our analysis jointly with Theorem 2.

Theorem 4

(Campbell 1965, [51]). Consider a DMS which emits symbols with a probability mass function

P_{X}

defined on a (finite or countably infinite) set

X

. Consider a UD fixed-to-variable source code operating on source sequences of k symbols with an alphabet of the codewords of size D. Let

ℓ (x^{k})

be the length of the codeword which corresponds to the source sequence

x^{k} : = (x_{1}, \dots, x_{k}) \in X^{k}

. Consider the scaled cumulant generating function of the codeword lengths

\begin{matrix} Λ_{k} (ρ) : = \frac{1}{k} \log_{D} (\sum_{x^{k} \in X^{k}} P_{X^{k}} (x^{k}) D^{ρ ℓ (x^{k})}), ρ > 0 \end{matrix}

(72)

where

\begin{matrix} P_{X^{k}} (x^{k}) = \prod_{i = 1}^{k} P_{X} (x_{i}), \forall x^{k} \in X^{k} . \end{matrix}

(73)

Then, for every

ρ > 0

, the following hold:

(a): Converse result:

$\begin{matrix} \frac{Λ_{k} (ρ)}{ρ} \geq \frac{1}{\log D} H_{\frac{1}{1 + ρ}} (X) . \end{matrix}$

(74)
(b): Achievability result: there exists a UD source code, for which

$\begin{matrix} \frac{Λ_{k} (ρ)}{ρ} \leq \frac{1}{\log D} H_{\frac{1}{1 + ρ}} (X) + \frac{1}{k} . \end{matrix}$

(75)

The term scaled cumulant generating function is used in view of [56] (Remark 20). The bounds in Theorem 4, expressed in terms of the Rényi entropy, imply that for sufficiently long source sequences, it is possible to make the scaled cumulant generating function of the codeword lengths approach the Rényi entropy as closely as desired by a proper fixed-to-variable UD source code; moreover, the converse result shows that there is no UD source code for which the scaled cumulant generating function of its codeword lengths lies below the Rényi entropy. By invoking L’Hôpital’s rule, one gets from (72)

\begin{matrix} \lim_{ρ ↓ 0} \frac{Λ_{k} (ρ)}{ρ} = \frac{1}{k} \sum_{x^{k} \in X^{k}} P_{X^{k}} (x^{k}) ℓ (x^{k}) = \frac{1}{k} E [ℓ (X^{k})] . \end{matrix}

(76)

Hence, by letting

ρ

tend to zero in (74) and (75), it follows from (4) that Campbell’s result in Theorem 4 generalizes the well-known bounds on the optimal average length of UD fixed-to-variable source codes (see, e.g., [84] ((5.33) and (5.37))):

\begin{matrix} \frac{1}{\log D} H (X) \leq \frac{1}{k} E [ℓ (X^{k})] \leq \frac{1}{\log D} H (X) + \frac{1}{k}, \end{matrix}

(77)

and (77) is satisfied by Huffman coding (see, e.g., [84] (Theorem 5.8.1)). Campbell’s result therefore generalizes Shannon’s fundamental result in [85] for the average codeword lengths of lossless compression codes, expressed in terms of the Shannon entropy.

Following the work by Campbell [51], Courtade and Verdú derived in [52] non-asymptotic bounds for the scaled cumulant generating function of the codeword lengths for

P_{X}

-optimal variable-length lossless codes [23,86]. These bounds were used in [52] to obtain simple proofs of the asymptotic normality of the distribution of codeword lengths, and the reliability function of memoryless sources allowing countably infinite alphabets. Sason and Verdú recently derived in [56] improved non-asymptotic bounds on the cumulant generating function of the codeword lengths for fixed-to-variable optimal lossless source coding without prefix constraints, and non-asymptotic bounds on the reliability function of a DMS, tightening the bounds in [52].

5.2.2. Analysis

The following analysis for lossless source compression with UD codes relies on a combination of Theorems 2 and 4.

Let

X_{1}, \dots, X_{k}

be i.i.d. symbols which are emitted from a DMS according to a probability mass function

P_{X}

whose support is a finite set

X

with

| X | = n

. Similarly to Section 5.1, to cluster the data, suppose that each symbol

X_{i}

is mapped to

Y_{i} = f (X_{i})

where

f \in F_{n, m}

is an arbitrary deterministic function (independent of the index i) with

m < n

. Consequently, the i.i.d. symbols

Y_{1}, \dots, Y_{k}

take values on a set

Y

with

| Y | = m < | X |

. Consider two UD fixed-to-variable source codes: one operating on the sequences

x^{k} \in X^{k}

, and the other one operates on the sequences

y^{k} \in Y^{k}

; let D be the size of the alphabets of both source codes. Let

ℓ (x^{k})

and

\bar{ℓ} (y^{k})

denote the length of the codewords for the source sequences

x^{k}

and

y^{k}

, respectively, and let

Λ_{k} (\cdot)

and

{\bar{Λ}}_{k} (\cdot)

denote their corresponding scaled cumulant generating functions (see (72)).

In view of Theorem 4-(b), for every

ρ > 0

, there exists a UD source code for the sequences in

X^{k}

such that the scaled cumulant generating function of its codeword lengths satisfies (75). Furthermore, from Theorem 4-(a), we get

\begin{matrix} \frac{{\bar{Λ}}_{k} (ρ)}{ρ} \geq \frac{1}{\log D} H_{\frac{1}{1 + ρ}} (Y) . \end{matrix}

(78)

From (75), (78) and Theorem 2 (a) and (b), for every

ρ > 0

, there exist a UD source code for the sequences in

X^{k}

, and a construction of a deterministic function

f \in F_{n, m}

(as specified by Steps 1–4 in the proof of Theorem 2, borrowed from [12]) such that the difference between the two scaled cumulant generating functions satisfies

\begin{matrix} Λ_{k} (ρ) - {\bar{Λ}}_{k} (ρ) \leq \frac{ρ}{\log D} [H_{\frac{1}{1 + ρ}} (X) - H_{\frac{1}{1 + ρ}} ({\tilde{X}}_{m}) + v (\frac{1}{1 + ρ})] + \frac{ρ}{k}, \end{matrix}

(79)

where (79) holds for every UD source code operating on the sequences in

Y^{k}

with

Y_{i} = f (X_{i})

(for

i = 1, \dots, k

) and the specific construction of

f \in F_{n, m}

as above, and

{\tilde{X}}_{m}

in the right side of (79) is a random variable whose probability mass function is given in (30). The right side of (79) can be very well approximated (for all

ρ > 0

) by using (70).

We proceed with a derivation of a lower bound on the left side of (79). In view of Theorem 4, it follows that (74) is satisfied for every UD source code which operates on the sequences in

X^{k}

; furthermore, Theorems 2 and 4 imply that, for every

f \in F_{n, m}

, there exists a UD source code which operates on the sequences in

Y^{k}

such that

\begin{matrix} \frac{{\bar{Λ}}_{k} (ρ)}{ρ} & \leq \frac{1}{\log D} H_{\frac{1}{1 + ρ}} (Y) + \frac{1}{k}, \end{matrix}

(80)

\begin{matrix} \leq \frac{1}{\log D} H_{\frac{1}{1 + ρ}} ({\tilde{X}}_{m}) + \frac{1}{k}, \end{matrix}

(81)

where (81) is due to (39) since

Y_{i} = f (X_{i})

(for

i = 1, \dots, k

) with an arbitrary deterministic function

f \in F_{n, m}

, and

Y_{i} \sim P_{Y}

for every i; hence, from (74), (80) and (81),

\begin{matrix} Λ_{k} (ρ) - {\bar{Λ}}_{k} (ρ) \geq \frac{ρ}{\log D} (H_{\frac{1}{1 + ρ}} (X) - H_{\frac{1}{1 + ρ}} ({\tilde{X}}_{m})) - \frac{ρ}{k} . \end{matrix}

(82)

We summarize our result as follows.

Theorem 5.

Let

$X_{1}, \dots, X_{k}$ be i.i.d. symbols which are emitted from a DMS according to a probability mass function $P_{X}$ whose support is a finite set $X$ with $| X | = n$ ;
Each symbol $X_{i}$ be mapped to $Y_{i} = f (X_{i})$ where $f \in F_{n, m}$ is the deterministic function (independent of the index i) with $m < n$ , as specified by Steps 1–4 in the proof of Theorem 2 (borrowed from [12]);
Two UD fixed-to-variable source codes be used: one code encodes the sequences $x^{k} \in X^{k}$ , and the other code encodes their mappings $y^{k} \in Y^{k}$ ; let the common size of the alphabets of both codes be D;
$Λ_{k} (\cdot)$ and ${\bar{Λ}}_{k} (\cdot)$ be, respectively, the scaled cumulant generating functions of the codeword lengths of the k-length sequences in $X^{k}$ (see (72)) and their mapping to $Y^{k}$ .

Then, for every

ρ > 0

, the following holds for the difference between the scaled cumulant generating functions

Λ_{k} (\cdot)

and

{\bar{Λ}}_{k} (\cdot)

:

(a): There exists a UD source code for the sequences in $X^{k}$ such that the upper bound in (79) is satisfied for every UD source code which operates on the sequences in $Y^{k}$ ;
(b): There exists a UD source code for the sequences in $Y^{k}$ such that the lower bound in (82) holds for every UD source code for the sequences in $X^{k}$ ; furthermore, the lower bound in (82) holds in general for every deterministic function $f \in F_{n, m}$ ;
(c): The gap between the upper and lower bounds in (79) and (82), respectively, is at most $\frac{ρ}{\log D} v (\frac{1}{1 + ρ}) + \frac{2 ρ}{k}$ (the function $v : (0, \infty) \to (0, \infty)$ is introduced in (33)), which is approximately $\frac{0.08607 ρ \log_{D} 2}{1 + ρ} + \frac{2 ρ}{k}$ ;
(d): The UD source codes in Items (a) and (b) for the sequences in $X^{k}$ and $Y^{k}$ , respectively, can be constructed to be prefix codes by the algorithm in Remark 11.

Remark 11 (An Algorithm for Theorem 5 (d)).

A construction of the UD source codes for the sequences in

X^{k}

and

Y^{k}

, whose existence is assured by Theorem 5 (a) and (b) respectively, is obtained by the following algorithm (of three steps) which also constructs them as prefix codes:

(1): As a preparatory step, we first calculate the probability mass function $P_{Y}$ from the given probability mass function $P_{X}$ and the deterministic function $f \in F_{n, m}$ which is obtained by Steps 1–4 in the proof of Theorem 2; accordingly, $P_{Y} (y) = \sum_{x \in X : f (x) = y} P_{X} (x)$ for all $y \in Y$ . We then further calculate the probability mass functions for the i.i.d. sequences in $X^{k}$ and $Y^{k}$ (see (73)); recall that the number of types in $X^{k}$ and $Y^{k}$ is polynomial in k (being upper bounded by ${(k + 1)}^{n - 1}$ and ${(k + 1)}^{m - 1}$ , respectively), and the values of these probability mass functions are fixed over each type;
(2): The sets of codeword lengths of the two UD source codes, for the sequences in $X^{k}$ and $Y^{k}$ , can (separately) be designed according to the achievability proof in Campbell’s paper (see [51] (p. 428)). More explicitly, let $α : = \frac{1}{1 + ρ}$ ; for all $x^{k} \in X^{k}$ , let $ℓ (x^{k}) \in N$ be given by

$\begin{matrix} ℓ (x^{k}) = ⌈- α \log_{D} P_{X^{k}} (x^{k}) + \log_{D} Q_{k}⌉ \end{matrix}$

(83)

with

$\begin{matrix} Q_{k} : = \sum_{x^{k} \in X^{k}} P_{X^{k}}^{α} (x^{k}) = {(\sum_{x \in X} P_{X}^{α} (x))}^{k}, \end{matrix}$

(84)

and let $\bar{ℓ} (y^{k}) \in N$ , for all $y^{k} \in Y^{k}$ , be given similarly to (83) and (84) by replacing $P_{X}$ with $P_{Y}$ , and $P_{X^{k}}$ with $P_{Y^{k}}$ . This suggests codeword lengths for the two codes which fulfil (75) and (80), and also, both satisfy Kraft’s inequality;
(3): The separate construction of two prefix codes (a.k.a. instantaneous codes) based on their given sets of codeword lengths ${ℓ (x^{k})}_{x^{k} \in X^{k}}$ and ${\bar{ℓ} (y^{k})}_{y^{k} \in Y^{k}}$ , as determined in Step 2, is standard (see, e.g., the construction in the proof of [84] (Theorem 5.2.1)).

Theorem 5 is of interest since it provides upper and lower bounds on the reduction in the cumulant generating function of close-to-optimal UD source codes because of clustering data, and Remark 11 suggests an algorithm to construct such UD codes which are also prefix codes. For long enough sequences (as

k \to \infty

), the upper and lower bounds on the difference between the scaled cumulant generating functions of the suggested source codes for the original and clustered data almost match (see (79) and (82)), being roughly equal to

ρ (H_{\frac{1}{1 + ρ}} (X) - H_{\frac{1}{1 + ρ}} ({\tilde{X}}_{m}))

(with logarithms on base D, which is the alphabet size of the source codes); as

k \to \infty

, the gap between these upper and lower bounds is less than

0.08607 \log_{D} 2

. Furthermore, in view of (76),

\begin{matrix} \lim_{ρ ↓ 0} \frac{Λ_{k} (ρ) - {\bar{Λ}}_{k} (ρ)}{ρ} = \frac{1}{k} (E [ℓ (X^{k})] - E [\bar{ℓ} (Y^{k})]), \end{matrix}

(85)

so, it follows from (4), (33), (79) and (82) that the difference between the average code lengths (normalized by k) of the original and clustered data satisfies

\begin{matrix} \frac{H (X) - H ({\tilde{X}}_{m})}{\log D} - \frac{1}{k} \leq \frac{E [ℓ (X^{k})] - E [\bar{ℓ} (Y^{k})]}{k} \leq \frac{H (X) - H ({\tilde{X}}_{m}) + 0.08607 \log 2}{\log D}, \end{matrix}

(86)

where the gap between the upper and lower bounds is equal to

0.08607 \log_{D} 2 + \frac{1}{k}

.

Funding

This research received no external funding.

Conflicts of Interest

The author declares no conflict of interest.

Appendix A. Proof of Lemma 2

We first find the extreme values of

p_{\min}

under the assumption that

P \in P_{n} (ρ)

. If

\frac{p_{\max}}{p_{\min}} = 1

, then P is the equiprobable distribution on

X

and

p_{\min} = \frac{1}{n}

. On the other hand, if

\frac{p_{\max}}{p_{\min}} = ρ

, then the minimal possible value of

p_{\min}

is obtained when P is the one-odd-mass distribution with

n - 1

masses equal to

ρ p_{\min}

and a smaller mass equal to

p_{\min}

. The latter case yields

p_{\min} = \frac{1}{1 + (n - 1) ρ}

.

Let

β : = p_{\min}

, so

β

can get any value in the interval

[\frac{1}{1 + (n - 1) ρ}, \frac{1}{n}] : = Γ_{ρ}^{(n)}

. From Lemma 1,

P ≺ Q_{β}

and

Q_{β} \in P_{n} (ρ)

, and the Schur-concavity of the Rényi entropy yields

H_{α} (P) \geq H_{α} (Q_{β})

for all

P \in P_{n} (ρ)

with

p_{\min} = β

. Minimizing

H_{α} (P)

over

P \in P_{n} (ρ)

can be therefore restricted to minimizing

H_{α} (Q_{β})

over

β \in Γ_{ρ}^{(n)}

.

Appendix B. Proof of Lemma 3

The sequence

{c_{α}^{(n)} (ρ)}_{n \in N}

is non-negative since

H_{α} (P) \leq \log n

for all

P \in P_{n}

. To prove (17),

\begin{matrix} 0 \leq c_{α}^{(n)} (ρ) & = \log n - \min_{P \in P_{n} (ρ)} H_{α} (P) \end{matrix}

(A1)

\begin{matrix} \leq \log n - \min_{P \in P_{n} (ρ)} H_{\infty} (P) \end{matrix}

(A2)

\begin{matrix} \leq \log n - \log \frac{n}{ρ} = \log ρ \end{matrix}

(A3)

where (A2) holds since

H_{α} (P)

is monotonically decreasing in

α

, and (A3) is due to (5) and

p_{\max} \leq \frac{ρ}{n}

.

Let

U_{n}

denote the equiprobable probability mass function on

{1, \dots, n}

. By the identity

\begin{matrix} D_{α} (P ‖ U_{n}) = \log n - H_{α} (P), \end{matrix}

(A4)

and since, by Lemma 2,

H_{α} (\cdot)

attains its minimum over the set of probability mass functions

P_{n} (ρ)

, it follows that

D_{α} (\cdot ‖ U_{n})

attains its maximum over this set. Let

P^{*} \in P_{n} (ρ)

be the probability measure which achieves the minimum in

c_{α}^{(n)} (ρ)

(see (16)), then from (A4)

\begin{matrix} c_{α}^{(n)} (ρ) & = \max_{P \in P_{n} (ρ)} D_{α} (P ‖ U_{n}) \end{matrix}

(A5)

\begin{matrix} = D_{α} (P^{*} ‖ U_{n}) . \end{matrix}

(A6)

Let

Q^{*}

be the probability mass function which is defined on

{1, \dots, 2 n}

as follows:

Q^{*} (i) = {\begin{matrix} \frac{1}{2} P^{*} (i), & i \in {1, \dots, n}, \\ \frac{1}{2} P^{*} (i - n), & i \in {n + 1, \dots, 2 n} . \end{matrix}

(A7)

Since by assumption

P^{*} \in P_{n} (ρ)

, it is easy to verify from (A7) that

\begin{matrix} Q^{*} \in P_{2 n} (ρ) . \end{matrix}

(A8)

Furthermore, from (A7),

\begin{matrix} D_{α} (Q^{*} ‖ U_{2 n}) & = \frac{1}{α - 1} \log (\sum_{i = 1}^{2 n} {(Q^{*} (i))}^{α} {(\frac{1}{2 n})}^{1 - α}) \end{matrix}

(A9)

\begin{matrix} = \frac{1}{α - 1} \log (\frac{1}{2} \sum_{i = 1}^{n} {(P^{*} (i))}^{α} {(\frac{1}{n})}^{1 - α} + \frac{1}{2} \sum_{i = n + 1}^{2 n} {(P^{*} (i - n))}^{α} {(\frac{1}{n})}^{1 - α}) \end{matrix}

(A10)

\begin{matrix} = \frac{1}{α - 1} \log (\sum_{i = 1}^{n} {(P^{*} (i))}^{α} {(\frac{1}{n})}^{1 - α}) \end{matrix}

(A11)

\begin{matrix} = D_{α} (P^{*} ‖ U_{n}) . \end{matrix}

(A12)

Combining (A5)–(A12) yields

\begin{matrix} c_{α}^{(2 n)} (ρ) & = \max_{Q \in P_{2 n} (ρ)} D_{α} (Q ‖ U_{2 n}) \end{matrix}

(A13)

\begin{matrix} \geq D_{α} (Q^{*} ‖ U_{2 n}) \end{matrix}

(A14)

\begin{matrix} = D_{α} (P^{*} ‖ U_{n}) \end{matrix}

(A15)

\begin{matrix} = c_{α}^{(n)} (ρ), \end{matrix}

(A16)

proving (18). Finally, in view of (A5),

c_{α}^{(n)} (ρ)

is monotonically increasing in

α

since so is the Rényi divergence of order

α

(see [87] (Theorem 3)).

Appendix C. Proof of Lemma 4

From Lemma 2, the minimizing distribution of

H_{α}

is given by

Q_{β} \in P_{n} (ρ)

where

\begin{matrix} Q_{β} = (\underset{i}{\underset{︸}{ρ β, \dots, ρ β}}, 1 - (n + i ρ - i - 1) β, \underset{n - i - 1}{\underset{︸}{β, β, \dots, β}}) \end{matrix}

(A17)

with

β \in [\frac{1}{1 + (n - 1) ρ}, \frac{1}{n}]

, and

1 - (n + i ρ - i - 1) β \leq ρ β \leq \frac{ρ}{n}

. It therefore follows that the influence of the middle probability mass of

Q_{β}

on

H_{α} (Q_{β})

tends to zero as

n \to \infty

. Therefore, in this asymptotic case, one can instead minimize

H_{α} ({\tilde{Q}}_{m})

where

\begin{matrix} {\tilde{Q}}_{m} = (\underset{m}{\underset{︸}{ρ β, \dots, ρ β}}, \underset{n - m}{\underset{︸}{β, β, \dots, β}}) \end{matrix}

(A18)

with the free parameter

m \in {1, \dots, n}

and

β = \frac{1}{n + m (ρ - 1)}

(so that the total mass of

{\tilde{Q}}_{m}

is equal to 1).

For

α \in (0, 1) \cup (1, \infty)

, straightforward calculation shows that

\begin{matrix} H_{α} ({\tilde{Q}}_{m}) & = \frac{1}{1 - α} \log (\sum_{j = 1}^{n} {\tilde{Q}}_{m}^{α} (j)) \\ = \log n - \frac{1}{α - 1} \log (\frac{1 + \frac{m}{n} (ρ^{α} - 1)}{{(1 + \frac{m}{n} (ρ - 1))}^{α}}), \end{matrix}

(A19)

and by letting

n \to \infty

, the limit of the sequence

{c_{α}^{(n)} (ρ)}_{n \in N}

exists, and it is equal to

\begin{matrix} c_{α}^{(\infty)} (ρ) & : = \lim_{n \to \infty} c_{α}^{(n)} (ρ) \\ = \lim_{n \to \infty} (\log n - \min_{m \in {1, \dots, n}} H_{α} ({\tilde{Q}}_{m})) \\ = \lim_{n \to \infty} \max_{m \in {1, \dots, n}} \{\frac{1}{α - 1} \log (\frac{1 + \frac{m}{n} (ρ^{α} - 1)}{{(1 + \frac{m}{n} (ρ - 1))}^{α}})\} \\ = \max_{x \in [0, 1]} \{\frac{1}{α - 1} \log (\frac{1 + (ρ^{α} - 1) x}{{(1 + (ρ - 1) x)}^{α}})\} . \end{matrix}

(A20)

Let

f_{α} : [0, 1] \to R

be given by

\begin{matrix} f_{α} (x) = \frac{1 + (ρ^{α} - 1) x}{{(1 + (ρ - 1) x)}^{α}}, x \in [0, 1] . \end{matrix}

(A21)

Then,

f_{α} (0) = f_{α} (1) = 1

, and straightforward calculation shows that the derivative

f_{α}^{'}

vanishes if and only if

\begin{matrix} x = x^{*} : = \frac{1 + α (ρ - 1) - ρ^{α}}{(1 - α) (ρ - 1) (ρ^{α} - 1)} . \end{matrix}

(A22)

We rely here on a specialized version of the mean value theorem, known as Rolle’s theorem, which states that any real-valued differentiable function that attains equal values at two distinct points must have a point somewhere between them where the first derivative at this point is zero. By Rolle’s theorem, and due to the uniqueness of the point

x^{*}

in (A22), it follows that

x^{*} \in (0, 1)

. Substituting (A22) into (A20) gives (20). Taking the limit of (20) when

α \to \infty

gives the result in (21).

In the limit where

α \to 1

, the Rényi entropy of order

α

tends to the Shannon entropy. Hence, letting

α \to 1

in (20), it follows that for the Shannon entropy

\begin{matrix} c_{1}^{(\infty)} (ρ) & = \lim_{α \to 1} c_{α}^{(\infty)} (ρ) \\ = \lim_{α \to 1} \{\frac{1}{α - 1} \log (1 + \frac{1 + α (ρ - 1) - ρ^{α}}{(1 - α) (ρ - 1)}) - \frac{α}{α - 1} \log (1 + \frac{1 + α (ρ - 1) - ρ^{α}}{(1 - α) (ρ^{α} - 1)})\} \\ = \frac{ρ \log ρ}{ρ - 1} - \log e - \log (\frac{ρ \log_{e} ρ}{ρ - 1}), \end{matrix}

(A23)

where (A23) follows by invoking L’Hôpital’s rule. This proves (22).

From (17)–(19), we get

0 \leq c_{α}^{(n)} (ρ) \leq c_{α}^{(\infty)} (ρ)

. Since

c_{α}^{(n)} (ρ)

is monotonically increasing in

α \in [0, \infty]

, for every

n \in N

, so is

c_{α}^{(\infty)} (ρ)

; hence, (21) yields

c_{α}^{(\infty)} (ρ) \leq \log ρ

. This proves (24).

References

Hardy, G.H.; Littlewood, J.E.; Pólya, G. Inequalities, 2nd ed.; Cambridge University Press: Cambridge, UK, 1952. [Google Scholar]
Marshall, A.W.; Olkin, I.; Arnold, B.C. Inequalities: Theory of Majorization and Its Applications, 2nd ed.; Springer: New York, NY, USA, 2011. [Google Scholar]
Arnold, B.C. Majorization: Here, there and everywhere. Stat. Sci. 2007, 22, 407–413. [Google Scholar] [CrossRef]
Arnold, B.C.; Sarabia, J.M. Majorization and the Lorenz Order with Applications in Applied Mathematics and Economics; Statistics for Social and Behavioral Sciences; Springer: New York, NY, USA, 2018. [Google Scholar]
Cicalese, F.; Gargano, L.; Vaccaro, U. Information theoretic measures of distances and their econometric applications. In Proceedings of the 2013 IEEE International Symposium on Information Theory, Istanbul, Turkey, 7–12 July 2013; pp. 409–413. [Google Scholar]
Steele, J.M. The Cauchy-Schwarz Master Class; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
Bhatia, R. Matrix Analysis Graduate Texts in Mathematics; Springer: New York, NY, USA, 1997. [Google Scholar]
Horn, R.A.; Johnson, C.R. Matrix Analysis, 2nd ed.; Cambridge University Press: Cambridge, UK, 2013. [Google Scholar]
Ben-Bassat, M.; Raviv, J. Rényi’s entropy and probability of error. IEEE Trans. Inf. Theory 1978, 24, 324–331. [Google Scholar] [CrossRef]
Cicalese, F.; Vaccaro, U. Bounding the average length of optimal source codes via majorization theory. IEEE Trans. Inf. Theory 2004, 50, 633–637. [Google Scholar] [CrossRef]
Cicalese, F.; Gargano, L.; Vaccaro, U. How to find a joint probability distribution with (almost) minimum entropy given the marginals. In Proceedings of the 2017 IEEE International Symposium on Information Theory, Aachen, Germany, 25–30 June 2017; pp. 2178–2182. [Google Scholar]
Cicalese, F.; Gargano, L.; Vaccaro, U. Bounds on the entropy of a function of a random variable and their applications. IEEE Trans. Inf. Theory 2018, 64, 2220–2230. [Google Scholar] [CrossRef]
Cicalese, F.; Vaccaro, U. Maximum entropy interval aggregations. In Proceedings of the 2018 IEEE International Symposium on Information Theory, Vail, CO, USA, 17–20 June 2018; pp. 1764–1768. [Google Scholar]
Harremoës, P. A new look on majorization. In Proceedings of the 2004 IEEE International Symposium on Information Theory and Its Applications, Parma, Italy, 10–13 October 2004; pp. 1422–1425. [Google Scholar]
Ho, S.W.; Yeung, R.W. The interplay between entropy and variational distance. IEEE Trans. Inf. Theory 2010, 56, 5906–5929. [Google Scholar] [CrossRef]
Ho, S.W.; Verdú, S. On the interplay between conditional entropy and error probability. IEEE Trans. Inf. Theory 2010, 56, 5930–5942. [Google Scholar] [CrossRef]
Ho, S.W.; Verdú, S. Convexity/concavity of the Rényi entropy and α-mutual information. In Proceedings of the 2015 IEEE International Symposium on Information Theory, Hong Kong, China, 14–19 June 2015; pp. 745–749. [Google Scholar]
Joe, H. Majorization, entropy and paired comparisons. Ann. Stat. 1988, 16, 915–925. [Google Scholar] [CrossRef]
Joe, H. Majorization and divergence. J. Math. Anal. Appl. 1990, 148, 287–305. [Google Scholar] [CrossRef]
Koga, H. Characterization of the smooth Rényi entropy using majorization. In Proceedings of the 2013 IEEE Information Theory Workshop, Seville, Spain, 9–13 September 2013; pp. 604–608. [Google Scholar]
Puchala, Z.; Rudnicki, L.; Zyczkowski, K. Majorization entropic uncertainty relations. J. Phys. A Math. Theor. 2013, 46, 1–12. [Google Scholar] [CrossRef]
Sason, I.; Verdú, S. Arimoto-Rényi conditional entropy and Bayesian M-ary hypothesis testing. IEEE Trans. Inf. Theory 2018, 64, 4–25. [Google Scholar] [CrossRef]
Verdú, S. Information Theory, 2018; in preparation.
Witsenhhausen, H.S. Some aspects of convexity useful in information theory. IEEE Trans. Inf. Theory 1980, 26, 265–271. [Google Scholar] [CrossRef]
Xi, B.; Wang, S.; Zhang, T. Schur-convexity on generalized information entropy and its applications. In Information Computing and Applications; Lecture Notes in Computer Science; Springer: New York, NY, USA, 2011; Volume 7030, pp. 153–160. [Google Scholar]
Inaltekin, H.; Hanly, S.V. Optimality of binary power control for the single cell uplink. IEEE Trans. Inf. Theory 2012, 58, 6484–6496. [Google Scholar] [CrossRef]
Jorshweick, E.; Bosche, H. Majorization and matrix-monotone functions in wireless communications. Found. Trends Commun. Inf. Theory 2006, 3, 553–701. [Google Scholar] [CrossRef]
Palomar, D.P.; Jiang, Y. MIMO transceiver design via majorization theory. Found. Trends Commun. Inf. Theory 2006, 3, 331–551. [Google Scholar] [CrossRef]
Roventa, I. Recent Trends in Majorization Theory and Optimization: Applications to Wireless Communications; Editura Pro Universitaria & Universitaria Craiova: Bucharest, Romania, 2015. [Google Scholar]
Sezgin, A.; Jorswieck, E.A. Applications of majorization theory in space-time cooperative communications. In Cooperative Communications for Improved Wireless Network Transmission: Framework for Virtual Antenna Array Applications; Information Science Reference; Uysal, M., Ed.; IGI Global: Hershey, PA, USA, 2010; pp. 429–470. [Google Scholar]
Viswanath, P.; Anantharam, V. Optimal sequences and sum capacity of synchronous CDMA systems. IEEE Trans. Inf. Theory 1999, 45, 1984–1993. [Google Scholar] [CrossRef]
Viswanath, P.; Anantharam, V.; Tse, D.N.C. Optimal sequences, power control, and user capacity of synchronous CDMA systems with linear MMSE multiuser receivers. IEEE Trans. Inf. Theory 1999, 45, 1968–1983. [Google Scholar] [CrossRef]
Viswanath, P.; Anantharam, V. Optimal sequences for CDMA under colored noise: A Schur-saddle function property. IEEE Trans. Inf. Theory 2002, 48, 1295–1318. [Google Scholar] [CrossRef]
Rényi, A. On measures of entropy and information. In Proceedings of the 4th Berkeley Symposium on Probability Theory and Mathematical Statistics, Berkeley, CA, USA, 8–9 August 1961; pp. 547–561. [Google Scholar]
Arikan, E. An inequality on guessing and its application to sequential decoding. IEEE Trans. Inf. Theory 1996, 42, 99–105. [Google Scholar] [CrossRef]
Arikan, E.; Merhav, N. Guessing subject to distortion. IEEE Trans. Inf. Theory 1998, 44, 1041–1056. [Google Scholar] [CrossRef]
Arikan, E.; Merhav, N. Joint source-channel coding and guessing with application to sequential decoding. IEEE Trans. Inf. Theory 1998, 44, 1756–1769. [Google Scholar] [CrossRef]
Burin, A.; Shayevitz, O. Reducing guesswork via an unreliable oracle. IEEE Trans. Inf. Theory 2018, 64, 6941–6953. [Google Scholar] [CrossRef]
Kuzuoka, S. On the conditional smooth Rényi entropy and its applications in guessing and source coding. arXiv, 2018; arXiv:1810.09070. [Google Scholar]
Merhav, N.; Arikan, E. The Shannon cipher system with a guessing wiretapper. IEEE Trans. Inf. Theory 1999, 45, 1860–1866. [Google Scholar] [CrossRef]
Sundaresan, R. Guessing based on length functions. In Proceedings of the 2007 IEEE International Symposium on Information Theory, Nice, France, 24–29 June 2007; pp. 716–719. [Google Scholar]
Salamatian, S.; Beirami, A.; Cohen, A.; Médard, M. Centralized versus decentralized multi-agent guesswork. In Proceedings of the 2017 IEEE International Symposium on Information Theory, Aachen, Germany, 25–30 June 2017; pp. 2263–2267. [Google Scholar]
Sundaresan, R. Guessing under source uncertainty. IEEE Trans. Inf. Theory 2007, 53, 269–287. [Google Scholar] [CrossRef]
Bracher, A.; Lapidoth, A.; Pfister, C. Distributed task encoding. In Proceedings of the 2017 IEEE International Symposium on Information Theory, Aachen, Germany, 25–30 June 2017; pp. 1993–1997. [Google Scholar]
Bunte, C.; Lapidoth, A. Encoding tasks and Rényi entropy. IEEE Trans. Inf. Theory 2014, 60, 5065–5076. [Google Scholar] [CrossRef]
Shayevitz, O. On Rényi measures and hypothesis testing. In Proceedings of the 2011 IEEE International Symposium on Information Theory, Saint Petersburg, Russia, 31 July–5 August 2011; pp. 800–804. [Google Scholar]
Tomamichel, M.; Hayashi, M. Operational interpretation of Rényi conditional mutual information via composite hypothesis testing against Markov distributions. In Proceedings of the 2016 IEEE International Symposium on Information Theory, Barcelona, Spain, 10–15 July 2016; pp. 585–589. [Google Scholar]
Harsha, P.; Jain, R.; McAllester, D.; Radhakrishnan, J. The communication complexity of correlation. IEEE Trans. Inf. Theory 2010, 56, 438–449. [Google Scholar] [CrossRef]
Liu, J.; Verdú, S. Rejection sampling and noncausal sampling under moment constraints. In Proceedings of the 2018 IEEE International Symposium on Information Theory, Vail, CO, USA, 17–22 June 2018; pp. 1565–1569. [Google Scholar]
Yu, L.; Tan, V.Y.F. Wyner’s common information under Rényi divergence measures. IEEE Trans. Inf. Theory 2018, 64, 3616–3632. [Google Scholar] [CrossRef]
Campbell, L.L. A coding theorem and Rényi’s entropy. Inf. Control 1965, 8, 423–429. [Google Scholar] [CrossRef]
Courtade, T.; Verdú, S. Cumulant generating function of codeword lengths in optimal lossless compression. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 2494–2498. [Google Scholar]
Courtade, T.; Verdú, S. Variable-length lossy compression and channel coding: Non-asymptotic converses via cumulant generating functions. In Proceedings of the 2014 IEEE International Symposium on Information Theory, Honolulu, HI, USA, 29 June–4 July 2014; pp. 2499–2503. [Google Scholar]
Hayashi, M.; Tan, V.Y.F. Equivocations, exponents, and second-order coding rates under various Rényi information measures. IEEE Trans. Inf. Theory 2017, 63, 975–1005. [Google Scholar] [CrossRef]
Kuzuoka, S. On the smooth Rényi entropy and variable-length source coding allowing errors. In Proceedings of the 2016 IEEE International Symposium on Information Theory, Barcelona, Spain, 10–15 July 2016; pp. 745–749. [Google Scholar]
Sason, I.; Verdú, S. Improved bounds on lossless source coding and guessing moments via Rényi measures. IEEE Trans. Inf. Theory 2018, 64, 4323–4346. [Google Scholar] [CrossRef]
Tan, V.Y.F.; Hayashi, M. Analysis of ramaining uncertainties and exponents under various conditional Rényi entropies. IEEE Trans. Inf. Theory 2018, 64, 3734–3755. [Google Scholar] [CrossRef]
Tyagi, H. Coding theorems using Rényi information measures. In Proceedings of the 2017 IEEE Twenty-Third National Conference on Communications, Chennai, India, 2–4 March 2017; pp. 1–6. [Google Scholar]
Csiszár, I. Generalized cutoff rates and Rényi information measures. IEEE Trans. Inf. Theory 1995, 41, 26–34. [Google Scholar] [CrossRef]
Polyanskiy, Y.; Verdú, S. Arimoto channel coding converse and Rényi divergence. In Proceedings of the Forty-Eighth Annual Allerton Conference on Communication, Control and Computing, Monticello, IL, USA, 29 September–1 October 2010; pp. 1327–1333. [Google Scholar]
Sason, I. On the Rényi divergence, joint range of relative entropies, and a channel coding theorem. IEEE Trans. Inf. Theory 2016, 62, 23–34. [Google Scholar] [CrossRef]
Yu, L.; Tan, V.Y.F. Rényi resolvability and its applications to the wiretap channel. In Lecture Notes in Computer Science, Proceedings of the 10th International Conference on Information Theoretic Security, Hong Kong, China, 29 November–2 December 2017; Springer: New York, NY, USA, 2017; Volume 10681, pp. 208–233. [Google Scholar]
Arimoto, S. On the converse to the coding theorem for discrete memoryless channels. IEEE Trans. Inf. Theory 1973, 19, 357–359. [Google Scholar] [CrossRef]
Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Proceedings of the 2nd Colloquium on Information Theory, Keszthely, Hungary, 25–30 August 1975; Csiszár, I., Elias, P., Eds.; Colloquia Mathematica Societatis Janós Bolyai: Amsterdam, The Netherlands, 1977; Volume 16, pp. 41–52. [Google Scholar]
Dalai, M. Lower bounds on the probability of error for classical and classical-quantum channels. IEEE Trans. Inf. Theory 2013, 59, 8027–8056. [Google Scholar] [CrossRef]
Leditzky, F.; Wilde, M.M.; Datta, N. Strong converse theorems using Rényi entropies. J. Math. Phys. 2016, 57, 1–33. [Google Scholar] [CrossRef]
Mosonyi, M.; Ogawa, T. Quantum hypothesis testing and the operational interpretation of the quantum Rényi relative entropies. Commun. Math. Phys. 2015, 334, 1617–1648. [Google Scholar] [CrossRef]
Simic, S. Jensen’s inequality and new entropy bounds. Appl. Math. Lett. 2009, 22, 1262–1265. [Google Scholar] [CrossRef]
Jelineck, F.; Schneider, K.S. On variable-length-to-block coding. IEEE Trans. Inf. Theory 1972, 18, 765–774. [Google Scholar] [CrossRef]
Garey, M.R.; Johnson, D.S. Computers and Intractability: A Guide to the Theory of NP-Completness; W. H. Freedman and Company: New York, NY, USA, 1979. [Google Scholar]
Boztaş, S. Comments on “An inequality on guessing and its application to sequential decoding”. IEEE Trans. Inf. Theory 1997, 43, 2062–2063. [Google Scholar] [CrossRef]
Bracher, A.; Hof, E.; Lapidoth, A. Guessing attacks on distributed-storage systems. In Proceedings of the 2015 IEEE International Symposium on Information Theory, Hong-Kong, China, 14–19 June 2015; pp. 1585–1589. [Google Scholar]
Christiansen, M.M.; Duffy, K.R. Guesswork, large deviations, and Shannon entropy. IEEE Trans. Inf. Theory 2013, 59, 796–802. [Google Scholar] [CrossRef]
Hanawal, M.K.; Sundaresan, R. Guessing revisited: A large deviations approach. IEEE Trans. Inf. Theory 2011, 57, 70–78. [Google Scholar] [CrossRef]
Hanawal, M.K.; Sundaresan, R. The Shannon cipher system with a guessing wiretapper: General sources. IEEE Trans. Inf. Theory 2011, 57, 2503–2516. [Google Scholar] [CrossRef]
Huleihel, W.; Salamatian, S.; Médard, M. Guessing with limited memory. In Proceedings of the 2017 IEEE International Symposium on Information Theory, Aachen, Germany, 25–30 June 2017; pp. 2258–2262. [Google Scholar]
Massey, J.L. Guessing and entropy. In Proceedings of the 1994 IEEE International Symposium on Information Theory, Trondheim, Norway, 27 June–1 July 1994; p. 204. [Google Scholar]
McEliece, R.J.; Yu, Z. An inequality on entropy. In Proceedings of the 1995 IEEE International Symposium on Information Theory, Whistler, BC, Canada, 17–22 September 1995; p. 329. [Google Scholar]
Pfister, C.E.; Sullivan, W.G. Rényi entropy, guesswork moments and large deviations. IEEE Trans. Inf. Theory 2004, 50, 2794–2800. [Google Scholar] [CrossRef]
De Santis, A.; Gaggia, A.G.; Vaccaro, U. Bounds on entropy in a guessing game. IEEE Trans. Inf. Theory 2001, 47, 468–473. [Google Scholar] [CrossRef]
Yona, Y.; Diggavi, S. The effect of bias on the guesswork of hash functions. In Proceedings of the 2017 IEEE International Symposium on Information Theory, Aachen, Germany, 25–30 June 2017; pp. 2253–2257. [Google Scholar]
Gan, G.; Ma, C.; Wu, J. Data Clustering: Theory, Algorithms, and Applications; ASA-SIAM Series on Statistics and Applied Probability; SIAM: Philadelphia, PA, USA, 2007. [Google Scholar]
Campbell, L.L. Definition of entropy by means of a coding problem. Probab. Theory Relat. Field 1966, 6, 113–118. [Google Scholar] [CrossRef]
Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
Kontoyiannis, I.; Verdú, S. Optimal lossless data compression: Non-asymptotics and asymptotics. IEEE Trans. Inf. Theory 2014, 60, 777–795. [Google Scholar] [CrossRef]
Van Erven, T.; Harremoës, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef]

Figure 1. A plot of

c_{α}^{(\infty)} (ρ)

in (20) and (22) (log is on base 2) as a function of

ρ

, confirming numerically the properties in (21) and (23).

Figure 2. Plots of

c_{α}^{(n)} (ρ)

in (16) (log is on base 2) as a function of

α > 0

, for

ρ = 2

(left plot) and

ρ = 256

(right plot), with several values of

n \geq 2

.

Figure 3. A plot of

c_{1}^{(\infty)} (ρ)

in (22) versus

c_{1}^{(n)} (ρ)

for finite n (

n = 512, 128, 32

, and 8) as a function of ρ.

Figure 4. Plots of the upper and lower bounds on

\frac{1}{k} \log_{2} \frac{E [g_{X^{k}}^{ρ} (X^{k})]}{E [g_{Y^{k}}^{ρ} (Y^{k})]}

in Theorem 3, as a function of

ρ > 0

, for random vectors of length

k = 100

(left plot) or

k = 1000

(right plot) in the setting of Example 1. Each plot shows the universal lower bound for an arbitrary deterministic

f \in F_{128, 16}

, and the achievable upper bound with the construction of the deterministic function

f = f^{*} \in F_{128, 16}

(based on the Huffman algorithm) in Theorem 3 (see, respectively, (67) and (69)).

© 2018 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Tight Bounds on the Rényi Entropy via Majorization with Applications to Guessing and Compression

Abstract

1. Introduction

2. Notation and Preliminaries

3. A Tight Lower Bound on the Rényi Entropy

4. Bounds on the Rényi Entropy of a Function of a Discrete Random Variable

5. Information-Theoretic Applications: Non-Asymptotic Bounds for Lossless Compression and Guessing

5.1. Guessing

5.1.1. Background

5.1.2. Analysis

5.1.3. Numerical Result

5.2. Lossless Source Coding

5.2.1. Background

5.2.2. Analysis

Funding

Conflicts of Interest

Appendix A. Proof of Lemma 2

Appendix B. Proof of Lemma 3

Appendix C. Proof of Lemma 4

References

Article Metrics

Citations

Article Access Statistics