Tight Bounds on the Rényi Entropy via Majorization with Applications to Guessing and Compression

This paper provides tight bounds on the Rényi entropy of a function of a discrete random variable with a finite number of possible values, where the considered function is not one to one. To that end, a tight lower bound on the Rényi entropy of a discrete random variable with a finite support is derived as a function of the size of the support, and the ratio of the maximal to minimal probability masses. This work was inspired by the recently published paper by Cicalese et al., which is focused on the Shannon entropy, and it strengthens and generalizes the results of that paper to Rényi entropies of arbitrary positive orders. In view of these generalized bounds and the works by Arikan and Campbell, non-asymptotic bounds are derived for guessing moments and lossless data compression of discrete memoryless sources.

This work, which relies on the majorization theory, has been greatly inspired by the recent insightful paper by Cicalese et al. [21]. 1 The work in [21] provides tight bounds on the Shannon entropy of a function of a discrete random variable with a finite number of possible values, where the considered function is not one-toone.For that purpose, and while being of interest by its own right (see [21,Section 6]), a tight lower bound on the Shannon entropy of a discrete random variable with a finite support was derived in [21] as a function of the size of the support, and the ratio of the maximal to minimal probability masses.The present paper aims to extend the bounds in [21] to Rényi entropies of arbitrary positive orders (note that the Shannon entropy is equal to the Rényi entropy of order 1), and to study the information-theoretic applications of these (non-trivial) generalizations in the context of non-asymptotic analysis of guessing moments and lossless data compression.
The presentation in this paper is structured as follows: Section II provides notation and essential preliminaries for the analysis in this paper.Sections III and IV strengthen and generalize, in a non-trivial way, the bounds on the Shannon entropy in [21] to Rényi entropies of arbitrary positive orders (see Theorems 1 and 2).Section V relies on the generalized bound from Section IV and the work by Arikan [1] to derive non-asymptotic bounds for guessing moments (see Theorem 3); Section V also relies on the generalized bound in Section IV and the source coding theorem by Campbell [15] (see Theorem 4) for the derivation of non-asymptotic bounds for lossless compression of discrete memoryless sources (see Theorem 5).

Let
• P be a probability mass function defined on a finite set X ; • p max and p min be, respectively, the maximal and minimal positive masses of P ; • G P (k) be the sum of the k largest masses of P for k ∈ {1, . . ., |X |} (note that G P (1) = p max and G P (|X |) = 1); • P n , for an integer n ≥ 2, be the set of all probability mass functions defined on X with |X | = n; without any loss of generality, let X = {1, . . ., n}; • P n (ρ), for ρ ≥ 1 and an integer n ≥ 2, be the subset of all probability measures P ∈ P n such that Definition 1 (majorization): Consider discrete probability mass functions P and Q defined on the same (finite or countably infinite) set X .It is said that P is majorized by Q (or Q majorizes P ), and it is denoted by ).If P and Q are defined on finite sets of different cardinalities, then the probability mass function which is defined over the smaller set is first padded by zeros for making the cardinalities of these sets be equal.
By Definition 1, a unit mass majorizes any other distribution; on the other hand, the equiprobable distribution on a finite set is majorized by any other distribution defined on the same set.
Definition 2 (Schur-convexity/concavity): A function f : P n → R is said to be Schur-convex if for every Definition 3 (Rényi entropy [62]): Let X be a random variable taking values on a finite or countably infinite set X , and let P X be its probability mass function.The Rényi entropy of order α ∈ (0, 1) By its continuous extension, H ∞ (X) = log 1 p max (5) where H(X) is the (Shannon) entropy of X.
Definition 4 (Rényi divergence [62]): Let P and Q be probability mass functions defined on a finite or countably infinite set X .The Rényi divergence of order α ∈ [0, ∞] is defined as follows: • By the continuous extension of D α (P Q), D 0 (P Q) = max where D(P Q) in the right side of ( 8) is the relative entropy (a.k.a.Kullback-Leibler divergence).
Throughout this paper, for a ∈ R, a denotes the ceiling of a (i.e., the smallest integer not smaller than the real number a), and a denotes the flooring of a (i.e., the greatest integer not greater than a).

III. A TIGHT LOWER BOUND ON THE R ÉNYI ENTROPY
We provide in this section a tight lower bound on the Rényi entropy, of an arbitrary order α > 0, when the probability mass function of the discrete random variable is defined on a finite set of cardinality n, and the ratio of the maximal to minimal probability masses is upper bounded by an arbitrary fixed value ρ ∈ [1, ∞).In other words, we derive the largest possible gap between the order-α Rényi entropies of an equiprobable distribution and a non-equiprobable distribution (defined on a finite set of the same cardinality) with a given value for the ratio of the maximal to minimal probability masses.The basic tool used for the development of our result in this section relies on the majorization theory.Our result strengthes the result in [21,Theorem 2] for the Shannon entropy, and it further provides a generalization for the Rényi entropy of an arbitrary order α > 0 (recall that the Shannon entropy is equal to the Rényi entropy of order α = 1, see ( 4)).Furthermore, the approach for proving the main result in this section differs significantly from the proof in [21] for the Shannon entropy.The main result in this section is a key result for all what follows in this paper.
The following lemma is a restatement of [21,Lemma 6].
Lemma 1: Let P ∈ P n (ρ) with ρ ≥ 1 and an integer n ≥ 2, and assume without any loss of generality that the probability mass function P is defined on the set X = {1, . . ., n}.Let Q ∈ P n be defined on X as follows: where Then, 2) P ≺ Q.
In view of Lemmata 1-4, we obtain the following main result in this section: 16) designate the maximal gap between the order-α Rényi entropies of equiprobable and arbitrary distributions in P n (ρ).Then, a) The non-negative sequence {c n=2 can be calculated by the real-valued single-parameter optimization in the right side of (15).
Remark 3: Theorem 1 strengthens the result in [21,Theorem 2] for the Shannon entropy (i.e., for α = 1), in addition to its generalization to Rényi entropies of arbitrary orders α > 0. This is because our lower bound on the Shannon entropy is given by whereas the looser bound in [21] is given by (see [21, (7)] and (22) here) and we recall that 0 ≤ c (24)). Figure 3 shows the improvement in the new lower bound (28) over ( 29) by comparing c 10 5 ] and with several values of n.It is reflected from Figure 3 that there is a very marginal improvement in the lower bound on the Shannon entropy (28) over the bound in (29) if ρ ≤ 30 (even for small values of n), whereas there is a significant improvement over the bound in (29) for large values of ρ; by increasing the value of n, also the value of ρ needs to be increased for observing an improvement of the lower bound in ( 28) over (29) (see Figure 3).
on the compression rate of Tunstall codes for discrete memoryless sources, which further tightens the bound by Jelinek and Schneider in [43, Eq. ( 9)].More explicitly, in view of [21,Section 6], an improved upper bound on the compression rate of these variable-to-fixed lossless source codes is obtained by combining [21,Eqs. (36) and (38)] with a tightened lower bound on the entropy H(W ) of the leaves of the tree graph for Tunstall codes.
From (28), the latter lower bound is given by H pmin is the reciprocal of the minimal positive probability of the source symbols, and n is the number of codewords (so, all codewords are of length log 2 n bits).This yields a reduction in the upper bound on the non-asymptotic compression rate R of Tunstall codes from log 2 n H(X) (see [21,Eq. (40)] and ( 22)) to bits per source symbol where H(X) denotes the source entropy (converging, in view of (17), to H(X) as we let n → ∞).
Remark 4: Equality (15) with the minimizing probability mass function of the form (13) holds, in general, by replacing the Rényi entropy with an arbitrary Schur-concave function (as it can be easily verified from the proof of Lemma 2 in Appendix A).However, the analysis leading to Lemmata 3-4 and Theorem 1 applies particularly to the Rényi entropy.

IV. BOUNDS ON THE R ÉNYI ENTROPY OF A FUNCTION OF A DISCRETE RANDOM VARIABLE
This section relies on Theorem 1 and majorization for extending [21, Theorem 1], which applies to the Shannon entropy, to Rényi entropies of any positive order.More explicitly, let α ∈ (0, ∞) and • X and Y be finite sets of cardinalities |X | = n and |Y| = m with n > m ≥ 2; without any loss of generality, let X = {1, . . ., n} and Y = {1, . . ., m}; • X be a random variable taking values on X with a probability mass function P X ∈ P n ; • F n,m be the set of deterministic functions f : X → Y; note that f ∈ F n,m is not one-to-one since m < n.
The main result in this section sharpens the inequality H α f (X) ≤ H α (X), for every deterministic function f ∈ F n,m with n > m ≥ 2 and α > 0, by obtaining non-trivial upper and lower bounds on max The calculation of the exact value of min f ∈Fn,m H α f (X) is much easier, and it is expressed in closed form by capitalizing on the Schur-concavity of the Rényi entropy.
The following main result extends [21, Theorem 1] to Rényi entropies of arbitrary positive orders.
Theorem 2: Let X ∈ {1, . . ., n} be a random variable which satisfies P X (1) m , let X m be the equiprobable random variable on {1, . . ., m}; otherwise, if P X (1) ≥1 m , let X m ∈ {1, . . ., m} be a random variable with the probability mass function where n * is the maximal integer i ∈ {1, . . ., m − 1} such that Then, for every α > 0, max where b) There exists an explicit construction of a deterministic function f * ∈ F n,m such that where f * is independent of α, and it is obtained by using Huffman coding (as in [21] for α = 1).
This point is further elaborated in Remark 8, after the proof of Theorem 2.
Remark 6: Similarly to [21, Lemma 1], an exact solution of the maximization problem in the left side of ( 32) is strongly NP-hard [30]; this means that, unless P = NP, there is no polynomial time algorithm which, for an arbitrarily small ε > 0, computes an admissible deterministic function This motivates the derivation of the bounds in (32), and the simple construction of a deterministic function A proof of Theorem 2 relies on the following lemmata.
Proof: Since P X ≺ P Xm (see [21,Lemma 2]) with P Xm ∈ P m , and P Xm ≺ Q for all Q ∈ P m such that [21,Lemma 4]), the result follows from the Schur-concavity of the Rényi entropy.
Lemma 6: Let X ∈ {1, . . ., n}, α > 0, and f ∈ F n,m with m < n.Then, Proof: Since f is a deterministic function in F n,m with m < n, the probability mass function of f (X) is an element in P m which majorizes P X (see [21,Lemma 3]).Inequality (39) then follows from Lemma 5.
We are now ready to prove Theorem 2.
Proof: In view of (39), We next construct a function f * ∈ F n,m such that, for all α > 0, ≥ max where the function v : (0, ∞) → (0, ∞) in the right side of ( 41) is given in (33), and ( 42) holds due to (38) and (40).The function f * in our proof coincides with the construction in [21], and it is therefore independent of α.
We first review and follow the concept of the proof of [21, Lemma 5], and we then deviate from the analysis there for proving our result.The idea behind the proof of [21, Lemma 5] relies on the following algorithm: 1) Start from the probability mass function P X ∈ P n with P X (1) ≥ . . .≥ P X (n); 2) Merge successively pairs of probability masses by applying the Huffman algorithm; 3) Stop the process in Step 2 when a probability mass function . ., m} for all probability masses P X (k), with k ∈ {1, . . ., n}, being merged in Steps 2-3 into the node of Q(j).
Let i ∈ {0, . . ., m − 1} be the largest index such that P X (1) = Q(1), . . ., P X (i) = Q(i) (note that i = 0 corresponds to the case where each node Q(j), with j ∈ {1, . . ., m}, is constructed by merging at least two masses of the probability mass function P X ).Then, according to [21, p. 2225], Let be the sum of the m − i smallest masses of the probability mass function Q.In view of (43), the vector represents a probability mass function where the ratio of its maximal to minimal masses is upper bounded by 2.
In view of (44), let Q * ∈ P m be the probability mass function which is given by From ( 50)-( 55), we get with Since T ∈ [0, 1] and v(α) > 0 for α > 0, it can be verified from ( 56)-( 58) that for α ∈ (0, 1) ∪ (1, ∞) The validity of ( 60) is extended to α = 1 by taking the limit α → 1 on both sides of this inequality, and due to the continuity of v(•) in ( 33) at α = 1.Applying the majorization result Q * ≺ P Xm in [21, (31)], it follows from ( 60) and the Schur-concavity of the Rényi entropy that, for all α > 0, We next prove Item c).Equality (36) is due to the Schur-concavity of the Rényi entropy, and since we have • f (X) is an aggregation of X, i.e., the probability mass function where I 1 , . . ., I m partition {1, . . ., n} into m disjoint subsets as follows: • By the assumption P X (1) ≥ P X (2) ≥ . . .≥ P X (n), it follows that Q ≺ P Ym for every such Q ∈ P m ; • From (35), Y m = f (X) where the function f ∈ F n,m is given by f (k) := 1 for all k ∈ {1, . . ., n−m+1}, and f (n − m + i) := i for all i ∈ {2, . . ., m}.Hence, P Ym is an element in the set of the probability mass functions of f (X) with f ∈ F n,m which majorizes every other element from this set.Remark 9: The minimizing probability mass function in (35) to the optimization problem (36), and the maximizing probability mass function in (30) to the optimization problem (38) are in general valid when the Rényi entropy of a positive order is replaced by an arbitrary Schur-concave function.However, the main results in ( 32)-( 34) hold particularly for the Rényi entropy.
Remark 10: Theorem 2 makes use of the random variables denoted by X m and Y m , rather than (more simply) X m and Y m respectively, because Section V considers i.i.d.samples {X i } k i=1 and {Y i } k i=1 with X i ∼ P X and Y i ∼ P Y ; note, however, that the probability mass functions of X m and Y m are different from P X and P Y , respectively, and for that reason we make use of tilted symbols in the left sides of (30) and (35).

V. INFORMATION-THEORETIC APPLICATIONS: NON-ASYMPTOTIC BOUNDS FOR LOSSLESS COMPRESSION AND GUESSING
Theorem 2 is applied in this section to derive non-asymptotic bounds for lossless compression of discrete memoryless sources, and guessing moments.Each of the two subsections starts with a short background for making the presentation self contained.
A. Guessing 1) Background: The problem of guessing discrete random variables has various theoretical and operational aspects in information theory (see [1], [2], [3], [10], [11], [14], [17], [31], [32], [41], [54], [55], [56], [59], [65], [68], [74], [75], [85]).The central object of interest is the distribution of the number of guesses required to identify a realization of a random variable X, taking values on a finite or countably infinite set X = {1, . . ., |X |}, by successively asking questions of the form "Is X equal to x?" until the value of X is guessed correctly.A guessing function is a one-to-one function g : X → X , which can be viewed as a permutation of the elements of X in the order in which they are guessed.The required number of guesses is therefore equal to g(x) when X = x with x ∈ X .
Lower and upper bounds on the minimal expected number of required guesses for correctly identifying the realization of X, expressed as a function of the Shannon entropy H(X), have been respectively derived by Massey [54] and by McEliece and Yu [55], followed by a derivation of improved upper and lower bounds by De Santis et al. [65].More generally, given a probability mass function P X on X , it is of interest to minimize the generalized guessing moment E[g ρ (X)] = x∈X P X (x)g ρ (x) for ρ > 0. For an arbitrary positive ρ, the ρ-th moment of the number of guesses is minimized by selecting the guessing function to be a ranking function g X , for which g X (x) = if P X (x) is the -th largest mass [54].Although the tie breaking affects the choice of g X , the distribution of g X (X) does not depend on how ties are resolved.Not only does this strategy minimize the average number of guesses, but it also minimizes the ρ-th moment of the number of guesses for every ρ > 0. Upper and lower bounds on the ρ-th moment of ranking functions, expressed in terms of the Rényi entropies, were derived by Arikan [1], Boztas ¸[10], followed by recent improvements in the non-asymptotic regime by Sason and Verdú [68].Although if |X | is small, it is straightforward to evaluate numerically the guessing moments, the benefit of bounds expressed in terms of Rényi entropies is particularly relevant when dealing with a random vector X k = (X 1 , . . ., X k ) whose letters belong to a finite alphabet X ; computing all the probabilities of the mass function P X k over the set X k , and then sorting them in decreasing order for the calculation of the ρ-th moment of the optimal guessing function for the elements of X k becomes infeasible even for moderate values of k.In contrast, regardless of the value of k, bounds on guessing moments which depend on the Rényi entropy are readily computable if for example {X i } k i=1 are independent; in which case, the Rényi entropy of the vector is equal to the sum of the Rényi entropies of its components.Arikan's bounds in [1] are asymptotically tight for random vectors of length k as k → ∞, thus providing the correct exponential growth rate of the guessing moments for sufficiently large k.

2) Analysis:
We next analyze the following setup of guessing.Let {X i } k i=1 be i.i.d.random variables where X 1 ∼ P X takes values on a finite set X with |X | = n.In order to cluster the data [29] (see also [21,Section 3.A] and references therein), suppose that each Let g X k : X k → {1, . . ., n k } and g Y k : Y k → {1, . . ., m k } be, respectively, the ranking functions of the random vectors X k = (X 1 , . . ., X k ) and Y k = (Y 1 , . . ., Y k ) by sorting in separate decreasing orders the probabilities P X k (x k ) = k i=1 P X (x i ) for x k ∈ X k , and P Y k (y k ) = k i=1 P Y (y i ) for y k ∈ Y k where ties in both cases are resolved arbitrarily.In view of Arikan's bounds on the ρ-th moment of ranking functions (see [1, Theorem 1] for the lower bound, and [1, Proposition 4] for the upper bound), since the following bounds hold for all ρ > 0: In the following, we rely on Theorem 2 and the bounds in ( 63) and ( 64) to obtain bounds on the exponential reduction of the ρ-th moment of the ranking function of X k as a result of its mapping to Y k .First, the combination of ( 63) and ( 64) In view of Theorem 2-a) and ( 65), it follows that for an arbitrary f ∈ F n,m and ρ > 0 where X m is a random variable whose probability mass function is given in (30).Note that where the first inequality in (68) holds since P X ≺ P Xm (see Lemma 5) and the Rényi entropy is Schur-concave.
By the explicit construction of the function f * ∈ F n,m according to the algorithm in Steps 1-4 in the proof of Theorem 2 (based on the Huffman procedure), by setting Y i := f * (X i ) for every i ∈ {1, . . ., k}, it follows from ( 34) and ( 66) that for all ρ > 0 where the monotonically increasing function v : (0, ∞) → (0, ∞) is given in (33), and it is depicted by the solid line in the left plot of Figure 2. In view of (33), it can be shown that the linear approximation v(α) ≈ v(1) α is excellent for all α ∈ [0, 1], and therefore for all ρ > 0 Hence, for sufficiently large value of k, the gap between the lower and upper bounds in ( 67) and ( 69) is marginal, being approximately equal to 0.08607 ρ 1+ρ bits for all ρ > 0.
The following theorem summarizes our result in this section.
3) Numerical Result: The following simple example illustrates the tightness of the achievable upper bound and the universal lower bound in Theorem 3, especially for sufficiently long sequences.67) and ( 69)).
Example 1: Let X be geometrically distributed restricted to {1, . . ., n} with the probability mass function where a = 24  25 and n = 128.Assume that X 1 , . . ., X k are i.i.d. with X 1 ∼ P X , and let Y i = f (X i ) with a deterministic function f ∈ F n,m with n = 128 and m = 16.We compare the upper and lower bounds in Theorem 3 for the two cases where the sequence X k = (X 1 , . . ., X k ) is of length k = 100 or k = 1000.The lower bound in (67) holds for an arbitrary deterministic f ∈ F n,m , and the achievable upper bound in (69) holds for the construction of the deterministic function f = f * ∈ F n,m (based on the Huffman algorithm) in Theorem 3.
Numerical results are shown in Figure 4, providing plots of the upper and lower bounds on in Theorem 3, and illustrating the improved tightness of these bounds when the value of k is increased from 100 (left plot) to 1000 (right plot).From Theorem 3-c), for sufficiently large k, the gap between the upper and lower bounds is less than 0.08607 bits (for all ρ > 0); this is consistent with the right plot of Figure 4 where B. Lossless Source Coding 1) Background: For uniquely-decodable (UD) lossless source coding, Campbell ([15], [16]) proposed the cumulant generating function of the codeword lengths as a generalization to the frequently used design criterion of average code length.Campbell's motivation in [15] was to control the contribution of the longer codewords via a free parameter in the cumulant generating function: if the value of this parameter tends to zero, then the resulting design criterion becomes the average code length per source symbol; on the other hand, by increasing the value of the free parameter, the penalty for longer codewords is more severe, and the resulting code optimization yields a reduction in the fluctuations of the codeword lengths.
We introduce the coding theorem by Campbell [15] for lossless compression of a discrete memoryless source (DMS) with UD codes, which serves for our analysis jointly with Theorem 2.
Theorem 4 (Campbell 1965, [15]): Consider a DMS which emits symbols with a probability mass function P X defined on a (finite or countably infinite) set X .Consider a UD fixed-to-variable source code operating on source sequences of k symbols with an alphabet of the codewords of size D. Let (x k ) be the length of the codeword which corresponds to the source sequence x k := (x 1 , . . ., x k ) ∈ X k .Consider the scaled cumulant generating function of the codeword lengths3 where Then, for every ρ > 0, the following hold: a) Converse result: b) Achievability result: there exists a UD source code, for which The bounds in Theorem 4, expressed in terms of the Rényi entropy, imply that for sufficiently long source sequences, it is possible to make the scaled cumulant generating function of the codeword lengths approach the Rényi entropy as closely as desired by a proper fixed-to-variable UD source code; moreover, the converse result shows that there is no UD source code for which the scaled cumulant generating function of its codeword lengths lies below the Rényi entropy.By invoking L'Hôpital's rule, one gets from ( 72) Hence, by letting ρ tend to zero in ( 74) and ( 75), it follows from (4) that Campbell's result in Theorem 4 generalizes the well-known bounds on the optimal average length of UD fixed-to-variable source codes ( ]): and ( 77) is satisfied by Huffman coding (see, e.g., [25,Theorem 5.8.1]).Campbell's result therefore generalizes Shannon's fundamental result in [70] for the average codeword lengths of lossless compression codes, expressed in terms of the Shannon entropy.
Following the work by Campbell [15], Courtade and Verdú derived in [23] non-asymptotic bounds for the scaled cumulant generating function of the codeword lengths for P X -optimal variable-length lossless codes [48], [79].These bounds were used in [23] to obtain simple proofs of the asymptotic normality of the distribution of codeword lengths, and the reliability function of memoryless sources allowing countably infinite alphabets.
Sason and Verdú recently derived in [68] improved non-asymptotic bounds on the cumulant generating function of the codeword lengths for fixed-to-variable optimal lossless source coding without prefix constraints, and non-asymptotic bounds on the reliability function of a DMS, tightening the bounds in [23].
2) Analysis: The following analysis for lossless source compression with UD codes relies on a combination of Theorems 2 and 4.
Let X 1 , . . ., X k be i.i.d.symbols which are emitted from a DMS according to a probability mass function P X whose support is a finite set X with |X | = n.Similarly to Section V-A, in order to cluster the data, suppose that each symbol X i is mapped to Let (x k ) and (y k ) denote the length of the codewords for the source sequences x k and y k , respectively, and let Λ k (•) and Λ k (•) denote their corresponding scaled cumulant generating functions (see (72)).
In view of Theorem 4-b), for every ρ > 0, there exists a UD source code for the sequences in X k such that the scaled cumulant generating function of its codeword lengths satisfies (75).Furthermore, from Theorem 4-a), we get From ( 75), ( 78) and Theorem 2 a) and b), for every ρ > 0, there exist a UD source code for the sequences in X k , and a construction of a deterministic function f ∈ F n,m (as specified by Steps 1-4 in the proof of Theorem 2, borrowed from [21]) such that the difference between the two scaled cumulant generating functions satisfies where ( 79) holds for every UD source code operating on the sequences in and the specific construction of f ∈ F n,m as above, and X m in the right side of ( 79) is a random variable whose probability mass function is given in (30).The right side of (79) can be very well approximated (for all ρ > 0) by using (70).
We proceed with a derivation of a lower bound on the left side of (79).In view of Theorem 4, it follows that (74) is satisfied for every UD source code which operates on the sequences in X k ; furthermore, Theorems 2 and 4 imply that, for every f ∈ F n,m , there exists a UD source code which operates on the sequences in Y k such that where ( 81) is due to (39) since Y i = f (X i ) (for i = 1, . . ., k) with an arbitrary deterministic function f ∈ F n,m , and Y i ∼ P Y for every i; hence, from ( 74), ( 80) and (81), We summarize our result as follows.
Theorem 5: Let • X 1 , . . ., X k be i.i.d.symbols which are emitted from a DMS according to a probability mass function P X whose support is a finite set X with |X | = n; • Each symbol X i be mapped to Y i = f (X i ) where f ∈ F n,m is the deterministic function (independent of the index i) with m < n, as specified by Steps 1-4 in the proof of Theorem 2 (borrowed from [21]); • Two UD fixed-to-variable source codes be used: one code encodes the sequences x k ∈ X k , and the other code encodes their mappings y k ∈ Y k ; let the common size of the alphabets of both codes be D; • Λ k (•) and Λ k (•) be, respectively, the scaled cumulant generating functions of the codeword lengths of the k-length sequences in X k (see (72)) and their mapping to Y k .
Then, for every ρ > 0, the following holds for the difference between the scaled cumulant generating functions a) There exists a UD source code for the sequences in X k such that the upper bound in ( 79) is satisfied for every UD source code which operates on the sequences in Y k ; b) There exists a UD source code for the sequences in Y k such that the lower bound in (82) holds for every UD source code for the sequences in X k ; furthermore, the lower bound in (82) holds in general for every c) The gap between the upper and lower bounds in ( 79) and ( 82), respectively, is at most 1) As a preparatory step, we first calculate the probability mass function P Y from the given probability mass function P X and the deterministic function f ∈ F n,m which is obtained by Steps 1-4 in the proof of Theorem 2; accordingly, P Y (y) = x∈X : f (x)=y P X (x) for all y ∈ Y.We then further calculate the probability mass functions for the i.i.d.sequences in X k and Y k (see (73)); recall that the number of types in X k and Y k is polynomial in k (being upper bounded by (k + 1) n−1 and (k + 1) m−1 , respectively), and the values of these probability mass functions are fixed over each type; 2) The sets of codeword lengths of the two UD source codes, for the sequences in X k and Y k , can (separately) be designed according to the achievability proof in Campbell's paper (see [15, p. 428]).More explicitly, let α := 1 1+ρ ; for all x k ∈ X k , let (x k ) ∈ N be given by with and let (y k ) ∈ N, for all y k ∈ Y k , be given similarly to (83) and ( 84) by replacing P X with P Y , and P X k with P Y k .This suggests codeword lengths for the two codes which fulfil ( 75) and ( 80), and also both satisfy Kraft's inequality; 3) The separate construction of two prefix codes (a.k.a.instantaneous codes) based on their given sets of ( X m ) (with logarithms on base D, which is the alphabet size of the source codes); as k → ∞, the gap between these upper and lower bounds is less than 0.08607 log D 2. Furthermore, in view of (76), so, it follows from (4), ( 33), ( 79) and ( 82) that the difference between the average code lengths (normalized by k) of the original and clustered data satisfies where the gap between the upper and lower bounds in ( 86) is equal to 0.08607 log D 2 + 1 k .

APPENDIX A PROOF OF LEMMA 2
We first find the extreme values of p min under the assumption that P ∈ P n (ρ).If pmax pmin = 1, then P is the equiprobable distribution on X and p min = 1 n .On the other hand, if pmax pmin = ρ, then the minimal possible value of p min is obtained when P is the one-odd-mass distribution with n − 1 masses equal to ρ p min and a smaller mass equal to p min .The latter case yields p min = 1 1+(n−1)ρ .Let β := p min , so β can get any value in the interval Let Q * be the probability mass function which is defined on {1, . . ., 2n} as follows: Since by assumption P * ∈ P n (ρ), it is easy to verify from (93) that Furthermore, from (93), proving (18).Finally, in view of (91), c α (ρ) is monotonically increasing in α since so is the Rényi divergence of order α (see [28,Theorem 3]).
In the limit where α → 1, the Rényi entropy of order α tends to the Shannon entropy.Hence, letting α → 1 in (20), it follows that for the Shannon entropy where (109) follows by invoking L'Hôpital's rule.This proves (22).
which together with(40), prove Items a) and b) of Theorem 2 (note that, in view of the construction of the deterministic function f * ∈ F n,m in Step 4 of the above algorithm, we get H α f * (X) = H α (Q)).

Remark 8 :
Inequality (43)  leads to the application of Theorem 1 with ρ = 2 (see(46)).In the derivation of Theorem 2, we refer to v(α) := c (∞) α (2) (see (47)-(49)) rather than referring to c (n) α (2) (although, from(24), we have 0 ≤ c (n) α (2) ≤ v(α) for all α > 0).We do so since, for n ≥ 16, the difference between the curves of c (n) α (2) (as a function of α > 0) and the curve of c (∞) α (2) is marginal (see the dashed and solid lines in the left plot of Figure2), and also because the function v in(33) is expressed in a closed form whereas c (n) α (2) is subject to numerical optimization for finite n (see(15) and (16)).For this reason, Theorem 2 coincides with the result in [21, Theorem 1] for the Shannon entropy (i.e., for α = 1) while providing a generalization of the latter result for Rényi entropies of arbitrary positive orders α.Theorem 1, however, both strengthens the bounds in[21, Theorem 2]  for the Shannon entropy with finite cardinality n (see Remark 3), and it also generalizes these bounds to Rényi entropies of all positive orders.
m is an arbitrary deterministic function (independent of the index i) with m < n.Consequently, {Y i } k i=1 are i.i.d., and each Y i takes values on a finite set Y with |Y| = m < |X |.
respectively, ranking functions of the random vectors X k = (X 1 , . . ., X k ) and Y k = (Y 1 , . . ., Y k ).Then, for every ρ > 0, a) The lower bound in (67) holds for every deterministic function f ∈ F n,m ; b) The upper bound in (69) holds for the specific f * ∈ F n,m , whose construction relies on the Huffman algorithm (see Steps 1-4 of the procedure in the proof of Theorem 2);

Fig. 4 .
Fig. 4. Plots of the upper and lower bounds on 1 k log 2 as a function of ρ > 0, for random vectors of length k = 100 (left plot) or k = 1000 (right plot) in the setting of Example 1.Each plot shows the universal lower bound for an arbitrary deterministic f ∈ F128, 16, and the achievable upper bound with the construction of the deterministic function f = f * ∈ F128, 16 (based on the Huffman algorithm) in Theorem 3 (see, respectively, ( m is an arbitrary deterministic function (independent of the index i) with m < n.Consequently, the i.i.d.symbols Y 1 , . . ., Y k take values on a set Y with |Y| = m < |X |.Consider two UD fixed-to-variable source codes: one operating on the sequences x k ∈ X k , and the other one operates on the sequences y k ∈ Y k ; let D be the size of the alphabets of both source codes.
)), which is approximately 0.08607ρ log D 2 1+ρ + 2ρ k ; d) The UD source codes in Items a) and b) for the sequences in X k and Y k , respectively, can be constructed to be prefix codes by the algorithm in Remark 11.Remark 11 (An Algorithm for Theorem 5 d)): A construction of the UD source codes for the sequences in X k and Y k , whose existence is assured by Theorem 5 a) and b) respectively, is obtained by the following algorithm (of three steps) which also constructs them as prefix codes:
(79)etermined in Step 2, is standard (see, e.g., the construction in the proof of[25, Theorem 5.2.1]).Theorem 5 is of interest since it provides upper and lower bounds on the reduction in the cumulant generating function of close-to-optimal UD source codes as a result of clustering data, and Remark 11 suggests an algorithm to construct such UD codes which are also prefix codes.For long enough sequences (as k → ∞), the upper and lower bounds on the difference between the scaled cumulant generating functions of the suggested source codes for the original and clustered data almost match (see(79)and (82)), being roughly equal to (ρ)} n∈N is non-negative since H α (P ) ≤ log n for all P ∈ P n .Furthermore, to prove (17), ) holds since H α (P ) is monotonically decreasing in α, and (89) is due to (5) and p max ≤ ρ n .Let U n denote the equiprobable probability mass function on {1, ..., n}.By the identityD α (P U n ) = log n − H α (P ),(90)and since, by Lemma 2, H α (•) attains its minimum over the set of probability mass functions P n (ρ), it follows that D α (• U n ) attains its maximum over this set.Let P * ∈ P n (ρ) be the probability measure which achieves the minimum in c ρ .From Lemma 1, P ≺ Q β and Q β ∈ P n (ρ), and the Schur-concavity of the Rényi entropy yields H α (P ) ≥ H α (Q β ) for all P ∈ P n (ρ) with p min = β.Minimizing H α (P ) over P ∈ P n (ρ) can be hence restricted to minimizing H α (Q β ) over β ∈ Γ α