The Interplay between Error, Total Variation, Alpha-Entropy and Guessing: Fano and Pinsker Direct and Reverse Inequalities

Using majorization theory via “Robin Hood” elementary operations, optimal lower and upper bounds are derived on Rényi and guessing entropies with respect to either error probability (yielding reverse-Fano and Fano inequalities) or total variation distance to the uniform (yielding reverse-Pinsker and Pinsker inequalities). This gives a general picture of how the notion of randomness can be measured in many areas of computer science.


Introduction
In many areas of science, it is of primary importance to assess the "randomness" of a certain random variable X. That variable could represent, for example, a cryptographic key, a signature, some sensitive data, or any type of intended secret. For simplicity, we assume that X is an M-ary discrete random variable, taking values in a finite alphabet X of size M, with known probability distribution p = (p 1 , p 2 , . . . , p M ) (in short, X ∼ p).
Depending on the application, many different criteria can be used to evaluate randomness. Some are information-theoretic, others are related to detection/estimation theory or to hypothesis testing. We review the most common ones in the following subsections.

Entropy
A "sufficiently random" X is often described as "entropic" in the literature. The usual notion of entropy is the Shannon entropy [1] which is classically thought of as a measure of "uncertainty". It has, however, an operational definition in the fields of data compression or source coding. The problem is to find the binary description of X with the shortest average description length or "coding rate". Note that the base of the logarithm is not specified in (1). Similar to all informationtheoretic quantities, the choice of the base determines the unit of information. Logarithms of base 2 give binary units (bits) or Shannons (Sh). Logarithms of base 10 give decimal units (dits) or Hartleys. Natural logarithms (base e) give natural units (nats).
This compression problem can be seen as equivalent to a "game of 20 questions" § 5.7.1 in [2], where a binary codeword for X is identified as a sequence of answers to yes-no questions about X that uniquely identifies it. There is no limitation on the type of questions asked, except that they must be answered by yes (1) or no (0). The goal of the game is to minimize the average number of questions, which is equal to the coding rate. It is well known, since Shannon [1], that the entropy H(X) is a lower bound on the coding rate that can be achieved asymptotically for repeated descriptions.
In this perspective, entropy is a natural measure of efficient (lossless) compression rate. A highly random variable (with high entropy) cannot be compressed too much without losing information: "random" means "hard to compress".

Guessing Entropy
Another perspective arises in cryptography when one wants to guess a secret key. The situation is similar to the "game of 20 questions" of the preceding subsection. The difference is that the only possibility is to actually try out one possible key hypothesis at a time. In other words, yes-no questions are restricted to be of the form "is X equal to x?" until the correct value has been found. The optimal strategy that minimizes the average number of questions is to guess the values of X in order of decreasing probabilities: first, the value with maximum probability p (1) , then the second maximum p (2) , and so on. The corresponding minimum average number of guesses is the guessing entropy [3] (also known as "guesswork" [4]): (2) Massey [3] has shown that the guessing entropy G is exponentially increasing as entropy H increases. A recent improved inequality is [5,6] G > exp H e + 1 2 . It is sometimes convenient to use log G instead of G, to express it in the same logarithmic unit of information as entropy H.
In this perspective, a highly random variable (with high guessing entropy) cannot be guessed rapidly: "random" means "hard to guess".

Coincidence or Collision
Another perspective is to view X as a (publicly available) "identifier", "fingerprint" or "signature" obtained by a randomized algorithm from some sensitive data. In such a scheme, to prevent "collision attacks", it is important to ensure that X is "unique" in the sense that there is only a small chance that another independent X obtained by the same randomized algorithm coincides with X. Since X and X are i.i.d., the "index of coincidence" P(X = X ) = ∑ k p 2 k should be as small as possible, that is, the complementary quantity (sometimes called quadratic entropy [7]): should be as large as possible. In the context of hash functions, this is called "universality" (Chapter 8 in [8]). The corresponding logarithmic measure is known as the collision entropy (Rényi entropy [9] of order 2, also known as quadratic entropy [10]): which should also be as large as possible. By concavity of the logarithm, ∑ k p k log p k ≤ log ∑ k p 2 k , that is, H ≥ H 2 ; hence, high collision entropy implies high entropy. In this perspective, a highly random variable (with high collision entropy) cannot be found easily by coincidence: "random" means "unique" or "hard to collide".

Estimation Error
In estimation or detection theory, one observes some disclosed data which may depend on X and tries to estimate X from the observation. The best estimatorx minimizes the probability of error, P(X =x) = 1 − P(X =x). Therefore, given the observation, the best estimation is the value x with highest probability p (1) , and the minimum probability of error is written: P e (X) = P e (p) 1 − max p = 1 − p (1) .
If X is meant to be kept secret, then this probability of error should be as large as possible. The corresponding logarithmic measure is known as the min-entropy: which should also be as large as possible. It is easily seen that H ≥ H 2 ≥ H ∞ ; hence, high min-entropy implies high entropy in all the previous senses.
In this perspective, a highly random variable (with high min-entropy) cannot be efficiently estimated: "random" means "hard to estimate" or "hard to detect". Figure 1 illustrates various randomness measures for a binary distribution.

Some Generalizations
One can generalize the above concepts in multiple ways. We only mention a few. The α-entropy, or Rényi entropy of order α > 0, is defined as follows [9]: where · α is the "α-norm" (strictly speaking, · α is a norm only when α ≥ 1). The Shannon entropy H = H 1 is recovered in the limiting case α → 1, the collision entropy H 2 is recovered in the case α = 2, and the min-entropy H ∞ is recovered in the limiting case α → ∞.
The ρ-guessing entropy, or guessing moment [11] of order ρ > 0, is defined as the minimum ρth-order moment of the number of guesses needed to find X. The same optimal strategy as for the guessing entropy yields the following: which generalizes G = G 1 for ρ = 1. Arikan [11] has shown that log G ρ behaves asymptotically as ρH 1 1+ρ . In particular, log G behaves asymptotically as the ½-entropy H 1 2 . In some cryptographic scenarios, one has the ability to estimate or guess X in a given maximum number m of tries. The corresponding error probability takes the form P(X =x 1 , X =x 2 , . . . , X =x m ). The same optimal strategy as for guessing entropy G ρ yields an error probability of order m: which generalizes P e = P 1 e for m > 1. One obtains similar randomness measures by replacing p with its "negation"p, as explained in [12].

"Distances" to the Uniform
A fairly common convention is that, if we "draw at random" X, it is assumed that we sample it according to a uniform distribution unless otherwise explicitly indicated. Thus, the uniform distribution u, where all possible outcomes being equally likely-all M values have equal probability u k = 1 M for all k-is considered as the ideal randomness. From this viewpoint, a variable X with distribution p should be all the more "random" as p is "close to uniform": randomness can be measured as some complementary "distance" from p to the uniform u, in the form, say, d max − d(p, u), where "distance" d has maximum value d max . Such d(p, u) should not necessarily obey all axioms of a mathematical distance, but at least should be nonnegative and vanish only when p = u.
Many of the above entropic criteria fall into this category. For example: where D(p q) = ∑ k p k log p k q k denotes the (Kullback-Leibler) divergence (or "distance"). More generally: where D α (p q) = 1 α−1 log ∑ k p α k q 1−α k denotes the (Rényi) α-divergence [13]. In the particular case α = 2, since ∑ k (p k − 1 M ) 2 = ∑ k p 2 k − 1 M , the complementary index of coincidence R 2 -hence, the collision entropy H 2 -is also related to the squared 2-norm distance to the uniform: It follows that the 2-norm distance is related to the 2-divergence by the formula D 2 (p u) = log(1 + M p − u 2 2 ) (see, e.g., Lemma 3 in [14]). Similarly, in the particular case α = 1 2 , one can write H 1 is a complementary quantity of the squared Hellinger distance 1 . Another important example is given next.

Statistical Distance to the Uniform
Suppose one wants to design a statistical experiment to know whether X follows either distribution p (null hypothesis H 0 ) or another distribution q (alternate hypothesis). Any statistical test takes the form "is X ∈ T?": if yes, then accept H 0 ; otherwise, reject it. Type-I and type-II errors have total probability P(X ∈ T) + Q(X ∈ T), where P, Q are the probability measures corresponding to p and q, respectively. Clearly, if |P(X ∈ T) − Q(X ∈ T)| is small enough, the two hypotheses p and q are indistinguishable in the sense that decision errors have total probability arbitrarily close to 1.
The statistical (total variation) distance § 8.8 in [8] is defined as follows: where the 1 2 factor is present to ensure that 0 ≤ ∆(p, q) ≤ 1. The maximum in the definition of the statistical distance: is attained for any event T + , satisfying the following: The statistical distance is particularly important from a hypothesis testing viewpoint, since, as we have just seen, a very small distance ∆(p, q) ensures that no statistical test can distinguish the two hypotheses p and q.
Following the discussion of the preceding subsection, we can define "statistical randomness" as the complementary value of the statistical distance ∆(p, u) between p and the uniform distribution u. Therefore, if q = u is uniform and letting K = |T + |, then ∆(p, u) = P(T + ) − K M has maximum value 1 − 1 M and statistical randomness can be defined as follows: This is similar to (12), where half the 1-norm is used in place of the squared 2-norm. From the hypothesis testing perspective, it follows that a high statistical randomness R ensures that no statistical test can effectively distinguish between the actual distribution and the uniform. This is, for example, the usual criterion used to evaluate randomness extractors in cryptology. Since equiprobable values are the least predictable, a highly random variable cannot be easily statistically predicted: "random" means "hard to predict".

Conditional Versions
In many applications, the randomness of X is evaluated after observing some disclosed data or side information Y. The observed random variable Y can model any type of data and is not necessarily discrete. The conditional probability distribution of X having observed Y = y is denoted by p X|y to distinguish it from the unconditional distribution p = p X (without side information). By the law of total probability P(X = x) = E y P(X = x|Y = y), p X is recovered by averaging all conditional distributions: where E y denotes the expectation operator over Y.
The "conditional randomness" of X given Y can then be defined as the average randomness measure of X|y over all possible observations, that is, the expectation over Y of all randomness measures of X|Y = y. For example, Shannon's conditional entropy or equivocation [1] is given by the following: Similarly: gives the average minimum number of guesses to find X after having observed Y. Additionally: R 2 (X|Y) E y R 2 (X|y) = E y R 2 (p X|y ) (22) gives the average probability of non-collision to identify X upon observation of Y, and P e (X|Y) E y P e (X|y) = 1 − E y max p X|y (23) gives the minimum average probability of error, as achieved by the maximum a posteriori (MAP) decision rule. The "conditional statistical randomness" is likewise defined as shown: For the generalized quantities of Section 1.5, the conditional ρ-guessing entropy is given by the following: and the conditional mth-order probability of error is as below: P m e (X|Y) E y P m e (X|y) = E y P m e (p X|y ).
For α-entropy, however, many different definitions of conditional α-entropy have been proposed in the literature [15]. The preferred choice for most applications seems to be Arimoto's definition [16]: where the expectation over Y is taken on the α-norm inside the logarithm and not outside. Shannon's conditional entropy H(X|Y) is recovered in the limiting case α → 1. One nice property of Arimoto's definition is that it is compatible with that of P e (X|Y) in the limiting case α → ∞, since the relation H ∞ = log 1 1−P e of (6) naturally extends to conditional quantities: Notice that for any order α = 1, Arimoto's definition can be rewritten as a simple expectation of ϕ α (H α ) instead of H α : where ϕ α is the increasing function, defined as follows: The requirement that ϕ α is increasing is important in the following. The signum term was introduced so that ϕ α is increasing, not only for 0 < α < 1, but also for α > 1. The exponential function exp is assumed to the same base as the logarithm: exp x = 2 x for x in bits, 10 x in dits, e x in nats). In what follows, we indifferently refer to H α or ϕ α (H α ).

Aim and Outline
The enumeration in the preceding subsections is by no means exhaustive. Every subfield or application has its preferred criterion, either information/estimation theoretic or statistical, conditioned on some observations or not. Clearly, all these randomness measures share many properties.
Therefore, a natural question is to determine a (possibly minimal) set of properties that characterize all possible randomness measures. Many axiomatic approaches have been proposed for entropy [1,17], α-entropy [9], information leakage [18] or conditional entropy [19,20].
Extending the work in [21], Section 2 presents a simple alternative, which naturally encompass all common randomness measures H, H α , G, G ρ , P e , P m e , R 2 and R, based on two natural axioms: • Equivalent random variables are equally random; • Knowledge reduces randomness (on average).
Many properties, shared by all randomness measures described above, are deduced from these two axioms.
Another important issue is to study the relationship between randomness measures, by establishing the exact locus or joint range of two such measures among all probability distributions with tight lower and upper bounds. In this paper, extending the presentation made in [21], we establish the optimal bounds relating information-theoretic (e.g., entropic) quantities on one hand and statistical quantities (probability of error and statistical distance) on the other hand.
Section 3 establishes general optimal Fano and reverse-Fano inequalities, relating any randomness measure to the probability of error. This generalizes Fano's original inequality [22] H(X|Y) ≤ (1−P e (X|Y)) log 1 1−P e (X|Y) + P e (X|Y) log M−1 P e (X|Y) , which has become ubiquitous in information theory (e.g., to derive converse channel coding theorems) and in statistics (e.g., to derive lower bounds on the maximum probability of error in multiple hypothesis testing).
Section 4 establishes general optimal Pinsker and reverse-Pinsker inequalities, relating any randomness measure to the statistical randomness or the statistical distance to the uniform. Generally speaking, Pinsker and reverse-Pinsker inequalities relate some divergence measure (e.g., d(p q) or d α (p q)) between two distributions to their statistical distance ∆(p, q). Here, following the discussion in Section 1.6, we restrict ourselves to the divergence or distance to the uniform distribution q = u. (For the general case of arbitrary distributions p, q see, e.g., the historical perspective on Pinsker-Schützenberger inequalities in [23].). In this context, we improve the well-known Pinsker inequality [24,25], which reads D(p u) = log M − H(p) ≥ 2 log e · p − u 2 1 . This inequality, of more general applicability for any distributions p, q, is no longer optimal in the particular case q = u.
Finally, Section 5 lists some applications in the literature, and Section 6 gives some research perspectives.

An Axiomatic Approach
Let X be any M-ary random variable with distribution p X . How should a measure of "randomness" R(X) ∈ R of X be defined in general? To simplify the discussion, we assume that R(X) ≥ 0 is nonnegative.
As advocated by Shannon [26], such a notion should not depend on the particular "reversible encoding" of X. In other words, any two equivalent random variables should have the same measure R(X), where equivalence is defined as follows.
Definition 1 (Equivalent Variables). Two random variables X and Y are equivalent: X ≡ Y, if there exist two mappings f and g, such that Y = f (X) a.s. (almost surely, i.e., with probability one) and X = g(Y) a.s.

Remark 1 (Equivalent Measures).
Obviously, it is also essentially equivalent to study R(X) or R(X) 2 , for example, or any quantity of the form ϕ(R(X)), where ϕ : R + → R + is any increasing (invertible) function.
Definition 2 (Conditional Randomness). Given any random variable Y, the conditional form of R is defined as follows: where X|y (or X|Y = y) denotes the random variable X, conditioned of the event Y = y. This quantity represents the average amount of randomness of X knowing Y.

Remark 2 (Equivalent Conditional Measures).
Again, it is essentially equivalent to study R(X|Y) or ϕ(R(X|Y)), where ϕ : R + → R + is any increasing function. One may, therefore, generalize the notion of conditional randomness by writing ϕ(R(X|Y)) = E y ϕ(R(X|y)) in place of (31), the same as (29) for α-entropy. However, in the sequel, we stay with the basic Definition 2 and simply assume that ϕ(R) is considered instead of R whenever it is convenient to do so.
In the sequel, we study the implications of only two axioms: We find such postulates quite intuitive and natural. First, equivalent random variables should be equally random. Second, knowledge of some side observation should, on average, reduces randomness.
All randomness quantities described in Section 1 obviously satisfy Axiom 1. That they also satisfy Axiom 2 is shown in the following examples.

Example 1 (Entropies). For Shannon's entropy H, the inequality H(X|Y) ≤ H(X) is well
known Thm 2.6.5 in [2]. This is often paraphrased as "conditioning reduces entropy", "knowledge reduces uncertainty" or "information can't hurt". The difference H(X) − H(X|Y) = I(X; Y) is the mutual information, which is always nonnegative. Inequality H α (X|Y) ≤ H α (X) is also known to hold for any α > 0, see [15,16] and Example 4 below.
Example 2 (Guessing Entropies). Axiom 2 for the guessing entropies G or G ρ can be easily checked from their definition, as follows.
Let N ∈ N = {1, 2, . . .} be any random variable giving the number of guesses needed to find X in any guessing strategy. N is equivalent to X (Definition 1) since every value of N corresponds to a unique value of X, and vice versa. By definition, G ρ (X) = min N≡X E(N ρ ), where the minimum is over all possible N ∈ N equivalent to X (corresponding to all possible strategies). Now, G ρ (X|Y) = E y G ρ (X|y) ≤ E y E(N ρ |y) = E(N ρ ), by the law of total expectation. Taking the minimum over N ≡ X gives G ρ (X|Y) ≤ G ρ (X), which is Axiom 2.
The case ρ = 1 was already shown in [27]. The result is quite intuitive: any side information Y can only improve the guess of X.
Example 3 (Error Probabilities). Axiom 2 for the error probability P e = P 1 e follows from the corresponding inequality for H ∞ = log 1 1−P e (see (28) and Example 1 for α = ∞), but it can also be checked directly from its definition, as well as in the case of P m e of order m, as follows. The mth order error probability is P m e (X) = minx 1 ,...,x m P(X =x 1 , X =x 2 , . . . , X =x m ), i.e., the minimum probability that X is not equal to any of the m first estimatesx 1 ,x 2 , . . . ,x m . Then, P m e (X|Y) = E y minx 1 ,...,x m P(X =x 1 , . . . , X =x m |y) ≤ E y P(X =x 1 , . . . , X =x m |y) = P(X =x 1 , . . . , X =x m ), by the law of total probability, for every sequencex 1 , . . . ,x m . Taking the minimum over such sequences gives P m e (X|Y) ≤ P m e (X), which is Axiom 2. The case m = 1 was already shown, e.g., in [27]. Again, the result is quite intuitive: any side information Y can only improve the estimation of X.

Symmetry and Concavity
We now rewrite Axioms 1 and 2 as equivalent conditions on probability distributions. Definition 3 (Probability "Simplex"). Let P be the set of all sequences of nonnegative numbers: p = (p 1 , p 2 , p 3 , . . .) (33) such that the following are satisfied: • Only a finite number of them are positive: p k = 0 for finitely many k; • They sum to 1: ∑ k p k = 1.
Notice that P has infinite dimension even though only a finite number of components are nonzero in every p ∈ P. Thus, any p ∈ P can be seen as the probability distribution of M-ary random variables with arbitrary large M.
Theorem 1 (Symmetry). Axiom 1 is equivalent to the condition that R(X) = R(p) is a symmetric function of p = (p 1 , p 2 , p 3 , . . .) ∈ P, identified as the probability distribution of X.
Proof. Let X be the finite set ("alphabet") of all values taken by X ∼ p X , and let f be an injective mapping from X to N = {1, 2, . . .}, whose image is a finite subset of N. From Definition 1, X is equivalent to f (X) ∈ N, with probabilities p = (p 1 , p 2 , . . .). Then, by Axiom 1, R(X) does not depend on the particular values of X but only on the corresponding probabilities, so that R(X) = R(p), where p ∈ P is identified to p X . Now, letting h be any bijection (permutation) of N, Axiom 1 implies that R(p) does not depend on the ordering of the p k s, that is, R(p) is a symmetric function of p. Conversely, any bijection applied to X can only change the ordering of the p k s in p = p X , which leaves R(p) = R(X) as invariant.
Accordingly, it is easily checked directly that all expressions in terms of probability distributions p of random measures given in Section 1 are symmetric in p. [17] define P as the union of all P M for M ∈ N, where P M is the M-simplex {(p 1 , p 2 , . . . , p M ), p k ≥ 0, p 1 + · · · + p M = 1}. With this viewpoint, even when the expression of R(p) does not explicitly depend on M, one has to define R(p) separately for all different values of M as a function R M (p 1 , p 2 , . . . , p M ), defined over P M , and further impose the compatibility condition that R M+1 (p 1 , p 2 , . . . , p M , 0) = R M (p 1 , p 2 , . . . , p M ), as in [17] (this is called "expansibility" in [20]).

Remark 3. Some authors
Such expansibility condition is unnecessary to state explicitly in our approach: it is an obvious consequence of an appropriate choice of f in Definition 1, namely, the injective embedding of {1, 2, . . . , M} into {1, 2, . . . , M + 1}.
Theorem 2 (Concavity). Axiom 2 is equivalent to the condition that R(p) is concave in p.
Proof. Using the notations of Theorem 1, Definition 2 and (19), Axiom 2 can be rewritten as shown: This is exactly Jensen's inequality for concave functions on the convex "simplex" P.

Remark 4 (ϕ-Concavity).
Similarly as in Remark 2, we may consider ϕ(R) in place of R in the definition of conditional randomness, where ϕ : R + → R is any increasing function. Then, by Theorem 2, ϕ(R) is concave, that is, R(p) is a ϕ-concave function of p (for example, for ϕ = log, one recovers the usual definition of a log-concave function). This is called "core-concavity" in [20].
Example 4 (Symmetric Concave Measures). All randomness measures of Examples 1-3 satisfy both Axioms 1 and 2, and are, therefore, symmetric concave in p. This can also be checked directly from certain closed-form expressions given in Section 1: • Shannon's entropy H, as well as the complementary index of coincidence R 2 , can be written in the form ∑ k r(p k ), where r is a strictly concave function. Thus, both are symmetric and strictly concave in p; • Statistical randomness R(p) can also be written in this form, where r(p k ) = − 1 2 p k − 1 M is concave in p k . Thus, R(p) is also symmetric concave and, therefore, is also an acceptable randomness measure satisfying Axioms 1 and 2; • For α-entropy, consider ϕ α (H α (p)) = sgn(1 − α) p α where ϕ α is the increasing function (30). It is known that the α-norm · α is strictly convex for finite α > 1 (by Minkowski's inequality) and strictly concave for 0 < α < 1 (by the reverse Minkowski inequality). Thus, α-entropy is symmetric and (strictly) ϕ α -concave in the sense of Remark 4. Therefore, one finds anew that it satisfies Axioms 1 and 2.

Example 5.
The mixing property of the Shannon entropy H is well-known Thm. 2.7.3 in [2]. A well-known thermodynamic interpretation is that mixing two gases of equal entropy results in a gas with higher entropy.

Basic Properties in Terms of Random Variables
In terms of random variables, one can deduce the following properties.
In particular, let 0 denote any deterministic variable (by Defintion 1, any deterministic random variable is equivalent to the constant 0). Then: Thus "absolute" (unconditional) randomness R(X) can be recovered as a special case of conditional randomness.
Proof. If X and Y are independent, then p X|y = p X for (almost) any y, so that R(X|Y) = E y R(X|y) = E y R(X) = R(X). In particular, X and 0 are always independent.
Remark 5 (Strict Concavity). A randomness measure R is "strictly concave" in p if Jensen's inequality (34) holds with equality only when p X|y = p X for almost all y. This can be stated in terms of random variables as follows. For any strictly concave random measure R, (32) is strict unless independence holds: Example 6 (Strictly Concave Measures). As already seen in Example 4, entropy H, all αentropies ϕ α (H α ) for finite α > 0 and R 2 are strictly concave. In particular, for entropy, H(X|Y) = H(X) if and only if X and Y are independent. This is well known since the mutual information I(X; Y) = H(X) − H(X|Y) vanishes only in the case of independence [2] (p. 28). More generally, for α-entropy, H α (X|Y) = H α (X) if and only if X and Y are independent.

Corollary 3 (Additional Knowledge Reduces Randomness). Inequality
for any Y, Z.
Proof. Inequality (32) applied to X|y and Z for fixed y gives R(X|y, Z) = E z|y R(p X|y,z ) ≤ R(p X|y ) = R(X|y). Taking the expectation over Y of both sides yields the announced inequality. Conversely, letting Y = 0, one obtains R(X|Z) ≤ R(X), which is (32).

Corollary 4 (Data Processing Inequality: Processing Knowledge Increases Randomness).
For any Markov chain X − Y − Z (i.e., such that p X|Y,Z = p X|Y ), one has the following: This property is equivalent to (32).

Example 7 (Data Processing Inequalities).
For entropy H, the property H(X|Y) ≤ H(X|Z) amounts to I(X; Z) ≤ I(X; Y), i.e., (post-)processing in the Markov chain X − Y − Z can never increase information § 2.8 in [2]. The data processing inequality for P e and G was already shown in [27].

Equalization (Minorization) via Robin Hood Operations
We now turn to another type of "mixing" probability distributions which are sometimes known as Robin Hood operations. To quote Arnold [28]: "When Robin and his merry hoods performed an operation in the woods they took from the rich and gave to the poor. The Robin Hood principle asserts that this decreases inequality (subject only to the obvious constraint that you don't take too much from the rich and turn them into poor.)" Definition 4 (Robin Hood operations [28]). An elementary "Robin Hood" operation p → q in P modifies only two probabilities (p i , p j ) → (q i , q j ) (i = j) in such a way that |p i − p j | ≥ |q i − q j |. A (general) "Robin Hood operation" results from a finite sequence of elementary Robin Hood operations.
Notice that in an elementary Robin Hood operation, the sum p i + p j = q i + q j should remain the same, since p and q are probability distributions. The fact that |p i − p j | decreases "increases equality", i.e., makes the probabilities more equal. This can be written as follows: provided that |δ| ≤ |p i − p j | ("you don't take too much from the rich and turn them into poor"). Setting (40) can be easily rewritten in the form: where λ ≥ 0,λ ≥ 0 and λ +λ = 1.
Remark 6 (Increasing Probability Product). In any elementary Robin Hood operation (p i , p j ) → (λp i +λp j ,λp i + λp j ), the product: always increases, with equality if and only if either λ = 0 or 1, or else p i = p j . This equality condition boils down to |p i − p j | = |q i − q j |, that is, the unordered set {p i , p j } = {q i , q j } is unchanged. Therefore, in any general Robin Hood operation, the product of all modified probabilities always increases, unless the probability distribution is unchanged (up to the order of the probabilities).
Remark 7 (Inverse Robin Hood Operation). One can also define a "Sheriff of Nottingham" operation as an inverse Robin Hood operation, resulting from a finite sequence of elementary Sheriff of Nottingham operations of the form (p i , p j ) → (q i , q j ), where |p i − p j | ≤ |q i − q j |. Increasing the quantity |p i − p j | "increases inequality", i.e., makes the probabilities more unequal.
Definition 5 (Equalization Relation). We write X Y ("X is equalized by Y") if p Y can be obtained from p X by a Robin Hood operation. Such operation "equalizes" p X in the sense that p Y is "more equal" or "more uniform" than p X . In terms of distributions, we also write p X p Y . Equivalently, p X can be obtained from p Y by a Sheriff of Nottingham operation (p X is more unequal than p Y ). We may also write Y X or p Y p X .
Remark 8 (Generalization). The above definitions hold verbatim for any vector or finitely many nonnegative numbers p k with a fixed sum s = ∑ k p k (not necessarily equal to one). In the following, we sometimes use the concept of "equalization" in this slightly more general context.
Remark 9 (Minorization). X Y amounts to saying that p X "majorizes" p Y in majorization theory [28,29]. So, in fact, the equalization relation is a "minorization"-the opposite of a majorization. Unfortunately, it is common in majorization theory to write "Y X" when X "majorizes" Y, instead of X Y when Y is "more equal" than X. Arguably, the notation adopted in this paper is more convenient, since it follows the usual relation order between randomness measures such as entropy.
Also notice that the present approach avoids the use of Lorenz order [28,29] and focuses on the more intuitive Robin Hood operations.
Remark 10 (Partial Order). It is easily seen that is a partial order on the set of (finitely valued) discrete random variables (considering two variables "equal" if they are equivalent in the sense of Definition 1). Indeed, reflexivity and transitivity are immediate from the definition, and antisymmetry is, e.g., an easy consequence of Remark 6: if X Y and Y X, then the product of all modified probabilities of X cannot increase by the two combined Robin Hood operations. Therefore, p Y should be the same as p X up to order; hence, X ≡ Y.
The following fundamental lemmas establish expressions for maximally equal and unequal distributions.
In particular, any probability distribution p is equalized by the uniform distribution u: Proof. Suppose at least one component of p is = s M . Since the p k s sum to s, there should be at least one p i > s M and one p j < s M . By a suitable Robin Hood operation on (p i , p j ), at least one of these two probabilities can be made = s M , reducing the total number of components = s M . Continuing in this manner, we arrive at all probabilities equal to s M after, at most, M − 1 Robin Hood operations.
In particular, for any probability distribution p: where δ is the (Dirac) probability distribution of any deterministic variable. (This can be written in terms of random variables as X 0, since, by Defintion 1, any deterministic random variable is equivalent to the constant 0.) Proof. Suppose at least two components lie between 0 and P: 0 < p i , p j < P. By a suitable Sheriff of Nottingham operation on (p i , p j ), at least one of these two probabilities can be made either = 0 or = P, reducing the number of components lying inside (0, P). Continuing in this manner, we arrive at, at most, one component r ∈ (0, P). Finally, the sum constraint implies s = qP + r where 0 < r < P, whence q = s P .
Remark 11. Theorem 3 can also be given a physical interpretation similar to Corollary 1. In fact, from (41), any Robin Hood operation can be seen as mixing two permuted probability distributions, which have equal randomness. Such mixing can only increase randomness.

Example 8 (Entropy is Schur-Concave).
That the Shannon entropy is Schur-concave is well known § 13 E in [29]. Similar to concavity (Example 5), this also has a similar physical interpretation: a liquid mixed with another results in a "more disordered", "more chaotic" system, which results in a "more equal" distribution and a higher entropy § 1 A9 in [29].
Remark 12 (ϕ-Schur Concavity). Schur concavity is not equivalent to concavity (even when assuming symmetry). In fact, with the notations of Remark 4, it is obvious that Schur concavity of R is equivalent to Schur concavity of ϕ(R), where ϕ : R + → R + is any increasing function. In other words, while "ϕ-concavity" (in the sense of Remark 4) is not the same as concavity, there is no need to introduce "ϕ-Schur concavity", since it is always equivalent to Schur concavity.

Remark 13 (Strict Schur Concavity).
A randomness measure R is "strictly Schur concave" if the inequality R(X) ≤ R(Y) for X Y holds with equality R(X) = R(Y) if and only if X ≡ Y. If R(p) is strictly concave (see Remark 5), then equality holds in (49) if and only if either λ = 0 or 1, or else p i = p j . Either of these conditions means that {p i , p j } is unchanged. Therefore, in this case, R is also strictly Schur concave.
Remark 6 states that the product of nonzero probabilities is strictly Schur-concave.
Example 9 (Strictly Schur Concave Measures). Randomness measures presented in Section 1 are (Schur) concave, but not all of them are strictly Schur concave: • Not only the Shannon entropy H is Schur concave (Example 8), but, as seen in Example 6, H, as well as all α-entropies ϕ α (H α ) for finite α > 0 and R 2 , are strictly concave and, hence, strictly Schur concave; • As seen also in Example 6, guessing entropy G, or, more generally, ρ-guessing entropy G ρ , is not strictly concave in p. However, G and G ρ are strictly Schur concave by the following argument. It suffices to show that some elementary Robin Hood operation (40) One may always choose δ as small as one pleases, since any elementary Robin Hood operation on (p i , p j ) can be seen as resulting from other ones on (p i , p j ) with smaller δ. One chooses δ small enough such that the elementary Robin Hood operation does not change the order of the probabilities in p. With the notations of Section 1.2, assuming, for example, that
In other words, minimal randomness is achieved for X = 0 (for any deterministic variable 0) and maximal randomness is achieved for uniformly distributed X.
Proof. From Lemmas 1 and 2, one obtains δ p X u. The result follows by Theorem 3.

Remark 14 (Zero Randomness).
Without loss of generality, we may always impose that R(0) = 0 by considering R(X) − R(0) in place of R(X). Then, zero randomness is achieved when X ≡ 0. It is easily checked from the expressions given in Section 1 that this convention holds for H, H α , log G, log G ρ , P e , P m e , R 2 and R. To simplify notations in the remainder of this paper, we assume that the zero randomness convention R(0) = 0 always holds.
Example 10 (Distribution Achieving Zero Randomness). By Remark 13, if R is strictly Schur concave, zero randomness is achieved only when X ≡ 0: • As seen in Example 9, this is the case for H, H α , log G, log G ρ and R 2 . In particular, we recover the well known property that zero entropy is achieved only when X is deterministic; • Although the error probability is not strictly Schur concave, one can check directly that P e (p) = 0 if and only if p (1) = 1, which corresponds to the δ distribution; • Similarly, from the discussion in Section 1.7, R(p) = 0 correspond to the maximum value of ∆(p, u) = 1 − 1 M attained for K = |T + | = 1 and P(T + ) = 1, which, again, corresponds to a δ distribution. To summarize, all quantities H, H α , log G, log G ρ , P e , R 2 and R satisfy (52). This observation is also an easy consequence of (10) or (11); , etc. Again, since G and G ρ are strictly Schur-concave, their maximum is achieved if and only if X is uniformly distributed; • max P e = P e (u) = 1 − 1 M , and, more generally, max P m e = P m e (u) = 1 − m M . The maximum of P e (X) is achieved if and only if the maximum probability p (1) equals 1 M , which implies that X is uniformly distributed; (12) and (18)) is achieved if and only if p = u. To summarize, for all quantities H, H α , log G, log G ρ , P e , R 2 and R, the unique maximizing distribution is the uniform distribution. Notice that, as expected, each of these maximum values increases with M.

Corollary 6 (Deterministic Data Processing Inequality: Processing Reduces Randomness).
For any deterministic function f :

Proof.
Consider preimages by f of values y = f (x). The application of f can be seen as resulting from a sequence of elementary operations, each of which puts together two distinct values of x (say, x i and x j ) in the same preimage of some y. In terms of probability distributions, this amounts to a Sheriff of Nottingham operation (p i , p j ) → (p i + p j , 0).
Overall, one has f (X) X. The result then follows by Schur concavity (Theorem 3).

Example 12. The fact that H( f (X)) ≤ H(X) is well known (see Ex. 2.4 in [2]
). This can also be seen from the data processing inequality of Corollary 4 by noting that, since X − f (X) − f (X) is trivially a Markov chain, H( f (X)) = I( f (X); f (X)) ≤ I(X; f (X)) ≤ H(X).

Remark 16 (Lattices of Information and Majorization).
Shannon [26] defined the order relation X ≤ Y if X = g(Y) a.s. and showed that it satisfies the properties of a lattice, called the "ìnformation lattice" (see [30] for detailed proofs). With this notation, (53) writes as shown: Majorization (or the order relation X Y) also satisfies the properties of a lattice-the "majorization lattice", as studied in [31]. From the proof of Corollary 6, one actually obtains the following: Therefore, the majorization lattice is denser than the information lattice.
Proof. Since R(X|y) ≥ 0 for any y, R(X|Y) = E y R(X|y) = 0 if and only if R(X|y) = 0 for (almost) all y. By (52), this implies that X is deterministic given Y = y, i.e., X is a deterministic function of Y.
Example 13. From Example 10, (57) is true for H, H α , log G, log G ρ , P e , R 2 and R. • Knowledge of Y removes equivocation only when X is fully determined by Y; s. is intuitively clear: knowing Y allows one to fully determine X in only one guess; s.: knowing Y allows one to estimate X without error only when X is fully determined by Y.

Fano and Reverse-Fano Inequalities
Definition 6 (Fano-type inequalities). A "Fano inequality" (resp. "reverse Fano inequality") for R(X) gives an upper (resp. lower) bound of R(X) as a function of the probability of error P e (X). Fano and reverse-Fano inequalities are similarly defined for conditional randomness R(X|Y), lower or upper bounded as a function of P e (X|Y).
In this section, we establish optimal Fano and reverse-Fano inequalities, where upper and lower bounds are tight. In other words, we determine the maximum and minimum of R for fixed P e . The exact locus of the region p ∈ P M → (P e (p), R(p)) = (P e (X), R(X)), as well as the exact locus of all attainable values of (P e (X|Y), R(X|Y)), is determined analytically for fixed M, based on the following.
Proof. On the left side, apply Lemma 2 with P = max p = p (1) = P s and s = 1. On the right side, with p (1) = P s being fixed, apply Lemma 1 to the M − 1 remaining probabilities (p (2) , . . . , p (M) ), which sum to s = 1 − P s = P e .
Theorem 4 (Optimal Fano and Reverse-Fano Inequalities for R(X)). The optimal Fano and reverse-Fano inequalities for the randomness measure R(X) of any M-ary random variable X in terms of P e = P e (X) are given analytically by the following: Proof. The proof is immediate from Lemma 3 and Theorem 3. The Fano and reverse-Fano bounds are achieved by the distributions on the left and right sides of (58), respectively.
A similar proof holding for any Schur concave R(X) was already given by Vajda and Vašek [17].
Assuming the zero randomness convention for simplicity (Remark 14), Fano and reverse-Fano bounds can be qualitatively described as follows. They are illustrated in Figure 2.
where R M denotes maximal randomness (Remark 15) is continuous in P e > 0, concave in P e and increases from 0 (for P e = 0) to R M (for P e = 1− 1 M ). For any fixed P e , it also increases with M.
Proof. Since R(p) ≥ 0 is concave over P M (Theorem 2), it is continuous on the interior of P M . Since P e → (1−P e , P e M−1 , . . . , P e M−1 ) is linear, the Fano bound results from the composition of a linear and a concave function. It is, therefore, concave, and continuous at every P e > 0. It is clear from Lemma 3, or using a suitable Robin Hood operation, that the maximizing distribution becomes more equal as P e increases. Therefore, the Fano bound increases with P e . The maximum is attained for P e = 1 − 1 M , which corresponds to the uniform distribution achieving maximum randomness R M . For fixed P e , it is also clear, using a suitable Robin Hood operation, that the maximizing distribution becomes more equal if M is increased by one. Therefore, the Fano bound also increases with M.
Proof. For any k ∈ {1, 2, . . . , M}, the reverse-Fano bound at P e = 1 − 1 k is R( 1 k , . . . , 1 k ) = R k . It suffices to prove that the reverse-Fano bound is continuous, concave and increasing for , the reverse-Fano bound is R(1−P e , . . . , 1−P e , 1−k(1−P e )). This results from the composition of a linear and a concave function R(p), which is continuous in the interior of P k . Therefore, it is concave in P e , and continuous on the whole closed interval [1 − 1 k , 1 − 1 k+1 ]. Finally, it is clear from Lemma 2 or using a suitable Robin Hood operation that (1−P e , . . . , 1−P e , 1−k(1−P e )) becomes more equal as P e increases. Therefore, each curve increases from R k to R k+1 .
Remark 17 (Independence of the reverse-Fano Bound from the Alphabet Size). Contrary to the (upper) Fano bound, the (lower) reverse-Fano bound is achieved by a probability distribution that does not depend on M. As a result, when the definition of R does not itself explicitly depend on M (as is the case for H, H α , G, G ρ , P e , P m e , R 2 ), the reverse-Fano bound is the same for all M, except that it is truncated up to P e = 1 − 1 M , at which point it meets the (upper) Fano bound (see Figure 2). Theorem 5 (Optimal Fano and Reverse-Fano Inequalities for R(X|Y)). The optimal Fano and reverse-Fano inequalities for the randomness measure R(X|Y) of any M-ary random variable X in terms of P e = P e (X|Y) are given analytically by the following: where we have noted x = x + 1 ( x is the usual ceil function x , unless x is an integer), P s = 1 − P e and R k = R( 1 k , . . . , 1 k ).
Proof. The Fano region for X|Y = y, i.e., the locus of the points (P e (p X|y ), R(p X|y )) for each Y = y, is given by the inequalities (59). From the definition of conditional randomness, the exact locus of points (P e (X|Y), R(X|Y)) = E y (P e (p X|y ), R(p X|y )) is composed of all convex combinations of points in the Fano region, that is, its convex envelope. The extreme points (P e = 0, R = R 1 = 0) and (P e = 1 − 1 M , R = R M ) are unchanged. The upper Fano bound joining these two extreme points is concave by Proposition 1 and, therefore, already belongs to the convex envelope. It follows that the upper Fano bound in (59) remains the same, as given in (62). However, the lower reverse-Fano bound for R(X|Y) is the convex hull of the lower bound in (59). By Proposition 2, it is easily seen to be the piecewise linear curve joining all singular points (P e = 1 − 1 k , R = R k ) for k = 1, 2, . . . , M (see Figure 2). A closed-form expression is obtained by noting that, when 1 1−P e = k, that is, 1 − 1 k ≤ P e < 1 − 1 k+1 , the equation of the straight line joining (1 − 1 k , R k ) and (1 − 1 k+1 , R k+1 ) is ((k + 1)P s − 1)kR k + (1 − kP s )(k + 1)R k+1 . Plugging k = 1 P s and k + 1 = 1 P s gives the lower reverse-Fano bound in (62).
Remark 18 (Shape of Fano and reverse-Fano bounds for Conditional Randomness). By Theorem 5, the Fano inequality for the conditional version R(X|Y) takes the same form as for R(X). In particular, it is increasing and concave in P e (X|Y). Compared to that for R(X), the reverse-Fano bound for R(X|Y), however, is a piecewise linear convex hull. Clearly, it is still continuous and increasing in P e (X|Y), as illustrated in Figure 2. If the corresponding sequence of slopes k(k + 1)(R k+1 − R k ) is increasing in k, then the reverse-Fano bound for R(X|Y) is also convex in P e (X|Y).
where ϕ is an increasing function (in particular, to define conditional randomness as in Remark 4), then Theorem 4 and the (upper) Fano bound of Theorem 5 can be directly applied to R. When ϕ is nonlinear, this may result in (upper) Fano bounds that are no longer concave. However, to obtain the reverse-Fano inequalities for R(X|Y), one has to apply Theorem 5 to ϕ(R(X|Y)) and then apply the inverse function ϕ −1 to the left side of (62). When ϕ is nonlinear, the resulting "reverse-Fano bound" for R(X|Y) will not be piecewise linear anymore. This is the case, e.g., for conditional α-entropies (see Example 15 below).

Example 14 (Fano and reverse-Fano Inequalities for Entropy).
For the Shannon entropy, the optimal Fano inequality (right sides of (59) and (62)) takes the form: where h(P e ) = P e log 1 P e + (1 − P e ) log 1 1−P e is the binary entropy function. Inequality (64) is the original Fano inequality established in 1952 [22], which has become ubiquitous in information theory and in statistics to relate equivocation to probability of error. Inequality (63) trivially follows, in case of blind estimation (Y ≡ 0). That these inequalities are sharp is well known (see, e.g., [32]).
The optimal reverse-Fano inequality (left sides of (59) and (62) with R k = log k) takes the form: These two lower bounds were first derived by Kovalevsky [33] in 1965. Optimality was already proven in [32].
Example 16 (Fano and reverse-Fano Inequalities for non collision R 2 ). Theorem 4 readily gives the optimal Fano region for R 2 (X): This can also be easily deduced from (69) and (71) for α = 2 via (4). Fano and reverse-Fano inequalities for R 2 (X) were first stated without proof in [7]. The optimal Fano region for R 2 (X|Y), however, cannot be directly deduced from that of H 2 (X|Y), because a different kind of average over Y is involved. However, a direct application of Theorem 5 with R k = 1 − 1 k gives the optimal Fano region: Remarkably, the reverse-Fano inequality has a very simple form R 2 (X|Y) ≥ P e (X|Y) (see Figure 3). Example 17 (Fano and reverse-Fano Inequalities for Guessing Entropy). For guessing entropy G, the Fano inequality is written as shown: One obtains similarly G 2 ≤ 1 + M 3 (M + 5 2 )P e , G 3 ≤ 1 + M 4 (M 2 + 3M + 4)P e , etc. Due to the fact that G ρ (p) is linear in p, for fixed 1 1−P e = k, the reverse-Fano bound for G ρ (X) is linear in P e . It follows that the bound is already piecewise linear, with a sequence of slopes s k = k(k + 1)(R k+1 − R k ) = k(1 ρ + · · · + (k + 1) ρ ) − (k + 1)(1 ρ + · · · + k ρ ), which is easily seen to be increasing. Therefore, the (lower) reverse-Fano bound is piecewise linear and convex and coincides with its convex hull. In other words, the reverse-Fano inequality for G ρ (X) and G ρ (X|Y) takes the same form: The following is easily determined from the left side of either (59) or (62): 2 ), such that the following occurs: ).
As already shown in [27] for ρ = 1, the use of Schur concavity greatly simplifies the derivation. Figure 4 shows some optimal Fano regions for H 1/2 (X), H(X), H 2 (X) and log G(X).

Pinsker and Reverse-Pinsker Inequalities
Pinsker and reverse-Pinsker inequalities relate some divergence measure (e.g., d(p q) or d α (p q)) between two distributions to their statistical distance ∆(p, q). For simplicity, even though we restrict ourselves to the divergence or distance to the uniform distribution q = u, we still use the generic name "Pinsker inequalities". Following the discussion in Section 1.6, we adopt the following.
Definition 7 (Pinsker-type inequalities). A "Pinsker inequality" (resp. "reverse-Pinsker inequality") for R(X) gives an upper (resp. lower) bound of R(X) as a function of the statistical randomness R(X) (or statistical distance ∆(p, u)). Pinsker and reverse-Pinsker inequalities are similarly defined for conditional randomness R(X|Y), lower or upper bounded as a function of R(X|Y).

Example 20 (Pinsker and reverse-Pinsker Inequalities for Guessing Entropy).
For the guessing entropy, the optimal Pinsker bounds of Theorem 6 are easily determined: A notable property is that the optimal upper bound does not depend on the value of K. The upper bound is mentioned by Pliam in [4] as an upper bound of ∆(p, u). The methodology of this paper, based on Schur concavity, greatly simplifies the derivation. For the conditional guessing entropy G(X|Y), observe that the upper Pinsker bound for G(X) is linear (hence, concave) in R and that (91) is of the form r k = 1 + k(k+1) 2M , where the sequence r k − r k−1 = k M is increasing. Therefore, by Theorem 7, the optimal Pinsker region for conditional entropy G(X|Y) is the same as for G(X): (103) Figure 7 shows some optimal Pinsker regions for H 1/2 (X), H(X), H 2 (X) and log G(X). Example 21 (Statistical Randomness vs. Probability of Error). As a final example, we present the optimal regions of statistical randomness R vs. probability of error P e . In this case, observe the following from Definitions 6 and 7: • The (optimal) Fano inequality for R is the same as the (optimal) reverse-Pinsker inequality for P e ; • The (optimal) Pinsker inequality for P e is the same as the (optimal) reverse-Fano inequality for R. Letting R = R(X) and P s = P s (X), Theorem 4 readily gives the optimal Fano and reverse-Fano inequalities: Similarly, letting R = R(X|Y) and P s = P s (X|Y), Theorem 5 with R k = k−1 M readily gives the optimal Fano and reverse-Fano inequalities: while Theorem 7 gives the optimal Pinsker and reverse-Pinsker inequalities: where the upper bound is the piecewise linear function connecting points (P e = 1 − 1 k+1 , R = k M ) for k = 0, 1, . . . , M − 1.
From the above observation, the left (reverse-Fano) inequality in (104) is equivalent to the right (Pinsker) inequality in (105), and, similarly, the left (reverse-Fano) inequality in (106) is equivalent to the right (Pinsker) inequality in (107), which do not seem obvious from the expressions above. The optimal Fano/Pinsker region is illustrated in Figure 8.

Conclusions and Perspectives
We have derived optimal regions for randomness measures compared to either the error probability or the statistical randomness (or the total variation distance). One perspective is to provide similar optimal regions relating two arbitrary randomness measures. Of course, by (6), Fano regions such as H α vs. P e can be trivially reinterpreted as regions H α vs. H ∞ (see, e.g., Figure 2 in [42] for the region H vs. H ∞ ). Using some more involved derivations, the authors of [46] have investigated the optimal regions H vs. H 2 and, more generally, the authors of [47,48] have investigated the optimal regions between two αentropies of different orders. It would be desirable to apply the methods of this paper to the more general case of two arbitrary randomness measures. In particular, the determination of the optimal regions H α vs. G ρ will allow one to assess the sharpness of the "Massey-type" inequalities of [5].
Catalytic majorization [49] was found to be a necessary and sufficient condition for the increase of all Rényi entropies (including the ones with negative parameters α). It would be interesting to find similar necessary and sufficient conditions for other types of randomness measures.
It is also possible to generalize the notion of entropies and other randomness quantities with respect to an arbitrary dominating measure instead of the counting measure, e.g., to extend the considerations of this paper from the discrete case to the continuous case. The relevant notion of majorization in this more general context is studied, e.g., in [50].
Concerning Pinsker regions, another perspective is to extend the results of this paper to the more general case of Pinsker and reverse-Pinsker inequalities, relating "distances" of two arbitrary distributions p, q by removing the restriction that q = u is uniform. Some results in this direction appear in [38,[51][52][53][54][55][56][57].
Other types of inequalities on randomness measures with different constraints can also be obtained via majorization theory [43,44].

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: X ∼ p X follows the probability distribution p H = H 1 Shannon entropy H 2 collision entropy H ∞ min-entropy H α α-entropy G = G 1 guessing entropy G ρ ρ-guessing moment P e probability of error P m e error probability of order m P s = 1 − P e probability of success R = R 1 statistical randomness ∆ = 1 − 1 M − R statistical distance to the uniform R 2 complementary index of coincidence R any randomness measure