What Is Randomness? The Interplay between Alpha Entropies, Total Variation and Guessing †

: In many areas of computer science, it is of primary importance to assess the randomness of a certain variable X . Many different criteria can be used to evaluate randomness, possibly after observing some disclosed data. A “sufﬁciently random” X is often described as “entropic”. Indeed, Shannon’s entropy is known to provide a resistance criterion against modeling attacks. More generally one may consider the Rényi α -entropy where Shannon’s entropy, collision entropy and min-entropy are recovered as particular cases α = 1, 2 and + ∞ , respectively. Guess work or guessing entropy is also of great interest in relation to α -entropy. On the other hand, many applications rely instead on the “statistical distance”, also known as “total variation" distance, to the uniform distribution. This criterion is particularly important because a very small distance ensures that no statistical test can effectively distinguish between the actual distribution and the uniform distribution. In this paper, we establish optimal lower and upper bounds between α -entropy, guessing entropy on one hand, and error probability and total variation distance to the uniform on the other hand. In this context, it turns out that the best known “Pinsker inequality” and recent “reverse Pinsker inequalities” are not necessarily optimal. We recover or improve previous Fano-type and Pinsker-type inequalities used for several applications.


Some Well-Known "Randomness" Measures
It is of primary importance to assess the "randomness" of a certain random variable X, which represents some identifier, cryptographic key, signature or any type of intended secret. Applications include pseudo-random bit generators [1], general cipher security [2], randomness extractors [3] and hash functions ( [4], Chapter 8), physically unclonable functions [5], true random number generators [6], to list but a few. In all of these examples, X takes finitely many values x ∈ {x 1 , x 2 , . . . , x M } with probabilities p X (x) = P(X = x). In this paper, it will be convenient to denote any rearrangement of the probabilities p(x) in descending order (where ties can be resolved arbitrarily), p (1) = max x p X (x) is the maximum probability, p (2) the second maximum, etc.
In addition, we need to define the cumulative sums P (k) p (1) + . . . + p (k) (k = 1, 2, . . . , M) (2) where, in particular, P (M) = 1. Many different criteria can be used to evaluate the randomness of X or its distribution p X , depending on the type of attack that can be carried out to recover the whole or part of the secret, possibly after observing disclosed data Y. The observed random variable Y can be any random variable and is not necessarily discrete. The conditional probability distribution of X having observed Y = y is denoted by p X|y to distinguish it from from the unconditional distribution p X . To simplify the notation, we write p(x) p X (x) = P(X = x) (3) p(x|y) p X|y (x) = P(X = x|Y = y).
A "sufficiently random" secret is often described as "entropic" in the literature. Indeed, Shannon's entropy (with the convention 0 log 1 0 = 0) is known to provide a resistance criterion against modeling attacks. It was introduced by Shannon as a measure of uncertainty of X. The average entropy after having observed Y is the usual conditional entropy A well-known generalization of Shannon's entropy is the Rényi entropy of order α > 0 or α-entropy where, by continuity as α → 1, the 1-entropy H 1 (X) = H(X) is Shannon's entropy. One may consider many different definitions of conditional α-entropy [7], but for many applications the preferred choice is Arimoto's definition [8][9][10] H α (X|Y) α 1−α log E y p X|y α (8) where the expectation over Y is taken over the "α-norm" inside the logarithm. (Strictly speaking, · α is not a norm when α < 1.) For α = 2, the collision entropy where X is an independent copy of X, is often used to ensure security against collision attacks. Perhaps one of the most popular criteria is the min-entropy defined when α → +∞ as whose maximization is equivalent to a probability criterion to ensure a worst-case security level. Arimoto's conditional ∞-entropy takes the form where we have noted P e (X) = P e (p) 1 P e (X|Y) E y P e (X|y).
The latter quantities correspond to the minimum probability of decision error using a MAP (maximum a posteriori probability) rule (see, e.g., [11]).
Guess work or guessing entropy [2,12] G(X) = G(p X ) and more generally guessing moments of order ρ > 0 or ρ-guessing entropy are also of great interest in relation to α-entropy [10,13,14]. The conditional versions given observation Y are the expectations When ρ = 1, this represents the average number of guesses that an attacker has to make to guess the secret X correctly after having observed Y [13].

Statistical (Total Variation) Distance to the Uniform Distribution
As shown in the sequel, all quantities introduced in the preceding section (H, H α , P e , G, G ρ ) have many properties in common. In particular, each of these quantities attains • its minimum value for a delta (Dirac) distribution p = δ, that is, a deterministic random variable X with p (1) = 1 and all other probabilities = 0; • its maximum value for the uniform distribution p = u, that is, a uniformly distributed random variable X with p(x) = 1 M for all x.
Indeed, it can be easily checked that where the lower (resp. upper) bounds are attained for a delta (resp. uniform) distribution, the uniform distribution is the "most entropic" (H α ), "hardest to guess" (G), and "hardest to detect" (P e ). The maximum entropy property is related to the minimization of divergence [15] where D(p q) = ∑ p(x) log p(x) q(x) ≥ 0 denotes the Kullback-Leibler divergence which vanishes if and only if p = q. Therefore, entropy appears as the complementary value of the divergence to the uniform distribution. Similarly, for α-entropy, 2 ). Instead of the divergence to the uniform distribution, it is often desirable to rely instead on the statistical distance, also known as total variation distance to the uniform distribution. The general expression of the total variation distance is where the 1/2 factor is there to ensure that 0 ≤ ∆(p, q) ≤ 1. Equivalently, where the maximum is over any event T and P, Q denote the respective probabilities w.r.t. p and q. As is well known, the maximum is attained when The total variation criterion is particularly important because a very small distance ∆(p, q) ensures that no statistical test can effectively distinguish between p and q. In fact, given some observation X following either p (null hypothesis H 0 ) or q (alternate hypothesis H 1 ), such a statistical test takes the form "is X ∈ T?" (then accept H 0 , otherwise reject H 0 ).
is small enough, the type-I or type-II errors have total probability P(X ∈ T) + Q(X ∈ T) ≈ 1. Thus, in this sense the two hypotheses p and q are undistinguishable (statistically equivalent).
By analogy with (20) and (21) we can then define "statistical randomness" R(X) = R(p) ≥ 0 as the complementary value of the statistical distance to the uniform distribution, i.e., such that holds. With this definition, Thus the uniform distribution u is the "most random". What is fundamental is that R(X) ≈ 1 ensures that no statistical test can effectively distinguish the actual distribution from the uniform distribution. Again the "least random" distribution corresponds to the deterministic case. In fact, from (24) we have where T + = {x | p(x) ≥ 1 M } of cardinality K = |T + |, and P(T + ) = P (K) by definition (2). It is easily seen that ∆(p, u) attains its maximum value = 1 − 1 M if and only if p = δ is a delta distribution. In summary where the lower (resp. upper) bound is attained for a delta (resp. uniform) distribution. The conditional version is again taken by averaging over the observation:

F-Concavity: Knowledge Reduces Randomness and Data Processing
Knowledge of the observed data Y (on average) reduces uncertainty, improves detection or guessing, and reduces randomness in the sense that: When α = 1, the property H(X|Y) ≤ H(X) is well-known ("conditioning reduces entropy" [15]): the difference H(X) − H(X|Y) = I(X; Y) is the mutual information, which is nonnegative. Property (30) for α = 1 is also well known, see [7,8]. In view of (10) and (11), the case α = +∞ in (30) is equivalent to (32) which is obvious in the sense that any observation can only improve MAP detection. This, as well as (31), is also easily proved directly (see, e.g., [17]). For all quantities H, P e , G, R, the conditional quantity is obtained by averaging over the observation as in (6), (13), (16) and (29). Since p(x) = E y p(x|y), the fact that knowledge of Y reduces H, P e , G or R amounts to saying that these are concave functions of the distribution p of X. Note that concavity of R(X) = R(p) in p is clear from the definition (26), which shows (33).
For entropy H, this also has been given some physical interpretation: "mixing" distributions (taking convex combinations of probability distributions) can only increase the entropy on average. For example, given any two distributions p and q, H(λp +λq) ≥ λH(p) +λH(q) where 0 ≤ λ = 1 −λ ≤ 1. Similarly, such mixing of distributions increases the average probability of error P e , guessing entropy G, and statistical randomness R.
For conditional α-entropy H α (X|Y) where α = 1, averaging over Y in the definition (8) is made on the α-norm of the distribution p X|y , which is known to be convex for α > 1 (by Minkowski's inequality) and concave for 0 < α < 1 (by the reverse Minkowski inequality), the fact that knowledge reduces α-entropy (inequality (30)) is equivalent to the fact that H α (p) in (6) is an F-concave function, that is, an increasing function F of a concave function in p, A straightforward generalization of (30)-(33) is the data processing inequality: for any Markov chain X − Y − Z, i.e., such that p(x|y, z) = p(x|y), (additional knowledge reduces randomness) and then noting that p(x|y, z) = p(x|y) by the Markov property-see, e.g., [7,18] for H α and [17] for G. Conversely, (30)-(33) can be re-obtained from (34)-(37) as the particular case Z = 0 (any deterministic variable representing zero information).

S-Concavity: Mixing Increases Randomness and Data Processing
Another type of mixing (different from the one described in the preceding section) is also useful in certain physical science considerations. It can be described as a sequence of elementary mixing operations as follows. Suppose that one only modifies two probability values p i = p(x i ) and p j = p(x j ) for i = j. Since the result should be again a probability distribution, the sum p i + p j should be kept constant. Then there are two possibilities: • |p i − p j | decreases; the resulting distribution is "smoother", "more spread out", "more disordered"; the resulting operation can be written as (p i , p j ) → (λp i +λp j , λp j +λp i ) where 0 ≤ λ = 1 −λ ≤ 1, also known as "transfer" operation. We call it elementary mixing operation or M-transformation in short. • |p i − p j | increases; this is the reverse operation, an elementary unmixing operation or U-transformation in short.
We say that a quantity is s-concave if it increases by any M-transformation (equivalently, decreases by any U-transformation). Note that any increasing function F of an s-concave function is again s-concave.
This notion coincides with that of Schur-concavity from majorization theory [19]. In fact, we can say that p is majorized by q, and we write p ≺ q, if p is obtained from q by a (finite) sequence of elementary M-transformations, or, what amounts the same, that q majorizes p, that is, q is obtained from p by a (finite) sequence of elementary U-transformations. A well-known result ( [19], p. 34) states that p ≺ q if and only if (see definition (2)) where always P (M) = Q (M) = 1. From the above definitions it is immediate to see that all previously considered quantities H, H α , G, G ρ , P e , R are s-concave, mixing increases uncertainty, guessing, error, and randomness, that is, p ≺ q implies For H α and R this can be easily seen from the fact that these quantities can be written as (an increasing function of) a quantity of the form ∑ x φ(p(x)) where φ is concave. Then the effet of For P e it is obvious, and for G and G ρ it is also easily proved using characterization (38) and summation by parts [17]. Another kind of (functional or deterministic) data processing inequality can be obtained from (39)-(42) as a particular case. For any deterministic function f , Thus deterministic processing (by f ) decreases (cannot increase) uncertainty, can only make guessing or detection easier, and decreases randomness. For α = 1 the inequality H( f (X)) ≤ H(X) can also be seen from the data processing inequality of the preceding section by noting that H( f (X)) = I( f (X); f (X)) ≤ I(X; f (X)) ≤ H(X) (since X − f (X) − f (X) is trivially a Markov chain).
To prove (43)-(46) in general, consider preimages by f of values of y = f (x); it is enough to show that each of the quantities H α , P e , G, or R decreases by the elementary operation consisting in putting together two distincts values x i , x j of x in the same preimage of y. However, for probability distributions, this operation amounts the U-transformation (p i , p j ) → (p i + p j , 0) and the result follows by s-concavity.
An equivalent property of (43)-(46) is the fact that any additional random variable Y increases uncertainty, probability of error, guessing, and randomness in the sense that This is a particular case of (43)-(46) applied to the joint (X, Y) and the first projection f (x, y) = x. Conversely, (43)-(46) follows from (47)-(50) by applying it to ( f (X), X) in place of (X, Y) and noting that the distribution of ( f (X), X) is essentially that of X.

Optimal Fano-Type and Pinsker-Type Bounds
We have seen that informational quantities such as entropies H, H α , guessing entropies G, G ρ on one hand, and statistical quantities such as probability of error for MAP detection P e and statistical randomness R on the other hand, satisfy many common properties: decrease by knowledge, data processing, increase by mixing, etc. For this reason, it is desirable to establish the best possible bounds between one informational quantity (such as H α or G ρ ) and one statistical quantity (P e or R = 1 − ∆(p, u)).
To achieve this, we remark that for any distribution p, we have the following majorizations. For fixed P e = 1 − P s : (P s , P e M−1 , . . . , P e M−1 ) ≺ p ≺ (P s , . . . , P s , 1 − KP s , 0, . . . , 0) where (necessarily) K = 1 P s , and for fixed R = 1 − ∆: where K = |{p ≥ 1 M }| as in (27)  Applying s-concavity of entropies H α or G ρ to (51) gives closed-form upper bounds of entropies as a function of P e , known as Fano inequalities; and closed-form lower bounds, known as reverse Fano inequalities. Figure 1 shows some optimal regions. The original Fano inequality was an upper bound on conditional entropy H(X|Y) as a function of P e (X|Y). It can be shown that upper bounds in the conditional case are unchanged. Lower bounds of conditional entropies or α-entropies, however, have to be slightly changed due to the average operation inside the function F (see Section 3 above) by taking the convex enveloppe (piecewise linear) of the lower curve on F −1 (H α ). In this way, one recovers easily the results of [20] for H, [11] for H α , and [14,17] for G and G ρ .
Likewise, applying s-concavity of entropies H α or G ρ to (52) gives closed-form upper bounds of entropies as a function of R, similar to Pinsker inequalities; and closed-form lower bounds, similar to reverse Pinsker inequalities. Figure 2 shows some optimal regions.
The various Pinsker and reverse Pinsker inequalities that can be found in the literature give bounds between ∆(p, q) and D(p q) for general q. Such inequalities find application in Quantum physics [21] and to derive lower bounds on the minimax risk in nonparametric estimation [22]. As they are of more general applicability, they turn out not to be optimal here since we have optimized the bounds in the particular case q = u. Using our method, one again recovers easily previous results of [23] (and [24], Theorem 26) for H, and improves previous inequalities used for several applications [3,4,6].

Conclusions
Using a simple method based on "mixing" or majorization, we have established optimal (Fano-type and Pinsker-type) bounds between entropic quantities (H α , G ρ ) and statistical quantities (P e , R) in an interplay between information theory and statistics. As a perspective, similar methodology could be developed for statistical distance to an arbitrary (not necessarily uniform) distribution.