Time-Adaptive Statistical Test for Random Number Generators

The problem of constructing effective statistical tests for random number generators (RNG) is considered. Currently, there are hundreds of RNG statistical tests that are often combined into so-called batteries, each containing from a dozen to more than one hundred tests. When a battery test is used, it is applied to a sequence generated by the RNG, and the calculation time is determined by the length of the sequence and the number of tests. Generally speaking, the longer is the sequence, the smaller are the deviations from randomness that can be found by a specific test. Thus, when a battery is applied, on the one hand, the “better” are the tests in the battery, the more chances there are to reject a “bad” RNG. On the other hand, the larger is the battery, the less time it can spend on each test and, therefore, the shorter is the test sequence. In turn, this reduces the ability to find small deviations from randomness. To reduce this trade-off, we propose an adaptive way to use batteries (and other sets) of tests, which requires less time but, in a certain sense, preserves the power of the original battery. We call this method time-adaptive battery of tests. The suggested method is based on the theorem which describes asymptotic properties of the so-called p-values of tests. Namely, the theorem claims that, if the RNG can be modeled by a stationary ergodic source, the value −logπ(x1x2…xn)/n goes to 1−h when n grows, where x1x2… is the sequence, π() is the p-value of the most powerful test, and h is the limit Shannon entropy of the source.


Introduction
Random number generators (RNG) and pseudo-random number generators (PRNG) are widely used in many applications.RNGs are based on physical sources, while pseudo-random numbers are generated by computers.The goal of RNG and PRNG is to generate sequences of binary digits, which are distributed as a result of throwing an "honest" coin, or, more precisely, obey the Bernoulli distribution with parameters (1/2, 1/2).As a rule, for practically used RNG and PRNG this property is verified experimentally with the help of statistical tests developed for this purpose.
Currently, there are more than one hundred applicable statistical tests, as well as dozens RNGs based on different physical processes, and an even greater number of PRNGs based on different mathematical algorithms; see for review [1,2,3].Informally, an ideal RNG should generate sequences that pass all tests.In practice, especially in cryptographic applications, this requirement is formulated as follows: an RNG must pass a so-called battery of statistical tests, that is, some fixed set of tests.When a battery is applied, each test in the test battery is applied separately to the RNG.Among these batteries, we mention the Marsaglia's Diehard battery, which contains 16 tests [4], the National Institute of Standards and Technology (NIST) battery of 15 tests [5], several batteries proposed by L'Ecuyer and Simard [2], which contain from 10 to 106 tests and many others (see for review [1,2,6]).In addition, these batteries contain many tests that can be used with different values of the parameters, potentially increasing the total number of tests in the battery.Note that practically used RNG should be tested from time to time like any physical equipment, and therefore these test batteries should be used continuously.
How to evaluate large batteries of tests?On the one hand, the larger the test battery, the more likely it is to find flaws in the tested RNG.On the other hand, the larger the battery, the more time is required for testing.(Thus, L'Ecuyer and Simard [2] remark the need for small batteries to increase computational efficiency.)Another view is as follows: in reality, the time available to study any RNG is limited.Given a certain time budget, one can either use more tests and relatively short sequences generated by the RNG, or use fewer tests, but longer sequences and, in turn, this gives more chances to find deviations of randomness of the considered RNG.
In order to reduce this trade-off, we propose a time-adaptive testing of RNGs, in which, informally speaking, first all the tests are executed on relatively short sequences generated by the RNG, and then a few "promising" tests are applied for the final testing.Of course, the key question here is which tests are promising.For example, if a battery of two tests is applied to (relatively short) sequences of the same length, it can be assumed that the smaller the p-value, the more promising the test.But a more complicated situation may arise when we have to compare two tests that were applied to sequences of different lengths (for example, the first test was applied to a sequence of length l 1 , and the second to a sequence of length of l 2 , l 1 = l 2 ).We show that if our goal is to choose the most powerful test, then a good strategy is to choose the test i for which the ratio − log(p − value i )/l i is maximum.This recommendation is based on the following theorem: if an RNG can be modelled by a stationary ergodic source, the value −log π(x 1 x 2 ...x n )/n goes to 1 − h, if n grows, where x 1 x 2 ... is a generated sequence, π( ) is the p-value of the most powerful test, h is the limit Shannon entropy of the stationary ergodic source.This theorem plays an important rule in the suggested time-adaptive scheme and will be described in the first part of the paper, whereas the time-adaptive testing will be described afterwords.The description will be illustrated by experiments with the battery Rabbit from [2].
As far as we know, the proposed approach to testing RNGs is new, but the idea of finding the best test among many, testing the tests step by step in an increasing sequence, is widely used in algorithmic information theory, where the notion of random sequence is formally investigated and discussed [7,8].
2 Hypothesis testing and properties of pi-values

Notation
We consider RNG which generates a sequence of letters x = x 1 x 2 ...x n , n ≥ 1, from a finite alphabet {0, 1} n .Two statistical hypotheses are considered: H 0 = {x obeys the uniform distribution (µ U ) on {0, 1} n }, and the alternative hypothesis H 1 = H0 , that is, H 1 is the negation of H 0 .It is a particular case of the so-called goodness-of-fit problem, and any test for it is called a test of fit, see [13].Let t be a test.Then, by definition, the significance level α equals the probability of the Type I error, α ∈ (0, 1).Denote a critical region of the test t for the significance level α by C t (α) and let Ct (α) = {0, 1} n \ C t (α).(Recall that Type I error occurs if H 0 is true and is rejected.Type II error occurs if H 1 is true, but H 0 is accepted.Besides, for a certain x = x 1 x 2 ...x n H 0 is rejected if and only if x ∈ C t (α).) Suppose that H 1 is true, and the investigated sequence x = x 1 x 2 ...x n is generated by an (unknown) source ν.By definition, a test t is consistent, if for any significance level α ∈ (0, 1) the probability of Type II error goes to 0, that is lim Suppose, that H 1 is true and the sequences x ∈ {0, 1} n obey a certain distribution ν.It is well-known in mathematical statistics that the optimal test (Neyman-Pearson or N P test) is described by the Neyman-Pearson lemma and the critical region of this test is defined as follows: where α ∈ (0, 1) is the significance level and the constant λ α is chosen in such a way that µ U (C N P (α)) = α, see [13].(We did not take into account that the set {0, 1} n is finite.Strictly speaking, in such a case a randomized test should be used, but in what follows we will consider asymptotic behaviour of tests for large n, and this effect will be negligible).Note that, by definition, µ U (x) = 2 −n for any x ∈ {0, 1} n .

The p-value and its properties.
The notion of the critical region is connected with the so-called p-value, which we define for the NP-test by the following equation: Informally, π N P (x) is the probability to meet a random point y which is worse than the observed when considering the null hypothesis.
The NP-test is optimal in the sense that its probability of a Type II error is minimal, but when testing an RNG the alternative distribution is unknown, and, hence, different tests are necessary.Let us consider a certain statistic τ (that is, a function on {0, 1} n ), and define the p-value for this τ and x as follows: (Note, that the definition π N P in (2) corresponds to this equation if the value ν(x) is considered as a statistic, i.e. τ (x) = ν(x)).

The p-value and Shannon entropy.
It turns out that there exist such tests whose asymptotic behaviour is close to that of the N P -test for any (unknown) stationary ergodic source ν, see [9].Those tests are based on so-called universal codes (or data-compressors) and are described in [10,11], where it is shown that they are consistent.We describe those tests in Appendix 1 and show that they are asymptotically optimal.The following theorem describes the asymptotic behaviour of p-values for stationary ergodic sources for N P test and the mentioned above tests which are based on universal codes (see Appendix 1).We use this theorem as the theoretical basis for adaptive statistical testing developed in this paper.
Theorem 1 i) If ν is a stationary ergodic measure, then, with probability 1, where h(ν) is the Shannon entropy of ν, see for definition [12].
ii) There exists such a statistic τ that for any stationary ergodic measure ν, with probability 1, where p-values π N P and π τ are defined in ( 2) and (3), correspondingly.
The statistic τ and the corresponding test of fit are described in Appendix 1, the proof of the theorem is given in Appendix 2, but here we note that this theorem gives some idea of the relation between the Shannon entropy of the (unknown) process ν and the required sample size.Indeed, suppose that the N P test is used and the desired significance level is α.Then, we can see that (asymptotically) α should be larger than π N P (x) and from (4) we obtain n > − log α/(1 − h(ν)) (for the most powerful test).It is known that the Shannon entropy is 1 if and only if ν is a uniform measure µ u .Therefore, in a certain sense, the difference 1 − h(ν) estimates the distance between the distributions, and the last inequality shows that the sample size becomes infinite if ν approaches a uniform distribution.
3 Time-adaptive statistical tests and their experimental investigation

Batteries of tests.
Let us consider a situation where the randomness testing is performed by conducting a battery of statistical tests for randomness.Suppose that the battery contains s tests and α i is the significance level of i−th test, i = 1, ..., s.If the battery is applied in such a way that the hypothesis H 0 is rejected when at least one test in the battery rejects it, then the significance level α of this battery satisfies the following inequality: If all the tests in the battery are independent, then the following equation is valid: Clearly, the upper bound ( 6) is true for this case and 1 That is why we will use the estimate (6) below.
We have considered a scenario in which a test is applied to a single sequence generated by an RNG, and then the researcher makes a decision on the RNG based on the test results.Another possibility that has been considered by several authors, e.g.[2,5], is to use the following two-step procedure for testing RNGs.The idea is to generate r sequences x 1 , x 2 , ..., x r and apply one test (say, τ ) to each of them independently.Then apply another test to the received data τ (x 1 ), τ (x 2 ), ..., τ (x r ) (as a rule, those values are converted into a sequence of corresponding p-values, and then the hypothesis of the uniform distribution of those p-values is tested).Then this procedure is repeated for the second test in the battery, and so on.The final decision is made on the basis of the results obtained.We do not consider this two-step procedure in detail, but note that time-adaptive testing can be applied in this situation, too.

3.2
The scheme of the time-adaptive testing.
Let there be an RNG which generates binary sequences, and a battery of s tests with statistics τ 1 , τ 2 , ..., τ s .In addition, suppose that the total available testing time is limited to a certain amount T and the level of significance is α ∈ (0, 1).
When the time-adaptive testing is applied, all the calculations are separated into a preliminary stage and a final one.The result of the preliminary stage is the list of values where the sequences x 1 1 x 1 2 ...x 1 n1 , ..., x s 1 x s 2 ...x s ns may have common parts (for example, the first sequence may be the prefix of the second, etc.).Then, taking into account the values (7), it is possible to choose some tests from the battery and apply them to the longer sequence, calculate new values γ, and so on.When the preliminary stage is carried out, several tests from the battery should be chosen for the next stage.
The final stage is as follows.First, we divide the significance level α into α 1 , α 2 , ..., α k in such a way that k i=1 α i = α.Then, we obtain new sequence(s) y 1  1 y 1 2 ...y 1 m1 , ..., y k 1 y k 2 ...y k m k , which may have common parts, but are independent of x 1 1 x 1 2 ...x 1 n1 , ..., x s 1 x s 2 ...x s ns and calculate The hypothesis H 0 will be accepted, if π τi j (y j 1 y j 2 ...y j mj ) > α j for all j = 1, ..., k.Otherwise, H 0 is rejected.The parameters of the test should be chosen in such a way that the total time of calculation is not grater than the given limit T .
Claim The significance level of the described time-adaptive test is not larger than α.
Indeed, the sequences y 1 1 y 1 2 ...y 1 m1 , ..., y k 1 y k 2 ...y k m k and x 1 1 x 1 2 ...x 1 n1 , ..., x s 1 x s 2 ...x s ns are independent and, hence, the results of the final stage does not depend on the preliminary one.When the battery τ i1 , τ i2 , ..., τ i k is applied, the significance level of τ ij equals α j and the significance level of the battery equals k i=1 α i .From (6) we can see that the significance level of the battery (and, hence, of the described testing) is not grater than α.
Comment.The length of the sequences may depend on the speed of tests.For example, it can be done as follows: let v i be the speed per bit of the test τ i , i = 1, ..., s.One possible way to take into account the speed difference is to calculate instead of ( 7) and similar expressions.

The experiments.
We carried out some experiments with the time-adaptive test basing on the battery Rabbit from [2], which contains 26 tests.Let us first describe the choice of the RNG for our experiments.Nowadays there are many "bad" PRNGs and "good" ones.In other words, the output sequences of some known PRNGs have some deviations from randomness, which are quite easy to detect with many known tests, while other PRNGs do not have deviations that can be detected by known tests [2].So, we need to have some families of RNGs with such deviations from randomness that they can be detected only for quite large output sequences.To do this, we take a good generator MRG32k3a and a bad one LCG from [2], generate sequences g 1 g 2 ... and b 1 b 2 ... by these two generators and then prepared a "mixed" sequence m 1 m 2 ... in such a way that where D is a parameter.The time-adaptive testing was organised as follows: during the preliminary stage we first generated a file m 1 m 2 ...m l1 with l 1 = 2 000 000 bytes, tested it by 25 tests from the Rabbit battery and calculated the values (7) with log ≡ log 2 , see the left part of Table 1.(This battery contains 26 tests, but one of them cannot be applied to such a short sequence.)Then we chose 5 tests with the biggest value −log π ti (m 1 ...m li )/l 1 (let they be t i1 , ..., t i5 ), generated a sequence m 1 ...m l2 with l 2 = 6 000 000 bytes and applied the tests t i1 , ..., t i5 for testing this sequence (see the example in the right part of Table 1).After that we found a test t f for which (In other words, for t f the value −logπ r (m 1 ...m l k )/l k is maximal for k = 1, 2 and all r (see the Table 1).The preliminary stage was finished.Then, during the second stage, we generated a 40 000 000 byte sequence, and applied the test t f to it.If the obtained p-value was less than 0.001, the hypothesis H 0 was rejected.(Note that the sequence length l 1 = 2 000 000 and l 2 = 6 000 000 are 5% and 15% from the final length of 40 000 000 bytes.So, the total length of the sequences tested by all the tests during the preliminary stage is 25 × 0.05 + 5 × 0.15 = 2 the final length, i.e. 2 × 40 000 000.On the other hand, if one applies the battery Rabbit to the sequence of the same length, the total length of investigated sequences is 25 × 40 000 000, i.e. 8,33 times more.Let us consider one example in detail, taking D = 2 in (9).Table 1 contains the results of all the calculations carried out during the preliminary stage.So, we can see that the value − log 2 π)/l is maximal for the test t13.Hence, at the final stage we applied the test t13 to the new 40 000 000byte sequence.It turned out that π t13 = 2.9 10 −26 and, hence, H 0 is rejected.Besides, we estimated time of all calculation (during both stages).
After that, we conducted an additional experiment to get the full picture.Namely, we calculated p-values for all tests and for the same 40 000 000-byte sequence and the estimated the total time of calculations.It turned out that the p-values of the two tests were less than 0.001.Namely, π t13 = 2.9 10 −26 , π t22 = 1.1 10 −6 .Besides, we estimated time of calculations for all experiments.
So, the described time-adaptive testing revealed one of the two most powerful tests, while the time used is 8 times.
We carried out similar experiments 20 times for d = 2, 3, 4 (in ( 9) ) with different good and bad generators from [2].Besides, we investigated several modifications of the considered scheme.In particular, we considered a case where during the preliminary stage we, as before, first chose 5 the best tests and them two of the best tests for the finale stage (instead of one, as in the experiment above).It turned out, that in all cases considered the battery Rabbit rejects H 0 and the time-adaptive testing rejected H 0 , too.

Conclusion
First of all, we note that the proposed time-adaptive testing does not suggest exact values of numerous parameters.Among these parameters, we note the number of steps at the preliminary stage (in the considered example there were two such steps: selecting five tests and then one), the number of tests compared in one step, the length of the tested sequences, the rule for choosing tests at different stages, etc.The problem of choosing the parameters may be considered a problem of multidimensional optimization.There are many methods available for solving such problems (for example, neural networks and other AI algorithms), and some of them can be used along with the timeadaptive testing.
As far as we know, no one has applied adaptive methods for testing randomness, but there are several well-known approaches that can be considered as steps in this direction.For example, L'Ecuyer and Simard recommend several batteries of different sizes that require different times (and the investigator may use them depending on how much time he has) [2].Another popular battery recommended for cryptographic applications also has some parameters that allow one to adjust the testing time [5].
We believe that the proposed approach makes it possible to investigate and optimize time-adaptive testing.
5 Appendix 1.Consistent tests based on universal codes.
The considered tests are based on so-called universal codes, that is why we first briefly describe them.For any integer m a code φ is defined as such a map from the set of m-letter words to the set of all binary words that for any m-letter u and v φ(u) = φ(u).This property gives a possibility to uniquely decode.(More formally, φ is injective mapping from {0, 1} m to {0, 1} * , where {0, 1} * = ∞ i=1 {0, 1} i .)We will consider so-called universal codes which have the two following properties: and for any stationary ergodic ν defined on the set of all infinite binary words x = x 1 x 2 ..., with probability one where h(ν) is the Shannon entropy of ν.Such code exist, see [12].Note, that a goal of codes is to " compress " sequences, i.e. make an average length of the codeword φ(x 1 x 2 ...x n ) as small as possible.The second property (11) shows that the universal codes are asymptotically optimal, because the Shannon entropy is a low bound of the length of the compressed sequence (per letter), see [12].
Let us back to considered problem of hypothesis testing.Suppose, it is known that a sample sequence x = x 1 x 2 ... was generated by stationary ergodic source and, as before, we consider the same H 0 against the same H 1 .Let φ be a universal code.The following test is suggested in [10]: If the length |φ(x 1 ...x n ) ≤ n − log 2 α then H 0 is rejected, otherwise accepted.Here, as before, α is the significance level, |φ(x 1 ...x n )| is the length of encoded ("compressed") sequence.We denote this test by T φ and its statistic by τ φ , i.e.
The following theorem is proven in [10,11]: Theorem 2 For each stationary ergodic ν, alpha ∈ (0, 1) and a universal code φ, with probability 1 the Type I error of the described test is not larger than α and the Type II error goes to 0, when n → ∞.
Proof of Theorem 1.The known Shannon-McMillan-Breiman (SMB) theorem claims that for the stationary ergodic source ν and any ǫ > 0, δ > 0 there exists such n ′ that ν{x for n > n ′ , see [12].From this we obtain for n > n ′ .It will be convenient to define From this definition and (14 ) we obtain For any x ∈ Φ ǫ,δ,n define Note that, by definition, |Λ x | ≤ |Φ ǫ,δ,n | and from (16) we obtain For any ρ ∈ (0, 1) we define Ψ ρ ⊂ Φ ǫ,δ,n such that (That is, Ψ ρ contains the most probable words whose total probability equals ρ.) Let us consider any x ∈ (Φ ǫ,δ,n \ Ψ ρ ).Taking into account the definition (19) and ( 16) we can see that for this x So, from this inequality and (18) we obtain From equation ( 14), ( 15 Having taken into account that this inequality is valid for all positive ǫ, δ and ρ we obtain the first statement of the theorem.The proof of the second statement of the theorem is closed to the previous one.First, from the theorem 2 we see that for any ǫ > 0, δ > 0 we define Φǫ,δ,n = {x : h(ν) − ǫ < |φ(x 1 ...x n )|/n < h(ν) + ǫ } . (24) Note that from (11 ) we can see that there exists such n ′′ that, for n > n ′′ , We will use the set Φ ǫ,δ,n (see (15) ).Having taken into account the SMB theorem ( 13) and (25), we can see that if n > max(n ′ , n ′′ ).From this moment, the proof begins to repeat the proof of the first statement if we use the set ( Φǫ,δ,n ∩ Φ ǫ,δ,n ) instead of Φ ǫ,δ,n .The only difference is in the definitions (17) and (19) which should be changed as follows.If we replace π N P with π τ φ and δ with 2δ, we obtain the proof of the second statement.Theorem is proven.

Table 1
Time-adaptive testing.Preliminary stage.