Towards More Efficient Rényi Entropy Estimation

Estimation of Rényi entropy is of fundamental importance to many applications in cryptography, statistical inference, and machine learning. This paper aims to improve the existing estimators with regard to: (a) the sample size, (b) the estimator adaptiveness, and (c) the simplicity of the analyses. The contribution is a novel analysis of the generalized “birthday paradox” collision estimator. The analysis is simpler than in prior works, gives clear formulas, and strengthens existing bounds. The improved bounds are used to develop an adaptive estimation technique that outperforms previous methods, particularly in regimes of low or moderate entropy. Last but not least, to demonstrate that the developed techniques are of broader interest, a number of applications concerning theoretical and practical properties of “birthday estimators” are discussed.

To state the problem formally, if we consider a fixed discrete distribution with probability mass function p X , the Rényi entropy [1] of some fixed positive order d is defined as The challenge is to estimate this quantity from n independent samples X 1 , . . . , X n ∼ iid p X . More precisely, we seek an explicit function of samples H such that the approximation H(X 1 , . . . , X n ) ≈ H d (p X ) (2) holds with a small error, high probability, and possibly minimal sample size n. We are interested in non-parametric estimation, as the distribution X remains unknown. As in prior work on Rényi entropy estimation, we focus on integer orders. This is not limiting, because Rényi entropies of positive integer orders: (a) encode the complete information about the distribution [29], (b) are sufficient for practical applications due to known smoothing and interpolation properties [30,31], and finally (c) are more efficient to estimate from the algorithmic perspective [32].

1.
Lack of simplicity and numerical precision. The analyses of the state-of-the-art estimators [32,34,35] struggle with analysing the variance of collision estimators, which is tackled either by poissonization approximations [32,34] (which carry their own overhead) or by using involved combinatorics [35]. As a consequence, the variance bounds are available in asymptotic "big-Oh" notation hiding constants and higher-order dependencies (such as relations to the order d), and are not suitable for applications in statistics or cryptography, which demanding precise formulas. This point was already raised in the context of applied works on physically unclonable functions [36].

2.
Adaptiveness gap. The focus of prior research was on establishing bounds under the worst-case choice over all distributions [32,34]. This is overly pessimistic because distributions that arise in practical applications are of a different structure than those occurring in this worst-case scenario analysis (the worst-case choice is known to be a mixture of uniform and Dirac distributions). The bounds were somewhat improved in the follow-up work [35] where a prior entropy bound is assumed. Still, there is a gap as the entropy bound is usually not known prior to the actual experiment. In fact, getting an entropy bound might be more costly than its application.

3.
Lack of established techniques. Prior work focused on delivering asymptotic formulas and did not elaborate much on techniques that could help obtain simpler and tighter bounds. The difficulty of analyzing collision estimators is a recurring issue, well-known to the researchers working on property testing [37]. Do we have a systematic method of handling it?

Our Contribution
This work fills the aforementioned gaps with the following contributions:

1.
Simpler and more accurate analysis of collision estimators, the main building blocks of the state-of-the-art Rényi entropy estimators. We analyze the collision estimators as kernel averages with the technique of Hoeffding's decomposition. This novelty brings the promised simplicity and improvement in accuracy.

2.
Adaptive estimation of Rényi entropy, using no prior knowledge of the sampling distribution. The sample size cost of the presented algorithm is essentially optimal (up to a poly-logarithmic factor).

3.
Modular approach using the established methods of U-statistics and Robust Mean Estimation. Specifically, we point out that the moment estimation problem, which Rényi entropy estimation reduces to, can be seen as the estimation of certain Ustatistics. While the dedicated statistical theory provides the bias-variance analysis, the confidence can be independently boosted by techniques of Robust Mean Estimation.
This paper aims to solve the two mentioned bottlenecks and, in this way, to close the gap between the theory-oriented state-of-the-art and the demand coming from applied researchers and their practical use cases, such as [36].

Organization
The notation and preliminary concepts are discussed in Section 2. The technical results and applications are presented in Section 3. The proofs are discussed in Appendix A, and the work is concluded in Section 4.

Basic Notation
Throughout the paper, p X is the probability mass function of a fixed discrete distribution over an alphabet of size K and X 1 , . . . , X n are observed independent samples from p X . We denote [n] = {1, . . . , n} and let ( [n] d ) denote the collection of all d-element subsets of n.

Estimation of Entropy, Moments, and Collisions
We leverage the following observation from prior work: the task of Rényi entropy estimation is equivalent to the task of moment estimation. More precisely, the d-th moment of the probability mass function p X is defined as and then-immediately from definition-we have the following result: An estimate H of the Rényi entropy of order d (defined in (1)) has an additive error of if and only if 2 H is an estimate of the d-th moment (defined in (3)) with a relative error of = 2 (d−1) − 1.
Solving the (equivalent) problem of moment estimation is more convenient due to the beautiful representation of moments as collision probabilities. More precisely, we have

U-Statistics
For a symmetric real function h of d arguments, the U-statistic with kernel h of the sample X 1 , . . . , X n is defined as: The U-statistic gives an unbiased estimate of the function expectation, hence its name. U-statistics were invented by Hoeffding [38] to extend certain results, such as concentration bounds, to sums of partly dependent terms. Many statistical quantities can be related to U-statistics; for example, moments or sample variances [39]. In the same spirit, we will see estimators of the collision probability in Equation (3) as U-statistics and use those to establish their desired properties.

Robust Mean Estimation
It is difficult to directly obtain high confidence bounds (such as those of the Chernoff-Hoeffding type) for moment estimators. Instead, we will boost weaker bounds obtained from bias-variance analyses. To this end, we combine independent runs of estimators into high-confidence bounds using the technique of Robust Mean Estimation.
Estimating the mean of a distribution from i.i.d. samples is not trivial: the "obvious" use of the empirical mean is inaccurate for heavy detailed distributions. Following the recent survey [40], we mention here two solutions: • the median-of-means approach organizes data (such as independent algorithm outputs) into blocks and computes the median of means within blocks. • the trimmed mean approach takes the mean of independent runs, excluding a certain fraction of smallest and biggest outcomes (removing outliers).
We note that any robust mean estimation can be used to achieve confidence boosting. In this work, we stick to the median-of-means. The following result discusses its performance.

Moment Bounds
We will need some bounds on moments of probability distributions, in order to simplify formulas that arise from variance analysis. Specifically, we will use these auxiliary results to express higher-order moments in terms of moments of small order (d = 2 and d = 3).

Results
Following the convention from prior work, our results are stated for moment estimation, which is equivalent to entropy estimation as discussed in Proposition 1. Throughout the rest of the paper, we keep this reduction in mind.

Simpler & More Accurate Moment Estimation
The first novelty offered in the current work is a simplified and strengthened variance analysis of the state-of-the-art moment estimator, presented in Algorithm 1. Differently than in prior work, we write the estimator output as a kernel average of the function . . x i d ) over the d-element subsets of the sample. This approach is not only more readable but ultimately also more accurate, as it links the task of moment estimation to the established theory of U-statistics [38].
On top of that comes the refined high-confidence moment estimator in Algorithm 2, which we build on robust mean estimators [40]. Due to this modularity, it uses fewer samples than the direct approach from prior work [34].
Theorem 1 (Bias-Variance Analysis of Moment Estimator). With the notation as above, the output of Algorithm 1 is unbiased: and its variance equals In particular, for any n d and > 0 we can upper-bound the variance as and the relative error confidence as Remark 1 (Efficient Implementation). Birthday estimators can be efficiently computed, in memory O(n) and one-pass over samples, by using a dictionary to count empirical frequencies of observed elements. Such an implementation is given in the supplementary material [41].

Remark 2 (Structural Assumptions).
The bounds from Theorem 1 use the statistic P d of the sampling distribution. This explicit dependency is beneficial, as further discussion clarifies. Lacking any prior knowledge, it can be estimated by the worst-case behavior, in terms of the alphabet size.
We see that the choice n > 6d 2 3 . Higher confidence (smaller δ) can be handled with the method of Robust Mean Estimation.

Theorem 2 (High-Confidence Moment Estimator).
For any > 0, Algorithm 2 approximates P d with probability 1 − δ and a relative error of provided that n Remark 3. The constant can be refined a little based on the methods from [42].
When comparing these results with prior work, we review the following aspects: 1.
Novel techniques of broader interest. We recall that analyzing variance formulas has been challenging for prior works on entropy estimation ( [32,34] resorted to Poisson approximations, while [35] gave an involved combinatorial argument), even for the case d = 2 (the lack of sharp analysis caused lots of difficulties in property testing [37]). As opposed to these ad hoc approaches, we establish the formula in a simple yet direct manner, pointing out that such formulas can be obtained by the techniques of U-statistics. When discussing applications, we will further benefit from these tools.

2.
Clean and improved formulas. Our confidence bound does not involve any implicit constants, while prior works in their main statements have unspecified dependencies on d (essentially, hiding more than absolute constants). We compare the accuracy bounds from this and prior works in Table 1 below. Our bound is strictly better given that P d is minimized at the uniform distribution and thus P −1/d d K −1+1/d (with a large gap when the distribution is far from uniform), which establishes that the dependency on d is 4d 2 . Leveraging the theory of U-statistics, we will show that the factor O(d 2 ) is optimal, which is also a novel contribution. Table 1. The performance of the "birthday estimator" of moments. In the formulas, ∈ (0, 1) is the relative error, n is the sample size, and 1 − δ is the confidence (the prob. that | P d − P d | P d ).

Adaptive Estimation
As per our variance analysis, the performance actually depends not on the alphabet size K (that is ultimately the pessimistic bound) but rather on a more fine-grained statistic of X, namely P d . Following the result in Theorem 2, we could hope for a moment estimation of The obvious obstacle is that, in general, we do not know P d in advance. We solve this problem by developing an adaptive algorithm. It does not assume the right number of samples in advance but tries gradually and eventually terminates with high probability, giving the answer within the desired margin of error and using only a few more samples than the ideal bound conjectured above. Its core is a subroutine that guesses the moment value, gradually changing the candidate.

Lower-Bounding Moments
The key ingredients of our approach are the following two subroutines: Algorithm 3 tests, based on samples, whether the moment is smaller or bigger than a proposed candidate; subsequently, Algorithm 4 loops the tester over a grid of candidate values.
The correctness of the approach is guaranteed by the lemmas stated below.
. . , X n ; Q) /* this call can recycle X i */ end return Q Lemma 1 (Moment Testing). Let P be any estimator of P d which, given samples, for some function C(·) and any > 0, 1 > δ > 0, achieves a relative error of with probability 1 − δ. Then Algorithm 3, when given at least n(Q, δ) = 4C(δ)Q −1/d samples of X, with probability 1 − δ, outputs TRUE when P d Q/2 and FALSE when P d 2Q.
Lemma 2 (Moment Bounding). With same P as in Lemma 1, with probability 1 − δ, Algorithm 4 terminates after using at most n = 4C(δ/(d − 1) log K)P −1/d d samples of X, and its output Q satisfies Q 2 P d 4Q.

Construction of Adaptive Estimator
Armed with Lemma 2, we are ready to analyze adaptive estimation. The algorithm is the same in the non-adaptive case, and we need Lemma 2 only to adjust the sample size.
Remark 4 (Adaptive Overhead). The sample size for adaptive estimation differs from the "dream bound" in (12) by a (very small) factor of log log K.

Remark 5 (Sample Complexity Guarantees).
Observe that the algorithm is not guaranteed to achieve the "good" sample complexity every time, but rather with high probability. This is a minor issue inherently related to the concept of adaptive estimation and does not affect much the performance in practical applications. Namely, we can always clip the total number of samples available at the pessimistic level from prior work and fall back to the fixed-size sample estimation should the adaptive estimation exceed the limit. This, however, happens with small probability δ, which can be further decreased with little overhead.
Remark 6 (Comparison to Prior Work). We give a clear comparison of our adaptive estimation and prior work in Table 2 below. We always have P d Q and P d K −1+1/d . Furthermore, usually P d /Q is much bigger than 1 (because in practice we do not know a prior lower bound in advance) and P d is much bigger than K −1+1/d (this gap can be as big as K Ω(1) , which is a huge factor for some applications; for example, in cryptography, we consider alphabets as big as K = 2 256 ). Thus, our bound outperforms the previous approaches for typical use cases. Table 2. The performance of moment (Rényi entropy) estimators. In the formulas, K is the alphabet size, is the relative error, and the confidence is 1 − δ.

One-Sided Estimation: Random Sources for Cryptography
The collision probability P d for d = 2 plays an important role in cryptography: it quantifies the amount of randomness that can be extracted from a distribution [28]. For that extraction to work, P 2 should be small enough. Specifically, if P 2 2 −k allows for extraction of nearly k bits of cryptographic quality, how could we check whether P 2 2 −k ?
To solve the problem, we adapt Algorithm 4 by adding early stopping; namely, we quit the loop if Q 2 −k . We take n = O(log(k/δ)2 k/2 ) samples.
It remains for us to show that the algorithm behaves as desired. By the guarantees in Lemma 1 and the union bound over at most k steps, with probability 1 − δ, we have the following: when P 2 < 2 −k−1 , the algorithm finishes with Q 2 −k ; and when P 2 > 2 −k+1 , the algorithm finishes with Q 2 −k . This can be generalized to one-sided estimation for any d, where the goal is to decide whether P d > Ω(2 −k ) or P d < O(2 −k ).
This one-sided estimation allows for saving samples and testing only up to a necessary extent. In cryptography, we do not have to estimate the whole entropy (which may be more costly, even with adaptive estimation) but only what suffices for the chosen application.

Birthday Estimators Are UMVUE
The shortcut UMVUE stands for uniformly minimum unbiased variance estimators. We prove this conceptually strong and interesting characterization, which essentially shows that the birthday estimator in Theorem 1 is variance-optimal among unbiased estimators. The argument is inspired by our variance analysis, seeing the estimator as a U-statistic.

Corollary 1 (Birthday estimators are UMVUE). Let P d be as in Algorithm 1 andP d be another unbiased (for any X) estimator of P d . Then we have Var[ P d ] Var[P d ].
Proof. We will use the known result due to Lehmann and Sheff, which states that if an unbiased estimator is a function of a complete and sufficient data statistic, then it has the smallest possible variance [43,44].
To apply this result, without losing generality (as it is a matter of encoding the alphabet), we assume that X takes values in the set {1, . . . , K}. Consider the sample X 1 , . . . , X n , and let σ be the rearrangement such that X σ (1) X σ (2) . . . X σ(n) (this is called the order statistic). The estimator P d can be seen as the average of the symmetric function h(X i 1 , . . . , X i d ) = I(X i 1 = x i 2 = . . . = X i d ) over tuples i 1 < i 2 < . . . < i d , and thus is also the function of T = (X σ(1) , X σ(2) , . . . , X σ(n) ). The claim follows if we prove that T is sufficient and complete (as a sample statistic).
Order statistics are sufficient for univariate distributions. This is because we have: which does not depend on X i . Thus, X σ(1) , X σ(2) , . . . , X σ(n) carries the same information about the data as X i . The completeness of T means that there are no non-trivial unbiased estimators of zero; equivalently, if E f (T) = 0 for all sampling distributions and some function f , then P[ f (T) = 0] = 1. To this end, observe that from the sufficiency proved above we have Suppose that the above holds for any finitely supported sampling distribution X. Let X take values i ∈ I with probability p i . Then the above implies ∑ i 1 ,...,i n ∈I×...×I p i 1 . . . p i n f (i 1 , . . . , i n ) = 0 for every distribution (p i ). The left-hand side represents a multivariate polynomial in variable p i , which evaluates to zero on the entire simplex of dimension n − 1. Thus, its coefficients must be zero, which implies f (i 1 , . . . , i n ) = 0 for each tuple i 1 , . . . , i n and proves that T is sufficient.

Central Limit Theorem for Birthday Estimators
We again represent the estimator as the average sum over tuples: We view the whole expression as the U-statistic with the kernel function h. Then we show the following strong result (below, N (0, σ 2 ) denotes the normal distribution with zero-mean and variance σ 2 ).

Corollary 2 (Asymptotic Normality). For n → +∞ it holds that
The proof utilizes the classical convergence results for U-statistics [38] and the derivation of our variance formula. Note that the result says that the central limit theorem applies, despite the fact that the sum components are correlated. Clearly, the result is interesting on its own, particularly because (a) it proves that our constant O(d 2 ) is sharp, and (b) can be used more generally to benchmark other proposed bounds, by means of comparing with the asymptotic gaussian tail.
However, we would like to point out an application to applied statistical research. In [36], Rényi entropy of order d = 2 has been estimated for the distribution of physically unclonable functions (PUFs), which are important in the field of cryptography. However, their methodology lacks statistical rigor. Particularly, for the authors' needs, prior work on Rényi entropy estimation was insufficient in terms of clarity on constants; thus, they resorted to the naive application of the central limit theorem, which can give very biased results.
A more solid alternative would be to use the above corollary to (a) justify the soundness, at least in the regime of large n, and (b) establish a more robust estimation of the variance.
Proof of Corollary 2. The limiting variance equals d 2 σ 2 1 [39] with where the tuples (i 1 , . . . , i d ) and (j 1 , . . . , j d ) have only k = 1 element in common. We analyze this expression when proving Theorem 1 and know that it equals P 2d−1 − P 2 d . The claimed formula now follows.

Adaptive Testing in Evaluation of PUFs
Here we discuss again an application to [36], but from a different perspective. As explained by the authors, the problem with estimating Rényi entropy of PUFs is a serious bottleneck: for this problem, the alphabet is huge, which limits the experiment scope, even on computational clusters [36]. In this note, we would like to point out that parts of these difficulties can be solved by our adaptive estimation. In fact, PUFs provide an excellent use case when entropy is quite low; therefore, the moment P d term in Theorem 3 is much bigger than the pessimistic bound based on the alphabet size. We discuss this application in full detail in a follow-up work.

Applications to Property Testing
The estimator from Algorithm 1 was first studied in [45], but the variance bounds obtained were not sharp. Quite oddly, in the ongoing research on closeness testing, the birthday-like collision estimators (being subroutines for uniformity checking) seemed to be suboptimal [46] until, very recently, the work of [37] re-examined the variance formula for d = 2 and shows that it achieves (in our notation) optimal dependence on K and . Thus, a breakthrough was possible just because of a specialized version of (8). In this discussion, we would like to (a) point out that the general variance formula can likely have similar applications and impact for d > 2 and should be of broader interest, and (b) comment on a minor gap in an early version of the proof of the central result in [37]. Lemma 2.3 in [37], which establishes the variance bound for d = 2, is the key ingredient of the main results. The authors derive an expression bounding, in our notation, the variance in Theorem 1 for the case d = 2. When doing so, they face up the term n(n − 1)(n − 2)(P 3 − P 2 2 ) (in our notation) and bound it as n 3 (P 3 − P 2 2 ) (the last line of derivation claims the upper bound, the following remark claims the lower bound). The reasoning, however, is valid when P 3 − P 2 2 is non-negative. This is true by Proposition 3.

Application to Statistical Inference
We will use Algorithm 1 to efficiently test whether a given distribution is close or far from a uniform one. The procedure described below is asymptotically equivalent but numerically superior to the one from [37].
Denote by K the alphabet size and let γ be such that P 2 = 1 K + γ. We see that which shows that γ measures the squared 2 -distance between p X and the uniform distribution. For convenience, we will refer to γ as the collision gap. Define This estimator gives an unbiased approximation of γ, because Furthermore, its variance equals the variance of P 2 because 1 K is a deterministic constant. By Chebyshev's inequality, with probability 1 − δ, we have We now analyze the variance in more detail. Define In particular, we have P 2 = 1 K + γ and 0 P 3 − P 2 2 γ 3 2 + γ K by Propositions 3 and 4. Therefore, by Theorem 1 we obtain Thus, with probability 1 − δ, we have The above two-sided inequality allows us to estimate a range of possible values γ with respect to the (observed) statistics γ, which yields high-confidence bounds for γ.
We illustrate the procedure numerically on real-world datasets. Data and results are summarized in Figure 1a,b. Our method confirms non-uniformity in both cases and provides confidence intervals. The details of the experiment are shared in the supplementary material [41].

Application to Entropy Estimation
The following experiment illustrates advantages of adaptive entropy estimation for distributions with large support and relatively low entropy, such as Zipf's law.
Let X follow Zipf's law with parameter s = 1.1 and the support of K = 10 4 elements. By numerical calculations, we find that P 2 ≈ 0.40005. Consider now the task of estimating entropy of X from samples. Theorem 3 allows us to save a large factor of about K 1/2 = 10 2 in the number of samples. Calculations show that on a sample of size about n = 10, 000, the algorithm from Algorithm 2 finds an approximation P d = 0.39898 with a relative error = 1 2 and confidence 1 − δ = 0.95. The details appear in the supplementary material [41]. . For this dataset, P 2 ≈ 0.14794 and 1 K ≈ 0.14286, so that the gap equals γ ≈ 0.00017. Our method gives the 99% confidence interval of (0.005,0.00516). (b) Births from insurance claims (source: courtesy of Roy Murphy). For this dataset, P 2 ≈ 0.08348 and 1 K ≈ 0.08333, so that the collision gap equals γ ≈ 0.00015. Our method gives the 99% confidence interval of (0.00009,0.00035).

Conclusions
This work simplifies the variance analysis of collision estimators, establishing the closed-form exact formulas and improving upon prior data-oblivious bounds by making them dependent on certain data statistics. In particular, we use the derived formulas to estimate Rényi entropy adaptively, asymptotically, and give other applications.
Numerical experiments highlight the importance of the dependency of sample size on confidence. The constants involved exponentially affect the confidence, so that further improvements are of significance for many real-world inference problems. This problem is left for future research.
Funding: This research received no external funding.

Institutional Review Board Statement: Not applicable
Data Availability Statement: The data and code is shared in the GitHub repository [41].

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Proofs
Appendix A.1. Proof of Proposition 3 For convenience, we introduce the function f (u) = u d−1 . Then we have: which is easiest to see by writing equivalently Var[Z] 0 where Z is a random variable taking values p i with probabilities p i . We therefore have Since 0 p i 1 and 1 k d, we have p 2d−1 i p 2d−k i . We finally obtain This can be further sharpened using p 2d−1 Appendix A.2. Proof of Proposition 4 The property in Proposition 4 is equivalent to the monotonicity of p-norms. For a proof of the latter one, see, for example [47].
From Proposition A1, it is straightforward to see that the estimator is unbiased.
Proposition A2 (Estimator Is Unbiased). For any i we have that EC i = ∑ x p X (x) d = P d . In particular, the estimator P d of the d-moment is unbiased.
In order to establish a variance formula, we analyze the covariance of the terms C i .
Proposition A3 (Covariance Formulas). Let i = i 1 , . . . , i d and j = j 1 , . . . , j d be tuples of distinct indices. Suppose that exactly k 0 of entries in i collides with some entries in j, that is |i ∩ j| = k. Then we have which implies that Remark A1 (Overlaps imply positive correlation). Note that 2 by Jensen's inequality. Thus, we have positive correlation. This can also be shown by the FKG correlation inequality [48].
Proof. We first prove the formula for E C i C j . Consider the case k = 0, which means that i and j do not share a common index; it is easy to see that the formula is true because 2d random variables X i , X j for i ∈ i and j ∈ j are independent (lack of collisions among the indices) so that E[C i C j ] = E[C i ]E[C j ]. Consider now the case k > 0, which means that i and j overlap. We have X i = X j for all i, j ∈ i ∩ j. Conditioning on this common value of all X i and X j (call it x) and denoting X i = (X i ) i∈i , we obtain because we have exactly 2d − k distinct variables X i or X j and all are equal to x. The formula follows now by aggregating over possible values of x.
The covariance formula follows now by combining the above bounds and Proposition A1, Finally, the covariance bounds from Proposition A1 and A3 remain to be used. We need the following fact, which counts how many times we see a particular pattern from the covariance formula.
Proposition A4 (Number of terms). There are ( n d )( d k )( n−d d−k ) unordered distinct tuples i and j that satisfy |i ∩ j| = k. The number of ordered tuples equals ( n 2d−k ).

Proof.
Recall that i and j are d-combinations out of n. To enumerate tuples such that |i ∩ j| = k, note that it suffices to choose i one in ( n d ) ways, then choose k common elements in ( d k ) ways and then choose remaining j \ i elements in ( n−d d−k ) ways. This gives the formula. By Proposition A1 we have Observe that the internal loop can do at most (d − 1) log K steps. This is because we decrease the candidate bound Q in each step by a factor of 2, starting from Q = 1, down to the smallest possible value of 1 K d−1 (the smallest possible moment value, achieved by the uniform distribution). Therefore, when we set δ = δ/(d − 1) log K in the subroutine, the guarantees from Lemma 1 hold in every step. Suppose that the loop takes exactly k steps. This means that the output is Q = 2 −k and that the subroutine outputs b = TRUE at step k − 1, which gives P d 2 −k+2 = 4Q by Lemma 1; at step k, we either get FALSE, which means P d 2 −k−1 = Q/2 or 2 −k K −d+1 and then P d K −d+1 2 −k > 2 −k−1 = Q/2 .
We shall also clarify how the online sample access is used: as stated in the algorithm, we recycle X i samples already so that we request only "missing" samples when the number of samples n increases due to the change of the candidate Q. This recycling is captured in the union bound.
Appendix A.5. Proof of Theorem 3 We use P from Theorem 1 combined with the Robust Mean Estimation, which works with accuracy and confidence 1 − δ given the number of samples, as in Theorem 2. This shows that n = Θ(log(1/δ) −2 P −1/d d ) works. Now it suffices to combine this with Lemma 2.