On the Convergence of the Benjamini–Hochberg Procedure

: The Benjamini–Hochberg procedure is one of the most used scientiﬁc methods up to date. It is widely used in the ﬁeld of genetics and other areas where the problem of multiple comparison arises frequently. In this paper we show that under fairly general assumptions for the distribution of the test statistic under the alternative hypothesis, when increasing the number of tests, the power of the Benjamini–Hochberg procedure has an exponential type of asymptotic convergence to a previously shown limit of the power. We give a theoretical lower bound for the probability that for a ﬁxed number of tests the power is within a given interval around its limit together with a software routine that calculates these values. This result is important when planning costly experiments and estimating the achieved power after performing them.


Introduction
In many modern studies scientists perform the same statistical test on many objects, e.g., genes or single-nucleotide polymorphisms. The number of these objects could be very large, say millions, resulting in a multiple comparison issue. This is due to the fact that even under the null hypothesis we expect some of these tests to have p-values below a predetermined significance level α. There are various techniques to overcome this problem with the Benjamini-Hochberg method [1] being the most used. Given α, for a set of m p-values, the method rejects the tests corresponding to the smallest k p-values, where k is the largest index such that p (k) ≤ kα/m. The method controls the FDR (false discovery rate) in a sense that the expected proportion of false discoveries out of all discoveries (rejected hypotheses) is not more than α when the tests are independent as shown in the original paper [1]. In some cases, rather than finding the cutoff index for rejecting tests, the so-called Benjamini-Hochberg-adjusted p-values are used; these are the p-values transformed by the formula on the right-hand side of the above inequality and truncated at 1.
Despite the fact that there are improved methods either in terms of weaker assumptions, e.g., dependency or better statistical power, e.g., [8][9][10], the Benjamini-Hochberg method is one of the most cited scientific articles, determined to be 59th most cited article in [11]. We should underline that in this work we are not interested in improving the Benjamini-Hochberg method, but rather than that in inferring about the statistical power of the original method as it is the most used one in practice. Moreover, article [12] shows that no correction of the standard methods is needed in certain cases even when there is some degree of dependence among the tests. See Section 5 for more information and related work.
Calculating the statistical power, that is the probability of correctly rejecting the null when the alternative hypothesis is true, is an important task when designing scientific studies. For example it allows us to estimate the required sample size, and therefore the budget needed to perform the experiment, in order to make sure that we have a reasonable chance of detecting the objects for which there is indeed a difference between the two groups. After performing the study, it allows us to determine whether we have found a relatively large proportion of significant objects.
For practical applications we are interested not only in the limit of the statistical power as shown in previous works (see Section 5), but also in the speed of convergence to that limit. The speed of convergence would allow us to find the probability that the statistical power is within a given range around its limit. Consequently this would allow us to calculate the probability that the statistical power is above a pre-determined threshold, typically 80%. In cases when this probability is relatively low, one would have to increase the sample size in order to have more favorable parameters of the Beta distribution that models the p-values under the alternative hypothesis.
In more detail, in this work we show that under natural condition, that we call (M), the convergence is exponential and give explicit expression of the constants involved in the asymptotics. That condition allows the usage of suitable Beta distributions amongst others. The relationship between our condition and the non-criticality condition in [13] will be discussed below. Condition (M) allows us to find a lower bound for the probability that for fixed parameters of the model the power is within given interval around that asymptotic limit. In this sense our work can be understood as a tool which can be used in the planning of the experiment and which complements [14], another article by the authors of [15], which deals with a variety of questions, such as estimation of the proportion of significant tests, the effect of the sample size to the quality of results, etc.

Some Preliminary Notation
Let X 1 , X 2 , · · · , X m , m ∈ N + , with N + = {1, 2, · · ·}, be independent random variables that are defined on a common probability space (Ω, F , P). For some 1 ≤ m 0 < m the random variables X j j≤m 0 are assumed to have a common cumulative distribution function (c.d.f.) F, whereas X j m 0 <j≤m are assumed to have a common c.d.f. G. In this work we set F to be the c.d.f. of the uniform random variable on (0, 1) and let for the time being G be any continuous c.d.f. We note that even if G is discrete, one can always approximate it with a suitable continuous distribution. We also work with the condition that m 0 m = γ ∈ (0, 1) is fixed and set m 1 := m − m 0 ∈ (0, m), which is strictly non-binding since we develop exponential bounds that are valid with different degree of accuracy for any fixed m 0 , m and the respective ratio m 0 m . It will be usually the case that γ is larger than 1/2 as the first m 0 random variables represent the non-significant observations. If X (i) m , i ≤ m, is i-th order statistics of X 1 , X 2 , · · · , X m then for a fixed level of rejection α ∈ (0, 1), the Benjamini-Hochberg procedure declares significant the first R m order statistics, where The truly non-significant tests have c.d.f F and the truly significant tests possess c.d.f. G. It is well known from [15] (1.2) that for x ∈ (0, 1) we can express the event {R m ≥ mx} in the following fashion where H m (t) = 1 m ∑ m j=1 I {Xj≤t} is the empirical c.d.f. of X 1 , X 2 , · · · , X m and where y = min{m ∈ N + : m ≥ y} is the ceiling function. We can express H m as where F m 0 is the empirical c.d.f. of X 1 , X 2 , · · · , X m 0 and G m 1 is the empirical c.d.f. of X m 0 +1 , · · · , X m . Therefore a simple rearrangement yields that

General Inequalities and Information concerning the Distribution of R m
We introduce the function c( ) = (1 + ) ln(1 + ) − , for > 0, which is investigated in Proposition A1 in Appendix A. Then the following inequalities hold true.

Proposition 1.
For any distribution function T with T(αx) > 0 and empirical cumulative distribution function T n (·) = 1 n ∑ n j=1 I {Xj≤·} , where X j j≥1 are i.i.d. random variables with c.d.f. T, any k, m ∈ N + , any α, x, ρ ∈ (0, 1) and A α (x, m) as in (3) above, we have, for every > 0, that Finally, if T(t)c ρ t T(t) as a function of t is non-decreasing on [αx, α) we get that Remark 1. Let us compute K T when T(t) = F(t) = t for t ∈ (0, 1). Then T(t)c ρ t T(t) = tc ρ is non-decreasing in t and substituting in the right-hand side of (6) we deduct that Remark 2. Given that c( ) ≥ min 1, 2 2e , see Proposition A1 we may eradicate the dependence on the function c in K T , see (5) by providing an upper bound for it: Remark 3. Inspecting the proof below one easily discovers that for computational purposes it is better to use the sharper estimate For theoretical purposes the bounds stated in the theorem are more suitable.
Proof of Proposition 1. The second part of the minimum (5) follows from the Dvoretsky-Kiefer-Wolfowitz inequality, see [16], which gives, for any x, > 0 that In Proposition A1 we use the function c( ) = (1 + ) ln(1 + ) − . The first part of the minimum of (5) is immediate from Proposition A1 since kT k (t) = ∑ k j=1 I {Xj≤t} is binomially distributed with parameters k and T(t) = P(X 1 ≤ t), and thus where we have used the fact that the number of elements in A α (m, x) is bounded from above as follows |A α (m, x)| ≤ m(1 − x). Next, (6) is deduced from the assumption that T(t)c ρ t T(t) as a function of t is non-decreasing on [αx, α) and the simple fact A α (m, x) ⊆ [αx, α).
Applying Proposition 1 with T = F and T = G and respectively m 0 = mγ and m 1 = m(1 − γ) and ρ = γ and ρ = 1 − γ we can confirm from (4) the proof of Theorem 3.1 in [15] in the sense that we have the convergence in probability Clearly, u α,γ : , is a non-increasing function on (0, 1]. Denote and Since u α,γ is non-increasing we have that x β α,γ ≥ x β α,γ . Therefore from relations (2) and (4) we can easily get that We can provide more precise information about probabilities of the type P R m m < x in the general case. In the sequel we use I {A} for the indicator function of the set A.

Proposition 2.
We have for any > 0 and x ∈ (0, 1) that Proof. Using successively (2), (4) and (10) with ρ = γ, k = m 0 = γm for the probability involving F m 0 and ρ = 1 − γ, k = m 1 = (1 − γ)m for the expression concerning G m 1 , we arrive at the inequalities We only note that in the last expression of the first inequality above we have used that once the dependence on F m 0 , G m 1 is estimated away, the remaining term is deterministic and thus the indicator function appears.

Condition (M)
In view of the fact that Theorem 3.1 in [15] holds, the limit in probability, that is lim We shall therefore restrict our attention only to the case when for fixed α we have that the function is decreasing on (0, α]. In this case we have from (11) that and if β = α then we set x α,γ := x α α,γ . From now on we work under this condition which we call condition (M) and which is synthesized as follows.

Remark 4.
As mentioned in [17], "[u]nder the alternative hypothesis, the p-values will have a distribution that has high density for small p-values and the density will decrease as the p-values increase". Therefore when modeling the p-values under the alternative hypothesis with a Beta distribution, B(a, b), its parameters should be so that a < 1 and b ≥ 1. Note that condition (M) also holds when b < 1 and a < (1 − 2α)/(1 − α), although that case should not happen for well-behaved statistical procedures.

Remark 5.
Note that in our context Equation (2.2) of [13] translates to α * = inf u>0 {u/G(u)} and as it is mentioned there, the asymptotic of the Benjamini-Hochberg procedure exhibits a very different behavior on the regions α ∈ (0, α * ) and α > α * . In the same paper it is demonstrated that when G(t)/t → 0 then the overwhelming proportion of discoveries will be amongst the non-significant features. Under our assumptions it is always true that α * = 0 and therefore we are always in the second regime, for which [13] provides a law of the iterated logarithm for R n , see Theorem 2.2 of [13]. In more detail it is shown that the precise estimate A very interesting contribution is made in Theorems 3.1 and 3.2 in [18]. The author finds a Donsker-type of convergence for the so-called threshold procedures, which are essentially functionals on the empirical functions H m (t) which recover, for example, the false discovery rate, see Theorem 3.2 in [18]. This implies a convergence rate 1/ √ n in |R n /n − x α,γ | and is applicable in a variety of other examples. Moreover, their results go beyond the scope of the Benjamini-Hochberg procedure and capture different procedures some of which are of higher power than the Benjamini-Hochberg procedure. In the particular case of the Benjamini-Hochberg procedure, however, we know from [15] that the probabilities of the events where S n /n is the power of that procedure, converge to 1. Our Theorems 1 and 3 strengthen this by showing that the speed of convergence is of an exponential order. As discussed earlier, this allows a direct estimate on the probability that S n /n is in a prescribed interval.

Remark 6.
We note that upon the validity of condition (M) it is impossible that G has atoms on (0, α) as otherwise if a ∈ (0, α) is an atom then which leads to a contradiction with the assumption that H is decreasing. Thus, our initial requirement that G is continuous is included in condition (M). This also means that H is continuous on (0, α) and thus x We then have the following elementary lemma.
The latter is satisfied if, for example,

Proof. From the assumptions it is clear that on
Thus the first part of the proof is settled. Clearly, tg(t) − G(t) < 0 if g (t) < 0 on (0, α) and the overall proof is thus completed.
The next lemma ensures an easier expression for K G , see (5) and (6), provided condition (M) holds.

Lemma 2. Let condition (M) hold for G and m
The same holds for the c.d.f. of the uniform distribution F. In particular if γ > 1/2 then is decreasing in x as long as the other parameters stay fixed. The same is valid for K M F .
Proof of Lemma 2. The proof is immediate from (6) since under condition (M) we have We just note that by definition m 1 = m(1 − γ) and we have employed this for k in (6) to derive (19). Finally, (20) follows from (7).

Theoretical Bounds on the Distribution of R m under Condition ( M)
If condition (M) holds we have the identity equivalent to (17) max t∈A α (x,m) and with β = α relation (18) together with (12) and (13) gives that Let β > 0 and denote using (21) We then have the following key result: Finally, for any x ∈ (0, x α,γ ), ∃ (x) such that for any 0 ≤ ≤ (x) the inequality (1 − γ)H(αx) > 1 α − 1 + 2 holds and then for any m ≥ m G (x, 1 Remark 8. It seems important to evaluate or approximate m G (x, 1 α − 1 − 2 , α, γ) with say m so that (24) is valid beyond this estimate, that is for m ≥ m . If H d := inf t∈(0,α) |H (t)| > 0, since αx < α mx m , 1 > x > x α,γ and H is decreasing on (0, α) we get using 22 in the first relation and (24) is always valid. Of course various similar and more precise estimates can be achieved, but these are beyond the scope of this work.

Remark 9.
Considering Remark 3 the inequalities (24) and (25) can be improved for computational purposes by using as an upper bound the expression

Remark 10.
We point out that in our setting lim m→∞ R m m > 0 and there have been a number of works that attempt to offer FDR even under weak dependence structure between the random variables following the uniform distribution or X j j≤m 0 . We point out, for example, to [19]. There is an interesting contribution, see [20], which discusses difficulties in controlling FDR under weak dependence in the case when lim m→∞ R m m = 0.
For practical reasons and especially for the accompanying software routine it is optimal to choose in Theorem 1 as large as possible. The latter is due to fact that the bounds for K M F , K M G , see (19) and (20), are decreasing in . Since in practical computations m is usually available, it is then beneficial to choose the largest as a function in x, m in a way that the previous theorem holds. In this direction we have the following result: Under the conditions of Theorem 1 and m ∈ N + , x ∈ (x α,γ , 1) fixed we have that Similarly, if m ∈ N, x ∈ (0, x α,γ ) are fixed and mx m < x α,γ then Proof. To show (28) with G as in (29) it suffices to show that for the chosen G and the given m, x we have that (16). However, as condition (M) holds, then max t∈A α (x,m) and we see that G is the maximal possible choice such that For the second part of (29) we only recall that (1 − γ)H(αx α,γ ) = 1 α − 1, see (22). The rest of the theorem, that is (30) and (31), follows in a similar fashion from (27) wherein the expression of (31) is the maximal such that for the fixed m, x ∈ (0, x α,γ ) We only note that in this instance we need to have that x < mx m < x α,γ , which is part of the assumptions of the theorem.

Exponential Convergence to the Theoretical Limit of the Power of the Benjamini-Hochberg Procedure
We proceed by studying the question about the power of the test. It is clear that if we denote by where X j 1≤j≤m 1 are i.i.d. with c.d.f G then is the proportion of tested elements that we have correctly identified as significant or the power of the test. If one wishes to understand the behavior of S * m with the increase of m one ought to decouple its dependence on R m m in (32). Theorem 1 aids in this matter. We note that if we choose x 1 < lim m→∞ R m m = x α,γ < x 2 , see (18) for the definition of x α,γ , then from Theorem 1 we have for every Then we can formulate the following fundamental result that offers exponential bounds on the convergence of S * m .

Remark 11.
We note that (36) has been proved in Corollary 3.3 in [15], but here we offer some exponential bounds on this convergence.
Proof of Theorem 3. The proof of (35) follows immediately from (32), (34) and the considerations thereabout since The fact that ∑ m 1 j=1 I {Xj≤αx} is Binomial with parameters m 1 and G(αx) = P(X 1 ≤ αx) is straightforward which with the help of (35) and prove (36). Indeed, the strong law of large numbers gives almost surely that and from (35) lim m→∞ S * m = G(αx α,γ ). Next, from (22) we get that The latter is the limit on the right-hand side of (36). The very last relation, that is (37), follows again from (35) and an application of Proposition A1 which yields in our case with m 1 = m(1 − γ) the bounds This concludes the proof.
As in the case of R m m we state a more practical version of Theorem 3 which is aimed to compute the largest for fixed m, x 1 , x 2 . It is in the spirit of Theorem 2.

Theorem 4.
Let condition (M) be valid and assume the notation of Theorem 2. Let m ∈ N + be fixed. Let next x α,γ ∈ (x 1 , x 2 ) ⊂ (0, 1) and x 1 is such that mx 1 m < x α,γ . Finally, let η ∈ (0, 1) be fixed. Then with we have that and c(η) = (1 + η) ln(1 + η) − η is considered in Proposition A1. For a one-sided bound we get the inequality Remark 12. Noting Remark 9 we see that the expressions involving K M F , K M G can be substituted with where T is either F or G.
Proof of Theorem 4. The proof is immediate but we note that the assumption mx 1 m < x α,γ and the choice of 1 , 2 are such that Theorem 2 applies to the right-hand side of (38), which yields an inequality identical to (35). The rest is as in the proof of Theorem 3.

Workflow
In this section we consider the case when the non-uniform distribution of X j m 0 <j≤m is Beta with parameters a, b, that is G (·, a, b) is the cdf of B(a, b), where a < 1 and b > 1 are taken so that condition (M) is satisfied. In this case Lemma 1 is valid and condition (M) holds since elementary calculations yield that g a,b < 0 on (0, α) and lim x→0 g a,b (x) = ∞, where g a,b is the probability density function of B(a, b).
For a given number of hypotheses m, level of significance α and proportion of significant hypotheses γ, we are interested in finding a theoretical bound for the probability that the power is between a number l and 1, where of course only l < G(αx α,γ , a, b) makes sense to be considered. We shall only provide a step by step description of the implementation and we shall refrain from developing a particular, explicit theoretical example for given values of a, b since the latter involves tedious, non-instructive and lengthy computations. We proceed with describing the workflow that has been implemented numerically in R:

Results
The following Tables 1-4 give empirical results for some values of m, a, b, γ. We have included different cases for the real power limit-taking large, moderate and small values. In all cases the significance level is α = 0.05. These tables also show the empirical probability based on numerical simulations with 1 million repetitions. Empirical probability of 1 means that no simulation repetition had power lower than l. Theoretical probability of 0 or empty value of η * indicate that some of the conditions of the method were not met. Larger sample size would result in a smaller parameter a of the Beta distribution, thus squeezing the distribution towards 0 and increasing the statistical power and its theoretical estimates, e.g., as in Tables 2 and 3.
We have carried out numerical simulations for many values of the parameters and based on the results it appears that the theoretical bound works well when l is not close to the real power limit lG(αx α,γ , a, b). Link to our R code is provided in the Supplementary Materials section.
Interestingly, when m increases and all other parameters are kept fixed, the expected power decreases for most values of m. This is shown in the following Figure 1. Perhaps this is due to the fact that the set for which we take the maximum in (2) increases its size for most values of m (unless the ceiling function jumps).
We note that as expected when either a is closer to 1 or/and γ is smaller then the theoretical estimates are getting significantly worse. This is natural as the closer the alternative distribution gets to the uniform the harder is to distinguish that the respective random variable X originates from it. The same is valid if the lower bound l is very close to the true limit and it is harder to account for the possible error. All these points reflect natural restrictions when one deals with universal, theoretical bounds, which are seemingly conducted in an optimal way.

Related Work
There are a number of results concerning the statistical power of the Benjamini-Hochberg procedure that relate to our result. Paper [15] discusses different properties of the procedure and shows that when increasing the number of tests m, while keeping the underlying distributions and the proportion of true significant tests the same, the statistical power under some conditions converges to a limit that can be found as a solution of a particular equation, see (3.8) of Corollary 3.3 of that article. A step forward is made in [13] wherein it is shown that the proportions related to the power of the test obey even a law of the iterated logarithm. Other notable contributions amongst others include the papers [18] which deal with various properties of the Benjamini-Hochberg procedure, including a Donsker-type of convergence. They are discussed in more detail in Remark 5. We wish to highlight explicitly that to our knowledge there are no bounds, albeit conservative, for the probability to uncover a given proportion of the truly significant tests (that is S n /n) with respect to the limit of S n /n, as n goes to infinity. We achieve this by evaluating the speed of convergence, whereas the results under more general assumptions, including under weak dependence between the tests, consider limit results. This does not allow for the derivation of any concrete intervals, depending solely on the set of parameters, which give a certain probability for the proportion of correctly identified significant tests.
While we argue below that for genetic studies the assumption of independence is not binding at all, we wish to mention some achievements under the weak dependence assumption since it is natural to pursue as a future development the extension of our results to this scenario. The limit for the power of the Benjamini-Hochberg procedure is proved in [15]. An interesting contribution is [12] which shows that for a large number of tests the FDR is controlled even when there is some degree of dependence between the test statistics. A conservative estimator for the proportion of false nulls among the hypotheses (the quantity 1 − γ in our work) is found (the proof there uses the Dvoretsky-Kiefer-Wolfowitz inequality [16], as we do for different purposes in our work). Then the estimator is used in a plug-in procedure in order to boost the power of the procedure while keeping the FDR controlled. An estimator for γ may precede the application of our procedure.
We are not interested in improving the procedure itself, but rather than that in estimating the power of the original procedure, that as we mentioned in the Introduction section, is the most widely used among the similar procedures.
As mentioned in [20], the results in [12] assume that R n /n is asymptotically larger than zero, where R n is the number of rejected tests out of all n tests (the R notation is consistent with ours through this work). A counterexample is given of a case in which R n /n converges asymptotically to zero with positive probability and in which the FDR is larger than α. Such situation does not appear to be relevant to the RNA-seq data used for single-step differential expression methods that prompted our work. Moreover, paper [12] is concerned with controlling the FDR rather than estimates of the power, which is the scope of our work in which we are interested in the power of the procedure that the scientists predominantly use, regardless of whether the FDR is controlled.

Considerations about Our Work
Because this is the first result of this exact type that we are aware of, we start our considerations under independence assumption. We want to point out that, as mentioned above, there are more general results under convergence, but they do not include explicit bounds for given set of parameters. The independence assumption holds approximately in the case of RNA-seq expression level analysis, where there are groups of genes that are correlated within the group, but most pairs of genes are weakly or not correlated. Even in the case of weak dependence, having real-world RNA-seq expression data, one cannot reliably estimate or replicate the dependence unless the sample size is much larger than the one of the currently available datasets. In addition, more complicated assumptions for the structure of dependence would make the numerical procedure impossible to set up as there would be too many parameters in the input and these parameters cannot be easily estimated for small sample sizes that are typical for RNA-seq experiments.
Theoretical results under more complicated assumptions for the dependence could be a subject of a follow-up studies. Numerical simulations of RNA-seq data under different basic types of weak dependencies could also be considered in subsequent studies.
It is important to also emphasize that our considerations differ from the basic situations in which the statistical power is studied. In a case of a simple test the statistical power is typically considered as a function of the sample size. However, in a case of a complex tests such as the ones cited above, often it is not possible to estimate the distribution of the test statistic, even for a single test in the multiple comparison setup. In order to be able to model a general setup we model the test statistic under the alternative hypothesis using the Beta family of distributions. In order to use our current results, one would first have to approximately or exactly estimate the parameters of the Beta distribution as function of the sample size of the particular test. Because such estimates are test-specific and cannot be studied in general, this would be a subject of another article. One would also have to estimate the proportion of significant tests and, having achieved that, one would apply our results for that proportion and those parameters of the Beta distribution, taking into account the possible values of the number of tests. For example, in the differential expression analysis methods, mentioned above, the sample size corresponds to the number of individuals and the number of tests may correspond to the number of genes.