Computing Power and Sample Size for the False Discovery Rate in Multiple Applications

The false discovery rate (FDR) is a widely used metric of statistical significance for genomic data analyses that involve multiple hypothesis testing. Power and sample size considerations are important in planning studies that perform these types of genomic data analyses. Here, we propose a three-rectangle approximation of a p-value histogram to derive a formula to compute the statistical power and sample size for analyses that involve the FDR. We also introduce the R package FDRsamplesize2, which incorporates these and other power calculation formulas to compute power for a broad variety of studies not covered by other FDR power calculation software. A few illustrative examples are provided. The FDRsamplesize2 package is available on CRAN.


Introduction
The false discovery rate (FDR) is now a widely used metric of statistical significance for exploratory data analyses that perform a very large number of hypothesis tests.Benjamini and Hochberg introduced the FDR and proposed a p-value adjustment procedure to control it under certain conditions [1].Subsequently, many procedures to estimate or control the FDR and similar metrics of statistical significance have been developed [2].Most of these later procedures improve power by incorporating an estimate π0 of the proportion π 0 of hypothesis tests that have a true null hypothesis.Others compute π0 as the height of the rightmost bin of a histogram of all p-values [3][4][5]; Pounds and Cheng compute π0 as the minimum of 1 or twice the mean of all p-values [6]; others have elaborate methods to compute π0 .These FDR methods have found widespread application in the analysis of genomic data in exploratory analyses that test the association of each of many genomic features with a phenotype or other biological condition of interest.
The widespread use of these methods has fueled the development of several methods to compute power and sample size for studies that use the FDR to determine statistical significance.Most of these methods compute power and sample size for two-group comparisons.Jung proposed a particularly elegant method based on a simple equation involving the desired FDR τ, the desired average power 1 − β of tests with a false null [7], the proportion π 0 of tests with a true null, and the unadjusted p-value threshold α used to declare statistical significance [8].Jung solves this equation for α and then finds the sample size n to provide 1 − β average power to the tests with a false null hypothesis and provides specific details for using a t-test to compare the means of many variables across two groups [7].
Pounds and Cheng expand on Jung's [7] approach by considering that, in practice, the estimate π0 is used instead of π 0 itself to determine statistical significance [9].The estimate π0 is a function of the data and thus is dependent on sample size.Pounds and Cheng derive a formula for this relationship, incorporate it into Jung's equation [7], and then develop an iterative procedure that incorporates this consideration into sample size calculations [9].They also provide software to perform these calculations, using one-way ANOVA to compare the means of many variables across three or more groups.This iterative procedure can be computationally expensive.In some cases, the iterative procedure yields the same sample size calculation as Jung's equation because π0 converges to π 0 as the sample size increases.
Here, we propose a p-value histogram framework to develop a simplified approach to consider the estimation of π 0 in sample size calculations.We incorporate this approach into the FDRsamplesize2 R software package (version: 0.2.0) for power and sample size calculations.The FDRsamplesize2 package computes power and sample size for several types of scientific questions that are not covered by other packages.The FDRsamplesize2 package can perform these calculations in settings where all hypotheses are evaluated with one of the following tests: two-proportions z-test, a two-group comparison of Poisson data, comparison of negative binomial data, one-sample or two-sample t-test, t-test for non-zero correlation, Fisher's exact test, signed rank test, sign test, rank sum test, ANOVA, and single-predictor Cox regression.Our software also computes power and sample size using Jung's formula [7] and our p-value histogram formulas.It also computes power for using the Benjamini and Hochberg p-value adjustment method [1] (which does not estimate π0 for FDR control).

Background
Consider a setting that involves performing a very large number of hypothesis tests.A proportion π 0 of these tests have a true null hypothesis, and thus, the remaining proportion 1 − π 0 of tests have a false null hypothesis.We want to design a study so that we have adequate statistical power to declare significance for 1 − β of the 1 − π 0 of tests with a false null hypothesis (average power of 1 − β across tests with a false null hypothesis) while controlling the FDR at a fixed, pre-specified level τ.
For this problem, Jung noted that: accurately approximates the relationship among the FDR τ, the proportion π 0 of tests with a true null hypothesis, the (unadjusted) p-value threshold α used to declare statistical significance, and the average power 1 − β of all tests with a false null hypothesis at the α level [7].Solving for α yields a fixed p-value threshold: that achieves the desired average power 1 − β and desired FDR control τ, given the proportion π 0 of tests with a true null hypothesis.One can then use the power calculation formula for the statistical method of interest to compute the sample size n necessary for tests with a false null hypothesis to have an average power (1 − β) at the α level.
Pounds and Cheng noted that in practice: is used to determine statistical significance, where all observed p-values are used to compute an estimate π0 of π 0 as well as local FDR estimates τ as a function of the p-value threshold α [9].The estimate π0 appears only in the numerator of this equation because the FDR adjustment procedures only use π0 in the numerator.In FDR adjustment procedures, the empirical distribution function (EDF) of all p-values is used to estimate the denominator; thus, π0 is not needed in the denominator.In sample size calculations, the denominator is the average cumulative distribution function (CDF) of all p-values, which is a function of the conjectured value π 0 .The denominator is the average CDF of all tests, mathematically represented with separate terms for the π 0 of tests with a true null and the 1 − π 0 tests with a false null to illuminate the relationship between FDR and average power 1 − β.
Recall that the statistical power of a test at the α level is the CDF of the p-value for that test.Therefore, π0 and τ themselves also depend on the average power 1 − β of the tests with a false null hypothesis, and thus, the sample size.Pounds and Cheng derive the formula for the expected value E( π0 |n, θ) of π0 as a function of the sample size n and parameter vector θ (effect sizes of all tests), and then develop an iterative procedure that incorporates this consideration into the sample size calculations [9].This iterative procedure adjusts for the fact that data-based estimates are used in practice.In some cases, this procedure obtains larger sample sizes than Jung's formula because E( π0 |n, θ) ≥ π 0 for most methods to compute π0 .In other cases, this procedure obtains the same sample size as Jung's formula because E( π0 |n, θ) → π 0 as n → ∞ .Nevertheless, the iterative procedure is computationally burdensome.Thus, a more straightforward and less computationally complex approximation can be useful.Following [7], let us consider the approximate relationship: where π 0 is used to compute τ.For example, π 0 = π 0 in Jung's equation above.Also, the Benjamini and Hochberg method operationally uses π 0 = 1 to obtain τ [1].Using π 0 = 1 in the equation above and solving for α yields: as the p-value threshold determined by using the Benjamini and Hochberg procedure to control the FDR at τ in a setting with 1 − β average power of tests with a false null hypothesis [1].One can then use the power formula for the tests with false null hypotheses to determine the sample size needed to achieve an average power of 1 − β at the α BH level.Histograms of p-values are a useful component of several FDR control procedures.In the q-value procedure, Storey estimates π 0 by the height of a histogram bar for p-values between 0.95 and 1.00 [3].Nettleton et al. and Pounds et al. also use histograms to compute estimates of π 0 [4,5].This suggests that a histogram approximation to the p-value distribution may help in developing a useful framework for computing power and sample size for controlling the FDR in applications involving many hypothesis tests.
We now derive other estimators of π 0 by considering a three-rectangle histogram approximation of the distribution of p-values obtained by performing many hypothesis tests (Figure 1).A proportion π 0 of tests have a true null hypothesis.There is an average power 1 − β to declare the other 1 − π 0 tests significant at the threshold α.The π 0 tests with a true null have uniformly distributed p-values by the probability integral transform.For the 1 − π 0 tests with a false null, a proportion 1 − β yield p-values less than α and a proportion β yield p-values greater than α.This distribution can be approximated by three rectangles.The first is a rectangle with a height π 0 and base extending from p = 0 to p = 1; this rectangle is the uniform distribution for the proportion π 0 of p-values with a true null hypothesis (shown in gray in Figure 1).The second is a rectangle with a base extending from p = 0 to p = α and a total area (1 − π 0 )(1 − β); this rectangle represents the 1 − β subset of the proportion 1 − π 0 of p-values with a false null that are less than α (shown in gold in Figure 1).The height of the second rectangle is (1 − π 0 )(1 − β)/α and its center (mean) is at p = α/2.The third rectangle has a base extending from p = α to p = 1 with a total area (1 − π 0 )β; this rectangle represents those tests with a false null hypothesis that fail to be significant at the α level (i.e., a Type II error or what some may call a false negative result).It has a height (1 − π 0 )β/(1 − α) and center (mean) at p = (1 + α)/2 (shown in cyan in Figure 1).This three-rectangle approximation yields: for FDR methods such as those of [3][4][5] that estimate  by the height of a p-value histogram (histogram height; HH) near  = 1.It also yields: for FDR methods such as that of [6] that estimate  by 2 ‾; that is, twice the mean of all p-values (histogram mean; HM).Substituting  and  into the formula for ̃ yields simple second-order polynomials in , with a solution provided by the quadratic formula, as described in Appendix A. These formulas for  account for the dependency of the computed FDR values on the power of the tests with false null hypotheses without the need for the computational burden of Pounds' and Cheng's iterative procedure [9].
Thus, with the new framework, we have derived a simple mathematical formula for the p-value thresholds  ,  ,  , and  for the Benjamini and Hochberg procedure [1], Jung's formula [7], methods that use the p-value histogram height (HH) at  = 1 to estimate  , as in Equation ( 6), and methods that use the p-value histogram mean (HM) to estimate  , as in Equation (7), respectively.These four formulas can yield different values of  (Table 1) and thus lead to different sample size estimates.Some settings in the table may seem unusual in comparison to the 80% power and 5% level calculations often performed for testing a single hypothesis.However, some applications may warrant some of these other settings.For example, one may wish to identify 100 genes with a 50% FDR (to have 50 "real" hits for follow-up) in a comparison of the expression of 10,000 genes between samples treated and untreated with a drug that alters the expression of 100 genes.This would correspond to ̃= 0.50 (desired FDR), 1 −  = 0.50 (power to detect 50 out of 100 genes), and  = 0.99 (9900 genes out of 10,000 unaltered by drug).This three-rectangle approximation yields: for FDR methods such as those of [3][4][5] that estimate π 0 by the height of a p-value histogram (histogram height; HH) near p = 1.It also yields: for FDR methods such as that of [6] that estimate π 0 by 2p; that is, twice the mean of all p-values (histogram mean; HM).Substituting π hh and π hm into the formula for τ yields simple second-order polynomials in α, with a solution provided by the quadratic formula, as described in Appendix A. These formulas for α account for the dependency of the computed FDR values on the power of the tests with false null hypotheses without the need for the computational burden of Pounds' and Cheng's iterative procedure [9].Thus, with the new framework, we have derived a simple mathematical formula for the p-value thresholds α BH , α Jung , α hh , and α hm for the Benjamini and Hochberg procedure [1], Jung's formula [7], methods that use the p-value histogram height (HH) at p = 1 to estimate π 0 , as in Equation ( 6), and methods that use the p-value histogram mean (HM) to estimate π 0 , as in Equation (7), respectively.These four formulas can yield different values of α (Table 1) and thus lead to different sample size estimates.Some settings in the table may seem unusual in comparison to the 80% power and 5% level calculations often performed for testing a single hypothesis.However, some applications may warrant some of these other settings.For example, one may wish to identify 100 genes with a 50% FDR (to have 50 "real" hits for follow-up) in a comparison of the expression of 10,000 genes between samples treated and untreated with a drug that alters the expression of 100 genes.This would correspond to τ = 0.50 (desired FDR), 1 − β = 0.50 (power to detect 50 out of 100 genes), and π 0 = 0.99 (9900 genes out of 10,000 unaltered by drug).

Supported Tests
The above section describes how to determine a threshold α for unadjusted p-values that yield the desired FDR τ and average power 1 − β for a given proportion π 0 of tests with a true null hypothesis.The sample size to yield the desired average power 1 − β at the p-value threshold α must still be determined using power calculation formulas for the particular statistical test of interest.Our R package performs these calculations for the following commonly applied statistical tests:

Algorithmic Details
The FDRsampsize2 package uses the following general algorithm to determine the sample size: 1.
Given the desired FDR τ, the desired average power 1 − β, the proportion π 0 of tests with a true null, and the desired option for π 0 (BH, HH, HM, Jung), determine the p-value threshold α that achieves the desired FDR and average power.

2.
Compute the average power for each of the two initial sample sizes n 0 < n 1 (default n 0 = 3 and n 1 = 6; other values can be specified by the user).

3.
If the average power for both of these initial sample sizes is greater than the desired average power, then report n 0 and its average power as the result.4.
If the average power for both of these initial sample sizes is less than desired, then define a new n 0 = n 1 and n 1 = 2n 1 and compute the average power for these new n 0 and n 1 .Repeat until the average power for n 1 is greater than the desired average power 1 − β, and the average power for n 0 is less than the desired average power.

5.
With n 0 and n 1 and their average powers as initial values, use bisection to determine the smallest n with average power greater than or equal to the desired average power 1 − β. 6.
To avoid excessive computing time, stop iterative calculations of steps 4 and 5 after average power has been computed a specified maximum number of times and report the results achieved thus far.
The algorithm reports the input parameters; the computed sample size (or number of events) n; the computed average power; the computed α; the number of iterations (number of times average power was computed in steps 4 and 5); the bisection bounds for the sample size n 0 and n 1 ; and the maximum number of iterations.If the desired average power was not achieved, the user may continue the calculations by starting at the achieved bounds for n 0 and n 1 or repeating the calculations with a greater maximum number of iterations.

Example 1: Study Involving Many Sign Tests
Suppose we are planning a study that will collect 10,000 Bernoulli (yes/no, positive/negative, etc.) variables for each subject.For each of those variables, we will use the sign test to evaluate the hypothesis that the probability θ of success is θ 0 = 0.5 for each of those variables.We are interested in computing power and sample size for the setting in which 100 variables have a success probability of 0.8, and the remaining 9900 variables have a success probability of 0.5.In this setting, the null hypothesis H 0 : θ 0 = 0.5 is true for 9900/10,000 (π 0 = 0.99) of the variables.The package includes a function power.signtestthat uses the formula of [10] to compute the power of the sign test.The power.signtest is plugged into the average.power.signtestfunction.The later function computes the average power achieved by performing all tests at a fixed p-value level α for a given sample size and θ.The find.sample.sizefunction determines the sample size needed to achieve a certain average power while controlling the FDR at a specific level.
To determine the sample size needed to achieve the desired average power that controls the FDR at τ = 0.1, we can use the n.fdr.signtestfunction directly to obtain the sample size n, computed average power, fixed p-value threshold, number of iterations in the function output, and the bounds of the sample size and input parameters.theta = rep(c(0.8,0.5),c(100,9900)) # 9900 null; 100 non-null res = n.fdr.signtest(fdr= 0.1, pwr = 0.8, p = theta, pi0.hat = "BH") res In this case, we obtain the same sample size with the "BH" and "HH" options.That is not always the case.By studying Equations ( 6) and ( 7), we can see that the methods may obtain different results with smaller π 0 , greater power, and reduced tolerance for Type I errors.For example, setting θ = 0.55 for tests with a false null, θ = 0.50 for tests with a true null, π 0 = 0.90, 1 − β = 0.99, and τ = 0.01, we obtain n = 3143 with the "BH" option and n = 3109 with the "HH" option for this problem.

Example 2: Design of a Clinical Trial to Find Prognostic Genes
The FDRsamplesize2 package can also determine the sample size necessary to identify genes with expression values that are associated with progression-free survival in the context of a pediatric brain tumor clinical trial.A single-predictor Cox regression model will be used to test the association of each gene's expression with progression-free survival (PFS).The package defines the power.coxfunction that computes the power formula of Hsieh and Lavori for Cox regression modeling [11].This function computes power as a function of the number of events (disease progression or death).We are interested in determining the number of events necessary to identify 80% of the genes truly associated with PFS while controlling the FDR at 10% in a setting in which 1% of the genes have a hazard ratio of 2 and a variance of 1.The R code for this calculation is shown below: This calculation shows that 37 events provided an average power of 81.4% when using the Benjamini and Hochberg procedure to control the FDR at 10% [1].The calculations project that with this number of events, the analysis will choose a p-value threshold of 0.00089 and achieve the desired average power of 80%.

Example 3: Differential Expression Analysis of RNA-Seq Data
Hart et al. derived a formula to compute the power of using a negative binomial model for differential expression analysis of RNA-seq data across two groups when all tests are performed at a common p-value threshold α [12].They used that formula to compute the power for performing all tests at the α = 0.01 level for unadjusted p-values.They did not incorporate that formula into the procedure of Jung [7] or Pounds and Cheng [9].
The formula of Hart et al. expresses power as a function of sample size, the log fold change of expression, the mean read depth per gene, and the coefficient of variation per gene [12].The power.hart function computes this power formula.Below, we use this function for average power calculation in the n.fdr.negbin to compute the sample size needed to achieve the desired average power while controlling the FDR at 10% in the setting that 99% of genes have no effect (log fold-change of 0), 1% of genes have a 2.7-fold change (log-fold change = log(2.7)= 0.99).All genes have an average read depth of 5 and a coefficient of variation equal to 0.60.theta = log(rep(c(1,2.7),c(9900,100))) # log fold change for each gene mu = rep(5,10000) # read depth sig = rep(0.6,10000)# coefficient of variation res = n.fdr.negbin(fdr= 0.1, pwr = 0.8, log.fc = theta, mu = mu, sig = sig, pi0.hat = "BH") In this setting, with a sample size of 20 subjects per group, using the Benjamini and Hochberg procedure chooses α = 0.00089 and provides an average power of 80.9% [1].

Example 4: Computing Power of t-Tests for a Given Sample Size and Effect Sizes
The FDRsamplesize2 package includes a function fdr.avepow that computes the FDR and average power of a set of tests as a function of α for a given sample size and effect sizes.
For example, this function can compute the FDR and average power of a set of two-sample t-tests as a function of the p-value threshold α with n = 10 per group in a setting with π 0 = 0.95 and effect size δ = 2 for the remaining tests.delta = rep(c(0,2),c(95,5)) # 95% null; 5% with delta = 2 res=fdr.avepow( In this setting, a sample size of n = 10 per group, using a fixed p-value threshold of α = 0.005 yields an FDR of 10.3% and an average power of 87.8%.

Simulations
FDRsamplesize2 uses published power calculation formulas for specific tests; thus, its performance is dependent on the accuracy of those power calculation formulas at the p-value threshold α determined by solving the equation relating the average power, the FDR, and the p-value threshold α.We performed a series of simulations to evaluate the accuracy of the power and sample size calculations.The simulation script is provided as a Supplementary File (FDRsamplesize2-simulation-script.R).The results are provided in Supplementary Table S1 (FDRsamplesize2-simulation-results.xlsx).Generally, the average power, FDR, and p-value threshold averaged over simulation replications are very close to those computed from the formulas.The calculations yielded sample sizes providing the desired FDR and average power for the two-sample t-test, Wilcoxon rank-sum test, comparison of two Poisson variables, t-test for non-zero correlations, signed-rank test, Fisher's exact test, and one-way ANOVA.
For the few tests for which FDRsamplesize2 yielded a sample size with inadequate FDR or average power, the power calculations were based on asymptotic normal approximations.This was the case for the two proportions z-test, the sign test, Cox regression, and the comparison of two negative binomial variables.The asymptotic normal approximations for these tests may not be sufficiently accurate for the very small p-value thresholds needed to maintain the desired FDR control.These results show that the accuracy of Jung's equation relating the p-value threshold, average power, and FDR depends on the accuracy of the formula used to compute average power.Users should be aware of whether the underlying power calculation formulas are exact or approximate when using FDRsamplesize2.We designed the FDRsamplesize2 package so that users can easily incorporate improved power calculation formulas for specific tests as they become available.It may be advisable to double-check some sample size calculations with simulation.Still, FDRsamplesize2 provides a good starting point for using simulations to compute power and sample size.

Discussion
The FDR is a widely used metric of statistical significance in multiple-testing applications.Power and sample size calculations are an important consideration in planning studies.Formulas that relate the p-value threshold, FDR, average power, and the proportion π 0 of tests with a true null hypothesis are useful for performing these calculations.In some applications, it is important to consider how average power impacts the estimate π0 of π 0 that is used to compute an FDR estimate τ in practice.We proposed a simple threerectangle approximation to the p-value histogram to derive two new power calcualtion formulas that incorporate this consideration.We also developed the FDRsamplesize2 R package, which performs sample size calculations using these formulas and Jung's formula [7].Additionally, it computes power for using the Benjamini and Hochberg procedure, which does not compute an estimate π0 of π 0 but instead effectively uses π 0 = 1 [1].We incorporated power calculation formulas for several statistical tests not covered by other published FDR power calculation software.The FDRsamplesize2 package also conveniently includes some individual methods scattered across different software resources, such as the method of Hart et al. to compare many negative binomial variables (read-count) across two groups [12].Also, FDRsamplesize2 provides a framework to compute the p-value threshold α needed for the desired average power and FDR, and then users can use traditional single-test power calculation methods to determine the sample size needed to achieve the desired average power at that α level.This provides additional flexibility and scope for users.Thus, FDRsamplesize2 should be a valuable resource to researchers as they develop grant proposals and plan their experiments.
In our simulations, FDRsamplesize2 provided very accurate sample size calculations for all tests that used exact power calculation formulas.In a few cases where asymptotic normal approximations were used to compute power, FDRsamplesize2 yielded sample sizes that did not satisfy pre-specified requirements for the FDR or average power.The inadequate power was observed for all four methods of using π 0 or π0 : the Benjamini-Hochberg approach, Jung's equation, and the two histogram-based approaches proposed here.Thus, we encourage users to use simulation to double-check calculations based on asymptotic normal approximations for power formulas.Still, even in these cases, FDRsamplesize2 calculations will save users time by providing useful guidance for a reasonable starting point for computationally intensive simulation-based power calculations.The Supplementary Simulation Script (FDRsamplesize2-simulation-script.R) also provides a useful template for performing those simulations.
There are other valuable resources to compute power and sample size for the FDR as well.The ssize.fdr package implements the method of Liu and Hwang for two-sample t-tests and some ANOVA tests [13,14].The PROPER method was designed specifically for planning differential expression analyses of RNA-seq data [15].Hu et al. developed a method using the expectation-maximization algorithm to compute power and sample size [16].Shao and Tseng developed a method to carefully account for correlation among tests in computing power and sample size [17].Pawitan et al. developed a method for performing many two-sample t-tests [18].The scPower framework computes power for multi-sample single-cell transcriptomic experiments [19].Ching et al. provided a simulation-based framework to compute power for RNA-seq differential expression analysis [20].Jung et al. developed a method to compute sample size for a one-way blocked design [21].Lee and Whitmore provided a framework to compute power for elaborate linear models [22].Li et al. developed a bulk method for performing exact tests on RNA-seq data.All these methods are approximations that work well when a very large number (hundreds or more) of hypothesis tests are performed [23].Glueck et al. developed a method to compute the exact power of the Benjamini and Hochberg procedure [1] for applications involving a relatively small number of hypothesis tests (such as evaluating multiple co-primary endpoints in a clinical trial) [24].Numerous tools are now available for many specific study designs and scientific objectives in the literature.Certainly, there are still other applications for which power calculation methods need to be developed.These provide opportunities for those interested in developing power calculation methods for planning genomic studies.

Supplementary Materials:
The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes15030344/s1,Table S1: Results of Simulation Studies for desired power = 0.5 and desired FDR = 0.05.which can be equivalently expressed as: and then as:

Figure 1 .
Figure 1.A three-rectangle approximation to the p-value histogram.The gray rectangle at the bottom represents the uniform distribution of -values with a true null hypothesis.The gold rectangle at the top left represents the significant p-values with false null hypotheses that are less than .The cyan rectangle at the top right represents the non-significant p-values with false null hypotheses that are greater than or equal .

Figure 1 .
Figure 1.A three-rectangle approximation to the p-value histogram.The gray rectangle at the bottom represents the uniform distribution of p-values with a true null hypothesis.The gold rectangle at the top left represents the significant p-values with false null hypotheses that are less than α.The cyan rectangle at the top right represents the non-significant p-values with false null hypotheses that are greater than or equal α.