Common Medical and Statistical Problems: The Dilemma of the Sample Size Calculation for Sensitivity and Specificity Estimation

Sample size calculation in biomedical practice is typically based on the problematic Wald method for a binomial proportion, with potentially dangerous consequences. This work highlights the need of incorporating the concept of conditional probability in sample size determination to avoid reduced sample sizes that lead to inadequate confidence intervals. Therefore, new definitions are proposed for coverage probability and expected length of confidence intervals for conditional probabilities, like sensitivity and specificity. The new definitions were used to assess seven confidence interval estimation methods. In order to determine the sample size, two procedures—an optimal one, based on the new definitions, and an approximation—were developed for each estimation method. Our findings confirm the similarity of the approximated sample sizes to the optimal ones. R code is provided to disseminate these methodological advances and translate them into biomedical practice.


Introduction
Diagnostic tests are helpful tools in medical decision-making, since they give indication of the disease or infection status. Assessing the relevance and utility of such tests through the estimation of sensitivity (Se) and specificity (Sp), among other measures, is crucial to make informed choices on their use. During the COVID-19 pandemic, the concepts of diagnostic tests and their performance have been a hot topic of debate, even outside the scientific community.
Sensitivity and specificity are conventional measures of diagnostic test accuracy. Sensitivity is the probability of a positive test result given that the individual has the condition (infection or disease), i.e., the probability of correctly identifying an individual with the condition. Specificity is the probability of a negative test result given that the individual does not have the condition, i.e., the probability of correctly identifying an individual without the condition.
Interval estimation has been strongly recommended by [1] and increasingly adopted in the biomedical literature. To improve the quality of reporting of diagnostic accuracy studies, the Standards for Reporting Diagnostic Accuracy (STARD) statement was originally published in 2003 and updated in 2015-STARD 2015 [2]. Since the first version of STARD, 95% confidence intervals (CI) for the accuracy estimates of diagnostic tests have been recommended. However, several works have failed to comply with these guidelines [3] and obsolete methods continue to be used and not well-described to obtain 95% confidence intervals.
For a binomial proportion, the Wald method has been the most popular and widespread method over the years, because it is simple to teach, understand, and use. However, in the last two decades, theoretical statistical research (e.g., [4][5][6][7][8]) has reported its drawbacks, which are particularly severe for small sample sizes and proportions near 0 or 1. Some researchers have addressed the subject of interval estimation for a binomial proportion [7][8][9], exploring and comparing alternative methods, and providing recommendations concerning their performance. Regarding criteria to evaluate the performance of interval estimation methods, coverage probability (CP) and expected length (EL) are two of the most frequently used.
Reporting how the sample size determination was conducted is also an important step in scientific research. In the biomedical literature, there is an under-reporting of the required sample size and the achieved sample size at the end of the study. The updated list of STARD 2015 added the need for methodological details on sample size [2]. Once again, sample size calculation is dominated by the Wald method to estimate a binomial proportion. [10] emphasizes that confidence intervals for proportions are influenced, not only by the actual estimate of the proportion, but also by the sample size, and argues that there is no totally reliable sample size choice to achieve a desired goal.
In diagnostic test accuracy, an additional problem derived from the wrong replacement of the conditional probabilities (sensitivity and specificity) by a binomial proportion, at least from a theoretical point of view. The maximum likelihood estimator of sensitivity (specificity) is a ratio that depends on the number of individuals in the sample with (without) the condition, which is a random variable with a binomial distribution that depends on the prevalence. This situation requires a new approach to modify the expressions of coverage probability and expected length, in order to use these two criteria to compare the performance for different interval estimation methods and for the sample size calculation, avoiding the traditional Wald method.
In this work, new expressions for coverage probability and expected length of a conditional probability interval, like sensitivity and specificity, were developed. These expressions are used to compare alternative interval estimation methods and to determine optimal sample sizes using simple random sampling. Moreover, we compare these sample sizes based on the rewritten formula of the expected length with approximated sample sizes. The approximated procedure has two steps. The first one considers sample sizes based on a binomial proportion (as calculated in [11]) to estimate the number of patients. The second step determines the sample size by the ratio of the previous number of subjects with (without) the condition to the prevalence (1-prevalence), following [12]. Differences between optimal and approximate procedures were explored, in order to evaluate if the simpler procedure based on an approximation has some practical use.
Practical problems arising from real studies are also discussed. Among several studies presented in the literature, we focus on examples related to two diseases, hepatitis B [13] and depression [14], which illustrate the diversity and complexity of problems found in practice. Regarding the diagnostic accuracy of tests to detect hepatitis B surface antigen, [13] presented a systematic review of the literature, including a meta-analysis. Sample sizes of the 40 reported studies ranged from 50 to 3956 (median size: 284). A survey of published papers on the diagnostic accuracy of depression screening tools [14] included 89 studies with sample sizes varying from 34 to 42,676 (median size: 224).
In different studies for the same disease, a substantial heterogeneity was found in terms of sample size, sensitivity and specificity estimates, reporting of their uncertainty, and identification of the statistical methods used. Another important issue is the prevalence of the disease in the population under study. For hepatitis B, [13] referred values range from 1.9% to 84.0%. Samples are frequently drawn from sub-populations with a higher probability of having the disease than the overall population. Despite efforts to improve reporting of diagnostic accuracy (e.g., STARD 2015 guidelines [2]), individual studies based on small sample sizes are often published. This usually leads to confidence intervals with an undesirably large width. In the survey of published studies on depression, even in journals with a high impact factor, authors reported that only ". . . 34% of the studies provided reasonably accurate intervals for estimates of sensitivity and specificity . . . " (p. 147 in [14]). This fact can mislead researchers without solid statistical background on the trustfulness of statistical tools in the biomedical practice. A major challenge in this context lies in the absence of a single statistical response to the high diversity of practical situations. Thus, statistically sound methods and associated software need to be developed to help researchers in the biomedical areas.

Interval Estimation Using Different Methods
The methods for constructing confidence or credibility intervals under analysis in the present study were selected among the ones that seem most promising according to previous research [6][7][8]11]: Clopper-Pearson, Anscombe, Agresti-Coull, Bayesian with Uniform prior and with Jeffreys prior, and Wilson, for a binomial proportion. For comparison reasons, the Wald method was also included in the study, given its common application in biomedical research. These methods provide two-sided confidence intervals for an unknown binomial proportion in the population, p, based on a sample of size n. A nominal confidence level, 100 × (1 − α)%, is pre-specified for the intervals, which means that the probability of including p, the so-called coverage probability, is intended to be (1 − α).
Some notation and expressions for coverage probability and expected length are summarized in Appendix A. The expressions for the lower and upper bounds of these methods can be found in Appendix B and the code for their implementation is available in Supplement S1. For more details about these methods see [8].
In terms of coverage probability, [8] distinguish three classes of interval estimation methods. The strictly conservative methods have minimum coverage probability greater or equal to (1 − α) for all the values of p and n. A second class includes the methods which are correct on average, i.e., for each n fixed, the mean coverage probability over all possible values of p is greater than or equal to (1 − α). There is a third class comprising the remaining methods, which are neither strictly conservative, nor correct on average. Accordingly, Clopper-Pearson and Anscombe methods can be stated as strictly conservative, while Bayesian-U, Jeffreys, Wilson, and Agresti-Coull can be classified as correct on average. The Wald method belongs to the third group and stands out for displaying quite low coverage probabilities for low or high values of p, which makes it an unreliable method.

New Expressions for Coverage Probability and Expected Length of a Conditional Probability Interval
In order to become adequate for the sensitivity and specificity, as well as other conditional probabilities, the conventional formulas for coverage probability and expected length of a confidence interval (presented in Appendix A) should be updated. If the proportion p, for which we aim to find an interval estimate, is the sensitivity or specificity, then p is a conditional probability. Accordingly, the usual estimator of the sensitivity (specificity) of a given diagnostic test is a ratio that depends on the state of the patient. Although the sample size, n, is known in advance, the number of individuals with and without the condition among the n individuals actually observed, N D and N D , respectively, are random variables. As a consequence, for example, N D has a binomial distribution with the parameters n and η, in which η is the prevalence, i.e., the proportion of individuals with the condition in the population under study. In fact, if p = Se, then (X|N D = m D ) is the number of individuals that tested positive among the m D that have the condition in the sample of size n. It follows that (X|N D = m D ) ∼ binomial(m D , Se). Similarly, being (Y|N D = m D ) the number of individuals that tested negative among the m D that do not have the condition in the sample of size n, we have Instead of the expressions for the conventional coverage probability and expected length (see Appendix A, respectively), the proper definitions for sensitivity are the following: Since it makes no sense to build an interval estimate if the number of individuals with the condition in the sample is null (N D = 0), the restriction N D > 0 in Equations (1) and (2) was added. In a simple random sampling scheme like the one we adopt, N D has a binomial distribution. This approach may be extended to other sampling schemes by using appropriate distributions for N D .
According to these new definitions, the coverage probability (expected length) associated to a sensitivity interval emerges as an expected value of coverage probability (expected length) taken for all possible non-null values of N D . Formulas for the coverage probability and expected length of a specificity interval or other conditional probability derive from the same rationale and have analogous interpretations.

Optimal and Approximate Sample Size Determination
Regarding sample size determination, the previous definitions have some implications to calculate the sample size, n, to obtain a 100 × (1 − α)% interval estimate for p, with a desired expected width, ω, i.e., we seek for n such that EL(n, p) = ω .
According to [11], for different confidence interval methods, these equations may not have neither a closed-form, nor an integer solution, but it is always possible to find an integer, n, that minimizes |EL(n, p) − ω| within a certain tolerance, ξ, which in most situations will be such that EL(n, p) ω. Consequently, we can use a procedure to determine all the values n verifying: If multiple solutions are found, we select the one that satisfies a chosen criteria, e.g., the one that maximizes coverage probability. Besides this option, the provided code enables other alternatives. If no solution is found, it is possible to increase ξ.
To calculate adequate sample sizes for the estimation of sensitivity and specificity with the desired confidence, we followed this procedure using the expected length new definition (2) stated above (an analogous expression was applied in the case of specificity). This procedure is designated optimal.
In parallel, following a rationale similar to the one described in [12] for the Wald method, an approximated procedure is also used and compared with the optimal one, now considering other methods besides the Wald. This approximated procedure took previously reported values for interval estimates of binomial proportions obtained by [11] as the number of subjects with (without) the disease required for sensitivity (specificity) estimation, n (Se) (or n (Sp) ). Theoretically, E(N D ) = η · n. In practice, E(N D ) is estimated by n (Se) , hence n = n (Se) /η is an approximation of the optimal sample size for sensitivity estimation. Following the same reasoning, in the case of specificity, the corresponding sample size is approximated by n = n (Sp) /(1 − η).
All calculations were performed using the statistical software R [15] and the code can be found in Supplementary Materials.

New Expressions for Coverage Probability of Interval Estimation Methods
Comparisons of new and conventional expressions of coverage probability and expected length were performed, using the seven methods, for some practical examples. The choice of the best methods is essentially based on their coverage probability. Nevertheless, this indicator varies with the sensitivity (specificity) of the diagnostic test under study, the prevalence of the condition, and the sample size. Given the diversity of practical examples, we adopted specific values for prevalence, sensitivity, and specificity motivated by a study presented in [16] concerning the diagnosis of dengue fever, an important vector-borne disease common in tropical areas.
Coverage probabilities were calculated considering different range values of sensitivity, prevalence, and sample size. Figure 1 shows the coverage probability of sensitivity intervals as a function of sensitivity, for η = 0.25 and n = 500, obtained with the new formula and the conventional formula, using Clopper-Pearson (strictly conservative method) and Wilson (correct on average). The new formula originates much smoother coverage probability curves, in contrast with the sharper indentations obtained with the conventional expressions.  Anscombe, Jeffreys, and Bayesian-U are not shown in the plots given their similarity with other methods. For most sensitivity values, the coverage probability corresponding to the highest prevalence, η = 0.25, tends to be closer to the nominal confidence level than η = 0.10. This tendency is not so clear for sensitivity closer to 1, where the coverage probability is quite unstable and erratic.
For the studied methods, regarding coverage probability values, we can recognise a section closer to Se = 1 with erratic and unstable values, a section closer to Se = 0.5 with stable values, and an intermediate section in between. Strictly conservative methods (Clopper-Pearson and Anscombe), present higher coverage probability than the nominal confidence level, with curves that seem detached from this target. The Wilson, Jeffreys, and Bayesian-U methods are similar, exhibiting stability around the nominal confidence level, except for erratic values close to 1. The Agresti-Coull method seems in between the Wilson and the higher detached values calculated for the strictly conservative methods.
It is also important to study the coverage probability associated with the sensitivity interval as a function of the prevalence. The coverage probability tends to be nearer the target as the prevalence increases towards η = 0.5. Figure 3 presents two sensitivities, 0.96 and 0.73, for a sample size of 500 individuals. The curve tends to be nearer the nominal confidence level in the case of the lower sensitivity (0.73).

Impact on Sample Sizes
The sample sizes based on the optimal procedure were calculated for 95% intervals for sensitivity or specificity, with expected width ω = 0.05 and a tolerance of ξ = 10 −4 . Approximated sample sizes for sensitivity intervals were obtained from the values reported by [11], divided by the prevalence, η = 0.10, and, for specificity, the divisor was (1 − η) = 0.90. Tables 1 and 2 show the optimal sample sizes corresponding to sensitivity and specificity, respectively. Both tables also present the differences between optimal and approximate sample sizes, δ.
Only small differences were detected between optimal and approximate sample sizes. As sensitivity (specificity) increases, the sample sizes required to satisfy the established criteria decrease. Moreover, the sample sizes needed in the case of sensitivity are much higher than the sample sizes for the particular case of specificity, as expected if the prevalence is smaller than 0.5. In the majority of practical situations, the prevalence is less than 0.5 and therefore the infected or diseased subjects are less represented in the sample, thus demanding higher sample sizes to guarantee an adequate sensitivity interval. Table 1. Optimal sample sizes (n optimal ) corresponding to several sensitivities, and differences between the optimal and approximate (n aprox ) sample sizes, δ = n optimal − n aprox , admitting η = 0.10, ω = 0.05, ξ = 10 −4 , and 95% nominal confidence level.

Clopper-Anscombe Agresti-Bayesian Jeffreys Wilson Wald Pearson
Coull Uniform Sp n optimal δ n optimal δ n optimal δ n optimal δ n optimal δ n optimal δ n optimal δ 0. In the hepatitis B meta-analysis performed by [13], the pooled sensitivity and specificity are 0.90 and 0.995, respectively. Considering a prevalence of 0.10 for a particular population of suspected cases, ω = 0.05, and ξ = 10 −4 , the optimal sample sizes for the reported sensitivity range from 5446 (Wilson) to 5890 (Anscombe), according to Table 1. Being so, all the 40 studies fail to provide reliable accurate sensitivity intervals. By contrast, if the target is the Sp = 0.99, optimal sample sizes range from 100 (Jeffreys) to 142 (Agresti-Coull), as presented in Table 2. Therefore, 90% and 80% of the studies fulfil the Jeffreys and Agresti-Coull requirements, respectively.
The conservative methods, Clopper-Pearson and Anscombe, demand higher sample sizes than the remaining methods. The sample sizes corresponding to the Agresti-Coull method are intermediate between the conservative methods and Bayesian-U, Jeffreys, and Wilson.
Tables 1 and 2 show optimal sample sizes associated with expected interval width ω = 0.05. However, taking a closer look at the studies reported in the hepatitis B meta-analysis [13], we can see that many studies provide results with much wider confidence intervals, which, although of limited interest in some scenarios, may be useful in others, by ethical and economical reasons. Therefore, ranging the interval width from 0.05 to 0.10, Tables 3 and 4 present optimal sample sizes for sensitivity and specificity intervals, admitting a prevalence of 0.10, for Clopper-Pearson and Wilson methods.

Interval
Illustrating again with the work of hepatitis B surface antigen [13], when the experiment was designed, if the authors anticipate a sensitivity of 0.95 for the new test and they want a confidence interval with an expected width of ω = 0.10, then, according to Figure 4, for a prevalence of the population under study equal to 0.05, a sample of size 1750 would be needed. If this prevalence doubles (η = 0.10) the optimal sample size decreases to 1000, and for η = 0.25 a sample size between 300 and 400 would be enough to obtain a Clopper-Pearson sensitivity interval with the desired width.

Discussion and Conclusions
New formulas for coverage probability and expected length were derived in this work. These expressions are more suitable for comparing alternative confidence interval methods and for determining optimal sample size when a conditional probability is the estimation target and the individuals' condition is not known in advance, under simple random sampling. The new formulas are transposable to other equally useful proportions that are conditional probabilities, besides sensitivity and specificity, such as the positive and negative predictive values. This approach may be extended to other types of random sampling.
The new coverage probability expression originates much smoother curves, in contrast with the sharper indentations obtained with the classical binomial proportion approach.
Due to the diversity and complexity of the problems found in practice, there seems to be no confidence interval method that performs better than all the others in every situation. When sensitivity and specificity are not very close to 1, the Wilson method appears to be a good recommendation. However, in risk situations and near the boundary, strictly conservative methods, such as Anscombe and Cloper-Pearson, are better choices. This work showed, using the new expressions specifically tailored for conditional probabilities like sensitivity and specificity, as others before [4][5][6][7][8] have shown more generally, the performance problems of the Wald method, which may reach very low coverage probability, particularly for sensitivity (specificity) near 1.
Although some reported sample sizes (Tables 1-4) may seem excessive when compared to certain sample sizes in medical practice, works such as [12,17] also discuss sample sizes of the same magnitude. Moreover, large samples can actually be found in medical research [18,19] and this is likely to be the case in the context of the COVID 19 pandemic, given the wide scale application of serological and RT-PCR tests [20].
Some of the accuracy studies included in the systematic review addressing the diagnosis of hepatitis B [13] provide confidence intervals for the tests' sensitivity and specificity which are very wide and therefore useless. Similarly, in the survey of published studies on depression [14], among 86 studies where 95% confidence intervals were provided or could be calculated, merely 8% had sensitivity intervals' widths not exceeding 0.10% and 62% had widths higher or equal to 0.21.
Determining in advance the sample size required to meet a predefined interval width, based on a sound procedure, is crucial to obtain informative confidence intervals.
Since, even in suspected clinical populations, the prevalence may be smaller than 0.5, sample sizes required for adequate estimation of sensitivities are higher than for specificities, in most of the cases. Choosing sample sizes based on sensitivity values, particularly for lower sensitivities, may lead to quite high sample sizes. Consequently, these sample sizes often guarantee adequate expected length and coverage probability of the confidence interval also for the remaining performance measures simultaneously estimated. In essence, if confidence intervals for both sensitivity and specificity are desired, the maximum of the calculated sample sizes assures that both cases are attended.
In an experimental design, when the sample size to estimate the sensitivity or specificity is determined without taking into account that a conditional probability is involved, inadequate sizes may be obtained.
The proposed approach, however, inevitably requires a conjecture regarding the unknown prevalence in the target population, because the number of individuals with (without) the condition is a random variable depending on the prevalence. However, this is always the case when the individuals' condition is not known in advance.
The approximate method leads to sample sizes comparable to the optimal method. Given the availability of the R code (see Supplement 1), the new procedure is easily accessible, having the great advantage of leading to the optimal result based on a solid theoretical framework. When an optimal solution is easily available, there is no need for an approximation.
When the coverage probability does not achieve the nominal confidence level (1 − α), it should be at least close to (1 − α). Between two methods with similar coverage probabilities, the choice of method may be based on additional criteria such as minimizing the expected length, among others. Coverage probability and expected length are the indicators of confidence interval performance covered in the present work. Discussion on additional criteria can be found in [21,22].

Appendix B
The expressions for the lower and upper bounds of the confidence interval methods for a binomial proportion are presented in Table A1 together with the R commands (from the code available in Supplement S1) needed to call the corresponding functions to obtain the confidence intervals. For more details about these methods see [8]. Table A1. Two-sided 100 × (1 − α)% confidence intervals for a binomial proportion, [L(X) , U(X)], where X is the number of successes, and R commands.

Method
R command L(X) , U(X)