Next Article in Journal
Robust Three-Step Regression Based on Comedian and Its Performance in Cell-Wise and Case-Wise Outliers
Next Article in Special Issue
Definition and Estimation of Covariate Effect Types in the Context of Treatment Effectiveness
Previous Article in Journal
Nonlinear Systems of Volterra Equations with Piecewise Smooth Kernels: Numerical Solution and Application for Power Systems Operation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Common Medical and Statistical Problems: The Dilemma of the Sample Size Calculation for Sensitivity and Specificity Estimation

by
M. Rosário Oliveira
1,*,†,
Ana Subtil
1,† and
Luzia Gonçalves
2
1
Department of Mathematics and CEMAT, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais, 1049-001 Lisbon, Portugal
2
Unidade de Saúde Pública Internacional e Bioestatística, Global Health and Tropical Medicine, Instituto de Higiene e Medicina Tropical, Universidade Nova de Lisboa and Centro de Estatística e Aplicações da Universidade de Lisboa, Rua da Junqueira, 100, 1349-008 Lisbon, Portugal
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Mathematics 2020, 8(8), 1258; https://doi.org/10.3390/math8081258
Submission received: 29 June 2020 / Revised: 25 July 2020 / Accepted: 29 July 2020 / Published: 1 August 2020
(This article belongs to the Special Issue Applied Medical Statistics: Theory, Computation, Applicability)

Abstract

:
Sample size calculation in biomedical practice is typically based on the problematic Wald method for a binomial proportion, with potentially dangerous consequences. This work highlights the need of incorporating the concept of conditional probability in sample size determination to avoid reduced sample sizes that lead to inadequate confidence intervals. Therefore, new definitions are proposed for coverage probability and expected length of confidence intervals for conditional probabilities, like sensitivity and specificity. The new definitions were used to assess seven confidence interval estimation methods. In order to determine the sample size, two procedures—an optimal one, based on the new definitions, and an approximation—were developed for each estimation method. Our findings confirm the similarity of the approximated sample sizes to the optimal ones. R code is provided to disseminate these methodological advances and translate them into biomedical practice.

1. Introduction

Diagnostic tests are helpful tools in medical decision-making, since they give indication of the disease or infection status. Assessing the relevance and utility of such tests through the estimation of sensitivity (Se) and specificity (Sp), among other measures, is crucial to make informed choices on their use. During the COVID-19 pandemic, the concepts of diagnostic tests and their performance have been a hot topic of debate, even outside the scientific community.
Sensitivity and specificity are conventional measures of diagnostic test accuracy. Sensitivity is the probability of a positive test result given that the individual has the condition (infection or disease), i.e., the probability of correctly identifying an individual with the condition. Specificity is the probability of a negative test result given that the individual does not have the condition, i.e., the probability of correctly identifying an individual without the condition.
Interval estimation has been strongly recommended by [1] and increasingly adopted in the biomedical literature. To improve the quality of reporting of diagnostic accuracy studies, the Standards for Reporting Diagnostic Accuracy (STARD) statement was originally published in 2003 and updated in 2015—STARD 2015 [2]. Since the first version of STARD, 95% confidence intervals (CI) for the accuracy estimates of diagnostic tests have been recommended. However, several works have failed to comply with these guidelines [3] and obsolete methods continue to be used and not well-described to obtain 95% confidence intervals.
For a binomial proportion, the Wald method has been the most popular and widespread method over the years, because it is simple to teach, understand, and use. However, in the last two decades, theoretical statistical research (e.g., [4,5,6,7,8]) has reported its drawbacks, which are particularly severe for small sample sizes and proportions near 0 or 1. Some researchers have addressed the subject of interval estimation for a binomial proportion [7,8,9], exploring and comparing alternative methods, and providing recommendations concerning their performance. Regarding criteria to evaluate the performance of interval estimation methods, coverage probability (CP) and expected length (EL) are two of the most frequently used.
Reporting how the sample size determination was conducted is also an important step in scientific research. In the biomedical literature, there is an under-reporting of the required sample size and the achieved sample size at the end of the study. The updated list of STARD 2015 added the need for methodological details on sample size [2]. Once again, sample size calculation is dominated by the Wald method to estimate a binomial proportion. [10] emphasizes that confidence intervals for proportions are influenced, not only by the actual estimate of the proportion, but also by the sample size, and argues that there is no totally reliable sample size choice to achieve a desired goal.
In diagnostic test accuracy, an additional problem derived from the wrong replacement of the conditional probabilities (sensitivity and specificity) by a binomial proportion, at least from a theoretical point of view. The maximum likelihood estimator of sensitivity (specificity) is a ratio that depends on the number of individuals in the sample with (without) the condition, which is a random variable with a binomial distribution that depends on the prevalence. This situation requires a new approach to modify the expressions of coverage probability and expected length, in order to use these two criteria to compare the performance for different interval estimation methods and for the sample size calculation, avoiding the traditional Wald method.
In this work, new expressions for coverage probability and expected length of a conditional probability interval, like sensitivity and specificity, were developed. These expressions are used to compare alternative interval estimation methods and to determine optimal sample sizes using simple random sampling. Moreover, we compare these sample sizes based on the rewritten formula of the expected length with approximated sample sizes. The approximated procedure has two steps. The first one considers sample sizes based on a binomial proportion (as calculated in [11]) to estimate the number of patients. The second step determines the sample size by the ratio of the previous number of subjects with (without) the condition to the prevalence (1-prevalence), following [12]. Differences between optimal and approximate procedures were explored, in order to evaluate if the simpler procedure based on an approximation has some practical use.
Practical problems arising from real studies are also discussed. Among several studies presented in the literature, we focus on examples related to two diseases, hepatitis B [13] and depression [14], which illustrate the diversity and complexity of problems found in practice. Regarding the diagnostic accuracy of tests to detect hepatitis B surface antigen, [13] presented a systematic review of the literature, including a meta-analysis. Sample sizes of the 40 reported studies ranged from 50 to 3956 (median size: 284). A survey of published papers on the diagnostic accuracy of depression screening tools [14] included 89 studies with sample sizes varying from 34 to 42,676 (median size: 224).
In different studies for the same disease, a substantial heterogeneity was found in terms of sample size, sensitivity and specificity estimates, reporting of their uncertainty, and identification of the statistical methods used. Another important issue is the prevalence of the disease in the population under study. For hepatitis B, [13] referred values range from 1.9% to 84.0%. Samples are frequently drawn from sub-populations with a higher probability of having the disease than the overall population. Despite efforts to improve reporting of diagnostic accuracy (e.g., STARD 2015 guidelines [2]), individual studies based on small sample sizes are often published. This usually leads to confidence intervals with an undesirably large width. In the survey of published studies on depression, even in journals with a high impact factor, authors reported that only “…34% of the studies provided reasonably accurate intervals for estimates of sensitivity and specificity …” (p. 147 in [14]). This fact can mislead researchers without solid statistical background on the trustfulness of statistical tools in the biomedical practice. A major challenge in this context lies in the absence of a single statistical response to the high diversity of practical situations. Thus, statistically sound methods and associated software need to be developed to help researchers in the biomedical areas.

2. Materials and Methods

2.1. Interval Estimation Using Different Methods

The methods for constructing confidence or credibility intervals under analysis in the present study were selected among the ones that seem most promising according to previous research [6,7,8,11]: Clopper–Pearson, Anscombe, Agresti–Coull, Bayesian with Uniform prior and with Jeffreys prior, and Wilson, for a binomial proportion. For comparison reasons, the Wald method was also included in the study, given its common application in biomedical research. These methods provide two-sided confidence intervals for an unknown binomial proportion in the population, p, based on a sample of size n. A nominal confidence level, 100 × ( 1 α ) % , is pre-specified for the intervals, which means that the probability of including p, the so-called coverage probability, is intended to be ( 1 α ) .
Some notation and expressions for coverage probability and expected length are summarized in Appendix A. The expressions for the lower and upper bounds of these methods can be found in Appendix B and the code for their implementation is available in Supplement S1. For more details about these methods see [8].
In terms of coverage probability, [8] distinguish three classes of interval estimation methods. The strictly conservative methods have minimum coverage probability greater or equal to ( 1 α ) for all the values of p and n. A second class includes the methods which are correct on average, i.e., for each n fixed, the mean coverage probability over all possible values of p is greater than or equal to ( 1 α ) . There is a third class comprising the remaining methods, which are neither strictly conservative, nor correct on average. Accordingly, Clopper–Pearson and Anscombe methods can be stated as strictly conservative, while Bayesian-U, Jeffreys, Wilson, and Agresti–Coull can be classified as correct on average. The Wald method belongs to the third group and stands out for displaying quite low coverage probabilities for low or high values of p, which makes it an unreliable method.

2.2. New Expressions for Coverage Probability and Expected Length of a Conditional Probability Interval

In order to become adequate for the sensitivity and specificity, as well as other conditional probabilities, the conventional formulas for coverage probability and expected length of a confidence interval (presented in Appendix A) should be updated. If the proportion p, for which we aim to find an interval estimate, is the sensitivity or specificity, then p is a conditional probability. Accordingly, the usual estimator of the sensitivity (specificity) of a given diagnostic test is a ratio that depends on the state of the patient. Although the sample size, n, is known in advance, the number of individuals with and without the condition among the n individuals actually observed, N D and N D ¯ , respectively, are random variables. As a consequence, for example, N D has a binomial distribution with the parameters n and η , in which η is the prevalence, i.e., the proportion of individuals with the condition in the population under study. In fact, if  p = Se , then ( X | N D = m D ) is the number of individuals that tested positive among the m D that have the condition in the sample of size n. It follows that ( X | N D = m D ) b i n o m i a l ( m D , Se ) . Similarly, being ( Y | N D ¯ = m D ¯ ) the number of individuals that tested negative among the m D ¯ that do not have the condition in the sample of size n, we have ( Y | N D ¯ = m D ¯ ) b i n o m i a l ( m D ¯ , Sp ) .
Instead of the expressions for the conventional coverage probability and expected length (see Appendix A, respectively), the proper definitions for sensitivity are the following:
CP ( Se , n , η ) = m = 1 n CP ( m , Se ) P r { N D = m | N D > 0 } ,
EL ( Se , n , η ) = m = 1 n EL ( m , Se ) P r { N D = m | N D > 0 } .
Since it makes no sense to build an interval estimate if the number of individuals with the condition in the sample is null ( N D = 0 ), the restriction N D > 0 in Equations (1) and (2) was added. In a simple random sampling scheme like the one we adopt, N D has a binomial distribution. This approach may be extended to other sampling schemes by using appropriate distributions for N D .
According to these new definitions, the coverage probability (expected length) associated to a sensitivity interval emerges as an expected value of coverage probability (expected length) taken for all possible non-null values of N D . Formulas for the coverage probability and expected length of a specificity interval or other conditional probability derive from the same rationale and have analogous interpretations.

2.3. Optimal and Approximate Sample Size Determination

Regarding sample size determination, the previous definitions have some implications to calculate the sample size, n, to obtain a 100 × ( 1 α ) % interval estimate for p, with a desired expected width, ω , i.e., we seek for n such that
EL ( n , p ) = ω .
According to [11], for different confidence interval methods, these equations may not have neither a closed-form, nor an integer solution, but it is always possible to find an integer, n, that minimizes | EL ( n , p ) ω | within a certain tolerance, ξ , which in most situations will be such that EL ( n , p ) ω . Consequently, we can use a procedure to determine all the values n verifying:
| EL ( n , p ) ω | ξ .
If multiple solutions are found, we select the one that satisfies a chosen criteria, e.g., the one that maximizes coverage probability. Besides this option, the provided code enables other alternatives. If no solution is found, it is possible to increase ξ .
To calculate adequate sample sizes for the estimation of sensitivity and specificity with the desired confidence, we followed this procedure using the expected length new definition (2) stated above (an analogous expression was applied in the case of specificity). This procedure is designated optimal.
In parallel, following a rationale similar to the one described in [12] for the Wald method, an approximated procedure is also used and compared with the optimal one, now considering other methods besides the Wald. This approximated procedure took previously reported values for interval estimates of binomial proportions obtained by [11] as the number of subjects with (without) the disease required for sensitivity (specificity) estimation, n ( Se ) (or n ( Sp ) ). Theoretically, E ( N D ) = η · n . In practice, E ( N D ) is estimated by n ( Se ) , hence n = n ( Se ) / η is an approximation of the optimal sample size for sensitivity estimation. Following the same reasoning, in the case of specificity, the corresponding sample size is approximated by n = n ( Sp ) / ( 1 η ) .
All calculations were performed using the statistical software R [15] and the code can be found in Supplementary Materials.

3. Results

3.1. New Expressions for Coverage Probability of Interval Estimation Methods

Comparisons of new and conventional expressions of coverage probability and expected length were performed, using the seven methods, for some practical examples. The choice of the best methods is essentially based on their coverage probability. Nevertheless, this indicator varies with the sensitivity (specificity) of the diagnostic test under study, the prevalence of the condition, and the sample size. Given the diversity of practical examples, we adopted specific values for prevalence, sensitivity, and specificity motivated by a study presented in [16] concerning the diagnosis of dengue fever, an important vector-borne disease common in tropical areas.
Coverage probabilities were calculated considering different range values of sensitivity, prevalence, and sample size.
Figure 1 shows the coverage probability of sensitivity intervals as a function of sensitivity, for  η = 0.25 and n = 500 , obtained with the new formula and the conventional formula, using Clopper–Pearson (strictly conservative method) and Wilson (correct on average). The new formula originates much smoother coverage probability curves, in contrast with the sharper indentations obtained with the conventional expressions.
Figure 2 illustrates the coverage probability of a sensitivity interval, admitting two prevalence values (0.10 and 0.25) and a sample size of 500 individuals, according to four methods (Clopper–Pearson, Agresti–Coull, Wilson, and Wald).
Anscombe, Jeffreys, and Bayesian-U are not shown in the plots given their similarity with other methods. For most sensitivity values, the coverage probability corresponding to the highest prevalence, η = 0.25 , tends to be closer to the nominal confidence level than η = 0.10 . This tendency is not so clear for sensitivity closer to 1, where the coverage probability is quite unstable and erratic.
For the studied methods, regarding coverage probability values, we can recognise a section closer to Se = 1 with erratic and unstable values, a section closer to Se = 0.5 with stable values, and an intermediate section in between. Strictly conservative methods (Clopper–Pearson and Anscombe), present higher coverage probability than the nominal confidence level, with curves that seem detached from this target. The Wilson, Jeffreys, and Bayesian-U methods are similar, exhibiting stability around the nominal confidence level, except for erratic values close to 1. The Agresti–Coull method seems in between the Wilson and the higher detached values calculated for the strictly conservative methods.
It is also important to study the coverage probability associated with the sensitivity interval as a function of the prevalence. The coverage probability tends to be nearer the target as the prevalence increases towards η = 0.5 . Figure 3 presents two sensitivities, 0.96 and 0.73, for a sample size of 500 individuals. The curve tends to be nearer the nominal confidence level in the case of the lower sensitivity ( 0.73 ).

3.2. Impact on Sample Sizes

The sample sizes based on the optimal procedure were calculated for 95 % intervals for sensitivity or specificity, with expected width ω = 0.05 and a tolerance of ξ = 10 4 . Approximated sample sizes for sensitivity intervals were obtained from the values reported by [11], divided by the prevalence, η = 0.10 , and, for specificity, the divisor was ( 1 η ) = 0.90 .
Table 1 and Table 2 show the optimal sample sizes corresponding to sensitivity and specificity, respectively. Both tables also present the differences between optimal and approximate sample sizes, δ .
Only small differences were detected between optimal and approximate sample sizes. As sensitivity (specificity) increases, the sample sizes required to satisfy the established criteria decrease. Moreover, the sample sizes needed in the case of sensitivity are much higher than the sample sizes for the particular case of specificity, as expected if the prevalence is smaller than 0.5 . In the majority of practical situations, the prevalence is less than 0.5 and therefore the infected or diseased subjects are less represented in the sample, thus demanding higher sample sizes to guarantee an adequate sensitivity interval.
In the hepatitis B meta-analysis performed by [13], the pooled sensitivity and specificity are 0.90 and 0.995 , respectively. Considering a prevalence of 0.10 for a particular population of suspected cases, ω = 0.05 , and  ξ = 10 4 , the optimal sample sizes for the reported sensitivity range from 5446 (Wilson) to 5890 (Anscombe), according to Table 1. Being so, all the 40 studies fail to provide reliable accurate sensitivity intervals. By contrast, if the target is the Sp = 0.99 , optimal sample sizes range from 100 (Jeffreys) to 142 (Agresti–Coull), as presented in Table 2. Therefore, 90% and 80% of the studies fulfil the Jeffreys and Agresti–Coull requirements, respectively.
The conservative methods, Clopper–Pearson and Anscombe, demand higher sample sizes than the remaining methods. The sample sizes corresponding to the Agresti–Coull method are intermediate between the conservative methods and Bayesian-U, Jeffreys, and Wilson.
Table 1 and Table 2 show optimal sample sizes associated with expected interval width ω = 0.05 . However, taking a closer look at the studies reported in the hepatitis B meta-analysis [13], we can see that many studies provide results with much wider confidence intervals, which, although of limited interest in some scenarios, may be useful in others, by ethical and economical reasons. Therefore, ranging the interval width from 0.05 to 0.10, Table 3 and Table 4 present optimal sample sizes for sensitivity and specificity intervals, admitting a prevalence of 0.10, for Clopper–Pearson and Wilson methods.
As expected, wider confidence intervals correspond to smaller optimal sample sizes. For example, if the researcher anticipates a sensitivity of 0.95 (and a prevalence of 0.10), a Clopper–Pearson interval of length ω = 0.05 would require a sample of 3280 observations to fulfil the desired requirements. Increasing the interval width by 0.01 to ω = 0.06 , the number of observations ( n o p t i m a l = 2326 ) needed is reduced by 954. Moreover, if the initial width doubles ( ω = 0.10 ), the optimal sample size decreases to 905, which is only 27.6% of the initial value.
Figure 4 and Figure 5 allow us to explore the expected length for desired confidence intervals as a function of optimal sample size, for specific values of prevalence, sensitivity (Figure 4) and specificity (Figure 5). In general, the expected lengths of the intervals decrease for high sensitivity and specificity, for the same sample size and prevalence.
Higher sample sizes and higher prevalence lead also to confidence intervals with lower expected length. The same figures also confirmed findings already discussed: (i) Wilson, being correct on average, leads to smaller optimal sample sizes than Clopper–Pearson, a strictly conservative method; (ii) optimal sample sizes designed for specificity intervals require smaller values than the ones demanded for sensitivity intervals, when the prevalence of the condition is smaller that 0.5.
Illustrating again with the work of hepatitis B surface antigen [13], when the experiment was designed, if the authors anticipate a sensitivity of 0.95 for the new test and they want a confidence interval with an expected width of ω = 0.10 , then, according to Figure 4, for a prevalence of the population under study equal to 0.05 , a sample of size 1750 would be needed. If this prevalence doubles ( η = 0.10 ) the optimal sample size decreases to 1000, and for η = 0.25 a sample size between 300 and 400 would be enough to obtain a Clopper–Pearson sensitivity interval with the desired width.

4. Discussion and Conclusions

New formulas for coverage probability and expected length were derived in this work. These expressions are more suitable for comparing alternative confidence interval methods and for determining optimal sample size when a conditional probability is the estimation target and the individuals’ condition is not known in advance, under simple random sampling. The new formulas are transposable to other equally useful proportions that are conditional probabilities, besides sensitivity and specificity, such as the positive and negative predictive values. This approach may be extended to other types of random sampling.
The new coverage probability expression originates much smoother curves, in contrast with the sharper indentations obtained with the classical binomial proportion approach.
Due to the diversity and complexity of the problems found in practice, there seems to be no confidence interval method that performs better than all the others in every situation. When sensitivity and specificity are not very close to 1, the Wilson method appears to be a good recommendation. However, in risk situations and near the boundary, strictly conservative methods, such as Anscombe and Cloper–Pearson, are better choices. This work showed, using the new expressions specifically tailored for conditional probabilities like sensitivity and specificity, as others before [4,5,6,7,8] have shown more generally, the performance problems of the Wald method, which may reach very low coverage probability, particularly for sensitivity (specificity) near 1.
Although some reported sample sizes (Table 1, Table 2, Table 3 and Table 4) may seem excessive when compared to certain sample sizes in medical practice, works such as [12,17] also discuss sample sizes of the same magnitude. Moreover, large samples can actually be found in medical research [18,19] and this is likely to be the case in the context of the COVID 19 pandemic, given the wide scale application of serological and RT-PCR tests [20].
Some of the accuracy studies included in the systematic review addressing the diagnosis of hepatitis B [13] provide confidence intervals for the tests’ sensitivity and specificity which are very wide and therefore useless. For example, a study reports a sensitivity of 0.60 (95% CI [0.15, 0.95]) and another a specificity of 1.00 (95% CI [0.03, 1.00]), corresponding to interval widths of 0.80 and 0.97, respectively. These are stark examples of how pointless confidence intervals can be from the estimation point of view, since they comprise almost the complete range of possible values.
Similarly, in the survey of published studies on depression [14], among 86 studies where 95% confidence intervals were provided or could be calculated, merely 8% had sensitivity intervals’ widths not exceeding 0.10% and 62% had widths higher or equal to 0.21.
Determining in advance the sample size required to meet a predefined interval width, based on a sound procedure, is crucial to obtain informative confidence intervals.
Since, even in suspected clinical populations, the prevalence may be smaller than 0.5 , sample sizes required for adequate estimation of sensitivities are higher than for specificities, in most of the cases. Choosing sample sizes based on sensitivity values, particularly for lower sensitivities, may lead to quite high sample sizes. Consequently, these sample sizes often guarantee adequate expected length and coverage probability of the confidence interval also for the remaining performance measures simultaneously estimated. In essence, if confidence intervals for both sensitivity and specificity are desired, the maximum of the calculated sample sizes assures that both cases are attended.
In an experimental design, when the sample size to estimate the sensitivity or specificity is determined without taking into account that a conditional probability is involved, inadequate sizes may be obtained.
The proposed approach, however, inevitably requires a conjecture regarding the unknown prevalence in the target population, because the number of individuals with (without) the condition is a random variable depending on the prevalence. However, this is always the case when the individuals’ condition is not known in advance.
The approximate method leads to sample sizes comparable to the optimal method. Given the availability of the R code (see Supplement S1), the new procedure is easily accessible, having the great advantage of leading to the optimal result based on a solid theoretical framework. When an optimal solution is easily available, there is no need for an approximation.

Supplementary Materials

The following are available online at https://www.mdpi.com/2227-7390/8/8/1258/s1, Supplement S1.

Author Contributions

M.R.O., A.S. and L.G. equally contributed for conceptualization; methodology; validation; formal analysis; investigation; writing—original draft preparation; writing—review and editing. M.R.O. and A.S. developed the software and the results’ visualization. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded partially by Fundação para a Ciência e a Tecnologia (FCT), Portugal, under the projects UIDB/00006/2020 and UIDP/00006/2020 of CEAUL, UIDB/04621/2020 and UIDP/04621/2020 of CEMAT/IST-ID, and UID/04413/2020 of GHTM.

Acknowledgments

We are thankful for the fruitful suggestions and encouragement from Prof. Polychronis Kostoulas, Chair of CA18208 Action, Novel tools for test evaluation and disease prevalence estimation, from COST (European Cooperation in Science and Technology), funded by the Horizon Framework Programme of the European Union, and also from António Pacheco, from Instituto Superior Técnico, Universidade de Lisboa. We would also like to thank Ana Pires and Conceição Amado, from Instituto Superior Técnico, Universidade de Lisboa, for the R code used in the initial steps of this research.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

   The following abbreviations are used in this manuscript:
CIConfidence interval
CPCoverage probability
ELExpected length
SeSensitivity
SpSpecificity

Appendix A

The notation and definitions presented here closely follow [8,11]. Suppose that a random sample of fixed size n is drawn from a large or infinite population and that we aim to find an interval estimate with a desired confidence level for an unknown proportion of interest in the population, p.
Let X ( 0 X n ) be the random variable that counts the number of observations in the sample belonging to the category of interest and [ L ( X ) ; U ( X ) ] represent the random interval for p. We aim to attach to this interval a nominal confidence level fixed in advance as 100 × ( 1 α ) % . This means that the probability of [ L ( X ) ; U ( X ) ] containing the unknown p should be ( 1 α ) . Under the previously described conditions, X has a b i n o m i a l ( n , p ) distribution. Since this is a discrete distribution, the probability of [ L ( X ) ; U ( X ) ] containing the unknown p, designated coverage probability, may not achieve the target value of ( 1 α ) for all possible values of the parameter.
As we mentioned above, the coverage probability of the random confidence interval [ L ( X ) ; U ( X ) ] is the probability that the random interval contains the unknown parameter p, i.e,
CP ( n , p ) = j = 0 n ( j n ) p j ( 1 p ) n j I [ L ( j ) ; U ( j ) ] ( p ) ,
where I [ a , b ] ( x ) = 1 if x [ a , b ] and I [ a , b ] ( x ) = 0 , otherwise.
The expected length of the random confidence interval [ L ( X ) ; U ( X ) ] for p is given by:
EL ( n , p ) = j = 0 n ( j n ) p j ( 1 p ) n j ( U ( j ) L ( j ) ) .
When the coverage probability does not achieve the nominal confidence level ( 1 α ) , it should be at least close to ( 1 α ) . Between two methods with similar coverage probabilities, the choice of method may be based on additional criteria such as minimizing the expected length, among others. Coverage probability and expected length are the indicators of confidence interval performance covered in the present work. Discussion on additional criteria can be found in [21,22].

Appendix B

The expressions for the lower and upper bounds of the confidence interval methods for a binomial proportion are presented in Table A1 together with the R commands (from the code available in Supplement S1) needed to call the corresponding functions to obtain the confidence intervals. For more details about these methods see [8].
Table A1. Two-sided 100 × ( 1 α ) % confidence intervals for a binomial proportion, [ L ( X ) , U ( X ) ] , where X is the number of successes, and R commands.
Table A1. Two-sided 100 × ( 1 α ) % confidence intervals for a binomial proportion, [ L ( X ) , U ( X ) ] , where X is the number of successes, and R commands.
Method      R command                    [ L ( X ) , U ( X ) ]
Clopper–Pearson      prop.ci( X , n , α , ClopperP )
X = 0 0 , 1 ( α 2 ) 1 n
0 < X < n B e t a α 2 ( X , n X + 1 ) , B e t a 1 α 2 ( X + 1 , n X )
X = n ( α 2 ) 1 n , 1
Bayesian-U      prop.ci( X , n , α , BayesianU )
X = 0 0 , 1 α 1 n + 1
0 < X < n B e t a α 2 ( X + 1 , n X + 1 ) , B e t a 1 α 2 ( X + 1 , n X + 1 )
X = n α 1 n + 1 , 1
Jeffreys      prop.ci( X , n , α , Jef )
X = 0 0 , 1 ( α 2 ) 1 n
X = 1 0 , B e t a 1 α 2 ( 2 , n )
1 < X < n 1 B e t a α 2 ( X + 1 2 , n X + 1 2 ) , B e t a 1 α 2 ( X + 1 2 , n X + 1 2 )
X = n 1 B e t a α 2 ( n , 2 ) , 1
X = n ( α 2 ) 1 n , 1
Agresti–Coull      prop.ci( X , n , α , AgrestC )
max { X + 2 n + 4 z 1 α 2 X + 2 ( n + 4 ) 2 ( 1 X + 2 n + 4 ) ; 0 } , min { X + 2 n + 4 + z 1 α 2 X + 2 ( n + 4 ) 2 ( 1 X + 2 n + 4 ) ; 1 }
Wilson      prop.ci( X , n , α , Wils )
2 X + z 1 α 2 2 z 1 α 2 z 1 α 2 2 + 4 X ( 1 X n ) 2 ( n + z 1 α 2 2 ) , 2 X + z 1 α 2 2 + z 1 α 2 z 1 α 2 2 + 4 X ( 1 X n ) 2 ( n + z 1 α 2 2 )
X = 0 0 , sin 2 ( min arcsin 3 8 + 1 2 n + 3 4 + z 1 α 2 2 n + 1 2 ; π 2 )
0 < X < n sin 2 ( max arcsin 3 8 + X 1 2 n + 3 4 z 1 α 2 2 n + 1 2 ; 0 ) , sin 2 ( min arcsin 3 8 + X + 1 2 n + 3 8 + z 1 α 2 2 n + 1 2 ; π 2 )
X = n sin 2 ( max arcsin 3 8 + n 1 2 n + 3 4 z 1 α 2 2 n + 1 2 ; 0 ) , 1
Wald      prop.ci( X , n , α , Wald )
max { X n z 1 α 2 X n 2 ( 1 X n ) ; 0 } , min { X n + z 1 α 2 X n 2 ( 1 X n ) ; 1 }
Note: zγ and Betaγ(a,b) represent the γ—quantiles of the N(0,1) and the Betaγ(a,b) distributions, respectively.

References

  1. Altman, D.; Machin, D.; Bryant, T.; Gardner, M. Statistics with Confidence. Confidence Intervals and Statistical Guidelines; BMJ: London, UK, 2000. [Google Scholar]
  2. Bossuyt, P.M.; Reitsma, J.B.; Bruns, D.; Gatsonis, C.A.; Glasziou, P.; Irwig, L.; Lijmer, J.G.; Moher, D.; Rennie, D.; de Vet, H.C.; et al. STARD 2015: An updated list of essential items for reporting diagnostic accuracy studies. Clin. Chem. 2015, h5527. [Google Scholar] [CrossRef] [Green Version]
  3. Korevaar, D.; Wang, J.; van Enst, W.A.; Leeflang, M.M.; Hooft, L.; Smidt, N.; Bossuyt, P.M.M. Reporting diagnostic accuracy studies: Some improvements after 10 years of STARD. Radiology 2015, 274, 781–789. [Google Scholar] [CrossRef] [PubMed]
  4. Newcombe, R. Two-sided confidence intervals for the single proportion: Comparison of seven methods. Stat. Med. 1998, 17, 857–872. [Google Scholar] [CrossRef]
  5. Agresti, A.; Coull, B. Approximate is better than “exact” for interval estimation of binomial proportions. Am. Stat. 1998, 52, 119–126. [Google Scholar]
  6. Brown, L.; Cai, T.; DasGupta, A. Interval estimation for a binomial proportion. Stat. Sci. 2001, 16, 101–117. [Google Scholar]
  7. Brown, L.; Cai, T.; Dasgupta, A. Confidence intervals for a Binomial proportion and asymptotic expansions. Ann. Stat. 2002, 30, 160–201. [Google Scholar]
  8. Pires, A.; Amado, C. Interval estimators for a binomial proportion: Comparison of twenty methods. REVSTAT J. 2008, 6, 165–197. [Google Scholar]
  9. Andrés, A.M.; Hernández, M.Á. Two-tailed asymptotic inferences for a proportion. J. Appl. Stat. 2014, 41, 1516–1529. [Google Scholar] [CrossRef]
  10. Zelmer, D.A. Estimating prevalence: A confidence game. J. Parasitol. 2013, 99, 386–389. [Google Scholar] [CrossRef] [PubMed]
  11. Gonçalves, L.; Oliveira, M.; Pascoal, C.; Pires, A. Sample size for estimating a binomial proportion: Comparison of different methods. Appl. Stat. 2012, 39, 2453–2473. [Google Scholar] [CrossRef]
  12. Flahault, A.; Cadilhac, M.; Thomas, G. Sample size calculation should be performed for design accuracy in diagnostic test studies. J. Clin. Epidemiol. 2005, 58, 859–862. [Google Scholar] [CrossRef] [PubMed]
  13. Amini, A.; Varsaneux, O.; Kelly, H.; Tang, W.; Chen, W.; Boeras, D.I.; Falconer, J.; Tucker, J.D.; Chou, R.; Ishizaki, A.; et al. Diagnostic accuracy of tests to detect hepatitis B surface antigen: A systematic review of the literature and meta-analysis. BMC Infect. Dis. 2017, 17, 698. [Google Scholar] [CrossRef] [PubMed]
  14. Thombs, B.D.; Rice, D.B. Sample sizes and precision of estimates of sensitivity and specificity from primary studies on the diagnostic accuracy of depression screening tools: A survey of recently published studies. Int. J. Methods Psychiat. Res. 2016, 25, 145–152. [Google Scholar] [CrossRef] [PubMed]
  15. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2018. [Google Scholar]
  16. Sathish, N.; Vijayakumar, T.; Abraham, P.; Sridharan, G. Dengue fever: Its laboratory diagnosis, with special emphasis on IgM detection. Dengue Bull. 2003, 27, 116–125. [Google Scholar]
  17. Dendukuri, N.; Rahme, E.; Bélisle, P.; Joseph, L. Bayesian sample size determination for prevalence and diagnostic test studies in the absence of a gold standard test. Biometrics 2004, 60, 388–397. [Google Scholar] [CrossRef] [PubMed]
  18. Gonçalves, L.; Subtil, A.; Oliveira, M.; Rosário, V.; Lee, P.; Shaio, M.F. Bayesian latent class models in malaria diagnosis. PLoS ONE 2012, e40633. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  19. Qiu, S.F.; Poon, W.Y.; Tang, M.L. Confidence intervals for proportion difference from two independent partially validated series. Stat. Methods Med. Res. 2016, 25, 2250–2273. [Google Scholar] [CrossRef] [PubMed]
  20. Gudbjartsson, D.F.; Helgason, A.; Jonsson, H.; Magnusson, O.T.; Melsted, P.; Norddahl, G.L.; Saemundsdottir, J.; Sigurdsson, A.; Sulem, P.; Agustsdottir, A.B.; et al. Spread of SARS-CoV-2 in the Icelandic population. N. Engl. J. Med. 2020. [Google Scholar] [CrossRef] [PubMed]
  21. Vos, P.; Hudson, S. Evaluation criteria for discrete confidence intervals: Beyond coverage and length. Am. Stat. 2005, 59, 137–142. [Google Scholar] [CrossRef]
  22. Newcombe, R. Measures of location for confidence intervals for proportions. Commun. Stat.-Theor. Methods 2011, 40, 1743–1767. [Google Scholar] [CrossRef]
Figure 1. Coverage probability of sensitivity intervals varying with the sensitivity, for η = 0.25 and n = 500 , obtained with the new formula (solid line) and the conventional formula (dashed line), for Clopper–Pearson (left) and Wilson (right). The horizontal spotted line marks the nominal confidence level, 95 % .
Figure 1. Coverage probability of sensitivity intervals varying with the sensitivity, for η = 0.25 and n = 500 , obtained with the new formula (solid line) and the conventional formula (dashed line), for Clopper–Pearson (left) and Wilson (right). The horizontal spotted line marks the nominal confidence level, 95 % .
Mathematics 08 01258 g001
Figure 2. Coverage probability of sensitivity intervals varying with the sensitivity, admitting n = 500 , obtained for η = 0.10 (grey line) and η = 0.25 (black line), for Clopper–Pearson (top left), Agresti–Coull (top right), Wilson (bottom left), and Wald (bottom right). The horizontal spotted line marks the nominal confidence level, 95 % .
Figure 2. Coverage probability of sensitivity intervals varying with the sensitivity, admitting n = 500 , obtained for η = 0.10 (grey line) and η = 0.25 (black line), for Clopper–Pearson (top left), Agresti–Coull (top right), Wilson (bottom left), and Wald (bottom right). The horizontal spotted line marks the nominal confidence level, 95 % .
Mathematics 08 01258 g002
Figure 3. Coverage probability of sensitivity intervals varying with the prevalence, admitting n = 500 , obtained for Se = 0.96 (solid line) and Se = 0.73 (dashed line), for Clopper–Pearson (top left), Agresti–Coull (top right), Wilson (bottom left), and Wald (bottom right). The horizontal spotted line marks the nominal confidence level, 95 % .
Figure 3. Coverage probability of sensitivity intervals varying with the prevalence, admitting n = 500 , obtained for Se = 0.96 (solid line) and Se = 0.73 (dashed line), for Clopper–Pearson (top left), Agresti–Coull (top right), Wilson (bottom left), and Wald (bottom right). The horizontal spotted line marks the nominal confidence level, 95 % .
Mathematics 08 01258 g003aMathematics 08 01258 g003b
Figure 4. Expected length of sensitivity intervals varying with the sample size (n), for η = 0.05 ; 0.10; 0.15; 0.25, and Se = 0.70 ; 0.80; 0.90; 0.95; 0.99, using Clopper–Pearson and Wilson methods with 95 % nominal confidence level.
Figure 4. Expected length of sensitivity intervals varying with the sample size (n), for η = 0.05 ; 0.10; 0.15; 0.25, and Se = 0.70 ; 0.80; 0.90; 0.95; 0.99, using Clopper–Pearson and Wilson methods with 95 % nominal confidence level.
Mathematics 08 01258 g004
Figure 5. Expected length of specificity intervals varying with the sample size (n), for η = 0.05 ; 0.10; 0.15; 0.25, and Sp = 0.70 ; 0.80; 0.90; 0.95; 0.99, using Clopper–Pearson and Wilson methods with 95 % nominal confidence level.
Figure 5. Expected length of specificity intervals varying with the sample size (n), for η = 0.05 ; 0.10; 0.15; 0.25, and Sp = 0.70 ; 0.80; 0.90; 0.95; 0.99, using Clopper–Pearson and Wilson methods with 95 % nominal confidence level.
Mathematics 08 01258 g005
Table 1. Optimal sample sizes ( n o p t i m a l ) corresponding to several sensitivities, and differences between the optimal and approximate ( n a p r o x ) sample sizes, δ = n o p t i m a l n a p r o x , admitting η = 0.10 , ω = 0.05 , ξ = 10 4 , and  95 % nominal confidence level.
Table 1. Optimal sample sizes ( n o p t i m a l ) corresponding to several sensitivities, and differences between the optimal and approximate ( n a p r o x ) sample sizes, δ = n o p t i m a l n a p r o x , admitting η = 0.10 , ω = 0.05 , ξ = 10 4 , and  95 % nominal confidence level.
Clopper-AnscombeAgresti-BayesianJeffreysWilsonWald
PearsonCoullUniform
Se n optimal δ n optimal δ n optimal δ n optimal δ n optimal δ n optimal δ n optimal δ
0.7511848−211855511459−111449−111145221153878114722
0.8010165−4510172−898693997711984737984777986454
0.858176−34818447829−17852227781−29785323785545
0.905880−10589005589−215557275488−2554626553212
0.915385553955511225024−265032225031−19502929
0.92487774887−234626645593945213145322451414
0.93435774368−224165354045353998840244398414
0.943824−16383663638−234955346333353525344111
0.95328003293331422299121291626300828288212
0.9627233273772653−7244002341124611230212
0.972170202185152178−219155179616194441613−7
0.98158441595517222140661272121462124899
0.9910799107991284149311191313104010NANA
Table 2. Optimal sample sizes ( n o p t i m a l ) corresponding to several specificities, and differences between the optimal and approximate ( n a p r o x ) sample sizes, δ = n o p t i m a l n a p r o x , admitting η = 0.10 , ω = 0.05 , ξ = 10 4 , and  95 % nominal confidence level.
Table 2. Optimal sample sizes ( n o p t i m a l ) corresponding to several specificities, and differences between the optimal and approximate ( n a p r o x ) sample sizes, δ = n o p t i m a l n a p r o x , admitting η = 0.10 , ω = 0.05 , ξ = 10 4 , and  95 % nominal confidence level.
Clopper-AnscombeAgresti-BayesianJeffreysWilsonWald
PearsonCoullUniform
Sp n optimal δ n optimal δ n optimal δ n optimal δ n optimal δ n optimal δ n optimal δ
0.501741−21742−1169601709111697−11696017000
0.551724−1173812169121685−4168001685−416951
0.601673−11674−21628−1164011162901640111632−2
0.651600121601121543−21543−21556111543−215470
0.701469−114711142821425−11436101424−214331
0.7513258131701273−11275112829128171274−1
0.801130−5113861093010937109331093710955
0.85908−591568700865−5865−387118702
0.906572654−1621−36161610061736140
0.91602460345713561−15581559−35582
0.92545354605140502−1498−1503−15000
0.934840485−346234460443−144704420
0.94425−242604050390238423922381−1
0.953661366034903311323133313190
0.9630413051294−2271−126002740254−1
0.972401241−1242−1213019802160179−1
0.9817601770191−115601400161−1540
0.9911901190142010301000114−1NANA
Table 3. Optimal sample sizes ( n o p t i m a l ) corresponding to several sensitivities, for the Clopper–Pearson and Wilson methods, with  ω varying between 0.05 and 0.10 , admitting η = 0.10 , ξ = 10 4 , and  95 % nominal confidence level.
Table 3. Optimal sample sizes ( n o p t i m a l ) corresponding to several sensitivities, for the Clopper–Pearson and Wilson methods, with  ω varying between 0.05 and 0.10 , admitting η = 0.10 , ξ = 10 4 , and  95 % nominal confidence level.
Interval Width ( ω )
0.050.060.070.080.090.10
Se Clopper-PearsonWilsonClopper-PearsonWilsonClopper-PearsonWilsonClopper-PearsonWilsonClopper-PearsonWilsonClopper-PearsonWilson
0.7511,84811,5388280799761195831471144593742353230462854
0.8010,16598477110682652595006405338253222300326252428
0.85817678535728544542435445327539952607305421262399
0.90588055464133383830712839237721651897171915521389
0.91538550313789352328182578218319871744156714281272
0.92487745323436315825592340198417971587141913011158
0.93435740243074280822932074178116041426127011701036
0.9438243535270524542021182715731404126211211040917
0.95328030082326209817431569136012171094976905801
0.9627232461195117421460130711481028925829768691
0.97217019441555140311741064930849763697639586
0.981584146211671080903845732686613576525494
0.9910791040833807679659571554491476431417
Table 4. Optimal sample sizes ( n o p t i m a l ) corresponding to several specificities, for the Clopper–Pearson and Wilson methods, with  ω varying between 0.05 and 0.10 , admitting η = 0.10 , ξ = 10 4 , and  95 % nominal confidence level.
Table 4. Optimal sample sizes ( n o p t i m a l ) corresponding to several specificities, for the Clopper–Pearson and Wilson methods, with  ω varying between 0.05 and 0.10 , admitting η = 0.10 , ξ = 10 4 , and  95 % nominal confidence level.
Interval Width ( ω )
0.050.060.070.080.090.10
Sp Clopper-PearsonWilsonClopper-PearsonWilsonClopper-PearsonWilsonClopper-PearsonWilsonClopper-PearsonWilsonClopper-PearsonWilson
0.501741169612221184901864692663547521445421
0.551724168512031172888855683656542517440418
0.601673164011711130862833665636527500428405
0.651600154311151077822786630601501474406384
0.70146914241028988760726583556464437377354
0.7513251281925887680650523497416391339316
0.8011301093790753586553450424358334291270
0.85908871636604471443364338289267236215
0.90657617459426342314264240211190172154
0.91602559421390313286242220193174158141
0.92545503383351284259220199176157144128
0.93484447341313255231198177158141130115
0.94425392301273225202174156140124115101
0.9536633325823319317315113412110810088
0.96304274216194162145127113102928576
0.972402161721551301181039384777064
0.9817616112911910093817667635754
0.9911911492897572636154524745

Share and Cite

MDPI and ACS Style

Oliveira, M.R.; Subtil, A.; Gonçalves, L. Common Medical and Statistical Problems: The Dilemma of the Sample Size Calculation for Sensitivity and Specificity Estimation. Mathematics 2020, 8, 1258. https://doi.org/10.3390/math8081258

AMA Style

Oliveira MR, Subtil A, Gonçalves L. Common Medical and Statistical Problems: The Dilemma of the Sample Size Calculation for Sensitivity and Specificity Estimation. Mathematics. 2020; 8(8):1258. https://doi.org/10.3390/math8081258

Chicago/Turabian Style

Oliveira, M. Rosário, Ana Subtil, and Luzia Gonçalves. 2020. "Common Medical and Statistical Problems: The Dilemma of the Sample Size Calculation for Sensitivity and Specificity Estimation" Mathematics 8, no. 8: 1258. https://doi.org/10.3390/math8081258

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop