3.1. New Expressions for Coverage Probability of Interval Estimation Methods
Comparisons of new and conventional expressions of coverage probability and expected length were performed, using the seven methods, for some practical examples. The choice of the best methods is essentially based on their coverage probability. Nevertheless, this indicator varies with the sensitivity (specificity) of the diagnostic test under study, the prevalence of the condition, and the sample size. Given the diversity of practical examples, we adopted specific values for prevalence, sensitivity, and specificity motivated by a study presented in [
16] concerning the diagnosis of dengue fever, an important vector-borne disease common in tropical areas.
Coverage probabilities were calculated considering different range values of sensitivity, prevalence, and sample size.
Figure 1 shows the coverage probability of sensitivity intervals as a function of sensitivity, for
and
, obtained with the new formula and the conventional formula, using Clopper–Pearson (strictly conservative method) and Wilson (correct on average). The new formula originates much smoother coverage probability curves, in contrast with the sharper indentations obtained with the conventional expressions.
Figure 2 illustrates the coverage probability of a sensitivity interval, admitting two prevalence values (0.10 and 0.25) and a sample size of 500 individuals, according to four methods (Clopper–Pearson, Agresti–Coull, Wilson, and Wald).
Anscombe, Jeffreys, and Bayesian-U are not shown in the plots given their similarity with other methods. For most sensitivity values, the coverage probability corresponding to the highest prevalence, , tends to be closer to the nominal confidence level than . This tendency is not so clear for sensitivity closer to 1, where the coverage probability is quite unstable and erratic.
For the studied methods, regarding coverage probability values, we can recognise a section closer to with erratic and unstable values, a section closer to with stable values, and an intermediate section in between. Strictly conservative methods (Clopper–Pearson and Anscombe), present higher coverage probability than the nominal confidence level, with curves that seem detached from this target. The Wilson, Jeffreys, and Bayesian-U methods are similar, exhibiting stability around the nominal confidence level, except for erratic values close to 1. The Agresti–Coull method seems in between the Wilson and the higher detached values calculated for the strictly conservative methods.
It is also important to study the coverage probability associated with the sensitivity interval as a function of the prevalence. The coverage probability tends to be nearer the target as the prevalence increases towards
.
Figure 3 presents two sensitivities, 0.96 and 0.73, for a sample size of 500 individuals. The curve tends to be nearer the nominal confidence level in the case of the lower sensitivity (
).
3.2. Impact on Sample Sizes
The sample sizes based on the optimal procedure were calculated for
intervals for sensitivity or specificity, with expected width
and a tolerance of
. Approximated sample sizes for sensitivity intervals were obtained from the values reported by [
11], divided by the prevalence,
, and, for specificity, the divisor was
.
Table 1 and
Table 2 show the optimal sample sizes corresponding to sensitivity and specificity, respectively. Both tables also present the differences between optimal and approximate sample sizes,
.
Only small differences were detected between optimal and approximate sample sizes. As sensitivity (specificity) increases, the sample sizes required to satisfy the established criteria decrease. Moreover, the sample sizes needed in the case of sensitivity are much higher than the sample sizes for the particular case of specificity, as expected if the prevalence is smaller than . In the majority of practical situations, the prevalence is less than and therefore the infected or diseased subjects are less represented in the sample, thus demanding higher sample sizes to guarantee an adequate sensitivity interval.
In the hepatitis B meta-analysis performed by [
13], the pooled sensitivity and specificity are
and
, respectively. Considering a prevalence of 0.10 for a particular population of suspected cases,
, and
, the optimal sample sizes for the reported sensitivity range from 5446 (Wilson) to 5890 (Anscombe), according to
Table 1. Being so, all the 40 studies fail to provide reliable accurate sensitivity intervals. By contrast, if the target is the
, optimal sample sizes range from 100 (Jeffreys) to 142 (Agresti–Coull), as presented in
Table 2. Therefore, 90% and 80% of the studies fulfil the Jeffreys and Agresti–Coull requirements, respectively.
The conservative methods, Clopper–Pearson and Anscombe, demand higher sample sizes than the remaining methods. The sample sizes corresponding to the Agresti–Coull method are intermediate between the conservative methods and Bayesian-U, Jeffreys, and Wilson.
Table 1 and
Table 2 show optimal sample sizes associated with expected interval width
. However, taking a closer look at the studies reported in the hepatitis B meta-analysis [
13], we can see that many studies provide results with much wider confidence intervals, which, although of limited interest in some scenarios, may be useful in others, by ethical and economical reasons. Therefore, ranging the interval width from 0.05 to 0.10,
Table 3 and
Table 4 present optimal sample sizes for sensitivity and specificity intervals, admitting a prevalence of 0.10, for Clopper–Pearson and Wilson methods.
As expected, wider confidence intervals correspond to smaller optimal sample sizes. For example, if the researcher anticipates a sensitivity of 0.95 (and a prevalence of 0.10), a Clopper–Pearson interval of length would require a sample of 3280 observations to fulfil the desired requirements. Increasing the interval width by 0.01 to , the number of observations () needed is reduced by 954. Moreover, if the initial width doubles (), the optimal sample size decreases to 905, which is only 27.6% of the initial value.
Figure 4 and
Figure 5 allow us to explore the expected length for desired confidence intervals as a function of optimal sample size, for specific values of prevalence, sensitivity (
Figure 4) and specificity (
Figure 5). In general, the expected lengths of the intervals decrease for high sensitivity and specificity, for the same sample size and prevalence.
Higher sample sizes and higher prevalence lead also to confidence intervals with lower expected length. The same figures also confirmed findings already discussed: (i) Wilson, being correct on average, leads to smaller optimal sample sizes than Clopper–Pearson, a strictly conservative method; (ii) optimal sample sizes designed for specificity intervals require smaller values than the ones demanded for sensitivity intervals, when the prevalence of the condition is smaller that 0.5.
Illustrating again with the work of hepatitis B surface antigen [
13], when the experiment was designed, if the authors anticipate a sensitivity of 0.95 for the new test and they want a confidence interval with an expected width of
, then, according to
Figure 4, for a prevalence of the population under study equal to
, a sample of size 1750 would be needed. If this prevalence doubles (
) the optimal sample size decreases to 1000, and for
a sample size between 300 and 400 would be enough to obtain a Clopper–Pearson sensitivity interval with the desired width.