An Elaboration on Sample Size Planning for Performing a One-Sample Sensitivity and Specificity Analysis by Basing on Calculations on a Specified 95% Confidence Interval Width

Sample size calculation based on a specified width of 95% confidence interval will offer researchers the freedom to set the level of accuracy of the statistics that they aim to achieve for a particular study. This paper provides a description of the general conceptual context for performing sensitivity and specificity analysis. Subsequently, sample size tables for sensitivity and specificity analysis based on a specified 95% confidence interval width is then provided. Such recommendations for sample size planning are provided based on two different scenarios: one for a diagnostic purpose and another for a screening purpose. Further discussion on all the other relevant considerations for the determination of a minimum sample size requirement and on how to draft the sample size statement for performing sensitivity and specificity analysis are also provided.


Introduction
Diagnostic research is one of the most popular types of research in the medical field. It is a study that aims to quantify the accuracy of a test's added contribution beyond the test results readily available to the physician or researcher in determining the presence or absence of a particular disease or to predict the two distinct categories of patients such as poor health or good health [1][2][3][4][5][6][7]. This type of study is very important for efficiently identifying and offering the appropriate medical management to the right patient [8][9][10].
Research design is one of the important factors that will define the success of diagnostic research, and one of the necessary considerations for any research design is to conduct proper sample size planning. Before calculating the required sample size, a researcher will need to first understand the overall concept, the underpinning assumptions, and all the measurable parameters for a diagnostic test. Figure 1 illustrates a common scenario for diagnostic research. In this example, the researcher aims to determine the accuracy of a particular screening test to determine the serum level of a particular biochemical marker for detecting colorectal cancer in a patient. The outcome of a diagnostic test must be objectively evaluated against a definitive measurement provided by a gold standard test such as in this case from a biopsy test.
The True Positive (TP) cases are referring to those cases that actually have a positive diagnosis from among a group of positive cases detected by the test, whereas the True Negative (TN) cases are referring to those cases that actually do not have a positive diagnosis from among a group of negative cases detected by the test. This means that the sensitivity of a diagnostic or screening test is an assessment of how well it is able to detect the True Positive (TP) cases (e.g., patients with colorectal cancer) as compared to that of the gold standard technique (i.e., performing a biopsy from the organ itself); whereas, the specificity of a diagnostic or screening test is an assessment of how well it is able to detect the True Negative (TN) cases (e.g., patients without colorectal cancer) as compared to that of the gold standard technique (i.e., performing a biopsy from the organ itself). In other Based on the above formula provided, the sensitivity and specificity of the test are calculated to be 87.8% and 83.3%, respectively. The Positive Predicted Value (PPV) is the proportion of people with a positive test result who actually have the disease and Negative Predicted Value (NPV) is the proportion of those with a negative result who do not have the disease. In this example, the values of PPV and NPV are then calculated to be 90.9% and 78.1%, respectively. Overall, the test has good sensitivity and specificity. Ideally, most researchers will always aim to achieve a perfect accuracy, which is a performance as good as the gold standard. However, this can rarely be achieved since a particular screening test that has been invented or developed will usually be far cheaper, offer a faster method of detection, and be more convenient and user-friendly in its procedures. Thus, most researchers will usually afford some allowances for its accuracy that are attributable to chance or random error [8][9][10].
Normally, there are three possible conclusions that can be drawn from diagnostic research. First, the test is both sensitive and specific and thus suitable for use as a diagnostic test or marker [1][2][3][4][5]. Second, the test can only be suitable for use as a screening tool since the test or marker is high in either its sensitivity or specificity (but not both) but is low in the other measures [11][12][13][14][15][16][17]. Lastly, the test is neither sensitive nor specific and perhaps this is the worst-case scenario in diagnostic research, which renders it not being suitable for use in either the diagnosis or screening of a disease [18][19][20]. The ideal result is to obtain an excellent measure for both its sensitivity and specificity or at least in one of its two evaluated measures (i.e., sensitivity or specificity) so that the test can still be deemed acceptable for use as a screening tool at a bare minimum. This paper adopts this position further by proposing that a careful evaluation of the actual purpose of diagnostic research (for either diagnosis or screening of a disease) is necessary because both purposes are not the same and each will require a different approach in its sample size planning. Many previous studies have provided the detailed estimation or calculation of sample size requirement for the purpose of sample size planning when conducting diagnostic tests as presented in Table 1. Although there are already numerous published papers related to sample size planning for performing sensitivity Based on the above formula provided, the sensitivity and specificity of the test are calculated to be 87.8% and 83.3%, respectively. The Positive Predicted Value (PPV) is the proportion of people with a positive test result who actually have the disease and Negative Predicted Value (NPV) is the proportion of those with a negative result who do not have the disease. In this example, the values of PPV and NPV are then calculated to be 90.9% and 78.1%, respectively. Overall, the test has good sensitivity and specificity. Ideally, most researchers will always aim to achieve a perfect accuracy, which is a performance as good as the gold standard. However, this can rarely be achieved since a particular screening test that has been invented or developed will usually be far cheaper, offer a faster method of detection, and be more convenient and user-friendly in its procedures. Thus, most researchers will usually afford some allowances for its accuracy that are attributable to chance or random error [8][9][10].
Normally, there are three possible conclusions that can be drawn from diagnostic research. First, the test is both sensitive and specific and thus suitable for use as a diagnostic test or marker [1][2][3][4][5]. Second, the test can only be suitable for use as a screening tool since the test or marker is high in either its sensitivity or specificity (but not both) but is low in the other measures [11][12][13][14][15][16][17]. Lastly, the test is neither sensitive nor specific and perhaps this is the worst-case scenario in diagnostic research, which renders it not being suitable for use in either the diagnosis or screening of a disease [18][19][20]. The ideal result is to obtain an excellent measure for both its sensitivity and specificity or at least in one of its two evaluated measures (i.e., sensitivity or specificity) so that the test can still be deemed acceptable for use as a screening tool at a bare minimum. This paper adopts this position further by proposing that a careful evaluation of the actual purpose of diagnostic research (for either diagnosis or screening of a disease) is necessary because both purposes are not the same and each will require a different approach in its sample size planning. Many previous studies have provided the detailed estimation or calculation of sample size requirement for the purpose of sample size planning when conducting diagnostic tests as presented in Table 1. Although there are already numerous published papers related to sample size planning for performing sensitivity and specificity tests, it is still necessary to provide further detailed step-by-step guidance of how to apply the relevant knowledge to ensure the researchers do not inadvertently omit accounting for any other pertinent considerations during the sample size planning for conducting diagnostic research. Furthermore, the sample size determination must also be guided by the specific study objective that is the aim of a particular diagnostic test (and also its expected level of accuracy).  16 Bujang and Adnan [36] 2016 Sample size calculation based on differences in hypothesis testing 17 Negida et al. [37] 2019 Sample size calculation based on sensitivity, specificity, and the area under the ROC curve Therefore, this study shall further extend the aim for determining the necessary sample size requirement in these situations by discussing the detailed step-by-step procedures of sample size planning for diagnostic research through the incorporation of a specified width of both sensitivity and specificity values that are based on a 95% confidence interval. The advantage of using the width as a proxy measure for its effect size is to enable the researcher to impose a pre-specified limit for its sensitivity and specificity values based on a 95% confidence interval that the researcher initially aims to achieve. By doing so, a list of sample size tables will be compiled to guide the researcher by facilitating them to set the sample size requirement by quickly conducting the necessary sample size planning without the need to understand the complexity of the computations involved.

Methods
The sample size calculations were determined by basing on two-sided confidence intervals for conducting a one-sample sensitivity and specificity analysis [35]. The formula for calculating the binomial confidence intervals was derived from an 'exact' method called the Clopper-Pearson interval in which these intervals are being calculated by directly basing them on the cumulative probabilities of the actual binomial distribution [38]. For all Diagnostics 2023, 13, 1390 4 of 12 these calculations, the alpha is set at 0.05, confidence interval width is set at 0.1 or 0.2, and the prevalence is set at 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, or 0.9.
For a study design that aims for diagnostic purposes, the values of both sensitivity and specificity are set at 0.7, 0.8, 0.9, and 0.95, respectively. A diagnostic test or marker should ideally have excellent levels of both sensitivity and specificity. In this paper, the minimum value of sensitivity and specificity is set at 0.70. For ease of interpretation, the values of both sensitivity and specificity of 0.95 can be regarded as having excellent accuracy, 0.90 as having nearly excellent accuracy, 0.80 as good accuracy, and 0.70 as fairly good accuracy.
For a study design that aims for screening purposes that place a particular emphasis on sensitivity, the pre-specified sensitivity values are set at 0.95, 0.90, 0.80, and 0.70 while the same for specificity is set at 0.5. Meanwhile, for a screening with a particular emphasis on specificity, the pre-specified specificity values are set at 0.95, 0.90, 0.80, and 0.70 while the same for sensitivity is set at 0.5. To develop a screening strategy, it might be necessary for the researchers to have to sacrifice either the sensitivity or specificity. In the case where a researcher has initially planned to ensure that a study must have a high level of sensitivity, the minimum setting for its sensitivity will be at least 0.7 while the minimum setting for its specificity is 0.5. All the calculations were performed by using Power and Sample Size Software (PASS) (PASS 2020 Power Analysis and Sample Size Software (2020). NCSS, LLC. Kaysville, UT, USA, ncss.com/software/pass).

Results
There are three main factors that can potentially contribute to the requirement of a larger sample size. Firstly, the determination of smaller values for sensitivity and specificity will usually command a larger sample size requirement. Secondly, the prevalence of a disease or outcome of interest will dictate the sample size requirement in that a lower prevalence will necessitate a larger sample size requirement for a determination of its sensitivity, whereas a higher prevalence will demand a larger sample size for a determination of its specificity. Thirdly, a narrower desired half-width of the confidence interval (which is equivalent to a smaller marginal error) will also command a bigger sample size requirement. Hence, an 'ideal' sample size will not be available for which it can be universally applied because the determination of an 'ideal' sample size shall ultimately depend on the conditions and prerequisites for the setting up of the target effect size (Tables 2-4).
For a diagnostic purpose, the researcher will usually aim to have an excellent level of both sensitivity and specificity. Therefore, these sample size calculations are now presented in Table 2, which provides a pair of same values for both sensitivity and specificity. Based on the initial setting of the requirements for its target sensitivity and specificity, the minimum sample size requirement can range from 58 to 26,580 subjects. The ideal goal for a researcher is to achieve an excellent level of accuracy (i.e., to aim for both sensitivity and specificity of at least 0.95) and, hence, only a smaller sample size will usually be required for recruitment. However, by considering the highly probable risk of not being able to reach an excellent level of both sensitivity and specificity, a researcher will be encouraged to recruit more subjects to ensure that he/she is able to confidently conclude that the reported level of accuracy of a particular diagnostic condition is at least satisfactory (i.e., with a degree of sensitivity and specificity of at least 0.80).
For a screening purpose, the researcher will usually aim to have achieved an excellent level of either sensitivity or specificity, but not both. To facilitate the setting up of all conditions and prerequisites for conducting the proper sample size planning of a screening condition, the tabulation of all these sample size calculations is now presented in Tables 3 and 4. Most studies that emphasize a screening purpose will aim for a higher degree of sensitivity and, thus, they may need to sacrifice their specificity levels. A list of various pre-specified values for their sensitivity are now provided with its minimum value of 0.70% along with a fixed pre-specified specificity value of 50.0% (Table 3). The ideal goal for a researcher is to set to achieve an excellent degree of sensitivity such as 0.95. Based on the tabulated values displayed by Table 3, this means that if a researcher decides to set Diagnostics 2023, 13, 1390 5 of 12 the desired width of a 95% confidence interval to be 0.20, then its minimum sample size requirement shall range between 174 and 1041 depending on the prevalence of the disease or outcome of interest.   Note: n a refers to sample size for sensitivity; n b refers to sample size for specificity.  Note: n a refers to sample size for sensitivity; n b refers to sample size for specificity.

Discussion
Scholars have developed numerous techniques to estimate or calculate a minimum sample size requirement for diagnostic research. There is no single technique that is superior to others because it totally depends on the study's purpose and the researchers' expectations. The sample size issues regarding diagnostic research were first discussed by Linnet in 1987 and, after that, the discussion regarding sample size is still continuing to be discussed with dozens of articles being published to discuss this matter even further [21][22][23][24][25][26][27][28][29][30][31][32][33][34][35][36][37]. The summary of these findings was described in Table 1.
Previous studies had found some spurious findings that were being derived from research pertaining to sample size planning for diagnostic research, such as the requirement of a very small minimum sample size [39,40]. In order to avoid the possibility of misconstruing the statistical rigor inherently present in this type of analysis, this paper adopts a different approach by offering sample size tables for performing sensitivity and specificity analysis that are based on the use of desired width of 95%CI as a measure for clinical or scientific significance, and are hence emphasizing the importance of this desired interval width as the level of confidence in all types of diagnostic research [35].
By imposing a tighter limit on the desired width of 95%CI (i.e., 0.1 or 0.2), the researcher will be more confident in ensuring that the accuracy of the study can realistically be scientifically justifiable. It is a well-known fact that a statistically significant result (i.e., p < 0.05) can be erroneously caused by an extremely large sample size [41]. Thus, some scholars may argue over the utility of the p-value, but it is nevertheless still applicable and acceptable until now [42][43][44][45]. Therefore, by imposing an additional condition such as the placement of a relevant fixed limit for the desired width of 95%CI; it is less likely for the researcher to be misguided and hence they will be better able to ensure that an acceptable level of accuracy can be realistically achieved. Tables Be Used? The sample size tables are provided in this paper to facilitate a researcher for the purpose of sample size planning of all studies related to sensitivity and specificity analysis. Firstly, a researcher needs to determine the prevalence of a disease or the outcome of interest (such as 'poor outcome' or 'good outcome'). The prevalence of a disease per se can vary widely depending on which type of study population a researcher aims to study. In other words, the researcher shall have to decide the specific type of study population for which a diagnostic or screening condition is intended. For example, the prevalence of colon cancer among a 'high-risk' population is obviously very much higher than that among a healthy population. If the researcher aims to implement the diagnostic or screening test or marker among the 'high-risk' population for colon cancer, then an estimate of the prevalence of colon cancer in the study population should be calculated from the 'high-risk' population for colon cancer (i.e., patients from a hospital setting such as the surgical specialist clinic).

How Should the Sample Size
Secondly, the researcher needs to decide whether the test or marker is intended to be an alternative for a diagnostic tool/marker or will solely be used for screening purposes. As a researcher or research scientist, they should be able to decide the desired aim of a particular test or marker since they are also the subject matter in the specialized field and should therefore know the true capabilities and expectations of a diagnostic test or marker. Hence, Table 2 should be referred to if a researcher intends to develop a diagnostic test or marker, whereas Tables 3 and 4 should be referred to if they intend to develop a screening test or marker.
Thirdly, the researcher also needs to decide beforehand the target values of both the sensitivity and specificity of a test or marker. If they intend to develop a diagnostic test or marker, then there will be four different possible sets of sensitivity and specificity values. For the sake of simplicity, this paper thereby recommends that both the sensitivity and specificity are being measured by a score of 0.95, 0.90, 0.80, and 0.70 to be regarded as an excellent, nearly excellent, good, and fairly good diagnostic test/marker, respectively. Finally, the researcher will also have to decide beforehand the desired interval width of 95%CI (i.e., either 0.1 or 0.2). The determination of the desired interval width is likely to be driven by the actual intended purpose of the study, the availability of resources, and the capability and experience level of the researcher under the various experimental conditions. Say, for example, the prevalence of a disease is set at 40.0%, the target desired interval width for 95%CI is set at 0.1 and the desired degrees of sensitivity and specificity are set at 0.90, respectively. Based on the abovementioned conditions, the minimum required sample size to perform an analysis for the determination of sensitivity is 395 and that for the determination of specificity is 264. In this case, the sample size of 395 shall be preferably chosen since it yields a much larger sample than the other. In another scenario, say, for example, the prevalence of a disease is set at 70.0%, the target desired interval width for 95%CI is set at 0.1 and the desired degrees of sensitivity and specificity are both set at 0.90, respectively. Based on such conditions, the minimum required sample size for assessing the degree of sensitivity is 135 and that for assessing the degree of specificity is 1304. Again, in this case, the sample size of 1304 shall be chosen preferably for the same reason mentioned above.

Issues That Can Arise from Very Large Sample Sizes Involving Very Low Level of Prevalence
It is evident that some of the calculations in the tables have yielded extremely large sample size requirements. For example, Table 2 has shown that a minimum of 26,580 subjects will be needed to claim for the degree of both sensitivity and specificity of 70.0%, which are based on the desired interval width of 95%CI of 0.05 in a study population with a 5.0% prevalence rate of disease. There are two main pertinent issues that await our due consideration here. Firstly, it is necessary to carefully consider whether the purported values of sensitivity and specificity of 70.0% will satisfy both the researchers and stakeholders (who are the end-users of the test or marker) and, secondly, it is also necessary to Diagnostics 2023, 13, 1390 9 of 12 determine whether or not the researchers can realistically cope with the work involved in the recruitment for a very large number of subjects.
This means to say that the recruitment of a large number of subjects will only be regarded worthwhile if the study can realistically be proven to be very highly sensitive and specific, such as having an exceptionally high degree of both sensitivity and specificity at 95.0%. In other words, it is only recommended to recruit an unusually large number of subjects if there are sufficient grounds for us to believe that a diagnostic test or marker demonstrates a very high level of high accuracy. Such grounds can often be retrieved from the literature or they can be based on cumulative scientific evidence for an evaluation of accuracy of the test marker.
Thus, the most important consideration here is that the core emphasis for diagnostic research shall be to develop a sensitive and specific marker by garnering sufficient cumulative evidence of its sensitivity and/or specificity and not just to merely study a particular diagnostic test/marker for its sensitivity or specificity without having accruing sufficient evidence of its sensitivity and specificity.
In other words, it is not recommended to conduct a study with very large number of subjects merely to prove that a diagnostic test/marker has a degree of both sensitivity and specificity of 70.0%. However, these calculations are being presented in this paper merely to illustrate the point that the recruitment of such a high number of subjects can be justifiable if and only if the accruing evidence has already demonstrated sufficient grounds that a particular diagnostic/screening test or marker has garnered cumulative scientific evidence of a high level of sensitivity and/or specificity, which provides a valid rationale for the study [46][47][48][49][50]. Otherwise, it is not recommended to do so.

Determination of Sample Size Requirements for Diagnostic Purposes
One previous study provided a list of recommended criteria for creating a sample size statement that should ideally include five elements. These elements shall consist of Step 1: to understand the objective of the study, Step 2: to select the appropriate statistical analysis, Step 3: to calculate or estimate the sample size, Step 4: to provide additional allowances during the subject recruitment procedure to cater for a certain proportion of non-response, and Step 5: to write a standard sample size statement [51]. For the purpose of writing a standard sample size statement, a common scenario has been created as follows: the researchers aim to prove that a particular new marker extracted from a patient's blood is suitable for use as a diagnostic marker to determine whether the patient has colon cancer.
Thus, the sample size statement is written as follows: "This study aims to determine whether marker X is highly accurate to detect all patients with colon cancer. The basis of its sample size calculation is derived from both sensitivity and specificity analyses. In a population at risk of colon cancer (i.e., patients who have already exhibited and reported to have usual symptoms of colon cancer), the prevalence of colon cancer is 10.0%. For a reliable diagnostic marker, the researcher will typically aim the new marker to have a degree of both sensitivity and specificity of at least 95.0%. The sample size calculation is based on the desired width of the 95% confidence interval for both its sensitivity and specificity to be set at 1.0. Based on the abovementioned conditions, the minimum sample size requirement to perform a study for determining its sensitivity is 940 patients and that for determining its specificity is 105 patients. Therefore, the minimum sample size of 940 patients shall be deemed necessary since it yields a larger sample between the two. In order to provide additional allowances for incorporating a possible non-response rate of 20.0%, the minimum required sample size is then further inflated to 1175 patients."

Determination of Sample Size Requirements for Screening Purposes
Yet, another similar scenario can be applied for the following example whereby the researcher is now aiming to prove that a particular new marker extracted from a patient's blood is suitable for use as a screening marker (i.e. equal or more than 70.0% for its sensitivity) for colon cancer with the fixed degrees of 50.0% for its specificity or vice versa.
Thus, the sample size statement is written as follows: "This study aims to determine whether marker Y is highly sensitive to screen a patient for the purpose of detecting colon cancer. The basis of its sample size calculation is derived from both sensitivity and specificity analyses. In a population at risk of colon cancer (i.e., patients who have already exhibited and reported to have usual symptoms of colon cancer), the prevalence of colon cancer is 10.0%. To obtain a reliable screening marker, the researcher will typically aim for the new marker to have a degree of sensitivity of at least 95.0% and that of specificity of at least 50.0%. The sample size calculation is based on the desired width of a 95% confidence interval for both its sensitivity and specificity to be set at 2.0. Based on the abovementioned conditions, the minimum sample size requirement to perform a study for determining its sensitivity is 290 patients and that for determining its specificity is 116 patients. Therefore, the minimum sample size of 290 patients shall be deemed necessary since it yields a larger sample between the two. In order to provide additional allowances for incorporating a possible non-response rate of 20.0%, the minimum required sample size is then further inflated to 363 patients".

Conclusions
Researchers often need a quick and simple 'rule-of-thumb' or method to estimate or calculate the minimum sample size requirement. This paper provides background information on a diagnostic study, a list of sample size tables for determining the minimum sample sizes required for performing both the sensitivity and specificity analysis together with a clear and concise guideline on how to use the sample size tables for performing such analysis under a wide variety of differing conditions, and, lastly, it wraps up the whole discussion by offering an illustrative example of how a standard sample size statement should be written.
Indeed, this paper provides a recommendation that the researcher shall now have to set a tighter desired width for the 95% confidence interval (i.e., 0.1 or 0.2) for better sample size planning. All in all, this paper will assist the researcher to conduct a proper sample size planning related to diagnostic research and, hence, it facilitates the researcher to reach a simple and quick decision on sample size planning without resorting to the use of many highly complicated statistical techniques for their computations, as well as to a formal in-depth acquisition of the knowledge and technicality of the subject matter.