Correcting Diagnostic Test Sensitivity and Specificity for Patient Misclassifications Resulting from Use of an Imperfect Reference Standard

Investigational diagnostic tests are validated by using a reference standard (RS). If the RS is imperfect (i.e., it has sensitivity [Se] and/or specificity [Sp] < 1), incorrect values for the investigational test’s Se and Sp may result because of patient misclassification by the RS. Formulas were derived to correct a test’s Se and Sp that were determined by using an imperfect RS. The following derived formulas correct for misclassification and give the true numbers of disease-positive [nDP] and disease-negative patients [nDN] from the apparent number of disease-positive and disease-negative patients (anDP and anDN), and the Se and Sp of the RS (SeR, SpR): nDP = (anDP × SpR + anDN × SpR − anDN)/JR; nDN = (anDP × SeR + anDN × SeR − anDP)/JR, where JR is Youden’s Index for the RS (JR = SeR + SpR − 1). The following derived formulas give the correct Se and Sp of an investigational test (SeI and SpI): SeI = (anTPI × SpR − nDP × SeR × SpR + nDP × JR + nDN × SpR2 − nDN × SpR − SpR × anTNI + anTNI)/(nDP × JR); SpI = (anTPI − anTPI × SeR + nDP × SeR2 − nDP × SeR − SeR × nDN × SpR + nDN × JR + SeR × anTNI)/(nDN × JR), where anTPI is the apparent number of true-positive test results, and anTNI is the apparent number of true-negative test results. The derived formulas correct for patient misclassification by an imperfect RS and give the correct values of a diagnostic test’s Se and Sp.


Introduction
Sensitivity (Se) and specificity (Sp) are fundamental measures of the performance of a diagnostic test. They, respectively, provide the probability of a positive test result in patients with the index disease (the disease that the test is designed to detect or exclude) and a negative test result in patients without the disease [1]. When combined with pre-test probability (prevalence) of disease in a tested population, they can be used to determine the more clinically useful metrics positive predictive value (PPV) and negative predictive value (NPV) [2]. Alternatively, Se and Sp can be used to determine likelihood ratios, which can be used with the pre-test odds (rather than probability) of disease to estimate the post-test odds of disease [3].
The Se and Sp of an investigational test (IT) must therefore be accurately determined before the test can be useful in clinical practice. Typically, accuracy is determined through one or more clinical studies in which study patients undergo both the IT and another diagnostic test, which serves as a reference standard (RS). The RS is used to classify each patient as disease-positive (DP) or disease-negative (DN), and these classifications are used to classify the patient's IT result as true positive (TP), true negative (TN), false positive (FP), or false negative (FN). The respective numbers of patient classifications (n TP , n TN , n FP , and n FN ) are used to determine the sensitivity and specificity of the investigational test (Se I and Sp I , respectively; Formulas (1) and (2)).
Se I = n TP /n DP (1) Sp I = n TN /n DN (2) If the RS is perfect (Se R = Sp R = 1 [100%]), then it is a true standard of truth, and each patient is correctly classified as DP or DN. Se I and Sp I can then be determined straightforwardly and correctly. When a perfect RS is used to evaluate a patient, there are only two possible outcomes for the RS (either a true-positive RS result classifying the patient as DP or a true-negative RS result classifying the patient as DN); and when a perfect RS is used to classify positive and negative IT results, there can only be four possible IT classifications (TP, TN, FP, and FN).
However, if the RS is imperfect (i.e., Se R and/or Sp R < 1 [<100%]), then it can misclassify patients, resulting in incorrect numbers of patients with and without disease (n DP , n DN , n TP , and n TN ), and thus incorrect values of Se I and Sp I . In this case, there can be four (rather than two) possible classifications of a patient by the imperfect RS: a true-positive result classifying the patient as DP, a true-negative result classifying the patient as DN, a falsepositive result misclassifying the patient as DP, or a false-negative result misclassifying the patient as DN. When the imperfect RS is used to classify the positive and negative IT results, the IT can yield true or false results, which could agree or disagree with the RS, which in turn could be true or false. As a result, there could be eight categories: TP, TN, FP, and FN from comparison of the IT results to accurate RS results, plus apparent TP, TN, FP, and FN from comparison of the IT results to inaccurate RS results. Consequently, determination of the true Se and Sp of the IT when an imperfect RS is used is not as straightforward as when the RS is perfect.
An example of an imperfect reference standard that inspired this paper was the report by McKeith and coworkers of a clinical study that determined the Se and Sp of [ 123 I]ioflupane with single-photon emission computed tomography (SPECT) for assessing patients with suspected dementia with Lewy bodies (DLB) [4]. In that study, the International Consensus Criteria (ICC [5]) for diagnosing DLB were used as the RS. However, it was known from a validation study [6] that the Se and Sp of the ICC were less than perfect (0.83 and 0.95, respectively). Thus, the true Se and Sp of ioflupane for DLB were uncertain. For this reason, I sought ways to adjust the apparent values of Se and Sp of ioflupane imaging, as reported by McKeith et al. [5], to account for the known Se and Sp of the diagnostic criteria used as the reference standard.
A literature search yielded several relevant articles. Nihashi et al. [7] reported using a Bayesian latent class model for adjusting the Se and Sp of DaTscan for DLB for eight clinical studies (including follow-up data from the McKeith 2007 study [4], but not the original data), although neither the published article nor the Supplemental Materials provided sufficient details to allow one to reproduce their results.
Umemneku Chikere et al. [8] discussed three methods of correcting for the effects of an imperfect RS: Brenner [9], Gart and Buck [10], and Staquet et al. [11]. All of these authors took different approaches and reported different equations. They did not report enough detail to allow one to determine if their derivations were correct.
Trikalinos et al. mentioned the possibility of adjusting results that are based on an imperfect RS but did not report derivation of the formulas needed to do so [12]. Therefore, this work was initiated to derive formulas needed to correct for patient misclassifications by an imperfect RS and to determine the true values of a diagnostic test's Se and Sp when they were determined by using an imperfect RS, in diagnostic terms readily understandable to clinicians, with full transparency of the derivations. Those results are reported here; application of the results to the McKeith ioflupane study [4] along with a review of relevant literature, will be reported separately.

Materials and Methods
Formulas were derived on the basis of an analysis of how a reference standard (RS) is used to classify patients as disease-positive or disease-negative and how misclassifications by an imperfect RS affect the apparent values of Se I and Sp I . Throughout, conditional inde-pendence of the RS and IT is assumed; i.e., the RS and IT misclassify patients independently. This assumption is reasonable if, for example, the RS and IT work by different mechanisms.
Two diagrams were created to depict patient misclassifications by the RS (Figure 1) and their effect on the apparent values of Se I and Sp I (Figure 2). In both figures, the prefix a was used to denote an apparent value. In Figure 2, to differentiate between the reference and investigational tests with respect to the numbers of true positives (n TP ), true negatives (n TN ), false positives (n FP ), false negatives (n FN ), Se, and Sp, these variables had the subscripts R (for reference test) or I (for investigational test) added.     Figure 1. When an imperfect RS is used in a clinical study of an investigational test, the apparent number of disease-positive subjects may include some subjects who are actually disease negative, as well as subjects who are truly disease positive. Likewise, the apparent number of disease-negative subjects may include some subjects who are actually disease positive, as well as subjects who are truly disease negative.
Diagnostics 2023, 13, x FOR PEER REVIEW 4 of 9 Figure 1. When an imperfect RS is used in a clinical study of an investigational test, the apparent number of disease-positive subjects may include some subjects who are actually disease negative, as well as subjects who are truly disease positive. Likewise, the apparent number of disease-negative subjects may include some subjects who are actually disease positive, as well as subjects who are truly disease negative.    The an TP will equal the number of subjects who got a true-positive result on both the RS and the IT, plus the number of subjects who got a false-positive result on both the RS and the IT. Likewise, an TN will equal the number of subjects with a true-negative result on both the RS and the IT, plus the number of subjects with a false-negative result on both the RS and the IT. Thus, both the apparent sensitivity and the apparent specificity of the IT (i.e., the sensitivity and specificity calculated in the clinical study) will depend on both the sensitivity and specificity of the RS (Se R and Se R ). Figure 1 shows how an imperfect RS results in patient misclassifications: multiplying Se R and Sp R by the true numbers of disease-positive (n DP ) and disease-negative (n DN ) patients results in the apparent number of disease-positive patients (an DP ) and the apparent number of disease-negative patients (an DN ). In Figure 1, n TPR is the number of patients with true-positive RS results, n FNR is the number of patients with false-negative RS results, n TNR is the number of patients with true-negative RS results, n FPR is the number of patients with false-positive RS results, Se R is the sensitivity of the reference standard, Sp R is the specificity of the reference standard, an DP is the apparent number of disease-positive patients, and an DN is the apparent number of disease-negative patients. Figure 2 shows how the patient misclassifications result in incorrect values of the IT's sensitivity (Se I ) and specificity (Sp I ): multiplying an DP and an DN by the true (but initially unknown) values of Se I and Sp I gives the apparent numbers of true-positive (an TPI ) and true-negative (an TNI ) IT results, which, when, respectively, divided by an DP and an DN (in accordance with Equations (1) and (2) above), give incorrect apparent values of the IT's Se and Sp (aSe I and aSp I ). In Figure 2 an TPI 1 is the apparent number of true-positive IT results based on n TPR • an TPI 2 is the apparent number of true-positive IT results based on n FPR • an FNI 1 is the apparent number of false-negative IT results based on n TPR • an FNI 2 is the apparent number of false-negative IT results based on n FPR • an FPI 1 is the apparent number of false-positive IT results based on n FNR • an FPI 2 is the apparent number of false-positive IT results based on n TNR • an TNI 1 is the apparent number of true-negative IT results based on n FNR • an TNI 2 is the apparent number of true-negative IT results based on n TNR • an TPI is the apparent total number of true-positive IT results • an TNI is the apparent total number of true-negative IT results.
The two diagrams were analyzed to develop formulas that were then solved to give n DP , n DN , Se I and Sp I starting from the apparent results of a clinical study. Figure 1 shows the relationship between n DP , n DN , Se R , Sp R , an DP , and an DN . From Figure 1, Formulas (3) and (4) can be deduced:

Correction for Patient Misclassifications by an Imperfect RS: Calculation of True Numbers of Disease-Positive and Disease-Negative Patients
If an DP , an DN , Se R , and Sp R are all known, then Formulas (3) and (4) constitute a system of equations with two unknowns (n DP and n DN ). This was solved by using an online system-of-equations calculator [13]. However, it was first necessary to substitute single-letter variables (e.g., x for n DP , y for n DN , and a, b, c, etc. for the other variables), because the online calculator interpreted two-letter variables as two variables rather than as a single variable. Solution of the system of equations leads to Formulas (5) and (6) (details shown in Supplementary Materials).
The denominators in Formulas (5) and (6) are equal to Youden's J statistic [14] (Youden's Index) for the RS (Equation (7)): Therefore, Formulas (5) and (6) can be rewritten as Formulas (8) and (9): n DP = (an DP × Sp R + Sp R × an DN − an DN )/J R (8) n DN = (an DP × Se R + Se R × an DN − an DP )/J R (9) Note that because the sum of n DP and n DN must equal N (the number of patients in the study; Equation (10)), one could also calculate just one value (either n DP or n DN ), and subtract it from N to get the other value (Equation (11) or Equation (12)): N = n DP + n DN (10) n DP = N − n DN (11) n DN = N − n DP (12)

Calculation of Sensitivity and Specificity of the Investigational Test
Next, expressions for the true numbers of patients with true-positive and true-negative investigational test results (n TPI and n TNI , respectively) were derived. Figure 2 shows the possible outcomes of a clinical study of an IT using a RS. From Figure 2, it can be deduced that: If n DP , n DN , Se R , and Sp R are all known, then Equations (13) and (14) constitute a system of equations with two unknowns (Se I and Sp I ). Substituting single-letter variables and using the same online system-of-equations calculator referenced above leads to Equation (15) for the sensitivity of the investigational test (Se I ) and Equation (16) for the specificity of the investigational test (Sp I ).
Se I = (an TPI × Sp R − n DP × Se R × Sp R + n DP × Se R + n DP × Sp R − n DP + n DN × Sp R 2 − n DN × Sp R − Sp R × an TNI + an TNI )/(n DP × (Se R + Sp R − 1)) (15) These can be simplified somewhat by substituting some of the terms with Youden's Index. Thus, Equation (15) becomes Equation (17): Similarly, substituting some terms with Youden's Index converts Equation (16) into Equation (18): As a quick check of the validity of the equations, if Se R = Sp R = 1, then Equations (17) and (18) should simplify to show that Se I equals an TPI /n DP and that Sp I equals an TNI /n DNand they do (See Supplementary Materials).

Calculation of Apparent Sensitivity and Specificity of the Investigational Test
For the sake of completeness, Equations (19) and (20), which show the relationship between n DP , n DN , Se R , Sp R , Se I , and Sp I and the apparent sensitivity (aSe I ) and specificity (aSp I ) of the investigational test were deduced from Figure 2. Details are provided in the Supplementary Materials.

Proportion-Based Equations
The above equations are in terms of patient counts. They can be converted to equations that are based on proportions by dividing by N, the total number of patients in the study. For example, starting with Equation (17) and dividing every term by N (equivalent to multiplying both the numerator and denominator by 1/N), one gets Equation (21): Since n DP /N = Pr (the prevalence of the index disease), and since n DN /N = 1 − Pr, Equation (21) can be rewritten as Equation (22): In Equation (22), pa TNI is the proportion of patients with an apparently true-negative investigational test result, Pr is the prevalence of the index disease, J R is Youden's Index for the RS, pa TPI is the proportion of patients with an apparently true-positive investigational test result, Sp R is the specificity of the RS, and Se R is the sensitivity of the RS.
Similarly, dividing every term in Equation (18) by N gives Equation (23): In Equation (23), the variables have the same meaning as in Equation (22).

Example Calculation Using Test Data
As an example, suppose a clinical study of an investigational diagnostic test using an imperfect RS finds that 50 patients are apparently disease-positive by the RS (an DP = 50), and 60 are apparently disease-negative (an DN = 60). Suppose that Se R = 0.90 and Sp R = 0.85, and that the numbers of patients with apparently true IT classifications (an TPI and an TNI ) are, respectively, 38 and 48. These data correspond to aSe I = 0.76 and aSp I = 0.80. Youden's Index (J R ) for the reference standard = 0.75. Using Formula (8), n DP = 45 after rounding. Since N = n DP + n DN , then n DN = 110 − 45, or 65. The number of patients misclassified by the RS is the absolute value of the difference between an DP and n DP (or an DN

Discussion
In assessing an investigational new diagnostic test, it is not always feasible to use a perfect RS, and an imperfect RS (one with Se and/or Sp < 1) must sometimes be used, which can result in patient misclassifications and incorrect values of Se I and Sp I . Such situations raise questions about the accuracy of the investigational test's Se and Sp determined using the imperfect RS. In this work, formulas for correctly calculating the investigational test's true Se and true Sp from any reference standard were derived.
Three prior studies [9][10][11] reported derivations of formulas for correcting Se and Sp for misclassification by an imperfect RS. Their approaches differed from each other and from the approach taken here. In addition, they did not report enough detail to allow one to determine if their derivations were correct. Therefore, comparison of this work to theirs was difficult but was successful for the approaches by Gart and Buck [10] and Staquet et al. [11] (See Supplementary Materials for details).
Brenner [9] reported equations for aSe I and aSp I for a case-control study if Se I , Sp I , Se R , Sp R and the exposure Pr were all known, but did not report solving for Se I and Sp I , contrary to the paper by Umemneku Chikere et al. [8] who I believe may have misinterpreted the Brenner equations for aSe I and aSp I as being for Se I and Sp I .
Gart and Buck [10] discussed the use of screening and reference tests for estimating disease Pr in epidemiologic studies. They derived equations for what they termed copositivity and co-negativity (which I determined to be equivalent to aSe I and aSp I ), and solved these for Se I and Sp I if Pr, Se R and Sp R , aSe I and aSp I are known. I was able to show that my equations for Se I and Sp I (after transformation into proportion-based variables) were equivalent to theirs (see Supplementary Materials for details). Staquet and colleagues [11] reported equations for calculating Se I and Sp I provided that Se R , Sp R , and an TPI , an TNI , an FPI , and an FNI are known. I was able to show that my equations for Se I and Sp I were equivalent to theirs (see Supplementary Materials for details).
Trikalinos et al. [12] did not report equations for Se I or Sp I , but I compared their equations for the cells of their 2 × 2 contingency table (corresponding to pa TPI and pa TNI ), and they matched what I had derived (data not shown).

Strengths and Weaknesses of This Work
This work builds on that of Trikalinos et al. [12], Gart and Buck [10], Staquet et al. [11] and Brenner [9]. These authors discussed potential methods of handling diagnostic studies that use an imperfect RS but did not provide sufficient detail to allow easy replication of their methods. Although Trikalinos et al. [12] mentioned the possibility of adjusting results that are based on an imperfect RS, they did not report derivation of the formulas needed to do so, as I have. One advantage of my work is that I provide full details of the derivations (see Supplementary Materials) so that others may easily reproduce and confirm my work.
In addition, I report formulas that can use either absolute patient counts or proportions (e.g., prevalence), in contrast to prior authors, who reported formulas for only one or the other approach.

Conclusions
Validation of a new diagnostic test by use of an imperfect RS (one with Se and/or Sp < 1) introduces patient misclassifications that result in deviation of the apparent sensitivity and specificity of the index test (aSe I and aSp I , respectively) from the true values. By analyzing the role of the reference standard in the determination of the sensitivity and specificity of an index test, it is possible to derive formulas that correct for patient misclassification by an imperfect RS, as well as for the subsequent error introduced into aSe I and aSp I . The analysis showed that the more imperfect the RS (i.e., the lower the Se and Sp of the RS), the greater the error introduced into aSe I and aSp I . Therefore, when an imperfect RS is used to validate a diagnostic test, it may be necessary to apply corrections to arrive at accurate values of Se and Sp for the test.
This work builds on that of prior authors who discussed potential methods of handling diagnostic studies that use an imperfect RS but did not provide sufficient detail to allow easy replication of their methods. In contrast, full details of the derivations are provided (in Supplementary Materials) to provide transparency, so that others may confirm and perhaps build upon this work. In addition, formulas based on both patient counts and patient proportions are provided, in contrast to prior authors, who provided either one or the other.
For this corrective method to be feasible, two conditions must be met. First, the assumption of conditional independence of the index test and the RS must be true; this assumption is reasonable if the two tests work by different mechanisms (e.g., if the index test relies on laboratory methods and the RS relies on autopsy). Second, one obviously needs to know the values of the Se and Sp of the RS, which may not always be the case. However, if they are known, then the derived formulas can help provide needed corrections to the apparent values of an index test's Se and Sp.