Communicating Risk: Developing an “Efficiency Index” for Dementia Screening Tests

Diagnostic and screening tests may have risks such as misdiagnosis, as well as the potential benefits of correct diagnosis. Effective communication of this risk to both clinicians and patients can be problematic. The purpose of this study was to develop a metric called the “efficiency index” (EI), defined as the ratio of test accuracy and inaccuracy, to evaluate screening tests for dementia. This measure was compared with a previously described “likelihood to be diagnosed or misdiagnosed” (LDM), also based on “numbers needed” metrics. Datasets from prospective pragmatic test accuracy studies examining four brief cognitive screening instruments (Mini-Mental State Examination; Montreal Cognitive Assessment; Mini-Addenbrooke’s Cognitive Examination (MACE); and Free-Cog) were analysed to calculate values for EI and LDM, and to examine their variation with test cut-off for MACE and dementia prevalence. EI values were also calculated using a modification of McGee’s heuristic for the simplification of likelihood ratios to estimate percentage change in diagnostic probability. The findings indicate that EI is easier to calculate than LDM and, unlike LDM, may be classified either qualitatively or quantitatively in a manner similar to likelihood ratios. EI shows the utility or inutility of diagnostic and screening tests, illustrating the inevitable trade-off between diagnosis and misdiagnosis. It may be a useful metric to communicate risk in a way that is easily intelligible for both clinicians and patients.


Introduction
No medical treatment or test is without potential harms as well as benefits, and hence associated with risk. Communicating such risk to patients for the purpose of shared decision making has attracted much attention and research in recent times, for example into the most appropriate methods by which to achieve such communication effectively. Guidance on both verbal and numerical qualifiers of risk has appeared [1,2]. As regards numerical qualifiers, options often used in the context of therapeutic interventions include absolute risk (AR), relative risk (RR), and the number needed to treat (NNT) [3], but no consensus on the optimum method has been established.
Considering diagnostic or screening tests, performance is typically described by comparison with a reference standard, such as a criterion diagnosis or a reference test, by constructing a 2 × 2 contingency table, such that all (N) index test results may be crosstabulated as true positive (TP), false positive (FP), false negative (FN), or true negative (TN). From this standard 2 × 2 contingency table (Figure 1), various parameters of test discrimination may be calculated, many of which are familiar to clinicians as descriptors of test performance, such as sensitivity (Sens; or true positive rate) and specificity (Spec; or true negative rate), positive and negative predictive values, and positive and negative likelihood ratios (LR+, LR−) [4]. One may thus distinguish from the 2 × 2 contingency table two conditions or relations between the index test and the reference standard: consistency, or matching, of outcomes (+/+ or TP, and −/− or TN); and contradiction, or mismatching (+/− or FP, and − /+ or FN). From these two conditions, the paired complementary parameters of accuracy (Acc = TP + TN/N) and inaccuracy or error rate (Inacc = FP + FN/N) may be derived. As negations, these may also be described using the Boolean NOT operator, since Acc = 1 − Inacc and Inacc = 1 − Acc.
How might these various test measures be effectively communicated to patients who are unfamiliar with the principles and nomenclature of binary classicism, but worried that they might have a dementia disorder because of memory symptoms? A metric called the "likelihood to be diagnosed or misdiagnosed" (LDM) has been developed [5,6] which may be useful for the purpose of communicating risk, specifically the risk of testing leading to misdiagnosis as opposed to correct diagnosis.
LDM was conceptualised as analogous to the "likelihood to be helped or harmed" (LHH) metric which was developed to communicate the results of therapeutic (randomised controlled) trials. LHH is based on "number needed to" metrics, specifically the number needed to treat (NNT) for a specified treatment outcome (e.g., cure, remission, 50% reduction in symptoms) [7] and the number needed to harm (NNH) [8]. LHH is the ratio of NNH to NNT, which is desirably as large as possible (high NNH, low NNT), thus summarising treatment benefits and risks [9].
The LDM metric has proved serviceable in evaluating a wide range of neurological signs and cognitive screening instruments (CSIs) used in the evaluation of disorders of cognition [5,6,12]. Nevertheless, LDM has some limitations and shortcomings. Consistent with its ad hoc development, based on existing metrics, LDM combined rates with dif- One may thus distinguish from the 2 × 2 contingency table two conditions or relations between the index test and the reference standard: consistency, or matching, of outcomes (+/+ or TP, and −/− or TN); and contradiction, or mismatching (+/− or FP, and − /+ or FN). From these two conditions, the paired complementary parameters of accuracy (Acc = TP + TN/N) and inaccuracy or error rate (Inacc = FP + FN/N) may be derived. As negations, these may also be described using the Boolean NOT operator, since Acc = 1 − Inacc and Inacc = 1 − Acc.
How might these various test measures be effectively communicated to patients who are unfamiliar with the principles and nomenclature of binary classicism, but worried that they might have a dementia disorder because of memory symptoms? A metric called the "likelihood to be diagnosed or misdiagnosed" (LDM) has been developed [5,6] which may be useful for the purpose of communicating risk, specifically the risk of testing leading to misdiagnosis as opposed to correct diagnosis.
LDM was conceptualised as analogous to the "likelihood to be helped or harmed" (LHH) metric which was developed to communicate the results of therapeutic (randomised controlled) trials. LHH is based on "number needed to" metrics, specifically the number needed to treat (NNT) for a specified treatment outcome (e.g., cure, remission, 50% reduction in symptoms) [7] and the number needed to harm (NNH) [8]. LHH is the ratio of NNH to NNT, which is desirably as large as possible (high NNH, low NNT), thus summarising treatment benefits and risks [9].
The LDM metric has proved serviceable in evaluating a wide range of neurological signs and cognitive screening instruments (CSIs) used in the evaluation of disorders of cognition [5,6,12]. Nevertheless, LDM has some limitations and shortcomings. Consistent with its ad hoc development, based on existing metrics, LDM combined rates with different denominators which are not easily reconciled. Calculation of several parameters from the 2 × 2 table is required to reach LDM (Sens, Spec, Y, NND, Inacc, NNM), although ad hoc calculators exist [13]. Furthermore, the "number needed to diagnose" based on the Youden index incorporates considerations not only of diagnosis but also of misdiagnosis, since Sens = 1 − false negative rate, and Spec = 1 − false positive rate. The resulting LDM has boundary values of − 1 (useless test: Sens = Spec = 0, NND = − 1; Inacc = 1, NNM = 1) and ∞ (perfect test: Sens = Spec = 1, NND = 1; Inacc = 0, NNM = ∞), and so the LDM values cannot be perfectly equated with the qualitative classification scheme developed for likelihood ratios (LR) [14] which has been used for making recommendations on tests suitable for dementia by the UK National Institute for Health and Care Excellence [15]. Unlike LDM, LR has boundary values of 0 and ∞, although LDM shares with LR an inflection point at 1 (LDM < 1 favours misdiagnosis, LDM > 1 favours diagnosis) [5,6].
A simple method to overcome these shortcomings of LDM may be proposed. The NND may be redefined, as NND*, using Acc, rather than Sens and Spec, such that NND* = 1/Acc. This formulation is analogous to the previous definition of NNM [10], where NNM = 1/Inacc. Both these measures now share the same denominator from the 2 × 2 contingency table, N, and calculation is thus simplified, such that: Whilst this ratio might justifiably be termed a "likelihood to be diagnosed or misdiagnosed", an alternative name would be preferable to avoid confusion with the previously defined LDM. Kraemer denoted TP + TN as "efficiency" [16], so FP + FN might be termed "inefficiency", and hence the ratio of efficiency/inefficiency may be denoted as the "efficiency index" (EI). Hence: The boundary values of EI are 0 (useless test: Acc = 0; Inacc = 1) and ∞ (perfect test: Acc = 1, Inacc = 0), as for likelihood ratios.
The primary aim of this study was to examine the utility of EI and compare it to the previously defined LDM parameter when applied to test accuracy studies of several brief CSIs, namely Mini-Mental State Examination (MMSE) [17], Montreal Cognitive Assessment (MoCA) [18], Mini-Addenbrooke's Cognitive Examination (MACE) [19], and Free-Cog [20]. Secondary aims were: to examine other methods to calculate EI and to compare performance with LRs, particularly with McGee's method of simplifying LR values as percentage changes in diagnostic probability [21]; and to compare EI with a previously described measure based on Acc and Inacc, the identification index (II) [22], defined as II = (Acc − Inacc) = 2 Acc − 1.

Methods
Data from pragmatic prospective screening test accuracy studies using a standardised methodology were re-analysed. The studies were undertaken in a dedicated cognitive disorders clinic located in a secondary care setting (regional neuroscience centre) and examined four brief cognitive screening instruments (administration time ca. 5-10 min), all with a denominator of 30 points: Mini-Mental State Examination (MMSE) [23,24], Montreal Cognitive Assessment (MoCA) [25], Mini-Addenbrooke's Cognitive Examination (MACE) [26], and Free-Cog [27].
In each study, criterion diagnosis of dementia followed standard diagnostic criteria (DSM-IV) and was made independent of scores on index CSIs to avoid review bias. For each study, prevalence of dementia was calculated as the sum of TP and FN divided by the total number of patients (N) assessed. All studies followed either the STAndards for the Reporting of Diagnostic accuracy studies (STARD) [28] or the derived guidelines specific for dementia studies, STARDdem [29], dependent on the exact date at which each study was undertaken. In all studies, subjects gave informed consent and study protocols were approved by the institute's committee on human research (Walton Centre for Neurology and Neurosurgery Approval: N 310).
For each CSI, the following parameters were calculated: Acc, Inacc, Y, LDM, II; and EI by using the values of Acc and Inacc (Equation (1)).
The variation of EI with test cut-off was examined using data from the test accuracy study of MACE [26] and compared with the variation in LDM. The variation of EI with prevalence of dementia (P) was also examined and compared to values for LDM [12].
EI was also calculated from values of test Sens and Spec in the MACE study. Since: and Hence: The performance of EI was also compared to that of LRs which may be used to calculate difference in pre-and post-test odds, since post-test odds = pre-test odds × LR. McGee showed that LR+ values of 2, 5, and 10 increased the probability of diagnosis by approximately 15%, 30%, and 45%, respectively, whereas LR− values of 0.5, 0.2, and 0.1 decreased the probability of diagnosis by approximately 15%, 30% and 45%, respectively. These figures derive from the almost linear relationship of probability and the natural logarithm of odds over the range 0.1-0.9, such that the percentage change in probability may be calculated independent of pre-test probability as: This simple heuristic obviates calculations between pre-and post-test odds and probabilities [21]. As the boundary values of EI (0, ∞) correspond to those of LRs, calculations were undertaken to assess whether or not the heuristic described for LR values also holds for EI values. These calculations used data from the studies of MACE [26] and MoCA [25]. As both of these studies had a similar (low) pre-test probability of dementia, the issue was further examined using data from a test accuracy study of the Test Your Memory (TYM) test [30], this being the study with the highest pre-test probability of dementia reported from this clinic [31] (because an informant is generally required to assist with TYM, and many patients without dementia attend the clinic alone [32]).

Results
A summary of the studies of MMSE, MoCA, MACE, and Free-Cog (Table 1) showed broadly similar prevalence of dementia, median age, and gender ratio in each patient cohort. Comparing the various metrics for the diagnosis of dementia versus no dementia for each of these CSIs (Table 2) showed a similar ranking (best to worst) for Acc, Y, LDM, II, and EI. Comparing the EI and LDM metrics across a range of MACE cut-offs (Table 3, Figure 2) showed that, as for the other CSIs (Table 2), EI was a more optimistic score than LDM and that, unlike II, EI nowhere had a negative value. The maxima for EI and LDM almost coincided (LDM ≤ 15/30; EI ≤ 14/30). Values for EI and LDM were approximately equal at higher test cut-offs but diverged at lower cut-offs in this dataset, which may be a reflection of high sensitivity and low specificity of MACE for the diagnosis of dementia [26].  Comparing the EI and LDM metrics across a range of P (Table 4, Figure 3) showed that values increasingly diverge at higher prevalence in this dataset.   Comparing the EI and LDM metrics across a range of P (Table 4, Figure 3) showed that values increasingly diverge at higher prevalence in this dataset.  Comparing the EI and LDM metrics across a range of P (Table 4, Figure 3) showed that values increasingly diverge at higher prevalence in this dataset.   Using Equation (1), the value of EI for MACE was 2.817 (Table 2). This value was checked by using Equation (3), substituting the values for dementia prevalence (P = 0.151) and Sens and Spec at the test threshold for MACE (≤20/30; Table 1 The same value of EI was thus obtained using two different methods.
To examine whether or not McGee's simple rules obviating calculations between preand post-test odds and probabilities for LRs [21] are also applicable to EIs, data from the MACE study were used [26] So, using calculations based on the observed pre-test probability, MACE increased diagnostic probability of dementia in this patient cohort from approximately 15% to approximately 33%, an 18% increase.
Using the equation derived by McGee to calculate change in diagnostic probability independent of pre-test probability [21], and substituting LR with EI: Change in probability = 0.19 × log e (EI) = 0.19 × log e (2.817) = 0.197 Thus similar values for percentage change in probability were obtained using two different methods (18% vs. 19.7%).
Calculations of change in diagnostic probability were performed assuming EI values of 2.0, 5.0, 10.0, 0.5, 0.2, and 0.1 using both the methods, i.e., dependent and independent of observed pre-test probability. Similar calculations were also performed using the MoCA test accuracy study data [25], in which pre-test probability was similar to MACE but EI was 0.745 (i.e., favouring misdiagnosis). The results (Table 5) show that for EI values > 1, favouring correct diagnosis, the percentage changes in diagnostic probability were similar when calculated independent of pre-test probability (column 1) and when calculated using observed pre-test probabilities (columns 2 and 3), approximating McGee's 15, 30, and 45% increases for EI = 2, 5 and 10. However, for EI values < 1, favouring misdiagnosis, McGee's 15, 30, and 45% decreases for EI 0.5, 0.2 and 0.1 were not observed, presumably because of the low pre-test probabilities (ca. 15%) in these patient cohorts. In the S-shaped curve which describes the relationship between probability and log e odds [21], this may correspond to the part of the plot away from the nearly linear portion which runs from approximately 0.1 to approximately 0.9. To further examine this point, data from a test accuracy study of the Test Your Memory (TYM) test [31] were reanalysed, wherein the pre-test probability of dementia was 0.35, the highest reported in studies from this clinic. The change in probability based on the observed pre-test probability ( Table 5, column 4) approximated the values calculated independent of pre-test probability (column 1) more closely than for MACE and MoCA for EI values < 1.

Discussion
This paper defines a new parameter, EI, which may be of use not only for the evaluation of diagnostic and screening tests but also for the communication of risk to both clinicians and patients. The examples presented illustrate how EI may be used in a clinical setting. EI may be applied in any test accuracy study which permits construction of a 2 × 2 contingency table. EI may be conceptualised, like a previously defined LDM parameter, as a ratio of test harms (misdiagnosis) and benefits (diagnosis), and hence a measure of what has previously been termed the "fragility" of screening and diagnostic tests [33]. Indeed, it might have been named the "fragility index", understood as a propensity to break or fail. However, "efficiency index" emphasises its relation to efficiency, understood as the ability to do things well, and conceptualised as a ratio of useful output to total input, or product per cost. However, EI differs from efficiency in that efficiency always has a value < 1, whereas the upper bound of the EI, for a perfect diagnostic or screening test, is ∞.
Comparing EI to LDM, EI has kinship with, but advantages over, LDM. It is more easily explicable (and, hence, more elegant) than the makeshift (although not arbitrary) derivation of LDM. EI is easier to calculate than LDM, requiring at its simplest only the four values from the cells of the 2 × 2 contingency table (Equation (2)), whilst retaining the inflection point at the value of 1 (EI > 1 indicates greater likelihood of correct diagnosis; EI < 1 indicates greater likelihood of misdiagnosis).
EI and LDM share the same denominator, Inacc, but differ in numerator, Acc and Y, respectively, these numerators being the multiplicative inverse of NND* vs. NND, respectively. Interpreting the results (Table 3, Figure 2), the differences in EI and LDM values thus relate to the higher values of Acc compared to Y (lower values of NND* compared to NND), especially at lower cut-off values.
Comparing EI to LRs, whilst both can be calculated from the values of Sens and Spec, and share boundary values (0, ∞), EI is dependent on prevalence (Equation (3)) whereas LRs are not (at least algebraically, although in practice there may be variation [34]). Based on the calculations (Table 5) it seems appropriate to use the same qualitative classification scheme for EI values as proposed by Jaeschke et al. for LRs [14] (Table 6, column 1). Furthermore, McGee's simplification of LRs, as percentage change in diagnostic probability [21], also appears to be applicable to EI values (Table 5), and hence this numerical classification might also be used (Table 6, column 2). Interpreting the EI results is therefore straightforward for clinicians evaluating diagnostic or screening tests, requiring no new classificatory system.  [22]. Unlike this simple subtraction of Inacc from Acc, EI does not produce negative values which have been previously noted to occur with II (if Inacc > Acc; Table 3) and whose meaning is difficult to comprehend (indeed may be meaningless) [26]. Part of the reason for this may be that, unlike II, EI is developed from "number needed to" metrics (NND*, NNM), whereas II was used as the basis for a "number needed to" metric, the "number needed to screen" [22]. EI therefore has advantages over II.
Comparing EI to Y, EI is dependent on disease prevalence (Table 4, Figure 3), since Acc and Inacc are calculated using values from both columns of the 2 × 2 contingency table, whereas Y is independent of P since Sens and Spec are strict columnar ratios (although these values may of course vary with the heterogeneity of clinical populations, or spectrum bias [34]). The maximal value of Y arbitrarily assumes disease prevalence to be 50%, which is not often the case in clinical practice. Both EI and Y treat FN and FP as equally undesirable, an assumption which is often not the case in clinical practice where FN may be considered more costly. Y can be negative (boundary values −1, +1), with negative values occurring if the test result is negatively associated with the correct diagnosis (although Y can be normalised, as the balanced accuracy). This is unlike EI (boundary values 0, ∞), which makes risk of misdiagnosis more explicit (values < 1), although this does not directly indicate whether FP or FN is the principal cause of inaccuracy and hence misdiagnosis.
Comparing EI to the diagnostic odds ratio, DOR (=TP × TN/FP × FN), both treat FN and FP as equally undesirable. DOR is independent of P, at least notionally, since it may be expressed solely in terms of Sens and Spec (=(Sens × Spec)/[(1 − Sens) × (1 − Spec)]). Both DOR and EI give optimistic results, DOR by choosing the best quality of a test and ignoring its weaknesses, particularly in populations with very high or very low risk. Ratios of DOR become unstable and inflated as the denominator approaches zero, which is also true of EI, although because the classes from the 2 × 2 contingency table are treated additively in EI rather than multiplicatively as in DOR the chance of denominator being zero is less. Hence, EI may have advantages over DOR.
EI may also be compared to other unitary measures, including the critical success index, F measure, area under the receiver operating characteristic curve (AUC ROC), and Matthews' correlation coefficient (MCC) [4,33,35]. Critical success index and F measure ignore TN values, unlike EI. AUC ROC combines test accuracy over a range of thresholds which may be both clinically relevant and clinically nonsensical, and hence gives a very optimistic measure of test accuracy. MCC takes into account the size of all four classes in the 2 × 2 contingency table and is widely regarded as being a very informative score for establishing the quality of a binary classifier, but the calculation (geometric mean of Y and the predictive summary index) is less straightforward than for EI, and values can be negative (boundary values −1, +1). Hence, EI may have advantages over these measures, none of which readily conveys risk of misdiagnosis.
For the communication of risk to patients, use of qualitative information is generally discouraged because of the potential ambiguity of such terms [1,2]. Hence, the suggested qualitative classification of EI values ( Table 6, column 1) may not be of use in this situation. However, unlike other quantitative measures, EI involves no fractions, frequencies, or percentages, which may be advantageous when discussing risk with those with low numeracy skills [1,2]. EI is based on "number needed" metrics which were originally deemed more intuitive for patients as well as clinicians [7], but empirical studies have suggested that NNT is more difficult for patients to understand than AR and RR [3,36].
EI is a dimensionless number, like RR and DOR, and a value of >1 suggests diagnostic value, just as a value of >1 suggests association for RR and better than random classification for DOR. As a consequence of the coronavirus pandemic, there may be a general awareness of another metric which has an inflection point at 1, namely R 0 , the reproduction number, used to denote the spread of infectious disease in a population, where infection is spreading if R 0 > 1, but not so if <1.
In summary, the proposed EI may prove an acceptable unitary measure of diagnostic and screening test utility for clinicians as it is easy to calculate and interpret. It may also be useful for communicating risk of diagnosis and misdiagnosis, in the specific example of dementia, to patients, but further empirical studies will be required specifically to address this question.