# The Elephant in the Machine: Proposing a New Metric of Data Reliability and its Application to a Medical Case to Assess Classification Reliability

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{7}

^{8}

^{9}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

- In Section 2, we discuss the concept of reliability in decision support, and we will show, also in visual form, the relationship between this concept and the degree of agreement between raters, the quality of the ground truth in multi-rater settings, and the accuracy of the resulting models. To this aim, we will outline the main approaches proposed in the literature to measure the concept of reliability, and what are their main shortcomings that we aim to overcome with our proposal;
- In Section 3, we introduce and discuss the proposed weighted reliability scores; to this aim, we will present the underlying theoretical framework and its analytical derivations. Further, in Section 3.2, we discuss the main so-called paradoxes of reliability, which are intuitive properties that a measure of ground truth reliability should satisfy, but are violated by the most common and frequently adopted measures, and we will show that our metric is resistant to these paradoxes. Finally, in Section 3.3, we describe the design of a user study we performed to provide a proof of concept of our metric and to illustrate the main differences between the proposed solution and commonly adopted reliability metrics;
- Finally, in Section 6, we summarize the main contributions of this paper and describe the motivations for further research.

## 2. Background and Motivations

#### Reliability in Decision Support

- How to assess the extent the raters involved genuinely assert one specific rating instead of another, beyond self-assessment or naive chance models;
- How to take into consideration properly, in the definition of a measure of reliability, not only the mutual agreement of the involved raters, but also their competence in the specific task of annotation and the confidence they attach to their ratings;
- Lastly, how to combine these two components into a single coherent measure of the reliability of a set of annotated cases.

## 3. Method

#### 3.1. Derivation of the Weighted Reliability Score

- If rater ${r}_{i}$ has “chosen” random choice, then we assume her/his label is selected according to a distribution ${\mathbf{p}}_{{r}_{i}}\left(x\right)=\langle p\left(1\right),...,p\left(n\right)\rangle $ where $p\left(i\right)=\frac{1}{n}$ (assuming a uniform distribution of the alternatives) or $p\left(i\right)=e\left(i\right)$ where $e\left(i\right)$ is an empirical prior estimate of the real prevalence of i (this can be derived from a ground truth labeling, if available, or considering all the labelings given by the multiple raters);
- If rater ${r}_{i}$ has “chosen” peaked choice, then we assume her/his label is selected according to a distribution ${\mathbf{d}}_{{r}_{i}}=\langle d\left(1\right),...,d\left(n\right)\rangle $ where $\exists !i.d\left(i\right)=1$.

#### 3.2. Paradoxes of Reliability Measures

**Paradox**

**1.**

**Paradox**

**2.**

**Theorem**

**1.**

**Proof.**

**Definition**

**1.**

**Theorem**

**2.**

**Proof.**

#### 3.3. User Study

## 4. Results

## 5. Discussion

## 6. Conclusions

- We intend to extend the proposed model of reliability so that it can be applied also in the case of annotation tasks in which the target annotations are either numeric or ordinal values, as well as in the case of missing annotations and incomplete ratings, that is when, for a given case, one or more raters do not provide an annotation;
- We intend to further investigate the relationship between the ground truth reliability (as measured by the $\varrho $ score) and the actual model accuracy to obtain more robust and precise estimates based on computational learning theory;
- As shown in [37], the accuracy of raters is also heavily affected by the complexity and difficulty of the cases considered (or similarly, by the proportion of really hard cases to interpret in the ground truth): difficulty can be another contextual (i.e., case-specific) factor in reducing the probability that a specific label is correct, both confidence and competence being equal. Thus, also this parameter should be collected from the team of raters involved, even if in a necessarily subjective and qualitative way, and be factored in the derivation of the weighted reliability score $\varrho $ from the the degree of concordance $\sigma $.
- We also intend to further enrich our $\varrho $ metric, by considering not just the number of agreements and their reliability, taken individually, for each case, but also whether the number of agreements is in a scarce majority configuration (i.e., in one of those majorities where one rating could change the assigned label), with the assumption that these cases are intrinsically less “reliable” than the cases where disagreements do not affect the majority decision.

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

GDPR | General Data Protection Regulation |

IRCCS | Istituto di Ricovero e Cura a Carattere Scientifico |

ML | Machine Learning |

## References

- Quekel, L.G.; Kessels, A.G.; Goei, R.; van Engelshoven, J.M. Miss rate of lung cancer on the chest radiograph in clinical practice. Chest
**1999**, 115, 720–724. [Google Scholar] [CrossRef] [PubMed][Green Version] - Graber, M.L. The incidence of diagnostic error in medicine. BMJ Qual. Saf.
**2013**, 22, ii21–ii27. [Google Scholar] [CrossRef] [PubMed][Green Version] - Jewett, M.A.; Bombardier, C.; Caron, D.; Ryan, M.R.; Gray, R.R.; Louis, E.L.S.; Witchell, S.J.; Kumra, S.; Psihramis, K.E. Potential for inter-observer and intra-observer variability in x-ray review to establish stone-free rates after lithotripsy. J. Urol.
**1992**, 147, 559–562. [Google Scholar] [CrossRef] - Cabitza, F.; Ciucci, D.; Rasoini, R. A giant with feet of clay: On the validity of the data that feed machine learning in medicine. In Organizing for the Digital World; Springer: Cham, Switzerland, 2019; pp. 121–136. [Google Scholar]
- Cabitza, F.; Campagner, A.; Ciucci, D. New Frontiers in Explainable AI: Understanding the GI to Interpret the GO. In International Cross-Domain Conference for Machine Learning and Knowledge Extraction; Springer: Berlin/Heidelberg, Germany, 2019; pp. 27–47. [Google Scholar]
- Svensson, C.M.; Hübler, R.; Figge, M.T. Automated classification of circulating tumor cells and the impact of interobsever variability on classifier training and performance. J. Immunol. Res.
**2015**, 2015, 573165. [Google Scholar] [CrossRef][Green Version] - Gwet, K.L. Handbook of Inter-Rater Reliability: The Definitive Guide to Measuring the Extent of Agreement among Raters; Advanced Analytics, LLC: Piedmont, CA, USA, 2014. [Google Scholar]
- Topol, E.J. High-performance medicine: The convergence of human and artificial intelligence. Nat. Med.
**2019**, 25, 44–56. [Google Scholar] [CrossRef] [PubMed] - Liu, X.; Faes, L.; Kale, A.U.; Wagner, S.K.; Fu, D.J.; Bruynseels, A.; Mahendiran, T.; Moraes, G.; Shamdas, M.; Kern, C.; et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: A systematic review and meta-analysis. Lancet Digit. Health
**2019**, 1, e271–e297. [Google Scholar] [CrossRef] - Beigman, E.; Klebanov, B.B. Learning with annotation noise. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: Volume 1-Volume 1; Association for Computational Linguistics: Stroudsburg, PA, USA, 2009; pp. 280–287. [Google Scholar]
- Beigman Klebanov, B.; Beigman, E. From annotator agreement to noise models. Comput. Linguist.
**2009**, 35, 495–503. [Google Scholar] [CrossRef] - Rajkomar, A.; Dean, J.; Kohane, I. Machine learning in medicine. New Engl. J. Med.
**2019**, 380, 1347–1358. [Google Scholar] [CrossRef] - Heinecke, S.; Reyzin, L. Crowdsourced PAC learning under classification noise. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, Hilversum, the Netherlands, 26–28 October 2020; AAAI Press: Palo Alto, CA, USA, 2019; 7, pp. 41–49. [Google Scholar]
- Pinto, A.; Brunese, L. Spectrum of diagnostic errors in radiology. World J. Radiol.
**2010**, 2, 377. [Google Scholar] [CrossRef] - Brady, A.P. Error and discrepancy in radiology: Inevitable or avoidable? Insights Imaging
**2017**, 8, 171–182. [Google Scholar] [CrossRef][Green Version] - Hripcsak, G.; Heitjan, D.F. Measuring agreement in medical informatics reliability studies. J. Biomed. Infor.
**2002**, 35, 99–110. [Google Scholar] [CrossRef][Green Version] - Hunt, R.J. Percent agreement, Pearson’s correlation, and kappa as measures of inter-examiner reliability. J. Dent. Res.
**1986**, 65, 128–130. [Google Scholar] [CrossRef] [PubMed] - McHugh, M.L. Interrater reliability: The kappa statistic. Biochem. Medica Biochem. Medica
**2012**, 22, 276–282. [Google Scholar] [CrossRef] - Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull.
**1971**, 76, 378. [Google Scholar] [CrossRef] - Krippendorff, K. Content Analysis: An Introduction to its Methodology. Sage Publications: Thousand Oaks, CA, USA, 2018. [Google Scholar]
- Feinstein, A.R.; Cicchetti, D.V. High agreement but low kappa: I. The problems of two paradoxes. J. Clin. Epidemiol.
**1990**, 43, 543–549. [Google Scholar] [CrossRef] - Cicchetti, D.V.; Feinstein, A.R. High agreement but low kappa: II. Resolving the paradoxes. J. Clin. Epidemiol.
**1990**, 43, 551–558. [Google Scholar] [CrossRef] - Hayes, A.F.; Krippendorff, K. Answering the call for a standard reliability measure for coding data. Commun. Methods Meas.
**2007**, 1, 77–89. [Google Scholar] [CrossRef] - Powers, D.M. The problem with kappa. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, 23–27 April 2012; Association for Computational Linguistics: Stroudsburg, PA, USA, 2012; pp. 345–355. [Google Scholar]
- Zhao, X.; Feng, G.C.; Liu, J.S.; Deng, K. We agreed to measure agreement-Redefining reliability de-justifies Krippendorff’s alpha. China Media Res.
**2018**, 14, 1. [Google Scholar] - Duffy, L.; Gajree, S.; Langhorne, P.; Stott, D.J.; Quinn, T.J. Reliability (inter-rater agreement) of the Barthel Index for assessment of stroke survivors: Systematic review and meta-analysis. Stroke
**2013**, 44, 462–468. [Google Scholar] [CrossRef][Green Version] - Brancati, D. Social Scientific Research; Sage: Thousand Oaks, CA, USA, 2018. [Google Scholar]
- Costa Monteiro, E.; Mari, L. Preliminary notes on metrological reliability. In Proceedings of the 21st IMEKO World Congress on Measurement in Research and Industry, Prague Congress CentrePrague, Prague, Czech Republic, 30 August–4 September 2015. [Google Scholar]
- Resnik, M.D. Choices: An Introduction to Decision Theory, Ned - New Edition; University of Minnesota Press: Minneapolis, MN, USA, 1987. [Google Scholar]
- Rasch, G. Probabilistic Models for some Intelligence and Attainment Tests 1960; Danish Institute for Educational Research: Copenhagen, Denmark, 1980. [Google Scholar]
- Charles Feng, G.; Zhao, X. Do not force agreement: A response to Krippendorff (2016). Methodology
**2016**, 12, 145–148. [Google Scholar] [CrossRef] - Krippendorff, K. Commentary: A dissenting view on so-called paradoxes of reliability coefficients. Ann. Int. Commun. Assoc.
**2013**, 36, 481–499. [Google Scholar] [CrossRef] - Krippendorff, K. Misunderstanding reliability. Methodology
**2016**, 12, 139–144. [Google Scholar] [CrossRef][Green Version] - Gwet, K.L. Computing inter-rater reliability and its variance in the presence of high agreement. Br. J. Math. Stat. Psychol.
**2008**, 61, 29–48. [Google Scholar] [CrossRef] [PubMed][Green Version] - Bien, N.; Rajpurkar, P.; Ball, R.L.; Irvin, J.; Park, A.; Jones, E.; Bereket, M.; Patel, B.N.; Yeom, K.W.; Shpanskaya, K.; et al. Deep-learning-assisted diagnosis for knee magnetic resonance imaging: Development and retrospective validation of MRNet. PLoS Med.
**2018**, 15, e1002699. [Google Scholar] [CrossRef] - Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometric
**1977**, 33, 159–174. [Google Scholar] [CrossRef][Green Version] - Campagner, A.; Sconfienza, L.; Cabitza, F. H-accuracy, an alternative metric to assess classification models in medicine. In Digital Personalized Health and Medicine; Studies in Health Technology and Informatics; IOS Press: Amsterdam, The Netherlands, 2020; Volume 270. [Google Scholar]
- Cabitza, F.; Campagner, A.; Balsano, C. Bridging the “last mile” gap between AI implementation and operation: “data awareness” that matters. Ann. Transl. Med.
**2020**, 8, 501. [Google Scholar] [CrossRef]

**Figure 1.**The figure depicts (on the y-axis) the number of raters that need to be involved to obtain a 95% accurate ground truth, as a function of the average accuracy of the raters involved (on the x-axis), if known. These estimates are obtained analytically and hence have general application.

**Figure 2.**The figure depicts the average, minimum, and maximum accuracy of the datasets obtainable for a given number of raters, each with an average accuracy of $85\%\pm 1\%$, by simple majority voting among the raters for each case. The three estimates of accuracy were obtained by an analytical simulation, computed through random sampling from a simulated population of 100 individuals with the above-mentioned characteristics and by computing, respectively, the average, minimum, and maximum observed value among the extracted samples.

**Figure 3.**Graphical representation of the relationship between the number of raters involved in any process of ground truthing, their accuracy, and the accuracy of the resulting dataset (by majority voting). Furthermore, these estimates, as in the previous figures, are obtained analytically and hence have general application.

**Figure 4.**Representation of the general relationship between inter-rater agreement (measured as ${P}_{o}$), the accuracy of the ground truth, and the actual accuracy of an ML model trained on that ground truth. The estimates are obtained analytically and hence have general application. For this reason, the diagram can be used as a sort of a nomogram: given the level of agreement achieved to produce a ground truth and the accuracy of an ML model trained on this ground truth, the diagram can be used to obtain an estimate of the actual accuracy of the model.

**Figure 5.**The general procedure proposed to get an estimate of the actual reliability of a classification model.

**Figure 6.**Graphical presentation of the general idea of the weighted reliability score, $\varrho $. This reliability score is multi-dimensional as it encompasses both the confidence of the raters and their competence. In this graphical example, rater ${r}_{i}$ is much less competent than rater ${r}_{j}$, and hence, the probability that her/his rating is correct is lower.

**Figure 7.**Results in graphical form. (

**a**) The raters’ performance in the ROC space: Circles represent single raters. The red line represents random guessing. (

**b**) Joyplot and histogram illustrating the raters’ confidence distributions.

**Figure 8.**Results in graphical format: (

**a**) The distribution of the labels in terms of right/wrong annotation, with respect to the confidence levels. For each confidence level, the percentages in the left column indicate the confidence level proportion with respect to all the cases; the top and bottom percentages in the right column indicate the proportions of right (resp. wrong) labels in the corresponding confidence level. (

**b**) Relationship between the raters’ confidence and the probability of error; the regression line suggests a mild inverse proportion.

**Figure 9.**Representation of the relationship between the multi-rater reliability (measured as $\varrho $), the accuracy of the ground truth, and the actual accuracy of an ML model trained on that ground truth. The figure can be used as a sort of nomogram: given the level of reliability for a given ground truth and the accuracy of an ML model on such a ground truth; the diagram can be used to obtain an estimate of the actual accuracy of the model.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Cabitza, F.; Campagner, A.; Albano, D.; Aliprandi, A.; Bruno, A.; Chianca, V.; Corazza, A.; Di Pietto, F.; Gambino, A.; Gitto, S.;
et al. The Elephant in the Machine: Proposing a New Metric of Data Reliability and its Application to a Medical Case to Assess Classification Reliability. *Appl. Sci.* **2020**, *10*, 4014.
https://doi.org/10.3390/app10114014

**AMA Style**

Cabitza F, Campagner A, Albano D, Aliprandi A, Bruno A, Chianca V, Corazza A, Di Pietto F, Gambino A, Gitto S,
et al. The Elephant in the Machine: Proposing a New Metric of Data Reliability and its Application to a Medical Case to Assess Classification Reliability. *Applied Sciences*. 2020; 10(11):4014.
https://doi.org/10.3390/app10114014

**Chicago/Turabian Style**

Cabitza, Federico, Andrea Campagner, Domenico Albano, Alberto Aliprandi, Alberto Bruno, Vito Chianca, Angelo Corazza, Francesco Di Pietto, Angelo Gambino, Salvatore Gitto,
and et al. 2020. "The Elephant in the Machine: Proposing a New Metric of Data Reliability and its Application to a Medical Case to Assess Classification Reliability" *Applied Sciences* 10, no. 11: 4014.
https://doi.org/10.3390/app10114014