Evaluating Explanatory Capabilities of Machine Learning Models in Medical Diagnostics: A Human-in-the-Loop Approach
Abstract
1. Introduction
2. State of the Art
2.1. Explainability
- Local: Try to elucidate the rationale behind the model’s specific prediction for a particular instance or a group of proximate instances.
- Global: Try to comprehend the overall behavior of the model.
- Model-Agnostic: These methods function independently of the specific ML model employed; they can be implemented across a range of models, irrespective of their architecture or underlying algorithms. They use techniques such as perturbing input features and observing the impact on model predictions.
- Model-Specific: These methods are tailored to a particular type of ML model and exploit the model’s internal structure and characteristics to provide explanations. The complexity of these methods depends on how transparent the models are.
2.1.1. Explainability in Healthcare
2.1.2. Problems with Explainability Methods
2.1.3. Frameworks for Explainability Assessment
2.2. Application Domain: Pancreatic Cancer
2.2.1. Cancer Staging
- T: Tumor size and possible growth outside the pancreas into nearby blood vessels.
- N: Cancer spread to nearby number of lymph nodes.
- M: Cancer spread to distant lymph nodes or distant organs (metastasized) such as the liver, peritoneum (the lining of the abdominal cavity), lungs, or bones.
- Stage 0: Cancer is present but it has not spread.
- Stage I (no spread or resectable): Cancer is limited to the pancreas and has grown 2 cm (stage IA) or its size is greater than 2 cm but less than 4 cm (stage IB).
- Stage II (local spread or borderline resectable): The cancer is limited to the pancreas and its size is greater than 4 cm, or there is spread locally to the nearby lymph nodes.
- Stage III (wider spread or unresectable): Cancer may have expanded to nearby blood vessels or nerves but has not metastasized to distant sites.
- Stage IV (metastatic): Cancer has spread to distant organs.
2.2.2. Pancreatic Cancer Medical Guidelines
- Resectable disease. In this case, the guidelines suggest proceeding with surgery (without neoadjuvant therapy) or endoscopic ultrasound-guided biopsy if neoadjuvant therapy is considered and if not previously done, and they consider stenting if clinically indicated.
- Borderline resectable disease. Here, an endoscopic ultrasound-guided biopsy is preferred (if not previously done), and a staging laparoscopy and baseline CA19-9 (important biomarker for pancreas cancer) are considered.
- Locally advanced disease. In this case, a biopsy should be performed if not previously done, which may result in (a) cancer not confirmed, (b) adenocarcinoma confirmed, and (c) other cancer confirmed.
- Metastatic disease. According to the guidelines, the following measures should be taken: (a) In cases of jaundice, a self-expanding metal stent should be placed. (b) Genetic testing should be performed for any inherited mutations, if not previously conducted. (c) Molecular profiling of tumor tissue should be conducted, if not previously performed.
| Algorithm 1 Clinical presentation and workup (PANC-1) [49] |
-Clinical suspicion of pancreatic cancer or evidence of dilated pancreatic and/or bile duct (indicators: age, gender) -Pancreatic protocol CT (abdomen) -Multidisciplinary consultation if No metastatic disease then Medical tests: * Chest and pelvic CT * Consider endoscopic ultrasonography (EUS) * Consider MRI as clinically indicated for indeterminate liver lesions * Consider PET/CT in high-risk patients * Consider endoscopic retrograde cholangiopancreatography (ERCP) with stent placement * Liver function test and baseline CA19-9 after adequate biliary drainage * Genetic testing for inherited mutations if diagnosis confirmed Possible outcomes: * Refer to high-volume center for evaluation * Resectable Disease, Treatment (indicators: primary diagnosis, site of resection or biopsy) * Borderline Resectable Disease, No Metastases (indicators: primary diagnosis, site of resection or biopsy) * Locally Advanced Disease (indicators: primary diagnosis, site of resection or biopsy) * Metastatic Disease, First-Line Therapy, and Maintenance Therapy else if Metastatic disease then Metastatic Disease Biopsy confirmation, from a metastatic site preferred Medical test: * Genetic testing for inherited mutations * Molecular profiling of tumor tissue is recommended * Complete staging with chest and pelvic CT Outcomes: (indicators: primary diagnosis, site of resection or biopsy) * Metastatic Disease: First-Line and Maintenance Therapy end if |
3. Data and Methodology
3.1. Data Set
3.2. Feature Selection
3.2.1. Recommended Set
3.2.2. Maximum Set
3.2.3. Minimum Set
3.2.4. Features and Guidelines
3.3. Machine Learning Models
- Bagging, also referred to as bootstrap aggregation, is an ensemble learning method that is frequently employed to minimize variability within a noisy data set by integrating the predictions of multiple models through the implementation of diverse aggregation techniques.
- Bootstrapping is a technique that is employed to enhance the performance of a classifier through iterative refinement. In conventional machine learning methodologies, a multitude of classifiers are typically trained on disparate sets of input data. Subsequently, the outputs of these classifiers are integrated to formulate a composite prediction.
3.3.1. Decision Trees
3.3.2. Random Forests
3.3.3. XGBoost
3.4. Explainability Models
3.4.1. Model-Specific Methods
- Split-improvement scores that are specific to tree-based methods. These scores naturally aggregate the improvement associated with each node split. They can be readily recorded during the tree-building process.
- Permutation methods that involves quantifying the alteration in value or precision when the values of one feature are substituted with irrelevant noise, typically produced by a permutation.
Mean Decrease in Impurity (MDI)
Mean Decrease Accuracy (MDA)
3.4.2. Model-Agnostic Methods
SHapley Additive exPlanations (SHAP)
Locally Interpretable Model-Agnostic Explanations (LIME)
4. Results
4.1. Accuracy of the Resulting Models
- Accuracy indicates the frequency with which a classification machine learning (ML) model is accurate in its overall predictions.
- Precision indicates the frequency with which an ML model accurately predicts the target class.
- Recall indicates the capacity of an ML model to identify all objects of the target class.
- F1-score provides an equilibrium between precision and recall, thereby rendering it a more comprehensive metric for evaluating classification models.
- Maximum Depth (md): This parameter defines the maximum number of levels allowed in a tree. It controls the complexity of the model by limiting how many times the data can be split. For Decision Trees, a lower depth (e.g., md = 2 or 3) was used to keep the models transparent and easy to interpret. For Random Forests and XGBoost, the depth was varied between 2 and 4 to find a balance between capturing complex relationships and avoiding overfitting.
- Minimum Samples per Leaf (msl): This parameter sets the minimum number of data points (samples) that must exist in a leaf node (a terminal node at the end of a branch) for a split to be valid. It acts as a regularization tool to prevent the tree from creating branches that account for only a few specific, possibly outlier, cases. In Decision Trees, this was set as high as 30 for the “Recommended” set to simplify the tree structure. In Random Forests, it was consistently set to 5 to maintain a level of detail across the ensemble.
- Number of Estimators (ne): This parameter is specific to ensemble models like XGBoost. It specifies the number of individual trees (weak learners) to be built and combined in the final model. The XGBoost models used either 40 or 50 estimators depending on the feature set being tested. Generally, more estimators can improve model accuracy by allowing the system to correct more errors from previous trees. However, after a certain point, adding more trees provides diminishing returns and increases computational cost and the risk of overfitting. The value has been chosen between 40 and 50 because a low value is most appropriate for small data sets such as ours.
4.2. Interpretability of the Resulting Models
4.2.1. Extracting Feature Importance from XAI Methods
- num_features = 10. Balances sparsity; low levels ensure that the explanation is simple and remains interpretable for humans.
- discretize_continuous = True. Recommended for tabular data; it makes the results much easier to read as simple if–then rules.
- top_labels = 5. Chooses which labels or classes to explain; in this case, the highest-scored labels.
- kernel_width = 0.75 (default for tabular). Smaller values are recommended for very local behavior (more unstable); larger values for more global behavior (smoother results).
- n_permutations = 5000 (default). Should be increased whenever explanations vary significantly between runs.
4.2.2. Jaccard Similarity Index
Similarity Index
- The “Gold Standard” is often a set, not a ranking. In information retrieval, there is a clear difference between the 1st and 10th result. However, in medical guidelines (like NCCN), clinical variables are often presented as a category of importance rather than a strict linear list.
- Handling “Zero-Importance” Features. Medical data often has many features that should have zero importance (noise). Recommendation metrics (RBO and MAP) are designed to measure how well you rank relevant items. They do not have a natural way of handling the “negative space”, that is, the features that should be ignored. Jaccard naturally penalizes “false positives” (when the model gives high importance to a variable the doctor knows is irrelevant). This is crucial for medical safety to ensure the model is not hallucinating importance.
- Complexity vs. Interpretability. Our goal is to make AI results “interpretable” for clinicians. In that sense, Jaccard is highly intuitive: “What percentage of the features we care about did the AI actually pick up?”. By using the weighted version, we keep the simplicity of the Jaccard logic but allow the specific importance values to influence the score, without needing the complex “convergence parameters” required by RBO.
- Robustness to Small Perturbations. Recent studies in XAI stability show that rank-based metrics (like Kendall’s Tau) can be hyper-sensitive to tiny changes in feature weights that do not actually change the clinical meaning of the explanation. Jaccard is more stable because it looks at the “overlap” of significant features rather than the “order” of every single minor feature.
4.2.3. Feature Importance Using the Minimum Feature Set
4.2.4. Feature Importance Using the Recommended Feature Set
4.2.5. Feature Importance Using the Maximum Feature Set
5. Discussion
5.1. Performance
5.2. Human-in-the-Loop
5.3. Explainability
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| XAI | Explainable AI |
| ML | Machine Learning |
| HITL | Human-in-the-Loop |
| DT | Decision Tree |
| RF | Random Forest |
| XGBoost | eXtreme Gradient Boosting |
| MDI | Mean Decrease in Impurity |
| MDA | Mean Decrease Accuracy |
| SHAP | SHapley Additive exPlanations |
| LIME | Locally Interpretable Model-Agnostic Explanations |
References
- Linardatos, P.; Papastefanopoulos, V.; Kotsiantis, S. Explainable AI: A Review of Machine Learning Interpretability Methods. Entropy 2021, 23, 18. [Google Scholar] [CrossRef] [PubMed]
- Angelov, P.P.; Soares, E.A.; Jiang, R.; Arnold, N.I.; Atkinson, P.M. Explainable artificial intelligence: An analytical review. Wiley Interdiscip. Rev. Data Min. Knowl. Discov. 2021, 11, e1424. [Google Scholar] [CrossRef]
- Barredo Arrieta, A.; Díaz-Rodríguez, N.; Del Ser, J.; Bennetot, A.; Tabik, S.; Barbado, A.; Garcia, S.; Gil-Lopez, S.; Molina, D.; Benjamins, R.; et al. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 2020, 58, 82–115. [Google Scholar] [CrossRef]
- Montavon, G.; Samek, W.; Müller, K.R. Methods for interpreting and understanding deep neural networks. Digit. Signal Process. 2018, 73, 1–15. [Google Scholar] [CrossRef]
- Slack, D.; Hilgard, A.; Singh, S.; Lakkaraju, H. Reliable Post hoc Explanations: Modeling Uncertainty in Explainability. Adv. Neural Inf. Process. Syst. 2021, 34, 9391–9404. [Google Scholar]
- Kotsiantis, S.B. Decision trees: A recent overview. Artif. Intell. Rev. 2013, 39, 261–283. [Google Scholar] [CrossRef]
- Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; KDD ’16. pp. 785–794. [Google Scholar] [CrossRef]
- Mosqueira-Rey, E.; Hernández-Pereira, E.; Alonso-Ríos, D.; Bobes-Bascarán, J.; Fernández-Leal, A. Human-in-the-loop Machine Learning: A State of the Art. Artif. Intell. Rev. 2023, 56, 3005–3054. [Google Scholar] [CrossRef]
- Gilpin, L.H.; Bau, D.; Yuan, B.Z.; Bajwa, A.; Specter, M.; Kagal, L. Explaining Explanations: An Overview of Interpretability of Machine Learning. In Proceedings of the 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 1–3 October 2018; pp. 80–89. [Google Scholar] [CrossRef]
- Lipton, Z.C. The mythos of model interpretability. Commun. ACM 2018, 61, 36–43. [Google Scholar] [CrossRef]
- Carrillo, A.; Cantú, L.F.; Noriega, A. Individual Explanations in Machine Learning Models: A Survey for Practitioners. arXiv 2021, arXiv:2104.04144. [Google Scholar] [CrossRef]
- Goodman, B.; Flaxman, S. European Union regulations on algorithmic decision-making and a “right to explanation”. AI Mag. 2017, 38, 50–57. [Google Scholar] [CrossRef]
- Adadi, A.; Berrada, M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
- Holzinger, A.; Langs, G.; Denk, H.; Zatloukal, K.; Müller, H. Causability and explainability of artificial intelligence in medicine. WIREs Data Min. Knowl. Discov. 2019, 9, e1312. [Google Scholar] [CrossRef]
- Holzinger, A.; Biemann, C.; Pattichis, C.S.; Kell, D.B. What do we need to build explainable AI systems for the medical domain? arXiv 2017, arXiv:1712.09923. [Google Scholar] [CrossRef]
- Loh, H.W.; Ooi, C.P.; Seoni, S.; Barua, P.D.; Molinari, F.; Acharya, U.R. Application of explainable artificial intelligence for healthcare: A systematic review of the last decade (2011–2022). Comput. Methods Programs Biomed. 2022, 226, 107161. [Google Scholar] [CrossRef]
- Yan, K.; Fong, S.; Li, T.; Song, Q. Multimodal Machine Learning for Prognosis and Survival Prediction in Renal Cell Carcinoma Patients: A Two-Stage Framework with Model Fusion and Interpretability Analysis. Appl. Sci. 2024, 14, 5686. [Google Scholar] [CrossRef]
- Nair, P.C.; Gupta, D.; Devi, B.I.; Kanjirangat, V. Building an Explainable Diagnostic Classification Model for Brain Tumor using Discharge Summaries. Procedia Comput. Sci. 2023, 218, 2058–2070. [Google Scholar] [CrossRef]
- Ganeshkumar, M.; Ravi, V.; Sowmya, V.; Gopalakrishnan, E.; Soman, K. Explainable deep learning-based approach for multilabel classification of electrocardiogram. IEEE Trans. Eng. Manag. 2021, 70, 2787–2799. [Google Scholar] [CrossRef]
- Moncada-Torres, A.; van Maaren, M.C.; Hendriks, M.P.; Siesling, S.; Geleijnse, G. Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival. Sci. Rep. 2021, 11, 6968. [Google Scholar] [CrossRef]
- Gulum, M.A.; Trombley, C.M.; Kantardzic, M. A Review of Explainable Deep Learning Cancer Detection Models in Medical Imaging. Appl. Sci. 2021, 11, 4573. [Google Scholar] [CrossRef]
- Hauser, K.; Kurz, A.; Haggenmüller, S.; Maron, R.C.; von Kalle, C.; Utikal, J.S.; Meier, F.; Hobelsberger, S.; Gellrich, F.F.; Sergon, M.; et al. Explainable artificial intelligence in skin cancer recognition: A systematic review. Eur. J. Cancer 2022, 167, 54–69. [Google Scholar] [CrossRef]
- Zhang, Y.; Song, K.; Sun, Y.; Tan, S.; Udell, M. Why Should You Trust My Explanation? Understanding Uncertainty in LIME Explanations. arXiv 2019, arXiv:1904.12991. [Google Scholar] [CrossRef]
- Bhatt, U.; Xiang, A.; Sharma, S.; Weller, A.; Taly, A.; Jia, Y.; Ghosh, J.; Puri, R.; Moura, J.M.F.; Eckersley, P. Explainable machine learning in deployment. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, Barcelona, Spain, 27–30 January 2020; FAT* ’20. pp. 648–657. [Google Scholar] [CrossRef]
- Dimanov, B.; Bhatt, U.; Jamnik, M.; Weller, A. You shouldn’t trust me: Learning models which conceal unfairness from multiple explanation methods. In Proceedings of the Frontiers in Artificial Intelligence and Applications: ECAI 2020, Santiago de Compostela, Spain, 29 August–8 September 2020; pp. 2473–2480. [Google Scholar] [CrossRef]
- Ghorbani, A.; Abid, A.; Zou, J. Interpretation of Neural Networks Is Fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January–1 February 2019; Volume 33, pp. 3681–3688. [Google Scholar] [CrossRef]
- Slack, D.; Hilgard, S.; Jia, E.; Singh, S.; Lakkaraju, H. Fooling LIME and SHAP: Adversarial Attacks on Post hoc Explanation Methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, New York, NY, USA, 7–8 February 2020; AIES ’20. pp. 180–186. [Google Scholar] [CrossRef]
- Dombrowski, A.; Alber, M.; Anders, C.J.; Ackermann, M.; Müller, K.; Kessel, P. Explanations can be manipulated and geometry is to blame. arXiv 2019, arXiv:1906.07983. [Google Scholar] [CrossRef]
- Alvarez-Melis, D.; Jaakkola, T.S. On the Robustness of Interpretability Methods. arXiv 2018, arXiv:1806.08049. [Google Scholar] [CrossRef]
- Lee, E.; Braines, D.; Stiffler, M.; Hudler, A.; Harborne, D. Developing the sensitivity of LIME for better machine learning explanation. In Proceedings of the Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications; Pham, T., Ed.; International Society for Optics and Photonics: Baltimore, MD, USA, 2019; Volume 11006, pp. 349–356. [Google Scholar] [CrossRef]
- Zafar, M.R.; Khan, N.M. DLIME: A Deterministic Local Interpretable Model-Agnostic Explanations Approach for Computer-Aided Diagnosis Systems. arXiv 2019, arXiv:1906.10263. [Google Scholar] [CrossRef]
- Bang, H.; Boggust, A.; Satyanarayan, A. Explanation Alignment: Quantifying the Correctness of Model Reasoning at Scale. In Proceedings of the Computer Vision—ECCV 2024 Workshops; Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T., Eds.; Springer: Cham, Switzerland, 2025; pp. 288–315. [Google Scholar] [CrossRef]
- Kazmierczak, R.; Azzolin, S.; Berthier, E.; Hedström, A.; Delhomme, P.; Filliat, D.; Bousquet, N.; Frehse, G.; Mancini, M.; Caramiaux, B.; et al. Benchmarking XAI Explanations with Human-Aligned Evaluations. arXiv 2025, arXiv:2411.02470. [Google Scholar] [CrossRef]
- Awal, M.A.; Roy, C.K. EvaluateXAI: A framework to evaluate the reliability and consistency of rule-based XAI techniques for software analytics tasks. J. Syst. Softw. 2024, 217, 112159. [Google Scholar] [CrossRef]
- Arras, L.; Osman, A.; Samek, W. CLEVR-XAI: A benchmark dataset for the ground truth evaluation of neural network explanations. Inf. Fusion 2022, 81, 14–40. [Google Scholar] [CrossRef]
- Bray, F.; Ferlay, J.; Soerjomataram, I.; Siegel, R.L.; Torre, L.A.; Jemal, A. Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018, 68, 394–424. [Google Scholar] [CrossRef] [PubMed]
- Tomczak, K.; Czerwińska, P.; Wiznerowicz, M. The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. Contemp. Oncol. 2015, 19, 68–77. [Google Scholar] [CrossRef]
- McGuigan, A.; Kelly, P.; Turkington, R.C.; Jones, C.; Coleman, H.G.; McCain, R.S. Pancreatic cancer: A review of clinical diagnosis, epidemiology, treatment and outcomes. World J. Gastroenterol. 2018, 24, 4846. [Google Scholar] [CrossRef]
- Hunter, B.; Hindocha, S.; Lee, R.W. The Role of Artificial Intelligence in Early Cancer Diagnosis. Cancers 2022, 14, 1524. [Google Scholar] [CrossRef]
- Kenner, B.; Chari, S.T.; Kelsen, D.; Klimstra, D.S.; Pandol, S.J.; Rosenthal, M.; Rustgi, A.K.; Taylor, J.A.; Yala, A.; Abul-Husn, N.; et al. Artificial intelligence and early detection of pancreatic cancer: 2020 summative review. Pancreas 2021, 50, 251. [Google Scholar] [CrossRef] [PubMed]
- Dmitriev, K.; Marino, J.; Baker, K.; Kaufman, A.E. Visual Analytics of a Computer-Aided Diagnosis System for Pancreatic Lesions. IEEE Trans. Vis. Comput. Graph. 2021, 27, 2174–2185. [Google Scholar] [CrossRef]
- Bakasa, W.; Viriri, S. Pancreatic cancer survival prediction: A survey of the state-of-the-art. Comput. Math. Methods Med. 2021, 2021, 1188414. [Google Scholar] [CrossRef] [PubMed]
- Walczak, S.; Velanovich, V. An evaluation of artificial neural networks in predicting pancreatic cancer survival. J. Gastrointest. Surg. 2017, 21, 1606–1612. [Google Scholar] [CrossRef]
- Hayashi, H.; Uemura, N.; Matsumura, K.; Zhao, L.; Sato, H.; Shiraishi, Y.; Yamashita, Y.I.; Baba, H. Recent advances in artificial intelligence for pancreatic ductal adenocarcinoma. World J. Gastroenterol. 2021, 27, 7480. [Google Scholar] [CrossRef] [PubMed]
- Bradley, A.; Van Der Meer, R.; McKay, C. Personalized pancreatic cancer management: A systematic review of how machine learning is supporting decision-making. Pancreas 2019, 48, 598–604. [Google Scholar] [CrossRef]
- Amin, M.B.; Greene, F.L.; Edge, S.B.; Compton, C.C.; Gershenwald, J.E.; Brookland, R.K.; Meyer, L.; Gress, D.M.; Byrd, D.R.; Winchester, D.P. The eighth edition AJCC cancer staging manual: Continuing to build a bridge from a population-based to a more “personalized” approach to cancer staging. CA Cancer J. Clin. 2017, 67, 93–99. [Google Scholar] [CrossRef]
- Cong, L.; Liu, Q.; Zhang, R.; Cui, M.; Zhang, X.; Gao, X.; Guo, J.; Dai, M.; Zhang, T.; Liao, Q.; et al. Tumor size classification of the 8th edition of TNM staging system is superior to that of the 7th edition in predicting the survival outcome of pancreatic cancer patiments after radical resection and adjuvant chemotherapy. Sci. Rep. 2018, 8, 10383. [Google Scholar] [CrossRef]
- NCCN. Pancreatic Adenocarcinoma, Version 3.2019; National Comprehensive Cancer Network: Plymouth Meeting, PA, USA, 2022. [Google Scholar]
- Cancer Genome Atlas Research Network; Weinstein, J.N.; Collisson, E.A.; Mills, G.B.; Shaw, K.R.; Ozenberger, B.A.; Ellrott, K.; Shmulevich, I.; Sander, C.; Stuart, J.M. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 2013, 45, 1113–1120. [Google Scholar] [CrossRef]
- Mosqueira-Rey, E.; Hernández-Pereira, E.; Bobes-Bascarán, J.; Alonso-Ríos, D.; Pérez-Sánchez, A.; Fernández-Leal, A.; Moret-Bonillo, V.; Vidal-Ínsua, Y.; Vázquez-Rivera, F. Addressing the data bottleneck in medical deep learning models using a humanin-the-loop machine learning approach. Neural Comput. Appl. 2024, 36, 2597–2616. [Google Scholar] [CrossRef]
- Samaan, J.S.; Abboud, Y.; Oh, J.; Jiang, Y.; Watson, R.; Park, K.; Liu, Q.; Atkins, K.; Hendifar, A.; Gong, J.; et al. Pancreatic cancer incidence trends by race, ethnicity, age and sex in the United States: A population-based study, 2000–2018. Cancers 2023, 15, 870. [Google Scholar] [CrossRef]
- Tempero, M.A.; Malafa, M.P.; Al-Hawary, M.; Behrman, S.W.; Benson, A.B.; Cardin, D.B.; Chiorean, E.G.; Chung, V.; Czito, B.; Chiaro, M.D.; et al. Pancreatic Adenocarcinoma, Version 2.2021, NCCN Clinical Practice Guidelines in Oncology. J. Natl. Compr. Cancer Netw. 2021, 19, 439–457. [Google Scholar] [CrossRef]
- Elias, R.; Cockrum, P.; Surinach, A.; Wang, S.; Chul Chu, B.; Shahrokni, A. Real-world impact of age at diagnosis on treatment patterns and survival outcomes of patients with metastatic pancreatic ductal adenocarcinoma. Oncologist 2022, 27, 469–475. [Google Scholar] [CrossRef] [PubMed]
- Hoffmann, A.S.; Hennigs, A.; Feisst, M.; Moderow, M.; Heublein, S.; Deutsch, T.M.; Togawa, R.; Schäfgen, B.; Wallwiener, M.; Golatta, M.; et al. Impact of age on indication for chemotherapy in early breast cancer patients: Results from 104 German institutions from 2008 to 2017. Arch. Gynecol. Obstet. 2023, 308, 219–229. [Google Scholar] [CrossRef] [PubMed]
- Breiman, L.; Friedman, J.; Olshen, R.; Stone, C.J. Classification and Regression Trees; Taylor and Francis Group: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
- Bentéjac, C.; Csörgo, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2021, 54, 1937–1967. [Google Scholar] [CrossRef]
- Saarela, M.; Jauhiainen, S. Comparison of feature importance measures as explanations for classification models. SN Appl. Sci. 2021, 3, 272. [Google Scholar] [CrossRef]
- Zhou, Z.; Hooker, G. Unbiased Measurement of Feature Importance in Tree-Based Methods. ACM Trans. Knowl. Discov. Data 2021, 15, 1–21. [Google Scholar] [CrossRef]
- Lundberg, S.M.; Lee, S.I. A Unified Approach to Interpreting Model Predictions. Adv. Neural Inf. Process. Syst. 2017, 30, 4765–4774. [Google Scholar]
- Ribeiro, M.T.; Singh, S.; Guestrin, C. Why Should I Trust You?: Explaining the Predictions of Any Classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; KDD ’16. pp. 1135–1144. [Google Scholar] [CrossRef]
- Haeupler, B.; Manasse, M.; Talwar, K. Consistent Weighted Sampling Made Fast, Small, and Easy. arXiv 2014, arXiv:1410.4266. [Google Scholar] [CrossRef]
- Spearman, C. The Proof and Measurement of Association between Two Things. Am. J. Psychol. 1904, 15, 72–101. [Google Scholar] [CrossRef]
- Kendall, M.G. A New Measure of Rank Correlation. Biometrika 1938, 30, 81–93. [Google Scholar] [CrossRef]
- Webber, W.; Moffat, A.; Zobel, J. A Similarity Measure for Indefinite Rankings. ACM Trans. Inf. Syst. 2010, 28, 1–38. [Google Scholar] [CrossRef]
- Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: Cambridge, UK, 2008. [Google Scholar]
- Guillot Suarez, C. Human-in-the-Loop Hyperparameter Tuning of Deep Nets to Improve Explainability of Classifications. Master’s Thesis, Aalto University, Espoo, Finland, 2022. [Google Scholar]
- Mosqueira-Rey, E.; Hernández-Pereira, E.; Alonso-Ríos, D.; Bobes-Bascarán, J. A Classification and Review of Tools for Developing and Interacting with Machine Learning Systems. In Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, New York, NY, USA, 25–29 April 2022; SAC ’22. pp. 1092–1101. [Google Scholar] [CrossRef]
- Anderson, M.R.; Antenucci, D.; Cafarella, M.J. Runtime Support for Human-in-the-Loop Feature Engineering System. IEEE Data Eng. Bull. 2016, 39, 62–84. [Google Scholar]
- Gkorou, D.; Larranaga, M.; Ypma, A.; Hasibi, F.; van Wijk, R.J. Get a human-in-the-loop: Feature engineering via interactive visualizations. In Proceedings of the Workshop on Interactive Adaptive Learning Co-Located with European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2020), Ghent, Belgium, 14–18 September 2020; Volume 2660. [Google Scholar]
- Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 2019, 1, 206–215. [Google Scholar] [CrossRef]







| Feature | Experts | Description |
|---|---|---|
| Age | 2 | The patient’s age (in years). |
| Adenocarcinoma invasion | 1 | Confirmation that the pancreas tumor sample being submitted to TCGA is an invasive adenocarcinoma. |
| Histological type | 1 | Histologic subtype, if available, for the pancreas adenocarcinoma tumor sample being submitted to TCGA. |
| Initial diagnosis method | 2 | Initial pathologic diagnosis method. |
| Lymph nodes positive HE | 1 | Number of lymph nodes positive by Hematoxylin and Eosin (HE) stain. |
| Maximum tumor dimension | 1 | Length of the largest dimension/diameter of the original tumor as stated on the pathology report. |
| Neoplasm cancer status | 2 | State or condition of an individual’s neoplasm at a particular point in time. |
| Neoplasm histologic grade | 1 | Description of a tumor based on how abnormal the cancer cells and tissue look under a microscope and how quickly the cancer cells are likely to grow and spread. |
| Pathologic stage | 1 | Extent of the cancer, especially whether the disease has spread from the original site to other parts of the body. |
| Pathologic N | 2 | Codes to represent the stage of cancer based on the nodes present. |
| Pathologic M | 1 | Code to represent the defined absence or presence of distant spread or metastases to locations via vascular channels or lymphatics beyond the regional lymph nodes. |
| Residual tumor | 1 | Text terms to describe the status of a tissue margin following surgical resection. |
| Surgery type performed | 2 | Type of surgical procedure performed. |
| Year of initial diagnosis | 1 | Year of initial diagnosis. |
| Feature | Expert | Description |
|---|---|---|
| Alcoholic exposure category | Indicate the patient’s current level of exposure to alcohol. | |
| Days to new tumor after treatment | 1 | Days to new tumor after initial treatment. |
| Family history of cancer | 3 | Indicate if a first-degree relative (parents, siblings, or children) of the patient has a history of a cancer diagnosis. |
| Ethnicity | An individual’s self-described social and cultural grouping, specifically whether an individual describes themselves as Hispanic or Latino. | |
| Gender | Text designations that identify gender. Gender is described as the assemblage of properties that distinguish people on the basis of their societal roles. | |
| History of diabetes | 3 | Indicate if the patient has been previously diagnosed with diabetes. |
| New tumor events | 1 | Indicate whether a new tumor event occurs. |
| Other DX | Numeric value to express the degree of abnormality of cancer cells, a measure of differentiation and aggressiveness. | |
| Pathologic T | 3 | Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T). |
| Primary therapy outcome success | 1 | Indicates a complete remission or response to the prescribed treatment. |
| Race | An arbitrary classification of a taxonomic group that is a division of a species. It usually arises as a consequence of geographical isolation within a species and is characterized by shared heredity, physical attributes and behavior, and in the case of humans, by common history, nationality, or geographic distribution. | |
| Radiation therapy | 2 | Whether the treatment includes a radiation therapy. |
| Smoking history | Category describing current smoking status and smoking history as self-reported by a patient. |
| Age | T | N | M | Stage |
|---|---|---|---|---|
| 35–88 | TX | N0 | M0 | 0 |
| T1 | N1 | M1 | I, IA, IB | |
| T2 | N1b | MX | II, IIA, IIB | |
| T3 | NX | III | ||
| T4 | IV |
| Abbr. | Feature | Guidelines | Experts |
|---|---|---|---|
| Age | Age | 3 | 2 |
| Stage | Stage | 1 | 1 |
| T | T | 2 | 3 |
| N | N | 2 | 2 |
| M | M | 2 | 1 |
| Year | Year of initial diagnosis | 3 | 1 |
| Adeno. | Adenocarcinoma invasion | 2 | 1 |
| Type | Histological type | 2 | 1 |
| Status | Neoplasm cancer status | 1 | 2 |
| Grade | Neoplasm histologic grade | 3 | 1 |
| Dimensi. | Maximum tumor dimension | 1 | 1 |
| Residual | Residual tumor | 3 | 1 |
| Diagnos. | Initial diagnosis method | 2 | 2 |
| Surgery | Surgery type performed | 2 | 2 |
| Lymph | Lymph nodes positive by HE | 1 | 1 |
| Gender | Gender | ||
| Race | Race | ||
| Ethnic. | Ethnicity | ||
| Other | Other DX | ||
| Diabetes | History of diabetes | 3 | 3 |
| Family | Family history of cancer | 2 | 3 |
| Radiat. | Radiation therapy | 2 | 2 |
| Therapy | Therapy outcome success | 2 | 1 |
| N. tumor | New tumor events | 1 | |
| Days to | Days to new tumor | 1 | |
| Tobacco | Tobacco smoking history | ||
| Alcohol | Alcoholic exposure category |
| Model | Parameters | Feature Set | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|---|---|
| DT | md = 3, msl = 5 | Minimum | 0.66 | 0.72 | 0.66 | 0.63 |
| DT | md = 2, msl = 30 | Recommended | 0.57 | 0.58 | 0.57 | 0.55 |
| DT | md = 3, msl = 20 | Maximum | 0.62 | 0.63 | 0.62 | 0.62 |
| RF | md = 2, msl = 5 | Minimum | 0.54 | 0.76 | 0.54 | 0.43 |
| RF | md = 3, msl = 5 | Recommended | 0.51 | 0.76 | 0.51 | 0.38 |
| RF | md = 4, msl = 5 | Maximum | 0.51 | 0.76 | 0.51 | 0.38 |
| XGB | md = 2, ne = 50 | Minimum | 0.66 | 0.70 | 0.66 | 0.64 |
| XGB | md = 4, ne = 40 | Recommended | 0.54 | 0.53 | 0.54 | 0.52 |
| XGB | md = 3, ne = 40 | Maximum | 0.59 | 0.62 | 0.59 | 0.55 |
| DT | RF | XGB | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Feature | MDI | MDA | SHAP | LIME | MDI | MDA | SHAP | LIME | MDI | MDA | SHAP | LIME | Guid. | Exp. |
| Age | 2 | 2 | 1 | 2 | 1 | 3 | 2 | 3 | 2 | 1 | 1 | 2 | 3 | 2 |
| Stage | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| T | 3 | 3 | 3 | 2 | 3 | 2 | 3 | 2 | 3 | |||||
| N | 2 | 3 | 2 | 2 | 1 | 2 | 3 | 2 | 2 | 3 | 2 | 2 | ||
| M | 3 | 2 | 3 | 2 | 2 | 2 | 1 | 2 | 1 | |||||
| MDI-DT | MDA-DT | SHAP-DT | LIME-DT | Guidelines | Experts | |
|---|---|---|---|---|---|---|
| MDI-DT | 1.00 | 0.71 | 0.75 | 1.00 | 0.55 | 0.64 |
| MDA-DT | 0.71 | 1.00 | 0.71 | 0.71 | 0.36 | 0.45 |
| SHAP-DT | 0.75 | 0.71 | 1.00 | 0.75 | 0.42 | 0.50 |
| LIME-DT | 1.00 | 0.71 | 0.75 | 1.00 | 0.54 | 0.63 |
| Guidelines | 0.55 | 0.36 | 0.42 | 0.54 | 1.00 | 0.75 |
| Experts | 0.64 | 0.45 | 0.50 | 0.63 | 0.75 | 1.00 |
| MDI-XGB | MDA-XGB | SHAP-XGB | LIME-XGB | Guidelines | Experts | |
|---|---|---|---|---|---|---|
| MDI-XGB | 1.00 | 0.83 | 0.83 | 0.66 | 0.91 | 0.83 |
| MDA-XGB | 0.83 | 1.00 | 0.83 | 0.66 | 0.75 | 0.83 |
| SHAP-XGB | 0.83 | 0.83 | 1.00 | 0.66 | 0.75 | 0.69 |
| LIME-XGB | 0.66 | 0.66 | 0.66 | 1.00 | 0.59 | 0.82 |
| Guidelines | 0.91 | 0.75 | 0.75 | 0.59 | 1.00 | 0.75 |
| Experts | 0.83 | 0.83 | 0.69 | 0.82 | 0.75 | 1.00 |
| DT | RF | XGB | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Feature | MDI | MDA | SHAP | LIME | MDI | MDA | SHAP | LIME | MDI | MDA | SHAP | LIME | Guid. | Exp. |
| Age | 1 | 2 | 1 | 2 | 1 | 3 | 2 | 1 | 1 | 2 | 1 | 1 | 3 | 2 |
| Stage | 2 | 1 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 2 | 1 | 1 | 1 |
| N | 2 | 1 | 2 | 3 | 2 | 3 | 3 | 2 | 2 | |||||
| M | 2 | 2 | 3 | 3 | 3 | 3 | 2 | 1 | ||||||
| Year | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 3 | 1 | |||
| Adeno. | 2 | 3 | 2 | 1 | ||||||||||
| Type | 2 | 2 | 1 | 2 | 2 | 1 | 1 | 3 | 2 | 2 | 1 | |||
| Status | 2 | 2 | 1 | 1 | 3 | 1 | 3 | 1 | 2 | 2 | 2 | 1 | 2 | |
| Grade | 2 | 2 | 2 | 2 | 2 | 2 | 1 | 3 | 2 | 3 | 1 | |||
| Dimensi. | 3 | 3 | 2 | 2 | 1 | 3 | 3 | 2 | 2 | 3 | 1 | 1 | ||
| Residual | 1 | 2 | 1 | 2 | 1 | 1 | 2 | 1 | 1 | 3 | 1 | |||
| Diagnos. | 3 | 1 | 3 | 1 | 1 | 2 | 2 | 2 | 2 | |||||
| Surgery | 3 | 3 | 1 | 2 | 3 | 3 | 3 | 2 | 2 | |||||
| Lymph | 2 | 1 | 2 | 3 | 2 | 2 | 3 | 1 | 1 | |||||
| MDI-DT | MDA-DT | SHAP-DT | LIME-DT | Guidelines | Experts | |
|---|---|---|---|---|---|---|
| MDI-DT | 1.00 | 0.53 | 0.80 | 0.50 | 0.24 | 0.31 |
| MDA-DT | 0.53 | 1.00 | 0.60 | 0.43 | 0.23 | 0.27 |
| SHAP-DT | 0.80 | 0.60 | 1.00 | 0.54 | 0.27 | 0.30 |
| LIME-DT | 0.50 | 0.43 | 0.54 | 1.00 | 0.50 | 0.58 |
| Guidelines | 0.24 | 0.23 | 0.27 | 0.50 | 1.00 | 0.71 |
| Experts | 0.31 | 0.27 | 0.30 | 0.58 | 0.71 | 1.00 |
| MDI-XGB | MDA-XGB | SHAP-XGB | LIME-XGB | Guidelines | Experts | |
|---|---|---|---|---|---|---|
| MDI-XGB | 1.00 | 0.47 | 0.68 | 0.72 | 0.63 | 0.65 |
| MDA-XGB | 0.47 | 1.00 | 0.35 | 0.41 | 0.30 | 0.37 |
| SHAP-XGB | 0.68 | 0.35 | 1.00 | 0.75 | 0.50 | 0.58 |
| LIME-XGB | 0.72 | 0.41 | 0.75 | 1.00 | 0.54 | 0.66 |
| Guidelines | 0.63 | 0.30 | 0.50 | 0.54 | 1.00 | 0.71 |
| Experts | 0.65 | 0.37 | 0.58 | 0.66 | 0.71 | 1.00 |
| DT | RF | XGB | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Feature | MDI | MDA | SHAP | LIME | MDI | MDA | SHAP | LIME | MDI | MDA | SHAP | LIME | Guid. | Exp. |
| Age | 3 | 2 | 3 | 3 | 1 | 2 | 3 | 1 | 1 | 1 | 3 | 2 | ||
| Stage | 2 | 2 | 1 | 1 | 1 | 1 | 1 | 1 | 2 | 1 | 1 | 1 | 1 | 1 |
| T | 2 | 2 | 3 | 3 | 2 | 3 | ||||||||
| N | 2 | 1 | 2 | 3 | 2 | 2 | ||||||||
| M | 3 | 2 | 3 | 3 | 3 | 2 | 1 | |||||||
| Year | 1 | 1 | 1 | 2 | 1 | 3 | 1 | 2 | 2 | 1 | 1 | 3 | 1 | |
| Adeno. | 3 | 3 | 3 | 2 | 1 | |||||||||
| Type | 3 | 2 | 1 | 3 | 3 | 3 | 1 | 2 | 3 | 2 | 1 | |||
| Status | 3 | 3 | 3 | 2 | 1 | 2 | ||||||||
| Grade | 3 | 2 | 2 | 3 | 2 | 3 | 2 | 3 | 3 | 1 | ||||
| Dimensi. | 1 | 3 | 1 | 3 | 2 | 2 | 1 | 1 | ||||||
| Residual | 3 | 2 | 1 | 2 | 3 | 2 | 1 | 2 | 3 | 1 | ||||
| Diagnos. | 3 | 3 | 3 | 3 | 2 | 2 | 2 | 2 | ||||||
| Surgery | 3 | 3 | 3 | 3 | 2 | 2 | 2 | |||||||
| Lymph | 2 | 3 | 3 | 2 | 2 | 2 | 1 | 1 | ||||||
| Gender | 2 | 3 | 2 | 1 | 1 | 2 | ||||||||
| Race | 2 | 2 | 1 | 3 | 3 | 2 | ||||||||
| Ethnic. | 3 | 3 | 3 | 3 | ||||||||||
| Other | 3 | 3 | ||||||||||||
| Diabetes | 2 | 3 | 3 | 3 | 3 | |||||||||
| Family | 2 | 1 | 1 | 2 | 2 | 2 | 2 | 3 | 1 | 1 | 1 | 2 | 3 | |
| Radiat. | 2 | 3 | 3 | 3 | 2 | 3 | 2 | 2 | ||||||
| Therapy | 3 | 2 | 3 | 3 | 3 | 2 | 1 | |||||||
| N.tumor | 3 | 3 | 2 | 1 | ||||||||||
| Days to | 2 | 3 | 1 | |||||||||||
| Tobacco | 2 | 3 | 3 | 3 | 2 | 2 | 3 | |||||||
| Alcohol | 2 | 3 | 3 | 2 | 3 | |||||||||
| MDI-DT | MDA-DT | SHAP-DT | LIME-DT | Guidelines | Experts | |
|---|---|---|---|---|---|---|
| MDI-DT | 1.00 | 0.67 | 0.80 | 0.43 | 0.15 | 0.13 |
| MDA-DT | 0.67 | 1.00 | 0.55 | 0.29 | 0.10 | 0.14 |
| SHAP-DT | 0.80 | 0.55 | 1.00 | 0.56 | 0.18 | 0.15 |
| LIME-DT | 0.43 | 0.29 | 0.56 | 1.00 | 0.33 | 0.24 |
| Guidelines | 0.15 | 0.10 | 0.18 | 0.33 | 1.00 | 0.63 |
| Experts | 0.13 | 0.14 | 0.15 | 0.24 | 0.63 | 1.00 |
| MDI-XGB | MDA-XGB | SHAP-XGB | LIME-XGB | Guidelines | Experts | |
|---|---|---|---|---|---|---|
| MDI-XGB | 1.00 | 0.24 | 0.60 | 0.49 | 0.43 | 0.40 |
| MDA-XGB | 0.24 | 1.00 | 0.38 | 0.30 | 0.26 | 0.29 |
| SHAP-XGB | 0.60 | 0.38 | 1.00 | 0.62 | 0.40 | 0.40 |
| LIME-XGB | 0.49 | 0.30 | 0.62 | 1.00 | 0.39 | 0.36 |
| Guidelines | 0.43 | 0.26 | 0.40 | 0.39 | 1.00 | 0.63 |
| Experts | 0.40 | 0.29 | 0.40 | 0.36 | 0.63 | 1.00 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Bobes-Bascarán, J.; Mosqueira-Rey, E.; Fernández-Leal, Á.; Alonso-Ríos, D.; Figueirido-Arnoso, I.; Vidal-Ínsua, Y. Evaluating Explanatory Capabilities of Machine Learning Models in Medical Diagnostics: A Human-in-the-Loop Approach. Mathematics 2026, 14, 497. https://doi.org/10.3390/math14030497
Bobes-Bascarán J, Mosqueira-Rey E, Fernández-Leal Á, Alonso-Ríos D, Figueirido-Arnoso I, Vidal-Ínsua Y. Evaluating Explanatory Capabilities of Machine Learning Models in Medical Diagnostics: A Human-in-the-Loop Approach. Mathematics. 2026; 14(3):497. https://doi.org/10.3390/math14030497
Chicago/Turabian StyleBobes-Bascarán, José, Eduardo Mosqueira-Rey, Ángel Fernández-Leal, David Alonso-Ríos, Israel Figueirido-Arnoso, and Yolanda Vidal-Ínsua. 2026. "Evaluating Explanatory Capabilities of Machine Learning Models in Medical Diagnostics: A Human-in-the-Loop Approach" Mathematics 14, no. 3: 497. https://doi.org/10.3390/math14030497
APA StyleBobes-Bascarán, J., Mosqueira-Rey, E., Fernández-Leal, Á., Alonso-Ríos, D., Figueirido-Arnoso, I., & Vidal-Ínsua, Y. (2026). Evaluating Explanatory Capabilities of Machine Learning Models in Medical Diagnostics: A Human-in-the-Loop Approach. Mathematics, 14(3), 497. https://doi.org/10.3390/math14030497

