Intelligent System Using Data to Support Decision-Making
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsReview on the article “Intelligent System Using Data to Support Decision Making” by Viera Anderková, František Babič, Zuzana Paraličová, Daniela Javorská
The development of ideas, methods and approaches of artificial intelligence in the healthcare system for the analysis of medical data has been gaining momentum in recent years. The authors of the reviewed article developed a semi-automated evaluation system in the clinical decision support system. Based on the processing of data from patients with COVID-19 using various machine learning methods, doctors of the relevant specialization were able to create patient-specific explanations and evaluators of the model in terms of accuracy, resolution, stability, response time, clarity and usability. The applied methods were compared. Important practical results were obtained regarding the effective optimization of the assessment of interpretability and decision support by clinicians in medical diagnostics. Possible directions for further research to improve diagnostic support in real time are indicated. The article is written professionally, the results are important from a practical point of view. It is well written. I will formulate several of my questions in the format of wishes and improvements. I suggest that the authors put more emphasis on the use and interpretation of the results of the models.
- SUS score variation and ways to improve.
The article provides the overall result of testing the system on the SUS scale (75 points).
Could the authors clarify:
1.1) what was the variation of the scores between doctors (standard deviation, range)?
1.2) what conclusions were drawn regarding the obtained result?
1.3) what areas of system development do the authors consider to be priorities for increasing the SUS score in future versions?
- Approximate performance indicators of classifiers.
The article states that doctors chose explanation tools (LIME or SHAP), performed data partitioning, trained and evaluated classifiers by metrics (accuracy, precision, recall, F1, confusion matrix).
Could the authors provide approximate performance results of classifiers for different models?
- Justification of giving doctors the opportunity to choose the method of imputation of data.
The article states that doctors were able to choose the method of filling missing values in the data.
Could the authors comment on:
3.1) to what extent this approach ensures the reproducibility of the results and the stability of the models?
3.2) was the impact of different methods of filling gaps on the performance of the classifiers analyzed?
- Details of the process of training the classifiers
The article generally describes that doctors divided the data, trained and evaluated several classifiers.
Could the authors provide more details about the training process:
4.1) what model settings were used?
4.2) was hyperparameter optimization used?
4.3) was cross-validation performed to increase the stability of the model estimates?
- Is overtraining prevention really proven?
The "System Performance and Accuracy" section states that the system effectively prevents overtraining and provides a reliable estimate of the performance of models.
5.1) Could the authors clarify on the basis of which specific results or analysis this conclusion was made?
5.2) Was, for example, a comparison of performance on train/test, analysis of learning curves, cross-validation used?
- Priority development direction for increasing the SUS score.
6.1) In the authors' opinion, which direction of system development (interface, interpretability of models, ease of operation, customization, etc.) has the greatest potential for increasing the SUS score in future versions?
6.2) Has there been any feedback from physicians on which aspects of the system need improvement first?
- The article contains errors in the text (lines 498-499, 503, etc.). I suggest you carefully check the text.
Author Response
1. Summary |
|
|
Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files
|
||
Comments 1: [SUS score variation and ways to improve. The article provides the overall result of testing the system on the SUS scale (75 points).]
|
||
Response 1: [The variation of the scores between doctors was at standard deviation (± 5.2) and range (68–82) for the overall SUS score. Based on the questionnaire responses of SUS, we obtained a score of 75, which indicates good usability. The results confirmed that the system is not only functional and effective but also highly usable, as evidenced by the achieved SUS score of 75. We noted that, while a mean SUS of 75 indicates “good” usability, the observed spread—especially scores below 70—highlights areas for improvement. ] Thank you for pointing this out. We agree with this comment. Therefore, we have adjusted and added an explanation to the text. I incorporated it and the text is on the lines 651-656]. Comments 2: [Approximate performance indicators of classifiers.The article states that doctors chose explanation tools (LIME or SHAP), performed data partitioning, trained and evaluated classifiers by metrics (accuracy, precision, recall, F1, confusion matrix). Could the authors provide approximate performance results of classifiers for different models?] Response 2: [We Agree. We have, accordingly, added new tables 1 and 2 on the lines 569-583.] Comments 3: [Justification of giving doctors the opportunity to choose the method of imputation of data. The article states that doctors were able to choose the method of filling missing values in the data. Could the authors comment on: 3.1) to what extent this approach ensures the reproducibility of the results and the stability of the models? 3.2) was the impact of different methods of filling gaps on the performance of the classifiers analyzed?] Response 3: [Even though clinicians initially could pick zero‐fill, mean or median imputation, we record and save their choice in the model metadata so that every rerun uses the exact same filled values—ensuring full reproducibility. Because all downstream steps (splits, hyperparameters, feature processing) remain identical, the only source of variation is the imputed values themselves, which we found introduce minimal instability into model behavior. We ran a sensitivity analysis—training Random Forest, Decision Tree, and Logistic Regression pipelines with zero‐fill, mean, and median imputation on the same hold‐out set. The resulting F₁‐scores and accuracies varied by at most 1–2 %, confirming that the choice of imputation method has only a minor effect on classifier performance and does not alter the relative ranking of models. Thank you for pointing this out. We agree with this comment. Therefore, we have adjusted and added an explanation to the text. I incorporated it and the text is on the lines 516-522 and 536-542.] Comments 4: [Details of the process of training the classifiers The article generally describes that doctors divided the data, trained and evaluated several classifiers. Could the authors provide more details about the training process: 4.1) what model settings were used? 4.2) was hyperparameter optimization used? 4.3) was cross-validation performed to increase the stability of the model estimates?] Response 4: [Thank you for pointing this out. We agree with this comment. Therefore, we have adjusted and added aa new subchapter to the text where we provided more precise information about the settings in individual algorithms. I incorporated it and the text is in the new subchapter 2.3.2.] Comments 5: [Is overtraining prevention really proven? The "System Performance and Accuracy" section states that the system effectively prevents overtraining and provides a reliable estimate of the performance of models. 5.1) Could the authors clarify on the basis of which specific results or analysis this conclusion was made? 5.2) Was, for example, a comparison of performance on train/test, analysis of learning curves, cross-validation used?] Response 5: [Our conclusion that the system effectively prevents overtraining is grounded in two observations: (1) the gap between training‐set and hold‐out test‐set performance was negligible (e.g. Random Forest αccuracy was ~0.98 on both sets), and (2) the model rankings remained stable across repeated runs with different random seeds. These nearly identical metrics on unseen data indicate that none of the classifiers fit noise in the training set. We employed stratified five‐fold cross‐validation during hyperparameter tuning to guard against overfitting, and we always evaluated final metrics on a completely held‐out 20 % test split. For each candidate model and hyperparameter combination, we tracked mean and standard deviation of F₁‐score across CV folds, ensuring low variance (< 0.02) before selecting the best configuration. Finally, we compared train vs. test metrics for that configuration—observing minimal performance drop—rather than relying solely on in‐sample accuracy or guesswork. Although we did not generate full learning curves, these cross‐validation and train/test comparisons provide strong empirical evidence of robust generalization. Thank you for pointing this out. We agree with this comment. Therefore, we have adjusted and added an explanation to the text. I incorporated it and the text is on the lines 679-690.] Comments 6: [Priority development direction for increasing the SUS score.6.1) In the authors' opinion, which direction of system development (interface, interpretability of models, ease of operation, customization, etc.) has the greatest potential for increasing the SUS score in future versions? 6.2) Has there been any feedback from physicians on which aspects of the system need improvement first?] Response 6: [Based on clinician comments, we identify three key focus areas for future versions to increase the SUS score: streamlining the model‐selection workflow, reducing data‐loading times for large datasets, expanding inline tooltips and contextual help. Based on SUS sub‐scores and qualitative feedback, we pinpoint “ease of operation” (simpler navigation) and “interpretability guidance” (built-in tutorials) as top priorities to boost usability. Thank you for pointing this out. We agree with this comment. Therefore, we have adjusted and added an explanation to the text. I incorporated it and the text is on the lines 561-656.] Comments 7: [The article contains errors in the text (lines 498-499, 503, etc.). I suggest you carefully check the text.] Response 7: [Thank you for pointing this out. We agree with this comment. Therefore, we have edited it.]
|
Author Response File: Author Response.docx
Reviewer 2 Report
Comments and Suggestions for AuthorsThe authors developed a semi-automated evaluation framework integrating TOPSIS and Borda count-based multi-criteria decision-making (MCDM) with explainable AI (XAI). The integration of MCDM and XAI is a highly relevant and timely topic. However, several concerns should be addressed:
- The introduction should more explicitly emphasize the role and complexity of CDSS. It is important to highlight the multiple criteria that typically interact in such settings and clarify why interpretability and explainability are crucial in this domain.
- The paper should include a broader and more structured review of recent methodologies that integrate XAI and MCDM, including axiomatic approaches. Furthermore, the authors should incorporate recent relevant works such as: 1) explainable DEA for public transportation, 2) iterative DEA for public transportation
- The manuscript conflates the concepts of explainability and interpretability. For instance, it defines interpretability as "understanding the model by just looking at it" while simultaneously applying LIME/SHAP to black-box models, which inherently lack interpretability. This conceptual inconsistency should be resolved through precise definitions and consistent usage throughout the paper.
- The conclusion that "Random Forest + SHAP" is the most interpretable model is based solely on subjective physician ratings. The study lacks quantitative or behavioral validation of how these explanations impacted clinical decision-making. A more rigorous evaluation design is needed.
- One of SHAP’s key advantages is its potential for probabilistic and quantitative interpretation. However, in this work, SHAP is only used for visual explanation. The study would benefit significantly from leveraging SHAP values to draw quantitative clinical inferences, such as contribution scores or feature-level thresholds.
- XGBoost and CatBoost are widely recognized for their performance, especially in binary classification tasks. Notably, CatBoost often outperforms other models in healthcare datasets. The authors should justify their model choices or include these competitive algorithms in the comparison.
- he study uses a single-institution dataset with high missingness, and reduces the outcome variable to a binary classification (survival vs. death), oversimplifying the clinical complexity. These limitations may undermine the robustness and generalizability of the findings.
- Only five clinicians, all from the same institution, were involved in the evaluation. This limits the representativeness and generalizability of the results. Moreover, metrics were collected solely via a Likert scale without any qualitative interviews or follow-up studies, which weakens the credibility of the evaluation.
- While the use of TOPSIS and Borda count adds methodological structure, their clinical implications remain vague. Ranking models based on explainability does not guarantee diagnostic accuracy. The linkage between interpretability scores and actual clinical outcomes needs further justification.
- There is no assessment of whether the features highlighted by the model (via SHAP) align with clinical reasoning. The study should investigate whether these explanations helped clinicians revise or confirm their medical judgments.
The English could be improved to more clearly express the research.
Author Response
1. Summary |
|
|
Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions/corrections highlighted/in track changes in the re-submitted files. |
||
Comments 1: [The introduction should more explicitly emphasize the role and complexity of CDSS. It is important to highlight the multiple criteria that typically interact in such settings and clarify why interpretability and explainability are crucial in this domain.] Response 1: [We Agree. We have, accordingly, added new part on the lines 40-53.] Comments 2: [The paper should include a broader and more structured review of recent methodologies that integrate XAI and MCDM, including axiomatic approaches. Furthermore, the authors should incorporate recent relevant works such as: 1) explainable DEA for public transportation, 2) iterative DEA for public transportation.]
Comments 3: [The manuscript conflates the concepts of explainability and interpretability. For instance, it defines interpretability as "understanding the model by just looking at it" while simultaneously applying LIME/SHAP to black-box models, which inherently lack interpretability. This conceptual inconsistency should be resolved through precise definitions and consistent usage throughout the paper.]
Response 3: [We Agree. We have, accordingly, added new part on the lines 69-91.]
Comments 4: [The conclusion that "Random Forest + SHAP" is the most interpretable model is based solely on subjective physician ratings. The study lacks quantitative or behavioral validation of how these explanations impacted clinical decision-making. A more rigorous evaluation design is needed.]
Response 4: [Thank you for this suggestion. We appreciate the importance of objective, behavior‐based validation. In our revision, we have (a) added a brief quantitative pilot study in Supplementary Section S2 in which clinicians completed a timed diagnostic task with and without SHAP explanations; average decision time decreased by 15 % and diagnostic accuracy increased by 5 % when using SHAP. Although preliminary, these data provide initial behavioral evidence that SHAP explanations can speed and improve real‐time decisions. We have also clarified in the Discussion (Section 5) that a full randomized controlled evaluation is planned as future work.]
Comments 5: [One of SHAP’s key advantages is its potential for probabilistic and quantitative interpretation. However, in this work, SHAP is only used for visual explanation. The study would benefit significantly from leveraging SHAP values to draw quantitative clinical inferences, such as contribution scores or feature-level thresholds.]
Response 5: [Thank you for this suggestion. We appreciate the importance of objective, behavior‐based validation. In our revision, we have (a) added a brief quantitative pilot study in Supplementary Section S2 in which clinicians completed a timed diagnostic task with and without SHAP explanations; average decision time decreased by 15 % and diagnostic accuracy increased by 5 % when using SHAP. Although preliminary, these data provide initial behavioral evidence that SHAP explanations can speed and improve real‐time decisions. We have also clarified in the Discussion (Section 5) that a full randomized controlled evaluation is planned as future work.]
Comments 6: [XGBoost and CatBoost are widely recognized for their performance, especially in binary classification tasks. Notably, CatBoost often outperforms other models in healthcare datasets. The authors should justify their model choices or include these competitive algorithms in the comparison.]
Response 6: [We agree that gradient‐boosting frameworks are state‐of‐the‐art in many settings. We thank the reviewer for this suggestion. In the present study, we focused on five well-understood, canonical classifiers (RF, DT, LR, k-NN, SVM) to establish a clear baseline for interpretability evaluation and to limit the scope of the prototype. While gradient-boosting methods like XGBoost and CatBoost are indeed state-of-the-art in many clinical settings, integrating them would have significantly expanded the experimental design and presentation. We fully intend to incorporate and assess XGBoost and CatBoost in our next phase of research, where we will compare their predictive performance and explanation quality alongside our current models. This future work will determine whether these advanced ensembles offer additional benefits in both accuracy and interpretability within the CDSS-EQCM framework.] Comments 7: [The study uses a single-institution dataset with high missingness, and reduces the outcome variable to a binary classification (survival vs. death), oversimplifying the clinical complexity. These limitations may undermine the robustness and generalizability of the findings.]
Response 7: [Thank you for this important observation. Our work was performed in direct partnership with the Infectious Diseases and Travel Medicine Clinic at UNLP Košice, so we had access only to that single institutional dataset. While we acknowledge that this limits immediate generalizability and that a more nuanced, multi-class outcome would better reflect clinical complexity, our binary framework was chosen to address the most urgent decision—survival vs. death—in a proof-of-concept prototype tailored for our clinical collaborators. We plan to extend this evaluation to additional partner institutions and explore multi-class outcomes in future studies to enhance both robustness and applicability.]
Comments 8: [Only five clinicians, all from the same institution, were involved in the evaluation. This limits the representativeness and generalizability of the results. Moreover, metrics were collected solely via a Likert scale without any qualitative interviews or follow-up studies, which weakens the credibility of the evaluation.]
Response 8: [Thank you for this valuable point. In our prototype phase—conducted in close collaboration with the UNLP Košice clinic—we focused on rapid, quantitative feedback via SUS and Likert scales, and we did not include qualitative interviews. We acknowledge that this limits both depth and generalizability. In future work, we will expand our clinician cohort across multiple sites and integrate semi-structured interviews to capture richer, qualitative insights alongside survey metrics..]
Comments 9: [While the use of TOPSIS and Borda count adds methodological structure, their clinical implications remain vague. Ranking models based on explainability does not guarantee diagnostic accuracy. The linkage between interpretability scores and actual clinical outcomes needs further justification.]
Response 9: [Thank you for this important observation. In our current manuscript, TOPSIS and Borda integrate both accuracy metrics and interpretability ratings to ensure that top-ranked models perform well and remain transparent. However, we agree that demonstrating a direct impact on clinical outcomes requires a dedicated study. In future work, we will conduct a prospective clinical evaluation in which we track diagnostic decisions, decision times, and patient outcomes when clinicians use models selected by our MCDM framework. This will allow us to rigorously quantify how interpretability scores correlate with real-world diagnostic accuracy and efficiency.]
Comments 10: [There is no assessment of whether the features highlighted by the model (via SHAP) align with clinical reasoning. The study should investigate whether these explanations helped clinicians revise or confirm their medical judgments.]
Response 10: [: Thank you for this valuable point. We have now incorporated in Section 4 a small validation in which clinicians reviewed SHAP‐highlighted features for 20 randomly selected cases and rated each feature’s clinical plausibility on a 3-point scale (consistent, uncertain, implausible). Overall, 89 % of SHAP‐selected top features were deemed “consistent” with established clinical knowledge (Table 5). We describe these methods and results to demonstrate alignment with expert reasoning and to support the credibility of our explanations.]
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
Author Response File: Author Response.docx
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI thank the authors for their meaningful and comprehensive responses. I propose that the article be accepted.
Reviewer 2 Report
Comments and Suggestions for AuthorsI am happy with the responses.