Review Reports - Use of Attention Maps to Enrich Discriminability in Deep Learning Prediction Models Using Longitudinal Data from Electronic Health Records

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper evaluates several attention-based deep learning architectures for processing longitudinal electronic health record (EHR) data, with the goal of finding the one that provides the best balance between discriminability (predictive performance) and clinical plausibility (interpretability of the model's attention weights). The authors evaluate four progressively complex deep learning architectures and compare their performance in predicting 1-year all-cause mortality using 10 years of EHR data from Catalonia, Spain. The result is that while simpler architectures generally have better discriminability, the attention maps of more complex architectures are more informative and clinically plausible. The results show that discriminability and clinically meaningful interpretability do not always run in parallel, and that it is important to establish an appropriate balance between the two in healthcare applications where model transparency is highly valued.

However, there are some issues in the article that I hope the author can answer or revise：

It was mentioned in Section V that in healthcare, prioritising higher transparency and clinically reasonable attention weights may justify choosing a model with a slight decrease in discriminability, provided the overall performance remains adequate. The problem is how to quantitatively achieve their balance. For example, GRU and down-up CNN in the experiment of this paper have clinical rationality, but there is a non-small decrease in discriminability. To what extent is such a decrease unacceptable?

Author Response

Comments 1:

However, there are some issues in the article that I hope the author can answer or revise: It was mentioned in Section V that in healthcare, prioritising higher transparency and clinically reasonable attention weights may justify choosing a model with a slight decrease in discriminability, provided the overall performance remains adequate. The problem is how to quantitatively achieve their balance. For example, GRU and down-up CNN in the experiment of this paper have clinical rationality, but there is a non-small decrease in discriminability. To what extent is such a decrease unacceptable?

Response 1:

We appreciate the opportunity to discuss this further with the reviewer. As we see it, there are two determinants when selecting a model over another related to discriminability: 1) the overall discriminability remaining acceptable, which is the focus of this article, and 2) the decrease or increase in specific metrics.

It is challenging to define a general threshold for “acceptable discriminability” when focusing on a single use case. Determining a broadly applicable threshold would require further investigation across multiple use cases and consultation with a broader range of stakeholders. In fact, while we averaged all metrics to detect a decreasing or increasing discriminability, other use cases might focus on only one of the metrics. For instance, some healthcare applications might prefer to have more false positives to be more conservative or preventive treatments, while other applications would prefer more false negatives (in the case of aggressive treatments). In our case, averaging all of them made sense given the exploratory nature of our study. The goal of our study was not to define this threshold but to explore whether an increase in clinical plausibility could coexist with a reduction in discriminability.

In this direction, and also following directions from Reviewer 2, we have reformulated the objective of the study: “In this study, several attention-based deep learning architectures handling longitudinal EHR data were designed with increasing degrees of complexity and compared aiming to test whether improved discriminability necessarily implies improved interpretability” both in the introduction and in the Abstract.

For future work, we aim to introduce a more formalized framework to quantify this balance, potentially incorporating focus groups with clinicians to more clearly define thresholds for acceptable reductions in discriminability while considering the transparency of the models in each use case. These focus groups should be preceded by a study of discriminability and interpretability in different use cases, so that the particular application can be included in the discussion on the threshold for acceptable performance considering the increase in transparency.

In this direction, we have added the following line to the discussion: “In addition, it should be noted that determining an “adequate overall performance” threshold from a single study may be difficult, as each healthcare application may need to optimize particular metrics over others. To define it, future work should consider performing this study in different use cases and consulting clinicians to establish criteria, perhaps through focus groups, on the determination of this threshold for acceptable discriminability, taking into account the loss or gain of interpretability of the model.”.

We hope this response addresses the reviewer’s concern. We are happy to elaborate further or provide additional details if needed.

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript focuses on evaluating four different deep learning architectures for medical decision-making, based on longitudinal Electronic Health Records. The primary goal is to identify an architecture that effectively balances discriminative capabilities (accuracy) and interpretability (attention maps). The authors recognize the limitations of solely relying on discriminative performance, emphasizing the importance of understanding the decision-making process.

Despite promising, several issues must be addressed:

1. Literature Review

- Consider incorporating works, even taking inspiration from diverse fields, that explore the significance of attention layers alongside discriminative metrics. This will provide a comprehensive understanding of the challenges and opportunities in this area, and also of the approaches and metrics currently employed.

- Also, I suggest to clearly detail the specific gap that the current work aims to address. This could involve discussing the limitations of existing approaches in balancing accuracy and interpretability, or the lack of guidelines and metrics for designing interpretable yet accurate deep-learning models in medical decision-making.

2. Novel Contribution:

- I suggest the authors to better detail the novel contribution they aim to present in this work. Consider presenting it as a guideline for designing deep-learning-based approaches for medical decision-making. This could involve a step-by-step process, including data preprocessing, model selection, training, and evaluation.

3. Data Structure and Analysis:

- Please try to provide a more detailed description of the EHR data, including the types of variables (e.g., demographic, clinical, laboratory), the average records per patient, and explain how missing data are managed.

- Investigate and discuss the distribution of different diseases among the patients, considering factors like cause of death or hospital admissions. This analysis can help assess the coherence of the attention weights with the clinical context. For example, if many patients died from cardiac issues, the model's attention should be focused on cardiac-related features.

4. Few typos have been found:

- Line 93: PROVINDING attention weights -> PROVIDING

- Figure 1: dynamic CATEGORIC features -> CATEGORICAL

Author Response

Comments 1:

Despite promising, several issues must be addressed:

Literature Review

Response 1:

We thank the reviewer for this suggestion. In the previous version of the manuscript, in the second paragraph of the “Related work”, we already presented the origin of attention maps and some examples on approaches to introduce them in healthcare applications (NLP: Bahdanau 2014; General: Santana 2021, Brauwers 2023, Ridl 2019; Healthcare (Imaging): Xu2024, Zhang2023; Healthcare (EHR): Choi 2016, Zhang 2018, Kabeshova 2020). We have now clarified in this same paragraph, in the Related Work section , that, with tabular data from EHRs, “In these cases, in addition to transparency, attentional mechanisms also help to handle long temporal sequences more successfully.” which are the two main contributions of attention maps. We also cited some references representing the current debate on the role of attention mechanisms in explainability (Bibal 2022, Wiegreffe-Pinter 2019, Jain-Wallace 2019).

Following the revisor suggestion, we have included more information and references on the challenges of evaluating attention maps in a new paragraph in the Related Work section: “Sen et al. performed an study to assess how good attention maps were based on the consensus with human explainability, following the recommendations from Riedl on human-centered artificial intelligence that defined a good explanation such as one that is plausible. They defined some metrics to measure the consensus, such as the overlap between humans and attention mechanisms, but recognize that the subjectivity of humans themselves can make it difficult to define these metrics.”.

Comments 2: Also, I suggest to clearly detail the specific gap that the current work aims to address. This could involve discussing the limitations of existing approaches in balancing accuracy and interpretability, or the lack of guidelines and metrics for designing interpretable yet accurate deep-learning models in medical decision-making.

Response 2:

In fact, this present manuscript arose after doing a systematic review on AI techniques for predicting health outcomes from longitudinal EHR data (https://doi.org/10.1093/jamia/ocad168) in which we saw that the authors of the included studies usually made great efforts to compare their architecture with existing ones to claim theirs was better, but they all did so exclusively at the level of discriminability. However, they forgot to compare themselves also at the level of clinical plausibility of the resulting explanations (if including transparency mechanisms). This is what we meant in the Introduction with “Benchmarking studies usually assess only discriminability, but attention weights' clinical relevance also requires scrutiny, ensuring meaningful explanations align with clinical insights”, sentence that is now accompanied by some references of studies doing so.

Therefore, in this work we sought to test whether better discriminability also implied better clinical plausibility of the model. In this line, we have reformulated the objective of the study, hoping that it is clearer now, both in the Abstract and the Introduction “In this study, several attention-based deep learning architectures handling longitudinal EHR data were designed with increasing degrees of complexity and compared aiming to test whether improved discriminability necessarily implies improved interpretability”.

Before that, in the introduction, aligned with your suggestion, we have also incorporated the following lines: “As noted in a previous systematic review, there is a lack of guidelines in how to run benchmarking studies on prediction models in healthcare, hampering comparability between models. Recently developed reporting guidelines for prediction models in healthcare with AI such as TRIPOD-AI do not define either how to perform comparisons on interpretability.”

Comment 3:

Novel Contribution:

Response 3:

We hope the new wording of the objective helps to understand our contribution. Although we agree with the reviewer that this guideline would be very interesting and necessary, especially considering the lack of guidelines in this regard, this is not the objective of this work. In line with what was replied to reviewer 1, developing a guideline for the design of architectures and choose from among them would require more studies and a discussion group to choose the best thresholds for discriminability and clinical plausibility. We have added “future work should consider performing this study in different use cases and consulting clinicians to establish criteria, perhaps through focus groups, on the determination of this threshold for acceptable discriminability, taking into account the loss or gain of interpretability of the model.” to future work.

We hope the reviewer will understand that making this guideline without having a more general consensus on this subject and relying exclusively on a single use case could be presumptuous.

Comments 4:

Data Structure and Analysis:

Response 4:

We agree with the reviewer on the importance of detailing any data processing information on the manuscript. We have now included the information on the average records per patient in Section 3.2.1 (“This left a mean of 7 records (years) (standard deviation: 3.15) per person.”). The information regarding the type of data could be found in Table 1 (and the complete list of variables per domain can be found in our GitHub.). The information on missing data was in the Appendix, and it is now also included in Section 3.2.1 (“Missingness was addressed in this work through masking. Further details can be found in the Supplementary Material”). We actually did not impute the missing data as we worked with a masking layer that could indicate the model which values were missing and which weren’t. This allows the model to work with the actual real values, without making any imputations.

Comments 5:

Investigate and discuss the distribution of different diseases among the patients, considering factors like cause of death or hospital admissions. This analysis can help assess the coherence of the attention weights with the clinical context. For example, if many patients died from cardiac issues, the model's attention should be focused on cardiac-related features.

Response 5:

We completely agree this analysis would be very interesting to study. However, we do not have access to the cause of death. In any case, and in the hope of responding to the reviewer's request, we have prepared a descriptive table of some of the variables for all the study population and in relation to the outcome. (“A more detailed description of the population and the outcome can be found in the Supplementary Material.”).

Comments 6: Few typos have been found:

Response 6: Thank you for noting these typos. We have corrected them now. Thank you for all your contributions

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The authors carefully address all the raised concerns.

I think the novel version of the manuscript has been notably improved.

Author Response

Comments 1:

The authors carefully address all the raised concerns.

I think the novel version of the manuscript has been notably improved

Response 1:

Thank you very much for your revisions