Investigating De-Identification Methodologies in Dutch Medical Texts: A Replication Study of Deduce and Deidentify
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis paper investigates de-identification methodologies for Dutch medical texts, focusing on a replication study of Deduce and Deidentify. The topic is becoming increasingly relevant due to evolving privacy and data protection regulations. The healthcare sector is particularly sensitive to these issues, making compliance a critical concern for both public and private institutions. Non-compliance can lead to severe consequences, reinforcing the strategic importance of this research area, which continues to attract significant attention and effort.
The paper could benefit from substantial improvements. The abstract and introduction should be refined to provide a clearer framing of the research. Specifically, what is the main objective of this study? The paper should explicitly outline the key requirements of the de-identification process. What characteristics must the data retain to remain useful? What is the intended application of the de-identified data? Additionally, the choice of de-identification technique should be better justified. Why were Deduce and Deidentify selected over other methods, such as Homomorphic Encryption?
A more structured approach is needed to define the primary research question earlier in the paper. Currently, it is introduced only at the end of the introduction, which affects clarity. The discussion of related work should also be expanded and systematically presented, highlighting this study’s contributions and distinctions compared to existing research in the field.
Other aspects that require clarification include the impact of the Dutch language on this problem. Additionally, the writing style should be improved for better readability. The example using Dutch text could be enhanced by providing a translation into English or clearly indicating in section 2.1.2 that a translation is available.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsHere's a well-crafted review with an insightful critique:
I think that this study provides an interesting contribution to the field of medical data privacy by examining the effectiveness of two de-identification techniques—Deduce and Deidentify—on Dutch electronic health records (EHRs). The use of both an annotation-based dataset and a synthetic dataset generated by OpenAI’s GPT-4 adds an innovative dimension, offering a broader perspective on how these techniques perform across different types of data. The comparative analysis using precision, recall, and F1 scores allows for a rigorous evaluation of their strengths and limitations.
I also feel that there are technical insights into the variability in de-identification performance across different entities, with Deduce demonstrating superior accuracy overall. This is surprising but perhaps needs more discussions on the observed performance gap between the two techniques, particularly the 0.42 improvement on synthetic datasets and 0.2 advantage on real-world data, underscores the challenges of adapting de-identification models to diverse data sources. However, the study would benefit from a deeper discussion on how these methods generalize across different hospital settings and languages, particularly given the variability in clinical text structures.
A key area that warrants further exploration is the connection between de-identification and healthcare analytics, especially in the context of machine learning applications during the COVID-19 pandemic. The pandemic underscored the urgent need for secure data-sharing mechanisms to facilitate real-time analytics for disease surveillance, patient outcome prediction, and resource allocation - see overview work in Zhe Fei etal, An Overview of Healthcare Data Analytics With Applications to the COVID-19 Pandemic. IEEE Transactions on Big Data, 2022 and also Hang etal, MEGA: Machine Learning-Enhanced Graph Analytics for Infodemic Risk Management. IEEE Journal on Biomedical and Health Informatics 2023, which highlights the same issue due to different language and country specification since the pandemic is worldwide. Ensuring robust de-identification is crucial for enabling researchers to analyze EHR data without compromising patient privacy. The paper could strengthen its impact by discussing how these de-identification techniques affect downstream machine learning models used for healthcare analytics, such as the trade-offs between de-identification quality and model performance in clinical decision-making.
Overall, I find that this study is well-structured and methodologically sound, providing a strong foundation for future research in medical text de-identification. Addressing the implications for healthcare analytics, particularly in pandemic-related contexts, would further enhance its relevance and practical utility.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsChanges were made to better structure the work, clarify the study's objectives, and outline the requirements of the de-identification process. Additionally, the choice of de-identification techniques was justified.
Overall, the key improvement suggestions were taken into account.
Author Response
Dear reviewer,
Thank you very much for your comments and for your approval.
Best,
Pablo Mosteiro, on behalf of all the authors
Reviewer 2 Report
Comments and Suggestions for AuthorsThe paper hasn't really addressed the reviewers' previous comments fully. Per the last review comment, the paper should begin with a stronger problem motivation, emphasizing the global significance of de-identification beyond Dutch clinical data. It's fine to mention Dutch and EUR legal compliance but this should be made with more justification as there is a growing importance of privacy protection, particularly in the wake of COVID-19, where large-scale data collection through contact tracing underscored the need for effective de-identification methods. This is why the reviewer suggested the need for the study to resonate with a wider audience and demonstrate its relevance beyond a specific national healthcare system by connecting to the healthcare analytics of COVID-19 as one reason. There are many reasons, legal compliance, COVID or next pandemic etc, but international privacy regulations such as WHO GDPR in the wake of COVID will further strengthen the justification for the work.
To improve generalizability, Section 4 should explicitly discuss the challenges of applying de-identification methods across different hospital settings and languages. Since clinical text structures vary significantly across healthcare systems, incorporating examples or citing studies from diverse regions will provide a more comprehensive perspective. If feasible, a discussion on how these techniques might be adapted for multilingual or non-European datasets would enhance the study’s applicability.
Moreover, the paper should address how de-identification impacts downstream machine learning models used in healthcare analytics - see discussions in An Overview of Healthcare Data Analytics With Applications 2021 paper. Now, it's 2025 and with Generative AI like large language models, so the trade-offs between de-identification quality and model performance, particularly in clinical decision-making tasks have changed somewhat. Hence, it's good to make some discussion or results section, an analysis—or at least a qualitative discussion—of how different de-identification strategies in the last ten years or so affect predictive modeling outcomes would add valuable insight. If practical, a small case study or illustrative example could further reinforce this point. After all, the reason we have legal compliance in public health is because we have gone through public health challenges and de-identification was certainly an unresolved problem otherwise government worldwide won't have mandated contact tracing and the need to preserve privacy when they saw the backlash.
By making these revisions, the paper will become more accessible to a global audience, demonstrating the broader impact of de-identification in healthcare beyond the Dutch context while also addressing its implications for machine learning and clinical analytics.
Author Response
For research article Investigating De-identification Methodologies in Dutch Medical Texts: A Replication Study of Deduce and Deidentify
Response to Reviewer 2 Comments
|
||
1. Summary |
|
|
Thank you very much for taking the time to review this manuscript. Please find the detailed responses below and the corresponding revisions highlighted in blue in the re-submitted file. |
||
|
|
|
2. Point-by-point response to Comments and Suggestions for Authors |
||
Comments 1: the paper should begin with a stronger problem motivation, emphasizing the global significance of de-identification beyond Dutch clinical data
|
||
Response 1: We agree with this comment. Therefore, we have highlighted the first sentence in the abstract and the first paragraph in the introduction, which tackle this point.
|
||
Comments 2: It's fine to mention Dutch and EUR legal compliance but this should be made with more justification as there is a growing importance of privacy protection, particularly in the wake of COVID-19, where large-scale data collection through contact tracing underscored the need for effective de-identification methods. |
||
Response 2: Agree. We have, accordingly, added a sentence to the first paragraph of the introduction to emphasize this point.
Comments 3: To improve generalizability, Section 4 should explicitly discuss the challenges of applying de-identification methods across different hospital settings and languages. Since clinical text structures vary significantly across healthcare systems, incorporating examples or citing studies from diverse regions will provide a more comprehensive perspective. If feasible, a discussion on how these techniques might be adapted for multilingual or non-European datasets would enhance the study’s applicability. Response 3: We have added Section 4.3
Comments 4: Moreover, the paper should address how de-identification impacts downstream machine learning models used in healthcare analytics - see discussions in An Overview of Healthcare Data Analytics With Applications 2021 paper. Now, it's 2025 and with Generative AI like large language models, so the trade-offs between de-identification quality and model performance, particularly in clinical decision-making tasks have changed somewhat. Hence, it's good to make some discussion or results section, an analysis—or at least a qualitative discussion—of how different de-identification strategies in the last ten years or so affect predictive modeling outcomes would add valuable insight. If practical, a small case study or illustrative example could further reinforce this point. After all, the reason we have legal compliance in public health is because we have gone through public health challenges and de-identification was certainly an unresolved problem otherwise government worldwide won't have mandated contact tracing and the need to preserve privacy when they saw the backlash. Response 4: Thank you for your comment. We agree with your assessment. For that reason, we had added in the previous round a sentence in the introduction (highlighted again now) that addresses the issue of downstream task performance. Unfortunately, we do not have a case study on our data (yet), so we cannot make a comparison of Deduce and Deidentify on a downstream task. However, we have added it to a new Section 6 Limitations and future work. |
||
|
||
|
||
|
||
4. Additional clarifications |
We acknowledge that the present study seems quite limited to the clinical domain and the Dutch language. We have the intention of making it broader in the future, in a follow-up paper. We hope that we have been more transparent about this in our current revisions, and that we have also pointed out how this challenge can be tackled.
Round 3
Reviewer 2 Report
Comments and Suggestions for AuthorsIt is recommended that the authors address the previous review comments fully and improve their references for motivation - the references are not comprehensive as mentioned in previous review. Please use the paper: Z. Fei et al, "An Overview of Healthcare Data Analytics With Applications to the COVID-19 Pandemic," in IEEE Transactions on Big Data, vol. 8, no. 6, pp. 1463-1480, 1 Dec. 2022 and references therein to improve motivation.
Here is another recommended and related one, which has at least ten work related to COVID-19 citing this paper: Libbi, C.A.; Trienes, J.; Trieschnigg, D.; Seifert, C. Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records. Future Internet 2021, 13, 136. https://doi.org/10.3390/fi13050136
Author Response
Dear reviewer,
Thank you for taking the time to review our paper so carefully. We are confident that our paper has become much better thanks to your review.
For the first paper titled 'An Overview of Healthcare Data Analytics with Applications to the COVID-19 Pandemic' discusses general big data challenges and introduces analytical and computational epidemiological methods specifically in relation to COVID-19 data. However, it does not focus on natural language processing (NLP) or the de-identification of medical text data, nor explore Dutch medical texts or the methodologies of Deduce and Deidentify, which are fundamental to our research. Given the big difference in topic, methods, and objectives, citing this paper might not add relevant support or context to our investigation on NLP-based de-identification methodologies in Dutch medical texts. So to my perspetive I think we might not cite this paper.
For the second paper titled 'Generating Synthetic Training Data for Supervised De-Identification of Electronic Health Records', it actually related to the part '2.1.2. Synthetic dataset' of our paper, so we have added a new sentence:
Previous research has also demonstrated the effectiveness of using generative language models, such as LSTM and GPT-2, to create synthetic EHR datasets annotated for named-entity recognition, highlighting their utility for downstream NLP tasks like de-identification [15].
Author Response File: Author Response.pdf