Next Article in Journal
Multimodal Retrieval Method for Images and Diagnostic Reports Using Cross-Attention
Previous Article in Journal
HNN-QCn: Hybrid Neural Network with Multiple Backbones and Quantum Transformation as Data Augmentation Technique
 
 
Article
Peer-Review Record

Effective Machine Learning Techniques for Non-English Radiology Report Classification: A Danish Case Study

by Alice Schiavone 1,*,†, Lea Marie Pehrson 1,2,3,†, Silvia Ingala 2,4, Rasmus Bonnevie 5, Marco Fraccaro 5, Dana Li 2,3, Michael Bachmann Nielsen 1,2,3 and Desmond Elliott 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3:
Submission received: 20 December 2024 / Revised: 27 January 2025 / Accepted: 5 February 2025 / Published: 17 February 2025
(This article belongs to the Section Medical & Healthcare AI)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

 

Scope: The authors highlight a very crucial problem, data annotation from a very large set of data. They focus on non-English radiology reports, specifically of Danish language using techniques previously used in English cases with great success.

 

The authors clearly explain the problem and its need for solving it in the introduction section.

 

The authors have clearly listed their contribution in the same section to assist the reader in clearly and precisely understanding their achievements.

 

The authors provided a very detailed and helpful section with regards to their data collection. It would have been beneficial to summarise the different cases, reports, entries etc. in a table for better clarity.

 

It would have been beneficial to elaborate more on how the human experts were assigned for labelling and how the overall procedure unfolded.

 

The authors did a very good job in explaining the overall training process of the LLM models.

 

The authors presented very interesting and insightful results, fully highlighting each LLM’s good points and potential improvements, adopting several evaluation metrics, and appropriate charts and tables.

 

The authors also compared their results with related studies’ results, an important part study should include.

 

Overall, an interesting and insightful paper to read.

 

Author Response

Comment 1: “The authors provided a very detailed and helpful section with regards to their data collection. It would have been beneficial to summarise the different cases, reports, entries etc. in a table for better clarity.”

Response 1: We acknowledge the reviewer’s comment. We had initially such a table included in the manuscript, but removed it to improve readability. Instead we decided to include more detailed information about the data and the classes distribution in Appendix A, as stated in Line 91 and 125.

 

Comment 2: “It would have been beneficial to elaborate more on how the human experts were assigned for labelling and how the overall procedure unfolded.”

Response 2: The process included multiple stages of annotation, with tasks distributed dynamically to ensure thorough coverage of all findings, according to the project requirements. The first labeling task included the annotation of randomly selected reports by a radiographer, with on-demand support from a radiologist. To get a challenging and enriched dataset, subsequent annotation tasks were picked using an active learning-inspired strategy, where we picked cases where an image classifier trained on the base labels disagreed the most with the labels themselves, which we empirically found to indicate flaws in the automatic annotation.

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript presents and compares machine learning methods for Danish radiology report classification. The manuscript's content is interesting enough for readers but it needs more clarifications on the novelty and methods. I suggest publication after major revision. Please see details below:

1. The manuscript needs extensive language check and editing.

2. What are the novelty and challenges in applying the existing method that was used for English in a Danish case? The method itself should be applicable to any language. Please clarify.

3. Font/format error in Figure 1: letter "g" is incomplete in words such as "Labeling" and "RegEx".

4.  Grammatical error in Line 315: "were compared of .." 

5. Figure 8: Legends for some line types are missing, and the lines' colors are not clear to distinguish.

6. Line 209: "there was a" -> "there was an"

7. Table 3: "In bold, the best score". It's not a complete sentence. Please rewrite.

Comments on the Quality of English Language

The manuscript needs extensive language check and editing to reduce grammatical errors. Please see the non-exhaustive examples in my comments.

Author Response

Comment 1: Language checking and editing (points 1, 3, 4, 6, 7)

Response 1: We acknowledge and thank the reviewer for pointing out these grammatical errors. We have made the following changes to the manuscript to address these comments:

  • Table 3, A2, A3, A4: “In bold, the best score” changed to “Best score per metric highlighted in bold”.
  • Line 205-206: "there was a" change to "there was an"
  • Figure 1: not a format error but the style of the font “Syne”. We changed it to another serif font. (“Ovo”)
  • Line 304-305: “Regular expression rules and machine learning models were compared of the task of detecting medical findings within Danish radiology reports.” was changed to “This study compared regular expression rules and machine learning models for detecting medical findings within Danish radiology reports.”

Lastly, a co-author, who is a native English speaker with a PhD in Computer Science from the University of Edinburgh, has reviewed the manuscript for language checking. The changes have improved the readability of most sections, including the task introduction and our methodology. (We don't list the changes as the editing was extensive.)



Comment 2: “Figure 8: Legends for some line types are missing, and the lines' colors are not clear to distinguish.”

Response 2: We are uncertain on which lines the reviewer deems missing. The five models tested are coloured in different colors, and the two tested classes (positive and negative mentions) are differentiated by different markers (“o” and “x”, respectively).



Comment 3: “What are the novelty and challenges in applying the existing method that was used for English in a Danish case? The method itself should be applicable to any language. Please clarify.”

Response 3: In natural language processing (NLP), methods and evaluations for the English language have taken precedence over all other languages.In fact, English has been called the “unmarked” language, as part of a drive to improve the attention to non-English language processing [1]. It is known methods that have been applied for English do not always apply to other languages. In this case, if we consider the rule-based method that we developed for our target language, it's clear that it requires rules which are tailored to this language. Take, as an example, Figure 2: the rules use a specific sub-word vocabulary unique to Danish and its grammar rules, other than an extended alphabet (å,æ,ø) that is not present in English methods. For this reason we had to hand-craft 360 RegEx rules, as described in Section 2.2. Large Language Models, while more flexible, are still limited to a predefined vocabulary, defined during the model and its tokenizer training [2].  We cannot assume that models trained on English data perform as well on other languages [3]. For this reason, we test this procedure on models that have been trained on a variable number of Danish texts. We found from multiple metrics (Figure 5, Tables 2, 3, 4) that the model that has been fed more Danish data during training (DanskBERT) is the best performing model in our task, showing that the task is language-dependent and benefits from more language resources. This is consistent with the worst results obtained by the model which has been trained mostly on English and other non-Danish data (mBERT). Our methodology should apply to other languages for which RegEx rules can be written and pre-trained language models can be utilized. However, the investigation of this for other non-English languages is beyond the scope of the current paper.



[1] On achieving and evaluating language-independence in NLP. EM Bender. Linguistic Issues in Language Technology 6, 2011

[2] Vaswani, A. "Attention is all you need." Advances in Neural Information Processing Systems (2017).

[3] Xiang Zhang, Senyu Li, Bradley Hauer, Ning Shi, and Grzegorz Kondrak. 2023. Don’t Trust ChatGPT when your Question is not in English: A Study of Multilingual Abilities and Types of LLMs. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7915–7927, Singapore. Association for Computational Linguistics.

Reviewer 3 Report

Comments and Suggestions for Authors

This study explores efficient and cost-effective techniques for the automated annotation of Danish chest X-ray reports and proposes a machine learning model training strategy that achieves performance comparable to similar methods in English. This is a significant research endeavor. However, the article's layout and content require further revision.

1. The introduction should focus on presenting the research problem, objectives, etc. Figure 1, which illustrates the methodological strategy proposed in this study, should appear in Chapter 2.

2. Chapter Two primarily describes the materials and methods of this study. It is recommended to establish two secondary headings: "Materials" and "Methods." Under these two secondary headings, you can then create tertiary headings to elaborate in detail.

3. The main objectives of this research are: (1) to develop a string matching algorithm, (2) to compare the performance of various methods with different BERT-based machine learning models, and (3) to generate an overview of the experimental results regarding the annotation work required to achieve the desired performance. However, the methodology described in Chapter Two does not clearly present the content of the algorithm developed in this study. Additionally, it is recommended to review the content and structure of Chapter Two and reconsider whether experimental and evaluation metrics could be incorporated into Chapter Three.

4. In the conclusion section, after summarizing the key points of this research, it is recommended to include additional content such as suggestions for future research directions.

Author Response

Comment 1: “Figure 1, which illustrates the methodological strategy proposed in this study, should appear in Chapter 2”
Response 1: We acknowledge the reviewer's suggestion. We have moved Figure 1 from Chapter One to Chapter Two.


Comment 2: “However, the methodology described in Chapter Two does not clearly present the content of the algorithm developed in this study.”
Response 2: As correctly summarized by the reviewer, the main objectives of our work are the development of a string string matching algorithm and the training of a BERT-based language model to solve the task of radiology report classification in Danish. We have shortly introduced these tools in Sections 2.2 and 2.3, respectively, and more details are available in Appendix B: Implementation Details. We chose to exclude these details from the main text to simply the text. As no link to this Appendix was available in the original version of the manuscript, we add it in Lines 115-116 by adding the sentence “More detail about pre-trained languages models and the RegEx-based method are available in Appendix B.”


Comment 3: “Chapter Two primarily describes the materials and methods of this study. It is recommended to establish two secondary headings: "Materials" and "Methods." Under these two secondary headings, you can then create tertiary headings to elaborate in detail.”
Response 3: We agree that the structure of Chapter Two can be improved. To avoid excessive subheading nesting, we made the following changes:
We renamed Section 2.1 from “Data Collection” to “Materials: Data Collection”
We renamed Section 2.2 from “Regular Expressions (RE)” to “Methods: Regular Expressions (RE)”
We renamed Section 2.3 from “Methods: Pretrained Language Models” to “Methods: Methods: Pretrained Language Models”
We removed  Section 2.4 (“Data splits”) and moved the text into Section 2.5


Comment 4: “It is recommended to review the content and structure of Chapter Two and reconsider whether experimental and evaluation metrics could be incorporated into Chapter Three.”
Response 4: We have moved Section 2.7 (“Evaluation metrics”) to Chapter 3.


Comment 5: “In the conclusion section, after summarizing the key points of this research, it is recommended to include additional content such as suggestions for future research directions.”
Response 5: We agree with the reviewer. We had originally discussed suggestions for future research in the last paragraph of the Discussion (Chapter Four), however, we agree that including these thoughts in the Conclusion section improves the manuscripts’ readability. We moved, and slightly reworked, Lines 312-316 to the end of the Conclusion.

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The manuscript has improved a lot and I recommend publication.

Reviewer 3 Report

Comments and Suggestions for Authors

The manuscript has been revised in response to the comments raised during the first round of review, and the revised paper meets the publication requirements of the journal.

Back to TopTop