Venous Thrombosis Risk Assessment Based on Retrieval-Augmented Large Language Models and Self-Validation
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper deals with the assessment of the risk of venous thrombosis based on large language models and self-validation. The topic is current and arouses great interest with the rapid emergence of new improved versions of LLM. However, the hallucination of models becomes problematic, which is dangerous mainly in the field of healthcare, which is the subject of the paper.
The paper has a partially clear structure expected from original research (Introduction - Methods - Experimentation and Analysis - Discussion - Patents). The literature contains 39 sources, with approximately 43.6% of the sources being published 5 years or more ago. The authors analyzed in sufficient detail various ways of connecting individual language models and proposed the most suitable configuration, which they also verified experimentally with regard to the absence of individual parts. However, given the above-mentioned scope of research, I would like to add the following suggestions for improving the quality of the paper:
1.) Add an ORCID identifier for each author. This will provide a better link to the authors' previous published research.
2.) I suggest adding a chapter Limitations of the study to the proposed structure (Introduction - Methods - Experimentation and Analysis - Discussion - Patents), where the authors identify cases in which their proposed solution fails. Such a comparison helps to direct possible future research. Subsequently, it is possible to comment on the reasons for this failure in the Discussion chapter. The proposed chapter will start from line 670 in the submitted paper. In the Limitations of the study chapter, also discuss the transferability of your solution to other languages ​​used in the world.
3.) On line 280, Threshold is mentioned, but nowhere is it stated how it was determined. Was its determination also verified experimentally? It is necessary to describe each precisely determined value used in the contribution with the procedure for its determination. Write the procedure for its determination at the place of its first occurrence of the value determined in this way. An example could be the information from line 280 onwards that the value was determined experimentally with its description.
4.) Tables 2, 3 and 4 should be included in the appendix, as they are large in scope and serve for illustration only. The tables would thus start from line 711 in the submitted document as appendices.
5.) Under what license are the data analyzed? Without this information, it is not possible to verify whether the presented results can be published.
6.) In the paper, you use the term large language model (for example, line 429) and the term LLMs (for example, line 438). Use unified labeling throughout the paper to accurately identify the described thing or phenomenon.
7.) Figure 6 shows the system interface, but in the upper left corner of the image there are patient names. Are these the names and data of real patients? This data needs to be anonymized to protect personal data.
8.) Have the results of your research and its applicability been evaluated by practitioners? If so, please provide at least a brief statement in the Discussion section.
9.) As I wrote above, the presented research contains up to 43.6% of sources with a publication date of 5 or more years in the past. Update these publications with newer ones so that the largest percentage is within 5 years at most. In such a rapidly developing field of research, the situation changes even during individual months. This will keep your research up to date.
Comments on the Quality of English LanguageThe English could be improved to more clearly express the research.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis study proposes a multi-scale adaptive evaluation framework based on retrieval-augmented generation (RAG). However, there are some issues that need to be improved, fundamentally related to the readability of the paper for a better understanding of the work. The shortages are listed as follows:
- Tables 3, 4, 9 10 and 11 cross pages and should be displayed on the same page.
- The font size in Figures 6 is too small.
- When presenting the existing VTE risk assessment methods, the advantages and disadvantages of different methods (e.g., rule-based systems, machine learning models, deep learning models) can be compared in more detail, and the limitations of these methods in clinical practice can be illustrated with real-life examples.
- It is found that some mathematical notations and parameter definitions are missing. Please re-check all the equations and re-define the missing information.
- Throughout the paper, there are many redundant and unnecessary descriptions. The authors are advised to refine the content of the paper.
- The limitations of the model and future research directions should be discussed in more detail in the conclusion section.
- In the Experimental results and analysis section, the experimental analyses are insufficient.
- Some recent NAS or LLM methods can be included to further revise the manuscript.
- There are some errors in the references.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsThis paper proposes a multi-scale adaptive VTE risk assessment framework based on RAG with a self-validation mechanism using the Qwen2.5-7B large language model fine-tuned via LoRA. The study optimizes EMR knowledge bases using entity-context association and vector databases and uses a closed-loop generation-verification process to reduce hallucination. Experiments on multiple risk assessment scales showed performance improvements over traditional clinical assessments and some earlier machine learning models. The work’s main strength is the integration of retrieval, fine-tuning, and verification into a unified framework for a real-world clinical problem.
Technical Comments / Weaknesses:
The authors should justify why Qwen2.5-7B was selected over many other strong open-source LLMs, especially given that newer models like Llama 3 and DeepSeek variants are also available and could provide fairer benchmarking.
The authors should explain more clearly how the fine-tuning dataset was generated and whether any human verification was involved, as synthetic data risks propagating biases from source guidelines.
The authors should improve the clarity of the knowledge representation for RAG retrieval, particularly addressing how large text chunks are managed to avoid LLM input limits, especially when medical documents are very long.
The authors should provide real example outputs to demonstrate how the model performs.
The authors should better explain how the hallucination detection works during the self-verification phase.
The authors should not mix their fine-tuning, retrieval, and generation-validation contributions in a single narrative without clear modular separation. It becomes difficult to attribute which component contributes most to the final improvements.
The literature review is too narrow and weak. The authors should cite additional works from recent LLM and RAG developments. Please cite below:
-
https://www.mdpi.com/2504-2289/8/9/115
-
https://ieeexplore.ieee.org/abstract/document/10577164
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI appreciate all the work and time that the authors invested in improving the submitted paper. Given the changes made, I have the following comment:
1.) Thank you for adding information about the evaluation by practitioners. You statement state "By first trying it in a department, such as the rehabilitation department, and evaluating the working hours, efficiency, percentage of system users, number of times of personal use and other parameters of the clinical practitioners before and after the research trial ...". If the authors evaluated individual parameters, as they state, it is appropriate to add them to the submitted paper.
A short description can be added from line 812 onwards. In the case of a longer description or larger tables, I suggest adding this information to a separate appendix.
The English could be improved to more clearly express the research.
Author Response
Comments 1:You statement state "By first trying it in a department, such as the rehabilitation department, and evaluating the working hours, efficiency, percentage of system users, number of times of personal use and other parameters of the clinical practitioners before and after the research trial ...". If the authors evaluated individual parameters, as they state, it is appropriate to add them to the submitted paper.A short description can be added from line 812 onwards. In the case of a longer description or larger tables, I suggest adding this information to a separate appendix.
Response:We agree with the reviewer’s suggestion. It is indeed important to present pilot usage data from clinical scenarios. Accordingly, we have organized and included a set of publicly shareable parameter statistics collected during the pilot trial. Relevant revisions have been made starting from line 798 to provide a smooth transition into this section. The corresponding statistical results are now presented in Table 11, beginning at line 816.
We sincerely thank the reviewer once again for this valuable and constructive recommendation.
Reviewer 3 Report
Comments and Suggestions for AuthorsThe authors have made considerable efforts to address the comments, and I am satisfied with the revised manuscript.
Author Response
We sincerely thank the reviewers once again for their valuable suggestions