Leveraging Large Language Models for Departmental Classification of Medical Records
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThe paper proposes the use of domain-specific large language models (LLMs) to automate the classification of medical inquiries into appropriate clinical departments. It focuses on fine-tuning models such as GLM2, GatorTron, LLaMA, and Gemma2 using LoRA and QLoRA techniques to reduce training costs while preserving diagnostic accuracy. The topic is timely and relevant, aligning with ongoing advancements in automating healthcare workflows using LLMs. It addresses an important gap in department-level triaging, particularly through the lens of models trained on non-English (Chinese) medical datasets. However, the contribution is moderate, as the general concept of using LLMs for classification has been explored in prior works. The novelty would be strengthened by clarifying how the proposed approach surpasses models like ChatDoctor, ClinicalBERT, or GatorTron (Lines 74–96), especially in terms of cross-lingual generalization. Several critical methodological details are missing:
Lines 111–134: Please specify batch size, learning rate, number of epochs, optimizer type, and the training environment (hardware).
Add a clear architectural overview diagram (recommended after Line 134) to help readers visualize the model pipeline.
The choice of ROUGE and BLEU (Lines 152–174) is not well-justified for a classification task. Researchers should include more appropriate metrics—such as accuracy, precision, recall, and F1-score—for evaluation.
The authors did not compare the results with traditional or established baselines like SVM, ClinicalBERT, or BioBERT. Adding these would provide the necessary context for understanding performance gains (suggested before Line 175).
The term “medical chatbot” (Lines 14, 27–28, 224) should be clarified—whether this refers to a backend classifier, a user-facing interface, or a complete end-to-end conversational system.
Conclusions and Limitations
The conclusions (Lines 263–273) are supported by the results, but appear overly optimistic given some limitations:
The language specificity of the dataset (Chinese) is acknowledged, but the implications for generalizability are not deeply discussed.
The handling of rare or complex cases (Lines 230–239) is presented as a limitation, but proposed solutions (e.g., expert systems or hybrid models) are discussed only briefly.
References
Most references are current and relevant, though some (e.g., [12], [13], [21]) are insufficiently contextualized in the methodology section. Please clarify their relevance or consider removing them. Additional references were suggested to enhance the literature context:
Generative AI Models (2018–2024): Advancements and Applications in Kidney Care – contextualizes the clinical role of generative AI.
A Survey of DeepSeek Models – covers emerging domain-specialized LLMs.
ChatGPT: Transforming Healthcare with AI – outlines real-world use cases in healthcare triage and education.
Tables and Figures
Tables 1 and 2: Consider adding statistical significance analysis (e.g., p-values or confidence intervals) to support performance comparisons.
Figures 3 and 4: Clarify the relationship between speed (steps/s, samples/s) and model accuracy. Faster models may trade off predictive power.
Figure 1: The dataset example is useful. Consider adding a class distribution chart for department labels to better contextualize model performance.
Additional Suggestions
Revise awkward or unclear phrasing in Lines 14 and 124–125. For example, “LoRA could remain the parameters…” should be revised to: “LoRA preserves the original model’s parameters while reducing the number of trainable components.”
Avoid redundancy: the descriptions of LoRA and QLoRA are repeated across Lines 124–134; consider merging for clarity and conciseness.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsThis study deals with the possibility of integrating large language models into medical applications, especially for the needs of chatbot systems whose task is to classify medical records and provide preliminary diagnosis. The authors of the study used state-of-the-art LLM, combined with Low-Rank Adaptation (LoRA) and QLoRA (Quantized Low-Rank Adaptation) techniques, in order to minimize the time required for training the model. For the purpose of evaluating the results, the authors used both ROUGE and BLEU evaluation metrics.
Despite the good results in the classification of diagnoses, this paper lacks some key details that would enable a better understanding of the obtained results and an understanding of its significance.
Lack of details related to Training and Fine-Tuning. The data used are publicly available, but in the paper itself there are no details related to the dataset used for training the model: dataset size, structure, preprocessing steps, or label distributions. Additionally, there is no explanation of how the LLMs (e.g., GLM2, GatorTron, LLaMA) were fine-tuned—what parameters were used, how long the models were trained, what hardware was employed, or how overfitting was avoided. The use of LoRA and QLoRA is mentioned, but without describing how it was integrated or justified.
Complete lack of details related to the architecture of the chatbot used for evaluation. There is no description of the complete process: how user input is processed, what components are involved (e.g., intent detection, entity extraction), or how outputs are generated and validated. There is also no discussion of how external medical knowledge bases were integrated, nor how they influence model predictions.
Insufficient Experimental Design: what prompts were used, how ground truth was determined, or how results compare against baselines.
While there is a brief mention of HIPAA and GDPR, the paper fails to substantiate how privacy concerns are handled in practice. More importantly, it does not address algorithmic bias, data representativeness, or the risks of deploying LLMs in healthcare without clinical oversight.
Although the idea of ​​integrating LLMs into medical chatbots is promising, this paper does not meet the standards required for publication in its current form.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsDear Authors,
The work presented addresses a highly timely and relevant topic. It is well-structured, follows an appropriate format, and is generally easy to read. However, I recommend a comprehensive review of grammar and textual cohesion throughout the manuscript.
I also suggest including additional examples of complex clinical cases in which the model fails, in order to better illustrate its limitations and to expand the discussion regarding its direct clinical applicability—particularly in the context of automated triage and referral processes.
Comments on the Quality of English LanguageI recommend a comprehensive review of grammar and textual cohesion throughout the manuscript.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsAutomatic classification of medical records with small LLMs is an innovative andvery interesting application. Good in terms of open publication of models and datasets (on HuggingFace and GitHub).
Very interesting the integration of modern techniques (LoRA and QLoRA) to reduce the cost of training.
Said that:
Although they explicitly recognize that the model performs poorly in complex or rare cases, the results should be much more nuanced:
Put the high ROUGE-1 level (96%) in context as it can be misleading: given that the tasks are very simple (just classify the department).
No specific quantification of the error in complex or rare cases is provided.
They use an almost exclusively chinese dataset, with very little real cultural or linguistic diversity. Although it is not mandatory, would it be nice to contextualize in other languages ​​or western medical systems?
Only very simple examples are provided and the performance of the model in difficult cases or with combined symptomatology is not illustrated.
They do not show examples of errors or confusions (we do not know where the model fails) and do not analyze types of cases that are difficult for the model (cases with multi-departmental, ambiguous symptoms, etc.). They do not analyze the model's ability to handle rare cases, medical emergencies or atypical symptomatology.
They do not discuss whether the model will present bias by age, sex, ethnicity or type of symptomatology.
Although it can be in part extapolated, they do not specify the hardware configuration used (GPU, RAM, etc.), limiting the replicability of the experiment.
They claim that they can improve clinical efficiency, but do not provide quantitative data or comparisons with real practices.
There is no reference to real clinical validation with human medical professionals.
Nothing is said about the real difficulties of adopting the system in healthcare settings (privacy, integration with EHRs, medical acceptance).
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsAll the concerns are addressed.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 2 Report
Comments and Suggestions for AuthorsAfter the reviewing the edited version of the study, I am satisfied with the changes and I think author addressed all my previous comments and suggestions. I would recommend for the paper to be accepted for publication.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 3 Report
Comments and Suggestions for AuthorsDear authors, thank you very much for your comments on my suggestions. Your work appears more enriched with the changes made. Congratulations!
Author Response
Please see the attachment.
Author Response File: Author Response.pdf
Reviewer 4 Report
Comments and Suggestions for AuthorsTwo points still need a bit attention:
The examples are too few—more edge cases and error analysis would help.
The clinical value is still theoretical—real-world validation or expert feedback would make the claims stronger.
That said, the paper has improved a lot. I recommend acceptance with revisions at least in the last ponit.
Author Response
Please see the attachment.
Author Response File: Author Response.pdf