Next Article in Journal
Optimized Interdisciplinary Research Team Formation Using a Genetic Algorithm and Publication Metadata Records
Previous Article in Journal
Scratching the Surface of Responsible AI in Financial Services: A Qualitative Study on Non-Technical Challenges and the Role of Corporate Digital Responsibility
 
 
Article
Peer-Review Record

AppHerb: Language Model for Recommending Traditional Thai Medicine

by Thanawat Piyasawetkul 1, Suppachai Tiyaworanant 2 and Tarapong Srisongkram 3,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Submission received: 24 June 2025 / Revised: 23 July 2025 / Accepted: 24 July 2025 / Published: 29 July 2025
(This article belongs to the Section Medical & Healthcare AI)

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper presents a novel LLM fine tuned for handling traditional thai medicine. It is based on a pretrained gemma2 model, finetuned on a small dataset extracted from 2 TTM books.

The text is clear, the english is good, but there are a few references that are not found (figures 2 and 3). Table 6 is not correctly numbered.

Honestly, tables 2 to 8 are not really interesting for occidental readers, they could be removed and summurized in the text. This would reduce the length of the paper and leave more space for other more technical work.

The results provided (performances around 20 to 30%) are not very good, probably due to the small size of the dataset. The risk of training a very large model on a small dataset is overfitting (the model memorizes the data instead of generalizing from it and this may not react well on unknown data) and loss instability (evaluation loss may remain high while training loss decreases). It would be interesting to compare somehow the values of the weights of the original model and the trained model, to see how the model changed inside during the training phase. For instance the percentage of weights whose amplitude has changed more than 25%, 50%, 75%, 100% in the process. But maybe other metrics can be used.

The comparison with other non specialized models is interesting (table 11), it shows that a larger model may behave even better than this model even if not trained on the TTM dataset (but trained on TCM). So, why did you limit your model choice to less than 14B parameters?

The use of LoRA seems important in the context of the small dataset. Can you compare the results with and without it?

The chapter at line 247 is interesting and should be emphasized in the abstract as well, and maybe developed more if the authors remove the tables as I suggest.

Author Response

Response to Reviewer 1

Comments and Suggestions for Authors

This paper presents a novel LLM fine tuned for handling traditional thai medicine. It is based on a pretrained gemma2 model, finetuned on a small dataset extracted from 2 TTM books.

  • Dear reviewer, we genuinely appreciate all the comments and suggestions that have enabled us to improve the quality of the manuscript.
  • Our responses to the reviewer’s comments are provided below in a point-by-point fashion.
  • All changes to the text in the revised manuscript are made and marked up using the red color highlight.
  • We hope our revisions and responses, provided below, address the thoughtful comments from the reviewers and improve the quality of this manuscript.

 

The text is clear, the english is good, but there are a few references that are not found (figures 2 and 3). Table 6 is not correctly numbered.

  • Thank you for your valuable observation. The missing references to Figures 2 and 3 have been corrected, and Table 6 has been renumbered correctly.

 

Honestly, tables 2 to 8 are not really interesting for occidental readers, they could be removed and summurized in the text. This would reduce the length of the paper and leave more space for other more technical work.

  • Thank you for your thoughtful feedback regarding Tables 2 to 8. We understand your concern about their relevance to a broader readership. We have relocated these tables to the Supporting Information to streamline the main text, while preserving the data due to its significance in traditional medicine.

 

The results provided (performances around 20 to 30%) are not very good, probably due to the small size of the dataset. The risk of training a very large model on a small dataset is overfitting (the model memorizes the data instead of generalizing from it and this may not react well on unknown data) and loss instability (evaluation loss may remain high while training loss decreases). It would be interesting to compare somehow the values of the weights of the original model and the trained model, to see how the model changed inside during the training phase. For instance the percentage of weights whose amplitude has changed more than 25%, 50%, 75%, 100% in the process. But maybe other metrics can be used.

  • Thank you for your insightful comments. We agree that the limited performance is attributable to the restricted data size. Nevertheless, in response to your suggestion, we have strengthened our evaluation by including pre-fine-tuning performance data, bootstrapped confidence intervals, and additional gold-standard metrics such as BERTScore and Bilingual Evaluation Understudy (BLEU) score in the revised manuscript.

The comparison with other non specialized models is interesting (table 11), it shows that a larger model may behave even better than this model even if not trained on the TTM dataset (but trained on TCM). So, why did you limit your model choice to less than 14B parameters?

  • Thank you for highlighting our comparison. We appreciate your question regarding the selection of our model size. We have clarified in the manuscript that our decision to restrict the model size to 14 billion parameters was primarily driven by hardware constraints and the practical feasibility of performing local fine-tuning and inference. Nonetheless, we acknowledge the potential for further improvement using larger models, and we have included this as part of the future directions of our research (within the Discussion part).

 

The use of LoRA seems important in the context of the small dataset. Can you compare the results with and without it?

  • Thank you for your valuable suggestion regarding the use of LoRA in the context of limited data. We agree that such a comparison would offer meaningful insights. However, due to current hardware and time constraints, we are currently unable to train the full model without LoRA. We acknowledge the importance of this comparison and plan to explore it in future work.

 

The chapter at line 247 is interesting and should be emphasized in the abstract as well, and maybe developed more if the authors remove the tables as I suggest.

  • Thank you for your encouraging feedback on the section beginning at line 247. We appreciate your suggestion to highlight it more prominently. As recommended, we have removed the tables to streamline the manuscript and have expanded this section accordingly. We also incorporated its key points into the abstract to better reflect its significance.

 

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Dear authors

Your manuscript presents a well-structured and innovative approach to adapting generative AI models to a low-resource language and knowledge domain, namely Traditional Thai Medicine (TTM). The manuscript introduces a fine-tuned LLM using the Gemma-2 model and applies it to two tasks: I) treatment prediction and ii) herbal recipe generation. Your manuscript offers a proof-of-concept that could be valuable for local public health and AI applications in cultural contexts.

However, the manuscript would benefit from several improvements before being ready for publication.

The introduction presents the rationale and context, but the narrative could be more concise. While the relevance of the work is well-justified, some claims, especially regarding the potential for improving public trust or transforming public health, should be rephrased with more caution and balanced with a discussion of the current limitations of LLMs.

In the materials and methods section, the authors demonstrate a thorough and systematic approach. However, more details are needed about the data preprocessing steps and validation strategies. The authors mention manual transcription from historical texts, how were errors minimized or consistency ensured in that process? Additionally, there is no mention of inter-annotator agreement or quality control of the corpus construction. While the use of LoRA and tokenization tools like PyThaiNLP is appropriate, it would strengthen the paper to include the model training duration, validation strategies (e.g., cross-validation or early stopping), and statistical rigour in evaluation.

 

The results section shows model performance using standard metrics, but there is a lack of statistical analysis. The reported F1 scores (26–31%) suggest modest model effectiveness. These should be discussed more critically. Figures showing the loss decline are referenced but marked as “Error! Reference not found", this must be corrected. Table 11 comparing AppHerb to TCM-based models is informative, though a clearer normalisation or explanation of scale differences would be helpful for interpretation.

Furthermore, in the discussion, the strengths and weaknesses of the model are well laid out. However, the discussion would benefit from a more structured comparison to the state-of-the-art, including error analysis or case studies where the model succeeded or failed. The section on the challenges posed by the Thai language is valuable and appropriately acknowledges limitations, but the manuscript would benefit from suggestions on how to overcome those in future research.

Finally, the conclusion restates key findings and acknowledges the proof-of-concept nature of the study. It appropriately avoids overstated claims. However, the practical implications could be explored in greater depth (e.g., how might this be deployed in educational or clinical environments?)

 

Author Response

Response to Reviewer 2

Comments and Suggestions for Authors

 

Dear authors

Your manuscript presents a well-structured and innovative approach to adapting generative AI models to a low-resource language and knowledge domain, namely Traditional Thai Medicine (TTM). The manuscript introduces a fine-tuned LLM using the Gemma-2 model and applies it to two tasks: I) treatment prediction and ii) herbal recipe generation. Your manuscript offers a proof-of-concept that could be valuable for local public health and AI applications in cultural contexts.

However, the manuscript would benefit from several improvements before being ready for publication.

  • Dear Reviewer, we genuinely appreciate all the comments and suggestions that have enabled us to improve the quality of the manuscript significantly.
  • Our responses to the reviewer’s comments are provided below in a point-by-point fashion.
  • All changes to the text in the revised manuscript are made and marked up using the red color highlight.
  • We hope that our revisions and responses, provided below, address the thoughtful comments from the reviewers and significantly improve the quality of this manuscript.

 

The introduction presents the rationale and context, but the narrative could be more concise. While the relevance of the work is well-justified, some claims, especially regarding the potential for improving public trust or transforming public health, should be rephrased with more caution and balanced with a discussion of the current limitations of LLMs.

  • Thank you for your feedback on the introduction. We appreciate your guidance on refining the narrative and moderating the broader claims. We have revised the manuscript to present a more concise introduction and have tempered the language around the potential impact, ensuring a more balanced discussion that acknowledges current limitations of LLMs.

 

In the materials and methods section, the authors demonstrate a thorough and systematic approach. However, more details are needed about the data preprocessing steps and validation strategies. The authors mention manual transcription from historical texts, how were errors minimized or consistency ensured in that process? Additionally, there is no mention of inter-annotator agreement or quality control of the corpus construction. While the use of LoRA and tokenization tools like PyThaiNLP is appropriate, it would strengthen the paper to include the model training duration, validation strategies (e.g., cross-validation or early stopping), and statistical rigour in evaluation.

  • Thank you for your detailed and thoughtful comments on the materials and methods section. We have added further clarification on our data preprocessing, noting that manual transcription was reviewed rigorously during the data cleaning and preparation phase, particularly while extracting herb names, to ensure accuracy and consistency.
  • While initial errors were present, we are confident that the final data set used for training is representative and reliable. Additionally, we have included the model training duration and expanded on model configurations to strengthen the methodological robustness.

 

The results section shows model performance using standard metrics, but there is a lack of statistical analysis. The reported F1 scores (26–31%) suggest modest model effectiveness. These should be discussed more critically. Figures showing the loss decline are referenced but marked as “Error! Reference not found", this must be corrected. Table 11 comparing AppHerb to TCM-based models is informative, though a clearer normalisation or explanation of scale differences would be helpful for interpretation.

  • Thank you for your review of the results section and for identifying areas for improvement in both analysis and presentation. We have addressed the issues by recalculating the results and incorporating bootstrapped confidence intervals to provide a more robust statistical interpretation.
  • The broken figure reference has been corrected, and Table 11 has been revised with clearer normalization and explanatory notes to aid the understanding of our results.

 

Furthermore, in the discussion, the strengths and weaknesses of the model are well laid out. However, the discussion would benefit from a more structured comparison to the state-of-the-art, including error analysis or case studies where the model succeeded or failed. The section on the challenges posed by the Thai language is valuable and appropriately acknowledges limitations, but the manuscript would benefit from suggestions on how to overcome those in future research.

  • Thank you for your insightful suggestions regarding the discussion section. We appreciate your emphasis on comparative analysis and future directions. We have revised the discussion to include structured best- and worst-case analyses, offering clearer insight into the model’s performance. Additionally, we expanded the section on linguistic challenges and outlined potential strategies for addressing them in future research.

 

Finally, the conclusion restates key findings and acknowledges the proof-of-concept nature of the study. It appropriately avoids overstated claims. However, the practical implications could be explored in greater depth (e.g., how might this be deployed in educational or clinical environments?)

  • Thank you for your insightful suggestion. We have revised the conclusion to better highlight the practical implications of our work. In particular, we now specify how the AppHerb models can be applied in real-world scenarios. For example, the TrP and HRG models are publicly accessible on Hugging Face, and a Thai-language chatbot interface is available on GitHub to demonstrate interactive use.
  • These tools can serve as educational resources for students and healthcare professionals to explore AI-assisted Thai traditional medicine. In clinical environments, they may aid in preliminary decision support or be integrated into electronic health systems after appropriate clinical validation. We have also emphasized that clinical deployment would require further evaluation to ensure safety and reliability.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

For the manuscript ‘AppHerb: Language Model for Recommending Traditional Thai Medicine’ here are my comments to improve it:

Abstract: Why TCM focused models cannot be directly applied to TTM.

Introduction: The authors' claim about 'false information' could be made more solid by adding a recent Thailand-specific citation. The introduction could better foreshadow the specific technical issues of the Thai language (e.g., low-resource language, lack of explicit word delimiters) that are critical to the problem and are currently only mentioned much later (page 11). This would lead more effectively into the technical contribution of the work. Adding AI to herbal medicine sources could expand the context beyond TCM and define the gap more precisely. "no TTM-related chatbot" to "we fill this gap" better articulate said work. Can clearly articulating the RQ by Can LoRA‐tuned LLMs encode TTM recipes and How do they compare to TCM models.

 

Material and Methods: How were the validation procedures performed to guarantee transcription fidelity and inter‐annotator consistency, particularly for the 405 and 256 records. It makes sense to transform multi‐symptom strings to lists but mention if hierarchical or ontological mapping was used to enhance uniformity. The exam leaderboard is a reasonable stand‐in for overall Thai‐language ability, but TTM language could vary in tone and vocabulary compared to exam writing. Consider adding an analysis of M3Exam content‐overlap with TTM corpora. Rank = 8 and α scaling were your options, but we have no hyperparameter search reported. A brief description of the tuning process would be helpful to reproducibility. Alpaca prompt Thai translation is accurate, but were prompt variants ever put into the pipeline? If not, note this as a point for improvement. Instead of LCS‐based Scoring, other scores like BLEU or BERTScore can weigh more heavily on semantic matches. ablation or sensitivity analysis are not present.

 

Result: Line 175, 177, 214 (Error! Reference source not found) broken references kindly fix them. Adding confidence intervals or bootstrapped variability would help assess statistical significance, especially given small test sizes in table 10. Consider adding examples for qualitative analysis of one high‐ vs. low‐performing case to identify model weaknesses.

 

Discussion: Can consider running HRG Variability artificially truncated inputs or outputs to verify the effect. The workaround through a Gemma-2 variant is creative, but group‐query attention patterns may not generalize to the LoRA-tuned model. Acknowledge this limitation and propose directions for future work. You could talk about employing multilingual models or character‐level tokenizers as directions for future work to alleviate segmentation issues.

 

Conclusion: Reiterate that clinical validation is required before any deployment and consider recommending collaboration with TTM practitioners for real‐world testing.

 

References: Check references according to journal guidelines. Better to include SOTA and recent references from 2024 and 2025.  As there are no references cited form 2025.

Author Response

Response to Reviewer 3

Comments and Suggestions for Authors

For the manuscript ‘AppHerb: Language Model for Recommending Traditional Thai Medicine’ here are my comments to improve it:

  • Dear Reviewer, we genuinely appreciate all the comments and suggestions that have enabled us to improve the quality of the manuscript significantly.
  • Our responses to the reviewer’s comments are provided below in a point-by-point fashion.
  • All changes to the text in the revised manuscript are made and marked up using the red color highlight.
  • We hope that our revisions and responses, provided below, address the thoughtful comments from the reviewers and significantly improve the quality of this manuscript.

 

Abstract: Why TCM focused models cannot be directly applied to TTM.

  • Thank you for raising this important point regarding the applicability of TCM-focused models to TTM. Our intention was not to directly apply TCM models to TTM tasks, as the two systems differ significantly in both linguistic structure and domain-specific knowledge. These distinctions necessitate tailored approaches for effective modeling.

 

Introduction: The authors' claim about 'false information' could be made more solid by adding a recent Thailand-specific citation. The introduction could better foreshadow the specific technical issues of the Thai language (e.g., low-resource language, lack of explicit word delimiters) that are critical to the problem and are currently only mentioned much later (page 11). This would lead more effectively into the technical contribution of the work. Adding AI to herbal medicine sources could expand the context beyond TCM and define the gap more precisely. "no TTM-related chatbot" to "we fill this gap" better articulate said work. Can clearly articulating the RQ by Can LoRA‐tuned LLMs encode TTM recipes and How do they compare to TCM models.

  • Thank you for your detailed and feedback on the introduction. We have revised the introduction to include a Thailand-specific citation regarding misinformation, introduced the linguistic challenges of Thai earlier in the narrative, and clarified the broader context of AI in herbal medicine. We also rephrased the gap statement and explicitly articulated the research questions to better align with the technical contributions of our work.

 

Material and Methods: How were the validation procedures performed to guarantee transcription fidelity and inter‐annotator consistency, particularly for the 405 and 256 records. It makes sense to transform multi‐symptom strings to lists but mention if hierarchical or ontological mapping was used to enhance uniformity. The exam leaderboard is a reasonable stand‐in for overall Thai‐language ability, but TTM language could vary in tone and vocabulary compared to exam writing. Consider adding an analysis of M3Exam content‐overlap with TTM corpora. Rank = 8 and α scaling were your options, but we have no hyperparameter search reported. A brief description of the tuning process would be helpful to reproducibility. Alpaca prompt Thai translation is accurate, but were prompt variants ever put into the pipeline? If not, note this as a point for improvement. Instead of LCS‐based Scoring, other scores like BLEU or BERTScore can weigh more heavily on semantic matches. ablation or sensitivity analysis are not present.

  • Thank you for your comprehensive and insightful feedback on the Materials and Methods section. We have updated the manuscript to include details on the ontological mapping used to standardize multi-symptom strings, and added an analysis comparing M3Exam language with the older Thai used in our data set. While prompt variants were not incorporated in this version, we have noted this as an area for future improvement. We also refined our LoRA hyperparameters and included BLEU and BERTScore metrics, along with a sensitivity analysis of the generation parameters, into the Results section.

 

Result: Line 175, 177, 214 (Error! Reference source not found) broken references kindly fix them. Adding confidence intervals or bootstrapped variability would help assess statistical significance, especially given small test sizes in table 10. Consider adding examples for qualitative analysis of one high‐ vs. low‐performing case to identify model weaknesses.

  • Thank you for pointing out the broken references and for your valuable suggestions to enhance the statistical and qualitative analysis of the results. We have corrected the reference errors at lines 175, 177, and 214. Additionally, we incorporated bootstrapped confidence intervals to better assess statistical significance, and included a qualitative discussion comparing high- and low-performing cases to highlight model strengths and limitations.

 

Discussion: Can consider running HRG Variability artificially truncated inputs or outputs to verify the effect. The workaround through a Gemma-2 variant is creative, but group‐query attention patterns may not generalize to the LoRA-tuned model. Acknowledge this limitation and propose directions for future work. You could talk about employing multilingual models or character‐level tokenizers as directions for future work to alleviate segmentation issues.

  • Thank you for your thoughtful suggestions regarding the discussion section. We appreciate your insights on model variability and future directions. While we did not implement artificial truncation of inputs or outputs, we have included a sensitivity analysis and performance comparison before and after fine-tuning to better illustrate model behavior. We also acknowledged the limitations of using the Gemma-2 variant, noting that its attention patterns may not generalize to our LoRA-tuned model. Furthermore, we expanded the future work section to include the potential use of multilingual models and character-level tokenizers to address segmentation challenges.

 

Conclusion: Reiterate that clinical validation is required before any deployment and consider recommending collaboration with TTM practitioners for real‐world testing.

  • Thank you for your valuable recommendation regarding the conclusion. We fully agree on the importance of clinical validation prior to deployment. We have revised the conclusion to emphasize the need for clinical validation and have recommended collaboration with TTM practitioners to facilitate real-world testing and application.

 

References: Check references according to journal guidelines. Better to include SOTA and recent references from 2024 and 2025.  As there are no references cited form 2025.

  • Thank you for your comment regarding the references. We appreciate your emphasis on aligning citations with journal standards and considering recent state-of-the-art work. While we acknowledge the importance of SOTA references, our study is positioned as a proof-of-concept focused on parameter-efficient fine-tuning under resource constraints. Incorporating SOTA models would exceed our hardware capabilities and diverge from the core objective of demonstrating feasibility within the TTM domain.
  • Nonetheless, we have reviewed and updated the references to ensure compliance with MPDI journal guidelines and included the most relevant recent works where appropriate.

 

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The paper's quality has improved, I have no further comment

Author Response

Comments and Suggestions for Authors

The paper's quality has improved, I have no further comment

  • Thank you for your comment.

Reviewer 3 Report

Comments and Suggestions for Authors

The authors have addressing many of the initial concerns. The quality has been significantly improved. However, for the manuscript to be accepted for publication, the following minor revisions are recommended.

In Materials and Methods, the description of the data preparation process should be expanded. The current mention of manual verification is insufficient. A formal, systematic procedure for establishing transcription reliability and ensuring inter annotator consistency across the collected records must be detailed to validate the dataset's integrity. The reproducibility of the study is currently limited by the lack of information on hyperparameter tuning. The authors should provide a brief rationale for their choice of LoRA parameters, whether it was based on preliminary experiments, established practices, or another method.

In Discussion, the suggested experiment with artificially truncated inputs/outputs, which was designed to assess model robustness, was not conducted. The authors should either perform this analysis or provide a clear justification for its exclusion. The authors wisely note the potential for the Gemma-2 variant's attention patterns to not generalize (Line 327). This is a critical point that provides greater emphasis to ensure readers fully grasp the scope and limitations of the presented attention analysis.

Author Response

Response to Reviewer 3

Reviewer(s)' Comments to Author:

  • Our responses to the reviewer’s comments are provided below in a point-by-point fashion. All changes to the text in the revised manuscript are made and marked up using the blue color highlight.

 

Comments and Suggestions for Authors

The authors have addressing many of the initial concerns. The quality has been significantly improved. However, for the manuscript to be accepted for publication, the following minor revisions are recommended.

  • We thank the reviewer for the rigorous review and insightful comments. We have made revisions throughout the manuscript as suggested by the reviewer.

In Materials and Methods, the description of the data preparation process should be expanded. The current mention of manual verification is insufficient. A formal, systematic procedure for establishing transcription reliability and ensuring inter annotator consistency across the collected records must be detailed to validate the dataset's integrity. The lack of the on hyperparameter tuning currently limits the reproducibility of the study. The authors should provide a brief rationale for their choice of LoRA parameters, whether it was based on preliminary experiments, established practices, or another method.

  • Thank you for your thoughtful feedback regarding the Materials and Methods section. We have taken steps to clarify and strengthen the description of our data preparation and model configuration processes.
  • To address the point on manual verification, we have now included a Python-based algorithm used to transform turn variables into symptom representations (Table 2). The table illustrates the systematic nature of our preprocessing pipeline in plain English text. Accordingly, we have updated the table indexing to maintain consistency with the revised manuscript.
  • Regarding hyperparameter tuning, we recognize that this aspect was not extensively explored. Given that our model’s peak performance reached approximately 24% F1 score, and theoretical considerations suggest that modest hyperparameter variation would not yield significant gains, we determined that further tuning would not justify the computational resources required. This limitation has now been acknowledged in the revised text.
  • Lastly, the LoRA parameter choices were based on the implementation guidance provided in the Unsloth documentation. We have clarified this point to improve transparency and reproducibility.

In Discussion, the suggested experiment with artificially truncated inputs/outputs, which was designed to assess model robustness, was not conducted. The authors should either perform this analysis or provide a clear justification for its exclusion. The authors wisely note the potential for the Gemma-2 variant's attention patterns to not generalize (Line 327). This is a critical point that provides greater emphasis to ensure readers fully grasp the scope and limitations of the presented attention analysis.

  • Thank you for your thoughtful feedback on the discussion. We opted not to conduct a truncated input/output analysis in this study because the overall F1 scores with full prompts remain quite low, indicating that the current model does not yet robustly capture the necessary domain knowledge for TTM tasks. At this stage, truncated analysis would provide limited actionable insight, as the baseline model performance is not sufficient for meaningful robustness or ablation studies. We believe that such an analysis will be more informative after further improvements to the data set, model, or when expert validation is available. We plan to pursue this direction in future work.
  • Regarding the Gemma-2 variants, we revised the specific part about the Gemma-2 variant to emphasize the limitation. Additionally, we also updated the score in line 299

 

Author Response File: Author Response.pdf

Back to TopTop