1. Introduction
Traditional Thai Medicine (TTM) has been the foundation of the healthcare system in Thailand for generations. For centuries, TTM and other traditional medicines (TMs) have held a significant place in global health, and the WHO recognizes their importance by incorporating TM knowledge and technologies in primary healthcare [
1]. TTM, like other TM systems, faces the common challenge of being rooted in empirical knowledge with limited objective standards. This results in complex and often ambiguous relationships among formulations, individual herbs, and their associated therapeutic effects [
2]. While TTM provides affordable healthcare solutions, patients with limited knowledge are vulnerable to false claims about the efficacy of traditional remedies, which poses significant concerns [
3,
4]. Therefore, it is essential to address this issue through evidence-based validation and improved public education.
Artificial intelligence (AI) has become a popular tool in various fields, including healthcare. Generative AI (Gen AI) is a subset of AI that has gained significant attention for its capacity to generate new data [
5]. However, current models cannot answer TTM-specific questions or perform related tasks correctly. This lack of TTM knowledge will likely provide inaccurate information [
6,
7] for those who wish to seek answers to health-related questions. Because there is no TTM-related chatbot in Thailand, we aim to fill this gap by developing a fine-tuned TTM language model.
Working with the Thai language for natural language processing (NLP) presents unique technical challenges. Thai is considered a low-resource language due to the limited availability of annotated data sets and domain-specific corpora for training robust models. In addition, Thai text is written without explicit word delimiters or spaces between words, making essential NLP tasks such as word segmentation considerably more challenging compared to languages like English or Chinese. These fundamental linguistic limitations complicate the development of accurate and efficient AI models tailored for Thai, particularly in specialized domains such as TTM.
To guide our investigation, we focus on two core research questions: First, can LoRA-tuned large language models (LLMs) effectively encode and generate accurate TTM recipes and treatment predictions from available data? Second, how does the performance of these TTM-specific models compare to large language models trained or fine-tuned on related domains, such as Traditional Chinese Medicine (TCM)? By explicitly addressing these questions, our study aims to evaluate the potential of advanced language modeling techniques for capturing complex and domain-specific knowledge in a low-resource language setting.
We utilized advanced techniques such as Low-Rank Adaptation (LoRA) [
8]. This method could improve the adaptability and efficiency of language models for TTM knowledge adaptation. We developed two variants of language models for different tasks: treatment prediction (TrP) and herbal recipe generation (HRG). Then, we collectively named this framework AppHerb. After this, we tested the performance of the two models using an external test set and evaluated their performance using standard evaluation metrics. Our aim was that these models would ensure that users received accurate and safe information regarding TTM treatment. Furthermore, the model’s ability to analyze complex relationships among formulae, herbs, and efficacy could potentially facilitate new insights and discoveries in TTM.
3. Results
For the TrP training data set, Gemma-2U exhibited a maximum average training loss of 270.00%, which progressively decreased over the course of 300 training steps, reaching a minimum loss of 18.20%, as illustrated in
Figure 2. Similarly, for the HRG training data set, the model recorded a maximum average training loss of 312.25%, which was reduced to 19.78% after 300 training steps, as shown in
Figure 3. The AppHerb fine-tuning process took approximately 59 min to 64 min for TrP and 66 min to 74 min for HRG.
Gemma-2U was trained separately for different data sets, leading to the first version of AppHerb, which includes two variations: TrP task and HRG task. We evaluated system performance using mean (triplicate) precision, recall, and F1 scores with 95% bootstrapped confidence intervals to account for test set variability. For the test set, the TrP exhibited mean precision, recall, and F1 scores of 26.54, 28.14, and 24.00 percent, respectively. Similarly, the HRG exhibited mean precision, recall, and F1 scores of 32.51, 24.42, and 24.84 percent, respectively. All AppHerb test performance is shown in
Table 5, where “Gemma-2U-” represents the original model before fine-tuning with the bootstrapped confidence interval included.
AppHerb’s BERTScore: precision, recall, and F1 exhibited 84.10, 84.77 and 84.39, respectively, for TrP, and 85.19, 84.24 and 84.68, respectively, for HRG (see
Table 6). Additionally, AppHerb-TrP and -HRG’s BLEU scores were 74.28 and 76.16, respectively (see
Table 7). These results highlight the importance of reporting confidence intervals alongside point estimates, particularly with limited test sizes.
4. Discussion
In this study, we found that the HRG model shows inconsistency in training loss reduction compared to the TrP model. This may be attributed to the structure of the HRG data set, in which the input (x) is relatively brief, while the expected output (y) is more complex and lengthier. In contrast, the TrP data set exhibits the opposite structure, where the input (x) is relatively long and descriptive, while the output (y) consists of a short disease name or clinical outcome. To enhance the training performance of the HRG model, future studies should consider incorporating more diverse data. Additionally, adjusting the LoRA parameters or adopting a new base model could further improve the model’s generative capabilities.
AppHerb-TrP achieved a BERT F1 score of 84.39 (95% CI: [83.33–85.45]), compared to Gemma-2U-’s 77.33 (95% CI: [76.99–77.67]), indicating a notable improvement in the model’s ability to correctly identify and generate relevant treatment outputs after fine-tuning. Precision and recall were also improved, with a marked improvement in precision (84.10 vs. 74.62), suggesting that AppHerb-TrP produced more accurate and relevant treatment predictions with fewer false positives. AppHerb-HRG outperformed its base model, with a BERT F1 score of 84.68 (95% CI: [84.27–85.11]) compared to 76.16 (95% CI: [75.62–76.78]) for the base model. Gains in both precision and recall indicate that the fine-tuned model was not only generating more precise herbal recipes but was also better at capturing relevant components.
We observed a linguistic complexity in the data cleaning step. Gemma-2U was pre-trained on Thai examinations, which is based on modern Thai language, while our TTM data was a traditional form of the Thai language used at least 200 years ago. We preserved the original wording for two reasons: to maintain familiarity for TTM practitioners, and because some words cannot be accurately translated into modern Thai. The significant change made was expanding the disease into symptoms (TrP data set), and a mix of manually and automatically extracting plant common names from the data (HRG data set).
A total of 9 out of 41 cases exhibited precision higher than 50 percent, and they were related to the fire element (5 cases), wind element (2 cases), and urinary tract-related disease (1 case). When examining the results generated by the HRG model, higher performance was observed in cases related to skin diseases. Notably, 4 out of 27 cases achieved a precision of 50% or higher, with 2 of those cases involving skin-related conditions. This result suggests that the AppHerb model may be effective in addressing TTM and skin-related problems. See
Supplementary Materials for more details.
After analyzing the best and worst performance from the test set, we found that reference tokens in the TrP task were shorter and typically contained only a single entity, which was sets of symptoms. Therefore, the scoring outcome tended to be all-or-none. F1 score ranged from 0 to 100%. In contrast, for HRG, the reference tokens were lengthier and richer in entities such as the part of the plant used, method of preparation, and symptoms of treatment. For this reason, the model’s scoring outcome differed from that of the other task, as reflected in the HRG F1 scores, which ranged from approximately 13 to 37%. However, these scores are intended primarily for comparison with the original TTM textbooks. The generated output should be further evaluated by experts in TTM to ensure clinical relevance and appropriateness.
To better understand the generation hyperparameter, we conducted a sensitivity analysis to ensure the model’s best performance and study how different combinations of hyperparameters affect the model score, as shown in
Figure 4 for TrP and
Figure 5 for HRG, and extended details are provided in
Supplementary Materials. However, we acknowledge that our exploration of hyperparameter configurations was not expansive, as the peak model F1 scores remained modest (F1 = 24.00 and 24.84). Based on these results, we do not expect that more extensive hyperparameter tuning alone would yield a substantial improvement in performance without addressing broader challenges such as data quality, model architecture, or domain knowledge integration.
LoRA has become a critical component in language model development, as it has been shown to deliver fine-tuning performance on par with, or even superior to, traditional methods in LLMs [
8]. In this study, this process is not only used to address hardware limitations but also to preserve the original 9B parameters of the base model. During fine-tuning, only the newly added LoRA layers are updated, while the base model remains static. This approach significantly improves time efficiency and reduces hardware demands.
AppHerb (post-fine-tune stage) models significantly outperformed Gemma-2U (base model) for the different tasks, except the superior recall score for Gemma-2U-TrP. The model’s performance was comparable to existing models, as stated in the literature review chapter.
Table 8 and
Figure 6 provide a preliminary comparison of AppHerb performance results in HRG with two much larger TCM language models that perform similar tasks: GSCCAM [
32] and RoKEPG [
33]. While the initial results (F1 = 24.84 vs. 29.99 and 25.92) appear encouraging, several important limitations must be considered. We used a 575-times smaller training data set (229 rows compared to over 100,000 rows in GSCCAM), which inherently caused overfitting and consequently constrained its generalizability. While the reported F1 scores (26–31%) demonstrate some predictive capability, these modest scores are far from indicating reliable performance. Our work represents an application of established methods to a specific domain rather than a novel methodological invention. Furthermore, this work constitutes a proof-of-concept that has not undergone clinical validation, which would be a necessary step before considering any practical applications in healthcare settings.
To enhance model transparency, we attempted to visualize a self-attention matrix of the base model. However, the Unsloth model does not allow output attentions. Consequently, we explored alternative Gemma-2 variants with characteristics similar to the Unsloth model (9B parameters and 4-bit quantization).
We selected the “StoyanGanchev/gemma-2-9b-nf4” model, which features 9B parameters and utilizes the NormalFloat4 datatype.
Figure 7 illustrates the self-attention matrix generated for the sentence, แบบจำลองภาษาสำหรับแนะนำตำรับแผนไทย (Language model for recommending traditional Thai medicine). The input sentence was tokenized into [แบบ, จำ, ลอง, ภาษา, สำหรับ, แนะนำ, ตำ, รับ, แผน, ไทย] and self-attention was performed to explore inter-sentential relationships. The Gemma-2 model employed in this study comprises 42 layers, each with 16 attention heads. The visualized self-attention matrix represents the averaged attention weights across these 16 heads for a specific layer.
For a given layer’s attention tensor
(where
is the weight from the source token
to the target token
in batch
and head
), between the
-th target and
-th source token for a single input (batch 1) across all
heads:
The distribution of attention weights varied across the model’s layers. A prominent pattern was the strong correlation of the first token’s attention with all other tokens. However, it is important to emphasize that the group-query attention patterns we observed in the Gemma-2 variant workaround may not necessarily generalize to the LoRA-tuned model or other variants applied to different tasks or data domains. Moreover, we did not compare the model’s performance and attention behavior between LoRA and non-LoRA variants, which may be worth exploring in a future study. A direct comparison would be valuable for validating and extending these attention mechanisms in the context of LoRA-based fine-tuning.
The challenge of this study is that it is based on the Thai language, which is a low-resource language, which may affect the accuracy of text generation in a multilingual model like the Gemma family. Unlike English or Chinese, the Thai language lacks explicit sentence boundary markers such as full stops, allowing for unlimited nested sub-sentences that can make it a difficult learning system for a machine. In terms of writing system, Thai language is an abugida, where consonants and vowels are combined in complex ways, and words are written without spaces—necessitating advanced segmentation techniques. In contrast, the Chinese language is a logographic writing system, where each character represents a word or concept. This comparison highlights the inherent complexity of the Thai language and the difficulties it poses for NLP tasks. A bilingual or monolingual LLM pre-trained on a Thai corpus may better assess the impact for low-resource language understanding and perhaps better utilize our TTM data set. Despite these challenges, our preliminary findings suggest potential for further development, while acknowledging the substantial differences in scale and scope between our application and the established Chinese medicine models we referenced.
Potentially, AppHerb could support a range of practical uses. In educational settings, the system may serve as an interactive tool for students, health practitioners, or the general public to explore TTM knowledge and receive clear explanations of traditional concepts, formulas, and practices. Educators might use the platform to design exercises that improve students’ understanding of herbal ingredients, treatment methods, or the rationale behind specific TTM prescriptions. In clinical environments, a validated form of AppHerb could assist practitioners by offering rapid, possible suggestions for herbal treatments, facilitating the preparation of herbal formulas.
Future work will need to address the current limitations by expanding data sets, enhancing model transparency, collaborating with TTM practitioners, and refining the fine-tuning approach to improve performance and reliability for real-world testing. While our current model was limited to 14 billion parameters due to hardware constraints, future iterations may explore the use of larger-scale models, which could offer enhanced representational power and improved predictive performance. Additionally, the next version of model development may incorporate multilingual models or character-level tokenizers, which can help alleviate segmentation challenges in low-resource language data sets and improve model robustness across diverse languages and scripts.