1. Introduction
With the rapid advancement of Large Language Models (LLMs) in the medical domain, their potential for processing vast corpora of classical Traditional Chinese Medicine (TCM) literature and supporting exam-oriented medical reasoning support has attracted growing attention. However, the diagnosis of TCM is not merely a process of pattern matching; rather, it involves the complex paradigm of syndrome differentiation and treatment determination. Conventional LLMs primarily rely on the static knowledge acquired during pre-training, and in the absence of real-time and authoritative domain constraints, they are prone to generating content that is inconsistent with TCM theoretical principles in professional learning tasks, a phenomenon commonly referred to as hallucination. Previous studies have shown that even advanced models such as GPT4o or instruction-tuned TCM-specific large TCM-specific models exhibit limited performance in the Taiwanese National Examination for TCM practitioners, largely due to linguistic biases and insufficient coverage of classical TCM texts [
1,
2].
To address the limitations mentioned above, the RAG architecture has been proposed, aiming to reduce hallucination risks and enhance model interpretability by incorporating external knowledge retrieval mechanisms [
3,
4]. In recent years, related studies, such as agent-based RAG systems developed for diagnostic reasoning that improve model interpretability, as well as the Yaoshi RAG framework applied to TCM dietary recommendations [
5], have demonstrated the potential of RAG in domains like TCM, which are highly dependent on textual knowledge, to enhance reasoning reliability and application adaptability [
5,
6,
7]. Nevertheless, existing RAG architecture remains generally constrained by noise arising from ineffective retrieval and lack precise knowledge reinforcement targeting model-level logical deficiencies. This limitation indicates that merely supplementing knowledge through retrieval is insufficient to compensate for the structural logical shortcomings inherent in LLM-based TCM reasoning processes.
Moreover, to improve model reasoning accuracy in professional TCM tasks, it is theoretically possible to enhance model capabilities through large-scale re-pretraining or full-parameter fine-tuning. However, such approaches are often associated with substantial costs in constructing specialized domain corpora, as well as significant computational and memory requirements, resulting in limited practical feasibility in clinical and academic settings with limited resources. In contrast, Supervised Fine-Tuning (SFT) [
8] offers a resource-efficient approach to guide models to learn domain-specific reasoning patterns, enabling adaptation to TCM syndrome differentiation and diagnostic tasks without restructuring the entire parameter space. Recently proposed parameter-efficient fine-tuning strategies [
9] have further lowered the practical adoption threshold of such methods, allowing them to be widely applied under limited hardware conditions. However, SFT remains primarily constrained by the coverage and temporal validity of the training data itself, making it difficult to independently address challenges such as the continuous evolution of professional TCM knowledge and reasoning hallucinations. This limitation further highlights the complementary necessity of integrating SFT with external knowledge injection mechanisms, such as RAG.
The objective of this study is to design and validate a generative artificial intelligence architecture that integrates supervised fine-tuning to internalize domain-specific reasoning capabilities, while employing a dynamically controlled RAG mechanism to reinforce the model with external knowledge sources. This integrated approach aims to improve the reliability and inference stability of LLMs in knowledge-intensive medical reasoning tasks. More specifically, this study seeks to establish a stable and efficient knowledge-intensive medical reasoning framework to mitigate hallucination phenomena exhibited by LLMs in highly rigorous professional reasoning scenarios, such as the national examination for practitioners of Traditional Chinese Medicine, and to comprehensively improve diagnostic accuracy and credibility of reasoning. The specific research contributions of this study are summarized as follows.
This study systematically collected a total of 11,476 multiple-choice questions from the Professional and Technical Senior Examination for Chinese Medicine Practitioners administered between 2005 and 2025. The resulting dataset comprehensively covers core knowledge domains, including fundamental TCM theories, syndrome differentiation principles, and examination-oriented diagnostic reasoning. By constructing an authoritative, reproducible, and highly challenging benchmark characterized by rigorous reasoning demands, this work addresses the longstanding lack of standardized evaluation resources for generative AI in the TCM domain. Moreover, the proposed benchmark provides a solid empirical foundation for comparative analysis and systematic evaluation of TCM-oriented large language models under controlled assessment settings.
- 2.
Proposing a Parameter-Efficient SFT-Based TCM Reasoning Framework for Professional Examination Tasks
This study adopts a parameter-efficient supervised fine-tuning strategy to guide large language models in learning structured TCM diagnostic workflows, syndrome differentiation logic, and examination-style reasoning patterns. Through this approach, the model demonstrates improved stability in high-level medical reasoning tasks within standardized evaluation contexts, effectively reducing hallucination phenomena commonly observed in general-purpose language models when applied to specialized TCM knowledge domains, while enhancing reasoning consistency and answer reliability.
- 3.
Integrating LoRA Fine-Tuning with an Error-Driven Knowledge Extraction–Based Hybrid RAG Framework
This study proposes a hybrid retrieval-augmented generation architecture that integrates LoRA-based fine-tuning with an error-driven knowledge extraction mechanism. By systematically identifying logical gaps and knowledge deficiencies in model reasoning performance through benchmark-based evaluation, the framework dynamically constructs a targeted, task-specific knowledge base for iterative reinforcement. This design exemplifies a novel architectural paradigm enabling deep integration between generative models and knowledge-centric AI systems in complex professional reasoning tasks.
- 4.
Designing a Dynamic Semantic Similarity Threshold Mechanism to Enhance the Precision and Interpretability of Knowledge Augmentation
To mitigate the risk of introducing low-relevance or unnecessary information during retrieval, this study further proposes a dynamic semantic similarity threshold mechanism as the activation condition for retrieval-augmented generation. This mechanism ensures that knowledge retrieval is triggered only under conditions of high semantic relevance and evidential support, thereby strengthening reasoning interpretability, information credibility, and generative output stability from a system design perspective.
- 5.
Empirical Validation of the Performance Advantages of the LoRA + RAG Architecture in Complex TCM Reasoning Tasks
Experimental results demonstrate that the proposed LoRA + RAG architecture effectively balances reasoning stability and answer accuracy, substantially narrowing the performance gap observed in existing TCM language models when evaluated on high-complexity professional reasoning tasks. These findings highlight the effectiveness of carefully designed generative AI systems for structured reasoning under standardized and high-difficulty assessment settings.
- 6.
Proposing a Reproducible and Empirically Grounded Research Methodology for Generative AI in Traditional Chinese Medicine
This study presents a comprehensive and reproducible methodological framework encompassing dataset construction, parameter-efficient fine-tuning, knowledge augmentation design, and evaluation mechanisms. Beyond addressing key concerns related to system design, performance robustness, and interpretability, the proposed methodology offers an extensible research blueprint for future investigations of TCM language models in professional education, examination preparation, and controlled reasoning support tasks, thereby laying the groundwork for subsequent studies targeting more open-ended clinical scenarios.
The remainder of this paper is organized as follows.
Section 2 reviews related work, including the current status and challenges of large language models in Traditional Chinese Medicine, the necessity of retrieval-augmented generation for TCM question answering, and existing performance optimization strategies such as LoRA-based fine-tuning.
Section 3 describes the proposed methodology, detailing the overall system architecture, parameter-efficient fine-tuning strategy, error-driven knowledge construction process, and the dynamic RAG inference mechanism.
Section 4 presents the experimental setup and evaluation results across different reasoning stages, along with ablation studies and statistical analyses.
Section 5 provides a comprehensive discussion of the findings, limitations, and design trade-offs. Finally,
Section 6 concludes the paper and outlines directions for future work.
2. Related Works
This section reviews related work in four key areas: (1) the current status and challenges of large language models in Traditional Chinese Medicine, (2) the necessity of retrieval-augmented generation for TCM question answering, (3) existing strategies for improving RAG performance, and (4) LoRA-based parameter-efficient fine-tuning approaches.
2.1. Current Status and Challenges of LLMs in Traditional Chinese Medicine
With the rise in generative AI, the development of models equipped with professional competence in TCM has increasingly become a focal point of research. Zhang et al. [
7] noted that although LLMs demonstrate outstanding performance in natural language processing, their applications in the TCM domain still face significant challenges. These challenges primarily stem from the highly abstract nature of TCM theories, such as Yin-Yang and Five Elements, as well as syndrome differentiation and treatment principles, which differ substantially from modern medical paradigms, making it difficult for general models to accurately capture the deep semantics of classical texts and specialized terminology.
Current TCM-oriented large models, such as BianCang [
10], pretrained on large-scale pharmacopeia corpora, and ZhongJing [
11], trained with expert feedback, outperform general models in understanding professional vocabulary and instructions. However, their accuracy and stability in handling complex reasoning tasks still leave considerable room for improvement.
For general-purpose LLMs, Su and Gu [
12] highlighted that linguistic and cultural biases embedded in training corpora may lead to systematic bias when addressing TCM-related problems. Wu et al. [
1] further reported that even ultra-large models with 200B parameters, such as GPT-4o, achieved only 62.29% accuracy on the Taiwanese National Examination for TCM practitioners, underscoring the limitations imposed by language bias and the scarcity of classical TCM content. This evidence demonstrates that relying solely on internal parameters and pretrained knowledge remains insufficient to support the rigorous professional reasoning and evaluation demands of the TCM domain.
2.2. The Necessity of RAG in TCM Question Answering
To alleviate hallucination and knowledge deficiency problems that arise when LLMs operate in closed knowledge environments, the RAG architecture has been widely applied across multiple professional domains, including medicine, with its flexibility and feasibility validated by numerous studies [
13,
14]. Mansurova et al. [
15] highlighted that although LLMs possess broad general knowledge, their outputs are often constrained by factual inaccuracies and outdated information, thereby undermining answer credibility. By guiding models to rely on external vector databases, RAG can reduce hallucination rates and improve factual consistency in responses.
Furthermore, Dong et al. [
8] demonstrated that when LLMs acquire specialized capabilities such as mathematical reasoning and code generation, their developmental trajectories differ markedly from general abilities, and multi-task learning frequently leads to capability conflicts or catastrophic forgetting. These findings suggest that domain-specific logical reasoning cannot be reliably achieved through limited data or generic instruction tuning alone, but instead requires extensive, high-quality professional knowledge support.
This characteristic is particularly critical for TCM diagnosis, where knowledge demands are highly specialized and precise. Chen et al. [
16] confirmed that the need for interpretability in TCM applications of LLMs is increasing, yet current techniques remain insufficient in reasoning transparency and evidence presentation. RAG ensures that models generate answers grounded in validated TCM classics such as Huangdi Neijing and Shanghan Lun, thereby reducing interference from low-quality web corpora. Recent studies, including agent-based RAG systems for tracking diagnostic reasoning and the Yaoshi-RAG framework for food-medicine homology recommendations [
5], further demonstrate the substantial potential of retrieval mechanisms in enhancing interpretability and adaptability of medical reasoning. However, the design of retrieval conditions and knowledge injection strategies still requires more systematic investigation.
2.3. Exploring Performance Optimization Strategies for RAG
Recent literature has proposed several key strategies to enhance the implementation efficiency of RAG systems. Şakar and Emekci [
17] evaluated multiple RAG approaches encompassing diverse vector databases, embedding models, and LLMs. Their findings emphasized the importance of balancing contextual quality with semantic similarity-based ranking methods, as well as understanding the trade-offs among similarity scores, token usage, execution time, and hardware utilization.
In terms of prompting, Liu et al. [
18] demonstrated that the ordering of queries and documents within prompts significantly influences the model’s ability to filter critical information when handling noisy retrieval tasks. Due to the autoregressive nature of language models, placing the query before the retrieved documents markedly improves the model’s capacity to filter irrelevant content. Pradhan [
19] further highlighted that combining prompt engineering with the RAG framework enables input design to guide model reasoning behavior, thereby reducing hallucination rates while enhancing alignment between generated outputs and practical requirements.
Taken together, most prior optimization methods for RAG architecture rely on static configurations and lack mechanisms for dynamic adjustment based on reasoning errors or domain-specific mistakes. For professional tasks such as the Taiwanese National Examination for TCM practitioners or clinical syndrome differentiation, which demand rigorous logical reasoning, achieving a dynamic balance between retrieval quality and generative stability, while precisely reinforcing domain-specific reasoning errors, remains an open area for systematic research.
2.4. LoRA Fine-Tuning Strategy
As LLMs continue to expand in scale, traditional full-model fine-tuning faces prohibitive computational and memory costs due to the need to update massive numbers of parameters. To address this challenge, Hu [
9] introduced Low-Rank Adaptation (LoRA) as a parameter-efficient fine-tuning method. By incorporating low-rank decomposition matrices into Transformer weights, LoRA enables downstream task adaptation through training only a small number of additional parameters, while preserving the original pretrained weights. LoRA represents weight updates as combinations of low-rank matrices, thereby drastically reducing the number of trainable parameters. Empirical evidence across multiple natural language processing tasks has demonstrated that LoRA can significantly lower memory requirements while maintaining performance comparable to, or even surpassing, full-model fine-tuning. Due to its efficiency and practicality, LoRA has become one of the core methods in parameter-efficient fine-tuning research. Subsequent studies further explored injection strategies and design conditions for LoRA across different modules, such as attention layers, showing strong adaptability across diverse model architectures and task scenarios [
20]. In practical applications, LoRA has also been widely adopted for rapid fine-tuning and version management under limited hardware resources, demonstrating substantial utility and real-world value.
3. Materials and Methods
This section provides a detailed description of the dataset construction process, the design of the two-stage system architecture, the implementation of dynamic inference strategies, and the hardware and software environments required for system development.
3.1. Dataset Construction Process
The dataset for this study was derived from the National Examination for TCM practitioners administered by the Ministry of Examination in Taiwan between 2005 and 2025 [
21], spanning a total of 21 years. Based on the single-choice questions from the national TCM licensing examination, we constructed a domain-specific corpus comprising 11,476 multiple-choice items. In parallel, we established a corresponding conceptualized knowledge repository for the professional TCM domain. The scope and principles of corpus construction are detailed as follows:
The dataset covers both stages of the annual examination, including Stage I (Fundamentals of TCM I-II) and Stage II (Clinical TCM I–IV), covering six core domains in total.
- 2.
Data Cleaning Principles
To ensure consistency in the evaluation process and enable automated feasibility, all items containing images or non-textual content were systematically excluded. Only pure text questions were retained, and the formatting of correct answers was standardized. This process resulted in a set designed to support subsequent model inference and performance evaluation.
- 3.
Error-Driven Knowledge Document Construction
An error-driven strategy was employed to construct a conceptualized TCM knowledge repository for retrieval augmentation. At this stage, the process serves as a proof-of-concept for transforming empirically observed reasoning failures into targeted external knowledge, rather than a fully automated knowledge engineering pipeline. Baseline model responses were examined for all types of examination questions. Analysis of incorrect predictions indicated that most errors arose from semantically or conceptually similar answer options, causing failures in fine-grained concept discrimination required by professional TCM reasoning, rather than from a complete lack of domain knowledge. Accordingly, the implementation focuses on aggregating questions that repeatedly triggered such confusion. For these high-frequency error cases, relevant domain knowledge was extracted and manually restructured into “conceptualized professional TCM knowledge documents” defined as concise explanations at the concept-level clarifying key discriminative principles, rather than question-specific solutions or textbook excerpts verbatim.
Knowledge construction followed a two-stage expert-in-the-loop process. A TCM resident physician conducted initial knowledge extraction and drafting, which was subsequently reviewed and validated by an attending TCM physician for correctness, conceptual clarity, and consistency with standard TCM teaching and examination logic. Only documents derived from recurring error patterns and approved through this review were retained. We acknowledge that this manual process does not yet include a comprehensive error taxonomy, quantitative inter-expert agreement metrics, or a standardized knowledge engineering workflow. These limitations reflect the exploratory nature of the study. Future work will extend the framework through systematic error categorization, multi-expert validation, and scalable knowledge construction pipelines. Nevertheless, the current results demonstrate the feasibility of leveraging recurrent model errors as a principled basis for targeted knowledge augmentation in professional TCM reasoning tasks.
3.2. Parameter-Efficient Supervised Fine-Tuning
To establish a foundational model capable of syndrome differentiation and diagnostic reasoning in Traditional Chinese Medicine (TCM), this study integrates a parameter-efficient supervised fine-tuning (SFT) strategy into the proposed RAG framework, adopting Low-Rank Adaptation (LoRA) as the core fine-tuning method. Compared with full-parameter fine-tuning, LoRA introduces only a small number of trainable low-ranking parameter modules, allowing effective domain adaptation while preserving the general language understanding and reasoning capabilities of the pretrained model.
Formally, for a linear transformation weight W in the model, LoRA represents its update as a low-rank matrix decomposition ΔW = BA, where A and B are trainable parameters, while the original weight W remains frozen during fine-tuning. This design substantially reduces the number of trainable parameters and the associated computational cost. In this study, LoRA modules were injected into key linear layers of the Transformer architecture, including the attention projection layers and feed-forward networks. The supervised fine-tuning corpus was derived from the Taiwanese National Examination for TCM practitioners, enabling the model to internalize structured diagnostic reasoning patterns required for professional TCM inference.
Through this parameter-efficient supervised fine-tuning strategy, the LLM was able to establish stable domain-specific reasoning capabilities under limited computational resources. The fine-tuned model subsequently serves as the core reasoning engine within the dynamic RAG architecture.
The LoRA configuration follows the Baichuan training setup, with a low-rank dimension (rank) of 8, a scaling factor (α\alphaα) of 32, and a dropout rate of 0.1. Model training was conducted using the Hugging Face Trainer framework, with a mini-batch size of 3 and gradient accumulation over 6 steps to simulate a larger effective batch size.
The LoRA configuration follows the Baichuan training setup, with a low-rank dimension (rank) of 8, a scaling factor (α) of 32, and a dropout rate of 0.1. Model training was conducted using the Hugging Face Trainer framework with a mini-batch size of 3 and gradient accumulation over 6 steps to simulate a larger effective batch size. Half-precision (FP16) training was enabled to further reduce GPU memory consumption. A fixed learning rate of 2 × 10−5 was applied across multiple training epochs, and model checkpoints were periodically saved to facilitate training resumption and model comparison.
Throughout the training process, all loss values and hyperparameter settings were logged and visualized in real time using the Weights & Biases (WandB) toolkit, allowing detailed monitoring and analysis of convergence behavior.
3.3. Design of System Architecture
This study is based primarily on the BianCang-Qwen2.5-7B-Instruct model [
10] and implements a two-stage generative AI reasoning framework tailored for the TCM National Examination in TCM. The overall architecture consists of two major modules: knowledge preprocessing and dynamic RAG inference, as illustrated in
Figure 1. In the first stage, an error-driven knowledge preprocessing pipeline systematically constructs external knowledge resources contextualized to the TCM domain. In the second stage, during the actual inference process, the model dynamically adjusts its generation strategy based on the semantic similarity between the input query and the external knowledge. This generative AI framework demonstrates how the model performs dynamic decision-making between RAG and zero-shot inference, ensuring that retrieval augmentation is activated only when the context is highly relevant and evidence-supported. In this way, the system achieves a balance between knowledge reinforcement and reasoning stability. The functions of each stage are described in detail as follows:
This stage adopts an error-driven strategy, analyzing items incorrectly answered by the model during benchmark evaluation. The key missing concepts are extracted and transformed into high-quality knowledge documents. These documents are then converted to semantic embeddings and stored in a FAISS vector database [
22], serving as external knowledge resources for augmented generation of the subsequent retrieval.
- 2.
Stage II: Dynamic RAG Inference
This stage processes clinical or examination questions entered into the system. Semantic similarity retrieval is performed first against the vector database, followed by the introduction of a dynamic decision mechanism to determine whether external knowledge should be invoked. Based on the strength of semantic relevance between the retrieved content and the input question, the inference pathway is automatically directed either to the RAG mode or the zero-shot mode. This ensures precision in model responses. The thresholding algorithm and inference logic underlying this mechanism are elaborated in the following section.
3.4. Dynamic RAG Inference Mechanism and Parameter Configuration
This study’s inference system adopts a dynamic decision routing mechanism that alternates between RAG mode and zero-shot generation mode according to the characteristics of different tasks. To implement this dynamic inference process, the mechanism and parameter configuration are structured around three core components: dynamic semantic similarity threshold, generation parameter optimization, and inference workflow with prompt construction. Each of these components is elaborated in the following sections.
3.4.1. Dynamic Semantic Similarity Threshold
In the dynamic retrieval-augmented generation process, this study designs a threshold-based mechanism grounded in semantic similarity to determine whether external knowledge documents should be incorporated into inference. By comparing semantic similarity scores against a predefined threshold, the system dynamically routes inference between RAG mode and zero-shot generation mode, thereby ensuring that generated content remains unaffected by irrelevant retrieval noise. The detailed workflow is as follows:
Upon receiving an input question, the system first uses the semantic embedding model jina-embeddings-v4 [
23] to convert both the input and the knowledge documents into high-dimensional semantic vectors. The similarity scores between the input vector and the document vectors are then calculated and used as a basis for subsequent decision-making.
- 2.
Dynamic Threshold Setting
During inference, a fixed semantic similarity threshold is applied to distinguish highly relevant retrieval results from low-relevance ones. This threshold is designed to balance information recall with noise suppression, which is particularly critical in knowledge-intensive reasoning tasks. The sensitivity experiments conducted in this study validated that the optimal threshold value is set at 0.85.
- 3.
Inference Path Routing Mechanism
When the similarity score of the retrieved results is greater than or equal to the threshold, the system adopts the RAG mode, integrating the retrieved knowledge documents into the prompts to support generation. Conversely, if the similarity score falls below the threshold, the system switches to zero-shot generation mode, allowing the base language model to perform inference directly, and thereby avoiding the risk of low-relevance information misleading the reasoning process.
3.4.2. Optimization of Generation Parameter
In the inference system developed for this study, the temperature parameter in the generation stage is used primarily to control the randomness of the model output. Sensitivity experiments conducted on parameter settings validated that the optimal temperature value is 0. This configuration enables the model to adopt a deterministic generation strategy, by which each decoding step consistently selects the token with the highest probability. As a result, the system maintains output consistency and stability across multiple inference runs, ensuring reliable reasoning performance.
3.4.3. Inference Workflow and Prompt Construction
When the semantic similarity between the input question and the knowledge document repository meets the conditions for retrieval-augmented generation, the system activates the RAG inference workflow. In this process, the retrieved knowledge documents are combined with structured prompts and fed into the base language model to guide generation. In terms of output design, the system not only produces the answer content corresponding to the input question but also simultaneously outputs the associated semantic similarity scores and the original text of the referenced knowledge documents. These additional outputs serve as valuable resources for further analysis and record.
3.5. Statistical Analysis Method
This study uses McNemar’s test [
24] to perform an inferential statistical analysis and compare performance under different inference conditions. McNemar’s test is a non-parametric method designed for paired samples where the response variable is measured on a binary nominal scale (e.g., correct/incorrect). In addition, the effect sizes were quantified using Cohen’s g, calculated from the counts of discordant pairs as
, where
b and
c denote incorrect → correct and correct → incorrect transitions, respectively, and
n is the total number of questions in the corresponding evaluation subset. Because paired McNemar tests were performed for the overall dataset as well as the Stage I and Stage II subsets, multiple comparisons were addressed using a Holm correction, and both raw and adjusted
p-values are reported in the Results.
In this study, the same base language model (BianCang-Qwen2-7B) was evaluated on an identical set of questions from the National Examination of TCM (11,476 items) under two inference conditions: “without RAG” and “with RAG.” Since both conditions were applied to the same questions using the same model, the resulting results constitute paired observations, making McNemar’s test appropriate for comparative analysis.
The significance level for the statistical testing was established at α = 0.05. When the computed p-value was less than 0.05, the null hypothesis was rejected, indicating that the response distribution under the two inference conditions exhibited a statistically significant difference.
3.6. Development Platform and Tools
The development platform and tools used in this study are detailed as follows:
The operating system used was Ubuntu 22.04 LTS, with a memory capacity of 32 GB. All experiments were carried out on a system equipped with two NVIDIA GeForce GTX 1080 Ti GPUs (11 × 2 GB VRAM) and a single Intel® CoreTM i58400 six-core processor.
- 2.
Development Tools
The implementation was carried out using Python 3.12, with the frameworks Hugging Face Transformers 4.57.1 and PyTorch 2.3.1+cu118 [
25]. For experiments involving the RAG architecture, the study utilized Sentence Transformer 5.1.2 and faiss-gpu-cu11 1.13.2, with jina-embeddings-v4 serving as the embedding model for vectorization. Statistical analysis, including McNemar’s test computation, was performed using the statsmodels 0.14.1 package.
4. Results
This study adopts accuracy as the primary evaluation metric, with the experimental results reported in the format of Mean ± Standard Deviation (SD). All experiments were repeated five times with different random seed values (42, 123, 456, 789, 2025) under deterministic decoding (temperature = 0) and deterministic CUDA settings, and the aggregated results are reported to assess system-level robustness.
4.1. Comparison of Cross-Model Performance (With and Without RAG)
This section presents the performance differences of various base LLMs before and after the integration of RAG architecture. The tested models include TCM specific models (BianCang-Qwen2-7B, ZhongJing-8B) as well as general purpose models (Llama3-8B [
26], Taide-8B [
27]), allowing a comparative analysis of the impact of RAG in different model types.
As shown in
Table 1, all four models achieved greater accuracy after integration of RAG architecture compared to the condition without RAG. In the non-RAG setting, BianCang-Qwen2-7B attained the highest accuracy (61.0%), while the other models ranged between 22.8% and 34.0%. With RAG enabled, BianCang-Qwen2-7B improved to 89.0%, the highest among all tested models, and the remaining models also demonstrated varying degrees of accuracy improvement.
Further analysis of the general-purpose models (Llama3-8B and Taide-8B) shows that although both exhibited significant gains after RAG integration (+17.6% and +20.5%, respectively), their final accuracy still failed to surpass the 60% threshold. This result indicates that when handling domain-specific tasks such as syndrome differentiation or classical terminology in TCM, general models remain constrained by limited semantic understanding, even when supported by external reference materials.
It should be noted that the ZhongJing-8B TCM-specific model achieved only marginally better baseline performance (34.0%) than the general models, and its improvement after RAG (+8.6%) was the lowest among all models tested. This suggests that if a base model lacks robust domain terminology comprehension and sufficient reasoning alignment training, it may still fail to effectively integrate retrieved information with its internal inference capabilities. Consequently, this study selected BianCang-Qwen2-7B as the primary model for subsequent experiments.
Overall, the findings demonstrate that retrieval augmentation alone is insufficient to compensate for inadequate domain semantic understanding in the base model. RAG can only deliver effective knowledge reinforcement when the underlying model already possesses a certain level of domain adaptation. This highlights that in the design of knowledge-intensive medical reasoning systems; the domain suitability of the base model is a critical prerequisite for the success of retrieval augmentation.
4.2. Semantic Similarity Threshold Experiment
This experiment investigates the impact of the semantic similarity threshold parameter on model performance within the dynamic RAG inference mechanism. Using BianCang-Qwen2-7B, the best-performing model, we evaluated accuracy under three threshold settings: 0.80, 0.85, and 0.90.
Table 2 demonstrates that when the threshold is set to 0.85, the model achieves the highest accuracy (89.0%). When the threshold is too low (0.80), the accuracy decreases to approximately 88.2%, indicating that an insufficiently strict threshold allows the system to incorporate excessive external texts with weak semantic relevance. This redundant information tends to distract the model’s attention and introduces confusion during answer generation. Conversely, when the threshold is too high (0.90), the accuracy drops to about 84.7%. Although retrieval precision improves, excessive filtering excludes supplementary knowledge that could provide valuable context, thus impairing performance on complex reasoning tasks. These findings confirm that 0.85 represents the optimal balance point, effectively mediating between ‘retrieval volume’ and “retrieval purity,” and serving as a critical parameter to ensure that the dynamic RAG system acquires knowledge that is both relevant and useful.
The experimental results on the semantic similarity thresholds, as shown in
Table 2, highlight a clear trade-off between retrieval quality and retrieval quantity. A threshold set too low introduces excessive low-relevance knowledge that interferes with reasoning, while a threshold set too high risks omitting necessary supplementary information. This outcome underscores that, in knowledge-intensive reasoning tasks, RAG should adopt a dynamic triggering mechanism with adaptive regulation rather than a fixed or always-on static design.
4.3. Experiment with Temperature Parameter
This section further investigates the impact of generation strategies on the stability of model reasoning. Using BianCang-Qwen2-7B as the testbed, a series of experiments were conducted to adjust randomness while keeping all other variables constant. The performance was evaluated under temperature settings of 0, 0.1, 0.3, and 0.7.
The experimental results in
Table 3 show that when the temperature is set to 0, the model achieves the highest accuracy under both RAG and non-RAG conditions. This finding validates the clear advantage of adopting a ‘deterministic generation’ strategy for tasks such as TCM diagnosis, which demand highly specialized and fact-based knowledge. As the temperature parameter increases to 0.3 or 0.7, the accuracy gradually declines from 87.7% to 82.3%. Excessive randomness introduced by higher temperatures leads the model to generate lexical predictions that deviate from established TCM reasoning patterns, manifesting as hallucinations. For professional evaluations such as the National TCM Licensing Examination, where answers are strictly defined, maintaining the lowest possible temperature to suppress randomness ensures greater rigor and stability in reasoning, while reducing the risk of uncontrolled generation.
The results in
Table 3 further highlight that although randomness can enhance linguistic diversity, the stability of generation is far more critical in knowledge-intensive medical reasoning tasks. Therefore, for professional reasoning applications characterized by definitive correct answers and strict logical requirements, adopting low-temperature or deterministic generation strategies is an essential design choice to safeguard the credibility of inference.
4.4. McNemar’s Test Significance Analysis
To verify whether the performance improvement achieved by integrating the RAG architecture is statistically significant, this study applied McNemar’s test to paired samples comparing model outputs under ‘non-RAG’ and “RAG” conditions. The results of the statistical analysis are presented in
Table 4,
Table 5 and
Table 6.
Table 4 reports the analysis across the entire dataset (11,476 questions), while
Table 5 and
Table 6 provide subgroup analyzes for Stage I (Fundamental TCM, 3583 questions) and stage II (Clinical TCM, 7893 questions), respectively.
The combined results of the McNemar test in
Table 4,
Table 5 and
Table 6 confirm that the RAG architecture demonstrates highly significant advantages at different stages of the examination tasks. In the overall dataset (
Table 4), the chi-square statistic reached χ
2 = 2903.73 with
p < 0.001, far below the significance threshold of α = 0.05, thus rejecting the null hypothesis. This verifies that the observed performance improvement is attributable to the retrieval mechanism rather than to a random error. In addition to McNemar’s tests, the effect sizes were quantified using Cohen’s g. For comparisons before/after the integration of RAG (
Table 4,
Table 5 and
Table 6), Cohen’s g indicates a substantial net improvement (overall g = 0.280, Stage I g = 0.272, Stage II g = 0.284), with extremely small values of probability that remain significant after Holm correction (Holm-adjusted
p < 10
−190).
Further stage-specific analysis shows that both Stage I (Fundamental Medicine) and Stage II (clinical medicine) achieved statistical significance. While Stage II contributed a larger absolute improvement due to its greater number of questions, the ratio of improvements to declines was consistent across both stages: 18.4:1 in Stage I and 20.6:1 in Stage II. This indicates that RAG provides robust knowledge reinforcement for both fundamental and clinical TCM tasks, with particularly strong benefits in clinical reasoning.
Moreover, across all experimental groups, the proportion of declines caused by retrieval interference remained extremely low (approximately 1.4% overall). This demonstrates that the 0.85 adopted semantic similarity threshold, combined with optimized prompt sequencing, successfully balanced ‘information acquisition’ with “noise suppression.” As a result, the model maintained a stable and precise reasoning quality even when confronted with the complexity of classical TCM texts.
Taken together, these statistical significance analyzes further validate that performance gains primarily stem from systematic integration of retrieval augmentation rather than stochastic variation in generation. This underscores that in high-risk medical reasoning contexts, RAG is not merely a performance-enhancing tool but a structural design essential to ensure the reliability and reproducibility of inference.
4.5. LoRA + RAG Cross-Technology Integration Experiment
To further optimize system performance, this study explored the integration of LoRA-based supervised fine-tuning with the RAG framework, with the aim of validating the synergistic relationship between internalized knowledge and external retrieval. The experimental results are presented in
Table 7.
After integrating LoRA with the RAG framework, the overall accuracy improved from 89.0% under pure RAG to 90.1%. In addition, the combined architecture yielded a notable reduction in standard deviation across repeated experiments. The improvement was more pronounced in Stage I (Fundamental TCM), with accuracy increasing by +1.3%, suggesting that SFT strengthened the stability of the model in handling foundational professional logic. In contrast, the gain in Stage II (Clinical TCM) was relatively modest (+0.8%), reflecting the greater difficulty in improving reasoning performance in complex clinical diagnostic tasks where SFT alone remains insufficient.
Together, these cross-technology integration results demonstrate that LoRA-based SFT and RAG play complementary roles in the reasoning process. LoRA contributes to stabilizing the internal reasoning structure, while RAG compensates for gaps in external knowledge. This finding underscores that for knowledge-intensive medical reasoning tasks, no single technique can simultaneously ensure both reasoning stability and knowledge completeness. Instead, the hybrid architecture offers greater practical feasibility by combining internal domain alignment with dynamic external grounding.
To assess the impact of supervised fine-tuning on retrieval-augmented generation, we also conducted a paired comparison between the vanilla RAG framework and the RAG + SFT architecture using McNemar’s test, focusing on answer level transitions for identical questions.
The overall evaluation comprised 11,476 questions, as summarized in
Table 8. Both models correctly answered 9605 questions and 547 incorrectly, indicating a substantial overlap in prediction outcomes. However, among questions with prediction discrepancies, RAG+SFT corrected 711 questions previously incorrectly answered by vanilla RAG, while 613 questions showed performance degradation after fine-tuning. McNemar’s test yielded a test statistic of χ
2 = 7.11 with
p < 0.01, indicating a statistically significant overall improvement of RAG + SFT over the vanilla RAG baseline. This result demonstrates that supervised fine-tuning contributes to systematic correction of incorrect responses rather than simply preserving existing correct predictions. For the RAG vs. RAG + SFT comparisons (
Table 8,
Table 9 and
Table 10), the effect sizes are small (overall g = 0.009, Stage I g = 0.0133, Stage II g = 0.006). After Holm correction in all three related tests (overall, Stage I, Stage II), the improvement remains significant for the overall dataset and Stage I (Holm-adjusted
p = 0.0230 and 0.0444), but not for Stage II (Holm-adjusted
p = 0.1032), suggesting that SFT produces a modest but reliable gain primarily in the foundational subset.
Stage I evaluation consisted of 3583 questions related to the knowledge of foundational Traditional Chinese Medicine (
Table 9). Both models answered 3009 questions correctly and 152 questions incorrectly. Regarding the discrepancies, RAG + SFT improved 235 questions, while 187 questions exhibited degradation. McNemar’s test confirmed that this difference is statistically significant (χ
2 = 5.23,
p < 0.05), indicating that supervised fine-tuning effectively improves model performance on structured, knowledge-intensive tasks.
Stage II evaluation included 7893 clinical reasoning questions (
Table 10). Both models answered 6596 questions correctly and 395 questions incorrectly. RAG + SFT improved performance on 476 questions, while 426 questions showed degradation. Although the net improvement remained positive, McNemar’s test produced χ
2 = 2.66 with
p = 0.10, indicating that the observed difference did not reach statistical significance under conventional thresholds.
Across all evaluation settings, RAG + SFT consistently demonstrated a higher number of incorrect-to-correct transitions than correct-to-incorrect transitions. Statistically significant improvements were observed for the overall dataset and Stage I tasks, while Stage II performance exhibited a non-significant but directionally consistent improvement trend.
4.6. LoRA-Only Ablation Study
To isolate the contribution of parameter-efficient supervised fine-tuning from retrieval augmentation, we conducted a LoRA-only ablation under the non-RAG (zero-shot) inference setting. Using the same stage-wise benchmark split (Stage I: 3583 questions; Stage II: 7893 questions), the LoRA-fine-tuned BianCang-Qwen2.5-7B model substantially outperformed the unfine-tuned baseline. Specifically, accuracy increased from 62.35% to 78.79% on Stage I and from 60.97% to 77.54% on Stage II, as shown in
Table 11. These results indicate that LoRA-based SFT alone provides a large gain by internalizing domain-specific reasoning patterns, thereby addressing the reviewer’s concern that the improvements could not be attributed to individual components. At the same time, when considered together with the LoRA + RAG results reported in
Section 4.5, the ablation supports a complementary interpretation in which LoRA improves internal reasoning alignment while retrieval further supplies external knowledge grounding.
5. Discussion
A comparative analysis between existing studies and the methodological design of this research reveals fundamental differences in several key dimensions, including retrieval strategies, knowledge construction, model adaptation, and evaluation scale. These differences are not merely architectural choices, but reflect distinct design philosophies regarding how large language models should reason in knowledge-intensive medical domains. Throughout this section, we interpret the findings primarily in the context of standardized professional examination settings rather than direct clinical deployment.
Most prior studies rely on static or always-on RAG mechanisms that continuously inject external knowledge during inference. Although such designs increase information coverage, they also increase the risk that irrelevant or low-quality retrieval results interfere with reasoning, particularly in domains requiring high logical precision. In contrast, the dynamic semantic threshold mechanism proposed in this study activates retrieval only when both semantic relevance and knowledge necessity exceed a predefined confidence level. From a system design perspective, this conditional gating enables a more balanced integration of external evidence and internal reasoning, treating retrieval as a situational aid rather than a mandatory inference component. Instances falling below the threshold are therefore processed in zero-shot mode, since preliminary analysis showed that introducing weakly relevant evidence more often degraded reasoning consistency than improved accuracy, although this conservative design may exclude potentially useful lower-ranked contextual information.
With respect to knowledge construction strategies, existing RAG-based approaches often rely directly on raw literature or large-scale unstructured corpora, implicitly assuming that retrieval relevance alone is sufficient to improve reasoning quality. However, such approaches rarely account for the specific error patterns exhibited by language models in real reasoning processes. This study addresses this gap by introducing an error-driven conceptual knowledge construction strategy, in which benchmark evaluation errors are systematically analyzed and transformed into structured TCM knowledge units. As a result, external knowledge in the proposed system is no longer a passive supplement, but a targeted intervention designed to directly repair semantic reasoning gaps and factual misconceptions. This design choice underscores a more substantive role of RAG in professional reasoning tasks, shifting its function from information expansion to error correction.
At the level of model adaptation, prior research frequently omits finetuning altogether or relies on full-parameter fine-tuning schemes, which impose high computational costs and limit scalability in real-world deployments. Using parameter-efficient fine-tuning methods such as LoRA, this study demonstrates that domain-specific reasoning patterns can be effectively internalized without sacrificing the general capabilities of the base model. More importantly, the integration of LoRA contributes to stabilizing inference behavior when retrieval information is introduced, suggesting that PEFT plays a dual role: reducing resource requirements while reinforcing internal reasoning structures essential for knowledge-intensive tasks.
Experimental results further confirm that incorporating RAG substantially mitigates hallucinations and reasoning deficiencies in large language models applied to professional TCM reasoning. The observed accuracy gains of BianCang-Qwen2-7B after RAG integration, together with statistically significant McNemar’s test results, provide strong empirical evidence that retrieval augmentation enhances not only task performance but also inference reliability and reproducibility. These findings support the view that RAG functions as a structural safeguard against unsupported generation in high-rigor medical reasoning scenarios.
Additionally, the configurations of the inference-time system were found to play a critical role in maintaining the stability of reasoning. Setting the semantic similarity threshold to 0.85 achieved an effective balance between introducing the necessary external knowledge and suppressing irrelevant retrieval noise. When retrieval relevance fell below this threshold, the system automatically reverted to zero-shot inference, preventing unrelated knowledge from misleading logical judgment. Furthermore, adopting a deterministic generation strategy (temperature = 0) ensured high consistency and predictability, which is particularly important in professional medical tasks characterized by definitive correct answers and low tolerance to stochastic variation.
In cross-technology integration experiments, the combination of LoRA with RAG yielded further performance improvements, consistent with observations reported in other application domains [
28]. LoRA is particularly attractive because it offers cost-effectiveness and robustness, especially in resource-constrained or iterative deployment settings. A more fine-grained analysis, however, indicates that these improvements are predominantly concentrated in foundational TCM knowledge tasks, whereas gains in Stage II tasks involving higher-complexity clinical reasoning are comparatively limited and do not reach statistical significance. This pattern suggests that advanced clinical reasoning, characterized by subtle syndrome differentiation and multi-cue diagnostic integration, may require more refined and higher-quality knowledge representations, as well as stronger alignment between internalized reasoning patterns and externally retrieved information, to fully benefit from hybrid architectures.
To address component isolation, we additionally report a LoRA-only ablation in the non-RAG inference setting, demonstrating that a substantial portion of the performance gain can be attributed to supervised fine-tuning. The remaining improvement observed in the entire LoRA + RAG system reflects the complementary benefit of retrieval-based knowledge grounding. This ablation clarifies the individual contribution of SFT, while further analyzes, such as varying LoRA target modules or ranks, and comparisons with full-parameter fine-tuning, are left for future work in light of the study’s focus on efficient deployment of resources.
Finally, several directions for future work emerge from this study. First, model generalization should be further evaluated by extending beyond exam-oriented benchmarks to open-ended clinical case scenarios (e.g., free-text case narratives and multi-turn diagnostic reasoning). Second, the knowledge base could be expanded to incorporate both classical TCM texts and contemporary literature to enable a richer contextual grounding. Third, more fine-grained mechanisms for dynamically regulating the interaction between internal model knowledge and external retrieval cues should be explored. Addressing these challenges will be essential to overcome current performance bottlenecks and to support subsequent studies that move from standardized assessment settings to broader clinical case reasoning, while continuing to strengthen educational and professional reasoning support.
6. Conclusions
This study proposes and systematically validates a knowledge-intensive professional reasoning architecture that integrates Low-Rank Adaptation (LoRA) fine-tuning with retrieval-augmented generation. A large-scale empirical evaluation was conducted using 11,476 questions from the National TCM Licensing Examination spanning 21 years. The results demonstrate that an error-driven knowledge construction strategy, transforming reasoning gaps exposed during model inference into conceptualized and retrievable external knowledge resources, combined with a dynamic semantic similarity threshold mechanism set at 0.85, effectively enhances accuracy, stability, and response reliability in professional TCM reasoning tasks under standardized assessment settings, while substantially reducing hallucination phenomena.
In general, this study confirms that a hybrid architecture combining parameter-efficient fine-tuning with dynamic retrieval augmentation can simultaneously stabilize internal reasoning structures and selectively reinforce external knowledge under limited computational resources. Rather than targeting direct clinical deployment, this design highlights the feasibility and effectiveness of generative AI systems for TCM professional education, examination preparation, and structured reasoning support. Moreover, the proposed framework provides a reproducible and extensible methodological reference for future research on knowledge-intensive reasoning systems, laying a solid foundation for future studies that may extend from standardized assessments to more open-based clinical case reasoning.