Benchmark Evaluation of a Tool-Augmented Large Language Model Agent Using Traditional Asian Medicine Metadata
Abstract
1. Introduction
- We constructed a structured TAM metadata database (4780 entries across five entity types) with three function-calling tools and evaluated the tool-augmentation effect across four LLMs on 1900 public benchmark items.
- We demonstrated through statistical testing that tool augmentation can improve lightweight models on factual retrieval tasks.
- We conducted a three-way comparison showing that autonomous tool invocation (agent) outperforms fixed retrieval (RAG) on factual tasks.
- We identified substantial model-level variation in tool usage patterns (2–87%) and demonstrated through conditional analysis that models selectively invoke tools on items they find uncertain.
2. Related Work
2.1. RAG and Tool-Augmented LLMs
2.2. TAM/TCM Benchmarks and Knowledge Representation
3. Materials and Methods
3.1. System Architecture
3.1.1. Metadata Database
3.1.2. Function-Calling Tools
- (1)
- search_entity: Searches the TAM metadata database for terms. Herbs, acupoints, syndromes, and symptoms can be searched by Korean, Chinese, English, Pinyin, or WHO code, with optional category specification to limit the search scope. The function accepts three parameters: term (required, the search query string), category (optional, one of “herb”, “syndrome”, “tcm_symptom”, “mm_symptom”, or “acupoint”), and limit (optional, maximum number of results to return, default 5).
- (2)
- get_entity_info: Returns the complete metadata of a specific entity, including detailed information such as herb properties and meridian tropism, acupoint location and needling technique, and syndrome definitions. The function accepts two required parameters: term (the entity name or identifier) and category (the entity type).
- (3)
- get_multilingual_names: Returns the multilingual names (Korean, Chinese, English, Pinyin, and Latin) of a specific entity. The function accepts two required parameters: term and category.
3.1.3. LLM Agent Pipeline
3.2. Benchmarks
3.2.1. TCMBench
3.2.2. TCMEval-SDT
3.3. Experimental Design
3.3.1. Model Configuration
3.3.2. Experimental Conditions
- (1)
- Baseline: The LLM generates responses using only its parametric knowledge without tools. Function calling is deactivated; no tool definitions are included in the API request.
- (2)
- Tool-augmented (Agent): Three metadata retrieval tools are provided to the LLM through the function-calling API interface, which autonomously determines whether to invoke tool calls. Up to three tool-call iterations per item are allowed.
- (3)
- Non-agentic retrieval (RAG): For each benchmark item, the search_entity tool is automatically invoked with the question text as the query, and the top-5 retrieved metadata entries are prepended to the prompt as context. The LLM receives this retrieved context but cannot make additional tool calls. This condition serves as a baseline to isolate the effect of autonomous tool-calling from the effect of the metadata itself. The RAG condition was evaluated for GPT-5-mini, the model that showed the largest agent-condition improvement.
3.3.3. Prompt Optimization
3.4. Evaluation Methods
3.4.1. TCMBench Scoring
3.4.2. TCMEval-SDT Scoring
3.4.3. Tool Usage Statistics
3.5. Statistical Analysis
3.6. Cost Calculation
3.7. Experimental Setup
4. Results
4.1. Overall Performance Comparison
4.1.1. TCMBench Performance
4.1.2. TCMEval-SDT Performance
4.2. Tool-Augmentation Effect
4.3. Tool Usage Patterns
4.4. Retrieval and Tool-Calling Ablations
5. Discussion
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| LLM | Large Language Model |
| TCM | Traditional Chinese Medicine |
| DB | Database |
| RAG | Retrieval-Augmented Generation |
| SDT | Syndrome Differentiation Thought |
| WHO | World Health Organization |
| MM | Modern Medicine |
References
- Bicknell, B.T.; Butler, D.; Whalen, S.; Ricks, J.; Dixon, C.J.; Clark, A.B.; Spaedy, O.; Skelton, A.; Edupuganti, N.; Dzubinski, L. ChatGPT-4 Omni performance in USMLE disciplines and clinical skills: Comparative analysis. JMIR Med. Educ. 2024, 10, e63430. [Google Scholar] [CrossRef]
- Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large language models in medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef]
- Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef]
- Jin, H.K.; Lee, H.E.; Kim, E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: A systematic review and meta-analysis. BMC Med. Educ. 2024, 24, 1013. [Google Scholar] [CrossRef]
- Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
- Shool, S.; Adimi, S.; Saboori Amleshi, R.; Bitaraf, E.; Golpira, R.; Tara, M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med. Inform. Decis. Mak. 2025, 25, 117. [Google Scholar] [CrossRef] [PubMed]
- Ren, Y.; Luo, X.; Wang, Y.; Li, H.; Zhang, H.; Li, Z.; Lai, H.; Li, X.; Ge, L.; Estill, J. Large language models in traditional Chinese medicine: A scoping review. J. Evid.-Based Med. 2025, 18, e12658. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Yuan, Y.; Yan, K.; Li, Y.; Sacca, V.; Hodges, S.; Cannistra, M.; Jeong, P.; Wu, J.; Kong, J. Evaluating the role of large language models in traditional Chinese medicine diagnosis and treatment recommendations. npj Digit. Med. 2025, 8, 466. [Google Scholar] [CrossRef]
- Wang, Z.; Hao, M.; Peng, S.; Huang, Y.; Lu, Y.; Yao, K.; Yang, X.; Zhu, Y. TCMEval-SDT: A benchmark dataset for syndrome differentiation thought of traditional Chinese medicine. Sci. Data 2025, 12, 437. [Google Scholar] [CrossRef]
- Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
- Pal, A.; Umapathi, L.K.; Sankarasubbu, M. Med-halt: Medical domain hallucination test for large language models. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), Singapore, 6–7 December 2023; pp. 314–334. [Google Scholar]
- Yue, W.; Wang, X.; Zhu, W.; Guan, M.; Zheng, H.; Wang, P.; Sun, C.; Ma, X. Tcmbench: A comprehensive benchmark for evaluating large language models in traditional chinese medicine. arXiv 2024, arXiv:2406.01126. [Google Scholar]
- Chen, Q.; Hu, Y.; Peng, X.; Xie, Q.; Jin, Q.; Gilson, A.; Singer, M.B.; Ai, X.; Lai, P.-T.; Wang, Z. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nat. Commun. 2025, 16, 3280. [Google Scholar] [CrossRef]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
- Amugongo, L.M.; Mascheroni, P.; Brooks, S.; Doering, S.; Seidel, J. Retrieval augmented generation for large language models in healthcare: A systematic review. PLoS Digit. Health 2025, 4, e0000877. [Google Scholar] [CrossRef] [PubMed]
- Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 2023, 36, 68539–68551. [Google Scholar]
- Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.R.; Cao, Y. React: Synergizing reasoning and acting in language models. In Proceedings of the Eleventh International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
- Gao, Y.; Li, R.; Croxford, E.; Caskey, J.; Patterson, B.W.; Churpek, M.; Miller, T.; Dligach, D.; Afshar, M. Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study. JMIR AI 2025, 4, e58670. [Google Scholar] [CrossRef] [PubMed]
- Wołk, K. Evaluating Retrieval-Augmented Generation Variants for Clinical Decision Support: Hallucination Mitigation and Secure On-Premises Deployment. Electronics 2025, 14, 4227. [Google Scholar] [CrossRef]
- Duan, Y.; Zhou, Q.; Li, Y.; Qin, C.; Wang, Z.; Kan, H.; Hu, J. Research on a traditional Chinese medicine case-based question-answering system integrating large language models and knowledge graphs. Front. Med. 2025, 11, 1512329. [Google Scholar] [CrossRef]
- Wang, X.; Sun, X.; Yang, L.; Zhang, Y.; Yang, T.; Xie, J.; Hu, K. Reinforcement learning for LLM-based explainable TCM prescription recommendation with implicit preferences from small language models. Chin. Med. 2025, 20, 193. [Google Scholar] [CrossRef]
- Li, Y.-X.; Elnaffar, S.; Chen, H.-Y.; Chen, N.-J.; Lai, P.-Y.; Li, N.-Q.; Chong, Y.; Qiao, J.; Liu, T.; Peng, Z.-B. An LLM Method for Understanding Traditional Chinese Medicine: Mechanism Exploration and Innovative Application. IEEE J. Biomed. Health Inform. 2025. ahead of print. [Google Scholar] [CrossRef]
- Guo, P.; Jiang, M.; Hu, S.; Jiang, Q.; Li, L.; Wu, J.; Ma, Y.; Wu, Z. Advancing the modernization of traditional Chinese medicine through artificial intelligence and multimodal data integration. Chin. Med. 2026, 21, 54. [Google Scholar] [CrossRef] [PubMed]
- Gao, K.; Liu, L.; Lei, S.; Li, Z.; Huo, P.; Wang, Z.; Dong, L.; Deng, W.; Bu, D.; Zeng, X. HERB 2.0: An updated database integrating clinical and experimental evidence for traditional Chinese medicine. Nucleic Acids Res. 2025, 53, D1404–D1414. [Google Scholar] [CrossRef]
- Wu, Y.; Zhang, F.; Yang, K.; Fang, S.; Bu, D.; Li, H.; Sun, L.; Hu, H.; Gao, K.; Wang, W. SymMap: An integrative database of traditional Chinese medicine enhanced by symptom mapping. Nucleic Acids Res. 2019, 47, D1110–D1117. [Google Scholar] [CrossRef] [PubMed]
- Regional Office for the Western Pacific—World Health Organization. WHO Standard Acupuncture Point Locations in the Western Pacific Region; World Health Organization: Geneva, Switzerland, 2008. [Google Scholar]
- Lim, S. WHO standard acupuncture point locations. Evid.-Based Complement. Altern. Med. 2010, 7, 167–168. [Google Scholar]
- Gong, E.J.; Bang, C.S.; Lee, J.J.; Baik, G.H. Knowledge-practice performance gap in clinical large language models: Systematic review of 39 benchmarks. J. Med. Internet Res. 2025, 27, e84120. [Google Scholar] [CrossRef]
- Yue, W.; Ji, W.; Wang, X.; Ma, X.; Wang, P.; Wang, X. Sdpr: Prescription recommendation with syndrome differentiation in traditional chinese medicine. IEEE J. Biomed. Health Inform. 2025, 29, 3736–3749. [Google Scholar] [CrossRef]





| Entity Type | Count | Source | Key Attributes |
|---|---|---|---|
| Herbs | 698 | HERB DB, SymMap | Multilingual names, properties, meridians, classification |
| Syndromes | 233 | SymMap | Multilingual names, definition |
| TCM Symptoms | 2285 | SymMap | Multilingual names, definition, body part, nature |
| MM Symptoms | 1148 | SymMap | Multilingual name, definition, UMLS/MeSH/ICD-10 codes |
| Acupoints | 416 | WHO Standard | Multilingual names, WHO code, meridian, location, needling method |
| Total | 4780 |
| Section | n (R) | Baseline | RAG | Agent | Agent–RAG |
|---|---|---|---|---|---|
| TCMBench Term | 309 | 75.4% | 77.7% | 85.6% | +7.9 pp |
| TCMBench Other | 930 | 71.0% | 72.5% | 76.7% | +4.2 pp |
| SDT Patho. | 299 | 0.474 | 0.494 | 0.508 | +0.014 |
| SDT Syndr. | 299 | 0.517 | 0.506 | 0.478 | −0.028 |
| Model | Tool Used | n | Not Used | n | Baseline |
|---|---|---|---|---|---|
| GPT-5.2 | 72.7% | 11 | 81.4% | 317 | 84.1% |
| GPT-5-mini | 80.3% | 71 | 89.5% | 153 | 82.1% |
| Claude Sonnet 4.6 | 91.2% | 284 | 100.0% | 18 | 89.7% |
| Claude Haiku 4.5 | 71.5% | 137 | 77.3% | 176 | 73.5% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Lee, W.-Y.; Kim, J.-H.; Leem, J.; Lee, B.-W.; Lee, S.; Kim, Y.W. Benchmark Evaluation of a Tool-Augmented Large Language Model Agent Using Traditional Asian Medicine Metadata. Appl. Sci. 2026, 16, 3377. https://doi.org/10.3390/app16073377
Lee W-Y, Kim J-H, Leem J, Lee B-W, Lee S, Kim YW. Benchmark Evaluation of a Tool-Augmented Large Language Model Agent Using Traditional Asian Medicine Metadata. Applied Sciences. 2026; 16(7):3377. https://doi.org/10.3390/app16073377
Chicago/Turabian StyleLee, Won-Yung, Ji-Hwan Kim, Jungtae Leem, Byung-Wook Lee, Seungho Lee, and Young Woo Kim. 2026. "Benchmark Evaluation of a Tool-Augmented Large Language Model Agent Using Traditional Asian Medicine Metadata" Applied Sciences 16, no. 7: 3377. https://doi.org/10.3390/app16073377
APA StyleLee, W.-Y., Kim, J.-H., Leem, J., Lee, B.-W., Lee, S., & Kim, Y. W. (2026). Benchmark Evaluation of a Tool-Augmented Large Language Model Agent Using Traditional Asian Medicine Metadata. Applied Sciences, 16(7), 3377. https://doi.org/10.3390/app16073377

