Next Article in Journal
Data-Efficient Multi-Objective Design of Auxiliary Localization Coils for Misalignment-Robust UAV WPT
Previous Article in Journal
Calibrating the Unit Cell Method for Jet-Grout Column Groups: A Field-Derived Mobilization Factor Approach
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Benchmark Evaluation of a Tool-Augmented Large Language Model Agent Using Traditional Asian Medicine Metadata

1
College of Korean Medicine, Wonkwang University, Iksan 54538, Republic of Korea
2
Research Center of Traditional Korean Medicine, Wonkwang University, Iksan 54538, Republic of Korea
3
Department of Sasang Constitutional Medicine, Division of Clinical Medicine, School of Korean Medicine, Pusan National University, Busan 46241, Republic of Korea
4
Department of Diagnostics, College of Korean Medicine, Wonkwang University, Iksan 54538, Republic of Korea
5
College of Korean Medicine, Dongguk University, Gyeongju 38066, Republic of Korea
6
Dongje Medical Co., Ltd., Daegu 42187, Republic of Korea
7
College of Korean Medicine, Woosuk University, Jeonju 54987, Republic of Korea
8
School of Korean Medicine, Dongguk University, Goyang-si 38066, Republic of Korea
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2026, 16(7), 3377; https://doi.org/10.3390/app16073377
Submission received: 13 March 2026 / Revised: 30 March 2026 / Accepted: 30 March 2026 / Published: 31 March 2026

Abstract

This study evaluated a tool-augmented large language model (LLM) agent system utilizing a traditional Asian medicine (TAM) metadata. A structured information comprising 4780 entries across five entity types (herbs, syndromes, TCM symptoms, modern medicine symptoms, and acupoints) was constructed. Four LLMs (GPT-5.2, GPT-5-mini, Claude Sonnet 4.6, and Claude Haiku 4.5) were evaluated under baseline, non-agentic retrieval (RAG), and tool-augmented (agent) conditions using two public benchmarks: TCMBench (1300 items) and TCMEval-SDT (600 items). McNemar’s test and bootstrap confidence intervals were applied to examine performance differences. Tool augmentation showed a nominally significant improvement for GPT-5-mini on TCMBench term-related items (+4.5 percentage points on paired items, McNemar p = 0.034, 95% CI [+0.9, +8.5] pp), while Claude Sonnet 4.6 showed a nominally significant decline on SDT pathogenesis (−0.038, bootstrap p = 0.006). The agent condition outperformed the non-agentic RAG baseline for GPT-5-mini on TCMBench (agent 85.6% vs. RAG 77.7% vs. baseline 75.4%), suggesting that selective, autonomous tool invocation is more effective than fixed retrieval. Tool usage rates varied substantially across models (2–87%), with moderate usage (30–40%) associated with the most consistent gains. These findings provide empirical evidence on the potential and limitations of metadata-based tool augmentation for LLMs in the TAM domain.

1. Introduction

The rapid advancement of large language models (LLMs) has prompted active exploration of their applicability in the medical domain. LLMs have achieved expert-level performance on medical licensing examinations—for instance, GPT-4 Omni attained a 90.4% accuracy rate on the United States Medical Licensing Examination (USMLE) [1,2], and Med-PaLM 2 reached performance levels comparable to those of medical professionals on the MedQA benchmark [3]. A systematic review and meta-analysis by Moura et al. [4] further confirmed that GPT-4 consistently surpassed passing thresholds across national licensing examinations in medicine, pharmacy, dentistry, and nursing. Beyond examination performance, LLMs have demonstrated potential across diverse medical tasks, including clinical knowledge encoding [5], medical question answering, and clinical decision support. A comprehensive systematic review by Karabacak et al. [6], encompassing 761 studies, documented the exponential growth of LLM evaluation research in clinical medicine—from 1 study in 2019 to 557 in 2024—reflecting the rapidly expanding interest in this field.
In the field of traditional Asian medicine (TAM) and traditional Chinese medicine (TCM), attempts to utilize LLMs for syndrome differentiation, herbal knowledge question answering, and prescription recommendation have been increasing. Wang et al. [7] conducted a scoping review of LLM applications in TCM, identifying promising directions in clinical reasoning, diagnosis, and education. More recently, Wang et al. [8] evaluated the role of LLMs in TCM diagnosis and treatment recommendations using models such as GPT-4o and Qwen, and Zheng et al. [9] investigated approaches to evaluate and improve the syndrome differentiation thinking ability of LLMs. Despite these emerging efforts, systematic benchmark-based evaluation studies in this domain remain in their early stages.
However, LLMs rely on parametric knowledge derived from their training data, which can lead to hallucination—the generation of plausible but factually incorrect content—in tasks requiring precise factual information [10]. The medical domain is particularly vulnerable to such hallucinations; Umapathi et al. [11] developed Med-HALT, a hallucination test specifically designed for medical LLMs, demonstrating that even state-of-the-art models produce medically inaccurate outputs on reasoning and memory-based tasks. In TAM, accurate factual information such as the properties and meridian tropism of herbs, WHO standard codes for acupoints, and definitions of syndromes is clinically important. The limitations of LLMs in this regard have been confirmed through TCMBench, a benchmark for evaluating TCM knowledge [12], which reported that multiple LLMs showed insufficient performance on TCM licensing examination items, with particular room for improvement in tasks requiring accurate reproduction of specialized terminology. Meanwhile, the TCMEval-SDT benchmark provides a systematic framework for evaluating LLMs’ pathogenesis reasoning and syndrome differentiation abilities in clinical reasoning tasks [9]. Chen et al. [13] further demonstrated through a large-scale benchmarking study that LLM performance varies substantially depending on task type, with factual retrieval tasks and complex reasoning tasks showing markedly different patterns.
In this study, we developed a tool-augmented LLM agent system equipped with a TAM metadata database and function-calling-based tools and evaluated the tool-augmentation effect across four LLMs using two public benchmarks. We additionally compared the agent system against a non-agentic retrieval (RAG) baseline, applied statistical significance testing, and conducted confounding and sensitivity analyses. The main contributions of this study are as follows:
  • We constructed a structured TAM metadata database (4780 entries across five entity types) with three function-calling tools and evaluated the tool-augmentation effect across four LLMs on 1900 public benchmark items.
  • We demonstrated through statistical testing that tool augmentation can improve lightweight models on factual retrieval tasks.
  • We conducted a three-way comparison showing that autonomous tool invocation (agent) outperforms fixed retrieval (RAG) on factual tasks.
  • We identified substantial model-level variation in tool usage patterns (2–87%) and demonstrated through conditional analysis that models selectively invoke tools on items they find uncertain.

2. Related Work

2.1. RAG and Tool-Augmented LLMs

Approaches to mitigate LLM hallucination in knowledge-intensive tasks include retrieval-augmented generation (RAG) [14] and tool augmentation. Sahoo et al. [15] conducted a systematic review of RAG applications for LLMs in healthcare, highlighting both the potential and remaining challenges. Tool augmentation represents an alternative paradigm, providing LLMs with external tools for database queries, calculations, and API calls [16]. The ReAct framework proposed by Yao et al. [17] formalized this approach by synergizing reasoning and acting, allowing agents to interleave chain-of-thought reasoning with external tool invocations. In the medical domain, structured knowledge integration approaches such as leveraging medical knowledge graphs with LLMs for diagnosis prediction [18] have shown promise. Hoch and Simbeck evaluated retrieval-augmented generation variants for clinical decision support, demonstrating the importance of controlling retrieval behavior to balance accuracy and efficiency [19].

2.2. TAM/TCM Benchmarks and Knowledge Representation

TCMBench [12] provides a comprehensive benchmark for evaluating TCM knowledge across 16 subjects, while TCMEval-SDT [9] offers a framework for assessing syndrome differentiation and pathogenesis reasoning. Li et al. developed a TCM case-based question-answering system integrating LLMs with knowledge graphs [20]. Recent advances in TCM prescription recommendation include reinforcement learning-based approaches [21] and syndrome differentiation-based methods [22]. Guo et al. reviewed the advancing modernization of TCM through artificial intelligence and multimodal data integration [23]. However, no prior study has systematically evaluated tool-augmented LLM agent systems utilizing TAM-specific metadata with statistical rigor.

3. Materials and Methods

3.1. System Architecture

3.1.1. Metadata Database

We constructed a metadata database that structures TAM terminology and metadata for use through LLM function calling. The database comprises five entity types with a total of 4780 entries; the sources and key attributes of each entity are presented in Table 1.
The herb data (698 entries) were collected from the HERB DB [24] and SymMap [25] databases. HERB is a high-throughput experiment- and reference-guided database of TCM that integrates information on herbs, ingredients, targets, and diseases; it has recently been updated to version 2.0 with expanded clinical and experimental evidence. The herb entries include multilingual names along with attributes such as properties, meridian tropism, drug classification, and use parts. Korean names for all entity types were curated and verified by two Korean medicine experts. Syndrome data (233 entries), TCM symptom data (2285 entries), and modern medicine (MM) symptom data (1148 entries) were collected from the SymMap database, an integrative TCM database that maps relationships between traditional Chinese medicine symptoms, modern medicine symptoms, herbs, and molecular targets. The MM symptom data include UMLS, MeSH, and ICD-10 codes, enabling linkage with modern medical terminology systems.
Acupoint data (416 entries) were constructed based on the World Health Organization (WHO) standard acupoint code system [26]. The WHO standardization project established unified nomenclature and locations for 361 acupoints across 14 meridians [27], providing an internationally recognized framework for acupoint identification. The acupoint entries include attributes such as affiliated meridian, location, point selection method, and needling technique. During preprocessing, entries with suppression flags set in the original data were removed, attribute names were standardized, and data were stored in CSV (comma-separated values) format. At runtime, the data were loaded into pandas DataFrames in memory for search operations.

3.1.2. Function-Calling Tools

Three function-calling tools were designed to enable LLMs to utilize the metadata database. Function calling (also referred to as tool use) is a mechanism provided by LLM APIs that allows models to generate structured requests for external function execution during response generation. Each tool is automatically invoked upon the LLM’s request and returns search results in JSON (JavaScript Object Notation) format. The tool definitions were provided to the LLM through the API’s function-calling interface, allowing the model to autonomously decide when and which tool to invoke based on the input query.
(1)
search_entity: Searches the TAM metadata database for terms. Herbs, acupoints, syndromes, and symptoms can be searched by Korean, Chinese, English, Pinyin, or WHO code, with optional category specification to limit the search scope. The function accepts three parameters: term (required, the search query string), category (optional, one of “herb”, “syndrome”, “tcm_symptom”, “mm_symptom”, or “acupoint”), and limit (optional, maximum number of results to return, default 5).
(2)
get_entity_info: Returns the complete metadata of a specific entity, including detailed information such as herb properties and meridian tropism, acupoint location and needling technique, and syndrome definitions. The function accepts two required parameters: term (the entity name or identifier) and category (the entity type).
(3)
get_multilingual_names: Returns the multilingual names (Korean, Chinese, English, Pinyin, and Latin) of a specific entity. The function accepts two required parameters: term and category.

3.1.3. LLM Agent Pipeline

When a user query (or benchmark item) is input, the LLM analyzes the query and, under the tool-augmented condition, optionally queries metadata through function calling before generating the final response. The overall pipeline operates as follows: (1) the benchmark item is formatted as a user message along with a system prompt; (2) the LLM processes the input and may generate one or more function-calling requests; (3) each requested function is executed against the metadata database, and the results are appended to the conversation context; (4) the LLM generates the final response incorporating both its parametric knowledge and any retrieved metadata.
The system prompt for the tool-augmented condition included: (1) role designation as a TAM expert, (2) information about the types and number of available metadata entries (e.g., “You have access to a TAM metadata database containing 698 herbs, 233 syndromes, 2285 TCM symptoms, 1148 MM symptoms, and 416 acupoints”), and (3) tool usage principles—“Tools are supplementary; first derive an answer from your own knowledge and only verify uncertain facts with tools; no more than three tool calls per item are recommended.” Under the baseline condition, only the TAM expert role was designated without any tool-related content. For each item, the LLM was allowed up to three iterations for tool calls, and the maximum number of generated tokens was set to 2048. Temperature was set to 0 for all models.

3.2. Benchmarks

Two public benchmarks in the TAM domain were used to evaluate the system.

3.2.1. TCMBench

TCMBench [12] is a public benchmark for evaluating TCM knowledge. In this study, we used 1300 expert-verified items. The items are five-choice multiple-choice questions written in Chinese, spanning 16 subjects including Basic Theory of TCM, TCM Diagnostics, Chinese Materia Medica, Formulas, and Acupuncture.
We classified items from five subjects directly related to the metadata database content (Basic Theory of TCM, TCM Diagnostics, Chinese Materia Medica, Formulas, and Acupuncture) as ‘term-related’ items (331 items) and items from the remaining 11 subjects as ‘other’ items (969 items). This classification was designed to analyze whether the tool-augmentation effect differs by task type—specifically, whether items requiring factual recall of metadata-relevant terminology show greater benefit from tool augmentation compared to items requiring broader clinical knowledge.

3.2.2. TCMEval-SDT

TCMEval-SDT [9] is a benchmark for evaluating syndrome differentiation ability on clinical cases. Each of 300 clinical cases requires performance of two tasks—pathogenesis reasoning and syndrome identification—yielding a total of 600 items. Each item has multiple correct answers in an open-ended format. Unlike TCMBench’s factual recall format, TCMEval-SDT requires comprehensive clinical reasoning, integrating patient symptoms, medical history, and theoretical frameworks to arrive at correct diagnoses. This contrast in task requirements provides an opportunity to evaluate the differential effects of tool augmentation across distinct cognitive demands.

3.3. Experimental Design

3.3.1. Model Configuration

Four LLMs from two API providers were selected (Supplementary Table S1): one large and one lightweight model from each provider, forming large–lightweight model pairs to enable comparison of tool-augmentation effects by model scale. OpenAI’s GPT-5.2 (large, $1.75/$14.00 per million input/output tokens) and GPT-5-mini (lightweight, $0.25/$2.00), and Anthropic’s Claude Sonnet 4.6 (large, $3.00/$15.00) and Claude Haiku 4.5 (lightweight, $0.80/$4.00) were selected.

3.3.2. Experimental Conditions

Three conditions were compared:
(1)
Baseline: The LLM generates responses using only its parametric knowledge without tools. Function calling is deactivated; no tool definitions are included in the API request.
(2)
Tool-augmented (Agent): Three metadata retrieval tools are provided to the LLM through the function-calling API interface, which autonomously determines whether to invoke tool calls. Up to three tool-call iterations per item are allowed.
(3)
Non-agentic retrieval (RAG): For each benchmark item, the search_entity tool is automatically invoked with the question text as the query, and the top-5 retrieved metadata entries are prepended to the prompt as context. The LLM receives this retrieved context but cannot make additional tool calls. This condition serves as a baseline to isolate the effect of autonomous tool-calling from the effect of the metadata itself. The RAG condition was evaluated for GPT-5-mini, the model that showed the largest agent-condition improvement.
The total experimental scale for the main experiment comprised 1900 items × 4 models × 2 conditions = 15,200 API calls, plus 1900 items × 1 model for the RAG condition, with 0.5-s intervals between calls to comply with rate limits.

3.3.3. Prompt Optimization

A pilot experiment (80 items, 20 per benchmark section) revealed that unrestricted tool access led to excessive invocations in Claude Sonnet 4.6, degrading SDT performance. Accordingly, the system prompt for the tool-augmented condition included usage guidelines instructing models to derive answers independently first and use tools only to verify uncertain facts, with a recommended limit of three calls per item. The 80 pilot items were retained in the main experiment; a sensitivity analysis confirmed their inclusion did not materially affect results (maximum delta difference: 0.85 pp; Supplementary Table S2).

3.4. Evaluation Methods

3.4.1. TCMBench Scoring

TCMBench was scored using single-answer accuracy. To extract the correct answer alphabet (A–E) from the LLM’s free-text response, multi-stage regular expression patterns were applied in sequence: (i) explicit answer markers (“Answer:”, “答案:” etc.), (ii) emphasis markers (bold or underlined single letters), (iii) single alphabets on the first line, and (iv) end-of-response search, thereby maximizing the extraction success rate. This multi-stage approach achieved a 99.7% extraction rate (1296 out of 1300 items). Accuracy was calculated as the proportion of items where the predicted answer matched the correct answer.

3.4.2. TCMEval-SDT Scoring

TCMEval-SDT was evaluated using the multi-answer F1 score. For each item, the LLM’s response was parsed to extract predicted answer terms, which were then compared against the reference answer set. Precision (|predicted ∩ correct|/|predicted|) and recall (|predicted ∩ correct|/|correct|) were calculated between the predicted and correct answer sets, and the F1 score was computed as their harmonic mean. This metric accounts for both the completeness (recall) and precision of the model’s predictions, providing a balanced evaluation for items with multiple correct answers. We acknowledge that this string-matching approach may not fully capture synonyms or semantically equivalent expressions; however, the consistent application of the same scoring across all conditions ensures valid within-study comparisons.

3.4.3. Tool Usage Statistics

To analyze tool usage patterns under the tool-augmented condition, the tool usage rate (proportion of items with at least one tool call) and average number of tool calls per item were calculated for each model. Additionally, we recorded which specific tools were invoked (search_entity, get_entity_info, or get_multilingual_names) to characterize the tool selection strategies of different models.

3.5. Statistical Analysis

To assess the statistical significance of performance differences between baseline and tool-augmented conditions, two complementary approaches were employed. For TCMBench (binary correct/incorrect outcomes on paired items), McNemar’s test was applied. For each model–section pair, we constructed a 2 × 2 contingency table of item-level outcome changes: b (baseline wrong → tool correct) and c (baseline correct → tool wrong). Only items with valid answer extractions under both conditions were included in the paired analysis. For all task sections (including SDT continuous F1 scores), nonparametric bootstrap confidence intervals were computed. For each model–section pair, 10,000 bootstrap resamples of item-level score differences (tool − baseline) were drawn with replacement, and 95% percentile confidence intervals were derived. A two-sided bootstrap p-value was calculated as twice the proportion of resamples on the opposite side of zero from the observed mean delta. All significance tests were conducted at α = 0.05. As this is an exploratory study with 16 model–section comparisons, the familywise error rate (FWE) under the global null hypothesis would be approximately 0.56. Results that reached p < 0.05 but did not meet the Bonferroni-corrected threshold (α = 0.003) are referred to as nominally significant throughout this paper.

3.6. Cost Calculation

API costs were calculated using the formula: cost = (input_tokens/1,000,000 × input_price) + (output_tokens/1,000,000 × output_price), where input_price and output_price correspond to the per-million-token rates listed in Supplementary Table S1. Under the tool-augmented condition, costs include overhead from tool definition tokens (included in every API call) and tool result tokens (appended to the conversation context after each tool invocation).

3.7. Experimental Setup

Experiments were conducted in a Python 3.11 (Anaconda) environment. Key packages included the OpenAI Python SDK (version 1.82.0), Anthropic Python SDK (version 0.84.0), pandas 2.2.3, and rapidfuzz 3.14.3. The metadata database was loaded from CSV files into pandas DataFrames; no separate vector database or embedding model was used. All API calls were made sequentially with 0.5-s inter-call delays.

4. Results

4.1. Overall Performance Comparison

4.1.1. TCMBench Performance

The subject-level performance of the four models on TCMBench is shown in Figure 1. Under the baseline condition, Claude Sonnet 4.6 achieved the highest accuracy with 89.4% on term-related items and 87.4% on other items, followed by GPT-5.2 at 84.3% (term) and 78.8% (other). The lightweight models, GPT-5-mini (75.4% term, 71.0% other) and Claude Haiku 4.5 (73.7% term, 70.3% other), showed 10–17 percentage points lower accuracy compared to the large models.
Under the tool-augmented condition, GPT-5-mini showed the most pronounced performance change, improving from 75.4% to 85.6% (+10.2 percentage points) on term-related items and from 71.0% to 76.7% (+5.7 percentage points) on other items. Claude Sonnet 4.6 improved from 89.4% to 91.7% (+2.3 percentage points) on term-related items. In contrast, GPT-5.2 showed slight decreases under tool augmentation (−3.2 percentage points on term, −2.2 on other), while Claude Haiku 4.5 showed marginal changes (+1.1 on term, +0.4 on other).

4.1.2. TCMEval-SDT Performance

Under the baseline condition, Claude Sonnet 4.6 showed the highest performance with 0.652 on pathogenesis and 0.671 on syndrome, followed closely by GPT-5.2 at 0.611 (pathogenesis) and 0.663 (syndrome). Under the tool-augmented condition, SDT performance showed slight decreases or no change for most models. Only GPT-5-mini improved on the pathogenesis task (from 0.474 to 0.508, +0.034), while declining on the syndrome task (from 0.517 to 0.478, −0.039). Claude Sonnet 4.6 showed declines of −0.038 (pathogenesis) and −0.021 (syndrome) (Figure 2).

4.2. Tool-Augmentation Effect

We observed tool-augmentation effects that varied substantially by task type and model (Figure 3; Supplementary Table S3). On TCMBench term-related items, GPT-5-minishowed a nominally significant improvement (+4.5 pp on paired items; McNemar p = 0.034, bootstrap 95% CI [+0.9, +8.5] pp), while no other model reached nominal significance. The +4.5 pp delta from paired analysis differs from the +10.2 pp overall accuracy gain (Section 4.1) because the paired analysis excludes items where answer extraction failed in either condition; GPT-5-mini’s high extraction failure rate (26.1%) under tool augmentation substantially reduced the paired sample. On SDT tasks, Claude Sonnet 4.6 showed a nominally significant decline on pathogenesis (Δ = −0.038, p = 0.006).

4.3. Tool Usage Patterns

Tool usage rates under the tool-augmented condition varied substantially across models (Figure 4a). GPT-5.2 barely utilized the provided tools (2%, 37/1900 items), whereas Claude Haiku 4.5 (30%) and GPT-5-mini (40%) showed selective usage. Claude Sonnet 4.6 was the most active (87%, average 2.6 calls/item).
API costs scaled with tool usage rate (Figure 4b). GPT-5-mini under the tool-augmented condition incurred costs comparable to GPT-5.2 baseline, while achieving similar TCMBench term accuracy (85.6% vs. 84.3%), suggesting that a lightweight model with tool augmentation can match a larger model’s factual performance at similar cost. In contrast, Claude Sonnet 4.6’s tool-augmented condition was approximately 10× more expensive than the GPT models, driven by higher per-token pricing and frequent tool invocations.
Tool-augmented performance profiles across four task axes revealed distinct model strengths (Figure 5). Claude Sonnet 4.6 exhibited the largest overall coverage, while GPT-5-mini with tool augmentation approached large-model performance on TCMBench tasks.

4.4. Retrieval and Tool-Calling Ablations

To isolate the effect of autonomous tool-calling from the metadata itself, we compared three conditions for GPT-5-mini: baseline, non-agentic retrieval (RAG), and agent (Table 2). On TCMBench, a consistent pattern of Agent > RAG > Baseline was observed, with the agent outperforming RAG by +7.9 pp on term-related items and +4.2 pp on other items. The RAG condition provided only modest improvement over baseline (+2.3 pp on term), suggesting that autonomous, selective retrieval is substantially more effective than fixed retrieval. On SDT tasks, differences between conditions were small and inconsistent.
We further analyzed performance conditioned on whether models actually invoked tools for each item (Table 3). Across all models, accuracy on tool-invoked items was consistently lower than on items without tool use, indicating that models selectively invoke tools on uncertain or difficult items rather than uniformly. This selective behavior complicates causal interpretation, as the observed performance differences reflect both the direct effect of tool-provided information and model-specific tool-calling tendencies.

5. Discussion

This study evaluated a tool-augmented LLM agent system for TAM across four models under baseline, RAG, and agent conditions. The main findings were threefold: tool augmentation showed a nominally significant improvement for a lightweight model (GPT-5-mini) on factual retrieval tasks (McNemar p = 0.034) but did not benefit clinical reasoning tasks; autonomous agent-based retrieval outperformed non-agentic RAG (85.6% vs. 77.7% on TCMBench term); and tool usage patterns varied widely across models (2–87%), with moderate usage associated with the most consistent gains.
Tool augmentation effects were strongly dependent on both model scale and task type. GPT-5-mini showed the only nominally significant improvement on factual retrieval tasks (McNemar p = 0.034), while high-baseline models showed no significant changes. However, tool usage rates varied dramatically across models (2–87%), and the conditional analysis (Table 3) revealed that models selectively invoke tools on uncertain items, complicating causal interpretation. On SDT clinical reasoning tasks, tool augmentation was associated with slight declines, with Claude Sonnet 4.6 showing a nominally significant decrease on pathogenesis (p = 0.006), representing the strongest individual effect observed across all comparisons. This task-type dependency is consistent with the knowledge-practice gap identified by Bedi et al. [28]: the metadata database provides factual information that directly supports recall but contributes less to clinical reasoning requiring synthesis of patient presentations with theoretical frameworks. Future tool designs could incorporate procedural knowledge such as diagnostic reasoning pathways and syndrome differentiation decision trees [21,29].
The three-way comparison (Table 2) demonstrated that autonomous agent-based retrieval consistently outperformed non-agentic RAG on factual tasks (+7.9 pp over RAG on term items), suggesting that selective, query-targeted retrieval is substantially more effective than indiscriminate fixed retrieval [14,15]. Tool usage patterns further revealed distinct model-specific behaviors: GPT-5.2 rarely invoked tools (2%), reflecting high parametric confidence, while Claude Sonnet 4.6’s aggressive usage (87%) required prompt-based control to prevent context window saturation and performance degradation. GPT-5-mini’s moderate usage (40%) was associated with the most consistent gains, aligning with the principle that effective tool use requires selective invocation [16,17].
From a cost perspective, GPT-5-mini with tool augmentation achieved factual accuracy comparable to GPT-5.2’s baseline at similar cost, while Claude Sonnet 4.6’s tool-augmented condition was approximately 10× more expensive, driven by higher per-token pricing and frequent tool invocations (Supplementary Table S1, Figure 4b). For cost-sensitive applications, a lightweight model with tool augmentation may represent a viable alternative to deploying a large model, though this comparison is limited to per-token API pricing.
This study has several limitations. The answer extraction failure rate varied across models, with GPT-5-mini showing a particularly high rate under tool augmentation (26.1%), potentially introducing selection bias. Despite this attrition, GPT-5-mini’s improvement remained nominally significant on the reduced paired sample (McNemar p = 0.034, n ≈ 245 scorable term items out of 331 total), suggesting the effect is robust to the sample reduction. The SDT F1 metric relies on string matching and may underestimate performance for paraphrased responses. As an exploratory analysis, results reaching p < 0.05 were reported as nominally significant even when the Bonferroni-corrected threshold (α = 0.003 for 16 comparisons) was not met. Future work with expanded model coverage, larger benchmark sets, and repeated sampling could strengthen the statistical power for more definitive conclusions. Additionally, the Gemini model family was excluded, and the RAG comparison was limited to GPT-5-mini only. The evaluation code and metadata database are planned for public release upon acceptance.

6. Conclusions

This study provides empirical evidence that tool-augmented LLM agents can improve factual retrieval in the TAM domain, particularly for lightweight models, while autonomous tool invocation outperforms fixed retrieval strategies. Furthermore, we identified that tool-calling frequency is a critical factor influencing agent performance, and that moderate, selective invocation yields the most consistent gains. These findings highlight the need for task-aligned tool design incorporating procedural and clinical knowledge, and model-specific prompt optimization for tool invocation control when developing medical AI agent systems.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16073377/s1, Table S1: Model Configuration and API Pricing; Table S2: Pilot Sensitivity Analysis; Table S3: Statistical Testing Results.

Author Contributions

Conceptualization, W.-Y.L. and Y.W.K.; methodology, W.-Y.L. and J.-H.K.; software, W.-Y.L.; validation, W.-Y.L., J.-H.K. and S.L.; formal analysis, W.-Y.L.; investigation, W.-Y.L. and J.-H.K.; resources, B.-W.L. and J.L.; data curation, W.-Y.L. and J.L.; writing—original draft preparation, W.-Y.L.; writing—review and editing, J.-H.K., J.L., S.L. and Y.W.K.; visualization, W.-Y.L.; supervision, S.L. and Y.W.K.; project administration, Y.W.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by grants from the National Research Foundation of Korea (NRF), funded by the Korean government (MSIT) (grant numbers 2022R1I1A2066653, and RS-2026-25494818) and Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health & Welfare, Republic of Korea (RS-2025-02303579).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The benchmarks used in this study are publicly available: TCMBench (https://github.com/ywjawmw/TCMBench, accessed on 15 January 2026) and TCMEval-SDT (https://github.com/zhuyan166/TCMEval/tree/main/evaluation/TCMEval-SDT, accessed on 8 January 2026). The metadata database and evaluation code are available at https://github.com/wonyung-lee/km-agent, accessed on 29 March 2026.

Conflicts of Interest

Author Byung-Wook Lee was employed by the company Dongje Medical Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LLMLarge Language Model
TCMTraditional Chinese Medicine
DBDatabase
RAGRetrieval-Augmented Generation
SDTSyndrome Differentiation Thought
WHOWorld Health Organization
MMModern Medicine

References

  1. Bicknell, B.T.; Butler, D.; Whalen, S.; Ricks, J.; Dixon, C.J.; Clark, A.B.; Spaedy, O.; Skelton, A.; Edupuganti, N.; Dzubinski, L. ChatGPT-4 Omni performance in USMLE disciplines and clinical skills: Comparative analysis. JMIR Med. Educ. 2024, 10, e63430. [Google Scholar] [CrossRef]
  2. Thirunavukarasu, A.J.; Ting, D.S.J.; Elangovan, K.; Gutierrez, L.; Tan, T.F.; Ting, D.S.W. Large language models in medicine. Nat. Med. 2023, 29, 1930–1940. [Google Scholar] [CrossRef]
  3. Singhal, K.; Tu, T.; Gottweis, J.; Sayres, R.; Wulczyn, E.; Amin, M.; Hou, L.; Clark, K.; Pfohl, S.R.; Cole-Lewis, H. Toward expert-level medical question answering with large language models. Nat. Med. 2025, 31, 943–950. [Google Scholar] [CrossRef]
  4. Jin, H.K.; Lee, H.E.; Kim, E. Performance of ChatGPT-3.5 and GPT-4 in national licensing examinations for medicine, pharmacy, dentistry, and nursing: A systematic review and meta-analysis. BMC Med. Educ. 2024, 24, 1013. [Google Scholar] [CrossRef]
  5. Singhal, K.; Azizi, S.; Tu, T.; Mahdavi, S.S.; Wei, J.; Chung, H.W.; Scales, N.; Tanwani, A.; Cole-Lewis, H.; Pfohl, S. Large language models encode clinical knowledge. Nature 2023, 620, 172–180. [Google Scholar] [CrossRef]
  6. Shool, S.; Adimi, S.; Saboori Amleshi, R.; Bitaraf, E.; Golpira, R.; Tara, M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med. Inform. Decis. Mak. 2025, 25, 117. [Google Scholar] [CrossRef] [PubMed]
  7. Ren, Y.; Luo, X.; Wang, Y.; Li, H.; Zhang, H.; Li, Z.; Lai, H.; Li, X.; Ge, L.; Estill, J. Large language models in traditional Chinese medicine: A scoping review. J. Evid.-Based Med. 2025, 18, e12658. [Google Scholar] [CrossRef] [PubMed]
  8. Liu, Y.; Yuan, Y.; Yan, K.; Li, Y.; Sacca, V.; Hodges, S.; Cannistra, M.; Jeong, P.; Wu, J.; Kong, J. Evaluating the role of large language models in traditional Chinese medicine diagnosis and treatment recommendations. npj Digit. Med. 2025, 8, 466. [Google Scholar] [CrossRef]
  9. Wang, Z.; Hao, M.; Peng, S.; Huang, Y.; Lu, Y.; Yao, K.; Yang, X.; Zhu, Y. TCMEval-SDT: A benchmark dataset for syndrome differentiation thought of traditional Chinese medicine. Sci. Data 2025, 12, 437. [Google Scholar] [CrossRef]
  10. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of hallucination in natural language generation. ACM Comput. Surv. 2023, 55, 248. [Google Scholar] [CrossRef]
  11. Pal, A.; Umapathi, L.K.; Sankarasubbu, M. Med-halt: Medical domain hallucination test for large language models. In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), Singapore, 6–7 December 2023; pp. 314–334. [Google Scholar]
  12. Yue, W.; Wang, X.; Zhu, W.; Guan, M.; Zheng, H.; Wang, P.; Sun, C.; Ma, X. Tcmbench: A comprehensive benchmark for evaluating large language models in traditional chinese medicine. arXiv 2024, arXiv:2406.01126. [Google Scholar]
  13. Chen, Q.; Hu, Y.; Peng, X.; Xie, Q.; Jin, Q.; Gilson, A.; Singer, M.B.; Ai, X.; Lai, P.-T.; Wang, Z. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nat. Commun. 2025, 16, 3280. [Google Scholar] [CrossRef]
  14. Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.-t.; Rocktäschel, T. Retrieval-augmented generation for knowledge-intensive nlp tasks. Adv. Neural Inf. Process. Syst. 2020, 33, 9459–9474. [Google Scholar]
  15. Amugongo, L.M.; Mascheroni, P.; Brooks, S.; Doering, S.; Seidel, J. Retrieval augmented generation for large language models in healthcare: A systematic review. PLoS Digit. Health 2025, 4, e0000877. [Google Scholar] [CrossRef] [PubMed]
  16. Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 2023, 36, 68539–68551. [Google Scholar]
  17. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.R.; Cao, Y. React: Synergizing reasoning and acting in language models. In Proceedings of the Eleventh International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  18. Gao, Y.; Li, R.; Croxford, E.; Caskey, J.; Patterson, B.W.; Churpek, M.; Miller, T.; Dligach, D.; Afshar, M. Leveraging medical knowledge graphs into large language models for diagnosis prediction: Design and application study. JMIR AI 2025, 4, e58670. [Google Scholar] [CrossRef] [PubMed]
  19. Wołk, K. Evaluating Retrieval-Augmented Generation Variants for Clinical Decision Support: Hallucination Mitigation and Secure On-Premises Deployment. Electronics 2025, 14, 4227. [Google Scholar] [CrossRef]
  20. Duan, Y.; Zhou, Q.; Li, Y.; Qin, C.; Wang, Z.; Kan, H.; Hu, J. Research on a traditional Chinese medicine case-based question-answering system integrating large language models and knowledge graphs. Front. Med. 2025, 11, 1512329. [Google Scholar] [CrossRef]
  21. Wang, X.; Sun, X.; Yang, L.; Zhang, Y.; Yang, T.; Xie, J.; Hu, K. Reinforcement learning for LLM-based explainable TCM prescription recommendation with implicit preferences from small language models. Chin. Med. 2025, 20, 193. [Google Scholar] [CrossRef]
  22. Li, Y.-X.; Elnaffar, S.; Chen, H.-Y.; Chen, N.-J.; Lai, P.-Y.; Li, N.-Q.; Chong, Y.; Qiao, J.; Liu, T.; Peng, Z.-B. An LLM Method for Understanding Traditional Chinese Medicine: Mechanism Exploration and Innovative Application. IEEE J. Biomed. Health Inform. 2025. ahead of print. [Google Scholar] [CrossRef]
  23. Guo, P.; Jiang, M.; Hu, S.; Jiang, Q.; Li, L.; Wu, J.; Ma, Y.; Wu, Z. Advancing the modernization of traditional Chinese medicine through artificial intelligence and multimodal data integration. Chin. Med. 2026, 21, 54. [Google Scholar] [CrossRef] [PubMed]
  24. Gao, K.; Liu, L.; Lei, S.; Li, Z.; Huo, P.; Wang, Z.; Dong, L.; Deng, W.; Bu, D.; Zeng, X. HERB 2.0: An updated database integrating clinical and experimental evidence for traditional Chinese medicine. Nucleic Acids Res. 2025, 53, D1404–D1414. [Google Scholar] [CrossRef]
  25. Wu, Y.; Zhang, F.; Yang, K.; Fang, S.; Bu, D.; Li, H.; Sun, L.; Hu, H.; Gao, K.; Wang, W. SymMap: An integrative database of traditional Chinese medicine enhanced by symptom mapping. Nucleic Acids Res. 2019, 47, D1110–D1117. [Google Scholar] [CrossRef] [PubMed]
  26. Regional Office for the Western Pacific—World Health Organization. WHO Standard Acupuncture Point Locations in the Western Pacific Region; World Health Organization: Geneva, Switzerland, 2008. [Google Scholar]
  27. Lim, S. WHO standard acupuncture point locations. Evid.-Based Complement. Altern. Med. 2010, 7, 167–168. [Google Scholar]
  28. Gong, E.J.; Bang, C.S.; Lee, J.J.; Baik, G.H. Knowledge-practice performance gap in clinical large language models: Systematic review of 39 benchmarks. J. Med. Internet Res. 2025, 27, e84120. [Google Scholar] [CrossRef]
  29. Yue, W.; Ji, W.; Wang, X.; Ma, X.; Wang, P.; Wang, X. Sdpr: Prescription recommendation with syndrome differentiation in traditional chinese medicine. IEEE J. Biomed. Health Inform. 2025, 29, 3736–3749. [Google Scholar] [CrossRef]
Figure 1. Accuracy heatmap by subject on TCMBench. Accuracy of four LLMs under baseline and tool-augmented conditions across 16 subjects. Term-related subjects are marked with †.
Figure 1. Accuracy heatmap by subject on TCMBench. Accuracy of four LLMs under baseline and tool-augmented conditions across 16 subjects. Term-related subjects are marked with †.
Applsci 16 03377 g001
Figure 2. Performance comparison on TCMEval-SDT (F1 score). F1 scores for pathogenesis and syndrome tasks across four LLMs. Gray = baseline; colored = tool-augmented.
Figure 2. Performance comparison on TCMEval-SDT (F1 score). F1 scores for pathogenesis and syndrome tasks across four LLMs. Gray = baseline; colored = tool-augmented.
Applsci 16 03377 g002
Figure 3. heatmap of tool-augmentation effect. Green indicates performance gains; red indicates performance declines with tool augmentation.
Figure 3. heatmap of tool-augmentation effect. Green indicates performance gains; red indicates performance declines with tool augmentation.
Applsci 16 03377 g003
Figure 4. Tool usage rate and cost–performance analysis. (a) Tool usage rate by model. (b) Cost vs. performance; circles = baseline, stars = tool-augmented. y-axis = average accuracy across four task sections.
Figure 4. Tool usage rate and cost–performance analysis. (a) Tool usage rate by model. (b) Cost vs. performance; circles = baseline, stars = tool-augmented. y-axis = average accuracy across four task sections.
Applsci 16 03377 g004
Figure 5. Radar chart of tool-augmented performance. Performance comparison across four task axes: TCMBench term/other (accuracy) and SDT pathogenesis/syndrome (F1 score).
Figure 5. Radar chart of tool-augmented performance. Performance comparison across four task axes: TCMBench term/other (accuracy) and SDT pathogenesis/syndrome (F1 score).
Applsci 16 03377 g005
Table 1. Composition of the metadata database.
Table 1. Composition of the metadata database.
Entity TypeCountSourceKey Attributes
Herbs698HERB DB, SymMapMultilingual names, properties, meridians, classification
Syndromes233SymMapMultilingual names, definition
TCM Symptoms2285SymMapMultilingual names, definition, body part, nature
MM Symptoms1148SymMapMultilingual name, definition, UMLS/MeSH/ICD-10 codes
Acupoints416WHO StandardMultilingual names, WHO code, meridian, location, needling method
Total4780
Table 2. Comparison of GPT-5-mini performance across baseline, non-agentic retrieval (RAG), and agent conditions. GPT-5-mini was selected as it showed the largest agent-condition improvement.
Table 2. Comparison of GPT-5-mini performance across baseline, non-agentic retrieval (RAG), and agent conditions. GPT-5-mini was selected as it showed the largest agent-condition improvement.
Sectionn (R)BaselineRAGAgentAgent–RAG
TCMBench Term30975.4%77.7%85.6%+7.9 pp
TCMBench Other93071.0%72.5%76.7%+4.2 pp
SDT Patho.2990.4740.4940.508+0.014
SDT Syndr.2990.5170.5060.478−0.028
Table 3. TCMBench term-related accuracy conditioned on tool usage.
Table 3. TCMBench term-related accuracy conditioned on tool usage.
ModelTool UsednNot UsednBaseline
GPT-5.272.7%1181.4%31784.1%
GPT-5-mini80.3%7189.5%15382.1%
Claude Sonnet 4.691.2%284100.0%1889.7%
Claude Haiku 4.571.5%13777.3%17673.5%
“Tool Used” = accuracy on items where the model invoked at least one tool; “Not Used” = accuracy on items where no tools were invoked; “Baseline” = accuracy under the baseline (no-tool) condition on the same items. n = number of items in each category.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lee, W.-Y.; Kim, J.-H.; Leem, J.; Lee, B.-W.; Lee, S.; Kim, Y.W. Benchmark Evaluation of a Tool-Augmented Large Language Model Agent Using Traditional Asian Medicine Metadata. Appl. Sci. 2026, 16, 3377. https://doi.org/10.3390/app16073377

AMA Style

Lee W-Y, Kim J-H, Leem J, Lee B-W, Lee S, Kim YW. Benchmark Evaluation of a Tool-Augmented Large Language Model Agent Using Traditional Asian Medicine Metadata. Applied Sciences. 2026; 16(7):3377. https://doi.org/10.3390/app16073377

Chicago/Turabian Style

Lee, Won-Yung, Ji-Hwan Kim, Jungtae Leem, Byung-Wook Lee, Seungho Lee, and Young Woo Kim. 2026. "Benchmark Evaluation of a Tool-Augmented Large Language Model Agent Using Traditional Asian Medicine Metadata" Applied Sciences 16, no. 7: 3377. https://doi.org/10.3390/app16073377

APA Style

Lee, W.-Y., Kim, J.-H., Leem, J., Lee, B.-W., Lee, S., & Kim, Y. W. (2026). Benchmark Evaluation of a Tool-Augmented Large Language Model Agent Using Traditional Asian Medicine Metadata. Applied Sciences, 16(7), 3377. https://doi.org/10.3390/app16073377

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop