Next Article in Journal
From Legal Text to NP-Complete Decision Models: MPNet Retrieval and Policy Information Extraction
Next Article in Special Issue
Do Foundation Models Truly Outperform Domain-Specific Models? Evidence from Digital Pathology
Previous Article in Journal
Scenario-Adaptive Evaluation of Trustworthy Fine-Tuned Text Models Across Knowledge-Grounded Generation and Misinformation Detection
Previous Article in Special Issue
EA-StrongSORT: An Efficient Attention StrongSORT Framework for Detection-Based Tumor Tracking in Cine-MRI TrackRAD2025 Dataset
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

MedToolica: Finetuning-Free Agentic Compositional Tool Learning for 3D CT Reasoning

AI Innovation Lab, Weill Cornell Medicine-Qatar, Doha P.O. Box 24144, Qatar
*
Author to whom correspondence should be addressed.
Mach. Learn. Knowl. Extr. 2026, 8(6), 162; https://doi.org/10.3390/make8060162
Submission received: 30 April 2026 / Revised: 6 June 2026 / Accepted: 8 June 2026 / Published: 11 June 2026

Abstract

Clinical reasoning over 3D CT scans is inherently compositional, requiring the integration of anatomical measurement, pathology assessment, spatial comparison, and clinical interpretation. We introduce MedToolica, a finetuning-free, role-based agentic framework for quantitative 3D abdominal CT reasoning that decomposes complex queries into structured sub-tasks coordinated through specialized expert tools. Empirical evaluation across quantitative reasoning benchmarks demonstrates that MedToolica is particularly effective in organ-centric measurement tasks when supported by reliable expert tools, achieving strong quantitative agreement (e.g., C C C = 0.99 for organ HU estimation versus 0.46 for finetuned baselines) and notable gains on multi-step visual reasoning tasks. In contrast, lesion-oriented tasks remain constrained by upstream tool limitations, indicating that reasoning sophistication alone cannot compensate for unreliable perception. Furthermore, we observe that the capability of the core language model substantially influences orchestration quality: smaller LLM orchestrators exhibit reduced overall accuracy due to higher execution failure rates ( 25 % vs. 79 % ) and increased susceptibility to hallucination ( 43 % vs. 2 % ). Collectively, these findings identify expert tool reliability and orchestration capability as critical determinants of performance in compositional medical AI and highlight both the promise and current limitations of finetuning-free agentic reasoning for quantitative 3D CT analysis.

1. Introduction

Computed Tomography (CT) provides rich and detailed information about anatomical structures. Clinical interpretation of 3D CT scans is compositional by nature. Compositionality is defined as the ability to construct complex functions from reusable primitives, which provides a foundation for systematic generalization. That being said, determining whether a liver lesion is resectable requires measuring its volume, assessing its Hounsfield Unit (HU), localizing it within a hepatic segment, comparing it against surrounding structures, and applying clinical resectability criteria, all linked in a chain of dependent reasoning steps. This inherent compositionality motivates a fundamental question: can an AI system that explicitly models this compositional structure outperform monolithic models that learn to answer queries end-to-end?
Despite significant progress in applying AI to radiological workflows, existing approaches to three-dimensional CT analysis have largely been developed as specialized models targeting specific clinical tasks. Prior work has demonstrated strong performance in areas such as organ segmentation [1,2], lesion detection [3,4], and classification [5] where each model focuses on a well-defined aspect of analysis. Recent advances in vision–language models (VLMs) have extended these capabilities toward more flexible query interfaces for 3D CT data; however, such models introduce additional limitations.
First, many VLMs for 3D CT analysis rely heavily on finetuning and are trained on relatively narrow datasets. As a result, their representational structures and learning dynamics struggle to generalize under distribution shifts. Consequently, these models often exhibit biases toward specific reporting styles or answer distributions [6,7,8], and their application to out-of-distribution data typically requires further finetuning, which is not always feasible in practical or clinical settings.
Second, most existing approaches are optimized for direct prediction, producing single-shot outputs without explicitly modeling intermediate reasoning steps. This limits the ability to assess whether predictions are grounded in visual evidence or influenced by spurious correlations learned during training [9,10]. From a clinical perspective, interpretability is critical, as practitioners must understand how a model arrives at its conclusions [11]. Despite achieving strong performance on specific tasks, many current 3D medical models for CT scans lack transparency, providing final answers without interpretable reasoning processes.
That being said, agentic frameworks provide a natural paradigm for handling compositional clinical reasoning. By decomposing complex queries into structured sub-tasks and routing them to specialized agents and expert tools, such systems can leverage the strengths of existing domain-specific models without requiring retraining, thereby improving robustness to out-of-distribution scenarios. This approach aligns with the inherently compositional nature of clinical workflows while also introducing transparency, as each intermediate step can be inspected and validated, an essential property for clinical deployment.
To this end, we propose an agentic framework, termed MedToolica, for coordinated analysis of 3D CT data through the integration of multiple specialized models in a finetuning-free manner. Unlike monolithic end-to-end architectures, our approach adopts a modular design based on tool learning and role-based agent orchestration, enabling a finetuning-free inference pipeline. Within this framework, complex clinical queries are decomposed into sequences of sub-tasks, each mapped to appropriate tools, enabling hierarchical and long-horizon reasoning. We use this framework as an initial study and an empirical lens to investigate three main contributions.
  • We introduce MedToolica, a finetuning-free, role-based agentic framework for quantitative 3D abdominal CT reasoning that integrates specialized tool outputs within a modular workflow.
  • We empirically demonstrate that expert tool reliability is a central bottleneck in compositional medical reasoning and quantify its impact on overall system performance across tasks and model families.
  • We benchmark MedToolica against finetuned 3D medical VLMs and agentic/reasoning baselines, characterizing where the framework offers advantages and where performance remains comparable or task dependent.
  • We systematically analyze how LLM scale and capability affect orchestration quality, tool utilization patterns, and multi-hop reasoning behavior in this agentic setting.

2. Related Works

3D Medical Vision–Language Models. Building on recent progress in 2D medical VLMs [12,13,14,15,16,17], a growing body of work has explored 3D CT understanding. RadFM [18] employs a Q-former-based architecture to aggregate volumetric representations prior to alignment with textual embeddings. Likewise, E3D-GPT [19] utilizes a self-supervised pretrained 3D encoder and a 3D perceiver module to project volumetric features into a language model’s latent space. Furthermore, M3D [20] introduces spatial average pooling to efficiently compress 3D features, whereas Med3DVLM [21] enhances cross-slice representation learning via an MLP-Mixer that jointly models spatial and semantic dependencies. SAMF [22] integrates slice-wise and volumetric representations for CT analysis, and SPINE [23] explores a segmentation-guided approach in spinal MRI. Extending this direction, reinforcement learning-based reasoning has been incorporated via GRPO training [24], producing structured intermediate reasoning steps in Med3D-R1 [25]. Despite these advances, all such approaches rely on monolithic finetuned architectures with limited support for modular, compositional reasoning.
Agentic Approaches in Medical Imaging. More recently, agentic paradigms have been introduced to enhance medical VLMs with dynamic reasoning and tool utilization capabilities. Early works such as AURA [26], MMedAgent [27], and MedRax [28] adopted the ReAct [29] framework to coordinate multiple expert modules for radiology-specific tasks, enabling iterative reasoning through tool invocation. Moreover, finetuning-based approaches such as VILA-M3 [30], MMedAgent-RL [31], and OralGPT-Plus [32] integrate tool usage within a central VLM, where the model is trained to invoke external tools during inference. In a similar direction, Voxel Prompt [33] proposes a finetuning-based programmatic interface that translates natural language queries into executable code, allowing vision models to perform analytical operations such as quantitative measurement and longitudinal assessment.
Other approaches explore agentic designs in more specialized settings. MedAgent-Pro [34] targets multimodal clinical reasoning by incorporating evidence-based decision processes and structured workflows for diagnosis. Additionally, MedXRay-CAD [35] introduces an agent-based framework for combining retrieval mechanisms and expert classifiers to align image and text representations for respiratory disease diagnosis and report generation. MARCH [36] introduces an agentic framework for CT report generation that incorporates retrieval-based validation. Its dual-stage pipeline first employs a trainable model to produce an initial report draft, which is subsequently processed by an agentic system that validates and refines the generated report through retrieval-guided verification.
Moreover, CT-Agent [37] focuses on improving perception efficiency in 3D CT analysis by leveraging anatomy-specific tools alongside token compression mechanisms through a ReAct approach to address spatial complexity. Also, RadAgent [38] explores agentic CT analysis through the orchestration of expert models and external tools within a ReAct framework, with a primary focus on chest CT applications. Beyond medical reasoning, M3Builder [39] extends the agentic paradigm to machine learning systems by enabling collaborative agents for training time automation.
Our work investigates finetuning-free, role-based compositional reasoning for 3D volumetric CT analysis, with a particular emphasis on quantitative medical analysis through modular tool integration. Unlike conventional ReAct-based paradigms, we adopt an agentic role-based framework and further provide an empirical characterization of tool quality as a key determinant of performance in compositional agentic pipelines. See Table 1 for related works.

3. Materials and Methods

Our framework adopts the training-free, tool card-based orchestration philosophy introduced in OctoTool [42] but adapts it for medical 3D CT reasoning through explicit role specialization and CT-specific expert tools. A principal advantage of this design lies in its finetuning-free nature, which permits the seamless incorporation of new tools without necessitating retraining, thereby enhancing the system’s flexibility and suitability for dynamic clinical environments. Through this regime, complex medical queries are decomposed into subproblems and resolved via the coordinated application of multiple analytical capabilities (see Figure 1 for more details). An illustration of the agent roles, the per-agent system prompts, and the tool I/O schemas used during the iterative context-update loop is shown in Figure 2.

3.1. Problem Formulation

We formulate the task as mapping a natural language query q and a 3D CT volume x to an output y. Unlike conventional approaches that learn a direct mapping f : ( x , q ) y , our framework models the problem as a sequential decision process over a set of tools T . At each step t, an action a t T is selected based on the current context C t , and the resulting tool output o t is used to update the context.
The objective is to construct a sequence of actions { a 1 , , a T } such that the final context C T contains sufficient information to generate the output y. The context C represents the accumulated state of the system, including intermediate tool outputs, query representations, and reasoning traces, and serves as the basis for decision making across steps.

3.2. Agentic Flow

MedToolica follows a multi-step, role-based architecture adapted from OctoTools consisting of five agents: the Query Analyzer, Action Predictor, Action Executor, Context Verifier, and Summarizer. It should be noted that the five-role design was not introduced to arbitrarily increase agent count, but rather to reflect the structure of the clinical reasoning process itself.
The Query Analyzer operates at the initial stage of the framework, where it interprets and disambiguates the user query while identifying the tools required for downstream processing. Its output follows a structured format composed of four elements: Concise Summary, which captures the main objective of the query; Required Skills, which lists the capabilities needed to solve the task; Relevant Tools, which identifies suitable tools from the toolbox; and Additional Context, which includes any supplementary considerations relevant to effective task completion. For a detailed description of the output format schemas, refer to Appendix A.
Subsequently, the Action Predictor and Action Executor operate in an iterative loop. The predictor selects the most appropriate next action based on the current context, while the executor invokes the corresponding tool and returns its output. After each step, the context is updated to support adaptive multi-step reasoning. The Action Predictor produces four concise fields: Justification, explaining the selected action; Context, containing the information required for tool execution; Sub Goal, defining the immediate objective; and Tool Name, specifying the selected tool.
The Context Verifier assesses whether the query has been sufficiently resolved after each iteration. If additional reasoning is required, the framework continues by selecting further actions. This verification stage also improves robustness by enabling recovery from intermediate errors through refinement of subsequent steps. Its output consists of two fields: Stop Signal, indicating whether the reasoning process should terminate, and Analysis, providing a brief justification for that decision.
Once the query is deemed complete or reaches the maximum number of iterations/steps, the process transitions to the Summarizer, which aggregates the intermediate results and generates the final response. A detailed description of the procedure is provided in Algorithm 1, while the system prompts used to guide the agents are illustrated in Figure 2A.
Algorithm 1 MedToolica: agentic 3D CT analysis via compositional tool learning with max iterations  N = 10
Require: Query q, CT scan x, toolset T , max iterations N
  1:
Initialize context C { q }
  2:
Initialize done ← False
  3:
Initialize number_iter  0
Stage 1: Query Understanding
  4:
C QueryAnalyzer ( q , T )        ▹ analyze the query
Stage 2: Iterative Reasoning and Tool Use
  5:
while not done and number_iter  < N  do
  6:
     a ActionPredictor ( C , T )       ▹ select next tool/action
  7:
     o ActionExecutor ( a , x , C )              ▹ execute tool
  8:
     C C { o }           ▹ update context with output
  9:
    doneContextVerifier ( C )          ▹ check termination
10:
    number_iternumber_iter + 1
11:
end while
Stage 3: Final Response
12:
y Summarizer ( C )           ▹ generate final answer
13:
return y
While the five-agent decomposition was designed to mirror the sequential stages of clinical reasoning—query understanding, iterative action, verification, and summarization—alternative configurations are conceivable. For instance, the Action Predictor and Action Executor could be merged into a single action agent, or the Context Verifier could be folded into the Summarizer. However, such merges would reduce modularity and limit the ability to recover from intermediate errors independently at each stage.

3.3. Tool Cards

In this work, tool learning refers to the selection and composition of pre-existing expert models rather than the teaching of the core orchestrator to use tools in an optimal manner. The proposed framework incorporates a curated set of expert tools, each designed to address specific analytical tasks in three-dimensional CT scan interpretation. In particular, four primary tools are employed for whole-body segmentation, kidney cyst detection, liver subsegment segmentation, and lesion (tumor) segmentation. These domain-specific tools are complemented by a collection of auxiliary utility modules that facilitate compositional reasoning and downstream analysis. See Figure 2B for more details on tools and their input and output formats.
For whole-body segmentation and kidney cyst detection, we adopt models provided by TotalSegmentator [43] which offer comprehensive anatomical coverage and reliable performance across diverse structures. Similarly, liver subsegment segmentation is conducted using a specialized model derived from the same framework tailored to provide fine-grained partitioning of hepatic regions [44]. For tumor and lesion analysis, we employ the DiffTumor model [3], which supports the detection and segmentation of lesions across major abdominal organs, including the liver, kidneys, and pancreas.
In addition to these expert models, a suite of utility tools has been developed to support higher-level reasoning and quantitative assessment. These include but are not limited to (i) a measurement tool for estimating anatomical volumes and Hounsfield Unit (HU) statistics from CT scans and corresponding segmentation masks, (ii) a tumor analysis tool for counting and characterizing lesions based on segmentation outputs, and (iii) an overlap analysis tool for identifying and quantifying spatial intersections between multiple segmentation masks.
It should be noted that the selection of tools in this work was guided by accessibility, ease of integration, and their reported generalizability and performance. However, the proposed framework is modular by design; therefore, practitioners may integrate alternative segmentation models or tools without modifying the overall orchestration framework provided interface compatibility is maintained.

3.4. Dataset

The proposed framework is evaluated on a curated subset of the DeepVQATumor dataset [45], a benchmark designed for reasoning over three-dimensional CT scans in a VQA setting. The evaluation focuses on three categories—measurement, visual reasoning, and medical reasoning—each assessing distinct aspects of clinical and analytical capability. Due to computational constraints and the high similarity among question templates, we uniformly sample up to 200 examples from each subcategory. For subcategories containing fewer than 200 samples, all available examples are retained. This results in 600 measurement examples, 112 medical reasoning examples, and 2230 visual reasoning examples, making a total of 2942 samples in the final evaluation set. Refer to Appendix D for more details and the distribution of the curated dataset.

3.5. Evaluation Metrics

We evaluate the proposed framework from two complementary perspectives: (1) end-task predictive performance compared with baseline methods, and (2) the effectiveness and efficiency of different core LLMs used for orchestration.
Performance against baselines. For regression-oriented tasks, we report the Mean Absolute Error (MAE) and Concordance Correlation Coefficient (CCC) [46]. The MAE quantifies the average magnitude of prediction error by measuring the absolute deviation between predicted and ground-truth values, thereby providing an interpretable measure of numerical discrepancy in the original measurement units, as defined in Equation (1). In contrast, the CCC evaluates the agreement between predictions and the ground truth by jointly accounting for both precision (the strength of association between measurements) and accuracy (the deviation from the identity line). Compared with correlation-only metrics, the CCC provides a more robust assessment of measurement agreement in settings exhibiting scale mismatch, systematic bias, or heteroscedastic variability, as shown in Equation (2).
m A E = 1 N i = 1 N y i y ^ i
C C C = 2 · Cov ( y , y ^ ) σ y 2 + σ y ^ 2 + ( μ y μ y ^ ) 2
For VQA tasks, we report prediction accuracy together with 95 % confidence intervals using the Wilson score interval. Accuracy measures the proportion of correctly answered instances and is defined as follows:
Accuracy = 1 N i = 1 N I ( y ^ i = y i )
To quantify uncertainty in accuracy estimates under finite sample sizes, we compute 95 % Wilson score confidence intervals. Given an observed accuracy p ^ from N samples and the standard normal quantile z, the Wilson interval is computed as follows:
C I W i l s o n = p ^ + z 2 2 N ± z p ^ ( 1 p ^ ) N + z 2 4 N 2 1 + z 2 N
where z = 1.96 for a 95 % confidence interval.
Additionally, to statistically compare paired model predictions evaluated on identical question instances, we perform pair-wise McNemar tests. For two competing methods, let n 10 denote the number of instances correctly predicted by method A but incorrectly predicted by method B, and let n 01 denote the converse. Since the test depends only on discordant prediction pairs, the McNemar statistic is defined as follows:
χ 2 = ( n 10 n 01 ) 2 n 10 + n 01
For settings with limited discordant counts, we employ the exact McNemar test based on the binomial distribution, where the null hypothesis assumes symmetric disagreement between paired methods:
H 0 : P ( n 10 ) = P ( n 01 )
A statistically significant result indicates that one method consistently answers more matched instances correctly than the other beyond differences expected from sampling variability.
Evaluation of LLM orchestration. To assess the role of the core LLM in coordinating agents and tools, we evaluate both efficiency and answer quality. We report the failure rate, which is decomposed into three categories:
(1) Ground-truth mismatch, defined as the disagreement between the final system prediction and the annotated ground-truth answer, reflecting overall task-level correctness of the full pipeline.
(2) Reasoning failure, which includes cases where the model produces an incorrect answer due to inappropriate tool selection, hallucinated reasoning, or generating responses without invoking required tools, including incorrect routing decisions that prevent successful task completion within the allowed number of interaction steps.
(3) Execution failure, which captures instances where the framework fails to complete the inference process, such as invalid output formatting that interrupts execution, failure to produce a final answer, or any breakdown in the agent–tool interaction protocol that prevents completion of the pipeline.

4. Results

For performance comparison, we selected SAMF [22], M3D [20], and Med3DVLM [21] as they are recent state-of-the-art models for abdominal CT analysis with publicly available implementations, enabling fair and reproducible benchmarking. All three models were finetuned on the DeepVQATumor dataset, and the finetuning configuration is provided in Table A4. Additionally, we compared MedToolica against MedGemma1.5 [15], a 2D medical model capable of interpreting 3D CT scans in a slice-wise manner, as well as against ReAct [29], an alternative agentic framework.

4.1. Illustrative Examples of Compositional Tool Use

Figure 3 and Figure 4 present representative examples of task solving through compositional reasoning and tool invocation. In Figure 3, the model estimates spleen volume using a two-step process where it invokes a segmentation model followed by a volume computation tool to obtain the final measurement.
In Figure 4, the model computes the aggregated volume of the liver and spleen over four steps. Notably, the process includes an initial incorrect assumption that segmentation outputs were already available, leading to an erroneous tool call and execution failure. Despite this, the model successfully corrects its trajectory in subsequent steps and completes the task, demonstrating its ability to recover from intermediate errors during reasoning.

4.2. Measurement

Table 2 presents organ-level measurement results where agentic frameworks demonstrate consistently higher performance across all three tasks. Against finetuning-based models, on organ volume estimation, we obtain CCC = 0.99 and mAE = 38.78 versus CCC = 0.90 and mAE = 123.66 for Med3DVLM, a near-perfect agreement (CCC > 0.99 ) reflecting direct volumetric extraction from accurate TotalSegmentator masks. Finetuned models, which must infer volumes implicitly from visual patterns without explicit measurement tools, cannot reach this ceiling. On this task, both agentic approaches perform similarly, with MedToolica slightly outperforming the ReAct-based approach.
A similar trend is observed for the organ aggregation task, where our framework outperforms finetuning-based models in both mAE and CCC. However, compared to the ReAct-based approach, our method achieves slightly lower performance, although both approaches achieve moderate agreement (CCC > 0.9 ). Notably, the ReAct method attains a lower nominal error in this setting.
Similarly, on organ HU measurement, finetuning-based baselines achieve poor agreement (CCC < 0.50 ), indicating that they predict mean HU but fail to track individual variation. In contrast, our framework achieves CCC = 0.92 , corresponding to moderate-to-strong agreement, compared to CCC = 0.76 for the ReAct-based approach (poor agreement). This improvement is driven by direct HU computation from segmentation masks, rendering the task straightforward once reliable organ segmentation is available.
Regarding the monotonic association between predictions and the ground truth, MedToolica demonstrates consistently strong relationships across all three organ-related measurement tasks, achieving Spearman’s correlation coefficients of ρ > 0.90 with statistical significance ( p < 10 5 ). The finetuned baselines also exhibit strong monotonic associations in organ volume measurement and organ aggregation tasks, typically achieving ρ values between 0.80 and 0.90 ( p < 10 5 ). However, in organ HU measurement, all baseline methods, including the ReAct approach, underperform relative to MedToolica, yielding only moderate correlations ( ρ 0.40 0.59 ), as illustrated in Figure 5.
In terms of prediction bias and error dispersion in measurement tasks, Figure 6 shows that agentic methods exhibit a tighter clustering of points around zero with narrower limits of agreement (approximately −400 to 300), indicating lower bias and more consistent estimates. In contrast, the finetuned baseline models display a wider spread of errors and broader limits of agreement (approximately −500 to 500). Moreover, for the baseline methods, prediction errors tend to increase with measurement magnitude and are accompanied by a larger number of outliers, whereas MedToolica remains comparatively stable across larger volumes with fewer extreme deviations.
Nevertheless, organ-level measurement tasks inherently emphasize quantitative measurement and structured aggregation, which may preferentially benefit agentic systems with explicit tool support compared to end-to-end finetuned VLMs. This consideration should be taken into account when interpreting comparative performance gains.
All lesion-related measurement tasks (lesion volume, diameter, slice localization, counting) yield deeply poor agreement, with CCC values of less than 0.2 across all five methods, confirming that lesion measurement is currently beyond the capability of any approach, not a limitation specific to our framework. We report these results in Appendix C for completeness.

4.3. Visual Reasoning

Table 3 demonstrates the uneven performance advantage of MedToolica over competing baselines. Kidney volume comparison (66.8% vs. 64.3% for ReAct and 42.3.6% for Med3DVLM) requires segmenting both kidneys, measuring each, and comparing them, a three-step compositional chain that our framework handles structurally. Finetuned models lack explicit measurement tools and appear to rely on perceptual pattern matching, which fails in this quantitative task. This trend is further supported by the McNemar test results shown in Figure 7, where MedToolica significantly outperforms competing finetuning-based baselines on this task ( p < 10 5 ).
Inter-segment comparison (63.5% vs. 61.0% for ReAct, 52.5% for M3D, 47.5% for Med3DVLM) requires multi-step reasoning in which models must first identify lesions across different liver subsegments and subsequently compare segment-wise lesion counts to determine which segment exhibits a higher burden. McNemar test results indicate that MedToolica consistently achieves equal or higher accuracy than other finetuning-based baselines, with improvements ranging from moderate to statistically significant levels. However, the difference between ReAct and MedToolica in this subcategory is not statistically significant. This task highlights a notable advantage of tool-driven approaches in supporting structured, multi-stage analytical reasoning.
Lesion outlier detection (64.1% vs. 53.8% for Med3DVLM and 51.3% for SAMF) requires identifying abnormal lesions relative to clinical norms. Although MedToolica achieves higher accuracy than competing baselines, the relatively wide 95% Wilson confidence intervals suggest substantial uncertainty in the estimated performance, which is likely attributable to the small size of the test set for this subcategory (see Figure 7). Furthermore, the large p-values observed in the McNemar test indicate that the performance differences between models are not statistically significant, implying comparable predictive behavior in this task despite MedToolica’s numerical advantage.
Largest lesion location (43.2% vs. 44.1% for MedGemma1.5) exhibited comparatively lower absolute accuracy; however, the McNemar test revealed no statistically significant performance difference between models. A similar trend was observed for largest lesion attenuation (42.5% vs. 51.5% for Med3DVLM and 50.0% for M3D), where MedToolica achieved lower numerical performance, yet McNemar analysis indicated no significant difference ( p > 0.05 ).
In the case of organ enlargement (59.2% vs. 73.0% for M3D and Med3DVLM), where statistically significant differences were observed ( p < 10 3 ), the models were generally able to perform volume measurements correctly but failed in the subsequent application of clinical reference thresholds (see Figure A4). The primary source of error in this task is informative in nature and is categorized as output interpretation failure, further analyzed in the Discussion section.
Similarly, output interpretation failure was evident in largest lesion attenuation, where lesion detection and HU measurement were largely accurate but the final classification frequently failed due to ambiguity in applying HU range-based diagnostic criteria (Figure A6).

4.4. Medical Reasoning

Table 4 presents results for the five medical reasoning tasks. Although ReAct achieves the highest accuracy on pancreas steatosis (83.3% vs. 75.0 for MedToolica, 58.4% for SAMF), the task requires integrating pancreatic HU measurements with clinically established fat infiltration thresholds. Despite this numerical advantage, McNemar analysis indicates no statistically significant performance difference between models. Furthermore, the wide 95% Wilson confidence intervals reflect substantial uncertainty in the estimated accuracies, limiting the reliability of comparative conclusions for this task (see Figure 7B).
A similar pattern is observed for lesion resectability, cyst resectability, and fatty liver. Whilst MedToolica underperforms numerically on these tasks, the McNemar test consistently shows non-significant differences across models, and the broad 95% Wilson confidence intervals suggest high variance and limited confidence in the reported estimates. Consequently, these results do not provide strong statistical evidence for meaningful performance separation among competing approaches in the medical reasoning category.
This outcome is likely attributable to the relatively small size of the curated medical reasoning subset after dataset filtering. Accordingly, we consider limited sample size and the resulting statistical uncertainty as an important limitation of the present study for this category.

4.5. Orchestration Efficiency

To examine the impact of different core LLMs, we compare models in terms of failure rate and successful reasoning traces (Figure 8). Among all evaluated models, Nvidia-Nemotron3-30B achieves the lowest overall failure rate, demonstrating the strongest orchestration performance. Notably, Qwen3-14B, despite having approximately half the number of parameters, attains a comparable failure rate and ranks as the second-best model in terms of orchestration capability. In contrast, GLM-4.7-Flash-31B exhibits a failure rate exceeding 40%, underperforming both Qwen3-14B and Ministral3-14B despite its larger parameter size. In general, smaller-scale models show substantially weaker orchestration ability, with failures arising from both execution-level errors (e.g., invalid agent output formats that disrupt the pipeline) and reasoning-level errors (e.g., incorrect tool selection or failure to follow the intended reasoning trajectory).
In terms of the tool-calling behavior of core LLMs, Table 5 reports the average number of reasoning steps, tool utilization (i.e., the proportion of steps involving tool invocation), and failure burden (i.e., the proportion of tool calls that result in execution failures). A key pattern emerges when comparing mid- to large-scale models: Nemotron3-30B-A3B and Gemma4-31B maintain relatively low numbers of average steps and moderate tool utilization while achieving comparatively lower failure burdens, indicating more stable orchestration behavior. Notably, Qwen-14B shows high tool utilization (95.83%) but also the highest failure burden (32.17%), implying that, although it frequently delegates to tools, it is more prone to execution-level errors during interaction. Despite this, its overall system-level failure rate remains competitive (see Figure 8), suggesting partial recovery through subsequent reasoning steps. In contrast, GLM-4.7-Flash-31B and smaller Qwen variants demonstrate weaker orchestration stability characterized by either reduced tool dependence or elevated failure rates.

5. Discussion

Clinical reasoning and quantitative assessment over 3D CT are inherently compositional processes that often require multi-step perception, measurement, and interpretation. In this work, we investigate “the extent to which a finetuning-free agentic framework can serve as an alternative to monolithic end-to-end models in this setting”. Our findings suggest that compositional quantitative reasoning in 3D CT is feasible, but its success depends strongly on two interacting factors—the reliability of the expert perception tools and the ability of the core LLM—to orchestrate them effectively. These results position tool quality and orchestration capability as central determinants of performance in modular medical reasoning pipelines.
That being said, we distinguish compositional reasoning from sequential tool orchestration. Sequential tool orchestration refers to the execution of tools through a predetermined workflow where the order of operations is largely fixed and independent of intermediate reasoning outcomes. In contrast, compositional reasoning involves decomposing a complex clinical query into intermediate objectives, iteratively acquiring and evaluating evidence, and synthesizing the resulting information to reach a final conclusion. Although MedToolica leverages external tools during inference, its operation extends beyond sequential tool orchestration. Rather than following a fixed execution pipeline, the system employs a multi-agent reasoning process in which agents dynamically select actions, exchange information, and refine subsequent steps based on intermediate findings and verification outcomes. Consequently, tool use serves as a component of the broader reasoning process rather than the primary organizing principle of the system.
Asymmetrical advantages through compositional reasoning. Our results suggest that the benefits of compositional reasoning are task dependent rather than universal. Agentic methods exhibit genuine structural advantages over finetuning-based models in organ-level quantitative tasks that require direct numerical comparison, aggregation of quantitative measurements, or spatial integration across anatomical regions. Notable improvements include organ volume estimation (up to −85 reduction in nominal error (mAE)) and organ aggregation calculations (−100 reduction in mAE), both accompanied by strong agreement (CCC > 0.9 ), as well as kidney volume comparison (+21.2 points over MedGemma1.5). In contrast, MedToolica underperforms in tasks involving subjective downstream clinical decisions (e.g., largest lesion attenuation and organ enlargement) and lesion-oriented reasoning tasks. However, pair-wise McNemar analyses on VQA tasks indicate that these differences are not statistically significant, with the exception of the organ enlargement task. These findings suggest that compositional reasoning offers distinct advantages for structured quantitative reasoning, whilst its effectiveness remains constrained in lesion-centric and clinically interpretive settings.
Tool quality is the primary determinant of system performance. The results consistently support our central hypothesis. Tasks involving organ volume segmentation, supported by TotalSegmentator, show substantial gains over finetuned baselines, with strong CCC agreement. In contrast, lesion detection tasks supported by DiffTumor exhibit poor performance across all methods. When tool outputs are unreliable, subsequent reasoning stages operate on inaccurate inputs, causing errors to accumulate and potentially intensify throughout the multi-step pipeline. Therefore, in agentic medical AI systems, tool reliability should be regarded as a first-class design criterion, alongside the design of the reasoning and orchestration architecture.
Two distinct failure modes with distinct remedies. Our analysis identifies two qualitatively different failure modes. The first is tool failure with error propagation: DiffTumor produces inaccurate segmentations, and downstream measurement amplifies these inaccuracies to low CCC values across all lesion tasks, regardless of reasoning sophistication. More generally, this observation suggests that robustness in compositional agentic systems may also benefit from mechanisms that explicitly mitigate tool-level uncertainty. One potential direction is to aggregate or fuse outputs from multiple tools to improve robustness and confidence in the final prediction, as explored in prior work on models such as OralGPT [32]. In this study, we do not incorporate explicit mechanisms for multi-tool fusion, fallback strategies, or uncertainty estimation over tool outputs, as our focus is on evaluating the feasibility of a finetuning-free agentic framework for quantitative 3D CT reasoning under a fixed tool pipeline. We acknowledge these aspects as limitations of the current work and consider them promising directions for future research.
The second failure mode is output interpretation failure. Even when TotalSegmentator produces reliable segmentations or DiffTumor successfully detects a lesion, the pipeline may still fail to convert tool outputs into correct clinical decisions. This is most evident in tasks where the final answer depends on threshold- or range-based judgments: both the organ enlargement and largest lesion attenuation tasks successfully obtain the required measurements, yet the final prediction is frequently wrong due to ambiguous or incorrect clinical cutoffs. A practical remedy would be to refine system prompts by explicitly providing clinically plausible ranges or decision criteria for each class, thereby supporting more accurate final reasoning.
Transparent reasoning offers a meaningful clinical advantage. Unlike single-shot end-to-end finetuned models, our framework produces explicit intermediate reasoning traces, including segmentation outputs, volume measurements, HU statistics, and spatial comparisons, all of which can be inspected and verified at each stage. This level of transparency provides clear practical value in clinical settings. For example, a radiologist can confirm whether a reported liver volume was derived from an accurate segmentation or whether a resectability assessment was based on the appropriate anatomical measurements. Therefore, interpretability is not an incidental feature of the agentic design but an intrinsic consequence of compositional reasoning. Although a fine-grained ablation of individual agent roles and verifier mechanisms could provide additional insight into the contribution of specific design choices, such an analysis is beyond the scope of the present study, whose primary objective is to evaluate the feasibility of a finetuning-free agentic framework for quantitative 3D CT reasoning.
Scaling the core LLM substantially improves orchestration quality. To investigate the influence of the core LLM on MedToolica, we evaluated models spanning 0.6 B to 31 B parameters. Overall, larger models demonstrated more stable orchestration behavior characterized by lower failure rates, fewer execution breakdowns, and more coherent reasoning trajectories. However, parameter scale alone did not fully determine performance. For example, Qwen3-14B achieved competitive orchestration quality despite having substantially fewer parameters than several 30B-class models, suggesting that orchestration capability depends not only on model size but also on robustness in multi-step tool-mediated reasoning. An important observation is that high tool utilization does not necessarily imply effective orchestration; models such as Ministral3-14B and Qwen-8B invoked tools in nearly all reasoning steps yet still incurred considerable failure burdens, indicating that successful agentic reasoning requires reliable tool execution and recovery mechanisms rather than frequent tool use alone. Conversely, smaller models (e.g., Qwen-0.6B and Qwen-4B) exhibited longer reasoning trajectories, reduced or inconsistent tool utilization, and elevated reasoning and execution errors, highlighting their limited ability to maintain structured agent–tool coordination in complex medical workflows.
Limitations and future directions. The study has limitations that point to concrete future directions: First, performance on lesion-related tasks is fundamentally constrained by DiffTumor; replacing or augmenting it with a more accurate segmentation model is the most direct path to improvement. More broadly, the present work does not address mechanisms for mitigating tool unreliability, such as multi-tool fusion, fallback strategies, or uncertainty-aware reasoning. Second, the current implementation does not incorporate a dedicated 3D medical VLM as its central reasoning component. Integrating a specialized 3D medical VLM, either as an expert module or as the primary decision-making model for tool orchestration, could introduce learned clinical priors that complement tool-based measurement workflows. Third, this work does not include ablation studies investigating different agent decompositions, merged-agent configurations, or reduced-agent designs, limiting insight into the relative impact of these design choices. Finally, the relatively small subset size of the medical reasoning categories limits the statistical reliability of conclusions within this setting and motivates further large-scale investigation. Future work should extend this evaluation to a broader range of models across architectures and scales, as well as investigate alternative agent decompositions and merged-agent configurations.

6. Conclusions

In this work, we investigated MedToolica, a finetuning-free, role-based agentic framework, and explored its applicability for quantitative 3D abdominal CT reasoning. Our goal was to investigate the feasibility, strengths, and limitations of compositional quantitative reasoning in volumetric CT analysis by leveraging external tools for measurement, interpretation, and multi-step medical reasoning. Our results show that the performance of compositional reasoning is jointly governed by expert tool reliability and the orchestration capability of the core LLM, and that LLM scale and capability materially affect tool usage patterns, failure modes, and downstream multi-hop reasoning behavior. Taken together, these findings indicate that finetuning-free agentic reasoning should be viewed as an important complementary paradigm to finetuned medical models. Finetuned architectures may retain advantages in perception-heavy settings and challenging edge cases, while agentic compositional frameworks can provide meaningful gains in structured quantitative assessment when coupled with reliable expert tools.

Author Contributions

Conceptualization, A.H. and A.S.; methodology, A.H.; software, A.H.; validation, A.H.; formal analysis, A.H. and A.S.; investigation, A.H. and A.S.; resources, A.H.; data curation, A.H.; writing—original draft preparation, A.H.; writing—review and editing, A.H. and A.S.; visualization, A.H. and A.S.; supervision, A.S.; project administration, A.S.; funding acquisition, A.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The data are publicly available through the https://github.com/Schuture/DeepTumorVQA, accessed on 7 June 2026, and the code and implementation of MedToolica are publicly accessible through the https://github.com/serag-ai/MedToolica, accessed on 7 June 2026.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
CTComputed Tomography
LLMLarge Language Model
VQAVisual question answering
HUHounsfield Unit
VLMVision–Language Model

Appendix A. Output Format Structure

We define structured output formats for the key components of the framework, along with their corresponding descriptions provided to the core orchestrator. The orchestrator is required to strictly adhere to these formats to enable systematic downstream processing. Table A1, Table A2, and Table A3 present the output schemas for the Query Analyzer, Action Predictor, and Context Verifier, respectively.
Table A1. Output schema for the Query Analyzer, which analyzes the user’s query and determines the necessary steps to address it effectively.
Table A1. Output schema for the Query Analyzer, which analyzes the user’s query and determines the necessary steps to address it effectively.
Output ComponentDescription
Concise summaryA concise summary of the query’s main points and objectives, as well as the content in any accompanying inputs.
Required skillsA list of required skills, with a brief explanation for each.
Relevant toolsA list of relevant tools from the toolbox, with a brief explanation of how each tool is utilized and its potential limitations.
Additional contextAny additional considerations that might be important for addressing the query effectively.
Table A2. Output schema for Action Prediction, which predicts the next-best action to take in the problem-solving process.
Table A2. Output schema for Action Prediction, which predicts the next-best action to take in the problem-solving process.
Output ComponentDescription
JustificationA detailed explanation of why the selected tool is the best choice for the next step considering the context and previous outcomes.
ContextMUST include ALL necessary information for the tool to function, structured as follows:
  • Relevant data from previous steps;
  • File names or paths created or used in previous steps (list EACH ONE individually);
  • Variable names and their values from previous steps’ results;
  • Any other context-specific information required by the tool.
An example output is (do not copy, use only as reference): Image path: “example/image.jpg”, Previous detection results: [list of objects]
Sub-goalA specific, achievable objective for the tool based on its metadata and previous outcomes. It MUST contain any involved data, file names, and variables from previous steps and their results, which the tool can act upon.
Tool nameMUST be the exact name of a tool from the available tools list.
Table A3. Output schema for the Context Verifier, which evaluates whether the current context and memory are sufficient to address the query effectively.
Table A3. Output schema for the Context Verifier, which evaluates whether the current context and memory are sufficient to address the query effectively.
Output ComponentDescription
AnalysisProvide a detailed analysis of why the memory is sufficient. Reference specific information from the memory and explain its relevance to each aspect of the task. Address how each main point of the query has been satisfied.
Stop signalWhether to stop the problem-solving process and proceed to generating the final output.
  • “True”: if the memory is sufficient for addressing the query to proceed and no additional available tools need to be used. If ONLY manual verification without tools is needed, choose “True”.
  • “False”: if the memory is insufficient and needs more information from additional tool usage.

Appendix B. Finetuning Configuration

We report the training settings used for the three benchmark vision–language models evaluated in this study: M3D, SAMF, and Med3DVLM. To ensure a fair comparison, both models were trained using their official open-source implementations and adapted to the DeepTumorVQA dataset with minimal modifications to the original architectures and optimization procedures.
Table A4. Finetuning hyperparameters for baseline models.
Table A4. Finetuning hyperparameters for baseline models.
HyperparameterM3DSAMFMed3DVLM
Core language modelPhi3-4BPhi3-4BQwen2.5-7B
Learning rate 5 × 10 4 5 × 10 5 5 × 10 5
Per-device batch size448
Gradient accumulation steps481
Learning rate schedulerCosineCosineCosine
Warmup ratio0.030.030.03
Weight decay0.00.00.0
Training epochs355

Appendix C. Lesion Measurement Results

Table A5 presents results for lesion-level measurement tasks. Across both paradigms—single-shot finetuned VLM baselines and our segmentation-assisted framework—all methods yield consistently low CCC values, generally below 0.20, indicating poor agreement between predicted and ground-truth measurements. These low CCC scores suggest limited reliability of current approaches for lesion-specific quantitative assessment in 3D CT, regardless of whether measurements are inferred directly by end-to-end VLMs or derived through segmentation-guided reasoning pipelines. The uniform difficulty across methods highlights the inherent challenges of lesion-level measurement in current 3D CT reasoning systems.
Additionally, all methods exhibit substantially weaker monotonic associations between predictions and the ground truth (refer to Figure A1). MedToolica achieves the strongest overall performance, attaining moderate positive correlations across most tasks, particularly for largest lesion slice localization ( ρ = 0.83 , p < 10 5 ) and tumor–organ HU difference ( ρ = 0.68 , p < 10 5 ). The finetuned baselines generally demonstrate weak-to-moderate correlations, with performance varying considerably across tasks. While several baseline methods achieve statistically significant positive associations for lesion volume, lesion slice localization, and lesion diameter measurements, correlation magnitudes typically remain limited ( ρ 0.20 0.59 ). Notably, tasks such as lesion counting and lesion count by location show weak, absent, or occasionally negative correlations for several baselines, indicating limited capability in capturing lesion-specific quantitative relationships. That being said, the reduced correlation strengths across methods underscore the increased difficulty of lesion-level quantitative reasoning in 3D CT compared with organ-level measurement tasks.
Table A5. Lesion measurement results. All methods fail across all tasks (low CCC and high mAE), confirming this is a tool-level limitation rather than a reasoning limitation. See Section 4.1 for discussion.
Table A5. Lesion measurement results. All methods fail across all tasks (low CCC and high mAE), confirming this is a tool-level limitation rather than a reasoning limitation. See Section 4.1 for discussion.
ModelLesion VolumeLargest DiameterLargest SliceOrgan HU DiffLesion CountCount by Location
mAE CCC mAE CCC mAE CCC mAE CCC mAE CCC mAE CCC
MedGemma1.5 [15]489.280.116.86−0.0521.90.2946.990.045.50.243.690.05
SAMF [22]272.610.055.670.0118.920.3030.170.244.52−0.032.340.11
M3D [20]328.480.165.150.2016.280.0931.410.122.630.131.430.01
Med3DVLM [21]281.130.115.940.1012.910.5432.770.192.740.011.400.15
ReAct251.020.1254.09−0.00819.920.1940.540.143.230.752.080.32
MedToolica257.160.146.470.1511.180.0621.970.554.340.632.600.21
The consistent failure across all lesion-related tasks in agentic frameworks likely reflects the fundamental limitations of DiffTumor as the upstream perception tool. First, DiffTumor was originally trained using a synthesis-based paradigm [3] that, while broadly generalizable, appears to be less robust for small lesions (e.g., below approximately 10 mm), where limited voxel support reduces segmentation confidence and boundary precision. Second, lesion appearance in abdominal CT is highly heterogeneous across organ types: liver lesions exhibit variable enhancement patterns, pancreatic lesions may have poorly defined margins, and renal cysts differ structurally from solid tumors. This heterogeneity likely contributes to inconsistent segmentation quality across organ types, propagating errors into downstream volume, diameter, and count measurements. These findings suggest that reasoning sophistication alone cannot fully compensate for unreliable upstream segmentation, reinforcing tool reliability as a first-class design criterion in agentic medical AI.
Figure A1. Spearman correlation coefficients ( ρ ) across models for lesion-related measurement tasks. Cell values represent Spearman’s ρ , quantifying the monotonic association between predictions and ground-truth measurements. Significance annotations: (***) p < 10 5 , (**) p < 10 3 , (*) p < 0.05 , and “ns” denotes a non-significant correlation.
Figure A1. Spearman correlation coefficients ( ρ ) across models for lesion-related measurement tasks. Cell values represent Spearman’s ρ , quantifying the monotonic association between predictions and ground-truth measurements. Significance annotations: (***) p < 10 5 , (**) p < 10 3 , (*) p < 0.05 , and “ns” denotes a non-significant correlation.
Make 08 00162 g0a1

Appendix D. Dataset Curation Process

We utilized the DeepVQATumor [45] dataset, an AI-generated benchmark constructed from metadata associated with publicly available 3D CT datasets. Rather than evaluating the complete dataset, we curated a subset for two primary reasons: First, evaluating the full benchmark is computationally expensive, particularly given the limited computational and storage resources available in our laboratory setting. Second, a large portion of the dataset contains highly similar questions, as many samples are automatically generated from metadata templates despite originating from different CT scans. Consequently, evaluating the entire dataset would substantially increase computational cost while providing limited additional diversity in reasoning content.
During curation, we excluded samples derived from the RSNA 2023 Abdominal Trauma Detection dataset and curated the remaining data by selecting up to 200 samples per subcategory whenever possible. We additionally attempted to preserve the original task distributions to minimize distributional shift between the curated and original datasets. For regression tasks, the original and curated distributions across subcategories are illustrated in Figure A2. The curated subset largely maintains class and subtype distributions similar to those of the original benchmark.
For visual reasoning tasks, Table A7 presents the class distributions before and after curation across visual reasoning subcategories. In general, only minor distribution shifts are observed. The lesion outlier category remains comparatively small due to its limited representation in the original dataset and is further reduced after removal of the excluded dataset portion.
A notable limitation of this work is the relatively small size of the medical reasoning subset. This category is already underrepresented in the original benchmark and becomes further reduced after curation, as summarized in Table A6. While most medical reasoning subcategories preserve class-wise distributions after curation, with the exception of fatty liver, the reduced sample size limits the breadth of medical reasoning evaluation and should be considered when interpreting the results.
Table A6. Original and curated class-wise distributions across medical reasoning subcategories. While the curated dataset largely preserves the original class distributions, a noticeable limitation is the increased class imbalance in the fatty liver subcategory after curation.
Table A6. Original and curated class-wise distributions across medical reasoning subcategories. While the curated dataset largely preserves the original class distributions, a noticeable limitation is the increased class imbalance in the fatty liver subcategory after curation.
SubcategoryFatty LiverPancreatic SteatosisPancreatic Cyst ResectabilityPancreatic Lesion ResectabilityLesion Type Classification
Classes No Moderate Light Yes No Yes No Res. Border. Res. Unres. Tumor Cyst
Original Distribution (%)48.0532.4719.4847.3752.6380.0020.0048.2834.4817.2486.4913.51
Total Samples15476352974
Curated Distribution (%)85.0010.005.4341.6758.3358.3341.6727.5770.402.0390.0010.00
Total Samples1212121442
Table A7. Original and curated class-wise distributions across visual reasoning subcategories. The primary limitation of the curation process is the reduced sample size of the lesion outlier subcategory.
Table A7. Original and curated class-wise distributions across visual reasoning subcategories. The primary limitation of the curation process is the reduced sample size of the lesion outlier subcategory.
SubcategoryLargest Lesion AttenuationKidney Volume ComparisonOrgan EnlargementLesion Outlier
Classes Hypo Hyper Iso Left Right Same No Yes No Yes
Original Distribution (%)38.6437.1224.2437.0429.2433.7273.5926.4143.6656.34
Total Original Samples660872348071
Curated Distribution (%)36.5034.0029.5042.3532.1425.5163.5036.5048.7251.28
Total Curated Samples20020020039
Figure A2. Distribution of regression tasks across both organ-based and lesion-oriented tasks. Figures shown with a cyan heatmap represent the original dataset distributions, while purple heatmaps correspond to the curated dataset distributions.
Figure A2. Distribution of regression tasks across both organ-based and lesion-oriented tasks. Figures shown with a cyan heatmap represent the original dataset distributions, while purple heatmaps correspond to the curated dataset distributions.
Make 08 00162 g0a2

Appendix E. Hallucination Examples

Smaller models exhibit substantially higher hallucination rates, as reported in Table 5. In this context, hallucination refers to cases in which the model attempts to answer directly without appropriate tool invocation. Such responses are typically derived from pretrained priors rather than evidence obtained from CT scan analysis. Figure A3 illustrates a representative example from the Qwen0.6B model.
Figure A3. Example of hallucination in the Qwen0.6B model. The task requires measuring liver volume. The model introduces an unsupported prior assumption during the planning stage (see Query Analyzer output). In Step 1, an incorrect tool is invoked and fails to produce any valid output, yet the Context Verifier accepts the step. The model subsequently generates a final answer without supporting evidence from the CT scan or tool outputs.
Figure A3. Example of hallucination in the Qwen0.6B model. The task requires measuring liver volume. The model introduces an unsupported prior assumption during the planning stage (see Query Analyzer output). In Step 1, an incorrect tool is invoked and fails to produce any valid output, yet the Context Verifier accepts the step. The model subsequently generates a final answer without supporting evidence from the CT scan or tool outputs.
Make 08 00162 g0a3

Appendix F. Incorrect Clinical Thresholding as a Source of Output Interpretation Failure

A recurring failure mode of MedToolica is output interpretation failure. Even when TotalSegmentator produces reliable segmentations or DiffTumor successfully detects lesions, the framework may still fail to translate accurate tool outputs into correct clinical judgments. This issue is most apparent in tasks where the final answer depends on threshold-based or range-based decisions, such as fatty liver, cyst resectability, lesion resectability, organ enlargement, and largest lesion attenuation. In these cases, errors often arise not from faulty perception tools, but from incorrect assumptions about clinically meaningful cutoffs or normal ranges. Figure A4, Figure A5, and Figure A6 present representative examples involving organ enlargement prediction, fatty liver assessment, and lesion attenuation classification, respectively.
Figure A4. Example of output interpretation failure in the organ enlargement task. The framework follows a correct reasoning trajectory and obtains a pancreas volume close to the ground truth. However, the final prediction is incorrect because an inaccurate normal reference range for pancreas volume is assumed.
Figure A4. Example of output interpretation failure in the organ enlargement task. The framework follows a correct reasoning trajectory and obtains a pancreas volume close to the ground truth. However, the final prediction is incorrect because an inaccurate normal reference range for pancreas volume is assumed.
Make 08 00162 g0a4
Figure A5. Example of output interpretation failure in the fatty liver task. The orchestration process correctly identifies the required tools and obtains relevant measurements, but the final clinical decision is incorrect due to an erroneous threshold assumption.
Figure A5. Example of output interpretation failure in the fatty liver task. The orchestration process correctly identifies the required tools and obtains relevant measurements, but the final clinical decision is incorrect due to an erroneous threshold assumption.
Make 08 00162 g0a5
Figure A6. Example of output interpretation failure in largest lesion attenuation classification. The framework correctly identifies the lesion and follows an appropriate reasoning path, but the final classification is incorrect because an inappropriate attenuation threshold is assumed. One intermediate tool call is unnecessary, although it does not affect the final outcome.
Figure A6. Example of output interpretation failure in largest lesion attenuation classification. The framework correctly identifies the lesion and follows an appropriate reasoning path, but the final classification is incorrect because an inappropriate attenuation threshold is assumed. One intermediate tool call is unnecessary, although it does not affect the final outcome.
Make 08 00162 g0a6

References

  1. Xu, M.; Amiranashvili, T.; Navarro, F.; Fritsak, M.; Hamamci, I.E.; Shit, S.; Wittmann, B.; Er, S.; Christ, S.M.; de la Rosa, E.; et al. CADS: A Comprehensive Anatomical Dataset and Segmentation for Whole-Body Anatomy in Computed Tomography. arXiv 2025, arXiv:2507.22953. [Google Scholar] [CrossRef]
  2. He, Y.; Guo, P.; Tang, Y.; Myronenko, A.; Nath, V.; Xu, Z.; Yang, D.; Zhao, C.; Simon, B.; Belue, M.; et al. VISTA3D: A unified segmentation foundation model for 3D medical imaging. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 20863–20873. [Google Scholar]
  3. Chen, Q.; Chen, X.; Song, H.; Xiong, Z.; Yuille, A.; Wei, C.; Zhou, Z. Towards generalizable tumor synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 11147–11158. [Google Scholar]
  4. Wu, L.; Zhuang, J.; Ni, X.; Chen, H. Freetumor: Advance tumor segmentation via large-scale tumor synthesis. arXiv 2024, arXiv:2406.01264. [Google Scholar]
  5. Di Piazza, T.; Lazarus, C.; Nempont, O.; Boussel, L. Structured Spectral Graph Representation Learning for Multi-label Abnormality Analysis from 3D CT Scans. arXiv 2025, arXiv:2510.10779. [Google Scholar]
  6. Hamamci, I.E.; Er, S.; Menze, B. Ct2rep: Automated radiology report generation for 3d medical imaging. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2024; pp. 476–486. [Google Scholar]
  7. Hosseini, A.; Ibrahim, A.; Serag, A. M3: Multimodal artificial intelligence for medical report generation and visual question answering from 3D abdominal CT scans. BJR| Artif. Intell. 2025, 2, ubaf011. [Google Scholar] [CrossRef]
  8. Di Piazza, T.; Lazarus, C.; Nempont, O.; Boussel, L. Ct-agrg: Automated abnormality-guided report generation from 3d chest ct volumes. In Proceedings of the 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI), Houston, TX, USA, 14–17 April 2025; pp. 01–05. [Google Scholar]
  9. Hamamci, I.E.; Er, S.; Wang, C.; Almas, F.; Simsek, A.G.; Esirgun, S.N.; Dogan, I.; Durugol, O.F.; Hou, B.; Shit, S.; et al. Generalist foundation models from a multimodal dataset for 3D computed tomography. Nat. Biomed. Eng. 2026, 1–19. [Google Scholar] [CrossRef] [PubMed]
  10. Chen, H.; Zhao, W.; Li, Y.; Zhong, T.; Wang, Y.; Shang, Y.; Guo, L.; Han, J.; Liu, T.; Liu, J.; et al. 3d-ct-gpt: Generating 3d radiology reports through integration of large vision-language models. arXiv 2024, arXiv:2409.19330. [Google Scholar]
  11. Abd-Alrazaq, A.; Solaiman, B.; Mekki, Y.M.; Al-Thani, D.; Farooq, F.; Alkubeyyer, M.; Abubacker, M.Z.; AlSaad, R.; Aziz, S.; Serag, A.; et al. Hype vs reality in the integration of artificial intelligence in clinical workflows. JMIR Form. Res. 2025, 9, e70921. [Google Scholar] [CrossRef] [PubMed]
  12. Moor, M.; Huang, Q.; Wu, S.; Yasunaga, M.; Dalmia, Y.; Leskovec, J.; Zakka, C.; Reis, E.P.; Rajpurkar, P. Med-flamingo: A multimodal medical few-shot learner. In Machine Learning for Health (ML4H); PMLR: Cambridge, MA, USA, 2023; pp. 353–367. [Google Scholar]
  13. Li, C.; Wong, C.; Zhang, S.; Usuyama, N.; Liu, H.; Yang, J.; Naumann, T.; Poon, H.; Gao, J. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Adv. Neural Inf. Process. Syst. 2023, 36, 28541–28564. [Google Scholar]
  14. Xu, W.; Chan, H.P.; Li, L.; Aljunied, M.; Yuan, R.; Wang, J.; Xiao, C.; Chen, G.; Liu, C.; Li, Z.; et al. Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning. arXiv 2025, arXiv:2506.07044. [Google Scholar] [CrossRef]
  15. Sellergren, A.; Kazemzadeh, S.; Jaroensri, T.; Kiraly, A.; Traverse, M.; Kohlberger, T.; Xu, S.; Jamil, F.; Hughes, C.; Lau, C.; et al. Medgemma technical report. arXiv 2025, arXiv:2507.05201. [Google Scholar]
  16. Pan, J.; Liu, C.; Wu, J.; Liu, F.; Zhu, J.; Li, H.B.; Chen, C.; Ouyang, C.; Rueckert, D. Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2025; pp. 337–347. [Google Scholar]
  17. Lai, Y.; Zhong, J.; Li, M.; Zhao, S.; Li, Y.; Psounis, K.; Yang, X. Med-r1: Reinforcement learning for generalizable medical reasoning in vision-language models. IEEE Trans. Med. Imaging 2026, 45, 2727–2737. [Google Scholar] [CrossRef]
  18. Wu, C.; Zhang, X.; Zhang, Y.; Hui, H.; Wang, Y.; Xie, W. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data. Nat. Commun. 2025, 16, 7866. [Google Scholar]
  19. Lai, H.; Jiang, Z.; Yao, Q.; Wang, R.; He, Z.; Tao, X.; Wei, W.; Lv, W.; Zhou, S.K. E3D-GPT: Enhanced 3D visual foundation for medical vision-language model. arXiv 2024, arXiv:2410.14200. [Google Scholar]
  20. Bai, F.; Du, Y.; Huang, T.; Meng, M.Q.H.; Zhao, B. M3d: Advancing 3d medical image analysis with multi-modal large language models. arXiv 2024, arXiv:2404.00578. [Google Scholar]
  21. Xin, Y.; Ates, G.C.; Gong, K.; Shao, W. Med3dvlm: An efficient vision-language model for 3d medical image analysis. IEEE J. Biomed. Health Inform. 2025, 30, 2524–2536. [Google Scholar] [CrossRef] [PubMed]
  22. Hosseini, A.; Ibrahim, A.; Serag, A. From Slices to Volumes: Multi-scale Fusion of 2D and 3D Features for CT Scan Report Generation. In International Conference on Medical Image Computing and Computer-Assisted Intervention; Springer: Cham, Switzerland, 2025; pp. 268–277. [Google Scholar]
  23. Helmy, H.; Hosseini, A.; Ibrahim, A.; Baig-Mirza, A.; Sadek, A.R.; Serag, A. SPINE: Segmentation-guided Processing and Integration of Multimodal Spinal MRI for Natural-Language Enhanced Report Generation. Appl. Artif. Intell. 2026, 40, 2626117. [Google Scholar] [CrossRef]
  24. Shao, Z.; Wang, P.; Zhu, Q.; Xu, R.; Song, J.; Bi, X.; Zhang, H.; Zhang, M.; Li, Y.; Wu, Y.; et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv 2024, arXiv:2402.03300. [Google Scholar]
  25. Lai, H.; Jiang, Z.; Zhang, K.; Yao, Q.; Wang, R.; He, Z.; Tao, X.; Wei, W.; Zhou, S.K. Med3D-R1: Incentivizing Clinical Reasoning in 3D Medical Vision-Language Models for Abnormality Diagnosis. arXiv 2026, arXiv:2602.01200. [Google Scholar]
  26. Fathi, N.; Kumar, A.; Arbel, T. Aura: A multi-modal medical agent for understanding, reasoning and annotation. In International Workshop on Agentic AI for Medicine; Springer: Cham, Switzerland, 2025; pp. 105–114. [Google Scholar]
  27. Li, B.; Yan, T.; Pan, Y.; Luo, J.; Ji, R.; Ding, J.; Xu, Z.; Liu, S.; Dong, H.; Lin, Z.; et al. Mmedagent: Learning to use medical tools with multi-modal agent. In Findings of the Association for Computational Linguistics: EMNLP 2024; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 8745–8760. [Google Scholar]
  28. Fallahpour, A.; Ma, J.; Munim, A.; Lyu, H.; Wang, B. Medrax: Medical reasoning agent for chest x-ray. arXiv 2025, arXiv:2502.02673. [Google Scholar] [CrossRef]
  29. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.R.; Cao, Y. React: Synergizing reasoning and acting in language models. In Proceedings of the The Eleventh International Conference on Learning Representations, Virtual, 25–29 April 2022. [Google Scholar]
  30. Nath, V.; Li, W.; Yang, D.; Myronenko, A.; Zheng, M.; Lu, Y.; Liu, Z.; Yin, H.; Law, Y.M.; Tang, Y.; et al. Vila-m3: Enhancing vision-language models with medical expert knowledge. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 14788–14798. [Google Scholar]
  31. Xia, P.; Wang, J.; Peng, Y.; Zeng, K.; Dong, Z.; Wu, X.; Tang, X.; Zhu, H.; Li, Y.; Zhang, L.; et al. Mmedagent-rl: Optimizing multi-agent collaboration for multimodal medical reasoning. arXiv 2025, arXiv:2506.00555. [Google Scholar]
  32. Fan, Y.; Hao, J.; Chen, H.; Bao, J.; Shao, Y.; Liang, Y.; Hung, K.F.; Tang, H. OralGPT-Plus: Learning to Use Visual Tools via Reinforcement Learning for Panoramic X-ray Analysis. arXiv 2026, arXiv:2603.06366. [Google Scholar]
  33. Hoopes, A.; Dey, N.; Butoi, V.I.; Guttag, J.V.; Dalca, A.V. VoxelPrompt: A Vision Agent for End-to-End Medical Image Analysis. arXiv 2024, arXiv:2410.08397. [Google Scholar]
  34. Wang, Z.; Wu, J.; Cai, L.; Low, C.H.; Yang, X.; Li, Q.; Jin, Y. Medagent-pro: Towards evidence-based multi-modal medical diagnosis via reasoning agentic workflow. arXiv 2025, arXiv:2503.18968. [Google Scholar]
  35. Raza, M.; Salem, S.; Kwon, H.; Hussain, J.; Gu, Y.H.; Al-Antari, M.A. Multimodal Knowledge-Infused VLM for Respiratory Disease Prediction and Clinical Report Generation. IEEE J. Biomed. Health Inform. 2025; early access.
  36. Lin, Y.; Ding, Y.; Wu, Y.; Peng, Y. MARCH: Multi-Agent Radiology Clinical Hierarchy for CT Report Generation. arXiv 2026, arXiv:2604.16175. [Google Scholar] [CrossRef]
  37. Mao, Y.; Xu, W.; Qin, Y.; Gao, Y. CT-Agent: A multimodal-LLM agent for 3D CT radiology question answering. arXiv 2025, arXiv:2505.16229. [Google Scholar] [CrossRef]
  38. Roschewitz, M.; Styppa, K.; Tao, Y.; Sohn, J.; Delbrouck, J.B.; Gundersen, B.; Deperrois, N.; Bluethgen, C.; Vogt, J.; Menze, B.; et al. RadAgent: A tool-using AI agent for stepwise interpretation of chest computed tomography. arXiv 2026, arXiv:2604.15231. [Google Scholar]
  39. Feng, J.; Zheng, Q.; Wu, C.; Zhao, Z.; Zhang, Y.; Wang, Y.; Xie, W. M 3 builder: A multi-agent system for automated machine learning in medical imaging. In International Workshop on Agentic AI for Medicine; Springer: Cham, Switzerland, 2025; pp. 115–124. [Google Scholar]
  40. Sellergren, A.; Gao, C.; Mahvar, F.; Kohlberger, T.; Jamil, F.; Traverse, M.; Tono, A.; Sadjad, B.; Yang, L.; Lau, C.; et al. Medgemma 1.5 technical report. arXiv 2026, arXiv:2604.05081. [Google Scholar] [CrossRef]
  41. Erdur, A.C.; Scholz, D.; Pan, J.; Wiestler, B.; Rueckert, D.; Peeken, J.C. Agentic Large Language Models for Training-Free Neuro-Radiological Image Analysis. arXiv 2026, arXiv:2604.16729. [Google Scholar]
  42. Lu, P.; Chen, B.; Liu, S.; Thapa, R.; Boen, J.; Zou, J. Octotools: An agentic framework with extensible tools for complex reasoning. arXiv 2025, arXiv:2502.11271. [Google Scholar]
  43. Wasserthal, J.; Breit, H.C.; Meyer, M.T.; Pradella, M.; Hinck, D.; Sauter, A.W.; Heye, T.; Boll, D.T.; Cyriac, J.; Yang, S.; et al. TotalSegmentator: Robust segmentation of 104 anatomic structures in CT images. Radiol. Artif. Intell. 2023, 5, e230024. [Google Scholar] [CrossRef]
  44. Tian, J.; Liu, L.; Shi, Z.; Xu, F. Automatic couinaud segmentation from CT volumes on liver using GLC-UNet. In International Workshop on Machine Learning in Medical Imaging; Springer: Cham, Switzerland, 2019; pp. 274–282. [Google Scholar]
  45. Chen, Y.; Xiao, W.; Bassi, P.R.; Zhou, X.; Er, S.; Hamamci, I.E.; Zhou, Z.; Yuille, A. Are vision language models ready for clinical diagnosis? a 3d medical benchmark for tumor-centric visual question answering. arXiv 2025, arXiv:2505.18915. [Google Scholar] [CrossRef]
  46. Lawrence, I.; Lin, K. A concordance correlation coefficient to evaluate reproducibility. Biometrics 1989, 45, 255–268. [Google Scholar]
Figure 1. Overview of MedToolica. Starting from a query, the system decomposes the request into manageable sub-tasks, selectively invokes appropriate tools, and integrates their results via iterative reasoning to produce a final answer.
Figure 1. Overview of MedToolica. Starting from a query, the system decomposes the request into manageable sub-tasks, selectively invokes appropriate tools, and integrates their results via iterative reasoning to produce a final answer.
Make 08 00162 g001
Figure 2. Illustration of the setup used in the MedToolica framework. (A) Agents and their corresponding system prompts, used to guide each agent in performing a specific task. (B) List of selected tools with their corresponding input/output formats along with a description of the role of each tool.
Figure 2. Illustration of the setup used in the MedToolica framework. (A) Agents and their corresponding system prompts, used to guide each agent in performing a specific task. (B) List of selected tools with their corresponding input/output formats along with a description of the role of each tool.
Make 08 00162 g002
Figure 3. Examples of the orchestration flow of the MedToolica system for measuring spleen volume.
Figure 3. Examples of the orchestration flow of the MedToolica system for measuring spleen volume.
Make 08 00162 g003
Figure 4. Example of an organ aggregation task completed in four steps, including self recovery from an execution error in Step 1, ultimately leading to the correct final answer.
Figure 4. Example of an organ aggregation task completed in four steps, including self recovery from an execution error in Step 1, ultimately leading to the correct final answer.
Make 08 00162 g004
Figure 5. Spearman correlation coefficients ( ρ ) across models and measurement tasks. Cell values denote Spearman’s ρ , while significance levels are indicated by (***) p < 10 5 , and “ns” denotes a non-significant correlation.
Figure 5. Spearman correlation coefficients ( ρ ) across models and measurement tasks. Cell values denote Spearman’s ρ , while significance levels are indicated by (***) p < 10 5 , and “ns” denotes a non-significant correlation.
Make 08 00162 g005
Figure 6. Bland–Altman analysis for illustration of prediction bias and error dispersion between model prediction and the ground truth. It demonstrates that agentic approaches (MedToolica and ReAct) achieve the smallest systematic bias and narrowest limits of agreement across measurement tasks, indicating closer agreement with ground-truth values.
Figure 6. Bland–Altman analysis for illustration of prediction bias and error dispersion between model prediction and the ground truth. It demonstrates that agentic approaches (MedToolica and ReAct) achieve the smallest systematic bias and narrowest limits of agreement across measurement tasks, indicating closer agreement with ground-truth values.
Make 08 00162 g006
Figure 7. Visualization of 95% Wilson confidence intervals (left) and pair-wise McNemar test results (right) across (A) visual reasoning and (B) medical reasoning tasks. Significance levels are indicated by (***) p < 10 5 , (**) p < 10 3 , (*) p < 0.05 , and “ns” denotes a non-significant correlation.
Figure 7. Visualization of 95% Wilson confidence intervals (left) and pair-wise McNemar test results (right) across (A) visual reasoning and (B) medical reasoning tasks. Significance levels are indicated by (***) p < 10 5 , (**) p < 10 3 , (*) p < 0.05 , and “ns” denotes a non-significant correlation.
Make 08 00162 g007
Figure 8. Effect of core LLM selection on orchestration behavior in MedToolica. Models are evaluated by decomposed failure profiles including hallucination, reasoning errors, execution errors, and ground-truth misprediction.
Figure 8. Effect of core LLM selection on orchestration behavior in MedToolica. Models are evaluated by decomposed failure profiles including hallucination, reasoning errors, execution errors, and ground-truth misprediction.
Make 08 00162 g008
Table 1. Comparison of medical VLM and agentic frameworks.
Table 1. Comparison of medical VLM and agentic frameworks.
MethodModalityDomainTraining RequirementTask FocusAgentic DesignTool ExtensibilityQuantitative Tool Support
Non-Agentic Medical VLMs
RadFM [18]3DCTEnd-to-End PretrainingVQA/MRG×
E3D-GPT [19]3DCTFull FTVQA×
M3D [20]3DCT, MRIFull FTVQA/MRG×
Med3DVLM [21]3DCTFull FTVQA×
Ct2Rep [6]3DChest CTEnd-to-End PretrainingMRG×
CtChat [9]3DChest CTFull FTVQA/MRG×
SAMF [22]3DChest CTFull FTVQA/MRG×
Med3D-R1 [25]3DCTFull + RL FTVQA/Reasoning×
MedGemma1.5 [40]2D/3DGeneralFull FTVQA/MRG/Reasoning×
Agentic Medical Systems
AURA [26]2DChest Xray×MRG/VQA/DiagnosisOrchestrator/ReActModularNone
MedRax [28]2DChest Xray×MRG/VQA/GroundingOrchestrator/ReActModularNone
MMedAgent [27]2DGeneral×MRG/VQA/Diagnosis/RetrievalOrchestrator/ReActModularNone
MMedAgent-RL [31]2DGeneralRL FTVQAOrchestrator/ReActFixedNone
VILA-M3 [30]3DGeneralFull FTMRG/VQAOrchestrator/ReActFixedLimited
MARCH [36]3DCTsPartial FTMRG/RetrievalHandoffs/ReActModularNone
VoxelPrompt [33]3DBrain MRIFull FTMRG/VQA/GroundingOrchestrator/ReActFixedExplicit
Neuro-Radiological Agent [41]3DBrain MRI×VQAAgent-as-Tools/Handoffs/OrchestratorModularExplicit
CT-Agent [37]3DChest CT×MRG/VQAOrchestrator/ReActModularLimited
RadAgent [38]3DChest CT×MRG/VQAOrchestrator/ReActModularNone
MedToolica3DAbdominal CTs×VQA/MeasurementOrchestrator/Role-Based (Octotools)ModularExplicit
Table 2. Results on organ-level measurement tasks. Bold values indicate the best performance across baselines.
Table 2. Results on organ-level measurement tasks. Bold values indicate the best performance across baselines.
ModelOrgan VolumeOrgan HUOrgan Aggregation
mAE CCC mAE CCC mAE CCC
MedGemma1.5 [15]257.110.7656.740.06478.530.46
SAMF [22]124.150.8929.650.29234.060.90
M3D [20]137.560.8728.130.38242.510.86
Med3DVLM [21]123.660.9027.100.46212.430.88
ReAct48.7910.9815.000.7665.7490.98
MedToolica38.7810.999.5640.92107.180.93
Table 3. Accuracy on visual reasoning tasks (closed-ended). Bold values indicate the best performance across baselines.
Table 3. Accuracy on visual reasoning tasks (closed-ended). Bold values indicate the best performance across baselines.
ModelOrgan EnlargementKidney Volume Comp.Lesion OutlierLargest Lesion AttenuationLargest Lesion LocationInter-Segment Comp.
MedGemma1.5 [15]44.545.647.433.544.137.8
SAMF [22]36.541.351.334.031.78.0
M3D [20]73.027.646.250.029.052.5
Med3DVLM [21]73.042.353.851.536.147.5
ReAct59.564.335.917.037.261.0
MedToolica59.266.864.142.543.263.5
Table 4. Accuracy on medical reasoning tasks. Bold values indicate the best performance across baselines.
Table 4. Accuracy on medical reasoning tasks. Bold values indicate the best performance across baselines.
ModelFatty LiverPancreas SteatosisCyst ResectabilityLesion ResectabilityLesion Type Classification
MedGemma1.5 [15]50.058.360.042.957.1
SAMF [22]21.458.466.750.095.2
M3D [20]85.750.066.721.492.9
Med3DVLM [21]85.741.760.028.692.9
ReAct85.783.320.021.435.7
MedToolica57.175.040.035.776.2
Table 5. A diagnostic view of how different core LLMs interact with tools in the orchestration framework capturing not only how frequently tools are used but also how reliably they are executed. Tool utilization measures the proportion of reasoning steps involving tool interactions, while failure burden quantifies the fraction of tool calls resulting in execution failures.
Table 5. A diagnostic view of how different core LLMs interact with tools in the orchestration framework capturing not only how frequently tools are used but also how reliably they are executed. Tool utilization measures the proportion of reasoning steps involving tool interactions, while failure burden quantifies the fraction of tool calls resulting in execution failures.
ModelAvg StepsTool UtilizationFailure Burden
Nemotron3-30B-A3B2.78 ± 1.6093.5319.62
Qwen-14B3.60 ± 1.7495.8332.17
Ministral3-14B3.64 ± 1.9498.3513.96
Gemma4-31B2.39 ± 1.3388.2812.32
GLM-4.7-Flash-31B3.36 ± 2.3656.2515.35
Ministral3-8B4.30 ± 3.0547.4414.70
Qwen-8B2.89 ± 1.8796.5334.05
Nemotron3-4B2.91 ± 2.4453.6123.08
Qwen-4B4.69 ± 2.6283.8049.11
Qwen-0.6B7.55 ± 4.0436.4243.27
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hosseini, A.; Serag, A. MedToolica: Finetuning-Free Agentic Compositional Tool Learning for 3D CT Reasoning. Mach. Learn. Knowl. Extr. 2026, 8, 162. https://doi.org/10.3390/make8060162

AMA Style

Hosseini A, Serag A. MedToolica: Finetuning-Free Agentic Compositional Tool Learning for 3D CT Reasoning. Machine Learning and Knowledge Extraction. 2026; 8(6):162. https://doi.org/10.3390/make8060162

Chicago/Turabian Style

Hosseini, Abdullah, and Ahmed Serag. 2026. "MedToolica: Finetuning-Free Agentic Compositional Tool Learning for 3D CT Reasoning" Machine Learning and Knowledge Extraction 8, no. 6: 162. https://doi.org/10.3390/make8060162

APA Style

Hosseini, A., & Serag, A. (2026). MedToolica: Finetuning-Free Agentic Compositional Tool Learning for 3D CT Reasoning. Machine Learning and Knowledge Extraction, 8(6), 162. https://doi.org/10.3390/make8060162

Article Metrics

Back to TopTop