1. Introduction
Computed Tomography (CT) provides rich and detailed information about anatomical structures. Clinical interpretation of 3D CT scans is compositional by nature. Compositionality is defined as the ability to construct complex functions from reusable primitives, which provides a foundation for systematic generalization. That being said, determining whether a liver lesion is resectable requires measuring its volume, assessing its Hounsfield Unit (HU), localizing it within a hepatic segment, comparing it against surrounding structures, and applying clinical resectability criteria, all linked in a chain of dependent reasoning steps. This inherent compositionality motivates a fundamental question: can an AI system that explicitly models this compositional structure outperform monolithic models that learn to answer queries end-to-end?
Despite significant progress in applying AI to radiological workflows, existing approaches to three-dimensional CT analysis have largely been developed as specialized models targeting specific clinical tasks. Prior work has demonstrated strong performance in areas such as organ segmentation [
1,
2], lesion detection [
3,
4], and classification [
5] where each model focuses on a well-defined aspect of analysis. Recent advances in vision–language models (VLMs) have extended these capabilities toward more flexible query interfaces for 3D CT data; however, such models introduce additional limitations.
First, many VLMs for 3D CT analysis rely heavily on finetuning and are trained on relatively narrow datasets. As a result, their representational structures and learning dynamics struggle to generalize under distribution shifts. Consequently, these models often exhibit biases toward specific reporting styles or answer distributions [
6,
7,
8], and their application to out-of-distribution data typically requires further finetuning, which is not always feasible in practical or clinical settings.
Second, most existing approaches are optimized for direct prediction, producing single-shot outputs without explicitly modeling intermediate reasoning steps. This limits the ability to assess whether predictions are grounded in visual evidence or influenced by spurious correlations learned during training [
9,
10]. From a clinical perspective, interpretability is critical, as practitioners must understand how a model arrives at its conclusions [
11]. Despite achieving strong performance on specific tasks, many current 3D medical models for CT scans lack transparency, providing final answers without interpretable reasoning processes.
That being said, agentic frameworks provide a natural paradigm for handling compositional clinical reasoning. By decomposing complex queries into structured sub-tasks and routing them to specialized agents and expert tools, such systems can leverage the strengths of existing domain-specific models without requiring retraining, thereby improving robustness to out-of-distribution scenarios. This approach aligns with the inherently compositional nature of clinical workflows while also introducing transparency, as each intermediate step can be inspected and validated, an essential property for clinical deployment.
To this end, we propose an agentic framework, termed MedToolica, for coordinated analysis of 3D CT data through the integration of multiple specialized models in a finetuning-free manner. Unlike monolithic end-to-end architectures, our approach adopts a modular design based on tool learning and role-based agent orchestration, enabling a finetuning-free inference pipeline. Within this framework, complex clinical queries are decomposed into sequences of sub-tasks, each mapped to appropriate tools, enabling hierarchical and long-horizon reasoning. We use this framework as an initial study and an empirical lens to investigate three main contributions.
We introduce MedToolica, a finetuning-free, role-based agentic framework for quantitative 3D abdominal CT reasoning that integrates specialized tool outputs within a modular workflow.
We empirically demonstrate that expert tool reliability is a central bottleneck in compositional medical reasoning and quantify its impact on overall system performance across tasks and model families.
We benchmark MedToolica against finetuned 3D medical VLMs and agentic/reasoning baselines, characterizing where the framework offers advantages and where performance remains comparable or task dependent.
We systematically analyze how LLM scale and capability affect orchestration quality, tool utilization patterns, and multi-hop reasoning behavior in this agentic setting.
2. Related Works
3D Medical Vision–Language Models. Building on recent progress in 2D medical VLMs [
12,
13,
14,
15,
16,
17], a growing body of work has explored 3D CT understanding. RadFM [
18] employs a Q-former-based architecture to aggregate volumetric representations prior to alignment with textual embeddings. Likewise, E3D-GPT [
19] utilizes a self-supervised pretrained 3D encoder and a 3D perceiver module to project volumetric features into a language model’s latent space. Furthermore, M3D [
20] introduces spatial average pooling to efficiently compress 3D features, whereas Med3DVLM [
21] enhances cross-slice representation learning via an MLP-Mixer that jointly models spatial and semantic dependencies. SAMF [
22] integrates slice-wise and volumetric representations for CT analysis, and SPINE [
23] explores a segmentation-guided approach in spinal MRI. Extending this direction, reinforcement learning-based reasoning has been incorporated via GRPO training [
24], producing structured intermediate reasoning steps in Med3D-R1 [
25]. Despite these advances, all such approaches rely on monolithic finetuned architectures with limited support for modular, compositional reasoning.
Agentic Approaches in Medical Imaging. More recently, agentic paradigms have been introduced to enhance medical VLMs with dynamic reasoning and tool utilization capabilities. Early works such as AURA [
26], MMedAgent [
27], and MedRax [
28] adopted the ReAct [
29] framework to coordinate multiple expert modules for radiology-specific tasks, enabling iterative reasoning through tool invocation. Moreover, finetuning-based approaches such as VILA-M3 [
30], MMedAgent-RL [
31], and OralGPT-Plus [
32] integrate tool usage within a central VLM, where the model is trained to invoke external tools during inference. In a similar direction, Voxel Prompt [
33] proposes a finetuning-based programmatic interface that translates natural language queries into executable code, allowing vision models to perform analytical operations such as quantitative measurement and longitudinal assessment.
Other approaches explore agentic designs in more specialized settings. MedAgent-Pro [
34] targets multimodal clinical reasoning by incorporating evidence-based decision processes and structured workflows for diagnosis. Additionally, MedXRay-CAD [
35] introduces an agent-based framework for combining retrieval mechanisms and expert classifiers to align image and text representations for respiratory disease diagnosis and report generation. MARCH [
36] introduces an agentic framework for CT report generation that incorporates retrieval-based validation. Its dual-stage pipeline first employs a trainable model to produce an initial report draft, which is subsequently processed by an agentic system that validates and refines the generated report through retrieval-guided verification.
Moreover, CT-Agent [
37] focuses on improving perception efficiency in 3D CT analysis by leveraging anatomy-specific tools alongside token compression mechanisms through a ReAct approach to address spatial complexity. Also, RadAgent [
38] explores agentic CT analysis through the orchestration of expert models and external tools within a ReAct framework, with a primary focus on chest CT applications. Beyond medical reasoning, M
3Builder [
39] extends the agentic paradigm to machine learning systems by enabling collaborative agents for training time automation.
Our work investigates finetuning-free, role-based compositional reasoning for 3D volumetric CT analysis, with a particular emphasis on quantitative medical analysis through modular tool integration. Unlike conventional ReAct-based paradigms, we adopt an agentic role-based framework and further provide an empirical characterization of tool quality as a key determinant of performance in compositional agentic pipelines. See
Table 1 for related works.
3. Materials and Methods
Our framework adopts the training-free, tool card-based orchestration philosophy introduced in OctoTool [
42] but adapts it for medical 3D CT reasoning through explicit role specialization and CT-specific expert tools. A principal advantage of this design lies in its finetuning-free nature, which permits the seamless incorporation of new tools without necessitating retraining, thereby enhancing the system’s flexibility and suitability for dynamic clinical environments. Through this regime, complex medical queries are decomposed into subproblems and resolved via the coordinated application of multiple analytical capabilities (see
Figure 1 for more details). An illustration of the agent roles, the per-agent system prompts, and the tool I/O schemas used during the iterative context-update loop is shown in
Figure 2.
3.1. Problem Formulation
We formulate the task as mapping a natural language query q and a 3D CT volume x to an output y. Unlike conventional approaches that learn a direct mapping , our framework models the problem as a sequential decision process over a set of tools . At each step t, an action is selected based on the current context , and the resulting tool output is used to update the context.
The objective is to construct a sequence of actions such that the final context contains sufficient information to generate the output y. The context represents the accumulated state of the system, including intermediate tool outputs, query representations, and reasoning traces, and serves as the basis for decision making across steps.
3.2. Agentic Flow
MedToolica follows a multi-step, role-based architecture adapted from OctoTools consisting of five agents: the Query Analyzer, Action Predictor, Action Executor, Context Verifier, and Summarizer. It should be noted that the five-role design was not introduced to arbitrarily increase agent count, but rather to reflect the structure of the clinical reasoning process itself.
The
Query Analyzer operates at the initial stage of the framework, where it interprets and disambiguates the user query while identifying the tools required for downstream processing. Its output follows a structured format composed of four elements:
Concise Summary, which captures the main objective of the query;
Required Skills, which lists the capabilities needed to solve the task;
Relevant Tools, which identifies suitable tools from the toolbox; and
Additional Context, which includes any supplementary considerations relevant to effective task completion. For a detailed description of the output format schemas, refer to
Appendix A.
Subsequently, the Action Predictor and Action Executor operate in an iterative loop. The predictor selects the most appropriate next action based on the current context, while the executor invokes the corresponding tool and returns its output. After each step, the context is updated to support adaptive multi-step reasoning. The Action Predictor produces four concise fields: Justification, explaining the selected action; Context, containing the information required for tool execution; Sub Goal, defining the immediate objective; and Tool Name, specifying the selected tool.
The Context Verifier assesses whether the query has been sufficiently resolved after each iteration. If additional reasoning is required, the framework continues by selecting further actions. This verification stage also improves robustness by enabling recovery from intermediate errors through refinement of subsequent steps. Its output consists of two fields: Stop Signal, indicating whether the reasoning process should terminate, and Analysis, providing a brief justification for that decision.
Once the query is deemed complete or reaches the maximum number of iterations/steps, the process transitions to the
Summarizer, which aggregates the intermediate results and generates the final response. A detailed description of the procedure is provided in Algorithm 1, while the system prompts used to guide the agents are illustrated in
Figure 2A.
| Algorithm 1 MedToolica: agentic 3D CT analysis via compositional tool learning with max iterations |
| Require: Query q, CT scan x, toolset , max iterations N |
- 1:
Initialize context - 2:
Initialize done ← False - 3:
Initialize number_iter Stage 1: Query Understanding - 4:
QueryAnalyzer ▹ analyze the query Stage 2: Iterative Reasoning and Tool Use - 5:
while not done and number_iter do - 6:
ActionPredictor ▹ select next tool/action - 7:
ActionExecutor ▹ execute tool - 8:
▹ update context with output - 9:
done ←ContextVerifier ▹ check termination - 10:
number_iter ←number_iter - 11:
end while Stage 3: Final Response - 12:
Summarizer ▹ generate final answer - 13:
return y
|
While the five-agent decomposition was designed to mirror the sequential stages of clinical reasoning—query understanding, iterative action, verification, and summarization—alternative configurations are conceivable. For instance, the Action Predictor and Action Executor could be merged into a single action agent, or the Context Verifier could be folded into the Summarizer. However, such merges would reduce modularity and limit the ability to recover from intermediate errors independently at each stage.
3.3. Tool Cards
In this work, tool learning refers to the selection and composition of pre-existing expert models rather than the teaching of the core orchestrator to use tools in an optimal manner. The proposed framework incorporates a curated set of expert tools, each designed to address specific analytical tasks in three-dimensional CT scan interpretation. In particular, four primary tools are employed for whole-body segmentation, kidney cyst detection, liver subsegment segmentation, and lesion (tumor) segmentation. These domain-specific tools are complemented by a collection of auxiliary utility modules that facilitate compositional reasoning and downstream analysis. See
Figure 2B for more details on tools and their input and output formats.
For whole-body segmentation and kidney cyst detection, we adopt models provided by TotalSegmentator [
43] which offer comprehensive anatomical coverage and reliable performance across diverse structures. Similarly, liver subsegment segmentation is conducted using a specialized model derived from the same framework tailored to provide fine-grained partitioning of hepatic regions [
44]. For tumor and lesion analysis, we employ the DiffTumor model [
3], which supports the detection and segmentation of lesions across major abdominal organs, including the liver, kidneys, and pancreas.
In addition to these expert models, a suite of utility tools has been developed to support higher-level reasoning and quantitative assessment. These include but are not limited to (i) a measurement tool for estimating anatomical volumes and Hounsfield Unit (HU) statistics from CT scans and corresponding segmentation masks, (ii) a tumor analysis tool for counting and characterizing lesions based on segmentation outputs, and (iii) an overlap analysis tool for identifying and quantifying spatial intersections between multiple segmentation masks.
It should be noted that the selection of tools in this work was guided by accessibility, ease of integration, and their reported generalizability and performance. However, the proposed framework is modular by design; therefore, practitioners may integrate alternative segmentation models or tools without modifying the overall orchestration framework provided interface compatibility is maintained.
3.4. Dataset
The proposed framework is evaluated on a curated subset of the DeepVQATumor dataset [
45], a benchmark designed for reasoning over three-dimensional CT scans in a VQA setting. The evaluation focuses on three categories—measurement, visual reasoning, and medical reasoning—each assessing distinct aspects of clinical and analytical capability. Due to computational constraints and the high similarity among question templates, we uniformly sample up to 200 examples from each subcategory. For subcategories containing fewer than 200 samples, all available examples are retained. This results in 600 measurement examples, 112 medical reasoning examples, and 2230 visual reasoning examples, making a total of 2942 samples in the final evaluation set. Refer to
Appendix D for more details and the distribution of the curated dataset.
3.5. Evaluation Metrics
We evaluate the proposed framework from two complementary perspectives: (1) end-task predictive performance compared with baseline methods, and (2) the effectiveness and efficiency of different core LLMs used for orchestration.
Performance against baselines. For regression-oriented tasks, we report the Mean Absolute Error (MAE) and Concordance Correlation Coefficient (CCC) [
46]. The MAE quantifies the average magnitude of prediction error by measuring the absolute deviation between predicted and ground-truth values, thereby providing an interpretable measure of numerical discrepancy in the original measurement units, as defined in Equation (
1). In contrast, the CCC evaluates the agreement between predictions and the ground truth by jointly accounting for both precision (the strength of association between measurements) and accuracy (the deviation from the identity line). Compared with correlation-only metrics, the CCC provides a more robust assessment of measurement agreement in settings exhibiting scale mismatch, systematic bias, or heteroscedastic variability, as shown in Equation (
2).
For VQA tasks, we report prediction accuracy together with
confidence intervals using the Wilson score interval. Accuracy measures the proportion of correctly answered instances and is defined as follows:
To quantify uncertainty in accuracy estimates under finite sample sizes, we compute
Wilson score confidence intervals. Given an observed accuracy
from
N samples and the standard normal quantile
z, the Wilson interval is computed as follows:
where
for a
confidence interval.
Additionally, to statistically compare paired model predictions evaluated on identical question instances, we perform pair-wise McNemar tests. For two competing methods, let
denote the number of instances correctly predicted by method A but incorrectly predicted by method B, and let
denote the converse. Since the test depends only on discordant prediction pairs, the McNemar statistic is defined as follows:
For settings with limited discordant counts, we employ the exact McNemar test based on the binomial distribution, where the null hypothesis assumes symmetric disagreement between paired methods:
A statistically significant result indicates that one method consistently answers more matched instances correctly than the other beyond differences expected from sampling variability.
Evaluation of LLM orchestration. To assess the role of the core LLM in coordinating agents and tools, we evaluate both efficiency and answer quality. We report the failure rate, which is decomposed into three categories:
(1) Ground-truth mismatch, defined as the disagreement between the final system prediction and the annotated ground-truth answer, reflecting overall task-level correctness of the full pipeline.
(2) Reasoning failure, which includes cases where the model produces an incorrect answer due to inappropriate tool selection, hallucinated reasoning, or generating responses without invoking required tools, including incorrect routing decisions that prevent successful task completion within the allowed number of interaction steps.
(3) Execution failure, which captures instances where the framework fails to complete the inference process, such as invalid output formatting that interrupts execution, failure to produce a final answer, or any breakdown in the agent–tool interaction protocol that prevents completion of the pipeline.
4. Results
For performance comparison, we selected SAMF [
22], M3D [
20], and Med3DVLM [
21] as they are recent state-of-the-art models for abdominal CT analysis with publicly available implementations, enabling fair and reproducible benchmarking. All three models were finetuned on the DeepVQATumor dataset, and the finetuning configuration is provided in
Table A4. Additionally, we compared MedToolica against MedGemma1.5 [
15], a 2D medical model capable of interpreting 3D CT scans in a slice-wise manner, as well as against ReAct [
29], an alternative agentic framework.
4.1. Illustrative Examples of Compositional Tool Use
Figure 3 and
Figure 4 present representative examples of task solving through compositional reasoning and tool invocation. In
Figure 3, the model estimates spleen volume using a two-step process where it invokes a segmentation model followed by a volume computation tool to obtain the final measurement.
In
Figure 4, the model computes the aggregated volume of the liver and spleen over four steps. Notably, the process includes an initial incorrect assumption that segmentation outputs were already available, leading to an erroneous tool call and execution failure. Despite this, the model successfully corrects its trajectory in subsequent steps and completes the task, demonstrating its ability to recover from intermediate errors during reasoning.
4.2. Measurement
Table 2 presents organ-level measurement results where agentic frameworks demonstrate consistently higher performance across all three tasks. Against finetuning-based models, on
organ volume estimation, we obtain CCC
and mAE
versus CCC
and mAE
for Med3DVLM, a
near-perfect agreement (CCC
) reflecting direct volumetric extraction from accurate TotalSegmentator masks. Finetuned models, which must infer volumes implicitly from visual patterns without explicit measurement tools, cannot reach this ceiling. On this task, both agentic approaches perform similarly, with MedToolica slightly outperforming the ReAct-based approach.
A similar trend is observed for the organ aggregation task, where our framework outperforms finetuning-based models in both mAE and CCC. However, compared to the ReAct-based approach, our method achieves slightly lower performance, although both approaches achieve moderate agreement (CCC ). Notably, the ReAct method attains a lower nominal error in this setting.
Similarly, on organ HU measurement, finetuning-based baselines achieve poor agreement (CCC ), indicating that they predict mean HU but fail to track individual variation. In contrast, our framework achieves CCC , corresponding to moderate-to-strong agreement, compared to CCC for the ReAct-based approach (poor agreement). This improvement is driven by direct HU computation from segmentation masks, rendering the task straightforward once reliable organ segmentation is available.
Regarding the monotonic association between predictions and the ground truth, MedToolica demonstrates consistently strong relationships across all three organ-related measurement tasks, achieving Spearman’s correlation coefficients of
with statistical significance (
). The finetuned baselines also exhibit strong monotonic associations in organ volume measurement and organ aggregation tasks, typically achieving
values between
and
(
). However, in organ HU measurement, all baseline methods, including the ReAct approach, underperform relative to MedToolica, yielding only moderate correlations (
–
), as illustrated in
Figure 5.
In terms of prediction bias and error dispersion in measurement tasks,
Figure 6 shows that agentic methods exhibit a tighter clustering of points around zero with narrower limits of agreement (approximately −400 to 300), indicating lower bias and more consistent estimates. In contrast, the finetuned baseline models display a wider spread of errors and broader limits of agreement (approximately −500 to 500). Moreover, for the baseline methods, prediction errors tend to increase with measurement magnitude and are accompanied by a larger number of outliers, whereas MedToolica remains comparatively stable across larger volumes with fewer extreme deviations.
Nevertheless, organ-level measurement tasks inherently emphasize quantitative measurement and structured aggregation, which may preferentially benefit agentic systems with explicit tool support compared to end-to-end finetuned VLMs. This consideration should be taken into account when interpreting comparative performance gains.
All lesion-related measurement tasks (lesion volume, diameter, slice localization, counting) yield deeply
poor agreement, with CCC values of less than 0.2 across all five methods, confirming that lesion measurement is currently beyond the capability of any approach, not a limitation specific to our framework. We report these results in
Appendix C for completeness.
4.3. Visual Reasoning
Table 3 demonstrates the uneven performance advantage of MedToolica over competing baselines.
Kidney volume comparison (66.8% vs. 64.3% for ReAct and 42.3.6% for Med3DVLM) requires segmenting both kidneys, measuring each, and comparing them, a three-step compositional chain that our framework handles structurally. Finetuned models lack explicit measurement tools and appear to rely on perceptual pattern matching, which fails in this quantitative task. This trend is further supported by the McNemar test results shown in
Figure 7, where MedToolica significantly outperforms competing finetuning-based baselines on this task (
).
Inter-segment comparison (63.5% vs. 61.0% for ReAct, 52.5% for M3D, 47.5% for Med3DVLM) requires multi-step reasoning in which models must first identify lesions across different liver subsegments and subsequently compare segment-wise lesion counts to determine which segment exhibits a higher burden. McNemar test results indicate that MedToolica consistently achieves equal or higher accuracy than other finetuning-based baselines, with improvements ranging from moderate to statistically significant levels. However, the difference between ReAct and MedToolica in this subcategory is not statistically significant. This task highlights a notable advantage of tool-driven approaches in supporting structured, multi-stage analytical reasoning.
Lesion outlier detection (64.1% vs. 53.8% for Med3DVLM and 51.3% for SAMF) requires identifying abnormal lesions relative to clinical norms. Although MedToolica achieves higher accuracy than competing baselines, the relatively wide 95% Wilson confidence intervals suggest substantial uncertainty in the estimated performance, which is likely attributable to the small size of the test set for this subcategory (see
Figure 7). Furthermore, the large
p-values observed in the McNemar test indicate that the performance differences between models are not statistically significant, implying comparable predictive behavior in this task despite MedToolica’s numerical advantage.
Largest lesion location (43.2% vs. 44.1% for MedGemma1.5) exhibited comparatively lower absolute accuracy; however, the McNemar test revealed no statistically significant performance difference between models. A similar trend was observed for largest lesion attenuation (42.5% vs. 51.5% for Med3DVLM and 50.0% for M3D), where MedToolica achieved lower numerical performance, yet McNemar analysis indicated no significant difference ().
In the case of
organ enlargement (59.2% vs. 73.0% for M3D and Med3DVLM), where statistically significant differences were observed (
), the models were generally able to perform volume measurements correctly but failed in the subsequent application of clinical reference thresholds (see
Figure A4). The primary source of error in this task is informative in nature and is categorized as
output interpretation failure, further analyzed in the Discussion section.
Similarly,
output interpretation failure was evident in
largest lesion attenuation, where lesion detection and HU measurement were largely accurate but the final classification frequently failed due to ambiguity in applying HU range-based diagnostic criteria (
Figure A6).
4.4. Medical Reasoning
Table 4 presents results for the five medical reasoning tasks. Although ReAct achieves the highest accuracy on
pancreas steatosis (83.3% vs. 75.0 for MedToolica, 58.4% for SAMF), the task requires integrating pancreatic HU measurements with clinically established fat infiltration thresholds. Despite this numerical advantage, McNemar analysis indicates no statistically significant performance difference between models. Furthermore, the wide 95% Wilson confidence intervals reflect substantial uncertainty in the estimated accuracies, limiting the reliability of comparative conclusions for this task (see
Figure 7B).
A similar pattern is observed for lesion resectability, cyst resectability, and fatty liver. Whilst MedToolica underperforms numerically on these tasks, the McNemar test consistently shows non-significant differences across models, and the broad 95% Wilson confidence intervals suggest high variance and limited confidence in the reported estimates. Consequently, these results do not provide strong statistical evidence for meaningful performance separation among competing approaches in the medical reasoning category.
This outcome is likely attributable to the relatively small size of the curated medical reasoning subset after dataset filtering. Accordingly, we consider limited sample size and the resulting statistical uncertainty as an important limitation of the present study for this category.
4.5. Orchestration Efficiency
To examine the impact of different
core LLMs, we compare models in terms of failure rate and successful reasoning traces (
Figure 8). Among all evaluated models,
Nvidia-Nemotron3-30B achieves the lowest overall failure rate, demonstrating the strongest orchestration performance. Notably,
Qwen3-14B, despite having approximately half the number of parameters, attains a comparable failure rate and ranks as the second-best model in terms of orchestration capability. In contrast,
GLM-4.7-Flash-31B exhibits a failure rate exceeding 40%, underperforming both
Qwen3-14B and
Ministral3-14B despite its larger parameter size. In general, smaller-scale models show substantially weaker orchestration ability, with failures arising from both execution-level errors (e.g., invalid agent output formats that disrupt the pipeline) and reasoning-level errors (e.g., incorrect tool selection or failure to follow the intended reasoning trajectory).
In terms of the tool-calling behavior of core LLMs,
Table 5 reports the average number of reasoning steps, tool utilization (i.e., the proportion of steps involving tool invocation), and failure burden (i.e., the proportion of tool calls that result in execution failures). A key pattern emerges when comparing mid- to large-scale models:
Nemotron3-30B-A3B and
Gemma4-31B maintain relatively low numbers of average steps and moderate tool utilization while achieving comparatively lower failure burdens, indicating more stable orchestration behavior. Notably, Qwen-14B shows high tool utilization (95.83%) but also the highest failure burden (32.17%), implying that, although it frequently delegates to tools, it is more prone to execution-level errors during interaction. Despite this, its overall system-level failure rate remains competitive (see
Figure 8), suggesting partial recovery through subsequent reasoning steps. In contrast,
GLM-4.7-Flash-31B and smaller Qwen variants demonstrate weaker orchestration stability characterized by either reduced tool dependence or elevated failure rates.
5. Discussion
Clinical reasoning and quantitative assessment over 3D CT are inherently compositional processes that often require multi-step perception, measurement, and interpretation. In this work, we investigate “the extent to which a finetuning-free agentic framework can serve as an alternative to monolithic end-to-end models in this setting”. Our findings suggest that compositional quantitative reasoning in 3D CT is feasible, but its success depends strongly on two interacting factors—the reliability of the expert perception tools and the ability of the core LLM—to orchestrate them effectively. These results position tool quality and orchestration capability as central determinants of performance in modular medical reasoning pipelines.
That being said, we distinguish compositional reasoning from sequential tool orchestration. Sequential tool orchestration refers to the execution of tools through a predetermined workflow where the order of operations is largely fixed and independent of intermediate reasoning outcomes. In contrast, compositional reasoning involves decomposing a complex clinical query into intermediate objectives, iteratively acquiring and evaluating evidence, and synthesizing the resulting information to reach a final conclusion. Although MedToolica leverages external tools during inference, its operation extends beyond sequential tool orchestration. Rather than following a fixed execution pipeline, the system employs a multi-agent reasoning process in which agents dynamically select actions, exchange information, and refine subsequent steps based on intermediate findings and verification outcomes. Consequently, tool use serves as a component of the broader reasoning process rather than the primary organizing principle of the system.
Asymmetrical advantages through compositional reasoning. Our results suggest that the benefits of compositional reasoning are task dependent rather than universal. Agentic methods exhibit genuine structural advantages over finetuning-based models in organ-level quantitative tasks that require direct numerical comparison, aggregation of quantitative measurements, or spatial integration across anatomical regions. Notable improvements include organ volume estimation (up to −85 reduction in nominal error (mAE)) and organ aggregation calculations (−100 reduction in mAE), both accompanied by strong agreement (CCC ), as well as kidney volume comparison (+21.2 points over MedGemma1.5). In contrast, MedToolica underperforms in tasks involving subjective downstream clinical decisions (e.g., largest lesion attenuation and organ enlargement) and lesion-oriented reasoning tasks. However, pair-wise McNemar analyses on VQA tasks indicate that these differences are not statistically significant, with the exception of the organ enlargement task. These findings suggest that compositional reasoning offers distinct advantages for structured quantitative reasoning, whilst its effectiveness remains constrained in lesion-centric and clinically interpretive settings.
Tool quality is the primary determinant of system performance. The results consistently support our central hypothesis. Tasks involving organ volume segmentation, supported by TotalSegmentator, show substantial gains over finetuned baselines, with strong CCC agreement. In contrast, lesion detection tasks supported by DiffTumor exhibit poor performance across all methods. When tool outputs are unreliable, subsequent reasoning stages operate on inaccurate inputs, causing errors to accumulate and potentially intensify throughout the multi-step pipeline. Therefore, in agentic medical AI systems, tool reliability should be regarded as a first-class design criterion, alongside the design of the reasoning and orchestration architecture.
Two distinct failure modes with distinct remedies. Our analysis identifies two qualitatively different failure modes. The first is
tool failure with error propagation: DiffTumor produces inaccurate segmentations, and downstream measurement amplifies these inaccuracies to low CCC values across all lesion tasks, regardless of reasoning sophistication. More generally, this observation suggests that robustness in compositional agentic systems may also benefit from mechanisms that explicitly mitigate tool-level uncertainty. One potential direction is to aggregate or fuse outputs from multiple tools to improve robustness and confidence in the final prediction, as explored in prior work on models such as OralGPT [
32]. In this study, we do not incorporate explicit mechanisms for multi-tool fusion, fallback strategies, or uncertainty estimation over tool outputs, as our focus is on evaluating the feasibility of a finetuning-free agentic framework for quantitative 3D CT reasoning under a fixed tool pipeline. We acknowledge these aspects as limitations of the current work and consider them promising directions for future research.
The second failure mode is output interpretation failure. Even when TotalSegmentator produces reliable segmentations or DiffTumor successfully detects a lesion, the pipeline may still fail to convert tool outputs into correct clinical decisions. This is most evident in tasks where the final answer depends on threshold- or range-based judgments: both the organ enlargement and largest lesion attenuation tasks successfully obtain the required measurements, yet the final prediction is frequently wrong due to ambiguous or incorrect clinical cutoffs. A practical remedy would be to refine system prompts by explicitly providing clinically plausible ranges or decision criteria for each class, thereby supporting more accurate final reasoning.
Transparent reasoning offers a meaningful clinical advantage. Unlike single-shot end-to-end finetuned models, our framework produces explicit intermediate reasoning traces, including segmentation outputs, volume measurements, HU statistics, and spatial comparisons, all of which can be inspected and verified at each stage. This level of transparency provides clear practical value in clinical settings. For example, a radiologist can confirm whether a reported liver volume was derived from an accurate segmentation or whether a resectability assessment was based on the appropriate anatomical measurements. Therefore, interpretability is not an incidental feature of the agentic design but an intrinsic consequence of compositional reasoning. Although a fine-grained ablation of individual agent roles and verifier mechanisms could provide additional insight into the contribution of specific design choices, such an analysis is beyond the scope of the present study, whose primary objective is to evaluate the feasibility of a finetuning-free agentic framework for quantitative 3D CT reasoning.
Scaling the core LLM substantially improves orchestration quality. To investigate the influence of the core LLM on MedToolica, we evaluated models spanning 0.6 B to 31 B parameters. Overall, larger models demonstrated more stable orchestration behavior characterized by lower failure rates, fewer execution breakdowns, and more coherent reasoning trajectories. However, parameter scale alone did not fully determine performance. For example, Qwen3-14B achieved competitive orchestration quality despite having substantially fewer parameters than several 30B-class models, suggesting that orchestration capability depends not only on model size but also on robustness in multi-step tool-mediated reasoning. An important observation is that high tool utilization does not necessarily imply effective orchestration; models such as Ministral3-14B and Qwen-8B invoked tools in nearly all reasoning steps yet still incurred considerable failure burdens, indicating that successful agentic reasoning requires reliable tool execution and recovery mechanisms rather than frequent tool use alone. Conversely, smaller models (e.g., Qwen-0.6B and Qwen-4B) exhibited longer reasoning trajectories, reduced or inconsistent tool utilization, and elevated reasoning and execution errors, highlighting their limited ability to maintain structured agent–tool coordination in complex medical workflows.
Limitations and future directions. The study has limitations that point to concrete future directions: First, performance on lesion-related tasks is fundamentally constrained by DiffTumor; replacing or augmenting it with a more accurate segmentation model is the most direct path to improvement. More broadly, the present work does not address mechanisms for mitigating tool unreliability, such as multi-tool fusion, fallback strategies, or uncertainty-aware reasoning. Second, the current implementation does not incorporate a dedicated 3D medical VLM as its central reasoning component. Integrating a specialized 3D medical VLM, either as an expert module or as the primary decision-making model for tool orchestration, could introduce learned clinical priors that complement tool-based measurement workflows. Third, this work does not include ablation studies investigating different agent decompositions, merged-agent configurations, or reduced-agent designs, limiting insight into the relative impact of these design choices. Finally, the relatively small subset size of the medical reasoning categories limits the statistical reliability of conclusions within this setting and motivates further large-scale investigation. Future work should extend this evaluation to a broader range of models across architectures and scales, as well as investigate alternative agent decompositions and merged-agent configurations.