Too Many Tools, Too Much Confusion? Navigating Agentic Tool Selection at Scale
Abstract
1. Introduction
2. Related Work
2.1. Tool-Using Agents and Scalability
2.2. Multi-Agent Systems for Tool Coordination
3. Proposed Approach
- Recall-first selection. Missing a required tool prevents task completion; therefore, we optimize for recall at the candidate-generation stage, tolerating additional false positives that can later be filtered by schema checks and execution-time validation.
- Context-efficient reasoning. Decomposition reduces the cognitive load of the LLM by exposing smaller, step-specific candidate sets.
- Collision mitigation. We explicitly reduce vector-space collisions between semantically similar tools via representation shaping of tool documents (expanded descriptions and synthetic questions).
| Algorithm 1 Unified Retrieval–Plan–Select (RPS) pipeline (recall-first). | |
Require: User request q; tool catalog ; vector DB over ; budgets . | |
1: Index. For all , build and add to vector DB. | |
2: Planner context. . | |
3: Decomposition. | ▹ dummy or context-based |
4: | |
5: for each subtask do | |
6: | |
7: | |
8: | |
9: end for | |
10: return
| |
3.1. Tool Corpus Preparation
3.2. Vector Index and Notation
- Planner retrieval with budget M (default ): returns a planning context of tool documents.
- Worker retrieval with budget K (default ): for each subtask s, returns a step-local candidate set.
3.3. Step-Local Selection
3.4. LLM Prompt Design and Template Specification
4. Experiments
- Embedding ablation. Context decomposition with , augmented tool docs, and worker ; we swap the embedding model and measure selection metrics on the Ultratool subset of 100 difficult cases.
- VectorDB retrieval test. Given a reference tool pool plus injected noise, we only run retrieval: planner ⇒ subtasks ⇒ for each subtask return ; union over subtasks forms the predicted set. We vary decomposition (none/dummy/context) and document expansion (vanilla vs. augmented).
- LLM context selection test. We provide the worker LLM with a mixed candidate pool without calling the vector DB and ask it to select tools. We vary decomposition (none/dummy/context) and document expansion.
- End-to-end. Full pipeline with retrieval and worker selection; we vary , decomposition (context).
4.1. Setup
4.2. Embedding Model Selection
4.3. Description Augmentation for Tool Selection
4.4. Tool Selection Stress Tests
- Random noise. Inject n uniformly random tools from .
- Semantically close noise. Inject n nearest distractors to each reference tool by cosine similarity over . This explicitly simulates representation collisions between look-alike tools.
4.4.1. Random Noise, No Decomposition
4.4.2. Semantically Close Noise
| Noise | Configuration | Recall | F1 | Recall 95% CI | vs. RAG-Only |
|---|---|---|---|---|---|
| 0 | RAG-only | 1.000 | 1.000 | [1.000, 0.000] | – |
| 10 | RAG-only | 0.857 | 0.422 | [0.825, 0.889] | – |
| 30 | RAG-only | 0.632 | 0.302 | [0.588, 0.676] | – |
| 50 | RAG-only | 0.597 | 0.283 | [0.551, 0.643] | – |
| 0 | RPS-Dummy | 1.000 | 1.000 | [1.000, 0.000] | – |
| 10 | RPS-Dummy | 0.848 | 0.471 | [0.821, 0.875] | 0.39 |
| 30 | RPS-Dummy | 0.732 | 0.336 | [0.698, 0.766] | <0.01 |
| 50 | RPS-Dummy | 0.699 | 0.305 | [0.662, 0.736] | <0.01 |
| 0 | RPS-Context | 1.000 | 1.000 | [1.000, 0.000] | – |
| 10 | RPS-Context | 0.859 | 0.449 | [0.822, 0.896] | 0.44 |
| 30 | RPS-Context | 0.763 | 0.324 | [0.717, 0.809] | <0.001 |
| 50 | RPS-Context | 0.743 | 0.298 | [0.692, 0.794] | <0.001 |
| 0 | RPS-Context+Aug | 1.000 | 1.000 | [1.000, 0.000] | – |
| 10 | RPS-Context+Aug | 0.863 | 0.458 | [0.832, 0.894] | 0.65 |
| 30 | RPS-Context+Aug | 0.784 | 0.340 | [0.732, 0.836] | <0.001 |
| 50 | RPS-Context+Aug | 0.752 | 0.303 | [0.695, 0.809] | <0.001 |
| Noise | Configuration | Recall | F1 | Recall 95% CI | vs. RAG-Only |
|---|---|---|---|---|---|
| 0 | RAG-only | 0.635 | 0.725 | [0.590, 0.680] | – |
| 10 | RAG-only | 0.407 | 0.453 | [0.341, 0.473] | – |
| 30 | RAG-only | 0.347 | 0.393 | [0.272, 0.422] | – |
| 50 | RAG-only | 0.323 | 0.364 | [0.252, 0.394] | – |
| 0 | RPS-Dummy | 0.665 | 0.750 | [0.633, 0.697] | 0.07 |
| 10 | RPS-Dummy | 0.449 | 0.494 | [0.407, 0.491] | 0.07 |
| 30 | RPS-Dummy | 0.388 | 0.433 | [0.343, 0.433] | 0.12 |
| 50 | RPS-Dummy | 0.347 | 0.384 | [0.300, 0.394] | 0.27 |
| 0 | RPS-Context | 0.799 | 0.868 | [0.762, 0.836] | <0.001 |
| 10 | RPS-Context | 0.557 | 0.533 | [0.510, 0.604] | <0.001 |
| 30 | RPS-Context | 0.474 | 0.433 | [0.417, 0.531] | <0.01 |
| 50 | RPS-Context | 0.448 | 0.405 | [0.396, 0.500] | <0.01 |
| 0 | RPS-Context+Aug | 0.773 | 0.854 | [0.718, 0.828] | <0.001 |
| 10 | RPS-Context+Aug | 0.555 | 0.559 | [0.507, 0.603] | <0.001 |
| 30 | RPS-Context+Aug | 0.463 | 0.453 | [0.407, 0.519] | <0.01 |
| 50 | RPS-Context+Aug | 0.432 | 0.422 | [0.368, 0.496] | <0.05 |
4.5. Visualization of Embedding Spaces
- Vanilla descriptions (vendor-provided).
- Expanded descriptions (augmented with clarifications + synthetic questions).
- Out-of-distribution noise matched version (original descriptions padded to match expanded length using random language noise).
- Expanded descriptions produce sharper and more stable local clusters, revealing meaningful semantic structure.
- Random-noise padding leads to diffusion and cluster breakup, indicating the importance of semantic rather than syntactic expansion.
4.6. End-to-End Pipeline Estimation
Computational Overhead
5. Conclusions
6. Limitations
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| LLM | Large Language Model |
| RAG | Retrieval-Augmented Generation |
| IR | Information Retrieval |
| CPU | Central Processing Unit |
| GPU | Graphics Processing Unit |
| API | Application Programming Interface |
| DB | Database |
| UMAP | Uniform Manifold Approximation and Projection |
| RPS | Retrieval–Plan–Select |
| JSON | JavaScript Object Notation |
| MCP | Model Context Protocol |
| ANN | Approximate Nearest-Neighbor search |
Appendix A. Extended Comparison of Related Methods
| Method | Core Idea | Retrieval Strategy | Decomposition Usage | Synthetic Descriptions | Strengths | Limitations |
|---|---|---|---|---|---|---|
| Toolformer [7] | Self-supervised learning of when and how to call external tools. | No explicit large-scale retrieval stage; tools are used in a constrained setting. | Not central. | No. | Foundational paradigm for tool-augmented language modeling. | Not designed for retrieval from large heterogeneous tool repositories. |
| ToolNet [8] | Tool selection through directed dependency graphs. | Selection is guided by predefined or constructed graph relations. | Partially, through graph-structured dependencies. | No. | Captures structured relations between tools and supports multi-tool reasoning. | Requires available or constructed tool relationships; scaling to very large repositories is limited. |
| ToolRet [3] | Benchmarking large-scale tool retrieval as a bottleneck for tool-using LLMs. | IR-style retrieval over a large tool collection. | No explicit decomposition mechanism. | No. | Demonstrates that retrieval quality strongly limits downstream tool-use success. | Primarily a benchmark and analysis framework rather than a complete retrieval-selection pipeline. |
| NESTools [4] | Evaluation of nested tool use where one tool output becomes another tool input. | Uses controlled distractor tools for evaluation. | Captures nested tool dependencies but not query decomposition for retrieval. | No. | Useful for studying multi-step tool dependency and compositional tool use. | Does not directly model dense embedding-space semantic collisions. |
| Toolshed [5] | Advanced RAG-style tool retrieval through enriched tool documentation. | Enhanced tool documents, tool knowledge bases, query expansion, and reranking. | Yes; uses query decomposition and multi-query expansion. | Yes. | Strongly improves retrieval by combining document enrichment with advanced retrieval and reranking. | Closely related to RPS, but mainly framed as RAG-tool fusion rather than recall-first retrieval with explicit retrieval-selection trade-off analysis. |
| ProTIP [6] | Learning prototypical tool representations aligned with descriptions and usage traces. | Contrastive alignment between tools and queries in a shared embedding space. | Not central. | No. | Learns stronger semantic representations for tool matching. | Focuses on representation learning rather than combining decomposition, synthetic descriptions, and recall-first candidate generation. |
| RPS (ours) | Recall-first pipeline for scalable tool retrieval and selection. | Contextual step-local candidate generation before downstream selection. | Yes; decomposes user requests into contextual subqueries. | Yes. | Combines decomposition, description enhancement, and recall-oriented candidate generation for large semantically dense repositories. | Adds preprocessing and retrieval-time complexity; downstream selection may still be affected by denser candidate neighborhoods. |
| Method | Core Idea | Retrieval Strategy | Decomposition Usage | Synthetic Descriptions | Strengths | Limitations |
|---|---|---|---|---|---|---|
| Think-then-Act [9] | Improving tool use through explicit reasoning before action. | Planning-oriented selection in a constrained tool setting. | Yes, at the reasoning and planning level. | No. | Strengthens sequential decision-making and tool-use reasoning. | Does not primarily target first-stage recall in very large tool repositories. |
| UniHGKR [11] | Unified heterogeneous knowledge retrieval. | Retrieval over heterogeneous knowledge sources and structures. | Not primarily designed as tool-query decomposition. | No. | Useful for integrating different knowledge structures into retrieval. | Focuses on heterogeneous knowledge retrieval rather than large-scale tool retrieval and selection. |
| Toolink [12] | Tool use through chain-of-solving paradigms. | Selects or organizes tools within a multi-step solving process. | Yes, at the chain-of-solving level. | No. | Supports structured multi-step tool use and reasoning. | Does not directly address recall degradation in large tool catalogs. |
| Chain of Tools [10] | Multi-tool orchestration through chained tool execution. | Focuses on selecting and executing tool chains. | Yes, through sequential tool planning. | No. | Effective for modeling dependencies among tool calls during execution. | Assumes relevant tools can be found or provided; retrieval recall is not the central focus. |
| XAgents [13] | Multi-agent cooperation for tool utilization. | Tool access and coordination through agent roles and rules. | Partially, via role-based task division. | No. | Provides a coordination mechanism for complex agentic workflows. | Usually assumes predefined domains or constrained tool environments. |
| Iterative feedback retrieval [17] | Refining retrieval queries based on execution feedback. | Iterative query refinement after retrieval or execution outcomes. | Partially; refinement is feedback-driven rather than decomposition-centered. | No. | Improves retrieval over single-pass methods and can complement agentic pipelines. | Adds iterative overhead and does not focus on synthetic tool representation enhancement. |
| ToolExpNet [18] | Tool similarity and dependency modeling through experience networks. | Selection based on simulated or learned agent–tool interaction experience. | Partially, through dependency-aware multi-tool selection. | No. | Captures inter-tool relations and improves downstream selection quality. | Focuses more on dependency-aware selection than on first-stage recall from large repositories. |
Appendix B. LLM Prompts
Appendix B.1. Tool Description Expansion Prompt

Appendix B.2. Planner Agent System Prompt

Appendix B.3. Execution Agent System Prompt

Appendix C. Augmented Tool Descriptions
- Pair A: set_schedule_priority vs. set_schedule_tag
- Vanilla (vendor) descriptions.
- set_schedule_priority
- description: "Set priority for schedule"
- set_schedule_tag
- description: "Set a tag for the schedule"
- Augmented description + synthetic questions (indexed).
- set_schedule_priority
- description_expanded:
- "The set_schedule_priority tool allows users to assign a specific priority level to a particular schedule, ensuring that critical tasks are addressed promptly. This tool is essential for project managers and team leaders who need to allocate resources effectively based on task urgency. Typical use-cases include setting deadlines for project milestones, prioritizing client meetings, or managing personal tasks in a busy schedule. By defining a clear priority, users can streamline their workflows and focus on the most important activities first. Additionally, this tool provides feedback on whether the priority assignment was successful, allowing users to confirm changes instantly."
- synthetic_questions:
- - "Can you set the priority of the team meeting to high?"
- - "I need to prioritize my project deadline; can you help me with that?"
- - "How do I change the priority of my schedule for next week's tasks?"
- - "Can you confirm if the priority for the client presentation was set successfully?"
- - "I want to ensure that my weekly planning session is marked as urgent; can you do that?"
- set_schedule_tag
- description_expanded:
- "The 'set_schedule_tag' tool is designed to assign specific tags to schedules, enhancing organization and retrieval of information. By utilizing a tagging system, users can categorize their schedules based on various criteria such as project type, priority, or status. Typical use-cases include labeling schedules for team projects, differentiating between personal and work-related tasks, or highlighting urgent deadlines. This tool allows users to define tags with an ID, name, color, and description, providing comprehensive details for each tag. The outcome of using this tool indicates whether the tagging process was successful, ensuring users can effectively manage their scheduling needs."
- synthetic_questions:
- - "Can you help me set a tag named 'Urgent' with a red color for my project schedule titled 'Q4 Planning'?"
- - "I need to add a green tag called 'Completed' to my meeting schedule titled 'Weekly Team Sync', can you do that?"
- - "Please create a tag with ID 'proj123', name 'Research', and description 'Time allocated for research activities' for my schedule 'Project Development'."
- - "How can I set a blue tag named 'Review' for the schedule titled 'Client Feedback Meeting'?"
- - "I want to categorize my 'Marketing Campaign' schedule with a yellow tag called 'Pending Approval', could you assist me with that?"
- Pair B: set_schedule_date_range vs. set_schedule_repetition
- Vanilla (vendor) descriptions.
- set_schedule_date_range
- description: "Set the start and end dates of the schedule"
- set_schedule_repetition
- description: "Set the repetition type and end date of the schedule"
- Augmented description + synthetic questions (indexed).
- set_schedule_date_range
- description_expanded:
- "The 'set_schedule_date_range' tool is designed to establish a defined period for scheduling events by setting both a start and end date. This tool is particularly useful in scenarios where project timelines, appointments, or any scheduled activities need to be clearly defined to avoid conflicts and ensure proper planning. Users can set the title for the schedule, which helps in identifying the purpose of the scheduled events easily. The tool ensures that all scheduling adheres to the specified dates, improving organization and time management. A successful operation returns a status indicating whether the date range was accurately established, providing immediate feedback on the action taken."
- synthetic_questions:
- - "Can you set the schedule for my project planning from January 10 to January 20?"
- - "I need to create a schedule titled 'Team Meetings' starting on February 1 and ending on February 28; can you do that?"
- - "Please set the date range for my training sessions from March 5 to March 12."
- - "Can you establish a schedule for the annual review process with a start date of April 1 and an end date of April 15?"
- - "I want to set up a schedule for the conference preparation from May 10 to May 20; can you assist with that?"
- set_schedule_repetition
- description_expanded:
- "The set_schedule_repetition tool is designed to establish a recurring schedule by allowing users to set the type of repetition and specify an end date for that schedule. This tool is particularly useful in various scenarios such as planning regular meetings, setting up reminders for recurring tasks, or organizing events that need to happen at fixed intervals, such as weekly training sessions or monthly reports. Users can select different repetition types, such as daily, weekly, or monthly, to fit their scheduling needs. Additionally, specifying an end date helps in managing the duration of the recurrence, ensuring that users have control over how long the repetition will last. Overall, this tool simplifies the process of creating and managing repetitive schedules, making it easier for users to stay organized and on track."
- synthetic_questions:
- - "Can you set a weekly meeting to repeat every Monday until the end of the year?"
- - "How do I schedule a monthly report submission that ends in six months?"
- - "I need to set a daily reminder for a task that should repeat until the end of next month, can you help?"
- - "Can you create a bi-weekly training schedule that runs for three months?"
- - "How can I set up a quarterly review meeting that repeats until December?"
Appendix D. Augmentation Heatmaps
First, I need to check the progress of my credit card application under my ID card, then check the balance of my ICBC (Industrial and Commercial Bank of China) card, and inquire about the debt amount of my China Merchants Bank credit card. Finally, I want to use the balance of my ICBC card to pay off the 500 yuan debt on my China Merchants Bank credit card.


Appendix E. Benchmark Datasets and Licenses
- The Ultratool benchmark is distributed under the Apache 2.0 License.
- The ToolLinkOS benchmark is released under the MIT License.
- The ToolRet benchmark is available under the Apache 2.0 License.
References
- Huang, S.; Zhong, W.; Lu, J.; Zhu, Q.; Gao, J.; Liu, W.; Hou, Y.; Zeng, X.; Wang, Y.; Shang, L.; et al. Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios. arXiv 2024, arXiv:2401.17167. [Google Scholar] [CrossRef]
- Lumer, E.; Basavaraju, P.H.; Mason, M.; Burke, J.A.; Subbiah, V.K. Graph RAG-Tool Fusion. arXiv 2025, arXiv:2502.07223. [Google Scholar] [CrossRef]
- Shi, Z.; Wang, Y.; Yan, L.; Ren, P.; Wang, S.; Yin, D.; Ren, Z. Retrieval Models Aren’t Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, Vienna, Austria, 27 July–1 August 2025. [Google Scholar]
- Han, H.; Zhu, T.; Zhang, X.; Wu, M.; Xiong, H.; Chen, W. NesTools: A Dataset for Evaluating Nested Tool Learning Abilities of Large Language Models. arXiv 2024, arXiv:2410.11805. [Google Scholar]
- Lumer, E.; Subbiah, V.K.; Burke, J.A.; Basavaraju, P.H.; Huber, A. Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases. arXiv 2024, arXiv:2410.14594. [Google Scholar]
- Anantha, R.; Bandyopadhyay, B.; Kashi, A.; Mahinder, S.; Hill, A.W.; Chappidi, S. ProTIP: Progressive tool retrieval improves planning. arXiv 2023, arXiv:2312.10332. [Google Scholar] [CrossRef]
- Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 2023, 36, 68539–68551. [Google Scholar]
- Liu, X.; Peng, Z.; Yi, X.; Xie, X.; Xiang, L.; Liu, Y.; Xu, D. ToolNet: Connecting large language models with massive tools via tool graph. arXiv 2024, arXiv:2403.00839. [Google Scholar] [CrossRef]
- Shen, Y.; Jiang, H.; Qu, H.; Zhao, J. Think-then-Act: A Dual-Angle Evaluated Retrieval-Augmented Generation. arXiv 2024, arXiv:2406.13050. [Google Scholar]
- Shi, Z.; Gao, S.; Chen, X.; Feng, Y.; Yan, L.; Shi, H.; Yin, D.; Chen, Z.; Verberne, S.; Ren, Z. Chain of tools: Large language model is an automatic multi-tool learner. arXiv 2024, arXiv:2405.16533. [Google Scholar]
- Min, D.; Xu, Z.; Qi, G.; Huang, L.; You, C. UniHGKR: Unified Instruction-aware Heterogeneous Knowledge Retrievers. arXiv 2024, arXiv:2410.20163. [Google Scholar]
- Qian, C.; Xiong, C.; Liu, Z.; Liu, Z. Toolink: Linking toolkit creation and using through chain-of-solving on open-source model. arXiv 2023, arXiv:2310.05155. [Google Scholar]
- Yang, H.; Gu, M.; Zhao, R.; Hu, F.; Deng, Z.; Chen, Y. XAgents: A Framework for Interpretable Rule-Based Multi-Agents Cooperation. arXiv 2024, arXiv:2411.13932. [Google Scholar]
- Sprigler, A.; Drobek, A.; Weinstock, K.; Tapsoba, W.; Childress, G.; Dao, A.; Gral, L. Synergistic Simulations: Multi-Agent Problem Solving with Large Language Models. arXiv 2024, arXiv:2409.13753. [Google Scholar] [CrossRef]
- Huang, Y.; Cheng, F.; Zhou, F.; Li, J.; Gong, J.; Yang, H.; Fan, Z.; Jiang, C.; Xue, S.; Chen, F. Romas: A role-based multi-agent system for database monitoring and planning. arXiv 2024, arXiv:2412.13520. [Google Scholar]
- Liu, T.; Wang, X.; Huang, W.; Xu, W.; Zeng, Y.; Jiang, L.; Yang, H.; Li, J. Groupdebate: Enhancing the efficiency of multi-agent debate using group discussion. arXiv 2024, arXiv:2409.14051. [Google Scholar]
- Xu, Q.; Li, Y.; Xia, H.; Li, W. Enhancing tool retrieval with iterative feedback from large language models. arXiv 2024, arXiv:2406.17465. [Google Scholar] [CrossRef]
- Zhang, Z.; Chen, Z.; Zhu, H.; Chen, Z.; Du, N.; Li, X. ToolExpNet: Optimizing multi-tool selection in llms with similarity and dependency-aware experience networks. In Proceedings of the Findings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, 27 July–1 August 2025; pp. 15706–15722. [Google Scholar]






| Work | Main Focus | Key Mechanism | Relation to RPS |
|---|---|---|---|
| Toolformer [7] | Tool-use learning | Self-supervised API-call prediction | Foundational tool-use setting; not focused on large-scale retrieval. |
| ToolNet [8] | Structured tool selection | Directed graph over tool relations | Models dependencies, but assumes predefined tool structure. |
| ToolRet [3] | Retrieval benchmark | Large-scale tool-retrieval evaluation | Establishes retrieval bottleneck; does not propose a full RPS-like pipeline. |
| NESTools [4] | Nested tool use | Multi-step tool dependency benchmark | Tests tool chaining, but does not model dense semantic collisions directly. |
| Toolshed [5] | RAG-based tool retrieval | Enriched tool documents, query expansion, and reranking | Closely related; RPS emphasizes recall-first retrieval and downstream selection trade-offs. |
| ProTIP [6] | Tool representation learning | Contrastive alignment of tools and queries | Improves embedding geometry; complementary to RPS-style decomposition and augmentation. |
| Think-then-Act/Chain of Tools [9,10] | Tool orchestration | Planning and sequential tool execution | Improves downstream reasoning after candidates are available. |
| RPS (ours) | Recall-first tool retrieval | Contextual decomposition and synthetic description enhancement | Combines decomposition, enhanced descriptions, and recall-oriented candidate generation. |
| Dataset | Number of Tools | Number of Queries |
|---|---|---|
| Ultratool | 2032 | 1000 |
| ToolLinkOS | 573 | 1569 |
| ToolRet-Web | 37,292 | 5230 |
| Embedding Model | Precision | Recall | F1 |
|---|---|---|---|
| TF–IDF | 0.258 | 0.299 | 0.268 |
| TinyBERT | 0.292 | 0.328 | 0.297 |
| BGE-Small | 0.300 | 0.343 | 0.306 |
| E5-Large | 0.313 | 0.337 | 0.311 |
| gte-Qwen2-1.5B-instruct | 0.343 | 0.363 | 0.337 |
| Embedding Model | Type | Params | Dim. | Max Tokens |
|---|---|---|---|---|
| TF–IDF | Sparse lexical | – | – | |
| TinyBERT | Compact encoder | ∼14 M | 312 | 512 |
| BGE-Small | Dense encoder | ∼33 M | 384 | 512 |
| E5-Large | Dense encoder | ∼335 M | 1024 | 512 |
| gte-Qwen2-1.5B-instruct | Instruction-tuned embedder | 1.5B | 1536 | 32k |
| Metric | Vanilla | Augmented | Relative | ||
|---|---|---|---|---|---|
| Recall@5 | 0.184 | 0.281 | +0.097 | +52.7% | 0.005 |
| nDCG@5 | 0.160 | 0.241 | +0.081 | +5.7% | 0.007 |
| Recall@10 | 0.288 | 0.403 | +0.115 | +4.1% | 0.005 |
| nDCG@10 | 0.199 | 0.289 | +0.090 | +45.3% | 0.008 |
| Noise | Stage | Recall | F1 | Recall 95% CI | Recall vs. Noise 0 |
|---|---|---|---|---|---|
| 0 | VectorDB retrieval | 1.000 | 1.000 | [1.000, 1.000] | – |
| 10 | VectorDB retrieval | 0.992 | 0.493 | [0.985, 1.000] | 0.030 |
| 30 | VectorDB retrieval | 0.934 | 0.460 | [0.897, 0.971] | <0.01 |
| 50 | VectorDB retrieval | 0.890 | 0.436 | [0.843, 0.937] | <0.01 |
| 0 | LLM selection | 0.674 | 0.761 | [0.642, 0.705] | – |
| 10 | LLM selection | 0.645 | 0.741 | [0.608, 0.682] | 0.035 |
| 30 | LLM selection | 0.598 | 0.694 | [0.554, 0.642] | <0.01 |
| 50 | LLM selection | 0.586 | 0.668 | [0.543, 0.639] | <0.01 |
| Configuration | Recall | F1 | Recall 95% CI |
|---|---|---|---|
| End-to-End (context decomposition, augmented description) | |||
| Context decomp., , max 3 loops | 0.231 | 0.246 | [0.167, 0.295] |
| (+5/loop), multi-query, fallback | 0.557 | 0.325 | [0.496, 0.618] |
| + post-selection | 0.424 | 0.396 | [0.359, 0.489] |
| RAG, context decomposition | |||
| per subtask | 0.435 | 0.336 | [0.366, 0.504] |
| per subtask | 0.465 | 0.353 | [0.392, 0.538] |
| RAG only (no decomposition) | |||
| from VectorDB | 0.251 | 0.298 | [0.207, 0.295] |
| from VectorDB | 0.247 | 0.277 | [0.212, 0.282] |
| Ultratool Reference Plan | |||
| (+5/loop), multi-query, fallback | 0.640 | 0.359 | [0.587, 0.693] |
| + post-selection | 0.492 | 0.465 | [0.433, 0.551] |
| Configuration | Recall | F1 | Recall 95% CI |
|---|---|---|---|
| Ultratool | |||
| End-to-End (context, augmented tools), | 0.392 | 0.353 | [0.333, 0.451] |
| RAG (context, vanilla tools), | 0.494 | 0.329 | [0.442, 0.546] |
| RAG only (no decomposition), | 0.340 | 0.362 | [0.305, 0.375] |
| ToolLinkOS | |||
| End-to-End (context, augmented tools), | 0.323 | 0.438 | [0.297, 0.349] |
| RAG (context, vanilla tools), | 0.313 | 0.429 | [0.285, 0.341] |
| RAG only (no decomposition), | 0.208 | 0.314 | [0.187, 0.229] |
| ToolRet | |||
| End-to-End (context, augmented tools), | 0.343 | 0.214 | [0.312, 0.374] |
| RAG (context, vanilla tools), | 0.347 | 0.227 | [0.323, 0.371] |
| RAG only (no decomposition), | 0.300 | 0.242 | [0.272, 0.328] |
| Configuration | Retrieval Calls | LLM Calls | Tool Docs Shown | Scaling |
|---|---|---|---|---|
| RAG-only | 1 | 1 | K | constant |
| RPS-Dummy | m | linear in m | ||
| RPS-Context | linear in m | |||
| RPS-Context+Aug | (aug.) | linear in m |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Kamiński, J.; Galyukshev, I.; Chuprin, S.; Kuznetsov, A.; Kalyuzhnaya, A. Too Many Tools, Too Much Confusion? Navigating Agentic Tool Selection at Scale. Algorithms 2026, 19, 447. https://doi.org/10.3390/a19060447
Kamiński J, Galyukshev I, Chuprin S, Kuznetsov A, Kalyuzhnaya A. Too Many Tools, Too Much Confusion? Navigating Agentic Tool Selection at Scale. Algorithms. 2026; 19(6):447. https://doi.org/10.3390/a19060447
Chicago/Turabian StyleKamiński, Jerzy, Ilya Galyukshev, Sergey Chuprin, Artem Kuznetsov, and Anna Kalyuzhnaya. 2026. "Too Many Tools, Too Much Confusion? Navigating Agentic Tool Selection at Scale" Algorithms 19, no. 6: 447. https://doi.org/10.3390/a19060447
APA StyleKamiński, J., Galyukshev, I., Chuprin, S., Kuznetsov, A., & Kalyuzhnaya, A. (2026). Too Many Tools, Too Much Confusion? Navigating Agentic Tool Selection at Scale. Algorithms, 19(6), 447. https://doi.org/10.3390/a19060447

