Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Too Many Tools, Too Much Confusion? Navigating Agentic Tool Selection at Scale

Algorithms 2026, 19(6), 447; https://doi.org/10.3390/a19060447

by Jerzy Kamiński^1,*

, Ilya Galyukshev^1,2

, Sergey Chuprin¹

, Artem Kuznetsov¹

and Anna Kalyuzhnaya¹

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Reviewer 3: Anonymous

Algorithms 2026, 19(6), 447; https://doi.org/10.3390/a19060447

Submission received: 9 April 2026 / Revised: 17 May 2026 / Accepted: 27 May 2026 / Published: 1 June 2026

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This paper addresses a significant and practical problem in scalable tool-augmented LLMs. Its core strengths are a well-motivated, modular framework and a rigorous, insightful experimental analysis that yields the valuable finding about "hard negatives." The primary weaknesses are the need for stronger baseline comparisons and several issues with clarity and presentation. With revisions focused on sharpening the novelty statement, strengthening the baseline comparison, fixing presentation errors, and clarifying methodological details, this paper can make a solid contribution to the field.

The paper clearly identifies a critical, timely problem: the performance degradation of LLM-based agents in massive tool repositories. The proposed "recall-first" pipeline combining query decomposition and tool description augmentation is a well-motivated and modular contribution. However, the novelty claim could be stated more explicitly in the abstract and introduction.
The paper honestly reports a nuanced and important finding: while description augmentation improves retrieval recall, it can sometimes hurt downstream LLM selection(e.g., Table 6). The discussion linking this to "denser local neighborhoods" making in-context discrimination harder is insightful. However, this trade-off analysis could be deepened.
The choice of baselines is somewhat limited. The primary comparison is between variants of the authors' own pipeline (no decomposition, dummy, context). While Table 8 shows gains, it would significantly strengthen the paper to include comparisons with recent, relevant strong baselines mentioned in the Related Work, such as Toolshed [5] or ProTIP [6]. Even a discussion of why a direct comparison was not feasible (e.g., code unavailability, vastly different experimental setups) would be beneficial. Without this, it's difficult to gauge the absolute performance of the RPS pipeline.
The methodology in Section 3 is generally well-described, and the inclusion of prompts in the Appendix is commendable for reproducibility. However, some details are unclear, such as: Line 148-149: "Constants follow our implementation: M=15 planner tools, K∈{5,10} worker tools..." How were these specific values (M=15, K=5/10) chosen? Was there any sensitivity analysis? Justifying these hyperparameters is important.
The results are data-rich, but the presentation can be optimized. Tables 4~7 must be reformatted for clarity in the final version. A consistent and clear presentation of the main result tables (perhaps in a summarized form) is crucial for the reader.
The paper has some typographical and formatting errors. A thorough proofread is essential.

Author Response

We thank the reviewer for the positive assessment of the paper's motivation, modular framework, experimental analysis, and the finding regarding hard negatives. We also appreciate the reviewer's suggestions concerning novelty, baselines, methodological details, and presentation.

Comments 1: The paper clearly identifies a critical, timely problem: the performance degradation of LLM-based agents in massive tool repositories. The proposed "recall-first" pipeline combining query decomposition and tool description augmentation is a well-motivated and modular contribution. However, the novelty claim could be stated more explicitly in the abstract and introduction.

Response 1: We revised both the Abstract and Introduction to make the novelty and contributions of the work more explicit. The revised Abstract now directly states the core methodological components of RPS: recall-first retrieval, context-aware query decomposition, step-local candidate generation, and synthetic tool description augmentation. We also added concrete KPI values in the Abstract to clarify the empirical contribution of the method.

a recall-first Retrieval-Plan-Select decomposition;
a compact synthetic description augmentation recipe
a collision-aware stress-testing protocol for random and semantically close noise.

This revision is intended to make clear what is methodologically new and how the proposed framework differs from flat retrieval or standard in-context tool selection.

Comments 2: The paper honestly reports a nuanced and important finding: while description augmentation improves retrieval recall, it can sometimes hurt downstream LLM selection(e.g., Table 6). The discussion linking this to "denser local neighborhoods" making in-context discrimination harder is insightful. However, this trade-off analysis could be deepened.

Response 2: We expanded the discussion of this trade-off in the experimental analysis and visualization sections. The revised manuscript now explains that augmented tool descriptions can improve vector retrieval by reducing high-similarity semantic collisions, but may also create denser local neighborhoods and longer per-tool contexts. This can make the final in-context selection step more difficult for the LLM.

We also strengthened the interpretation of the hard-negative stress tests. The revised discussion now explicitly distinguishes between retrieval-stage gains and selection-stage trade-offs, showing that description augmentation is most beneficial for recall-oriented candidate generation, while final tool selection may require additional pruning or compression strategies.

Comments 3: The choice of baselines is somewhat limited. The primary comparison is between variants of the authors' own pipeline (no decomposition, dummy, context). While Table 8 shows gains, it would significantly strengthen the paper to include comparisons with recent, relevant strong baselines mentioned in the Related Work, such as Toolshed [5] or ProTIP [6]. Even a discussion of why a direct comparison was not feasible (e.g., code unavailability, vastly different experimental setups) would be beneficial. Without this, it's difficult to gauge the absolute performance of the RPS pipeline.

Response 3: We substantially expanded the Related Work section and added comparative tables describing ToolRet, Toolshed, ProTIP, and other relevant tool retrieval and orchestration approaches. These tables compare prior methods along dimensions such as core mechanism, retrieval strategy, decomposition usage, synthetic descriptions, strengths, and limitations.

We also added an explicit discussion of why direct quantitative comparison with Toolshed and ProTIP is currently difficult. In particular, the revised manuscript explains that these methods do not provide fully reproducible evaluation pipelines compatible with our benchmark setup, and in some cases depend on different knowledge-base construction procedures or unavailable full retrieval/orchestration stacks.

Thus, while we cannot provide a strictly controlled direct comparison under identical conditions, we now position RPS more clearly relative to these approaches and explain the methodological differences.

Comments 4: The methodology in Section 3 is generally well-described, and the inclusion of prompts in the Appendix is commendable for reproducibility. However, some details are unclear, such as: Line 148-149: "Constants follow our implementation: M=15 planner tools, K∈{5,10} worker tools..." How were these specific values (M=15, K=5/10) chosen? Was there any sensitivity analysis? Justifying these hyperparameters is important.

Response 4: We added a dedicated explanation of these parameters in the experimental setup. The revised manuscript clarifies that M and K control different stages of the pipeline. The planner budget M = 15 is used only to provide contextual evidence for decomposition and is not part of the final candidate pool directly. We justify this value using the empirical distribution of reference tool counts in Ultratool, where the typical number of required tools is much smaller than 15. Thus, M = 15 provides the planner with several plausible tools per reasoning step while keeping the prompt manageable.

The worker budget K in {5, 10} is explained as a step-local retrieval budget. In decomposed RPS configurations, K is applied independently to each subtask, whereas in the RAG-only baseline it is applied once to the original query. We therefore clarify that K = 5 in RPS is not directly equivalent to a global top-5 retrieval setting.

Comments 5: The results are data-rich, but the presentation can be optimized. Tables 4~7 must be reformatted for clarity in the final version. A consistent and clear presentation of the main result tables (perhaps in a summarized form) is crucial for the reader.

Response 5: We reformatted the main result tables to improve readability and consistency. The revised tables now use more compact groupings, clearer configuration names, confidence intervals, and p-values where appropriate. We also added summarizing text before and after the tables to help readers interpret the results without relying only on raw numerical values.

Comments 6: The paper has some typographical and formatting errors. A thorough proofread is essential.

Response 6: We performed a full proofreading and formatting pass. We corrected grammar issues, improved notation consistency, revised captions, fixed table formatting, improved appendix formatting, and checked references and section transitions.

Reviewer 2 Report

Comments and Suggestions for Authors

This paper addresses the scalability challenges of LLM agents in large-scale toolkits by proposing the RPS framework: it reduces semantic collisions through a recall-first strategy, contextual task decomposition, and enhanced tool descriptions. The framework provides both theoretical and practical guidance for efficient tool selection in large-scale toolkits for LLM agents.

It is recommended to explicitly state the key performance indicators (KPIs) in the abstract and briefly describe the core innovations of the methodology. The current abstract merely mentions "performance improvement" in general terms without supporting specific data; therefore, it is advisable to clearly specify the KPIs and succinctly outline the methodological innovations.
In the relevant work section, the current textual descriptions are somewhat fragmented. Tables would provide an intuitive comparison of the proposed method with existing approaches, highlighting the uniqueness of the "query decomposition + synthetic description" approach. It is recommended to include comparative tables detailing the core technologies, advantages, and limitations of methods such as ToolRet, Toolshed, and ProTIP.
In Section 4.2 of the experiment, the existing results are presented without explanation regarding the rationale for selecting these models, which may raise concerns about "selective experimentation." It is recommended to specify the selection criteria for candidate models in the model embedding section and provide comparative analyses of model parameters.
In Section 4.4 of the experiment, the results only present numerical differences without statistical testing, making it difficult to confirm whether the improvements are attributable to the method itself rather than random fluctuations. It is recommended to supplement the tables 4, 5, and 6 with p-values (e.g., "p<0.01") to verify the statistical significance of performance differences under various configurations.
In Section 4.5 of the experiment, the existing visualization lacks clear labeling, making it difficult to intuitively understand how enhanced descriptions reduce semantic collisions. It is recommended to add color coding explanations in Figure 5 and illustrate the clustering changes under different description strategies using representative tools.

Author Response

We thank the reviewer for recognizing the relevance of the scalability challenge and the practical value of the RPS framework. We also appreciate the detailed suggestions on KPIs, related work, model selection, statistical testing, and visualization.

Comments 1: It is recommended to explicitly state the key performance indicators (KPIs) in the abstract and briefly describe the core innovations of the methodology. The current abstract merely mentions "performance improvement" in general terms without supporting specific data; therefore, it is advisable to clearly specify the KPIs and succinctly outline the methodological innovations.

Response 1: We revised the Abstract to include both the main methodological innovations and concrete performance indicators. The revised Abstract now reports end-to-end Recall improvements on Ultratool, ToolLinkOS, and ToolRet, as well as Recall@10 improvements from description augmentation and the reduction of high-similarity semantic collisions.

We also explicitly describe the key methodological components: context-aware query decomposition, synthetic tool description augmentation, recall-first retrieval, and step-local candidate generation.

Comments 2: In the relevant work section, the current textual descriptions are somewhat fragmented. Tables would provide an intuitive comparison of the proposed method with existing approaches, highlighting the uniqueness of the "query decomposition + synthetic description" approach. It is recommended to include comparative tables detailing the core technologies, advantages, and limitations of methods such as ToolRet, Toolshed, and ProTIP.

Response 2: We expanded and reorganized the Related Work section. The revised version now includes a compact comparative table in the main text and a more detailed comparison in the appendix. These tables compare existing approaches in terms of their main focus, key mechanism, relation to RPS, retrieval strategy, decomposition usage, synthetic descriptions, strengths, and limitations.

This addition is intended to make the positioning of RPS clearer and to highlight the uniqueness of combining query decomposition, synthetic tool descriptions, and recall-first candidate generation.

Comments 3: In Section 4.2 of the experiment, the existing results are presented without explanation regarding the rationale for selecting these models, which may raise concerns about "selective experimentation." It is recommended to specify the selection criteria for candidate models in the model embedding section and provide comparative analyses of model parameters.

Response 3: We revised Section 4.2 to clarify the embedding model selection protocol. The candidate models were selected before the main experiments and cover several retrieval regimes: a sparse lexical baseline, compact dense encoders, a larger dense encoder, and an instruction-tuned embedding model. We also clarify that the selected embedding backbone is fixed for all subsequent experiments and is not tuned separately per benchmark or configuration.

In addition, we added a separate table summarizing the approximate characteristics of the candidate embedding models, including model type, parameter count, embedding dimension, and maximum token length. This was added to make the model comparison more transparent and to reduce concerns about selective experimentation.

Comments 4: In Section 4.4 of the experiment, the results only present numerical differences without statistical testing, making it difficult to confirm whether the improvements are attributable to the method itself rather than random fluctuations. It is recommended to supplement the tables 4, 5, and 6 with p-values (e.g., "p<0.01") to verify the statistical significance of performance differences under various configurations.

Response 4: We added bootstrap-based statistical estimates to the relevant experimental tables. The revised tables now include 95% confidence intervals and paired bootstrap p-values. The p-values are computed by paired non-parametric bootstrap resampling over benchmark queries. This allows us to assess whether observed Recall differences are likely to reflect systematic improvements rather than random fluctuations.

We added this information to the pure retrieval results, random-noise stress tests, and semantically close hard-negative stress tests.

Comments 5: In Section 4.5 of the experiment, the existing visualization lacks clear labeling, making it difficult to intuitively understand how enhanced descriptions reduce semantic collisions. It is recommended to add color coding explanations in Figure 5 and illustrate the clustering changes under different description strategies using representative tools.

Response 5: We revised the Figure 5 discussion and caption. The manuscript now explains that colors correspond to kNN-derived local density clusters in the UMAP projection and are used only to visualize neighborhood structure, not manually assigned API categories. We also describe the three compared description strategies: vanilla descriptions, expanded descriptions with synthetic questions, and random-noise padding.

In the text, we now explicitly connect the visualization to the quantitative collision analysis: expanded descriptions create sharper local semantic clusters, while random padding produces a more diffuse and less semantically organized structure. This supports the conclusion that retrieval improvements come from semantic augmentation rather than merely increased description length.

Reviewer 3 Report

Comments and Suggestions for Authors

Congratulations to the authors on this groundbreaking work, which provides a logical framework for decomposing complex tasks through the RPS pipeline. Particularly valuable are the stress-test protocol that models semantic conflicts and the use of synthetic queries, which have been shown to improve retrieval recall when scaling tool sets.

Suggestions for improvement:

- Figure 5 is currently too cluttered, the legibility of the labels needs to be improved to allow for a more effective interpretation of the results.

- Although the study is highly valuable even in its current form, a comprehensive assessment of the method’s practical applicability would benefit from a quantitative comparison of the additional costs and time requirements with those of standard RAG solutions.

Author Response

We thank the reviewer for the encouraging assessment of the RPS framework, the stress-test protocol, and the use of synthetic queries for improving retrieval recall. We also appreciate the suggestions regarding Figure 5 and practical applicability.

Comments 1: Figure 5 is currently too cluttered, the legibility of the labels needs to be improved to allow for a more effective interpretation of the results.

Response 1: We revised Figure 5 and its accompanying explanation to improve interpretability. We added a clearer color-coding explanation, clarified the meaning of the clusters, and selected representative tool labels to illustrate the semantic structure of the embedding space. We also revised the caption to make the comparison between vanilla descriptions, expanded descriptions, and random-noise padding easier to understand.

The revised discussion now explains how the visualization supports the quantitative finding that semantic augmentation reduces high-similarity collisions while preserving meaningful local neighborhoods.

Comments 2: Although the study is highly valuable even in its current form, a comprehensive assessment of the method’s practical applicability would benefit from a quantitative comparison of the additional costs and time requirements with those of standard RAG solutions.

Response 2: We added a new computational overhead analysis to the end-to-end evaluation section. The revised manuscript now includes an analytical comparison between RAG-only and RPS-style pipelines in terms of retrieval calls, LLM calls, tool documents shown to the model, and scaling behavior with respect to the number of decomposed subtasks.

This analysis clarifies that RPS is not intended to reduce inference cost relative to flat RAG. Instead, it trades additional planning and step-local retrieval calls for higher recall in large and semantically dense tool repositories. We explicitly state that the overhead scales approximately linearly with the number of decomposed subtasks and that the practical cost depends on the number of subtasks, augmented document length, and LLM provider latency.

Article Menu

Too Many Tools, Too Much Confusion? Navigating Agentic Tool Selection at Scale

Further Information

Guidelines

MDPI Initiatives

Follow MDPI