Next Article in Journal
MoE-World: A Mixture-of-Experts Architecture for Multi-Task World Models
Previous Article in Journal
Digital Twins Under EU Law: A Unified Compliance Framework Across Smart Cities, Industry, Transportation, and Energy Systems
 
 
Article
Peer-Review Record

Multi-Agent Coordination Strategies vs Retrieval-Augmented Generation in LLMs: A Comparative Evaluation

Electronics 2025, 14(24), 4883; https://doi.org/10.3390/electronics14244883
by Irina Radeva 1,2,*, Ivan Popchev 3, Lyubka Doukovska 1,2 and Miroslava Dimitrova 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Electronics 2025, 14(24), 4883; https://doi.org/10.3390/electronics14244883
Submission received: 15 November 2025 / Revised: 7 December 2025 / Accepted: 8 December 2025 / Published: 11 December 2025

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This manuscript addresses a critical research gap for comparing multi-agent coordination strategies against optimized retrieval-augmented generation (RAG) baselines. It is a meaningful and timely contribution. However, several aspects need to be addressed to improved this manuscript.

  1. All agents use the same base model rather than heterogeneous architectures or complementary roles. Why?
  2. The collaborative strategy only aggregates outputs equally without conflict resolution or structured synthesis. How to balance it?
  3. All 100 test cases are from climate-smart agriculture question-answering. No testing on datasets outside agriculture to verify whether model-strategy alignment holds across domains.
  4. No reporting of total inference time per query, token consumption, or resource utilization, which are critical for judging whether multi-agent coordination is worth the overhead.
  5. The reward coefficient and penalty coefficient are set based on "standard practices" without sensitivity analysis.

Author Response

Response to Reviewer 1 Comments
1.    Summary
Thank you very much for taking the time to review this manuscript and for your constructive feedback. We appreciate your recognition of the research gap and your thoughtful suggestions for strengthening the experimental methodology. Your comments have helped us significantly improve the manuscript. Please find the detailed responses below. Revised text is presented in red font within each response, with key additions quoted directly from the revised manuscript. Comments have been added to the re-submitted manuscript to indicate where each reviewer comment is addressed.
2.    Questions for General Evaluation
Question    Reviewer's Evaluation    Response
Does the introduction provide sufficient background and include all relevant references?    Can be improved    Addressed in Section 1 with expanded scope clarification and literature contrast.
Is the research design appropriate?    Can be improved    Addressed with expanded Section 3.2.7 detailing four-phase experimental design.
Are the methods adequately described?    Yes    Thank you. Further details added in Appendix A.
Are the results clearly presented?    Yes    Thank you. Additional computational metrics added in Table 5.
Are the conclusions supported by the results?    Can be improved    Addressed with softened language and explicit scope limitations in Section 6.
Are all figures and tables clear and well-presented?    Can be improved    New Table 5 (computational metrics) and Table 6 (sensitivity analysis) added.

3.    Point-by-point Response to Comments and Suggestions
Comment 1: "All agents use the same base model rather than heterogeneous architectures or complementary roles. Why?"
Response 1: Thank you for raising this important methodological question. The use of homogeneous agent architectures was a deliberate design choice to isolate the effects of coordination strategies from confounding factors introduced by model heterogeneity. We have added explicit justification in Section 3.2.6:
"All agents within each configuration use the same base model. This design choice isolates the effects of the coordination strategy from confounding factors introduced by model heterogeneity. Heterogeneous agent architectures, in which agents with different capabilities or specialisations collaborate, are a promising area for future research."
This limitation is also acknowledged in Section 5.3 (Limitations), where we note that heterogeneous architectures combining models with complementary strengths represent an unexplored area that could produce different results.
Comment 2: "The collaborative strategy only aggregates outputs equally without conflict resolution or structured synthesis. How to balance it?"
Response 2: We agree with this observation. To address this limitation, we have implemented and evaluated an improved collaborative strategy with structured synthesis. A new Section 3.2.2 "Improved collaborative strategy: Two-Phase Consensus" has been added:
"This improved collaborative strategy was implemented to address the limitations of simple response aggregation. The two-phase collaborative consensus operates as follows: Phase 1 (Independent Analysis): All agents generate responses in parallel. Each agent provides a confidence score reflecting their certainty in their response. Phase 2 (Collaborative Synthesis): The agent with the highest confidence score is designated as the lead synthesiser. This agent receives summaries of all responses (the first 300 characters of each response) and carries out a structured synthesis task. The synthesis task involves integrating the strongest points from each analysis, resolving contradictions and generating a unified response."
Implementation details including the synthesis prompt template are provided in Appendix A.7. Results show that Two-Phase Consensus improved Collaborative performance by 7–14 percentage points relative to Original implementations (Section 5.1).
Comment 3: "All 100 test cases are from climate-smart agriculture question-answering. No testing on datasets outside agriculture to verify whether model-strategy alignment holds across domains."
Response 3: We acknowledge this limitation. A new Section 5.3 "Limitations" has been added to explicitly address the single-domain constraint:
"Several factors limit the generalisability of these findings. This includes the fact that all evaluations used factual question-answering pairs from a single knowledge base (the FAO Climate-Smart Agriculture Sourcebook), and that performance patterns may differ for other domains, particularly those requiring multi-step reasoning or creative synthesis."
The Abstract and Conclusions (Section 6) have also been updated to explicitly acknowledge this scope limitation and recommend multi-domain evaluation as a direction for future research.
Comment 4: "No reporting of total inference time per query, token consumption, or resource utilization, which are critical for judging whether multi-agent coordination is worth the overhead."
Response 4: We agree that computational metrics are essential for evaluating multi-agent overhead. A new Section 4.4 "Computational Efficiency Analysis" has been added with Table 5 presenting token consumption and processing time measurements:
"Multi-agent configurations incurred computational overhead through multiple inference calls and coordination mechanisms. Token consumption provides a hardware-independent measure of computational overhead. Table 5 presents token consumption and processing time measurements across models and strategies."
Key findings include: Collaborative strategy requires 58.2% higher token consumption (mean: 2,656 tokens) compared to other strategies (mean: 1,658–1,717 tokens). Hardware heterogeneity (Apple M1 for Llama vs Intel Xeon CPU for Mistral/Granite) precludes direct timing comparisons between models, but within-model comparisons reveal 13–25% variation across strategies. All configurations used three agents per query.
Comment 5: "The reward coefficient and penalty coefficient are set based on 'standard practices' without sensitivity analysis."
Response 5: We agree that sensitivity analysis is essential for validating the T-CPS metric. A new Section 4.7 "Sensitivity Analysis Results" has been added with comprehensive analysis:
"The T-CPS metric depends on two parameters: α (consistency reward, default = 0.10) and β (variance penalty, default = 0.05). To verify that conclusions were not dependent on these specific values, sensitivity analysis was conducted across 25 parameter combinations (α = {0.05, 0.10, 0.15, 0.20, 0.25} and β = {0.025, 0.05, 0.075, 0.10, 0.15}). T-CPS was recalculated for all 31 configurations in each combination."
Table 6 presents the sensitivity analysis results. Key findings: Configuration rankings remain highly stable across all parameter variations (Spearman ρ > 0.95). The α parameter exhibits near-perfect positive correlation with T-CPS (mean r = 0.9993), while β shows negligible correlation (mean r = −0.034). Variance decomposition indicates that α explains 99.87% of variance in T-CPS, while β explains only 0.13%. This validates that study conclusions are robust to reasonable parameter variations.
4.    Response to Comments on the Quality of English Language
The reviewer indicated that "The English is fine and does not require any improvement." We thank the reviewer for this assessment. Minor editorial refinements have been made throughout the manuscript for clarity and consistency.
5.    Additional Clarifications
In addition to the specific changes requested, we have made the following enhancements to strengthen the manuscript:
•    Added Section 3.2.7 "Experimental configuration summary" detailing the four-phase experimental design (Baseline Establishment → Original Multi-Agent Evaluation → Retrieval Fragmentation Isolation → Implementation Enhancement).
•    Expanded Appendix A with detailed implementation information including hardware configurations (Table A.1), software dependencies (Table A.2), RAG parameters (Table A.4), agent role assignments (Table A.5), and prompt templates (Sections A.5–A.7).
•    Added Supplementary Materials including inter-metric correlation analysis (Table S1), weight perturbation stability analysis (Table S2), and leave-one-out ablation results (Table S3).
•    Updated Table 4 to include all 28 multi-agent configurations ranked by T-CPS degradation magnitude, with unified presentation of statistical metrics.
•    Expanded the bibliography with additional references supporting the methodological framework and contextualizing findings within recent multi-agent research.
We believe these revisions comprehensively address all comments raised by Reviewer 1 and strengthen the scientific contribution of the manuscript.

Author Response File: Author Response.pdf

Reviewer 2 Report

Comments and Suggestions for Authors

Main contents: This article investigates whether adding multi agent coordination on top of retrieval augmented generation improves performance compared with carefully tuned single agent RAG systems. The authors extend the PaSSER framework to support four coordination strategies, collaborative, sequential, competitive and hierarchical, and evaluate three open source models, Mistral 7B, Llama 3.1 8B and Granite 3.2 8B, on one hundred climate smart agriculture question answer pairs. Performance is measured with two composite metrics, CPS and T CPS, which aggregate nine lexical, semantic and fluency measures. Across sixteen multi agent configurations, fourteen exhibit significant degradation and none clearly surpasses its RAG baseline. Collaborative coordination is consistently the worst performer, while Llama 3.1 8B with sequential or hierarchical control shows only limited performance loss. Additional experiments with shared retrieval context for Granite suggest that coordination overhead rather than fragmented retrieval is the dominant source of degradation. The study thus indicates that tuned single agent RAG is usually preferable under similar conditions.

The following is my comments in details one by one. 

>> The evaluation scope appears parochial relative to the research question. All experiments use one hundred  question answer pairs from climate smart agriculture. These items come from the same source that was used to tune similarity thresholds for the RAG baselines, so there is no coverage of other domains, no multi step reasoning tasks, and no creative tasks where multi agent systems might reasonably be expected to show advantages. By contrast, many multi agent studies assess systems on more demanding scenarios such as programming, planning, cooperative games, or multi step question answering, and they report cases where multi agent configurations surpass single agent ones. Such extrapolation to most deployment contexts under similar conditions seems weakly supported by the data.

>> The multi agent architecture itself remains rather jejune. All agents share one base model and differ mostly in prompts. There is no evidence of genuine specialization or role specific fine tuning, and those prompts are generated automatically by an assistive tool rather than being shaped through a systematic optimisation process. Coordination strategies are also minimal, for example the collaborative strategy simply concatenates and summarizes all answers without any explicit selection or debate, while in the hierarchical strategy the manager sees only very short summaries of the two worker agents. In contrast, successful multi agent frameworks in the literature typically emphasize carefully designed roles, structured interaction protocols, and multi round reflection, so the negative findings in this study likely reflect these design choices more than any inherent flaw in multi agent coordination itself.

>> Baseline RAG configurations receive markedly more tuning than the multi agent ones. The authors sweep similarity thresholds across the entire range for each model using three hundred sixty nine questions, select the optimum with respect to T CPS, and then fix that setting for all subsequent experiments, including those that incorporate multi agent coordination. Yet multi agent setups with independent retrieval or shared context may require different thresholds, different numbers of retrieved documents, or different context lengths to work effectively, especially when the goal is to diversify supporting evidence. Because the paper explores shared context only for Granite, omits corresponding tests for the other models, and provides no sensitivity analysis for thresholds or retrieval parameters, it becomes difficult to separate the claim that multi agent systems are worse from the simpler possibility that the evaluation setup is methodologically tendentious.

>> The composite metric and the statistical analysis look baroque. CPS aggregates nine highly correlated metrics using fixed weights that appear to have been chosen by intuition rather than calibrated on data or validated with domain experts, and the paper does not examine whether its conclusions remain stable when those weights are perturbed or when individual metrics are removed. T CPS then introduces reward and penalty terms based on the coefficient of variation, with parameters alpha and beta borrowed from generic machine learning literature, yet no sensitivity study is reported for this particular application. At the same time, the authors conduct many pairwise comparisons across configurations on the same dataset without any adjustment for multiple testing, and by focusing almost entirely on CPS and T CPS they obscure which specific aspects of quality content accuracy, fluency, or stability are most strongly degraded for multi agent systems.

>> Claims about coordination overhead and generality seem slightly quixotic. The attempt to disentangle retrieval fragmentation from coordination effects is carried out only for Granite by comparing independent retrieval with shared context, yet the resulting conclusion is extended to all three models. Nor does the paper report details such as the number of coordination rounds, inference time, token usage, or computational cost for each strategy, even though it asserts that overhead is the primary cause of degradation, and differences in hardware two models on CPUs and one on an M1 machine with a GPU further weaken any inference about deployment cost. Finally, the discussion and conclusion portray multi agent approaches as inefficient for most RAG scenarios, while existing work already documents settings with more  complex tasks and carefully engineered multi agent architectures where performance can improve substantially, so a more cautious stance would treat the present findings as evidence about one very specific configuration rather than as a refutation of multi agent coordination in general.

Author Response

Response to Reviewer 2 Comments
1.    Summary
Thank you very much for taking the time to review this manuscript and for your detailed, constructive feedback. We appreciate your thorough analysis of the experimental design, composite metrics, and generalizability concerns. Your comments have helped us significantly strengthen the manuscript. Please find the detailed responses below. Revised text is presented in red font within each response, with key additions quoted directly from the revised manuscript. Comments have been added to the re-submitted manuscript to indicate where each reviewer comment is addressed.
2.    Questions for General Evaluation
Question    Reviewer's Evaluation    Response
Does the introduction provide sufficient background and include all relevant references?    Yes    Thank you. Scope clarification added in Section 1.
Is the research design appropriate?    Can be improved    Addressed with four-phase experimental design (Section 3.2.7) and design rationale.
Are the methods adequately described?    Can be improved    MCDM framework expanded (Section 3.3) with weight rationale and validation.
Are the results clearly presented?    Can be improved    New Tables 5–6 added; computational metrics and sensitivity analysis included.
Are the conclusions supported by the results?    Can be improved    Claims softened throughout; scope limitations explicitly stated in Section 6.
Are all figures and tables clear and well-presented?    Can be improved    Table 4 reorganized; new Tables 5–6 added; Supplementary Tables S1–S3.

3.    Point-by-point Response to Comments and Suggestions
Comment 1: "The evaluation scope appears parochial relative to the research question. All experiments use one hundred question answer pairs from climate smart agriculture. These items come from the same source that was used to tune similarity thresholds for the RAG baselines, so there is no coverage of other domains, no multi-step reasoning tasks, and no creative tasks where multi-agent systems might reasonably be expected to show advantages."
Response 1: We acknowledge this limitation and have made several revisions to address it. The Introduction (Section 1) now includes explicit scope clarification:
"Previous studies demonstrating the benefits of multi-agent systems have typically used resource-intensive configurations. Examples include iterative, multi-round debates... Additionally, multi-agent capabilities scale strongly with model size: GPT-4-turbo exceeds Llama-2-70B by more than threefold on coordination metrics, suggesting that coordination benefits may require model capabilities that extend beyond typical deployment constraints."
A new Section 5.3 "Limitations" explicitly acknowledges domain specificity: "All evaluations used factual question-answering pairs from a single knowledge base (the FAO Climate-Smart Agriculture Sourcebook), and performance patterns may differ for other domains, particularly those requiring multi-step reasoning or creative synthesis." The Abstract and Conclusions have been updated to frame findings as applicable to "local deployment scenarios with computational constraints" rather than making universal claims.
Comment 2: "The multi agent architecture itself remains rather jejune. All agents share one base model and differ mostly in prompts. There is no evidence of genuine specialization or role specific fine tuning, and those prompts are generated automatically by an assistive tool rather than being shaped through a systematic optimisation process."
Response 2: We agree that this is a limitation of the current study. The homogeneous architecture was a deliberate design choice to isolate coordination effects, as explained in Section 3.2.6:
"All agents within each configuration use the same base model. This design choice isolates the effects of the coordination strategy from confounding factors introduced by model heterogeneity."
Section 5.3 (Limitations) now explicitly acknowledges this constraint: "Heterogeneous architectures that combine models with complementary strengths are an unexplored area that could produce different results. Role prompts were generated using AI assistance (GitHub Copilot with Claude Sonnet 4.5), reflecting realistic development practices. However, hand-crafted, domain-specific prompts with careful role differentiation might produce different results." Future work directions in Section 6 include heterogeneous agent teams and role-specific prompting.
Comment 3: "Baseline RAG configurations receive markedly more tuning than the multi agent ones. The authors sweep similarity thresholds across the entire range for each model using three hundred sixty nine questions, select the optimum with respect to T CPS, and then fix that setting for all subsequent experiments, including those that incorporate multi agent coordination."
Response 3: We have added Section 3.2.7 "Experimental configuration summary" to clarify the experimental rationale and address this concern:
"Phase 1: Baseline Establishment. Single-agent RAG baselines (3 configurations) establish reference performance for each model using tuned similarity thresholds: 0.95 for Mistral 7B and Granite 3.2 8B, and 0.90 for Llama 3.1 8B. These thresholds were determined through evaluation on the full 369-question dataset (Section 3.4). The baselines represent the performance target against which multi-agent configurations are compared."
The experimental design deliberately uses tuned baselines because practitioners would compare multi-agent approaches against their best available single-agent configuration. Using suboptimal baselines would inflate multi-agent benefits artificially. This methodological choice is now explicitly justified in the text. The Optimized configurations (Phase 4) apply shared context retrieval to all models, ensuring multi-agent configurations also receive implementation enhancements.
Comment 4: "The composite metric and the statistical analysis look baroque. CPS aggregates nine highly correlated metrics using fixed weights that appear to have been chosen by intuition rather than calibrated on data or validated with domain experts, and the paper does not examine whether its conclusions remain stable when those weights are perturbed or when individual metrics are removed."
Response 4: We have substantially expanded the methodology to address these concerns. Section 3.3.2 now justifies the aggregation method selection within the MCDM framework:
"SAW was selected for this evaluation based on three considerations. Firstly, the data requirements: AHP requires pairwise expert comparisons across all criteria, TOPSIS requires the specification of ideal and anti-ideal reference points, and outranking methods require concordance and discordance thresholds. SAW only requires normalised scores and criterion weights, which are both directly available from the evaluation metrics."
Section 3.3.4 now provides explicit weight rationale within a four-dimensional evaluation framework: content accuracy (50%), semantic relevance (20%), lexical overlap (15%), and linguistic quality (15%). Regarding the "highly correlated" concern, Supplementary Table S1 presents the complete inter-metric correlation matrix, showing mean pairwise correlation r = 0.452, indicating complementary rather than redundant metrics.
New Section 4.7 presents comprehensive sensitivity analysis across 25 parameter combinations. Supplementary Table S2 confirms ranking stability (Spearman ρ > 0.95 across all weight perturbations). Supplementary Table S3 provides leave-one-out ablation analysis showing that conclusions remain stable when individual metrics are removed.
Comment 5: "Claims about coordination overhead and generality seem slightly quixotic. The attempt to disentangle retrieval fragmentation from coordination effects is carried out only for Granite by comparing independent retrieval with shared context, yet the resulting conclusion is extended to all three models. Nor does the paper report details such as the number of coordination rounds, inference time, token usage, or computational cost for each strategy."
Response 5: We have made several revisions to address these concerns. First, the experimental design has been clarified in Section 3.2.7 to show that Phase 4 (Optimized configurations) applies shared context retrieval to ALL three models, not just Granite:
"Phase 4: Implementation Enhancement. Optimized multi-agent configurations (12 configurations) incorporate two enhancements. All strategies receive shared context retrieval, where agents receive identical retrieved documents from a single query."
Second, new Section 4.4 "Computational Efficiency Analysis" with Table 5 now reports token consumption and processing time for all model-strategy combinations. Key findings include: Collaborative strategy requires 58.2% higher token consumption; all configurations used three agents per query; hardware heterogeneity precludes direct timing comparisons between models but within-model patterns are reported.
Third, claims have been softened throughout. Section 6 now states: "These findings suggest boundary conditions for multi-agent RAG deployment: coordination benefits demonstrated in prior work using iterative debate, adversarial roles, or larger models (70B+) may not extend to simpler coordination strategies with resource-constrained open-source models." The Abstract explicitly notes "study limitations include evaluation on a single domain (agriculture), use of 7–8B parameter models, and homogeneous agent architectures."
4.    Response to Comments on the Quality of English Language
The reviewer indicated that "The English is fine and does not require any improvement." We thank the reviewer for this assessment. Minor editorial refinements have been made throughout the manuscript for clarity and consistency.
5.    Additional Clarifications
We appreciate the reviewer's rigorous evaluation. In response to the overall concerns about generalizability, we have reframed the paper's contributions as establishing boundary conditions for multi-agent RAG deployment rather than making categorical claims. The revised manuscript:
•    Explicitly positions findings as applicable to "local deployment scenarios with computational constraints" using 7–8B parameter models.
•    Acknowledges that larger models, heterogeneous agents, and different task types may produce different results.
•    Frames the negative results as valuable scientific contributions establishing where multi-agent approaches encounter limitations.
•    Provides Supplementary Materials (Tables S1–S3) validating the CPS/T-CPS methodology.
•    Adds practical deployment guidelines in Section 6 based on empirical findings.
•    Expanded the bibliography with additional references supporting the methodological framework and contextualizing findings within recent multi-agent research.
We believe these revisions comprehensively address all comments raised by Reviewer 2 and present a more appropriately scoped contribution to the field.

Author Response File: Author Response.pdf

Reviewer 3 Report

Comments and Suggestions for Authors

1)  
This study makes a substantial contribution by systematically comparing a well‑tuned single‑agent RAG system with four multi‑agent coordination strategies across three 7–8B open‑source LLMs, and by showing that most multi‑agent configurations reduce both performance and consistency; however, it would be helpful for readers if the Introduction and Conclusion sections could more explicitly indicate that the experimental scope—single domain (CSA sourcebook), 100 QA pairs, and 7–8B models—constrains the generalizability of the conclusions.

Suggestion: It is recommended to add a brief subsection such as “Limitations and Scope” in the Introduction and Discussion/Conclusion to explicitly summarize under which conditions the findings are primarily applicable, considering the domain used, dataset size, model scale, and fixed coordination design.

2)  
The conceptual description of the multi‑agent framework and the overview of the coordination strategies are clearly presented; however, the implementation details (such as role‑specific prompt structures, the number of agents and token settings, decision/selection rules, and decoding hyperparameters) are relatively concise, which may make it challenging for follow‑up studies to fully reproduce the setup or to trace the concrete causes of performance degradation.

Suggestion: Adding a subsection or an appendix titled “Prompt and Coordination Configuration” in the Method section, where you summarize for each strategy the agent role definitions, core prompt templates (at the level of diagrams or pseudocode), and key hyperparameters (e.g., temperature, max tokens) in tabular form, would further strengthen reproducibility and interpretability.

3)  
In the results analysis, the finding that the collaborative strategy yields the largest performance degradation across all models and forms a “universal collaborative degradation” group is intriguing, but the paper would benefit from a somewhat richer qualitative discussion of how the deliberation/consensus procedure in this strategy differs from designs in prior multi‑agent work and what interaction patterns might be driving this degradation.

Suggestion: In the Related Work and Discussion sections, you might briefly contrast representative collaborative/consensus approaches with your implementation through narrative comparison or a simple table, and add a summarized dialogue example in the appendix or main text to illustrate recurrent error patterns (e.g., reduction or distortion of evidence, non‑productive repetition in discussions).

4)  
CPS and T‑CPS are important strengths of this study, but from the reader’s perspective it would be helpful to have a slightly more concrete explanation of how the nine underlying metrics are weighted and aggregated, and what intuitive roles each metric plays in practical decision‑making about whether to adopt multi‑agent coordination.

Suggestion: Briefly outlining the rationale for the chosen metric weights in the Method or Appendix, and providing a visualized example (diagram or table) of metric‑level scores for one or two representative configurations, would make the interpretation of CPS/T‑CPS more intuitive.

5)  
Although the paper concludes that a single‑agent RAG is practically more suitable under most conditions, the practical impact of the results could be further enhanced if concise, concrete guidelines were provided in summary form for practitioners who need to select model–strategy combinations.

Suggestion: In the final part of the Conclusion, a short “Practical Deployment Guidelines” summary could be included, for example stating that in 7–8B models with single‑domain QA and a tuned RAG setup, a single‑agent configuration is generally recommended, while combinations such as the Sequential/Hierarchical strategies with Llama 3.1 8B—where performance degradation is minimal—may be considered as exceptions.

 

Author Response

Response to Reviewer 3 Comments
1.    Summary
Thank you very much for taking the time to review this manuscript and for your constructive feedback. We appreciate your recognition of the study's contribution and your practical suggestions for strengthening the scope, reproducibility, and applicability of our findings. Your comments have helped us significantly improve the manuscript. Please find the detailed responses below. Revised text is presented in red font within each response, with key additions quoted directly from the revised manuscript. Comments have been added to the re-submitted manuscript to indicate where each reviewer comment is addressed.
2.    Questions for General Evaluation
Question    Reviewer's Evaluation    Response
Does the introduction provide sufficient background and include all relevant references?    Can be improved    Scope clarification added in Section 1; limitations acknowledged.
Is the research design appropriate?    Can be improved    Section 3.2.7 added with four-phase experimental design.
Are the methods adequately described?    Can be improved    Appendix A expanded with prompts, configurations, and hyperparameters.
Are the results clearly presented?    Must be improved    Tables 5–6 added; Section 5.1 expanded with qualitative analysis.
Are the conclusions supported by the results?    Can be improved    Section 5.3 (Limitations) added; deployment guidelines in Section 6.
Are all figures and tables clear and well-presented?    Can be improved    New Tables 5–6; Supplementary Tables S1–S3; Appendix Tables A.1–A.5.

3.    Point-by-point Response to Comments and Suggestions
Comment 1: "This study makes a substantial contribution by systematically comparing a well-tuned single-agent RAG system with four multi-agent coordination strategies across three 7–8B open-source LLMs, and by showing that most multi-agent configurations reduce both performance and consistency; however, it would be helpful for readers if the Introduction and Conclusion sections could more explicitly indicate that the experimental scope—single domain (CSA sourcebook), 100 QA pairs, and 7–8B models—constrains the generalizability of the conclusions. Suggestion: It is recommended to add a brief subsection such as 'Limitations and Scope' in the Introduction and Discussion/Conclusion to explicitly summarize under which conditions the findings are primarily applicable."
Response 1: We have implemented this suggestion by adding a dedicated Section 5.3 "Limitations" that explicitly summarizes the scope constraints:
"Several factors limit the generalisability of these findings. All evaluations used factual question-answering pairs from a single knowledge base (the FAO Climate-Smart Agriculture Sourcebook), and performance patterns may differ for other domains, particularly those requiring multi-step reasoning or creative synthesis. The 7–8B parameter models tested represent a specific capability tier; larger models (70B+) that have demonstrated stronger coordination capabilities in prior work were not evaluated. Heterogeneous architectures that combine models with complementary strengths are an unexplored area that could produce different results."
The Introduction (Section 1) now includes scope clarification referencing the specific conditions under which multi-agent benefits have been demonstrated in prior work. The Abstract explicitly notes: "Study limitations include evaluation on a single domain (agriculture), use of 7–8B parameter models, and homogeneous agent architectures." Section 6 (Conclusions) frames findings as applicable to "local deployment scenarios with computational constraints" rather than making universal claims.
Comment 2: "The conceptual description of the multi-agent framework and the overview of the coordination strategies are clearly presented; however, the implementation details (such as role-specific prompt structures, the number of agents and token settings, decision/selection rules, and decoding hyperparameters) are relatively concise, which may make it challenging for follow-up studies to fully reproduce the setup. Suggestion: Adding a subsection or an appendix titled 'Prompt and Coordination Configuration' in the Method section, where you summarize for each strategy the agent role definitions, core prompt templates, and key hyperparameters in tabular form, would further strengthen reproducibility."
Response 2: We have substantially expanded Appendix A to address this concern. The appendix now includes:
•    Table A.1: Hardware configurations (Apple M1 iMac for Llama, Intel Xeon server for Mistral/Granite)
•    Table A.2: Software dependencies with version numbers
•    Table A.4: RAG configuration parameters (chunk size, overlap, embedding model, similarity thresholds)
•    Table A.5: Agent role definitions for all four coordination strategies
•    Section A.5: Original coordination strategy prompt templates
•    Section A.6: Improved coordination strategy modifications
•    Section A.7: Improved Collaborative strategy two-phase consensus protocol
Key hyperparameters are now documented: temperature = 0.3, max_tokens = 512, three agents per configuration. The complete source code remains available on GitHub (maPaSSER repository) for full reproducibility.
Comment 3: "In the results analysis, the finding that the collaborative strategy yields the largest performance degradation across all models and forms a 'universal collaborative degradation' group is intriguing, but the paper would benefit from a somewhat richer qualitative discussion of how the deliberation/consensus procedure in this strategy differs from designs in prior multi-agent work and what interaction patterns might be driving this degradation. Suggestion: In the Related Work and Discussion sections, you might briefly contrast representative collaborative/consensus approaches with your implementation."
Response 3: We have added qualitative analysis addressing this question in two locations. Section 1 (Introduction) now contrasts our approach with prior successful multi-agent implementations:
"Previous studies demonstrating the benefits of multi-agent systems have typically used resource-intensive configurations. Examples include iterative, multi-round debates; adversarial agent roles that challenge and refine responses; and architectures that combine different model families to provide complementary strengths."
Section 5.1 (Discussion) provides detailed analysis of the interaction patterns driving collaborative degradation:
"The collaborative strategy exhibits the largest degradation because its consensus-based synthesis compounds rather than corrects individual agent limitations. When agents independently generate suboptimal responses, the synthesis step tends to preserve shared errors while potentially discarding correct but minority viewpoints. This contrasts with successful multi-agent implementations that use iterative debate (where agents challenge each other's reasoning), adversarial roles (where designated critics identify weaknesses), or model heterogeneity (where different architectures contribute complementary strengths)."
Comment 4: "CPS and T-CPS are important strengths of this study, but from the reader's perspective it would be helpful to have a slightly more concrete explanation of how the nine underlying metrics are weighted and aggregated, and what intuitive roles each metric plays in practical decision-making. Suggestion: Briefly outlining the rationale for the chosen metric weights in the Method or Appendix, and providing a visualized example of metric-level scores, would make the interpretation of CPS/T-CPS more intuitive."
Response 4: We have added Section 3.3.4 "Weight rationale" that provides explicit justification for metric weights within a four-dimensional evaluation framework:
"Content accuracy (50% total weight): METEOR (20%) captures semantic adequacy through synonym matching and stemming; BERTScore F1 (15%) measures deep semantic similarity via contextual embeddings; Cosine Similarity (15%) quantifies vector-space alignment between generated and reference responses. Semantic relevance (20%): Pearson Correlation (20%) assesses the linear relationship between response and reference semantic vectors. Lexical overlap (15%): ROUGE-L F1 (15%) measures longest common subsequence overlap. Linguistic quality (15%): Perplexity (15%, inverted) penalises incoherent or disfluent text."
Section 3.3.2 now explains why the Simple Additive Weighting (SAW) method was selected over alternatives (AHP, TOPSIS, ELECTRE, PROMETHEE) based on data requirements, interpretability, and theoretical appropriateness. Supplementary Table S1 provides the complete inter-metric correlation matrix demonstrating that metrics are complementary rather than redundant (mean r = 0.452).
Comment 5: "Although the paper concludes that a single-agent RAG is practically more suitable under most conditions, the practical impact of the results could be further enhanced if concise, concrete guidelines were provided in summary form for practitioners who need to select model–strategy combinations. Suggestion: In the final part of the Conclusion, a short 'Practical Deployment Guidelines' summary could be included."
Response 5: We have added practical deployment guidelines to Section 6 (Conclusions):
"For practitioners deploying RAG systems with 7–8B parameter open-source models, these findings suggest several guidelines. Single-agent RAG with tuned similarity thresholds should be the default choice for factual question-answering tasks. If multi-agent coordination is required for other reasons (e.g., diverse perspective generation), Sequential or Hierarchical strategies with Llama 3.1 8B offer the smallest performance trade-offs (−2.1% to −5.4% vs baseline). Collaborative strategies should be avoided in their current form due to consistent degradation across all tested configurations."
These guidelines directly address the reviewer's suggestion by providing actionable recommendations tied to specific experimental findings.
4.    Response to Comments on the Quality of English Language
The reviewer indicated that "The English is fine and does not require any improvement." We thank the reviewer for this assessment. Minor editorial refinements have been made throughout the manuscript for clarity and consistency.
5.    Additional Clarifications
We are grateful for the reviewer's constructive suggestions, which have substantially improved the manuscript's clarity, reproducibility, and practical applicability. Beyond the specific responses above, we have also:
•    Added Section 4.4 "Computational Efficiency Analysis" with Table 5 presenting token consumption and processing time data.
•    Added Section 4.7 "Sensitivity Analysis Results" with Table 6 validating ranking stability across 25 parameter combinations.
•    Provided Supplementary Materials including inter-metric correlation analysis (Table S1), weight perturbation stability (Table S2), and leave-one-out ablation analysis (Table S3).
•    Updated Table 4 with all 28 multi-agent configurations ranked by T-CPS degradation.
•    Expanded the bibliography with additional references supporting the methodological framework and contextualizing findings within recent multi-agent research.
We believe these revisions comprehensively address all comments raised by Reviewer 3 and present a more complete, reproducible, and practically useful contribution to the field.

Author Response File: Author Response.pdf

Round 2

Reviewer 2 Report

Comments and Suggestions for Authors

The comments of the reviewer were responded clearly and the manucript has been improved significantly. I recommend an acceptance.

Back to TopTop