Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation
Abstract
1. Introduction
1.1. Research Objectives
- Comparative performance assessment. This study evaluates performance of four coordination strategies (collaborative, sequential, competitive, hierarchical) across three open-source models (Mistral 7B, Llama 3.1 8B, Granite 3.2 8B). The objective is to determine whether multi-agent configurations outperform, match, or underperform calibrated single-agent RAG baselines.
- Degradation source identification. The study aims to isolate the relative contributions of coordination overhead versus retrieval fragmentation to performance changes. Independent retrieval and shared context retrieval configurations are compared to decompose these effects quantitatively.
- Model-strategy interaction analysis. This objective investigates whether coordination effectiveness depends on model architecture. Differential responses to identical coordination protocols across model families are characterized.
- Consistency-performance trade-offs. The study examines whether multi-agent coordination affects output variability alongside mean performance. The Threshold-aware Composite Performance Score (T-CPS) is employed to evaluate stability-performance trade-offs simultaneously.
1.2. Approach
2. Related Work
3. Methods
3.1. Experimental Infrastructure
3.2. Multi-Agent Coordination Strategies and Experimental Design
3.2.1. Collaborative Strategy
3.2.2. Improved Collaborative Strategy: Two-Phase Consensus
3.2.3. Competitive Strategy
3.2.4. Hierarchical Strategy
3.2.5. Sequential Strategy
3.2.6. Retrieval Context Configuration
- -
- Independent Retrieval: Each agent independently queries the vector database and retrieves its own document subset based on similarity ranking. This configuration allows agents to access potentially different retrieved passages, introducing retrieval diversity but also potential inconsistency in available context. Independent retrieval was used for all Original multi-agent configurations.
- -
- Shared Context Retrieval: All agents receive identical retrieved document sets extracted through a single vector database query. This configuration ensures uniform input context across all agents, eliminating retrieval fragmentation as a confounding variable and isolating pure coordination effects. Shared context retrieval was applied to all Optimized configurations (all three models) and to Granite-SCR configurations.
3.2.7. Experimental Configuration Summary
3.3. Performance Evaluation Framework
3.3.1. The Multi-Criteria Evaluation Problem
3.3.2. Aggregation Method Selection
3.3.3. Component Metrics
3.3.4. Composite Performance Score (CPS)
- -
- indicates the polarity of metric : if higher values indicate better performance, if lower values indicate better performance
- -
- is the assigned weight for metric , with
3.3.5. Threshold-Aware Composite Performance Score (T-CPS)
- -
- is the mean CPS for model with strategy
- -
- is the coefficient of variation, with denoting the standard deviation of CPS across evaluation instances
- -
- defines the reward coefficient for stable configurations
- -
- defines the penalty coefficient for high variability
- -
- The coefficient of variation normalizes variability assessment by expressing standard deviation as a proportion of the mean, enabling fair comparison across configurations with different baseline performance levels. Lower CV values indicate more consistent behaviour across queries and evaluation runs.
3.4. Baseline Configuration and Statistical Analysis
4. Results
4.1. Experimental Overview
4.2. Overall Performance Comparison
4.3. Statistical Significance and Effect Size Analysis
4.4. Computational Efficiency Analysis
4.5. CPS and T-CPS Relationship Analysis
4.6. Model-Specific Coordination Response Patterns
4.7. Sensitivity Analysis Results
4.8. Summary of Results
- All 28 multi-agent configurations showed statistically significant degradation relative to baseline RAG (p < 0.01 for all comparisons).
- Performance degradation ranged from −4.39% (Mistral-Opt Hierarchical) to −35.31% (Granite Collaborative), with effect sizes |d| = 0.28 to 6.09.
- Shared context retrieval (Granite-SCR) improved performance by 4.2–12.5 percentage points relative to independent retrieval, but all configurations remained 14.2–31.1% below baseline.
- Llama 3.1 8B demonstrated selective tolerance (Sequential and Hierarchical strategies showed degradation below 6%), while Granite 3.2 8B and Mistral 7B Original configurations showed degradation exceeding 25%.
- Optimized configurations reduced degradation by 10–15 percentage points for Mistral and Granite, but showed no benefit for Llama.
- Collaborative coordination showed the largest degradation across all models (20.7–35.3%) despite achieving high output consistency.
- T-CPS sensitivity analysis confirmed ranking stability: 93.5% of configurations maintained identical ranks across 25 parameter combinations.
5. Discussion
5.1. Performance Impact of Multi-Agent Coordination
5.2. Performance Patterns Across Models and Strategies
5.3. Limitations
5.4. Comparison with Prior Literature
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| API | Application Programming Interface |
| BLEU | Bilingual Evaluation Understudy |
| CPS | Composite Performance Score |
| CV | Coefficient of Variation |
| JSON | JavaScript Object Notation |
| LLM | Large Language Model |
| MCDM | Multi-Criteria Decision Making |
| METEOR | Metric for Evaluation of Translation with Explicit ORdering |
| RAG | Retrieval-Augmented Generation |
| ROUGE | Recall-Oriented Understudy for Gisting Evaluation |
| SCR | Shared Context Retrieval |
| T-CPS | Threshold-aware Composite Performance Score |
Appendix A. Implementation Details and Reproducibility
Appendix A.1. System Architecture
Appendix A.1.1. Hardware Configurations
| Component | Mistral 7B/Granite 8B | Llama 3.1 8B |
|---|---|---|
| Processor | Intel Xeon (server) | Apple M1 |
| Inference Engine | Ollama (CPU) | MLX (MPS) |
| Acceleration | None (CPU only) | Metal Performance Shaders |
Appendix A.1.2. Software Stack
| Component | Technology |
|---|---|
| LLM Inference | Ollama API (local)/MLX (Apple Silicon) |
| Vector Database | ChromaDB |
| Embeddings | Ollama embeddings (Mistral model) |
| Backend Framework | Python FastAPI |
| Frontend | React |
Appendix A.2. Generation Hyperparameters
| Parameter | Value | Description |
|---|---|---|
| temperature | 0.7 | Controls randomness in generation |
| top_p | 0.9 | Nucleus sampling threshold |
| top_k | 50 | Top-k sampling (Mistral); 40 for other models |
| repetition_penalty | 1.1 | Penalty for repeated tokens |
| repetition_context_size | 20 | Context window for repetition detection |
| max_new_tokens | Dynamic | 90% of remaining context window (up to 2000) |
Appendix A.3. Retrieval Configuration
| Parameter | Value |
|---|---|
| top_k_docs | 7 |
| Embedding model | Mistral (via Ollama) |
| Vector database | ChromaDB |
| Similarity metric | Cosine similarity |
| Similarity threshold | Model-specific (0.90–0.95) |
Appendix A.4. Agent Configuration
Appendix A.4.1. Number of Agents
Appendix A.4.2. Role Assignments by Strategy
| Strategy | Agent 1 | Agent 2 | Agent 3 |
|---|---|---|---|
| Collaborative | Analyzer | Critic | Synthesizer |
| Sequential | Analyzer | Processor | Reviewer |
| Competitive | Expert A | Expert B | Expert C |
| Hierarchical | Specialist 1 | Specialist 2 | Manager |
Appendix A.5. Prompt Templates
Appendix A.5.1. Base Agent Prompt (With RAG Context)
Appendix A.5.2. Base Agent Prompt (Without RAG Context)
Appendix A.5.3. Model-Specific Formatting
Appendix A.6. Coordination Mechanisms
Appendix A.6.1. Collaborative Strategy
Appendix A.6.2. Sequential Strategy
Appendix A.6.3. Competitive Strategy
Appendix A.6.4. Hierarchical Strategy
Appendix A.7. Two-Phase Collaborative Consensus (Improved)
Appendix A.8. Code Availability
Appendix A.9. Data Availability
References
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
- Barnett, S.; Kurniawan, S.; Thudumu, S.; Brannelly, Z.; Abdelrazek, M. Seven Failure Points When Engineering a Retrieval Augmented Generation System. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering—Software Engineering for AI, Lisbon, Portugal, 14–15 April 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 194–199. [Google Scholar]
- Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation. Proc. AAAI Conf. Artif. Intell. 2024, 38, 17754–17762. [Google Scholar] [CrossRef]
- Yu, W.; Zhang, H.; Pan, X.; Cao, P.; Ma, K.; Li, J.; Wang, H.; Yu, D. Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.-N., Eds.; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 14672–14685. [Google Scholar]
- Salemi, A.; Zamani, H. Evaluating Retrieval Quality in Retrieval-Augmented Generation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 2395–2400. [Google Scholar]
- Radeva, I.; Popchev, I.; Dimitrova, M. Similarity Thresholds in Retrieval-Augmented Generation. In Proceedings of the 2024 IEEE 12th International Conference on Intelligent Systems (IS), Varna, Bulgaria, 29–31 August 2024; pp. 1–7. [Google Scholar]
- Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S.K.S.; Lin, Z.H.; et al. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Liang, T.; He, Z.; Jiao, W.; Wang, X.; Wang, Y.; Wang, R.; Yang, Y.; Shi, S.; Tu, Z. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.-N., Eds.; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 17889–17904. [Google Scholar]
- Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, San Francisco, CA, USA, 29 October–1 November 2023; Association for Computing Machinery: New York, NY, USA, 2023. [Google Scholar]
- Bond, A.H.; Gasser, L. Readings in Distributed Artificial Intelligence; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1988; ISBN 978-0-934613-63-7. [Google Scholar]
- Du, Y.; Li, S.; Torralba, A.; Tenenbaum, J.B.; Mordatch, I. Improving Factuality and Reasoning in Language Models through Multiagent Debate. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; JMLR.org: Vienna, Austria, 2024; pp. 11733–11763. [Google Scholar]
- Xu, L.; Hu, Z.; Zhou, D.; Ren, H.; Dong, Z.; Keutzer, K.; Ng, S.-K.; Feng, J. MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.-N., Eds.; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 7315–7332. [Google Scholar]
- Kim, Y.H.; Park, C.; Jeong, H.; Chan, Y.S.; Xu, X.; McDuff, D.; Breazeal, C.; Park, H.W. MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making. arXiv 2024, arXiv:2404.15155. [Google Scholar] [CrossRef]
- Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
- Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
- Mishra, M.; Stallone, M.; Zhang, G.; Shen, Y.; Prasad, A.; Soria, A.M.; Merler, M.; Selvam, P.; Surendran, S.; Singh, S.; et al. Granite Code Models: A Family of Open Foundation Models for Code Intelligence. arXiv 2024, arXiv:2405.04324. [Google Scholar] [CrossRef]
- Ibm-Granite/Granite-3.2-8b-Instruct Hugging Face. Available online: https://huggingface.co/ibm-granite/granite-3.2-8b-instruct (accessed on 12 November 2025).
- Climate Smart Agriculture Sourcebook|Food and Agriculture Organization of the United Nations. Available online: https://www.fao.org/climate-smart-agriculture-sourcebook/en/ (accessed on 12 November 2025).
- Radeva, I.; Popchev, I.; Doukovska, L.; Dimitrova, M. Web Application for Retrieval-Augmented Generation: Implementation and Testing. Electronics 2024, 13, 1361. [Google Scholar] [CrossRef]
- Dimitrova, M. Retrieval-Augmented Generation (RAG): Advances and Challenges. Probl. Eng. Cybern. Robot. 2025, 83, 32–57. [Google Scholar] [CrossRef]
- Xu, K.; Zhang, K.; Li, J.; Huang, W.; Wang, Y. CRP-RAG: A Retrieval-Augmented Generation Framework for Supporting Complex Logical Reasoning and Knowledge Planning. Electronics 2025, 14, 47. [Google Scholar] [CrossRef]
- Knollmeyer, S.; Caymazer, O.; Grossmann, D. Document GraphRAG: Knowledge Graph Enhanced Retrieval Augmented Generation for Document Question Answering Within the Manufacturing Domain. Electronics 2025, 14, 2102. [Google Scholar] [CrossRef]
- Choi, Y.; Kim, S.; Bassole, Y.C.F.; Sung, Y. Enhanced Retrieval-Augmented Generation Using Low-Rank Adaptation. Appl. Sci. 2025, 15, 4425. [Google Scholar] [CrossRef]
- Ji, X.; Xu, L.; Gu, L.; Ma, J.; Zhang, Z.; Jiang, W. RAP-RAG: A Retrieval-Augmented Generation Framework with Adaptive Retrieval Task Planning. Electronics 2025, 14, 4269. [Google Scholar] [CrossRef]
- Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.V.; Wiest, O.; Zhang, X. Large Language Model Based Multi-Agents: A Survey of Progress and Challenges. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Jeju, Republic of Korea, 3–9 August 2024; Larson, K., Ed.; International Joint Conferences on Artificial Intelligence Organization: Bremen, Germany, 2024; pp. 8048–8057. [Google Scholar]
- Zhang, X.; Dong, X.; Wang, Y.; Zhang, D.; Cao, F. A Survey of Multi-AI Agent Collaboration: Theories, Technologies and Applications. In Proceedings of the 2nd Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Digital Economy and Artificial Intelligence, Dongguan, China, 28–30 March 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 1875–1881. [Google Scholar]
- Jimenez-Romero, C.; Yegenoglu, A.; Blum, C. Multi-Agent Systems Powered by Large Language Models: Applications in Swarm Intelligence. Front. Artif. Intell. 2025, 8, 1593017. [Google Scholar] [CrossRef]
- Qian, C.; Liu, W.; Liu, H.; Chen, N.; Dang, Y.; Li, J.; Yang, C.; Chen, W.; Su, Y.; Cong, X.; et al. ChatDev: Communicative Agents for Software Development. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.-W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 15174–15186. [Google Scholar]
- Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv 2023, arXiv:2308.08155. [Google Scholar] [CrossRef]
- Bo, X.; Zhang, Z.; Dai, Q.; Feng, X.; Wang, L.; Li, R.; Chen, X.; Wen, J.-R. Reflective Multi-Agent Collaboration Based on Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: New York, NY, USA, 2024; Volume 37, pp. 138595–138631. [Google Scholar]
- Cinkusz, K.; Chudziak, J.A.; Niewiadomska-Szynkiewicz, E. Cognitive Agents Powered by Large Language Models for Agile Software Project Management. Electronics 2025, 14, 87. [Google Scholar] [CrossRef]
- Ji, X.; Zhang, L.; Zhang, W.; Peng, F.; Mao, Y.; Liao, X.; Zhang, K. LEMAD: LLM-Empowered Multi-Agent System for Anomaly Detection in Power Grid Services. Electronics 2025, 14, 3008. [Google Scholar] [CrossRef]
- Caspari, L.; Dastidar, K.G.; Zerhoudi, S.; Mitrovic, J.; Granitzer, M. Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems. arXiv 2024, arXiv:2407.08275. [Google Scholar] [CrossRef]
- Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; Vlachos, A., Augenstein, I., Eds.; Association for Computational Linguistics: Dubrovnik, Croatia, 2023; pp. 2014–2037. [Google Scholar]
- Topsakal, O.; Harper, J.B. Benchmarking Large Language Model (LLM) Performance for Game Playing via Tic-Tac-Toe. Electronics 2024, 13, 1532. [Google Scholar] [CrossRef]
- Zografos, G.; Moussiades, L. Beyond the Benchmark: A Customizable Platform for Real-Time, Preference-Driven LLM Evaluation. Electronics 2025, 14, 2577. [Google Scholar] [CrossRef]
- Li, B.; Han, L. Distance Weighted Cosine Similarity Measure for Text Classification. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2013, Hefei, China, 20–23 October 2013; Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M., Weise, T., Li, B., Yao, X., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 611–618. [Google Scholar]
- Wang, L.; Yang, N.; Huang, X.; Jiao, B.; Yang, L.; Jiang, D.; Majumder, R.; Wei, F. Text Embeddings by Weakly-Supervised Contrastive Pre-Training. arXiv 2024, arXiv:2212.03533. [Google Scholar] [CrossRef]
- Ni, J.; Qu, C.; Lu, J.; Dai, Z.; Hernandez Abrego, G.; Ma, J.; Zhao, V.; Luan, Y.; Hall, K.; Chang, M.-W.; et al. Large Dual Encoders Are Generalizable Retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 9844–9855. [Google Scholar]
- Zhao, X.; Wang, K.; Peng, W. An Electoral Approach to Diversify LLM-Based Multi-Agent Collective Decision-Making. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.-N., Eds.; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 2712–2727. [Google Scholar]
- Hwang, C.-L.; Yoon, K. Multiple Attribute Decision Making: Methods and Applications—A State-of-the-Art Survey. In Lecture Notes in Economics and Mathematical Systems; Springer: Berlin/Heidelberg, Germany, 1981. [Google Scholar]
- Fishburn, P.C. Letter to the Editor—Additive Utilities with Incomplete Product Sets: Application to Priorities and Assignments. Oper. Res. 1967, 15, 537–542. [Google Scholar] [CrossRef]
- Saaty, R.W. The Analytic Hierarchy Process—What It Is and How It Is Used. Math. Model. 1987, 9, 161–176. [Google Scholar] [CrossRef]
- Scpdxtest GitHub—Scpdxtest/maPaSSER. Available online: https://github.com/scpdxtest/maPaSSER (accessed on 2 December 2025).
- Batsakis, S.; Tachmazidis, I.; Mantle, M.; Papadakis, N.; Antoniou, G. Model Checking Using Large Language Models—Evaluation and Future Directions. Electronics 2025, 14, 401. [Google Scholar] [CrossRef]
- Montgomery, D.C. Statistical Quality Control: A Modern Introduction; Wiley: Hoboken, NJ, USA, 2012; ISBN 978-1-118-53137-2. [Google Scholar]
- Sharpe, W.F. Mutual Fund Performance. J. Bus. 1966, 39, 119–138. [Google Scholar] [CrossRef]
- Kuncheva, L.I.; Whitaker, C.J. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy. Mach. Learn. 2003, 51, 181–207. [Google Scholar] [CrossRef]
- Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
- Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; Choi, Y. The Curious Case of Neural Text Degeneration. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
- Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
- Chen, W.; Su, Y.; Zuo, J.; Yang, C.; Yuan, C.; Chan, C.-M.; Yu, H.; Lu, Y.; Hung, Y.-H.; Qian, C.; et al. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors. In Proceedings of the International Conference on Representation Learning, Vienna, Austria, 7–11 May 2024; Volume 2024, pp. 20094–20136. [Google Scholar]
- Chan, C.M.; Chen, W.; Su, Y.; Yu, J.; Xue, W.; Zhang, S.; Fu, J.; Liu, Z. ChatEval: Towards Better LLM-Based Evaluators Through Multi-Agent Debate. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Li, G.; Hammoud, H.A.A.K.; Itani, H.; Khizbullin, D.; Ghanem, B. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. arXiv 2023, arXiv:2303.17760. [Google Scholar] [CrossRef]
- Wang, Z.; Mao, S.; Wu, W.; Ge, T.; Wei, F.; Ji, H. Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 257–279. [Google Scholar]






| Group | Configs | Models | Retrieval | Collaborative Consensus | Other Strategies |
|---|---|---|---|---|---|
| Baselines | 3 | All | Single query | N/A | N/A |
| Original | 12 | All | Independent | Simple aggregation | Standard |
| Granite-SCR | 4 | Granite only | Shared | Simple aggregation | Standard |
| Optimized | 12 | All | Shared | Two-Phase | Standard |
| Total | 31 |
| Category | Metric | Weight | Subtotal |
|---|---|---|---|
| Content Accuracy | F1 Score | 0.20 | |
| METEOR | 0.15 | ||
| BLEU | 0.15 | 0.50 | |
| Semantic Relevance | Cosine Similarity | 0.10 | |
| Pearson Correlation | 0.10 | 0.20 | |
| Lexical/Fluency | ROUGE-1.f | 0.075 | |
| ROUGE-L.f | 0.075 | ||
| Laplace Perplexity * | 0.075 | ||
| Lidstone Perplexity * | 0.075 | 0.30 | |
| Total | 1.00 |
| Model | Configuration | Threshold | CPS | T-CPS | CV | CPS Δ% | T-CPS Δ% | Balance Score | Selection |
|---|---|---|---|---|---|---|---|---|---|
| Mistral 7B | Baseline | — | 0.5181 | 0.5610 | 0.1501 | — | — | — | |
| Mistral 7B | SELECTED | 0.95 | 0.5448 | 0.5911 | 0.1339 | +5.16% | +5.37% | 40.11 | Max T-CPS |
| Mistral 7B | Alternative 1 | 0.90 | 0.5332 | 0.5787 | 0.1312 | +2.93% | +3.16% | 24.07 | |
| Mistral 7B | Alternative 2 | 0.70 | 0.5321 | 0.5776 | 0.1283 | +2.70% | +2.97% | 23.13 | |
| Granite 3.2 8B | Baseline | — | 0.5112 | 0.5552 | 0.1243 | — | — | — | |
| Granite 3.2 8B | SELECTED | 0.95 | 0.5176 | 0.5622 | 0.1240 | +1.25% | +1.26% | 10.12 | Max T-CPS |
| Granite 3.2 8B | Alternative 1 | 0.80 | 0.5172 | 0.5619 | 0.1220 | +1.18% | +1.20% | 9.85 | |
| Granite 3.2 8B | Alternative 2 | 0.75 | 0.5168 | 0.5613 | 0.1239 | +1.10% | +1.10% | 8.91 | |
| Llama 3.1 8B | Baseline | — | 0.4982 | 0.5394 | 0.1497 | — | — | — | |
| Llama 3.1 8B | SELECTED | 0.90 | 0.5074 | 0.5495 | 0.1479 | +1.85% | +1.87% | 12.65 | Max T-CPS |
| Llama 3.1 8B | Alternative 1 | 0.55 | 0.5046 | 0.5478 | 0.1286 | +1.30% | +1.55% | 12.05 | |
| Llama 3.1 8B | Alternative 2 | 0.80 | 0.5039 | 0.5450 | 0.1589 | +1.15% | +1.04% | 6.57 |
| Rank | Configuration | Type | CPS | T-CPS | Δ CPS (%) | Δ T-CPS (%) | t-Stat | p-Value | d | |d| | Effect | Sig |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BASELINES | ||||||||||||
| — | Mistral 7B | Baseline | 0.5758 | 0.6226 | — | — | — | — | — | — | — | — |
| — | Granite 3.2 8B | Baseline | 0.5622 | 0.6087 | — | — | — | — | — | — | — | — |
| — | Llama 3.1 8B | Baseline | 0.5363 | 0.5793 | — | — | — | — | — | — | — | — |
| MINIMAL DEGRADATION (Δ > −10%) | ||||||||||||
| 1 | Mistral-Opt Hierarchical | Optimized | 0.5505 | 0.5952 | −4.39 | −4.40 | −2.775 | 0.007 | −0.28 | 0.28 | Small | ** |
| 2 | Llama Sequential | Original | 0.5103 | 0.5511 | −4.85 | −4.87 | −2.972 | 0.004 | −0.30 | 0.30 | Small | ** |
| 3 | Mistral-Opt Sequential | Optimized | 0.5454 | 0.5897 | −5.28 | −5.28 | −3.404 | <0.001 | −0.34 | 0.34 | Small | *** |
| 4 | Llama Hierarchical | Original | 0.5078 | 0.5484 | −5.31 | −5.33 | −2.752 | 0.007 | −0.28 | 0.28 | Small | ** |
| 5 | Llama Competitive | Original | 0.5012 | 0.5413 | −6.54 | −6.56 | −4.058 | <0.001 | −0.41 | 0.41 | Small | *** |
| 6 | Llama-Opt Hierarchical | Optimized | 0.4989 | 0.5382 | −6.97 | −7.09 | −4.177 | <0.001 | −0.42 | 0.42 | Small | *** |
| 7 | Mistral-Opt Competitive | Optimized | 0.5279 | 0.5723 | −8.32 | −8.08 | −6.435 | <0.001 | −0.64 | 0.64 | Medium | *** |
| 8 | Llama-Opt Competitive | Optimized | 0.4907 | 0.5305 | −8.50 | −8.42 | −5.678 | <0.001 | −0.57 | 0.57 | Medium | *** |
| 9 | Llama-Opt Sequential | Optimized | 0.4905 | 0.5302 | −8.54 | −8.48 | −5.677 | <0.001 | −0.57 | 0.57 | Medium | *** |
| MODERATE DEGRADATION (−20% < Δ ≤ −10%) | ||||||||||||
| 10 | Granite-SCR Competitive | SCR | 0.4823 | 0.5210 | −14.21 | −14.41 | −9.853 | <0.001 | −0.99 | 0.99 | Large | *** |
| 11 | Granite-Opt Competitive | Optimized | 0.4789 | 0.5180 | −14.82 | −14.90 | −10.955 | <0.001 | −1.10 | 1.10 | Large | *** |
| 12 | Granite-SCR Hierarchical | SCR | 0.4715 | 0.5102 | −16.13 | −16.18 | −12.358 | <0.001 | −1.24 | 1.24 | V.Large | *** |
| 13 | Granite-Opt Sequential | Optimized | 0.4703 | 0.5084 | −16.35 | −16.48 | −12.041 | <0.001 | −1.20 | 1.20 | V.Large | *** |
| 14 | Granite-Opt Hierarchical | Optimized | 0.4637 | 0.5020 | −17.52 | −17.53 | −14.087 | <0.001 | −1.41 | 1.41 | V.Large | *** |
| SEVERE DEGRADATION (−30% < Δ ≤ −20%) | ||||||||||||
| 15 | Mistral-Opt Collaborative | Optimized | 0.4567 | 0.4961 | −20.68 | −20.32 | −21.500 | <0.001 | −2.15 | 2.15 | V.Large | *** |
| 16 | Llama-Opt Collaborative | Optimized | 0.4236 | 0.4611 | −21.01 | −20.40 | −25.945 | <0.001 | −2.59 | 2.59 | V.Large | *** |
| 17 | Granite-Opt Collaborative | Optimized | 0.4421 | 0.4810 | −21.36 | −20.98 | −25.241 | <0.001 | −2.52 | 2.52 | V.Large | *** |
| 18 | Granite-SCR Sequential | SCR | 0.4335 | 0.4658 | −22.89 | −23.48 | −14.282 | <0.001 | −1.43 | 1.43 | V.Large | *** |
| 19 | Mistral Competitive | Original | 0.4318 | 0.4663 | −25.01 | −25.10 | −27.854 | <0.001 | −2.79 | 2.79 | V.Large | *** |
| 20 | Granite Competitive | Original | 0.4198 | 0.4534 | −25.33 | −25.51 | −29.753 | <0.001 | −2.98 | 2.98 | V.Large | *** |
| 21 | Mistral Hierarchical | Original | 0.4187 | 0.4522 | −27.28 | −27.37 | −24.877 | <0.001 | −2.49 | 2.49 | V.Large | *** |
| 22 | Llama Collaborative | Original | 0.3849 | 0.4157 | −28.23 | −28.24 | −38.486 | <0.001 | −3.85 | 3.85 | V.Large | *** |
| 23 | Granite Hierarchical | Original | 0.4015 | 0.4336 | −28.58 | −28.77 | −34.973 | <0.001 | −3.50 | 3.50 | V.Large | *** |
| 24 | Mistral Sequential | Original | 0.4098 | 0.4426 | −28.83 | −28.91 | −23.972 | <0.001 | −2.40 | 2.40 | V.Large | *** |
| 25 | Granite Sequential | Original | 0.3977 | 0.4295 | −29.26 | −29.44 | −35.898 | <0.001 | −3.59 | 3.59 | V.Large | *** |
| EXTREME DEGRADATION (Δ ≤ −30%) | ||||||||||||
| 26 | Granite-SCR Collaborative | SCR | 0.3852 | 0.4195 | −31.48 | −31.08 | −46.857 | <0.001 | −4.69 | 4.69 | V.Large | *** |
| 27 | Mistral Collaborative | Original | 0.3736 | 0.4035 | −35.12 | −35.19 | −49.716 | <0.001 | −4.97 | 4.97 | V.Large | *** |
| 28 | Granite Collaborative | Original | 0.3637 | 0.3928 | −35.31 | −35.47 | −60.914 | <0.001 | −6.09 | 6.09 | V.Large | *** |
| Model | Strategy | Processing Time (s) Mean (SD) | Total Tokens Mean (SD) |
|---|---|---|---|
| Granite 3.2 8B | Collaborative | 2211.0 (622.9) | 3481 (864) |
| Sequential | 2257.2 (747.8) | 2251 (650) | |
| Competitive | 1998.3 (587.5) | 2179 (636) | |
| Hierarchical | 2247.2 (611.7) | 2351 (629) | |
| Mistral 7B | Collaborative | 1584.7 (525.1) | 3076 (924) |
| Sequential | 1494.6 (501.1) | 1920 (659) | |
| Competitive | 1798.6 (550.4) | 2043 (644) | |
| Hierarchical | 1868.3 (573.5) | 1958 (613) | |
| Llama 3.1 8B | Collaborative | 56.1 (26.3) | 1410 (707) |
| Sequential | 57.3 (27.0) | 804 (481) | |
| Competitive | 52.6 (27.7) | 813 (494) | |
| Hierarchical | 59.8 (31.0) | 842 (568) |
| Analysis | Metric | Value |
|---|---|---|
| Correlation | Mean r(α) | +0.9993 |
| Mean r(β) | −0.034 | |
| Configs with significant α (p < 0.001) | 31/31 | |
| Configs with significant β (p < 0.05) | 0/31 | |
| Variance Decomposition | Variance explained by α | 99.87% |
| Variance explained by β | 0.13% | |
| Regression | Mean b1 (α coefficient) | +0.396 |
| Mean b2 (β coefficient) | −0.022 | |
| R2 (all configurations) | 1.000 | |
| Effect Magnitude | Mean T-CPS range | 0.082 |
| Mean T-CPS % change | 17.1% | |
| Ranking Stability | Configurations with zero rank change | 29/31 (93.5%) |
| Maximum rank change observed | 1 position | |
| Top 2 positions stable | Yes (all 25 combinations) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Radeva, I.; Popchev, I.; Doukovska, L.; Dimitrova, M. Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation. Electronics 2025, 14, 4883. https://doi.org/10.3390/electronics14244883
Radeva I, Popchev I, Doukovska L, Dimitrova M. Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation. Electronics. 2025; 14(24):4883. https://doi.org/10.3390/electronics14244883
Chicago/Turabian StyleRadeva, Irina, Ivan Popchev, Lyubka Doukovska, and Miroslava Dimitrova. 2025. "Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation" Electronics 14, no. 24: 4883. https://doi.org/10.3390/electronics14244883
APA StyleRadeva, I., Popchev, I., Doukovska, L., & Dimitrova, M. (2025). Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation. Electronics, 14(24), 4883. https://doi.org/10.3390/electronics14244883

