Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation

Radeva, Irina; Popchev, Ivan; Doukovska, Lyubka; Dimitrova, Miroslava

doi:10.3390/electronics14244883

Open AccessEditor’s ChoiceArticle

Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation

¹

Intelligent Systems Department, Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, 1113 Sofia, Bulgaria

²

Faculty of Informatics and Mathematics, Trakia University, 6000 Stara Zagora, Bulgaria

³

Bulgarian Academy of Sciences, 1040 Sofia, Bulgaria

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(24), 4883; https://doi.org/10.3390/electronics14244883

Submission received: 15 November 2025 / Revised: 7 December 2025 / Accepted: 8 December 2025 / Published: 11 December 2025

(This article belongs to the Special Issue Machine Learning Applications in Computer Vision, Data Modeling, and Natural Language Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

This paper evaluates multi-agent coordination strategies against single-agent retrieval-augmented generation (RAG) for open-source language models. Four coordination strategies (collaborative, sequential, competitive, hierarchical) were tested across Mistral 7B, Llama 3.1 8B, and Granite 3.2 8B using 100 domain-specific question–answer pairs (3100 total evaluations). Performance was assessed using Composite Performance Score (CPS) and Threshold-aware CPS (T-CPS), aggregating nine metrics spanning lexical, semantic, and linguistic dimensions. Under the tested conditions, all 28 multi-agent configurations showed degradation relative to single-agent baselines, ranging from −4.4% to −35.3%. Coordination overhead was identified as a primary contributing factor. Llama 3.1 8B tolerated Sequential and Hierarchical coordination with minimal degradation (−4.9% to −5.3%). Mistral 7B with shared context retrieval achieved comparable results. Granite 3.2 8B showed degradation of 14–35% across all strategies. Collaborative coordination exhibited the largest degradation across all models. Study limitations include evaluation on a single domain (agriculture), use of 7–8B parameter models, and homogeneous agent architectures. These findings suggest that single-agent RAG may be preferable for factual question-answering tasks in local deployment scenarios with computational constraints. Future research should explore larger models, heterogeneous agent teams, role-specific prompting, and advanced consensus mechanisms.

Keywords:

retrieval-augmented generation (RAG); multi-agent coordination strategies; large language models (LLMs); comparative evaluation; performance evaluation

1. Introduction

Retrieval-Augmented Generation (RAG) enhances large language model capabilities by integrating external knowledge retrieval into the generation process [1]. RAG systems have been explored across various application contexts [2,3,4,5], with ongoing challenges in threshold configuration and architectural design [6].

Multi-agent coordination has been proposed as a mechanism to improve LLM reasoning quality through distributed processing and consensus mechanisms [7,8,9]. Multi-agent coordination strategies refer to the organizational frameworks and interaction protocols that govern how multiple agents work together, communicate and adjust their actions within these systems. Four primary coordination architectures—collaborative (peer-to-peer deliberation), sequential (pipeline refinement), competitive (selection-based) and hierarchical (manager-worker)—represent distinct approaches to distributed processing [10].

However, the effectiveness of multi-agent coordination strategies when applied to RAG systems remains largely unexamined. Previous studies demonstrating the benefits of multi-agent systems have typically used resource-intensive configurations. Examples include iterative, multi-round debates [11], adversarial role assignment with judge arbitration [8], elaborate role specialization with structured verification [7] and large commercial models exceeding 70 billion parameters [12]. Whether simpler coordination strategies can provide practical benefits using accessible, open-source models remains an open question with direct implications for deployments with resource constraints.

Existing multi-agent evaluations typically compare coordination approaches against simple single-agent prompting baselines rather than against RAG systems that already incorporate external knowledge retrieval. Whether coordination mechanisms provide sufficient performance improvements to justify their computational overhead when added to functioning RAG systems has not been systematically investigated.

1.1. Research Objectives

Multi-agent systems for large language models have generated considerable research attention. Claims of improved reasoning through agent collaboration have been reported [7,8,13].

This study addresses whether adding multi-agent coordination to configured RAG systems enhances or degrades performance compared to single-agent baselines.

Four specific research objectives were formulated.

Comparative performance assessment. This study evaluates performance of four coordination strategies (collaborative, sequential, competitive, hierarchical) across three open-source models (Mistral 7B, Llama 3.1 8B, Granite 3.2 8B). The objective is to determine whether multi-agent configurations outperform, match, or underperform calibrated single-agent RAG baselines.
Degradation source identification. The study aims to isolate the relative contributions of coordination overhead versus retrieval fragmentation to performance changes. Independent retrieval and shared context retrieval configurations are compared to decompose these effects quantitatively.
Model-strategy interaction analysis. This objective investigates whether coordination effectiveness depends on model architecture. Differential responses to identical coordination protocols across model families are characterized.
Consistency-performance trade-offs. The study examines whether multi-agent coordination affects output variability alongside mean performance. The Threshold-aware Composite Performance Score (T-CPS) is employed to evaluate stability-performance trade-offs simultaneously.

Evidence-based deployment guidance is provided to practitioners. Three questions are addressed: (1) Under what conditions is multi-agent coordination justified, if any? (2) Under what conditions do single-agent RAG baselines demonstrate superior performance? (3) What performance patterns emerge across coordination strategies? The evaluation uses three open-source models with 7–8B parameters across four coordination strategies. These are tested on 100 pairs of factual questions and answers from a domain-specific knowledge base FAO Climate-Smart Agriculture Sourcebook.

1.2. Approach

Four multi-agent coordination strategies (collaborative, sequential, competitive, hierarchical) were evaluated across three 7–8B open-source models (Mistral 7B [14], Llama 3.1 8B [15], Granite 3.2 8B [16,17]) using 100 domain-specific question–answer pairs from the Climate-Smart Agriculture Sourcebook [18]. All agents within each configuration use the same base model to isolate coordination strategy effects. Performance was assessed using Composite Performance Score (CPS) [6] and Threshold-aware CPS (T-CPS) within the PaSSER evaluation framework [19], comparing all multi-agent configurations against tuned single-agent RAG baselines. Experimental methodology and performance metrics are detailed in Section 3. This evaluation focuses on a specific deployment scenario: factual question-answering using 7–8B parameter open-source models with a single-domain knowledge base. The scope is intentionally constrained to isolate coordination effects under controlled conditions. Findings apply to the tested configurations and should not be generalized to multi-step reasoning tasks, larger model scales, heterogeneous agent architectures, or alternative coordination implementations.

The remainder of this paper is structured as follows. Section 2 reviews related work in retrieval-augmented generation and multi-agent language model systems. Section 3 describes the experimental methodology, including PaSSER framework extensions, coordination strategies, evaluation corpus, and performance metrics. Section 4 presents results across performance, stability, and efficiency dimensions. Section 5 discusses coordination performance patterns and comparisons with prior literature. Section 6 concludes with key findings, practical deployment guidelines, limitations, and future research directions. Appendix A provides implementation details for reproducibility.

2. Related Work

Retrieval-Augmented Generation systems enhance language model capabilities by incorporating external knowledge retrieval into the generation process. The foundational RAG approach was introduced in 2020 [1], establishing a paradigm that has since been evaluated across diverse application contexts [3,5]. A comprehensive review traces the evolution of RAG systems from their roots in information retrieval to current modular architectures supporting dynamic reasoning and real-time knowledge integration [20]. Recent work has identified architectural challenges and failure modes in RAG deployments [2], leading to developments in adaptive retrieval control [4], complex reasoning support [21], knowledge graph integration [22], and domain-specific embedding configurations [23]. Adaptive retrieval strategies that dynamically adjust retrieval parameters based on query characteristics have also been explored [24]. Such frameworks feature heterogeneous weighted graph indices and adaptive planning that selects appropriate retrieval strategies based on query features, demonstrating improved compatibility with both small and large language models. These advances demonstrate that RAG systems can achieve substantial performance through architectural refinement and parameter tuning. However, existing research focuses primarily on single-agent RAG configurations. Whether multi-agent coordination mechanisms provide additional benefits beyond well-configured single-agent RAG remains an open question.

Multi-agent systems for large language models employ distributed processing to address complex tasks through agent collaboration. Comprehensive surveys characterize coordination mechanisms, organizational structures, and application domains [25,26,27], identifying four primary coordination architectures. Collaborative (peer-to-peer) strategies enable agents to work cooperatively toward shared objectives with specialized roles, with code generation frameworks demonstrating improvements through explicit communication protocols [7,28]. Competitive (debate-based) approaches leverage adversarial interaction between agents to enhance factual accuracy through iterative refinement and disagreement-driven error correction [8]. Hierarchical (manager-worker) architectures delegate subtasks from manager agents to specialized workers, excelling in complex workflows requiring strategic planning and resource allocation [29]. Sequential (pipeline) strategies implement multi-stage refinement where agents progressively improve outputs through iterative processing, effective for tasks requiring diverse perspectives [9,30]. Recent applications demonstrate multi-agent systems’ potential across diverse domains. Cognitive agents powered by LLMs have been integrated within the Scaled Agile Framework for software project management, demonstrating advanced task delegation and inter-agent communication capabilities [31]. Hierarchical multi-agent architectures for power grid anomaly detection have shown that lower-layer agents handling specialized monitoring tasks combined with upper-layer coordinators performing multimodal feature fusion and global decision-making can achieve high precision through distributed collaboration [32].

Recent studies highlight the importance of coordination structure selection based on task characteristics [25], with adaptive strategies outperforming fixed architectures in specific domains. However, most evaluations compare multi-agent approaches against simple single-agent prompting baselines rather than against retrieval-augmented generation systems that already incorporate external knowledge. This evaluation gap obscures whether coordination overhead is justified when added to functioning RAG systems.

The selection of embedding models significantly impacts RAG system performance. Systematic evaluation of embedding model similarity using Centred Kernel Alignment across five BEIR datasets reveals distinct performance patterns [33]. Models from the same family exhibit high embedding similarity, while cross-family similarity varies substantially. Top-k retrieval similarity shows high variance at low k values, complicating model selection for RAG applications. The Massive Text Embedding Benchmark (MTEB) provides standardized evaluation across 58 datasets spanning eight task categories [34]. However, MTEB primarily assesses single-query retrieval performance, not addressing multi-turn interactions or agent-based scenarios that emerge in multi-agent RAG deployments.

Evaluation methodologies for LLM performance continue to evolve beyond traditional benchmarks. Systematic performance evaluation through strategic decision-making tasks reveals significant variations across models under consistent evaluation protocols [35]. Customizable evaluation frameworks that accommodate domain-specific quality criteria address limitations of standardized benchmarks that fail to capture specialized application requirements [36].

Embedding models exhibit distinct characteristics that influence threshold behaviour. Distance metric properties vary across embedding spaces [37], semantic granularity affects optimal threshold selection [33,34], and different pre-training objectives produce distinct embedding space geometries [38,39]. These model-specific characteristics inform baseline RAG configuration in this work. This study addresses the identified evaluation gap through comparison of coordination strategies against calibrated RAG baselines, determining when multi-agent coordination enhances or degrades system performance.

Recent multi-agent LLM research has demonstrated performance improvements under specific conditions. Iterative debate, in which agents refine their responses over multiple rounds after observing the outputs of others, has been shown to improve mathematical reasoning and factuality [11]. Adversarial role assignment with judge arbitration enhances divergent thinking on counter-intuitive tasks [8]. High task completion rates on software engineering benchmarks have been achieved using standardized operating procedures with role specialization [7]. These studies share common characteristics: iterative refinement mechanisms, explicit role differentiation, and evaluation on reasoning-intensive tasks. By contrast, the present evaluation employs simpler coordination strategies without iterative debate or multi-round refinement, testing whether basic coordination mechanisms provide benefit when applied to factual question-answering with smaller models.

Multi-agent capabilities appear to scale with model size; GPT-4-turbo exceeds Llama-2–70B by more than threefold on coordination metrics [12]. A survey of 52 multi-agent systems found a heavy reliance on selection-based mechanisms (such as dictatorial or plurality voting) rather than consensus synthesis [40].

The present study evaluates whether the benefits of coordination extend to settings with resource constraints, such as smaller open-source models with 7–8 billion parameters, simpler coordination strategies that do not involve iterative refinement and tasks that involve factual retrieval rather than multi-step reasoning.

3. Methods

This section describes the experimental method, including the PaSSER framework extensions, multi-agent coordination strategies, evaluation corpus, and performance metrics.

3.1. Experimental Infrastructure

The PaSSER framework was originally designed for systematic evaluation of RAG threshold configurations [19]. The present study extends PaSSER to support multi-agent reasoning while maintaining backward compatibility with single-agent RAG evaluation. The extended PaSSER system comprises three components: (1) a Python 3.11.14 Flask API server for LLM inference and coordination; (2) a React frontend for configuration and monitoring; and (3) a ChromaDB vector database using sentence-transformers embeddings.

Three open-source models with comparable parameter counts were selected for evaluation. Mistral 7B (version 7b-instruct-v0.3) is a 7.3 billion parameter instruction-tuned model released under Apache 2.0 licence. Llama 3.1 8B (version 8b-instruct) is Meta’s 8 billion parameter instruction-tuned model with extended context support. Granite 3.2 8B (version 8b-instruct) is IBM’s 8 billion parameter enterprise-focused instruction-tuned model. Model selection prioritized four criteria: comparable parameter counts (7–8B), open-source availability, active maintenance, and instruction-tuning for question-answering. All models were obtained from the Hugging Face model repository.

Experiments were conducted on two hardware configurations: An Apple M1 with Llama 3.1 8B and an Intel Xeon server with Mistral 7B and Granite 3.2 8B. To eliminate network latency variability and external service dependencies, all models were deployed locally. Due to the heterogeneous nature of the hardware configuration, direct comparison of absolute timing metrics is not possible; therefore, the analysis focuses on within-model patterns and hardware-independent quality measures. Full hardware specifications are provided in Appendix A.1.

Figure 1 presents the complete PaSSER framework architecture. The three-layer design enables systematic evaluation of multi-agent coordination strategies through modular components for model deployment, retrieval-augmented generation, performance assessment, and blockchain-based result verification.

3.2. Multi-Agent Coordination Strategies and Experimental Design

The evaluation dataset consisted of 100 question–answer pairs derived from the “Climate Smart Agriculture Source Book” [18]. Each model-strategy combination was evaluated on the same question set.

Figure 2 illustrates the seven-step experimental workflow, from initial vector store creation to final result verification. This workflow was applied consistently to all model-strategy combinations.

Four coordination strategies were implemented with the following operational protocols:

3.2.1. Collaborative Strategy

Three agents processed each query in parallel. Each used identical retrieved context and system prompts. Each generated a response with a confidence score based on internal coherence. The final output aggregated all responses into a summary preserving individual perspectives.

Consensus Mechanism: No explicit selection or filtering occurred; all agent outputs contributed equally to the final response. Confidence scores were averaged to produce an aggregate consensus score. This approach assumes that multiple independent perspectives enhance coverage and reduce individual agent blind spots. System prompts for all coordination strategies were developed with AI assistance (GitHub Copilot with Claude Sonnet 3.5). This approach prioritizes reproducibility over manual prompt engineering. Systematically engineered prompts may yield different results; this limitation is acknowledged.

3.2.2. Improved Collaborative Strategy: Two-Phase Consensus

This improved collaborative strategy was implemented to address the limitations of simple response aggregation. The two-phase collaborative consensus operates as follows: Phase 1 (Independent Analysis): All agents generate responses in parallel. Each agent provides a confidence score reflecting their certainty in their response. Phase 2 (Collaborative Synthesis): The agent with the highest confidence score is designated as the lead synthesizer. This agent receives summaries of all responses (the first 300 characters of each response) and carries out a structured synthesis task.

The synthesis task involves integrating the strongest points from each analysis, resolving contradictions and generating a unified response. Implementation details are provided in Appendix A.7, with general prompt templates in Appendix A.5.

This differs from the original collaborative strategy (Section 3.2.1), where responses were concatenated without structured conflict resolution. The implementation is available in the project repository [38].

Each model-strategy combination was evaluated on all 100 test instances. The complete experimental design comprised 31 configurations (see Table 1), resulting in a total of 3100 evaluations: 2800 multi-agent evaluations and 300 baseline evaluations. Test execution followed randomized instance ordering within each configuration to mitigate temporal drift effects. For each test, the system recorded all nine component metrics for CPS calculation, response generation time (milliseconds), token counts (prompt tokens, completion tokens, total tokens), and coordination metrics for multi-agent configurations only (agent participation count, consensus rounds, disagreement rate, consensus score).

3.2.3. Competitive Strategy

Implementation: Three agents processed each query independently in parallel using an identical retrieved context. Each agent assigned a confidence score (0–100 scale) to its generated response based on semantic coherence, factual consistency with context, and answer completeness.

Selection Mechanism: The agent with the highest confidence score was designated as the winner. Only this agent’s response was returned as the final answer, with the winner’s confidence serving as the consensus score. Disagreement was quantified by calculating variance in confidence scores across agents (normalized to 0–1 range).

3.2.4. Hierarchical Strategy

Implementation: Two specialist agents processed the query first, generating independent preliminary analyses. A third agent acting as manager received truncated summaries (first 400 characters) of specialist outputs along with the original query and retrieved context. The manager agent synthesized specialist insights into a unified final response with an expanded context window (1.5× standard context size) to accommodate specialist inputs.

Coordination Protocol: Sequential two-phase processing: (1) parallel specialist analysis (2 agents), followed by (2) manager synthesis (1 agent). The manager could selectively integrate, reconcile, or prioritize specialist perspectives based on relevance and quality assessment.

3.2.5. Sequential Strategy

Implementation: Agents processed queries in sequence (Agent 1 → Agent 2 → Agent 3). Each agent received the original query, retrieved context, and the previous agent’s response. This enabled progressive refinement.

Refinement Protocol: Agent 1 generated an initial response. Agent 2 reviewed and refined it based on identified gaps or errors. Agent 3 performed final refinement to produce the system response.

3.2.6. Retrieval Context Configuration

Two retrieval context delivery modes were evaluated to isolate coordination effects from retrieval fragmentation:

-: Independent Retrieval: Each agent independently queries the vector database and retrieves its own document subset based on similarity ranking. This configuration allows agents to access potentially different retrieved passages, introducing retrieval diversity but also potential inconsistency in available context. Independent retrieval was used for all Original multi-agent configurations.
-: Shared Context Retrieval: All agents receive identical retrieved document sets extracted through a single vector database query. This configuration ensures uniform input context across all agents, eliminating retrieval fragmentation as a confounding variable and isolating pure coordination effects. Shared context retrieval was applied to all Optimized configurations (all three models) and to Granite-SCR configurations.

Both configurations maintain identical similarity thresholds (0.95), chunk sizes, embedding models (sentence-transformers with Mistral), and k-value settings (k = 5). The comparison enables assessment of whether multi-agent performance degradation stems from coordination overhead or from inconsistent retrieval across agents. All agents within each configuration use the same base model. This design choice isolates the effects of the coordination strategy from confounding factors introduced by model heterogeneity. Heterogeneous agent architectures, in which agents with different capabilities or specializations collaborate, are a promising area for future research.

3.2.7. Experimental Configuration Summary

The experimental design comprises 31 configurations organized into four groups. The experimental sequence was designed to isolate different sources of performance variation through progressive investigation.

Phase 1: Baseline Establishment. Single-agent RAG baselines (3 configurations) establish reference performance for each model using tuned similarity thresholds: 0.95 for Mistral 7B and Granite 3.2 8B, and 0.90 for Llama 3.1 8B. These thresholds were determined through evaluation on the full 369-question dataset (Section 3.4). The baselines represent the performance target against which multi-agent configurations are compared.

Phase 2: Original Multi-Agent Evaluation. Original multi-agent configurations (12 configurations: 3 models × 4 strategies) evaluate standard coordination using independent retrieval, where each agent queries the vector database separately. Simple consensus mechanisms are applied: response concatenation for Collaborative, highest-confidence selection for Competitive, manager synthesis for Hierarchical, and final-agent output for Sequential. This phase addresses the primary research question: does multi-agent coordination improve RAG performance?

Phase 3: Retrieval Fragmentation Isolation. Granite-SCR configurations (4 configurations) were introduced to isolate the effect of retrieval fragmentation from coordination overhead. These configurations apply shared context retrieval to Granite 3.2 8B while retaining the simple consensus mechanisms from the Original group. Comparison between Granite-Original and Granite-SCR enables quantification of performance differences attributable to agents receiving different retrieved documents.

Phase 4: Implementation Enhancement. Optimized multi-agent configurations (12 configurations) incorporate two enhancements. All strategies receive shared context retrieval, where agents receive identical retrieved documents from a single query. The Collaborative strategy additionally employs Two-Phase Consensus (Section 3.2.2). Sequential, Competitive, and Hierarchical strategies use their standard mechanisms with shared context only.

Two-Phase Consensus was applied exclusively to the Collaborative strategy for two reasons. First, preliminary analysis from Phase 2 indicated that Collaborative warranted targeted intervention. Second, the remaining strategies incorporate inherent selection or structured processing mechanisms (winner selection for Competitive, manager synthesis for Hierarchical, sequential refinement for Sequential) that address response integration differently. Collaborative’s simple response concatenation represented a distinct mechanism that could potentially benefit from structured synthesis.

Table 1 summarizes the experimental design. Configurations labelled “Optimized” incorporate shared context retrieval (all strategies) and Two-Phase Consensus (Collaborative only). This terminology denotes enhanced implementation rather than mathematically derived optimal values. “Standard” denotes each strategy’s default coordination mechanism without enhancements. The experimental sequence proceeded chronologically in the order listed (Baselines → Original → Granite-SCR → Optimized). All configurations were evaluated using 100 question–answer pairs. Baseline similarity thresholds were determined through a separate 369-question evaluation (Section 3.4).

3.3. Performance Evaluation Framework

3.3.1. The Multi-Criteria Evaluation Problem

Evaluating the performance of a language model presents a multi-criteria decision problem. No single metric can adequately capture response quality. For example, lexical overlap measures (BLEU and ROUGE) may reward surface-level similarity while missing semantic equivalence, and semantic similarity scores may overlook factual errors. Fluency metrics, meanwhile, may favour verbose but uninformative outputs. Practitioners selecting among model configurations must compare alternatives that may excel in different areas.

This challenge is further intensified when comparing multi-agent coordination strategies against single-agent baselines. For example, a coordination strategy might improve semantic coherence while degrading lexical precision, or enhance fluency while introducing factual inconsistencies. Without a principled aggregation method, such trade-offs cannot be evaluated systematically.

Multi-Criteria Decision Making (MCDM) provides established methodologies for addressing such problems [41]. Within the MCDM framework, model configurations constitute the alternatives and evaluation metrics constitute the criteria, with the aggregation method providing a decision rule for ranking the alternatives.

3.3.2. Aggregation Method Selection

Several MCDM aggregation methods exist, including Simple Additive Weighting (SAW) [42], TOPSIS [41], AHP [43], and outranking methods such as ELECTRE and PROMETHEE. The selection of a method depends on the decision context and the available information.

SAW was selected for this evaluation based on three considerations. Firstly, the data requirements: AHP requires pairwise expert comparisons across all criteria, TOPSIS requires the specification of ideal and anti-ideal reference points, and outranking methods require concordance and discordance thresholds. SAW only requires normalized scores and criterion weights, which are both directly available from the evaluation metrics. Secondly, in terms of interpretability, SAW produces a weighted average that practitioners can readily interpret and decompose. Thirdly, in terms of appropriateness, when criteria are compensatory and commensurable, SAW is theoretically appropriate. Both conditions are met in the present evaluation context.

3.3.3. Component Metrics

Nine evaluation metrics were selected to capture complementary quality dimensions. The selection criterion required that the metrics assess distinct aspects of response quality, rather than providing redundant measurements.

METEOR measures text quality by aligning words between generated and reference texts, accounting for exact matches, stems, and synonyms. Higher scores indicate better alignment.

ROUGE measures n-gram overlap between generated and reference texts. Two variants were used: ROUGE-2 (bigram overlap) and ROUGE-L (longest common subsequence), both computing F1-scores combining precision and recall. Implementation used the rouge library’s get_scores function.

BERTScore evaluates contextual alignment using transformer-based embeddings, computing token-level cosine similarities aggregated into F1 scores. Implementation used the bert-score library.

B-RT (BERT-based Reference-free Text Evaluation) provides reference-free assessment across coherence, consistency, fluency, and relevance using BERT representations. Two variants were used: B-RT.average (overall quality) and B-RT.fluency (grammatical naturalness).

F1 Score combines precision and recall as their harmonic mean:

F 1 = 2 \times (P r e c i s i o n \times R e c a l l) / (P r e c i s i o n + R e c a l l)

. Implementation used scikit-learn for token-level agreement measurement.

Perplexity quantifies model uncertainty in predicting text sequences, with lower values indicating better predictive performance. Two smoothing variants were used: Laplace (add-one, α = 1.0) and Lidstone (parameterized, α = 0.5) to handle zero-probability n-grams. Implementation used the nltk.lm module.

All metrics were computed using the PaSSER framework and aggregated across 100 test instances per configuration to provide mean and standard deviation values for statistical analysis. Detailed implementation is available in the GitHub repository [44] (backEnd.py and maBackEnd.py scripts).

3.3.4. Composite Performance Score (CPS)

CPS aggregates the nine component metrics into a single score using SAW.

The CPS formulation addresses three fundamental challenges in multi-metric evaluation: (1) heterogeneous metric scales requiring normalization, (2) opposing directionality where some metrics improve with higher values while others improve with lower values, and (3) differential importance of evaluation dimensions necessitating weighted aggregation.

For each query q evaluated under model m with coordination strategy s, the CPS is computed as:

C P S_{q}^{(m, s)} = \sum_{i = 1}^{n} w_{i} [\frac{d_{i} (m_{i, q} - m i n_{i})}{m a x_{i} - m i n_{i}} + \frac{1 - d_{i}}{2}]

(1)

where:

-: $d_{i} \in - 1, + 1$ indicates the polarity of metric $i$ : $d_{i} = + 1$ if higher values indicate better performance, $d_{i} = - 1$ if lower values indicate better performance
-: $w_{i}$ is the assigned weight for metric $i$ , with $\sum_{i = 1}^{n} w_{i} = 1$

The normalization procedure ensures all metrics are scaled to the [0, 1] range while preserving their performance directionality. For metrics where higher values indicate better performance

(d_{i} = + 1)

, the standard min-max normalization is applied:

Normalized value = \frac{m_{i} - m i n_{i}}{m a x_{i} - m i n_{i}}

For metrics where lower values indicate better performance

(d_{i} = - 1)

, specifically the perplexity metrics in this study, the normalization is inverted to ensure that superior performance (lower perplexity) maps to higher normalized scores:

Normalized value = \frac{m a x_{i} - m_{i}}{m a x_{i} - m i n_{i}}

The aggregated CPS for model m with strategy

s

across

Q

queries is:

μ_{m, s} = \frac{1}{Q} \sum_{q = 1}^{Q} C P S_{q}^{(m, s)}

(2)

Weight specification. In the MCDM framework, weights represent decision-maker priorities for the specific application context. For retrieval-augmented generation, two considerations guided weight assignment.

The weights reflect a four-dimensional evaluation framework [6] that prioritizes content accuracy, which is paramount in RAG applications where users require factually correct information. The dimensions are: (1) content accuracy (50%) through F1, METEOR, and BLEU; (2) semantic relevance (20%) through Cosine Similarity and Pearson Correlation; (3) lexical overlap (15%) through ROUGE-1 and ROUGE-L; and (4) linguistic quality (15%) through Laplace and Lidstone Perplexity. This structure ensures that configurations cannot achieve high scores solely through secondary dimensions.

Table 2 presents the complete weight distribution.

Supplementary analyses confirm that the nine metrics capture complementary quality dimensions (mean pairwise correlation r = 0.452) and that configuration rankings remain stable under weight perturbations (Spearman ρ > 0.95 across all tests). Complete validation results are provided in Supplementary Materials (Tables S1–S3).

The selection of evaluation metrics for LLM systems requires careful consideration of task-specific requirements and model capabilities. Recent evaluations demonstrate that even state-of-the-art LLMs achieve only 60% accuracy on complex reasoning tasks, emphasizing the importance of comprehensive evaluation frameworks that capture both performance and reasoning reliability [45].

3.3.5. Threshold-Aware Composite Performance Score (T-CPS)

Mean performance alone is insufficient for deployment decisions. A configuration that performs well on some queries but poorly on others may be unsuitable for production systems where consistent behaviour is required. This principle is established in manufacturing quality control, where process capability indices evaluate both mean accuracy and variance [46], and in portfolio theory, where risk-adjusted returns account for volatility [47].

The consistency requirement is particularly relevant for multi-agent systems, which introduce variability sources beyond single-agent sampling randomness: voting ambiguities in collaborative strategies; error propagation in sequential refinement; selection inconsistency in competitive evaluation; and synthesis variability in hierarchical integration.

T-CPS integrates mean performance with consistency through a reward-penalty structure, penalizing high variability configurations.

The T-CPS for model

m

with strategy

s

is computed as:

T - C P S_{m, s} = μ_{m, s} \cdot (1 + α \cdot (1 - C V_{m, s})) - β \cdot C V_{m, s}^{2}

(3)

where:

-: $μ_{m, s}$ is the mean CPS for model $m$ with strategy $s$
-: $C V_{m, s} = \frac{σ_{m, s}}{μ_{m, s}}$ is the coefficient of variation, with $σ_{m, s}$ denoting the standard deviation of CPS across evaluation instances
-: $α$ defines the reward coefficient for stable configurations
-: $β$ defines the penalty coefficient for high variability
-: The coefficient of variation normalizes variability assessment by expressing standard deviation as a proportion of the mean, enabling fair comparison across configurations with different baseline performance levels. Lower CV values indicate more consistent behaviour across queries and evaluation runs.

The reward term

μ_{m, s} \cdot (1 + α \cdot (1 - C V_{m, s}))

increases scores for stable configurations. When

C V_{m, s} = 0

(perfect consistency), the configuration receives the maximum reward of

μ_{m, s} \cdot (1 + α)

. As variability increases, the reward diminishes, reaching

μ_{m, s}

when

C V_{m, s} = 1

(standard deviation equals mean).

The penalty term

β \cdot C V_{m, s}^{2}

applies a quadratic penalty for high variability, ensuring that unstable configurations are appropriately downweighted. The quadratic form provides progressive penalization: configurations with moderate variability receive modest penalties, while highly unstable configurations (large CV) are substantially penalized.

Parameter values

α = 0.1

and

β = 0.05

follow standard practices in machine learning [48,49]. These values ensure that both average quality and stability matter in the evaluation. They do not over-penalize normal variability in language model outputs [50].

When

α = 0.1

, a perfectly consistent system

(C V = 0)

gets a 10% reward. Systems with some variability get smaller rewards. When

β = 0.05

, systems with low variability get small penalties. Systems with high variability get larger penalties because the penalty uses

C V^{2}

. These parameters favour systems with good average performance and reasonable consistency. This matches what is needed for deployed language models [51].

A sensitivity analysis was conducted across 25 parameter combinations: (

α = \{0.05, 0.10, 0.15, 0.20, 0.25\}

and

β = \{0.025, 0.05, 0.075, 0.10, 0.15\}

) to verify ranking stability. The parameter values were selected to cover practically relevant ranges, while avoiding extremes that would distort the metric’s intended purpose. For example, excessive α would cause consistency to dominate base quality, while excessive β would penalize normal statistical variation.

The results of the sensitivity analysis are presented in Section 4.8. CPS and T-CPS address different evaluation questions. CPS answers: which configuration produces better outputs on average? T-CPS answers: which configuration provides reliable performance suitable for deployment? Both metrics are reported to enable assessment from research and deployment perspectives.

This formulation is adapted from our previous work [6] on threshold selection in RAG systems. In the present multi-agent context, T-CPS serves to identify coordination strategies that achieve strong mean performance while maintaining acceptable output stability—a dual objective critical for practical system deployment.

The T-CPS framework is particularly valuable in multi-agent evaluation because coordination mechanisms inherently introduce additional variability sources beyond single-agent sampling randomness. By simultaneously evaluating performance and stability, T-CPS identifies configurations that not only achieve high average quality but also deliver predictable, reliable outputs—characteristics essential for user-facing applications where inconsistent behaviour undermines trust and usability.

3.4. Baseline Configuration and Statistical Analysis

Single-agent RAG baseline configurations were established through systematic threshold evaluation for each model. The selection procedure employed the same evaluation framework—PaSSER with a 369-question Climate-Smart Agriculture dataset and comprehensive metric assessment.

Each model was evaluated across similarity thresholds ranging from 0.50 to 0.95 in increments of 0.05. For each threshold configuration, both CPS and T-CPS were computed across all 369 questions. Statistical significance was assessed through paired t-tests comparing each threshold configuration against an untuned retrieval baseline without explicit threshold filtering (α = 0.05). Performance improvements and consistency measures (coefficient of variation) were evaluated to identify threshold configurations that balance both mean performance gains and output stability.

Baseline thresholds were selected by identifying configurations that maximized T-CPS while maintaining statistically significant performance improvements. This approach ensures baselines represent not only strong mean performance but also reliable, consistent outputs across diverse queries—a critical requirement for practical RAG system deployment.

The evaluation identified threshold 0.95 for Mistral 7B (T-CPS: 0.5911, +5.37% improvement) and Granite 3.2 8B (T-CPS: 0.5622, +1.26% improvement), and threshold 0.90 for Llama 3.1 8B (T-CPS: 0.5495, +1.87% improvement). These threshold-calibrated configurations establish the single-agent RAG baselines against which multi-agent coordination strategies are compared.

Table 3 presents the baseline selection analysis, showing the performance characteristics of optimal thresholds alongside the baseline (no threshold) configuration and alternative competitive thresholds for comparison.

Figure 3 compares threshold configurations. Selected baselines (gold-highlighted) consistently achieve superior T-CPS performance: Mistral 7B demonstrates the largest improvement potential (+5.37% at threshold 0.95), while Granite 3.2 8B (+1.26% at threshold 0.95) and Llama 3.1 8B (+1.87% at threshold 0.90) show more modest but reliable gains. The close alignment between CPS and T-CPS bars indicates that selected thresholds balance mean performance without compromising output consistency, establishing rigorous reference points for multi-agent comparison.

The selected baseline configurations demonstrate statistically significant improvements over untuned retrieval while maintaining stable, consistent output quality. Mistral 7B achieves the largest improvement magnitude (+5.37% T-CPS) with strong balance between performance gains and stability (Balance Score: 40.11). Granite 3.2 8B exhibits the most stable performance profile (CV: 0.1240) with modest but reliable improvements (+1.26% T-CPS). Llama 3.1 8B demonstrates intermediate characteristics, balancing moderate improvement (+1.87% T-CPS) with acceptable consistency (CV: 0.1479).

These threshold-calibrated baselines establish rigorous comparison conditions for evaluating multi-agent coordination effectiveness, ensuring that any observed performance differences reflect coordination mechanisms rather than suboptimal baseline configurations.

Performance comparisons employed percentage change metrics relative to RAG baselines:

C P S_{c h a g e (%)} = (\frac{(C P S_{M A} C P - C P S_{R A G})}{C P S_{R A G}}) \times 100

(4)

Positive values indicate multi-agent improvements. Negative values indicate degradation. Similar calculations applied to T-CPS. Statistical significance testing used paired t-tests, Cohen’s d effect sizes, and 95% confidence intervals. Configuration-level aggregation computed mean CPS, standard deviation, and coefficient of variation across all 100 test instances for each model-strategy combination. Strategy-level analysis averaged results across the three models to identify general coordination patterns independent of specific model characteristics.

4. Results

4.1. Experimental Overview

The multi-agent evaluation used a 100-question subset drawn from the 369-question Climate-Smart Agriculture Sourcebook dataset employed for baseline threshold determination (Section 3.4). This subset enabled multi-agent testing while maintaining experimental tractability. The baseline configurations used similarity thresholds determined through the full 369-question evaluation: 0.95 for Mistral 7B and Granite 3.2 8B, and 0.90 for Llama 3.1 8B.

The experimental design encompassed 3100 test instances across 31 configurations organized into four groups (Section 3.2.7). Baselines comprised 3 single-agent RAG configurations. Original multi-agent configurations (12) used independent retrieval with simple consensus mechanisms. Granite-SCR configurations (4) isolated the effect of shared context retrieval using Granite 3.2 8B. Optimized configurations (12) incorporated shared context retrieval, with Two-Phase Consensus applied to the Collaborative strategy.

Performance was assessed using the CPS and T-CPS metrics (Section 3.3). Statistical validation employed paired t-tests to compare each multi-agent configuration with its corresponding baseline (α = 0.05), with effect sizes quantified using Cohen’s d. Results are organized to follow the experimental sequence: Phase 2 findings (Section 4.2, Section 4.3, Section 4.4 and Section 4.5), Phase 3 findings (Section 4.6), and Phase 4 findings (Section 4.6).

4.2. Overall Performance Comparison

Table 4 presents performance results ranked by degradation magnitude across all model-strategy combinations. The unified presentation includes both CPS and T-CPS scores, percentage changes from baseline, statistical significance assessments, and effect size measurements. Configurations were grouped by degradation severity to facilitate interpretation.

Figure 4 illustrates T-CPS performance across all configuration groups, organized by coordination strategy. The visualization reveals several patterns. Collaborative coordination produced degradation across all implementations, with Original configurations (red bars) showing larger gaps from baseline than Optimized configurations (light blue). Sequential and Hierarchical strategies demonstrated model-dependent responses, with some configurations approaching baseline performance. Optimized configurations narrowed the gap to baselines compared to Original configurations across most strategies, except for Llama 3.1 8B Sequential where the Original configuration achieved smaller degradation than the Optimized variant.

Regarding the primary research question (Section 3.2.7, Phase 2), all 28 multi-agent configurations exhibited statistically significant performance degradation relative to their respective baselines (p < 0.01). Effect sizes ranged from small (|d| = 0.28) to very large (|d| = 6.09). CPS scores ranged from 0.3637 to 0.5505, representing degradation from 4.39% (Mistral-Opt Hierarchical) to 35.31% (Granite Collaborative). T-CPS degradation followed similar patterns (4.40% to 35.47%). Both metrics showed parallel degradation, indicating that multi-agent approaches affected both mean quality and response consistency.

Systematic patterns emerged across degradation severity groupings. The Minimal Degradation group (ranks 1–9) contained configurations with small to medium effect sizes (|d| = 0.28–0.64), primarily Llama 3.1 8B and Mistral-Optimized configurations. The Moderate Degradation group (ranks 10–14) showed large effect sizes (|d| = 0.99–1.41), comprising Granite configurations with shared context retrieval. The Severe Degradation group (ranks 15–25) exhibited very large effect sizes (|d| = 1.43–3.85), including Original configurations for Mistral and Granite. The Extreme Degradation group (ranks 26–28) demonstrated the largest effect sizes (|d| = 4.69–6.09), exclusively comprising Collaborative configurations with simple aggregation.

The Collaborative strategy occupied ranks 15–17, 22, and 26–28, indicating consistent underperformance across all models and implementation approaches. This pattern informed the targeted intervention described in Section 3.2.7 (Phase 4).

4.3. Statistical Significance and Effect Size Analysis

Statistical analysis indicated that observed performance differences were unlikely to be attributable to random variation. Of the 28 multi-agent configurations evaluated, all showed statistically significant degradation at p < 0.01, with 25 reaching highly significant levels (p < 0.001). The three configurations with the smallest effect sizes—Mistral-Opt Hierarchical (rank 1), Llama Sequential (rank 2), and Mistral-Opt Sequential (rank 3)—demonstrated small effect sizes (|d| = 0.28–0.34) while still reaching statistical significance (p < 0.01).

Effect sizes, quantified through Cohen’s d and reported as |d| in Table 4, revealed the magnitude of practical significance beyond statistical thresholds. Given the exploratory nature of multi-agent coordination analysis and the large effect sizes observed, no adjustment for multiple comparisons was applied; significance was interpreted at α = 0.05 for each individual comparison.

Effect size groupings showed the following distribution: ranks 1–6 showed small effects (|d| < 0.5), ranks 7–9 showed medium effects (0.5 ≤ |d| < 0.8), ranks 10–11 showed large effects (0.8 ≤ |d| < 1.2), and ranks 12–28 showed very large effects (|d| ≥ 1.2). The concentration of very large effects in the lower ranks indicates that most multi-agent configurations produced substantial practical differences from baseline performance.

4.4. Computational Efficiency Analysis

Multi-agent configurations incurred computational overhead through multiple inference calls and coordination mechanisms. Token consumption provides a hardware-independent measure of computational overhead. Table 5 presents token consumption and processing time measurements across models and strategies.

The Collaborative strategy requires substantially higher token consumption (mean: 2656 tokens), representing a 58.2% overhead compared to other strategies (mean: 1658–1717 tokens). This increase is due to the additional synthesis step requiring a fourth inference call.

Due to hardware heterogeneity (Apple M1 for Llama 3.1 8B and Intel Xeon CPU for Mistral 7B and Granite 3.2 8B), processing time comparisons across models are not directly comparable. Within-model comparisons reveal a variation of 13–25% across strategies, with no single strategy being consistently the fastest across all models.

All configurations used three agents per query. The number of coordination rounds varied according to strategy: Collaborative and Competitive strategies required single-round parallel processing, Hierarchical strategy required two rounds (specialist analysis followed by manager synthesis), and Sequential strategy required three rounds (progressive refinement across agents). Multi-agent configurations require three to four inference calls per query, compared to a single call for baseline RAG. This establishes a computational overhead independent of strategy-specific factors.

4.5. CPS and T-CPS Relationship Analysis

The relationship between CPS and T-CPS revealed insights about the interplay between mean performance and output consistency. For baseline configurations, T-CPS scores consistently exceeded CPS scores by 8.0% to 8.3% (Mistral: +8.1%, Llama: +8.0%, Granite: +8.3%), reflecting the reward term in the T-CPS formulation for stable outputs. This consistent increase indicated that baseline RAG achieved both relatively high performance and good consistency.

Multi-agent configurations exhibited more complex CPS-T-CPS relationships. Collaborative strategies (ranks 15–17, 22, 26–28) demonstrated a pattern worth noting: despite having the lowest CPS scores across all configurations (0.3637–0.4567), they achieved T-CPS increase similar to baselines (approximately 8.0% over CPS). Llama Collaborative (rank 22) showed T-CPS increase of 8.0% over CPS, as did Mistral Collaborative (rank 27) and Granite Collaborative (rank 28). This "stable mediocrity" pattern indicated that consensus mechanisms reduced variability by converging toward consistent but lower-quality solutions.

In contrast, Competitive strategies (ranks 5, 7–8, 10–11, 19–20) showed more modest T-CPS increase (6–9%), indicating higher variability in outputs. Hierarchical strategies exhibited T-CPS increase ranging from 7.5% to 8.2%. Sequential strategies demonstrated intermediate behaviour with T-CPS increase around 8.0–8.2%. These patterns suggested that different coordination mechanisms produced distinct consistency profiles.

The Granite-SCR configurations maintained T-CPS increase patterns similar to standard Granite configurations, with both groups showing 7–9% increase. This indicated that shared context delivery affected mean performance without substantially altering output consistency characteristics, suggesting that retrieval fragmentation primarily impacted mean quality rather than variability.

Across all configurations, T-CPS remained below baseline values, indicating that multi-agent coordination affected both mean performance and consistency-aware metrics. This dual pattern has implications for production deployment where predictable behaviour is relevant.

4.6. Model-Specific Coordination Response Patterns

The three evaluated models demonstrated distinct response patterns to multi-agent coordination, revealing dependencies between model architecture and coordination effectiveness.

Phase 2 Findings: Original Configurations

Figure 5 illustrates the response patterns of the models to multi-agent coordination. Mistral 7B (left panel) shows a clear gap between Original configurations (red bars) and the baseline, with all Original strategies producing degradation exceeding 25%. Llama 3.1 8B (centre panel) exhibits a different pattern, with Original Sequential and Hierarchical configurations approaching baseline performance. Granite 3.2 8B (right panel) shows consistent degradation across all Original configurations, with gaps exceeding 25% for all strategies.

Llama 3.1 8B exhibited selective tolerance to coordination overhead. Sequential (rank 2) showed degradation of −4.85% (p < 0.01, |d| = 0.30). Hierarchical (rank 4) showed similar patterns (−5.31%, p < 0.01, |d| = 0.28). In contrast, Collaborative (rank 22) showed substantial degradation (−28.24%, p < 0.001, |d| = 3.85). This selective pattern suggests that Llama’s architecture accommodates certain coordination patterns—specifically sequential processing and hierarchical management—while showing greater sensitivity to consensus-based mechanisms.

Granite 3.2 8B demonstrated degradation across all coordination strategies in Original configurations (ranks 20, 23, 25, 28), with all configurations showing highly significant performance losses (p < 0.001) and large to very large effect sizes (|d| = 2.98 to 6.09). The magnitude of degradation ranged from 25.51% (Competitive) to 35.47% (Collaborative).

Mistral 7B exhibited degradation in Original configurations (ranks 19, 21, 24, 27), with all strategies producing highly significant losses (p < 0.001) ranging from 25.10% to 35.19%. Effect sizes (|d| = 2.40 to 4.97) indicated substantial practical significance.

Phase 3 Findings: Retrieval Fragmentation Isolation

The Granite-SCR configurations (Section 3.2.7, Phase 3) enabled isolation of retrieval fragmentation effects. Comparison between Granite-Original and Granite-SCR quantifies the performance difference attributable to the retrieval method.

Granite-SCR configurations showed improved performance relative to Granite-Original: Competitive improved from −25.33% to −14.21% (difference: 11.1 percentage points), Hierarchical from −28.58% to −16.13% (12.5 pp), Sequential from −29.26% to −22.89% (6.4 pp), and Collaborative from −35.31% to −31.08% (4.2 pp). These improvements indicate that retrieval fragmentation contributed measurably to performance degradation.

However, all Granite-SCR configurations remained below baseline performance (−14.21% to −31.08%), indicating that shared context retrieval alone did not recover baseline performance. The residual degradation suggests that coordination overhead, rather than retrieval inconsistency alone, represents a substantial factor in multi-agent RAG performance under these experimental conditions.

Phase 4 Findings: Implementation Enhancements

Optimized configurations (Section 3.2.7, Phase 4) incorporated shared context retrieval for all strategies plus Two-Phase Consensus for Collaborative.

Mistral 7B showed notable improvement from Optimized configurations. Optimized configurations (ranks 1, 3, 7, 15) achieved degradation of 4.39–20.32%, compared to 25.10–35.19% for Original configurations—a difference of 12–15 percentage points. Mistral-Opt Hierarchical (rank 1, −4.39%) and Mistral-Opt Sequential (rank 3, −5.28%) achieved the smallest degradation among all multi-agent configurations.

Granite 3.2 8B demonstrated moderate improvement from Optimized configurations: 4.90–20.98% degradation compared to 25.51–35.47% for Original, representing 10–14 percentage points of improvement. Granite-Opt Competitive (rank 11, −14.90%) performed similarly to Granite-SCR Competitive (rank 10, −14.21%), suggesting that for non-Collaborative strategies, shared context retrieval accounted for most of the improvement.

Llama 3.1 8B showed a different pattern. Original configurations outperformed Optimized by 0.4–2.0 percentage points for Sequential (Original: −4.85%, Optimized: −8.48%), Hierarchical (Original: −5.33%, Optimized: −7.09%), and Competitive (Original: −6.56%, Optimized: −8.42%). This suggests that Llama’s architecture already accommodates coordination overhead effectively for these strategies, and the additional processing in Optimized configurations introduced computational cost without corresponding quality improvement.

For the Collaborative strategy specifically, Optimized configurations improved over Original across all models: Mistral from −35.19% to −20.32% (14.9 pp improvement), Granite from −35.47% to −20.98% (14.5 pp), and Llama from −28.24% to −20.40% (7.8 pp). Despite these improvements, Collaborative remained the lowest-performing strategy within each model family.

Figure 6 provides a comparative heatmap of performance degradation across Original and Optimized configurations. The visualization shows that Optimized configurations (right columns) exhibit lighter shading (smaller degradation) than Original configurations (left columns) for Mistral 7B and Granite 3.2 8B across all strategies. Llama 3.1 8B shows mixed patterns, with Original Sequential and Hierarchical showing lighter shading than their Optimized counterparts. The Collaborative column shows darker shading (larger degradation) than other strategies across all models and configuration types.

4.7. Sensitivity Analysis Results

The T-CPS metric depends on two parameters: α (consistency reward, default = 0.10) and β (variance penalty, default = 0.05). To verify that conclusions were not dependent on these specific values, sensitivity analysis was conducted across 25 parameter combinations (α = {0.05, 0.10, 0.15, 0.20, 0.25} and β = {0.025, 0.05, 0.075, 0.10, 0.15}). T-CPS was recalculated for all 31 configurations in each combination.

Table 6 presents the sensitivity analysis results. Correlation analysis showed that α exhibited near-perfect positive correlation with T-CPS (mean r = 0.9993; all p < 0.001), whereas β showed negligible correlation (mean r = −0.034; not significant). Variance decomposition indicated that α explained 99.87% of the variance in T-CPS, while β explained 0.13%. Multiple regression yielded R² = 1.000.

Across the parameter range, T-CPS values varied by approximately 17%, but this variation was uniform across all configurations, ensuring that relative comparisons remained stable. Ranking stability analysis supported this: 29 of 31 configurations (93.5%) showed no change in rank across all 25 parameter combinations. Two configurations (Granite-SCR-Sequential and Llama-Opt-Collaborative) exhibited fluctuation of one position between ranks 21 and 22. The top two positions remained stable across all combinations.

4.8. Summary of Results

The evaluation across 3100 test instances provides empirical findings regarding multi-agent coordination effectiveness under the tested conditions:

All 28 multi-agent configurations showed statistically significant degradation relative to baseline RAG (p < 0.01 for all comparisons).
Performance degradation ranged from −4.39% (Mistral-Opt Hierarchical) to −35.31% (Granite Collaborative), with effect sizes |d| = 0.28 to 6.09.
Shared context retrieval (Granite-SCR) improved performance by 4.2–12.5 percentage points relative to independent retrieval, but all configurations remained 14.2–31.1% below baseline.
Llama 3.1 8B demonstrated selective tolerance (Sequential and Hierarchical strategies showed degradation below 6%), while Granite 3.2 8B and Mistral 7B Original configurations showed degradation exceeding 25%.
Optimized configurations reduced degradation by 10–15 percentage points for Mistral and Granite, but showed no benefit for Llama.
Collaborative coordination showed the largest degradation across all models (20.7–35.3%) despite achieving high output consistency.
T-CPS sensitivity analysis confirmed ranking stability: 93.5% of configurations maintained identical ranks across 25 parameter combinations.

These findings address the research questions posed in Section 1.1. Regarding comparative performance (Objective 1), multi-agent configurations underperformed baselines across all tested conditions. Regarding degradation sources (Objective 2), both retrieval fragmentation and coordination overhead contributed to degradation, with coordination overhead representing the larger component. Regarding model-strategy interactions (Objective 3), effectiveness varied substantially by model architecture. Regarding consistency-performance trade-offs (Objective 4), Collaborative achieved high consistency but low quality, while other strategies showed variable consistency profiles.

5. Discussion

5.1. Performance Impact of Multi-Agent Coordination

Nineteen experimental conditions were evaluated: three model baselines (Llama 3.1 8B, Granite 3.2 8B/Granite-SCR, Mistral 7B) and twenty-eight multi-agent configurations across four coordination strategies. All twenty-eight multi-agent configurations showed statistically significant performance degradation relative to their respective single-agent RAG baselines (p < 0.01) under the evaluated implementation conditions. As detailed in Section 4.3, statistical analysis indicated multi-agent degradation. Even the best-performing configurations (Mistral-Opt Hierarchical and Llama Sequential, ranks 1–2 in Table 4) showed statistically significant degradation (p < 0.01), albeit with small effect sizes (|d| = 0.28–0.30). The total underperformance rate reaches 100% when directionality rather than statistical significance is considered.

These findings suggest caution regarding assumptions that multi-agent systems automatically improve LLM reasoning in all contexts. The results align with recent observations about multi-agent architecture limitations [52,53,54,55].

Two potential mechanisms may contribute to the observed degradation. First, coordination overhead—voting, synthesis, sequential processing—may consume resources without proportional quality improvements in certain implementations. Second, consensus mechanisms in their evaluated form appeared to converge toward compromise solutions rather than amplifying the strongest individual responses. Collaborative coordination showed the largest degradation across all models (ranks 15–17, 22, 26–28), with degradation ranging from −20.68% to −35.31% and effect sizes ranging from very large to extreme (|d| = 2.15–6.09, all p < 0.001). The Collaborative panels in Figure 4 illustrate this pattern, with all bars (Original, Optimized, and SCR) falling substantially below baseline across all models.

The consistent underperformance of collaborative coordination in this evaluation suggests boundary conditions for multi-agent RAG deployment. Prior work demonstrating the benefits of multi-agent systems has typically employed conditions that differ substantially from those of resource-constrained settings. Examples include iterative, multi-round debates in which agents refine their responses after observing the outputs of others [11], adversarial role assignment with explicit judge arbitration [8], and standardized operating procedures with role specialization and verifiable intermediate outputs [7]. Additionally, multi-agent capabilities scale strongly with model size: GPT-4-turbo exceeds Llama-2–70B by more than threefold on coordination metrics [12], suggesting that coordination benefits may require model capabilities that extend beyond typical deployment constraints.

This study therefore evaluated simpler coordination strategies using accessible, open-source models with 7–8 billion parameters to determine whether multi-agent approaches provide practical benefits in resource-constrained conditions. The Two-Phase Consensus implementation (Section 3.2.2) introduced structured synthesis through a lead agent, improving Collaborative performance by 7–14 percentage points relative to Original implementations. Despite this improvement, Collaborative coordination remained the lowest-performing strategy within each model family.

These results suggest that, in the absence of the supporting mechanisms present in successful multi-agent systems (e.g., iterative refinement, role differentiation or larger model capacity), consensus-based synthesis introduces information loss. Selection-based strategies (Competitive and Hierarchical) outperformed synthesis-based approaches (Collaborative), which is consistent with research showing that voting and selection mechanisms often yield more robust collective decisions than consensus aggregation [40]. For factual question-answering tasks involving resource-constrained models, single-agent RAG or selection-based multi-agent strategies are more effective deployment choices than consensus-based coordination.

T-CPS analysis (Section 4.5) revealed a “stable mediocrity” pattern: several configurations achieved low output variability while producing consistently poor-quality results. This demonstrates that reliability and quality are distinct properties.

5.2. Performance Patterns Across Models and Strategies

The three evaluated models demonstrated distinct response patterns to multi-agent coordination, as reported in Section 4.6. The experimental phases (Section 3.2.7) enabled isolation of these patterns: Phase 2 (Original configurations) established baseline coordination effects, Phase 3 (Granite-SCR) isolated retrieval fragmentation, and Phase 4 (Optimized configurations) tested implementation enhancements.

Llama 3.1 8B exhibited selective tolerance to coordination overhead in this evaluation. Sequential (rank 2) and Hierarchical (rank 4) strategies showed minimal degradation (−4.85% and −5.31%, respectively, both p < 0.01) with small effect sizes. Competitive strategy (rank 5) showed small but significant degradation (−6.54%, p < 0.001). Collaborative strategy exhibited substantial degradation (rank 22: −28.24%, p < 0.001). This selective pattern suggests that Llama’s architecture accommodates sequential processing and hierarchical management. Vulnerability to consensus-based mechanisms persists.

Mistral 7B exhibited a split response pattern. Original configurations (ranks 19, 21, 24, 27) showed severe degradation (25.0–35.1%), while Optimized configurations (ranks 1, 3, 7, 15) achieved substantially reduced degradation (4.4–20.7%). This 12–15 percentage point improvement from implementation enhancements suggests that Mistral benefits from shared context retrieval and structured consensus mechanisms.

Granite 3.2 8B demonstrated degradation across all coordination strategies. Original configurations (ranks 20, 23, 25, 28) showed severe degradation (25.3–35.3%), while Optimized and SCR configurations (ranks 10–14, 16–18) showed moderate degradation (14.2–22.9%). The Granite-SCR configurations enabled decomposition of enhancement effects, as discussed below.

Figure 5 illustrates these model-specific response patterns. Llama 3.1 8B exhibits selective tolerance to Sequential and Hierarchical coordination, with Original configurations (red bars) approaching baseline. Mistral 7B shows a clear gap between Original and Optimized configurations across all strategies. Granite 3.2 8B demonstrates consistent degradation, though SCR and Optimized configurations (orange and light blue bars) show reduced gaps compared to Original.

Strategy-specific patterns emerged across models. Among Original configurations, Sequential strategies showed the least degradation for Mistral 7B (rank 24: −28.83%) and Granite 3.2 8B (rank 25: −29.26%), though both remained severely degraded. This relative performance advantage is apparent in Figure 4, where the Sequential panel shows bars closer to baseline compared to Collaborative and Competitive panels. Llama 3.1 8B Sequential (rank 2) demonstrated minimal degradation (−4.85%, p < 0.01, small effect). Competitive strategy showed minimal degradation only for Llama 3.1 8B (rank 5: −6.54%, small effect), while showing severe degradation for Mistral 7B Original (rank 19: −25.01%) and Granite 3.2 8B Original (rank 20: −25.33%). Hierarchical strategies showed similar patterns, with Llama Original (rank 4: −5.31%) substantially outperforming Mistral Original (rank 21: −27.28%) and Granite Original (rank 23: −28.58%). Collaborative coordination showed the largest degradation across all configurations (ranks 15–17, 22, 26–28: −20.68% to −35.31%, all p < 0.001). High consensus scores (near 100%) indicate agents produced homogeneous outputs. Diversity benefits were eliminated while requiring multiple coordinated inference calls.

The Granite-SCR configurations (Section 3.2.7, Phase 3) isolated the contribution of retrieval fragmentation. Shared context retrieval improved performance by 4.2–12.5 percentage points relative to Granite-Original configurations. Strategy-specific sensitivity was demonstrated. The largest gains were achieved by Hierarchical (CPS: +17.5%, T-CPS: +17.0%), followed by Competitive (CPS: +14.9%, T-CPS: +14.2%), Sequential (CPS: +9.1%, T-CPS: +7.8%), and Collaborative (CPS: +5.9%, T-CPS: +5.8%).

Architectural dependencies are revealed by this differential response. Hierarchical and Competitive strategies rely more heavily on input consistency. The coordinator or selection mechanism must process agent outputs referencing common context. Sequential and Collaborative mechanisms introduce additional failure modes. Sequential error propagation and consensus deadlock cannot be mitigated by shared context.

Despite these improvements, all Granite-SCR configurations were significantly inferior to baseline RAG performance (CPS: 0.5658; T-CPS: 0.6125). Degradation ranged from 14.2% (Competitive) to 31.6% (Collaborative). All p values were less than 0.001. Consistency metrics remained nearly identical across retrieval configurations (Granite-SCR CV: 0.0380–0.0907 vs. Granite 3.2 8B CV: 0.0327–0.0482). Coordination mechanisms, rather than retrieval fragmentation, primarily determine output stability. This confirms that coordination overhead remains the dominant factor limiting multi-agent effectiveness. Shared context reduces retrieval fragmentation effects but cannot overcome fundamental coordination costs.

The strategy-specific degradation patterns (Figure 4) demonstrate that Collaborative coordination produces the largest performance loss across all models, while Sequential and Hierarchical strategies exhibit model-dependent tolerance. Figure 6 provides a comparative heatmap confirming that Optimized configurations reduced degradation for Mistral and Granite, while Llama Original configurations outperformed Optimized for non-Collaborative strategies.

5.3. Limitations

Several factors limit the generalisability of these findings. This includes the fact that all evaluations used factual question-answering pairs from a single knowledge base (the FAO Climate-Smart Agriculture Sourcebook), and that performance patterns may differ for other domains, particularly those requiring multi-step reasoning or creative synthesis. The evaluated models (7–8 billion parameters) represent a specific capability tier. Larger models with enhanced instruction-following abilities may exhibit different coordination dynamics.

All agents used identical base models with different system prompts. Heterogeneous architectures that combine models with complementary strengths are an unexplored area that could produce different results. Role prompts were generated using AI assistance (GitHub Copilot with Claude Sonnet 4.5), reflecting realistic development practices. However, hand-crafted, domain-specific prompts with careful role differentiation might produce different results. The Collaborative strategy in Original configurations implemented straightforward aggregation. The Two-Phase Consensus enhancement (Optimized configurations) introduced structured synthesis but did not include iterative refinement or explicit conflict resolution. Advanced protocols such as weighted voting, structured argumentation or learned aggregation might preserve individual strengths while mitigating weaknesses.

Hardware constraints further limit interpretation. Mistral 7B and Granite 3.2 8B used CPU-only inference on Intel Xeon processors, whereas Llama 3.1 8B used Apple’s M1 unified memory architecture. These heterogeneous conditions preclude direct timing comparisons between models. Consequently, the analysis focuses on quality and stability metrics (CPS and T-CPS) rather than absolute efficiency measures.

5.4. Comparison with Prior Literature

These findings extend prior work by comparing multi-agent strategies against RAG baselines rather than simple prompting baselines. This distinction matters. Most studies evaluate multi-agent systems against basic single-agent prompts. Coordination can appear more effective than it is through this comparison. RAG systems already incorporate external knowledge retrieval, setting a higher performance bar that makes multi-agent coordination harder to justify.

The baseline RAG outperformed all multi-agent configurations in the evaluated conditions. This confirms the concerns raised in the recent literature that the benefits do not outweigh the coordination overhead.

An unexpected pattern was also observed. Sequential coordination performed best among multi-agent approaches. Debate-based and hierarchical approaches are emphasized in most papers [7,8,13]. No universal "best" strategy exists. This is suggested by the pattern. Effectiveness depends on specific task characteristics and domain requirements.

6. Conclusions

This study compared four multi-agent coordination strategies with single-agent RAG baselines across three open-source language models. The experimental design comprised 31 configurations organized into four groups (Section 3.2.7): three single-agent RAG baselines, 12 Original multi-agent configurations with independent retrieval, four Granite-SCR configurations to isolate retrieval effects, and 12 Optimized multi-agent configurations with shared context retrieval and Two-Phase Consensus for Collaborative strategy.

Of the 28 multi-agent configurations evaluated, none matched baseline performance. Degradation ranged from −4.39% (Mistral-Opt Hierarchical) to −35.31% (Granite Collaborative). All configurations showed statistically significant degradation (p < 0.01), with effect sizes ranging from small (|d| = 0.28) to very large (|d| = 6.09).

The Granite-SCR experiments isolated the relative contributions of retrieval fragmentation and coordination overhead. Shared context retrieval improved performance by 4.2–12.5 percentage points relative to independent retrieval; however, all configurations remained 14.2–31.1% below baseline performance. This finding suggests that coordination overhead, rather than retrieval inconsistency, represents a primary limiting factor under the evaluated conditions. Figure 4 illustrates strategy-specific patterns, showing that Collaborative coordination produced the largest degradation across all implementation approaches. Strategy selection (Figure 4) and model architecture (Figure 5) jointly determine the degradation magnitude, while Figure 6 confirms that Optimized configurations reduced degradation for Mistral and Granite.

Model-specific response patterns were observed (Figure 5). Llama 3.1 8B demonstrated selective tolerance, with Sequential and Hierarchical strategies producing minimal degradation (−4.85% and −5.31%, respectively). Mistral 7B and Granite 3.2 8B exhibited degradation across all strategies in Original configurations (25.0–35.1%), though Optimized configurations reduced this to 4.4–21.4%. The Two-Phase Consensus mechanism reduced Collaborative degradation by 7–14 percentage points across all models, indicating that coordination effectiveness depends on both model architecture and implementation approach.

For practitioners considering deploying multi-agent RAG with 7–8 billion parameter models, these findings imply that single-agent RAG with tuned similarity thresholds offers superior performance in most scenarios. When a multi-agent architecture is required, Sequential and Hierarchical strategies consistently outperform Collaborative coordination across all models (Figure 4). Among these, Llama 3.1 8B with Sequential or Hierarchical coordination offers the smallest performance penalty. Collaborative coordination produced the largest degradation across all models regardless of implementation enhancements, suggesting that alternative strategies should be preferred when multi-agent deployment is required. These recommendations apply to the evaluated configuration space; different domains or model scales may exhibit different patterns.

These findings suggest boundary conditions for multi-agent RAG deployment: coordination benefits demonstrated in prior work using iterative debate, adversarial roles, or larger models (70B+) may not extend to simpler coordination strategies with resource-constrained open-source models. Results that establish where approaches encounter limitations contribute to the scientific understanding of multi-agent systems, particularly when they identify conditions under which commonly held assumptions may not apply.

Future research should investigate the following areas: role-specific prompting with explicit agent specialization; advanced consensus mechanisms incorporating structured argumentation or learned aggregation; adaptive strategy selection based on query characteristics; and joint tuning of retrieval thresholds with coordination parameters. The experimental phases employed in this study (baseline → Original → SCR → Optimized) provide a template for systematic investigation of multi-agent enhancements. Evaluating larger models (70B+) and heterogeneous agent teams may reveal the conditions under which multi-agent coordination provides a net benefit over single-agent RAG.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics14244883/s1, Table S1: Pairwise Pearson correlation matrix for CPS metrics; Table S2: Spearman rank correlation between baseline and perturbed CPS rankings across weight variations; Table S3: Leave-one-out ablation analysis showing Spearman correlation between full-metric and ablated rankings.

Author Contributions

Conceptualization, I.R. and I.P.; methodology, I.R.; software, I.R. and M.D.; validation, I.P., I.R. and L.D.; formal analysis, I.R. and I.P.; investigation, I.R. and L.D.; resources, M.D.; data curation, I.R.; writing—original draft preparation, I.R.; writing—review and editing, I.P. and M.D.; visualization, I.R.; supervision, I.P.; project administration, L.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Center of Competence Digitization of the economy in an environment of Big data, BG05M2OP001-1.002-0002-C05, OP SESG.

Data Availability Statement

The complete datasets generated and analyzed during this study are publicly available in the project’s GitHub repository at https://github.com/scpdxtest/maPaSSER (accessed on 13 November 2025). The repository contains: (1) raw experimental results in CSV format (Mistral 7B, Llama 3.1 8B, Granite 3.2 8B) and five configurations (four multi-agent coordination strategies plus RAG baseline), including individual metric scores (METEOR, ROUGE-1, ROUGE-L, BLEU, Laplace perplexity, Lidstone perplexity, cosine similarity, Pearson correlation, F1 score) and composite performance measures (CPS, T-CPS); (2) the enhanced PaSSER framework implementation with Python Flask API server (multiagent_server_api_2.py) supporting multi-agent coordination protocols, React-based frontend for configuration and real-time monitoring, and MongoDB integration for result persistence; (3) documented implementations of four coordination strategies (collaborative, sequential, competitive, hierarchical) with explicit voting, aggregation, and selection logic; (4) the 100-item question–answer evaluation corpus derived from the Climate-smart Agriculture Sourcebook; (5) Python analysis scripts for statistical validation (tcps_script_2.py, tcps_analysis_2.txt) including CPS/T-CPS calculations and performance comparisons; (6) operational efficiency metrics including inference latency, token consumption, and coordination overhead automatically logged by the system; and (7) comprehensive installation instructions and reproducibility documentation in README.md. The original PaSSER framework for RAG evaluation is available at https://github.com/scpdxtest/PaSSER (accessed on 1 April 2024).

Acknowledgments

During the preparation of this manuscript, the author used Claude (Anthropic, Claude Opus 4.5) for manuscript editing, figure generation and data presentation, as well as for drafting response and cover letters. DeepL (version 25.11.23262385) was used for translation assistance. The author has reviewed and edited all outputs and takes full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

API	Application Programming Interface
BLEU	Bilingual Evaluation Understudy
CPS	Composite Performance Score
CV	Coefficient of Variation
JSON	JavaScript Object Notation
LLM	Large Language Model
MCDM	Multi-Criteria Decision Making
METEOR	Metric for Evaluation of Translation with Explicit ORdering
RAG	Retrieval-Augmented Generation
ROUGE	Recall-Oriented Understudy for Gisting Evaluation
SCR	Shared Context Retrieval
T-CPS	Threshold-aware Composite Performance Score

Appendix A. Implementation Details and Reproducibility

Appendix A.1. System Architecture

The multi-agent RAG system was implemented using a Python-based backend with FastAPI for API endpoints and a React-based frontend for the evaluation interface. Local LLM inference was performed through Ollama for CPU-based deployment and MLX for Apple Silicon acceleration.

Appendix A.1.1. Hardware Configurations

Two hardware configurations were employed. Mistral 7B and Granite 3.2 8B models were executed on an Intel Xeon server using CPU inference. Llama 3.1 8B was executed on Apple M1 hardware using Metal Performance Shaders (MPS) acceleration via MLX. This hardware heterogeneity precludes direct cross-model timing comparisons; analysis therefore focuses on within-model patterns and hardware-independent metrics.

Table A1. Hardware configurations for model deployment.

Component	Mistral 7B/Granite 8B	Llama 3.1 8B
Processor	Intel Xeon (server)	Apple M1
Inference Engine	Ollama (CPU)	MLX (MPS)
Acceleration	None (CPU only)	Metal Performance Shaders

Two retrieval modes were evaluated. Independent retrieval allows each agent to query the database separately. Shared context retrieval provides identical documents to all agents from a single query. Configuration details are described in Section 3.2.6.

Appendix A.1.2. Software Stack

Table A2. Software dependencies and version specifications.

Component	Technology
LLM Inference	Ollama API (local)/MLX (Apple Silicon)
Vector Database	ChromaDB
Embeddings	Ollama embeddings (Mistral model)
Backend Framework	Python FastAPI
Frontend	React

Appendix A.2. Generation Hyperparameters

Generation parameters were held constant across all experimental conditions to ensure comparability. The following settings were applied:

Table A3. Model specifications and deployment parameters.

Parameter	Value	Description
temperature	0.7	Controls randomness in generation
top_p	0.9	Nucleus sampling threshold
top_k	50	Top-k sampling (Mistral); 40 for other models
repetition_penalty	1.1	Penalty for repeated tokens
repetition_context_size	20	Context window for repetition detection
max_new_tokens	Dynamic	90% of remaining context window (up to 2000)

Appendix A.3. Retrieval Configuration

Document retrieval was performed using ChromaDB with cosine similarity scoring. The following parameters were applied:

Table A4. RAG configuration parameters.

Parameter	Value
top_k_docs	7
Embedding model	Mistral (via Ollama)
Vector database	ChromaDB
Similarity metric	Cosine similarity
Similarity threshold	Model-specific (0.90–0.95)

Appendix A.4. Agent Configuration

Appendix A.4.1. Number of Agents

All coordination strategies employed three agents per query. This configuration was selected to balance coordination complexity against computational overhead while maintaining sufficient diversity of perspectives.

Appendix A.4.2. Role Assignments by Strategy

Role assignments were determined by coordination strategy as follows:

Table A5. Role assignments by coordination strategy.

Strategy	Agent 1	Agent 2	Agent 3
Collaborative	Analyzer	Critic	Synthesizer
Sequential	Analyzer	Processor	Reviewer
Competitive	Expert A	Expert B	Expert C
Hierarchical	Specialist 1	Specialist 2	Manager

Appendix A.5. Prompt Templates

Appendix A.5.1. Base Agent Prompt (With RAG Context)

When RAG context was available, the following prompt template was employed:

Based on the following information, provide a comprehensive

answer to the question.

Knowledge Base:

{rag_context}

Question: {query}

Role: You are {role}. Analyze the information above

and provide a detailed, well-structured answer.

Appendix A.5.2. Base Agent Prompt (Without RAG Context)

When RAG context was not available, a simplified template was used:

Question: {query}

Role: You are {role}. Provide a detailed, comprehensive

answer based on your knowledge.

Appendix A.5.3. Model-Specific Formatting

Prompt formatting was adapted to model-specific tokenisation requirements. Mistral 7B uses instruction tokens to delimit user input, with prompts formatted as [INST] {prompt content} [/INST]. Granite 3.2 8B and Llama 3.1 8B used plain text format with an "Answer:" suffix appended to trigger response generation.

Appendix A.6. Coordination Mechanisms

Appendix A.6.1. Collaborative Strategy

In the collaborative strategy, all agents processed the query in parallel. Response aggregation was performed by concatenating summaries and selecting the most comprehensive answer based on internal quality scoring.

Appendix A.6.2. Sequential Strategy

In the sequential strategy, agents processed the query in serial order. Agent 1 generated an initial response (R1). Agent 2 received R1 as additional context and generated response R2. Agent 3 received both R1 and R2 as context and generated the final response R3.

Appendix A.6.3. Competitive Strategy

In the competitive strategy, all agents processed the query in parallel independently. The best response was selected based on internal quality scoring incorporating confidence, completeness, and coherence factors.

Appendix A.6.4. Hierarchical Strategy

In the hierarchical strategy, specialist agents (Agents 1–2) generated individual analyses in parallel. The manager agent (Agent 3) subsequently synthesized specialist outputs into the final response.

In Optimized configurations, all strategies received shared context retrieval, ensuring identical input documents across agents.

Appendix A.7. Two-Phase Collaborative Consensus (Improved)

The improved collaborative strategy (Section 3.2.2) implemented structured synthesis through a two-phase process.

Phase 1: Independent Analysis

All agents generated responses independently with associated confidence scores.

Phase 2: Collaborative Synthesis

The agent with the highest confidence score was designated as lead synthesiser. The following synthesis prompt was employed:

Based on the following question and multiple expert analyses,

create a comprehensive unified answer that combines the best

insights from all perspectives.

Original Question: {query}

Expert Analyses Summary:

{insights_summary}

Your task: Synthesize these perspectives into a single,

coherent, comprehensive answer that:

1. Integrates the strongest points from each analysis

2. Resolves any contradictions or overlaps

3. Provides a complete, well-structured response

4. Maintains accuracy and precision

Final Score Computation

The final score was computed as: Final Score = 0.7 × synthesis_quality + 0.3 × average_confidence. If synthesis failed, the system reverted to the best individual response.

Appendix A.8. Code Availability

Full source code is available in the GitHub repository (maPaSSER). Key components include:

Backend API: multiagent_MAC_server_api.py

Evaluation Scripts: tcps_script_2.py

Global Constants: globalConst.py

Appendix A.9. Data Availability

Test datasets and experimental results are available in the GitHub repository. File naming conventions are as follows:

Baseline RAG results: filtered_newTest{Model}_{threshold}.csv

Multi-agent results: {Model}{Strategy}.csv

Improved configurations: {Model}_Optimized_{Strategy}.csv

SCR configurations: Granite_SCR_{Strategy}.csv

References

Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020; Curran Associates Inc.: Red Hook, NY, USA, 2020. [Google Scholar]
Barnett, S.; Kurniawan, S.; Thudumu, S.; Brannelly, Z.; Abdelrazek, M. Seven Failure Points When Engineering a Retrieval Augmented Generation System. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering—Software Engineering for AI, Lisbon, Portugal, 14–15 April 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 194–199. [Google Scholar]
Chen, J.; Lin, H.; Han, X.; Sun, L. Benchmarking Large Language Models in Retrieval-Augmented Generation. Proc. AAAI Conf. Artif. Intell. 2024, 38, 17754–17762. [Google Scholar] [CrossRef]
Yu, W.; Zhang, H.; Pan, X.; Cao, P.; Ma, K.; Li, J.; Wang, H.; Yu, D. Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.-N., Eds.; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 14672–14685. [Google Scholar]
Salemi, A.; Zamani, H. Evaluating Retrieval Quality in Retrieval-Augmented Generation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024; Association for Computing Machinery: New York, NY, USA, 2024; pp. 2395–2400. [Google Scholar]
Radeva, I.; Popchev, I.; Dimitrova, M. Similarity Thresholds in Retrieval-Augmented Generation. In Proceedings of the 2024 IEEE 12th International Conference on Intelligent Systems (IS), Varna, Bulgaria, 29–31 August 2024; pp. 1–7. [Google Scholar]
Hong, S.; Zhuge, M.; Chen, J.; Zheng, X.; Cheng, Y.; Zhang, C.; Wang, J.; Wang, Z.; Yau, S.K.S.; Lin, Z.H.; et al. MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Liang, T.; He, Z.; Jiao, W.; Wang, X.; Wang, Y.; Wang, R.; Yang, Y.; Shi, S.; Tu, Z. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.-N., Eds.; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 17889–17904. [Google Scholar]
Park, J.S.; O’Brien, J.; Cai, C.J.; Morris, M.R.; Liang, P.; Bernstein, M.S. Generative Agents: Interactive Simulacra of Human Behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, San Francisco, CA, USA, 29 October–1 November 2023; Association for Computing Machinery: New York, NY, USA, 2023. [Google Scholar]
Bond, A.H.; Gasser, L. Readings in Distributed Artificial Intelligence; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1988; ISBN 978-0-934613-63-7. [Google Scholar]
Du, Y.; Li, S.; Torralba, A.; Tenenbaum, J.B.; Mordatch, I. Improving Factuality and Reasoning in Language Models through Multiagent Debate. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; JMLR.org: Vienna, Austria, 2024; pp. 11733–11763. [Google Scholar]
Xu, L.; Hu, Z.; Zhou, D.; Ren, H.; Dong, Z.; Keutzer, K.; Ng, S.-K.; Feng, J. MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.-N., Eds.; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 7315–7332. [Google Scholar]
Kim, Y.H.; Park, C.; Jeong, H.; Chan, Y.S.; Xu, X.; McDuff, D.; Breazeal, C.; Park, H.W. MDAgents: An Adaptive Collaboration of LLMs for Medical Decision-Making. arXiv 2024, arXiv:2404.15155. [Google Scholar] [CrossRef]
Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The Llama 3 Herd of Models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Mishra, M.; Stallone, M.; Zhang, G.; Shen, Y.; Prasad, A.; Soria, A.M.; Merler, M.; Selvam, P.; Surendran, S.; Singh, S.; et al. Granite Code Models: A Family of Open Foundation Models for Code Intelligence. arXiv 2024, arXiv:2405.04324. [Google Scholar] [CrossRef]
Ibm-Granite/Granite-3.2-8b-Instruct Hugging Face. Available online: https://huggingface.co/ibm-granite/granite-3.2-8b-instruct (accessed on 12 November 2025).
Climate Smart Agriculture Sourcebook|Food and Agriculture Organization of the United Nations. Available online: https://www.fao.org/climate-smart-agriculture-sourcebook/en/ (accessed on 12 November 2025).
Radeva, I.; Popchev, I.; Doukovska, L.; Dimitrova, M. Web Application for Retrieval-Augmented Generation: Implementation and Testing. Electronics 2024, 13, 1361. [Google Scholar] [CrossRef]
Dimitrova, M. Retrieval-Augmented Generation (RAG): Advances and Challenges. Probl. Eng. Cybern. Robot. 2025, 83, 32–57. [Google Scholar] [CrossRef]
Xu, K.; Zhang, K.; Li, J.; Huang, W.; Wang, Y. CRP-RAG: A Retrieval-Augmented Generation Framework for Supporting Complex Logical Reasoning and Knowledge Planning. Electronics 2025, 14, 47. [Google Scholar] [CrossRef]
Knollmeyer, S.; Caymazer, O.; Grossmann, D. Document GraphRAG: Knowledge Graph Enhanced Retrieval Augmented Generation for Document Question Answering Within the Manufacturing Domain. Electronics 2025, 14, 2102. [Google Scholar] [CrossRef]
Choi, Y.; Kim, S.; Bassole, Y.C.F.; Sung, Y. Enhanced Retrieval-Augmented Generation Using Low-Rank Adaptation. Appl. Sci. 2025, 15, 4425. [Google Scholar] [CrossRef]
Ji, X.; Xu, L.; Gu, L.; Ma, J.; Zhang, Z.; Jiang, W. RAP-RAG: A Retrieval-Augmented Generation Framework with Adaptive Retrieval Task Planning. Electronics 2025, 14, 4269. [Google Scholar] [CrossRef]
Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.V.; Wiest, O.; Zhang, X. Large Language Model Based Multi-Agents: A Survey of Progress and Challenges. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, Jeju, Republic of Korea, 3–9 August 2024; Larson, K., Ed.; International Joint Conferences on Artificial Intelligence Organization: Bremen, Germany, 2024; pp. 8048–8057. [Google Scholar]
Zhang, X.; Dong, X.; Wang, Y.; Zhang, D.; Cao, F. A Survey of Multi-AI Agent Collaboration: Theories, Technologies and Applications. In Proceedings of the 2nd Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Digital Economy and Artificial Intelligence, Dongguan, China, 28–30 March 2025; Association for Computing Machinery: New York, NY, USA, 2025; pp. 1875–1881. [Google Scholar]
Jimenez-Romero, C.; Yegenoglu, A.; Blum, C. Multi-Agent Systems Powered by Large Language Models: Applications in Swarm Intelligence. Front. Artif. Intell. 2025, 8, 1593017. [Google Scholar] [CrossRef]
Qian, C.; Liu, W.; Liu, H.; Chen, N.; Dang, Y.; Li, J.; Yang, C.; Chen, W.; Su, Y.; Cong, X.; et al. ChatDev: Communicative Agents for Software Development. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Bangkok, Thailand, 11–16 August 2024; Ku, L.-W., Martins, A., Srikumar, V., Eds.; Association for Computational Linguistics: Bangkok, Thailand, 2024; pp. 15174–15186. [Google Scholar]
Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. arXiv 2023, arXiv:2308.08155. [Google Scholar] [CrossRef]
Bo, X.; Zhang, Z.; Dai, Q.; Feng, X.; Wang, L.; Li, R.; Chen, X.; Wen, J.-R. Reflective Multi-Agent Collaboration Based on Large Language Models. In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: New York, NY, USA, 2024; Volume 37, pp. 138595–138631. [Google Scholar]
Cinkusz, K.; Chudziak, J.A.; Niewiadomska-Szynkiewicz, E. Cognitive Agents Powered by Large Language Models for Agile Software Project Management. Electronics 2025, 14, 87. [Google Scholar] [CrossRef]
Ji, X.; Zhang, L.; Zhang, W.; Peng, F.; Mao, Y.; Liao, X.; Zhang, K. LEMAD: LLM-Empowered Multi-Agent System for Anomaly Detection in Power Grid Services. Electronics 2025, 14, 3008. [Google Scholar] [CrossRef]
Caspari, L.; Dastidar, K.G.; Zerhoudi, S.; Mitrovic, J.; Granitzer, M. Beyond Benchmarks: Evaluating Embedding Model Similarity for Retrieval Augmented Generation Systems. arXiv 2024, arXiv:2407.08275. [Google Scholar] [CrossRef]
Muennighoff, N.; Tazi, N.; Magne, L.; Reimers, N. MTEB: Massive Text Embedding Benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, Dubrovnik, Croatia, 2–6 May 2023; Vlachos, A., Augenstein, I., Eds.; Association for Computational Linguistics: Dubrovnik, Croatia, 2023; pp. 2014–2037. [Google Scholar]
Topsakal, O.; Harper, J.B. Benchmarking Large Language Model (LLM) Performance for Game Playing via Tic-Tac-Toe. Electronics 2024, 13, 1532. [Google Scholar] [CrossRef]
Zografos, G.; Moussiades, L. Beyond the Benchmark: A Customizable Platform for Real-Time, Preference-Driven LLM Evaluation. Electronics 2025, 14, 2577. [Google Scholar] [CrossRef]
Li, B.; Han, L. Distance Weighted Cosine Similarity Measure for Text Classification. In Proceedings of the Intelligent Data Engineering and Automated Learning—IDEAL 2013, Hefei, China, 20–23 October 2013; Yin, H., Tang, K., Gao, Y., Klawonn, F., Lee, M., Weise, T., Li, B., Yao, X., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 611–618. [Google Scholar]
Wang, L.; Yang, N.; Huang, X.; Jiao, B.; Yang, L.; Jiang, D.; Majumder, R.; Wei, F. Text Embeddings by Weakly-Supervised Contrastive Pre-Training. arXiv 2024, arXiv:2212.03533. [Google Scholar] [CrossRef]
Ni, J.; Qu, C.; Lu, J.; Dai, Z.; Hernandez Abrego, G.; Ma, J.; Zhao, V.; Luan, Y.; Hall, K.; Chang, M.-W.; et al. Large Dual Encoders Are Generalizable Retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7–11 December 2022; Goldberg, Y., Kozareva, Z., Zhang, Y., Eds.; Association for Computational Linguistics: Abu Dhabi, United Arab Emirates, 2022; pp. 9844–9855. [Google Scholar]
Zhao, X.; Wang, K.; Peng, W. An Electoral Approach to Diversify LLM-Based Multi-Agent Collective Decision-Making. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; Al-Onaizan, Y., Bansal, M., Chen, Y.-N., Eds.; Association for Computational Linguistics: Miami, FL, USA, 2024; pp. 2712–2727. [Google Scholar]
Hwang, C.-L.; Yoon, K. Multiple Attribute Decision Making: Methods and Applications—A State-of-the-Art Survey. In Lecture Notes in Economics and Mathematical Systems; Springer: Berlin/Heidelberg, Germany, 1981. [Google Scholar]
Fishburn, P.C. Letter to the Editor—Additive Utilities with Incomplete Product Sets: Application to Priorities and Assignments. Oper. Res. 1967, 15, 537–542. [Google Scholar] [CrossRef]
Saaty, R.W. The Analytic Hierarchy Process—What It Is and How It Is Used. Math. Model. 1987, 9, 161–176. [Google Scholar] [CrossRef]
Scpdxtest GitHub—Scpdxtest/maPaSSER. Available online: https://github.com/scpdxtest/maPaSSER (accessed on 2 December 2025).
Batsakis, S.; Tachmazidis, I.; Mantle, M.; Papadakis, N.; Antoniou, G. Model Checking Using Large Language Models—Evaluation and Future Directions. Electronics 2025, 14, 401. [Google Scholar] [CrossRef]
Montgomery, D.C. Statistical Quality Control: A Modern Introduction; Wiley: Hoboken, NJ, USA, 2012; ISBN 978-1-118-53137-2. [Google Scholar]
Sharpe, W.F. Mutual Fund Performance. J. Bus. 1966, 39, 119–138. [Google Scholar] [CrossRef]
Kuncheva, L.I.; Whitaker, C.J. Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy. Mach. Learn. 2003, 51, 181–207. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Holtzman, A.; Buys, J.; Du, L.; Forbes, M.; Choi, Y. The Curious Case of Neural Text Degeneration. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.L.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. Training Language Models to Follow Instructions with Human Feedback. In Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 28 November–9 December 2022; Curran Associates Inc.: Red Hook, NY, USA, 2022. [Google Scholar]
Chen, W.; Su, Y.; Zuo, J.; Yang, C.; Yuan, C.; Chan, C.-M.; Yu, H.; Lu, Y.; Hung, Y.-H.; Qian, C.; et al. AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors. In Proceedings of the International Conference on Representation Learning, Vienna, Austria, 7–11 May 2024; Volume 2024, pp. 20094–20136. [Google Scholar]
Chan, C.M.; Chen, W.; Su, Y.; Yu, J.; Xue, W.; Zhang, S.; Fu, J.; Liu, Z. ChatEval: Towards Better LLM-Based Evaluators Through Multi-Agent Debate. In Proceedings of the International Conference on Learning Representations, Vienna, Austria, 7–11 May 2024. [Google Scholar]
Li, G.; Hammoud, H.A.A.K.; Itani, H.; Khizbullin, D.; Ghanem, B. CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society. arXiv 2023, arXiv:2303.17760. [Google Scholar] [CrossRef]
Wang, Z.; Mao, S.; Wu, W.; Ge, T.; Wei, F.; Ji, H. Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 16–21 June 2024; Duh, K., Gomez, H., Bethard, S., Eds.; Association for Computational Linguistics: Mexico City, Mexico, 2024; pp. 257–279. [Google Scholar]

Figure 1. PaSSER framework architecture. The framework comprises three integrated layers: LLM Infrastructure layer supporting local deployment of Mistral 7B, Llama 3.1 8B, and Granite 3.2 8B models with ChromaDB vector database and multi-agent coordination engine implementing four strategies (Collaborative, Sequential, Competitive, Hierarchical); Web Application layer providing React/PrimeReact frontend with LangChain RAG pipeline, configuration management, and real-time monitoring; Evaluation & Storage layer featuring Python-based metric computation (Flask API with NLTK, PyTorch 2.5.1), CPS/T-CPS calculation, statistical analysis, and blockchain integration via Antelope for immutable result verification.

Figure 2. Testing Workflow in the PaSSER Framework. The seven-step experimental protocol encompassing: (1) vector store creation with domain-specific document embeddings via ChromaDB, (2) dataset preparation with reference question–answer pairs, (3) retrieval configuration establishing similarity thresholds and context parameters, (4) multi-agent execution implementing coordination strategies (Collaborative, Sequential, Competitive, Hierarchical), (5) response generation with agent coordination mechanisms (as Python API), (6) performance evaluation computing metrics, and (7) result recording.

Figure 3. Baseline threshold selection: performance comparison across threshold configurations. Grouped bar charts comparing CPS improvement (blue bars) and T-CPS improvement (red bars) for three threshold configurations per model. Selected baselines (highlighted in gold): Mistral 7B at threshold 0.95 (CPS: +5.16%, T-CPS: +5.37%), Granite 3.2 8B at threshold 0.95 (CPS: +1.25%, T-CPS: +1.26%), Llama 3.1 8B at threshold 0.90 (CPS: +1.85%, T-CPS: +1.87%). Alternative configurations (Alt 1, Alt 2) show competing threshold choices with lower T-CPS improvements. Values displayed on bars indicate percentage improvement relative to untuned baseline. Selected thresholds maximize T-CPS while maintaining statistically significant CPS improvements, balancing mean performance with output consistency.

Figure 4. T-CPS performance comparison by coordination strategy. Each panel presents one strategy (Collaborative, Sequential, Competitive, Hierarchical) across all models. Bars represent Baseline (dark blue), Original (red), Optimized (light blue), and Granite-SCR (orange) configurations. The dashed line indicates the mean baseline T-CPS.

Figure 5. T-CPS performance comparison by model. Each panel presents one model (Mistral 7B, Llama 3.1 8B, Granite 3.2 8B) across all coordination strategies. Bars represent Original (red), Optimized (light blue), and SCR (orange, Granite only) configurations. The vertical dashed line indicates each model’s single-agent RAG baseline.

Figure 6. Performance degradation: original vs. shared context configurations. Performance degradation heatmap comparing Original and Optimized configurations. Values represent percentage change from single-agent baseline (T-CPS metric). Color scale: green indicates smaller degradation; red indicates larger degradation.

Table 1. Experimental configuration groups.

Group	Configs	Models	Retrieval	Collaborative Consensus	Other Strategies
Baselines	3	All	Single query	N/A	N/A
Original	12	All	Independent	Simple aggregation	Standard
Granite-SCR	4	Granite only	Shared	Simple aggregation	Standard
Optimized	12	All	Shared	Two-Phase	Standard
Total	31

Table 2. CPS metric weights.

Category	Metric	Weight	Subtotal
Content Accuracy	F1 Score	0.20
	METEOR	0.15
	BLEU	0.15	0.50
Semantic Relevance	Cosine Similarity	0.10
	Pearson Correlation	0.10	0.20
Lexical/Fluency	ROUGE-1.f	0.075
	ROUGE-L.f	0.075
	Laplace Perplexity *	0.075
	Lidstone Perplexity *	0.075	0.30
Total			1.00

Notes: * Inverted metrics (lower values indicate better performance).

Table 3. Baseline configuration selection for multi-agent comparison.

Model	Configuration	Threshold	CPS	T-CPS	CV	CPS Δ%	T-CPS Δ%	Balance Score	Selection
Mistral 7B	Baseline	—	0.5181	0.5610	0.1501	—	—	—
Mistral 7B	SELECTED	0.95	0.5448	0.5911	0.1339	+5.16%	+5.37%	40.11	Max T-CPS
Mistral 7B	Alternative 1	0.90	0.5332	0.5787	0.1312	+2.93%	+3.16%	24.07
Mistral 7B	Alternative 2	0.70	0.5321	0.5776	0.1283	+2.70%	+2.97%	23.13
Granite 3.2 8B	Baseline	—	0.5112	0.5552	0.1243	—	—	—
Granite 3.2 8B	SELECTED	0.95	0.5176	0.5622	0.1240	+1.25%	+1.26%	10.12	Max T-CPS
Granite 3.2 8B	Alternative 1	0.80	0.5172	0.5619	0.1220	+1.18%	+1.20%	9.85
Granite 3.2 8B	Alternative 2	0.75	0.5168	0.5613	0.1239	+1.10%	+1.10%	8.91
Llama 3.1 8B	Baseline	—	0.4982	0.5394	0.1497	—	—	—
Llama 3.1 8B	SELECTED	0.90	0.5074	0.5495	0.1479	+1.85%	+1.87%	12.65	Max T-CPS
Llama 3.1 8B	Alternative 1	0.55	0.5046	0.5478	0.1286	+1.30%	+1.55%	12.05
Llama 3.1 8B	Alternative 2	0.80	0.5039	0.5450	0.1589	+1.15%	+1.04%	6.57

Notes: All configurations evaluated on 369 question-answer pairs from Climate-Smart Agriculture dataset; CPS: Composite Performance Score (see Section 3.3 for metric details); T-CPS: Threshold-aware CPS with stability reward

(α = 0.1)

and variability penalty

(β = 0.05)

; CV: Coefficient of Variation (σ/μ; lower values indicate more consistent performance); Balance Score:

(T - C P S Δ %) / C V

(measures stability-performance trade-off); Selection criterion: Maximum T-CPS among configurations with positive improvements and acceptable stability.

Table 4. Multi-agent coordination performance ranked by degradation magnitude.

Rank	Configuration	Type	CPS	T-CPS	Δ CPS (%)	Δ T-CPS (%)	t-Stat	p-Value	d	\|d\|	Effect	Sig
BASELINES
—	Mistral 7B	Baseline	0.5758	0.6226	—	—	—	—	—	—	—	—
—	Granite 3.2 8B	Baseline	0.5622	0.6087	—	—	—	—	—	—	—	—
—	Llama 3.1 8B	Baseline	0.5363	0.5793	—	—	—	—	—	—	—	—
MINIMAL DEGRADATION (Δ > −10%)
1	Mistral-Opt Hierarchical	Optimized	0.5505	0.5952	−4.39	−4.40	−2.775	0.007	−0.28	0.28	Small	**
2	Llama Sequential	Original	0.5103	0.5511	−4.85	−4.87	−2.972	0.004	−0.30	0.30	Small	**
3	Mistral-Opt Sequential	Optimized	0.5454	0.5897	−5.28	−5.28	−3.404	<0.001	−0.34	0.34	Small	***
4	Llama Hierarchical	Original	0.5078	0.5484	−5.31	−5.33	−2.752	0.007	−0.28	0.28	Small	**
5	Llama Competitive	Original	0.5012	0.5413	−6.54	−6.56	−4.058	<0.001	−0.41	0.41	Small	***
6	Llama-Opt Hierarchical	Optimized	0.4989	0.5382	−6.97	−7.09	−4.177	<0.001	−0.42	0.42	Small	***
7	Mistral-Opt Competitive	Optimized	0.5279	0.5723	−8.32	−8.08	−6.435	<0.001	−0.64	0.64	Medium	***
8	Llama-Opt Competitive	Optimized	0.4907	0.5305	−8.50	−8.42	−5.678	<0.001	−0.57	0.57	Medium	***
9	Llama-Opt Sequential	Optimized	0.4905	0.5302	−8.54	−8.48	−5.677	<0.001	−0.57	0.57	Medium	***
MODERATE DEGRADATION (−20% < Δ ≤ −10%)
10	Granite-SCR Competitive	SCR	0.4823	0.5210	−14.21	−14.41	−9.853	<0.001	−0.99	0.99	Large	***
11	Granite-Opt Competitive	Optimized	0.4789	0.5180	−14.82	−14.90	−10.955	<0.001	−1.10	1.10	Large	***
12	Granite-SCR Hierarchical	SCR	0.4715	0.5102	−16.13	−16.18	−12.358	<0.001	−1.24	1.24	V.Large	***
13	Granite-Opt Sequential	Optimized	0.4703	0.5084	−16.35	−16.48	−12.041	<0.001	−1.20	1.20	V.Large	***
14	Granite-Opt Hierarchical	Optimized	0.4637	0.5020	−17.52	−17.53	−14.087	<0.001	−1.41	1.41	V.Large	***
SEVERE DEGRADATION (−30% < Δ ≤ −20%)
15	Mistral-Opt Collaborative	Optimized	0.4567	0.4961	−20.68	−20.32	−21.500	<0.001	−2.15	2.15	V.Large	***
16	Llama-Opt Collaborative	Optimized	0.4236	0.4611	−21.01	−20.40	−25.945	<0.001	−2.59	2.59	V.Large	***
17	Granite-Opt Collaborative	Optimized	0.4421	0.4810	−21.36	−20.98	−25.241	<0.001	−2.52	2.52	V.Large	***
18	Granite-SCR Sequential	SCR	0.4335	0.4658	−22.89	−23.48	−14.282	<0.001	−1.43	1.43	V.Large	***
19	Mistral Competitive	Original	0.4318	0.4663	−25.01	−25.10	−27.854	<0.001	−2.79	2.79	V.Large	***
20	Granite Competitive	Original	0.4198	0.4534	−25.33	−25.51	−29.753	<0.001	−2.98	2.98	V.Large	***
21	Mistral Hierarchical	Original	0.4187	0.4522	−27.28	−27.37	−24.877	<0.001	−2.49	2.49	V.Large	***
22	Llama Collaborative	Original	0.3849	0.4157	−28.23	−28.24	−38.486	<0.001	−3.85	3.85	V.Large	***
23	Granite Hierarchical	Original	0.4015	0.4336	−28.58	−28.77	−34.973	<0.001	−3.50	3.50	V.Large	***
24	Mistral Sequential	Original	0.4098	0.4426	−28.83	−28.91	−23.972	<0.001	−2.40	2.40	V.Large	***
25	Granite Sequential	Original	0.3977	0.4295	−29.26	−29.44	−35.898	<0.001	−3.59	3.59	V.Large	***
EXTREME DEGRADATION (Δ ≤ −30%)
26	Granite-SCR Collaborative	SCR	0.3852	0.4195	−31.48	−31.08	−46.857	<0.001	−4.69	4.69	V.Large	***
27	Mistral Collaborative	Original	0.3736	0.4035	−35.12	−35.19	−49.716	<0.001	−4.97	4.97	V.Large	***
28	Granite Collaborative	Original	0.3637	0.3928	−35.31	−35.47	−60.914	<0.001	−6.09	6.09	V.Large	***

Notes: Statistical significance: *** p < 0.001, ** p < 0.01, * p < 0.05; Δ values: Percentage difference from respective model baseline (negative indicates degradation); Cohen’s d: Standardized effect size computed as d = t/√n where n = 100; Effect size thresholds: Negligible (|d| < 0.2), Small (0.2–0.5), Medium (0.5–0.8), Large (0.8–1.2), Very Large (|d| > 1.2); Type: Original = simple response aggregation with independent retrieval; Optimized = Two-Phase Collaborative Consensus (for Collaborative) or shared context retrieval only (for other strategies); SCR = Shared Context Retrieval only (Granite 3.2 8B); N = 100 test instances per configuration.

Table 5. Token consumption and processing time by model and strategy.

Model	Strategy	Processing Time (s) Mean (SD)	Total Tokens Mean (SD)
Granite 3.2 8B	Collaborative	2211.0 (622.9)	3481 (864)
	Sequential	2257.2 (747.8)	2251 (650)
	Competitive	1998.3 (587.5)	2179 (636)
	Hierarchical	2247.2 (611.7)	2351 (629)
Mistral 7B	Collaborative	1584.7 (525.1)	3076 (924)
	Sequential	1494.6 (501.1)	1920 (659)
	Competitive	1798.6 (550.4)	2043 (644)
	Hierarchical	1868.3 (573.5)	1958 (613)
Llama 3.1 8B	Collaborative	56.1 (26.3)	1410 (707)
	Sequential	57.3 (27.0)	804 (481)
	Competitive	52.6 (27.7)	813 (494)
	Hierarchical	59.8 (31.0)	842 (568)

Notes: Processing time in seconds; Granite and Mistral on Intel Xeon CPU; Llama on Apple M1; Direct timing comparisons across models not valid due to hardware differences; Token counts are hardware-independent and comparable across models.

Table 6. Sensitivity analysis results.

Analysis	Metric	Value
Correlation	Mean r(α)	+0.9993
	Mean r(β)	−0.034
	Configs with significant α (p < 0.001)	31/31
	Configs with significant β (p < 0.05)	0/31
Variance Decomposition	Variance explained by α	99.87%
	Variance explained by β	0.13%
Regression	Mean b₁ (α coefficient)	+0.396
	Mean b₂ (β coefficient)	−0.022
	R² (all configurations)	1.000
Effect Magnitude	Mean T-CPS range	0.082
	Mean T-CPS % change	17.1%
Ranking Stability	Configurations with zero rank change	29/31 (93.5%)
	Maximum rank change observed	1 position
	Top 2 positions stable	Yes (all 25 combinations)

Note: two configurations with rank change: Granite-SCR-Sequential and Llama8B-Optimized-Collaborative (alternated ranks 21–22).

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Radeva, I.; Popchev, I.; Doukovska, L.; Dimitrova, M. Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation. Electronics 2025, 14, 4883. https://doi.org/10.3390/electronics14244883

AMA Style

Radeva I, Popchev I, Doukovska L, Dimitrova M. Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation. Electronics. 2025; 14(24):4883. https://doi.org/10.3390/electronics14244883

Chicago/Turabian Style

Radeva, Irina, Ivan Popchev, Lyubka Doukovska, and Miroslava Dimitrova. 2025. "Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation" Electronics 14, no. 24: 4883. https://doi.org/10.3390/electronics14244883

APA Style

Radeva, I., Popchev, I., Doukovska, L., & Dimitrova, M. (2025). Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation. Electronics, 14(24), 4883. https://doi.org/10.3390/electronics14244883

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation

Abstract

1. Introduction

1.1. Research Objectives

1.2. Approach

2. Related Work

3. Methods

3.1. Experimental Infrastructure

3.2. Multi-Agent Coordination Strategies and Experimental Design

3.2.1. Collaborative Strategy

3.2.2. Improved Collaborative Strategy: Two-Phase Consensus

3.2.3. Competitive Strategy

3.2.4. Hierarchical Strategy

3.2.5. Sequential Strategy

3.2.6. Retrieval Context Configuration

3.2.7. Experimental Configuration Summary

3.3. Performance Evaluation Framework

3.3.1. The Multi-Criteria Evaluation Problem

3.3.2. Aggregation Method Selection

3.3.3. Component Metrics

3.3.4. Composite Performance Score (CPS)

3.3.5. Threshold-Aware Composite Performance Score (T-CPS)

3.4. Baseline Configuration and Statistical Analysis

4. Results

4.1. Experimental Overview

4.2. Overall Performance Comparison

4.3. Statistical Significance and Effect Size Analysis

4.4. Computational Efficiency Analysis

4.5. CPS and T-CPS Relationship Analysis

4.6. Model-Specific Coordination Response Patterns

4.7. Sensitivity Analysis Results

4.8. Summary of Results

5. Discussion

5.1. Performance Impact of Multi-Agent Coordination

5.2. Performance Patterns Across Models and Strategies

5.3. Limitations

5.4. Comparison with Prior Literature

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A. Implementation Details and Reproducibility

Appendix A.1. System Architecture

Appendix A.1.1. Hardware Configurations

Appendix A.1.2. Software Stack

Appendix A.2. Generation Hyperparameters

Appendix A.3. Retrieval Configuration

Appendix A.4. Agent Configuration

Appendix A.4.1. Number of Agents

Appendix A.4.2. Role Assignments by Strategy

Appendix A.5. Prompt Templates

Appendix A.5.1. Base Agent Prompt (With RAG Context)

Appendix A.5.2. Base Agent Prompt (Without RAG Context)

Appendix A.5.3. Model-Specific Formatting

Appendix A.6. Coordination Mechanisms

Appendix A.6.1. Collaborative Strategy

Appendix A.6.2. Sequential Strategy

Appendix A.6.3. Competitive Strategy

Appendix A.6.4. Hierarchical Strategy

Appendix A.7. Two-Phase Collaborative Consensus (Improved)

Appendix A.8. Code Availability

Appendix A.9. Data Availability

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI