5.1. Model Capability vs. Coordination Complexity
These findings should be interpreted within the context of the selected domain. Blockchain-based e-voting represents a terminologically overloaded setting, which amplifies ambiguity in title-abstract screening. The observed pattern is consistent across all tested configurations. Its generalisability to other domains remains to be validated. Configurations containing Qwen 2.5 7B consistently outperformed alternatives regardless of the coordination strategy applied. The single-agent baseline (S1) with Qwen in few-shot mode achieved the highest F1 (82.6%) and WSS@95 (43.4%), outperforming every multi-agent alternative.
Figure 6 illustrates this comparison across the top five configurations. The identical GS and FC values for the two few-shot configurations confirm that screening behaviour remained stable at full corpus scale, while the single-agent baseline maintained a clear advantage across metrics.
This outcome does not support the assumption that ensemble coordination compensates for individual model weaknesses. When one model performs better, adding weaker models reduces overall quality. S2 (majority voting) and S4 (confidence-weighted aggregation) produced identical results for all metrics. This indicates that self-reported confidence from 7–8B parameter models does not add meaningful differentiation. This pattern is consistent with findings from a recent multi-agent screening study using API-based models, where majority voting also outperformed more complex adjudication and debate strategies [
23]. S5 (two-stage filtering) offered computational efficiency by filtering up to 41% of papers in Stage 1, but did not improve precision. S3 (recall-focused OR) amplified false positives without recall gains.
These results align with recent findings on multi-agent coordination with models of comparable size. A comparative evaluation of four coordination strategies against single-agent RAG baselines reported consistent performance degradation, with coordination overhead identified as the primary factor [
22]. This study extends this observation from question answering to binary screening. Coordination overhead appears to be a general limitation at the 7–8B scale, consistent with a recent evaluation of 18 LLMs across three biomedical systematic reviews [
27], where model rankings varied substantially across domains. Mistral achieved the highest inter-rater agreement (PABAK = 0.621) in clinical reviews, whereas Qwen dominated in the present interdisciplinary domain—suggesting that model superiority is domain-dependent rather than absolute. Notably, smaller models (llama3.1:8b, MCC = 0.302) outperformed their larger counterparts (llama3.1:70b, MCC = 0.242) in that study, reinforcing the observation that parameter count alone does not determine screening quality at this scale. Whether larger models (13B–70B) benefit from coordination remains an open question. Greater reasoning diversity among agents could make ensemble deliberation productive at higher parameter scales.
The pairwise McNemar’s tests (
Section 4.6) confirmed that S1 Qwen few-shot was statistically superior to all four multi-agent alternatives (
p ≤ 0.0005, power > 0.93), while no significant differences were detected among Ranks 2–5 (all
p = 1.0, power ≤ 0.07). The ranking in
Table 9 for the multi-agent configurations therefore reflects point estimates rather than statistically significant performance gaps, whereas the single-agent advantage is statistically confirmed.
The equivalence between S4 and S2 is explained by the confidence distributions of the individual models serving as agents in both strategies.
Table 19 presents the self-reported confidence levels for each model across all 2036 full corpus papers, together with the strategy-level agreement between S4 and S2.
LLaMA 3.1 8B and Mistral 7B reported HIGH confidence on 99.5% and 96.2% of all papers, respectively, providing virtually no discriminative signal for aggregation weighting. Only Qwen 2.5 7B exhibited meaningful variation, reporting HIGH confidence on 68.3% of INCLUDE decisions and 93.7% of EXCLUDE decisions, with the remaining votes classified as MEDIUM. When two of three models assign HIGH confidence to the same decision, the weighted sum in S4 produces the same outcome as unweighted majority voting. The two disagreements both involved papers where two models returned UNCERTAIN—a category that does not affect INCLUDE/EXCLUDE screening counts. The three-level confidence scale (HIGH = 0.9, MEDIUM = 0.7, LOW = 0.5) proved too coarse for effective differentiation at the 7–8B parameter scale. Finer-grained calibration or continuous confidence scoring would be required for confidence weighting to offer a practical advantage over majority voting. All computations were performed using a dedicated Python script available in the project repository.
Table 20 quantifies the computational cost of each strategy on the full corpus (
n = 2036). Because S1, S2, and S4 were executed on Apple MacBook Air M2 (16 GB) and S5 on Apple MacBook Pro M1 Pro (16 GB), wall-clock times are not directly comparable between strategy groups; token counts serve as the hardware-independent cost metric.
S2 and S4 consumed 1.36× the tokens per paper compared to S1, with no improvement in screening performance—confirming the coordination overhead observed in
Section 4.2.3 with quantitative evidence. The two strategies were computationally identical (same inference calls, same total tokens), differing only in aggregation logic. S5 (two-stage) consumed between 0.94× and 1.19× the tokens of S1: Stage 1 filtering resolved 35–41% of papers with a single model call, partially offsetting the cost of three-model evaluation in Stage 2. Because Qwen produces substantially longer responses than Mistral and LLaMA, the S5 configuration with Mistral as Stage 1 filter generates fewer total output tokens than S1 with Qwen, despite requiring more model calls. All computations were performed using a dedicated Python script available in the project repository.
To quantify the effect of model composition on ensemble performance, a pairwise ablation was conducted comparing ensembles containing Granite 3.3 8B (MLG: Mistral + LLaMA + Granite) against ensembles where Granite was replaced by Qwen 2.5 7B (MLQ: Mistral + LLaMA + Qwen).
Table 21 presents the results.
Recall was 1.000 in all ten configurations. Replacing Granite with Qwen reduced false positives by 19 to 44 papers across strategies. The effect was most pronounced under S3 (recall-focused OR), where Granite’s near-universal INCLUDE behaviour (199/200 in zero-shot, FP = 126 as a single agent) propagated directly through the OR aggregation. Under majority voting (S2), Granite’s vote was outnumbered when Mistral and LLaMA agreed on EXCLUDE, partially mitigating the damage. These results suggest that ensemble coordination does not compensate for a non-discriminative model.
5.2. Error Patterns and the Precision Ceiling
Error analysis (RQ3) showed that false positive accumulation was the dominant source of screening error. Only two false negatives were observed across configurations, while a substantial number of papers were repeatedly classified as false positives across strategies.
The moderate inter-rater agreement (κ = 0.515) indicates that a portion of disagreement arises from inherent ambiguity in the classification task. This is consistent with prior observations that screening agreement decreases in interdisciplinary domains with overlapping terminology [
24]. In such cases, differences between model predictions and ground truth may reflect uncertainty in the labels rather than purely model error.
The sensitivity analysis by agreement subset (
Section 4.5) provides empirical support for this interpretation. The precision drop on disputed papers (0.375–0.452 for the top five configurations) aligns with the observation that the dominant false positives correspond to papers near the inclusion–exclusion boundary. When ground-truth labels themselves are uncertain, lower precision is an expected artefact of label ambiguity rather than an indication of reduced model capability.
The persistent false positives identified in
Section 4.4 matched inclusion keywords but did not meet the contribution requirements. This suggests that distinguishing contribution type from topic may exceed the information available in titles and abstracts alone.
The relationship between exclusion criteria usage and false positive rates supports this interpretation. Models that explicitly applied exclusion criteria produced fewer false positives, while models that appeared to rely on surface-level matching showed limited discriminative capacity. This observation aligns with prior work showing that the formulation and application of inclusion and exclusion criteria strongly influence screening outcomes [
27].
5.3. Implications for Practice
For practical applications of LLM-assisted screening, differences between models produced substantially larger performance variations than differences between strategies.
For well-defined domains, a single capable model with few-shot prompting may be sufficient. The observed workload reduction (WSS@95 = 43.4%) is comparable to ranges reported in prior studies of LLM-assisted screening [
27], despite the absence of retrieval augmentation in the present framework.
Multi-agent strategies remain relevant in cases where no single model achieves acceptable recall. In such settings, majority voting provides a simple approach, while two-stage strategies may offer computational savings.
Because screening criteria are externalised (
Section 3.1), applying the framework to a new domain requires only redefining these criteria and constructing a domain-specific Gold Standard. Empirical validation on additional domains remains a direction for future work. Existing reporting frameworks such as RDAL [
12] and PRISMA-trAIce [
13] address transparency in AI-assisted reviews, while the blockchain-based audit mechanism proposed here complements these approaches by providing infrastructure-level decision traceability. Similar blockchain-based logging approaches have been adopted in other digital governance domains, such as infrastructure security for IoT networks [
38].
5.4. Limitations
The framework was validated on a single domain. The results should therefore be interpreted as domain-specific, and generalisability to other domains remains to be established.
The Gold Standard was constructed exclusively from Pool A, papers containing voting-related keywords, which intentionally over-represents terminologically ambiguous cases. While this design provides a rigorous stress test for screening accuracy, precision estimates derived from this subset may not generalise directly to the full corpus, where the proportion of clearly irrelevant papers is higher.
The Gold Standard of 200 papers (190 for evaluation) limits the statistical power of precision comparisons. The few-shot calibration examples were drawn from the same corpus, which may constitute indirect data leakage despite the exclusion of all 10 calibration papers from evaluation.
All models tested were instruction-tuned transformer variants in the 7–8B parameter range, deployed on local hardware. This limited architectural diversity may have contributed to the observed S4–S2 equivalence, as similarly sized models tend to produce correlated confidence estimates. Larger models may benefit more from multi-agent coordination, where greater parameter capacity could support productive ensemble deliberation. Recent evidence from API-based models (GPT-4o Mini, Claude 3 Haiku, Gemini 1.5 Flash) supports this hypothesis, as multi-agent collaboration yielded consistent improvements over individual baselines at higher parameter scales [
23]. The inter-rater agreement (κ = 0.515) introduces uncertainty in the ground truth labels, which may affect the interpretation of model performance.
The sensitivity of the prompt was not examined. The same prompt template was used for all experiments. The results may differ with alternative wordings or instruction structures. The conclusion that model selection was more important than strategy selection was derived from four models. A larger and more diverse pool of models would be needed to confirm this pattern.