Comparing Single-Agent and Multi-Agent Strategies in LLM-Based Title-Abstract Screening

Radeva, Irina; Noncheva, Teodora; Doukovska, Lyubka; Popchev, Ivan

doi:10.3390/electronics15081661

Open AccessArticle

Comparing Single-Agent and Multi-Agent Strategies in LLM-Based Title-Abstract Screening

¹

Intelligent Systems Department, Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, 1113 Sofia, Bulgaria

²

Faculty of Digital and Green Technologies, Trakia University, Student Town, 6000 Stara Zagora, Bulgaria

³

Bulgarian Academy of Sciences, 1000 Sofia, Bulgaria

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(8), 1661; https://doi.org/10.3390/electronics15081661

Submission received: 24 March 2026 / Revised: 9 April 2026 / Accepted: 14 April 2026 / Published: 15 April 2026

(This article belongs to the Special Issue Data Mining in Natural Language Processing: Latest Advances and Prospects)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Title-abstract screening remains labour-intensive, especially in interdisciplinary domains where shared terminology increases misclassification risk. This study compared five LLM coordination strategies—single-agent baseline, majority voting, recall-focused ensemble, confidence-weighted aggregation, and two-stage filtering—using four 4-bit quantised open-source models (Mistral 7B, LLaMA 3.1 8B, Granite 3.3 8B, Qwen 2.5 7B) in zero-shot and few-shot configurations. The evaluation was conducted on a Gold Standard of 200 papers from a corpus of 2036 records on blockchain-based e-voting. The best-performing configuration—a single-agent strategy with Qwen 2.5 7B in few-shot mode—achieved recall of 100%, precision of 70.4%, F1 of 82.6%, and a 43.4% reduction in manual screening effort, outperforming all multi-agent alternatives. Confidence-weighted aggregation produced results identical to majority voting, indicating that self-reported confidence from 7–8B parameter models did not add discriminative value. All screening decisions were logged on a private blockchain with timestamped anchoring for reproducibility. These results suggest that, for domain-specific screening tasks, careful model selection outweighs multi-agent coordination overhead, and that few-shot prompting with a well-matched model can achieve human-level recall with substantially reduced manual effort.

Keywords:

large language models (LLMs); screening tasks; LLM coordination strategies; model selection; few-shot prompting; reproducibility; blockchain-verified audit trail

1. Introduction

1.1. Problem

Structured literature screening processes are the foundation of evidence-based research. They follow structured protocols to identify, screen, and synthesise all relevant studies on a given topic [1]. The screening stage is widely recognised as the most resource-intensive phase of this process. Thousands of candidate papers must be evaluated against predefined criteria, typically by two independent reviewers [2].

The challenge is not only one of scale. In interdisciplinary research areas, the vocabulary of the target field often overlaps with that of adjacent fields. The same terms carry different meanings in different disciplines, and papers from neighbouring domains may appear relevant based on their titles and abstracts but fall outside the scope of the review. This terminological overlap increases both the volume of candidates and the difficulty of distinguishing target from non-target papers. Even experienced reviewers may disagree on borderline cases, making screening in such domains particularly slow and inconsistent.

1.2. Opportunity

Large language models can process title—abstract pairs and produce structured inclusion/exclusion decisions without task-specific training [3]. Several studies have shown that LLM-assisted screening can reduce manual workload while maintaining high recall [4]. However, most existing work has relied on proprietary models accessed through commercial APIs. Few studies have explored whether locally deployed open-source models can perform this task effectively. Fewer still have examined how multiple LLM agents can be coordinated to improve screening decisions, and whether different coordination strategies suit different types of domains.

1.3. Research Questions

Three research questions define the scope of this study.

RQ1: How does the type of coordination strategy—from simple voting to two-stage filtering—affect screening performance when using heterogeneous open-source LLM ensembles?

RQ2: Which strategy and model combination achieves the best balance between recall and screening effort reduction?

RQ3: What systematic error patterns emerge when LLMs screen papers in a terminologically overloaded domain?

1.4. Goal and Tasks

To address these questions, the goal of this study is to compare single-agent and multi-agent LLM coordination strategies for automated title—abstract screening. Four locally deployed, 4-bit quantised open-source models are tested in a terminologically overloaded domain—blockchain-based e-voting, i.e., systems that employ distributed ledger technology to record, verify, or audit votes in public or institutional elections. In this domain, terms such as “voting”, “election”, and “consensus” carry purely technical meaning in blockchain literature, making automated screening particularly challenging.

To achieve this goal, the study was structured into five sequential tasks, each corresponding to a stage of the experimental workflow illustrated in Figure 1:

To implement LLM-based screening strategies and human expert screening within a single platform with blockchain audit logging.
To construct the target corpus through systematic search, deduplication, and keyword-based filtering from five open-access academic databases.
To define domain-specific inclusion and exclusion criteria for title—abstract screening.
To construct a Gold Standard through dual independent human screening with disagreement resolution, and to select few-shot examples for LLM calibration.
To evaluate all configurations on the Gold Standard, apply the top-ranked ones to the full corpus, and analyse systematic error patterns.

This study makes four contributions. First, it provides a controlled comparison of single-agent and multi-agent LLM coordination strategies for structured title-abstract screening, with pairwise statistical testing (McNemar’s exact test) confirming that performance differences among the top configurations are not statistically significant at the current sample size. Second, it demonstrates that, at the 7–8B parameter scale, model selection was the primary determinant of screening quality, outweighing coordination strategy—a finding with direct practical implications for resource-constrained deployments. Third, it identifies a persistent precision ceiling driven by terminological ambiguity in title-abstract screening, supported by cross-strategy error analysis on both the Gold Standard and the full corpus. Fourth, the framework is applied and evaluated in an interdisciplinary, terminologically overloaded domain, demonstrating the practical feasibility and the domain-specific challenges of automated screening.

1.5. Paper Organisation

The remainder of this paper is organised as follows. Section 2 reviews related work on literature screening and LLM-assisted screening approaches. Section 3 describes the proposed framework, including corpus construction, Gold Standard creation, strategy design, and evaluation protocol. Section 4 presents the experimental results. Section 5 discusses findings, practical recommendations, and limitations. Section 6 concludes the paper.

2. Related Work

This section reviews research in three areas relevant to the present study: (a) LLM-assisted screening in systematic reviews, (b) multi-agent and ensemble LLM strategies, and (c) inter-rater reliability in screening. The section concludes with a summary of the identified research gap.

2.1. LLM-Assisted Screening in Systematic Reviews

The PRISMA 2020 guidelines require transparent documentation of study selection procedures, including any automation tools [1]. Title-abstract screening remains one of the most labour-intensive phases. An analysis of human reviewers in multiple systematic reviews reported a mean error rate of 10.76% during abstract screening [5], providing an empirical baseline for evaluating automated approaches.

Early automation relied on traditional machine learning. A voting perceptron classifier was applied to 15 drug class reviews, and the Work Saved over Sampling at 95% recall (WSS@95) metric was introduced as a standard measure of screening efficiency [2]. Active learning tools such as ASReview further improved efficiency by prioritising records for human review using multiple classifiers and query strategies [6]. However, these tools still require iterative human labelling.

Large language models introduced a different approach. An evaluation of ChatGPT (GPT-3.5 Turbo) for systematic review screening found performance comparable to traditional classifiers such as support vector machines, without task-specific training [7]. A pre-registered study of GPT-4 tested title/abstract screening, full-text review, and data extraction across peer-reviewed, grey, and non-English literature. GPT-4 achieved high specificity but variable sensitivity depending on dataset balance. It was concluded that GPT-4 may function as a secondary reviewer but should not replace human judgement entirely [8]. A hybrid workflow combining LLM analysis with human verification was also proposed, where the LLM identified misclassified articles that were missed during human-only screening [9]. A methodological review emphasised that recall must be prioritised in early screening phases and that iterative prompt refinement is essential [10]. A three-layer strategy using GPT-3.5 and GPT-4 was proposed, where each layer evaluated a different inclusion criterion: research design, target population, and intervention [11]. This layered approach is conceptually similar to the two-stage filtering strategy (S5) in this study. The insufficiency of existing reporting standards for AI-aided screening has been explicitly documented. The RDAL checklist demonstrated that PRISMA guidelines do not require detailed recording of screening decisions, model settings, or training data in active learning-aided reviews [12]. A broader extension, PRISMA-trAIce, was subsequently developed to cover transparent AI reporting across phases of a systematic review [13]. Both initiatives confirm that reproducibility in AI-assisted screening remains an open methodological challenge.

2.2. Multi-Agent and Ensemble LLM Strategies

A survey of LLM-based multi-agent systems identified three communication paradigms: cooperative, debate, and competitive [14]. Cooperative communication underlies majority voting (S2) and recall-focused aggregation (S3) in the present study, while strategy S5 adopts a two-stage design inspired by the debate paradigm but uses majority voting rather than iterative argumentation. A self-consistency decoding method samples multiple reasoning paths from a language model and selects the most frequent answer through majority voting [15]. This approach improved accuracy on arithmetic and commonsense benchmarks and provides the theoretical basis for strategy S2. In a complementary direction, a multi-agent debate approach refines responses through structured argumentation rounds among multiple model instances [16]. The debate mechanism improved both factual accuracy and reasoning by exposing errors through peer critique, informing the multi-model panel design of strategy S5. Adversarial debate combined with voting mechanisms was also investigated to reduce LLM hallucinations, using dynamic weighting to prioritise high-performing models [17].

A framework for reliable decision-making in multi-agent LLM systems compared aggregation strategies including majority voting, decentralised communication, and spoke-and-wheel architectures. Majority voting and decentralised approaches consistently formed the Pareto front of reliability across tasks [18]. These aggregation patterns are consistent with the majority voting (S2) and confidence-weighted (S4) strategies adopted here. Scaling the number of agents in a majority-voting framework also yielded consistent accuracy improvements [19], supporting the use of three-agent ensembles here.

Ensemble techniques have been applied to scientific literature classification by combining outputs from multiple LLMs using a confidence calibration framework. The ensemble achieved higher accuracy than any individual model [20]. A survey of ensemble approaches distinguished between model-level, parameter-level, and task-specific ensembles. Ensemble methods improved robustness but introduced additional computational costs [21]. This trade-off is a central consideration here, where all models were executed locally on consumer hardware. A comparative evaluation of four coordination strategies (collaborative, sequential, competitive, and hierarchical) against calibrated single-agent RAG baselines reported statistically significant performance degradation across 28 tested configurations, with coordination overhead identified as the primary contributing factor [22]. A recent evaluation applied three multi-agent collaboration strategies—majority voting, multiagent debate, and LLM-based adjudication—directly to abstract screening across 28 biomedical systematic reviews. Majority voting with three API-based models consistently outperformed individual models, while adjudicator-as-a-ranker achieved the best results among adjudication variants [23].

2.3. Inter-Rater Reliability in Systematic Reviews

Inter-rater reliability (IRR) is a fundamental concern in systematic reviews. An analysis of screening practices found that IRR is widely under-reported and that coding behaviour varies both between and within individuals over time [24]. A study of inter-reviewer reliability across clinical systematic reviews reported a mean Cohen’s kappa of 0.82 for abstract screening [25]. However, agreement decreased for interdisciplinary or emerging research areas where terminology was not yet standardised. Inter-rater agreement was also assessed using the PROBAST tool for prediction model studies, revealing kappa values between 0.04 and 0.26 at the domain level [26].

These findings suggest that moderate agreement levels are expected in interdisciplinary domains where terminology spans multiple fields, particularly when screening criteria require distinguishing application context rather than topic [24].

2.4. Research Gaps

The reviewed literature reveals several gaps. First, most evaluations of LLM screening have relied on proprietary models such as GPT-3.5 and GPT-4 [7,8,10,11]. Recent work has begun testing open-source models deployed locally via frameworks such as Ollama [27], but these evaluations assessed models individually rather than in coordinated multi-agent configurations. Although multi-agent strategies have recently been applied to biomedical screening with API-based models [23], no study has applied multi-agent coordination strategies using quantised open-source models deployed locally, nor evaluated such strategies in terminologically complex interdisciplinary domains.

Second, multi-agent strategies have been explored for general reasoning tasks [14,15,16,17,18,19] but not applied to systematic review screening. The present study compares five strategies (S1–S5) implemented from established multi-agent decision-making patterns [14,15,16,17,18,19,20,21]. A recent evaluation of multi-agent coordination for RAG-based question answering confirmed consistent performance degradation when applied to 7–8B parameter models [22]. Whether similar patterns emerge in the structurally different task of binary screening classification has not been investigated.

Third, no prior work has combined LLM-based screening with blockchain-based audit trails to ensure decision provenance and reproducibility. The reproducibility gap in AI-aided screening has been explicitly recognised—the RDAL checklist [12] and PRISMA-trAIce [13] address reporting standards, but infrastructure-level decision provenance remains unaddressed. A blockchain-based framework for logging AI decision provenance on a permissioned ledger was proposed for IoT environments [28]. An analysis of blockchain integration patterns showed that the choice of integration architecture affects the auditability of the resulting system [29]. However, the application of blockchain-based audit trails to systematic review screening has not been investigated.

Fourth, prior evaluations used primarily biomedical datasets [2,8,9,11]. The present study applies multi-agent screening to an interdisciplinary domain where terminology spans multiple fields.

3. Methods

3.1. Framework Overview

This study presents a task-driven framework for designing and evaluating multi-agent LLM coordination strategies in structured title-abstract screening tasks. The framework is designed to be domain-independent. Screening criteria are defined in a separate configuration module and injected into prompts at runtime; the framework can be applied to any corpus. It consists of five sequential phases (Figure 1):

Corpus construction. A domain-specific corpus is assembled from multiple open-access databases using a Boolean search strategy. The collected records are deduplicated and filtered to retain only papers relevant to the target domain.
Gold Standard creation. A subset of the corpus is sampled and screened independently by two human reviewers. Inter-rater agreement is measured using Cohen’s Kappa and PABAK. Disagreements are resolved by a third reviewer. The resulting consensus labels serve as ground truth for LLM evaluation.
Strategy design. Five LLM coordination strategies of increasing complexity are defined: single-agent screening (S1), majority voting (S2), recall-focused ensemble (S3), confidence-weighted aggregation (S4), and two-stage filtering (S5). Each strategy is tested in zero-shot and few-shot modes.
Evaluation. LLM screening decisions are compared against the Gold Standard using Recall, Precision, F1 Score, and Work Saved over Sampling at 95% recall (WSS@95). A minimum Recall threshold of 95% is applied, reflecting the requirement that systematic reviews must not miss relevant studies.
Blockchain audit. All screening decisions—both human and LLM—are logged to a private Antelope blockchain (nodeos v4.0.4). Periodic Merkle root anchoring to a public repository (Zenodo; manual deposit via web interface) and timestamping (OpenTimestamps reference client; https://opentimestamps.org) provide external verifiability without exposing individual records.

The eight steps are organised into five sequential phases, as indicated in Figure 1: corpus construction (Phase 1), Gold Standard creation (Phase 2), strategy design and LLM screening (Phase 3), evaluation with error analysis and full corpus validation (Phase 4), and blockchain audit (Phase 5). The framework was integrated into PaSSER-SR (https://github.com/scpdxtest/PaSSER-SR, accessed on 13 April 2026), an open-source platform described in the PaSSER-SR Platform section (Section 3.8). It was validated on a case study domain described in Section 3.2.

3.2. Case Study Domain

The framework was validated using a corpus of blockchain-based e-voting systems. This field was chosen because it poses a significant challenge for automated screening, given the substantial terminological overlap with related areas (see Section 1.4).

The models must distinguish the application context of shared terms rather than rely on keyword matching alone. For instance, a paper titled “A Secure Voting Protocol for Blockchain Governance” may contain all expected keywords yet fall entirely outside the scope of e-voting systems research.

The domain therefore serves as a rigorous test case. If the proposed coordination strategies achieve acceptable screening performance in a terminologically overloaded domain, they may reasonably be expected to perform at least as well in domains with cleaner terminological boundaries.

3.3. Dataset Construction

3.3.1. Search Strategy

A Boolean search strategy was designed to capture blockchain applications in electoral processes. Search terms were organised into two groups: Group A contained blockchain-related terms (blockchain, distributed ledger, DLT, smart contract, decentralised), and Group B contained electoral terms (voting, election, e-voting, electoral, ballot, referendum, voter registration, vote counting). Each query required at least one term from each group. Terms related to non-electoral blockchain mechanisms were excluded during post-processing (DAO voting, governance voting, governance token, token voting). The search covered publications from 2015 to 2025, spanning the period from blockchain’s emergence as a research topic to the present.

3.3.2. Data Sources

Papers were collected from five open-access databases: OpenAlex, Semantic Scholar, CORE, arXiv, and MDPI. This selection ensured broad coverage without reliance on subscription-based services. The Boolean query was adapted for each database API. Table 1 summarises the retrieval results.

3.3.3. Deduplication

A two-stage deduplication process was applied to the combined corpus. First, exact DOI matching identified duplicates between databases. Second, for records without DOIs, normalised title similarity was computed using the Ratcliff/Obershelp algorithm (Python 3.11.14 difflib.SequenceMatcher), with a threshold of 0.85. When duplicates were found, metadata was merged to retain the most complete record. This process removed 1208 duplicates, yielding 4021 unique papers of which 857 papers appeared in more than one database.

3.3.4. E-Voting Context Filtering

The unified corpus contained papers matching the broad Boolean query, including works on blockchain consensus mechanisms, decentralised governance, and smart city platforms that were not related to electoral processes.

A keyword-based filter was applied to retain only papers with explicit electoral context. The filter matched 48 domain-specific terms (e.g., election, ballot, voter, referendum, polling station, voter registration, parliamentary, presidential) against each paper’s title and abstract. Papers without any matching term were excluded. This step removed 1985 papers, producing a final corpus of 2036 papers for screening. Figure 2 presents the PRISMA 2020 flow diagram of the selection process.

3.4. Gold Standard Protocol

The evaluation of automated screening tools requires a set of expert-labelled decisions serving as ground truth—commonly referred to as a Gold Standard in the systematic review literature [2,6]. The following protocol was applied to construct such a set for this study.

3.4.1. Sampling

The Gold Standard was constructed from the filtered corpus of 2036 papers. Papers were partitioned into two pools based on the presence of electoral keywords in the title or abstract: Pool A (1954 papers with keywords) and Pool B (82 papers without keywords). A random sample of 200 papers was drawn from Pool A using a fixed seed (seed = 42) to ensure reproducibility. Pool B was not sampled due to its small size and the expectation that few papers would meet the inclusion criteria. All 200 papers contained at least one electoral keyword, providing a sample enriched for borderline cases where screening decisions are most difficult. The sampling strategy intentionally focuses on keyword-rich records, increasing the proportion of difficult cases. The resulting Gold Standard should therefore be interpreted as a stress-test set rather than a statistically representative sample of the full corpus.

3.4.2. Screening Criteria

Five inclusion criteria (IC1–IC5) and six exclusion criteria (EC1–EC6) were specifically defined for this study based on the scope of research into blockchain-based e-voting systems. The initial set was derived from the research questions and domain boundaries outlined in Section 3.2. This was then refined during a pilot screening of 30 papers selected from the corpus, prior to the main annotation round. IC1 and IC2 establish the mandatory intersection of the two research domains: blockchain technology and e-voting. IC3–IC5 narrow the scope to papers with empirical, security-related or implementation content. The six exclusion criteria target common sources of false inclusion identified during the pilot. EC1 and EC2 exclude non-e-voting blockchain applications. EC3 excludes non-research contributions. EC4 excludes governance mechanisms outside of public electoral voting. EC5 excludes incomplete records. EC6 excludes purely theoretical work without a concrete e-voting application. To be included, a paper is required to satisfy IC1 and IC2, and at least one for IC3–IC5, while not meeting any exclusion criterion. Table 2 presents the criteria definitions.

3.4.3. Screening Procedure

Prior to independent screening, the screening criteria were discussed between the two reviewers to align interpretation. Each reviewer then screened all 200 papers independently in blind mode using the PaSSER-SR Human Screening Module. For each paper, reviewers recorded a decision (INCLUDE, EXCLUDE, or UNCERTAIN), a confidence level (HIGH, MEDIUM, or LOW), the specific criteria met or violated, and free-text reasoning.

3.4.4. Inter-Rater Agreement and Disagreement Resolution

Inter-rater reliability was measured using Cohen’s Kappa (κ) and Prevalence-Adjusted Bias-Adjusted Kappa (PABAK). Disagreements were resolved by a third reviewer who examined the original paper, both reviewers’ decisions and reasoning, and made a final determination. The agreement statistics and disagreement patterns are reported in the Gold Standard Results section.

3.5. LLM Coordination Strategies

The five coordination strategies were implemented for this study, each based on an established multi-agent decision-making paradigm identified in Section 2.2 [15,16,18,20] and adapted for the title-abstract screening task. Each strategy receives a paper’s title and abstract as input and produces a binary decision (INCLUDE or EXCLUDE), a confidence level (HIGH, MEDIUM, or LOW), the criteria applied, and free-text reasoning.

S1: Single Agent (Baseline). A single LLM model screens each paper independently. This strategy serves as the baseline against which multi-agent approaches are compared.

S2: Majority Voting. Three LLMs screen each paper independently. The final decision is determined by simple majority [15,19,20]: if two or more models agree on INCLUDE or EXCLUDE, that decision is adopted. If no majority is reached, the paper is marked UNCERTAIN. The aggregated confidence is computed as the mean of individual confidence scores, mapped to HIGH (≥0.85), MEDIUM (≥0.65), or LOW.

S3: Recall-Focused Ensemble. Three LLMs screen each paper independently. If any model votes INCLUDE, the final decision is INCLUDE (OR logic). This strategy prioritises recall at the expense of precision, reflecting the principle that missing a relevant study is more costly than including an irrelevant one. If no model votes INCLUDE but at least one votes UNCERTAIN, the paper is marked UNCERTAIN; otherwise, it is marked EXCLUDE.

S4: Confidence-Weighted Aggregation. Three LLMs screen each paper independently. Each vote is weighted by the model’s self-reported confidence level, mapped to numerical weights: HIGH = 0.9, MEDIUM = 0.7, LOW = 0.5. INCLUDE votes contribute positive weight, EXCLUDE votes contribute negative weight, and UNCERTAIN votes contribute zero to the numerator but their confidence weight is included in the normalisation denominator. The weighted scores are normalised, and the final decision is determined by threshold: a normalised score above +0.2 results in INCLUDE, below −0.2 in EXCLUDE, and between these values in UNCERTAIN.

S5: Two-Stage Filtering. This strategy separates screening into two stages, combining confidence-based pre-filtering with majority voting. In Stage 1, a designated fast-filter model screens each paper. Papers receiving a HIGH-confidence EXCLUDE decision are immediately excluded. All remaining papers proceed to Stage 2, where two additional models independently screen the paper. If all three models (including the Stage 1 response) reach consensus, that decision is final. If disagreement persists, the majority decision is adopted; in the event of a tie, INCLUDE is selected to preserve recall. The fast-filter model and panel models are configurable, allowing role assignment based on individual model strengths. In this notation, the arrow format X → Y + Z denotes the role assignment: X is the Stage 1 fast-filter model, and Y and Z are the two Stage 2 panel models. For example, M → L + Q indicates that Mistral serves as the fast filter, while LLaMA and Qwen perform the Stage 2 screening. The Stage 2 aggregation follows the same majority voting rule as S2: the final decision is the majority among all three model responses (one from Stage 1 and two from Stage 2). Unlike S2, which applies majority voting to all papers, S5 first excludes papers receiving a HIGH-confidence EXCLUDE decision from the fast-filter model, reducing the number of papers that require evaluation by all three models.

Models

All strategies were tested with four open-source LLMs: Mistral 7B Instruct v0.3 (Mistral AI), Meta LLaMA 3.1 8B Instruct, Qwen 2.5 7B Instruct (Alibaba), and IBM Granite 3.3 8B Instruct. Model selection was governed by two requirements. First, a core design principle of PaSSER-SR is privacy-preserving, cloud-independent operation. All inference is performed locally on Apple Silicon hardware using the Apple MLX framework (mlx v.0.30.3, mlx-metal v.0.30.3, mlx-lm v.0.30.4) with 4-bit quantisation. This restricts the candidate pool to models that have MLX-compatible quantised variants and fit within the unified memory of consumer-grade hardware. No cloud-based or commercial APIs were used. Second, to maximise architectural diversity within this constraint, four models were chosen from independent development pipelines with different pre-training corpora, fine-tuning procedures, and alignment strategies.

Four models were selected from the 7–8B instruction-tuned model families that were independently developed, with MLX-compatible 4-bit variants. This selection was partly influenced by prior experience with these models in the PaSSER platform. The chosen models satisfy the minimum ensemble size of three required for majority voting (S2), and one additional model enables varied three-model combinations for S5. The 7–8 billion parameter range represents the practical upper bound for 4-bit local inference on devices with 16–32 GB of unified memory. The PaSSER-SR platform extends the original PaSSER platform [30], which employed Mistral 7B, Llama2 7B, and Orca2 7B for RAG evaluation in the smart agriculture domain. This study uses newer model versions (LLaMA 3.1, Qwen 2.5) and adds IBM Granite. Three of the four model families (Mistral, LLaMA, and Granite) were previously evaluated in a multi-agent RAG context using the PaSSER platform [22], enabling cross-study comparison of model behaviour across different tasks.

Each strategy–model combination was tested in two prompt modes: zero-shot and few-shot. Both modes used a two-part prompt. The system prompt defined the model’s role as a systematic review screener, listed the inclusion and exclusion criteria (Table 2), and specified the output format (decision, confidence, and reasoning). The screening prompt provided the paper’s title and abstract:

“Evaluate this paper for inclusion in a systematic review on blockchain-based electoral systems. TITLE: [title] ABSTRACT: [abstract] Based on the inclusion/exclusion criteria, provide your decision in JSON format”.

In few-shot mode, ten labelled examples (Section 3.6) were inserted between the system and screening prompts, each containing a title, abstract, ground-truth decision, and brief reasoning. A single prompt template was used for all models, strategies, and configurations; no alternative wordings were tested.

3.6. Few-Shot Example Selection Protocol

Few-shot examples were selected through error-driven analysis of zero-shot results rather than random sampling. The selection protocol consisted of three steps.

Step 1: Cross-strategy error aggregation. Error analysis reports from all zero-shot runs were aggregated over strategy–model configurations. For each paper appearing as a false positive (FP) or false negative (FN), the number of configurations in which the error occurred was counted. Papers producing errors in a larger number of configurations were considered more representative of systematic model weaknesses.

Step 2: Error pattern categorisation. FP errors were categorised by confusion type: (a) domain boundary confusion, where blockchain consensus terminology was mistaken for electoral voting (related to EC2 and EC4); (b) missing implementation, where papers discussed elections without proposing a concrete blockchain system (EC1); (c) opinion or review papers without an original contribution (EC3); and (d) governance confusion, where DAO or corporate voting was misidentified as public electoral voting (EC4). FN errors were categorised by (a) overly aggressive application of exclusion criteria (EC2 and EC3) and (b) terminology mismatch, where synonyms such as “distributed ledger” were not recognised as blockchain-related.

Step 3: Representative selection. Five EXCLUDE and five INCLUDE examples were selected to maximise pattern coverage over the identified error categories. EXCLUDE examples prioritised papers that appeared as FP in the largest number of configurations. INCLUDE examples targeted specific model weaknesses, such as the tendency of certain models to over-apply EC2 and EC3 when screening review papers.

The 10 selected papers were marked as calibration examples (is_calibration = true) in the Gold Standard database and excluded from all evaluation runs to prevent data leakage. Each few-shot example included the paper title, abstract, ground-truth decision, criteria applied, and a brief reasoning statement. Few-shot runs were therefore evaluated on the remaining approximately 190 papers from the Gold Standard.

3.7. Evaluation Protocol

Each strategy–model–prompt mode configuration was evaluated against the human ground truth established during Gold Standard screening (Section 3.4). The ground truth decision for each paper was determined as follows: where both screeners agreed, the consensus decision was used; where they disagreed, the resolution decision provided by the third reviewer was adopted.

Screening was framed as a binary classification task. INCLUDE was treated as the positive class and EXCLUDE as the negative class. Four standard confusion matrix counts were computed: true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). From these, four primary metrics were derived.

Recall (sensitivity) measured the proportion of relevant papers correctly identified by the LLM strategy:

R e c a l l = \frac{T P}{T P + F N}

(1)

Precision (positive predictive value) measured the proportion of papers classified as INCLUDE that were genuinely relevant:

P r e c i s i o n = \frac{T P}{T P + F P}

(2)

F1 Score provided the harmonic mean of Recall and Precision:

F_{1} = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(3)

Work Saved over Sampling at 95% recall (WSS@95) quantified the reduction in screening workload compared to random sampling [2]:

W S S 95 = \frac{T N + F N}{N} - 0.05

(4)

This metric is meaningful only when Recall ≥ 0.95 and represents the proportion of papers that need not be manually screened beyond the 5% baseline cost of random sampling [31]. The subtraction of 0.05 removes the trivial baseline: randomly screening 95% of the corpus would also achieve 95% recall while saving 5% of the workload. WSS@95 therefore measures only the additional screening effort saved by the classifier beyond what random ordering would achieve.

Recall ≥ 0.95 was adopted as the primary acceptance threshold. In systematic review screening, missing relevant studies (false negatives) poses a greater risk to review validity than including irrelevant ones (false positives), which are eliminated during full-text review [2]. Strategies meeting this threshold were ranked by WSS@95 in descending order. Strategies failing to meet the threshold were ranked separately by Recall.

Confidence intervals. For Recall and Precision, 95% confidence intervals were computed using the Wilson score method [32]. The Wilson interval was selected over the Wald (normal approximation) interval because it provides more accurate coverage for small sample sizes and for proportions near 0 or 1 [33], both of which apply to the present evaluation set.

UNCERTAIN treatment. Human screening decisions included an UNCERTAIN category for ambiguous cases. Two treatment modes were evaluated: (a) recall-focused, where UNCERTAIN decisions in both ground truth and predictions were mapped to INCLUDE, representing a conservative approach that avoids missing potentially relevant papers; (b) precision-focused, where UNCERTAIN decisions were mapped to EXCLUDE. All results reported in Section 4 use the recall-focused treatment unless otherwise stated.

Calibration exclusion. The 10 papers marked as few-shot calibration examples (Section 3.6) were excluded from all evaluation runs. Each configuration was therefore evaluated on approximately 190 papers from the Gold Standard. The selection of these calibration examples was informed by zero-shot error patterns observed on the same Gold Standard, which constitutes an indirect form of data leakage despite the exclusion of all 10 calibration papers from evaluation. The sensitivity of few-shot results to the specific examples chosen was not examined and remains a limitation of the current design.

Full corpus validation. To assess whether screening performance generalises beyond the Gold Standard, the best-performing configurations were applied to the full corpus of 2036 papers. Evaluation metrics were computed on the 190 GS evaluation papers embedded within the corpus, using the same ground truth and the same formulas described above. Consistency between GS-only and full-corpus metrics served as evidence that LLM performance does not degrade at scale.

3.8. PaSSER-SR Platform

All experimental procedures were conducted within PaSSER-SR, a web-based platform developed as an extension of the PaSSER platform [30]. PaSSER-SR was designed to support the full systematic review screening workflow, from corpus management through human and LLM-based screening to evaluation and audit. The system architecture is illustrated in Figure 3.

Presentation layer. The user interface was implemented in React 18.3.1 with the PrimeReact 10.9.7 component library. Four functional modules were provided: (a) Corpus Collection, for browsing and filtering the imported paper corpus; (b) Human Screening, for manual title-abstract screening with structured criteria selection, confidence tracking, and disagreement resolution; (c) LLM Screening, for configuring and executing automated screening runs with real-time progress monitoring via WebSocket; and (d) Evaluation and Audit, for computing performance metrics, comparing strategies, and managing the blockchain audit trail.

Application layer. The backend consisted of two Python FastAPI (v.0.100.1) services and two supporting modules (Figure 3). The first service (screening_api.py) handled project management, corpus and Gold Standard operations, human screening decisions, disagreement resolution, and role-based access control. Three user roles were defined: screener (submits decisions), resolver (handles disagreements), and admin (manages projects, exports, and audit). The second service (llm_screening_api.py) managed LLM model loading and inference via the Apple MLX framework, strategy execution, and few-shot example retrieval. A dedicated evaluation module (evaluate.py) computed Recall, Precision, F1, and WSS@95 with Wilson confidence intervals for strategy–model–prompt mode configurations. A blockchain logging module (blockchain_logger) recorded all screening decisions to the Antelope blockchain, computed Merkle roots via the logexport action, and coordinated OpenTimestamps anchoring and Zenodo DOI publication (Section 3.9).

Data and infrastructure layer. Three components supported data storage and processing. MongoDB (v.5.0.23) served as the primary database, storing corpus papers, Gold Standard records, human screening decisions, LLM decisions, disagreement resolutions, and evaluation results. The Apple MLX framework (mlx-lm) provided local LLM inference on Apple Silicon hardware with 4-bit quantised models, eliminating the need for cloud-based API services. A private Antelope blockchain recorded all screening decisions as immutable audit entries; the audit trail is described in Section 3.9.

The platform enforced a strict separation between calibration and evaluation data. Papers marked as few-shot calibration examples (is_calibration = true) were automatically excluded from LLM screening evaluation runs. Human screening followed a dual independent review protocol: two screeners assessed each paper independently, and disagreements were resolved by a third reviewer with the resolver role. For full corpus screening jobs, the evaluation_only parameter ensured that the 10 calibration papers were excluded from metric computation, preventing data leakage from few-shot examples into the evaluation set. All decisions, including timestamps, criteria selections, and confidence levels, were logged to both MongoDB and the Antelope blockchain.

All experimental configurations, prompts, and evaluation procedures are available in the project repository, enabling full reproducibility of the reported results.

All experiments were executed on consumer-grade Apple Silicon hardware. S1, S2, S3, and S4 configurations were run on MacBook Air M2 (16 GB RAM); S5 configurations were run on MacBook Pro M1 Pro (16 GB RAM). All models were 4-bit quantised and served locally via the MLX framework. Inference was sequential (one model call at a time).

3.9. Blockchain Audit Trail

A three-tier verification mechanism was implemented to ensure the reproducibility and integrity of all screening decisions. At the operational level (Tier 1), every human and LLM screening event was recorded as an immutable transaction on a private Antelope blockchain, capturing the paper identifier, screener identity, decision, confidence level, exclusion criteria, and timestamp. At defined milestones (Tier 2), all accumulated records were exported as a JSON file, a Merkle root was computed, and a timestamp proof was generated via OpenTimestamps and anchored to the Bitcoin blockchain. The complete export, together with its timestamp proof, was deposited on Zenodo with a persistent DOI (Tier 3), enabling independent verification without access to the private chain [34]. The deposited archive can be verified by running the OpenTimestamps client (https://opentimestamps.org) against the JSON file.

4. Results

This section presents the empirical results obtained from the corpus construction pipeline, the Gold Standard screening, and the evaluation of five LLM screening strategies with four open-source models. The evaluation covered 25 configurations (12 zero-shot and 13 few-shot) against the Gold Standard of 200 papers. The primary acceptance criterion was recall ≥ 0.95, ensuring that at least 95% of relevant studies were retained.

4.1. Corpus Statistics

Papers were collected from five open access databases using the search query described in Section 3.2. Table 3 extends this with deduplication results and cross-database overlap. “Unique to DB” indicates records found exclusively in one database. “Shared” indicates records found in the given database and at least one other. “% of Corpus” is calculated as After Dedup/4021. Percentages exceed 100% because shared papers are counted under each contributing database.

The search retrieved 5229 raw records. After deduplication by DOI matching and title similarity (≥0.85 threshold), the unified corpus contained 4021 unique papers. Of these, 857 (21.3%) appeared in two or more databases, confirming the value of multi-source search for comprehensive coverage.

Semantic Scholar contributed the largest share of the corpus (56.9%), followed by OpenAlex (38.6%) and CORE (20.1%). The arXiv and MDPI databases contributed smaller but complementary sets of 259 and 86 records, respectively. Semantic Scholar contained 1511 exclusive papers (37.6%), indicating that reliance on a single database would have resulted in substantial omissions.

Electoral keyword filtering (Section 3.3) reduced the corpus from 4021 to 2036 papers. The excluded 1985 papers lacked e-voting context in their title or abstract. The Gold Standard sample of 200 papers was drawn from the filtered corpus using simple random sampling with a fixed seed (Section 3.4).

4.2. Gold Standard and Inter-Rater Agreement

Two independent reviewers screened all 200 Gold Standard papers independently against the inclusion and exclusion criteria defined in Table 2. Disagreements were resolved through discussion and, where necessary, adjudication by a third reviewer. Table 4 presents the inter-rater reliability metrics. Cohen’s κ was interpreted according to the Landis and Koch scale [35].

Cohen’s κ of 0.515 indicates moderate inter-rater agreement, with an observed agreement of 75.0% and a PABAK of 0.500. This level of agreement is consistent with the terminological overlap characteristic of the target domain (Section 3.2).

Table 5 presents the agreement matrix between the two reviewers. The largest source of disagreement was between INCLUDE and EXCLUDE decisions: 14 papers were classified as INCLUDE by Reviewer 1 but EXCLUDE by Reviewer 2, while 23 papers showed the reverse pattern. A further 13 disagreements involved the UNCERTAIN category. All 50 disagreements were resolved by a third reviewer.

After disagreement resolution, the final Gold Standard comprised 200 papers: 67 classified as INCLUDE (33.5%), 126 as EXCLUDE (63.0%), and 7 as UNCERTAIN (3.5%). Table 6 presents the final distribution and the mapping to LLM evaluation ground truth.

The UNCERTAIN papers were treated as INCLUDE for LLM evaluation purposes, following a conservative approach that maximises the recall requirement. This yielded an effective evaluation set of 74 positive (INCLUDE + UNCERTAIN) and 126 negative (EXCLUDE) cases for zero-shot evaluation (n = 200). For few-shot evaluation, 10 calibration papers were excluded, resulting in 69 positive and 121 negative cases (n = 190).

4.2.1. Zero-Shot Screening Results

Three S5 configurations were excluded from the main results: two (L → Q + M in both prompt modes) produced valid responses for only a subset of papers (124 of 200 in zero-shot, 132 of 190 in few-shot), and one (Q → G, zero-shot) used only two models instead of three, representing a degenerate case. Table 7 presents the performance of 12 zero-shot configurations covering five strategies and four models. Each configuration was evaluated on the full Gold Standard of 200 papers (n = 200). All metrics reported as percentages except TP, FP, FN, TN (counts). Configurations with recall ≥ 95% are considered qualified. WSS@95 values are reported for all configurations but are meaningful only when recall ≥ 0.95; values for unqualified configurations should not be compared against qualified ones. Ensemble abbreviations: M = Mistral 7B, L = LLaMA 3.1 8B, Q = Qwen 2.5 7B, G = Granite 3.3 8B. For S5, the notation X → Y + Z indicates Stage 1 filter → Stage 2 panel.

Of the 12 zero-shot configurations, 11 met the recall threshold of 95%. The single unqualified configuration was S1 with Qwen 2.5 7B, which achieved a recall of 94.6% (FN = 4). Despite failing the recall criterion, this configuration achieved the highest precision (72.9%) and F1 score (82.3%) among all zero-shot runs, demonstrating a clear recall–precision trade-off.

Granite 3.3 8B exhibited a distinct failure pattern. Under S1, it classified 199 out of 200 papers as INCLUDE (TN = 0, FP = 126), yielding a negative WSS@95 of −4.5%. This indicates that Granite showed no discriminative power in this setting and performed no better than including all papers without screening. This behaviour propagated to all ensemble strategies containing Granite (S2 MLG, S4 MLG, S5 G → M + L), which underperformed their Qwen-based counterparts.

The best zero-shot performance by WSS@95 was S5 (M → L + Q) at 34.0%, closely followed by S2 MLQ and S4 MLQ at 33.5%. An identical result pattern was observed between S2 MLQ and S4 MLQ across metrics, suggesting that the confidence-weighted aggregation in S4 produced no benefit over simple majority voting in S2 for this dataset.

The following section examines whether few-shot prompting addresses these zero-shot limitations, particularly Qwen’s recall failure and Granite’s lack of discriminative power.

4.2.2. Few-Shot Screening Results

Table 8 presents the results of 13 few-shot configurations. Each was evaluated on n = 190 papers, as 10 papers used for calibration examples (Section 3.6) were excluded from the evaluation set to prevent data leakage. Notation follows Table 7. All 13 configurations met the recall ≥ 95% threshold.

All 13 few-shot configurations met the recall threshold. The best overall configuration was S1 with Qwen 2.5 7B (few-shot), which achieved perfect recall (100.0%), precision of 70.4%, F1 of 82.6%, and WSS@95 of 43.4%. This was the highest-ranked configuration across all 25 tested combinations.

The most notable few-shot effect was observed for Qwen 2.5 7B under S1. In zero-shot mode, Qwen was the only model that failed the recall threshold (94.6%, FN = 4). With few-shot prompting, recall increased to 100.0% while precision remained high (70.4% vs. 72.9% in zero-shot). The few-shot examples effectively corrected the overly conservative exclusion pattern that caused the zero-shot failure.

Granite 3.3 8B showed no improvement with few-shot prompting. Under S1, it again classified nearly all papers as INCLUDE (FP = 121, TN = 0), and the same pattern appeared under S3 MLG (FP = 121, TN = 0). Granite’s inability to discriminate between relevant and irrelevant papers persisted regardless of prompt mode.

For matched strategy–model pairs, few-shot prompting produced mixed effects on precision. For Mistral 7B, precision decreased in most configurations (e.g., S1: 52.1% → 48.6%), suggesting that the few-shot examples introduced a bias towards inclusion. For LLaMA 3.1 8B, precision remained comparable (e.g., S1: 59.7% → 56.2%). Only Qwen 2.5 7B showed a consistent beneficial pattern where the recall improvement outweighed the modest precision decrease.

4.2.3. Strategy Comparison and Ranking

Table 9 presents the top five configurations ranked by a composite criterion: recall ≥ 95% as a hard threshold, then F1 score as the primary ranking metric. Abbreviations: ZS = zero-shot, FS = few-shot. Wilson 95% confidence intervals are shown for Recall and Precision.

The single-agent baseline (S1) achieved the highest rank, followed by two S5 (two-stage) configurations and S2/S4 with identical metrics.

Figure 4 illustrates the performance trends across all five strategies. In panels (a) and (b), S1 and S5 consistently outperform S2, S3, and S4 in both F1 and WSS@95. Few-shot prompting improved S1 and S5 but provided marginal or negative gains for S2–S4. Recall was 100% for all qualified configurations in both prompt modes; consequently, panel (c) reports only precision, which was the sole differentiating metric. Best zero-shot/few-shot models per strategy: S1—LLaMA/Qwen; S2, S3, S4—M + L + Q/M + L + Q; S5—L + M + G/Q + M + L.

Three observations emerge from the 25 tested configurations. First, configurations including Qwen consistently outperformed those using Granite, regardless of strategy. Second, the equivalence of S2 and S4 observed in Section 4.2.1 persisted across tested ensembles. Third, multi-agent strategies did not consistently outperform the single-agent baseline—the best S1 configuration surpassed all multi-agent alternatives. The 95% Wilson confidence intervals for precision overlapped across the top five configurations (Table 9). Formal pairwise testing is reported in Section 4.6.

4.3. Full Corpus Screening

To assess whether screening performance generalises beyond the Gold Standard, the top five configurations from Table 9 were applied to the full corpus of 2036 papers. Evaluation metrics were computed on the 190 Gold Standard evaluation papers embedded within the corpus (Section 3.7). Table 10 presents the screening results. GS-embedded metrics were computed on n = 190 (calibration excluded). Included = papers selected for full-text review from the full corpus of 2036 (INCLUDE + UNCERTAIN). FP and FN = GS-embedded error counts. For Ranks 1 and 2 (few-shot, n = 190), all confusion matrix counts are identical to Table 8. For Ranks 3–5 (zero-shot), the metrics differ from Table 7 due to the exclusion of calibration papers (n = 190 vs. n = 200).

All five configurations maintained 100% recall on the embedded Gold Standard papers, confirming that no relevant studies were lost during full corpus screening. The single false negative observed for S5 M → L + Q ZS in the GS-only evaluation (Table 7, n = 200) was a calibration paper excluded from the n = 190 set.

Cross-strategy agreement on the candidate set was high: 926 of the 950 papers selected by S1 (97.5%) were also selected by four multi-agent configurations. Only 2 papers were unique to S1, while 140 papers were selected by multi-agent strategies but excluded by S1. Whether these 140 papers represent genuine inclusions or additional false positives cannot be determined without full-text review. The three zero-shot multi-agent configurations (Ranks 3–5) produced nearly identical INCLUDE sets, differing on fewer than 5 papers.

The complementary set—papers selected by four multi-agent strategies but not by S1—comprised 140 papers (INCLUDE and UNCERTAIN decisions were treated as “selected” throughout this analysis, consistent with the counts reported above). To assess whether these represent genuine misses by S1 or false positives by the multi-agent strategies, each paper was cross-referenced with the 200-paper Gold Standard, and the S1 exclusion metadata (confidence level and exclusion criteria cited) were extracted from the screening log. The results are presented in Table 11.

Of the 11 papers with ground-truth labels, all were classified as EXCLUDE in the Gold Standard, confirming that S1 correctly rejected them as true negatives while the multi-agent strategies produced false positives. S1 excluded 139 of 140 papers with MEDIUM confidence, citing predominantly EC5 (99.3%) and EC3 (96.4%), with most papers violating four or five exclusion criteria simultaneously. Title-level inspection of the remaining 129 papers without ground-truth labels revealed that the majority are surveys, literature reviews, or conceptual frameworks that match topical keywords but lack original empirical contributions.

The 950 papers selected by the top-ranked configuration (of which one was classified as uncertain and excluded from the final review list) constitute the candidate set for full-text review.

4.4. Error Analysis

Error analysis was performed on all 25 Gold Standard configurations using a consistent evaluation set of n = 190 papers (calibration examples excluded). The analysis was then extended to the five full corpus configurations to assess whether error patterns persisted at scale.

False positive persistence. In 25 GS configurations, 52 unique papers were flagged as false positives at least once. The distribution was highly concentrated: a persistent pool of 20 papers appeared in the FP set of all five strategies, while 13 papers appeared exclusively in S1 configurations. Multi-agent screening did not introduce new error types—no paper appeared as a false positive only in a multi-agent strategy. Figure 5 illustrates the FP overlap structure for both GS and full corpus evaluations. Panels (a) and (b) show the GS results aggregated across all 25 configurations (n = 190): the persistent pool of 20 papers common to all five strategies is clearly separated from strategy-specific false positives. Panels (c) and (d) show the full corpus evaluation (five top-ranked configurations, n = 190 GS-embedded papers), where this concentration intensified—28 of 48 unique false positives were shared across all five configurations. Dark shading indicates papers appearing as FP in all strategies; red indicates papers unique to S1.

Exclusion criteria application. An inverse relationship was observed between EC citation frequency and false positive count (Table 12). Granite cited no exclusion criteria in either mode, consistent with keyword matching rather than criteria-based reasoning. Qwen cited EC most frequently and produced the fewest false positives.

Few-shot prompting effect. Few-shot prompting increased EC citation counts for models capable of criteria-based reasoning (Table 12), while Granite remained at zero. However, few-shot prompting also produced a slight FP increase for all models. The calibration examples promoted more thorough criteria matching but did not prevent inclusion of papers that superficially satisfy the inclusion criteria. The net effect is improved criteria articulation without consistent precision gains.

Full corpus error categories. The five full corpus configurations produced 48 unique false positives on the GS-embedded set, of which 28 appeared in all five configurations. Table 13 categorises these persistent false positives by exclusion criterion. The categories are based on the exclusion criteria cited by both human reviewers in agreed cases, or by the third reviewer during disagreement resolution.

Table 13 reveals that the dominant error category was EC3 (no original contribution), accounting for 19 of the 28 persistent false positives. These papers typically present surveys, tutorials, or position papers whose abstracts contain all expected inclusion keywords but offer no original system, framework, or empirical evaluation. The remaining categories—EC1 (no blockchain implementation), EC5 (non-English abstract), and EC2 (non-electoral domain)—together account for 9 papers. This distribution indicates that the primary precision bottleneck is not terminological ambiguity per se, but the inability of the models to distinguish the type of contribution from its topic based on title and abstract alone.

False negatives. Only two FN cases occurred in 25 GS configurations, both in single-agent S1 (Granite ZS and LLaMA FS). All multi-agent configurations and all five full corpus configurations recorded zero false negatives. The error analysis indicates that false positive accumulation, not false negative risk, is the primary source of screening uncertainty.

These findings address RQ3 by identifying consistent error patterns for models and strategies.

4.5. Sensitivity to Ground-Truth Agreement

To assess whether the moderate inter-rater agreement (κ = 0.515) compromises the reliability of the evaluation, the Gold Standard was divided into two subsets: 150 papers on which there was agreement and 50 papers on which there was disagreement. For few-shot configurations, ten calibration papers (six agreed and four disputed) were excluded, yielding evaluation sets of 144 and 46 papers, respectively. Table 14 presents the results. Recall remained at 1.000 for both subsets and all five top configurations. The exception was S5 M → L + Q zero-shot (rank 3), which missed one disputed paper (recall = 0.938). Precision was consistently lower on the disputed subset. The largest discrepancy was observed for S1 Qwen few-shot (0.821 agreed versus 0.452 disputed). For the multi-agent configurations, precision on the disputed subset dropped further to 0.375–0.390. The difference in F1 score between subsets ranged from 0.27 to 0.29, primarily due to a higher concentration of false positives in the disputed subset. These results suggest that moderate κ introduces noise into precision estimates, but does not undermine the primary screening function: recall remained stable regardless of ground-truth certainty. The complete agreed versus disputed results for all configurations are available in the project data deposit.

4.6. Pairwise Statistical Comparison

Since the Wilson confidence intervals for recall and precision overlapped for the top five configurations (Table 9), pairwise comparisons were conducted using McNemar’s exact test for paired binary outcomes—the recommended method for classifiers evaluated on the same test set, as it accounts for paired structure and avoids inflated Type I error rates [36]. The test examines the 2 × 2 contingency table of paired outcomes (Table 15).

n A B

and

n B A

denote discordant pairs (papers classified correctly by only one configuration) and

n d = n A B + n B A

. The test focuses exclusively on these discordant pairs.

Under the null hypothesis of equal classifier performance, the number of discordant pairs in each direction follows a binomial distribution. Following the methodology of [37], the exact binomial formulation was adopted:

H 0 : n A B ~ B i n o m i a l (n d, 0.5)

(5)

The exact two-sided p-value is computed as

p = 2 \cdot \sum C (n d, i) \cdot {0.5}^{n d}, i = n A B, \dots, n d w h e n n A B \geq n B A

(6)

where

C (n d, i)

denotes the binomial coefficient. This formulation was preferred over the asymptotic

χ^{2}

approximation:

χ^{2} = {(n A B - n B A)}^{2} / (n A B + n B A)

(7)

The number of discordant pairs was below the threshold of 25 recommended for the asymptotic variant. Statistical power was estimated using the normal approximation to the McNemar statistic, based on the observed number of discordant pairs and their asymmetry ratio. All computations were performed using dedicated Python scripts available in the GitHub (https://github.com/scpdxtest/PaSSER-SR, accessed on 13 April 2026) repository.

Table 16 presents the pairwise results for all 10 comparisons among the top five configurations.

Four of the 10 pairwise comparisons reached statistical significance. All four involved S1 Qwen few-shot against each of the remaining configurations: p = 0.0005 for Rank 1 vs. 2, and p = 0.0003 for Rank 1 vs. Ranks 3–5. The discordant pairs were strongly asymmetric—15:1 and 16:1 in favour of S1. This means S1 classified 15–16 papers correctly that the opposing configuration misclassified, with only one paper showing the reverse. Statistical power for these four comparisons exceeded 0.93. The remaining six comparisons (among Ranks 2–5) showed no significant differences (all p = 1.0). Three patterns emerged. First, S1 Qwen few-shot was statistically superior to all multi-agent alternatives, with 16–17 discordant pairs out of 190 common papers (8.4–8.9%). Second, S5 Q → M + L few-shot differed from the three zero-shot configurations in only 7 papers (3.7%), with near-symmetric errors (4:3). Third, the three zero-shot configurations (Ranks 3–5) produced nearly identical classifications—at most 1 discordant pair. This confirms that confidence-weighted aggregation (S4) added no discriminative value over majority voting (S2) at the 7–8B parameter scale. For the six non-significant comparisons, the maximum discordant pair count was 7, yielding a power of 0.07. The evaluation set is therefore underpowered for discriminating among the multi-agent configurations but adequate for detecting the single-agent precision advantage.

To complement the pairwise analysis, 95% Clopper–Pearson exact confidence intervals were computed for recall (Table 17).

For zero-shot configurations evaluated on all 200 papers, Ranks 4 and 5 (74 true positives) achieved a lower bound of 0.951, exceeding the 0.95 threshold. Rank 3 (73 of 74 true positives, recall = 0.986) yielded a lower bound of 0.927—the only top-five configuration with imperfect recall. For few-shot configurations evaluated on 190 papers (69 true positives), the lower bound of 0.948 falls marginally below 0.95. This shortfall reflects the reduced evaluation set (10 calibration papers excluded) rather than a difference in classification behaviour.

4.7. Summary of Results

Table 18 consolidates the evaluation metrics for the top five configurations from the Gold Standard (GS) evaluation and the full corpus (FC) screening. The GS columns report performance on the fixed evaluation sets (n = 190 for few-shot, n = 200 for zero-shot), while the FC columns report performance on the same 190 GS papers embedded within the full corpus of 2036 records.

For the two few-shot configurations (Ranks 1–2), all GS and FC metrics are identical, confirming that screening behaviour remained stable when the evaluation set was embedded in the full corpus. For the three zero-shot configurations (Ranks 3–5), FC metrics differ from GS because 10 calibration papers—which contain 5 false positives and 1 false negative—were excluded from the FC evaluation (n = 190 vs. n = 200). This exclusion accounts for the improved FP counts (49 → 44) and, for Rank 3 (S5 M → L + Q ZS), the increase in recall from 98.7% to 100.0%. This comparison is illustrated in Figure 6.

Table 18 confirms that the ranking established on the Gold Standard was preserved in the full corpus evaluation. The single-agent baseline maintained its advantage across metrics, while the multi-agent configurations produced comparable results among themselves. The persistent false positive pool identified in Section 4.4 represents a precision ceiling inherent to title-abstract screening, where distinguishing original contributions from general discussions exceeds the discriminative capacity of the tested models. These results address RQ1 and RQ2 by showing that model selection has a stronger impact on performance than coordination strategy.

Pairwise McNemar’s exact tests (Section 4.6) revealed that the top-ranked configuration (S1 Qwen few-shot) was statistically superior to all four multi-agent alternatives (p ≤ 0.0005), with 15–16 discordant pairs favouring S1 and a statistical power exceeding 0.93. The remaining six pairwise comparisons among Ranks 2–5 showed no significant differences (all p = 1.0), with a maximum power of 0.07—confirming that the evaluation set is underpowered for discriminating among the multi-agent configurations. Performance was sensitive to ground-truth quality: precision on the disputed subset of the Gold Standard dropped to 0.375–0.452 for the top five configurations, compared with 0.707–0.821 on the agreed subset (Section 4.5). This pattern is consistent with the moderate inter-rater agreement (κ = 0.515) and suggests that a substantial portion of the false positive count reflects label ambiguity rather than model error.

5. Discussion

The experimental results addressed the three research questions posed in Section 1.3. This section interprets the findings, examines their implications, and acknowledges the boundaries of the current study.

5.1. Model Capability vs. Coordination Complexity

These findings should be interpreted within the context of the selected domain. Blockchain-based e-voting represents a terminologically overloaded setting, which amplifies ambiguity in title-abstract screening. The observed pattern is consistent across all tested configurations. Its generalisability to other domains remains to be validated. Configurations containing Qwen 2.5 7B consistently outperformed alternatives regardless of the coordination strategy applied. The single-agent baseline (S1) with Qwen in few-shot mode achieved the highest F1 (82.6%) and WSS@95 (43.4%), outperforming every multi-agent alternative. Figure 6 illustrates this comparison across the top five configurations. The identical GS and FC values for the two few-shot configurations confirm that screening behaviour remained stable at full corpus scale, while the single-agent baseline maintained a clear advantage across metrics.

This outcome does not support the assumption that ensemble coordination compensates for individual model weaknesses. When one model performs better, adding weaker models reduces overall quality. S2 (majority voting) and S4 (confidence-weighted aggregation) produced identical results for all metrics. This indicates that self-reported confidence from 7–8B parameter models does not add meaningful differentiation. This pattern is consistent with findings from a recent multi-agent screening study using API-based models, where majority voting also outperformed more complex adjudication and debate strategies [23]. S5 (two-stage filtering) offered computational efficiency by filtering up to 41% of papers in Stage 1, but did not improve precision. S3 (recall-focused OR) amplified false positives without recall gains.

These results align with recent findings on multi-agent coordination with models of comparable size. A comparative evaluation of four coordination strategies against single-agent RAG baselines reported consistent performance degradation, with coordination overhead identified as the primary factor [22]. This study extends this observation from question answering to binary screening. Coordination overhead appears to be a general limitation at the 7–8B scale, consistent with a recent evaluation of 18 LLMs across three biomedical systematic reviews [27], where model rankings varied substantially across domains. Mistral achieved the highest inter-rater agreement (PABAK = 0.621) in clinical reviews, whereas Qwen dominated in the present interdisciplinary domain—suggesting that model superiority is domain-dependent rather than absolute. Notably, smaller models (llama3.1:8b, MCC = 0.302) outperformed their larger counterparts (llama3.1:70b, MCC = 0.242) in that study, reinforcing the observation that parameter count alone does not determine screening quality at this scale. Whether larger models (13B–70B) benefit from coordination remains an open question. Greater reasoning diversity among agents could make ensemble deliberation productive at higher parameter scales.

The pairwise McNemar’s tests (Section 4.6) confirmed that S1 Qwen few-shot was statistically superior to all four multi-agent alternatives (p ≤ 0.0005, power > 0.93), while no significant differences were detected among Ranks 2–5 (all p = 1.0, power ≤ 0.07). The ranking in Table 9 for the multi-agent configurations therefore reflects point estimates rather than statistically significant performance gaps, whereas the single-agent advantage is statistically confirmed.

The equivalence between S4 and S2 is explained by the confidence distributions of the individual models serving as agents in both strategies. Table 19 presents the self-reported confidence levels for each model across all 2036 full corpus papers, together with the strategy-level agreement between S4 and S2.

LLaMA 3.1 8B and Mistral 7B reported HIGH confidence on 99.5% and 96.2% of all papers, respectively, providing virtually no discriminative signal for aggregation weighting. Only Qwen 2.5 7B exhibited meaningful variation, reporting HIGH confidence on 68.3% of INCLUDE decisions and 93.7% of EXCLUDE decisions, with the remaining votes classified as MEDIUM. When two of three models assign HIGH confidence to the same decision, the weighted sum in S4 produces the same outcome as unweighted majority voting. The two disagreements both involved papers where two models returned UNCERTAIN—a category that does not affect INCLUDE/EXCLUDE screening counts. The three-level confidence scale (HIGH = 0.9, MEDIUM = 0.7, LOW = 0.5) proved too coarse for effective differentiation at the 7–8B parameter scale. Finer-grained calibration or continuous confidence scoring would be required for confidence weighting to offer a practical advantage over majority voting. All computations were performed using a dedicated Python script available in the project repository.

Table 20 quantifies the computational cost of each strategy on the full corpus (n = 2036). Because S1, S2, and S4 were executed on Apple MacBook Air M2 (16 GB) and S5 on Apple MacBook Pro M1 Pro (16 GB), wall-clock times are not directly comparable between strategy groups; token counts serve as the hardware-independent cost metric.

S2 and S4 consumed 1.36× the tokens per paper compared to S1, with no improvement in screening performance—confirming the coordination overhead observed in Section 4.2.3 with quantitative evidence. The two strategies were computationally identical (same inference calls, same total tokens), differing only in aggregation logic. S5 (two-stage) consumed between 0.94× and 1.19× the tokens of S1: Stage 1 filtering resolved 35–41% of papers with a single model call, partially offsetting the cost of three-model evaluation in Stage 2. Because Qwen produces substantially longer responses than Mistral and LLaMA, the S5 configuration with Mistral as Stage 1 filter generates fewer total output tokens than S1 with Qwen, despite requiring more model calls. All computations were performed using a dedicated Python script available in the project repository.

To quantify the effect of model composition on ensemble performance, a pairwise ablation was conducted comparing ensembles containing Granite 3.3 8B (MLG: Mistral + LLaMA + Granite) against ensembles where Granite was replaced by Qwen 2.5 7B (MLQ: Mistral + LLaMA + Qwen). Table 21 presents the results.

Recall was 1.000 in all ten configurations. Replacing Granite with Qwen reduced false positives by 19 to 44 papers across strategies. The effect was most pronounced under S3 (recall-focused OR), where Granite’s near-universal INCLUDE behaviour (199/200 in zero-shot, FP = 126 as a single agent) propagated directly through the OR aggregation. Under majority voting (S2), Granite’s vote was outnumbered when Mistral and LLaMA agreed on EXCLUDE, partially mitigating the damage. These results suggest that ensemble coordination does not compensate for a non-discriminative model.

5.2. Error Patterns and the Precision Ceiling

Error analysis (RQ3) showed that false positive accumulation was the dominant source of screening error. Only two false negatives were observed across configurations, while a substantial number of papers were repeatedly classified as false positives across strategies.

The moderate inter-rater agreement (κ = 0.515) indicates that a portion of disagreement arises from inherent ambiguity in the classification task. This is consistent with prior observations that screening agreement decreases in interdisciplinary domains with overlapping terminology [24]. In such cases, differences between model predictions and ground truth may reflect uncertainty in the labels rather than purely model error.

The sensitivity analysis by agreement subset (Section 4.5) provides empirical support for this interpretation. The precision drop on disputed papers (0.375–0.452 for the top five configurations) aligns with the observation that the dominant false positives correspond to papers near the inclusion–exclusion boundary. When ground-truth labels themselves are uncertain, lower precision is an expected artefact of label ambiguity rather than an indication of reduced model capability.

The persistent false positives identified in Section 4.4 matched inclusion keywords but did not meet the contribution requirements. This suggests that distinguishing contribution type from topic may exceed the information available in titles and abstracts alone.

The relationship between exclusion criteria usage and false positive rates supports this interpretation. Models that explicitly applied exclusion criteria produced fewer false positives, while models that appeared to rely on surface-level matching showed limited discriminative capacity. This observation aligns with prior work showing that the formulation and application of inclusion and exclusion criteria strongly influence screening outcomes [27].

5.3. Implications for Practice

For practical applications of LLM-assisted screening, differences between models produced substantially larger performance variations than differences between strategies.

For well-defined domains, a single capable model with few-shot prompting may be sufficient. The observed workload reduction (WSS@95 = 43.4%) is comparable to ranges reported in prior studies of LLM-assisted screening [27], despite the absence of retrieval augmentation in the present framework.

Multi-agent strategies remain relevant in cases where no single model achieves acceptable recall. In such settings, majority voting provides a simple approach, while two-stage strategies may offer computational savings.

Because screening criteria are externalised (Section 3.1), applying the framework to a new domain requires only redefining these criteria and constructing a domain-specific Gold Standard. Empirical validation on additional domains remains a direction for future work. Existing reporting frameworks such as RDAL [12] and PRISMA-trAIce [13] address transparency in AI-assisted reviews, while the blockchain-based audit mechanism proposed here complements these approaches by providing infrastructure-level decision traceability. Similar blockchain-based logging approaches have been adopted in other digital governance domains, such as infrastructure security for IoT networks [38].

5.4. Limitations

The framework was validated on a single domain. The results should therefore be interpreted as domain-specific, and generalisability to other domains remains to be established.

The Gold Standard was constructed exclusively from Pool A, papers containing voting-related keywords, which intentionally over-represents terminologically ambiguous cases. While this design provides a rigorous stress test for screening accuracy, precision estimates derived from this subset may not generalise directly to the full corpus, where the proportion of clearly irrelevant papers is higher.

The Gold Standard of 200 papers (190 for evaluation) limits the statistical power of precision comparisons. The few-shot calibration examples were drawn from the same corpus, which may constitute indirect data leakage despite the exclusion of all 10 calibration papers from evaluation.

All models tested were instruction-tuned transformer variants in the 7–8B parameter range, deployed on local hardware. This limited architectural diversity may have contributed to the observed S4–S2 equivalence, as similarly sized models tend to produce correlated confidence estimates. Larger models may benefit more from multi-agent coordination, where greater parameter capacity could support productive ensemble deliberation. Recent evidence from API-based models (GPT-4o Mini, Claude 3 Haiku, Gemini 1.5 Flash) supports this hypothesis, as multi-agent collaboration yielded consistent improvements over individual baselines at higher parameter scales [23]. The inter-rater agreement (κ = 0.515) introduces uncertainty in the ground truth labels, which may affect the interpretation of model performance.

The sensitivity of the prompt was not examined. The same prompt template was used for all experiments. The results may differ with alternative wordings or instruction structures. The conclusion that model selection was more important than strategy selection was derived from four models. A larger and more diverse pool of models would be needed to confirm this pattern.

6. Conclusions

This study evaluated five LLM coordination strategies for automating title-abstract screening tasks—single-agent baseline (S1), majority voting (S2), recall-focused ensemble (S3), confidence-weighted aggregation (S4), and two-stage filtering (S5)—using four open-source 7–8B parameter models deployed locally. The evaluation was conducted on a Gold Standard of 200 papers from a corpus of 2036 records in the terminologically complex domain of blockchain-based e-voting.

The findings should be interpreted within the context of the selected domain and require validation in other application areas.

Three principal findings emerged, each addressing one of the research questions posed in Section 1.3.

RQ1 (coordination strategy effect): Model selection was the primary determinant of screening performance, outweighing strategy selection. The single-agent strategy with Qwen 2.5 7B in few-shot mode achieved the highest point estimates among all configurations (recall = 100.0%, precision = 70.4%, F1 = 82.6%, WSS@95 = 43.4%). Pairwise McNemar’s exact tests (Section 4.6) confirmed that S1 was statistically superior to all four multi-agent alternatives (p ≤ 0.0005, power > 0.93), while no significant differences were detected among the remaining configurations (all p = 1.0, power ≤ 0.07). Confidence-weighted aggregation (S4) produced results identical to majority voting (S2), indicating that self-reported confidence from 7–8B parameter models does not provide additional discriminative value in this setting.

RQ2 (best strategy–model combination): The single-agent baseline (S1) with Qwen 2.5 7B in few-shot mode achieved the highest point estimates for both recall and effort reduction. Pairwise comparisons confirmed that no multi-agent configuration matched the single-agent baseline (Section 4.6), while multi-agent configurations were statistically indistinguishable from each other. The 95% Wilson confidence intervals for precision overlapped across the top five configurations, suggesting limited statistical separation. Multi-agent strategies introduced additional computational cost without measurable benefit under the conditions of this study.

RQ3 (systematic error patterns): The dominant source of screening error was false positive accumulation driven by terminological overlap. Models that actively applied exclusion criteria (Qwen) produced fewer false positives, while models relying on keyword matching (Granite) showed limited discriminative power. A persistent pool of 28 false positives—predominantly surveys and position papers (EC3)—appeared in all five full corpus configurations, suggesting a practical precision ceiling in title-abstract screening.

The blockchain-based audit mechanism combining private chain logging, OpenTimestamps anchoring, and Zenodo archival provides decision-level traceability that addresses part of the documented reproducibility gap in AI-aided screening [12]. Full reproducibility additionally requires access to the source code and prompt templates, which are available in the project repository.

The main contributions of this study are: (1) a controlled comparison of single-agent and multi-agent LLM coordination strategies using 4-bit quantised 7–8B parameter models deployed locally; (2) empirical evidence that, at this model scale, a single well-prompted model (Qwen 2.5 7B, few-shot) outperformed all multi-agent alternatives, and that self-reported confidence scores did not add discriminative value; (3) a three-tier blockchain-based audit mechanism combining private chain logging, Bitcoin-anchored temporal proofs, and DOI-based archival publication, addressing the documented reproducibility gap in AI-aided screening; and (4) application and evaluation of the framework in an interdisciplinary, terminologically overloaded domain. These contributions map directly onto the five tasks defined in Section 1.4. The platform implementation and corpus construction (Tasks 1–2) provided the experimental infrastructure; the definition of domain-specific criteria and creation of a Gold Standard (Tasks 3–4) established the ground truth for evaluation; and the systematic evaluation with error analysis (Task 5) produced the above-summarised empirical findings.

Future work should validate these findings across domains with different terminological characteristics. It should also be investigated whether larger models benefit from multi-agent coordination. Recent evidence at higher parameter scales suggests this possibility [23]. Two technical extensions could further address the persistent false positive pool. Retrieval-augmented generation (RAG) could provide full-text access during screening. Active learning could iteratively refine the decision boundary by selecting the most informative papers for human review. The inclusion and exclusion criteria were designed specifically for the present domain. Future applications should examine how domain-specific criteria are best derived, for instance through sensitivity analysis of individual criteria or error-driven iterative refinement, and whether a systematic approach to criteria selection can improve screening accuracy across different research areas. Alternative prompt structures and output formats could also be beneficial without changing the model or strategy. The blockchain audit mechanism could be extended to cover the full-text review stage. All of these directions build directly on the PaSSER-SR infrastructure.

Author Contributions

Conceptualization, I.R. and T.N.; methodology, I.R.; software, I.R. and T.N.; validation, I.P., L.D. and T.N.; formal analysis, I.R.; investigation, T.N. and L.D.; resources, I.R., T.N. and L.D.; data curation, I.R. and T.N.; writing—original draft preparation, I.R. and T.N.; writing—review and editing, I.R. and I.P.; visualization, T.N.; supervision, I.R., I.P. and L.D.; project administration, L.D., I.R. and T.N.; funding acquisition, L.D. and T.N. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Centre of Competence Digitization of the economy in an environment of Big data, BG05M2OP001-1.002-0002-C05, OP SESG.

Data Availability Statement

The screening audit log, Gold Standard evaluation results, and blockchain verification files are available on Zenodo at https://doi.org/10.5281/zenodo.19182242 [34]. The source code for the PaSSER-SR platform and the experimental scripts are available on GitHub at https://github.com/scpdxtest/PaSSER-SR (accessed on 13 April 2026).

Acknowledgments

During the preparation of this manuscript, the authors used Claude (Anthropic, Opus 4.6) for the purposes of drafting the initial structure of the study, scaffolding source code, generating alternative layouts for tables and figures, cross-checking numerical values between tables and figures in the manuscript, verifying alignment between paper descriptions and source code implementations, refining English language precision and terminological consistency, technical verification of textual content, and consolidating textual amendments during the revision process. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

CI	Confidence Interval
EC1–EC6	Exclusion Criteria 1–6
FC	Full Corpus
FN	False Negative
FP	False Positive
FS	Few-Shot
GS	Gold Standard
IC1–IC5	Inclusion Criteria 1–5
LLM	Large Language Model
MLX	Apple Machine Learning Framework
OTS	OpenTimestamps
PABAK	Prevalence-Adjusted Bias-Adjusted Kappa
PRISMA	Preferred Reporting Items for Systematic Reviews and Meta-Analyses
RAG	Retrieval-Augmented Generation
TN	True Negative
TP	True Positive
WSS@95	Work Saved over Sampling at 95% Recall
ZS	Zero-Shot

References

Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. BMJ 2021, 372, 71. [Google Scholar] [CrossRef]
Cohen, A.M.; Hersh, W.R.; Peterson, K.; Yen, P.-Y. Reducing Workload in Systematic Review Preparation Using Automated Citation Classification. J. Am. Med. Inform. Assoc. 2006, 13, 206–219. [Google Scholar] [CrossRef]
Guo, E.; Gupta, M.; Deng, J.; Park, Y.-J.; Paget, M.; Naugler, C. Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study. J. Med. Internet Res. 2024, 26, e48996. [Google Scholar] [CrossRef]
Akinseloyin, O.; Jiang, X.; Palade, V. A Question-Answering Framework for Automated Abstract Screening Using Large Language Models. J. Am. Med. Inform. Assoc. 2024, 31, 1939–1952. [Google Scholar] [CrossRef]
Wang, Z.; Nayfeh, T.; Tetzlaff, J.; O’Blenis, P.; Murad, M.H. Error Rates of Human Reviewers during Abstract Screening in Systematic Reviews. PLoS ONE 2020, 15, e0227742. [Google Scholar] [CrossRef]
van de Schoot, R.; de Bruin, J.; Schram, R.; Zahedi, P.; de Boer, J.; Weijdema, F.; Kramer, B.; Huijts, M.; Hoogerwerf, M.; Ferdinands, G.; et al. An Open Source Machine Learning Framework for Efficient and Transparent Systematic Reviews. Nat. Mach. Intell. 2021, 3, 125–133. [Google Scholar] [CrossRef]
Syriani, E.; Dávid, I.; Kumar, G.A. Assessing the Ability of ChatGPT to Screen Articles for Systematic Reviews. arXiv 2023, arXiv:2307.06464. [Google Scholar] [CrossRef]
Khraisha, Q.; Put, S.; Kappenberg, J.; Warraitch, A.; Hadfield, K. Can Large Language Models Replace Humans in Systematic Reviews? Evaluating GPT-4’s Efficacy in Screening and Extracting Data from Peer-Reviewed and Grey Literature in Multiple Languages. Res. Synth. Methods 2024, 15, 616–626. [Google Scholar] [CrossRef] [PubMed]
Ye, A.; Maiti, A.; Schmidt, M.; Pedersen, S.J. A Hybrid Semi-Automated Workflow for Systematic and Literature Review Processes with Large Language Model Analysis. Future Internet 2024, 16, 167. [Google Scholar] [CrossRef]
Galli, C.; Gavrilova, A.V.; Calciolari, E. Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations. Information 2025, 16, 378. [Google Scholar] [CrossRef]
Matsui, K.; Utsumi, T.; Aoki, Y.; Maruki, T.; Takeshima, M.; Takaesu, Y. Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews. J. Med. Internet Res. 2024, 26, e52758. [Google Scholar] [CrossRef]
Lombaers, P.; de Bruin, J.; van de Schoot, R. Reproducibility and Data Storage for Active Learning-Aided Systematic Reviews. Appl. Sci. 2024, 14, 3842. [Google Scholar] [CrossRef]
Holst, D.; Moenck, K.; Koch, J.; Schmedemann, O.; Schüppstuhl, T. Transparent Reporting of AI in Systematic Literature Reviews: Development of the PRISMA-trAIce Checklist. JMIR AI 2025, 4, e80247. [Google Scholar] [CrossRef]
Guo, T.; Chen, X.; Wang, Y.; Chang, R.; Pei, S.; Chawla, N.V.; Wiest, O.; Zhang, X. Large Language Model Based Multi-Agents: A Survey of Progress and Challenges. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24; Larson, K., Ed.; International Joint Conferences on Artificial Intelligence Organization: California, CA, USA, 2024; pp. 8048–8057. [Google Scholar]
Wang, X.; Wei, J.; Schuurmans, D.; Le, Q.; Chi, E.H.; Zhou, D. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv 2022, arXiv:2203.11171. [Google Scholar]
Du, Y.; Li, S.; Torralba, A.; Tenenbaum, J.B.; Mordatch, I. Improving Factuality and Reasoning in Language Models through Multiagent Debate. In Proceedings of the 41st International Conference on Machine Learning, Vienna, Austria, 21–27 July 2024; Proceedings of Machine Learning Research (PMLR): Cambridge, MA, USA, 2024; pp. 11733–11763. [Google Scholar]
Yang, Y.; Ma, Y.; Feng, H.; Cheng, Y.; Han, Z. Minimizing Hallucinations and Communication Costs: Adversarial Debate and Voting Mechanisms in LLM-Based Multi-Agents. Appl. Sci. 2025, 15, 3676. [Google Scholar] [CrossRef]
Yeow Lee, X.; Akatsuka, S.; Vidyaratne, L.; Kumar, A.; Farahat, A.; Gupta, C. Reliable Decision-Making for Multi-Agent LLM Systems. In Proceedings of the 39th AAAI Conference on Artificial Intelligence (AAAI-25), Philadelphia, PA, USA, 25 February–4 March 2025. [Google Scholar]
Li, J.; Zhang, Q.; Yu, Y.; Fu, Q.; Ye, D. More Agents Is All You Need. Trans. Mach. Learn. Res. 2024, 2024, 1–18. [Google Scholar]
Bernasconi, E.; Redavid, D.; Ferilli, S. Integrated Survey Classification and Trend Analysis via LLMs: An Ensemble Approach for Robust Literature Synthesis. Electronics 2025, 14, 3404. [Google Scholar] [CrossRef]
Mienye, I.D.; Swart, T.G. Ensemble Large Language Models: A Survey. Information 2025, 16, 688. [Google Scholar] [CrossRef]
Radeva, I.; Popchev, I.; Doukovska, L.; Dimitrova, M. Multi-Agent Coordination Strategies vs. Retrieval-Augmented Generation in LLMs: A Comparative Evaluation. Electronics 2025, 14, 4883. [Google Scholar] [CrossRef]
Akinseloyin, O.; Jiang, X.; Palade, V. Large Language Model-Based Multiagent Collaboration for Abstract Screening toward Automated Systematic Reviews. Biol. Methods Protoc. 2026, 11, bpag006. [Google Scholar] [CrossRef]
Belur, J.; Tompson, L.; Thornton, A.; Simon, M. Interrater Reliability in Systematic Review Methodology: Exploring Variation in Coder Decision-Making. Sociol. Methods Res. 2021, 50, 837–865. [Google Scholar] [CrossRef]
Hanegraaf, P.; Wondimu, A.; Mosselman, J.J.; de Jong, R.; Abogunrin, S.; Queiros, L.; Lane, M.; Postma, M.J.; Boersma, C.; van der Schans, J. Inter-Reviewer Reliability of Human Literature Reviewing and Implications for the Introduction of Machine-Assisted Systematic Reviews: A Mixed-Methods Review. BMJ Open 2024, 14, e076912. [Google Scholar] [CrossRef]
Langenhuijsen, L.F.S.; Janse, R.J.; Venema, E.; Kent, D.M.; van Diepen, M.; Dekker, F.W.; Steyerberg, E.W.; de Jong, Y. Systematic Metareview of Prediction Studies Demonstrates Stable Trends in Bias and Low PROBAST Inter-Rater Agreement. J. Clin. Epidemiol. 2023, 159, 159–173. [Google Scholar] [CrossRef]
Delgado-Chaves, F.M.; Jennings, M.J.; Atalaia, A.; Wolff, J.; Horvath, R.; Mamdouh, Z.M.; Baumbach, J.; Baumbach, L. Transforming Literature Screening: The Emerging Role of Large Language Models in Systematic Reviews. Proc. Natl. Acad. Sci. USA 2025, 122, e2411962122. [Google Scholar] [CrossRef] [PubMed]
Kulothungan, V. Using Blockchain Ledgers to Record AI Decisions in IoT. IoT 2025, 6, 37. [Google Scholar] [CrossRef]
Radeva, I. Blockchain Integration: Development and Implementation, 1st ed.; Printing Office of Prof. Marin Drinov Publishing House of Bulgarian Academy of Sciences: Sofia, Bulgaria, 2024; ISBN 978-619-245-473-9. [Google Scholar]
Radeva, I.; Popchev, I.; Doukovska, L.; Dimitrova, M. Web Application for Retrieval-Augmented Generation: Implementation and Testing. Electronics 2024, 13, 1361. [Google Scholar] [CrossRef]
Kusa, W.; Lipani, A.; Knoth, P.; Hanbury, A. An Analysis of Work Saved over Sampling in the Evaluation of Automated Citation Screening in Systematic Literature Reviews. Intell. Syst. Appl. 2023, 18, 200193. [Google Scholar] [CrossRef]
Wilson, E.B. Probable Inference, the Law of Succession, and Statistical Inference. J. Am. Stat. Assoc. 1927, 22, 209–212. [Google Scholar] [CrossRef]
Agresti, A.; Coull, B.A. Approximate Is Better than “Exact” for Interval Estimation of Binomial Proportions. Am. Stat. 1998, 52, 119–126. [Google Scholar] [CrossRef]
Radeva, I.; Noncheva, T.; Doukovska, L.; Popchev, I. Blockchain-Verified Audit Trail: Automated Title-Abstract Screening. Zenodo. 2025. Available online: https://zenodo.org/records/19182242 (accessed on 13 April 2026).
Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]
Dietterich, T.G. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms. Neural Comput. 1998, 10, 1895–1923. [Google Scholar] [CrossRef] [PubMed]
Fagerland, M.W.; Lydersen, S.; Laake, P. The McNemar Test for Binary Matched-Pairs Data: Mid-p and Asymptotic Are Better than Exact Conditional. BMC Med. Res. Methodol. 2013, 13, 91. [Google Scholar] [CrossRef] [PubMed]
Pawar, P.P.; Kumar, D.; Kumar Meesala, M.; Kumar Pareek, P.; Reddy Addula, S.; Shwetha, K.S. Securing Digital Governance: A Deep Learning and Blockchain Framework for Malware Detection in IoT Networks. In Proceedings of the 2024 International Conference on Integrated Intelligence and Communication Systems (ICIICS), Kalaburagi, India, 22–23 November 2024; pp. 1–8. [Google Scholar] [CrossRef]

Figure 1. Experimental workflow of the screening evaluation framework.

Figure 2. PRISMA 2020 flow diagram of the selection process.

Figure 3. PaSSER-SR system architecture.

Figure 4. Performance metrics by LLM coordination strategies: (a) F1 Score, (b) WSS@95, (c) Precision.

Figure 5. False positive overlap across strategies: (a) FP composition, GS (25 configs, N = 190); (b) pairwise FP overlap, GS; (c) FP composition, FC (5 configs, N = 190); (d) pairwise FP overlap, FC.

Figure 6. Comparison of top five configurations on Gold Standard (GS) and full corpus (FC): (a) F1 score, (b) WSS@95, and (c) false positive counts.

Table 1. Data sources and retrieval statistics.

Database	Access Method	Records Retrieved
OpenAlex	REST API	1644
Semantic Scholar	API with rate limiting	2392
CORE	API with registration	848
arXiv	OAI-PMH API	259
MDPI	BibTeX export	86
Total		5229

All collection scripts are publicly available in the project repository https://github.com/scpdxtest/PaSSER-SR (accessed on 13 April 2026).

Table 2. Criteria for title-abstract screening.

Code	Type	Criterion
IC1	Incl.	Proposes, describes, or evaluates a blockchain-based model, framework, or system
IC2	Incl.	Addresses electoral process (voter authentication, registration, petition signing, voting, counting, auditing, dispute resolution) for public or institutional elections (national, regional, local, university, organisation)
IC3	Incl.	Includes empirical evaluation or experimental results
IC4	Incl.	Contains security/privacy analysis
IC5	Incl.	Describes implementation or prototype
EC1	Excl.	No blockchain technology discussed, or mentions blockchain without specific implementation
EC2	Excl.	Focuses on non-electoral domain (e.g., finance, supply chain, healthcare, IoT, energy) or discusses decentralisation/blockchain in general without electoral application
EC3	Excl.	Opinion pieces, position papers, tutorials, or general overviews/surveys without systematic method or original contribution
EC4	Excl.	DAO governance, corporate voting, or technical voting/election mechanisms (consensus protocols, node/notary/leader election, Byzantine voting)
EC5	Excl.	Abstract missing, insufficient, unclear scope, or not in English
EC6	Excl.	Only theoretical discussion, or general blockchain/smart contract concepts without concrete electoral application

Table 3. Distribution of records by databases before and after deduplication.

Database	Raw Records	After Dedup	Unique to DB	Shared	% of Corpus
OpenAlex	1644	1554	811	743	38.6
Semantic Scholar	2392	2289	1511	778	56.9
CORE	848	809	689	120	20.1
arXiv	259	259	130	129	6.4
MDPI	86	86	23	63	2.1
Total	5229	4021	3164	857	—

Table 4. Inter-rater reliability metrics for the Gold Standard screening.

Metric	Value
Total papers screened (dual review)	200
Observed agreement (Po)	0.750 (75.0%)
Expected agreement (Pe)	0.485 (48.5%)
Cohen’s κ	0.515
PABAK	0.500
Interpretation (Landis and Koch)	Moderate
Agreed INCLUDE	58
Agreed EXCLUDE	92
Agreed UNCERTAIN	0
Disagreements	50
Resolved by third reviewer	50

Table 5. Agreement matrix between Reviewer 1 and Reviewer 2 (n = 200).

	R2: INCLUDE	R2: EXCLUDE	R2: UNCERTAIN	Total
R1: INCLUDE	58	14	3	75
R1: EXCLUDE	23	92	8	123
R1: UNCERTAIN	0	2	0	2
Total	81	108	11	200

Table 6. Final Gold Standard decision distribution after disagreement resolution.

Decision	Count	Percentage	LLM Ground Truth
INCLUDE	67	33.5%	INCLUDE (67)
EXCLUDE	126	63.0%	EXCLUDE (126)
UNCERTAIN	7	3.5%	INCLUDE (7)
Total	200	100%	74 INC/126 EXC

Table 7. Zero-shot screening performance on the Gold Standard (n = 200).

Str.	Model	Rec.%	Prec.%	F1%	WSS@95%	TP	FP	FN	TN
S1	Mistral 7B	98.7	52.1	68.2	25.0	73	67	1	59
	LLaMA 3.1 8B	100.0	59.7	74.8	33.0	74	50	0	76
	Qwen 2.5 7B	94.6	72.9	82.3	47.0	70	26	4	100
	Granite 3.3 8B	98.7	36.7	53.5	−4.5	73	126	1	0
S2	MLG	100.0	51.7	68.2	23.5	74	69	0	57
	MLQ	100.0	60.2	75.1	33.5	74	49	0	77
S3	MLQ	100.0	51.7	68.2	23.5	74	69	0	57
S4	MLG	100.0	52.1	68.5	24.0	74	68	0	58
	MLQ	100.0	60.2	75.1	33.5	74	49	0	77
S5	G → M + L	100.0	52.1	68.5	24.0	74	68	0	58
	L → M + G	100.0	59.7	74.8	33.0	74	50	0	76
	M → L + Q	98.7	59.8	74.5	34.0	73	49	1	77

Note: The full evaluation set (n = 200) is used, as zero-shot configurations do not employ calibration examples. The ≥95% recall acceptance threshold reflects the convention that missing relevant studies poses a greater risk than including irrelevant ones.

Table 8. Few-shot screening performance (n = 190).

Str.	Model	Rec.%	Prec.%	F1%	WSS@95%	TP	FP	FN	TN
S1	Mistral 7B	100.0	48.6	65.4	20.3	69	73	0	48
	LLaMA 3.1 8B	98.6	56.2	71.6	31.3	68	53	1	68
	Qwen 2.5 7B	100.0	70.4	82.6	43.4	69	29	0	92
	Granite 3.3 8B	100.0	36.3	53.3	−5.0	69	121	0	0
S2	MLG	100.0	47.6	64.5	18.7	69	76	0	45
	MLQ	100.0	58.0	73.4	32.4	69	50	0	71
S3	MLG	100.0	36.3	53.3	−5.0	69	121	0	0
	MLQ	100.0	47.3	64.2	18.2	69	77	0	44
S4	MLG	100.0	47.6	64.5	18.7	69	76	0	45
	MLQ	100.0	58.0	73.4	32.4	69	50	0	71
S5	Q → M + L	100.0	61.6	76.2	36.0	69	43	0	78
	M → L + Q	100.0	58.0	73.4	32.4	69	50	0	71
	L → M + Q	98.6	57.6	72.7	32.9	68	50	1	71

Note: 10 calibration papers were excluded from evaluation (see Section 3.6), resulting in n = 190. The ≥95% recall acceptance threshold reflects the convention that missing relevant studies poses a greater risk than including irrelevant ones.

Table 9. Top five configurations by F1 score.

Rank	Str.	Model(s)	Mode	Rec.%	Prec.% [95% CI]	F1%	WSS@95%
1	S1	Qwen 2.5 7B	FS	100.0	70.4 [60.7–78.5]	82.6	43.4
2	S5	Q → M + L	FS	100.0	61.6 [52.4–70.1]	76.2	36.0
3	S5	M → L + Q	ZS	98.7	59.8 [51.0–68.1]	74.5	34.0
4	S2	MLQ	ZS	100.0	60.2 [51.3–68.4]	75.1	33.5
5	S4	MLQ	ZS	100.0	60.2 [51.3–68.4]	75.1	33.5

Table 10. Full corpus screening results.

Rank	Str.	Model(s)	Mode	Rec.%	Prec.%	F1%	WSS@95%	Included	FP
1	S1	Qwen 2.5 7B	FS	100.0	70.4	82.6	43.4	950	29
2	S5	Q → M + L	FS	100.0	61.6	76.2	36.0	1131	43
3	S5	M → L + Q	ZS	100.0	61.1	75.8	35.5	1121	44
4	S2	MLQ	ZS	100.0	61.1	75.8	35.5	1125	44
5	S4	MLQ	ZS	100.0	61.1	75.8	35.5	1124	44

Table 11. Cross-strategy agreement: papers excluded by S1 but included by multi-agent strategies.

Category	Metric	Value
Set composition	S1 selected (full corpus)	950
	All four MA selected	1066
	Intersection (S1 ∩ MA)	926
	In all MA but not S1	140
GS cross-reference	Of 140: found in GS	11
	GS decision for all 11	EXCLUDE
	S1 correct (true negatives)	11/11
S1 exclusion metadata	Confidence MEDIUM	139 (99.3%)
	Confidence LOW	1 (0.7%)
	Criterion EC5 cited	139 (99.3%)
	Criterion EC3 cited	135 (96.4%)

Table 12. False positive counts and exclusion criteria citations per model (S1, n = 190).

Model	FP (ZS)	FP (FS)	EC Cited (ZS)	EC Cited (FS)
Granite 3.3 8B	121	121	0	0
Mistral 7B	62	73	30	84
LLaMA 3.1 8B	45	53	163	270
Qwen 2.5 7B	25	29	248	365

Table 13. Exclusion categories for false positives common to all five full corpus configurations.

Category	Count	Example
EC3: No original contribution (opinion, overview, survey)	19	“Disrupting the Ballot Box: Blockchain as a Catalyst for Innovation in Electoral Processes”
EC1: No blockchain implementation	6	“Privacy Preserving E-Voting System Using Homomorphic Encryption”
EC5: Non-English abstract	2	“Implementasi Sistem E-Voting Berbasis Blockchain…”
EC2: Non-electoral domain	1	“Controllable anonymous authentication scheme based on blockchain…”
Total	28

Table 14. Screening performance on agreed and disputed Gold Standard subsets.

Rank	Config.	Subset	n	TP	FP	FN	TN	Recall	Precision	F1
1	S1 Qwen FS	Agreed	144	55	12	0	77	1.000	0.821	0.902
1	S1 Qwen FS	Disputed	46	14	17	0	15	1.000	0.452	0.622
2	S5 Q → M + L FS	Agreed	144	55	21	0	68	1.000	0.724	0.840
2	S5 Q → M + L FS	Disputed	46	14	22	0	10	1.000	0.389	0.560
3	S5 M → L + Q ZS	Agreed	150	58	24	0	68	1.000	0.707	0.829
3	S5 M → L + Q ZS	Disputed	50	15	25	1	9	0.938	0.375	0.536
4	S2 MLQ ZS	Agreed	150	58	24	0	68	1.000	0.707	0.829
4	S2 MLQ ZS	Disputed	50	16	25	0	9	1.000	0.390	0.561
5	S4 MLQ ZS	Agreed	150	58	24	0	68	1.000	0.707	0.829
5	S4 MLQ ZS	Disputed	50	16	25	0	9	1.000	0.390	0.561

Note: Few-shot configurations were evaluated on 190 papers (6 agreed-subset and 4 disputed-subset calibration papers excluded).

Table 15. Structure of the 2 × 2 contingency table for paired classifier comparison.

	Config B Correct	Config B Incorrect
Config A correct	ncc	nAB
Config A incorrect	nBA	nww

Table 16. Pairwise McNemar’s exact test results for the top five configurations.

Pair	Config A	Config B	n	ncc	nAB	nBA	nww	nd	p	Power
1–2	S1 Qwen FS	S5 Q → M + L FS	190	146	15	1	28	16	0.0005 *	0.94
1–3	S1 Qwen FS	S5 M → L + Q ZS	190	145	16	1	28	17	0.0003 *	0.95
1–4	S1 Qwen FS	S2 MLQ ZS	190	145	16	1	28	17	0.0003 *	0.95
1–5	S1 Qwen FS	S4 MLQ ZS	190	145	16	1	28	17	0.0003 *	0.95
2–3	S5 Q → M + L FS	S5 M → L + Q ZS	190	143	4	3	40	7	1.00	0.07
2–4	S5 Q → M + L FS	S2 MLQ ZS	190	143	4	3	40	7	1.00	0.07
2–5	S5 Q → M + L FS	S4 MLQ ZS	190	143	4	3	40	7	1.00	0.07
3–4	S5 M → L + Q ZS	S2 MLQ ZS	200	150	0	1	49	1	1.00	0.17
3–5	S5 M → L + Q ZS	S4 MLQ ZS	200	150	0	1	49	1	1.00	0.17
4–5	S2 MLQ ZS	S4 MLQ ZS	200	151	0	0	49	0	—	—

Note: papers evaluated in common. ncc: both correct. nAB: correct only by A. nBA: correct only by B. nww: both incorrect. nd: discordant pairs. p: exact two-sided p-value. Power: estimated following [37]. Dashes: test undefined (zero discordant pairs). *: statistically significant (p < 0.05).

Table 17. Clopper–Pearson 95% exact confidence intervals for recall.

Rank	Configuration	n	Correct/Pos	Recall	95% CI
1	S1 Qwen FS	190	69/69	1.000	[0.948, 1.000]
2	S5 Q → M + L FS	190	69/69	1.000	[0.948, 1.000]
3	S5 M → L + Q ZS	200	73/74	0.986 *	[0.927, 1.000]
4	S2 MLQ ZS	200	74/74	1.000	[0.951, 1.000]
5	S4 MLQ ZS	200	74/74	1.000	[0.951, 1.000]

Correct/Pos: true INCLUDE papers correctly identified out of total true INCLUDE papers. *: imperfect recall.

Table 18. Summary comparison of top-ranked configurations on Gold Standard and full corpus.

		Rank 1	Rank 2	Rank 3	Rank 4	Rank 5
Str.		S1	S5	S5	S2	S4
Config		Qwen 2.5 7B	Q → M + L	M → L + Q	MLQ	MLQ
Mode		FS	FS	ZS	ZS	ZS
Rec%	GS	100.0	100.0	98.7	100.0	100.0
	FC	100.0	100.0	100.0	100.0	100.0
F1%	GS	82.6	76.2	74.5	75.1	75.1
	FC	82.6	76.2	75.8	75.8	75.8
WSS@95%	GS	43.4	36.0	34.0	33.5	33.5
	FC	43.4	36.0	35.5	35.5	35.5
FP	GS	29	43	49	49	49
	FC	29	43	44	44	44
FN	GS	0	0	1	0	0
	FC	0	0	0	0	0

Table 19. Agent confidence distribution and S4 vs. S2 agreement.

Model	n	HIGH (%)	MEDIUM (%)	LOW (%)
Mistral 7B	2036	1959 (96.2)	19 (0.9)	58 (2.8)
LLaMA 3.1 8B	2036	2025 (99.5)	9 (0.4)	2 (0.1)
Qwen 2.5 7B	2036	1585 (77.8)	451 (22.2)	0 (0.0)
S4 vs. S2 agreement	2036		2034 (99.9%)	2 disagreements

Table 20. Computational cost per strategy on the full corpus (4-bit MLX, sequential inference).

Config	Calls/Paper	Output Tokens/Paper	×Tokens vs. S1	Hardware
S1 Qwen FS	1	440	1.00	M2
S2 MLQ ZS	3	599	1.36	M2
S4 MLQ ZS	3	599	1.36	M2
S5 Q → M + L FS	1 or 3 *	524	1.19	M1 Pro
S5 M → L + Q ZS	1 or 3 *	415	0.94	M1 Pro

Note: * S5 uses two-stage filtering: papers receiving a HIGH-confidence EXCLUDE in Stage 1 are resolved with a single model call; the remaining papers proceed to a three-model panel in Stage 2. Stage 1 resolved 41% (S5 Q → M + L FS) and 35% (S5 M → L + Q ZS) of papers.

Table 21. Granite ablation: MLG (with Granite) vs. MLQ (with Qwen) ensembles on the Gold Standard.

Strategy	Mode	F1 (MLG)	F1 (MLQ)	ΔF1	Prec (MLG)	Prec (MLQ)	FP (MLG)	FP (MLQ)	FP Reduction
S2	ZS	0.682	0.751	+0.069	0.517	0.602	69	49	20
S2	FS	0.645	0.734	+0.089	0.476	0.580	76	50	26
S3	FS	0.533	0.642	+0.109	0.363	0.473	121	77	44
S4	ZS	0.685	0.751	+0.066	0.521	0.602	68	49	19
S4	FS	0.645	0.734	+0.089	0.476	0.580	76	50	26

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Radeva, I.; Noncheva, T.; Doukovska, L.; Popchev, I. Comparing Single-Agent and Multi-Agent Strategies in LLM-Based Title-Abstract Screening. Electronics 2026, 15, 1661. https://doi.org/10.3390/electronics15081661

AMA Style

Radeva I, Noncheva T, Doukovska L, Popchev I. Comparing Single-Agent and Multi-Agent Strategies in LLM-Based Title-Abstract Screening. Electronics. 2026; 15(8):1661. https://doi.org/10.3390/electronics15081661

Chicago/Turabian Style

Radeva, Irina, Teodora Noncheva, Lyubka Doukovska, and Ivan Popchev. 2026. "Comparing Single-Agent and Multi-Agent Strategies in LLM-Based Title-Abstract Screening" Electronics 15, no. 8: 1661. https://doi.org/10.3390/electronics15081661

APA Style

Radeva, I., Noncheva, T., Doukovska, L., & Popchev, I. (2026). Comparing Single-Agent and Multi-Agent Strategies in LLM-Based Title-Abstract Screening. Electronics, 15(8), 1661. https://doi.org/10.3390/electronics15081661

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Comparing Single-Agent and Multi-Agent Strategies in LLM-Based Title-Abstract Screening

Abstract

1. Introduction

1.1. Problem

1.2. Opportunity

1.3. Research Questions

1.4. Goal and Tasks

1.5. Paper Organisation

2. Related Work

2.1. LLM-Assisted Screening in Systematic Reviews

2.2. Multi-Agent and Ensemble LLM Strategies

2.3. Inter-Rater Reliability in Systematic Reviews

2.4. Research Gaps

3. Methods

3.1. Framework Overview

3.2. Case Study Domain

3.3. Dataset Construction

3.3.1. Search Strategy

3.3.2. Data Sources

3.3.3. Deduplication

3.3.4. E-Voting Context Filtering

3.4. Gold Standard Protocol

3.4.1. Sampling

3.4.2. Screening Criteria

3.4.3. Screening Procedure

3.4.4. Inter-Rater Agreement and Disagreement Resolution

3.5. LLM Coordination Strategies

Models

3.6. Few-Shot Example Selection Protocol

3.7. Evaluation Protocol

3.8. PaSSER-SR Platform

3.9. Blockchain Audit Trail

4. Results

4.1. Corpus Statistics

4.2. Gold Standard and Inter-Rater Agreement

4.2.1. Zero-Shot Screening Results

4.2.2. Few-Shot Screening Results

4.2.3. Strategy Comparison and Ranking

4.3. Full Corpus Screening

4.4. Error Analysis

4.5. Sensitivity to Ground-Truth Agreement

4.6. Pairwise Statistical Comparison

4.7. Summary of Results

5. Discussion

5.1. Model Capability vs. Coordination Complexity

5.2. Error Patterns and the Precision Ceiling

5.3. Implications for Practice

5.4. Limitations

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI