Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations

Galli, Carlo; Gavrilova, Anna V.; Calciolari, Elena

doi:10.3390/info16050378

Open AccessReview

Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations

by

Carlo Galli

^1,*

,

Anna V. Gavrilova

² and

Elena Calciolari

^3,4

¹

Histology and Embryology Laboratory, Department of Medicine and Surgery, University of Parma, 43126 Parma, Italy

²

Department of Biosciences, University of Milan, 20122 Milan, Italy

³

Department of Medicine and Surgery, Dental School, University of Parma, 43126 Parma, Italy

⁴

Centre for Oral Clinical Research, Institute of Dentistry, Faculty of Medicine and Dentistry, Queen Mary University of London, London E1 2AD, UK

^*

Author to whom correspondence should be addressed.

Information 2025, 16(5), 378; https://doi.org/10.3390/info16050378

Submission received: 12 March 2025 / Revised: 23 April 2025 / Accepted: 28 April 2025 / Published: 1 May 2025

(This article belongs to the Special Issue Semantic Web and Language Models)

Download

Browse Figure

Versions Notes

Abstract

:

Systematic reviews require labor-intensive screening processes—an approach prone to bottlenecks, delays, and scalability constraints in large-scale reviews. Large Language Models (LLMs) have recently emerged as a powerful alternative, capable of operating in zero-shot or few-shot modes to classify abstracts according to predefined criteria without requiring continuous human intervention like semi-automated platforms. This review focuses on the central challenges that users in the biomedical field encounter when integrating LLMs—such as GPT-4—into evidence-based research. It examines critical requirements for software and data preprocessing, discusses various prompt strategies, and underscores the continued need for human oversight to maintain rigorous quality control. By drawing on current practices for cost management, reproducibility, and prompt refinement, this article highlights how review teams can substantially reduce screening workloads without compromising the comprehensiveness of evidence-based inquiry. The findings presented aim to balance the strengths of LLM-driven automation with structured human checks, ensuring that systematic reviews retain their methodological integrity while leveraging the efficiency gains made possible by recent advances in artificial intelligence.

Keywords:

systematic review; large language models; text screening; prompt engineering; AI-assisted screening

1. Introduction

Systematic reviews have become essential for evidence-based decision-making in several fields and particularly in medicine because they synthesize all relevant studies on a particular topic in a transparent and methodical way [1]. By adhering to predefined protocols and rigorous inclusion criteria, systematic reviews aim at minimizing bias and generating high-level evidence to inform policy, clinical practice, and research priorities [2]. Standard guidance on these methods emphasizes formulating a clear research question (often using the PICO framework [3]), developing a detailed review protocol, conducting a comprehensive search of multiple databases, screening studies for eligibility, extracting data, assessing quality and risk of bias, and synthesizing findings for reporting [4].

One of the most labor-intensive and error-prone stages in a systematic review is the initial screening of titles and abstracts to identify pertinent studies that can be used as a base for the systematic review [5]. Double screening by independent reviewers has long been considered the gold standard [6], yet it is equally recognized that this level of rigor demands substantial time and human resources [7]. Large academic databases often generate thousands of search results, requiring researchers to manually sift through extensive lists of potentially relevant citations to pinpoint the small number of articles that meet inclusion criteria—often fewer than a dozen for a typical review. This process makes it both logistically challenging and costly for research teams to maintain high accuracy while also managing time constraints [8]. Although searching itself is highly structured [9]—often guided by established protocols such as the Cochrane Handbook or NICE guidelines [10,11]—many reviews still rely predominantly on manual screening methods that are vulnerable to inconsistencies across reviewers, especially when they lack extensive experience or when the volume of studies is extremely large [12]. Because the accuracy of screening directly influences the reliability of the final synthesis, poorly executed screening risks omitting critical evidence, ultimately undermining the entire review process.

In response to these challenges, semi-automated tools such as Rayyan, Abstractr, or Research Screener have emerged to assist with citation management, deduplication, and study selection [13]. Rayyan employs a web- and mobile-based AI-assisted environment that learns from inclusion and exclusion decisions and suggests likely matches [14]. Research Screener uses deep learning and text embeddings to re-rank articles for each new judgment made by the reviewer [15]. While these semi-automated methods decrease the number of abstracts that require manual review, they typically rely on iterative human feedback to train their predictive models [16]. This approach often helps maintain high recall—the proportion of truly relevant articles identified—but still requires a prolonged “learning phase” before users realize the most significant time savings.

Alongside semi-automated platforms, a variety of additional automation efforts have sought to refine each stage of a systematic review. Some tools focus exclusively on searching, such as LitSuggest, which recommends relevant articles from PubMed [17], whereas others support more advanced tasks like data extraction (RobotReviewer, ExaCT) [18,19]. Despite their potential, full automation remains elusive, particularly in later phases of a review where human expertise is needed to interpret nuanced results [12]. Moreover, most software currently operates in isolation, forcing researchers to stitch together different tools that are not always interoperable [16].

Recent advances in natural language processing (NLP) have begun to shift the focus from traditional machine learning pipelines to modern Large Language Models (LLMs), such as GPT-4 and other state-of-the-art architectures [20]. Unlike conventional semi-automated screening tools, LLMs can classify abstracts in a zero-shot or few-shot mode simply by relying on well-structured queries, commonly referred to as “prompts” that detail inclusion and exclusion criteria [21]. Multiple studies have evaluated LLMs against human screening in diverse domains and reported encouraging results, albeit with notable variability across different models and datasets [22]. Recent studies highlight both the promise and variability of LLMs in medical literature screening. For instance, Delgado-Chaves et al. evaluated 18 LLMs across three clinical domains, observing classification accuracy ranging from 40% to 92% [23]. Similarly, it has been shown that systematic prompt optimization enabled GPT-4o and Claude-3.5 to achieve sensitivities and specificities approaching 98% for thoracic surgery meta-analyses, suggesting that targeted adjustments during screening rounds yield measurable improvements [24]. Meanwhile, investigations into open-source models revealed similar dependencies on design choices: the testing of four LLMs on biomedical datasets documented dramatic fluctuations in sensitivity and specificity based on model selection and prompt phrasing [25]. And even high-performing models such as GPT-4 have been reported to falter when confronted with dataset imbalances or low-prevalence agreement scenarios—a potent reminder of the persistent gap between laboratory validation and real-world application [26]. Such findings attest to the growing promise of LLMs for accelerating the screening phase of systematic reviews but also highlight the need for human oversight in verifying edge cases and ensuring high performance.

Modern LLMs can be rapidly adapted through prompt engineering, often making them more flexible for screening tasks in varied domains [27]. Nevertheless, clear protocols and refined inclusion/exclusion criteria remain vital because even the most advanced LLM can propagate errors if initial instructions or domain-specific nuances are overlooked [23,24]. Amid these opportunities and caveats, the question is no longer whether LLMs can assist in systematic review screening, but rather how best to implement them so that they enhance speed and consistency without compromising the rigorous standards necessary for high-quality evidence synthesis [28].

In this paper, we explore the integration of Large Language Models (LLMs) into the screening stage of systematic reviews. We discuss critical methodological challenges, including data preparation, model selection, prompt engineering, and the importance of maintaining human oversight to ensure screening quality and rigor. This review highlights practical strategies for optimizing prompts and managing costs and provides actionable insights on reducing the manual workload associated with literature screening. The study underscores the necessity of balancing automation benefits with structured quality checks and proposes draft guidelines to standardize the use of LLMs, ensuring reproducibility, ethical integrity, and methodological consistency in systematic reviews. What sets this review apart is its integrative scope. We braid together three threads that have so far run in parallel: the engineering know-how around transformer variants and parameter-efficient tuning; the new CONSORT-AI and PRISMA-AI standards that demand transparent error analysis; and a pragmatic, step-by-step guideline that makes those standards actionable for screening teams. By combining them into a single, reproducible workflow, we believe the conversation can move from whether LLMs can help to how any review group can deploy them responsibly.

2. LLM-Based Screening: Rationale and Key Considerations

2.1. Foundational Concepts and Terminology

Some terms will be used in the present survey that might benefit from a clear initial definition. Systematic reviews are a publication type that follows predefined protocols to collect and synthesize evidence in a transparent manner [1], while LLMs are sophisticated generative models capable of interpreting language [29].

Prompt engineering is the practice of carefully crafting the instructions or queries (known as prompts) presented to an LLM, with the goal of eliciting the most accurate or context-appropriate response [30]. This review will also mention terms such as zero-shot classification. In zero-shot classification, the model applies instructions to novel tasks without prior specialized training. In one-shot or few-shot classifications, it can absorb context from, respectively, one or a handful of examples provided in the prompt [31].

Recall (synonymous with sensitivity in classification tasks) refers to the proportion of truly relevant articles a model correctly identifies, while precision represents the proportion of articles labeled as relevant that genuinely meet the inclusion criteria [32]. Specificity, conversely, measures the model’s ability to exclude irrelevant articles (true negative rate). In systematic review screening—where minimizing missed evidence (false negatives) is paramount—recall/sensitivity is prioritized, though precision remains critical to avoid overwhelming reviewers with false positives. Throughout this review, both recall and precision emerge as central to assessing an LLM’s screening performance.

The use of LLMs in literature screening is necessarily based on fundamental systematic review principles—such as formulating clear inclusion/exclusion criteria—and adapts them to AI-driven classification, enabling efficient classification of abstracts without a lengthy training phase, and saving time for researchers to focus on other aspects of the review process. By structuring prompts around well-defined inclusion and exclusion criteria, it becomes possible to exploit an LLM’s language understanding to categorize studies and identify articles that are pertinent and relevant for the systematic review. This process still requires careful human intervention, especially for checking gray areas and refining prompt wording [28].

2.2. Technical and Logistical Considerations

Research teams with foundational expertise in systematic review methodology and familiarity with computational tools are likely to derive the greatest benefit from LLM-based approaches, as they can more effectively integrate these technologies into their workflows. Implementations vary in complexity, but, at a minimum, users need reliable access to a modern LLM such as GPT-4, GPT-3.5, or Deepseek r1, which can be accessed through a cloud-based API [33]; some LLMs can be installed locally if suitable hardware and software are available [34]. Establishing API-based access involves setting up credentials and ensuring a stable internet connection, whereas running an LLM locally requires significant computational resources, including a dedicated GPU with sufficient memory (usually 8–16 GB of VRAM for smaller open-source models and substantially more for larger architectures). Cloud services such as Google Colab can be a useful resource to run models remotely on platforms equipped with the necessary resources [35]. These hardware demands can influence the scale of the review and the practicalities of high-volume screening, particularly when screening thousands of abstracts.

A critical element of this setup appears to be a robust environment for data handling and preprocessing. Python libraries, such as pandas, are frequently employed in the literature for data preprocessing [36] for organizing references, removing duplicates, and converting downloaded records into a uniform tabular format, and possibly a uniform file format for data storage (e.g., CSV). Although coding expertise does not have to be extensive, a working knowledge of Python syntax, basic scripting, and command-line tools significantly streamlines the process of merging database outputs, cleaning messy metadata, and customizing LLM prompts [37]. Familiarity with virtual environments or package managers (such as conda or pip) can be particularly helpful for maintaining consistency and reproducibility, since the rapid pace of AI development often results in frequent updates and version changes to software packages [38].

Cost considerations for API-based LLM services are a recurring theme in the literature, particularly for large-scale reviews [39,40]. Balancing the benefits of higher accuracy from more advanced models with the financial impact of repeated queries is vital for long-term feasibility. If budgets are constrained, smaller open-source models may provide an adequate starting point, even though they occasionally require more extensive prompt tuning or additional error checking to reach acceptable levels of recall and precision [41,42,43]. For users planning to work on private or sensitive datasets, the local deployment of open-source or self-hosted models can address data security concerns, but this option does increase the burden of setup, hardware maintenance, and ongoing troubleshooting [44,45,46,47].

2.3. Prerequisites for Literature Screening

Published guidance underscores the importance of articulating a clear research question and inclusion criteria prior to screening [48]. These are often summarized through frameworks such as PICO (Population, Intervention, Comparison, Outcome) or one of its close variations [3,49,50,51]. For instance, a review investigating “the effectiveness of antihypertensive Drug A versus placebo in reducing systolic blood pressure among adults with hypertension” could specify the following:

Population: Adults aged 18–75 with primary hypertension.
Intervention: Daily oral administration of Drug A.
Comparison: Placebo.
Outcome: Mean change in systolic blood pressure at 12 weeks.

Frameworks like PICO are widely used in the literature to structure research questions for systematic reviews and their usefulness for AI solutions has been broadly investigated in the current literature [28,52,53,54,55].

Dai et al. demonstrated that the precision of an LLM’s output depends substantially on how accurately these criteria are translated into prompts or instructions [24]. Articulating the review question in detail ensures that the model can target specific populations, interventions, and outcomes without veering into irrelevant territory [56].

3. Implementation Considerations for LLM-Based Screening

3.1. Data Acquisition and Preparation

Any literature screening begins (Figure 1) with a comprehensive search for potential studies in relevant databases [57,58,59]. Database access forms a cornerstone of any systematic review [60,61], and access to databases like Medline and Embase [60,62] is frequently cited as critical for comprehensive evidence retrieval [57,60,61,63]. Medline is undoubtedly the most renowned literature database in biomedicine and can be freely accessed both through its PubMed web portal [64] but also directly through command line via API, using specific python libraries, such as Biopython [65], which is advantageous because retrieved articles can be stored in data structures that can be passed directly to LLMs.

In most systematic reviews, querying the databases involves crafting detailed strategies that reflect the components outlined by PICO [5]. Queries are often expanded to include synonyms, related keywords, and Medical Subject Headings (MeSH), if applicable, to avoid overlooking relevant articles [66]. These elements can be combined in a database-specific syntax to query the database [67]. For instance, a PubMed query structured around PICO components might include “the efficacy of cognitive behavioral therapy (CBT) versus antidepressants for reducing depressive symptoms in adolescents” and might generate a PubMed query structured as follows:

(“Adolescent”[MeSH] OR “teen*”[tiab] OR “youth”[tiab])

AND (“Cognitive Behavioral Therapy”[MeSH] OR “CBT”[tiab])

AND (“Antidepressive Agents”[MeSH] OR “SSRI”[tiab] OR “SNRI”[tiab])

AND (“Depression”[MeSH] OR “depressive symptoms”[tiab])

AND (“Treatment Outcome”[MeSH] OR “remission”[tiab])

Where keywords are accompanied by tags within square brackets that specify, e.g., whether they belong to the MeSH lexicon [66] or whether they should be found in title and abstracts. Once the searches are executed, the resulting citations are exported in a consistent format—such as CSV, RIS, or XML—that can be imported and processed by data analysis libraries in Python (or other environments) [57]. If the query is conducted by accessing PubMed via API, the results can be stored as a python data structure (e.g., a pandas’ DataFrame), ready for subsequent processing.

After collating results from multiple databases, the records must be cleaned and standardized [68]. This often involves removing exact duplicates—an issue that arises frequently when the same article appears across different platforms—and converting files to a uniform character encoding (e.g., UTF-8) to avoid text corruption [69]. Standardizing metadata fields, such as title and abstract, is a common preprocessing step to ensure consistency [70]. Attention to detail at this stage pays off in later steps, as it prevents misclassification and missed articles caused by inconsistent data fields [71]. Data cleaning can dramatically improve downstream performance by reducing irrelevant noise that misleads both conventional machine learning pipelines and modern LLMs [23].

3.2. LLM Selection and Prompt Design

At the heart of any LLM-augmented screening pipeline lies a two-fold decision: which model architecture to deploy and how to shape the prompts that govern its behavior. The transformer family dominates this choice because its self-attention layers capture the long-range dependencies that typify biomedical abstracts [72,73]. Encoder-only variants such as BERT or PubMedBERT compress a paragraph into a contextual embedding that a lightweight classification head can read [74], whereas decoder-only chat models like GPT-3.5 or GPT-4 emit an “ACCEPT” or “REJECT” token directly through autoregressive generation [75]. Hybrid encoder–decoder systems—including Flan-T5, Mixtral, and the latest DeepSeek releases—offer bidirectional context plus generative flexibility, an advantage whenever reviewers ask the model to justify its label with a short evidence snippet [76,77,78].

Architecture alone, however, rarely suffices; clinical subfields often introduce terminology and contextual nuances that degrade zero-shot performance, necessitating parameter-efficient fine-tuning methods. Low-rank adaptation (LoRA), for instance, injects trainable matrices into transformer blocks, allowing models to assimilate domain-specific patterns from limited labeled data while preserving foundational knowledge. This approach balances specialization with computational frugality, avoiding the prohibitive costs of full-model retraining [79].

Validation frameworks must align with these technical choices. Rigorous evaluation reserves a stratified subset of citations to assess generalization, measuring performance via AUROC [80] and confusion matrices stratified by PICO elements to expose biases—such as reduced recall for non-English studies. Stress tests across clinical domains (e.g., applying a cardiology-tuned model to oncology abstracts) can be used to probe overfitting, while temporal validation on post-deployment data monitors concept drift. Human adjudication of low-confidence predictions closes the feedback loop, iteratively refining both prompts and model parameters.

Practical implementation, however, confronts infrastructural realities. Smaller open-source checkpoints such as GPT-2 or the base Flan-T5 can power pilot projects on ordinary GPUs, but their shallow context limits mean that nuanced inclusion criteria often demand aggressive prompt engineering and post hoc filtering. By contrast, frontier-scale models—GPT-4, Claude-3 Sonnet, Gemini-1.5 Pro—deliver near-human recall on most public benchmarks yet incur higher token costs [81] and may require protected-data agreements if accessed through commercial APIs [44]. Table S1 summarizes the evidence to date, mapping each architecture to its documented strengths, typical pitfalls, and reported screening accuracy [23,24,28,82].

Comparative evaluations consistently show a wide range of performance across models [83,84,85,86] and it has been shown that larger models can excel in contextual awareness, preserving coherence when confronted with intricate abstracts, while smaller models often produce output more quickly but risk overlooking nuanced details [87]. Generation speed and context management can vary widely across models depending on both the underlying architecture and the target hardware environment [88].

These efficiency considerations become especially relevant in systematic reviews, which can easily involve thousands of abstracts to classify [89]. Overly long prompts, although potentially more instructive, may result in slower inference speeds, and a near-linear increase has been observed in total processing time in models receiving large prompt sizes [88]. Researchers must weigh the complexity of their prompts—particularly if they include multiple inclusion/exclusion criteria or domain-specific nuances—against the desire for rapid classification.

Developing effective prompts remains the other crucial pillar in model selection. Even advanced models with large context windows can produce erratic outputs if the instructions conflict or are excessively vague [24]. Slight rewording of a prompt can either inflate false positives (when instructions are too permissive) or inadvertently exclude relevant studies (when instructions are too strict) [23,24].

3.3. Fundamentals and Challenges of Prompt Design

A prompt refers to the instructions, context, or background information given to a LLM so that the model can respond in a manner consistent with the user’s objectives [90]. Unlike traditional machine learning classifiers that rely on iterative retraining, modern LLMs use these prompts as immediate instructions, which guide the model’s behavior [90,91]. The structure of a well-crafted prompt typically includes a concise statement of the task (for example, “You are assisting in a systematic review”), any relevant context (such as inclusion/exclusion criteria or a description of the population and interventions), the textual data to be analyzed (i.e., the title or abstract), and explicit output instructions (indicating whether to “ACCEPT” or “REJECT”) [21,92]. In the context of systematic reviews, prompts often encode key methodological requirements—whether defined via PICO or other frameworks—so that an LLM can scan each abstract for relevant details like patient characteristics, study design, or reported outcomes [93,94].

Below is an example of a possible prompt that uses PICO criteria for a literature search of RCTs.

System: You are an AI assistant helping with a systematic review on [TOPIC OR CONDITION].

User Prompt: You will decide if each article should be ACCEPTED or REJECTED based on the following criteria:

Population (P): Adult patients (≥18 years) with [SPECIFIC POPULATION OR CONDITION]. If the abstract does not mention age, or does not clearly describe non-adult populations, do not penalize.

Intervention (I): Must involve [INTERVENTION 1] combined with [INTERVENTION 2]. If either is implied or partially mentioned, do not penalize.

Comparison (C): Ideally a group that uses [CONTROL OR COMPARISON], or some control lacking [KEY INTERVENTION]. If not stated but not contradicted, do not penalize.

Outcomes (O): Must measure [PRIMARY OUTCOME] or at least mention [SECONDARY OUTCOMES OR RELEVANT PARAMETERS]. If the abstract does not state outcomes explicitly but mentions [RELEVANT OUTCOME KEYWORDS], do not penalize.

Study design: Must be an RCT or strongly imply random allocation. If uncertain, do not penalize.

Follow-up: Minimum [X] months. If not stated or unclear, do not penalize unless it says <[X] months.

Decision Rule: If no criterion is explicitly violated, respond only with “ACCEPT”. If any criterion is clearly contradicted (e.g., non-randomized design, pediatric population, <[X] months follow-up), respond with “REJECT”. Provide no additional explanation.

Title: {title}

Abstract: {abstract}

Evidence suggests that prompts with minimal extraneous details improve classification accuracy [95]. In systematic review applications, this often means specifying the precise triggers for “ACCEPT”, e.g., randomized study designs and adult populations, versus the explicit triggers for “REJECT”, e.g., purely animal research or pediatric cohorts.

We recently tested some advanced LLMs with PICO-styled prompts for literature screening and found that Claude 3 Haiku achieved perfect recall (100%) across all quartiles of screening data but exhibited starkly variable precision (16.1–75.0%) depending on prompt style and dataset quartile, with concise prompts generally outperforming verbose ones in early screening stages [96]. In contrast, GPT-4o demonstrated near-perfect performance (100% recall, ≥88.2% precision) across all quartiles and prompt types, with a remarkable performance across all the datasets we used [96]. Examples of prompts used in this work can be found in Appendix A.

The complexity of biomedical abstracts, which may discuss multiple interventions, outcomes, or populations, can pose a challenge if the prompts are too broad, too vague, or contain contradictory statements. For instance, telling the LLM to accept studies if they mention any adult participants but then also requesting rejection if the study includes children under 18 could lead to confusion if the abstract features a mixed population [91]. Careful wording of the criteria or prompt instructions can thus mitigate incorrect interpretations, a phenomenon made more likely when dealing with large corpora.

Another key challenge is ensuring that the model does not “hallucinate” details not actually present in the abstract [97]. These hallucinations typically appear as confidently stated yet unsupported claims about critical aspects of studies, such as randomization status, interventions, or measured outcomes. Because LLMs are probabilistic text generators trained on diverse textual corpora, they can sometimes invent content—such as extra interventions, specific follow-up durations, or outcome measures—simply because the prompt or question implies these details are relevant, and countermeasures to mitigate this phenomenon are a fertile area of investigation [98,99,100]. Some studies propose mitigating hallucinations by human-based fact checking [101], e.g., reviewers can explicitly request models to cite verbatim sentences from the abstract that support classification decisions, facilitating rapid human verification and minimizing unwarranted assumptions. For instance, at the end of a prompt such as the one we proposed above, the text could be modified as follows:

“Decision Rule: If no criterion is explicitly violated, respond only with “ACCEPT”. If any criterion is clearly contradicted (e.g., non-randomized design, pediatric population, <[X] months follow-up), respond with “REJECT”. Additionally, after your decision, briefly quote the specific sentence or phrase from the abstract or title that directly supports your classification. If no explicit supporting sentence is present, clearly state: “NO SUPPORTING TEXT FOUND”.”

During pilot rounds, this simple rule almost eliminates fabricated trial characteristics because any hallucinated field is immediately visible to the human auditor and can be traced back to the offending prompt wording.

Prompt refinement typically evolves through iterative testing [102]. Many researchers begin with a “soft” or inclusive instruction set that aims to maximize recall, then review a subset of “Accepted” outputs to identify obvious false positives that indicate the need for more stringent language. Likewise, a “strict” approach can guard against irrelevant articles but risks excluding borderline studies whose abstracts do not explicitly list every inclusion criterion. Domain-specific jargon or abbreviations also introduce complexity, since the LLM might misinterpret specialized terms or incorrectly infer the presence of required conditions [103]. For instance, a study might use “RCT” in the text but never explicitly mention “randomized controlled trial”, leading certain prompts to accept or reject the article prematurely if they only look for the spelled-out term [104]. Researchers should therefore tailor prompts to the language patterns common in the target domain, possibly by leveraging known synonyms or by describing relevant terms in the instructions (“Consider an ‘RCT’ the same as a ‘randomized controlled trial’”). Even in a best-case scenario, LLMs might misclassify abstracts that mention unclear or conflicting details. While advanced LLMs have grown remarkably adept at context-sensitive classification, no prompt can capture every edge case in biomedical literature, particularly in specialized reviews that examine niche interventions or unique study designs [103]. Documenting prompt versions, analyzing errors, and iterating toward more precise instructions appears to remain central to balancing recall, precision, and cost efficiency for large-scale screening efforts.

3.4. Model Deployment

Once the prompt strategies are established, the LLM can be applied to classify each abstract as either “Accepted” or “Rejected”. This step typically involves passing the abstract text and the relevant prompt to the LLM and collecting the output in a structured data frame. Metadata such as timestamps, model version, or confidence levels (if provided by the API or tool) can also be recorded for subsequent auditing and reproducibility. Consistent recordkeeping at this juncture lays the groundwork for quality assurance and the potential to replicate the screening approach in the future [12].

The available evidence suggests that human expertise remains indispensable in systematic reviews, even when leveraging advanced language models [105]. Human oversight often involves auditing subsets of rejected articles to identify misclassifications, as noted in the literature [106]. If borderline cases appear in this category, adjustments to prompt wording or acceptance thresholds may be necessary. Conversely, a quick review of the “Accepted” abstracts helps detect obvious false positives. This iterative feedback loop, reminiscent of semi-automated screening tools [14,15], can be accomplished more rapidly and flexibly through zero-shot or few-shot prompting in LLMs. Refinements continue until the screening team is satisfied that the model reliably captures relevant studies without becoming overly permissive. Once optimized, these prompt settings can be incorporated into ongoing or future screening efforts, with systematic refinement shown to improve both recall and precision while reducing reviewer workload [24].

3.5. Current Recommendations and Trade-Offs

The literature emphasizes recall as a priority in early screening phases to minimize the omission of relevant studies. Studies often emphasize maximizing inclusivity during database searches and the early screening phases—accepting borderline cases to avoid excluding key studies prematurely [107,108,109,110].

Well-defined inclusion/exclusion criteria are consistently linked to reduced misclassifications [21,92,111,112], and all changes to prompts or model settings should be carefully documented, including the rationale and observed impacts on accuracy or recall. Pilot tests screening small sets of known abstracts may be very valuable for surfacing issues early, preventing downstream errors that might otherwise emerge only after processing thousands of articles. While these checks may seem laborious, they safeguard reproducibility and avert more significant oversights.

Cost is another practical consideration, particularly when using proprietary models with API-based pricing [113]. Although ongoing refinement and error analysis may eventually lower expenses by reducing unnecessary queries, research teams must weigh the financial overhead of repeated API calls against potential performance gains.

A flexible, layered approach often balances efficiency and rigor effectively. Early screening rounds benefit from broad prompts and inclusive language to preserve potentially relevant studies, while later phases can adopt stricter criteria or incorporate prior labels to filter clearly irrelevant articles, thereby improving precision and reducing the full-text workload. Throughout this process, error analysis—especially the detection of false negatives—remains central to safeguarding evidence integrity, to keep, as it is often mentioned, “humans in the loop” [114]. Strategically limited manual checks, such as random sampling of “Rejected” articles or the verification of ambiguous abstracts, confirm model reliability without requiring exhaustive rechecks [23], preserving the time-saving advantages of automation.

Ultimately, LLMs should be viewed as high-efficiency filters that augment—rather than replace—expert judgment. Whether employing a single inclusive prompt strategy or a multi-stage filtering model, iterative refinement and selective validation allow the method to adapt to the review’s scope, resource constraints, and citation volume.

4. Ethical, Practical, and Methodological Implications

The adaptability of LLMs offers a clear advantage over more rigid machine learning models [115]. In zero-shot or few-shot modes, the model’s performance depends heavily on how well the prompt captures the essence of the inclusion and exclusion criteria. It has been shown that refining those criteria substantially boosts accuracy and can approach human-level recall [23,24]. These gains do not negate the importance of human expertise. Rather, oversight remains pivotal to interpret borderline abstracts, continually adjust prompts, and preserve the rigor of evidence synthesis [116]. This necessity is now formalized in emerging reporting standards, which mandate transparency in both model behavior and reviewer oversight.

The CONSORT-AI checklist requires trialists to analyze “performance errors” and explain how such errors were detected and quantified, while item 19 specifically calls for the disclosure of systematic misclassification patterns and their clinical consequences [117]. Applied to LLM-assisted screening, this obligation translates into routinely stratifying confusion matrix outputs by study design, publication language, geographical origin, and other equity-relevant covariates; large error differentials across strata are an early warning of algorithmic bias and should trigger efficient refinement of prompts or thresholds, or temporary reversion to manual screening for the affected subgroup. In parallel, the nascent PRISMA-AI initiative emphasizes that systematic reviews evaluating AI interventions must document the technical particulars of each algorithm, the provenance of its training data, and the measures used to appraise bias, explainability, and generalizability [118]. For evidence syntheses that use rather than evaluate LLMs, reviewers can meet these expectations by depositing the full prompt history, the model version hash, and the audit log of reviewer–LLM disagreements in the Supplementary Materials, allowing external readers to recreate—or challenge—the decision pathway that led from raw citation to inclusion status.

Two recurring failure modes deserve explicit mention. First, representation bias arises when an LLM trained predominantly on English-language or high-income country abstracts develops decision boundaries that underprioritize studies from under-represented regions; this bias typically manifests as a precipitous drop in recall for those strata and can be mitigated by seeding the prompt with carefully chosen minority language exemplars and by enforcing the CONSORT-AI mandate to report subgroup-specific error rates. Second, hallucination—the confident attribution of methods or outcomes not present in the source text—often emerges when prompts invite the model to “infer” unstated elements, as we mentioned before.

Beyond these technical considerations, the recent literature emphasizes the need for explicit guidelines to optimize LLM usage in research contexts [119,120,121,122,123,124] and highlights the importance of transparency in disclosing AI involvement and the ethical requirement of human accountability. As the importance of AI is growing exponentially in science as much as in everyday life, education on the strengths and limitations of language models is critical to users.

Scholars caution that LLMs should not replace expert judgment; rather, they should enhance it by rapidly filtering large volumes of text, provided their outputs are continually verified and documented for reproducibility. Transparency about any AI-assisted workflow is essential to maintain scientific integrity [119], and risks like hallucinations and bias must be mitigated through ongoing validation [124]. Likewise, Ranjan et al. suggest that structured methods—whether fine-tuning or retrieval-augmented techniques—can boost performance, but only if accompanied by guidelines that ensure data curation and prompt engineering are implemented consistently and responsibly [120,125]. Adopting such measures is of particular importance when applying LLM-based screening to medical fields, given the high stakes of omitting relevant studies or introducing biased results into the evidence base [121].

This paper reviews many of the time and resource challenges associated with traditional screening, but it also raises new questions about how to manage prompt complexity, maintain cost-effectiveness when making numerous API calls, and log each classification for reproducibility. These authors are convinced that combining LLM technologies with robust oversight, adherence to ethical standards, and comprehensive user training will ensure that these tools bolster rather than compromise the credibility of systematic reviews. In particular, transparent, consensus-driven procedures that govern data handling, prompt design, model validation, and human verification may be an important addition. To move the field beyond ad hoc experimentation, we propose a set of draft standardized guidelines intended to help review teams integrate LLM-assisted screening into routine systematic review practice while preserving the rigor and reproducibility demanded by existing methodological standards.

5. Proposed Standardized Guidelines for LLM-Assisted Screening

These guidelines are intended to assist review teams in adopting Large Language Models (LLMs) for title and abstract screening while preserving the methodological transparency, reproducibility, and ethical integrity expected of systematic reviews. They are framed to complement existing reporting standards—such as PRISMA, PRISMA-AI, and CONSORT-AI—rather than to replace them, and they assume adherence to established protocols for search strategy design, data extraction, and risk of bias assessment [117,118].

5.1. Planning and Governance

Before any model is invoked, the protocol should state that an LLM will be used, describe its intended role, and identify the responsible investigators who will oversee model configuration, prompt refinement, and human verification. The protocol should also specify how modifications to prompts or model parameters will be recorded, how disagreements between reviewers and the LLM will be resolved, and how data protection requirements will be met when cloud-based services are involved.

5.2. Data Preparation

All search results must be exported in a structured, machine-readable format (for example, RIS, XML, or CSV). Deduplication should be performed with deterministic or probabilistic matching rules and documented in sufficient detail to allow replication. Metadata fields (e.g., authors, title, abstract, journal, year) should be normalized and stored in a single table with a unique record identifier that travels with the citation throughout the workflow.

5.3. Model Selection and Disclosure

The exact LLM version, access mode (API, local deployment, or third-party platform), context window length, and any fine-tuning or retrieval-augmented components must be reported. If a proprietary model is used, investigators should note whether data are transmitted to external servers and, if so, what safeguards are in place. For open-source models, the commit hash or model checkpoint must be cited to ensure identical replication.

5.4. Prompt Engineering

Prompts should encode the same inclusion and exclusion criteria that are stated in the review protocol, using unambiguous language that maps directly to PICO (or equivalent) elements. Each new prompt version requires a brief rationale and an audit of its effect on a validation subset. Whenever possible, prompts should instruct the model to supply the verbatim sentence (or its index) upon which its classification decision is based, thereby facilitating rapid checking for hallucinated or mis-attributed information.

5.5. Screening Procedure

An initial high-recall pass should be performed with permissive thresholds to minimize the risk of false negatives. Human reviewers must then inspect a random sample of both accepted and rejected citations; the sample size should be large enough to detect a ten percent error rate with ninety-five percent confidence. Discrepancies trigger prompt revision or threshold adjustment, after which the model is rerun on the full dataset. The iterative cycle continues until two successive validation samples each show an error rate below the pre-specified tolerance set in the protocol.

5.6. Quality Control and Bias Monitoring

A confusion matrix for the final prompt–model combination must be produced on the last validation set, alongside 95% confidence intervals for recall, precision, specificity, and negative predictive value. Investigators should examine whether error rates differ systematically by publication date, study design, geographic region, or language of publication. Any observed imbalance must be discussed in the manuscript and, where feasible, mitigated through prompt adjustment or supplementary manual screening.

5.7. Documentation and Reproducibility

All code used for data wrangling, prompting, and result aggregation should be archived in a public repository with an explicit open-source license. The prompt history, model-usage logs, and the final list of screened-in and screened-out records must be provided as Supplementary Material in machine-readable form. If proprietary or sensitive data prevent full release, a synthetic dataset that mimics the structure of the real data should be shared to illustrate the workflow.

5.8. Ethical and Legal Compliance

The review team must certify that the use of an LLM complies with local data protection regulations and institutional policies. If abstracts contain personally identifiable or sensitive health information, reviewers must ensure that the data are processed either on secure local hardware or under a contractual agreement that meets the relevant jurisdiction’s privacy standards. Any potential conflicts of interest related to model providers should be disclosed.

5.9. Reporting in the Manuscript

The Methods Section must describe the LLM workflow with enough granularity for replication, including the date of the last model call, the total number of tokens processed (for cost estimation), and the human time spent on verification. The Results Section should present the performance metrics specified above, and the Discussion Section should address the implications of any residual misclassifications, the cost–benefit profile of the approach, and anticipated avenues for future refinement.

6. Conclusions

Integrating LLMs into systematic review screening offers substantial time savings during the initial evaluation of abstracts, particularly when well-defined inclusion criteria are applied. The use of LLMs not only potentially expand the scope of the literature that can feasibly be screened but also alleviate the burden on research teams. By selecting models appropriate to specific review goals, tailoring prompts to their capabilities, conducting manual validation, and documenting iterative refinements, it is possible to achieve robust recall and precision while maintaining the integrity of evidence-based conclusions. Looking ahead, LLM applications could extend beyond screening to automate data extraction, assess risk of bias, or even synthesize findings—though these tasks will require rigorous validation frameworks to ensure accuracy and mitigate hallucinations. Future research should also explore hybrid systems combining LLMs with symbolic AI or knowledge graphs to enhance interpretability and domain specificity.

To translate these advances into practice, policymakers and review consortiums must establish standardized guidelines for LLM use, addressing ethical concerns (e.g., transparency in model selection, reproducibility of prompts) and equitable access to computational resources. The draft framework provided here represents a starting point for such efforts. As LLM capabilities evolve, interdisciplinary collaboration—among AI developers, methodologists, and domain experts—will be critical to balance innovation with methodological rigor, ensuring these tools augment rather than undermine the trustworthiness of systematic reviews.

Taken together, the workflow diagram, the parameter-efficient tuning recipe, and the draft CONSORT-AI aligned checklist transform dispersed technical insights into a single, reproducible protocol. This consolidation, to our knowledge, is the first attempt to move the discussion from can LLMs screen? to how should every review team deploy them responsibly and report the process in a journal-ready format, thereby adding a layer of pragmatic novelty that complements—not replicates—the existing corpus.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/info16050378/s1. Table S1: Summary table of LLM performance in literature screening.

Author Contributions

Conceptualization, C.G. and E.C.; methodology, C.G.; software, A.V.G.; writing—original draft preparation, C.G.; writing—review and editing, A.V.G. and E.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

No data were generated.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

The following is an example of a prompt used in a recent work [96] to screen the literature for RCTs on periodontal regeneration with Emdogain and bone grafts:

“
You are assisting in a systematic review on periodontal regeneration comparing
Emdogain (EMD) + bone graft (BG) versus BG alone.

Your task is to decide whether the following article should be ACCEPTED
or REJECTED based on the following “soft approach” criteria:

**Inclusion Criteria**:

1.

**Population (P)**: Adult periodontitis patients (≥18 years old) with at least one intrabony or furcation defect.

2.

**Intervention (I)**: Regenerative surgical procedures involving EMD combined with any type of bone graft material (EMD+BG).

3.

**Comparison (C)**: Regenerative surgical procedures involving BG alone.

4.

**Outcomes (O)**:

-: Primary: CAL (Clinical Attachment Level) gain, PD (Probing Depth) reduction.
-: Secondary: Pocket closure, wound healing, gingival recession, tooth loss, patient-reported outcome measures (PROMs), adverse events.

5.

**Study Design**:

-: Randomized controlled trial (RCT), parallel or split-mouth design.
-: ≥10 patients per arm.
-: ≥6 months follow-up.

**Decision Approach**:

-: If **at least one** of the above criteria is explicitly met or strongly implied,
**AND** none of the criteria are explicitly contradicted, then **ACCEPT**.
-: If **any** criterion is clearly violated (e.g., population is exclusively children,
follow-up is 3 months, or design is not an RCT), then **REJECT**.
-: If **no** criterion is clearly met, **REJECT**.

Below is the article’s title and abstract. Decide if it should be ACCEPTED or REJECTED
according to the “soft approach” described.

Title: {title}
Abstract: {abstract}

**If the article is acceptable, respond with exactly:**
ACCEPT

**Otherwise, respond with exactly:**
REJECT
“

Appendix A.2

The following is an example of a more concise prompt used in the same investigation for the same screening:

“
You are an expert periodontology assistant. You are assisting in a systematic review on periodontal regeneration comparing
Emdogain (EMD) + bone graft (BG) versus bone graft alone. Evaluate this article step by step:

1.: Population: If the text states adult patients with intrabony/furcation defects, or is silent about age/defect type, it’s not violated.
2.: Intervention: If Emdogain + bone graft is mentioned or strongly implied, we consider this met.
3.: Comparison: If a group uses bone graft alone, or there’s at least a control lacking Emdogain, consider it met.
4.: Outcomes: If they mention CAL gain or PD reduction, or are silent, do not penalize. Only reject if they clearly never measure any clinical outcomes.
5.: Study design: If they claim RCT or strongly imply it, accept. If they mention a different design (case series, pilot with fewer than 10 patients, or <6-month follow-up), reject.

If at least one criterion is explicitly met and none are clearly violated, answer ACCEPT. Otherwise, REJECT.
If you are unsure, default to ACCEPT unless a contradiction is stated.

Article Title: {title}
Abstract: {abstract}

Respond with ONLY ‘ACCEPT’ or ‘REJECT’ accordingly.
“

References

Mulrow, C.D. Systematic Reviews: Rationale for systematic reviews. BMJ 1994, 309, 597–599. [Google Scholar] [CrossRef] [PubMed]
Parums, D.V. Review articles, systematic reviews, meta-analysis, and the updated preferred reporting items for systematic reviews and meta-analyses (PRISMA) 2020 guidelines. Med. Sci. Monit. 2021, 27, e934475. [Google Scholar] [PubMed]
Methley, A.M.; Campbell, S.; Chew-Graham, C.; McNally, R.; Cheraghi-Sohi, S. PICO, PICOS and SPIDER: A comparison study of specificity and sensitivity in three search tools for qualitative systematic reviews. BMC Health Serv. Res. 2014, 14, 579. [Google Scholar] [CrossRef]
Linares-Espinós, E.; Hernández, V.; Domínguez-Escrig, J.L.; Fernández-Pello, S.; Hevia, V.; Mayor, J.; Padilla-Fernández, B.; Ribal, M.J. Methodology of a systematic review. Actas Urol. Esp. 2018, 42, 499–506. [Google Scholar]
Dickersin, K.; Scherer, R.; Lefebvre, C. Systematic reviews: Identifying relevant studies for systematic reviews. BMJ 1994, 309, 1286–1291. [Google Scholar]
Greenhalgh, T.; Thorne, S.; Malterud, K. Time to challenge the spurious hierarchy of systematic over narrative reviews? Eur. J. Clin. Investig. 2018, 48, e12931. [Google Scholar] [CrossRef]
Waffenschmidt, S.; Knelangen, M.; Sieben, W.; Bühn, S.; Pieper, D. Single screening versus conventional double screening for study selection in systematic reviews: A methodological systematic review. BMC Med. Res. Methodol. 2019, 19, 132. [Google Scholar] [CrossRef]
Cooper, C.; Booth, A.; Varley-Campbell, J.; Britten, N.; Garside, R. Defining the process to literature searching in systematic reviews: A literature review of guidance and supporting studies. BMC Med. Res. Methodol. 2018, 18, 85. [Google Scholar]
Furlan, J.C.; Singh, J.; Hsieh, J.; Fehlings, M.G. Methodology of Systematic Reviews and Recommendations. J. Neurotrauma 2011, 28, 1335–1339. [Google Scholar]
Cumpston, M.; Li, T.; Page, M.J.; Chandler, J.; Welch, V.A.; Higgins, J.P.; Thomas, J. Updated guidance for trusted systematic reviews: A new edition of the Cochrane Handbook for Systematic Reviews of Interventions. Cochrane Database Syst. Rev. 2019, 2019, ED000142. [Google Scholar]
Dunning, J.; Lecky, F. The NICE guidelines in the real world: A practical perspective. Emerg. Med. J. 2004, 21, 404. [Google Scholar] [PubMed]
Van Dinter, R.; Tekinerdogan, B.; Catal, C. Automation of systematic literature reviews: A systematic literature review. Inf. Softw. Technol. 2021, 136, 106589. [Google Scholar]
Wang, Z.; Nayfeh, T.; Tetzlaff, J.; O’Blenis, P.; Murad, M.H. Error rates of human reviewers during abstract screening in systematic reviews. PLoS ONE 2020, 15, e0227742. [Google Scholar]
Ouzzani, M.; Hammady, H.; Fedorowicz, Z.; Elmagarmid, A. Rayyan—A web and mobile app for systematic reviews. Syst. Rev. 2016, 5, 210. [Google Scholar] [PubMed]
Chai, K.E.K.; Lines, R.L.J.; Gucciardi, D.F.; Ng, L. Research Screener: A machine learning tool to semi-automate abstract screening for systematic reviews. Syst. Rev. 2021, 10, 93. [Google Scholar] [CrossRef]
Khalil, H.; Ameen, D.; Zarnegar, A. Tools to support the automation of systematic reviews: A scoping review. J. Clin. Epidemiol. 2022, 144, 22–42. [Google Scholar] [CrossRef]
Allot, A.; Lee, K.; Chen, Q.; Luo, L.; Lu, Z. LitSuggest: A web-based system for literature recommendation and curation using machine learning. Nucleic Acids Res. 2021, 49, W352–W358. [Google Scholar] [PubMed]
Marshall, I.J.; Kuiper, J.; Wallace, B.C. RobotReviewer: Evaluation of a system for automatically assessing bias in clinical trials. J. Am. Med. Inform. Assoc. 2016, 23, 193–201. [Google Scholar]
Kiritchenko, S.; De Bruijn, B.; Carini, S.; Martin, J.; Sim, I. ExaCT: Automatic extraction of clinical trial characteristics from journal publications. BMC Med. Inf. Decis. Mak. 2010, 10, 56. [Google Scholar]
Sindhu, B.; Prathamesh, R.P.; Sameera, M.B.; KumaraSwamy, S. The evolution of large language model: Models, applications and challenges. In Proceedings of the 2024 International Conference on Current Trends in Advanced Computing (ICCTAC), Bengaluru, India, 8–9 May 2024; pp. 1–8. [Google Scholar]
Cao, C.; Sang, J.; Arora, R.; Kloosterman, R.; Cecere, M.; Gorla, J.; Saleh, R.; Chen, D.; Drennan, I.; Teja, B.; et al. Prompting is all you need: LLMs for systematic review screening. medRxiv 2024, 2024–2106. [Google Scholar] [CrossRef]
Scherbakov, D.; Hubig, N.; Jansari, V.; Bakumenko, A.; Lenert, L.A. The emergence of Large Language Models (LLM) as a tool in literature reviews: An LLM automated systematic review. arXiv 2024, arXiv:2409.04600. [Google Scholar]
Delgado-Chaves, F.M.; Jennings, M.J.; Atalaia, A.; Wolff, J.; Horvath, R.; Mamdouh, Z.M.; Baumbach, J.; Baumbach, L. Transforming literature screening: The emerging role of large language models in systematic reviews. Proc. Natl. Acad. Sci. USA 2025, 122, e2411962122. [Google Scholar] [CrossRef]
Dai, Z.Y.; Shen, C.; Ji, Y.L.; Li, Z.Y.; Wang, Y.; Wang, F.Q. Accuracy of Large Language Models for Literature Screening in Systematic Reviews and Meta-Analyses. 2024. Available online: https://ssrn.com/abstract=4943759 (accessed on 11 March 2025).
Dennstädt, F.; Zink, J.; Putora, P.M.; Hastings, J.; Cihoric, N. Title and abstract screening for literature reviews using large language models: An exploratory study in the biomedical domain. Syst. Rev. 2024, 13, 158. [Google Scholar] [CrossRef]
Khraisha, Q.; Put, S.; Kappenberg, J.; Warraitch, A.; Hadfield, K. Can large language models replace humans in systematic reviews? Evaluating GPT -4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Res. Synth. Methods 2024, 15, 616–626. [Google Scholar] [CrossRef] [PubMed]
Blevins, T.; Gonen, H.; Zettlemoyer, L. Prompting Language Models for Linguistic Structure. arXiv 2022, arXiv:2211.07830. [Google Scholar]
Lieberum, J.L.; Töws, M.; Metzendorf, M.I.; Heilmeyer, F.; Siemens, W.; Haverkamp, C.; Böhringer, D.; Meerpohl, J.J.; Eisele-Metzger, A. Large language models for conducting systematic reviews: On the rise, but not yet ready for use—A scoping review. J. Clin. Epidemiol. 2025, 181, 111746. [Google Scholar] [CrossRef]
Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
Gao, A. Prompt Engineering for Large Language Models. 2023. Available online: https://ssrn.com/abstract=4504303 (accessed on 11 March 2025).
Dang, H.; Mecke, L.; Lehmann, F.; Goller, S.; Buschek, D. How to prompt? Opportunities and challenges of zero-and few-shot learning for human-AI interaction in creative applications of generative models. arXiv 2022, arXiv:2209.01390. [Google Scholar]
Cottam, J.A.; Heller, N.C.; Ebsch, C.L.; Deshmukh, R.; Mackey, P.; Chin, G. Evaluation of Alignment: Precision, Recall, Weighting and Limitations. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 2513–2519. [Google Scholar]
Wang, Y.; Yu, J.; Yao, Z.; Zhang, J.; Xie, Y.; Tu, S.; Fu, Y.; Feng, Y.; Zhang, J.; Zhang, J.; et al. A solution-based LLM API-using methodology for academic information seeking. arXiv 2024, arXiv:2405.15165. [Google Scholar]
Kumar, B.V.P.; Ahmed, M.D.S. Beyond Clouds: Locally Runnable LLMs as a Secure Solution for AI Applications. Digit. Soc. 2024, 3, 49. [Google Scholar]
Bisong, E. Google Colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Bisong, E., Ed.; Apress: Berkeley, CA, USA, 2019; pp. 59–64. [Google Scholar]
Mckinney, W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; pp. 51–56. [Google Scholar]
Grigorov, D. Harnessing Python 3.11 and Python Libraries for LLM Development. In Introduction to Python and Large Language Models: A Guide to Language Models; Springer: Berlin/Heidelberg, Germany, 2024; pp. 303–368. [Google Scholar]
Maji, A.K.; Gorenstein, L.; Lentner, G. Demystifying Python Package Installation with conda-env-mod. In Proceedings of the 2020 IEEE/ACM International Workshop on HPC User Support Tools (HUST) and Workshop on Programming and Performance Visualization Tools (ProTools), Atlanta, GA, USA, 18 November 2020; pp. 27–37. [Google Scholar]
Shekhar, S.; Dubey, T.; Mukherjee, K.; Saxena, A.; Tyagi, A.; Kotla, N. Towards optimizing the costs of llm usage. arXiv 2024, arXiv:2402.01742. [Google Scholar]
Chen, X.; Gao, C.; Chen, C.; Zhang, G.; Liu, Y. An empirical study on challenges for llm developers. arXiv 2024, arXiv:2408.05002. [Google Scholar]
Irugalbandara, C.; Mahendra, A.; Daynauth, R.; Arachchige, T.K.; Dantanarayana, J.; Flautner, K.; Tang, L.; Kang, Y.; Mars, J. Scaling down to scale up: A cost-benefit analysis of replacing OpenAI’s LLM with open source SLMs in production. In Proceedings of the 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Indianapolis, IN, USA, 5–7 May 2024; pp. 280–291. [Google Scholar]
Ding, D.; Mallick, A.; Wang, C.; Sim, R.; Mukherjee, S.; Ruhle, V.; Lakshmanan, L.V.; Awadallah, A.H. Hybrid llm: Cost-efficient and quality-aware query routing. arXiv 2024, arXiv:2404.14618. [Google Scholar]
Chen, L.; Zaharia, M.; Zou, J. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv 2023, arXiv:2305.05176. [Google Scholar]
Yan, B.; Li, K.; Xu, M.; Dong, Y.; Zhang, Y.; Ren, Z.; Cheng, X. On protecting the data privacy of large language models (llms): A survey. arXiv 2024, arXiv:2403.05156. [Google Scholar]
Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar]
Huang, B.; Yu, S.; Li, J.; Chen, Y.; Huang, S.; Zeng, S.; Wang, S. Firewallm: A portable data protection and recovery framework for llm services. In International Conference on Data Mining and Big Data; Springer: Berlin/Heidelberg, Germany, 2023; pp. 16–30. [Google Scholar]
Feretzakis, G.; Verykios, V.S. Trustworthy AI: Securing sensitive data in large language models. AI 2024, 5, 2773–2800. [Google Scholar] [CrossRef]
Meline, T. Selecting studies for systemic review: Inclusion and exclusion criteria. Contemp. Issues Commun. Sci. Disord. 2006, 33, 21–27. [Google Scholar]
Cooke, A.; Smith, D.; Booth, A. Beyond PICO. Qual. Health Res. 2012, 22, 1435–1443. [Google Scholar] [CrossRef]
Frandsen, T.F.; Bruun Nielsen, M.F.; Lindhardt, C.L.; Eriksen, M.B. Using the full PICO model as a search tool for systematic reviews resulted in lower recall for some PICO elements. J. Clin. Epidemiol. 2020, 127, 69–75. [Google Scholar] [CrossRef]
Brown, D. A Review of the PubMed PICO Tool: Using Evidence-Based Practice in Health Education. Health Promot. Pract. 2020, 21, 496–498. [Google Scholar] [CrossRef] [PubMed]
Gosak, L.; Štiglic, G.; Pruinelli, L.; Vrbnjak, D. PICOT questions and search strategies formulation: A novel approach using artificial intelligence automation. J. Nurs. Scholarsh. 2025, 57, 5–16. [Google Scholar] [PubMed]
De Cassai, A.; Dost, B.; Karapinar, Y.E.; Beldagli, M.; Yalin, M.S.O.; Turunc, E.; Turan, E.I.; Sella, N. Evaluating the utility of large language models in generating search strings for systematic reviews in anesthesiology: A comparative analysis of top-ranked journals. Reg Anesth Pain Med. 2025, 2024–10623. [Google Scholar]
Huang, W.H.; Poojary, V.; Hofer, K.; Fazeli, M.S. MSR217 Evaluating a Large Language Model Approach for Full-Text Screening Task in Systematic Literature Reviews with Domain Expert Input. Value Health 2024, 27, S481. [Google Scholar]
Jin, Q.; Leaman, R.; Lu, Z. PubMed and beyond: Biomedical literature search in the age of artificial intelligence. eBioMedicine 2024, 100, 104988. [Google Scholar] [CrossRef]
Scells, H.; Zuccon, G.; Koopman, B.; Deacon, A.; Azzopardi, L.; Geva, S. Integrating the Framing of Clinical Questions via PICO into the Retrieval of Medical Literature for Systematic Reviews. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; ACM: New York, NY, USA, 2017; pp. 2291–2294. [Google Scholar]
Chigbu, U.E.; Atiku, S.O.; Du Plessis, C.C. The Science of Literature Reviews: Searching, Identifying, Selecting, and Synthesising. Publications 2023, 11, 2. [Google Scholar] [CrossRef]
Patrick, L.J.; Munro, S. The literature review: Demystifying the literature search. Diabetes Educ. 2004, 30, 30–38. [Google Scholar]
Gusenbauer, M.; Haddaway, N.R. Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Res. Synth. Methods 2020, 11, 181–217. [Google Scholar] [CrossRef]
Heintz, M.; Hval, G.; Tornes, R.A.; Byelyey, N.; Hafstad, E.; Næss, G.E.; Bakkeli, M. Optimizing the literature search: Coverage of included references in systematic reviews in Medline and Embase. J. Med. Libr. Assoc. 2023, 111, 599–605. [Google Scholar] [CrossRef]
Lu, Z. PubMed and beyond: A survey of web tools for searching biomedical literature. Database 2011, 2011, baq036. [Google Scholar] [CrossRef]
Bramer, W.M.; Giustini, D.; Kramer, B.M.R. Comparing the coverage, recall, and precision of searches for 120 systematic reviews in Embase, MEDLINE, and Google Scholar: A prospective study. Syst. Rev. 2016, 5, 39. [Google Scholar] [CrossRef] [PubMed]
Page, D. Systematic Literature Searching and the Bibliographic Database Haystack. Electron. J. Bus. Res. Methods 2008, 6, 199–208. [Google Scholar]
White, J. PubMed 2.0. Med. Ref. Serv. Q. 2020, 39, 382–387. [Google Scholar] [CrossRef]
Cock, P.J.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; et al. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25, 1422–1423. [Google Scholar]
Lu, Z.; Kim, W.; Wilbur, W.J. Evaluation of query expansion using MeSH in PubMed. Inf. Retr. Boston 2009, 12, 69–80. [Google Scholar] [PubMed]
Stuart, D. Database search translation tools: MEDLINE transpose, ovid search translator, and SR-accelerator polyglot search translator. J. Electron. Resour. Med. Libr. 2023, 20, 152–159. [Google Scholar]
Pichiyan, V.; Muthulingam, S.; Nalajala, S.; Ch, A.; Das, M.N. Web scraping using natural language processing: Exploiting unstructured text for data extraction and analysis. Procedia Comput. Sci. 2023, 230, 193–202. [Google Scholar]
Mavrogiorgos, K.; Mavrogiorgou, A.; Kiourtis, A.; Zafeiropoulos, N.; Kleftakis, S.; Kyriazis, D. Automated rule-based data cleaning using NLP. In Proceedings of the2022 32nd Conference of Open Innovations Association (FRUCT), Tampere, Finland, 9–11 November 2022; pp. 162–168. [Google Scholar]
Ulrich, H.; Kock-Schoppenhauer, A.K.; Deppenwiese, N.; Gött, R.; Kern, J.; Lablans, M.; Majeed, R.W.; Stöhr, M.R.; Stausberg, J.; Varghese, J.; et al. Understanding the nature of metadata: Systematic review. J. Med. Internet Res. 2022, 24, e25440. [Google Scholar]
Yang, M.; Adomavicius, G.; Burtch, G.; Ren, Y. Mind the gap: Accounting for measurement error and misclassification in variables generated via data mining. Inf. Syst. Res. 2018, 29, 4–24. [Google Scholar]
Chernyavskiy, A.; Ilvovsky, D.; Nakov, P. Transformers:“the end of history” for natural language processing. In Machine Learning and Knowledge Discovery in Databases. Research Track, Proceedings of the European Conference, ECML PKDD 2021, Bilbao, Spain, 13–17 September 2021; Proceedings, Part III 21; Springer: Cham, Switzerland, 2021; pp. 677–693. [Google Scholar]
Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
Shreyashree, S.; Sunagar, P.; Rajarajeswari, S.; Kanavalli, A. A Literature Review on Bidirectional Encoder Representations from Transformers. In Inventive Computation and Information Technologies; Smys, S., Balas, V.E., Palanisamy, R., Eds.; Springer Nature: Singapore, 2022; pp. 305–320. [Google Scholar]
Zarrieß, S.; Voigt, H.; Schüz, S. Decoding Methods in Neural Language Generation: A Survey. Information 2021, 12, 355. [Google Scholar] [CrossRef]
Ashwathy, J.S.; SR, N.; Pyati, T. The Progression of ChatGPT: An Evolutionary Study from GPT-1 to GPT-4. J. Innov. Data Sci. Big Data Manag. 2024, 3, 38–44. [Google Scholar]
Kalyan, K.S. A survey of GPT-3 family large language models including ChatGPT and GPT-4. Nat. Lang. Process. J. 2024, 6, 100048. [Google Scholar] [CrossRef]
Gao, T.; Jin, J.; Ke, Z.T.; Moryoussef, G. A Comparison of DeepSeek and Other LLMs. arXiv 2025, arXiv:2502.03688. [Google Scholar]
Sun, Z.; Yang, H.; Liu, K.; Yin, Z.; Li, Z.; Xu, W. Recent Advances in LoRa: A Comprehensive Survey. ACM Trans. Sens. Netw. 2022, 18, 1–44. [Google Scholar] [CrossRef]
Chang, P.W.; Newman, T.B. Receiver Operating Characteristic (ROC) Curves: The Basics and Beyond. Hosp. Pediatr. 2024, 14, e330–e334. [Google Scholar] [CrossRef] [PubMed]
Wu, Y.; Gu, Y.; Feng, X.; Zhong, W.; Xu, D.; Yang, Q.; Liu, H.; Qin, B. Extending context window of large language models from a distributional perspective. arXiv 2024, arXiv:2410.01490. [Google Scholar]
Huotala, A.; Kuutila, M.; Ralph, P.; Mäntylä, M. The promise and challenges of using LLMs to accelerate the screening process of systematic reviews. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, Salerno, Italy, 18–21 June 2024; pp. 262–271. [Google Scholar]
Wu, S.; Koo, M.; Blum, L.; Black, A.; Kao, L.; Fei, Z.; Scalzo, F.; Kurtz, I. Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI 2024, 1, AIdbp2300092. [Google Scholar]
Rydzewski, N.R.; Dinakaran, D.; Zhao, S.G.; Ruppin, E.; Turkbey, B.; Citrin, D.E.; Patel, K.R. Comparative evaluation of LLMs in clinical oncology. NEJM AI 2024, 1, AIoa2300151. [Google Scholar]
Wu, S.; Koo, M.; Blum, L.; Black, A.; Kao, L.; Scalzo, F.; Kurtz, I. A comparative study of open-source large language models, gpt-4 and claude 2: Multiple-choice test taking in nephrology. arXiv 2023, arXiv:2308.04709. [Google Scholar]
Safavi-Naini, S.A.A.; Ali, S.; Shahab, O.; Shahhoseini, Z.; Savage, T.; Rafiee, S.; Samaan, J.S.; Shabeeb, R.A.; Ladak, F.; Yang, J.O.; et al. Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models. arXiv 2024, arXiv:2409.00084. [Google Scholar]
Berglund, L.; Stickland, A.C.; Balesni, M.; Kaufmann, M.; Tong, M.; Korbak, T.; Kokotajlo, D.; Evans, O. Taken out of context: On measuring situational awareness in LLMs. arXiv 2023, arXiv:2309.00667. [Google Scholar]
Agarwal, L.; Nasim, A. Comparison and Analysis of Large Language Models (LLMs). 2024. Available online: https://ssrn.com/abstract=4939534 (accessed on 11 March 2025).
Polanin, J.R.; Pigott, T.D.; Espelage, D.L.; Grotpeter, J.K. Best practice guidelines for abstract screening large-evidence systematic reviews and meta-analyses. Res. Synth. Methods 2019, 10, 330–342. [Google Scholar]
Beurer-Kellner, L.; Fischer, M.; Vechev, M. Prompting is programming: A query language for large language models. Proc. ACM Program. Lang. 2023, 7, 1946–1969. [Google Scholar]
He, J.; Rungta, M.; Koleczek, D.; Sekhon, A.; Wang, F.X.; Hasan, S. Does Prompt Formatting Have Any Impact on LLM Performance? arXiv 2024, arXiv:241110541. arXiv 2024, arXiv:2411.10541. [Google Scholar]
Zaghir, J.; Naguib, M.; Bjelogrlic, M.; Névéol, A.; Tannier, X.; Lovis, C. Prompt Engineering Paradigms for Medical Applications: Scoping Review. J. Med. Internet Res. 2024, 26, e60501. [Google Scholar] [CrossRef] [PubMed]
Colangelo, M.T.; Guizzardi, S.; Meleti, M.; Calciolari, E.; Galli, C. How to Write Effective Prompts for Screening Biomedical Literature Using Large Language Models. BioMedInformatics 2025, 5, 15. [Google Scholar] [CrossRef]
Cao, C.; Sang, J.; Arora, R.; Chen, D.; Kloosterman, R.; Cecere, M.; Gorla, J.; Saleh, R.; Drennan, I.; Teja, B. Development of Prompt Templates for Large Language Model–Driven Screening in Systematic Reviews. Ann. Intern Med. 2025, 178, 389–401. [Google Scholar] [CrossRef]
Wang, W.; Shi, J.; Wang, C.; Lee, C.; Yuan, Y.; Huang, J.T.; Lyu, M.R. Learning to ask: When llms meet unclear instruction. arXiv 2024, arXiv:2409.00557. [Google Scholar]
Galli, C.; Colangelo, M.T.; Guizzardi, S.; Meleti, M.; Calciolari, E. A Zero-Shot Comparison of Large Language Models for Efficient Screening in Periodontal Regeneration Research. Preprints 2025. [Google Scholar] [CrossRef]
Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
Bhattacharya, R. Strategies to mitigate hallucinations in large language models. Appl. Mark. Anal. 2024, 10, 62–67. [Google Scholar]
Gosmar, D.; Dahl, D.A. Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks. arXiv 2025, arXiv:2501.13946. [Google Scholar]
Hassan, M. Measuring the Impact of Hallucinations on Human Reliance in LLM Applications. J. Robot. Process Autom. AI Integr. Work. Optim. 2025, 10, 10–20. [Google Scholar]
Rawte, V.; Chakraborty, S.; Pathak, A.; Sarkar, A.; Tonmoy, S.I.; Chadha, A.; Sheth, A.; Das, A. The troubling emergence of hallucination in large language models-an extensive definition, quantification, and prescriptive remediations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023. [Google Scholar]
Fagbohun, O.; Harrison, R.M.; Dereventsov, A. An empirical categorization of prompting techniques for large language models: A practitioner’s guide. arXiv 2024, arXiv:2402.14837. [Google Scholar]
Mai, H.T.; Chu, C.X.; Paulheim, H. Do LLMs really adapt to domains? An ontology learning perspective. In The Semantic Web–ISWC 2024, Proceedings of the 23rd International Semantic Web Conference, Baltimore, MD, USA, 11–15 November 2024; Proceedings, Part I; Springer: Cham, Switzerland, 2024; pp. 126–143. [Google Scholar]
Sumanathilaka, T.; Micallef, N.; Hough, J. Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation. arXiv 2024, arXiv:2411.18337. [Google Scholar]
Duenas, T.; Ruiz, D. The Risks of Human Overreliance on Large Language Models for Critical Thinking. Research Gate. 2024. Available online: https://www.researchgate.net/publication/385743952_The_Risks_Of_Human_Overreliance_On_Large_Language_Models_For_Critical_Thinking (accessed on 11 March 2025).
Schiller, C.A. The human factor in detecting errors of large language models: A systematic literature review and future research directions. arXiv 2024, arXiv:2403.09743. [Google Scholar]
Page, M.J.; Higgins, J.P.T.; Sterne, J.A.C. Assessing risk of bias due to missing results in a synthesis. In Cochrane Handbook for Systematic Reviews of Interventions; Wiley: Hoboken, NJ, USA, 2019; pp. 349–374. [Google Scholar]
Goossen, K.; Tenckhoff, S.; Probst, P.; Grummich, K.; Mihaljevic, A.L.; Büchler, M.W.; Diener, M.K. Optimal literature search for systematic reviews in surgery. Langenbecks Arch. Surg. 2018, 403, 119–129. [Google Scholar]
Ewald, H.; Klerings, I.; Wagner, G.; Heise, T.L.; Stratil, J.M.; Lhachimi, S.K.; Hemkens, L.G.; Gartlehner, G.; Armijo-Olivo, S.; Nussbaumer-Streit, B. Searching two or more databases decreased the risk of missing relevant studies: A metaresearch study. J. Clin. Epidemiol. 2022, 149, 154–164. [Google Scholar]
Cooper, C.; Varley-Campbell, J.; Carter, P. Established search filters may miss studies when identifying randomized controlled trials. J. Clin. Epidemiol. 2019, 112, 12–19. [Google Scholar]
Giray, L. Prompt Engineering with ChatGPT: A Guide for Academic Writers. Ann. Biomed. Eng. 2023, 51, 2629–2633. [Google Scholar] [CrossRef]
Meskó, B. Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial. J. Med. Internet Res. 2023, 25, e50638. [Google Scholar] [CrossRef] [PubMed]
Wong, E. Comparative Analysis of Open Source and Proprietary Large Language Models: Performance and Accessibility. Adv. Comput. Sci. 2024, 7, 1–7. [Google Scholar]
Shah, C. From prompt engineering to prompt science with human in the loop. arXiv 2024, arXiv:2401.04122. [Google Scholar]
Ray, S. A Quick Review of Machine Learning Algorithms. In Proceedings of the 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, 14–16 February 2019; pp. 35–39. [Google Scholar]
Tang, X.; Jin, Q.; Zhu, K.; Yuan, T.; Zhang, Y.; Zhou, W.; Qu, M.; Zhao, Y.; Tang, J.; Zhang, Z.; et al. Prioritizing safeguarding over autonomy: Risks of llm agents for science. arXiv 2024, arXiv:2402.04247. [Google Scholar]
Liu, X.; Rivera, S.C.; Moher, D.; Calvert, M.J.; Denniston, A.K.; Ashrafian, H.; Beam, A.L.; Chan, A.W.; Collins, G.S.; Deeks, A.D.J.; et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Nat. Med. 2020, 26, 1364–1374. [Google Scholar] [CrossRef]
Cacciamani, G.E.; Chu, T.N.; Sanford, D.I.; Abreu, A.; Duddalwar, V.; Oberai, A.; Kuo, C.C.J.; Liu, X.; Denniston, A.K.; Vasey, B.; et al. PRISMA AI reporting guidelines for systematic reviews and meta-analyses on AI in healthcare. Nat. Med. 2023, 29, 14–15. [Google Scholar]
Kim, J.K.; Chua, M.; Rickard, M.; Lorenzo, A. ChatGPT and large language model (LLM) chatbots: The current state of acceptability and a proposal for guidelines on utilization in academic medicine. J. Pediatr. Urol. 2023, 19, 598–604. [Google Scholar] [CrossRef] [PubMed]
Ranjan, R.; Gupta, S.; Singh, S.N. A comprehensive survey of bias in llms: Current landscape and future directions. arXiv 2024, arXiv:2409.16430. [Google Scholar]
Ullah, E.; Parwani, A.; Baig, M.M.; Singh, R. Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology—A recent scoping review. Diagn. Pathol. 2024, 19, 43. [Google Scholar]
Barman, K.G.; Wood, N.; Pawlowski, P. Beyond transparency and explainability: On the need for adequate and contextualized user guidelines for LLM use. Ethics Inf. Technol. 2024, 26, 47. [Google Scholar]
Barman, K.G.; Caron, S.; Claassen, T.; De Regt, H. Towards a benchmark for scientific understanding in humans and machines. Minds Mach. 2024, 34, 6. [Google Scholar]
Jiao, J.; Afroogh, S.; Xu, Y.; Phillips, C. Navigating llm ethics: Advancements, challenges, and future directions. arXiv 2024, arXiv:2406.18841. [Google Scholar]
Patil, R.; Gudivada, V. A review of current trends, techniques, and challenges in large language models (llms). Appl. Sci. 2024, 14, 2074. [Google Scholar] [CrossRef]

Figure 1. A simplified workflow for integrating Large Language Models (LLMs) into systematic review screening. The process is divided into four phases: data preparation (including broad database searches and deduplication), model and prompt configuration (the selection of an appropriate LLM and the formulation of screening instructions), screening (automated classification followed by human verification), and finalization (quality control, prompt refinement, and the synthesis of selected studies). Abbreviations: LLM, Large Language Model; QC, quality control.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Galli, C.; Gavrilova, A.V.; Calciolari, E. Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations. Information 2025, 16, 378. https://doi.org/10.3390/info16050378

AMA Style

Galli C, Gavrilova AV, Calciolari E. Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations. Information. 2025; 16(5):378. https://doi.org/10.3390/info16050378

Chicago/Turabian Style

Galli, Carlo, Anna V. Gavrilova, and Elena Calciolari. 2025. "Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations" Information 16, no. 5: 378. https://doi.org/10.3390/info16050378

APA Style

Galli, C., Gavrilova, A. V., & Calciolari, E. (2025). Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations. Information, 16(5), 378. https://doi.org/10.3390/info16050378

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations

Abstract

1. Introduction

2. LLM-Based Screening: Rationale and Key Considerations

2.1. Foundational Concepts and Terminology

2.2. Technical and Logistical Considerations

2.3. Prerequisites for Literature Screening

3. Implementation Considerations for LLM-Based Screening

3.1. Data Acquisition and Preparation

3.2. LLM Selection and Prompt Design

3.3. Fundamentals and Challenges of Prompt Design

3.4. Model Deployment

3.5. Current Recommendations and Trade-Offs

4. Ethical, Practical, and Methodological Implications

5. Proposed Standardized Guidelines for LLM-Assisted Screening

5.1. Planning and Governance

5.2. Data Preparation

5.3. Model Selection and Disclosure

5.4. Prompt Engineering

5.5. Screening Procedure

5.6. Quality Control and Bias Monitoring

5.7. Documentation and Reproducibility

5.8. Ethical and Legal Compliance

5.9. Reporting in the Manuscript

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI