Performance Comparison of Large Language Models for Efficient Literature Screening

Colangelo, Maria Teresa; Guizzardi, Stefano; Meleti, Marco; Calciolari, Elena; Galli, Carlo

doi:10.3390/biomedinformatics5020025

Open AccessArticle

Performance Comparison of Large Language Models for Efficient Literature Screening

by

Maria Teresa Colangelo

¹

,

Stefano Guizzardi

¹

,

Marco Meleti

²,

Elena Calciolari

^2,3

and

Carlo Galli

^1,*

¹

Histology and Embryology Laboratory, Department of Medicine and Surgery, University of Parma, Via Volturno 39, 43126 Parma, Italy

²

Department of Medicine and Surgery, Dental School, University of Parma, 43126 Parma, Italy

³

Centre for Oral Clinical Research, Institute of Dentistry, Faculty of Medicine and Dentistry, Queen Mary University of London, London E1 2AD, UK

^*

Author to whom correspondence should be addressed.

BioMedInformatics 2025, 5(2), 25; https://doi.org/10.3390/biomedinformatics5020025

Submission received: 18 March 2025 / Revised: 18 April 2025 / Accepted: 29 April 2025 / Published: 7 May 2025

Download

Browse Figures

Versions Notes

Abstract

Background: Systematic reviewers face a growing body of biomedical literature, making early-stage article screening increasingly time-consuming. In this study, we assessed six large language models (LLMs)—OpenHermes, Flan T5, GPT-2, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o—for their ability to identify randomized controlled trials (RCTs) in datasets of increasing difficulty. Methods: We first retrieved articles from PubMed and used all-mpnet-base-v2 to measure semantic similarity to known target RCTs, stratifying the collection into quartiles of descending relevance. Each LLM then received either verbose or concise prompts to classify articles as “Accepted” or “Rejected”. Results: Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o consistently achieved high recall, though their precision varied in the quartile with the highest similarity, where false positives increased. By contrast, smaller or older models struggled to balance sensitivity and specificity, with some over-including irrelevant studies or missing key articles. Importantly, multi-stage prompts did not guarantee performance gains for weaker models, whereas single-prompt approaches proved effective for advanced LLMs. Conclusions: These findings underscore that both model capability and prompt design strongly affect classification outcomes, suggesting that newer LLMs, if properly guided, can substantially expedite systematic reviews.

Keywords:

systematic review; large language models; literature screening; artificial intelligence

1. Introduction

Systematic reviews and meta-analyses are fundamental tools in biomedical research, offering robust insights into the efficacy and safety of medical interventions by synthesizing findings from multiple studies [1]. However, the initial screening of abstracts—a step often involving thousands of articles—remains one of the most time-consuming aspects of conducting these reviews, particularly when database searches prioritize sensitivity over specificity [2]. As the volume of published research grows exponentially, there is an urgent need for more efficient strategies to manage this screening burden and support evidence-based decision-making [3,4].

A typical systematic review involves searching databases like PubMed, Scopus, or Embase to identify studies that align with predefined criteria, such as population, intervention, comparison, outcomes, and timeframe (PICOT) [5]. While these broad searches aim to avoid missing important studies, they also generate large numbers of irrelevant or duplicate records. Manual title and abstract screening are not only laborious but also subject to human error, fatigue, and inconsistencies among reviewers [6].

Advances in artificial intelligence (AI), particularly in natural language processing (NLP), have introduced new opportunities to automate or partially automate screening workflows. Large language models (LLMs)—AI systems trained on vast amounts of text data—can classify abstracts based on predefined inclusion criteria through tailored instructions or minor adjustments [7,8,9,10]. These automated approaches promise to accelerate systematic reviews, enhance consistency, and allow researchers to concentrate on critical appraisal and data synthesis [11].

This can be extremely advantageous in those fields where technical and scientific advances require scholars to continuously curate and update systematic reviews [12], such as periodontal regeneration. In periodontology, for instance, the regenerative treatment of bone defects has gained prominence, with interventions such as guided tissue regeneration, bone graft materials, and adjuncts bioactive molecules studied extensively [13,14,15,16]. However, evidence of their effectiveness remains fragmented, underscoring the need for the rigorous synthesis of findings from randomized controlled trials (RCTs).

In the present study, we developed a screening pipeline using two distinct groups of LLMs —smaller or older models (OpenHermes, Flan T5, and GPT-2) and more advanced models (Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o)—to evaluate their ability to classify periodontal regeneration studies. Using a recent systematic review by Fidan et al. [17] and a systematic review by Yang et al. [18] as a benchmark, we compiled titles and abstracts from RCTs identified in their work to create a “target set” of relevant studies. We then conducted a broad PubMed search for “periodontal regeneration” or “bronchoscopy” to generate larger pools of potentially related articles. To prioritize the screening process, we used AI-based text similarity to rank articles by their resemblance to the target set [19,20,21]. This approach organized articles from both pools into four groups of descending similarity, with the top group expected to contain the most similar studies and thus representing the most challenging task for LLMs. We then tested different LLMs with a zero-shot approach, i.e., without previous fine-tuning, and let them classify the papers based on the focused question and inclusion criteria of Fidan et al.’s or Yang et al.’s review, prompting them to output “ACCEPT” or “REJECT”. We generated confusion matrices and standard classification metrics (accuracy, precision, recall, and F1 score) to evaluate how closely each model’s performance matched expert judgments [22]. High recall was particularly important to avoid discarding studies that truly met the inclusion criteria [23].

This study thus aims to clarify the strengths and limitations of LLMs in literature screening and thus optimize their use in streamlining the screening process. This could reduce the manual workload for researchers, accelerate evidence synthesis, and ultimately support the development of more effective therapies.

2. Materials and Methods

2.1. Study Design and Rationale

We designed this methodological study to test whether large language models could effectively screen articles for a systematic review using ad hoc curated datasets. To build these we arbitrarily chose 2 recent reviews: (1) “Combination of Enamel Matrix Derivatives with Bone Graft vs Bone Graft Alone in the Treatment of Periodontal Intrabony and Furcation Defects: A Systematic Review and Meta-Analysis” by Fidan et al. [17], which investigated whether combining enamel matrix derivative (EMD) with bone grafts (BG) offers additional clinical benefits compared to BG alone in patients with periodontal intrabony or furcation defects, and (2) “The diagnostic performance and optimal strategy of cone beam CT-assisted bronchoscopy for peripheral pulmonary lesions: A systematic review and meta-analysis” by Yang et al. [18], which investigated the clinical value of CBCT-assisted bronchoscopy in lung cancer diagnosis.

2.1.1. Fidan’s Review

This review was chosen because of the topic, which fell within the domain knowledge of our team, and because it was very recent and clearly stated its focused question (FQ) and inclusion/exclusion criteria in a format that could be easily parsed. The FQ of this review paper was as follows:

“Does the combination of EMD + BG provide additional clinical benefits compared with BG alone in terms of CAL gain, PD reduction, pocket closure, composite outcome of treatment success, gingival recession (REC) and bone gain in periodontal intrabony and furcation defects?”

From the text of this systematic review, we extracted a PICOT framework [24]:

Population (P): Adult periodontitis patients (≥18 years old) with at least one intrabony or furcation defect;
Intervention (I): Periodontal regenerative surgical procedures using EMD combined with bone grafts (EMD + BG);
Comparison (C): Periodontal regenerative surgical procedures using bone grafts alone (BG);
Outcomes (O): CAL gain, PD reduction (primary); secondary outcomes included pocket closure, wound healing, gingival recession (REC), tooth loss, PROMs, and adverse events.

The inclusion criteria explicitly mentioned in the review were as follows:

Study Design: Randomized controlled trials (RCTs), parallel or split-mouth, with ≥10 patients per arm;
Follow-up: Minimum of 6 months after the surgical procedure;
Population: Adult periodontitis patients (≥18 years) with intrabony or furcation defects;
Intervention: EMD + BG (i.e., Emdogain combined with any bone graft material);
Comparison: BG alone;
Outcomes: At least CAL gain and PD reduction.

We used these inclusion criteria to draft a set of explicit exclusion criteria:

Studies focusing exclusively on children (<18 years);
Studies without a clear mention of EMD or bone grafts;
Follow-up period of <6 months or uncertain;
Non-randomized studies or fewer than 10 patients per arm.

Based on these criteria, the authors of this review performed a literature search on Medline, Embase, and Web of Science and conducted a manual search on several relevant publications in the field, eventually identifying nine target articles (Appendix A.1). These target articles formed the core reference set for the evaluation of our models.

2.1.2. Yang’s Review

This review was selected as confirmation of the insights obtained with the former review because its primary objective—assessing the diagnostic performance and optimal strategy of cone beam CT (CBCT)-assisted bronchoscopy for peripheral pulmonary lesions (PPLs)—fell within a quite different medical specialty than Fidan’s work, so it could be indicative of the applicability of these LLMs across different medical fields.

Unlike Fidan’s paper, Yang’s systematic review did not frame an explicit FQ, and its research question had to be inferred from the text, namely, whether and how CBCT improves diagnostic yield, navigational success, and safety compared with other bronchoscopic approaches.

While Yang’s review did not explicitly lay out a PICOT statement in the same structured manner as Fidan’s, it could be extracted as follows:

Population (P): adult patients presenting with radiologically suspicious PPLs;
Intervention (I): CBCT-augmented bronchoscopy;
Comparison (C): standard endobronchial or navigational tools alone;
Outcomes (O): diagnostic yield, procedure times, and complication rates.

Criteria for exclusion were as follows: (1) Review, commentary, and unhuman studies were not included, and (2) Studies with a number of cases ≤10 were excluded.

2.2. Data Acquisition

First, we generated 2 lists of target articles that included the 9 studies identified by Fidan et al. (Appendix A.1) and the 15 studies investigated by Yang et al. (Appendix A.2). We then created pools of thematically related articles by searching Medline [25] using the query “periodontal regeneration” or “bronchoscopy“, respectively, which were arbitrarily chosen to comprise articles about similar topics to Fidan’s and Yang’s systematic reviews. Due to PubMed’s limit of retrieving up to 10,000 records in a single pass, we segmented our search by publication date ranges. We used Biopython’s Entrez.efetch [26,27] to parse and store the title and abstract texts within Python DataFrames [28]. Duplicates overlapping with the target articles were removed to ensure a clean distinction between relevant (target = 1) and presumed irrelevant (target = 0) papers. The initial pools contained 16,597 articles on periodontal regeneration and 16,460 articles on bronchoscopy, respectively.

2.3. Data Pre-Processing

We cleaned all dataframes belonging to both pools by removing or replacing invalid or missing fields in the Title and Abstract columns with empty strings. Titles and abstracts were concatenated to form a single text input per article for embeddings. We used the sentence-transformer all-mpnet-base-v2 model to encode each article into a 768-dimensional embedding vector [20]. We computed the mean embedding of Fidan’s nine target articles and calculated the cosine similarity between each pool article’s embedding and this mean vector [21]. Articles in the pool were then ranked and split into four quartiles (Q1–Q4) according to similarity scores.

Each quartile contained several thousand articles (~4000–5000), which we anticipated would require a long time to be processed by LLMs. For feasibility purposes, we then decided to randomly sample 200 articles from each quartile. The composition of the final subsets (“reduced quartiles”, n = 200) that were used for testing is shown in Table S1. The same procedure was applied to the bronchoscopy pool, eventually creating 4 reduced quartiles containing 200 articles on bronchoscopy each.

The target articles from the two reviews were then added to their respective datasets. These datasets were designed to pose increasingly difficult challenges for the LLMs to correctly discriminate between relevant and non-relevant articles.

2.4. Dataset Analysis

To identify thematic patterns within the titles in the datasets, we also employed BERTopic, a hierarchical topic modeling framework that integrates transformer-based embeddings, dimensionality reduction, clustering, and keyword extraction [29]. All-mpnet-base-v2 embeddings were used and reduced with UMAP (Uniform Manifold Approximation and Projection), a nonlinear reduction technique that preserves both local and global data structure by modeling topological relationships [30]. Clustering was performed using HDBSCAN, a density-based algorithm adept at identifying semantically dense regions while filtering noise [31]. Custers were subsequently annotated via a class-based TF-IDF variant (c-TF-IDF). Unlike traditional TF-IDF, which operates at the document level, c-TF-IDF emphasizes terms salient to entire clusters, downweighting high-frequency yet nonspecific vocabulary [29].

To enhance topic interpretability, we passed BERTopic-generated keywords to GPT 3.5 Turbo (see below for details), which was prompted as follows:

“Generate a brief, grammatically correct title (3–5 words) that summarizes the following biomedical research keywords. Avoid technical jargon and focus on overarching themes: [KEYWORDS]”

The list of topics in the aggregated datasets can be found in Table S2.

2.5. LLM-Based Classification

We tested several models to classify each article in the reduced quartile sets as either “Accepted” or “Rejected” according to the inclusion criteria above.

OpenHermes: OpenHermes is an instruction-tuned language model based on the Mistral 7B architecture (7 billion parameters), designed for effective natural language understanding and generation across a wide range of tasks [32]. For this study, we employed the quantized version of OpenHermes-2.5-Mistral-7B-GGUF (openhermes-2.5-mistral-7b.Q4_K_M.gguf), freely available on Huggingface.com. Quantization reduced the model’s 32-bit parameters to 4-bit values, significantly improving computational efficiency while maintaining high performance;
Flan T5: Flan-T5 is an instruction-tuned language model developed by Google, designed for general-purpose natural language understanding and generation tasks [33]. Flan-T5 was fine-tuned on a wide array of instruction-following datasets and optimized for handling tasks such as classification, summarization, and question answering with high accuracy and contextual awareness;
GPT-2: GPT-2, developed by OpenAI, lacks the instruction-tuning and domain-specific optimization of more advanced models, but it remains a valuable baseline for understanding the capabilities of earlier-generation language models [34];
Claude 3 Haiku: Claude 3 Haiku is a next-generation model developed by Anthropic, featuring advanced language understanding and reasoning capabilities. Optimized for a wide range of tasks, it is instruction-tuned to follow specific prompts and has shown strong performance in classification scenarios [35];
GPT-3.5 Turbo: GPT-3.5 Turbo, developed by OpenAI, is an optimized and cost-efficient version of the GPT-3.5 model, providing robust natural language understanding and generation capabilities [36]. With significantly improved contextual reasoning and instruction-following compared to GPT-2, GPT-3.5 Turbo performs better in structured classification tasks. In this study, GPT-3.5 Turbo was utilized via OpenAI’s API;
GPT-4o: GPT-4o is the optimized version of GPT-4, and it combines enhanced instruction-following capabilities with improved contextual understanding [37]. GPT-4o performs better in complex decision-making and classification scenarios than its predecessors. GPT-4o was accessed via OpenAI’s API, too.

All LLM models were run with default parameters (e.g., temperature, top_k, and top_p in generative settings). Two authors (C.G. and M.M.) independently reviewed the output of the LLMs according to the inclusion/exclusion criteria described in Section 2.1. Disagreements were discussed and resolved by consensus.

2.6. Prompting

2.6.1. Base Prompt

LLMs require adequate prompting to function, i.e., a set of instructions that LLMs follow to perform their task [38]. We designed the prompt based on the focused questions and the PICOT outlined in the papers. For the review by Fidan et al., 2024, we drafted a general prompt as follows:

“

You are assisting in a systematic review on periodontal regeneration comparing

Emdogain (EMD) + bone graft (BG) versus BG alone.

Your task is to decide whether the following article should be ACCEPTED

or REJECTED based on the following “soft approach” criteria:

**Inclusion Criteria**:

1. **Population (P)**: Adult periodontitis patients (≥18 years old) with

at least one intrabony or furcation defect.

2. **Intervention (I)**: Regenerative surgical procedures involving EMD

combined with any type of bone graft material (EMD+BG).

3. **Comparison (C)**: Regenerative surgical procedures involving BG alone.

4. **Outcomes (O)**:

- Primary: CAL (Clinical Attachment Level) gain, PD (Probing Depth) reduction.

- Secondary: Pocket closure, wound healing, gingival recession, tooth loss,

patient-reported outcome measures (PROMs), adverse events.

5. **Study Design**:

- Randomized controlled trial (RCT), parallel or split-mouth design.

- ≥10 patients per arm.

- ≥6 months follow-up.

**Decision Approach**:

- If **at least one** of the above criteria is explicitly met or strongly implied,

**AND** none of the criteria are explicitly contradicted, then **ACCEPT**.

- If **any** criterion is clearly violated (e.g., population is exclusively children,

follow-up is 3 months, or design is not an RCT), then **REJECT**.

- If **no** criterion is clearly met, **REJECT**.

Below is the article’s title and abstract. Decide if it should be ACCEPTED or REJECTED

according to the “soft approach” described.

Title: {title}

Abstract: {abstract}

**If the article is acceptable, respond with exactly:**

ACCEPT

**Otherwise, respond with exactly:**

REJECT

“

For the review by Yang et al., 2025, we generated the following prompt:

“You are assisting in a systematic review on Cone Beam CT (CBCT)-guided bronchoscopy for peripheral pulmonary lesions (PPLs).

Your task is to decide whether the following article should be ACCEPTED or REJECTED based on the “soft approach” criteria below:

**Inclusion Criteria**:

**Population (P)**: Patients with peripheral pulmonary lesions (PPLs) detected by CT examination.
**Intervention (I)**: Diagnostic CBCT-guided bronchoscopy.
**Outcomes (O)**: Diagnostic yield (e.g., success rate) and/or safety outcomes (e.g., compl.ications).
Study Size: >10 patients.

Soft Approach:

ACCEPT if ≥1 criterion is explicitly met or strongly implied AND no criteria are explicitly contradicted.

REJECT if any criterion is violated (e.g., ≤10 patients, no diagnostic/safety outcomes, non-CBCT-guided intervention).

REJECT if no criteria are clearly met.

Below is the article’s title and abstract. Decide if it should be ACCEPTED or REJECTED

according to the “soft approach” described.

Title: {title}

Abstract: {abstract}

**If the article is acceptable, respond with exactly:**

ACCEPT

**Otherwise, respond with exactly:**

REJECT

“

These prompts can be conceived as soft prompts because they emphasize recall: if any criterion might be satisfied, the article is likely ACCEPTED unless explicitly contradicted. This prompt was tested with all the models. These prompts are also quite verbose because they are more explicit in expressing the criteria for a decision.

2.6.2. Double Prompt—OpenHermes, Flan T5, and GPT-2

While testing the LLMs, it soon became apparent that the less advanced models (OpenHermes, Flan T5, and GPT-2) struggled with such structured prompts, and we decided to also test their performance with a double-prompt approach, i.e., by splitting the task into two sub-tasks, each with its own dedicated prompt. The initial prompt was very simple and as follows:

“

Initial filter to reject articles that do not mention Bone Graft (BG) or Emdogain (EMD).

”

This prompt was designed to shortlist the articles and reject all the reports that did not directly focus on bone grafts or Emdogain and reduce the number of articles to screen in a more detailed way. The second prompt was then applied only to screen the articles selected by the first screening and was the same as the standard prompt. Given the relatively low performance of these models only data from the first review are reported hereafter.

2.6.3. Concise Prompt—Claude 3 Haiku, GPT 3.5 Turbo, and GPT 4o

We also decided to test more concise prompts that REJECT papers unless they definitively match inclusion criteria, only for Claude 3 Haiku, GPT 3.5 Turbo, and GPT 4o. This approach aimed to reduce false positives but risked higher false negatives.

The more concise prompt for the review by Fidan et al., 2024 was as follows:

“

You are an expert periodontology assistant. You are assisting in a systematic review on periodontal regeneration comparing

Emdogain (EMD) + bone graft (BG) versus bone graft alone. Evaluate this article step by step:

1. Population: If the text states adult patients with intrabony/furcation defects, or is silent about age/defect type, it’s not violated.

2. Intervention: If Emdogain + bone graft is mentioned or strongly implied, we consider this met.

3. Comparison: If a group uses bone graft alone, or there’s at least a control lacking Emdogain, consider it met.

4. Outcomes: If they mention CAL gain or PD reduction, or are silent, do not penalize.

Only reject if they clearly never measure any clinical outcomes.

5. Study design: If they claim RCT or strongly imply it, accept. If they mention a different design (case series, pilot with fewer than 10 patients, or <6-month follow-up), reject.

If at least one criterion is explicitly met and none are clearly violated, answer ACCEPT. Otherwise, REJECT.

If you are unsure, default to ACCEPT unless a contradiction is stated.

Article Title: {title}

Abstract: {abstract}

Respond with ONLY ‘ACCEPT’ or ‘REJECT’ accordingly.

”

For the review by Yang et al., 2025, the concise prompt was as follows:

“

You are an expert in diagnostic bronchoscopy. Assist in a systematic review on CBCT-guided bronchoscopy for peripheral pulmonary lesions (PPLs). Evaluate this article step by step:

Population: If the text states patients with PPLs detected by CT, or is silent about PPLs/CT, it’s not violated.
Intervention: If CBCT-guided bronchoscopy is mentioned or strongly implied, consider this met.
Outcomes: If they mention diagnostic yield (e.g., success rate) or safety outcomes (e.g., complications), or are silent, do not penalize. Only reject if they clearly never measure these.
Study Size: If they mention >10 patients or are silent, accept. If they state ≤10, reject.

Decision rules:

If ≥1 criterion is explicitly met and none are clearly violated, answer ACCEPT.

If any criterion is violated (e.g., ≤10 patients, non-CBCT-guided methods), answer REJECT.

Default to ACCEPT unless a contradiction is stated.

Article Title: {title}

Abstract: {abstract}

Respond with ONLY ‘ACCEPT’ or ‘REJECT’ accordingly.

”

2.7. Performance Evaluation

To assess how accurately each large language model (LLM) classified the articles, we compared the model’s predictions—“Accepted” or “Rejected”—with the ground truth (target). An article was assigned target = 1 if it met the published inclusion criteria (i.e., it was truly relevant); otherwise, target = 0 indicated a non-relevant study. For the target articles, any disagreement between the model’s prediction and the ground truth was considered an error, and for a subset of the remaining articles, we verified their relevance status via manual or semi-manual checks.

We used scikit-learn to generate confusion matrices, which summarize four key outcomes. A true positive (TP) occurs when a model correctly identifies a relevant article as “Accepted”, whereas a false positive (FP) arises if it incorrectly deems an irrelevant article to be “Accepted”. Conversely, a false negative (FN) means a relevant article was mistakenly “Rejected”, and a true negative (TN) means an irrelevant article was correctly “Rejected”. From these, we calculated accuracy, precision, recall (sensitivity), and F1 score, as described in Appendix B.

The whole methodology is summarized as a flowchart in Figure 1.

2.8. Software and Hardware

All data processing and model testing in this study were conducted using Google Colab notebooks [39]. Colab’s cloud-based environment is particularly advantageous for researchers without access to local high-end computing infrastructure because all computing can be performed remotely. We leveraged a Tesla T4 GPU in Colab to accelerate the embedding steps and reduce the computational overhead, which is crucial when dealing with thousands of abstracts, which would otherwise take hours to be processed. For local inference of smaller models like OpenHermes, we utilized a similar GPU-based setup, which allowed us to run quantized versions of the models on our standard hardware. Meanwhile, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o required calls to OpenAI’s API, an approach that significantly streamlined implementation but introduced usage costs and an external dependency.

We employed pandas version 2.2.2 [28] for efficient data management and Biopython version 1.85 [27] for PubMed retrieval, which ensure robust handling of metadata and abstracts. We relied on the sentence-transformers library (version 3.3.1) for generating all-mpnet-base-v2 embeddings [40], which balances speed and accuracy, and used the well-established scikit-learn (version 1.6) library for computing classification metrics and confusion matrices [41]. These software choices minimized coding overhead while standardizing data processing and ensured sufficient computational power and memory for both local and API-based model inference.

3. Results

We successfully generated four datasets for each review, each containing 200 articles sampled from quartiles Q1 to Q4 based on their similarity to the target articles. All the previously identified target RCTs were added to every subset for each respective pool. The distribution of cosine similarity scores between the articles in each quartile and the target set follows a clear gradient of relevance. As expected, Quartile 1 (Q1) exhibited the lowest similarity, while Quartile 4 (Q4) showed the highest similarity, as can be seen in Table S1.

A box-and-whisker plot was generated to visualize the distribution of similarity scores within each quartile (Figure 2A,B).

Q1 displayed the widest range of scores, with numerous outliers toward the lower end of the similarity scale, likely articles easy to filter out. This trend reflects the increasing difficulty of distinguishing between relevant and non-relevant articles as similarity scores rise. Table S2 contains the topics identified by the BERTopic algorithm for both article pools. The periodontal pool (Table S2A) was possibly more homogeneous, and only five main topic clusters were identified (including a −1 unclassified topic group), mostly about periodontal regeneration, with the only exception of topic #2 endodontic treatment outcomes, which would more properly fall within endodontics or the endodontic–periodontal relationship. This group was, however, the smallest one in the pool (n = 49). The bronchoscopy pool (Table S2B) was more diverse, with 20 topics (including the unclassified −1 topic group), with themes ranging from cancer to pulmonary infections and transplant surgery.

3.1. Open Hermes

The performance of the Open Hermes model was evaluated using two alternative prompting strategies: a single prompt and a double-prompt strategy (comprising an initial filtering stage, followed by a detailed evaluation stage). When using the single prompt, Open Hermes demonstrated high accuracy across all quartiles but consistently struggled to identify relevant articles (true positives, TPs), as shown in Figure 3A.

For Q1, the model achieved an accuracy of 95.5% because it correctly classified 190 true negative (TN) papers out of 191. The model had only one TP and eight false negatives (FN), resulting in a moderate precision of 50% but low recall and F1 scores of 11.1% and 18.2%, respectively. For Q2–Q4, the model again achieved perfect or near-perfect accuracy, but no TP articles were identified, leading to undefined precision and F1 scores (Figure 3A). While the model effectively rejected TN articles, it consistently failed to identify TPs across quartiles, which defies the purpose of the screening.

The double-prompt strategy, designed to improve performance by introducing an initial filtering step, showed mixed results. In the initial filtering phase (Figure 3B), Open Hermes primarily focused on rejecting articles that lacked explicit mentions of EMD or BG. For Q1, the model achieved 94.5% accuracy, with 189 TNs and 2 false positives (FPs), but no TPs were identified. For Q2–Q4, the model achieved 95.5% accuracy, rejecting all TNs, yet failed to detect any TPs, with precision and F1 scores remaining undefined (Figure 3B).

In the subsequent detailed evaluation phase, Open Hermes was queried again to assess only the articles that had passed the initial filtering. However, the model’s performance remained suboptimal, as obviously no TPs were identified across any quartile because all target articles had been filtered out during the first round (Figure 3C).

3.2. Flan T5

The performance of Flan T5 was similarly evaluated using the same two strategies as Open Hermes. Under the single prompt approach, Flan T5 demonstrated moderate accuracy across all quartiles but struggled with precision, recall, and F1 scores due to a high number of FPs and limited TPs. In Q1, the model achieved 64.5% accuracy, with 128 TNs, 63 FPs, 8 FNs, and 1 TP. Precision was 1.6%, recall was 11.1%, and the F1 score was 2.7%, highlighting the model’s limitations in identifying relevant articles. Similar trends were observed across Q2–Q4, with accuracy ranging from 68% to 76.0%. The precision, recall, and F1 scores remained consistently low, as the model failed to effectively distinguish relevant articles from irrelevant ones (Figure 4A). The high number of FPs across all quartiles indicates the model’s tendency to accept a large number of irrelevant articles.

When using the double-prompt strategy, Flan T5′s performance improved significantly during the initial filtering phase (Figure 4B), where the model focused on rejecting articles that did not meet broad inclusion criteria (e.g., a lack of explicit mentions of EMD or BG). In this phase, the model achieved strong recall, correctly identifying all relevant articles (FNs = 0) across all quartiles. However, precision was limited due to a substantial number of FPs.

For example, the model achieved 93.2% accuracy in Q1, with 178 TNs, 13 FPs, 0 FNs, and 9 TPs. Precision was 40.9%, recall was 100%, and the F1 score was 58% (Figure 4B). Accuracy across quartiles ranged from 78% to 89.5%, with recall remaining perfect but precision declining as FPs increased, particularly in Q4 (17%), where FPs reached 44.

In the detailed evaluation phase (Figure 4C), the model’s performance declined significantly. Despite achieving reasonable true negative counts, Flan T5 failed to identify any TPs across all quartiles. In Q1, the model achieved 54.5% accuracy, with 12 TNs, 1 FP, 9 FNs, and 0 TPs, while the precision, recall, and F1 scores were zero. Similar trends were observed in Q2–Q4, with accuracy ranging from 60.0% to 77.4% (Figure 4C). These results highlight the limitations of the model’s ability to refine article selection during the detailed evaluation phase, as it struggled to achieve any balance between recall and precision.

3.3. GPT-2

Using the single-prompt strategy, GPT-2 demonstrated consistent behavior across all quartiles (Figure 5A). The model classified all articles as “Accepted”, resulting in TNs = 0 and FNs = 0, while the TP and FP numbers were very high. This led to a precision of 4.5% across all quartiles, while recall remained perfect at 100% due to the complete acceptance of target articles (Figure 5A).

However, this behavior also inflated the number of FPs, keeping the F1 score down to a mere 8.7%. The overall accuracy for the single-prompt approach was 4.5%, underlining the inadequacy of this strategy for discerning relevant from irrelevant articles.

In contrast, the performance of the model changed with the double-prompt approach. During the initial filtering, GPT-2 demonstrated improved performance by rejecting some irrelevant articles (Figure 5B). Across the quartiles, TNs increased significantly (e.g., Q1: TNs = 178), which was mirrored by a reduction in FPs (e.g., Q1: FPs = 13). This adjustment boosted accuracy to about 90% for Q1–Q3 quartiles, though it was 78% in Q4 (Figure 5B). In contrast, precision declined substantially over the quartiles (Q1: 40.9%, Q2: 30%, Q3: 28.1%, and Q4: 17%). Recall remained constant at 100% in the initial filtering phase, ensuring that no target articles were excluded prematurely, while the F1 score declined from 58% in Q1 to 29% in Q4 for the increasing number of FP articles (Figure 5B). In the subsequent detailed evaluation phase (Figure 5C), the model showed minimal change, with no TNs or FNs identified. FPs matched TPs across all quartiles, emphasizing that all initially accepted articles were retained in the final stage. Precision for the detailed evaluation phase was consistent with the initial filtering results, but the accuracy and F1 scores were not improved (Figure 5C).

3.4. Claude 3 Haiku

The Claude 3 Haiku model exhibited perfect recall (100%) across all quartiles (Q1–Q4) in the Fidan 2024 dataset, never missing any relevant articles (FNs = 0) but with varying precision and accuracy, particularly in Quartile 4. Using the verbose prompt, it delivered high accuracy (97.5–96.5%) and solid precision (64.3–56.2%) in Q1–Q3; however, in Q4, it declined to 76.5% accuracy and 16.1% precision due to increased false positives (Figure 6A).

With the concise prompt, it generally outperformed the verbose version in Q1–Q3, achieving up to 98.5% accuracy and 75.0% precision in Q1 and maintaining 100% recall overall yet again encountered a surge in false positives in Q4, causing a drop to 65.5% accuracy and 11.5% precision (Figure 6B). This same pattern was largely replicated in the Yang 2025 corpus, where the model’s verbose prompt yielded consistently high accuracy and precision in Q1–Q2, near-perfect recall in Q3 with accuracy around 99.5%, and a notable decrease in Q4 accuracy to 87.5% and precision to 37.5% (Figure 6C). The concise prompt showed similarly impeccable recall overall but suffered in Q4, with precision falling to 20.8% and overall accuracy to 71.5%, underscoring once more the trade-off between perfect recall and elevated false positives (Figure 6D).

3.5. GPT-3.5 Turbo

The performance of GPT-3.5 Turbo, when employing verbose and concise prompts on Fidan 2024 data, revealed consistently high TN counts and minimal FPs, although both strategies faced challenges in Q4. Under the verbose prompt, Q1 exhibited 99.0% accuracy and 88.9% precision, and Q2 and Q3 remained at or above 99.0% accuracy and 88.9% precision, while Q4 showed a drop to 30% precision with an F1 score of 46.1% despite retaining 100% recall (Figure 7A).

The concise prompt strategy yielded comparable overall accuracy but more fluctuations in FPs, evident from a precision of 57.1% and F1 score of 69.4% in Q1, an increase to 72.7% precision in Q2, a drop to 53.3% in Q3, and a sharp decline to 21.6% in Q4 (Figure 7B). In the Yang 2025 dataset, GPT-3.5 Turbo’s verbose prompt again maintained high recall in Q1–Q3, with accuracy near or above 98.5%, although Q3 showed slight dips in recall or precision in some runs, and Q4 once again experienced more false positives (Figure 7C). The concise prompt mirrored these swings, delivering up to 95.0% accuracy in Q1 with 60.0% precision and 70–90% precision in Q2–Q3 but dropping to 55.6% precision in Q4, confirming its persistent sensitivity to borderline articles in the highest-similarity quartile (Figure 7D).

3.6. GPT-4o

The GPT-4o model with the verbose prompt demonstrated consistently excellent performance across Q1–Q4 in the Fidan 2024 dataset, showing near-perfect accuracy, precision, recall, and F1 scores (Figure 8A). In Q1–Q3, it correctly identified all TPs and TNs with no false positives or negatives, thus maintaining 100% for all metrics; in Q4, there was a minor reduction to 90% precision caused by one FP, giving an F1 score of 94.7%.

Using the concise prompt, GPT-4o also sustained 100% accuracy and recall through Q1–Q3, encountering only a slight drop in Q4, where three FPs brought precision down to 75% and the F1 score to 85.7% (Figure 8B).

Concerning the Yang 2025 review, GPT-4o replicated this exceptional reliability in Q1–Q3, again scoring 100% for accuracy, precision, and recall under both prompt styles, while Q4 introduced marginal dips in precision—approximately 88.2% with the verbose prompt and 83.3% with the concise prompt—yet recall remained perfect, highlighting GPT-4o’s capacity to identify all relevant items at the cost of a few additional false positives at higher similarity thresholds (Figure 8C,D).

4. Discussion

We evaluated multiple large language models (LLMs) on two systematically compiled datasets derived from the Fidan 2024 and Yang 2025 reviews, each clustered into four quartiles of increasing similarity to their respective target randomized controlled trials (RCTs) [17]. In the first step, we retrieved and embedded thematically related articles from Medline and then partitioned them by quartiles based on their semantic proximity to either the Fidan or Yang RCT references. In the second step, we prompted various LLMs—ranging from smaller open-source models to more advanced proprietary ones—to apply PICOT-based criteria in an attempt to identify the known RCTs within each quartile.

The three smaller or older LLMs (OpenHermes (7B), Flan T5, and GPT-2) performed poorly under standard prompts. For example, OpenHermes failed to identify any target article except in Q1, which had the lowest resemblance to the target references. Flan T5 performed only marginally better, finding one target article per dataset. GPT-2 displayed the opposite extreme by accepting all articles, thus also failing to discriminate relevant from irrelevant material. These outcomes align with reports arguing that Flan T5 and OpenHermes showed considerable variation in sensitivity and specificity across datasets, underscoring the challenge of maintaining precision with smaller LLMs [7].

Next, we tested these lower-end LLMs with a double-prompt strategy. In the first prompt, articles lacking essential interventions (Emdogain or bone graft) were removed, followed by a second, more detailed prompt applying the inclusion criteria. However, this two-stage approach did not improve results for OpenHermes. Flan T5 initially achieved perfect recall by retaining all relevant studies in the first stage but then underperformed in the second, highlighting the risk of “re-contextualization” or changes in classification behavior upon re-prompting (a phenomenon also noted by Delgado-Chaves et al., who stressed that refining the inclusion/exclusion criteria can greatly enhance LLM performance) [10].

GPT-2 did reject some irrelevant articles in the initial filter but remained heavily biased toward inclusion afterward. Although it avoided missing any truly relevant articles, it generated numerous false positives, making it viable only where minimizing study loss is the absolute priority or for rough, cost-efficient pre-screening. These findings echo Dennstädt et al., who noted similar trade-offs in sensitivity versus specificity among open-source LLMs, even when threshold adjustments or multi-step prompts were introduced [7].

In contrast, Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o demonstrated a markedly stronger performance on both the Fidan 2024 and Yang 2025 datasets. Claude 3 Haiku consistently achieved perfect recall in all four quartiles, implying that no relevant RCTs were missed. Its precision, however, tended to drop in Quartile 4, particularly on abstracts that closely resembled the targets. GPT-3.5 Turbo similarly showed near-perfect recall in Quartiles 1 through 3 for both Fidan and Yang but also registered more false positives in Quartile 4, causing the precision and F1 scores to decline. GPT-4o delivered the most balanced outcomes, rarely failing to identify a true positive while keeping false positives minimal, especially in the first three quartiles. Although it also experienced a modest rise in false positives in Quartile 4, the model’s overall accuracy and F1 metrics remained relatively high, and its performance was consistently robust regardless of whether prompts were verbose or concise. This outcome echoes prior work indicating that advanced LLMs, when supported by well-tailored prompts, can approach human-level performance in systematic screening [9,42].

These differences among LLMs and prompt strategies highlight the importance of careful model selection and prompt design when using automated approaches for systematic review screening. Smaller or earlier-generation models risk excluding relevant articles or accepting too many irrelevant ones, while multi-stage prompts sometimes fail to maintain logical consistency across steps. However, our data show that Flan T5 and GPT-2, despite suboptimal results with structured prompts, can still be valuable for broad initial screening when high recall is paramount. Meanwhile, advanced LLMs like Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o can succeed with a single-prompt scenario, especially if prompts emphasize recall without unduly inflating false positives. Delgado-Chaves et al. further noted that performance gains can be substantial—up to 90% workload reduction—when inclusion and exclusion criteria are well-defined, although results can differ across domains [10].

Despite these encouraging findings, limitations remain. Our reference datasets were relatively small and domain-specific, potentially restricting the generalizability of our results. Although GPT-3.5 Turbo and GPT-4o performed well without domain-specific tuning, some LLMs might behave differently on larger or more varied corpora. We also relied on abstracts alone, and poor abstract quality can skew results. Additionally, the use of paid APIs for GPT-3.5 Turbo and GPT-4o may not be practical for every research team handling very large datasets. Finally, LLM-based approaches, however advanced, still require human oversight to resolve ambiguities, handle incomplete data, and ensure screening accuracy. We provided a general step-by-step roadmap to integrate LLMs into the systematic review workflow (Supplementary Information File S1). In practical terms, organizations could embed target trials and new articles, cluster them by semantic similarity, and apply structured prompts with an LLM for the first pass of screening. Subsequently, domain experts could review and refine borderline or uncertain classifications, thus blending the efficiency of automated tools with human oversight.

Looking ahead, future research could address these challenges by testing LLMs across a broader range of biomedical fields, exploring domain-specific embeddings or specialized models and optimizing prompt construction to balance false positives and negatives. Although canonical ML approaches (e.g., support vector machines or random forests) can handle text classification with enough domain-specific feature engineering, we focused on LLM-based solutions for their flexibility in context handling and minimal feature-engineering requirements. Hybrid methods that combine LLM-based classification with conventional machine learning or active learning in existing systematic review platforms could yield a more integrated pipeline, leveraging the computational efficiency of LLMs while preserving the expert judgment of human reviewers. Notably, tools like Rayyan [43] and Research Screener [44] emphasize iterative feedback, learning from each inclusion/exclusion to refine predictions. Although these systems can achieve up to 96% time savings, their reliance on ongoing user corrections contrasts with our more direct classification approach, where prompt refinement is pivotal but not incremental. Other systems, such as DistillerSR or RobotAnalyst [45], use active learning to flag uncertain articles for human verification, suggesting that a similar loop might benefit smaller LLMs like GPT-2 or Flan T5. Ultimately, by matching prompting strategies to specific LLM capabilities—or supplementing weaker models with simpler filtering rules—researchers can substantially reduce screening workload without sacrificing thoroughness in evidence-based reviews.

5. Conclusions

Our findings demonstrate that more advanced LLMs (Claude 3 Haiku, GPT-3.5 Turbo, and GPT-4o) can effectively facilitate automated screening in a systematic review. By creating progressively challenging datasets, we observed that these higher-end models excelled at identifying nearly all relevant randomized controlled trials while simultaneously curbing false positives. In contrast, smaller or earlier-generation LLMs (OpenHermes, Flan T5, and GPT-2) struggled to maintain balanced performance, though some improvement was seen when applying a double-prompt strategy.

These results underscore the importance of both prompt design and model selection in optimizing recall and precision for systematic reviews. Notably, GPT-4o achieved near-ideal classification outcomes across all datasets. Nevertheless, certain limitations remain, including the domain-specific nature of our reference set, the quality of abstracts, and the need for human oversight. Future work could validate these methods on larger or more varied corpora, explore domain-tailored models, and integrate human reviewers in iterative, hybrid workflows. Such approaches may further bolster the reliability and efficiency of LLM-based systematic review screening, ultimately reducing the burden on researchers while preserving the rigor of evidence-based reviews.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/biomedinformatics5020025/s1. https://github.com/CarloGalliUnipr/Literature-screening-with-LLM.git, Literature_screening_for_MDPI.ipynb: a sample code, such as the one we used in our manuscript, in Jupyter notebook format (accessed on 28 April 2025). Table S1: Descriptive statistics of reduced datasets for both reviews; Table S2: List of topics in the datasets used for model testing. The lists include a topic identifier (topic), the topic size (count: number of articles in the topic), a BERTopic-generated topic label, obtained by joining the 4 top identified keywords (name) and an LLM-generated label based on that list of keywords (LLM). File S1: Roadmap detailing step-by-step instructions to integrate LLMs into literature screening workflow.

Author Contributions

Conceptualization, C.G., M.M. and E.C.; methodology, C.G.; software, C.G.; formal analysis, C.G. and M.T.C.; data curation, S.G. and M.T.C.; writing—original draft preparation, C.G. and M.M.; writing—review and editing, S.G. and E.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

Table A1. List of the target articles identified in the systematic review by Fidan et al., 2024.

PMID	Title	Journal	Year	Reference
11990444	A Clinical Comparison of a Bovine-Derived Xenograft Used Alone and in Combination with Enamel Matrix Derivative for the Treatment of Periodontal Osseous Defects in Humans.	Journal of Periodontology	2002	[46]
12186348	Clinical Evaluation of an Enamel Matrix Protein Derivative (Emdogain) Combined with a Bovine-Derived Xenograft (Bio-Oss) for the Treatment of Intrabony Periodontal Defects in Humans.	International Journal of Periodontics & Restorative Dentistry	2002	[47]
11990441	Clinical Evaluation of an Enamel Matrix Protein Derivative Combined with a Bioactive Glass for the Treatment of Intrabony Periodontal Defects in Humans.	Journal of Periodontology	2002	[48]
19053917	Clinical Evaluation of Demineralized Freeze-Dried Bone Allograft With and Without Enamel Matrix Derivative for the Treatment of Periodontal Osseous Defects in Humans.	Journal of Periodontology	2008	[49]
20054593	Comparative Study of DFDBA in Combination with Enamel Matrix Derivative Versus DFDBA Alone for Treatment of Periodontal Intrabony Defects at 12 Months Post-Surgery.	Clinical Oral Investigations	2011	[50]
23484181	Evaluation of the Effectiveness of Enamel Matrix Derivative, Bone Grafts, and Membrane in the Treatment of Mandibular Class II Furcation Defects.	International Journal of Periodontics & Restorative Dentistry	2013	[51]
23379539	Hydroxyapatite/Beta-Tricalcium Phosphate and Enamel Matrix Derivative for Treatment of Proximal Class II Furcation Defects: A Randomized Clinical Trial.	Journal of Clinical Periodontology	2013	[52]
26556577	Enamel Matrix Protein Derivative and/or Synthetic Bone Substitute for the Treatment of Mandibular Class II Buccal Furcation Defects. A 12-Month Randomized Clinical Trial.	Clinical Oral Investigations	2016	[53]
31811645	Adjunctive Use of Enamel Matrix Derivatives to Porcine-Derived Xenograft for the Treatment of One-Wall Intrabony Defects: Two-Year Longitudinal Results of a Randomized Controlled Clinical Trial.	Journal of Periodontology	2020	[54]

Appendix A.2

Table A2. List of the target articles identified in the systematic review by Yang et al., 2025.

PMID	Title	Journal	Year	Reference
36369295	Shape-Sensing Robotic-Assisted Bronchoscopy with Concurrent Use of Radial Endobronchial Ultrasound and Cone Beam Computed Tomography in the Evaluation of Pulmonary Lesions	Lung	2022	[55]
36006070	Efficacy and Safety of Cone-Beam CT. Augmented Electromagnetic Navigation Guided Bronchoscopic Biopsies of Indeterminate Pulmonary Nodules.	Tomography	2022	[56]
35920067	Diagnostic Yield of Cone-Beam-Derived Augmented Fluoroscopy and Ultrathin Bronchoscopy Versus Conventional Navigational Bronchoscopy Techniques.	Journal of Bronchology & Interventional Pulmonology	2023	[57]
24665347	Cone Beam Computertomography (CBCT) in Interventional Chest Medicine—High Feasibility for Endobronchial Realtime Navigation.	Journal of Cancer	2014	[58]
30746241	Cone Beam Computed Tomography-Guided Thin/Ultrathin Bronchoscopy for Diagnosis of Peripheral Lung Nodules: A Prospective Pilot Study.	Journal of Thoracic Disease	2018	[59]
30179922	Cone-Beam CT With Augmented Fluoroscopy Combined With Electromagnetic Navigation Bronchoscopy for Biopsy of Pulmonary Nodules.	Journal of Bronchology & Interventional Pulmonology	2018	[60]
30505506	Biopsy of Peripheral Lung Nodules Utilizing Cone Beam Computer Tomography With and Without Trans Bronchial Access Tool: A Retrospective Analysis.	Journal of Thoracic Disease	2018	[61]
31121593	Transbronchial Biopsy Using an Ultrathin Bronchoscope Guided by Cone-Beam Computed Tomography and Virtual Bronchoscopic Navigation in the Diagnosis of Pulmonary Nodules.	Respiration	2019	[62]
33547938	Robotic-Assisted Navigation Bronchoscopy as a Paradigm Shift in Peripheral Lung Access.	Lung	2021	[63]
33615626	Cone-Beam Computed Tomography Versus Computed Tomography-Guided Ultrathin Bronchoscopic Diagnosis for Peripheral Pulmonary Lesions: A Propensity Score-Matched Analysis.	Respirology	2021	[64]
35054208	Cone-Beam Computed Tomography-Derived Augmented Fluoroscopy Improves the Diagnostic Yield of Endobronchial Ultrasound-Guided Transbronchial Biopsy for Peripheral Pulmonary Lesions.	Diagnostics	2021	[65]
33401270	Cone-Beam Computed Tomography-Guided Electromagnetic Navigation for Peripheral Lung Nodules.	Respiration	2021	[66]
32649327	Cone-Beam CT Image Guidance With and Without Electromagnetic Navigation Bronchoscopy for Biopsy of Peripheral Pulmonary Lesions.	Journal of Bronchology & Interventional Pulmonology	2021	[67]
34162799	Cone-Beam CT and Augmented Fluoroscopy-Guided Navigation Bronchoscopy: Radiation Exposure and Diagnostic Accuracy Learning Curves.	Journal of Bronchology & Interventional Pulmonology	2021	[68]
33845482	Efficacy and Safety of Cone-Beam Computed Tomography-Derived Augmented Fluoroscopy Combined with Endobronchial Ultrasound in Peripheral Pulmonary Lesions.	Respiration	2021	[69]

Appendix B

Performance Metrics

Accuracy is the overall proportion of correct classifications (TPs + TNs) across all articles and is calculated as follows:

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

Although easy to calculate, accuracy can sometimes be misleading if the data are imbalanced (for instance, many irrelevant articles). Precision focuses on how reliable the model’s “Accepted” decisions are by measuring the fraction of accepted articles that are indeed relevant (TPs/(TPs + FPs)). A high precision indicates the model seldom accepts irrelevant papers, which keeps the screening burden manageable. Precision is calculated as follows:

Precision = \frac{T P}{T P + F P}

By contrast, recall captures how many of the truly relevant articles are successfully found. In the context of systematic reviews, recall is especially important because missing relevant articles (false negatives) could compromise the comprehensiveness of the final analysis. A model with perfect recall would accept every relevant article, but if it also accepts many irrelevant ones, precision would suffer. Recall is, therefore, calculated as follows:

Recall = \frac{T P}{T P + F N}

The F1 score combines precision and recall (using the harmonic mean) into a single indicator and is a useful measure to balance the trade-off between retrieving as many relevant articles as possible and not letting too many irrelevant articles through. The F1 score is calculated as follows:

F 1 score = 2 \times \frac{(P r e c i s i o n \times R e c a l l)}{(P r e c i s i o n + R e c a l l)}

References

Mulrow, C.D. Systematic Reviews: Rationale for systematic reviews. BMJ 1994, 309, 597–599. [Google Scholar] [CrossRef]
Betrán, A.P.; Say, L.; Gülmezoglu, A.M.; Allen, T.; Hampson, L. Effectiveness of different databases in identifying studies for systematic reviews: Experience from the WHO systematic review of maternal morbidity and mortality. BMC Med. Res. Methodol. 2005, 5, 6. [Google Scholar] [CrossRef] [PubMed]
Grivell, L. Mining the bibliome: Searching for a needle in a haystack? EMBO Rep. 2002, 3, 200–203. [Google Scholar] [CrossRef] [PubMed]
Landhuis, E. Scientific literature: Information overload. Nature 2016, 535, 457–458. [Google Scholar] [CrossRef] [PubMed]
Scells, H.; Zuccon, G.; Koopman, B.; Deacon, A.; Azzopardi, L.; Geva, S. Integrating the Framing of Clinical Questions via PICO into the Retrieval of Medical Literature for Systematic Reviews. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; ACM: New York, NY, USA, 2017; pp. 2291–2294. [Google Scholar]
Anderson, N.K.; Jayaratne, Y.S.N. Methodological challenges when performing a systematic review. Eur. J. Orthod. 2015, 37, 248–250. [Google Scholar] [CrossRef]
Dennstädt, F.; Zink, J.; Putora, P.M.; Hastings, J.; Cihoric, N. Title and abstract screening for literature reviews using large language models: An exploratory study in the biomedical domain. Syst. Rev. 2024, 13, 158. [Google Scholar] [CrossRef]
Khraisha, Q.; Put, S.; Kappenberg, J.; Warraitch, A.; Hadfield, K. Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Res. Synth. Methods 2024, 15, 616–626. [Google Scholar] [CrossRef]
Dai, Z.-Y.; Wang, F.-Q.; Shen, C.; Ji, Y.-L.; Li, Z.-Y.; Wang, Y.; Pu, Q. Accuracy of Large Language Models for Literature Screening in Systematic Reviews and Meta-Analyses. J. Med. Internet Res. 2024, 27, e67488. [Google Scholar] [CrossRef]
Delgado-Chaves, F.M.; Jennings, M.J.; Atalaia, A.; Wolff, J.; Horvath, R.; Mamdouh, Z.M.; Baumbach, J.; Baumbach, L. Transforming literature screening: The emerging role of large language models in systematic reviews. Proc. Natl. Acad. Sci. USA 2025, 122, e2411962122. [Google Scholar] [CrossRef]
Scherbakov, D.; Hubig, N.; Jansari, V.; Bakumenko, A.; Lenert, L.A. The emergence of Large Language Models (LLM) as a tool in literature reviews: An LLM automated systematic review. arXiv 2024, arXiv:240904600. [Google Scholar]
Elliott, J.H.; Synnot, A.; Turner, T.; Simmonds, M.; Akl, E.A.; McDonald, S.; Salanti, G.; Meerpohl, J.; MacLehose, H.; Hilton, J.; et al. Living systematic review: 1. Introduction—The why, what, when, and how. J. Clin. Epidemiol. 2017, 91, 23–30. [Google Scholar] [CrossRef]
Ren, J.; Fok, M.R.; Zhang, Y.; Han, B.; Lin, Y. The role of non-steroidal anti-inflammatory drugs as adjuncts to periodontal treatment and in periodontal regeneration. J. Transl. Med. 2023, 21, 149. [Google Scholar] [CrossRef] [PubMed]
Mijiritsky, E.; Assaf, H.D.; Peleg, O.; Shacham, M.; Cerroni, L.; Mangani, L. Use of PRP, PRF and CGF in Periodontal Regeneration and Facial Rejuvenation—A Narrative Review. Biology 2021, 10, 317. [Google Scholar] [CrossRef] [PubMed]
Miron, R.J.; Moraschini, V.; Estrin, N.E.; Shibli, J.A.; Cosgarea, R.; Jepsen, K.; Jervøe-Storm, P.; Sculean, A.; Jepsen, S. Periodontal regeneration using platelet-rich fibrin. Furcation defects: A systematic review with meta-analysis. Periodontol 2000 2024, 97, 191–214. [Google Scholar] [CrossRef]
Woo, H.N.; Cho, Y.J.; Tarafder, S.; Lee, C.H. The recent advances in scaffolds for integrated periodontal regeneration. Bioact. Mater. 2021, 6, 3328–3342. [Google Scholar] [CrossRef] [PubMed]
Fidan, I.; Labreuche, J.; Huck, O.; Agossa, K. Combination of Enamel Matrix Derivatives with Bone Graft vs Bone Graft Alone in the Treatment of Periodontal Intrabony and Furcation Defects: A Systematic Review and Meta-Analysis. Oral Health Prev. Dent. 2024, 22, 655–664. [Google Scholar]
Yang, H.; Huang, J.; Zhang, Y.; Guo, J.; Xie, S.; Zheng, Z.; Ma, Y.; Deng, Q.; Zhong, C.; Li, S. The diagnostic performance and optimal strategy of cone beam CT-assisted bronchoscopy for peripheral pulmonary lesions: A systematic review and meta-analysis. Pulmonology 2025, 31, 1–2420562. [Google Scholar] [CrossRef]
Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:190810084. [Google Scholar]
Stankevičius, L.; Lukoševičius, M. Extracting Sentence Embeddings from Pretrained Transformer Models. Appl. Sci. 2024, 14, 8887. [Google Scholar] [CrossRef]
Li, B.; Han, L. Distance Weighted Cosine Similarity Measure for Text Classification. In Intelligent Data Engineering and Automated Learning—IDEAL 2013; Springer: Berlin/Heidelberg, Germany, 2013; pp. 611–618. [Google Scholar]
Cichosz, P. Assessing the quality of classification models: Performance measures and evaluation procedures. Open Eng. 2011, 1, 132–158. [Google Scholar] [CrossRef]
Cottam, J.A.; Heller, N.C.; Ebsch, C.L.; Deshmukh, R.; Mackey, P.; Chin, G. Evaluation of Alignment: Precision, Recall, Weighting and Limitations. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 2513–2519. [Google Scholar]
Abbade, L.P.F.; Wang, M.; Sriganesh, K.; Mbuagbaw, L.; Thabane, L. Framing of research question using the PICOT format in randomised controlled trials of venous ulcer disease: A protocol for a systematic survey of the literature. BMJ Open 2016, 6, e013175. [Google Scholar] [CrossRef]
White, J. PubMed 2.0. Med. Ref. Serv. Q 2020, 39, 382–387. [Google Scholar] [CrossRef]
Chapman, B.; Chang, J. Biopython: Python tools for computational biology. ACM Sigbio Newsl. 2000, 20, 15–19. [Google Scholar] [CrossRef]
Cock, P.J.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; et al. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25, 1422–1423. [Google Scholar] [CrossRef] [PubMed]
Mckinney, W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; pp. 51–56. [Google Scholar]
Grootendorst, M. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv 2022, arXiv:2203.05794. [Google Scholar]
McInnes, L.; Healy, J.; Melville, J. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv 2018, arXiv:1802.03426. [Google Scholar]
McInnes, L.; Healy, J.; Astels, S. hdbscan: Hierarchical density based clustering. J. Open Source Softw. 2017, 2, 205. [Google Scholar]
Thakkar, H.; Manimaran, A. Comprehensive Examination of Instruction-Based Language Models: A Comparative Analysis of Mistral-7B and Llama-2-7B. In Proceedings of the 2023 International Conference on Emerging Research in Computational Science (ICERCS), Coimbatore, India, 7–9 December 2023; pp. 1–6. [Google Scholar]
Oza, J.; Yadav, H. Enhancing Question Prediction with Flan T5-A Context-Aware Language Model Approach. Authorea 2023. [Google Scholar]
Patwardhan, I.; Gandhi, S.; Khare, O.; Joshi, A.; Sawant, S. A Comparative Analysis of Distributed Training Strategies for GPT-2. arXiv 2024, arXiv:240515628. [Google Scholar]
Nguyen, C.; Carrion, D.; Badawy, M. Comparative Performance of Claude and GPT Models in Basic Radiological Imaging Tasks. medRxiv 2024. [Google Scholar] [CrossRef]
Williams, C.Y.K.; Miao, B.Y.; Butte, A.J. Evaluating the use of GPT-3.5-turbo to provide clinical recommendations in the Emergency Department. medRxiv 2023. [Google Scholar] [CrossRef]
Islam, R.; Moushi, O.M. Gpt-4o: The cutting-edge advancement in multimodal llm. Authorea 2024. [Google Scholar] [CrossRef]
Cao, C.; Sang, J.; Arora, R.; Kloosterman, R.; Cecere, M.; Gorla, J.; Saleh, R.; Chen, D.; Drennan, I.; Teja, B.; et al. Development of prompt templates for large language model–driven screening in systematic reviews. Ann. Int. Med. 2025, 178, 389–401. [Google Scholar] [CrossRef]
Bisong, E. Google Colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Bisong, E., Ed.; Apress: Berkeley, CA, USA, 2019; pp. 59–64. [Google Scholar]
Sentence-Transformers/All-Mpnet-Base-v2. Available online: https://huggingface.co/sentence-transformers/all-mpnet-base-v2 (accessed on 10 February 2024).
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Müller, A.; Nothman, J.; Louppe, G.; et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Luo, R.; Sastimoglu, Z.; Faisal, A.I.; Deen, M.J. Evaluating the Efficacy of Large Language Models for Systematic Review and Meta-Analysis Screening. medRxiv 2024. [Google Scholar]
Ouzzani, M.; Hammady, H.; Fedorowicz, Z.; Elmagarmid, A. Rayyan—A web and mobile app for systematic reviews. Syst. Rev. 2016, 5, 210. [Google Scholar] [CrossRef] [PubMed]
Chai, K.E.K.; Lines, R.L.J.; Gucciardi, D.F.; Ng, L. Research Screener: A machine learning tool to semi-automate abstract screening for systematic reviews. Syst. Rev. 2021, 10, 93. [Google Scholar] [CrossRef]
Khalil, H.; Ameen, D.; Zarnegar, A. Tools to support the automation of systematic reviews: A scoping review. J. Clin. Epidemiol. 2022, 144, 22–42. [Google Scholar] [CrossRef]
Scheyer, E.T.; Velasquez-Plata, D.; Brunsvold, M.A.; Lasho, D.J.; Mellonig, J.T. A clinical comparison of a bovine-derived xenograft used alone and in combination with enamel matrix derivative for the treatment of periodontal osseous defects in humans. J. Periodontol. 2002, 73, 423–432. [Google Scholar] [CrossRef] [PubMed]
Sculean, A.; Chiantella, G.C.; Windisch, P.; Gera, I.; Reich, E. Clinical evaluation of an enamel matrix protein derivative (Emdogain) combined with a bovine-derived xenograft (Bio-Oss) for the treatment of intrabony periodontal defects in humans. Int. J. Periodontics Restor. Dent. 2002, 22, 259–267. [Google Scholar]
Sculean, A.; Barbé, G.; Chiantella, G.C.; Arweiler, N.B.; Berakdar, M.; Brecx, M. Clinical evaluation of an enamel matrix protein derivative combined with a bioactive glass for the treatment of intrabony periodontal defects in humans. J. Periodontol. 2002, 73, 401–408. [Google Scholar] [CrossRef]
Hoidal, M.J.; Grimard, B.A.; Mills, M.P.; Schoolfield, J.D.; Mellonig, J.T.; Mealey, B.L. Clinical evaluation of demineralized freeze-dried bone allograft with and without enamel matrix derivative for the treatment of periodontal osseous defects in humans. J. Periodontol. 2008, 79, 2273–2280. [Google Scholar] [CrossRef]
Aspriello, S.D.; Ferrante, L.; Rubini, C.; Piemontese, M. Comparative study of DFDBA in combination with enamel matrix derivative versus DFDBA alone for treatment of periodontal intrabony defects at 12 months post-surgery. Clin. Oral Investig. 2011, 15, 225–232. [Google Scholar] [CrossRef]
Jaiswal, R.; Deo, V. Evaluation of the effectiveness of enamel matrix derivative, bone grafts, and membrane in the treatment of mandibular Class II furcation defects. Int. J. Periodontics Restor. Dent. 2013, 33, e58–e64. [Google Scholar] [CrossRef]
Peres, M.F.S.; Ribeiro, É.D.P.; Casarin, R.C.V.; Ruiz, K.G.S.; Junior, F.H.N.; Sallum, E.A.; Casati, M.Z. Hydroxyapatite/β-tricalcium phosphate and enamel matrix derivative for treatment of proximal class II furcation defects: A randomized clinical trial. J. Clin. Periodontol. 2013, 40, 252–259. [Google Scholar] [CrossRef]
Queiroz, L.A.; Santamaria, M.P.; Casati, M.Z.; Ruiz, K.S.; Nociti, F.; Sallum, A.W.; Sallum, E.A. Enamel matrix protein derivative and/or synthetic bone substitute for the treatment of mandibular class II buccal furcation defects. A 12-month randomized clinical trial. Clin. Oral Investig. 2016, 20, 1597–1606. [Google Scholar] [CrossRef] [PubMed]
Lee, J.-H.; Kim, D.-H.; Jeong, S.-N. Adjunctive use of enamel matrix derivatives to porcine-derived xenograft for the treatment of one-wall intrabony defects: Two-year longitudinal results of a randomized controlled clinical trial. J. Periodontol. 2020, 91, 880–889. [Google Scholar] [CrossRef] [PubMed]
Styrvoky, K.; Schwalk, A.; Pham, D.; Chiu, H.T.; Rudkovskaia, A.; Madsen, K.; Carrio, S.; Kurian, E.M.; Casas, L.D.L.; Abu-Hijleh, M. Shape-Sensing Robotic-Assisted Bronchoscopy with Concurrent use of Radial Endobronchial Ultrasound and Cone Beam Computed Tomography in the Evaluation of Pulmonary Lesions. Lung 2022, 200, 755–761. [Google Scholar] [CrossRef]
Podder, S.; Chaudry, S.; Singh, H.; Jondall, E.M.; Kurman, J.S.; Benn, B.S. Efficacy and Safety of Cone-Beam CT Augmented Electromagnetic Navigation Guided Bronchoscopic Biopsies of Indeterminate Pulmonary Nodules. Tomography 2022, 8, 2049–2058. [Google Scholar] [CrossRef]
DiBardino, D.M.; Kim, R.Y.; Cao, Y.B.; Andronov, M.; Lanfranco, A.R.; Haas, A.R.; Vachani, A.; Ma, K.C.; Hutchinson, C.T. Diagnostic Yield of Cone-beam–Derived Augmented Fluoroscopy and Ultrathin Bronchoscopy Versus Conventional Navigational Bronchoscopy Techniques. J. Bronc-Interv. Pulmonol. 2022, 30, 335–345. [Google Scholar] [CrossRef]
Hohenforst-Schmidt, W.; Zarogoulidis, P.; Vogl, T.; Turner, J.F. Browning, R.; Linsmeier, B.; Huang, H.; Li, Q.; Darwiche, K.; Freitag, L.; Simoff, M.; Kioumis, I.; Zarogoulidis, K.; Brachmann. J. Cone Beam Computertomography (CBCT) in Interventional Chest Medicine—High Feasibility for Endobronchial Realtime Navigation. J. Cancer 2014, 5, 231–241. [Google Scholar] [CrossRef]
Casal, R.F.; Sarkiss, M.; Jones, A.K.; Stewart, J.; Tam, A.; Grosu, H.B.; Ost, D.E.; Jimenez, C.A.; Eapen, G.A. Cone beam computed tomography-guided thin/ultrathin bronchoscopy for diagnosis of peripheral lung nodules: A prospective pilot study. J. Thorac. Dis. 2018, 10, 6950–6959. [Google Scholar] [CrossRef] [PubMed]
Pritchett, M.A.; Schampaert, S.; de Groot, J.A.H.; Schirmer, C.C.; van der Bom, I. Cone-Beam CT With Augmented Fluoroscopy Combined With Electromagnetic Navigation Bronchoscopy for Biopsy of Pulmonary Nodules. J. Bronc-Interv. Pulmonol. 2018, 25, 274–282. [Google Scholar] [CrossRef] [PubMed]
Sobieszczyk, M.J.; Yuan, Z.; Li, W.; Krimsky, W. Biopsy of peripheral lung nodules utilizing cone beam computer tomography with and without trans bronchial access tool: A retrospective analysis. J. Thorac. Dis. 2018, 10, 5953–5959. [Google Scholar] [CrossRef]
Ali, E.A.A.; Takizawa, H.; Kawakita, N.; Sawada, T.; Tsuboi, M.; Toba, H.; Takashima, M.; Matsumoto, D.; Yoshida, M.; Kawakami, Y.; et al. Transbronchial Biopsy Using an Ultrathin Bronchoscope Guided by Cone-Beam Computed Tomography and Virtual Bronchoscopic Navigation in the Diagnosis of Pulmonary Nodules. Respiration 2019, 98, 321–328. [Google Scholar] [CrossRef]
Benn, B.S.; Romero, A.O.; Lum, M.; Krishna, G. Robotic-Assisted Navigation Bronchoscopy as a Paradigm Shift in Peripheral Lung Access. Lung 2021, 199, 177–186. [Google Scholar] [CrossRef] [PubMed]
Kawakita, N.; Takizawa, H.; Toba, H.; Sakamoto, S.; Miyamoto, N.; Matsumoto, D.; Takashima, M.; Tsuboi, M.; Yoshida, M.; Kawakami, Y.; et al. Cone-beam computed tomography versus computed tomography-guided ultrathin bronchoscopic diagnosis for peripheral pulmonary lesions: A propensity score-matched analysis. Respirology 2021, 26, 477–484. [Google Scholar] [CrossRef] [PubMed]
Lin, C.-K.; Fan, H.-J.; Yao, Z.-H.; Lin, Y.-T.; Wen, Y.-F.; Wu, S.-G.; Ho, C.-C. Cone-Beam Computed Tomography-Derived Augmented Fluoroscopy Improves the Diagnostic Yield of Endobronchial Ultrasound-Guided Transbronchial Biopsy for Peripheral Pulmonary Lesions. Diagnostics 2021, 12, 41. [Google Scholar] [CrossRef]
Kheir, F.; Thakore, S.R.; Uribe Becerra, J.P.; Tahboub, M.; Kamat, R.; Abdelghani, R.; Fernandez-Bussy, S.; Kaphle, U.R.; Majid, A. Cone-Beam Computed Tomography-Guided Electromagnetic Navigation for Peripheral Lung Nodules. Respiration 2021, 100, 44–51. [Google Scholar] [CrossRef] [PubMed]
Verhoeven, R.L.J.; Fütterer, J.J.; Hoefsloot, W.; Van Der Heijden, E.H.F.M. Cone-Beam CT Image Guidance With and Without Electromagnetic Navigation Bronchoscopy for Biopsy of Peripheral Pulmonary Lesions. J. Bronchol. Interv. Pulmonol. 2021, 28, 60–69. [Google Scholar] [CrossRef]
Verhoeven, R.L.J.; van der Sterren, W.M.; Kong, W.M.; Langereis, S.; van der Tol, P.M.; van der Heijden, E.H. Cone-beam CT and Augmented Fluoroscopy–guided Navigation Bronchoscopy. J Bronchol. Interv Pulmonol. 2021, 28, 262–271. [Google Scholar] [CrossRef]
Yu, K.-L.; Yang, S.-M.; Ko, H.-J.; Tsai, H.-Y.; Ko, J.-C.; Lin, C.-K.; Ho, C.-C.; Shih, J.-Y. Efficacy and Safety of Cone-Beam Computed Tomography-Derived Augmented Fluoroscopy Combined with Endobronchial Ultrasound in Peripheral Pulmonary Lesions. Respiration 2021, 100, 538–546. [Google Scholar] [CrossRef]

Figure 1. Flowchart illustrating the methodological pipeline for evaluating large language models (LLMs) in automated article screening for systematic reviews.

Figure 2. Box plot of the distribution of cosine similarity scores by quartile for the (A) Fidan et al., 2024 and (B) Yang et al., 2025 reviews. Quartiles (Q1–Q4) were generated by ranking articles based on their cosine similarity to the mean embedding vector of the nine target articles. Q1 contains articles with the lowest similarity, while Q4 contains those with the highest similarity. The box plots display the median, interquartile range, and outliers for each quartile.

Figure 3. Performance metrics (accuracy, precision, recall, and F1 score) for the Open Hermes model across different quartiles (Q1–Q4). (A) Results for the single-prompt configuration. (B) Results for the initial filtering of the double-prompt configuration. (C) Results for the second step in the double-prompt configuration. Accuracy is represented in blue, precision in orange, recall in green, and F1 score in red.

Figure 4. Performance metrics (accuracy, precision, recall, and F1 score) for the Flan T5 model across different quartiles (Q1–Q4). (A) Results for the single-prompt configuration. (B) Results for the first step of the double-prompt configuration. (C) Results for the second step of the double-prompt configuration. Accuracy is represented in blue, precision in orange, recall in green, and F1 score in red.

Figure 5. Performance metrics (accuracy, precision, recall, and F1 score) for the GPT-2 model across different quartiles (Q1–Q4). (A) Results for the single-prompt configuration. (B) Results for the first step of the double-prompt configuration. (C) Results for the second step of the double-prompt configuration. Accuracy is represented in blue, precision in orange, recall in green, and F1 score in red.

Figure 6. Performance metrics (accuracy, precision, recall, and F1 score) for the Claude 3 Haiku model evaluated across quartiles (Q1–Q4) of (A,B) Fidan et al., 2024 and (C,D) Yang et al., 2025 reviews, using two different prompt styles: (A,C) verbose prompt and (B,D) concise prompt. Accuracy is represented in blue, precision in orange, recall in green, and F1 score in red.

Figure 7. Performance metrics (accuracy, precision, recall, and F1 score) for the GPT-3.5 Turbo model evaluated across quartiles (Q1–Q4) of (A,B) Fidan et al., 2024 and (C,D) Yang et al., 2025 reviews, using two different prompt styles using two different prompt styles: (A,C) verbose prompt and (B,D) concise prompt. Accuracy is represented in blue, precision in orange, recall in green, and F1 score in red.

Figure 8. Performance metrics (accuracy, precision, recall, and F1 score) for the GPT-4o model evaluated across quartiles (Q1–Q4) of (A,B) Fidan et al., 2024 and (C,D) Yang et al., 2025 reviews, using two different prompt styles: (A,C) verbose prompt and (B,D) concise prompt. Accuracy is represented in blue, precision in orange, recall in green, and F1 score in red.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Colangelo, M.T.; Guizzardi, S.; Meleti, M.; Calciolari, E.; Galli, C. Performance Comparison of Large Language Models for Efficient Literature Screening. BioMedInformatics 2025, 5, 25. https://doi.org/10.3390/biomedinformatics5020025

AMA Style

Colangelo MT, Guizzardi S, Meleti M, Calciolari E, Galli C. Performance Comparison of Large Language Models for Efficient Literature Screening. BioMedInformatics. 2025; 5(2):25. https://doi.org/10.3390/biomedinformatics5020025

Chicago/Turabian Style

Colangelo, Maria Teresa, Stefano Guizzardi, Marco Meleti, Elena Calciolari, and Carlo Galli. 2025. "Performance Comparison of Large Language Models for Efficient Literature Screening" BioMedInformatics 5, no. 2: 25. https://doi.org/10.3390/biomedinformatics5020025

APA Style

Colangelo, M. T., Guizzardi, S., Meleti, M., Calciolari, E., & Galli, C. (2025). Performance Comparison of Large Language Models for Efficient Literature Screening. BioMedInformatics, 5(2), 25. https://doi.org/10.3390/biomedinformatics5020025

Article Menu

Performance Comparison of Large Language Models for Efficient Literature Screening

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Design and Rationale

2.1.1. Fidan’s Review

2.1.2. Yang’s Review

2.2. Data Acquisition

2.3. Data Pre-Processing

2.4. Dataset Analysis

2.5. LLM-Based Classification

2.6. Prompting

2.6.1. Base Prompt

2.6.2. Double Prompt—OpenHermes, Flan T5, and GPT-2

2.6.3. Concise Prompt—Claude 3 Haiku, GPT 3.5 Turbo, and GPT 4o

2.7. Performance Evaluation

2.8. Software and Hardware

3. Results

3.1. Open Hermes

3.2. Flan T5

3.3. GPT-2

3.4. Claude 3 Haiku

3.5. GPT-3.5 Turbo

3.6. GPT-4o

4. Discussion

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

Appendix B

Performance Metrics

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI