Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations
Abstract
:1. Introduction
2. LLM-Based Screening: Rationale and Key Considerations
2.1. Foundational Concepts and Terminology
2.2. Technical and Logistical Considerations
2.3. Prerequisites for Literature Screening
- Population: Adults aged 18–75 with primary hypertension.
- Intervention: Daily oral administration of Drug A.
- Comparison: Placebo.
- Outcome: Mean change in systolic blood pressure at 12 weeks.
3. Implementation Considerations for LLM-Based Screening
3.1. Data Acquisition and Preparation
3.2. LLM Selection and Prompt Design
3.3. Fundamentals and Challenges of Prompt Design
3.4. Model Deployment
3.5. Current Recommendations and Trade-Offs
4. Ethical, Practical, and Methodological Implications
5. Proposed Standardized Guidelines for LLM-Assisted Screening
5.1. Planning and Governance
5.2. Data Preparation
5.3. Model Selection and Disclosure
5.4. Prompt Engineering
5.5. Screening Procedure
5.6. Quality Control and Bias Monitoring
5.7. Documentation and Reproducibility
5.8. Ethical and Legal Compliance
5.9. Reporting in the Manuscript
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
Appendix A.1
- “
- You are assisting in a systematic review on periodontal regeneration comparing
- Emdogain (EMD) + bone graft (BG) versus BG alone.
- Your task is to decide whether the following article should be ACCEPTED
- or REJECTED based on the following “soft approach” criteria:
- **Inclusion Criteria**:
- 1.
- **Population (P)**: Adult periodontitis patients (≥18 years old) with at least one intrabony or furcation defect.
- 2.
- **Intervention (I)**: Regenerative surgical procedures involving EMD combined with any type of bone graft material (EMD+BG).
- 3.
- **Comparison (C)**: Regenerative surgical procedures involving BG alone.
- 4.
- **Outcomes (O)**:
- -
- Primary: CAL (Clinical Attachment Level) gain, PD (Probing Depth) reduction.
- -
- Secondary: Pocket closure, wound healing, gingival recession, tooth loss, patient-reported outcome measures (PROMs), adverse events.
- 5.
- **Study Design**:
- -
- Randomized controlled trial (RCT), parallel or split-mouth design.
- -
- ≥10 patients per arm.
- -
- ≥6 months follow-up.
- **Decision Approach**:
- -
- If **at least one** of the above criteria is explicitly met or strongly implied,**AND** none of the criteria are explicitly contradicted, then **ACCEPT**.
- -
- If **any** criterion is clearly violated (e.g., population is exclusively children,follow-up is 3 months, or design is not an RCT), then **REJECT**.
- -
- If **no** criterion is clearly met, **REJECT**.
- Below is the article’s title and abstract. Decide if it should be ACCEPTED or REJECTED
- according to the “soft approach” described.
- Title: {title}
- Abstract: {abstract}
- **If the article is acceptable, respond with exactly:**
- ACCEPT
- **Otherwise, respond with exactly:**
- REJECT
- “
Appendix A.2
- “
- You are an expert periodontology assistant. You are assisting in a systematic review on periodontal regeneration comparing
- Emdogain (EMD) + bone graft (BG) versus bone graft alone. Evaluate this article step by step:
- 1.
- Population: If the text states adult patients with intrabony/furcation defects, or is silent about age/defect type, it’s not violated.
- 2.
- Intervention: If Emdogain + bone graft is mentioned or strongly implied, we consider this met.
- 3.
- Comparison: If a group uses bone graft alone, or there’s at least a control lacking Emdogain, consider it met.
- 4.
- Outcomes: If they mention CAL gain or PD reduction, or are silent, do not penalize. Only reject if they clearly never measure any clinical outcomes.
- 5.
- Study design: If they claim RCT or strongly imply it, accept. If they mention a different design (case series, pilot with fewer than 10 patients, or <6-month follow-up), reject.
- If at least one criterion is explicitly met and none are clearly violated, answer ACCEPT. Otherwise, REJECT.
- If you are unsure, default to ACCEPT unless a contradiction is stated.
- Article Title: {title}
- Abstract: {abstract}
- Respond with ONLY ‘ACCEPT’ or ‘REJECT’ accordingly.
- “
References
- Mulrow, C.D. Systematic Reviews: Rationale for systematic reviews. BMJ 1994, 309, 597–599. [Google Scholar] [CrossRef] [PubMed]
- Parums, D.V. Review articles, systematic reviews, meta-analysis, and the updated preferred reporting items for systematic reviews and meta-analyses (PRISMA) 2020 guidelines. Med. Sci. Monit. 2021, 27, e934475. [Google Scholar] [PubMed]
- Methley, A.M.; Campbell, S.; Chew-Graham, C.; McNally, R.; Cheraghi-Sohi, S. PICO, PICOS and SPIDER: A comparison study of specificity and sensitivity in three search tools for qualitative systematic reviews. BMC Health Serv. Res. 2014, 14, 579. [Google Scholar] [CrossRef]
- Linares-Espinós, E.; Hernández, V.; Domínguez-Escrig, J.L.; Fernández-Pello, S.; Hevia, V.; Mayor, J.; Padilla-Fernández, B.; Ribal, M.J. Methodology of a systematic review. Actas Urol. Esp. 2018, 42, 499–506. [Google Scholar]
- Dickersin, K.; Scherer, R.; Lefebvre, C. Systematic reviews: Identifying relevant studies for systematic reviews. BMJ 1994, 309, 1286–1291. [Google Scholar]
- Greenhalgh, T.; Thorne, S.; Malterud, K. Time to challenge the spurious hierarchy of systematic over narrative reviews? Eur. J. Clin. Investig. 2018, 48, e12931. [Google Scholar] [CrossRef]
- Waffenschmidt, S.; Knelangen, M.; Sieben, W.; Bühn, S.; Pieper, D. Single screening versus conventional double screening for study selection in systematic reviews: A methodological systematic review. BMC Med. Res. Methodol. 2019, 19, 132. [Google Scholar] [CrossRef]
- Cooper, C.; Booth, A.; Varley-Campbell, J.; Britten, N.; Garside, R. Defining the process to literature searching in systematic reviews: A literature review of guidance and supporting studies. BMC Med. Res. Methodol. 2018, 18, 85. [Google Scholar]
- Furlan, J.C.; Singh, J.; Hsieh, J.; Fehlings, M.G. Methodology of Systematic Reviews and Recommendations. J. Neurotrauma 2011, 28, 1335–1339. [Google Scholar]
- Cumpston, M.; Li, T.; Page, M.J.; Chandler, J.; Welch, V.A.; Higgins, J.P.; Thomas, J. Updated guidance for trusted systematic reviews: A new edition of the Cochrane Handbook for Systematic Reviews of Interventions. Cochrane Database Syst. Rev. 2019, 2019, ED000142. [Google Scholar]
- Dunning, J.; Lecky, F. The NICE guidelines in the real world: A practical perspective. Emerg. Med. J. 2004, 21, 404. [Google Scholar] [PubMed]
- Van Dinter, R.; Tekinerdogan, B.; Catal, C. Automation of systematic literature reviews: A systematic literature review. Inf. Softw. Technol. 2021, 136, 106589. [Google Scholar]
- Wang, Z.; Nayfeh, T.; Tetzlaff, J.; O’Blenis, P.; Murad, M.H. Error rates of human reviewers during abstract screening in systematic reviews. PLoS ONE 2020, 15, e0227742. [Google Scholar]
- Ouzzani, M.; Hammady, H.; Fedorowicz, Z.; Elmagarmid, A. Rayyan—A web and mobile app for systematic reviews. Syst. Rev. 2016, 5, 210. [Google Scholar] [PubMed]
- Chai, K.E.K.; Lines, R.L.J.; Gucciardi, D.F.; Ng, L. Research Screener: A machine learning tool to semi-automate abstract screening for systematic reviews. Syst. Rev. 2021, 10, 93. [Google Scholar] [CrossRef]
- Khalil, H.; Ameen, D.; Zarnegar, A. Tools to support the automation of systematic reviews: A scoping review. J. Clin. Epidemiol. 2022, 144, 22–42. [Google Scholar] [CrossRef]
- Allot, A.; Lee, K.; Chen, Q.; Luo, L.; Lu, Z. LitSuggest: A web-based system for literature recommendation and curation using machine learning. Nucleic Acids Res. 2021, 49, W352–W358. [Google Scholar] [PubMed]
- Marshall, I.J.; Kuiper, J.; Wallace, B.C. RobotReviewer: Evaluation of a system for automatically assessing bias in clinical trials. J. Am. Med. Inform. Assoc. 2016, 23, 193–201. [Google Scholar]
- Kiritchenko, S.; De Bruijn, B.; Carini, S.; Martin, J.; Sim, I. ExaCT: Automatic extraction of clinical trial characteristics from journal publications. BMC Med. Inf. Decis. Mak. 2010, 10, 56. [Google Scholar]
- Sindhu, B.; Prathamesh, R.P.; Sameera, M.B.; KumaraSwamy, S. The evolution of large language model: Models, applications and challenges. In Proceedings of the 2024 International Conference on Current Trends in Advanced Computing (ICCTAC), Bengaluru, India, 8–9 May 2024; pp. 1–8. [Google Scholar]
- Cao, C.; Sang, J.; Arora, R.; Kloosterman, R.; Cecere, M.; Gorla, J.; Saleh, R.; Chen, D.; Drennan, I.; Teja, B.; et al. Prompting is all you need: LLMs for systematic review screening. medRxiv 2024, 2024–2106. [Google Scholar] [CrossRef]
- Scherbakov, D.; Hubig, N.; Jansari, V.; Bakumenko, A.; Lenert, L.A. The emergence of Large Language Models (LLM) as a tool in literature reviews: An LLM automated systematic review. arXiv 2024, arXiv:2409.04600. [Google Scholar]
- Delgado-Chaves, F.M.; Jennings, M.J.; Atalaia, A.; Wolff, J.; Horvath, R.; Mamdouh, Z.M.; Baumbach, J.; Baumbach, L. Transforming literature screening: The emerging role of large language models in systematic reviews. Proc. Natl. Acad. Sci. USA 2025, 122, e2411962122. [Google Scholar] [CrossRef]
- Dai, Z.Y.; Shen, C.; Ji, Y.L.; Li, Z.Y.; Wang, Y.; Wang, F.Q. Accuracy of Large Language Models for Literature Screening in Systematic Reviews and Meta-Analyses. 2024. Available online: https://ssrn.com/abstract=4943759 (accessed on 11 March 2025).
- Dennstädt, F.; Zink, J.; Putora, P.M.; Hastings, J.; Cihoric, N. Title and abstract screening for literature reviews using large language models: An exploratory study in the biomedical domain. Syst. Rev. 2024, 13, 158. [Google Scholar] [CrossRef]
- Khraisha, Q.; Put, S.; Kappenberg, J.; Warraitch, A.; Hadfield, K. Can large language models replace humans in systematic reviews? Evaluating GPT -4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Res. Synth. Methods 2024, 15, 616–626. [Google Scholar] [CrossRef] [PubMed]
- Blevins, T.; Gonen, H.; Zettlemoyer, L. Prompting Language Models for Linguistic Structure. arXiv 2022, arXiv:2211.07830. [Google Scholar]
- Lieberum, J.L.; Töws, M.; Metzendorf, M.I.; Heilmeyer, F.; Siemens, W.; Haverkamp, C.; Böhringer, D.; Meerpohl, J.J.; Eisele-Metzger, A. Large language models for conducting systematic reviews: On the rise, but not yet ready for use—A scoping review. J. Clin. Epidemiol. 2025, 181, 111746. [Google Scholar] [CrossRef]
- Zhao, W.X.; Zhou, K.; Li, J.; Tang, T.; Wang, X.; Hou, Y.; Min, Y.; Zhang, B.; Zhang, J.; Dong, Z.; et al. A survey of large language models. arXiv 2023, arXiv:2303.18223. [Google Scholar]
- Gao, A. Prompt Engineering for Large Language Models. 2023. Available online: https://ssrn.com/abstract=4504303 (accessed on 11 March 2025).
- Dang, H.; Mecke, L.; Lehmann, F.; Goller, S.; Buschek, D. How to prompt? Opportunities and challenges of zero-and few-shot learning for human-AI interaction in creative applications of generative models. arXiv 2022, arXiv:2209.01390. [Google Scholar]
- Cottam, J.A.; Heller, N.C.; Ebsch, C.L.; Deshmukh, R.; Mackey, P.; Chin, G. Evaluation of Alignment: Precision, Recall, Weighting and Limitations. In Proceedings of the 2020 IEEE International Conference on Big Data (Big Data), Atlanta, GA, USA, 10–13 December 2020; pp. 2513–2519. [Google Scholar]
- Wang, Y.; Yu, J.; Yao, Z.; Zhang, J.; Xie, Y.; Tu, S.; Fu, Y.; Feng, Y.; Zhang, J.; Zhang, J.; et al. A solution-based LLM API-using methodology for academic information seeking. arXiv 2024, arXiv:2405.15165. [Google Scholar]
- Kumar, B.V.P.; Ahmed, M.D.S. Beyond Clouds: Locally Runnable LLMs as a Secure Solution for AI Applications. Digit. Soc. 2024, 3, 49. [Google Scholar]
- Bisong, E. Google Colaboratory. In Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners; Bisong, E., Ed.; Apress: Berkeley, CA, USA, 2019; pp. 59–64. [Google Scholar]
- Mckinney, W. Data Structures for Statistical Computing in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; pp. 51–56. [Google Scholar]
- Grigorov, D. Harnessing Python 3.11 and Python Libraries for LLM Development. In Introduction to Python and Large Language Models: A Guide to Language Models; Springer: Berlin/Heidelberg, Germany, 2024; pp. 303–368. [Google Scholar]
- Maji, A.K.; Gorenstein, L.; Lentner, G. Demystifying Python Package Installation with conda-env-mod. In Proceedings of the 2020 IEEE/ACM International Workshop on HPC User Support Tools (HUST) and Workshop on Programming and Performance Visualization Tools (ProTools), Atlanta, GA, USA, 18 November 2020; pp. 27–37. [Google Scholar]
- Shekhar, S.; Dubey, T.; Mukherjee, K.; Saxena, A.; Tyagi, A.; Kotla, N. Towards optimizing the costs of llm usage. arXiv 2024, arXiv:2402.01742. [Google Scholar]
- Chen, X.; Gao, C.; Chen, C.; Zhang, G.; Liu, Y. An empirical study on challenges for llm developers. arXiv 2024, arXiv:2408.05002. [Google Scholar]
- Irugalbandara, C.; Mahendra, A.; Daynauth, R.; Arachchige, T.K.; Dantanarayana, J.; Flautner, K.; Tang, L.; Kang, Y.; Mars, J. Scaling down to scale up: A cost-benefit analysis of replacing OpenAI’s LLM with open source SLMs in production. In Proceedings of the 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), Indianapolis, IN, USA, 5–7 May 2024; pp. 280–291. [Google Scholar]
- Ding, D.; Mallick, A.; Wang, C.; Sim, R.; Mukherjee, S.; Ruhle, V.; Lakshmanan, L.V.; Awadallah, A.H. Hybrid llm: Cost-efficient and quality-aware query routing. arXiv 2024, arXiv:2404.14618. [Google Scholar]
- Chen, L.; Zaharia, M.; Zou, J. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv 2023, arXiv:2305.05176. [Google Scholar]
- Yan, B.; Li, K.; Xu, M.; Dong, Y.; Zhang, Y.; Ren, Z.; Cheng, X. On protecting the data privacy of large language models (llms): A survey. arXiv 2024, arXiv:2403.05156. [Google Scholar]
- Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; Zhang, Y. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly. High-Confid. Comput. 2024, 4, 100211. [Google Scholar]
- Huang, B.; Yu, S.; Li, J.; Chen, Y.; Huang, S.; Zeng, S.; Wang, S. Firewallm: A portable data protection and recovery framework for llm services. In International Conference on Data Mining and Big Data; Springer: Berlin/Heidelberg, Germany, 2023; pp. 16–30. [Google Scholar]
- Feretzakis, G.; Verykios, V.S. Trustworthy AI: Securing sensitive data in large language models. AI 2024, 5, 2773–2800. [Google Scholar] [CrossRef]
- Meline, T. Selecting studies for systemic review: Inclusion and exclusion criteria. Contemp. Issues Commun. Sci. Disord. 2006, 33, 21–27. [Google Scholar]
- Cooke, A.; Smith, D.; Booth, A. Beyond PICO. Qual. Health Res. 2012, 22, 1435–1443. [Google Scholar] [CrossRef]
- Frandsen, T.F.; Bruun Nielsen, M.F.; Lindhardt, C.L.; Eriksen, M.B. Using the full PICO model as a search tool for systematic reviews resulted in lower recall for some PICO elements. J. Clin. Epidemiol. 2020, 127, 69–75. [Google Scholar] [CrossRef]
- Brown, D. A Review of the PubMed PICO Tool: Using Evidence-Based Practice in Health Education. Health Promot. Pract. 2020, 21, 496–498. [Google Scholar] [CrossRef] [PubMed]
- Gosak, L.; Štiglic, G.; Pruinelli, L.; Vrbnjak, D. PICOT questions and search strategies formulation: A novel approach using artificial intelligence automation. J. Nurs. Scholarsh. 2025, 57, 5–16. [Google Scholar] [PubMed]
- De Cassai, A.; Dost, B.; Karapinar, Y.E.; Beldagli, M.; Yalin, M.S.O.; Turunc, E.; Turan, E.I.; Sella, N. Evaluating the utility of large language models in generating search strings for systematic reviews in anesthesiology: A comparative analysis of top-ranked journals. Reg Anesth Pain Med. 2025, 2024–10623. [Google Scholar]
- Huang, W.H.; Poojary, V.; Hofer, K.; Fazeli, M.S. MSR217 Evaluating a Large Language Model Approach for Full-Text Screening Task in Systematic Literature Reviews with Domain Expert Input. Value Health 2024, 27, S481. [Google Scholar]
- Jin, Q.; Leaman, R.; Lu, Z. PubMed and beyond: Biomedical literature search in the age of artificial intelligence. eBioMedicine 2024, 100, 104988. [Google Scholar] [CrossRef]
- Scells, H.; Zuccon, G.; Koopman, B.; Deacon, A.; Azzopardi, L.; Geva, S. Integrating the Framing of Clinical Questions via PICO into the Retrieval of Medical Literature for Systematic Reviews. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, Singapore, 6–10 November 2017; ACM: New York, NY, USA, 2017; pp. 2291–2294. [Google Scholar]
- Chigbu, U.E.; Atiku, S.O.; Du Plessis, C.C. The Science of Literature Reviews: Searching, Identifying, Selecting, and Synthesising. Publications 2023, 11, 2. [Google Scholar] [CrossRef]
- Patrick, L.J.; Munro, S. The literature review: Demystifying the literature search. Diabetes Educ. 2004, 30, 30–38. [Google Scholar]
- Gusenbauer, M.; Haddaway, N.R. Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Res. Synth. Methods 2020, 11, 181–217. [Google Scholar] [CrossRef]
- Heintz, M.; Hval, G.; Tornes, R.A.; Byelyey, N.; Hafstad, E.; Næss, G.E.; Bakkeli, M. Optimizing the literature search: Coverage of included references in systematic reviews in Medline and Embase. J. Med. Libr. Assoc. 2023, 111, 599–605. [Google Scholar] [CrossRef]
- Lu, Z. PubMed and beyond: A survey of web tools for searching biomedical literature. Database 2011, 2011, baq036. [Google Scholar] [CrossRef]
- Bramer, W.M.; Giustini, D.; Kramer, B.M.R. Comparing the coverage, recall, and precision of searches for 120 systematic reviews in Embase, MEDLINE, and Google Scholar: A prospective study. Syst. Rev. 2016, 5, 39. [Google Scholar] [CrossRef] [PubMed]
- Page, D. Systematic Literature Searching and the Bibliographic Database Haystack. Electron. J. Bus. Res. Methods 2008, 6, 199–208. [Google Scholar]
- White, J. PubMed 2.0. Med. Ref. Serv. Q. 2020, 39, 382–387. [Google Scholar] [CrossRef]
- Cock, P.J.; Antao, T.; Chang, J.T.; Chapman, B.A.; Cox, C.J.; Dalke, A.; Friedberg, I.; Hamelryck, T.; Kauff, F.; Wilczynski, B.; et al. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009, 25, 1422–1423. [Google Scholar]
- Lu, Z.; Kim, W.; Wilbur, W.J. Evaluation of query expansion using MeSH in PubMed. Inf. Retr. Boston 2009, 12, 69–80. [Google Scholar] [PubMed]
- Stuart, D. Database search translation tools: MEDLINE transpose, ovid search translator, and SR-accelerator polyglot search translator. J. Electron. Resour. Med. Libr. 2023, 20, 152–159. [Google Scholar]
- Pichiyan, V.; Muthulingam, S.; Nalajala, S.; Ch, A.; Das, M.N. Web scraping using natural language processing: Exploiting unstructured text for data extraction and analysis. Procedia Comput. Sci. 2023, 230, 193–202. [Google Scholar]
- Mavrogiorgos, K.; Mavrogiorgou, A.; Kiourtis, A.; Zafeiropoulos, N.; Kleftakis, S.; Kyriazis, D. Automated rule-based data cleaning using NLP. In Proceedings of the2022 32nd Conference of Open Innovations Association (FRUCT), Tampere, Finland, 9–11 November 2022; pp. 162–168. [Google Scholar]
- Ulrich, H.; Kock-Schoppenhauer, A.K.; Deppenwiese, N.; Gött, R.; Kern, J.; Lablans, M.; Majeed, R.W.; Stöhr, M.R.; Stausberg, J.; Varghese, J.; et al. Understanding the nature of metadata: Systematic review. J. Med. Internet Res. 2022, 24, e25440. [Google Scholar]
- Yang, M.; Adomavicius, G.; Burtch, G.; Ren, Y. Mind the gap: Accounting for measurement error and misclassification in variables generated via data mining. Inf. Syst. Res. 2018, 29, 4–24. [Google Scholar]
- Chernyavskiy, A.; Ilvovsky, D.; Nakov, P. Transformers:“the end of history” for natural language processing. In Machine Learning and Knowledge Discovery in Databases. Research Track, Proceedings of the European Conference, ECML PKDD 2021, Bilbao, Spain, 13–17 September 2021; Proceedings, Part III 21; Springer: Cham, Switzerland, 2021; pp. 677–693. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, 16–20 November 2020; pp. 38–45. [Google Scholar]
- Shreyashree, S.; Sunagar, P.; Rajarajeswari, S.; Kanavalli, A. A Literature Review on Bidirectional Encoder Representations from Transformers. In Inventive Computation and Information Technologies; Smys, S., Balas, V.E., Palanisamy, R., Eds.; Springer Nature: Singapore, 2022; pp. 305–320. [Google Scholar]
- Zarrieß, S.; Voigt, H.; Schüz, S. Decoding Methods in Neural Language Generation: A Survey. Information 2021, 12, 355. [Google Scholar] [CrossRef]
- Ashwathy, J.S.; SR, N.; Pyati, T. The Progression of ChatGPT: An Evolutionary Study from GPT-1 to GPT-4. J. Innov. Data Sci. Big Data Manag. 2024, 3, 38–44. [Google Scholar]
- Kalyan, K.S. A survey of GPT-3 family large language models including ChatGPT and GPT-4. Nat. Lang. Process. J. 2024, 6, 100048. [Google Scholar] [CrossRef]
- Gao, T.; Jin, J.; Ke, Z.T.; Moryoussef, G. A Comparison of DeepSeek and Other LLMs. arXiv 2025, arXiv:2502.03688. [Google Scholar]
- Sun, Z.; Yang, H.; Liu, K.; Yin, Z.; Li, Z.; Xu, W. Recent Advances in LoRa: A Comprehensive Survey. ACM Trans. Sens. Netw. 2022, 18, 1–44. [Google Scholar] [CrossRef]
- Chang, P.W.; Newman, T.B. Receiver Operating Characteristic (ROC) Curves: The Basics and Beyond. Hosp. Pediatr. 2024, 14, e330–e334. [Google Scholar] [CrossRef] [PubMed]
- Wu, Y.; Gu, Y.; Feng, X.; Zhong, W.; Xu, D.; Yang, Q.; Liu, H.; Qin, B. Extending context window of large language models from a distributional perspective. arXiv 2024, arXiv:2410.01490. [Google Scholar]
- Huotala, A.; Kuutila, M.; Ralph, P.; Mäntylä, M. The promise and challenges of using LLMs to accelerate the screening process of systematic reviews. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, Salerno, Italy, 18–21 June 2024; pp. 262–271. [Google Scholar]
- Wu, S.; Koo, M.; Blum, L.; Black, A.; Kao, L.; Fei, Z.; Scalzo, F.; Kurtz, I. Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI 2024, 1, AIdbp2300092. [Google Scholar]
- Rydzewski, N.R.; Dinakaran, D.; Zhao, S.G.; Ruppin, E.; Turkbey, B.; Citrin, D.E.; Patel, K.R. Comparative evaluation of LLMs in clinical oncology. NEJM AI 2024, 1, AIoa2300151. [Google Scholar]
- Wu, S.; Koo, M.; Blum, L.; Black, A.; Kao, L.; Scalzo, F.; Kurtz, I. A comparative study of open-source large language models, gpt-4 and claude 2: Multiple-choice test taking in nephrology. arXiv 2023, arXiv:2308.04709. [Google Scholar]
- Safavi-Naini, S.A.A.; Ali, S.; Shahab, O.; Shahhoseini, Z.; Savage, T.; Rafiee, S.; Samaan, J.S.; Shabeeb, R.A.; Ladak, F.; Yang, J.O.; et al. Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models. arXiv 2024, arXiv:2409.00084. [Google Scholar]
- Berglund, L.; Stickland, A.C.; Balesni, M.; Kaufmann, M.; Tong, M.; Korbak, T.; Kokotajlo, D.; Evans, O. Taken out of context: On measuring situational awareness in LLMs. arXiv 2023, arXiv:2309.00667. [Google Scholar]
- Agarwal, L.; Nasim, A. Comparison and Analysis of Large Language Models (LLMs). 2024. Available online: https://ssrn.com/abstract=4939534 (accessed on 11 March 2025).
- Polanin, J.R.; Pigott, T.D.; Espelage, D.L.; Grotpeter, J.K. Best practice guidelines for abstract screening large-evidence systematic reviews and meta-analyses. Res. Synth. Methods 2019, 10, 330–342. [Google Scholar]
- Beurer-Kellner, L.; Fischer, M.; Vechev, M. Prompting is programming: A query language for large language models. Proc. ACM Program. Lang. 2023, 7, 1946–1969. [Google Scholar]
- He, J.; Rungta, M.; Koleczek, D.; Sekhon, A.; Wang, F.X.; Hasan, S. Does Prompt Formatting Have Any Impact on LLM Performance? arXiv 2024, arXiv:241110541. arXiv 2024, arXiv:2411.10541. [Google Scholar]
- Zaghir, J.; Naguib, M.; Bjelogrlic, M.; Névéol, A.; Tannier, X.; Lovis, C. Prompt Engineering Paradigms for Medical Applications: Scoping Review. J. Med. Internet Res. 2024, 26, e60501. [Google Scholar] [CrossRef] [PubMed]
- Colangelo, M.T.; Guizzardi, S.; Meleti, M.; Calciolari, E.; Galli, C. How to Write Effective Prompts for Screening Biomedical Literature Using Large Language Models. BioMedInformatics 2025, 5, 15. [Google Scholar] [CrossRef]
- Cao, C.; Sang, J.; Arora, R.; Chen, D.; Kloosterman, R.; Cecere, M.; Gorla, J.; Saleh, R.; Drennan, I.; Teja, B. Development of Prompt Templates for Large Language Model–Driven Screening in Systematic Reviews. Ann. Intern Med. 2025, 178, 389–401. [Google Scholar] [CrossRef]
- Wang, W.; Shi, J.; Wang, C.; Lee, C.; Yuan, Y.; Huang, J.T.; Lyu, M.R. Learning to ask: When llms meet unclear instruction. arXiv 2024, arXiv:2409.00557. [Google Scholar]
- Galli, C.; Colangelo, M.T.; Guizzardi, S.; Meleti, M.; Calciolari, E. A Zero-Shot Comparison of Large Language Models for Efficient Screening in Periodontal Regeneration Research. Preprints 2025. [Google Scholar] [CrossRef]
- Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, 43, 1–55. [Google Scholar] [CrossRef]
- Bhattacharya, R. Strategies to mitigate hallucinations in large language models. Appl. Mark. Anal. 2024, 10, 62–67. [Google Scholar]
- Gosmar, D.; Dahl, D.A. Hallucination Mitigation using Agentic AI Natural Language-Based Frameworks. arXiv 2025, arXiv:2501.13946. [Google Scholar]
- Hassan, M. Measuring the Impact of Hallucinations on Human Reliance in LLM Applications. J. Robot. Process Autom. AI Integr. Work. Optim. 2025, 10, 10–20. [Google Scholar]
- Rawte, V.; Chakraborty, S.; Pathak, A.; Sarkar, A.; Tonmoy, S.I.; Chadha, A.; Sheth, A.; Das, A. The troubling emergence of hallucination in large language models-an extensive definition, quantification, and prescriptive remediations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023. [Google Scholar]
- Fagbohun, O.; Harrison, R.M.; Dereventsov, A. An empirical categorization of prompting techniques for large language models: A practitioner’s guide. arXiv 2024, arXiv:2402.14837. [Google Scholar]
- Mai, H.T.; Chu, C.X.; Paulheim, H. Do LLMs really adapt to domains? An ontology learning perspective. In The Semantic Web–ISWC 2024, Proceedings of the 23rd International Semantic Web Conference, Baltimore, MD, USA, 11–15 November 2024; Proceedings, Part I; Springer: Cham, Switzerland, 2024; pp. 126–143. [Google Scholar]
- Sumanathilaka, T.; Micallef, N.; Hough, J. Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation. arXiv 2024, arXiv:2411.18337. [Google Scholar]
- Duenas, T.; Ruiz, D. The Risks of Human Overreliance on Large Language Models for Critical Thinking. Research Gate. 2024. Available online: https://www.researchgate.net/publication/385743952_The_Risks_Of_Human_Overreliance_On_Large_Language_Models_For_Critical_Thinking (accessed on 11 March 2025).
- Schiller, C.A. The human factor in detecting errors of large language models: A systematic literature review and future research directions. arXiv 2024, arXiv:2403.09743. [Google Scholar]
- Page, M.J.; Higgins, J.P.T.; Sterne, J.A.C. Assessing risk of bias due to missing results in a synthesis. In Cochrane Handbook for Systematic Reviews of Interventions; Wiley: Hoboken, NJ, USA, 2019; pp. 349–374. [Google Scholar]
- Goossen, K.; Tenckhoff, S.; Probst, P.; Grummich, K.; Mihaljevic, A.L.; Büchler, M.W.; Diener, M.K. Optimal literature search for systematic reviews in surgery. Langenbecks Arch. Surg. 2018, 403, 119–129. [Google Scholar]
- Ewald, H.; Klerings, I.; Wagner, G.; Heise, T.L.; Stratil, J.M.; Lhachimi, S.K.; Hemkens, L.G.; Gartlehner, G.; Armijo-Olivo, S.; Nussbaumer-Streit, B. Searching two or more databases decreased the risk of missing relevant studies: A metaresearch study. J. Clin. Epidemiol. 2022, 149, 154–164. [Google Scholar]
- Cooper, C.; Varley-Campbell, J.; Carter, P. Established search filters may miss studies when identifying randomized controlled trials. J. Clin. Epidemiol. 2019, 112, 12–19. [Google Scholar]
- Giray, L. Prompt Engineering with ChatGPT: A Guide for Academic Writers. Ann. Biomed. Eng. 2023, 51, 2629–2633. [Google Scholar] [CrossRef]
- Meskó, B. Prompt Engineering as an Important Emerging Skill for Medical Professionals: Tutorial. J. Med. Internet Res. 2023, 25, e50638. [Google Scholar] [CrossRef] [PubMed]
- Wong, E. Comparative Analysis of Open Source and Proprietary Large Language Models: Performance and Accessibility. Adv. Comput. Sci. 2024, 7, 1–7. [Google Scholar]
- Shah, C. From prompt engineering to prompt science with human in the loop. arXiv 2024, arXiv:2401.04122. [Google Scholar]
- Ray, S. A Quick Review of Machine Learning Algorithms. In Proceedings of the 2019 International Conference on Machine Learning, Big Data, Cloud and Parallel Computing (COMITCon), Faridabad, India, 14–16 February 2019; pp. 35–39. [Google Scholar]
- Tang, X.; Jin, Q.; Zhu, K.; Yuan, T.; Zhang, Y.; Zhou, W.; Qu, M.; Zhao, Y.; Tang, J.; Zhang, Z.; et al. Prioritizing safeguarding over autonomy: Risks of llm agents for science. arXiv 2024, arXiv:2402.04247. [Google Scholar]
- Liu, X.; Rivera, S.C.; Moher, D.; Calvert, M.J.; Denniston, A.K.; Ashrafian, H.; Beam, A.L.; Chan, A.W.; Collins, G.S.; Deeks, A.D.J.; et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Nat. Med. 2020, 26, 1364–1374. [Google Scholar] [CrossRef]
- Cacciamani, G.E.; Chu, T.N.; Sanford, D.I.; Abreu, A.; Duddalwar, V.; Oberai, A.; Kuo, C.C.J.; Liu, X.; Denniston, A.K.; Vasey, B.; et al. PRISMA AI reporting guidelines for systematic reviews and meta-analyses on AI in healthcare. Nat. Med. 2023, 29, 14–15. [Google Scholar]
- Kim, J.K.; Chua, M.; Rickard, M.; Lorenzo, A. ChatGPT and large language model (LLM) chatbots: The current state of acceptability and a proposal for guidelines on utilization in academic medicine. J. Pediatr. Urol. 2023, 19, 598–604. [Google Scholar] [CrossRef] [PubMed]
- Ranjan, R.; Gupta, S.; Singh, S.N. A comprehensive survey of bias in llms: Current landscape and future directions. arXiv 2024, arXiv:2409.16430. [Google Scholar]
- Ullah, E.; Parwani, A.; Baig, M.M.; Singh, R. Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology—A recent scoping review. Diagn. Pathol. 2024, 19, 43. [Google Scholar]
- Barman, K.G.; Wood, N.; Pawlowski, P. Beyond transparency and explainability: On the need for adequate and contextualized user guidelines for LLM use. Ethics Inf. Technol. 2024, 26, 47. [Google Scholar]
- Barman, K.G.; Caron, S.; Claassen, T.; De Regt, H. Towards a benchmark for scientific understanding in humans and machines. Minds Mach. 2024, 34, 6. [Google Scholar]
- Jiao, J.; Afroogh, S.; Xu, Y.; Phillips, C. Navigating llm ethics: Advancements, challenges, and future directions. arXiv 2024, arXiv:2406.18841. [Google Scholar]
- Patil, R.; Gudivada, V. A review of current trends, techniques, and challenges in large language models (llms). Appl. Sci. 2024, 14, 2074. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Galli, C.; Gavrilova, A.V.; Calciolari, E. Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations. Information 2025, 16, 378. https://doi.org/10.3390/info16050378
Galli C, Gavrilova AV, Calciolari E. Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations. Information. 2025; 16(5):378. https://doi.org/10.3390/info16050378
Chicago/Turabian StyleGalli, Carlo, Anna V. Gavrilova, and Elena Calciolari. 2025. "Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations" Information 16, no. 5: 378. https://doi.org/10.3390/info16050378
APA StyleGalli, C., Gavrilova, A. V., & Calciolari, E. (2025). Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations. Information, 16(5), 378. https://doi.org/10.3390/info16050378