Validating Large Language Models for Title-Abstract Screening in Low-Prevalence Systematic Reviews: An Environmental Science Case Study
Abstract
1. Introduction
2. Materials and Methods
2.1. Large Language Models (LLMs)
2.2. Prompts
- {
- “verdict”: “<your verdict here, either ‘include’ or ‘exclude’>“,
- “explanation”: “<detailed explanation to justify your verdict here>“,
- “confidence”: “<confidence level of your decision here>“
- }
2.3. Statistical Analysis
3. Results
3.1. Classification Performance Metrics
3.2. Misclassifications
3.3. Inter-Rater Reliability
4. Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| LLM | Large Language Model |
Appendix A
Appendix A.1. Search Strategy for Nawrath et al. (2021) [25]
- Research questions
- The research questions were:
- (1)
- Do greenspaces promote good mental health of urban residents in LMICs?
- (2)
- What are the geographic characteristics of the evidence from LMICs?
- (3)
- Which contextual factors mediate and moderate how greenspaces and mental health are associated in LMICs?
- (4)
- How were greenspaces assessed and which mental health outcomes were studied in LMICs?
- Definitions
- Search strategy
| Greenspaces | Mental Health & Well-Being | Study Location |
|---|---|---|
| Greenspace, blue space, open space, urban park, urban forest, urban tree, urban ecosystem, urban green, urban blue, urban agriculture, natural environment, biodiversity, species richness, nature reserve, wilderness environment, spontaneous vegetation | Mental health, mental well-being, well-being, mental, psychiatric, psychologic, depression, MDD, anxiety, phobia, agoraphobia, dysthymia, ADNOS, schizophrenia, hebephrenia, oligophrenia, akathisia, neuroleptic-induced deficit syndrome, tardive dyskinesia, movement disorders, somatoform, somatisation, hysteria, briquet, multisomatic, MUPs, medically unexplained, dissociative disorders, dissociative reactions, dissociation, affective disorders, PTSD, psychological trauma, combat disorders, stress disorders, cognitive disorders, personality disorders, impulse control disorders, mood disorders, paranoid disorders, psychotic disorders, neurological disorders, nervous disorders, nervous system disorders, eating disorders, bipolar disorders, behavioural disorders, obsessive disorders, compulsive disorders, panic disorders, mood disorders, delusional disorders, trichotillomania, OCD, GAD, stress reaction, acute stress, neurosis, stress syndrome, pain disorder, dementia, Alzheimer, epilepsy, substance abuse disorders, personality disorders, sleep disorders | LMICs: Using the Development Assistance Committee country classification list [65] Urban: Urban, city, town |
- Data screening
- Data extraction
- Authors, title, year of publication;
- Objectives;
- Study population;
- Methods, study design;
- Health outcome and measures;
- Measure of greenspace;
- General results.
Appendix A.2. Prompts
- {
- “verdict”: “<your verdict here, either ‘include’ or ‘exclude’>“,
- “explanation”: “<detailed explanation to justify your verdict here>“,
- “confidence”: “<confidence level of your decision here>“
- }
Appendix A.3. Formulas for Classification Performance Metrics
Appendix A.4. Regression Analysis
| Model | Intercept β0 (SE) | Slope β1 (SE) | Test Statistic | Odds Ratio for β1 (95% CI) | Fit Criterion |
|---|---|---|---|---|---|
| GPT-4.1 | −1.9552 (0.1368) *** | 3.9011 (1.0778) *** | z = 3.62 | 49.46 (5.98–408.95) | AIC = 378.82 |
| Claude 3.5 Sonnet | −1.9552 (0.1368) *** | 3.9011 (1.0778) *** | z = 3.62 | 49.46 (5.98–408.95) | AIC = 378.82 |
| Gemini 2.0 Flash (Firth) | −1.7906 (0.1287) *** | 4.6238 (1.4609) *** | χ2 = 28.53 | 101.88 (12.47–13227.13) | — |
| Mistral Large (Firth) | −0.8319 (0.0980) *** | 3.6651 (1.4585) *** | χ2 = 16.47 | 39.06 (4.82–5062.47) | — |
| DeepSeek V3 | −3.8754 (0.3195) *** | 3.3645 (0.7971) *** | z = 4.22 | 28.92 (6.06–137.94) | AIC = 112.30 |
References
- Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
- Bornmann, L.; Mutz, R. Growth Rates of Modern Science: A Bibliometric Analysis Based on the Number of Publications and Cited References. J. Assoc. Inf. Sci. Technol. 2015, 66, 2215–2222. [Google Scholar] [CrossRef]
- Borah, R.; Brown, A.W.; Capers, P.L.; Kaiser, K.A. Analysis of the Time and Workers Needed to Conduct Systematic Reviews of Medical Interventions Using Data from the PROSPERO Registry. Open 2017, 7, 12545. [Google Scholar] [CrossRef]
- Shojania, K.; Sampson, M.; Ansari, M.; Ji, J.; Doucette, S.; Moher, D. How Quickly Do Systematic Reviews Go out of Date? A Survival Analysis. J. Emerg. Med. 2007, 147, 224–233. [Google Scholar] [CrossRef] [PubMed]
- Haddaway, N.R.; Pullin, A.S. The Policy Role of Systematic Reviews: Past, Present and Future. Springer Sci. Rev. 2014, 2, 179–183. [Google Scholar] [CrossRef]
- Westgate, M.J.; Haddaway, N.R.; Cheng, S.H.; McIntosh, E.J.; Marshall, C.; Lindenmayer, D.B. Software Support for Environmental Evidence Synthesis. Nat. Ecol. Evol. 2018, 2, 588–590. [Google Scholar] [CrossRef]
- Luo, X.; Chen, F.; Zhu, D.; Wang, L.; Wang, Z.; Liu, H.; Lyu, M.; Wang, Y.; Wang, Q.; Chen, Y. Potential Roles of Large Language Models in the Production of Systematic Reviews and Meta-Analyses. J. Med. Internet Res. 2024, 26, e56780. [Google Scholar] [CrossRef]
- Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 16, 1–72. [Google Scholar] [CrossRef]
- Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large Language Models: A Survey. arXiv 2025, arXiv:2402.06196. [Google Scholar]
- Vaswani, A.; Brain, G.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
- Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
- Salah, M.; Abdelfattah, F.; Alhalbusi, H. AI vs. Humans: The Future of Academic Review in Public Administration. Res. Sq. 2023; preprint. [CrossRef]
- Fabiano, N.; Gupta, A.; Bhambra, N.; Luu, B.; Wong, S.; Maaz, M.; Fiedorowicz, J.G.; Smith, A.L.; Solmi, M. How to Optimize the Systematic Review Process Using AI Tools. JCPP Adv. 2024, 4, e12234. [Google Scholar] [CrossRef]
- López-Pineda, A.; Nouni-García, R.; Carbonell-Soliva, Á.; Gil-Guillén, V.F.; Carratalá-Munuera, C.; Borrás, F. Validation of Large Language Models (Llama 3 and ChatGPT-4o Mini) for Title and Abstract Screening in Biomedical Systematic Reviews. Res. Synth. Methods 2025, 16, 620–630. [Google Scholar] [CrossRef]
- Marshall, I.J.; Wallace, B.C. Toward Systematic Review Automation: A Practical Guide to Using Machine Learning Tools in Research Synthesis. Syst. Rev. 2019, 8, 163. [Google Scholar] [CrossRef] [PubMed]
- Pedreschi, D.; Giannotti, F.; Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F. Meaningful Explanations of Black Box AI Decision Systems. Proc. AAAI Conf. Artif. Intell. 2019, 33, 9780–9784. [Google Scholar] [CrossRef]
- Staudinger, M.; Kusa, W.; Piroi, F.; Lipani, A.; Hanbury, A. A Reproducibility and Generalizability Study of Large Language Models for Query Generation. In Proceedings of the SIGIR-AP 2024—Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region; Association for Computing Machinery, Inc.: New York, NY, USA, 2024; pp. 186–196. [Google Scholar]
- O’Mara-Eves, A.; Thomas, J.; McNaught, J.; Miwa, M.; Ananiadou, S. Using Text Mining for Study Identification in Systematic Reviews: A Systematic Review of Current Approaches. Syst. Rev. 2015, 4, 5. [Google Scholar] [CrossRef] [PubMed]
- Adel, A.; Alani, N. Can Generative AI Reliably Synthesise Literature? Exploring Hallucination Issues in ChatGPT. AI Soc. 2025, 40, 6799–6812. [Google Scholar] [CrossRef]
- Sciurti, A.; Migliara, G.; Siena, L.M.; Isonne, C.; De Blasiis, M.R.; Sinopoli, A.; Iera, J.; Marzuillo, C.; De Vito, C.; Villari, P.; et al. Compact Large Language Models for Title and Abstract Screening in Systematic Reviews: An Assessment of Feasibility, Accuracy, and Workload Reduction. Res. Synth. Methods 2026, 17, 332–347. [Google Scholar] [CrossRef]
- Nykvist, B.; Macura, B.; Xylia, M.; Olsson, E. Testing the Utility of GPT for Title and Abstract Screening in Environmental Systematic Evidence Synthesis. Environ. Evid. 2025, 14, 7. [Google Scholar] [CrossRef]
- Trad, F.; Yammine, R.; Charafeddine, J.; Chakhtoura, M.; Rahme, M.; El-Hajj Fuleihan, G.; Chehab, A. Streamlining Systematic Reviews with Large Language Models Using Prompt Engineering and Retrieval Augmented Generation. BMC Med. Res. Methodol. 2025, 25, 130. [Google Scholar] [CrossRef]
- Galli, C.; Gavrilova, A.V.; Calciolari, E. Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations. Information 2025, 16, 378. [Google Scholar] [CrossRef]
- Van Dijk, S.H.B.; Brusse-Keizer, M.G.J.; Bucsán, C.C.; Van Der Palen, J.; Doggen, C.J.M.; Lenferink, A. Artificial Intelligence in Systematic Reviews: Promising When Appropriately Used. BMJ Open 2023, 13, e072254. [Google Scholar] [CrossRef]
- Nawrath, M.; Guenat, S.; Elsey, H.; Dallimer, M. Exploring Uncharted Territory: Do Urban Greenspaces Support Mental Health in Low- and Middle-Income Countries? Environ. Res. 2021, 194, 110625. [Google Scholar] [CrossRef] [PubMed]
- Arksey, H.; O’Malley, L. Scoping Studies: Towards a Methodological Framework. Int. J. Soc. Res. Methodol. 2005, 8, 19–32. [Google Scholar] [CrossRef]
- Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.J.J.; Horsley, T.; Weeks, L.; et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann. Intern. Med. 2018, 169, 467. [Google Scholar] [CrossRef] [PubMed]
- NIVA LLM Benchmarks for Abstract Screening in Social and Environmental Scientific Publications. 2025. Available online: https://github.com/NIVANorge/ai-literature-review-public (accessed on 1 May 2026).
- Syriani, E.; David, I.; Kumar, G. Screening Articles for Systematic Reviews with ChatGPT. J. Comput. Lang. 2024, 80, 101287. [Google Scholar] [CrossRef]
- Dwork, C.; Feldman, V.; Hardt, M.; Pitassi, T.; Reingold, O.; Roth, A. Generalization in Adaptive Data Analysis and Holdout Reuse. In Proceedings of the NIPS’15: Proceedings of the 29th International Conference on Neural Information Processing Systems—Volume 2; MIT Press: Cambridge, MA, USA, 2015. [Google Scholar]
- Li, Y.; Datta, S.; Rastegar-Mojarad, M.; Lee, K.; Paek, H.; Glasgow, J.; Liston, C.; He, L.; Wang, X.; Xu, Y. Enhancing Systematic Literature Reviews with Generative Artificial Intelligence: Development, Applications, and Performance Evaluation. J. Am. Med. Inform. Assoc. 2025, 32, 616–625. [Google Scholar] [CrossRef]
- Malik, F.S.; Terzidis, O. A Hybrid Framework for Creating Artificial Intelligence-Augmented Systematic Literature Reviews. Manag. Rev. Q. 2025, 1–27. [Google Scholar] [CrossRef]
- Taylor, K.S.; Mahtani, K.R.; Aronson, J.K. Summarising Good Practice Guidelines for Data Extraction for Systematic Reviews and Meta-Analysis. BMJ Evid. Based. Med. 2021, 26, 88–90. [Google Scholar] [CrossRef]
- Schmidt, L.; Olorisade, B.K.; McGuinness, L.A.; Thomas, J.; Higgins, J.P.T. Data Extraction Methods for Systematic Review (Semi)Automation: A Living Systematic Review. F1000Research 2021, 10, 401. [Google Scholar] [CrossRef]
- Atil, B.; Aykent, S.; Chittams, A.; Fu, L.; Passonneau, R.J.; Radcliffe, E.; Rajagopal, G.R.; Sloan, A.; Tudrej, T.; Ture, F.; et al. Non-Determinism of “Deterministic” LLM System Settings in Hosted Environments. In Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 135–148. [Google Scholar]
- Heinze, G.; Schemper, M. A Solution to the Problem of Separation in Logistic Regression. Stat. Med. 2002, 21, 2409–2419. [Google Scholar] [CrossRef]
- Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]
- Chicco, D.; Jurman, G. The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef]
- Chen, G.; Faris, P.; Hemmelgarn, B.; Walker, R.L.; Quan, H. Measuring Agreement of Administrative Data with Chart Data Using Prevalence Unadjusted and Adjusted Kappa. BMC Med. Res. Methodol. 2009, 9, 5. [Google Scholar] [CrossRef]
- Zec, S.; Soriani, N.; Comoretto, R.; Baldi, I. High Agreement and High Prevalence: The Paradox of Cohen’s Kappa. Open Nurs. J. 2017, 11, 211–218. [Google Scholar] [CrossRef] [PubMed]
- Delgado, R.; Tibau, X.A. Why Cohen’s Kappa Should Be Avoided as Performance Measure in Classification. PLoS ONE 2019, 14, e0222916. [Google Scholar] [CrossRef]
- de la Cruz Huayanay, A.; Bazán, J.L.; Russo, C.M. Performance of Evaluation Metrics for Classification in Imbalanced Data. Comput. Stat. 2025, 40, 1447–1473. [Google Scholar] [CrossRef]
- Page, M.J.; Higgins, J.P.; Sterne, J.A. Assessing Risk of Bias Due to Missing Results in a Synthesis. In Cochrane Handbook for Systematic Reviews of Interventions, 2nd ed.; Cochrance: London, UK, 2019. [Google Scholar]
- Feinstein, A.R.; Cicchetti, D. V High Agreement but Low Kappa: I. the Problems of Two Paradoxes. J. Clin. Epidemiol. 1990, 43, 543–549. [Google Scholar] [CrossRef] [PubMed]
- Jothi Prakash, B.; Barath Kannan, D.; Pankaj Seervi, A.; Meivezhi, G. Prompt Engineering for Large Language Models: A Systematic Review and Future Directions. Res. Sq. 2025; Preprint. [CrossRef]
- Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv 2025, arXiv:2402.07927. [Google Scholar] [CrossRef]
- Adam, T.J.; Abosabie, S.A.S.; Dittmer, M.; Wolf, E.; Abosabie, S.A.; Behnke, C.; Baier, F.; Weickmann, A.; Köser, L.; Correll, C.U.; et al. Prompt Engineering of Large Language Models for Paper Screening in Medical Meta-Analyses and Systematic Reviews: A Prospective Comparative Study. Res. Synth. Methods 2026, 17, 1–18. [Google Scholar] [CrossRef] [PubMed]
- Ye, A.; Maiti, A.; Schmidt, M.; Pedersen, S.J. A Hybrid Semi-Automated Workflow for Systematic and Literature Review Processes with Large Language Model Analysis. Future Internet 2024, 16, 167. [Google Scholar] [CrossRef]
- Homiar, A.; Thomas, J.; Ostinelli, E.G.; Kennett, J.; Friedrich, C.; Cuijpers, P.; Harrer, M.; Leucht, S.; Miguel, C.; Rodolico, A.; et al. Development and Evaluation of Prompts for a Large Language Model to Screen Titles and Abstracts in a Living Systematic Review. BMJ Ment. Health 2025, 28, e301762. [Google Scholar] [CrossRef] [PubMed]
- Huotala, A.; Kuutila, M.; Ralph, P.; Mäntylä, M. The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024); CEUR-WS; Association for Computing Machinery: New York, NY, USA, 2024; Volume 2657, pp. 1–9. [Google Scholar]
- Sollini, M.; Pini, C.; Lazar, A.; Gelardi, F.; Ninatti, G.; Bauckneht, M.; Chiti, A.; Kirienko, M. Human Researchers Are Superior to Large Language Models in Writing a Medical Systematic Review in a Comparative Multitask Assessment. Sci. Rep. 2025, 16, 173. [Google Scholar] [CrossRef] [PubMed]
- Haltaufderheide, J.; Ranisch, R. The Ethics of ChatGPT in Medicine and Healthcare: A Systematic Review on Large Language Models (LLMs). npj Digit. Med. 2024, 7, 183. [Google Scholar] [CrossRef]
- Peters, U.; Chin-Yee, B. Generalization Bias in Large Language Model Summarization of Scientific Research. R. Soc. Open Sci. 2025, 12, 241776. [Google Scholar] [CrossRef]
- Deng, C.; Duan, Y.; Jin, X.; Chang, H.; Tian, Y.; Liu, H.; Wang, Y.; Gao, K.; Zou, H.P.; Jin, Y.; et al. Deconstructing the Ethics of Large Language Models from Long-Standing Issues to New-Emerging Dilemmas: A Survey. AI Ethics 2025, 5, 4745–4771. [Google Scholar] [CrossRef]
- Fareed, M.; Fatima, M.; Uddin, J.; Ahmed, A.; Sattar, M.A. A Systematic Review of Ethical Considerations of Large Language Models in Healthcare and Medicine. Front. Digit. Health 2025, 7, 1653631. [Google Scholar] [CrossRef]
- Aljohani, M.; Hou, J.; Kommu, S.; Wang, X. A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare. arXiv 2025, arXiv:2504.00025. [Google Scholar] [CrossRef]
- Gartlehner, G.; Kahwati, L.; Hilscher, R.; Thomas, I.; Kugley, S.; Crotty, K.; Viswanathan, M.; Nussbaumer-Streit, B.; Booth, G.; Erskine, N.; et al. Data Extraction for Evidence Synthesis Using a Large Language Model: A Proof-of-Concept Study. Res. Synth. Methods 2024, 15, 576–589. [Google Scholar] [CrossRef]
- O’Connor, A.M.; Clark, J.; Thomas, J.; Spijker, R.; Kusa, W.; Walker, V.R.; Bond, M. Large Language Models, Updates, and Evaluation of Automation Tools for Systematic Reviews: A Summary of Significant Discussions at the Eighth Meeting of the International Collaboration for the Automation of Systematic Reviews (ICASR). Syst. Rev. 2024, 13, 290. [Google Scholar] [CrossRef] [PubMed]
- Cacciamani, G.E.; Chu, T.N.; Sanford, D.I.; Abreu, A.; Duddalwar, V.; Oberai, A.; Kuo Jay, C.C.; Liu, X.; Denniston, A.K.; Vasey, B.; et al. PRISMA AI-Reporting Guidelines for Systematic Reviews and Meta-Analyses on AI in Healthcare. Nat. Med. 2023, 29, 14–15. [Google Scholar] [CrossRef]
- Carlini, N.; Tramèr, F.; Lee, K.; Roberts, A.; Wallace, E.; Jagielski, M.; Herbert-Voss, A.; Brown, T.; Song, D.; Erlingsson, Ú.; et al. Extracting Training Data from Large Language Models. In Proceedings of the 30th USENIX Security Symposium, Online, 11–13 August 2021. [Google Scholar]
- Thode, L.; Iftikhar, U.; Mendez, D. Exploring the Use of LLMs for the Selection Phase in Systematic Literature Studies. Inf. Softw. Technol. 2025, 184, 107757. [Google Scholar] [CrossRef]
- Golchin, S.; Surdeanu, M. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. In Proceedings of the ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
- Dietrich, J.; Hollstein, A. Performance and Reproducibility of Large Language Models in Named Entity Recognition: Considerations for the Use in Controlled Environments. Drug Saf. 2025, 48, 287–303. [Google Scholar] [CrossRef] [PubMed]
- Flemyng, E.; Noel-Storr, A.; Macura, B.; Gartlehner, G.; Thomas, J.; Meerpohl, J.J.; Jordan, Z.; Minx, J.; Eisele-Metzger, A.; Hamel, C.; et al. Position Statement on Artificial Intelligence (AI) Use in Evidence Synthesis across Cochrane, the Campbell Collaboration, JBI and the Collaboration for Environmental Evidence 2025. Environ. Evid. 2025, 14, 20. [Google Scholar] [CrossRef]
- Development Assistance Committee. DAC List of ODA Recipients; Development Assistance Committee: Paris, France, 2025. [Google Scholar]
- Hartig, T.; Mitchell, R.; de Vries, S.; Frumkin, H. Nature and Health. Annu. Rev. Public Health 2014, 35, 207–228. [Google Scholar] [CrossRef] [PubMed]
- Barton, J.; Rogerson, M. The Importance of Greenspace for Mental Health. BJPsych Int. 2017, 14, 79–81. [Google Scholar] [CrossRef]
- World Health Organisation. Mental Health: A State of Wellbeing. Available online: https://www.who.int/features/factfiles/mental_health/en/ (accessed on 1 May 2025).
- Linton, M.J.; Dieppe, P.; Medina-Lara, A. Review of 99 Self-Report Measures for Assessing Well-Being in Adults: Exploring Dimensions of Well-Being and Developments over Time. BMJ Open 2016, 6, e010641. [Google Scholar] [CrossRef] [PubMed]
- Houlden, V.; Weich, S.; de Albuquerque, J.P.; Jarvis, S.; Rees, K. The Relationship between Greenspace and the Mental Wellbeing of Adults: A Systematic Review. PLoS ONE 2018, 13, e0203000. [Google Scholar] [CrossRef]
- American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders, 5th ed.; American Psychiatric Association Publishing: Washington, DC, USA, 2013. [Google Scholar]
- United Nations. Habitat III Issue Papers—Informal Settlements; UN Habitat: New York, NY, USA, 2016.
- Gascon, M.; Mas, M.T.; Martínez, D.; Dadvand, P.; Forns, J.; Plasència, A.; Nieuwenhuijsen, M.J. Mental Health Benefits of Long-Term Exposure to Residential Green and Blue Spaces: A Systematic Review. Int. J. Environ. Res. Public Health 2015, 12, 4354–4379. [Google Scholar] [CrossRef]
- Academic Unit of Health Economics University of Leeds. AUHE Search Strategy: Low and Middle Income Countries Geographic Search; Academic Unit of Health Economics University of Leeds: Leeds, UK, 2018. [Google Scholar]

| Model | True Positive | True Negative | False Positive | False Negative | Total Accuracy | Sensitivity | 95% CI | Specificity | 95% CI |
| GPT-4.1 | 7 | 431 | 61 | 1 | 0.876 | 0.875 | 0.529–0.978 | 0.876 | 0.844–0.902 |
| Claude 3.5 Sonnet | 7 | 431 | 61 | 1 | 0.876 | 0.875 | 0.529–0.978 | 0.876 | 0.844–0.902 |
| Gemini 2.0 flash | 8 | 422 | 70 | 0 | 0.860 | 1.000 | 0.676–1.000 | 0.858 | 0.824–0.886 |
| Mistral Large | 8 | 343 | 149 | 0 | 0.702 | 1.000 | 0.676–1.000 | 0.697 | 0.655–0.736 |
| DeepSeek V3 | 3 | 482 | 10 | 5 | 0.970 | 0.375 | 0.137–0.694 | 0.980 | 0.963–0.989 |
| Model | Positive Predictive Value | 95% CI | Negative Predictive Value | Positive Likelihood Ratio | 95% CI | Negative Likelihood Ratio | 95% CI | ||
| GPT-4.1 | 0.103 | 0.051–0.198 | 0.998 | 7.057 | 3.228–15.429 | 0.143 | 0.020–1.015 | ||
| Claude 3.5 Sonnet | 0.103 | 0.051–0.198 | 0.998 | 7.057 | 3.228–15.429 | 0.143 | 0.020–1.015 | ||
| Gemini 2.0 flash | 0.103 | 0.053–0.190 | 1 | 7.029 | 3.382–14.606 | 0 | 0–NaN | ||
| Mistral Large | 0.051 | 0.026–0.097 | 1 | 3.302 | 1.621–6.725 | 0 | 0–NaN | ||
| DeepSeek V3 | 0.231 | 0.082–0.503 | 0.990 | 18.450 | 5.078–67.039 | 0.638 | 0.264–1.540 | ||
| Model | Cohen’s κ | z | MCC | z | PABAK | z | Gwet’s AC1 | z |
|---|---|---|---|---|---|---|---|---|
| GPT-4.1 | 0.160 | 1.60 | 0.275 *** | 6.39 | 0.752 *** | 25.51 | 0.856 *** | 49.90 |
| Claude 3.5 Sonnet | 0.160 | 1.60 | 0.275 *** | 6.39 | 0.752 *** | 25.51 | 0.856 *** | 49.90 |
| Gemini 2.0 flash | 0.162 | 1.74 | 0.297 *** | 6.94 | 0.720 *** | 23.20 | 0.834 *** | 45.29 |
| Mistral Large | 0.069 | 1.07 | 0.188 *** | 4.29 | 0.404 *** | 9.88 | 0.589 *** | 20.85 |
| DeepSeek V3 | 0.271 | 1.46 | 0.280 *** | 6.51 | 0.940 *** | 61.61 | 0.969 *** | 121.76 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Nawrath, M.; Merlina, A.; Knight, J.; Welch, S.A.; Rashidian, M.; Seifert-Dähnn, I. Validating Large Language Models for Title-Abstract Screening in Low-Prevalence Systematic Reviews: An Environmental Science Case Study. Information 2026, 17, 501. https://doi.org/10.3390/info17050501
Nawrath M, Merlina A, Knight J, Welch SA, Rashidian M, Seifert-Dähnn I. Validating Large Language Models for Title-Abstract Screening in Low-Prevalence Systematic Reviews: An Environmental Science Case Study. Information. 2026; 17(5):501. https://doi.org/10.3390/info17050501
Chicago/Turabian StyleNawrath, Maximilian, Andrea Merlina, Jemmima Knight, Sam A. Welch, Mahla Rashidian, and Isabel Seifert-Dähnn. 2026. "Validating Large Language Models for Title-Abstract Screening in Low-Prevalence Systematic Reviews: An Environmental Science Case Study" Information 17, no. 5: 501. https://doi.org/10.3390/info17050501
APA StyleNawrath, M., Merlina, A., Knight, J., Welch, S. A., Rashidian, M., & Seifert-Dähnn, I. (2026). Validating Large Language Models for Title-Abstract Screening in Low-Prevalence Systematic Reviews: An Environmental Science Case Study. Information, 17(5), 501. https://doi.org/10.3390/info17050501

