Next Article in Journal
Positioning Artificial Intelligence Research in East Asia and Latin America: A Comparative Bibliometric Analysis
Previous Article in Journal
Research on the Long-Term Mechanism of Digital Transformation in High-End Equipment Manufacturing Based on a Four-Party Evolutionary Game
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Validating Large Language Models for Title-Abstract Screening in Low-Prevalence Systematic Reviews: An Environmental Science Case Study

Norwegian Institute for Water Research, Økernveien 94, 0579 Oslo, Norway
*
Author to whom correspondence should be addressed.
Information 2026, 17(5), 501; https://doi.org/10.3390/info17050501
Submission received: 16 April 2026 / Revised: 7 May 2026 / Accepted: 14 May 2026 / Published: 19 May 2026
(This article belongs to the Section Information Applications)

Abstract

Literature screening is a major bottleneck in systematic reviews, yet Large Language Models (LLMs) can substantially reduce workloads. However, performance varies across models and is sensitive to evaluation metrics, particularly in low-prevalence screening contexts. We validated five LLMs (GPT-4.1, Claude 3.5 Sonnet, Gemini 2.0 Flash, DeepSeek V3, and Mistral Large) against a 500-record gold-standard dataset (8 inclusions; 1.6% prevalence) using a conservative zero-shot prompt aligned with standard systematic review workflows. Performance was assessed through classification metrics (sensitivity, specificity, precision), logistic regression (GLM; Firth-penalised where separation occurred), and agreement indices (Cohen’s κ, MCC, PABAK, Gwet’s AC1). Gemini 2.0 Flash and Mistral Large showed no false negatives (1.00) but differed in specificity (0.858 vs. 0.697) and accuracy (0.860 vs. 0.702). GPT-4.1 and Claude 3.5 Sonnet performed identically (sensitivity 0.875; specificity 0.876; accuracy 0.876). In contrast, DeepSeek V3 maximised specificity (0.980) and accuracy (0.970) but demonstrated lower sensitivity (0.375). Regression analyses confirmed strong positive associations with human decisions (OR 28.9–49.5). Agreement indices revealed the expected low-prevalence artefact, with Cohen’s κ low despite high concordance while MCC, PABAK, and AC1 indicated substantially stronger agreement. Our results highlight a fundamental sensitivity-specificity trade-off, with conclusions dependent on the evaluation framework chosen. LLMs may meaningfully support title-abstract screening as decision-support tools, provided that human oversight is maintained and validation is transparent and reproducible.

1. Introduction

Systematic reviews are a methodology used to synthesise scientific evidence on a particular topic. They provide comprehensive summaries of the existing literature, minimise the impact of bias and errors in individual studies, identify knowledge gaps to inform future research and policy making, and combine findings from studies to provide novel insights [1]. Currently, the world’s scientific output doubles every nine years [2]. Systematic reviews enable researchers and practitioners to keep up with this rapidly growing body of evidence. However, systematic reviews are work-intensive and therefore time-consuming. A systematic review of reviews found that, on average, they take 13 months to complete [3]. The screening phase, where hundreds or even thousands of abstracts and full texts must be evaluated for eligibility, is particularly time-consuming and labour intensive. Due to the rapid pace of scientific progress, estimates suggest that a quarter of reviews are outdated within two years of publication [4]. Systematic reviews are crucial for evidence-based policymaking [5]. The demand for rapid evidence from policymakers outpaces the speed of traditional review processes, creating a “synthesis gap” [6].
Natural language processing through Large Language Models (LLMs) are rapidly developing methods which can support reducing the time and resources required for systematic reviews. Potential benefits of using LLMs in systematic reviews include decreasing the time needed for producing reviews, reducing associated costs, reducing variability between reviewers and reviews, and allowing researchers more time to focus on tasks that computers cannot perform [7].
LLMs are deep learning systems trained via self-supervised learning on very large text corpora to excel in natural language understanding and generation to understand, summarise, generate, and predict new content in human language [8]. They are built on neural network architectures called transformers, and are trained on large volumes of text data, enabling them to perform a wide range of natural language processing tasks [9]. The foundational breakthrough came with the “Transformer” architecture introduced in Vaswani (2017)’s seminal paper “Attention Is All You Need”, which replaced traditional recurrent networks with scalable self-attention mechanisms [10]. As models scaled in size, researchers observed emergent abilities, and at time of writing these models underpin numerous applications across the knowledge economy. LLMs are typically assessed using a growing array of benchmarks and have shown promising performance in text classification and language understanding [11]. However, a persistent issue with LLMs is that they produce “hallucinated” outputs, generating responses that are convincing yet factually incorrect.
Systematic reviews consist of several key steps: defining the research question, developing a review protocol, systematically searching the literature, screening and selecting studies that meet the inclusion criteria, extracting data, assessing risk of bias, and qualitative and quantitative data synthesis [1]. A recent review found that LLMs, such as GPT-4.1, can outperform human reviewers in a range of tasks, including comprehensibility, clarity of review, relevance of feedback, and accuracy of technical assessments [12]. LLMs also have the potential to substantially reduce the time required for the screening phase in particular [13]. For instance, using LLMs may improve the efficiency of study selection by automatically identifying and ranking the reviewed studies based on a set of predefined criteria [14]. This may enhance consistency and handle larger volumes of data more efficiently than human reviewers alone [15].
While there is substantial potential for LLMs to improve systematic reviews, there are some major concerns. LLM algorithms often lack transparency, making it difficult to understand and trust their decision-making processes [16]. This also results in reproducibility issues [17]. Risk of bias is another concern, given that biases in training data or model assumptions can skew results [18]. To detect and mitigate hallucination, which are instances where models generate inaccurate or fabricated information, systematic review prompts have been developed to improve transparency of the model’s reasoning and outputs [19]. However, despite these concerns, LLMs have strong potential for reducing the workload in systematic reviews, given that these tools are rigorously validated and appropriately applied [14].
The article selection process is a crucial step in systematic reviews, typically divided into two stages. In the first stage, reviewers screen titles and abstracts, and in the second stage, they review the full texts of studies. Traditionally, the screening is conducted independently by two or more reviewers to minimise the impact of bias and errors. Inter-rater reliability is commonly assessed using chance-corrected agreement statistics, most notably Cohen’s kappa, but also alternative measures such as the Matthews Correlation Coefficient (MCC), the Prevalence-Adjusted Bias-Adjusted Kappa (PABAK), and Gwet’s First-Order Agreement Coefficient (AC1), which have been proposed to address limitations of κ under prevalence imbalance. Discrepancies are then resolved through discussion between the reviewers [1]. This is to ensure that all studies meeting the inclusion criteria are included and to reduce the risk for excluding relevant studies.
Recent empirical studies have begun to evaluate the use of LLMs specifically for title-abstract screening in systematic reviews. These studies generally report high sensitivity and substantial workload reductions, particularly when LLMs are used as screening assistants rather than fully autonomous decision-makers [13,14,15,20]. Several validation studies show that LLM-based screening can achieve recall rates exceeding 95%, often at the cost of increased false positives, thereby preserving review validity while shifting effort downstream [21,22]. However, existing work also reveals substantial heterogeneity in evaluation approaches, with studies relying on different metrics, datasets, prevalence conditions, and prompt strategies, limiting comparability across findings [7,23]. In particular, reliance on single performance indicators such as Cohen’s κ or overall accuracy may obscure important trade-offs between sensitivity, specificity, and practical agreement in low-prevalence screening contexts, which underscores the need for multi-metric validation frameworks.
Given that LLMs are highly likely to become integrated parts of the systematic review process, it is crucial to validate these tools in a variety of contexts to ensure their reliability and accuracy. Comparing the performance of LLMs with traditional human review processes, which remain the gold standard, enables determining how automated screening performs compared to traditional human screening, particularly in relation to sensitivity and specificity in study selection [14,24]. Ensuring sensitivity and specificity of LLMs in article screening is crucial, as erroneously excluding studies can substantially bias the conclusions drawn from systematic reviews.
Against this background, this study aims to provide a rigorous validation of contemporary LLMs for title-abstract screening in systematic reviews, with particular attention to how conclusions about model performance depend on the choice of evaluation metrics. Using a gold-standard screening dataset [25], we address the following research questions: (1) How do different LLMs compare to human reviewers in terms of sensitivity, specificity, and overall classification performance during title-abstract screening? (2) To what extent are model performance patterns consistent across classification metrics, logistic regression analyses, and inter-rater reliability measures? (3) How do prevalence effects influence commonly used agreement statistics, and what are the implications for interpreting inter-rater reliability between LLMs and human reviewers in low-prevalence screening contexts? (4) What are the implications of observed sensitivity-specificity trade-offs for the practical suitability of LLMs in systematic review screening workflows? To address these questions, we apply a three-pronged evaluation framework combining classification metrics, logistic regression, and multiple inter-rater reliability indices.

2. Materials and Methods

This comparative validation study evaluated the effectiveness and accuracy of five LLMs in screening titles and abstracts from a set of articles previously reviewed by human experts and published in Nawrath et al. (2021) [25], which synthesises emerging evidence on the links between urban greenspaces and mental health outcomes in low- and middle-income countries (LMICS).
The review followed Arksey & O’Malley’s (2005) framework and the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) [26,27]. A total of 1801 potentially eligible articles were identified through searches in Web of Science Core Collection (1900–November 2019), Medline (1946–November 2019), Embase (1947–November 2019) and CAB Abstracts (1910–November 2019). Screening was performed in two stages by two independent reviewers using Microsoft Excel. In the first stage, titles and abstracts were assessed against predefined eligibility criteria as defined in [25]. In the second stage, full-text screening was conducted for the remaining studies by the same reviewers. All conflicts generated through the screening stages between the two reviewers were discussed until consensus was reached.
In this study, only the title-abstract screening stage was evaluated. Full-text screening was used solely to validate the human reviewers’ inclusion and exclusion decisions, which served as the gold standard against which LLM-based title-abstract screening decisions were compared. Studies were excluded if they did not meet the inclusion criteria and were included when the title and abstract did not contain sufficient information to make an informed judgement. The final inclusion/exclusion decisions served as the “gold standard” for this study, as this remains the validated approach in systematic reviews for initial screening [14]. The complete search strategy is included in Appendix A.1. From Nawrath et al. (2021) [25], we extracted the following data for a randomly selected 500-article subset: title, abstract, decision whether the article was included or excluded by the reviewers and decision reason.

2.1. Large Language Models (LLMs)

The five LLMs included in our analysis were OpenAI’s ChatGPT-4.1, Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 2.0 Flash, Mistral Large v24.11, and DeepSeek V3. All experiments were conducted in August 2025 using the latest available versions of these models. The Python (version 3.11.9) code used to generate the results presented in the following sections is openly available in the accompanying repository [28]. Each model was queried through its respective official API interface under identical prompt conditions.

2.2. Prompts

Prompts determine the LLM’s behaviour and context. We began with a prompt template developed by Syriani et al. (2024) [29] and refined it through a small number of iterative revisions to address recurring issues such as incomplete adherence to the screening instructions, sequential bias, inconsistent formatting, and omission of required fields. This included specifying a leniency rule, adopting a zero-shot configuration, and setting the temperature to the lowest possible value. Revisions were guided by qualitative inspection of model outputs to address general issues (e.g., the model not consistently returning valid JSON or omitting required fields), rather than tailoring the prompt to specific articles. However, because prompt refinement drew on outputs from the same 500-abstract corpus used for evaluation, we cannot exclude the possibility of adaptive overfitting to the evaluation set (data leakage) [30]. We therefore treat the reported results as specific to this final prompt and this dataset, and acknowledge that performance may be slightly better than would be expected under fully prospective, real-world zero-shot screening.
Importantly, the final prompt was implemented using a zero-shot learning approach, following the definition proposed by Syriani et al. (2024) [29], whereby models are provided with task instructions and decision criteria but no labelled examples or demonstrations of correct classifications. In this context, zero-shot learning reflects a realistic deployment scenario in which LLMs are applied at the planning and screening stages of systematic reviews, prior to the availability of curated training examples or task-specific fine-tuning.
To overcome observed challenges, we established a set of design principles aimed at improving the prompt while maintaining a zero-shot paradigm. Prompts instructed the models to analyse the entire document before summarising [31], delay synthesis until all sections were reviewed to mitigate sequential bias, and follow the logical order of scientific reporting (study design, sample size, variables, methods, results, conclusions) for contextual accuracy [32]. Domain-specific terminology and synonyms were incorporated to capture nuanced information, while prompts for complex data types (e.g., sample sizes, time points) were made detailed and flexible to accommodate diverse reporting styles [33]. Additional refinements included accounting for implicit information through conceptual definitions, providing explicit instructions on what to extract and ignore, and adapting prompts to study designs (e.g., field vs. laboratory research; qualitative vs. quantitative designs) [34]. However, we did not operationalise eligibility based on specific study design categories as in the original review. This decision was made because information on study designs is often not reliably reported in titles and abstracts and is typically assessed during full-text screening. Encoding it as a hard criterion at the title-abstract stage could therefore have introduced avoidable false negatives.
Based on these principles, we revised the prompt to meet four key requirements: (1) the prompt should only use the information available during the planning phase of the systematic review to enable integration of LLMs into standard workflow; (2) apply universally across systematic review topics; (3) minimise token count to reduce cost and bandwidth; and (4) ensure consistency by using the same system and user prompt for all LLMs (Appendix A.2; Syriani et al., 2024) [29]. We set the temperature, a parameter controlling the randomness of LLM responses, to 0.0 (the lowest possible value) to minimise sampling randomness and produce deterministic, reproducible responses by consistently selecting the highest-probability token at each step. However, for API-hosted LLMs, identical queries can occasionally yield slightly different outputs due to backend nondeterminism (e.g., how the provider routes requests or runs the model) [35].
The models were instructed to return a binary verdict, either “include” or “exclude”, for each article in the systematic review. In addition, the models were required to provide justification for their decision and a confidence rating. The output was structured as a plain JSON object in the following format:
  • {
  •  “verdict”: “<your verdict here, either ‘include’ or ‘exclude’>“,
  •  “explanation”: “<detailed explanation to justify your verdict here>“,
  •  “confidence”: “<confidence level of your decision here>“
  • }
We did not compare zero-shot and few-shot prompting strategies in this study. This choice was deliberate, given that our objective was not to maximise performance under optimal prompting conditions, but to evaluate how LLMs perform under a conservative, generalisable, and workflow-compatible zero-shot configuration. Few-shot prompting typically requires curated examples and introduces additional sources of variability across reviews, domains, and model versions, which may limit comparability and reproducibility in applied evidence synthesis contexts [29]. Eligibility decisions were therefore based on the subset of the original inclusion/exclusion criteria that can plausibly be assessed from titles and abstracts, and we did not exclude records based on study design because titles and abstracts rarely provide sufficient information to apply this criterion consistently.
We explicitly instructed the models to apply a lenient decision rule (Appendix A.2). This reflects standard systematic review practice, where the costs of false negatives are typically considered higher than the costs of false positives, because missed eligible studies can propagate bias into the final evidence base. Consistent with this, prior work has recommended high recall thresholds for title-abstract screening even when this increases downstream workload [21]. Accordingly, we treated leniency as a prompt design choice aligned with typical screening priorities. This choice may also have trade-offs. For instance, the leniency instruction might also tend to increase false positives, and results should therefore be interpreted as reflecting a conservative, recall oriented screening approach.

2.3. Statistical Analysis

We computed classification performance metrics for each model by comparing its inclusion and exclusion decisions to the human reference standard. Specifically, we calculated sensitivity (true positive rate; when the classifier correctly includes an article), specificity (true negative rate; upon correct exclusion), positive predictive value (PPV, or precision), negative predictive value (NPV), and the positive (LR+) and negative (LR−) likelihood ratios from two-by-two contingency tables, with 95% confidence intervals for each measure. Sensitivity measures the proportion of truly included studies that were correctly identified by the model, while PPV indicates the proportion of studies predicted as included that were truly included. Similarly, specificity measures the proportion of excluded studies correctly identified, and NPV indicates the proportion of studies predicted as excluded that were indeed to be excluded. The likelihood ratios summarise the trade-off between sensitivity and specificity. The formulas for all classification performance metrics are included in Appendix A.3.
Where possible, we mapped misclassifications to specific eligibility criteria to improve interpretability and identify potential sources of systematic error. We reviewed each false negative and assigned them to a criterion category based on the reasoning the model had stated. Given the low number of misclassifications inherent to the low-prevalence dataset, a fully systematic error taxonomy was not feasible. Instead, we report illustrative cases narratively to highlight recurrent error patterns. The full dataset, including all model decisions and reasoning, is available at https://github.com/NIVANorge/ai-literature-review-public/blob/main/Full%20dataset.xlsx (accessed on 1 May 2025).
Moreover, we used logistic regression to model the binary inclusion outcome for each article as a function of the human screening decision, enabling estimation of odds ratios that quantify how strongly each model’s inclusion decisions aligned with human inclusion (Appendix A.4). Specifically, we fitted separate binomial logistic regression models for each LLM, with the model’s binary verdict (include/exclude) as the outcome and the human label as the predictor. We report slope coefficients, odds ratios, and 95% confidence intervals to summarise the strength of association between LLM and human decisions. Where standard maximum-likelihood logistic regression exhibited (quasi-) complete separation due to sparse inclusion events, we applied Firth’s penalised likelihood logistic regression to obtain finite, bias-reduced coefficient estimates and confidence intervals [36].
Inter-rater reliability was assessed by comparing each LLM’s screening decisions with the human reference standard using multiple agreement metrics. For each LLM, we constructed a 2 × 2 contingency table of inclusion and exclusion decisions against human reviewer decisions, where cells represented: a (both exclude), b (reviewer exclude, model include), c (reviewer include, model exclude), and d (both include), with N = 500 total abstracts.
Cohen’s kappa was calculated as:
κ   =   P ο     P e 1     P e
where P ο is the observed proportion of agreement (the proportion of times the raters agree):
P o = a   +   d N
The expected agreement by chance ( P e ) was calculated as:
P e   =   ( a   +   b N   ·   a   +   c N )   +   ( c   +   d N   ·   b   +   d N )
where b is the number of false positives (LLM includes, human excludes) and c is the number of false negatives (LLM excludes, human includes).
Statistical significance of K was evaluated using the z -statistic:
z   =   K S E K
with the standard error computed as:
S E K   =   P o ( 1     P o ) N ( 1     P e ) 2
Cohen’s kappa values were interpreted using the Landis and Koch classification: <0.00 = poor agreement, 0.00–0.20 = slight agreement, 0.21–0.40 = fair agreement, 0.41–0.60 = moderate agreement, 0.61–0.80 = substantial agreement, and 0.81–1.00 = almost perfect agreement [37].
To complement Cohen’s kappa, we calculated the Matthews Correlation Coefficient (MCC) [38], which is considered more informative for binary classification tasks and less susceptible to prevalence effects:
M C C = a d     b c   ( a   + b ) ( a + c ) ( d + b ) ( d + c )
MCC ranges from −1 (complete disagreement) through 0 (random agreement) to +1 (perfect agreement). The standard error was calculated as:
S E ( M C C ) = 1 M C C 2 n
with significance assessed via z-statistic: z = M C C / S E ( M C C ) .
Given the potential for prevalence imbalance in screening decisions, we calculated the Prevalence-Adjusted Bias-Adjusted Kappa (PABAK) [39], which removes the influence of prevalence and bias on agreement estimates. PABAK was calculated as:
P A B A K = 2 P o 1
where P o is the observed agreement. The standard error was calculated as:
S E ( P A B A K ) = 4 P o   ( 1 P o ) n
with significance assessed via z -statistic: z = P A B A K / S E ( P A B A K ) .
As an alternative to Cohen’s kappa that is more stable in the presence of high agreement and trait prevalence, we calculated Gwet’s AC1:
A C 1   =   P o P e   ( g w e t ) 1 P e   ( g w e t )
where the expected agreement is calculated using marginal probabilities:
π 1 =   ( a   + b )   + ( a + c ) 2 n ,   π 2 ( c + d ) + ( b + d ) 2 n
P e ( g w e t ) = π 1 ( 1 π 1 ) + π 2 ( 1 π 2 )
The standard error was calculated as:
S E ( A C 1 ) = P o ( 1 P o ) n ( 1 P e ( g w e t ) ) 2
with significance assessed via z-statistic: z = A C 1 / S E ( A C 1 ) .
For all metrics, statistical significance was assessed at three levels, p < 0.05 (z > 1.645), p < 0.01 (z > 1.96), and p < 0.001 (z > 2.576), using two-tailed tests. All analyses were performed in R version (RStudio 2025.09.2) using the R package ‘irr’ for Cohen’s kappa verification, with custom functions implementing the remaining metrics according to the formulae above.
We report multiple inter-rater reliability metrics because each captures a different facet of inter-rater reliability. Although Cohen’s kappa is widely used, it can be paradoxically low under imbalanced prevalence and skewed marginals, which has been coined the “prevalence paradox” [40,41]. However, we included Cohen’s kappa because of its role as a standard metric in traditional systematic reviews and its familiarity to the research community. MCC, commonly preferred in machine learning settings, provides a more balanced and symmetric summary that weighs all four confusion matrix cells equally and remains informative under class imbalance [41]. PABAK adjusts for the observed prevalence (and bias) in our dataset and yields a prevalence robust estimate of practical agreement [40]. Gwet’s AC1 provides a more stable alternative to κ that is less sensitive to marginal distributions and mitigates prevalence effects [41]. Complementing these chance corrected agreement indices, we computed precision-recall-based metrics (e.g., F1 score) and likelihood ratios directly which summarise the sensitivity-specificity trade-off that is central to screening validity and are recommended for LLM evaluation [23,42]. Consistent with recent guidance, we therefore do not interpret κ in isolation but alongside classification metrics, and, where possible, calibration measures. This is to provide a comprehensive assessment of LLM screening performance in systematic review contexts [23].

3. Results

Beginning with a randomly selected 500-article sample, gold-standard human double-blind title and abstract screening resulted in the inclusion of 8 studies. Studies excluded at this stage either did not meet the inclusion criteria, were not written in English, or did not report primary research.

3.1. Classification Performance Metrics

GPT-4.1 correctly retrieved 7 true positive studies and 431 true negatives but incorrectly included 61 false positives with one false negative error. This yielded a total accuracy of 87.6% (Table 1; Figure 1). Sensitivity was 0.875 (95% CI: 0.529–0.978), indicating that the model retrieved most of the studies it ought to. Specificity was high at 0.876 (95% CI: 0.844–0.902), with strong performance in ruling out irrelevant studies. The positive predictive value was low at 0.103 (95% CI: 0.051–0.198), meaning that only 10.3% of records flagged for inclusion were truly relevant. The negative predictive value was very high (0.998), which indicates highly reliable exclusion decisions. Furthermore, an “include” decision made a study about seven times more likely to be truly relevant (Positive likelihood ratio = 7.06; 95% CI: 3.23–15.43), whereas an “exclude” decision greatly reduced the likelihood that a study was relevant (Negative likelihood ratio = 0.143; 95% CI: 0.020–1.015). Claude 3.5 Sonnet produced the same inclusion/exclusion decisions as GPT-4.1 in this 500-record subset, which resulted in identical confusion-matrix counts and classification metrics (Table 1; Figure 1).
Gemini 2.0 Flash performed strongly on retrieval. It identified all 8 true positive studies with no false negatives, alongside 422 true negatives and 70 false positives, resulting in an overall accuracy of 86.0% (Table 1; Figure 1). Sensitivity was therefore 1.0 (95% CI: 0.676–1.0), which means that the model did not miss any relevant studies in this subset. Specificity was 0.858 (95% CI: 0.824–0.886), reflecting the fact that the model included a larger number of irrelevant records than GPT-4.1 and Claude. Precision remained low (Positive predictive value = 0.103; 95% CI: 0.053–0.190) given the very low prevalence of included studies, while negative predictive value was 1.0. The positive likelihood ratio was 7.03 (95% CI: 3.382–14.606); the negative likelihood ratio was 0 (CI not estimable because FN = 0).
Mistral Large performed strong on retrieval and identified all 8 true positive studies with no false negatives, but at the expense of substantially weaker exclusion performance. It correctly classified 343 true negatives while incorrectly including 149 false positives, resulting in an overall accuracy of 70.2% (Table 1; Figure 1), the lowest among the evaluated models. Sensitivity was therefore 1.0 (95% CI: 0.676–1.0), whereas specificity was 0.697 (95% CI: 0.655–0.736), reflecting frequent inclusion of irrelevant records. Precision was very low (Positive predictive value = 0.051; 95% CI: 0.026–0.097), consistent with the model’s broad inclusion tendency and the low prevalence of included studies, while negative predictive value was 1.0. The positive likelihood ratio was 3.30 (95% CI: 1.621–6.725), which indicates only a modest increase in the probability of relevance following an “include” decision; the negative likelihood ratio was 0 (CI not estimable because FN = 0).
DeepSeek V3’s screening decisions were the most conservative. The model achieved the highest overall accuracy (97.0%) and very high specificity (0.980; 95% CI: 0.963–0.989), correctly excluding 482 true negative studies while producing only 10 false positives (Table 1; Figure 1). However, this strict exclusion strategy reduced sensitivity to 0.375 (95% CI: 0.137–0.694), as the model retrieved only 3 of the 8 relevant studies, which means that it missed 5 relevant studies. In line with its selective inclusion behaviour, DeepSeek’s positive predictive value was the highest (Positive predictive value = 0.231; 95% CI: 0.082–0.503), and an “include” decision substantially increased the likelihood of true relevance (Positive likelihood ratio = 18.45; 95% CI: 5.078–67.039). Negative predictive value remained high (0.990), while the negative likelihood ratio (Negative likelihood ratio = 0.638; 95% CI: 0.264–1.540) indicates that some relevant studies were still classified as exclusions. Moreover, logistic regression analyses showed that models were substantially more likely to return an ‘include’ verdict when the human reviewers included the record (Appendix A.4).

3.2. Misclassifications

Across models, we observed a small number of systematic misclassifications which reflect recurrent patterns of error in LLM-based screening. Misclassifications fell into three criterion categories, including a narrow interpretation of mental health (4 cases), narrow interpretation of the target population criterion (2 cases), and incorrect classification of OECD Development Assistance Committee (DAC) list countries (1 case). The most common error pattern involved overly restrictive interpretations of the mental health criterion. DeepSeek V3 accounted for the majority of these and excluded studies addressing human well-being, physical activity as a mediating factor linking greenspaces and mental health, and emotional aspects related to greenspaces, all of which had been included by the human reviewers under the leniency rule, although some were subsequently excluded at full-text screening. The remaining DeepSeek V3 false negatives involved an overly narrow interpretation of the ‘general urban population’ criterion, and excluded papers which focused on psychiatric and hospital patients despite these being within scope. Claude 3.5 Sonnet produced one false negative by incorrectly classifying Serbia as a high-income country, despite Serbia appearing on the OECD DAC list. This reflects a factual error in the model’s assessment of the geography criterion. One GPT-4.1 false negative involved a study on Dhaka, Bangladesh, that human reviewers had themselves been uncertain about at title-abstract screening and subsequently excluded at full-text screening, which suggests that this case may reflect ambiguity in the source data rather than a model error per se. The full dataset is openly accessible at https://github.com/NIVANorge/ai-literature-review-public/blob/main/Full%20dataset.xlsx (accessed on 1 May 2025).

3.3. Inter-Rater Reliability

Cohen’s kappa values revealed modest agreement between LLMs and human reviewers (Table 2). DeepSeek V3 achieved the highest agreement (κ = 0.271, z = 1.46), corresponding to slight-to-fair agreement. GPT-4.1 and Claude 3.5 Sonnet performed identically (κ = 0.160, z = 1.60) and indicated slight agreement. Gemini 2.0 flash showed comparable performance (κ = 0.162, z = 1.74). Mistral Large exhibited the weakest agreement (κ = 0.069, z = 1.07) and did not reach statistical significance. Overall, only Gemini’s κ was statistically significant, and the κ estimates were lower than might be expected given the models’ screening accuracy, which may suggest the presence of prevalence-related bias in the kappa statistic.
MCC values indicated moderate positive correlation between LLM and human decisions across models, with all MCC estimates statistically significant (p < 0.001) (Table 2). Gemini 2.0 flash showed the strongest correlation (MCC = 0.297, z = 6.94), followed by DeepSeek V3 (MCC = 0.280, z = 6.51) and GPT-4.1 and Claude 3.5 Sonnet, which again performed identically (MCC = 0.275, z = 6.39). Mistral Large showed the weakest, but still significant, correlation (MCC = 0.188, z = 4.29). Notably, MCC z-scores were substantially higher than those for Cohen’s κ, suggesting more stable agreement estimates when all four cells of the confusion matrix are weighted symmetrically.
PABAK values revealed substantially higher agreement than unadjusted Cohen’s κ, indicating that class imbalance suppressed the original κ estimates (Table 2). DeepSeek V3 again achieved very high agreement (PABAK = 0.940, z = 61.61, p < 0.001), which indicates agreement with human reviewers in 97.0% of decisions. GPT-4.1 (PABAK = 0.752, z = 25.51, p < 0.001) and Claude 3.5 Sonnet (PABAK = 0.752, z = 25.51, p < 0.001) again performed identically, while Gemini 2.0 flash showed slightly lower, albeit still substantial agreement (PABAK = 0.720, z = 23.20, p < 0.001). Mistral Large, on the other hand, showed markedly lower agreement (PABAK = 0.404, z = 9.88, p < 0.001), corresponding to 70.2% agreement with human decisions. Overall, PABAK z-scores were far higher than Cohen’s κ z-scores. This is consistent with a strong observed agreement rate that is not well captured by Cohen’s κ under pronounced class imbalance.
Gwet’s AC1 coefficients, which are generally more stable than Cohen’s κ under high overall agreement and strong class imbalance, aligned with the PABAK results (Table 2). DeepSeek V3 showed very high agreement (AC1 = 0.969, z = 121.76, p < 0.001), the highest value across models. GPT-4.1 and Claude 3.5 Sonnet again performed identically (AC1 = 0.856, z = 49.90, p < 0.001), while Gemini 2.0 Flash showed similarly high agreement (AC1 = 0.834, z = 45.29, p < 0.001). Mistral Large demonstrated substantially lower, though still statistically significant, agreement (AC1 = 0.589, z = 20.85, p < 0.001). The higher PABAK and AC1 values suggest that the low κ estimates are mainly due to the strong imbalance between inclusions and exclusions, rather than poor agreement between LLMs and human reviewers.
Across metrics, DeepSeek V3 showed the highest agreement with the human gold standard, while GPT-4.1, Claude 3.5 Sonnet, and Gemini 2.0 Flash clustered closely behind; Mistral Large consistently ranked lowest (Table 2). The contrast between low κ and substantially higher PABAK/AC1 is consistent with class imbalance at screening, where exclusion decisions dominate and Cohen’s κ’s chance correction can be overly influential. Accordingly, the prevalence-adjusted indices indicate that all models except Mistral Large achieved substantial-to-almost-perfect agreement with human reviewers in practical terms, even though traditional Cohen’s κ suggests only slight-to-fair agreement.

4. Discussion

The screening stage of systematic reviews is widely recognised as one of the most laborious and time-intensive components of evidence synthesis. Traditional manual methods, while being rigorous, are slow and resource intensive. In response, recent research has begun exploring the use of LLMs to automate this process. Against this background, the present study contributes a multi-metric validation of contemporary LLMs for title-abstract screening and explicitly examines how conclusions about model performance depend on the choice of evaluation framework. In recent valuation studies, LLMs have demonstrated perfect recall in screening datasets and the potential to save up to 75% of manual effort while retaining comprehensive inclusion of relevant studies [21]. Another study presented an end-to-end LLM workflow that achieved nearly complete exclusion of irrelevant records while preserving all relevant studies and reducing manual workload by over 95% [22].
In our validation using a 500-article gold-standard screening dataset [25], GPT-4.1 and Claude 3.5 Sonnet showed very high sensitivity (recall of relevant studies) and high overall accuracy, closely matching human reviewer decisions. DeepSeek V3, by contrast, achieved the highest specificity and inter-rater reliability across metrics, as well as the strongest logistic regression fit (lowest AIC), yet at the cost of reduced sensitivity, excluding some studies that met the inclusion criteria. Gemini 2.0 Flash also showed no false negatives in the dataset, but with lower specificity and precision than GPT-4.1 and Claude, leading to more false positives and only moderate overall effectiveness depending on the evaluation metric.
These findings highlight a fundamental trade-off in screening automation. Prioritising high sensitivity to minimise missed relevant studies generally increases the false positive rate, thus requiring more human effort to sift exclusions. However, given that missing relevant studies during screening can systematically bias review conclusions [43], prior studies have recommended high recall thresholds of ≥ 95% for screening tools to avoid bias through omitted evidence, even if this increases false positives [21,22]. In our study, Gemini 2.0 Flash and Mistral Large were above this threshold, whereas GPT-4.1, Claude 3.5 Sonnet, and DeepSeek V3 were below. From a systematic review perspective, this suggests that models optimised for specificity and agreement may appear statistically reliable yet be less suitable in practice due to an elevated risk of false negatives. Accordingly, DeepSeek’s conservative exclusion strategy, with higher specificity but worse recall, may be less suitable for systematic reviews because it increases the risk of bias associated with false negatives. On the other hand, models such as DeepSeek V3, which favour higher specificity, can achieve higher inter-rater reliability statistics (κ, MCC, PABAK, AC1) but do so at the expense of recall. Because inclusions were rare in our dataset (n = 8), sensitivity estimates and odds-ratios reported in this study should be interpreted with caution, given that small changes in classification would materially affect these results.
Notably, the absence of false negatives for Gemini 2.0 Flash and Mistral Large in our dataset translated into strong positive associations in logistic regression analyses, which means that these models were much more likely to include a study when it was also included by human reviewers. In this context, regression provides a complementary perspective by quantifying agreement conditional on human inclusion decisions, rather than overall classification accuracy alone. GPT-4.1 and Claude 3.5 Sonnet also showed positive and statistically significant slopes, but their sensitivity was below the ≥95% threshold used in prior recommendations. DeepSeek V3’s conservative behaviour led to strong model fit (lowest AIC) and an elevated odds ratio, despite low sensitivity in the classification metrics. Across methods, Mistral Large showed high odds ratios but comparatively poor overall fit (high AIC), which is consistent with its highly inclusive screening approach and shows that regression estimates can be inflated under broad inclusion strategies and low-prevalence settings, particularly where separation is present.
Inter-rater reliability analyses revealed a clear prevalence paradox, which may explain the discrepancy between high observed agreement and low Cohen’s κ values, with Cohen’s κ substantially underestimating agreement between LLMs and human reviewers despite high observed agreement [40,44]. While Cohen’s κ suggested only slight-to-fair agreement, prevalence-robust metrics (PABAK, Gwet’s AC1) and MCC indicated moderate-to-near-perfect agreement, with the strongest agreement for DeepSeek V3 and consistently high agreement for GPT-4.1, Claude 3.5 Sonnet, and Gemini 2.0 Flash, but only moderate agreement for Mistral Large. This discrepancy may arise because Cohen’s κ’s chance correction becomes inflated under severe class imbalance, as exclusion decisions dominate abstract screening, thereby penalising high-sensitivity models that increase disagreement on the rare inclusion class [41].
Consistent with this mechanism, the most conservative model, DeepSeek V3, achieved higher Cohen’s κ values, whereas high-sensitivity models tended to show lower Cohen’s κ despite stronger alignment with screening priorities. MCC provided a more balanced assessment by weighting all confusion-matrix cells equally and remains informative under class imbalance [38], while PABAK and Gwet’s AC1 offered more stable estimates of practical agreement by attenuating prevalence effects. Interpreting these metrics jointly clarifies that low Cohen’s κ values primarily reflected prevalence effects rather than weak alignment between human and LLM screening decisions.
Furthermore, our study highlights that logistic regression can favour over-inclusive models through inflated odds ratios, particularly under separation where penalised estimation is required, while Cohen’s κ systematically underestimated agreement. In contrast, the combined MCC/PABAK/AC1 profile more closely tracked the high observed agreement in the data, indicating that low Cohen’s κ values primarily reflected prevalence effects rather than weak alignment between LLM-assisted and human screening decisions. These findings support recent recommendations that Cohen’s κ should not be interpreted in isolation when evaluating LLM-assisted screening and that prevalence-adjusted and balanced metrics provide a more valid basis for assessing agreement in evidence synthesis tasks [23,42].
Developing a reliable and high-quality prompt that resulted in few false negatives proved to be a time-intensive and iterative process. LLM performance is sensitive to prompt design and requires multiple rounds of manual evaluation, refinement, and testing [45,46]. The performance reported in this study should therefore be interpreted as conditional on prompt optimisation rather than as an intrinsic property of the models alone. In particular, our prompt operationalised only those inclusion/exclusion criteria that could plausibly be assessed from titles and abstracts. This means that our prompt did not contain a screening instruction to control for study design, in contrast to the original review. This may have influenced the rate of false positives and false negatives. Additionally, the LLM’s tendency towards overly narrow criterion interpretation in our study, particularly regarding the scope of mental health and the eligibility of the target population, suggests that prompt specificity and explicit operationalisation of inclusion criteria are important determinants of screening performance [47]. Because we refined the prompt iteratively by inspecting the outputs from the evaluated corpus of studies, results may be modestly optimistic due to overfitting to a reused test set. Future work should separate prompt development from final evaluation using a hold-out set or prospective screening. Furthermore, we did not conduct repeated-run stability testing such as multiple identical API calls per model, to quantify the run-to-run variability under otherwise fixed settings. Accordingly, although temperature was set to 0.0, slight output variation cannot be ruled out in API-based use cases [35]. Therefore, future work should assess stability via repeated runs and report variability in both outputs and decisions.
Furthermore, this iterative approach is not a one-time task, but a necessary component of integrating LLMs into systematic review workflows [48,49]. Therefore, time needs must be factored into the decision about whether to use such automation tools or not [23]. Skilled researchers may still outperform LLMs on some tasks. Prompt design and various prompt formats (zero-shot, few-shot, chain-of-thought) have demonstrated to considerably affect LLM screening accuracy [50]. Therefore, prompt engineering remains a critical consideration when integrating LLMs into screening workflows, particularly when high recall is required to minimise bias [51].
While LLMs offer substantial potential to reduce workload and speed up screening, ethical safeguards remain essential [52]. Beyond efficiency, concerns about trustworthiness and reliability are important. LLMs can misrepresent or oversimplify scientific findings. For instance, some studies have shown up to five times greater risk of overgeneralisation compared to human experts, particularly in medical contexts [53]. These risks are compounded by persistent ethical challenges related to bias, fairness, and transparency, as well as hallucinations and inaccurate outputs that can undermine the integrity of evidence synthesis [54,55]. Moreover, key dimensions such as safety, robustness, and explainability remain underexplored, and existing evaluation frameworks are insufficient for high-stakes domains such as healthcare and policy-relevant research [56,57].
These concerns highlight the importance of transparency, accountability, and interpretability in LLM-assisted screening, as emphasised in the recent methodological and ethical literature [58]. In particular, maintaining human oversight and critical judgement remains essential, especially where screening decisions may influence downstream policy or clinical practice. Calls for ethical guidance and human oversight are recurrent; however, rather than treating oversight as a binary requirement, the ethical debate may benefit from being reframed towards defining what constitutes acceptable human oversight across different applications [52]. This reframing requires explicit consideration of context-specific risk, uncertainty, and potential harm.
At the same time, ethical evaluation must be balanced against the potential benefits of automation. LLM-assisted screening can contribute to substantial time and cost savings, may improve consistency, and can free up expert time for more complex analytical and interpretive tasks [21,22,23]. Nonetheless, no automated approach can fully replace domain expertise or methodological oversight, and rapid developments in model architectures mean that performance characteristics may change over time, raising challenges for reproducibility and long-term validity [17,58]. Additional risks include opaque decision-making processes, reinforcement of existing biases, and equity concerns, particularly if LLM outputs are reused or “fed” into subsequent automated systems without adequate validation or transparency [16,54,55].
This highlights the need for continued human oversight to maintain rigorous quality control in LLM-supported systematic reviews [23]. At the same time, they point to the necessity of evolving ethical frameworks that accompany the deployment of LLMs in research evaluation, moving beyond ad hoc safeguards towards clearer standards for accountability, oversight, and acceptable risk across different applications [59]. In this study, we sought to mitigate some of these concerns by closely replicating established human screening procedures and maintaining explicit control over the screening process, even as new model versions and tools continue to emerge.
Several limitations of the present study warrant consideration. First, the gold-standard dataset was published openly in 2021 [25], meaning records may have appeared in the training corpora of the evaluated models. We cannot fully rule this out, as training data for proprietary models are not publicly disclosed. However, we consider direct memorisation of screening decisions unlikely, because the prompt requires the models to apply eligibility criteria and provide structured justifications rather than reproduce text verbatim. Moreover, evidence indicates that memorised training content is most evident as verbatim text sequences and typically requires specifically targeted queries [60]. Yet, it still represents a threat to validity without straightforward mitigation strategy [61]. Moreover, we used zero-shot prompts without exposure to labelled outcomes, and LLM pretraining on scientific text is not known to encode study-level inclusion decisions from systematic review workflows. This remains an inherent limitation of validating LLMs on openly available datasets, and future studies should where possible use prospective or unpublished gold-standard corpora to eliminate this concern [62]. Second, all five models were accessed via proprietary APIs, limiting reproducibility and independent verification, which is a recognised challenge in LLM evaluation for evidence synthesis [63]. Future work should include open-weight alternatives such as locally deployed Llama or Mistral variants to enable fully reproducible pipelines.
Future research should systematically evaluate the performance of LLMs across a broader range of review topics, disciplines, languages, and inclusion criteria to strengthen generalisability and reproducibility. To date, most validation studies have focused on biomedical and environmental sciences, leaving substantial gaps in other fields such as the social sciences, engineering, and interdisciplinary research, where screening criteria and evidence standards may differ markedly [7,23]. Expanding evaluations to multilingual and non-English corpora is particularly important, as language bias may affect both screening accuracy and equity in global evidence synthesis [32].
Reproducibility represents a critical and underexplored gap. Future studies should explicitly test the stability of LLM screening performance across repeated runs, model versions, access modalities, and time, as well as assess sensitivity to prompt variation. Such work is essential given the limited transparency of many LLMs and the potential for model-specific biases, including overly strict or literal interpretations of eligibility criteria that may lead to systematic misclassification. Addressing these issues will be central to building trust in LLM-assisted screening and to distinguishing genuine performance differences from artefacts of model drift or prompt instability.
Methodological standardisation remains a key gap. Future studies would benefit from clearer reporting standards specifying model version, access modality, prompt strategy, and evaluation metrics, thereby enabling meaningful comparison across studies and over time. This need aligns with the 2025 joint position statement on artificial intelligence in evidence synthesis issued by Cochrane, the Campbell Collaboration, JBI, and the Collaboration for Environmental Evidence, which points out the importance of transparency, reproducibility, and human oversight as preconditions for responsible AI use [64]. Such guidance is particularly salient given the black-box nature of many LLMs and the rapid pace of model development, which complicates replication and trust.
Future work should also extend validation beyond title-abstract screening to full-text screening, where inclusion decisions are more complex and misclassification risks may differ. In addition, systematic exploration of prompt variation, prompt formats, and domain adaptation remains limited but necessary, as these design choices can substantially influence screening performance. Hybrid human-LLM workflows warrant further investigation as a pragmatic means of optimising both sensitivity and efficiency while safeguarding review quality.
Finally, benchmarking LLMs using shared datasets and reproducible pipelines, with explicit attention to consistency across model versions, is essential [23]. Given the rapid evolution of LLM capabilities, repeated validation over time will be necessary to ensure that conclusions remain robust and applicable, and to support the development of trustworthy standards for integrating LLMs into systematic review protocols and evidence synthesis workflows.

5. Conclusions

This study provides a rigorous, multi-metric validation of contemporary LLMs for title and abstract screening in systematic reviews, using a gold-standard human reference dataset and a conservative, zero-shot prompting framework. Across models, we find that LLMs can achieve high sensitivity and substantial agreement with human reviewers, but that performance varies markedly depending on the evaluation metrics used. In particular, the reliance on single indicators such as overall accuracy or Cohen’s κ can yield misleading conclusions in low-prevalence screening contexts, whereas prevalence-robust agreement measures and classification metrics better reflect screening priorities.
Our findings demonstrate that models optimised for recall are better aligned with systematic review objectives, even when this comes at the cost of increased false positives and downstream screening effort. Conversely, models prioritising specificity may appear statistically reliable yet pose a greater risk of bias through missed relevant studies. These trade-offs underscore the importance of aligning model selection and evaluation with the substantive goals of evidence synthesis rather than abstract notions of agreement alone.
More broadly, this study shows that LLM-assisted screening is not a plug-and-play solution. Performance is contingent on prompt design, metric choice, and continued human oversight. When deployed conservatively and transparently, LLMs can meaningfully reduce screening workload while preserving review validity. However, their integration into systematic review workflows must be accompanied by clear reporting standards, reproducibility testing, and evolving ethical guidance.
Our results support the use of LLMs as decision-support tools rather than autonomous reviewers and highlight the need for multi-metric validation frameworks that reflect the methodological realities of evidence synthesis. As LLM capabilities continue to evolve, repeated and standardised validation will be essential to ensure that efficiency gains do not come at the expense of rigour, trust, or transparency in systematic reviews.

Author Contributions

M.N.: Conceptualisation, Methodology, Investigation, Data Curation, Formal Analysis, Writing—Original Draft, Writing—Review and Editing; A.M.: Conceptualisation, Methodology, Investigation, Data Curation, Formal Analysis, Writing—Review and Editing; J.K.: Conceptualisation, Methodology, Investigation, Data Curation, Writing—Review and Editing; S.A.W.: Writing—Review and Editing; M.R.: Writing—Review and Editing; I.S.-D.: Conceptualisation, Funding acquisition, Writing—Review and Editing. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partly funded by the Research Council of Norway (contract number 342628/L10).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LLMLarge Language Model

Appendix A

Appendix A.1. Search Strategy for Nawrath et al. (2021) [25]

  • Research questions
While previous studies from high-income countries (HICs) have shown that greenspaces can generally contribute to mental health, this study provides novel insights through examining the extent, quality and geographic characteristics of the scientific evidence regarding the mental health benefits of urban greenspaces in LMICs (based on the [65] list) by systematically reviewing the literature.
  • The research questions were:
(1)
Do greenspaces promote good mental health of urban residents in LMICs?
(2)
What are the geographic characteristics of the evidence from LMICs?
(3)
Which contextual factors mediate and moderate how greenspaces and mental health are associated in LMICs?
(4)
How were greenspaces assessed and which mental health outcomes were studied in LMICs?
  • Definitions
We defined urban greenspaces as all forms of ‘living nature’ of flora and fauna in cities, together with still and running water [66], including maintained and unmaintained environmental areas such as nature reserves, wilderness environments, urban parks [67] as well as urban wildlife.
Low-income countries are those with a Gross National Income (GNI) per capita of US $1045 or less in 2013, lower-middle-income countries are those with GNI per capita between US $1046–US $4125, upper-middle-income countries are those with GNI per capita US $4126–US $12,745, and those with GNI per capita higher than US $12,745 are high-income countries [65].
We used the [68] definition of health as ‘a state of well-being in which every individual realises their own potential, can cope with the normal stresses of life, can work productively and fruitfully, and is able to make a contribution to their community’. Mental health encompasses the presence of mental well-being and the absence of mental illness. Mental well-being is ‘the psychological, cognitive and emotional quality of a person’s life. This includes the thoughts and feelings that individuals have about the state of their life, and a person’s experience of happiness’ [69]. Mental well-being ‘comprises happiness and life satisfaction (hedonic well-being), and fulfilment, functioning and purpose in life (eudaimonic well-being), and therefore is a multi-dimensional measure of positive mental health’ [70]. Mental illness comprises the occurrence of disorders of cognition, affect and behaviour, defined through “The Diagnostic and Statistical Manual of Mental Disorders” [71]. These include conditions such as depression, anxiety, substance use disorders, as well as illnesses such as schizophrenia and autism.
Many cities in LMICs are characterised by informal settlements and slums. The former are defined as ‘residential areas where (1) inhabitants have no security of tenure, with modalities ranging from squatting to informal rental housing, (2) the neighbourhoods usually lack, or are cut off from, basic services and city infrastructure and (3) the housing may not comply with current planning and building regulations, and is often situated in geographically and environmentally hazardous areas. Slums are the most deprived and excluded form of informal settlements [72]. While slums are characterised by poverty and substandard living conditions, informal settlements may have very good living conditions.
  • Search strategy
Electronic databases. Comprehensive literature searches of electronic databases were conducted in Web of Science Core Collection, Medline, Embase and CAB Abstracts. The search strategies for each database were peer-reviewed and approved by the information specialist Natalie King (School of Medicine, University of Leeds). Searches used queries that target studies on (1) greenspaces and (2) mental health in (3) urban areas in LMICs. The selection of search terms relating to (1) greenspaces followed the methodology used in a previous review on the mental health benefits of exposure to greenspaces [73] (Table A1). The LMICs (LMIC ODA DAC 2003–2020 [65]) and (2) mental health search terms were adopted from peer-reviewed search filters developed by the University of Leeds Institute of Health Sciences, which were optimised for the Medline database [74]. The Medline search filters were then translated for the use in Embase, CAB Abstracts and Web of Science by one of the authors (MN). Studies were included if they were conducted in a country which was listed as (3) LMIC on the DAC list at the time of publication of the study in question but graduated from the recent list.
Table A1. List of search terms, which were translated into search strategies for the electronic databases Web of Science, Medline, Embase and CAB Abstracts. This table shows an abbreviated presentation of all variations of search terms. In the search, truncations, synonyms and different spelling and word variants of the search terms were included as well.
Table A1. List of search terms, which were translated into search strategies for the electronic databases Web of Science, Medline, Embase and CAB Abstracts. This table shows an abbreviated presentation of all variations of search terms. In the search, truncations, synonyms and different spelling and word variants of the search terms were included as well.
GreenspacesMental Health & Well-BeingStudy Location
Greenspace, blue space, open space, urban park, urban forest, urban tree, urban ecosystem, urban green, urban blue, urban agriculture, natural environment, biodiversity, species richness, nature reserve, wilderness environment, spontaneous vegetationMental health, mental well-being, well-being, mental, psychiatric, psychologic, depression, MDD, anxiety, phobia, agoraphobia, dysthymia, ADNOS, schizophrenia, hebephrenia, oligophrenia, akathisia, neuroleptic-induced deficit syndrome, tardive dyskinesia, movement disorders, somatoform, somatisation, hysteria, briquet, multisomatic, MUPs, medically unexplained, dissociative disorders, dissociative reactions, dissociation, affective disorders, PTSD, psychological trauma, combat disorders, stress disorders, cognitive disorders, personality disorders, impulse control disorders, mood disorders, paranoid disorders, psychotic disorders, neurological disorders, nervous disorders, nervous system disorders, eating disorders, bipolar disorders, behavioural disorders, obsessive disorders, compulsive disorders, panic disorders, mood disorders, delusional disorders, trichotillomania, OCD, GAD, stress reaction, acute stress, neurosis, stress syndrome, pain disorder, dementia, Alzheimer, epilepsy, substance abuse disorders, personality disorders, sleep disordersLMICs:
Using the Development Assistance Committee country classification list [65]

Urban:
Urban, city, town
Inclusion and exclusion criteria. To be included in the review, studies needed to be published in a peer-reviewed journal and be written in English. Study types to be included were randomised controlled trial studies, cohort studies, case–control studies, cross-sectional studies, before and after studies, time series, longitudinal studies and qualitative studies. Studies needed to involve aspects of urban greenspaces and mental health and consider one or more LMICs. Case reports, reviews, opinion pieces, editorials, comments, news, letters and grey literature were excluded from the review.
Population. The general urban population of upper/lower-middle-income and low-income countries, as defined by OECD’s Development Assistance Committee (DAC) was considered [65].
  • Data screening
Studies were extracted following a two-stage screening process performed independently by two researchers (MN and SG). In the first stage, title and abstract of all database search results were screened in order to select studies for inclusion which matched the stated eligibility criteria. Doubts regarding the inclusion or exclusion of studies were resolved by discussion between the two researchers.
  • Data extraction
Full-text screening of selected studies was conducted as second screening stage. Using the stated eligibility criteria, a data extraction form was designed to include information on the following:
  • Authors, title, year of publication;
  • Objectives;
  • Study population;
  • Methods, study design;
  • Health outcome and measures;
  • Measure of greenspace;
  • General results.

Appendix A.2. Prompts

I am screening papers for a systematic literature review.
The topic of the systematic review is assessing links between urban greenspaces and mental health in low- and middle-income countries. The general urban population of upper/lower-middle-income and low-income countries, as defined by OECD’s Development Assistance Committee (DAC) is included. Studies from high-income countries are excluded.
The study should focus exclusively on this topic.
Decide if the following article should be included or excluded from the systematic review. I give the title and abstract of the article as input.
Please respond with a plain JSON, without any formatting or backticks, that adheres to the following format:
  • {
  •  “verdict”: “<your verdict here, either ‘include’ or ‘exclude’>“,
  •  “explanation”: “<detailed explanation to justify your verdict here>“,
  •  “confidence”: “<confidence level of your decision here>“
  • }
Be lenient. I prefer including papers by mistake rather than excluding them by mistake.

Appendix A.3. Formulas for Classification Performance Metrics

Total Accuracy (TA):
T A = T P + T N T P + T N + F P + F N
Sensitivity (S):
S = T P T P + F N
Specificity (E):
E   =   T N T N   +   F P
Positive Predictive Value (PPV):
P P V   =   T P T P   +   F P
Negative Predictive Value (NPV):
N P V   =   T N T N   +   F N
Positive Likelihood Ratio (LR+)
L R + = S e n s i t i v i t y 1 S p e c i f i c i t y = T P / T P + F N F P / F P + T N
Negative Likelihood Ratio (LR)
L R = 1 S e n s i t i v i t y S p e c i f i c i t y = F N / T P + F N T N / F P + T N

Appendix A.4. Regression Analysis

GPT-4.1 and Claude 3.5 Sonnet showed identical regression results, with slope estimates of 3.90 (SE = 1.08) in both cases, yielding odds ratios (ORs) of 49.46 (95% CI: 5.98–408.93) and 49.46 (95% CI: 5.98–408.93), respectively (Table A2). These models showed strong predictive associations and performed identically in terms of model fit (AIC = 378.82 for both). The results suggest that GPT-4.1 and Claude 3.5 Sonnet were about 49 times more likely to include a study when it had also been included by human reviewers.
Gemini 2.0 Flash showed a strong association under Firth penalised logistic regression (used because the standard GLM exhibited separation), with a slope coefficient of 4.62 and an OR of 101.88 (95% CI: 12.47–13,227.13). This implies that Gemini was about 102 times more likely to include a study when it was included by human reviewers. AIC is not directly comparable here, because it is not produced on the same basis under the penalised model.
Mistral Large likewise required Firth penalised logistic regression. It showed a slope coefficient of 3.67 and an OR of 39.06 (95% CI: 4.82–5062.47), indicating a strong positive association between human and model inclusion decisions. As with Gemini, AIC is not directly comparable under the penalised model.
DeepSeek V3 also demonstrated a strongly positive association, with a slope coefficient of 3.36 (SE = 0.80) and an OR of 28.92 (95% CI: 6.06–137.95). It achieved the lowest AIC among the models estimated via standard GLM (AIC = 112.30), indicating the best overall fit in that subset. This suggests that although DeepSeek V3 included relatively few studies, its inclusion decisions were highly aligned with human inclusion decisions when judged via this regression.
Table A2. Logistic regression summary comparing Large Language Model (LLM) predictions to human screening decisions (title-abstract stage, N = 500). For GPT-4.1, Claude 3.5 Sonnet, and DeepSeek V3, we used a standard logistic regression (Generalised Linear Model (GLM), Maximum Likelihood Estimation (MLE)). For Gemini and Mistral, we used Firth penalised logistic regression due to (quasi-) separation in the MLE fits (inflated coefficients/SEs). Coefficients represent the log-odds of the model predicting inclusion given the human label. Odds ratios are reported for β1. No fit criterion is reported for Firth-penalised models in this table. For Firth-penalised models, p-values and 95% confidence intervals are based on the profile penalised log-likelihood; for maximum-likelihood models, the test statistic is the Wald z. Significance: *** p < 0.001. Statistically significant coefficients are highlighted in bold.
Table A2. Logistic regression summary comparing Large Language Model (LLM) predictions to human screening decisions (title-abstract stage, N = 500). For GPT-4.1, Claude 3.5 Sonnet, and DeepSeek V3, we used a standard logistic regression (Generalised Linear Model (GLM), Maximum Likelihood Estimation (MLE)). For Gemini and Mistral, we used Firth penalised logistic regression due to (quasi-) separation in the MLE fits (inflated coefficients/SEs). Coefficients represent the log-odds of the model predicting inclusion given the human label. Odds ratios are reported for β1. No fit criterion is reported for Firth-penalised models in this table. For Firth-penalised models, p-values and 95% confidence intervals are based on the profile penalised log-likelihood; for maximum-likelihood models, the test statistic is the Wald z. Significance: *** p < 0.001. Statistically significant coefficients are highlighted in bold.
ModelIntercept β0 (SE)Slope β1 (SE)Test StatisticOdds Ratio for β1 (95% CI)Fit Criterion
GPT-4.1−1.9552 (0.1368) ***3.9011 (1.0778) ***z = 3.6249.46 (5.98–408.95)AIC = 378.82
Claude 3.5 Sonnet−1.9552 (0.1368) ***3.9011 (1.0778) ***z = 3.6249.46 (5.98–408.95)AIC = 378.82
Gemini 2.0 Flash (Firth)−1.7906 (0.1287) ***4.6238 (1.4609) ***χ2 = 28.53101.88 (12.47–13227.13)
Mistral Large (Firth)−0.8319 (0.0980) ***3.6651 (1.4585) ***χ2 = 16.4739.06 (4.82–5062.47)
DeepSeek V3−3.8754 (0.3195) ***3.3645 (0.7971) ***z = 4.2228.92 (6.06–137.94)AIC = 112.30

References

  1. Page, M.J.; McKenzie, J.E.; Bossuyt, P.M.; Boutron, I.; Hoffmann, T.C.; Mulrow, C.D.; Shamseer, L.; Tetzlaff, J.M.; Akl, E.A.; Brennan, S.E.; et al. The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews. BMJ 2021, 372, n71. [Google Scholar] [CrossRef]
  2. Bornmann, L.; Mutz, R. Growth Rates of Modern Science: A Bibliometric Analysis Based on the Number of Publications and Cited References. J. Assoc. Inf. Sci. Technol. 2015, 66, 2215–2222. [Google Scholar] [CrossRef]
  3. Borah, R.; Brown, A.W.; Capers, P.L.; Kaiser, K.A. Analysis of the Time and Workers Needed to Conduct Systematic Reviews of Medical Interventions Using Data from the PROSPERO Registry. Open 2017, 7, 12545. [Google Scholar] [CrossRef]
  4. Shojania, K.; Sampson, M.; Ansari, M.; Ji, J.; Doucette, S.; Moher, D. How Quickly Do Systematic Reviews Go out of Date? A Survival Analysis. J. Emerg. Med. 2007, 147, 224–233. [Google Scholar] [CrossRef] [PubMed]
  5. Haddaway, N.R.; Pullin, A.S. The Policy Role of Systematic Reviews: Past, Present and Future. Springer Sci. Rev. 2014, 2, 179–183. [Google Scholar] [CrossRef]
  6. Westgate, M.J.; Haddaway, N.R.; Cheng, S.H.; McIntosh, E.J.; Marshall, C.; Lindenmayer, D.B. Software Support for Environmental Evidence Synthesis. Nat. Ecol. Evol. 2018, 2, 588–590. [Google Scholar] [CrossRef]
  7. Luo, X.; Chen, F.; Zhu, D.; Wang, L.; Wang, Z.; Liu, H.; Lyu, M.; Wang, Y.; Wang, Q.; Chen, Y. Potential Roles of Large Language Models in the Production of Systematic Reviews and Meta-Analyses. J. Med. Internet Res. 2024, 26, e56780. [Google Scholar] [CrossRef]
  8. Naveed, H.; Khan, A.U.; Qiu, S.; Saqib, M.; Anwar, S.; Usman, M.; Akhtar, N.; Barnes, N.; Mian, A. A Comprehensive Overview of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 16, 1–72. [Google Scholar] [CrossRef]
  9. Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large Language Models: A Survey. arXiv 2025, arXiv:2402.06196. [Google Scholar]
  10. Vaswani, A.; Brain, G.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  11. Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A Survey on Evaluation of Large Language Models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
  12. Salah, M.; Abdelfattah, F.; Alhalbusi, H. AI vs. Humans: The Future of Academic Review in Public Administration. Res. Sq. 2023; preprint. [CrossRef]
  13. Fabiano, N.; Gupta, A.; Bhambra, N.; Luu, B.; Wong, S.; Maaz, M.; Fiedorowicz, J.G.; Smith, A.L.; Solmi, M. How to Optimize the Systematic Review Process Using AI Tools. JCPP Adv. 2024, 4, e12234. [Google Scholar] [CrossRef]
  14. López-Pineda, A.; Nouni-García, R.; Carbonell-Soliva, Á.; Gil-Guillén, V.F.; Carratalá-Munuera, C.; Borrás, F. Validation of Large Language Models (Llama 3 and ChatGPT-4o Mini) for Title and Abstract Screening in Biomedical Systematic Reviews. Res. Synth. Methods 2025, 16, 620–630. [Google Scholar] [CrossRef]
  15. Marshall, I.J.; Wallace, B.C. Toward Systematic Review Automation: A Practical Guide to Using Machine Learning Tools in Research Synthesis. Syst. Rev. 2019, 8, 163. [Google Scholar] [CrossRef] [PubMed]
  16. Pedreschi, D.; Giannotti, F.; Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F. Meaningful Explanations of Black Box AI Decision Systems. Proc. AAAI Conf. Artif. Intell. 2019, 33, 9780–9784. [Google Scholar] [CrossRef]
  17. Staudinger, M.; Kusa, W.; Piroi, F.; Lipani, A.; Hanbury, A. A Reproducibility and Generalizability Study of Large Language Models for Query Generation. In Proceedings of the SIGIR-AP 2024—Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region; Association for Computing Machinery, Inc.: New York, NY, USA, 2024; pp. 186–196. [Google Scholar]
  18. O’Mara-Eves, A.; Thomas, J.; McNaught, J.; Miwa, M.; Ananiadou, S. Using Text Mining for Study Identification in Systematic Reviews: A Systematic Review of Current Approaches. Syst. Rev. 2015, 4, 5. [Google Scholar] [CrossRef] [PubMed]
  19. Adel, A.; Alani, N. Can Generative AI Reliably Synthesise Literature? Exploring Hallucination Issues in ChatGPT. AI Soc. 2025, 40, 6799–6812. [Google Scholar] [CrossRef]
  20. Sciurti, A.; Migliara, G.; Siena, L.M.; Isonne, C.; De Blasiis, M.R.; Sinopoli, A.; Iera, J.; Marzuillo, C.; De Vito, C.; Villari, P.; et al. Compact Large Language Models for Title and Abstract Screening in Systematic Reviews: An Assessment of Feasibility, Accuracy, and Workload Reduction. Res. Synth. Methods 2026, 17, 332–347. [Google Scholar] [CrossRef]
  21. Nykvist, B.; Macura, B.; Xylia, M.; Olsson, E. Testing the Utility of GPT for Title and Abstract Screening in Environmental Systematic Evidence Synthesis. Environ. Evid. 2025, 14, 7. [Google Scholar] [CrossRef]
  22. Trad, F.; Yammine, R.; Charafeddine, J.; Chakhtoura, M.; Rahme, M.; El-Hajj Fuleihan, G.; Chehab, A. Streamlining Systematic Reviews with Large Language Models Using Prompt Engineering and Retrieval Augmented Generation. BMC Med. Res. Methodol. 2025, 25, 130. [Google Scholar] [CrossRef]
  23. Galli, C.; Gavrilova, A.V.; Calciolari, E. Large Language Models in Systematic Review Screening: Opportunities, Challenges, and Methodological Considerations. Information 2025, 16, 378. [Google Scholar] [CrossRef]
  24. Van Dijk, S.H.B.; Brusse-Keizer, M.G.J.; Bucsán, C.C.; Van Der Palen, J.; Doggen, C.J.M.; Lenferink, A. Artificial Intelligence in Systematic Reviews: Promising When Appropriately Used. BMJ Open 2023, 13, e072254. [Google Scholar] [CrossRef]
  25. Nawrath, M.; Guenat, S.; Elsey, H.; Dallimer, M. Exploring Uncharted Territory: Do Urban Greenspaces Support Mental Health in Low- and Middle-Income Countries? Environ. Res. 2021, 194, 110625. [Google Scholar] [CrossRef] [PubMed]
  26. Arksey, H.; O’Malley, L. Scoping Studies: Towards a Methodological Framework. Int. J. Soc. Res. Methodol. 2005, 8, 19–32. [Google Scholar] [CrossRef]
  27. Tricco, A.C.; Lillie, E.; Zarin, W.; O’Brien, K.K.; Colquhoun, H.; Levac, D.; Moher, D.; Peters, M.D.J.J.; Horsley, T.; Weeks, L.; et al. PRISMA Extension for Scoping Reviews (PRISMA-ScR): Checklist and Explanation. Ann. Intern. Med. 2018, 169, 467. [Google Scholar] [CrossRef] [PubMed]
  28. NIVA LLM Benchmarks for Abstract Screening in Social and Environmental Scientific Publications. 2025. Available online: https://github.com/NIVANorge/ai-literature-review-public (accessed on 1 May 2026).
  29. Syriani, E.; David, I.; Kumar, G. Screening Articles for Systematic Reviews with ChatGPT. J. Comput. Lang. 2024, 80, 101287. [Google Scholar] [CrossRef]
  30. Dwork, C.; Feldman, V.; Hardt, M.; Pitassi, T.; Reingold, O.; Roth, A. Generalization in Adaptive Data Analysis and Holdout Reuse. In Proceedings of the NIPS’15: Proceedings of the 29th International Conference on Neural Information Processing Systems—Volume 2; MIT Press: Cambridge, MA, USA, 2015. [Google Scholar]
  31. Li, Y.; Datta, S.; Rastegar-Mojarad, M.; Lee, K.; Paek, H.; Glasgow, J.; Liston, C.; He, L.; Wang, X.; Xu, Y. Enhancing Systematic Literature Reviews with Generative Artificial Intelligence: Development, Applications, and Performance Evaluation. J. Am. Med. Inform. Assoc. 2025, 32, 616–625. [Google Scholar] [CrossRef]
  32. Malik, F.S.; Terzidis, O. A Hybrid Framework for Creating Artificial Intelligence-Augmented Systematic Literature Reviews. Manag. Rev. Q. 2025, 1–27. [Google Scholar] [CrossRef]
  33. Taylor, K.S.; Mahtani, K.R.; Aronson, J.K. Summarising Good Practice Guidelines for Data Extraction for Systematic Reviews and Meta-Analysis. BMJ Evid. Based. Med. 2021, 26, 88–90. [Google Scholar] [CrossRef]
  34. Schmidt, L.; Olorisade, B.K.; McGuinness, L.A.; Thomas, J.; Higgins, J.P.T. Data Extraction Methods for Systematic Review (Semi)Automation: A Living Systematic Review. F1000Research 2021, 10, 401. [Google Scholar] [CrossRef]
  35. Atil, B.; Aykent, S.; Chittams, A.; Fu, L.; Passonneau, R.J.; Radcliffe, E.; Rajagopal, G.R.; Sloan, A.; Tudrej, T.; Ture, F.; et al. Non-Determinism of “Deterministic” LLM System Settings in Hosted Environments. In Proceedings of the 5th Workshop on Evaluation and Comparison of NLP Systems; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 135–148. [Google Scholar]
  36. Heinze, G.; Schemper, M. A Solution to the Problem of Separation in Logistic Regression. Stat. Med. 2002, 21, 2409–2419. [Google Scholar] [CrossRef]
  37. Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]
  38. Chicco, D.; Jurman, G. The Advantages of the Matthews Correlation Coefficient (MCC) over F1 Score and Accuracy in Binary Classification Evaluation. BMC Genom. 2020, 21, 6. [Google Scholar] [CrossRef]
  39. Chen, G.; Faris, P.; Hemmelgarn, B.; Walker, R.L.; Quan, H. Measuring Agreement of Administrative Data with Chart Data Using Prevalence Unadjusted and Adjusted Kappa. BMC Med. Res. Methodol. 2009, 9, 5. [Google Scholar] [CrossRef]
  40. Zec, S.; Soriani, N.; Comoretto, R.; Baldi, I. High Agreement and High Prevalence: The Paradox of Cohen’s Kappa. Open Nurs. J. 2017, 11, 211–218. [Google Scholar] [CrossRef] [PubMed]
  41. Delgado, R.; Tibau, X.A. Why Cohen’s Kappa Should Be Avoided as Performance Measure in Classification. PLoS ONE 2019, 14, e0222916. [Google Scholar] [CrossRef]
  42. de la Cruz Huayanay, A.; Bazán, J.L.; Russo, C.M. Performance of Evaluation Metrics for Classification in Imbalanced Data. Comput. Stat. 2025, 40, 1447–1473. [Google Scholar] [CrossRef]
  43. Page, M.J.; Higgins, J.P.; Sterne, J.A. Assessing Risk of Bias Due to Missing Results in a Synthesis. In Cochrane Handbook for Systematic Reviews of Interventions, 2nd ed.; Cochrance: London, UK, 2019. [Google Scholar]
  44. Feinstein, A.R.; Cicchetti, D. V High Agreement but Low Kappa: I. the Problems of Two Paradoxes. J. Clin. Epidemiol. 1990, 43, 543–549. [Google Scholar] [CrossRef] [PubMed]
  45. Jothi Prakash, B.; Barath Kannan, D.; Pankaj Seervi, A.; Meivezhi, G. Prompt Engineering for Large Language Models: A Systematic Review and Future Directions. Res. Sq. 2025; Preprint. [CrossRef]
  46. Sahoo, P.; Singh, A.K.; Saha, S.; Jain, V.; Mondal, S.; Chadha, A. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications. arXiv 2025, arXiv:2402.07927. [Google Scholar] [CrossRef]
  47. Adam, T.J.; Abosabie, S.A.S.; Dittmer, M.; Wolf, E.; Abosabie, S.A.; Behnke, C.; Baier, F.; Weickmann, A.; Köser, L.; Correll, C.U.; et al. Prompt Engineering of Large Language Models for Paper Screening in Medical Meta-Analyses and Systematic Reviews: A Prospective Comparative Study. Res. Synth. Methods 2026, 17, 1–18. [Google Scholar] [CrossRef] [PubMed]
  48. Ye, A.; Maiti, A.; Schmidt, M.; Pedersen, S.J. A Hybrid Semi-Automated Workflow for Systematic and Literature Review Processes with Large Language Model Analysis. Future Internet 2024, 16, 167. [Google Scholar] [CrossRef]
  49. Homiar, A.; Thomas, J.; Ostinelli, E.G.; Kennett, J.; Friedrich, C.; Cuijpers, P.; Harrer, M.; Leucht, S.; Miguel, C.; Rodolico, A.; et al. Development and Evaluation of Prompts for a Large Language Model to Screen Titles and Abstracts in a Living Systematic Review. BMJ Ment. Health 2025, 28, e301762. [Google Scholar] [CrossRef] [PubMed]
  50. Huotala, A.; Kuutila, M.; Ralph, P.; Mäntylä, M. The Promise and Challenges of Using LLMs to Accelerate the Screening Process of Systematic Reviews. In Proceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering (EASE 2024); CEUR-WS; Association for Computing Machinery: New York, NY, USA, 2024; Volume 2657, pp. 1–9. [Google Scholar]
  51. Sollini, M.; Pini, C.; Lazar, A.; Gelardi, F.; Ninatti, G.; Bauckneht, M.; Chiti, A.; Kirienko, M. Human Researchers Are Superior to Large Language Models in Writing a Medical Systematic Review in a Comparative Multitask Assessment. Sci. Rep. 2025, 16, 173. [Google Scholar] [CrossRef] [PubMed]
  52. Haltaufderheide, J.; Ranisch, R. The Ethics of ChatGPT in Medicine and Healthcare: A Systematic Review on Large Language Models (LLMs). npj Digit. Med. 2024, 7, 183. [Google Scholar] [CrossRef]
  53. Peters, U.; Chin-Yee, B. Generalization Bias in Large Language Model Summarization of Scientific Research. R. Soc. Open Sci. 2025, 12, 241776. [Google Scholar] [CrossRef]
  54. Deng, C.; Duan, Y.; Jin, X.; Chang, H.; Tian, Y.; Liu, H.; Wang, Y.; Gao, K.; Zou, H.P.; Jin, Y.; et al. Deconstructing the Ethics of Large Language Models from Long-Standing Issues to New-Emerging Dilemmas: A Survey. AI Ethics 2025, 5, 4745–4771. [Google Scholar] [CrossRef]
  55. Fareed, M.; Fatima, M.; Uddin, J.; Ahmed, A.; Sattar, M.A. A Systematic Review of Ethical Considerations of Large Language Models in Healthcare and Medicine. Front. Digit. Health 2025, 7, 1653631. [Google Scholar] [CrossRef]
  56. Aljohani, M.; Hou, J.; Kommu, S.; Wang, X. A Comprehensive Survey on the Trustworthiness of Large Language Models in Healthcare. arXiv 2025, arXiv:2504.00025. [Google Scholar] [CrossRef]
  57. Gartlehner, G.; Kahwati, L.; Hilscher, R.; Thomas, I.; Kugley, S.; Crotty, K.; Viswanathan, M.; Nussbaumer-Streit, B.; Booth, G.; Erskine, N.; et al. Data Extraction for Evidence Synthesis Using a Large Language Model: A Proof-of-Concept Study. Res. Synth. Methods 2024, 15, 576–589. [Google Scholar] [CrossRef]
  58. O’Connor, A.M.; Clark, J.; Thomas, J.; Spijker, R.; Kusa, W.; Walker, V.R.; Bond, M. Large Language Models, Updates, and Evaluation of Automation Tools for Systematic Reviews: A Summary of Significant Discussions at the Eighth Meeting of the International Collaboration for the Automation of Systematic Reviews (ICASR). Syst. Rev. 2024, 13, 290. [Google Scholar] [CrossRef] [PubMed]
  59. Cacciamani, G.E.; Chu, T.N.; Sanford, D.I.; Abreu, A.; Duddalwar, V.; Oberai, A.; Kuo Jay, C.C.; Liu, X.; Denniston, A.K.; Vasey, B.; et al. PRISMA AI-Reporting Guidelines for Systematic Reviews and Meta-Analyses on AI in Healthcare. Nat. Med. 2023, 29, 14–15. [Google Scholar] [CrossRef]
  60. Carlini, N.; Tramèr, F.; Lee, K.; Roberts, A.; Wallace, E.; Jagielski, M.; Herbert-Voss, A.; Brown, T.; Song, D.; Erlingsson, Ú.; et al. Extracting Training Data from Large Language Models. In Proceedings of the 30th USENIX Security Symposium, Online, 11–13 August 2021. [Google Scholar]
  61. Thode, L.; Iftikhar, U.; Mendez, D. Exploring the Use of LLMs for the Selection Phase in Systematic Literature Studies. Inf. Softw. Technol. 2025, 184, 107757. [Google Scholar] [CrossRef]
  62. Golchin, S.; Surdeanu, M. Time Travel in LLMs: Tracing Data Contamination in Large Language Models. In Proceedings of the ICLR 2024, Vienna, Austria, 7–11 May 2024. [Google Scholar]
  63. Dietrich, J.; Hollstein, A. Performance and Reproducibility of Large Language Models in Named Entity Recognition: Considerations for the Use in Controlled Environments. Drug Saf. 2025, 48, 287–303. [Google Scholar] [CrossRef] [PubMed]
  64. Flemyng, E.; Noel-Storr, A.; Macura, B.; Gartlehner, G.; Thomas, J.; Meerpohl, J.J.; Jordan, Z.; Minx, J.; Eisele-Metzger, A.; Hamel, C.; et al. Position Statement on Artificial Intelligence (AI) Use in Evidence Synthesis across Cochrane, the Campbell Collaboration, JBI and the Collaboration for Environmental Evidence 2025. Environ. Evid. 2025, 14, 20. [Google Scholar] [CrossRef]
  65. Development Assistance Committee. DAC List of ODA Recipients; Development Assistance Committee: Paris, France, 2025. [Google Scholar]
  66. Hartig, T.; Mitchell, R.; de Vries, S.; Frumkin, H. Nature and Health. Annu. Rev. Public Health 2014, 35, 207–228. [Google Scholar] [CrossRef] [PubMed]
  67. Barton, J.; Rogerson, M. The Importance of Greenspace for Mental Health. BJPsych Int. 2017, 14, 79–81. [Google Scholar] [CrossRef]
  68. World Health Organisation. Mental Health: A State of Wellbeing. Available online: https://www.who.int/features/factfiles/mental_health/en/ (accessed on 1 May 2025).
  69. Linton, M.J.; Dieppe, P.; Medina-Lara, A. Review of 99 Self-Report Measures for Assessing Well-Being in Adults: Exploring Dimensions of Well-Being and Developments over Time. BMJ Open 2016, 6, e010641. [Google Scholar] [CrossRef] [PubMed]
  70. Houlden, V.; Weich, S.; de Albuquerque, J.P.; Jarvis, S.; Rees, K. The Relationship between Greenspace and the Mental Wellbeing of Adults: A Systematic Review. PLoS ONE 2018, 13, e0203000. [Google Scholar] [CrossRef]
  71. American Psychiatric Association. Diagnostic and Statistical Manual of Mental Disorders, 5th ed.; American Psychiatric Association Publishing: Washington, DC, USA, 2013. [Google Scholar]
  72. United Nations. Habitat III Issue Papers—Informal Settlements; UN Habitat: New York, NY, USA, 2016.
  73. Gascon, M.; Mas, M.T.; Martínez, D.; Dadvand, P.; Forns, J.; Plasència, A.; Nieuwenhuijsen, M.J. Mental Health Benefits of Long-Term Exposure to Residential Green and Blue Spaces: A Systematic Review. Int. J. Environ. Res. Public Health 2015, 12, 4354–4379. [Google Scholar] [CrossRef]
  74. Academic Unit of Health Economics University of Leeds. AUHE Search Strategy: Low and Middle Income Countries Geographic Search; Academic Unit of Health Economics University of Leeds: Leeds, UK, 2018. [Google Scholar]
Figure 1. Classification performance metrics for five Large Language Models (LLMs) in predicting study inclusion at the title and abstract screening stage. Metrics include total accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. Each axis represents a normalised scale from 0 to 1. Higher values indicate better performance. We compared GPT-4.1, Claude 3.5 Sonnet, Gemini 2.0 flash, Mistral Large, and DeepSeek V3.
Figure 1. Classification performance metrics for five Large Language Models (LLMs) in predicting study inclusion at the title and abstract screening stage. Metrics include total accuracy, sensitivity, specificity, positive predictive value, and negative predictive value. Each axis represents a normalised scale from 0 to 1. Higher values indicate better performance. We compared GPT-4.1, Claude 3.5 Sonnet, Gemini 2.0 flash, Mistral Large, and DeepSeek V3.
Information 17 00501 g001
Table 1. Classification performance metrics for Large Language Models (LLMs), comparing human decisions at the title and abstract screening stage with LLMs. Total accuracy: the overall proportion of correct predictions (true positives and true negatives); Sensitivity (recall): the proportion of truly relevant studies (as determined by human reviewers) that the model correctly identified for inclusion; Specificity: the proportion of irrelevant studies that the model correctly excludes; Positive Predictive Value: proportion of studies that the model included which were actually relevant; Negative Predictive Value: proportion of studies that the model excluded which were truly irrelevant; Positive Likelihood Ratio: quantifies how much more likely a study is to be relevant if the model includes it; Negative Likelihood Ratio: indicates how much of the probability of relevance decreases if the model excludes a study.
Table 1. Classification performance metrics for Large Language Models (LLMs), comparing human decisions at the title and abstract screening stage with LLMs. Total accuracy: the overall proportion of correct predictions (true positives and true negatives); Sensitivity (recall): the proportion of truly relevant studies (as determined by human reviewers) that the model correctly identified for inclusion; Specificity: the proportion of irrelevant studies that the model correctly excludes; Positive Predictive Value: proportion of studies that the model included which were actually relevant; Negative Predictive Value: proportion of studies that the model excluded which were truly irrelevant; Positive Likelihood Ratio: quantifies how much more likely a study is to be relevant if the model includes it; Negative Likelihood Ratio: indicates how much of the probability of relevance decreases if the model excludes a study.
ModelTrue PositiveTrue NegativeFalse PositiveFalse NegativeTotal AccuracySensitivity95% CISpecificity95% CI
GPT-4.174316110.8760.875 0.529–0.9780.8760.844–0.902
Claude 3.5 Sonnet74316110.8760.8750.529–0.9780.8760.844–0.902
Gemini 2.0 flash 84227000.8601.000 0.676–1.0000.8580.824–0.886
Mistral Large834314900.7021.000 0.676–1.0000.697 0.655–0.736
DeepSeek V334821050.9700.3750.137–0.6940.980 0.963–0.989
ModelPositive Predictive Value95% CINegative Predictive ValuePositive Likelihood Ratio95% CINegative Likelihood Ratio95% CI
GPT-4.10.1030.051–0.1980.9987.0573.228–15.4290.143 0.020–1.015
Claude 3.5 Sonnet0.1030.051–0.1980.9987.0573.228–15.4290.1430.020–1.015
Gemini 2.0 flash 0.1030.053–0.19017.0293.382–14.6060 0–NaN
Mistral Large0.0510.026–0.09713.3021.621–6.7250 0–NaN
DeepSeek V30.2310.082–0.5030.99018.4505.078–67.0390.6380.264–1.540
Table 2. Inter-rater reliability between Large Language Models (LLMs) and human reviewers across multiple agreement metrics (N = 500 abstracts for all models). Cohen’s kappa (κ) and Gwet’s AC1 are interpreted using Landis and Koch criteria: <0.00 = poor, 0.00–0.20 = slight, 0.21–0.40 = fair, 0.41–0.60 = moderate, 0.61–0.80 = substantial, 0.81–1.00 = almost perfect. MCC ranges from −1 to +1, with values >0.3 often being interpreted as moderate positive correlation. PABAK ranges from −1 to +1, with the same interpretation as kappa. Significance levels: *** p < 0.001. Statistically significant coefficients are highlighted in bold.
Table 2. Inter-rater reliability between Large Language Models (LLMs) and human reviewers across multiple agreement metrics (N = 500 abstracts for all models). Cohen’s kappa (κ) and Gwet’s AC1 are interpreted using Landis and Koch criteria: <0.00 = poor, 0.00–0.20 = slight, 0.21–0.40 = fair, 0.41–0.60 = moderate, 0.61–0.80 = substantial, 0.81–1.00 = almost perfect. MCC ranges from −1 to +1, with values >0.3 often being interpreted as moderate positive correlation. PABAK ranges from −1 to +1, with the same interpretation as kappa. Significance levels: *** p < 0.001. Statistically significant coefficients are highlighted in bold.
ModelCohen’s κzMCCzPABAKzGwet’s AC1z
GPT-4.10.1601.600.275 ***6.390.752 ***25.510.856 ***49.90
Claude 3.5 Sonnet0.1601.600.275 ***6.390.752 ***25.510.856 ***49.90
Gemini 2.0 flash0.1621.740.297 ***6.940.720 ***23.200.834 ***45.29
Mistral Large0.0691.070.188 ***4.290.404 ***9.880.589 ***20.85
DeepSeek V30.2711.460.280 ***6.510.940 ***61.610.969 ***121.76
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Nawrath, M.; Merlina, A.; Knight, J.; Welch, S.A.; Rashidian, M.; Seifert-Dähnn, I. Validating Large Language Models for Title-Abstract Screening in Low-Prevalence Systematic Reviews: An Environmental Science Case Study. Information 2026, 17, 501. https://doi.org/10.3390/info17050501

AMA Style

Nawrath M, Merlina A, Knight J, Welch SA, Rashidian M, Seifert-Dähnn I. Validating Large Language Models for Title-Abstract Screening in Low-Prevalence Systematic Reviews: An Environmental Science Case Study. Information. 2026; 17(5):501. https://doi.org/10.3390/info17050501

Chicago/Turabian Style

Nawrath, Maximilian, Andrea Merlina, Jemmima Knight, Sam A. Welch, Mahla Rashidian, and Isabel Seifert-Dähnn. 2026. "Validating Large Language Models for Title-Abstract Screening in Low-Prevalence Systematic Reviews: An Environmental Science Case Study" Information 17, no. 5: 501. https://doi.org/10.3390/info17050501

APA Style

Nawrath, M., Merlina, A., Knight, J., Welch, S. A., Rashidian, M., & Seifert-Dähnn, I. (2026). Validating Large Language Models for Title-Abstract Screening in Low-Prevalence Systematic Reviews: An Environmental Science Case Study. Information, 17(5), 501. https://doi.org/10.3390/info17050501

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop