Abstract
Background: Artificial intelligence (AI) has been proposed as a transformative tool in suicide prevention, yet most evidence remains observational. To provide a rigorous benchmark, we systematically reviewed randomized controlled trials (RCTs) evaluating AI-based interventions targeting suicidal thoughts, behaviours, or help-seeking. Methods: Following PRISMA 2020 guidelines, MEDLINE, Web of Science, and Scopus were searched to 31 May 2025. Eligible studies were RCTs in humans that incorporated AI or machine learning for risk prediction, automated intervention, or treatment allocation. Methodological quality was assessed with the PEDro scale and certainty of evidence with GRADE. Results: From 1101 screened records, six RCTs (n = 793) met all criteria. Three studies tested machine learning risk prediction, two evaluated fully automated interventions (a transformer-based recommender and a digital nudge), and one examined AI-assisted treatment allocation. Risk-prediction models stratified short-term suicidal outcomes with accuracies of up to 0.67 and AUC values around 0.70. Digital interventions reduced counsellor response latency or increased crisis-service uptake by 23%. Algorithm-guided allocation reduced the occurrence of suicidal events when randomisation aligned with model recommendations. Methodological quality was moderate to high (median PEDro = 8/10), but GRADE certainty was low due to small samples and imprecision. Conclusions: AI can enhance discrete processes in suicide prevention, including risk stratification, help-seeking, and personalized treatment. However, the current evidence is limited, and larger multisite RCTs with longer follow-up, CONSORT-AI compliance, and equity-focused design are urgently required.
1. Introduction
Suicide remains a major public health emergency, claiming more than 700,000 lives each year and ranking among the leading causes of premature mortality worldwide []. After five decades of empirical work, our ability to anticipate who will act on suicidal thoughts—and when—remains limited []. Traditional actuarial or checklist-based risk tools, largely derived from static demographic variables, yield positive-predictive values only marginally better than chance []. At the same time, clinical services face profound capacity constraints: demand for evidence-based psychotherapies and crisis support routinely outstrips supply, leaving many high-risk individuals unassessed or untreated [].
In this context, artificial intelligence (AI) and machine learning (ML) have been heralded as pillars of “precision psychiatry”. Recent scoping and systematic reviews show that ML algorithms can ingest high-dimensional data—electronic health records (EHRs), social media language, wearable sensor streams—to uncover non-linear patterns that elude conventional statistics [,]. A deep-learning model trained on millions of Reddit posts, for example, detected rising suicidal ideation months before users self-reported a crisis [], and an electronic health record (HER)-based classifier identified postpartum depression trajectories with greater accuracy than clinician judgement []. In parallel, digital interventions powered by conversational agents or recommender systems have begun to deliver self-guided cognitive behavioural therapy, with measurable reductions in depression and anxiety symptoms [,].
However, empirical validation still lags behind these promises. Most AI studies in suicide prevention remain retrospective or cross-sectional [,], and their reporting quality is inconsistent, with key details on algorithm development, validation strategy, and error analysis frequently absent []. To address these gaps, the CONSORT-AI and SPIRIT-AI extensions now specify how randomized trials of AI interventions should be designed and reported []. Uptake, however, is still inconsistent.
A functional-domain framework helps organize this growing but heterogeneous field. Current applications cluster around three complementary roles: (i) risk-prediction models, (ii) fully automated digital interventions, and (iii) AI-assisted treatment allocation.
Risk-prediction models estimate individualized probabilities of future suicidal thoughts or behaviours, enabling tiered monitoring and proactive outreach []. The U.S. Veterans Health Administration’s REACH-VET program illustrates this translational success. Its ensemble model mines EHR data to flag the top 0.1% of veterans at elevated risk, routing them to rapid follow-up and yielding observational gains in outcomes []. Meta-analyses indicate that ML-based models generally outperform traditional logistic approaches, although concerns regarding external validity and ethical deployment remain [,].
Fully automated digital interventions act directly on users in real time. A transformer-based recommender embedded in the Dutch 113 helpline shortened counsellor response latency and was deemed helpful in more than 80% of interactions []. Other studies focused on primary prevention by combining psychoeducation, self-assessment, and anonymous navigation to local resources []. Parallel work on passive digital biomarkers—including sleep pattern deviations and acoustic markers of hopelessness—aims to trigger just-in-time outreach before crises escalate [].
AI-assisted treatment allocation employs algorithmic decision rules to match patients to the most effective therapy. Blending structured clinical interviews with ML classifiers have sharpened suicide-risk stratification in emergency departments and streamlined referral pathways []. A personalized advantage index derived from ensemble learning recently reduced suicidal events when randomisation aligned with the model’s treatment recommendation, underscoring the translational potential of adaptive allocation [].
These advances also surface unresolved ethical and governance challenges. Large-scale aggregation of sensitive mental health data raises questions about ownership, informed consent, and algorithmic bias [,]. To address these concerns transparently is critical to ensure that AI enhances, rather than erodes, trust and equity in suicide-prevention services.
Against the backdrop of rapid developments in clinical AI from 2023 to 2025, international policy frameworks increasingly emphasize safety-oriented, context-specific validation and governance. Recent recommendations—such as the OECD principles for trustworthy AI and contemporary commentaries on clinical AI regulation—call for transparent reporting, lifecycle risk management, and post-deployment monitoring in high-stakes health settings. In suicide prevention, where data are highly sensitive and populations are clinically vulnerable, these requirements translate into careful trial conduct, explicit bias auditing across demographic subgroups, and alignment between algorithmic function, intended use, and service pathways.
Consistent with this policy trajectory, emerging methodological proposals for evaluating large language models and related AI systems in healthcare [] stress dimensions that are directly relevant to mental health applications: conformance with international principles (e.g., OECD), temporal response consistency, performance under ambiguity, and adaptability to professional terminologies. Our focus on randomized evidence is therefore complementary to these governance priorities: by privileging experimental designs and domain-specific outcomes, the present review aims to benchmark effectiveness within a safety-first paradigm and to inform the design and reporting standards of future AI trials in suicide prevention.
In this context, and to benchmark AI tools against safety-aligned standards, we undertook a systematic review focused exclusively on randomized evidence. The present review had two aims: (i) to evaluate the methodological quality and the certainty of the evidence for suicide-related outcomes of randomized controlled trials deploying AI-based tools for suicide prevention, and (ii) to synthesize the effectiveness of these tools within each functional domain—risk prediction, fully automated digital interventions, and AI-assisted treatment allocation.
2. Materials and Methods
2.1. Design, Data Sources, and Search Strategy
We conducted a systematic review of randomized controlled trials (RCTs) that evaluated artificial intelligence–based tools for suicide-prevention in humans.
A comprehensive literature search was performed in MEDLINE (via PubMed), Web of Science Core Collection, and Scopus from database inception to 31 May 2025. The strategy combined controlled vocabulary and free-text terms in two conceptual blocks: (i) AI and ML, (ii) Suicide, suicidal ideation, self-harm, suicide attempt, and (iii) intervention.
An example of search syntax used in SCOPUS, was: (chatbot* OR “conversational ai” OR “conversational agent*” OR “dialog system*” OR “large language model*” OR LLM OR ChatGPT OR “generative ai” OR “artificial intelligence” OR “machine learning”) AND (suicid* OR “suicidal ideation” OR “suicide attempt*” OR “self-harm*” OR “self injur*”) AND (assist* OR support* OR counsel* OR therap* OR intervention*).
Syntax was translated for WoS and Medline. Reference lists of all included papers and recent AI-and-suicide reviews were hand-searched to capture missed studies. Grey literature, conference abstracts, preprints, and non-English records were excluded.
This review was conducted in accordance with the PRISMA 2020 statement (Table S1). The review protocol was not prospectively registered in PROSPERO, which we acknowledge as a methodological limitation. No AI tools were employed in the conduct of the search, data extraction, or drafting processes, in line with MDPI’s research integrity policies. Although the PEDro scale and GRADE framework were used to appraise methodological quality and certainty of evidence, we did not formally evaluate compliance with AI-specific reporting standards such as CONSORT-AI or SPIRIT-AI. These guidelines were not part of our eligibility criteria, but we recommend their systematic adoption in future randomized trials of AI-based suicide-prevention tools to enhance transparency, reproducibility, and clinical applicability.
2.2. Eligibility Criteria
We included parallel-group or cluster randomized controlled trials conducted in humans of any age or setting, published in English, that evaluated an intervention incorporating artificial intelligence or machine-learning methods, such as predictive algorithms, conversational agents, recommender systems, or adaptive treatment rules, designed to prevent suicidal thoughts, behaviours, or related outcomes. Eligible comparators encompassed usual care, sham or inactive controls, active digital controls, or alternative treatment modalities, provided randomization distinguished the AI component from the control condition. No temporal restriction was applied to publication date. We excluded (i) studies that were non-randomized, purely observational, diagnostic accuracy-only, or modelling-only; (ii) trials without a genuine AI/ML element; (iii) systematic reviews and meta-analyses were likewise excluded; as were (iv) animal or simulation work; (v) and reports published in languages other than English.
2.3. Study Selection, Data Extraction, and Data Synthesis
The database search retrieved 1756 records: 649 from MEDLINE, 717 from Scopus, and 390 from Web of Science. After automatic and manual de-duplication, 1101 unique citations remained. Title-and-abstract screening with the prespecified filters—randomized design, human subjects, and English language—reduced the pool to 36 full-text articles. Of these, 24 were excluded because the AI component was applied to drug discovery or pharmacological modelling rather than suicide prevention, four targeted outcomes unrelated to suicidality, and two described predictive algorithms without any linked clinical or digital intervention. Accordingly, six randomized controlled trials met all eligibility criteria and were included in the qualitative synthesis (Figure 1).
Figure 1.
PRISMA flow diagram. ** Full-text articles were excluded for prespecified reasons: (i) not a clinical trial, (ii) no AI-based conversational agent, (iii) no suicide-related outcomes, (iv) wrong population, (v) protocol/abstract only, (vi) insufficient outcome data.
Search results were exported to Zotero (version 7.0.1) (duplicates were removed automatically and manually. Two reviewers (I.F-Q. and C.R-N.) screened titles/abstracts independently and then assessed full texts against eligibility criteria. Disagreements were resolved by discussion or a third reviewer (I.H-P.). The information retrieved from the manuscripts included study characteristics (year, setting, sample size, age, and sex), AI purpose (risk prediction, fully automated intervention, AI-assisted allocation, and clinical support system), description of the AI/ML method, comparator, suicide-related outcomes, follow-up duration, and key efficacy and safety results.
2.4. Risk-of-Bias Assessment and Grading the Certainty of Evidence
Risk of bias for all eligible randomized controlled trials (RCTs) was appraised with the 11-item PEDro scale; items 2–11 were scored dichotomously (Yes = 1, No = 0), yielding a total from 0 to 10. Two reviewers (C.-R.-N, I.F.-Q.) rated independently, demonstrating an excellent inter-rater consistency (Cronbach’s alpha = 0.83).
Overall certainty for every clinically relevant outcome (suicide attempt, suicidal event, suicidal ideation trajectory, use of crisis services, and process metrics) was then judged with the GRADE framework. Certainty started at a high level (randomized evidence) and was downgraded to one or more levels for serious concerns in risk of bias, inconsistency, indirectness, imprecision, or publication bias. Upgrading criteria were not applied because no study showed a large effect uncontaminated by plausible bias or a clear dose–response gradient.
Given the small number of trials and heterogeneity in interventions and outcomes, quantitative pooling was deemed inappropriate. The results were tabulated descriptively and narratively synthesized within the predefined functional domains of AI usage. Where authors reported discrimination statistics (AUC, accuracy), we presented those values verbatim; no additional modelling was undertaken.
3. Results
3.1. Functional Domains of Artificial Intelligence in Suicide Prevention
Where authors reported discrimination metrics for prediction models, these values are presented verbatim. For example, Rozek et al. (2020) [] reported that their five-variable decision-tree ensemble identified 30.8% of future suicide attempters with high specificity, while Pontén et al. (2024) [] found that their random forest achieved an accuracy of 0.67 compared with clinician prediction at 0.63. Alexopoulos et al. (2022) [] applied multiple ensemble methods to distinguish short-term suicidal ideation trajectories, with discrimination supported by cross-validated performance metrics. These reported statistics provide the empirical basis for the discussion statement that machine learning models can outperform traditional scales in predictive discrimination (Table 1).
Table 1.
Clinical studies of AI applications in suicide prevention: summary of design, performance, and clinical relevance.
3.1.1. Risk-Prediction Models
Three randomized studies employed machine learning algorithms exclusively to anticipate subsequent suicidal thinking or behaviour. In a secondary analysis of a trial of brief cognitive behavioural therapy for suicide prevention, Rozek et al. (2020) [] used a proprietary decision-tree ensemble to analyze baseline data from 152 active duty soldiers. Although the cohort was modest, the five-variable model identified 30.8% of participants who attempted suicide during the 24-month follow-up, substantially greater sensitivity than the parent study’s conventional statistics, while maintaining high specificity. Pontén et al. (2024) [] investigated adolescents receiving internet-based CBT for non-suicidal self-injury (n = 62). A random forest classifier predicted remission of self-injury, a pragmatic proxy for diminished suicide risk; emotion dysregulation emerged as the single most important predictor, highlighting the relevance of early affect-regulation skills to favourable trajectories (Table 1).
Finally, Alexopoulos et al. (2022) [] applied an ensemble of LASSO, random forest, gradient boosting, and CART to 249 older adults undergoing brief psychotherapy for major depression. The models distinguished two nine-week trajectories of suicidal ideation; 31% followed an unfavourable course that was independently driven by hopelessness, high neuroticism, and low self-efficacy, three modifiable psychological factors that can be targeted in subsequent sessions (Table 1). Collectively, these trials demonstrate that, even within controlled therapeutic settings, data-driven risk scores effectively stratify patients according to short-term suicide-related outcomes.
3.1.2. Fully Automated Digital Interventions
Two cluster- or individually randomized trials tested AI systems that intervened directly, in real time, to modify help-seeking behaviour or counselling flow. Salmi et al. (2024) [] embedded a transformer-based recommender (BERT) in the Dutch 113 suicide-prevention helpline. For each incoming message, the model surfaced five context-matched responses drawn from a library of successful past chats. When counsellors consulted those suggestions before composing their own reply, 83% rated them helpful, and mean response latency declined, although counsellors’ self-efficacy scores remained unchanged, suggesting that efficiency gains may precede measurable changes in practitioner confidence. Jaroszewski et al. (2022) [] deployed a brief, fully automated barrier reduction intervention on the public crisis-support app Koko.
Machine learning classifiers flagged posts signalling acute distress; the platform then delivered personalized messages that addressed common obstacles to contacting professional services. Within hours, the proportion of distressed users who accessed external crisis resources was 23% higher in the intervention arm than under usual care, indicating that minimal-contact digital nudges can shift help-seeking even during periods of intense psychological pain.
3.1.3. AI-Assisted Treatment Allocation
Myers et al. (2019) [] examined whether algorithmic decision rules could improve the match between patient and therapy in a sample of U.S. veterans at high suicide risk (MBCT-S; n = 71; enhanced treatment as usual n = 69). A random forest-derived personalized advantage index estimated, for each participant, the relative benefit of mindfulness-based cognitive therapy versus control (Table 1).
When randomization happened to coincide with the model’s recommendation, the 12-month rate of suicidal events fell significantly compared with patients whose assignment contradicted the recommendation (AUC = 0.70). Although the study was powered only for proof of concept, it demonstrates that data-driven treatment rules can meaningfully influence clinical outcomes and merits validation in larger, more diverse cohorts (Table 1).
3.2. Risk of Bias in Included Trials
PEDro scores spanned 5 to 8, with a median of 8, indicating that most studies met a large majority of methodological criteria. Four studies were classified as high-quality, one as moderate-quality, and one as low-quality (Table 2). Inter-rater reliability for the PEDro assessments was excellent. When the two reviewers independently scored all six trials, the overall Cronbach’s alpha for the ten rated PEDro items was 0.83, indicating a high level of internal consistency and confirming that the quality ratings were reproducible across evaluators.
Table 2.
Methodological quality of the included randomized controlled trials (PEDro Scale).
Applying the GRADE framework, the overall confidence we can place in the current body of evidence is limited. For the two trials that reported hard clinical end points, suicide attempts (Rozek et al., 2020) [] and suicidal events over 12 months (Myers et al., 2024) [], certainty was downgraded to low. Both studies were small, had appreciable attrition, and produced wide confidence intervals, raising serious concerns regarding risk of bias and imprecision. Evidence for change in suicidal ideation trajectories, derived from a single moderate-quality psychotherapy trial in adults aged ≥60 years (Alexopoulos et al., 2022) [], was rated very low: the absence of corroborating trials, potential baseline differences that could not be fully adjusted, and the age-restricted sample generated additional downgrades for inconsistency and indirectness. Outcomes rooted in help-seeking behaviour or helpline process metrics also received very low or low ratings. Jaroszewski et al. (2022) [] demonstrated a 23% increase in use of crisis services after a brief digital nudge, but the evidence was weakened by self-selection of platform users, non-blinded assessment, and a follow-up window of only a few hours. Salmi et al. (2024) [] showed that a BERT-based recommender improved counsellor latency and was judged helpful in 83% of instances; however, these are intermediate process outcomes whose link to reduced suicidality has yet to be demonstrated, leading to downgrades for indirectness (Table 2).
Beyond individual trial findings, several methodological patterns were consistent across the six included RCTs. Sample sizes were uniformly small (all <300 participants), limiting statistical power for rare events such as suicide attempts or deaths. Follow-up durations were generally short (ranging from hours to 24 months), with only one study extending observation to one year. None of the trials achieved participant or therapist blinding, and assessor blinding was incomplete in most cases, reflecting challenges typical of digital behavioural interventions. Attrition rates exceeded 15% in at least one trial, and intention-to-treat analyses were inconsistently applied. Where discrimination statistics were reported, they have been presented with verbatim. These features collectively explain the GRADE downgrades for imprecision and indirectness, and they underscore the need for larger, multisite trials with longer-term follow-up. None of the included trials reported adverse events or safety issues directly attributable to AI-based interventions. The absence of such outcomes has been noted explicitly for completeness.
4. Discussion
4.1. Synthesis of Evidence Across Functional Domains
To structure the analysis, findings from the six included RCTs are synthesized according to three functional domains in which AI tools have been applied to suicide prevention: risk prediction, fully automated digital interventions, and AI-assisted treatment allocation. This framework enables comparison across heterogeneous designs and outcomes while highlighting both shared methodological patterns and domain-specific contributions.
4.1.1. What Do ML Risk Scores Add Beyond Clinician Judgment?
Evidence from three RCT-derived analyses suggests that machine-learning (ML) risk scores can meaningfully stratify patients even within controlled therapeutic environments. In military personnel, data-driven models captured a notable proportion of future suicide attempters, illustrating how ensemble methods can detect high-risk individuals overlooked by conventional statistics []. In adolescents receiving internet-delivered CBT, ML offered only marginal gains beyond clinician judgement, echoing meta-analytic findings that incremental predictive value diminishes in samples already enriched for risk []. In late-life depression, ensemble learning differentiated short-term trajectories of suicidal ideation and identified hopelessness, neuroticism, and low self-efficacy as proximal, modifiable drivers []. Collectively, these findings indicate that ML-based risk scores may enhance prognostic accuracy and inform within-trial stratification, but they also underscore context-specific ceilings on predictive utility, particularly in small, narrowly defined populations.
Beyond suicide prevention specifically, risk-prediction models should also be considered within the wider literature on clinical prognostics in psychiatry. ML has been applied to anticipate psychiatric hospitalizations, relapse trajectories, and acute crisis episodes, highlighting both overlapping opportunities and unique challenges in suicide-related prediction. These broader applications reinforce the view that AI-derived risk scores can refine prognostic accuracy, yet they also underscore the need for caution. Current RCTs remain underpowered, with short follow-up periods and small samples, which limit the certainty of evidence and raise questions about the stability and generalizability of such predictions across different populations and health systems.
4.1.2. From Efficiency to Outcomes: Can Automated Prompts Change Suicidal Behavior?
The Dutch helpline trial demonstrated that a transformer-based recommender reduced counsellor latency and was judged helpful in 83% of chats, yet did not improve counsellor self-efficacy []. This mirrors broader chatbot research in depression and anxiety, where efficiency gains often preceded measurable clinical benefit [,]. In contrast, a minimalist barrier-reduction nudge increased crisis-service uptake by 23% within hours [], echoing real-world observations that timely prompts can shift help-seeking behaviour in online communities []. Taken together, these findings suggest that even lightweight AI interventions can influence proximal behavioural targets, although the downstream impact on suicidal acts remains unproven.
4.1.3. Turning Prediction into Decisions: The Promise and Pitfalls of Algorithm-Guided Allocation
Only one study evaluated an algorithmic rule to personalize therapy choice. When randomization happened to match the personalized advantage index (PAI) recommendation, suicidal events fell significantly at one year []. The result dovetails with small adaptive-treatment pilots in adolescent depression [] and supports the hypothesis that algorithm-guided allocation can translate predictive power into actionable clinical decisions, provided larger confirmatory trials establish robustness.
4.2. Methodological Strengths and Weaknesses
PEDro scores were high (median = 8/10), with excellent inter-rater reliability (α = 0.83), indicating generally sound randomization, baseline equivalence, and statistical reporting.
Although the PEDro scale and GRADE framework were used to appraise methodological quality and certainty of evidence, we did not formally evaluate compliance with AI-specific reporting standards such as CONSORT-AI or SPIRIT-AI. These guidelines were not part of our eligibility criteria, but we recommend their systematic adoption in future randomized trials of AI-based suicide-prevention tools to enhance transparency, reproducibility, and clinical applicability.
A limitation common to all included RCTs is the absence of prospective protocol registration. Neither participants nor therapists were blinded, and blinding of assessors was incomplete; these are limitations typical of digital behavioural trials []. Attrition exceeded 15% in one study, and one failed to conduct intention-to-treat analysis, driving GRADE downgrades for risk of bias. Imprecision was common: all six trials involved less than 300 participants, and confidence intervals around clinical endpoints were wide. As a result, they certainly never exceeded the low level for hard outcomes (attempts or events) and achieved very low scores for help-seeking and process surrogates (Table 3). These ratings align with simulation work showing that small trial samples inflate apparent ML accuracy and understate variance [,].
Table 3.
Certainty of the evidence for each outcome assessed with the GRADE approach.
The review protocol was not prospectively registered (e.g., PROSPERO) due to time constraints and scope evolution aligned with rapid developments in clinical AI. We recognize that non-registration may increase the risk of selective reporting; to mitigate this, we applied a priori eligibility criteria, dual independent screening/extraction, and we report all prespecified outcomes. We commit to prospective registration for future updates or extensions.
4.3. Relationship to the Wider Literature
Our synthesis aligns with earlier observational reviews reporting that ML models often outperform traditional scales in predictive discrimination [,]. However, the randomized evidence base shows that these gains have not yet translated into consistent reductions in suicidal behaviour. This discrepancy highlights the importance of trial design: while large observational rollouts such as REACH-VET demonstrated population-level impact [], smaller RCTs reveal that predictive accuracy alone does not guarantee clinical benefit. Similarly, fully automated digital interventions have demonstrated improvements in workflow efficiency and help-seeking behaviour [,], but these process outcomes remain surrogate markers, and sustained effects on symptoms or mortality have yet to be established. These gaps echo recent calls for hybrid effectiveness–implementation trials capable of linking proximal digital markers—such as latency, engagement, and sentiment—with long-term clinical outcomes [,].
4.4. Ethical, Governance, and Equity Considerations
Large-scale AI deployment raises unresolved questions about data ownership, consent, and algorithmic bias. Recent commentaries warn that over-reliance on proprietary “black box” models could erode patient trust and obscure sources of inequity []. Regulatory frameworks such as CONSORT-AI mandate transparency in algorithm specification and performance across demographic strata [], yet only two of the six trials provided error analyses stratified by sex or age. Future work should embed continuous bias auditing and adopt open-science practices, a direction urged by the open-science community in suicide research []. Moreover, none of the included trials were conducted in low- or middle-income countries, leaving questions about the generalisability of AI tools to settings with different digital infrastructures and care pathways.
Beyond traditional clinical concerns, AI can inadvertently generate misinformation, as underscored by the World Health Organization during the COVID-19 pandemic. Recent work on infodemic risk management demonstrates how machine learning–enhanced graph analytics can monitor and mitigate the spread of health-related fake news on a global scale []. These developments suggest that AI safety is not solely a clinical issue but also a public health priority, requiring international monitoring standards and potentially global centres for oversight.
Ethical and governance considerations also extend to privacy-preserving architectures. Suicide-prevention trials involve highly vulnerable populations and sensitive data, making methods such as federated learning and differential privacy particularly relevant. Emerging multi-parametric evaluation strategies for large language models, such as the framework proposed by Sblendorio et al., 2024 [], emphasize regulatory alignment with OECD principles, temporal response consistency, and robustness under ambiguous inputs. Embedding these criteria into trial design would enhance transparency, accountability, and fairness, ensuring that AI systems meet international safety standards.
4.5. Clinical and Research Implications
Future work should focus on several mutually reinforcing priorities. First, hopelessness, emotion dysregulation, and low self-efficacy repeatedly surfaced as high-leverage predictors across trials; embedding brief, evidence-based modules that directly modify these constructs may amplify both preventive and therapeutic impact. Second, the field must bridge proximal process gains, such as shorter chat latency, with distal clinical end points by incorporating mediation analyses and longer follow-up, thereby clarifying whether efficiency truly translates into fewer suicidal ideations or attempts. Third, the promising personalized-advantage index requires replication in multisite pragmatic RCTs that pit algorithm-guided allocation against stepped-care pathways already used in practice, with cost-effectiveness analyses to determine scalability. Fourth, universal adoption of the SPIRIT-AI and CONSORT-AI reporting extensions, along with public release of de-identified model code, would enhance reproducibility and facilitate systematic bias auditing []. Equity must become a central design criterion: trials conducted in diverse cultural and resource settings, potentially leveraging passive sensing via ubiquitous mobile devices, are essential to ensure that AI-enabled suicide-prevention tools benefit underserved populations as well as those in digitally mature environments [].
Looking forward, several methodological approaches could strengthen the evidence base for AI in suicide prevention. Adaptive randomized controlled trials, stepped-wedge designs, and hybrid effectiveness–implementation trials are particularly well suited to digital psychiatry, where both technologies and user behaviours evolve rapidly. Such designs could balance internal validity with external generalisability, while also allowing iterative refinement of AI systems in real-world service contexts.
In addition, psychological and relational dynamics shape the effectiveness of AI-supported interventions. The RCTs included in this review primarily addressed acute suicidal crises, where short-term risk mitigation is paramount. By contrast, patients experiencing chronic psychosocial stressors—such as those with pain catastrophizing in oncology []—may require digital tools that emphasize continuity, resilience, and sustained engagement. Integrating these relational dimensions into evaluation frameworks could bridge the current gap between proximal process metrics (e.g., latency, accuracy) and distal clinical outcomes (e.g., distress reduction, adherence, and self-efficacy).
Finally, a more interdisciplinary perspective is essential. Future evaluations of AI in suicide prevention should draw not only on psychiatry and data science, but also on nursing science, behavioural psychology, and bioethics. Situating AI applications within diverse care trajectories and patient needs would yield a more nuanced, clinically grounded vision for safe and effective integration.
5. Conclusions
This systematic review is the first to evaluate randomized evidence for AI applications in suicide prevention through a functional-domain lens. Six RCTs—three risk-prediction models, two fully automated digital interventions, and one AI-assisted treatment-allocation tool—demonstrate that AI can (i) refine short-term risk stratification, (ii) nudge crisis-help, seeking or streamline helpline workflows, and (iii) personalize therapy choice. Methodological quality was generally high, yet the certainty of evidence remained low because all trials were small, largely unblinded, and underpowered to detect rare hard outcomes such as suicide attempts or deaths.
Moreover, the absence of cost-effectiveness analyses, equity assessments, and long-term follow-up restricts our understanding of AI’s sustainable clinical and societal impact. The exclusive focus on high-income settings also underscores the urgent need for culturally sensitive and contextually adapted research in diverse populations.
From a clinical psychology perspective, translating these promising technological advances into real-world practice demands adherence to rigorous reporting standards (CONSORT-AI/SPIRIT-AI), larger and demographically diverse samples, and comprehensive outcome measures that capture both proximal indicators (e.g., engagement, response times) and distal, meaningful endpoints (e.g., suicidal behaviours, mortality). Continuous bias monitoring and economic evaluations are essential to ensure ethical, equitable, and cost-effective integration of AI tools. Until such data emerge, AI tools should be viewed as complementary, rather than replacement, elements within comprehensive, clinician-led suicide-prevention strategies.
Supplementary Materials
The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/psychiatryint6040143/s1, Table S1: The PRISMA Checklist.
Author Contributions
Conceptualization, I.H.-P. and C.R.-N.; methodology, I.F.-Q. and I.H.-P.; validation, I.F.-Q. and F.L.-E.; formal analysis, I.F.-Q., C.R.-N. and I.H.-P.; investigation, I.F.-Q.; data curation, I.H.-P. and C.S.-L.; writing—original draft preparation, I.H.-P. and I.F.-Q.; writing—review and editing, R.M.-S. and F.L.-E.; visualization, C.R.-N.; supervision, I.H.-P. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The raw data supporting the conclusions of this article will be made available by the authors on request.
Conflicts of Interest
The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| AI | Artificial Intelligence |
| ML | Machine Learning |
| RCT | Randomized Controlled Trial |
| BCBT | Brief Cognitive Behavioural Therapy for Suicide Prevention |
| TAU | Treatment as Usual |
| NSSI | Non-Suicidal Self-Injury |
| CBT | Cognitive Behavioural Therapy |
| MDD | Major Depressive Disorder |
| PST | Problem-Solving Therapy |
| LASSO | Least Absolute Shrinkage and Selection Operator regression |
| RF | Random Forest |
| GBM | Gradient Boosting Machine |
| CART | Classification and Regression Tree |
| EHR | Electronic Health Record |
| AUC | Area Under the Receiver Operating Characteristic Curve |
| PAI | Personalized Advantage Index |
| BERT | Bidirectional Encoder Representations from Transformers |
| BERTje | Dutch-language version of BERT |
| BRI | Barrier Reduction Intervention |
| MI | Motivational Interviewing |
References
- World Health Organization. Suicide Worldwide in 2019: Global Health Estimates; World Health Organization: Geneva, Switzerland, 2021; Available online: https://www.who.int/publications/i/item/9789240026643 (accessed on 10 July 2025).
- Jha, S.; Chan, G.; Orji, R. Identification of risk factors for suicide and insights for developing suicide-prevention technologies: A systematic review and meta-analysis. Hum. Behav. Emerg. Technol. 2023, 2023, 3923097. [Google Scholar] [CrossRef]
- Franklin, J.C.; Ribeiro, J.D.; Fox, K.R.; Bentley, K.H.; Kleiman, E.M.; Huang, X.; Musacchio, K.M.; Jaroszewski, A.C.; Chang, B.P.; Nock, M.K. Risk factors for suicidal thoughts and behaviors: A meta-analysis of 50 years of research. Psychol. Bull. 2017, 143, 187–232. [Google Scholar] [CrossRef] [PubMed]
- Naifeh, J.A.; Mash, H.B.H.; Stein, M.B.; Fullerton, C.S.; Kessler, R.C.; Ursano, R.J. The Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS): Progress toward understanding suicide among soldiers. Mol. Psychiatry 2019, 24, 34–48. [Google Scholar] [CrossRef]
- Kirtley, O.J.; Janssens, J.; Kaurin, A. Open science in suicide research is open for business. Crisis 2022, 43, 355–360. [Google Scholar] [CrossRef]
- Carter, G.; Milner, A.; McGill, K.; Pirkis, J.; Kapur, N.; Spittal, M.J. Predicting suicidal behaviours using clinical instruments: Systematic review and meta-analysis of positive predictive values for risk scales. Br. J. Psychiatry 2017, 210, 387–395. [Google Scholar] [CrossRef]
- Zare, A.; Shafaei Bajestani, N.; Khandehroo, M. Machine learning in public health. J. Res. Health 2024, 14, 207–208. [Google Scholar] [CrossRef]
- Shatte, A.B.R.; Hutchinson, D.M.; Teague, S.J. Machine learning in mental health: A scoping review of methods and applications. Psychol. Med. 2019, 49, 1426–1448. [Google Scholar] [CrossRef]
- Belsher, B.E.; Smolenski, D.J.; Pruitt, L.D.; Bush, N.E.; Beech, E.H.; Workman, D.E.; Morgan, R.L.; Evatt, D.P.; Tucker, J.; Skopp, N.A. Prediction models for suicide attempts and deaths: A systematic review and simulation. JAMA Psychiatry 2019, 76, 642–651. [Google Scholar] [CrossRef]
- Sheu, Y.H.; Sun, J.; Lee, H.; Castro, V.M.; Barak-Corren, Y.; Song, E.; Madsen, E.M.; Gordon, W.J.; Kohane, I.S.; Churchill, S.E.; et al. An efficient landmark model for prediction of suicide attempts in multiple clinical settings. Psychiatry Res. 2023, 323, 115175. [Google Scholar] [CrossRef]
- Garriga, R.; Mas, J.; Abraha, S.; Nolan, J.; Harrison, O.; Tadros, G. Machine-learning model to predict mental-health crises from electronic health records. Nat. Med. 2022, 28, 1240–1248. [Google Scholar] [CrossRef]
- Wang, S.; Pathak, J.; Zhang, Y. Using electronic health records and machine learning to predict postpartum depression. Stud. Health Technol. Inform. 2019, 264, 888–892. [Google Scholar] [PubMed]
- Ahmad, R.; Siemon, D.; Gnewuch, U.; Robra-Bissantz, S. Designing personality-adaptive conversational agents for mental health care. Inf. Syst. Front. 2022, 24, 923–943. [Google Scholar] [CrossRef] [PubMed]
- Fitzpatrick, K.K.; Darcy, A.; Vierhile, M. Delivering cognitive behavior therapy to young adults with symptoms of depression and anxiety using a fully automated conversational agent (Woebot): A randomized controlled trial. JMIR Ment. Health 2017, 4, e19. [Google Scholar] [CrossRef] [PubMed]
- DeVylder, J.E. Suicide risk prediction in clinical settings-additional considerations for face-to-face screening and machine learning approaches. JAMA Netw. Open 2022, 5, e2212106. [Google Scholar] [CrossRef]
- Dhelim, S.; Chen, L.; Ning, H.; Nugent, C.D. Artificial intelligence for suicide assessment using audiovisual cues: A review. Artif. Intell. Rev. 2022, 56, 5591–5618. [Google Scholar] [CrossRef]
- Boudreaux, E.D.; Rundensteiner, E.; Liu, F.; Wang, B.; Larkin, C.; Agu, E.; Ghosh, S.; Semeter, J.; Simon, G.; Davis-Martin, R.E. Applying machine-learning approaches to suicide prediction using healthcare data: Overview and future directions. Front. Psychiatry 2021, 12, 707916. [Google Scholar] [CrossRef]
- Liu, X.; Cruz-Rivera, S.; Moher, D.; Calvert, M.J.; Denniston, A.K.; SPIRIT-AI and CONSORT-AI Working Group. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: The CONSORT-AI extension. Lancet Digit. Health 2020, 2, e537–e548. [Google Scholar] [CrossRef]
- Ni, Y.; Jia, F. A scoping review of AI-driven digital interventions in mental healthcare: Mapping applications across screening, support, monitoring, prevention, and clinical education. Healthcare 2025, 13, 1205. [Google Scholar] [CrossRef]
- Kessler, R.C.; Stein, M.B.; Petukhova, M.V.; Bliese, P.; Bossarte, R.M.; Bromet, E.J.; Fullerton, C.S.; Gilman, S.E.; Ivany, C.; Lewandowski-Romps, L.; et al. Predicting suicides after outpatient mental health visits in the Army Study to Assess Risk and Resilience in Servicemembers (Army STARRS). Mol. Psychiatry 2017, 22, 544–551. [Google Scholar] [CrossRef]
- Matarazzo, B.B.; Eagan, A.; Landes, S.J.; Mina, L.K.; Clark, K.; Gerard, G.R.; McCarthy, J.F.; Trafton, J.; Bahraini, N.H.; Brenner, L.A.; et al. The Veterans Health Administration REACH-VET Program: Suicide predictive modeling in practice. Psychiatr. Serv. 2023, 74, 206–209. [Google Scholar] [CrossRef]
- Ehtemam, H.; Esfahlani, S.S.; Sanaei, A.; Ghaemi, M.M.; Hajesmaeel-Gohari, S.; Rahimisadegh, R.; Bahaadinbeigy, K.; Ghasemian, F.; Shirvani, H. Role of machine learning algorithms in suicide risk prediction: A systematic review–meta-analysis of clinical studies. BMC Med. Inform. Decis. Mak. 2024, 24, 138. [Google Scholar] [CrossRef]
- Gual-Montolio, P.; Jaén, I.; Martínez-Borba, V.; Castilla, D.; Suso-Ribera, C. Using artificial intelligence to enhance ongoing psychological interventions for emotional problems in real- or close-to-real-time: A systematic review. Int. J. Environ. Res. Public Health 2022, 19, 7737. [Google Scholar] [CrossRef]
- Sblendorio, E.; Dentamaro, V.; Lo Cascio, A.; Germini, F.; Piredda, M.; Cicolini, G. Integrating human expertise & automated methods for a dynamic and multi-parametric evaluation of large language models’ feasibility in clinical decision-making. Int. J. Med. Inform. 2024, 188, 105501. [Google Scholar]
- Rozek, D.C.; Andres, W.C.; Smith, N.B.; Smith, N.B.; Leifker, F.R.; Arne, K.; Jennings, G.; Dartnell, N.; Bryan, C.J.; Rudd, M.D. Using Machine Learning to Predict Suicide Attempts in Military Personnel. Psychiatry Res. 2020, 294, 113515. [Google Scholar] [CrossRef] [PubMed]
- Pontén, M.; Flygare, O.; Bellander, M.; Karemyr, J.; Nilbrink, J.; Hellner, C.; Ojala, O.; Bjureberg, J. Comparison between clinician and machine-learning prediction in a randomized controlled trial for nonsuicidal self-injury. BMC Psychiatry 2024, 24, 904. [Google Scholar] [CrossRef] [PubMed]
- Alexopoulos, G.S.; Raue, P.J.; Banerjee, S.; Mauer, E.; Marino, P.; Soliman, M.; Kanellopoulos, D.; Solomonov, N.; Adeagbo, A.; Sirey, J.A.; et al. Modifiable predictors of suicidal ideation during psychotherapy for late-life major depression: A machine-learning approach. Transl. Psychiatry 2021, 11, 536. [Google Scholar] [CrossRef]
- Salmi, S.; Mérelle, S.; van Eijk, N.; Gilissen, R.; van der Mei, R.; Bhulai, S. Real-time assistance in suicide-prevention helplines using a deep-learning recommender system: A randomized controlled trial. Int. J. Med. Inform. 2024, 195, 105760. [Google Scholar] [CrossRef] [PubMed]
- Jaroszewski, A.C.; Morris, R.R.; Nock, M.K. A randomized controlled trial of an online machine-learning-driven risk-assessment and intervention platform for increasing the use of crisis services. J. Consult. Clin. Psychol. 2019, 87, 370–379. [Google Scholar] [CrossRef]
- Myers, C.E.; Dave, C.V.; Chesin, M.S.; Marx, B.P.; St Hill, L.M.; Reddy, V.; Miller, R.B.; King, A.; Interian, A. Initial evaluation of a Personalized Advantage Index to determine who may benefit from mindfulness-based cognitive therapy for suicide prevention. Behav. Res. Ther. 2024, 183, 104637. [Google Scholar] [CrossRef]
- Hang, C.N.; Yu, P.D.; Chen, S.; Tan, C.W.; Chen, G. MEGA: Machine Learning-Enhanced Graph Analytics for Infodemic Risk Management. IEEE J. Biomed. Health Inform. 2023, 27, 6100–6111. [Google Scholar] [CrossRef]
- Mercadante, S.; Ferrera, P.; Lo Cascio, A.; Casuccio, A. Pain Catastrophizing in Cancer Patients. Cancers 2024, 16, 568. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).