Artificial Intelligence and Suicide Prevention: A Systematic Review of Machine Learning Investigations

Suicide is a leading cause of death that defies prediction and challenges prevention efforts worldwide. Artificial intelligence (AI) and machine learning (ML) have emerged as a means of investigating large datasets to enhance risk detection. A systematic review of ML investigations evaluating suicidal behaviors was conducted using PubMed/MEDLINE, PsychInfo, Web-of-Science, and EMBASE, employing search strings and MeSH terms relevant to suicide and AI. Databases were supplemented by hand-search techniques and Google Scholar. Inclusion criteria: (1) journal article, available in English, (2) original investigation, (3) employment of AI/ML, (4) evaluation of a suicide risk outcome. N = 594 records were identified based on abstract search, and 25 hand-searched reports. N = 461 reports remained after duplicates were removed, n = 316 were excluded after abstract screening. Of n = 149 full-text articles assessed for eligibility, n = 87 were included for quantitative synthesis, grouped according to suicide behavior outcome. Reports varied widely in methodology and outcomes. Results suggest high levels of risk classification accuracy (>90%) and Area Under the Curve (AUC) in the prediction of suicidal behaviors. We report key findings and central limitations in the use of AI/ML frameworks to guide additional research, which hold the potential to impact suicide on broad scale.


Introduction
Suicide is a complex, but preventable public health problem that challenges prediction due to its transdiagnostic, yet rare occurrence at the population-level. Beyond the inestimable costs at the individual, family, and community level, suicide currently outnumbers homicide and motor vehicle accident collisions [1,2], representing a public health emergency and resulting in an estimated cost of $93.5 billion to the U.S. economy [3]. Despite unprecedented strategies to advance awareness and treatment [4][5][6], suicide rates have remained intractable over time and recently increased in some cases, rising by approximately 24% in the U.S. (10.5 to 13/100,000) from 1999-2014 [7]. Alarmingly,

Study Selection
This review was performed according to the EQUATOR/PRISMA guidelines (Enhancing the Quality and Transparency of Health Research/Preferred Reporting Items for Systematic Reviews and Meta-Analyses), which serves as an evidence-based protocol for selecting and reporting for systematic reviews and meta-analyses [30]. Given that suicidal behaviors exist across all ages and diverse medical conditions, diagnostic or age-related variables were not a basis for exclusion.

Data Collection Process
Reviewers (R.A.B., A.M.H.) independently reviewed abstracts, followed by full-text articles. A third reviewer (R.M.) made a final decision, if there was a lack of consensus. Source documents were assessed according to the following inclusion criteria: (1) Journal article (Available in English), (2) original investigation (non-review/commentary), (3) employment of AI/ML methodology, and (4) evaluation of a suicide risk outcome (i.e., defined using CDC-derived guidelines [31] for suicidal self-directed violence; non-suicidal self-injury was excluded), grouped and labeled by suicidal behavior type (e.g., suicide ideation, suicide attempts, suicide death, or other). Studies identified by the above search strategy were managed using Endnote X8. Reports failing to meet inclusion criteria were systematically excluded with reasons. A PRISMA flow chart [30] was created to graphically depict inclusion/exclusion of studies by level of (1) identification, (2) screening, (3) eligibility, and (4) inclusion, and reports were coded with established quality ratings [32]. Reports were further grouped according to suicidal behavior type, sample characteristics, and AI/ML methodology. The latter included use of supervised learning, which aims to predict outcomes based on a set of input values, used to train a classifier; whereas, in unsupervised learning, no labels are provided and the aim is to instead describe data patterns, often by way of clustering, based on input measures. Studies were also coded for use of natural language processing (NLP)-which uses a computer to automatically or semi-automatically process human-generated language-and summarized by other study characteristics, including evaluation of biological markers of suicide risk.

Data Analysis
Descriptive analyses were employed to analyze study findings by key design characteristics according to suicide risk outcome, where ML parameters (e.g., area under the curve (AUC), accuracy, sensitivity, specificity) were reported. The data collated was not amenable to synthesis and meta-analysis was therefore not possible for evaluation.

Data Extraction
A total of n = 594 records were identified according to the above search methods. An additional 25 records were identified through handsearch and Google Scholar articles. Of n = 461 unique articles, a further n = 316 were excluded according to abstract screening. Full text review was performed for n = 149 articles according to study inclusion criteria. Forty-nine reports failed to meet stated inclusion criteria. This included failure to: represent an original report (i.e., vs. a review/commentary), employ AI/ML methodology, or evaluate a suicide risk outcome (i.e., according to CDC-defined suicidal behaviors). N = 87 studies were included for qualitative synthesis; a subsample of reports (n = 13) met criteria as a subset of this review (see Figure 1). These evaluated emotional content among suicide decedent notes using natural language processing (NLP), and will be discussed separately. A total of n = 87 studies were included in a quantitative analysis. For reports meeting primary review inclusion, this represented an aggregate total number of n = 5,986,238 patients. Sample size was unreported or unavailable in n = 7 studies.

Broad Outcome Groupings
For broad outcome groupings in the quantitative synthesis, a total of n = 42 reports assessed suicide attempt (n = 28) or suicide death (n = 14) as a primary outcome. A total of n = 45 studies evaluated suicidal ideation (i.e., history and current symptoms) (n = 9), multiple risk outcomes (n = 18), and other-social media (n = 10) or other-undifferentiated (n = 8) risk outcomes.

ML Techniques and Learning Methods
ML methodology varied widely across reports and included both supervised and unsupervised learning algorithms. The majority of studies employed supervised learning techniques, which included ensemble learning methods (e.g., especially random forests), naïve Bayes classification, decision trees, logistic/least squares regression, and support vector machines (SVM). In comparison, n = 7 studies used unsupervised learning techniques, which included clustering algorithms, neural networks, self-organizing maps (SOM), principal component analysis (PCA), and decision trees. Only three studies used both supervised and unsupervised learning methods. Cross-validation techniques, or methods for splitting the data into training and test sets for model performance testing, were variably reported, with few investigations using distinct datasets, separated in time, for training and test models. (See Table 1 for all studies (n = 87). See Table 2 for a subset of reports (n = 13) in this review).

Broad Outcome Groupings
For broad outcome groupings in the quantitative synthesis, a total of n = 42 reports assessed suicide attempt (n = 28) or suicide death (n = 14) as a primary outcome. A total of n = 45 studies evaluated suicidal ideation (i.e., history and current symptoms) (n = 9), multiple risk outcomes (n = 18), and other-social media (n = 10) or other-undifferentiated (n = 8) risk outcomes.

ML Techniques and Learning Methods
ML methodology varied widely across reports and included both supervised and unsupervised learning algorithms. The majority of studies employed supervised learning techniques, which included ensemble learning methods (e.g., especially random forests), naïve Bayes classification, decision trees, logistic/least squares regression, and support vector machines (SVM). In comparison, n = 7 studies used unsupervised learning techniques, which included clustering algorithms, neural networks, self-organizing maps (SOM), principal component analysis (PCA), and decision trees. Only three studies used both supervised and unsupervised learning methods. Cross-validation techniques, or methods for splitting the data into training and test sets for model performance testing, were variably reported, with few investigations using distinct datasets, separated in time, for training and test models. (See Table 1 for all studies (n = 87). See Table 2 for a subset of reports (n = 13) in this review).   58 3 Investigations (N = 13) by study subset and ML parameters. Outcome focused on sentiment detection of suicide decedent notes using NLP. Notes: Quality ratings were performed according to the Oxford Centre for Evidence-Based Medicine Protocol; ML = machine learning; NLP = natural language processing; Precision = positive predictive value (PPV); NR = not reported.

Design Characteristics
Broadly grouped, several general approaches were visible in study design methodology: (1) Investigations designed to explore the accuracy of diagnostic classification (i.e., using ML techniques and a large dataset or number of variables) to identify those at risk by classifying a binary suicide risk outcome (n = 65) (i.e., classification studies); or (2) investigations evaluating conceptual models of suicide risk, which ranged from PCA and other clustering algorithm methods (e.g., hidden layers, discovering patterns). Prospective designs were used in a small number of studies (n = 21), whereas the majority of investigations used a cross-sectional study design. Several reports (n = 12) utilized a population-based or epidemiologic design, and over half included multi-site investigations. In general, according to the Oxford Centre for Evidence-Based Medicine Protocol [32], articles ranged between ratings of 2-4, with most represented by a 3 rating. A 2 rating describes well-designed, controlled trials without randomization or prospective comparative cohort trials; a 3 rating refers to studies that employ case controls or retrospective cohort investigations; whereas a 4 rating represents case series studies with or without intervention or use of a cross sectional design. No randomized controlled, adequately powered trials (i.e., 1 rating) were identified in this review.

Sample Size and High Dimensional Data
Across all investigations, samples ranged in size from 55 to 975,057 (M = 74,815, SD = 217,839; Md n = 761 participants). While the majority of reports harnessed big data, several studies (n = 6) investigated high dimensional datasets with small sample sizes, which may increase the risk of overfitting. These studies evaluated a large volume of variables (i.e., >400), using smaller samples (i.e., ranging from n = 34-135 participants) to classify and detect differences in risk outcome.

Sample Characteristics
Samples varied significantly in ages studied, with the majority evaluating adults, and a smaller proportion investigating pediatric (n = 26), geriatric (n = 15), or all-age (n = 8) samples. A total of n = 16 studies evaluated risk among military personnel or veterans, and across all reports, the use of a clinical sample was observed in the majority of cases. These included participants recruited from high-risk or triage settings, such as the emergency department (ED) (n = 15). Reports demonstrated a primarily transdiagnostic focus, with few focusing on risk among specific psychiatric conditions, such as mood disorders and schizophrenia (n = 13). Other studies (n = 10) included social media investigations without diagnostic specifiers or assessed an undifferentiated outcome of suicide risk (i.e., suicide risk stratification and clinical decision-making prediction; suicide gene marker detection; human vs. machine learning classification testing, etc.). Finally, the number of studies utilizing electronic medical records (EMR) or administrative chart data was high, particularly in comparison with those using epidemiologic surveys (n = 12) or social media user data (n = 10). Use of a convenience sample or re-evaluation of archival datasets using ML techniques was common in comparison with a priori-designed studies.

Natural Language Processing and Biological Markers of Risk
Twenty-nine investigations employed the use of natural language processing (NLP) in association with suicidal behaviors. These included investigations evaluating (n = 2) acoustic features of speech to identify risk within emergency department settings, text-based applications (n = 1), or investigation of social media user data or posts (n = 15). A few such studies generated a word map to note word frequency in association with risk within EMR, medical discharge notes, or social media posts. A small number of ML investigations (n = 12) evaluated a biological marker of risk, such as plasma and blood metabolites (n = 2), genes (n = 8), and neuroimaging (n = 1), to predict risk for suicidal behaviors and hospitalization.

Timeframe of Assessment and Predictive Modeling
Timeframe of risk detection was variable, ranging from the next 24 h to lifetime assessments of suicide outcomes. Where reported, the majority (n = 15) investigated suicide risk prediction over a monthly timeframe. This ranged from 1 to 24 months (n = 9.31). Two reports investigated risk over an acute timeframe (24-72 h), and n = 12 studies evaluated lifetime risk. Four investigations reported multiple timeframes of risk, whereas n = 21 failed to report or specify precise observation or time-at-risk periods. According to an exploratory one-way analysis of variance (ANOVA) to evaluate non-weighted accuracy and AUC values across reports, significant mean differences were not detected according to type of suicide-related outcome for highest accuracy (

Discussion
Eighty-seven reports were identified in this systematic review, which included a subset of investigations evaluating emotional sentiment among suicide notes using ML methods. Across reports meeting primary inclusion criteria, the majority of studies examined risk for suicide attempts, followed by death by suicide, suicidal ideation, and multiple risk outcomes. A small proportion of studies predicted risk of an outcome in-between these groupings, including those examining an undifferentiated outcome (e.g., unspecified suicidal behavior, or "suicidality") or harnessing social media data (e.g., suicide-related risk by Twitter or internet post content) [85][86][87][88][89][90][91][92]. Based on this review, use of AI/ML methods for suicide risk prediction is a burgeoning area of inquiry, reflected by the diversity of fields represented and the pace of publications. Though 1999 marked the earliest publication, nearly half of reports were published in the past three years. This suggests an area of rapid growth at a nascent stage of investigation, presenting opportunities to critically guide the field forward and address key gaps in the extant literature.
Machine learning methods varied substantially across studies and ranged significantly in rigor and model testing. Supervised learning was most commonly used compared to unsupervised learning techniques, and few studies used both methods. In general, exploratory investigations were overrepresented, and replication or application of a predictive model-within a new setting or sample-was rare. Several reports tested replication in a new cohort-within the same setting-or used a multiple-wave sampling approach [15,26,34,97]. Methodologically, these represent critical areas of importance for future studies and warrant replication. Classification studies were most commonly observed in this review, and excellent accuracy and area under the curve (AUC) values were observed, despite considerable differences in design, methodology, sample, and learning methods. Model performance metrics most frequently reported were AUC, whereas accuracy, sensitivity, and specificity were reported in less than a third of reports. According to broad outcome groupings, underreporting and low cell counts by outcome groupings challenge interpretation and adequately powered comparisons.
Regarding generalizability, reports reflected a transdiagnostic focus, and primarily assessed adult participants or patient records. A smaller number of reports examined high-risk, pediatric or geriatric samples, as well as military veterans [15,26,33,34,53,72,97,112]. These highlight areas of elevated need, and align with prioritized strategies and nationally-directed initiatives for technology innovation in suicide prevention [5,6,130]. Investigations predominantly evaluated clinical samples or emergency settings, consistent with increased risk post-hospitalization [15,26,34,43,68,131]. Regarding constraints, archival datasets were common, with fewer studies employing prospective data elements [15,34,55,106]. Though convenience samples present inherent limitations, this has likewise been emphasized as a relative strength-insofar as ML may be applied to large-scale datasets that, as yet, remain unstudied [34]. This highlights opportunity for re-analysis of existing datasets to advance early detection and prevention methods, where prospective samples warrant prioritization. Next, though several reports used epidemiologic surveys within nationally-representative sample [60,101,106]. Such surveys, however, frequently used a single item assessment of suicidal behaviors, which may misclassify risk [132,133]. In general, suicide outcomes were variably defined, and validated symptom instruments varied significantly across reports [26,56,57,59,72,78,86,95,102,103]. This aligns with calls for increased uniformity in the assessment of suicidal behaviors to enhance research comparisons and improve surveillance [31]. We recommend that such calls be applied to the study of suicidal behaviors across ML investigations to enhance uniformity, comparison, and opportunity to improve risk prediction frameworks.
In several cases, the development and testing of clinical prediction models were evaluated against traditional statistics [6,55], showing superiority of ML in the classification of risk. Reports likewise compared ML-guided decision tree models to clinician-based predictions (i.e., of hospitalization following a suicide attempt (SA) or predicting likelihood of a suicide risk outcome) to guide triage [6,55,59,76,78,96]. Importantly, ML-guided risk stratification models outperformed those relying on clinician-based prediction methods alone. This included model testing within acute time frames of risk (i.e., 3-6 months) [55,96]-in one case, with performance enhanced three-fold using ML risk stratification [96]. Such findings suggest that advanced data analytic methods, combined with computer-guided screening, may augment clinical decision-making. Replication is warranted, including how such models may guide triage to optimize patient care with minimal time burden to providers.
Given that the majority of suicide decedents consult with their physician prior to death [8,9], such methods hold promise to enhance early detection and opportunity for rapid intervention. This may be particularly relevant to emergency settings, where medical records have been compared with manual coding of suicide attempt encounters using machine learning with promising results [43]. This aligns with research suggesting that brief, low-risk suicide prevention strategies targeting emergency settings are both efficacious and cost-effective [134][135][136]. The way suicidal behaviors are coded within EMR may likewise pose challenges to risk detection. Anderson and colleagues [102] used ML to evaluate correspondence between patient notes and ICD/E-Codes (International Classification of Disease/ICD External Cause of Injury Code) for suicidal behaviors, based on text-mining of clinical discharge notes in a sample of n = 15,761 patient records. They observed a low level of correspondence, with only 3% of encounters coded for suicidal ideation and 19% coded for suicide attempts. This suggests nned for considerable caution when interpreting suicide risk using ICD/E-Codes from EMR data alone, in comparison with discharge notes.
A subset of studies investigated NLP as a novel area of inquiry in select settings or populations. Pestian et al. [26] investigated NLP (i.e., key words and vocal characteristics) in structured and free-text speech responses to accurately distinguish (96.6% accuracy) n = 60 youth presenting to an ED for suicide risk (i.e., versus those presenting for other reasons). Text-mining methods also predicted accurate classification of those at risk for later suicidal behaviors [109,110,112,113], in some cases, generating word maps that may aid future research. Other novel approaches included social media investigations of microblog users and Twitter posts to detect suicide risk among users, online communities, or posts following a natural disaster to index public emotion [66,[85][86][87][88][89]105]. Despite a large number of neuroimaging and neuroanatomical reports within suicide prevention, a smaller number of studies examined a biological variable in this review. Baca-Garcia and colleagues [56] showed that an algorithm based on three CNS (Central Nervous System) single nucleotide polymorphisms (SNPs) correctly classified those with and without a suicide attempt history, whereas other investigations evaluated candidate biomarkers to predict future risk for suicide [64,69,103]. Only one study used neuroimaging-comparing youth with suicidal ideation (n = 17) to matched controls (n = 17) on fMRI variables [74]. Based on neural representations in response to suicide and death-related scan stimuli, this generated a high (91%) classification accuracy [74]. This signals a promising approach to biomarker discovery, underscoring integration of biological, behavioral, and clinical variables to inform etiology and intervention in an area with few selective treatments [137,138].

Critical Challenges and Future Directions
A number of limitations should be noted. Methods varied widely across reports, both with respect to ML methods and study quality. Despite considerable diagnostic and methodological heterogeneity, high levels of model performance were observed. Incomplete reporting of test statistics (e.g., accuracy, AUC, sensitivity, specificity) and differing methods for assessing and defining risk within diverse ML methods-highlights need for improved reporting standards and a priori-designed studies. Key parameters, such as PPV, area under the precision curve (AUPRC), and lead-time of the prediction-which allows for decision-making about when to potentially act and intervene-were also underreported. Challenges inherent in retrospectively analyzing health data for administrative and clinical purposes should also be noted, given the high number of studies using EMR. Hersh et al. [139] raised concerns regarding biases due to EMR data being collected only at hospital visits, incomplete records or missing data, and other considerations relevant to accurate coding that emphasize advanced statistical methods be used for correction. Others report similar concerns of omission in EMR, calling for longitudinal measurements [19]. Additionally, given the way in which differences in the splitting of training data may alter the performance of predictive modeling [140], use of multiple methods to separate samples (i.e., for training versus testing of algorithms) is recommended. Critically, the majority of studies were cross-sectional in nature, underscoring need for prospectively-designed ML investigations to advance suicide risk prediction.
A lack of application to new settings or populations also highlights need for replication, particularly according to longitudinal, well-defined outcomes of risk. Though translation of one model to a new site poses inherent challenges, a model can be trained with data from any local site and tested using data from the site itself [23]. Regarding future application, challenges in constructing and deploying a statistical model within a clinical setting include access to data, availability of skilled personnel, and need to identify ways of integrating the model into healthcare workflows [23]. Others have emphasized associations between model complexity and predictive accuracy [141], in addition to key limitations [142]. For example, Siddaway and colleagues [142] suggest that ML may be best harnessed when led by clinical need, becoming machine-assisted learning similar to other statistical techniques, cautioning against over-reliance on ML models. We recommend incorporation of these considerations into the design of new investigations utilizing machine learning in the detection and prediction of risk for suicidal behaviors.

Conclusions
In conclusion, findings of this review highlight risk factors that align with past non-ML findings (e.g., mood/substance disorders, male gender, family history, previous hospitalization, unemployment, comorbidity, and delinquency); whereas, newly-identified risk variables or approaches point to sleep, circadian, and neural substrates, and NLP-derived indices of speech or user data. These findings reflect a burgeoning literature that warrant future study in an area of prevention prioritized worldwide. Though a leading cause of death, suicide defies prediction given its rare occurrence at the population level, which poses important challenges to prevention. AI and ML applications hold unique promise to enable precision medicine in the prevention of suicide, particularly given their ability to handle large and complex datasets. We propose that such methods may crucially inform the early detection of suicide risk, triage, and treatment development, with important methodological and statistical cautions. The application of NLP to social media in particular, and integration of AI with real-time suicide risk assessments, holds unique promise to impact the prevention of suicide on a broad scale.