Identification of Risk Factors for Suicidal Ideation and Attempt Based on Machine Learning Algorithms: A Longitudinal Survey in Korea (2007–2019)

Investigating suicide risk factors is critical for socioeconomic and public health, and many researchers have tried to identify factors associated with suicide. In this study, the risk factors for suicidal ideation were compared, and the contributions of different factors to suicidal ideation and attempt were investigated. To reflect the diverse characteristics of the population, the large-scale and longitudinal dataset used in this study included both socioeconomic and clinical variables collected from the Korean public. Three machine learning algorithms (XGBoost classifier, support vector classifier, and logistic regression) were used to detect the risk factors for both suicidal ideation and attempt. The importance of the variables was determined using the model with the best classification performance. In addition, a novel risk-factor score, calculated from the rank and importance scores of each variable, was proposed. Socioeconomic and sociodemographic factors showed a high correlation with risks for both ideation and attempt. Mental health variables ranked higher than other factors in suicidal attempts, posing a relatively higher suicide risk than ideation. These trends were further validated using the conditions from the integrated and yearly dataset. This study provides novel insights into suicidal risk factors for suicidal ideations and attempts.


Introduction
In the past 10 years, approximately 800,000 people committed suicide annually [1][2][3]. Suicidal mortality is considered a critical factor for both social and public health [4][5][6]. Many previous studies have suggested that death by suicide has socioeconomic and psychological consequences, burdening members of society [7][8][9]. Paul et al. [10] analyzed the social and economic burden of suicide in the Hong Kong SAR. In addition, Shumona et al. [11] attempted to validate suicidal risk factors and their impact in rural Bangladesh. They suggested that the burden of suicide is a major health problem. In the current COVID-19 pandemic, depression and suicide are important mental health care challenges [12][13][14].
To solve problems related to suicide, researchers have attempted to identify its underlying factors. Using systematic reviews, Elizabeth et al. [15] proposed several risk factors associated with suicidal self-directed violence among veterans living in the US. In addition, Mościcki [16] investigated the contributions of sociodemographic factors (e.g., age, gender, race, and socioeconomic status) to suicidal risk based on epidemiologic studies. Madelyn et al. [17] demonstrated the risk of psychosocial factors associated with suicide in children and adolescents. In their final analysis, socioenvironmental circumstances were confirmed to be a significant factor in teenage suicide risk.
Previous studies have focused on two categories of risk factors: "psychiatric or clinical" and "economic or social elements", which were determined to be influential factors in Int. J. Environ. Res. Public Health 2021, 18, 12772 2 of 21 related studies. However, the focus on these variables in related studies differed depending on the topic of interest or the population studied. First, when conducting research with psychiatric patient groups, major associations between the clinical variables and suicidal risk were found. For example, Gregory et al. [18] attempted to determine the risk factors for psychiatric outpatient groups. Various variables, including the current intensity of the patients' specific attitudes and behaviors, were collected from a total of 6981 patients. The contributions of psychiatric variables to the risk of eventual suicide were identified among diverse categories of variables, including clinical and economic factors. Second, when conducting research on patient groups from specific population (e.g., children or adolescents), socioeconomic factors were found to be major influencing factors. Esben et al. [19] estimated the risk factors for young people living in Denmark. Their participants answered survey questions detailing mental illness, employment, and income. Having parents with a low socioeconomic status was found to be a relatively high risk factor for suicide among young people. To reflect the diverse characteristics of the population as much as possible, we analyzed large-scale and longitudinal datasets collected from the Korean public. Moreover, in terms of a multivariate analysis, diverse variables, including both clinical and socioeconomic factors, were used to determine associations between the factors.
In addition, the study groups were divided according to suicide-related events. Nock and Banaji [20] predicted suicidal ideation and suicide attempts based on test results. Their participants were divided into three groups (i.e., non-suicidal, suicide ideators, and recent suicide attempters), and the analysis results were compared. To examine the different characteristics of suicide attempters and suicide completers, Konrad [21] evaluated their medical fitness, personality, and clinical characteristics. Matthew et al. [22] compared the risk factors of suicidal ideation, plans, and attempts through a cross-national analysis. Some researchers analyzed the effects of risk factors by comparing multiple datasets. To investigate the effects of factors contributing to suicide events, namely, suicide ideation, planning, attempts, and completion, datasets of multiple events were compared in numerous previous studies [23][24][25].
To identify risk factors based on participant characteristics, diverse methodologies for analysis, including statistical modeling, have been applied. Gutierrez et al. [26] used structural equation modeling to determine the relationships between candidate risk factors. A chi-square analysis was subsequently used to validate the modeling results. Berman [27] utilized descriptive statistics (e.g., mean and proportions) to summarize patient characteristics. Fisher's exact test with a two-tailed test was used to compare the clinical characteristics of the patient groups. Using Pearson's correlation analysis, Park and Jang [28] evaluated the association between suicide rates and risk factors among Korean adolescents.
In recent studies, machine learning algorithms have shown better performance than traditional statistical methods for structural type datasets (e.g., datasets collected from surveys). Subramani et al. [29] attempted to forecast the risk of diabetes through electronic medical records (EMRs) with machine learning classifiers, such as support vector machines and decision trees. Sangita et al. [30] used a logistic regression model to investigate the nutritional status of children using Indian demographic and health survey datasets. From these studies, it was confirmed that machine learning models have sufficient capability to analyze structural datasets.
Machine learning models have been applied in many previous studies to investigate the risk factors for suicide. De la Garza et al. [31] attempted to determine the risk factors for nonfatal suicide attempts. They utilized the importance of features in trained algorithms to determine the emphasized variables. Samah et al. [32] proposed a machine learning-based framework to detect potential suicide risk factors using text datasets collected from Twitter. The decision tree model and K-means clustering algorithms were applied to classify the risk levels. The clustering results and classification performances were used to identify the risk factors for suicide. The authors reported that words related to feelings of depression and self-harm were important in classifying suicide risk levels.
Suicidal ideation and attempt are compared in this study to analyze and determine the relative influence of risk factors. The degree of risk for suicide ideation was set as a low-risk condition, and that for suicide attempt was set as a relatively high-risk condition. The importance of different risk factors for a suicide attempt was examined in the high-suiciderisk group based on the underlying effects (e.g., economic and psychological burden). Machine learning algorithms were used to compare the influence of risk factors for suicidal ideation and suicide attempts separately with that of suicide risk (i.e., suicide ideation was classified as a low suicide risk, and suicide attempt was classified as a relatively high suicide risk). The longitudinal dataset obtained from the general population of Korea was utilized to determine the risk factors. In addition, a new risk-factor score based on feature importance and the rank of the variable was proposed to confirm their importance in suicidal ideation and attempt.
The major contributions of our research are as follows. First, to reflect the various characteristics of the study population, large-scale (n = 215,522) and longitudinal (from 1998 to 2019) datasets collected in Korea were used. Second, machine learning algorithms were applied to detect the inherent patterns and factors in the dataset associated with suicide ideation and suicidal attempts from the dataset. Third, a novel risk-factor score was proposed and applied to compare the importance of the factors using the results of the machine learning classifiers. Finally, the risks for suicide ideation and attempt were separately validated based on a comparison between the risks of both suicide ideation and attempt.

Overview
To determine the major risk factors for suicidal ideation and attempts, our research was divided into six steps. First, the associated variables were collected from the Korea National Health and Nutrition Examination Survey (KNHANES) dataset. Second, missing or extreme values among the collected variables were removed to better reflect the characteristics of the participants. Third, the final datasets were constructed based on the main dependent variables (i.e., suicide ideation and suicide attempts). Fourth, three machine-learning algorithms were trained and evaluated using previously organized datasets. Fifth, the importance of each feature was determined using the model with the best classification performance. Finally, the score of each variable was calculated and compared to reveal the differences between ideation and suicide attempts. The detailed steps are shown in Figure 1.

Data Source
In this study, the open-source KNHANES dataset released by the Korea Disease Control and Prevention Agency (KDCA) [33] was utilized to compare the risk factors of suicidal ideation and attempts. KNHANES is a longitudinal survey that investigates the health status, health-related awareness and behavior, and nutritional status of people living in Korea. This survey is conducted annually by the KDCA. The first survey was conducted in 1998 and was repeated at three-year intervals until 2005. Since 2007, surveys have been conducted annually. A total of 216,815 people participated in the survey from 1998 to 2019. The original datasets collected from the surveys are available to the public. The KNHANES dataset was constructed using nine categories of survey variables. A detailed list of the categories is presented in Table 1.   Food safety investigation Categorical 7 Food intake frequency survey Categorical 8 Food intake survey Continuous 9 Dietary life evaluation index Continuous The survey results were stored in two datafiles. In the first file, survey variables for the health behavior, blood test, blood pressure test, and hand grip test results were included. The other test results (dietary life, food safety, food intake, and dietary life evaluation) were stored in a second datafile. All variables in the data files can be merged based on the participant ID.
Additionally, the public dataset on suicide rate released by the Korean Statistical Information Service (KOSIS) was utilized to investigate the effects of suicide rate on the associated risk factors. In this study on the KNHANES datasets from 1998 to 2019, twentyone suicide rate values were used for the analysis.

Collection and Selection of Associated Variables from Datasets
To reflect the various characteristics of the participants, all available datasets in KN-HANES (i.e., from 1998 to 2019) were utilized. In addition, the associated variables, including socioeconomic and psychiatric variables, were selected from the datasets. Among the nine categories of variables, the variables for demographics, household, subjective health status, activity restriction and quality of life, education and economic activity, obesity and weight control, drinking, and mental health were extracted to identify major suicidal risk factors. The dimensions of the original dataset were (215,522, 851). After the extraction of relevant variables, the dimensions of the remaining dataset were (78,796, 58). The baseline characteristics of the datasets are listed in Table 2. The distributions of the variables were analyzed to remove missing or extreme values from the data. In the KNHANES dataset, the missing values were coded as 99 or 9999. To reflect the exact response to each variable, the distribution of each variable was examined using histograms. After removing the variables in more than half of the responses, 48 variables remained. The distribution of the variables used in our study is shown in Figure 2.
The distributions of the variables were analyzed to remove missing or extreme values from the data. In the KNHANES dataset, the missing values were coded as 99 or 9999. To reflect the exact response to each variable, the distribution of each variable was examined using histograms. After removing the variables in more than half of the responses, 48 variables remained. The distribution of the variables used in our study is shown in Figure 2.  Table 3.  Table 3.

Generation of the Final Datasets for the Evaluation of Machine Learning Classifiers
To verify the difference between suicidal ideation and attempt, we set "BP6_10" (suicidal ideation within the last year) and "BP6_31" (suicide attempts within the last year) as the dependent variables. Other variables were applied to the machine learning classifiers as independent variables. The dataset was divided by year to compare the analysis results individually, and the integrated dataset was further analyzed, regardless of the year, to investigate differences in the risk factors. The suicide rates per year in the dataset were used to determine the effect of suicide rate on major risk factors for suicide.
Based on the aforementioned conditions used for the comparison in our study, the comparison of the experimental results was performed under a total of 28 conditions (two dependent conditions in 14 datasets from 2007 to 2019). The datasets for the 28 conditions were divided into training and test datasets with an 8:2 ratio.

Training and Evaluation of Machine Learning Classification Algorithm
As described in the previous section, the machine learning classifiers were applied to the 28 conditions in the datasets for a comparison. In this study, the XGBoost classifier, support vector classifier, and logistic regression were used. According to the binary characteristics of the dependent variables ("BP6_10" and "BP6_31"), all algorithms performed binary classification tasks under all experimental conditions. The BP6_10 and BP6_31 variables were collected from different survey questions. The BP6_10 values consisted of binary answers to "Have you thought about suicide within the last year?" The BP6_31 values consisted of binary answers to "Have you attempted suicide within the last year?" Participants answered both questions in binary format (i.e., yes = 1; no = 0). The importance of the features was recorded for the test dataset to identify important features among the independent variables using the trained algorithms.
A random search was conducted to determine the optimal hyperparameters of the three ML classifiers, as listed in Table 4. In addition, to prevent overfitting of the classification algorithms, 10-fold cross-validations were applied when training the algorithms.

Calculation of the Risk-Factor Score from Feature Importance Results
From the evaluation of the trained algorithms on the test datasets, the importance of each feature was determined for the best performing classifier. The importance scores (e.g., F-score for XGBoost classifier, coefficients for SVC, and logistic regression) and rank for each variable were confirmed through the analysis of feature importance. To simultaneously consider both results (importance score and rank of the variable), a new quantified score was devised for the risk of each variable by integrating the importance and rank. The proposed score for each variable was calculated using the following formula: where α denotes a normalized rank between 0 and 1, and β represents the normalized importance score. In our study, a 10-fold cross-validation was used for training and evaluating the algorithms. As a result, 10 evaluation sets were used for each experiment, and the same results for the evaluation of the variables were observed. Finally, to create a single score, the values for multiple variables were averaged.
2.6. Machine Learning (ML) Classification Algorithms 2.6.1. XGBoost Classifier The XGBoost classifier is a supervised learning algorithm based on gradient boosting methods [34]. This classification model was constructed using classification and regression tree (CART) methods. In addition, it is an ensemble model composed of several decisiontree models. The objective function of this algorithm with ensemble learning is as follows: where In the above function, the first formula indicates the objective function of the XGBoost classifier. Function L is the loss function for the algorithm evaluation. In addition, the function Ω denotes the regularization term and mean complexity of the models. y i values are calculated using the kth decision tree, represented by f k . In this study, y i set class labels can denote suicidal ideation or attempts (e.g., coded with 0 or 1 for suicidal ideation or attempt, respectively, within the past year).

Support Vector Classifier
The second classification algorithm used in this study is the support vector classifier (SVC) with radial basis function (RBF) kernels [35]. This classification algorithm divides the feature space into hyperplanes that are separated using class labels. A radial basis function kernel with non-linear characteristics was applied to the classifiers. In previous studies, linear kernels were used for binary classification tasks [36]. Here, it was experimentally verified that the SVC model with the RBF kernel resulted in better model performance compared with the model with a linear kernel. In addition, completely participant-separated datasets were used to avoid overfitting the classifiers.

Logistic Regression
The last classification algorithm used in our study was the logistic regression model. This classification model calculates the log-odds value of each variable and applies a sigmoid function to each result [37]. The probability that the data belong to the corresponding class was calculated using this model. A probability higher than 0.5 indicates that the variable was classified as being a risk factor for having had suicidal ideation or attempt in the past year. The basic form of this model including the variables and classes for suicidal ideation or attempts is as follows: where

Evaluation Metrics
The classification performance of the machine learning classifiers was compared using five evaluation metrics. To evaluate the experimental results obtained from the model, the true positive (TP), true negative (TN), false negative (FN), and false positive (FP) values were calculated using the confusion matrix. The ratio of correctly classified samples was calculated based on TP and TN values. In addition, incorrectly classified samples were indicated by FN and FP. Finally, we obtained four indicators: precision, recall, f1-score, and accuracy. Furthermore, the true positive rate (TPR) and false positive rate (FPR) were determined to establish the receiver operating characteristic (ROC) curve. The area under the curve (AUC) was calculated from the ROC curve.

Tools
All codes for ML classifiers and data preprocessing were written using Python (version 3.7.1; scikit-learn, version 2.4.1) and R (version 4.0.3) programming languages.

Classification Performance Results from ML Classifiers
To identify risk factors using various variables, three machine-learning classifiers were applied to the preprocessed datasets. The classification performance results were compared using five evaluation metrics (precision, recall, f1-score, accuracy, and AUC). Among the three classifiers (XGBoost, SVC, and LR) used in this study, the XGBoost classifier demonstrated the best classification performance. In the experimental results, the highest evaluation metrics for both dependent variables were obtained from the integrated datasets. In addition, for the yearly datasets, the XGBoost classifier showed better performance than the other two algorithms. Details of the classification performance results are listed in Tables 5-7.

Feature Importance for the Identification of Risk Factors
Based on the classification performance results, important features from the trained XGBoost classifier were determined. The models with the best classification performances were selected to determine which features of the trained algorithms were important and reflected the characteristics of the variables. Common and unique factors associated with suicidal ideation and suicide attempts were identified and compared.
For the case with the integrated dataset, important features and their risk-factor scores are listed in Table 8. There are seven common variables with a high rank (ranked 1 to 7). Socioeconomic variables (e.g., average monthly income, age, drinking age, and education level) and nonmental health-related variables were identified as common variables. Next, the differences in ranks between the variables were examined for middle and low ranks (from rank 8 to 20). Unlike suicidal ideation-related risk factors, mental health-related variables (e.g., prevalence of depression, anxiety/depression in quality of life, and depression for more than 2 weeks) were ranked higher as being a risk factor for suicidal attempt.
In the results on the yearly dataset conditions, trends similar to those of the integrated dataset conditions were identified. The detailed results of the yearly dataset conditions are listed in Tables 9-11 and Supplementary Tables S1-S5. In addition, the suicide rate was analyzed by year, together with the experimental results. First, socioeconomic variables, including average monthly income, were found among the high ranking common variables of the two dependent variables. Second, similar to the analysis results obtained from the previously integrated dataset, mental health-related variables were confirmed to be ranked relatively high for risk of suicidal attempt. Third, considering the suicide rate, it was verified that the aforementioned characteristics of mental health variables were more prominent. For example, in 2009, 2010, and 2011, when the suicide rate was relatively high (31.0% in 2009, 31.2% in 2010, and 31.7% in 2011), the prevalence of depression, subjective health status, and depression for more than 2 weeks were ranked higher than other results for the same socioeconomic variables in 13 years.

Discussion
In this study, the risk factors associated with suicidal ideation and attempts were compared using machine learning classifiers. To determine the important factors, machinelearning algorithms were utilized for dataset analysis. After confirming the related factors using the classification algorithm, a novel risk-factor score based on the rank and importance scores was calculated from the algorithm to evaluate the importance of variables. To investigate the differences in the importance of factors, based on suicide risk level, suicide ideation was set as having a low suicide risk and suicide attempt was set as having a high suicide risk, prior to analyzing the results. In the experimental results, we found that the associations of socioeconomic and sociodemographic variables were high for both suicide ideation and attempt. In addition, the risk-factor scores of mental health variables were higher than those for other variables for the high suicide risk condition (i.e., suicide attempt).
Reasonable evidence was gathered from previous studies on the research topics (i.e., suicidal risk factors, suicidal ideation, suicidal attempts, and machine learning algorithms) before conducting our study. First, with regard to suicidal ideation and risk factors, Hintikka et al. [38] analyzed a 3-year prospective follow-up dataset collected from people living in Finland (n = 1339) to identify factors associated with suicidal ideation. From a longitudinal follow-up dataset, the authors focused on the risk factors of suicidal ideation. The impacts of sociodemographic and socioeconomic factors, including lifestyle, were identified for suicidal ideation. Weber et al. [39] examined the relationship between suicidal ideation and the diverse variables in a population sampled from college students. Among the variables used in this study, depression-and hopelessness-related variables showed a strong association with the main dependent variable (suicidal ideation). Kleiman et al. [40] attempted to identify related risk factors and their degree of variation in suicidal ideation within a short period of time. Well-known risk factors for suicidal ideation, such as hopelessness, burdensomeness, and loneliness, were varied and correlated with suicidal ideation. Second, many previous studies have associated various risk factors with suicide attempts. For example, Beautrais et al. [41] applied case-control designed datasets collected from 129 young people who had made serious suicide attempts. Among the various factors, the contributions of the risk factors of childhood adversity, social disadvantage, and psychiatric morbidity were found to be significant in the analysis results. Teti et al. [42] systematically reviewed several published studies to find similar risk factors for suicide among Latin American and Caribbean people. Major depressive disorder, family dysfunction, and prior suicide attempts were confirmed as the main risk factors. Parra-Uribe et al. [43] focused on the risk factors of suicide re-attempts and completed suicides after previous attempts. The authors identified the influences of alcohol use, personality disorders, and younger age on suicide re-attempts.
Finally, regarding the investigation of risk factors using machine learning algorithms, there were several previous studies on the detection of associated factors and suicide. Taneja et al. [44] applied a random forest classifier to predict the risk of sepsis from clinical variables in electronic medical record (EMR) datasets. The prediction model proposed that the "PCT" and "IL-6" variables are important in predicting risk of sepsis. Walsh et al. [45] applied machine learning algorithms to predict the risk of suicide attempts. A dataset comprising 5167 people was analyzed using the random forest algorithm. Among the predictors used in random forest models, non-fatal prior experience of suicide attempts, hospital utilization history, and visit tallies are the most important predictors. Colin et al. [46] used a machine-learning model to predict suicide attempts in adolescents. Longitudinal datasets collected from 974 adolescents over 17 years were used to investigate the effects of associated factors. Random forest algorithms were used to analyze the datasets. Among the feature importance results of predictors, the top 20 predictors were compared to evaluate their importance. Body mass index (BMI), age, anilide medications, propionic acid derivative medication, and selective serotonin reuptake inhibitors were identified as the top five important factors for suicide attempt prediction. A history of episodic mood disorder and other medication-related variables ranked relatively low in the experimental results.
Based on previous studies, we found it reasonable to conduct research on the selected topics of our study (i.e., identification of major risk factors for suicidal ideation and attempts with machine learning models). The experimental design of previous studies was adopted in our study. Unlike previous researchers who used datasets on specific patient groups, De la Garza et al. [31] used longitudinal datasets from a National US survey to investigate the characteristics of the general population. They divided the datasets based on intervals (e.g., from 2001 to 2002 was wave 1 and from 2004 to 2006 was wave 2) to compare the effects of the factors between periods. From the collected datasets, risk factors for suicide attempts were detected using a machine learning algorithm (random forest model). In wave 1, various variables, such as disorder-related variables (e.g., alcohol use, drug use, and nicotine dependence) and mood disorders (panic and bipolar disorder), were collected through interviews with participants. Three years after wave 1, in wave 2, the main dependent variables (i.e., non-fatal suicide attempts) were collected. A balanced random forest classifier was used to identify factors associated with suicide attempts through processed features in wave 1 and binary suicide attempts in wave 2. To quantify the importance of each variable, Youden J statistics were calculated to set the cut-off points for the evaluation metric values. Six evaluation indices (sensitivity, specificity, positive predictive value, negative predictive value, alarms per 100 evaluations, and number needed for evaluation) were used to examine the performance of the classification models. Among the 2985 available input features, the top 20 most important variables were compared. The authors found that individuals who "felt that they wanted to die" and "thought about committing suicide" showed the highest importance scores in the experimental results. In addition, the effects of socioeconomic disadvantages were observed in the analysis results.
Su et al. [47] used structured electronic health records (EHRs) from the Connecticut Children's Medical Center (CCMC) for the 2011-2016 duration to predict suicidal risk in children and adolescents. From the CCMC EHR database, approximately 641,708 visits for 129,485 patients were extracted. To compare the model classification performances, several datasets with different conditions for the period were analyzed. The main dependent variables (i.e., suicide attempts) were identified using the International Classification of Diseases, Ninth Revision (ICD-9) code. In addition, demographics, prescribed medications, and clinical variables were included in the extracted datasets. The prediction model was evaluated using four evaluation indices (receiver operator characteristic curve, sensitivity, specificity, and positive predictive value). The variables were grouped and the performances of the prediction models were evaluated to confirm the usefulness of the predictive models proposed. The frequency of specific predictors was measured to confirm the influence of the variables on prediction. Symptoms and signs involving emotional state, depressive episode, and gender showed the highest rank among the input variables. In addition, antidepressant medications, including sertraline and escitalopram, and urine culture test variables were found to be highly important variables.
Design processes from previous studies were incorporated in our study, as described above. First, partial datasets with related variables were extracted from the KNHANES dataset from 1998 to 2019. For the common variables, only 13 datasets collected from 2007 to 2019 were selected. Second, the distributions of variables in the extracted datasets were examined to remove extreme or missing data. After removing more than half of the variables with missing data, 48 variables remained, including demographics, socioeconomic status, and mental health categories. Third, to select an optimized machine learning algorithm, the performances of the three classification models in our study were compared. Finally, the feature importance of the classifiers in identifying risk factors was determined for the best performance.
To compare the risk factors between suicide ideation and attempts, the importance of each variable in the datasets was evaluated with respect to suicide ideation and attempts. In addition, the conditions of the integrated dataset and year dataset were compared. The experimental results were analyzed, with suicidal ideation being a relatively low-risk group and suicidal attempt being a high-risk group. First, the results were compared from the perspective of the integrated dataset, regardless of the time-series, such as the year datasets. Socioeconomic and sociodemographic variables (e.g., average monthly income, age, education status, and drinking age) were confirmed to have a high rank and were common factors in both suicide ideation and attempt. In Ferretti and Coluccia [48], the trends in socioeconomic factors associated with determinants of suicide in the general population of Europe were similar, unlike that in the group of patients with mental illnesses. Mortensen et al. [49] suggested that a high risk of suicide was associated with unemployment and other socioeconomic factors in the Danish population. Among relatively low ranking factors, the rank of mental health-related variables (e.g., prevalence of depression, anxiety, depression related quality of life, and depression for more than 2 weeks) in high suicidal risk conditions (i.e., suicidal attempt) was higher than that in low-risk conditions (i.e., suicidal ideation). Based on these results, in the integrated dataset conditions, sociodemographic and socioeconomic factors are important for both suicidal ideation and attempts. In addition, mental health variables including depression or anxiety are risk factors for suicide attempts.
Second, the risk factors were investigated using yearly datasets with suicidal rate data to compare the factors for high and low suicide rates. From 2007 to 2019, the suicide rates in 2009, 2010, and 2011 were higher than in other years. Similar to the integrated dataset conditions, sociodemographic and socioeconomic factors were found to rank high for both ideation and attempt over 13 years. In datasets with a relatively high suicide rate, trends of depression prevalence and depression for more than two weeks ranking high were clearly identified. From these trends in the results, we confirmed that the yearly dataset with a high suicide rate showed similar trends to the results of the integrated datasets.
In conclusion, in terms of major risk factors for suicide, the analysis results based on a longitudinal dataset collected from the general population and analyzed using a machine learning classification model indicated that socioeconomic and sociodemographic factors were associated with both suicidal ideation and suicide attempts. Similar trends were validated on yearly datasets with yearly suicide rates, resulting in a high rank for social variables and a relatively higher rank for mental health variables associated with high suicide ranks.

Conclusions
In social and public health, an investigation of suicide risk factors is critical to solving or decreasing the impact of suicide on the public. In this study, we applied machine learning algorithms to identify risk factors for suicidal ideation and attempts using longitudinal datasets collected from the general population living in Korea. To compare the differences between suicidal ideation and suicidal attempt factors, the KNHANES dataset was preprocessed for two dependent variables ('BP6_10 : suicide ideation and 'BP6_31 : suicide attempt). In addition, datasets collected over 13 years (from 2007 to 2019) were analyzed to determine the associated risk factors for both the integrated dataset and datasets by year in terms of the importance of factors under different dataset conditions. Furthermore, to confirm the optimized machine learning algorithms for our research topic, we compared the performances of three machine learning classifiers (XGBoost classifier, support vector classifier, and logistic regression). Among the three classifiers, XGBoost showed the best performance on five evaluation metric values. Based on these results, we evaluated the feature importance of XGBoost in identifying important risk factors of suicidal ideation and attempt. As a common factor for ideation and attempts, sociodemographic and socioeconomic factors ranked high for various variables. In addition, we found that mental health variables showed a relatively high rank in the suicidal attempt condition, which was considered a high risk for suicide. From these experimental results, it was concluded that sociodemographic and socioeconomic factors are critical for suicide in the general population. In the high-risk group with suicidal attempts, mental health factors could also influence their suicide risk.
The first strength of this study was the application of machine learning algorithms to investigate the associated risk factors for suicidal ideation and suicide attempts. Second, the risk of each factor was determined using a new quantitative score calculated from the rank and importance scores calculated using machine learning algorithms. Third, large-scale real-world datasets collected from people living in Korea were utilized to reflect practical tendencies. Finally, the influence of risk factors on suicide ideation and attempts were evaluated and compared through conditions for risk of suicide. Our study has some limitations. First, we extracted and utilized several variables without considering a wide range of factors. However, we tried to select variables that were associated with suicide in previous studies. Second, various methodologies, including deep learning algorithms for the analysis of factors, can be applied to address our research questions. In our study, ML algorithms were used to facilitate the confirmation of feature importance. Third, additional comparisons and external validation will be required in future studies required on datasets collected from other countries to generalize the results.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/ijerph182312772/s1, Table S1: Important features and risk-factor score in yearly dataset condition (2007 and 2008), Table S2: Important features and risk-factor score in yearly dataset condition (2012 and 2013), Table S3: Important features and risk-factor score in yearly dataset condition (2014 and 2015), Table S4: Important features and risk-factor score in yearly dataset condition (2016 and 2017), Table S5: Important features and risk-factor scores in yearly dataset conditions (2018 and 2019).