Machine Learning-Based Three-Month Outcome Prediction in Acute Ischemic Stroke: A Single Cerebrovascular-Specialty Hospital Study in South Korea

Background: Functional outcomes after acute ischemic stroke are of great concern to patients and their families, as well as physicians and surgeons who make the clinical decisions. We developed machine learning (ML)-based functional outcome prediction models in acute ischemic stroke. Methods: This retrospective study used a prospective cohort database. A total of 1066 patients with acute ischemic stroke between January 2019 and March 2021 were included. Variables such as demographic factors, stroke-related factors, laboratory findings, and comorbidities were utilized at the time of admission. Five ML algorithms were applied to predict a favorable functional outcome (modified Rankin Scale 0 or 1) at 3 months after stroke onset. Results: Regularized logistic regression showed the best performance with an area under the receiver operating characteristic curve (AUC) of 0.86. Support vector machines represented the second-highest AUC of 0.85 with the highest F1-score of 0.86, and finally, all ML models applied achieved an AUC > 0.8. The National Institute of Health Stroke Scale at admission and age were consistently the top two important variables for generalized logistic regression, random forest, and extreme gradient boosting models. Conclusions: ML-based functional outcome prediction models for acute ischemic stroke were validated and proven to be readily applicable and useful.


Introduction
Stroke is a representative disease with high mortality and morbidity [1], with 30-70% of stroke survivors reportedly remaining disabled [2]. Functional disability, directly and indirectly, affects the rest of the patient's life. It not only adversely affects the patient's return to society and work but also places a burden on family members [3]. Moreover, disability is an important cause of psychiatric complications, such as depression, which significantly reduces the long-term quality of life of patients [4]; therefore, the prognosis related to functional ability after a stroke is understandably one of the most important concerns for patients and their families. In addition, the need for predicting functional recovery is required by physicians and surgeons who need to establish long-term treatment plans [5].
From these requirements, several prediction tools using the risk scoring method have been proposed for acute ischemic stroke, which account for the majority of stroke [6]. The representative tools are the Acute Stroke Registry and Analysis of Lausanne (ASTRAL)

Variables
Variables that were initially available on admission in patients with acute ischemic stroke were used. The detailed definitions of all the contributing variables are presented in the online Supplementary Materials (Document S1).
Personal factors, such as age, sex, body mass index, and abdominal circumference, were checked. We evaluated the National Institute of Health Stroke Scale (NIHSS) at admission, stroke subtype, onset type, onset (or LNT) to arrival time, circulatory territory, involved side, and the type of acute intravenous/intraarterial treatments as stroke-related factors. We also investigated the initial laboratory findings and blood pressure at the time of admission. Finally, comorbidities including previous stroke or transient ischemic attack, coronary artery disease, peripheral artery diseases, hypertension, diabetes, dyslipidemia, atrial fibrillation, smoking habit, previous administration of antiplatelet/anticoagulant, and potential sources of cardiogenic embolism were checked and confirmed for the patients as well.
The dependent variable was defined as mRS values at 3 months after the onset of the stroke and dichotomized for prediction. Favorable outcomes were defined as mRS scores of 0 and 1, and this group was designated as the target class, with follow-up mRS values measured at an outpatient clinic by a neurologist or neurosurgeon. In the case of not being able to visit the hospital, the mRS measurement was conducted in the form of a telephone interview with patients or their families through a structured questionnaire according to the Korean Stroke Registry guideline [19].

Data Analysis and Machine Learning Processes
All statistical analyses and machine learning processes were performed using R software 4.1.0 (R Core Team, R Foundation for Statistical Computing, Vienna, Austria). The entire code used for data analyses and the ML processes in this study, as well as the corresponding results, are available in the online Supplementary Materials (Document S2). The entire ML modeling process is summarized in Figure 2. The dependent variable was defined as mRS values at 3 months after the onset of the stroke and dichotomized for prediction. Favorable outcomes were defined as mRS scores of 0 and 1, and this group was designated as the target class, with follow-up mRS values measured at an outpatient clinic by a neurologist or neurosurgeon. In the case of not being able to visit the hospital, the mRS measurement was conducted in the form of a telephone interview with patients or their families through a structured questionnaire according to the Korean Stroke Registry guideline [19].

Data Analysis and Machine Learning Processes
All statistical analyses and machine learning processes were performed using R software 4.1.0 (R Core Team, R Foundation for Statistical Computing, Vienna, Austria). The entire code used for data analyses and the ML processes in this study, as well as the corresponding results, are available in the online Supplementary Materials (Document S2). The entire ML modeling process is summarized in Figure 2.
Comparative analyses were conducted between outcome groups with mRS < 2 or mRS ≥ 2 at 3 months after stroke onset. Continuous variables were expressed as the mean ± standard deviation, and an independent t-test was used to compare the two groups. Categorical variables were presented as frequencies and proportions, and the chi-squared (trend) test was performed for comparative analysis, with the statistical significance defined as p < 0.05.
Serial data pre-processing was performed for ML algorithm training and validation. Variables with near-zero variances were identified and removed, with correlation coefficients evaluated to confirm the collinearity between continuous variables. After the variable selection was completed, centering and scaling for continuous variables and onehot encoding for categorical variables were conducted. The dataset was then randomly split into training and test sets at a ratio of 7:3, and the synthetic minority oversampling technique (SMOTE) and adaptive synthetic (ADASYN) sampling were then applied to the training set to deal with the imbalance in the target class. Comparative analyses were conducted between outcome groups with mRS < 2 or mRS ≥ 2 at 3 months after stroke onset. Continuous variables were expressed as the mean ± standard deviation, and an independent t-test was used to compare the two groups. Categorical variables were presented as frequencies and proportions, and the chi-squared (trend) test was performed for comparative analysis, with the statistical significance defined as p < 0.05.
Serial data pre-processing was performed for ML algorithm training and validation. Variables with near-zero variances were identified and removed, with correlation coefficients evaluated to confirm the collinearity between continuous variables. After the variable selection was completed, centering and scaling for continuous variables and one-hot encoding for categorical variables were conducted. The dataset was then randomly split into training and test sets at a ratio of 7:3, and the synthetic minority oversampling technique (SMOTE) and adaptive synthetic (ADASYN) sampling were then applied to the training set to deal with the imbalance in the target class.
Five ML algorithms were evaluated in this study, namely, regularized logistic regression (RLR), support vector machines (SVM), random forest (RF), k-nearest neighbors (KNN), and extreme gradient boosting (XGB). Internal validation was performed by applying these ML algorithms to the training dataset, where pre-processing was completed. To obtain an optimal training model, 10-fold cross-validation with 10 repetitions was performed, with both random and grid search methods applied for hyperparameter tuning. An external validation was then performed on the test dataset using the trained models. This study used AUC as the main metric and investigated the F1-score and overall accuracy to evaluate model performance. The best-performing model was then selected for each ML algorithm based on external validation results. The variable importance from ensemble algorithms (RF and XGB) and a linear model (RLR) was also determined, with the top 10 important variables from each algorithm identified. Five ML algorithms were evaluated in this study, namely, regularized logistic regression (RLR), support vector machines (SVM), random forest (RF), k-nearest neighbors (KNN), and extreme gradient boosting (XGB). Internal validation was performed by applying these ML algorithms to the training dataset, where pre-processing was completed. To obtain an optimal training model, 10-fold cross-validation with 10 repetitions was performed, with both random and grid search methods applied for hyperparameter tuning. An external validation was then performed on the test dataset using the trained models. This study used AUC as the main metric and investigated the F1-score and overall accuracy to evaluate model performance. The best-performing model was then selected for each ML algorithm based on external validation results. The variable importance from ensemble algorithms (RF and XGB) and a linear model (RLR) was also determined, with the top 10 important variables from each algorithm identified.

Baseline Characteristics
Amongst the 1066 patients analyzed, 745 (69.9%) and 321 (30.1%) patients had favorable and unfavorable outcomes at the 3-month follow-up, respectively. Table 1 describes personal and stroke-related features in both outcome groups. The favorable outcome group was 65.8 ± 11.3 years old, which was significantly younger than that of the unfavorable outcome group, 74.4 ± 11.4 years old (p < 0.001), and the proportion of men was significantly lower (33.7% vs. 48.6%; p < 0.001). In the favorable outcome group, NIHSS at admission was significantly lower (2.3 ± 3.2 vs. 6.3 ± 5.9; p < 0.001), and the proportion without acute intravenous/intraarterial treatments was significantly higher (90.3% vs. 79.1%; p < 0.001); however, in the favorable outcome group, significantly higher hemoglobin, and triglyceride levels (p < 0.001 and p = 0.004, respectively), and significantly Diagnostics 2021, 11,1909 5 of 12 lower random glucose and blood urea nitrogen levels (p = 0.028 and p = 0.007, respectively) were found.  Table 2 presents the underlying risk factors for both outcome groups. It was found that the ratio of current smokers was significantly higher in the favorable outcome group (p = 0.002), whilst atrial fibrillation was significantly higher in the unfavorable outcome group (p = 0.009). There were no significant differences in other risk factors found between the two groups.

Data Pre-Processing
The following variables were removed that showed near-zero variance: previous transient ischemic attack, previous peripheral artery diseases, previous cancer, previous administration of anticoagulant, and high and medium risks of potential sources of cardiogenic embolism. It should also be noted that none of the continuous variables showed any collinearity. After random splitting, the training and test datasets were divided into 769 and 297 samples, respectively, with the proportions of favorable and unfavorable outcome groups in the training dataset being 537 and 232, respectively. After applying SMOTE, the ratios were 537 and 464, respectively, and after applying ADASYN, the ratios were 537 and 555, respectively. In the test dataset, the two outcome groups had ratios of 208 and 89.
Among the optimal training models of the best test prediction result for each ML algorithm, SVM, XGB, and KNN showed the best performance when target class balancing was not performed. Contrarily, the best-performed model was generated in RLR and RF when SMOTE and ADASYN were applied, respectively. Supplementary Table S2 shows the balancing method and hyperparameter tuning results for the best model of each ML algorithm.  Among the optimal training models of the best test prediction result for each ML algorithm, SVM, XGB, and KNN showed the best performance when target class balancing was not performed. Contrarily, the best-performed model was generated in RLR and RF when SMOTE and ADASYN were applied, respectively. Supplementary Table S2 shows the balancing method and hyperparameter tuning results for the best model of each ML algorithm.

Variable Importance
For RLR, NIHSS was the most important variable for model performance, followed by age and hemoglobin. Both RF and XGB represented the same top three results of variable importance; the most important variable for the prediction performance was the NIHSS score at admission, followed by age and time to arrival. NHISS at admission and age were also the two most important variables in the three ML models. Finally, random glucose, hemoglobin, and triglyceride levels were identified as the top ten important variables in all three models (Figure 4).

Variable Importance
For RLR, NIHSS was the most important variable for model performance, followed by age and hemoglobin. Both RF and XGB represented the same top three results of variable importance; the most important variable for the prediction performance was the NIHSS score at admission, followed by age and time to arrival. NHISS at admission and age were also the two most important variables in the three ML models. Finally, random glucose, hemoglobin, and triglyceride levels were identified as the top ten important variables in all three models (Figure 4). by age and hemoglobin. Both RF and XGB represented the same top three results of variable importance; the most important variable for the prediction performance was the NIHSS score at admission, followed by age and time to arrival. NHISS at admission and age were also the two most important variables in the three ML models. Finally, random glucose, hemoglobin, and triglyceride levels were identified as the top ten important variables in all three models (Figure 4).

Discussion
This study demonstrated models that predict the short-term functional prognosis of patients with acute ischemic stroke using ML algorithms. All ML algorithms utilized showed validated results with AUC > 0.8. In particular, the proposed models were established based on initial evaluation and examination findings at the time of admission, which has the advantage of being able to predict a favorable outcome within a short time after hospitalization. Moreover, this feature can be useful to physicians and surgeons in making clinical decisions or informing patients and their families.
The proposed ML models' performance was either similar to or slightly better than existing risk scoring tools such as ASTRAL and ISCORE; nevertheless, it is difficult to conclude that our ML models using more variables are a much-improved prediction tool. The reason for this finding can be attributed to the following. The first is the difference in outcome definition. Unlike both ASTRAL and ISCORE, where mRS 0-2 was defined as a good outcome, only mRS 0 and 1 were defined as favorable outcomes in this study. According to mRS [20], if based on dependency, grade 2 or less can be viewed as a favorable outcome; however, when a narrow favorable outcome criterion was applied, we considered the absence of disability and maintenance of usual daily activities; therefore, it is difficult to compare these two risk scoring tools directly with the proposed ML models because of the different target outcome definitions. The second reason can be inferred from the results of the variable importance. The ASTRAL score is based on six contributing factors: age, NIHSS score, time delay, visual field defect, glucose level, and level of consciousness [7]. ISCORE is calculated based on age, sex, preadmission functionality, cancer, atrial fibrillation, congestive heart failure, renal function, stroke subtype, glucose level, and stroke severity [8]. There is a significant overlap between the contributing factors of these two tools and the variables with high

Discussion
This study demonstrated models that predict the short-term functional prognosis of patients with acute ischemic stroke using ML algorithms. All ML algorithms utilized showed validated results with AUC > 0.8. In particular, the proposed models were established based on initial evaluation and examination findings at the time of admission, which has the advantage of being able to predict a favorable outcome within a short time after hospitalization. Moreover, this feature can be useful to physicians and surgeons in making clinical decisions or informing patients and their families.
The proposed ML models' performance was either similar to or slightly better than existing risk scoring tools such as ASTRAL and ISCORE; nevertheless, it is difficult to conclude that our ML models using more variables are a much-improved prediction tool. The reason for this finding can be attributed to the following. The first is the difference in outcome definition. Unlike both ASTRAL and ISCORE, where mRS 0-2 was defined as a good outcome, only mRS 0 and 1 were defined as favorable outcomes in this study. According to mRS [20], if based on dependency, grade 2 or less can be viewed as a favorable outcome; however, when a narrow favorable outcome criterion was applied, we considered the absence of disability and maintenance of usual daily activities; therefore, it is difficult to compare these two risk scoring tools directly with the proposed ML models because of the different target outcome definitions. The second reason can be inferred from the results of the variable importance. The ASTRAL score is based on six contributing factors: age, NIHSS score, time delay, visual field defect, glucose level, and level of consciousness [7]. ISCORE is calculated based on age, sex, preadmission functionality, cancer, atrial fibrillation, congestive heart failure, renal function, stroke subtype, glucose level, and stroke severity [8]. There is a significant overlap between the contributing factors of these two tools and the variables with high importance in our ML models. Several more variables were utilized compared to the existing risk scoring tools by taking advantage of ML-based data processing; however, it was found that they shared a similar critical variables list.
There have been previous studies on the ML-based predictions of functional outcomes in acute ischemic stroke. Heo et al. [11] predicted a favorable functional outcome at the 3-month follow-up. They designated mRS 0-2 as the target class and directly compared the predictive power of their ML models to the ASTRAL score. Hence, the deep neural network recorded a significantly higher AUC than the ASTRAL score; however, there was no significant difference in the AUC between the ASTRAL score and RF and logistic regression. Alaka et al. [21] also presented ML-based models for predicting the 3-month functional outcome by targeting mRS > 2 in ischemic stroke. The external validation results in their study were inferior to our study, with an AUC range of 0.66 to 0.71. This difference is thought to be due to our model using more sample sizes and variables than theirs. Jang et al. [22] defined mRS > 1 at 3 months in acute ischemic stroke as a bad outcome and performed an analysis based on the same outcome class classification as ours. Among the ML models presented by them, XGB, RF, and SVM showed the highest AUC value of 0.84; our RLR and SVM models slightly outperformed the other models with higher AUC values.
Among the ML algorithms we applied, RLR showed the best performance. Regularization lowers the weight of the parameter to reduce the complexity of the dataset and prevent overfitting [23,24]. Thus, a linear model sometimes outperforms ensemble algorithms, such as RF and XGB [25,26]. We were able to optimally train our best model by applying L1 regularization, which avoids overfitting by increasing the sparsity [27]. Based on this, it is inferred that our prediction model performed better on the dataset with lower complexity in a relatively limited manner. Consequently, our linear model showed better prediction performance than the other ML algorithms in this study.
The target class in the dataset is slightly imbalanced. We applied the SMOTE and ADASYN methods to address this problem. Both SMOTE and ADASYN are KNN-based up-sampling methods [28,29]. We obtained the best results by applying SMOTE to our best model, RLR. RF was able to obtain optimal results by applying ADASYN; however, in SVM and KNN, the training model without balancing showed better results, which is thought to be because the balancing of the training set caused overfitting. XGB can avoid overfitting even with balancing methods; however, the AUC was slightly higher when we did not implement balancing methods. Consequently, to create an ideal ML-based prediction model, it is necessary to confirm the results derived from the original dataset as well as the results with target class balancing. Additionally, it is necessary to consider overfitting caused by up-sampling.
Our ML models showed relatively low specificity, which is thought to be because our target was the majority class. We chose AUC as the main metric because it is not affected by the majority or minority of the target class [30,31]. In contrast, we should interpret the overall accuracy with caution because majority class predictivity may be overestimated in imbalanced data [32]. In fact, we selected the RLR model as the best model because it showed the test prediction result with the highest AUC and the most balanced sensitivity and specificity values among the investigated models.
There were several limitations to this study. First, mRS was used as an indicator of functional outcome, which had the advantage of an intuitive understanding of the functional level; however, it can be challenging to obtain a detailed reflection of the various neurologic symptoms observed after ischemic stroke, such as dysarthria-clumsy hand, ataxic hemiparesis, and pure sensory stroke. In particular, there may be a discrepancy between the measured functional score and the discomfort of the symptoms felt by the patient in these subtypes [33]; therefore, it was believed that a better model could be presented if more specific clinical data were added. Second, it was a single-center study. Therefore, an integrated, multicenter study is required to generalize this study's result and may present better results. Finally, relatively broad inclusion criteria were also applied to develop a generally applicable model for acute ischemic stroke, which may have contributed to the bias.

Conclusions
This study demonstrated that ML-based models early and effectively predicted favorable functional outcomes at 3 months after acute ischemic stroke. All ML-based prediction models in this study showed validated results, with an AUC > 0.8. In particular, RLR showed the best performance, with SVM showing promising results as well. Both models exhibited similar or slightly better performance than existing risk scoring tools or previously proposed ML-based prediction models; moreover, they are useful because they are readily applicable, informative to patients and families, and support clinical decision-making.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/diagnostics11101909/s1, Table S1: Optimal training model of each machine learning algorithm, Table S2: Confusion matrix, Document S1: Variable definitions and measurements, Document S2: R Markdown. Data Availability Statement: Data are not publicly available because of privacy and ethical restrictions on the data sharing regulation of the Korean Stroke Registry. Only authorized researchers can access the dataset used in this study (www.stokedb.or.kr; accessed on 25 August 2021). The data analyses and machine learning results for this study are available in the online supplementary content.