Interpretable Machine Learning Model Predicting Early Neurological Deterioration in Ischemic Stroke Patients Treated with Mechanical Thrombectomy: A Retrospective Study

Early neurologic deterioration (END) is a common and feared complication for acute ischemic stroke (AIS) patients treated with mechanical thrombectomy (MT). This study aimed to develop an interpretable machine learning (ML) model for individualized prediction to predict END in AIS patients treated with MT. The retrospective cohort of AIS patients who underwent MT was from two hospitals. ML methods applied include logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGBoost). The area under the receiver operating characteristic curve (AUC) was the main evaluation metric used. We also used Shapley Additive Explanation (SHAP) and Local Interpretable Model-Agnostic Explanations (LIME) to interpret the result of the prediction model. A total of 985 patients were enrolled in this study, and the development of END was noted in 157 patients (15.9%). Among the used models, XGBoost had the highest prediction power (AUC = 0.826, 95% CI 0.781–0.871). The Delong test and calibration curve indicated that XGBoost significantly surpassed those of the other models in prediction. In addition, the AUC in the validating set was 0.846, which showed a good performance of the XGBoost. The SHAP method revealed that blood glucose was the most important predictor variable. The constructed interpretable ML model can be used to predict the risk probability of END after MT in AIS patients. It may help clinical decision making in the perioperative period of AIS patients treated with MT.


Introduction
Recent clinical trials have shown that mechanical thrombectomy (MT) became the first-line standard treatment for acute ischemic stroke (AIS) patients caused by large vessel occlusion (LVO) [1][2][3][4][5]. However, AIS patients are susceptible to a common and feared complication of early neurologic deterioration (END) after MT [6], which is consistently associated with three-month unfavorable outcomes [7][8][9]. Therefore, it is desirable to predict the risk probability of END in AIS patients after MT and assist doctors in making more accurate treatment decisions for AIS patients.
Over the years, there have been many studies concerning exploring the predictors of END after MT in AIS patients, and some predictors have been found indeed, such as the blood glucose [10,11] baseline stroke severity [12], AIS patients with atrial fibrillation [13], diabetes mellitus [14,15], and so on. There are also several studies on predictive models for END after MT in AIS patients [16][17][18]. To some extent, these prediction models can help clinicians make more accurate treatment decisions for patients. However, these prediction models are based on traditional regression algorithms, which have certain limitations in solving nonlinear problems between various prognostic factors with outcomes. In addition, the variable selection based on the traditional statistical technique might obtain false positive variables or exclude false negative variables, which may debase predictive power. For overcoming the above problem, an alternative and effective technique is necessary for constructing precise prediction models.
Machine learning (ML) has emerged as a promising predictive tool in medicine and has been applied in many medical fields, such as ischemic stroke outcome prediction [19][20][21][22], biomedical research [23] and rehabilitation for chronic stroke survivors [24], and so on. Because in the process of modeling, machine learning can fit the complex relationship in multi-dimensional data, extract subtle information, and automatically summarize and generalize to obtain new knowledge. Machine learning has been limited in the medical field due to its black-box nature [25,26]. However, the latest machine learning models have been made interpretable by Shapley Additive Explanations (SHAP), which prompts the application of ML to be further developed [19,20]. The SHAP is a novel, cutting-edge machine learning algorithm which can visualize the relationship between each feature and the related predictive ability and can more intuitively understand the importance of features and enhance clinical interpretability.
We aimed to develop an interpretable machine learning model using patient preoperative and intraoperative relevant variables and demographics. The model was used to predict the probability of postoperative neurological deterioration and assist doctors to make more accurate treatment decisions for patients.

Study Population
This retrospective study was based on the clinical data of the patients with LVO who were treated with MT at the National Advanced Stroke Center of Nanjing First Hospital (China), from July 2015 to December 2021. We also used clinical data of the patients between January 2019 and December 2021 from Nanjing Drum Tower Hospital (China) for external validation to optimize the model. Inclusion criteria for this study can be listed as follows: (I) patients aged older than 18 years; (II) treated with MT; (III) anterior or posterior circulation large vessel occlusion state was verified by digital subtracted angiography, magnetic resonance angiography, or computed tomographic angiography. We excluded patients with uncertain 24 h neurological deterioration and lack of NIHSS score on admission. We also excluded patients with lacking vital data (i.e., age, sex and vascular recanalization status) and existing extreme outliers.

Patient Variables and Data Definitions
Patient information that was available for analyses included demographic characteristics, previous anti-thrombosis treatment, risk factors, past ischemic events, course data of stroke in 24 h, evaluation of the situation of stroke on admission with National Institutes of Health Stroke Scale [NIHSS] score, premorbid modified Rankin Scale [mRS] score, and laboratory findings ( Table 1).
The symptomatic intracranial hemorrhage (sICH) happened within 24 h of admission, according to the Heidelberg Bleeding Classification [27]. Vascular recanalization was evaluated by modified Thrombolysis in Cerebral Infarction [mTICI] ≥ 2b [28]. The study was conducted in accordance with the Nanjing First Hospital, Nanjing Drum Tower Hospital, and approved by the document number of the ethics committee's approval: ChiCTR-OCH-14004382. The outcome was END. END was defined as the NIHSS score with an increase of at least 4 points from baseline to 24 h of the stroke event [29].

Statistical Analysis
Categorical variables were expressed as numbers (percentages) and continuous variables were expressed as medians (quartile). Comparisons of the baseline characteristics between END and without END used the Student t-test or Mann-Whitney U test for con-tinuous variables relying on the sample normality of distribution and Pearson's test or Fisher's exact test for categorical characteristics relying on the sample amount.

Data Processing and Feature Selection
Variables for which less than 20% of data were missing were included in the study, and missing variables were computed by using the median for continuous variables or the most common value for categorical variables, as the case may be. Before we started the modeling process, the dataset was split into training set (80%) and testing set (20%) with a stratified random sampling method. The data from Nanjing Drum Tower Hospital was taken as an external validation set. The training set was applied to the step of feature selection and training models. To overcome the imbalance of the training set, we applied stratified random sampling to generate positive cases, which may effectively prevent the overfitting problem. The testing set was used for evaluating model performance during and after training, while the validation set was used for evaluating the generalization of the model.
Redundant and extraneous factors would draw too much detail and noise, which may be prone to form a model over-fitting and decrease the predictive ability of the model, respectively. Therefore, we utilized the Least Absolute Selection and Shrinkage Operator (LASSO) [30] to exclude non-interference variables and developed ML models. It has the advantage of combining packaging method with machine learning algorithm and also has the advantage of high computational efficiency of filtering method. In brief, the LASSO is a regularized regression technique commonly used for reducing a high-dimensional feature space. Its principle is to shrink some coefficients to exactly zero and thus perform variable selection.

Modeling Strategies
Modeling processes were shown in Figure 1. The END ML model was built with data from the training set. We initially attempted to use four ML algorithms for constructing models, and these four models were named logistic regression (LR), random forest (RF), extreme gradient boosting (XGBoost), and support vector machine (SVM). We used 10-fold cross-validation technique to improve generality, and grid search algorithm was used for tuning hyperparameters for each model.

Model Evaluation
Model performance was assessed by area under the receiver operating characteristic curve (AUC) using the independent testing set. The Delong test and calibration curves were used for comparing the ROC curves in different models, which could identify the optimal model. Calibration curves of the models were evaluated by using the Brier score

Model Evaluation
Model performance was assessed by area under the receiver operating characteristic curve (AUC) using the independent testing set. The Delong test and calibration curves were used for comparing the ROC curves in different models, which could identify the optimal model. Calibration curves of the models were evaluated by using the Brier score method (range: 0-1), with lower scores reflecting better model calibration, which showed the agreement between the model's predicted values and the cohort's observed outcome [31]. We also input the validation set to assess the generalization of model. The sensitivity, specificity, accuracy (ACC), and Youden index were also analyzed.

Explanation of the Model
Although the above algorithms are robust and well-performing ML methods that have been very popular in the medical field, it is difficult to interpret and display black-box characters. Thus, based on the training set, we introduced the SHAP to our prediction model to make up for the issue of black-box character, which can obtain analyses of the features and make patient-specific predictions rationally [32]. The SHAP was inspired by cooperative game theory, which could visually represent the importance ranking of features and calculate each feature Shapley values in the prediction model [33]. In addition, to achieve a better interpretation of the prediction model in individual patients, we also introduced Local Interpretable Model-Agnostic Explanations (LIME) to exhibit the impact of vital variables at the level of the individual. Briefly, based on a local linear model, LIME represents a detailed interpretation of a classifier by approximating weights to the disturbance input. Thus, we used LIME to explain two specific instances in the best predictive behavior of the model. Python version 3.7 was used in this present study and related packages include XGBoost, Shap, and Scikit-learn environment. Table 1 describes the characteristics of the study population from Nanjing First Hospital. A total of 985 AIS patients with LVO and 36 features were taken into the study, comprising 157 who developed END during hospitalization and 828 without END. The univariate analysis revealed that coronary artery disease, sICH, Stent retriever only, Stent retriever/aspiration with rescue therapy, blood glucose, and glycated hemoglobin were associated with the risk of END. Supplementary Table S1 describes the demographics and clinical characteristics of the external validation set: there are 56 positive cases and 177 negative cases.

Feature Selection
The data set was randomly split into the training set (n = 690) and the testing set (n = 295). In the LASSO algorithm, 9 variables, namely, blood glucose, NIHSS at baseline, interval from groin puncture to recanalization, serum creatinine, interval from onset to treatment, systolic blood pressure, diastolic blood pressure, platelets, and uric acid were selected as crucial variables, which were determined by the result of LASSO regularization process based on 690 patients in the training set.

Model Building and Evaluation
We applied the following ML algorithm with 9 crucial variables as input variables, including LR, XGBoost, SVM, RFC, and DNN to predict the risk of END. Figure 2 shows the ROC of each model on the training set and testing set. Table 2 shows the evaluation indexes of each model on the testing set, including AUC, sensitivity, specificity, accuracy, and the Youden index. Overall, among the four models, XGBoost had the highest prediction power (AUC = 0.826, 95% CI 0.781-0.871), whereas SVM showed the poorest prediction performance (AUC = 0.643, 95% CI 0.584-0.702). The Delong test was used for comparing the performance of the 4 models on the testing set, and p < 0.05 were considered statistically  Table S2 shows that RF and XGBoost significantly surpassed those of the other models in prediction. We further determined the optimal model from the calibration curve; the plot showed that the XGBoost had the lowest brier score, which means the model has the best calibration ( Figure 3). Therefore, the XGBoost was selected to be the optimal model. the ROC of each model on the training set and testing set. Table 2 shows the evaluation indexes of each model on the testing set, including AUC, sensitivity, specificity, accuracy, and the Youden index. Overall, among the four models, XGBoost had the highest prediction power (AUC = 0.826, 95% CI 0.781-0.871), whereas SVM showed the poorest prediction performance (AUC = 0.643, 95% CI 0.584-0.702). The Delong test was used for comparing the performance of the 4 models on the testing set, and p < 0.05 were considered statistically significant. Supplementary Table S2 shows that RF and XGBoost significantly surpassed those of the other models in prediction. We further determined the optimal model from the calibration curve; the plot showed that the XGBoost had the lowest brier score, which means the model has the best calibration ( Figure 3). Therefore, the XGBoost was selected to be the optimal model.   As shown in Table 3, we validated the XGBoost model in an external validation cohort with 233 patients, and the AUC in the validating set was 0.846. We further calculated the confusion matrix, the sensitivity, specificity, and overall accuracy were 0.750, 0.836, and 0.815, which means that among the 56 cases with END, 46 cases were correctly predicted by the model, while among the 177 cases without END, 148 cases were correctly predicted by the model.   As shown in Table 3, we validated the XGBoost model in an external validation cohor with 233 patients, and the AUC in the validating set was 0.846. We further calculated the confusion matrix, the sensitivity, specificity, and overall accuracy were 0.750, 0.836, and 0.815, which means that among the 56 cases with END, 46 cases were correctly predicted by the model, while among the 177 cases without END, 148 cases were correctly predicted by the model.

Explanation of the Model at the Feature Level
The SHAP algorithm has been applied to the XGBoost model to obtain each variable' importance for the END prediction. The importance distribution for each variable in de scending order was plotted in Figure 4A. Blood glucose was the strongest predictor o END, followed closely by NIHSS at baseline, interval from groin puncture to recanaliza tion, serum creatinine, and interval from onset to treatment. In addition, in order to dis tinguish the relationship between the target result and positive and negative predictors SHAP values were used for uncovering the END risk factors. As shown in Figure 4B, each row represents a feature, and each dot represents a sample; the redder the color, the

Explanation of the Model at the Feature Level
The SHAP algorithm has been applied to the XGBoost model to obtain each variable's importance for the END prediction. The importance distribution for each variable in descending order was plotted in Figure 4A. Blood glucose was the strongest predictor of END, followed closely by NIHSS at baseline, interval from groin puncture to recanalization, serum creatinine, and interval from onset to treatment. In addition, in order to distinguish the relationship between the target result and positive and negative predictors, SHAP values were used for uncovering the END risk factors. As shown in Figure 4B, each row represents a feature, and each dot represents a sample; the redder the color, the greater the value of the feature; the bluer the color, the smaller the value of the feature. What can be found is that increases in the concentration of blood glucose are a positive impact and are more likely to develop END, whereas increases in NIHSS at baseline have a negative impact and is less likely to develop END.

Explanation of the Model at the Individual Level
We next used LIME to analyze the contribution level of features of new instances for the prediction of END in LVO patients. As illustrated in Figure 5, the true values of the nine main features (right), the overall predicted probability of END and No-END (left), and the classification details (middle) of the two instances were exhibited in the LIME plot. For instance, in patient 23, the predicted probability for END was high (0.70) due to the number of positive conditions, consisting of a high concentration of blood glucose, low NIHSS at baseline, high systolic blood pressure and high platelets. In contrast, the END probability in patient 100 was low (0.29) due to few positive conditions, only high serum creatinine. What can be found is that increases in the concentration of blood glucose are a positive impact and are more likely to develop END, whereas increases in NIHSS at baseline have a negative impact and is less likely to develop END.

Explanation of the Model at the Individual Level
We next used LIME to analyze the contribution level of features of new instances for the prediction of END in LVO patients. As illustrated in Figure 5, the true values of the nine main features (right), the overall predicted probability of END and No-END (left), and the classification details (middle) of the two instances were exhibited in the LIME plot. For instance, in patient 23, the predicted probability for END was high (0.70) due to the number of positive conditions, consisting of a high concentration of blood glucose, low NIHSS at baseline, high systolic blood pressure and high platelets. In contrast, the END probability in patient 100 was low (0.29) due to few positive conditions, only high serum creatinine. (A)

Explanation of the Model at the Individual Level
We next used LIME to analyze the contribution level of features of new instances for the prediction of END in LVO patients. As illustrated in Figure 5, the true values of the nine main features (right), the overall predicted probability of END and No-END (left), and the classification details (middle) of the two instances were exhibited in the LIME plot. For instance, in patient 23, the predicted probability for END was high (0.70) due to the number of positive conditions, consisting of a high concentration of blood glucose, low NIHSS at baseline, high systolic blood pressure and high platelets. In contrast, the END probability in patient 100 was low (0.29) due to few positive conditions, only high serum creatinine.

(A)
Brain Sci. 2023, 13, 557 9 of 13 (B) Figure 5. LIME plot for individual case explanation on two random patients from the testing set of the XGBoost model. LIME plot included the patient from the "true positive" group explained by LIME algorithm (A) and a patient from the "true negative" group explained by LIME algorithm (B).

Discussion
As far as we know, this study is the first attempt to use the explainable machine learning model to predict the risk of END after MT for AIS patients. Moreover, the results of the Delong test showed that our model had good discrimination. Additionally, the prediction accuracy of our model was high. Among the 4 ML prediction models, XGBoost had the best prediction effect, with an AUC as high as 0.826. The XGBoost algorithm can Figure 5. LIME plot for individual case explanation on two random patients from the testing set of the XGBoost model. LIME plot included the patient from the "true positive" group explained by LIME algorithm (A) and a patient from the "true negative" group explained by LIME algorithm (B).

Discussion
As far as we know, this study is the first attempt to use the explainable machine learning model to predict the risk of END after MT for AIS patients. Moreover, the results of the Delong test showed that our model had good discrimination. Additionally, the prediction accuracy of our model was high. Among the 4 ML prediction models, XGBoost had the best prediction effect, with an AUC as high as 0.826. The XGBoost algorithm can control the complexity of the model by adding regular terms, which is more conducive to preventing overfitting. Furthermore, it can better deal with classified and multi-dimensional data sets and improve the computational power and generalization ability of the model, which makes it suitable for clinical application. We also used the data of Nanjing Drum Tower Hospital to externally validate our prediction model, and the AUC value was as high as 0.846, indicating that our model has a high potential for prediction of END in a wider range of the Chinese population at least. We hope that more data can be used to further validate our model in the future. In our prediction model, some variables associated with END were not previously highlighted, such as the interval from onset to treatment, platelet, uric acid, and so on. In our interpretable ML model, the variables most associated with END were visualized, which helps to more intuitively understand the risk factors associated with END.
Our study is consistent with previous findings and validates the value of the ML model in predicting END after MT in AIS patients. The fasting blood glucose level [34][35][36][37] and the baseline NIHSS score [35,37] have been reported as risk factors for END. These risk factors have been reported in previous studies. Following previous work has shown that it may be related to vascular endothelial dysfunction caused by impaired blood glucose control [38]. Recent studies have shown that there is a close link between pre-existing hyperglycemia and increased cerebral ischemia/reperfusion injury in the field of stroke. In addition, previous studies have shown that clinical features alone that are observed at stroke onset can help to distinguish cardioembolic from atherothrombotic infarction [39]. However, in our study, although the incidence of END in patients with atherosclerotic infarcts is higher than that in patients with cardioembolic stroke, this is not a statistically significant difference (p > 0.05). This may require us to expand the sample size and do further research in the future.
The thromboinflammation will be further developed mainly due to the high blood glucose, which creates a harmful environment in the body [40]. The study by Girot et al. in the field of END showed that a low NIHSS score on admission was an independent risk factor [17]. Our findings are consistent with Girot et al. At first glance, this result may be surprising. It is possible that those with higher NIHSS scores are less prone to increase their score compared to those with lower NIHSS scores. It may be that patients with lower NIHSS scores are prone to vascular injuries in the acute phase and hypoperfusion areas that lead to infarction progression. Rohini et al. [18] showed that higher SBP is highly associated with END, which is consistent with our findings. It could be explained that patients with high SBP may have more severe strokes. In addition, previous studies have shown that elevated baseline SBP in AIS patients may be associated with death and dependence [41]. In our study, the time from groin puncture to revascularization and the time from symptom onset to treatment were also highly correlated with END. The time from onset to revascularization was an independent predictor of END after EVT for acute basilar artery occlusion, which was reported by Zhong et al. [42].
The complexity of ML models makes it difficult to determine the reasons behind their predictions, which may hinder clinical application. However, in this study, the SHAP algorithm was used to interpret the prediction at different levels, which ensured the performance and clinical interpretability of the model and was demonstrated to the user through friendly visualization tools. Clinicians will have a better understanding of the decision-making process of the model, which is conducive to the clinical use of the prediction results. Furthermore, it is demonstrated in our study that interpretable machine learning methods are capable of predicting END and individualizing predictions in the context of patients. Previous studies on END mainly focused on its pathophysiological explanation and each risk factor, which lacked the combined usage of large samples in clinical practice [43][44][45]. Furthermore, there is no uniformly accepted risk stratification algorithm for predicting END. Therefore, the strength of our ML model is that relevant variables can be extracted from real-world clinical data to predict END.
Our study has some limitations. First, although we have data for external validation, this is retrospective data, and the data comes from one center. The retrospective data might have led to recall, and selection bias to a certain degree. Therefore, in order to promote the model, more data sets and prospective multicenter clinical trials are needed to verify our results and the accuracy of the model. Second, to retain more available data, we only considered medical records in which END occurred within 24 h after stroke, although END is also known to occur within days of the initial time. In addition, some anesthetic drugs were not completely metabolized in the patient, and the patient was still intubated, which would lead to a risk of bias in the NIHSS score. Third, the collateral status may affect END, but this variable was not collected. In addition, in the process of building the model, feature selection was only applied once on the combined training validation set without being incorporated into the cross-validation splits. Therefore, it is possible that some feature importance variabilities were unaccounted for. Finally, the participants are all from the Chinese population, so the results cannot be easily extrapolated to other groups.

Conclusions
In this study, the interpretable ML model constructed can be used to predict the risk probability of END after MT in AIS patients. In addition, the model was externally validated and achieved good predictive performance. However, the amount of data to validate the model is small, and more clinical data and prospective clinical studies are needed for further verification. Furthermore, the results from this study may guide clinical decision making in the selection and intraoperative and postoperative management of AIS patients who had undergone MT with a high risk of END.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/brainsci13040557/s1, Table S1: Demographics and clinical characteristics of external validation set;  Funding: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Institutional Review Board Statement:
This study was conducted in accordance with the Nanjing First Hospital, Nanjing Drum Tower Hospital, and approved by the document number of the ethics committee's approval: ChiCTR-OCH-14004382.
Informed Consent Statement: Patient consent was waived because this is a retrospective study.

Data Availability Statement:
The datasets used in this study are available from the corresponding author upon reasonable request.