Machine Learning Models of Survival Prediction in Trauma Patients

Background: We aimed to build a model using machine learning for the prediction of survival in trauma patients and compared these model predictions to those predicted by the most commonly used algorithm, the Trauma and Injury Severity Score (TRISS). Methods: Enrolled hospitalized trauma patients from 2009 to 2016 were divided into a training dataset (70% of the original data set) for generation of a plausible model under supervised classification, and a test dataset (30% of the original data set) to test the performance of the model. The training and test datasets comprised 13,208 (12,871 survival and 337 mortality) and 5603 (5473 survival and 130 mortality) patients, respectively. With the provision of additional information such as pre-existing comorbidity status or laboratory data, logistic regression (LR), support vector machine (SVM), and neural network (NN) (with the Stuttgart Neural Network Simulator (RSNNS)) were used to build models of survival prediction and compared to the predictive performance of TRISS. Predictive performance was evaluated by accuracy, sensitivity, and specificity, as well as by area under the curve (AUC) measures of receiver operating characteristic curves. Results: In the validation dataset, NN and the TRISS presented the highest score (82.0%) for balanced accuracy, followed by SVM (75.2%) and LR (71.8%) models. In the test dataset, NN had the highest balanced accuracy (75.1%), followed by the TRISS (70.2%), SVM (70.6%), and LR (68.9%) models. All four models (LR, SVM, NN, and TRISS) exhibited a high accuracy of more than 97.5% and a sensitivity of more than 98.6%. However, NN exhibited the highest specificity (51.5%), followed by the TRISS (41.5%), SVM (40.8%), and LR (38.5%) models. Conclusions: These four models (LR, SVM, NN, and TRISS) exhibited a similar high accuracy and sensitivity in predicting the survival of the trauma patients. In the test dataset, the NN model had the highest balanced accuracy and predictive specificity.


Background
Among various prediction models for mortality outcomes of trauma patients, the Trauma and Injury Severity Score (TRISS) remains the most commonly used algorithm [1][2][3]. The TRISS model was created by logistic regression (LR) in 1987 using a small cohort of a single center to predict survival probability [4]. In 1990, it was used to provide a comprehensive overview of outcomes in trauma patients by the American College of Surgeons' Major Trauma Outcome Study (MTOS) after the evaluation of more than 80,000 patients [5]. The TRISS calculator determines the probability of survival from age as a dichotomous variable, Injury Severity Score (ISS, an anatomical variable), Revised Trauma Score (RTS, a physiological variable), and the use of different coefficients for blunt and penetrating injuries. The ISS is a measure of the severity of injury using an Abbreviated Injury Scale (AIS) [6], which scores the severity of injury on a scale of 1 to 6 for six specified anatomical regions. The sum of the squares of the three highest AIS scores is used to calculate ISS, ranging from 1 to 75 [7]. The RTS is a weighted summation of coded variable values of the patient's initial Glasgow Coma Scale (GCS) score, systolic blood pressure (SBP) and respiratory rate (RR) [8].
The TRISS model has provided a good standard for survival prediction for more than 30 years and attempts to replace TRISS have failed [9]. In a review of 90 articles on prognostic models for general trauma patients, the TRISS model was externally validated in 43 articles [3]. Furthermore, there were 112 TRISS-based models that were subsequently developed [3]. However, criticism of the TRISS persisted because TRISS scores in middle ranges (25-75%) have large inconsistencies in terms of reliability [10,11]. Thus, the results of the TRISS model differ dramatically on subsets of trauma patients [12,13]. Furthermore the TRISS model, which was derived from a study based on the North American trauma population, might present unvalidated results in Europe, where a substantially lower proportion of patients suffer penetrating trauma [9], or in Asia where motorcycle or fall accidents comprise a majority of the trauma population [14][15][16].
Given that the TRISS model estimates survival probability using linear statistical models and only a small set of variables were selected for analysis, it is questionable as to whether these models are capable of analyzing complex biological systems such as traumatized humans. Furthermore, it is important to know whether the provision of additional information, including the presence of pre-existing comorbidities or laboratory data, would improve the predictive performance of TRISS [17]. Nonetheless, limitations of overfitting and multicollinearity of regression analysis may preclude the analysis of many explanatory variables. Currently, machine learning (ML) has been successfully applied to aid in clinical diagnosis and predictive prognosis [18][19][20]. Examples of ML techniques applied in a clinical setting include, but are not limited to, support vector machine (SVM), neural networks (NN), decision trees, random forest trees, and Bayes classification. SVM and NN can be used as a nonlinear statistical data modeling tool to deal with complex relationships and have shown significantly better predictive performance than LR in some illness [20,21]. In our prior study that build the ML model to predict the mortality of individual hospitalized motorcycle riders, we had found that LR and SVM outperformed the decision tree (DT) in prediction accuracy [22]. Therefore, the current study aimed to investigate whether the addition of more clinical variables and the application of the SVM or NN models can improve performance, compared to TRISS, in predicting the survival probability of trauma patients.

Subject and Data Preparation
This study was approved by the Institutional Review Board (IRB) of Kaohsiung Chang Gung Memorial Hospital, a 2686-bed level I trauma center in Southern Taiwan [14][15][16], with approval number 201701324B0. Because of the retrospective design study, the requirement for informed consent was waived according to the regulation of the IRB. Detailed patient information for the period January 2009 to December 2016 was retrieved from the Trauma Registry System of the hospital. Information used included age, sex, pre-existing comorbidities [coronary artery disease (CAD), congestive heart failure (CHF), cerebral vascular accident (CVA), diabetes mellitus (DM), end-stage renal disease (ESRD), hypertension (HTN), vital signs (temperature, systolic blood pressure (SBP), diastolic blood pressure (DBP), heart rate (HR), and respiratory rate (RR) which were recorded at triage upon arrival at the emergency department), Glasgow coma scale (GCS) score, AIS in different body regions, RTS, ISS, and TRISS. Blood-drawn laboratory data at the emergency room, including white blood cell count (WBC), red blood cell count (RBC), hemoglobin (Hb), hematocrit (Hct), platelets, neutrophil (%), international normalized ratio (INR), glucose, sodium (Na), potassium (K), blood urine nitrogen (BUN), creatinine (Cr), aspartate aminotransferase (AST), alanine aminotransferase (ALT) were also used. Finally, in-hospital mortality was recorded. The survival prognosis of TRISS is computed based on a logarithmic regression equation: Survival probability = 1/(1 + e-b), where b (blunt injury) = −0.4499 + 0.8085 × RTS − 0.0835 × ISS − 1.7430 × Age(index) and b (penetrating injury) = −2.5355 + 0.9934 × RTS − 0.0651 × ISS − 1.1360 × Age(index). The Age(index) value is set to 0 in the formula for patients below 55 years old, and is set to 1 for patients older than 55 [4]. Those patients who had missing laboratory data were not included in the analysis. Multiple imputation of missing values using the Markov Chain Monte Carlo simulation was applied to deal with other missing data, which accounted for less than 2% of the total. To avoid multicollinearity, variables that were closely correlated with each other (Hct and Hb, AST and ALT, SBP/DBP/RR and RTS) were excluded before being subjected to further analysis. From 2009 to 2016, all enrolled patients were divided into a training dataset (70% of the original data set) for generation of a plausible model under supervised classification, and a test dataset (30% of the original data set) to test the performance of the model. Among the enrolled 18,811 patients, 13,208 (12,871 survival and 337 mortality) and 5603 (5473 survival and 130 mortality) patients comprised the training and test datasets, respectively. An additional validation dataset was created with data randomly selected from 30% of the training dataset and 30% of the test dataset to provide an unbiased evaluation of a model fit on the training dataset. The data in the validation dataset was divided into ten parts with model hyperparameters being fine-tuned by a grid search with 100-fold cross validation to determine the values that led to the best performance and avoided over fitting.

Logistic Regression (LR)
The LR classifier used generalized liner model (glm) and step function in the stats package of R 3.3.3 (R Foundation for Statistical Computing, Vienna, Austria) to predict the likelihood of occurrence of survival in the trauma patients. A stepwise-selected multivariate LR was performed to identify significant predictors of survival using forward selection and backward elimination repeatedly. The Hosmer-Lemeshow test was used as a statistical test for goodness of fit for the LR models. A prediction model was established based on the sum of scores calculated from these independent risk factors and their regression coefficient.

Support Vector Machine (SVM)
The SVM classifier used the e1071 package in R with the tune.svm and svm functions. The SVM classifier handles non-linear interactions [23] by estimating the optimal operating point using a grid search with a 10-fold cross-validation varying the penalty parameter C, which determined the tradeoff between minimization of fitting error and model complexity, and hyper-parameter γ, which defined the nonlinear feature transformation onto a higher dimensional space and controlled the tradeoff between error due to bias and variance in the model [23].

Neural Networks (NN)
The NN used the Stuttgart Neural Network Simulator (RSNNS) [24] with multilayer perceptron (mlp) method and Std_Backpropagation parameter in R to create the model. The models contained three layers: an input layer, two layers of hidden nodes, and a single output node. Thirty-one input variables were applied in the current models. The number of hidden layer neurons was determined through trial and error during the selection of a predictive network with the best sensitivity and specificity. Tuning parameters included the number of nodes in the hidden layer optimized between one and 20, maximal of iterations to learn was set to 200, and other parameters were set to default for the training process. Iterations occurred until the error did not significantly decrease to avoid over-training and to enhance the capacity for generalization [25].

Predictive Performance
The accuracy, sensitivity, and specificity of the four models (LR, SVM, NN, and TRISS) were calculated. Discrimination of the models was assessed according to the area under the curve (AUC) of the receiver operating characteristic curves (ROCs) using the roc and roc.test functions in the pROC package in R, as this approach allowed for comparison of two or more empirical curves that were constructed from tests performed on the same individuals [26]. Based on a confusion matrix, the balanced accuracy is given by 1/2 × (true-positive/positive + true-negative/negative) to deal with imbalanced datasets. Somers' Dxy rank correlation coefficient, c-index, R 2 , and Brier score of these models were calculated. Somers' Dxy assesses the predictive discrimination with measured probability of concordance minus the probability of discordance between predicted outcomes and observed outcomes [27]. The C-index is a measure of how well the model can discriminate between those who survive and those that do not; the model is considered to have outstanding discrimination when the c-index score >0.9.
Calibration curves were used to determine the degree of agreement between predicted probabilities and observed outcomes. The R 2 quantifies the goodness-of-fit of a model [28], with an R 2 = 1 indicating that the regression line perfectly fits the data. The Brier score is defined as the mean squared difference between the predicted probability and the actual outcome and presents an overall measure of model performance [29]. Brier scores vary between 0 and 1, a lower Brier score is indicative of a better calibrated prediction.

Statistical Analyses
All statistical analyses were performed using SPSS 23.0 (IBM Inc., Chicago, IL, USA) or R 3.3.3. For categorical variables, we used chi-square tests to determine the significance of the association between variables. For continuous variables, we used Kolmogorov-Smirnov test to analyze the normalization of the distributed data and used Mann-Whitney U tests to analyze non-normally distributed data. Results are presented as median ± interquartile range (IQR). A p-value <0.05 was taken as statistically significant.

Patient Demographics
Among the enrolled 18,811 patients, there were 18,344 survival and 467 mortality outcomes. Except for CVA history, there were statistically significant differences in all variables between survival and fatal patients, including age, gender, pre-existing comorbidities, HR, temperature, GCS, AIS in different body region, ISS, RTS, and blood-drawn laboratory data (Tables 1 and 2). Thirty-one variables were used for imputation in the ML classifiers.

Construction of the ML Classifiers
Logistic regression identified 19 predictors (CHF, INR, CAD, Cr, ISS, WBC, age, HR, platelets, neutrophil, Na, Hb, AIS-Thorax, GCS, HTN, temperature, AIS-Extremity, RTS, AIS-Face) as independent risk factors for survival (Appendix A Table A1), with ISS and age being the most important two features that determine the survival of an individual trauma patient (Appendix A Table A2). The SVM classifier was used to predict mortality, taking inputs from all 31 variables, with two parameters (C, γ) being determined by a grid search of 2 x , where x is an integer between −20 and 4 for both C and γ. The values which gave the highest 100-fold cross-validation accuracy were C = 0.003906 and γ = 0.003906. The constructed NN model (RSNNS) included 31 inputs, five neurons in the first hidden layer and 18 neurons in the second hidden layer, and one output neuron. A single output node indicated the probability of survival.
A comparison of AUCs of the ROCs among these four models for the training dataset (Figure 1), demonstrated that the LR (AUC 0.967) model had a significantly higher AUC than the NN (AUC 0.959) and TRISS (AUC 0.934) models; SVM (0.964) had a significantly higher AUC than TRISS; however, there was no significant difference in AUC between LR and the SVM nor between SVM and NN. In addition, NN had a significant higher AUC than TRISS. In the test dataset, both LR (AUC 0.958) and SVM (AUC 0.964) had a significantly higher AUC than NN (AUC 0.944) and TRISS (AUC 0.930); however, there no significant difference in AUC between LR and the SVM models was apparent. In addition, NN had a significantly higher AUC than TRISS. In either the training or test datasets, the TRISS exhibited a significantly worse performance in predicting survival than the three remaining models. Table 3. Survival prediction performance (i.e., accuracy, sensitivity, and specificity) for the logistic regression (LR), support vector machine (SVM), neural network (NN), and The Trauma and Injury Severity Score (TRISS) models in the train, validation, and test datasets. A comparison of AUCs of the ROCs among these four models for the training dataset (Figure 1), demonstrated that the LR (AUC 0.967) model had a significantly higher AUC than the NN (AUC 0.959) and TRISS (AUC 0.934) models; SVM (0.964) had a significantly higher AUC than TRISS; however, there was no significant difference in AUC between LR and the SVM nor between SVM and NN. In addition, NN had a significant higher AUC than TRISS. In the test dataset, both LR (AUC 0.958) and SVM (AUC 0.964) had a significantly higher AUC than NN (AUC 0.944) and TRISS (AUC 0.930); however, there no significant difference in AUC between LR and the SVM models was apparent. In addition, NN had a significantly higher AUC than TRISS. In either the training or test datasets, the TRISS exhibited a significantly worse performance in predicting survival than the three remaining models.    The calibration curves of these four predictions demonstrated a non-significant miscalibration in model development, with small differences between models. The LR model generated a nonparametric line close to the ideal diagonal line with the highest R 2 (0.569) and with a Brier score of 0.014 (Figure 2), while the NN had the smallest R 2 and lowest Brier score (0.010). By contrast, the TRISS model exhibited deviation from the ideal diagonal line and there was a clear difference between the predicted probability and the actual outcome, especially for those individuals who had low predicted probability.

Discussion
More predictive variables may lead to a more accurate performance for LR models such as TRISS. In a review of 90 articles on prognostic models for the general trauma population [3], the lowest AUCs were found in a model with age and comorbidities as predictors (AUC 0.59). In the Kampala Trauma Score (KTS) (AUC 0.62), which is calculated using the patient's age, SBP, RR, neurologic status (alert, responds to voice, responds to pain, or unresponsive), and number of serious injuries [30], the highest AUC was found for a TRISS-based model with updated coefficients based on goodness-of-fit for their own study population (AUC 0.98) [3]. Furthermore, the inclusion of dichotomized, categorical, or continuous predictors in the models may result in differences in predictive performance. The comparison of the TRISS model with TRISS-based models that incorporated different measurement levels for age or ISS [3,31,32] revealed that models with categorical variables presented better calibration and discrimination compared with dichotomized variables, and models that included continuous variables showed even better calibration and discrimination [24,31,32].
Theoretically, models that included predictors from all categories (physiological, anatomical, and demographic variables, and injury cause/mechanism) show better discrimination compared to models incorporating only one or two categories. However, systematic reviews have demonstrated that adding more predictors to the basic TRISS-model did not always result in better performance [33,34]. Although more predictive variables may lead to a more accurate performance, the complex interaction among these independent and dependent variables may hinder their predictive performance. It is also recognized that with the increased number of potential risk factors, the complexity of the models can cause over-fitting and yield implausible results. The TRISS-based models included acute ethanol poisoning as an additional predictor showed even worse discriminatory power [35]. Furthermore, the variable chosen as an input to the model should be practical. For example, base deficit could be an important predictor for mortality for trauma patients. However, base deficit is mostly assessed only in severely injured patients [36] and often would not be recorded and would therefore be presented as a missing value in the general trauma population. Therefore, it is not practical to incorporate base deficit into the model to predict mortality for the general trauma population. Likewise, as more input variables are included in the model, the problem of increasing rates of missing values becomes difficult to overcome [3].
One study used 16 anatomical and physiological predictor variables to predict mortality outcomes in 10,609 trauma patients and revealed that the performance of the NN (AUC 0.912) exceeded that of the TRISS (AUC 0.895) model [37]. The NN model was more accurate and outperformed the LR model in predicting in-hospital mortality for patients in critical care [38] and under mechanical ventilation [39]. In this study, although SVM and NN were both expected to handle complex nonlinear relationships between independent and dependent variables better than LR [40,41], the SVM outperformed the NN in the test dataset, and there was no significant difference in AUC values between the LR and SVM. The performance of an ANN depends on the number of parameters, network weights, the selection of an appropriate training algorithm, the type of transfer functions used, and the determination of network size [42]. Neural networks require initialization and adjustment of many individual parameters to optimize the performance of the classification. The NN model was developed empirically and can be over-fitted for training data [42]. Many researchers have compared NN versus LR models and found that NN and LR models have similar classification performance [42]. Nonetheless, inferences about the explanatory variables of NN are more difficult to interpret than those derived from LR analysis [43]. In addition, the employment of kernels in SVM help the model learn non-linear decision boundaries, allowing the classifier to solve more complex data than linear analytic methods [44] and the SVM boundary is only minimally influenced by outliers [45].
Some limitations of this study include the following: First, patients declared dead on arrival at the emergency room were not recorded in the registered database and this may have resulted in a selection bias [15,16]. Furthermore, because the registered trauma data included only in-hospital mortality but no information regarding mortalities at 30 days or a longer, additional selection biases in assessing mortality may be present. Second, we included more variables, such as preexisting comorbidities and laboratory data, to improve prediction performance. However, collection of physiological and laboratory data at the time of arrival at the emergency department may not reflect further changes in hemodynamics and metabolic variables of patients who were under possible management or resuscitation. The lack of specific mortality-related information, including resuscitation or mechanism of trauma may have limited the accuracy of the prediction model. Third, this study was unable to assess the effects of any one particular treatment intervention, especially surgical interventions to the patients. We can only assume that these treatments were uniform across the population data. Fourth, our analysis used only single-center data obtained in southern Taiwan, which may not be representative of other populations. Further, in the study, we had chosen the RSNNS as one representative of NN models because it performed better than other ANN packages in R (data not shown). Although the results of the NN model seemed unimpressive, we believe that NN models may still be valuable in predicting the outcomes of trauma because NN algorithms are evolving quickly. Further, although the addition of more clinical variables or the application of the SVM or NN models can significantly improve the predictive performance compared to that of TRISS, the improvement was modest. It may be argued that it may not be worth getting such improvements by using much more complex input variables than the TRISS model. However, we believe that in the future, real-time processing of monitored patient data will reduce the burden of such laborious works. In addition, random forests is an important class of ML models with a different conception than the other models. Whether it can improve the predictive performance is worthy for further investigation but not explored in this study. Finally, because of low specificity of above ML models in predicting survival, an ensemble of regressors from these different models may be worthy to be built to test whether it can improve the predictive power.

Conclusions
In this study, we demonstrated that these four models (LR, SVM, NN, and TRISS) exhibited a similar high accuracy and sensitivity in predicting the survival of the trauma patients. In addition, in the test dataset, the NN model had the highest balanced accuracy and predictive specificity out of the four models tested. The results of this study may provide encouraging information for the development of a new NN prediction model that can be integrated into trauma care systems to predict survival of trauma patients.