Early Prediction of Seven-Day Mortality in Intensive Care Unit Using a Machine Learning Model: Results from the SPIN-UTI Project

Patients in intensive care units (ICUs) were at higher risk of worsen prognosis and mortality. Here, we aimed to evaluate the ability of the Simplified Acute Physiology Score (SAPS II) to predict the risk of 7-day mortality, and to test a machine learning algorithm which combines the SAPS II with additional patients’ characteristics at ICU admission. We used data from the “Italian Nosocomial Infections Surveillance in Intensive Care Units” network. Support Vector Machines (SVM) algorithm was used to classify 3782 patients according to sex, patient’s origin, type of ICU admission, non-surgical treatment for acute coronary disease, surgical intervention, SAPS II, presence of invasive devices, trauma, impaired immunity, antibiotic therapy and onset of HAI. The accuracy of SAPS II for predicting patients who died from those who did not was 69.3%, with an Area Under the Curve (AUC) of 0.678. Using the SVM algorithm, instead, we achieved an accuracy of 83.5% and AUC of 0.896. Notably, SAPS II was the variable that weighted more on the model and its removal resulted in an AUC of 0.653 and an accuracy of 68.4%. Overall, these findings suggest the present SVM model as a useful tool to early predict patients at higher risk of death at ICU admission.


Data imputation
For replacing missing values, different imputation methods (i.e. the replacement of missing values with 0, mean, median or mode values and regression imputation) are commonly used. To do that, we used a K-Nearest Neighbor (K-NN) imputation method to recover part of the missing values for continue and categorical variables, according to Malarvizhi and Thanamani [1]. The K-NN method is based on the assumption that a point value can be approximated by the values of the points that are closest to it, based on the other variables [2]. It is useful for dealing with all kind of missing values whose distribution is unknown. In our study, we applied the algorithm for every different target variable considering Euclidean distance in the feature space for non-binary variables and Jaccard distance for dichotomic variables. In particular, the Jaccard distance is complementary to the Jaccard coefficient, defined as the size of the intersection divided by the size of the union of the sample sets.
Applying two cycles of 1-NN imputation separately to the two classes of data, death patients or not, we recovered 3258 records, approximately the 73% of the incomplete ones. After imputation, all available data were included in the analysis.

Support Vector Machine model
Datasets are often not linearly separable even in a feature space, not allowing to satisfy all the constraints in the minimization problem of SVM [3]. To solve this issue, Slack variables are introduced to allow certain constraints to be violated. By choosing very large slack variable values we could find a degenerate solution which would lead to the model overfitting. To penalize the assignment of too large slack variables, the penalty is introduced in the classification objective: -, indicates "slack variables", one for each datapoint i, to allow certain constraints to be violated; -, indicates a tuning parameter that controls the trade-off between the penalty of slack variables ε and the optimization of the margin. High values of penalize slack variables leading to an hard margin, whereas low values of lead to a soft margin, that is a bigger corridor which allows certain training points inside at the expense of misclassifying some of them. In particular, parameter sets the confidence interval range of the learning model.
The RBF kernel function expression on two sample, x and x , is defined as K(x, x ) = exp −γ |x − x | where |x − x | is the squared Euclidean distance between the two feature vectors and is a free parameter. The RBF can be applied to a dataset through the choice of two parameters, C and γ. The classifier performance of SVM depends on the choice of these two parameters. A Grid Search method was used to find the optimal parameters of the RBF for SVM.
This method considered m values in and n values in , according to the M×N combination of and [4], by training different SVM using a K-fold cross validation. Here, to optimize the f1-score of the positive class, we used a Grid Search on a 5-fold cross validation. The analyses were performed using Python and Support Vector Classification (SVC) from Sklearn 0.22.1.