Since sepsis is diverse disease with high mortality rate, various ways of mortality prediction were developed. Biomarkers like procalcitonin, presepsin, CRP all played some role in sepsis mortality prediction, but the reported AUC of ROC curve varies from different settings or sepsis definition. In the same emergency department setting, Liu reported the AUC 0.658 for presepsin and AUC 0.679 for procalcitonin in predicting 28 d mortality [32
]. Lee reported the AUC 0.76 for procalcitonin and 0.68 for CRP in predicting early mortality and the AUC 0.73 for procalcitonin and 0.64 for CRP in predicting late mortality [33
]. These results were not promising compared to other ways in clinical use. Although many biomarkers display relevant correlation with the mortality of patients with sepsis, their time courses may be more reliable than absolute levels [34
Compared to a single biomarker, the scoring system developed from the regression model could provide more feasible and cheap prediction methods. SIRS, SOFA score, qSOFA, MEDS (mortality in emergency department sepsis), NEWS (national early warning score) were used for this purpose in clinical practice and literature [35
]. Lee reported an AUC 0.82, Zhao showed an AUC 0.776 and Hermans revealed an AUC 0.81 (95% CI = 0.73–0.88) for MEDS in predicting mortality [35
]. Glouden reported an AUC 0.65 (95% CI = 0.61–0.68) for NEWS, 0.62 (95% CI = 0.59–0.66) for qSOFA in predicting in-hospital mortality [37
]. Macdonald revealed an AUC 0.81 (95% CI = 0.74–0.88) for MEDS and an AUC 0.78 (95% CI = 0.71–0.87) for SOFA score in predicting sepsis related mortality [38
]. The performance of MEDS was better than other scoring systems and biomarkers but it required both clinical presentation data and laboratory data to reach the accuracy [39
]. Zhao combined MEDS and procalcitonin in predicting sepsis related mortality and increased the performance of AUC from 0.776 to 0.813.
In our study, we demonstrate that the accuracy rate of mortality prediction at both 72 h and 28 d of suspected infection in sepsis patients can be improved with machine learning. Among the different methods tested, we found that CNN + Softmax yielded the best predictions for sepsis-related mortality. Although machine learning is now used for various medical applications, very few studies used these algorithms in sepsis-related mortality prediction. In Taylor’s study, 500 clinical variables were used with the RF model to predict 28 days in-hospital sepsis-related mortality among 5278 ED visits [40
]. With this, an area under the curve (AUC) of AUC 0.86 (95% CI = 0.82–0.90) was achieved, which is similar to our study that produced an AUC value of 0.89 (95% CI = 0.886–0.894). Taylor compared the traditional regression model and risk stratification score system and showed that they had a lower AUC than the RF model. In Rabias’s study, there were 34 clinical variables that were used with the RVM (relevance vector machine, a variant of SVM) model to predict severe sepsis-related mortality among 354 ICU patients [7
]. He showed that RVM had AUC of 0.80, lower than our result (0.93 at 72 h and 0.90 at 28 days).
Through the feature visualization method, we ensured that the feature extraction abilities of CNN and AE were better than that of PCA. We determined whether features of the same group were concentrated together, and whether the features of different groups were separated.
The extracted feature visualizations show that CNN performed best because it segregated the surviving patients from the non-surviving patients. Features common to both surviving and non-surviving patients were centralized. The feature-extracted results of AE were better than those of PCA because the features of non-surviving patients were more centralized than for PCA, although the degrees of separation of surviving patient data were similar.
The feature extraction ability of the CNN was still the best in predicting mortality at 28 days. Although the features of surviving patients and non-surviving patients were closer than the case for mortality in 72 h, the features of AE and PCA were considerably worse. The CNN centralized the features of non-surviving patients. AE also centralized the non-surviving-patient features; however, the degree of centralization was less than CNN. The extracted results when predicting mortality within 28 days using PCA was worse than CNN. Therefore, in centralizing the same-labeled data or separating different labeled data, the ability of PCA was considerably inferior.
Certain AUC of the no feature extraction case with SoftMax were better than those using the CNN + SoftMax, yet the performance of the accuracy score demonstrated that results generated by CNN + SoftMax were the best. We believe that the accuracy score is more representative than AUC because AUC needs to set a prediction threshold for tuning. Setting the threshold immediately as 0.5 is a more accurate method for comparing the predictive ability of the algorithms.
AUCs of all methods were larger than 0.5 at 28 d. In most of the experiments, SoftMax demonstrated the best performance. The AUC at both 72 h and 28 d revealed that the best AUC score was obtained using CNN + SoftMax; however, the AUC curve is based on the calculation of each operation point on the regression. In terms of machine learning, it was more advantageous to have fewer handcrafted parameters. Although the AUC curve demonstrated the performance of various algorithms, its performance was based on multiple operating points.
Irrespective of the mortality prediction for 72 h or 28 d, predictions made with machine learning were more accurate than those made by either qSOFA or SIRS. Machine learning algorithms easily obtained an accurate result and improved both the treatment strategy and the mortality rate of suspected infected patients. Such methods ensured that the algorithm was considerably biased to the accurate rate of a single category. The most obvious type of partiality was the RF. The best accuracy rate was that of CNN + SoftMax. For more complex algorithms of feature extraction, the KNN, SVM, and RF trends were more accurate. Conversely, higher complexity of feature extractions resulted in higher accuracy rates for SoftMax. Irrespective of the feature extraction, SoftMax demonstrated the best performance. The SVM ranked second while recognizing the extracted features by the CNN.
Our results favored the use of a median value of 0.5 for the regression lines as the standard operating point for automatic classifiers. This comparison method can be considered the fairest and most reasonable. Regardless of which machine learning algorithm is used, it must have the ability to learn the operating points itself. For mortality prediction at 72 h, the accuracy rate of CNN + SoftMax was 89.02%; however, for mortality prediction at 28 d, the accuracy rate of this approach was reduced to 81.79%. This is acceptable because a mortality prediction at 28 d is inherently more difficult. The physiological state of a patient 28 d after diagnosis can be significantly different from the physiological state during the first examination. The longer the timeline, the more difficult it is to predict the mortality rate. Thus, the best classifier for mortality prediction at 28 d was SoftMax.
Irrespective of the type of feature extraction, or in the case where no feature extraction was used, the performance of SoftMax was the best and that of RF was the worst. While using SoftMax, the accuracy rate did not significantly differ. This demonstrated that the 28 d mortality prediction easily reaches its maximum limitation. Nevertheless, complex feature extraction is useful for other classifiers, particularly, the SVM and RF. The recognition abilities of SVM and RF were not equivalent to that of SoftMax. Therefore, increasing the strength of feature extraction optimizes the performance of a classifier. However, if the ability of the classifier is sufficient to appropriately classify the original features, then complex feature extraction does not provide an advantage. While using the SVM and RF as a classifier, we proposed using a CNN for feature extraction to increase the accuracy of the 28-d mortality prediction.
RF performance was inconsistent and was not improved by the use of complex feature extraction. The performance of the RF in mortality prediction at 28 d was similar to that at 72 h. The reason the experimental result for RF was so poor is that it could not effectively distinguish non-survivors from survivors. Thus, it can be inferred that the false positive rate was extremely high.
As the highest accuracy rate was obtained using CNN with a SoftMax layer, this algorithm was recommended for predicting the mortality of suspected infected patients in the ED. Such a design will make it easier for clinicians to immediately determine the condition of the patient.
CNN + SoftMax performed the best due to the shared weights and biases system. Through the shared weights and bias system, the calculation of the neurons was reduced. Regardless of the input feature dimensions, calculations of the neuron number times feature numbers were fixed by the shared weights and biases system. Unlike a traditional neural network, the calculation of neurons was not increased while the input dimension increased. Additionally, the pooling system further increased the performance of CNN + SoftMax, as it reduces the dimension of the extracted feature by the convolutional layer. Through this system, CNN can construct the importance information into a lower dimension feature. For a training model, the lower the dimension, the less gradient descent needs to be calculated. These systems further demonstrate that CNN + SoftMax has the best performance in our experiment. Other research experiments also show that CNN + SoftMax has a better performance than the other machine learning algorithms [41
In our study, the most important feature extracted from RF was base excess (BE). Although it was not given much attention in our practice, one recent study showed that alactic base excess was associated with fluid balance, which is an important parameter in sepsis treatment [43
]. Arayici showed that BE could be used to predict neonatal sepsis with promising sensitivity and specificity [44
]. Another important feature is RDW, which has been well studied in various contexts; studies showed that it could be used to predict long-term outcomes in sepsis patients, irrespective of anemia [45
]. Other important features, like solid tumors and shock episodes, were more familiar to clinical physicians.
The features that were extracted by CNN and AE could not be used to determine which baseline features were most important. Certain machine learning algorithms, such as RF, are capable of ranking different features according to their level of importance. Although their accuracy scores and AUC were lower than those of the CNN, observing these features may lead to other clinical applications.
Although our study demonstrated the good prediction ability of machine learning, it had several limitations. First, although the AUC of CNN + SoftMax is better than that of qSOFA and SIRS, and possibly other score systems as well, our model requires a computer for complex calculations, while the other methods involve calculations that can be easily performed by doctors. Second, CNN + SoftMax needs more clinical data to maintain its accuracy, while qSOFA and SIRS only require a few clinical data points. Thus, our method requires more time to produce accurate results, which in turn creates further delays to doctors and patients before they receive the information. As the use of computers in daily practice progresses, this problem may be resolved in the future. Third, in this study, we used mortality as the primary outcome. Although it is truly the final and worst outcome, and would not be influenced by the management decisions of patients, other bad outcomes like intensive care unit admission, or intubation, could provide more time for attending physicians to act on this disease.