Predicting Outcome of Traumatic Brain Injury: Is Machine Learning the Best Way?

One of the main challenges in traumatic brain injury (TBI) patients is to achieve an early and definite prognosis. Despite the recent development of algorithms based on artificial intelligence for the identification of these prognostic factors relevant for clinical practice, the literature lacks a rigorous comparison among classical regression and machine learning (ML) models. This study aims at providing this comparison on a sample of TBI patients evaluated at baseline (T0), after 3 months from the event (T1), and at discharge (T2). A Classical Linear Regression Model (LM) was compared with independent performances of Support Vector Machine (SVM), k-Nearest Neighbors (k-NN), Naïve Bayes (NB) and Decision Tree (DT) algorithms, together with an ensemble ML approach. The accuracy was similar among LM and ML algorithms on the analyzed sample when two classes of outcome (Positive vs. Negative) approach was used, whereas the NB algorithm showed the worst performance. This study highlights the utility of comparing traditional regression modeling to ML, particularly when using a small number of reliable predictor variables after TBI. The dataset of clinical data used to train ML algorithms will be publicly available to other researchers for future comparisons.


Introduction
Traumatic brain injury (TBI) has a tremendous impact on patients and family members. They must learn to live with a diminished potential for physical, emotional, cognitive, and social functioning. A recent meta-analysis [1] found an overall incidence rate of 262 per 100,000 per year, and in the USA 43.3% of hospitalized TBI survivors will have a long-term disability [2]. One of the main challenges in TBI-related research is to achieve an early and definite prognosis considering the best predictors of outcome, to administer effective treatments able to improve the clinical progression. Some factors have been proposed and predictive models have been constructed. Most studies used traditional regression techniques to identify these factors [3] defining age, diagnosis, and severity level (measured with Coma Recovery Scale-Revised (CRS-r)) as the most important clinical indicators to predict TBI outcome, although with a poor implementation in clinical practice [4]. In such a scenario, in the last few years, considerable efforts have been put into implementing and developing artificial intelligence tools. Machine learning (ML) methods landed in this neurological domain only a few years ago with promising and enthusiastic perspectives [5]. The first studies tested the performance of support vector machine (SVM) [6], Naïve Bayes (NB) [7], and random forest [8] algorithms in predicting the mortality of TBI patients. Despite good performance having been reported (ranging from 67% to 97%), these preliminary ML studies are characterized by high variability in predictive models and clinical predictors used for the training phase, thus reducing a rigorous comparison among methods. For this reason, in this study, we compare the performance of a Classical Linear Regression Model (LM) with the most common ML algorithms for the prediction of clinical outcomes of TI patients (measured with the Glasgow Outcome Scale-Extension) after 6-9 months from the hospitalization. Finally, we also tested the ensemble ML algorithm and applied different feature selection methods to optimize the model.

Population
The population enrolled for this study was composed of 102 subjects (Table 1). This study was a secondary analysis conducted on a large database used in different previous studies [9,10] and further augmented with new data. For each patient, collected data consisted of demographic information and clinical assessment on admission (T0) after three months (T1) and after 6-9 months at discharge (T2). At the end of the study, 11.8% of TBI patients died, whereas 37.2% had positive outcomes (Glasgow Outcome Scale higher than 4).

Proposed Approach
In this section, we briefly describe the outcome measure, the predictors and the classification methods.

Outcome Measure
As a main measure, we used the extended version of the Glasgow Outcome Scale (GOS-E) [11], which is the most used scale in clinics for outcome assessment after a head injury or non-traumatic acute brain insults. Details about the scales are reported in Table 1. For binary classification, we split the scale into two halves corresponding to Positive and Negative Outcome, respectively. For multi-class classifications, we divided the dataset into four classes, joining together the Lower and Upper sub-categories for the first three classes and PST and Death for the fourth classes as reported in Table 2.

Predictors' Selection
In the experimental section we have tested the following predictors of outcome: • Age, Sex. • Marshall classification (T0): one of the most used systems for grading traumatic brain injury at admission [12]. It is based on the observation of the brain non-contrast CT scan and of the degree of swelling and the presence and size of hemorrhages. Patients are divided into six increasing severity categories: diffuse injury I (no visible pathology), diffuse injury II, diffuse injury III (swelling), diffuse injury IV (shift), evacuated mass lesion V, non-evacuated mass lesion VI. Therefore, the highest categories correspond to the worst prognosis.  [16]: extended version of the Barthel index for the assessment of early neurological rehabilitation patients over the course of recovery. It contains highly relevant items, such as mechanical ventilation, tracheostomy, or dysphagia and it ranges from −325 to +100.

Feature Selection
In this work, we also applied statistical-based feature selection methods to reduce the computational cost of modeling, improve an easier understanding of data and explore a possible improvement of the performance of the model [17]. According to predictor importance, univariate features ranking was performed using the two most common methods: • Minimum Redundancy Maximum Relevance (MRMR) algorithm [18]: this method explores the optimal subset of features with the maximum relevance for the response and the minimum redundancy using the pairwise mutual information among features and between each feature and the outcome. • Chi-square Test [19]: an approach based on individual chi-square tests to examine the relationship between each dependent variable and the outcome.
Next, the optimal subset of features was defined, selecting the highest difference between consecutive scores as a breakpoint and taking the most important predictors. Indeed, a drop among the importance scores represents the confidence of feature selection. Therefore, a large drop indicates that the algorithm is confident in selecting the most important variables, while a small drop suggests that the differences among predictor importance are not significant.

Classification Methods
In the classification phase, we used the LR and the four most conventional classifiers which are described below: • Support Vector Machine (SVM) [20]: a widely used method based on mapping data into a higher dimensional feature space using kernel functions to make them separable and then finding the best hyperplane for classification. In this study, we used a Radial Basis Function (RBF) kernel: where x j and x k are vectors representing observations j-th and k-th.
• k-Nearest Neighbors (k-NN) [21]: a simple approach where each object is assigned to the most common class among its k nearest neighbors applying the majority voting technique. In this study, we set a number of nearest neighbors equal to 5 following a general rule k = √ n to identify the optimal value, where n is the number of samples in training data [22] and employing the standardized Euclidean distance as a metric: where q is the query instance, x i is the i-th observation of the sample and σ i is the standard deviation.
• Naïve Bayes (NB) [23]: based on Bayes' Theorem, this technique applies density estimation to the data and assigns an observation to the most probable class assuming that the predictors are conditionally independent, given the class. In this study, probabilities were computed modeling data with Gaussian distribution: [17]: based on a tree-like model in which each internal node specifies a test involving an attribute, each branch descending from the node corresponds to one of the possible outcomes of the test and each leaf node represents a class label.
Classifying an object with a decision tree means performing a sequence of cascading tests, starting with the root node and finishing with a leaf node. In this study, for the decision tree model, we set a maximal number of decision splits equal to 10. As an algorithm for selecting the best split predictor at each node, we chose standard CART, which selects the split predictor maximizing Gain. Gain is a split criterion given by the difference between the information needed to classify an object (I) and the amount of residual information needed after the value of attribute A have been learned (Ires): where I is given by the entropy measure with p(c) equal to the proportion of examples of class c and All these four classifiers were tested with the majority voting ensemble technique. Majority voting is a simple ensemble method that usually is adopted to improve machine learning performances better than any single model used in the ensemble. It works by combining the final classification of all the four ML models (SVM, k-NN, NB and DT). The predictions for each label are summed and the label with the major number of occurrences is the final outcome [24]. ML models were trained and tested using Matlab R2020b (Mathworks, Natick, MA, USA).

Performance Metrics
For the evaluation of the models, we employed two types of stratified Cross-Validation: Leave-One-Out Cross-Validation (LOOCV) and 10-fold Cross-Validation (10-fold CV) [25]. K-fold Cross-Validation is a procedure that consists in splitting the dataset into k subsets and iteratively leaving one subset out as a test set while keeping the remaining subsets together as a training test. Leave-one out is the extreme version of cross-validation where the number of subsets coincides with the number of samples in the dataset. LOOCV requires fitting and evaluating a model for each sample, which maximizes computational cost. The main advantage of this technique is its robustness since, at each iteration, the training set is as similar as possible to real data. This allows unbiased and reliable estimation of performances avoiding overfitting. Ten-fold CV is a commonly used and less computationally expensive version of cross-validation. For applications with real-world datasets, Kohavi recommends stratified 10-fold cross-validation [26]. Classification performances were measured using Accuracy, Precision, Recall and F1-Score [27], defined for multi-classes tasks, as reported in Table 3. Table 3. Performance Metrics.

Performance Measure Binary Classification Multi-Class Classification
Accuracy Statistical analysis of performance metrics was carried out using RStudio Version 4.0.3 (10 October 2020) (RStudio, Boston, MA, USA). Since variables were not normally distributed, the Kruskal-Wallis (KW) test was employed to compare performance metrics of ML algorithms to discriminate 2 and 4 classes of the outcome, respectively [28]. The KW test was also used to investigate the performances achieved with different feature selection methods. A p-level of <0.05 was used for defining significance, followed by post-hoc Dwass-Steel-Critchlow-Fligner pairwise comparisons.

Results
No significant differences of performance metrics, respectively, for 2 and 4 classes were found between the MRMR and Chi-square feature selection methods as reported in Table 4 using KW test. Moreover, we observed that MRMR achieved the same performances, respectively for 2 and 4 classes of output as shown in Figures 1 and 2 with a larger drop among the predictor importance scores and selecting a minor number of features useful to reducing computational costs.    Furthermore, we extracted correlations between each pair of clinical predictors as shown in Figure 3 and observed that the variables Marshall, Entry Diagnosis, CRS-R, RLAS and DRS scores were the most correlated features. For these reasons, we performed further analyses using MRMR with CRS-R, Age and ERBI B for binary classification and Entry Diagnosis, Age and Sex for four-class classification. After feature selection, we applied the KW test between each ML model and LM with a 10-fold CV. Significant differences were found in both cases (Table 5). Post-hoc Dwass-Steel-Critchlow-Fligner pairwise comparisons among accuracies were included to compare each pair of ML models and identify the best performer. Using two classes of outcome we observed a significant difference between LM and NB (see Table 6), although the accuracy of NB is lower than LM (Figure 4). In the case of the four classes of outcome, no significant differences were detected revealing comparable performances among all models, although a significant loss of accuracy was detected (Table 6 and Figure 5). The same trend was observed for other ML metrics. Tables 7 and 8 reported metric performances using LOOCV and 10-fold CV, respectively, for 2 and 4 classes of the outcome.

Discussion
In this study, we compare ML approaches to more traditional LM in contemporary TBI patients' data to predict their clinical evolution, respectively, using 2 and 4 classes of outcome approaches. We demonstrated that classic LM could perform as well as more advanced ML and ensemble ML classifiers in terms of accuracy (sensitivity and specificity) trained by the same predictors.
The LM had the advantage of identifying some prognostic factors, associating each of them with an odds ratio, while the use of ML is limited by the difficulty of interpreting the model, often referred to as 'black box'. This finding is perfectly in agreement with results recently obtained by Iosa et al. [29], who compared the performance of classical regression, neural network, and cluster analysis in predicting the outcome of patients with stroke. Similarly, Gravesteijn et al. [30] reached the same conclusions on TBI patients evaluating the different performance of logistic regression with respect to SVM, random forests, gradient boosting machines, and artificial neural networks. In terms of model performance, our SVM and DT values are similar to those reported by Abujaber et al. [6], whereas k-NN has never been employed in this clinical domain and this outperformed other ML approaches in all the evaluation metrics. The only algorithm that relatively underperformed was the NB. The accuracy (and sensitivity) was somewhat lower, passing from the analyzed to the test sample. Our data conflict with those reported by Amorim et al. [7] who described the excellent performance of this algorithm as a screening tool in predicting the functional outcome of TBI patients. This discrepancy could be mainly due to the use of different clinical predictors. Indeed, there is large heterogeneity in factors (i.e., age, gender, clinical severity, clinical comorbidities, systolic blood pressure, respiratory rate, lab values, and presence of mass lesion) identified as having a prognostic value in TBI patients, thus making a direct comparison between ML approaches difficult to perform [6]. Another limit is due to the fact the dataset is unbalanced (62.8% negative vs. 37.2% positive outcome) and could negatively affect performances of machine learning. To overcome this issue and increase classification robustness, we also applied the technique of LOOCV that is less affected by this problem and allows us to compare four machine learning techniques since each type of algorithm performs predictions differently. For instance, the DT algorithm performs well with unbalanced datasets thanks to the splitting rules that look at class variables.
Moreover, the type of predictors, such as continuous and categorized (operatordependent) variables and the lack of objective biological high-dimensional data (i.e., neuroimaging, genetics), might also limit the performance of ML techniques applied in this domain [31]. Our data would seem to confirm this hypothesis because of the change in identified predictors for classification. Indeed, as shown in Figures 4 and 5, moving from 2 to 4 classes of outcome approaches impacts the most significant features extracted by predictive models. Apart from age, for reaching the excellent performance with the 2 classes approach, LM and ML algorithms need CRS-r values and ERBI values at T1, whereas, for the 4 classes approach, diagnosis at admission and sex are the most important features.
Finally, this is the first study employing an ensemble ML approach to improve the outcome prediction in TBI patients. This approach has been demonstrated to be useful for integrating multiple ML models in a single predictive model characterized by higher robustness, reducing the dispersion of predictions [32]. However, this method would not seem to boost performance except when the four classes approach was employed ( Figure 5). Indeed, in our KW analysis, we observe that the ensemble ML for two classes reach a high accuracy similar to other ML techniques of about 84% for LOOCV and 82% for 10-fold CV as shown in Table 7 while using four classes approach (Table 8) a (not significant) trend of performance metrics was observed (71.5% for LOOCV and 70.5% for 10-fold CV), which is five to ten percentage points higher than the other models.

Conclusions
In summary, we found that ML algorithms do not perform better than more traditional regression models in predicting the outcome after TBI. As future work, we plan to perform further external validations of all these models on other datasets that could capture dynamic changes in prognosis during intensive care courses extending the current models with new objective predictors, such as neuroimaging data (EEG, PET, fMRI) [33].
Funding: This research received no external funding.
Institutional Review Board Statement: The study was approved by the Ethical Committee of the Central Area Regione Calabria of Catanzaro (Protocol n. 244), according to the Helsinki Declaration.
Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.
Data Availability Statement: Datasets collected at S'anna Institute Crotone Italy, all relevant data have been uploaded to a public repository: URL: www.kaggle.com/dataset/6e8c66445ac5b0fda3b3 d50cf3a0d1dc4fecb09a3a4e6df19abf98fc0c13a8f3, (accessed on 9 February 2022).

Conflicts of Interest:
The authors declare no conflict of interest.