Predicting Outcome in Patients with Brain Injury: Differences between Machine Learning versus Conventional Statistics

Defining reliable tools for early prediction of outcome is the main target for physicians to guide care decisions in patients with brain injury. The application of machine learning (ML) is rapidly increasing in this field of study, but with a poor translation to clinical practice. This is basically dependent on the uncertainty about the advantages of this novel technique with respect to traditional approaches. In this review we address the main differences between ML techniques and traditional statistics (such as logistic regression, LR) applied for predicting outcome in patients with stroke and traumatic brain injury (TBI). Thirteen papers directly addressing the different performance among ML and LR methods were included in this review. Basically, ML algorithms do not outperform traditional regression approaches for outcome prediction in brain injury. Better performance of specific ML algorithms (such as Artificial neural networks) was mainly described in the stroke domain, but the high heterogeneity in features extracted from low-dimensional clinical data reduces the enthusiasm for applying this powerful method in clinical practice. To better capture and predict the dynamic changes in patients with brain injury during intensive care courses ML algorithms should be extended to high-dimensional data extracted from neuroimaging (structural and fMRI), EEG and genetics.


Introduction
Brain injury consists of damage to the brain that is not hereditary, congenital, degenerative, or induced by birth trauma or perinatal complications. The injury results in a modification of the brain's neural activity, structure, and functionality with a consequent loss of cognitive, behavioral, and motor functions. Head trauma, ischemic and hemorrhage stroke, infections, and brain tumors are among the most common causes of acquired brain injury.
Due to the severe social and economic burden of brain injury, the expectation of longterm outcome is an important factor in clinical practice, and this is particularly important after severe traumatic brain injury (TBI), which still presents high mortality and unfavorable outcome rates with a severe global disability [1].

Machine Learning Methods
ML is a subfield of Artificial Intelligence, studying the ability of computers to automatically learn from experience and solve specific problems without being explicitly programmed for it. These learning systems can continuously self-improve their performance and increase their efficiency in extracting knowledge from experimental data and analytical observations. ML includes three main approaches that differ in learning technique, type of input data and outcome, and typology of the task to solve: Supervised Learning, Unsupervised Learning and Reinforcement Learning.
Supervised learning is the most common paradigm of ML, applied when input variables and output targets are available, and relevant for neurorehabilitation clinicians. This approach consists of algorithms that analyze the mapping function between "input" and "output" variables with the goal to learn how to predict a specified "output" given a set of "input" variables, also called "predictors". Supervised Learning can be broadly divided into two main types: • Classification: where the output variable is made up of a finite set of discrete categories that indicate the class labels of input data, and the goal is to predict class labels of new instances starting from a training set of observations with known class labels; • Regression: where the output is a continuous variable, and the goal is to find the mathematical relationship between input variables and outcome with a reasonable level of approximation.
The unsupervised approach is characterized by unlabeled input data. The algorithm explores and models data inherent in structure and patterns without the guidance of a labeled training set. Typical applications of Unsupervised Learning are: • Clustering (or unsupervised classification): with the aim to divide data so that similarity of instances of the same cluster is maximized and similarity of different clusters is minimized; • Dimensionality Reduction: where input instances are projected into a new lowerdimensional space.
Reinforcement Learning has the goal to develop a system called agent that improves through interaction with the environment. In particular, at each iteration the agent receives a reward (or a penalty) based on its action, which is a measure of how much this activity is good for the desired goal. An exploratory trial-and-error approach is exploited to find actions that maximize the cumulative reward. Very common applications of reinforcement learning are in computer games and robotics.
Among the wide number of possible machine learning algorithms, there are some conventional techniques that are considered the gold standard for classification problems and that have been employed in the studies presented in this review: • LR [17]: the simplest among classification techniques, it is mainly used for binary problems. Assuming linear decision boundaries, LR works by applying a logistic function in order to model a dichotomous variable of output: This oversimplified model allows low training time and the poor possibility of overfitting, but at the same time, it may carry to underfitting for complex datasets. For these reasons, LR is suitable for simple clinical datasets such as those related to patients with brain injuries. Ridge Regression and Lasso Regression are distinguished from Ordinary Least Squares Regression because of their intent to shrink predictors by imposing a penalty on the size of the coefficients. Therefore, they are particularly useful in the case of big data problems:

•
Generalized Linear Models (GLM) [18] are an extension of linear models where data normality is no longer required because predictions distributionŷ is transformed into a linear combination of input variables X throughout the inverse link function h: Moreover, the unit deviance d of the productive exponential dispersion model (EDM) is used instead of the squared loss function; • Support Vector Machine (SVM) [19]: it applies a kernel function with the aim to map available data into a higher dimensional feature space where they can be easily separated by an optimal classification hyperplane. • k-Nearest Neighbors (k-NN) [20]: it assigns the class of each instance computing the majority voting among its k nearest neighbors. This approach is very simple but requires some not trivial choices such as the number of k and the distance metric. Standardized Euclidean distance is one of the most used because neighbors are weighted by the inverse of their distance: where q is the query instance, x i is the i-th observation of the sample and σ i is the standard deviation; • Naïve Bayes (NB) [21]: based on the Bayes' Theorem, it computes for each instance the class with the highest probability of applying density estimation and assuming independence of predictors; • Decision Tree (DT) [22]: a tree-like model that works performing for each instance a sequence of cascading tests from the root node to the leaf node. Each internal node is a test on a specific variable, each branch descending from that node is one of the possible outcomes of the test, and each leaf node corresponds to a class label. In particular, at each node the function Information Gain is maximized to select the best split variable: where I represents the information needed to classify the instance and it is given by the entropy measure: With p(c) equal to the proportion of examples of class c. And I res is the residual information needed after the selection of variable A: • A common technique employed to enhance models' robustness and generalizability is the ensemble method [23][24][25][26] that combines predictions of many base estimators. The aggregation can be done with the Bootstrap Aggregation technique (Bagging) applying the average among several trees trained on a subset of the original dataset (such as in the case of Random Forests (RF)) or with the Boosting technique applying the single estimators sequentially giving higher importance to samples that were incorrectly classified from previous trees (like in AdaBoost algorithm); • Artificial Neural Networks (ANNs) [27]: are a group of machine learning algorithms inspired by the way the human brain performs a particular learning task. In particular, neural networks consist of simple computational units called neurons connected by links representing synapses, which are characterized by weights used to store information during the training phase. A standard NN architecture is composed of an input layer whose neurons represent input variables {x i | x 1 , x 2 , . . . , x m }, a certain number of hidden layers for intermediate calculations, and the output layer that converts received values in outputs. Each internal node transforms values from the previous layer using a weighted linear summation (u = w 1 x 1 + w 2 x 2 + . . . + w m x m ), followed by a non-linear activation function (y = φ(u + b)) such as step, sign, sigmoid or hyperbolic tan functions. The learning process is performed throughout the backpropagation algorithm that computes the error term from the output layer and then back propagates this term to previous layers updating weights. This process is repeated until a certain stop criterion, or a certain number of epochs, are reached.

Predicting Outcome: Conventional Statistics versus Machine Learning
The recent literature that incorporates ML in the neurorehabilitation field raises a natural question: what is the innovation compared with conventional statistical techniques such as linear or LR? From one side, traditional statistics have long been used for regression and classification tasks, can also determine a relationship between input and output, and have been used for classification tasks. Some other authors may even claim that both linear and LR are themselves ML techniques, although some important distinctions needed to be made between classical statistical learning and ML (Table 1). Statistical methods are top-down approaches: it is assumed that we know the model from which the data have been generated (this is an underlying assumption of techniques like linear and LR), and then the unknown parameters of this model are estimated from the data. The potential drawback is that the link between input and output is user chosen and could result in a suboptimal prediction model if the actual input-output association is not well represented by the selected model. This may occur if a user chooses LR, even though the relationship between input and output is non-linear, or when many input variables are involved.
Otherwise, ML methods are bottom-up approaches. No particular model is assumed, but starting from a dataset an algorithm develops a model with a prediction as the main goal. Generally, the resulting models are complex, and some parameters cannot be directly estimated from the data. In this case, the common procedure is to choose the best parameters either from previous relevant studies or tuning them during the training in order to give the best prediction. ML algorithms can handle a larger number of variables with respect to traditional statistical methods, but also require larger sample sizes for predicting the outcome with greater accuracy.
A potential limit of ML is that the repetition of the analysis may lead to slightly different results. The reliability of classical statistics is mainly related to the sampling process, but the same data lead to the same results independently of the number of times in which the same analysis was applied. The uncertainty is simply related to the concept that the sample was randomly extracted by the population. Techniques such as split-half, parallel form or bootstrap analysis have been introduced to retest the reliability of results among different resampled data. In ML, there is often an over-dimensioned system that could provide the same level of accuracy in predicting the outcome in different ways, and it means associated different weights to each variable even when the same model is applied to the same data sample. In a recent study, the importance associated with factors influencing harmonic walking in patients with stroke was found to have a variability going from 6% (for the iliopsoas maximum force) up to 37% (for the patient's gender) [28].

Conventional Statistics versus Machine Learning Methods in TBI Patients
In the last few years, seven papers have been published aimed at comparing the performance of the regression models with respect to ML in extracting the best clinical indicators of outcome in TBI patients (Table 2). Historically, Amorim and colleagues firstly applied ML approaches to 517 patients with various severity of TBI. A large amount of demographic (gender, age), clinical (pupil reactivity at admission, GCS, presence of hypoxia and hypotension, computed tomography findings, trauma severity score), and laboratory data were used as predictors. Using a mixed ML classification model, they found that the naive Bayes algorithm had the best predictive performance (90% accuracy), followed by a Bayesian generalized linear model (88% accuracy) when mortality was used as an outcome. The most important variables used by ML models for prediction were: (a) age; (b) Glasgow motor score; (c) prehospital GCS; and (d) GCS at admission. In this paper, linear regression analysis has been directly merged into the ML models in order to improve prediction performance. Following a similar multimodal approach where a series of ML algorithms were individually used and finally pooled together in an ensemble model to evaluate the performance with respect to the LR approach, our group demonstrated high but similar performance among methods [29]. Indeed, we found similar performance among LM (82%) and ML (85%) algorithms when two classes of outcome approach (Positive vs. Negative measures of the Glasgow Outcome Scale-Extended (GOS-e)) were used. Age, CRS-r, Early Rehabilitation Barthel Index (ERBI), and entry diagnosis were the best features for classification. Tunthanathip et al., evaluated the performance of several supervised algorithms (SVM, ANNs, RF, NB, k-NN) compared to LR in a wide population of pediatric TBI. With respect to other studies, the traditional binary LR was performed with a backward elimination procedure for extracting the best prognostic factors useful to classification (GCS, hypotension, pupillary light reflex, and sub-arachnoid hemorrhage). The authors found that the SVM was the best algorithm to predict outcomes (accuracy: 94%). Instead, Gravesteijn et al. [30] directly compared LR with respect to a series of ML algorithms (SVM, RF, gradient boosting machines, and ANNs) to predict outcomes in more than 11,000 TBI patients. All statistical methods showed the same performance in predicting mortality or unfavorable outcomes (ranging from 79% to 82%), where the RF algorithm was the worst. Similarly, Nourelahi et al. [31] described the same results by evaluating 2381 TBI patients. Despite the employment of the only SVM and RF for ML analysis, they reached an accuracy in post-trauma survival status prediction of 79%, where the best features extracted were Glasgow coma scale motor response, pupillary reactivity and age. Similarly, Eftekhar et al. [32] only used the Artificial neural networks (ANN) algorithm to evaluate the prediction performance with respect to the LR model. ANN was able to predict mortality of TBI patients in almost all patients (95% of accuracy), although this performance was lower than LR (96%). Finally, following a one-single ML approach, Chong et al., used Neural Network to evaluate the predictive accuracy of different clinical data (i.e., presence of seizure, confusion, clinical signs of skull fracture). Evaluating data from a very small sample of TBI patients they reported high but similar performance among LR and ML approaches (93% versus 98%), indicating as best features a list of never reported clinical variables.

Conventional Statistics versus Machine Learning Methods in Stroke Patients
As shown in Table 3, for patients with stroke six studies have been included in this review because they compare the results of ML algorithms (ANN) with those of conventional regression analysis. The total number of patients included in these studies was very high (5346), going from 33 patients up to 2522. There was a wide variety of investigated outcomes, ranging from a return to work to death. Even wider was the variety of the assessed independent variables. The accuracy of ANN ranged between 74% and 93.9%, greatly depending also on the chosen method of analysis. The accuracy of conventional regression analysis was generally lower, ranging from 40 to 85%. In five out of these six studies, the ANN resulted in a more accurate prediction than conventional regression [28,[36][37][38][39][40]. The unique exception was the study conducted on a large sample (2522 patients) in which the accuracy of ANN was slightly inferior (74% vs. 76.6%) [39]. Conversely, wider differences in favor of ANN were found for the two studies with the smaller sample size having as an outcome the functional status of patients at discharge [28,36]. When different types of ANNs were compared the Deep Neural Network [38] and the k-Nearest Neighbors [40], more accurate performance was detected. The features extracted by models were widely variable among studies leading to very different results, with some prognostic factors already well known in the literature such as older age [37,39].

Discussion
Since the similarity in performance reported for LR and ML approaches and the large heterogeneity in best features extracted, the main conclusion of this review is that ML does not confer substantial advantages with respect to LR methodologies in predicting the outcome of TBI or stroke patients (Figure 1). Qualitative evaluation of results suggested a trend towards better performance of ML algorithms in the stroke patients with respect to LR. However, without a quantitative comparison (i.e., benchmark analysis) a definitive conclusion cannot be drawn.
Despite using similar means, the boundary between LR (statistical inference) and ML is subject to debate. The LR had the advantage of identifying relationships between prognostic factors associating each of them with an odds ratio, while the use of ML is limited by the difficulty of interpreting the model, often used like a 'black box' to obtain the best performance on a specific test set. The most important advantage of ML algorithms is their capacity to perform non-linear predictions of the outcome and that do not require statistical assumptions such as independence of observations and multicollinearity. However, this common high non-linearity of the classification problem implies that the direction of effect of each input cannot be easily recognized [41]. An issue poorly investigated is the repeatability of the results obtained with ML. The two most important psychometric properties of a test are validity and reliability. The high accuracy found in the abovereported studies could be seen as proof of the validity of the ML approach. However, most of these studies identified specific prognostic factors but did not test the reliability of these findings if the ML was repeatedly applied. A recent study conducted on prognostic factors related to walking ability in patients with stroke showed that variability in the weight of each factor among 10 applications of an ANN analysis ranged between 6 up to 37%. On the other hand, authors reported that the reliability was lower for the factors with reduced weight, and higher for the most important factors [28]. However, this study highlights the need to assess not only the accuracy and hence the validity of the ML algorithms, but also their reliability [28]. heterogeneity in best features extracted, the main conclusion of this review is that ML does not confer substantial advantages with respect to LR methodologies in predicting the outcome of TBI or stroke patients (Figure 1). Qualitative evaluation of results suggested a trend towards better performance of ML algorithms in the stroke patients with respect to LR. However, without a quantitative comparison (i.e., benchmark analysis) a definitive conclusion cannot be drawn. Despite using similar means, the boundary between LR (statistical inference) and ML is subject to debate. The LR had the advantage of identifying relationships between prognostic factors associating each of them with an odds ratio, while the use of ML is limited by the difficulty of interpreting the model, often used like a 'black box' to obtain the best performance on a specific test set. The most important advantage of ML algorithms is their capacity to perform non-linear predictions of the outcome and that do not require statistical assumptions such as independence of observations and multicollinearity. However, this common high non-linearity of the classification problem implies that the direction of effect of each input cannot be easily recognized [41]. An issue poorly investigated is the repeatability of the results obtained with ML. The two most important psychometric properties of a test are validity and reliability. The high accuracy found in the above-reported studies could be seen as proof of the validity of the ML approach. However, most of these studies identified specific prognostic factors but did not test the reliability of these findings if the ML was repeatedly applied. A recent study conducted on prognostic factors related to walking ability in patients with stroke showed that variability in the weight of Figure 1. Accuracy (%) of outcome prediction in traumatic brain injury (TBI) and stroke patients for the considered studies (n: sample size). Machine learning approach (black bars) versus linear regression (grey bars), [17][18][19][20][21][22][23][24][25][26][27][28][29].
For stroke patients, the accuracy of ML algorithms in predicting the outcome ranged from 74 to 95%. It is often reported that ANN requires wide samples for achieving good accuracy, however, the two studies of the six reviewed with the smaller samples showed higher accuracy of ANN with respect to conventional regression analysis [28,36], probably suggesting that also (or even more) regression needs wide samples to obtain solid results. It is important to also cite some other studies not reported in Table 2, for example that reporting accuracy of ANN of 100% for Wolf Motor Function Test scores in George et al., [41] were very different from the 30% for the functional independence measures in Sale et al. [42]. However, in the latter study, the outcome score was accurately predicted at 84%. The ANN is the most common type of ML algorithm used in stroke, conceivably for its higher simplicity with respect to other types, followed by support vector machines and random forest algorithms. Some studies compared the performances of different algorithms. It should be noted that literature also reports some papers about the accuracy of ML not compared with those of conventional statistical methods. Oczkowski and Barreca reported an accuracy of 88% on 147 patients with stroke [43]; George et al. reported an accuracy of 100% [41]; and Sale et al. [42] on 55 patients with an accuracy of 84% for predicting the Barthel Index score and of 30% for Functional Independence Measure; Thakkar et al. [44] on 239 patients with an accuracy between 81 and 85%; Liu et al. [45] on 56 patients 88-94%; Billot et al. [46] on 55 patients with an accuracy between 84 and 93%; and the study by Xie et al. [47] on a wide sample of patients (n = 512) reporting an area under the curve of the ROC analysis of 0.75. Some other studies also reported results obtained by regression analysis but without reporting its accuracy in a comparison with that of ML [48,49]. Other studies included wide samples of patients [37][38][39] but the accuracy was not lower in studies including small samples [41,50]. Conversely, the largest study was conducted on 2522 patients, which divided into 1522 patients to train an artificial neural network and 1000 patients to test its predictive capacity; it reported an accuracy of only 74%, lower than that obtained on the same samples with conventional statistics (linear regression and cluster analysis) [39].
In TBI patients a similar status is reported, with accuracy ranging from 78% to 98% and no evidence of the best ML algorithm. Indeed, either considering works using mixed or ensemble ML models [30,31,33,34] or that with one single algorithm [32,35], the result is similar: no evidence for a best ML algorithm and no substantial difference with respect to LR approach.

Conclusions
ML algorithms do not perform better than more traditional regression models in predicting the outcome in TBI or stroke patients. Although ML has been demonstrated to be a powerful tool to capture complex nonlinear function dependencies in several neurological domains [51,52], the state-of-art in TBI and stroke domains do not confirm this advantage. This could be dependent on the type of predictors employed in several studies, such as continuous and categorized (operator-dependent) variables (i.e., clinical scales, radiological metrics). Moreover, ML has demonstrated its value when trained on high-dimensional and complex data extracted from neuroimaging (structural and fMRI), EEG and genetics. Future works are needed to better capture changes in prognosis during intensive care courses extending the current "black-box" or "static" approaches (data extracted from only admission and discharge)" in a new era of mixed dynamic mathematical models [53].