Predicting Survival for Veno-Arterial ECMO Using Conditional Inference Trees—A Multicenter Study

Background: Despite increasing use and understanding of the process, veno-arterial extracorporeal membrane oxygenation (VA-ECMO) therapy is still associated with considerable mortality. Personalized and quick survival predictions using machine learning methods can assist in clinical decision making before ECMO insertion. Methods: This is a multicenter study to develop and validate an easy-to-use prognostic model to predict in-hospital mortality of VA-ECMO therapy, using unbiased recursive partitioning with conditional inference trees. We compared two sets with different numbers of variables (small and comprehensive), all of which were available just before ECMO initiation. The area under the curve (AUC), the cross-validated Brier score, and the error rate were applied to assess model performance. Data were collected retrospectively between 2007 and 2019. Results: 837 patients were eligible for this study; 679 patients in the derivation cohort (median (IQR) age 60 (49 to 69) years; 187 (28%) female patients) and a total of 158 patients in two external validation cohorts (median (IQR) age 57 (49 to 65) and 70 (63 to 76) years). For the small data set, the model showed a cross-validated error rate of 35.79% and an AUC of 0.70 (95% confidence interval from 0.66 to 0.74). In the comprehensive data set, the error rate was the same with a value of 35.35%, with an AUC of 0.71 (95% confidence interval from 0.67 to 0.75). The mean Brier scores of the two models were 0.210 (small data set) and 0.211 (comprehensive data set). External validation showed an error rate of 43% and AUC of 0.60 (95% confidence interval from 0.52 to 0.69) using the small tree and an error rate of 35% with an AUC of 0.63 (95% confidence interval from 0.54 to 0.72) using the comprehensive tree. There were large differences between the two validation sets. Conclusions: Conditional inference trees are able to augment prognostic clinical decision making for patients undergoing ECMO treatment. They may provide a degree of accuracy in mortality prediction and prognostic stratification using readily available variables.


Introduction
Veno-arterial extracorporeal membrane oxygenation (VA-ECMO) represents the ultimate treatment for cardiopulmonary failure [1,2]. Despite its growing adoption and an 2 of 11 increased understanding of it, this therapy is still associated with significant morbidity and resource utilization, with notably high mortality rates [3,4]. Precise and personalized survival prediction based on data available prior to insertion has the potential to enhance clinical decision making and ultimately improve patient outcomes.
Numerous studies have explored predictive factors linked to the outcomes of VA-ECMO [5][6][7][8][9][10][11]. However, effective application of these factors to accurately predict survival in clinical practice presents challenges, as it involves data collection and score calculation immediately before the ECMO procedure. This is especially challenging, if not impossible, if the calculation of the score necessitates the use of a computer and the knowledge of how to correctly calculate it-having both the equipment and the expertise to use it is not realistic in a clinical emergency setting. A study by Schrutka et al. revealed that scoring systems for outcome predictions in patients undergoing ECMO after cardiovascular surgery had insufficient discriminatory power, except for the Simplified Acute Physiology Score (SAPS II) and the Survival After Venoarterial ECMO (SAVE) score [12]. Both scores, the SAVE [9] and the SAPS II [13], were developed using multiple regression models, but involve a large number of variables that are not always routinely collected, again rendering them hard to use if time is scarce. So far, only a limited number of publications have addressed the application of advanced statistical principles in order to easily and quickly predict personalized outcomes in critically ill patients receiving ECMO therapy [14,15].
The aim of the present study was to provide an easy-to-use, pencil-and-paper algorithm using the machine learning technique called unbiased recursive partitioning, based on conditional inference trees to develop and validate a prognostic model that can predict the probability of patient survival before the initiation of ECMO therapy. Such a prognostic model should assist clinicians in making decisions prior to ECMO implantation. Since such decisions often have to be made in time-critical situations, an additional goal was to provide an algorithm with concise variables typically available in the resuscitation room. We compared two sets with different numbers of variables, all of which were available just before ECMO initiation.

Study Design
The derivation cohort was based on a retrospective ECMO registry of the University Hospital Zurich, Switzerland, a tertiary care referral hospital [16]. Adult patients treated with VA-ECMO between January 2007 and December 2019 with complete follow-up were included. Exclusion criteria were an age of under 18 years and documented refusal of consent. Veno-venous (VV) and hybrid ECLS were also excluded.
The anonymized validation cohorts were retrospectively derived from two independent ECMO centers of the University Hospitals in Frankfurt and Würzburg. The study was reviewed and the requirement for written informed consent was waived by the Cantonal Ethics Commission of Zurich, Switzerland (BASEC No. 2019-01926). This study follows the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) reporting policy.

Data Collection
Data were collected through a retrospective review of patient records (e.g., medical history, laboratory values, and survival status) and direct export from the clinical information system (e.g., age and sex). The closest laboratory value before VA-ECMO insertion was noted. The maximum tolerated interval of 4 h for blood gas analysis and 24 h for laboratory values was used.
Only variables available prior to VA-ECMO initiation were included and categorized into two sets. First, a simple data set of limited variables immediately available in the emergency situation was defined to see if these variables could provide relatively reliable predictions (small data set, Table 1). Second, a more comprehensive set of variables was assembled (Supplementary Table S1). We grouped the indications for the VA-ECMO therapy into four categories according to current literature [2,17,18]: postcardiotomy, cardiopulmonary resuscitation, refractory cardiogenic shock, and other. The category "other" included ECMO indications for lung transplantation and expansive thoracic surgery.

Model Development
All analyses were performed in R, version 4.0.5. We used unbiased recursive partitioning based on conditional inference trees to derive the desired decision algorithm. The idea behind this statistical method is based on machine learning, and can be seen as a data-driven approach to find a set of subgroups that are as homogeneous as possible with respect to the clinical outcome of interest.
The method works as follows: Two distinct steps are executed alternately and iteratively until a predefined stopping criterion is reached. In the first step, a test is conducted for each potential predictor out of the set of candidate variables to evaluate its influence on the outcome variable. The resulting p-values are corrected for multiple testing. The variable with the strongest association is then selected. If the null hypothesis cannot be rejected for any of the candidate variables, the algorithm is terminated and it is concluded that there is no variable that is sufficiently associated with the outcome.
The second step aims to find the best possible split for the variable selected in step one, i.e., the value that leads to two subgroups that are as distinct as possible with respect to the outcome. This is done by evaluating all possible dichotomous splits to maximize the differences between the two resulting subgroups.
Those two steps are repeated until no additional associated predictor can be found, or until the a priori stopping criterion is reached. Different kinds of criteria can be defined, for example, a minimal number of individuals in each node. If a split would lead to even smaller nodes, the algorithm terminates. This method has been described and discussed extensively [19][20][21] and has been applied in other clinical contexts [22,23].
An alternative to conditional inference trees is the use of so-called conditional inference forests [19,[24][25][26], which we calculated to evaluate a potential loss of information in the conditional inference trees. When using this method, an ensemble of several classification trees is calculated based on many random draws from the data set. Predictions are obtained using the mean or majority prediction of the single trees. This usually leads to considerably higher prediction accuracy, but cannot be done by hand using an easily understandable pencil-and-paper system.
Thus, in total, four different classifications are developed and compared: One using the small data set with data that are immediately available before ECMO implantation, and one using the more comprehensive data set described above. For both data sets, a conditional inference tree and a random conditional inference forest are calculated.

Model Validation
In order to validate the binary predictions (death or survival) obtained from the two different methods and sets of variables, we calculate the receiver operating characteristic (ROC) curve along with the associated AUC, the error rate, i.e., the percentage of wrong predictions, and the Brier score [27]. The Brier score is defined as the squared difference between the probability of the binary outcome and the actually observed one. This is a strictly proper scoring rule that is widely used to compare prediction models.
Calculating these measures for the predictions of the same data set that was used for developing the prediction models leads to, by far, too positive results, for which reason we use 10-fold cross-validation instead. This means that the data set is randomly distributed in 10 different random subsets. In a total of ten runs, one of these subsets (called folds) is taken out before the calculation of the trees and forests, respectively, leading to ten trees that can be different as the training set differs in each run. The one fold that was taken out before calculating its respective tree is then used for validating the predictions from the current run. The results of this cross-validation procedure are given as a summary of the ten different runs. We do not use leave-one-out cross-validation, because taking out just one measurement at a time leads to no relevant changes in the model that is derived from this data set. Note that for pragmatic reasons, all confidence intervals in the cross-validation are calculated using the usual formulae for proportions, means and AUC, although there is some debate in the literature if this really is appropriate due to dependencies in the data [28,29].
In addition to internal validation, where the original data set is used, we also perform external validation. This means that we apply the predictions from the model that was calculated based on the whole data set and validate its predictions in two completely new data sets. They stem from the Department of Anesthesiology, Intensive Care, Emergency and Pain Medicine, University Hospital Würzburg, Germany; and the Department of Anesthesiology, Intensive Care Medicine and Pain Therapy, University Hospital Frankfurt, Goethe University, Germany (Table 2). These data sets contain only the variables that were selected for the conditional inference trees. Based on the two trees, the predictions are calculated for both data sets separately and taken together. This allows for an assessment of the predictive abilities of the two trees in an external setting.

Results
A descriptive summary of the variables that were included in the small and the comprehensive data sets is given in Table 1 and Table S1. Figures 1 and 2 show the two classification trees that were obtained for the two data sets. Each terminal node shows the survival probability for patients in this node, along with the total number of patients who ended up in this node. For the small data set, lactate, ECMO indication, and age were selected for the splits. In the comprehensive data set, lactate was selected for the first split, along with an additional lactate split as well as eGFR, ECMO indication, albumin, and alkaline phosphatase in the following nodes.     A closer look at the two different plots shows quite distinct predictions for the different paths of the plots: In the small data set, the outcomes of the outer nodes 3 and 11 are very clear, resulting in less than 10% deaths and more than 90% deaths, respectively. With respect to the remaining nodes, the predictions are not unambiguous, especially for nodes 5 and 10, with around 40% deaths in both cases.
A very similar situation can be seen in the tree from the comprehensive data set: several nodes show an outcome that is almost certain (nodes 4, 8, and 12), or at least more A closer look at the two different plots shows quite distinct predictions for the different paths of the plots: In the small data set, the outcomes of the outer nodes 3 and 11 are very clear, resulting in less than 10% deaths and more than 90% deaths, respectively. With respect to the remaining nodes, the predictions are not unambiguous, especially for nodes 5 and 10, with around 40% deaths in both cases.
A very similar situation can be seen in the tree from the comprehensive data set: several nodes show an outcome that is almost certain (nodes 4, 8, and 12), or at least more likely (nodes 7 and 10), whereas in two of the nodes no clear prediction is possible (nodes 5 and 13).
Ten-fold cross-validation of the binary predictions with conditional inference trees showed a cross-validated error rate of 35.8% (95% confidence interval from 32.3% to 39.5%) in the small data set. This means that in about 36% of the cases the prediction of death or survival was wrong, whereas the prediction matched the actually observed result in about 64% of the cases when the classification tree in Figure 1 was used. The associated area under the receiver operating characteristic (ROC) curve (AUROC) was 0.70 (95% confidence interval from 0.66 to 0.74). In the case of the comprehensive data set (Figure 2), the crossvalidated error rate was practically the same, with a value of 35.4% (95% confidence interval from 31.8% to 39.0%), and the AUC had a value of 0.71 (95% confidence interval from 0.67 to 0.75). The cross-validated mean Brier scores of the two trees were 0.210 (95% confidence interval from 0.208 to 0.212) in the small data set and 0.211 (95% confidence interval from 0.209 to 0.212) for the comprehensive data set. Calibration plots for both data sets are shown in Figure 3. Satisfactory calibration can be seen for both trees. Table 2 shows the characteristics of the two validation data sets from Frankfurt and Wurzburg. The external validation resulted in an error rate of 43.0% (95% confidence interval from 35.6% to 50.8%) in the combined data set using the small tree and of 35.3% (95% confidence interval from 28.1% to 43.3%) using the comprehensive tree. This pattern is also reflected in the ROC curves (Figure 4): the predictions from the small tree result in an AUC of 0.60 (95% confidence interval from 0.52 to 0.69), and the AUC from the comprehensive tree is 0.63 (95% confidence interval from 0.54 to 0.72). Ten-fold cross-validation with forests showed a cross-validated error rate of 32.3% (95% confidence interval from 28.9% to 35.9%), an AUC of 0.75 (95% confidence interval from 0.72 to 0.79), and a mean Brier score of 0.199 (95% confidence interval from 0.198 to 0.201) in the small data set and of 32.0% (95% confidence interval from 28.6% to 35.6%), 0.76 (95% confidence interval from 0.72 to 0.79) and 0.198 (95% confidence interval from 0.197 to 0.200), respectively, for the comprehensive data set. Table 2 shows the characteristics of the two validation data sets from Frankfurt and Wurzburg. The external validation resulted in an error rate of 43.0% (95% confidence interval from 35.6% to 50.8%) in the combined data set using the small tree and of 35.3% (95% confidence interval from 28.1% to 43.3%) using the comprehensive tree. This pattern is also reflected in the ROC curves ( Figure 4): the predictions from the small tree result in an AUC of 0.60 (95% confidence interval from 0.52 to 0.69), and the AUC from the comprehensive tree is 0.63 (95% confidence interval from 0.54 to 0.72).
The results of external validation from the two different hospitals separately show differences between the two hospitals: predicting survival from the Wurzburg data only resulted in an error rate of 29.8% (95% confidence interval from 19.5% to 42.7%) and an AUC of 0.73 (95% confidence interval from 0.60 to 0.86) for the small tree, and an error rate of 24.1% (95% confidence interval from 14.6% to 36.9%) and an AUC of 0.74 (95% confidence interval from 0.60 to 0.89) for the comprehensive tree.
The error rate for the small tree from the data from Frankfurt was 50.5% (95% confidence interval from 41.0% to 60.0%) and the AUC was 0.52 (95% confidence interval from 0.41 to 0.63), while the external predictions from the comprehensive tree resulted in an error rate of 41.7% (95% confidence interval from 32.3% to 51.7%) and an AUC of 0.56 (95% confidence interval from 0.45 to 0.67).  The results of external validation from the two different hospitals separately show differences between the two hospitals: predicting survival from the Wurzburg data only resulted in an error rate of 29.8% (95% confidence interval from 19.5% to 42.7%) and an AUC of 0.73 (95% confidence interval from 0.60 to 0.86) for the small tree, and an error rate of 24.1% (95% confidence interval from 14.6% to 36.9%) and an AUC of 0.74 (95% confidence interval from 0.60 to 0.89) for the comprehensive tree.
The error rate for the small tree from the data from Frankfurt was 50.5% (95% confidence interval from 41.0% to 60.0%) and the AUC was 0.52 (95% confidence interval from 0.41 to 0.63), while the external predictions from the comprehensive tree resulted in an error rate of 41.7% (95% confidence interval from 32.3% to 51.7%) and an AUC of 0.56 (95% confidence interval from 0.45 to 0.67).

Discussion
This study demonstrates a machine learning predictive model for in-hospital mortality in patients receiving VA ECMO. In approximately 64% to 68% of cases, the prediction based on pre-ECMO variables matched the observed outcome, and the corresponding area under the receiver operating characteristics [ROC] curve [AUROC] values were 0.70 and 0.71. This compares favorably with the already established SAVE (Survival After Venoarterial ECMO) score, which has an AUROC of 0.68 [95% CI 0.64-0.71].
There were several different ECMO indications in our data set, which makes it relatively heterogeneous. However, this represents the typical clinical setting and makes the proposed algorithm more useable than if it was proposed for a very specific indication only. Despite the heterogeneity it performed well, which might be due in part to the higher sample size of the training set.

Discussion
This study demonstrates a machine learning predictive model for in-hospital mortality in patients receiving VA ECMO. In approximately 64% to 68% of cases, the prediction based There were several different ECMO indications in our data set, which makes it relatively heterogeneous. However, this represents the typical clinical setting and makes the proposed algorithm more useable than if it was proposed for a very specific indication only. Despite the heterogeneity it performed well, which might be due in part to the higher sample size of the training set.
As the review by Eric. J. Topol [30] notes, the use of artificial intelligence is beginning to have an impact on predicting clinical outcomes that would be useful to healthcare systems. In the current literature, two studies report about applying machine learning to the ECMO cohort. Abbasi et. al. [14] compared classification and regression models to predict bleeding and thrombosis. The study cohort included 44 patients on ECMO. The most common indication was acute respiratory distress syndrome (59%), and 66% were supported with veno-venous ECMO. Rankings for variables varied and included ECMO indications, cannulation strategies, and duration. The study by Ayers et. al. [15] included 282 adult patients undergoing VA-ECMO. A deep neural network was trained to predict survival to discharge. The most important variables in predicting the primary outcome were lactate, age, total bilirubin, and creatinine. Their final model achieved high accuracy and a greater area under the curve than the SAVE score in predicting survival to discharge.
Typically, ECMO risk scores require many, and sometimes very detailed, variables. The calculation of the score is time consuming and the collection of the necessary variables requires the use of tools. In addition, the SAVE score [9] excludes ECMO during cardiopulmonary resuscitation. The ENCOURAGE score [8] is specific for the population with acute myocardial infarction and cardiogenic shock, whereas the PREDICT score [11] refers to the prognosis after ECMO implantation.
We compared two different data sets: a small set with limited variables immediately available in the acute situation, and a large set with more comprehensive variables that are still easily available before the onset of ECMO. Prediction with conditional inference trees showed a similar error rate for the small and large data sets (35.79% vs. 35.35%). In the clinical setting, this finding is very helpful because the variables of the small set, namely age, lactate, and ECMO indication, are usually readily available. In an acute situation requiring the placement of an ECMO, clinicians are challenged to make a quick decision. Lactate can be determined at the point of care within minutes using blood gas analysis, and age and ECMO indication are obvious. In contrast, measuring blood samples in the laboratory requires much more time. Larger data sets have been shown not to improve the accuracy of the prediction [8]. Hence, awaiting the laboratory values such as albumin, alkaline phosphatase (ALP), and estimated glomerular filtration rate (eGFR) appears to be unnecessary. This shortens the time needed for decision making considerably.
Moreover, our suggested algorithm requires no computer and no training on how to use the respective programs, as it can be done using only a sheet of paper. This makes its application much more realistic and saves a lot of time compared to other predictions.
Consistent with the analyses by Abbasi et al. [14] and Ayers et. al. [15], we have shown that parameters such as age, ECMO indication, lactate, alkaline phosphatase, and creatinine or eGFR are suitable variables for the prediction of outcomes.
External validation showed that predictions for patients in the nodes with either a high probability of death or of survival can be very useful in clinical practice, whereas the predictions made for patients in the remaining nodes reflect the grade of uncertainty associated with the potential outcome. The scheme proposed in our analysis might serve as a new uncomplicated and rapid tool for the prediction of mortality in patients on ECMO immediately before implantation.
The differences between the two validation data sets are explained, at least in part, by the number of patients in each node. In Frankfurt, only about 14-15% are in the comparably certain nodes 3 and 11, whereas in Wurzburg, more than twice as many (39% and 33%) are located in this region (Supplementary Tables S2 and S3).
For both data sets, a conditional inference tree and a random forest were calculated. The error rates in conditional inference forests were slightly lower than in trees (32.3% vs. 35.79% for the small data set). However, the application of conditional inference forests is more complicated and time consuming, since predictions are made with a computer and cannot be obtained by hand using an easily understandable pencil-and-paper system. Therefore, the application of conditional inference forests is unsuitable in the acute situation.

Limitations
Data were retrospectively collected which leads to potential bias. Over the long period of observation, there have been technological and procedural changes in ECMO therapy that may affect the chances of survival. Despite the established classification of ECMO indications in the literature, they may be interpreted and classified differently from center to center.

Conclusions
Conditional inference trees have the potential to contribute to prognostic clinical decision making for patients receiving ECMO therapy. They may provide a degree of accuracy in mortality prediction and prognostic stratification using readily available variables. Nonetheless, it is of utmost importance to further study the factors that influence the outcome in this complex situation.  Institutional Review Board Statement: The study was conducted in accordance with the Declaration of Helsinki. The study was reviewed and the requirement for written informed consent was waived by the Cantonal Ethics Commission of Zurich, Switzerland (BASEC No. 2019-01926).
Informed Consent Statement: The study was reviewed by the Cantonal Ethics Commission of Zurich, Switzerland. Patients with documented refused informed consent were excluded.