Comparison of the Tree-Based Machine Learning Algorithms to Cox Regression in Predicting the Survival of Oral and Pharyngeal Cancers: Analyses Based on SEER Database

Simple Summary Formulating accurate survival prediction models of oral and pharyngeal cancers (OPCs) is important, as they might impact the decisions of clinicians and patients. Improving the quality of these clinical prediction modelling studies can benefit the reliability of the developed models and facilitate their implementations in clinical practice. Given the growing trend on the application of machine learning methods in cancer research, we present the use of popular tree-based machine learning algorithms and compare them to the standard Cox regression as an aim to predict OPCs survival. The predictive models discussed here are based on a large cancer registry dataset incorporating various prognosis factors and different forms of bias. The comparable predictive performance between Cox and tree-based models suggested that these machine learning algorithms provide non-parametric alternatives to Cox regression and are of clinical use for estimating the survival probability of OPCs patients. Abstract This study aims to demonstrate the use of the tree-based machine learning algorithms to predict the 3- and 5-year disease-specific survival of oral and pharyngeal cancers (OPCs) and compare their performance with the traditional Cox regression. A total of 21,154 individuals diagnosed with OPCs between 2004 and 2009 were obtained from the Surveillance, Epidemiology, and End Results (SEER) database. Three tree-based machine learning algorithms (survival tree (ST), random forest (RF) and conditional inference forest (CF)), together with a reference technique (Cox proportional hazard models (Cox)), were used to develop the survival prediction models. To handle the missing values in predictors, we applied the substantive model compatible version of the fully conditional specification imputation approach to the Cox model, whereas we used RF to impute missing data for the ST, RF and CF models. For internal validation, we used 10-fold cross-validation with 50 iterations in the model development datasets. Following this, model performance was evaluated using the C-index, integrated Brier score (IBS) and calibration curves in the test datasets. For predicting the 3-year survival of OPCs with the complete cases, the C-index in the development sets were 0.77 (0.77, 0.77), 0.70 (0.70, 0.70), 0.83 (0.83, 0.84) and 0.83 (0.83, 0.86) for Cox, ST, RF and CF, respectively. Similar results were observed in the 5-year survival prediction models, with C-index for Cox, ST, RF and CF being 0.76 (0.76, 0.76), 0.69 (0.69, 0.70), 0.83 (0.83, 0.83) and 0.85 (0.84, 0.86), respectively, in development datasets. The prediction error curves based on IBS showed a similar pattern for these models. The predictive performance remained unchanged in the analyses with imputed data. Additionally, a free web-based calculator was developed for potential clinical use. In conclusion, compared to Cox regression, ST had a lower and RF and CF had a higher predictive accuracy in predicting the 3- and 5-year OPCs survival using SEER data. The RF and CF algorithms provide non-parametric alternatives to Cox regression to be of clinical use for estimating the survival probability of OPCs patients.

. Demographic characteristics of patients with oral and pharyngeal cancers in SEER cohorts (for imputed data). .   Table S1. Demographic characteristics of patients with oral and pharyngeal cancers in SEER cohorts (for imputed data).   Figure S1. Overtime C-index for predicting 3-and 5-year disease-specific survival of oral and pharyngeal cancers with various models in the imputed datasets.    Figure S4. Time-dependent receiver operator curves for predicting 1-to 5-year disease-specific survival of oral and pharyngeal cancers with Cox models.

Description of the methods for plotting the time-dependent ROC
The time-dependent receiver operator curves (ROC) are extensions of the standard ROC curves (developed for binary data) and are developed for situations where the event status (e.g. death) occurs at various time point during the study period, and it is suitable to time-to-event analysis. The time-dependent ROC can be constructed based on the cumulative sensitivity (Se C ) and dynamic specificity (Sp D ), which have been well defined in the literature*: Let Ti denote the predicted time of event onset and is the predicted 'risk' (represented by hazard ratios of predictor values at baseline) for individual i, (i = 1, …, n). At each observed time point t, each individual is classified as a case or control (e.g. has event/no event at that time point in between time periods Ti = 0 and t). A case is defined as any individual experiencing the event between baseline t = 0 and time t and a control as an individual remaining event-free at time t. The cases and controls change over time and each individual may play the role of control at the earlier time (when the event time is greater than the target time, i.e. Ti > t) but then contributes as a case for later times (when the event time is less than or equal to the target time, i.e. Ti ≤ t). For an observed threshold c, the cumulative sensitivity of our model is defined as the probability that the individual has an predicted 'risk' greater than c among the individuals who experienced the event before time t, and the dynamic specificity is the probability that an individual has a predictor value less than or equal to c among those event-free individuals beyond time t. Thus:

Results
Once the time-dependent setting is applied, the death status is observed and predicted at each time point which yields different values of sensitivity and specificity throughout the investigated time period. So that we could choose to plot the time-dependent ROC curves at a specific time of interest. For example, the following plots give the cumulative prediction performance at 12, 24, 36, 48 and 60 months.   Table S5. Sensitivity analysis to investigate the effect of unmeasured factor on the estimation of hazard ratios.

Methods
Here's an example call to the sensitivity analysis (R Code): obsSensSCC(cox1, which=1, g0=c(0.1,0.5,2), p0=seq(0,1,0.2), p1=seq(0,1,0.2), logHaz=F) where: • obsSensSCC = a sensitivity analysis for three variables: outcome Y is a survival outcome, exposure X is a categorical variable such as sex, and latent variables U are categorical variables) • model = the Cox regression model • which = the parameter in the regression model that specifies the predictor, e.g. 2 refers to the second predictor, which was sex-male in our analysis • g0 = strength of the relationship between U and the outcome (specified here as a hazard ratio); also called gamma • p0 = prevalence of U in unexposed group (or when exposure = 0) • p1 = prevalence of U in the exposed group (or when exposure = 1) • logHaz = whether log of the hazard or the hazard ratio should be returned In the sensitivity analysis, a range of g is chosen to include the unadjusted before adjusting for unmeasured predictors. Together with a range of p, is then estimated for different values of g and p. For example, a range of g of 0.1 to 2 was chosen as the hypothetical effect of unmeasured predictors that could explain away the or reduce it to a specific level. If the confidence intervals of the adjusted do not include 1, this suggests a direct beneficial relationship between survival outcome and predictors. On the other hand, if the confidence interval included 1, then the unmeasured predictor could explain the relationship between survival outcome and predictors.

Let's look at two examples:
The following table presents the impact of unmeasured predictor on the hazard ratios of the association between two predictors and 3-year survival outcome. For the predictor 'whether the surgery was performed or not', we found that the range of did not include 1 across all scenarios, which means the added unmeasured predictor did not impact on the effect of 'Surgery' on the outcome. However, for the predictor 'T category, T3', when larger proportion of T3 patients have the unmeasured predictor than the non-T3 patients (e.g. this meets the chemotherapy scenario), the effect might be changed only when the unmeasured predictors has effect 2-fold larger than the existing predictor 'T category, T3'.

* g(Gamma)
refers to the effect estimate of the association between predictors and an unmeasured covariate; p refers to the correlation between survival outcome and unmeasured covariate. Effect estimates not including 1 represent conditions where survival outcome is associated with the presence of predictors, whereas those including 1 represent conditions where survival outcome is not associated with predictors. Table S6. Pruning parameters for survival tree and random forests for survival.

Model Pruning Parameters
Survival tree ('minsplit', lower=1,upper=20), corresponds to the minimum number of observations that must exist in a node in order for a split to be attempted. ('maxdepth',lower=1,upper=30), corresponds to the maximum depth of a tree. Depth is the length of the longest path from a Root node to a Leaf node.

Random forests for survival
('ntree', lower=1000, upper=2000), corresponds to the total number of trees in the forest.
('mtry',lower = 1,upper = 12), corresponds to the number of variables tested in any split. ('nsplit',lower = 0, upper=20), corresponds to the size of random split points for each 'mtry' candidate. ('splitrule',values = 'logrank', special.vals = list('logrank','logrankscore','random')), corresponds to the split rule and formula. ('nodedepth',lower = …, upper =…), corresponds to the length of the longest path from a root to a leaf of any tree in the forest. The default behaviour is that this parameter is ignored. ('nodesize',lower =…, upper=…), corresponds to the minimum number of unique cases (data points) in a terminal node of any tree in the forest. The default behaviour is that this parameter is ignored.

Conditional Inference Forest
('ntree', lower=1000, upper=2000), corresponds to the total number of trees in the forest. ('mtry',lower = 1,upper = 12), corresponds to the number of variables tested in any split. ('minsplit',lower = 0, upper=20), corresponds to the minimum size of random split points for each 'mtry' candidate. ('teststat', values= 'quad', special.vals = list('quad', 'max')), corresponds to a character specifying the type of the test statistic to be applied. ('mincriterion'), corresponds to the depth of the trees. Usually unstopped and unpruned trees are used in random forests. To grow large trees, set it to a small value. The default behaviour is that this parameter is ignored.

The development of a ST algorithm can be summarized as follows:
1 Start function F build survival tree 2 Create an initial survival tree with root node t0 3 Create an empty stack S of open nodes 4 while S is not empty do 5 t = t0 + t1, if stopping criterion is met for t end else Find the split on F that maximizes the survival difference between children nodes 6 Partition data in to two child nodes of t 7 end 8 end 9 End For the tree growing and pruning purposes, one needs a splitting statistic (log-rank) that handles the dependence of failure times and a measure (C-index) to evaluate the performance of the tree.

The development of a RF algorithm can be summarized as follows:
1 Start 2 Select the number of trees to build, ntree 3 for i = 1 to ntree do 4 Generate a B bootstrap sample (usually two thirds) of the original data, one third is left as out-of-bag (OOB) data 5 Train a tree model on this sample 6 for each split do 7 Randomly select k (< K) of the original predictors 8 Select the best predictor among the k predictors for each splitting point of the best k do Compare the survival curves of the two groups using one splitting rule r among a) log-rank splitting rule, b) log-rank score splitting rule, or c) random log-rank splitting rule. Select the best splitting rule and partition the data end for determining splitting rule 9 end for determining one split in one tree 10 Use tree stopping criteria to determine when a tree is complete. 11 Using OOB data, the prediction ability is calculated and represented by C-index. The cumulative hazard function (CHF) is also calculated for each tree. 12 end for one tree 13 The individual C-index are then averaged to obtain the ensemble C-index. The individual CHFs are then averaged to obtain the ensemble CHF. 14 End The development of a conditional inference tree can be summarized as follows: 1: For case weights w, test the global null hypothesis of independence between any of the p covariates and the response variable. Stop if this hypothesis cannot be rejected otherwise the the j th covariate X with strongest associate to the outcome.
2. Select a set A ∈ X in order to split X into two disjoint sets. The weights wL and wR determine the two subgroups with wL,i = wiI (Xj,i ∈ A) and wR,I = wiI (Xj,i ≠ A) for all i=1,2,…n.
3. Recursively repeat steps 1 and 2 with modeified case weights wL and wR, respectively. For validation, identify any differences from the development data in setting, eligibility criteria, outcome, and predictors. 3-4, Table  1 14b D If done, report the unadjusted association between each candidate predictor and outcome. NA Model specification 15a D Present the full prediction model to allow predictions for individuals (i.e., all regression coefficients, and model intercept or baseline survival at a given time point).