Deep Learning-Based Survival Analysis for High-Dimensional Survival Data

: With the development of high-throughput technologies, more and more high-dimensional or ultra-high-dimensional genomic data are being generated. Therefore, effectively analyzing such data has become a signiﬁcant challenge. Machine learning (ML) algorithms have been widely applied for modeling nonlinear and complicated interactions in a variety of practical ﬁelds such as high-dimensional survival data. Recently, multilayer deep neural network (DNN) models have made remarkable achievements. Thus, a Cox-based DNN prediction survival model (DNNSurv model), which was built with Keras and TensorFlow, was developed. However, its results were only evaluated on the survival datasets with high-dimensional or large sample sizes. In this paper, we evaluated the prediction performance of the DNNSurv model using ultra-high-dimensional and high-dimensional survival datasets and compared it with three popular ML survival prediction models (i.e., random survival forest and the Cox-based LASSO and Ridge models). For this pur-pose, we also present the optimal setting of several hyperparameters, including the selection of a tuning parameter. The proposed method demonstrated via data analysis that the DNNSurv model performed well overall as compared with the ML models, in terms of the three main evaluation measures (i.e., concordance index, time-dependent Brier score, and the time-dependent AUC) for survival prediction performance.


Introduction
The survival time (i.e., time-to-event) is defined as the time to an event of interest, such as time-to-death, time-to-recurrence, and time-to-employment. One important characteristic of survival data is that they are often censored, which leads to incomplete outcomes. Due to the presence of censoring, statistical analysis of survival data is usually much more complicated than regular statistical analysis [1]. Many statistical methods have been developed for survival analysis by using typically non-parametric or semi-parametric statistical methods. In particular, with the development of high-throughput technologies, more and more high-dimensional (HD) or ultra-high-dimensional (ultra-HD) genomic data are being generated [2]. Unlike in regular cases, the HD case is often observed in biomedical data such as genomic data, i.e., the number of covariates (p) (e.g., gene features) is usually much larger than the sample size (n) (i.e., p n), leading to a challenging problem [2]. In this paper, we considered the HD case, as well as the ultra-HD case where p is extremely large (e.g., p > 10 5 ).
Recently, machine learning (ML) algorithms have been widely applied for modeling nonlinear and complicated interactions, and for improving predictability, in a variety of practical fields. Therefore, they are good at handling incomplete data in survival analysis for HD survival data [3]. In particular, the neural network algorithm has been applied to survival analysis for a long time. The multilayer deep neural network (DNN) model has recently made remarkable achievements for complex and HD cases with complete data [4][5][6]. Nevertheless, the application of deep learning to survival analysis for censored data is still limited because the existing DNN survival modeling approaches only use a single hidden layer [7]. Faraggi and Simon [8] proposed a single-hidden-layer feedforward neural network, which is usually regarded as a nonlinear extension of the Cox proportional hazards (PHs) model. However, it did not outperform the classical Cox model in a study on prostate cancer survival data [9,10]. Katzman et al. [11] restudied such a single-layer model in a deep learning framework (named DeepSurv), which showed that, in terms of Harrell's [12] concordance index (C-index), the novel network performed better than the regular Cox model and the random survival forest [13] model in a study of breast cancer. Ching et al. [14] proposed a single-hidden-layer deep learning package (Cox-nnet) to predict patient prognosis from high-throughput omics survival data, which was also an extension of the neural network for the Cox model. It was demonstrated that the neural network survival model performed similar to or better than the regular Cox model, the penalized Cox model, and the random survival forest model using TCGA (The Cancer Genomic Atlas) cancer data with high-throughput gene expressions. To overcome such a restriction, very recently, Sun et al. [7] developed a multi-hidden-layer Cox-based DNN survival model (DNNSurv model) to predict the progression of age-related macular degeneration (AMD) disease and compared it with other survival models based on machine learning. It was shown that, in a study of AMD progression, the DNNSurv model not only outperformed several other survival models (e.g., penalized Cox model and random survival forest model) in terms of the evaluation metrics (e.g., C-index), but also successfully obtained the patient-specific predictor importance measures using the local interpretable model-agnostic explanation (LIME) method [15]. However, it was only concerned with the survival datasets with both HD and a large sample size.
In this paper, we evaluated the prediction performance of the DNNSurv model using several HD and ultra-HD datasets for survival analysis and compared it with three popular ML survival prediction models (i.e., random survival forest model and Cox-based LASSO and Ridge models). Here, we also present the optimal setting of several hyperparameters including the selection of the tuning parameter. The DNNSurv model was built with Keras [16] and TensorFlow [17] to make sure that the computation was stable and efficient.
Keras is an open-source software library and is often used to define and train deep learning models. Several backends are supported by Keras, and TensorFlow was used as the backend engine of Keras. Keras contains numerous commonly used building blocks for neural networks, such as layers, activation functions, and optimizers, which makes the work much easier in terms of writing the deep neural network code. The DNNSurv model [7] is compatible with both GPUs and CPUs, via Keras and TensorFlow.
The paper is organized as follows. In Section 2, we review the machine and deep learning survival methods, including the prediction evaluation measures for survival analysis. In Section 3, we present the hyperparameter settings, together with a cross-validation procedure to find the optimal tuning parameter, and we then, assess the performance of four survival prediction models (DNNSurv, random survival forest, Cox-based LASSO, and Cox-based Ridge) using several real HD and ultra-HD survival datasets. The discussion is given in Section 4.

Machine and Deep Learning Methods for Survival Analysis
Let T be a non-negative continuous random variable that represents the time-to-event. The survival function and hazard function are denoted by S(t) and λ(t), respectively. For each individual i (i = 1, . . . , n), let T i be the survival time, and let C i be the corresponding censoring time. Then, observable random variables are given by: where δ i is the censoring indicator.
Let x = (x 1 , . . . , x p ) T be a vector of covariates for an individual, and let λ(t; x) be the hazard function at time t for an individual with covariates x. Under the Cox PH model, the hazard function for an individual takes the form: where λ 0 (t) is the unspecified baseline hazard function at time t under x = 0, and β = (β 1 , . . . , β p ) T is a vector of regression parameters corresponding to covariates x. The term x T β does not include the intercept term, and it is called the linear predictor or prognostic index. In this paper, the models applied for the analysis of HD or ultra-HD survival datasets were based on the Cox PH model except for the random survival forest (RSF) model.

Methods
For HD data (e.g., high-throughput genomic data) with p n, standard statistical methods cannot be applied directly. The same problems also occur in survival data [18].
To overcome these problems, various improved methods (e.g., penalized-based methods, random forests, and deep learning) have been developed. Below, we review the machine and deep learning methods used in the survival analysis of the HD time-to-event data.

Penalized Cox Models
Penalized Cox models are often used for processing HD survival data. The commonly used penalized Cox models include the Cox-based LASSO (Cox-LASSO) and Cox-based Ridge (Cox-Ridge) models [19,20], which are used for minimizing the negative partial log-likelihood of the Cox model with different penalty functions. The partial log-likelihood of the Cox model is given by: where D is the set of all events, y r is the rth (r = 1, · · · , E) smallest distinct event time among the Y i 's, E is the number of distinct events, x r is the corresponding covariate vector, and R r = {i : Y i ≥ y r } is the set of individuals who are at risk at time y r . We can then obtain the penalized maximum likelihood estimates of the regression parameters β corresponding to the two methods: where is the L 2 -norm penalty, and γ is the tuning parameter, which is used for the adjustment of regularization. In particular, there is no regularization when γ = 0, and it tends to be more regularized when γ → ∞. Note here that the LASSO penalty performs well for selecting significant variables among a variety of genes, but the limit is that it can only select at most n variables for p n cases because of the convex optimization problem. The ridge penalty, on the other hand, is more suitable for solving multicollinearity problems among covariates, but is not appropriate for the variable selection problem [3].

Random Survival Forest
Breiman [21] proposed the random forest algorithm and showed that randomizing the base learning process can improve the performance of ensemble learning. Figure 1 shows a simple representation of a random forest. The random survival forest (RSF) algorithm [21] is an extension of the random forest to survival analysis with censored data. Because some parametric methods used for survival analysis are based on restrictive assumptions, the survival analysis is much more difficult. However, the RSF method handles these problems automatically. It is based on random bootstrap samples using the training dataset. For each bootstrap sample, it randomly selects candidate variables at each node of the tree when growing the trees. Moreover, the candidate variables are used to split the node that maximizes the survival difference among child nodes [13]. Note here that different from the random forest algorithm, the splitting rule of the RSF used for growing a tree should consider both the survival time and censoring indicator. For survival data, the log-rank splitting rule is often used to split nodes by maximizing the log-rank test statistic [22].
The main ideas of the RSF algorithm are growing a survival tree and building the ensemble cumulative hazard function, which is the average of the cumulative hazard functions (usually, we use the Nelson-Aalen cumulative hazard functions) [23]. The RSF is available using the "randomForestSRC" R package. Figure 1. Diagrammatic representation of a random forest (for further details, see https:// www.analyticsvidhya.com/blog/2020/05/decision-tree-vs-random-forest-algorithm/ (accessed on 1 April 2021)).

DNNSurv Model
The DNN model is well known for its capacity to learn complex covariate structures such as nonlinearity or interactions [24]. By the universal approximation theorem [25,26], for any continuous function g(x; β), it can be guaranteed that there is a neural network that approximates this function. Thus, the neural network algorithm can be extremely effective even if it consists of a very simple architecture such as just one single hidden layer. The DNNSurv model [7] was built by combining the DNN survival model and the regular Cox PH model, and it can be applied to the HD or ultra-HD survival datasets. The DNNSurv model was constructed as follows. The corresponding hazard function is of the form: where g(x; β) is an unknown function with a vector of parameters β, indicating the prognostic index that can be nonlinear. In other words, it is the extension of the linear predictor in the regular Cox PH model, and it becomes the Cox model when g(x; β) = x T β. As a result, the DNNSurv model can be used for various nonlinear covariate structures [7]. Furthermore, because of the presence of tied events, which means that more than one event occurs from different individuals at the same time, the DNNSurv model applies Efron's approach [27] to approximate the partial log-likelihood (β; x). It is defined by: where D is the set of all events with size N D , y r is the rth (r = 1, · · · , E) smallest distinct event time, K r is the set of individuals who fail at time y r , k r is the size of K r , and R r is the risk set at time y r . On the other hand, the loss function of the DNNSurv model with the L 1 penalty, which is used for handling HD covariates, is defined as: where γ is the tuning parameter. A simple structure of the DNNSurv model includes one input layer, two hidden layers, and one output layer, as shown in Figure 2. For each individual, the vector of covariates x is input into the input layer, and a scalar prognostic index g(x; β) is output from the output layer with weights. For the hidden layer, the model of the lth layer can be written ), which is constructed by the weight w and the activation function f . The activation function of the DNNSurv model is the scaled exponential linear units (SeLUs) [28], which is defined by: where ReLU [29] is the rectified linear unit with the form of f (x) = max(0, x) and a and b are constants. The mini-batch stochastic gradient descent algorithm [30] was applied to obtainβ to minimize the loss function (7), and it was much faster than the standard stochastic gradient descent algorithm, in terms of minimizing the loss function. From (6) and (7), the loss function of the lth (l = 1, · · · , L) batch is given by: Here, where N l D , D l , K l r , k l r , and R l r are the corresponding terms for the lth batch, similar to those defined in Equation (6). Then, β can be updated through the following gradient descent formula contributed by the lth batch: where η is the learning rate. The process is repeated n e (epoch size) times until convergence. The selection of the DNN hyperparameters, which mainly include the number of hidden layers, the number of nodes in each hidden layer, the activation function, the turning parameter, the batch size, the number of epochs, and the learning rate, is presented in detail in Section 3.

Evaluation Measures of Survival Prediction
Three popular survival accuracy metrics, i.e., the C-index, the time-dependent Brier score, and the AUC (area under the curve), were used to evaluate the performance of the survival prediction models. Below, we describe the three measures.

C-index
Let T 1 and T 2 be the survival times of two different subjects. The C-index [12] is defined by: where T 1 and T 2 are the estimated survival times of T 1 and T 2 , respectively, which can often be obtained by the estimation of the risk or prognostic scores g(x; β). It is used to measure the proportion of pairs where the predicted outcomes are concordant with the observed outcomes. The C-index can be estimated by [2,12]: The range of the value for the C-index is from 0 to 1, and a larger value indicates better performance.

Time-Dependent Brier Score
The definition of the time-dependent Brier score [31,32] is given by: where I(t) = I(T > t) indicates the event status at time point t. The Brier score BS(t) indicates the mean squared error of the difference between a model-based survival function S(t; x) and the event status I(t). Thus, the Brier score is estimated based on the mean squared error between the predicted survival function S(t; x i ) and the observed event Brier score [31,32] is given by: where w i (t) is the inverse probability of censoring weights (IPCW) [32], which is given by: where G(t) = P(C > t) with censoring time C and G(Y i −) indicates the estimated survival function just prior to Y i for the censoring time C. Note that the lower Brier score indicates better performance.

Time-Dependent AUC
The receiver operating characteristic (ROC) curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. Here, TPR and FPR were equal to sensitivity and 1-specificity, respectively. In survival analysis, the ROC depends on time t [33], with the following two definitions: where c is an arbitrary threshold or cut-off value and M is a diagnostic test or marker (here, M = g(x; β)). The corresponding estimates of time-dependent sensitivity and specificity are given by: Therefore, we can determine the ROC curve, and furthermore, we can obtain the corresponding AUC at each time point t. The value of the AUC is between 0 and 1, and the discriminant ability is much stronger with a higher AUC.

Analysis of High-and Ultra-High-Dimensional Survival Data
We used both high-dimensional (HD) and ultra-HD datasets to evaluate the performance of the DNNSurv model compared with other three survival prediction models (i.e., RSF, Cox-LASSO, and Cox-Ridge).

Real Survival Datasets
We considered three datasets presented by Wang and Li [2] for the evaluation of the four survival models (i.e., DNNSurv, RSF, Cox-LASSO, and Cox-Ridge). The datasets are summarized in Table 1. They consisted of ultra-HD (EMTAB386 and GSE49997 datasets) and HD (TCGAmirna dataset), which are available from the "curatedOvarianData" R package.
A brief introduction of the three datasets is as follows. In addition, as shown in Table 1, we also used the corresponding covariates (G + C) with both gene features (G) and clinical variables (C) in each dataset. Note that for each dataset, the time-to-event was the overall survival time, which was defined as the time from randomization to death due to any cause, and the patients who were alive or lost to follow up were censored. Before starting the survival analysis, we preprocessed all the datasets, by including the elimination of the missing values and the columns with the same values, as well as the normalization for continuous covariates.

Performance Evaluation
For all datasets, we evaluated the performance of the DNNSurv model using the three evaluation measures mentioned in Section 2.2 (i.e., C-index, time-dependent Brier score, and time-dependent AUC) and compared it with the performance of the other three models (i.e., RSF, Cox-LASSO, and Cox-Ridge). For the performance evaluation of all four models, a 10-fold cross-validation (CV) was performed for each real dataset in Table 1. The final results of each evaluation measure were based on the average of the results of 10 test datasets. Below, we present the 10-fold CV procedure for the four models.
• For the DNNSurv model, (i) the first step was to perform a 10-fold CV grid search method to select an optimal tuning parameter γ * that maximizes the C-index. Here, the 10-fold CV procedure was as follows. Denote the full dataset by f , and denote the CV training and test datasets by f −k (= f − f k ) and f k , respectively, for k = 1, . . . , 10.
For each γ and k, we found the estimator β f −k (γ) using the training dataset f −k . Then, for each γ, we computed the CV estimates, i.e., C(γ), based on the C-index in (13): where M k is the sample size of the kth subset and β f −k = β f −k (γ). Thus, we found the γ * (i.e., optimal tuning parameter) that maximized C(γ); (ii) In the second step, given γ * , we performed a 10-fold CV for each dataset, i.e., we trained the model on the training dataset, then obtained the values of three measures (i.e., C-index, time-dependent AUC, and Brier score (BS)) by the test dataset. For further understanding of our CV procedures with (i) and (ii), a flowchart is presented in Figure 3; • For the RSF model, we trained the model using the log-rank splitting rule for survival analysis. The corresponding parameters we selected were as follows: the number of trees was 500; the number of variables randomly selected as candidates for splitting a node was √ p; and the size of the terminal node was three; • For the Cox-LASSO and Cox-Ridge models, the "glmnet" R package was applied. For the L 1 penalty (in the Cox-LASSO model) and the L 2 penalty (in the Cox-Ridge model), a 10-fold CV (by the cv.glmnet() R function) was first used, respectively, in the training dataset to select the optimal tuning parameter γ * * (denoted as lambda.min in the R package) that gave the minimum mean cross-validated error (cvm). After γ * * was determined, we trained each model (Cox-LASSO or Cox-Ridge) on the training dataset and then validated each model in the test dataset.
Furthermore, we controlled the hyperparameters in order to boost the performance of the DNNSurv model. The settings of the proper hyperparameters for each type of covariate are summarized in Table 2.    5 show the prediction performance of the four models (i.e., DNNSurv, RSF, Cox-LASSO, and Cox-Ridge) in terms of the C-index based on 10 test datasets using the 10-fold CV. Here, the covariates of each dataset in Figures 4 and 5, respectively, indicate the gene features (G) and both gene features and clinical variables (G + C). As shown in Figures 4 and 5, we can see that the DNNSurv model provided the best performance compared to the other three models (i.e., RSF, Cox-LASSO, and Cox-Ridge) in terms of the C-index. In particular, we can also conclude that for each dataset, the DNNSurv model revealed better performance in Figure 5 than in Figure 4 in terms of the C-index; this means that the performance of the dataset with G + C covariates was better than that with G only. Figures 6 and 7 report the performance evaluation of the four models in terms of the time-dependent Brier score on the 10 test datasets using the 10-fold CV. The difference between the two figures is that despite the gene features, there were additional clinical variables included in the datasets shown in Figure 7. As shown in Figures 6 and 7, at each time point t, the value of the Brier score was the average of the results of the Brier score generated by the 10-fold CV under each model. In Figures 6 and 7, the Brier score of the DNNSurv model at each time point t was consistently lower than the other three models on the GSE49997 dataset. This means that the DNNSurv model was superior compared to the other three models for this dataset. For the TCGAmirna dataset, at each time point t, the value of the Brier score for the DNNSurv model was a little smaller than the value for the other models, which means that the performance of the DNNSurv model was slightly better than the other three models in terms of the time-dependent Brier score. However, the performance of the DNNSurv model did not outperform that of the EMTAB386 dataset because it seemed that the Cox-Ridge model performed better than the DNNSurv model on this dataset. Furthermore, the overall trends of the DNNSurv model of the last two datasets from Figures 6 and 7 were very similar.   Figures 8 and 9 were the same as those in Figures 6 and 7, respectively. For each model, the value of the AUC at each time point t was the average of the AUC at each time point t generated by the 10-fold CV. From Figures 8 and 9, we can see that at almost all time points among all the datasets, the AUC values of the DNNSurv model were consistently higher than those of the other three models. The results demonstrate that the DNNSurv model performs the best in terms of the time-dependent AUC compared to the other three models (i.e., RSF, Cox-LASSO, and Cox-Ridge). Moreover, we found that for each dataset, the performance of the DNNSurv model according to the time-dependent AUC in Figure 9 was overall better than that in Figure 8. In addition, Table 3 summarizes the 10-fold CV C-index and the Brier score and AUC at the specified time point in all four survival models under two covariate cases (G and G + C) for each dataset. Here, the results were the mean and standard deviation (SD) of the C-index and the Brier score and AUC values at 5 years for the EMTAB386 dataset, at 3.5 years for the GSE49997 dataset, and at 8.5 years for the TCGAmirna dataset, respectively, based on the 10 test datasets.

Figures 4 and
From Table 3, we found that for each of the three datasets, the performance of the DNNSurv model was overall the best among the four models in terms of the three evaluation measures. For example, for the EMTAB386 dataset, which contains the G + C case, the five-year AUC of the DNNSurv model was 0.639, which was the largest value (i.e., the best performance) among the four models. Similarly, for the GSE49997 dataset, which contains gene features only, the 3.5-year Brier score of the DNNSurv model was 0.231, which was much smaller (i.e., better performance) than the other three models (i.e., RSF, Cox-LASSO, and Cox-Ridge). According to the C-index, the DNNSurv model performed best in both the G and the G + C cases of each dataset. In particular, in terms of the C-index, for each dataset, the DNNSurv model performed better in the G + C case, which contained additional clinical variables, than in the G case. These results again confirmed the results shown in Figures 4 and 5.
In summary, from Table 3 and Figures 4-9, we can see that the DNNSurv model performed overall the best as compared to the three ML models (i.e., RSF, Cox-LASSO, and Cox-Ridge) in terms of the three evaluation measures on all the datasets we used here.

Discussion
In this paper, we successfully applied the DNNSurv model to real HD and ultra-HD survival datasets and effectively evaluated its ability to make accurate dynamic survival predictions. The results of the data analysis demonstrated that the DNNSurv model outperformed the three ML survival models (i.e., RSF, Cox-LASSO, and Cox-Ridge) in terms of the three evaluation measures (i.e., C-index, the time-dependent Brier score, and the AUC).
However, there were still some limitations to the DNNSurv model. For example, it took much time to run the DNNSurv model using Keras and TensorFlow. For a personal computer with an Intel Core i7-8700 3.20-GHz quad-core processor and 16 GB RAM, as an example, we took the EMTAB386 dataset, which contains gene features only. Here, the training time for the DNNSurv model was about 50 min, whereas the training times for RSF, Cox-LASSO, and Cox-Ridge were about 35 s, 5.5 s, and 3.6 s, respectively. The hyperparameter settings can be sensitive to the prediction performance. Developing a unified procedure for finding optimal hyperparameters including the tuning parameter would be interesting future work. Furthermore, an extension of the DNNsurv model to advanced survival models (e.g., the frailty model [1]) with clustered time-to-event data would also be interesting further work.  Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. These data can be found here: https://www.bioconductor.org/ (accessed on 1 April 2021).

Conflicts of Interest:
The authors declare no conflict of interest.