Survival Analysis with High-Dimensional Omics Data Using a Threshold Gradient Descent Regularization-Based Neural Network Approach

Analysis of data with a censored survival response and high-dimensional omics measurements is now common. Most of the existing analyses are based on specific (semi)parametric models, in particular the Cox model. Such analyses may be limited by not having sufficient flexibility, for example, in accommodating nonlinearity. For categorical and continuous responses, neural networks (NNs) have provided a highly competitive alternative. Comparatively, NNs for censored survival data remain limited. Omics measurements are usually high-dimensional, and only a small subset is expected to be survival-associated. As such, regularized estimation and selection are needed. In the existing NN studies, this is usually achieved via penalization. In this article, we propose adopting the threshold gradient descent regularization (TGDR) technique, which has competitive performance (for example, when compared to penalization) and unique advantages in regression analysis, but has not been adopted with NNs. The TGDR-based NN has a highly sensible formulation and an architecture different from the unregularized and penalization-based ones. Simulations show its satisfactory performance. Its practical effectiveness is further established via the analysis of two cancer omics datasets. Overall, this study can provide a practical and useful new way in the NN paradigm for survival analysis with high-dimensional omics measurements.


Introduction
In the study of complex diseases, data with a censored survival response (overall survival, progression-free survival, time to metastasis, etc.) and high-dimensional omics measurements (gene expression, SNPs, methylation loci, etc.) are now routine. When the dimensionality of omics measurements is high (compared to sample size), regularized estimation is needed. For many practical scenarios (for example, as in our data analysis), it is expected that only a subset of the omics measurements is associated with survival. As such, variable/model selection is needed. In the past two decades, many regularized estimation and selection methods have been developed, and techniques extensively adopted include penalization [1,2], thresholding [3], sparse boosting [4], Bayesian [5], and others [6,7]. Most of the existing studies are based on specific (semi)parametric models, especially the Cox model [8]. Here, we note that some "model-free" methods, for example, the one based on U-statistic [9], are still based on albeit very weak model assumptions.
In recent years, it has been increasingly recognized that model-based approaches may not be sufficiently flexible, especially in accommodating unknown nonlinear effects. The neural network (NN) technique [10], with its superior flexibility and prediction performance, has drawn increasing attention and provided a highly competitive alternative to regression. Most of the existing NNs are for categorical and continuous responses [11][12][13][14][15].
Cox model, which is perhaps the most popular survival model and has conditional hazard function: λ(t|z i ) = λ 0 (t)exp(β T z i ), (1) where λ 0 (t) is the unknown baseline hazard function, and β is the vector of unknown regression coefficients. The log partial likelihood function is: Following the strategy of quite a few published studies, we build an NN with a loss function motivated by the above log partial likelihood function. The proposed NN consists of one input layer, one or multiple hidden layers, and one Cox output layer. For each sample, we feed the p-dimensional vector of measurements z i into the input layer and obtain the log hazard radio θ i from the output layer. To simplify notation, we first describe using one hidden layer. For the hidden layer with h nodes, the tanh activation function is used, which leads to satisfactory performance in our numerical study-here, we note that other activation functions may be needed for other data settings. The forward propagation of the entire NN is: where W is the weight matrix between the input layer and hidden layer, with size h × p, b the bias term for the hidden layer and σ the tanh activation function: Finally, for the cost part of the network, we adopt: which has been motivated by the log partial likelihood. This loss has also been adopted in some recent literature [21,22]. The overall structure of the proposed network is shown in Figure A1 (Appendix A).

TGDR-Based Optimization
First developed for linear regression [3] and later extended to many other regression models [30,31,35], TGDR is a generically applicable regularized optimization technique.
The key strategy is that, in gradient-based optimization, in each iteration, gradients are compared and important parameters with large gradients identified. Then, estimates are updated only for those important parameters. Thresholding and variable selection are conducted in each iteration, and hence the final estimates can have sparsity and regularization. Building on the NN structure shown in Figure A1 (Appendix A), in the upper panel of Figure 1, we schematically present the TGDR-based estimation, where the light dotted lines indicate that the connection parameters are zeros, and the light-colored nodes have no effect on the output. Denote z i = z 1 i , z 2 i , . . . , z p i . Further, let 0 ≤ τ ≤ 1 denote the threshold, which is a tuning parameter. The TGDR-based algorithm proceeds as follows: (1) Denote h as the number of hidden layer nodes. Initialize k = 0, W h×p 1 = 0, W 1×h 2 = 0, b ∼ N (0, 0.01I), learning rate α = 0.01, and learning rate decay η = 0.99. (2) Compute θ i = W 2 σ(W 1 z i + b), l(W 1 , W 2 , b) = ∑ δ i =1 θ i − log ∑ t j >t i exp(θ j ) . . . , p. (5) Update W 2 ← W 2 + αL 2 ⊗ ∆W 2 , W 1 ← W 1 + αL 1 ⊗ ∆W 1 , and b ← b + α∆b . The product of L and ∆W is component-wise. (6) Repeat Steps (2)-(5) K times, where K is another tuning parameter. In Step (1), the initialization of the NN parameters is different from other networks. Here, parameters W 1 and W 2 need to be initialized as zero, which ensures sparsity in the subsequent optimization. Simultaneously, b is initialized as a random vector with components close to zero to facilitate activating all hidden layer nodes. In Step (2), we carry out forward propagation of the NN. The activation function in the hidden layer can increase the ability of the network to solve nonlinear problems and prevent gradient explosion. In Step (3), the gradients of the connection parameter matrix are calculated. In Step (4), we set the threshold according to the maximum value of the gradients between the hidden layer and the output layer-this is similar to the operation in regression. In addition, the threshold of gradients between the input layer and the hidden layer is determined by the maximum value of the gradients corresponding to each hidden layer. In Step (5), the parameters are updated based on the threshold values generated in Step (4)-only the important ones are updated. When a parameter's gradient remains below the threshold in all iterations, its final estimate will be exactly zero, thus achieving sparsity and regularization. In the lower panel of Figure 1, we show a small example with p = 8, h = 7.
On the left are the gradients between the hidden layer and the output layer, and on the right are the gradients between the input layer and the hidden layer. With the threshold, only a small number of parameters are updated. We note that sparsity here is more "transparent" than penalization and some other regularization approaches.
The final network structure is jointly determined by tuning parameters τ and K. A smaller value of τ and a larger value of K lead to denser estimates. In regression, it has been recognized that τ ≈ 0 leads to ridge-type estimation and τ ≈ 1 leads to lasso-type estimation. In numerical studies, we select τ and K via cross-validation. When it is desirable to further reduce computational cost for very high-dimensional data, τ can be fixed at a large value close to 1. Here, it is noted that parameters W 1 and W 2 in the model cannot be used for variable selection directly. We obtain a new vector W 12 by matrix multiplication of W 1 and W 2 , and then select variables correspondingly.
As in each iteration, the gradient matrix is sparse and the proposed approach can be computationally less demanding than some of the existing ones. To further fix ideas, we consider a simulated dataset with sample size 2000, dimension of input 1000, number of hidden layer nodes 10, and number of iterations 200. With a "standard" configuration (CPU: Intel core i9-8950HK, RAM 32.0 GB, GPU NVIDIA GeForce GTX 1080), the optimization of the proposed NN takes 6.50 s, about 14% more efficient than its key competitor. Additional comparisons are presented in Table A1 in Appendix A. Computational cost is expected to depend on sample size, number of nodes, and other parameters in a similar way as for other NNs.
To facilitate data analysis, the computer code implementing the proposed method is publicly available at: https://github.com/shuanggema/CNTsurv, accessed on 17 August 2022.

Simulation
We conduct simulation to better comprehend the effectiveness of the proposed method. Here, it is noted that many NN studies only analyze real data without simulation study. Simulated data, although they may be somewhat simplified, can be advantageous with known generating mechanisms, which facilitates an objective evaluation and comparison. To better gauge its performance, we compare the proposed method against: (a) Cox-nnet with lasso (CNL), which is an NN approach based on the same Cox model-based loss function. Lasso, as a representative of penalization, is applied for regularization and sparsity. Similar methods have been adopted in the literature [21,22]: (b) Cox regression with TGDR (CT), which conducts Cox regression analysis and applies TGDR for regularization and sparsity; (c) Cox regression with lasso (CPH), which conducts Cox regression analysis and applies lasso for regularization and sparsity; and (d) random survival forest (RSF), which applies the random forest technique to survival analysis and has competitive performance. For abbreviation, we refer to the proposed method as CNT, where T stresses TGDR. We recognize that many approaches can be applied to analyze the simulated data. The above four can be the most relevant. Specifically, comparing with CNL can directly establish the advantage of TGDR, and comparing with CT can directly establish the advantage of NN. CPH and RSF represent popular and competitive existing approaches.
We compare the proposed and alternative methods in terms of prediction and variableselection performance. In prediction evaluation, we first generate the training data and build models using the proposed and alternative methods. Then, a testing dataset is generated under the same settings. We make predictions for the testing-set samples using the training-data models and evaluate prediction performance using C-index. This measure ranges between 0 and 1, with a larger value indicating better prediction. When evaluating variable-selection performance, we adopt the popular strategy with which a variable is identified as important if it has a nonzero effect-we again refer to Figure 1 for a schematic presentation. Consider the true positive (TP), false positive (FP), true negative (TN), and false negative (FN) of variable selection under the selected tuning parameter values. We first calculate P = TP TP+FP , R = TP TP+FN . The first variable-selection accuracy measure is F1, defined as F = (α 2 +1)P·R α 2 (P+R) , where α = 1. Here, a larger value indicates more accurate selection.
Additionally, to get a "global picture" of variable selection, we follow the literature [36], consider a sequence of tuning parameter values, evaluate variable-selection performance (TP and FP) at each tuning parameter value, and generate the receiver-operating characteristic (ROC) curve and calculate the area under the curve (AUC) for evaluation. AUC ranges between 0.5 and 1, with a larger value indicating higher accuracy. When implementing CNL and CNT, for the hyperparameters, we consider: one hidden layer (noting that both approaches can be directly extended to multiple hidden layers), 10 nodes in the hidden layer, tanh activation function, initial learning rate α = 0.01, and learning rate decay η = 0.99. Cross-validation is applied to select the threshold and number of iterations for CNT and CT, the penalization parameter for CNL and CPH, and the number of forests for RSF.
For NN methods in general, the number of hidden layers may have an impact on model performance. An increase in the number of hidden layers means increased network complexity and computational cost. Usually, it is recommended that, when sample size is sufficiently large and signals are dense, multiple hidden layers are needed to achieve sufficient model complexity. For the proposed settings with a limited sample size and sparse signals, one hidden layer may be preferred. In Table A2 (Appendix A), we consider Simulation 1 and Scenario 4 (details described below) and 1-3 hidden layers. It is observed that one hidden layer may be sufficient-here, we do note that more hidden layers may be needed for other data settings.
We comprehensively consider the following simulation settings. Simulation setting 1: Here, we further consider four scenarios with different conditional hazard functions.
Under Simulation 1, all linear effects have "equal effects" with regression coefficients 1. Under Simulation 2, half of the linear effects have relatively strong signals with regression coefficients 1, and the other half have relatively weak signals with regression coefficients 0.3. Under Simulations 1 and 2, Scenario 1 contains only linear effects, which favors the CT and CPH approaches. Scenario 2 contains linear and quadratic effects. Scenario 3 contains linear effects and interactions. Scenario 4 contains linear and quadratic effects, as well as interactions. Scenarios 2-4 have been designed to demonstrate the NN's flexibility in accommodating unspecified nonlinear effects. For these simulation settings, we have considered multiple values of data dimensionality and sample size. The relatively lowdimensional settings (for example, with p = 50) can shed insights into performance of the proposed approach with "regular" survival data. We note that the dimensionality is not ultrahigh. In practical data analysis, to improve stability, screening is often conducted with ultrahigh-dimensional omics measurements to reduce dimensionality to a more manageable level. This may be especially needed for NNs, which have significantly more parameters and hence lower stability. In Simulations 1 and 2, the number of linear effects is fixed. In addition, the number of quadratic terms and that of interactions are fixed at 2. Further, in Simulation 3, we fix the total data dimensionality, as well as the number of quadratic terms and that of interactions, and vary the number of linear effects.
The results based on 200 replicates are numerically summarized in Tables A3-A20 (Appendix A) and graphically summarized in Figures A2-A7 (Appendix A). It is observed that the proposed method has competitive performance-it outperforms the four alternatives under most of the settings. When sample size increases, in general, the performance of all methods improves (sometimes significantly). When dimensionality increases, performance may deteriorate; however, in general not significantly. Under simpler settings, CT may excel-similar observations have been made in the literature [22]. However, the proposed method still has competitive performance. Consider, for example, Simulation 1, Scenario 1, p = 1000, N = 1000, and the first type of correlation structure. The five methods have average C-index values 0.917 (CNT), 0.91 (CNL), 0.919 (CT), 0.718 (CPH), and 0.845 (RSF), respectively, and the average F1 values are 0.895 (CNT), 0.284 (CNL), 0.995 (CT), 0.044 (CPH), and 0.033 (RSF), respectively. As data become more complicated, advantages of the proposed method get more obvious. Consider, for example, Scenario 4 and otherwise the same settings as above.

Data Analysis
We further test the proposed method using a gene expression dataset obtained from GEO and a DNA methylation dataset obtained from TCGA.

High-Grade Serous Ovarian Cancer Data
The first dataset is on high grade serous ovarian cancer (HGSOC) and obtained from GEO (GSE132342; https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE132342, accessed on 17 May 2022). This dataset contains records on 3769 HGSOC patients, and 513 gene expression measurements are available for each patient. Among the 3769 patients, 2475 died during follow-up, and the survival times ranged from 0.03 to 117.67 months, with a median of 33.35 months. The rest were right-censored, and the follow-up times ranged from 0.03 to 117.74 months, with a median of 69.61 months. Additional sample characteristics are described in Table A21 (Appendix A).
When implementing the proposed and alternative NN methods, we consider the following hyperparameter specifications: one hidden layer, 20 nodes in the hidden layer, tanh activation function, initial learning rate α = 0.01, and learning rate decay η = 0.99. For all the methods, the other tuning parameters are selected using cross-validation. The numbers of identified variables are 99 (CNT), 132 (CNL), 26 (CT), 54 (CPH), and 73 (RSF). It is observed that different methods lead to different identification results. More details are available from the authors.
With practical data, there is a lack of an objective way of assessing variable identification performance. We resort to a random splitting approach, which has been popularly adopted in the literature [37], to evaluate prediction performance and stability of selection (which can provide "indirect" support to the validity of modeling). Specifically, we randomly split data into a training and a testing set with equal sizes. Here, as the effective sample size (number of events) is limited, we choose a relatively larger testing data size to ensure the stability of evaluation. Models are constructed using the training data, and prediction is made for samples in the testing data. With each random splitting, we can obtain the C-index value from the testing data and a set of identified genes. With 100 random splittings, the proposed approach has a mean C-index of 0.650 with SD = 0.003 compared to 0.627 (SD 0.011) for CNL, 0.638 (SD 0.012) for CT, 0.634 (SD 0.007) for CPH, and 0.610 (SD 0.007) for RSF. In Figure 2, we show the box plots of the C-index values. In the stability evaluation, the same 99 genes (Table A22, Appendix A) are identified in all the random splittings, which suggests a high level of stability. Stability information on the alternatives, which is not as satisfactory as the proposed approach, is available from the authors. Among these 99 genes, many have been previously suggested as relevant for ovarian cancer, suggesting the "biological validity" of our analysis. For example, studies have suggested that TAP1 is involved in the antigen-presenting pathway and positively associated with ovarian cancer survival [38]. Additionally, hypomethylation of TAP1 is associated with the prolongation of disease-recurrence time. ZFHX4 may play a role in neural and muscle differentiation and be involved in transcriptional regulation. CXCL9 is associated with higher lymphocytic infiltration, which is a characteristic of the immunoreactive HGSOC molecular subtype. FBN1 is an extracellular matrix protein. As a biomarker of early recurrence in patients with ovarian cancer, it is closely related to connective tissue proliferation in HGSOC. TGER3 has been suggested as associated with relapse-free survival of HGSOC. DH1B belongs to a pathway that promotes ovarian cancer cell infiltration [39]. C10orf82 has been previously identified as associated with ovarian cancer overall survival [40]. The overexpression of COL11A1 has been found to be associated with the progression of ovarian cancer [41,42]. In terms of mechanism, it confers chemoresistance on ovarian cancer cells through the activation of the Akt/c/EBPβ pathway [43]. CXCL10 has a tumor-suppressive function by TIL recruitment in ovarian cancer [44]. FABP4 is a key factor of ovarian cancer and has strong metastatic potential [45]. LGALS4 has been suggested as an early diagnostic marker of mucinous ovarian cancer [46]. LBP has been validated as a diagnostic biomarker of ovarian cancer [47]. Many other genes also have strong implications, and we omit their discussions here.

Breast Invasive Carcinoma Data
The second dataset is on breast cancer with DNA methylation measurements and obtained from TCGA (https://portal.gdc.cancer.gov/projects/TCGA-BRCA, accessed on 6 September 2022). The pipeline for processing TCGA DNA methylation data has been described in the literature [48]. The analyzed data contain records on 1093 breast cancer patients. For methylation loci, we take a "candidate gene" approach and focus on the 63 genes that are included in three gene-testing platforms (Oncotype DX [49], MammaPrint [50], and EndoPredict [51]). Among the 1093 patients, 152 died during followup, and the survival times ranged from 0 to 248.5 months, with a median of 38.23 months. For the 941 censored patients, the follow-up times ranged from 0 to 286.83 months, with a median of 25.33 months. Additional sample characteristics are described in Table A23 (Appendix A).
Data are analyzed using the proposed and four alternative methods in the same way as above. The only difference is that, with a lower dimensionality, the number of nodes in the hidden layer is set as 10. Detailed estimation results are available from the authors. In the evaluation of prediction, the C-index values based on 100 random splittings are 0.725 (0.024) for the proposed method, 0.646 (0.039) for CNL, 0.652 (0.03) for CT, 0.671 (0.036) for CPH, and 0.562 (0.074) for RSF. The box plots are shown in Figure 2. Also in Figure 2, we show the top ten genes with the highest stability. Among them, genes BAALC, REXO2, CDCA8 are selected in over 60 splittings, BAALC, REXO2, CDCA8, HSPA14, CLDN1, SPARCL1, SMOC2, and C1S are from MammaPrint, SCUBE2 is from Oncotype DX, and RPL37A is from EndoPredict. The alternative approaches have less satisfactory stability (details omitted here).
Among the identified genes, BAALC is a protein-coding gene and associated with acute leukemia and other cancers. REXO2 may play a role in DNA repair, replication, and recombination, as well as RNA processing and degradation. It may also participate in the resistance of human cells to UVC-induced cell death through its role in the process of DNA repair. CDCA8 encodes a component of the chromosome passenger complex. This complex is an important regulator of mitosis and cell division. Its encoded protein is regulated by cell cycle and required for chromatin-induced microtubule stabilization and spindle formation. HSPA14 has multiple functions, including ATP, misfolded protein, and unfolded protein bindings. SCUBE2 enables calcium ion binding activities and is involved in signal transduction. It may also act upstream of or within several processes, including positive regulation of chondrocyte proliferation. RPL37A encodes a ribosomal protein and is a component of the 60s subunit. This protein belongs to the l37ae family of ribosomal proteins.

Discussion
In this study, we have developed a new NN-based modeling strategy for data with a censored survival response and high-dimensional omics measurements. A new regular-ized estimation and selection method has been developed based on the TGDR technique. Considering the superiority of the NN over regression, relatively limited developments in the NN for censored survival data and high-dimensional covariates and satisfactory performance of TGDR in regression, the proposed methodological development is warranted. The proposed method has an intuitive formulation, computational efficiency, and satisfactory performance in simulation. Here, it is noted that the proposed method is not dominatingly better in simulation, which is as expected. Its superiority becomes more prominent with the increasing complexity of data. Its effectiveness has been further demonstrated using two cancer studies.
This study can be potentially extended in multiple ways. In our numerical studies, a single hidden layer has been adopted, and led to satisfactory results. Our limited exploration has suggested that, for our data settings, additional hidden layers may not be needed-however, this may not be true for other data. When determining the number of hidden layer nodes, we have taken sample size and number of input variables into consideration. Our literature search has not suggested an objective criterion for determining node number. Searching over a few candidate values and selecting the optimal (for example, with cross-validation) may be needed. Additionally, following the extension of TGDR to other regression settings, we can also develop TGDR-based NNs for other data/model settings. Last, but not least, more data analysis using the proposed method may be warranted in future research.