Kidney Cancer Biomarker Selection Using Regularized Survival Models

Clear cell renal cell carcinoma (ccRCC) is the most common subtype of RCC showing a significant percentage of mortality. One of the priorities of kidney cancer research is to identify RCC-specific biomarkers for early detection and screening of the disease. With the development of high-throughput technology, it is now possible to measure the expression levels of thousands of genes in parallel and assess the molecular profile of individual tumors. Studying the relationship between gene expression and survival outcome has been widely used to find genes associated with cancer survival, providing new information for clinical decision-making. One of the challenges of using transcriptomics data is their high dimensionality which can lead to instability in the selection of gene signatures. Here we identify potential prognostic biomarkers correlated to the survival outcome of ccRCC patients using two network-based regularizers (EN and TCox) applied to Cox models. Some genes always selected by each method were found (COPS7B, DONSON, GTF2E2, HAUS8, PRH2, and ZNF18) with known roles in cancer formation and progression. Afterward, different lists of genes ranked based on distinct metrics (logFC of DEGs or β coefficients of regression) were analyzed using GSEA to try to find over- or under-represented mechanisms and pathways. Some ontologies were found in common between the gene sets tested, such as nuclear division, microtubule and tubulin binding, and plasma membrane and chromosome regions. Additionally, genes that were more involved in these ontologies and genes selected by the regularizers were used to create a new gene set where we applied the Cox regression model. With this smaller gene set, we were able to significantly split patients into high/low risk groups showing the importance of studying these genes as potential prognostic factors to help clinicians better identify and monitor patients with ccRCC.


Introduction
Renal cell carcinoma (RCC) is the most common type of kidney cancer [1]. In 2020, it was the 14th most common type of cancer (431,288 new cases), with a significantly higher incidence in developed countries, and the 15th most deadly one (179,368 deaths) worldwide [2]. These numbers of incidence and mortality are expected to increase to 666,000 and 301,000, respectively, by 2040 [3]. Among the different pathological subtypes of RCC, clear cell renal cell carcinoma (ccRCC) is the most common. ccRCC is known to have a significant percentage of mortality due to the high rate of metastasis and resistance to both radiotherapy and chemotherapy [4]. These, coupled with the fact that ccRCC early stage diagnosis is highly correlated to improved survival rate (83% at stage I vs. 6% at stage IV), make one of the priorities of kidney cancer research to identify RCC-specific biomarkers for early detection and screening of the disease [5]. This may help reduce both patient and healthcare systems' burden due to kidney cancer.
Nowadays, there are no effective biomarkers for early diagnosis of RCC in clinical treatment, and the molecular mechanism of RCC metastasis remains unclear [6]. With the development of high-throughput techniques such as RNA-sequencing (RNA-seq), it is now possible to measure the expression levels of thousands of genes in parallel. This has been widely used for the discovery of new biomarkers in oncology [7], where the molecular profiling of individual tumors may help us find putative genes that will likely improve diagnosis and treatment outcome and guide patient selection for targeted therapies [8].
In particular, the relationship between gene expression and survival outcomes, such as the time to death, has been widely studied. Identifying genes associated with cancer survival may provide new information for clinical decision-making, diagnosis, prognosis, and treatment options of patients [7,9,10].
Survival analysis studies the time until an event of interest occurs, such as death. The Cox proportional hazards models have been used to discover attributes that are related to survival and predict the outcome [11]. These survival regression models describe the relationship between the survival times and a set of covariates [12] such as genes. Some studies show that molecular biomarkers may be more effective in predicting survival outcomes than clinical parameters such as tumor stage and grade [13]. Nonetheless, one of the main challenges of using gene expression data is their high dimensionality, since the number of covariates is much larger than that of observations. This can lead to instability in the selection of gene signatures, making the selection of novel biomarkers a difficult task [14]. Dealing with the high dimensionality of patients' data represents a largely unsolved problem and several regularization methods have been proposed to tackle this challenge.
The Lasso penalty in the Cox model was proposed in 1997 [15], providing a solution with few variables selected. However, one of the problems of using this regularization is that when there are several correlated variables, Lasso randomly selects only one. More recently, to overcome this limitation, the elastic net that combines both Lasso and ridge penalties applied to Cox regression model was proposed [16,17]. The Lasso penalty chooses only a few nonzero coefficients, while a ridge regression scales all the coefficients towards zero [18]. Besides these methods, network-based regularizers have also been suggested. Since gene expression features may be connected through a graph structure where vertices are genes and edges a weighted relation between them, incorporating network information as a constraint in the loss function can be used to improve models' performance and interpretability. Such network information can be either obtained by functional knowledge (e.g., protein-protein interaction network) available in public databases or a weight attributed to each gene based on data itself [9]. Other methods include using network information from gene coexpression [19] and also applying a network constraint based on gene correlation patterns between two groups of interest (e.g., normal and tumor) [20], moving towards more meaningful biological solutions.
Altogether, the application of Cox regularized models to survival data with RNA-seq covariates allows the identification of gene expression signatures and may enable the identification of targeted therapeutics and genes that can serve as predictive or prognostic biomarkers in kidney cancer [21].
Here we try to identify potential prognostic biomarkers correlated to the survival outcome of ccRCC patients. Several putative genes were found using network-based regularizers applied to Cox models and these were analyzed using gene ontology (GO) classification to try to find potential enriched mechanisms and pathways. GO (http://geneontology.org, accessed on 1 June 2022) is the most comprehensive and widely used knowledge base concerning the functions of genes, describing the biological role of genomic products (e.g., genes) by classifying them according to their molecular functions (MF), biological processes (BP) and cellular components (CC) [22]. These may be used to perform a gene set enrichment analysis (GSEA), which allows identifying over-and under-represented functional biological groups within a list of genes. We believe that these genes could serve as prognostic factors to help clinicians better identify and monitor patients with ccRCC.

Datasets
The data used in this study was retrieved from The Cancer Genome Atlas (TCGA) Research Network: https://www.cancer.gov/tcga (accessed on 1 May 2022) Both gene expression profile (RNA-seq) and clinical data from kidney renal clear cell carcinoma (KIRC) patients were used. Regarding RNA-seq, the initial size of the dataset was 20,501 genes for 606 samples, containing both tumor (n = 529) and normal (n = 77) tissue. After filtering genes with a null standard deviation, a total of 19,819 genes remained. Regarding clinical data, 537 samples had information on age, status (dead = 1 or alive = 0), days to death, days to the last follow-up, stage, T-stage, N-stage, M-stage, sex, and race (summarized in Table 1). The staging system used here was the American Joint Committee on Cancer (AJCC) TNM Classification of Malignant Tumors (TNM) system, which is based on the size of the tumor (T), the spread to nearby lymph nodes (N), and the spread to distant sites (M). Data were downloaded from TCGA using the package RTCGAToolbox [23], which in turn processes RNA-seq with RSEM [24] that estimates gene and isoform expression levels. Data were then log transformed using the function log 2 (x + 1), which is a standard procedure when working with RNA-seq data to stabilize the variance across mean values [25].

Differential Gene Expression
A differential gene expression analysis was performed to assess which genes were differentially expressed between tumor vs. normal tissue and early stage (I, II, and III) vs. advanced stage (IV) of the disease, using the edgeR R package [26][27][28]. This package offers many variants of analysis. Here, we used the quasi-likelihood F-tests method, which is highly recommended for differential expression analysis of bulk RNA-seq data. The top ten most significant genes were examined and a multiplicity correction was performed by applying the Benjamini-Hochberg method on the p-values, to control the false discovery rate (FDR) [29].

Survival Analysis
A survival analysis studies the time until a certain event occurs (e.g., death). When analyzing survival data, two functions that are dependent on time are of particular interest: the survival function (S(t), the probability of surviving at least until time t), and the hazard function (h(t), the conditional probability rate of dying at time t having survived until that time). The Kaplan-Meier (KM, [30]) estimator is used to estimate the survival curve (S(t) against t) from a set of observed follow-up times and a variable indicating if the event took place or not [31]. The log-rank is a nonparametric statistical test used to compare the survival curves between two groups; it tests the null hypothesis that there is no difference between the population survival curves [32,33]. However, this test does not allow other variables to be taken into account. A Cox proportional hazards model [34] is a multiple regression model that allows the study of the relationship between several predictor variables and survival times.

Cox Regression
The Cox proportional hazards model is the most commonly used multivariate approach for analyzing survival time data in medical research. It describes the relationship between the event incidence (death) expressed by the hazard function and a set of covariates (genes) [12]. The Cox model can be expressed as: where h i (t) represents the hazard function of individual i = 1, . . . , n, dependent on a set of p covariates x i = (x i1 , x i2 , . . . , x ip ) T , and β = (β 1 , β 2 , . . . , β p ) are the regression coefficients.
The h 0 (t) term represents the baseline hazard, the value of hazard if all x i are equal to zero. The inference of the β parameters is made by maximizing Cox's partial log-likelihood function, given by: where R i = R(t i ) = {j : t j ≥ t i } denotes the set of all individuals that are at risk at t i , i.e., with a follow-up time greater than or equal to t i , and δ i indicates if the event was observed (δ i = 1) or not (δ i = 0) for patient i.
To address the high-dimensionality problem (n p), sparse penalized Cox's models have been considered, by adding a penalty term, F(β), to the partial log-likelihood, l(β). One example is an elastic net (EN), which uses the 1 -norm (Lasso) and the 2 -norm (ridge) to restrict the solution space by imposing sparsity and small coefficients to the parameters [16]. The EN penalty is defined as: where λ controls the penalizing weight and α the balance between the two norms (α = 0 for ridge regression and α = 1 for Lasso regression).
Network-based regularizers have also been proposed in the context of cancer genomics. The TCox method [20], a correlation-based regularizer, promotes the selection of features (genes) that have distinct correlation patterns in two different groups, such as tumor and normal tissue, highlighting potential differences in the corresponding subnetworks. Given two distinct datasets (tumor-T and normal-N), TCox builds the correlation matrices, , respectively. Each column σ j corresponds to the correlation of gene j with the remaining ones. The measure of gene j dissimilarity between T and N can be defined as: This dissimilarity term is then normalized by their maximum value: , j, k = 1, . . . , p.
In the context of the present study, we were interested in finding genes that exhibited different correlation patterns between tumor and normal tissues, i.e., genes that showed a larger dissimilarity between the two correlation matrices. Therefore, the penalty term used here is given by: where vector q = (w −1 1 , . . . , w −1 j , . . . , w −1 p ) represents the inverse of the normalized distances. These models were built using the R package glmnet [35].

Model Evaluation
A survival analysis was performed using two different regularizations: EN and TCox. For both models, the α parameter was set between α = 0.3 and α = 0.05, which provided a feasible number of features to be further analyzed. All models were generated 100 times to ensure the robustness and stability of the results.
Samples were randomly divided into a training set (70%) for model construction and all data were used for the model evaluation. Both subsets had the same proportion of censored samples (≈67%). The training sets were used to find the best λ value (controls the penalizing weight) using a 10-fold cross-validation. This fixed λ was then used in the survival model applied to the whole dataset. To evaluate the accuracy of the models, the observations were split into two groups defined by the median of the fitted relative risks (high vs. low risk of dying). This allowed us to perform the log-rank test via the Kaplan-Meier estimator, and to assess if we could separate the survival curves of the two groups by evaluating the p-value.
After assessing the mean results obtained by each model for the 100 runs (Table 2), we observed that the best results were obtained when the parameter that controlled sparsity was set to α = 0.1 for both the EN model and TCox. However, to select a similar and adequate number of variables to be further analyzed and interpreted in both regularization methods, gene sets obtained using TCox α = 0.1 (≈51 genes) and EN α = 0.2 (≈48 genes) were used to perform a further analysis. Notwithstanding, different α parameters may be tested to select different gene set sizes, using the code made available.
To perform the analysis described above, the glmnet [35] package was used in R statistical software. The q vector used in the TCox method was introduced as a penalty factor in the glmnet function.

Gene Set Enrichment Analysis
To highlight the enriched functions of biomarkers selected by the models tested, a gene set enrichment analysis (GSEA) was performed using the R package clusterProfiler [36]. The Gene Ontology (GO) [37] knowledge base is the world's largest source of information on the functions of genes and defines classes used to describe gene function regarding three aspects: molecular activities of gene products (molecular function-MF), where gene products are active (cellular component-CC) and pathways made up of multiple gene products (biological process-BP). Given a ranked set of genes, GSEA determines whether the members of the gene set are randomly distributed or if they are primarily found at the top/bottom of the ranked list. There are three key elements of the GSEA method: the enrichment score (ES), which is the degree to which a gene set is over-or under-represented at the top or bottom of the ranked list, the significance level of ES, and the adjustment for multiple hypothesis testing.
GSEA was applied to four different gene sets to test how different metrics used to rank genes affected the selection of enriched ontologies. Firstly, we used as ranking measure the log fold change of DEGs found between tumor vs. normal tissue (GS1) and early vs. advanced stage of the disease (GS2). Afterward, genes were ranked based on β coefficients obtained from Cox's regression model using an elastic net regularization (GS3) or TCox regularization (GS4). Then, the over-/under-represented GOs found were compared between groups, and genes that were more involved in enriched processes were further analyzed.
Lastly, KEGG (Kyoto Encyclopedia of Genes and Genomes) was used to perform a genes' functional analysis [38][39][40]. KEGG is a collection of manually drawn pathway maps representing molecular interaction and reaction networks, covering a wide range of biochemical processes regarding metabolism, genetic information processing, environmental information processing, cellular processes, organism systems, human diseases, and drug development.

Results
Since molecular biomarkers may be a way of predicting the survival outcome of patients, studying the relationship between gene expression and survival phenotype may help us identify prognostic genes related to ccRCC survival, providing new information to help clinical decision making.

Exploratory Analysis
To identify putative biomarkers associated with survival outcomes in ccRCC patients, gene expression and clinical data (described in Table 1) from TCGA were used.
Firstly, a survival analysis using Cox proportional hazards model was performed to investigate the association between the time to death and the explanatory variables selected (stage, T-stage, N-stage, M-stage, sex, and race). In the multivariate Cox analysis, the only statistically significant covariates were the stages. In particular, the higher hazard ratio found was for stage IV, HR ≈ 22 (p-value = 9.17 × 10 −5 ), indicating a strong relationship between the patients' stage and an increased risk of death.
Afterward, to find genes of interest associated with tumor formation and development in ccRCC patients, genes differentially expressed were found between tumor vs. normal tissues and also between the early (I, II and III) and advanced stages (IV) of the disease, using the edgeR package.
Comparing normal and tumor tissues, 13,906 differential expressed genes (DEGs) were found, 7076 up-and 6830 downregulated in tumor tissue. Table 3 shows the top 10 DEGs (ranked by FDR) found. Table 3. List of the ten most significant DEGs found between tumor and normal tissues from KIRC patients. LogFC-log fold change; FDR-false discovery rate.

Genes
LogFC Furthermore, DEGs between the early and advanced stages of the disease were found. A total of 6170 genes were DEGs, 3126 up-and 3044 downregulated in the early stages of the disease. The top 10 genes with the lowest FDR values are listed in Table 4. Table 4. List of the ten most significant DEGs found between early (stages I, II and III) and advanced stages (stage IV) from KIRC patients. LogFC-log fold change; FDR-false discovery rate.

Survival Models
To find possible prognostic markers in kidney cancer, two survival models with different regularization metrics to handle high-dimensional data were used: EN and TCox. Table 2 shows the results obtained for the Cox survival model with an EN penalty for the 100 runs tested with different α values, selecting a different number of variables. As we can see, the lowest p-value was obtained when α = 0.1, selecting a mean of 90 genes (p-value = 0). However, to have a similar gene set size between the two methods used (EN and TCox), the genes that were selected as important in the Cox regression for at least 50% of the 100 runs using α = 0.2 were listed (Table 5) and further analyzed, since those were the ones more correlated with the survival outcome of patients. A total of 50 genes were found. Some of these genes were upregulated and others downregulated (represented by ↑ and ↓, respectively, in Table 5) in tumor tissue or the advanced stage of disease. Here are presented only the top 20 genes selected. Three genes (COPS7B, DONSON, and LOC100272146) were always selected by EN. Table 5. List of the top 20 genes selected by elastic net in at least 50% of the runs when α = 0.2. Arrows represent if genes are upregulated (↑) or downregulated (↓) in tumor tissue or in the advanced stage of the disease and % is the percentage of the runs where a certain gene appears in the solution.

Elastic Net
-genes that are not differentially expressed in tumor tissue.

TCox
The second approach used to try to identify biomarkers associated with the survival outcome of patients was a penalization based on correlation-textttTCox. This regularizer promotes the selection of genes with distinct correlation patterns between the healthy and tumor tissues. Table 2 presents the results obtained for this method for the 100 runs tested. The curves split with the lowest mean p-value was obtained for α = 0.1, with a mean set comprising 47 genes. To assess which genes were more interesting to study, the ones selected at least 50% of the times were found (n = 49). In Table 6 are listed the top 20 most selected genes and how often they are selected (represented by the percentage within the 100 runs tested). Five genes (GTF2E2, HAUS8, PRH2, SEC61A2, and ZNF18) were always selected by the model. Furthermore, comparing the two gene lists obtained for the methods used, four genes were found in common, DONSON, SEC61A2, SNRPA1, and SORBS2.

Gene Ontology
To describe a gene function along with the biological process, molecular function, and cellular component, a GSEA was performed. Gene sets were ranked based on their phenotypes using two different measures, the commonly used log fold change of DEGs and the β coefficient parameter of the regression. A total of four gene sets were tested: GS1-logFC of DEGs found in tumor vs. normal tissues, GS2-logFC of DEGs found between early vs. advanced stages, GS3-genes selected by EN method ranked by β parameter of regression, and GS4-genes selected by the TCox method ranked by the β parameter of the regression.

DEGS Tumor vs. Normal
Firstly, the list of ranked genes based on the logFC between tumor and normal tissues was used (GS1). Enriched ontologies regarding biological processes (BP), molecular function (MF), and cellular component (CC) are presented in Figures 1-3, respectively. For each analysis, enriched terms are listed on the left panel, and genes that belong to the top three enriched categories are shown on the right panel.
Looking closely at the BP characterization, we observed that the most significant ontologies were: urogenital system development (GO:0001655), kidney development (GO:0001822), and positive regulation of cytokine production (GO:0001819). Some of the DEGs associated with these ontologies showing a higher absolute logFC value were PAEP, ORM1, SAA1, and LBP (upregulated) and AQP2, CALB1, and NPHS1/2 (downregulated).
Regarding the MF characterization, the top three ontologies found were immune receptor activity (GO:0140375), cation transmembrane transporter activity (GO:0008324), and inorganic molecular entity transmembrane transporter activity (GO:0015318), and the top DEGs found regarding the logFC were AQP2, SLC12A1, and ATP12A, all downregulated.
Concerning the CC classification, the top enriched ontologies were GO:0009897external side of plasma membrane, GO:0045177-apical part of the cell, and GO:0016324apical plasma membrane. MUC17 was found to be one of the highest upregulated genes and SLC14A1/2, ATP12A, UMOD, and AQP2 were downregulated.

DEGs Early vs. Advanced Stage
Afterward, the same methods were applied to a gene set ranked by logFC between the early and advanced stage of disease (GS2) to see if different enriched functions and pathways with different gene sets were obtained.
Results for the BP classification are presented in Figure 4. The three most enriched ontologies found were mitotic cell cycle (GO:0000278), nuclear division (GO:0000280), and organic acid metabolic process (GO:0006082). Within these gene sets, TAT, CRABP1, APOA1/5, APOC3, UGT2B4, SULT2A1, and ADH4 were upregulated in the early stages of the disease and gene ANKFN1 was downregulated.
Regarding the MF classification ( Figure 5), the main three ontologies were related to transporter activity (inorganic molecular entity transmembrane transporter activity-GO:0015318, transporter activity-GO:0005215, and transmembrane transporter activity-GO:0022857) and the genes found with the highest logFC between the early and advanced stage were APOA1/2, SLC12A3, and AQP6, all upregulated in the early stages of the disease. Figure 6 presents the most significant ontologies regarding the CC characterization (plasma membrane region-GO:0098590, chromosomal region-GO:0098687, and secretory granule lumen-GO:0034774), and the genes involved in these terms. The genes with the highest logFC were AHSG, TTR, HRG, ALB, AQP6, and LRRTM2, all upregulated.
Finally, regarding the KEGG enrichment pathways, only one enriched term was found, the metabolic pathways-hsa01100.  Figure 6. Gene ontology enrichment analysis regarding cellular components terms for a list of DEGs ranked by the log fold change between early and advanced stages of the disease. The left panel shows a dot chart with the most significant CC terms. The right panel shows a gene-concept network plot of the three most enriched terms that depicts the linkages of genes and biological concepts as a network.

Genes Based on EN Regularization
To see how different metrics to rank genes affected the enriched gene ontologies and pathways found, we also used the β coefficient parameter of the regression to order genes.
Firstly, we used the β coefficients obtained from EN regularization (GS3). The three most enriched ontologies found using this ranked list of genes regarding BP characterization ( Figure 7) were mitotic sister chromatid segregation (GO:0000070), nuclear division (GO:0000280), and sister chromatid segregation (GO:0000819). The genes with the highest correlation to the terms were SMC4, KIF, NEK2, PLK1, TTK, and NUF2, all with positive coefficients.
Lastly, regarding the KEGG enrichment pathways, the top three enriched terms found were Epstein-Barr virus infection (hsa05169), amyotrophic lateral sclerosis (hsa05014), and olfactory transduction (hsa04740). Figure 7. Gene ontology enrichment analysis regarding biological processes terms for a list of genes selected by EN ranked by the β coefficients of the regression. The left panel shows a dot chart with the most significant BP terms and on the right a gene-concept network plot of the three most enriched terms depicts the linkages of genes and biological concepts as a network.  Figure 9. Gene ontology enrichment analysis regarding cellular components terms for a list of genes selected by EN ranked by the β coefficients of the regression. The left panel shows a dot chart with the most significant CC terms and on the right a gene-concept network plot of the three most enriched terms depicts the linkages of genes and biological concepts as a network.

Genes Based on TCox Regularization
The last gene set analyzed (GS4) was ranked based on the β coefficients obtained for the survival regression using a TCox regularization. Figure 10 shows the most enriched terms and genes found regarding the BP classification. The top three enriched terms found were mitotic sister chromatid segregation (GO:0000070), nuclear division (GO:0000280), and sister chromatid segregation (GO:0000819), the same as the ones found in GS3 (gene set ranked based on the β coefficients using EN regularization). Some of the genes with the highest β coefficient found were NUF2, TRIP13, TTK, PRC1, KIF, NDC80, NEK2, PSRC1, AURKA, and FBX.
Regarding the MF characterization (Figure 11), the top three enriched terms were microtubule binding (GO:0008017), tubulin binding (GO:0015631), and receptor regulator activity (GO:0030545), and the most important genes found were NAV3, HAUS8, IFNB1, CCL8, GTSE1, and PSRC1, all with positive β coefficients values. The FNTA gene was the only gene found in common between these three ontologies.
Finally, the top three enriched terms regarding the KEGG enrichment pathways were found to be olfactory transduction (hsa04740), tight junction (hsa04530), and oocyte meiosis (hsa04114). Figure 10. Gene ontology enrichment analysis regarding biological processes terms for a list of genes selected by TCox ranked by the β coefficients of the regression. The left panel shows a dot chart with the most significant BP terms. Right panel shows a gene-concept network plot of the three most enriched terms that depicts the linkages of genes and biological concepts as a network. Figure 11. Gene ontology enrichment analysis regarding molecular function terms for a list of genes selected by TCox ranked by the β coefficients of the regression. The left panel shows a dot chart with the most significant MF terms. The right panel shows a gene-concept network plot of the three most enriched terms that depicts the linkages of genes and biological concepts as a network. Figure 12. Gene ontology enrichment analysis regarding cellular components terms for a list of genes selected by TCox ranked by the β coefficients of the regression. The left panel shows a dot chart with the most significant CC terms. The right panel shows a gene-concept network plot of the three most enriched terms that depicts the linkages of genes and biological concepts as a network.

Discussion
To identify potential biomarkers correlated to the survival outcome of ccRCC patients, different approaches were tested, namely a differential gene expression analysis, networkbased regularization applied to Cox models, and gene ontology classification. One of the limitations of using RCC TCGA data is that the information regarding cancer therapies is very sparse. Therefore, this confounder variable could not be considered, and we did not analyze how patients' therapy may be associated with gene expression profiles and survival.
It has been shown that studying the relationship between gene expression and survival outcome is very important to identify genes associated with cancer survival, providing new information for the prognosis and treatment of cancer diseases. Here, we tested two regularization functions applied to the Cox regression: EN and TCox.
Regarding EN, a regularization method that combines penalties from both Lasso and ridge regression, we could see that some genes were always selected by the model in all of the runs tested ( Table 5). One of those genes selected was COPS7B (COP9 Signalosome Subunit 7B). The upregulation of this gene has been associated with an advanced stage of the disease and metastasis in RCC, indicating that it may serve as a prognostic marker and therapeutic target [41]. Another gene always selected was DONSON (DNA replication fork stabilization factor). Previous studies have shown that the upregulation of this gene is associated with an advanced TNM stage and unfavorable prognosis in gastric cancer tissue [42]. Of note, this biomarker was previously associated with unfavorable overall survival in KIRC, and its prognostic potential was validated using quantitative real-time PCR and IHC, showing that this gene may be a robust biomarker for risk stratification with upregulation being associated with worst survival [43]. Here, both genes were found upregulated in the advanced stages and tumor tissue of ccRCC patients.
Afterwards, a penalization based on correlation-TCox was used. Within the genes found using this method (Table 6), some were also selected by the EN regularization (DON-SON, SEC61A2, SNRPA1, and SORBS2). All of these genes were already described to have a role in tumor formation or metastization. For example, several studies identified new roles of SNRPA1 in the progression of CRC, classifying it as a potential therapeutic target in the treatment of CRC [44] and showing that the upregulation of this gene was associated with a worst prognostic [45]. SEC61A2 may be important for metastasis to the brain in humans when upregulated [46] and finally SORBS2 has been reported to be a tumor suppressor gene in ccRCC [47,48]. It is worth mentioning that in this paper, although upregulated in tumor tissue, this tumor suppressor gene was found downregulated in metastatic patients (stage IV).
Regarding the TCox regularization, five genes were always selected by the model: GTF2E2, HAUS8, PRH2, and ZNF18. Interestingly, some genes were already associated with cancer disease. GTF2E2 (general transcription factor IIE subunit 2) was found upregulated in lung adenocarcinoma tissue and was negatively associated with patients' overall survival [49] and the genetic mutations of PRH2 (proline-rich protein HaeIII subfamily 2) were also identified in lung cancer tissue [50]. Lastly, ZNF18 is a zinc finger protein involved in transcriptional regulation [51] and HAUS8 (augmin-like complex subunit 8) a microtubule-binding complex involved in the mitotic spindle assembly and maintenance of centrosome integrity [51], both processes related to enriched GO terms found here when we performed the GSEA (mitotic cell cycle, microtubule binding and chromosomal region).
As explained above, to better understand the enriched functions of genes selected, a GSEA was used to identify over-and under-represented functional biological groups (regarding BP, MF, and CC terms) within distinct list of genes. Four lists of genes were used: (1) DEGs tumor vs. normal ranked by logFC (GS1); (2) DEGs early vs. advanced stage ranked by logFC (GS2); (3) EN ranked by β coefficient (GS3); and (4) TCox ranked by β coefficient (GS4).
In Table 7 are listed some of the genes present on the top three enriched gene ontology terms found for each group tested (BP, MF, and CC). When comparing gene ontology concerning the biological process (BP) terms between all gene sets tested, genes present in pathways involving nuclear division (GO:0000280) and sister chromatid segregation (GO:0000819) were found in common. Regarding the top genes present in these terms, we found genes involved in carcinogenesis with potential prognostic and therapeutic roles in cancer (FBXO5 [52] and SMC4 [53]) and also PRC1, whose deregulation is related to chromosomal instability and tumor heterogeneity [54].
The second gene ontology classification studied was the molecular function (MF), which describe activities that occur at the molecular level. Several enriched GO terms were found in common between the gene sets studied: various transmembrane transporter activities (GO:0008324, GO:0015318, and GO:0022857) and also microtubule and tubulin binding (GO:0008017 and GO:0015631). In Table 7 are listed the genes that contribute the most to these GOs. SLC34A1 and SLC12A3 that were found to be more involved in the transmembrane transporter activities terms have been studied as potential targets for the clinical diagnosis, prognosis, and treatment of ccRCC patients [55]. Regarding microtubule and tubulin binding, the KIF gene family was highly present in these terms.
This superfamily contains microtubule-dependent molecular motor proteins, which, upon alteration of their expression, lead to cancer development and progression [56]. Likewise, TPX2 has been studied as a factor critical for mitosis and spindle assembly and as a marker for diagnosis and prognosis when overexpressed in cancer [57]; GTSE1 can also act as an oncogene and a high expression was positively correlated with histological grade and poor survival [58]. In this study, both genes were found upregulated in tumor tissue of ccRCC patients.
The last GO considered describes the locations relative to cellular structures in which a gene product performs a certain function-cellular component (CC). The most enriched terms found between the gene sets were related to plasma membrane regions (GO:0098590, GO:0045177) and chromosomal region (GO:0098687, GO:0000776, GO:0000779). Regarding plasma membrane regions, genes belonging to the urea transporter family (SLC14) were highly present and alterations in those genes may increase the risk of bladder cancer [59]. Further, the aquaporins (AQPs) family responsible for the transport of small solutes were also highly present, with known functions in tumor biology such as cell proliferation and migration [60]. Finally, the most significant genes belonging to chromosomal regions were found (TTK, NEK2, and NUF2). The TTK gene is associated with cell proliferation and proteins encoded by this gene are essential for chromosome alignment and duplication. Furthermore, it was found strongly overexpressed in human pancreatic ductal adenocarcinoma [61], which is consistent with the results obtained in this study, where we found TTK gene upregulated in the tumor tissue of ccRCC patients. NEK2 and NUF2 have also well established roles in cell cycle regulation and cell proliferation and their overexpression was associated with a variety of cancer types including renal cell carcinoma [62,63].
To see if this smaller gene set may split ccRCC patients into two groups regarding their risk of dying, we created a new model where we applied a Cox regression using the ridge penalization (no feature selection was performed) to three different datasets, namely the full dataset (n = 527), the early stage patients (n = 441), and the metastatic/stage IV patients (n = 84).
We were still able to split the curves with a mean significance p-value in the three groups of patients tested ( Figure 13) and Table 8 shows the mean β coefficients obtained for each gene after we tested the model 100 times. Genes showing positive β coefficients (e.g., COPS7B), have HR > 1, so patients with an increased expression of these genes are expected to have a higher risk of dying. Likewise, genes showing negative β coefficients (e.g., SORBS2), have HR < 1, so patients with a downregulation of these genes are expected to have a higher risk of dying. Interestingly, as described before, SORBS2 has been reported to be a tumor suppressor gene in ccRCC and here, we found that it was downregulated in metastatic tissue.
We observed that even though this gene set was able to split the two survival curves (high/low risk of dying) significantly, the hazard ratios obtained for each gene were close to one (HR = 1, no effect). Therefore, the Cox regression with ridge penalization was applied only to the genes selected earlier by both EN and TCox regularizers (DONSON, SEC61A2, SNRPA1, and SORBS2). This was performed 100 times and Table 9 shows the mean β coefficients and HR obtained for each gene.
These results show that the biomarkers previously selected using the full dataset are also significant for risk stratification in early stage patients and also metastatic patients. This gene set comprises potential prognostic biomarkers with putative roles in cancer therapy that may help clinicians in the clinical decision-making of ccRCC patients.  Table 8. List of genes and corresponding β coefficients obtained in a multivariate Cox survival model using ridge regression. HR (hazard ratio) gives the effect size of covariates and it is calculated by exp(β). HR = 1, no effect; HR < 1, reduction in the hazard; HR > 1, increase in the hazard.  Table 9. List of genes previously selected by both EN and TCox regularizers and corresponding β coefficients obtained when we applied a multivariate Cox survival model with ridge penalization. HR (hazard ratio) gives the effect size of covariates and it is calculated by exp(β). HR = 1, no effect; HR < 1, reduction in the hazard; HR > 1, increase in the hazard.

Conclusions
One of the priorities of kidney cancer research is to identify RCC-specific biomarkers for the early detection and screening of the disease to reduce patient and healthcare systems' burden due to kidney cancer. Nowadays, there are no effective biomarkers for early diagnosis of RCC in clinical treatment, and the molecular mechanism of RCC metastasis remains unclear.
Altogether, these genes should be further investigated as potential prognostic factors to help clinicians better identify and monitor patients with ccRCC. Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: Publicly available datasets were analyzed in this study. These data can be found here: https://portal.gdc.cancer.gov/projects/TCGA-KIRC (accessed on 1 May 2022). Furthermore, the code used to perform this analysis is available at https://github.com/sysbiomed/ KIRC_analysis.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: