1. Introduction
Glioblastoma multiforme (GBM) is the most aggressive and fatal form of primary brain tumor, classified as grade IV astrocytoma by the World Health Organization [
1,
2,
3,
4]. GBM is characterized by rapid cellular proliferation, high intertumoral heterogeneity, and extensive infiltration into surrounding brain tissue, which severely limits the effectiveness of conventional therapies [
5,
6,
7]. Despite advances in surgical resection, radiotherapy, and chemotherapy, the median overall survival of GBM patients remains approximately 14 to 15 months, with a five-year survival rate of less than 5% [
8,
9]. This highlights the urgent need to identify robust molecular biomarkers to improve prognostic accuracy and guide personalized therapeutic approaches [
10,
11,
12]. It should be noted that the current study uses data collected prior to the 2021 WHO reclassification, which now defines GBM strictly as IDH-wildtype. Therefore, the term GBM in this manuscript may include cases that would now be reclassified as IDH-mutant astrocytoma.
Tumorigenesis in GBM involves complex molecular alterations, including gene mutations, aberrant gene expression, and dysregulated signaling pathways [
3,
13]. Gene expression profiling, particularly through RNA sequencing (RNA-Seq), provides insights into transcriptional changes that reflect tumor behavior and can reveal potential prognostic markers [
14,
15]. In particular, dysregulated expression of tumor suppressors and oncogenes identified through RNA-Seq has been associated with GBM progression and patient survival, supporting its utility in biomarker discovery [
16,
17].
The Cancer Genome Atlas (TCGA) provides a comprehensive and publicly accessible resource containing multidimensional molecular data across various cancer types, including GBM [
18,
19]. The TCGA GBM dataset includes RNA-Seq gene expression profiles along with detailed clinical information, such as patient survival time and vital status [
20,
21]. This data offers a valuable opportunity to explore the association between gene expression signatures and patient prognosis.
Recent developments in machine learning (ML) and deep learning (DL) have revolutionized cancer research by enabling the extraction of complex patterns from high-dimensional omics data [
22]. These computational approaches can be used to identify potential biomarkers, classify tumor subtypes, and build accurate prognostic models. ML methods such as random forest, support vector machine, and gradient boosting machine are widely used for feature selection and survival prediction [
23,
24,
25], while DL models such as autoencoders or deep neural networks can capture intricate non-linear relationships in large datasets [
20,
26].
In the context of GBM, integrating RNA-Seq expression data with survival analysis using ML and DL offers a promising strategy to uncover novel genes that are strongly associated with overall survival. However, due to the high dimensionality and variability inherent in gene expression data, careful feature selection and model validation are essential for building reliable and interpretable prognostic models.
In this study, we analyzed RNA-Seq-based gene expression profiles and corresponding survival data of GBM patients from the TCGA dataset. We applied a combination of statistical methods, machine learning algorithms, and deep learning approaches to identify key genes associated with patient prognosis. Our findings aim to provide insights into the transcriptional landscape of GBM and support the development of effective survival prediction models for clinical applications.
3. Results
3.1. Clinical Data Summary
This clinical dataset includes information on survival status, overall survival duration, gender, and patient follow-up time. The majority of patients in this dataset were deceased at the end of the observation period, accounting for 80%, while 20% of patients were still alive. Gender distribution shows that approximately 65% were male and 35% were female, indicating a male-dominated patient population in this study. Among the deceased patients, overall survival duration varied widely, ranging from 5 to 1537 days, with a median value of approximately 360 days, indicating significant variability in patient survival after diagnosis or the start of observation. For the surviving patients, the follow-up duration ranged from 13 to 958 days, with a median of approximately 268 days. Overall, the variation in survival status and observation period within this patient cohort provides a strong basis for conducting further survival analyses, such as the Cox proportional hazards model, to evaluate the relationship between gene expression and patient prognosis.
3.2. Exploratory Analysis of Gene Expression Data
To ensure the appropriateness of the RNA-Seq data for linear modeling, a voom transformation was applied to stabilize the variance across the range of gene expression levels. This transformation is crucial for meeting the assumptions of linear modeling, particularly in handling the heteroscedasticity typical of RNA-Seq data. The transformation’s effect is illustrated in
Figure 1, where each dot represents a gene. The x-axis shows the average log2 counts per million, while the y-axis displays the square root of the standard deviation. A clear downward trend is observed along the red trend line, indicating that genes with lower expression levels tend to exhibit higher variability, whereas highly expressed genes show more stable variance. This trend enables the calculation of observation-level precision weights, thereby improving the reliability of downstream differential expression analysis.
To further explore the overall structure and variability within the dataset, PCA was conducted. The PCA results are shown in
Figure 2, which visualizes the projection of samples onto the first two principal components (PC1 and PC2). In this plot, primary solid tumor samples are represented by black dots and normal solid tissue samples by pink dots. A distinct separation between the two groups is evident, particularly along PC1, which captures the largest proportion of variance and serves as the main axis distinguishing tumor from normal samples. PC2 accounts for additional, though smaller, variation. This clear clustering pattern indicates substantial differences in gene expression profiles between tumor and normal tissues, highlighting the biological relevance of the data and supporting its suitability for identifying differentially expressed genes and building prognostic models.
3.3. Univariate Survival Analysis Using Gene Expression Data
To assess the relationship between gene expression and overall survival, a univariate survival analysis was conducted using KM curves and log-rank tests. The gene expression matrix, consisting of 29,128 genes, was first pre-processed to match with clinical survival data. Genes with incomplete expression profiles were excluded from further analysis to ensure reliable statistical inference. For each gene, patients were stratified into two groups, high expression and low expression, based on the mean expression value of the respective gene across all samples. A KM survival model was then fitted to compare the survival distributions between these two groups, and a log-rank test was used to compute the p-value indicating statistical significance. This procedure was iteratively applied to all genes. Only genes with a p-value < 0.05 were retained as significantly associated with overall survival. As a result, 694 genes were identified as statistically significant, suggesting their potential relevance as prognostic biomarkers.
3.4. Machine Learning-Based Feature Selection and Survival Prediction Performance
To assess the impact of different feature selection techniques on survival prediction, multivariate Cox proportional hazards models were developed using gene sets selected by five machine learning algorithms. These models were trained and evaluated under varying training–test splits (60%, 70%, 80%, and 90% training data), and their predictive performance was measured using the c-index. The results are summarized in
Table 3.
The highest predictive performance was achieved using RF-RFE at 90% training data, with a c-index of 0.725, suggesting that this method was most effective in selecting gene features relevant to survival outcomes. The substantial increase in performance at 90% training also highlights the benefit of a larger training set when dealing with high-dimensional biological data. Across all training splits, PCA consistently yielded moderate to good performance, with a notable c-index of 0.619 at the 60% split. Despite being an unsupervised method, PCA’s ability to retain variance from the original dataset may have captured essential biological signals related to survival. Meanwhile, SVM-RFE and GB also demonstrated stable performance, especially at higher training proportions. GB maintained C-index values above 0.59 across all data splits, while SVM-RFE achieved a C-index of 0.627 at the 90% split. These results affirm the potential of combining ensemble and wrapper-based methods with survival analysis for robust biomarker selection. In contrast, RF without recursive elimination showed lower and relatively flat performance across different splits, indicating that direct importance ranking alone may not sufficiently refine feature selection in this context. Overall, the findings emphasize the importance of both the choice of machine learning-based feature selection method and the size of the training dataset in determining survival prediction performance. Recursive elimination techniques, particularly those based on ensemble models, appear most suitable for high-dimensional gene expression data and should be prioritized in similar analytical pipelines.
3.5. Application of DeepSurv for Survival Prediction and Exploration of Significant Genes
In this study, the DeepSurv approach was employed as a deep learning model for survival analysis based on gene expression in glioblastoma patients. This model is designed to capture complex non-linear relationships between gene expression and patient mortality risk—an aspect that classical statistical methods like the Cox proportional hazards model often fail to optimally model. The primary goal of applying DeepSurv is to improve survival prediction accuracy and identify the most influential genes associated with patient prognosis. The performance of DeepSurv was evaluated by testing various model training configurations, including variations in training data proportion, number of hidden units, activation functions, dropout rates, learning rates, number of epochs, and batch sizes. The evaluation results are summarized in
Table 4, which presents the c-index as the main metric for assessing survival prediction accuracy. The model trained with 90% of the data achieved the highest c-index value of 0.822, indicating excellent predictive performance. This value is significantly higher than that of the best ML-based Cox model (RF-RFE), which achieved a maximum c-index of around 0.725 using RF-RFE for feature selection and Cox proportional hazards models for survival prediction.
To further evaluate the robustness of our proposed DeepSurv model, we compared its performance with a previous study [
20], which used the same TCGA GBM dataset for survival prediction. As summarized in
Table 5, across different training data ratios (60%, 70%, 80%, and 90%), our model consistently achieved higher average c-index values than the reference model. For instance, at 90% training data, our model reached a c-index of 0.822 compared to 0.639 reported by Feng et al. This improvement can be attributed to the comprehensive hyperparameter tuning and model architecture optimization conducted in our study, including activation function selection, dropout regularization, and learning rate adjustment. Moreover, the exploration of multiple configurations helped mitigate overfitting and enhanced generalizability. These results suggest that with appropriate tuning and feature selection, deep learning models such as DeepSurv can significantly outperform classical or previously published models even on the same dataset, highlighting their potential utility in precision oncology. As seen, larger training datasets tend to produce relatively better c-index values on the test data. Overall, the proposed model outperformed the reference model across all configurations, confirming its robustness and superior capacity to model survival outcomes from gene expression data in glioblastoma.
Following model training, gene contributions to risk prediction was evaluated using a gradient-based method. From this, 10 key genes with the highest average scores were identified, indicating their major influence on glioblastoma patient survival predictions. These top genes are
CMTR1,
RPL23AP42,
TSPYL1,
AC011287.1,
RPL7L1P8,
CCDC107,
AL354743.2,
GMPR,
PPY, and
MT-TL1. To better summarize their biological functions and potential prognostic roles, the top genes identified by DeepSurv are presented in
Table 6.
Among the top prognostic genes identified, CMTR1 encodes a methyltransferase involved in mRNA cap methylation and immune regulation. Recent studies revealed that CMTR1 is overregulated in various cancers and promotes ribosomal protein gene expression, thereby enhancing tumor growth. RPL23AP42 and RPL7L1P8 are ribosomal pseudogenes with limited direct evidence in GBM. However, other ribosomal pseudogenes, such as RPL4P4, have been reported as prognostic markers and may act as competitive endogenous RNAs (ceRNAs), influencing gene expression regulation in gliomas. TSPYL1, a chromatin remodeling factor, is associated with neural development and may contribute to glioma progression, as identified in IDH1-associated tumor evolution studies. For AC011287.1 and AL354743.2, functional annotation is scarce, but they belong to the long non-coding RNA (lncRNA) class. Other lncRNAs, such as MEG3, have been shown to suppress glioma cell proliferation by modulating gene expression and chromatin states. These uncharacterized lncRNAs may play similar regulatory roles. CCDC107, though not yet directly studied in GBM, belongs to the coiled-coil domain-containing family. CCDC103, a related protein, has been implicated in glioma progression and cytoskeletal organization, suggesting that CCDC107 may also impact tumor cell motility or structure. GMPR, a key enzyme in purine metabolism, catalyzes the conversion of GMP to IMP. While its direct involvement in GBM is underexplored, purine metabolism has been shown to regulate DNA repair and drive therapy resistance in glioblastoma. PPY, a neuropeptide and member of the NPY family, is part of the broader neuroendocrine signaling system. NPY receptors and intertumoral neuropeptides are active in GBM, potentially influencing tumor behavior. MT-TL1, a mitochondrial tRNA for leucine, plays a critical role in mitochondrial protein synthesis. Alterations in mitochondrial DNA, including tRNA genes like MT-TL1, have been linked to metabolic reprogramming and mitochondrial dysfunction in GBM.
These findings indicate that DeepSurv not only enhances survival prediction accuracy but also uncovers biologically relevant genes with potential as prognostic biomarkers or therapeutic targets. Nevertheless, experimental validation is essential to confirm their mechanistic roles in glioblastoma. Furthermore, challenges such as overfitting and interpretability must be addressed to enable clinical implementation of deep learning-based survival models.
3.6. Kaplan–Meier Survival Analysis of Key Genes
To evaluate the clinical relevance of the top 10 prognostic genes identified through deep learning analysis, KM survival curves were generated for each gene. Patients were stratified into high- and low-expression groups (labeled UP and DOWN) based on the median expression threshold of each gene. Survival differences between the two groups were assessed using the log-rank test. As illustrated in
Figure 3, the KM plots reveal a statistically significant survival difference for all 10 genes, with log-rank test
p-values ranging from 0.0061 to 0.04. Notably,
GMPR (
p = 0.0094) and
PPY (
p = 0.0061) exhibit the most pronounced separation between high- and low-expression groups, suggesting strong prognostic potential.
The number at risk at various time points is also shown in the plots, providing additional context for interpreting survival trends over time. These results indicate that patients with high expression (or low expression, depending on the gene) consistently have significantly different survival outcomes compared to those with opposite expression level. For example, elevated expressions of CMTR1, TSPYL1, or RPL23AP42 are associated with poorer survival, aligning with their hypothesized roles in promoting tumor progression or interfering with gene regulation. The statistical significance of separation supports the relevance of these genes as individual prognostic biomarkers. From a clinical perspective, these genes could potentially be used to stratify glioblastoma patients into distinct risk categories, aiding personalized treatment planning. However, further investigation, including experimental validation and clinical trials, is essential before clinical implementation.
While all top 10 genes showed statistically significant survival differences, the nature of the curve separation varied. For instance, although PPY demonstrated the lowest p-value (p = 0.0061), its survival curves remained nearly overlapping until after the median survival time, suggesting a delayed prognostic effect. In contrast, GMPR (p = 0.0094) showed early and consistent divergence between high- and low-expression groups, indicating a more robust and clinically meaningful impact on patient prognosis. These distinctions highlight the importance of interpreting survival curves beyond p-values alone, considering both the magnitude and timing of risk stratification, which may influence their utility in future clinical applications.
3.7. Functional Network and Pathway Analysis of Key Genes
To further understand the biological functions and molecular interactions of the top 10 genes identified through machine learning and deep learning models, a network-based functional analysis was conducted. This analysis integrated diverse biological data sources, including co-expression, physical and genetic interactions, shared pathways, co-localization, and shared protein domains. The network was constructed for Homo sapiens, with the weighting based on biological process relevance to ensure a functionally meaningful arrangement. The resulting protein–protein interaction (PPI) network, shown in
Figure 4, illustrates how the selected genes are embedded within a broader biological framework. In this network, the central 10 genes are enriched with additional interacting partners predicted.
The lines connecting the genes are color-coded to indicate the nature of their interactions; co-expression is depicted in light purple, physical interactions in pink, predicted interactions in orange, co-localization in green, shared pathways in blue, and shared protein domains in gray. These edge types reflect distinct modes of functional relationships; co-expression suggests genes are transcriptionally coordinated, physical interactions imply direct protein binding, predicted interactions are inferred from orthologous data, and co-localization indicates shared subcellular compartments. Shared pathways and shared domains reflect common biological roles or structural motifs. These visual indicators provide a comprehensive representation of the complex relationships among the genes.
The analysis reveals that many of the selected genes share strong co-expression and pathway-based associations, suggesting that they may participate in related biological processes relevant to glioblastoma progression. Notably, several genes such as GMPR emerge as key hubs within the network, connecting multiple genes primarily involved in purine metabolism. GMPR acts as a central node linking metabolic and nucleotide regulatory elements crucial for maintaining the proliferative state of cancer cells. CMTR1, known for its involvement in RNA processing and immune response modulation, shows limited but specific interactions, notably with HPRT1, suggesting a specialized regulatory role rather than widespread connectivity. Notably, CMTR1 is not connected with TSPYL1 in the network. TSPYL1 exhibits limited but distinct associations, particularly with HPRT1 and NPY5R, reflecting its involvement in regulatory pathways such as chromatin remodeling and cell cycle control. In this analysis, TSPYL1 is not connected with PPY or CMTR1, suggesting distinct functional roles in the network. The neuropeptide-related genes, including PPY and members of the NPY family (e.g., NPY, NPY1R, NPY2R, NPY4R, NPY5R, PYY), form a distinct cluster characterized by strong co-expression and shared pathway interactions, indicating a potential role in neuroendocrine signaling relevant to glioblastoma. The network also includes CCDC107, which appears isolated with no direct interactions, implying that its role in glioblastoma remains unclear but may still be worth investigating given its potential involvement in cytoskeletal regulation. Overall, this functional network analysis emphasizes the interconnectedness of the selected prognostic genes and highlights their collective involvement in pathways that could be critical to glioblastoma survival outcomes. These findings provide not only a biological rationale for their selection as predictive biomarkers but also lay the groundwork for future investigations into targeted therapeutic strategies.
4. Discussion
This study demonstrates the effectiveness of both machine learning and deep learning methods in identifying prognostic biomarkers for GBM using RNA-Seq gene expression data. GBM remains one of the most lethal brain tumors, with limited improvement in patient outcomes over the last decades. Identifying molecular features that can reliably predict survival is crucial for improving risk stratification and guiding treatment decisions. The integration of computational approaches enables high-throughput and unbiased screening of potential biomarkers across the genome.
The univariate Cox regression analysis identified 694 survival-associated genes, which served as a foundation for building multivariate models. Among the ML-based approaches, the Cox model using random forest–recursive feature elimination (RF-RFE) with 90% training data achieved a concordance index (C-index) of 0.725, suggesting moderate predictive performance. These ML models offer interpretability and are less computationally intensive, making them attractive for practical applications. However, they may not fully capture complex interactions in high-dimensional data such as gene expression profiles.
In contrast, the DeepSurv model, a deep learning extension of the Cox proportional hazards model, demonstrated superior predictive power, achieving a C-index of 0.822. This underscores the strength of DL approaches in modeling non-linear and intricate relationships that are often missed by traditional or ML-based survival models. While the higher performance of DeepSurv is promising, it also introduces challenges in interpretability and requires more computational resources and expertise.
The top 10 genes identified by DeepSurv—CMTR1, RPL23AP42, TSPYL1, AC011287.1, RPL7L1P8, CCDC107, AL354743.2, GMPR, PPY, and MT-TL1—were further analyzed using Kaplan–Meier survival curves, which showed significant separation in survival outcomes between high- and low-risk groups. Several of these genes, such as CMTR1 and GMPR, have known roles in RNA modification and purine metabolism, respectively, processes that are commonly dysregulated in cancer, particularly in aggressive subtypes like GBM. The MT-TL1 gene is part of the mitochondrial genome and has been linked to metabolic regulation and energy production in cancer cells.
To further explore the biological relevance of these genes, GeneMANIA was employed to construct gene interaction networks and perform pathway enrichment analysis. The analysis revealed functional associations with RNA processing, neuroendocrine signaling, cell cycle regulation, and chromatin modification, all of which are critical in GBM pathogenesis. These findings are consistent with previous reports highlighting the role of transcriptional dysregulation in GBM progression, particularly as revealed through RNA-Seq–based gene expression profiling. The inclusion of lncRNAs and pseudogenes among the top-ranked features also suggests the involvement of non-coding regulatory mechanisms in gliomagenesis, an area that warrants further exploration.
To further substantiate the relevance of the top-ranked genes, we investigated existing literature to determine whether these genes have been previously associated with prognosis in other types of cancer. Several of them have demonstrated potential prognostic or functional roles across various tumor types. A summary of the prognostic relevance of the top 10 genes based on published studies is provided in
Table 7, which strengthens the biological plausibility of their involvement in GBM outcomes and supports their consideration as candidate biomarkers for future validation.
Our findings are consistent with recent studies that emphasize the value of combining high-throughput omics data with advanced computational models for biomarker discovery. For instance, previous research using expression data has demonstrated the prognostic significance of immune infiltration, stemness features, and metabolic reprogramming in GBM. However, our approach uniquely compares ML- and DL-based survival modeling, providing deeper insight into the relative strengths, limitations, and biological relevance of selected features derived from RNA-Seq data.
Despite these promising results, this study has some limitations. First, the analysis relies solely on TCGA data, which may introduce cohort-specific biases and limit generalizability. Gene expression patterns can vary significantly across populations and platforms. Therefore, external validation using independent cohorts, such as CGGA or REMBRANDT, is essential to assess the robustness and transferability of the proposed models. In future work, we plan to validate the selected prognostic genes and trained models using these external datasets to ensure generalizability across different patient populations. Second, although DeepSurv provides improved performance, its black-box nature hinders straightforward biological interpretation. Future studies should also focus on enhancing model explainability and performing functional validation of key genes through wet-lab experiments, including CRISPR knockout and qRT-PCR assays.
In this study, univariate Cox regression identified 694 genes with p-values below 0.05. However, after applying false discovery rate (FDR) correction using the Benjamini–Hochberg method across all 29,128 tested genes, none remained statistically significant at FDR < 0.05. This result is not unexpected given the high dimensionality and multiple testing burden inherent in genome-wide gene expression analyses. Nevertheless, these initially filtered genes served as a preliminary feature pool for further downstream analysis using multivariate Cox models and deep learning, which helped prioritize biologically meaningful candidates. While the results should be interpreted cautiously, this workflow aligns with common practices in exploratory omics studies and sets the stage for future independent validation efforts.
Overall, this integrative analysis supports the utility of ML and DL techniques in survival prediction and biomarker identification in GBM. The identified genes have the potential to enhance current prognostic models and inform the development of targeted therapies. By combining statistical rigor, computational power, and biological interpretation, this study contributes to the ongoing effort to improve patient stratification and precision oncology in glioblastoma.
5. Conclusions
In this study, we integrated statistical, machine learning, and deep learning approaches to analyze TCGA RNA-Seq gene expression data and identify key genes associated with overall survival in GBM. Univariate Cox analysis revealed hundreds of genes with potential prognostic relevance, which were further refined using various feature selection techniques. Among traditional models, RF-RFE combined with Cox regression achieved a notable c-index of 0.725. However, the deep learning-based DeepSurv model outperformed all traditional approaches, achieving a c-index of 0.822 and identifying 10 key prognostic genes: CMTR1, RPL23AP42, TSPYL1, AC011287.1, RPL7L1P8, CCDC107, AL354743.2, GMPR, PPY, and MT-TL1. Several of these genes, such as CMTR1, play critical roles in mRNA cap methylation and immune response regulation, while GMPR and MT-TL1 are involved in metabolic pathways essential for tumor growth. PPY, typically associated with neuroendocrine signaling, emerged as a significant prognostic indicator in glioblastoma. Kaplan–Meier analysis confirmed the clinical relevance of these genes, and network-based functional analysis revealed their potential interactions and pathways. These findings not only enhance our understanding of GBM biology but also provide a foundation for the development of personalized prognostic tools and targeted therapies. The identified genes could serve as candidate biomarkers for patient risk stratification and novel targets for therapeutic intervention. Further experimental validation and functional studies are warranted to confirm their roles and clinical applicability.