DeepCMS: A Feature Selection-Driven Model for Cancer Molecular Subtyping with a Case Study on Testicular Germ Cell Tumors
Abstract
1. Introduction
- The proposed methodology addresses one of the major issues faced by the research community working with genomic or transcriptomic data: data scarcity, i.e., the significant disparity between the number of samples and the number of features.
- The robustness of the proposed approach was assessed by comparing it with the state-of-the-art models. Its superiority over other state-of-the-art models, including Random Forest, SVM, and DeepCC, in terms of aggregated accuracy, aggregated sensitivity, aggregated specificity, and aggregated balanced accuracy is clearly demonstrated.
- To evaluate the generalizability of the framework, a case study was designed using another cancer type: testicular germ cell tumor (TGCT). The same pipeline was followed on the TGCT dataset. Comparative results were also obtained on this dataset, achieving an accuracy of 0.97.
Related Work
2. Materials
2.1. Datasets
2.2. Methods
| Algorithm 1 DeepCMS |
Algorithm Steps:
|
3. Results
4. Discussion
Case Study in Testicular Germ Cell Tumor
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
- The following abbreviations are used in this manuscript:
TCGA The Cancer Genome Atlas TGCT Testicular germ cell tumor DL Deep learning SGD Stochastic gradient descent
References
- Chen, Y.; Wang, L.; Ding, B.; Shi, J.; Wen, T.; Huang, J.; Ye, Y. Automated Alzheimer’s disease classification using deep learning models with Soft-NMS and improved ResNet50 integration. J. Radiat. Res. Appl. Sci. 2024, 17, 100782. [Google Scholar] [CrossRef]
- Choi, Y.A.; Park, S.J.; Jun, J.A.; Pyo, C.S.; Cho, K.H.; Lee, H.S.; Yu, J.H. Deep learning-based stroke disease prediction system using real-time bio signals. Sensors 2021, 21, 4269. [Google Scholar] [CrossRef]
- Mahmud, S.H.; Chen, W.; Jahan, H.; Dai, B.; Din, S.U.; Dzisoo, A.M. DeepACTION: A deep learning-based method for predicting novel drug-target interactions. Anal. Biochem. 2020, 610, 113978. [Google Scholar] [CrossRef]
- Alshmrani, G.M.M.; Ni, Q.; Jiang, R.; Pervaiz, H.; Elshennawy, N.M. A deep learning architecture for multi-class lung diseases classification using chest X-ray (CXR) images. Alex. Eng. J. 2023, 64, 923–935. [Google Scholar] [CrossRef]
- Zhan, Z.; Jing, Z.; He, B.; Hosseini, N.; Westerhoff, M.; Choi, E.Y.; Garmire, L.X. Two-stage Cox-nnet: Biologically interpretable neural-network model for prognosis prediction and its application in liver cancer survival using histopathology and transcriptomic data. NAR Genom. Bioinform. 2021, 3, lqab015. [Google Scholar] [CrossRef]
- Ahmadi, H.; Arji, G.; Shahmoradi, L.; Safdari, R.; Nilashi, M.; Alizadeh, M. The application of internet of things in healthcare: A systematic literature review and classification. Univers. Access Inf. Soc. 2019, 18, 837–869. [Google Scholar] [CrossRef]
- Nasarudin, N.A.; Al Jasmi, F.; Sinnott, R.O.; Zaki, N.; Al Ashwal, H.; Mohamed, E.A.; Mohamad, M.S. A review of deep learning models and online healthcare databases for electronic health records and their use for health prediction. Artif. Intell. Rev. 2024, 57, 249. [Google Scholar] [CrossRef]
- Wekesa, J.S.; Kimwele, M. A review of multi-omics data integration through deep learning approaches for disease diagnosis, prognosis, and treatment. Front. Genet. 2023, 14, 1199087. [Google Scholar] [CrossRef]
- Singh, R.; Fazal, Z.; Freemantle, S.J.; Spinella, M.J. Between a rock and a hard place: An epigenetic-centric view of testicular germ cell tumors. Cancers 2021, 13, 1506. [Google Scholar] [CrossRef] [PubMed]
- Pandya, R.; Grace San Diego, K.; Shabbir, T.; Modi, A.P.; Wang, J.; Dhahbi, J.; Barsky, S.H. The cell of cancer origin provides the most reliable roadmap to its diagnosis, prognosis (biology) and therapy. Med. Hypotheses 2021, 157, 110704. [Google Scholar] [CrossRef] [PubMed]
- Lu, X.; Zhang, L.; Zhao, H.; Chen, C.; Wang, Y.; Liu, S.; Lin, X.; Wang, Y.; Zhang, Q.; Lu, T.; et al. Molecular classification and subtype-specific drug sensitivity research of uterine carcinosarcoma under multi-omics framework. Cancer Biol. Ther. 2019, 20, 227–235. [Google Scholar] [CrossRef]
- Yang, H.; Zhao, L.; Li, D.; An, C.; Fang, X.; Chen, Y.; Liu, J.; Xiao, T.; Wang, Z. Subtype-WGME enables whole-genome-wide multi-omics cancer subtyping. Cell Rep. Methods 2024, 4, 100781. [Google Scholar] [CrossRef] [PubMed]
- Rohani, N.; Eslahchi, C. Classifying breast cancer molecular subtypes by using deep clustering approach. Front. Genet. 2020, 11, 553587. [Google Scholar] [CrossRef] [PubMed]
- Paz-Cabezas, M.; Calvo-López, T.; Romera-Lopez, A.; Tabas-Madrid, D.; Ogando, J.; Fernández-Aceñero, M.J.; Sastre, J.; Pascual-Montano, A.; Mañes, S.; Díaz-Rubio, E.; et al. Molecular classification of colorectal cancer by microRNA profiling: Correlation with the Consensus Molecular Subtypes (CMS) and validation of miR-30b targets. Cancers 2022, 14, 5175. [Google Scholar] [CrossRef]
- Chen, L.; Zeng, T.; Pan, X.; Zhang, Y.H.; Huang, T.; Cai, Y.D. Identifying methylation pattern and genes associated with breast cancer subtypes. Int. J. Mol. Sci. 2019, 20, 4269. [Google Scholar] [CrossRef]
- Liu, T.; Huang, J.; Liao, T.; Pu, R.; Liu, S.; Peng, Y. A hybrid deep learning model for predicting molecular subtypes of human breast cancer using multimodal data. Irbm 2022, 43, 62–74. [Google Scholar] [CrossRef]
- Hong, R.; Liu, W.; DeLair, D.; Razavian, N.; Fenyö, D. Predicting endometrial cancer subtypes and molecular features from histopathology images using multi-resolution deep learning models. Cell Rep. Med. 2021, 2, 100400. [Google Scholar] [CrossRef]
- Jiang, M.; Zhang, D.; Tang, S.C.; Luo, X.M.; Chuan, Z.R.; Lv, W.Z.; Jiang, F.; Ni, X.J.; Cui, X.W.; Dietrich, C.F. Deep learning with convolutional neural network in the assessment of breast cancer molecular subtypes based on US images: A multicenter retrospective study. Eur. Radiol. 2021, 31, 3673–3682. [Google Scholar] [CrossRef]
- Sun, P.; Wu, Y.; Yin, C.; Jiang, H.; Xu, Y.; Sun, H. Molecular subtyping of cancer based on distinguishing co-expression modules and machine learning. Front. Genet. 2022, 13, 866005. [Google Scholar] [CrossRef]
- Guo, L.Y.; Wu, A.H.; Wang, Y.X.; Zhang, L.P.; Chai, H.; Liang, X.F. Deep learning-based ovarian cancer subtypes identification using multi-omics data. BioData Min. 2020, 13, 10. [Google Scholar] [CrossRef]
- Liu, L.P.; Lu, L.; Zhao, Q.Q.; Kou, Q.J.; Jiang, Z.Z.; Gui, R.; Luo, Y.W.; Zhao, Q.Y. Identification and validation of the pyroptosis-related molecular subtypes of lung adenocarcinoma by bioinformatics and machine learning. Front. Cell Dev. Biol. 2021, 9, 756340. [Google Scholar] [CrossRef]
- Yang, H.; Chen, R.; Li, D.; Wang, Z. Subtype-GAN: A deep learning approach for integrative cancer subtyping of multi-omics data. Bioinformatics 2021, 37, 2231–2237. [Google Scholar] [CrossRef] [PubMed]
- Islam, M.M.; Huang, S.; Ajwad, R.; Chi, C.; Wang, Y.; Hu, P. An integrative deep learning framework for classifying molecular subtypes of breast cancer. Comput. Struct. Biotechnol. J. 2020, 18, 2185–2199. [Google Scholar] [CrossRef]
- Tafavvoghi, M.; Sildnes, A.; Rakaee, M.; Shvetsov, N.; Bongo, L.A.; Busund, L.T.R.; Møllersen, K. Deep learning-based classification of breast cancer molecular subtypes from H&E whole-slide images. J. Pathol. Inform. 2025, 16, 100410. [Google Scholar]
- Shen, J.; Shi, J.; Luo, J.; Zhai, H.; Liu, X.; Wu, Z.; Yan, C.; Luo, H. Deep learning approach for cancer subtype classification using high-dimensional gene expression data. BMC Bioinform. 2022, 23, 430. [Google Scholar] [CrossRef]
- Li, S.; Yang, Y.; Wang, X.; Li, J.; Yu, J.; Li, X.; Wong, K.C. Colorectal cancer subtype identification from differential gene expression levels using minimalist deep learning. BioData Min. 2022, 15, 12. [Google Scholar] [CrossRef]
- Matsui, Y.; Maruyama, T.; Nitta, M.; Saito, T.; Tsuzuki, S.; Tamura, M.; Kusuda, K.; Fukuya, Y.; Asano, H.; Kawamata, T.; et al. Prediction of lower-grade glioma molecular subtypes using deep learning. J. Neuro-Oncol. 2020, 146, 321–327. [Google Scholar] [CrossRef] [PubMed]
- Woerl, A.C.; Eckstein, M.; Geiger, J.; Wagner, D.C.; Daher, T.; Stenzel, P.; Fernandez, A.; Hartmann, A.; Wand, M.; Roth, W.; et al. Deep learning predicts molecular subtype of muscle-invasive bladder cancer from conventional histopathological slides. Eur. Urol. 2020, 78, 256–264. [Google Scholar] [CrossRef] [PubMed]
- Yang, B.; Xin, T.T.; Pang, S.M.; Wang, M.; Wang, Y.J. Deep subspace mutual learning for cancer subtypes prediction. Bioinformatics 2021, 37, 3715–3722. [Google Scholar] [CrossRef]
- Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
- BrianMBot. Colorectal Cancer Subtyping Consortium (CRCSC), 2014. Available online: https://repo-prod.prod.sagebase.org/repo/v1/doi/locate?id=syn2623706&type=ENTITY (accessed on 9 October 2025).
- Guinney, J.; Dienstmann, R.; Wang, X.; De Reynies, A.; Schlicker, A.; Soneson, C.; Marisa, L.; Roepman, P.; Nyamundanda, G.; Angelino, P.; et al. The consensus molecular subtypes of colorectal cancer. Nat. Med. 2015, 21, 1350–1356. [Google Scholar] [CrossRef] [PubMed]
- Synapse. Synapse Project Wiki. Available online: https://www.synapse.org/#!Synapse:syn2623706/wiki/ (accessed on 26 March 2025).
- Network, C.G.A.R. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 2013, 45, 1113–1120. [Google Scholar] [CrossRef]
- Subramanian, A.; Tamayo, P.; Mootha, V.K.; Mukherjee, S.; Ebert, B.L.; Gillette, M.A.; Paulovich, A.; Pomeroy, S.L.; Golub, T.R.; Lander, E.S.; et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA 2005, 102, 15545–15550. [Google Scholar] [CrossRef]
- Gao, F.; Wang, W.; Tan, M.; Zhu, L.; Zhang, Y.; Fessler, E.; Vermeulen, L.; Wang, X. DeepCC: A novel deep learning-based framework for cancer molecular subtype classification. Oncogenesis 2019, 8, 44. [Google Scholar] [CrossRef]
- Sung, H.; Ferlay, J.; Siegel, R.L.; Laversanne, M.; Soerjomataram, I.; Jemal, A.; Bray, F. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA A Cancer J. Clin. 2021, 71, 209–249. [Google Scholar] [CrossRef]
- de Back, T.R.; van Hooff, S.R.; Sommeijer, D.W.; Vermeulen, L. Transcriptomic subtyping of gastrointestinal malignancies. Trends Cancer 2024, 10, 842–856. [Google Scholar] [CrossRef]
- Raghav, S.; Suri, A.; Kumar, D.; Aakansha, A.; Rathore, M.; Roy, S. A hierarchical clustering approach for colorectal cancer molecular subtypes identification from gene expression data. Intell. Med. 2024, 4, 43–51. [Google Scholar] [CrossRef]
- Young, M.R.; Craft, D.L. Pathway-informed classification system (PICS) for cancer analysis using gene expression data. Cancer Inform. 2016, 15, 151–161. [Google Scholar] [CrossRef]
- Ghazarian, A.A.; Kelly, S.P.; Altekruse, S.F.; Rosenberg, P.S.; McGlynn, K.A. Future of testicular germ cell tumor incidence in the United States: Forecast through 2026. Cancer 2017, 123, 2320–2328. [Google Scholar] [CrossRef] [PubMed]
- Kristensen, D.M.; Sonne, S.B.; Ottesen, A.M.; Perrett, R.M.; Nielsen, J.E.; Almstrup, K.; Skakkebaek, N.E.; Leffers, H.; Rajpert-De Meyts, E. Origin of pluripotent germ cell tumours: The role of microenvironment during embryonic development. Mol. Cell. Endocrinol. 2008, 288, 111–118. [Google Scholar] [CrossRef] [PubMed]
- Fazal, Z.; Singh, R.; Fang, F.; Bikorimana, E.; Baldwin, H.; Corbet, A.; Tomlin, M.; Yerby, C.; Adra, N.; Albany, C.; et al. Hypermethylation and global remodelling of DNA methylation is associated with acquired cisplatin resistance in testicular germ cell tumours. Epigenetics 2021, 16, 1071–1084. [Google Scholar] [CrossRef] [PubMed]





|
Study Reference | Data Type | Approach | Limitation |
|---|---|---|---|
| [16] | Histopathological images, gene expression data, copy number variation (CNV) | CNN for images and a Deep Neural Network for CNV and expression data, combining the results using weighted linear aggregation. | Improved image preprocessing, removing noise, may produce better prediction results. Deep learning models that involve hybrid modality data can be computationally expensive. |
| [17] | H and E slides | Inception ResNet-based CNN architectures. Separate models were trained for each resolution. The third last layer concatenates all three models for the final prediction. | Multi-resolution models are computationally expensive, and scalability can be a challenge. |
| [18] | Ultrasound images | ResNet50. To avoid data overfitting, a data augmentation technique was applied to the data, and the stochastic gradient descent was used as the optimizer. | Ultrasound images can be preprocessed to minimize the impact of artifacts. |
| [19] | RNAseq data and clinical information | The gene expression data were transformed into subtype-specific gene co-expression networks, from which the specific modules were identified. | Integrating the multi-OMIC data may enhance its effectiveness. |
| [20] | mRNA-seq data, miRNA-seq data, and copy number variation (CNV) | Low-dimensional features were extracted using the denoising autoencoders (DAE), which were used to cluster the subtypes. The logistic regression classification model was built using the clustered subtype data. | The robustness and accuracy of the model are not clearly defined. |
| [21] | RNAseq data | Pyroptosis-related molecular subtypes were identified using consensus clustering. A classification model was created using CatBoost. | The online databases are used; thus, missing the diversity of patients and response to treatment would lead to non-generalizability. |
| [22] | Multi-omics data | Extracted the features using the adversarial learning framework, GAN. Used the consensus clustering along with the Gaussian mixture model to identify the molecular subtypes of the tumor samples. | Selecting features diligently from raw input data can lead to better performance. |
| [23] | Gene expression data and copy number variation (CNV) | Deep CNN models were trained on each data type and then combined in the last layer for the final prediction. | Considering the correlation between different data types during integration can strengthen the model’s validity. |
| [24] | Whole slides images (WSI) | Designed a model for classifying tumor and non-tumor H and E slide tiles, which was then used to filter out only the tumor tiles for the molecular subtype. | This approach may lack generalizability, as the testing was not performed on external data. |
| [25] | Gene expression data | Combined CNN and Bidirectional gated recurrent unit | Utilizing genomic data from variable platforms can improve its efficiency. |
| [26] | Gene expression data | Subtype-specific genes are extracted and then entered into the feedforward NN. L1 and L2 regularization and dropout are used to avoid overfitting | The effectiveness of the proposed framework is highly dependent on the training dataset, limiting its generalisability to unseen data. |
| [27] | MRI, MET-PET images, and clinical numeric data | Designed a deep neural network using the MRI and patient profile numeric data. | The test accuracy needs to be improved. |
| Dataset | CMS Acc | CC Acc | CMS Sens | CC Sens | CMS Speci | CC Speci | CMS Bal Acc | CC Bal Acc |
|---|---|---|---|---|---|---|---|---|
| GSE39582 | 0.88 | 0.86 | 0.87 | 0.87 | 0.95 | 0.95 | 0.91 | 0.91 |
| GSE13067 | 0.92 | 0.89 | 0.93 | 0.87 | 0.97 | 0.96 | 0.95 | 0.92 |
| GSE37892 | 0.96 | 0.93 | 0.93 | 0.90 | 0.98 | 0.97 | 0.96 | 0.94 |
| GSE17536 | 0.9 | 0.91 | 0.87 | 0.90 | 0.96 | 0.96 | 0.91 | 0.99 |
| Dataset | SVM | Random Forest | DeepCC | DeepCMS |
|---|---|---|---|---|
| Accuracy | 0.90 | 0.84 | 0.89 | 0.92 |
| Sensitivity | 0.87 | 0.84 | 0.89 | 0.90 |
| Specificity | 0.95 | 0.94 | 0.96 | 0.968 |
| Balanced Accuracy | 0.912 | 0.89 | 0.92 | 0.93 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Khan, M.W.; Ahmed, G.; Shahzad, M.; Namoun, A.; Hussain, S.; Alanazi, M.H. DeepCMS: A Feature Selection-Driven Model for Cancer Molecular Subtyping with a Case Study on Testicular Germ Cell Tumors. Diagnostics 2025, 15, 2730. https://doi.org/10.3390/diagnostics15212730
Khan MW, Ahmed G, Shahzad M, Namoun A, Hussain S, Alanazi MH. DeepCMS: A Feature Selection-Driven Model for Cancer Molecular Subtyping with a Case Study on Testicular Germ Cell Tumors. Diagnostics. 2025; 15(21):2730. https://doi.org/10.3390/diagnostics15212730
Chicago/Turabian StyleKhan, Mehwish Wahid, Ghufran Ahmed, Muhammad Shahzad, Abdallah Namoun, Shahid Hussain, and Meshari Huwaytim Alanazi. 2025. "DeepCMS: A Feature Selection-Driven Model for Cancer Molecular Subtyping with a Case Study on Testicular Germ Cell Tumors" Diagnostics 15, no. 21: 2730. https://doi.org/10.3390/diagnostics15212730
APA StyleKhan, M. W., Ahmed, G., Shahzad, M., Namoun, A., Hussain, S., & Alanazi, M. H. (2025). DeepCMS: A Feature Selection-Driven Model for Cancer Molecular Subtyping with a Case Study on Testicular Germ Cell Tumors. Diagnostics, 15(21), 2730. https://doi.org/10.3390/diagnostics15212730

