DeepGene-BC: Deep Learning-Based Breast Cancer Subtype Prediction via Somatic Point Mutation Profiles
Simple Summary
Abstract
1. Introduction
2. Materials and Methods
2.1. Mutation Data Process
2.2. Mutual Information
2.3. Model Architecture
- (1)
- The Linear (Wide) Component: A simple linear regression layer that models the direct contribution of each mutation to the subtype logits. It is defined asThis component captures the marginal effect of individual gene mutations, which can reflect the influence of well-known subtype-associated driver genes (e.g., TP53, PIK3CA, or GATA3), as mutations in such genes alone may bias tumors toward specific molecular subtypes.
- (2)
- The Factorization Machine (FM) Component: To capture pairwise gene interactions without the combinatorial explosion of parameters, we projected the binary input into a dense embedding space. Each feature is assigned a latent vector (embedding dimension ). The FM component computes second-order interactions via the inner product of these latent vectors:Biologically, this branch models co-occurring or mutually exclusive mutation patterns, capturing gene–gene relationships that may reflect functional cooperation, synthetic lethality, or shared involvement in signaling pathways (e.g., coordinated alterations within the PI3K-AKT or cell-cycle regulatory axes).
- (3)
- The Deep Component: To extract higher-order patterns, the feature embeddings are aggregated via sum-pooling and fed into a Multi-Layer Perceptron (MLP). The MLP consists of two hidden layers with 64 and 32 units, respectively. Each dense layer employs activation to introduce non-linearity and Dropout to prevent overfitting. The output is a vector representing high-level semantic abstractions of the mutational profile. This component captures complex, non-linear interactions among multiple mutations simultaneously, which may correspond to pathway-level dysregulation and broader molecular programs that cannot be explained by individual genes or pairwise interactions alone.
2.4. Model Training
2.5. Cross Validation
2.6. Recurrence Threshold Selection and Sensitivity Analysis
2.7. Statistical Uncertainty Estimation by Bootstrap Resampling
3. Results
3.1. Overview of deepGene-BC
3.2. Sparsity and Heterogeneity of Somatic Point Mutations in Breast Cancer
3.3. Pathway-Constrained Extraction of Subtype-Associated Mutational Signals
3.4. deepGene-BC Architecture and Cross-Validation Performance
3.5. Performance of deepGene-BC on the Independent Test Set
3.6. Ablation Analysis of Model Architecture
4. Discussion
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Bray, F.; Laversanne, M.; Sung, H.; Ferlay, J.; Siegel, R.; Soerjomataram, I.; Jemal, A. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA-Cancer J. Clin. 2024, 74, 229–263. [Google Scholar] [CrossRef] [PubMed]
- Kim, J.; Harper, A.; Mccormack, V.; Sung, H.; Houssami, N.; Morgan, E.; Mutebi, M.; Garvey, G.; Soerjomataram, I.; Fidler-Benaoudia, M. Global patterns and trends in breast cancer incidence and mortality across 185 countries. Nat. Med. 2025, 31, 1154–1162. [Google Scholar] [CrossRef]
- Perou, C.; Sorlie, T.; Eisen, M.; van de Rijn, M.; Jeffrey, S.; Rees, C.; Pollack, J.R.; Ross, D.; Johnsen, H.; Akslen, L.; et al. Molecular portraits of human breast tumours. Nature 2000, 406, 747–752. [Google Scholar] [CrossRef] [PubMed]
- Sorlie, T.; Perou, C.; Tibshirani, R.; Aas, T.; Geisler, S.; Johnsen, H.; Hastie, T.; Eisen, M.; van de Rijn, M.; Jeffrey, S.; et al. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci. USA 2001, 98, 10869–10874. [Google Scholar] [CrossRef] [PubMed]
- Yersal, O.; Barutca, S. Biological subtypes of breast cancer: Prognostic and therapeutic implications. World J. Clin. Oncol. 2014, 5, 412–424. [Google Scholar] [CrossRef]
- Koboldt, D.; Fulton, R.; McLellan, M.; Schmidt, H.; Kalicki-Veizer, J.; McMichael, J.; Fulton, L.; Dooling, D.; Ding, L.; Mardis, E.; et al. Comprehensive molecular portraits of human breast tumours. Nature 2012, 490, 61–70. [Google Scholar] [CrossRef]
- Parker, J.; Mullins, M.; Cheang, M.; Leung, S.; Voduc, D.; Vickery, T.; Davies, S.; Fauron, C.; He, X.; Hu, Z.; et al. Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes. J. Clin. Oncol. 2009, 27, 1160–1167. [Google Scholar] [CrossRef]
- Rodriguez, C.; Garcia-Muñoz, M.; Sancho, M.; Garcia-Gonzalez, M.; Delgado, C.; de Prado, D.; Alvarez, J.; Bayona, C.; Alés-Martínez, J.; Gallegos, I.; et al. Impact of the Prosigna (PAM50) assay on adjuvant clinical decision making in patients with early stage breast cancer: Results of a prospective multicenter public program. J. Clin. Oncol. 2017, 35, 5. [Google Scholar] [CrossRef]
- Hurson, A.; Hamilton, A.; Olsson, L.; Kirk, E.; Sherman, M.; Calhoun, B.; Geradts, J.; Troester, M. Reproducibility and intratumoral heterogeneity of the PAM50 breast cancer assay. Breast Cancer Res. Treat. 2023, 199, 147–154. [Google Scholar] [CrossRef]
- Pennock, N.; Jindal, S.; Horton, W.; Sun, D.; Narasimhan, J.; Carbone, L.; Fei, S.; Searles, R.; Harrington, C.; Burchard, J.; et al. RNA-seq from archival FFPE breast cancer samples: Molecular pathway fidelity and novel discovery. BMC Med. Genom. 2019, 12, 195. [Google Scholar] [CrossRef]
- Lien, T.; Ohnstad, H.; Lingjaerde, O.; Vallon-Christersson, J.; Aaserud, M.; Sveli, M.; Borg, A.; Osbreac, O.; Garred, O.; Borgen, E.; et al. Sample Preparation Approach Influences PAM50 Risk of Recurrence Score in Early Breast Cancer. Cancers 2021, 13, 6118. [Google Scholar] [CrossRef]
- Lin, Y.; Dong, Z.; Ye, T.; Yang, J.; Xie, M.; Luo, J.; Gao, J.; Guo, A. Optimization of FFPE preparation and identification of gene attributes associated with RNA degradation. NAR Genom. Bioinform. 2024, 6, 11. [Google Scholar] [CrossRef]
- Cejalvo, J.; De Dueñas, E.; Galván, P.; García-Recio, S.; Gasión, O.; Paré, L.; Antolín, S.; Martinello, R.; Blancas, I.; Adamo, B.; et al. Intrinsic Subtypes and Gene Expression Profiles in Primary and Metastatic Breast Cancer. Cancer Res. 2017, 77, 2213–2221. [Google Scholar] [CrossRef] [PubMed]
- Crowley, E.; Di Nicolantonio, F.; Loupakis, F.; Bardelli, A. Liquid biopsy: Monitoring cancer-genetics in the blood. Nat. Rev. Clin. Oncol. 2013, 10, 472–484. [Google Scholar] [CrossRef]
- Jorgensen, C.; Larsson, A.; Forsare, C.; Aaltonen, K.; Jansson, S.; Bradshaw, R.; Bendahl, P.; Rydén, L. PAM50 Intrinsic Subtype Profiles in Primary and Metastatic Breast Cancer Show a Significant Shift toward More Aggressive Subtypes with Prognostic Implications. Cancers 2021, 13, 1592. [Google Scholar] [CrossRef] [PubMed]
- Curtis, C.; Shah, S.; Chin, S.; Turashvili, G.; Rueda, O.; Dunning, M.; Speed, D.; Lynch, A.; Samarajiwa, S.; Yuan, Y.; et al. The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups. Nature 2012, 486, 346–352. [Google Scholar] [CrossRef]
- Faienza, S.; Margaria, J.; Franco, I. Reconstructing the lifelong history of cells and tissues via somatic mutation analysis. Cell. Mol. Life Sci. 2025, 82, 14. [Google Scholar] [CrossRef]
- Viswanadham, V.; Kim, S.; Caglayan, E.; Doan, R.; Dou, Y.; Bizzotto, S.; Khoshkhoo, S.; Huang, A.; Yeh, R.; Chhouk, B.; et al. Combined somatic mutation and transcriptome analysis reveals region-specific differences in clonal architecture in human cortex. Cell Rep. 2025, 44, 27. [Google Scholar] [CrossRef] [PubMed]
- Yuan, Y.; Shi, Y.; Li, C.; Kim, J.; Cai, W.; Han, Z.; Feng, D.D. DeepGene: An advanced cancer type classifier based on deep learning and somatic point mutations. BMC Bioinform. 2016, 17, 476. [Google Scholar] [CrossRef]
- Minussi, D.; Nicholson, M.; Ye, H.; Davis, A.; Wang, K.; Baker, T.; Tarabichi, M.; Sei, E.; Du, H.; Rabbani, M.; et al. Breast tumours maintain a reservoir of subclonal diversity during expansion. Nature 2021, 592, 302–308. [Google Scholar] [CrossRef] [PubMed]
- Black, J.; McGranahan, N. Genetic and non-genetic clonal diversity in cancer evolution. Nat. Rev. Cancer 2021, 21, 379–392. [Google Scholar] [CrossRef]
- Gerstung, M.; Jolly, C.; Leshchiner, I.; Dentro, S.; Gonzalez, S.; Rosebrock, D.; Mitchell, T.; Rubanova, Y.; Anur, P.; Yu, K.; et al. The evolutionary history of 2,658 cancers. Nature 2020, 578, 122. [Google Scholar] [CrossRef] [PubMed]
- Bruhm, D.; Mathios, D.; Foda, Z.; Annapragada, A.; Medina, J.; Adleff, V.; Chiao, E.; Ferreira, L.; Cristiano, S.; White, J.; et al. Single-molecule genome-wide mutation profiles of cell-free DNA for non-invasive detection of cancer. Nat. Genet. 2023, 55, 1301. [Google Scholar] [CrossRef]
- Zhang, K.; Fu, R.; Liu, R.; Su, Z. Circulating cell-free DNA-based multi-cancer early detection. Trends Cancer 2024, 10, 161–174. [Google Scholar] [CrossRef]
- Gao, Q.; Zeng, Q.; Wang, Z.; Li, C.; Xu, Y.; Cui, P.; Zhu, X.; Lu, H.; Wang, G.; Cai, S.; et al. Circulating cell-free DNA for cancer early detection. Innovation 2022, 3, 18. [Google Scholar] [CrossRef]
- Moser, T.; Kühberger, S.; Lazzeri, I.; Vlachos, G.; Heitzer, E. Bridging biological cfDNA features and machine learning approaches. Trends Genet. 2023, 39, 285–307. [Google Scholar] [CrossRef] [PubMed]
- Shi, Y.; Guo, Z.; Su, X.; Meng, L.; Zhang, M.; Sun, J.; Wu, C.; Zheng, M.; Shang, X.; Zou, X.; et al. DeepAntigen: A novel method for neoantigen prioritization via 3D genome and deep sparse learning. Bioinformatics 2020, 36, 4894–4901. [Google Scholar] [CrossRef] [PubMed]
- Feng, M.; Liu, L.; Su, K.; Su, X.; Meng, L.; Guo, Z.; Cao, D.; Wang, J.; He, G.; Shi, Y. 3D genome contributes to MHC-II neoantigen prediction. BMC Genom. 2024, 25, 889. [Google Scholar] [CrossRef]
- Unger, M.; Kather, J. Deep learning in cancer genomics and histopathology. Genome Med. 2024, 16, 14. [Google Scholar] [CrossRef]
- Zhang, C.; Li, W.; Deng, M.; Jiang, Y.; Cui, X.; Chen, P. SIG: Graph-Based Cancer Subtype Stratification With Gene Mutation Structural Information. IEEE-ACM Trans. Comput. Biol. Bioinform. 2024, 21, 1752–1764. [Google Scholar] [CrossRef]
- Palazzo, M.; Beauseroy, P.; Yankilevich, P. A pan-cancer somatic mutation embedding using autoencoders. BMC Bioinform. 2019, 20, 10. [Google Scholar] [CrossRef] [PubMed]
- Danyi, A.; Jager, M.; de Ridder, J. Cancer Type Classification in Liquid Biopsies Based on Sparse Mutational Profiles Enabled through Data Augmentation and Integration. Life 2022, 12, 1. [Google Scholar] [CrossRef]
- Sherman, M.; Yaari, A.; Priebe, O.; Dietlein, F.; Loh, P.; Berger, B. Genome-wide mapping of somatic mutation rates uncovers drivers of cancer. Nat. Biotechnol. 2022, 40, 1634. [Google Scholar] [CrossRef] [PubMed]
- Martínez-Jiménez, F.; Muiños, F.; Sentís, I.; Deu-Pons, J.; Reyes-Salazar, I.; Arnedo-Pac, C.; Mularoni, L.; Pich, O.; Bonet, J.; Kranas, H.; et al. A compendium of mutational cancer driver genes. Nat. Rev. Cancer 2020, 20, 555–572. [Google Scholar] [CrossRef]
- Bailey, M.; Tokheim, C.; Porta-Pardo, E.; Sengupta, S.; Bertrand, D.; Weerasinghe, A.; Colaprico, A.; Wendl, M.; Kim, J.; Reardon, B.; et al. Comprehensive Characterization of Cancer Driver Genes and Mutations. Cell 2018, 173, 371. [Google Scholar] [CrossRef]
- Oksza-Orzechowski, K.; Quinten, E.; Shafighi, S.; Kielbasa, S.; van Kessel, H.; de Groen, R.; Vermaat, J.; Yanez, J.; Navarrete, M.; Veelken, H.; et al. CaClust: Linking genotype to transcriptional heterogeneity of follicular lymphoma using BCR and exomic variants. Genome Biol. 2024, 25, 31. [Google Scholar] [CrossRef]
- Lee, G.; Lee, S.; Lee, S.; Jeong, C.; Song, H.; Lee, S.; Yun, H.; Koh, Y.; Kim, H. Prediction of metabolites associated with somatic mutations in cancers by using genome-scale metabolic models and mutation data. Genome Biol. 2024, 25, 26. [Google Scholar] [CrossRef] [PubMed]
- Weinstein, J.; Collisson, E.; Mills, G.; Shaw, K.; Ozenberger, B.; Ellrott, K.; Shmulevich, I.; Sander, C.; Stuart, J.; Network, C.G.A.R. The Cancer Genome Atlas Pan-Cancer analysis project. Nat. Genet. 2013, 45, 1113–1120. [Google Scholar] [CrossRef]
- Guo, H.; Tang, R.; Ye, Y.; Li, Z.; He, X. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. arXiv 2017, arXiv:1703.04247. [Google Scholar] [CrossRef]
- Liberzon, A.; Subramanian, A.; Pinchback, R.; Thorvaldsdóttir, H.; Tamayo, P.; Mesirov, J. Molecular signatures database (MSigDB) 3.0. Bioinformatics 2011, 27, 1739–1740. [Google Scholar] [CrossRef]
- Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; et al. Gene Ontology: Tool for the unification of biology. Nat. Genet. 2000, 25, 25–29. [Google Scholar] [CrossRef]
- Asselin-Labat, M.-L.; Sutherland, K.D.; Barker, H.; Thomas, R.; Shackleton, M.; Forrest, N.C.; Hartley, L.; Robb, L.; Grosveld, F.G.; van der Wees, J.; et al. Gata-3 is an essential regulator of mammary-gland morphogenesis and luminal-cell differentiation. Nat. Cell Biol. 2007, 9, 201–209. [Google Scholar] [CrossRef]
- Ciriello, G.; Gatza, M.L.; Beck, A.H.; Wilkerson, M.D.; Rhie, S.K.; Pastore, A.; Zhang, H.; McLellan, M.; Yau, C.; Kandoth, C.; et al. Comprehensive Molecular Portraits of Invasive Lobular Breast Cancer. Cell 2015, 163, 506–519. [Google Scholar] [CrossRef]
- Ellis, M.J.; Ding, L.; Shen, D.; Luo, J.; Suman, V.J.; Wallis, J.W.; Van Tine, B.A.; Hoog, J.; Goiffon, R.J.; Goldstein, T.C.; et al. Whole-genome analysis informs breast cancer response to aromatase inhibition. Nature 2012, 486, 353–360. [Google Scholar] [CrossRef]
- Hoadley, K.A.; Yau, C.; Wolf, D.M.; Cherniack, A.D.; Tamborero, D.; Ng, S.; Leiserson, M.D.M.; Niu, B.; McLellan, M.D.; Uzunangelov, V.; et al. Multiplatform Analysis of 12 Cancer Types Reveals Molecular Classification within and across Tissues of Origin. Cell 2014, 158, 929–944. [Google Scholar] [CrossRef] [PubMed]
- Lord, C.J.; Ashworth, A. BRCAness revisited. Nat. Rev. Cancer 2016, 16, 110–120. [Google Scholar] [CrossRef] [PubMed]
- Duffy, M.J.; McGowan, P.M.; Harbeck, N.; Thomssen, C.; Schmitt, M. uPA and PAI-1 as biomarkers in breast cancer: Validated for clinical use in level-of-evidence-1 studies. Breast Cancer Res. 2014, 16, 428. [Google Scholar] [CrossRef] [PubMed]
- Long, E.; Kim, H.; Liu, D.; Peterson, M.; Rajagopalan, S. Controlling Natural Killer Cell Responses: Integration of Signals for Activation and Inhibition. Annu. Rev. Immunol. 2013, 31, 227–258. [Google Scholar] [CrossRef]





| deepGene-BC | DNN | SVM | KNN | RF | |
|---|---|---|---|---|---|
| Accuracy | 0.81 (±0.03) | 0.73 (±0.03) | 0.67 (±0.02) | 0.66 (±0.04) | 0.65 (±0.02) |
| Precision (macro) | 0.81 (±0.04) | 0.64 (±0.04) | 0.56 (±0.04) | 0.56 (±0.13) | 0.42 (±0.08) |
| Recall (macro) | 0.75 (±0.03) | 0.63 (±0.04) | 0.51 (±0.03) | 0.48 (±0.06) | 0.46 (±0.03) |
| F1 (macro) | 0.77 (±0.04) | 0.63 (±0.04) | 0.52 (±0.03) | 0.46 (±0.07) | 0.42 (±0.04) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Hou, P.; Liu, L.; Duan, Y.; Yin, S.; Yan, W.; Pang, C.; Yan, Y.; Aziz, S.; Torhola, M.; Kujanen, H.; et al. DeepGene-BC: Deep Learning-Based Breast Cancer Subtype Prediction via Somatic Point Mutation Profiles. Cancers 2026, 18, 570. https://doi.org/10.3390/cancers18040570
Hou P, Liu L, Duan Y, Yin S, Yan W, Pang C, Yan Y, Aziz S, Torhola M, Kujanen H, et al. DeepGene-BC: Deep Learning-Based Breast Cancer Subtype Prediction via Somatic Point Mutation Profiles. Cancers. 2026; 18(4):570. https://doi.org/10.3390/cancers18040570
Chicago/Turabian StyleHou, Pengfei, Liangjie Liu, Yijia Duan, Shanshan Yin, Wenqian Yan, Chongchen Pang, Yang Yan, Sabreena Aziz, Mika Torhola, Henna Kujanen, and et al. 2026. "DeepGene-BC: Deep Learning-Based Breast Cancer Subtype Prediction via Somatic Point Mutation Profiles" Cancers 18, no. 4: 570. https://doi.org/10.3390/cancers18040570
APA StyleHou, P., Liu, L., Duan, Y., Yin, S., Yan, W., Pang, C., Yan, Y., Aziz, S., Torhola, M., Kujanen, H., Förger, K., Shi, H., He, G., & Shi, Y. (2026). DeepGene-BC: Deep Learning-Based Breast Cancer Subtype Prediction via Somatic Point Mutation Profiles. Cancers, 18(4), 570. https://doi.org/10.3390/cancers18040570

