Heterogeneous Clustering of Multiomics Data for Breast Cancer Subgroup Classification and Detection
Abstract
1. Introduction
2. Results
3. Materials and Methods
3.1. Data
3.2. Feature Selection
3.3. Joint Latent Variable Model
3.4. Data Clustering with Expectation–Maximization
Algorithm 1: Multiomics data integration with DCEM. |
% load data % feature selection as in Figure 7a % initialize latent variable model as in Figure 7b % optimize latent distributions return |
3.5. Subtype Expression Analysis
4. Discussion
5. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Manzoni, C.; Kia, D.A.; Vandrovcova, J.; Hardy, J.; Wood, N.W.; Lewis, P.A.; Ferrari, R. Genome, transcriptome and proteome: The rise of omics data and their integration in biomedical sciences. Briefings Bioinform. 2018, 19, 286–302. [Google Scholar] [CrossRef] [PubMed]
- Kumar, D.; Bansal, G.; Narang, A.; Basak, T.; Abbas, T.; Dash, D. Integrating transcriptome and proteome profiling: Strategies and applications. Proteomics 2016, 16, 2533–2544. [Google Scholar] [CrossRef] [PubMed]
- Chatterji, S.; Krzoska, E.; Thoroughgood, C.W.; Saganty, J.; Liu, P.; Elsberger, B.; Abu-Eid, R.; Speirs, V. Defining genomic, transcriptomic, proteomic, epigenetic, and phenotypic biomarkers with prognostic capability in male breast cancer: A systematic review. Lancet Oncol. 2023, 24, e74–e85. [Google Scholar] [CrossRef] [PubMed]
- Pierconti, F.; Rossi, E.D.; Cenci, T.; Carlino, A.; Fiorentino, V.; Totaro, A.; Sacco, E.; Palermo, G.; Iacovelli, R.; Larocca, L.M.; et al. DNA methylation analysis in urinary samples: A useful method to predict the risk of neoplastic recurrence in patients with urothelial carcinoma of the bladder in the high-risk group. Cancer Cytopathol. 2023, 131, 158–164. [Google Scholar] [CrossRef] [PubMed]
- Hoadley, K.A.; Yau, C.; Hinoue, T.; Wolf, D.M.; Lazar, A.J.; Drill, E.; Shen, R.; Taylor, A.M.; Cherniack, A.D.; Thorsson, V.; et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell 2018, 173, 291–304. [Google Scholar] [CrossRef]
- Shen, R.; Olshen, A.B.; Ladanyi, M. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics 2009, 25, 2906–2912. [Google Scholar] [CrossRef] [PubMed]
- Shen, R.; Mo, Q.; Schultz, N.; Seshan, V.E.; Olshen, A.B.; Huse, J.; Ladanyi, M.; Sander, C. Integrative subtype discovery in glioblastoma using iCluster. PLoS ONE 2012, 7, e35236. [Google Scholar] [CrossRef]
- Chalise, P.; Fridley, B.L. Integrative clustering of multi-level ‘omic data based on non-negative matrix factorization algorithm. PLoS ONE 2017, 12, e0176278. [Google Scholar] [CrossRef]
- Argelaguet, R.; Velten, B.; Arnol, D.; Dietrich, S.; Zenz, T.; Marioni, J.C.; Buettner, F.; Huber, W.; Stegle, O. Multi-Omics Factor Analysis—A framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 2018, 14, e8124. [Google Scholar] [CrossRef]
- Wang, T.; Shao, W.; Huang, Z.; Tang, H.; Zhang, J.; Ding, Z.; Huang, K. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat. Commun. 2021, 12, 3445. [Google Scholar] [CrossRef]
- Nalluri, J.J.; Barh, D.; Azevedo, V.; Ghosh, P. Mirsig: A consensus-based network inference methodology to identify pan-cancer Mirna-MIRNA interaction signatures. Sci. Rep. 2017, 7, 39684. [Google Scholar] [CrossRef] [PubMed]
- Nalluri, J.J.; Barh, D.; Azevedo, V.; Ghosh, P. A novel gene network inference algorithm using predictive minimum description length approach. BMC Syst. Biol. 2010, 4 (Suppl. S1), S7. [Google Scholar] [CrossRef]
- Chaitankar, V.; Ghosh, P.; Perkins, E.; Gong, P.; Zhang, C. Time lagged information-theoretic approaches to the reverse engineering of gene regulatory networks. BMC Bioinform. 2010, 11 (Suppl. S6), S19. [Google Scholar] [CrossRef] [PubMed]
- Wold, H. Estimation of principal components and related models by iterative least squares. In Multivariate Analysis; Academic Press: Cambridge, MA, USA, 1966; pp. 391–420. [Google Scholar]
- Cao, K.A.L.; Martin, P.G.; Robert-Granié, C.; Besse, P. Sparse canonical methods for biological data integration: Application to a cross-platform study. BMC Bioinform. 2009, 10, 34. [Google Scholar] [CrossRef]
- Harold, H. Relations between two sets of variates. Biometrika 1936, 28, 321. [Google Scholar]
- Cao, K.A.L.; Boitard, S.; Besse, P. Sparse PLS discriminant analysis: Biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinform. 2011, 12, 253. [Google Scholar] [CrossRef]
- Pearson, K. On lines and planes of closest fit to systems of points in space. Lond. Edinb. Dublin Philos. Mag. J. Sci. 1901, 2, 559–572. [Google Scholar] [CrossRef]
- Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 417. [Google Scholar] [CrossRef]
- Paatero, P.; Tapper, U. Positive matrix factorization: A non-negative factor model with optimal utilization of error estimates of data values. Environmetrics 1994, 5, 111–126. [Google Scholar] [CrossRef]
- Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature 1999, 401, 788–791. [Google Scholar] [CrossRef]
- Rana, P.; Thai, P.; Dinh, T.; Ghosh, P. Relevant and non-redundant feature selection for cancer classification and subtype detection. Cancers 2021, 13, 4297. [Google Scholar] [CrossRef] [PubMed]
- Franco, E.F.; Rana, P.; Cruz, A.; Calderón, V.V.; Azevedo, V.; Ramos, R.T.; Ghosh, P. Performance comparison of deep learning autoencoders for cancer subtype detection using multi-omics data. Cancers 2021, 13, 2013. [Google Scholar] [CrossRef] [PubMed]
- Cancer Genome Atlas (TCGA) Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature 2008, 455, 1061. [Google Scholar] [CrossRef]
- Chin, S.F.; Teschendorff, A.E.; Marioni, J.C.; Wang, Y.; Barbosa-Morais, N.L.; Thorne, N.P.; Costa, J.L.; Pinder, S.E.; van de Wiel, M.A.; Green, A.R.; et al. High-resolution aCGH and expression profiling identifies a novel genomic subtype of ER negative breast cancer. Genome Biol. 2007, 8, R215. [Google Scholar] [CrossRef]
- Aguirre, A.J.; Brennan, C.; Bailey, G.; Sinha, R.; Feng, B.; Leo, C.; Zhang, Y.; Zhang, J.; Gans, J.D.; Bardeesy, N.; et al. High-resolution characterization of the pancreatic adenocarcinoma genome. Proc. Natl. Acad. Sci. USA 2004, 101, 9067–9072. [Google Scholar] [CrossRef]
- Mo, Q.; Shen, R. iClusterPlus: Integrative Clustering of mUlti-Type Genomic Data, Bioconductor R Package Version 1.19; 2018. Available online: https://bioconductor.statistik.uni-dortmund.de/packages/3.10/bioc/vignettes/iClusterPlus/inst/doc/iManual.pdf (accessed on 24 April 2024).
- Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 1996, 58, 267–288. [Google Scholar] [CrossRef]
- Subramanian, I.; Verma, S.; Kumar, S.; Jere, A.; Anamika, K. Multi-omics data integration, interpretation, and its application. Bioinform. Biol. Insights 2020, 14, 1177932219899051. [Google Scholar] [CrossRef]
- Aitchison, J. The statistical analysis of compositional data. J. R. Stat. Soc. Ser. B 1982, 44, 139–160. [Google Scholar] [CrossRef]
- Colaprico, A.; Silva, T.C.; Olsen, C.; Garofano, L.; Cava, C.; Garolini, D.; Sabedot, T.; Malta, T.M.; Pagnotta, S.M.; Castiglioni, I.; et al. TCGAbiolinks: An R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2015, 44, e71. [Google Scholar] [CrossRef]
- Wang, B.; Mezlini, A.M.; Demir, F.; Fiume, M.; Tu, Z.; Brudno, M.; Haibe-Kains, B.; Goldenberg, A. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 2014, 11, 333–337. [Google Scholar] [CrossRef]
- Lee, J.; Shin, A.; Shin, W.-K.; Choi, J.-Y.; Kang, D.; Lee, J.-K. Adherence to the World Cancer Research Fund/American Institute for Cancer Research and Korean Cancer Prevention Guidelines and cancer risk: A prospective cohort study from the Health Examinees-Gem study. Epidemiol. Health 2023, 45, e2023070. [Google Scholar] [CrossRef] [PubMed]
- Nazario, H.E.; Lepe, R.; Trotter, J.F. Metastatic Breast Cancer Presenting as Acute Liver Failure. Gastroenterol. Hepatol. 2011, 7, 58–62. [Google Scholar]
- Bidard, F.C.; Kaklamani, V.G.; Neven, P.; Streich, G.; Montero, A.J.; Forget, F.; Mouret-Reynier, M.A.; Sohn, J.H.; Taylor, D.; Harnden, K.K.; et al. Elacestrant (oral selective estrogen receptor degrader) Versus Standard Endocrine Therapy for Estrogen Receptor-Positive, Human Epidermal Growth Factor Receptor 2-Negative Advanced Breast Cancer: Results From the Randomized Phase III EMERALD Trial. J. Clin. Oncol. 2022, 40, 3246–3256, Erratum in: J. Clin. Oncol. 2023, 41, 3962. [Google Scholar] [CrossRef]
- Sherman, B.T.; Hao, M.; Qiu, J.; Jiao, X.; Baseler, M.W.; Lane, H.C.; Imamichi, T.; Chang, W. DAVID: A web server for functional enrichment analysis and functional annotation of gene lists (2021 update). Nucleic Acids Res. 2022, 50, W216–W221. [Google Scholar] [CrossRef] [PubMed]
- Yu, G. Gene ontology semantic similarity analysis using gosemsim. In Stem Cell Transcriptional Networks; Methods in Molecular Biology; Springer: Berlin/Heidelberg, Germany, 2020; Volume 2117, pp. 207–215. [Google Scholar] [PubMed]
- Xu, T.; Le, T.D.; Liu, L.; Su, N.; Wang, R.; Sun, B.; Colaprico, A.; Bontempi, G.; Li, J. CancerSubtypes: An R/Bioconductor package for molecular cancer subtype identification, validation, and visualization. Bioinformatics 2017, 33, 3131–3133. [Google Scholar] [CrossRef]
- Sharma, P.; Kurban, H.; Dalkilic, M. DCEM: An R package for clustering big data via data-centric modification of Expectation Maximization. SoftwareX 2022, 17, 100944. [Google Scholar] [CrossRef]
- Kurban, H.; Jenne, M.; Dalkilic, M.M. Using data to build a better EM: EM* for big data. Int. J. Data Sci. Anal. 2017, 4, 83–97. [Google Scholar] [CrossRef]
- Zaki, M.J.; Meira, W. Data Mining and Analysis: Fundamental Concepts and Algorithms; Cambridge University Press: Cambridge, UK, 2014. [Google Scholar]
- Arthur, D.; Vassilvitskii, S. k-means++: The advantages of careful seeding. In Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA, USA, 7–9 January 2007; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2006; pp. 1027–1035. [Google Scholar]
- Kaplan, E.L.; Meier, P. Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc. 1958, 53, 457–481. [Google Scholar] [CrossRef]
- Carbon, S.; Ireland, A.; Mungall, C.J.; Shu, S.; Marshall, B.; Lewis, S.; the AmiGO Hub; Web Presence Working Group. Amigo: Online access to ontology and annotation data. Bioinformatics 2008, 25, 288–289. [Google Scholar] [CrossRef]
- Zhou, W.; Li, Y.; Gu, D.; Xu, J.; Wang, R.; Wang, H.; Liu, C. High expression COL10A1 promotes breast cancer progression and predicts poor prognosis. Heliyon 2022, 8, e11083. [Google Scholar] [CrossRef] [PubMed]
- Zhuang, Y.; Li, X.; Zhan, P.; Pi, G.; Wen, G. MMP11 promotes the proliferation and progression of breast cancer through stabilizing smad2 protein. Oncol. Rep. 2021, 45, 16. [Google Scholar] [CrossRef] [PubMed]
- Mamoor, S. CD300LG (NEPMUCIN) is Differentially Expressed in Brain Metastatic Breast Cancer; OSF Preprints, 1 Nov. 2020. Web. 2020. Available online: https://osf.io/e2y7m_v1 (accessed on 8 May 2024).
- Li, Y.; Liu, H.-T.; Chen, X.; Wang, Y.-W.; Tian, Y.-R.; Ma, R.-R.; Song, L.; Zou, Y.-X.; Gao, P. Aberrant promoter hypermethylation inhibits RGMA expression and contributes to tumor progression in breast cancer. Oncogene 2022, 41, 361–371. [Google Scholar] [CrossRef] [PubMed]
- Saha, S.K.; Kim, K.; Yang, G.M.; Choi, H.Y.; Cho, S.G. Cytokeratin 19 (KRT19) has a Role in the Reprogramming of Cancer Stem Cell-Like Cells to Less Aggressive and More Drug-Sensitive Cells. Int. J. Mol. Sci. 2018, 19, 1423. [Google Scholar] [CrossRef]
- Yu, S.; Yang, R.; Xu, T.; Li, X.; Wu, S.; Zhang, J. Cancer-associated fibroblasts-derived FMO2 as a biomarker of macrophage infiltration and prognosis in epithelial ovarian cancer. Gynecol. Oncol. 2022, 167, 342–353. [Google Scholar] [CrossRef]
- Chen, X.; Meng, C.; Wang, X.; Wu, Z.; Sun, X.; Sun, C.; Zheng, L.; Li, W.; Jia, W.; Tang, T. Exploring CCL11 in breast cancer: Unraveling its anticancer potential and immune modulatory effects involving the Akt-S6 signaling. J. Cancer Res. Clin. Oncol. 2024, 150, 69. [Google Scholar] [CrossRef]
- Ameli, F.; Nassab, F.G.; Masir, N.; Kahtib, F. Tumor-Derived Matrix Metalloproteinase-13 (MMP-13) Expression in Benign and Malignant Breast Lesions. Asian Pac. J. Cancer Prev. 2021, 22, 2603–2609. [Google Scholar] [CrossRef]
- Kim, J.H.; Lee, E.; Yun, J.; Ryu, H.S.; Kim, H.K.; Ju, Y.W.; Kim, K.; Kim, J.; Moon, H. Calsequestrin 2 overexpression in breast cancer increases tumorigenesis and metastasis by modulating the tumor microenvironment. Mol. Oncol. 2021, 16, 466–484. [Google Scholar] [CrossRef]
- Stelzer, G.; Rosen, N.; Plaschkes, I.; Zimmerman, S.; Twik, M.; Fishilevich, S.; Stein, T.I.; Nudel, R.; Lieder, I.; Mazor, Y.; et al. The genecards suite: From Gene Data Mining to disease genome sequence analyses. Curr. Protoc. Bioinform. 2016, 54, 1.30.1–1.30.33. [Google Scholar] [CrossRef]
Patients | Trial 1 | Trial 2 | Trial 3 | Trial 4 | Trial 5 |
---|---|---|---|---|---|
276 | 3 × 10−6 | 3 × 10−9 | 2 × 10−4 | 2 × 10−6 | 8 × 10−6 |
265 | 2 × 10−5 | 9 × 10−4 | 1 × 10−5 | 6 × 10−6 | 1 × 10−6 |
255 | 4 × 10−8 | 2 × 10−4 | 5 × 10−4 | 7 × 10−6 | 6 × 10−5 |
245 | 5 × 10−4 | 2 × 10−5 | 2 × 10−4 | 1 × 10−6 | 0.001 |
235 | 4 × 10−5 | 4 × 10−4 | 9 × 10−6 | 3 × 10−4 | 0.05 |
225 | 1 × 10−5 | 3 × 10−9 | 4 × 10−6 | 1 × 10−4 | 1 × 10−5 |
215 | 8 × 10−5 | 4 × 10−6 | 1 × 10−4 | 0.003 | 2 × 10−5 |
205 | 4 × 10−6 | 8 × 10−4 | 5 × 10−4 | 0.002 | 0.001 |
200 | 1 × 10−5 | 3 × 10−5 | 4 × 10−4 | 0.004 | 3 × 10−6 |
150 | 9 × 10−4 | 3 × 10−4 | 5 × 10−7 | 0.003 | 0.009 |
100 | 0.04 | 0.003 | 0.5 | 0.004 | 2 × 10−5 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pateras, J.; Lodi, M.; Rana, P.; Ghosh, P. Heterogeneous Clustering of Multiomics Data for Breast Cancer Subgroup Classification and Detection. Int. J. Mol. Sci. 2025, 26, 1707. https://doi.org/10.3390/ijms26041707
Pateras J, Lodi M, Rana P, Ghosh P. Heterogeneous Clustering of Multiomics Data for Breast Cancer Subgroup Classification and Detection. International Journal of Molecular Sciences. 2025; 26(4):1707. https://doi.org/10.3390/ijms26041707
Chicago/Turabian StylePateras, Joseph, Musaddiq Lodi, Pratip Rana, and Preetam Ghosh. 2025. "Heterogeneous Clustering of Multiomics Data for Breast Cancer Subgroup Classification and Detection" International Journal of Molecular Sciences 26, no. 4: 1707. https://doi.org/10.3390/ijms26041707
APA StylePateras, J., Lodi, M., Rana, P., & Ghosh, P. (2025). Heterogeneous Clustering of Multiomics Data for Breast Cancer Subgroup Classification and Detection. International Journal of Molecular Sciences, 26(4), 1707. https://doi.org/10.3390/ijms26041707