# Deciphering Genomic Heterogeneity and the Internal Composition of Tumour Activities through a Hierarchical Factorisation Model

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. Hierarchical Model of Factorisation

#### 2.2. Deriving the Internal Composition from Model Matrices

#### 2.3. Estimating the Optimal Number of Components

#### 2.4. Model Validation

## 3. Results

## 4. Discussion

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Torre, L.A.; Siegel, R.L.; Ward, E.M.; Jemal, A. Global Cancer Incidence and Mortality Rates and Trends—An Update. Cancer Epidemiol. Biomarkers Prev.
**2016**, 25, 16–27. [Google Scholar] [CrossRef] [Green Version] - Hanahan, D.; Weinberg, R.A. Hallmarks of cancer: The next generation. Cell
**2011**, 144, 646–674. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Martincorena, I.; Campbell, P.J. Somatic mutation in cancer and normal cells. Science
**2015**, 349, 1483–1489. [Google Scholar] [CrossRef] [PubMed] - Wang, F.; Fang, Q.; Ge, Z.; Yu, N.; Xu, S.; Fan, X. Common BRCA1 and BRCA2 mutations in breast cancer families: A meta-analysis from systematic review. Mol. Biol. Rep.
**2012**, 39, 2109–2118. [Google Scholar] [CrossRef] [PubMed] - Stein-O’Brien, G.L.; Arora, R.; Culhane, A.C.; Favorov, A.V.; Garmire, L.X.; Greene, C.S.; Goff, L.A.; Li, Y.; Ngom, A.; Ochs, M.F.; et al. Enter the Matrix: Factorization Uncovers Knowledge from Omics. Trends Genet.
**2018**, 34, 790–805. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Kossenkov, A.V.; Ochs, M.F. Matrix factorisation methods applied in microarray data analysis. Int. J. Data Min. Bioinform.
**2010**, 4, 72–90. [Google Scholar] [CrossRef] - Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol.
**1933**, 24, 417–441. [Google Scholar] [CrossRef] - Comon, P. Independent component analysis, A new concept? Signal Process.
**1994**, 36, 287–314. [Google Scholar] [CrossRef] - Lee, D.D.; Seung, H.S. Learning the parts of objects by non-negative matrix factorization. Nature
**1999**, 401, 788–791. [Google Scholar] [CrossRef] [PubMed] - Türkmen, A.C. A Review of Nonnegative Matrix Factorization Methods for Clustering. 2015, pp. 1–23. Available online: https://arxiv.org/abs/1507.03194 (accessed on 1 April 2019).
- Hofree, M.; Shen, J.P.; Carter, H.; Gross, A.; Ideker, T. Network-based stratification of tumor mutations. Nat. Methods
**2013**, 10, 1108–1115. [Google Scholar] [CrossRef] - Alexandrov, L.B.; Nik-Zainal, S.; Wedge, D.C.; Campbell, P.J.; Stratton, M.R. Deciphering signatures of mutational processes operative in human cancer. Cell Rep.
**2013**, 3, 246–259. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Bayati, M.; Rabiee, H.R.; Mehrbod, M.; Vafaee, F.; Ebrahimi, D.; Forrest, A.R.R.; Alinejad-Rokny, H. CANCERSIGN: A user-friendly and robust tool for identification and classification of mutational signatures and patterns in cancer genomes. Sci. Rep.
**2020**, 10, 1286. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Repsilber, D.; Kern, S.; Telaar, A.; Walzl, G.; Black, G.F.; Selbig, J.; Parida, S.K.; Kaufmann, S.H.; Jacobsen, M. Biomarker discovery in heterogeneous tissue samples -taking the in-silico deconfounding approach. BMC Bioinform.
**2010**, 11, 27. [Google Scholar] [CrossRef] [Green Version] - Gaujoux, R.; Seoighe, C. Semi-supervised Nonnegative Matrix Factorization for gene expression deconvolution: A case study. Infect. Genet. Evol.
**2012**, 12, 913–921. [Google Scholar] [CrossRef] [PubMed] - Zhang, S.; Liu, C.C.; Li, W.; Shen, H.; Laird, P.W.; Zhou, X.J. Discovery of multi-dimensional modules by integrative analysis of cancer genomic data. Nucleic Acids Res.
**2012**, 40, 9379–9391. [Google Scholar] [CrossRef] - Ray, B.; Liu, W.; Fenyo, D. Adaptive multiview nonnegative matrix factorization algorithm for integration of Multimodal Biomedical Data. Cancer Inform.
**2017**, 16, 1176935117725727. [Google Scholar] [CrossRef] [PubMed] - Zhang, L.; Zhang, S. Learning common and specific patterns from data of multiple interrelated biological scenarios with matrix factorization. Nucleic Acids Res.
**2019**, 47, 6606–6617. [Google Scholar] [CrossRef] [PubMed] - Ding, Q.; Sun, Y.; Shang, J.; Li, F.; Zhang, Y.; Liu, J.X. NMFNA: A Non-negative Matrix Factorization Network Analysis Method for Identifying Modules and Characteristic Genes of Pancreatic Cancer. Front. Genet.
**2021**, 12, 1115. [Google Scholar] [CrossRef] - Ashburner, M.; Ball, C.A.; Blake, J.A.; Botstein, D.; Butler, H.; Cherry, J.M.; Davis, A.P.; Dolinski, K.; Dwight, S.S.; Eppig, J.T.; et al. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet.
**2000**, 25, 25–29. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Kanehisa, M.; Goto, S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res.
**2000**, 28, 27–30. [Google Scholar] [CrossRef] - Subramanian, A.; Tamayo, P.; Mootha, V.K.; Mukherjee, S.; Ebert, B.L.; Gillette, M.A.; Paulovich, A.; Pomeroy, S.L.; Golub, T.R.; Lander, E.S.; et al. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA
**2005**, 102, 15545–15550. [Google Scholar] [CrossRef] [Green Version] - Al-Shahrour, F.; Arbiza, L.; Dopazo, H.; Huerta-Cepas, J.; Minguez, P.; Montaner, D.; Dopazo, J. From genes to functional classes in the study of biological systems. BMC Bioinform.
**2007**, 8, 114. [Google Scholar] [CrossRef] [Green Version] - Sebastián-León, P.; Carbonell, J.; Salavert, F.; Sanchez, R.; Medina, I.; Dopazo, J. Inferring the functional effect of gene expression changes in signaling pathways. Nucleic Acids Res
**2013**, 41, W213–W217. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Tarca, A.L.; Draghici, S.; Khatri, P.; Hassan, S.S.; Mittal, P.; Kim, J.S.; Kim, C.J.; Kusanovic, J.P.; Romero, R. A novel signaling pathway impact analysis. Bioinformatics
**2009**, 25, 75–82. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Martini, P.; Sales, G.; Massa, M.S.; Chiogna, M.; Romualdi, C. Along signal paths: An empirical gene set approach exploiting pathway topology. Nucleic Acids Res.
**2013**, 41, e19. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Haynes, W.A.; Higdon, R.; Stanberry, L.; Collins, D.; Kolker, E. Differential expression analysis for pathways. PLoS Comput. Biol.
**2013**, 9, e1002967. [Google Scholar] [CrossRef] - Jacob, L.; Neuvial, P.; Dudoit, S. More power via graph-structured tests for differential expression of gene networks. Ann. Appl. Stat.
**2012**, 6, 561–600. [Google Scholar] [CrossRef] - Hidalgo, M.R.; Cubuk, C.; Amadoz, A.; Salavert, F.; Carbonell-Caballero, J.; Dopazo, J. High throughput estimation of functional cell activities reveals disease mechanisms and predicts relevant clinical outcomes. Oncotarget
**2016**, 8, 5160. [Google Scholar] [CrossRef] [Green Version] - Amadoz, A.; Hidalgo, M.R. A comparison of mechanistic signaling pathway activity analysis methods. Briefings Bioinform.
**2019**, 20, 1655–1668. [Google Scholar] [CrossRef] - Rian, K.; Hidalgo, M.R.; Çubuk, C.; Falco, M.M.; Loucera, C.; Esteban-Medina, M.; Alamo-Alvarez, I.; Peña-Chilet, M.; Dopazo, J. Genome-scale mechanistic modeling of signaling pathways made easy: A bioconductor/cytoscape/web server framework for the analysis of omic data. Comput. Struct. Biotechnol. J.
**2021**, 19, 2968–2978. [Google Scholar] [CrossRef] - Ardia, D.; Boudt, K.; Carl, P.; Mullen, K.M.; Peterson, B.G. Differential Evolution with DEoptim: An Application to Non-Convex Portfolio Optimization. R. J.
**2011**, 3, 27–34. [Google Scholar] [CrossRef] [Green Version] - Saraçli, S.; Doǧan, N.; Doǧan, I. Comparison of hierarchical cluster analysis methods by cophenetic correlation. J. Inequalities Appl.
**2013**, 2013, 203. [Google Scholar] [CrossRef] - Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math.
**1987**, 20, 53–65. [Google Scholar] [CrossRef] [Green Version] - Kim, H.; Park, H. Nonnegative Matrix Factorization Based on Alternating Nonnegativity Constrained Least Squares and Active Set Method. SIAM J. Matrix Anal. Appl.
**2008**, 30, 713–730. [Google Scholar] [CrossRef] - Hudson, T.J.; Anderson, W.; Aretz, A.; Barker, A.D.; Bell, C.; Bernabé, R.R.; Bhan, M.K.; Calvo, F.; Eerola, I.; Gerhard, D.S.; et al. International network of cancer genome projects. Nature
**2010**, 464, 993–998. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Love, M.I.; Huber, W.; Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol.
**2014**, 15, 550. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Conway, J.R.; Lex, A.; Gehlenborg, N. UpSetR: An R package for the visualization of intersecting sets and their properties. Bioinformatics
**2017**, 33, 2938–2940. [Google Scholar] [CrossRef] [Green Version] - Gampenrieder, S.P.; Rinnerthaler, G.; Greil, R. CDK4/6 inhibition in luminal breast cancer. Memo
**2016**, 9, 76–81. [Google Scholar] [CrossRef] [PubMed] [Green Version]

**Figure 1.**Description of the obtained hierarchical model applied to a controlled simulation, representing a cohort of 20 individuals divided into 2 different phenotypic groups (Control and Disease). The use of the Hipathia function on ${X}_{g}$ provides the pathway activity matrix (${X}_{p}=\hslash \left({X}_{g}\right)$) composed of 5 rows, corresponding to the 5 signalling cascades ([c1, c2, c3, c4, c5]) modelled by Hipathia (below). These 5 cascades involve 22 genes (from A to V) responsible for regulating the biological process under study. The S matrix shows the hierarchical relationship between the 3 components obtained at the pathway level (${W}_{p}$) and the 7 components obtained at the gene level (${W}_{g}$). As can be seen, the first component at the pathway level (kp1) regulates the biological process by overactivating the c1 and c3 signalling cascades. In this case, the hierarchical model obtains two different strategies (components) at the gene level to produce this effect: while kg1 overactivates the genes A and J, kg2 overactivates the genes B and K. Likewise, the mixing matrix obtained at the pathway level (${H}_{p}$) shows two different configurations for the two groups of individuals. In this case, while the Control group weights the first component (kp1), the Disease group focuses on the second component (kp2), with kp3 being shared by both groups.

**Figure 2.**Illustrative example to describe the designed protocol to estimate the optimal number of components (${k}_{g}$ and ${k}_{p}$). On the left, we see the result obtained after fitting the error curve (dashed blue lines) from 5 independent factorisations (orange dots) performed for pathways (top) and genes (bottom), respectively. The result of this process defines the values ${k}_{p}=7$ and ${k}_{g}=33$ as potential “elbow points”. From these two values we define the intervals ${K}_{p}=[6,7,8]$ and ${K}_{g}=[32,33,34]$ to be explored by the hierarchical model, giving a total of 9 joint factorisations to be performed. On the right, we show the scatter plots comparing the values of the gene-level components (${W}_{g}$) and their associated pathway-level components (${W}_{p}{S}^{t}$) for the 9 joint factorisations performed by the hierarchical model. The graphs show that the pair of values ${k}_{p}=7$ and ${k}_{g}=32$ provides the best correlation between the components obtained at both levels, being therefore the pair of values to be used for subsequent factorisations.

**Figure 3.**Difference between the expected number of components and those estimated over 100 simulations using the cophenetic correlation coefficient, the silhouette method and the hierarchical factorisation model (HFM), for genes and pathways, respectively. As can be seen, the hierarchical model provides the smallest differences in both spaces, providing in most cases the exact number of expected components.

**Figure 4.**Distribution of the correlation values obtained between the simulated and the optimised matrices by using the original NMF, the alternating non-negative least squares method (NNLS) and the hierarchical factorisation model (HFM). As can be seen, the hierarchical model provides the highest correlations, especially when comparing the coherence between both sets of components (${W}_{g}$ vs. ${W}_{p}{S}^{T}$) and their corresponding mixtures (${S}^{T}{H}_{g}$ vs. ${H}_{p}$).

**Figure 5.**Graphical representation of the hierarchical model obtained for the biological function cellular response to epidermal growth factor stimulus, applied to the Her2 subtype. The most frequent combination observed at the pathway level relies on the kp4 component (coloured green), overactivating two signalling cascades of the Adherens junction pathway. In the gene space, the kp4 component is associated with the kg4 component, having as relevant genes ERBB4 and EGFR, both belonging to the epidermal growth factor receptor family.

**Figure 6.**Graphical representation of the hierarchical model obtained for the biological function Notch signalling, applied to the Basal subtype.The most frequent pathway-level combination relies on the components kp2 (dark green), kp12 (magenta), and kp4 (pink) components, overactivating signalling cascades from the Wnt, PI3K-Akt, MAPK, Hippo, and Adherens junction pathways. Interestingly, the combination also activates the Notch signalling pathway, highly relevant in the Basal subtype. At the gene space, these components include several cancer-related genes such as MYC, EGFR, SNAI1, or NOTCH1, among others.

**Figure 7.**Graphical representation of the hierarchical model obtained in the biological function response to oestrogen, applied to the subtypes Luminal A and Luminal B. The most frequent pathway-level combination is shared between both luminal subtypes, involving the kp3 and kp10 components. These components mainly activate thyroid hormone reception, as well as Adherens junction, Tight junction and ErbB signalling pathways. At the gene level, the associated components activate relevant genes as the growth factor receptors ERBB4 and AREG, as well as ESR1, one of the oestrogen receptor isoforms.

**Figure 8.**Graphical representation of the hierarchical model obtained in the biological function G2 M transition of mitotic cell cycle, applied to the subtypes Luminal A and Luminal B. The model found two dominant pathway-level combinations concentrating the majority of individuals. Both combinations share the kp12 component (green), that overactivates the PI3K-Akt pathway. Interestingly, the first combination is specific to the Luminal A subtype and adds the kp10 (orange) and kp2 (magenta) components, that activate mainly FoxO and ErbB signalling pathways, and partially the MAPK and TGF-beta pathways. These components are associated in the gene space with the kg13/20 and kg15/19 components, respectively, which characteristically activate the cyclin inhibitors CDKN1A, CDKN1B, and CDKN2B. On the other hand, the second combination is specific to the Luminal B subtype, incorporating the components kp6 (pink) and kp4 (light green), which significantly activate the PI3K-Akt pathway and the P53 protein activation pathway. These components are associated in the gene space with the kg16 and kg4/18 components, activating a set of genes that positively regulates cell proliferation, such as PLK1, or the cyclins CDC25C, CCNA2, CDK1, and CDK2.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Carbonell-Caballero, J.; López-Quílez, A.; Conesa, D.; Dopazo, J.
Deciphering Genomic Heterogeneity and the Internal Composition of Tumour Activities through a Hierarchical Factorisation Model. *Mathematics* **2021**, *9*, 2833.
https://doi.org/10.3390/math9212833

**AMA Style**

Carbonell-Caballero J, López-Quílez A, Conesa D, Dopazo J.
Deciphering Genomic Heterogeneity and the Internal Composition of Tumour Activities through a Hierarchical Factorisation Model. *Mathematics*. 2021; 9(21):2833.
https://doi.org/10.3390/math9212833

**Chicago/Turabian Style**

Carbonell-Caballero, José, Antonio López-Quílez, David Conesa, and Joaquín Dopazo.
2021. "Deciphering Genomic Heterogeneity and the Internal Composition of Tumour Activities through a Hierarchical Factorisation Model" *Mathematics* 9, no. 21: 2833.
https://doi.org/10.3390/math9212833