# Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Results

#### 2.1. Comparing and Benchmarking Graph-Based Data Approximation Methods Using Data Point Partitioning by Graph Segments

#### 2.1.1. Outlining the Method

- Split each approximating graph into segments. By a segment we mean certain path from one branching point or leaf node to another.
- Create clustering (partitioning) of the dataset by segments. Each data point is associated with the nearest segment of the graph. Thus, in general, if graph is partitioned into N segments at the previous step then the dataset will be partitioned into N clusters.
- Compare the clusterings for the dataset using standard metrics (like adjusted Rand index or adjusted mutual information or any other suitable measure). Since the clusterings are produced from the structure of the approximating graphs, the score shows how two GBDA results are similar to each other.

#### 2.1.2. Details on Step 1: Segmenting the Graphs

#### 2.1.3. Details on Step 2: Clustering (Partitioning) Data Points by Graph Segments

#### 2.1.4. Details on Step 3: Compare the Clustering Measures for the Datasets

#### 2.2. Unsupervised Scores for Comparing GBDA Methods and Parameter Fine-Tuning

- Split each approximating graph into segments
- Create clustering (partitioning) of the dataset by the segments
- Use standard metrics and tools to calculate how good is the clustering:, e.g., calculate silhouette, Calinski–Harabasz, Davies-Bouldin score etc.

#### 2.3. Comparison of Clustering Based Scores with Other GBDA Metrics

#### 2.4. Comparison of MST-Based Graph-Based Approximations and ElPiGraph

- Both MST and ElPiGraph can achieve similar quality of graph-based data approximation by tuning their main parameter, number of nodes. The quality here is estimated by the scores introduced above and in comparison with ground truth for simulated datasets. However, the advantage of ElPiGraph is in its stability with respect to this parameter. Small modifications do not change the quality dramatically, while for MST even small modifications of the parameter may lead to quite a drastic change of the approximating graph structure and, respectively, the quality of ground truth approximation. That means that in practical situation where ground truth is not available, ElPiGraph has an advantage since one should not be afraid of choosing the parameter incorrectly. Therefore, a reasonably wide range of parameters gives similar results, while for MST approach choosing an incorrect parameter would give a significant loss in approximation quality.
- The embedment of approximating graphs produced by ElPiGraph are much smoother comparing to MST. This observation is consistent with the previous one and with ElPiGraph algorithm which penalizes for non-smoothness and hence it produces graphs which would not produce too much branching points, the problem which underlies the MST instability.
- We analyse a possible combination of MST+ElPiGraph methods, where the initial graph approximation is obtained by MST and then ElPiGraph algorithm starts from this initialization. One of our conclusions is that ElPiGraph “forgets” the initial conditions quite fast. Nevertheless the combination of methods may have certain advantages.
- We consider a real-life single cell RNA sequencing dataset and demonstrate that our approach for choosing the parameters of the trajectory inference method based on unsupervised clustering scores provides reasonable results.

#### 2.4.1. Smoothness of ElpiGraph-Based Graph Approximators Comparing to MST

#### 2.4.2. Comparison between ElPiGraph and MST on Large Number of Different Datasets

#### 2.4.3. Initialization of ElPiGraph with MST-Based Tree

#### 2.5. Tuning Parameters of GBDA Algorithms Using Unsupervised Clustering Quality Scores

#### 2.6. Single Cell Data Analysis Example

- ElPiGraph demonstrates much higher stability than MST and choice of main parameter (node number) can be made more easily and reliably.
- Unsupervised metrics for trajectories introduced above (based on silhouette, Calinski–Harabas, Davies-Bouldin clustering scores) provides insights for the choice of parameters for trajectory inference methods.
- Heuristic for MST, to take number of nodes equal to approximately square root of number of datapoints fails significantly for the present example. It would give 21 nodes, however trajectories of MST with 21 nodes show significant “overbranching” (see figures below). Reasonable node numbers for MST is 9 or a little below, but 10 and above leads to biologically incorrect branching.

## 3. Discussion

## 4. Materials and Methods

#### 4.1. Minimal Spanning Tree-Based Data Approximation Method

- Split the dataset to certain pieces (for example, by applying K-means clustering for some K).
- Construct kNN (k-nearest neighbour) graph with nodes at the centers of clusters.
- Compute an MST (minimal spanning tree) of the kNN graph computed as described above.

#### 4.2. Method of Elastic Principal Graphs (ElPiGraph)

#### 4.3. Benchmark Implementation

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

PG | Principal graph |

SOM | Self-Organizing Map |

kNN | k-nearest neighbours |

GBDA | Graph-based data approximations |

TI | Trajectory inference |

MST | Minimal spanning tree |

PAGA | Partition-based graph abstraction |

ElPiGraph | Elastic Principal Graph |

## References

- Kohonen, T. Self-organized formation of topologically correct feature maps. Biol. Cybern.
**1982**, 43. [Google Scholar] [CrossRef] - Gorban, A.N.; Zinovyev, A.Y. Method of elastic maps and its applications in data visualization and data modeling. Int. J. Comput. Anticipatory Syst. CHAOS
**2001**, 12, 353–369. [Google Scholar] - Gorban, A.N.; Zinovyev, A. Principal manifolds and graphs in practice: From molecular biology to dynamical systems. Int. J. Neural Syst.
**2010**, 20, 219–232. [Google Scholar] [CrossRef] [PubMed][Green Version] - Hastie, T.; Stuetzle, W. Principal Curves. J. Am. Stat. Assoc.
**1989**, 84. [Google Scholar] [CrossRef] - Kégl, B.; Krzyzak, A.; Linder, T.; Zeger, K. A polygonal line algorithm for constructing principal curves. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 1–5 December 1998. [Google Scholar]
- Gorban, A.; Kégl, B.; Wunch, D.; Zinovyev, A. (Eds.) Principal Manifolds for Data Visualisation and Dimension Reduction; Springer: Berlin, Germany, 2008; p. 340. [Google Scholar] [CrossRef][Green Version]
- Amsaleg, L.; Chelly, O.; Furon, T.; Girard, S.; Houle, M.E.; Kawarabayashi, K.I.; Nett, M. Estimating local intrinsic dimensionality. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015. [Google Scholar] [CrossRef]
- Albergante, L.; Bac, J.; Zinovyev, A. Estimating the effective dimension of large biological datasets using Fisher separability analysis. In Proceedings of the International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019. [Google Scholar] [CrossRef][Green Version]
- Bac, J.; Zinovyev, A. Local intrinsic dimensionality estimators based on concentration of measure. arXiv
**2020**, arXiv:2001.11739. [Google Scholar] - Gorban, A.N.; Sumner, N.R.; Zinovyev, A.Y. Topological grammars for data approximation. Appl. Math. Lett.
**2007**, 20, 382–386. [Google Scholar] [CrossRef][Green Version] - Kruskal, J.B. On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc.
**1956**, 7, 48–50. [Google Scholar] [CrossRef] - Gorban, A.N.; Zinovyev, A.Y. Principal Graphs and Manifolds. arXiv
**2008**, arXiv:0809.0490. [Google Scholar] - Mao, Q.; Yang, L.; Wang, L.; Goodison, S.; Sun, Y. SimplePPT: A simple principal tree algorithm. In Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, BC, Canada, 30 April–2 May 2015; pp. 792–800. [Google Scholar]
- Mao, Q.; Wang, L.; Tsang, I.W.; Sun, Y. Principal Graph and Structure Learning Based on Reversed Graph Embedding. IEEE Trans. Pattern Anal. Mach. Intell.
**2017**, 39, 2227–2241. [Google Scholar] [CrossRef] - Lähnemann, D.; Köster, J.; Szczurek, E.; McCarthy, D.J.; Hicks, S.C.; Robinson, M.D.; Vallejos, C.A.; Campbell, K.R.; Beerenwinkel, N.; Mahfouz, A.; et al. Eleven grand challenges in single-cell data science. Genome Biol.
**2020**, 21, 1–35. [Google Scholar] [CrossRef] - Saelens, W.; Cannoodt, R.; Todorov, H.; Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol.
**2019**, 37, 547–554. [Google Scholar] [CrossRef] [PubMed] - Aynaud, M.M.; Mirabeau, O.; Gruel, N.; Grossetête, S.; Boeva, V.; Durand, S.; Surdez, D.; Saulnier, O.; Zaïdi, S.; Gribkova, S.; et al. Transcriptional Programs Define Intratumoral Heterogeneity of Ewing Sarcoma at Single-Cell Resolution. Cell Rep.
**2020**, 30, 1767–1779. [Google Scholar] [CrossRef] [PubMed][Green Version] - Kumar, P.; Tan, Y.; Cahan, P. Understanding development and stem cells using single cell-based analyses of gene expression. Development
**2017**, 144, 17–32. [Google Scholar] [CrossRef] [PubMed][Green Version] - Wolf, F.A.; Hamey, F.K.; Plass, M.; Solana, J.; Dahlin, J.S.; Göttgens, B.; Rajewsky, N.; Simon, L.; Theis, F.J. PAGA: Graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol.
**2019**, 20, 1–9. [Google Scholar] [CrossRef][Green Version] - Chen, H.; Albergante, L.; Hsu, J.Y.; Lareau, C.A.; Lo Bosco, G.; Guan, J.; Zhou, S.; Gorban, A.N.; Bauer, D.E.; Aryee, M.J.; et al. Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nat. Commun.
**2019**, 10, 1–14. [Google Scholar] [CrossRef][Green Version] - Bac, J.; Zinovyev, A. Lizard Brain: Tackling Locally Low-Dimensional Yet Globally Complex Organization of Multi-Dimensional Datasets. Front. Neurorob.
**2020**, 13, 110. [Google Scholar] [CrossRef][Green Version] - Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math.
**1987**, 20, 53–65. [Google Scholar] [CrossRef][Green Version] - Rand, W.M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc.
**1971**, 66, 846–850. [Google Scholar] [CrossRef] - Meilǎ, M. Comparing clusterings-an information based distance. J. Multivariate Anal.
**2007**, 98, 873–895. [Google Scholar] [CrossRef][Green Version] - Fowlkes, E.B.; Mallows, C.L. A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc.
**1983**, 78, 553–569. [Google Scholar] [CrossRef] - Rosenberg, A.; Hirschberg, J. V-Measure: A conditional entropy-based external cluster evaluation measure. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic, 28–30 June 2007. [Google Scholar]
- Shin, J.; Berg, D.A.; Zhu, Y.; Shin, J.Y.; Song, J.; Bonaguidi, M.A.; Enikolopov, G.; Nauen, D.W.; Christian, K.M.; Ming, G.L.; et al. Single-Cell RNA-Seq with Waterfall Reveals Molecular Cascades underlying Adult Neurogenesis. Cell Stem Cell
**2015**, 17, 360–372. [Google Scholar] [CrossRef] [PubMed][Green Version] - Ji, Z.; Ji, H. TSCAN: Pseudo-time reconstruction and evaluation in single-cell RNA-seq analysis. Nucleic Acids Res.
**2016**, 44, e117. [Google Scholar] [CrossRef] [PubMed][Green Version] - Street, K.; Risso, D.; Fletcher, R.B.; Das, D.; Ngai, J.; Yosef, N.; Purdom, E.; Dudoit, S. Slingshot: Cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genom.
**2018**, 19, 477. [Google Scholar] [CrossRef] [PubMed][Green Version] - Parra, R.G.; Papadopoulos, N.; Ahumada-Arranz, L.; Kholtei, J.E.; Mottelson, N.; Horokhovsky, Y.; Treutlein, B.; Soeding, J. Reconstructing complex lineage trees from scRNA-seq data using MERLoT. Nucleic Acids Res.
**2019**, 47, 8961–8974. [Google Scholar] [CrossRef] [PubMed][Green Version] - Yang, L.; Wang, W.H.; Qiu, W.L.; Guo, Z.; Bi, E.; Xu, C.R. A single-cell transcriptomic analysis reveals precise pathways and regulatory mechanisms underlying hepatoblast differentiation. Hepatology
**2017**, 66, 1387–1401. [Google Scholar] [CrossRef][Green Version] - Albergante, L.; Mirkes, E.; Bac, J.; Chen, H.; Martin, A.; Faure, L.; Barillot, E.; Pinello, L.; Gorban, A.; Zinovyev, A. Robust and scalable learning of complex intrinsic dataset geometry via ElPiGraph. Entropy
**2020**, 22, 296. [Google Scholar] [CrossRef][Green Version] - Gorban, A.N.; Mirkes, E.; Zinovyev, A.Y. Robust principal graphs for data approximation. Arch. Data Sci.
**2017**, 2, 1:16. [Google Scholar] - Golovenkin, S.E.; Bac, J.; Chervov, A.; Mirkes, E.M.; Orlova, Y.V.; Barillot, E.; Gorban, A.N.; Zinovyev, A. Trajectories, bifurcations and pseudotime in large clinical datasets: Applications to myocardial infarction and diabetes data. GigaScience
**2020**, in press. [Google Scholar] [CrossRef]

**Figure 1.**Decomposing a graph into non-branching segments and partitioning the data accordingly to the graph segments. Segments are defined as paths with ends at branching or leaf nodes. Branching nodes belong to several segments—so for better illustration here it is emphasized by multiplexing branching nodes and slightly diverging these copies from their origin (it is for illustration only). Toy example of a graph approximating a cloud of data points (shown in grey), and using its decomposition into non-branching segments for partitioning the data which defines clustering of the data cloud.

**Figure 2.**Example—trajectory comparison results. Trajectories which are close to each other from visual point of view are close by introduced similarity scores (as well as induced clusterings are visually similar). For example: trajectory 1 is visually similar to 2, but less to 2 and even less to 3,4, the scores in the table shows the same ordering. Despite graphs 3 and 4 are quite different in the sense of graph theory (three nodes are added to graph 3 to obtain graph 4), but trajectories for them are almost identical, and thus clusterings are similar and their similarity scores are quite high. That illustrates the main desired property—scores depend on trajectories not graphs themselves.

**Figure 3.**Example of estimating trajectories with the help of unsupervised clustering scores—better trajectories—get higher scores. Consider three graphs approximating the dataset of dumbell like shape with 2 branching points. We choose graphs such that the first graph fits nicely the structure of the dataset, while graph with 40 nodes fits dataset in worse way (it has several branching nodes which does not correspond to underlying dataset) and the last graph is even worse (i.e., has even more branching nodes). Thus the good score should assign best score to the first, second to the second, etc. Silhouette and Davies-Bouldin scores fit that expectation: the first graph (8 nodes) has the best scores, the second (40 nodes) the second best and the third (60 nodes) is the worst. While Calinski–Harabasz index assigns the best score to the third graph, which is counter-intuitive, nevertheless it correctly orders the first and the second graphs. (The choice of concrete graphs (and nodes numbers 8,40,60) is somewhat arbitrary the only purpose is to illustrate that scores fit the intuition—good score—corresponds to good approximating graph). Thus the figure indicates that unsupervised scores can be used to select the best graph; however, that should be done with certain care. The rightest plot confirms the same conclusion in the other way. We plot silhouette scores values for all values of nodes between 6 and 60 is given (graphs are constructed by Minimal Spanning Tree (MST) technique). It is clear that large node number correspond to lower scores on that plot, and that fits the expectations; graphs with large number of nodes constructed by MST will create many branch points which does not fit the given dataset, which has only 2 branching points. Thus it again confirms the principle—bad scores—bad approximating property.

**Figure 4.**(

**A**) Examples of datasets with known underlying graph structure, a symmetric binary tree with 7 edges. Datasets are 20-dimensional, visualized with PCA. The first one is with low level of noise and the second one with high level of noise. Note that despite that the noisy dataset may visually seem quite difficult for the correct graph approximation, it is distortion of 2-dimensional visualization; in original 20-dimensional space correct graph approximation is not that unlikely as one can see here. (

**B**) Demonstration of agreement of our scores and naive scores like number and mutual distance to branching points, etc. On these plots X axis is number of nodes for approximate graph construction. The main observation is that ten nodes is the best value from all points of view: from our scores (the second subplot-adjusted Rand index, adjusted mutual information, Fowlkes–Mallows, etc.) as well as from the comparing distances between ground truth branching points and approximate branching points (the forth subplot), as well as elbow rule for inertia (the third subplot), as well as number from looking on number of branching points (the first subplot); ground truth has three branching points; approximate construction has three branching points for when number of nodes equal to 10 (first subplot gray line). All plots are average values over 100 randomly generated datasets having the underlying graph ground truth structure of the same shape, symmetric binary trees with 7 edges (3 branch points). See provided notebooks for details.

**Figure 5.**Comparison of ElPiGraph (green) and MST (black) trajectories. ElPiGraph produces correct trajectories for much wider range of main input parameter (nodes numbers) then MST.

**Figure 6.**Comparison of scores for MST (red) and ElPiGraph (blue), depending on number of nodes in graph (X-axis). Simulations on 4 different types of datasets with known ground truth, results are averaged over 100 random simulations of each type (i.e., each point on each plot is result of averaging over 100 simulations). For each dataset two types of plots are presented, number of branch points and adjusted Rand index comparing trajectories of ElPiGraph and MST with ground truth. Best Rand scores for both methods are approximately the same (except left bottom case where ElPiGraph gives higher score), but the advantage of ElPiGraph is much higher stability, wide range of parameters with near highest score.

**Figure 7.**ElPiGraph forgets the initialization.

**A.**Two constructions of graphs. Right: combined construction (MST+ElPiGraph) produce result similar to pure ElPiGraph despite it started from quite a different (overbranching and non-smooth) MST graph (left).

**B.**Analysis how fast ElPiGraph forgets the initialization. Dynamics of scores for each step of ElPiGraph processing which started from MST graph with 30 nodes and finishing on graph with 40 nodes. X axis, number of nodes. One can see that in 5 steps scores stabilize. ElPiGraph forgets the initialization in 5 steps (for that dataset).

**C.**Demonstration how fast ElPiGraph forgets the initialization. Results for MST, ElPiGraph and several combinations of MST+ElPiGraph differing by node number of MST intializing ElPiGraph. X-axis, number of nodes in constructed graph. Y-axis is number of branching points in constructed graph. Plots with different colors, different algorithms: gray—pure ElPiGraph (without MST initialization); violet—pure MST; yellow—combined: MST-25 nodes followed by ElPiGraph; green—combined: MST-30-nodes followed by ElPiGraph, etc. As one can see especially for green, yellow, red plots, they start from nearby violet plot (MST) quickly comes to around gray plot (pure ElPiGraph), thus demonstrating that ElPiGraph initialized by MST quickly forgets the initialization and shows results similar to pure ElPiGraph (sometimes even slightly better since for example several lines goes under gray line). Each point of the plot is obtained by averaging results of 100 simulations on random datasets of binary tree like shape with 3 branching points and 7 edges. Results for other simulations are similar.

**Figure 8.**Agreement of supervised and unsupervised scores (silhouette, Calinski–Harabasz, Davies-Bouldin). For datasets with a clear ground truth graph structure (binary tree with 3 branching points) MST trajectory inference algorithm is applied. Constructed tree is compared with ground truth with adjusted Rand index based score and also scored by unsupervised score (not requiring the ground truth). (By the methodologies described above). One can see that all methods suggest that choice of parameter in the 20-40 nodes is reasonable choice. So supervised and unsupervised scores agree at this example. Thus unsupervised scores may provide an indication how to choose parameters for GBDA algorithms. Each point of the plot obtained by averaging scores on hundred simulated datasets. See notebook “SimulationUnsupervisedScoresForTI” for further details.

**Figure 9.**(

**A**) PCA projections for single cell RNA data. Differentiation of bi-potential hepatoblasts into hepatocytes and cholangiocytes for mouse embryos. Color corresponds to different timestamps. (

**B**) Trajectory inference from a single cell transcriptomic dataset using MST approach. The method produces biologically correct trajectories for a much narrower range of parameters than ElPiGraph (see next figure). Already for 10 nodes biologically incorrect branching appear. This threshold can be seen from the drop of unsupervised metrics like silhouette metric (see figures below). (

**C**) Trajectory inference from a single cell transcriptomic dataset using ElPiGraph approach. The method produces biologically correct trajectories for much wider range of parameters than MST. Only for 23 nodes biologically incorrect branching appears. That threshold can be seen from drop of unsupervised metrics like silhouette metric (see next figures).

**Figure 10.**Unsupervised metrics for trajectories inferred by MST and ElPiGraph for a single cell dataset. X-axis, node number of the trajectory. (

**A**) Application of MST-based approach. The metrics have extremum for 9 nodes, that point is indeed a good choice for the node number parameter. For the number of nodes more than 10, the metrics clearly indicate worse clustering quality, which corresponds to the appearance of biologically incorrect branching for 10 nodes (see previous figures). (

**B**) ElPiGraph. The metrics are quite stable in wide range of node values. The clear drop of silhouette at 23 nodes corresponds to appearance of biologically incorrect branching (see previous figures). Thus it confirms that metrics can be used as an indication for the choice of parameters.

**Figure 11.**Examples of MST and ElPiGraph trajectories inference with different number of nodes. Overestimating node number would cause overbranching, creating branches which does not correspond to the ground truth. On the other hand underestimating would not allow to catch complex enough branching structure. Thus the correct choice of the number of nodes is important. (

**A**) Application of MST-based approach. (

**B**) ElPiGraph produces correct trajectories for much wider range of main input parameter (node number) than MST.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Chervov, A.; Bac, J.; Zinovyev, A.
Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets. *Entropy* **2020**, *22*, 1274.
https://doi.org/10.3390/e22111274

**AMA Style**

Chervov A, Bac J, Zinovyev A.
Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets. *Entropy*. 2020; 22(11):1274.
https://doi.org/10.3390/e22111274

**Chicago/Turabian Style**

Chervov, Alexander, Jonathan Bac, and Andrei Zinovyev.
2020. "Minimum Spanning vs. Principal Trees for Structured Approximations of Multi-Dimensional Datasets" *Entropy* 22, no. 11: 1274.
https://doi.org/10.3390/e22111274