# Robust and Scalable Learning of Complex Intrinsic Dataset Geometry via ElPiGraph

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{7}

^{8}

^{9}

^{10}

^{11}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. General Objective of the ElPiGraph as a Data Approximation and Dimensionality Reduction Method

_{i}}, i = 1…N, in R

^{m}where N is the number of data points and m is the number of variables, we aim at approximating it by nodes of a graph injected into R

^{m}. The number of the injected graph nodes |V| is supposed to be small |V|<< N, and the configuration of nodes is supposed to be regular and not too complicated in certain sense which depends on the graph topology. The general graph topology is constrained too and in most of the applications is restricted to linear, grid-like, tree-like, or ring-like structures even though more complex graph classes such as cubic complexes can be used [20]. The graph nodes together with the linear segments corresponding to graph edges forms a base space to project data points onto it which provides dimensionality reduction. Geodesic distances defined along the graph edges define new similarity measure for the data points which might significantly differ compared to the distance in the ambient space R

^{m}.

#### 2.2. Elastic Graphs: Basic Definitions

_{0,1,…,k}∈ V and k edges {(v

_{0}, v

_{i})|i = 1, …, k}. Let E

^{(i)}(0), E

^{(i)}(1) denote the two nodes of a graph edge E

^{(i)}, and S

^{(j)}(0), …, S

^{(j)}(k) denote nodes of a k-star S

^{(j)}(where S

^{(j)}(0) is the central node, to which all other nodes are connected). Let deg(v

_{i}) denote a function returning the order k of the star with the central node v

_{i}and 1 if v

_{i}is a terminal node. Let φ:V →

**R**

^{m}be a map which is an injection of the graph into a multidimensional space by mapping a node of the graph to a point in the data space. For any k-star of the graph G, we call its injection ‘harmonic’ if the position of its central node coincides with the mean of the leaf positions, i.e., $\varphi \left(centralnode\right)=\frac{1}{k}{{\displaystyle \sum}}_{i=1\dots k}\varphi \left(i\right)$, where i iterates over leafs. A mapping φ of graph G is called harmonic if injections of all k-stars of G are harmonic.

_{m}and for which all E

^{(i)}∈ E and ${\mathrm{S}}_{m}^{\left(j\right)}$ have associated elasticity moduli λ

_{i}> 0 and μ

_{j}> 0. Furthermore, a primitive elastic graph is defined as an elastic graph in which every non-leaf node (i.e., with at least two neighbors) is associated with a k-star formed by all the neighbors of the node. All k-stars in the primitive elastic graph are in the selection, i.e., the S

_{m}sets are completely determined by the graph structure. Non-primitive elastic graphs are not considered here, but they can be used, for example, for constructing 2D and 3D elastic principal manifolds, where a node in the graph can be a center of two 2-stars, in a rectangular grid [11].

#### 2.3. Elastic Energy Functional and Elastic Matrix

_{i}) and the sum of squared deviations from harmonicity for each star (weighted by the μ

_{j})

_{j}/deg(S

^{(j)})) and the energy of negative (repulsive) springs connecting all non-central nodes in a star pair-wise (with negative elasticity moduli-μ

_{j}/(deg(S

^{(j)}))

^{2}). The resulting system of springs, whose energy is minimized, is shown in Figure 1A,B.

_{i}at the intersection of rows and lines, corresponding to each pair E

^{(i)}(0), E

^{(i)}(1), and the star elasticity module μ

_{j}in the diagonal element corresponding to S

^{(j)}(0). Therefore, EM(G) can be represented as a sum of two matrices $\mathsf{\Lambda}$ and M.

_{j}/k

_{j}, where k

_{j}is the number of edges connected to the jth star center. ${\mathsf{\Lambda}}_{}^{star\_leafs}$ is a weighted adjacency matrix for the negative springs (shown in green in Figure 1A and in red in Figure 1B), with elasticity moduli −μ

_{j}/(k

_{j})

^{2}. An example of transforming the elastic matrix EM(G) into three weighted adjacency matrices $\mathsf{\Lambda},$ ${\mathsf{\Lambda}}_{}^{star\_edges}$, ${\mathsf{\Lambda}}_{}^{star\_leafs}$ is shown in Figure 1B.

#### 2.4. Elastic Principal Graph Optimization

^{th}data point among all graph nodes. We define the optimization functional for fitting a graph to data as a sum of the truncated approximation error and the elastic energy of the graph

_{i}is a weight of the data point i (can be equal one for all points), $\left|V\right|$ is the number of nodes, ||..|| is the usual Euclidean norm.

**R**

^{m}such that U

^{φ}(X,G) → min over all possible elastic graph G injections into

**R**

^{m}. In practice, we are looking for a local minimum of U

^{φ}(X,G) by applying the standard splitting-type algorithm, which is described by the pseudo-code provided below. The essence of the algorithm (similar to the simplest k-means clustering) is to alternate between 1) computing the partitioning K given the current estimation of the map φ and 2) computing new map φ provided the partitioning K. The first task is a standard neighbor search problem, while the second task is solving a system of linear equations of size |V| (see Algorithm 1).

Algorithm 1 Base graph optimization for a fixed structure of the elastic graph |

- (1)
- Initialize the graph G, its elastic matrix EM(G) and the map φ
- (2)
- Compute the matrix $\tilde{L}\left(EM\left(G\right)\right)$
- (3)
- Partition the data by proximity to the embedded nodes of the graph (i.e., compute the mapping K:{X} $\to ${V} of a data point i to a graph node j)
- (4)
- Solve the following system of linear equations to determine the new map φ$${{\displaystyle \sum}}_{j=1}^{\left|V\right|}\left(\frac{{{\displaystyle \sum}}_{\left\{K\left(i\right)=j\right\}}{w}_{i}}{{{\displaystyle \sum}}_{i=1}^{\left|V\right|}{w}_{i}}{\delta}_{ij}+\tilde{L}{\left(EM\left(G\right)\right)}_{ij}\right)\varphi \left({V}_{j}\right)=\frac{1}{{{\displaystyle \sum}}_{i=1}^{\left|V\right|}{w}_{i}}{{\displaystyle \sum}}_{K\left(i\right)=j}^{}{w}_{i}{X}_{i},$$
Iterate 3–4 till the map φ does not change more than ε in some appropriate measure of difference between two consecutive iterations. |

#### 2.5. Graph Grammar-Based Approach for Determining the Optimal Graph Topology

_{0}and an initial map φ

_{0}(G

_{0}). In the simplest case, the initial graph can be initialized by two nodes and one edge and the map can correspond to a segment of the first principal component.

_{i}, φ

_{i}(G

_{i})}. Each grammar operation Ψ

^{p}produces a set of new graphs and their injection maps possibly taking into account the dataset X

_{i}, φ

_{i}(G

_{i})}, a set of r different graph operations {Ψ

^{1},…, Ψ

^{r}} (which we call a ‘graph grammar’), and an energy function ${\overline{U}}_{}^{\varphi}\left(X,G\right)$, at each step of the algorithm the energetically optimal candidate graph embedment is selected as

Algorithm 2 graph grammar based optimization of graph structure and embedment |

- (1)
- Initialize the current graph embedment by some graph topology and some initial map {G
_{0}, φ_{0}(G_{0})}. - (2)
- For the current graph embedment {G
_{i}, φ_{i}(G_{i})}, apply all grammar operations from a grammar set {Ψ^{1},…, Ψ^{r}}, and generate a set of s candidate graph injections $\{{D}_{}^{k},\varphi \left({D}_{}^{k}\right),k=1\dots s\}.$ - (3)
- Further optimize each candidate graph embedment using Algorithm 1, and obtain a set of s energy values $\left\{{\overline{U}}^{\varphi \left({D}_{}^{k}\right)}\left({D}_{}^{k}\right)\right\}$
- (4)
- Among all candidate graph injections, select an injection map with the minimum energy as $\left\{{G}_{i+1}^{},{\varphi}_{i+1}^{}({G}_{i+1}^{})\right\}.$
Repeat 2–4 until the graph contains a required number of nodes. |

GRAPH GRAMMAR OPERATION ‘BISECT AN EDGE’ | GRAPH GRAMMAR OPERATION ‘ADD NODE TO A NODE’ |

Applicable to: any edge of the graph | Applicable to: any node of the graph |

Update of the graph structure: for a given edge {A,B}, connecting nodes A and B, remove {A,B} from the graph, add a new node C, and introduce two new edges {A,C} and {C,B}. | Update of the graph structure: for a given node A, add a new node C, and introduce a new edge {A,C} |

Update of the elasticity matrix: the elasticity of edges {A,C} and {C,B} equals elasticity of {A,B}. | Update of the elasticity matrix: if A is a leaf node (not a star center) then the elasticity of the edge {A,C} equals to the edge connecting A and its neighbor, the elasticity of the new star with the center in A equals to the elasticity of the star centered in the neighbor of A. If the graph contains only one edge then a predefined values is assigned. else the elasticity of the edge {A,C} is the mean elasticity of all edges in the star with the center in A, the elasticity of a star with the center in A does not change. |

Update of the graph injection map: C is placed in the mean position between the positions of A and B. | Update of the graph injection map: if A is a leaf node (not a star center) then C is placed at the same distance and the same direction as the edge connecting A and its neighbor, else C is placed in the mean point of all data points for which A is the closest node |

GRAPH GRAMMAR OPERATION ‘REMOVE A LEAF NODE’ | GRAPH GRAMMAR OPERATION ‘SHRINK INTERNAL EDGE’ |

Applicable to: node A of the graph with deg(A) = 1 | Applicable to: any edge {A,B} such that deg(A) > 1 and deg(B) > 1. |

Update of the graph structure: for a given edge {A,B}, connecting nodes A and B, remove edge {A,B} and node A from the graph | Update of the graph structure: for a given edge {A,B}, connecting nodes A and B, remove {A,B}, reattach all edges connecting A with its neighbors to B, remove A from the graph. |

Update of the elasticity matrix: if B is the center of a 2-star then put zero for the elasticity of the star for B (B becomes a leaf) else do not change the elasticity of the star for B Remove the row and column corresponding to the vertex A | Update of the elasticity matrix: The elasticity of the new star with the center in B becomes the average elasticity of the previously existing stars with the centers in A and B Remove the row and column corresponding to the vertex A |

Update of the graph injection map: all nodes besides A keep their positions. | Update of the graph injection map: B is placed in the mean position between A and B. |

#### 2.6. Computational Complexity of ElPiGraph

^{2}). The first term O(Nsm) comes from the simple nearest neighbor search where each data point is attributed to the closest graph node position. The second term O(ms

^{2}) comest from the solution of the system of s linear equations with sparse matrix m times. With usual situation N >> s, the first term always dominates the complexity.

- (1)
- Effective reduction of the number of points in the dataset, e.g., by applying pre-clustering. The new datapoints represent the centroids of clusters weighted accordingly to the number of points in each cluster. Alternatively, ElPiGraph can use stochastic approximation approach, by exploiting a subsample of data at each step of growth.
- (2)
- Application of accelerated strategies for the nearest neighbor search step (with the number of neighbors = 1) between graph node positions and data points. In relatively small dimensions and large graphs, the standard use of kdtree method can be already beneficial. The new partitioning can exploit the results of the previous partitioning in order to prevent recomputing all distances, similar to the fast versions of k-means algorithm [52,53]. Various strategies of approximate nearest neighbor search can be also exploited.
- (3)
- Reducing the number of tested candidate graph topologies using simple heuristics. For example, one can test at each application of ‘Bisect an edge’ grammar operation only k longest edges, which most probably will be selected as optimal, with small and fixed k. Similarly, ‘Add a node’ grammar operations can be applied to the several most charged with data points nodes.

#### 2.7. Strategies for Graph Initialization

_{0}value, the advised strategy is graph growing which can be initialized by 1) a rough estimation of local data density in a limited number of data points and 2) placing two nodes, one into the data point characterized by the highest local density and another one placed into the data point closest to the first one (but not coinciding).

#### 2.8. Fine-Tuning the Final Graph Structure

#### 2.9. Choice of the Core ElPiGraph Parameters

_{0}—trimming radius for MSE-based data approximation term.

^{−2}, however, the exact value can depend on the requirements for the resulting graph properties.

_{0}value changes the way the principal graph is grown. In this case it explores the dataset starting from a small fragment of it by ‘crawling’ to the neighbor data points simultaneously fitting the local features of data topology. With properly chosen ${R}_{0}$, the ElPiGraph algorithm can tolerate significant amount of uniformly distributed background noise (see Figure 2C) and even deal with self-intersecting data distributions (Figure 2D). ElPiGraph includes a function for estimating the coarse-grained radius of the data based on local analysis of density distribution, which can be used as an initial guess for the R

_{0}value. An alternative initial guess for the trimming radius can be obtained by taking the median of the pairwise data point distances distribution.

#### 2.10. Principal Graph Eensembles and Consensus Principal Graph

#### 2.11. Code Availability

- MATLAB from https://github.com/sysbio-curie/ElPiGraph.M
- Python from https://github.com/sysbio-curie/ElPiGraph.P. Note that Python implementation provides support of GPU use.

## 3. Results

#### 3.1. Approximating Complex Data Topologies in Synthetic Examples

#### 3.2. Benchmarking ElPiGraph

#### 3.3. Inferring Branching Pseudo-Time Cell Trajectories from Single-Cell RNASeq Data via ElPiGraph

#### 3.4. Approximating the Complex Structure of Development or Differentiation from Single Cell RNASeq Data

^{−4}) associations with previously defined populations [37] (Figure 5F). Using the principal graph obtained, it is also possible to obtain a pseudotime that can be used to explore how different genes vary across the different branches (Figure 5G). Our approach was able to identify structured transcriptomic variation in a group of cells previously identified as ‘outliers’ (Figure 5G, top panel). Furthermore, our approach identified a loop in the part of the graph associated with the neural tube (Figure 5F, top left), suggesting the presence of complex diverging-converging cellular dynamics.

^{−4}) with previously defined cell types emerges (Figure 5I). As before, such structure can be used to identify how gene dynamics changes as cells commit to a specific cell type (Figure 5I).

## 4. Discussion

_{0}) from any graph node invisible to the optimization procedure. However, when growing, the principal graphs can gradually capture new data points, being able to learn a complex data structure starting from a simple small fragment of it. Such a ‘from local to global’ approach provides robustness and flexibility for the approximation; for example, it allows solving the problem of self-intersecting manifold clustering.

## Supplementary Materials

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Roux, B.L.; Rouanet, H. Geometric Data Analysis: From Correspondence Analysis to Structured Data Analysis; Springer: Dordrecht, The Netherlands, 2005; ISBN 1402022360. [Google Scholar]
- Gorban, A.; Kégl, B.; Wunch, D.; Zinovyev, A. Principal Manifolds for Data Visualisation and Dimension Reduction; Springer: Berlin/Heidelberg, Germany, 2008; ISBN 978-3-540-73749-0. [Google Scholar]
- Carlsson, G. Topology and data. Bull. Am. Math. Soc.
**2009**, 46, 255–308. [Google Scholar] [CrossRef] [Green Version] - Nielsen, F. An elementary introduction to information geometry. arXiv Prepr.
**2018**, arXiv:1808.08271. [Google Scholar] - Camastra, F.; Staiano, A. Intrinsic dimension estimation: Advances and open problems. Inf. Sci.
**2016**, 328, 26–41. [Google Scholar] [CrossRef] - Albergante, L.; Bac, J.; Zinovyev, A. Estimating the effective dimension of large biological datasets using Fisher separability analysis. In Proceedings of the International Joint Conference on Neural Networks, Budapest, Hungary, 14–19 July 2019. [Google Scholar]
- Gorban, A.N.; Tyukin, I.Y. Blessing of dimensionality: Mathematical foundations of the statistical physics of data. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci.
**2018**, 376, 20170237. [Google Scholar] [CrossRef] [Green Version] - Pearson, K.; Pearson, K. On lines and planes of closest fit to systems of points in space. Philos. Mag.
**1901**, 2, 559–572. [Google Scholar] [CrossRef] [Green Version] - Kohonen, T. The self-organizing map. Proc. IEEE
**1990**, 78, 1464–1480. [Google Scholar] [CrossRef] - Gorban, A.; Zinovyev, A. Elastic principal graphs and manifolds and their practical applications. Computing
**2005**, 75, 359–379. [Google Scholar] [CrossRef] [Green Version] - Gorban, A.N.; Zinovyev, A. Principal manifolds and graphs in practice: From molecular biology to dynamical systems. Int. J. Neural Syst.
**2010**, 20, 219–232. [Google Scholar] [CrossRef] [Green Version] - Smola, A.J.; Williamson, R.C.; Mika, S.; Scholkopf, B. Regularized Principal Manifolds. Comput. Learn. Theory
**1999**, 1572, 214–229. [Google Scholar] - Tenenbaum, J.B.; De Silva, V.; Langford, J.C. A global geometric framework for nonlinear dimensionality reduction. Science
**2000**, 290, 2319–2323. [Google Scholar] [CrossRef] - Roweis, S.T.; Saul, L.K. Nonlinear dimensionality reduction by locally linear embedding. Science
**2000**, 290, 2323–2326. [Google Scholar] [CrossRef] [Green Version] - Van Der Maaten, L.J.P.; Hinton, G.E. Visualizing high-dimensional data using t-sne. J. Mach. Learn. Res.
**2008**, 9, 2579–2605. [Google Scholar] - McInnes, L.; Healy, J.; Saul, N.; Großberger, L. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw.
**2018**, 3, 861. [Google Scholar] [CrossRef] - Gorban, A.N.; Zinovyev, A.Y. Principal graphs and manifolds. In Handbook of Research on Machine Learning Applications and Trends: Algorithms, Methods, and Techniques; Information Science Reference: Hershey, PA, USA, 2009; pp. 28–59. ISBN 9781605667669. [Google Scholar]
- Zinovyev, A.; Mirkes, E. Data complexity measured by principal graphs. Comput. Math. Appl.
**2013**, 65, 1471–1482. [Google Scholar] [CrossRef] - Mao, Q.; Wang, L.; Tsang, I.W.; Sun, Y. Principal Graph and Structure Learning Based on Reversed Graph Embedding. IEEE Trans. Pattern Anal. Mach. Intell.
**2017**, 39, 2227–2241. [Google Scholar] [CrossRef] [PubMed] - Gorban, A.N.; Sumner, N.R.; Zinovyev, A.Y. Topological grammars for data approximation. Appl. Math. Lett.
**2007**, 20, 382–386. [Google Scholar] [CrossRef] [Green Version] - Gorban, A.N.; Sumner, N.R.; Zinovyev, A.Y. Beyond the concept of manifolds: Principal trees, metro maps, and elastic cubic complexes. In Principal Manifolds for Data Visualization and Dimension Reduction; Lecture Notes in Computational Science and Engineering; Springer: Berlin/Heidelberg, Germany, 2008; Volume 58, pp. 219–237. [Google Scholar]
- Mao, Q.; Yang, L.; Wang, L.; Goodison, S.; Sun, Y. SimplePPT: A simple principal tree algorithm. In Proceedings of the SIAM International Conference on Data Mining, Vancouver, BC, Canada, 30 April–2 May 2015; Society for Industrial and Applied Mathematics Publications: Philadelphia, PA, USA, 2015; pp. 792–800. [Google Scholar]
- Wang, L.; Mao, Q. Probabilistic Dimensionality Reduction via Structure Learning. IEEE Trans. Pattern Anal. Mach. Intell.
**2019**, 41, 205–219. [Google Scholar] [CrossRef] [Green Version] - Briggs, J.; Weinreb, C.; Wagner, D.; Megason, S.; Peshki, L.; Kirschner, M.; Klein, A. The dynamics of gene expression in vertebrate embryogenesis at single-cell resolution. Science
**2018**, 360, eaar5780. [Google Scholar] [CrossRef] [Green Version] - Wagner, D.; Weinreb, C.; Collins, Z.; Briggs, J.; Megason, S.; Klein, A. Single-cell mapping of gene expression landscapes and lineage in the zebrafish embryo. Science
**2018**, 360, 981–987. [Google Scholar] [CrossRef] [Green Version] - Plass, M.; Solana, J.; Wolf, F.A.; Ayoub, S.; Misios, A.; Glažar, P.; Obermayer, B.; Theis, F.J.; Kocks, C.; Rajewsky, N. Cell type atlas and lineage tree of a whole complex animal by single-cell transcriptomics. Science
**2018**, 360, eaaq1723. [Google Scholar] [CrossRef] [Green Version] - Furlan, A.; Dyachuk, V.; Kastriti, M.E.; Calvo-Enrique, L.; Abdo, H.; Hadjab, S.; Chontorotzea, T.; Akkuratova, N.; Usoskin, D.; Kamenev, D.; et al. Multipotent peripheral glial cells generate neuroendocrine cells of the adrenal medulla. Science
**2017**, 357, eaal3753. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Trapnel, C.; Cacchiarelli, D.; Grimsby, J.; Pokharel, P.; Li, S.; Morse, M.; Lennon, N.J.; Livak, K.J.; Mikkelsen, T.S.; Rinn, J.L. Pseudo-temporal ordering of individual cells reveals dynamics and regulators of cell fate decisions. Nat. Biotechnol.
**2012**, 29, 997–1003. [Google Scholar] - Athanasiadis, E.I.; Botthof, J.G.; Andres, H.; Ferreira, L.; Lio, P.; Cvejic, A. Single-cell RNA-sequencing uncovers transcriptional states and fate decisions in hematopoiesis. Nat. Commun.
**2017**, 8, 2045. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Velten, L.; Haas, S.F.; Raffel, S.; Blaszkiewicz, S.; Islam, S.; Hennig, B.P.; Hirche, C.; Lutz, C.; Buss, E.C.; Nowak, D.; et al. Human hematopoietic stem cell lineage commitment is a continuous process. Nat. Cell Biol.
**2017**, 19, 271–281. [Google Scholar] [CrossRef] [Green Version] - Tirosh, I.; Venteicher, A.S.; Hebert, C.; Escalante, L.E.; Patel, A.P.; Yizhak, K.; Fisher, J.M.; Rodman, C.; Mount, C.; Filbin, M.G.; et al. Single-cell RNA-seq supports a developmental hierarchy in human oligodendroglioma. Nature
**2016**, 539, 309–313. [Google Scholar] [CrossRef] [Green Version] - Cannoodt, R.; Saelens, W.; Saeys, Y. Computational methods for trajectory inference from single-cell transcriptomics. Eur. J. Immunol.
**2016**, 46, 2496–2506. [Google Scholar] [CrossRef] [PubMed] - Moon, K.R.; Stanley, J.; Burkhardt, D.; van Dijk, D.; Wolf, G.; Krishnaswamy, S. Manifold learning-based methods for analyzing single-cell RNA-sequencing data. Curr. Opin. Syst. Biol.
**2017**, 7, 36–46. [Google Scholar] [CrossRef] - Saelens, W.; Cannoodt, R.; Todorov, H.; Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol.
**2019**, 37, 547–554. [Google Scholar] [CrossRef] - Qiu, X.; Mao, Q.; Tang, Y.; Wang, L.; Chawla, R.; Pliner, H.A.; Trapnell, C. Reversed graph embedding resolves complex single-cell trajectories. Nat. Methods
**2017**, 14, 979–982. [Google Scholar] [CrossRef] [Green Version] - Drier, Y.; Sheffer, M.; Domany, E. Pathway-based personalized analysis of cancer. Proc. Natl. Acad. Sci. USA
**2013**, 110, 6388–6393. [Google Scholar] [CrossRef] [Green Version] - Welch, J.D.; Hartemink, A.J.; Prins, J.F. SLICER: Inferring branched, nonlinear cellular trajectories from single cell RNA-seq data. Genome Biol.
**2016**, 17, 106. [Google Scholar] [CrossRef] [Green Version] - Setty, M.; Tadmor, M.D.; Reich-Zeliger, S.; Angel, O.; Salame, T.M.; Kathail, P.; Choi, K.; Bendall, S.; Friedman, N.; Pe’Er, D. Wishbone identifies bifurcating developmental trajectories from single-cell data. Nat. Biotechnol.
**2016**, 34, 637–645. [Google Scholar] [CrossRef] - Kégl, B.; Krzyzak, A. Piecewise linear skeletonization using principal curves. IEEE Trans. Pattern Anal. Mach. Intell.
**2002**, 24, 59–74. [Google Scholar] [CrossRef] - Hastie, T.; Stuetzle, W. Principal curves. J. Am. Stat. Assoc.
**1989**, 84, 502–516. [Google Scholar] [CrossRef] - Kégl, B.; Krzyzak, A.; Linder, T.; Zeger, K. A polygonal line algorithm for constructing principal curves. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999. [Google Scholar]
- Gorban, A.N.; Rossiev, A.A.; Wunsch, D.C.; Gorban, A.A.; Rossiev, D.C. Wunsch II. In Proceedings of the International Joint Conference on Neural Networks, Washington, DC, USA, 10–16 July 1999. [Google Scholar]
- Zinovyev, A. Visualization of Multidimensional Data; Krasnoyarsk State Technical Universtity: Krasnoyarsk, Russia, 2000. [Google Scholar]
- Gorban, A.N.; Zinovyev, A. Method of elastic maps and its applications in data visualization and data modeling. Int. J. Comput. Anticip. Syst. Chaos
**2001**, 12, 353–369. [Google Scholar] - Delicado, P. Another Look at Principal Curves and Surfaces. J. Multivar. Anal.
**2001**, 77, 84–116. [Google Scholar] [CrossRef] [Green Version] - Gorban, A.N.; Mirkes, E.; Zinovyev, A.Y. Robust principal graphs for data approximation. Arch. Data Sci.
**2017**, 2, 1–16. [Google Scholar] - Chen, H.; Albergante, L.; Hsu, J.Y.; Lareau, C.A.; Lo Bosco, G.; Guan, J.; Zhou, S.; Gorban, A.N.; Bauer, D.E.; Aryee, M.J.; et al. Single-cell trajectories reconstruction, exploration and mapping of omics data with STREAM. Nat. Commun.
**2019**, 10, 1–14. [Google Scholar] [CrossRef] [Green Version] - Parra, R.G.; Papadopoulos, N.; Ahumada-Arranz, L.; Kholtei, J.E.; Mottelson, N.; Horokhovsky, Y.; Treutlein, B.; Soeding, J. Reconstructing complex lineage trees from scRNA-seq data using MERLoT. Nucleic Acids Res.
**2019**, 47, 8961–8974. [Google Scholar] [CrossRef] [Green Version] - Wolf, F.A.; Hamey, F.; Plass, M.; Solana, J.; Dahlin, J.S.; Gottgens, B.; Rajewsky, N.; Simon, L.; Theis, F.J. PAGA: Graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol.
**2019**, 20, 59. [Google Scholar] [CrossRef] [Green Version] - Cuesta-Albertos, J.A.; Gordaliza, A.; Matrán, C. Trimmed k-means: An attempt to robustify quantizers. Ann. Stat.
**1997**, 25, 553–576. [Google Scholar] [CrossRef] - Gorban, A.N.; Mirkes, E.M.; Zinovyev, A. Piece-wise quadratic approximations of arbitrary error functions for fast and robust machine learning. Neural Netw.
**2016**, 84, 28–38. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Elkan, C. Using the Triangle Inequality to Accelerate k-Means. In Proceedings of the Twentieth International Conference on Machine Learning, Washington, DC, USA, 21–24 August 2003. [Google Scholar]
- Hamerly, G. Making k-means even faster. In Proceedings of the 10th SIAM International Conference on Data Mining, Columbus, OH, USA, 29 April–1 May 2010. [Google Scholar]
- Politis, D.; Romano, J.; Wolf, M. Subsampling; Springer: New York, NY, USA, 1999. [Google Scholar]
- Babaeian, A.; Bayestehtashk, A.; Bandarabadi, M. Multiple manifold clustering using curvature constrained path. PLoS ONE
**2015**, 10, e0137986. [Google Scholar] [CrossRef] [PubMed] - Bac, J.; Zinovyev, A. Lizard Brain: Tackling Locally Low-Dimensional Yet Globally Complex Organization of Multi-Dimensional Datasets. Front. Neurorobot.
**2020**, 13, 110. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Mao, Q.; Wang, L.; Goodison, S.; Sun, Y. Dimensionality Reduction Via Graph Structure Learning. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, Australia, 10–13 August 2015; pp. 765–774. [Google Scholar]
- Aynaud, M.M.; Mirabeau, O.; Gruel, N.; Grossetête, S.; Boeva, V.; Durand, S.; Surdez, D.; Saulnier, O.; Zaïdi, S.; Gribkova, S.; et al. Transcriptional Programs Define Intratumoral Heterogeneity of Ewing Sarcoma at Single-Cell Resolution. Cell Rep.
**2020**, 30, 1767–1779. [Google Scholar] [CrossRef] [Green Version] - Paul, F.; Arkin, Y.; Giladi, A.; Jaitin, D.A.; Kenigsberg, E.; Keren-Shaul, H.; Winter, D.; Lara-Astiaso, D.; Gury, M.; Weiner, A.; et al. Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors. Cell
**2015**, 163, 1663–1677. [Google Scholar] [CrossRef] [Green Version] - Guo, G.; Pinello, L.; Han, X.; Lai, S.; Shen, L.; Lin, T.W.; Zou, K.; Yuan, G.C.; Orkin, S.H. Serum-Based Culture Conditions Provoke Gene Expression Variability in Mouse Embryonic Stem Cells as Revealed by Single-Cell Analysis. Cell Rep.
**2016**, 14, 956–965. [Google Scholar] [CrossRef] [Green Version] - Zhang, Z.; Wang, J. MLLE: Modified Locally Linear Embedding Using Multiple Weights. Adv. Neural Inf. Process. Syst.
**2007**, 19(19), 1593–1600. [Google Scholar] - Weinreb, C.; Wolock, S.; Klein, A.M. SPRING: A kinetic interface for visualizing high dimensional single-cell expression data. Bioinformatics
**2017**, 34, 1246–1248. [Google Scholar] [CrossRef] - Gorban, A.N.; Zinovyev, A. Visualization of Data by Method of Elastic Maps and its Applications in Genomics, Economics and Sociology. IHES Preprints. 2001. IHES/M/01/36. Available online: http://cogprints.org/3088/ (accessed on 11 March 2011).
- Gorban, A.N.; Zinovyev, A.Y.; Wunsch, D.C. Application of the method of elastic maps in analysis of genetic texts. In Proceedings of the International Joint Conference on Neural Networks, Portland, OR, USA, 20–24 July 2003; p. 3. [Google Scholar]
- Failmezger, H.; Jaegle, B.; Schrader, A.; Hülskamp, M.; Tresch, A. Semi-automated 3D Leaf Reconstruction and Analysis of Trichome Patterning from Light Microscopic Images. PLoS Comput. Biol.
**2013**, 9. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Cohen, D.P.A.; Martignetti, L.; Robine, S.; Barillot, E.; Zinovyev, A.; Calzone, L. Mathematical Modelling of Molecular Pathways Enabling Tumour Cell Invasion and Migration. PLoS Comput. Biol.
**2015**, 11, e1004571. [Google Scholar] [CrossRef] [PubMed] [Green Version]

**Figure 1.**Basic principles and examples of ElPiGraph usage. (

**A**) Schematic workflow of the ElPiGraph method. Left, construction of the elastic graph starts by defining the initial graph topology and embedding it into the data space. The graph structure is fit to the data, using minimization of the mean square error regularized by the elastic energy. The elastic energy includes a term reflecting the overall stretching of the graph (symbolically shown as contractive red springs) and a term reflecting the overall bending of the graph branches and the harmonicity of branching points (shown as repulsive green springs). Middle, ElPiGraph explores a large region of the structural space by exhaustively applying a set of graph rewriting rules (graph grammars) and selecting, at each step, the structure leading to the minimum overall energy of the graph embedding. (

**B**) Example of elastic matrix for a simple graph and transforming this matrix into several parts used for optimization (see details in the Methods). (

**C**) Left and middle, illustration of the robust local workflow of ElPiGraph, which makes it possible to deal with the presence of noise. In the global version, the graph structure is fit to all data points at the same time, while in the local version, the structure is fit to the points in the local graph neighborhood, which expands as the graph grows. Right, an illustration of principal graph ensemble approach: 100 elastic principle graphs are superimposed, each constructed on a fraction of data points randomly sampled at each execution. (

**D**) Application of a set of graph grammar rules to generate a set of tree topologies with up to seven nodes.

**Figure 2.**Synthetic examples showing features of ElPiGraph. (

**A**) ElPiGraph is robust with respect to downsampling and oversampling of a dataset. Here, a reference branching dataset [32] is downsampled to 50 points (middle) or oversampled by sampling 20 points randomly placed around each point of the original dataset. More systematic study of the down/over-sampling effect on robustness of the principal graph is shown in Supplementary Figure S2. (

**B**) ElPiGraph is able to capture non-tree like topologies. Here, the standard set of principal tree graph grammars was applied to a graph initialized by four nodes connected to a ring. (

**C**) ElPiGraph is robust to large amounts of uniformly distributed background noise. Here, the initial dataset from Figure 1 is shown as black points, and the uniformly distributed noisy points are shown as grey points. ElPiGraph is blindly applied to the union of black and grey points. (

**D**) ElPiGraph is able to solve the problem of learning intersecting manifolds. On the left, a toy dataset of three uniformly sampled curves intersecting in many points is shown. ElPiGraph starts by learning a principal curve using the local version several times, each time on a complete dataset. However, for each iteration, ElPiGraph is initialized by a pair of neighboring points not belonging to points already captured by a principal curve. The fitted curves are shown in the middle of the point distribution by using different colors, and the clustering of the dataset based on curve approximation is shown on the right. (

**E**) Approximating a synthetic ten-dimensional dataset with known branch structure (with different colors indicating different branches), where one of the branches (blue one) extrude into higher dimensions and collapses with other branches when projected in 2D by principal component analysis (left). Middle, being applied in the space of two first principal components, ElPiGraph does not recover the branch, while it is captured when the ElPiGraph is applied in the complete 10-dimensional dataset (right). In both cases, the principal tree is visualized using metro map layout [21], and a pie chart associated to each node of the graph indicates the percentage of points belonging to the different populations. The size of the pie chart indicates the number of points associated with each node.

**Figure 3.**Computational performance of ElPiGraph. (

**A**) Time required to build principal curves and trees (y-axis) with a different number of nodes (x-axis) using the default parameters across synthetic datasets containing different numbers of points having a different number of dimensions (color scale), using single CPU. (

**B**) Demonstrating computational advantages of using a combination of GPU and CPU in ElPiGraph computations. Starting from 10k data points, using GPU becomes advantageous. The computational experiment was done using Google Colab environment. (

**C**) Empirical performance scaling of ElPiGraph on a 12k data points in 5D example. The exact ElPiGraph algorithm scales as polynome of degree 3, while the approximate reduced algorithm with a limitation on the number of tested candidate structures scales as s

^{2.5}in the s = 50…1000 range, where s is the number of nodes in the principal graph.

**Figure 4.**Quantification of branching cellular trajectories from single cell scRNASeq data using ElPiGraph. (

**A**) Diagrammatic representation of the concept behind branching cellular trajectories and biological pseudotime in an arbitrary 2D space associated with gene expression: as cells progress from Stage 1 they differentiate (Stages 2 and 3) and branch (Stage 4) into two different subpopulations (Stages 5 and 6). Local distances between the cells indicate epigenetic similarity. Note how embedding a tree into the data allow recovering genetic changes associated with cell progression into the two differentiated states. (

**B**) Application of ElPiGraph to scRNA-Seq data [59]. Each point indicates a cell and is color-coded using the cellular phenotype specified in the original paper. One hundred bootstrapped trees are represented (in black), along with the tree fitted on all the data (black nodes and edges). Projection on ElPiGraph principal components 2 and 3 is shown enlarged as the most informative. The fraction of variance of the original data explained by the projection on the nodes (FVE) and on the edges (FVEP) is reported on top of the plot. (

**C**) Diagrammatic representation of the distribution of cells across the branches of the tree reconstructed by ElPiGraph with the same color scheme as in panel B. Pie charts indicate the distribution of populations associated with each node. The black arrows highlight a particular cell fate decision trajectory from CMP to GMP and DC. (

**D**) Single-cell dynamics of gene expression along the path from the root of the tree (at the top of panel C) to the branch corresponding to DC commitment and GMP commitment. The expression profiles have been fitted by a LOESS smoother, colored according to the majority cell type in the branch, with a 95% confidence interval (in grey). The vertical grey area represents a 95% confidence interval obtained by projecting the relevant branching points of the resampled tree showed in panel B on the path of the principal graph.

**Figure 5.**Using ElPiGraph to approximate complex datasets describing developing embryos (xenopus) or adult organisms (planarian). (

**A**) A kNN graph constructed using the gene expression of 7936 cells of Stage 22 Xenopus embryo has been projected on a 3D space using a force directed layout. The color in this and the related panels indicate the population assigned to the cells by the source publication. (

**B**) The coordinates of the points in panel A have been used to fit 1280 principal trees with different parameters, hence obtaining a principal forest. (

**C**) The principal forest shown in panel B has been used to produce 10 consensus graphs (one for each parameter set). (

**D**) A final consensus graph has been produced using the consensus graphs shown in panel C. (

**E**) A final principal graph has been obtained by applying standard grammar operations to the consensus graph shown in panel D. (

**F**) The associations of the different cell types to the nodes of the consensus graph shown in panel E is reported on a plane with a pie chart for each node. Note the complexity of the graph and the predominance of different cell types in different branches, as indicated the predominance of one or few colors. (

**G**) The dynamics of notable genes had been derived by deriving a pseudotime for a branching structure (top) and a linear structure (bottom) present in the principal graph of panel E (see black polygons). Each point represents the gene expression of a cell and their color indicate either their associated path (top) of the cell type (bottom). The gene expression profiles have been fitted with a LOESS smoother which include a 95% confidence interval. In the top panel the smoother has been colored to highlight the different paths, with the color indicated in the text of panel F. (

**H**–

**J**) The same approach described by panels A–G has been used to study the single-cell transcriptome of planarians. In panel J, the color of the smoother indicates the predominant cell type on the path. Interactive versions of key panels are available at https://sysbio-curie.github.io/elpigraph/.

Element | Initial Publication | Principal Advances |
---|---|---|

Principal curves | Hastie and Stuelze, 1989 [24] | Definition of principal curves based on self-consistency |

Piece-wise linear principal curves | Kégl et al., 1999 [25] | Length constrained principal curves, polygonal line algorithm |

Elastic energy functional for elastic principal curves and manifolds | Gorban, Rossiev, Wunsch II, 1999 [26] | Fast iterative splitting algorithm to minimize the elastic energy, based on sequence of solutions of simple quadratic minimization problems |

Method of elastic maps | Zinovyev, 2000 [27], Gorban and Zinovyev, 2001 [28] | Construction of principal manifold approximations possessing various topologies |

Principal Oriented Points | Delicado, 2001 [29] | Principal curves passing through a set of principal oriented points |

Principal graphs specialized for image skeletonization | Kégl and Krzyzak, 2002 [30] | Coining the term principal graph, an algorithm extending the polygonal line algorithm, specialized on image skeletonization |

Self-assembling principal graphs | Gorban and Zinovyev, 2005 [10] | Simple principal graph algorithm, based on application of elastic map method, specialized on image skeletonization |

General purpose elastic principal graphs | Gorban, Sumner, Zinovyev, 2007 [20] | Suggesting the principle of (pluri-)harmonic graph embedding, coining the terms ‘principal tree’ and ‘principal cubic complex’ with algorithms for their construction |

Topological grammars | Gorban, Sumner, Zinovyev, 2007 [20] | Exploring multiple principal graph topologies via gradient descent-like search in the space of admissible structures |

Explicit control of principal graph complexity | Gorban and Zinovyev, 2009 [17] | Introducing three types of principal graph complexity and ways to constrain it |

Regularized principal graphs | Mao et al., 2015 [22] | Formulating reverse graph embedding problem. Suggesting SimplePPT algorithm. Further development in Mao et al., 2017 [19] |

Robust principal graphs | Gorban, Mirkes, Zinovyev, 2015 [31] | Using trimmed version of the mean squared error, resulting in the ‘local’ growth of the principal graphs and robustness to background noise |

Domain-specific adaptations of principal graphs | Qiu et al., 2017 [32] Chen et al., 2019 [33] Para et al., 2019 [34] | Use of principal graphs in single cell data analysis as a part of pipelines Monocle, STREAM, MERLoT. Introducing heuristics for problem-specific initializations of principal graphs. Benchmarked in Saelens et al., 2019 [35]. |

Partition-based graph abstraction | Wolf et al., 2019 [36] | Dealing with non-tree like topologies, large-scale and multi-scale data analysis using graphs |

Excessive branching control | This publication | Introducing a penalty for excessive branching of elastic principal graphs (α parameter) |

Principal graph ensemble approach | This publication | Estimating confidence of branching point positions, constructing consensus principal graphs |

Reducing the computational complexity of elastic principal graphs | This publication | Accelerated procedures and heuristics in order to enable constructing large principal graphs |

Use of GPUs for constructing elastic principal graphs | This publication | Scalable Python and R implementations of elastic principal graphs (ElPiGraph), improved scalability and introducing various plotting functions |

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Albergante, L.; Mirkes, E.; Bac, J.; Chen, H.; Martin, A.; Faure, L.; Barillot, E.; Pinello, L.; Gorban, A.; Zinovyev, A.
Robust and Scalable Learning of Complex Intrinsic Dataset Geometry via ElPiGraph. *Entropy* **2020**, *22*, 296.
https://doi.org/10.3390/e22030296

**AMA Style**

Albergante L, Mirkes E, Bac J, Chen H, Martin A, Faure L, Barillot E, Pinello L, Gorban A, Zinovyev A.
Robust and Scalable Learning of Complex Intrinsic Dataset Geometry via ElPiGraph. *Entropy*. 2020; 22(3):296.
https://doi.org/10.3390/e22030296

**Chicago/Turabian Style**

Albergante, Luca, Evgeny Mirkes, Jonathan Bac, Huidong Chen, Alexis Martin, Louis Faure, Emmanuel Barillot, Luca Pinello, Alexander Gorban, and Andrei Zinovyev.
2020. "Robust and Scalable Learning of Complex Intrinsic Dataset Geometry via ElPiGraph" *Entropy* 22, no. 3: 296.
https://doi.org/10.3390/e22030296