Reconstruction of Single-Cell Trajectories Using Stochastic Tree Search

The recent advancement in single-cell RNA sequencing technologies enables the understanding of dynamic cellular processes at the single-cell level. Using trajectory inference methods, pseudotimes can be estimated based on reconstructed single-cell trajectories which can be further used to gain biological knowledge. Existing methods for modeling cell trajectories, such as minimal spanning tree or k-nearest neighbor graph, often lead to locally optimal solutions. In this paper, we propose a penalized likelihood-based framework and introduce a stochastic tree search (STS) algorithm aiming at the global solution in a large and non-convex tree space. Both simulated and real data experiments show that our approach is more accurate and robust than other existing methods in terms of cell ordering and pseudotime estimation.


Introduction
The advancement of single-cell RNA sequencing enables measuring of gene expression for individual cells to prompt an understanding of dynamic cellular processes, including cell state transitions such as cell differentiation. Reconstructing a cell trajectory from the gene expression for a sample of cells is one new research area made possible by this technology. However, the high-dimensional gene expression data space and the associated high-level noise pose difficulties in modeling the trajectory from the original expression data [1]. One way to reconstruct the cell trajectory is by the calculation of pseudotime, where pseudotime is a measure of the distance of a particular cell from the origin in a dynamic process. This type of computational approach is called trajectory reconstruction (TR) [2]. To overcome the challenges in single-cell trajectory analysis, TR methods generally have two main steps: First, to handle the high dimensionality and high noise level in the expression data, a dimensionality reduction method is applied to convert the original high-dimensional data space into a low-dimensional space. Both linear and nonlinear dimensionality methods can be considered in this step to address different types of data. The second step is to model the trajectory in the dimension-reduced space. For instance, a minimal spanning tree (MST) is fitted or the k-nearest neighbor (KNN) graph is applied to model the cell trajectory [3]. This paper focuses on the second trajectory-modeling approach, to further improve the performance of the TR method in the reconstruction of single-cell trajectories.
The existing TR methods can be classified into three main categories based on the trajectory-modeling steps. First, Wanderlust, Wishbone and SLICER are all designed based on the KNN graph. Wanderlust was not originally developed for single-cell transcriptomics data but for cytometry data, so a dimensionality reduction is implemented and the trajectory-modeling is applied directly to the high-dimensional data space [4]. Based on Wanderlust, Wishbone adds a dimensionality reduction step before trajectory modeling to address the high-dimensionality challenge in the scRNA-seq data. Unlike Wanderlust,

Methods
Similar to most existing TR methods, our single-cell trajectory reconstruction approach also consists of two parts. The first part is flexible with any dimensionality reduction method, while we develop a novel stochastic tree searching process to estimate the cell trajectory for the second part. Under the likelihood framework, we define an optimization function used to find the optimal embedding tree as the estimated cell trajectory. During the stochastic tree searching process, we search through the whole embedding tree space with a pool of candidate trees at each time and start from the simplest one-edge tree structure to a more complicated tree structure with more edges. The details of our stochastic tree searching algorithm are shown in the following subsections.

Preprocessing
Before fitting the tree structure on the data, the raw gene read counts are normalized by log2 transformation. Then, the dimension of normalized data is reduced by any linear or nonlinear method. Principal component analysis (PCA) applies a linear projection of the data, which preserves the variance in the new lower dimension space. Locally linear embedding (LLE), diffusion maps, and t-SNE are more general approaches without the linear relationship assumption, so these methods are able to find nonlinear relationships between cells. The data drive the choice of dimensionality reduction approach. If the linear assumption holds in the data, PCA will be applied since it is relatively computational efficient [1]. Otherwise, nonlinear methods will be applied, especially for the more recent data set with a more complex cell trajectory.

Penalized Likelihood
We use an embedding tree on the lower-dimension space to estimate the cell trajectory. An embedding tree T = (V, E, y) can be defined with three main components V, E and y. V is a vertex set of size |V|, and E is an edge set with E ⊂ V × V. y = (y 1 , . . . , y |V| ) is the associated vertex embedding, where y j ∈ R p . When we fit the embedding tree T with n data points X = {X 1 , . . . , X n }, we assume that where X i ∈ R p and φ T (X i ) is the projection of X i to the embedding tree T. Now we have which leads to the likelihood function where we assume that σ 2 is known and can be estimated from the data. The negative log-likelihood function is Finally, we can find the optimal embedding tree by minimizing the following penalized negative log-likelihood function, min T f (T) + p(T). f (T) is the loss function derived from the negative log-likelihood function as follows, where φ T (x i ) is the projection of x i to the embedding tree T. The penalty term p(T) has two components as follows, The first part in the penalty term controls the complexity of the tree structure. With the larger tree size, more penalties will be added to the optimization function. Based on the BIC whereσ 2 is the sample variance. K is the number of free parameters for the tree model, so we have K ≈ p|V|. Therefore, we set λ 1 and α as follows, The second part shrinks the length of tree edges. We use a small penalty term with λ 2 = 0.01 to avoid any unnecessary long edges.

Projection of the Data to the Tree
In order to obtain the penalized negative log-likelihood function for the data, we first introduce the way to calculate the projection of each data point to a given tree. For an embedding tree T = (V, E, y), the projection of a data point x i to the tree T is defined as Here, e is an edge connecting two vertices V j , V k ∈ V with the associated embedding y j , y k ∈ R p , where j, k ∈ {1, . . . , |V|}. We can further define φ e (x i ) in the following way, where l e (x i ) is the mapping of φ e (x i ) when mapping the line segment [y j , y k ] to [0, 1] which can be calculated using the following closed-form formula

Updating Vertex Embedding Location
In the optimization process, we start with a random tree and then iteratively project the data to the tree (detailed in the section above) and update the embedding locations of its vertices. Without loss of generality, suppose there is a vertex v ∈ V and {v 1 , v 2 , . . . , v m } is the set of all the vertices that v connects to. Let x k 1 , . . . , x k n k ∈ R p be the data points projected to edge e k = (v k , v) for k = 1, . . . , m. The new embedding location y * for v can be calculated through minimizing the objective function Let l k t = l e k (x k t ) be the mapping of φ e k (x k t ) when mapping [y k , y] to [0, 1], and y * can then be computed by the following formula, If we assume that the mapping l k t is fixed for y * , there is a closed form solution for updating y * . See Appendix A for more details. However, the mapping l k t will change as the embedding location y * is updated. Hence, there is no simple closed-form solution for y * and we apply a backtracking line search algorithm to obtain y * as follows. We first set the initial values y 0 and τ 0 = τ max = 1. Then, for the b-th iteration, b = 1, . . . , B, we take the following steps: Continue the iteration until we have S( Here there is no simple closed-form solution for the gradient ∇S b as the mapping l k t is also related to y * . Therefore, we calculate the gradient ∇S b by the numerical method where ∇S b ≈ S(y b + I)−s((y b ) . We use a small = 1 × 10 −6 in the previous numerical approximation formula. We use p = 0.8 and c = 0.5 in our simulation studies. Finally we repeat the above backtracking line search process until the difference in the objective function reaches the tolerance.

Tree Similarity Score
To facilitate the stochastic tree search, we maintain a pool of candidate trees. During the stochastic tree searching process, we remove similar trees and therefore maintain the diversity of all candidate trees in the tree pool, which is achieved using a tree similarity score. The tree similarity score is defined to estimate the similarity between two trees of the same size. Considering two embedding trees T 1 = (V (1) , E (1) , y (1) ) and T 2 = (V (2) , E (2) , y (2) ) with the same size |V (1) | = |V (2) | = s, the tree similarity score can be computed as follows, (1) ), V (2) (y (2) )) + D(E (1) , E (2) ).
The first part D(V (1) (y (1) ), V (2) (y (2) )) denote the similarity between the two vertex sets V (1) and V (2) through the embedding location y (1) and y (2) as follows, The second part D(E (1) , E (2) ) calculates the distance between two edge sets E (1) and E (2) as 2.6. Stochastic Optimization 2.6.1. Initial Tree Generation Starting from the most simple tree structure-one-edge tree, the associated embedding locations are generated from a multivariate normal distribution MV N(µ, Σ). The mean vector µ and diagonal variance matrix Σ can be estimated from the data. To better search through the large tree space, multiple initial trees {T 1 2 , . . . , T L 1 2 } are generated to form the initial tree pool {T l 2 }. We then update the embedding location for each tree in the initial tree pool to better fit the data. For each updated tree T * 2 , we calculate the tree similarity score S 1 and optimization score S 2 as follows, Let T min 2 denote the tree in the pool where S(T min 2 , , then we update the current initial tree by replacing T min 2 with T * 2 .

Grow Trees by Adding Nodes
Based on the old tree pool {T l k } where we have all candidate trees with the tree size equal to k, we grow the trees by adding a new node to each tree. There are two ways to add a new node. The first one is adding a new node v new connected to any existing node v j , j ∈ 1, . . . , k of T k and obtaining the new edge e new = (v new , v j ). The probability of selecting any node among all nodes of T k is set as follows, Here, we assume there is an equal probability of selecting each existing node to connect with the new node. In order to make the stochastic optimization more efficient, we generate the new embedding location y new of v new based on the guidance from the data. In detail, we first compute the residuals based on T k , Then, the residuals are standardized as Finally, we sample the embedding location of the new node y new from x 1 , . . . , x n with the sampling weights w i = e λ 3 r * i . The new node is more likely to locate in the area with larger residuals since they have a higher sampling probability. The parameter λ 3 controls the level of new embedding location driven by residuals.
The other way to generate a new node is by adding a middle point to any edge e = (v j , v g ) of T k . Each edge of T k has the same probability to be chosen to add a middle point, where We replace the previous edge e with two new edges e new The new embedding location y new can be calculated by the following formula, where y j and y g are the embedding locations for v j and v g .

Optimizing Tree with Data
From the old tree pool {T l k }, we can obtain a new tree pool {T l k+1 } by adding a new node as described in the previous section. For each new tree, we further update the embedding location for all nodes to better fit the data. Similar to the initial tree updating procedure, for each updated tree T * k+1 in the optimization process, we compute the tree similarity score S 1 and optimization score S 2 . If S 1 (T * k+1 ) > a and S 2 (T * k+1 ) < min T k+1 ∈{T l k+1 } S 2 (T k+1 ), then we update the current initial tree by replacing T min k+1 with T * k+1 , where T min k+1 is the tree in the pool with the smallest tree similarity score S(T * k+1 , T k+1 ). We repeat adding a new node step on the sequence of old trees for L 2 times and update the existing new tree pool {T l k+1 } according to the above two criteria.

Final Optimal Tree
According to the optimization score S 2 (T k ), we order trees in each tree pool {T l k }, for k = 2, . . . , and then we select the top ranked tree with smallest optimization score , we stop developing trees and obtain the final optimal tree T opt = T opt k . The details of the process of finding the final optimal tree are shown in Figure 1.

Pseudotime Calculation
Once the optimal tree is found through the above stochastic tree searching method, we can then compute pseudotime through the same shortest path algorithm as applied by Bendall (2014) in the pseudotime calculation for Wanderlust [12]. We first define the length of an edge as the Euclidean distance between the two vertices on this edge, so that a corresponding adjacency matrix for the optimal tree can be calculated. Then, the distance between each cell to the origin is computed by the shortest path algorithm. Finally, we define the distance as the pseudotime for cells and order cells by their pseudotimes.

Extension from Linear Trees to Nonlinear Trees
The assumption of a linear tree-shaped cell trajectory is not always valid for most scRNA-seq data sets. The allowance of nonlinear cell trajectory estimation can improve the accuracy in the reconstruction of a cell trajectory. Hence, we also propose an improved the global searching algorithm-a stochastic tree searching algorithm to include nonlinear cell trajectories. Instead of the linear embedding tree, the new algorithm plans to apply a curved embedding tree with curved edges modeled by bounded principal curves. The details of the curved tree method can be found in Appendix B.

Design
To check the accuracy and the robustness to the data noise of our algorithm, a k-nodes embedding tree in p dimensions is randomly generated with the embedding location of each node simulated from a multivariate normal distribution N(0, I p ). Considering the complexity in the real data set, additional noise in the data is generated from a known distribution. In the first part of the simulation study, the data noise is simulated from a standard normal distribution N(0, 1). In the second part, the noise is generated from a student t distribution t 3 . In this case, potential extreme outliers will occur, driving the data far away from the true tree structure. A parameter σ denotes the scale of noise in the simulated data and controls the noise level. The larger σ means more noise in the simulated data. In the study, we gradually increase the noise level σ to examine the robustness of our algorithm. Because of the significance of pseudotime in the real data application, the accuracy is calculated based on the ordering of pseduotime assignments. Kendall rank correlation coefficient is computed between the estimated pseudotime and the simulated pseudotime based on the true structure.

Comparison with Other Methods
Four existing single-cell trajectory-inferring methods are also applied to the simulated data as comparisons. These four approaches use an MST, the KNN graph or more complicated methods such as the principal curve in the trajectory-modeling part, which covers a majority of models for reconstructing a cell trajectory. Monocle models the trajectory by an MST in the lower dimension converted by ICA. More complicated than Monocle, Monocle2 uses an additional clustering algorithm to obtain the latent data representation in the lower dimension and then uses RGE to learn a principal curve. Slingshot also models the trajectory based on an MST. However, the MST is fitted on the clusters of cells and finally is treated as an initial guess for a simultaneous principal curve algorithm, which enables modeling of a nonlinear trajectory. SLICER first reduces dimensionality by LLE and then applies KNN graph for the trajectory-modeling step.
As shown in Figure 2, for the normal-distributed noise, all methods have the same trend that both the Kendall correlation and the stability will decrease as the noise level increases. Hence, the accuracy of the cell ordering based on the pseudotime is associated with the noise level. However, STS has the highest correlation and is the most stable one among all methods with any noise level. When the noise level is low, the correlation of our method is close to 1, so our method produces almost the same cell ordering result as the truth. SLICER has the second highest correlation, and Slingshot has poorer performance than SLICER. The two versions of Monocle methods have a similar performance with the lowest correlation among all methods. When the noise level is extremely high, all methods have a Kendall correlation lower than 0.5, and the estimated cell orderings are close to random assignments.
When the noise is generated from student t distribution in Figure 2, the Kendall correlations of all methods will drop faster than those in the normally distributed noise case when the noise level increases. The performance of all approaches is also more unstable than the normal case. With a low or moderate noise level, our method outperforms other approaches with the highest Kendall correlation. When the noise level reaches 1.5, the correlations of all methods drop below 0.5; thus, all approaches fail to accurately identify the true cell orderings. Monocle with DDRTree (Monocle2) is the most stable and least affected by the extreme outliers among all methods. STS has a slightly lower correlation than Monocle2 but is still compatible with other methods.
We also check the computational efficiency through simulations with different sample sizes and tree sizes. As Figure 3 shows, the computational cost increases nonlinearly with the growth in sample size when the tree size is fixed. However, there is a linear increasing trend indicated by Figure 3 for the running time when the tree size increases and the sample size remains the same. When the tree size is more than eight, the tree structure is complicated, making it difficult to estimate it through the data. In this case, STS will stop with a more straightforward estimated optimal tree with fewer edges.

Comparison between Linear and Curved Tree Methods
In order to check the accuracy and the robustness to data noise of our curved tree algorithm, a k-nodes curved tree with non-zero curvature parameters in p dimensions is randomly generated with the embedding location of each node simulated from a multivariate normal distribution N(0, I p ). Considering the complexity in the real data set, additional noise in the data is generated from a known distribution. Since the data is noisier in the curve tree framework, we only simulate the data noise from a standard normal distribution N(0, 1). A parameter σ denotes the scale of noise in the simulated data and controls the noise level. The larger σ means more noise in the simulated data. In the study, we first gradually increase the noise level σ to examine the robustness of our algorithm when the curvature level is fixed. Then, we simulate the curvature parameters of the bounded principal curves from the normal distribution N(0, cI p ). The standard deviation c of this normal distribution controls the curvature of simulated curved trees. In detail, a larger c means that there is a higher curvature on average, and thus a more complicated tree is simulated. We also simulate data with different curvature levels when the noise level is fixed. We assess the performance of our method in two parts. To check the accuracy of cell orderings, the Kendall rank correlation coefficient is computed between the estimated pseudotime and the simulated pseudotime based on the true structure. In addition, we also check the accuracy of the estimated cell trajectory by the residual standard error. The residual standard error is calculated as the square root of the mean square error between the estimated data projections and the true projections. We also compare our curved tree algorithm with the linear tree algorithm.
As shown in Table 1, all approaches have the same nonlinear trend: both the Kendall correlation and the residual standard error with the associated standard deviation will decrease as the noise level increases. Hence, the accuracy of the cell ordering is based on the pseudotime and is related to the noise level, as we found for the linear tree method. Moreover, the accuracy of trajectory estimation is also related to the noise level. When we compare the two algorithms with the same noise level, there is no significant difference in Kendall correlation between the two methods. Regarding residual standard error, the curved tree algorithm outperforms the linear tree algorithm with much lower values. In general, the curved tree algorithm does not significantly improve the accuracy of cell orderings, but the estimation of cell trajectory is more accurate.
From Table 2, both the Kendall correlation and residual standard error are positively correlated with the curvature for the two methods. With the increasing curvature, there is no significant difference in Kendall correlations between the curved and linear tree methods. However, with the larger curvature, the Kendall correlations between the two methods have a more significant difference. In terms of residual standard error, the curved tree method has a smaller value than the linear tree approach with the same curvature. Moreover, with the increase in curvature, both two approaches become more unstable with the higher variation in Kendall correlation and residual standard error.

Induction of Mouse Embryonic Stem (ES) Cell Differentiation
Time series data of 421 RamDA-seq samples with mouse ES cells were collected at five different time points with 157,717 gene features during the cell differentiation process. These time points are 0, 12, 24, 48 and 72 h after the induction of cell differentiation into primitive endoderm (PrE) cells [13]. We pre-processed the data in the same way as Cannoodt (2016) did. The data is first filtered by only keeping cells with good quality, so the number of cells reduces to 414. Then, the count data is normalized by log2 transformation. Finally, 23,658 gene features are selected based on the feature variability. With the final gene expression matrix, diffusion map is used to further reduce the data dimensionality. With the normalized data transformed into a lower dimension, we apply STS to estimate the cell trajectory and calculate pseudotime to order cells. Further, we also fit the same four existing TR methods in the simulation studies-Slingshot, SLICER, Monolce and Monocle2.
Hayashi (2018) identified a cell differentiation trajectory in their paper, where cells move from the initial state to the final state through an intermediate transition state. From Figure 4, our two approaches can obtain an estimated cell trajectory close to the truth. For the four existing TR methods, Slingshot, Monocle and Monocle2 cluster on cells but classify different numbers of clusters ranging from 5 to 16. Both SLICE and Monocle2 estimate the cell trajectory slightly differently from the other three methods, as they also identify a small branch at the transition state. In terms of pseudotime estimation, not all of these methods accurately order cells. Figure 5 shows the relationship between the estimated pseudotime and the true time for each method. For all the methods except SLICER, there is a positive correlation between the estimation and truth, which indicates the pseudotime calculation consistent with the true time points. SLICER is unable to estimate a pseudotime accurately related to the truth. The same result is indicated in Table 3. The Kendall correlation for SLICER is the lowest, at 0.39, among all TR methods. Other methods all have a Kendall correlation greater than 0.8 and our two methods have the highest correlation. All the methods except SLICER are able to estimate pseudotime accurately and recover the true cell orders.  Comparing our linear and curved tree methods, the curved tree method (0.0045) has a slightly lower residual standard error than the linear tree method (0.0055). Therefore, the curved tree method estimates the cell trajectory more accurately than the linear tree approach. In terms of pseudotime estimation, consistent with the findings from the previous data set, our curve tree method does not have any significant improvement on the cell orderings when compared with the linear tree method. Both the curved and the linear tree methods have the same Kendall correlation.

Resolution of Cell Fate Decisions from Zygote to Blastocyst
We also apply our STS methods on a single-cell expression data set with Ct values of 48 genes from 442 cells harvested over the first four days of mouse development. We try to recover the dynamic cell development process from zygote through blastocyst (from 1-cell stage to 64-cell stage). We follow the same data cleaning and normalization process as Guo et al. (2010) in [14]. Firstly, duplicate cells from two or more cell embryos are removed, and the cells with Ct values less than 28 are also removed. Then, we normalize the cell Ct values using the endogenous controls Actb and Gapdh by subtracting their average Ct values for each cell. With the final gene expression matrix, diffusion map is used to further reduce the data dimensionality. With the normalized data transformed into a lower dimension, we apply our linear and curved tree methods to estimate the cell trajectory and calculate pseudotime to order cells. Further, we also fit the same four existing TR methods in the simulation studies-Slingshot, SLICER, Monolce and Monocle2.
The previous study shows that the cells from the 64-cell stage are subdivided into the trophectoderm (TE), the epiblast (EPI) and the primitive endoderm (PE). In addition, the cells from the 32-cell stage can be classified as the inner cell mass (ICM) or TE in the middle of the cell development [14]. From Figure 6, our two approaches can obtain an estimated cell trajectory close to the truth. For the four existing TR methods, Monocle2 can also recover a similar three-branch structure to that in our method. However, Slingshot, SLICER and Monocle only identify two cell types at 64-cell stage. In terms of pseudotime estimation, not all of these methods accurately order cells. Figure 7 shows the relationship between the estimated pseudotime and the true time for each method. Both Slingshot and our approach show a positive correlation between the estimation and truth, indicating the pseudotime calculation consistent with the true time points. SLICER, Monocle and Monocle2 fail to distinguish between the first three stages as well as the last two stages. The same result is indicated in Table 4. The Kendall correlation for Monocle is the lowest, at 0.51, among all TR methods. SLICER and Monocle2 have a Kendall correlation equal to 0.69 and our algorithm has the second highest correlation of 0.76. Slingshot has a slightly higher correlation than our methods. This data set shows a more complicated dynamic process than the previous data set, so none of these methods can order the cells highly close to the truth. Nevertheless, STS and Slingshot can order the majority of the cells accurately.
Comparing between our linear and curved tree methods, the curved tree method (0.0220) has a slightly lower residual standard error than the linear tree method (0.0229). Therefore, the curved tree method estimates the cell trajectory more accurately than the linear tree approach. In terms of pseudotime estimation, consistent with the findings from the previous data set, our curved tree method does not have any significant improvement on the cell orderings when compared with the linear tree method. Both the curved and the linear tree methods have the same Kendall correlation.

Discussion
The reconstruction of a cell trajectory from the scRNA-seq data is considered as a nonlinear optimization problem for many existing TR methods. These TR methods are mainly based on an MST or KNN graph to search for an optimal solution locally. Instead of searching locally, our algorithm provides a novel approach to directly search a global optimal fitting tree. We apply a stochastic tree optimization algorithm after preprocessing the data with normalization and dimension reduction. Based on a penalized likelihood, we start searching with a one-edge-tree pool and then gradually move to a more sophisticated tree pool by adding a new node. Although we currently use squared Euclidean distance between the data point and its projection on the tree, the optimization function is flexible and can be extended to different forms. As the simulation study and two real data examples show, STS is more accurate in cell-ordering estimation and less sensitive to outliers and skewed expression distribution. Specifically, the global optimal tree search improves the estimation performance compared with local search methods. In general, STS can be applicable to reconstruct both a simple bifurcation trajectory and more complicated multifurcation trajectories, with both high accuracy in trajectory estimation and cell ordering.
As some other challenges introduced by the complex data structure remain, there are still several potential future extensions to our STS method. Firstly, the giant tree searching space encumbers the computational efficiency of the dynamic optimization process. Secondly, the current optimization step is sensitive to extreme outliers, though the extreme outliers do not frequently occur in the real dataset. To reduce the sensitivity to the outliers, the current sum square objective function could be replaced by other functions. Moreover, based on the likelihood function, we can also calculate the probability for each tree in the tree pool by assigning prior probability for the tree size. Then we can further extend to the inference part, such as hypothesis testing [15]. Finally, instead of the backward line search algorithm we can use other optimization algorithms to reduce the sensitivity to the initial tree guess. Nevertheless, all these extensions can enable our approach to be adapted to more complex and noisy real data sets.

Data Availability Statement:
The data and computer codes that support the findings in this paper are available at https://github.com/kkttzjy/STS. These data were derived from the following resources available in the public domain: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc= GSE98664 accessed on 31 March 2016, and https://doi.org/10.1016/j.devcel.2010.02.012 accessed on 11 February 2022.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Updating Vertex Embedding Location with a Closed-Form Solution
Without loss of generality, suppose there is a vertex v ∈ V and {v 1 , v 2 , . . . , v m } is the set of all the vertices that v connects to. Let x k 1 , . . . , x k n k ∈ R p be the data points projected to edge e k = (v k , v) for k = 1, . . . , m. The new embedding location y * for v can be calculated through minimizing the objective function Now, let l k t = l e k (x k t ) be the mapping of φ e k (x k t ) when mapping [y k , y] to [0, 1]. If we assume that l k t is fixed even when the embedding location y changes, then y * can then be computed by the following formula, There is a closed-form solution for y * as follows,

Appendix B. Extension to Curved Tree Method
Instead of the linear embedding tree assumption, we assume that a cell trajectory is a curved embedding tree with the nonlinear tree edges modeled as the bounded principal curves. The details of our stochastic curved tree searching algorithm are shown in the following subsections.

Appendix B.1. Curved Embedding Tree
We first start with defining a new class of embedding trees with nonlinear edges as G = (V, E, y, F ). V is a vertex set of size |V|, and E is an edge set with E ⊂ V × V. y = (y 1 , . . . , y |V| ) is the associated vertex embedding where y j ∈ R p . F = {f j } |V|−1 j=1 is the set of curved edges which are modeled as principal curves bounded by their vertices. Specifically, for the j-th edge E j = (V j1 , V j2 ) with the associated embedding locations for the two vertices y j1 , y j2 we have data points X k , k = 1, . . . , n k projected to the j-th edge, where n k is the total number of data points projected to the j-th edge and X k ∈ R p . For s = 1, . . . , p, the curved j-th edge is modeled as where λ k is the projection index for X k . In this chapter, we consider to model λ k as the arc length of the projection on the curve to the origin. Based on this assumption, we plan to define λ k as the pseudotime of X k . In addition, we assume that the residual ks i.i.d ∼ N(0, σ 2 ). Inspired by the standard principal curve algorithm introduced by Hastie [16], we also define the projection index λ k as follows, Here the projection index λ f j (X k ) of the data point X k is the value of λ for which its projection f j (λ) to the curved edge E j is closest to X k .
Then, we consider a principal curve f j (λ) bounded by two vertices V j1 and V j2 with the associated pseudotimes λ j1 and λ j2 . Hence, for any data point projected on this curve, their projection index will be restricted to the interval (λ j1 , λ j2 ). We then scale λ into the new range interval (0, 1).
We can model f js (λ) with any format of nonlinear functions. In this paper, we assume that f js (λ) is a mixture of a linear function, a quadratic function and a cubic function as follows, f js (λ) = g js (λ) + α js h js (λ) + β js l js (λ).
where we have g js (λ) = y j1s + λ(y j2s − y j1s ) Here, α js and β js are the curvature parameters which control the complexity in the structure of a curved tree. We define augmented vectors of curvature parameters as γ j = α j β j , where α j = (α j1 , . . . , α jp ) and β j = (β j1 , . . . , β jp ). The mixture of the above three functions can cover a wide range of curves.

Appendix B.2. Penalized Likelihood
Suppose that we have n data points X 1 , . . . , X n in p dimensions, where X i ∈ R p , for i = 1, . . . , n. For a curved tree G = (V, E, y, F ) fitted to the data, the projection of each data point X i on j-th edge of the curved tree is denoted as f j (λ i ). Then we assume that Now, we can calculate the likelihood function as follows, where we assume that σ 2 is known and can be estimated from the data. The negative log-likelihood function is Based on the previous assumption, we propose a penalized log likelihood with the lasso penalty and edge shrinkage as The above penalized likelihood function is the sum of squares of the Euclidean distances between the data points and their projections on the tree with two penalties. The first penalty shrinks the curvature of the curved tree to control the complexity of the tree structure. The second one shrinks the edge lengths to avoid unnecessarily long edges. We use a small penalty term with φ 2 = 0.01 to avoid any unnecessary long edges and a small penalty term φ 1 = 0.01 to control the curvature.

Appendix B.3. Projection of the Data to the Tree
In order to obtain the penalized negative log-likelihood function for the data, we first introduce the way to calculate the projection of each data point to a given curved tree. For an embedding curved tree G = (V, E, y, F ), the projection of a data point x i to the curved tree G is defined as φ G (x i ), where φ G (x i ) = f(λ i ). We use a piecewise linear approximation approach to calculate φ G (x i ). In particular, we divide the projected curved edge e into L small line segments. Since we have λ i ∈ [0, 1], we obtain a series of λ l where λ l = l L with l = 0, . . . , L. For each line segment e l = (f(λ l−1 ), f(λ l )), l = 1, . . . , L, we compute f(λ l ) = φ e l (x i ) by the equation for the linear tree. We further define d il = ||X i − f(λ l )|| 2 to measure the difference between the data point x i and f(λ l ). Then, we obtain the projection index of the data point X i on the curved tree λ i = λ il * if d il * = min l d il . The corresponding approximated projection of X i on G is computed as f (λ i ) = f (λ il * ).

Appendix B.4. Updating Vertex Embedding Location
In the optimization process, we start with the optimal linear tree found by STS algorithm. We assign γ 0 as the initial guess for the curvature parameters of principal curves. Then, we otain the initial curved tree G 0 based on the optimal linear tree. We update the embedding location y and the curvature parameters ff and fi with the line search algorithm.
Without loss of generality, suppose there is a vertex v ∈ V and {v 1 , v 2 , . . . , v m } is the set of all the vertices that v connects to. Let x k 1 , . . . , x k n k ∈ R p be the data points projected to edge e k = (v k , v) for k = 1, . . . , m. The new embedding location y * for v and the new curvature parameters γ * k for the curve edge e k can be calculated by minimizing the objective function {|α ks | + |β ks |} + φ 2 ||y − y k || 2 .
When we update the curved tree G with y * , γ * k , the projection of data points on the new curved tree will also change. Hence, there are no closed-form solutions for y * and γ * k . We apply the backtracking line search algorithm to compute y * and γ * k . Since we need to update both the embedding location y and the curvature parameters γ k , the backtracking line search algorithm in Section 2.4 is modified to update the two parts. We first update γ k as follows. We set the initial values γ 0 k , and we have τ 0 γ = τ max = 1. Then, for the b-th iteration, b = 1, . . . , B, we take the following steps: Here, we calculate the gradient ∂S b ∂γ b k by the numerical method, where We use a small = 1 × 10 −6 in the previous numerical approximation formula.
Once we have the final updated curvature parameters γ * k , we further update y with the similar backtracking line search process as follows. We set the initial values y 0 and τ 0 y = τ max = 1. Then, for the b-th iteration, b = 1, . . . , B, we take the following steps: Here, we calculate the gradient ∂S b ∂y b by the numerical method, where We use p = 0.8 and c = 0.5 in our simulation studies. Finally we repeat the above two backtracking line search processes until the difference in the objective function reaches the tolerance.