Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

Topological Data Analysis in Time Series: Temporal Filtration and Application to Single-Cell Genomics

Algorithms 2022, 15(10), 371; https://doi.org/10.3390/a15100371

by Baihan Lin^1,2,3

Reviewer 1:

Jason Cory Brunson

Reviewer 2:

Renee S Hoekzema

Reviewer 3:

María De Los Ángeles Cosío-León

Algorithms 2022, 15(10), 371; https://doi.org/10.3390/a15100371

Submission received: 15 August 2022 / Revised: 4 October 2022 / Accepted: 4 October 2022 / Published: 10 October 2022

Round 1

Reviewer 1 Report

Brief summary

This manuscript proposes a new family of filtrations of longitudinal multidimensional data together with auxiliary data analysis tools, and demonstrates their application to temporal inference problems using a set of time-resolved gene expression data. The key technique, called temporal filtration, substitutes a conjunctive distance and time threshold for the conventional distance threshold for point cloud data augmented with time stamps. In addition to persistent homology, mapper constructions, and the use of witness sampling with this technique, an original set of standardized summary statistics, the normalized complexities, are proposed. These techniques are used to conduct an exploratory analysis of zebrafish embryonic development through the lens of longitudinal single-cell RNA sequencing data. The applications showcase clear improvements in the interpretability of visualizations compared with a cross-sectional approach and suggest that key events in the evolution of a biological system can be more effectively detected using normalized complexity than using Betti numbers.

General conceptual comments

This strikes me as a natural and valuable approach to the exploratory analysis of longitudinal distance/similarity data. I also want to commend the author on the strength of their exposition and motivation in Section 1, in terms both of the biological questions and of the methods and their limitations (e.g. lines 49–52, 76–78, 86–88, 100–106).

I have several issues or questions about the methods:

First, was the calculation of persistent homology considered as part of this study? I would be interested to know if the persistent features detected by "temporal persistent homology" are noticeably different than those obtained using cross-sectional persistent homology.

Second, if i understand correctly that temporal filtration is equivalent to conventional filtration using the composite norm d( (x,t), (y,s) ) = max( 1/epsilon |y-x|, 1/tau |s-t| ), then it might reinforce the explanation to also describe it this way, which would also clarify that it can be used without additional specialist software.

Third, it is excellent to be able to read and use the source code for the empirical analysis, though the code repository suffers from a lack of illustrative worked examples in the README. These, i think, are increasingly expected and needed in order for others to benefit from the tools provided.

Finally, i don't fully understand how the normalized complexity values are being or should be interpreted. Because the approach uses flag complexes, a simplex amounts to a clique, and the n-simplicial complexity is the ratio of observed to expected (n+1)-cliques in the graph. Is this not, then, a measure of clustering? The normalized complexity is described as summarizing "cliques and cavities"; the "cliques" part is clear to me but the "cavities" part is not. This also suggests that a more appropriate comparison than to Betti numbers would be to a conventional measure of clustering, extended to several longitudinal thresholds in the same way. Their interpretation as possible "fundamental building blocks" would support this comparison. I may be missing something here and would welcome clarification.

Specific comments

I have a few suggestions or requests on the details:

The citation for TDA (Carlsson, 2009) is quite old for this highly active field. Several valuable reviews, some specific to biological applications, have appeared in recent years and would provide readers with a more current reference.

Most of the figures are difficult to read on the printed page and would benefit either from having their elements increased in size or from being enlarged in size altogether, by a factor of 1.5–2 in either case.

The approach is described in Sections 4 and 5 as "parameter-free" and as "machine learning". Yet the technique relies explicitly on the tau parameter, while the study does not take what i understand to be a machine learning approach to the evaluation of methods (by partitioning the data into training and testing sets, for example). I realized that these terms are used somewhat flexibly, but i would like to better understand what the author means when using them.

Author Response

The author would like to thank the reviewer for the careful reading of our manuscript, the helpful suggestions and the overall positive evaluation. We have revised our manuscript accordingly, and aim to answer the reviewer's questions as follows.

1. First, was the calculation of persistent homology considered as part of this study? I would be interested to know if the persistent features detected by "temporal persistent homology" are noticeably different than those obtained using cross-sectional persistent homology.

Yes, the computation of the persistent homology is part of the study. We have now clarified this confusion in our writing and introduced more figures on the analytical pipeline and the persistent diagrams computed for the dataset. From the persistent diagrams, the persistent features detected by the persistent homology are not noticeably different from the temporal persistent homology. Further study using downstream machine learning tasks can potentially pinpoint the benefits of these temporal persistent features, but it is not the main focus of this work.

2. Second, if i understand correctly that temporal filtration is equivalent to conventional filtration using the composite norm d( (x,t), (y,s) ) = max( 1/epsilon |y-x|, 1/tau |s-t| ), then it might reinforce the explanation to also describe it this way, which would also clarify that it can be used without additional specialist software.

Thank you for the helpful suggestion. We agree that it would be a much more intuitive understanding and explanation anchor. We have now included this interpretation into our method section.

3. Third, it is excellent to be able to read and use the source code for the empirical analysis, though the code repository suffers from a lack of illustrative worked examples in the README. These, i think, are increasingly expected and needed in order for others to benefit from the tools provided.

Thank you for the suggestion. We have now included a Python jupyter notebook as a tutorial to walk through the examples.

4. Finally, i don't fully understand how the normalized complexity values are being or should be interpreted. Because the approach uses flag complexes, a simplex amounts to a clique, and the n-simplicial complexity is the ratio of observed to expected (n+1)-cliques in the graph. Is this not, then, a measure of clustering? The normalized complexity is described as summarizing "cliques and cavities"; the "cliques" part is clear to me but the "cavities" part is not. This also suggests that a more appropriate comparison than to Betti numbers would be to a conventional measure of clustering, extended to several longitudinal thresholds in the same way. Their interpretation as possible "fundamental building blocks" would support this comparison. I may be missing something here and would welcome clarification.

Thank you for the question. The motivation behind the normalized complexity is that, given the simplicial complexes of different orders from the witness sampling approach, we need to correct for the effect of sampling. The larger the sample size, the more likely the higher-order simplicial complexes emerge. One way to correct for this amplification effect is to normalize this quantity directly to the quantity collected from a null distribution of the data. It is still a measure of clustering, but more like the non-random clustering effect. Despite the connection, here we are more interested in the order or the complexity of the clustering, instead of which clusters they form. But it would be a good idea to compare the clustering methods in a follow up study.

We agree that finding the cliques are more straightforward in our approach, while the cavities requires more clarifications. We have now introduced more explanations in the main text to elaborate on it. Topological cavities are usually formed and then later filled with the additions of new edges (and potentially, nodes). When computing the persistent homology, we perform a filtration process which innately tracks the formation and later filling of topological cavities of different dimensions. The temporal persistent homology characterizes the information of cavities with the lifespan of these topological objects.

5. The citation for TDA (Carlsson, 2009) is quite old for this highly active field. Several valuable reviews, some specific to biological applications, have appeared in recent years and would provide readers with a more current reference.

Thanks for the suggestion. That is a very good point. We have now included several references of recent surveys in the introductions for the readers.

6. Most of the figures are difficult to read on the printed page and would benefit either from having their elements increased in size or from being enlarged in size altogether, by a factor of 1.5–2 in either case.

Thanks for the pointer. We have now enlarged all the figures and font sizes to increase legibility.

7. The approach is described in Sections 4 and 5 as "parameter-free" and as "machine learning". Yet the technique relies explicitly on the tau parameter, while the study does not take what i understand to be a machine learning approach to the evaluation of methods (by partitioning the data into training and testing sets, for example). I realized that these terms are used somewhat flexibly, but i would like to better understand what the author means when using them.

Thank you for the question. By ``parameter-free'', we mean that it doesn't have arbitrary hyper-parameters that the users have to set in order to perform the analysis. The parameter $\tau$, instead, is a user-specified parameter that is relevant to the specific application and problem of interest. An analogy to a prediction model would be, the learning rate is an arbitrary hyper-parameter, and the prediction window would be a user-specified parameter relevant to the application. By ``machine learning'', we refer to the general goal of building a model that learns from the data. The topological data analysis is a class of unsupervised learning method. The topological features identified from the process can be further applied to downstream machine learning tasks, such as hierarchical clustering of cellular lineage (Figure 11). We have also included some clarifications in the text to clarify this.

Thank you.

Reviewer 2 Report

In the manuscript 'Topological Data Analysis in Time Series: Temporal Filtration and Application to Single-Cell Genomics', the author presents novel methods for analysing single cell datasets based on clique counting inspired by persistent homology. The application of TDA methods to single cell developmental data and the attempt to pick up the gastrulation phase are interesting research directions and the author poses intriguing questions such as how to approach single cell data of cells that are similar and how to include a time direction. However, I remain unconvinced of the validity of the methods proposed. Moreover, the author shows a lack of understanding of the mathematics of (persistent) homology.

The manuscript insufficiently convinces me that studying cliques of cells, irrespective of the filtering parameter or homology class, is a sensible thing to do, and that it is detecting something interesting in the dataset it is applied to.

In order for the manuscript to be reconsidered I would like to see:

- A worked out synthetic example in which it is becomes clear what clique counting picks up in a dataset (e.g. homogeneously distributed versus clustered points).

- A UMAP plot and more info on the dataset and comparison to different methods of analysing it such as clustering.

- Higher sampling of the dataset and/or a demonstration of robustness of results to resampling and differences in preprocessing.

A resubmitted document should also manifestly display a better understanding of the methods of persistent homology.

The use of the English language needs to be checked thoroughly throughout, in particular the use of particles, singular/plural, and the use of phrases such as 'in another word' for 'in other words', 'in most time' for 'most of the time' etc.

I have the following line-by-line comments:

1) Line 37 'visualize them' - what is them here?

2) Line 38 'cell complexity' - should this be library complexity?

3) Line 52 and after: why does expression similarity deserve the name complexity? Is there a reason to believe gene expression similarity has something to do with interactions rather than reflecting the number similar cells that happen to be present in the sample? If the cells for example display clustering into cell types, and one cell type is predominantly present in the sample, would this not give rise to a 'high complexity' group?

4) Line 73 'at that static graph' – correct this sentence.

The author points to reference 14 as introducing simplicial analysis and claims that numbers of simplices are computed. However, ref14 studies cliques that bound cavities, i.e. it counts homology classes. Ref14 moreover considers directed cliques.

5) Fig. 1: Font is illegibly small in places;

The fact that colours in panel 1 and 3 agree is confusing: the colours in 1 correspond to cell types whereas in 3 they correspond to individual gene expression values;

'bbostrapping' typo;

It appears later in the text like persistent homology is not computed as a part of simplicial analysis, but that it is computed directly from the Vietoris-Rips complex. Persistent homology is also not an input for the temporally filtered Mapper graph as far as I can tell. Hence this workflow appears misleading to me.

6) Line 100 and after:

how does algebraic geometry relate to this?

The three points mentioned are not challenges but proposed solutions (to what questions remains unclear).

7) Fig. 2: The caption indicates an alarming lack of understanding of persistent homology. Homology is computed by quotienting cycles by boundaries, it is not immediate from 'the formation of complexes of order n'. A 2-simplex is a filled in triangle, whereas the term loop is generally reserved for a 1-cycle that represents a hole, i.e. is not filled in.

I can recommend the author reads for example 'A roadmap for the computation of persistent homology' (Otter et al.).

8) Section 2.1: I would like to see an introduction to the dataset that is used e.g. biological context and previous findings.

9) Line 150: More motivation is needed to refer to cliques of cells as representing a notion of ecology.

10) End of section 2.2: Since you are dealing with data with two filtrations, it is natural to discuss multiparameter persistence as an alternative method of analysis.

11) Section 2.3: It would be good to at least give a summary of what (persistent) homology is and emphasize the difference between the 'holes' that persistent homology counts and cliques or complexes that are counted here. Not every cycle representing a homology class can be represented by a single simplex.

12) Section 2.4: What are the filter functions applied for Mapper? How does this differ from setting the time parameter as a filter function?

13) The author says that including every data point would take too much computing power. However, I believe the computational bottleneck for computing persistent homology lies in the computation of homology. It appears to me one can read off the Vietoris-Rips complex at various filtration thresholds directly from the distance matrix. The distance matrix also should be easily computable with many more than 100 points and without performing dimension reduction first.

It appears to me that the manner of subsampling has a large effect on the number of cliques, so I think the use of dimension reduction, subsampling, and the application of witness complexes for the latter needs to be explored more thoroughly.

14) Fig 3: Please makes axes directions coherent across Fig 3 A, B and Fig 5. Panel B: the plot is not shaded according to the scale.

Why does the simplicial dimension axis only go to 6 in (B)?

15) Apparently the application of mds and sampling 100 points does not detect gastrulation whereas PCA and 80 points does? Why are two different methods applied and why do they not agree?

16) Line 258: better than what?

17) Fig 4: I don't find this figure insightful and I recommend cutting it. The figure appears not to agree with what is written in lines 272 etc.

18) Line 275 '...organized into numerous fundamental building blocks with increasing complexity.' For example (proto-)cell types? Please elaborate on the relevance of this.

19) Line 279: ' different' correct to differentiate

20) A sample of 80 is really excessively small.

21) I cannot find what the value of the time difference tau is set to in the end for Figs 3 and 5.

22) Fig 6. I enjoy the new Mapper graph with the additional temporal constraint. However, is there independent validation that the two tracks represent the mentioned cell populations?

23) Fig 7 These Betti numbers do not look right to me. Is this computed also within a two-dimensional space? That would explain why there is no higher homology. What happens in the cell type with very high Betti_1 that makes it stand out? Please check these calculations.

24) Line 333: I do not see how the proposed methods would lead to the generation of pseudo-time series.

25) Conclusions: I remain unconvinced that the manuscript has tackled any of these three challenges. It is unclear what was done with the tau parameter, rather it appears computations were done per timestep so I don't see how the methods tackle the problem of integrating the temporal direction in the analysis; the method is not scalable to 10k+ datapoints as it subsamples only 100; the manuscript contains very little interpretation of the topology as biological features other than describing cell similarity as cell ecology without appropriate motivation.

Overall, I think the questions and ideas in this manuscript are interesting but the scientific content and exposition need to be significantly improved.

Author Response

The author would like to thank the reviewer for the careful reading of our manuscript, the helpful suggestions and the interest to our ideas and questions. We have revised our manuscript significantly according to your pointers, and aim to answer the reviewer's questions and requests as follows.

1. In order for the manuscript to be reconsidered I would like to see:

- A worked out synthetic example in which it is becomes clear what clique counting picks up in a dataset (e.g. homogeneously distributed versus clustered points).

Thank you for the suggestion. We included a synthetic example that illustrates our approach.

2. - A UMAP plot and more info on the dataset and comparison to different methods of analysing it such as clustering.

Thank you for the suggestion. We included a figure comparing dimension reduction methods and more information about the dataset.

3. - Higher sampling of the dataset and/or a demonstration of robustness of results to resampling and differences in preprocessing.

Thank you for the suggestion. We have conducted a sensitivity analysis of our results to resampling.

4. A resubmitted document should also manifestly display a better understanding of the methods of persistent homology.

We have reorganized, rewrote and provided more details about the persistent homology.

5. The use of the English language needs to be checked thoroughly throughout, in particular the use of particles, singular/plural, and the use of phrases such as 'in another word' for 'in other words', 'in most time' for 'most of the time' etc.

Thank you for the pointers. We have corrected them.

6. Line 37 'visualize them' - what is them here?

Data points. We have clarified it in the text.

7. Line 38 'cell complexity' - should this be library complexity?

Thanks. We have now fixed it.

8. Line 52 and after: why does expression similarity deserve the name complexity? Is there a reason to believe gene expression similarity has something to do with interactions rather than reflecting the number similar cells that happen to be present in the sample? If the cells for example display clustering into cell types, and one cell type is predominantly present in the sample, would this not give rise to a 'high complexity' group?

Thank you for the question.

Simplicial complexes are high dimensional objects or generalizations of neighboring graphs to represent the cliques of data points, and in other words, a notion of ecology. The ecology doesn't have to be the organisms within a physical system. In the field of data science where we represent biological cells by their measurements (e.g. gene expression profiles) as data points residing in high-dimensional feature spaces, the ecology can be how these data points are connected to one another in the feature space. If we adopt an ecology research point of view, in order to characterize the dynamic systems of a community, one need to have knowledge or priors regarding the causal relationships between the agents (e.g. how do preys and predates interact, and in what ways). In order to parse out causal relationships, the temporal sequence of these events matters. Thus, the property of the synchrony and asynchrony of the events is a key to translate the feature space (represented by a similarity graph) to an ecology, which has directed (e.g. causal) relationships among the agents. This is why a temporal take into the topological data analysis can potentially unlock the first step from finding a static representations of the overall shape of the data points to discovering the event-directed representations (i.e. a temporal skeleton) of the data points.

For your questions, many of them are simply the re-iterations of our questions in the paper, which are also open questions we wish to engage the field to discuss and investigate together, instead of answering them directly in this first work. First, we have now also disclaim this in the paper to avoid unrealistic expectations. Second, we will share our preliminary take on them:

Why does expression similarity deserve the name of complexity? To clarify, the expression similarity may not be a measure of complexity. However, the temporally connected higher order co-expression structure characterized by similarity can be a useful measure of complexity. If the task requires several agents to co-work together at the same time, or follow a specific sequence of actions by different agents, then it is more complex than a task which only requires a few agents or doesn't need to follow a specific sequence. The notion of similarity is usually related to clustering and thus, the separation of homogeneous groups. To extend on this understanding, the similarity relationships that are further constrained by temporal sequences would relate to functionally separating groups of homogeneous agents, and thus, potentially informative to their interactions.

Is there a reason to believe gene expression similarity has something to do with interactions rather than reflecting the number of similar cells that happen to be present in the sample? These temporally constrained gene expression similarity can both reflect the number of similar cells that coexist at the same time, but also potentially related to the some level of functional interactions, as discussed above. We wish to leave further investigations on what type of interactions for future work, and welcome discussions and critique in these interpretations.

If the cells for example display clustering into cell types, and one cell type is predominantly present in the sample, would this not give rise to a ``high complexity'' group? Similar to our treatment to the time steps, in the analysis of different cell types, we group them first by cell types, and then performed samples with the same sample size (which is bounded by the cell type with the lowest numbers). As a result, it will not be affected by the predominance in the sample.

(This is also related to question 20, which we answer together here). We have now provided more clarifications with respect to what do we mean by the notion of ecology and complexity in the main text.

9. Line 73 'at that static graph' – correct this sentence.

Thanks. We have now fixed it.

10. The author points to reference 14 as introducing simplicial analysis and claims that numbers of simplices are computed. However, ref14 studies cliques that bound cavities, i.e. it counts homology classes. Ref14 moreover considers directed cliques.

Yes, Ref14 studies directed cliques. But they also performed simplicial analysis, which we found relevant and worth referencing. We have also removed "first" if that's offending.

11. Fig. 1: Font is illegibly small in places;

Thanks. We have now fixed it.

12. The fact that colours in panel 1 and 3 agree is confusing: the colours in 1 correspond to cell types whereas in 3 they correspond to individual gene expression values;

Thanks for the suggestion. This is only for an aesthetic choice. We assume the readers can understand this simple step without too much confusion.

13. 'bbostrapping' typo;

Thanks. We have now fixed it.

14. It appears later in the text like persistent homology is not computed as a part of simplicial analysis, but that it is computed directly from the Vietoris-Rips complex. Persistent homology is also not an input for the temporally filtered Mapper graph as far as I can tell. Hence this workflow appears misleading to me.

Thanks for the pointer. The computation of the persistent homology is part of the study. We have now clarified this confusion in our writing and introduced more figures on the analytical pipeline and the persistent diagrams computed for the dataset.

15. Line 100 and after: how does algebraic geometry relate to this?

We are fine with changing algebraic geometry to algebraic topology.

16. The three points mentioned are not challenges but proposed solutions (to what questions remains unclear).

Thanks for the note. We agree, and now change "challenge" to "solution".

17. Fig. 2: The caption indicates an alarming lack of understanding of persistent homology. Homology is computed by quotienting cycles by boundaries, it is not immediate from 'the formation of complexes of order n'. A 2-simplex is a filled in triangle, whereas the term loop is generally reserved for a 1-cycle that represents a hole, i.e. is not filled in.

We agree with you on the clarification and now improved our description of persistent homology.

18. I can recommend the author reads for example 'A roadmap for the computation of persistent homology' (Otter et al.).

Thank you. It is a nice introduction.

19. Section 2.1: I would like to see an introduction to the dataset that is used e.g. biological context and previous findings.

Thanks for the suggestion. We have now added the introduction to the dataset.

20. Line 150: More motivation is needed to refer to cliques of cells as representing a notion of ecology.

Thank you for the pointer. We have now provided more clarifications with respect to what do we mean by the notion of ecology, and why cliques of cells can be one.

21. End of section 2.2: Since you are dealing with data with two filtrations, it is natural to discuss multiparameter persistence as an alternative method of analysis.

Thank you for the suggestion. We have now included the reference and discussed it as an alternative.

22. Section 2.3: It would be good to at least give a summary of what (persistent) homology is and emphasize the difference between the 'holes' that persistent homology counts and cliques or complexes that are counted here. Not every cycle representing a homology class can be represented by a single simplex.

We have now introduced the persistent homology in section 2.3 with more descriptions. Please also kindly note that this is an applied paper, not a theoretical one.

23. Section 2.4: What are the filter functions applied for Mapper? How does this differ from setting the time parameter as a filter function?

We have now explicitly described the filter functions for Mapper. If temporal filtration is applied, then edge forming is also controlled by the additional time delay constraint that the clusters are formed with both spatial and temporal proximity, and the edges would only exist between two clusters if all points in the two clusters are within the time delay limit $\tau$. In other words, the filter function is the same that we apply to persistent homology, which can either be a single filtration with the temporal constraint, a single filtration with the temporal composite norm, or a multi-parameter filtration.

24. The author says that including every data point would take too much computing power. However, I believe the computational bottleneck for computing persistent homology lies in the computation of homology. It appears to me one can read off the Vietoris-Rips complex at various filtration thresholds directly from the distance matrix. The distance matrix also should be easily computable with many more than 100 points and without performing dimension reduction first.

You are correct if you take this perspective. However, as we pointed out, the computational bottleneck is only part of the story. The other bound for the sampling size is the number of samples in each time steps. Take our dataset as an example, the data collected at the 12 time steps are highly imbalanced: 1 (2225 data points), 2 (200), 3 (1158), 4 (1467), 5 (5716), 6 (1026), 7 (4101), 8 (6178), 9 (5442), 10 (5200), 11 (1614) and 12 (4404). For each time points, we can only sample as large as 200 data points, since it is the lowest number of samples among all time points.

(This is going to be the answer to another question later, but we want to reiterate that:) It is fine using different numbers of data points for persistent homology and mapper visualization, but it would not be if we are compute number of simplical complexes to use as a comparable summary statistics. If we have one time point that has 3 data points, obviously we are not going to get any 5-simplex. While a different time point with 10,000 data points are more likely to have 5-simplex.

25. It appears to me that the manner of subsampling has a large effect on the number of cliques, so I think the use of dimension reduction, subsampling, and the application of witness complexes for the latter needs to be explored more thoroughly.

Yes, we fully agree with you. Given the simplicial complexes of different orders from the witness sampling approach, we need to correct for the effect of sampling. The larger the sample size, the more likely the higher-order simplicial complexes emerge. One way to correct for this amplification effect is to normalize this quantity directly to the quantity collected from a null distribution of the data.

In addition, we followed your suggestion to explore on these effect. To demonstrate the sensitivity of persistent homology to sampling size and reduced dimensions, we perform the following experiment. We use the full dimensions of the standard scaled dataset, vary the sampling size from 50, 100, 500 to 1,000 data points and compute their persistent diagrams. We then set the sample size to 1,000, vary the PCA dimensions to be the first 2, 10 and 103 (full) dimensions, and compute their persistent diagrams. We observe no clear difference. Then, we perform the simplicial analysis with witness sampling using sample size from 10, 100, 200 to 300. In this case, we observe a slightly higher numbers of higher order simplical complexes, but the overall shape and distinction between the time steps are maintained. Future study can investigate strategies of increasing the stability of the simplicial analysis to sample size.

26. Fig 3: Please makes axes directions coherent across Fig 3 A, B and Fig 5. Panel B: the plot is not shaded according to the scale.

Thank you for the suggestion. These are 3d plots, so we have to rotate to the best angle to provide more information to the readers which can only read on a paper. The color bar is plotted from the default seaborn library and thus, should be shaded according to the scale.

27. Why does the simplicial dimension axis only go to 6 in (B)?

This is because our way of computation cannot handle scenarios where the number of simplicial complexes in the null distribution is zero. In such case, they would be undefined, or nan, and thus not shown.

28. Apparently the application of mds and sampling 100 points does not detect gastrulation whereas PCA and 80 points does? Why are two different methods applied and why do they not agree?

They agree with each other. The only difference of them is they are plotted in different scales. In the previous version, we have these two plots using different dimension reduction and sample size to show case the flexibility to different low-dimensional embeddings and sample sizes. Given the feedback of their non-appreciation and confusion, we have removed this analysis.

29. Line 258: better than what?

Better than Betti numbers.

30. Fig 4: I don't find this figure insightful and I recommend cutting it. The figure appears not to agree with what is written in lines 272 etc.

Thanks for the suggestion. We decide to keep it because other reviewers and readers find it insightful.

We have hedged down its description at line 272 by removing the word "overall above-null" and adding the word "increasing".

31. Line 275 '...organized into numerous fundamental building blocks with increasing complexity.' For example (proto-)cell types? Please elaborate on the relevance of this.

This is a spectulative statement, hence the word "might". We appreciate your pointer of connecting it to the concept of proto-cell types, but didn't wish to stretch to it, because we haven't found clear evidence at this stage.

32. Line 279: ' different' correct to differentiate

We meant "different" low-dimensional embeddings, instead of "differentiate". We have removed this unnecessary analysis to avoid confusion.

33. A sample of 80 is really excessively small.

We have now used 200 throughout, the maximal one possible.

34. I cannot find what the value of the time difference tau is set to in the end for Figs 3 and 5.

We have now included this details in the results.

35. Fig 6. I enjoy the new Mapper graph with the additional temporal constraint. However, is there independent validation that the two tracks represent the mentioned cell populations?

Thank you. Since this is a computational paper, we don't have the experimental means to validate it. But it does warrants interesting future experimental work to investigate it.

36. Fig 7 These Betti numbers do not look right to me. Is this computed also within a two-dimensional space? That would explain why there is no higher homology. What happens in the cell type with very high Betti_1 that makes it stand out? Please check these calculations.

Thank you for the note. We have double checked our calculation and it appears to be correct.

37. Line 333: I do not see how the proposed methods would lead to the generation of pseudo-time series.

The proposed methods provide useful summary statistics. Pseudo-time series can be reconstructed using these summary statistics. It is not a focus in our work to reconstruct pseudo-time series, but that doesn't mean it cannot be applied to generate pseudo-time series. This would be a logical next step.

38. Conclusions: I remain unconvinced that the manuscript has tackled any of these three challenges. It is unclear what was done with the tau parameter, rather it appears computations were done per timestep so I don't see how the methods tackle the problem of integrating the temporal direction in the analysis; the method is not scalable to 10k+ datapoints as it subsamples only 100; the manuscript contains very little interpretation of the topology as biological features other than describing cell similarity as cell ecology without appropriate motivation.

Thank you for the question. Since this is an important point we aim to directly answer the three questions in our conclusion section:

i. A lack of time-series analytical methods in quantifying the underlying temporal skeleton within the manifold of the similarities among data points

In persistent homology and mapper visualization, our temporal filtration uses a user-specified time separation parameter $\tau$, which can be either discrete (consecutive time steps) or continuous (by a time delay quantity). This enables the computation of persistent components that is computed only on data points that are temporally proximal, and thus, provides a temporal skeleton representation. In the simplicial analysis, we can group the data points by time steps, and compute the normalized simplicial complexity as a quantity to inform the ecology of cells in the transcriptomic feature space.

ii. A lack of scalable computational methods to characterize single-cell sequence signals in the scale of 10k+ data points, while the single-cell sequencing data are dominating the bioinformatics in recent few years

The usage of witness sampling and dimension reduction enable the computation of persistent homology to large numbers of high-dimensional data points. Sampling is also a required step to compare the topological features in groups of data points with different count numbers. The normalization against null distribution of the data sample partly corrects for the amplification effect of higher-order topological quantities. The usage of dimension reduction techniques such as PCA help with data management and computation without a significant loss of performance.

iii. A lack of insight and interpretation that connects the mathematical language of algebraic topology to the physical references to the biological phenomena.

In the introduction and discussion, we initiate the discussion of the interpretations of the topological properties. More specifically, we point out how the temporally directed relationships among data points can be related to functionally separating groups of homogeneous agents in the feature space, and thus, potentially informative to their interactions. With our temporal-directed treatment of filtration or grouping techniques, our study is a small but first step to use topological data analysis as not only a descriptor tool for static manifold, but also, in the future, a discovery tool of dynamic or mechanistic components. Our goal in this work is not to fully answer the question of interpreting the biological insights topological properties, but to further motivate and facilitate our understanding to the question. As more techniques of topological data analysis are applying to biological problems, we wish to encourage the discussion and critique from the biology and machine learning research community.

In summary, the goal of our work is not to fully solve these three challenges, but providing tools to help us better understand and partly tackle these open questions.

Lastly, we would like to thank the reviewer for taking the time reading and writing a very detailed, thoughtful and long review to our manuscript, with many valuable suggestions. We have learned a lot through working through these questions. Guided by your pointers, the manuscript is improved in various aspects. hope that our manuscript has clarified all the questions the reviewer is interested in.

Reviewer 3 Report

This is an interesting work exploring the use of techniques based on TDA.

I suggest including a diagram with the entire process, you talk about the use of algorithms such as PCA and MDS, and their contribution looks irrelevant.

You use only one graph to explain all about TDA, and a lot of details lose relevance. It is notorious for the huge of work, and your contribution, but you need to explain details maybe a flow diagram can help.

Some definitions are hard to understand as you say in line 50 talking about definitions in line 39.

Discussion and conclusions help to understand your results after a dense text.

Author Response

The author would like to thank the reviewer for the careful reading of our manuscript, the helpful suggestions and the interest to our work. We have revised our manuscript significantly according to your pointers, and aim to answer the reviewer's questions and requests as follows.

1. I suggest including a diagram with the entire process, you talk about the use of algorithms such as PCA and MDS, and their contribution looks irrelevant.

Thank you for the suggestion. We have now included several additional diagrams to illustrate the full process. The usage of dimension reduction is a useful step before the filtration. Due to the ``Curse of Dimensionality'', the data points in a very high-dimensional space can be very sparse and thus the distances between them usually collapses to a constant, i.e. residing at a hyperspherical space. As a result, the filtration computation around them can be ineffective and unstable. Mapping them onto a low-dimensional space can partly solve this issue. We have now included these details.

2. You use only one graph to explain all about TDA, and a lot of details lose relevance. It is notorious for the huge of work, and your contribution, but you need to explain details maybe a flow diagram can help.

Thank you for the suggestion. We have now included several additional diagrams to illustrate them better. We have also reorganized and rewrote the method sections to provide more details.

3. Some definitions are hard to understand as you say in line 50 talking about definitions in line 39.

Thank you for the pointer. We have now provided necessary definitions in line 39.

4. Discussion and conclusions help to understand your results after a dense text.

Thank you.

Round 2

Reviewer 2 Report

The article has improved but there are still some points that need to be corrected.

The descriptions of homology in caption figure 2 and Line 265 currently reads as incorrect, please make sure this is written well:

“Hn indicates the n-th homology group, i.e. the formation of the simplex complexes of order

n, with 0-simplex to be the nodes (or clusters), 1-simplex to be the edges between two nodes, 2-simplex to be the loops (or triangles in this case), 3-simplex to be the tetrahedrons and so on.”

Homology counts the number of essentially different cycles – linear combinations of simplices that form a cycle (for example a loop formed by a sequence of edges) – that are not the boundary of something that can fill in the hole (for example a combination of 2D simplices or triangles spanning the inside of the loop).

Replace the description of homology in both these places with somehting along the lines of your description in line 489, and make it clear that there is an essential difference between counting the number of simplices and counting homology groups. In particular, homology groups can be filled in as the nerve balls grow, whereas simplices will never disappear.

There are still a lot of errors in English language and style. Please have it checked carefully by a native speaker, e.g.

Abstract:

“...we can describe..., and analyze..”

“We propose single-cell..” (article inapproapriate)

line 26 “the idea of persistence, which extracts..”

line 31 etc: It is bad style to start sentences with a reference.

3) The generally used term is persistence diagram rather than persistent diagram.

Author Response

We would like to thank the reviewer again for the careful reads and suggestions. We have now fixed all the points in the current version.

Article Menu

Topological Data Analysis in Time Series: Temporal Filtration and Application to Single-Cell Genomics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI