Next Article in Journal
Mitochondrial DNA: Consensuses and Controversies
Previous Article in Journal
Updating the Phylogeography and Temporal Evolution of Mitochondrial DNA Haplogroup U8 with Special Mention to the Basques
 
 
Article
Peer-Review Record

Physlr: Next-Generation Physical Maps

DNA 2022, 2(2), 116-130; https://doi.org/10.3390/dna2020009
by Amirhossein Afshinfard 1,2, Shaun D. Jackman 1, Johnathan Wong 1, Lauren Coombe 1, Justin Chu 1, Vladimir Nikolic 1,2, Gokce Dilek 1, Yaman Malkoç 1, René L. Warren 1 and Inanc Birol 1,3,*
Reviewer 1: Anonymous
Reviewer 2:
Reviewer 3:
DNA 2022, 2(2), 116-130; https://doi.org/10.3390/dna2020009
Submission received: 29 March 2022 / Revised: 2 June 2022 / Accepted: 7 June 2022 / Published: 10 June 2022

Round 1

Reviewer 1 Report

 

Physlr, by Afshinfard, et al. presents a nice tool that allows one to use the data made available in linked reads, originally developed by 10X Genomics. The existence of this tool is nice, since the Supernova assembler did not allow for much latitude when trying to assemble linked reads. I have mainly focused my comments on the methods and made suggestions where the authors could improve the description of the algorithm for the typical reader (since this is submitted to DNA and not a bioinformatics or mathematics-related journal). Importantly, beyond clarifying details of the algorithm, the authors need to spend some time describing the occurrence and frequency of misassemblies – what are the main causes and how do misassemblies affect the presented contiguity statistics. Finally, I think the figures could be improved substantially, in the minimum just making the captions more informative, but also, aspects of the figures don’t seem particularly useful (e.g. Fig. 2d and aspects of figure 4, such as the big arrows on the X and Y axes).  Point by point comments can be found below.

 

Abstract: the abstract mentions that Physlr can be used for misassembly correction and structural variant detection, but this is not demonstrated in the manuscript, so it should not be included in the abstract.

 

Introduction:

Lines 26-27: Consider rewording or removing the topic sentence. It is a sweeping statement, that is debatable in its meaning (what about epigenetic instructions passed on from the mother, sequence contained in various organelles, etc.).

 

Line 31: consider using the phrase “long target molecule” instead of “long target sequences”, since the sequencing device is producing readouts from a molecular library, not a collection of abstract sequences.

 

Lines 32-33: “wide range of downstream studies.” Please provide specific references/examples or omit this sentence.

 

Line 47: while it is true that genome assembly via physical maps is expensive, it is disingenuous to compare this cost to the HGP cost – since the HGP included the development of the technology itself, the sequencing of earlier ‘test’ genomes, etc. Either omit this cost comparison, or find a more appropriate comparator from a later genome sequenced using this method (e.g. zebrafish or the initial stickleback genomes).

 

Lines 64-66: methods used to scaffold fragmented assemblies should also include genetic maps.

 

Materials and Methods:

From my understanding, there are three aspects of linked read sequencing that are important to explain, in order to understand the Physlr algorithm. First, each molecule has an assigned barcode and those barcodes can repeat across molecules (which the authors explain). Second, each individual molecule is not fully sequenced (sub 1x coverage, as the authors explain). Third, the whole genome is covered redundantly by many, many molecules, so even though each molecule is not fully sequenced, the collection of molecules are fully sequenced – allowing for assembly (and for Physlr to work). Making these conditions a little more clear would help in the description of the algorithm, in section 2.2. So, reads are collected together by barcode, and barcodes are collected together by the sequencing similarity of underlying reads (represented by the kmer minimizers). A figure that provides a cartoon of these relationships would be very valuable for the reader.

 

Line 126: “a small radius and a large diameter”, most readers are going to assume this is related to the common definition of a circle, in which case it is an impossible relationship. The authors should take some more space to describe what these terms mean in the context of graphs.

 

Line 135: the community detection algorithm needs to be described in the Methods. I also strongly encourage the authors to give a specific example to enable the reader to better understand the method. In the current section of the Results, the authors cite figure 2b as an example, but is this typical? (This part can stay within the Results, if desired). The authors need to provide some metrics about how often barcode vertices are easily split and how often they are confounded.

 

Line 138: Why is a maximum spanning tree the proper construct to use to find useful graph components? One could think of a number of clustering algorithms that might be employed here, why use an MST?

 

Line 139: Related: why do we expect every vertex in a MST to have only two branches?

 

Lines 141-142: “Physlr employs a linear-time belief propagation algorithm” – this needs to be explained to the reader, citing a reference is not sufficient.

 

Results

 

Line 211: please define NGA50 and NG50.

 

Line 236: what does it mean that Physlr “increased misassemblies by less than 98% on average”? Does that mean there were almost double the misassemblies when Physlr was used?

 

Line 238: If I understand correctly, for a nanopore assembly, Physlr added almost 30% more missassemblies?

 

If I am interpreting these numbers correctly, the authors should spend some time and analysis describing what is happening here and why. It does seem that in a number of cases, Physlr has fewer misassemblies than other scaffolders, but what is generating the misassemblies and how do the number of misassemblies interact with the contiguity statistics (does it invalidate them)?

 

Figure 4 is not clear, what do the large grey arrows on the X and Y axis mean and why are the NG50/NGA50 stats represented as a vertical line for each assembly?

Author Response

We thank our Reviewer for their thorough review of our research and for providing invaluable input. We believe that the comments led to many improvements in the revised version. We have carefully considered all the comments and tried our best to address every one of them. We hope the manuscript, after careful revisions, meets your standards. Enclosed, please find responses to each comment.

Author Response File: Author Response.docx

Reviewer 2 Report

Lines 37-43:  I don't think his needs to be passive as Sanger sequencing is still used.

Line 46: No need for passive

I love Figure 1.

The wording at line 126 is confusing.

Line 141: Could a line or 2 about "linear-time belief propagation algorithm" be added so that the reader need not hunt down that reference to know what it does?

Figures 4 and 5 are solid convincing arguments for Physlr.

I would add barcode as a "keyword" in the paper (or "barcode reuse"). The re-use of the barcodes is a great idea, worthy of being in the literature. Thus should be more discoverable.

I wonder if lines 269-271 should also be in the figure caption somehow since it is impactful.

Lines 290-295: Not needed but desired. A cartoon depiction of this sub sub graph creation. I guess like Figures S2 and S3.

The github page appears well organized and thorough.

 

 

 

 

 

 

Author Response

We thank our Reviewer for their thorough review of our research and for providing invaluable input. We believe that the comments led to many improvements in the revised version. We have carefully considered all the comments and tried our best to address every one of them. We hope the manuscript, after careful revisions, meets your standards. Enclosed, please find responses to each comment.

Author Response File: Author Response.docx

Reviewer 3 Report

Afshinfard et al. describe a new pipeline for scaffolding using linked reads. They introduce a significant number of novel approaches and algorithms. They also clearly master the subject, suggesting that these tools will be valuable addition to the genome assembly toolkit, at least for those using linked-reads. The work is generally well-presented, although given the high technicality I think several passages could be improved for clarity. Also, while the authors did several validation experiments, the claim of superior results compared to other tools is not excessively convincing. I also suggest to move most of the details on the novel algorithm for community detection in the relevant methods section. These are my three major revisions that the authors should address. Additional details on these concerns and minor revisions are reported in the comments below.

general comments:

-introduction lacks background on the importance of linked-reads in the current paradigm represented by long reads, ultra long reads, hic reads. What is the advantage of using linked reads today?

-Is it normal that code availability is stated in the conclusion? A dedicated section would make it more easy to find

-ARCS is already capable of scaffolding a genome using linked reads. What is the advantage of physlr over ARCS?

-referencing to figures from results in the methods section is misleading and should be avoided

-Based on Figure 4, Physlr appears to consistently have larger NG* values but also more misassemblies than ARCS, which could explain the increased contiguity. Can the authors rule this possibility out? I think the authors should have a more critical discussion about their findings in terms of the number of misassemblies, since the structural variation they induce is critical for many downstream analyses.

-It is hard to understand what is the actual format of the physical maps provided by Physlr. Could the authors add a paragraph specifying this?

 

lines26-33: it would be good to have at least one reference, e.g. Giani 2020 https://doi.org/10.1016/j.csbj.2019.11.002

lines65-66: I suggest tuning this statement down, many pipelines now achieve reference-level continuity, e.g. https://www.nature.com/articles/s41586-021-03451-0 The reference also does not seem appropriate to support the statement

line95: Why do linked reads provide information about barcodes rather than their constituent molecules? What do they authors mean here? Why "in principle"? Does it mean this is the easiest level of information accessible? Does it relate to the sparse sequencing of the molecule? How does barcode reuse affect this (in parentheses). Please clarify this sentence.

lines97-98: This sentence is also not extremely clear. I take that there are no "overlapping barcodes", but rather a certain level of shared similarity between the minimizers in the molecules and this information is used to connect barcodes. Please rephrase.

line100: please define communities

line100-101: I have the feeling that sometimes phrasing is misleading, e.g. in this sentence molecules do not "constitute" a barcode, but are rather "associated" with a given barcode. Please try to improve here and throughout the text as to improve clarity. Also, in figure 2b, what do the smaller groupings represent (please clarify in the legend)? And are the edges weighted in the graph of figure 2b?

lines101-14: details are needed about this transformation (i.e. how to go from figure 2b to 2c), even when looking at the next section (lines130-135) details are minimal. I think some of the details on the novel algorithm presented in the results should be reported here. See also my comment below.

line104: outputted

line117: as far as I understand ntCard is essentially a kmer counter, although it may not generate a kmer database. Why was it chosen among kmer counters (e.g. KMC, Meryl, FastK), given that they all generate kmer histograms? What information was used from the output of ntCard for the subsequent steps? Which cardinality are you referring to and what is the cardinality used for here?

line118: is the bloom filter used to increase efficiency/speed, or only to identify high-frequency kmers? Again, phrasing appears misleading, is the Bloom filter used to filter the kmers? In general this paragraph should be rephrased to explain *why* each step is conducted (same for minimizers)

lines119-120: is this the same Bloom filter from ntHits? It seems so, but phrasing is confusing

lines129-130: what about the effect of repeats on the graph?

lines130-135: while understand that this is described in the results, this section requires a significant leap of faith. It also makes it harder to understand what exactly constitutes the physical map.

lines132-133: associated molecules?

line139: citation or clarification needed

line143: definition for 'junctions' in this context needed

line143-144: This is interesting, as I would have expected more of such ambiguous situations. What does guarantee this? Is it based on the experimental evidence? Would it depend on genome content (e.g. if a more complex genome was assessed, would there be more) and/or the number of barcodes?

line144-145: after such reassessment are all ambiguities resolved? If not, how are they treated?

line145-146: is there any guarantee that only one such tree exist? If not, how is the representative one chosen?

line150: would "assign" be a better term and "splits"? And "associated" instead of "constituent"?

lines150-151: again, since the molecule graph wasn't presented yet, it is hard to understand how such selection is made

lines154-159: it seems like the authors are using ARKS rather than ARCS (https://github.com/bcgsc/arcs), please provide specific details on how this is achieved.

line167: please specify what the stLFR setting corresponds to.

lines168-169: please specify how such calibration is achieved

line174: which mode of ARCS was used? Also, some of the software versions is reported in the text, some in Supp Table 4. Can you homogenize this, as it is hard to see if anything is missing.

line182: if I understood correctly, GRCh38 is used as a reference, therefore they will not all be misassemblies but some will individual-specific SVs. Please rephrase

line221: Figure 3. The axis label seem completely off. Chr 1 250 Mbp, what does 15 M stand for?

line227: please specify what baseline stands for. Only contigs straight out of the assembler?

lines235-239: how is misassembly increase computed?

lines244-245: the use of "slightly" here is misleading. There are cases where the number of misassemblies compared to ARCS is twice, and it is hundreds of additional misassemblies more than ARCS in most cases. I think this section should be more critical. Do the authors have an explanation for increase in the number of misassemblies? Do they affect the same regions? See also my general comment.

line263: Ci,m is not labelled in Supp fig 2. note also small typo in Supp Fig 3 legend (anther)

line284: Figure 5 caption should be more informative. E.g. what are the percentages in the plots? Could it be possible to highlight more misassemblies? Currently, they blend in the rest of the plot, for instance crossing ribbons could be dark lines.

lines290-302: This is arguably (also in the authors view) one of the most important contributions of this study. Therefore, I think a schematic figure that describe this crucial step would be very valuable. I also wonder whether Supp Fig 2 wouldn't fit in Figure 2.

lines255-307: while I can understand why the authors decided to put section 3.3 at the end of the results, also based on my previous comments it seems that it would be much more appropriate in the methods section. The results could then briefly refer to such section of the methods when they are first introduced in the workflow and explain the novelty of the approach. In its current form the flow is highly disrupted.

lines300-301: based on supplementary figure 4 it seems that Phylslr is very demanding in terms of memory. Could this sentence be more carefully rephrased to show this?

lines345-346: Tigmint appears limited in scope for assembly pipelines, while now several assembly tools and pipelines exist, e.g. PMID 33526886 33911273 32801147

I suggest that the authors cite this and other literature on the subject.

line361: please rephrase "comparatively low" and the conclusion in light of the other comments.

Author Response

We thank you for your thorough review of our research and for providing invaluable input. We believe that the comments led to many improvements in the revised version. We have carefully considered all the comments and tried our best to address every one of them. We hope the manuscript, after careful revisions, meets your standards. Enclosed, please find responses to each comment.

Author Response File: Author Response.docx

Back to TopTop