Submit to this Journal Review for this Journal Propose a Special Issue

Article Menu

Share Help Cite Discuss in SciProfiles

Open AccessArticle

Peer-Review Record

NCBI’s Virus Discovery Codeathon: Building “FIVE” —The Federated Index of Viral Experiments API Index

Viruses 2020, 12(12), 1424; https://doi.org/10.3390/v12121424

by Joan Martí-Carreras^1,*

, Alejandro Rafael Gener^2,3,4,5,*

, Sierra D. Miller⁶

, Anderson F. Brito⁷

, Christiam E. Camacho⁸

, Ryan Connor^8,*, Ward Deboutte¹

, Cody Glickman¹

, David M. Kristensen⁹, Wynn K. Meyer^10,†

, Sejal Modha¹¹

, Alexis L. Norris¹²

, Surya Saha^13,14

, Anna K. Belford¹⁵

, Evan Biederstedt¹⁶, James Rodney Brister⁸, Jan P. Buchmann¹⁷

, Nicholas P. Cooley¹⁸

, Robert A. Edwards¹⁹

, Kiran Javkar^20,21

, Michael Muchow²²

, Harihara Subrahmaniam Muralidharan^20,23, Charles Pepe-Ranney²⁴, Nidhi Shah²⁰, Migun Shakya²⁵, Michael J. Tisza¹⁵, Benjamin J. Tully²⁶

, Bert Vanmechelen¹

, Valerie C. Virta²⁷, JL Weissman²⁸

, Vadim Zalunin⁸

, Alexandre Efremov⁸ and Ben Busby^8,29,*

Show full author list Hide full author list

Reviewer 1: Anonymous

Reviewer 2: Anonymous

Viruses 2020, 12(12), 1424; https://doi.org/10.3390/v12121424

Submission received: 16 October 2020 / Accepted: 2 December 2020 / Published: 10 December 2020

(This article belongs to the Special Issue Virus Bioinformatics 2020)

Round 1

Reviewer 1 Report

I tried to understand the logic of the authors - but it was not easy

To give examples-

Section with Figure 1 - ... "recognition of protein domains"

If the recognition is done to classify metagenomic contigs as viral -

why to compare RPS-BLAST with MASH -- which was demonstrated to have poor prediction accuracy -

Better to compare RPS BLAST performance with the tools that are specialized on the virus sequence recognition - there are several ... VirFind, VirSorter, etc.

RPS BLAST performance (run time) would be much better if instead of 6-frame translation the genes in the contig were predicted first and then RPS BLAST would be used instead of RPS tBLASTn

Fig. 2 is missing

Fig. 3 is low informative - with information repeating itself twice

Given poor potential on MASH to solve the tasks - Figure 4 is becoming of very low interest

Fig. 5 meaning is not possible to understand

for instance --the legend says -"Node color corresponds to taxonomic distribution of k-mer." no single color is explained.

Fig. 6 would better belong to Suppl Materials (hard to understand without many additional clarifications of the field names)

Reviewer 2 Report

fine for me

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.

Round 1

Reviewer 1 Report

I do not see that the authors have reached the goal - creating a virtual (federated) database of viral sequences with uniform functional and taxonomy annotation

There is no statistics characterizing the structure, taxonomy and the numbers of entries in the database

I also did not find a clear and logical description of the pipelines that were supposed to be in place to process a large number - of order of 10 mln of contigs reassembled from 2,953 SRA entries

Several important sections are poorly written

Particularly section related to domain search pipeline (Lines 201-213)

The sequence of steps in this section is not well justified. The pipeline, as implemented, is split into two independent methods (Fig 1), one of them (Mash based) was eventually found to be substantially less sensitive than the second method (RPS-tBLASTn based) … an average recall of only 15.3%. – makes the second method not usable in practice unless the precision value is close to 100%. However, this second important measure – precision- was not mentioned at all.

The section does not make precise statements, for instance “Recall percentages were calculated per dataset by dividing the unique viral CDD’s identified by RPS-tBLASTn by the unique viral CDD’s identified by the Mash pipeline.” (lines 205-207)

Line 250 --- construction of whole genome HMMs, not well defined concept and method of design (needs multiple alignments of viral genomes) this has to be explained before coming up with a database of whole genome HMMs … a construct possible may be for short genome viruses but may not be so useful model for viruses with long genomes

Lines 265-271 Another critical section related to Taxonomy-domain-integration pipeline was not well written; it is impossible to reconstruct the logic and to see justification for the steps taken

Only fragments of the database were built – like virus graphs, which concept was not clearly explained, particularly its usefulness

Lines 530-535 – was the database built? – There is no link to a full size database, which seems to be a future work

Reviewer 2 Report

In this article Carreras et al. depict the works performed during the Virus Hunting Toolkit codeathon during which FIVE a federated viral database was elaborated. The initiative is particularly outstanding and need to be promoted. The article is well written and describes what has been done with, via a github repository, the open access to the methods developed.

The reviewer consider this article must be taken has it is, i.e. the description of a work that has been done and can not be done again.

As a virologist some aspects of what has been attempted still remain totally obscure to me (section 3.4 virus graph fig 5) but a detailed explanation is probably out of the scope of this article. May be citing some references would help virologist to understand the rational of this virus graph working group.

minor comments:

P14 L 513 searching potential instead of searaching potential

P15 L556 cloud computing instead of could computing

Article Menu

NCBI’s Virus Discovery Codeathon: Building “FIVE” —The Federated Index of Viral Experiments API Index

Further Information

Guidelines

MDPI Initiatives

Follow MDPI