MDPI - Publisher of Open Access Journals

22 pages, 3968 KiB

Open AccessEditor’s ChoiceArticle

Recommendations for Uniform Variant Calling of SARS-CoV-2 Genome Sequence across Bioinformatic Workflows

by Ryan Connor, Migun Shakya, David A. Yarmosh, Wolfgang Maier, Ross Martin, Rebecca Bradford, J. Rodney Brister, Patrick S. G. Chain, Courtney A. Copeland, Julia di Iulio, Bin Hu, Philip Ebert, Jonathan Gunti, Yumi Jin, Kenneth S. Katz, Andrey Kochergin, Tré LaRosa, Jiani Li, Po-E Li, Chien-Chi Lo, Sujatha Rashid, Evguenia S. Maiorova, Chunlin Xiao, Vadim Zalunin, Lisa Purcell and Kim D. Pruitt Show full author list Hide full author list

Viruses 2024, 16(3), 430; https://doi.org/10.3390/v16030430 - 11 Mar 2024

Cited by 2 | Viewed by 15651

Abstract

Genomic sequencing of clinical samples to identify emerging variants of SARS-CoV-2 has been a key public health tool for curbing the spread of the virus. As a result, an unprecedented number of SARS-CoV-2 genomes were sequenced during the COVID-19 pandemic, which allowed for rapid identification of genetic variants, enabling the timely design and testing of therapies and deployment of new vaccine formulations to combat the new variants. However, despite the technological advances of deep sequencing, the analysis of the raw sequence data generated globally is neither standardized nor consistent, leading to vastly disparate sequences that may impact identification of variants. Here, we show that for both Illumina and Oxford Nanopore sequencing platforms, downstream bioinformatic protocols used by industry, government, and academic groups resulted in different virus sequences from same sample. These bioinformatic workflows produced consensus genomes with differences in single nucleotide polymorphisms, inclusion and exclusion of insertions, and/or deletions, despite using the same raw sequence as input datasets. Here, we compared and characterized such discrepancies and propose a specific suite of parameters and protocols that should be adopted across the field. Consistent results from bioinformatic workflows are fundamental to SARS-CoV-2 and future pathogen surveillance efforts, including pandemic preparation, to allow for a data-driven and timely public health response. Full article

(This article belongs to the Special Issue Molecular and Computational Studies in Epidemiology and Evolution of Pandemic Virus)

► Show Figures

Figure 1

17 pages, 2120 KiB

Open AccessArticle

NCBI’s Virus Discovery Codeathon: Building “FIVE” —The Federated Index of Viral Experiments API Index

by Joan Martí-Carreras, Alejandro Rafael Gener, Sierra D. Miller, Anderson F. Brito, Christiam E. Camacho, Ryan Connor, Ward Deboutte, Cody Glickman, David M. Kristensen, Wynn K. Meyer, Sejal Modha, Alexis L. Norris, Surya Saha, Anna K. Belford, Evan Biederstedt, James Rodney Brister, Jan P. Buchmann, Nicholas P. Cooley, Robert A. Edwards, Kiran Javkar, Michael Muchow, Harihara Subrahmaniam Muralidharan, Charles Pepe-Ranney, Nidhi Shah, Migun Shakya, Michael J. Tisza, Benjamin J. Tully, Bert Vanmechelen, Valerie C. Virta, JL Weissman, Vadim Zalunin, Alexandre Efremov and Ben Busby Show full author list Hide full author list

Viruses 2020, 12(12), 1424; https://doi.org/10.3390/v12121424 - 10 Dec 2020

Cited by 3 | Viewed by 6132

Abstract

Viruses represent important test cases for data federation due to their genome size and the rapid increase in sequence data in publicly available databases. However, some consequences of previously decentralized (unfederated) data are lack of consensus or comparisons between feature annotations. Unifying or displaying alternative annotations should be a priority both for communities with robust entry representation and for nascent communities with burgeoning data sources. To this end, during this three-day continuation of the Virus Hunting Toolkit codeathon series (VHT-2), a new integrated and federated viral index was elaborated. This Federated Index of Viral Experiments (FIVE) integrates pre-existing and novel functional and taxonomy annotations and virus–host pairings. Variability in the context of viral genomic diversity is often overlooked in virus databases. As a proof-of-concept, FIVE was the first attempt to include viral genome variation for HIV, the most well-studied human pathogen, through viral genome diversity graphs. As per the publication of this manuscript, FIVE is the first implementation of a virus-specific federated index of such scope. FIVE is coded in BigQuery for optimal access of large quantities of data and is publicly accessible. Many projects of database or index federation fail to provide easier alternatives to access or query information. To this end, a Python API query system was developed to enhance the accessibility of FIVE. Full article

(This article belongs to the Special Issue Virus Bioinformatics 2020)

► Show Figures

Figure 1

18 pages, 3614 KiB

Open AccessArticle

NCBI’s Virus Discovery Hackathon: Engaging Research Communities to Identify Cloud Infrastructure Requirements

by Ryan Connor, Rodney Brister, Jan P. Buchmann, Ward Deboutte, Rob Edwards, Joan Martí-Carreras, Mike Tisza, Vadim Zalunin, Juan Andrade-Martínez, Adrian Cantu, Michael D’Amour, Alexandre Efremov, Lydia Fleischmann, Laura Forero-Junco, Sanzhima Garmaeva, Melissa Giluso, Cody Glickman, Margaret Henderson, Benjamin Kellman, David Kristensen, Carl Leubsdorf, Kyle Levi, Shane Levi, Suman Pakala, Vikas Peddu, Alise Ponsero, Eldred Ribeiro, Farrah Roy, Lindsay Rutter, Surya Saha, Migun Shakya, Ryan Shean, Matthew Miller, Benjamin Tully, Christopher Turkington, Ken Youens-Clark, Bert Vanmechelen and Ben Busby Show full author list Hide full author list

Genes 2019, 10(9), 714; https://doi.org/10.3390/genes10090714 - 16 Sep 2019

Cited by 9 | Viewed by 9159

Abstract

A wealth of viral data sits untapped in publicly available metagenomic data sets when it might be extracted to create a usable index for the virological research community. We hypothesized that work of this complexity and scale could be done in a hackathon setting. Ten teams comprised of over 40 participants from six countries, assembled to create a crowd-sourced set of analysis and processing pipelines for a complex biological data set in a three-day event on the San Diego State University campus starting 9 January 2019. Prior to the hackathon, 141,676 metagenomic data sets from the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) were pre-assembled into contiguous assemblies (contigs) by NCBI staff. During the hackathon, a subset consisting of 2953 SRA data sets (approximately 55 million contigs) was selected, which were further filtered for a minimal length of 1 kb. This resulted in 4.2 million (Mio) contigs, which were aligned using BLAST against all known virus genomes, phylogenetically clustered and assigned metadata. Out of the 4.2 Mio contigs, 360,000 contigs were labeled with domains and an additional subset containing 4400 contigs was screened for virus or virus-like genes. The work yielded valuable insights into both SRA data and the cloud infrastructure required to support such efforts, revealing analysis bottlenecks and possible workarounds thereof. Mainly: (i) Conservative assemblies of SRA data improves initial analysis steps; (ii) existing bioinformatic software with weak multithreading/multicore support can be elevated by wrapper scripts to use all cores within a computing node; (iii) redesigning existing bioinformatic algorithms for a cloud infrastructure to facilitate its use for a wider audience; and (iv) a cloud infrastructure allows a diverse group of researchers to collaborate effectively. The scientific findings will be extended during a follow-up event. Here, we present the applied workflows, initial results, and lessons learned from the hackathon. Full article

(This article belongs to the Special Issue Viral Diagnostics Using Next-Generation Sequencing)

► Show Figures

Figure 1

Search Results (3)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Saved Queries

Search Filter Reset All

Years

Feature Papers

Subjects

Journals

Article Types

Countries / Regions

Search Results (3)

Further Information

Guidelines

MDPI Initiatives

Follow MDPI