Simulated High Throughput Sequencing Datasets: A Crucial Tool for Validating Bioinformatic Pathogen Detection Pipelines
Abstract
:Simple Summary
Abstract
1. Introduction
2. Types of Datasets
3. Existing HTS Simulators for Benchmarking Bioinformatic Pipelines for Pathogen Detection
Simulator Name | Taxonomic Profiling (Metagenome) | Platform Errors | Genomic Variants | Sequencing Platform | Quality Score | Ref. | Latest Release Date [M/D/Y] |
---|---|---|---|---|---|---|---|
ART | It is designed for single genomes but can be adapted for metagenomics. | Yes, it simulates substitution and INDEL (insertion-deletion) errors. | Yes, with VarSim | Illumina | Yes | [32] | 6/5/2016 |
BEAR | Yes, specifically designed for metagenomics. | Yes, emulates characteristics from real data. | Not specified, but designed to work with metagenomic datasets that inherently contain variations. | Ion Torrent, 454, Illumina | Yes | [42] | 5/8/2020 |
CAMISIM | Yes, can model different microbial abundance profiles, multi-sample time series, and differential abundance studies. | Yes, offers flexibility in simulating various error profiles. | Yes, includes real and simulated strain-level diversity. | Illumina, PacBio, Oxford Nanopore | Yes | [34] | 1/4/2022 |
CuReSim | No. | Yes, allows adjustments to error distribution along reads. | Yes, can introduce insertions, deletions, and substitutions at a controlled rate. | Ion Torrent | Yes | [35] | 6/24/2015 |
FASTQSim | Does not allow profile input, but it has been used for metagenome simulations. | Yes, designed to be platform-independent and simulate various NGS datasets. | Yes. | Platform-independent | Yes | [33] | 11/15/2016 |
Grinder | Yes, can simulate metagenomic data. User-defined profile or inferred from real HTS runs. | Yes, provides options for uniform, linear, and polynomial error models. | Not explicitly specified. | Sanger, 454, Illumina | Yes | [36] | 11/27/2016 |
metaSPARSim | Yes, specifically designed for 16S rRNA gene sequencing data. | Yes, utilizes a Multivariate Hypergeometric distribution to model sequencing and simulate realistic sparsity and compositionality. | Not explicitly specified. | Not specified | Not specified | [39] | 12/1/2020 |
MetaSim | Yes, explicitly designed for simulating metagenomic data. | Yes, supports user-defined parametric error models. | Not explicitly specified. | 454, Illumina | No | [37] | 10/8/2008 |
NanoSim | Yes, they added metagenomic simulation option. | Yes. | Not explicitly specified. | Nanopore | No | [41] | 8/16/2024 |
NeSSM | Yes, designed for metagenomic sequencing simulation. | Yes, incorporates sequencing error models based on the distribution of errors at each base and coverage bias. | Not explicitly specified. | 454, Illumina | Yes | [38] | 8/18/2024 |
nfcore-ReadSimulator | Yes. | YES, from ART and capsim. | Not explicitly specified. | Illumina, PacBio | [43] | 4/26/2024 | |
ReadSim | Not explicitly specified. | Yes. | Not explicitly specified. | Nanopore, PacBio | Yes | [44] | 12/1/2014 |
ReSeq | Yes. | Yes. | Yes. | Illumina, BGI | Yes | [40] | 12/1/2020 |
4. Diagnostic Performance Metrics with Artificial HTS Datasets
4.1. Analytical Sensitivity
4.2. Analytical Specificity
4.3. Diagnostic Sensitivity
4.4. Diagnostic Specificity
4.5. Precision
4.6. Robustness
5. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Soltani, N.; Stevens, K.A.; Klaassen, V.; Hwang, M.-S.; Golino, D.A.; Al Rwahnih, M. Quality Assessment and Validation of High-Throughput Sequencing for Grapevine Virus Diagnostics. Viruses 2021, 13, 1130. [Google Scholar] [CrossRef] [PubMed]
- Maina, S.; Zheng, L.; Rodoni, B.C. Targeted Genome Sequencing (TG-Seq) Approaches to Detect Plant Viruses. Viruses 2021, 13, 583. [Google Scholar] [CrossRef] [PubMed]
- Lebas, B.; Adams, I.; Al Rwahnih, M.; Baeyen, S.; Bilodeau, G.J.; Blouin, A.G.; Boonham, N.; Candresse, T.; Chandelier, A.; De Jonghe, K.; et al. Facilitating the Adoption of High-throughput Sequencing Technologies as a Plant Pest Diagnostic Test in Laboratories: A Step-by-step Description. Bull. OEPP 2022, 52, 394–418. [Google Scholar] [CrossRef]
- Piombo, E.; Abdelfattah, A.; Droby, S.; Wisniewski, M.; Spadaro, D.; Schena, L. Metagenomics Approaches for the Detection and Surveillance of Emerging and Recurrent Plant Pathogens. Microorganisms 2021, 9, 188. [Google Scholar] [CrossRef]
- Hu, X.; Hurtado-Gonzales, O.P.; Adhikari, B.N.; French-Monar, R.D.; Malapi, M.; Foster, J.A.; McFarland, C.D. PhytoPipe: A Phytosanitary Pipeline for Plant Pathogen Detection and Diagnosis Using RNA-Seq Data. BMC Bioinform. 2023, 24, 470. [Google Scholar] [CrossRef]
- Espindola, A.S.; Sempertegui-Bayas, D.; Bravo-Padilla, D.F.; Freire-Zapata, V.; Ochoa-Corona, F.; Cardwell, K.F. TASPERT: Target-Specific Reverse Transcript Pools to Improve HTS Plant Virus Diagnostics. Viruses 2021, 13, 1223. [Google Scholar] [CrossRef]
- Katsiani, A.; Maliogka, V.I.; Katis, N.; Svanella-Dumas, L.; Olmos, A.; Ruiz-García, A.B.; Marais, A.; Faure, C.; Theil, S.; Lotos, L.; et al. High-Throughput Sequencing Reveals Further Diversity of Little Cherry Virus 1 with Implications for Diagnostics. Viruses 2018, 10, 385. [Google Scholar] [CrossRef]
- Bester, R.; Cook, G.; Breytenbach, J.H.J.; Steyn, C.; De Bruyn, R.; Maree, H.J. Towards the Validation of High-Throughput Sequencing (HTS) for Routine Plant Virus Diagnostics: Measurement of Variation Linked to HTS Detection of Citrus Viruses and Viroids. Virol. J. 2021, 18, 61. [Google Scholar] [CrossRef]
- Maree, H.J.; Fox, A.; Al Rwahnih, M.; Boonham, N.; Candresse, T. Application of HTS for Routine Plant Virus Diagnostics: State of the Art and Challenges. Front. Plant Sci. 2018, 9, 1082. [Google Scholar] [CrossRef]
- Fajardo, T.V.M.; Silva, F.N.; Eiras, M.; Nickel, O. High-Throughput Sequencing Applied for the Identification of Viruses Infecting Grapevines in Brazil and Genetic Variability Analysis. Trop. Plant Pathol. 2017, 42, 250–260. [Google Scholar] [CrossRef]
- Amoia, S.S.; Chiumenti, M.; Minafra, A. First Identification of Fig Virus A and Fig Virus B in Ficus Carica in Italy. Plants 2023, 12, 1503. [Google Scholar] [CrossRef]
- Maliogka, V.I.; Minafra, A.; Saldarelli, P.; Ruiz-García, A.B.; Glasa, M.; Katis, N.; Olmos, A. Recent Advances on Detection and Characterization of Fruit Tree Viruses Using High-Throughput Sequencing Technologies. Viruses 2018, 10, 436. [Google Scholar] [CrossRef] [PubMed]
- Al-helu, M.H.; Zhongtian, X.; Li, J.-M.; Lahuf, A.A. Next-Generation Sequencing-Based Detection Reveals Erysiphe Necator-Associated Virus 1 in Okra Plants. J. Kerbala Agric. Sci. 2024, 11, 205–213. [Google Scholar] [CrossRef]
- Kinoti, W.M.; Nancarrow, N.; Dann, A.; Rodoni, B.C.; Constable, F.E. Updating the Quarantine Status of Prunus Infecting Viruses in Australia. Viruses 2020, 12, 246. [Google Scholar] [CrossRef] [PubMed]
- Dang, T.; Espindola, A.; Vidalakis, G.; Cardwell, K. An In Silico Detection of a Citrus Viroid from Raw High-Throughput Sequencing Data. In Viroids: Methods and Protocols; Rao, A.L.N., Lavagi-Craddock, I., Vidalakis, G., Eds.; Springer: New York, NY, USA, 2022; Volume 2316, pp. 275–283. ISBN 9781071614648. [Google Scholar]
- Proaño-Cuenca, F.; Espindola, A.S.; Garzon, C.D. Detection of Phytophthora, Pythium, Globisporangium, Hyaloperonospora and Plasmopara species in High-Throughput Sequencing data by in silico and in vitro analysis using Microbe Finder (MiFi®). PhytoFrontiersTM 2023, 3, 124–136. [Google Scholar] [CrossRef]
- Espindola, A.; Schneider, W.; Hoyt, P.R.; Marek, S.M.; Garzon, C. A New Approach for Detecting Fungal and Oomycete Plant Pathogens in next Generation Sequencing Metagenome Data Utilising Electronic Probes. Int. J. Data Min. Bioinform. 2015, 12, 115–128. [Google Scholar] [CrossRef]
- Espindola, A.S.; Cardwell, K.; Martin, F.N.; Hoyt, P.R.; Marek, S.M.; Schneider, W.; Garzon, C.D. A Step Towards Validation of High-Throughput Sequencing for the Identification of Plant Pathogenic Oomycetes. Phytopathology 2022, 112, 1859–1866. [Google Scholar] [CrossRef]
- Stobbe, A.H.; Daniels, J.; Espindola, A.S.; Verma, R.; Melcher, U.; Ochoa-Corona, F.; Garzon, C.; Fletcher, J.; Schneider, W. E-Probe Diagnostic Nucleic Acid Analysis (EDNA): A Theoretical Approach for Handling of next Generation Sequencing Data for Diagnostics. J. Microbiol. Methods 2013, 94, 356–366. [Google Scholar] [CrossRef]
- Bocsanczy, A.M.; Espindola, A.S.; Cardwell, K.; Norman, D.J. Development and Validation of E-Probes with the MiFi System for Detection of Ralstonia Solanacearum Species Complex in Blueberries. PhytoFrontiers 2023, 3, 137–147. [Google Scholar] [CrossRef]
- Radhakrishnan, G.V.; Cook, N.M.; Bueno-Sancho, V.; Lewis, C.M.; Persoons, A.; Mitiku, A.D.; Heaton, M.; Davey, P.E.; Abeyo, B.; Alemayehu, Y.; et al. MARPLE, a Point-of-Care, Strain-Level Disease Diagnostics and Surveillance Tool for Complex Fungal Pathogens. BMC Biol. 2019, 17, 65. [Google Scholar] [CrossRef]
- Loit, K.; Adamson, K.; Bahram, M.; Puusepp, R.; Anslan, S.; Kiiker, R.; Drenkhan, R.; Tedersoo, L. Relative Performance of MinION (Oxford Nanopore Technologies) versus Sequel (Pacific Biosciences) Third-Generation Sequencing Instruments in Identification of Agricultural and Forest Fungal Pathogens. Appl. Environ. Microbiol. 2019, 85, e01368-19. [Google Scholar] [CrossRef] [PubMed]
- Bronzato Badial, A.; Sherman, D.; Stone, A.; Gopakumar, A.; Wilson, V.; Schneider, W.; King, J. Nanopore Sequencing as a Surveillance Tool for Plant Pathogens in Plant and Insect Tissues. Plant Dis. 2018, 102, 1648–1652. [Google Scholar] [CrossRef] [PubMed]
- Kutnjak, D.; Tamisier, L.; Adams, I.; Boonham, N.; Candresse, T.; Chiumenti, M.; De Jonghe, K.; Kreuze, J.F.; Lefebvre, M.; Silva, G.; et al. A Primer on the Analysis of High-Throughput Sequencing Data for Detection of Plant Viruses. Microorganisms 2021, 9, 841. [Google Scholar] [CrossRef] [PubMed]
- Standards & Guidelines: Generation and Analysis of High Throughput Sequencing Data. Available online: https://www.agriculture.gov.au/agriculture-land/animal/health/laboratories/hts-standards-and-guidelines (accessed on 18 August 2024).
- PM 7/151 (1) Considerations for the Use of High Throughput Sequencing in Plant Health Diagnostics. Bull. OEPP 2022, 52, 619–642. [CrossRef]
- Tamisier, L.; Haegeman, A.; Foucart, Y.; Fouillien, N.; Al Rwahnih, M.; Buzkan, N.; Candresse, T.; Chiumenti, M.; De Jonghe, K.; Lefebvre, M.; et al. Semi-Artificial Datasets as a Resource for Validation of Bioinformatics Pipelines for Plant Virus Detection. Peer Community J. 2021, 1, e53. [Google Scholar] [CrossRef]
- Saah, A.J.; Hoover, D.R. “Sensitivity” and “Specificity” Reconsidered: The Meaning of These Terms in Analytical and Diagnostic Settings. Ann. Intern. Med. 1997, 126, 91–94. [Google Scholar] [CrossRef]
- Mostafa, H.H.; Hardick, J.; Morehead, E.; Miller, J.-A.; Gaydos, C.A.; Manabe, Y.C. Comparison of the Analytical Sensitivity of Seven Commonly Used Commercial SARS-CoV-2 Automated Molecular Assays. J. Clin. Virol. 2020, 130, 104578. [Google Scholar] [CrossRef]
- Espindola, A.S.; Cardwell, K.F. Microbe Finder (MiFi®): Implementation of an Interactive Pathogen Detection Tool in Metagenomic Sequence Data. Plants 2021, 10, 250. [Google Scholar] [CrossRef]
- Dang, T.; Wang, H.; Espíndola, A.S.; Habiger, J.; Vidalakis, G.; Cardwell, K. Development and Statistical Validation of E-Probe Diagnostic Nucleic Acid Analysis (EDNA) Detection Assays for the Detection of Citrus Pathogens from Raw High Throughput Sequencing Data. PhytoFrontiers 2022, 3, 113–123. [Google Scholar] [CrossRef]
- Huang, W.; Li, L.; Myers, J.R.; Marth, G.T. ART: A next-Generation Sequencing Read Simulator. Bioinformatics 2012, 28, 593–594. [Google Scholar] [CrossRef]
- Shcherbina, A. FASTQSim: Platform-Independent Data Characterization and in Silico Read Generation for NGS Datasets. BMC Res. Notes 2014, 7, 533. [Google Scholar] [CrossRef] [PubMed]
- Fritz, A.; Hofmann, P.; Majda, S.; Dahms, E.; Dröge, J.; Fiedler, J.; Lesker, T.R.; Belmann, P.; DeMaere, M.Z.; Darling, A.E.; et al. CAMISIM: Simulating Metagenomes and Microbial Communities. Microbiome 2019, 7, 17. [Google Scholar] [CrossRef] [PubMed]
- Caboche, S.; Audebert, C.; Lemoine, Y.; Hot, D. Comparison of Mapping Algorithms Used in High-Throughput Sequencing: Application to Ion Torrent Data. BMC Genom. 2014, 15, 264. [Google Scholar] [CrossRef]
- Angly, F.E.; Willner, D.; Rohwer, F.; Hugenholtz, P.; Tyson, G.W. Grinder: A Versatile Amplicon and Shotgun Sequence Simulator. Nucleic Acids Res. 2012, 40, e94. [Google Scholar] [CrossRef]
- Richter, D.C.; Ott, F.; Auch, A.F.; Schmid, R.; Huson, D.H. MetaSim—A Sequencing Simulator for Genomics and Metagenomics. PLoS ONE 2008, 3, e3373. [Google Scholar] [CrossRef]
- Jia, B.; Xuan, L.; Cai, K.; Hu, Z.; Ma, L.; Wei, C. NeSSM: A Next-Generation Sequencing Simulator for Metagenomics. PLoS ONE 2013, 8, e75448. [Google Scholar] [CrossRef]
- Patuzzi, I.; Baruzzo, G.; Losasso, C.; Ricci, A.; Di Camillo, B. MetaSPARSim: A 16S RRNA Gene Sequencing Count Data Simulator. BMC Bioinform. 2019, 20, 416. [Google Scholar] [CrossRef] [PubMed]
- Schmeing, S.; Robinson, M.D. ReSeq Simulates Realistic Illumina High-Throughput Sequencing Data. Genome Biol. 2021, 22, 67. [Google Scholar] [CrossRef]
- Yang, C.; Chu, J.; Warren, R.L.; Birol, I. NanoSim: Nanopore Sequence Read Simulator Based on Statistical Characterization. Gigascience 2017, 6, gix010. [Google Scholar] [CrossRef]
- Johnson, S.; Trost, B.; Long, J.R.; Pittet, V.; Kusalik, A. A Better Sequence-Read Simulator Program for Metagenomics. BMC Bioinform. 2014, 15, S14. [Google Scholar] [CrossRef]
- Ewels, P.A.; Peltzer, A.; Fillinger, S.; Patel, H.; Alneberg, J.; Wilm, A.; Garcia, M.U.; Di Tommaso, P.; Nahnsen, S. The Nf-Core Framework for Community-Curated Bioinformatics Pipelines. Nat. Biotechnol. 2020, 38, 276–278. [Google Scholar] [CrossRef]
- Lee, H.; Gurtowski, J.; Yoo, S.; Marcus, S.; McCombie, W.R.; Schatz, M. Error Correction and Assembly Complexity of Single Molecule Sequencing Reads. bioRxiv 2014, 006395. [Google Scholar] [CrossRef]
- Massart, S.; Adams, I.; Al Rwahnih, M.; Baeyen, S.; Bilodeau, G.J.; Blouin, A.G.; Boonham, N.; Candresse, T.; Chandellier, A.; De Jonghe, K.; et al. Guidelines for the Reliable Use of High Throughput Sequencing Technologies to Detect Plant Pathogens and Pests. Peer Community J. 2022, 2, e62. [Google Scholar] [CrossRef]
- Groth-Helms, D.; Rivera, Y.; Martin, F.N.; Arif, M.; Sharma, P.; Castlebury, L.A. Terminology and Guidelines for Diagnostic Assay Development and Validation: Best Practices for Molecular Tests. PhytoFront. 2023, 3, 23–35. [Google Scholar] [CrossRef]
- Armbruster, D.A.; Pry, T. Limit of Blank, Limit of Detection and Limit of Quantitation. Clin. Biochem. Rev. 2008, 29, S49–S52. [Google Scholar] [PubMed]
- Gaafar, Y.Z.A.; Ziebell, H. Comparative Study on Three Viral Enrichment Approaches Based on RNA Extraction for Plant Virus/Viroid Detection Using High-Throughput Sequencing. PLoS ONE 2020, 15, e0237951. [Google Scholar] [CrossRef]
- Pecman, A.; Kutnjak, D.; Gutiérrez-Aguirre, I.; Adams, I.; Fox, A.; Boonham, N.; Ravnikar, M. Next Generation Sequencing for Detection and Discovery of Plant Viruses and Viroids: Comparison of Two Approaches. Front. Microbiol. 2017, 8, 1998. [Google Scholar] [CrossRef]
Performance Metric | Aim | Variables | HTS Dataset Type |
---|---|---|---|
Analytical sensitivity | Determine the lowest concentration of the target that is consistently detectable. | Quantitative: Limit of Detection (LoD) calculated as the mean of replicate tests used to calculate the Limit of Blank (LoB) + 1.645 standard deviations of a low-concentration sample. Qualitative: the lowest concentration is consistently detected as positive in repeated testing. | Real: serially diluted samples. Simulated: datasets with varying target concentrations. |
Analytical specificity | Ensure the assay accurately detects all target variants (inclusivity) while excluding non-targets (exclusivity). | Inclusivity: percentage of target variants correctly identified. Exclusivity: percentage of non-target samples correctly identified as negative. Selectivity: ability to detect target in the presence of background matrix. | Real: panels of target and non-target samples, including closely related species and environmental samples. Simulated: datasets containing a mix of target and non-target sequences with controlled variations. |
Diagnostic sensitivity | Evaluate the assay’s ability to correctly identify true positive samples. | Percentage of known positive samples correctly identified by the test. | Real: panels of samples with confirmed presence or absence of the target. Simulated: not suggested. |
Diagnostic specificity | Assess the assay’s ability to correctly identify true negative samples. | Percentage of known negative samples correctly identified. | Real: panels of samples with confirmed absence of the target. Simulated: they are needed because they provide a known negative background. The intentional inclusion of closely related organisms to the target is suggested (see precision). |
Precision | Precision refers to the closeness of agreement between independent test results obtained under specified conditions. | Repeatability: this involves assessing the variation when the same operator conducts the assay on the same sample multiple times. Intermediate precision: agreement between results from multiple operators or instruments within a lab. Reproducibility: evaluate the variation in results when the assay is performed in different laboratories and by different operators. | Real: replicate testing of the same samples. Simulated: datasets containing a mix of target and non-target sequences with controlled variations and concentrations. |
Robustness | Ability to maintain its precision despite variations in factors. It evaluates the assay’s performance under variable conditions. | The ability of the test to maintain precision when subjected to variations. Most variations for bioinformatic pipelines come from the operator. These are deliberate variations. In this case, variations in the pipeline could be intentionally introduced, i.e., read filtering, library size (rarefaction), and slight variation in pathogen reads, incorporating closely related organisms, etc. Typically assessed through ring tests involving multiple laboratories. | Real: testing under various conditions and with minor protocol deviations. The variations are more limited. Simulated: variations that include pathogen read abundance and library size read filtering, among others, can be performed without restrictions. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Espindola, A.S. Simulated High Throughput Sequencing Datasets: A Crucial Tool for Validating Bioinformatic Pathogen Detection Pipelines. Biology 2024, 13, 700. https://doi.org/10.3390/biology13090700
Espindola AS. Simulated High Throughput Sequencing Datasets: A Crucial Tool for Validating Bioinformatic Pathogen Detection Pipelines. Biology. 2024; 13(9):700. https://doi.org/10.3390/biology13090700
Chicago/Turabian StyleEspindola, Andres S. 2024. "Simulated High Throughput Sequencing Datasets: A Crucial Tool for Validating Bioinformatic Pathogen Detection Pipelines" Biology 13, no. 9: 700. https://doi.org/10.3390/biology13090700
APA StyleEspindola, A. S. (2024). Simulated High Throughput Sequencing Datasets: A Crucial Tool for Validating Bioinformatic Pathogen Detection Pipelines. Biology, 13(9), 700. https://doi.org/10.3390/biology13090700