Simple Summary
Metagenomics analysis measures microbiome diversity in samples without prior enrichment. Advances in High-Throughput Sequencing (HTS) have expanded its use from identifying known organisms to diagnosing diseases. Reliable results need strong validation with standard samples and databases from real and synthetic controls. We introduce the Metagenomic Standards Generator (MeStanG), a tool for creating HTS Nanopore data sets to test bioinformatics pipelines. MeStanG allows users to design and generate samples with specific numbers of reads for each organism from reference sequences and error profiles. The accuracy was tested by simulating metagenomic samples with known diversities and abundances expressed as number of reads. The analysis showed results that matched the expected organism composition in the samples. MeStanG is a valuable tool for scientists to create mock metagenomic samples useful in diagnostic assay validation studies and assess bioinformatics pipeline performance using simulated samples.
Abstract
Metagenomics analysis has enabled the measurement of the microbiome diversity in environmental samples without prior targeted enrichment. Functional and phylogenetic studies based on microbial diversity retrieved using HTS platforms have advanced from detecting known organisms and discovering unknown species to applications in disease diagnostics. Robust validation processes are essential for test reliability, requiring standard samples and databases deriving from real samples and in silico generated artificial controls. We propose a MeStanG as a resource for generating HTS Nanopore data sets to evaluate present and emerging bioinformatics pipelines. MeStanG allows samples to be designed with user-defined organism abundances expressed as number of reads, reference sequences, and predetermined or custom errors by sequencing profiles. The simulator pipeline was evaluated by analyzing its output mock metagenomic samples containing known read abundances using read mapping, genome assembly, and taxonomic classification on three scenarios: a bacterial community composed of nine different organisms, samples resembling pathogen-infected wheat plants, and a viral pathogen serial dilution sampling. The evaluation was able to report consistently the same organisms, and their read abundances as provided in the mock metagenomic sample design. Based on this performance and its novel capacity of generating exact number of reads, MeStanG can be used by scientists to develop mock metagenomic samples (artificial HTS data sets) to assess the diagnostic performance metrics of bioinformatic pipelines, allowing the user to choose predetermined or customized models for research and training.
1. Introduction
Advances in molecular biology and genomics made possible the assessment of the diversity, richness, and interaction of the organisms present in an environment sample [1]. Metagenome is a term used to refer to a collection of genomes in samples retrieved by amplicon and whole-genome shotgun strategies, focusing on microbial diversity and functional studies [2]. Metagenomics has been used to profile several ecosystems and environments’ taxonomic and functional interactions, making identifying specific microbes possible [3]. Direct sequencing of raw environmental DNA is used as a technique to retrieve quantitative sequence information from a sample, allowing the classification of known and the discovery of new taxa by association with known organisms [4,5]. Shotgun High-throughput sequencing (HTS) provides fast and extensive insights into massive biological data with different sequencing platforms including Illumina, Nanopore, PacBio, and Ion Torrent platforms, which are used for studies on phylogenetic, functional, and descriptive metagenomics [6,7,8]. Microbial diversity has been one of the leading research areas that have improved over the years since the advent of shotgun metagenomics. Several tools have been developed for reconstructing microbial composition [9], viral discovery [10], and unculturable organism detection [11]. One challenge posed by these techniques is the complex data analysis required to effectively establish shotgun metagenomics HTS as a strategy for pathogen detection and diagnostics regarding the massive amounts of data, diversity of highly specialized pipelines, and computationally expensive processes [12].
Validating the accuracy of read classification procedures can be challenging due to the lack of accurate reference databases, samples, and quality controls. To address this, in silico validation can be performed using sequence read simulators. When validating the accuracy of read classifiers, it is essential to consider the diversity of metagenomic samples, how accurately they resemble real samples, and determine their analytical sensitivity and specificity as a first validation tier [13,14]. Using sequence sets generated in silico representing the diverse naturally occurring sequencing outputs from real samples is crucial for validating diagnostic tests based on HTS methods [15].
There are several HTS simulators available for short (ART v.2.5.8, DWGSIM v.0.1.15, InSilicoSeq v.1.5.4, Mason v.2.0.9, NEAT v.3.0, wgsim v.0.3.1-r13) [16], long (NanoSim v3.1.0, HeteroGenesis v1.5, DeepSimulator v1.5) [17], and metagenomic reads (NanoSim v3.1.0, CAMISIM 1.3). CAMISIM can use ART, wgsim, and NanoSim for metagenomic simulation in its framework, being NanoSim the only one capable of generating Nanopore reads through error model characterization on sequencing outputs followed by read simulation or de novo direct simulation using reference genomes and pre-trained models obtained from mock data sets [18,19]. However, there are cases where the corresponding real sequencing outputs are not available for model characterization or the pre-trained models generate data sets with read abundance distributions unsuitable for estimating performance indicators in precise sample composition analyses, such as assessing the limits of Detection, Sensitivity, and Specificity, metrics required for diagnostic assay validation.
Here, we propose Metagenomic Standards Generator (MeStanG) as a resource for simulating de novo specific nanopore data sets resembling sequencing data. The simulated data can be used to evaluate existing and emerging bioinformatics pipelines designed to analyze HTS data for taxonomic classification and diagnostic purposes.
2. Materials and Methods
MeStanG allows the generation of standard samples with precise composition features for the performance assessment of tools that rely on variable and abundances as number of reads per organism and diversities of the organisms in metagenomic communities. It is a de novo approach that requires reference data sets in FASTA format (assemblies, complete genomes, contigs, or reads) and a set of user-defined parameters for sample generation consistent with the reference sequence lengths. MeStanG introduces an algorithm for de novo error insertion resembling Guppy and Dorado base-calling performance implemented as error type [20,21,22] and specific base transition/transversion probability using an empiric model based on the chemical structures of the nucleotide nitrogenous bases. Users can also provide custom error rates and base-calling accuracy profiles (Figure 1).
Figure 1.
MeStanG workflow diagram. Created in BioRender [23].
Read abundance (RA) for the organisms in the metagenomic sample can be provided as (1) absolute number of reads, (2) relative to the total number of reads, or (3) assigned pseudo-randomly. Samples can be designed to resemble environmental or host-microbiome scenarios. Depending on the taxonomic distribution, the diversity design has two approaches: individual taxa or taxa with subtaxa. For designing diversity as individual taxa, each organism in the community exists as an independent taxon, and an assignment of absolute or relative RA is required individually. When designing diversity as taxa with subtaxa in scenarios where individual organisms’ RA cannot be provided or estimated but the information for a higher taxon is available, the RA for the higher taxon can be manually set and distributed among the subtaxa, i.e., a species complex with a total known RA but unknown for each organism in the group individually. Samples are generated in FASTA format, along with reports of the absolute and relative RA, error profile, error distribution, and run parameters.
To determine MeStanG’s capability of generating metagenomes with specified read length, depth, and taxon microbial RA when compared to NanoSim metagenome mode, several generated metagenomes with both simulation platforms using the same design parameters were evaluated using pipelines for metagenome analysis through assembly [24,25], mapping to reference [25], and taxonomic sequence classification [26].
2.1. Bacterial-Only Metagenome
Nine bacterial species assemblies stored in the National Center for Biotechnology Information (NCBI) database were used as input for generating a metagenome sample that resembles one that contains only bacterial organisms with MeStanG and NanoSim metagenome mode. Bacillus subtilis (GCF_000009045.1), Escherichia coli (GCF_000005845.2), Enterococcus faecalis (GCF_000393015.1), Klebsiella pneumoniae (GCF_000364385.3), Limosilactobacillus fermentum (GCF_029961225.1), Listeria monocytogenes (GCF_000438585.1), Pseudomonas aeruginosa (GCF_000006765.1), Staphylococcus aureus (GCF_000418345.1), and Salmonella enterica (GCF_000783815.2). MeStanG was run with parameters that aimed at generating an average read length of 2000 ± 200 nucleotides, varying the number of reads for each organism to get ~50× depth for assembly, optimal for the chosen read length [27].
Metagenome composition was detected through mapping to a combined reference of the nine bacteria using minimap2 v2.28-r1209 [28] and samtools v1.20 [29] for post-processing removing secondary and chimeric mappings, retrieving unique hits to each organism per read. Metagenome Assembly was performed with Miniasm v0.3-r179 [30] along with Racon v1.5.0 [31] for three polishing rounds and Flye v2.9.4-b1799 [32] optimized for metagenomic samples with three polishing rounds. Assemblies were evaluated using MetaQUAST v5.2.0 [33], and dot plots were generated using D-Genies for the assembly alignment to the combined reference [34]. Taxonomic sequence classification was carried out using Kraken2 v2.1.3 [35], followed by Bracken v2.9 [36], and displayed using Pavian v1.0 [37].
2.2. Host-Pathogen Metagenome Sample Generation
Fifteen metagenomic HTS data sets were generated with MeStanG and NanoSim, simulating bread wheat samples (Assembly accession: GCF_018294505.1) infected with three different pathogens. The pathogens included were Puccinia striiformis f. sp. tritici strain 134E16A+17+33+ (Assembly accession: GCF_021901695.1), Xanthomonas translucens pv. undulosa strain XtLr8 (Assembly accession: GCF_017301775.1), and Barley yellow dwarf virus—PAV (Nucleotide accession: NC_004750.1). Varying pathogen concentrations were used to resemble different host-pathogen interaction scenarios. Pathogens were detected on the samples using minimap2, Kraken2 followed by Bracken, and E-probe Diagnostic Nucleic Acid Analysis (EDNA) on Microbe Finder (MiFi®) [38,39]. Results were compared to the reported pathogen RA set in the sample design.
Additionally, a set of samples resembling five serial dilutions with 20 replicates each of wheat samples containing the viral pathogen Barley yellow dwarf virus were generated using MeStanG to evaluate the accuracy of RA design with the same approaches for detection previously described.
3. Results and Discussion
3.1. Bacterial-Only Metagenome Assessment
The results from mapping the bacterial metagenome to the combined reference genomes are consistent in mapping quality for MeStanG and NanoSim data sets, with higher accuracy in the number of mapped reads for MeStanG than NanoSim samples (Table 1).
Table 1.
Metagenome assessment statistics through mapping and assembly. SP: simulation platform; MeStanG (MSG) and NanoSim (NS), RA: Read abundances given as input for sample generation; absolute for MeStanG and relative for NanoSim, #reads mapped: number of best unique metagenome reads mapped to the respective reference organism genome, Mapq: mapping quality, MinR: assembly using Miniasm coupled with Racon, NA: data not reported by MetaQUAST.
Assembly results for MeStanG generated samples with Miniasm followed by Racon polishing assembly genome fractions ranged from 95.890 to 99.392% for S. enterica and B. subtilis, respectively, and the dot plot displays a continuous high identity alignment to the combined reference genomes. Flye genome fractions ranged from 69.373 to 92.381% for P. aeruginosa and E. faecalis, respectively, the genome alignment plot has a similar disposition as the Miniasm assembly plot (Figure 2). Assembly results for NanoSim-generated samples with Miniasm followed by Racon polishing assembly genome fractions ranged from 37.440 to 99.621% for K. pneumoniae and B. subtilis, respectively, and the dot plot displays a discontinuous identity alignment for K. pneumoniae and S. enterica. Flye genome fractions ranged from 10.460 to 71.668% for K. pneumoniae and L. fermentum, respectively, the alignment plot has a similar disposition as the Miniasm assembly plot (Figure 3).
Figure 2.
Genome alignment between the combined reference for nine different bacterial species to the metagenome assembly of the generated sample with MeStanG using (A) Miniasm + Racon and (B) Flye.
Figure 3.
Genome alignment between the combined reference for nine different bacterial species to the metagenome assembly of the generated sample with NanoSim using (A) Miniasm + Racon and (B) Flye.
NanoSim simulation pipeline generates two sets of reads, aligned and unaligned. The latter contains reads simulated at random; merging the two sets gives the total number of reads specified for simulation [19]. Unaligned reads generation results in a ~13% loss in the RA initially specified by design calculated from all the results presented using mapping. This makes NanoSim unsuitable for approaches requiring an exact number of reads for analysis as it is not possible to modify how random reads are generated or estimate a proper initial relative RA to generate a specific number of reads.
Assembly metrics N50 refers to the contig length such that using contigs of the same size would produce half of the bases in the assembly, NGA50 is computed as the length of the aligned blocks that represent 50% of the reference genome size instead of the total assembly length [40]. While the N50 reflects the assembler performance in getting long contigs, NGA50 is a more informative metric for assembly completeness respect to a reference genome [41]. MetaQUAST fails to esteem NGA50 in four cases (Table 1) as the genome fraction is lower than 50%. The results suggest a better assembly performance for the MeStanG than the NanoSim sample based on NGA50 (the longer the better) and genome fraction metrics (the higher the better).
Taxonomic classification using Kraken2 followed by Bracken reported RA consistent with the diversity distribution designed for the MeStanG sample, ranging from 96.735% for K. pneumoniae to 99.511% for B. subtilis (Figure 4). NanoSim sample reported RA ranged from 50.410% for S. aureus to 85.039% for L. fermentum with overestimations for E. coli (127.755%), E. faecalis (136.938%), and P. aeruginosa (123.163%) (Figure 5).
Figure 4.
Pavian graph of the taxonomic diversity of the simulated metagenomic sample using MeStanG containing nine different bacterial species detected using Kraken2 followed by Bracken with the number of reads assigned to each organism. Taxonomic levels shown as D: Domain, P: Phylum, F: Family, G: Genus, S: Species.
Figure 5.
Pavian graph of the taxonomic diversity of the simulated metagenomic sample using NanoSim containing nine different bacterial species detected using Kraken2 followed by Bracken with the number of reads assigned to each organism. Taxonomic levels shown as D: Domain, P: Phylum, F: Family, G: Genus, S: Species.
Kraken2 approach to classification is based on k-mers for efficient search against a database and it might misassign reads to a closely related species whose k-mers are similar within the same genus [42]. Bracken is used to improve the Kraken2 reported estimated RA at the species level leading to better estimates [43], however as Bracken redistributes RA at other taxonomic levels based on the initial classification, any unclassified reads with Kraken2 will likely remain unclassified. The classification accuracy for the bacterial-only sample generated with MeStanG was 98.417 ± 0.981%, consistent with the Bracken reported performance [36]. Accuracy for NanoSim generated samples was 69.320 ± 12.273% excluding the overestimations, this suboptimal read classification might be explained by the amount of unaligned random reads generated biased towards E. coli and P. aeruginosa.
3.2. Host-Pathogen Sample Generation
Results for mapping and EDNA-MiFi® were consistent with the reported RA by MeStanG, with high-quality mappings for the pathogens present in the sample ranging from 16.928 to 60 (97.971 to 99.999% accuracy rate) and low-quality mapping scores when absent ranging from 0 to 5 (0 to 68.377% accuracy rate). NanoSim samples mappings were also consistent with high-quality mappings for the pathogens present ranging from 19.584 to 60 (98.899 to 99.999% accuracy rate) and low-quality mapping scores when absent ranging from 0 to 2.833 (0 to 47.917% accuracy rate) despite failing in retrieving all the intended abundancies for all samples as per previously discussed (Table 2).
Table 2.
Host-pathogen sampling detection of select organisms using mapping, taxonomic classification, and EDNA-MiFi®. Sample: MSG—generated with MeStanG, NS—generated with NanoSim, RA: Read abundances given as input for sample generation; absolute for MeStanG and relative for NanoSim. # reads mapped: number of best unique metagenome reads mapped to the respective reference organism, Mapq: mapping quality, KB: reads assigned to each organism by Kraken2 + Bracken taxonomic classification, EM: EDNA-MiFi® detection P for Positive and N for Negative.
Kraken2 classification was able to assign the RA consistently for MeStanG and NanoSim samples for each organism to the level of species as no specific strain/biotype was detected (PAV for Barley yellow dwarf virus/Luteovirus pavhordei, f. sp. tritici strain 134E16A+17+33+ for Puccinia striiformis, and pv. undulosa strain XtLr8 for Xanthomonas translucens) as Kraken2 might underestimate RA when classifying reads to the strain resolution level [42].
It is worth noting that taxonomic classification was more accurate in host-pathogen NanoSim samples compared to the results of the bacterial-only metagenome. This is likely due to the diversity in both cases being more homogeneous in the bacterial-only sample making it more difficult to discriminate between closely related organisms. On the other hand, taxonomic classification in MeStanG samples was consistent regardless of the diversity, making it suitable for generating samples to be subject of pipelines using high-accuracy analysis thresholds.
As a demonstration of the applicability of MeStanG in generating samples for assessing sensitivity of diagnostic tests, a total of 100 samples resembling a serial dilution routine were generated (Table 3)
Table 3.
Simulated Serial dilution sampling with MeStanG of bread wheat plants infected with Barley yellow dwarf virus where each RA has 20 replicates displaying mean values with their corresponding standard deviation where available. Absolute RA: absolute read abundance used as input for number of reads simulation, Relative RA: relative read abundance respect to the total number of reads in the sample, # reads mapped: number of best unique reads mapped to the virus genome, Mapq: mapping quality, K2B: reads assigned to the virus by Kraken2 + Bracken taxonomic classification, EM (TPR%): EDNA-MiFi® true positive rate detection.
The number of mapped reads and their mapping quality was the same throughout each of the 20 samples for all RA. Taxonomic classification was not able to estimate the exact number of reads for all samples, and EDNA-MiFi® true positive rate decreases to 80% and 40% when there are 10 and five viral reads in the sample, respectively. The performance of taxonomic classification makes it more reliable than EDNA-MiFi® detection in the lowest RA for this virus, which must be considered when using multiple pipelines for detection of pathogens in HTS samples.
The parameters used to generate the samples used in this study were set according to optimal values to ensure a proper metagenome assembly in terms of read length and number of reads, changes in the parameters will reflect in different sequencing depths obtaining better assemblies with higher depths [27,41]. Error rates from pre-trained models or customized models impact directly to the accuracy of the assembly and detection methods, making it necessary to have higher sequencing depths to address the unreliability generated by high error rates for assembly [22,41] and run polishing or correction pipelines to address possible misassemblies [44].
4. Conclusions
Based on its capacity of generating samples with exact number of reads per organism and the performance metrics evaluated using tools for detection of the read abundance and diversity of HTS samples, MeStanG has potential various applications, including creating standards for evaluating existing and emerging bioinformatics pipelines, generating controls for validation assays, improving the estimation of diagnostic tests sensitivity and specificity by generating exclusion and inclusion panels with sufficient replicates, benchmarking read classification systems based on sequence alignment by testing their performance on complex synthetic metagenome compositions resembling natural and artificial environments, and providing mock samples for teaching basic and advanced bioinformatic methods.
With the guidance of the user manual available at the MeStanG GitHub repository found in the Data Availability Statement section, we want to enable users to choose predefined reported performance models for common usage and customized profiles for research and training purposes depending on the requirements for sample generation, expecting to nurture more research based on artificial controls to estimate performance indicators before translating technologies into real scenarios.
Author Contributions
Conceptualization, D.R.L., A.S.E. and F.J.F.; Methodology, D.R.L., A.S.E. and F.J.F.; Software, D.R.L.; Validation, D.R.L.; Formal analysis, D.R.L.; Investigation, D.R.L. and A.S.E.; Resources, D.R.L. and A.S.E.; Data curation, D.R.L.; Writing—original draft preparation, D.R.L.; Writing—review and editing, D.R.L., A.S.E. and F.J.F.; Visualization, D.R.L.; Supervision, A.S.E. and D.R.L.; Project administration, A.S.E.; Funding acquisition, A.S.E. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by Oklahoma State University—Oklahoma Agricultural Experiment Station—Hatch OKL03271.
Institutional Review Board Statement
Not applicable.
Informed Consent Statement
Not applicable.
Data Availability Statement
The generated samples by MeStanG used to assess its performance can be found at https://doi.org/10.5281/zenodo.13858384. MeStanG source code and user manual can be found at: https://github.com/ibmf-bioinformatics/MeStanG, 11 December 2024.
Acknowledgments
Some of the computing for this project was performed at the High Performance Computing Center at Oklahoma State University, supported in part through the National Science Foundation grant OAC-1531128.
Conflicts of Interest
The authors declare no conflicts of interest.
References
- Handelsman, J.; Rondon, M.R.; Brady, S.F.; Clardy, J.; Goodman, R.M. Molecular Biological Access to the Chemistry of Unknown Soil Microbes: A New Frontier for Natural Products. Chem. Biol. 1998, 5, R245–R249. [Google Scholar] [CrossRef] [PubMed]
- Vecherskii, M.V.; Semenov, M.V.; Lisenkova, A.A.; Stepankov, A.A. Metagenomics: A New Direction in Ecology. Biol. Bull. Russ. Acad. Sci. 2021, 48, S107–S117. [Google Scholar] [CrossRef]
- Fierer, N.; Leff, J.W.; Adams, B.J.; Nielsen, U.N.; Bates, S.T.; Lauber, C.L.; Owens, S.; Gilbert, J.A.; Wall, D.H.; Caporaso, J.G. Cross-Biome Metagenomic Analyses of Soil Microbial Communities and Their Functional Attributes. Proc. Natl. Acad. Sci. USA 2012, 109, 21390–21395. [Google Scholar] [CrossRef] [PubMed]
- Tyson, G.W.; Chapman, J.; Hugenholtz, P.; Allen, E.E.; Ram, R.J.; Richardson, P.M.; Solovyev, V.V.; Rubin, E.M.; Rokhsar, D.S.; Banfield, J.F. Community Structure and Metabolism through Reconstruction of Microbial Genomes from the Environment. Nature 2004, 428, 37–43. [Google Scholar] [CrossRef]
- Von Meijenfeldt, F.A.B.; Arkhipova, K.; Cambuy, D.D.; Coutinho, F.H.; Dutilh, B.E. Robust Taxonomic Classification of Uncharted Microbial Sequences and Bins with CAT and BAT. Genome Biol. 2019, 20, 217. [Google Scholar] [CrossRef]
- Benoit, G.; Raguideau, S.; James, R.; Phillippy, A.M.; Chikhi, R.; Quince, C. High-Quality Metagenome Assembly from Long Accurate Reads with metaMDBG. Nat. Biotechnol. 2024, 42, 1378–1383. [Google Scholar] [CrossRef]
- Chuzel, L.; Sinha, A.; Cunningham, C.V.; Taron, C.H. High-Throughput Nanopore DNA Sequencing of Large Insert Fosmid Clones Directly from Bacterial Colonies. Appl. Environ. Microbiol. 2024, 90, e00243-24. [Google Scholar] [CrossRef]
- Zhou, J.; He, Z.; Yang, Y.; Deng, Y.; Tringe, S.G.; Alvarez-Cohen, L. High-Throughput Metagenomic Technologies for Complex Microbial Community Analysis: Open and Closed Formats. mBio 2015, 6, e02288-14. [Google Scholar] [CrossRef]
- Setubal, J.C. Metagenome-Assembled Genomes: Concepts, Analogies, and Challenges. Biophys. Rev. 2021, 13, 905–909. [Google Scholar] [CrossRef]
- Lu, C.; Peng, Y. Computational Viromics: Applications of the Computational Biology in Viromics Studies. Virol. Sin. 2021, 36, 1256–1260. [Google Scholar] [CrossRef]
- Liu, S.; Moon, C.D.; Zheng, N.; Huws, S.; Zhao, S.; Wang, J. Opportunities and Challenges of Using Metagenomic Data to Bring Uncultured Microbes into Cultivation. Microbiome 2022, 10, 76. [Google Scholar] [CrossRef] [PubMed]
- Sekse, C.; Holst-Jensen, A.; Dobrindt, U.; Johannessen, G.S.; Li, W.; Spilsberg, B.; Shi, J. High Throughput Sequencing for Detection of Foodborne Pathogens. Front. Microbiol. 2017, 8, 2029. [Google Scholar] [CrossRef] [PubMed]
- Espindola, A.S. Simulated High Throughput Sequencing Datasets: A Crucial Tool for Validating Bioinformatic Pathogen Detection Pipelines. Biology 2024, 13, 700. [Google Scholar] [CrossRef]
- Schlaberg, R.; Chiu, C.Y.; Miller, S.; Procop, G.W.; Weinstock, G.; the Professional Practice Committee and Committee on Laboratory Practices of the American Society for Microbiology; the Microbiology Resource Committee of the College of American Pathologists. Validation of Metagenomic Next-Generation Sequencing Tests for Universal Pathogen Detection. Arch. Pathol. Lab. Med. 2017, 141, 776–786. [Google Scholar] [CrossRef]
- Dulanto Chiang, A.; Dekker, J.P. From the Pipeline to the Bedside: Advances and Challenges in Clinical Metagenomics. J. Infect. Dis. 2020, 221, S331–S340. [Google Scholar] [CrossRef]
- Milhaven, M.; Pfeifer, S.P. Performance Evaluation of Six Popular Short-Read Simulators. Heredity 2023, 130, 55–63. [Google Scholar] [CrossRef]
- Tanner, G.; Westhead, D.R.; Droop, A.; Stead, L.F. Simulation of Heterogeneous Tumour Genomes with HeteroGenesis and in Silico Whole Exome Sequencing. Bioinformatics 2019, 35, 2850–2852. [Google Scholar] [CrossRef]
- Fritz, A.; Hofmann, P.; Majda, S.; Dahms, E.; Dröge, J.; Fiedler, J.; Lesker, T.R.; Belmann, P.; DeMaere, M.Z.; Darling, A.E.; et al. CAMISIM: Simulating Metagenomes and Microbial Communities. Microbiome 2019, 7, 17. [Google Scholar] [CrossRef]
- Yang, C.; Lo, T.; Nip, K.M.; Hafezqorani, S.; Warren, R.L.; Birol, I. Characterization and Simulation of Metagenomic Nanopore Sequencing Data with Meta-NanoSim. GigaScience 2023, 12, giad013. [Google Scholar] [CrossRef]
- Wick, R.R. ONT-Only Accuracy: 5 kHz and Dorado 2023. Zenodo. Available online: https://zenodo.org/records/10038673 (accessed on 20 September 2024).
- Wick, R.R. Yet Another ONT Accuracy Test: Dorado v0.5.0 2023. Zenodo. Available online: https://zenodo.org/records/10397818 (accessed on 20 September 2024).
- Zeng, J.; Cai, H.; Peng, H.; Wang, H.; Zhang, Y.; Akutsu, T. Causalcall: Nanopore Basecalling Using a Temporal Convolutional Network. Front. Genet. 2020, 10, 1332. [Google Scholar] [CrossRef]
- BioRender. Available online: https://app.biorender.com/citation/677829ec555ed7c8e986a946 (accessed on 10 January 2025).
- Latorre-Pérez, A.; Villalba-Bermell, P.; Pascual, J.; Vilanova, C. Assembly Methods for Nanopore-Based Metagenomic Sequencing: A Comparative Study. Sci. Rep. 2020, 10, 13588. [Google Scholar] [CrossRef] [PubMed]
- Abou Kubaa, R.; Amoia, S.S.; Altamura, G.; Minafra, A.; Chiumenti, M.; Cillo, F. Nanopore Technology Applied to Targeted Detection of Tomato Brown Rugose Fruit Virus Allows Sequencing of Related Viruses and the Diagnosis of Mixed Infections. Plants 2023, 12, 999. [Google Scholar] [CrossRef] [PubMed]
- Diao, Z.; Lai, H.; Han, D.; Yang, B.; Zhang, R.; Li, J. Validation of a Metagenomic Next-Generation Sequencing Assay for Lower Respiratory Pathogen Detection. Microbiol. Spectr. 2023, 11, e03812-22. [Google Scholar] [CrossRef]
- Khrenova, M.G.; Panova, T.V.; Rodin, V.A.; Kryakvin, M.A.; Lukyanov, D.A.; Osterman, I.A.; Zvereva, M.I. Nanopore Sequencing for De Novo Bacterial Genome Assembly and Search for Single-Nucleotide Polymorphism. Int. J. Mol. Sci. 2022, 23, 8569. [Google Scholar] [CrossRef] [PubMed]
- Li, H. New Strategies to Improve Minimap2 Alignment Accuracy. Bioinformatics 2021, 37, 4572–4574. [Google Scholar] [CrossRef] [PubMed]
- Danecek, P.; Bonfield, J.K.; Liddle, J.; Marshall, J.; Ohan, V.; Pollard, M.O.; Whitwham, A.; Keane, T.; McCarthy, S.A.; Davies, R.M.; et al. Twelve Years of SAMtools and BCFtools. GigaScience 2021, 10, giab008. [Google Scholar] [CrossRef]
- Li, H. Minimap and Miniasm: Fast Mapping and de Novo Assembly for Noisy Long Sequences. Bioinformatics 2016, 32, 2103–2110. [Google Scholar] [CrossRef]
- Vaser, R.; Sović, I.; Nagarajan, N.; Šikić, M. Fast and Accurate de Novo Genome Assembly from Long Uncorrected Reads. Genome Res. 2017, 27, 737–746. [Google Scholar] [CrossRef]
- Kolmogorov, M.; Bickhart, D.M.; Behsaz, B.; Gurevich, A.; Rayko, M.; Shin, S.B.; Kuhn, K.; Yuan, J.; Polevikov, E.; Smith, T.P.L.; et al. metaFlye: Scalable Long-Read Metagenome Assembly Using Repeat Graphs. Nat. Methods 2020, 17, 1103–1110. [Google Scholar] [CrossRef]
- Mikheenko, A.; Prjibelski, A.; Saveliev, V.; Antipov, D.; Gurevich, A. Versatile Genome Assembly Evaluation with QUAST-LG. Bioinformatics 2018, 34, i142–i150. [Google Scholar] [CrossRef]
- Cabanettes, F.; Klopp, C. D-GENIES: Dot Plot Large Genomes in an Interactive, Efficient and Simple Way. PeerJ 2018, 6, e4958. [Google Scholar] [CrossRef] [PubMed]
- Wood, D.E.; Lu, J.; Langmead, B. Improved Metagenomic Analysis with Kraken 2. Genome Biol. 2019, 20, 257. [Google Scholar] [CrossRef] [PubMed]
- Lu, J.; Breitwieser, F.P.; Thielen, P.; Salzberg, S.L. Bracken: Estimating Species Abundance in Metagenomics Data. PeerJ Comput. Sci. 2017, 3, e104. [Google Scholar] [CrossRef]
- Breitwieser, F.P.; Salzberg, S.L. Pavian: Interactive Analysis of Metagenomics Data for Microbiome Studies and Pathogen Identification. Bioinformatics 2020, 36, 1303–1304. [Google Scholar] [CrossRef]
- Espindola, A.S.; Cardwell, K.F. Microbe Finder (MiFi®): Implementation of an Interactive Pathogen Detection Tool in Metagenomic Sequence Data. Plants 2021, 10, 250. [Google Scholar] [CrossRef]
- Stobbe, A.H.; Daniels, J.; Espindola, A.S.; Verma, R.; Melcher, U.; Ochoa-Corona, F.; Garzon, C.; Fletcher, J.; Schneider, W. E-Probe Diagnostic Nucleic Acid Analysis (EDNA): A Theoretical Approach for Handling of next Generation Sequencing Data for Diagnostics. J. Microbiol. Methods 2013, 94, 356–366. [Google Scholar] [CrossRef]
- Gurevich, A.; Saveliev, V.; Vyahhi, N.; Tesler, G. QUAST: Quality Assessment Tool for Genome Assemblies. Bioinformatics 2013, 29, 1072–1075. [Google Scholar] [CrossRef]
- Wick, R.R.; Judd, L.M.; Gorrie, C.L.; Holt, K.E. Unicycler: Resolving Bacterial Genome Assemblies from Short and Long Sequencing Reads. PLoS Comput. Biol. 2017, 13, e1005595. [Google Scholar] [CrossRef]
- Da Silva, K.; Pons, N.; Berland, M.; Plaza Oñate, F.; Almeida, M.; Peterlongo, P. StrainFLAIR: Strain-Level Profiling of Metagenomic Samples Using Variation Graphs. PeerJ 2021, 9, e11884. [Google Scholar] [CrossRef]
- Zhu, K.; Schäffer, A.A.; Robinson, W.; Xu, J.; Ruppin, E.; Ergun, A.F.; Ye, Y.; Sahinalp, S.C. Strain Level Microbial Detection and Quantification with Applications to Single Cell Metagenomics. Nat. Commun. 2022, 13, 6430. [Google Scholar] [CrossRef]
- Chen, Y.; Nie, F.; Xie, S.-Q.; Zheng, Y.-F.; Dai, Q.; Bray, T.; Wang, Y.-X.; Xing, J.-F.; Huang, Z.-J.; Wang, D.-P.; et al. Efficient Assembly of Nanopore Reads via Highly Accurate and Intact Error Correction. Nat. Commun. 2021, 12, 60. [Google Scholar] [CrossRef] [PubMed]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).