Next Article in Journal
Human Coronary Artery Endothelial Cell Response to Porphyromonas gingivalis W83 in a Collagen Three-Dimensional Culture Model
Previous Article in Journal
Impact of Anthropogenic Activities on Microbial Community Structure in Riverbed Sediments of East Kazakhstan
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Three Rounds of Read Correction Significantly Improve Eukaryotic Protein Detection in ONT Reads

1
OMICS Research Unit, Health Science Centre, Kuwait University, Kuwait City 13110, Kuwait
2
Serology and Molecular Microbiology Reference Laboratory, Mubarak Al-Kabeer Hospital, Ministry of Health, Kuwait City 13110, Kuwait
3
Department of Microbiology, Faculty of Medicine, Kuwait University, Kuwait City 13110, Kuwait
*
Author to whom correspondence should be addressed.
Microorganisms 2024, 12(2), 247; https://doi.org/10.3390/microorganisms12020247
Submission received: 29 October 2023 / Revised: 19 January 2024 / Accepted: 23 January 2024 / Published: 24 January 2024
(This article belongs to the Section Microbial Biotechnology)

Abstract

:
Background: Eukaryotes’ whole-genome sequencing is crucial for species identification, gene detection, and protein annotation. Oxford Nanopore Technology (ONT) is an affordable and rapid platform for sequencing eukaryotes; however, the relatively higher error rates require computational and bioinformatic efforts to produce more accurate genome assemblies. Here, we evaluated the effect of read correction tools on eukaryote genome completeness, gene detection and protein annotation. Methods: Reads generated by ONT of four eukaryotes, C. albicans, C. gattii, S. cerevisiae, and P. falciparum, were assembled using minimap2 and underwent three rounds of read correction using flye, medaka and racon. The generates consensus FASTA files were compared for total length (bp), genome completeness, gene detection, and protein-annotation by QUAST, BUSCO, BRAKER1 and InterProScan, respectively. Results: Genome completeness was dependent on the assembly method rather than on the read correction tool; however, medaka performed better than flye and racon. Racon significantly performed better than flye and medaka in gene detection, while both racon and medaka significantly performed better than flye in protein-annotation. Conclusion: We show that three rounds of read correction significantly affect gene detection and protein annotation, which are dependent on assembly quality in preference to assembly completeness.

1. Introduction

Oxford Nanopore Technology (ONT), a third-generation sequencing technology, serves as a platform to sequence small to large and multiplex genomes and is currently widely used globally, especially in low- and mid-income countries, due to its simplicity, feasibility, and sustainability in both medical research and clinical settings [1,2]. The main advantage of ONT is the generation of real-time analysis using the user-friendly interface, EPI2ME Agent, with no bioinformatic expertise required, allowing rapid and fast detection of microbe identification and antimicrobial resistant genes (AMR) [3,4]. The agile and simple library preparation for ONT sequencing without the biased PCR amplification step is another significant advantage [5]. Furthermore, ONT overcomes the problems observed in next-generation sequencing (NGS) in sequencing genomic repeats and the production of incompletely assembled genomes [6]. ONT sequencing generates ‘long-enough’ reads to exceed the length of repeated regions and generates near-complete assemblies in which the location of resistant genes can be detected—i.e., chromosomal vs. plasmid [7,8].
Despite the advantages of ONT and the rapid advancement of the technology since its development, the major shortcoming of this technology is the production of relatively high error rates (~10–15%) compared to NGS, when using R9 flow cells [9]. Although increasing the depth of ONT reads can produce contiguous assembled genomes, the errors accumulate as the sequencing depth increases [10]. ONT reads often require read correction with short reads to generate complete and robust genome assemblies. The hybrid genome assemblies produced using both long and short sequencing reads (with sufficient depth of both short and long reads), enhance the accuracy of assembled genomes for downstream analysis [11]. However, having access to both long and short sequencing platforms and the performance of two sequencing experiments on a single sample is impractical—especially in low- and middle-income countries and in clinical settings where prompt diagnoses are important. Therefore, there is a need for alternative low-cost methods to obtain more accurate genome assemblies from ONT reads.
Computational and bioinformatics tools analysing ONT reads are freely available and rapidly expanding. These tools can be counted as a reasonable and low-cost option to reduce error rates post-assembly. These tools use varied algorithms that are designed to identify and resolve sequencing errors to not only produce a complete but also an accurate genome assembly, though the output of the read correction step is reliant on the applied methods and their specific parameters [12]. Several studies are benchmarking freely available read correction tools and their impact on downstream analysis [13,14,15,16]. Among the several available read correction tools, flye, medaka and racon are most commonly used for ONT reads. While flye read assembly and correction tool is based on the generalized Brujin Graph, medaka and racon are tools created to outer-perform graph-based methods generating genomic consensus in much faster time [12,13,16]. The process of benchmarking freely available read correction tools holds significant importance within the scientific community as it plays a pivotal role in advancing the research domain allowing improved analytical precision and resolving critical issues.
Most benchmarking studies focus on prokaryote genome assemblies rather than eukaryote. Whilst ONT has become an important platform for eukaryotic DNA sequencing, allowing an in-depth analysis of complex eukaryotic DNA sequences for virulence factors and gene annotation, there is a need to benchmark the impact of read correction tools on eukaryotic genomes and their downstream analysis.
In this study, we retrieved ONT sequencing reads from the Sequencing Read Archive (SRA)–NCBI of four pathogenic eukaryotes: Candida albicans, Cryptococcus gattii, Saccharomyces cerevisiae, and Plasmodium falciparum, and evaluated the impact of applying three read correction tools: flye, medaka, and racon, on genome length, fragmentation and completeness, and accurate gene structure, and analysed and classified eukaryotic functional proteins. The selection of these organisms was primarily motivated by the availability of high-quality sequencing data in the SRA–NCBI database through ONT methods. This choice was further supported by their significance as model organisms, as exemplified by S. cerevisiae, and their significance as pathogens.

2. Materials and Methods

The sequencing reads (FASTQ) of four eukaryotic species, (n = 6 each), were retrieved from the SRA–NCBI (Supplementary Table S1). The sequencing reads were all generated using an ONT ligation sequencing kit (LSK-109) with R9 flow cells. The FASTQ reads were then filtered based on quality (Q score > 10) using NanoFilt (version 2.6.0) [17]. The adapters and read barcodes were then trimmed by Porechop (version 0.2.1) (https://github.com/rrwick/Porechop, accessed on 1 September 2023).
The filtered and trimmed FASTQ reads were then aligned against a reference genome sequence (Supplementary Table S2) using Minimap2 (version 2.17-r941) [18]—using default parameters—in combination with bcftools (version 1.5) (https://samtools.github.io/bcftools/, accessed on 1 September 2023) and bedtools (version 2.30) (https://bedtools.readthedocs.io/en/latest/, accessed on 1 September 2023) to remove missing and/or low-coverage sites/nucleotides. Qualimap (version 2.2.2-dev) [19] was used to detect the mapping percentage in the BAM files generated in the minimap2 procedure. Reads with >85% coverage mapping against the reference genome were further analysed (Supplementary Table S3). The consensus FASTA files generated went through three rounds of read correction process with flye (version 2.8.3-b1695) with polish-target parameter, medaka (version 0.11.0) (https://github.com/nanoporetech/medaka, accessed on 1 September 2023) and racon (version 1.4.10) with no-trimming parameter [20,21].
The quality of generated consensus FASTA files from minimap2, flye, medaka, and racon (n = 24 per species, n = 96 in total) were assessed by QUAST (version 5.0.2) using the LG parameter. The total length (bp), total aligned (bp), and GC%, were evaluated [22].
The sum of genome completeness, duplication rate, genome fragmentation, and missing genes were evaluated by Universal Single-Copy Orthologues (BUSCO) (version 5.2.2) [23]. Accurate eukaryotic gene structure annotation of the consensus FASTA files was assessed with BRAKER1 (version 3.0.3) with GeneMark-ET. The generated GFF3 files containing complete coding DNA (CDs), forward CDs, reverse CDs, mRNA, and introns were then visualized with pycirclize (version 0.5.1) (https://github.com/moshi4/pyCirclize, accessed on 1 September 2023) [24,25,26,27]. InterProScan (European Molecular Biology Laboratory’s European Bioinformatics Institute) (version 5.63–95.0) was used to fully analyse and classify eukaryotic functional proteins using ProSiteProfiles analysis [28]. All consensus FASTA files, codes, and commands are available at https://github.com/hussainsafar/eukaryotes_read_correction, accessed on 1 September 2023.
Statistical analysis was performed with Bonferroni’s multiple comparison one-way ANOVA by GraphPad Prism (Boston, MA, USA) (version 8.0.1) to determine significant differences (p < 0.05, p < 0.001) existing among the consensus FASTA files generated by minimap2 before and after read correction with flye, medaka and racon, in QUAST-based assembly statistics, gene and protein detection/prediction by BRAKER1 and InterProScan.

3. Results and Discussion

Eukaryotic whole genome sequencing provides comprehensive insights into their complex genomes. ONT sequencing is a practical long-read sequencing platform that enables rapid and cost-effective identification of strains, and detection of virulence factors and proteins in both research and clinical settings. However, the relatively higher error rates produced by ONT reads require computational and bioinformatics efforts to produce contiguous and accurate eukaryotic genome assemblies. In this study, we examined the effect of three rounds of read corrections using flye, medaka, and racon after assembling ONT reads to a reference genome using minimap2. The evaluation was based on the genome total length (bp) and GC% produced by QUAST, genome completeness detected by BUSCO, gene prediction by BRAKER, and protein annotation by InterProScan. We used default parameters and datasets provided by the bioinformatic tools.
QUAST analysis assessed the quality and accuracy of genome assemblies pre- and post-three rounds of read correction. The total length (bp) was significantly (p < 0.05) higher after read alignment with minimap2 against the reference genomes than post-read correction of C. gattii, S. cerevisiae, and P. falciparum (Table 1). Nevertheless, the median total length after read correction was the lowest after correction with flye and significantly (p < 0.05) improved with the second and third rounds of correcting with medaka and racon, respectively (Table 1). The improvement of assemblies’ total length is a common feature. Studies have reported improvements up to 57% in genome assemblies; however, in this study, we noticed improvements of 9.36% only [8,16]. The variation in improvement percentage depends upon various factors, such as organism sequencing, DNA library preparation, genome assembly, and read correction tools used. Although the total aligned (bp) was highest after minimap2 assembly, it was not significant (p > 0.05) (Table 1) when compared to assemblies after read correction. The total aligned length was the highest after the second round of read correction with medaka and was the lowest after the third read correction with racon. The GC% was significantly higher (p < 0.05) (Table 1) after read correcting with flye and decreased after the second and third rounds of read correcting. In line with other studies, we previously noticed similar outcomes; although medaka and racon had significantly lower GC%, both read correction tools performed better in the overall genome assembly, especially when combined [16,29,30,31].
BUSCO provides a quantitative measure of genome completeness to evaluate the quality of genome annotation. Among the four eukaryotic species examined in this study, medaka showed improvement over minimap2 only in C. albicans assembled genomes regarding genome completeness (Figure 1a). When comparing the read correction tools, medaka was also more superior than flye and racon in genome completeness in all four species samples (Figure 1). While the usage of medaka for diploid cells has been controversial because of the diploid nature of yeast, we found that the newer version of medaka provided more accurate assemblies. These results are in line with Sigova et al. [32]. In their study, they reported that read correction with medaka is superior to read correction with racon in fungal pathogens. In addition, the percentage of genome completeness significantly decreases (by ~40%) when a reference is added, even after using six read correction tools [32]. Moreover, Zhang et al. showed that medaka performance was superior against other read correction/polishing tools in which medaka improved the continuity and reduced mismatches in S. cerevisiae-assembled genomes [33]. In all species, except P. falciparum, flye was superior to racon in genome completeness and duplication rates (Figure 1). The rate of the fragmented genome was comparable in all species for all three rounds of read correction (Figure 1).
Genome completeness is majorly affected by sequencing methods and genome assembly tools rather than read correction tools [33]. The higher number of genome completeness observed in uncorrected assemblies in this study was due to minimap2 assembly, which is a reference-based alignment method. Other studies using de-novo genome assembly methods show—with sufficient sequencing depth—the advantages of using read correction tools in BUSCO analysis [33,34].
BRAKER1 is a bioinformatic tool commonly utilized for gene prediction in eukaryotic genomes using GeneMark-ET. Ideally, eukaryotic genome assemblies are combined with RNA-seq data to improve gene prediction accuracy. However, the ability to combine both DNA and RNA-seq data is not often available in real scenarios. Here, we performed BRAKER1 analysis on assembled and corrected genomes to evaluate the total number of CDs, forward CDs, reverse CDs, mRNA, and introns (Figure 2, Figure 3, Figure 4 and Figure 5). The total numbers of CDs, forward CDs, and reverse CDs were significantly higher after the third round of read correction with racon (p < 0.05 vs. minimap2, p < 0.001 vs. flye, and p < 0.05 vs. medaka) (Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6). Surprisingly, the total number of CDs increased after the first round of read correction with flye but decreased after the second round of read correction with medaka (Figure 2, Figure 3, Figure 4 and Figure 5). In the samples of C. albicans, C. gattii, and P. falciparum, the total number of CDs after read correction with racon was higher than flye by 55273, 176705, and 63178, respectively. However, the total number of CDs in the samples of S. cerevisiae was lower after read correction with racon. The effect of genome assembly and read correction pipelines on the S. cerevisiae genome has been well characterised [33]. The authors concluded that although read correction improved contiguity and coverage, sequencing depth and choice of sequencing method affect S. cerevisiae genome annotation [33]. The number of introns showed a parallel significance pattern to the total number of CDs. The total number of introns was significantly higher after read correction with racon (p < 0.05 vs. minimap2, p < 0.001 vs. flye and medaka) (Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6) in the samples of C. albicans, C. gattii and P. falciparum, but not S. cerevisiae. Similarly, Shin et al. [35] found that applying the Nanopolish read correction tools to reads assembled by the Canu-SMARTdenovo method increased the detection of CDs and introns when using MAKER2 as an annotation tool. Interestingly, the number of introns after the first round of read correction with flye was significantly higher (p < 0.05) than after genome assembly with minimap2 (Figure 6). On the contrary, the number of mRNA coding genes was the highest after genome assembly with minimap2. Among the three rounds of read correction, the highest number of mRNA coding genes was detected after the second round of read correction with medaka, which was only significant against racon (p < 0.05) (Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6). Given the size of mRNA coding gene, which is ~1500 nucleotides in average, detecting mRNA coding genes is very critical [36,37]. Like other coding genes, these genes undergo quality control and trimming steps to remove low-quality and/or adapters present in the sequencing reads. Hence, the trimming process by read correction tools can generate even smaller gene sizes which no longer map to the reference genomes in the databases. Although the number of mRNA coding genes was lower after the third round of read correction with racon, this may result from removing all false-positive genes detected post-genome assembly with minimap2.
Based on BRAKER1 gene prediction accuracy results, we investigated the effect of read correction tools on protein annotation by InterProScan with ProSiteProfiles analyses, describing protein domains, families, and functional sites. The overall hits of protein annotation were improved with each round of read correction in all four species, with racon being the top-performing read correction tool (Figure 7a). Several protein annotations were only detected after applying a read correction to the assembled genomes, such as TGF-beta binding (IPR017878), colipase family (IPR001981), and Cytochrome c class II (IPR002321) in C. gattii samples; streptavidin (IPR005468), Cytochrome c, class II (IPR002321), and GATA-type zinc finger (IPR000679) in S. cerevisiae; and platelet-derived growth factor (PDGF) (IPR000072), coronaviridae zinc-binding (CV ZBD) (IPR000072), GATA-type zinc finger (IPR000679), and C-terminal cystine knot (IPR006207) in P. falciparum samples (Figure 7a). Protein annotation hits of IPR002321 detected by medaka were significantly (p < 0.05) higher than minimap2, flye, and racon in C. albicans, whereas protein annotation hits of IPR00724 and IPR001002 detected by medaka were significantly (p < 0.05) higher than minimap2, and protein annotation hits of IPR002321 detected by medaka and racon were significantly (p < 0.05) higher than minimap2 and flye (Figure 7b). In S. cerevisiae samples, protein annotation hits of IPR007112 detected by racon were significantly higher than hits detected by minimap2 (Figure 7b). Protein annotation hits of IPR001938 detected by medaka were significantly (p < 0.05) higher than hits detected by flye in P. falciparum samples (Figure 7b).
To our knowledge, this is the first study to evaluate the effect of read correction tools for long-reads on gene prediction using BRAKER1 and protein annotation using InterProScan. Although BUSCO analysis showed superior genome completeness to uncorrected assemblies, we found that read correction tools offer advantages over uncorrected assemblies in BRAKER1 gene detection and protein annotation using InterProScan with ProProfiles analysis. In this study, we showed that genome accuracy after three rounds of read correction is more vital for gene prediction and protein annotation than genome completeness. We proved that gene prediction accuracy relies on the quality of assembled genomes after read correction rather than the quantity or the number of present genes after genome assembly. In other words, a more accurate genome assembly leads to more reliable gene prediction and protein annotation [38,39]. However, the gene completeness analysis could still be improved. The development of more robust read assembly and read correction tools and pipelines is still an area to explore. Studies have shown that the usage of mix-and-matched freely available read assembly and read correction tools significantly improves not only assembly parameters, but also antimicrobial resistant genes detection, plasmid identification and pan-genome analysis with and without using short sequencing reads for read correction [14,16,40,41,42]. In addition, adjusting the read assembly and/or read correction tools parameters could be beneficial. Schiavone et al. [43] has documented the importance of applying ‘tailored’ bioinformatics analysis. Obtaining complete sequences of chromosome and plasmid of Salmonella enterica was possible by modifying corErrorRate and corMincoverage parameters in Canu assembler [43].
In addition, improving the sequencing platform itself can reduce sequencing error rates and increase accuracy, which has been observed since the development of ONT from the production of R6 flow cells until now [44]. ONT has recently introduced the flow cells (R.10.4.1) with a quality score >20. The preliminary outcome of these flow cells is very encouraging [45]. The performance of the R10 flow cells outperforms the R9 flow cells, achieving a genome accuracy of >99% [45,46]. However, to achieve near-complete genomes, short reads may still be required for read correction [47]. The performance of the new R20 flow cells is still being investigated, and their combination with different read assembly and read correction tools is yet to be investigated.

4. Conclusions

The rapid development of whole-genome sequencing platforms has revolutionised their usage and application in research and clinical settings. Using both short- and long-sequencing reads to produce hybrid genome assemblies is a very robust method for gene detection and protein annotation. However, access to both short- and long-sequencing platforms is an unrealistic scenario, especially in low- and mid-income countries. ONT serves as a reliable and relatively inexpensive long-reading sequencing platform. However, the major burden of this sequencing platform is the relatively higher error rate. Therefore, improving the sequencing reads generated by ONT by computational and bioinformatics tools is a logical and cost-effective option.
Numerous long-read correction tools are regularly generated aiming to achieve robust genome assemblies. These tools often use different bioinformatic algorithms. Benchmarking the freely available read correction tools is very important and drives the research field to better analysis resolution. This study showed that genome quality is more important than genome completeness. Although genome completeness was significantly higher in pre-read correction steps, significant improvement in gene prediction and protein annotation in eukaryotic genomes was noticeable after the second and third rounds of read correction. However, the assembled genomes can still be improved for better outcomes. Therefore, the investigation of several read correction tool combinations is required along with the improvement of ONT-sequencing technology.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/microorganisms12020247/s1, Table S1: The SRA numbers of sequencing reads (FASTQ) of four eukaryotic species, C. albicans, C. gattii, S. cerevisiae, and P. falciparum (n = 6 each), were retrieved from the SRA–NCBI; Table S2: NCBI references used for the four eukaryotic species, C. albicans, C. gattii, S. cerevisiae, and P. falciparum, Table S3: Read mapping coverage percentage against the appropriate reference genome as detected by qualimap. Refs. [48,49,50,51,52,53,54,55] are cited in the Supplementary Materials.

Author Contributions

Conceptualization, H.A.S.; methodology, H.A.S. and F.A.; software, H.A.S.; validation, A.S.M.; formal analysis, H.A.S.; investigation, H.A.S.; resources, H.A.S.; data curation, H.A.S. and F.A.; writing—original draft preparation, H.A.S.; writing—review and editing, F.A. and A.S.M.; visualization, H.A.S. and F.A.; supervision, A.S.M.; funding acquisition, A.S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data are contained within the article and Supplementary Materials.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Runtuwene, L.R.; Tuda, J.S.B.; Mongan, A.E.; Makalowski, W.; Frith, M.C.; Imwong, M.; Srisutham, S.; Thi, L.A.N.; Tuan, N.N.; Eshita, Y.; et al. Nanopore sequencing of drug-resistance-associated genes in malaria parasites, Plasmodium falciparum. Sci. Rep. 2018, 8, 8286. [Google Scholar] [CrossRef] [PubMed]
  2. Stevanovski, I.; Chintalaphani, S.R.; Gamaarachchi, H.; Ferguson, J.M.; Pineda, S.S.; Scriba, C.K.; Tchan, M.; Fung, V.; Ng, K.; Cortese, A.; et al. Comprehensive genetic diagnosis of tandem repeat expansion disorders with programmable targeted nanopore sequencing. Sci. Adv. 2022, 8, eabm5386. [Google Scholar] [CrossRef] [PubMed]
  3. Charalampous, T.; Kay, G.L.; Richardson, H.; Aydin, A.; Baldan, R.; Jeanes, C.; Rae, D.; Grundy, S.; Turner, D.J.; Wain, J.; et al. Nanopore metagenomics enables rapid clinical diagnosis of bacterial lower respiratory infection. Nat. Biotechnol. 2019, 37, 783–792. [Google Scholar] [CrossRef] [PubMed]
  4. Cheng, H.; Sun, Y.; Yang, Q.; Deng, M.; Yu, Z.; Zhu, G.; Qu, J.; Liu, L.; Yang, L.; Xia, Y. A rapid bacterial pathogen and antimicrobial resistance diagnosis workflow using Oxford nanopore adaptive sequencing method. Brief. Bioinform. 2022, 23, bbac453. [Google Scholar] [CrossRef] [PubMed]
  5. Zhao, W.; Zeng, W.; Pang, B.; Luo, M.; Peng, Y.; Xu, J.; Kan, B.; Li, Z.; Lu, X. Oxford nanopore long-read sequencing enables the generation of complete bacterial and plasmid genomes without short-read sequencing. Front. Microbiol. 2023, 14, 1179966. [Google Scholar] [CrossRef] [PubMed]
  6. Salzberg, S.L.; Phillippy, A.M.; Zimin, A.; Puiu, D.; Magoc, T.; Koren, S.; Treangen, T.J.; Schatz, M.C.; Delcher, A.L.; Roberts, M.; et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2011, 22, 557–567. [Google Scholar] [CrossRef]
  7. Ashton, P.M.; Nair, S.; Dallman, T.; Rubino, S.; Rabsch, W.; Mwaigwisya, S.; Wain, J.; O’Grady, J. MinION nanopore sequencing identifies the position and structure of a bacterial antibiotic resistance island. Nat. Biotechnol. 2014, 33, 296–300. [Google Scholar] [CrossRef]
  8. Wang, Y.; Zhao, Y.; Bollas, A.; Wang, Y.; Au, K.F. Nanopore sequencing technology, bioinformatics and applications. Nat. Biotechnol. 2021, 39, 1348–1365. [Google Scholar] [CrossRef]
  9. Delahaye, C.; Delahaye, C.; Nicolas, J.; Nicolas, J. Sequencing DNA with nanopores: Troubles and biases. PLoS ONE 2021, 16, e0257521. [Google Scholar] [CrossRef]
  10. Sutton, J.M.; Millwood, J.D.; McCormack, A.C.; Fierst, J.L. Optimizing experimental design for genome sequencing and assembly with Oxford Nanopore Technologies. Gigabyte 2021, 2021, 1–26. [Google Scholar] [CrossRef]
  11. Brown, C.L.; Keenum, I.M.; Dai, D.; Zhang, L.; Vikesland, P.J.; Pruden, A. Critical evaluation of short, long, and hybrid assembly for contextual analysis of antibiotic resistance genes in complex environmental metagenomes. Sci. Rep. 2021, 11, 3753. [Google Scholar] [CrossRef] [PubMed]
  12. Dohm, J.C.; Peters, P.; Stralis-Pavese, N.; Himmelbauer, H. Benchmarking of long-read correction methods. NAR Genom. Bioinform. 2020, 2, lqaa037. [Google Scholar] [CrossRef] [PubMed]
  13. Cherukuri, Y.; Janga, S.C. Benchmarking of de novo assembly algorithms for Nanopore data reveals optimal performance of OLC approaches. BMC Genom. 2016, 17, 507. [Google Scholar] [CrossRef] [PubMed]
  14. Juraschek, K.; Borowiak, M.; Tausch, S.H.; Malorny, B.; Käsbohrer, A.; Otani, S.; Schwarz, S.; Meemken, D.; Deneke, C.; Hammerl, J.A. Outcome of Different Sequencing and Assembly Approaches on the Detection of Plasmids and Localization of Antimicrobial Resistance Genes in Commensal Escherichia coli. Microorganisms 2021, 9, 598. [Google Scholar] [CrossRef] [PubMed]
  15. Wick, R.R.; Holt, K.E. Benchmarking of long-read assemblers for prokaryote whole genome sequencing. F1000Research 2021, 8, 2138. [Google Scholar] [CrossRef] [PubMed]
  16. Safar, H.A.; Alatar, F.; Nasser, K.; Al-Ajmi, R.; Alfouzan, W.; Mustafa, A.S. The impact of applying various de novo assembly and correction tools on the identification of genome characterization, drug resistance, and virulence factors of clinical isolates using ONT sequencing. BMC Biotechnol. 2023, 23, 26. [Google Scholar] [CrossRef] [PubMed]
  17. De Coster, W.; D’Hert, S.; Schultz, D.T.; Cruts, M.; Van Broeckhoven, C. NanoPack: Visualizing and processing long-read sequencing data. Bioinformatics 2018, 34, 2666–2669. [Google Scholar] [CrossRef]
  18. Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 2018, 34, 3094–3100. [Google Scholar] [CrossRef]
  19. García-Alcalde, F.; Okonechnikov, K.; Carbonell, J.; Cruz, L.M.; Götz, S.; Tarazona, S.; Dopazo, J.; Meyer, T.F.; Conesa, A. Qualimap: Evaluating next-generation sequencing alignment data. Bioinformatics 2012, 28, 2678–2679. [Google Scholar] [CrossRef]
  20. Kolmogorov, M.; Yuan, J.; Lin, Y.; Pevzner, P.A. Assembly of long, error-prone reads using repeat graphs. Nat. Biotechnol. 2019, 37, 540–546. [Google Scholar] [CrossRef]
  21. Vaser, R.; Sović, I.; Nagarajan, N.; Šikić, M. Fast and accurate de novo genome assembly from long uncorrected reads. Genome Res. 2017, 27, 737–746. [Google Scholar] [CrossRef]
  22. Mikheenko, A.; Prjibelski, A.; Saveliev, V.; Antipov, D.; Gurevich, A. Versatile genome assembly evaluation with QUAST-LG. Bioinformatics 2018, 34, i142–i150. [Google Scholar] [CrossRef] [PubMed]
  23. Manni, M.; Berkeley, M.R.; Seppey, M.; Zdobnov, E.M. BUSCO: Assessing Genomic Data Quality and Beyond. Curr. Protoc. 2021, 1, e323. [Google Scholar] [CrossRef] [PubMed]
  24. Stanke, M.; Schöffmann, O.; Morgenstern, B.; Waack, S. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinform. 2006, 7, 62. [Google Scholar] [CrossRef] [PubMed]
  25. Buchfink, B.; Xie, C.; Huson, D.H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 2014, 12, 59–60. [Google Scholar] [CrossRef] [PubMed]
  26. Hoff, K.J.; Lange, S.; Lomsadze, A.; Borodovsky, M.; Stanke, M. BRAKER1: Unsupervised RNA-Seq-Based Genome Annotation with GeneMark-ET and AUGUSTUS. Bioinformatics 2015, 32, 767–769. [Google Scholar] [CrossRef] [PubMed]
  27. Hoff, K.J.; Lomsadze, A.; Borodovsky, M.; Stanke, M. Whole-genome annotation with Braker. Methods Mol. Biol. 2019, 1962, 65–95. [Google Scholar] [CrossRef] [PubMed]
  28. Jones, P.; Binns, D.; Chang, H.-Y.; Fraser, M.; Li, W.; McAnulla, C.; McWilliam, H.; Maslen, J.; Mitchell, A.; Nuka, G.; et al. InterProScan 5: Genome-scale protein function classification. Bioinformatics 2014, 30, 1236–1240. [Google Scholar] [CrossRef]
  29. Chen, Z.; Erickson, D.L.; Meng, J. Benchmarking Long-Read Assemblers for Genomic Analyses of Bacterial Pathogens Using Oxford Nanopore Sequencing. Int. J. Mol. Sci. 2020, 21, 9161. [Google Scholar] [CrossRef]
  30. Cali, D.S.; Kim, J.S.; Ghose, S.; Alkan, C.; Mutlu, O. Nanopore sequencing technology and tools for genome assembly: Computational analysis of the current state, bottlenecks and future directions. Brief. Bioinform. 2018, 20, 1542–1559. [Google Scholar] [CrossRef]
  31. Lee, J.Y.; Kong, M.; Oh, J.; Lim, J.; Chung, S.H.; Kim, J.-M.; Kim, J.-S.; Kim, K.-H.; Yoo, J.-C.; Kwak, W. Comparative evaluation of Nanopore polishing tools for microbial genome assembly and polishing strategies for downstream analysis. Sci. Rep. 2021, 11, 20740. [Google Scholar] [CrossRef] [PubMed]
  32. Sigova, E.A.; Pushkova, E.N.; Rozhmina, T.A.; Kudryavtseva, L.P.; Zhuchenko, A.A.; Novakovskiy, R.O.; Zhernova, D.A.; Povkhova, L.V.; Turba, A.A.; Borkhert, E.V.; et al. Assembling Quality Genomes of Flax Fungal Pathogens from Oxford Nanopore Technologies Data. J. Fungi 2023, 9, 301. [Google Scholar] [CrossRef] [PubMed]
  33. Zhang, X.; Liu, C.-G.; Yang, S.-H.; Wang, X.; Bai, F.-W.; Wang, Z. Benchmarking of long-read sequencing, assemblers and polishers for yeast genome. Brief. Bioinform. 2022, 23, bbac146. [Google Scholar] [CrossRef] [PubMed]
  34. Siadjeu, C.; Pucker, B.; Viehöver, P.; Albach, D.C.; Weisshaar, B. High Contiguity de novo Genome Sequence Assembly of Trifoliate Yam (Dioscorea dumetorum) Using Long Read Sequencing. Genes 2020, 11, 274. [Google Scholar] [CrossRef] [PubMed]
  35. Shin, S.C.; Kim, H.; Lee, J.H.; Kim, H.-W.; Park, J.; Choi, B.-S.; Lee, S.-C.; Kim, J.H.; Lee, H.; Kim, S. Nanopore sequencing reads improve assembly and gene annotation of the Parochlus steinenii genome. Sci. Rep. 2019, 9, 5095. [Google Scholar] [CrossRef] [PubMed]
  36. Hereford, L.M.; Rosbash, M. Number and distribution of polyadenylated RNA sequences in yeast. Cell 1977, 10, 453–462. [Google Scholar] [CrossRef] [PubMed]
  37. von der Haar, T. A quantitative estimation of the global translational activity in logarithmically growing yeast cells. BMC Syst. Biol. 2008, 2, 87. [Google Scholar] [CrossRef]
  38. Steward, C.A.; Parker, A.P.J.; Minassian, B.A.; Sisodiya, S.M.; Frankish, A.; Harrow, J. Genome annotation for clinical genomic diagnostics: Strengths and weaknesses. Genome Med. 2017, 9, 49. [Google Scholar] [CrossRef]
  39. Wingfield, B.D.; Berger, D.K.; Coetzee, M.P.A.; Duong, T.A.; Martin, A.; Pham, N.Q.; Berg, N.v.D.; Wilken, P.M.; Arun-Chinnappa, K.S.; Barnes, I.; et al. IMA genome-F17. IMA Fungus 2022, 13, 19. [Google Scholar] [CrossRef]
  40. Goldstein, S.; Beka, L.; Graf, J.; Klassen, J.L. Evaluation of strategies for the assembly of diverse bacterial genomes using MinION long-read sequencing. BMC Genom. 2019, 20, 23. [Google Scholar] [CrossRef]
  41. Chen, Z.; Erickson, D.L.; Meng, J. Benchmarking hybrid assembly approaches for genomic analyses of bacterial pathogens using Illumina and Oxford Nanopore sequencing. BMC Genom. 2020, 21, 631. [Google Scholar] [CrossRef] [PubMed]
  42. Wang, J.; Chen, K.; Ren, Q.; Zhang, Y.; Liu, J.; Wang, G.; Liu, A.; Li, Y.; Liu, G.; Luo, J.; et al. Systematic Comparison of the Performances of De Novo Genome Assemblers for Oxford Nanopore Technology Reads From Piroplasm. Front. Cell. Infect. Microbiol. 2021, 11, 696669. [Google Scholar] [CrossRef] [PubMed]
  43. Schiavone, A.; Pugliese, N.; Samarelli, R.; Cumbo, C.; Minervini, C.F.; Albano, F.; Camarda, A. Factors Affecting the Quality of Bacterial Genomes Assemblies by Canu after Nanopore Sequencing. Appl. Sci. 2022, 12, 3110. [Google Scholar] [CrossRef]
  44. Deamer, D.; Akeson, M.; Branton, D. Three decades of nanopore sequencing. Nat. Biotechnol. 2016, 34, 518–524. [Google Scholar] [CrossRef] [PubMed]
  45. Zhang, T.; Li, H.; Ma, S.; Cao, J.; Liao, H.; Huang, Q.; Chen, W. The newest Oxford Nanopore R10.4.1 full-length 16S rRNA sequencing enables the accurate resolution of species-level microbial community profiling. Appl. Environ. Microbiol. 2023, 89, e0060523. [Google Scholar] [CrossRef] [PubMed]
  46. Ni, Y.; Liu, X.; Simeneh, Z.M.; Yang, M.; Li, R. Benchmarking of Nanopore R10.4 and R9.4.1 flow cells in single-cell whole-genome amplification and whole-genome shotgun sequencing. Comput. Struct. Biotechnol. J. 2023, 21, 2352–2364. [Google Scholar] [CrossRef] [PubMed]
  47. Sereika, M.; Kirkegaard, R.H.; Karst, S.M.; Michaelsen, T.Y.; Sørensen, E.A.; Wollenberg, R.D.; Albertsen, M. Oxford Nanopore R10.4 long-read sequencing enables the generation of near-finished bacterial genomes from pure cultures and metagenomes without short-read or reference polishing. Nat. Methods 2022, 19, 823–826. [Google Scholar] [CrossRef]
  48. Wang, Y.; Chen, T.; Zhang, S.; Zhang, L.; Li, Q.; Lv, Q.; Kong, D.; Jiang, H.; Ren, Y.; Jiang, Y.; et al. Clinical evaluation of metagenomic next-generation sequencing in unbiased pathogen diagnosis of urinary tract infection. J. Transl. Med. 2023, 21, 762. [Google Scholar] [CrossRef]
  49. Panthee, S.; Hamamoto, H.; Ishijima, A.S.; Paudel, A.; Sekimizu, K. Utilization of Hybrid Assembly Approach to Determine the Genome of an Opportunistic Pathogenic Fungus, Candida albicans TIMM 1768. Genome Biol. Evol. 2018, 10, 2017–2022. [Google Scholar] [CrossRef]
  50. Rizzo, M.; Soisangwan, N.; Vega-Estevez, S.; Price, R.J.; Uyl, C.; Iracane, E.; Shaw, M.; Soetaert, J.; Selmecki, A.; Buscaino, A. Stress combined with loss of the Candida albicans SUMO protease Ulp2 triggers selection of aneuploidy via a two-step process. PLoS Genet. 2022, 18, e1010576. [Google Scholar] [CrossRef]
  51. Schotanus, K.; Heitman, J. Centromere deletion in Cryptococcus deuterogattii leads to neocentromere formation and chromosome fusions. eLife 2020, 9, e56026. [Google Scholar] [CrossRef] [PubMed]
  52. Farrer, R.A.; Chang, M.; Davis, M.J.; van Dorp, L.; Yang, D.-H.; Shea, T.; Sewell, T.R.; Meyer, W.; Balloux, F.; Edwards, H.M.; et al. A New Lineage of Cryptococcus gattii (VGV) Discovered in the Central Zambezian Miombo Woodlands. mBio 2019, 10, e02306–e02319. [Google Scholar] [CrossRef] [PubMed]
  53. Salazar, A.N.; de Vries, A.R.G.; Broek, M.v.D.; Wijsman, M.; Cortés, P.d.l.T.; Brickwedde, A.; Brouwers, N.; Daran, J.-M.G.; Abeel, T. Nanopore sequencing enables near-complete de novo assembly of Saccharomyces cerevisiae reference strain CEN.PK113-7D. FEMS Yeast Res. 2017, 17, fox074. [Google Scholar] [CrossRef] [PubMed]
  54. Dans, M.G.; Piirainen, H.; Nguyen, W.; Khurana, S.; Mehra, S.; Razook, Z.; Geoghegan, N.D.; Dawson, A.T.; Das, S.; Schneider, M.P.; et al. Sulfonylpiperazine compounds prevent Plasmodium falciparum invasion of red blood cells through interference with actin-1/profilin dynamics. PLoS Biol. 2023, 21, e3002066. [Google Scholar] [CrossRef]
  55. De Meulenaere, K.; Cuypers, W.L.; Gauglitz, J.M.; Guetens, P.; Rosanas-Urgell, A.; Laukens, K.; Cuypers, B. Selective whole-genome sequencing of Plasmodium parasites directly from blood samples by nanopore adaptive sampling. mBio 2023, e0196723. [Google Scholar] [CrossRef]
Figure 1. BUSCO analysis detecting genome completeness, genome duplication, fragmented genes, and missing genes in (a) C. albicans, (b) C. gattii, (c) S. cerevisiae, and (d) P. falciparum samples. mm = uncorrected minimap2, mmf = minimap2 corrected with flye, mmfm = minimap2 corrected with flye + medaka, and mmfmr = minimap2 corrected with flye + medaka + racon.
Figure 1. BUSCO analysis detecting genome completeness, genome duplication, fragmented genes, and missing genes in (a) C. albicans, (b) C. gattii, (c) S. cerevisiae, and (d) P. falciparum samples. mm = uncorrected minimap2, mmf = minimap2 corrected with flye, mmfm = minimap2 corrected with flye + medaka, and mmfmr = minimap2 corrected with flye + medaka + racon.
Microorganisms 12 00247 g001
Figure 2. BRAKER1 analysis detecting forward CDs, reverse CDs, mRNA, and intron in C. albicans species. (a) Sample 1, (b) sample 2, (c) sample 3, (d) sample 4, (e) sample 5, and (f) sample 6. Bonferroni’s multiple comparison one-way ANOVA statistical analysis was performed to determine significant differences (p < 0.05, p < 0.001) existing among the different groups.
Figure 2. BRAKER1 analysis detecting forward CDs, reverse CDs, mRNA, and intron in C. albicans species. (a) Sample 1, (b) sample 2, (c) sample 3, (d) sample 4, (e) sample 5, and (f) sample 6. Bonferroni’s multiple comparison one-way ANOVA statistical analysis was performed to determine significant differences (p < 0.05, p < 0.001) existing among the different groups.
Microorganisms 12 00247 g002
Figure 3. BRAKER1 analysis detecting forward CDs, reverse CDs, mRNA, and intron in C. gattii species. (a) Sample 1, (b) sample 2, (c) sample 3, (d) sample 4, (e) sample 5, and (f) sample 6. Bonferroni’s multiple comparison one-way ANOVA statistical analysis was performed to determine significant differences (p < 0.05, p < 0.001) existing among the different groups.
Figure 3. BRAKER1 analysis detecting forward CDs, reverse CDs, mRNA, and intron in C. gattii species. (a) Sample 1, (b) sample 2, (c) sample 3, (d) sample 4, (e) sample 5, and (f) sample 6. Bonferroni’s multiple comparison one-way ANOVA statistical analysis was performed to determine significant differences (p < 0.05, p < 0.001) existing among the different groups.
Microorganisms 12 00247 g003
Figure 4. BRAKER1 analysis detecting forward CDs, reverse CDs, mRNA, and intron in S. cerevisiae species. (a) Sample 1, (b) sample 2, (c) sample 3, (d) sample 4, (e) sample 5, and (f) sample 6. Bonferroni’s multiple comparison one-way ANOVA statistical analysis was performed to determine significant differences (p < 0.05, p < 0.001) existing among the different groups.
Figure 4. BRAKER1 analysis detecting forward CDs, reverse CDs, mRNA, and intron in S. cerevisiae species. (a) Sample 1, (b) sample 2, (c) sample 3, (d) sample 4, (e) sample 5, and (f) sample 6. Bonferroni’s multiple comparison one-way ANOVA statistical analysis was performed to determine significant differences (p < 0.05, p < 0.001) existing among the different groups.
Microorganisms 12 00247 g004
Figure 5. BRAKER1 analysis detecting forward CDs, reverse CDs, mRNA, and intron in P. falciparum species. (a) Sample 1, (b) sample 2, (c) sample 3, (d) sample 4, (e) sample 5, and (f) sample 6. Bonferroni’s multiple comparison one-way ANOVA statistical analysis was performed to determine significant differences (p < 0.05, p < 0.001) existing among the different groups.
Figure 5. BRAKER1 analysis detecting forward CDs, reverse CDs, mRNA, and intron in P. falciparum species. (a) Sample 1, (b) sample 2, (c) sample 3, (d) sample 4, (e) sample 5, and (f) sample 6. Bonferroni’s multiple comparison one-way ANOVA statistical analysis was performed to determine significant differences (p < 0.05, p < 0.001) existing among the different groups.
Microorganisms 12 00247 g005
Figure 6. Heatmap statistical analysis for BRAKER1 results. Bonferroni’s multiple comparison one-way ANOVA was performed to determine significant differences (p < 0.05, p < 0.001) among minimap2 before and after read correction with flye, medaka and racon.
Figure 6. Heatmap statistical analysis for BRAKER1 results. Bonferroni’s multiple comparison one-way ANOVA was performed to determine significant differences (p < 0.05, p < 0.001) among minimap2 before and after read correction with flye, medaka and racon.
Microorganisms 12 00247 g006
Figure 7. InterProScan analysis using ProProfile analysis for protein annotation in C. albicans, C. gattii, S. cerevisiae, and P. falciparum, (a) number of hits detected, and (b) the significant differences among read correction methods. Bonferroni’s multiple comparison one-way ANOVA statistical analysis was performed to determine significant differences (p < 0.05, p < 0.001) existing among the different groups.
Figure 7. InterProScan analysis using ProProfile analysis for protein annotation in C. albicans, C. gattii, S. cerevisiae, and P. falciparum, (a) number of hits detected, and (b) the significant differences among read correction methods. Bonferroni’s multiple comparison one-way ANOVA statistical analysis was performed to determine significant differences (p < 0.05, p < 0.001) existing among the different groups.
Microorganisms 12 00247 g007aMicroorganisms 12 00247 g007b
Table 1. Total length (bp), total aligned (bp), and GC% of ONT-sequencing reads aligned with minimap2 before and after applying as read correction tools.
Table 1. Total length (bp), total aligned (bp), and GC% of ONT-sequencing reads aligned with minimap2 before and after applying as read correction tools.
Correction Tool Minimap2 (Not Corrected)FlyeFlye + MedakaFlye + Medaka + Racon
Total Length (bp)Total Aligned (bp)GC%Total Length (bp)Total Aligned (bp)GC%Total Length (bp)Total Aligned (bp)GC%Total Length (bp)Total Aligned (bp)GC%
C. albicansSample 1142687311425575733.45142727671423142633.49143177351425091633.43143194291425500133.43
Sample 2142516181423818833.46142982441424676933.5143418471426267833.43143563821425013233.42
Sample 3142751541421724233.42142406461411161533.51143205301416651933.38143120091413899833.4
Sample 4142805491422661233.4142637631421102133.34143452001427290033.2143184481424138233.17
Sample 5142681901418281233.4142188011410219233.48142873331415506633.33143046311415856233.29
Sample 6142675751418387033.41142060121409716033.5142659631414417633.37142752761412630833.33
median14268460.51422192733.41514252204.51416131833.49514319132.514208717.533.37514315228.51419997233.365
C. gattiiSample 1183740561396345647.9515618076301879145.8715848723112799945.391564987597946845.62
Sample 2183739361673820247.8717275771282979747.7417401496319307847.6517335832247866347.64
Sample 3183738171674875047.8717249122281197347.7717403823299315447.6617331969231497347.69
Sample 4183735861691130047.8617292994340666747.7817435803391694747.717395149289262547.71
Sample 5183717841730992947.88176677391048884247.95177466641091655847.91177194231011019547.82
Sample 6183740111559043447.8817093485364951047.4717283501312935547.0917341085234770747.07
median18373876.51674347647.87517262446.5321272947.75517402659.53161216.547.65517338458.5241318547.665
S. cerevisiaeSample 1119009171178675138.26117560941162759838.37117620611161428938.27117705181161004038.24
Sample 2119274521178697938.22118175831139116938.31118355151139266338.24118419701138999338.23
Sample 3118671501171768638.28117149841161172538.37117286461159139238.3117345691154274338.2
Sample 4120483651174621838.27117014911155764138.31117442191153024438.26117260321147282338.12
Sample 5118480141172834238.26118445561157972738.37118472831156802138.25118416091154438638.21
Sample 6118988281168020438.27116505371151821538.35116833911151968738.23116763821148343538.13
median11899872.51173728038.265117355391156868438.361175314011549132.538.25511752543.51151308938.205
P. falciparumSample 1231840992303045219.3227831332272660319.63231103452303718719.36232778872319764219.16
Sample 2232444182319181819.33228467452282709919.64231034712307743019.44232511092320630419.29
Sample 3232780912311980419.27227948792274083819.59230710682299283019.36231707822311512219.2
Sample 4232667432318628919.33228436362281707419.64230824522305226219.44232223952318322119.29
Sample 5231677442218731119.55225973932236009519.53229028572252638719.36228794372214891919.29
Sample 6231938362064591519.63212789522084846719.64220211312109960419.32219951372026523219.27
median232191272307512819.332278900622733720.519.6352307676023015008.519.3623196588.523149171.519.28
QUAST-based assembly statistics including for C. albicans, C. gattii, S. cerevisiae, and P. falciparum assembled genomes with minimap2 pre- and post-read correction with flye, medaka, and racon. Bonferroni’s multiple comparison one-way ANOVA statistical analysis was performed to determine significant differences (p < 0.05, p < 0.001) existing among the different groups.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Safar, H.A.; Alatar, F.; Mustafa, A.S. Three Rounds of Read Correction Significantly Improve Eukaryotic Protein Detection in ONT Reads. Microorganisms 2024, 12, 247. https://doi.org/10.3390/microorganisms12020247

AMA Style

Safar HA, Alatar F, Mustafa AS. Three Rounds of Read Correction Significantly Improve Eukaryotic Protein Detection in ONT Reads. Microorganisms. 2024; 12(2):247. https://doi.org/10.3390/microorganisms12020247

Chicago/Turabian Style

Safar, Hussain A., Fatemah Alatar, and Abu Salim Mustafa. 2024. "Three Rounds of Read Correction Significantly Improve Eukaryotic Protein Detection in ONT Reads" Microorganisms 12, no. 2: 247. https://doi.org/10.3390/microorganisms12020247

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop