Three Rounds of Read Correction Significantly Improve Eukaryotic Protein Detection in ONT Reads

Background: Eukaryotes’ whole-genome sequencing is crucial for species identification, gene detection, and protein annotation. Oxford Nanopore Technology (ONT) is an affordable and rapid platform for sequencing eukaryotes; however, the relatively higher error rates require computational and bioinformatic efforts to produce more accurate genome assemblies. Here, we evaluated the effect of read correction tools on eukaryote genome completeness, gene detection and protein annotation. Methods: Reads generated by ONT of four eukaryotes, C. albicans, C. gattii, S. cerevisiae, and P. falciparum, were assembled using minimap2 and underwent three rounds of read correction using flye, medaka and racon. The generates consensus FASTA files were compared for total length (bp), genome completeness, gene detection, and protein-annotation by QUAST, BUSCO, BRAKER1 and InterProScan, respectively. Results: Genome completeness was dependent on the assembly method rather than on the read correction tool; however, medaka performed better than flye and racon. Racon significantly performed better than flye and medaka in gene detection, while both racon and medaka significantly performed better than flye in protein-annotation. Conclusion: We show that three rounds of read correction significantly affect gene detection and protein annotation, which are dependent on assembly quality in preference to assembly completeness.


Introduction
Oxford Nanopore Technology (ONT), a third-generation sequencing technology, serves as a platform to sequence small to large and multiplex genomes and is currently widely used globally, especially in low-and mid-income countries, due to its simplicity, feasibility, and sustainability in both medical research and clinical settings [1,2].The main advantage of ONT is the generation of real-time analysis using the user-friendly interface, EPI2ME Agent, with no bioinformatic expertise required, allowing rapid and fast detection of microbe identification and antimicrobial resistant genes (AMR) [3,4].The agile and simple library preparation for ONT sequencing without the biased PCR amplification step is another significant advantage [5].Furthermore, ONT overcomes the problems observed in next-generation sequencing (NGS) in sequencing genomic repeats and the production of incompletely assembled genomes [6].ONT sequencing generates 'long-enough' reads to exceed the length of repeated regions and generates near-complete assemblies in which the location of resistant genes can be detected-i.e., chromosomal vs. plasmid [7,8].
Despite the advantages of ONT and the rapid advancement of the technology since its development, the major shortcoming of this technology is the production of relatively high error rates (~10-15%) compared to NGS, when using R9 flow cells [9].Although increasing the depth of ONT reads can produce contiguous assembled genomes, the errors accumulate as the sequencing depth increases [10].ONT reads often require read correction with short reads to generate complete and robust genome assemblies.The hybrid genome assemblies produced using both long and short sequencing reads (with sufficient depth of both short and long reads), enhance the accuracy of assembled genomes for downstream analysis [11].However, having access to both long and short sequencing platforms and the performance of two sequencing experiments on a single sample is impractical-especially in low-and middle-income countries and in clinical settings where prompt diagnoses are important.Therefore, there is a need for alternative low-cost methods to obtain more accurate genome assemblies from ONT reads.
Computational and bioinformatics tools analysing ONT reads are freely available and rapidly expanding.These tools can be counted as a reasonable and low-cost option to reduce error rates post-assembly.These tools use varied algorithms that are designed to identify and resolve sequencing errors to not only produce a complete but also an accurate genome assembly, though the output of the read correction step is reliant on the applied methods and their specific parameters [12].Several studies are benchmarking freely available read correction tools and their impact on downstream analysis [13][14][15][16].Among the several available read correction tools, flye, medaka and racon are most commonly used for ONT reads.While flye read assembly and correction tool is based on the generalized Brujin Graph, medaka and racon are tools created to outer-perform graph-based methods generating genomic consensus in much faster time [12,13,16].The process of benchmarking freely available read correction tools holds significant importance within the scientific community as it plays a pivotal role in advancing the research domain allowing improved analytical precision and resolving critical issues.
Most benchmarking studies focus on prokaryote genome assemblies rather than eukaryote.Whilst ONT has become an important platform for eukaryotic DNA sequencing, allowing an in-depth analysis of complex eukaryotic DNA sequences for virulence factors and gene annotation, there is a need to benchmark the impact of read correction tools on eukaryotic genomes and their downstream analysis.
In this study, we retrieved ONT sequencing reads from the Sequencing Read Archive (SRA)-NCBI of four pathogenic eukaryotes: Candida albicans, Cryptococcus gattii, Saccharomyces cerevisiae, and Plasmodium falciparum, and evaluated the impact of applying three read correction tools: flye, medaka, and racon, on genome length, fragmentation and completeness, and accurate gene structure, and analysed and classified eukaryotic functional proteins.The selection of these organisms was primarily motivated by the availability of high-quality sequencing data in the SRA-NCBI database through ONT methods.This choice was further supported by their significance as model organisms, as exemplified by S. cerevisiae, and their significance as pathogens.
The quality of generated consensus FASTA files from minimap2, flye, medaka, and racon (n = 24 per species, n = 96 in total) were assessed by QUAST (version 5.0.2) using the LG parameter.The total length (bp), total aligned (bp), and GC%, were evaluated [22].
Statistical analysis was performed with Bonferroni's multiple comparison one-way ANOVA by GraphPad Prism (Boston, MA, USA) (version 8.0.1) to determine significant differences (p < 0.05, p < 0.001) existing among the consensus FASTA files generated by minimap2 before and after read correction with flye, medaka and racon, in QUAST-based assembly statistics, gene and protein detection/prediction by BRAKER1 and InterProScan.

Results and Discussion
Eukaryotic whole genome sequencing provides comprehensive insights into their complex genomes.ONT sequencing is a practical long-read sequencing platform that enables rapid and cost-effective identification of strains, and detection of virulence factors and proteins in both research and clinical settings.However, the relatively higher error rates produced by ONT reads require computational and bioinformatics efforts to produce contiguous and accurate eukaryotic genome assemblies.In this study, we examined the effect of three rounds of read corrections using flye, medaka, and racon after assembling ONT reads to a reference genome using minimap2.The evaluation was based on the genome total length (bp) and GC% produced by QUAST, genome completeness detected by BUSCO, gene prediction by BRAKER, and protein annotation by InterProScan.We used default parameters and datasets provided by the bioinformatic tools.
QUAST analysis assessed the quality and accuracy of genome assemblies pre-and post-three rounds of read correction.The total length (bp) was significantly (p < 0.05) higher after read alignment with minimap2 against the reference genomes than post-read correction of C. gattii, S. cerevisiae, and P. falciparum (Table 1).Nevertheless, the median total length after read correction was the lowest after correction with flye and significantly (p < 0.05) improved with the second and third rounds of correcting with medaka and racon, respectively (Table 1).The improvement of assemblies' total length is a common feature.Studies have reported improvements up to 57% in genome assemblies; however, in this study, we noticed improvements of 9.36% only [8,16].The variation in improvement percentage depends upon various factors, such as organism sequencing, DNA library preparation, genome assembly, and read correction tools used.Although the total aligned (bp) was highest after minimap2 assembly, it was not significant (p > 0.05) (Table 1) when compared to assemblies after read correction.The total aligned length was the highest after the second round of read correction with medaka and was the lowest after the third read correction with racon.The GC% was significantly higher (p < 0.05) (Table 1) after read correcting with flye and decreased after the second and third rounds of read correcting.In line with other studies, we previously noticed similar outcomes; although medaka and racon had significantly lower GC%, both read correction tools performed better in the overall genome assembly, especially when combined [16,[29][30][31].BUSCO provides a quantitative measure of genome completeness to evaluate the quality of genome annotation.Among the four eukaryotic species examined in this study, medaka showed improvement over minimap2 only in C. albicans assembled genomes regarding genome completeness (Figure 1a).When comparing the read correction tools, medaka was also more superior than flye and racon in genome completeness in all four species samples (Figure 1).While the usage of medaka for diploid cells has been controversial because of the diploid nature of yeast, we found that the newer version of medaka provided more accurate assemblies.These results are in line with Sigova et al. [32].In their study, they reported that read correction with medaka is superior to read correction with racon in fungal pathogens.In addition, the percentage of genome completeness significantly decreases (by ~40%) when a reference is added, even after using six read correction tools [32].Moreover, Zhang et al. showed that medaka performance was superior against other read correction/polishing tools in which medaka improved the continuity and reduced mismatches in S. cerevisiae-assembled genomes [33].In all species, except P. falciparum, flye was superior to racon in genome completeness and duplication rates (Figure 1).The rate of the fragmented genome was comparable in all species for all three rounds of read correction (Figure 1).BUSCO provides a quantitative measure of genome completeness to evaluate the quality of genome annotation.Among the four eukaryotic species examined in this study, medaka showed improvement over minimap2 only in C. albicans assembled genomes regarding genome completeness (Figure 1a).When comparing the read correction tools, medaka was also more superior than flye and racon in genome completeness in all four species samples (Figure 1).While the usage of medaka for diploid cells has been controversial because of the diploid nature of yeast, we found that the newer version of medaka provided more accurate assemblies.These results are in line with Sigova et al. [32].In their study, they reported that read correction with medaka is superior to read correction with racon in fungal pathogens.In addition, the percentage of genome completeness significantly decreases (by ~40%) when a reference is added, even after using six read correction tools [32].Moreover, Zhang et al. showed that medaka performance was superior against other read correction/polishing tools in which medaka improved the continuity and reduced mismatches in S. cerevisiae-assembled genomes [33].In all species, except P. falciparum, flye was superior to racon in genome completeness and duplication rates (Figure 1).The rate of the fragmented genome was comparable in all species for all three rounds of read correction (Figure 1).Genome completeness is majorly affected by sequencing methods and genome assembly tools rather than read correction tools [33].The higher number of genome Genome completeness is majorly affected by sequencing methods and genome assembly tools rather than read correction tools [33].The higher number of genome completeness observed in uncorrected assemblies in this study was due to minimap2 assembly, which is a reference-based alignment method.Other studies using de-novo genome assembly methods show-with sufficient sequencing depth-the advantages of using read correction tools in BUSCO analysis [33,34].
BRAKER1 is a bioinformatic tool commonly utilized for gene prediction in eukaryotic genomes using GeneMark-ET.Ideally, eukaryotic genome assemblies are combined with RNA-seq data to improve gene prediction accuracy.However, the ability to combine both DNA and RNA-seq data is not often available in real scenarios.Here, we performed BRAKER1 analysis on assembled and corrected genomes to evaluate the total number of CDs, forward CDs, reverse CDs, mRNA, and introns (Figures 2-5).The total numbers of CDs, forward CDs, and reverse CDs were significantly higher after the third round of read correction with racon (p < 0.05 vs. minimap2, p < 0.001 vs. flye, and p < 0.05 vs. medaka) (Figures 2-6).Surprisingly, the total number of CDs increased after the first round of read correction with flye but decreased after the second round of read correction with medaka (Figures 2-5).In the samples of C. albicans, C. gattii, and P. falciparum, the total number of CDs after read correction with racon was higher than flye by 55273, 176705, and 63178, respectively.However, the total number of CDs in the samples of S. cerevisiae was lower after read correction with racon.The effect of genome assembly and read correction pipelines on the S. cerevisiae genome has been well characterised [33].The authors concluded that although read correction improved contiguity and coverage, sequencing depth and choice of sequencing method affect S. cerevisiae genome annotation [33].The number of introns showed a parallel significance pattern to the total number of CDs.The total number of introns was significantly higher after read correction with racon (p < 0.05 vs. minimap2, p < 0.001 vs. flye and medaka) (Figures 2-6) in the samples of C. albicans, C. gattii and P. falciparum, but not S. cerevisiae.Similarly, Shin et al. [35] found that applying the Nanopolish read correction tools to reads assembled by the Canu-SMARTdenovo method increased the detection of CDs and introns when using MAKER2 as an annotation tool.Interestingly, the number of introns after the first round of read correction with flye was significantly higher (p < 0.05) than after genome assembly with minimap2 (Figure 6).On the contrary, the number of mRNA coding genes was the highest after genome assembly with minimap2.Among the three rounds of read correction, the highest number of mRNA coding genes was detected after the second round of read correction with medaka, which was only significant against racon (p < 0.05) (Figures 2-6).Given the size of mRNA coding gene, which is ~1500 nucleotides in average, detecting mRNA coding genes is very critical [36,37].Like other coding genes, these genes undergo quality control and trimming steps to remove low-quality and/or adapters present in the sequencing reads.Hence, the trimming process by read correction tools can generate even smaller gene sizes which no longer map to the reference genomes in the databases.Although the number of mRNA coding genes was lower after the third round of read correction with racon, this may result from removing all false-positive genes detected post-genome assembly with minimap2.Based on BRAKER1 gene prediction accuracy results, we investigated the effect of read correction tools on protein annotation by InterProScan with ProSiteProfiles analyses, describing protein domains, families, and functional sites.The overall hits of protein annotation were improved with each round of read correction in all four species, with racon being the top-performing read correction tool (Figure 7a).Several protein annotations were only detected after applying a read correction to the assembled genomes, such as TGF-beta binding (IPR017878), colipase family (IPR001981), and Cytochrome c class II (IPR002321) in C. gattii samples; streptavidin (IPR005468), Cytochrome c, class II (IPR002321), and GATA-type zinc finger (IPR000679) in S. cerevisiae; and platelet-derived growth factor (PDGF) (IPR000072), coronaviridae zinc-binding (CV ZBD) (IPR000072), Based on BRAKER1 gene prediction accuracy results, we investigated the effect of read correction tools on protein annotation by InterProScan with ProSiteProfiles analyses, describing protein domains, families, and functional sites.The overall hits of protein annotation were improved with each round of read correction in all four species, with racon being the top-performing read correction tool (Figure 7a).Several protein annotations were only detected after applying a read correction to the assembled genomes, such as TGF-beta binding (IPR017878), colipase family (IPR001981), and Cytochrome c class II (IPR002321) in C. gattii samples; streptavidin (IPR005468), Cytochrome c, class II (IPR002321), and GATA-type zinc finger (IPR000679) in S. cerevisiae; and platelet-derived growth factor (PDGF) (IPR000072), coronaviridae zinc-binding (CV ZBD) (IPR000072), GATA-type zinc finger (IPR000679), and C-terminal cystine knot (IPR006207) in P. falciparum samples (Figure 7a).Protein annotation hits of IPR002321 detected by medaka were significantly (p < 0.05) higher than minimap2, flye, and racon in C. albicans, whereas protein annotation hits of IPR00724 and detected by medaka were significantly (p < 0.05) higher than minimap2, and protein annotation hits of IPR002321 detected by medaka and racon were significantly (p < 0.05) higher than minimap2 and flye (Figure 7b).In S. cerevisiae samples, protein annotation hits of IPR007112 detected by racon were significantly higher than hits detected by minimap2 (Figure 7b).Protein annotation hits of IPR001938 detected by medaka were significantly (p < 0.05) higher than hits detected by flye in P. falciparum samples (Figure 7b).
GATA-type zinc finger (IPR000679), and C-terminal cystine knot (IPR006207) in P. falciparum samples (Figure 7a).Protein annotation hits of IPR002321 detected by medaka were significantly (p < 0.05) higher than minimap2, flye, and racon in C. albicans, whereas protein annotation hits of IPR00724 and IPR001002 detected by medaka were significantly (p < 0.05) higher than minimap2, and protein annotation hits of IPR002321 detected by medaka and racon were significantly (p < 0.05) higher than minimap2 and flye (Figure 7b).In S. cerevisiae samples, protein annotation hits of IPR007112 detected by racon were significantly higher than hits detected by minimap2 (Figure 7b).Protein annotation hits of IPR001938 detected by medaka were significantly (p < 0.05) higher than hits detected by flye in P. falciparum samples (Figure 7b).To our knowledge, this is the first study to evaluate the effect of read correction tools for long-reads on gene prediction using BRAKER1 and protein annotation using Inter-ProScan.Although BUSCO analysis showed superior genome completeness to uncorrected assemblies, we found that read correction tools offer advantages over uncorrected assemblies in BRAKER1 gene detection and protein annotation using InterProScan with ProProfiles analysis.In this study, we showed that genome accuracy after three rounds of read correction is more vital for gene prediction and protein annotation than genome completeness.We proved that gene prediction accuracy relies on the quality of assembled genomes after read correction rather than the quantity or the number of present genes after genome assembly.In other words, a more accurate genome assembly leads to more reliable gene prediction and protein annotation [38,39].However, the gene completeness analysis could still be improved.The development of more robust read assembly and read correction tools and pipelines is still an area to explore.Studies have shown that the usage of mix-and-matched freely available read assembly and read correction tools significantly improves not only assembly parameters, but also antimicrobial resistant genes detection, plasmid identification and pan-genome analysis with and without using short sequencing To our knowledge, this is the first study to evaluate the effect of read correction tools for long-reads on gene prediction using BRAKER1 and protein annotation using InterProScan.Although BUSCO analysis showed superior genome completeness to uncorrected assemblies, we found that read correction tools offer advantages over uncorrected assemblies in BRAKER1 gene detection and protein annotation using InterProScan with ProProfiles analysis.In this study, we showed that genome accuracy after three rounds of read correction is more vital for gene prediction and protein annotation than genome completeness.We proved that gene prediction accuracy relies on the quality of assembled genomes after read correction rather than the quantity or the number of present genes after genome assembly.In other words, a more accurate genome assembly leads to more reliable gene prediction and protein annotation [38,39].However, the gene completeness analysis could still be improved.The development of more robust read assembly and read correction tools and pipelines is still an area to explore.Studies have shown that the usage of mix-and-matched freely available read assembly and read correction tools significantly improves not only assembly parameters, but also antimicrobial resistant genes detection, plasmid identification and pan-genome analysis with and without using short sequencing reads for read correction [14,16,[40][41][42].In addition, adjusting the read assembly and/or read correction tools parameters could be beneficial.Schiavone et al. [43] has docu-mented the importance of applying 'tailored' bioinformatics analysis.Obtaining complete sequences of chromosome and plasmid of Salmonella enterica was possible by modifying corErrorRate and corMincoverage parameters in Canu assembler [43].
In addition, improving the sequencing platform itself can reduce sequencing error rates and increase accuracy, which has been observed since the development of ONT from the production of R6 flow cells until now [44].ONT has recently introduced the flow cells (R.10.4.1) with a quality score >20.The preliminary outcome of these flow cells is very encouraging [45].The performance of the R10 flow cells outperforms the R9 flow cells, achieving a genome accuracy of >99% [45,46].However, to achieve near-complete genomes, short reads may still be required for read correction [47].The performance of the new R20 flow cells is still being investigated, and their combination with different read assembly and read correction tools is yet to be investigated.

Conclusions
The rapid development of whole-genome sequencing platforms has revolutionised their usage and application in research and clinical settings.Using both short-and longsequencing reads to produce hybrid genome assemblies is a very robust method for gene detection and protein annotation.However, access to both short-and long-sequencing platforms is an unrealistic scenario, especially in low-and mid-income countries.ONT serves as a reliable and relatively inexpensive long-reading sequencing platform.However, the major burden of this sequencing platform is the relatively higher error rate.Therefore, improving the sequencing reads generated by ONT by computational and bioinformatics tools is a logical and cost-effective option.
Numerous long-read correction tools are regularly generated aiming to achieve robust genome assemblies.These tools often use different bioinformatic algorithms.Benchmarking the freely available read correction tools is very important and drives the research field to better analysis resolution.This study showed that genome quality is more important than genome completeness.Although genome completeness was significantly higher in pre-read correction steps, significant improvement in gene prediction and protein annotation in eukaryotic genomes was noticeable after the second and third rounds of read correction.However, the assembled genomes can still be improved for better outcomes.Therefore, the investigation of several read correction tool combinations is required along with the improvement of ONT-sequencing technology.

Figure 6 .
Figure 6.Heatmap statistical analysis for BRAKER1 results.Bonferroni's multiple comparison oneway ANOVA was performed to determine significant differences (p < 0.05, p < 0.001) among min-imap2 before and after read correction with flye, medaka and racon.

Figure 6 .
Figure 6.Heatmap statistical analysis for BRAKER1 results.Bonferroni's multiple comparison one-way ANOVA was performed to determine significant differences (p < 0.05, p < 0.001) among minimap2 before and after read correction with flye, medaka and racon.

Figure 7 .
Figure 7. InterProScan analysis using ProProfile analysis for protein annotation in C. albicans, C. gattii, S. cerevisiae, and P. falciparum, (a) number of hits detected, and (b) the significant differences among read correction methods.Bonferroni's multiple comparison one-way ANOVA statistical analysis was performed to determine significant differences (p < 0.05, p < 0.001) existing among the different groups.

Table 1 .
Total length (bp), total aligned (bp), and GC% of ONT-sequencing reads aligned with minimap2 before and after applying as read correction tools.
QUAST-based assembly statistics including for C. albicans, C. gattii, S. cerevisiae, and P. falciparum assembled genomes with minimap2 pre-and post-read correction with flye, medaka, and racon.Bonferroni's multiple comparison one-way ANOVA statistical analysis was performed to determine significant differences (p < 0.05, p < 0.001) existing among the different groups.