Next Article in Journal / Special Issue
Network Analyses of Integrated Differentially Expressed Genes in Papillary Thyroid Carcinoma to Identify Characteristic Genes
Previous Article in Journal
Genomic Enhancers in Brain Health and Disease
Previous Article in Special Issue
A Novel Method for Identifying Essential Genes by Fusing Dynamic Protein–Protein Interactive Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads

School of Information Science and Engineering, Central South University, Changsha 410083, China
*
Author to whom correspondence should be addressed.
Genes 2019, 10(1), 44; https://doi.org/10.3390/genes10010044
Submission received: 29 November 2018 / Revised: 7 January 2019 / Accepted: 8 January 2019 / Published: 14 January 2019

Abstract

:
The advent of third-generation sequencing (TGS) technologies, such as the Pacific Biosciences (PacBio) and Oxford Nanopore machines, provides new possibilities for contig assembly, scaffolding, and high-performance computing in bioinformatics due to its long reads. However, the high error rate and poor quality of TGS reads provide new challenges for accurate genome assembly and long-read alignment. Efficient processing methods are in need to prioritize high-quality reads for improving the results of error correction and assembly. In this study, we proposed a novel Read Quality Evaluation and Selection Tool (REQUEST) for evaluating the quality of third-generation long reads. REQUEST generates training data of high-quality and low-quality reads which are characterized by their nucleotide combinations. A linear regression model was built to score the quality of reads. The method was tested on three datasets of different species. The results showed that the top-scored reads prioritized by REQUEST achieved higher alignment accuracies. The contig assembly results based on the top-scored reads also outperformed conventional approaches that use all reads. REQUEST is able to distinguish high-quality reads from low-quality ones without using reference genomes, making it a promising alternative sequence-quality evaluation method to alignment-based algorithms.

Graphical Abstract

1. Introduction

Next-generation sequencing (NGS) dominated the DNA sequencing area since its development, dramatically reducing the cost and error of sequencing by enabling a massively paralleled approach capable of producing large numbers of reads [1]. With the length generated by most NGS machines being short (often less than 200 bp), the applications of NGS are limited in gene/transcript reconstruction and complex genomic assembly [2,3,4].
The emergence of third-generation sequencing (TGS) technology offers many new prospects for genome research, especially thanks to its dramatically increased reads length [5], to solve complex genome regions with long repeats [6,7,8,9]. In 2014, Oxford Nanopore Technologies (ONT) presented their tiny MinION sequencer. The MinION can produce reads thousands of bases long. Scientists used this technology to construct genomes of new species [6], such as vaccinia virus [10], Saccharomyces cerevisiae [11], and tobacco [12]. The one-dimensional (1D) reads from MinION have a raw nucleotide accuracy less than 75%, while the two-dimensional (2D) reads are of higher quality (80–88% accuracy) [13].
The standard for judging assembly and long transcripts is mapping rate or genome coverage, which depends on alignment and, therefore, is time-consuming. The accuracy of second-generation sequencing is about 99.96%; however, it still needs to be corrected in assembly, scaffolding, and gap-filling [7,14,15]. At the same time, the genomes of many species are incomplete, leading to the fact that part of reads cannot be aligned to the genome and to the limitation of downstream analysis. There are more widely used alignment methods currently available, such as bowtie [16], HISAT (Hierarchical Indexing for Spliced Alignment of Transcripts) [17], BLAT (BLAST-like alignment tool) [18], and Tophat2 [19]. Currently, there are several assembly algorithms, such as de Bruijn graph (DBG), string graph, and overlap layout consensus (OLC) [20]. The DBG algorithm, which splits the reads into k-mers and builds the overlap graph, is a fast assembler suitable for large-scale SGS reads.
However, these tools were originally designed for NGS and do not work well for TGS reads. The high error rate of TGS poses new challenges for long-read alignment, assembly, structure variation [21], etc. To solve this problem, some error correction methods were put forward, including hybrid error correction methods, such as LoRDEC (a hybrid error correction method) [22], LSC (a computational method to perform error Correction of TGS Long reads by SGS short reads) [23], proovread [24], and LSCplus [25], which borrow information from high-quality second-generation reads.
Due to the low quality of data, multiple iterations of error correction are required to achieve assembly quality [26]. Current approaches take all reads as input without filtering, such as MECAT (a fast Mapping, Error Correction, and de novo Assembly Tool) [18], FC_Consensus [27], DAGCon (a Directed Acyclic Graph Consensus method) [28], and FalconSense [29]. The poor-quality reads may have a negative influence on results. MECAT uses different error-correction methods for different types of regions. A counting-based method is used in the regions with consistent matches or deletions without insertion. The local partial order graph (POG) is used in the regions with insertions. The counting-based method greatly improves the calculation speed. The POG method ensures maximum accuracy. The correcting speed of MECAT was about five times higher than that of other tools. The accuracies of MECAT were also consistently higher than those of other two methods.
The alignment tools designed specifically for long reads, such as MECAT, Minimap [30], and BLASR (Basic Local Alignment with Successive Refinement) [31], are still time-consuming for precise alignment. Some tools for long-read processing were also developed. For instance, MECAT is a mapping, error correction, assembly tool, which is very fast compared to several other tools.
Regarding sequencing machines, Pacbio RS, Pacbio RS II, and Nanopore minION are biased toward generating certain types of erroneous nucleotide combinations. For example, an insertion or deletion of the same continuous base was reported in recent studies [11,13]. We assumed that the base content combinations of nucleotides, dinucleotides, and trinucleotides between high-quality and low-quality reads were differential and, therefore, could be used for read-quality evaluation. The nucleotide combinations considered in our work include four kinds of single nucleotide (adenine, A; guanine, G; thymine, T; and cytosine, C), 16 kinds of dinucleotides, and 64 kinds of trinucleotides. Here, the Read Quality Evaluation and Selection Tool (REQUEST) was applied to three real-world third-generation sequencing read datasets from different species. We found that the reads selected by REQUEST were of higher quality and achieved better performances in read correction and contig assembly compared to randomly selected reads. These results support that using high-quality reads rather than all reads is a promising approach for genome assembly.

2. Materials and Methods

2.1. Data Availability

There are three species of 2D-pass datasets generated by Oxford Nanopore techniques, including Escherichia coli (E. coli), Yersinia pestis (Yersinia), and Drosophila biarmipes (Drosophila). The E. coli dataset is available at the Loman lab website (http://lab.loman.net/). The Yersinia pestis and Drosophila biarmipes datasets are available at the National Center for Biotechnology Information (NCBI) Sequence Read Archive (SRA) database (SRR5117441, SRR7167956). The latest assembled genomes of E. coli, Yersinia, and Drosophila used here were downloaded from the RefSeq database (http://www.ncbi.nlm.nih.gov/refseq).

2.2. Methods

REQUEST prioritizes high/low-quality reads based on their differential pattern of nucleotide combinations to evaluate the quality of reads. It consists of three steps to solve the high error rate facing the application of TGS, as shown in Figure 1.
In step 1, to generate the training dataset, the contents of 84 kinds of nucleotide combinations were calculated as the sequence features for each read. The raw reads were regarded as the low-quality reads (LQ, labeled as ‘-1’). The error-corrected reads generated by MECAT with the raw reads as the input data were regarded as the high-quality reads (HQ, labeled as ‘1’).
In step 2, the training sets were split into two subsets to train the linear model separately and cross-score the reads. The process was equivalent to solving a linear least-square problem. The list of predicted Scores of read Quality, denoted as SQ scores, was calculated as shown in Equation (1).
SQ scores = f(X, Xnew) = Xnew (XTX)−1 XTY,
where X refers to the matrix of training sets (Part 1 in Figure 1), and Xnew refers to the matrix of test sets (Part 2 in Figure 1). For all reads, the SQ scores of all raw reads were the combination of SQ scores of the two parts.
In step 3, to verify the effectiveness of the proposed method, we selected the top-ranked 80%, 85%, 90%, and 95% of reads and removed the lowest-scored reads, which reduced the negative impacts. The top-ranked reads could then be used for error correction and contig assembly for testing the effectiveness of our method.
The REQUEST software was implemented in Python and R, and it is freely available at http://github.com/bioinfomaticsCSU/REQUEST.

2.3. Evaluation Method

For raw reads and corrected reads, the analytical indicators include the number of reads (Num), as well as the maximum (max), minimum (min), and average length of each dataset. We aligned all reads to the genome and counted the numbers of alignment, aligned rate (%), and mean and median of identity.
Alignments refer to long reads whose overlapped lengths with the reference genome are longer than 2000 bp and where the mismatch rate is less than twice the read error rate [32]. Aligned rate (R) refers to the proportion of reads aligned to genomes in all reads, calculated as
aligned   rate = n N × 100 % ,
where n refers to the number of alignments, and N refers to the number of all reads.
The identity is a general standard of sequence quality, showing the degree of match to genome. Identity of a sequence is the ratio of bases aligned to genome, calculated as
identity = m Ref   × 100 % ,
where m refers to the number of matched bases, and Ref refers to the length of reference sequence.
We calculated the Pearson correlation coefficient between the identity and SQ scores. The Pearson correlation coefficient was as follows:
P = i = 1 n ( I d e n t i t y i I d e n t i t y ¯ ) ( S Q i S Q ¯ ) i   = 1 n ( I d e n t i t y i I d e n t i t y ¯ ) 2 i = 1 n ( S Q i S Q ¯ ) 2 .
To further investigate whether this selection method could improve assembly, we used the selected datasets for assembly using MECAT2canu with the Nanopore assembly pipeline, and the contigs were evaluated by QUAST (Quality Assessment Tool for Genome Assemblies) [33]. The metrics used here were the number of contigs, max length of contigs, the number of misassemblies (MA), largest alignment, N50, NA50, and genome fraction. N50 is the length of the longest contig such that all the contigs longer than this contig cover at least half of the genome being assembled [34]. NA50 is similar to N50 [35] in corrected contigs. Genome fraction is the percentage of aligned bases in the reference genome. A base in the reference genome is aligned if there is at least one contig covering this base [36].

3. Results

3.1. High-Quality and Low-Quality Reads Show Different Patterns of Nucleotide Combination Content

The differential pattern of long reads was illustrated using a Nanopore sequencing dataset. For instance, Figure 2 shows the difference of four trinucleotides between the reference genome (representing gold-standard error-free reads, green lines), corrected reads (representing high-quality reads, blue lines), and raw reads (representing low-quality reads, red lines). The differences were prominent.
In order to determine whether the selected reads with high SQ scores could result in an improvement of error correction and assembly results, we also randomly selected the same number of raw reads and compared the results between our selected reads and the randomly selected reads. The results of E. coli (see Table 1), Yersinia (see Table 2), and Drosophila (see Table 3) consist of three parts: read alignment, read correction, and contig assembly.
The raw reads were ranked by the SQ scores, and the top 95%, top 90%, top 85%, and top 80% of reads were retained for subsequent analysis. For comparison, subsets of raw datasets of the same size as the reads selected by REQUEST were randomly selected; by repeating this process 20 times, 20 replicate sub-datasets were obtained, and the results on the randomly selected reads were averaged for comparison.
The corrected reads were processed by MECAT. The evaluation criterions of the raw reads and corrected reads contained (1) the number of reads, (2) maximum, minimum, and mean length, (3) the number of alignments and the proportion of alignment in all reads, and (4) mean and median identity.
The evaluation criteria of contigs contained the number of contigs, the maximum contig length, the number of misassemblies, maximum length of alignment, N50, NA50, and genome fraction.

3.2. Experimental Results

3.2.1. Results of Escherichia coli

The raw dataset of E. coli contained 31,858 2D reads. The length ranged from 99 bp to 64,218 bp. The identity ranged from 53.95% to 97.42%. We made a comparison between SQ and identity. The relationship between SQ score and identity is shown in Figure 3a. An obvious positive correlation can be seen in the figure. The Pearson correlation coefficient of E. coli between identity and SQ score was 0.53 (p < 2.2 × 10−16), suggesting that the SQ scores are a useful indicator of read alignment-based quality.
Then, error correction and assembly were carried out on the randomly selected reads. The results of E. coli datasets are shown in Table 1. The length distribution of the reads selected by our method was higher than that of the randomly selected reads. The proportion of alignments was up to 93.25%, which was 5.57 percent higher than that from randomly selected reads. The distribution of identity had a similar trend. This indicates that SQ scores indeed correlate with the quality of reads.
In the second part, the results of error correction showed different trends. The mean and median identity of the REQUEST selection was lower than that of the random selection in 85–95% and higher in 80%. Meanwhile, the number and length of the REQUEST selection was much higher than random selection. This means that REQUEST allowed more reads to be corrected and the length of effective error correction was longer.
In the last part, the assembly results showed the advantages of REQUEST with fewer and longer contigs. N50 and NA50 were also longer. Although there were slightly more misassemblies, the genome fraction was up to 100%.

3.2.2. Results of Yersinia pestis

The raw dataset of Yersinia contained 28,429 2D reads. The length ranged from 125 bp to 61,191 bp. The identity ranged from 54.24% to 95.14%. The relationship of SQ score and identity is shown in Figure 3b. The Pearson correlation coefficient of Yersinia between identity and SQ score was 0.48 (p < 2.2 × 10−16). The results are shown in Table 2. The mean length of the reads selected by the REQUEST method was higher than that of the randomly selected reads. The distribution of identity had a similar trend.
In the second part, the results of error correction had similar trends as the E. coli datasets. The max length of error-corrected reads was 23,000bp longer than that of random selection.
In the last part, the assembly results also showed the advantages of REQUEST. Overall, the results of model-based selection were comparable to those of all data and outperformed randomly selected reads. The max length, N50, and NA50 were also longer. Although misassemblies were slightly more than the result of random selection, genome fraction was up to 99.96%.

3.2.3. Results of Drosophila biarmipes

The raw dataset of Drosophila contained 1,375,649 reads. The length ranged from 61 bp to 93,368 bp. The identity ranged from 60.60% to 100.0%. The relationship of SQ score and identity is shown in Figure 3c. The Pearson correlation coefficient of Drosophila between identity and SQ score was 0.36 (p < 2.2 × 10−16). Due to the large genome of Drosophila, the results were different from those of the above two datasets (Table 3). The alignment of the REQUEST selected reads was higher than that of all reads and randomly selected reads. In the second part, the number of corrected reads from the REQUEST-selected reads was more than that of randomly selected reads. In the last part, the assembly results also showed the advantage of REQUEST. Overall, the results after selection were better than those without filtering.

4. Discussion

In this study, we proposed a sequence-based method, REQUEST, to evaluate and select TGS long reads based on the differential pattern of base combination. It defined the corrected reads as the high-quality reads and the raw reads as the low-quality reads. The base combinations of each read were regarded as the features. REQUEST builds a linear model to score the raw reads. The SQ scores were used as the criterion to select the high-quality reads.
The selected reads with high SQ scores had longer length, higher identity, and higher aligned rate than randomly selected ones. For the results of error correction, the selection generated more reads with longer effective length. The aligned rate of REQUEST was also better than the results of all reads without filtration. Applied to contig assembly, the performance of contigs of REQUEST was better compared to random selection, as well as the performance for all reads in N50, NA50, and other aspects. The genome fraction was higher than that using all reads. It was confirmed that using only reads of high SQ scores had a positive impact in further error correction and assembly. In the future, we plan to test the performance of REQUEST on larger and more complex genomes such as the human genome sequencing data.
REQUEST evaluated and selected third-generation long reads based on the base combinations without a reference genome. It performed better than randomly selected reads and all reads in terms of read quality, error correction, and assembly. REQUEST can quickly evaluate sequence quality, improve the results of error correction and assembly, and reduce the time of iterative error correction of reads generated by the third-generation sequencing technique. REQUEST gives each read an SQ score. In addition to aid filtering low-quality reads, this score can also be integrated with error correction and assembly algorithms for potentially improving their performance.

Author Contributions

Conceptualization and methodology, W.Z. and H.D.L.; software and writing—original draft preparation, W.Z.; validation and visualization, W.Z. and J.Z.; formal analysis, X.L., N.H., and J.W.; writing—review and editing, W.Z. and H.D.L.; supervision, J.W. and H.D.L.

Funding

This research was funded by the National Natural Science Foundation of China under Grant No. 61732009, No. 61702556, No. 61772557, the 111 Project (No. B18059), the Hunan Provincial Science and Technology Program (2018WK4001), and the Fundamental Research Funds for the Central Universities Freedom Explore Program of Central South University under Grant No. 2018zzts570.

Acknowledgments

We would like to thank the reviewers for their critical and constructive suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Treangen, T.J.; Salzberg, S.L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 2012, 13, 36–46. [Google Scholar] [CrossRef]
  2. Alkan, C.; Sajjadian, S.; Eichler, E.E. Limitations of next-generation genome sequence assembly. Nat. Methods 2010, 8, 61. [Google Scholar] [CrossRef] [PubMed]
  3. Abnizova, I.; Leonard, S.; Skelly, T.; Brown, A.; Jackson, D.; Gourtovaia, M.; Qi, G.; Te, B.R.; Faruque, N.; Lewis, K. Analysis of context-dependent errors for Illumina sequencing. J. Bioinform. Comput. Biol. 2012, 10, 1241005. [Google Scholar] [CrossRef] [PubMed]
  4. Abnizova, I.; Skelly, T.; Naumenko, F.; Whiteford, N.; Brown, C.; Cox, T. Statistical comparison of methods to estimate the error probability in short-read Illumina sequencing. J. Bioinform. Comput. Biol. 2010, 8, 579–591. [Google Scholar] [CrossRef] [PubMed]
  5. Lu, H.; Giordano, F.; Ning, Z. Oxford Nanopore MinION Sequencing and Genome Assembly. Genom. Proteom. Bioinform. 2016, 14, 265–279. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Li, C.; Lin, F.; An, D.; Wang, W.; Huang, R. Genome Sequencing and Assembly by Long Reads in Plants. Genes 2018, 9, 6. [Google Scholar] [CrossRef] [PubMed]
  7. Li, M.; Tang, L.; Liao, Z.; Luo, J.; Wu, F.; Pan, Y.; Wang, J. A novel scaffolding algorithm based on contig error correction and path extension. IEEE/ACM Trans. Comput. Biol. Bioinform. 2018. [Google Scholar] [CrossRef] [PubMed]
  8. Li, M.; Tang, L.; Wu, F.; Pan, Y.; Wang, J. SCOP: A novel scaffolding algorithm based on contig classification and optimization. Bioinformatics 2018. [Google Scholar] [CrossRef]
  9. Liao, X.; Li, M.; Luo, J.; Zou, Y.; Wu, F.; Pan, Y.; Luo, F.; Wang, J. Improving de novo assembly based on read classification. IEEE/ACM Trans. Comput. Biol. Bioinform. 2018. [Google Scholar] [CrossRef] [PubMed]
  10. Prazsák, I.; Tombácz, D.; Szűcs, A.; Dénes, B.; Snyder, M.; Boldogkői, Z. Full Genome Sequence of the Western Reserve Strain of Vaccinia Virus Determined by Third-Generation Sequencing. Genome Announc. 2018, 6, e01570-01517. [Google Scholar]
  11. Jenjaroenpun, P.; Wongsurawat, T.; Pereira, R.; Patumcharoenpol, P.; Ussery, D.W.; Nielsen, J.; Nookaew, I. Complete genomic and transcriptional landscape analysis using third-generation sequencing: A case study of Saccharomyces cerevisiae CEN.PK113-7D. Nucleic Acids Res. 2018, 46, e38. [Google Scholar] [CrossRef] [PubMed]
  12. Lu, P.; Jin, J.; Li, Z.; Cao, P.; Fan, K.; Xu, Y. Genome assembly based on the third-generation sequencing technology and its application in tobacco. Tobacco Sci. Technol. 2017, 51, 87–94. [Google Scholar]
  13. Ip, C.L.; Loose, M.; Tyson, J.R.; De, C.M.; Brown, B.L.; Jain, M.; Leggett, R.M.; Eccles, D.A.; Zalunin, V.; Urban, J.M. MinION Analysis and Reference Consortium: Phase 1 data release and analysis. F1000Research 2015, 4, 1075. [Google Scholar] [CrossRef] [PubMed]
  14. Wu, B.; Wang, J.; Luo, J.; Li, M.; Wu, F.; Pan, Y. MEC: Misassembly error correction in contigs using a combination of paired-end reads and GC-contents. IEEE/ACM Trans. Comput. Biol. Bioinform. 2018. [Google Scholar] [CrossRef]
  15. Li, M.; Wu, B.; Yan, X.; Luo, J.; Pan, Y.; Wu, F.; Wang, J. PECC: Correcting contigs based on paired-end read distribution. Comput. Biol. Chem. 2017, 69, 178–184. [Google Scholar] [CrossRef] [PubMed]
  16. Langmead, B.; Trapnell, C.; Pop, M.; Salzberg, S.L. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 2009, 10, R25. [Google Scholar] [CrossRef] [PubMed]
  17. Daehwan, K.; Ben, L.; Salzberg, S.L. HISAT: A fast spliced aligner with low memory requirements. Nat. Methods 2015, 12, 357–360. [Google Scholar]
  18. Kent, W.J. BLAT—The BLAST-like alignment tool. Genome Res. 2002, 12, 656–664. [Google Scholar] [CrossRef] [PubMed]
  19. Kim, D.; Pertea, G.; Trapnell, C.; Pimentel, H.; Kelley, R.; Salzberg, S.L. TopHat2: Accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 2013, 14, R36. [Google Scholar] [CrossRef] [PubMed]
  20. Sović, I.; Križanović, K.; Skala, K.; Šikić, M. Evaluation of hybrid and non-hybrid methods for de novo assembly of nanopore reads. Bioinformatics 2016, 32, btw237. [Google Scholar]
  21. Zhang, Z.; Wang, J.; Luo, J.; Ding, X.; Zhong, J.; Wang, J.; Wu, F.; Pan, Y. Sprites: Detection of deletions from sequencing data by re-aligning split reads. Bioinformatics 2016, 32, 1788–1796. [Google Scholar] [CrossRef] [PubMed]
  22. Leena, S.; Eric, R. LoRDEC: Accurate and efficient long read error correction. Bioinformatics 2014, 30, 3506–3514. [Google Scholar]
  23. Kin Fai, A.; Underwood, J.G.; Lawrence, L.; Wing Hung, W. Improving PacBio long read accuracy by short read alignment. PLoS ONE 2012, 7, e46679. [Google Scholar]
  24. Hackl, T.; Hedrich, R.; Schultz, J.; Förster, F. proovread: Large-scale high-accuracy PacBio correction through iterative short read consensus. Bioinformatics 2014, 30, 3004. [Google Scholar] [CrossRef] [PubMed]
  25. Hu, R.; Sun, G.; Sun, X. LSCplus: A fast solution for improving long read accuracy by short read alignment. BMC Bioinform. 2016, 17, 451. [Google Scholar] [CrossRef]
  26. Sameith, K.; Roscito, J.G.; Hiller, M. Iterative error correction of long sequencing reads maximizes accuracy and improves contig assembly. Brief. Bioinform. 2017, 18, 1–8. [Google Scholar] [CrossRef]
  27. Chin, C.S.; Peluso, P.; Sedlazeck, F.J.; Nattestad, M.; Concepcion, G.T.; Clum, A.; Dunn, C.; O’Malley, R.; Figueroabalderas, R.; Moralescruz, A. Phased diploid genome assembly with single-molecule real-time sequencing. Nat. Methods 2016, 13, 1050–1054. [Google Scholar] [CrossRef] [Green Version]
  28. Koren, S.; Schatz, M.C.; Walenz, B.P.; Martin, J.; Howard, J.T.; Ganapathy, G.; Wang, Z.; Rasko, D.A.; Mccombie, W.R.; Jarvis, E.D. Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat. Biotechnol. 2012, 30, 693–700. [Google Scholar] [CrossRef] [Green Version]
  29. Berlin, K.; Koren, S.; Chin, C.S.; Drake, J.P.; Landolin, J.M.; Phillippy, A.M. Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 2015, 33, 623–630. [Google Scholar] [CrossRef]
  30. Li, H. Minimap2: Pairwise alignment for nucleotide sequences. Bioinformatics 2017. [Google Scholar] [CrossRef]
  31. Chaisson, M.J.; Tesler, G. Mapping single molecule sequencing reads using basic local alignment with successive refinement (BLASR): Application and theory. BMC Bioinform. 2012, 13, 238. [Google Scholar] [CrossRef] [PubMed]
  32. Xiao, C.L.; Chen, Y.; Xie, S.Q.; Chen, K.N.; Wang, Y.; Han, Y.; Luo, F.; Xie, Z. MECAT: Fast mapping, error correction, and de novo assembly for single-molecule sequencing reads. Nat. Methods 2017, 14. [Google Scholar] [CrossRef] [PubMed]
  33. Gurevich, A.; Saveliev, V.; Vyahhi, N.; Tesler, G. QUAST: Quality assessment tool for genome assemblies. Bioinformatics 2013, 29, 1072–1075. [Google Scholar] [CrossRef] [PubMed]
  34. Li, M.; Liao, X.; He, Y.; Wang, J.; Luo, J.; Pan, Y. ISEA: Iterative Seed-Extension Algorithm for De Novo Assembly Using Paired-End Information and Insert Size Distribution. IEEE/ACM Trans. Comput. Biol. Bioinform. 2017, 14, 916–925. [Google Scholar] [CrossRef] [PubMed]
  35. Luo, J.; Wang, J.; Shang, J.; Luo, H.; Li, M.; Wu, F.X.; Pan, Y. GapReduce: A gap filling algorithm based on partitioned read sets. IEEE/ACM Trans. Comput. Biol. Bioinform. 2018. [Google Scholar] [CrossRef] [PubMed]
  36. Luo, J.; Wang, J.; Zhang, Z.; Li, M.; Wu, F.X. BOSS: A novel scaffolding algorithm based on an optimized scaffold graph. Bioinformatics 2016, 33, 169. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The workflow of the Read Quality Evaluation and Selection Tool (REQUEST). The method consists of three steps: (1) compiling of the training data of high- and low-quality reads; (2) splitting the training set into two parts to build the linear model and cross-score the reads; (3) selecting the top-scored reads and evaluating them. SQ stands for the score of sequencing read quality computed by REQUEST.
Figure 1. The workflow of the Read Quality Evaluation and Selection Tool (REQUEST). The method consists of three steps: (1) compiling of the training data of high- and low-quality reads; (2) splitting the training set into two parts to build the linear model and cross-score the reads; (3) selecting the top-scored reads and evaluating them. SQ stands for the score of sequencing read quality computed by REQUEST.
Genes 10 00044 g001
Figure 2. Distribution of nucleotide combinations of genome, high-quality, and low-quality reads of four example trinucleotides: (a) ACC; (b) CTC; (c) GAC; (d) GCA. The green, blue, and red lines represent the data from genome (gold-standard error-free reads), high-quality (corrected reads), and low-quality reads (raw reads), respectively.
Figure 2. Distribution of nucleotide combinations of genome, high-quality, and low-quality reads of four example trinucleotides: (a) ACC; (b) CTC; (c) GAC; (d) GCA. The green, blue, and red lines represent the data from genome (gold-standard error-free reads), high-quality (corrected reads), and low-quality reads (raw reads), respectively.
Genes 10 00044 g002
Figure 3. Relationship of identity and predicted (SQ) score. The identity was grouped into 65–70%, 70–75%, 75–80%, 80–85%, 85–90%, 90–95%, and 95–100%. For each group, the distribution of SQ scores was plotted. (a) Comparison of Escherichia coli; (b) comparison of Yersinia pestis; (c) comparison of Drosophila biarmipes.
Figure 3. Relationship of identity and predicted (SQ) score. The identity was grouped into 65–70%, 70–75%, 75–80%, 80–85%, 85–90%, 90–95%, and 95–100%. For each group, the distribution of SQ scores was plotted. (a) Comparison of Escherichia coli; (b) comparison of Yersinia pestis; (c) comparison of Drosophila biarmipes.
Genes 10 00044 g003
Table 1. Summary of the results of Escherichia coli in selection, correction, and contigs. REQUEST—Read Quality Evaluation and Selection Tool.
Table 1. Summary of the results of Escherichia coli in selection, correction, and contigs. REQUEST—Read Quality Evaluation and Selection Tool.
P (%)NumMaxMinMeannR (%)Mean IMedian I
Read AlignmentAll reads10031,85864,21899766827,86987.4884.1688.33
Random9530,26562,07299766926,47187.5084.1688.33
9028,67262,07299767025,07887.5084.1688.32
8527,07961,357100767023,68587.5184.1588.32
8025,48659,926100767122,28887.5084.1588.32
REQUEST9530,26564,21899787527,23890.0084.4688.60
9028,67264,21899796426,19291.3584.9889.07
8527,07964,21899802825,01692.3885.5289.48
8025,48664,21899808223,76693.2586.0689.88
Read CorrectionAll reads10026,03433,9122000814425,77599.0196.6798.36
Random9524,25233,8822000814724,01199.0196.7898.36
9022,94333,8172001814322,71599.0196.7698.34
8521,62633,7242001813721,41199.0096.7398.31
8020,30333,5192001813020,10199.0096.7098.28
REQUEST9525,71533,8862001816225,46999.0496.5198.25
9024,90633,8862001822424,67099.0596.5998.29
8523,96833,8802001827923,73199.0196.6898.33
8022,88333,8802000833522,67399.0896.7698.38
Contig Assembly P (%)NumMax(kb)MALargest alignment (kb)N50 (kb)NA50(kb)GF (%)
All reads100246366230546361655.6099.86
Random95337243329437243201.9199.98
90429583260629472438.5499.97
85534633315333803032.3799.92
80724963244419701864.4899.81
REQUEST95246415253046412529.56100.00
90246397358746393587.1399.89
85346355395646353956.42100.00
80346365395746363956.57100.00
1 P indicates the proportion of retained reads; Max, Min, and Mean indicate the maximum, minimum, and mean read lengths, respectively; “n” means the number of alignments; R means the aligned rate; “I” indicates the identity; MA indicates misassemblies; GF indicates genome fraction.
Table 2. Summary of the results of Yersinia pestis in selection, correction, and contigs.
Table 2. Summary of the results of Yersinia pestis in selection, correction, and contigs.
P (%)NumMaxMinMeannR (%)Mean IMedian I
Read AlignmentAll reads10028,42961,191125767926,98994.9383.4486.70
Random9527,00761,191125768025,62894.9083.4486.70
9025,58661,191125768924,27794.8883.4486.70
8524,16461,191145767922,92894.8983.4486.70
8022,74353,492125768621,57394.8683.4486.70
REQUEST9527,00861,191184778526,18196.9483.8486.87
9025,58661,191184782725,02497.8084.3287.08
8524,16461,191184786923,75098.2984.7387.29
8022,74361,191184790422,40298.5085.0887.45
Read CorrectionAll reads10025,77657,3012000722924,76996.0996.9698.09
Random9523,95333,8432001717023,94699.9797.1298.14
9022,63333,5872001715722,62799.9797.1198.12
8521,31533,2892000713921,31099.9897.1198.10
8019,97433,7302001711719,96999.9897.1098.09
REQUEST9525,35756,5602000726325,35099.9796.8698.03
9024,44956,5602000733624,44299.9796.9398.07
8523,31256,5872000739923,30599.9797.0498.12
8022,02857,0442000746822,02299.9797.1098.16
Contig Assembly P (%)NumMax(kb)MALargest alignment (kb)N50(kb)NA50(kb)GF (%)
All reads10044646309404646377.6999.96
Random9552749288352310370.5399.72
9082174258161642345.9399.55
85111756287711141301.6699.28
8019119427593471224.7398.54
REQUEST95646413110124641377.7099.96
9044658317984658377.6999.96
85446452910124645377.6999.96
8072571307982571282.4099.73
Table 3. Summary of the results of Drosophila biarmipes in selection, correction, and contigs.
Table 3. Summary of the results of Drosophila biarmipes in selection, correction, and contigs.
P (%)NumMaxMinMeannR (%)Mean IMedian I
Read AlignmentAll reads1001,375,64993,368614102845,13461.4479.5782.58
Random951,306,86793,368614102802,96861.4479.5782.24
901,260,87093,368614101760,61460.3279.5782.58
851,192,22993,368614101718,48960.2679.5782.58
801,123,44693,368614102676,35260.2079.5782.58
REQUEST951,306,86793,368834298844,50464.6279.5882.58
901,260,87093,368834503841,43966.7379.6182.61
851,192,22993,368834725833,91169.9579.6582.67
801,123,44693,3681054950818,45772.8579.7282.79
Read CorrectionAll reads100628,18053,16320006743625,47299.5789.2594.68
Random95594,93252,70220006654592,27099.5589.2294.67
90571,87652,53120006579558,01997.5889.2194.68
85536,46349,26020006452522,18497.3489.1994.68
80498,68547,74620006297483,63096.9889.1994.69
REQUEST95634,00353,15420006713630,93399.5289.1094.55
90633,47853,15420006715629,20699.3389.1194.56
85632,02653,15420006719629,20699.5589.1494.56
80627,42753,15720006731575,14591.6789.2594.72
Contig Assembly P (%)NumMax(kb)MALargest alignment (kb)N50(kb)NA50(kb)GF (%)
All reads100218567310,6023046731.0055.65
Random95205153096892165727.0046.36
90186830189731765023.0036.92
85163522681651604316.0028.09
80138519173761123910.0020.90
REQUEST95216455210,8153076831.0055.82
90214255210,7322346831.0055.75
85213255210,6162346831.0055.57
80211355210,7342346731.0054.95

Share and Cite

MDPI and ACS Style

Zhang, W.; Huang, N.; Zheng, J.; Liao, X.; Wang, J.; Li, H.-D. A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads. Genes 2019, 10, 44. https://doi.org/10.3390/genes10010044

AMA Style

Zhang W, Huang N, Zheng J, Liao X, Wang J, Li H-D. A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads. Genes. 2019; 10(1):44. https://doi.org/10.3390/genes10010044

Chicago/Turabian Style

Zhang, Wenjing, Neng Huang, Jiantao Zheng, Xingyu Liao, Jianxin Wang, and Hong-Dong Li. 2019. "A Sequence-Based Novel Approach for Quality Evaluation of Third-Generation Sequencing Reads" Genes 10, no. 1: 44. https://doi.org/10.3390/genes10010044

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop