Assembling Reads Improves Taxonomic Classification of Species

Most current approach to metagenomic classification employ short next generation sequencing (NGS) reads that are present in metagenomic samples to identify unique genomic regions. NGS reads, however, might not be long enough to differentiate similar genomes. This suggests a potential for using longer reads to improve classification performance. Presently, longer reads tend to have a higher rate of sequencing errors. Thus, given the pros and cons, it remains unclear which types of reads is better for metagenomic classification. We compared two taxonomic classification protocols: a traditional assembly-free protocol and a novel assembly-based protocol. The novel assembly-based protocol consists of assembling short-reads into longer reads, which will be subsequently classified by a traditional taxonomic classifier. We discovered that most classifiers made fewer predictions with longer reads and that they achieved higher classification performance on synthetic metagenomic data. Generally, we observed a significant increase in precision, while having similar recall rates. On real data, we observed similar characteristics that suggest that the classifiers might have similar performance of higher precision with similar recall with longer reads. We have shown a noticeable difference in performance between assembly-based and assembly-free taxonomic classification. This finding strongly suggests that classifying species in metagenomic environments can be achieved with higher overall performance simply by assembling short reads. Further, it also suggests that long-read technologies might be better for species classification.


Introduction
The species level classification is an important problem in metagenomics analysis. Computational workflows aim for the identification of microbial species that are present in metagenomics samples. Examples include studies that assessed the host-microbe interactions in the gut microbiome to gain better insight into human health [1], revealed ecological differentiation of closely related bacteria [2], uncovered the presence of ancient sub-populations of marine bacteria [3], and highlighted extensive intra-species recombination [4,5].
Methods for classification and profiling of microbial communities are diverse. CLARK [6] uses a database of k-mers that aims to uniquely describes genomic regions of each targeted microbes. GOTTCHA [7] has a different approach to identifying unique genomic regions of targeted microbes by using a combination of empirical data and machine learning methods. Kraken [8] also utilizes k-mers, but builds taxonomic trees that help differentiate closely related microbes. MetaPhlAn2 [9] employs a similar taxonomic approach, but narrows read alignment and its analysis is on a set of only around one million markers.
More recent technologies can produce very long reads, but at the expense of having higher costs and much higher error rates [10]. However, longer reads have been found to be more appropriate or better compared to short reads in certain studies [11]. Single-molecule sequencing (SMS) offers exceptionally long reads that enable direct sequencing of genomic regions that are difficult to sequence with short reads, including long repetitive elements, extreme GC-content regions, and complex gene loci. Similarly, these platforms enable structural variation characterization at previously unparalleled resolution and direct detection of epigenetic marks in native DNA [12]. Similarly, the PacBio sequencing system can capture full-length 16S rRNA sequences [13]. Third-generation nanopore sequencing offers many solutions to the current problems of using whole metagenome sequencing (WMS) for infectious disease diagnostics. It has been successfully utilized for pathogen detection, AMR prediction, and characterization of mixed microbial communities [14].
While long read technologies are more appropriate for certain studies, short read technologies are mature and less expensive. Is it possible to leverage known strengths of short read technologies to garner the high performance of long reads?
In this paper, we demonstrate that it is possible to improve the performance of species classification in metagenomic applications using long reads that are assembled from short reads. This finding has two major implications. First, it suggests that many existing studies that utilize short reads can benefit from long reads that are assembled from the short reads. Although there is an extra computational cost of assembly and minor modification to the existing workflows, the increase in performance might justify the cost. Second, this finding suggests that there are potential gains in utilizing long-reads technologies in this type of applications. As current long-read technologies have different characteristics from short-read technologies in terms of cost and sequencing errors, the trade-offs between these pros and cons remain to be investigated.

Method
Most metagenomic classifiers, including those that we studied in this paper, consist of two main steps. In the preprocessing step, a classifier utilized reference genomic sequence of existing species to build an index or reference table. The index or reference table was built only once for a metagenomic environment. In the classification step, the classifier used the index or reference table to classify metagenomic data, in the form of reads, and make predictions.
Our method interrupted this workflow by modifying the classification step. Before feeding short reads as inputs to a classifier, we assembled them into long reads. Figure 1 depicts the process of comparing a classifier's performance on short reads and long reads. Figure 1A is the standard workflow of a classifier, where it took a short reads dataset as input and outputted species that it predicted to be present in the sample. Figure 1B shows a workflow, in which the same short reads were first assembled before feeding to the classifier. Different methods may have different types of prediction formats, which can be species label for each read, or predicted species for the entire dataset, or predicted percentages of species in the sample (in case of metagenomics profiling). Metagenomic classifiers outputted a rank separated taxanomic profile with relative abundances, whereas binning classifiers provided sequence identification, length used in the assignment, and taxon as output. The classifier's outputs were then converted into species names, which produced a list of species.
Each classifier required a reference database of genomic sequences to classify metagenomic reads into species. We used complete genomes of bacteria archaea, and viruses from NCBI to construct this database for each classifier. We removed species that were labeled unclassified or unknown because they might cause problems for taxonomic prediction [15].
For consistency, we used the NCBI taxonomy database [16] to standardize results from different classifiers. Further, for classifiers that produced strain-level predictions, we converted them to species-level predictions so that the results could be compared consistently across different classifiers.
We compared the outputs of classifiers using default parameters at the species level because not all classifiers still predicted at strain level. Species is a taxonomic rank more relevant in clinical diagnostics or pathogen identification than genus or phylum. Although some clinical diagnosis and epidemiological tracking often requires identification of strains, genomic databases remain poorly populated below the species level [17]. Evaluation was done in a similar way to [17,18]. For each classifier, we evaluated predicted species produced with assembled reads and predicted species produced with original short reads.
Kaiju, CLARK, Kraken, MetaCache are k-mer based methods for metagenomic read classification. CLARK and Kraken were run with the default k-mer size of 31, while MetaCache use 16-mers by default. Kaiju was run in the fastest MEM mode (with minimum fragment length m = 11), as well as in the heuristic greedy mode (with minimum score s = 65).
On the other hand, both MetaPhlAn2 [9] and DUDes [21] have to use results of read-to-reference mapping from Bowtie2 [22]; however, for some longer contigs (several million bps), Bowtie2 (version 2.3.4.2) crahsed. We used a read mapper designed for both short and long reads: Minimap2 (version 2.17) [23] as an alternative for mapping reads to reference genomes.
While running the classifiers above, we specified the "paired-end reads" option for raw read input as well as the "singleton read" option for assembled read input.
Assemblers were launched with (mostly) default parameters; taking a pair of FASTQ files that contains raw reads and then producing a single FASTA file that contains assembled reads for each dataset. The file names of assembled reads from MEGAHIT, metaSPAdes, and Ray were "final.contigs", "contigs" and "Contigs" respectively.
Ray parallelized assembly computations using the Message Passing Interface (MPI) standard, a run agent "mpirun". While metaSPAdes consumes very high memory, MEGAHIT specified multiple computational threads and optionally a graphical processing unit for improving its runtime. Due to the scope of this work, we do not report the runtime as well as memory usage of the assemblers.

Experimental Design
The main hypothesis was that the classification or identification of microbes in metagenomics samples was better done with long reads than with short reads. We aimed to design a controlled experiment to verify the hypothesis. To achieve this, we evaluated the ability to detect species in metagenomics samples of several well-known classifiers on several short-reads datasets and derived long-reads datasets. The choice of which long-reads datasets were used to compare against which short-reads datasets was an important design decision. If we chose a long-reads dataset produced by a current technology to compare against a short-reads dataset produced by a different technology, the result might be due to differences in technologies rather than in read lengths. As our goal was to examine the impact of read lengths on classification, we chose to use long reads that are derived from the same short reads. These derived long-reads datasets were constructed by assembling short reads from the datasets that are used to evaluate the classifiers' performance. Although this design choice removed the effect of sequencing technologies, it introduced the potential effect of assembling reads on the result. To address this, we evaluated classifiers with different assemblers to remove algorithmic bias on classification performance.

Performance Assessment
Classifiers were evaluated with synthetic and real samples. Although some tools could work on the strain level, we evaluated classification results at species levels since most methods still do not provide strain level identifications. Classification performance of synthetic data was measured in terms of precision, recall, and F1. To access classifiers' performance on real data, we define the overall pairwise similarity of a method c to other methods as is the number of species predicted by method i. This similarity is between 0 and 1. The closer it is to 1, the higher the overall similarity to other methods.

Data
We used the Mende datasets [28], which are synthetic data consisting of 890 genomes representing 457 strains, species, and sub-species. The simulators used to generate custom metagenomic data are freely available, as are the datasets we used http://www.bork.embl.de/~mende/simulated_data/.
Mende datasets were simulated for Sanger sequencing, pyrosequencing, and Illumina sequencing. For each technology, three metagenomes were simulated to mimic different community complexities 10 species (10 s), 100 species (100 s), and 400 species (400 s). However, the Sanger sequencing, pyrosequencing technologies seemed obsolete/out-of-date. We tested our hypothesis on Illumina paired-end raw reads of Mende datasets, which is a very widely used sequencing platform.
To test our hypothesis with the real data, we used the gut microbiome data [29]. The metagenomic shotgun-sequencing data for two samples (ERR2017411, ERR2017412) was downloaded from the European Bioinformatics Institute (EBI) database under the accession code ERP023788.
All data used were paired-end reads within the Illumina platform. Synthetic data consists of 26 million reads for each dataset (10 s, 100 s, and 400 s) with the read length of 75 bp. Real data had 17 million reads for each sample with the read length of 90 bp. There Methods a slight difference in read lengths between synthetic data and real data; however, the read lengths from 75 bp to 100 bp were reported [30,31] to produce the same alignment results.

Findings
Using assembled reads, four out of seven classifiers increased their precision by up to 2×, while maintaining similar recall; see Table 1. These four classifiers were Kaiju, CLARK, Kraken and MetaCache. The improvement in performance was most significant for smaller datasets. With the dataset 10 s, which consisted of 10 species, CLARK, for example, benefited from a 50× increase in precision with the same recall, when reads were assembled by any of the three assemblers. With the dataset 100 s, which consisted of 100 species, CLARK benefited from a 3 − 4× increase in precision with the same recall, when reads were assembled by any of the three assemblers. With the dataset 400 s, which consisted of 400 species, CLARK benefited from a 1.04× increase in precision with the same recall. Similarly, other three classifiers benefited from assembled reads. Kraken and MetaCache benefited from increases in both precision and recall with the larger datasets 100 s and 400 s.
MetaPhlAn2's performance got worse with assembled reads, compared to its performance on unassembled short reads. We also observed that DUDes and GOTTCHA did not benefit from assembled reads. The overall F1 scores were highest when reads were assembled by MEGAHIT and metaSPAdes.
We report the assembly statistics (Table 2) and read length distribution of assembled reads compared to two datasets of current long-read technologies ( Figure 2). Note that due to highly fragmented reads and contigs, the read/contig length distribution was log scaled. In Table 2, the performance of Ray with MetaPhlAn2, DUDes and GOTTCHA on the 400 s simulation dataset was much worse than that without assembly, likely because the number of contigs on 400 s was significantly smaller than that of 10 s, and 100 s. Therefore, Ray was not a good choice for assembling reads with a high number of species in the sample. We suggest that later assembly algorithm developments should consider the estimate number of species as an option.
We observed that with synthetic data, the majority of classifiers predicted much fewer species when reads were assembled; see Table 3. With fewer predicted species, there should be fewer false positives. Even if prediction mistakes were not random, making fewer positive predictions will naturally decrease the number of false positives, which tends to increase precision. This observation likely explains the drastic observed increase in precision, while maintaining similar recall, across the board with synthetic datasets. With real datasets, we observed a similar behavior that classifiers predicted much fewer species when reads were assembled. Although we could not compute precision and recall with real data, the same trend suggests that as with synthetic data, classifiers should be much more precise when reads were assembled. This increase in precision should, similarly, be drastic when the datasets have much fewer species than the index database that were used by classifiers to classify species. Additionally, Table 4 shows that with real datasets, the overall pairwise similarity decreased with assembled reads. This suggests that with assembled reads, classifiers had a higher chance of showing their uniqueness in predicting species. Table 1. Precision, recall, F-1 of species-level classification of four metagenomic classifiers on three synthetic short read datasets, which are, respectively, not assembled and assembled by three assemblers: MEGAHIT (MH), metaSPAdes (MS), and Ray.

Discussion
In this work, we showed the promising prospect of utilizing long reads in identifying species in metagenomic samples. Long reads, used in this study, are assembled from the same short reads, which were used to compare classification performance. This was performed to remove potential side effects of different sequencing technologies. As future long-read technologies achieve fewer sequencing errors and become less expensive, their use for species classification in metagenomics should be desirable.
At present, we have demonstrated that we can leverage the advantage of long reads by assembling short reads that would otherwise be used for species classification. We showed that at least two of the currently popular assemblers can be used for this purpose. We observed that MEGAHIT and metaSPAdes produced higher N50s across datasets, while Ray had lower N50s. In fact, it failed to assemble reads when the datasets contained 400 species. A quick comparison between metaSPAdes and MEGAHIT assemblers across all the datasets considered in this study confirmed that metaSPAdes performs better for a smaller dataset (10 s) while MEGAHIT performs better for larger datasets (100 s and 400 s).
We think that Kaiju, CLARK, Kraken, and MetaCache benefited from the longer reads because of their approach of using k-mers as unique markers to distinguish closely related species. On the other hand, MetaPhlAn2, DUDes and GOTTCHA have built-in statistical post-processing procedures that align reads to reference genomes, which appear not benefit from longer reads.

Conclusions
We demonstrated that improvement in taxonomic classification can be achieved with a novel assembly-based protocol. Specifically, we compared performance of popular metagenomic classifiers on short reads and longer reads, which are assembled from the same short reads. Using a number of popular assemblers to assemble short reads, we discovered that most classifiers made fewer predictions with longer reads and that they achieved higher classification performance on synthetic metagenomic data. Specifically, across most classifiers, we observed a significant increase in precision, while recall remained the same, resulting in higher overall classification performance. On real metagenomic data, we observed a similar trend as in the case of synthetic data that classifiers made fewer predictions. This suggested that they might have the same performance characteristics of having higher precision while maintaining the same recall with longer reads.
This finding has two main implications. First, it suggests that classifying species in metagenomic environments can be achieved with higher overall performance simply by assembling short reads. This finding can make a big impact on the many existing studies that utilize short reads. The modification to their existing workflow is minimal, although there is an extra computational cost of assembling short reads. We showed that a number of existing assemblers could be used for the purpose of assembling short reads into contigs for this specific purpose. Second, this finding suggests that the use of long-read technologies in taxonomic classification might result in significant improvements. Current long-read technologies tend to have higher sequencing errors and are more expensive compared to short-read technologies. The trade-offs between the pros and cons may be further investigated.
Author Contributions: Conceived and designed the experiments: Q.T., V.P.; performed the experiments: Q.T.; analyzed the data: Q.T., V.P.; contributed reagents/materials/analysis tools: Q.T.; wrote the paper: Q.T., V.P. All authors have read and agreed to the published version of the manuscript.
Funding: QT was funded by a University of Memphis teaching assistantship. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.