Advancing Mitochondrial Metagenomics: A New Assembly Strategy and Validating the Power of Seed-Based Approach

: Mitochondrial metagenomics (MMG) using Illumina sequencers for mixed-species samples provides a promising tool for evolutionary and ecological studies using mitogenomes. However, the traditional assembly procedure is still computationally intensive and expensive. Here, a novel MMG pipeline was applied to different DNA extractions, one per species, and their sequence as a mixed sample for rapid mitogenome assembly is presented. Our method integrated a faster and more accurate read mapper for ﬁltering non-mitochondrial reads. A seed-and-extend assembler for species-speciﬁc mitogenomes that detects ‘noisy species/sequences’ was also assessed. The MMG pipeline for each dataset was completed in a few hours on desktop PCs, maintaining high accuracy and completeness (COI divergence >10%), except for some very closely related taxa. Particularly for closely related species, the exclusion of ‘noisy reads’ (including chimera of non-targeted species) improved the target assembly. In addition, we observed that short barcodes used as references had almost identical detection power compared with mitogenomes but required greater sequencing depth. We tested our MMG pipeline on two real and one simulated dataset to validate its high efﬁciency in mixed-species sample mitogenome assembly.


Introduction
Understanding biodiversity, i.e., the variety and variability of life forms on Earth, has been recognized as a fundamental topic in evolutionary and ecological biology. Individualbased DNA techniques using Sanger sequencing can speed up species identification as complementary diagnoses for morphological taxonomy [1,2]. These studies usually analyze sequence divergence by amplifying nuclear and organellar gene regions, such as the standard Cytochrome C Oxidase Subunit I (COI) barcode for animals [1]. The successor metabarcoding can assess biodiversity from environmental DNA and bulk/mixed-species samples by amplicon next-generation sequencing (NGS) [3]. All of the above DNA-based approaches require PCR amplification and often suffer from some shortcomings, such as amplification difficulty across a broad taxonomic range [4], high requirements for samples (e.g., need fresh samples), and the experiment being complicated and time-consuming [5]. The recently emerged mitochondrial metagenomics (MMG) [6,7] can further unify the ecological and evolutionary understanding of biodiversity [8,9]. MMG has been applied in systematics and community phylogenetics by assembling high-copy mitochondrial contigs from low-coverage shotgun sequencing of specimen mixture [6,[10][11][12][13][14].
Despite the great potential for evolutionary and ecological studies, MMG has not been considered perfect in the efficiency of mitogenome assembly using published pipelines. These assembly procedures are usually time-consuming and labor-intensive [8]: filter nonmitochondrial reads by BLAST [15] searches against a mitogenome database, assemble putative mitochondrial reads using multiple assemblers, annotate preliminary contigs, and combine overlapping contigs to generate longer consensus sequences [6,7]. DNA searches with BLASTn are not suitable for aligning very divergent sequences [16,17]. In contrast, Burrows Wheeler transformation (BWT)-based short read mappers, e.g., BWA [18], often have higher sensitivity in several orders of magnitude in less time [19], although they are also less efficient in coping with high sequence divergence. Compared to BWT-based mappers, hash-based ones, e.g., NextGenMap [20] and NOVOPlasty [21], can handle highly polymorphic reads and genomes in a shorter running time. Unfortunately, the disk and memory usage for hash-based assembly in NOVOPlasty can be substantial in a regular PC (i.e., PC with less than 16 G memory). Therefore, read mappers completely have the potential ability to replace BLAST for the read filtering with higher accuracy in much less time.
MMG assembly usually uses tools initially designed for whole-genome data rather than AT-rich mitogenome, e.g., IDBA-UD [22], Celera Assembler [23], SOAPdenovo [24], SOAPdenovo-Trans [25], and Newbler [26]. To maximize sequence contiguity, subsequent contig consensus requires additional tools and manual checks, such as Minimus [27], TG-ICL [28], and Geneious [29]. Two major categories of mitogenome-specific assemblers exist: (1) seed-free assembly software, such as MitoZ [30] and MEANGS [31], which can directly assemble mitogenomes from raw data without seed/bait sequences; (2) seedbased assembly software, such as MITObim [32] and NOVOPlasty, which can directly assemble mitogenomes with the user-provided seed/bait sequences for single-species samples. The MitoZ assembly is accurate; however, it is very time-and computing-resourcesconsuming [30]. The MEANGS is a seed-free de novo assembly software that applies tree-search to extend contigs from self-discovery seeds and assembles the mitogenome from NGS data [31]. Nonetheless, MitoZ and MEANGS were not suitable to extract the mitochondrial genome from mixed-species samples. The earlier version (v1.6) of MITObim provided a trial proofreading algorithm for metagenomic data. It was only tested in samples with a few (<6) species [32,33] but failed in more complex samples of typical MMG pools (unpublished data). Many rounds of iterations in MITObim are very time-consuming and hinder its further applications in metagenomic data. NOVOPlasty is faster and more accurate than other software, allowing for seeds from the same species or genus, even different genera, and families. A relaxed seed design is a benefit for single-species samples but becomes a shortcoming for complex samples.
This study aims to improve the MMG assembly efficiency by integrating faster, more accurate, and low-consumption bioinformatic tools. The short read mapper of NextGenMap is used to filter non-mitochondrial reads. The mitogenome-specific assembler NOVOPlasty was used for mitogenome assembly by restricting its seed compatibility. We tested the pipeline on both real and simulated raw sequencing datasets, covering both distantly and closely related species.

Data Generation
We tested the MMG assembly pipeline on two real and one simulated sequencing datasets that have reference mitogenome sequences (COI barcode also included) for each species. Dataset DS_A (SRA174290, 150 PE) included 49 divergent animal species (see in Supplementary Materials Table S1 and Figure S1a) [7]. Dataset DS_B [34] represented an extremely closely related-bee-species case: 48 individual raw sequencing data (100 PE) were combined into a mixture (Table S2); COI p-distance among species varied from 0.036 to 0.289 (mean 0.207) ( Figure S1b). To better evaluate the performance of our pipeline in the presence of closely related species, we generated a dataset DS_C including 10 mosquito species, for which the interspecific p-distance was 0.015-0.166 (mean 0.109) ( Figure S1c); the mitochondrial coverage was set as 200x to guarantee a sufficient amount with the ratio of mitochondrial to nuclear reads as 1% for each species (Table S3). The simulation of raw sequencing reads was generated using ART [35] with a read length of 150 bp, an insert fragment size of 300 bp (standard deviation of 30 bp), and a sequencing error of 0.1% (art_illumina-ss HSXn-i GENOME.fasta-p-l 150-f 200-m 300-s 30-na-qs 30-qs2 30-o SPECIES.mito.).

Mitogenome Assembly
The workflow of MMG mitogenome assembly is shown in Figure 1. All analyses were executed in the CentOS 7 operating system on an AMD RYZEN 1700X CPU (8 cores/16 threads) and 32 G memory PC. Commands for merging all the forward or reverse reads, and the script of MMG are available on GitHub (https://github.com/xtmtd/MMG, accessed on 14 March 2022). Raw sequencing data were mapped to a mitogenome database using NextGenMap v0.5.5 with an identity threshold of 0.3. The sequences of mitogenome database were downloaded from NCBI Reference Sequence Database (RefSeq, 15 July, 2018): 1632 (117 Arachnida, eight Asteroidea, 11 Branchiopoda, six Danio, 1490 Insecta) for DS_A, 16 (Apoidea) for DS_B, 3 (one Aedes, one Anopheles, one Culex) for DS_C. Candidate paired mitochondrial reads, in which at least one of the paired ones can be mapped, were extracted with SAMtools v1.7 [36]. Mitogenomes were assembled with a loop script that included the NOVOPlasty v4.3.1 assembler while storing hash locally to speed up the assembly for possible multiple runs. Each mitogenome assembly requires a seed or bait sequence, a short mitochondrial fragment used for the initial assembly. In practice, seeds can be generated using barcoding for each species or metabarcoding for multiplexed species pool (DNA libraries with a unique identifier for each species). COI of 658 bp (standard 'barcode') and 313 bp ('mini-barcode', named mini-COI here) were selected as seeds and extracted from the reference in this study. Assembled contig(s) of each species were aligned to seeds of all species with VSEARCH v2.7.1 (-usearch_global -blast6out -maxaccepts 0 -maxrejects 0 -id 0.2) [37]; a contig of identity greater than 0.95 (i.e., highly matching seed sequences, query cover 100%) was assigned to the species from NOVOPlasty outputs, and then, mitogenomes were assembled. extremely closely related-bee-species case: 48 individual raw sequencing data (100 PE) were combined into a mixture (Table S2); COI p-distance among species varied from 0.036 to 0.289 (mean 0.207) ( Figure S1b). To better evaluate the performance of our pipeline in the presence of closely related species, we generated a dataset DS_C including 10 mosquito species, for which the interspecific p-distance was 0.015-0.166 (mean 0.109) ( Figure  S1c); the mitochondrial coverage was set as 200x to guarantee a sufficient amount with the ratio of mitochondrial to nuclear reads as 1% for each species (Table S3). The simulation of raw sequencing reads was generated using ART [35] with a read length of 150 bp, an insert fragment size of 300 bp (standard deviation of 30 bp), and a sequencing error of 0.1% (art_illumina-ss HSXn-i GENOME.fasta-p-l 150-f 200-m 300-s 30-na-qs 30-qs2 30-o SPECIES.mito.).

Mitogenome Assembly
The workflow of MMG mitogenome assembly is shown in Figure 1. All analyses were executed in the CentOS 7 operating system on an AMD RYZEN 1700X CPU (8 cores/16 threads) and 32 G memory PC. Commands for merging all the forward or reverse reads, and the script of MMG are available on GitHub (https://github.com/xtmtd/MMG, accessed on 14 March 2022). Raw sequencing data were mapped to a mitogenome database using NextGenMap v0.5.5 with an identity threshold of 0.3. The sequences of mitogenome database were downloaded from NCBI Reference Sequence Database (RefSeq, 15 July, 2018): 1632 (117 Arachnida, eight Asteroidea, 11 Branchiopoda, six Danio, 1490 Insecta) for DS_A, 16 (Apoidea) for DS_B, 3 (one Aedes, one Anopheles, one Culex) for DS_C. Candidate paired mitochondrial reads, in which at least one of the paired ones can be mapped, were extracted with SAMtools v1.7 [36]. Mitogenomes were assembled with a loop script that included the NOVOPlasty v4.3.1 assembler while storing hash locally to speed up the assembly for possible multiple runs. Each mitogenome assembly requires a seed or bait sequence, a short mitochondrial fragment used for the initial assembly. In practice, seeds can be generated using barcoding for each species or metabarcoding for multiplexed species pool (DNA libraries with a unique identifier for each species). COI of 658 bp (standard 'barcode') and 313 bp ('mini-barcode', named mini-COI here) were selected as seeds and extracted from the reference in this study. Assembled contig(s) of each species were aligned to seeds of all species with VSEARCH v2.7.1 (--usearch_global --blast6out -maxaccepts 0 --maxrejects 0 --id 0.2) [37]; a contig of identity greater than 0.95 (i.e., highly matching seed sequences, query cover 100%) was assigned to the species from NOVO-Plasty outputs, and then, mitogenomes were assembled. Assembled contig(s) from different NOVOPlasty procedures may be identical for closely related species due to the relaxed seed compatibility. With the species-specific seed Assembled contig(s) from different NOVOPlasty procedures may be identical for closely related species due to the relaxed seed compatibility. With the species-specific seed sequence (COI here), NOVOPlasty may assemble mitochondrial contigs belonging to other species present in the same sample. This means that the presence of 'noisy reads' of non-targeted 'noisy species' in the raw sequencing data obstructs the target assembly. The exclusion of these 'noisy reads' would alleviate this difficulty. These 'noisy species' were identified when the assembled contigs matched the seeds of non-targeted species revealed by VSEARCH results. Noisy reads were filtered against 'noisy mitochondrial contigs' assembled in the previous assembly round using NextGenMap, SAMtools with an identity value of 0.99. The resulting reads were used to assemble mitogenomes again for the species that failed in the previous assembly round. Then, the steps of the above noise detection-filtering-assembly procedure were repeated (usually one-two rounds) until no more noisy reads were discovered. Specific mitogenome reference sequences were drawn from Tang et al. [7,34] or NCBI public ones. Due to the great assembly difficulty in dataset DS_B, a second seed of 497-bp ND5 was used for assembly, and species with contigs shorter than 5000 bp were re-assembled in the filtering rounds.
Assembled contigs could be heterospecific contigs (chimera), although their partial regions well matched the seeds. We examined the chimera as follows: contigs were divided into multiple 200 bp fragments, and a VSEARCH search against all assembled mitogenomes was performed for each fragment. A contig having best matches (identity ≥ 97%) with two or more reference genomes was treated as a chimera. Those 'noisy' parts within the chimera were filtered and newer assemblies were generated following the above noisefiltering steps.
The final assembly was finished until no noisy regions including chimeras were detected. The assembly quality was measured relative to the reference using genome coverage (percentage of the reference genome) and accuracy (percentage of correctly assembled nucleotides for aligned regions). Only one round of filtering noisy reads was performed for most species. For the closely related species, two rounds of filtering noisy reads were needed. The complete process of mitochondrial genome assembly was executed following the custom script.
Three preconditions were used to assemble mitogenomes using the MMG pipeline: (a) having seeds in advance; (b) knowing that species inside the sample are not too similar, as to avoid cross assembly; and (c) ensuring that the DNA of each species is mixed in the right quantity.

Results
A total of 45.8, 130.8, and 3.1 Gbp of raw PE reads were generated for datasets DS_A, DS_B, and DS_C, respectively. After the removal of non-mitochondrial reads, 6.88%, 5.12%, and 1.32% of raw reads were filtered out for the subsequent assembly. Detailed assembly results and statistics can be found in Tables S1-S3. Each NOVOPlasty assembly was finished in less than ten minutes depending on the amount of input candidate mitochondrial reads. In the initial assembly, the use of longer COI as seed sequences generated more noisy assembled reads than shorter mini-COI seeds for DS_A (3 vs. 0) and DS_C (4 vs. 2). Thus, COI was not selected as seeds in DS_B due to the great number of closely related species. For the dataset DS_B, the assembly of six species failed to produce any sequences in the first round of assembly with mini-COI seeds, and three assemblies generated chimera although correct seed sequences were included in a chimera (Table S2). Chimeras of 14,974 bp assembled from two species of DS_C are highly similar to both reference sequences (Anopheles gambiae and A. merus), which had a 0.015 p-distance for COI. After one or two rounds of filtering noisy reads, all target assemblies were correctly recovered except for two species in DS_C.

Discussion
Our novel MMG pipeline greatly accelerated the assembly by efficient improvements in removing non-mitochondrial reads, particularly in simplifying the workflow and reducing running time (the run times were within 12 h, 20 h, and 8 h for datasets DS_A, DS_B, and DS_C, respectively). Assembly contiguity and accuracy can be comparable with the references (Figure 2) for both distantly and closely related taxa. Detection and removal of noisy reads, including chimera, further guarantee the correctness of target sequences.
Seed or bait sequences for each species are prerequisites for assembly using NOVO-Plasty. Seeds can be generated from standard barcoding or metabarcoding, with the latter pooling multiple amplification products to further reduce the cost [3]. In addition, the 313 bp mini-COI, which is the most frequently used marker in metabarcoding [38], outperforms the standard 658 bp-COI when serving as seeds for assembly, resulting in fewer incorrect assemblies (Tables S1 and S3). A single seed can produce nearly complete mitogenomes when sequencing coverage is enough [8], as exemplified in both DS_A and DS_C. Additional seeds (ND5 in DS_B, which was acquired from the same genome) may help to generate more contigs for one species [7], particularly when mitochondrial coverage is limited (often less than 50x) and sequencing read length is short (e.g., 100 PE in DS_B).
Although the seed-and-extend algorithm in NOVOPlasty greatly reduces the chimera for mixed-species samples, it is possible to fail for very closely related species (Tables  S2 and S3). Chimera detection and removal in our pipeline eliminated most errors but still failed in two mosquito species with 1.5% COI divergence (Table S3). According to our tests, we recommend that the accuracy and completeness of COI divergence >10% are necessary (see the details in Appendix A).
Our assessments of simulated datasets (Appendix A) indicate that species, even at the very low sequencing coverage (0.1x), can be assembled mitogenomes in bulk samples. Short barcodes as the reference have similar detection powers but require at least an order of magnitude greater in sequencing depth (much higher cost) than whole mitogenomes. Rapid decreases in sequencing costs, e.g., library preparation of CNY 20 plus CNY 4/Gbp per sample on the BGI or NovaSeq 6000 platforms (price from BerryGenomics, China, 1 March, 2022), allows MMG to be applied in a wider scope for ecological and evolutionary

Discussion
Our novel MMG pipeline greatly accelerated the assembly by efficient improvements in removing non-mitochondrial reads, particularly in simplifying the workflow and reducing running time (the run times were within 12 h, 20 h, and 8 h for datasets DS_A, DS_B, and DS_C, respectively). Assembly contiguity and accuracy can be comparable with the references (Figure 2) for both distantly and closely related taxa. Detection and removal of noisy reads, including chimera, further guarantee the correctness of target sequences.
Seed or bait sequences for each species are prerequisites for assembly using NOVO-Plasty. Seeds can be generated from standard barcoding or metabarcoding, with the latter pooling multiple amplification products to further reduce the cost [3]. In addition, the 313 bp mini-COI, which is the most frequently used marker in metabarcoding [38], outperforms the standard 658 bp-COI when serving as seeds for assembly, resulting in fewer incorrect assemblies (Tables S1 and S3). A single seed can produce nearly complete mitogenomes when sequencing coverage is enough [8], as exemplified in both DS_A and DS_C. Additional seeds (ND5 in DS_B, which was acquired from the same genome) may help to generate more contigs for one species [7], particularly when mitochondrial coverage is limited (often less than 50x) and sequencing read length is short (e.g., 100 PE in DS_B).
Although the seed-and-extend algorithm in NOVOPlasty greatly reduces the chimera for mixed-species samples, it is possible to fail for very closely related species (Tables S2 and S3). Chimera detection and removal in our pipeline eliminated most errors but still failed in two mosquito species with 1.5% COI divergence (Table S3). According to our tests, we recommend that the accuracy and completeness of COI divergence >10% are necessary (see the details in Appendix A).
Our assessments of simulated datasets (Appendix A) indicate that species, even at the very low sequencing coverage (0.1x), can be assembled mitogenomes in bulk samples. Short barcodes as the reference have similar detection powers but require at least an order of magnitude greater in sequencing depth (much higher cost) than whole mitogenomes. Rapid decreases in sequencing costs, e.g., library preparation of CNY 20 plus CNY 4/Gbp per sample on the BGI or NovaSeq 6000 platforms (price from BerryGenomics, China, 1 March, 2022), allows MMG to be applied in a wider scope for ecological and evolutionary studies [8,9] (e.g., the origin and phylodiversity of insects [39,40], relative species abundance estimation [39]).

Supplementary Materials:
The following supporting information can be downloaded at https: //www.mdpi.com/article/10.3390/d14050317/s1, Figure S1: The COI p-distance among species and phylogenetic relationships of datasets DS_A (a), DS_B (b), and DS_C (c); Table S1: Species information, mitogenome reference sequences, and assembly results for dataset DS_A. The yellow cells represent the chimeras, and the species name of the same sequence are written in brackets; Table S2: Species information, mitogenome reference sequences, and assembly results for dataset Diversity 2022, 14, 317 6 of 8 DS_B. The yellow cells represent the chimeras, and the species name and the length of the same sequence are written in brackets. The grey cells represent the fail to assemble; Table S3: Species information, mitogenome and genome reference sequences, and assembly results for dataset DS_C. The yellow cells represent the chimeras, and the species name of the same sequence are written in the brackets; Table S4: Assessment of the accuracy of MMG seed-based approach at varying mitochondrial coverages against mitogenome, COI, and mini-COI references, respectively.