Comparison of Long-Read Methods for Sequencing and Assembly of Lepidopteran Pest Genomes

Lepidopteran species are mostly pests, causing serious annual economic losses. High-quality genome sequencing and assembly uncover the genetic foundation of pest occurrence and provide guidance for pest control measures. Long-read sequencing technology and assembly algorithm advances have improved the ability to timeously produce high-quality genomes. Lepidoptera includes a wide variety of insects with high genetic diversity and heterozygosity. Therefore, the selection of an appropriate sequencing and assembly strategy to obtain high-quality genomic information is urgently needed. This research used silkworm as a model to test genome sequencing and assembly through high-coverage datasets by de novo assemblies. We report the first nearly complete telomere-to-telomere reference genome of silkworm Bombyx mori (P50T strain) produced by Pacific Biosciences (PacBio) HiFi sequencing, and highly contiguous and complete genome assemblies of two other silkworm strains by Oxford Nanopore Technologies (ONT) or PacBio continuous long-reads (CLR) that were unrepresented in the public database. Assembly quality was evaluated by use of BUSCO, Inspector, and EagleC. It is necessary to choose an appropriate assembler for draft genome construction, especially for low-depth datasets. For PacBio CLR and ONT sequencing, NextDenovo is superior. For PacBio HiFi sequencing, hifiasm is better. Quality assessment is essential for genome assembly and can provide better and more accurate results. For chromosome-level high-quality genome construction, we recommend using 3D-DNA with EagleC evaluation. Our study references how to obtain and evaluate high-quality genome assemblies, and is a resource for biological control, comparative genomics, and evolutionary studies of Lepidopteran pests and related species.


Introduction
Lepidopteran pests have a critical impact on vegetable crop production, often with a mixture of multiple pests, with many overlapping generations each year, causing huge annual economic losses. Genome sequencing has brought Lepidopteran pest control and genomics research to a new level. Genome sequencing of Plutella xylostella, Hyphantria cunea, Cydia pomonella, and Helicoverpa zea revealed the genetics of their invasive populations, explained their host and environmental adaptations at the genetic level, provided partial evidence for the causes of their rapid invasion, and determined potential genetic targets for innovative pest management strategies and the genetic basis of Bt toxin resistance [1][2][3][4][5]. Whole-genome sequencing of twenty Heliconius butterflies revealed the complex evolutionary history of the genus, demonstrating that chromosomal structural variation due to gradual penetration is responsible for increased polymorphism in butterfly wings [6]. approaches in B. mori, focusing on how to efficiently and accurately construct and evaluate chromosome-level genome assemblies in B. mori and other Lepidopteran pests. We believe our results will provide valuable guidance for future Lepidopteran pest genome projects as well as improve previous genome assemblies without generating new sequencing data.

Summary of Raw Data, Assemblies, and Benchmarks
To compare the performance of diverse TGS platforms on constructing highly contiguous genome assembly on Lepidopteran pests. We sequenced and analyzed three long-read datasets for three B. mori strains ( Table 1 and   Hi-C High-throughput Chromosome conformation capture sequencing; used for scaffolding Lu et al. (2020) [20] D9L×N4, D9L, and P50T, corresponding to three conditions: highly heterozygous, degradation, and normal. The quality of assembly was evaluated by QUAST [21], BUSCO, Inspector, and the newly proposed EagleC [22] based on deep learning. We assessed the performance of diverse TGS approaches in B. mori, focusing on how to efficiently and accurately construct and evaluate chromosome-level genome assemblies in B. mori and other Lepidopteran pests. We believe our results will provide valuable guidance for future Lepidopteran pest genome projects as well as improve previous genome assemblies without generating new sequencing data.

Summary of Raw Data, Assemblies, and Benchmarks
To compare the performance of diverse TGS platforms on constructing highly contiguous genome assembly on Lepidopteran pests. We sequenced and analyzed three longread datasets for three B. mori strains (Table 1 and Figure 1)  The genome sequencing datasets were assembled by seven different assembly tools ( Figure 2). The CLR reads were assembled by Canu, NextDenovo, MECAT, and wtdbg2. The genome sequencing datasets were assembled by seven different assembly tools ( Figure 2). The CLR reads were assembled by Canu, NextDenovo, MECAT, and wtdbg2. Assemblies of HiFi reads were performed using HiCanu and hifiasm. ONT data were assembled using NextDenovo, NECAT, and wtdbg2. Assemblies of HiFi reads were performed using HiCanu and hifiasm. ONT data were assembled using NextDenovo, NECAT, and wtdbg2. The assembly quality was evaluated according to the following six criteria: contig numbers (Contigs), contig N50 (N50) length, number of structural errors (Structural error), small structural errors per Mb (Small-scale error), number of BUSCO complete genes (Complete genes), Quality Value (QV) score and percentage of assembly errors (PAR) identified by EagleC for chromosome-level genomes.

ONT Genome Assembly
For investigating how to obtain high-quality haploid genome assemblies for fieldcaught genomically heterozygous Lepidopteran pests, we selected the silkworm D9L × N4 strain (approximately 1.11%, Figure S1a) with high genomic heterozygosity for ONT The assembly quality was evaluated according to the following six criteria: contig numbers (Contigs), contig N50 (N50) length, number of structural errors (Structural error), small structural errors per Mb (Small-scale error), number of BUSCO complete genes (Complete genes), Quality Value (QV) score and percentage of assembly errors (PAR) identified by EagleC for chromosome-level genomes.

ONT Genome Assembly
For investigating how to obtain high-quality haploid genome assemblies for fieldcaught genomically heterozygous Lepidopteran pests, we selected the silkworm D9L × N4 strain (approximately 1.11%, Figure S1a) with high genomic heterozygosity for ONT sequencing and assembly testing. The ONT sequence was assembled using three different long-read assembly tools (NextDenovo, wtdbg2, and NECAT) and eight different subsets of various coverage (10×, 20×, 40×, 60×, 80×, 100×, 120× and 160×). Detailed  statistics for each assembly are shown in Tables 2 and S1. The NextDenovo assemblies were the smallest in size (approximately 449-468 Mb) with contig numbers of approximately 89-114 ( Figure 3 and Table S1). NextDenovo generated the most contiguous assemblies (contig N50 approximately 10.0-13.8 Mb), with the highest number of complete (approximately 1181-1298) and single-copy (approximately 1176-1287) BUSCO genes. The wtdbg2 assemblies were the largest in size (approximately 452-794 Mb) and produced contig numbers of approximately 3273-13,714. Wtdbg2 generated the least contiguous assemblies (contig N50 0.15-0.81 Mb), and the lowest number of complete (669-1129) and single-copy (668-1083) BUSCO genes. The assembly quality of wtdbg2 for genomes with high heterozygosity was less satisfactory, but it is the only software that can generate assembly at 10× sequencing depth. The assembly quality of NECAT was between those of NextDenovo and wtdbg2. The NECAT assemblies' sizes were approximately 561-581 Mb, contig numbers were approximately 688-851, contig N50 were between 2.44 and 2.88 Mb, and complete BUSCO genes were between 1253 and 1272. Additionally, we analyzed the computational time of those assemblers and found that the wtdbg2 was the fastest assembler, followed by NextDenovo and NECAT (Figure 4b), saving between a third and a half of the time when the sequencing depth was greater than 80×.
To estimate the genome assembly accuracy, we calculated the number of Structural errors and Small-scale errors using Inspector. NextDenovo had the lowest number of Smallscale errors, and Structural error numbers just below those of wtdbg2 ( Figures 5 and S2). Wtdbg2 had the highest number of Small-scale errors, while the lowest number of Structural errors. NECAT had the highest number of Structural errors and the second highest of Small-scale errors. The subsequent Racon long-read polishing process greatly improved the wtdbg2 draft genome assemblies' completeness as indicated by the BUSCO complete gene percentages, which increased from between a minimum range of 49% to 82.6% to a maximum range of 74.7% to 92.1% (Table S1). The assembly accuracy was also greatly improved as indicated by the number of Small-scale errors, which decreased from between 3484 and 7767 per Mbp to between 633 and 1418 per Mbp ( Figure S2a). The subsequent Racon long-read polishing process greatly improved the wtdbg2 draft genome assemblies' completeness as indicated by the BUSCO complete gene percentages, which increased from between a minimum range of 49% to 82.6% to a maximum range of 74.7% to 92.1% (Table S1). The assembly accuracy was also greatly improved as indicated by the number of Small-scale errors, which decreased from between 3484 and 7767 per Mbp to between 633 and 1418 per Mbp ( Figure S2a).
For the purpose of investigating the effect of sequencing depth on assembly tools, we evaluated the quality of ONT assemblies on diverse sequencing depths (10×, 20×, 40×, 60×, 80×, 100×, 120× and 160×). The assembly quality on low-depth subsets (10× or 20×) varied greatly amongst different assemblers, whereas it was reliable on relatively high-depth subsets ( Figure 3). According to our findings, the dataset with around 40× ONT can construct the most genomes. However, a deeper sequencing effort is required to further enhance the genome quality.

CLR Genome Assembly
In order to investigate how to generate high-quality haploid genome assemblies from wild harvested genome degradation samples, we selected the slightly poorer quality genomes extracted from silkworm D9L samples for testing (N50 = 11,722 bp, E-size = 11,909 bp, Table 1 and Figure 1). The assembly of the CLR reads was conducted using four different long-read assembly tools (NextDenovo, Canu, wtdbg2, and MECAT2). Due to lower genomic heterozygosity, the CLR assemblies showed much smaller differences than ONT in genome size (426-506 Mb, excluding the 10× and 20× results, Table S1). When a certain sequencing depth is satisfied (>= 40×), the difference in the number of contigs for each genome assembly is not significant, and the result of NextDenovo remains the best. The continuity of all the assemblies (N50 of contigs) increased by the sequencing depth, the NextDenovo assembly increased the most pronounced ( Figure 3).
The CLR assemblies showed similar complete BUSCO gene numbers as the ONT assemblies (excluding MECAT2). Wtdbg2 generated the lowest number of Structural errors, followed by NextDenovo ( Figure 5). NextDenovo generated the lowest number of Small-scale errors followed by Canu ( Figure S2). The NextDenovo assembly showed the highest contiguous (contig N50 = 9.41 Mb), smallest size (477 Mb), and the least contigs (n = 205) ( Table S1). The Canu assembly was the largest (506 Mb) but contained a high degree of duplication as indicated by the percentage of duplicated BUSCOs (2.9%). Therefore, as was recently discovered, the assembly of Canu probably contains uncollapsed haplotypes corresponding to artifactually duplicated areas [23]. The assembly quality of four assemblers was assessed by the metric mean of the six different subsets ( Table 2). NextDenovo shows the best overall performance, followed by Canu. Though Canu needs the longest CPU hours and generates fragmented assemblies, the accuracy is excellent (Figure 4b).
Before polishing, the wtdbg2 assembly was the most fragmented (contig N50: 0.154-2.56 Mb) and the MECAT2 assembly was the least complete (12.5-57.8% complete BUSCOs) (Table S1). Subsequently, we polished the wtdbg2 and MECAT2 assemblies using the CLR long-reads. The Racon polishing steps greatly improved the wtdbg2 and MECAT2 draft assemblies' genome completeness ( Figure 3). As expected, a reduced number of Small-scale errors and Structural errors were identified in CLR assemblies when compared with ONT assemblies (Figures 5 and S2). The long-read polishing process resulted in the percentage of single-copy BUSCOs increasing significantly and the number of Small-scale errors for the MECAT2 assembly reduced sharply, while the number of Small-scale errors of wtdbg2 assembly reduced modestly (Table S1). Interestingly, the long-read polishing process did not improve the integrity of the NextDenovo and Canu assemblies. In this part, we found that the dataset with roughly 40× CLR can construct most genomes, increase the sequencing depth could improve the genome quality, and need a polishing strategy or not depending on which assemblers are used.

HiFi Genome Assembly
Purposed to testing the contribution of the new technology to high-quality genome assembly, we performed PacBio HiFi sequencing on P50T silkworm whose genomes had been previously sequenced. HiFi reads were assembled using HiCanu and hifiasm. Compared with CLR and ONT assemblies, the genomic continuity and integrity of HiFi assemblies were significantly superior. There were no significant differences in the size, continuity, and completeness of the HiFi genome assemblies. The greatest difference is reflected in the contig numbers, which are much smaller in hifiasm assemblies than those in HiCanu assemblies ( Figure 3 and Table S1). We polished the HiFi assemblies using the HiFi long-read sequences. Just as expected, the polished HiFi assembly showed a similar percentage of complete BUSCO genes compared to the raw HiFi assembly (Table S1). When compared with the ONT and CLR assembly, the HiFi assembly contained the fewest Structural errors and Small-scale errors ( Figures 5, S2 and S3). Furthermore, we evaluated the quality of HiFi assemblies on datasets with different sequencing depths (10×, 20×, 30×, 40×, 50×, and 60×) to investigate the effect of data depth on assemblers. For low-depth subsets (as the 10× subset depicted in Figure 3), the assembly quality was highly varied amongst assemblers, while on relatively high-depth subsets, it was resilient. When exceeding a certain threshold, high coverage subsets do not significantly improve the quality of assembly, either. However, higher depths will require more computing resources and instantiation times. Therefore, choosing an appropriate depth is crucial. According to our findings, the dataset with about 20× HiFi data was able to create the most genomes. Since HiFi only requires a sequencing depth of 20× or more to build most of the genome and does not require a subsequent polish process, the time used for genome assembly is much less than that of ONT and CLR, especially when using Canu.
Compared with the other two sequencing methods, HiFi assembly shows the best assembly quality, the lowest contig number, and the highest continuity, accuracy, and completion, without relying on other scaffolding. It also requires the least amount of time and computer memory and can be considered the optimal sequencing method for future Lepidopteran pest genomes.

Construction and Quality Assessment of Hi-C-Based Chromosome-Level Genomes
The quality of the three long-read sequencing assemblies was significantly superior compared with short-read sequencing. However, none of the HiFi assemblies completed the assembly of all the chromosomes. We selected the best genome assembly for each sequencing method using 3D-DNA for genome construction at the chromosome level. Using default parameters, 3D-DNA achieved clustering of most of the chromosomes. However, there remained some chromosome clustering errors and contig translocations and inversions, which were identified using the Hi-C map ( Figure S4). Further manual adjustment by Juicebox can fix these organizational errors. However, this individualized manual adjustment often does not conform to a uniform standard. We then designed a quality assessment standard for chromosome-level genome assembly based on EagleC. This can identify organizational errors rapidly and accurately and is able to report the percentage of misassemblies in the genome assembly in the form of a table to facilitate the correction of these assembly errors (Figure 6c and Table 3). Based on EagleC's recommendations, we completed the adjustment of the genome assemblies and performed polishing using Racon and gap filling using TGS-GapCloser. Finally, using five-base telomere repeats ('TTAGG') [24] as a sequence query, we identified 50 telomeres and constructed 28 pseudomolecules (25 of 28 were represented by a single large contig, and the remaining three were assembled from two main contigs) for the silkworm (P50T-HiFi) genome (Figure 6a,c). Compared with the SilkBase reference genome (P50T-SilkBase), the P50T-HiFi assembly filled 30 gaps that were found in the SilkBase assembly. These gaps ranged from 99 to 75,391 bp and were distributed throughout the genome. The parallel plots showed that the P50T-HiFi assembly displayed good collinearity with the P50T-SilkBase assembly in most of the chromosomes, however, we also found some differences (Figure 6c). According to the EagleC report, these discrepant regions are caused by several Mb-level assembly errors, for example Chr24 (Figure 6e). The assembly mistake in the P50T-SilkBase assembly is also confirmed by the Chr19 parallel plots of five silkworm genome assemblies (Figure 6d). Furthermore, the final remaining three gaps in the P50T-HiFi assembly were filled with P50T-SilkBase assembly, resulting in a gap-free silkworm genome assembly (P50T). This is the first nearly complete telomere-to-telomere reference genome of silkworm (P50T). Although the genome assembly quality of CLR and ONT is not as good as that of HiFi, both completed very high consecutive and complete chromosome-level genome assemblies after treatment with EagleC and 3D-DNA, which is based on Hi-C (Figure 6b and Table 3). silkworm (P50T). Although the genome assembly quality of CLR and ONT is not as good as that of HiFi, both completed very high consecutive and complete chromosome-level genome assemblies after treatment with EagleC and 3D-DNA, which is based on Hi-C ( Figure 6b and Table 3).

Case
Whether our assembly process is applicable to the genome assembly of other Lepidopteran pests, and whether it can help optimize the genome assembly of published genomes, we selected the genome sequencing data of Korean silkworm (KRSM) [25] and Dendrolimus punctatus (D. punctatus) [26] for testing. Based on the results of the above comparison, an optimal pipeline was selected for those Lepidopteran insects.
We evaluated the metrics of the final assemblies and demonstrated these in Table 3  and Table S2. The pipeline can build the high-completeness genome assembly in KRSM with approximately 16.89 Mb scaffold N50, 97.5% complete BUSCO genes (Figure 6b), 55.13 small-scale errors per Mb, 34 structural errors and the value of PAR is 0.05%, demonstrating the accuracy of this genome assembly. On the other hand, misassemblies in the genome assembly of D. punctatus were identified and optimized using the EagleC evaluation process, significantly improving the quality of the optimized genome (Figure 7b-d).
The circle plots showed that the genome of D. punctatus we assembled here shared good collinearity with Dendrolimus kikuchii (D. kikuchii), and confirmed the assembly errors published in previous studies (Figure 7e). This demonstrates the compelling potential of the EagleC evaluation process to assess and optimize the quality of published genome assemblies.
For genome sequencing of Lepidopteran pests, we recommend HiFi and Hi-C sequencing followed by hifiasm and 3D-DNA for assembly and chromosome mounting, which achieves the best haploid genome assembly. For species already sequenced by ONT or CLR, we recommend NextDenovo, 3D-DNA, and EagleC for chromosome-level genome optimization. For genome sequencing of Lepidopteran pests, we recommend HiFi and Hi-C sequencing followed by hifiasm and 3D-DNA for assembly and chromosome mounting, which achieves the best haploid genome assembly. For species already sequenced by ONT or CLR, we recommend NextDenovo, 3D-DNA, and EagleC for chromosome-level genome optimization.

Discussion
Lepidoptera is the second largest order of insects, some of which are severe pests of agriculture and forests and cause significant economic losses each year. Genome sequencing of Lepidopteran pests has contributed greatly to their control. The accumulating genomic resources have been a crucial source for major breakthroughs in life science innovations and discoveries. Seventy-six arthropod genome assemblies were used to characterize the changes in genes and protein contents for a better understanding of 500 million years of evolution [27]. A study of 195 insect genomes revealed a high diversity of transposable elements across insects with varying degrees of conservation depending on phylogenetic position [28]. Horizontal gene transfer (HGT) events in 218 insects acquired from nonmetazoan sources provide insight into the adaptation of GTs in insects [29]. The breadth of Lepidopteran pest genome sequencing spans approximately 300 million years of evolution and roughly two orders, with genome sizes ranging from the tiny 229.9 Mb genome of Papilio polytes to the massive genome of Parnassius apollo at 1392 Mb [10]. However, this represents a mere 0.08% of the approximately 160,000 described insects [11], with many orders remaining without genomic representation. In the future, more Lepidopteran pest genomes will be sequenced. There are a number of questions that need to be addressed at that time: What sequencing strategy and which assembler should be chosen? How can we complete the genome assembly and quality assessment faster and more accurately?
In this study, we use the silkworm as a model to compare and analyze the mainstream TGS strategies, including sequencing platform, sequencing depth, and assemblers. Using the silkworm high-quality genome assembly as a reference, we compared BUSCO, Inspector, and EagleC to find the most appropriate long-read sequencing strategy under different conditions. We performed 32, 42, and 12 de novo genome assemblies of silkworms on high-depth ONT, CLR, and HiFi reads, correspondingly to comprehensively assess the effect of assemblers and sequencing coverage on Lepidopteran insect genome assembly. Each sequencing strategy has its own advantages: the ONT genomic library has the largest fragment size, the HiFi data has the highest accuracy, and the CLR is in between. The focus of different assemblers is different.
On graft genome assembly, the NextDenovo assembler shows the best performance on both CLR and ONT sequencing datasets, and hifiasm is better for HiFi datasets. We recommend the use of 3D-DNA in combination with the EagleC evaluation to complete the construction of chromosome-level genomes. The polishing process is completed without the original assembly, you can finish the scaffolding and polish it afterward, the quality of the assembly is similar, and it is timesaving. In the case application, the KRSM genome was obtained from CLR and Hi-C data with excellent continuity and integrity. Other factors, such as throughput, convenience, and price should also be reasons for considering which genome sequencing platform to choose.
Here, using various techniques and ultra-high-depth datasets, we evaluated the effects of sequencing coverage on Lepidopteran pest genomes assembly. The quality of the genome tends to stabilize after 40× on the ONT and CLR datasets, and at 20× on HiFi datasets, and improves as the sequencing depth increases. The sequencing depth of 20× is the minimum for genome construction, however, the higher the sequencing depth, the better the assembly effect is not necessary, and too high a sequencing depth will cause excessive consumption of computing resources. Especially for large genome projects, such as pan-genome, you must select an appropriate sequencing depth to efficiently reduce the burden of computing resources and the cost of time and money.
The currently available mainstream genome quality assessment markers are N50, BUSCO, Merqury, QUAST-LG, and Inspector. However, the N50 is just a simple continuous statistic, BUSCO can only evaluate conserved genomic regions, Mercury requires users to input high-precision reads and is not suitable for long-read data, and QUAST-LG relies excessively on existing reference genomes [18]. Several software programs have been developed to implement scaffolding based on Hi-C data, HiRISE, LACHESIS, SALSA, 3D-DNA, and ALLHiC [30]. EagleC combines deep-learning and ensemble-learning strategies can predict the whole range of SVs with a Hi-C map very quickly and accurately [22]. SVs can induce de novo chromatin interactions across the breakpoints, which are similar to assembly errors, both show aberrant interaction blocks. We then designed a quality assessment standard for chromosome-level genome assembly based on EagleC. The EagleC process that we developed is different from the previous evaluation work. It is a deep learning-based process used to accurately and rapidly evaluate the quality of chromosomelevel genome assemblies and direct the repair of assembly errors. It can also be used for the optimization of published chromosome-level genome assemblies. The quality of de novo genome assemblies has a great significant impact on gene annotation and comparative genomic research [31]. At the same time, we noticed that although we have developed good quality assessment standards for chromosome-level genome assembly, it is difficult for novices to complete the assessment. Building an online database that can generate results reports with one click can greatly solve this problem, and facilitate the widespread use of scholars with different research backgrounds, which is what we are currently doing.
In this study, we take into account that practical issues are faced in Lepidopteran pest genome sequencing projects, including high genome heterozygosity, poor quality genome libraries with short fragments, and assembly approaches tailored to various scenarios. Additionally, it has demonstrated how to improve an existing genome assembly based on the findings of a genome evaluation without producing new sequencing data. This benchmark work offers insights for other eukaryote genomes such as mildew and microalgae, and even complicated human genomes, in addition to helping to build the high-quality genomes of Lepidopteran pests.

Insect Material
B. mori strains (P50T, D9L, and D9 × N4) were sourced from the Center for Frontier Interdisciplinary Biology, Southwestern University, China. Silkworms were reared on mulberry leaves under 12-h light and 12-h dark photoperiod at 28 • C from the 1st to 4th instars, and 25 • C after the 4th ecdysis.

Genome Sequencing and Creation of Subsets
Genomic DNA of P50T, D9L, and D9 × N4 was extracted, detected, and sequenced for generating PacBio HiFi, PacBio CLR and ONT reads at Frasergen (Wuhan, China), separately. Among them, D9L and D9 × N4 have not been previously sequenced.

Genome Assembly Evaluation
QUAST (v5.0.2) and Inspector were used to assess the assembly quality generated by different assembly tools. BUSCO(v4) was used to assess the completeness of the genome assembly with the insect (odb10) protein set. We selected the number and N50 of contigs, complete genes number from BUSCO, Small-scale errors per Mb, the number of Structural errors, and Quality Value (QV) score from Inspector to visualize in the main text. The QV was calculated on the base of the identified structural and small-scale errors in the assemblies [18].
Furthermore, we designed a quality assessment standard for chromosome-level genome assembly based on EagleC, a deep-learning framework for detecting a full range of assembly errors from Hi-C contact maps that were used to identify both small-scale and large-scale assembly errors, accurately. It reports the percentage (Equation (1)), type, and locus of specific assembly errors, and provides solutions on how to fix these assembly errors.
Percentage of assembly errors (PAR) = Total length of the assembly errors Total length of the assembly (1)