Chromosome-Level Genome Assembly of Papilio elwesi Leech, 1889 (Lepidoptera: Papilionidae)

Simple Summary Swallowtail butterflies are renowned for their extensive morphological diversity, especially their wing color patterns, with around 600 described species. Given their morphological diversity and species richness, as well as their ecological habits, their phytophagous insect lineage is regarded to have evolved successfully. According to the NCBI database, sixty Papilionidae genomes have been reported, of which only two species have been assembled at the chromosome level. Currently, the paucity of high-quality genome data, especially for rarely seen and endemic butterflies, has hampered our knowledge of their evolution and biology at the genome level. Here, we report a chromosome-level genome assembly of Papilio elwesi, a butterfly species endemic to the Chinese mainland. The sequencing, assembly, and annotation of Papilio elwesi in this study enrich the available butterfly genome resources and lay the data foundation for future research. Abstract A rarely seen butterfly species, the large swallowtail butterfly Papilio elwesi Leech, 1889 (Lepidoptera: Papilionidae), endemic to the Chinese mainland, has been declared a state-protected animal in China since 2000, but its genome is not yet available. To obtain high-quality genome assembly and annotation, we sequenced the genome and transcriptome of P. elwesi using the PacBio and PromethION platforms, respectively. The final assembled genome was 358.51 Mb, of which 97.59% was anchored to chromosomes (30 autosomes and 1 Z sex chromosome), with a contig/scaffold N50 length of 6.79/12.32 Mb and 99.0% (n = 1367) BUSCO completeness. The genome annotation pointed to 36.82% (131.99 Mb) repetitive elements and 1296 non-coding RNAs in the genome, along with 13,681 protein-coding genes that cover 98.6% (1348) of the BUSCO genes. Among the 11,499 identified gene families, 104 underwent significantly rapid expansions or contractions, and these rapidly expanding families play roles in detoxification and metabolism. Additionally, strong synteny exists between the chromosomes of P. elwesi and P. machaon. The chromosome-level genome of P. elwesi could serve as an important genomic resource for furthering our understanding of butterfly evolution and for more in-depth genomic analyses.


Introduction
Among the 18,000+ known butterfly species (Lepidoptera: Papilionoidea), the family Papilionidae Latreille 1802, commonly known as swallowtail butterflies, comprises more than 3% of all butterfly species [1]. Owing to the colorful wing color patterns and the extensive morphological diversity among the species, sexes, populations, and seasonal forms [2][3][4], it is widely regarded as one of the most popular and esthetically attractive butterflies and of great significance for exploring the morphological diversification and Insects 2023, 14, 304 3 of 12 pupae rather than adult specimens to acquire purer DNA and RNA while minimizing the risk of potential digestive and other microbial contamination. The pupae were delivered to Berry Genomics (Beijing, China), and then stored in liquid nitrogen for subsequent DNA and RNA extraction, library preparation, and sequencing. One pupal individual was used for PacBio HiFi and Illumina transcriptome sequencing, whereas Illumina wholegenome (genome survey), Hi-C (genome), and ONT (transcriptome) sequencing were performed on the other pupa. Genomic DNA and transcriptome RNA were extracted by the cetyltrimethylammonium bromide (CATB) method and TRIzol™ Reagent, respectively. Prior to sequencing, quality control was performed on the extracted DNA and RNA to ensure that they met the requirements of the sequencing platforms. The quality of the DNA was checked using the NanoDrop 2000 Spectrophotometer, Qubit fluorometer, pulsed-field electrophoresis (Bio-rad CHEF), and agarose gel electrophoresis, while the RNA was monitored using the Agilent 4200 TapeStation system and NanoDrop 2000 Spectrophotometer. All details about the sequencing platform, insert size, and data amount are summarized in Table S1.

Genome Survey and Assembly
Quality control was performed using Fastp v0. 23.0 [29] with parameters "-q 20 -D -g -x -u 10-5 -r -c" to eliminate bases with quality scores below 20, drop duplication, trim polyG/X tails, and enable base correction in overlapped regions. The script khist.sh in BBTools v38.82 [30] was used to generate a histogram file of 21-kmers for the genome survey in GenomeScope2 v2.0.0 [31], and the genome size estimation for P. elwesi was set to "-k 21 -p 2 -m 10000".
Prior to the genome assembly, a format conversion of PacBio HiFi ccs reads from BAM to FASTA was performed in SAMtools v1.10 [32]. Hifiasm v0.16.1-r375 [33] was used for the preliminary genome assembly, and Gfatools converted the assembly graph of the primary contigs in the GFA format to the FASTA file. Contigs of read coverage lower than 20× were filtered. Minimap2 v2.24-r1122 [34,35] was used for mapping reads and Purge_Dups v1.2.5 [36] was used to delete redundant heterozygous regions.
Scaffolding with Hi-C data employed Juicer v1.6.2 [37] and 3D-DNA v180922 [38]; the former was used to align reads to the assembly, remove duplicates, and extract Hi-C contacts, the latter was used for anchoring primary contigs into chromosomes. Then we manually reviewed the candidate assembly in conjunction with Juicebox v1.11.08 [37] according to the Hi-C heatmaps. Additionally, compared to the UniVec and nucleotide databases in the NCBI, a blastn-like search was performed using MMseqs2 v11 [39] for detecting potential contaminant sequences within the assembly. Further assessments of the genome quality were carried out based on the raw reads mapping rate, consensus quality (QV) estimation, and genome completeness, calculated using SAMtools, Merqury v1.3 [40], and BUSCO v5.2.2 [41], respectively.
Protein-coding gene prediction integrated evidence at the DNA, RNA, and protein levels through the MAKER v.3.01.03 [50] pipeline, which can be boiled down into three strategies. The first strategy was ab initio gene prediction using protein and transcriptome evidence in BRAKER v2.1.6 [51], where Augustus v3.4.0 [52] was combined with GeneMark-ES/ET/EP v4.68_lic [53] as predictors integrated for gene model training; the reference Arthropoda protein evidence was downloaded from the OrthoDB10 protein database [54] and incorporated into BRAKER along with RNA-seq data to increase the coding gene prediction accuracy. The second strategy was transcriptome-based prediction, where transcripts (transcriptome evidence) were assembled using StringTie v.2.1.3 [55] with mixed Illumina short reads and ONT reads. HISAT2 v.2.2.0 [56] was used to align RNA short reads to the assembled P. elwesi genome, while Minimap2 was used to map ONT reads. The third strategy was homology-based prediction, for which we downloaded high-quality genome assemblies and annotations of the following six reference species: Drosophila melanogaster (Diptera), Spodoptera frugiperda (Noctuoidea), Aricia agestis (Lycaenidae), Danaus plexippus (Nymphalidae), Pieris rapae (Pieridae) and Papilio machaon (Papilioninae) as evidence of protein homology, and combined them with transcriptome evidence for gene predictions in GeMoMa v1.8 [57]. Using the evidence presented above, MAKER produced the final genome structure annotation.
Divergence time estimation was performed by using the script mcmctree_AA.sh (https://github.com/xtmtd/Phylogenomics/tree/main/scripts accessed on 23 August 2022), which applies MCMCTree from PAML v.4.9j [72]. Four fossil calibration points from Espeland et al. [7] were selected here: root (<201. To guarantee convergence, the script was run at least twice. At last, changes (expansion or contraction) in the genome families were inferred across the divergence time estimation tree using CAFE v.4.2.1 [73], which applied an error correction model and lambda (birth-death parameter) search using the default parameters. Functional enrichment analysis based on PCGs from significantly expanded families enriched the GO and KEGG terms using clusterProfiler v3.10.1 [74], and the cutoff values of the p-value and q-value were respectively set to 0.01 and 0.05, with the Benjamini-Hochberg (BH) correction method.

Chromosomal Synteny
To reveal the interspecific chromosomal variation between P. elwesi and P. machaon, MM-seqs2 was searched against their protein sequence alignments using the easy-search module with a target sensitivity of 7.5, a maximum E-value threshold of 1e-5, and 5 hits accepted per query sequence (-s 7.5 -alignment-mode 3 -num-iterations 1-e 1e-5 -max-accept 5). Then, we obtained the synteny between two papilionid species using MCXcanX [75] with an alignment significance level of 1e-10 (-e 1e-10) and a syntenic block of five genes (-s 5).

Sequencing and Genome Survey
There was a total of 147.24 Gb data sequenced for P. elwesi (Table S1) (Table S1), respectively. For the 54.51 Gb Illumina DNA data, 40.06 Gb was retained after the quality control process and then used for the genome survey. The genome was estimated to be 331. 80-331.96 Mb in size, with 0.93% heterozygosity and 17.78% (58.99-59.02 Mb) repetitive sequences. The peak (~44×) in front of the main peak (~88×) may be the sex chromosome, which accounts for about half of the coverage of the main peak ( Figure S1a).

Genome and Mitochondrion Assembly
The final P. elwesi assembly contained 147 scaffolds and 207 contigs, the N50 length of the scaffold/contig was 12.32/6.79 Mb, and the length of the max scaffold/contig was 20.18/15.81 Mb (Table S2). Its size was 358.51 Mb, with 97.59% anchored on 31 chromosomes (Figure 1a and Figure S2), and its GC content was 37.08% (Table S2). The sequencing coverage of each chromosome was calculated and is shown in Table S3. The coverage of one chromosome is 45.86×, which is approximately half the coverage of most chromosomes. Since a few Lepidoptera females have no W sex chromosome, the sex chromosome system of Lepidoptera is WZ/Z for females and ZZ for males [26], and both the Z and W sex chromosomes for females are theoretically half the coverage of the autosomes. However, only one chromosome was sequenced about 45× here. We judged it to be sex chromosome Z, and the P. elwesi we collected was a female swallowtail butterfly. The genome survey estimated a smaller genome size because only the autosomal size was calculated. The coverage of some chromosomes is apparently higher than 90×, which is because of the high proportion of transposons (repetitive sequences), resulting in multiple reads repeatedly mapped to one region. The BUSCO assessment indicated that the completeness of the assembled genome is up to 99.0% (insecta_odb10, n = 1367) with 98.2% single-copy orthologues and 0.8% duplicated genes. The raw sequencing data mapped to the assembly had a high mapping ratio: 99.9%/95.07% for long/short DNA and 91.96% for short RNA, and the estimated consensus quality value for the assembly reached 55 (QV = 55). These results indicate that the genome we assembled is of high quality. and the estimated consensus quality value for the assembly reached 55 (QV = 55). These results indicate that the genome we assembled is of high quality. In comparison to the two Papilionidae chromosome-level assembled genomes available in the NCBI database, the genome size of P. elwesi is bigger than that of P. machaon (GCA_912999745.1, 252.11 Mb) but smaller than that of Iphiclides podalirius (GCA_933534255.1, 430.73 Mb). It has a much longer scaffold N50 length than the prior assembly of P. machaon (8.78 Mb) and the highest GC content. In terms of the BUSCO completeness, it is on par with P. machaon at 99% and higher than Iphiclides podalirius at 97% (Table S2). It is evident from the high degree of contig contiguity and completeness that the assembled genome of P. elwesi is of high quality. There is general consensus that most lepidopteran species have a sex chromosome system of WZ/ZZ (female/male), but In comparison to the two Papilionidae chromosome-level assembled genomes available in the NCBI database, the genome size of P. elwesi is bigger than that of P. machaon (GCA_912999745.1, 252.11 Mb) but smaller than that of Iphiclides podalirius (GCA_933534255.1, 430.73 Mb). It has a much longer scaffold N50 length than the prior assembly of P. machaon (8.78 Mb) and the highest GC content. In terms of the BUSCO completeness, it is on par with P. machaon at 99% and higher than Iphiclides podalirius at 97% (Table S2). It is evident from the high degree of contig contiguity and completeness that the assembled genome of P. elwesi is of high quality. There is general consensus that most lepidopteran species have a sex chromosome system of WZ/ZZ (female/male), but the genus Papilio is a special group among lepidopterans, possessing both the WZ/ZZ (P. machaon) and Z/ZZ (P. elwesi) sex chromosome systems. The mitochondrial genome of P. elwesi was assembled into a circularity with a size of 15,337 bp, and the length of the assembled mitochondrion was close to that of most Papilio species (~15,300 bp) public in the NCBI. The mitochondrion was annotated with 37 genes; the numbers of transfer RNA genes, protein-coding genes, and ribosomal RNA genes were 22, 13, and 2, respectively. A majority of the genes were found on the positive strand (majority strand or J-strand), including fourteen tRNA genes (tRNA-Met, tRNA-Ile, tRNA-Trp, tRNA-Leu, tRNA-Lys, tRNA-Asp, tRNA-Gly, tRNA-Ala, tRNA-Arg, tRNA-Asn, tRNA-Ser, tRNA-Glu, tRNA-Thr, and tRNA-Ser2) and nine protein-coding genes (ND2, COX1, COX2, ATP8, ATP6, COX3, ND3, ND6, and CYTB), with the remaining genes in the reverse strand (minority strand or N-strand). It displays the same gene order as the publicly available swallowtail butterflies deposited in the NCBI, such as Papilio maraho (NC_014055.1), Papilio helenus (KP247522.1), and Papilio dialis (OP135983.1). Moreover, its gene order and orientation are consistent with some other lepidopteran species, e.g., moths [76,77]. Nevertheless, the order of trnM-trnI-trnQ in P. elwesi is different from the ancestor of Ditrysia, which is trnI-trn-trnM [78], while the former undergoes rearrangement as trnM-trnI-trnQ. The base compositions of A, T, G, and C were 40.17%, 39.61%, 7.49%, and 12.73%, respectively, indicating a high A+T content (79.78%) in the mitochondrion of P. elwesi. The mitochondrial genome assembled and annotated in the study was deposited in GenBank under the accession number OQ581151  (Table S4; Figure 1a). The proportion of repetitive elements was subequal to that of P. machaon (37.24%, data from NCBI Papilio machaon Annotation Release 100).
The MAKER pipeline predicted 13,681 PCGs with a mean gene/protein length of 8529.8/571.7. The numbers of exons, introns, and CDSs per gene were 7.4, 6.4, and 7.2, respectively, and their corresponding mean lengths were 332.3, 1003.9, and 221.3 (Table S6). The BUSCO achieved a completeness assessment of 98.60% (insecta_odb10, n = 1367) for the protein sequences, demonstrating predictions of high quality. A total of 13,442 (98.25%) genes were hit in the UniProtKB database with at least one record, and 11,414 (83.43%) and 13,042 (95.33%) were predicted by InterProScan and eggNOG, respectively. Genes with 10,208 GO items and 4995 KEGG pathway terms were identified by combining the InterProScan and eggNOG results (Table S6).

Gene Family Evolution and Phylogenetic Relationships
OrthoFinder identified 156,231 genes among 11 species, including 150,006 (96.02%) genes assigned to 14,488 orthogroups and 6225 (3.98%) unassigned genes. There were 7650 orthogroups shared by all species, including 3891 single-copy orthogroups, and 5891 species-specific genes clustered into 1524 orthogroups (Table S7; Figure 1c). The distribution of the gene families (orthogroups) for the 11 species is summarized in Table S8. In the P. elwesi genome, 13,411 (98.03%) of the 13,681 identified genes were contained in 11,499 orthogroups, and the numbers of species-specific orthogroups and genes were 38 and 113, respectively (Table S8).
The gene family evolution analysis in CAFE indicated that the global error estimation of the input data was 0.0875, and the estimated lambda was 0.00193608081496. In total, 621 gene families expanded and 863 gene families contracted in P. elwesi. Among them, 104 (77 expansions and 27 contractions) gene families experienced rapid evolution (Table S9; Figure 1c). These rapid expansion families included those responsible for cuticle formation (cuticular protein: 17; chitin-binding type-2 domain-containing protein: 15; insect cuticle protein: 7), detoxification metabolism (glutathione S-transferase: 14; cytochrome P450: 5), digestion (trypsin: 11), juvenile hormone synthesis (farnesyl pyrophosphate synthase: 9), chemosensation (gustatory receptor: 8; PBP/GOBP family: 8; odorant receptor: 5), immunity (cecropin family: 7), and so on (Table S10, Figure 1b). We speculate that these expanded families are related to foraging and adaptation to the host plant, as terpenoidderived metabolites in Lauraceae trees, such as linalool, have recently been reported to act on the expressions of cuticular proteins and cytochrome P450s simultaneously [80]. The GO and KEGG enrichments of the rapid expansion families further indicate that their functions are involved in detoxification, metabolism, alarm pheromone, and some functions regarding biological processes for which we could not discover the specific meaning ( Figure S3).

Chromosomal Synteny
Chromosome comparisons between the genomes of P. elwesi and P. machaon revealed conserved syntenic relationships among the chromosomes (Figure 1d). Eighty syntenic blocks with 20,067 (74.2%) collinear genes from the two genomes were identified.
Most chromosomes (1, 4, 6-10, 13, 14, 16-20, 22, 25, 29) of P. elwesi matched perfectly with the corresponding chromosomes of P. machaon, whereas no collinearity was found between chromosome 30 of P. elwesi (Pelw30) and chromosome 28 of P. machaon (Pmac28) (Figure 1d). We can confirm that Pelw30, with the interaction signals, is a complete chromosome belonging to P. elwesi according to the Hi-C heatmap ( Figure S2). We suspect that Pelw30 is a neo-chromosome or W sex chromosome. Specifically, the high content of repetitive sequences in Pelw30 is particularly similar to that of the W sex chromosome [25,27,81], but there is no clear evidence for this at present. Chromosome 15 of P. machaon (Pmac15) was collinear to chromosomes 15 (Pelw15) and 28 (Pelw28) of P. elwesi. The Z sex chromosomes of the two papilionid species also showed collinearity, and the W sex chromosome of P. machaon (PmacW) was related to the Z chromosome of P. elwesi (PelwZ). The dominant role that the W chromosome plays in determining the sex of Lepidoptera seems to be doubtful, as demonstrated here, and has been raised by several previous studies (e.g., Yoshido et al. [82]; Dalíková et al. [26]; Fraïsse et al. [27]). This also implies that Z/ZZ may be an ancestral system, with the W chromosome evolving later from a common ancestor [26]. Taking this into consideration, the genome assembled here in high quality may serve as a genomic resource for future exploration and understanding of the sex determination mechanism and sex chromosome evolution in Lepidoptera.

Conclusions
With PacBio HIFi, Hi-C, ONT, and Illumina sequencing data, we report the lepidopteran genome assembly of Papilio elwesi, an endemic butterfly from the Chinese mainland. The chromosome-level assembly comprises 207 contigs with a size of 358.5 Mb. The BUSCO completeness of the assembly and the 13,681 predicted protein-coding genes reach 99% and 98.6%, respectively. The results of the phylogenomic analysis support Papilionidae at the basal position to all other butterflies. The sex chromosome system of P. elwesi is Z/ZZ (female/male), which differs from the majority of lepidopterans (WZ/ZZ). The high-quality genome assembled here is important evidence for revealing lepidopteran sex determination systems and an important source of data for future studies exploring lepidopteran evolution and comparative genomics.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/insects14030304/s1, Figure S1. GenomeScope profile plots of Papilio elwesi: (a) untransformed linear; (b) untransformed log. Figure S2. Genome-wide chromosomal heatmap of Papilio elwesi. Each chromosome and scaffold are framed in blue and green, respectively. Figure S3. Top 20 enriched functions of GO (a) and KEGG (b) pathways in significantly expanded gene families. Table S1. Statistics of sequencing data used for genome assembly and annotation. Table S2. Genome assembly statistics of three Papilionidae species. Table S3. The coverage of the 31 chromosomes. Table S4. Annotation of repetitive elements in the Papilio elwesi genome. Table S5. Annotation of non-coding RNAs in the Papilio elwesi genome. Table S6. Summary statistics of structural and functional annotations in the Papilio elwesi genome. Table S7. Statistics of gene family among 11 selected species. Table S8. Gene counts for each species. Table S9. Gene family evolution of 11 species inferred from CAFE. Table S10. Statistics of significantly expanded gene families in Papilio elwesi, excluding families lacking functional annotation.  Data Availability Statement: All of the raw sequencing data (accessions CRR588447-CRR588454) and the final genome assembly (accession GWHBOVJ00000000) of Papilio elwesi have been deposited at the BioProject PRJCA012581 at the China National Center for Bioinformation-National Genomics Data Center (CNCB-NGDC). We have also uploaded all raw sequencing data (SRR23648870, SRR23689296, SRR23700075, SRR23702046, SRR23703133) and final genome assembly (JARFMI000000000) to the NCBI under BioProject PRJNA937921. Additionally, information regarding the genome annotation has been uploaded to Figshare and is available at https://doi.org/10.6084/m9.figshare.22045085.v1.