Transcriptome Characterization for Non-Model Endangered Lycaenids, Protantigius superans and Spindasis takanosis, Using Illumina HiSeq 2500 Sequencing

The Lycaenidae butterflies, Protantigius superans and Spindasis takanosis, are endangered insects in Korea known for their symbiotic association with ants. However, necessary genomic and transcriptomics data are lacking in these species, limiting conservation efforts. In this study, the P. superans and S. takanosis transcriptomes were deciphered using Illumina HiSeq 2500 sequencing. The P. superans and S. takanosis transcriptome data included a total of 254,340,693 and 245,110,582 clean reads assembled into 159,074 and 170,449 contigs and 107,950 and 121,140 unigenes, respectively. BLASTX hits (E-value of 1.0 × 10−5) against the known protein databases annotated a total of 46,754 and 51,908 transcripts for P. superans and S. takanosis. Approximately 41.25% and 38.68% of the unigenes for P. superans and S. takanosis found homologous sequences in Protostome DB (PANM-DB). BLAST2GO analysis confirmed 18,611 unigenes representing Gene Ontology (GO) terms and a total of 5259 unigenes assigned to 116 pathways for P. superans. For S. takanosis, a total of 6697 unigenes were assigned to 119 pathways using the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database. Additionally, 382,164 and 390,516 Simple Sequence Repeats (SSRs) were compiled from the unigenes of P. superans and S. takanosis, respectively. This is the first report to record new genes and their utilization for conservation of lycaenid species population and as a reference information for closely related species.


Introduction
Butterflies form an invincible part of the Earth's rich biodiversity and are considered as quality-of-life indicators. Butterflies accord aesthetic value, are part of our natural heritage, and are portrayed as a symbol of beauty, peace or freedom. They are also of immense scientific value and have been used in diverse areas of biological research including pest control, population dynamics, biodiversity conservation, evolution, and genetics. Lately, they have been studied in the context of climate change and global warming. This is not surprising, as populations of many butterfly species have declined and many show substantial changes in their distribution. The primary cause for the vulnerability of butterfly species includes the damage to habitations due to human activities, agricultural activities and air pollution [1,2]. The consistent decline in butterfly populations has prompted ecologists to engage in species and community conservation initiatives that explore themes such as the evolutionary origins of butterfly diversity, population dynamics and threats (parasitism, predation and human impacts). Other scientific strategies used for the conservation assessment of butterfly populations around the globe include expressed sequence information, which is whole genome or transcriptome sequencing to characterize genes for positive selection of the species.
In the context of the declining butterfly populations, a Red List assessment is imperative as it addresses the likelihood and predisposition of the species in becoming extinct in their natural environments and hence promotes the prioritization and strict assessment of conservation programs [1]. The Red List of butterflies has been published on different scales and from different countries following the guidelines of the International Union for Conservation of Nature and Natural Resources (IUCN) [1,3,4]. In the Korean peninsula, an exhaustive investigation on the endangered butterfly populations was conducted to identify critically endangered, endangered, and vulnerable species to understand the causes of decline over time [5]. Subsequent to this, a Red List assessment was conducted in Korea using the IUCN Red List Categories and Criteria (version 3.1) and IUCN Red List Categories and Criteria (version 8.0), that classified the status of animals and plant species into several categories [6]. From these assessments, we summarized that a majority of threatened butterflies (classified as Critically Endangered, Endangered, and Vulnerable) belonged to Family Lycaenidae and Nymphalidae of the Lepidopteran taxon. The Lycaenidae Family of butterflies (gossamer-winged butterflies) represents around 40% of all known butterfly species. Generally, the butterfly species from the family (about 75% of species) show a mutualistic or commensal relationship with ants in nature, but few have evolved into parasites of ants [7,8]. Many such ant-parasitic Lycaenid species are highly endangered and are the prime focus of insect conservation biology [9]. A total of 15 Lycaenid species have been included in the Korean Red List of Threatened Species (2014). The population of these Lycaenid species have shown a decline due to climate change and loss of symbiotic ants. An extinction in host ant species is detrimental to the existence of such associated ant-parasitic Lycaenid butterflies, hence a conservation of the former is also a priority to protect the latter in the Korean peninsula [10].
The Lycaenidae family species Spindasis takanosis and Protantigius superans were listed as endangered in Korea by a Ministry of Environment report in 2005 [11]. These Lycaenid have been classified as Level 2 species and are vulnerable to face extinction unless threatening factors are eliminated or mitigated. Spindasis takanosis (Matsumara) has been reported from Gyunggi-do, Gangwon-do, Chungchungnam-do and Jeollanam-do provinces of Korea. The population of this species has declined alarmingly due to the loss of forests and symbiotic ants such as Crematogaster matsumurai [12]. S. takanosis belongs to the majority non-parasitic species of Lycaenids that are symbionts of C. matsumurai [8]. Protantigius superans was not included as a threatened butterfly species in an earlier study [5], but it was included as a vulnerable species by the Korean Red List assessment (2014). The ant-parasitism status of P. superans has not been reported. In fact, there are only scattered occurrences of ant-parasitism among the Lycaenidae species, suggesting an independent evolution of such interactions. The major barrier for the conservation efforts even after the legalization of the Korean Endangered Species Act in 2005, was the lack of data that determined the genetics, ecology and distribution of the species. As an initiative towards preserving the gene pool of the Lycaenid species S. takanosis and P. superans, the genomic and RNA sequence information initiative was considered as a foolproof strategy, more so after the completion of the mitogenome sequencing for the species [13,14]. The need for conservation genomics initiatives towards biodiversity assessment and management has been advocated, especially with the rapid evolution of Next-Generation Sequencing (NGS) platforms [15,16]. The NGS enables the derivation of a global gene or RNA expression profile that could lead to the discovery of genetic markers such as Simple Sequence Repeats (SSRs) and Single Nucleotide Polymorphisms (SNPs), and candidate transcripts as markers for ecological fitness, quantitative trait loci (QTL) and so on [17,18].
The transcriptome sequencing of butterfly species using NGS has provided with clues towards understanding the population genetic structure and setting conservation goals and priorities. A case in this initiative pertains to the rapidly declining populations of the marsh fritillary butterfly, Euphydryas aurinia, wherein next-generation 454-pyrosequencing characterized seven novel microsatellite loci [19]. Other 454-sequencing data has established the transcriptome of the Glanville fritillary butterfly (Melitaea cinxia) appropriately, with the discovery of large number of SNPs [18]. The phylogeography of the Karner blue butterfly, belonging to the genus Lycaeides, using 454-pyrosequencing has also been reported [20]. Lately, Illumina sequencing has been extremely successful in the identification of SSRs in endangered specimens within and across taxonomic groups [21,22]. Most notably, the genome assembly of the Monarch butterfly involved the use of Illumina paired-end sequencing [23]. We have used the Illumina HiSeq 2500 NGS technology to characterize the transcriptome of endangered Lycaenids S. takanosis and P. superans and annotated the genomic resources from the species for the mechanistic dissection of ecologically relevant traits. This study also bridges the gap between the genomic sequence information of model organisms vs. non-model species that are essentially the viable targets of biodiversity conservation and phylogenetics.

Transcriptome Analysis
In order to obtain the transcriptomes of the endangered Lycaenid butterflies S. takanosis and P. superans, a cDNA library was constructed from the RNA isolated from the whole body of adult insects and sequenced on the Illumina HiSeq 2500 NGS platform. The transcriptome assembly and analysis work-flow has been depicted in Figure 1.
The Illumina HiSeq 2500 sequencing of P. superans generated a total of 258,875,070 raw reads (32,618,258,820 bases) with a mean length of 126 bp. The filtering of reads based on quality parameters resulted in the discard of 0.32% of bases to get a paired-end profile of 32,514,410,974 bases with an average length of 125.6 bp. After stringent quality assessment, a total of 254,340,693 clean reads (Q 20~9 9% and percent of unknown nucleotide is 0%) were obtained, which represents 98.25% of the obtained raw reads. The mean length and the N50 length of the obtained clean reads were 124.3 bp and 126 bp, respectively, with GC% (or guanine-cytosine content) of 39.81%. In the case of S. takanosis, the transcriptome generated a total of 249,312,792 raw reads (31,413,411,792 bases) with a mean length of 126 bp. Adapter trimming led to a discard of 0.35% of the total base pairs processed, which, after stringent quality assessment, generated a total of 245,110,582 clean reads (30,515,812,866 bases). The mean length, N50 length, and GC% of the obtained clean reads were 124.5, 126, and 41.96%, respectively. The summary of the read processing analysis based on quality parameters is shown in Table S1. The sequence reads generated from the transcriptome sequencing of P. superans and S. takanosis have been submitted to GenBank Sequence Read Archive (SRA) at National Center for Biotechnology Information (NCBI) under accession numbers SRP063812 and SRP063813, respectively. The processed high quality reads were assembled using the Trinity program which uses three sequential software modules, namely Inchworm, Chrysalis, and Butterfly, for de novo transcriptome assembly [24]. Trinity is an exclusive program for assembling transcript sequences from Illumina transcriptome data and scores over other de novo transcriptome algorithms developed including SOAPdenovo-Trans [25], Trans-ABySS [26], and Oases [27]. With Trinity, a total of 159,074 contigs were assembled for the P. superans transcriptome, with N50 (contig length such that equal or longer contigs amount to half of the total assembly length) and mean length of 1220 bp and 746.3 bp, respectively. A significant proportion of the assembled contigs (39.28%) were ≥500 bp with the longest contig size of 15,152 bp. A total of 170,449 contigs were assembled for the S. takanosis transcriptome, with N50 and mean length of 1372 bp and 786.4 bp, respectively. Out of the total assembled contigs, 66,844 contigs (39.21%) were ≥500 bp with the longest contig size of 16,820 bp. The size distribution of the assembled contigs for the P. superans and S. takanosis transcriptomes have been shown in Figure 2A. It is evident that the proportion of short contigs and contigs over 1 kb were high in our datasets. Also, the contig N50 value was found higher in the Lycaenid butterfly transcriptome compared to N50s obtained from transcriptome assemblies of distinct insects [28][29][30]. The contigs were finally clustered to a total of 107,950 unigenes with 89,022,313 bases for P. superans and 121,140 unigenes with 100,232,710 bases for S. takanosis. For P. superans, the N50 and mean length of unigenes were 1452 and 824.7 bp, respectively, with a GC% of 38.46%. Among these unigenes, 29,596 (27.42%) had a size of no more than 300 bp, 54,518 (50.50%) were in the sizes of 301-1000 bp, 13,325 (12.34%) were of lengths in between 1001 and 2000 bp, and 10,511 (9.74%) were over 2000 bp. For S. takanosis, the N50 and mean length of unigene sequences were 1537 and 827.4 bp, respectively, with a GC% of 38.68%. Among these unigenes, 35,101 (28.97%) had a size of ≤300 bp, 61,001 (50.36%) were in the size range of 301-1000 bp, 13,000 (10.73%) were of lengths 1001-2000 bp, and 12,038 (9.93%) were over The processed high quality reads were assembled using the Trinity program which uses three sequential software modules, namely Inchworm, Chrysalis, and Butterfly, for de novo transcriptome assembly [24]. Trinity is an exclusive program for assembling transcript sequences from Illumina transcriptome data and scores over other de novo transcriptome algorithms developed including SOAPdenovo-Trans [25], Trans-ABySS [26], and Oases [27]. With Trinity, a total of 159,074 contigs were assembled for the P. superans transcriptome, with N50 (contig length such that equal or longer contigs amount to half of the total assembly length) and mean length of 1220 bp and 746.3 bp, respectively. A significant proportion of the assembled contigs (39.28%) were ě500 bp with the longest contig size of 15,152 bp. A total of 170,449 contigs were assembled for the S. takanosis transcriptome, with N50 and mean length of 1372 bp and 786.4 bp, respectively. Out of the total assembled contigs, 66,844 contigs (39.21%) were ě500 bp with the longest contig size of 16,820 bp. The size distribution of the assembled contigs for the P. superans and S. takanosis transcriptomes have been shown in Figure 2A. It is evident that the proportion of short contigs and contigs over 1 kb were high in our datasets. Also, the contig N50 value was found higher in the Lycaenid butterfly transcriptome compared to N50s obtained from transcriptome assemblies of distinct insects [28][29][30]. The contigs were finally clustered to a total of 107,950 unigenes with 89,022,313 bases for P. superans and 121,140 unigenes with 100,232,710 bases for S. takanosis. For P. superans, the N50 and mean length of unigenes were 1452 and 824.7 bp, respectively, with a GC% of 38.46%. Among these unigenes, 29,596 (27.42%) had a size of no more than 300 bp, 54,518 (50.50%) were in the sizes of 301-1000 bp, 13,325 (12.34%) were of lengths in between 1001 and 2000 bp, and 10,511 (9.74%) were over 2000 bp. For S. takanosis, the N50 and mean length of unigene sequences were 1537 and 827.4 bp, respectively, with a GC% of 38.68%. Among these unigenes, 35,101 (28.97%) had a size of ď300 bp, 61,001 (50.36%) were in the size range of 301-1000 bp, 13,000 (10.73%) were of lengths 1001-2000 bp, and 12,038 (9.93%) were over 2000 bp. The size distribution of assembled unigenes for both the Lycaenid butterflies are shown in Figure 2B. A summary of the P. superans and S. takanosis transcriptomes depicting the datasets obtained after the processing of raw reads, Trinity de novo assembly, and TGIR Gene Indices Clustering Tool (TGICL) clustering are depicted in Table 1. The transcriptome sequence and assembly efficiency was better in case of S. takanosis as it resulted in greater number of transcripts from a lesser number of raw read sequences.

Sequence Annotation
The assembled unigenes of P. superans and S. takanosis were used as query sequences and blasted (BLASTX search; E-value ď 1.0ˆ10´5) against various protein databases, including a locally curable Protostome DB (PANM-DB), Unigene and EuKaryotic Orthologous Groups (KOG) databases. Significant matches were found for the assembled unigenes with the subject sequences in PANM-DB with 44,529 (41.25%) and 46,852 (38.68%) unigenes of P. superans and S. takanosis recovering the BLAST results. Similarly, a total of 15,331 (14.2%) unigenes of P. superans and 22,124 (18.26%) unigenes of S. takanosis found homology to sequences in the Unigene database. Roughly, 18,511 (17.14%) and 24,603 (20.31%) unigenes of P. superans and S. takanosis show BLASTX hits in the KOG database. The sequence-based annotation of unigenes from the P. superans and S. takanosis transcriptomes are shown in Table 2.  We found that a significant proportion of longer unigene sequences (≥1000 bp) are circumstantial in returning a higher proportion of BLAST hits against various protein databases as compared to short-read unigenes (≤300 bp). In the case of P. superans, a total of 18,506 and 13,234 unigenes had common homologous matches in the PANM-DB and KOG DB and PANM-DB and Unigene DB, respectively. A total of 9659 unigene sequences overlapped within the three protein databases. Only three unigenes sequences were found exclusive to the KOG DB, whereas two sequences were uniquely shared between the KOG and Unigene DB. A total of 22,448 and 2095 sequences showed homologies exclusive to the PANM-DB and Unigene DB, respectively. The sequence annotation results for the P. superans butterfly are shown in Figure 3A. The unigene annotation for S. takanosis ( Figure 3B) showed a total of 13,888 sequences overlapping the three databases. A total of 24,528, 18,538, and 13,940 unigenes recovered BLAST hits from both the PANM-DB and KOG DB, the PANM-DB and Unigene DB, and the KOG DB and Unigene DB, respectively. BLASTX was also used  We found that a significant proportion of longer unigene sequences (ě1000 bp) are circumstantial in returning a higher proportion of BLAST hits against various protein databases as compared to short-read unigenes (ď300 bp). In the case of P. superans, a total of 18,506 and 13,234 unigenes had common homologous matches in the PANM-DB and KOG DB and PANM-DB and Unigene DB, respectively. A total of 9659 unigene sequences overlapped within the three protein databases. Only three unigenes sequences were found exclusive to the KOG DB, whereas two sequences were uniquely shared between the KOG and Unigene DB. A total of 22,448 and 2095 sequences showed homologies exclusive to the PANM-DB and Unigene DB, respectively. The sequence annotation results for the P. superans butterfly are shown in Figure 3A. The unigene annotation for S. takanosis ( Figure 3B) showed a total of 13,888 sequences overlapping the three databases. A total of 24,528, 18,538, and 13,940 unigenes recovered BLAST hits from both the PANM-DB and KOG DB, the PANM-DB and Unigene DB, and the KOG DB and Unigene DB, respectively. BLASTX was also used to search for the matches of P. superans and S. takanosis unigenes against Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) protein functional databases. A total of 18,661 (17.29%) and 5259 (4.87%) unigenes of P. superans found matches to protein sequences in the GO and KEGG databases, respectively ( Table 2). As expected, the majority (more than 55%) of GO and KEGG annotated transcripts were ě1000 bp. The non-annotated transcripts may also attribute to novel genes, but it is possible that most of these shorter sequences may lack a functional conserved domain and hence are missing sequence matches in the databases. to search for the matches of P. superans and S. takanosis unigenes against Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) protein functional databases. A total of 18,661 (17.29%) and 5259 (4.87%) unigenes of P. superans found matches to protein sequences in the GO and KEGG databases, respectively ( Table 2). As expected, the majority (more than 55%) of GO and KEGG annotated transcripts were ≥1000 bp. The non-annotated transcripts may also attribute to novel genes, but it is possible that most of these shorter sequences may lack a functional conserved domain and hence are missing sequence matches in the databases.

Homology Characteristics of Assembled Unigenes
The characteristics of the homology search of assembled unigenes from P. superans recovered by BLAST hits against the PANM-DB were analyzed. We studied for homology characteristics such as the E-value, identity and similarity distribution ( Figure 4). We have also performed the homology search of assembled unigenes against the Unigene DB ( Figure S1). The E-value distribution revealed that a significant proportion of unigenes (24,694, 55.46%) showed significant homology to previously deposited sequences in the PANM-DB with an E-value ranging from 1.0 × 10 −50 to 1.0 × 10 −5 ( Figure  4A). Identity distribution chart revealed a close distribution with 14,728 (33.08%), 14,144 (31.76%), and 10,740 (24.12%) unigenes showing identity of 60%-80%, 40%-60%, and 80%-100%, respectively. About 603 (1.35%) unigenes showed an identity of 100% to match subject sequences in the PANM-DB ( Figure 4B). According to the similarity distribution chart, 20,838 (46.80%) unigenes had a similarity of 80%-100% and 17,618 (39.57%) unigenes had a similarity of 60%-80% with the deposited sequences ( Figure 4C). In addition, our results showed that the unigene hit percentage increased steadily with an increase in the length of the unigenes. Above 70% of P. superans unigene sequences over 1500 bp in length showed BLASTx hits to protein sequences in the PANM-DB. In contrast, only close to 20% of sequences shorter than 300 bp found a hit to homologous sequences in the database ( Figure 4D). In the homology search of P. superans assembled sequences using the Unigene DB ( Figure  S1), we found a majority showing an E-value ranging from 1.0 × 10 −50 to 1.0 × 10 −5 ( Figure S1A). The identity distribution plot revealed most assembled sequences showing an identity of 80%-100% ( Figure S1B) to sequences in the Unigene DB. About 50% of sequences >2001 bp showed annotation hits, while none of the <200 bp sequences were annotated to sequences in the database ( Figure S1C).

Homology Characteristics of Assembled Unigenes
The characteristics of the homology search of assembled unigenes from P. superans recovered by BLAST hits against the PANM-DB were analyzed. We studied for homology characteristics such as the E-value, identity and similarity distribution ( Figure 4). We have also performed the homology search of assembled unigenes against the Unigene DB ( Figure S1). The E-value distribution revealed that a significant proportion of unigenes (24,694, 55.46%) showed significant homology to previously deposited sequences in the PANM-DB with an E-value ranging from 1.0ˆ10´5 0 to 1.0ˆ10´5 ( Figure 4A). Identity distribution chart revealed a close distribution with 14,728 (33.08%), 14,144 (31.76%), and 10,740 (24.12%) unigenes showing identity of 60%-80%, 40%-60%, and 80%-100%, respectively. About 603 (1.35%) unigenes showed an identity of 100% to match subject sequences in the PANM-DB ( Figure 4B). According to the similarity distribution chart, 20,838 (46.80%) unigenes had a similarity of 80%-100% and 17,618 (39.57%) unigenes had a similarity of 60%-80% with the deposited sequences ( Figure 4C). In addition, our results showed that the unigene hit percentage increased steadily with an increase in the length of the unigenes. Above 70% of P. superans unigene sequences over 1500 bp in length showed BLASTx hits to protein sequences in the PANM-DB. In contrast, only close to 20% of sequences shorter than 300 bp found a hit to homologous sequences in the database ( Figure 4D). In the homology search of P. superans assembled sequences using the Unigene DB ( Figure S1), we found a majority showing an E-value ranging from 1.0ˆ10´5 0 to 1.0ˆ10´5 ( Figure S1A). The identity distribution plot revealed most assembled sequences showing an identity of 80%-100% ( Figure S1B) to sequences in the Unigene DB. About 50% of sequences >2001 bp showed annotation hits, while none of the <200 bp sequences were annotated to sequences in the database ( Figure S1C). We have summarized the characteristics of the homology search of assembled unigenes from S. takanosis in Figure 5. The E-value distribution chart revealed that about 22,791 (48.64%) unigenes showed an E-value ranging from 1.0 × 10 −50 to 1.0 × 10 −5 ( Figure 5A). As with P. superans, S. takanosis unigenes reveal close identity distribution with 15,658 (33.42%), 13,138 (28.04%), and 12,843 (27.41%) unigenes showing identity in the range of 60%-80%, 80%-100%, and 40%-60%, respectively. About 670 (1.43%) unigenes showed an identity of 100% to subject sequences in the PANM-DB ( Figure 5B). The similarity distribution chart shows a larger share of unigenes (24,613, 52.53%) having similarity in the range of 80%-100% ( Figure 5C). Unlike the P. superans unigenes hit percentage, almost 50% of the unigenes in S. takanosis with a length of less than 200 bp showed a positive hit. Following the same, the hit percentage increased by about 90% for unigenes above 2001 bp in length ( Figure 5D). Homology search of S. takanosis assembled sequences using the Unigene DB ( Figure S2) showed a majority of sequences having an E-value ranging from 1.0 x 10 -50 to 1.0 x 10 -5 ( Figure S2A). A majority of sequences showed an identity of 80%-100% ( Figure S2B) to sequences in the Unigene DB. Assembled sequences of >2001 bp showed about 60% annotation hits to sequences in the Unigene DB ( Figure S2C). Moreover, more unigene hits were observed with the PANM-DB compared to the Unigene DB over the length of assembled sequences. This indicates that longer unigenes are more likely to get an identifiable affiliation during BLAST matches, due to the likely presence of a representative protein domain that may be hard to find in shorter sequences. More so, the longest unigenes yield BLAST hits and annotations with a higher frequency [31,32].
The BLASTx top-hit species distribution of unigenes matched to the PANM-DB has been shown for P. superans and S. takanosis in Figure 6A,B, respectively. In the case of both the butterfly species, the highest match was observed to Danaus plexippus (14,239 unigenes, 13.19%, for P. superans; and 12,316 unigenes, 10.17%, for S. takanosis), followed by Bombyx mori (10,244 unigenes, 9.49%; and 8601 unigenes, 7.10%). Other species ranked high with a greater number of hits included the Arthropods such as Plutella xylostella, Pararge aegeria and the mollusk Aplysia californica among others with more than 1000 unigene hits. We also analyzed the top-hit species distribution of the 15,331 and 22,124 assembled sequences for P. superans ( Figure 6C) and S. takanosis ( Figure 6D) matched to the Unigene We have summarized the characteristics of the homology search of assembled unigenes from S. takanosis in Figure 5. The E-value distribution chart revealed that about 22,791 (48.64%) unigenes showed an E-value ranging from 1.0ˆ10´5 0 to 1.0ˆ10´5 ( Figure 5A). As with P. superans, S. takanosis unigenes reveal close identity distribution with 15,658 (33.42%), 13,138 (28.04%), and 12,843 (27.41%) unigenes showing identity in the range of 60%-80%, 80%-100%, and 40%-60%, respectively. About 670 (1.43%) unigenes showed an identity of 100% to subject sequences in the PANM-DB ( Figure 5B). The similarity distribution chart shows a larger share of unigenes (24,613, 52.53%) having similarity in the range of 80%-100% ( Figure 5C). Unlike the P. superans unigenes hit percentage, almost 50% of the unigenes in S. takanosis with a length of less than 200 bp showed a positive hit. Following the same, the hit percentage increased by about 90% for unigenes above 2001 bp in length ( Figure 5D). Homology search of S. takanosis assembled sequences using the Unigene DB ( Figure S2) showed a majority of sequences having an E-value ranging from 1.0ˆ10´5 0 to 1.0ˆ10´5 ( Figure S2A). A majority of sequences showed an identity of 80%-100% ( Figure S2B) to sequences in the Unigene DB. Assembled sequences of >2001 bp showed about 60% annotation hits to sequences in the Unigene DB ( Figure S2C). Moreover, more unigene hits were observed with the PANM-DB compared to the Unigene DB over the length of assembled sequences. This indicates that longer unigenes are more likely to get an identifiable affiliation during BLAST matches, due to the likely presence of a representative protein domain that may be hard to find in shorter sequences. More so, the longest unigenes yield BLAST hits and annotations with a higher frequency [31,32].
The BLASTx top-hit species distribution of unigenes matched to the PANM-DB has been shown for P. superans and S. takanosis in Figure 6A,B, respectively. In the case of both the butterfly species, the highest match was observed to Danaus plexippus (14,239 unigenes, 13.19%, for P. superans; and 12,316 unigenes, 10.17%, for S. takanosis), followed by Bombyx mori (10,244 unigenes, 9.49%; and 8601 unigenes, 7.10%). Other species ranked high with a greater number of hits included the Arthropods such as Plutella xylostella, Pararge aegeria and the mollusk Aplysia californica among others with more than 1000 unigene hits. We also analyzed the top-hit species distribution of the 15,331 and 22,124 assembled sequences for P. superans ( Figure 6C) and S. takanosis ( Figure 6D)        Functional prediction and classification of the butterfly unigenes were achieved by a search against the KOG database (Figure 7). A total of 18,511 (17.15% of total unigenes) and 24,603 (20.31% of total unigenes) unigenes of P. superans ( Figure 7A) and S. takanosis ( Figure 7B), respectively were ascribed functions under 25 categories arranged to four main functional groups. A greater proportion of unigenes were clustered to the cellular processes and signaling group (4985 unigenes for P. superans and 6981 unigenes for S. takanosis), followed by the metabolism group (3408 for P. superans and 4882 for S. takanosis) and the information storage and processing group (2868 for P. superans and 3801 for S. takanosis). Within the cellular processes and signaling group, most of the unigenes were classified under the signal transduction process category followed by the post-translational modification, protein turnover and chaperone categories. Furthermore, a significant fraction of unigenes remained poorly characterized (5275 unigenes for P. superans and 6169 unigenes for S. takanosis).  10 Functional prediction and classification of the butterfly unigenes were achieved by a search against the KOG database (Figure 7). A total of 18,511 (17.15% of total unigenes) and 24,603 (20.31% of total unigenes) unigenes of P. superans ( Figure 7A) and S. takanosis ( Figure 7B), respectively were ascribed functions under 25 categories arranged to four main functional groups. A greater proportion of unigenes were clustered to the cellular processes and signaling group (4985 unigenes for P. superans and 6981 unigenes for S. takanosis), followed by the metabolism group (3408 for P. superans and 4882 for S. takanosis) and the information storage and processing group (2868 for P. superans and 3801 for S. takanosis). Within the cellular processes and signaling group, most of the unigenes were classified under the signal transduction process category followed by the post-translational modification, protein turnover and chaperone categories. Furthermore, a significant fraction of unigenes remained poorly characterized (5275 unigenes for P. superans and 6169 unigenes for S. takanosis). unigenes into four major categories of information storage and processing, cellular processes and signaling and metabolism. The code descriptions for KOG categories are as follows: J, translation, ribosomal structure, and biogenesis; A, RNA processing and modification; K, transcription; L, replication, recombination, and repair; B, chromatin structure and dynamics; D, cell cycle control, cell division, and chromosome portioning; Y, nuclear structure; V, defense mechanisms; T, signal transduction mechanisms; M, cell wall/membrane/envelope biogenesis; N, cell motility; Z, cytoskeleton; W, extracellular structures; U, intracellular trafficking, secretion, and vesicular transport; O, posttranslational modification, protein turnover, and chaperones; C, energy production and conversion; G, carbohydrate transport and metabolism; E, amino acid transport and metabolism; F, nucleotide transport and metabolism; H, co-enzyme transport and metabolism; I, lipid transport and metabolism; P, inorganic ion transport and metabolism; Q, secondary metabolites biosynthesis, transport and catabolism; R, general function prediction only; S, unknown function; Multi, more than one classified function. unigenes into four major categories of information storage and processing, cellular processes and signaling and metabolism. The code descriptions for KOG categories are as follows: J, translation, ribosomal structure, and biogenesis; A, RNA processing and modification; K, transcription; L, replication, recombination, and repair; B, chromatin structure and dynamics; D, cell cycle control, cell division, and chromosome portioning; Y, nuclear structure; V, defense mechanisms; T, signal transduction mechanisms; M, cell wall/membrane/envelope biogenesis; N, cell motility; Z, cytoskeleton; W, extracellular structures; U, intracellular trafficking, secretion, and vesicular transport; O, post-translational modification, protein turnover, and chaperones; C, energy production and conversion; G, carbohydrate transport and metabolism; E, amino acid transport and metabolism; F, nucleotide transport and metabolism; H, co-enzyme transport and metabolism; I, lipid transport and metabolism; P, inorganic ion transport and metabolism; Q, secondary metabolites biosynthesis, transport and catabolism; R, general function prediction only; S, unknown function; Multi, more than one classified function.

Functional Annotation Using GO and KEGG
To functionally classify P. superans and S. takanosis unigenes, GO terms and KEGG pathway classification were assigned to each unigene using the BLAST2GO software suite. For P. superans, the 18,661 unigenes annotated from the total unigene profile were allocated one or more GO terms based on sequence similarity. A total of 89,289 unigene sequences were without GO terms. The GO annotated unigenes were functionally classified into three broad categories such as biological process, cellular component, and molecular function. A summary of P. superans unigenes and GO terms have been shown in Figure 8. The molecular function category was assigned 16,200 unigenes, followed by biological process with 11,757, and cellular components with 6252 unigenes. Of these unigenes, 3880 showed functional attributes shared within the three main categories. Additionally, 5448, 808 and 737 unigenes were found uniquely attached to the molecular function, cellular component, and biological process categories, respectively ( Figure 8A). As shown in Figure 8B, a significant number of unigenes were represented by more than a single GO term. Only 5492 (29.43%) unigenes were ascribed to one GO term with most unigenes (5760, 30.87%) represented by two GO terms.

Functional Annotation Using GO and KEGG
To functionally classify P. superans and S. takanosis unigenes, GO terms and KEGG pathway classification were assigned to each unigene using the BLAST2GO software suite. For P. superans, the 18,661 unigenes annotated from the total unigene profile were allocated one or more GO terms based on sequence similarity. A total of 89,289 unigene sequences were without GO terms. The GO annotated unigenes were functionally classified into three broad categories such as biological process, cellular component, and molecular function. A summary of P. superans unigenes and GO terms have been shown in Figure 8. The molecular function category was assigned 16,200 unigenes, followed by biological process with 11,757, and cellular components with 6252 unigenes. Of these unigenes, 3880 showed functional attributes shared within the three main categories. Additionally, 5448, 808 and 737 unigenes were found uniquely attached to the molecular function, cellular component, and biological process categories, respectively ( Figure 8A). As shown in Figure 8B, a significant number of unigenes were represented by more than a single GO term. Only 5492 (29.43%) unigenes were ascribed to one GO term with most unigenes (5760, 30.87%) represented by two GO terms. All unigene sequences were classified into molecular function, cellular components, and biological process level 2. Within the molecular function category, the unigenes were further classified to 13 sub-categories, out of which a majority were represented under binding (9502 unigenes, 46.89%), catalytic activity (7611, 37.56%), and transporter activity (1263, 6.23%). About 61 sequences represented the antioxidant activity with very few transcripts assigned to protein tag, nutrient reservoir activity, and metallochaperone activity ( Figure 9A). Within the cellular components category, the majority of unigenes represented cell (3530, 31.57%), membrane (3033, 27.12%), organelle (2458, 21.98%) and macromolecular complex (1510, 13.50%). Few unigenes also represented synapse (50, 0.45%) and extracellular matrix (49, 0.44%) ( Figure 9B). For the biological process category, there were 19 sub-categories, and metabolic process (8303, 27.86%), cellular process (7985, 26.79%) and single-organism process (18.18%) were the predominant GO groups. A smaller proportion of unigenes also fell under response to stimulus (1447, 4.86%), signaling (1081, 3.63%), reproduction (41, 0.14%), reproductive process (37, 0.12%), and immune system process (29, 0.10%) ( Figure 9C).
In the case of S. takanosis, 22,275 unigenes (18.39% of assembled unigenes) were assigned one or more GO terms, while 98,865 unigenes were without GO terms. The GO distribution for S. takanosis unigenes assigned 16,200 terms to the molecular functions category, followed by 11,757 terms to the biological process and 6252 terms to the cellular components categories ( Figure 10A). A total of 6737, 978, and 757 unigenes were found exclusive to molecular function, cellular components, and biological process categories, respectively. About 4932 unigenes showed functional attributes shared between the three major categories of GO terms. As with P. superans, a greater number S. takanosis unigenes (15,486) also got represented by more than one GO term. Only 6789 (30.48%) unigenes showed homology to a single GO term ( Figure 10B). Int. J. Mol. Sci. 2015, 16, page-page 12 978, and 757 unigenes were found exclusive to molecular function, cellular components, and biological process categories, respectively. About 4932 unigenes showed functional attributes shared between the three major categories of GO terms. As with P. superans, a greater number S. takanosis unigenes (15,486) also got represented by more than one GO term. Only 6789 (30.48%) unigenes showed homology to a single GO term ( Figure 10B).   , and 757 unigenes were found exclusive to molecular function, cellular components, and biological process categories, respectively. About 4932 unigenes showed functional attributes shared between the three major categories of GO terms. As with P. superans, a greater number S. takanosis unigenes (15,486) also got represented by more than one GO term. Only 6789 (30.48%) unigenes showed homology to a single GO term ( Figure 10B).   The number of S. takanosis unigenes ascribed to various sub-categories under the three major categories of GO terms (level 2) are shown in Figure 11. Under the molecular function category, binding (11,887 unigenes, 48.08%) and catalytic activity (8845 unigenes, 35.78%) were majorly represented. A total of 83 unigene sequences were ascribed to antioxidant activity with only three transcripts linked to metallochaperone activity ( Figure 11A). Regarding the cellular components category, a high proportion of unigenes were ascribed to cell (4373 unigenes, 31.67%), membrane (3726, 26.99%), organelle (2894, 20.96%), and macromolecular complex (2079, 15.06%) ( Figure 11B). Out of the 35,050 unigene hit to the biological process category, 20 sub-categories were represented. A high proportion of sequences showed homology to metabolic process (9482 unigenes, 27.05%), cellular process (8773, 25.03%) and single-organismal process (6937 unigenes, 19.79%). A total of 1305, 39, 32, and 23 unigene sequences fell under signaling, reproduction, reproductive process, and immune system process ( Figure 11C). In discussing the dominant GO terms for Lycaenid butterflies P. superans and S. takanosis, we find that the profiles are very similar. The prominence of the GO biological process category over the GO molecular function and cellular components categories is understood in related species [33]. Also, the dominance of metabolic and cellular process under the GO biological process category has been consistently predicted in other Lepidopteran species. GO classification for functions in the sugarcane giant borer (Telchin licus licus) transcriptome show over 50% of GO terms represented under metabolic and cellular processes [34]. Consistent with our observations, the most prominent GO molecular function categories include binding and catalytic activity and the most prominent GO cellular components are cell, organelle and macromolecular complex in Lepidoptera and other representative insects [33,35]. The number of S. takanosis unigenes ascribed to various sub-categories under the three major categories of GO terms (level 2) are shown in Figure 11. Under the molecular function category, binding (11,887 unigenes, 48.08%) and catalytic activity (8845 unigenes, 35.78%) were majorly represented. A total of 83 unigene sequences were ascribed to antioxidant activity with only three transcripts linked to metallochaperone activity ( Figure 11A). Regarding the cellular components category, a high proportion of unigenes were ascribed to cell (4373 unigenes, 31.67%), membrane (3726, 26.99%), organelle (2894, 20.96%), and macromolecular complex (2079, 15.06%) ( Figure 11B). Out of the 35,050 unigene hit to the biological process category, 20 sub-categories were represented. A high proportion of sequences showed homology to metabolic process (9482 unigenes, 27.05%), cellular process (8773, 25.03%) and single-organismal process (6937 unigenes, 19.79%). A total of 1305, 39, 32, and 23 unigene sequences fell under signaling, reproduction, reproductive process, and immune system process ( Figure 11C). In discussing the dominant GO terms for Lycaenid butterflies P. superans and S. takanosis, we find that the profiles are very similar. The prominence of the GO biological process category over the GO molecular function and cellular components categories is understood in related species [33]. Also, the dominance of metabolic and cellular process under the GO biological process category has been consistently predicted in other Lepidopteran species. GO classification for functions in the sugarcane giant borer (Telchin licus licus) transcriptome show over 50% of GO terms represented under metabolic and cellular processes [34]. Consistent with our observations, the most prominent GO molecular function categories include binding and catalytic activity and the most prominent GO cellular components are cell, organelle and macromolecular complex in Lepidoptera and other representative insects [33,35]   Furthermore, the unigenes were searched against the KEGG database for the identification of biological pathways active in the Lycaenid butterflies under investigation. In P. superans, a total of 5259 unigenes were assigned to 116 pathways. Among them, 709 enzymes were assigned to these pathways. The number of unigenes assigned to the main pathways have been presented in Figure 12. The unigenes predominantly fall into the metabolism (5013 unigenes, 95.32%) group, followed by the organismal systems (118 unigenes, 2.24%), genetic information processing (68 unigenes, 1.29%), and environmental information processing (60 unigenes, 1.14%) groups. Among the metabolism group, the majority of the unigenes were ascribed to the nucleotide metabolism sub-group (1655, 31.47%) followed by the metabolism of co-factors and vitamins (1017, 19.34%). Apart from the metabolism group, the unigenes exclusively fell under the translation sub-group for the genetic information processing group, the signal transduction sub-group for the environmental information processing group and the immune system sub-group for the organismal systems group. Furthermore, the unigenes were searched against the KEGG database for the identification of biological pathways active in the Lycaenid butterflies under investigation. In P. superans, a total of 5259 unigenes were assigned to 116 pathways. Among them, 709 enzymes were assigned to these pathways. The number of unigenes assigned to the main pathways have been presented in Figure 12. The unigenes predominantly fall into the metabolism (5013 unigenes, 95.32%) group, followed by the organismal systems (118 unigenes, 2.24%), genetic information processing (68 unigenes, 1.29%), and environmental information processing (60 unigenes, 1.14%) groups. Among the metabolism group, the majority of the unigenes were ascribed to the nucleotide metabolism sub-group (1655, 31.47%) followed by the metabolism of co-factors and vitamins (1017, 19.34%). Apart from the metabolism group, the unigenes exclusively fell under the translation sub-group for the genetic information processing group, the signal transduction sub-group for the environmental information processing group and the immune system sub-group for the organismal systems group. The mapping of S. takanosis annotated unigenes to typical KEGG pathways identified a total of 6697 assembled sequences assigned to 119 pathways. Among them, 800 enzymes were assigned to these pathways. A total of 6438 (96.13%) unigenes belonged to the metabolism pathway, out of which 1887 unigenes fell under nucleotide metabolism and 1178 unigenes fell under the metabolism of cofactors and vitamins sub-group. A total of 128 unigenes (1.91%) were represented under the immune system pathway under the category of organismal systems. A total of 86 and 45 unigenes fell under the translation and signal transduction pathways of the genetic information processing and environmental information processing categories, respectively ( Figure 13). The KEGG pathways identified in S. takanosis but not in P. superans were chlorocyclohexane and chlorobenzene degradation, flavone and flavonol biosynthesis, fluorobenzoate degradation and steroid degradation, while the betalain biosynthesis pathway was observed in only P. superans.
The GO classification, in a stricter sense, does not mean evidence of functionality; instead, it only suggests that a unigene sequence can be grouped to those of known (or predicted) function. An analysis to consider is the evidence code associated with each GO term. We find that a majority of GO terms (above 99% in both the sequenced Lycaenids) represented in our study are assigned the code "IEA" (inferred from electronic annotation), which are not manually curated and probably may contain more false positives. This is true, as out of the over 16 million GO annotations as of October 2007, 15,687,382 are in fact computationally derived IEA codes [36]. Hence, in the discussion of GO term annotations for the study, we emphasize that not all GO terms are of equal validity and, based The mapping of S. takanosis annotated unigenes to typical KEGG pathways identified a total of 6697 assembled sequences assigned to 119 pathways. Among them, 800 enzymes were assigned to these pathways. A total of 6438 (96.13%) unigenes belonged to the metabolism pathway, out of which 1887 unigenes fell under nucleotide metabolism and 1178 unigenes fell under the metabolism of cofactors and vitamins sub-group. A total of 128 unigenes (1.91%) were represented under the immune system pathway under the category of organismal systems. A total of 86 and 45 unigenes fell under the translation and signal transduction pathways of the genetic information processing and environmental information processing categories, respectively ( Figure 13). The KEGG pathways identified in S. takanosis but not in P. superans were chlorocyclohexane and chlorobenzene degradation, flavone and flavonol biosynthesis, fluorobenzoate degradation and steroid degradation, while the betalain biosynthesis pathway was observed in only P. superans.
The GO classification, in a stricter sense, does not mean evidence of functionality; instead, it only suggests that a unigene sequence can be grouped to those of known (or predicted) function. An analysis to consider is the evidence code associated with each GO term. We find that a majority of GO terms (above 99% in both the sequenced Lycaenids) represented in our study are assigned the code "IEA" (inferred from electronic annotation), which are not manually curated and probably may contain more false positives. This is true, as out of the over 16 million GO annotations as of October 2007, 15,687,382 are in fact computationally derived IEA codes [36]. Hence, in the discussion of GO term annotations for the study, we emphasize that not all GO terms are of equal validity and, based on this, the interpretations of unigenes relate only to predicted function. Additionally, we presume at this point that, using BLASTx, unigene sequences are found to share homology with known pathway genes in the KEGG database. In addition, it is critical to study both the partial and full-length unigene sequences at the functional level for major applications of transcriptome sequencing.
Int. J. Mol. Sci. 2015, 16, page-page 15 on this, the interpretations of unigenes relate only to predicted function. Additionally, we presume at this point that, using BLASTx, unigene sequences are found to share homology with known pathway genes in the KEGG database. In addition, it is critical to study both the partial and fulllength unigene sequences at the functional level for major applications of transcriptome sequencing.

Protein Domain Analysis
InterProScan searches were conducted on the identified 107,950 P. superans unigenes by BLAST2GO. We discovered a total of 154,298 protein domains that include a maximum of 10,978 C2H2-like zinc finger domains. A summary of top 40 domains predicted in the P. superans transcriptome has been shown in Table S2. The notable conserved protein domains included the C2H2-like zinc finger, protein kinase, WD40 (also known as WD or β-transducin repeats), ABCtransporter, EGF-like (epidermal growth factor-like), Immunoglobulin-like, and fibronectin type-III domains. Among the enlisted top 40, we also found cytochrome P450, insect cuticle protein, G proteincoupled receptor, UDP-glucosyltransferase (Uridine diphosphate-glucosyltransferase), and major facilitator superfamily domains.
As with P. superans, a protein domain classification of S. takanosis transcripts was identified using BLAST2GO with the top 40 InterPro domains represented in Table S3. With the notable presence of the C2H2-like zinc finger domain, WD40 repeat domain, Armadillo-like helical domain, and cytochrome P450 domain, as also observed with P. superans, some other functional domains were characteristic of S. takanosis unigenes.
The C-type lectin domain, cadherin domain and the thioredoxin-like fold domain showed significant S. takanosis unigene hits. The C-type lectin domain proteins are known to be encoded by a number of genes in insects with strategic functions in the regulation of antimicrobial activity, proPO activation, and other associated immune functions [37,38].
The protein kinases and zinc finger domains show conspicuous presence in other invertebrate transcriptomes as these process cellular functions including survival, differentiation and apoptosis [31,39,40]. The transcriptome abundant and conserved C2H2-like zinc finger domains exists in proteins as multiple tandem pairs of zinc fingers or tandem arrays of three or more zinc fingers, and hence are often represented by few proteins in surveyed species. These proteins are most likely DNA-binding transcription factors but can also bind to RNA and other protein targets [41,42]. The presence of WD40 repeat and Armadillo-like helical domains in 172 and 106 unigenes of P. superans is consistent with similar analysis in insects and crustaceans [43,44]. The immunoglobulin and fibronectin type-III domains are basically involved in cell-signaling mechanisms related to cellular

Protein Domain Analysis
InterProScan searches were conducted on the identified 107,950 P. superans unigenes by BLAST2GO. We discovered a total of 154,298 protein domains that include a maximum of 10,978 C 2 H 2 -like zinc finger domains. A summary of top 40 domains predicted in the P. superans transcriptome has been shown in Table S2. The notable conserved protein domains included the C 2 H 2 -like zinc finger, protein kinase, WD40 (also known as WD or β-transducin repeats), ABC-transporter, EGF-like (epidermal growth factor-like), Immunoglobulin-like, and fibronectin type-III domains. Among the enlisted top 40, we also found cytochrome P450, insect cuticle protein, G protein-coupled receptor, UDP-glucosyltransferase (Uridine diphosphate-glucosyltransferase), and major facilitator superfamily domains.
As with P. superans, a protein domain classification of S. takanosis transcripts was identified using BLAST2GO with the top 40 InterPro domains represented in Table S3. With the notable presence of the C 2 H 2 -like zinc finger domain, WD40 repeat domain, Armadillo-like helical domain, and cytochrome P450 domain, as also observed with P. superans, some other functional domains were characteristic of S. takanosis unigenes.
The C-type lectin domain, cadherin domain and the thioredoxin-like fold domain showed significant S. takanosis unigene hits. The C-type lectin domain proteins are known to be encoded by a number of genes in insects with strategic functions in the regulation of antimicrobial activity, proPO activation, and other associated immune functions [37,38].
The protein kinases and zinc finger domains show conspicuous presence in other invertebrate transcriptomes as these process cellular functions including survival, differentiation and apoptosis [31,39,40]. The transcriptome abundant and conserved C 2 H 2 -like zinc finger domains exists in proteins as multiple tandem pairs of zinc fingers or tandem arrays of three or more zinc fingers, and hence are often represented by few proteins in surveyed species. These proteins are most likely DNA-binding transcription factors but can also bind to RNA and other protein targets [41,42]. The presence of WD40 repeat and Armadillo-like helical domains in 172 and 106 unigenes of P. superans is consistent with similar analysis in insects and crustaceans [43,44]. The immunoglobulin and fibronectin type-III domains are basically involved in cell-signaling mechanisms related to cellular processes and immunity [31,45]. While cytochrome P450 genes are predominantly involved in the metabolism of xenobiotics in mollusks, polychaete and crustaceans, these sequences are categorized as "environmental response genes" in insects [46]. The protein domain information provides vital clues to understanding the mechanisms of cellular survival and signaling mechanisms leading to adaptation in the endangered Lycaenid butterflies, P. superans and S. takanosis.

Sample Preparation and Illumina Sequencing
The geographic origins of P. superans and S. takanosis for sequencing were Gangwon-do and Gyeonggi-do regions of Korea, respectively. A total of two individuals from each species were used for experimental purposes as per the notification of the Ministry of Environment, South Korea. Total RNA was extracted from the adults (pooled whole body samples) of S. takanosis and P. superans using Trizol Reagent (Invitrogen, Carlsbad, CA, USA) according to the manufacturer's instructions. The processed RNA were checked for purity and integrity using Nanodrop-2000 spectrophotometer (Thermo Scientific, Wilmington, DE, USA) and the Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA). The mRNA-seq library was constructed using the mRNA-seq sample preparation kit (Illumina, San Diego, CA, USA). In the process, the total RNA was treated with DNase I, and magnetic beads with Oligo(dT) to purify poly(A+) mRNA from it. The purified mRNA was fragmented using the DNA fragmentation kit (Ambion, Austin, TX, USA) prior to cDNA synthesis. The short fragments of mRNA were used to transcribe first-strand cDNA using reverse-transcriptase (Invitrogen) and random hexamer-primers. The synthesis of second-strand cDNA was accomplished using DNA polymerase I (New England BioLabs, Ipswich, MA, USA) and RNase H (Invitrogen). Subsequently, the double-stranded cDNA was end-repaired using T4 DNA polymerase, the Klenow fragment, and the T4 polynucleotide kinase (New England BioLabs). The end-repaired cDNA fragments were connected with PE (Paired-end) Adapter Oligo Mix using T4 DNA ligase (New England BioLabs) at room temperature for 15 min. The suitable fragments (200˘25 bp) separated on a 2% agarose gel electrophoresis matrix were paired-end sequenced on an ultra-high-throughput Illumina HiSeq 2500 sequencer. Illumina short-reads is an appropriate NGS platform for the sequencing of transcriptomes in non-model species due to its affordability and output efficiency [31,52].

De Novo Assembly and Annotation
The raw paired-end reads of the two Lycaenid transcriptomes (S. takanosis and P. superans) were cleaned by filtering out adapter (nucleotide length of recognized adapter ď13 and the remaining adapter-excluded nucleotide length ď35), repeated, and low-quality reads (phred quality score of less than 20) that may affect optimum assembly analysis and annotation. We command-line tool Cutadapt with default parameters (for paired-end reads: -a ADAPT1 -A ADAPT2; -o out1. fastq -p out2. fastq in1. fastq in2. fastq) [53] for pre-processing of raw reads. Cutadapt was selected over other popular adaptor trimmer programs [54] as it has one of the highest Mathew's correlation coefficients (mCC)-a quality indicator for pattern recognition. The clean reads from the samples were assembled with the short reads assembling program, called Trinity (v2.0.6) [24] with 200 GB of memory and a path reinforcement distance of 50. The Trinity assembler with the default options (fastq type reads; paired read: RF; number of CPUs: eight; minimum assembled contig length of 200 bp) first assembled the reads to form longer fragments without gaps called contigs. These contigs were further assembled to unigenes (having 94% identity, 30 bp overlap) using sequence clustering software TIGR gene indices clustering tool (TGICL) [55]. After the elimination of redundant sequences, the longest transcripts were recognized as unigenes and were used for functional annotation analysis.
All the unigenes were searched against the PANM reference database (PANM-DB) [56] using the BLASTx program with an E-value threshold of 1.0ˆ10´5 for the identification of functional transcripts. PANM-DB combines protein sequence data of Arthropoda, Nematoda, and Mollusks (in multi-FASTA format) downloaded from the Taxonomy browser of NCBI nr database. PANM-DB is freely downloadable from amino acid database BLAST web-interface of Malacological Society of Korea. Subsequently, the unigenes were blasted against Unigene DB [57], Eukaryotic clusters of orthologous groups (KOG) DB [58], and Kyoto Encyclopedia of Genes and Genomes (KEGG) DB [59] using BLASTX at a typical cut-off E-value of less than 1.0ˆ10´5. Number of unigenes that were either unique or shared among PANM-DB, Unigene DB and KOG DB were visualized using a three-way Venn diagram plot constructed using Venny [60]. The gene ontology (GO) annotations presented represent the level 2 analysis, illustrating the predicted function of the assembled unigenes under biological process, molecular function, and cellular component category. The GO analysis was conducted using the professional BLAST2GO suite [61]. InterProScan at BLAST2GO was used to annotate the assembled unigenes with characteristic protein domains [62].

Identification of cSSR Markers
MicroSAtellite (MISA) [63] was used to decipher microsatellites in the unigene sequences of the Lycaenid butterflies S. takanosis and P. superans. The Simple Sequence Repeats (SSR) searches were run on default mode with detection of mono-, di-, tri-, tetra-, penta-, and hexa-nucleotide motifs, including the compound SSR (with more than one type of repeat unit). Primer pairs flanking the SSR motifs were designed using BatchPrimer3 [64] with the following criteria: primer lengths of 18-23 bases (optimum size of 21 bases), product size of 100-300 bases, Tm-50-470˝C (optimum 55˝C), and primer GC content of 30%-70%.

Conclusions
In this study, the transcriptomes of endangered Lycaenid butterflies P. superans and S. takanosis were sequenced using Illumina HiSeq 2500. A de novo assembly and transcript annotation approach resulted in the identification of unigenes related to functional GO categories and KEGG pathways. Furthermore, the identification of SSRs with repeat types will assist in the development of large-scale molecular markers for the species. The valuable transcriptome sequence and functional information will provide necessary cues towards the successful implementation of sustainable conservation plans for the butterfly species in their preferred habitat. The sequence information to be indexed in the databases will be the basis for an evolutionary developmental study of Lycaenidae with symbiotic ants.