Transcriptome Analysis of the Tadpole Shrimp (Triops longicaudatus) by Illumina Paired-End Sequencing: Assembly, Annotation, and Marker Discovery

The tadpole shrimp (Triops longicaudatus) is an aquatic crustacean that helps control pest populations. It inhabits freshwater ponds and pools and has been described as a living fossil. T. longicaudatus was officially declared an endangered species South Korea in 2005; however, through subsequent protection and conservation management, it was removed from the endangered species list in 2012. The limited number of available genetic resources on T. longicaudatus makes it difficult to obtain valuable genetic information for marker-aided selection programs. In this study, whole-transcriptome sequencing of T. longicaudatus generated 39.74 GB of clean data and a total of 269,822 contigs using the Illumina HiSeq 2500 platform. After clustering, a total of 208,813 unigenes with an N50 length of 1089 bp were generated. A total of 95,105 unigenes were successfully annotated against Protostome (PANM), Unigene, Eukaryotic Orthologous Groups (KOG), Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases using BLASTX with a cut-off of 1E−5. A total of 57,731 unigenes were assigned to GO terms, and 7247 unigenes were mapped to 129 KEGG pathways. Furthermore, 1595 simple sequence repeats (SSRs) were detected from the unigenes with 1387 potential SSR markers. This is the first report of high-throughput transcriptome analysis of T. longicaudatus, and it provides valuable insights for genetic research and molecular-assisted breeding of this important species.


Introduction
The tadpole shrimp, Triops spp. (order: Notostraca; class: Branchiopoda) is a crustacean that inhabits freshwater, ephemeral ponds in arid regions worldwide [1], and it has been described as a living fossil from the late Cretaceous period similar to other members of this ancient crustacean order. This is allegedly due to their virtually unchanged morphology during an evolutionary time scale spanning more than 70 million years [2,3]. This includes the ability to control the size of mosquito populations by consuming Culex larvae [4,5], and its utilization as a biological agent to control weeds in paddy fields [6]. The diversification of cryptic species within the genus occurred more recently than this, based on the subtle differences in genetic composition and morphology [7][8][9]. Triops longicaudatus is the most widespread notostracan crustacean, being found in North America, South America, the Caribbean, Saudi Arabia, Japan, and the Pacific Islands [10][11][12][13][14][15][16]. There are a number of reports on its distribution morphology, and reproduction [2,10,11,15,17]. In South Korea, the species has been reported since 1986, where it was collected from paddy fields in the cities of Changnyeong and Samcheonpo (Gyeongsangnam-Do Province) [13]. It was registered as an endangered species in South Korea by the Ministry of Environment in 2004. Since then, populations of T. longicaudatus have increased through regional conservation measures and it was removed from the endangered species list in 2012. T. longicaudatus is economically important species to be used for environmental friendly agriculture. It is proposed that genetic studies involving genome, transcriptom, and gene function analysis will be necessary to preserve the genotypes of this species by assisting in determining their developmental and regulatory functions. Furthermore, the elucidation of cDNA simple sequence repeat (SSR) markers in the putative coding transcripts will be necessary to assess population genetic structure and diversity.
Among the limited number of genomic resources on T. longicaudatus, only the mitochondrial DNA sequence is known [18]. The variation in mitochondrial genes has been successfully utilized to identify cryptic lineages of the genus Triops [9]. Despite these studies, genetic and genomic information on the species is limited due to the lack of whole genome sequencing, RNA sequencing, expression profiles of transcripts, and microsatellite markers. The traditional method of expressed sequence tag (EST) construction using Sanger sequencing is time consuming and inefficient, producing at best 10,000 sequences. This is likely an insufficient representation of the size of the genome and thus a major limitation of functional research applications [19]. High-throughput next-generation sequencing (NGS) technologies, such as 454 (Roche), Solexa/Illumina (Illumina), and SOLiD (ABI), collect massive amounts of sequencing data in a single run with increased efficiency at an affordable level [20,21]. This technology has enabled genome and transcriptome-level computational analyses [22], leading to the discovery of molecular markers such as SSRs, single nucleotide polymorphisms (SNPs), and quantitative trait loci (QTL) [23]. Because genome sequences are currently unavailable or unreliable in many non-model species, transcriptome sequencing provides direct relevance to the genetic level by measuring the expression of relevant traits [24][25][26]. Among the NGS technologies, Illumina sequencing is a preferred choice due to the generation of short-read sequences with greater coverage [27][28][29][30].
Over the last four to five years, significant progress has been made in characterizing the transcriptome of economically important crustacean species such as Litopenaeus vannamei, Fenneropenaeus chinensis, Eriocheir sinensis, Macrobrachium nippoense, Portunus trituberculatus, and Carcinus maenas. These analyses have provided insights into species biology, the functional regulation of defense signaling pathways, growth and reproduction, and strategies to improve culture productivity. In this study, we present the first massive sequencing data for the tadpole shrimp, T. longicaudatus, using the Illumina HiSeq 2500 NGS platform. The assembled and annotated sequencing data were utilized for the large-scale identification of putative functional transcripts. Furthermore, the identification and analysis of SSR loci and SSR markers in the transcriptome will be useful for population genomics and variability studies, further assisting in the marker assisted selection breeding of T. longicaudatus.

Ethics Statement
The experiments in this study were performed in accordance with relevant national and international guidelines. Because T. longicaudatus is not an endangered or protected species, in Korea, sample collection did not require special permits. Our project was approved by the National Institute of Biological Resources (NIBR), Korea.
The adult whole-body tissues of hermaphrodite T. longicaudatus (n = 10) were pooled and total RNAs were extracted using Trizol reagent (Invitrogen, Carlsbad, CA, USA) and extracted in accordance with the manufacturer's protocol. The extracted RNA was treated with RNase-free DNase I (Qiagen) to remove the genomic DNA. RNA purity and concentration were measured using a Nanodrop-2000 spectrophotometer (Thermo Scientific, Wilmington, DE, USA). The Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA) measures RNA quantity and agarose gel electrophoresis. Total RNAs were pooled, purified to obtain mRNA using oligo (dT) magnetic beads, and subsequently fragmented using an RNA Fragmentation Kit (Ambion, Austin, TX, USA).

cDNA Synthesis and HiSeq 2500 Sequencing
First-strand cDNA synthesis was performed using reverse-transcriptase (Invitrogen) and random hexamer-primers. Second-strand cDNA was synthesized using RNase H (Invitrogen) and DNA polymerase I (New England BioLabs, Ipswich, MA, USA). The double-stranded cDNA was end-repaired using T4 DNA polymerase, the Klenow fragment (New England BioLabs), and T4 polynucleotide kinase (New England BioLabs). The end-repaired cDNA fragments were ligated to the PE (paired-end) Adapter Oligo Mix with T4 DNA ligase (New England BioLabs) at room temperature for 15 min. The ligated products were purified and separated by size on a 2% agarose gel. DNA fragments of the desired size (200 ± 25 bp) were excised and sequenced on the Illumina HiSeq 2500 sequencing platform with 2 × 126 bp after validation.

De Novo Assembly and Assessment of De Novo Assemblies
Before de novo transcriptome assembly, the raw reads were cleaned by removing adaptor-only reads (nt length of the recognized adaptor ≤13 and the remaining adaptor-excluded nt length of ≤35), repeated reads, and low-quality reads (Phred quality score ≤20) using Sickle (http://github. com/najoshi/sickle) [31] and Cutadapt (http://cutadapt.readthedocs.io/en/stable/index.html) [32]. High-quality reads were assembled using Trinity software (software version 2013-02-25) with default parameters (100 GB of memory, path reinforcement distance of 50, and minimum allowed length of 200 bp) [33]. The Trinity program assembles reads of a certain length that overlap to form longer fragments without gaps; these are called contigs. The total number of contigs, as well as the mean length, the N 50 length, and GC% were recorded. The contigs were further assembled into sequences that could not be extended at either end; these are called unigenes (having 94% identity, 30 bp overlap) [34,35]. Such unigenes were subjected to annotation analysis against public protein and nucleotide databases. The assessment of the assembly and annotation completeness we applied the software tool BUSCO (software version 1.1b) [36].

Illumina Reads and Sequence Assembly
Transcriptome information for the T. longicaudatus was characterized from adult whole-body. The Illumina HiSeq 2500 platform generated a total of 323,319,608 paired-end reads (40,738,270,608 bases) were generated with a read length of 100 bp. All raw sequencing data were deposited into the NCBI Sequence Read Archive (SRA) under accession number SRR3961747. After adaptor trimming, a total of 318,610,596 clean sequencing reads (98.54%) were filtered, which were used for further analysis. The mean length, the N 50 length, and GC% of the obtained clean reads were 124.7 bp, 126 bp, and 48.39%, respectively.
Because the reference genome sequence is unavailable, de novo assembly of the transcriptome was performed. Trinity assembly with default parameters was used to resolve the clean transcripts to overlapping contiguous sequences. De novo assembly of the high-quality sequences generated a total of 269,822 contigs (192,327,026) with a mean length of 712.8 bp and an N 50 length of 1148 bp. Of the total assembled contigs, 89,407 were ≥500 bp, with the longest contig size of 40,450 bp. The clustering of the contigs generated 208,813 unigenes with a mean length of 700 bp and an N 50 length of 1089 bp. The lengths of the unigenes varied from 224 bp to 40,450 bp. Table 1 summarizes the transcriptome sequencing, de novo assembly, and clustering of contigs. Among the unigenes, 85.86%, 7.50%, and 6.63% showed lengths of 200-1000 bp, 1001-2000 bp, and >2000 bp, respectively. The size distribution of the contigs and unigenes are shown in Figure 1. The unigenes represent a comprehensive resource of functional information on the T. longicaudatus genome and may facilitate the discovery of relevant phenotypes in this species.

Sequence Annotation of Unigenes
Several public databases comprised of known protein and nucleotide sequences were used as subject databases for the sequence annotation of T. longicaudatus unigenes. The unigene sequences (as queries) were searched to identify homologous sequences using BLASTX and BLASTN (E-value cut-off of 1E−5) for protein and nucleotide databases, respectively. The PANM-DB, KOG, GO, and KEGG databases were used as protein databases, while the Unigene database was used as the nucleotide database. Of the total of 208,813 unigenes, 95,105 (45.55%) were annotated to any one of the databases with a great number of unigenes having lengths of 300-1000 bp. The number of matches to PANM-DB was the greatest (87,719 unigenes), followed by the KOG (63,978 unigenes). The annotation results of unigenes to the public databases are shown in Table 2. The results also show that 23,732 (27.1%), 7729 (28.8%), 20,131 (31.55%), 16,663 (28.8%), and 2112 (29.1%) of the unigenes that were over 1000 bp in length had BLAST matches in the PANM, Unigene, KOG, GO, and KEGG databases, respectively. Next, to understand the overlap of the unigene sequence annotations between PANM-DB and Unigene and KOG databases, we constructed a three-way Venn diagram ( Figure 2). We found that a maximum number of 39,763 unigenes matched in both PANM-DB and KOG database, and 22,348 unigenes matched in all three databases. The number of unigenes annotated exclusively to PANM-DB, and the Unigene and KOG databases without any overlap were 23,501, 1710, and 1187, respectively. Several public databases comprised of known protein and nucleotide sequences were used as subject databases for the sequence annotation of T. longicaudatus unigenes. The unigene sequences (as queries) were searched to identify homologous sequences using BLASTX and BLASTN (E-value cut-off of 1E−5) for protein and nucleotide databases, respectively. The PANM-DB, KOG, GO, and KEGG databases were used as protein databases, while the Unigene database was used as the nucleotide database. Of the total of 208,813 unigenes, 95,105 (45.55%) were annotated to any one of the databases with a great number of unigenes having lengths of 300-1000 bp. The number of matches to PANM-DB was the greatest (87,719 unigenes), followed by the KOG (63,978 unigenes). The annotation results of unigenes to the public databases are shown in Table 2. The results also show that 23,732 (27.1%), 7,729 (28.8%), 20,131 (31.55%), 16,663 (28.8%), and 2112 (29.1%) of the unigenes that were over 1000 bp in length had BLAST matches in the PANM, Unigene, KOG, GO, and KEGG databases, respectively. Next, to understand the overlap of the unigene sequence annotations between PANM-DB and Unigene and KOG databases, we constructed a three-way Venn diagram (Figure 2). We found that a maximum number of 39,763 unigenes matched in both PANM-DB and KOG database, and 22,348 unigenes matched in all three databases. The number of unigenes annotated exclusively to PANM-DB, and the Unigene and KOG databases without any overlap were 23,501, 1710, and 1187, respectively. The number of unigenes hits using BLASTX search (E-value < 1E−5).  19.08% for 15%-40% and 60%-80% unigenes, respectively ( Figure S1B). According to the similarity distribution analysis, 36,411 (41.51%) unigenes showed a similarity of 60%-80% with homologous sequences in the PANM-DB. Only 18.03% of unigene sequences showed similarity of 80%-100% to sequences in PANM-DB ( Figure S1C). The BLASTX annotation hits to homologous protein sequences in PANM-DB increased with increasing unigene length. More than 90% of unigenes with a sequence length >2000 bp showed annotation hits against PANM-DB ( Figure S1D).

KOG, GO and KEGG Classifications
For a functional classification of the T. longicaudatus unigenes, we conducted a BLAST search against the KOG, GO, and KEGG databases. Under the KOG classification, a total of 63,978 unigenes were predicted under 25 functional categories excluding the "multi" category. Within the 25 categories, the unigenes were predominantly distributed to "translation, ribosomal structure and biogenesis (7210 unigenes)", followed by "general function prediction only" (6591 unigenes), "post-translational modification, protein turnover and chaperones" (6005 unigenes), and "signal transduction mechanisms" (5017 unigenes). The least represented groups included "cell motility" We also examined homology search characteristics such as score, identity and similarity distribution. The score distribution, which represents the quality of the BLAST alignment, showed that 45,158 (51.48%) unigenes had a score <100 ( Figure S1A). The identity distribution revealed that 36,329 (41.42%) unigenes showed an identity of 40%-60%, followed by identities of 33.17% and 19.08% for 15%-40% and 60%-80% unigenes, respectively ( Figure S1B). According to the similarity distribution analysis, 36,411 (41.51%) unigenes showed a similarity of 60%-80% with homologous sequences in the PANM-DB. Only 18.03% of unigene sequences showed similarity of 80%-100% to sequences in PANM-DB ( Figure S1C). The BLASTX annotation hits to homologous protein sequences in PANM-DB increased with increasing unigene length. More than 90% of unigenes with a sequence length >2000 bp showed annotation hits against PANM-DB ( Figure S1D).

KOG, GO and KEGG Classifications
For a functional classification of the T. longicaudatus unigenes, we conducted a BLAST search against the KOG, GO, and KEGG databases. Under the KOG classification, a total of 63,978 unigenes were predicted under 25 functional categories excluding the "multi" category. Within the 25 categories, the unigenes were predominantly distributed to "translation, ribosomal structure and biogenesis (7210 unigenes)", followed by "general function prediction only" (6591 unigenes), "post-translational modification, protein turnover and chaperones" (6005 unigenes), and "signal transduction mechanisms" (5017 unigenes). The least represented groups included "cell motility" (90 unigenes) and "nuclear structure" (103 unigenes) ( Figure 4).  GO is an international standardization of the gene functional classification system. The GO classification system comprises three large categories: molecular function, biological process and cellular components. Among all the unigenes with GO annotations, we found that 57,731 (27.65% of all unigenes) unigenes matched to GO terms and 14,379 unigenes showed functional attributes shared within the three main categories. The unigenes predominantly shared the biological process and molecular function categories ( Figure 5A). Approximately 15,971 (27.7%) unigenes were represented by one GO term; 15,454 (26.8%) unigenes were represented by two GO terms; and 14,420 (25.0%) unigenes were represented by three GO terms of predicted functions ( Figure 5B). Additionally, biological processes, molecular functions and cellular components were associated with 75,548, 54,306, and 34,193 unigenes, respectively. In the biological process category, metabolic process (22,634 unigenes), cellular process (21,511 unigenes), and single-organism process (14,281 unigenes) were the most abundant groups, whereas cell killing (2 unigenes) and biological phase (1 unigene) were the least abundant groups. Under the molecular function category, binding (25,159 unigenes) and catalytic activity (20,067 unigenes) were the most abundant groups, while antioxidant GO is an international standardization of the gene functional classification system. The GO classification system comprises three large categories: molecular function, biological process and cellular components. Among all the unigenes with GO annotations, we found that 57,731 (27.65% of all unigenes) unigenes matched to GO terms and 14,379 unigenes showed functional attributes shared within the three main categories. The unigenes predominantly shared the biological process and molecular function categories ( Figure 5A). Approximately 15,971 (27.7%) unigenes were represented by one GO term; 15,454 (26.8%) unigenes were represented by two GO terms; and 14,420 (25.0%) unigenes were represented by three GO terms of predicted functions ( Figure 5B). Additionally, biological processes, molecular functions and cellular components were associated with 75,548, 54,306, and 34,193 unigenes, respectively. In the biological process category, metabolic process (22,634 unigenes), cellular process (21,511 unigenes), and single-organism process (14,281 unigenes) were the most abundant groups, whereas cell killing (2 unigenes) and biological phase (1 unigene) were the least abundant groups. Under the molecular function category, binding (25,159 unigenes) and catalytic activity (20,067 unigenes) were the most abundant groups, while antioxidant activity (283 unigenes) and metallochaperone activity (3 unigenes) were also observed. In cellular component terms, cell (11,360 unigenes), organelle (7747 unigenes), macromolecular complex (7489 unigenes), and membrane (6509 unigenes) were the dominant groups. An account of the suggested function of T. longicaudatus unigenes under the GO term categories is shown in Figure 6.  We classified unigenes into biological pathways by annotating the unigene sequences against the KEGG database. A total of 7247 unigenes were predicted to function in a total of 129 pathways. Predominantly, the unigene sequences were classified into the metabolism pathway group, wherein "nucleotide metabolism", "metabolism of cofactors and vitamins", and "carbohydrate metabolism" constituted the major groups (Table S1). A total of 293 unigene sequences were predicted to be classified under translation group, followed by 288 under the immune system and 101 under the  We classified unigenes into biological pathways by annotating the unigene sequences against the KEGG database. A total of 7247 unigenes were predicted to function in a total of 129 pathways. Predominantly, the unigene sequences were classified into the metabolism pathway group, wherein "nucleotide metabolism", "metabolism of cofactors and vitamins", and "carbohydrate metabolism" constituted the major groups (Table S1). A total of 293 unigene sequences were predicted to be We classified unigenes into biological pathways by annotating the unigene sequences against the KEGG database. A total of 7247 unigenes were predicted to function in a total of 129 pathways. Predominantly, the unigene sequences were classified into the metabolism pathway group, wherein "nucleotide metabolism", "metabolism of cofactors and vitamins", and "carbohydrate metabolism" constituted the major groups (Table S1). A total of 293 unigene sequences were predicted to be classified under translation group, followed by 288 under the immune system and 101 under the signal transduction group. The identified KEGG pathways for T. longicaudatus unigenes are presented in Figure 7. Using the InterPro Scan analysis feature in BLAST2GO, we identified the most prominent protein domains predicted for T. longicaudatus unigenes. A total of 1252 unigenes showed top-hits to the P-loop-containing nucleoside triphosphate hydrolase (P-loop NTPase) domain. Other top domains identified based on unigene homology included the insulin-like growth factor binding protein, N-terminal domain, zinc finger, C2H2-like domain, heat shock protein 70 family, EGF-like domain, and helicase C-terminal domain (Table S2).  (Table S2).

Discussion
In this study, we used high-throughput mRNA-Seq technology to analyze expressed transcripts of the longtail tadpole shrimp T. longicaudatus. RNA-Seq platform technology has been used for the rapid characterization of genomic and genetic resources in related non-model species including the Pacific white shrimp (Litopenaeus vannamei) [28,42], the Banana shrimp (Fenneropenaeus merguiensis) [43], the Brine shrimp (Artemia franciscana) [44], and the Triops newberryi [1]. Transcriptome studies have also provided advances in establishing putative genes involved in the growth, reproduction and innate immune system pathways in the European shore crab (Carcinus maenas) [45], the Mud crab (Scylla paramamosain) [46], and the swimming crab (Portunus trituberculatus) [47]. These studies have researched the need for genetic data on these species through the screening and exploitation of microsatellites in a cost-efficient and timely manner. In this study, using the Illumina HiSeq 2500 sequencing method and Trinity de novo assembly, 269,822 contigs and 208,813 unigenes were generated. The N50 length (1148 bp) and the average length (712.8 bp) of the contigs and unigenes (N50 length of 1089 bp and an average length of 700 bp) are greater than in the transcriptomic analysis of other crustacean species such as L. vannamei (42,336 unigenes with an N50 of 736 bp and an average length of 561 bp) [48], brine shrimp, A. franciscana (36,896 contigs with an average length of 746 bp) [44], crayfish, Cherax quadricarinatus (36,128 contigs with an N50 of 936 bp and an average length of 800 bp) [49], and pandalid shrimp, Pandalus latirostris (45,467 contigs with an N50 of 493 bp) [50], and are lower than in the transcriptome of Parhyale hawaiensis (35,301 contigs with an N50 of 1510 bp) [51]. For further we applied the BUSCO, which is reference based software for assessing quality of de novo assembles. Out of 2675 single copy orthologs for arthropods our assembly is 88.56% complete (1708 complete single copy BUSCOs and 661 complete duplicated BUSCOs), while 5.35% of contigs are fragmented (143 fragmented BUSCOs) and 6.09% are missing (163 missing BUSCOs).
We annotated the T. longicaudatus unigene sequences against the PANM, Unigene, KOG, GO, and KEGG databases by BLASTX with a cut-off value of 1E−5. Approximately 45.55% of unigenes matched to homologous sequences in the databases, which is less than half of the unigenes present

Discussion
In this study, we used high-throughput mRNA-Seq technology to analyze expressed transcripts of the longtail tadpole shrimp T. longicaudatus. RNA-Seq platform technology has been used for the rapid characterization of genomic and genetic resources in related non-model species including the Pacific white shrimp (Litopenaeus vannamei) [28,42], the Banana shrimp (Fenneropenaeus merguiensis) [43], the Brine shrimp (Artemia franciscana) [44], and the Triops newberryi [1]. Transcriptome studies have also provided advances in establishing putative genes involved in the growth, reproduction and innate immune system pathways in the European shore crab (Carcinus maenas) [45], the Mud crab (Scylla paramamosain) [46], and the swimming crab (Portunus trituberculatus) [47]. These studies have researched the need for genetic data on these species through the screening and exploitation of microsatellites in a cost-efficient and timely manner. In this study, using the Illumina HiSeq 2500 sequencing method and Trinity de novo assembly, 269,822 contigs and 208,813 unigenes were generated. The N 50 length (1148 bp) and the average length (712.8 bp) of the contigs and unigenes (N 50 length of 1089 bp and an average length of 700 bp) are greater than in the transcriptomic analysis of other crustacean species such as L. vannamei (42,336 unigenes with an N 50 of 736 bp and an average length of 561 bp) [48], brine shrimp, A. franciscana (36,896 contigs with an average length of 746 bp) [44], crayfish, Cherax quadricarinatus (36,128 contigs with an N 50 of 936 bp and an average length of 800 bp) [49], and pandalid shrimp, Pandalus latirostris (45,467 contigs with an N 50 of 493 bp) [50], and are lower than in the transcriptome of Parhyale hawaiensis (35,301 contigs with an N 50 of 1510 bp) [51]. For further we applied the BUSCO, which is reference based software for assessing quality of de novo assembles. Out of 2675 single copy orthologs for arthropods our assembly is 88.56% complete (1708 complete single copy BUSCOs and 661 complete duplicated BUSCOs), while 5.35% of contigs are fragmented (143 fragmented BUSCOs) and 6.09% are missing (163 missing BUSCOs).
We annotated the T. longicaudatus unigene sequences against the PANM, Unigene, KOG, GO, and KEGG databases by BLASTX with a cut-off value of 1E−5. Approximately 45.55% of unigenes matched to homologous sequences in the databases, which is less than half of the unigenes present in the T. longicaudatus transcriptome could be annotated. Lineage-specific genes are often difficult to annotate because their function is specific to the species [1,52]. We also characterized the homology search using PANM-DB due to the greater degree of annotation of unigene sequences obtained with this database. PANM-DB is preferred over the NCBI nr database due to faster processing of NGS datasets (15 times faster than that of the NCBI nr database) and a higher number of annotation hits [37]. The locally curated PANM-DB was an addition to the Molluscs database, and covers the available sequences of the Protostomia group in a multi-FASTA format [53]. Furthermore, our results showed that more than 90% of unigenes with a sequence length >2000 bp matched with a homologous protein in the databases, which is possible because the protein-coding genes generally give rise to longer full-length transcripts [54]. The BLASTX top-hit species distribution showed putative homology of the annotated unigene sequences across species in the PANM-DB. Most sequences matched the crustacean, Daphnia pulex (15.32%), followed by Crassostrea gigas (5.47%) and Lottia gigantea (3.67%).
Functional annotations of the assembled unigenes using KOG, GO terms, KEGG pathway analysis, as well as an InterPro conserved domain scan, were conducted to obtain a comprehensive description of the properties of these genes and their products in the species. GO classification only suggests that a unigene is related to a predicted function, as all GO terms are not of equal validity [55]. Most of the evidence codes are based on electronic annotations and are not manually created. The computational source of evidence constitutes more than 95% of the total GO annotation results in non-model species [56,57]. KEGG pathway analysis suggests the classification of unigenes into regulatory biological pathways that include metabolism, genetic information processing, environmental information processing, and organismal systems. The T. longicaudatus unigenes were mapped to 129 reference canonical pathways, among which distribution to the metabolism pathways was predominant. In the transcriptome analysis of Litopenaeus vannamei, a total of 9621 unigenes were mapped to 317 pathways, wherein the most enriched sequences were assigned to metabolic pathways, followed by the biosynthesis of secondary metabolites and spliceosome and RNA transport [58]. In the mud crab (Scylla paramamosain) transcriptome using 454 sequencing, 4878 unigenes were classified into 281 KEGG pathways, and the identified genes were found to be involved in growth, development, and disease resistance pathways [46]. Among the top-hit InterPro domain obtained in the present analysis, P-loop NTPases were predominant. These represent a large protein family that is involved in a variety of cellular functions, such as signal transduction, translation, protein transport and localization, signal-sequence recognition, chromosome partitioning, and membrane transport [59]. The C 2 H 2 type zinc finger domains are widely found in DNA binding motifs in eukaryotic transcription factors [60].
Polymorphic microsatellite markers such as SSRs have been utilized for a variety of genetic and breeding studies [61]. NGS technologies can be used to develop abundant SSR or SNP markers with high efficiency and accuracy [62]. In this study, we screened 1595 SSRs of 2-6 bp in length from unigene sequences >1000 bp in length. The tri-nucleotide repeats were predominant, followed by di-and tetra-nucleotide repeats. The tri-nucleotide SSR motifs have been consistently found as the predominant markers in the transcriptome sequences of many monocotyledonous plants [63,64]; however, in animals, the di-nucleotide repeats are predominant [65]. One nucleotide repeat motifs were detected but were not considered as these may be the result of single nucleotide stretch errors generated by sequencing [66,67]. These SSR loci provide an abundant marker resource for studying the genetic variation, population, and conservation genomics of species. In a previous study that constructed a genetic linkage map of L. vannamei using AFLP and SSR markers, 25 SSR markers were found to be informative in mapping a population of L. vannamei and are available for map construction [68].
The abundance of AC/GT motifs found in the present study is consistent with the SSR motif study in the mud crab, Scylla paramamosain [46]. The tri-nucleotide motifs AGC/CTG and ACC/GGT found in this study were also the preferred motifs in the SSRs isolated from the transcriptome of the Red Swamp Crayfish Procambarus clarkii [69]. A total of 1387 potential SSR markers identified in this study will provide important research advances for genetic studies including the assessment of genetic diversity, the development of genetic maps, comparative genomics, and marker-assisted selection breeding. The primer pairs designed for polymorphism identification would add towards genotyping of the species diversity and exploitation of the economic potential of the species.

Conclusions
This is the first report of high-throughput transcriptome analysis of T. longicaudatus. In total, 95,105 unigenes were annotated for putative functions using BLASTX with a cut-off of 1E−5. A total of 57,731 unigenes were assigned to GO terms, and 7247 unigenes were mapped to 129 KEGG pathways. Furthermore, 1595 SSRs were detected from the unigenes with 1387 potential SSR markers. A total of 1387 potential SSR markers identified in this study will provide important research advances for genetic studies including the assessment of genetic diversity, the development of genetic maps, comparative genomics, and marker assisted selection breeding.
Supplementary Materials: The following are available online at www.mdpi.com/2073-4425/7/12/114/s1. Figure S1: Summary of homology search of assembled unigenes of T. longicaudatus against PANM-DB. (A) score distribution, (B) identity distribution, (C) similarity distribution, (D) distribution of hit and non-hit sequences as compared with the length of unigenes; Table S1: KEGG mappings for T. longicaudatus unigenes; Table S2: List of top-hit InterPro domains in T. longicaudatus, Table S3: Summary of SSR types in the T. longicaudatus transcriptome, Table S4: Sequences of 1,387 primer pairs for SSR markers.