Characterization of Global Transcriptome Using Illumina Paired-End Sequencing and Development of EST-SSR Markers in Two Species of Gynostemma (Cucurbitaceae)

Gynostemma pentaphyllum is an important medicinal herb of the Cucurbitaceae family, but limited genomic data have hindered genetic studies. In this study, transcriptomes of two closely-related Gynostemma species, Gynostemma cardiospermum and G. pentaphyllum, were sequenced using Illumina paired-end sequencing technology. A total of 71,607 nonredundant unigenes were assembled. Of these unigenes, 60.45% (43,288) were annotated based on sequence similarity search with known proteins. A total of 11,059 unigenes were identified in the Kyoto Encyclopedia of Genes and Genomes Pathway (KEGG) database. A total of 3891 simple sequence repeats (SSRs) were detected in 3526 nonredundant unigenes, 2596 primer pairs were designed and 360 of them were randomly selected for validation. Of these, 268 primer pairs yielded clear products among six G. pentaphyllum samples. Thirty polymorphic SSR markers were used to test polymorphism and transferability in Gynostemma. Finally, 15 SSR makers that amplified in all 12 Gynostemma species were used to assess genetic diversity. Our results generated a comprehensive sequence resource for Gynostemma research.


Introduction
Gynostemma (Cucurbitaceae), is a genus of perennial creeping herbs with both sexual reproduction and clonal growth by rhizomes or bulbils [1]. This genus has around 16 species and two varieties, distributed in forests, scrubs and bush habitats at 60-3200 m elevations throughout China, India, Myanmar, Korea and Japan [2]. Drainage areas of the Yangtze River in Southwest China's Yunnan Province are thought to be the modern distribution center of this genus. Gynostemma can be divided into two subgenera (Gynostemma and Triostellum) according to different fruit morphology [2]. In recent years, Gynostemma has been attracting attention since they contain saponins, amino acids, and reducing sugars, which could be commercially useful. For example, Gynostemma pentaphyllum, widespread as a traditional Chinese medicinal herb, is thought to inhibit tumor cell growth, anti-ulceration and to enhance immunity [3,4]. Approximately 84 dammarane-type saponin glycosides were found in G. pentaphyllum [5], some of them have structural similarities to glycosides found in Panax ginseng C.A. Mey [6], but different saponin contents were observed in various other species of Gynostemma [7]. Natural populations of Gynostemma were destroyed in recent years due to excessive harvesting, especially G. pentaphyllum, which has been listed as a Grade II Key Protected Wild Plant Species by the Chinese Government [8]. It is necessary to preserve natural stocks of Gynostemma spp., and assess their genetic diversity and differentiation. Molecular genetic research of Gynostemma is limited [9] because most studies have mainly focused on extracting bioactive components [5,[10][11][12][13]. Subramaniyam et al. (2011) [14] reported de novo transcriptome assembly of G. pentaphyllum with Roche platforms using the materials of leaves and roots, but the research focused on the identification of secondary metabolite genes. So far, only 14 genomic simple sequence repeats (SSRs) and 14 inter-simple sequence repeats (ISSRs) have been exploited in Gynostemma [15,16]. Thus, more markers are needed to better understand the genetic diversity and to develop conservation strategies for Gynostemma.
SSR markers have turned out to be an effective tool for germplasm characterization and genetic diversity studies. SSRs can be divided into two categories based on the original sequences used for development of SSRs: genomic SSRs and expressed sequence tag (EST)-SSRs. Developing genomic SSR markers from random genomic sequences is labor, money and time intensive [17,18]. On the contrary, EST-SSRs identified from transcribed RNA sequences are more conserved than noncoding sequences. EST-SSRs are becoming more and more widespread, not only because they are potentially linked with particular transcriptional regions that contribute to agronomic phenotypes [19,20], but also because they have high transferability among closely-related species [21][22][23][24][25][26]. With the development of next-generation sequencing (NGS), it has become possible to generate large numbers of transcriptomic datasets for nonmodel organisms [27] using various platforms such as Roche 454, Illumina HiSeq, and Applied Biosystems SOLiD. Obtaining large numbers of valuable EST sequences via NGS is important for gene annotation and discovery [28,29], comparative genomics [30], development of molecular markers [31,32], and population genomics studies of genetic variation linked to adaptive traits [33]. Recently, an increasing number of EST datasets have become available for model and non-model organisms, but only a limited number of Gynostemma EST sequences are available in the public database.
In this study, we describe the generation, de novo assembly, and annotation of a transcriptome-derived EST dataset using Illumina paired-end sequencing technology from two Gynostemma species, G. pentaphyllum and G. cardiospermum. In addition, we mined and validated a large set of EST-SSR markers and investigated the genetic relationship among 12 selected species. This EST datasets will serve as a valuable genomic resource for further studies in Gynostemma, e.g., novel gene discovery and marker-assisted selective breeding.

Assembly of Gynostemma Transcriptome Data from Illumina Sequencing
After stringent quality assessment and data filtering, Illumina HiSeq™ 2000 sequencing generated 43,175,448 high-quality reads for Gynostemma pentaphyllum and 52,782,146 high-quality reads for Gynostemma cardiospermum, respectively. The raw data were deposited in the NCBI Sequence Read Archive (SRA) under the accession number SRA305674. Using the Trinity assembler software [34], short-read sequences from G. pentaphyllum and G. cardiospermum were assembled de novo into 1,488,035 contigs and 1,911,378 contigs, respectively. Transcriptome reads and assembled contigs information for two Gynostemma species are shown in Table 1. The frequency distribution of contigs length from these two Gynostemma species showed little difference, except for 100-200 bp, which showed more contigs in G. cardiospermum (Figure 1). Using paired end-joining, gap-filling, and Trinity, these contigs were assembled into scaffolds which were further assembled into unigenes. Finally, we obtained 40,257 and 44,000 unigenes from G. pentaphyllum and G. cardiospermum, respectively (Table 2). In addition, we obtained a nonredundant set of unigene sequences by pooling contigs from the two species and assembling them together into 71,607 unigenes. As shown in Table 2, the 71,607 nonredundant unigenes were used for in silico mining and validation of genic-SSR markers.

Functional Annotation and Classification
A homology-based approach was conducted for validation and annotation of assembled unigenes. Among 71,607 unigenes, 60.28% (43,167) showed homology in the nonredundant (nr) database, and 46.04% (32,970) unigenes had BLAST hits in Swiss-Prot database. A total of 60.45% (43,288) unigenes were successfully annotated in the nr and/or Swiss-Prot databases. Additionally, 97.73% (19,895 of 20,357) of the unigenes over 1000 bp in length showed homologous matches, whereas only 32.33% (6619 of 20,474) of the unigenes shorter than 300 bp showed matches ( Figure 2). The unigenes homologous to known protein sequences in nr database were further assigned to gene ontology (GO) terms using Blast2GO. A total of 35,968 unigenes were assigned to 549,570 GO term annotations, which belonged to biological processes, molecular functions and cellular components clusters ( Figure 3). Among biological processes, "cellular process" was the most dominant group, followed by "metabolic process", "response to stimulus", and "biological regulation". Regarding the molecular functions category, the major GO terms were "binding" and "catalytic activity". Under the cellular components category, "cell part" and "cell" represented the most abundant classification, followed by "organelle" and "organelle part". All unigenes were searched against the COG database to predict possible functions and phylogenetically classify orthologous gene products. Out of 43,167 nr hits, 20,585 sequences were assigned to one or more COG classifications ( Figure 4). Among the 25 COG categories, the cluster for "general function prediction" was the largest group, followed by "translation ribosomal structure and biogenesis", "replication, recombination and repair", and "posttranslational

Functional Annotation and Classification
A homology-based approach was conducted for validation and annotation of assembled unigenes. Among 71,607 unigenes, 60.28% (43,167) showed homology in the nonredundant (nr) database, and 46.04% (32,970) unigenes had BLAST hits in Swiss-Prot database. A total of 60.45% (43,288) unigenes were successfully annotated in the nr and/or Swiss-Prot databases. Additionally, 97.73% (19,895 of 20,357) of the unigenes over 1000 bp in length showed homologous matches, whereas only 32.33% (6619 of 20,474) of the unigenes shorter than 300 bp showed matches ( Figure 2). The unigenes homologous to known protein sequences in nr database were further assigned to gene ontology (GO) terms using Blast2GO. A total of 35,968 unigenes were assigned to 549,570 GO term annotations, which belonged to biological processes, molecular functions and cellular components clusters ( Figure 3). Among biological processes, "cellular process" was the most dominant group, followed by "metabolic process", "response to stimulus", and "biological regulation". Regarding the molecular functions category, the major GO terms were "binding" and "catalytic activity". Under the cellular components category, "cell part" and "cell" represented the most abundant classification, followed by "organelle" and "organelle part". All unigenes were searched against the COG database to predict possible functions and phylogenetically classify orthologous gene products. Out of 43,167 nr hits, 20,585 sequences were assigned to one or more COG classifications ( Figure 4). Among the 25 COG categories, the cluster for "general function prediction" was the largest group, followed by "translation ribosomal structure and biogenesis", "replication, recombination and repair", and "posttranslational modification, protein turnover, chaperones". In contrast, only a few unigenes were assigned to "nuclear structure and extracellular structure". According to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, 11,059 unigenes were identified with pathway annotation and were assigned to 117 KEGG pathways (Table S1). The top 20 pathways, including 5443 unigenes, are listed in Figure 5. The most highly represented pathways were "Ribosome", followed by "Protein processing in endoplasmic reticulum" and "Plant hormone signal transduction". Being important medicinal plants used in China, previous research on Gynostemma spp. has mostly focused on saponins biosynthesis pathways. As expected, some key genes encoding enzymes related to the synthesis of triterpene compounds, which are important components of saponins, were revealed in metabolism pathway. For example, the genes involved in mevalonate (MVA) and 2-C-methyl-D-erythritol 4-phosphate (MEP) pathways were identified in our study (Table 3) and these genes may provide valuable resources for research on gynosaponin biosynthesis.
Molecules 2015, 20, page-page 4 modification, protein turnover, chaperones". In contrast, only a few unigenes were assigned to "nuclear structure and extracellular structure". According to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, 11,059 unigenes were identified with pathway annotation and were assigned to 117 KEGG pathways (Table S1). The top 20 pathways, including 5443 unigenes, are listed in Figure 5. The most highly represented pathways were "Ribosome", followed by "Protein processing in endoplasmic reticulum" and "Plant hormone signal transduction". Being important medicinal plants used in China, previous research on Gynostemma spp. has mostly focused on saponins biosynthesis pathways. As expected, some key genes encoding enzymes related to the synthesis of triterpene compounds, which are important components of saponins, were revealed in metabolism pathway. For example, the genes involved in mevalonate (MVA) and 2-C-methyl-D-erythritol 4-phosphate (MEP) pathways were identified in our study (Table 3) and these genes may provide valuable resources for research on gynosaponin biosynthesis.   Molecules 2015, 20, page-page 4 modification, protein turnover, chaperones". In contrast, only a few unigenes were assigned to "nuclear structure and extracellular structure". According to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database, 11,059 unigenes were identified with pathway annotation and were assigned to 117 KEGG pathways (Table S1). The top 20 pathways, including 5443 unigenes, are listed in Figure 5. The most highly represented pathways were "Ribosome", followed by "Protein processing in endoplasmic reticulum" and "Plant hormone signal transduction". Being important medicinal plants used in China, previous research on Gynostemma spp. has mostly focused on saponins biosynthesis pathways. As expected, some key genes encoding enzymes related to the synthesis of triterpene compounds, which are important components of saponins, were revealed in metabolism pathway. For example, the genes involved in mevalonate (MVA) and 2-C-methyl-D-erythritol 4-phosphate (MEP) pathways were identified in our study (Table 3) and these genes may provide valuable resources for research on gynosaponin biosynthesis.

Frequency and Distribution of Different Types of SSR Markers
A total of 3891 SSR loci were identified from 3526 nonredundant unigenes, representing 4.92% of the total 71,607 unigenes. The distribution density was one SSR locus per 15.78 kb. This study excluded mononucleotide repeats and complex SSRs. There were 329 unigenes with more than one SSR locus. The most frequent repeat unit in nonredundant unigenes were trinucleotides followed by dinucleotides ( Figure 6A); and di-and tri-nucleotide repeats constituted 3778 (97.10%) of the identified SSR loci. The number of reiterations of a given repeat unit ranged from five to 12 and SSRs with five reiterations were the most abundant ( Figure 6B). The majority of the SSR sequences were from 12 to 21 bp in length, accounting for 96.53% (3756) of the total identified SSR loci; SSR loci with a length of 15 bp were the most frequent. The longest SSR locus was 30 bp ( Figure 6C). More details about different repeat motif of di-and trinucleotide repeats in EST-SSRs are listed in Table 4. The dominant di-and tri-nucleotide repeat motif in SSRs were AG/CT and AAG/CTT respectively. There was only one CG/CG repeat motif and very few ACT/AGT repeats in our results (Table 4).  Table 4. Frequency distribution of the di-and tri-nucleotide repeat motifs in the Gynostemma.

Development, Validation and Transferability of SSR Markers
To further evaluate the assembly quality, 2586 primer pairs (Table S2) were designed using Primer 3.0 based on 3891 SSR loci generated from MISA. Primer pairs for the remaining 1305 SSR loci could not be designed successfully because their flanking sequences were either too short or the nature of sequences did not satisfy the criteria of BatchPrimer3 v 1.0 software. All 2586 unigene sequences were subjected to BLAST analysis to predict the likely function of these EST-SSRs. There were 2559 transcriptome sequences that showed homology to functional loci of other plants (Table S2). From the 2586 primer pairs, 360 (Table S3) were randomly selected for validation using DNA from the six samples of G. pentaphyllum of three different populations. Among the 360 primer pairs, 268 were successfully amplified via PCR. The remaining 92 primer pairs failed to generate PCR products, even when the annealing temperature was reduced by 8˝C. Of the 268 working primer pairs, 239 amplified products of the expected size including 23 monomorphic loci and 216 polymorphic loci among six genotypes of G. pentaphyllum. The other 29 generated larger products than the expected size, suggesting that there may exist introns in the amplifying regions. To test interspecies transferability across 12 related species (26 individuals) in the genus Gynostemma, 30 SSR pairs were selected from the 216 microsatellites that produce polymorphic size fragments. Of the 30 polymorphic SSRs, 15 primer pairs could amplify PCR products and show polymorphic fragments from all 12 Gynostemma species (Tables 5 and 6).

Genetic Diversity and Relatedness in the Genus Gynostemma
The 15 primer pairs that yielded clear, highly polymorphic bands from all Gynostemma species were used to assess the genetic diversity in a set of 26 individual plants representing 12 species of Gynostemma (Tables 5 and 6). A total of 101 alleles were identified, the number of alleles per locus ranged from four to 11 with an average of 6.73 alleles. Polymorphism information content (PIC) ranged from 0.55 to 0.87 with an average of 0.73, suggesting that the developed EST-SSRs were highly polymorphic. A phenogram based on Jaccard's similarity coefficients was constructed to resolve the relationship of 26 individuals from 12 species (Figure 7), which showed two distinct clusters at a cut-off similarity index of 0.71. Cluster I contained seven species, which corresponded to subgen. Gynostemma, and was divided into five sub-clusters: Ia, Ib, Ic, Id and Ie, at a cut-off similarity index of 0.78; Sub-cluster Ia comprised six G. pentaphyllum genotypes (three wild and three cultivar genotypes); Sub-cluster Ib comprised four G. pubescens from four populations and one G. laxum; Sub-cluster Ic comprised four G. longipes from four populations and one G. burmanicum var. molle; Sub-cluster Id comprised two G. burmanicum from two populations; and Sub-cluster Ie comprised two G. pentaphyllum var. dasycarpum from Jinghong and Mengla locations, respectively. Cluster II included five species that corresponded to subgen. Triostellum, and was divided into two sub-clusters, IIa and IIb, at a cut-off similarity index of 0.79. Sub-cluster IIa comprised one G. microspermum, one G. laxiflorum and one G. caulopterum, and Sub-cluster IIb comprised one G. yixingense and two G. cardiospermum individuals.

Functional Annotation of Unigenes
Presently, most research concentrate on isolating bioactive components from Gynostemma spp., but the potential molecular mechanisms producing such compounds is still unclear. Transcriptome sequencing is an effective method for novel gene discovery and SSR marker development. In this study, 71,607 nonredundant unigenes were obtained after assembly. In total, 60.45% (43,288) of all unigenes had homologs in the NCBI nr or Swiss-Prot protein databases, which was lower than that reported by Subramaniyam et al. [14] with leaves and roots as sequencing materials in Gynostemma pentaphyllum. Compared with other plants used in Chinese medicine, this was higher than Epimedium sagittatum [35], but lower than Panax quinquefolius [36] and Panax notoginseng [37]. Among the 43,288 unigenes with BLAST matches in the NCBI nr or Swiss-Prot protein database, 97.73% were over 1000 bp, whereas unigenes shorter than 300 bp in length only accounted for 32.33% of the total. Therefore, we infer that the large proportion of BLAST matches in Gynostemma was probably due to the large number of long sequences in our unigene database, which was also validated in other plants [38][39][40]. Perhaps the lack of a characterized protein domain, a common feature of the shorter unigene sequences, was the cause of the small number of shorter sequences showing BLAST hits in the protein databases. Further research with GO analysis revealed that most genes are involved in many biological processes in Gynostemma. Many genes were assigned to "metabolic process" and "catalytic activity" classes, which suggest a great deal of enzymes involved in primary and secondary metabolism.

Functional Annotation of Unigenes
Presently, most research concentrate on isolating bioactive components from Gynostemma spp., but the potential molecular mechanisms producing such compounds is still unclear. Transcriptome sequencing is an effective method for novel gene discovery and SSR marker development. In this study, 71,607 nonredundant unigenes were obtained after assembly. In total, 60.45% (43,288) of all unigenes had homologs in the NCBI nr or Swiss-Prot protein databases, which was lower than that reported by Subramaniyam et al. [14] with leaves and roots as sequencing materials in Gynostemma pentaphyllum. Compared with other plants used in Chinese medicine, this was higher than Epimedium sagittatum [35], but lower than Panax quinquefolius [36] and Panax notoginseng [37]. Among the 43,288 unigenes with BLAST matches in the NCBI nr or Swiss-Prot protein database, 97.73% were over 1000 bp, whereas unigenes shorter than 300 bp in length only accounted for 32.33% of the total. Therefore, we infer that the large proportion of BLAST matches in Gynostemma was probably due to the large number of long sequences in our unigene database, which was also validated in other plants [38][39][40]. Perhaps the lack of a characterized protein domain, a common feature of the shorter unigene sequences, was the cause of the small number of shorter sequences showing BLAST hits in the protein databases. Further research with GO analysis revealed that most genes are involved in many biological processes in Gynostemma. Many genes were assigned to "metabolic process" and "catalytic activity" classes, which suggest a great deal of enzymes involved in primary and secondary metabolism. Among the KEGG pathways, the well-represented pathways discovered in our study were "Ribosome", "Protein processing in endoplasmic reticulum" and "Plant hormone signal transduction". Furthermore, some key genes involved in the biosynthesis of terpenoids were identified, several of which were found in other species [36,37]. Compared with the transcriptome of leaves and roots, genes related to biosynthesis of terpenoids showed little difference. For instance, 4-hydroxy-3-methylbut-2-en-1-yl diphosphate synthase (EC: 1.17.7.1), farnesyl-diphosphate farnesyltransferase (EC: 2.5.1.21) and Squalene monooxygenase (EC: 1.14.99.7), which were found in our study, were not presented in the results of Subramaniyam et al. [14]. Likewise, the genes present in results of Subramaniyam et al. [14], Hydroxymethylglutaryl-CoA reductase (EC: 1.1.1.88), Geranyl transtransferase (EC: 2.5.1.10), Dimethylallyltranstransferase (EC: 2.5.1.1), etc., did not appear in our study either. These results reflect that there might exist different transcriptomic signatures in different tissues. Some unigenes without BLASTx hits may be potential Gynostemma-specific genes. Both of these classes can provide valuable information for the further study of Gynostemma spp., such as novel gene discovery and cloning, functional studies, and metabolic engineering of enzymes.

SSR Marker Frequency and Distribution in Gynostemma Transcriptome
Polymorphic SSRs play an important role in genetic diversity research, genetic mapping studies, comparative genomics, and marker-assisted selection breeding [41]. Transcriptomics provides a rich source for SSR discovery because it generates plenty of sequences. A total of 3891 SSRs greater than 12 bp in length were identified from 3526 nonredundant unigenes, 4.92% of the total 71,607 unigenes possessed SSRs. It is obvious that the SSR frequency detected in Gynostemma is in accordance with the range of frequencies (2.65%-16.82%) reported before for other dicotyledonous species [42]. Several factors affect the EST-SSR frequency. First, the criteria for calling microsatellites is the most important factor of EST-SSR frequency, e.g., the repeat length threshold and the number of repeat motifs. Most studies have excluded the mononucleotide repeat motifs because they may result from sequencing errors. Some studies take three-repeat units into account when calculating the number of dinucleotide repeat units [40], while others do not [43][44][45]. In addition, we identified SSRs primarily from unigenes over 1000 bp, which may reduce the frequency to a certain degree. Secondly, genome structure or composition could also influence SSR frequency [46]. For example, it is reported that the small genome size of rice was the cause of the high frequency of EST-SSR sequences [47]. Finally, the different software used to detect SSR loci can also affect the SSR frequency.
Theoretically, the frequency of di-, tri-, tetra-, penta-, and hexanucleotide repeats should be in turn decreased according to the relative probability of replication slippage events [48]. The most abundant type of repeat motif among the Gynostemma unigenes analyzed was trinucleotide ( Figure 6A). This finding is consistent with the earlier results reported before [22,35,45,[48][49][50][51][52][53][54], which showed the trinucleotide motif is the most frequent repeat type. Some studies point out the reason for the high frequency of tri-nucleotide SSRs is that the selection against frameshift mutations might limit the expansion of other SSR types [43,[55][56][57]. Meanwhile, other studies show that the most abundant class of SSRs was dinucleotide [38,39,42,58,59]. There are also some plant species showing approximately equal proportions of dinucleotide and tri-nucleotide repeats in their transcriptome sequences, e.g., Aspidistra saxicola [44], sweet potato [39], and oak [60]. The most frequent repeats of di-and tri-nucleotide were AG/CT and AAG/CTT, respectively, which was in accordance with the reports in sesame [38], oil palm [61], sweet potato [39], Primula [62], and Nothofagus nervosa [63].

Transferability of SSR Markers and Genetic Relationships among Different Species of Gynostemma
In this study, 3891 SSR markers were developed and 360 primer pairs were randomly selected to evaluate the assembly quality of reads and validity of markers in Gynostemma. In total, 268 primer pairs (74.44%) yielded clear fragments among six G. pentaphyllum genotypes. This result matches the 60%-90% success rate reported before. In total, 216 Polymorphic EST-SSR markers were obtained with a polymorphic proportion of 90.38%, which was similar to Amorphophallus [26], but was higher than other plants [20,21,52]. Our results suggest that the transcriptome assembly was reliable, and that the EST-SSR markers are usable across 12 species in the genus Gynostemma. The observed number of alleles ranged from four to 11 with an average of 6.73, indicating the potential application of these primer pairs. In the present study, EST-SSRs derived from G. pentaphyllum and G. cardiospermum had a higher transferability rate, which was also observed in other plant taxa [22,25,26,35,64]. It has been proposed that the high transferability rate of EST-SSRs might be due to several factors: (1) the EST-SSRs derived from transcriptome database are conservative when compared with genomic SSRs [65,66]; (2) the more consistent efficiency of amplification of EST-SSRs enhances cross-species transferability [67,68]; and (3) closely-related species benefit from a high SSR transferability rate [26]. However, at the same time, [49] explained that the limitation on the interspecific transfer of SSR markers is caused by homoplasy of band sizes and complex mutational events. The genetic relationship among 26 individuals representing 12 species of Gynostemma based on 15 polymorphic SSR loci was clearly shown in dendrogram graph. Two major groups representing subgen. Gynostemma and subgen. Triostellum respectively were identified at a cut-off similarity index of 0.71, the level of genetic similarity was 0.70-1.00, indicating relatively high resolution power and potential utility of polymorphic SSR markers in phylogenetics of Gynostemma. As expected, the six G. pentaphyllum individuals were classified into three groups, and wild individuals were clustered with cultivated individuals from the same population. The variation between populations was higher than the other Gynostemma species, implying that G. pentaphyllum, as a widespread species, has a high level of genetic diversity. These results are concordant with previous reports [69]. Therefore, the potential EST-SSRs identified in this study will be an effective tool for germplasm polymorphism assessment or quantitative trait loci mapping in Gynostemma.

Plant Materials
Young leaves, flowers and immature seeds from two species in the genus Gynostemma (G. pentaphyllum and G. cardiospermum) were used for RNA extraction and transcriptome sequencing. DNA from 26 individual plants collected from southeast China was used to validate SSR markers and diversity analysis. Detailed information for the plant materials is listed in Table 6.

RNA Extraction, Reverse Transcription and Sequencing
G. pentaphyllum and G. cardiospermum were collected from two locations of Ankang in Shaanxi province during July 2013 (G. pentaphyllum: 32˝25 1 N, 109˝04 1 E; G. cardiospermum: 32˝13 1 N, 109˝01 1 E), the multiple individual plants mixture including leaves, stems, flowers, shoot tips and developing seeds for each species were frozen immediately in liquid nitrogen, and stored at´70˝C. After mixing an approximately equal weight of mixture for each species, total RNA was extracted using the TRIzol reagent (Invitrogen, Carlsbad, CA, USA) according to manufacturer instructions, then poly-A mRNA was isolated from total RNA using poly-T oligo-attached magnetic beads (Illumina Inc., San Diego, CA, USA). The quantity and quality of RNA were assessed by gel electrophoresis and spectrophotometry. Purified RNA was used to construct a directional cDNA library using the cDNA Synthesis Kit (Illumina), and then the cDNA library was sequenced using a HiSeq 2000 (Illumina) to obtain short sequences.

Transcript Assembly and Analysis
All raw reads from the two Gynostemma species were prescreened to remove adapter sequences, reads with greater than 10% unknown bases, and reads with an average base quality less than 30. High-quality filtered transcriptome reads were assembled into contigs by de novo assembly using Trinity tools [34]. A nonredundant set of unigene sequences was then created using paired-end reads by further alignments of the contigs from each species. To annotate them, all unigenes were searched against NCBI's nonredundant protein (nr) database and Swiss-Prot protein databases using BLASTx with an E-value <10´5. The Blast2GO program [70] was used to get Gene Ontology (GO) terms to describe gene products according to three ontologies: molecular function, biological process and cellular component [71]. The unigene sequences were also aligned to the COG database to predict and classify functions. To further understand the biological functions and interactions of genes, pathway assignments were performed based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [72] using BLASTx with an E-value threshold of 10´5.

EST-SSR Detection and Pprimer Design
Nonredundant unigene sequences longer than 1000 bp were used for mining SSR loci using the MISA tool [49], and primers were designed using BatchPrimer3 v1.0 software with default parameters [73]. Only cDNA-based SSR loci containing two to six nucleotide motifs were considered, the criteria for selection of SSRs were a minimum of six repeats for di-nucleotide motifs and five repeats for tri-, four repeats for tetra-, penta-, and hexa-nucleotide motifs. Mononucleotide repeats and complex SSR types were ignored. Frequency of SSR refers to the average number of kilobase pairs of cDNA sequence containing one SSR. The parameters for designing PCR primer pairs from sequences flanking SSRs were as follows: (1) primer length range from 18 to 25 bases (optimal 20 bases); (2) PCR product size range of 100 to 300 bp (optimal 200 bp); (3) annealing temperature of 50-60˝C (optimal 55˝C); and (4) a GC content of 40%-60% (optimal 50%). Other parameters were set at the default value of BatchPrimer3v1.0.

Plant DNA Extraction, PCR Conditions and Separation of SSR Markers
26 individuals, representing 12 Gynostemma species (Table 6), were selected for analysis of intraspecific genetic diversity, cross-species amplification with the EST-SSRs, and interspecific relationships. Plant DNA was extracted from leaf samples using the CTAB method [74], and DNA integrity was checked via electrophoresis on 1.5% agarose gel. PCR amplifications were carried out using a MyCycler™ Thermal Cycler (Bio-RAD, CA, USA) in a 10 µL final volume containing 1ˆPCR buffer [10 mM Tris-HCl (pH 8.4), 1.5 mM MgCl 2 ], 0.2 mM dNTPs, 0.2 µM of each primer, 50 ng of genomic DNA, and 0.5 U Taq polymerase (Biostar, New Taipei, Taiwan). The PCR reaction program was: DNA denaturation at 95˝C for 5 min; followed by 32 cycles of 95˝C for 40 s, 50-60˝C (depending on optimized annealing temperature) for 30 s and 72˝C for 50 s. The final extension was performed at 72˝C for 10 min. PCR products were analyzed using 8% PAGE and silver stained [75] with a PBR322 DNA marker ladder (Tiangen Biotech, Beijing Co., Ltd., Beijing, China) for assessing the length of the DNA bands. A total of 360 genic SSR markers were selected randomly for genotyping six G. pentaphyllum samples from three populations, 30 highly polymorphic loci were selected for testing the transferability of EST-SSRs to the other ten species in the genus Gynostemma.

Genetic Analysis and Data Scoring
Of the 30 highly polymorphic loci, genic-SSR markers that amplified successfully in all 12 species were used to assess the genetic diversity in a set of 26 individual plants. Each allele was scored as present (1) or absent (0) for each of the SSR loci. The polymorphism information content (PIC) of each SSR primer was calculated to estimate the allelic variation of SSRs in the 26 individuals according to the formula: PIC " 1´Σ n i"0 Pi 2 , where Pi is the frequency of the i th allele for a given SSR marker, and n is the total number of alleles detected for that SSR marker [76]. The genetic similarity between any two individuals was estimated based on Jaccard's similarity coefficient. All 26 individuals were clustered with the UPGMA algorithm and SAHN procedure of the NTSYS-PC v2.10t [77]. Bootstrapping analysis with 1000 replicates was carried out using the software FREETREE V.0.9.1.50 [78]. Bootstrap values over 50 were considered significant and provided on the dendrogram.

Conclusions
In this study, we used high-throughput sequencing to characterize the transcriptomes of two Gynostemma species. A large-scale EST dataset with 71,607 nonredundant unigenes from G. pentaphyllum and G. cardiospermum was established, which provided valuable sequences for the discovery of new genes and EST-SSR markers. These results support the view that NGS is a fast and cost-effective approach for gene discovery and molecular marker development in nonmodel species.