De Novo Assembly and Characterization of the Transcriptome of the Chinese Medicinal Herb, Gentiana rigescens

Gentiana rigescens is an important medicinal herb in China. The main validated medicinal component gentiopicroside is synthesized in shoots, but is mainly found in the plant’s roots. The gentiopicroside biosynthetic pathway and its regulatory control remain to be elucidated. Genome resources of gentian are limited. Next-generation sequencing (NGS) technologies can aid in supplying global gene expression profiles. In this study we present sequence and transcript abundance data for the root and leaf transcriptome of G. rigescens, obtained using the Illumina Hiseq2000. Over fifty million clean reads were obtained from leaf and root libraries. This yields 76,717 unigenes with an average length of 753 bp. Among these, 33,855 unigenes were identified as putative homologs of annotated sequences in public protein and nucleotide databases. Digital abundance analysis identified 3306 unigenes differentially enriched between leaf and root. Unigenes found in both tissues were categorized according to their putative functional categories. Of the differentially expressed genes, over 130 were annotated as related to terpenoid biosynthesis. This work is the first study of global transcriptome analyses in gentian. These sequences and putative functional data comprise a resource for future investigation of terpenoid biosynthesis in Gentianaceae species and annotation of the gentiopicroside biosynthetic pathway and its regulatory mechanisms.

The objective of this research was to compare the transcriptomes of the leaf and root of G. rigescens, using Illumina Hiseq2000. To determine genes involved in the gentiopicroside biosynthesis pathway and its regulatory mechanism, transcripts from leaves and roots of gentian were isolated, quantified, sequenced, and annotated. The results described here will aid further functional genomic studies in gentian.

Sequencing and Assembly
To determine the transcriptomes of the leaves and roots of G. rigescens, two sequencing libraries were prepared and sequenced with the Illumina paired-end technique. As a result, over 50 million clean reads per library were obtained after cleaning and quality checks were performed. The sequencing data quality assessments are shown in Table 1. The error rate of both root and leaf is 0.03% (Q20 and Q30 are over 96% and 90%, separately), indicating a high quality of sequence. The sequencing raw data has been deposited into the Short Reads Archive (SRA) database under the accession number SRP027253. Leaf_1, Root_1, Leaf_2, Root_2: The left and the right reads, separately; Raw reads: Statistical raw sequence data with four lines as a unit, to sum the sequence number of each file; The Clean Reads of the Leaf or Root were the sum of the left and right reads. Error rate: Bases error rate; Q20 and Q30: The percentages of the bases whose Phred values were more than 20 and 30, separately.
The clean reads were combined and assembled by using the Trinity program, which has been shown to be an excellent assembler for de novo transcriptome assembly from short-read RNA-Seq data [20]. Assembled sequences were subjected to cluster using the Trinity algorithm. As a result, 191,541 contigs clustered into 78,433 Trinity components (mean size = 743 bp, N50 = 1365 bp). Each Trinity component defines a collection of transcripts that are most likely to be derived from the same locus (except a portion from very closely related paralogs) [20,21]. This component was defined as a unigene and the longest transcript in each component was used to represent the corresponding unigene in this study. After removal of 1716 (2.2% of total) contaminant unigene sequences from non-plant species (see Materials and Methods), a transcriptome of 76,717 unigenes with a total size ~57.7 Mb was established for G. rigescens. The sequences of the unigenes (longer than 200 bp) were deposited in the NCBI (National Center for Biotechnology Information) Transcriptome Shotgun Assembly Sequence Database (TSA) according to its standard criteria (downloaded from the BioProject: PRJNA211794, Accession Number: GDAB00000000). The full list of transcript sequences is also available upon request.

Assembly Assessment
By comparing our results to Gentiana sequences downloaded from the NCBI (Available online: http://www.ncbi.nlm.nih.gov), we demonstrated that the assembly succeeded in constructing a large amount of transcripts with desirable length. Of 43,611 Gentiana sequences, 33,773 (77.4%) sequences were represented in our assembly (Megablast, E-value was 10 −9 ), among which 23,908 (70.8%) sequences were matched with more than 80% identity and 80% coverage. RNA-Seq reads were mapped back to the assembly to calculate the proportion of reads assembled, indicating a statistic report comparable to other de novo assemblies. The total alignment rate was 92.72% (Table 2), and 78.3% of the mapped paired-reads aligned concordantly, which showed good physical evidence of sequence contiguity. Transcript length (such as N50, average length) is another broadly used parameter to overview the quality of the transcriptome assembly. As shown in Figure 1, the unigenes ranged from 201 to 16,728 bp, with a mean length of 753 bp and an N50 length of 1384 bp, which is comparable to similar RNAseq reports. Thus, we have successfully constructed a desirable assembly from Illumina paired-end sequencing.  Figure 1. Length distribution frequency of the unigenes in G. rigescens.

Gene Function Annotation and Classification
All the 76,717 assembled putative unigenes were aligned using the BLAST program against the NR, NT, Swiss-Prot and COG databases with the E-value cutoff of 10 −5 . A total of 33,855 unigenes were annotated, accounting for 44.13% (Table 3). Among them, 26,686 unigenes (34.78%) showed high homology, with sequences in the NR database, 24,371 unigenes (31.77%) matched to protein sequences in TAIR10, and 18,627 unigenes (24.28%) showed homology with known genes in SwissProt. The detailed results are shown in Tables 3 and S1-S3. Based on the top-hit species distribution of the homology result against NR databases, 26,361 unigenes (92.08%) showed high homology with sequences from land plants, among which the highest matches were to genes from Coffea canephora (36.08%), followed by Vitis vinifera (8.57%), and Sesamum indicum (7.36%) ( Figure 2).  Putative protein sequences were obtained by translating using a standard codon table. The CDSs of unigenes that did not match the above databases were predicted with the ESTSCAN software. The gene length distribution is shown in Figure 3. The length of peptides predicted by BLASTp ranges from 60-810, while that of ESTSCAN are 30-240. In this study, all unigenes were searched against the GO database. Out of 76,717 unigenes, 26,494 were successfully annotated and classified into three GO categories: biological process, cellular component, and molecular function, and assigned to 56 functional groups ( Figure 4). As shown in Figure 4, assignments which fell under cellular component ranked the highest, followed by biological process, and molecular function. In the biological process category, "cellular process" (16,075, 60.67%) and "metabolic process" (15,223,57.46%) were the two most representative subcategories. In the cellular component category, unigenes related to "cell" (10,308, 38.91%) and "cell part" (10,282, 38.81%) were dominant, while in the molecular function category, the majority of unigenes were involved in "binding" (14,903, 56.25%) and "catalytic activity" (12,326, 46.52%). These results suggested that many kinds of enzyme pathways were active in gentian.
A total of 10,524 sequences were classified into 26 KOG/COG (Clusters of Orthologous Groups of proteins) groups ( Figure 5), where "General function prediction only" category accounted for the most frequent group (1948, 18.51%), with the second largest group being "Post-translational modification, protein turnover, chaperon" (1319, 12.53%), followed by "Signal transduction" (932, 8.86%) and "Translation" (654, 6.21%). These results showed that in the flower stage of gentian, the protein translation and signal transduction are active.  The KEGG (Kyoto Encyclopedia of Genes and Genomes) metabolic system is a group of metabolic maps which represents current understanding of biomolecular interaction networks. In order to determine the active pathways in flowering gentian, KEGG assignments of all unigenes were performed.
Referencing the 7998 unigenes of G. rigescens through the KEGG database predicted a total of five categories (level 1, cellular processes, environmental information processing, genetic information processing, and metabolism and organismal systems), 31 sub-categories (level 2, Figure 6) and 238 pathways (level 3). Unigenes identified as related to the "Translation" (861, 10.77%), "carbohydrate metabolism" (852, 10.65%), "Folding, sorting and degradation" (699, 8.74%) and "Signal transduction" (685, 8.56%) were the top four representative pathways ( Figure 6). Unigenes counts for "Terpenoid backbone biosynthesis", "Monoterpenoid biosynthesis", "Diterpenoid biosynthesis", "Sesquiterpenoid and triterpenoid biosynthesis", and "Ubiquinone and other terpenoid-quinone biosynthesis" were 55, 5, 22, 21, and 31, separately. These results indicated that the terpenoid pathways were active in flowering gentian, and the corresponding genes would be candidate genes for gentiopicroside biosynthesis. Gene expression was calculated using the RPKM method, which takes into account both sequencing depth and gene length effects on read count [22]. On the basis of the applied criteria q-value <0.005 and log2(foldchange) >1, 3306 genes (4.31% of all genes) were identified as significantly differentially expressed genes (DEGs) between these two tissues, which comprised 2204 up-regulated genes (accounting for 67%) and 1102 down-regulated genes (33%) in leaves ( Figure 7, Table S4). The log2(fold changes) ranged from one to 15. Not surprisingly, among these DEGs, most were related to photosynthesis, for example, ribulose-1,5-bisphosphate carboxylase/oxygenase (RuBisCO)-a, a key enzyme of the Calvin-Benson cycle of autotrophic CO2 assimilation [23], chloroplast chlorophyll a/b-binding protein, photosystem II 22 kDa protein gene, and chloroplastic ferredoxin genes, were all up-regulated over 10-fold in leaves compared to roots. The terpenoid biosynthesis related genes, such as geranyl diphosphate synthase (GPPS), geraniol synthase (GES), geraniol 10-hydroxylase (G10H), and iridoid oxidase (IO), four key enzymes involving monoterpene biosynthesis, were all up-regulated over 10-fold in leaves compared to roots. Of the down-regulated genes, a late embryogenesis abundant (LEA) protein, was 13-fold higher in roots than in leaves. Late Embryogenesis Abundant (LEA) proteins are a group of hydrophilic proteins with a high content of glycine, and are associated with stress tolerance in plants and animals through protecting enzymatic function and inhibition of aggregation in dehydration, heat, and salt stress [24,25]. In Arabidopsis thaliana, overexpression of LEA14 enhances salt stress tolerance [26]. Ectopic expression of ZmLEA5C in tobacco and yeast enhances their tolerance to osmotic and low temperature stresses [27]. A calcium-dependent protein kinase (CDPK) gene involved in plant defense responses [28] was nine-fold higher in roots than in leaves. Previous research suggests that CCaMK is an important component of the symbiosis signaling pathway [29][30][31][32][33][34]. In Zea mays, calcium/calmodulin-dependent protein kinase (ZmCCaMK) is required for abscisic acid (ABA)-induced antioxidant defense systems [35]. A high affinity nitrate transporter [36] was eight-fold higher in roots than in leaves. In higher plants, there are two nitrate uptake systems, the high and low affinity transporter systems, and the high affinity nitrate transporter functions when the nitrate concentration is low [37,38].

Candidate Transcription Factors Involved in Regulating the Terpenoid Biosynthetic Pathway
TFs play key roles in controlling gene expression [51], and the controlled transcription of biosynthetic genes is one major mechanism regulating secondary metabolite production in plant cells [52][53][54]. The floral terpenoids of snapdragon appear to be derived exclusively from the MEP pathway in plastids, and this pathway controls precursor levels for GPPS, which in turn is transcriptionally regulated [55]. In our G. rigescens unigene dataset, 7176 unigenes were annotated as transcription factors (Table S6), including bHLH (349), AP2-EREBP (172), WRKY (141), MYB (129), bZIP (115), and GRAS (94) family members. Among these, most were expressed in both root and leaf tissues, with 80 showing a significantly higher expression level in leaves than in roots ( Table 6, Table S7). Table 6. Summary of transcription factor unigenes of G. rigescens. HLH  349  26  5  AP2-EREBP  172  20  4  WRKY  141  17  1  MYB  129  7  2  bZIP  115  3  4  GRAS  94  7  1  Total  1000  80  17 Members of the WRKY transcription factor family have been shown to regulate secondary metabolism pathways [56]. In Gossypium arboreum, GaWRKY1 regulates sesquiterpene biosynthesis via activation of δ-cadinene synthase (CAD1-A) [57]. In Coptis japonica, the biosynthesis of berberine is controlled by CjWRKY1 [58]. In tomato trichomes, terpene synthase are controlled by SlMYC1 and SlWRKY73 [59]. In Catharanthus roseus, CrWRKY1, a regulator in biosynthesis of terpenoid indole alkaloids, interacts with transcription factors, including ORCA3, CrMYC, and ZCTs, to play a role in determining the root-specific accumulation of serpentine [60,61]. In Nicotiana attenuata, biosynthesis of diterpene glycosides are regulated by WRKY3 and WRKY6 [62]. In leaves of Artemisia annua, AaWRKY1 activated the expression of the majority of artemisinin biosynthetic genes, including AaADS and AaHMGR [63]. In the present analysis, 141 unigenes were annotated as WRKY family transcription factors, of which 17 were more highly expressed in leaves than in roots (Table 6). qRT-PCR results showed that GrWRKY7 genes were more highly expressed in leaves than in roots, while it was the opposite for GrWRKY5 and GrWRKY6 ( Figure 10). Thus, GrWRKY7 is a good candidate to study in the regulation of the biosynthesis of gentiopicroside. Figure 10. The expression pattern of three selected WRKY genes in roots and leaves in G. rigescens. Means ± SE; each qRT-PCR was biologically repeated three times.

Plant Materials and RNA Isolation
The cultivated variety of G. rigescens was grown in pots with humus soil and yellow soil mixed in a 1:1 ratio. The fresh roots and leaves were collected from 3-year-old flowering gentian plants in October 2012 ( Figure 11). To reduce biological bias, material of three individual plants was pooled to give 1 g of roots and 1 g of leaf samples. All samples were immediately frozen in liquid nitrogen and stored at −80 °C .
Total RNA of each sample (three plants mixed) was isolated by Illumina TruSeq™ RNA Sample Preparation Kit (RS-122-2001). RNA degradation and contamination were monitored on 1% agarose gels. RNA purity was checked using the NanoPhotometer ® spectrophotometer (IMPLEN, Westlake Village, CA, USA). RNA concentration was measured using Qubit ® RNA Assay Kit in Qubit ® 2.0 Flurometer (Life Technologies, Carlsbad, CA, USA). RNA integrity was assessed using the RNA Nano 6000 Assay Kit of the Bioanalyzer 2100 system (Agilent Technologies, Palo Alto, CA, USA).

Transcriptome Sample Preparation for Sequencingc
The construction of a cDNA library and the following sequencing procedures were as in Zhu et al. [64].

Data Filtering
Raw data (raw reads) of fastq format were firstly processed through in-house perl scripts (available upon request). In this step, clean reads were obtained by standard quality control criteria to remove all of the reads which meet any one of the following parameters: (1) The reads that aligned to adaptors with no more than two mismatches; (2) The reads with more than 10% unknown bases (N bases); (3) The reads with more than 50% of low-quality bases (quality value ≤ 5) in one read. At the same time, Q20, Q30, GC-content, and sequence duplication level of the clean data was calculated. All the downstream analyses were based on clean data with a high quality of sequencing.

Transcriptome Assembly and Contamination Sequences Filtering
As there are few reference sequences available for Gentianaceae, the reads for unigenes of both root and leaf were assembled together. The left files (read1 files) from all libraries/samples were pooled into one left.fq file, and right files (read2 files) into one right.fq file, both in FastQ format. Transcriptome assembly was accomplished based on the pooled paired-end reads files (left.fq and right.fq) using Trinity software (Version 2012-10-05) [20] with min_k-mer_cov set to 2 and all other parameters settings as default. The following processes are referred to Shu et al. [65].
Contaminant sequence level was investigated according to species distribution based on protein similarity searching against NR protein databases. Coding sequences from Non-land-plant species were identified and discarded using a previously described taxonomy-based method [66]. Contaminant sequence from major plant pathogens, and human and other microorganisms (including bacteria, virus, and fungi) was investigated using the stand-alone version of DeconSeq [67].

Gene Functional Annotation
To assign putative gene function, unigenes were searched against the NR (NCBI non-redundant protein sequences), NT (NCBI nucleotide sequences), TAIR10, PFAM (Protein family; Available online: http://pfam.sanger.ac.uk/), and Swiss-Prot (A manually annotated and reviewed protein sequence database; Available online: http://www.ebi.ac.uk/uniprot/) databases using BLAST software with an E-value cutoff of 10 −5 [68]. Hmmerscan was adopted for PFAM annotation, and Blast2GO was used for GO annotation (Gene Ontology; Available online: http://www.geneontology.org/) [69] with the same E-value. To evaluate the completeness of the library and the efficacy of the annotation process, the annotated sequences were searched for the possible functions involved in KOG/COG (Available online: http://www.ncbi.nlm.nih.gov/COG/) classifications. To determine which pathways are active in leaves and roots, the annotated sequences were mapped to the reference pathways in KOG/COG, KO (KEGG Ortholog database; Available online: http://www.genome.jp/kegg/).

Differential Expression Analysis
The calculation of unigene expression used the RPKM method (Reads per kb per Million reads) [70]. Gene expression levels were estimated by RSEM [71] for each sample: (1) Clean data were mapped back onto the assembled transcriptome; (2) Readcount for each gene was obtained from the mapping results.
Differential expression analysis and GO enrichment analysis of leaves vs. roots was referred to Lv et al. [72]. To figure out the transcription factor families existing in leaves and roots, the transcript sequences were aligned against the Plant Transcription Factor Database with BLASTX and a cutoff of E-value < 10 −6 [73].

KEGG Enrichment Analysis
KEGG pathway enrichment analysis of the DEGs was done using KOBAS [74].

Real-Time PCR Analysis
DNase I-treated total RNA of root and leaf was converted into first-strand cDNA by the use of PrimeScript RTase (Takara, Tokyo, Japan). qRT-PCRs were performed in an ABI7000 Fluorescence Quantitative PCR Instrument (Applied Biosystems, Foster City, CA, USA) using a SuperReal PreMix Plus Kit (Tiangen, China). The PCR condition was: 95 °C for 3 min; 95 °C for 15 s; 60 °C for 30 s. Each reaction was repeated three times. Glyceraldehyde-3-phosphate dehydrogenase (GAPDH) was chosen as the internal reference gene. The 2 −ΔΔCt method was adopted for the relative gene expression. The primers used are listed in Table S8.

Conclusions
Next generation sequencing of RNA has now replaced microarrays as the preferred method for gene expression profiling. One key advantage of this method is that it enables examination of the transcriptome of non-model organisms [75]. Despite the Chinese traditional herb G. rigescens being used for thousands of years, the biosynthesis pathway and regulation of its main effective component, gentiopicroside, remains unknown. Few genetic or genomic studies have been performed. The results presented here addresses this by using the Illumina Hiseq2000 platform to identify sequences and transcript abundance levels of genes expressed in developing roots and leaves of G. rigescens. These sequences provide a starting point for further investigation of gentiopicroside biosynthesis, and include the 3306 unigenes from diverse pathways that were differentially expressed between root and leaf. The results represent a genetic resource for G. rigescens, and may serve as the foundation for further genomic research on G. rigescens and its relatives.