Isoform Sequencing Provides a More Comprehensive View of the Panax ginseng Transcriptome

Korean ginseng (Panax ginseng C.A. Meyer) has been widely used for medicinal purposes and contains potent plant secondary metabolites, including ginsenosides. To obtain transcriptomic data that offers a more comprehensive view of functional genomics in P. ginseng, we generated genome-wide transcriptome data from four different P. ginseng tissues using PacBio isoform sequencing (Iso-Seq) technology. A total of 135,317 assembled transcripts were generated with an average length of 3.2 kb and high assembly completeness. Of those unigenes, 67.5% were predicted to be complete full-length (FL) open reading frames (ORFs) and exhibited a high gene annotation rate. Furthermore, we successfully identified unique full-length genes involved in triterpenoid saponin synthesis and plant hormonal signaling pathways, including auxin and cytokinin. Studies on the functional genomics of P. ginseng seedlings have confirmed the rapid upregulation of negative feed-back loops by auxin and cytokinin signaling cues. The conserved evolutionary mechanisms in the auxin and cytokinin canonical signaling pathways of P. ginseng are more complex than those in Arabidopsis thaliana. Our analysis also revealed a more detailed view of transcriptome-wide alternative isoforms for 88 genes. Finally, transposable elements (TEs) were also identified, suggesting transcriptional activity of TEs in P. ginseng. In conclusion, our results suggest that long-read, full-length or partial-unigene data with high-quality assemblies are invaluable resources as transcriptomic references in P. ginseng and can be used for comparative analyses in closely related medicinal plants.


Introduction
Korean ginseng (Panax ginseng C.A. Meyer) is one of the most important perennial herbs and has been used as a crude drug for thousands of years in Asian countries, especially South Korea, Japan and China [1]. Its effects have been well defined via clinical trials and animal tests and include improved immunity, increased energy, and rejuvenation [2]. The medicinal constituents of ginseng are derived

PacBio Iso-Seq
cDNA was size-selected in fractions of lengths 1-2 kb, 2-3 kb, 3-6 kb, and >6 kb from RNAs pooled from four tissues, which contain leaf, stem, root and reproductive organs (fruits and flowers). SMRTbell Template libraries were made from those cDNAs for sequencing on PacBio RS II system as recommended by Pacific Biosciences (Palo Alto, CA, USA). The templates were sequenced via polymerase binding using the DNA polymerase binding kit P6 v2 primers.

Iso-Seq Assembly and Quality Assessment
Iso-Seq assembly was performed using SMRT-Analysis software v2.3.0 [22]. First, Read of Inserts (ROIs) were generated using minimum read quality of 75. The Iso-Seq classify tool classified the ROIs into full-length nonchimeric and non-full length reads. The classification was carried out by identifying 5 and 3 adapters used in library preparation. Full-length reads were defined as containing both adapters. Iso-Seq cluster tool was then used for clustering all the full-length reads. In the last step, the consensus sequences produced by the cluster tool were polished using the non-full length reads through the Quiver algorithm [22]. The high and low quiver consensus isoforms were clustered by CD-HIT with a sequence identity threshold of 0.99 [23]. Protein coding sequences (CDSs) were analyzed by using TransDecoder with the following steps [24]; (1) search for all possible CDSs; (2) verify CDSs by GeneID software [25]; and (3) choose the region with the highest score. To assess assembly quality of unigenes, CEGMA [26], which assesses high reliable set of gene annotations in genome and transcriptome assembly, was employed. The compatibility of unigenes to other datasets was analyzed by mapping with BWA [27] using RNA-Seq samples that were derived from 16 tissues of P. ginseng reported by Wang et al. (2015) [13], with the following parameters; -k 19 for minimum seed length, -A1 for matching score, -B 4 for mismatch penalty, and -T 30 for alignment output with score higher than 30.

Unigene Annotation
For functional annotation, unigenes were searched against the UniProt, NCBI non-redundant (NR), TAIR, PlantTFDB databases using BLASTX [10] with an E-cutoff value of 10 −5 . Protein domains were also searched using InterProScan [28]. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway annotations were performed using Blast2GO [29]. Transposable elements (TEs), microsatellites, and other repeats were screened using RepeatMasker, which was developed for de novo repeat family identification and modeling [30]. In particular, a TE sequence database was constructed from the genomic sequences of Panax notoginseng [31] by searching for known TEs coding sequences against Repbase [32] using TBLASTX with a cutoff 10 −10 . The TE sequences searched were annotated with reference of Gydb [33]. SSRs (simple sequence repeats) were searched using SSR finder [34].

Identification of Alternative Splicing Isoforms
To identify alternative splicing events of internal exon(s) that corresponded to exon skipping, we used BLASTN to search for unigenes that showed exact matches for the 100 bases at the 5 and 3 ends within their complete open reading frames (ORFs). Then, sequence identities among the unigenes were examined with reciprocal BLAST. The unigenes were clustered, and those that exhibited insertions/deletions of more than 100 bases were selected to exclude ambiguous alternative splicing isoforms, such as heterogeneous forms with almost identical sequences. Finally, the resulting clusters were validated by CLUSTALW [35].

Plant Hormone Treatments and Real Time qRT-PCR
One-year-old ginseng seedlings were transferred to 1/2 B5 liquid medium with or without plant hormones (auxin, NAA 10 µM, Sigma # n0640 and cytokinin, t-zeatin 1 µM, Sigma # z0876) and incubated for 2 h. After treatment, the samples were immediately frozen in liquid nitrogen. To extract total RNA, the Easy-Spin IIp Plant RNA Extraction Kit (iNtRON Biotecnology, Seongnam, Korea) was used. cDNA was synthesized with a first-strand synthesis KIT (Enzynomics, Daejeon, Korea). One microgram of total RNA was used. The reaction conditions were as follows: 50 • C for 60 min, 95 • C for 5 min, and 1 min of cooling on ice during the synthesis of first strand cDNA. Selected target genes were subjected to quantitative real-time PCR under the same reaction conditions. The primers were designed using Primer Express 3.0.1 software (Table S6, PE Applied Biosystems, Foster City, CA, USA). qRT-PCR analysis was performed using the QuantBase 3 (Applied Biosystem, Foster City, CA, USA) instrument with SYBR Green Real-time PCR Master Mix (Applied Biosystem, Foster City, CA, USA) according to the manufacturer's recommendations. The primer sequences used in this study are listed in Table S5. The expression levels using the threshold cycle (Ct) value were calculated using the 2 −∆∆Ct method.

GenBank Accession Code
The PacBio sequence data generated for this work are accessible via the NCBI Sequence Read Archive under accession number (SRA: SUB2796783).

Transcriptome Sequencing of P. ginseng Using PacBio Iso-Seq
We generated large P. ginseng cDNA fragments using the PacBio Iso-Seq, which favors the reverse transcription of intact, full-length RNA molecules and identifies splice variants of genes. To obtain equal samples of long and short transcripts, the cDNA was size-selected in lengths of 1-2 kb, 2-3 kb, 3-6 kb, and >6 kb from RNA pooled from four tissues, including leaves, stems, roots and reproductive organs (flowers). Using SMRT sequencing technology, a total of 8.2 million sequencing subreads were generated (Table S1) and merged, from 163,195 to 247,189 isoform clusters that were classified as full-length and non-full-length reads (Table S2). After consensus sequence calling and quality filtration (Table S3), in which the sequences were well distributed in the size fractions of libraries ( Figure S1), the sequences were clustered into a total of 135,317 isoforms with 90% coverage and 99% sequence identity among sequence overlaps and 91.94% assembly completeness ( Figure S2), accounting for approximately 430 Mb in cumulative length. This procedure is summarized in Figure 1. The average length of clustered transcripts (unigenes) was 3178 bp with a length distribution of 2-4 kb ( Figure S3), indicating that our dataset contained larger unigenes compared with previous datasets, in which unigenes ranged from 0.79 kb to 1.9 kb [9,13,36]. ORFs were found in 100,280 unigenes (74.1%), with an average length of 756 bp and a cumulative length of approximately 76.9 Mb. Remarkably, 91,345 unigenes were predicted as complete full-length ORFs.
We also assessed the accuracy and performance of the P. ginseng unigenes. The CEGMA evaluation, which assesses a highly reliable set of gene annotations in genome and transcriptome assembly, revealed a very complete (91.94%) dataset compared with the two de novo assembly results, dataset 1 (85.08%) and dataset 2 (71.77%), which were generated using RNA-Seq. These RNA-Seq samples, dataset 1 and dataset 2, were derived from 16 tissues reported by Wang et al. (2015) dataset 1 [13] and produced from P. ginseng leaves (dataset 2, unpublished), respectively ( Figure 2A). To assess the quality and novelty of our sequencing data, we compared the transcript lengths between our PacBio data and the Illumina RNA-seq de novo assembled unigenes. As presented in Figure S4, almost 56%, 24% and 12% of unigenes distributed in <1 kb, 1-2 kb and 2-3 kb, respectively, were unaligned with the previously reported unigene dataset 1 ([13], Figure S4). However, the PacBio data could recover many portions of longer transcripts than those analyzed using the Illumina RNA-seq de novo assembly ( Figure S5). Next, exact matches of the unigenes were validated by Sanger sequencing (i.e., 100% sequence identity in KG_ISO_000349) ( Figure 2B). Finally, the RNA-Seq samples derived from the 16 tissues of P. ginseng reported by Wang et al. (2015) [13] were mapped to unigenes generated by Iso-Seq ( Figure 2C) with an average mapping rate of 90.8%, although they were poorly mapped due to the absence of fruit pedicel (74.4%) and seed (70.1%) tissue in the pooled RNA samples. Mapping rate of the RNA-seq samples to RNA-seq dataset 1 was 83.8% on average. The high mapping rate for Iso-Seq was also supported by a comparison of sequence alignment using STAR which showed an average mapping rate of 83.4% for Iso-Seq and 66.3% for RNA-Seq dataset 1, respectively ( Figure S6). Our results suggest that high-quality, longer full-length sequencing reads are an invaluable resource and a reliable transcriptome reference in P. ginseng. Procedure of transcript assembly of Panax ginseng using Iso-Seq. Sub-reads were merged from 163,195 to 247,189 isoform clusters and classified as full-length (FL) and/or non-FL reads. After consensus sequence calling and quality filtration were performed, sequences were finally clustered into 135,317 isoforms. Open reading frames (ORFs) were also predicted in those unigenes. After consensus sequence calling and quality filtration were performed, sequences were finally clustered into 135,317 isoforms. Open reading frames (ORFs) were also predicted in those unigenes. Figure 1. Procedure of transcript assembly of Panax ginseng using Iso-Seq. Sub-reads were merged from 163,195 to 247,189 isoform clusters and classified as full-length (FL) and/or non-FL reads. After consensus sequence calling and quality filtration were performed, sequences were finally clustered into 135,317 isoforms. Open reading frames (ORFs) were also predicted in those unigenes.

Identification of Repeat Sequences in P. ginseng Unigenes
We next identified an abundance of TE sequences in unigenes by searching against a TE-related coding sequence database constructed from the genome of P. notoginseng. These TE sequences accounted for approximately 17.9 Mb (4.2%) in cumulative length. Of TEs searched, long terminal repeat (LTR) retrotranspons (58.2% out of the searched TE sequences, Figure S7), especially Ty3/gypsy (34.9%) and Ty1/copia (21.4%), were predominantly distributed ( Figure 3A). The analysis of Ty3/gypsy-and Ty1/copia-encoding sequences including ORFs for gag, pol and other accessory genes revealed the abundance of non-chromodomain retrotransposons for Ty3/gypsy and LTR retrotransposons including copia-like hemiviruses and sireviruses for Ty1/copia ( Figure 3B). In addition to LTRs, DNA transposons including CMC-EnSpm (1.8%), PIF (2.6%), MULE (7.4%), hAT (7.8%), and helitron (15.3%) were also abundant. This result suggests that TEs have transcriptional activity and proliferate in P. ginseng tissues, thus playing an important role in genome and gene evolution by controlling transposition at a post-transcriptional level [37]. We also identified a total of 60,304 simple sequence repeats (SSRs) in unigenes ( Figure 3C), more than the 13,044 SSRs previously reported [38]. The SSR motifs in di-and tri-nucleotide repeats were primarily AG (accounting for 53.9% of dinucleotide repeats) and AAG (24.3% of trinucleotide repeats) sequences. Di-and trinucleotide repeats were also predominantly identified in untranslated regions (UTRs) and ORFs. This result indicates that the P. ginseng transcribed regions are characterized by SSRs. Sequence identity between unigene KG_ISO_000349 and the corresponding Sanger sequence was examined using zPicture. (C) Performance assessment of P. ginseng Iso-Seq. Mapping of RNA-seq dataset 1, derived from 16 tissues, to the two unigene sets assembled from Iso-Seq and RNA-Seq dataset 1 of P. ginseng using BWA.

Identification of Repeat Sequences in P. ginseng Unigenes
We next identified an abundance of TE sequences in unigenes by searching against a TE-related coding sequence database constructed from the genome of P. notoginseng. These TE sequences accounted for approximately 17.9 Mb (4.2%) in cumulative length. Of TEs searched, long terminal repeat (LTR) retrotranspons (58.2% out of the searched TE sequences, Figure S7), especially Ty3/gypsy (34.9%) and Ty1/copia (21.4%), were predominantly distributed ( Figure 3A). The analysis of Ty3/gypsyand Ty1/copia-encoding sequences including ORFs for gag, pol and other accessory genes revealed the abundance of non-chromodomain retrotransposons for Ty3/gypsy and LTR retrotransposons including copia-like hemiviruses and sireviruses for Ty1/copia ( Figure 3B). In addition to LTRs, DNA transposons including CMC-EnSpm (1.8%), PIF (2.6%), MULE (7.4%), hAT (7.8%), and helitron (15.3%) were also abundant. This result suggests that TEs have transcriptional activity and proliferate in P. ginseng tissues, thus playing an important role in genome and gene evolution by controlling transposition at a post-transcriptional level [37]. We also identified a total of 60,304 simple sequence repeats (SSRs) in unigenes ( Figure 3C), more than the 13,044 SSRs previously reported [38]. The SSR motifs in diand tri-nucleotide repeats were primarily AG (accounting for 53.9% of dinucleotide repeats) and AAG (24.3% of trinucleotide repeats) sequences. Di-and tri-nucleotide repeats were also predominantly identified in untranslated regions (UTRs) and ORFs. This result indicates that the P. ginseng transcribed regions are characterized by SSRs.

Efficient Gene Annotation of P. ginseng
Obtaining complete or longer ORFs using Iso-Seq increases the efficiency of functional gene prediction or annotation, especially in the absence of reference genome information. To date, de novo transcriptome assembly using Illumina RNA-Seq data has been limited to more accurate and longer contig assembly because of short read lengths, resulting in chimeric contigs and/or low gene annotation. Of all unigenes called by Iso-Seq, 129,599 unigenes (95.8%) were aligned to protein sequence databases such as UniProt, NR, TAIR, and InterPro using BLAST and InterProScan ( Figure  4A), demonstrating outstanding hit scores within the protein sequence databases. In particular, 64,676 unigenes (47.8%) were predicted as known homologous genes by searching within databases ( Figure 4B), and 120,626 of these annotated unigenes showed homology to 14,403 genes in Arabidopsis ( Figure 4B). A total of 52,920 transcription factor (TF)-related sequences were also matched against PlantTFDB using BLAST, corresponding to 1488 TF-related Pfam domains. MYB-related TFs, WRKY, C3H, B3, and HB families in P. ginseng were abundant compared with the Arabidopsis and rice genomes, although the frequencies of TF families varied ( Figure 4C). The annotated unigenes were assigned to GO categories (89,239 unigenes), KEGG (50,239 unigenes), and Plant Metabolic Pathways

Efficient Gene Annotation of P. ginseng
Obtaining complete or longer ORFs using Iso-Seq increases the efficiency of functional gene prediction or annotation, especially in the absence of reference genome information. To date, de novo transcriptome assembly using Illumina RNA-Seq data has been limited to more accurate and longer contig assembly because of short read lengths, resulting in chimeric contigs and/or low gene annotation. Of all unigenes called by Iso-Seq, 129,599 unigenes (95.8%) were aligned to protein sequence databases such as UniProt, NR, TAIR, and InterPro using BLAST and InterProScan ( Figure 4A), demonstrating outstanding hit scores within the protein sequence databases. In particular, 64,676 unigenes (47.8%) were predicted as known homologous genes by searching within databases ( Figure 4B), and 120,626 of these annotated unigenes showed homology to 14,403 genes in Arabidopsis ( Figure 4B). A total of 52,920 transcription factor (TF)-related sequences were also matched against PlantTFDB using BLAST, corresponding to 1488 TF-related Pfam domains. MYB-related TFs, WRKY, C3H, B3, and HB families in P. ginseng were abundant compared with the Arabidopsis and rice genomes, although the frequencies of TF families varied ( Figure 4C). The annotated unigenes were assigned to GO categories (89,239 unigenes), KEGG (50,239 unigenes), and Plant Metabolic Pathways (73,453 unigenes) ( Figure 4D). The Plant Metabolic Pathways revealed the functional categories of highly abundant genes in P. ginseng compared to seven model plant genomes ( Figure 4E) and included: aerobic respiration, 1,3,5-trimethoxybenzene biosynthesis, homogalacturonan biosynthesis, gluconeogenesis, homogalacturonan degradation, phospholipases, adenosine ribonucleotides de novo biosynthesis, glycerophosphodiester degradation, sucrose biosynthesis, cellulose biosynthesis, callose biosynthesis, D-myo-inositol (1,4,5)-trisphosphate biosynthesis, trehalose biosynthesis, flavonoid biosynthesis, dolichyl-diphosphooligosaccharide biosynthesis, folate transformations, chlorophyllide a biosynthesis, acyl-ACP thioesterase pathway, triacylglycerol degradation, 3-phosphoinositide biosynthesis, choline biosynthesis, pentose phosphate pathway, mevalonate pathway, and maackiain biosynthesis. Interestingly, the abundance of protein-coding genes involved in these pathways of P. ginseng was highly similar to those of P. notoginseng, even though slight variation was observed ( Figure 4E). However, flavonoid biosynthesis and triacylglycerol degradation were more abundant in P. notoginseng ( Figure 4E). (73,453 unigenes) ( Figure 4D). The Plant Metabolic Pathways revealed the functional categories of highly abundant genes in P. ginseng compared to seven model plant genomes ( Figure 4E) and included: aerobic respiration, 1,3,5-trimethoxybenzene biosynthesis, homogalacturonan biosynthesis, gluconeogenesis, homogalacturonan degradation, phospholipases, adenosine ribonucleotides de novo biosynthesis, glycerophosphodiester degradation, sucrose biosynthesis, cellulose biosynthesis, callose biosynthesis, D-myo-inositol (1,4,5)-trisphosphate biosynthesis, trehalose biosynthesis, flavonoid biosynthesis, dolichyl-diphosphooligosaccharide biosynthesis, folate transformations, chlorophyllide a biosynthesis, acyl-ACP thioesterase pathway, triacylglycerol degradation, 3phosphoinositide biosynthesis, choline biosynthesis, pentose phosphate pathway, mevalonate pathway, and maackiain biosynthesis. Interestingly, the abundance of protein-coding genes involved in these pathways of P. ginseng was highly similar to those of P. notoginseng, even though slight variation was observed ( Figure 4E). However, flavonoid biosynthesis and triacylglycerol degradation were more abundant in P. notoginseng ( Figure 4E). To date, candidate genes involved in triterpenoid saponin synthesis have been identified and are critical enzyme-encoding genes from P. ginseng [9,39]. However, low coverage and/or partial short contig assembly in transcriptome sequencing hinders the identification of such candidates. In this study, we identified 314 unigenes previously reported to be involved in triterpenoid saponin synthesis [9,39] (Table 1). Remarkably, of those genes, 233 unigenes (74.2%) were predicted to contain complete ORFs. β-amyrin synthase and β-amyrin 28-oxidase, key upstream molecules involved in triterpenoid saponin synthesis, were expressed as five and two paralogs in complete ORFs. Moreover, squalene synthase, squalene epoxidase, and dammarenediol-II synthase appear as To date, candidate genes involved in triterpenoid saponin synthesis have been identified and are critical enzyme-encoding genes from P. ginseng [9,39]. However, low coverage and/or partial short contig assembly in transcriptome sequencing hinders the identification of such candidates.
In this study, we identified 314 unigenes previously reported to be involved in triterpenoid saponin synthesis [9,39] (Table 1). Remarkably, of those genes, 233 unigenes (74.2%) were predicted to contain complete ORFs. β-amyrin synthase and β-amyrin 28-oxidase, key upstream molecules involved in triterpenoid saponin synthesis, were expressed as five and two paralogs in complete ORFs. Moreover, squalene synthase, squalene epoxidase, and dammarenediol-II synthase appear as families containing several paralogs with alternative splicing isoforms. We also analyzed the number of homologous genes involved in triterpene saponin biosynthesis between P. notoginseng and P. ginseng by using TBLASTX with a cutoff ≤10 −10 (Table S4). Compared to the candidate genes of P. notoginseng, those of P. ginseng appear to be overestimated despite of showing high similarities. These results would be come from abundance of sequence variation in our assembled unigenes which can generate different isoform clusters and increase the number of unigenes. It is also possible that sequence heterozygosity by pooling four different tissues may increase different isoform clusters. These results suggest that our data will be an invaluable resource for identifying candidate genes and understanding functional pathways involved in triterpenoid saponin synthesis.

Identification and Functional Analysis of Auxin and Cytokinin Signaling Pathways in P. ginseng
Plant hormones are key players in plant growth and developmental processes. Among the nine phytohormones studied, auxin and cytokinin have fundamental roles in embryogenesis and meristem development. Recent studies have revealed essential functions of auxin and cytokinin's synergistic or antagonistic interactions in the regulation of plant growth and other developmental processes [40,41]. However, neither annotation nor functional studies of genes related to auxin or cytokinin signaling or biosynthetic pathways in ginseng plants are available. As shown in Table 2, we identified a total of 354 complete ORFs of putative auxin signaling-related genes, including SCF TIR/AFBs receptor complexes (47 unigenes), Auxin Response Factors as key transcription factors (274 unigenes) and AUX/IAAs negative regulators (33 unigenes). We also screened 53 complete ORFs that encoded cytokinin-related two-component signaling genes (Table 3). To test whether the common canonical signaling pathways of auxin and cytokinin are also evolutionary conserved in P. ginseng, we surveyed the negative feedback loop effects in these signaling responses. As previously reported, rapid upregulation of AUX/IAAs and type-A response regulators, which were identified as negative feedback loops in auxin and cytokinin signaling, respectively, was observed in the presence of the corresponding phytohormones ( Figure 5). These results suggest that the transcriptome database is useful for the identification of candidate genes and to understand their functionality in P. ginseng.   Figure 5. Negative feedback loops in auxin and cytokinin signaling responses in P. ginseng seedlings.
Relative gene expression levels of type-A response regulators (A) and AUX/IAAs (B) in three-monthold P. ginseng seedlings treated with cytokinin and auxin, respectively, were quantified by a real time qRT-PCR (n = 4, Error bars indicate ± standard error).

Identification of Alternative Splicing Isoforms
We also confirmed that a total of 88 clusters contained alternative splicing events ( Figure 6). Because whole-genome de novo assembly data of P. ginseng are not yet available, we analyzed alternative splicing events for internal exon(s), such as exon skipping, within unigenes (but not in the 5-or 3′-ends of unigenes). Unigenes that were exact matches for both 100 bases at the 5′ and 3′ ends within their complete ORFs were clustered, and those exhibiting insertions/deletions of more than 100 bases were selected and validated by multiple-sequence alignments ( Figure 6A). For example, an exon skipping event with a 183-base deletion was detected in the ARF6 homolog (auxin response factor 6, Figure 6B). In addition, a 78-base deletion was found in the unigene cluster that was translated even though it was not identified as having alternative splicing isoforms. Insertions/deletions (InDels) were mostly distributed in the size range of 100 bp to 400 bp ( Figure  6C), and resulted in the variation of protein translation. These clusters might result from tissuespecific expression, explaining the differences in amino acid sequence and in biological function. Those clusters were also compared with homologous genes in P. notoginseng. A total of seven clusters were identified by multiple sequence alignment with high sequence identity, and InDels were found only in P. ginseng at the level of protein sequences, suggesting the modification of gene structures in P. ginseng ( Figure 6D). Future studies that use diverse strategies to assess genome-wide transcriptomics in P. ginseng will provide insight into the growth and developmental features of this important plant. in three-month-old P. ginseng seedlings treated with cytokinin and auxin, respectively, were quantified by a real time qRT-PCR (n = 4, Error bars indicate ± standard error).

Identification of Alternative Splicing Isoforms
We also confirmed that a total of 88 clusters contained alternative splicing events ( Figure 6). Because whole-genome de novo assembly data of P. ginseng are not yet available, we analyzed alternative splicing events for internal exon(s), such as exon skipping, within unigenes (but not in the 5-or 3 -ends of unigenes). Unigenes that were exact matches for both 100 bases at the 5 and 3 ends within their complete ORFs were clustered, and those exhibiting insertions/deletions of more than 100 bases were selected and validated by multiple-sequence alignments ( Figure 6A). For example, an exon skipping event with a 183-base deletion was detected in the ARF6 homolog (auxin response factor 6, Figure 6B). In addition, a 78-base deletion was found in the unigene cluster that was translated even though it was not identified as having alternative splicing isoforms. Insertions/deletions (InDels) were mostly distributed in the size range of 100 bp to 400 bp ( Figure 6C), and resulted in the variation of protein translation. These clusters might result from tissue-specific expression, explaining the differences in amino acid sequence and in biological function. Those clusters were also compared with homologous genes in P. notoginseng. A total of seven clusters were identified by multiple sequence alignment with high sequence identity, and InDels were found only in P. ginseng at the level of protein sequences, suggesting the modification of gene structures in P. ginseng ( Figure 6D). Future studies that use diverse strategies to assess genome-wide transcriptomics in P. ginseng will provide insight into the growth and developmental features of this important plant.

Discussion
Transcriptome data generated by Iso-Seq generate longer and improved unigenes from P. ginseng with a high level of assembly completeness and gene annotation, enabling a comprehensive view of the transcriptome. Conventional methods, such as cDNA cloning and expressed sequence tags (EST) sequencing, have limitations and do not efficiently provide accurate sequence information, including expressed mRNA. Although high-throughput sequencing using the Illumina HiSeq platform has recently produced genome-wide transcriptome data with good sequencing depth and coverage [13], de novo transcriptome assembly using short reads have generated short and partial transcript contigs containing artifacts including chimeras, structural errors, incomplete assembly, and base errors, resulting in a high misassembly rate and unreliable gene annotation [42]. In these cases, it has been almost impossible to elucidate the function of genes associated with traits of interest as well as novel transcripts. To avoid such limitations in P. ginseng, we first collected a large portion of full-length ORF containing transcriptome data from four independent tissues (root, stem, leaf, and flower), which maximizes transcript diversity, and used the PacBio SMRT sequencing approach. As expected,

Discussion
Transcriptome data generated by Iso-Seq generate longer and improved unigenes from P. ginseng with a high level of assembly completeness and gene annotation, enabling a comprehensive view of the transcriptome. Conventional methods, such as cDNA cloning and expressed sequence tags (EST) sequencing, have limitations and do not efficiently provide accurate sequence information, including expressed mRNA. Although high-throughput sequencing using the Illumina HiSeq platform has recently produced genome-wide transcriptome data with good sequencing depth and coverage [13], de novo transcriptome assembly using short reads have generated short and partial transcript contigs containing artifacts including chimeras, structural errors, incomplete assembly, and base errors, resulting in a high misassembly rate and unreliable gene annotation [42]. In these cases, it has been almost impossible to elucidate the function of genes associated with traits of interest as well as novel transcripts. To avoid such limitations in P. ginseng, we first collected a large portion of full-length ORF containing transcriptome data from four independent tissues (root, stem, leaf, and flower), which maximizes transcript diversity, and used the PacBio SMRT sequencing approach. As expected, a large amount of transcriptome data was generated, including 135,317 unigenes with 91.94% high assembly completeness that were much larger (0.79 to 1.9 kb) than reported previously [13,36,39]. Moreover, a reference for quantifying gene expression in the absence of reference genome information is likely to be very useful with high mapping rate ( Figure 2C and Figure S9). For this, genome-wide coverage of transcripts should be strengthened. Our data were not relatively well-covered for fruit tissue-related expressed genes ( Figure 2C) due to absence of RNA extraction from reproductive ginseng fruits. Moreover, considerable small size transcripts less than approximately 1 kb seem to be missed ( Figure S4). This problem appears to result from technical limitations of a PacBio sequencing platform associated with size selection in the construction of mRNA sequencing libraries. From the observations, Illumina RNA-Seq data are likely to be effective for the coverage of expressed genes, especially small size transcripts, in the genome, and PacBio Iso-Seq data are useful for the verification of gene prediction with full-length open reading frames (FL-ORF), thus suggesting combinatorial data analysis in de novo transcriptome assembly. In spite of such defect, Iso-Seq data increases the efficiency of functional gene prediction or annotation by the finding of FL-ORFs. In the result, 64,676 (47.8%) unigenes out of 135,317 isoforms identified were successfully annotated as known homologous genes using BLAST searches of the NR, TAIR, Swiss-Prot and InterProScan databases ( Figure 4). This implies that more than half of unigenes generated in this study were not annotated according to existing databases. This could be as a result of several reasons such as the absence of reference genomic information on the family Araliaceae and the unigenes without hits probably belonged to untranslated regions. It is also possible that they could contain non-coding RNA and short sequences absenting protein functional domains.
Since lacking genomic and transcriptomic information in the family Araliaceae, these unigenes without hits may be considered putative novel transcribed sequences. Therefore, according to these results, there is a need to generate a large collection of unigenes and further characterize the gene structures and expression patterns in P. ginseng. Furthermore, as previously reported [21], the PacBio isoform data could provide much longer transcript, isoform and gene structure information than the preexisted genome databases Full-length transcript sequence information with complete ORF structures is valuable for gene annotation and functional genomics in plants. FL information provides a more effective approach for discovering candidate genes involved in secondary metabolite biosynthesis. In our study, we identified 315 unigenes related to triterpenoid saponin biosynthesis and 233 unigenes predicted to harbor complete ORFs. Although previous studies also identified candidates related to triterpenoid saponin biosynthesis [13,43], most of the sequences described were partial transcripts from root tissues. Based on SMRT sequencing and analysis of the four different types of P. ginseng tissues, we have found the 16 genes with complete ORFs encoding key enzymes involved in ginsenoside backbone biosynthesis, including squalene epoxidase, dammarenediol-II synthase and farnesyl diphosphate synthase (Table 1). Therefore, these results strongly indicate that most of the genes involved in the synthesis of ginsenosides are contained within our transcriptome data. Accordingly, our results will provide a foundation for understanding the molecular mechanism of ginsenoside biosynthesis in P. ginseng plants.
As represented in Figure 4, a comparative analysis of transcription factor gene structure showed that when compared to Arabidopsis and rice, P. ginseng contained a higher proportion of transcription factors that control secondary metabolism, cell morphogenesis, and transmission of hormone and stress signals, including MYB, WRKY, C3H, B3, and HB. MYB and WRKY are involved within the synthesis of the secondary metabolites of ginseng [44,45]. C3H, B3, and HB are responsible for precisely controlling gene expression in Arabidopsis and rice following environmental changes [46,47]. Understanding the metabolic pathways for ginseng secondary metabolite synthesis and identifying the transcription factors controlling related genes have economic importance because the activity and quality of plant secondary metabolite synthetic pathways are directly linked. Overall, our transcriptome data generated using long-read sequencing exhibited the following features: (1) identification of much more portion of complete ORF structures; (2) a high level of gene annotation with efficient identification of genes associated with useful traits such as triterpenoid saponin and auxin/cytokinin signaling; (3) abundance of TEs with transcriptional activity in P. ginseng; and (4) identification of alternative splicing isoforms encoded by single genes. Moreover, when other RNA-Seq data were compared with our data, a high mapping rate was revealed (Figure 2), which will be a valuable transcriptomics resource. Therefore, our results will contribute to studies on gene function and will hasten the completion of a reference genome for P. ginseng.
Some unigenes in our dataset were exactly matched by comparing with the Sanger sequencing ( Figure 2B). These results indicated that the SMRT sequencing approach is reliable for generating high-quality, full-length sequence data from P. ginseng. Accordingly, we expect that these long-read transcripts will provide the accuracy of transcriptome characterization compared with transcript tags assembled from short RNA-seq reads. Additionally, our survey of plant hormonal signaling-related genes (Tables 2 and 3) and their expression patterns from upstream signal cues suggested that our annotations can predict the function of specific P. ginseng genes ( Figure 5 and Figure S8). Recent advances in molecular genetic studies for crop plants have revealed that key developmental and hormone-related genes are closely linked to diverse quantitative traits and play critical roles in domestication during breeding and artificial selection [48]. Interestingly, large number of transcription regulators of auxin and cytokinin signaling pathways is presented in P. ginseng transcriptome (Tables 2 and 3). Complex gene networks would be able to increase the cope of efficiency to diverse environmental changes during long terms of growth and developmental processes of this plant. Therefore, the functional identification of key genes and their signaling networks in P. ginseng will be helpful for developing biotechnological approaches to enhance the quality and consumer-friendly market of P. ginseng and related medicinal plants. Taken together, our Iso-Seq data will serve as a valuable resource for understanding the metabolic pathways related to triterpenoid saponin synthesis as well as growth and developmental signaling pathways. In addition, our results will improve the development of ginseng varieties with increased ginsenoside content and enhanced resistance to environmental stress.
Supplementary Materials: The following are available online at www.mdpi.com/2073-4425/8/9/228/s1. Figure S1: The distribution of read length of consensus isoforms. Figure S2: Summary of transcript assemblies along with sequence coverage and identity for clustering of all the libraries. Figure S3: The length distribution of unigenes and predicted ORFs. Figure S4: Mapping of unigenes (query) from Illumina RNA-Seq reported by Wang et al. (2016) [13] to unigenes (subject) from Iso-Seq. Figure S5: The length distribution of unigenes generated by Illumina RNA-Seq (green) and PacBio Iso-Seq (blue). Figure S6: Mapping of Illumina RNA-seq dataset 1 to the two unigene sets from Iso-Seq and Illumina RNA-Seq dataset 1 using STAR. Figure S7: Abundance of TEs in P. gineng unigenes. Figure S8: Identification of plant hormone signal components. Figure S9: Mapping of Illumina RNA-Seq dataset 1 reported by Wang et al. (2016) [13] to Iso-Seq unigene dataset as a reference. Table S1: Summary of subreads after removal of adapters and artifacts. Table S2: Classification of isoform clusters. Table S3: Summary of consensus isoforms. Table S4: List of protein-coding genes of P. notoginseng homologous to unigenes involved in triterpene saponin biosynthesis in P. ginseng. Table S5: Identification of alternative splicing (AS) events in unigenes of P. ginseng. Table S6: Primer lists for qRT-PCR.