Comparative Transcriptome Analysis Identifies Putative Genes Involved in Dioscin Biosynthesis in Dioscorea zingiberensis

Dioscorea zingiberensis is a perennial herb native to China. The rhizome of D. zingiberensis has long been used as a traditional Chinese medicine to treat rheumatic arthritis. Dioscin is the major bioactive ingredient conferring the medicinal property described in Chinese pharmacopoeia. Several previous studies have suggested cholesterol as the intermediate to the biosynthesis of dioscin, however, the biosynthetic steps to dioscin after cholesterol remain unknown. In this study, a comprehensive D. zingiberensis transcriptome derived from its leaf and rhizome was constructed. Based on the annotation using various public databases, all possible enzymes in the biosynthetic steps to cholesterol were identified. In the late steps beyond cholesterol, cholesterol undergoes site-specific oxidation by cytochrome P450s (CYPs) and glycosylation by UDP-glycosyltransferases (UGTs) to yield dioscin. From the D. zingiberensis transcriptome, a total of 485 unigenes were annotated as CYPs and 195 unigenes with a sequence length above 1000 bp were annotated as UGTs. Transcriptomic comparison revealed 165 CYP annotated unigenes correlating to dioscin biosynthesis in the plant. Further phylogenetic analysis suggested that among those CYP candidates four of them would be the most likely candidates involved in the biosynthetic steps from cholesterol to dioscin. Additionally, from the UGT annotated unigenes, six of them were annotated as 3-O-UGTs and two of them were annotated as rhamnosyltransferases, which consisted of potential UGT candidates involved in dioscin biosynthesis. To further explore the function of the UGT candidates, two 3-O-UGT candidates, named Dz3GT1 and Dz3GT2, were cloned and functionally characterized. Both Dz3GT1 and Dz3GT2 were able to catalyze a C3-glucosylation activity on diosgenin. In conclusion, this study will facilitate our understanding of dioscin biosynthesis pathway and provides a basis for further mining the genes involved in dioscin biosynthesis.


Introduction
Dioscorea zingiberensis, commonly known as "yellow ginger" in China, is extensively cultivated as a traditional medicinal crop in the southern region of Shaanxi province in China. Extracts from D. zingiberensis rhizomes have been suggested in Chinese medicine for the treatment of chest pain, coronary heart disease, and hypoglycemic problems [1]. These medicinal effects are mostly attributed to the presence of steroidal saponins in its rhizome tissue [2], and of which dioscin is the principle active component [3]. A variety of bioactivities, such as anti-tumor [4], anti-inflammatory [5], and bone-protecting [6] properties, have been shown for dioscin. More importantly, the aglycone of biosynthetic pathway through 2,3-oxidosqualene followed by cycloartenol synthase, and genes involved in the steps from cycloartenol to cholesterol have been discovered (Figure 1) [11]. Cholesterol can be transformed into dioscin through modifications which include oxidation at the C-22, C-26, and C-16 positions by enzymes such as P450-dependent monooxygenases, dioxygenases, and other catalysts, and subsequent addition of glucose and rhamnose groups to the C-3 position of diosgenin by UDP-glycosyltransferases (UGTs) (Figure 1). Although the specific steps in dioscin biosynthesis starting from cholesterol were previously proposed by Mehrafarin et al [12], the proposed steps have never been experimentally validated, and genes involved in these steps have never been isolated.
Advances in RNA sequencing technology provide an efficient and economical platform for gene discovery in non-model plants without knowing their genome sequences [13]. Owing to the tissue-specificity for the synthesis of many types of plant metabolites, transcript profiling together with metabolite analysis from different tissues could prove valuable for the identification of genes in different metabolic pathways [14][15][16]. We have confirmed that dioscin is highly synthesized in the underground rhizome tissue of D. zingiberensis while in the aerial leaf tissue it is formed at very low concentrations ( Figure 2). Thus, to identify specific genes putatively in the downstream steps to dioscin biosynthesis, transcriptomic comparison between these two tissues was conducted and extensively analyzed in this study. Genes encoding several members of cytochrome P450 family and glycosyltransferase group, which could be putatively involved in the late steps to dioscin biosynthesis, have been identified. By heterologous expression in E. coli, two UGTs capable of glucosylating diosgenin on C-3 position were functionally confirmed.

Generation of D. Zingiberensis Transcriptomes from its Leaf and Rhizome Tissues
It is well known that rhizome tissue is the richest source for dioscin production in D. zingiberensis. We confirmed this tissue-specificity for dioscin as well as its aglycone diosgenin by measuring their contents in its leaf and newly-formed rhizome tissues, which we collected at a relatively earlier stage on 28 August 2015 ( Figure 2A). Rhizomes (five months old) were shown to produce higher amounts of dioscin and diosgenin (3.50 mg g −1 for dioscin and 5.84 mg g −1 for diosgenin) while in leaf tissue (five months old) much less of them were detected (0.12 mg g −1 for dioscin and 0.20 mg g -1 for diosgenin) ( Figure 2C). At the later stage (2 November 2015), most of leaves faded and fell off from the plants ( Figure 2B), and in the rhizome tissue (eight months old) of this stage the biosynthesis of either dioscin or diosgenin largely decreased in comparison with that in the younger rhizomes (1.23 mg g −1 of dioscin and 2.00 mg g −1 of diosgenin were detected from the rhizomes of this later stage) ( Figure 2C).  (C), the content of diosgenin and dioscin in the different plant parts, including the leaves harvested in August (Aug_L), the rhizomes harvested in August (Aug_R), and the rhizomes harvested in November (Nov_R). Two biological replicates were included for this experiment. Different letters on the bar indicate a significant difference at p < 0.05 based on generalized linear models with Bonferroni multicomparison tests.

Generation of D. Zingiberensis Transcriptomes from its Leaf and Rhizome Tissues
It is well known that rhizome tissue is the richest source for dioscin production in D. zingiberensis. We confirmed this tissue-specificity for dioscin as well as its aglycone diosgenin by measuring their contents in its leaf and newly-formed rhizome tissues, which we collected at a relatively earlier stage on 28 August 2015 ( Figure 2A). Rhizomes (five months old) were shown to produce higher amounts of dioscin and diosgenin (3.50 mg g −1 for dioscin and 5.84 mg g −1 for diosgenin) while in leaf tissue (five months old) much less of them were detected (0.12 mg g −1 for dioscin and 0.20 mg g −1 for diosgenin) ( Figure 2C). At the later stage (2 November 2015), most of leaves faded and fell off from the plants ( Figure 2B), and in the rhizome tissue (eight months old) of this stage the biosynthesis of either dioscin or diosgenin largely decreased in comparison with that in the younger rhizomes (1.23 mg g −1 of dioscin and 2.00 mg g −1 of diosgenin were detected from the rhizomes of this later stage) ( Figure 2C).
Based on the metabolite accumulation pattern described above, the leaf (described as Aug_L) and newly-formed rhizomes (Aug_R) harvested at the younger stage, and the newly-formed rhizomes (Nov_R) harvested at the later stage were chosen for comparative transcriptome analysis to understand the molecular mechanism underlying dioscin biosynthesis. Total RNA from each tissue sample was prepared and was sent to Novogene Company (Beijing, China) for the cDNA library construction and the RNA-Seq analysis. The number of the resulting raw reads, clean reads, Q20, Q30, GC content of each sample, and N50 is shown in Table 1. All the raw data have been submitted to the NCBI sequence read archive (SRA; http://www.ncbi.nlm.nih.gov/sra) under the accession number of SRR6281651 for Aug_L, SRR6281650 for Aug_R, and SRR6281649 for Nov_R, respectively. The de novo assembly of all the clean reads was performed using Trinity software. A total of 176,406 transcripts were assembled with a mean length of 715 bp in a range of 201 to 16,773 bp. Transcripts were further assembled into 143,245 unigenes and the assembled unigenes were then searched against several public databases including NCBI non-redundant protein database (Nr), NCBI nucleotide sequences database (Nt), Swiss-Prot protein database, Pfam database, Clusters of Orthologous Groups of protein databases (COG) and Gene ontology (GO). In total, 74,908 (52.29%) unigenes were successfully annotated in at least one of the above public databases (Table S1). According to Nr annotation, 8229 (16.2%) unigenes had the most hits from Elaeis guineensis, followed by Phoenix dactylifera (6857, 13.5%), Musa acuminate (2692, 5.3%), Hordeum vulgare (2387, 4.7%), and Vitis vinifera (1117, 2.2%).

CD-HIT and BlastP Analysis
Unigene sequences were clustered by Cluster Database at High Identity with Tolerance (cd-hit) to reduce the redundancy. The 143,245 uigenes were divided into 143,095 clusters. The result was shown in Table S2. After doing the blast, the amino acid sequences of 407 CYP unigenes have some homology (from 21.15% to 86.55%) to the CYP450 proteins from Arabidopsis thaliana, while the amino acid sequences of 488 UGT unigenes show different sequence identities (from 19.57% to 83.33%) to the UGT proteins from Arabidopsis thaliana. The results were shown in Table S3 and Table S4, respectively.

Identification of Genes Related to Steroidal Backbone Biosynthesis
Steroidal saponins are biosynthesized from C5 units, isopentenyl diphosphate (IPP), which is derived either from the cytosolic mevalonate pathway (MVA pathway) or from the plastidic methylerythritol phosphate pathway (MEP pathway). In the D. zingiberensis transcriptome of this study, eleven unigenes putatively code for the MVA pathway enzymes (Table 2), including three for AACT, one for HMGS, one for HMGR, two for MK, one for PMK, one for MVD, and two for IDI; seven unigenes were identified as the putative genes in the MEP pathway (Table 2), which included two for DXS, one for DXR, one for CMK, one for MECPS, one for HDS and one for HDR. Among the MVA pathway unigenes, the transcripts of the c46893_g1 and c102447_g1 coding for AACT, the c76083_g1 for MK, the c61824_g1 for HMGS, and the c74391_g1 for IDI were specifically expressed in the rhizomes (Aug_R) but not in the leaves (Aug_L), which is in accordance with the rhizome-specificity for dioscin biosynthesis in D. zingiberensis. On the other hand, all the MEP pathway backbone unigenes were expressed at much higher levels in the leaves (Aug_L) than in the rhizomes (Aug_R), and their expressions in both rhizome tissues (Aug_R and Nov_R) were essentially similar (Table 2), which does not match the metabolite accumulation pattern ( Figure 2C). These results indicate that dioscin biosynthesis is likely to be mainly from the MVA pathway.
In the Solanaceae plant family, especially those producing the steroidal glycoalkaloids (SGAs), there are two separate sets of pathway enzymes responsible for the biosynthesis of cholesterol and phytosterols [11]. However, based on the bioinformatics analysis here, we proposed the same set of enzymes probably occurring in the cholesterol and phytosterol pathways in D. zingiberensis. A similar case was also previously suggested for another diosgenin-producing plant Trigonella foenum-graecum [17]. The unigenes encoding all the known pathway enzymes from IPP to cholesterol were discovered in the D. zingiberensis transcriptome (Table 2), including two for GPPS, one for FPPS, one for SS, one for SE, one for CAS, two for SSR, one for SMO1, one for CYP51, one for C14-R, one for 8,7 SI, one for SMO2, one for C5-SD, and one for 7-DR. Except for the unigenes putatively coding for GPPS, FPPS, CAS, CPI, and 8,7 SI, all the other unigenes in the pathway were more highly expressed in the rhizomes (Aug_R) than in the leaves (Aug_L) and their transcript abundances were much higher in the younger rhizomes (Aug_R) than in the mature rhizomes (Nov_R) ( Table 2). Taken together, except for the MEP pathway genes displaying higher expression in the leaves (Aug_L), most of the unigenes involved in the steroidal saponin backbone biosynthesis showed higher expression in the Aug_R, followed by the Nov_R, and Aug_L (Table 2), which parallels to the accumulation pattern of diosgenin or dioscin in the plants ( Figure 2C). This data implicates that the biosynthesis of the steroidal metabolites in D. zingiberensis is highly controlled by the transcriptional regulation of steroidal backbone biosynthesis genes.

Identification of CYP450 Unigenes Putatively Involved in Diosgenin Biosynthesis
It has been suggested that diosgenin is biosynthesized from the steroidal skeleton compound cholesterol by a series of oxidative reactions at the C-22, C-26, and C-16 positions [18] ( Figure 1). The hydroxylation of cholesterol is likely catalyzed by cytochrome P450 (CYP) enzymes. Through annotation using different databases, a total of 485 CYP-encoding unigenes were discovered in D. zingiberensis transcriptome (Table S5), among which 165 CYP unigenes showed the highest expression level in Aug_R, followed by Nov_R and Aug_L (Table S6), coincident with the accumulation pattern of diosgenin in those tissue samples. These CYP unigenes exhibiting correlating expression pattern with diosgenin accumulation were considered as the potential candidates to be involved in diosgenin biosynthesis. Among them, there were 22 CYP unigenes putatively being presented at full-length cDNAs (Table S6), out of which the unigene c56203_g1 was annotated as a sterol 14αdemethylase. It is well known that the sterol 14α-demethylase functions in the pathway steps to cholesterol formation [11]. Cholesterol was proposed to be the precursor for diosgeninbiosynthesis, and thus identification of sterol 14α-demethylase annotated unigene within our list of the potential P450 candidates indicates that some of the other CYP candidates might play an important role in diosgenin biosynthesis. We performed phylogenetic analysis for these full-length CYP candidates together with the well characterized CYPs from various metabolic pathways, including the ones that are relevant to the biosynthesis of triterpenoids and steroidal saponins. The 2-oxoglutarate-dependent dioxygenases from potato (St16DOX) and tomato (Sl16DOX), which oxidize hydroxycholesterols at the C-16 position [19], were included in the alignments as the outgroup.
As shown in Figure 3, the unigene c65998_g1, which belongs to the CYP90B subfamily, grouped together with AtCYP90B1 from A. thaliana [20], and OSCYP90B2 and OsCYP724B1 from rice [21] that were all characterized to be a steroid C-22 hydroxylase in brassinosteroid pathway, strongly suggesting that the unigene c65998_g1 might be the CYP candidate responsible for the cholesterol C-22 oxidation in diosgenin biosynthesis. It should be noted that a CYP enzyme StCYP72A188 from potato, which also catalyzes a sterol C-22 hydroxylation in steroidal glycoalkaloid (SGA) biosynthesis [22], was in the clade distinctly different from the above mentioned sterol C-22 hydroxylases (Figure 3). Also, among the full-length CYP candidates matching the accumulation of diosgenin (Table S6), no members of the CYP72A group were found. These data suggest that the C-22 hydroxylase in SGA pathway might evolve independently from the CYPs for diosgenin or brassinosteroid pathway. The unigene c47099_g1, which belongs to CYP505 family, showed a relatively closer relationship to the sterol C-26 hydroxylases that include SlCYP734A7 from Solanum lycopersicum [23], OsCYP734A1 from rice [24] (Sakamoto et al., 2011) , and StCYP72A208 from potato [22], suggesting that the unigene c47099_g1 might encode the C-26 hydroxylase in diosgenin biosynthesis.
CYP enzymes displaying the sterol C-16 hydroxylation activity have not been identified yet, however, a 2-oxoglutarate-dependent dioxygenase (designated St16DOX) has recently been reported to play a role in SGA biosynthesis in potato by catalyzing the C-16 oxidation on hydroxycholesterols [19]. Its equivalent in tomato, designated Sl16DOX, has essentially the same biochemical activity as does St16DOX [19]. From the D. zingiberensis transcriptome data, we searched for close homologs of St16DOX. However, the closest homolog unigene c51094_g1 displayed only 36% amino acid identity while showing a 77% identity to Oryza sativa Japonica 2-oxoglutarate-Fe-dioxygenase. The unigene c51094_g1 was specifically found in the D. zingiberensis leaf tissue whereas was not observed in the rhizome (data not shown). It still could not be ruled out that there exist CYP enzymes catalyzing the sterol C-16 oxidation for diosgenin biosynthesis. Phylogenetic analysis of the unigene c63041_g1 revealed a relatively closer relationship to genes encoding the sterol C-22 hydroxylases and a β-amyrin (a triterpenoid) C11-oxidase [25]. The c63041_g1 shows similarity to genes encoding enzymes belonging to CYP88D group, and from this family some members have already been reported to participate in the biosynthesis of SGA [26] or triterpenoids [25]. Thus, it will be a priority to test whether the unigene c63041_g1 participates in diosgenin biosynthesis as a C-16 oxidase. The rest of the CYP unigenes were grouped into many other P450 subfamilies (i.e., CYP76, CYP75A, CYP71D55, CYP93E, CYP78A5,

Identification of UGT Candidates Involved in Dioscin Biosynthesis
The biosynthesis of dioscin requires the addition of glucose and rhamnosegroups onto the C-3 hydroxyl group, which is catalyzed by UGTs (Figure 1). In the present study, 195 unigenes encoding UGTs with a sequence length above 1000 bps were obtained (Table S4). Inspection of these UGT unigenes revealed six unigenes (c73651_g1, c70118_g1, c61522_g1, c66977_g1, c60958_g1, and c59129_g1) annotated as a sterol 3-O-UGT and two unigenes (c60525_g2 and c69762_g2) as a UDP-

Identification of UGT Candidates Involved in Dioscin Biosynthesis
The biosynthesis of dioscin requires the addition of glucose and rhamnosegroups onto the C-3 hydroxyl group, which is catalyzed by UGTs (Figure 1). In the present study, 195 unigenes encoding UGTs with a sequence length above 1000 bps were obtained (Table S4). Inspection of these UGT unigenes revealed six unigenes (c73651_g1, c70118_g1, c61522_g1, c66977_g1, c60958_g1, and c59129_g1) annotated as a sterol 3-O-UGT and two unigenes (c60525_g2 and c69762_g2) as a UDP-rhamnose: rhamnosyltransferase (Table S7). These UGT unigenes would be possible candidates for dioscin biosynthesis. However, the expression trend of all the identified UGTs (Table S7) did not match the distribution pattern of dioscin in the tissue samples ( Figure 2C). A similar case also has been reported for Panax japonicas UGTs whose expression pattern across various tissues is different from the tissue-specificity for triterpenoid saponin biosynthesis in Panax japonicas [27]. These data indicate that glycosylation reaction governed by UGTs might not influence the tissue-specificity of saponin synthesis in some of the plant species.
For biochemical characterizations of the identified UGTs, it is necessary to know their full open reading frames (ORFs). Unfortunately, only partial transcripts of the two rhamnosyltransferase candidates were present in the transcriptome. Among the identified sterol 3-O-UGTs, the unigenes of c60958_g1 and c59129_g1 were expressed at very low levels in all the tissue samples, moreover, they were not at a full-length and thus were not subjected to further analysis. Within the rest of the 3-O-UGTs, only the c61522_g1 and c66977_g1 contain complete ORFS. The gene products of c61522_g1 and c66977_g1 were designated as Dz3GT1 (NCBI accession no. MG488289) and Dz3GT2 (NCBI accession no. MG488290), respectively, in this study. Dz3GT2 shows only 56% amino acid identity to Dz3GT1 but displays 99.8 % identity to the previously identified DzS3GT from D. zingiberensis, the enzyme catalyzing a C3-glucosylating activity on diosgenin [28]. To further characterize the biochemical properties of Dz3GT1 and Dz3GT2, their ORFs were transferred into E. coli cells, and the recombinant UGTs were purified by the use of their N-terminal fused tags. The recombinant protein was assayed with diosgenin and cholesterol followed by HPLC analysis. Both Dz3GT1 and Dz3GT2 were able to convert diosgenin to a new product (peak 1) which corresponds to trillin (diosgenin 3-O-glucoside) authentic standard ( Figure 4A), suggesting their 3-O-glucosylating activities on diosgenin. When cholesterol was used as a substrate, a new product (peak 2) was observed from the assays with both D. zingiberensis UGTs ( Figure 4B). Based on the C3-glucosylating activities of Dz3GTs described above, we assumed the peak 2 to be cholesterol 3-O-glucoside, although the cholesterol 3-O-glucoside standard is not commercially available at present. Both Dz3GT1 and Dz3GT2 function redundantly in C3-O-glucosylation of diosgenin, the importance of Dz3GT1 relative to Dz3GT2 in dioscin formation is not clear. Further characterization of Dz3GT1 and Dz3GT2 is required to confirm their roles in dioscin formation by silencing their gene expression in D. zingiberensis.  For the controls, the purified Dz3GTs were excluded from the reaction mixtures. Both Dz3GT1 and Dz3GT2 were able to convert diosgenin to one new product (peak 1), corresponding to trillin (diosgenin 3-O-glucoside), which was not produced in the controls. When cholesterol was used as the substrate, a new product (peak2) was produced by both the UDPglycosyltransferase (UGT) proteins but was not observed from the control assays. The peak 2 was assumed to be cholesterol 3-O-glucoside based on the C3-glucosylating activities of Dz3GTs.

Plant Material
Wild D. zingiberensis plants were collected from Zhuxi County, Hubei province of China. On 12 March 2015, healthy rhizomes were planted in a field managed by the Wuhan Botanical Garden of Chinese Academy of Science. After being grown for about five months, samples of the newly-formed rhizome and young leaf tissues were harvested on 28 August 2015 to identify genes differentially transcribed in both tissues. To monitor the gene expression change in mature (eight month old) CK means the control reaction. For the controls, the purified Dz3GTs were excluded from the reaction mixtures. Both Dz3GT1 and Dz3GT2 were able to convert diosgenin to one new product (peak 1), corresponding to trillin (diosgenin 3-O-glucoside), which was not produced in the controls. When cholesterol was used as the substrate, a new product (peak2) was produced by both the UDP-glycosyltransferase (UGT) proteins but was not observed from the control assays. The peak 2 was assumed to be cholesterol 3-O-glucoside based on the C3-glucosylating activities of Dz3GTs.

Plant Material
Wild D. zingiberensis plants were collected from Zhuxi County, Hubei province of China. On 12 March 2015, healthy rhizomes were planted in a field managed by the Wuhan Botanical Garden of Chinese Academy of Science. After being grown for about five months, samples of the newly-formed rhizome and young leaf tissues were harvested on 28 August 2015 to identify genes differentially transcribed in both tissues. To monitor the gene expression change in mature (eight month old) rhizomes relative to the younger (five month old) ones, the newly-formed rhizomes at a late developmental stage were also harvested on 2 November 2015. It should be noted that we did not collect leaf tissue at this later stage because at that age most of the leaves faded and fell off from the plants. All the harvested plant materials were flash frozen in liquid nitrogen after cleaning and stored at −80 • C until use.

Transcriptome Sequencing, De Novo Assembly and Unigene Functional Annotation
Total RNA were isolated from the rhizome and leaf tissues using a plant RNA Prep Kit (Tiangen Biotech, Beijing, China) following the product manual. Each tissue was sampled from two individual plants, and after quality determination RNA from two biological repeats was mixed in equal amounts into a single pool for RNA-sequencing. The RNA samples were sent to Novogene Company (Beijing, China) where the cDNA libraries were constructed and RNA-sequencing was performed on an Illumina HiSeq2000 platform. Raw reads were cleaned by removing adapter containing reads and low quality reads. To monitor sequencing quality, the values of Q20 and Q30 of the clean data were determined. Clean reads were de novo assembled by Trinity software with default parameters [29]. Gene function was annotated by BLASTx (E-value < 1e −5 ) search against NR, Nt KOG/COG, Swiss-Prot, KO (KEGG Ortholog), Pfam, and GO databases. To calculate gene expression levels, clean reads were mapped back to the assembled transcriptome to get read-counts for each unigene and gene expression levels were normalized to Fragments Per kb per Million fragments (FPKM) [30].

CD-HIT Analysis
Cd-hit-est was used to cluster the Unigene sequences to reduce the redundancy (http://weizhongli-lab.org/cdhit_suite/cgi-bin/index.cgi). Sequence identity cut-off was set as 0.9.

Phylogenetic Analysis
The open reading frame of identified CYP450s was predicted by using Translate tool (http://www.expasy.ch/tools/dna.html/). Amino acid sequence alignment was performed using ClustalW program (http://www.ebi.ac.uk/clustalW/). A phylogenetic tree was constructed by neighbor-joining method using MEGA 5 software [31].

Plant Metabolite Extraction
To measure diosgenin content, plant tissue was ground into fine powder in liquid nitrogen and extracted with methanol under sonication (180 W, 40 kHz, 30 • C, 20 min). The methanol extracts were dried in a vacuum concentrator and hydrolyzed with 1.5 M sulfuric acid for 4 h at 100 • C. After acid hydrolysis, the residue was extracted with petroleum ether and washed to neutral. The petroleum ether extracts were evaporated to dryness and re-dissolved in methanol prior to HPLC analysis. To measure dioscin content, powdered tissue was extracted with methanol, and the methanol extracts were dried in a vacuum concentrator and re-dissolved in methanol for HPLC analysis.

Recombinant Protein Expression in E. Coli and Purification
The full length Dz3GT2 was amplified and sub-cloned into the pET28a vector (Novagen, Darmstadt, Germany) to give an in-frame N-terminal fusion with a His tag. The Dz3GT2 expression vector was then transformed into BL21 (DE3) E. coli competent cells. Dz3GT1 was initially cloned into the pET28a vector with a fusion with a His tag, but this construct gave the recombinant Dz3GT1 only expressed in inclusion bodies but not in a soluble form. To get a soluble Dz3GT1, the full length Dz3GT1 was then sub-cloned into the pGEX-2T vector which encodes an N-terminal GST-tagged fusion protein of Dz3GT1. The Dz3GT1 expression vector was transformed into BL21 E. coli competent cells. The transformed E. coli cells were grown in Luria-Bertani (LB) medium and the protein expression was induced with the addition of 1 mM isopropyl-1-thio-b-D-galactopyranoside (IPTG) into the medium. Following the manufacturer's protocol, the soluble recombinant Dz3GT2 was purified using Nickel affinity HisTrap™ column (GE Healthcare) while Dz3GT1 was purified by the use of a glutathione sepharose 4B kit (GE Healthcare).

Enzyme Assays
The activity of the recombinant UGT was tested in a 200 µL reaction mixture containing 50 mM Tris-HCl buffer (pH 8.0), 200 µM substrate (sugar acceptor), 2 mM dithiothreitol (DTT), 2 mM UDP-glucose and 3 µg of the purified UGT. Controls were performed by omitting the purified UGT in the reaction mixture. After incubating the reaction mixture overnight at 30 • C, the reaction was stopped with 200 µL methanol and the reaction products were directly subjected to HPLC analysis.

HPLC (High Performance Liquid Chromatography) Analysis
HPLC analysis was performed on an LC-20AT instrument (Shimadzu, Kyoto, Japan) using an inertsil ODS-SP reverse phase column (250 mm × 4.6 mm, 5 µm) at 30 • C. The monitoring wave length was set to 203 nm. Milli-Q water (solvent A) and HPLC-grade methanol (solvent B) were used as the mobile phase. To measure diosgenin content in the plant materials or analyze the products from the in vitro assays with diosgenin, the samples were separated using 90% B for 30 min at a flow rate of 1 mL/min. For detecting the products from the enzyme assays with cholesterol, the samples were eluted in 98% B. For measuring diosin content in the plant materials, the separation was achieved with milli-Q water (solvent A) and HPLC-grade acetonitrile (solvent B) as the mobile phase at a flow rate of 0.8 mL/min, and the solvent gradient was set as follows: 0-20 min, 10-90% B; 20-30 min, 90-10% B; 30-40 min, 10% B.
Supplementary Materials: The following are available online. Table S1: Annotation of all the D. zingiberensis unigenes based on public databases. Table S2: CD-HIT analysis of unigenes from D. zingiberensis transcriptome. Table S3: Blastp analysis of CYP amino acid sequences from D. zingiberensis with the CYP450 protein database from Arabidopsis thaliana. Table S4: Blastp analysis of UGT amino acid sequences from D. zingiberensis with the UGT protein database from Arabidopsis thaliana. Table S5: Annotation of all the putative cytochrome P450 unigenes from D. zingiberensis transcriptome. Table S6: Putative cytochrome P450 unigenes with the expression pattern in the order of FPKM(Aug_R) > FPKM(Nov_R) > FPKM(Aug_L). Table S7: Putative UDP-glycosyltransferase (UGT) unigenes with a length above 1000 bp from D. zingiberensis transcriptome. Figure S1: The UV absorption spectrum of the peak 2 in Figure 4.