Identifying Terpenoid Biosynthesis Genes in Euphorbia maculata via Full-Length cDNA Sequencing

The annual herb Euphorbia maculata L. produces anti-inflammatory and biologically active substances such as triterpenoids, tannins, and polyphenols, and it is used in traditional Chinese medicine. Of these bioactive compounds, terpenoids, also called isoprenoids, are major secondary metabolites in E. maculata. Full-length cDNA sequencing was carried out to characterize the transcripts of terpenoid biosynthesis reference genes and determine the copy numbers of their isoforms using PacBio SMRT sequencing technology. The Illumina short-read sequencing platform was also employed to identify differentially expressed genes (DEGs) in the secondary metabolite pathways from leaves, roots, and stems. PacBio generated 62 million polymerase reads, resulting in 81,433 high-quality reads. From these high-quality reads, we reconstructed a genome of 20,722 genes, in which 20,246 genes (97.8%) did not have paralogs. About 33% of the identified genes had two or more isoforms. DEG analysis revealed that the expression level differed among gene paralogs in the leaf, stem, and root. Whole sets of paralogs and isoforms were identified in the mevalonic acid (MVA), methylerythritol phosphate (MEP), and terpenoid biosynthesis pathways in the E. maculata L. The nucleotide information will be useful for identifying orthologous genes in other terpenoid-producing medicinal plants.


Introduction
Euphorbia (Euphorbiaceae) is a genus of flowering plants with about 2000 species that is subdivided into many subgenera and sections [1,2]. Distributed worldwide from desert to temperate zones, Euphorbia species range from tiny annuals to large and long-lived trees (https://www.finegardening.com/genus/euphorbia; accessed on 1 December 2021). Many Euphorbia species are used in traditional Chinese, Japanese, and Korean medicine [3]. Shi et al. (2008) surveyed biomolecules in Euphorbia and identified 535 molecules among the terpenoids, steroids, phenolic compounds, and flavonoids [2]. Their biological activities include cytotoxicity, effects on cell division, DNA damage, tumor promotion, and antimicrobial activity [3,4]. E. maculata L., commonly called spotted spurge, is an annual herb native to North America but grows worldwide. Although the sap from E. maculata may

Illumina RNA-Seq Library Construction and Sequencing
Total RNA was purified from leaves, stem, and roots using the RiboPure Kit (Applied Biosynthesis, Foster City, CA, USA). DNase1 (Sigma, St. Louis, MO, USA) was used for residual DNA digestion, and the total RNA was quantified using a NanoDrop 2000 spectrophotometer (Thermo Scientific, Wilmington, DE, USA). Paired-end sequencing was performed with a Nova Seq platform (Illumina, San Diego, CA, USA) at the professional sequencing provider Theragen Bio Co., Ltd. (Seongnam, South Korea). The quality of the constructed libraries was checked by a LabChip GX system (PerkinElmer, Waltham, MA, USA).

Full-Length cDNA Sequencing
Total RNAs from the three tissues (leaf, root, and stem) were pooled, and RNA quality was checked (Agilent Technologies, Santa Clara, CA, USA). The cDNA size selection was performed with a BluePippin system (Sage Science, Beverly, MA, USA) to build two cDNA libraries of ≤4 and ≥4 kb. Iso-Seq library preparation and sequencing were carried out using the PacBio full-length cDNA library and sequencing kit according to the manufacturer's protocol (Pacific Biosciences Inc., San Diego, CA, USA) at the sequencing service provider Theragen Bio Co., Ltd. (Seongnam, South Korea).

De Novo Assembly and Iso-Seq Data Analysis Using a Bioinformatics Pipeline
PacBio raw sequencing reads were processed via the standard Iso-Seq protocol in SMRTlink 4.0 software. Polymerase reads shorter than 50 bp were removed, and the subread BAM files were set to error-corrected circular consensus sequences (CCSs) using the following parameters: full passes ≥0 and predicted consensus accuracy >0.75. Fulllength (5 -and 3 -adapters and the poly-A tail) and non-full-length reads (CCSs with all 5 -and 3 -reads) were clustered into consensus sequences using the Iterative Clustering for Error Correction (ICE) algorithm (https://www.pacb.com/products-and-services/ analytical-software; accessed on 1 April 2022). These reads were further combined with non-full-length transcripts and polished in clusters by Quiver [30].

Full-Length Unique Transcript Model Reconstruction
Error-corrected, high-quality (HQ) and low-quality (LQ), full-length, polished consensus transcripts were combined to remove redundancy using the CD-HITv4.6 package with the parameters -c 0.99 -G 0 -aL 0.00 -aS 0.99 -AS 30 -M 0 -d 0 -p 1 [31]. The non-redundant transcripts were processed with the Coding GENome reconstruction Tool (Cogent v7.0.0, https://github.com/Magdoll/Cogent; accessed on 1 April 2022). Cogent creates the k-mer profile of non-redundant transcripts, computes pairwise distance, and clusters the transcripts into families based on their k-mer similarity. Each transcript family was further reconstructed into one or several unique transcript models (referred to as UniTransModels) using a De Bruijn graph method.

Isoform and Paralog Identification
Error-corrected, non-redundant transcripts (transcripts before Cogent reconstruction) were mapped to UniTransModels using Minimap2 v2.6 (Li 2018). Splicing junctions for transcripts mapped to the same UniTransModels were examined, and transcripts with the same splicing junctions were collapsed using Cupcake ToFU v13.0.0 [25]. Collapsed transcripts with different splicing junctions were identified as transcription isoforms of UniTransModels. Paralogs were analyzed by the BLASTclust program with the unigene sequences (https://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html; accessed on 1 April 2022) with a score coverage threshold of 1.75 and a length coverage threshold of 0.9.

Functional Annotation
Functional annotations were obtained by mapping sequences into several databases. Non-redundant protein sequences (Nr) and non-redundant nucleotide sequences (Nt) were compared against the NCBI database by BLAST v2.10.1 with an E-value cut-off of 1 × 10 −5 . Gene Ontology (GO) analyses were carried out by BLAST2GO v5.2.5 (bioinformatics software) with an E-value cut-off of 1 × 10 −5 . Figure 1 shows the genomics and bioinformatics pipeline used in this study. transcripts with the same splicing junctions were collapsed using Cupcake ToFU v13.0.0 [25]. Collapsed transcripts with different splicing junctions were identified as transcription isoforms of UniTransModels. Paralogs were analyzed by the BLASTclust program with the unigene sequences (https://www.ncbi.nlm.nih.gov/Web/Newsltr/Spring04/blastlab.html; accessed on 1 April 2022) with a score coverage threshold of 1.75 and a length coverage threshold of 0.9.

Functional Annotation
Functional annotations were obtained by mapping sequences into several databases. Non-redundant protein sequences (Nr) and non-redundant nucleotide sequences (Nt) were compared against the NCBI database by BLAST v2.10.1 with an E-value cut-off of 1 × 10 −5 . Gene Ontology (GO) analyses were carried out by BLAST2GO v5.2.5 (bioinformatics software) with an E-value cut-off of 1 × 10 −5 . Figure 1 shows the genomics and bioinformatics pipeline used in this study.

Differential Gene Expression Analysis
Illumina reads were aligned using Bowtie 2 v2.4.2 [32]. The read count values were directly obtained and converted to fragments per kilobase of transcript per million mapped reads (FPKM) values using RSEM (v1.1.12) [33]. Then, the DEGs between different tissue samples (leaf vs. stem, leaf vs. root, and stem vs. root) were detected with the standardization trimmed mean of M values (TMM) normalization method using edge R [34]. The significant DEGs were screened at false discovery rates (FDRs) < 0.05 and fold change of 2 as a cut-off.

Differential Gene Expression Analysis
Illumina reads were aligned using Bowtie 2 v2.4.2 [32]. The read count values were directly obtained and converted to fragments per kilobase of transcript per million mapped reads (FPKM) values using RSEM (v1.1.12) [33]. Then, the DEGs between different tissue samples (leaf vs. stem, leaf vs. root, and stem vs. root) were detected with the standardization trimmed mean of M values (TMM) normalization method using edge R [34]. The significant DEGs were screened at false discovery rates (FDRs) < 0.05 and fold change of 2 as a cut-off.

E. maculata Transcriptome Analysis Using PacBio Iso-Seq
We clustered raw sequencing reads from the full-length cDNA libraries into consensus transcripts using the TOFU pipeline (GitHub version) supported by PacBio (Table 1). We obtained approximately 62 million polymerase reads with an average length of 56,777 bp in the ≤4 kb library and 51,584 bp in the ≥4 kb library. We obtained 467,479 CCSs with an average length of 2471 bp and a CCS read score of 0.989 in the ≤4 kb library and 465,085 CCSs with an average length of 4040 bp and a read score of 0.983 in the ≥4 kb library. Using the standard Iso-Seq protocol for transcript clustering, we obtained 47,860 highquality (HQ) isoforms and 405 low-quality (LQ) isoforms in the ≤4 kb library and 33,573 HQ and 993 LQ isoforms in the ≥4 kb library (Table 1). Then, we processed 81,433 HQ transcripts with the COding GENome reconstruction Tool (Cogent v7.0.0, https://github. com/Magdoll/Cogent; accessed on 1 April 2022) to develop a fake genome of 20,722 reads (containing 18,481 reconstructed contigs and 2241 unassigned sequences). The fake genome was then used as a reference to map the HQ transcripts, which produced 20,172 isoforms ( Figure 1, Table 2). The transcript length showed a normal distribution with the greatest number of transcripts in the 2000-2999-bp range ( Figure 2).

Number of Paralogs
Number of Transcripts
In the functional classification, we assigned Gene Ontology (GO) terms to each of the UniTransModels via the BLAST2GO program based on the annotation of the Nr database. Overall, 16,652 (82.55%) unigenes were classified into three major categories: 'biological process', 'molecular function', and 'cellular component' (Figure 4). Genes in the biological process category primarily fell into seven major subgroups with over 10,000 transcripts: cellular process (GO: 00099871), metabolic process (GO: 0008152), response to stimulus (GO: 0050896), biological regulation (GO: 0065007), regulation of biological process (GO: 0050789), developmental process (GO: 0032502), and multi-multicellular organism process (GO: 0044706). In the molecular function category, two subgroups, binding (GO: 0005488) and catalytic activity (GO: 0003824), were predominant. Genes fell mainly into two subgroups in the cellular component category: cellular anatomical entity (GO: 0110165) and protein-containing complex (GO: 0032991). : DOXP had three paralogs: DOXP.para1, DOXP.para2, and DOXP.para3. DOXP.para1 had three isoforms with different translation termination sites. DOXP.para3 had two isoforms due to alternative splicing and differences in translation initiation and termination sites. (B): PB84.1 is a tRNA ligase gene. It had no paralogs, but 10 isoforms, which differed by alternative splicing and different translation initiation and termination sites.

Functional Annotation
Of the 20,172 unigenes, 19,190  In the functional classification, we assigned Gene Ontology (GO) terms to each of the UniTransModels via the BLAST2GO program based on the annotation of the Nr database. Overall, 16,652 (82.55%) unigenes were classified into three major categories: 'biological process', 'molecular function', and 'cellular component' (Figure 4). Genes in the biological process category primarily fell into seven major subgroups with over 10,000 transcripts: cellular process (GO: 00099871), metabolic process (GO: 0008152), response to stimulus (GO: 0050896), biological regulation (GO: 0065007), regulation of biological process (GO: 0050789), developmental process (GO: 0032502), and multi-multicellular organism process (GO: 0044706). In the molecular function category, two subgroups, binding (GO: 0005488) and catalytic activity (GO: 0003824), were predominant. Genes fell mainly into two subgroups in the cellular component category: cellular anatomical entity (GO: 0110165) and protein-containing complex (GO: 0032991).

Gene Expression Analysis across Different Tissues
We analyzed the DEGs in leaf, root, and stem tissues by mapping the Illumina sequencing reads to the Pac-Bio unigene reference sequences ( Table 4). The percent mapped paired-end reads to unigene reference sequences was 70.9, 60, and 64.8 in the leaf, root, and stem, respectively. The number of expressed genes was 17,735, 17,260, and 18,008 in the leaf, root, and stem, respectively. Of the 20,172 unigenes, 16,477 (81.7%) were expressed constitutively among the three organs. There were 295 organ-specific genes in the root, 300 in the leaf, and 395 in the stem ( Figure 5). The number of DEGs with more than a two-fold difference in expression was distinct among the three organs. We identified more upregulated genes in the root than in the shoot or stem (Table 4). Figure  6 shows the GO analysis of the organ-specific genes. In the biological process category, the proportion of genes involved in metabolic processes was higher in the aboveground organs (leaf and stem) than in the root. However, the distribution of genes in the molecular function and cellular process categories was similar among the three organs.

Gene Expression Analysis across Different Tissues
We analyzed the DEGs in leaf, root, and stem tissues by mapping the Illumina sequencing reads to the Pac-Bio unigene reference sequences ( Table 4). The percent mapped paired-end reads to unigene reference sequences was 70.9, 60, and 64.8 in the leaf, root, and stem, respectively. The number of expressed genes was 17,735, 17,260, and 18,008 in the leaf, root, and stem, respectively. Of the 20,172 unigenes, 16,477 (81.7%) were expressed constitutively among the three organs. There were 295 organ-specific genes in the root, 300 in the leaf, and 395 in the stem ( Figure 5). The number of DEGs with more than a two-fold difference in expression was distinct among the three organs. We identified more upregulated genes in the root than in the shoot or stem (Table 4). Figure 6 shows the GO analysis of the organ-specific genes. In the biological process category, the proportion of genes involved in metabolic processes was higher in the aboveground organs (leaf and stem) than in the root. However, the distribution of genes in the molecular function and cellular process categories was similar among the three organs.

Terpenoid Biosynthesis Pathway Genes
We identified all genes in the MVA, MEP, and terpenoid biosynthesis pathways (Table 5; Figure 7). The nucleotide sequences of paralogous genes and isoforms in these

Terpenoid Biosynthesis Pathway Genes
We identified all genes in the MVA, MEP, and terpenoid biosynthesis pathways (Table 5; Figure 7). The nucleotide sequences of paralogous genes and isoforms in these pathways are listed in Supplementary Table S1. In the MVA pathway, six genes encode the enzymes involved in IPP biosynthesis, with one (AAC thiolase and MVA kinase) to five (HMG-CoA reductase) paralogs per gene and one to three isoforms of each paralog. The first reaction in the MEP pathway is the condensation of pyruvate with glyceraldehyde 3-phosphate to form DOXP by DOXP synthase. The DOXP synthase gene had two paralogs and one and three isoforms of each paralog. There are five genes involved in the conversion of DOXP to 1-hydroxyl-2-methyl-2(E)-butenyl-4-diphosphate (HMBPP), which had one (CDP-ME synthase) to five (HMG-CoA reductase) paralogs and one to three isoforms of each paralog. HMBPP is reduced to dimethylallyl diphosphate or IPP by IPP/DMAPP synthase, which has two paralogous genes with only one isoform each. IPP and DMAPP are isomers that are interconverted by IDI. IDI has two paralogous genes with one and three isoforms. IPP undergoes head-to-tail dimerization to form geranyl diphosphate (GPP) by GPP synthase, which has two paralogous genes with a single isoform each. GPP is converted to monoterpenes by monoterpene synthase, which has two paralogous genes with a single isoform each. GPP is also converted to farnesyl diphosphate (FPP) by farnesyl synthase, which is encoded by a single-copy gene with two isoforms. FPP is processed into sesquiterpenoids or squalene by sesquiterpene synthase or squalene synthase, respectively. Squalene is further processed to triterpenoid by triterpene synthase, which is encoded by three paralogous genes with a single isoform each. Geranylgeranyl diphosphate (GGPP) is converted into diterpenes by diterpene synthase, which is annotated as ent-kaurene synthase. Ent-kaurene synthase is encoded by a single-copy gene with one isoform.  In a single gene, different paralogs had different numbers of isoforms as exemplified by the DOXP gene in Figure 3. DOXP.para1 had three isoforms with different termination sites, and DOXP.para3 had two isoforms with different starting and termination sites, as well as different exons. The expression of the paralogs differed among the tissues ( Figure  7). For instance, of the five paralogs of the gene encoding HMG-CoA reductase in IPP biosynthesis, PB.10074 had the highest expression in the leaf and the lowest expression in the root, but PB.10076 had the opposite expression pattern. Supplementary Table S1 shows the sequence information of all the genes involved in the terpenoid synthesis in E. maculata. In a single gene, different paralogs had different numbers of isoforms as exemplified by the DOXP gene in Figure 3. DOXP.para1 had three isoforms with different termination sites, and DOXP.para3 had two isoforms with different starting and termination sites, as well as different exons. The expression of the paralogs differed among the tissues (Figure 7). For instance, of the five paralogs of the gene encoding HMG-CoA reductase in IPP biosynthesis, PB.10074 had the highest expression in the leaf and the lowest expression in the root, but PB.10076 had the opposite expression pattern. Supplementary Table S1 shows the sequence information of all the genes involved in the terpenoid synthesis in E. maculata.

Discussion
NGS technologies have revolutionized many areas of genetics. Transcriptomics captures a snapshot of the total transcripts in a cell at a specific time and is used to quantify gene expression profiles during development [19,35]. High-throughput short RNA-Seq analysis was used to identify the genes involved in the biosynthesis of phytochemicals in medicinal plants [25,27,36]. Here, we used transcriptome profiling to analyze the genes involved in terpenoid biosynthesis in the medicinal plant E. maculata L., which is used in folk medicine in oriental countries [4]. Terpenoids are major secondary metabolites in E. maculata that have pharmacological benefits including anti-inflammation, antioxidant, antitumor, hepatoprotection, and anti-HIV protease activity [4,5,7,37].
The E. maculata genome has not been sequenced; therefore, we obtained transcriptome sequences from PacBio SMRT full-length cDNA sequencing. We obtained 20,172 full-length unigenes, which is similar to that obtained in Berberis koreana (23,246) by PacBio SMRT sequencing [27]. Although full-length unigenes may not accurately represent the number of genes in a species, the number of genes in E. maculata may be low compared to other plant species. Gene numbers in plants range from 20,000 to 124,000. The small genome of Arabidopsis thaliana encodes 26,000 genes [38]. We previously reported an Illumina NovaSeqderived transcriptome of Euphorbia jolkini having 123,215 assembled transcripts [27]. In our functional annotation of E. maculata genes, 19,190 (92.6%) and 19,407 (93.65%) matched with the Nt and Nr databases in NCBI, respectively, indicating that the function of most of the transcripts is known and only about 7% of the transcripts have not been annotated. The top three species BLAST-matching with E. maculata transcripts were the Pará rubber tree (Hevea brasiliensis), castor bean (Ricinus communis), and cassava (Manihot esculenta), all in Euphorbiaceae. These plants produce a milky latex containing terpenes [18,39,40]. The high match to these species may be because they have well-characterized transcriptome data due to their economic importance, as reported in the Pará rubber tree [41,42], castor bean [43], and cassava [44,45]. GO allows the comparison and functional classification of genes and their products across species (http://www.geneontology.org/; accessed on 1 April 2022) and covers three domains: cellular components, molecular functions, and biological processes. In our E. maculata transcriptomes, the distribution of genes in the different functional categories was similar to that of other medicinal plants [26,27,46].
PacBio SMRT sequencing is a third-generation sequencing system that allows the identification of isoforms [20,47]. Paralogs are homologous genes in a species that arise from the duplication of a single ancestral gene [48]. We identified isoforms and paralogs in our PacBio SMRT sequencing data. In humans, approximately 70% of protein-coding genes have at least one paralog [49]. Arabidopsis has at least 21,843 paralogs, which account for approximately 84% of its protein-coding genes [50]. However, 97.8% of the E. maculata unigenes were single copy, which is unexpectedly high because most eukaryotes underwent several whole-genome duplication events that resulted in the duplication of ancestral genes. Thus, it will be interesting to determine the number of paralogs in other Euphorbia species to verify our findings. Currently, only one Euphorbia transcriptome has been reported, but it was generated by Illumina NovaSeq, which does not permit the analysis of paralogs of full-length transcripts [26]. Transcript isoforms are derived from alternative splicing of the introns and the differential initiation or termination of translation from primary transcripts, which allows a single gene to code for multiple forms of a protein [51]. Proteome plasticity from alternative splicing plays a major role in adaptation to environmental stresses [52]. In plants, alternative splicing occurs in about 24% of transcripts in wheat (Triticum aestivum) to 60% in Arabidopsis in intron-containing genes [44]. In the E. maculata transcriptome, about 35.8% of the unigenes had isoforms; two examples are shown in Figure 3. Different paralogs had different isoform patterns. Furthermore, the expression patterns of paralogs differed among root, stem, and leaf tissues. Thus, paralogs and their isoforms might help plants adapt to stresses, as demonstrated in cassava under cold stress [44].
Terpenoids are the major bioactive compounds in E. maculata. We isolated the genes, as well as their isoforms and paralogs, involved in the MVA, MEP, and terpenoid biosyn-thesis pathways in E. maculata. The MVA pathway begins with Acetoacetyl-CoA synthase (AAC thiolase), which catalyzes the condensation of two 2-Acetyl-CoA (AAC) molecules. AAC is subsequently transformed into five intermediate molecules to form IPP, which involves five enzymes: HMG-CoA synthase, HMG-CoA reductase, MVA kinase, MVAP kinase, and MVAPP decarboxylase (Figure 7) [11]. In E. maculata, the genes encoding these enzymes were present as single-copy up to five-copy genes, with one to three isoforms per gene (Table 4). HMG-CoA reductase is a key regulatory enzyme in the MVA pathway in plants [53] and catalyzes the conversion of HMG-CoA to MVA, which is a rate-limiting step in the MVA pathway [10,13]. The HMG-CoA reductase gene is highly conserved among organisms, and we identified 1929 HMG-CoA reductase mRNAs among all biological kingdoms from viruses to bacteria to eukaryotes in the NCBI database (data not shown). The gene encoding HMG-CoA reductase had five copies in E. maculata, and each paralog was expressed differently in stem, leaf, and root tissue. Developmental and organ-specific expression of the HMG-CoA reductase gene was also reported in plants [53]. The HMG-CoA reductase gene was expressed higher in stems than in roots and leaves in lavender (Lavandula pubescens), which also produces terpenoids [54]. In E. maculata, one of the HMG-CoA reductase-paralogous genes was highly expressed in stems. The various paralogs expressed differently among the three organs, which may be highly coordinated for plant development.
The MEP pathway, also known as the non-mevalonate (non-MVA) pathway [13], occurs in plastids; thus, animals do not have this pathway, which has spurred interest as a potential strategy to develop anti-bacterial or herbicide products [55,56]. We identified all enzyme-encoding genes of the MEP pathway in E. maculata. Except for the gene encoding CDP-ME kinase, all other enzyme-encoding genes had two to four copies and several isoforms. IPP derived from the MVA pathway and DMAPP derived from the MEP pathway are structurally unrelated isomers that are interconverted by IDI. Because IPP is derived directly from the MVA pathway, IDI is not essential for plant survival; thus, IDI may play a role in modulating the IPP/DMAPP ratio in the cell [13].
IPP is a C 5 molecule that undergoes enzyme-mediated sequential head-to-tail condensation to become GPP (C 10 ), FPP (C 15 ), and GGPP (C 20 ) [12]. There were two, one, and two copies of GPP synthase, FPP synthase, and GGPP synthase in E. maculata, respectively. GPP is converted to monoterpenes by monoterpene synthase, which was encoded by two paralogous genes, and both copies had very high expression in the three organs in our analysis. Monoterpenoids have not been reported in E. maculata, but several monoterpenoid compounds were reported in other Euphorbia species [57,58]. FPP is converted to sesquiterpenes (C 15 ) by sesquiterpene synthase or squalene (C 30 ) by squalene synthase. We found one copy of the sesquiterpene synthase gene in E. maculata. A sesquiterpene synthase gene was isolated from Euphorbia fischeriana, which produced several sesquiterpenoids, including cedrol and eupho-acorenols [59,60]. Oxygenated sesquiterpenes and sesquiterpene hydrocarbons were identified in different Euphorbia species, and their bioactivities were also reported [3]. Squalene (C 30 ) is a precursor of steroids [61]. Squalene is biosynthesized by combining two molecules of FPP by squalene synthase. A squalene synthase gene was isolated from Euphorbia pekinensis [62] and Euphorbia tirucalli [63]. We found two copies of squalene synthase in E. maculata, and both copies were actively expressed in the three organs. Squalene is converted to triterpenoids (C 30 ) by triterpenoid synthase, also called oxidosqualene cyclase [64]. A triterpene synthase gene was isolated from the bark of Euphorbia lathyris, in which triterpenoids are abundant [63]. The terpene synthase gene was highly expressed in the latex of E. lathyris. We identified three copies of the triterpene synthase gene in E. maculata, and their expression was high in leaves and stems compared to roots. Sun et al. (2018) reported two new triterpenes from dried whole E. maculata plants, which had anti-inflammatory properties [4]. Triterpenes have been isolated from diverse Euphorbia species [63][64][65]. Diterpenoids (C 20 ) are derived from GGPP by diterpene synthase. Diterpenoids are abundant in Euphorbia species [60]. We found one copy of the diterpene synthase gene in E. maculata. Plants produce thousands of diterpenoids, and diterpene synthases have numerous functions in diverse plants [66].

Conclusions
E. maculata L. is a medicinal herb that produces bioactive compounds including terpenoids. We conducted transcriptome sequencing via PacBio SMRT and Illumina RNA-Seq to identify the genes involved in terpenoid biosynthesis in E. maculata. Because the E. maculata genome sequence is not available, we used de novo assembly and obtained 20,722 unique full-length transcripts. PacBio SMRT sequencing allowed us to identify paralogous genes and isoforms. GO and DEG analyses revealed that paralogs of each gene expressed differently in stem, leaf, and root tissues. Using this approach, we identified the genes involved in the terpenoid biosynthesis pathway in E. maculata. Our sequence information will be useful for isolating orthologs in other terpenoid-producing medicinal plants.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/molecules27144591/s1, Table S1: The DNA sequence information of all the genes involved in the terpenoid synthesis in E. maculata.