Whole-Genome Identification and Analysis of Multiple Gene Families Reveal Candidate Genes for Theasaponin Biosynthesis in Camellia oleifera

Camellia oleifera is an economically important oilseed tree. Seed meals of C. oleifera have a long history of use as biocontrol agents in shrimp farming and as cleaning agents in peoples’ daily lives due to the presence of theasaponins, the triterpene saponins from the genus Camellia. To characterize the biosynthetic pathway of theasaponins in C. oleifera, members of gene families involved in triterpenoid biosynthetic pathways were identified and subjected to phylogenetic analysis with corresponding members in Arabidopsis thaliana, Camellia sinensis, Actinidia chinensis, Panax ginseng, and Medicago truncatula. In total, 143 triterpenoid backbone biosynthetic genes, 1169 CYP450s, and 1019 UGTs were identified in C. oleifera. The expression profiles of triterpenoid backbone biosynthetic genes were analyzed in different tissue and seed developmental stages of C. oleifera. The results suggested that MVA is the main pathway for triterpenoid backbone biosynthesis. Moreover, the candidate genes for theasaponin biosynthesis were identified by WGCNA and qRT-PCR analysis; these included 11 CYP450s, 14 UGTs, and eight transcription factors. Our results provide valuable information for further research investigating the biosynthetic and regulatory network of theasaponins.


Introduction
Camellia oleifera, also called the Camellia oil tree, is a traditionally cultivated woody species in China that is used as a source of high-quality seed oil. Seed meals of C. oleifera have a long history of application along with camellia oil. They have been used as a biocontrol agent in shrimp farming and agricultural production, as a cleaning agent in peoples' daily lives, and as a protectant to maintain the health of hair. The seed meals of C. oleifera contain significant quantities of triterpene saponins, also known as theasaponins. Many studies have suggested that theasaponins have a variety of biological and pharmacological activities, including inhibiting the growth of human carcinoma cells [1][2][3][4], anti-microbial effects [5], anti-inflammatory activity [6], neuroprotection [7], and foaming and detergent properties [8].
Secondary metabolites of natural plants such as triterpene saponins, phytosterols, and flavonoids are vital resources for humans. Identification of the relevant biosynthetic pathways would thus be helpful for our understanding and utilization of theses natural products. To date, more than 70 different theasaponins have been isolated from Camellia seeds, all of which are oleanane-type triterpene saponins [9]. However, the biosynthetic pathway of theasaponins is still unclear. In plants, triterpenoid saponins are synthesized from isopentenyl pyrophosphate (IPP) and dimethylallyl diphosphate (DMAPP) derived from the mevalonate (MVA) pathway and methylerythritol phosphate (MEP) pathway. IPP and DMAPP are then transformed into 2,3-oxidosqualene, the common precursor of triterpenoid saponins and sterols in eukaryotes, by the action of farnesyl diphosphate synthase (FPS), squalene synthase (SS), and squalene epoxidase (SE). The above steps are also known as the conserved biosynthesis pathway of the terpenoid backbone. Next, the structural diversity of saponins is formed by three main steps: (1) oxidosqualene cyclases (OSCs) catalyze cyclization of the precursor to different triterpenoid backbones; (2) triterpenoid backbones are modified by cytochrome P450 monooxygenase enzymes (CYP450s); (3) addition of sugar (chains) is catalyzed by UDP-glycosyltransferases (UGTs) [10][11][12].
Genes functioning in common processes always have similar expression patterns. Weighted gene co-expression network analysis (WGCNA) is a systems biology approach for describing the co-expression networks between genes across large-scale gene expression profiling data [19,20]. This is a powerful tool for screening potential genes related to biosynthesis and regulation of plant secondary metabolism. CaMYB48 as a regulator of capsaicinoid biosynthesis was screened by WGCNA [21]. TSAR1, TSAR2, multiple CYP450s, and UGTs were predicted to be involved in triterpene saponin biosynthesis in Medicago truncatula through co-expression analysis [22,23]. These examples show that candidate genes for theasaponin biosynthesis in C. oleifera could be revealed by WGCNA.
In this study, we selected genes involved in triterpene saponin biosynthetic pathways in C. oleifera, including MVA and MEP pathways and the IDI, FPS, SS, SE, OSC, CYP450, and UGT families, and constructed a gene co-expression network by WGCNA based on the expression pattern of bAS. The resulting network was combined with correlation results obtained through quantitative real-time PCR (qRT-PCR) to screen for candidate genes involved in theasaponin biosynthesis. This study establishes a foundation for further research investigating the biosynthetic and regulatory networks of theasaponins.
OSC catalyzes the first committed step in the triterpenoid biosynthesis pathway, and it plays an important role in the formation of diverse triterpenoid backbones. It has been found that OSCs such as CAS and LAS catalyze the generation of sterols, while bAS and DDS catalyze the generation of triterpenoids. In this work, a total of 83 OSCs were identified, 31 from C. oleifera. All of the OSCs contained two conserved domains, SQHop_cyclase_C and SQHop_cyclase_N. Those OSCs and some functionally characterized OSCs were aligned and used to construct a phylogenetic tree ( Figure 1A). The results showed that C. oleifera OSCs consisted of two sterol-related OSCs (1 CAS and 1 LAS) and 29 triterpenoidrelated OSCs (16 bASs and 13 DDSs).
The transcriptomes of different tissues (root, stem, leaf, petal, filament, anther, style, and ovary) and different seed developmental stages (seeds in July, August, September, and October) of C. oleifera were used for the expression analysis of triterpenoid backbone biosynthetic genes. The OSC genes displayed divergent expression patterns ( Figure 1B). In sterolrelated OSCs, CAS (Co10022113) was expressed equally in all samples; LAS (Co10411291) showed a low level of expression in roots. For triterpenoid-related OSCs, six bAS genes (Co10176456, Co10051732, Co10352372, Co10286845, Co10307396, and Co10350615) and two DDS genes (Co10339102 and Co10389481) showed high expression in some samples, while other OSC genes were not expressed or were expressed at very low levels. The six bASs were strongly expressed in roots, stems, leaves, and especially in the seeds in August, September, and October. The two DDSs showed high levels of expression in roots and stems, especially in roots. In addition, no triterpenoid-related OSCs were expressed in any flower parts, except for three bASs and two DDSs that were expressed at low levels in filaments and styles, respectively. Those results indicated that triterpenoids are primarily derived from the β-amyrin scaffold in C. oleifera seeds, while in roots, stems, and leaves they are derived from β-amyrin and dammarendiol scaffold.
To further screen for major-effect genes and pathways involved in triterpenoid backbone biosynthesis in C. oleifera, the gene expression data of identified genes in MVA, MEP, and IPP-related downstream pathways were used for a clustering analysis ( Figure 1C, Table S2). A total of 27 genes were grouped together with the total FPKM of 29 triterpenoidrelated OSC genes. In other words, the 27 genes exhibited a similar expression pattern as the triterpenoid-related OSCs, and thus may be responsible for triterpenoid backbone biosynthesis. Most of these genes (14 genes) were involved in the MVA pathway, and 8 genes were IPP-related downstream genes. Except for MVD, all of the enzymes of MVA and IPP-related downstream pathways were encoded by one or multiple genes in this cluster that contained four AACT genes, four HMGS genes, one HMGR gene, two MVK genes, three PMK genes, two IDI genes, three FPS genes, one SS gene, and two SE genes. This result suggested that MVA was the main pathway for triterpenoid backbone biosynthesis in C. oleifera. Moreover, an HMGR (Co10265376) and an SE (Co10322248) in this cluster exhibited extremely high expression in August and September in C. oleifera seeds, with average FPKM values of 1047, 1011, and 923, 990, respectively (Table S2). This indicated that the HMGR and SE genes may play important roles in the theasaponin biosynthetic pathway in C. oleifera seeds ( Figure 2).  (v2.1.7 SSE3) using the JTT + CAT model and SH-like test with 1000 resamples. (B) Expression profiles of OSCs in different tissues and seed developmental stages. Gene expression is presented as log2(FPKM + 1). (C) Expression pattern hierarchical clustering for triterpenoid backbone biosynthesis genes based on RNA-Seq results. Seed 7, seed picked on 15 July; Seed 8, seed picked on 15 August; Seed 9, seed picked on 15 September; Seed 10, seed picked on 15 October; Root 1, root of containerized seedlings; Root 2, root of bare-root seedlings; known genes, previously characterized genes from other species; T-OSC total, the total FPKM values of triterpenoid-related OSC genes.

Identification of Candidate CYP450s, UGTs, and Transcription Factors Related to
Theasaponin Synthesis in C. oleifera Seeds by WGCNA CYP450 and UGT are large gene superfamilies, each containing more than 1000 members in C. oleifera; this makes it difficult to obtain candidate genes involved in theasaponin biosynthesis. As genes belonging to the same pathway had similar expression patterns in different tissues and developmental periods, we generated a co-expression network via WGCNA using the FPKM values of all above-identified genes and predicted transcription factors in C. oleifera as source data. As the enzyme bAS catalyzes the first and most critical step of the theasaponin biosynthesis pathway in C. oleifera seeds, we considered the FPKM of bASs as representative of the level of theasaponin biosynthesis. Finally, among 11 modules, the MEbrown module containing 475 genes had an expression pattern tightly corre- 2.3. Identification of Candidate CYP450s, UGTs, and Transcription Factors Related to Theasaponin Synthesis in C. oleifera Seeds by WGCNA CYP450 and UGT are large gene superfamilies, each containing more than 1000 members in C. oleifera; this makes it difficult to obtain candidate genes involved in theasaponin biosynthesis. As genes belonging to the same pathway had similar expression patterns in different tissues and developmental periods, we generated a co-expression network via WGCNA using the FPKM values of all above-identified genes and predicted transcription factors in C. oleifera as source data. As the enzyme bAS catalyzes the first and most critical step of the theasaponin biosynthesis pathway in C. oleifera seeds, we considered the FPKM of bASs as representative of the level of theasaponin biosynthesis. Finally, among 11 modules, the MEbrown module containing 475 genes had an expression pattern tightly correlated with theasaponin biosynthesis ( Figure 3A,B). The heatmap of MEbrown genes ( Figure 3C) demonstrated that most of these genes exhibited a seed-specific pattern and had a higher expression level in seeds in August, September, and October. There were 291 genes with gene significance (GS) > 0.7 and intramodular connectivity (kME) values > 0.7 in this module. Consistent with the results of the clustering analysis of triterpenoid backbone biosynthetic genes, there were several genes for encoding enzymes of MVA and IPP-related downstream pathways among these 291 genes. Additionally, 41 CYP450s and 40 UGTs were present and may be related to theasaponin biosynthesis (Table S3).
ure 3C) demonstrated that most of these genes exhibited a seed-specific pattern and had a higher expression level in seeds in August, September, and October. There were 291 genes with gene significance (GS) > 0.7 and intramodular connectivity (kME) values > 0.7 in this module. Consistent with the results of the clustering analysis of triterpenoid backbone biosynthetic genes, there were several genes for encoding enzymes of MVA and IPPrelated downstream pathways among these 291 genes. Additionally, 41 CYP450s and 40 UGTs were present and may be related to theasaponin biosynthesis (Table S3).  Apart from the above, fifteen transcription factors (TFs) with GS > 0.8 and kME > 0.95 in the MEbrown module, in which three basic helix-loop-helix transcription factors (bHLH), four growth-regulating factors (GRF), four B3 domain-contain transcription factors (B3), and one lateral organ boundary domain gene (LBD) were present. In addition, eight MYB transcription factors, an important gene family in regulation of the biosynthesis of secondary metabolites, were contained in the MEbrown module with kME > 0.90 ( Figure 3D, Table S3).

Verification of the Results in WGCNA by qRT-PCR
To further screen for candidate CYP450s, UGTs, and TFs involved in theasaponin biosynthesis in C. oleifera seeds, we first aligned CDSs of CYP450s, UGTs, and TFs from WGCNA results and bASs. Genes with identity ≥ 96% were regarded as alleles. In the results, 17 non-allelic CYP450s, 25 non-allelic UGTs, three bHLHs, three GRFs, three B3s, one LBD gene, seven MYBs, and one bAS were obtained and named according to the family they belonged to (Table S3). Relative expression levels of these genes were then quantified by qRT-PCR in six different developmental stages (30 June, 15 July, 30 July, 30 August, 30 September, and 20 October) of C. oleifera seeds ( Figure 4A). The results were in general agreement with those from the RNA-Seq; most of those genes had relatively high-level expression in August, September, and October, while having relatively low-level expression in June and July ( Figure 4B). Finally, the Pearson correlation coefficient (R) was calculated between each selected gene and bAS using their relative expression (Table S4). Genes with R < 0.7 were dropped. Finally, we obtained 11 CYP450s (three in CYP71, two in CYP716, and one each in CYP72, CYP73, CYP79, CYP81, CYP83, and CYP87) ( Figure 4C), 14 UGTs (two each in UGT73, UGT91, and UGT93 and one each in UGT72, UGT75, UGT78, UGT79, UGT80, UGT90, UGT94, and UGT708) ( Figure 5) and eight TFs (two in bHLH, one in B3, one in GRF, and four in MYB) ( Figure 6) as the candidate genes indicating a variety of structures and complex mechanisms involved in theasaponin biosynthesis and regulation in C. oleifera seeds. Among the 11 CYP450s and 14 UGTs, CoCYP716-1, CoCYP716-2, CoCYP87-1, and CoUGT73-1 exhibited extremely high FPKM values (Table S3) and relative expression (Figures 4 and 5) in C. oleifera seeds during August, September, and October.

Biosynthesis Pathway of the Triterpenoid Backbone in C. oleifera
In addition to high-quality edible oils, the C. oleifera seeds contain abundant secondary metabolites such as flavonoids, saponins, phytosterols, and squalenes. Saponin is one Figure 6. The relative expression patterns of candidate TFs and bASs. ** p < 0.01 (two-sided test).

Biosynthesis Pathway of the Triterpenoid Backbone in C. oleifera
In addition to high-quality edible oils, the C. oleifera seeds contain abundant secondary metabolites such as flavonoids, saponins, phytosterols, and squalenes. Saponin is one of the main active ingredients extracted from Camellia oleifera seeds. In the past several decades, theasaponins, a group of triterpenoid saponins from the genus Camellia, have received increasing attention due to their bioactivities. Many individual theasaponins have been isolated, structurally characterized, and functionally identified [3,9,26]. However, the biosynthetic pathway of theasaponin in C. oleifera has not yet been completely resolved. As a widespread bioactive compound in plants, triterpenoid saponin has been studied extensively not only in regard to structure and function but also concerning the biosynthetic pathway in many species, especially in medicinal plants such as Panax ginseng [27,28], Bupleurum falcatum [29,30], Platycodon grandiflorus [12], and model plants such as Medicago truncatula [23,31]. Generally, the triterpenoid saponins synthesis pathway can be divided into two stages, the triterpenoid skeleton synthesis stage and the triterpenoid skeleton modification stage. Triterpenoid skeleton modification includes oxidation modification catalyzed by CYP450, glycosylation modification catalyzed by UGT, and other modifications.
In this study, we mined members of the gene families related to triterpenoid saponin synthesis, and most of those families had more members in C. oleifera than in the other five species, especially the CYP450 and UGT gene families. This may be because there are more alleles in C. oleifera because it is a hexaploid and is highly polymorphic. Further analysis of the expression of triterpenoid skeleton biosynthetic genes revealed that genes presenting similar expression patterns as triterpenoid-related OSCs were concentrated in the MVA pathway, while the MEP pathway was less prominent. We speculated that the MVA pathway was the main pathway for triterpenoid backbone biosynthesis in C. oleifera. This result was consistent with previous studies [16]. HMGR is a rate limiting enzyme in the MVA pathway, and 18 HMGRs were identified in C. oleifera and exhibited diverse expression patterns. One of the eighteen HMGRs had a very similar expression pattern to bAS in different seed developmental stages of C. oleifera and in addition had a very high expression level. As in HMGR, one of the eleven SE genes also had a similar expression pattern and high expression level in C. oleifera seeds. We speculated that these two genes play important roles in theasaponin biosynthesis, and different genes from the same family may be responsible for different end-products of biosynthesis. OSCs catalyze the first step in the specific biosynthesis of triterpenoid saponins and determine the structure of diverse triterpene skeletons. OSCs expression analysis showed that only bAS exhibited high-level expression in C. oleifera seeds. This finding indicates that the triterpenoids are primarily derived from the β-amyrin scaffold in C. oleifera seeds, explaining why all the theasaponins detected in C. oleifera seeds to date are oleanane-type triterpene saponins [9]. The expression trends of bAS indicated that rapid saponins synthesis in C. oleifera seeds occurred in August, September, and October. In addition, as in other species, saponins have different accumulation levels and types in different tissue parts and developmental stages of C. oleifera.

Candidate CYP450s Involved in Theasaponin Biosynthesis in C. oleifera Seeds
Extensive experimentation has shown that C. oleifera seeds contain a variety of oleananetype triterpenoids such as oleiferasaponin [32][33][34] and camelliasaponin [33]. A basic structure of theasaponin is shown in Figure 2B,C, illustrating that oxidative modifications exist at C-16, C-21, C-22, C-23, and C-28 of triterpenoid backbones. By far the most common modifications found in triterpene saponins are catalyzed by CYP450s. CYP716 is an ancient CYP450 gene family in the CYP85 clan, and its proposed origin is in triterpenoid primary metabolism [35]. CYP716 enzymes are to date the only CYP450s known to perform C-28 three-step oxidation of triterpenoids [36]. Of all the candidate CYP450s obtained in this study, there were two non-allelic CYP716s, CoCYP716-1 and CoCYP716-2. In C. oleifera seeds, they had similar expression patterns to bAS, and extremely high expression in August and September from transcriptome data (Table S3) and qRT-PCR ( Figure 4C). Moreover, CoCYP716-1 and CoCYP716-2 exhibited a high-level sequence identity (more than 50%) with P. ginseng CYP716A52v2 and Platycodon grandiflorus CYP716A140v2 that are β-amyrin C-28oxidase enzymes involved in oleanolic acid production [30,37]. As such, we speculated that the C-28 oxidation of β-amyrin in C. oleifera seeds is most likely performed by CoCYP716-1 and CoCYP716-2. Nonetheless, most of the characterized CYP716s catalyze C-28, but not only C-28 oxidation. For example, CYP716A141 from Platycodon grandifloras was characterized as a C-16β hydroxylation enzyme [38]; CYP716Y1 from Bupleurum falcatum catalyzed C-16α hydroxylation of β-amyrin [30]; and CYP716A2 displayed 22α-hydroxylation activity in Arabidopsis thaliana [39], and CYP716A14v2 oxidized the C-3 hydroxyl group to carbonyl group, a prerequisite for further additions such as glycosylation at this position [40]. These results suggest that CoCYP716-1 and CoCYP716-2 may also have catalytic activity on C-16, C-22, and C-3 of theasaponin. CYP87D16 is another CYP450 that has C-16α oxidase activity in Maesa lanceolate [41]. This enzyme belongs to the CYP87 family that, similar to the CYP716 family, is a member of the CYP85 clan. We screened a member of the CYP87 family (CoCYP87-1, containing three alleles, Co10319325, Co10341811, and Co10360453) as a candidate gene involved in theasaponin biosynthesis in C. oleifera seeds. CoCYP87-1 had an amino acid similarity of 76.9% to CYP87D16, and thus we hypothesized that CoCYP87-1 was the best candidate gene for C-16 oxidation of theasaponin in C. oleifera seeds.
C-23 is one of the most common positions for oxidation of the triterpenoid backbone, and the presence of the hydroxyl group at C-23 is crucial for biological activity [38]. Previous studies demonstrated that C-23 oxidation of oleanolic acid is catalyzed by CYP72A68v2 in Medicago truncatula [42], CYP72A552 in Barbarea vulgaris [43], and CYP71A16 in Arabidopsis thaliana [44]. In our analysis, three CYP71 genes and one CYP72 gene were co-expressed with bAS, suggesting the possibility that one or more of those genes performed C-23 oxidation in theasaponin biosynthesis. Additionally, some saponins contained additions at C-21 and C-22, but the enzyme is largely unknown. In this study, in addition to the above genes, four CYP450s belonging to CYP73, CYP79, CYP81, and CYP83 were also co-expressed with bAS. This suggested that those genes may have catalytic activity on β-amyrin or related compounds. Nonetheless, CYP450s are ubiquitous enzymes; some from two different gene families exhibit the same biochemical function [41], while some from the same family perform different biochemical functions [42,45]. Our current findings suggest possible yet unexplored functions of candidate CYP450s related to theasaponin biosynthesis in C. oleifera seeds.

Candidate UGTs Involved in Theasaponin Biosynthesis in C. oleifera Seeds
Glycosylation, which is usually catalyzed by UGTs, can alter bioactivity and solubility and increase the diversity of saponins [46]. In theasaponin, sugar moieties are attached to C-3 with a glucuronic acid (GlcA) or its methyl ester, and substituted at position 2 (one sugar unit) and position 3 (one or two sugar units) by glucose (Glc), galactose (Gal), arabinose (Ara), xylose (Xyl), or rhamnose (Rha) ( Figure 3C) [9]. Usually, the diversification of triterpenoids created by UGTs has been thought to be by far the most common [47]. However, a few UGT enzymes have been identified to glycosylate triterpene aglycones. Earlier studies showed that UGTs glycosylated triterpenes are mostly members of groups A, D, and E [47]. Group A contains UGT79, UGT80, UGT91, and UGT94 families; group D contains the UGT73 family, and group E contains UGT71, UGT72, UGT88, and UGT708 families (Table 3). In this study, most of the candidate UGTs (nine of fourteen) belonged to groups A, D, or E, implying that the results are reliable. The other five genes comprised two UGT93s, one UGT75, one UGT78, and one UGT90. These may be undiscovered genes with glycosylated triterpenoids, or they may catalyze other reactions that happen to coincide with the biosynthesis of saponins in C. oleifera seeds. In previous studies, several UGT73s have been shown to have the function of glycosylating oleanane-type triterpene saponins at the C-3 position. UGT73C10, UGT73C11, and four OAGTs (members of the UGT73 family) are responsible for the addition of the first sugars of the C-3 sugar chain [48,49], while UGT73P2 (galactosyltransferase) [50] and UGT73P10 (arabinosyltransferase) [51] catalyze the addition of the second sugars of the C-3 sugar chain. In our current study, there were two UGT73s as candidate genes. The two genes, especially CoUGT73-1, had high FPKM values (Table S3) and relative expression levels ( Figure 5) in C. oleifera seeds in August, September, and October. Moreover, amino acid sequences of the two genes were more similar to UGT73P2 (about 45%) than to UGT73C10 (about 40%), suggesting that the two UGT73s are more likely to be responsible for the addition of second sugars than first sugars of the C-3 sugar chain. Cellulose synthase is another gene superfamily identified as the first glucuronosyltransferase at the C-3 position of oleanane-type aglycones [52,53]. However, we did not analyze this superfamily in this study. The functions of UGTs are very complicated, and the correlation between substrate selectivity and the sequence is very low, making it difficult to identify the target UGTs. Hence, the candidate UGTs we obtained need further in vitro and in vivo functional analysis.

TFs Involved in Theasaponin Biosynthesis in C. oleifera Seeds
Plants deploy a variety of secondary metabolites as defense mechanisms against various stress situations. Their biosynthesis is tightly regulated, and multiple phytohormones, such as jasmonate (JA) and salicylic acid (SA), are involved in the process. Previous studies and this report imply that the triterpene biosynthetic and regulation networks are extremely complex, and many transcription factors participate in the pathway. bHLH is the mostreported transcription factor family involved in the regulation of saponin biosynthesis, as TSAR1-3, MYC2 and TSARL1-2 [54][55][56]. These usually directly bind to the promoters of triterpene biosynthetic genes and could be induced by JAs. MYB is a famous transcription factor family that participates in the regulation of secondary metabolite biosynthesis. There is some evidence that MYB can regulate triterpene biosynthesis. For example, BpMYB21 and PgMYB2 positively regulate triterpenoid biosynthesis, while the VvMYB5b gene decreases β-amyrin in tomato [57][58][59]. MYB can directly modulate secondary metabolites or form complexes with bHLH to regulate secondary metabolites. In this study, two bHLHs, one GRF, one B3, and four MYBs exhibited high co-expression with bAS ( Figure 6), and they may thus regulate the theasaponin biosynthesis in C. oleifera seeds. Whether these CobHLHs and CoMYBs act directly or form complexes to regulate theasaponin biosynthesis in C. oleifera seeds remains to be further investigated. The other transcription factors we screened have not been shown to be directly involved in the regulation of triterpenoid synthesis, but they play crucial roles in many important biological processes including secondary metabolites and stress responses. They can also interact with many transcription factors, such as MYB and bHLH. Thus, we cannot rule out the possibility that they also participate in theasaponin biosynthesis in C. oleifera seeds. In conclusion, as for other biological processes, theasaponins biosynthesis involves a complex network with co-functioning of multiple transcription factors and structural genes. Whether the transcription factors we screened actually regulate theasaponin biosynthesis and how to regulate this process merit further study.

Data Resources Used
The hidden Markov model (HMM) files corresponding to the domains were downloaded from the Pfam protein family database (http://pfam.xfam.org/, accessed on 8 April 2021). The domains each family contained and their Pfam IDs are provided in Supplementary Table S1.
The genomic data were obtained from the following websites: Arabidopsis thaliana, http://plants. ensembl RNA sequencing data were retrieved from previous studies by our research group [64], with accession number: PRJNA 693152.

Identification of Triterpenoid Saponin Biosynthesis-Related Genes
According to Supplementary Table S1, hmmsearch v.3.1b1 was used to screen all proteins containing these domains from the predicted protein database of C. oleifera 'Huashuo', A. thaliana, C. sinensis, A. chinensis, P. ginseng, and M. truncatula, with 1e-3 as the threshold E-value. Proteins that exceeded this threshold were analyzed for the existing domains using the plug-in "Batch SMART" of TBtools [65] (v1.098696). This plug-in links to the SMART website (http://smart.embl-heidelberg.de/, accessed on 12 July 2021). If domains belonged only to the specified family, the domain-containing proteins were identified as members of that family. If not, a phylogenetic tree was constructed, and identified family members were based on known proteins from A. thaliana and other species.

Sequence Alignment and Phylogenetic Tree Construction
Amino acid sequences were aligned using MAFFT v7.215. The approximate maximum likelihood tree was constructed by FastTree (v2.1.7 SSE3) using the JTT + CAT model and SH-like test with 1000 resamples [66,67]. The phylogenetic trees were visualized and drawn using MEGA7 and Adobe Illustrator 2020 software (Adobe Inc., San Jose, CA, USA).

Visualization of Gene Expression and Identification of Co-Expression Modules
The RNA sequencing data were re-analyzed with the C. oleifera 'Huashuo' genome as a reference, and the values of FPKM (the fragments per kilobase of exon per million mapped reads) were calculated for each transcript. Gene expression visualization was realized by the plug-in "HeatMap" for TBtools. A gene co-expression network was built using the plug-in "WGCNA shiny" for TBtools. The networks were visualized using Cytoscape v.3.8.2 (Cytoscape Consortium, USA).

Plant Materials
The different developmental stages of seeds of C. oleifera 'Huashuo' used for RT-qPCR were collected at the HuJu forest farm in Chaling county, Zhuzhou, Hunan Province, China (113 • 25 E, 26 • 55 N). Nine healthy trees in adjacent geographical locations with the same age and the same growth potential were randomly selected, and we marked the flowers that bloomed on the same day in the full-bloom stage. Three trees with the most marked fruits were then selected, and each tree was considered as a biological replicate. Finally, the samples were randomly collected from marked fruits of the three trees on 30 June, 15 July, 30 July, 30 August, 30 September, and 20 October 2021. The seeds were removed from the fruits and immediately frozen in liquid nitrogen. After returning to the laboratory, the samples were stored at −80 • C.

RNA Extraction and RT-qPCR Analysis
The frozen stored seeds were ground into fine powder in liquid nitrogen. Approximately 140-170 mg powder samples were used to extract total RNA using the M5 HiPer Plant RNeasy Complex mini kit (Mei5 Biotechnology, Co., Ltd., Beijing, China). First strand cDNA was synthesized from 2 µg of total RNA using a Goldenstar™ RT6 cDNA synthesis Kit Ver.2 (TsingKe Biotech Co., Ltd., Beijing, China) according to the manufacturer's instructions. One microliter of cDNA was used as a template for qPCR analysis using 2× TSINGKE ® Master qPCR Mix (SYBR Green I) (TsingKe Biotech Co., Ltd., Beijing, China). Analysis was performed on a LightCycler 96 Real-Time PCR System (Roche, Basel, Switzerland) and followed the program proposed by 2× TSINGKE ® Master qPCR Mix (SYBR Green I) protocol with a 56 • C annealing temperature and 45 cycles. The relative transcript level of each gene was normalized to glyceraldehyde-3-phosphate dehydro-genase gene (GAPDH) and calculated according to the 2 −∆∆Ct method. The relative expression of each gene on 30 June was set to 1, and those of all other stages were calculated relative to that of 30 June. Data represent the mean ± SD of three biological replicates. The primers used in this study are listed in Supplementary Table S4. The correlations between candidate genes and bAS were tested using the Pearson coefficient analyzed by two-sided tests using IBM SPSS 19 (SPSS Inc., Chicago, IL, USA).

Conclusions
In this study, we identified the members of multiple gene families that cover the whole triterpenoid backbone biosynthetic pathway, as well as CYP450 and UGT families, through mining the protein database for C. oleifera (Huashuo) by searching HMM. In total, 143 triterpenoid backbone biosynthetic genes, 1169 P450s, and 1019 UGTs from C. oleifera were identified. The CYP450s and UGTs were further categorized using tree-based methods. The transcriptome data analysis indicated that MVA was the main pathway for triterpenoid backbone biosynthesis in C. oleifera, and that HMGR and SE genes may play important roles in this pathway. Through WGCNA and RT-qPCR analysis, 11 CYP450s, 14 UGTs, and 8 TFs were identified as the candidate genes involved in theasaponin biosynthesis in Camellia oleifera seeds. The results of this study provide valuable information for further research investigating the biosynthesis and regulatory network of theasaponins.