The Identification of SQS/SQE/OSC Gene Families in Regulating the Biosynthesis of Triterpenes in Potentilla anserina

The tuberous roots of Potentilla anserina (Pan) are an edible and medicinal resource in Qinghai–Tibetan Plateau, China. The triterpenoids from tuberous roots have shown promising anti-cancer, hepatoprotective, and anti-inflammatory properties. In this study, we carried out phylogenetic analysis of squalene synthases (SQSs), squalene epoxidases (SQEs), and oxidosqualene cyclases (OSCs) in the pathway of triterpenes. In total, 6, 26, and 20 genes of SQSs, SQEs, and OSCs were retrieved from the genome of Pan, respectively. Moreover, 6 SQSs and 25 SQEs genes expressed in two sub-genomes (A and B) of Pan. SQSs were not expanded after whole-genome duplication (WGD), and the duplicated genes were detected in SQEs. Twenty OSCs were divided into two clades of cycloartenol synthases (CASs) and β-amyrin synthases (β-ASs) by a phylogenetic tree, characterized with gene duplication and evolutionary divergence. We speculated that β-ASs and CASs may participate in triterpenes synthesis. The data presented act as valuable references for future studies on the triterpene synthetic pathway of Pan.


Introduction
Triterpenes are one of the largest groups of secondary metabolites from natural origins with various skeletons; many deployed promising anti-cancer and anti-oxidant activity and are used widely in the pharmaceutical industry, such as ursolic acid, oleanolic acid, and ginsenoside [1,2]. Triterpenoids discovered from plants are commonly generated in the pathway of farnesyl pyrophosphate (FPP). squalene synthases (SQSs) catalyze FPP to squalene, and squalene is then oxidized by squalene epoxidases (SQEs) to 2,3oxidosqualene, which is further converted into different triterpene skeletons by different members of oxidosqualene cyclases (OSCs) ( Figure S1). Triterpene scaffolds are further oxidized or glycosylated into different structures of triterpenoids by various cytochrome P450 monooxygenases (CYP450) and UDP glucosyltransferases or cellulose synthase-like M-subfamilies [3]. In previous studies, 121 triterpenes skeletons have been summarized according to different conformation and ring numbers, among which 51 skeletons have been experimentally characterized as products of OSCs [4]. Interestingly, 24 skeletons not reported from nature sources were generated by OSCs in heterologous expressions [4]. Protosteryl and dammarenyl cations were parents of many triterpene skeletal types. Protosteryl cation with chair-boat-chair (CBC) conformation was further catalyzed by the OSCs of cycloartenol synthases (CASs), lanosterol synthases (LASs), and cucurbitadienol synthases (CDSs) to form products of cycloartenol, lanosterol, and cucurbitadienol, respectively, which are associated with the tetracyclic triterpene skeleton (6-6-6-5). While In this study, from the genomic data of Pan, we retrieved 6, 26, and 20 genes for SQSs, SQEs, and OSCs, respectively ( Figure 1A and Table S2). Six copies of SQSs were obtained by searching the genomic data of Pan with the Protein family database (Pfam, ID PF00494), while zero copies were derived from Homolog-based prediction (HBP). For SQEs, 25 copies of SQES were acquired using Pfam (ID PF08491) for searching, while 24 were obtained by HBP; among them, 23 were intersected, 1 new copy (Pan3G00107-1) was annotated by HBP, and 2 were refined with incomplete annotation (Pan8G00699-1 and Pan3G00106-1) ( Figure 1D, Table S2). For OSCs, 18 and 19 copies were obtained via the Pfam database (IDs PF13249 and PF13243) and the HBP method, respectively, with 16 copies intersected. Pan11G02732 and Pan11G02733 mapped by Pfam were merged to Pan11G02732-1 by HBP ( Figure 1D). Pan12G01400 (annotated) and Pan12G01401 (not acquired by Pfam) were merged to Pan12G01400-1 by the HBP method; moreover, the HBP method refined two genes (Pan7G00839-1 and Pan12G01259-1) and screened out one new copy (Pan11G02694-1) of OSCs (Table S2). genes (Pan7G00839-1 and Pan12G01259-1) and screened out one new copy (Pan11G02694-1) of OSCs (Table S2). The correction of the HBP method to Pfam analysis. New copy: a new copy of genes was obtained by the HBP method; Refined: genes were adjusted; Merged: multiple genes were merged into a complete gene; Split: one gene was split into multiple genes.
According to the Conserved Domain Database (CDD), we obtained the domains of SQSs/SQEs/OSCs ( Figure 1C). The amino acid lengths of predicted products from SQSs, SQEs, and OSCs ranged from 318 bp to 627 bp, 215 bp to 420 bp, and 619 bp to 778 bp, respectively, and the isoelectric point ranged from 7.11 to 9.32, 5.01 to 8.84, and 5.73 to 7.61, respectively. Subcellular localization revealed that SQSs were located in organelle membranes and chloroplasts, products of SQEs located in endomembrane system, and organelle membranes, while proteins coded by OSCs were mainly located in plasma membranes and the nucleus (Table S2). In total, 6 SQSs genes and 25 SQEs genes were detected The correction of the HBP method to Pfam analysis. New copy: a new copy of genes was obtained by the HBP method; Refined: genes were adjusted; Merged: multiple genes were merged into a complete gene; Split: one gene was split into multiple genes.
According to the Conserved Domain Database (CDD), we obtained the domains of SQSs/SQEs/OSCs ( Figure 1C). The amino acid lengths of predicted products from SQSs, SQEs, and OSCs ranged from 318 bp to 627 bp, 215 bp to 420 bp, and 619 bp to 778 bp, respectively, and the isoelectric point ranged from 7.11 to 9.32, 5.01 to 8.84, and 5.73 to 7.61, respectively. Subcellular localization revealed that SQSs were located in organelle membranes and chloroplasts, products of SQEs located in endomembrane system, and organelle membranes, while proteins coded by OSCs were mainly located in plasma membranes and the nucleus (Table S2). In total, 6 SQSs genes and 25 SQEs genes were detected with different expression levels in root and tuberous root, respectively, and 8 OSCs genes were expressed in root and tuberous root ( Figure 1B).

The Chromosomal Locations and Collinearity Analysis of SQSs/SQEs/OSCs
The genome of Pan contains 14 chromosomes, which underwent a tetraploidization in 6.4 Mya. The A sub-genome (A1-A7, from Chr1 to Chr13, odd numbers) and B subgenome (B1-B7, from Chr2 to Chr14, even numbers) showed good collinearity [16]. SQSs were symmetrically distributed on two sub-genomes. SQEs were scattered on seven chromosomes such as Chr1 and Chr2, and the number of products located on the B subgenome was larger than that of the A sub-genome, and there were tandem-duplicated genes on Chr2, Chr3, Chr4, Chr7, and Chr8. OSCs were concentrated and distributed on Chr7, Chr11, and Chr12 with tandem-duplicated genes, with the A sub-genome containing more copies (Figure 2).

2023, 28, x FOR PEER REVIEW
genome (B1-B7, from Chr2 to Chr14, even numbers) showed good collinearity [16 were symmetrically distributed on two sub-genomes. SQEs were scattered on seve mosomes such as Chr1 and Chr2, and the number of products located on the B nome was larger than that of the A sub-genome, and there were tandem-duplicate on Chr2, Chr3, Chr4, Chr7, and Chr8. OSCs were concentrated and distributed o Chr11, and Chr12 with tandem-duplicated genes, with the A sub-genome containin copies ( Figure 2).

Phylogenetic Analysis of OSCs
The phylogenetic tree of OSCs retrieved from Pan, Fve, Rru, Pmi, Pyrus pyrifolia (Ppy), and Vitis vinifera (Vvi) was constructed along with the published/verified OSCs of Arabidopsis thaliana (Ath) and Oryza sativa (Osa) as outgroups ( Figure 4; Tables S1 and S3). According to the functional annotation of KEGG and Swissport, products of OSCs were classified into eight categories ( Figure 4). OSCs from Vvi, Fve, Rru, Ppy, and Pmi were classified into the CAS, LUS, and β-AS clades. OSCs from Pan were divided into two branches of CAS and β-AS. CASs were entirely distributed on the A sub-genome and divided into three sub-clades of Pan7G00204 and Pan7G00839-1, Pan7G00207 and Pan7G00837, and Pan11G02732-1, and the first two sub-clades on chromosome 7 (A4) had tandem duplication events. Meanwhile, β-ASs located on chromosomes 11 (A6) and 12 (B6) were divided into two branches, as shown in Figure 4 (clades I and II), characterized by a large number of tandem-duplicated genes.

Phylogenetic Analysis of OSCs
The phylogenetic tree of OSCs retrieved from Pan, Fve, Rru, Pmi, Pyrus pyrifolia (Ppy), and Vitis vinifera (Vvi) was constructed along with the published/verified OSCs of Arabidopsis thaliana (Ath) and Oryza sativa (Osa) as outgroups (Figure 4; Tables S1 and S3). According to the functional annotation of KEGG and Swissport, products of OSCs were classified into eight categories ( Figure 4). OSCs from Vvi, Fve, Rru, Ppy, and Pmi were classified into the CAS, LUS, and β-AS clades. OSCs from Pan were divided into two branches of CAS and β-AS. CASs were entirely distributed on the A sub-genome and divided into three sub-clades of Pan7G00204 and Pan7G00839-1, Pan7G00207 and Pan7G00837, and Pan11G02732-1, and the first two sub-clades on chromosome 7 (A4) had tandem duplication events. Meanwhile, β-ASs located on chromosomes 11 (A6) and 12 (B6) were divided into two branches, as shown in Figure 4 (clades I and II), characterized by a large number of tandem-duplicated genes.

Gene Families Identified by the Pfam Database and the HBP Method
The development of sequencing technology and sophisticated genome ass methods promotes the release of more genomes of plants. Generally, the reliability sequent genome analysis depends on the quality of assembly and the integrity of a tion. In this study, we used high-threshold Pfam results, along with the HBP meth supplement, to identify genes in the pathway of triterpenes. We found that th method had a correct rate (new copy and genes refined, merged, and split by HBP/a numbers) of 14-17% for SQS, SQE, and OSC gene families in different species (T and S3). There were merged OSCs in Pan and Ppy, according to the results (Tab possibly due to incomplete or low-accuracy annotation leading to fragmentation, wide existence of variable splicing in the genome. The split genes that appeared (i.e., FvH4_2g02660) and Rru were confirmed by the transcriptome data ( Figure 1D  S4), which may be caused by tandem-duplicated genes, resulting in overly long a tion. The sequence of Pan12G01345-1 obtained from the HBP method contained a p

Gene Families Identified by the Pfam Database and the HBP Method
The development of sequencing technology and sophisticated genome assembly methods promotes the release of more genomes of plants. Generally, the reliability of subsequent genome analysis depends on the quality of assembly and the integrity of annotation. In this study, we used high-threshold Pfam results, along with the HBP method as a supplement, to identify genes in the pathway of triterpenes. We found that the HBP method had a correct rate (new copy and genes refined, merged, and split by HBP/all gene numbers) of 14-17% for SQS, SQE, and OSC gene families in different species (Table 1 and  Table S3). There were merged OSCs in Pan and Ppy, according to the results (Table S3), possibly due to incomplete or low-accuracy annotation leading to fragmentation, or the wide existence of variable splicing in the genome. The split genes that appeared in Fve (i.e., FvH4_2g02660) and Rru were confirmed by the transcriptome data ( Figure 1D, Table S4), which may be caused by tandem-duplicated genes, resulting in overly long annotation. The sequence of Pan12G01345-1 obtained from the HBP method contained a premature termination codon which may be introduced during evolution. Eleven SQSs were obtained via a search in the Pfam database ( Figure S2), of which three SQS-like genes (Pan4G02790, Pan3G00908, and Pan3G00874) were located in independent branches with incomplete motifs. The other two SQSs (Pan9G01625 and Pan10G01518) had complete PF00494 alignment information, but no corresponding motif, and KEGG annotation information. A possible explanation is that SQSs and the phytoene synthase family (PSYs) have structural similarities and share three conservative regions [17]. Thus, we included the remaining six SQSs in this study. In HBP analysis, only three SQS-like genes (Pan4G02790, Pan3G00908, and Pan3G00874) were obtained ( Figure S2), probably because there were few reference genes for HBP. Therefore, accurate gene sets for homologous prediction and reliable databases for functional verification are critical in HBP analysis.

SQS/SQE/OSC Gene Family Expression and Evolution
Gene duplications were recognized as contributors to the evolution of genes with divergence functions. Sub-functional, hypofunctional, and neo-functional genes, as well as compensatory drift and neutral variation, may result from gene duplications [18]. Multiple paralogous genes may be generated by whole-genome polyploidization, segmental duplication, or tandem duplication [18]. The divergence of Vitales and Rosales occurred at about 121.9 Mya, while the Rosoideae divergence from Amygdaloideae occurred at 84.4 Mya [19]. The Potentilla genus was separated from Fragaria genus at about 40.68 Mya, followed by the divergence between Pan and Pmi (28.52 Mya) [16]. Pmi and Fve diverged at about 34.5 Mya [19], and Fve diverged from Rru at about 52.3 Mya [19]. An ancient genome-wide tripling event occurred in Vvi [20]. Plants in Rosaceae have experienced an ancient genome triplication, though no evidence of large-scale genome replication was found in Fve. It is speculated that chromosome rearrangement and genome shrinkage (or the selective loss of replication genes) cover up the ancient triplication event in Fve [21]. Rru has no additional genome-wide duplication after whole-genome triplication (WGT) [22]. The genome of Pan underwent a recent WGD event to form an allotetraploid after the divergence of Rosaceae, and the A sub-genome and the B sub-genome were found to have good collinearity [16].
According to the OSC phylogenetic trees and the distribution of OSCs on chromosomes, we discussed the characterization of OSCs. On the CAS clade, the Fve:Rru:Pan ratio was 3:2:5, and all CASs of Pan were distributed on the A sub-genome (Figure 4). Four copies of OSCs (Pan7G00207, Pan7G00837, Pan7G00204, and Pan7G00839-1) in chr7 were distributed into two branches, indicating that a tandem duplication event had taken place. The copies mentioned in chr7 showed a discrepant expression in root or tuberous root yields, and we speculated that the low expression may result from the shortening of coding sequences, resulting in sub-functionalization (Table S2).
In clade I of β-AS (Rru6g04897~Pan12G01344) (Figure 4), we hypothesized that β-AS has three ancestral copies, based on three genes of Vvi in the outgroup (clade III). The ratio of Fve:Rru:Pan was 4:5:7, and it was found that Pan has five OSC copies in chr12 and two copies in chr11. Pan12G01344, Pan12G01400-1, and Pan12G01353 showed similar transcription levels (Table S2), and two copies of them may have expanded due to tandem duplication. In clade II of β-AS (Rru6g04906~Pan12G01345-1), the ratio of Fve:Rru:Pan was 3:3:8. The ratio of the sub-branch (Rru6g05056~Pan11G02823) of Fve:Rru:Pan was 1:1:4. Furthermore, the OSCs of Pan (Pan11G02822, Pan11G02823, Pan12G01259-1, and Pan12G01261) were evenly distributed in the A and B sub-genomes, and the genes generated by tandem duplication were not uniformly expressed in the tuberous root (Table S2).
Similarly, according to the phylogenetic tree of SQSs and SQEs (Figure 3), Fve and Pan had three and six SQSs, respectively. The SQSs of Pan distributed on the A and B sub-genomes symmetrically may result from the WGD event of Pan. For SQEs, one of Pan2G00943 and Pan2G00944 seemed to expand in clade I, while tandem duplication occurred in Pan3G00304 or Pan3G00105 and Pan4G03465 or Pan4G03385, and the expression of duplicated genes was consistent in transcriptome data of root (Table S2). In clade II, Pan8G00700, Pan8G00721, and Pan8G00699-1 were duplicated genes, and Pan8G00699-1 had a higher expression in transcriptome data of root. In clade III, Pan11G01188 had no expression in root and tuberous root yields, which may be attributed to spatio-temporal specific expression (Figure 3, Table S2).

The Functional Prediction of OCSs and Triterpene Synthesis in Pan
In the phylogenetic tree of eight species of OSCs, CASs were located in the outermost branch, which is in agreement with previous findings that LASs, LUSs, and β-ASs evolved from ancient CASs [5,6]. The classification of enzymes has been extensively studied with Osa as the outgroup [4]. OsOSC2 converts 2,3-oxidosqualene to cycloartenol [23] when using the CBC conformation as the precursor under the catalyzation of S-adenosyl-Lmethionine-sterol-C24-methyltransferase 1 (SMT1) and CYP450 [24], and cycloartenol could be converted into phytosterols [25] and cholesterol [26]. Moreover, steroidal diosgenin was obtained with further hydroxylation and cyclization [27]. OsPS (japonica subspecies) is a tetracyclic parkeol synthase. In contrast, OsOS (indica subspecies) synthesizes a novel pentacyclic triterpene orysatinol and 12 other triterpenes, and key amino acid residues were found to determine the functional divergence between OsPS and OsOS [28]. OsABAS is a multifunctional enzyme annotated as the achilleol B synthase, showing functions of both α-AS and β-AS enzymes [29]. The OSCs of Pan have been classified into two groups of CAS and β-AS. CASs synthesize the tetracyclic triterpene skeleton and may be involved in the synthesis of sterols. SQEs act as a rate-limiting enzyme in the steroid biosynthesis pathway [30]. The β-ASs detected in Pan are pentacyclic triterpene (oleanane) synthetases, which is consistent with a lot of oleanane-type triterpenoids isolated from Pan [10,14,15].

SQS/SQE/OSC Gene Family Identification
The sequenced and annotated genome data of Potentilla anserina (Pan) [16], Potentilla micrantha (Pmi) [31], Fragaria vesca (Fve) [32], Rosa rugosa (Rru) [22], and Pyrus pyrifolia (Ppy) [33] of Rosaceae were requested from the Genome Database for Rosaceae (GDR) (https://www.rosaceae.org/tools/jbrowse, accessed on 14 March 2023), while the sequence of Vitis vinifera (Vvi) [20] was downloaded from the National Center for Biotechnology Information (NCBI) database. Oryza sativa (Osa) and Arabidopsis thaliana (Ath), serving as outgroups, were also downloaded from the NCBI. PF13249 and PF13243 downloaded from Pfam (http://pfam.xfam.org/, accessed on 14 March 2023) were applied to search for OSC coding proteins using the hidden Markov model (HMM) (E-value < 1 × 10 −20 ). In order to reduce the deviation caused by different gene annotation methods on a variety of genomes, we used homolog-based prediction (HBP) to predict the number of genes. The strategy of HBP is as follows. Firstly, the OSC (Table S1) protein sequences downloaded from the NCBI were aligned to the genome using TBLASTN with an E-value < 1 × 10 −5 . Secondly, the conjoined high-scoring pairs (HSPs) by Solar (https://github.com/gigas cience/papers/tree/master/zhou2013/MTannotationBGI/solar, accessed on 14 March 2023) were applied to determine the rough genomic region for each gene. Thirdly, we extracted and extended 2 kb both upstream and downstream of the Solar region, and defined gene models using GeneWise (v2.4.1) [34]. We also filtered the predictions with less than 30% coverage. The functional annotation of resulted genes was actualized by the Pfam, KEGG (https://www.genome.jp/kegg/, accessed on 14 March 2023), and Swissprot (http://www.ebi.ac.uk/sprot, accessed on 14 March 2023) databases. The motif was recognized using Meme (https://meme-suite.org/meme/tools/meme, accessed on 14 March 2023), with a default number of 6. Furthermore, the domain was predicted under the CDD (https://www.ncbi.nlm.nih.gov/cdd/, accessed on 14 March 2023) database. We combined the two prediction results of Pfam and HBP, and then retained the accurate ones. The identification of SQSs (PF00494) and SQEs (PF08491) was realized using the same method (Table S1).

Conclusions
Triterpenes are major pharmacodynamic substances of Pan, which show good antiinflammatory and hepatoprotective activity. In this study, we analyzed members of OSC, SQS, and SQE gene families in the triterpene synthesis pathway using Pfam and HBP methods. The results provide a research basis for understanding the biosynthetic pathway of triterpenes in Pan by clarifying the function and evolutionary relationship of triterpenesynthesis-related genes.

Conflicts of Interest:
The authors declare no conflict of interest.