Transcriptome Analysis Reveals Putative Genes Involved in the Lipid Metabolism of Chaulmoogra Oil Biosynthesis in Carpotroche brasiliensis (Raddi) A.Gray, a Tropical Tree Species

: Chaulmoogra oil is found in the seeds of Carpotroche brasiliensis (Raddi) Endl. (syn. Mayna brasiliensis Raddi), an oil tree of the Achariaceae family and native to Brazil’s Atlantic Forest biome, which is considered the ﬁfth most important biodiversity hotspot in the world. Its main constituents are cyclopentenic fatty acids. Chaulmoogra oil has economic potential because of its use in the cosmetics industry and as a drug with anti-tumor activity. The mechanisms related to the regulation of oil biosynthesis in C. brasiliensis seeds are not fully understood, especially from a tissue-speciﬁc perspective. In this study, we applied a de novo transcriptomic approach to investigate the transcripts involved in the lipid pathways of C. brasiliensis and to identify genes involved in lipid biosynthesis. Comparative analysis of gene orthology, expression analysis and visualization of metabolic lipid networks were performed, using data obtained from high-throughput sequencing (RNAseq) of 24 libraries of vegetative and reproductive tissues of C. brasiliensis . Approximately 10.4 million paired-end reads (Phred (Q) > 20) were generated and re-assembled into 107,744 unigenes, with an average length of 340 base pairs (bp). The analysis of transcripts from different tissues identiﬁed 1131 proteins involved in lipid metabolism and transport and 13 pathways involved in lipid biosynthesis, degradation, transport, lipid bodies, and lipid constituents of membranes. This is the ﬁrst transcriptome study of C. brasiliensis , providing basic information for biotechnological applications of great use for the species, which will help understand chaulmoogra oil biosynthesis.


Introduction
Carpotroche brasiliensis (Raddi) Endl. (syn. Mayna brasiliensis Raddi) is a tree of the Achariaceae family; the seeds are composed of 70% oil [1], including mainly linear cyclic fatty acids and triacylglycerols [2]. Chaulmoogra oil is composed mainly of cyclopentenyl fatty acids (AGCs) [3], and to a lesser extent, fatty acids such as palmitic, palmitoleic, stearic, oleic, and linoleic acids [4]. The knowledge about the biosynthetic pathway of fatty acids present in the seeds of C. brasiliensis is still scarce. Existing information is restricted to the chemical reactions and components associated with its production [4].
Cyclopentenyl fatty acids present in chaulmoogra oil form a well-defined chemical structure comprising a 5-membered unsaturated ring attached to a linear side chain terminated by a carboxylic acid [5]. Evidence suggests that AGCs arise from the elongation of the aleprolic acid chain, possibly obtained from cyclopentenylglycine. The conversion of cyclopentenylglycine into aleprolic acid may occur by transamination and oxidative decarboxylation. Thereby, activated aleprolic acid is stretched to cyclopentenyl fatty acids [6]. These findings indicate that (i) cyclopentenylglycine is the precursor of cyclopentenyl fatty acids, and (ii) unsaturated linear chain acids undergo cyclization. Studies of the biosynthesis pathway of cyclopentenic fatty acids in the leaves and chloroplasts of Culoncoha echinata (Oncobeae) indicate that cyclopentenic fatty acids are synthesized from aspartate and pyruvate or glutamate and acetate [2]. Therefore, the chaulmoogra oil biosynthesis pathways suggested in previous studies are quite controversial.
Given the scarcity of information about the components of the chaulmoogra oil biosynthetic pathway, identifying genes/transcripts encoding enzymes involved in synthesizing these fatty acids may provide valuable information for future studies. The discovery of genes encoding enzymes involved in plants specialized metabolism (such as the synthesis of chaulmoogra oil) represents a unique challenge. Complex enzymatic networks can produce several byproducts instead of a single compound [7]. An additional complication is that different plant species vary considerably in their oil content and fatty acid composition [8]. This is particularly true of the pathway of cyclopentenic acids, where there are few studies. Additionally, chaulmoogra oil-producing species do not have genomic information available in public databases, making studies related to omics difficult. Such studies are relevant for accessing and understanding biosynthetic pathways, identifying new genes and understanding the mechanisms of oil accumulation.
Our study aimed to uncover the mechanisms involved in chaulmoogra oil biosynthesis, answering the following questions: (i) Which lipid metabolism genes are present in different tissues of C. brasiliensis? and (ii) What is the level of these genes expressed in different tissues?
Here, we report the first transcriptome of the tropical tree C. brasiliensis and seek new insights into the molecular bases of lipid biosynthesis. Our study provides rich genetic information that will be useful for understanding the synthesis of chaulmoogra oil.

Plant Material
C. brasiliensis trees were randomly chosen to compose the sample group, the source of the collections was carried out in cocoa farms with authorization from the cooperative of seed collectors of the agroforestry system Cabruca in the municipalities of Camamu and Maraú, Bahia, Brazil (Supplementary Table S1). Voucher specimens have been stored at the Santa Cruz State University herbarium (RH-Uesc 20315-RH-Uesc 20316), the species was identified by José Lima da Paixão. Based on the quality of the extracted RNA, 24 samples were selected for library development. The material was immediately frozen in liquid nitrogen and stored in an ultrafreezer at −80 • C before RNA extraction. Thus, each library was developed from the extraction of 1 tissue from 1 tree. The samples were collected in triplicate to compose the 24 developed libraries, using 12 buds and 12 other plant tissues, namely: 2 leaves, 2 roots, 2 flowers, 3 seeds, pulp, and the skin pool of 3 fruits.

RNA Extraction and cDNA Library Preparation
The RNAqueous ® Total RNA Isolation Kit-AM1912 was used to perform the total RNA extraction, following the manufacturer's recommendations. The integrity and amount of RNA sample was evaluated in a NanoDrop spectrophotometer (Nano-Drop, Wilmington, The cDNA libraries were built from 10 ng total RNA NEBNext Ultra RNA Library Prep Kit for Illumina (E7530S) and NEBNext Multiplex (E7335) Oligos for Illumina to develop cDNA libraries, following the manufacturer's protocols (Illumina, CA, USA). The libraries were quantified with the KAPA Library quantification kit with an Agilent 2100 Bioanalyzer. A total of 24 cDNA libraries were sequenced on Illumina MiSeq 2 × 250 bp paired-end. Raw sequencing data were deposited at NCBI Sequence Read Archive (SRA) under project number PRJNA858666.

Quality Control and Assembly
The Trimmomatic v.0.32 software (https://github.com/timflutre/trimmomatic, accessed on 18 October 2022) was used to verify the quality of the reads. We used a Phred score of Q > 20 as a threshold to trim low-quality bases from the ends of the reads. Highquality reads were used to assemble a de novo transcriptome using the Trinity v.2.1.1 software (https://github.com/trinityrnaseq/trinityrnaseq, accessed on 18 October 2022) following an analysis pipeline (Supplementary Figure S1).

Quality of Transcripts, Complete Transcripts, and Super Transcripts
The transcripts' quality and completeness were assessed using the Benchmarking Universal Single-Copy Orthologs (BUSCO 3.1.0) (https://github.com/openpaul/busco, accessed on 18 October 2022), based on ortholog groups from Arabidopsis thaliana and Solanaceae. The Busco software was chosen because it allowed us to detect the natural variation of conserved genes within clades, based on evolutionarily informed expectations of gene content of near-universal single-copy orthologs for a variety of eukaryotic clades. The indexing and alignment of transcripts was carried out using Bowtie2 2.2.5 (https:// github.com/BenLangmead/bowtie, accessed on 18 October 2022). The trimmed transcripts of each library were aligned against the indexed reference and subsequently analyzed with RSEM (http://deweylab.github.io/RSEM/, accessed on 18 October 2022) to estimate the expression values normalized by transcripts per million (TPM) in each RNA-Seq library.
The TransDecoder software 5.5.0 (https://github.com/TransDecoder/TransDecoder/, accessed on 18 October 2022) was used to identify candidate coding regions for starting and ending transcript sequences. For this purpose, the Basic Local Alignment Search Tool (BLAST) was used versus the Uniref and Pfam 31.0 databases (http://pfam.xfam.org, accessed on 18 October 2022). This allowed predicting complete sequences of isoforms selected by TransDecoder, as well as protein sequences. Pfam is an extensive collection of protein families, represented by multi-sequence alignments, using hidden Markov chains (HMMs). In this sense, only complete sequences (i) with start and end sequence transcription, (ii) showing high similarity with cured sequences from the Uniref database, and (iii) with functional domains based on the statistical alignment of Pfam were considered.
The Trinity software (https://github.com/trinityrnaseq/trinityrnaseq, accessed on 18 October 2022) was used to reconstruct the primary transcript sequence and obtain super transcripts. This analysis recognizes unique and common sequence regions between isoforms and merges these isoforms into a single linear sequence. The super transcripts are useful in the context of de novo free genomes, since they yield a gene sequence similar to what can be obtained when sequencing the genome.

Functional Annotation
The functional annotation of the C. brasiliensis transcript was generated with the EggNOG mapper database using HMM to obtain the EggNOG mapper 4.5 orthology data (Supplementary Figure S1). For all transcripts, the clusters of orthologous groups (COGs) and archaeal clusters of orthologous genes (arCOG) were assigned [9]. The sequences of C. brasiliensis proteins were also annotated based on the InterProScan databases (http:// www.ebi.ac.uk/InterProScan/, accessed on 18 October 2022) and the Pannzer 2 tool (protein annotation using Z-score) [10]. These analyses recovered the EC protein codes, which were used in the KEGG mapper [11] for the identification of lipid pathways. Annotations with ARGOT_PPV greater than 0.35 were selected from the results obtained with Pannzer 2. The gene ontology (GO) categories of fruit and seed libraries were retrieved and formatted according to the native format of the WEGO platform (Web Gene Ontology Annotation Plot) (http://wego.genomics.org.cn/cgi-bin/wego/index.pl, accessed on 18 October 2022) to obtain a bar plot for categorization into cellular components, molecular functions, and biological processes.

Phylogenetic Analysis
The 18S ribosomal sequence of C. brasiliensis was identified by searching for putative homologs of the 18S A. thaliana sequence. The 18S transcription of C. brasiliensis was performed against the Non-Redundant Database (NR) of the National Center for Biotechnology Information (NCBI) to obtain the 18S data of species used in the phylogeny construction. Redundant sequences were excluded and at least one copy of each genus was maintained. Multiple nucleotide alignment (ClustalW) of 79 sequences was performed using the Mega X 10.1 software [12]. The model test was performed using MEGA X 10.1. Genetic relationships between accessions were calculated using a distance matrix obtained by the Tamura-Nei nucleotide substitution model [13]. The identified taxa were grouped based on the maximum likelihood method. The consistency of the clustering patterns was assessed by 1000 bootstrap replications. The generated dendrogram was edited with the Figtree v.1.4.4 software (http://tree.bio.ed.ac.uk/, accessed on 18 October 2022).

Analysis of Orthologous Gene Families
The ARALIP website platform was used to select gene families related to lipids in A. thaliana (http://aralip.plantbiology.msu.edu/pathways/pathways, accessed on 18 October 2022). Searches for the term "fatty acids" were carried out in the Phytozome database v.12 (https://phytozome.jgi.doe.gov/, accessed on 18 October 2022), including searches for gene sequences related to lipid metabolism in A. thaliana, Glycine max, and Populus trichocarpa.
BLASTp was used to compare the lipid families of the model species against the C. brasiliensis protein database. The BLASTp result was used to further analyze gene orthology in OrthoMCL v.1.4 [14] with the standard inflation parameter I = 1.5. OrthoMCL was used to obtain the best hits and explore similarity measures to ortholog and paralog groups. The Markov clustering algorithm was applied, and only families with one species were excluded, but those with at least one C. brasiliensis gene were kept.
All C. brasiliensis genes belonging to at least one of the families (identified by Or-thoMCL) were associated with lipid synthesis in the species. The lipid genes obtained in OrthoMCL were used to perform the analysis in different tissues.

Heatmap
A heatmap was constructed with all the lipids identified in all libraries using the OrthoMCL, Pannzer 2, EggNOG mapper, KEGG Mapper, and InterProScan. A heatmap was constructed to visualize the expression of the C. brasiliensis lipid transcripts identified in the different tissues with the aid of the "ComplexHeatmap" package (https://github. com/jokergoo/ComplexHeatmap, accessed on 18 October 2022) in the R version 4.0.1 environment (R Core Team). Thereafter, the values of TPM (transcripts per million) were transformed into z-scores with the function "scale". Euclidean distance and grouping with complete linkage were used to add the dendrogram to the heatmap. From the functional annotation analyses, an ontology gene enrichment (GO) analysis was performed for each heatmap cluster, using the BINGO [15].

Construction of the Lipid Metabolism Model
The following tools were used to build the lipid metabolism model: (i) the OrthoMCL software to identify the orthologous lipid gene families; (ii) the Mercator4 v5.0 software (https://www.plabipd.de/portal/mercator4, accessed on 18 October 2022) to map the unigene file previously identified, obtaining a text file with one or more BINs per protein; (iii) the MapMan program to visualize the expression and meta-analysis data, to annotate the plant omics data [16], to correlate each unigene to its expression level, and to identify the lipid metabolism pathways (X4.2 Lipid metabolism R2.0); (iv) Pannzer 2 [10] and lnterProScan (http://www.ebi.ac.uk/InterProScan/, accessed on 18 October 2022) for protein annotation (focusing on keywords such as fatty, lipid, and desat to contemplate the largest gene number from the lipid pathway; (v) the Kegg Mapper platform [11] to rescue the protein EC (Enzyme Commission Numbers), to automatically assign KO identifiers (KEGG Orthology; https://www.kegg.jp/ or https://www.genome.jp/kegg/, accessed on 18 October 2022) and to map the described lipid metabolism pathways (biosynthesis of fatty acids and unsaturated fatty acids, and elongation of fatty acids); and (vi) the "ComplexHeatmap" package (https://github.com/jokergoo/ComplexHeatmap, accessed on 18 October 2022) in R version 4.0.1 to generate a heatmap from the rescued EC. For heatmap generation, the TPM values were transformed to z-scores with the "scale" function, while for clustering, the Euclidean distance and clustering with complete linkage function were used. The lipid pathways and the expression data were manually associated in a general scheme.  Table S2). The GC content was 41.37% and N50 totalized 696,928 bp. We used TransDecoder to reduce the number of total transcripts and the number of unigenes, but to increase the average N50 length (Supplementary Table S2). The total number of initial unigenes was 263,562 and after treatment with the TransDecoder, the total number of unigenes was 12,908.
The quality of assembly allowed us to detect 73.5% (BUSCO , Table S3) of the genes found in A. thaliana. This result can be considered satisfactory since the species C. brasiliensis is a non-model organism, and studies with genomes of non-model species generally report BUSCO scores ranging from 50% to 95%, depending on aspects of the species (genome size, number of repetitive elements), and their taxonomic position [17].
A total of 38,841 proteins were detected. Of them, 21,393 (55.07%) proteins were annotated with GO terms, based on functional annotation of the transcripts. In the three main GO categories (biological processes, cellular components, and molecular functions), proteins assigned to the activity subcategories "metabolic process", "cell" and "cell process" were found in higher percentages (Supplementary Figure S2). Similar results were observed in RNA-Seq data for Brassica napus L. during seed maturation [18]. We managed to detect new proteins for C. brasiliensis, as well as finding similarities in functional categories of proteins in comparison to model species.
Search for proteins based on sequence identity found 38,841 sequences, which were categorized into 25 functional groups. Proteins with "unknown function" represented the largest group (8416, or 21.6%), followed by "post-translational modifications" proteins (2685, or 6.9%), proteins related to "signal transduction mechanisms" (2669, or 6.8%). Proteins associated with "cellular motility" (41, or 0.10%), and "extracellular structures" (64, or 0.16%) were the smallest groups. No proteins were associated with the category "general function prediction" (Figure 1). The shorter sequences may lack a characterized protein domain or may be too short to show sequence matches, resulting in false-negative results. Because genomic and transcriptomic information is currently lacking for C. brasiliensis in databases, these cases of no hits can be considered putative novel protein sequences.
(2685, or 6.9%), proteins related to "signal transduction mechanisms" (2669, or 6.8%). Proteins associated with "cellular motility" (41, or 0.10%), and "extracellular structures" (64, or 0.16%) were the smallest groups. No proteins were associated with the category "general function prediction" (Figure 1). The shorter sequences may lack a characterized protein domain or may be too short to show sequence matches, resulting in false-negative results. Because genomic and transcriptomic information is currently lacking for C. brasiliensis in databases, these cases of no hits can be considered putative novel protein sequences. The metabolic network analysis of C. brasiliensis revealed 148 pathways, but the combined global map showed 742 metabolic pathways (ko01100) and 334 secondary metabolite biosynthesis enzymes (ko01110). This result reveals that C. brasiliensis can be The metabolic network analysis of C. brasiliensis revealed 148 pathways, but the combined global map showed 742 metabolic pathways (ko01100) and 334 secondary metabolite biosynthesis enzymes (ko01110). This result reveals that C. brasiliensis can be considered a plant rich in secondary metabolites. Due to the commercial importance of the oil in C. brasiliensis, our study focused on "lipid metabolism", for which 13 pathways were identified ( Figure 2).
A large number of sequences involved in lipid synthesis and metabolism was predicted, as well as the biochemical pathways involved in the synthesis of chaulmoogra oil. We found 1131 unigenes that possibly encode proteins associated with metabolism and lipids, representing 2.9% of total proteins, in all libraries of C. brasiliensis (Figure 1). considered a plant rich in secondary metabolites. Due to the commercial importance of the oil in C. brasiliensis, our study focused on "lipid metabolism", for which 13 pathways were identified (Figure 2). A large number of sequences involved in lipid synthesis and metabolism was predicted, as well as the biochemical pathways involved in the synthesis of chaulmoogra oil. We found 1131 unigenes that possibly encode proteins associated with metabolism and lipids, representing 2.9% of total proteins, in all libraries of C. brasiliensis (Figure 1).

Phylogenetic Analyses with 18S: A Search for Related Species Producing Lipids of Economic Importance
Searches for transcripts including the ribosomal 18S from C. brasiliensis revealed 78 similar sequences, which matched lipid producing tree species or other plant species that produce secondary metabolites, such as Cannabis sativa, Triadica sebifera, and Citrus clementina. When comparing these RNAs with A. thaliana and other oilseed models, such as Glycine max, Populus trichocarpa, and species of the families Achariaceae and defunct Flacourtiaceae, the phylogenetic tree indicated a division of all species into three major groups (Supplementary Figure S3). C. brasiliensis showed high proximity with Camptostylus manii, Dasylepis brevipedicellata, and Erythrospermum phytolaccoides, which belong to the Achariaceae family (Supplementary Figure S3).
It was interesting to find that the model plant species Arabidopsis thaliana is relatively closely related to C. brasiliensis (Supplementary Figure S3) according to the 18S gene sequence analysis, as well as in terms of gene completeness (Supplementary Table S3). Thus, based on the phylogeny results obtained with 18S sequences in the present study, we chose A. thaliana as a biological model in the subsequent analyses of groups of orthologous genes.

Orthologous Groups Involved in Lipid Metabolism
The main family of lipid genes found in our results was the chaperones (Figure 3). A previous study with the Chinese tallow tree (Triadiaca sebifera) evaluated proteins in lipid droplets of the fruit mesocarp, and numerous proteins were found related to signal transduction and activity similar to molecular chaperones [19]. This study identified proteins with similar functions to chaperones, such as annexin D8, aspartic proteinases, bcl-2-associated athanogene (BAG), and aquaporin. Therefore, we believe that chaperones are important for the synthesis of lipids.

Phylogenetic Analyses with 18S: A Search for Related Species Producing Lipids of Economic Importance
Searches for transcripts including the ribosomal 18S from C. brasiliensis revealed 78 similar sequences, which matched lipid producing tree species or other plant species that produce secondary metabolites, such as Cannabis sativa, Triadica sebifera, and Citrus clementina. When comparing these RNAs with A. thaliana and other oilseed models, such as Glycine max, Populus trichocarpa, and species of the families Achariaceae and defunct Flacourtiaceae, the phylogenetic tree indicated a division of all species into three major groups (Supplementary Figure S3). C. brasiliensis showed high proximity with Camptostylus manii, Dasylepis brevipedicellata, and Erythrospermum phytolaccoides, which belong to the Achariaceae family (Supplementary Figure S3).
It was interesting to find that the model plant species Arabidopsis thaliana is relatively closely related to C. brasiliensis (Supplementary Figure S3) according to the 18S gene sequence analysis, as well as in terms of gene completeness (Supplementary Table S3). Thus, based on the phylogeny results obtained with 18S sequences in the present study, we chose A. thaliana as a biological model in the subsequent analyses of groups of orthologous genes.

Orthologous Groups Involved in Lipid Metabolism
The main family of lipid genes found in our results was the chaperones (Figure 3). A previous study with the Chinese tallow tree (Triadiaca sebifera) evaluated proteins in lipid droplets of the fruit mesocarp, and numerous proteins were found related to signal transduction and activity similar to molecular chaperones [19]. This study identified proteins with similar functions to chaperones, such as annexin D8, aspartic proteinases, bcl-2-associated athanogene (BAG), and aquaporin. Therefore, we believe that chaperones are important for the synthesis of lipids.
The second most numerous gene family among species was that of lipids involved with cell membranes, such as galactolipids, sulfolipids, and phospholipids ( Figure 3).
Our results suggest that seven putative enzymes identified in the fast oil accumulation stage may be involved in the synthesis of triacylglycerols (1-acyl-sn-glycerol-3-phosphate acyltransferase; phosphatidate phosphatase; diacylglycerol O-acyltransferase; phospholipid:diacylglycerol acyltransferase; phosphatidylcholine:diacylglycerol cholinephosphotransferase; lysophospholipid acyltransferase; and phospholipase) (Supplementary Table S4), suggesting their involvement in oil accumulation in C. brasiliensis seeds.  The second most numerous gene family among species was that of lipids involved with cell membranes, such as galactolipids, sulfolipids, and phospholipids ( Figure 3).
Our results suggest that seven putative enzymes identified in the fast oil accumulation stage may be involved in the synthesis of triacylglycerols (1-acyl-sn-glycerol-3-phosphate acyltransferase; phosphatidate phosphatase; diacylglycerol O-acyltransferase; phospholipid:diacylglycerol acyltransferase; phosphatidylcholine:diacylglycerol cholinephosphotransferase; lysophospholipid acyltransferase; and phospholipase) (Supplementary Table S4), suggesting their involvement in oil accumulation in C. brasiliensis seeds.

Special Characteristics of Lipids in Fruits and Seeds
In C. brasiliensis, lipids are primarily stored in seeds. We identified a high level of gene expression for oil synthesis in the seeds and immature fruits of C. brasiliensis. Five clusters represent the biological processes involved in lipid metabolism: Cluster 1 (acetyl-CoA metabolism and lipid A metabolism); Cluster 2 (lipid A metabolism and seed oilbody biogenesis); Cluster 3 (unsaturated fatty acid metabolism and acylglycerol metabolism); Cluster 4 (fatty acid oxidation and glycerollipid biosynthesis) (Figure 4).

Special Characteristics of Lipids in Fruits and Seeds
In C. brasiliensis, lipids are primarily stored in seeds. We identified a high level of gene expression for oil synthesis in the seeds and immature fruits of C. brasiliensis. Five clusters represent the biological processes involved in lipid metabolism: Cluster 1 (acetyl-CoA metabolism and lipid A metabolism); Cluster 2 (lipid A metabolism and seed oilbody biogenesis); Cluster 3 (unsaturated fatty acid metabolism and acylglycerol metabolism); Cluster 4 (fatty acid oxidation and glycerollipid biosynthesis) (Figure 4).
We  Table S4). We also observed 106 fatty acid enzymes (LACS, HCD, SAD, FAD5, oleosin, caleosin) Cluster 2 ( Figure 4 and Supplementary Table S4). Additionally, we identified four oleosins and three caleosins that may play a role in the accumulation and maintenance of chaulmoogra oil.
One unigene (phospholipase A) encoding lipases (Supplementary Table S4) was identified from our libraries. Lipases present in developing castor bean may be involved in the remodeling of TGs after synthesis [20]. The function of these lipases in developing C. brasiliensis seeds remains unclear. identified four oleosins and three caleosins that may play a role in the accumulation and maintenance of chaulmoogra oil. For the formation of unsaturated fatty acids (Cluster 3) (Figure 4), such as chaulmoogra oil, six proteins that encode fatty acid desaturase (FAD) have been identified, including six types of FAD (SAD, FAD5) and stearoyl-ACP desaturase (SAD), which removes two hydrogen atoms from stearic acid (18C: 0) to form oleic acid (18C: 1) (Supplementary Table S4).
One unigene (phospholipase A) encoding lipases (Supplementary Table S4) was identified from our libraries. Lipases present in developing castor bean may be involved in the remodeling of TGs after synthesis [20]. The function of these lipases in developing C. brasiliensis seeds remains unclear.

Special Features of Lipids in Flower Tissue
The genes involved in lipid metabolism were analyzed and identified on a scale (Zscore) with relative levels between transcripts and tissues, with a specific focus on the synthesis of fatty acids and triacylglycerol accumulation/storage pathways (Figure 4). Most of the genes had a strong contrast between seed and non-seed lipid genes, showing

Special Features of Lipids in Flower Tissue
The genes involved in lipid metabolism were analyzed and identified on a scale (Z-score) with relative levels between transcripts and tissues, with a specific focus on the synthesis of fatty acids and triacylglycerol accumulation/storage pathways (Figure 4). Most of the genes had a strong contrast between seed and non-seed lipid genes, showing substantially different levels of expression among tissues. The libraries with the highest levels of expression were the flower, floral bud, immature fruit, and leaf libraries (Figure 4). Although seeds are by far the largest commercial sources of oils from C. brasiliensis, oil is also abundant in many other tissues.

Special Features of Root and Leaf Lipids
We detected transcripts related to the synthesis of lipids in the root library of C. brasiliensis (Cluster 5) (Figure 4). Such results were expected, since suberin layers serve as an infection barrier. The wax surface influences plant-insect interactions, and helps to prevent germination of pathogenic microbes [21].

Metabolic Pathways Related to Oil Biosynthesis
The organelles involved in the synthesis and elongation of fatty acids are chloroplasts (de novo synthesis of fatty acids) and rough endoplasmic reticulum ( Figure 5). iensis (Cluster 5) (Figure 4). Such results were expected, since suberin layers serve as an infection barrier. The wax surface influences plant-insect interactions, and helps to prevent germination of pathogenic microbes [21].

Metabolic Pathways Related to Oil Biosynthesis
The organelles involved in the synthesis and elongation of fatty acids are chloroplasts (de novo synthesis of fatty acids) and rough endoplasmic reticulum ( Figure 5).

Figure 5. Lipid biosynthesis pathways and transcript expression patterns in immature and mature
Carpotroche brasiliensis seeds and fruits. The transcript patterns for enzymes involved in glycolysis reactions in mitochondria. Fatty acid synthesis occurs in plastids, and in the endoplasmic reticulum, triacylglycerol synthesis occurs-the elongation and establishment of fatty acids. The gene expression heatmap (Z-score), where an average of the expression of the libraries of immature and ripe fruits was performed. The color corresponds to the Z-score per gene that is calculated from TPM. The color intensity indicates the relative level of expression. The first column represents the library of immature seed, immature fruit, immature fruit, ripe seed, and ripe fruit, respectively. The enzymes found on this route are marked in red. The gray box represents unsaturated fatty acids. Abbreviation: FabA, 3-hydroxyacyl-ACP dehydratase II; FabB, β-ketoacyl-ACP synthase I; FabD, ACP-S-malonyltransferase; FabF, 3-oxoacyl-ACP synthase II; FabG, 3-oxoacyl-ACP reductase; FabH, 3oxoacyl-ACP synthase III; FabY, Acetoacetyl-ACP synthase; FabV, Enoyl-ACP reductase; FabK, Enoyl-ACP reductase II; FabZ, 3-hydroxyacyl-ACP dehydratase; ACP, acyl-carrier protein. The color corresponds to the Z-score per gene that is calculated from TPM. The color intensity indicates the relative level of expression. The first column represents the library of immature seed, immature fruit, immature fruit, ripe seed, and ripe fruit, respectively. The enzymes found on this route are marked in red. The gray box represents unsaturated fatty acids. Abbreviation: FabA, 3hydroxyacyl-ACP dehydratase II; FabB, β-ketoacyl-ACP synthase I; FabD, ACP-S-malonyltransferase; FabF, 3-oxoacyl-ACP synthase II; FabG, 3-oxoacyl-ACP reductase; FabH, 3-oxoacyl-ACP synthase III; FabY, Acetoacetyl-ACP synthase; FabV, Enoyl-ACP reductase; FabK, Enoyl-ACP reductase II; FabZ, 3-hydroxyacyl-ACP dehydratase; ACP, acyl-carrier protein.
In terms of the synthesis of fatty acids in plastids, genes were regulated in libraries of immature and ripe fruits and seeds. For example, the genes for malonyl-CoA:ACP-S-malonyltransferase (FabD), Enoyl-ACP reductase (FabV), and Enoyl-ACP reductase II (FabK) have higher levels of expression in the tissues of immature seeds and fruits ( Figure 5). On the other hand, in mature seeds and fruits, these genes are down-regulated.
The table in Figure 5 shows that the lipids from C. brasiliensis seeds are composed mainly of saturated fatty acids, such as palmitic acid (C16: 0) and stearic acid (C18: 0), and unsaturated fatty acids, such as oleic acid (C18: 1), linoleic acid (C18: 2), alpha-linolenic acid (C18: 3), and palmitoleic acid (C16: 1). In C. brasiliensis, we observed the presence of 3-oxoacyl-[acyl-carrier-protein] synthase II (EC: 2.3.1.179), the main enzyme involved in determining the fatty acid chain length (that is, the ratio of fatty acids from 16C to 18C). We also observed many saturated fatty acids. Thus, we believe that as already reported in another study of Carya illinoinensis [22], these lipid genes are important in the developing embryo.

Discussion
This is the first transcriptome and the first characterization of gene expressions in leaves, flowers, floral buds, roots, fruits, and seeds of C. brasiliensis. Hence, it represents a unique transcriptomic resource available for this chaulmoogra oil-producing species, endemic to the Atlantic Forest biome.
Observing the assembling results, we detected that C. brasiliensis transcriptome is close to those previously identified in members of the defunct Flacourtiaceae family, such as Idesia polycarpa [23]. These results suggest that our data are robust and of similar quality to the transcriptomes from other oilseed species. The 12,908 unigenes are considered a good estimate of the total genes for a tree species, since plant genomes are expected to have between 12,000 and 45,000 genes [24], suggesting that we detected more complete transcripts.
Additionally, we included Arabidopsis thaliana as a possible model species because it is a relatively close group (Supplementary Figure S3), as well as its gene completeness (Supplementary Table S3). Thus, from the phylogeny results obtained with 18S sequences in the present study, we confirmed A. thaliana as the biological model in subsequent analyses of orthologous gene groups. We highlight that the 18S phylogeny data were critical in improving the understanding of the evolutionary history of C. brasiliensis, as well as providing additional information on the family Achariaceae. By analyzing orthologous groups involved in lipid metabolism, we identified clusters of lipid gene families, including orthologous genes among C. brasiliensis, A. thaliana, G. max and P. trichocarpa. Consequently, it was possible to identify orthologous genes with common ancestry among different species, since these genes tend to preserve the function of their ancestor. Therefore, our results indicate that the C. brasiliensis transcriptome can be used as reference for studies of other phylogenetically close oil tree species. When other models were used for comparative analysis of the transcriptome assembly of C. brasiliensis and Solanum lycopersicum, the result found was 49.6% in the groups researched using BUSCO (Supplementary Table S3). Such findings may be related to the evolutionary distance of the botanical families used to compare gene completeness. In this regard, the species A. thaliana would be more adequate to measure the gene completeness and quality verification of the assembly of the C. brasiliensis transcriptome. Studies using RNA-Seq data from 24 species of vascular plants were previously reported with BUSCO scores between 60% and 85% [25].
In most plants, the main unsaturated fatty acids (UFAs) are three: oleic (18:1), linoleic (18:2), and α-linolenic (18:3). The biosynthetic pathway in Arabidopsis is taken as an example. Briefly, in plastids, fatty acids are synthesized again from acetyl-coenzyme A (CoA), due to the joint action of acetyl-CoA carboxylase (ACC) and fatty acid synthase (FAS). Once produced, 18: 0, conjugated to the acyl carrier protein (ACP), enters mainly the unsaturation pathway administered by a series of desaturases, and 18: 1-ACP is rapidly created by stearoyl-ACP desaturase (SAD). However, the biosynthesis of polyunsaturated fatty acids is coupled with that of membrane glycerolipids, which is conducted in two parallel pathways-the 'prokaryotic' in the plastids and the 'eukaryotic' in the endoplasmic reticulum [26]. These identified lipid unigenes provided critical clues to clone and identify key functional genes involved in unsaturated fatty acids and triacylglycerol biosynthesis in C. brasiliensis seeds.
In comparison with other studies related to oil production in plants, our results indicate better coverage. In a study carried out with pecan (Carya illinoinensis), 153 unigenes associated with lipid biosynthesis were identified, including 107 unigenes for fatty acid biosynthesis, 34 for triacylglycerol biosynthesis, 7 for oily bodies, and 5 for transcription factors [22]. In another work using peanut (Arachis hypogaea), the authors identified 654 unigenes involved in lipid metabolism and transport, which represented 4.5% of the total number of unigenes [27]. Furthermore, the total number of enzymes involved in the metabolism of glycerophospholipids in C. brasiliensis was 22, which is similar to that observed in studies with seeds of Camellia meiocarpa and Camellia oleifera, where 10 unigenes related to the metabolism of glycerophospholipids were found [28]. The similar results between these two genera (Carpotroche and Camellia) indicate that oil-producing species have similar lipid metabolism enzymes, even though they produce oils with different properties. The high number of libraries used in the present study may have contributed to obtaining higher and more robust values of unigenes related to the transport and metabolism of lipids in C. brasiliensis. Our metabolic network results show that the sequences related to the biosynthesis of secondary metabolites will be a good resource for future research into the biosynthesis mechanism of flavonoids, phytosterols, saponins, and other medicinal compounds of the species.
Our study also provides important information on the biological sequences of species belonging to the Achariaceae family and can contribute to elucidate the evolutionary history of this group. The phylogenetic relationships with similar species that also produce lipids (important compounds for the production of biodiesel) and essential oils of economic importance can help to better understand the mechanisms of oil production in plants. The knowledge of the evolutionary relationships of C. brasiliensis also improves the understanding of the homology of unknown genes, from genes previously described in closely related species. This knowledge can help in the discovery of new genes involved in several metabolic pathways.
Previously, C. brasiliensis was classified as belonging to the Flacourtiaceae family. However, most taxonomic groups producing cyanogenic cyclopentene glycosides and flowers with unequal numbers of sepals and petals have already been reclassified [29]. This was the case of our target species and others that produce (AGC). Therefore, they were classified in the family Achariaceae [29], with acceptance in the last classification of Angiosperms (APG IV 2016) [30]. Consequently, Achariaceae needs broader phylogenetic and basic studies, which could identify uses of plants of the family beyond potential drug production. We consider that our 18S phylogeny data were fundamental to reveal the level of evolutionary relationships with A. thaliana, improving the understanding of the evolutionary history of C. brasiliensis, as well as providing additional information from the family Achariaceae.
Furthermore, we identified clusters of lipid gene families, including orthologous genes among C. brasiliensis, A. thaliana, G. max, and P. trichocarpa. Consequently, it was possible to identify orthologous genes with common ancestry among different species, since these genes tend to preserve the function of their ancestor. One interesting example of orthologous groups involved in lipid metabolism was the family of the chaperone lipid genes (Figure 3). A previous study with the Chinese tallow tree (Triadiaca sebifera) evaluated proteins in lipid droplets of the fruit mesocarp, and numerous proteins were found related to signal transduction and activity similar to molecular chaperones [19]. Our results identified proteins with similar functions to chaperones, such as annexin D8, aspartic proteinases, bcl-2-associated athanogene (BAG), and aquaporin. Therefore, we believe that chaperones are important for the synthesis of lipids such as chaulmoogra oil in C. brasiliensis.
Other important gene families for plant lipid synthesis are related with cell membranes. The assembly of triacylglycerols (TG) in the endoplasmic reticulum is integrated with the assembly of membrane glycerol lipids, which has two possible routes leading to the formation of triacylglycerol [31]. In the Kennedy pathway, glycerol-3-phosphate acyltransferase (EC: 2.3.1.15), 1-Acyl-sn-glycerol-3-phosphate acyltransferase (EC: 2.3.1.51), phosphatidic acid phosphatase (EC: 3.1.3.4), and diacylglycerol acyltransferase (EC: 2.3.1.20) are the sequential enzymes involved in the synthesis of triacylglycerols [31]. Recently, some additional independent acyl-CoA reactions have been identified in the development of seeds and other plant tissues that contribute to the synthesis of TGs, although the relative contributions of these alternative pathways to the accumulation of TG varies depending on the tissue and/or species [31].
Regarding oilseed and medicinal species, it was expected that the most synthesis of oil genes was found in seeds and immature fruits. These findings corroborate those of similar studies carried out with the species Hydnocavpus anthelminthica, belonging to the same family as C. brasiliensis. A study evaluating the synthesis (AGCs) showed a decline in the synthesis of cyclopentenyl fatty acids as the seed maturity progressed, indicating that the activity ceases almost completely at full maturity [6].
The results of gene expression involved in lipid biosynthesis imply that the Acyl-CoA dependent TG biosynthesis pathway might be an active pathway in TG biosynthesis in C. brasiliensis seeds. Similar results were found for the oil species Sacha Inchi (Plukenetia volubilis L.) [32] and Brassica napus L. [18]. These identified proteins are putative enzymes in the synthesis of lipids and are also found in other oleaginous species. Furthermore, it is important to highlight that the expression of oleosin genes is generally closely associated with oil accumulation in developing seeds [33]. A similar pattern of lipid gene expression has been shown in studies with fruits of Vitellaria paradoxa [34] and with Idesia polycarpa [23], although there are no studies identifying the production of chaulmoogra oil by these species.
As we know, oleic and linoleic acids are constituents of chaulmoogra oil in C. brasiliensis seeds. Based on this knowledge, it is interesting to note that the functions of the six enzymes identified can be the molecular basis for the formation of polyunsaturated fatty acids in the seeds. In addition, the Acyl-CoA-independent TG assembly pathway, including acyl editing and phosphatidylcholine/1,2-sn-diacylglycerol interconversion, is believed to facilitate the incorporation of polyunsaturated fatty acids into TG in some plant species [35]. Other works with oilseed plants have also identified classes of desaturases, such as Sacha Inchi (Plukenetia volubilis L.) [32] and two species of Camellia [28].
Additionally, the results based on flower tissue reveal some very specific changes in expression patterns (Figure 4). High levels of expression in non-seed tissues of C. brasiliensis for transcripts associated with lipid synthesis can indicate not only distinct roles during oil accumulation and seed development but also tissue-specific differences in their functions, such as in flower buds and flowers. The flowers were found to contain several secondary metabolites, pigments, and complex molecules involved in cellular processes, which probably occur throughout development. Therefore, these genes are necessary for the proper regulation of cell proliferation and expansion, the development of reproductive tissues, and the sculpting of the final shape of the different organs [36]. Our results draw attention to the importance of lipids such as phosphatidylcholine, a phospholipid found in C. brasiliensis, which is usually found in biological membranes (Figure 3). This class of lipids exhibits circadian oscillation and is also correlated with flowering, pointing to the important role of lipids in lowers in general [37]. On the other hand, the vital importance of plant surface wax in protecting tissue from environmental stresses is reflected in the huge commitment of epidermal cells to cuticle formation [38]. Another common characteristic of all terrestrial plants, identified in the annotation in this study, was the cutin (Figure 2). Cutin is a hydrophobic substance that covers surfaces exposed to air to prevent the non-stomatal loss of water and protect against pathogens. Cutin and wax contain derivatives of very-long-chain fatty acids, such as alkanes and alcohols, with chain lengths > 20 carbons. The composition of the wax varies according to the species, organ, and developmental stage [38].
Similar to other studies carried out to investigate the species of the Achariaceae family (e.g., Hydnocarpus anthelminthica and Culoncoha echinata) [6], our results indicate that cyclopentenyl fatty acids are synthesized not just in seeds, but in other tissues as well, including leaves, as is the case of many other families of oil plants.
Our analysis of orthology, metabolic pathways of lipids, and functional annotation were used to annotate the transcripts of C. brasiliensis-encoding orthologs of putative plant enzymes involved in the biosynthesis of saturated and unsaturated fatty acids, besides fatty acid elongation. These data were integrated and compiled to propose metabolic routes that lead to the accumulation of lipids in C. brasiliensis seeds ( Figure 5). The de novo biosynthesis of fatty acids in plants occurs in plastids, performed by a dissociable complex of monofunctional fatty acid synthase enzymes [22]. Briefly, the pyruvate dehydrogenase (PDH) complex generates acetyl-CoA, the component used in the production of fatty acids. The first step in fatty acid biosynthesis is the conversion of acetyl-CoA to malonyl-CoA by acetyl-CoA carboxylase (ACC). The malonyl group is then transferred from CoA to the acyl carrier protein (ACP), and the condensation between malonyl-ACP and acetyl-CoA is catalyzed by the fatty acid synthase complex. This is the first in a series of sequential reactions of condensation, reduction, and dehydration, adding two units of carbon to the lengthening acyl chain. The acyl chains are finally hydrolyzed by the acyl-ACP thioesterases, which release fatty acids [39]. Fatty acid synthases rely on a small protein, the acyl transporter protein (ACP), to carry the fatty acid load from enzyme to enzyme, for the elongation and synthesis of fatty acids.
Our results suggest that fatty acid biosynthesis, fatty acid elongation, and the tricarboxylic acid (TCA) cycle are all activated in the seed and fruit development process. Many of the enzymes involved in the metabolism of fatty acids have been increased or under-regulated, and specific enzymes are critical in the biosynthesis of cyclopentenic fatty acids.

Conclusions
We conclude from the results generated and the complexity of the analyses carried out that C. brasiliensis can significantly broaden the understanding of the metabolism and lipid synthesis since we developed a very comprehensive lipid unigene resource. It was possible to identify 13 lipid pathways and 1131 proteins involved with lipid metabolism and transport, as well as to identify cyclopentenic acids, such as oleic acid, linoleic acid, palmitic acid, and stearic acid, in chaulmoogra oil. From our results, we also conclude that the synthesis of chaulmoogra oil starts in immature seeds, where the highest number and expression values of lipid transcripts were found. However, we also found that in all tissues, transcribed putative genes were involved in the processes of synthesis, metabolism, transport, and degradation of lipids.
This is the first transcriptome study of C. brasiliensis, providing basic information for biotechnological applications of great use for the species, which will help understand chaulmoogra oil biosynthesis. The dataset developed in this study expands the database of C. brasiliensis. These resources can contribute to the discovery of new genes in developing seeds, which reinforces the importance of the present study. The basic information produced can be applied in future biotechnological approaches providing a booter to obtain improved varieties via genetic engineering, as well as for the development of molecular markers. Finally, the set of identified unigenes can also contribute to the future annotation and assembly of the complete genome sequence of C. brasiliensis. To our knowledge, this is one of the most complete and extensive efforts to annotate lipid genes from tropical non-timber trees.

Declarations Ethics Approval and Consent to Participate
Ethical approval for the botanical material sampling of Carpotroche brasiliensis for this study was granted by the Brazillian "Biodiversity Authorization and Information System" (SISBIO-42397-1). Additionally, we obtained the genetical accessing authorization from the "National System for the Management of Genetic Heritage and Associated Traditional Knowledge (SisGen-A99B6B8), for this research work, in compliance with relevant institutional, national, and international guidelines and legislation. The authors declare to be in accordance with the IUCN Policy on Research Involving Endangered Species and the Convention on Trade in Endangered Species of Wild Fauna and Flora. However, the species Carpotroche brasiliensis is not in any list of species at risk or threat of extinction.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/f13111806/s1, Figure S1: Schematic overview of the structure of Carpotroche brasiliensis RNA-Seq data analysis; Figure S2: Gene Ontology (GO) analysis of the categories for immature and mature fruits and seeds from Carpotroche brasiliensis; Figure S3: Phylogeny based on the Maximum likelihood method with 1000 bootstrap replicates and 18S gene sequences; Table S1: Data from samples and sequencing of RNA extraction from tissues of Carpotroche brasiliensis.; Table S2: Assembly statistics of C. brasiliensis transcriptome by RNA-Seq and TransDecoder; Table S3: Assessment of transcriptome quality by BUSCO; Table S4: Enzymes involved in fatty acid biosynthesis and catabolism identified by the annotation of Carpotroche brasiliensis transcriptome.