Discovery of Anthocyanin Biosynthetic Pathway in Cosmos caudatus Kunth. Using Omics Analysis

: Cosmos caudatus Kunth. or “king’s salad” contains high values of nutritional compounds that act as health promoters. Although widely consumed for its medicinal value, information on phytochemical contents and their biosynthesis in the species is scarce. Among the interesting compounds are the anthocyanins that possess a dual role; an antioxidant and natural colorant. A complete anthocyanin biosynthetic pathway in C. caudatus was elucidated using transcriptomics, metabolomics, and anatomical approaches in this study. The transcriptomic analysis revealed genes encoding enzymes in the anthocyanin biosynthetic pathway and the genes encoding the transcription factors relevant to the latter pathway. A total of 11 anthocyanins of cyanidin, pelargonidin, and delphinidin derivatives that are signiﬁcantly abundant in the species were identiﬁed, correlating with the anthocyanin mainstream gene pathway. The occurrence of anthocyanin was further validated by light microscopy. Anthocyanin pigments in C. caudatus were detected at the epidermal layer of the leaf, stem, and ﬂower, and at the cortex of stem and root. To our knowledge, this is the ﬁrst work that has delineated the complete anthocyanin biosynthetic pathway in Malaysia’s underutilized plant, C. caudatus Kunth. This study correlated multi-omics data that will help integrate systems biology and synthetic biology, for a detailed understanding of the molecular mechanism and characterization of the anthocyanin biosynthesis using heterologous expression studies. In our study, a total of 82 unigenes belonging to 11 main enzymes in the anthocyanin biosynthesis have been discovered. Among these unigenes, 30 are mapped to three general phenylpropanoid enzymes (PAL, C4H, and 4CL), 30 are early (CHS, CHI, F3H, F3 (cid:48) H, and F3 (cid:48) 5 (cid:48) H) and 22 are late (DFR, LDOX, and UFGT) anthocyanin biosynthetic genes in which all these genes complete the whole anthocyanin biosynthetic pathway. The anthocyanin biosynthetic pathway begins from L-phenylalanine, which is then converted into chalcones by enzymes of phenylalanine ammonia-lyase (PAL; 16 unigenes), cinnamate 4-hydroxylase (C4H; 10 unigenes), 4-coumarate CoA ligase (4CL; 4 unigenes), and chalcone synthase (CHS; 7 unigenes) in the phenylpropanoid pathway. Along this line, stereospeciﬁc cycliza-tion of chalcones catalyzed by the enzyme chalcone isomerase (CHI; 6 unigenes) forms either naringenin or ﬂavanones. Hydroxylation of ﬂavanones by the enzyme ﬂavanone 3-hydroxylase (F3H; 8 unigenes) forms dihydroﬂavonols which are then converted to leucoanthocyanidins by dihydroﬂavonol 4-reductase (DFR; 12 unigenes) during the pathway branching towards the anthocyanin production. Conversion of leucoanthocyanidins into anthocyanidin was through the activity of leucoanthocyanidin dioxygenase (LDOX; 4 unigenes). In either form, anthocyanins were then produced by the enzyme UDP glucose activity: ﬂavonoid-3- O -glucosyltransferase


Introduction
Anthocyanin has garnered the attention of researchers due to its potent antioxidant properties and potential for lowering the risk of cancer [1], heart problems [2,3], regulating blood glucose level [4], and acting in preventing carcinogenesis [5]. Anthocyanin constitutes a class of flavonoids that play a pivotal role in the coloration of fruits and most flowers, as well as various plant tissues such as the root, stem, and leaves [6,7]. The word anthocyanin is derived from the Greek term anthos, flower, and kyanos, dark blue, referring to the colored pigments commonly found in angiosperms [8]. Anthocyanins are secondary metabolites derived from the phenylpropanoid pathway's biosynthetic network, which has been well-studied in various plant species such as Oryza sativa, Arabidopsis thaliana, and Malus domestica [9][10][11].
In general, two classes of genes are involved in this pathway; the structural genes encoding the enzymes that directly participate in forming anthocyanins and the regulatory genes controlling these structural gene's expression [12]. Environmental factors such as light, temperature, environmental stress, and loss of nutrient elements affect the expression levels of both structural and regulatory genes and correspondingly promote or inhibit the synthesis and accumulation of anthocyanin, resulting in different colorations in plant tissues [13]. The use of anthocyanins as a natural food colorant and additive approved by the Food and Drug Administration (FDA) and European nations (E-163) is, however, still limited due to their non-stable properties at different pH and temperatures [7].
C. caudatus is also known as king's salad in Malaysia due to its unique flavor. C. caudatus was found to have the highest phenolic content among other medicinal plants [14]. It is considered a potential source of dietary flavonoids and can be incorporated into the daily human diet [15,16]. In Malaysia, the plant is widely recognized for its pharmaceutical properties [17]. Due to its nutritional value, the demand for this species has drastically increased as various researchers have proven that its consumption helps regulate blood circulation and promote bone formation [18,19]. Other studies have demonstrated the medicinal effects of C. caudatus, such as being anti-hypertensive [20] and antioxidant [21][22][23], as well as its antimicrobial and antifungal activity [24,25]. Although C. caudatus is generally known to possess a myriad of bioactive phytochemicals, available reports on its anthocyanin metabolite profile are scarce. The total anthocyanin content of the species has been reported to be as high as 0.78 ± 0.05 mg/100 g of fresh weight sample using liquid extraction methods in comparison to other plants with intense color; in addition to the presence of several other metabolites such as quercetin, β-carotene, kaempferol, ascorbic acid, and ferulic acid [14,26].
Despite the importance of anthocyanin as a potential nutraceutical product, its biological pathways and specific compounds of interest produced by C. caudatus have not been studied. Lack of genomic information is one of the limiting factors in further exploration of the C. caudatus biosynthetic pathways contributing to its medicinal properties. This research gap has created a need to explore the species more comprehensively using omics technology. Omics technology has enabled the identification of whole complex frameworks underlying different mechanisms in plant metabolism. Next-generation sequencing (NGS) of RNA in transcriptome analysis has been widely used to analyze specific genes for identifying various biosynthetic pathways. This technique is crucial and advantageous in exploring non-model plants [27][28][29]. For plant metabolomic platforms, liquid chromatography-mass spectrometry (LC-MS) has been extensively used in profiling secondary metabolites, such as phenylpropanoids and alkaloids [30][31][32][33][34][35][36][37]. The correlation of information from multiomics analysis platforms will increase our understanding of the medicinal properties of C. caudatus and encourage information trade with other model plants [38].
Several studies focusing on metabolites from C. caudatus extracts of different tissues have been performed, using chromatography and spectroscopy analysis [21,25,[39][40][41]. More information needs to be gained to shed light on this important species; therefore, we further expanded the investigation by employing advanced and holistic omics methodologies. This is the first report of complementary approaches, in which the C. caudatus transcriptome analysis was combined with metabolite profiling and anatomical investigation of the leaf, flower, stem, and root, to describe the anthocyanin biosynthetic pathways in C. caudatus.

Plant Cultivation and Maintenance
C. caudatus seeds were sown in germination pots and transferred to the plot of an open field after two weeks, with an average daily sunlight of 7 h. The plants were watered daily, while fertilizer was applied every three weeks. Generally, the plant was grown in the warm and humid conditions of the tropics. Samples of leaf, flower, stem, and root tissues of C. caudatus were harvested from the experimental herbal plot at the UKM campus, Bangi, Malaysia, at the recorded GPS location 2.922165, 101.788304, after 10 weeks of growth, specifically when the plants began flowering, for further analysis.

RNA Extraction for Transcriptomic Analysis
Samples of each of the harvested tissues were frozen immediately in liquid nitrogen and stored at −80 • C. Each tissue sample was ground using a mortar and pestle with liquid nitrogen. According to the manufacturer's protocols, the total RNA from each tissue (powdered sample) was extracted using a Qiagen RNeasy Plant Mini Kit (Hilden, Germany). The quality check of the RNA was done using Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA, USA), where an RNA integrity number (RIN) of 7 or higher was used for sequencing [42]. One biological replicate for each tissue was used for sequencing, in which the obtained result for all four tissues was assembled for further analysis. All the apparatus and disposable items used throughout this study were maintained in a sterile condition.

cDNA Library Construction and Illumina Sequencing
The cDNA library preparation was done using SureSelect Strand-Specific RNA Library Prep (Agilent Technologies, Santa Clara, CA, USA). Oligo (dT) magnetic beads were used to isolate the RNAs and fragmentation buffer was used to fragment them into small pieces. Reverse transcription of the first and second-strand cDNA was obtained using First Strand Master Mix. cDNA was ligated with adapters to both ends of the fragments using a SureSelect Oligo Adaptor, followed by attaching DNA to the surface to allow bridge amplification and denaturation of the double-stranded molecules. The sequencing of cDNA libraries was performed using the Illumina Hiseq4000 150PE (Theragene Etex, Seoul, Korea) platform.

De Novo Transcriptome Assembly
Adapter sequences were removed from raw read data using "cutadapt" followed by raw data filter out reads with inaccurate bases >10% and <Q20 using an in-house script by TheragenEtex Bio Institute (Seoul, Gyeonggi-do, Korea). The remaining clean reads from the sample library were assembled. Trinity software version 2.0.2 was used to assemble the sequences using de novo assembly [43]. All transcripts were clustered using TGICL software to eliminate redundant sequences, and the most extended transcript in each cluster was selected as unigene [44].

Gene Annotation and Classification
The assembled unigene sequences were used to perform searches in public databases, including the NCBI non-redundant protein (NR) and UniProt. A homology search was done using BLASTx local and InterProScan, and setting an e-value threshold of <1 × 10 −5 . The Uniprot database id was uploaded to the online database PANTHER (http://www. pantherdb.org/ accessed online on 24 December 2019) to retrieve the gene ontology (GO) annotation, according to molecular function, biological process, and cellular components, apart from each sample's protein class. The same database was then used to perform a GO functional classification [45]. Pathway assignments were carried out based on the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [46]. The coding region of the transcripts was identified using Transdecoder (http://transdecoder.sf.net accessed online on 15 November 2019). The transcription factors were predicted based on transcripts uploaded in the online database PlantTFDB version 5.0 (http://planttfdb.gao-lab.org/ index.php accessed online on 22 February 2020). The domains of the anthocyanin key genes using peptide sequences were searched against InterProScan software (https://www. ebi.ac.uk/interpro accessed online on 21 February 2021) [47].

Metabolite Extraction
The samples from the four tissues of three biological replicates each were immersed in liquid nitrogen before grinding using a mortar and pestle. For each of the tissues and its replicates, 1 g of the pulverized sample was transferred into a vial followed by the addition of 10 mL 100% MeOH for metabolite extraction [48]. The mixture was vigorously vortexed for 1 min, followed by sonication in a water bath (S30H, Elma, Singen, Germany) Agronomy 2021, 11, 661 4 of 23 at room temperature, at the maximum frequency of 37 kHz, for approximately 15 min. Filtration of the samples was carried out using a 0.22 µm nylon membrane filter with a syringe aid. The extracts were then transferred into autosampler glass vials before liquid chromatography-mass spectrometry (LC-MS) analysis. All the solvents and chemicals used in this study were HPLC grade.

LC-MS Analysis
Separation of the extracts was performed on a Thermo Scientific C18 reversed-phase column of Acclaim TM Polar Advantage II, 3 × 150 mm dimension, 3 µm particle size on a Dionex UltiMate 3000 ultra-high-performance liquid chromatography (UHPLC) system

Statistical and Multivariate Analysis
Data obtained from the time-of-flight (TOF) analyzer was processed using Compass Data Analysis software, version 4.2 (Bruker, Bremen, Germany). The data file obtained in .d format was converted to .csv (comma-separated values) format file. Signal-to-noise ratio value (S/N) of 10 and a smoothing width of 3 was applied to the entire data set to perform data analysis. The metabolome data was normalized to a reference feature of caffeine (m/z 195.0877), spiked as an internal standard in the samples. The data were also analyzed using one-way analysis of variance (ANOVA) in MetaboAnalyst 3.0 online software [50] for anthocyanins with p < 0.05 significant values. The heatmap was constructed using the default Euclidean distance measure and Ward clustering algorithm in MetaboAnalyst 3.0 and viewed as group averages.

Anatomy Studies
The sectioning of the fresh leaf, stem, flower petal, and root tissues was performed using a sliding microtome and sections were viewed immediately, without any staining procedure using a light microscope, Olympus BX43 (Tokyo, Japan) connected to Olympus DP72 (Tokyo, Japan) and Canon EOS 700D (Tokyo, Japan) cameras. Photographs of the cells were taken and processed using Analysis Docu and EOS Utility 2 software. The images were observed under magnifications of 10× and 20× for a clearer cell view. Images were saved as tagged image file format (TIFF) files. As the cells' distribution was similar within replicates, each tissue replicate's best image was processed further. The high-throughput transcriptome sequencing of C. caudatus using Illumina Hiseq4000 (150PE) produced 263,163,738 total raw reads. After data cleaning, 242,973,712 total clean reads were obtained, and 68,120,202, 60,354,324, 55,512,786, and 58,986,400 reads were from root, flower, leaf, and stem libraries, respectively (Supplementary S1). All clean reads from the four libraries were assembled, the length distribution of unigenes is shown in Figure 1. Out of the total unigene (187,206) sequences found, most unigenes obtained had a length between 200 and 500 nucleotides (nt). The total assembly statistics showed a GC content of 41.89%, with average contigs of 476.71 based on statistical analysis (Table 1). In addition, the open reading frames (ORFs) analysis using Transdecoder showed that 89.34% of the total transcripts (167,252 sequences) contained the predicted coding sequences (CDS). Among the predicted CDS, 18.74% (31,338 sequences) contained both start and stop codon, indicating these transcripts contained full-length CDS.

Transcriptome Analysis
total clean reads were obtained, and 68,120,202, 60,354,324, 55,512,786, and 58,986,400 reads were from root, flower, leaf, and stem libraries, respectively (Supplementary S1). All clean reads from the four libraries were assembled, the length distribution of unigenes is shown in Figure 1. Out of the total unigene (187,206) sequences found, most unigenes obtained had a length between 200 and 500 nucleotides (nt). The total assembly statistics showed a GC content of 41.89%, with average contigs of 476.71 based on statistical analysis (Table 1). In addition, the open reading frames (ORFs) analysis using Transdecoder showed that 89.34% of the total transcripts (167,252 sequences) contained the predicted coding sequences (CDS). Among the predicted CDS, 18.74% (31,338 sequences) contained both start and stop codon, indicating these transcripts contained full-length CDS.

. Functional Annotation of Unigenes
To further analyze the criteria of the obtained C. caudatus transcriptome, all unigenes were annotated against the NCBI non-redundant protein database (NR), Swiss-Prot, Interpro, and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases (Table 2).

. Functional Annotation of Unigenes
To further analyze the criteria of the obtained C. caudatus transcriptome, all unigenes were annotated against the NCBI non-redundant protein database (NR), Swiss-Prot, Interpro, and Kyoto Encyclopedia of Genes and Genomes (KEGG) databases (Table 2). Blastx analysis with the threshold value of e-value < 1 × 10 −5 successfully matched 51.25% and 38.25% of our unigenes to a known protein available in the NR and Swiss-Prot databases. The E-value distribution showed 27% of our annotated unigenes had a homology of an e-value < 1 × 10 −45 with a known protein, while the remaining annotated unigenes showed homology with an e-value more than 1 × 10 −45 ( Figure 2). Our annotated unigenes' similarity distribution showed that 81% of these genes had similarities higher than 40% to known proteins. Altogether, these results indicated that transcripts of C. caudatus are conserved in comparison with known proteins.
of an e-value < 1 × 10 −45 with a known protein, while the remaining annotated unigenes showed homology with an e-value more than 1 × 10 −45 ( Figure 2). Our annotated unigenes' similarity distribution showed that 81% of these genes had similarities higher than 40% to known proteins. Altogether, these results indicated that transcripts of C. caudatus are conserved in comparison with known proteins.  The predicted abundant protein classes found in the C. caudatus transcriptome are shown in Figure 3. The nucleic acid-binding (PC00171) protein family members were present in the highest numbers, due to the interactions between nucleic acid and other protein domains are fundamental to the entire genetic information's maintenance and the accessions of an organism, as these enzymes are predominant to various biological related processes [51,52]. The protein classes of transferase (PC00220) and hydrolase (PC00121) account for the second most abundant protein class in the species with a similar distribution value. Transferases catalyze a particular functional group's relocation from one donor  The predicted abundant protein classes found in the C. caudatus transcriptome are shown in Figure 3. The nucleic acid-binding (PC00171) protein family members were present in the highest numbers, due to the interactions between nucleic acid and other protein domains are fundamental to the entire genetic information's maintenance and the accessions of an organism, as these enzymes are predominant to various biological related processes [51,52]. The protein classes of transferase (PC00220) and hydrolase (PC00121) account for the second most abundant protein class in the species with a similar distribution value. Transferases catalyze a particular functional group's relocation from one donor molecule to an acceptor molecule, while hydrolases catalyze the complex compounds' cleavage through hydrolytic action [53,54]. Transporter (PC00227) protein accounts for the fourth most abundant protein class in the C. caudatus plant, as it aids the movement of a wide range of molecules and ions transversing the membrane of cells [55].
Agronomy 2021, 11, x FOR PEER REVIEW 7 of 23 molecule to an acceptor molecule, while hydrolases catalyze the complex compounds' cleavage through hydrolytic action [53,54]. Transporter (PC00227) protein accounts for the fourth most abundant protein class in the C. caudatus plant, as it aids the movement of a wide range of molecules and ions transversing the membrane of cells [55]. Gene ontology analysis was carried out to elucidate the function of C. caudatus predicted unigenes. Based on the sequence similarity, a total of 70,065 unigenes were assigned with one or more GO terms. The number of GO terms per transcript ranged from 1 to 167. In total, 14,894 GO terms were successfully assigned to our unigenes and were classified into three major categories: biological process, molecular function, and cellular component ( Figure 4). Within the biological process GO category, organic substance metabolic process (4583 unigenes), cellular metabolic process (4195 unigenes), and primary metabolic process (4118 unigenes) were the most abundant groups. The molecular function GO category was dominated by organic cyclic compound binding (11,398 unigenes), heterocyclic compound binding (11,379), and catalytic activity, acting on a protein (7278 unigenes). Moreover, under the cellular component category, organelles (12,964 unigenes), intracellular organelles (12,275 unigenes), and cytoplasm (9474 unigenes) were found to be the most dominant groups. Gene ontology analysis was carried out to elucidate the function of C. caudatus predicted unigenes. Based on the sequence similarity, a total of 70,065 unigenes were assigned with one or more GO terms. The number of GO terms per transcript ranged from 1 to 167. In total, 14,894 GO terms were successfully assigned to our unigenes and were classified into three major categories: biological process, molecular function, and cellular component ( Figure 4). Within the biological process GO category, organic substance metabolic process (4583 unigenes), cellular metabolic process (4195 unigenes), and primary metabolic process (4118 unigenes) were the most abundant groups. The molecular function GO category was dominated by organic cyclic compound binding (11,398 unigenes), heterocyclic compound binding (11,379), and catalytic activity, acting on a protein (7278 unigenes). Moreover, under the cellular component category, organelles (12,964 unigenes), intracellular organelles (12,275 unigenes), and cytoplasm (9474 unigenes) were found to be the most dominant groups.
KEGG pathway analysis was then performed to understand the specific function further, and identify active biological pathways in C. caudatus [47]. A total of 49,984 of our CDS were assigned to 7251 KEGG ortholog groups and 433 biological pathways, which were classed into several main categories, including metabolism, genetic information processing, environmental information processing, and cellular processes. Many CDS (573 CDS) were assigned to the biosynthesis of secondary metabolites pathways, including phenylpropanoid biosynthesis, flavonoid biosynthesis, anthocyanin biosynthesis, isoflavonoid biosynthesis, and isoquinoline alkaloid biosynthesis. In this study, annotated unigenes with the processes associated with anthocyanin biosynthesis were selected for further analysis.  KEGG pathway analysis was then performed to understand the specific function further, and identify active biological pathways in C. caudatus [47]. A total of 49,984 of our CDS were assigned to 7251 KEGG ortholog groups and 433 biological pathways, which were classed into several main categories, including metabolism, genetic information processing, environmental information processing, and cellular processes. Many CDS (573 CDS) were assigned to the biosynthesis of secondary metabolites pathways, including phenylpropanoid biosynthesis, flavonoid biosynthesis, anthocyanin biosynthesis, isofla-

Candidate Genes for Anthocyanin Biosynthesis
The unigenes from C. caudatus transcripts involved in the anthocyanin biosynthetic pathway remapped from unigenes correlated with the KEGG pathway database ( Figure 5). The mapping was done by integration of three KEGG pathways; the phenylpropanoid pathway (map00940), flavonoid pathway (map00941), and anthocyanin biosynthesis pathway (map00942). The anthocyanin biosynthesis pathway generally starts from the phenylpropanoid pathway towards the flavonoid biosynthesis pathway, followed by the anthocyanin biosynthesis pathway responsible for synthesizing various types of anthocyanin. In our study, a total of 82 unigenes belonging to 11 main enzymes in the anthocyanin biosynthesis have been discovered. Among these unigenes, 30 are mapped to three general phenylpropanoid enzymes (PAL, C4H, and 4CL), 30 are early (CHS, CHI, F3H, F3 H, and F3 5 H) and 22 are late (DFR, LDOX, and UFGT) anthocyanin biosynthetic genes in which all these genes complete the whole anthocyanin biosynthetic pathway. The anthocyanin biosynthetic pathway begins from L-phenylalanine, which is then converted into chalcones by enzymes of phenylalanine ammonia-lyase (PAL; 16 unigenes), cinnamate 4-hydroxylase (C4H; 10 unigenes), 4-coumarate CoA ligase (4CL; 4 unigenes), and chalcone synthase (CHS; 7 unigenes) in the phenylpropanoid pathway. Along this line, stereospecific cyclization of chalcones catalyzed by the enzyme chalcone isomerase (CHI; 6 unigenes) forms either naringenin or flavanones. Hydroxylation of flavanones by the enzyme flavanone 3-hydroxylase (F3H; 8 unigenes) forms dihydroflavonols which are then converted to leucoanthocyanidins by dihydroflavonol 4-reductase (DFR; 12 unigenes) during the pathway branching towards the anthocyanin production. Conversion of leucoanthocyanidins into anthocyanidin was through the activity of leucoanthocyanidin dioxygenase (LDOX; 4 unigenes). In either form, anthocyanins were then produced by the enzyme UDP glucose activity: flavonoid-3-O-glucosyltransferase (UFGT; 6 unigenes). The peptide sequence of four key genes of the anthocyanin biosynthesis pathway (F3H, DFR, LDOX, and UFGT) found in the transcript of C. caudatus were searched for their domain region to further confirm the characteristics of the genes were as the functions predicted. Table 3 summarizes each key gene's domain characteristics from C. caudatus transcripts based on the UniprotPro database. The domain hit for the sequence of respective genes is included in Supplementary S4.

Candidate Transcription Factors (TFs) Involved in Anthocyanin Biosynthesis and Transport
Several TFs regulate the anthocyanin biosynthetic pathways, namely C2H2, basic he lix-loop-helix (bHLH), MYB related proteins, ethylene response factor (ERF), and WRK transcription factor [56,57]. Unigenes encoding for these TFs were discovered in this stud as the top five most abundant unigenes annotated from the transcriptome assembl among several other TFs. The transcriptome data analyzed using the Plant Transcriptio Factor Database (PlantTFDB) for C. caudatus found the most abundant encodes were o

Candidate Transcription Factors (TFs) Involved in Anthocyanin Biosynthesis and Transport
Several TFs regulate the anthocyanin biosynthetic pathways, namely C2H2, basic helix-loop-helix (bHLH), MYB related proteins, ethylene response factor (ERF), and WRKY transcription factor [56,57]. Unigenes encoding for these TFs were discovered in this study as the top five most abundant unigenes annotated from the transcriptome assembly among several other TFs. The transcriptome data analyzed using the Plant Transcription Factor Database (PlantTFDB) for C. caudatus found the most abundant encodes were of C2H2 (233 unigenes), followed by bHLH (190 unigenes), MYB_related proteins (179 unigenes), ERF (163 unigenes), and WRKY (138 unigenes), and followed by several other TF ( Figure 6). The top five TF found in our data are known to be regulating and influencing the anthocyanin biosynthesis in various species.

Metabolomic Analysis
The LC-MS data were examined for anthocyanin derivative ion masses of naturally occurring anthocyanidins, i. The anthocyanins detected were of cyanidin, pelargonidin, and delphinidin derivatives, distinguished by the aglycones that were easily detached from the molecular ions at their respective retention times. Although this phenomenon may be observed more evidently in tandem MS with elevated collision energy, the 6eV minimal applied energy to allow ion movement through the cell in this MS analysis caused initial breakage of the O-glycosidic bonds in the anthocyanin molecular ions. The free anthocyanidins or the aglycone of anthocyanins cyanidin, pelargonidin, and delphinidin were not observed as separate metabolite peaks in all the LC-MS chromatograms of the C. caudatus extracts. The structure of these anthocyanidins and the 11 observed anthocyanins in the C. caudatus extracts are illustrated in Table 4, along with their molecular weights. The high-resolution time-of-flight (TOF) mass spectrometer (HRMS) with massresolving power of ≥10,000 at the full width at half maximum (FWHM) of a peak, subppm (parts per million) mass measurement error, and characteristic sensitive full-spectrum scanning allowed the calculation of accurate elemental formulae and the elucidation of the anthocyanin identities in the C. caudatus extracts [58][59][60]. The generated molecular formulae were then compared to previous reports on anthocyanins, as listed in Table 5. The anthocyanins' abundance marked the differences in the C. caudatus flower, leaf, root, and stem tissues significant with a p-value less than 0.05. These are illustrated in the heatmap in Figure 7.

Metabolomic Analysis
The LC-MS data were examined for anthocyanin derivative ion masses of naturally occurring anthocyanidins, i. The anthocyanins detected were of cyanidin, pelargonidin, and delphinidin derivatives, distinguished by the aglycones that were easily detached from the molecular ions at their respective retention times. Although this phenomenon may be observed more evidently in tandem MS with elevated collision energy, the 6eV minimal applied energy to allow ion movement through the cell in this MS analysis caused initial breakage of the O-glycosidic bonds in the anthocyanin molecular ions. The free anthocyanidins or the aglycone of anthocyanins cyanidin, pelargonidin, and delphinidin were not observed as separate metabolite peaks in all the LC-MS chromatograms of the C. caudatus extracts. The structure of these anthocyanidins and the 11 observed anthocyanins in the C. caudatus extracts are illustrated in Table 4, along with their molecular weights. The high-resolution time-of-flight (TOF) mass spectrometer (HRMS) with mass-resolving power of ≥10,000 at the full width at half maximum (FWHM) of a peak, sub-ppm (parts per million) mass measurement error, and characteristic sensitive full-spectrum scanning allowed the calculation of accurate elemental formulae and the elucidation of the anthocyanin identities in the C. caudatus extracts [58][59][60]. The generated molecular formulae were then compared to previous reports on anthocyanins, as listed in Table 5. The anthocyanins' abundance marked the differences in the C. caudatus flower, leaf, root, and stem tissues significant with a p-value less than 0.05. These are illustrated in the heatmap in Figure 7. Agronomy 2021, 11, x FOR PEER REVIEW 13 Figure 7. Heatmap of the most significant metabolites of different abundance levels marking the divergence between the flower, leaf, root, and stem tissues of C. caudatus. Anthocyanins in the heatmap are labelled, showing pelargonidin, cyanidin, and delphinidin derivatives in high abundance in flower tissue. In contrast, delphinidin derivatives were detected in high abundance in the leaf tissue compared to the stem and root.

Anatomy Data
The anthocyanin was observed to be accumulated in the inner side of developed vacuoles, synthesized around the tonoplast and condensed on the vacuoles' inward side in the C. caudatus tissues. The presence of anthocyanin can be recognized with evident coloration, without any staining. The vibrant hue of the red-pigmented area was determined based on the color visible on the cell's outer part. Further focusing on the region revealed that the anthocyanin was accumulated in the granule (Figure 8). The cross-section of the petiole, stem, and root allowed the observation of the anthocyanin accumulation in C. caudatus, primarily located in the epidermal layer and below the epidermal layer, known as the subepidermal layer. The anthocyanin was mostly found in the vacuoles of parenchyma cells. Cells with an accumulation of anthocyanin were slightly more prominent than other cells and their color density varied ( Figure 8A-F). However, in the flower tissue, anthocyanin accumulation was, relatively, very highly distributed almost all over the adaxial epidermal cells, including the papillae and in the area slightly lower, in the abaxial epidermal cells ( Figure 8G,H). The anthocyanin density was high in the epidermis cells' outer layer and gradually decreased moving inward to the inner part of the flesh. Anthocyanin was observed to accumulate more in the flower part, especially in the adaxial epidermal layer.

Discussions
The formation of coloration in plant species is highly influenced by anthocyanin, a group of compounds derived from the phenylpropanoid pathway [78]. Previous studies on C. caudatus extracts suggested anthocyanin's role in contributing to its high bioactivity [14,21]. However, an in-depth study to confirm the specific anthocyanin compounds produced by this species is still lacking. In this investigation, the next-generation sequencing technology of Illumina Hiseq4000 was employed to unravel the anthocyanin biosynthetic genes present in C. caudatus, which did not have a reference genome. The domain part in a sequence is important in determining the function of a specific gene of the same class according to its conserved regions. The transcript of the four key genes in anthocyanin biosynthesis found in this study were searched for their domains. The first key gene, F3H

Discussions
The formation of coloration in plant species is highly influenced by anthocyanin, a group of compounds derived from the phenylpropanoid pathway [78]. Previous studies on C. caudatus extracts suggested anthocyanin's role in contributing to its high bioactivity [14,21]. However, an in-depth study to confirm the specific anthocyanin compounds produced by this species is still lacking. In this investigation, the next-generation sequencing technology of Illumina Hiseq4000 was employed to unravel the anthocyanin biosynthetic genes present in C. caudatus, which did not have a reference genome. The domain part in a sequence is important in determining the function of a specific gene of the same class according to its conserved regions. The transcript of the four key genes in anthocyanin biosynthesis found in this study were searched for their domains. The first key gene, F3H annotated in this study has a domain of Fe-dependent dioxygenase, as this conserved region has been found in the F3H gene from Carthamus tinctorius L. and Ascocenda sp. [79,80]. The DFR gene liable in the formation of leucoanthocyanidins from dihydroflavonols by stereospecific reduction has the domain of epimerase/dehydratase, which was found to be similar to other species like Camellia sinensis and Brassica rapa [81,82]. The LDOX gene found in this study showed the iron/2oxoglutarate dependent dioxygenase, which is also be known as anthocyanidin synthase (ANS), and responsible for converting leucoanthocyanidin into anthocyanidin, as proven in a previous study involving Perilla frutescens [83]. The conversion of anthocyanidin to anthocyanin is catalyzed by the enzyme UFGT, a vital enzyme of the anthocyanin biosynthetic pathway. From domain analysis, the transcripts obtained in this study and previously reported in Ginkgo biloba for the gene UFGT were of the same superfamily, UDP-glycosyltransferase [84]. Previous research involving Prunus mume (Japanese apricot) has demonstrated that the UFGT gene plays a crucial role in pigmentation, and that the transcription level of the gene is directly proportional to the anthocyanin level detected in the species [85]. Some other studies also demonstrated that these enzymes' expression is vital in anthocyanin biosynthesis in Vitis vinifera, Solanum tuberosum, Phalaenopsis, and Musa itenerans [86][87][88]. The previous research proved that the key gene's presence contributes to the occurrence of anthocyanin across different plant species, as found in the transcriptome data of C. caudatus [80][81][82].
The interaction of regulatory genes such as basic helix-loop-helix (bHLH), WD40, and MYB TFs with promoters of the structural genes plays an important role in the biosynthesis of anthocyanin [89,90]. One of the most abundant TFs in plants is the superfamily bHLH [91][92][93]. TF bHLH, along with other activators such as MYB, plays a pivotal role in anthocyanin's biosynthesis [94]. MYB is important in the accumulation of anthocyanin intensity patterns in various plant species [95]. The activation of this biosynthetic pathway is regulated by the bHLH-MYB-WD40 complex, where modification of any of these genes causes changes in color formation in various plants, such as Malus domestica, Brassica oleracea var. botrytis, Dahlia variabilis, and Citrus sinensis [96][97][98][99]. A study on Solanum tuberosum demonstrated that TF C2H2, besides several other protein families such as zf-HD, LOB, bZIP, MYB, and MADS, played an essential part in regulating the anthocyanin biosynthetic pathway in wild purple and red potatoes, using a comparative transcriptome approach [100]. Several studies have demonstrated a strong correlation of ERF with anthocyanin's expression level in several species, such as Vaccinium myrtillus [101]. It binds with MYB and bHLH promoters as co-regulator of anthocyanin biosynthesis in Malus domestica [102] and Pyrus sp. [103], respectively. Another study on Malus domestica identified WRKY as a protein that interacts with the positive regulator of anthocyanin synthesis in the species, which bind MYB to its target genes better [104]. By comparing sequences of known transcription factor families, the most abundant TFs: C2H2, MYB, bHLH, ERF, and WRKY, were annotated in the transcriptome profile of C. caudatus, which might play a pivotal role in regulating the anthocyanin pathway in the species studied, as shown in other angiosperms species.
Anthocyanidins such as cyanidin, delphinidin, and pelargonidin occur in a more stable and soluble form of anthocyanin, i.e., glycosylated, and are very rarely found without their sugar substituents in nature [105]. This facilitates absorption of the anthocyanidin into the bloodstream upon consumption of plants containing anthocyanin. Various anthocyanins of cyanidin, delphinidin, and pelargonidin derivatives were identified in C. caudatus, with its flowers detected as having the most anthocyanin, followed by the leaf tissue. The results confirmed existing assertions of C. caudatus containing anthocyanin. The anatomical approach further supported this finding in which anthocyanin was observed to accumulate more in the flower tissue, especially in the adaxial epidermal layer. This is consistent with the proposed mechanism of anthocyanin accumulation in proportion to light exposure [106]. Anthocyanin and chlorophyll levels in green leaves are negatively correlated, as demonstrated in Brassica oleracea and Camellia sinensis [107]. This explains the lower density of anthocyanin detected in the C. caudatus leaf tissue. Anthocyanin level slowly decreases with plant development and a developed or mature leaf contains lower levels of anthocyanins compared to chlorophyll. The distribution of anthocyanin within the plant tissue also differs considerably. As a rule, anthocyanin was found to localize in the cell vacuoles or just below the adaxial epidermis. Occasionally, the anthocyanin pigments also accumulate in the abaxial epidermis' cell vacuoles, palisade, spongy mesophyll, and root tissues as compounds involved in photoprotection. In this study, the findings showed a very low density of anthocyanin in the root, stem, and petiole tissues, which were more sheltered from the sun.
Previous anatomical screening on 463 plant species discovered anthocyanin in the leaf tissue of 260 species at different life stages, such as development, maturity, and senescence [108]. It has also been reported in various tissues other than trichomes, such as palisade and spongy layers, [108]. This correlates with the findings in our study, where the granules accommodating anthocyanin pigment were observed in all the tissues studied, although some were detected in different parts of the cell. At these parts of the tissues ( Figure 8A-F), anthocyanin was only found in an individual cell or groups of several cells in one or two rows, as previously reported in Impatiens balsamina [109] The glandular trichomes found in the floral part of the C. caudatus were larger, immobile, and colorless, with several of the trichomes red-pigmented, indicating the accumulation of anthocyanin in them. The accumulation of anthocyanin in trichomes is supported by the previous examination on the foliar anatomy of Castanopsis fissa, suggesting the photoinhibition activity as a defense mechanism for protecting the tissue against high light intensity [110,111].
The high accumulation of anthocyanin in flower tissues with visible pink coloration is due to the abundance of cyanidin and pelargonidin derivative. Metabolomics data support this; a higher abundance of anthocyanins being responsible for red to pink spectrum is supported by previous reports on various flowers, such as Chrysanthemum morifolium, Lilium spp., and Centaurea cyanus [112][113][114]. In this study, the highest abundance of anthocyanins was found in flowers, followed by the leaves. This was found to correspond to our previous research on C. caudatus involving FT-IR, in which flower and leaf tissues were grouped together, apart from the other tissues studied [39]. Transcriptome data found the C. caudatus unigenes obtained were shown to have a complete anthocyanin biosynthesis pathway, directly correlating with three major anthocyanins (cyanidin-3-O-glucoside, pelargonidin-3-O-glucoside, and delphinidin-3-O-glucoside) found in the LC-MS analysis, elucidating the occurrence of complete anthocyanin biosynthesis in this species for the first time, to our knowledge.
Based on the results obtained from transcriptome data and metabolite analysis combined with the anatomy work, this study successfully demonstrated the occurrence of anthocyanin biosynthesis in C. caudatus ( Figure 9). All the genes related to the anthocyanin pathway and the metabolites of interest were identified in this work. These findings offer a depiction of different anthocyanins present in the species and present more in-depth insights into each type of anthocyanin present in the investigated tissues. The traditional practice of C. caudatus dietary consumption for various health benefits may relate to the anthocyanins present in the species and other bioactive compounds, which demands further thorough research [115]. The identification of the C. caudatus unigenes encoding the overall anthocyanin biosynthesis would provide information on new candidate genes for anthocyanin bioproduction using metabolic engineering and synthetic biology approaches. Agronomy 2021, 11, x FOR PEER REVIEW 18 of 23

Conclusions
C. caudatus is a herbal plant with a high content of anthocyanins. To the best of our knowledge, this is the first comprehensive analysis of the anthocyanin biosynthetic pathway in C. caudatus based on the integration of transcriptomics, metabolomics, and anatomical approaches. The work enabled the identification of eleven main genes involved in the anthocyanin biosynthetic pathway and 11 correlative anthocyanins present in the species. The pathway was elucidated, and the identification of transcripts related to anthocyanin biosynthesis will help discern the molecular mechanism for future research prospects. The results obtained from this research can be used to choose specific tissues of interest to enhance anthocyanin production. This can be done by understanding and controlling the gene regulation of the anthocyanin biosynthesis pathway. Furthermore, this discovery will also assist and complement future translational research in biotransformation and characterization of different anthocyanins and the exploration of synthetic biology to harness this highly potent plant using heterologous expression studies.
Supplementary Materials: The following are available online at www.mdpi.com/xxx/s1, Table S1: Quality control and filtering of transcriptome raw reads of Cosmos caudatus, Table S2: KEGG pathway analysis Cosmos caudatus, Table S3: List of unigene anthocyanin biosynthetic pathway Cosmos caudatus, Figure S4

Conclusions
C. caudatus is a herbal plant with a high content of anthocyanins. To the best of our knowledge, this is the first comprehensive analysis of the anthocyanin biosynthetic pathway in C. caudatus based on the integration of transcriptomics, metabolomics, and anatomical approaches. The work enabled the identification of eleven main genes involved in the anthocyanin biosynthetic pathway and 11 correlative anthocyanins present in the species. The pathway was elucidated, and the identification of transcripts related to anthocyanin biosynthesis will help discern the molecular mechanism for future research prospects. The results obtained from this research can be used to choose specific tissues of interest to enhance anthocyanin production. This can be done by understanding and controlling the gene regulation of the anthocyanin biosynthesis pathway. Furthermore, this discovery will also assist and complement future translational research in biotransformation and characterization of different anthocyanins and the exploration of synthetic biology to harness this highly potent plant using heterologous expression studies.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/agronomy11040661/s1, Table S1: Quality control and filtering of transcriptome raw reads of Cosmos caudatus, Table S2: KEGG pathway analysis Cosmos caudatus, Table S3: List of unigene anthocyanin biosynthetic pathway Cosmos caudatus, Figure S4  Data Availability Statement: The raw read sequences of transcriptome assembly obtained from Illumina Hiseq4000 (150PE) platform have been deposited in Sequence Read Archive (SRA) under the accession number SRP158180 in NCBI and details of the sample can be viewed via BioProject id: PRJNA486042. The assembled transcriptome data (TSA) were deposited NCBI database, under processing, will be released upon publication.