Genome Assembly and Analysis of the Flavonoid and Phenylpropanoid Biosynthetic Pathways in Fingerroot Ginger (Boesenbergia rotunda)

Boesenbergia rotunda (Zingiberaceae), is a high-value culinary and ethno-medicinal plant of Southeast Asia. The rhizomes of this herb have a high flavanone and chalcone content. Here we report the genome analysis of B. rotunda together with a complete genome sequence as a hybrid assembly. B. rotunda has an estimated genome size of 2.4 Gb which is assembled as 27,491 contigs with an N50 size of 12.386 Mb. The highly heterozygous genome encodes 71,072 protein-coding genes and has a 72% repeat content, with class I TEs occupying ~67% of the assembled genome. Fluorescence in situ hybridization of the 18 chromosome pairs at the metaphase showed six sites of 45S rDNA and two sites of 5S rDNA. An SSR analysis identified 238,441 gSSRs and 4604 EST-SSRs with 49 SSR markers common among related species. Genome-wide methylation percentages ranged from 73% CpG, 36% CHG and 34% CHH in the leaf to 53% CpG, 18% CHG and 25% CHH in the embryogenic callus. Panduratin A biosynthetic unigenes were most highly expressed in the watery callus. B rotunda has a relatively large genome with a high heterozygosity and TE content. This assembly and data (PRJNA71294) comprise a source for further research on the functional genomics of B. rotunda, the evolution of the ginger plant family and the potential genetic selection or improvement of gingers.

The ethno-medicinal and potential pharmaceutical importance of B. rotunda have led to interest in exploring cell and tissue cultures for secondary metabolite production. In commercial farms, the plant is propagated clonally from rhizomes, and several protocols for multiplication via in vitro culture have been reported including plantlet regeneration via somatic embryogenesis from callus cultures [38,39], shoot bud explants [40] and embryogenic cell suspension cultures [38]. Cell suspensions of B. rotunda [18,41] and various types of callus [42,43] have been explored as potential sources for alpinetin, cardamonin, pinocembrin, pinostrobin and panduratin A. Reproducible methods for the in vitro cell culture of B. rotunda, have led to protocols for genetic transformation [39], that could facilitate the metabolic engineering of cell materials for specific desirable metabolite production; however, current knowledge of the underlying biosynthetic pathways is sparse. Other than biochemical profiling [17,43], the application of current technologies for determining the deep sets of the genetic sequences expressed in various tissue and cell types can deliver useful information.
Genomic level studies improve the understanding of the biology and biochemistry of the plant and can be applied in breeding for improved agronomy and plant products. Whole genome sequencing identifies the genes and regulatory sequences for complex biological processes such as secondary metabolite biosynthesis [44][45][46], while transcriptional profiling provides information for functional studies. Structural genomic studies have been undertaken in other Zingiberales including turmeric (Curcuma longa; genome size of 1.24 Gb) [47] and for several Musaceae species and cultivars, which have genome sizes ranging from 462 Mb to 598 Mb (Banana Genome Hub https://banana-genome-hub.southgreen.fr/ accessed on 28 May 2022) [48], while the Pan-genome of Musa Ensete has a genome size of 951.6 Mb [49]. Larger plant genomes have now been sequenced including those of important monocot species such as wheat~17 Gb (International Wheat Genome Sequencing Consortium), Aegilops tauschii~4.3 Gb [50], oil palm~1.8 Gb [51] and maize~2.6 Gb [52], in addition to species known for their unique metabolites such as tea (Camellia sinensis) 2.98 Gb [53,54] and ginseng (Panax ginseng)~3.2 Gb [55]; however, even with the recent advances in long sequence technology, large plant genomes can be challenging to assemble due to a high repeat content and high levels of heterozygosity [56,57]. The availability of an assembled genome sequence expands the functional biological questions that can be asked, since regulatory and variable elements, many of which may be involved in epigenetic regulation, cannot be seen purely using expression data. Consequently, while transcriptome [41] and proteome [58] data for B. rotunda are available, the lack of a previously published genome assembly is a limitation for functional studies. Genome assemblies also facilitate the exploration of genomic repeats which can not only be a source for genetic markers but are also drivers of genome size, gene content and order, centromere function and reflect genome evolution [59,60]. Last but not least, the epigenetic dynamism in genomes mainly involves "non-coding DNA", thus, a genome assembly provides the framework for epigenetic studies; therefore, in the current investigation we performed the first complete genome sequence for B. rotunda made with a hybrid assembly strategy using the Pacific Biosciences (PacBio) and Illumina HiSeq platforms. We explored the sites of 45S rDNA and 5S rDNA on metaphase chromosomes observed by Fluorescence in situ hybridization (FISH). In addition, we carried out a deep transcriptome (RNA-seq data) assembly from five B. rotunda samples, including various types of callus cultures, and leaves. Gene expression profiles and bisulfite seq DNA methylation data from these tissues and samples were used for a co-expression analysis to identify any association of gene 4 of 26 expression and local DNA methylation of unigenes related to methylation, somatic embryogenesis, and the pathways for flavonoid and phenylpropanoid biosynthesis. We also report novel expressed sequence tags-SSR (EST-SSR) and genomic SSR markers for B. rotunda and the estimated cross-transferability of the designed primers between B. rotunda and closely related species to provide deeper genetic resources to support further study of the biology and biodiversity in this genus. Genomic information and complete sequence data for this less investigated herb should provide a solid foundation as a vital step in genetic analysis to facilitate B. rotunda improvement and to reach a deeper understanding of the metabolic pathways of its natural products.

Chromosomes and Location of rDNA Sites
Boesenbergia rotunda (2n = 36; 18 pairs of submetacentric chromosomes) has three pairs of 45S rDNA sites near the ends of three pairs of chromosomes ( Figure 1A). One pair of the 5S rDNA sites ( Figure 1D) is on a chromosome pair not bearing 45S rDNA.

Genome Assembly
Genomic DNA from the leaves of a single, clonal B. rotunda plant was sequenced using multiple approaches (Table S1), with 114 Gb PacBio long reads, 260 Gb of Illumina HiSeq 2500 250 bp paired-end reads, and 90 Gb of mate-paired reads with 2, 5, 10, 20 and 40 kb insert sizes. Based on the k-mer analysis (k = 17, GenomeScope), the estimated haploid genome size of B. rotunda was 2.4 Gb ( Figure S1), consistent with the flow cytometry ( Figure S2). The heterozygosity was estimated as 3.01%. A hybrid genome assembly pipeline combining the Illumina data and PacBio data was adopted ( Figure S3). The final assembled genome size was 2.347 Gb, characterized by 27,491 contigs and 10,627 scaffolds, with a contig N50 of 123.86 kb and a scaffold N50 of 394.68 kb (Table 1). Based on a benchmarking universal single-copy orthologs (BUSCO) analysis [61] mapping the B. rotunda genome against a set of 1440 core eukaryotic genes, 1232 (85.6%) were present (Table S2). An assembly quality assessment showed that over 95% of the Illumina PE250 reads were mapped to the contig assembly (Table 2).

Annotation of the B. rotunda Genome
Five sets of RNA-seq datasets were generated from three cell culture types, in vitro and ex vitro leaves of B. rotunda, given the importance for secondary metabolites production. Individual transcriptomes were assembled from these RNA-seq reads using different de novo transcriptome assemblers (Table 3, Figure S4). The assembled transcriptome size ranged from 31 to 71 million base pairs with 72,085 to 158,465 contigs for the Oases, SOAPdenovo-Trans, TransAbyss, and Trinity (Table 3, Figure S4). Oases had the largest N50 size and average contig length. The BUSCO quantitative measure of the completeness transcriptomes in terms of expected gene content scores, also showed Oases (36.7%) and TransAbyss (36.6%) to give the assemblies with higher numbers of complete and singlecopy contigs compared to the SOAPdenovo-Trans and Trinity (31.7%) ( Figure S5). The non-redundant transcript sequences formed from the Oases followed by TGICL were used to annotate the B. rotunda genome and for the downstream expression analysis. Based on a combination of de novo and homology-based gene prediction methods, 72.51% of the genome (1.70 Gb) was annotated as repeats including 6.94% tandem repeats. Among the Class I TEs (Retroelements), long terminal repeats (LTRs) constituted the greatest proportion of the genome (67.16%) while DNA TE made up 3.29 % of the genome ( Figure S6, Table 4).
From the 10,627 assembled contigs and 95,847 assembled transcriptome sequences searched for SSRs, (Table 5, Figure 2), the density of the microsatellites was 102 SSR loci per Mbp in the genomic and 69 SSR loci per Mbp in the transcriptome sequences. Among the identified repeat motif types, trinucleotides were the most abundant in both genomic (35.62%) and transcriptome (51.67%) sequences, followed by mono-and dinucleotide repeats (Table 5, Figure 2a). The class II type SSR-loci (<30 bp) were two-fold higher than the class I type in the genomic sequences, whereas the class II type SSR-loci were fourfold higher than the class I type SSR loci in the transcriptome sequences ( Figure 2b). The number of AT rich microsatellites was significantly higher than that of the GC rich and microsatellites with a balanced GC content. The mapping of B. rotunda SSR to close relatives using newly designed primer sequences showed that from 93.81% of the genomic SSR and 73.12% of the transcriptome sequences suitable for the SSR primer design, only a low number of primers mapped to the selected relatives, Musa acuminata, Musa balbisiana, Musa itinerans and Ensete ventricosum (Table 5). Overall, 224 G-SSR and 65 EST-SSR primers showed transferability into any of the four related species (with slightly more in the Ensete), while only 42 genomic SSRs and 7 transcript SSRs were common to all five genomes (Figure 2c,d). A subset of 14 B. rotunda SSR primer pairs (Table S3) were tested for their marker potentiality and all showed amplified bands of the expected sizes for each species ( Figure S7).
The annotation of the predicted protein-coding genes was a combination of homologybased and de novo prediction in addition to comparison with the B. rotunda transcriptome data (Table S4). After consolidation, 73,102 protein-coding genes were predicted in the B. rotunda genome with an average transcript length of 4312 bp (excluding UTR), CDS length of 1360 bp, average exon and intron lengths of 303 bp and 812 bp, respectively, and 4.49 exons per gene (Table S4). For the homology-based protein-coding gene predictions, the protein sequences from four species (M. acuminata, Phoenix dactylifera, Oryza sativa and Arabidopsis thaliana) were mapped onto the B. rotunda genome. From these alignments, B. rotunda had the highest number of matches with P. dactylifera followed by O. sativa, A. thaliana and M. acuminata ( Figure S8). Functional annotation of the 73,102 predicted proteins from B. rotunda against seven databases enabled functional predictions for 97.8% of the predicted genes (Table 6). Non-coding RNA analysis of the assembly identified 213 microRNA (miRNA), 2727 transfer RNA (tRNA), 486 ribosomal RNA (rRNA), and 2136 small nuclear RNA (snRNA) genes (Table 7).  The annotation of the predicted protein-coding genes was a combination of homology-based and de novo prediction in addition to comparison with the B. rotunda transcriptome data (Table S4). After consolidation, 73,102 protein-coding genes were predicted in the B. rotunda genome with an average transcript length of 4312 bp (excluding UTR), CDS length of 1360 bp, average exon and intron lengths of 303 bp and 812 bp, respectively, and 4.49 exons per gene (Table S4). For the homology-based protein-coding gene predictions,  A final genome annotation was performed by using MAKER together with de novo assembled non-redundant transcripts, predicted proteins, non-coding RNAs and repeats.

Functional Classification by Gene Ontology
From a total of 95,847 unigenes derived from the B. rotunda transcriptome, 41,550 unigenes (43.35%) were found significantly scoring BLASTX hits against the NR protein database. Of these, 6850 (7.15% of the total unigenes) returned significant sequence alignments that could not be linked to Gene Ontology entries; 6038 (6.3%) of the GO mapped dataset did not obtain an annotation assignment and we could assign functional labels to 28,662 (29.9%) of the input sequences ( Figure S9). Species distribution among the BLASTX matches showed that M. acuminata subsp. malaccensis had a very high similarity score with 87,000 top BLASTX hits from B. rotunda. Other species matches included Ethiopian banana, Ensete ventricosum (Musaceae) with 70,000 hits, African oil palm, Elaeis guineensis (Arecaceae) with 62,500 BLASTX hits and date palm, P. dactylifera (Arecaceae) with 62,000 BLASTX hits ( Figure S10). Biological processes were the most represented functional group based on the categorization of the GO classes for the Nr annotated sequences (Figure 3a).
A Blast2GO enzyme code (EC) annotation showed the distribution of B. rotunda predicted proteins among six main enzyme classes of oxidoreductases (1400), transferases (3500), hydrolases (2250), lyases (450), isomerases (250), and ligases (270) ( Figure S11). The KOG function classification produced Nr hits for 18,767 unigenes annotated and classified functionally into 25 KOG functional categories ( Figure 3b). General function prediction was the most represented group (2396) followed by signal transduction mechanism (2178) and posttranslational modification, protein turnover, and chaperons' with 2031 genes. Comparison of all the B. rotunda unigenes with the KEGG pathway database resulted in 1494 out of 28,662 annotated unigene sequences (5.21%) being assigned to 145 predicted metabolic pathways.

Phylogenetic Orthology Inference of B. rotunda Genes
A total of 62,520 orthogroups were found with Orthofinder [62] (Table S5) with matches of the genes from B. rotunda to 979,315 genes from 13 other species (Glycine max, Cucumis melo, Gossypium raimondii, Brassica napus, Arabidopsis thaliana, Solanum tuberosum, Solanum lycopersicum, Musa acuminata, Zea mays, Oryza sativa subsp. japonica, Hordeum vulgare, Phoenix dactylifera and Brachypodium distachyon). Of these, 7276 orthogroups were shared among all species and there were no single copy orthogroups (Table S5). The species tree inferred by STAG [63] and rooted by STRIDE [64] indicated that B. rotunda has the closest relationship with M. acuminata (order Zingiberales) and P. dactylifera (order

Phylogenetic Orthology Inference of B. rotunda Genes
A total of 62,520 orthogroups were found with Orthofinder [62] (Table S5) with matches of the genes from B. rotunda to 979,315 genes from 13 other species (Glycine max, Cucumis melo, Gossypium raimondii, Brassica napus, Arabidopsis thaliana, Solanum tuberosum, Solanum lycopersicum, Musa acuminata, Zea mays, Oryza sativa subsp. japonica, Hordeum vulgare, Phoenix dactylifera and Brachypodium distachyon). Of these, 7276 orthogroups were shared among all species and there were no single copy orthogroups (Table S5). The species tree inferred by STAG [63] and rooted by STRIDE [64] indicated that B. rotunda has the closest relationship with M. acuminata (order Zingiberales) and P. dactylifera (order Arecales) followed by members of the Poaceae family (Z. mays, O. sativa subsp. japonica, H. vulgare, and B. distachyon), and was distant from the plant species from the Solanaceae, Brassicaceae, Malvaceae, Cucurbitaceae, and Fabaceae ( Figure 4a, Table S5). UpSet plotting showed 7276 orthogroups shared between the B. rotunda and 13 selected reference genomes (Figure 4b). Additionally, 1849 protein orthologs are specific for B. rotunda and 274 orthogroups are shared among the 13 selected reference genomes except for B. rotunda.  Figure 4a, Table S5). UpSet plotting showed 7276 orthogroups shared between the B. rotunda and 13 selected reference genomes ( Figure 4b). Additionally, 1849 protein orthologs are specific for B. rotunda and 274 orthogroups are shared among the 13 selected reference genomes except for B. rotunda.

Gene Family Expansion and Contraction
Using the data generated from OrthoFinder [62], we explored the gene family expansion and contractions in B. rotunda (Figure 4a). In total, there are 17,106 gene families shared by the most recent common ancestor (MRCA). There were large numbers of gene families expanding (53-10,855) or contracting 754) between 14 plant genomes (Figure 4a). Our results show the substantial expansion of gene families in the Poaceae (5557) followed by the Brassicaceae (5104) and the Pooideae subfamilies (4205). A large gene family contraction was observed in Solanaceae (8975). Interestingly, the majority of the genomes with reported ancient whole genome duplication or massive segmental duplications or major chromosomal duplications show a higher number of gene family duplications than gene family losses (indicated by asterisks in Figure 4a).

Transcriptome Changes of B. rotunda Unigenes Related to Flavonoid and Phenylpropanoid Biosynthesis Pathways
A Transcriptome analysis showed that, in total, 167 unigenes from B. rotunda were mapped to five different classes of enzymes including oxidoreductase, transferase, ligase, lyase, and hydrolase in the flavonoid and phenylpropanoid pathways. Of these, only 23 enzymes showed differential expression in the different samples, i.e., in vitro leaf (IVL), embryogenic callus (EC), and non-embryogenic calli (dry callus (DC) and watery callus (WC)) using ex vitro leaf (EVL) samples as the comparator ( Figure 5, Table S6). Phenylalanine ammonia-lyase (PAL), the first enzyme in the phenylpropanoid pathway, turns phenylalanine to cinnamic acid. The PAL was expressed at the lowest levels among all the samples in the IVL, with the highest expression level in the WC (indicated by the dark red squares in Figure 5). Then, the coenzyme A (CoA) would attach to cinnamic acid or p-

Gene Family Expansion and Contraction
Using the data generated from OrthoFinder [62], we explored the gene family expansion and contractions in B. rotunda (Figure 4a). In total, there are 17,106 gene families shared by the most recent common ancestor (MRCA). There were large numbers of gene families expanding (53-10,855) or contracting 754) between 14 plant genomes ( Figure 4a). Our results show the substantial expansion of gene families in the Poaceae (5557) followed by the Brassicaceae (5104) and the Pooideae subfamilies (4205). A large gene family contraction was observed in Solanaceae (8975). Interestingly, the majority of the genomes with reported ancient whole genome duplication or massive segmental duplications or major chromosomal duplications show a higher number of gene family duplications than gene family losses (indicated by asterisks in Figure 4a).

Transcriptome Changes of B. rotunda Unigenes Related to Flavonoid and Phenylpropanoid Biosynthesis Pathways
A Transcriptome analysis showed that, in total, 167 unigenes from B. rotunda were mapped to five different classes of enzymes including oxidoreductase, transferase, ligase, lyase, and hydrolase in the flavonoid and phenylpropanoid pathways. Of these, only 23 enzymes showed differential expression in the different samples, i.e., in vitro leaf (IVL), embryogenic callus (EC), and non-embryogenic calli (dry callus (DC) and watery callus (WC)) using ex vitro leaf (EVL) samples as the comparator ( Figure 5, Table S6). Phenylalanine ammonia-lyase (PAL), the first enzyme in the phenylpropanoid pathway, turns phenylalanine to cinnamic acid. The PAL was expressed at the lowest levels among all the samples in the IVL, with the highest expression level in the WC (indicated by the dark red squares in Figure 5). Then, the coenzyme A (CoA) would attach to cinnamic acid or p-coumaric acid by 4-coumarate-CoA ligase (4CL) and form cinnamoyl-CoA or p-coumaroyl-CoA. This enzyme showed a relatively higher expression in all the samples except for the EC.
coumaric acid by 4-coumarate-CoA ligase (4CL) and form cinnamoyl-CoA or p-coumaroyl-CoA. This enzyme showed a relatively higher expression in all the samples except for the EC. In the phenylpropanoid pathway, cinnamic acid is also converted to coumarinate by Beta-glucosidase (BGLU) to produce coumarin. BGLU was expressed in all samples, with the highest expression level in the non-embryogenic calli (NEC). Then CHS, chalcone synthase (CHS) converts cinnamoyl-CoA to pinocembrin chalcone and p-coumaroyl CoA to naringenin chalcone. CHS was expressed in all the samples except the IVL, with the highest expression level in the WC. In the next step, the two flavanones of pinocembrin and naringenin are synthesised by chalcone isomerase (CHI). CHI was expressed in all the samples except the IVL, with the highest expression level in the EC. Pinocembin is converted to pinostrobin by flavanone-3-hydroxylase (F3H) which serves as a precursor of panduratin A synthesis. Expression analysis of the unigenes related to the F3H enzyme and dihydroflavonol 4-reductase (DFR), which are involved in the synthesis of anthocyanidins such as pelargonidin, cyanidin, and delphinidin, showed the DFR to be more highly expressed in the DC and WC compared to the other samples, while the  In the phenylpropanoid pathway, cinnamic acid is also converted to coumarinate by Beta-glucosidase (BGLU) to produce coumarin. BGLU was expressed in all samples, with the highest expression level in the non-embryogenic calli (NEC). Then CHS, chalcone synthase (CHS) converts cinnamoyl-CoA to pinocembrin chalcone and p-coumaroyl CoA to naringenin chalcone. CHS was expressed in all the samples except the IVL, with the highest expression level in the WC. In the next step, the two flavanones of pinocembrin and naringenin are synthesised by chalcone isomerase (CHI). CHI was expressed in all the samples except the IVL, with the highest expression level in the EC. Pinocembin is converted to pinostrobin by flavanone-3-hydroxylase (F3H) which serves as a precursor of panduratin A synthesis. Expression analysis of the unigenes related to the F3H enzyme and dihydroflavonol 4-reductase (DFR), which are involved in the synthesis of anthocyanidins such as pelargonidin, cyanidin, and delphinidin, showed the DFR to be more highly expressed in the DC and WC compared to the other samples, while the F3H was only relatively up-regulated in the WC. Other enzymes in the phenylpropanoid pathway include hydroxycinnamoyl-CoA shikimate (HCT), cinnamoyl-CoA reductase (CCR), cinnamyl alcohol dehydrogenase (CAD; EC1.1.1.195), caffeoyl-CoA O-methyltransferase (CCOAOMT), and lactoperoxidase enzyme (LPO), involved in monolignols synthesis such as p-hydroxyphenyl (H), guaiacyl (G) and syringyl (S). Among them, the HCT, CCOAOMT, and CAD showed a higher expression in all the samples, except the IVL for the CAD, while CCR showed a higher expression in the WC. The gene expression differences between the tissue samples for cinnamic acid 4-hydroxylase (C4H), ρ-coumarate 3-hydroxylase (C3H), ferulate 5-hydroxylase (F5H), caffeic acid O-methyltransferase (COMT), and cinnamyl alcohol dehydrogenase (CAD) were below the threshold of the FPKM without any differential expression in the studied samples.

DNA Methylation Analysis Using Bisulfite Sequencing
Genome wide methylation percentages determined from the bisulfite sequence data from the leaf and four tissue cultured samples, were higher in all methylated cytosine contexts for the samples from the EVL (CpG 73.2%, CHG 36.2% and CHH 33.7%) and IVL (CpG71.3%, CHG 35.4% and CHH 33.5%). The lowest levels were for the EC (CpG 53.4%, CHG 18.5% and CHH 25.3%), followed by the WC (CpG 63.8%, CHG 21.9% and CHH 28.1%) and the DC (CpG 68.4%, CHG 25.9% and CHH 28.6%). We also evaluated the DNA methylation levels of three groups of genes (30 genes in total) including DNA methyltransferase-related genes across the genome of B. rotunda (Figure 6a-d). In general, the CHH methylation levels were higher than the methylation levels in the CpG and CHG context, and out of 30 genes, 22 genes (73.3%) showed low methylation levels (<0.1) in the CHG and CHH cytosine contexts, whereas only 30% of the genes showed low methylation levels in the CpG context. Cytosine methylation of the methylation-related genes in all cytosine contexts (CpG, CHG and CHH) was the highest for the DRM2, followed by MET1 and CMT3 (Figure 6a-d). Among the somatic embryogenesis-related genes, the WOX gene was heavily methylated in the CpG and CHG contexts compared to the other somatic embryogenesis-related genes (LEC2, BBM, SERK) (Figure 6a,d). For the pathway-related genes, the LPOs methylated more in all the studied samples compared to the other genes.
(CCOAOMT), and lactoperoxidase enzyme (LPO), involved in monolignols synthesis such as p-hydroxyphenyl (H), guaiacyl (G) and syringyl (S). Among them, the HCT, CCOAOMT, and CAD showed a higher expression in all the samples, except the IVL for the CAD, while CCR showed a higher expression in the WC. The gene expression differences between the tissue samples for cinnamic acid 4-hydroxylase (C4H), ρ-coumarate 3hydroxylase (C3H), ferulate 5-hydroxylase (F5H), caffeic acid O-methyltransferase (COMT), and cinnamyl alcohol dehydrogenase (CAD) were below the threshold of the FPKM without any differential expression in the studied samples.

DNA Methylation Analysis Using Bisulfite Sequencing
Genome wide methylation percentages determined from the bisulfite sequence data from the leaf and four tissue cultured samples, were higher in all methylated cytosine contexts for the samples from the EVL (CpG 73.2%, CHG 36.2% and CHH 33.7%) and IVL (CpG71.3%, CHG 35.4% and CHH 33.5%). The lowest levels were for the EC (CpG 53.4%, CHG 18.5% and CHH 25.3%), followed by the WC (CpG 63.8%, CHG 21.9% and CHH 28.1%) and the DC (CpG 68.4%, CHG 25.9% and CHH 28.6%). We also evaluated the DNA methylation levels of three groups of genes (30 genes in total) including DNA methyltransferase-related genes across the genome of B. rotunda (Figure 6a-d). In general, the CHH methylation levels were higher than the methylation levels in the CpG and CHG context, and out of 30 genes, 22 genes (73.3%) showed low methylation levels (<0.1) in the CHG and CHH cytosine contexts, whereas only 30% of the genes showed low methylation levels in the CpG context. Cytosine methylation of the methylation-related genes in all cytosine contexts (CpG, CHG and CHH) was the highest for the DRM2, followed by MET1 and CMT3 (Figure 6a-d). Among the somatic embryogenesis-related genes, the WOX gene was heavily methylated in the CpG and CHG contexts compared to the other somatic embryogenesis-related genes (LEC2, BBM, SERK) (Figure 6a,d). For the pathway-related genes, the LPOs methylated more in all the studied samples compared to the other genes.

Correlation between Gene Expression Levels and DNA Methylation Levels of Genes Related to Methylation, Somatic Embryogenesis and Secondary Metabolite Pathway
From the gene expression analysis, we observed that the expression level of the DNA methyltransferase genes, MET 1, CMT 3, and DRM2 was higher in the callus than the leaf samples and was highest in the embryogenic callus (EC) for all three genes, but lowest in the in vitro leaf (IVL) (Figure 7a). DRM2 showed the lowest level of expression and the highest level of DNA methylation. The DNA methylation levels of these genes at the CpG, CHG, CHH cytosine contexts were the highest in the DC and WC with similar and lower methylation levels in the embryogenic callus and leaf samples. Overall, the expression of methylation-related genes was higher in the samples of EC, DC, and WC but other than for the CMT3, which showed an inverse relationship between the expression level and methylation levels, there was no clear correlation between the level of DNA methylation and the level of gene expression (Figure 7b). methyltransferase genes, MET 1, CMT 3, and DRM2 was higher in the callus than the leaf samples and was highest in the embryogenic callus (EC) for all three genes, but lowest in the in vitro leaf (IVL) (Figure 7a). DRM2 showed the lowest level of expression and the highest level of DNA methylation. The DNA methylation levels of these genes at the CpG, CHG, CHH cytosine contexts were the highest in the DC and WC with similar and lower methylation levels in the embryogenic callus and leaf samples. Overall, the expression of methylation-related genes was higher in the samples of EC, DC, and WC but other than for the CMT3, which showed an inverse relationship between the expression level and methylation levels, there was no clear correlation between the level of DNA methylation and the level of gene expression (Figure 7b).
Similarly, while there were different expression patterns for the four somatic embryogenesis-related genes of SERK, BBM, LEC2, and WOX between the different leaf and callus samples (Figure 7c), the DNA methylation level of each gene across the different leaf and callus samples was largely unchanged (Figure 7d). A comparison of 23 of the B. rotunda genes involved in the flavonoid and phenylpropanoid pathways showed them to be expressed differentially in the B. rotunda leaf and callus samples. Among them, the BGLU, CAD, CHS, LPO8, LPO9 and PAL were expressed more highly in the callus than in the leaf samples (Figure 7e). The highest level of DNA methylation was observed for the HCT, CCR, and LPO2 genes in all the studied samples and again, there was no general correlation between the gene expression levels and methylation levels for these samples (Figure 7f).

Discussion
We present a genome assembly of Boesenbergia rotunda (2n = 36) with an estimated genome size of 2.4 Gb. The genome of the plant we sequenced, when in cultivation a largely vegetatively propagated species, shows an unusually high heterozygosity of 3.01%, suggesting that the cultivar may be of hybrid origin or may have undergone whole genome duplication events. This is also suggested based on the large number of unigenes, 408 in B. rotunda, notably more than twice that of Ensete glaucum [57], and 46,765 duplicated genes (65.8% of the B. rotunda genome, with at least 50% support). As noted in Citrus Similarly, while there were different expression patterns for the four somatic embryogenesisrelated genes of SERK, BBM, LEC2, and WOX between the different leaf and callus samples (Figure 7c), the DNA methylation level of each gene across the different leaf and callus samples was largely unchanged (Figure 7d). A comparison of 23 of the B. rotunda genes involved in the flavonoid and phenylpropanoid pathways showed them to be expressed differentially in the B. rotunda leaf and callus samples. Among them, the BGLU, CAD, CHS, LPO8, LPO9 and PAL were expressed more highly in the callus than in the leaf samples (Figure 7e). The highest level of DNA methylation was observed for the HCT, CCR, and LPO2 genes in all the studied samples and again, there was no general correlation between the gene expression levels and methylation levels for these samples (Figure 7f).

Discussion
We present a genome assembly of Boesenbergia rotunda (2n = 36) with an estimated genome size of 2.4 Gb. The genome of the plant we sequenced, when in cultivation a largely vegetatively propagated species, shows an unusually high heterozygosity of 3.01%, suggesting that the cultivar may be of hybrid origin or may have undergone whole genome duplication events. This is also suggested based on the large number of unigenes, 408 in B. rotunda, notably more than twice that of Ensete glaucum [57], and 46,765 duplicated genes (65.8% of the B. rotunda genome, with at least 50% support). As noted in Citrus limon [65], high levels of heterozygosity complicate the assembly process and due to the clonal propagation nature of the fingerroot ginger, offspring resulting from sexual hybridization is rather limited. Thus, we applied a similar approach as reported by Chin et al. (2016) and Baek et al. (2018), for the assembly of the B. rotunda genome [66,67]. The sequencing assembly of B. rotunda using long PacBio reads, in addition to the Illumina short-reads, and followed by assembly using the FALCON assembler resulted in a scaffold number of 10,627. This relatively high scaffold number is not unexpected considering the high repeat content (72.51%) of the B. rotunda genome, coupled with the relatively high level of heterozygosity (3.01%), and the lack of any molecular marker and breeding data for B. rotunda. Future mapping and marker studies could help to resolve an assembly into the anticipated 18 chromosomes, as could more recent technologies such as single chromosome sequencing and optical mapping [56].
Sequence information for other Boesenbergia species is not yet available, with the closest relative of B. rotunda from sequenced genomes at the time of our study being M. acuminata, based on previous analyses using amino acid data from single genes including chalcone isomerase (CHI) [68] and phytyltransferase (BrPT2) [69]. Our phylogeny analysis also showed M. acuminata as the closest relative among those compared with Z. mays, O. sativa, H. vulgare, and B. distachyon from the Poaceae family, though more distantly related, as expected.
The repeat content of the B. rotunda at~72% of the assembled genome is high compared to many other plant genomes in this order such as Musa itinerans (38.95%) [70] and M. acuminata (35.43%) [71], but similar to that of Z. officinale (ginger official) at 81% [72]. A higher level of repeat content has been observed to correlate with the larger genome sizes in Fabaceae [73] and Melampodium [74]. Both of those reports suggest the greater genome size to be largely driven by Ty3/gypsy LTR-retrotransposons and it is interesting to note that B. rotunda also has a high LTR content of 64%. While data for the genome sizes and content are not yet available for other Boesenbergia species, the Z. officinale genome has a similarly high value of 61% LTR which was also suggested to contribute to the high genome size [75]. Studies in other plant species have reported that plant genomes generally have over 50% transposable elements content (e.g., maize) while some small plant genomes such as Arabidopsis may have as low as 10% repeat content [76][77][78]. Cytosine methylation is usually much denser in transposons than in genes [79][80][81] and this has also been correlated with the evolution of genome size in angiosperms [77]. The large genome size and high repeat content of B. rotunda with relatively low gene body cytosine methylation levels of the genes selected for observation in the current study, fit well with this model and it will be interesting to compare this with other Boesenbergia species in the future when similar data becomes available.
As DNA methylation is dynamic, we saw variations in the global DNA methylation levels in the different samples. Unmethylated DNA has been shown to demarcate expressed genes [82] and so to be able to examine this in the context of gene expression in B. rotunda and to add depth to our genome data, we included deep sequencing of the leaf and callus transcriptomes from B. rotunda. There are several alternative tools for the de novo assembly of RNA-seq short reads into a reference transcriptome and we compared analysis from four assemblers. The quality of the assembly was noticeably affected by both the k-mer size and assembler tool, with Oases delivering the highest N50 size and average contig length at k-mer 21 compared to at k-mer 24 or other assemblers ( Figure S4, Table 3), indicating a more effective and accurate assembly. In comparison to a previous transcriptome assembly of B. rotunda by the SOAPdenovo-Trans de novo assembler, our study obtained a longer N50 size (1019) compared to an N50 value of 236 reported by [40]. An Oases assembly of the genome sequence data from a Fern, Lygodium japonicum, was also found to give the best mean transcript length and N50 size when compared to assemblies using Trinity and SOAPdenovo-Trans [83]. The BUSCO assessment of the B. rotunda transcriptome data also showed that Oases had higher numbers of complete and single copy contigs and less fragmented contigs. Based on this, the transcriptome assembly using Oases offered an improved resource for genome annotation and the gene expression study in B. rotunda.
We focused the functional aspects of the B. rotunda genome study on the methylation and the flavonoid and phenylpropanoid pathways, as the chalcone, panduratin A, is considered one of the most promising bioactive compounds from B. rotunda and previous studies from our research group had indicated that DNA methylation may influence the gene expression in tissue cultured materials [84,85]. From the 23 flavonoid and phenylpropanoid pathway genes that showed differential expression between the leaf and any of the callus samples, most were more highly expressed in the EC, DC, and WC, including PAL, CHS, CHI, DFR, BGLU, HCT, CCOAOMT, and CAD (Figure 7), with the highest expression level in the non-embryogenic callus (DC and WC). This aligns with previous ultra performance liquid chromatography-mass spectrometry (UPLC-MS) data showing the WC followed by the DC to have a higher concentration of panduratin, pinocembrin, pinostrobin, cardamonin and alpinetin [43]. Based on this, the unigenes identified in the genome assembly that correspond to CHS and CHI, encode key enzymes in the biosynthesis of panduratin A in B. rotunda. Although DNA methylation plays an important role in the regulation of gene expression, comparison of the methylation of the differentially expressed flavonoid and phenylpropanoid pathway genes, with their cytosine methylation, showed no obvious patterns to indicate any correlation for this gene set.
As our samples included embryogenic and non-embryogenic callus tissue, we also evaluated the expression level of DNA methylase genes (MET1, CMT3, DRM2) and genes related to somatic embryogenesis (SERK, BBM, LEC2, WUS) with the DNA methylation levels across the genome of B. rotunda based on a bisulfite sequence analysis. An earlier study with some quantitative qRT-PCR validation suggested that the higher level of expression of the methyltransferase-related genes and the lower CG, CHG and CHH sequence contexts in the EC samples was negatively correlated with the total methylation level of the DNA methyltransferase-related genes [85]. We did observe a similar pattern for CMT3 in all five sample types in the current study ( Figure 7); however, no similar correlation between the expression level and cytosine methylation was observed in the current data for the other genes examined. The lack of correlation between the transcript expression and the respective gene body methylation from our data may be due to the limitations of the current genome assembly, such that the cis regions could not be well annotated. In the future, a higher resolution genome assembly for B. rotunda would be useful to examine the methylation data from the current study.
Although only a minor portion of the B. rotunda genome at around 0.35%, microsatellites are the key elements in plant genomes. Among these, short sequence repeat mi-crosatellites (SSRs) have found wide utility as co-dominant markers useful in breeding and diversity studies [86,87]. In this study, we identified the genomic and EST-SSRs from B. rotunda, designing primers and showing several to have a transferability to Musa and Ensete genomes, mostly in silico analysis, but with 14 tested in PCR experiments. Boesenbergia, Musa and Ensete are members of the same plant family Zingiberales, and all have abundant AT-rich SSR sequences; however, they are not from the same genus, so are phylogenetically somewhat distant as reflected in the fairly low numbers with potential as markers across these species. Nevertheless, these newly developed SSR markers enhance the genetic resources for B. rotunda as well as the plant family Zingiberales, and these markers could be utilized for genotyping, population structure analysis, association studies and cultivar identification as well as any other breeding application of the Boesenbergia spp.
In conclusion, the genome assembly of B. rotunda covers some 2300 Mbp divided among 18 relatively similar submetacentric chromosomes. The cultivated accession sequenced was highly heterozygous. The genome assembly, transcriptome, gene expression, SSR analysis and DNA methylation data from this study are resources that will allow further understanding of the unique secondary metabolite properties and their biosynthetic pathways in the genus Boesenbergia and for functional genomics of B. rotunda characteristics. As this data represent a first report of a Boesenbergia genome, this, together with the existing data from sister families such as Musacea, can contribute to the understanding of the pan-genome of the Zinzerberales, evolution of the ginger plant family and the potential genetic selection or improvement of gingers.

Ethics
The conduct of this study was approved by the University of Malaya's grant management committee, chaired by Professor Noorsaadah Abdul Rahman (noorsaadah@um.edu.my), the Director of the Institute of Research Management and Monitoring, and no human, animal, or endangered or protected plant species were used as materials.

Plant Materials and Samples Preparation
Rhizomes of B. rotunda (L.) Mansf. were collected from a commercial farm in Temerloh, Pahang, Malaysia (Latitude: 3.27 • N, Longitude: 102.25 • E) and propagated in the laboratory using the steps outlined by Karim et al. (2018b) [85]. To promote rhizome sprouting, first the rhizomes were cleaned under running tap water for ten minutes, then they were allowed to air dry before being planted in the black polybags.
Each day, water was sprayed on the samples to induce the growth of shoots and leaves. Four weeks after planting, young ex vitro leaf (EVL) samples were collected from the rhizome-derived plants. To generate the callus samples, meristematic block explants were subcultured on MS medium supplemented with 30 g L −1 of sucrose and 2 g L −1 of Gelrite ® with 2,4-dichlorophenoxy acetic acid (2,4-D) at concentrations of 1 mg L −1 (4.5 µM) for the watery callus (WC), 3 mg L −1 (13.5 µM) for the embryogenic callus (EC) and 4 mg L −1 (18 µM) for the dry callus (DC). The WC, EC and DC samples were collected after four weeks on the respective media (eight weeks after initial culturing from the explant). Plants grown from the embryogenic calli were cultured on regeneration media (MS0) and the in vitro leaves (IVL) were collected after 8 weeks, when the leaves were large enough for nucleic acid extraction (16 weeks after the initial culturing from meristematic block explants) [84].

DNA Extraction and Sequencing for Genome and Bisulfite Sequence (BS-Seq) Analysis
Total genomic DNA was isolated using a modified cetyl trimethyl ammonium bromide (CTAB) method from the ex vitro leaf (EVL) of B. rotunda [88]. The quality and quantity of the extracted genomic DNA were determined by a NanoDrop 2000 Spectrophotometer (Thermo Fisher Scientific Inc., Waltham, MA, USA) and Qubit ® 2.0 Fluorometer (Thermo Fisher Scientific Inc., Waltham, MA, USA). The DNA sample was sent to BGI Shenzhen (Shenzhen, China) for library construction and de novo sequencing on the Illumina HiSeq2000 and HiSeq2500 platforms (Illumina Inc., San Diego, CA, USA) and the PacBio RS II platform (PacBio Inc., Menlo Park, CA, USA) [89]. The BS sequence analysis was performed using the extracted DNA from the EVL, EC, DC, WC, and IVL treated by sodium bisulfite. The paired-end reads were generated using an Illumina HiSeqTM 2000 platform (Illumina Inc., San Diego, CA, USA) by Sengenics Sdn. Bhd., Malaysia from a total of five samples with three biological replicates for each sample.

RNA Extraction and Sequencing for Transcriptome (RNA-Seq) Analysis
The total RNA was extracted from the EVL, EC, DC, WC, and IVL samples using a CTAB method [90] and three independent sets of RNAs for each sample were generated. High quality RNA samples were sent to BGI-Shenzhen (Shenzhen, China) for library construction and sequencing using the Illumina Genome Analyzer IIx (GAIIx) platform (Illumina Inc., San Diego, CA, USA) to generate single-end reads. The FISH technique was performed according to Schwarzacher and Heslop-Harrison (2000) [91]. 'Fingers' of B. rotunda were placed in shallow dishes with soil to initiate root growth and kept in the glasshouse at the University of Leicester, UK. Newly grown root tips of 1-2 cm length were treated with 2 mM 8-hydroxyquinoline at growth temperature for 2 h followed by incubation overnight at 4 • C, and then fixed with 96% ethanol:glacial acetic acid (3:1). The roots were digested for 1-3 h at 37 • C with a mixture of cellulose (32 U/mL, Sigma-Aldrich, St. Louis, MO, USA, C1184), 'Onozuka' RS cellulose (20 U/mL), pectinase (from Aspergillus niger, Sigma-Aldrich P4716) and Viscozyme (20 U/mL, Sigma-Aldrich V2010) in a 10 mM citric acid/sodium citrate buffer (pH4.6). Chromosome preparations of dissected meristems were made in 60% acetic acid by squashing under a cover slip. Slides were stored at −20 • C until the FISH.

Genome Size Estimation
Flow cytometry was used to determine the genome size of the B. rotunda as an unknown sample on a MACSQuant Analyzer (Miltenyl Biotec Inc., Bergisch Gladbach, Germany), using soybean (Glycine max cv. Polanka (G) 2C = 2.50 pg DNA) and pea (Pisum sativum cv. Ctirad (P) 2C = 9.09 pg DNA) as the internal standards and propidium iodide as the stain. Each plant (sample and comparator) was compared using an average of four biological replicates [53,93,94]. We also performed a K-mer analysis to estimate the B. rotunda genome size and heterozygosity rate using Jellyfish [95] and GenomeScope [96].

Genome Assembly
A combination of sequencing technologies of the PacBio RSII platform, Illumina HiSeq 2500 paired-end reads (PE) with a 450 bp insert size library, and Illumina HiSeq 2000 matepair reads (MP) with insert size libraries of 2, 5, 10, 20, and 40 kb was performed for the genome assembly. Before assembly, the Illumina HiSeq sequence reads were filtered by removing the adaptors and low-quality nucleotides. PacBio reads were filtered to remove the short reads of less than 500 bp or a quality score lower than 0.8, then error correction for the long reads was completed by FALCON [67,97], following the general principles proposed by [98]. We tried to use several de novo assemblers to construct the assembly with both the Illumina and PacBio reads. Finally, we chose the SMARTdenovo software [99]. Corrected PacBio reads were assembled with the SMARTdenovo software (https://github.com/ruanjue/smartdenovo accessed on 1 June 2018) to construct the contigs. For the PacBio data, constructed contigs were subsequently polished by standalone consensus modules [98] and Pilon software [100] for the Illumina PE reads. Polished contigs were used as input for the scaffolding. The scaffolds were constructed by SOAP scaffolding, and the SSPACE tool [101] with Illumina mate-pair reads (2-40 k) with default parameters to extend the length of the scaffolds for the raw assembly. The gaps within the scaffolds, and consensus sequences generated from the PacBio sub-reads were filled using PBJelly2 [102]. Finally, the scaffolds were corrected by Pilon [100] with Illumina PE reads to correct the assembly errors and to obtain the final genome assembly.
The completeness of the genome assembly was tested by searching for 1440 core eukaryotic genes using benchmarking universal single-copy orthologs (BUSCO) (v2.0) [61]. The quality of the genome assembly was assessed by mapping of the Illumina paired-end 250 bp read data to the contigs using BWA-MEM (version 0.7.15-r1142) [103].

Repeat Annotation
Tandem repeats were identified with the tandem repeat finder (TRF) [104] (version 4.0.4). The transposable elements (TE) were identified with integrated homology-based and de novo methods [53]. The homology-based prediction was completed at the DNA and protein levels by comparing the assembly to the RepBase v.20.04 [105]

Gene Annotation
The protein-coding genes prediction was completed by homology, de novo, and RNA-Seq-based approaches. For generation of the homology-based predictions, the gene sets from four species, i.e., M. acuminata (http://www.promusa.org/Musa+acuminata accessed on 28 May 2019), P. dactylifera, O. sativa (http://rice.plantbiology.msu.edu/ accessed on 28 May 2019) and A. thaliana (https://www.arabidopsis.org/ accessed May 2019) were downloaded. The nonredundant protein sequences for each gene set was searched by TBLASTN. For generation of the expression-based evidence, RNA-Seq short reads originated from the EVL, IVL, EC, DC, and WC tissues were mapped to the ginger genome with the Hisat2 v.2.0.4 [106] alignment program. For the de novo gene annotation, wellsupported transcripts identified both by the homology-based and the RNA-Seq based predictions were selected for ab initio prediction using AUGUSTUS v.3.2.3 program [107,108]. The exon-intron structure of the genes was predicted using Genscan [109] and SNAP [110]. The results from the three approaches were consolidated using MAKER v.2.31.9 [111] to generate a protein-coding gene set. For the functional information, in silico translated products of the coding genes were aligned to the seven known protein databases of NR [112], InterPro [113], GO [114], KEGG [115], Swissprot and TrEMBL [116], COG [117].

Non-Coding RNAs Annotation
Four types of ncRNA, including microRNA (miRNA), transfer RNA (tRNA), ribosomal RNA (rRNA), and small nuclear RNA (snRNA), were annotated in the assembled genome based on de novo/or homology-based search methods. The tRNAs genes were annotated using tRNAScan-SE v.1.3.1 [118] with default parameters and filtered to remove the pseudo annotated tRNA genes. To identify the rRNA genes, the B. rotunda genome assembly was searched against the rRNA template sequences (Rfam database, release 13.0) [119] of A. thaliana, ca and M. acuminata with BLASTN, with an identity cutoff of ≥90% and a coverage at 80% or more. Using Infernal v.1.1.2 [120], mapping of the B. rotunda genome sequences to the Rfam database was performed to identify the miRNA and snRNA genes [53,89].

Construction of Phylogenetic Trees
The conserved orthologs genes (COS) in the B. rotunda genome and 13 other species were identified using the Orthofinder program [62]. Using identified single-copy orthologous genes, a neighbour joining (NJ) tree was constructed using a MEGAX [121] and UpSet plot using UpSetR [122].

Gene Family Expansion and Contraction Analysis
The data generated from OrthoFinder were used as input for the computational Analysis of gene Family Evolution (CAFE) [123] to estimate the gene family expansion and contraction. The phylogenetic tree from OrthoFinder was converted to an ultrametric tree using the make_ultrametric.py in OrthoFinder. Gene families with a large variance (≥100 gene copies) were removed using the clade_and_size_filter.py in the CAFE package. Divergence times in the phylogenetic tree were estimated using PATHd8 [124] and were calibrated using the divergence time between Brachypodium and Oryza (40-45 million years ago) (The International Brachypodium Initiative 2010) [125] and Arabidopsis and Oryza (130-200 million years ago) [126]. CAFE version 5 [123] was used to determine the stochastic birth and death processes and for modelling of the gene family evolution. The parameters for CAFE5 were "cafe5 -i orthofinder_gene_families.txt -t orthofinder_ultrametric.tre -p -e".

DNA Methylation Analysis Using Bisulfite Sequencing (BS-Seq)
For the bisulfite sequencing analysis, low quality reads and adapters were trimmed by the Trim-Galore [127] tool specific for bisulfite sequencing. The trimmed reads were mapped to draft the ginger genome with the Bismark [128] tool by choosing the Bowtie aligner with options set to the best, minimum map length of 50 bp and insert size of 500 bp. Mapping duplicates were removed by the Methpipe [129] tool. The Methcounts program from the Methpipe software package was used for the mapping of methylated and unmethylated cytosines where the methylation level at a single base resolution was calculated based on the number of 5-methylated cytosines (5 mC) in the reads, divided by the sum of the C and thymines (T) in the CG, CHG and CHH sequence contexts within the coding sequences of all the selected genes from B. rotunda [84].

De Novo Transcriptome Assembly of B. rotunda and Functional Annotation
To gather information related to the secondary metabolites, expression of the genes involved in the flavonoid and phenylpropanoid pathways of B. rotunda was based on the deep transcriptome sequencing of three cell culture types, in vitro and ex vitro leaves of B. rotunda. Based on our previous studies of embryogenesis related genes [84,85] and on the levels of metabolites in the cell cultures [43], we generated deep transcriptome data from five tissue types (each with three replicates) to investigate the gene regulation patterns in the phenylpropanoid and flavonoid pathways, to identify the metabolite producing cells in the B. rotunda in vitro cultured cells. The quality of the RNA-Seq raw reads were checked by the FastQC software (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/ accessed on 28 May 2018) to remove the low-quality reads and adapter sequences. Four different short-read assemblers: Oases [130], TransABySS [131], SOAPdenovo-Trans [132] and Trinity [133], were used to assemble the high-quality reads to the contigs. Using a best k-mer strategy, the high-quality reads were assembled at different k-mer lengths 21-51 using Oases, TransABySS and SOAPdenovo-Trans, whereas the assembly by Trinity used default parameters (K-mer 25). The contigs generated from all the k-mers by each respective assembler were merged and the redundancy was removed using CD-HIT [134]. Then, the TGICL clustering tool [135] was used to assemble the non-redundant contigs with a maximum identity of 90 and a minimum overlap length of 40. The completeness of the transcriptome assemblies was measured using the BUSCO [61] software.
High-throughput functional annotation was performed with the Blast2GO Command Line [136]. The BLASTX algorithm was performed to obtain a list of the potential homologous for each input sequence. To evaluate the functional annotation for the query sequences [137], gene ontology (GO) terms associated with the obtained BLAST hits were mapped by Blast2GO. GO mapping and Enzyme Commission (EC) classification were performed based on the annotation parameters of a Cut-off 55, E-Value 1 × 10 e−6 , GO Weight of 5, and HSP-Hit Coverage Cut-off 0. In order to identify the functional enrichment categories among the differentially expressed genes (DEGs), a Fisher exact test with a false discovery rate (FDR) cutoff of 0.05 was used. Classification of the B. rotunda transcripts into functional categories was performed using the Eukaryotic Orthologous Groups (KOG) [117] protein database. Biological pathways mapping of the B. rotunda transcripts was completed using the KEGG database [115]. Unigenes potentially related to the Panduratin A and other secondary metabolites biosynthesis were identified as those with a unigene annotated function matching to the enzymes assigned to the flavonoid and phenylpropanoid biosynthesis pathways in the KEGG pathway database.

Estimation of Transcript Abundance and Differential Expression
The RSEM software package [138] was used for the estimation of the gene expression level with a mean fragment length of 200 bp and fragment length standard deviation of 80 bp. Then, the FPKMs (fragments per feature kilobase per million reads mapped) were used to normalize the expression level for each gene and comparison between the samples. The Bioconductor tool (EdgeR) [139] was used for the differential expression analysis with a p-value threshold of ≤0.05 and |log2 (Fold Change)| ≥1 used to identify the significant differential expression of the transcripts.

Mining of Simple Sequence Repeats (SSRs) from B. rotunda Transcriptome and Genome Assembly
The whole genome assembly and assembled transcriptome sequences were searched for SSRs using a modified Liliaceae simple sequence analysis tool (LSAT) pipeline [140]. The searches were standardized for mining SSRs from mono to 20 bp with a minimum repeat loci of 12 nucleotides. The SSRs were classified based on the SSR locus length (class I > 20 nt and class II 12-20 nt) and the nucleotide base composition of the SSR loci (AT-rich, GC-rich and AT-GC balance). Primer pair sequences were developed for each identified SSR loci using the default parameters of the primer 3 (http://bioinfo.ut.ee/primer3 accessed on 28 May 2020) software [141]. Redundant primer pairs were eliminated using the perl script developed by Biswas et al. [87]. An electronic polymerase chain reaction (ePCR) [142] strategy was applied for mapping and estimating the transferability of the designed primers. The primers were mapped on four genomes viz. M. acuminata, M. balbisiana, M. itinerans and E. ventricosum, as those are the most related plant species of B. rotunda. A maximum 2 nt mismatch with two gaps was set as a cut off value for the ePCR result filter.
Wet lab validation of the transcriptome SSR (EST-SSR) and genomic SSR (G-SSR) markers A total of 14 (8 EST-SSR and 6 G-SSR) primer pairs were selected based on their in silico transferability result to assess their marker potentiality. Three B. rotunda, two Ensete and three Musa species were used to validate the selected primer sets. Fresh leaf samples were harvested from the greenhouse grown plants and the total genomic DNA was extracted following the CTAB methods. PCR amplifications were carried out for the SSR primer validation under the following conditions: 95 • C for 2 min, 35 cycles at 95 • C for 1 min, 60 • C for 1 min, and 72 • C for 1 min, followed by a final elongation at 72 • C for 10 min. The amplified DNA fragments were run on 2% agarose gels in a 1 × Tris-Borate-EDTA (TBE) buffer with 80v for 90 min. A 100 bp molecular ladder was used to estimate the amplicon size.