Transcriptome Analysis of Bael ( Aegle marmelos ( L . ) Corr . ) a Member of Family Rutaceae

Aegle marmelos (L.) Corr. is a medicinally and horticulturally important tree member of the family Rutaceae. It is native to India, where it is also known as Bael. Despite its importance, the genomic resources of this plant are scarce. This study presented the first-ever report of expressed transcripts in the leaves of Aegle marmelos. A total of 133,616 contigs were assembled to 46,335 unigenes with minimum and maximum lengths of 201 bp and 14,853 bp, respectively. There were 7002 transcription factors and 94,479 simple sequence repeat (SSR) markers. The A. marmelos transcripts were also annotated based on information from other members of Rutaceae; namely Citrus clementina and Citrus sinensis. A total of 482 transcripts were annotated as cytochrome p450s (CYPs), and 314 transcripts were annotated as glucosyltransferases (GTs). In the A. marmelos leaves, the monoterpenoid biosynthesis pathway was predominant. This study provides an important genomic resource along with useful information about A. marmelos.


Introduction
Aegle marmelos (L.) Corr.(2 n = 18) or Bael is an underexploited member of family Rutaceae.Believed to be native to the Indian subcontinent, it is well distributed throughout the tropical and subtropical belts of southeast Asia [1,2].Botanically, A. marmelos is a deciduous tree stretching up to 10 m in height that flowers during the months of May-June [3,4].It is also commonly grown as a horticultural plant in India, and its fruits are processed as juice or candies, as well as eaten fresh.During the past few decades, a spike in its cultivation as a horticulture plant has been attributed to its medicinal properties, along with a hardy nature that allows it to be cultivated on marginal lands with acidic or alkaline soils [5,6].
The traditional medicine system of Ayurveda in India routinely uses every part of A. marmelos as a therapy for medical conditions [7,8].The leaves are most easily accessible, and are therefore more regularly used for the treatments than any other plant part. A. maremelos leaves are used to treat jaundice and help in wound-healing when applied as a paste on a wound surface [9].Moreover, A. marmelos leaf extracts have been proved to be a better cure for gastrointestinal and hematopoietic damage than its fruits [10].The leaf extract of A. marmelos is used as a medication against a number of chronic diseases such as diabetes, pancreatic cancer, and arthritis [11][12][13][14].All of these medicinal properties of A. marmelos leaves are attributed to various phytochemicals present in the leaves such as aeglin, rutin, γ-sitosterole, β-sitosterol, eugenol, marmesinin, glycoside, skimmianine, etc. Broadly, these phytochemicals can be divided into three main classes: alkaloids, phenylpropanoids, and terpenoids [15,16].However, there is no genomic data-based information about the pathways of these important metabolic compounds that are present in the A. marmelos leaf.The information regarding the biosynthetic pathways and the encoding enzymes present in the A. marmelos leaf will be highly useful for the functional genomics in A. marmelos via transgenics and metabolic engineering approaches.Furthermore, A. marmelos leaf extract is used for the green synthesis of gold and silver nanoparticles [17,18].
The sum total of all of the transcripts captured in the cell of an individual organism is called its transcriptome [19].There are two ways to capture the expressed transcripts: either by microarray, which is limited to predefined sequences, or by performing RNA-Seq using second-generation sequencing technologies [20].This kind of sequencing has revolutionized the understanding of non-model organisms, and has evolved as one of the first choices of methods to apply to gene discovery and the expression profiling of non-model organisms [21,22].The availability of well-defined computational tools, along with a well-applied methodology, has further demonstrated the effectiveness of de novo transcriptome assemblies in organisms even without a reference genome [23,24].
Genomic resources in A. marmelos are scarce compared with other members of Rutaceae, such as Citrus sinensis (Sweet Orange) and Citrus clementina (Clementine), both having well-annotated genomes [25,26].Moreover, the unavailability of molecular markers based on the genomic information has further decelerated the molecular breeding efforts in A. marmelos.Earlier, a diversity study was carried out using only 12 random amplification of polymorphic DNA (RAPDs) [1].This limitation can be overcome by developing an appropriate resource of genomic information-based molecular markers using a next-generation sequencing (NGS)-based approach such as transcriptomics [20,22].To the best of our knowledge, this is the first detailed report on the transcriptome of this medically important plant.Moreover, only six expressed sequence tag (ESTs) are available in the National Center for Biotechnology Information (NCBI) database (accessed on 25 May 2018) [27].An investigation into the leaf transcriptome of A. marmelos can help answer key questions regarding various aspects related to genes and their gene function, via the pathways involved in the metabolic compound formation.Therefore, we used RNA sequencing followed by the de novo transcriptome assembly of A. marmelos leaves to identify the transcription factors, simple sequence repeats (SSRs), and transcripts related to important metabolic pathways in the leaves of A. marmelos.Also, the information regarding cytochrome P450s (CYPs) and glucosyltransferases (GTs) extant in the leaf of A. marmelos was also accomplished.

RNA Isolation and Sequencing
Young and tender leaves from three mature and healthy plants of A. marmelos variety "Kaghzi" (~five years old) were collected from the Government Garden Nursery (coordinates at 29 • 58 06.9" N 76 • 52 50.8"E) in Haryana, India.The sampled leaf tissues were stored in RNAlater (Life Technologies, Carlsbad, CA, USA) till further use.RNA was extracted with a TRIZOL reagent (Life Technologies Corporation, Carlsbad, CA, USA) based RNA extraction protocol for plant leaves [28,29].The quality of the extracted RNA was checked on a 1% formaldehyde denaturing agarose gel, and further quantified using a Nanodrop ND-1000 spectrophotometer (Nanodrop Technologies, Montchanin, DE, USA).A pooled sample of RNA from three selected plants was used for a single cDNA library preparation.The library was prepared with a TruSeq RNA Library Prep Kit v2 from Illumina ® (Illumina Inc., San Diego, CA, USA), and the library quantification was done using a Qubit Fluorometer (Qubit™ dsDNA HS Assay Kit, Life Technologies Corporation, Carlsbad, California, USA) and Agilent D1000 ScreenTape system (Agilent Technologies, Santa Clara, CA, USA).The library was further sequenced on the Illumina HiSeq 2500 (2 × 150 bp) platform (Illumina, Dedham, MA, USA).

De Novo Assembly and Identification Coding Sequences
The cleaned reads were assembled using Trinity software (version 2.4.0) and TransDecoder v. 3.0.1 (http://transdecoder.sourceforge.net/)[30] was used to identify candidate coding regions within the generated transcripts and look for the open reading frames (ORF) that were at least 100 amino acids long in order to decrease the chances of false positives.
The pooled RNA sample of A. marmelos leaves with RIN values around 8.0 generated a total of 115.92 million paired reads of high quality (Phred score > 30).Trinity assembler was used for the assembly, and after trimming of adapters, there was a total of 133,616 contigs (only from reads of 200 bp and above in length) clustered into 46,345 unigenes (Table 1).The raw data that was obtained as a result of sequencing was submitted to NCBI BioProject (PRJNA433585).The assembly completeness report from gVolante estimated that the transcriptome assembly was 90.15% complete (Figure S1).We scrutinized for an open reading frame that was at least 100 amino acids long in order to decrease the chances of false positives during open reading frames (ORF) predictions.The annotated transcripts with ORFs are listed in Table 2.A total of 90,525 transcripts were annotated to GO terms (Table S1).The transcripts related to plant species were extracted and used for gene ontology (Table 2).

GO Annotation
In total, 600,642 Gene Ontology (GO) terms were mapped to the A. marmelos leaf contigs belonging to all the three possible classes, i.e., biological process (227,921 transcripts), cellular component (188,465 transcripts), and molecular function (184,25 transcripts) in the GO database (Figure 1).The breakdown of the proteins associated with the various biological process, cellular components, and molecular functions is illustrated in Figure 2. The "integral component of membrane" (GO: 0016021) associated with various cellular components, "transcription DNA-templated" (GO: 00006351) associated with biological processes and "ATP binding" (GO: 00005524) associated with molecular function were the most mapped terms in their respective categories (Figure 2.).
The GO terms primarily define three categories of functions: namely, the biological, cellular, and molecular functions for a gene product.This is achieved by associating a gene with their ontologies [45,46].Earlier studies have pointed out a higher metabolic activity in the leaves of A. marmelos, which is because of the presence of phytochemicals such as alkaloids, flavonoids, and phenols [47,48].We have identified a number of GO terms in the leaves of A. marmelos; this information could lead to the identification of important pathways of metabolic compounds in A. marmelos [49].

Citrus Database Annotation
The A. marmelos transcripts were also annotated via Phytozome (https://phytozome.jgi.doe.gov/) with reference to the Citrus clementina and Citrus sinensis genomes.This resulted in the mapping of 78.44% of the transcripts to the Citrus clementine, and 79.85% to the Citrus sinensis genome (Table 3).An almost similar number of transcripts were annotated with GO terms and KEGG annotation, respectively (Tables S2 and S3).However, recently, an extensive amount of relatedness was observed within the members of genus Citrus of family Rutaceae, which was based on the study performed by

Citrus Database Annotation
The A. marmelos transcripts were also annotated via Phytozome (https://phytozome.jgi.doe.gov/) with reference to the Citrus clementina and Citrus sinensis genomes.This resulted in the mapping of 78.44% of the transcripts to the Citrus clementine, and 79.85% to the Citrus sinensis genome (Table 3).An almost similar number of transcripts were annotated with GO terms and KEGG annotation, respectively (Tables S2 and S3).However, recently, an extensive amount of relatedness was observed within the members of genus Citrus of family Rutaceae, which was based on the study performed by

Citrus Database Annotation
The A. marmelos transcripts were also annotated via Phytozome (https://phytozome.jgi.doe.gov/) with reference to the Citrus clementina and Citrus sinensis genomes.This resulted in the mapping of 78.44% of the transcripts to the Citrus clementine, and 79.85% to the Citrus sinensis genome (Table 3).An almost similar number of transcripts were annotated with GO terms and KEGG annotation, respectively (Tables S2 and S3).However, recently, an extensive amount of relatedness was observed within the members of genus Citrus of family Rutaceae, which was based on the study performed by using the whole genome sequences of 60 members in the Citrus genus; the authors even pointed out the need for reformulation of the genus [50].Simple sequence repeats (SSRs), or short tandem repeats or microsatellites, are short repeat motifs that show length polymorphism due to the insertion or deletion mutations of one or more repeat types [51].We analyzed for the abundance of SSRs of annotated plant transcripts for A. marmelos leaf transcripts using the MISA tool, and the predicted SSRs statistics are shown in Figure 3.There were 58,354 transcripts that contained SSRs, and among these, 23,034 had more than one SSRs (Table S4).In total, 94,479 SSRs were identified, of which 65.27% were monorepeats, 19.78% were direpeats, and 13.40% were trirepeats (Figure 3).Tetra, penta and hexarepeats made up 1.01%, 0.26% and 0.24% of the total, respectively (Figure 3).However, out of a total of 94,479 identified SSRs, 11,400 (12.06%) were related to the compound formation.

Simple Sequence Repeats (SSRs) Prediction
Simple sequence repeats (SSRs), or short tandem repeats or microsatellites, are short repeat motifs that show length polymorphism due to the insertion or deletion mutations of one or more repeat types [51].We analyzed for the abundance of SSRs of annotated plant transcripts for A. marmelos leaf transcripts using the MISA tool, and the predicted SSRs statistics are shown in Figure 3.There were 58,354 transcripts that contained SSRs, and among these, 23,034 had more than one SSRs (Table S4).In total, 94,479 SSRs were identified, of which 65.27% were monorepeats, 19.78% were direpeats, and 13.40% were trirepeats (Figure 3).Tetra, penta and hexarepeats made up 1.01%, 0.26% and 0.24% of the total, respectively (Figure 3).However, out of a total of 94,479 identified SSRs, 11,400 (12.06%) were related to the compound formation.
SSRs are codominant markers that are well dispersed throughout plant genomes.SSRs are popularly used for marker-assisted selection, fingerprinting, diversity assessment, and quantitative trait loci (QTLs) identification [52].Routinely, SSRs are identified in the medicinal plants via transcriptome assemblies, because they are more robust and can also be transferred among different species within the same genus.These identified SSRs can also be used for the marker-assisted breeding in A. marmelos i.e., to breed this tree for a particular environment or condition.Otherwise, until recently, only diversity-related studies were conducted in A. marmelos using universal primers, and researchers were even limited to only 12 RAPDs and 16 universal ISSRs to access diversity among their A. marmelos genotypes collection [1,53].Furthermore, these genomic information-based SSRs can help to identify and differentiate between homozygous and heterozygous individuals.SSRs are also commonly used for the map-based cloning of genes; a close association between genes and their SSRs is crucial in the context of genotyping and haplotyping [51,52].SSRs are codominant markers that are well dispersed throughout plant genomes.SSRs are popularly used for marker-assisted selection, fingerprinting, diversity assessment, and quantitative trait loci (QTLs) identification [52].Routinely, SSRs are identified in the medicinal plants via transcriptome assemblies, because they are more robust and can also be transferred among different species within the same genus.These identified SSRs can also be used for the marker-assisted breeding in A. marmelos i.e., to breed this tree for a particular environment or condition.Otherwise, until recently, only diversity-related studies were conducted in A. marmelos using universal primers, and researchers were even limited to only 12 RAPDs and 16 universal ISSRs to access diversity among their A. marmelos genotypes collection [1,53].Furthermore, these genomic information-based SSRs can help to identify and differentiate between homozygous and heterozygous individuals.SSRs are also commonly used for the map-based cloning of genes; a close association between genes and their SSRs is crucial in the context of genotyping and haplotyping [51,52].

Transcriptional Factors Identification
Gene expression patterns are regulated by transcription factors that in turn determine the different biological process [54].A total of 7002 transcription factors were retrieved from the PlnTFDB.The 52 transcription factors were unique to the A. marmelos leaves; although these were out of a total 6122 that were extant above 100 in the unigenes (Table S5).The most abundant were Auxin response factors (ARFs) (717), myeloblastosis (MYB-related) (562), a basic domain/leucine zipper (bZIP) (437), and basic helix-loop-helix (bHLH) (417), whereas HB-Other (132) and CAMATA (109) were the least abundant (Figure 4).
Auxin is the plant hormone that regulates the different plant processes from growth to senescence.Auxin response factors are necessary for the plant to response to auxin stimuli; they channelize the response via auxin response DNA elements that are present in the primary auxin response genes.ARFs switch the auxin response gene on and off via their transcriptional activation domain or transcriptional repression domain [55,56].MYB-related transcription factors play many roles like protection against biotic and abiotic stresses.MYB transcription factors also regulate the metabolism of the phenylpropanoid pathway, and are well studied with respect to the regulation of primary and secondary metabolism in the plant [57,58].Likewise, bZIP and bHLH transcription factors are also involved in the metabolic biosynthesis in plants, especially by activation of phenylpropanoid genes [59,60].

Transcripts Encoding Cytochrome p450s (CYPs) and Glucosyltransferase (GTs)
CYPs help in the primary and secondary metabolism of plants by catalyzing monooxygenation reactions.These cytochromes assist in the diversification of metabolic pathways in plants.Currently, these are potential targets for metabolic engineering for the overproduction of metabolites of interest [61].There were 477 transcripts in total that were annotated cytochrome p450s (Table S6).Considering their vital role inmetabolic pathways, we further analyzed the abundance of SSRs annotated within these cytochrome p450 transcripts (Table 4).Among the 128 identified SSRs, 85 were with monorepeats, seven were with direpeats, and 36 were with trirepeats (Table S7).
The last step in the production of plant secondary metabolites is glycosylation, which is carried out by glycotransferases (GTs) [62][63][64].A total of 314 transcripts were annotated as glucosyltransferase (Table S8).We analyzed the abundance of SSRs that were present in these transcripts (Table S9), and among the 247 identified, 109 were monorepeats, 79 were direpeats, 58 were trirepeats, and only one was identified as a tetrarepeat (Table 4).The SSRs that were identified  Auxin is the plant hormone that regulates the different plant processes from growth to senescence.Auxin response factors are necessary for the plant to response to auxin stimuli; they channelize the response via auxin response DNA elements that are present in the primary auxin response genes.ARFs switch the auxin response gene on and off via their transcriptional activation domain or transcriptional repression domain [55,56].MYB-related transcription factors play many roles like protection against biotic and abiotic stresses.MYB transcription factors also regulate the metabolism of the phenylpropanoid pathway, and are well studied with respect to the regulation of primary and secondary metabolism in the plant [57,58].Likewise, bZIP and bHLH transcription factors are also involved in the metabolic biosynthesis in plants, especially by activation of phenylpropanoid genes [59,60].

Transcripts Encoding Cytochrome p450s (CYPs) and Glucosyltransferase (GTs)
CYPs help in the primary and secondary metabolism of plants by catalyzing monooxygenation reactions.These cytochromes assist in the diversification of metabolic pathways in plants.Currently, these are potential targets for metabolic engineering for the overproduction of metabolites of interest [61].There were 477 transcripts in total that were annotated cytochrome p450s (Table S6).Considering their vital role inmetabolic pathways, we further analyzed the abundance of SSRs annotated within these cytochrome p450 transcripts (Table 4).Among the 128 identified SSRs, 85 were with monorepeats, seven were with direpeats, and 36 were with trirepeats (Table S7).The last step in the production of plant secondary metabolites is glycosylation, which is carried out by glycotransferases (GTs) [62][63][64].A total of 314 transcripts were annotated as glucosyltransferase (Table S8).We analyzed the abundance of SSRs that were present in these transcripts (Table S9), and among the 247 identified, 109 were monorepeats, 79 were direpeats, 58 were trirepeats, and only one was identified as a tetrarepeat (Table 4).The SSRs that were identified as using CYPs and GTs can be of immense potential for identifying genetic diversity among different A. marmelos accessions with divergent metabolic profiles.

Identification of Biosynthetic Pathways in A. Marmelos Leaf
A. marmelos leaves are used for the treatment of several medical conditions in Ayurveda and Yunani medicine systems [65].The transcripts with the highest fragments per kilobase per million mapped reads (FPKM) values were extracted from an annotation file along with the Kyoto Encyclopedia of Genes and Genomes (KEGG) ID and sorted from the RNA-Seq by Expectation Maximization (RSEM) file that was obtained from the assembly for transcript quantification.Using the KEGG ID, pathways were identified (Table S10).RSEM is commonly used to obtain information regarding transcript abundance from the RNA-Seq data of an organism, even without a reference genome [66].The pathway analysis identified that monoterpenoid biosynthesis and thiamine pathways were the two most expressed pathways present in the A. marmelos leaves (Figure 5).  A. marmelos leaves have been reported to contain monoterpenoids as the principal metabolites in the leaves in levels as high as 93.9% [67,68].Moreover, this monoterpenoid content is not affected by the geographic location of the plant, as it remains unaffected by changes in altitude, unlike many other metabolites [69].Thiamine is naturally produced in the plants as a sulfur-comprising and water-soluble compound.A. marmelos contains thiamine, although thiamine concentration is higher in the fruits than the leaves, and is among the fruits with the highest thiamine content [70,71].Detailed information regarding the routes, reactions, and encoded enzymes of the two most expressed pathways identified in the leaf transcriptome of A. marmelos is provided in Figures 6 and 7.This detailed information generated regarding the monoterpenoid and thiamine biosynthetic pathway will be useful for further genetic analysis of the production of these important metabolites and their pathway engineering.

Conclusions
In cases of underexploited plant species, there is often not enough genomic information available to proceed with their genetic improvement, and subsequently transfer important genes from them to cultivated crops.Transcriptome assembly is a cost-effective alternative to genome sequencing for obtaining the information of expressed genes and assisting in the more effective development of underexploited crops and medicinal plants.RNA-Seq shines a light on genes and their functions, as

ForestsForests 15 Figure 1 .
Figure 1.Genes associated with the biological process, cellular components, and molecular functions in the A. marmelos leaf transcriptome assembly.

Figure 2 .
Figure 2. Gene Ontology (GO) classification of A. marmelos transcripts.GO term are divided in three main categories: biological process (a), cellular component (b), and molecular function (c).

Figure 1 . 15 Figure 1 .
Figure 1.Genes associated with the biological process, cellular components, and molecular functions in the A. marmelos leaf transcriptome assembly.

Figure 2 .
Figure 2. Gene Ontology (GO) classification of A. marmelos transcripts.GO term are divided in three main categories: biological process (a), cellular component (b), and molecular function (c).

Figure 2 .
Figure 2. Gene Ontology (GO) classification of A. marmelos transcripts.GO term are divided in three main categories: biological process (a), cellular component (b), and molecular function (c).

Figure 3 .
Figure 3. Simple sequence repeats (SSRs) classes identified in the leaf transcripts of A. marmelos.

Figure 3 .
Figure 3. Simple sequence repeats (SSRs) classes identified in the leaf transcripts of A. marmelos.

Figure 4 .
Figure 4. Top 21 families of transcription factors identified in the A. marmelos leaf.

Figure 4 .
Figure 4. Top 21 families of transcription factors identified in the A. marmelos leaf.

Figure 5 .
Figure 5.The top 15 pathways in the A. marmelos leaf.

Figure 7 .
Figure 7. A. marmelos leaf transcriptome encoded enzymes (highlighted) involved in the thiamine biosynthetic pathway identified in the leaf.

Table 1 .
Assembly statistics of the leaf transcriptome.

Table 2 .
Annotation summary of A. marmelos leaf transcripts.COG: cluster of orthologous groups, GO: gene ontology, KEGG: Kyoto Encyclopedia of Genes and Genomes, ORFs: open reading frames, Pfam: protein family.

Table 3 .
Annotation summary of A. marmelos leaf transcripts with Citrus sinensis and Citrus clementine genome.