Chloroplast Genome Sequence of Clusterbean (Cyamopsis tetragonoloba L.): Genome Structure and Comparative Analysis

Clusterbean (Cyamopsis tetragonoloba L.), also known as guar, belongs to the family Leguminosae, and is an annual herbaceous legume. Guar is the main source of galactomannan for gas mining industries. In the present study, the draft chloroplast genome of clusterbean was generated and compared to some of the previously reported legume chloroplast genomes. The chloroplast genome of clusterbean is 152,530 bp in length, with a quadripartite structure consisting of large single copy (LSC) and small single copy (SSC) of 83,025 bp and 17,879 bp in size, respectively, and a pair of inverted repeats (IRs) of 25,790 bp in size. The chloroplast genome contains 114 unique genes, which includes 78 protein coding genes, 30 tRNAs, 4 rRNAs genes, and 2 pseudogenes. It also harbors a 50 kb inversion, typical of the Leguminosae family. The IR region of the clusterbean chloroplast genome has undergone an expansion, and hence, the whole rps19 gene is included in the IR, as compared to other legume plastid genomes. A total of 220 simple sequence repeats (SSRs) were detected in the clusterbean plastid genome. The analysis of the clusterbean plastid genome will provide useful insights for evolutionary, molecular and genetic engineering studies.


Introduction
Clusterbean (Cyamopsis tetragonoloba L.), also known as guar, is an annual herbaceous legume, tolerant to drought and salinity [1][2][3]. It belongs to the family Leguminosae and subfamily Papilionoideae. Due to its short growing season (90-120 days) [4], it is grown in rotation with other crops, like cotton, grain, sorghum, flax, etc. [5,6]. Guar also increases the nitrogen content and organic matter of the soil by the process of nitrogen fixation, and hence, leads to the increase in yield of other crops grown in rotation with it [6][7][8]. Guar is primarily cultivated in arid and semi-arid regions, like North-West India and South-East Pakistan. Guar pods are consumed as vegetables across the globe as they are a rich source of minerals, fibres, proteins and Vitamin C [9]. sequences [57], like grapevine Cp genome sequence, which was obtained during the sequencing of the whole genome [58]. Next generation techniques have an advantage over labour intensive and low throughput cloning techniques, as enriched or un-enriched DNA can be used directly for sequencing [30]. The first attempt to use Next Generation Sequencing technology (454 GS 20 system) for the sequencing of Cp genome, was made by Moore et al. (2006) [59]. With the advent of Next Generation Sequencing, a large number of Cp genomes have now been sequenced [38,[60][61][62]. Herbarium genomics has also been reported to be a promising field with respect to organelle genome sequencing [63]. Organelle genome sequencing is of great value to the branch of phylogenomics. Like genome skimming, which involves sequencing of chloroplast, mitochondrial, or rDNA, can be used to recover matrilineal genealogy [64]. Multiple platforms are available for sequencing of Cp genome, but Illumina is the most used platform for sequencing of chloroplast genomes [55,[65][66][67]. In this study, we used purified chloroplast DNA as a template for sequencing by Illumina HiSeq 1000 platform (San Diego, CA, USA). This is the first report of the guar chloroplast genome, and hence, would help in phylogenetic analysis, DNA barcoding, and breeding in future.

Plant Material and Chloroplast DNA Isolation
Guar (variety RGC 936) was used in this study. Fresh leaves were harvested from the plant and kept in dark for 48 h prior to Cp DNA isolation. The Cp DNA isolation from the leaves was performed as per Kirti et al. (1993) [68].

Chloroplast Genome Sequencing, Assembly, and Annotation
The plastid libraries were prepared by Illumina Nextera DNA library preparation kit (San Diego, CA, USA). Initially, 50 ng of the plastid DNA was tagmented, cleaned, and amplified, and libraries were prepared as per manufacturer's protocol, with an average size of 500 bp. The quality check (QC) of the libraries were validated by Bioanalyzer, using DNA High sensitivity chips (Agilent Technologies, California, USA), and thereafter, the samples were run on Illumina Hiseq 1000 platform.
FastQC v0.11.5 was used to assess the per base quality of the raw reads (50,642,415). A Phred score of 30 was set as the threshold for filtering reads. Average length of the reads was 96 bp. All 50,642,415 paired-end raw reads passed the quality filter threshold of 30 Phred score. With CLC genomics (workbench 9.5.1), (CLC Bio, Arhus, Denmark) 5,882,271 (11.62%) of these reads were mapped using the chloroplast reference genome of Glycine max (G. max), and assembled at 23 k-mer (auto). The thus obtained large contigs were reassembled again by guidance based de novo assembly with G. max Cp genome to obtain an assembly containing the largest contig, >150 kb size and N50 of 90,670 bp. A BLASTN search-based approach was used to order the contigs against the G. max Cp genome, with >80% matches and gaps filled by filtered reads at 90% similarity over 50% length.
The annotation of the chloroplast genome was performed by Dual Organellar Genome Annotator (DOGMA) [69] and hence coding sequences (cds), rRNAs, and tRNAs were identified by using plastid genetic code and BLAST homology searches. The tRNAs were verified by online tRNAscan-SE 1.21 search serve [70]. The exact gene and exon boundaries were verified, and the start and stop codons were manually corrected.
The entire chloroplast genome sequence of Cyamopsis tetragonoloba, along with gene annotations was submitted to GenBank (accession number: MF352008).

Genome Features of Clusterbean Chloroplast Genome
The complete chloroplast genome of clusterbean is 152,530 bp in length. It has a typical quadripartite structure, with the Cp genome divided into LSC and SSC of 83,025 bp and 17,879 bp in size, respectively, and a pair of IRs of 25,790 bp in size ( Figure 1). The size of the Cp genome is similar to other reported legume genomes (Supplementary Table S1). The GC content for the whole genome is 35%, which is in accordance with other reported legume genomes, like Glycine max [75], Cicer arietinum [33], Vigna radiata [76], and Cajanus cajan [38]. Similarly, the GC content for LSC, SSC, and IRs, is 33%, 29%, and 42%, respectively ( Table 1). The high GC content for the IR regions can be attributed to the presence of four rRNAs genes (rrn4.5, rrn5, rrn16, rrn23), thus leading to sequence complexity and stabilisation of the whole genome.
The clusterbean chloroplast genome contains 114 unique genes when duplicated genes are counted only once, and includes 78 protein coding genes, 30 tRNAs, 4 rRNAs genes, and 2 pseudogenes. Individually, LSC contains 80 genes (57 protein coding genes, 22 tRNAs, and 1 pseudogene), SSC contains 13 genes (12 protein coding genes and 1 tRNA). The IR region consists of duplicated copies of 10 protein coding genes, 7 tRNAs, 4 rRNAs genes, and 1 pseudogene, therefore, in total, it consists of 22 genes ( Table 2). The tRNA genes are distributed throughout the genome, and are encoded by 61 possible codons (excluding the stop codon). Duplicated tRNAs, trnM-CAU, and trnT-GGU, are present in the LSC region. Such tRNA duplications have been observed in the past in black pine, Actinidia, and pigeonpea [38,77,78]. In total, 12 intron-containing genes are present in the clusterbean Cp genome, out of which two (ycf3 and clpP) contain two introns each (Supplementary Table S2). The trnK-UUU gene contains the largest intron (2573 bp), which also harbors the matK gene. Trans-splicing of rps12 gene is observed in the clusterbean Cp genome, as is the case with other Cp genomes, like Actinidia [78] and Pongamia pinnata [79]. As a result of trans-splicing, 5 exon is present in the LSC region, and the 3 exon is duplicated in the IR region.
In clusterbean Cp genome, the protein coding region accounts for 52.7%, while the tRNA and rRNA coding regions account for 2.07% and 5.94%, respectively. The remaining genome consists of the intergenic region, introns, and pseudogenes.
Codon usage was calculated for the protein coding genes present in the clusterbean Cp genome. A total of 78 protein coding genes, comprising a length of 80,166 nucleotides, are represented by 26,722 codons ( Table 3). As reported earlier also, leucine (2831 codons, 10.5% of the total) and cysteine (318 codons, 1.19% of the total) represent the most and least abundant amino acids, respectively [80][81][82]. Salim and Cavalcanti (2008) [83] suggested that there exists a relationship between codon usage bias and translational efficiency. They further explain that codon usage is biased towards either abundant tRNAs, or those codons which binds their cognate tRNAs more strongly than others. Also, the codon usage pattern in the clusterbean Cp genome is observed to be biased towards a high presentation of A or T at third codon position (Table 1), as supported by relative synonymous codon usage (RSCU) values. The value for codons ending with A and T is 40.42%, while codons ending with C and G is 12.61%. This bias towards the high presentation of A or T is also observed in other Cp genomes [82,84]. It has also been reported in the past that organellar proteins are encoded mainly by codons ending with A or U [85]. The innermost darker gray corresponds to GC content, whereas the lighter gray corresponds to AT content. Different genes are colour coded. IR: inverted repeat; LSC: large single copy region; SSC: small single copy region.  During the course of evolution, there have been many cases of intron and full gene losses among the angiosperms. Homologous recombination between intron-less cDNA and the original intron containing copy of DNA has been proposed as one of the mechanisms for the loss of introns. Loss of intron of atpF gene in Malpighiales was explained by the above mechanism [86,87]. Likewise, it has been reported that a clade known as IR-lacking clade (IRLC), which includes Cicer arietinum, Medicago truncatula, Trifolium subterraneum, Pisum sativum, and Lathyrus sativus, has lost clpP introns. Loss of intron from rpl2 gene was also reported from various lineages of flowering plants [24]. Nevertheless, introns play an important role in gene expression, as the presence of an intron enhances the gene's transcription. Also, introns present within the gene can be used as flanking sequences for the purpose of genetic engineering, thus providing efficient processing of foreign transcripts [53].    As is the case with intron loss, various gene losses have also been reported in angiosperms. Likewise, rpl22 and infA genes are observed to be missing from clusterbean chloroplast genome. The independent transfer of rpl22 gene to the nucleus has been reported in Fabaceae [40] and Fagaceae [88]. Similarly, transfer of infA gene to the nucleus has been documented in rosids, and was supported by the finding of expressed copies of the gene with stretches of chloroplast transit peptide in the nucleus [89]. The transfer of rpl32 gene in Salicaceae [90,91] and accD gene in Trifolium [41] have also been well documented. The accD gene has been reported to be lost at least seven times in the course of evolution of angiosperms. In some plastid genomes, like Medicago and Populus, the phenomenon of nuclear substitution has been reported as a cause for loss of rps16 gene from the plastome, as the nuclear encoded, mitochondrial copy of the gene is targeted to the plastid also [92]. But in the plastid genome of clusterbean, rps16 gene has been found to be present as a pseudogene. Similarly, it is present as a pseudogene in Cp genome of pigeonpea [38], while its non-functional copy is present in V.radiata [76]. Loss of splicing activity might be the reason for it to be present as a pseudogene [93]. Also, ycf15 is observed to be present as a pseudogene in clusterbean Cp genome. It has also been reported, in the past, to be present as a pseudogene in the Cp genome of pigeonpea [38], Phaseolus vulgaris, and Vigna radiata [33,76], as it contains premature stop codons within the coding sequence. It can be concluded from the above reports that increased rate of hypermutation has made the legumes more prone to rearrangements, and thus, more number of genes are lost or relocated to the nucleus in legumes [41,94].

Gene Order
Each of the sequenced legume Cp genomes possesses a unique structure. To deduce the structural homology, we compared the Cp genome of clusterbean with the sequenced legume genomes using MAUVE (Figure 2), taking Arabidopsis Cp genome as a reference. On comparison with Arabidopsis, it was found that the clusterbean and all the legume Cp genomes possesses a 50 kb inversion in LSC, spanning the region between the rbcl and rps16 genes of the chloroplast. the former. This loss of IR region has been reported earlier too [96], and such legume tribes, lacking one IR region, form a new clade known as Inverted Repeat-lacking clade (IRLC) [97,98]. An inversion unique to subtribe phaseolinae, occurring as a result of expansion and subsequent contraction of IRs [99], is present in the Cp genome of V. radiata and P. vulgaris, but absent from other plastid genomes. This suggests that legume Cp genomes have undergone considerable rearrangements and diversification, and thus, provide a valuable resource for phylogenetic analysis.

Plastid Genome Sequence Comparison
The availability of various legume plastid genomes provides the opportunity for comparison of Cp genomes. Hence, the sequence identity of the legume Cp genomes was plotted with the help of mVISTA (Figure 3), using annotations of clusterbean as reference. On alignment, it was found that the overall chloroplast genomes were conservative with some divergent regions. Similar to other plant species, coding regions were found to be more conservative than the non-coding regions. The most conserved region was the IR region, probably due to the presence of conserved rRNA genes and the phenomenon of copy correction [100]. The coding regions like clpP, accD, petA, petD, and cemA show high a degree of divergence, while the intergenic region between the genes rpoB-psbD, ndhC-atpB, psbE-psbB, petD-rps3, trnK-UUU-rbcl, and ndhJ-ycf3, also show a high degree of divergence. The clusterbean Cp genome also possesses an additional inversion within the LSC and the IR region, occurring as a result of flip flop intramolecular recombination [95]. This inversion is also observed for G. max and pigeonpea Cp genome. The Cp genomes of Cicer arietinum and Medicago truncatula generally share the same gene order with clusterbean, except for the loss of IRb region in the former. This loss of IR region has been reported earlier too [96], and such legume tribes, lacking one IR region, form a new clade known as Inverted Repeat-lacking clade (IRLC) [97,98].
An inversion unique to subtribe phaseolinae, occurring as a result of expansion and subsequent contraction of IRs [99], is present in the Cp genome of V. radiata and P. vulgaris, but absent from other plastid genomes. This suggests that legume Cp genomes have undergone considerable rearrangements and diversification, and thus, provide a valuable resource for phylogenetic analysis.

Plastid Genome Sequence Comparison
The availability of various legume plastid genomes provides the opportunity for comparison of Cp genomes. Hence, the sequence identity of the legume Cp genomes was plotted with the help of mVISTA (Figure 3), using annotations of clusterbean as reference. On alignment, it was found that the overall chloroplast genomes were conservative with some divergent regions. Similar to other plant species, coding regions were found to be more conservative than the non-coding regions. The most conserved region was the IR region, probably due to the presence of conserved rRNA genes and the phenomenon of copy correction [100]. The coding regions like clpP, accD, petA, petD, and cemA show high a degree of divergence, while the intergenic region between the genes rpoB-psbD, ndhC-atpB, psbE-psbB, petD-rps3, trnK-UUU-rbcl, and ndhJ-ycf3, also show a high degree of divergence. Figure 3. Sequence alignment of legume plastid genomes, with C. tetragonoloba Cp genome set as a reference using mVISTA. Position and transcriptional direction of each gene is indicated by gray arrows. Intergenic and genic regions are indicated by red and blue areas, respectively. Sequence identity between the Cp genomes is shown on y-axis as a percentage between 50% to 100%.

Comparison of Inverted Repeat Boundaries of Clusterbean with Other Closely Related Plastid Genomes
The IR regions are known to promote the stability of the rest of the genome by intramolecular recombination between the two copies of inverted repeats, and thus, limiting the recombination between the two single copy regions [97,101]. The contraction and expansion of IRs leads to the size variation of the plastid genomes among the angiosperms. The comparison of the boundaries of clusterbean Cp genome with other plastid genomes is presented in Figure 4. The IR region of clusterbean contains 22 completely duplicated genes. At the IR/LSC junction, rps19 gene is included in the IR region, and hence, is completely duplicated. On the other hand, at the IR/SSC junction, 485 bp of ycf1 gene is included in the IR. As a result, a partial ycf1 gene is included at the IRa/SSC junction, while the complete ycf1 gene is included in the IR at the SSC/IRb junction. In comparison to other genomes, these boundaries fluctuate, like in G.max Cp genome, where 68 bp of rps19 gene is included in the IR. Meanwhile, complete duplication of the rps19 gene is seen in Vigna radiata and Phaseolus vulgaris, in contrast with the absence of rps19 in the IR regions of pigeonpea. On the other hand, ycf1 gene is included in the IRs of all the compared legumes, but the size varies among them. On comparison with other closely related legumes, the IR region of clusterbean (25,790 bp) was found to be smaller than that of Vigna radiata (26,474 bp), but larger than the IR region of Cajanus cajan (25,398 bp).

Simple Sequence Repeats Analysis
A group of tandem repeat sequences consisting of 1-6 nucleotide repeat units are known as simple sequence repeats (SSRs), or microsatellites [102]. CpSSRs are known to be relatively abundant, and demonstrate high reproducibility and polymorphism. Thus, these are frequently used in species identification and genetic analysis. Chloroplast SSRs were extracted using MISA perl script, and a total of 220 SSRs were detected in the clusterbean Cp genome (Supplementary Table S3). The numbers of SSR loci found are similar to that reported in Vigna radiata, but less than that reported in pigeonpea [88,103]. Among 220 SSRs reported, 67.27% (148 SSRs) are present in LSC region, 11.81% (26 SSRs) are present in IR region, and 20.9% (46 SSRs) are present in the SSC region ( Figure 5). The findings were similar to that reported in artichoke [104] and Datura stramonium Cp genome [80]. Further, the SSRs were distributed among the coding, non-coding, and intergenic regions. Additionally, it was found that 33% (72) of the SSRs were present in the coding region, 4% (10 SSRs) in the intronic region and 63% (138) were present in the intergenic region ( Figure 6). Also on comparison between coding and non-coding region, a higher number of SSRs were found to be present in the non-coding region than the coding region, making the results consistent with those observed for G. max [105] and Datura stramonium [80].

Simple Sequence Repeats Analysis
A group of tandem repeat sequences consisting of 1-6 nucleotide repeat units are known as simple sequence repeats (SSRs), or microsatellites [102]. CpSSRs are known to be relatively abundant, and demonstrate high reproducibility and polymorphism. Thus, these are frequently used in species identification and genetic analysis. Chloroplast SSRs were extracted using MISA perl script, and a total of 220 SSRs were detected in the clusterbean Cp genome (Supplementary Table S3). The numbers of SSR loci found are similar to that reported in Vigna radiata, but less than that reported in pigeonpea [88,103]. Among 220 SSRs reported, 67.27% (148 SSRs) are present in LSC region, 11.81% (26 SSRs) are present in IR region, and 20.9% (46 SSRs) are present in the SSC region ( Figure 5). The findings were similar to that reported in artichoke [104] and Datura stramonium Cp genome [80]. Further, the SSRs were distributed among the coding, non-coding, and intergenic regions. Additionally, it was found that 33% (72) of the SSRs were present in the coding region, 4% (10 SSRs) in the intronic region and 63% (138) were present in the intergenic region ( Figure 6). Also on comparison between coding and non-coding region, a higher number of SSRs were found to be present in the non-coding region than the coding region, making the results consistent with those observed for G. max [105] and Datura stramonium [80].
On the basis of the arrangement of nucleotides in the repeat motif, 78% of the SSRs were found to be perfect repeats, while 15%, 6%, and 1% were found to be compound interrupted, imperfect, and compound repeats, respectively. The microsatellites were further analysed on the basis of repeat types. The most abundant repeat type was mononucleotide, and the least abundant was pentanucleotide, with no hexanucleotide motifs detected (Figure 7). These repeat types were distributed among the coding and non-coding region (Figure 8). On analysis of these repeat types, it was found that mono, di, and trinucleotide repeats were mainly composed of A or T nucleotides. Wheeler et al. (2014) [106] also reported that the majority of mononucleotides are A/T rich. This bias in base composition is also consistent with overall AT richness in the clusterbean plastid genome. Such A/T rich repeats were also reported in Camellia species, Sesame indicum, Glycine species and Sesamum indicum [105,[107][108][109].    Among the coding sequences, maximum numbers of repeats were detected in the ycf1 gene region. In recent studies, ycf1 gene is considered as the most variable locus [80,110]. The finding is consistent with others reported from Glycine species, V. radiata, Camellia species, Cynaracardunculus [76,104,105,108].

Conclusions
The clusterbean plastid genome was sequenced on an Illumina Hi-Seq 1000 platform, and the assembly was done using CLC Genomics Workbench 9.5.1. The genome is 152,530 bp in length, with a typical quadripartite structure, and consists of 114 unique genes, similar to other reported legume plastid genomes. This is the first study reporting the draft chloroplast genome sequence of clusterbean. The plastid genome of clusterbean, on comparison with other legumes, shows similar organisation, except for the IR expansion, where rps19 is included in the IRs, and hence, completely duplicated. It also consists of two pseudogenes, namely rps16 and ycf15. Gene loss is also observed, as the genes rpl22 and infA are absent from the plastid genome. On doing SSR analysis, 220 SSR loci were found, with most SSRs present in the intergenic region. This study would be helpful in evolutionary and molecular studies.

Acknowledgments:
We acknowledge the financial support from ICAR-CRP on Genomics platform (NBFGR, Lucknow, India). We also acknowledge the logistic support provided by PD, ICAR-NRCPB, New Delhi.
Author Contributions: T.K. carried out the experiments, prepared the genomic library for Illumina sequencing run and wrote the manuscript. P.K.C. and H.C.R. performed chloroplast genome assembly and bioinformatics analysis. S.S. carried out the SSR markers discovery and validation. T.K., P.K.C., S.S., and A.T. were involved in the result interpretation, analysis, and finalisation of the manuscript. S.V.A.M., A.U.S., T.R.S. and N.K.S. contributed in data analysis, genome annotation and manuscript finalisation. P.K. provided the seeds and contributed to manuscript. K.G. conceived the study, designed the experiments, and coordinated the work. All the authors have read and approved the final manuscript.

Conflicts of Interest:
The authors declare no conflict of interest.