Comparative Analysis of Chloroplast Genomes of Four Medicinal Capparaceae Species: Genome Structures, Phylogenetic Relationships and Adaptive Evolution

This study presents for the first time the complete chloroplast genomes of four medicinal species in the Capparaceae family belonging to two different genera, Cadaba and Maerua (i.e., C. farinosa, C. glandulosa, M. crassifolia and M. oblongifolia), to investigate their evolutionary process and to infer their phylogenetic positions. The four species are considered important medicinal plants, and are used in the treatment of many diseases. In the genus Cadaba, the chloroplast genome ranges from 156,481 bp to 156,560 bp, while that of Maerua ranges from 155,685 bp to 155,436 bp. The chloroplast genome of C. farinosa, M. crassifolia and M. oblongifolia contains 138 genes, while that of C. glandulosa contains 137 genes, comprising 81 protein-coding genes, 31 tRNA genes and 4 rRNA genes. Out of the total genes, 116–117 are unique, while the remaining 19 are replicated in inverted repeat regions. The psbG gene, which encodes for subunit K of NADH dehydrogenase, is absent in C. glandulosa. A total of 249 microsatellites were found in the chloroplast genome of C. farinosa, 251 in C. glandulosa, 227 in M. crassifolia and 233 in M. oblongifolia, the majority of which are mononucleotides A/T found in the intergenic spacer. Comparative analysis revealed variable hotspot regions (atpF, rpoC2, rps19 and ycf1), which can be used as molecular markers for species authentication and as regions for inferring phylogenetic relationships among them, as well as for evolutionary studies. The monophyly of Capparaceae and other families under Brassicales, as well as the phylogenetic positions of the studied species, are highly supported by all the relationships in the phylogenetic tree. The cp genomes reported in this study will provide resources for studying the genetic diversity of Capparaceae, as well as resolving phylogenetic relationships within the family.


Introduction
The family Capparaceae, whose members are distributed in both arid and semi-arid areas, has about 470 morphologically diverse species in ca. 17 genera, which include Cadaba and Maerua [1][2][3]. Members of the family possess highly essential compounds used in folk medicine that are extracted from them [4]. The four species in question are considered important medicinal plants and are used in the treatment of many diseases. Most Cadaba species contain important compounds, such as alkaloids, sesquiterpene lactones and cadabicine. Cadaba farinosa and Cadaba glandulosa are used as purgative, anthelmintic, antisyphilitic, emmenagogue, aperient, antiscorbutic, and antiphlogistic substances; for liver damage and cancer, dysentery, fever and body pain; in therapy as a hepatoprotective and hypoglycemic [5,6]. Maerua crassifolia and Maerua oblongifolia species are used in the treatment of fever, stomach troubles, skin infections, diabetes mellitus, epilepsy and abdominal colic; they demonstrate antimicrobial and antioxidant properties and are used (C. glandulosa), corresponding to 49.89% and 48.93% of the total genome length. The LSC regions ranged from 85,681 bp (C. glandulosa) to 84,153 bp (M. oblongifolia) in size, whereas the SSC ranged from 18,481 bp (M. oblongifolia) to 18,031 bp (C. glandulosa); the pair of inverted repeats are separated by the small single copy region and ranged from 26,430 bp (C. farinosa) to 26,294 bp (M. crassifolia) ( Table 1 and Figure 1). These four Capparaceae chloroplast genome sequences were deposited in GenBank (accession numbers: C. farinosa, MN603027; C. glandulosa, MN603028; M. crassifolia, MN603029 and M. oblongifolia, MN603030).  17 17 In the four species, the plastomes are well conserved in gene order and number of genes, with slight variation in the presence of the psbG gene, which is absent in C. glandulosa. The result of the gene annotation revealed a total of 137 in C. glandulosa and 138 genes in C. farinosa, M. crassifolia and M. oblongifolia, among which 116-117 are situated in the SSC and the LSC copy regions, and 19 genes are located in the IRa and IRb. The plastome contained 80 protein-coding genes in C. glandulosa and 81 in other species, four rRNA genes and 31 tRNA genes ( Figure 1 and Table 2). Eight protein-coding genes, four rRNA and seven tRNA were found in the IR regions. In the C. glandulosa species, the LSC region contained 61 protein-coding genes, whereas it included 62 in other species and 23 tRNA genes; the remaining 12 protein-coding genes and 1 tRNA are situated in the SSC region.
Some protein-coding genes and tRNA genes in the chloroplast genome of angiosperms contain an intron [49,50], as is found in the plastomes of the four species (C. farinosa, C. glandulosa, M. crassifolia and M. oblongifolia). In the total genes of the cp genomes (out of the total genes in all chloroplast genomes), 13 genes in C. glandulosa and M. crassifolia and 14 genes in C. farinosa and M. oblongifolia include an intron; some genes are protein-coding genes (nine in C. farinosa and M. oblongifolia and eight in C. glandulosa and M. crassifolia) while the remaining five are tRNA genes. Four genes (rpl2, ndhB, trnI-GAU and trnA-UGC) that have introns are situated in the inverted repeat region, ndhA is located in the SSC region and the remainder is found in the LSC region. All genes have only one intron and only two genes, namely ycf3 and clpP, have two introns. The gene trnK-UUU has the longest intron of 2542-2571 bp; this is a result of the matK gene being located within the intron of the gene.

Figure 1.
Chloroplast genome maps of the four Capparaceae species. Genes drawn inside the circle are transcribed clockwise, while those outside the circle are transcribed counter-clockwise. The inner dark gray circle corresponds to GC content and the inner light gray circle corresponds to the AT content. Different colors are used as a representation of distinctive genes within separate functional groups.

Codon Usage
One of the factors shaping the evolution of the chloroplast genome is codon usage, which occurs as a result of bias in mutation [51]. The codon usage bias in the plastome was computed using the protein-coding genes and tRNA gene nucleotide sequences-104,575 bp, 106,488 bp, 105,750 bp and 99,100 bp for C. farinosa, C. glandulosa, M. crassifolia and M. oblongifolia, respectively. Supplementary Tables S1-S4 present the relative synonymous codon usage (RSCU) of each codon in the genome; these results suggested that all the genes are encoded by 33,686 codons in C. farinosa, 34,303 codons in C. glandulosa, 34,064 codons in M. crassifolia and 31,920 codons in M. oblongifolia. The most frequently occurring codons are for the amino acids leucine 3290 (9.76%), 3599 (10.49%), 3452 (10.13%) and 2951 (9.24%), respectively (Figure 2), which have been found in other flowering plant genomes [52], whereas methionine has the fewest codons in the genome at 620 (1.84%) in C. farinosa and 606 (1.89%) in M. oblongifolia, and for tryptophan it is 674 (1.96%) in C. glandulosa and 695 (2.04%) in M. crassifolia. A-and T-endings were discovered to be less frequent than their counterparts G-and C-. The codon usage bias is low in the cp genomes of Capparaceae species (Tables S1-S4). Additionally, the results show that the RSCU values of 27 codons were >1, all with A/T-endings, whereas the other 28 codons were <1, and all ended with G/C. The RSCU values of tryptophan and methionine amino acids are 1, so they are the only amino acids with no codon bias.

RNA Editing Sites
RNA editing features a set of processes that comprise of insertion, deletion or modifications of nucleotides that alter the DNA-encoded sequence [53], which represents a way to create transcript and protein diversity [54]. Some chloroplast RNA editing sites are preserved in plants [55]. The RNA editing sites in the C. farinosa, C. glandulosa, M. crassifolia and M. oblongifolia chloroplast genomes were predicted using the PREP suite program; the first codon position of the first nucleotide was used in all of the analyses. The results show that conversion of the amino acid serine into leucine was the majority of the conversions in the codon positions (Tables S5-S8). This conversion is found to occur more frequently [56]. In total, 48 editing sites in the genus Maerua and 50 in the genus Cadaba were revealed by the program. Twenty protein-coding genes in C. farinosa and 19 protein-coding genes in C. glandulosa, M. crassifolia and M. oblongifolia were distributed across the editing sites. As stated in previous studies [57][58][59], the ndhB genes have the largest number of editing sites (nine sites), followed by ndhD (nine sites in C. farinosa and M. oblongifolia and eight sites in C. glandulosa and M. crassifolia), while accD, atpF, ccsA, clpP, PsaI, psbG, psbF, rpoA, rpl20, rps2 and rps16 have at least one site each. Certain RNA sites, amidst all the conversions in the RNA editing (modification) sites, changed the amino acid from proline to serine. RNA-predicting sites in the first codon of the first nucleotides are not present in the following genes: atpA, atpB, atpI, ccsA (only in C. glandulosa), petB, petD, petG, petL, psaB, psbB, psbL, rpl2, rpl20 (except in M. oblongifolia), rpl23, rps8 and ycf3, among others. This result indicated that the preservation of RNA editing is fundamental [60,61].

RNA Editing Sites
RNA editing features a set of processes that comprise of insertion, deletion or modifications of nucleotides that alter the DNA-encoded sequence [53], which represents a way to create transcript and protein diversity [54]. Some chloroplast RNA editing sites are preserved in plants [55]. The RNA editing sites in the C. farinosa, C. glandulosa, M. crassifolia and M. oblongifolia chloroplast genomes were predicted using the PREP suite program; the first codon position of the first nucleotide was used in all of the analyses. The results show that conversion of the amino acid serine into leucine was the majority of the conversions in the codon positions (Tables S5-S8). This conversion is found to occur more frequently [56]. In total, 48 editing sites in the genus Maerua and 50 in the genus Cadaba were revealed by the program. Twenty protein-coding genes in C. farinosa and 19 protein-coding genes in C. glandulosa, M. crassifolia and M. oblongifolia were distributed across the editing sites. As stated in previous studies [57][58][59], the ndhB genes have the largest number of editing sites (nine sites), followed by ndhD (nine sites in C. farinosa and M. oblongifolia and eight sites in C. glandulosa and M. crassifolia), while accD, atpF, ccsA, clpP, PsaI, psbG, psbF, rpoA, rpl20, rps2 and rps16 have at least one site each. Certain RNA sites, amidst all the conversions in the RNA editing (modification) sites, changed the amino acid from proline to serine. RNA-predicting sites in the first codon of the first nucleotides are not present in the following genes: atpA, atpB, atpI, ccsA (only in C. glandulosa), petB, petD, petG, petL, psaB, psbB, psbL, rpl2, rpl20 (except in M. oblongifolia), rpl23, rps8 and ycf3, among others. This result indicated that the preservation of RNA editing is fundamental [60,61].

Simple Sequence Repeats (SSRs)
The SSRs or microsatellites are a group of short repeat sequences of nucleotide series (1-6 bp), which are used as a tool to facilitate the assessment of molecular diversity [65]. The genetic variation within and among species with the valuable molecular marker of the SSRs is extremely important for studying genetic heterogeneity and contributes to species recognition [66][67][68]. In this study, there are 249 microsatellites found in the plastid

Simple Sequence Repeats (SSRs)
The SSRs or microsatellites are a group of short repeat sequences of nucleotide series (1-6 bp), which are used as a tool to facilitate the assessment of molecular diversity [65]. The genetic variation within and among species with the valuable molecular marker of the SSRs is extremely important for studying genetic heterogeneity and contributes to species recognition [66][67][68]. In this study, there are 249 microsatellites found in the plastid genome of C.  (Figure 4). A high richness in mononucleotides poly A and T has been observed in most flowering plants' cp genomes [62]. Table 3. Simple sequence repeats in the C. farinosa, C. glandulosa, M. crassifolia and M. oblongifolia chloroplast genomes.

SSR Type
Repeat Unit Species

Comparative Analysis of the Capparaceae Species Cp Genome
To analyze the DNA sequence divergence in the chloroplast genomes of the five species of Capparaceae, a comparative analysis was done using the mVISTA program to align the sequences. Sequence alignment was conducted among four chloroplast genomes of Capparaceae and compared with the chloroplast genome of Capparis versicolor (MH142726), available in GenBank. To understand the structural characteristics in the cp genomes, the annotation of C. farinosa was used as a reference. The alignment outcome reveals highly conserved genomes with few variations. As in most chloroplast genomes of angiosperm plants, non-coding counterparts were conserved less than the gene-coding regions ( Figure 6). Among the five cp genomes, the results showed that trnH(GUG)-psbA, rps16-trnQ, psbI-trnS, trnS-trnR, petN-psbM, psbM-trnD, trnE-trnT, trnS-trnG, trnT-trnL, trnF-ndhJ, rbcL-accD, psbE-petL, rbs16-rbs3 and ndhF-rpl32 were the most divergent noncoding regions. However, it was detected that some variations occurred in the following genes: atpF, rpoC2, rps19 and ycf1.
Although angiosperms retain the structure and size of the chloroplast genome [68], some evolutionary events occur in the genome, such as expansion and contraction, that alter the size of the genome and the boundaries of the LSC, SSC, IRa and IRb regions [69,70]. We compared between IR-LCS and IR-SSC the boundaries of the five cp genomes of Capparaceae (Cadaba farinosa, Cadaba glandulosa, Maerua crassifolia, Maerua oblongifolia and Capparis versicolor) and the result presented a similarity among the compared plastomes of Cadaba and Maerua species, with a slight variation among C. versicolor (Figure 7). The chloroplast genome of C. versicolor (155,051 bp) was the smallest, whereas the genome of C. glandulosa (156,560 bp) was the largest. The smallest IR region is in C. versicolor (26,

Comparative Analysis of the Capparaceae Species Cp Genome
To analyze the DNA sequence divergence in the chloroplast genomes of the five species of Capparaceae, a comparative analysis was done using the mVISTA program to align the sequences. Sequence alignment was conducted among four chloroplast genomes of Capparaceae and compared with the chloroplast genome of Capparis versicolor (MH142726), available in GenBank. To understand the structural characteristics in the cp genomes, the annotation of C. farinosa was used as a reference. The alignment outcome reveals highly conserved genomes with few variations. As in most chloroplast genomes of angiosperm plants, non-coding counterparts were conserved less than the gene-coding regions ( Figure 6). Among the five cp genomes, the results showed that trnH(GUG)-psbA, rps16-trnQ, psbI-trnS, trnS-trnR, petN-psbM, psbM-trnD, trnE-trnT, trnS-trnG, trnT-trnL, trnF-ndhJ, rbcL-accD, psbE-petL, rbs16-rbs3 and ndhF-rpl32 were the most divergent non-coding regions. However, it was detected that some variations occurred in the following genes: atpF, rpoC2, rps19 and ycf1.
Although angiosperms retain the structure and size of the chloroplast genome [68], some evolutionary events occur in the genome, such as expansion and contraction, that alter the size of the genome and the boundaries of the LSC, SSC, IRa and IRb regions [69,70]. We compared between IR-LCS and IR-SSC the boundaries of the five cp genomes of Capparaceae (Cadaba farinosa, Cadaba glandulosa, Maerua crassifolia, Maerua oblongifolia and Capparis versicolor) and the result presented a similarity among the compared plastomes of Cadaba and Maerua species, with a slight variation among C. versicolor (Figure 7). The chloroplast genome of C. versicolor (155,051 bp) was the smallest, whereas the genome of C. glandulosa (156,560 bp) was the largest. The smallest IR region is in C. versicolor (26,141 bp).

Divergence of Protein-Coding Gene Sequence
The cp genomes of four Capparaceae species include 80 protein-coding genes in C. glandulosa and 81 genes in other species. To detect the genes under selective pressure, the rates of synonymous (dS) and non-synonymous (dN) substitution and dN/dS ratio were calculated. The results showed that in all of the paired genes of C. farinosa vs. C. glandulosa, the dN/dS ratio is less than 1, and most of the paired genes are less than 1 except atpF in C. farinosa vs. M. crassifolia and cemA, psbK and rps18 in C. farinosa vs. M. oblongifolia, having values of 1.16, 1.52 and 1.2, respectively ( Figure 8). The result of the dN/dS ratio obtained in this study is consistent with other related studies [52,53]. In all the genes, the synonymous (dS) values range from 0 to 0.32 (Figure 8).

Divergence of Protein-Coding Gene Sequence
The cp genomes of four Capparaceae species include 80 protein-coding genes in C. glandulosa and 81 genes in other species. To detect the genes under selective pressure, the rates of synonymous (dS) and non-synonymous (dN) substitution and dN/dS ratio were calculated. The results showed that in all of the paired genes of C. farinosa vs. C. glandulosa, the dN/dS ratio is less than 1, and most of the paired genes are less than 1 except atpF in C. farinosa vs. M. crassifolia and cemA, psbK and rps18 in C. farinosa vs. M. oblongifolia, having values of 1.16, 1.52 and 1.2, respectively (Figure 8). The result of the dN/dS ratio obtained in this study is consistent with other related studies [52,53]. In all the genes, the synonymous (dS) values range from 0 to 0.32 (Figure 8).

Phylogenetic Analysis
Phylogenetic relationships based on Bayesian analysis and maximum parsimony were congruent and placed all samples into three main clades, with strong support in all the nodes with PP 1.00 (Figure 9). The first clade contains species of the Capparaceae family and is divided into two subclades; the first subclade includes species of genera Cadaba and Maerua, while the second clade includes species of genus Capparis. The second clade comprises Cleomaceae species, while the third clade includes species from the Brassicaceae family. The phylogenetic tree showed that the Capparaceae family is the earliest diverging lineage among the three families and is sister to Cleomaceae and Brassicaceae. It is clear in this phylogenetic result that Cleomaceae was separated from Capparaceae and became a sister to the Brassicaceae family, as reported by [19,20]; this is consistent with some previous classifications of the order Brassicales.

Phylogenetic Analysis
Phylogenetic relationships based on Bayesian analysis and maximum parsimony were congruent and placed all samples into three main clades, with strong support in all the nodes with PP 1.00 ( Figure 9). The first clade contains species of the Capparaceae family and is divided into two subclades; the first subclade includes species of genera Cadaba and Maerua, while the second clade includes species of genus Capparis. The second clade comprises Cleomaceae species, while the third clade includes species from the Brassicaceae family. The phylogenetic tree showed that the Capparaceae family is the earliest diverging lineage among the three families and is sister to Cleomaceae and Brassicaceae. It is clear in this phylogenetic result that Cleomaceae was separated from Capparaceae and became a sister to the Brassicaceae family, as reported by [19,20]; this is consistent with some previous classifications of the order Brassicales.

Library Construction, Sequencing and Assembly
Input material for the DNA sample preparations was derived (or taken) from a total amount of 1.0 µg DNA. The NEBNext DNA Library Prep Kit was used to generate sequence libraries according to the manufacturer's recommendation; indices were also added to each sample. Genomic DNA was randomly fragmented by shearing to a size of 350 bases in length. The ends of randomly fragmented DNA were repaired and A-tailed, adapters were ligated with NEBNext for Illumina sequencing, then the PCR improved by P5 and indexed P7 oligo sequences. The AMPure XP system was used to purify the PCR products; subsequent findings were analyzed by the Agilent 2100 Bioanalyzer for size distribution and later quantified using real-time PCR. After pooling, the qualified libraries were fed into an Illumina HiSeq 2500 system (350 bp paired ends reads); this was based on its effective concentration and expected data volume. The raw reads (19,844,190 bp, 19,053,503 bp, 19,440,639 bp and 19,929,468 bp for C. farinosa, C. glandulosa, M. crassifolia and M. oblongifolia, respectively) were cleaned reads (5 Gb) to remove low-quality sequences and adaptors; they were then filtered for PCR duplicates using PRINSEQlite v0.20.4 [71]. The clean raw reads were subjected to de novo assembly from the whole genome sequences using NOVOPlasty 2.7.2 [72] with kmer (K-mer = 31-33). The trnH-psbA of Cadaba farinosa (KR735837.1) was used as a seed and the complete plastome of Arabidopsis thaliana (KX551970.1) was used as a reference for the assembly of the Cadaba farinosa cp genome. The assembled cp genome of Cadaba farinosa was used as seed and reference for the assembly of the Cadaba glandulosa plastome. For Maerua crassifolia, the rpoC1 gene of M. crassifolia (JQ845894.1) was used as seed and the complete cp genome of C. farinosa was used as reference. The assembled cp genome of M. crassifolia was used as seed and reference for the assembly of M. oblongifolia. Finally, each species generated one contig that contained the complete chloroplast genome sequence.

Sequence Analysis
MEGA 6.0 was used to compute the codon usage, base composition, and the relative synonymous codon usage values (RSCUs). The RNA editing sites in cp protein-coding genes of the Capparaceae species were predicted using PREP suite [76] with a 0.8 cutoff value. Simple sequence repeats (SSRs) were identified in the chloroplast genomes of the four species (C. farinosa, C. glandulosa, M. crassifolia and M. oblongifolia) using the online software MIcroSAtellite (MISA) [77] with the following parameters set: eight, five, four and three repeat units for mononucleotides, dinucleotides, trinucleotides and tetra-, penta-, hexanucleotide SSR motifs, respectively. To identify the size and location of long repeats (palindromic, forward, reverse and complement) in the four species of Capparaceae being studied, the online program REPuter (https://bibiserv.cebitec.uni-bielefeld.de/reputer (accessed on 22 June 2019) [76], with standard settings, was used.

Genome Comparison
The chloroplast genomes of C. farinosa, C. glandulosa, M. crassifolia and M. oblongifolia were compared using the program mVISTA [78], and the annotation of C. farinosa was used as a reference in the Shuffle-LAGAN mode [79]. The four species of Capparaceae were compared against the border region between inverted repeat (IR), large single copy (LSC) and small single copy (SSC).

Characterization of Substitution Rate
To detect the genes that are under selection pressure, the substitution rate of the synonymous (dS) and non-synonymous (dN) substitution and the dN/dS ratio were analyzed using DNAsp v5.10.01 [80], the cp genome of C. farinosa was compared with the cp genome of C. glandulosa, M. crassifolia and M. oblongifolia. Separate protein-coding genes were aligned individually using Geneious version 8.1.3 software, while the protein sequence was translated from aligned sequences.

Phylogenetic Analysis
The analysis was conducted based on the complete chloroplast genome sequences of nine Capparaceae taxa, six species and three varieties, C. farinosa, MN603027, C. glandulosa, MN603028, M. crassifolia, MN603029, M. oblongifolia, MN603030, Capparis spinosa, MT041701, Capparis spinosa var. spinosa, MK639365, Capparis spinosa var. herbacea, MK639366, Capparis spinosa var. ovata, MK637690 and Capparis decidua MT948186, two Cloemaceae species, eight species of Brassicaceae and two species of Malvaceae, as an outgroup. All sequences were aligned using MAFFT software [81] with default settings. The phylogenetic trees were reconstructed based on maximum parsimony analysis using PAUP software (version 4.0b10) [82], utilizing tree bisection and reconnection branch swapping, with MulTrees on, saving a maximum of 1000 trees per replicate. Missing characters were treated as gaps. The bootstrap analysis confidence was based on 1000 replicates. MrBayes version 3.2.6 [83] was used to conduct Bayesian analysis and jModelTest version 3.7 [84] was used to select the appropriate model.

Conclusions
This current study used the Illumina HiSeq 2500 platform to obtain the first complete chloroplast sequences of four medicinal Capparaceae species: C. farinosa, C. glandulosa, M. crassifolia and M. oblongifolia. The four species are divided into two groups: C. farinosa and C. glandulosa belong to the tribe Cadabeae; M. crassifolia and M. oblongifolia belong to the tribe Maerueae. This study can be used to accurately identify species during different medicinal uses based on their plastid genome.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/plants10061229/s1, Table S1: Codon-anticodon recognition patterns and codon usage of the Cadaba farinosa chloroplast genome; Table S2: Codon-anticodon recognition patterns and codon usage of the Cadaba glandulosa chloroplast genome; Table S3: Codon-anticodon recognition patterns and codon usage of the Maerua crassifolia chloroplast genome; Table S4: Codon-anticodon recognition patterns and codon usage of the Maerua oblongifolia chloroplast genome; Table S5: Predicted RNA editing site in the Cadaba farinosa chloroplast genome; Table S6: Predicted RNA editing site in the Cadaba glandulosa chloroplast genome; Table S7: Predicted RNA editing site in the Maerua crassifolia chloroplast genome; Table S8: Predicted RNA editing site in the Maerua oblongifolia chloroplast genome; Table S9: Repeat sequences present in the Cadaba farinosa chloroplast genome; Table S10: Repeat sequences present in the Cadaba glandulosa chloroplast genome; Table S11: Repeat sequences present in the Maerua crassifolia chloroplast genome; Table S12: Repeat sequences present in the Maerua oblongifolia chloroplast genome.

Conflicts of Interest:
The authors declare no conflict of interest.