Comparative Chloroplast Genomics in Phyllanthaceae Species

: Family Phyllanthaceae belongs to the eudicot order Malpighiales, and its species are herbs, shrubs, and trees that are mostly distributed in tropical regions. Here, we elucidate the molecular evolution of the chloroplast genome in Phyllanthaceae and identify the polymorphic loci for phylogenetic inference. We de novo assembled the chloroplast genomes of three Phyllanthaceae species, i.e., Phyllanthus emblica , Flueggea virosa , and Leptopus cordifolius , and compared them with six other previously reported genomes. All species comprised two inverted repeat regions (size range 23,921–27,128 bp) that separated large single-copy (83,627–89,932 bp) and small single-copy (17,424–19,441 bp) regions. Chloroplast genomes contained 111–112 unique genes, including 77–78 protein-coding, 30 tRNAs, and 4 rRNAs. The deletion/pseudogenization of rps 16 genes was found in only two species. High variability was seen in the number of oligonucleotide repeats, while guanine-cytosine contents, codon usage, amino acid frequency, simple sequence repeats, synonymous and non-synonymous substitutions, and transition and transversion substitutions were similar. The transition substitutions were higher in coding sequences than in non-coding sequences. Phylogenetic analysis revealed the polyphyletic nature of the genus Phyllanthus . The polymorphic protein-coding genes, including rpl 22, ycf 1, mat K, ndh F, and rps 15, were also determined, which may be helpful for reconstructing the high-resolution phylogenetic tree of the family Phyllanthaceae. Overall, the study provides insight into the chloroplast genome evolution in Phyllanthaceae.


Introduction
The family Phyllanthaceae and its closely related sister family Picorodendraceae comprise a separate clade of phyllantoid taxa within the order Malpighiales [1,2]. The species of the family Phyllanthaceae are predominantly tropical in distribution and include herbs, shrubs, and trees [3,4]. This family consists of two subfamilies (Phyllanthoideae and Antidesmatoideae), 10 tribes, 57 genera, and 2050 species [5]. Certain taxonomic discrepancies exist within Phyllanthaceae, i.e., Phyllanthus is polyphyletic since species of genera Breynia, Glochidion, Sauropus, and Synostemon were found embedded in its phylogeny [3,6,7]. Moreover, taxonomic discrepancies exist at intra-genus level since several subgenera and sections of Phyllanthus were paraphyletic [6,8]. Hence, researchers have suggested either merging all embedded genera into Phyllanthus L. to create a giant genus or dividing Phyllanthaceae into several monophyletic genera for grouping morphological similar species [3,6,8]. Phyllanthus L. is considered to be the largest genera of the family

Analysis of Codon Usage, Amino Acid Frequency, and Repeats
The relative synonymous codon usage (RSCU) and amino acid frequency of each species were determined using Geneious R8.1 [44]. The RSCU value of each species was shown with the help of a heatmap using TBtool [47]. MIcroSAtellite (MISA) [48] was used for the determination of simple sequence repeats (SSRs), with a maximum threshold of 10 for mononucleotide SSRs, 5 for dinucleotide SSRs, 4 for trinucleotide SSRS, and 3 for tetra-, penta-, and hexanucleotide SSRs. REPuter [49] was used for the determination of four types of oligonucleotide repeats, including forward (F), complementary (C), reverse (R), and palindromic (P). Parameters such as repeats size ≥ 30 with at least 90% similarities were adjusted.

Synonymous, Non-Synonymous, Transition, and Transversion Substitution Analyses
The synonymous (K s ) and non-synonymous (K a ) substitutions were analyzed in TBtool [47] for 77 protein-coding genes using Baccaurea ramiflora Lour. as a reference for all other species, as this species lies basal to the current species following Abdullah et al. [19]. Each gene selection was predicted by considering the ratios of Ka/Ks < 1 purifying selection, Ka/Ks = 1 neutral selection, and Ka/Ks > 1 positive selection [50]. The positive selection on protein coding genes or their codons was further analyzed using the Mixed Effects Model of Evolution (MEME) [51] and Fast Unconstrained Bayesian Approximation (FUBAR) [52] as implemented in the DATAMONKEY web server [53], according to previous parameters [54]. The transition and transversion substitutions of complete genome and protein-coding genes were determined by comparison with closely as well as with distantly related species. The whole genome was aligned using MAFFT, whereas the protein-coding sequences of each species were concatenated, except for ycf 1, and also aligned using MAFFT alignment [55]. For closely related species, the genome of P. emblica (Pakistan) was used as a reference for Breynia fruticosa, Phyllanthus amarus, Phyllanthus emblica (China), Glochidion chodoense, and Sauropus spatulifolius. Similarly, for distantly related species, Baccaurea ramiflora was used as a reference for the selected species of Leptopus cordifolius, Flueggea virosa, and Phyllanthus emblica (Pakistan).

Polymorphism of Protein-Coding Genes
To determine the extent of polymorphism of protein-coding genes for further phylogenetic studies, all protein-coding genes of each species were extracted. The sequence of each gene was multiple aligned using Geneious R8.1 and analyzed in DnaSP v.6 [56].

Reconstruction of Phylogenetic Tree
For inferring phylogeny, the sequences of 76 protein-coding genes, except ycf 1, of each species were extracted and concatenated in Geneious R8.1 [44]. The ycf 1 was excluded from the analysis as this gene exists at the junction of IR/SSC, with one part of ycf 1 located in the IR region and the other part in the SSC region. The part situated in the IR region showed a different rate of evolution than the part in the SSC region due to rate heterotachy [57][58][59]. These concatenated sequences of all 10 species, including Linum usitatissimum L. as outgroup from the family Linaceae, were multiple aligned using MAFFT [60] extension in Geneious R8.1. The maximum likelihood tree was reconstructed with best fit model GTR + F + G4 and determined by jModelTest 2 [61] using IQ-tree [62] with the same parameters as described previously [54]. The phylogenetic tree was also reconstructed with the identified polymorphic loci following the same approach. However, the best fit model TVM + F + G4 was predicted by jModelTest 2 and applied. The Interactive Tree of Life [63] was used to improve the presentation of the tree.

Chloroplast Genome Features and Organization
HiSeq 2500 produced about 10.52 GB data with 31.4 million reads for Phyllanthus emblica and 7.58 GB data with 22.6 million reads for Leptopus cordifolius. Chloroplast genomes assembled from the data exhibited a high average coverage depth of 855X for Phyllanthus emblica and 375X for Leptopus cordifolius. The Flueggea virosa which was assembled from the SRA data of NCBI showed an average coverage depth of 138X. These three de novo assembled genomes along with the previously reported genome provided an opportunity to perform in-depth comparative chloroplast genomics of members of the family Phyllanthaceae.
The gene contents and organization of the chloroplast genomes of the family Phyllanthaceae are provided in Tables 1 and 2 and Figure 1. The genomes showed high similarity in gene contents, except rps16, which was missing in Baccaurea ramiflora and Leptopus cordifolius. The chloroplast genomes of all selected species displayed quadripartite structure, comprising IRs (IRa and IRb) that separate the LSC and SSC regions. Chloroplast genomes possessed 111-112 unique genes that showed high similarities in the arrangement within the genomes and no inversion events related to gene rearrangement were found, as predicted by Mauve alignment (Figure 2). Among these 111-112 genes, 77 -78 were protein-coding genes, 30 transfer RNA (tRNA) genes, and 4 ribosomal RNA (rRNA) genes (Table 1). Moreover, 16-17 genes were also duplicated in IR regions, including 5-6 protein-coding genes, 7 tRNA genes, and 4 rRNA genes. Here, two pseudogenes of ycf 1 Ψ and rps19 Ψ were excluded, which exist at the junctions of LSC/IRa and SSC/IRa ( Table 1). The rps12 gene was a trans-splicing gene; the 5 end was present in LSC in the form of a single copy, whereas the 3 end was duplicated in IRa and IRb. The average GC content of the complete chloroplast genome was 36 The highest GC content of IRs belongs to rRNAs (55.5%) and tRNAs (53%); the genomic features are given in a detailed summary in Table 2. Total 131 * Gene with one intron, ** gene with two introns, a gene with two copies, £Ψ gene pseudo in Baccaurea ramiflora and Leptopus cordifolius, $ single copy existed of rps19 in Glochidion chodoense and of rpl2 in Sauropus spatulifolius, Ψ pseudo-copy also existed along with a functional copy.  a Duplicating genes in IRs including pseudogenes, information in parentheses indicates the total number of functional genes, * represents sequenced and assembled in the current study, ** represents assembled from raw reads of NCBI.
There were 17-18 intron-containing genes of which 11-12 were protein-coding genes and 6 were tRNA genes. Moreover, an intron was found in atpF gene in only Baccaurea ramiflora, which was missing in all other species examined. All protein-coding genes contained one intron, except clpP and ycf 3, which had two introns as shown in Table 1. Of these intron-containing genes, three were protein-coding genes and two tRNA genes were also duplicated in IR regions.

Phylogenetic Analysis
Phylogenetic analysis of the selected species of the family Phyllanthaceae was performed based on 76 protein-coding genes and suitable polymorphic loci. Both methods provided similar results and resolved the phylogenetic relationship within the family Phyllanthaceae with high bootstrapping support of 100. The genus Phyllanthus was found to be polyphyletic, with P. amarus closely related to G. chodoense instead of P. emblica ( Figure 3). The nodes "P. amarus and G. chodoense" and "B. fruticosa and S. spatulifolius" were sisters to one another and were rooted by P. emblica. This whole clade was rooted by F. virosa, followed by L. cordifolius and then by B. ramiflora, which lies basal to the selected species in the tree.  a Duplicating genes in IRs including pseudogenes, information in parentheses indicates the total number of functional genes, * represents sequenced and assembled in the current study, ** represents assembled from raw reads of NCBI.

Figure 2.
Mauve alignment of nine species of the family Phyllanthaceae shows high similarities in gene arrangement and gene contents. The small blocks represent genes: red = rRNA; green = tRNA with intron; black = tRNA without intron; white = protein-coding genes. The large reddish color represents LSC and IR regions, while the greenish part represents SSC regions. The small purple and green blocks represent part of the ycf1 Ψ gene, e.g., in Leptopus cordifolius the size of ycf1 Ψ was 1032 bp and both blocks are present at the start of the large greenish block, while in Flueggea virosa the size of ycf1 Ψ was 220 bp and both blocks are present at the end of the large greenish block.

Phylogenetic Analysis
Phylogenetic analysis of the selected species of the family Phyllanthaceae was performed based on 76 protein-coding genes and suitable polymorphic loci. Both methods Figure 2. Mauve alignment of nine species of the family Phyllanthaceae shows high similarities in gene arrangement and gene contents. The small blocks represent genes: red = rRNA; green = tRNA with intron; black = tRNA without intron; white = protein-coding genes. The large reddish color represents LSC and IR regions, while the greenish part represents SSC regions. The small purple and green blocks represent part of the ycf 1 Ψ gene, e.g., in Leptopus cordifolius the size of ycf 1 Ψ was 1032 bp and both blocks are present at the start of the large greenish block, while in Flueggea virosa the size of ycf 1 Ψ was 220 bp and both blocks are present at the end of the large greenish block. Phyllanthaceae with high bootstrapping support of 100. The genus Phyllanthus was found to be polyphyletic, with P. amarus closely related to G. chodoense instead of P. emblica (Figure 3). The nodes "P. amarus and G. chodoense" and "B. fruticosa and S. spatulifolius" were sisters to one another and were rooted by P. emblica. This whole clade was rooted by F. virosa, followed by L. cordifolius and then by B. ramiflora, which lies basal to the selected species in the tree.

Inverted Repeat Contraction and Expansion
Comparative analysis of the junctions of LSC/IR and SSC/IR showed similarities among species of the family Phyllanthaceae, except for a few differences that mostly occurred at the junction of LSC/IRb (JLB) (Figure 4). At JLB, the rps19 started from the IRb and entered the LSC regions in six species. Hence, truncated rps19 pseudogene remained at JLA (IRa/LSC) in these species (Figure 4). In the remaining three species, the rps19 completely existed in the LSC regions at the JLB junction and did not lead to the generation of a pseudogene at the JLA junction. The rpl2 gene was completely found in the IR regions away from the JLB and JLA junctions, except in Sauropus spatulifolius, where the rpl2 started from IRb and entered the LSC regions, retaining a pseudogene at the junction of JLA instead of a functional copy due to contraction of IRs. Moreover, at JLA the rpl23 was present in IRb and IRa instead of rpl2. At the SSC/IRb junction (JSB), ndhF was situated entirely in the SSC region in the six species of Baccaurea ramiflora, Flueggea virosa, Phyllanthus emblica (China), Phyllanthus emblica (Pak), Breynia fruticosa, and Sauropus spatulifolius, while it was integrated into the IRb regions within the remaining three species of Phyllanthus amarus, Glochidion chodoense, and Leptopus cordifolius, where it overlapped with the pseudogene of ycf1. The ycf1 started in IRa and ended in SSC, as seen at the JSA (SSC/IRa) junction. Consequently, pseudogene of ycf1 remained at JSB (IRb/SSC), which

Inverted Repeat Contraction and Expansion
Comparative analysis of the junctions of LSC/IR and SSC/IR showed similarities among species of the family Phyllanthaceae, except for a few differences that mostly occurred at the junction of LSC/IRb (JLB) (Figure 4). At JLB, the rps19 started from the IRb and entered the LSC regions in six species. Hence, truncated rps19 pseudogene remained at JLA (IRa/LSC) in these species (Figure 4). In the remaining three species, the rps19 completely existed in the LSC regions at the JLB junction and did not lead to the generation of a pseudogene at the JLA junction. The rpl2 gene was completely found in the IR regions away from the JLB and JLA junctions, except in Sauropus spatulifolius, where the rpl2 started from IRb and entered the LSC regions, retaining a pseudogene at the junction of JLA instead of a functional copy due to contraction of IRs. Moreover, at JLA the rpl23 was present in IRb and IRa instead of rpl2. At the SSC/IRb junction (JSB), ndhF was situated entirely in the SSC region in the six species of Baccaurea ramiflora, Flueggea virosa, Phyllanthus emblica (China), Phyllanthus emblica (Pak), Breynia fruticosa, and Sauropus spatulifolius, while it was integrated into the IRb regions within the remaining three species of Phyllanthus amarus, Glochidion chodoense, and Leptopus cordifolius, where it overlapped with the pseudogene of ycf 1. The ycf 1 started in IRa and ended in SSC, as seen at the JSA (SSC/IRa) junction. Consequently, pseudogene of ycf 1 remained at JSB (IRb/SSC), which ranges in size from 195 to 1921 bp. At the junction of JLA (IRa/LSC), the trnH gene was found in all species (Figure 4). Thus, IR contraction and expansion were not responsible for complete duplication of a functional copy of a gene, but they led to the generation of truncated rps19, rpl2, and ycf 1 pseudogenes.
Diversity 2021, 13, x FOR PEER REVIEW 9 of 18 ranges in size from 195 to 1921 bp. At the junction of JLA (IRa/LSC), the trnH gene was found in all species (Figure 4). Thus, IR contraction and expansion were not responsible for complete duplication of a functional copy of a gene, but they led to the generation of truncated rps19, rpl2, and ycf1 pseudogenes.

Codon Usage and Amino Acid Frequency, and Repeats
Codon usage analysis was interpreted in terms of RSCU values. The analysis showed that most of the amino acids were encoded from codons ended with A/T at 3′ (having RSCU > 1) instead of C/G (having RSCU < 1) ( Figure S1). The amino acid frequency analysis revealed that leucine was the most encoded amino acid while cysteine was the rarest ( Figure S2). Codon usage analysis and amino acid frequency showed high similarities in all species of the family Phyllanthaceae. The simple sequence repeats (SSRs) and oligonucleotide repeats were analyzed in the Cp genome of selected species. The analysis of MISA revealed 630 SSRs mostly consist of A/T motifs in all species (Table S1). All six types of SSRs were based on motif types, including mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide. However, SSRs were predominantly mononucleotide, followed by dinucleotide ( Figure S3a). The highest number of SSRs was found in the LSC region, followed by the SSC region and the IR ( Figure S3b). The highest number of SSRs (84) was observed in Phyllanthus emblica (China), and the lowest number of SSRs (51) was seen in Glochidion chodoense. The REPuter program detected 311 oligonucleotide repeats with sizes of 30-96 bp. The number of repeats varied from 24 (Glochidion chodoense) to 49 (Baccaurea ramiflora). Most of the oligonucleotide repeats were forward and palindromic, but reverse and complementary repeats were also found ( Figure S3c). Most of the repeats ranged in size from 30 to 34 bp ( Figure S3d). The number of repeats was higher in LSC than in SSC and IR, whereas some repeats were also shared between LSC/IR, LSC/SSC, and SSC/IR regions of the chloroplast genome ( Figure  S3e).

Codon Usage and Amino Acid Frequency, and Repeats
Codon usage analysis was interpreted in terms of RSCU values. The analysis showed that most of the amino acids were encoded from codons ended with A/T at 3 (having RSCU > 1) instead of C/G (having RSCU < 1) ( Figure S1). The amino acid frequency analysis revealed that leucine was the most encoded amino acid while cysteine was the rarest ( Figure S2). Codon usage analysis and amino acid frequency showed high similarities in all species of the family Phyllanthaceae. The simple sequence repeats (SSRs) and oligonucleotide repeats were analyzed in the Cp genome of selected species. The analysis of MISA revealed 630 SSRs mostly consist of A/T motifs in all species (Table S1). All six types of SSRs were based on motif types, including mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide. However, SSRs were predominantly mononucleotide, followed by dinucleotide ( Figure S3a). The highest number of SSRs was found in the LSC region, followed by the SSC region and the IR ( Figure S3b). The highest number of SSRs (84) was observed in Phyllanthus emblica (China), and the lowest number of SSRs (51) was seen in Glochidion chodoense. The REPuter program detected 311 oligonucleotide repeats with sizes of 30-96 bp. The number of repeats varied from 24 (Glochidion chodoense) to 49 (Baccaurea ramiflora). Most of the oligonucleotide repeats were forward and palindromic, but reverse and complementary repeats were also found ( Figure S3c). Most of the repeats ranged in size from 30 to 34 bp ( Figure S3d). The number of repeats was higher in LSC than in SSC and IR, whereas some repeats were also shared between LSC/IR, LSC/SSC, and SSC/IR regions of the chloroplast genome ( Figure S3e).

Synonymous and Non-Synonymous Substitutions
The synonymous (Ks) and non-synonymous (Ka) substitutions and their ratio (Ka/Ks) revealed high similarities among the species of Phyllanthaceae (Table S2). The lowest average values of Ka = 0.0298, Ks = 0.1866, and Ka/Ks = 0.1597 were recorded, which showed that high purifying selection pressure acted on the protein-coding genes of species of the family Phyllanthaceae. Two genes-psbL and petL-showed a signature of positive selection in Sauropus spatulifolius and Flueggea virosa, respectively, based on the values of Ka/Ks. However, the results of FUBAR and MEME did not support signature of positive selection.

Transition and Transversion Substitutions
The transition and transversion substitutions within the complete chloroplast genome and in the protein-coding regions were analyzed. The comparative analysis of the complete chloroplast genomes among the distantly related species of Phyllanthaceae revealed the existence of 10,950-12,603 substitutions (Figure 5a), whereas comparison of closely related species revealed the presence of 257-2077 substitutions (Figure 6a). Furthermore, 3237-3747 substitutions were observed in protein-coding sequences of distantly related species (Figure 5b) and 65-685 substitutions in closely related species (Figure 6b). The ratio of transition and transversion substitutions (Ts/Tv) also revealed variations in both distantly related and closely related species. The Ts/Tv ratio for distantly related species varied from 1.34 to 1.40 (Figure 5c), whereas for closely related species it varied from 0.77 to 1.0 (Figure 6c). The analysis revealed that the Ts/Tv ratio was higher in protein-coding sequences than in the complete chloroplast genome. The Ts/Tv ratio for distantly related species was 2.09-2.34 ( Figure 5c) and for closely related species 1.42-1.71 (Figure 6c). The higher Ts/Tv ratio in protein-coding regions showed the occurrence of higher transition substitutions than transversion substitutions. The lower Ts/Tv ratio in the complete genome than in coding regions was due to inclusion of an intergenic spacer region in which higher transversion substitutions take place than transition substitutions.

Intraspecies Variations in Chloroplast Genomes of Phyllanthus Emblica
The intraspecies variations in the chloroplast genomes of Phyllanthus emblica belonging to two different countries-Pakistan and China-with totally different climatic changes were determined. Slight differences emerged in genome size, LSC, and SSC, as shown in Table 2. The numbers of protein-coding genes, unique genes, and IR regions were the same in both species. The comparative analysis of both P. emblica chloroplast genomes revealed 257 SNPs in the whole chloroplast genome, 79 of which belong to coding regions. In total, 108 insertion-deletion (InDel) events with an average length of 5.32 in which 3 InDels belong to coding regions were observed. No InDels or SNPs in genes of transfer RNAs and ribosomal RNAs were seen.

Polymorphism of Protein-Coding Genes
The polymorphism of all protein-coding genes was assessed to facilitate identification of suitable polymorphic loci for future phylogenetic inference of the family Phyllanthaceae (Table S3). The 15 genes of ≥200 bp were selected. The missing data produced by the highly polymorphic regions due to InDels to provide information about the suitability of these loci were also assessed and are presented in Table 3. These loci included in the analysis were rpl22, ycf 1, matK, ndhF, and rps15, as shown in Table 3. y 2021, 13, x FOR PEER REVIEW 11 of 18    Although the angiosperm's chloroplast genome is believed to be conserved in both gene content and gene order, with conserved intronic regions in protein-coding genes and tRNA genes. However, the loss of some genes or introns were also reported [20,22,35,64]. Several studies have reported de novo assembled chloroplast genomes from whole genome shotgun reads [65,66]. In the present study, whole genome shotgun sequencing was employed to assemble the complete chloroplast genomes of Phyllanthus emblica (Pak), Leptopus cordifolius, and Flueggea virosa with high coverage depth.
We observed that the chloroplast genome features of the species belonging to the family Phyllanthaceae were highly similar to each other. The intron of atpF gene was deleted in all of the species, except Baccaurea ramiflora, which was present at a basal point in the current species. The rps16 gene was found to be non-functional in the two species of Baccaurea ramiflora (deletion of exon 1) and Leptopus cordifolius (gene loss with only few base pairs remaining). The deletion of the rps16 gene has previously been reported [27] in the species of the order Malpighiales and a functional copy was found for some species in nuclear genome but was not reported in the family Phyllanthaceae due to inclusion of Glochidion chodoense in the comparison containing a functional rps16 gene. Here, comparison of several species helped us to determine the deletion/pseudogenization of rps16 for the first time in the family Phyllanthaceae. The GC content of species of the family Phyllanthaceae is similar to other angiosperms and to other species of the order Malpighiales [20,67,68]. Another gene, infA, was completely absent in the chloroplast genome of the species of the family Phyllanthaceae. The deletion and pseudogenization of the infA gene were also reported in other families of angiosperms, including Araceae [57,66,69], Malvaceae [19], and Malpighiaceae [20]. Since the product of the inf A gene has a vital function as a translation initiation factor, it can be inferred that the functional copy of inf A has probably been transferred to the nuclear genome [35,70].
Previously, it was documented that IR's contraction and expansion can generate a pseudo-copy as well as a functional gene copy or may change duplicated genes to a single-copy gene due to transfer of IRs to single-copy regions (LSC or SSC) or may convert a single-copy gene to a duplicated gene when transferred from LSC or SSC to IRs [19,34,57,65]. In our study, the analyzed species were found to be conserved in terms of contraction and expansion of IRs. Likewise, the generation of a pseudo-copy of ycf 1, rpl2, and rps19 remained similar to other angiosperms [71,72]. However, the duplication of complete functional genes, such as ycf 1, rps15, and rps19, observed in other angiosperms was not detected here [19,57,65].

Simple Sequence Repeats and Oligonucleotide Repeat Analyses
The highest number of simple sequence repeats (SSRs) was observed in the LSC region as compared with lower numbers in SSC and IR regions. Mononucleotide SSRs (A/T) and dinucleotide SSRs (AT/TA) were more abundant than trinucleotide SSRs, consistent with previous studies [54,[73][74][75][76][77]. However, a few studies have reported higher numbers of trinucleotides than of dinucleotides [19,68,78,79]. The SSR marker identified in the current study can serve as a resource for population genetics studies of species of the family Phyllanthaceae. REPuter detected different oligonucleotide repeats, including forward, reverse, complementary, and palindromic repeats in the chloroplast genomes. These repeats originate InDels and substitutions [80][81][82] and can be used as proxies for identification of mutational hotspot regions [69,81,83]. High variations were detected in the number of oligonucleotide repeats, but further analysis involving more species is needed to gain insight into the evolutionary events leading to loss/gain of repeats. Previously, correlations between the identified number of repeats and genome size and phylogenetic position of species have not been established [65].

Phylogenetics Analysis and Suitable Polymorphic Loci for Further Phylogenetic Inference
The phylogenetic analysis showed that the polyphyletic nature of the genus Phyllanthus was similar to the recently obtained results using nuclear (ITS, PHYC) and chloroplast (matK, accD-psaI, trnS-trnG) markers along with morphological data [6]. Hence, the author suggested the division of the genus Phyllanthus into nine small genera instead of combining the embedded genera to Phyllanthus. Our study was based on 76 protein-coding sequences, which supports the polyphyletic nature of the genus Phyllanthus with a limited number of species. Further sequencing will thus be required to gain broad insight into the phylogenetics of Phyllanthus and the tribe Phyllantheae. A similar phylogenetic position has been reported in other species earlier [4,7]. Certain discrepancies still exist in the phylogeny of the family Phyllanthaceae, which warrant further elaboration [3,4,7]. Here, we have identified 15 suitable highly polymorphic regions from protein-coding sequences based on the comparative analysis of chloroplast genomes. The phylogenetic analysis based on these loci provided similar results to the dataset of 76 genes, which suggests a high efficacy of these loci for phylogenetic inference. These will serve as suitable markers for quality phylogenetic inference of the family Phyllanthaceae, as lineage-specific molecular markers are more authentic, robust, and cost-effective [22,36,[83][84][85].
In conclusion, our study provides insight into the molecular evolution of the chloroplast genome and sheds light on the deletion/pseudogenization of rps16 in the family Phyllanthaceae for the first time. However, sequencing and analysis of a broad number of species may elucidate the evolution and biological factors that have led to deletion of this gene. The phylogenetic analysis of our limited species showed the polyphyletic nature of the genus Phyllanthus but sequencing multiple species of this genus may be helpful to reconstruct a high-resolution phylogenetic tree and obtain insight into its systematics. The suitable polymorphic markers determined in this study may also be validated and applied in the future to a large number of species for broad phylogenetic inference of the family Phyllanthaceae.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/10 .3390/d13090403/s1, Figure S1. Heatmap representing the relative synonymous codon usage. The y-axis represents 61 amino acid coding codons, and the x-axis represents species names. Phyllanthus emblica (Pak) was included in the comparison. Figure S2. Comparison of amino acid frequency within the species of Phyllanthaceae. The x-axis represents the amino acid with standard symbol, and the y-axis represents the frequency of amino acids. Phyllanthus emblica (Pak) was included in the comparison. Figure Table S1. Simple sequence repeats analysis among species of the family Phyllanthaceae. Table S2. Comparison of synonymous and non-synonymous substitutions among species of the family Phyllanthaceae. Table S3. Polymorphic analysis of protein-coding genes.
Author Contributions: Manuscript drafting, A. and U.R.; data analyses, U.R., A., M.M. and A.J.; data curation, U.R. and A.; data interpretation, U.R., A., P.P. and N.S.; conceptualization, U.R., N.S. and P.P.; review and editing of first draft, N.S. and P.P.; supervision, N.S.; project administration and resources, N.S. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest:
The authors have no conflict of interest to declare.