Comparison of the Complete Eragrostis pilosa Chloroplast Genome with Its Relatives in Eragrostideae (Chloridoideae; Poaceae)

Eragrostis of the tribe Eragrostideae is a taxonomically complex genus, because of its polyploid nature and the presence of similar morphological characters among its species. However, the relationship between these morphologically indistinguishable species at the genomic level has not yet been investigated. Here, we report the complete chloroplast genome of E. pilosa and compare its genome structures, gene contents, simple sequence repeats (SSRs), sequence divergence, codon usage bias, and Kimura 2-parameter (K2P) interspecific genetic distances with those of other Eragrostideae species. The E. pilosa chloroplast genome was 134,815 bp in length and contained 132 genes and four regions, including a large single-copy region (80,100 bp), a small single-copy region (12,661 bp), and a pair of inverted repeats (21,027 bp). The average nucleotide diversity between E. pilosa and E. tef was estimated to be 0.011, and 0.01689 among all species. The minimum and maximum K2P interspecific genetic distance values were identified in psaA (0.007) and matK (0.029), respectively. Of 45 SSRs, eight were shared with E. tef, all of which were in the LSC region. Phylogenetic analysis resolved the monophyly of the sampled Eragrostis species and confirmed the close relationship between E. pilosa and E. tef. This study provides useful chlorophyll genomic information for further species identification and phylogenetic reconstruction of Eragrostis species.


Introduction
Eragrostis is a large genus comprising more than 350 species in the tribe Eragrostideae (subfamily Chloridoideae; family Poaceae), which includes three subtribes (Cotteinae (4 genera), Eragrostidinae (11 genera), and Uniolinae (5 genera) [1,2]. Eragrostis species are distributed in weedy disturbed areas and dry habitats of the world's tropical and subtropical regions [3]. It is commonly called love grass or cane grass, and has significant uses in different areas. Most species belonging to this genus are used as animal feed. However, E. tef is a staple food in some countries, such as Ethiopia. E. amabilis is grown as an ornamental plant, and E. cynosuroides has religious meaning and is used in Hindu temples. Growing E. bahiensis can remove highly toxic radioactive atoms from the environment, and E. curvula is widely planted to prevent soil erosion.
Previously, all Eragrostideae species were thought to use the C4 photosynthetic pathway. Leaf blade anatomies common to the C4 pathway, including NAD-ME, PCK-like, and intermediate

Characteristic cp Genome Features of E. pilosa among Eragrostideae Species
The complete cp genome of E. pilosa was 134,815 bp long, and displayed a typical quadripartite structure containing a large single-copy (LSC) region (80,100 bp), a short single-copy (SSC) region (12,661 bp) and a pair of inverted repeat (IR) regions (21,027 bp). It consisted of 85 protein-coding genes, 39 tRNA genes, and eight rRNA genes, for a total of 132. Seven protein-coding genes (rps15, ycf68, rps7, ndhB, rpl23, rpl2, and rps19), eight tRNA genes (trnN-GUU, trnR-ACG, trnA-UGC, trnI-GAU, trnV-GAC, trnL-CAA, trnI-CAU, and trnH-GUG), and four rRNA genes were duplicated in the IR regions. In addition, trnM-CAU and trnfM-CAU were repeated in the LSC region. Ten protein-coding genes, including two duplicated genes (rpl2 and ndhB) and eight tRNA genes, including two duplicated genes (trnI-GAU and trnA-UGC) contained one intron, whereas the protein-coding gene ycf3 had two introns. Similar to most of the other angiosperms, the 5' end of rps12 was located in the LSC region and 3' end was duplicated in the IR regions ( Figure 1). rRNA genes. E. tenellula encoded the lowest total number of genes (124) due to loss of one copy of trnfM-CAU and the protein-coding genes rps19, ycf68, rps15, and ndhH. A copy of trnfM-CAU also lost in En. caerulescens, E. setifolia, and E. walteri. The additional trnT-GGU gene clustered with trnM-CAU in U. paniculata, En. oblongus, and E. minor. The protein-coding genes rps19 and ycf68 were also lost in E. setifolia and En. caerulescens, and loss of ycf68 was also observed in U. paniculata and E. walteri (Table  1 and Table S1).   The E. pilosa plastome sequence generated in this study was compared to all available complete plastomes from the tribe Eragrostideae. Six plastomes belonged to Eragrostis, two to Enneapogon, and one to Uniola. Of the Eragrostideae species, E. tenellula had the smallest cp genome (130,773 bp) while E. walteri had the largest (137,056 bp). The total number of genes varied among species, but E. pilosa and E. tef encoded an identical number of total genes, including the same protein-coding, tRNA, and rRNA genes. E. tenellula encoded the lowest total number of genes (124) due to loss of one copy of trnfM-CAU and the protein-coding genes rps19, ycf68, rps15, and ndhH. A copy of trnfM-CAU also lost in En. caerulescens, E. setifolia, and E. walteri. The additional trnT-GGU gene clustered with trnM-CAU in U. paniculata, En. oblongus, and E. minor. The protein-coding genes rps19 and ycf68 were also lost in E. setifolia and En. caerulescens, and loss of ycf68 was also observed in U. paniculata and E. walteri (Table 1 and Table S1). The IR/SSC boundary regions of Eragrostideae species displayed some significant differences, even among species in the genus Eragrostis. E. pilosa, E. tef, and E. minor had similar IR/SSC boundary patterns, except for length variations in ndhF and ndhH. In E. pilosa and E. minor, ndhF was 2211 bp, but was 2217 bp in E. tef. This gene crossed the IRB/SSC boundary region by 23 bp in all three species. Furthermore, ndhH crossed the IRA/SSC region by 4 bp, and was 1188 bp long in E. pilosa and 1182 bp long in E. tef and E. minor. The trnH gene was located in the IRA region and duplicated in the IRB region, and was 75 bp in all species. The 282-bp rps19 gene was present in the IRA region and repeated in IRB in all species except E. setifolia, E. tenellula, and En. caerulescens ( Figure 2). The IR/SSC boundary regions of Eragrostideae species displayed some significant differences, even among species in the genus Eragrostis. E. pilosa, E. tef, and E. minor had similar IR/SSC boundary patterns, except for length variations in ndhF and ndhH. In E. pilosa and E. minor, ndhF was 2211 bp, but was 2217 bp in E. tef. This gene crossed the IRB/SSC boundary region by 23 bp in all three species. Furthermore, ndhH crossed the IRA/SSC region by 4 bp, and was 1188 bp long in E. pilosa and 1182 bp long in E. tef and E. minor. The trnH gene was located in the IRA region and duplicated in the IRB region, and was 75 bp in all species. The 282-bp rps19 gene was present in the IRA region and repeated in IRB in all species except E. setifolia, E. tenellula, and En. caerulescens ( Figure 2).

Codon Usage
The GC contents of the entire cp genomes, the coding sequences (CDS), and each codon position were similar in all species. The GC content in the plastomes and CDS were lower (approximately 38.3% and 39%, respectively) suggesting an abundance of AT regions. The mean GC content at the first codon position (GC1, 48.0%) was higher than the mean values at the second and third positions (GC2 and GC3; 39.1% and 30.0% respectively) in all species ( Table 2). The average number of codons ranged between 14,407 (in En. caerulescens) and 14,431 (in E. walteri). In all species, leucine and cysteine were the most and least abundant amino acids, respectively. Methionine and tryptophan were each represented by only one codon, while all other amino acids were encoded by more than one codon in all Eragrostideae species. In E. pilosa, leucine was encoded by six synonymous codons (UUA (522), UUG (281), CUU (328), CUC (95), CUA (225), and CUG (77)), with relative synonymous codon usage (RSCU) values of 2.05, 1.1, 1.29, 0.37, 0.88, and 0.3, respectively (Table S2). Table 2. GC content of the Eragrostideae cp genome sequences. GC content (g) whole cp genome (c) 44 analyzed CDS (1) (2) (3) first, second, and third codon positions, respectively.

Codon Usage
The GC contents of the entire cp genomes, the coding sequences (CDS), and each codon position were similar in all species. The GC content in the plastomes and CDS were lower (approximately 38.3% and 39%, respectively) suggesting an abundance of AT regions. The mean GC content at the first codon position (GC1, 48.0%) was higher than the mean values at the second and third positions (GC2 and GC3; 39.1% and 30.0% respectively) in all species ( Table 2). The average number of codons ranged between 14,407 (in En. caerulescens) and 14,431 (in E. walteri). In all species, leucine and cysteine were the most and least abundant amino acids, respectively. Methionine and tryptophan were each represented by only one codon, while all other amino acids were encoded by more than one codon in all Eragrostideae species. In E. pilosa, leucine was encoded by six synonymous codons (UUA (522), UUG (281), CUU (328), CUC (95), CUA (225), and CUG (77)), with relative synonymous codon usage (RSCU) values of 2.05, 1.1, 1.29, 0.37, 0.88, and 0.3, respectively (Table S2).

Simple Sequence Repeats
SSRs in cp genomes are widely used in species identification, plant population genetics, and evolutionary studies [13,14]. In E. pilosa, 45 SSRs > 10 bp in length were detected, all of which were A/T and C/G mononucleotide repeats. The highest number of SSRs was detected in E. walteri (52), including 51 mononucleotide repeats and one dinucleotide repeat. The lowest number (41) was identified in E. minor. In all species, the SSRs were both mononucleotide or dinucleotide repeats, most were located in the LSC region, and A/T repeats were the most abundant type. All Eragrostis species except E. walteri contained SSRs incorporating C/G nucleotides. Compound SSRs also contained remarkably high A/T content, and in E. pilosa, only one compound SSR contained G nucleotides. E. tef and E. walteri had the highest number of compound SSRs containing A or T. The length of compound SSRs ranged from 22 to 200 bp in all species (Table 3).  (9), while E. tef and E. walteri shared the least (2). E. pilosa shared eight, four, and five SSRs with E. tef, E. setifolia, and E. tenellula, respectively, while E. tef and E. minor shared five ( Figure 3 and Table S3). Ten E. pilosa SSRs were randomly selected, and primers were designed that can be used in future studies of the genetic diversity and population structure of these species (Table S4).  (9), while E. tef and E. walteri shared the least (2). E. pilosa shared eight, four, and five SSRs with E. tef, E. setifolia, and E. tenellula, respectively, while E. tef and E. minor shared five ( Figure 3 and Table S3). Ten E. pilosa SSRs were randomly selected, and primers were designed that can be used in future studies of the genetic diversity and population structure of these species (Table S4).

Sequence Divergence and Kimura 2-Parameter (K2P) Genetic Distance
Coding and non-coding regions that are highly variable between cp genomes are useful as potential DNA barcodes for species identification and phylogenetic reconstruction [15]. According to DNA Sequence Polymorphism (DnaSP), the estimated average nucleotide diversity (pi) value between E. pilosa and E. tef was 0.011. The highest variation was found in the LSC region (atpI-atpH) with a value of 0.125. Five regions in the LSC (rps16-trnQ, atpI-atpH, ycf3-trnS, trnL-UAA, and rbcL-psaI) showed high variability, with pi values > 0.06. The highest nucleotide diversity in the IR regions was in rrn23 (pi=0.117) while rrn16, which clusters with rrn23, had a pi value > 0.063. For the nine cp genomes, pi values ranged from 0-0.05872, with an average of 0.01689. The loci rps16-trnQ, trnG-trnM-trnT, trnL, rbcL-psaI, and ndhF-rpl32 were highly divergent among Eragrostideae species (pi > 0.04; Figure 4). The highest nucleotide diversity was observed in rps16-trnQ (0.05872). Furthermore, we performed a multiple sequence alignment between the nine complete cp genomes in mVISTA, using E. pilosa as a reference. (Figure 5). Similar to the estimated nucleotide diversity, the most divergent regions were located in the intergenic spacer regions of the LSC.
Using the sequences of 44 protein-coding genes, we calculated the interspecific K2P genetic distances among the nine species. Minimum and maximum K2P values were observed in psaA (0.007) and matK (0.029), respectively. The average K2P interspecific genetic distance was estimated at 0.014 ( Figure 6). Pairwise comparisons of genetic divergence were estimated using the same protein-coding sequences. The lowest genetic divergence value was between E. pilosa and E. minor (0.003), while the highest divergence values (0.03) were between E. walteri and En. caerulescens/En. oblongus. Genetic divergence values between E. pilosa and E. minor, E. tef, E. walteri, E. setifolia, and E. tenellula were 0.003, 0.006, 0.028, 0.007, and 0.008, respectively, with the highest divergence observed between E. pilosa and E. walteri. When Enneapogon species were included, genetic divergence values with Eragrostis species were always high (ranging from 0.014 to 0.030). U. paniculata also showed relatively high genetic divergence with Eragrostis (ranging from 0.008 to 0.026) and Enneapogon (0.013) species. The genetic distance between En. caerulescens and En. oblongus was 0.002 (Table S5). psaI) showed high variability, with pi values > 0.06. The highest nucleotide diversity in the IR regions was in rrn23 (pi=0.117) while rrn16, which clusters with rrn23, had a pi value > 0.063. For the nine cp genomes, pi values ranged from 0-0.05872, with an average of 0.01689. The loci rps16-trnQ, trnG-trnM-trnT, trnL, rbcL-psaI, and ndhF-rpl32 were highly divergent among Eragrostideae species (pi > 0.04; Figure 4). The highest nucleotide diversity was observed in rps16-trnQ (0.05872). Furthermore, we performed a multiple sequence alignment between the nine complete cp genomes in mVISTA, using E. pilosa as a reference. (Figure 5). Similar to the estimated nucleotide diversity, the most divergent regions were located in the intergenic spacer regions of the LSC. Using the sequences of 44 protein-coding genes, we calculated the interspecific K2P genetic distances among the nine species. Minimum and maximum K2P values were observed in psaA (0.007) and matK (0.029), respectively. The average K2P interspecific genetic distance was estimated at 0.014 ( Figure 6). Pairwise comparisons of genetic divergence were estimated using the same protein-coding sequences. The lowest genetic divergence value was between E. pilosa and E. minor (0.003), while the   (Table S5).

Phylogeny
The nine Eragrostideae species and the outgroup species Eleusine indica (NC_030486) from Cynodonteae were used for phylogenetic analysis. Similar topologies were observed in maximum likelihood (ML) and Bayesian inference (BI) generated trees based on complete cp genome sequences, with high bootstrap values for almost all relationships. E. pilosa was in the same clade as E. minor, and both were sisters of E. tef (Figure 7). BI trees based on highly divergent regions (rps16-trnQ + trnL+ rbcL-psaI) and 30 protein-coding genes of length > 600 bp were identical with high posterior probability values, and E. pilosa, E minor, and E. tef clustered together (Figure 8). All Eragrostis species

Phylogeny
The nine Eragrostideae species and the outgroup species Eleusine indica (NC_030486) from Cynodonteae were used for phylogenetic analysis. Similar topologies were observed in maximum likelihood (ML) and Bayesian inference (BI) generated trees based on complete cp genome sequences, with high bootstrap values for almost all relationships. E. pilosa was in the same clade as E. minor, and both were sisters of E. tef (Figure 7). BI trees based on highly divergent regions (rps16-trnQ + trnL+ rbcL-psaI) and 30 protein-coding genes of length > 600 bp were identical with high posterior probability values, and E. pilosa, E minor, and E. tef clustered together ( Figure 8). All Eragrostis species except E. walteri had close relationships in all analysis. E. walteri was in the basal position followed by the Enneapogon species.

Discussion
Chloroplasts are essential plant cell organelles that convert solar energy to usable chemical energy [12]. They are involved in the synthesis of most leaf proteins, fatty acids, amino acids, and secondary metabolites. The chloroplast enzyme ribulose-1,5-bisphosphate carboxylase (RbcL) is the most abundant protein on earth, and it is widely used in evolutionary studies of plants [16]. During

Discussion
Chloroplasts are essential plant cell organelles that convert solar energy to usable chemical energy [12]. They are involved in the synthesis of most leaf proteins, fatty acids, amino acids, and secondary metabolites. The chloroplast enzyme ribulose-1,5-bisphosphate carboxylase (RbcL) is the most abundant protein on earth, and it is widely used in evolutionary studies of plants [16]. During evolution, specific cp genes have been lost in various species, genera, and families [17]. Therefore, comparing the cp genomes of plant species is useful to understand their evolutionary processes over time. The development of advanced technologies for cp genome sequencing has allowed many research groups to produce complete cp DNA sequences. In addition, no single DNA barcode exists that can distinguish all angiosperms from each other. Therefore, many recent studies have been conducted to detect appropriate barcodes to identify species, genera, and families using complete cp sequences [18].
All cp genomes used in this study had a typical quadripartite structure containing two copies of the IR region separating the SSC and LSC regions, similar to most other angiosperms [19,20]. However, the size and gene content varied even with the Eragrostis genus. Sequence alignment indicated that specific genes were lost in some cp genomes compared to E. pilosa. E. tenellula, for example, displayed loss of trnfM, rps19, rps15, ndhH, and ycf68.
IR contraction and expansion causes length variation in angiosperm cp genomes [21]. Various gene migrations are common in most Poaceae species, such as the ndhF and ndhH genes, which are located at opposite ends of the SSC region and migrate toward the IR regions. The tribe Eragrostideae is classified in the PACMAD clade. Davis et al. (2010) [22] reported that in this clade, ndhF extended into the IR by < 30 nucleotides. Similar patterns of overlap were observed in this study, with the exception of E. tenellula, in which ndhF extended 53 nucleotides into the IRB region and ndhH was not present. The pattern in En. oblongus differed from other Eragrostideae species, with ndhF located entirely in the SSC.
The phenomenon of codon usage bias refers to the unequal use of synonymous codons in an organism [23]. It is associated with the gene expression level and accuracy of translation [24], gene length [25], gene translation initiation signals, protein amino acid composition, protein structure, tRNA abundance, mutation frequency and patterns, and GC composition [26,27]. Consistent with previously reported cp genomes [28], the GC content of the Eragrostideae plastid genomes was reduced, suggesting an abundance of AT-rich intergenic regions in cp genomes. The AT-rich bias was strongest in the third codon position, also consistent with previously reported plastomes.
As mentioned above, E. pilosa and E. tef are closely related species with similar morphology. We aligned their cp genome sequences to observe similarities and divergences at the genomic level. Between these two species, we observed variable regions that were mainly found in intergenic regions. We plotted the sequence identity using mVISTA using E. pilosa as a reference. The divergence was mostly located in the LSC region, with less divergence observed in the IR regions. However, divergence in this region was associated with the rrn cluster, containing rrn16 and rrn23. Regions of high sequence divergence identified in this study could be used in deep phylogenetic analysis and as potential DNA barcodes.
Taking the E. walteri to outside of the Chloridoideae, low genetic divergence was observed in Eragrostis compared to inter-generic divergence in Eragrostideae. Within the Eragrostis genus, comparatively low genetic divergence was observed between E. tef and E. pilosa, suggesting this inter-specific variation may account for E. pilosa's many climatic adaptations. Evidence of chloroplast genes contribute to plant adaptation to the different environment has been discussed previously in Najas flexilis [29] and the Hawaiian genus Schiedea [30].
The proper taxonomic classification of Eragrostis species within Chloridoideae has been debated for many years, due to their similar morphologies, polyploid natures, and broad geographic distributions. However, this study does not suggest the need for a new infrageneric classification, but rather adds additional molecular information on the cp genomes for existing infrageneric classifications. Hilu and Alice (2001) [31] demonstrated that Eragrostideae species do not appear to be monophyletic, according to their matK sequences. However, Ingram and Doyle (2007) [3] used the plastid locus rps16 and the nuclear gene waxy to demonstrate that Eragrostis species form a monophyletic group when a few segregate genera, including Acamptoclados, Diandrochloa, and Neeragrostis are included. In our data, Eragrostis species formed a monophyletic group, while Enneapogon and Uniola clustered separately. Meanwhile, Ingram et al. (2011) [7] demonstrated that E. walteri was misclassified in the Chloridoideae. Therefore, the placement of E. walteri in our phylogenetic tree is not surprising.
Alignment of the complete cp genomes demonstrated that E. pilosa and E. minor formed a clade and were sisters to E. tef. Additionally, it recovered a major Eragrostis clade and other genera. However, it is challenging to construct phylogenies for Eragrostis, because most known species are polyploid and the ploidy levels of many others remain unknown [3]. In BI trees based on highly divergent regions and 30 protein-coding genes, E. pilosa, E. tef, and E. minor clustered together. The similarities between E. pilosa and E. tef have been reported previously using morphological [32,33], cytological/biochemical [34,35] and molecular methods [10]. Based on cytological examinations, Tavassoli (1986) [35] suggested that E. minor and E. pilosa are close relatives of E. tef. Ingram and Doyle (2003) [10] showed that E. minor clustered with E. pilosa and E.tef with different progenitors. Therefore, extended sampling with several landraces of these close relatives will be necessary to place E. pilosa in its correct position. Our study highlights the utility of complete cp genome sequences, several protein-coding genes, and highly divergent regions for the phylogenetic analysis of Eragrostideae.

Genome Assembly and Annotation
We obtained high-quality reads from raw reads using the Trimmomatic tool [37]. The high-quality reads were then assembled using the GetOrganelle software (https://github.com/ Kinggerm/GetOrganelle/blob/master/get_organelle_from_reads.py). Assembly was polished by mapping raw reads back to the retrieved sequence using the map reference tool in Geneious 10.0.5 (http://www.geneious.com). The generated consensus sequence was further adjusted by alignments with related species. Dual Organellar GenoMe Annotator online tool (https://dogma.ccbb.utexas.edu/) was used to annotate all protein-coding sequences and tRNA and mRNA genes [38]. Annotations of tRNA genes were confirmed with tRNAscan-SE [39]. Finally, the circular cp genome of E. pilosa was drawn with the online tool Organellar Genome DRAW (ORDRAW, https://chlorobox.mpimp-golm. mpg.de/OGDraw.html) [40]. The complete chloroplast genome of E. pilosa was deposited in GenBank with the accession number MN268502.
Alignments of whole cp genomes were visualized with mVISTA [43]. Sequence divergence values were determined using the sliding window method in DnaSP v5.10. [44] with a 200-bp step size and an 800-bp window length. We estimated the K2P genetic distance for each protein-coding gene and the pairwise distance between each species using the K2P model in MEGA 6.0.

SSR Analysis
The MISA tool [45] was used to detect SSRs, with the parameters set to ten repeated units for mononucleotide SSRs, six repeated units for dinucleotide SSRs, and five repeated units for tri-, tetra-, penta-, and hexanucleotide SSRs. Shared repeats among Eragrostis species were determined by aligning the species to detect identical repeats located in homologous regions. The Primer3 online tool (http://primer3.ut.ee/) was used to design E. pilosa SSR primers for future studies [46]. IR border regions were visualized using the IRscope online program (https://irscope.shinyapps.io/irapp/).

Phylogenetic Analysis
All nine Eragrostideae species and Eleusine indica from Cynodonteae were used for phylogenetic analysis. Phylogenetic trees were derived from the following three data sets: (1) the complete cp genome sequences; (2) 30 protein-coding genes of > 600 bp length, to avoid sampling bias; and (3) a concatenation of highly divergent regions (rps16-trnQ + trnL+ rbcL-psaI). The analyses were based on the ML and BI methods. Coding sequences extracted from each sequence were aligned using MAFFT [47] and concatenated using Geneious software. The ML tree was constructed in MEGA 6.0 based on the Tamura-Nei model using a heuristic search for initial trees. Bootstrap analysis was performed with 1,000 replicates. A discrete gamma distribution was used with four categories (+G, parameter = 0.3797). Initial ML tree(s) were obtained using the neighbor joining and BIONJ algorithms. BI analysis was conducted using MrBayes 3.2 [48] with a General Time Reversible substitution model. Two independent Markov chain Monte Carlo chains were run for 1.5 million generations, with a subsampling frequency of 200 generations.

Conclusions
In this study, we sequenced the complete cp genome sequence of E. pilosa, a grass belonging to tribe Eragrostideae. Specific genome features and SSRs were identified and compared with those of other Eragrostideae species. This study is the first complete cp genome comparison of Eragrostideae species, using all complete Eragrostideae cp genomes available in the NCBI database to date. Phylogenetic analysis demonstrates that cp genome data sets can be used to resolve Eragrostideae phylogeny through broad sampling. This study makes a significant contribution to the literature because it provides new information on the genetic relationships between Eragrostis species that could be used for further species identification and phylogenetic analysis.