Comparative Analysis and Characterization of Ten Complete Chloroplast Genomes of Eremurus Species (Asphodelaceae)

: Eremurus , a perennial rhizomatous mesophytic ornamental plant and one of the largest genera of the family Asphodelaceae, is distributed mainly in southwestern and central Asia. We sequenced the complete chloroplast genomes of ten species corresponding to all sections of the genus and analyzed their basic structure and evolutionary relationships. The cp genomes showed signiﬁcant similarities in size, gene sequences, gene classes, and inverted repeat regions (IRs). The complete chloroplast genome of Eremurus has a typical tetrad structure, ranging in length from 153,782 bp ( E. lactiﬂorus ) to 155,482 bp ( E. aitchisonii ). The length of the large single-copy region (LSC) ranges from 84,005 bp ( E. lactiﬂorus ) to 84,711 bp ( E. robustus ), that of the small single-copy region (SSC) ranges from 16,727 bp ( E. soogdianus ) to 17,824 bp ( E. suworowii ), and that of the inverted repeat regions (IR) ranges from 26,484 bp ( E. lactiﬂorus ) to 26,597 bp ( E. inderiensis and E. soogdianus ). A total of 131 genes were detected, including 85 protein-coding genes, 8 rRNA genes, and 38 tRNA genes. In addition, we found seven common and eight unique SSRs in ten Eremurus species. Among the protein-coding genes, ﬁve highly variable genes ( ycf1 , rps15 , rps16 , and rpl36 ) with high Pi values were detected and showed potential as DNA barcodes for the genus. Three genes ( rps19 , ycf1 , and ndhB ) had positive Ka/Ks values. Codon usage patterns were very similar across species: 33 codons had relative synonymous codon usage values of more than one, of which three ended with G, and the remaining codons ended with A and U. Phylogenetic analyses using complete cp genomes and 81 protein-coding genes conﬁrmed previous studies with the genus as well as subgenus Eremurus monophyletic and the subgenus Henningia paraphyletic.


Introduction
Eremurus M.Bieb. is one of the largest genera of Asphodelaceae Juss. [1], with 45 to 50 [2] or 59 [3] species of perennial, rhizomatous mesophytic plants. The genus is distributed in loess slopes and arid to semiarid mountainous areas in Central Asia, China, India, Pakistan, Afghanistan, Iran, Iraq, Lebanon, the Caucasus area, and Turkey [4,5]. Species of Eremurus are important as ornamental plants [6] and are called "foxtail lily" or "desert candle" because of their large and colorful inflorescence spikes [1]. They are also used in industry for products such as bio-oil [7] and adhesives [8], and some species within

Sequencing, Assembly, and Annotation
The DP305 Plant Genomic DNA Kit (Tiangen, Beijing, China) was used to extract total genomic DNA from leaf material, following the manufacturer s protocol. The sequencing library was generated using the NEBNext ®® UltraTM DNA Library Prep Kit for Illumina (New England, USA, NEB, Catalog: E7370L) following the manufacturer s recommendations, and index codes were assigned to each sample. Briefly, the genomic

Sequencing, Assembly, and Annotation
The DP305 Plant Genomic DNA Kit (Tiangen, Beijing, China) was used to extract total genomic DNA from leaf material, following the manufacturer's protocol. The sequencing library was generated using the NEBNext ®® UltraTM DNA Library Prep Kit for Illumina (New England, USA, NEB, Catalog: E7370L) following the manufacturer's recommendations, and index codes were assigned to each sample. Briefly, the genomic DNA sample was fragmented to a size of 350 bp by sonication. Then, the DNA fragments were polished at the ends, A-tailed, and ligated to the full-length adapter for Illumina sequencing, followed by further PCR amplification. The PCR products were purified using the AMPure XP system (Beverly, MA, USA). Subsequently, the quality of the library was checked by Agilent 5400 System (Agilent, Santa Clara, CA, USA) and quantified by QPCR (1.5 nM). The qualified libraries were pooled and sequenced on Illumina platforms using the PE150 strategy from Novogene Bioinformatics Technology Co., Ltd. (Beijing, China), depending on the effective library concentration and the amount of data required.
The resulting clean reads were assembled using the GetOrganelle pipeline [19] with the optimized parameters "-F plant_cp -w 0.6 -o -R 20 -t 8 -k 75,95,115,127 &". Gene annotation was performed in Geneious v. 10.0.2, and E. robustus (accession number: NC046772) was set as the reference. Start and stop codons and intron/exon boundaries for protein-coding genes were manually checked [20].

Simple Sequence Repeats (SSRs)
The MIcroSAtellite (MISA) web tool was used for chloroplast simple sequence repeat (SSR) identification [21]. The search parameters for SSRs were configured to identify ideal mono-, di-, tri-, tetra-, penta-, and hexa-nucleotide patterns with at least 10, 5, 4, 3, 3, and 3 repeats, respectively. The REPuter program [22] was used to identify repeats: forward, reverse, palindrome, and complement sequences in cp genomes. The following settings were used to identify repeats: (1) hamming distance equal to 3; (2) minimal repeat size set to 30 bp; and (3) maximum calculated repeats set to 90 bp.

Comparative Analysis of Chloroplast Genomes
Physical maps of cp genomes were generated using OGDRAWv1.1 [23]. The program mVISTA in Shufe-LAGAN mode [24] was used to compare the complete cp genomes of the 10 Eremurus species, using the annotation of E. robustus as a reference (NC046772). After manual multiple alignments using the program MUSCLE [25] in the software MEGA X [26], coding regions were extracted to detect variable sites. The nucleotide variability (Pi) was calculated for the whole cp genome and protein-coding genes separately using DnaSP v. 6 software [27]. The window length was set to 800 bp and the step size to 200 bp. To determine whether protein-coding genes were under selection pressure, the synonymous (Ks) and nonsynonymous (Ka) substitution rates and ω-value (ω = Ka/Ks) for shared protein-coding genes in ten Eremurus cp genomes were analyzed using DnaSP v. 6 software [27].

Codon Usage Bias Analysis
Coding sequences (CDS) found in chloroplast genomes were extracted manually one by one. Codon usage frequency analysis was performed for each species using the MEGA X [26]. The relative synonymous codon usage (RSCU) indicates whether a plastid gene is being favored, and codons with an RSCU value greater than 1 were considered high-frequency codons.

Phylogenetic Analysis
A total of 17, of which 11 cp were genomes of Eremurus and 6 cp were genomes of outgroups (Aloe vera, A. maculata, Aloidendron pillansii, Xanthorrhorea preissii, Hemerocallis fulva from Asphodelaceae and Asparagus officinalis from Asparagaceae), were used for phylogenetic analysis. Table 1 provides information about their NCBI accession numbers. Phylogenetic tree reconstruction was performed using the complete cp genomes and protein-coding sequences, which were first aligned multiple times using MAFFT software v. 7 [28]. Maximum likelihood (ML), Bayesian inference (BI), and maximum parsimony (MP) methods were used in this study to reconstruct phylogenetic trees. Nucleotide substitution models were statistically selected using jModelTest2 on XSEDE (www.phylo.org, accessed on 2 January 2020) using the Akaike information criterion (AIC). The GTR+I+G and TIM1+I+G models were selected as the best models for the protein-coding sequences and the complete cp genomes, respectively. For BI, we used MrBayes v. 3.2.7a [29] with 10 million generations, randomly sampling the trees every 1000 generations. In the latter analysis, after the first 25% of the trees were discarded as burn-in, a consensus tree with 50% majority rule was constructed from the remaining trees to estimate posterior probabilities (PP). ML trees were constructed with 1000 replicates for bootstrapping using RAxML v8.2.11 [30] via raxmlGUI 2.0.10 platform [31]. For MP analysis, we used PAUP* 4.0a169 [32]. The MP bootstrap analysis was performed with heuristic search, TBR branch-swapping, 1000 bootstrap replicates, random addition sequence with 10 replicates, and a maximum of 1000 saved trees per round.

Repeat Sequences and SSRs Analysis
Our study examined both common and unique SSRs in ten Eremurus species, which are listed in Tables S3 and S4. Our results showed that the majority of repeat units consisted of A and T, with rare occurrences of C or G, suggesting that the SSRs of different species have a clear preference for certain base types of repeat units. The common SSRs present in all ten species included A, AG, AT, AAT, AAAC, AAAT, and AATG. In addition, we found eight unique SSRs, including ATCC, AAATT, and ATATC in E. lactiflorus; AAACT and AAATTC in E. hissaricus; and AATT and AAAAAT in E. suworowii. In addition, a single AAATTG SSR was detected in E. aitchisonii, while no unique SSRs were identified in E. inderiensis, E. iae, E. regelii, E. soogdianus, E. albertii, and E. luteus. and 1 hexanucleotide SSRs, respectively. Among the ten Eremurus cp genomes, the most abundant repeats were the mononucleotides from 59 (E. lactiflorus) to 79 (E. aitchisonii), and the most dominant SSR was A. The second most predominant SSR was the dinucleotides, especially AT, varying from seven to nine. AG were three, and trinucleotides were one in each species. A total of 40 repeats of tetranucleotides, varying from 3 (E. inderiensis, E. hissaricus, E. iae, E. regelii, E. sogdianus, and E. albertii) to 6 (E. luteus and E. suworowii), were identified among the ten Eremurus cp genomes. Our analysis revealed that only E. aitchisonii had no pentanucleotide repeats. Eight pentanucleotide repeats in the other nine Eremurus species varied from one (E. luteus and E. suworowii) to four (E. hissaricus and E. lactiflorus). As well, E. lactiflorus did not exhibit any pentanucleotide repeats, whereas the other species varied from one to four ( Figure 3B,C). Our study examined both common and unique SSRs in ten Eremurus species, which are listed in Tables S3 and S4. Our results showed that the majority of repeat units consisted of A and T, with rare occurrences of C or G, suggesting that the SSRs of different species have a clear preference for certain base types of repeat units. The common SSRs present in all ten species included A, AG, AT, AAT, AAAC, AAAT, and AATG. In addition, we found eight unique SSRs, including ATCC, AAATT, and ATATC in E. lactiflorus; AAACT and AAATTC in E. hissaricus; and AATT and AAAAAT in E. suworowii. In addition, a single AAATTG SSR was detected in E. aitchisonii, while no unique SSRs were identified in E. inderiensis, E. iae, E. regelii, E. soogdianus, E. albertii, and E. luteus.
In this study, we found many repeat regions, including forward, reverse, palindromic, and complementary repeats ( Figure 3A). Among the ten studied Eremurus species, the longest repetitive sequences were detected in the cp genome of E. iae, which In this study, we found many repeat regions, including forward, reverse, palindromic, and complementary repeats ( Figure 3A). Among the ten studied Eremurus species, the longest repetitive sequences were detected in the cp genome of E. iae, which had 102 repetitive sequences with a length of 29 bp or less. In contrast, the smallest repetitive sequences were found in the cp genome of E. lactiflorus, which had 36 scattered repetitive sequences with a length not exceeding 12 bp. The length of the largest forward and reverse repeats was 40 bp and 42 bp, respectively, in the E. iae cp genome, while the largest palindromic repeats were 21 bp in the E. inderiensis and E. soogdianus cp genomes. Equal numbers of complement repeats were detected in all ten species. In addition, the reverse repeat was not found in the cp genome of E. lactiflorus.
The cp genome sequences of 11 Eremurus species were compared using mVISTA software, and their alignments were visualized with annotation data ( Figure 5). After this visualization analysis, differences occurred between the sequences in accD, AtpF, ndhA, ndhB, ycf1, and ycf2 genes from coding regions and mainly in noncoding intergenic regions. The encoded gene classes and the alignments of most coding regions of the ten Eremurus species were highly congruent. and reverse repeats was 40 bp and 42 bp, respectively, in the E. iae cp genome, while the largest palindromic repeats were 21 bp in the E. inderiensis and E. soogdianus cp genomes. Equal numbers of complement repeats were detected in all ten species. In addition, the reverse repeat was not found in the cp genome of E. lactiflorus.

Comparative Genomic Divergence and Hotspot Regions
We calculated nucleotide diversity (Pi) to estimate levels of interspecific sequence divergence across the genome ( Figure 4A,B). The highest variations (Pi > 0.01) were mainly concentrated in the SSC regions, between 125000 bp and 135000 bp. Across protein-coding genes, ycf1 (0.00961), rps15 (0.00626), rpl36 (0.00585), and rps16 (0.00569) had the highest variability (Pi > 0.0055), and the rpl2 gene had the lowest (0.00026). Values of Pi were less than 0.001 in 39.29% of the protein-coding genes and were 0.001-0.002 in 27.05%. Only 37.55% of protein-coding genes had Pi > 0.002 (Table S5). The cp genome sequences of 11 Eremurus species were compared using mVISTA software, and their alignments were visualized with annotation data ( Figure 5). After this visualization analysis, differences occurred between the sequences in accD, AtpF, ndhA, ndhB, ycf1, and ycf2 genes from coding regions and mainly in noncoding intergenic regions. The encoded gene classes and the alignments of most coding regions of the ten Eremurus species were highly congruent.

Phylogenetic Analysis
As previously mentioned, our phylogenetic analysis included cp genome data from eleven species of Eremurus and six outgroups. The phylogenetic trees constructed based

Phylogenetic Analysis
As previously mentioned, our phylogenetic analysis included cp genome data from eleven species of Eremurus and six outgroups. The phylogenetic trees constructed based on 81 protein-coding genes and complete cp genome sequences by ML ( Figure 8A,B), BI, and MP ( Figure S1A,B) methods yielded quite similar topologies. The species of Heme-

Phylogenetic Analysis
As previously mentioned, our phylogenetic analysis included cp genome data from eleven species of Eremurus and six outgroups. The phylogenetic trees constructed based on 81 protein-coding genes and complete cp genome sequences by ML ( Figure 8A,B), BI, and MP ( Figure S1A,B) methods yielded quite similar topologies. The species of Hemerocallis and Xanthorrhea (H. fulva and X. preissii, subfamilies Hemerocallidoideae and Xanthorrhoeoideae, respectively, Asphodelaceae) are sister to the rest. Aloe and Aloidendron (subfamily Asphodeloideae, Asphodelaceae) are sister to Eremurus. Thus, our results confirmed Naderi's studies [13] by revealing the monophyly of the genus/subgenus Eremurus and paraphyly of the subgenus Henningia. The genus is divided into two clades, the first containing only species of sect. Hennigia (E. robustus, E. suworowii, E. luteus, and E. aitchisonii) and the second containing species of all three sections (E. albertii and E. lactiflorus (sect. Henningia), E. inderiensis (sect. Ammolirion), E. soogdianus, E. hissaricus, E. iae, and E. regelii (sect. Eremurus)) (Figures 8 and S1). Subgenus Eremurus is well supported in the protein-coding analysis, with ML, BI, and MP support values of 74%, 0.99, and 78%, respectively, and has even higher support in the complete cp genome analysis, with support values of 96%, 1, and 97%, respectively. E. regelii and E. iae (sect. Eremurus), which are Central Asian endemics, clustered together with weak to moderate support (ML = 53%, BI = 0.68, and MP = 71%) based on protein-coding genes and with high support of 97% (ML), 1 (BI), and 99% (MP) based on the complete cp genome sequences.  (Figures 8 and S1). Subgenus Eremurus is well supported in the protein-coding analysis, with ML, BI, and MP support values of 74%, 0.99, and 78%, respectively, and has even higher support in the complete cp genome analysis, with support values of 96%, 1, and 97%, respectively. E. regelii and E. iae (sect. Eremurus), which are Central Asian endemics, clustered together with weak to moderate support (ML = 53%, BI = 0.68, and MP = 71%) based on protein-coding genes and with high support of 97% (ML), 1 (BI), and 99% (MP) based on the complete cp genome sequences.

Discussion
This study is the first comparative analysis of complete cp genome sequences in Eremurus. The sizes of the ten cp genomes sequenced ranged from 153,782 bp (E. lactiflorus) to 155,482 bp (E. aitchisonii). It is worth noting that many related genera with similar cp genome sizes to Eremurus have been reported in recent years [33][34][35][36][37]. The cp genomes of ten Eremurus species showed high similarity with regard to genome size, gene sequences, gene classes, and the IR region. Their GC content was also similar, which is an important indicator of species affinity, according to Tamura et al. [38].
Introns are recognized as being central to the regulation of gene expression in plants and animals [39][40][41]. In the present study, 16 genes with one intron and two genes (ycf3 and clpP) with two introns were identified in each of the cp genomes of the ten studied Eremurus species. Most of the 18 identified genes have a high similarity in the structure of introns. However, a structural change was detected in the intron of the atpF (764-800),

Discussion
This study is the first comparative analysis of complete cp genome sequences in Eremurus. The sizes of the ten cp genomes sequenced ranged from 153,782 bp (E. lactiflorus) to 155,482 bp (E. aitchisonii). It is worth noting that many related genera with similar cp genome sizes to Eremurus have been reported in recent years [33][34][35][36][37]. The cp genomes of ten Eremurus species showed high similarity with regard to genome size, gene sequences, gene classes, and the IR region. Their GC content was also similar, which is an important indicator of species affinity, according to Tamura et al. [38].
Introns are recognized as being central to the regulation of gene expression in plants and animals [39][40][41]. In the present study, 16 genes with one intron and two genes (ycf3 and clpP) with two introns were identified in each of the cp genomes of the ten studied Chloroplast SSRs are often used as fingerprinting markers in studies of phylogenetic relationships, population genetics, and species identification [42,43]. In Eremurus, 82 (E. inderiensis) to 93 (E. aitchisonii) SSRs were found across species. Several studies found that mononucleotide repeats are dominant among SSRs in the cp genome, where A/T bases account for the majority [44][45][46]. Eremurus shows the same pattern. The SSRs found here may be useful in future analyses of genetic diversity in Eremurus. In particular, the unique hexanucleotide SSRs identified in E. aitchisonii (AAATTG); penta-and hexanucleotide SSRs identified in E. hissaricus (AAACT and AAATTC, respectively); tetra-(AATT) and hexanucleotide (AAAAAT) SSRs identified in E. suworowii; and tetra-(ATCC) and two pentanucleotide (AAATT and ATATC) SSRs identified in E. lactiflorus have potential for future use in species identification and assessment of population genetic diversity. The unique SSRs were absent in the cp genome of E. inderiensis, E. iae, E. regelii, E. soogdianus, E. albertii, and E. luteus. In addition, it is well known that repeat sequences play a significant role in cp genome rearrangement, recombination, gene duplication, deletion, and gene expression [47][48][49][50], as well as being responsible for substitutions and indels [51]. We identified 36 (E. lactiflorus) to 102 (E. iae) repeat sequences among the ten Eremurus cp genomes analyzed, with forward and palindromic repeats being the most common in E. inderiensis (sect. Ammolirion), E. hissaricus, E. regelii, E. soogdianus (sect. Eremurus), E. aitchisonii, E. albertii, E. lactiflorus, E. luteus, and E. suworowii (sect. Henningia) whereas reverse repeats were the most abundant in E. iae (sect. Eremurus). Notably, species of section Henningia had fewer repeat regions (36)(37)(38)(39)(40)(41) compared to section Eremurus (40-102). Additional research focused on repeat sequences in section Henningia is recommended.
Highly variable DNA barcodes are essential for species identification, resource conservation, and phylogenetic analyses [52][53][54]. The cp genome length varied among species, with the E. lactiflorus genome (153,782 bp) being the longest and that of E. aitchisonii (155,482 bp) the shortest. There was significant similarity in the content and order of genes. Noncoding regions exhibited a higher level of sequence variation compared to other regions. The analysis of nucleotide diversity identified that the SSC region has the highest level of variation, whereas the IR region exhibits the lowest degree of variation. The Pi values of the coding regions indicate that ycf1, rps15, rps16, and rpl36 exhibit high levels of variation.
The value of Ka/Ks is an indicator of selective pressure and molecular adaptation. In this study, only three genes (rps19, ycf1, and ndhB) underwent positive selection with Ka/Ks>1. ycf1 exhibited high values of both Pi and Ka/Ks ratios, suggesting that its evolution has been important in Eremurus and can be a potential molecular marker for future studies. Previous studies have reported ycf1 to be highly variable in flowering plants [55,56] and crucial for plant viability [57]. It may prove useful in future barcoding studies in Eremurus.
In the cp genomes of Eremurus, the 85 protein-coding genes encoded 26,494 to 26,663 codons, comparable to that of the genus Iris, in which cp genes encode 26,169-26,353 codons [58]. The codon usage patterns in Eremurus species indicated a notable level of conservation in their cp genomes. Like Iris [58], Amomum [59], Panax [60], Dipterygium, and Cleome [61], and many other species, the conservation of codon usage patterns in Eremurus species' cp genomes was evident. A noteworthy observation was that the RSCU value of a single amino acid exhibited a positive correlation with the number of codons that encoded it. Furthermore, it was observed that 27 commonly utilized codons terminated with A/U, which could potentially be linked to the significant proportion of A/T present in cp genomes [62].
Our phylogenetic analysis based on complete cp genomes and protein-coding genes confirmed previous studies based on morphological cladistic analyses [13,14] as well as trnL-F and nuclear (ITS) data determining the monophyly of Eremurus [17]. We also found that the subgenus Eremurus is monophyletic while the subgenus Henningia is paraphyletic because E. albertii and E. lactiflorus are sister to subgenus Eremurus. Further studies with more species, particularly from sections Eremurus and Ammolirion, are necessary to confirm this outcome, but we note that the monophyly of section Eremurus is supported morphologically by shared characteristics of sections Eremurus and Ammolirion, including campanulate or tubular flowers, inward curved tepals and abaxial 3−5-nerved tepals and filaments longer than parianth. By contrast, subgenus Henningia has subrotate flowers, filaments shorter than perianth, and one-nerved tepals. Further research on morphological variation among species of the genus is needed. Additional studies with more sampling are currently being conducted by the authors to confirm the above phylogenetic results.

Conclusions
Our study is the first research work to investigate the genome characteristics of the genus Eremurus. We sequenced, assembled, and annotated the cp genome of E. inderiensis, E. hissaricus, E. iae, E. regelii, E. soogdianus, E. aitchisonii, E. albertii, E. lactiflorus, E. luteus, and E. suworowii using high-throughput technology. Our study is based on cp genome data from a total of eleven Eremurus species, including one previously published E. robustus. The cp genomes of all ten Eremurus species analyzed contained 131 genes, including 85 protein-coding genes, 8 rRNA genes, and 38 tRNA genes. We identified between 79 and 91 microsatellites and 36 to 102 pairs of repeat sequences among the ten Eremurus species cp genomes. In addition, we identified seven common SSRs and eight unique SSRs in the studied Eremurus species. Furthermore, we detected highly variable regions in the ycf1, rps15, rps16, and rpl36 protein-coding genes. Out of all the genes studied, only rps19, ycf1, and ndhB have a positive Ka/Ks value. Remarkably, the ycf1 gene stands out with both a high Pi and Ka/Ks value. These repeat motifs and highly variable genes could be used for evolutionary studies, phylogenetic relationships, and plant population genetics and species identification. Our phylogenetic reconstructions using the complete cp genome and protein-coding genes confirmed the monophyly of Eremurus. The subgenus Eremurus is monophyletic, whereas the subgenus Henningia is paraphyletic. However, further studies with more species, especially with the sections Eremurus and Ammolirion, are needed to confirm this result and understand the biogeography of the genus.

Supplementary Materials:
The following supporting information can be downloaded at: https:// www.mdpi.com/article/10.3390/f14091709/s1. Figure S1. Phylogenetic trees of 17 species, including 11 Eremurus species using BI and MP analysis based on 81 protein-coding genes (A) and complete cp genomes (B). Table S1. Information on sampling localities and voucher specimens of ten newly sequenced Eremurus species. Table S2. Location and length of intron-containing genes in ten Eremurus species. Table S3. Common simple sequence repeats (SSRs) in the chloroplast genome (cpDNA) of 10 Eremurus species. Table S4. Unique simple sequence repeats (SSRs) in the chloroplast genome (cpDNA) of 10 Eremurus species. Table S5. Individual characteristics of 85 protein-coding genes.