Evolutionary Comparison of the Complete Chloroplast Genomes in Convallaria Species and Phylogenetic Study of Asparagaceae

The genus Convallaria (Asparagaceae) comprises three herbaceous perennial species that are widely distributed in the understory of temperate deciduous forests in the Northern Hemisphere. Although Convallaria species have high medicinal and horticultural values, studies related to the phylogenetic analysis of this genus are few. In the present study, we assembled and reported five complete chloroplast (cp) sequences of three Convallaria species (two of C. keiskei Miq., two of C. majalis L., and one of C. montana Raf.) using Illumina paired-end sequencing data. The cp genomes were highly similar in overall size (161,365–162,972 bp), and all consisted of a pair of inverted repeats (IR) regions (29,140–29,486 bp) separated by a large single-copy (LSC) (85,183–85,521 bp) and a small single-copy (SSC) region (17,877–18,502 bp). Each cp genome contained the same 113 unique genes, including 78 protein-coding genes, 30 transfer RNA genes, and 4 ribosomal RNA genes. Gene content, gene order, AT content and IR/SC boundary structure were nearly identical among all of the Convallaria cp genomes. However, their lengths varied due to contraction/expansion at the IR/LSC borders. Simple sequence repeat (SSR) analyses indicated that the richest SSRs are A/T mononucleotides. Three highly variable regions (petA-psbJ, psbI-trnS and ccsA-ndhD) were identified as valuable molecular markers. Phylogenetic analysis of the family Asparagaceae using 48 cp genome sequences supported the monophyly of Convallaria, which formed a sister clade to the genus Rohdea. Our study provides a robust phylogeny of the Asparagaceae family. The complete cp genome sequences will contribute to further studies in the molecular identification, genetic diversity, and phylogeny of Convallaria.


Introduction
The monocot genus Convallaria L. (Asparagaceae) are perennial herbs commonly found in the understory of temperate deciduous forests in the Northern Hemisphere [1,2]. Three morphologically similar but isolated species with different geographical distribution were recognized in the genus, namely C. keiskei, C. majalis, and C. montana [3,4]. Convallaria keiskei is widely distributed in Sakhalin, Korea, China, Japan and Eastern Siberia [5,6]. Convallaria majalis, commonly known as Lily of the Valley, is native to Europe and now widely ranged in the temperate regions of Eurasia and Eastern North America [7]. C. montana, the American Lily of the Valley, has a limited distribution endemic to the southern Appalachian Mountains in North America [5,6]. This genus is characterized by a 10-15 cm leaf length, two leaves and a raceme of about 10 white flowers on the stem apex, 4-10 mm bracts length, nearly globose seeds and a base chromosome number of 18 [8]. Like many perennial herbaceous plants, the Convallaria species reproduces asexually by rhizome and sexually by seed [9]. They always form extensive colonies by spreading underground stems [10,11]. The Convallaria species is widely cultivated as an ornamental plant for its white bell-shaped flowers and sweet fragrance [7]. However, potential poisonings are a concern in terms of the cardenolide glycosides found throughout these plants, such as convallatoxin and convallatoxol [12]. These components could have medicinal value. For example, steroidal glycosides extracted from C. majalis whole plants had the potential for use as an anti-lung cancer agent [13]. Convallaria keiskei plants was used in the treatment of salivary gland cancer as an efficient therapeutic alternative [14]. Although Convallaria species have high medicinal and horticultural values, studies on the phylogenetic relationships and evolution of these Convallaria species are few.
Asparagales is the largest order of monocots, though in APG III [15], the number of families recognized has fallen from 26 to 14 [16]. In previous research, the phylogeny of Asparagaceae was reconstructed using cp or nuclear loci, with low support value for the position of Convallaria [17,18]. In recent years, chloroplast genomes have been widely used in reconstructing the phylogenetic relationships among plant groups for their uniparental inheritance, lack of recombination, and conservativeness in gene content and gene order [19][20][21]. Many phylogenetic relationships that remained unresolved with few loci have been clarified by using whole cp genome sequences, especially in recently diverged plant groups [22][23][24][25]. Furthermore, the fast-evolving regions of the cp genome can be utilized as DNA barcodes to identify the morphological similar species, as well as molecular markers for systematic and phylogeographic analyses [26,27]. Thus, phylogenetic analyses with cp genomes offer an efficient method to obtain a robust phylogenetic relationship of the Asparagaceae for further evolutionary study.
In this study, we report and annotate five complete cp genomes of three Convallaria species. We aimed to (1) investigate the global structure patterns of Convallaria cp genomes; (2) analyze the repeat sequences and SSRs among the five cp genomes; (3) screen hotspot DNA regions; and (4) reconstruct the phylogenetic relationships within Asparagaceae to locate the phylogenetic position of Convallaria and confirm its monophyly.

Plant Samples, DNA Extraction, and Sequencing
Fresh leaves of five individual plants (two C. keiskei, two C. majalis and one C. montana) were collected in the field and dried with silica gel immediately. Voucher specimens were deposited in the Herbarium of Zhejiang Sci-Tech University. Total genomic DNA was extracted from~10 mg leaf tissue using the CTAB extraction protocol with modification [28]. After purification, the extracted DNA was used to generate short-insert (<800 bp) pairedend sequencing libraries according to the Illumina standard protocol (Illumina, San Diego, CA, USA). The genomic DNA of each individual specimen was indexed by tags and pooled together in one lane of Hiseq 2500 (Illumina) for sequencing at Beijing Genomics Institute (BGI, Shenzhen, China). After sequencing and data treatment, 14,576,340-21,266,964 clean reads with pair-end 125 bp read length were retrieved for the five cp genomes.

Chloroplast Genome Assembly and Annotation
The cp genomes were assembled with both de novo and reference guided approaches [29]. The cp genome sequence of Rohdea chinensis (MH356725; [30]) obtained from NCBI was used as a reference (http://www.ncbi.nlm.nih.gov/) (accessed on 5 February 2020). Firstly, we assembled the clean reads into contigs on the de novo assembler using CLC Genomics Workbench with the following optimized parameters: bubble size of 98, minimum contig length of 200, mismatch cost of 2, deletion and insertion costs of 3, length fraction of 0.9, and similarity fraction of 0.8. Next, all the contigs were aligned to the reference genome using local BLAST (http://blast.ncbi.nlm.nih.gov/) (accessed on 12 February 2020) with ≥90% similarity and query coverage. The contigs of each individual were aligned with R. chinensis genome and ordered in Geneious v11.1.2 (http://www.geneious.com) (accessed on 15 February 2020) to construct the draft chloroplast genomes. Then, the clean reads were mapped to the draft genome sequences again to check the contigs' concatenation. Finally, the complete chloroplast genome sequences were obtained by connecting these overlapping contigs. The yielded cp genomes were annotated through the online program Dual Organellar Genome Annotator (DOGMA; [31]. The annotated organelle genomes of C. keiskei (A4, A118), C. majalis (A63, A69 and C. montana(A114) were deposited in GenBank (accession numbers ON645923, ON303653, ON303655, ON645922 and ON303654). The start and stop codons and the exon-intron boundaries of the encoded genes were accurately confirmed by comparison with homologous genes of R. chinensis using MAFFT v7 [32]. In addition, the tRNAscan-SE v1.21 was used to verify the tRNA genes with default parameters [33]. The graphic maps of the cp genomes of Convallaria were drawn in Organellar Genome DRAW [34].

Comparative Analysis of Convallaria cp Genomes and Hotspot Regions Identification
We used the five complete chloroplast genomes of Convallaria for comparative analyses. MAFFT v7 was used in the alignment of the plastid genome sequences [32]. Boundary regions between the LSC, IR and SSC and their lengths were compared and analyzed using cp genomes. The sequence identity of the five Convallaria cp genomes was plotted using the online software mVISTA (http://genome.lbl.gov/vista/mvista/submit.shtml) (accessed on 22 February 2020) with the Shuffle-LAGAN mode [35]. The cp DNA rearrangement analysis of the five chloroplast genomes were conducted using Mauve Alignment [36]. In order to screen the fast-evolving regions among the cp genomes, the sequence alignment was subjected to a sliding window analysis to evaluate the total number of mutations (Eta) and nucleotide diversity (Pi) for all protein-coding and intergenic spacer regions using the DNA Polymorphism calculation in DNASP v5.1 [37]. Regions that met the following two criteria were used: (1) an aligned length > 200 bp; and (2) mutation site > 0. We also calculated the Eta and Pi among the Asparagaceae species to obtain the hypervariable regions which could be used in future molecular evolutionary and systematic studies of this family.

Long Repeats and Simple Sequence Repeats
The position and size of three repeat sequence types, including direct (forward), inverted (palindromic) and reverse repeats, were identified in the five cp genomes of Convallaria using the online program REPuter [38] (accessed on 10 March 2020). For the above repeat types, we set the constraints in REPuter as the following: (1) repeat size longer than 30 bp; and (2) sequence identity more than 80%, with a hamming distance of 3 (i.e., the gap size between repeats larger than 3 bp). SSRs were detected using MIcroSAtellite (MISA) perl script [39] with a threshold for mono-, di-, tri-, tetra-, penta-, and hexanucleotide SSRs containing 10, 5, 5, 3, 3, and 3 repeat units, respectively.

Phylogenetic Analysis
The 5 cp genomes of Convallaria and other 43 cp genomes of Asparagaceae downloaded from NCBI were recovered to infer their phylogenetic relationships within this family (Table S1). Agapanthus coddii was used as the outgroup. Multiple alignment of coding sequences from 48 cp genomes were performed using MAFFT v7 [32]. The phylogenetic reconstructions were applied using two methods: maximum likelihood (ML) and Bayesian inference (BI). The best-fitting models (GTR + F + R3) of nucleotide substitutions were determined by the Bayesian Information Criterion (BIC) yielded using jModelTest v2.1.7 [40]. The ML analyses were performed in RAxML-HPC v8.2.20 [41] with 5000 bootstrap (BS) replicates. Bayesian inference (BI) analyses were run with MrBayes v.3.2.5 [42]. The Markov chain Monte Carlo (MCMC) algorithm was run for 1,000,000 generations. Trees were sampled every 500 generations. The first 25% trees were discarded as burn-in.
A consensus tree was obtained from the remaining trees, and we estimated the posterior probabilities (PPs) and visualized them in FigTree v1.4.2.5.

Genome Organization and Features
Five complete cp genomes of Convallaria were assembled with no gaps. The full length of five Convallaria chloroplast genomes ranged from 161,365 bp to 162,972 bp ( Figure 1, Table 1). All these chloroplast genomes exhibited the typical quadripartite structure, consisting of a pair of IRs (58,280-58, (Table 1). All cp genomes identically contain 78 protein-coding genes, 30 tRNAs, and four rRNAs. Altogether, the five cp genomes of Convallaria were highly conserved in gene content, gene order, and GC content. Nine of the protein-coding genes (rps16, atpF, rpoC1, petB, petD, rpl16, rpl2, ndhB and ndhA) contained one single intron, whereas three genes (clpP, ycf3, and rps12) contained two introns. The ycf1 gene in IRa was partially duplicated and formed a pseudogene ( Table 2).

Variation at IR/SC Boundaries
Comparison of the 5 cp genomes of Convallaria revealed minor differences at the IR/LSC boundaries (Figure 2). At the IRA/LSC border, the space length from rpl19 to the border varied from 46 bp to 66 bp. At the IRB/LSC border, the space length from psbA to the border was all 82 bp, except in C. montana (220 bp). No variation was observed at the IR/LSC boundaries. All the IRb regions expanded 910 bp into ycf1 and formed a pseudogene Ψycf1 in IRa by duplication. All IRa regions expanded into ndhF, causing a 34 bp overlap with Ψycf1.

Comparative Analysis of Convallaria cp Genomes and Hotspot Regions Identification
We analyzed the whole sequence divergence of the five Convallaria cp genomes using the mVISTA software with C. keiskei (A4) as reference. After alignment, the sequences showed high similarities with only a few regions' sequence identities less than 90%, suggesting that the cp genomes of Convallaria species are conservative (Figure 3). The two IR regions showed a lower level of sequence divergence than LSC and SSC regions. The coding genes and non-coding regions were compared among 5 individuals to detect divergence hotspots. In total, 82 loci (33 coding genes and 49 intergenic spacers) were generated ( Figure 4; Table S2), and the nucleotide diversity (Pi) value for each locus ranged from 0.00019 (rpoC1) to 0.02222 (ccsA-ndhD). Three regions, psaJ (0.0093), petA-psbJ (0.00985) and ccsA-ndhD (0.02222), showed a Pi > 0.009, which were recognized as hotspot regions that could be applied as molecular markers in species identification and phylogenetic analyses. For all chloroplast genome sequences of Asparagareae, we obtained three hypervariable regions, petA-psbJ (0.0929), psbI-trnS (0.08097), and ccsA-ndhD (0.07816), which could be used in future molecular evolutionary and systematic studies of this family (Table S3).
A total of 338 SSRs were identified by MISA analysis across the five Convallaria cp genomes. The number of SSRs ranged from 64 (C. majalis A69) to 74 (C. montana A114) SSRs, with 21 SSRs shared among all genomes (Table S6). Among these SSRs, more than one-half (60.35%) were composed of A/T bases ( Figure 6). In general, the SSRs of these cp genomes showed abundant variation, which can be used in population genetics study of these Convallaria species.  (Table S6). Among these SSRs, more than one-half (60.35%) were composed of A/T bases ( Figure 6). In general, the SSRs of these cp genomes showed abundant variation, which can be used in population genetics study of these Convallaria species.

Phylogenetic Analysis
We used 48 cp genome sequences of Asparagareae and 1 outgroup of Agapanthoideae in total for phylogenetic analysis. The trees reconstructed with the 68 common CDSs shared between the plastomes displayed almost identical topologies with generally high support values, in both ML and BI analyses. The phylogenetic tree was almost fully supported (PP/BS = 1/100) at all nodes (Figure 7). The phylogenetic tree indicated that the subfamilies Scilloideae and Brodiaeoideae constituted the earliest diverging lineage in Asparagaceae. All these phylogenetic trees identically supported the monophyly of Convallaria, which in turn formed a sister clade to the Rohdea. Within Convallaria, C. keiskei was resolved as a monophyletic clade with a sister relationship to C. majalis, and C. montana was at the basal position.

Phylogenetic Analysis
We used 48 cp genome sequences of Asparagareae and 1 outgroup of Agapanthoideae in total for phylogenetic analysis. The trees reconstructed with the 68 common CDSs shared between the plastomes displayed almost identical topologies with generally high support values, in both ML and BI analyses. The phylogenetic tree was almost fully supported (PP/BS = 1/100) at all nodes (Figure 7). The phylogenetic tree indicated that the subfamilies Scilloideae and Brodiaeoideae constituted the earliest diverging lineage in Asparagaceae. All these phylogenetic trees identically supported the monophyly of Convallaria, which in turn formed a sister clade to the Rohdea. Within Convallaria, C. keiskei was resolved as a monophyletic clade with a sister relationship to C. majalis, and C. montana was at the basal position.

Discussion
Comparative analysis results indicate that five cp genome sequences of Convallaria showed highly conserved genomic structures. No variation and rearrangement of the gene content were found between the five cp genomes of Convallaria. All plastomes had the same number of protein-coding genes, tRNAs and rRNAs. Genes of ycf1 and infA were found to be pseudogenes in five individuals. The pseudogenizations of ycf1 and locations of ψycf1 copies commonly found in other plants [43,44]. The pseudogene was firstly thought to have lost the ability of protein coding [45] but was now considered as an

Discussion
Comparative analysis results indicate that five cp genome sequences of Convallaria showed highly conserved genomic structures. No variation and rearrangement of the gene content were found between the five cp genomes of Convallaria. All plastomes had the same number of protein-coding genes, tRNAs and rRNAs. Genes of ycf1 and infA were found to be pseudogenes in five individuals. The pseudogenizations of ycf1 and locations of ψycf1 copies commonly found in other plants [43,44]. The pseudogene was firstly thought to have lost the ability of protein coding [45] but was now considered as an evolutionary relic of the functional gene [46]. Such conservatism revealed accords with the low substitution rate of chloroplast genomes and the presumed recent divergence within genus Convallaria. Similar findings were also found in other closely related species [47,48]. Within cp genomes, the results of comparative analysis indicated that CDS and IRs regions were more conserved than IGS and SCs, correspondingly. The IR regions are highly conserved, which is important for the stabilization of the chloroplast genome structure [49]. Comparison of the IR/SC boundary areas among species suggested expansions and contractions of the IR region. The expansion and contraction of the IR regions often results in the length change of cp genomes [50]. The mechanism of larger IRs expansion may be caused by the double-strand break repair (DSBR) [51]. Because of the conservatism, the IR regions showed a lower level of sequence divergence than LSC and SSC regions in Convallaria cp genomes, in accordance with other studies [52][53][54].
The polymorphic cp DNA non-coding regions have been widely used to investigate species identification and molecular phylogeny at the interspecific level [55,56]. We have detected three variable regions (petA-psbJ, psbI-trnS and ccsA-ndhD) that can be used in species identification and phylogeny. SSR is a repetitive sequence consisting of simple repeating units in tandem and has been widely used as molecular marker in genetic structure and genetic diversity analysis [57,58]. The SSRs detected in present study showed abundant variation, and can therefore be applied in the genetic diversity, population genetics analyses [59,60].
In recent years, chloroplast genomes have been widely used to evaluate the relationship of closely related species in taxonomic studies. For example, cp genomes of 35 species representing 31 genera from Ranunculaceae were sequenced and utilized to clarify the long-standing systematic controversies of this family [22]. Our phylogenetic analyses based on 48 cp genomes successfully resolved intergeneric relationships within Asparagaceae. We obtained a well-resolved and robust phylogenetic tree. Two main clades including seven subfamilies were confirmed: Lomandroideae + (Asparagoideae + Nolinoideae) and Brodiaeoideae + (Scilloideae + (Aphyllanthoideae + Agavoideae)). The first two diverged clades were Lomandroideae and Brodiaeoideae, respectively. This result was congruent with previous research [18,61]. Convallaria was confirmed as a monophyly most closely related to the genus Rohdea. The Convallaria clade consisted of two lineages: one contains the North American species C. montana, and another contains two Eurasian distributed species, C. keiskei and C. majalis. A similar phylogenetic relationship was revealed in other plant taxa that displayed an Eastern Asian-Eastern North American disjunct distribution, such as Croomia, Polygonatum, Maianthemum [62][63][64]. This prevalent phylogeographic pattern in plants was explained as the result of vicariance events after dispersal via the Bering or North Atlantic land bridge during the Late Tertiary [65][66][67][68].

Conclusions
In conclusion, our findings reveal the detailed characteristics of the complete cp genome of three Convallaria species. The gene content, gene order, and gene orientation are highly conserved. Comparative analyses revealed that no rearrangements occurred in Convallaria, and that intergenic regions were more variable than coding regions. Three hypervariable regions (petA-psbJ, psbI-trnS and ccsA-ndhD) were identified as valuable molecular markers. The cp genome data provided strong support for the relationships within Convallaria and among the subfamily clades within Asparagareae, which proved the cp genome to be useful genetic resources in dealing with phylogenetically difficult taxa. The complete cp genome sequences will contribute to further studies in molecular identification, genetic diversity, and phylogeny.
Supplementary Materials: The following supporting information can be downloaded at: https: //www.mdpi.com/article/10.3390/genes13101724/s1; Table S1: Accession numbers of chloroplast genomes used for phylogenetic analyses; Table S2: Nucleotide variability (Pi) values and total number of mutation (Eta) in Convallaria; Table S3: Nucleotide variability (Pi) values and total number of mutation (Eta) in Asparagareae; Table S4: Repeat types and lengths of Convallaria chloroplast genomes; Table S5: Analyses of repeat sequences in five Convallaria chloroplast genomes; Table S6: Simple sequence repeat (SSR) polymorphism in five Convallaria chloroplast genomes.

Conflicts of Interest:
The authors declare no conflict of interest.