Comparative Analysis of the Chloroplast Genomes of the Chinese Endemic Genus Urophysa and Their Contribution to Chloroplast Phylogeny and Adaptive Evolution

Urophysa is a Chinese endemic genus comprising two species, Urophysa rockii and Urophysa henryi. In this study, we sequenced the complete chloroplast (cp) genomes of these two species and of their relative Semiquilegia adoxoides. Illumina sequencing technology was used to compare sequences, elucidate the intra- and interspecies variations, and infer the phylogeny relationship with other Ranunculaceae family species. A typical quadripartite structure was detected, with a genome size from 158,473 to 158,512 bp, consisting of a pair of inverted repeats separated by a small single-copy region and a large single-copy region. We analyzed the nucleotide diversity and repeated sequences components and conducted a positive selection analysis by the codon-based substitution on single-copy coding sequence (CDS). Seven regions were found to possess relatively high nucleotide diversity, and numerous variable repeats and simple sequence repeats (SSR) markers were detected. Six single-copy genes (atpA, rpl20, psaA, atpB, ndhI, and rbcL) resulted to have high posterior probabilities of codon sites in the positive selection analysis, which means that the six genes may be under a great selection pressure. The visualization results of the six genes showed that the amino acid properties across each column of all species are variable in different genera. All these regions with high nucleotide diversity, abundant repeats, and under positive selection will provide potential plastid markers for further taxonomic, phylogenetic, and population genetics studies in Urophysa and its relatives. Phylogenetic analyses based on the 79 single-copy genes, the whole complete genome sequences, and all CDS sequences showed same topologies with high support, and U. rockii was closely clustered with U. henryi within the Urophysa genus, with S. adoxoides as their closest relative. Therefore, the complete cp genomes in Urophysa species provide interesting insights and valuable information that can be used to identify related species and reconstruct their phylogeny.


Introduction
The genus Urophysa (Ranunculaceae) is a Chinese endemic genus with only two species, Urophysa rockii Ulbr. and Urophysa henryi (Oliv.) Ulbr. U. rockii is an extremely rare species with fewer than 2000 individuals living in Jiangyou, a Sichuan province of China, and U. henryi is distributed in Guizhou, south Chongqing, north Hunan, and west Hubei [1]. The two species' natural populations are restricted to small and isolated areas separated by high mountains and deep valleys and grow in steep and karstic cliffs with dramatically shrinking and fragmenting natural distributions [2]. In addition, the plants are collected for Chinese traditional medicine for the treatment of contusions and bruises, which contributed to the decline of their populations [3]. Previous studies on the genus Urophysa are scarce and mainly focused on the endangered U. rockii, its growing environment and conservation strategies [4], its biological and ecological characteristics, and its reproductive biology [5,6]. A recent study suggested that the uplift of the Yungui Plateau played an important role in the species divergence of Urophysa [2]. However, the chloroplast DNA (cpDNA) phylogeny showed inconsistency with the nuclear ribosomal DNA (nrDNA). Hence, to gain a better insight into the relationship of these two species and understand their genome structure so as to facilitate their speciation process and the conservation of U. rockii, we assembled and characterized the complete chloroplast genome sequence of U. rockii and U. henryi using the Illumina paired-end sequencing reads.
The angiosperm cp genome is one of the three DNA genomes (the other two are nuclear and mitochondrial genome), is uniparentally inherited, and has a high conserved circular DNA arrangement [7]. It is widely considered an informative and valuable resource for investigating evolutionary biology because of its relatively stable genome structure, gene content, and gene order [8][9][10][11][12][13]. The cp genome of plants always ranges from 115 to 210 kb and has a quadripartite structure that is typically composed of two copies of inverted repeat (IR) regions, which are separated by a large single-copy (LSC) region and a small single-copy (SSC) region [14][15][16]. Because of its compact size, less recombination, and maternal inheritance, the cp genome has been used to generate genetic markers for phylogenetic analysis [17,18], molecular identification [19], and divergence dating [20]. Especially, the low evolutionary rate of the cp genome in taxa that are not very young makes it an ideal system for assessing plant phylogeny [21].
In the present study, we report the complete chloroplast genome sequences of these two Urophysa species and their relative Semiquilegia adoxoides for the first time. Combining previously reported cp genome sequences, we performed phylogenetic analyses according to the whole cp genome and shared single-copy genes. Our findings will contribute to our understanding of the evolutionary history of the genus Urophysa. Additionally, highly variable regions and genes that were detected to be under positive selection could be employed to develop potential markers for phylogenetic analyses or candidates for DNA barcoding in future studies.

Complete Chloroplast Genomes of Three Species
The complete chloroplast genome of U. rockii, U. henryi, and S. adoxoides showed a single circular molecule with a typical quadripartite structure ( Figure 1). The sizes of the U. rockii, U. henry, and S. adoxoides cp genomes were found to be 158,512 bp, 158,303, and 158,340 bp, respectively, which are in the range of most angiosperm plastid genomes [22]. The cp genome consists of a pair of IRs (IRa and IRb, with length 26,473-26,584 bp), separated by a LSC (87,031-87,202 bp) region and one SSC (18,,220 bp) region ( Table 1). The GC content of each species was very similar in the whole cp genome and the same region (LSC, SSC, and IR), but in the IR regions it was clearly higher than in the other regions, possibly because of the high GC content of the rRNA (55.8%) that was located in the IR regions (Table 2). These results are similar to a previously reported high GC percentage in IR regions [23][24][25].

Category for Genes
Group of Genes Name of Genes

Repeat Analysis
Chloroplast repeats are potentially useful genetic resources to investigate population genetics and biogeography of allied taxa [26]. Analyses of various cp genomes revealed that repeat sequences are essential to induce indels and substitutions [27]. Repeat analysis of the U. rockii cp genome revealed 22 palindromic repeats, 23 forward repeats, 5 reverse, and 1 complement repeats. Among them, 16 palindromic, 18 forward, and 5 reverse repeats are 20-40 bp in length. Six palindromic and five forward repeats are 41-60 in length ( Figure 2). Similarly, 23 and 25 palindromic repeats, 21 and 22 forward repeats, 5 and 2 reverse repeats, and 1 complement repeats were detected, and the detailed repeats length distributions are shown in Figure 2. The number and length of the repeats indicate that U. rockii is more similar to U. henryi than to S. aquilegia. Previous studies suggested that the slipped-strand mispairing and improper recombination of repeat sequences can result in sequence variation and genome rearrangement [28][29][30]. These repeats are informative sources for developing genetic markers for phylogenetic and population studies [31]. Simple sequence repeats (SSRs) in the cp genome can be highly variable at the intra-specific level and are therefore often used as genetic markers in population genetic and evolutionary studies [12,[32][33][34]. Because of a high polymorphism rate at the species level, SSRs have been recognized as one of the main sources of molecular markers and have been extensively researched in phylogenetic and biogeographic studies of populations [35][36][37]. In this study, we analyzed the SSRs in the cp genomes. Five categories of perfect SSRs (mono-, di-, tri-, tetra-, and penta-nucleotide repeats) were detected in the cp genome of these three species, with an overall length ranging from 10 to 26 bp ( Figure 3, Table S2). Certain parameters were set, because SSRs of 10 bp or longer are prone to slipped-strand mispairing, which is believed to be the main mutational mechanism for polymorphism [38][39][40]. A total of 169 microsatellites were detected in the U. rockii cp genome on the basis of the SSR analysis. Similarly, 171 and 174 SSRs were detected in U. henryi and S. adoxoides, respectively ( Figure 3A). The most abundant were tri-nucleotide repeats, which accounted for about 33.85% of the total SSRs, and whose number varies from 56 in U. rockii to 60 in S. adoxoides, followed by mono-nucleotide repeats (27.63%), di-nucleotide repeats (26.46%), and tetra-nucleotides repeats (11.28%). Penta-nucleotide repeats were the least abundant (0.78%; Figure 3, Table S2). Most previous studies revealed that the richness of SSR types varies between species. In Quercus species, mono-nucleotide repeats are the most abundant, accounting for about 80% of the total SSRs [34]. In the cp genome of Forthysia, the number of di-nucleotide repeat is the highest [41]. Tri-nucleotide SSRs are most abundant in Nicotiana species, accounting for approximately 43.03% [42]. These results suggest that different repeats may contribute to the genetic variations differently among species. Thus, the SSR information will be important for understanding the genetic diversity status of Urophysa and its relatives.
In U. rockii, more than 96.2% mono-nucleotides are composed of A/T, and a majority of di-nucleotides (84.9%) is composed of A/T ( Figure 3B, Table S2), which is consistent with U. henryi (97.8% mono-nucleotides and 83.0% di-nucleotides) and S. aquilegia (97.9% mono-nucleotides and 85.6% di-nucleotides). Our findings are comparable to previously reported observations that SSRs found in the chloroplast genome are generally composed of poly-thymine (polyT) or poly-adenine (polyA) repeats and infrequently contain tandem cytosine (C) and guanine (G) repeats [43]. Therefore, these SSRs contribute to the AT richness of the three species cp genome, as previously reported for different species [43,44]. SSRs were also detected in CDS regions of the U. rockii cp genome. The CDS regions account for approximately 49% of the total length. About 68.6% of SSRs (68.4% for U. henryi and 67.2% for S. adoxoides) were detected in non-coding regions, whereas only 28.9%of SSRs (29.2% for U. henryi and 30.5% for S. adoxoides) are present in the protein-coding region of U. rockii. Furthermore, about 62.1% of SSRs are present in the LSC region of U. rockii (66.1% for U. henryi and 68.9% for S. adoxoides), and a minority of SSRs exist in IR regions (17.8% in IRa and IRb in total). It was observed that 49 SSRs (28.9%) were located in 19 genes (CDS) regions (atpF, rpoC1, rpoC2, rps14, rps15, rps19, psaB, psaA, rbcL, rpl33, rpl22, ndhB, ndhD, ndhF, ndhH, ccsA, ycf1, ycf2, ycf3) in U. rockii. The detailed SSR location information is listed in Table S2. These results suggest an uneven distribution of SSRs in the U. rockii, U. henryi, and S. adoxoides cp genomes, as was also reported in different angiosperm cp genomes [44]. Moreover, the cp SSRs of the three species presented abundant variation and are useful for detecting genetic polymorphisms at population, intraspecific, and cultivar levels, as well as for comparing more distant phylogenetic relationships among species.

Genomes Sequence Divergence among the Three Species
In order to calculate the sequence divergence level, the nucleotide diversity values in the LSC, SSC, and IR regions of the chloroplast genomes were calculated ( Figure 4, Table S3). In the LSC regions, these values varied from 0 to 0.05496, with a mean of 0.00705, in the IR regions they varied from 0 to 0.01265, with a mean of 0.00363, and only the SSC region had >0.010 average sequence nucleotide diversity, and its values varied from 0 to 0.02369, with a mean of 0.01048. All these results indicated that the differences among these genome regions were small. However, some highly variable loci, including trnK-UUU, trnG-UCC, trnD-GUC, atpF, rps4, trnL-UAA, accD, cemA, rpl36, rpl22, rps19, ndhF, trnL-UAG, ccsA, ndhA, and ycf3 were more precisely located ( Figure 4, Table S3). All these regions displayed higher nucleotide diversity values than other regions (value > 0.015). Twelve of these loci were found to be located in the LSC region, and four in the SSC region, but the nucleotide diversity in the IR regions appeared small, less than 0.015. Among these loci, atpF, accD, ndhF, rpl22, ccsA, and ycf3 have been detected as highly variable regions in different plants [19,23,45,46]. On the basis of these results, we believe that accD, rps4, ccsA, rpl36, and ndhF, which have comparatively high sequence deviation, are good sources for interspecies phylogenetic analysis, as shown in previous studies [42,44].
Expansion and contraction at the borders of IR regions is the main reason for size variations in the cp genome and plays a vital role in its evolution [39,47,48]. The IR/LSC and IR/SSC junction regions were compared to identify IR expansion or contraction. The rps19, ndhF, ycf1, and psbA genes were located in the junctions of the LSC/IRa, IRa/SSC, SSC/IRb, and IRb/LSC regions, respectively ( Compared to species of other genera, the IRb/SSC and SSC/IRa regions of Urophysa showed an expansion in ycf1, but a contraction in rps19 ( Figure 5). The expansion and contraction detected in the IR regions may act as a primary mechanism in creating the length variation of the cp genomes in U. rockii, U. henryi, and S. adoxoides, as previous studies suggested [32,34,42,49].

Phylogenetic Analysis
To study the phylogenetic position of U. rockii and U. henryi within the Ranunculaceae family, we used 79 single-copy genes shared by the cp genomes of 12 Ranunculaceae members, representing seven genera ( Figure 6). For Bayesian inference (BI) and maximum parsimony (MP), the posterior probabilities and bootstrap values were very high for each lineage, with all values ≥98%. Both the maximum likelihood (ML), BI, and MP phylogenetic results strongly supported that U. rockii is closely clustered with U. henryi within the genus Urophysa, with S. adoxoides as their closest relative with 100% bootstrap value (Figure 6), which is consistent with the results of previous molecular studies [50][51][52]. Furthermore, the species in each genus formed a single clade. The first clade is formed by species of the genera Urophysa, Semiaquilegia, and Trollius, the second clade was divided into two clades: one clade includes the Ranunculus and Clematis species, and the other clade consists of just the Aconitum species. Additionally, the topological structures from the whole complete chloroplast genome sequences and the CDS sequences are similar to that from single-copy genes ( Figure S1), and all lineages possess high bootstrap values. These results suggest that there is no conflict among the entire genome data set, CDS sequences, and 79 shared single-copy genes of these cp genomes. Furthermore, these results are in accord with previous phylogeny research [53]. All these phylogenetic analyses are substantially increasing our understanding of the evolutionary relationship among species in Ranunculaceae.

Positive Selected Analysis
Of 57 single-copy CDS genes initially considered for the positive selection analysis (Table S4), 47 were eventually selected (Table 4). No significant positive selection was detected for all genes (p-value > 0.05), but six genes that possess high posterior probabilities for codon sites were found in the Bayesian Empirical Bayes (BEB) test (atpA, rpl20, psaA, atpB, ndhI, and rbcL) (Figure 7, Figure S2 and Table 4). Previous studies suggested that codon sites with a high posterior probability should be regarded as positively selected sites [54], which means that these six genes may be under positive selection pressure [55]. After Jalview visualization, the results of the amino acid properties across each column of all species revealed that many amino acids vary between different genera, such as the 88th amino acid (G in U. rockii and U. henryi, R in other species) of the rpl20 gene ( Figure 7A) and other amino acids (marked with red blocks in Figure 7A). In the ndhI gene, two amino acids (the A in 168th and the P in 174th) were specific for U. rockii and U. henryi, and three amino acids (the 9th, 148th, and 165th, marked with red blocks in Figure 7B) were only possessed by U. rockii, U. henryi, and S. adoxoides. The amino acid properties of the other four genes (atpA, atpB, rbcL, and psaA) are shown in Figure S2. As we know, most amino acids may be under strong structural and functional constraints and not free to change [55]. We detected six genes with high posterior probability in codon site and many different amino acids among species, which may play an important role in Urophysa species evolution and environment adaptation. Populations of U. rockii and U. henryi are distributed only in karst regions of southern China, and the karst environments are characterized by low soil water content, insufficient light, and poor nutrient availability, which might have exerted strong selective forces on plant evolution [56].    However, five of the abovementioned six genes are involved in photosynthesis (atpA, psaA, atpB, ndhI, and rbcL) ( Table 3). The gene rpl20 is involved in translation, which is an important part of protein synthesis [57]. The genes atpA and atpB participate in ATP synthesis, which is the main source of energy for the functioning of living cells and all multicellular organisms [58]. Additionally, rbcL is the gene for the Rubisco large subunit protein, which is an important component of photosynthetic electron transport [59,60]. Most previous research has revealed that positive selection of the rbcL gene in land plants may be a common phenomenon [61]. All these genes might play important roles when founder effects occur in populations; both changes in selection pressures and genetic drift result in the rapid shift of these genes to a new, coadapted combination. Therefore, all these genes under positive selection give an indication of why U. rockii and U. henryi could adapt to the harsh environment of karst (characterized by low soil water content, periodic water deficiency, and poor nutrient availability). Moreover, the results of the gene effectiveness test (rbcL and rpl20) ( Figure S3) suggested that these genes can distinguish the species of Urophysa and its relatives and can be used for future phylogenetic analyses. The six genes will not only provide insights into chloroplast genome evolution of species of Urophysa, but also offer valuable genetic markers for population phylogenomic studies of Urophysa and its close lineages.

Chloroplast Genome Sequencing and Assembling
All cp genomes were sequenced using an Illumina Hiseq 2500 platform by Biomarker Technologies, Inc. (Beijing, China) In order to eliminate the interference from mitochondrial or nuclear DNAs, all the cp genome reads were extracted by mapping all raw reads to the reference cp genome of Trollius chinensis (KX752098) with Burrows Wheeler Alignment (BWA) [63]. High-quality reads were obtained using the CLC Genomics Workbench v7.5 (CLC Bio, Aarhus, Denmark) with the default parameters set. A few gaps in the assembled cp genomes were corrected by Sanger sequencing. The primers were designed using Lasergene 7.1 (DNASTAR, Madison, WI, USA). Primer synthesis and the sequencing of the polymerase chain reaction products were conducted by Sangon Biotech (Shanghai, China). The primers and amplifications are shown in Supplementary Table S5.

Genome Annotation and Analysis
The complete cp genomes were annotated using the online program DOGMA [64]. The annotation results were checked manually, and the codon positions were adjusted by comparing to a previously homologous gene from various chloroplast genomes present in the database using Geneious R11 (Biomatters, Ltd., Auckland, New Zealand). Furthermore, the OGDRAW1 program [65] was used to draw the circular plastid genome maps. GC content and codon usage were analyzed by the MEGA 6 software [66]. The complete cp genomes of U. rockii, U. henryi, and S. adoxoides are deposited in the GenBank under the accession numbers MH006686, MH142266, and MH142265, respectively.

Phylogenetic Analysis
Phylogenetic analysis was conducted using the single-copy genes of the three taxa, together with nine species downloaded from the NCBI GenBank (Tables S6 and S7). The sequences were aligned using MAFFT v5 [69] in GENEIOUS R11 (Biomatters, Ltd.) with the default parameters set and were manually adjusted in MEGA 6.0 [66]. Maximum parsimony (MP) analyses were conducted using PAUP [70]. All characters were equally weighted, gaps were treated as missing, and character states were treated as unordered. Heuristic search was performed with MULPARS option, tree bisection-reconnection (TBR) branch swapping, and random stepwise addition with 1000 replications. The maximum likelihood (ML) analyses were performed using RAxML 8.0 [71]. For ML analyses, the best-fit model, general time reversible (GTR) + G was used with 1000 bootstrap replicates. Bayesian inference (BI) was performed with Mrbayes v3.2 [72]. The Markov chain Monte Carlo (MCMC) analysis was run for 1 × 10 8 generations. The trees were sampled at every 1000 generations with the first 20% discarded as burn-in. The remaining trees were used to build a 50% majority-rule consensus tree. The stationarity was considered to be reached when the average standard deviation of split frequencies remained below 0.001. Additionally, in order to test the utility of different cp regions, phylogenetic analyses were performed for the complete chloroplast genome sequences and the CDS sequences, respectively.

Chloroplast Genome Nucleotide Diversity and Positive Selected Analysis
The cp genome sequences were aligned using MAFFT v5 [69] and adjusted manually. Furthermore, a sliding window analysis was conducted for nucleotide diversity in LSC, SSC, and IR regions of the cp genomes using the DnaSP version 5.1 [73]. In addition, to identify the genes under positive selection in U. rockii and U. henryi, endemic to special karst environment, an optimized branch-site model [74] combined with Bayesian Empirical Bayes (BEB) methods [55] were used by comparison with their relatives. We firstly extracted all CDS sequences from U. rockii, U. henryi, S. adoxoides, and nine closely related species downloaded from GenBank (Table S6). The single-copy CDS sequences between these twelve species were obtained (see the Table S4). Each single-copy CDS sequence of these twelve species was aligned according to their amino acid sequence alignment generated by MUSCLE [75], and the "number of gaps" in the alignments was further checked. Then, the alignments of the corresponding DNA codon sequences were further trimmed by TRIMAL [76], and the bona fide alignments were used to support the subsequent positive selection analysis. The optimized branch-site model in the CODEML program implemented in the PAML 4 package [77] was used to assess potential positive selection affecting individual codons along a specifically designated lineage, which was set as U. rockii and U. henryi. Selective pressure is measured by the ratio (ω) of the nonsynonymous substitution rate (dN) to the synonymous substitutions rate (dS). A ratio ω > 1 indicates positive selection, ω = 1 implies neutral selection, and ω < 1 suggests negative selection [78]. Log-likelihood values were calculated in an alternative branch-site model (Model = 2; NSsites = 2; and Fix = 0) that allowed ω to vary among different codons along particular lineages and a neutral branch-site model (Model = 2; NSsites = 2; Fix = 1; Fix ω = 1) that confined the codon sites under neutral selection (ω = 1) on the basis of the likelihood ratio tests (LRT). The right-tailed chi-square test was performed to calculate the p values based on the difference in log-likelihood values between the alternative model and the neutral model with one degree of freedom to assess the model fit. Then, the p values were further adjusted according to multiple statistical tests [79]. A gene with an adjusted p value smaller than 0.05 and with positively selected sites was considered a positively selected gene (PSG). Moreover, in order to identify specific amino acid sites that are potentially under positive selection, a BEB method was implemented to calculate the posterior probabilities for sites classes. Codon sites with a high posterior probability were regarded as positively selected sites [54]. Jalview [80] was used to view the amino acid sequences of positively selected genes. In the end, in order to test the effectiveness of genes under positive selection, we randomly chose two genes to conduct the phylogenetic analyses.