Comparative Analysis of the Complete Plastid Genome of Five Bupleurum Species and New Insights into DNA Barcoding and Phylogenetic Relationship.

Bupleurum L. (Apiaceae) is a perennial and herbal genus, most species of which have high medicinal value. However, few studies have been performed using plastome data in this genus, and the phylogenetic relationships have always been controversial. In this study, the plastid genomes of Bupleurum chinense and Bupleurum commelynoideum were sequenced, and their gene content, order, and structure were counted and analyzed. The only three published Bupleurum species (B. boissieuanum, B. falcatum, and B. latissimum) and other fifteen allied species were selected to conduct a series of comparative and phylogenetic analyses. The genomes of B. chinense and B. commelynoideum were 155,869 and 155,629 bp in length, respectively, both of which had a typical quadripartite structure. The genome length, structure, guanine and cytosine (GC) content, and gene distribution were highly similar to the other three Bupleurum species. The five Bupleurum species had nearly the same codon usages, and eight regions (petN-psbM, rbcL-accD, ccsA-ndhD, trnK(UUU)-rps16, rpl32-trnL(UAG)-ccsA, petA-psbJ, ndhF-rpl32, and trnP(UGG)-psaJ-rpl33) were found to possess relatively higher nucleotide diversity, which may be the promising DNA barcodes in Bupleurum. Phylogenetic analysis revealed that all Bupleurum species clustered into a monophyletic clade with high bootstrap support and diverged after the Chamaesium clade. Overall, our study provides new insights into DNA barcoding and phylogenetic relationship between Bupleurum and its related genera, and will facilitate the population genomics, conservation genetics, and phylogenetics of Bupleurum in Apiaceae.


Codon Usage Bias Analysis
In the plastid genomes of the five Bupleurum species (B. chinense, B. commelynoideum, B. boissieuanum, B. falcatum, and B. latissimum), the 20 amino acids were also encoded by 64 codons (Figure 2), among which only methionine (Met) and tryptophan (Trp) were encoded by single codon, while arginine (Arg), leucine (Leu) and serine (Ser) had the maximum codons of six. Most of the amino acids had codon preferences except Met and Trp. The total number of codons in the five Bupleurum species ranged from 21,188 to 21,195 (Table S1 in Supplementary Material). Leucine (Leu) and cysteine (Cys) were the most and least abundant amino acids, respectively. The relative synonymous codon usage (RSCU) values of the same codon were subequal, with a maximum difference of 0.2 for very few. 30 codons preferences were identified, including 24 high preference (RSCU > 1.3), 2 moderate preference (1.2 ≤ RSCU ≤ 1.3) and 4 low preference (1.0 < RSCU < 1.2). These codons were from 18 amino acids and 1 stop codon. Preferences between codons may result from mutation, selection, and random genetic drift, and be affected by translation efficiency, which may be an adaptive factor [42,43]. The ENC, CAI, CBI and FOP values ranged from 49.83 to 49.90, 0.166 to 0.167, −0.102 to −0.100 and 0.353 to 0.354, respectively, which meant there was no obvious preference in the five Bupleurum species' plastid genomes ( Table 3). All the GC3 content values were 0.269%, indicating that these genes preferred the codons ended with A/T, which is a universal phenomenon in the plastid genome of higher plants [44][45][46].    The five Bupleurum species and their allied species in Apiaceae had similar codon usages, of which UAA, AGA, GCU, UCU and ACU had the highest frequency, while AGC, CUG, CUC, CGC and UAC had the lowest frequency ( Figure 3). This indicated that codons preferred to end with A/T, which was consistent with the conclusion of GC3 content above. Notably, the five Bupleurum species exhibited lower usages in the stop codon UAA (light red) and higher usages in the stop codon UGA (light blue). On the contrary, Daucus carota, Pleurospermum camtschaticum and Chamaesium viridiflorum showed higher usages in UAA (deep red) and lower usages in UGA (deep blue). These differences in the use of stop codons suggest that stop codons may not undergo the same strict selection as other codons. In addition, the codon CGU, CUU, and CAC had slightly higher usages, and CGA, AGG, UUC, and AUC had slightly lower usages than other allied species. The synonymous codons are generated by mutations, and the evolutionary pressures cause the use of these synonymous codons to vary in frequency [42,47]. Codon preference is the result of long-term adaption of species to their base composition, tRNA abundance, and environmental selection pressure. Moreover, the preference can affect the initiation, elongation, and accuracy of the translation, the shearing of mRNA, and the folding of proteins [48][49][50]. These preferences will contribute to the plastid gene engineering of Bupleurum species and lay a theoretical foundation for modification and efficient expression of exogenous genes.

Repetitive Sequences Analysis
Repetitive sequences in plastid genomes play an essential role in population genetics and biogeography studies [51,52], and these repeats may result from slipped-strand mispairing and improper recombination [53]. In this study, short dispersed repeats (SDRs) analysis found 22 forward, 0 reverse, 0 complement and 20 palindromic repeats in the plastid genome of B. chinense ( Figure 4A Together with the other three Bupleurum species (B. boissieuanum, B. falcatum, and B. latissimum), they tended to generate more forward and palindromic repeats rather than reverse and complement repeats ( Figure 4C). Comparing to the repeats with length more than 50 bp, the repeats with 30-40 bp in length more widely existed in the plastid genomes ( Figure 4D). Additionally, repeat numbers are also obviously different. All the tendencies were the same in the allied species of the Apiaceae except that Chamaesium viridiflorum had a reversed length distribution in SDRs ( Figure 4D). To figure out whether the abnormal SDRs distribution appears in a single species, a population or the whole genus, more comparative analysis needs to be performed on different levels in the future.

Repetitive Sequences Analysis
Repetitive sequences in plastid genomes play an essential role in population genetics and biogeography studies [51,52], and these repeats may result from slipped-strand mispairing and improper recombination [53]. In this study, short dispersed repeats (SDRs) analysis found 22 forward, 0 reverse, 0 complement and 20 palindromic repeats in the plastid genome of B. chinense ( Figure 4A , they tended to generate more forward and palindromic repeats rather than reverse and complement repeats ( Figure 4C). Comparing to the repeats with length more than 50 bp, the repeats with 30-40 bp in length more widely existed in the plastid genomes ( Figure 4D). Additionally, repeat numbers are also obviously different. All the tendencies were the same in the allied species of the Apiaceae except that Chamaesium viridiflorum had a reversed length distribution in SDRs ( Figure 4D). To figure out whether the abnormal SDRs distribution appears in a single species, a population or the whole genus, more comparative analysis needs to be performed on different levels in the future.  Simple sequence repeats (SSRs) analysis showed 48 mono-, 8 di-, 7 tri-, 4 tetra-, 2 penta-and 1 hexa-nucleotides in B. chinense, and the corresponding values of B. commelynoideum were 41, 9, 7, 5, 3 and 1, respectively ( Figure 5A). Among these sorts of SSRs, the number of mono-nucleotides was largest, and the number of di-, tri-and tetra-nucleotides were far less than the mono-nucleotides. Pentaand hexa-nucleotides had the least number, even some species existed no penta-and hexa-nucleotides. This tendency was also found in allied species ( Figure 5B). Furthermore, the size distribution of SSRs varies in different species. In most species such as Lilium [54], Primula [55], Allium [56] and Quercus [57], the most abundant repeats are the mono-nucleotide repeats, but in Forthysia [58] are di-nucleotide repeats, and in Nitotiana [59] are tri-nucleotide repeats. This indicates that SSR variations will devote to genetic diversity in different species. Previous studies suggested that repeats diversity in plastid genomes are highly related in the rearrangement of the genome, and are generated by slipped-strand mispairing and abnormal recombination [53,60] in the process of DNA replication. Thus, SSRs have been widely used as molecular markers in population genetics and evolutionary studies [61,62]. Simple sequence repeats (SSRs) analysis showed 48 mono-, 8 di-, 7 tri-, 4 tetra-, 2 penta-and 1 hexa-nucleotides in B. chinense, and the corresponding values of B. commelynoideum were 41, 9, 7, 5, 3 and 1, respectively ( Figure 5A). Among these sorts of SSRs, the number of mono-nucleotides was largest, and the number of di-, tri-and tetra-nucleotides were far less than the mono-nucleotides. Penta-and hexa-nucleotides had the least number, even some species existed no penta-and hexanucleotides. This tendency was also found in allied species ( Figure 5B). Furthermore, the size distribution of SSRs varies in different species. In most species such as Lilium [54], Primula [55], Allium [56] and Quercus [57], the most abundant repeats are the mono-nucleotide repeats, but in Forthysia [58] are di-nucleotide repeats, and in Nitotiana [59] are tri-nucleotide repeats. This indicates that SSR variations will devote to genetic diversity in different species. Previous studies suggested that repeats diversity in plastid genomes are highly related in the rearrangement of the genome, and are generated by slipped-strand mispairing and abnormal recombination [53,60] in the process of DNA replication. Thus, SSRs have been widely used as molecular markers in population genetics and evolutionary studies [61,62].

Nucleotide Diversity Analysis
Nucleotide diversity (Pi) of the plastid genomes in the five Bupleurum species was calculated to assess the sequence divergence level in the genus Bupleurum. In the LSC region, Pi values ranged from 0 to 0.02250, with an average of 0.00358, and in the SSC region, they ranged from 0.00067 to 0.01350, with an average value of 0.00505, while the IR region had the least average value of 0.00059, of which the Pi values ranged from 0 to only 0.00467 ( Figure 6A). Low Pi values in the IR region indicated that the IR region existed fewer mutations and was highly conserved at the genus level. Sequences with high Pi values were all spacer regions between genes. Among these spacer regions, petN-psbM, rbcL-accD, ccsA-ndhD, trnK(UUU)-rps16, rpl32-trnL(UAG)-ccsA, petA-psbJ, ndhF-rpl32, and trnP(UGG)-psaJ-rpl33 were the only eight regions which had >0.01000 Pi values. Intergenic regions are under weaker selection pressure and possess a higher evolutionary rate than genes. They are more suitable for the study of the classification and evolution of low taxonomic levels [22,63]. Most of the Bupleurum species have high medicine values, such as B chinense and B. scorzonerifolium [2], so identifying these species becomes important and necessary. Therefore, these eight hotspot sequences may be the promising DNA barcodes for identification, classification and genetic divergence of the Bupleurum taxa, and some of these have been used in other species [64]. Also, Pi values together within the other 13 allied species were calculated, and the same tendency was observed over the Pi variations ( Figure 6B). The Pi values of LSC, SSC and IR region ranged from 0.00897 to 0.10574, 0.02639 to 0.10867 and 0.00037 to 0.05375, respectively, and the corresponding averages were 0.03831, 0.05282 and 0.01058, respectively. The averages of LSC and SSC region were nearly ten times as large as those in the five Bupleurum species. Furthermore, more than half of the higher diversity regions were different from those in the five Bupleurum species, which indicated a great difference in the sequence differentiation among the genus. The three highest diversity regions were ycf1, ndhF-rpl32 and trnE(UUC)-trnT(GGU), and the Pi values were 0.10867, 0.10576 and 0.10574, respectively, which were the only three regions whose Pi values were over 0.10000. These regions are more suitable for the study of genus levels, and have been widely used as molecular markers in previous phylogenetic studies [65][66][67], and easier to align than nrITS. In addition, we chose the ten regions with the highest Pi values for the phylogenetic analysis.

Selective Pressure Analysis
The ω (Ka/Ks) of 80 protein-coding genes were calculated to assess the selection pressure among the five Bupleurum species and allied species in Apiaceae. Genes perform important biological functions, and the mutations will undergo rigorous selection. When the ratio of non-synonymous mutation rate (Ka) to synonymous mutation rate (Ks) is greater than 1, the gene is under positive selection. The Ka/Ks < 1 illustrates purifying selection and Ka/Ks close to 1 illustrates neutral evolution [68,69]. The result showed that there were no genes with Ka/Ks > 1, indicating that no genes were under positive selection (Figure 7). The gene ycf15 had the highest Ka/Ks value of 0.94785, while the gene ycf2 with the Ka/Ks value of 0.84184 ranked second. Both of them had values close to 1, which indicated that there could be as many non-synonymous mutations as synonymous mutations, and they might be in the process of neutral evolution, especially ycf15. However, the gene ycf15 is regarded as a pseudogene that had lost its function (Table 2), which indicates that mutations will not undergo selection. Also, the gene ycf2 is one of the Hypothetical Chloroplast Reading Frames (YCF) ( Table 2), so it is controversial whether ycf2 encodes a protein [70]. This may be the reason why ycf15 and ycf2 underwent a neutral evolution. The remaining genes with Ka/Ks away from 1 were regarded under purifying selection. We found the Ka values of atpH, psbJ, psbF, rpl36 and rpl23 were 0, and thereby the Ka/Ks values were all 0 in Figure 7. Among them, genes atpH, psbJ and psbF are related to photosynthesis, and genes rpl36 and rpl23 are for the synthesis of large ribosome subunits ( Table 2). All these genes are crucial to the plant, and non-synonymous mutations that occurred can affect survival and, thus, being eliminated.

IR Boundary Comparative Analysis
The IR/LSC and IR/SSC junction of the five Bupleurum species and their allied species were compared to assess the expansion and contraction of the IR regions. Among the five Bupleurum species, the genes rpl22 and rpl2 flanked the LSC/IRb junction, and gene rps19 traversed the LSC and the IRb region (JLB line), with 49-84 bp located in the IR region ( Figure 8). Therefore, the portions of the rps19 located in the IRb were also duplicated in the IRa/LSC junction, which was identified as pseudogenes (marked with 'ψ'). The ycf1 gene, which was the second-largest gene of the plastid genome in higher plants [30], traversed the SSC and IRa region, with the same length of 1877 bp in IR region, thus in the junction of the IRb/SSC lying the ψycf1 with 1877 bp, too. On the other side of the IRb/SSC lied the gene ndhF, which was the same length of 26 bp away from the IRb/SSC junction. No obvious expansion or contraction was observed within the five Bupleurum species, but an obvious shift of JLB was observed in other species in Apiaceae, which indicated they were undergoing a contraction in IR regions. It's worth noting that the gene ndhF traversed the IRb and SSC region, which indicated that there might also be a tendency of expansion in gene ndhF. Among Apiaceae species, the size variation of the plastid genome in the process of evolution mainly results from the expansion and contraction of the IR regions [71,72], and double-strand break (DSB) events may be the main reason of expansions [73].

Phylogenetic Analysis
Plastid genomes of plants have been widely used to investigate the phylogenetic and evolutionary relationships among families [27], genera [74], species [75], and even within species [76]. To investigate the phylogenetic position of the B. chinense and B. commelynoideum in Apiaceae, six partitions datasets including the complete plastid genomes, LSC regions, SSC regions, IR regions, single-copy CDS sequences and ten high-variation regions of 18 Apiaceae and 2 Araliaceae species plastid genomes were used to construct the maximum likelihood (ML) tree. All six datasets produced the same topology trees with a slight difference in bootstrap support ( Figure S1). Among them, the trees based on SSC regions, single-copy CDS sequences and ten high-variation regions possessed higher bootstrap support values (>90), which suggested that these three datasets can better show species differentiation. All the phylogenetic trees revealed that B. chinense was most related to B. commelynoideum, and they were gathered within a clade (Figure 9). Also, B. falcatum and B.

Phylogenetic Analysis
Plastid genomes of plants have been widely used to investigate the phylogenetic and evolutionary relationships among families [27], genera [74], species [75], and even within species [76]. To investigate the phylogenetic position of the B. chinense and B. commelynoideum in Apiaceae, six partitions datasets including the complete plastid genomes, LSC regions, SSC regions, IR regions, single-copy CDS sequences and ten high-variation regions of 18 Apiaceae and 2 Araliaceae species plastid genomes were used to construct the maximum likelihood (ML) tree. All six datasets produced the same topology trees with a slight difference in bootstrap support ( Figure S1). Among them, the trees based on SSC regions, single-copy CDS sequences and ten high-variation regions possessed higher bootstrap support values (>90), which suggested that these three datasets can better show species differentiation. All the phylogenetic trees revealed that B. chinense was most related to B. commelynoideum, and they were gathered within a clade (Figure 9). Also, B. falcatum and B. boissieuanum were gathered within a clade. These two clades were sister groups. B. latissimum was the first to speciate and was at the base among the five Bupleurum species in phylogenetic trees. The five Bupleurum species clustered into a monophyletic clade with strong support in all trees. The phylogenetic relationship in the genus Bupleurum was similar to previous studies based on nrITS and plastid DNA markers (trnH-psbA and matK) [14]. In our study, the Bupleurum clade was differentiated after the Chamaesium clade, and Chamaesium clade was at the base representing the basal taxa in Apioideae. However, in previous studies based on nrITS and plastid DNA markers, the Bupleurum clade was earlier differentiated than the Chamaesium and was at the base [10,11,77]. The inconsistent results indicated that the nrITS may be affected by hybridization and incomplete lineage sorting. Plastid DNA, which is maternally inherited, is longer and has less mutation than the nuclear ITS region, so it will provide more phylogenetic information and not be interfered with by paralogous genes in the phylogenetic studies [23,78]. Therefore, plastid DNA can better reflect the evolutionary relationship. This may be the main reason for the discrepancy in the phylogenetic analyses based on nrITS, plastid DNA markers, and plastid genomes. Anyway, the genus Bupleurum is still the relatively basal taxa in the Apiaceae. In addition, the genus Daucus, Semenovia, etc. are the crown groups, which is consistent with their external morphology and fruit characteristics [1,79]. Our study may provide information for population genomics and taxonomy in Bupleurum, and a new insight for phylogenetic reconstruction in Apiaceae.

Plant Materials and DNA Extraction
Mature and healthy leaves of single individuals of B. chinense and B. commelynoideum were collected from Mao county (Sichuan province, China; coordinates: 31 • 41 28.94" N, 103 • 49 32.41" E) and LuHuo county (Sichuan province, China; coordinates: 31 • 38 29 N, 100 • 15 05" E), respectively. All voucher specimens were deposited in the Sichuan University Herbarium (SZ). The fresh leaves above were immediately dried with silica gel for further DNA extraction. The total genomic DNA was extracted from the dried leaves using the modified cetyltrimethyl ammonium bromide (CTAB) method [80].

Genome Sequencing, Assembly and Annotation
The total genomic DNA was sequenced at Novogene (Novogene BioTech, Inc. Beijing, China) by Illumina NovaSeq 6000 Platform (Illumina, San Diego, CA). Libraries with an average length of 350 bp were constructed, and the average length of generated reads was 150 bp. Deep coverage of plastid genomes was obtained from the total genomic DNA via genome skimming sequencing strategy [81]. The clean data were used to assemble the complete plastid genome via NOVOplasty [82] with the complete plastid genome of B. boissieuanum as the reference (GenBank accession No. MF663725). The assembled plastid genomes were annotated via Geneious v9.0.2 [83] with the sequence of B. boissieuanum as the reference, and the annotation result was revised according to the B. latissimum, B. falcatum and allied genus including Angelica, Chamaesium, Chuanminshen, and Pleurospermum manually. Finally, the physical maps of the genome were generated using OGDraw v1.3.1 [84] (http: //ogdraw.mpimp-golm.mpg.de/). The annotated plastid genomes of B. chinense and B. commelynoideum had been submitted to GenBank under the Accession Number MN893666 and MT162552, respectively.

Codon Usage Bias Analysis
Codon usage analysis was conducted via the program codon W [85]. 53 protein-coding genes (CDS) ( Table S2) were filtered from the five Bupleurum species (other three were downloaded from NCBI) after removing the CDS less than 300 bp and the repeat sequences. Five important indices were calculated to assess the extent of the codon usage bias including the codon adaptation index (CAI), codon bias index (CBI), frequency of optimal codons (Fop), GC content of the synonymous third codons positions (GC3s) and the effective number of codons (ENC). The relative synonymous codon usage values (RSCU) [86] of 15 allied species and 2 outgroup species (also downloaded from NCBI) were calculated to assess the difference of the codon usages (Table S3).

Nucleotide Diversity Analysis
The plastid genome sequences were aligned using MAFFT [89] and adjusted manually. Then a slide window analysis was conducted to calculate the nucleotide diversity (Pi) in DnaSP v5 [90]. The parameters were set as follows: (1) windows size of 600 bp; (2) step size of 200 bp. Ten high-variation regions of the allied species were selected to conduct the phylogenetic analysis.

Selective Pressure Analysis
80 single-copy CDS sequences were extracted from the aligned plastid genome sequences after removing the repeat sequences. All the CDS sequences of the 18 species were aligned using MAFFT [89], then all the termination codons were removed. The final alignments were used to conduct the selective pressure analysis using DnaSP v5 [90]. The ratio (ω) of the non-synonymous substitution rate (Ka) to the synonymous substitution rate (Ks) was calculated to measure the selective pressure.

IR Boundary Comparative Analysis
All the plastid genome sequences were aligned using MAFFT [89] and adjusted manually. The plugin repeat finder in Geneious v9.0.2 was used to find the inverted repeats of some species without IR annotation. The genes and pseudogenes (marked with 'ψ') located in and beside the junctions of the boundaries were drawn manually to show the expansion and contraction of the IR region.

Phylogenetic Analysis
Phylogenetic analysis was conducted to investigate the relationship between the five Bupleurum species. Sixteen complete genome sequences of allied species were downloaded from NCBI, including other 3 species in Bupleurum, 3 species in Chamaesium, and 1 species in Anethum, Foeniculum, Petroselinum, Apium, Semenovia, Glehnia, Daucus, Cicuta, Chuanminshen and Pleurospermum. Two Araliaceae species (Eleutherococcus senticosus and Fatsia japonica) were added as outgroups (Table S3). The following six partitions datasets were used to conduct a phylogenetic analysis: (A) the complete plastid genomes; (B) the LSC regions; (C) the SSC regions; (D) the IR regions; (E) the single-copy CDS sequences; (F) ten high-variation region sequences. All the datasets were aligned and trimmed via MAFFT [89], and manually adjusted in MEGA7 if necessary. Maximum likelihood (ML) analysis was performed using RAxML v8.2.4 [91] with GTR + G model and 1000 bootstrap replicates.

Conclusions
In this study, we sequenced the complete plastid genomes of two Bupleurum species, B. chinense and B. commelynoideum, and compared their genome structure with the only three published Bupleurum species (B. boissieuanum, B. falcatum, and B. latissimum). All the five Bupleurum species plastid genomes exhibited a typical quadripartite and circular structure with very similar length, and the gene contents and orders were highly conserved. They also had very similar codon usage preferences but differed in two stop codon UAA and UGA from allied species. Repetitive sequences analysis showed they had the same tendency and certain diversity in numbers and distributions of SDRs and SSRs. The spacer regions petN-psbM, rbcL-accD, ccsA-ndhD, trnK(UUU)-rps16, rpl32-trnL(UAG)-ccsA, petA-psbJ, ndhF-rpl32, and trnP(UGG)-psaJ-rpl33 had relative higher nucleotide diversity in five Bupleurum species, which could be the promising DNA barcodes in Bupleurum taxa, while ycf1, ndhF-rpl32, and trnE(UUC)-trnT(GGU) were the three highest regions with Pi values over 0.10000 compared with allied species, which have been widely used as molecular markers. No genes were under positive selection, but obvious shifts of JLB and traversing ndhF was observed in other species in Apiaceae. Phylogenetic analysis based on plastid genome datasets produced the same topology trees with high support. The five Bupleurum species clustered into a monophyletic clade and were later differentiated from the Chamaesium clade. This study will enrich the plastid genome data and genetic resources of the genus Bupleurum and provide new insights into DNA barcoding of Bupleurum and phylogenetic reconstruction of the family Apiaceae.
Supplementary Materials: The following are available online at http://www.mdpi.com/2223-7747/9/4/543/s1, Table S1: The RSCU values of 64 codons in the five Bupleurum species, Table S2: 53 protein-coding genes (CDS) used in codon usage bias analysis and corresponding alignment length and GC content, Table S3: Complete plastid genomes of sixteen Apiaceae species and two Araliaceae species from GenBank, Figure S1