Insights into Comparative Genomics, Codon Usage Bias, and Phylogenetic Relationship of Species from Biebersteiniaceae and Nitrariaceae Based on Complete Chloroplast Genomes

Biebersteiniaceae and Nitrariaceae, two small families, were classified in Sapindales recently. Taxonomic and phylogenetic relationships within Sapindales are still poorly resolved and controversial. In current study, we compared the chloroplast genomes of five species (Biebersteinia heterostemon, Peganum harmala, Nitraria roborowskii, Nitraria sibirica, and Nitraria tangutorum) from Biebersteiniaceae and Nitrariaceae. High similarity was detected in the gene order, content and orientation of the five chloroplast genomes; 13 highly variable regions were identified among the five species. An accelerated substitution rate was found in the protein-coding genes, especially clpP. The effective number of codons (ENC), parity rule 2 (PR2), and neutrality plots together revealed that the codon usage bias is affected by mutation and selection. The phylogenetic analysis strongly supported (Nitrariaceae (Biebersteiniaceae + The Rest)) relationships in Sapindales. Our findings can provide useful information for analyzing phylogeny and molecular evolution within Biebersteiniaceae and Nitrariaceae.


Introduction
Biebersteiniaceae and Nitrariaceae are two small families. Biebersteiniaceae consists of only one genus Biebersteinia, which comprises 4-5 species, and is mainly distributed in the eastern Mediterranean, Central, and Western Asia, northwest of China, and the Himalayas [1,2]. Some species of Biebersteinia was used as traditional folk medicines in Iran and China with the activities of anti-inflammatory, analgesic, antibacterial, antioxidant, antihypertensive, and hypoglycemic effects [3]. Nitrariaceae consists of four genera, namely Nitraria, Tetradiclis, Malacocarpus, and Peganum, which comprise about 20 species, and are native to arid and semi-arid regions of Mexico, North Africa, Europe, Australia, and Asia [1,4]. The plants of Nitraria genus are highly adaptable to high salinity, arid, or semi-arid environments. They can effectively alleviate the salinization degree of soil, improve the utilization rate of saline-alkali land and prevent soil desertification [5].
The placement of the five genera (Biebersteinia, Nitraria, Tetradiclis, Malacocarpus, and Peganum) have long been controversial [6]. Biebersteinia was traditionally placed in Geraniaceae or near Geraniaceae as a distinct family [7]. However, molecular evidence from recent studies supported a strong position for Biebersteiniaceae within Sapindales [7,8]. In earlier classifications, the position of Nitraria, Tetradiclis,

Genome Comparison
The analysis revealed high sequence similarity among the five species, suggesting they are conserved in size and structure. The rpl22 gene was positioned at the LSC/IRb junction, with 44-75 bp in the IRb region ( Figure S1). The ndhF gene was located at the IRb/SSC junction, with 0-18 bp in the IRb region. The SSC/IRa junction was crossed by ycf1, with 1298-1441 bp in the IRa region in all the five chloroplast genomes. The rps19-trnH genes were located at the IRa/LSC junction, with 17-62 bp separating the spacer from the end of the IRa region. Overall, IR region contraction/expansion events were detected across the five chloroplast genomes.
To elucidate the genome divergence, we performed multiple sequence alignment of the five chloroplast genomes using the mVISTA program with N. tangutorum as the reference (Figure 1). The comparison demonstrated highly conserved coding regions compared with the non-coding regions among the five species. The IR regions were less divergent than the LSC and SSC regions. The highest divergence was found among the intergenic spacers, including trnH-psbA, matK-rps16, rps16-psbK, psbI-atpA, rpoB-psbD, petN-psbM, psaA-rps4, rps4-ndhJ, ndhC-atpE, ycf4-cemA, and psbE-psaJ in LSC and ndhF-ccsA and ndhG-ndhI in SSC. In addition, a greater sequence divergence was found in the coding regions of matK, rpoA, rps19, ndhF, ccsA, ndhD, and ycf1.
Furthermore, the nucleotide variability (Pi) was calculated to determine the sequence divergence levels among the five chloroplast genomes. Analysis revealed that the IR regions are more conserved than the LSC and SSC regions ( Figure S2). We detected seven highly variable regions (Pi > 0.10), namely trnH-psbA, matK-rps16, psbK-psbI, trnE-trnT, trnF-ndhJ, ndhD-ndhG, and rrn23-trnA, which can be used as potential markers for further genetic studies. roborowskii as a reference. The x-axis represents the coordinates in the chloroplast genome. The y-axis indicates the average percent identity of sequence similarity in the aligned regions, ranging between 50% and 100%. Genome regions are color coded as protein coding, rRNA coding, tRNA coding, or conserved noncoding sequences (CNS). Furthermore, the nucleotide variability (Pi) was calculated to determine the sequence divergence levels among the five chloroplast genomes. Analysis revealed that the IR regions are more conserved than the LSC and SSC regions ( Figure S2). We detected seven highly variable regions (Pi > 0.10), namely trnH-psbA, matK-rps16, psbK-psbI, trnE-trnT, trnF-ndhJ, ndhD-ndhG, and rrn23-trnA, which can be used as potential markers for further genetic studies.

Variation in Nucleotide Substitution Rates
Synonymous and nonsynonymous nucleotide substitution patterns are important markers used in the study of gene evolution. In most genes, except rapidly evolving genes, the frequency of nonsynonymous nucleotide substitution (dN) is lower than that of synonymous substitution (dS) due to purifying selection. Generally, dN/dS < 1 (especially less than 0.5) indicates purifying selection; dN/dS > 1 indicates possible forward selection; and dN/dS value approaching 1 indicates neutral evolution [29]. The average dN/dS value of the five species' 74 protein-coding genes was 0.2566 ( Figure 2). The genes atp, pet, and psb, with an average dN/dS value between 0 and 0.1. The model averaging method in dN/dS calculator showed an average dN/dS > 1 for clpP. Figure 1. Comparison of five chloroplast genomes using the mVISTA alignment program with N. roborowskii as a reference. The x-axis represents the coordinates in the chloroplast genome. The y-axis indicates the average percent identity of sequence similarity in the aligned regions, ranging between 50% and 100%. Genome regions are color coded as protein coding, rRNA coding, tRNA coding, or conserved noncoding sequences (CNS).

Variation in Nucleotide Substitution Rates
Synonymous and nonsynonymous nucleotide substitution patterns are important markers used in the study of gene evolution. In most genes, except rapidly evolving genes, the frequency of nonsynonymous nucleotide substitution (d N ) is lower than that of synonymous substitution (d S ) due to purifying selection. Generally, d N /d S < 1 (especially less than 0.5) indicates purifying selection; d N /d S > 1 indicates possible forward selection; and d N /d S value approaching 1 indicates neutral evolution [29]. The average d N /d S value of the five species' 74 protein-coding genes was 0.2566 ( Figure 2). The genes atp, pet, and psb, with an average d N /d S value between 0 and 0.1. The model averaging method in d N /d S calculator showed an average d N /d S > 1 for clpP.  harmala and the three Nitraria species. Among the Nitraria species, N. roborowskii had higher ENC values than N. sibirica and N. tangutorum. Further, to identify the forces that determine the five genomes' overall codon usage, the ENC-GC3S plot was drawn ( Figure 3). Most genes lay close to Wright's curve [30]. A few genes, including ccsA, with lower ENC values, lay below the curve.

Codon Usage Bias
The total GC content (GC, 0.386-0.391), GC content at first codon position (GC 1 , 0.475-0.477), GC content at second codon position (GC 2 , 0.395-0.397), and GC content at third codon position (GC 3 , 0.287-0.300) less than 0.5 ( Table 2) suggest that the five chloroplast genomes tend to use A/T bases and A/T-ending codons. The codon adaptation index (CAI) values between 0.167 and 0.170 indicate a slight bias in codon usage in the five species, especially B. heterostemon. The values of effective number of codons (EN C ) ranged from 48.71 to 49.74. B. heterostemon had higher EN C values than P. harmala and the three Nitraria species. Among the Nitraria species, N. roborowskii had higher EN C values than N. sibirica and N. tangutorum. Further, to identify the forces that determine the five genomes' overall codon usage, the EN C -GC 3S plot was drawn ( Figure 3). Most genes lay close to Wright's curve [30]. A few genes, including ccsA, with lower EN C values, lay below the curve. Table 2. Codon usage of the five species. GC: the total GC content; GC 1 : the GC content at first codon position; GC 2 : the GC content at second codon position; GC 3 : the GC content at third codon position; CAI: the codon adaptation index; T 3S : the thymine content at synonymous third codon position; C 3S : the cytosine content at synonymous third codon position; A 3S : the adenine content at synonymous third codon position; G 3S : the guanine content at synonymous third codon position; GC 3S : the GC content at synonymous third codon position; EN C : the effective number of codons.

Species
GC  Slight differences were found in the relative synonymous codon usage (RSCU) among the five species ( Figure 4). RSCU > 1 was found for 30 identical codons, of which 29 were A/T-ending codons (except for UUG) and RSCU < 1 for 32 identical codons, of which 29 were C/G-ending codons (except AUA, CUA and UGA). The highest RSCU value was recorded for UUA and the lowest for CUC, both encode leucine. To conclude, UUA was positively biased, while CUC was negatively biased.  Slight differences were found in the relative synonymous codon usage (RSCU) among the five species ( Figure 4). RSCU > 1 was found for 30 identical codons, of which 29 were A/T-ending codons (except for UUG) and RSCU < 1 for 32 identical codons, of which 29 were C/G-ending codons (except AUA, CUA and UGA). The highest RSCU value was recorded for UUA and the lowest for CUC, both encode leucine. To conclude, UUA was positively biased, while CUC was negatively biased.  Slight differences were found in the relative synonymous codon usage (RSCU) among the five species ( Figure 4). RSCU > 1 was found for 30 identical codons, of which 29 were A/T-ending codons (except for UUG) and RSCU < 1 for 32 identical codons, of which 29 were C/G-ending codons (except AUA, CUA and UGA). The highest RSCU value was recorded for UUA and the lowest for CUC, both encode leucine. To conclude, UUA was positively biased, while CUC was negatively biased.  A parity rule 2 (PR2) plot was generated based on AT-bias (A 3 / (A 3 + T 3 )) and GC-bias (G 3 / (G 3 + C 3 )) ( Figure 5). The AT-biases of B. heterostemon, P. harmala, N. roborowskii, N. sibirica, and N. tangutorum were 0.475, 0.475, 0.476, 0.476, and 0.475, respectively, while the GC-biases were 0.501, 0.513, 0.511, 0.511, and 0.511, respectively. Thus, we detected T/G bias at the third codon position of chloroplast genes in the five species. However, different genes showed different preferences. The analysis revealed T/G bias in accD, matK, ndh, pet, and rpo; A/C bias in ccsA; T/C bias in cemA, psa, psb, and rbcL; and A/G bias in rpl. B. heterostemon showed different codon usage bias at the third codon position in atp, clpP, and rps (p < 0.05). The atp genes of B. heterostemon and P. harmala were T/C-biased, whereas those in N. roborowskii, N. sibirica, and N. tangutorum were T/G-biased. The clpP gene in B. heterostemon was A/G biased, while the genes in P. harmala, N. roborowskii, N. sibirica, and N. tangutorum were A/C biased. The rps in B. heterostemon was A/C biased, while the genes in P. harmala, N. roborowskii, N. sibirica and N. tangutorum were A/G bias.
The difference in GC 3 was more evident than that in GC 1 and GC 2 in the five genomes, reflecting the neutral mutation bias leading to different codon choice. The neutrality plots (GC 12 -GC 3 ) were used to analyze the correlation between the three codon positions ( Figure 6).  The difference in GC3 was more evident than that in GC1 and GC2 in the five genomes, reflecting the neutral mutation bias leading to different codon choice. The neutrality plots (GC12-GC3) were used to analyze the correlation between the three codon positions ( Figure 6

Phylogenetic Analysis
The maximum likelihood (ML) and Bayesian inference (BI) analysis of the whole plastome and 67 protein-coding genes dataset resulted in congruent topologies and only the nodes supporting was different (Figure 7 and Figure S3). We used the 67 protein-coding genes topology for the downstream analysis. Most clades were strongly supported by high bootstrap values (BS) and posterior probabilities (PP). Five well-supported clades were recovered within Sapindales in the phylogeny tree. Topology of (Nitrariaceae (Biebersteiniaceae + The Rest)) was supported in confidence. Biebersteiniaceae and Nitrariaceae was at the basal in Sapindales with a high support (BS = 97 and PP = 100). Anacardiaceae and Burseraceae together formed a strongly supported clade (BS = 100 and PP = 100). Meanwhile, the position of Sapindaceae was moderately supported (BS = 77 and PP = 100). Meliaceae, Simaroubaceae, and Rutaceae formed a clade with the highest support (BS = 100 and PP = 100). A sister relationship was observed between Rutaceae and Simaroubaceae, which was strongly supported (BS = 100 and PP = 100).

Phylogenetic Analysis
The maximum likelihood (ML) and Bayesian inference (BI) analysis of the whole plastome and 67 protein-coding genes dataset resulted in congruent topologies and only the nodes supporting was different (Figures 7 and S3). We used the 67 protein-coding genes topology for the downstream analysis. Most clades were strongly supported by high bootstrap values (BS) and posterior probabilities (PP). Five well-supported clades were recovered within Sapindales in the phylogeny tree. Topology of (Nitrariaceae (Biebersteiniaceae + The Rest)) was supported in confidence. Biebersteiniaceae and Nitrariaceae was at the basal in Sapindales with a high support (BS = 97 and PP = 100). Anacardiaceae and Burseraceae together formed a strongly supported clade (BS = 100 and PP = 100). Meanwhile, the position of Sapindaceae was moderately supported (BS = 77 and PP = 100). Meliaceae, Simaroubaceae, and Rutaceae formed a clade with the highest support (BS = 100 and PP = 100). A sister relationship was observed between Rutaceae and Simaroubaceae, which was strongly supported (BS = 100 and PP = 100).

Comparison of the Chloroplast Genomes
Highly-variable regions in the chloroplast genome can be used for the phylogenetic analysis and

Comparison of the Chloroplast Genomes
Highly-variable regions in the chloroplast genome can be used for the phylogenetic analysis and species discrimination. Studies have shown that genetic polymorphism in the single-copy regions (SSC and LSC) are more variable than the IR regions [31][32][33]. Similarly, in this study, the IR regions are more conserved than the LSC and SSC regions with a lower Pi value. Several researchers have conducted taxonomic studies; however, the taxonomic position and the phylogenetic relationship among the species of Biebersteiniaceae and Nitrariaceae remain poorly resolved [8,10,13,14]. Based on mVISTA and sliding window analysis, we identified the divergent regions in the chloroplast genomes that could be used as useful molecular markers. The variable regions trnH-psbA, matK-rps16, rps16-psbK, psbI-atpA, rpoB-psbD, petN-psbM, psaA-rps4, rps4-ndhJ, ndhC-atpE, ycf4-cemA, psbE-psaJ, ndhF-ccsA, and ndhG-ndhI would be ideal molecular markers to reconstruct the phylogenetic relationship in the Biebersteiniaceae and Nitrariaceae.

Variation in Nucleotide Substitution Rates
In photosynthetic angiosperms, the gene order, content, and rate of sequence evolution of protein-coding genes of the chloroplast genomes are generally conserved [34]. However, our analysis revealed variations in the substitution rate among the five species. Several protein-coding genes, including accD, clpP, matK, and rpl, showed accelerated d N , consistent with the previous reports [35][36][37][38]. Growing evidence suggests that the acceleration in gene-specific substitution rates may be due to local hypermutation or mutagenic retroprocessing [38]. The most extreme case was clpP, which had significantly higher d N (d N = 0.0973, d N /d S = 1.0611) in the present study. In angiosperm cells, clpP gene, comprise two introns and encode the ATP-dependent Clp protease proteolytic subunit. High substitution rates of the clpP gene was found in several species and mainly accompanied gene structural changes, including lack of the first intron or both the introns [38]. The amino acid sequence alignment of clpP gene ( Figure S4) showed that B. heterostemon contains three large insertions that interrupted the conserved domains, which may induced the gene structural changes and a high d N /d S value.

Codon Usage Bias
Codon usage bias has been correlated with different factors, including gene expression level, GC content, amino acid conservation, and transcriptional selection [24]. Our study reveals selection and mutation as the probable mechanisms for codon bias. Selection theory explains that the codon bias contributes to the efficiency and/or accuracy of protein expression, and therefore undergoes positive selection. Meanwhile, the mutational explanation posits that codon bias exists due to the non-randomness in mutational patterns [25,39]. Although the mechanism behind codon bias selection remains controversial, a strong correlation is identified between the GC content and codon usage patterns in this study [24]. Furthermore, we found slight differences in the optimal codon usage patterns between the Biebersteiniaceae and Nitrariaceae species. The GC content of the protein-coding region (38.6-39.1%) is consistent with the whole genome (37.1-37.9%). The EN C value reflects the codon bias level; EN C value of 20 indicates maximum codon bias; a value of 61 indicates unbiased codon usage. A gene with an EN C value of 35 or less is thought to possess strong codon bias. In current study, the EN C values of the protein-coding genes ranged from 39 to 61; the majority were greater than 45, indicating a weak codon usage bias. According to EN C -plot analysis, the gene should fall on the standard curve in the graph when the codon use pattern is affected only by GC mutation. Nearly half of the genes were positioned on or close to the standard curve. The actual EN C value closer to the theoretical EN C value indicates that the codon usage is greatly affected by GC content, mutations. Few genes, especially ccsA and psb gene, lay below the standard curve. This finding confirms that codon usage is affected by selection, as reported in the previous studies [40][41][42].
GC content reflects the overall trend in mutation, and changes in the third base of the codon do not cause changes in the coding amino acid. Therefore, mutations in the third base are subject to low selection pressure. GC 3 is also used to analyze the codon usage pattern. PR2 analysis used to determine the correlation between A and T and G and C in the third position of the codon showed that T was used slightly more frequently than A, while C was used more frequently than G, which indicates that pyrimidine was used more frequently than purine. Most of the mutations in the third position of the codon are synonymous mutations. The base changes in the first and second positions of the codon, which result in changes in the amino acids, are nonsynonymous mutations. Furthermore, the correlation between GC 12 and GC 3 was weak (R 2 = 0.0786~0.1217), and the mutation has different effects on the composition of the first, second, and third positions of the codon. The third position of the codon is affected by weak mutation and other factors, such as selection, which may play an important role.

Phylogenetic Analysis within Sapindales
The developments in next-generation sequencing technologies have led to the sequencing of many chloroplast genomes. In addition, phylogenetic studies have been intensified, and many disputed relationships among plant species have been resolved [17]. In the present study, we obtained a robust phylogeny of (Nitrariaceae (Biebersteiniaceae + The Rest)) in the Sapindales. Phylogenetic relationships among Biebersteiniaceae and Nitrariaceae and related genera of the order Sapindales have long been controversial. The two-gene tree generated by Muellner et al., (2007) showed Biebersteiniaceae as a monophyletic group, with a possible sister relationship with eight Sapindales families [8]. Chen et al., (2016) found the (Nitrariaceae (Biebersteiniaceae + The Rest)) relationship; however, with little support (BS = 27) [14]. Some analyses supported (Nitrariaceae + Biebersteiniaceae) clade as sister to the rest of the order, the support was also poor [10,13]. Our results revealed that Nitrariaceae was at the basal of Sapindales followed by Biebersteiniaceae with strong support (BS = 97 and PP = 100). Meanwhile, A strong support for (Bursereae + Anacardiaceae) and (Meliaceae (Simaroubaceae + Rutaceae)) clades was also found, and such topology was supported by Soltis et al., 2011, Muellner-Riehl et al., 2016. The current phylogenomic study sheds new light on the disentangling of complex evolutionary events within Sapindales. Furthermore, inferences based on large-scale phylogenetic frameworks within Sapindales should benefit from phylogenetic genomics based upon whole-plastome sequencing. All the specimens were deposited in the Qinghai-Tibetan Plateau Museum of Biology (HNWP). Total genomic DNA was extracted from the fresh leaves of one representative plant via the modified cetyltrimethylammonium bromide (CTAB) method [43] and measured by NanoDrop spectrophotometer (Thermo Scientific, Carlsbad, CA, USA). All the DNA samples were fragmented randomly, and paired-end libraries were constructed according to the Illumina preparation manual (San Diego, CA, USA). Universal primer and index primer for Illumina library amplification were listed in Table S1. Sequencing was performed on an Illumina HiSeq2500 platform (San Diego, CA, USA). Clean data was obtained by trimming the raw reads via the Trimmomatic tool [44].

Genome Comparison
The online program IRscope (https://irscope.shinyapps.io/irapp/) [48] was used to check the contraction and expansion at the borders of IR regions. The mVISTA program (http://genome.lbl.gov/ vista/index.shtml) [49] was employed in Shuffle-LAGAN mode to determine the differences among the chloroplast genomes of the five species. The nucleotide variability (average pairwise divergence; Pi) among the five chloroplast genomes was calculated via a sliding window analysis using DnaSP v5.10 (http://www.ub.edu/dnasp/) [50] with the following settings: 400 bp window length and 200 bp step size.

Codon Usage Bias
The level of codon usage bias was determined by calculating EN C , GC 3S , CAI, RSCU, A 3 , U 3 , C 3 , and G 3 using CodonW (http://codonw.sourceforge.net/) for all the protein-coding genes in the 14 data sets mentioned in Section 4.3. The EN C vs. GC 3S plots and a heatmap of RSCU were generated based on the data sets. The PR2 plots were drawn based on the AU-bias (A 3 /(A 3 + U 3 )) and GC-bias (G 3 /(G 3 + C 3 )). The GC 1 , GC 2 , and GC 3 was calculated in EMBOSS explorer (http://www.bioinformatics.nl/emboss-explorer/). The neutrality plot was drawn based on GC 3 and GC 12 (the average of GC 1 and GC 2 ). The heatmap and the different plots were drawn using ggplot2 in R v.3.6.3 (https://www.r-project.org/).

Conclusions
In present study, we compared the complete chloroplast genomes of five species from Biebersteiniaceae and Nitrariaceae. High similarity of structure and gene order, content was found among the five species. Several mutation hotspot regions for the five genomes included trnH-psbA, matK-rps16, rps16-psbK, psbI-atpA, rpoB-psbD, petN-psbM, psaA-rps4, rps4-ndhJ, ndhC-atpE, ycf4-cemA, psbE-psaJ, ndhF-ccsA, and ndhG-ndhI. Accelerated substitution rate was found in several protein-coding genes, including accD, clpP, matK, rpl and clpP, especially in clpP. The relative synonymous codon usage was biased among the five species, and the probable mechanisms may be selection and mutation. The phylogenetic analysis confirmed the topology of (Nitrariaceae (Biebersteiniaceae + The Rest)) in the Sapindales with strong support, providing a better-resolved phylogenetic relationship for the studied species of Sapindales than previous studies. Our findings reported here shed light on the structural evolution of chloroplast genomes and phylogenetic relationships of Biebersteiniaceae and Nitrariaceae.
Supplementary Materials: The following are available online at http://www.mdpi.com/2223-7747/9/11/1605/s1, Table S1. Primer used for Illumina library amplification, Table S2. Protein-coding genes used for nucleotide substitution rate, codon usage bias, and phylogenetic analysis, Table S3. Information of the species used for the phylogenetic analysis, Figure S1. Comparison of the borders of large single-copy (LSC), small single-copy (SSC), and inverted repeat (IR) regions among the chloroplast genomes of five species, Figure S2. Sliding window analysis of nucleotide variability (pairwise divergence) among the five species, Figure S3. The phylogenetic relationships within Sapindales resolved by complete chloroplast genome. Numbers associated with the branches are ML bootstrap value (BS) and BI posterior probabilities (PP). Nodes without numbers are supported by 100/1, Figure S4. Amino acid sequence alignment of clpP gene for the five species. Red box indicates the gene structural changes domains.

Conflicts of Interest:
The authors declare no conflict of interest.