The Uneven Rate of the Molecular Evolution of Gene Sequences of DNA-Dependent RNA Polymerase I of the Genus Lamium L

RNA polymerase type I (plastid-encoded polymerase, PEP) is one of the key chloroplast enzymes. However, the rpo genes that encode its subunits (rpoA, rpoB, rpoC1 and rpoC2) are relatively rapidly evolving sequences. The aim of this study was to investigate the rate of the molecular evolution of rpo genes and to evaluate them as phylogenetic markers on the example of the genus Lamium L. (Lamiaceae). The analyzed genes were shown to differ in the level of variation, rate of intragenic mutations, phylogenetic informativeness, and in the impact of these mutations on the properties of encoded peptides. Destabilizing effects of the positive pressure were observed in all genes examined coding for PEP enzyme. We have demonstrated the relationship between mutations fixed by positive selection and the separation of phylogenetic lines within the genus Lamium. The study showed also that the rpo genes were reliable phylogenetic markers, useful in the reconstruction of interconnections of species belonging to the same genus. Of the four tested genes, the most promising phylogenetic marker was rpoA gene, while the least useful gene appeared to be rpoC1.


Introduction
Plastid-encoded plastid RNA polymerase (PEP), called also RNA polymerase type I, is one of the two DNA-dependent RNA polymerases occurring in chloroplasts of higher plants. In contrast to type II polymerase (nucleus-encoded plastid RNA polymerase, NEP), it is entirely encoded by the chloroplast genome. PEP is involved in tRNA and mRNA synthesis in the chloroplast, thus being one of the key enzymes that maintain the semi-autonomous nature of the plastid [1][2][3]. Its primary role includes the transcription of photosynthetic genes [4][5][6]. PEP is constituted by 4 enzymatic subunits: α, β, β' and β'', which are encoded by the four respective genes from the chloroplast genome, rpoA, rpoB, rpoC1 and rpoC2.
In spite of performing very important function, the rpo genes are relatively fast evolving sequences [7]. For this reason, they have been applied as markers in phylogenetic studies [7][8][9][10]. This group of genes has recently received considerable attention also due to their utility in the field of DNA barcoding [11][12][13]. Previous studies have shown the high relevance of the rpo genes in the analysis performed at a high taxonomic level such as division. The analysis of the phylogenetic tree of angiosperms (Magnoliophyta Cronquist) based solely on the rpo genes was congruent with the tree built based on complete chloroplast genome [7]. Individual genes of this group have also been applied in the study of evolutionary history at the family level in plants [10,12] and genus level in bacteria [8,9].
The goal of research presented in this paper was to evaluate the rpo genes as phylogenetic markers using Lamium L. (Lamiaceae) as an exemplar genus. The quality of a phylogenetic marker is determined by both the degree of variation and by the strength and reliability of the phylogenetic signal carried by this DNA region. This signal may be reduced due to the saturation of the tested sequences with substitutions, the nucleotide composition bias or within-site rate variation bias [14]. If these phenomena are pronounced, they can affect the phylogenetic reconstruction, potentially resulting in false though statistically reliable results [14]. In view of the above, we assessed the reliability of the phylogenetic signal carried by the rpo genes by analyzing the profile of their phylogenetic informativeness. We also examined the degree of sequences saturation with mutations as well as the variation in the rate of evolution within these genes. The phylogenetic picture of the genus Lamium, obtained by analyzing the variability of the rpo genes, was compared with findings of previous studies with a larger number of loci [15]. Results achieved enrich the knowledge on the internal phylogenetic relationships of this genus.
The second aim of this work was to complete the phylogeny reconstruction based on nucleotide sequences with information about the evolution at a protein level. For this purpose, we analyzed the impact of substitutions triggering changes of amino acid in the encoded proteins. The effect of substitution depends on the extent of differences in structural and physicochemical properties between the introduced amino acid and the one being replaced. When the selection elicits local but significant changes in protein properties, then this effect is called destabilizing selection [16]. In our research, we have focused our attention on the phenomenon of destabilizing positive selection acting on properties of the corresponding proteins. This is the first study of this type conducted with the use of rpo genes.

Sequence Characteristics
The study was carried out with rpo gene fragments ranging from 676 bp (rpoC2) to 1269 bp (rpoB) in length ( Table 1). The variability of rpoB, rpoC1 and rpoC2 genes was caused exclusively by substitutions, whereas the rpoA gene was found to contain one indel ( Figure 1). The analyzed sequences of different species could be divided into three groups. The first one consisted of two species: L. album and L. tomentosum, where the length of the rpoA gene fragment was 728 bp. In the second group, consisting of L. amplexicaule, L. bifidum, L. gevorense and L. incisum, this region was shorter by 15 bp, which corresponded to five amino acids. The remaining species that constituted the third group, were characterized by the absence of 6 bp when compared to the first group. This indel mutation did not cause changes in the reading frame in any of these groups.  All the rpo genes were characterized by a slightly higher content of purines over pyrimidines and by a higher number of AT than GC bonds (Table 1). Among the genes tested, the highest level of variation was detected in the rpoA gene. The average rate of its evolution was 1.42 substitutions per nucleotide. The contribution of variable (V) and parsimony-informative (Pi) sites reached 5.91% and 4.81%, respectively. Slightly less variable was the rpoC2 gene (V = 5.03%, Pi = 3.99%) with the substitution rate at 1.15%. The rpoB gene was characterized by the lowest proportion of the variable (2.52%) and parsimony-informative sites (1.97%), and by the lowest average rate of substitution introduction (Table 1).
Literature data are lacking that would allow to compare the variation of rpo genes of Lamium species with that in other genera of vascular plants. We can only refer our results to the few data concerning variability of rpo genes within families. Based on a study published by Petersen and Seberg [10], it could be concluded that the degree of variability observed in the rpoA gene in the genus Lamium (V = 5.91%) was higher than in the family Triticaceae (V = 4.31%). The variability of rpoB and rpoC1 genes observed in the genus Lamium (Table 1) was also higher than in the family Myristicaceae, where the variability of these genes amounted to V = 0.82% and V = 1.24%, respectively [12]. Although in these families, the variability of genes from the rpo group is lower than in the genus analyzed in our study, it is still relatively beneficial when compared with other phylogenetic markers. For example, in the family Myristicaceae, the variability of rpoC1 was at the similar level as the variability of matK gene or accD gene [12].
Our research showed that the lower average rate of mutations in the rpoB and rpoC1 genes from the Lamium genus was correlated with a greater variability of this rate within the genes. Both genes contain vast conserved regions as well as several extremely variable sites ( Figure 2). The rpoA and rpoC2 genes are in this respect much less diverse, which makes their phylogenetic signal more reliable. From the viewpoint of molecular taxonomy, a uniform rate of substitution is preferred within a given segment of the genome [17,18]. The presence of highly variable sites increases the likelihood of reverse mutations or homoplasy, which in turn makes it difficult to correctly reconstruct the phylogenetic relationships [7]. Our observations support findings of investigations carried out at higher taxonomic levels, i.e., in division Magnoliophyta, in which the rpoA and rpoC2 are better phylogenetic markers than rpoB and rpoC1 [7]. The result obtained by Logacheva et al. [7] based on the number of correctly resolved nodes of phylogenetic tree, which was adopted as a measure of effectiveness.

Phylogenetic Signal and Informativeness
Individual genome fragments can provide a reliable phylogenetic signal on condition that they have not yet reached the state of substitution saturation. Otherwise, it is difficult to ascertain whether the similarity observed in the pair of sequences is an inherited trait from a common ancestor, or a homoplasy [14,19]. The level of substitution saturation in rpo genes was measured by comparing the number of transitions and transversions with the size of the genetic distance for each pair of sequences ( Figure 3). The analysis of these genes showed that the amount of substitutions was increasing with the extension of genetic distance. Only the number of transversions within the rpoC2 gene showed a lower increasing tendency. However, none of the four plots took the form of a plateau, which is typical of the state of saturation with substitutions. Hence, the result achieved speaks for the usefulness of the rpo genes in phylogenetic inference in the genus Lamium. . The estimated number of transitions and transversions for each pairwise comparison was plotted against the genetic distance. This pattern can be interpreted as indicating that substitution saturation has not been reached, so that the data can be expected to provide reliable phylogenetic signal.
The analysis of the phylogenetic informativeness of individual genes leads to similar conclusions. The PI profiles obtained ( Figure 4) demonstrate that the analyzed genes reach the peak of informativeness for events older evolutionarily than the speciation within the genus Lamium. This means that the information noise that arose from the accumulation of mutations carrying non-phylogenetic signal was minimized [20,21]. A comparison of the PI profile based on nucleotide sequences with the profile obtained for the amino acid sequences additionally demonstrated the difference in the load of information carried by these two types of data. Caution should, however, be exercised when interpreting graphs shown in Figure 4, as they represent PI values per site. The amino acid sequences are three times shorter than the nucleotide ones, and thus to compare the PI values for these two types of data, results shown on the Y-axis of graph in Figure 4B should be divided by three. The results, obtained in this way, showed that the phylogenetic signal carried by all tested genes was significantly stronger at the nucleotide level.
The PI analysis revealed rpoA to be the most useful phylogenetic marker for the genus Lamium among the genes tested. High informativeness of the nucleotide sequence of the rpoA gene ( Figure 4A) corresponded with a very strong phylogenetic signal at the amino acid level ( Figure 4B). This proved the high convergence of the evolution of a protein sequence with the topology of the phylogenetic tree. Although the rpoB gene presented low informativeness at the nucleotide level ( Figure 4A), it had a relatively strong phylogenetic signal at the amino acid level ( Figure 4B). It could be because the analysis based on the amino acid sequence was less obscured by the information noise compared to the analysis incorporating nucleotide sequences. Therefore, the result of the former analysis was highly reliable even though the amount of information was lower than in the case of nucleotides [22].
There is little data available in the literature on the PI levels of phylogenetic markers in plants. There are no publications that would provide information about the values of PI per site in relation to the genus. For this reason, we could not discuss the level of rpo genes informativeness with the results pertaining to the informativeness of other markers at the genus level.

Phylogenetic Tree
The phylogenetic tree obtained with the MP method constitutes a consensus of 2258 most parsimonious trees with a length of 294 steps each (CI = 0.891473 and RI = 0.979651). Its topology is fully consistent with the result of the Bayesian inference. The tree constructed with the ML method was very similar to the remaining two and differed only in topology within the clade A. Cumulative results of the phylogenetic analyses are shown in Figure 5. Phylogenetic analysis based on rpo sequence variation does not reflect the full picture of the evolutionary history of the Lamium genus. However, it provides valuable information about the phylogenetic relationships between the Lamium species examined. The phylogenetic tree of this genus, calculated based on the rpo genes, was generally in agreement with results based on the analysis of six, mainly non-coding, chloroplast markers [15].
Our findings demonstrated a large genetic distance between the group comprising L. album and L. tomentosum (clade A) and the other Lamium species. This is indicative of the early separation of this phylogenetic line. Changes in rpoA, rpoB and rpoC2 genes in group A were also noticeable at the protein level (Table 2). Group A was additionally interesting because of the relationships of two subspecies, L. album and L. tomentosum. Among these taxa, only representatives of L. album subsp. barbatum grouped in one clade, while L. album and L. tomentosum formed mixed clades. L. album ssp. barbatum individuality observed at the nucleotide level was confirmed by the presence of an adaptive mutation in a protein encoded by rpoB gene. The result obtained with the ML method divided clade A into two smaller clades, thus indicating a greater similarity of L. album to L. tomentosum than the actual similarity of two subspecies of L. album with one another (clade A'). This situation occurred not only in the analysis of the rpo genes, but also in that of other chloroplast and nuclear loci [15]. This points to the incomplete lineage sorting of the species discussed. Such a phenomenon occurs when the effective size of the evolving population is large, and the time between successive divisions of phylogenetic lines is too short to fix the mutations that have arisen. It leads to the stochastic sorting of ancestral polymorphisms [23,24].  rpoB, rpoC1, rpoC2). Credibility values above 0.50 are given in the top line. Bootstrap values of clades supported by maximum parsimony and maximum likelihood analysis are given below (MP value before the slash and ML after the slash). Clades discussed in the text and in Table 2 are marked with a letter. Topography of the clade A obtained with ML method supplied with ML values is shown in a frame as a clade A'. N s -average number of surrounding residues; a m -power to be at the middle of alpha-helix; pH i -isoelectric point; C-composition; pK'-equilibrium constant (ionization of COOH); α n -power to be at the N-terminal; P α -alpha-helical tendencies.
Our results showed that L. galeobdolon (clade E) did not form a separate phylogenetic line in relation to the genus studied and thereby confirmed that the presence of L. galeobdolon in the Lamium genus was justified. The analysis of rpo genes variation supports the hypothesis [15,25,26] that L. purpureum is one of the ancestral species of L. confertum (clade C). A similar situation applied to L. bifidum and L. incisum (clade B), where L. bifidum is considered the so-called organellar parent of the hybrid species L. incisum [15,[27][28][29].
Worthy of special attention in this study was a clear distinction of L. garganicum var. armenum from other representatives of L. garganicum, indicated by a separate, highly supported clade (B2) formed by L. garganicum on the phylogram. It was additionally distinguished by three adaptive mutations, which did not appear in other phylogenetic lines ( Table 2). The subspecies L. garganicum var. armenum was located in the clade C, also very reliable statistically. Due to such large differences between these taxa, it was reasonable to exclude L. garganicum var. armenum from the species L. garganicum and adopt its correct synonymous name L. armenum Boiss. To our knowledge, the discussed taxon has never been addressed in molecular studies.

Positive Selective Pressure
The genes we tested differed in the content of silent mutations and missense mutations. Predominance of variable sites containing synonymous substitutions was reported only for the rpoC1 gene ( Figure 6). Other genes had more non-synonymous sites. Their largest number and the largest percentage in relation to all variable sites was found in the rpoC2 gene, where out of 34 variable sites, 28 had non-synonymous substitutions.  Table 2.
Within the rpoA and rpoC2 genes, the positive pressure was associated with the structural properties of the respective PEP subunits. In the case of rpoB and rpoC1 genes, the pressure focused mainly on the trait described as the equilibrium constant (Table 2). This is a chemical characteristics related to the ionization of a carboxyl group of the amino acid. The formation of a radical mutation usually resulted in a rapid change in one amino acid property. The substitution of cysteine by arginine in the rpoA (codon 84) was the only one that resulted in a simultaneous change of two properties, i.e., composition and average number of surrounding residues. When radical mutations were referred to the phylogenetic tree, we observed that they occurred at various organizational levels (Table 2, Figure 5). Some mutations are typical of the multi-species clades, while others-of the species, or even single individuals. One of the mutations observed in the rpoB gene (affecting codon 192) occurred simultaneously in three different branches of the cladogram (Table 2).
Our findings showed that the adaptive pressure exerted the greatest influence on the α-subunit of PEP enzyme, encoded by the rpoA gene. Within this gene, 25 missense mutations were observed, including 6 resulting from the destabilizing positive pressure. Of these, 4 mutations were associated with well separated clades of the phylogenetic tree. The formation of indel mutations was also significantly correlated with the separation of phylogenetic lines. The group 1 from Figure 1 corresponds to clade A ( Figure 5), whilst the group 2 corresponds to clade B1. In the subunit β of PEP (product of rpoB), a trait defined as the equilibrium constant (pK') and amino acid 192 were under strong selection pressure ( Table 2). As a result, there was a change of isoleucine to methionine, observed as a parallel mutation in clades A1 and F, and in one representative of L. maculatum. In the The observed cases of destabilizing positive selection pressure acting on the rpo genes indicated that the process of evolution of the PEP enzyme in genus Lamium was due not only to the genetic drift. This is evidenced by the evolutionary events that were correlated with the separation of phylogenetic lines of Lamium genus. The isolated mutations have been noted in five clades (A1, C, E, E1 and F), and each of these mutations affected one property of the amino acid. In clades A and B2 there were two and three mutations observed, respectively ( Table 2). In the clade A, pressure was put on the structural properties (a n , P α ), while in the clade B2 on two physicochemical traits (pH i and pK') and one structural property (N S ). Further experimental investigations are needed to estimate the impact of these changes on enzyme functioning. Currently, we can only assume that the natural selection acts to increase the overall efficiency of the enzyme or to optimize its performance in different environments.

Plant Material
The study comprised 19 species belonging to the Lamium genus, represented by 66 specimens. Different subspecies of L. album, L. bifidum, L. garganicum and L. galeobdolon, were also included, which expanded the number of taxa to 28. Nomenclature was adopted from the "World Checklist of Lamiaceae and Verbenaceae" [30]. The exceptions were: L. garganicum subsp. laevigatum Ces. Pass. & Gibelli, L. moschatum var. rhodium (Gand.) R.R.Mill., L. lycium Boiss., L. incisum Willd. and L. confertum Fr., which were not considered by these authors as distinct taxonomic units. The names of these taxa were used as described in the herbarium sheets. In the taxonomic approach adopted in our study, L. galeobdolon belongs to the genus, however, species transferred by Ryding [31] to the genus Matsumurella are excluded. Glechoma hederacea L. was included into analysis as an outgroup. Most of the samples were herbarium specimens, and some were collected by the authors during field research (Supplementary material 2). For the amplification and sequencing of rpoC1 and rpoC2 we used the primers from Royal Botanical Garden in Kew website [32]. Primers for rpoA and rpoB genes were designed with the use of chloroplast genome of Sezamum indicum, annotated as JN 637766.2 in the GenBank database [33]. The sequences of the applied primers are given in Table 3.

Experimental Procedures and Sequence Alignment
Electropherograms were edited and assembled using Sequencher 4.1.4 (Gene Codes Corporation, Ann Arbor, MI, USA). The assembled sequences were aligned and manually adjusted with BioEdit 7 [34]. Regions of ambiguous alignment and incomplete data (i.e., at the beginning and end of sequences) were excluded from the analyses. Table 3. Sequences of DNA primers used in the present study.

Patterns of Substitutions and Phylogenetic Informativeness
The distribution of rates of substitution across sites were estimated using the HyPhy program [35] by assigning a rate class to each site based on the general reversible model. The level of nucleotide substitution saturation was evaluated in DAMBE [36] software by plotting transitions and transversions against pairwise genetic distance. PhyDesign server [19,37] was used to estimate phylogenetic informativeness (PI) of rpo genes. PI profiles were plotted with reference to an uncalibrated tree. The tree used to overlay the historic changes in substitution rates was obtained with ML and ultrametrisized using PATHd8 [38]. To obtain relative ages for the clades, the root of the tree was set at an evolutionary time of 1.0 and tips at time of 0. The HyPhy program [35] which is using empirical base frequencies and a time-reversible model of substitution was used to calculate PI of nucleotide data sets. For the aminoacids (AA) data sets the program Rate4Site [39] was used, using a JTT model of evolution and ML inference method. In order to eliminate an influence of gene length on the outcome per site informativeness was calculated.

Phylogenetic Analysis
Bayesian inference (BI), maximum likelihood (ML) and maximum parsimony (MP) methods were applied to infer phylogenetic relationships. Bayesian inference phylogenetic analyses were done using MrBayes v. 3.1.2 [40] with the priors set according to the output of jModelTest 0.1.1 [41]. Optimal models of nucleotide substitution for rpo sequences were selected based on Bayesian Information Criterion (BIC) results. The parameters of the likelihood model applied for ITS region were adequate for general time reversible model with a gamma-shaped distribution of rates across sites (GTR + Γ), (n st = 6). Bayesian inference was estimated running six incrementally heated chains (MCMC algorithm) for 1,500,000 generations, sampling one out of every 100 generations of random trees. Test run was performed to define the number of generations, which should be excluded from consensus tree calculations. The first 30,000 generations were discarded as "burn-in". The remaining generations were used to construct the Bayesian consensus tree. The phylogenetic trees generated were graphically adjusted in FigTree v1.3.1 software [42].
ML and MP analyses were conducted using MEGA v.5 [43]. The ML method based on the GTR + Γ model of nucleotide substitution. In the MP analysis the tree inference was done with Close Neighbor Interchange (CNI) algorithm at the search level of 3 and the number of initial trees equal to 10. The phylogeny was tested both in ML and MP with bootstrap method [44] with the number of bootstrap replications at the level of 1000.

Detection of Positive Selection
The occurrence of a positive selection pressure at the protein level was studied using software TreeSAAP v3.2 [45]. The program measures the selective influence on 31 structural and biochemical amino acid properties across a phylogenetic tree. A gradient of eight categories was used to classify each property change, with lower categories indicating more conservative changes and higher categories denoting more radical changes. Then Z-test was performed to determine if certain magnitudes of change deviate from neutral expectations [46]. Positive destabilizing selection on amino-acid property was detected when the frequency of radical changes (6, 7 or 8 category) exceeded the frequency expected by chance, as indicated by positive z-scores. Each of rpo genes was tested in TreeSAAP using both the entire data set and a sliding window analysis with 15 codons in width. The results were used to identify the particular amino acid residues that contained positive destabilizing selection for each property.

Conclusions
Results achieved in our study demonstrated the rpo genes to be good phylogenetic markers, useful in the reconstruction of the interconnections of species belonging to the same genus. As coding regions, they contain a highly reliable phylogenetic signal, the value of which can be enhanced by the analysis of changes in the amino acid chain. Another advantage is ease in the alignment of the sequences analyzed. Of the four tested genes, the best performing phylogenetic marker was gene rpoA, while the least useful gene appeared to be rpoC1. Destabilizing effects of the positive pressure were observed in all genes examined coding for PEP enzyme. We have demonstrated the relation between the mutations fixed by positive selection and the separation of phylogenetic lines within the genus tested.