Multiple Evolutionary Selections Involved in Synonymous Codon Usages in the Streptococcus agalactiae Genome

Streptococcus agalactiae is an important human and animal pathogen. To better understand the genetic features and evolution of S. agalactiae, multiple factors influencing synonymous codon usage patterns in S. agalactiae were analyzed in this study. A- and U-ending rich codons were used in S. agalactiae function genes through the overall codon usage analysis, indicating that Adenine (A)/Thymine (T) compositional constraints might contribute an important role to the synonymous codon usage pattern. The GC3% against the effective number of codon (ENC) value suggested that translational selection was the important factor for codon bias in the microorganism. Principal component analysis (PCA) showed that (i) mutational pressure was the most important factor in shaping codon usage of all open reading frames (ORFs) in the S. agalactiae genome; (ii) strand specific mutational bias was not capable of influencing the codon usage bias in the leading and lagging strands; and (iii) gene length was not the important factor in synonymous codon usage pattern in this organism. Additionally, the high correlation between tRNA adaptation index (tAI) value and codon adaptation index (CAI), frequency of optimal codons (Fop) value, reinforced the role of natural selection for efficient translation in S. agalactiae. Comparison of synonymous codon usage pattern between S. agalactiae and susceptible hosts (human and tilapia) showed that synonymous codon usage of S. agalactiae was independent of the synonymous codon usage of susceptible hosts. The study of codon usage in S. agalactiae may provide evidence about the molecular evolution of the bacterium and a greater understanding of evolutionary relationships between S. agalactiae and its hosts.


Introduction
Synonymous codon usage patterns have great theoretical significance as well as practical values when studying molecular evolution. Synonymous codons are not employed in equal and random frequencies in the 61 codons (excluding three stop codons). Synonymous codon usage patterns often represent additional and essential genetic characteristics evolutionally developed by most life forms. Therefore, synonymous codon usage bias analysis for open reading frames (ORFs) have an important role in underlying mechanisms of synonymous codon usage and gene evolution analysis [1]. Many

The Synonymous Codon Usage Pattern of S. agalactiae Genomes
In this study, we compared five S. agalactiae strains whose complete genomes were available from the Genbank database. The average GC12% (G + C content at the first and second codon positions) and GC3% (G + C content at the third codon position) of all functional genes was 33.71% and 25.87%, respectively. The nucleotide contents at these positions also had significant fluctuations, namely, GC12% varied from 18.53% to 56.41%, GC3% varied from 12.78% to 53.57% in all functional genes in S. agalactiae. This suggested that the A/T nucleotide at the third (wobble) position were preferred over G/C at the wobble site.
Analysis of variance (ANOVA) analysis also revealed that C nucleotide at the wobble position was significantly different (p = 0.001) between the leading and the lagging strands, However, the three other nucleotides at the wobble position were not significantly different between the leading and the lagging strands (data not shown). This result suggested that although the GC content of the S. agalactiae genome was considerably low, mutation pressure from C content variation played a role in the formation of the nucleotide content of the S. agalactiae genome to a relatively small degree.
The codon usage data and relative synonymous codon usage (RSCU) values for all preferentially used codons between five datasets were listed in Table 1. All codons were all A-or U-ended with the exception of the UUG (Leu) codon (marked in bold in Table 1), this demonstrated that A/T compositional constraints might play a key role in contributing to the synonymous codon usage pattern. In addition, since these genomes showed an A + T redundancy (mean 64.24, SD 3.50), compositional limitation also influenced the codon usage pattern of S. agalactiae.

The Overall Codon Usage Pattern of S. agalactiae
The overall codon usage pattern of each gene in the S. agalactiae genomes was determined by PCA based on the RSCU values. From this, we could detect one major trend in the first axis, which accounted for 14.944% of the total variation, and another major trend in the second axis, which accounted for 9.164% of the total variation ( Figure 1). This finding indicated that the first major axis could explain a substantial amount of variation in trends in codon usage of S. agalactiae, the second major axis also had an appreciable impact on total variation in codon usage pattern of S. agalactiae [21].

The Overall Codon Usage Pattern of S. agalactiae
The overall codon usage pattern of each gene in the S. agalactiae genomes was determined by PCA based on the RSCU values. From this, we could detect one major trend in the first axis, which accounted for 14.944% of the total variation, and another major trend in the second axis, which accounted for 9.164% of the total variation ( Figure 1). This finding indicated that the first major axis could explain a substantial amount of variation in trends in codon usage of S. agalactiae, the second major axis also had an appreciable impact on total variation in codon usage pattern of S. agalactiae [21].  When we analyzed strand-specific mutational bias, we could find no factor influencing strand-specific mutational bias using overlay plots of both the leading strand (green plus sign) and the lagging strands (orange plus sign) ( Figure 1). Therefore, strand compositional asymmetry was not the main factor in shaping the codon usage patterns in S. agalactiae.
The negative correlations between GC and AT skews including all three nucleotide positions as well as the first and second positions indicated that G/C and A/T composition bias had slight ability to influence the synonymous codon usage of S. agalactiae (Table 2). On the other hand, AT skews at the third codon position (AT3 skew) was stronger than that of mutation pressure caused by GC and AT skews at the first and second codon positions. GC3 skew was also slightly, but significantly, higher suggesting a role for GC compositional bias at the third codon position even though S. agalactiae has an AT-rich genome. To examine this further, we plotted GC 3 % against the effective number of codons (ENC) including data from all five genomes. The majority of points with low ENC values lay below the expected curve ( Figure 2). This suggested that apart from the compositional bias, mutational pressure in the codon usage pattern was the most important factor in determining codon usage bias.
To examine this further, we plotted GC3% against the effective number of codons (ENC) including data from all five genomes. The majority of points with low ENC values lay below the expected curve ( Figure 2). This suggested that apart from the compositional bias, mutational pressure in the codon usage pattern was the most important factor in determining codon usage bias. The majority of points with low ENC values lay below the expected curve ( Figure 2). This suggested that apart from the compositional bias, mutational pressure in the codon usage pattern was the most important factor in determining codon usage bias.
Most of the genes fell within a restricted cloud in a relatively narrow range of GC3 values. However, these points represented a wide range of ENC values indicating that translational selection was also responsible for codon bias among the genes. The few points that lay above the curve could be accounted for by extreme compositional constraints ( Figure 2). Furthermore, the negative correlation between the ENC value and the first major variation (Axis 1) was analyzed (r = 0.048, p < 0.01). Together, these results suggested that both nucleotide composition and other evolutionary factors played roles in the overall codon usage in the S. agalactiae genome. The majority of points with low ENC values lay below the expected curve ( Figure 2). This suggested that apart from the compositional bias, mutational pressure in the codon usage pattern was the most important factor in determining codon usage bias.
Most of the genes fell within a restricted cloud in a relatively narrow range of GC 3 values. However, these points represented a wide range of ENC values indicating that translational selection was also responsible for codon bias among the genes. The few points that lay above the curve could be accounted for by extreme compositional constraints ( Figure 2). Furthermore, the negative correlation between the ENC value and the first major variation (Axis 1) was analyzed (r = 0.048, p < 0.01). Together, these results suggested that both nucleotide composition and other evolutionary factors played roles in the overall codon usage in the S. agalactiae genome.

Using PR-2 Plots to Differentiate Replication and Transcription Associated Biases
As mentioned above, compositional bias among genes is very strong in the S. agalactiae genomes. These biases can be induced by many factors, such as asymmetric replication mutation pressure, asymmetric transcription/translation-associated mutation/selection pressure [12]. The G/C strand bias indicated by the horizontal axis were slightly stronger than that of the T/A (vertical axis) (Figure 3). Following the rule of PR-2 plots, the extent of replication-induced bias value was 0.322, and the transcription/translation-associated bias value was 0.013. Therefore, the replication effect on gene composition bias was much higher than that of transcription/translation effects.
These biases can be induced by many factors, such as asymmetric replication mutation pressure, asymmetric transcription/translation-associated mutation/selection pressure [12]. The G/C strand bias indicated by the horizontal axis were slightly stronger than that of the T/A (vertical axis) ( Figure 3). Following the rule of PR-2 plots, the extent of replication-induced bias value was 0.322, and the transcription/translation-associated bias value was 0.013. Therefore, the replication effect on gene composition bias was much higher than that of transcription/translation effects.

Genetic Diversities Based on Synonymous Codon Usage by PCA
Negative correlation was found with gene length against ENC value, suggesting that gene length had little influence on codon usage. A negative correlation was also found with gene length against GC3%, codon adaptation index (CAI), the first major trend by PCA (Axis 1) and the second major trend by PCA (Axis 2) through correlation analysis. These findings suggested that gene length was not playing a major role in the case of codon usage bias in this organism.
In addition, the significant correlation between the first major trend by PCA (Axis 1) and GC3% was observed (Spearman r = 0.055 *, p < 0.05), further highlighted the fact that the nucleotide compositional constraint from mutational pressure was one factor in shaping the codon usage patterns in the S. agalactiae genomes.

Codon Adaptation Index (CAI) and Identification of Translationally Optimal Codons
CAI value has the ability to predict levels of gene expression, significant correlations between the gene expression levels through CAI value and ENC, GC%, GC12% and GC3% (Spearman r = 0.136 **, 0.233 **, 0.056 *, 0.277 *, respectively, p < 0.01) were analyzed (Figure 4), and the negative correlation between the first major trend (Axis 1) and CAI was evident in S. agalactiae genomes (Spearman r = 0.014, p < 0.01), indicating that the higher expression genes would cluster on the extreme end of the first major trend from PCA. This demonstrated that gene expression level was the most important factor in codon usage in the S. agalactiae genomes. In addition, genes with higher expression levels exhibited a greater degree of codon usage bias.

Genetic Diversities Based on Synonymous Codon Usage by PCA
Negative correlation was found with gene length against ENC value, suggesting that gene length had little influence on codon usage. A negative correlation was also found with gene length against GC3%, codon adaptation index (CAI), the first major trend by PCA (Axis 1) and the second major trend by PCA (Axis 2) through correlation analysis. These findings suggested that gene length was not playing a major role in the case of codon usage bias in this organism.
In addition, the significant correlation between the first major trend by PCA (Axis 1) and GC3% was observed (Spearman r = 0.055 *, p < 0.05), further highlighted the fact that the nucleotide compositional constraint from mutational pressure was one factor in shaping the codon usage patterns in the S. agalactiae genomes.

Codon Adaptation Index (CAI) and Identification of Translationally Optimal Codons
CAI value has the ability to predict levels of gene expression, significant correlations between the gene expression levels through CAI value and ENC, GC%, GC12% and GC3% (Spearman r = 0.136 **, 0.233 **, 0.056 *, 0.277 *, respectively, p < 0.01) were analyzed (Figure 4), and the negative correlation between the first major trend (Axis 1) and CAI was evident in S. agalactiae genomes (Spearman r = 0.014, p < 0.01), indicating that the higher expression genes would cluster on the extreme end of the first major trend from PCA. This demonstrated that gene expression level was the most important factor in codon usage in the S. agalactiae genomes. In addition, genes with higher expression levels exhibited a greater degree of codon usage bias.
In addition, we found significant correlation between the first major trend (Axis 1) on RSCU and frequency of optimal codons (Fop) value (Spearman r = 0.053 *, p < 0.01), indicating the genes variation resulted from their usage preference of optimal codons. And significant correlations between CAI value and Fop value (Spearman r = 0.078 *, p < 0.01), suggested that the high expression genes preferentially used a particular subset of optimal codons in order to enhance their translational efficiency.
Twenty-three codons were determined as the optimal codons containing 11 A-ending and 16 U-ending codons (Table 1). These optimal codons might have the ability to influence expression levels of heterologous genes in order to increase the abundance of specific proteins, which were recognized by the most abundant tRNAs. Therefore, translational selection had influenced the synonymous codon pattern of S. agalactiae and highly expressed genes used more optional codons frequently. Furthermore, the positive correlation between the tRNA adaptation index (tAI) values and CAI values (Spearman r = 0.277 *, p < 0.01) and Fop values (Spearman r = 0.455 *, p < 0.01) in the S. agalactiae genomes, reinforced the role of natural selection for efficient translation. In addition, we found significant correlation between the first major trend (Axis 1) on RSCU and frequency of optimal codons (Fop) value (Spearman r = 0.053 *, p < 0.01), indicating the genes variation resulted from their usage preference of optimal codons. And significant correlations between CAI value and Fop value (Spearman r = 0.078 *, p < 0.01), suggested that the high expression genes preferentially used a particular subset of optimal codons in order to enhance their translational efficiency.
Twenty-three codons were determined as the optimal codons containing 11 A-ending and 16 U-ending codons (Table 1). These optimal codons might have the ability to influence expression levels of heterologous genes in order to increase the abundance of specific proteins, which were recognized by the most abundant tRNAs. Therefore, translational selection had influenced the synonymous codon pattern of S. agalactiae and highly expressed genes used more optional codons frequently. Furthermore, the positive correlation between the tRNA adaptation index (tAI) values and CAI values (Spearman r = 0.277 *, p < 0.01) and Fop values (Spearman r = 0.455 *, p < 0.01) in the S. agalactiae genomes, reinforced the role of natural selection for efficient translation.

Effects of the Overall Codon Usage of Hosts on that of S. agalactiae
According to potential role of the overall codon usage pattern (D (A, B)) values reflecting the effect of overall codon usage pattern of human, tilapia, common carp, and crucian carp on that of S. agalactiae, the average of D (A, B) values were 0.1686, 0.1694, 0.15579, 0.1566, respectively. These low values indicated a strong independence from host pressures in the evolution of its synonymous codon usage patterns. The bacterium acted like an independent replication unit that did not need

Effects of the Overall Codon Usage of Hosts on that of S. agalactiae
According to potential role of the overall codon usage pattern (D (A, B)) values reflecting the effect of overall codon usage pattern of human, tilapia, common carp, and crucian carp on that of S. agalactiae, the average of D (A, B) values were 0.1686, 0.1694, 0.15579, 0.1566, respectively. These low values indicated a strong independence from host pressures in the evolution of its synonymous codon usage patterns. The bacterium acted like an independent replication unit that did not need components from its host for translation unlike what a virus would require.

Discussion
The majority of codon usage variations seen in the S. agalactiae genome was affected by mutational pressure. In this study, a large number of points with low ENC values lie below the expected curve (ENC against GC3%), the result suggests that apart from the compositional bias, mutational pressure in codon usage pattern is the most important factor in shaping the codon usage bias of S. agalactiae. A significant correlation was found between the first major trend (axis 1) generated from PCA and GC3%, further highlighting that the nucleotide compositional constraint from mutational pressure is the major source of the codon usage patterns. In addition, we found that the synonymous codon usage bias in S.agalactiae genes was low (mean ENC = 44.06, higher than 40). The low codon bias may assist S. agalactiae to replicate efficiently in the host cells through affecting the cellular protein synthesis and host mRNAs degradation, this would serve to reduce competition between the bacteria and its host [21]. But low D (A, B) value suggests that S. agalactiae has a simple form of self-replication with a strong independence in its evolution of synonymous codon usage. Compositional bias is the other factor influencing the formation of synonymous codon usage of S. agalactiae, S. agalactiae is an A + T redundant genome with A + T content more than 50% (mean 64.24, SD 3.50), the AT richness of this genus is a distinctive feature and this is reflected in its codon usage variations. Nevertheless, GC compositional bias also accounts for the observed variations. Translational selection was also evident from the fact that codon usage in the genomes of S. agalactiae was driven by the tRNA aimed at enhancing translational adeptness, and stems from the observation of significant correlations between the tAI and CAI, FOP value, reinforcing the role of natural selection on translational efficiency. The same results were reported for other organisms like Escherichia coli, Caenorhabditis elegans, and Bifidobacterium spp. [22][23][24]. Thus, the effect of compositional constraints and translation selection played roles in shaping the codon usage fashion of this species.
CAI value is measured to predict gene expression levels. In this study, we analyzed the correlation between codon usage bias and predicted gene expression levels, and found a significant correlation. The optimal codons contained 10 A-ending and 16 U-ending codons, maybe have the ability to influence expression levels of heterologous genes in order to increase the abundance of specific protein, the same results are reported in E. coli, Drosophila melanogaster, Caenorhabditis elegans, Chromohalobacter salexigens, Aedes aegypti and Anopheles gambiae [22,23,[25][26][27].
S. agalactiae is a rapidly evolving bacterium through phylogentic analysis, which is located on one of the longest branches of the phylogenetic tree life. Based on the biological characterizations, S. agalactiae have the ability to survive hostile environments and accommodate new environments or hosts after S. agalactiae infects its natural hosts. So, synonymous codon usage patterns of the S. agalactiae genome were independent of that of hosts.

Sequence Data
To investigate the codon usage pattern of functional genes of S. agalactiae isolated from human and tilapia. The five complete genomes sequence of S. agalactiae A909 (GenBank No. CP000114.1), NEM316 (GenBank No. AL732656.1), 2603V/R (GenBank No. AE009948.1), COH1 (GenBank No. HG939456.1), GD201008-001 (GenBank No. CP003810.1) were downloaded from the GenBank database [28]. Each gene in the leading or lagging strand was recorded to estimate the role of the strand-specific mutational bias in the formation of synonymous codon usage for S. agalactiae. In order to avoid errors, sequences with length less than 300 bps were excluded for this study, thus the remaining gene sequences were used for the analysis. In addition, codon usage frequencies of susceptible hosts including human (Homo sapiens), tilapia (Oreochromis mossambucus) and nonsusceptible hosts including common carp (Cyprinius carpio), crucian carp (Carassius auratus) were calculated according to the codon usage database to investigate codon adaptation to the hosts [29].

Codon Usage Pattern of Functional Genes of S. agalactiae
The ENC value against the GC3% (G + C content at the third codon position) is one of the best measures to predict heterogeneity from expected random codon usage of genes. ENC values range from 20 to 61. When only one codon is used for the corresponding amino acid, this value would be 20; and if all synonymous codons are used equally, it would be 61. The stronger the extent of codon preference in a gene, the lower the ENC value is [21,30,31].

Analysis of the Genetic Diversities of the S. agalactiae ORFs by Principle Component Analysis (PCA)
Based on the relative synonymous codon usage (RSCU) values, PCA was employed in this study to investigate the genetic diversity of S. agalactiae using genes larger than 300 bps, which reduced data dimensionality by performing a covariance analysis among the 59 synonymous codons.

Nucleotide Composition Statistics for Genes in S. agalactiae
Strand asymmetry of the nucleotide compositions exists between the leading and lagging strands. Specifically, when excess of guanine (G) bases relative to cytosine (C) exists in the leading strands, the opposite base scale exists in the lagging strands, while the characteristic is frequently accompanied by the bias of thymine (T) versus adenine (A). Herein, a series of GC skew and AT skew data were calculated for codons at the first and second positions and at the third position to evaluate the effects of nucleotide composition bias at different nucleotide positions on synonymous codon usage pattern [32,33].

Replication and Transcription Associated Biases Differentiation
Expected theoretical value is A = T and G = C in each strand if there are no strand-specific biases between the two strands of DNA to evaluate the asymmetry in mutation and/or selection between the two strands G/(G + C) value against A/(A + T) value at the 3rd codon position in all genes proposed by Lobry and Sueoka [34]. Here, we evaluated nucleotide composition bias in S. agalactiae genomes according to the PR2-plot method.

Preferred Gene Analysis by Codon Adaptation Index (CAI)
CAI is used to estimate predict levels of gene expression. The CAI value ranges from 0 to 1.0 for a gene, where a higher value is likely to indicate stronger codon usage bias and a potential higher expression level [21,35,36]. The CAI value was calculated using EMBOSS CUSP program [37].

Computation of tRNA Adaptation Index (tAI) and Frequency of Optimal Codons (Fop)
tRNA adaptation index (tAI) is a simple estimate of tRNA usage by the coding sequences of a genome, which represents the levels of co-adaptation between a particular codon and the corresponding tRNA pool and produces higher correlations with protein abundance than the other measures of codon bias [38]. The tAI values for the S. agalactiae genomes were calculated according to dos Reis et al. [39]. Frequency of optimal codons (Fop) is calculated using codonW program [40].

Estimating Effects of the Overall Codon Usage of Human and Tilapia on that of S. agalactiae
To estimate the effect of the overall codon usage of human, tilapia on that of S.agalactiae, a formula of D (A, B) was used to evaluate the potential role of the overall codon usage pattern of human and tilapia in the formation of codon usage of S. agalactiae, comparing with nonsusceptible hosts including Cyprinius carpio and Carassius auratus.
DpA, Bq " 1´DpA, Bq 2 where R (A, B) is defined as a cosine value of an included angle between A and B special vectors representing the degree of similarity between S. agalactiae and hosts at the aspect of the overall codon usage pattern, a i is defined as the RSCU value for a specific codon in 59 synonymous codons of S. agalactiae, b i is termed as the RSCU value for the same codon of hosts. D (A, B) represents the potential effect of the overall codon usage of human and tilapia on that of S. agalactiae, and D (A, B)P(0,1) [41]. The higher D (A, B) is, the stronger the effect of environment related synonymous codon usage patterns of hosts on that of S. agalactiae is.

Statistical Analysis
PCA was employed to analyze major trend in codon usage pattern which reduces data dimensionality by performing a covariance analysis between 59 synonymous codons [5,21,42]. Correlation analysis involved in this study was based on the Spearman's rank correlation analysis and was performed using the SPSS program.

Conclusions
Synonymous codon usage pattern analysis has contributed to understand codon usage bias in S. agalactiae. Based on the biological characterizations, S. agalactiae have the ability to survive hostile environments and accommodate new environment or hosts. And codon usage pattern analysis is helpful to understand the evolution of S. agalactiae in this study.