Genome-Wide Analysis of the Synonymous Codon Usage Patterns in Riemerella anatipestifer

Riemerella anatipestifer (RA) belongs to the Flavobacteriaceae family and can cause a septicemia disease in poultry. The synonymous codon usage patterns of bacteria reflect a series of evolutionary changes that enable bacteria to improve tolerance of the various environments. We detailed the codon usage patterns of RA isolates from the available 12 sequenced genomes by multiple codon and statistical analysis. Nucleotide compositions and relative synonymous codon usage (RSCU) analysis revealed that A or U ending codons are predominant in RA. Neutrality analysis found no significant correlation between GC12 and GC3 (p > 0.05). Correspondence analysis and ENc-plot results showed that natural selection dominated over mutation in the codon usage bias. The tree of cluster analysis based on RSCU was concordant with dendrogram based on genomic BLAST by neighbor-joining method. By comparative analysis, about 50 highly expressed genes that were orthologs across all 12 strains were found in the top 5% of high CAI value. Based on these CAI values, we infer that RA contains a number of predicted highly expressed coding sequences, involved in transcriptional regulation and metabolism, reflecting their requirement for dealing with diverse environmental conditions. These results provide some useful information on the mechanisms that contribute to codon usage bias and evolution of RA.


Introduction
Riemerella anatipestifer (RA), belonging to the Flavobacteriaceae family, is a non-spore-forming, rod-shaped, and atrichous Gram-negative bacterium [1]. It can cause a contagious disease in domestic ducks, geese, turkeys, and various other wild birds. To date, more than 21 serovars have been identified [2]. In addition, no cross-protection has been observed with inactivated bacterins made from different serotypes of RA [3]. Thus, RA can easily cause large economic losses in the duck industry over the world.
Codon usage bias (CUB) of genes generally exists in prokaryotes and eukaryotes. The genetic code in organisms is not strictly one-for-one code. Most amino acids, except Trp (UGG) and Met (ATG) can allow more than one codon (called synonymous codon). Synonymous codons usually (ATG) can allow more than one codon (called synonymous codon). Synonymous codons usually differ by one base in the third codon position (or for some amino acids, in the second position) [4,5]. Among prokaryotes, it is well known that CUB is mainly influenced by mutational bias and natural selection [6,7]. Mutational bias can drive the change in the G + C content of the whole genome. Examples of mutational bias affecting codon usage can be illustrated in many prokaryotes with extremely AT or GC-rich genome [8]. Moreover, CUB may be associated with some other factors, including gene expression level [9,10], gene length [11], amino acid conservation, protein structure [12], gene function [13], and isoaccepting tRNA [14]. There are some variations in codon usage among the genomes of bacteria, which suggests that these genomes bear different pressure in evolution process. CUB analysis has important significance in many aspects. It was proved useful in studying molecular genetic engineering for codon optimization and heterologous protein expression in some species [15]. CUB analysis at genomic scale can also reveal the genetic information about the molecular evolution of individual genes and help to understand evolution of living organisms [16]. Furthermore, CUB can enrich our understanding about the relationship between pathogens and their hosts by analyzing their codon usage patterns [17].
At present, CUB in RA has not been investigated in any detail, and is not clear which factors shape the codon usage pattern. In this study, we analyzed the genome-wide codon usage patterns of 12 RA species. Our results show that natural selection is the main driving factor for codon usage patterns of RA. Additionally, the evolutionary relationship of the species shown in our study is different from that of the traditional classification.

The Codon Usage Pattern between Riemerella anatipestifer (RA)
To identify and understand codon usage patterns of RA, the values of relative synonymous codon usage (RSCU) were computed for every codon in each genome. A codon with an RSCU value of more than 1.0 has a positive codon usage bias, while a value of less than 1.0 has a negative codon usage bias. When the RSCU value is equal to 1.0, it means that this codon is chosen equally and randomly [18]. The results showed a general bias toward codons having nucleotides A or U in the third position while U was more frequently detected ( Figure 1 and Table 1). There were 30 codons having the high RSCU values (RSCU > 1, Table 1; in bold), and optimal codons (shown in *, Table 1) identified by χ-squared test which were similarly biased. Among the RA strains, it can be clearly observed that the frequencies of UCU (Ser) and CCU (Pro) are considerably high. Figure 1. Comparison of RSCU between 12 different species of RA. The heat-map was drawn by HemI using hierarchical clustering method. The higher RSCU value, suggesting more frequent codon usage, was represented with darker shades of red. In Riemerella anatipestifer (RA) genomes, codons ending in A or U have higher RSCU value than codons ending in G or C. Figure 1. Comparison of RSCU between 12 different species of RA. The heat-map was drawn by HemI using hierarchical clustering method. The higher RSCU value, suggesting more frequent codon usage, was represented with darker shades of red. In Riemerella anatipestifer (RA) genomes, codons ending in A or U have higher RSCU value than codons ending in G or C.

The Codon Usage Bias of RA not Affected by Mutation Bias
The preference of A or U in the third position of the codon in RA observed in the RSCU comparative analysis could be due to the overall GC bias within the genome. Differences in GC content were the greatest at the third codon position followed by the first and second positions [19]. The GC3s values of RA strains varied from 27.07% to 26.50% with a mean of 26.6% and standard deviation (SD) of 0.23. The effective number of codons (ENc) has been widely used to measure the codon bias level of individual genes. Among the 12 isolates, the values of ENc were higher than 40 ( Plotting ENc versus GC3s is an effective strategy to investigate patterns of synonymous codon usage [20]. The distribution plot of ENc and GC3s values for these genes have been presented in Figure 2. The solid line represents the curve if codon usage is only determined by GC3s. The actual ENc values for some genes lay near to the solid line on the left region of this distribution, and a majority of the points with low ENc values lay below the expected curve. This implies that not only mutation but also other factors are likely to be involved in determining the selective constraints on codon bias in RA genomes. majority of the points with low ENc values lay below the expected curve. This implies that not only mutation but also other factors are likely to be involved in determining the selective constraints on codon bias in RA genomes.

Correspondence Analysis (COA)
To investigate the synonymous codon usage variation among RA strains, COA was performed on the variation of RSCU value for this study. The coordinate of each coding sequence (CDS) on the two principal axes (Axes 1 and 2) is shown in Figure 3. The relative inertia explained by the first axis in RA contributes approximately 10% of the total variation. It must be remembered that although the first principal axis explains a substantial amount of variation of codon usage among the genes in RA, its value is not remarkably high for relative inertia explained by the first axis in other organisms studied earlier [11,21,22]. The low value might be due to the AT-rich genomic composition of this genome. As it can be seen, these strains of RA isolated from different places, even the same serotype, have the same trend in codon usage variation. The previous studies have shown that the codon usage variation among the genes in the extremely AT or GC rich organisms is only shaped by compositional bias, The third codon position in the preferred codons should also have the base composition of A or T [23,24]. The mutation bias toward a high G + C content seems to have resulted in a preponderance of GC-rich optimal codons [25]. As shown in Table 1, the third positions of optimal codons in RA were preferred in A or T, which suggests that the strongest influence on the choice of codon usage might not be mutation bias, but translation optimality in RA.

Correspondence Analysis (COA)
To investigate the synonymous codon usage variation among RA strains, COA was performed on the variation of RSCU value for this study. The coordinate of each coding sequence (CDS) on the two principal axes (Axes 1 and 2) is shown in Figure 3. The relative inertia explained by the first axis in RA contributes approximately 10% of the total variation. It must be remembered that although the first principal axis explains a substantial amount of variation of codon usage among the genes in RA, its value is not remarkably high for relative inertia explained by the first axis in other organisms studied earlier [11,21,22]. The low value might be due to the AT-rich genomic composition of this genome. As it can be seen, these strains of RA isolated from different places, even the same serotype, have the same trend in codon usage variation. The previous studies have shown that the codon usage variation among the genes in the extremely AT or GC rich organisms is only shaped by compositional bias, The third codon position in the preferred codons should also have the base composition of A or T [23,24]. The mutation bias toward a high G + C content seems to have resulted in a preponderance of GC-rich optimal codons [25]. As shown in Table 1, the third positions of optimal codons in RA were preferred in A or T, which suggests that the strongest influence on the choice of codon usage might not be mutation bias, but translation optimality in RA.

Natural Selection Influences the Codon Bias of RA
The GC content is calculated according to the first, second, and third codon positions (P1, P2 and P3 respectively). P12 is the average of P1 and P2, it is used for analysis of neutrality plot (P12 against P3). The neutrality plot is drawn to characterize the correlation among the three codon positions, and then used to estimate the extent of directional mutation pressure against selection on CUB. In the

Natural Selection Influences the Codon Bias of RA
The GC content is calculated according to the first, second, and third codon positions (P 1 , P 2 and P 3 respectively). P 12 is the average of P 1 and P 2 , it is used for analysis of neutrality plot (P 12 against P 3 ).
The neutrality plot is drawn to characterize the correlation among the three codon positions, and then used to estimate the extent of directional mutation pressure against selection on CUB. In the neutrality plot, each point represents one gene (Figure 4). If a gene is under neutral selection pressure, a point should be located on diagonal line with a significant correlation between its P 12 and P 3 . If a gene is close to X-axis, below the diagonal line, meaning the gene is under mutational pressure. Thus, the slope less than 1 should indicate a whole genome trend of non-neutral mutational pressure [26,27]. In this study, all RA species had relative neutralities ranging from 9% to 15% (Figure 4). It means CUB was affected a little by neutral evolution since natural selection was more than 85%. The points in all RA species were located above the diagonal distribution and the regression curve (bold line) with a slope less than 1, indicated the whole genomes in RA species trend of non-neutral mutational pressure. The subsequent correlation analysis revealed little positive correlation between P 12 and P 3 . These results showed that natural selective pressure dominated over mutation shaping the composition of coding sequences. neutrality plot, each point represents one gene (Figure 4). If a gene is under neutral selection pressure, a point should be located on diagonal line with a significant correlation between its P12 and P3. If a gene is close to X-axis, below the diagonal line, meaning the gene is under mutational pressure. Thus, the slope less than 1 should indicate a whole genome trend of non-neutral mutational pressure [26,27]. In this study, all RA species had relative neutralities ranging from 9% to 15% (Figure 4). It means CUB was affected a little by neutral evolution since natural selection was more than 85%. The points in all RA species were located above the diagonal distribution and the regression curve (bold line) with a slope less than 1, indicated the whole genomes in RA species trend of non-neutral mutational pressure. The subsequent correlation analysis revealed little positive correlation between P12 and P3. These results showed that natural selective pressure dominated over mutation shaping the composition of coding sequences.

Cluster Analysis
To gain more insight into evolution of the RA, the RSCU values between 12 species were used in hierarchical clustering ( Figure 5B). Cluster analysis for RA family yielded five major clusters, similar to dendrogram based on genomic BLAST by neighbor-joining method ( Figure 5A). Cluster I is composed of RA-CH-2, RA-GD, Yb2 and RCAD0122, meanwhile RA-CH-2 and Yb2 stay the closest and are isolated almost from the root. Cluster II contains ATCC11845, RA153, RA-SG and RA-YM. RA-SG and RA-YM appear closely related to RA153, but are on different branches

Cluster Analysis
To gain more insight into evolution of the RA, the RSCU values between 12 species were used in hierarchical clustering ( Figure 5B). Cluster analysis for RA family yielded five major clusters, similar to dendrogram based on genomic BLAST by neighbor-joining method ( Figure 5A). Cluster I is composed of RA-CH-2, RA-GD, Yb2 and RCAD0122, meanwhile RA-CH-2 and Yb2 stay the closest and are isolated almost from the root. Cluster II contains ATCC11845, RA153, RA-SG and RA-YM. RA-SG and RA-YM appear closely related to RA153, but are on different branches compared to ATCC11845. RA17 has a close relationship with cluster II divided into cluster III. RA-CH-1 and CH3, belonging to the serotype 1, are clustered in cluster IV. The highly biased RA-JLLY is clustered alone as a minor cluster of cluster V and close to the branch of RA-CH-1 and CH3. compared to ATCC11845. RA17 has a close relationship with cluster II divided into cluster III. RA-CH-1 and CH3, belonging to the serotype 1, are clustered in cluster IV. The highly biased RA-JLLY is clustered alone as a minor cluster of cluster V and close to the branch of RA-CH-1 and CH3.

Understanding Pathway Level Functions in RA through CUB
The codon adaptation index (CAI) for a gene is a measurement of its optimal codons usage, which is the codon commonly used by highly expressed proteins in a given genome [28]. CAI values of all CDS in RA genomes were calculated using the ribosomal protein codon usages as a reference set. As shown in Figure 6, the CAI values of all RA genes were distributed over a very wide range from 0.3 to 0.8 (the mean value of 0.6), but most of the genes had CAI values between 0.5 and 0.7. Only about 6% of the CAI values were greater than or equal to 0.7. No obvious correlation was observed between CAI values and the corresponding gene lengths (p > 0.05). This implies that codon bias is not the primary mechanism determining the translational efficiency of long genes in RA. Within each RA strain, the top 5% of genes with the highest CAI value were predicted to be highly expressed genes. This is corresponded to CAI cut-off   Comparison of phylogenetic tree with RSCU based clustering of RA strains. (A) The phylogenetic tree derived for the RA genome using genomic BLAST by neighbor-joining method; (B) Cluster analysis of the 12 species in RA based on RSCU value. The observed distances range from 1 to 25, the ratio of the rescaled distances within the dendrogram is the same as the ratio of the original Squared Euclidean distances.

Understanding Pathway Level Functions in RA through CUB
The codon adaptation index (CAI) for a gene is a measurement of its optimal codons usage, which is the codon commonly used by highly expressed proteins in a given genome [28]. CAI values of all CDS in RA genomes were calculated using the ribosomal protein codon usages as a reference set. As shown in Figure 6, the CAI values of all RA genes were distributed over a very wide range from 0.3 to 0.8 (the mean value of 0.6), but most of the genes had CAI values between 0.5 and 0.7. Only about 6% of the CAI values were greater than or equal to 0.7. No obvious correlation was observed between CAI values and the corresponding gene lengths (p > 0.05). This implies that codon bias is not the primary mechanism determining the translational efficiency of long genes in RA. Within each RA strain, the top 5% of genes with the highest CAI value were predicted to be highly expressed genes. This is corresponded to CAI cut-off of 0.

Understanding Pathway Level Functions in RA through CUB
The codon adaptation index (CAI) for a gene is a measurement of its optimal codons usage, which is the codon commonly used by highly expressed proteins in a given genome [28]. CAI values of all CDS in RA genomes were calculated using the ribosomal protein codon usages as a reference set. As shown in Figure 6, the CAI values of all RA genes were distributed over a very wide range from 0.3 to 0.8 (the mean value of 0.6), but most of the genes had CAI values between 0.5 and 0.7. Only about 6% of the CAI values were greater than or equal to 0.7. No obvious correlation was observed between CAI values and the corresponding gene lengths (p > 0.05). This implies that codon bias is not the primary mechanism determining the translational efficiency of long genes in RA. Within each RA strain, the top 5% of genes with the highest CAI value were predicted to be highly expressed genes. This is corresponded to CAI cut-off of 0.701 in ATCC11845, 0.698 in RA-CH-1, 0.708 in CH3, 0.691 in RA-CH-2, 0.709 in RA-GD, 0.706 in RA-SG, 0.706 in RA-YM, 0.715 in Yb2, 0.708 in RA-JLLY, 0.703 in RA153, 0.700 in RA17, and 0.706 in RCAD0122 (included about 100 genes for each RA strains), respectively.   Small subunit ribosomal protein S11 + rpsO Small subunit ribosomal protein S15 + rpsP Small subunit ribosomal protein S16 CH3 rpsR Small subunit ribosomal protein S18 +  To further analyze the highly expressed genes estimated by functional analysis, we used blastKOALA based on KEGG annotations [29]. As the limitation of gene annotation and functional studies, about forty-five orthologous high-level expression gene pairs from the all RA genomes were annotated as hypothetical proteins. In this way, their CAI values could rightfully indicate the gene expression level. These hypothetical proteins with the predicted high expression may become attractive candidates for experimental characterization, thus we assumed that they should have important functions in those organisms. Functional analysis showed that only half of genes in all 12 genomes were classified. The high-level expression genes were involved in genetic information processing, carbohydrate metabolism, energy metabolism, metabolism of cofactors and vitamins, nucleotide metabolism and cellular processes ( Table 3). The high-level expression genes involved in genetic information processing were the largest functional group. An investigation of the functional categories to which the CAI reference genes (top 1% of genes) belong has revealed that RA contains a significant fraction of ribosomal proteins (large subunit ribosomal in 62.5% and small ribosomal subunits in 37.5%). This is in agreement with the ribosomal criterion defined by Carbone [30], which states that ribosomal proteins have significantly higher CAI value than other protein encoding genes in translationally biased organisms. The rplL encoding ribosomal protein L9 with the highest CAI value (0.834) was one of the most abundant proteins under the rapid growth conditions in RA while codon selection was expected to be effective. The second most high-level expression genes was for various enzymes including carbohydrate metabolism, metabolism of cofactors and vitamins, energy metabolism and nucleotide metabolism. As we know, acnA, mdh, sucC and sucD gene encoding aconitate hydratase, malate dehydrogenase and succinyl-CoA synthetase are participant in tricarboxylic acid cycle (TCA) pathway. Several genes encoding cytochrome, transferases, and ATP synthase were also found in the 12 RA strains. Enolase is involved in secondary metabolism. Apart from ribosomal proteins and enzymes, three genes encoding elongation factor Tu, G, Ts and two chaperone encoding GroEL and DnaK were observed as the high-level expression genes in RA genomes. In addition, the outer membrane protein was also found high in expression. This analysis has offered the prospective method to further carry out the characterization on those genes.

Discussion
To confirm the observed dominance of mutational bias, the RSCU patterns are conducted in these strains. As a general rule, AT-rich genome of bacteria can result in the dominance of the A/U-ended codons. RA has extremely AT-rich genome, which is the main reason why there are 31 optimized codons ending with A/U among 32 optimized codons. This predominance of A and T at the synonymous sites is better displayed in Table 2, which reveals that amino acid usage is strongly associated with AT content in AT-rich genome [31]. In bacteria with extreme genomic GC compositions, synonymous codon usage could be dominated by strong compositional bias [32].What is more, the mutation is universally biased towards AT in bacteria [33,34]. Therefore, it likely can be concluded that the main force driving codon usage in RA is the strong compositional bias towards A and T. It is reasonable that compositional bias may be a potential bias in the evolution of the codons in RA.
The codon usage bias was conserved in RA strains. The RSCUs of each codon were very similar in 12 RA strains. Meanwhile, the distributions of the plot of Axes 1 and 2 in each CDS were almost in the same region. The plot of Axes 1 and 2 of each open reading frame (ORF) shows that there is a quite small amount of the codon usage variation in RA strains. In addition, the COA also has highly negative correlation with the GC3s value, which suggests codon usage variation is directly related to mutational bias. The ENc values of RA genome are all more than 45, which demonstrates that codon usage bias is low in RA strains. The ENc-plot suggests that not only mutational pressure but also other factors affect the codon bias among the genes. This conclusion is also supported by the highly significant correlation analysis. Comparisons of 12 RA species show a significant positive correlation between ENc and GC3s (p < 0.01). Moreover, it is obviously that the codon usage bias has no significant difference by comparing the ENc-plot of 12 RA strains. In summary, the data presented herein reveals that the differences of codon usage are small among different RA strains.
Most CAI values of RA genes are near to 0.6 that is lower than other bacteria, such as E. coli, Nocardia farcinica, and Streptomyces coelicolor [35][36][37]. The results provide evidence why RA strains need rich nutrition to grow but still slow and consequently have low environment adaptability. By correlation analysis between average RSCU values of RA ORFs and high/low ENc value groups, there are high correlations between RA ORFs and high/low ENc value groups. The codon usage patterns have no obvious difference between high and low ENc value genes. Hence, gene expression levels only have a weak influence on codon usage bias in RA.
Finally, the CAI values were set as the expression level indicator of genes in RA. The notion of gene expression by CAI values was proposed for a long time ago, however, in recent years, the methods have been widely used to qualitatively assess high-level expression genes in prokaryote and eukaryote [38][39][40][41]. Fast development of the whole-genome analysis technologies, especially whole genome sequencing as well as proteomics has made it possible to compare computational data of codon usage and expression ability with experimental evidence. In our research, the highly expressed genes can be considered as the strength of relative codon bias, most of the highly expressed genes are identified by ribosomal proteins genes. Moreover, the genes encoding elongation factor, chaperone proteins, enzymes of essential carbon metabolism pathways of TCA cycle, genes of ATP synthesis, nucleotide biosynthesis, outer membrane protein, transport and binding protein are identified as highly expressed genes in our approach. The study also proves our prediction, based on their codon usage, that some of hypothetical proteins would be highly expressed. Further research of hypothetical proteins by integrated computational and experimental data will enhance our knowledge of the metabolism in RA.

Sequences Data
A total of 12 RA genomes were used in this study. The coding sequences (CDS) datasets from the whole genome sequences were obtained from National Center of Biotechnology Information (NCBI). To minimize sampling bias in codon usage calculations only CDS of at least 100 codons in length with correct initiation were used in further analysis. Detailed information about these strains is listed in Table 4, and the distribution of these strains except ATCC11845 in the different provinces of China is shown in Figure 7. Most CAI values of RA genes are near to 0.6 that is lower than other bacteria, such as E. coli, Nocardia farcinica, and Streptomyces coelicolor [35][36][37]. The results provide evidence why RA strains need rich nutrition to grow but still slow and consequently have low environment adaptability. By correlation analysis between average RSCU values of RA ORFs and high/low ENc value groups, there are high correlations between RA ORFs and high/low ENc value groups. The codon usage patterns have no obvious difference between high and low ENc value genes. Hence, gene expression levels only have a weak influence on codon usage bias in RA.
Finally, the CAI values were set as the expression level indicator of genes in RA. The notion of gene expression by CAI values was proposed for a long time ago, however, in recent years, the methods have been widely used to qualitatively assess high-level expression genes in prokaryote and eukaryote [38][39][40][41]. Fast development of the whole-genome analysis technologies, especially whole genome sequencing as well as proteomics has made it possible to compare computational data of codon usage and expression ability with experimental evidence. In our research, the highly expressed genes can be considered as the strength of relative codon bias, most of the highly expressed genes are identified by ribosomal proteins genes. Moreover, the genes encoding elongation factor, chaperone proteins, enzymes of essential carbon metabolism pathways of TCA cycle, genes of ATP synthesis, nucleotide biosynthesis, outer membrane protein, transport and binding protein are identified as highly expressed genes in our approach. The study also proves our prediction, based on their codon usage, that some of hypothetical proteins would be highly expressed. Further research of hypothetical proteins by integrated computational and experimental data will enhance our knowledge of the metabolism in RA.

Sequences Data
A total of 12 RA genomes were used in this study. The coding sequences (CDS) datasets from the whole genome sequences were obtained from National Center of Biotechnology Information (NCBI). To minimize sampling bias in codon usage calculations only CDS of at least 100 codons in length with correct initiation were used in further analysis. Detailed information about these strains is listed in Table 4, and the distribution of these strains except ATCC11845 in the different provinces of China is shown in Figure 7.

Measurement Indices of Codon Usage Bias
In order to normalize codon usage within datasets of different amino acid compositions, relative synonymous codon usage (RSCU) values were calculated by dividing the observed codon usage by the expected ones under the condition that all codons for the same amino acid are used equally. The RSCU was used to compute relative codon frequency. The codon adaptation index (CAI) has been proved to be the best gene expression value index and was extensively used as a measure of gene expression level. The CAI was generally calculated using the codon preference of genes for highly expressed proteins, such as ribosomal proteins and elongation factors. In this study, the values of CAI were calculated using a reference set of ribosomal proteins. Based on the calculated CAI value, 5% of the total genes with extremely high CAI values were regarded as the highly expressed datasets.
Effective number of codons (ENc) was often used to quantify the codon usage bias of a gene. The ENc value of a gene could range from 20 (extreme bias where one codon for each codon family was used) to 61 (all synonymous codons were used randomly). As in the previous report, P 1 , P 2 , and P 3 were calculated after excluding ATG, TGG, ATA, and the stop codons (TAA, TAG, or TGA) [52]. The value of GC3s was the frequency of G + C at the synonymous third position of sense codons and it was employed to better understand the codon usage variation and compositional constraints (i.e., excluding Met, Trp, and termination codons). The ENc value against GC3s was computed, which was assumed equal to the use of G and C (A and T) in degenerate codon groups. The expected ENc value under random codon usage was calculated for any value of GC3s as below: ENc " 2`s`29rs 2`p 1´sq 2 s´1 (1) where s represents the given GC3s value. If the G + C content at the third position is the only determinant factor that shapes the codon usage, the point of ENc should fall on the standard curve described by Formula (1).

Correspondence Analysis and Cluster Analysis
The correspondence analysis (COA) was used to investigate the major trend in codon usage variation among genes of 12 RA strains. The CDS of each gene was represented as a 59 dimensional vector (excluding ATG, TGG, and the stop codons), and each dimension corresponds to the RSCU value of one sense codon. Since the first two axes, compared to the other axes, would be enough to explain the higher fraction of the variance of the data, genes and codons were plotted on these two axes only [53,54]. In the cluster analysis, RA species were clustered according to their RSCU values by hierarchical methods through measurement of the Squared Euclidean distance.

Software and Statistical Analysis
RSCU, ENc, total G + C genomic content, as well as COA, were calculated by CodonW 1.4 version [55]. The heat map was drawn with HemI (Huazhong University of Science and Technology, Wuhan, Hubei, China) [56] and clustered the RSCU values using an average linkage cluster algorithm. Values of CAI, P1, P2 and P3 were calculated by CAIcal Server [57]. The highly-expressed-gene datasets were interpretation of high-level functions by BlastKOALA [58]. Correlation analysis was performed using the statistical software SPSS 19.0 (IBM, Chicago, IL, USA). Graphs were generated with GraphPad Prism 6.0 (GraphPad Software Inc., La Jolla, CA, USA).

Conclusions
To summarize, our study reveals that codon usage bias in RA is slightly biased, and there is no significant difference between the strains in codon usage. Natural selection is the main factor that affects codon usage variation in RA. Other factors, such as GC content and gene expression also have an influence on codon usage pattern. In addition, all RA strains have the common highly abundant proteins. To our knowledge, this research is the first work of its kind to report of codon usage analysis in RA, and it gives us a basic understanding of the mechanisms for codon usage bias and gene expression during the evolution of RA. Moreover, this study has provided a basis for further studies on the mechanisms of codon usage that affects the RA strains through evolution.