Genome-Wide Analysis of Codon Usage Bias in Epichloë festucae

Analysis of codon usage data has both practical and theoretical applications in understanding the basics of molecular biology. Differences in codon usage patterns among genes reflect variations in local base compositional biases and the intensity of natural selection. Recently, there have been several reports related to codon usage in fungi, but little is known about codon usage bias in Epichloë endophytes. The present study aimed to assess codon usage patterns and biases in 4870 sequences from Epichloë festucae, which may be helpful in revealing the constraint factors such as mutation or selection pressure and improving the bioreactor on the cloning, expression, and characterization of some special genes. The GC content with 56.41% is higher than the AT content (43.59%) in E. festucae. The results of neutrality and effective number of codons plot analyses showed that both mutational bias and natural selection play roles in shaping codon usage in this species. We found that gene length is strongly correlated with codon usage and may contribute to the codon usage patterns observed in genes. Nucleotide composition and gene expression levels also shape codon usage bias in E. festucae. E. festucae exhibits codon usage bias based on the relative synonymous codon usage (RSCU) values of 61 sense codons, with 25 codons showing an RSCU larger than 1. In addition, we identified 27 optimal codons that end in a G or C.


Introduction
The introduction the genetic code refers to the sequences of DNA and RNA nucleotides that determine amino acid sequences in proteins. The genetic code comprises 64 codons encoding 20 amino acids. Therefore, some amino acids are encoded by more than one codon. Different codons that encode the same amino acid are termed synonymous codons. Although their corresponding tRNAs may differ in speed due to their relative abundances, all codons are recognized by the ribosome. The most amino acids can be encoded by more than one codons. Synonymous codons do not appear randomly throughout the genome, however, a phenomenon that is referred to as codon usage bias [1,2]. Differences in codon usage can modulate the efficiency and accuracy of protein production while maintaining the same protein sequence.
Studies of codon usage have determined that several factors may influence codon usage patterns, including mutational bias and natural selection. Analysis of codon usage patterns sheds light on the molecular biology of gene regulation, gene expression, secondary protein structure, selective transcription, and the external environment. Among these, the major factors that are responsible for codon usage variation among different organisms are compositional constraints under mutational pressure and natural selection [3][4][5][6].

Results and Discussion
Codon usage bias in genes is an important evolutionary parameter and has been increasingly documented in a wide range of organisms from prokaryotes to eukaryotes. Two theories-neutral evolution and natural selection-have been used to explain the origin of codon usage bias [3,28,29]. If a synonymous mutation occurs at the third codon position, it should result in a random codon choice, with GC and AT being substituted proportionally among the degenerate codons in a gene [30,31]. In contrast, if translational selection pressure influences codon usage, the bias should be significantly positively correlated with expression levels, with some translation-preferred codons appearing more frequently than others. Previous studies have demonstrated that genes within a species often share similar codon usage patterns, though a few species, such as Bacillus subtilis, appear to refute this [32].

Neutrality Plot
A neutrality plot revealed the relationship between GC12 and GC3 (Figure 1), which may reflect the mutation-selection equilibrium that shapes codon usage in E. festucae. The neutrality plot shows that E. festucae genes exhibit a wide range of GC3 values, ranging from 21.82% to 91.67%. There was a significant positive correlation between GC12 and GC3 (r = 0.121, p < 0.01). The slope of the regression line for all coding sequences was 0.0486. The results reveal that the effect of mutation pressure is only 4.86%, it means the codon bias was affected a little by neutral evolution, while the influence from other factors, for example natural selection, is 95.14%. In the genome of E. festucae, there was a significant correlation (p < 0.01) and the regression coefficient was 0.121. This significantly positive correlation in neutrality plots indicated that the effect on the GC contents by the intra genomic GC mutation bias was similar at all three codon positions [33]. Accordingly, mutation pressure (nucleotide bias) only plays a minor role in shaping the codon bias, whereas natural selection probably dominates the codon bias. These results suggest that an effect of natural selection is present at all codon positions.

Association between Effective Number of Codons (ENC) and GC3s
The ENC in E. festucae genes ranged from 26.02 to 61.00, with an average of 51.58. Among the 4870 genes, only 132 genes exhibited high codon bias (ENC < 35), indicating that E. festucae genes, in general, reflect random codon usage without strong codon bias.
We estimated the difference between the observed and the expected ENC values for all genes using a plot of the frequency distribution of (ENCexp − ENCobs)/ENCexp ( Figure 2). Most genes appear in the 0.0-0.1 range, suggesting that most observed ENC values are smaller than the ENC values expected based on the GC3s. These results show that E. festucae codon usage can be predicted from GC3s and that mutation plays a role in shaping codon usage. An ENC plot was generated to explore the influence of GC3s on codon bias in E. festucae. If a gene is located on the expected curve, the codons of that gene are no bias. In this study, most ENC values were lower than expected and were located right below the curve (Figure 3), indicating that other factors, combined with mutation pressure, affects codon usage.
Kawabe and Miyashita [34] reported that the width of the GC3s distribution might be related to variation in the strength of directional selection against mutation pressure. In E. festucae, the GC3s distribution was between 0.4 and 1.0, indicating that E. festucae mainly evolved by mutation pressure.
The ENC is often used in population genetics research to measure the overall codon bias for an individual gene without knowledge of the optimal codons or a reference set of highly-expressed genes. From the ENC plot, a comparison of the observed distribution of genes with the expected distribution based on GC3s can reveal whether the codon biases of genes are influenced by mutation, but the mutation might not be the unique factor [30].

Association between Effective Number of Codons (ENC) and GC3s
The ENC in E. festucae genes ranged from 26.02 to 61.00, with an average of 51.58. Among the 4870 genes, only 132 genes exhibited high codon bias (ENC < 35), indicating that E. festucae genes, in general, reflect random codon usage without strong codon bias.
We estimated the difference between the observed and the expected ENC values for all genes using a plot of the frequency distribution of (ENC exp´E NC obs )/ENC exp ( Figure 2). Most genes appear in the 0.0-0.1 range, suggesting that most observed ENC values are smaller than the ENC values expected based on the GC3s. These results show that E. festucae codon usage can be predicted from GC3s and that mutation plays a role in shaping codon usage.

Association between Effective Number of Codons (ENC) and GC3s
The ENC in E. festucae genes ranged from 26.02 to 61.00, with an average of 51.58. Among the 4870 genes, only 132 genes exhibited high codon bias (ENC < 35), indicating that E. festucae genes, in general, reflect random codon usage without strong codon bias.
We estimated the difference between the observed and the expected ENC values for all genes using a plot of the frequency distribution of (ENCexp − ENCobs)/ENCexp ( Figure 2). Most genes appear in the 0.0-0.1 range, suggesting that most observed ENC values are smaller than the ENC values expected based on the GC3s. These results show that E. festucae codon usage can be predicted from GC3s and that mutation plays a role in shaping codon usage. An ENC plot was generated to explore the influence of GC3s on codon bias in E. festucae. If a gene is located on the expected curve, the codons of that gene are no bias. In this study, most ENC values were lower than expected and were located right below the curve (Figure 3), indicating that other factors, combined with mutation pressure, affects codon usage.
Kawabe and Miyashita [34] reported that the width of the GC3s distribution might be related to variation in the strength of directional selection against mutation pressure. In E. festucae, the GC3s distribution was between 0.4 and 1.0, indicating that E. festucae mainly evolved by mutation pressure.
The ENC is often used in population genetics research to measure the overall codon bias for an individual gene without knowledge of the optimal codons or a reference set of highly-expressed genes. From the ENC plot, a comparison of the observed distribution of genes with the expected distribution based on GC3s can reveal whether the codon biases of genes are influenced by mutation, but the mutation might not be the unique factor [30]. An ENC plot was generated to explore the influence of GC3s on codon bias in E. festucae. If a gene is located on the expected curve, the codons of that gene are no bias. In this study, most ENC values were lower than expected and were located right below the curve (Figure 3), indicating that other factors, combined with mutation pressure, affects codon usage.
Kawabe and Miyashita [34] reported that the width of the GC3s distribution might be related to variation in the strength of directional selection against mutation pressure. In E. festucae, the GC3s distribution was between 0.4 and 1.0, indicating that E. festucae mainly evolved by mutation pressure.
The ENC is often used in population genetics research to measure the overall codon bias for an individual gene without knowledge of the optimal codons or a reference set of highly-expressed genes. From the ENC plot, a comparison of the observed distribution of genes with the expected distribution based on GC3s can reveal whether the codon biases of genes are influenced by mutation, but the mutation might not be the unique factor [30]. If a given gene is only subject to GC composition/mutation constraints, it will lie just above or below the standard curve. However, if a particular gene is under selective pressure for high expression, its ENC value will deviate more strongly from the expected value, and it will lie significantly below the curve. In the ENC plot, at a GC3 of approximately 0.4, there were some genes that displayed a more biased codon usage than expected based on the respective GC3s.
The translation efficiency constrains codon choice, which the frequency of codon usage is positively correlated with tRNA availability. The degree of codon usage bias is related to the level of gene expression, with highly-expressed genes exhibiting greater codon bias than lowly-expressed genes. Thus, highly-expressed genes reduce the use of these codons under the selection pressure as far as possible [35][36][37]. While the ENC values of high-expression genes will deviate more strongly from the expected value, this indicates that the translation efficiency is associated with small ENC.

Correlations between Codon Usage Bias, Hydrophobicity, Aromaticity, and Gene Length in E. festucae
To determine the relationship between relative codon bias and nucleotide composition in E. festucae, relationships between codon usage bias and hydrophobicity, aromaticity, and gene length were determined using multivariate correlation analysis ( Table 2). The results showed that neither the GRAVY (General average hydropathicity) values nor the Aromo values were significantly correlated with GC3s. Aromo and GRAVY values did, however, exhibit significant negative correlations with ENC values (r = −0.034, p < 0.05; r = −0.164, p < 0.01, respectively), indicating that Aromo and GRAVY values are negatively correlated with codon usage bias in E. festucae. Gene length was positively correlated with ENC values (r = 0.227, p < 0.01), suggesting that gene length may contribute to codon usage bias. The ENC values were significantly positively correlated with the first axis (r = 0.836, p < 0.01) and the second axis (r = 0.193, p < 0.01) values, but were significantly negatively correlated with the GC3s (r = −0.808, p > 0.01) ( Table 2). This suggests that ENC may be the main factor shaping codon bias in E. festucae. If a given gene is only subject to GC composition/mutation constraints, it will lie just above or below the standard curve. However, if a particular gene is under selective pressure for high expression, its ENC value will deviate more strongly from the expected value, and it will lie significantly below the curve. In the ENC plot, at a GC3 of approximately 0.4, there were some genes that displayed a more biased codon usage than expected based on the respective GC3s.
The translation efficiency constrains codon choice, which the frequency of codon usage is positively correlated with tRNA availability. The degree of codon usage bias is related to the level of gene expression, with highly-expressed genes exhibiting greater codon bias than lowly-expressed genes. Thus, highly-expressed genes reduce the use of these codons under the selection pressure as far as possible [35][36][37]. While the ENC values of high-expression genes will deviate more strongly from the expected value, this indicates that the translation efficiency is associated with small ENC.

Correlations between Codon Usage Bias, Hydrophobicity, Aromaticity, and Gene Length in E. festucae
To determine the relationship between relative codon bias and nucleotide composition in E. festucae, relationships between codon usage bias and hydrophobicity, aromaticity, and gene length were determined using multivariate correlation analysis ( Table 2). The results showed that neither the GRAVY (General average hydropathicity) values nor the Aromo values were significantly correlated with GC3s. Aromo and GRAVY values did, however, exhibit significant negative correlations with ENC values (r =´0.034, p < 0.05; r =´0.164, p < 0.01, respectively), indicating that Aromo and GRAVY values are negatively correlated with codon usage bias in E. festucae. Gene length was positively correlated with ENC values (r = 0.227, p < 0.01), suggesting that gene length may contribute to codon usage bias. The ENC values were significantly positively correlated with the first axis (r = 0.836, p < 0.01) and the second axis (r = 0.193, p < 0.01) values, but were significantly negatively correlated with the GC3s (r =´0.808, p > 0.01) ( Table 2). This suggests that ENC may be the main factor shaping codon bias in E. festucae. Table 2. Correlation coefficients between the positions of genes along the first two major axes with index of total genes' codon usage and synonymous codon usage bias. The CAI (Codon adaptation index), which reflects the gene expression level, exhibited a significant positive correlation with GC (r = 0.018, p < 0.01), GC3 (r = 0.265, p < 0.01), GC3s (r = 0.266, p < 0.01), T3s (r = 0.045, p < 0.01), C3s (r = 0.526, p < 0.01), GRAVY (r = 0.080, p < 0.01), and Aromo (r = 0.144, p < 0.01) values. However, the CAI was significantly negatively correlated with the first and second axis and with other nucleotide composition indices (gene length, GC1, GC2, A3s, G3s, and ENC). These results indicate that both nucleotide composition and gene expression level are major factors shaping codon usage bias in E. festucae.
To statistically measure the relationship between CAI and codon usage bias in E. festucae, the correlation coefficients for the positions of the genes along the first four major axes were analyzed with their indices of amino acid usage in Table 3. Table 3. Correlation coefficients between the positions of genes along the first four major axes with an index of total genes' amino acid usage. Though the CAI was negatively correlated with the first axis (r =´0.328, p < 0.01), the second axis (r =´0.623, p < 0.01), and the third axis (r =´0.159, p < 0.01), it was not significantly correlated with the fourth axis (r = 0.005, p > 0.05). GRAVY values were positively correlated with Aromo values (r = 0.420, p < 0.01) and were negatively correlated with all four axes. Aromo values exhibited a significant correlation with the first and second axis.

CAI
These results indicate that the most important factor in the amino-acid usage is hydrophobicity, followed by CAI and aromaticity. This provides strong evidence for the inference that selection for translational efficiency of amino acids exists in E. festucae.
In addition, the correspondence analysis was used for some specific aromatic amino acids in codon usage in our research, but the effect of amino acid composition on the codon usage of the whole genome needs further study. Some researchers put forward research that ignores the composition of amino acids in the genome, while some study the codon usage, and some very important properties of correspondence analysis, such as rows weighting, are lost in the process, often diminishing the quantity of information to analyze, occasionally resulting in interpretation errors [38].
Four methods of correspondence analysis (CA) have been developed based on three kinds of input data for synonymous codon usage in 241 bacteria genomes: absolute codon frequency, relative codon frequency, and relative synonymous codon usage (RSCU), as well as within-group CA (WCA). The result shows that WCA is more effective than the other three methods in generating axes that reflect variations in synonymous codon usage, and WCA reveals sources that were previously unnoticed in some genomes, such as synonymous codon usage related to replication strand skew [39]. However, these studies are based on bacteria and some other prokaryote microbiology research, so we are not sure whether the WCA is also the best in eukaryotic organisms, such as fungi. In our study, we just select the CA based on the RSCU, which is widely used to identify major sources of variation in synonymous codon usage. As this is a first study of the codon usage in fungi of Epichloë endophytes, we want to find some common codon usage patterns in this fungi. It is necessary to compare more genomes between fungi and bacteria in future studies.

Optimal Codons in E. festucae
We found that E. festucae exhibits weak codon biases based on the RSCU values of the 61 sense codons (Table 4). Twenty-five codons were frequently used, such as CUC (RSCU = 1.84) and GGC (RSCU = 1.79), encoding Leu and Gly, respectively. Most frequent codons ended in a G or C, such as UUC, UUG, AUC, and GUC.  The total putative optimal codons of E. festucae are presented in Table 5. There is the synonymous codon of each amino acid, and the RSCU values and codon numbers with corresponding "high" and "low" expression date dataset behind each synonymous codon. The number of the synonymous codons of each amino acid is different, such as Ser, were encoded by four codons (UCU, UCC, UCA, and UCG); behind the corresponding codons are the RSCU values and codon numbers with corresponding "high" and "low" expression date dataset, respectively.
There are 27 optimal codons that end in a G (14/27) or C (13/27). This suggests that the preferred codons of E. festucae may be related to the GC content at third positions. There are three optimal codons (AGG, CGC, and CGG) encoding the amino acid Arg and two optimal codons each that encode Ala, Thr, Pro, Ser, Val, and Leu. These codons were significantly correlated with translation levels and may be useful in the design of degenerate primers and investigations into the evolutionary history of E. festucae. Similarly to E. festucae, the optimal codons of Aspergillus nidulans [40], Oryza sativa [41], Triticum aestivum [33], Zea mays [42], and other higher plant nuclear genomes end in G or C, though this differs from results from E. coli, B. subtilis, Dictyostelium discoideum, D. melanogaster, Schizosaccharomyces pombe, S. cerevisiae, and other Saccharomyces spp. [7,43]. Close to one-third of all optimal codons end in a uracil, while others end in cytosine or guanine. This phenomenon may be related to their origin and relatives.
In summary, codon usage bias in E. festucae was found to be relatively weak and affected by nucleotide composition, mutational pressure, natural selection, and gene expression level. However, natural selection may play a major role in shaping codon usage variation, manifesting itself in weaker codon usage bias. In addition, the codon preferences of E. festucae were more biased than those of A. thaliana, E. coli, or Caenorhabditis elegans [44].
Currently, no complete Epichloë sp. mitochondrial genome is available in GenBank. As more complete mitochondrial and nuclear genomes of Epichloë species are released, further comparative analyses will be possible, allowing for investigation of the genetic and environmental constraints that influence codon usage patterns at the intra-and inter-species levels. In addition, because Epichloë is an endophytic fungus, different strains possess different host specificities. Comparing the differences in codon usage of different Epichloë strains from the same species may explain these host specificities. Moreover, comparisons of codon usage between the mitochondrial and nuclear genomes in the same Epichloë strain may enable exploration into the mechanism of interaction between Epichloë endophytes and host grasses.

Materials and Methods
The complete genome sequences of E. festucae (E2368, version 4) were obtained from Genome Projects at University of Kentucky [45]. CDS (Coding DNA sequences) were downloaded from GenBank [46]. To improve the quality of sequences and minimize sampling errors, CDS were filtered based on the following considerations: (i) the presence of a start codon at the beginning and a stop codon at the end of each CDS was required; (ii) each CDS had to be greater than 300 nucleotides in length; and (iii) duplicated sequences (exact matches) were detected and excluded from the dataset. As a result, 4870 CDS were used for further analysis.

Indices of Codon Usage and Synonymous Codon Usage Bias
The GC3s value is defined as the proportion of GC nucleotides at the third (variable) coding position of synonymous codons. It is a useful parameter for evaluating the degree of base composition bias.
Similarly, A3s, G3s, C3s, T3s, and GC3s values can also be deduced by analogy to quantify the usage of each base at synonymous third codon positions. The GC content of each full-length gene, as well as at first, second, and third codon positions (GC, GC1, GC2, and GC3, respectively) were also calculated. GC12 represents the average of GC1 and GC2 and was used for neutrality plot analysis.
Codon adaptation index (CAI) values are often used to measure the extent of bias toward codons that are known to be preferred in highly expressed genes [47]. With values ranging from 0 to 1.0, the higher the value is, the higher the expression level will be.
The effective number of codons (ENC) value, ranging from 20 to 61, is used to measure the magnitude of codon bias in individual genes. This is also a measure of the unevenness of use of codons across all amino acids in a protein. It is worth noting that ENC values are affected by base composition. If all codons for each amino acid were used equally (completely random usage), the ENC would be 61, while if a single codon was used for each amino acid, the ENC would be 20 [30].
The relative synonymous codon usage (RSCU) is the ratio of the observed frequency of codons relative to the expected frequency of the codon under a uniform synonymous codon usage. An RSCU value equal to 1 reflects that codon use is not biased. RSCU values less than 1.0 occur when the observed frequency is less than the expected frequency, and vice versa [48].
General average hydropathicity (GRAVY) values represent the sum of the hydropathy values of all amino acids in the gene product divided by the number of residues in the sequence [49]. The more negative the GRAVY value, the more hydrophilic the protein, while the more positive the GRAVY value, the more hydrophobic the protein.
Aromo values denote the frequency of aromatic amino acids (Phe, Tyr, Trp) in the hypothetical translated gene product. The Aromo and GRAVY values have been used to quantify the major correspondence analysis (COA) trends in the amino acid composition of E. coli genes [40].

ENC Plot
The ENC plot (a plot of ENC vs. GC3s) is a strategic investigation into patterns of synonymous codon usage, providing a visual display of the main features of codon usage patterns for a number of genes. Values of ENC were always within the range from 20 (only one codon effectively used for each amino acid) to 61 (codons used randomly). The expected ENC values were calculated as follows [30]: ENCexp " 2`S`p29{pS 2`p 1´S 2 qqq where S is the frequency of G + C (i.e., GC3s).

Neutrality Plot
A neutrality plot [50] can be used to analyze factors influencing codon usage patterns and biases, including estimation and characterization of the relationships between GC12 and GC3.
A neutrality plot regression with a slope of 0 indicates no effects of directional mutation pressure (complete selective constraints), while a slope of 1 is indicative of complete neutrality [50].

Determination of Optimal Codons
The independent optimal codon index can be used as a standard to distinguish between strong and weak translation-coupled biases in datasets. In this study, we ordered the sequences by their ENC ratio values. Using the 5% of sequences from either end of the ordered dataset, we formed two subsets: the "high bias" dataset comprised genes with higher overall ENC ratios, suggesting that their observed ENC values were far from those expected based on GC content, while the "low bias" dataset comprised genes with the lowest ENC ratios [33]. When the difference between the RSCU of "high bias" and "low bias" dataset (∆RSCU) was larger than 0.08, the corresponding codon was defined as the optimal codon [41].

Correspondence Analysis of RSCU
Correspondence analysis (COA) is a widely used method in multivariate statistical analysis of codon usage patterns. While there are a total of 59 synonymous codons (excluded three termination codons, methionine (Met) and tryptophan (Trp)), in order to generate a COA of RSCU, the degrees of freedom were reduced to 40 after removing variations caused by the unequal usage of amino acids [44].

Software Used
Using Mobyle server [51], including Codon W (Ver. 1.4.4) [52], we selected yeast as the model in this research. CHIPS [53] and CUSP [54] were used to calculate the indices of codon usage bias.

Statistical Analysis
Correlations between codon usage and various indices were carried out using SPSS 19.0 (SPSS Inc., Chicago, IL, USA). Effects were corrected for multiple testing with a Tukey-Kramer test, with p ď 0.01 and p ď 0.05 as significance levels, respectively [55]. All analyses were performed with SPSS, version 22.0 and GraphPad Prism 5 (GraphPad Software, San Diego, CA, USA).