Analysis of Codon Usage Patterns in Giardia duodenalis Based on Transcriptome Data from GiardiaDB

Giardia duodenalis, a flagellated parasitic protozoan, the most common cause of parasite-induced diarrheal diseases worldwide. Codon usage bias (CUB) is an important evolutionary character in most species. However, G. duodenalis CUB remains unclear. Thus, this study analyzes codon usage patterns to assess the restriction factors and obtain useful information in shaping G. duodenalis CUB. The neutrality analysis result indicates that G. duodenalis has a wide GC3 distribution, which significantly correlates with GC12. ENC-plot result—suggesting that most genes were close to the expected curve with only a few strayed away points. This indicates that mutational pressure and natural selection played an important role in the development of CUB. The Parity Rule 2 plot (PR2) result demonstrates that the usage of GC and AT was out of proportion. Interestingly, we identified 26 optimal codons in the G. duodenalis genome, ending with G or C. In addition, GC content, gene expression, and protein size also influence G. duodenalis CUB formation. This study systematically analyzes G. duodenalis codon usage pattern and clarifies the mechanisms of G. duodenalis CUB. These results will be very useful to identify new genes, molecular genetic manipulation, and study of G. duodenalis evolution.


Introduction
Codon usage bias is a widespread feature among both prokaryotes and eukaryotes, which describes synonymous codons that are not used at the same frequency in the process of gene translation [1][2][3]. CUB is widely distributed and affected by tRNA abundance [4], nucleotide composition [5], translational processes [6], gene function [7,8], protein structure [9], hydrophobicity [10], environment temperature [11], etc. Among the known factors affecting CUB in different organisms, constraints on composition and translation selection are believed to be the primary causes for the differences in CUB among genes in different species [12]. CUB can be used to optimize the expression of foreign genes in given host cells. In addition, CUB could also provide clues about the evolution and environmental adaptation of various species [13].
Nine thousand, seven hundred and forty-seven coding sequences (CDS) of G. duodenalis (Giardia Assemblage A isolate WB) obtained from GiardiaDB were investigated based on the transcriptome profiling of Oscar Franzén et al. [35]. To minimize outliers caused by small size, CDS with sizes below 300 bp were eliminated. Thus, a total of 5968 CDS were selected for the analysis.

Indices of Codon Usage
A codon usage table of G. duodenalis was obtained from Codon Usage Database via Optimizer software [36]. CDS sequences codon usage of G. duodenalis genes was analyzed by Codon W 1.4.4 software and EMBOSS online tools (https://www.bioinformatics.nl/ emboss-explorer/, accessed on 26 July 2021). 17 CDS sequences codon usage of G. duodenalis highly expressed variant-specific surface proteins (VSPs) [37] were analyzed by EMBOSS online tools (https://www.bioinformatics.nl/emboss-explorer/, accessed on 26 July 2021). The analysis parameters included an effective number of codons (ENC), the codon adaptation index (CAI), relative synonymous codon usage (RSCU), GC content (guanine and cytosine content), A3s, T3s, G3s, C3s, and GC3s. Nucleotide composition and its frequency on the third synonymous codon can be used to reflect the codon usage preference of CDS sequences. The RSCU value is the observed codon frequency divided by the frequency expected under the same assumption for amino acid synonymous codons, and it is an index for studying differences in synonymous codon usage among genes. The RSCU values were measured according to the previous study [12]. ENC values reflect the number of codon types used in a gene, and its value generally ranged from 20 (when only one codon was used) to 61 (when all codons are used equally), and the ENC values were measured as previously described [38]. CAI refers to the consistency between the usage frequency of synonymous codons and optimal codons in the coding region, with a value of 0-1. CAI is used to estimate the degree of CUB, which is preferred in highly expressed genes. Higher CAI values mean that CUB may be stronger, and the potential expression level may be higher, and the CAI values were calculated as previously described [39].

Neutrality Plot
The effect of selection on CUB was usually measured by neutrality plot analysis. In this method, the average GC content of GC12 and GC3s was calculated. The scatter plot was drawn with GC3s as an independent variable and GC12 as a dependent variable. The points representing genes are distributed on or near the diagonal, indicating that the codon usage pattern is greatly affected by mutation; on the contrary, the smaller the slope of the scatter formation curve is, or even parallels to the horizontal axis, indicating that codon usage pattern is greatly influenced by environmental selection [40].

ENC Plot
The ENC plot of ENC values plotted against GC3s values is used to analyze the influence of base composition on the codon usage in a genome [41]. A standard curve is used to show the functional relationship between ENC and GC3s values under mutation pressure rather than selection pressure. In the ENC plot correlation analysis, GC3s were used as the independent variable and ENC as the dependent variable to construct the scatter diagram and to analyze the correlation between ENC and GC3s. In addition, according to the CUB, the standard curve was constructed under the condition of mutation pressure, but not selection pressure. If the predicted ENC value is on or near the standard curve, it represents that the CUB is mainly influenced by mutation rather than selection pressure; if the predicted ENC value is far below the standard curve, it represents that the codon composition is mainly affected by selection pressure [41].

PR2 Bias Plot Analysis
Parity rule 2 analysis, also known as parity rule analysis, is a method to study the base composition of codons. If the gene is not under the pressure of mutation or environmental selection, the internal composition of the base is A = T, C = G. However, due to the influence of gene mutation and environmental selection pressure, the usage of G and C in the genome coding sequence is often uneven, especially the third codon deviates from the rule of intrachain equivalence. In this method, amino acids encoded by four synonymous codons were analyzed, and the calculated results of G3/ (G3 + C3) and A3/ (A3 + T3) were plotted. The coordinates (0.5, 0.5) represent the PR2 principle (A = T, C = G). The distance and position of the scattered points from the center indicate the degree and direction of the gene deviation from the rules [42].

Determination of Optimal Codons
Based on the CAI values, 5% of the total genes with extremely high and low CAI values were regarded as high and low datasets, respectively. Codon usage was compared using a Chi-squared contingency test of the two groups, and codons whose frequency of usage was significantly higher (p < 0.01) in highly-expressed genes than those with low levels of expression were defined as optimal codons [43].

Correspondence Analysis (COA)
The connection between variables and samples was widely analyzed by Multivariate statistical analysis. COA has been widely used to study codon usage variation. CondonW was used to analyze RSCU values by COA analysis, COA was performed on RSCU values using to compared the intra-genomic variation of 59 informative codons partitioned along 59 orthogonal axes (excluding Met, Trp, and stop codons) [44].

Statistical Analysis
The indices of codon usage were analyzed by CondonW1.4.4 software. Microsoft Excel and SPSS 19.0 were used to analyze the correlation based on Spearman's rank correlation.

Nucleotide Contents of G. duodenalis Genes
The nucleotide contents of G. duodenalis CDSs (expressed as % GC) were shown in Figure 1. The results suggested that the GC content of 5968 G. duodenalis genes exhibited a distinctly unimodal distribution. The GC contents of G. duodenalis genes varied from 39.1% to 81.1% (SD = 0.0523), and the GC contents of the 5968 CDSs were mainly ranged from 45% to 55%. To understand the distribution of nucleotides, we studied the content of GC and GC3s. GC1, GC2 and GC3 were 53.38%, 41.79% and 51.67%, respectively. This result suggested that the GC2 was different from GC1, and GC3-GC2 was the lowest among the three codon positions. The mean value of GC contents of all codons was 48.95%. To study the relationship between the three codon positions, the neutr (GC12 versus GC3) of the G. duodenalis gene was constructed. Our result de that the GC3s of G. duodenalis genes was widely distributed (26.17% to 99.32% was a significant correlation between GC12 and GC3 (r = 0.283, p < 0.0001) (Fi dicating that mutation might affect the codon preference in G. duodenalis geno  To study the relationship between the three codon positions, the neutrality graph (GC12 versus GC3) of the G. duodenalis gene was constructed. Our result demonstrated that the GC3s of G. duodenalis genes was widely distributed (26.17% to 99.32%), and there was a significant correlation between GC12 and GC3 (r = 0.283, p < 0.0001) (Figure 2), indicating that mutation might affect the codon preference in G. duodenalis genome. To study the relationship between the three codon positions, the neutr (GC12 versus GC3) of the G. duodenalis gene was constructed. Our result dem that the GC3s of G. duodenalis genes was widely distributed (26.17% to 99.32%) was a significant correlation between GC12 and GC3 (r = 0.283, p < 0.0001) (Fi dicating that mutation might affect the codon preference in G. duodenalis geno

Codon Usage in G. duodenalis
The synonymous codon usage patterns of G. duodenalis were shown in T G + C content of the G. duodenalis genome was 49.1%, which indicated that the G genome was a little AT-rich. The total codon usage was biased towards G-and

Codon Usage in G. duodenalis
The synonymous codon usage patterns of G. duodenalis were shown in Table 1. The G + C content of the G. duodenalis genome was 49.1%, which indicated that the G. duodenalis genome was a little AT-rich. The total codon usage was biased towards G-and C-terminal codons (31 codons were common codons, and 16/31 frequently used codons ended with G or C). These results indicated that gene composition restriction played a significant role in the formation of codon usage variation in the G. duodenalis genome. In addition, we analyzed 17 CDS sequences codon usage of G. duodenalis highly expressed VSPs genes based on the actual protein levels [37], and 12/15 most frequently used codons were biased towards G-and C-terminal codons ( Table 2). At last, a codon usage table of G. duodenalis was obtained from Codon Usage Database via Optimizer software, and 9/10 most frequently used codons were biased towards G-and C-terminal codons ( Table 3). These results further confirm the accuracy of our predicted codon usage bias.

Relation between ENC and GC3
ENC values were analyzed according to the fraction of GC3s (Figure 3) to clarify the relationship between nucleotide composition and G. duodenalis CUB. The ENC values ranged from 24 to 61, suggesting significant differences in codon bias among these genes. As shown in Figure 3, most of the points were clustered near the expected ENC curve, which indicated that the ENC of most genes was close to the expected ENC value based on their GC3. In addition, the ENC of some points was lower than the expected curve, indicating that the codon usage was also affected by other factors beyond mutation pressure.
(ENCexp-ENCobs)/ENCexp was calculated to estimate the observed and expected ENC values more accurately. We found that (ENCexp-ENCobs)/ENCexp was mainly located in 0-0.05, and the values of most genes were located in −0.05-0.15 (Figure 4). The results suggested that the ENCs of most genes were slightly different from the expected ENCs values in GC3s. The observed ENCs values of most genes were near expected ENCs in GC3s, although the ENCs values of some genes were much lower.
Genes 2021, 12, x FOR PEER REVIEW

Correspondence Analysis
The differences in synonymous codon usage among G. duodena studied by RSCU correspondence analysis. We determined four maj 1-4. Axis 1 and axis 2 accounted for 14.81% and 5.30% of the total v while axis 3 and axis 4 accounted for 4.05% and 3.46% of the total v

Correspondence Analysis
The differences in synonymous codon usage among G. duodenalis genes were further studied by RSCU correspondence analysis. We determined four major contributors as axis 1-4. Axis 1 and axis 2 accounted for 14.81% and 5.30% of the total variance, respectively, while axis 3 and axis 4 accounted for 4.05% and 3.46% of the total variance, respectively, suggesting that axis 1 and axis 2 were the main contributors to G. duodenalis CUB. Axis 1 and axis 2 showed each gene in Figure 5A. To clarify the effect of gene GC contents on CUB, the gene GC contents were color coded. The genes with GC content more than or equal to 60% were shown in green, while those with GC content less than 45% were shown in red. GC content ranged from 45% to 60% were shown in blue. The results showed that the high and low GC contents of G. duodenalis genes could be separated by the primary axis. In addition, as shown in Figure 5B, we found that the terminal codons of different bases could be separated along two axes. It seemed that the separation of the first axis codon was mainly due to the frequency difference of G/C and A/T terminal codons. Further calculation showed a significant correlation between the GC content of each gene and its position on the first axis (r = 0.1154, p < 0.0001). The location of axis 1 gene was positively correlated with GC3s (r = 0.1160, p < 0.0001) and negatively correlated with ENC (r = −0.0626, p < 0.0001). These results indicated that the genes with higher GC and GC3 contents and lower ENC content on the left side of the first axis showed stronger codon bias, which demonstrated that the main factor affecting the CUB of G. duodenalis was the nucleotide composition.
To investigate different gene codon usages, we selected hydrophobic genes, aromatic genes, ribosomal genes, and other genes from 5968 genes. Figure 5C showed the distribution of these four categories of genes. Multivariate analysis of variance indicated that there were statistically significant differences in codon usage among these different genes (p < 0.01).

PR2-Bias Plot Analysis
To study whether high bias genes restrict the selection of biased codons, the relationship between A/G purines and C/T pyrimidines in amino acid was analyzed by PR2 bias plot. Figure 6 showed that genes in the upper left quadrant had low expression and the genes in the lower right had high expression. We demonstrated that G. duodenalis preferred to use G and T rather than use C and A (Figure 6), which suggested that mutation bias, selection, and other factors were involved in G. duodenalis codon usage bias.

Role of Gene Expression Level and Encoded Protein Size Synonymous CUB
To investigate the relationship between CUB and gene expression level, the correlation coefficient between CAI and ENC was calculated and analyzed. The CAI values were used to evaluate G. duodenalis genes expression level, which ranged from 0.157 to 0.912 (mean = 0.318, SD = 0.07942). As shown in Figure 7, our results indicated that genes expression level were significantly negatively correlated with genes position along axis 1 (r = 0.6811, −0.1086, p < 0.0001), while CAI value showed positive relationship significantly with GC3s and GC content (r = 0.9022, 0.6486, respectively, p < 0.0001). These results demonstrated that highly expressed genes had a strong CUB and preferred to choose G or C codons in synonymous positions.
genes, ribosomal genes, and other genes from 5968 genes. Figur tion of these four categories of genes. Multivariate analysis of va were statistically significant differences in codon usage among the Figure 5. Correspondence analysis of RSCU for G. duodenalis genes. (A distribution of G. duodenalis genes. Green dots represent G + C content + C content ≥ 45%, but less than 60%; Red dots represent G + C conten plot. Figure 6 showed that genes in the upper left quadrant had low express genes in the lower right had high expression. We demonstrated that G. duo ferred to use G and T rather than use C and A (Figure 6), which suggested th bias, selection, and other factors were involved in G. duodenalis codon usage b

Role of Gene Expression Level and Encoded Protein Size Synonymous CUB
To investigate the relationship between CUB and gene expression level, tion coefficient between CAI and ENC was calculated and analyzed. The CAI v used to evaluate G. duodenalis genes expression level, which ranged from 0.1 (mean = 0.318, SD = 0.07942). As shown in Figure 7, our results indicated tha pression level were significantly negatively correlated with genes position alo = 0.6811, −0.1086, p < 0.0001), while CAI value showed positive relationship s with GC3s and GC content (r = 0.9022, 0.6486, respectively, p < 0.0001). Th demonstrated that highly expressed genes had a strong CUB and preferred t or C codons in synonymous positions.  As shown in Figure 8, correlation analysis between protein size and ax indicated that the 3 correlation coefficients (r = −0.0817, 0.1822, −0.1254, respe 0.01) all significantly correlated, suggesting that genes with higher expression a smaller size. As shown in Figure 8, correlation analysis between protein size and axis 1 values indicated that the 3 correlation coefficients (r = −0.0817, 0.1822, −0.1254, respectively, p < 0.01) all significantly correlated, suggesting that genes with higher expression levels had a smaller size.
As shown in Figure 8, correlation analysis between protein size and axis 1 va indicated that the 3 correlation coefficients (r = −0.0817, 0.1822, −0.1254, respectively 0.01) all significantly correlated, suggesting that genes with higher expression levels a smaller size.

Optimal Codons
Translational optimal codons of G. duodenalis were represented by the average value of RSCU in high/low expressed genes (Table 4). Chi-square test showed that 26 codons were optimal codons, and the frequency of these codons was significantly higher in highly expressed genes (p < 0.01), which all end in G or C, indicating that G. duodenalis preferred to use G or C ending synonymous codons.  Comparison of codon usage frequency between high and low expression genes of G. duodenalis. The optimal codons were determined by a Chi-square contingency test. * indicates that the frequency of the codons is much higher (p < 0.01). AA, amino acid; N, number of codons.

Discussion
CUB widely exists among both prokaryotes and eukaryotes. This is an interesting and complex phenomenon in the process of biological evolution. Previous studies have proposed hypotheses trying to explain the origin of CUB, among which the selection-mutation-drift balance model and the neutral theory are the most influential ones [12,45], which considers that CUB is determined by the balance between mutation pressure, genetic drift, and weak selection [12,46], while neutral theory believes that mutations of degenerate coding cites should be neutral selection, which leads to random synonymous codon usage selection [45]. However, studies have also suggested that many others factors could affect CUB, such as GC-content [47,48], gene size [25], gene expression level [25,49,50], and gene recombination rate [47,49,51,52]. Furthermore, RNA and protein structure [41,[53][54][55], intron length [56], evolutionary age of the genes [57], population size [58], the aromaticity and the hydrophobicity of the coding proteins [10,59] have all been found to be influencing factors.
A previous analysis of G. duodenalis codon usage was restricted, which only considered eight genes, and yet it seems that the codon usage of G. duodenalis has been quite heterogeneous [33]. Another analysis of G. duodenalis codon usage investigated 65 genes, and 21 codons were the optimal codons, which were all end in C or G and almost exclusively used in the highly expressed genes [34], which was similar to our conclusion. However, the CUB of G. duodenalis has not been fully studied yet. In the present study, we performed a more comprehensive analysis based on the whole transcriptome data and found that multiple factors influence shaping G. duodenalis CUB, such as mutation pressure, selection, gene expression, and compositional constraints, and protein size.
Generally speaking, nucleotide composition is one of the most important effect factors in forming codon usage, while GC content reflects the overall trend of codon mutation [60]. Some GC-rich organisms, such as Triticum Aestivum, Oryzasativa, Bacteria, Archea, and Fungi [61,62], have been proved that they tend to use G or C ending codons. In contrast, some AT-rich organisms, such as Onchocerca volvulus, Mycoplasma capricolum, and P. falciparum [63][64][65], have been shown that they tend to A or T ending codons. In this study, we demonstrated that the average GC content among the 5968 G. duodenalis genes was 49.1%. Although the genome of G. duodenalis seems to be slightly AT-rich, the usage of all codons is biased towards G or C-terminal codons (Table 2), which is similar to those in T. saginata [20] and T. multiceps [21].
Previous studies have demonstrated that over-expressed genes are expressed more frequently, and produced more protein than other genes [22]. Moreover, preferred codons were more frequently used in highly expressed genes than other genes [66]. ENC represents the species independent synonymous bias in genes [38,67]. The genes expression level of protein coding genes can be classified according to ENC values. Highly expressed genes show less ENC value, while the lowly expressed genes show more ENC value. In the present study, the ENC values of G. duodenalis genes ranged from 24-61, suggesting that the codon bias among these genes was significantly different.
CAI is a method of recognizing differential gene expression by codon selection. Highly expressed genes exhibit a high tendency to use certain codons and tend to use them frequently [68,69]. In the present study, the low CAI of G. duodenalis indicated that highly expressed genes faced greater translation selection pressure in shaping CUB, which is similar to the codon usage in T. saginata [20] and P. falciparum [70].
It has been confirmed that various organisms, such as T. multiceps [21], T. saginata [20], S. cerevisiae [29], Silenelatifolia [71], Caenorhabditis elegans [25], and Arabidopsis thaliana [25], showed significant negative correlations between gene size and CUB. Interestingly, our research suggested that G. duodenalis also showed a negative correlation between gene size and CUB, suggesting that selection restriction made G. duodenalis tend to produce smaller proteins with similar functions to larger proteins, thus reducing the energy consumption of producing specific functional proteins [72].
Identifying the optimal codons could provide an effective means for rational codon usage rearrangement and evolutionary molecular genetics research [73][74][75]. The optimal codons tend to reflect the GC and AT content of the genomes [62,76]. In this study, 26 codons were identified as optimal codons, ending with G or C. This phenomenon is similar to previous study of G. duodenalis [34] and other eukaryotic genomes, such as T. multiceps [21] and T. saginata [20]. Identifying optimal codons in the G. duodenalis genome might provide valuable information for genetic engineering and evolutionary study.
Our study revealed the pattern of CUB in the G. duodenalis genome and its influencing factors. Our results indicated that the CUB of G. duodenalis seemed to be a complex equilibrium under different pressures: Natural selection, mutation, GC content, gene expression level, and protein size. Interestingly, all 26 optimal codons ended with G or C, which would be useful for cloning and expression of foreign genes in G. duodenalis. Together, our study elucidated the codon usage pattern of G. duodenalis and provided useful information for genetic engineering and evolutionary studies in this primitive eukaryote.

Conclusions
This study systematically analyzes G. duodenalis codon usage pattern and clarifies the mechanisms of G. duodenalis CUB, which will be very useful to identify new genes, molecular genetic manipulation, and study of G. duodenalis evolution.
Author Contributions: X.L., X.W. and J.L. drafted the main manuscript and performed the data analysis; X.L., N.Z. and J.L. were responsible for experimental design; P.G., X.Z. and J.L. were responsible for guiding and manuscript revisions. All authors have read and agreed to the published version of the manuscript.