Genetic Evolution and Molecular Selection of the HE Gene of Influenza C Virus

Influenza C virus (ICV) was first identified in humans and swine, but recently also in cattle, indicating a wider host range and potential threat to both the livestock industry and public health than was originally anticipated. The ICV hemagglutinin-esterase (HE) glycoprotein has multiple functions in the viral replication cycle and is the major determinant of antigenicity. Here, we developed a comparative approach integrating genetics, molecular selection analysis, and structural biology to identify the codon usage and adaptive evolution of ICV. We show that ICV can be classified into six lineages, consistent with previous studies. The HE gene has a low codon usage bias, which may facilitate ICV replication by reducing competition during evolution. Natural selection, dinucleotide composition, and mutation pressure shape the codon usage patterns of the ICV HE gene, with natural selection being the most important factor. Codon adaptation index (CAI) and relative codon deoptimization index (RCDI) analysis revealed that the greatest adaption of ICV was to humans, followed by cattle and swine. Additionally, similarity index (SiD) analysis revealed that swine exerted a stronger evolutionary pressure on ICV than humans, which is considered the primary reservoir. Furthermore, a similar tendency was also observed in the M gene. Of note, we found HE residues 176, 194, and 198 to be under positive selection, which may be the result of escape from antibody responses. Our study provides useful information on the genetic evolution of ICV from a new perspective that can help devise prevention and control strategies.


Introduction
Influenza C virus (ICV) was first identified in the United States from a human patient with upper respiratory disease in 1947 [1]. Despite its high seroprevalence rate [2][3][4], viruses were only occasionally isolated by cell culture. ICV was thought to exclusively infect humans until it was isolated from pigs [5], and recently from cattle [6]. Nevertheless, whether the bovine ICV isolates possess a zoonotic potential to cause human disease remains unknown. ICV usually causes a mild upper respiratory tract illness in children and is seldom associated with severe disease [7][8][9][10]. However,

Nucleotide Composition Analysis
The component parameters of the ICV HE gene were calculated after removing AUG, UGG, and termination codons (UAA, UGA, UAG) since AUG and UGG solely encode Met and Trp, respectively. In addition, stop codons do not encode amino acids and, thus, have no codon usage bias. The frequencies of A, U, G, C, GC, and AU were computed using BioEdit. The nucleotide frequencies of the third position of synonymous codons were calculated using the CodonW software (http://codonw.sourceforge.net/). The GC content at the first, second, and third positions (GC 1s , GC 2s , GC 3s ) were calculated in Emboss explorer (http://www.bioinformatics.nl/emboss-explorer/); the GC 12s is the mean of GC 1s and GC 2s .

Relative Synonymous Codon Usage (RSCU)
The RSCU values of the ICV HE gene were calculated using MEGA 7 software to determine the codon usage patterns without the effect of amino acid composition and sequence length. The RSCU value equals the ratio between the observed usage frequency and the expected usage frequency when all the synonymous codons for one amino acid are used equally. When the RSCU value equals 1, it means there is no bias for that codon [31]. Furthermore, codons with RSCU values of more than 1.6 and less than 0.6 are considered to be over-represented and under-represented, respectively [32].

Effective Number of Codons (ENC)
The ENC was calculated using the CodonW software. The ENC values range from 20 (an extreme codon usage bias when only one synonymous codon is used) to 61 (no bias when all the synonymous codons are used equally for the corresponding amino acid) [33]. The ENC values were calculated as follows: where k (k = 2, 3, 4, 6) represents k-fold degenerate amino acids, and F k represents the mean values for F k , which in turn is calculated with the following formula: where n denotes the total occurrences of the codons for the corresponding amino acid and S was calculated using the following formula: where n i is the total occurrences number of the ith codon for that amino acid. In general, it is accepted that ENC values ≤ 35 indicate strong codon usage bias [33,34].

ENC Plot
ENC plots are generally used to visualize whether mutation pressure plays a vital role in codon usage bias or not. The plot is constructed by graphing GC 3s in the abscissa and ENC in the ordinate. If a point lies on or around the standard curve, it means that the codon usage bias is merely constrained Viruses 2019, 11, 167 4 of 18 by mutation pressure. Otherwise, other factors may also play a vital role, such as natural selection [33]. The calculations of the expected ENC values for each GC 3s were estimated as follows: ENC expected = 2 + s + 29 where s stands for the value of GC 3s .

Neutrality Plot Analysis
A neutrality plot graphs GC 12s in the ordinate and GC3 in the abscissa. The neutrality plot was produced to evaluate the degree of influence of natural selection and mutation pressure on the codon usage patterns of the ICV HE gene, as well as the M gene. When the regression coefficient is close to 1, no or weak external selection pressure exists. In contrast, if the regression curve deviates from the diagonal line, it indicates a low correlation between GC 12s and GC 3s and natural selection plays a crucial role in shaping codon usage bias [35].

Principal Component Analysis (PCA)
As a multivariate statistical method, PCA analysis identifies the correlations between variables and samples. PCA was based on a 59-dimensional hyperspace in accordance with the usage of the 59 synonymous codons. Codons UGG and AUG and the three termination codons were excluded from the analysis.

Parity Rule 2 Analysis (PR2)
The Parity rule 2 (PR2) plot graphs G 3 /(G 3 + C 3 ) in the abscissa and A 3 /(A 3 + U 3 ) in the ordinate were employed to explore the influence of mutation and natural selection on the codon usage of the ICV HE gene. A point for which both coordinates are 0.5 means no bias [36,37].

Codon Adaptation Index (CAI) Analysis
CAI analysis is performed to predict the adaptability of the ICV HE and M genes to their hosts (H. sapiens, Sus scrofa, and Bos taurus). The most common codons have the highest relative adaptability to their hosts. In addition, sequences with higher CAIs are considered to be preferred over those with lower CAIs [38]. The CAI values were computed using the CAIcal SERVER [39] (http://genomes. urv.cat/CAIcal/RCDI/) and had a range from 0 to 1. The reference datasets for H. sapiens, Sus scrofa, and Bos taurus were retrieved from the Codon Usage Database (http://www.kazusa.or.jp/codon/).

Relative Codon Deoptimization Index (RCDI)
The RCDI reflects the similarity of the codon usage between a given coding sequence and a reference genome. The RCDI values of different lineages of the HE gene were computed using the RCDI/eRCDI server [40] (http://genomes.urv.cat/CAIcal/RCDI/). The reference codon usage patterns for Homo sapiens, Sus scrofa, and Bos Taurus were retrieved from the Codon Usage Database. An RCDI value of 1 implies that the virus is completely adapted to its host. In contrast, RCDI values higher than 1 indicate deoptimization codon usage patterns between the virus and its host(s) [23].

Similarity Index (SiD)
To further reveal the effect of the codon usage patterns of the host (H. sapiens, Sus scrofa, and Bos Taurus) on the overall codon usage patterns of the HE gene of ICV, the similarity index was calculated as follows: where a i is the RSCU value of a specific codon in the 59 synonymous codons of the HE gene of ICV, and b i is the RSCU value for the same codon in the ICV host. D (A, B) represents the SiD values and it ranges from 0 to 1 [41].

Relative Dinucleotide Abundance Analysis
The relative abundances and CpG dinucleotide content of individual ICV HE and M genes were calculated. The odds ratio was calculated using the following formula: where fy represents the frequency of nucleotide Y, f x represents the frequency of nucleotide X, f y f x represents the expected frequency of the dinucleotide XY, and f xy is the real frequency of the dinucleotide XY. As a generally accepted criterion, when P XY > 1.23 or < 0.78, the XY pair is supposed to be over-represented or under-represented, respectively [42].

Phylogenetic Analysis of the ICV HE Gene
Phylogenetic analysis revealed that the HE gene could be classified into six individual lineages: C/Kanagawa, C/Yamagata, C/Aichi, C/Sao Paulo, C/Taylor, and C/Mississippi with relative high bootstrap values, which is in agreement with previous studies [20,43] (Figure 1A). Importantly, these lineages coincide with antigenic groups [44]. ICV is not only pathogenic to humans and pigs, but also to bovine (C/Bovine/Montana/12/2016), which was first identified in 2016 in the United States [45]. The ML tree showed that the C/Bovine/Montana/12/2016 strain belongs to the C/Mississippi lineage, consistent with a previous study which showed that the bovine strain is closely related to the C/Mississippi/80 lineage [45]. Additionally, the strains isolated in pigs in China clustered with the C/Yamagata lineage.

PCA Analysis of ICV HE Gene
PCA of the RSCU values of the HE gene revealed that axis1 and axis2 account for 40.65% and 18.72% of the variation, respectively (Supplementary Materials 2). Then, we plotted the axis1 against axis2 based on lineage and host to identify the specific distribution at the individual gene level ( Figure 1B). In terms of lineage, there is overlap between strains of different genotypes, indicating that the ICV HE strains might diverge from a common ancestor [46]. C/Sao Paulo formed a separate cluster, which indicates an independent evolution. Additionally, we found overlap among the different hosts, implying similar codon usage trends ( Figure 1C). It is essential to note that the analysis included only one bovine and three swine sequences. Therefore, these results require further confirmation.

The ICV HE Coding Regions Are Rich in A and U Nucleotides
To analyse the potential impact of compositional constraints on codon usage, the nucleotide compositions of the ICV HE coding sequences were calculated. The mean content (%) of A (33.48 ± 0.42) and U (26.51 ± 0.25) was higher than G (21.02 ± 0.30) and C (18.98 ± 0.18). The mean percentages of nucleotides at the third codon position of A3 (47.2 ± 1.7) and U3 (39.2 ± 0.85) were also higher than G 3 (16.6 ± 1.67) and C 3 (23.2 ± 0.61), indicating that A-and U-ended codons are preferred in the ICV HE coding sequence. The averages of GC 12s and GC 3s were 30.8 ± 0.98% and 44.6 ± 0.22%, respectively. According to the lineage classification, it is important to mention that significant differences in the mean of A 3 and G 3 were observed between the C/Sao Paulo lineage and the other five lineages

PCA Analysis of ICV HE Gene
PCA of the RSCU values of the HE gene revealed that axis1 and axis2 account for 40.65% and 18.72% of the variation, respectively (Supplementary Materials 2). Then, we plotted the axis1 against axis2 based on lineage and host to identify the specific distribution at the individual gene level ( Figure  1B). In terms of lineage, there is overlap between strains of different genotypes, indicating that the ICV HE strains might diverge from a common ancestor [46]. C/Sao Paulo formed a separate cluster, which indicates an independent evolution. Additionally, we found overlap among the different hosts, implying similar codon usage trends ( Figure 1C). It is essential to note that the analysis included only one bovine and three swine sequences. Therefore, these results require further confirmation.

Codon Usage Bias of the HE Coding Sequence
To evaluate the codon usage bias of the ICV HE, the ENC values were calculated. A mean value of 44.15 ± 0.92 was obtained for the overall HE sequences, regardless of lineages, indicating a low codon usage bias. When the lineage classification was considered, the highest ENC value (45.19 ± 0.64) was for the C/Sao Paulo lineage, and thus the lowest codon usage bias among the six lineages. In contrast, the lowest ENC value (42.86 ± 0.33) was for the C/Aichi lineage, and thus the highest codon usage bias ( Figure 2A). When the structural components of the HE protein were considered, the ENC values of HEF1 and HEF2 were calculated, with a mean ENC value of 45.62 ± 0.77 and 42.42 ± 1.43, respectively. Additionally, the mean ENC value of the M gene was 45.19 ± 0.67.

Codon Usage Bias of the HE Coding Sequence
To evaluate the codon usage bias of the ICV HE, the ENC values were calculated. A mean value of 44.15 ± 0.92 was obtained for the overall HE sequences, regardless of lineages, indicating a low codon usage bias. When the lineage classification was considered, the highest ENC value (45.19 ± 0.64) was for the C/Sao Paulo lineage, and thus the lowest codon usage bias among the six lineages. In contrast, the lowest ENC value (42.86 ± 0.33) was for the C/Aichi lineage, and thus the highest codon usage bias ( Figure 2A). When the structural components of the HE protein were considered, the ENC values of HEF1 and HEF2 were calculated, with a mean ENC value of 45.62 ± 0.77 and 42.42 ± 1.43, respectively. Additionally, the mean ENC value of the M gene was 45.19 ± 0.67.

The HE Gene Displays a Contrasting Codon Usage Pattern with its Hosts
RSCU analysis was performed to confirm the codon usage pattern of the ICV HE coding sequence. We found an extreme preference for A-and U-ended codons. Among the 18 preferred codons, 16 were A/U-ended (A-ended: 6; U-ended: 10) and two were G/C-ended (C-ended: 1; G-

The HE Gene Displays a Contrasting Codon Usage Pattern with its Hosts
RSCU analysis was performed to confirm the codon usage pattern of the ICV HE coding sequence. We found an extreme preference for A-and U-ended codons. Among the 18 preferred codons, 16 were A/U-ended (A-ended: 6; U-ended: 10) and two were G/C-ended (C-ended: 1; G-ended: 1). Within these preferred codons, 10 had a RSCU value > 1.6, with the highest being AGA (4.24), indicative of extreme over-presentation. We also found that the codons containing CpG dinucleotides were under-represented (RSCU < 0.6). Estimation of the overall RSCU might potentially hide lineage-specific patterns, so the RSCU values of the HE coding sequences were next calculated according to lineages (Table 1). There was little difference among lineages. The C/Mississippi and C/Yamagata lineages were more prone to use the UCU codon to code for Ser and the C/Taylor, C/Mississippi, and C/Yamagata lineages were more likely to use the CCA codon to code for Pro. Furthermore, we found an extreme difference between the HE gene and it hosts in the usage of the 18 preferred codons. The only two exceptions were the UGC codon (coding for Cysteine) used in both the ICV HE gene and its hosts (H. sapiens, Sus scrofa, and Bos Taurus), and AGA (coding for Arginine), which was the preferred codon for the HE gene in H. sapiens, but not in Sus scrofa and Bos Taurus. Most of these results are consistent with the RSCU analysis of the M gene (Supplementary Materials 2).

Natural Selection Dominates the Codon Usage Pattern of the ICV HE Gene
ENC plots, PR2 analysis, and neutrality plot analyses were performed to explore whether the codon usage bias of the HE coding sequences was shaped by natural selection, mutation pressure, or both. The ENC-plot according to the different lineages showed that all the points were below the standard curves, indicating that other factors, such as natural selection, shape the codon usage of the ICV HE and M genes ( Figure 2B and Figure 4A, respectively). Moreover, the PR2-plot showed that all the points were away from the origin point (0.5, 0.5) (Figure 2C), indicating a bias between the degree of mutation pressure and natural selection. Furthermore, neutrality analysis revealed the narrow distribution of GC 3s in all the HE lineages. There was only one significant correlation between GC 12s and GC 3s among the six lineages: C/Aichi lineage (R 2 = 0.4372, p < 0.05). The slops of the C/Yamagata, C/Kanagawa, C/Aichi, C/Taylor, C/Sao Paulo, and C/Mississippi lineages were −0.0602, 0.0474, 0.2734, −0.00329, 0.01124, and −0.1495, respectively (Figure 3). This implies that the effect of direct mutation pressure on the codon usage bias of the six lineages is only 6.02%, 4.74%, 27.34%, 0.33%, 1.124%, and 14.9%, respectively. Natural selection plays a critical role in shaping the codon usage bias, especially in the C/Sao Paulo lineage, with the contribution of natural selection being 99.67%. As shown in Figure 4B, natural selection also plays a dominant role in shaping the codon usage patterns of the M gene (95.22%).

Natural Selection Dominates the Codon Usage Pattern of the ICV HE Gene
ENC plots, PR2 analysis, and neutrality plot analyses were performed to explore whether the codon usage bias of the HE coding sequences was shaped by natural selection, mutation pressure, or both. The ENC-plot according to the different lineages showed that all the points were below the standard curves, indicating that other factors, such as natural selection, shape the codon usage of the ICV HE and M genes ( Figure 2B and Figure 4A, respectively). Moreover, the PR2-plot showed that all the points were away from the origin point (0.5, 0.5) (Figure 2C), indicating a bias between the degree of mutation pressure and natural selection. Furthermore, neutrality analysis revealed the narrow distribution of GC3s in all the HE lineages. There was only one significant correlation between GC12s and GC3s among the six lineages: C/Aichi lineage (R 2 = 0.4372, p < 0.05). The slops of the C/Yamagata, C/Kanagawa, C/Aichi, C/Taylor, C/Sao Paulo, and C/Mississippi lineages were −0.0602, 0.0474, 0.2734, −0.00329, 0.01124, and −0.1495, respectively (Figure 3). This implies that the effect of direct mutation pressure on the codon usage bias of the six lineages is only 6.02%, 4.74%, 27.34%, 0.33%, 1.124%, and 14.9%, respectively. Natural selection plays a critical role in shaping the codon usage bias, especially in the C/Sao Paulo lineage, with the contribution of natural selection being 99.67%. As shown in Figure 4B, natural selection also plays a dominant role in shaping the codon usage patterns of the M gene (95.22%).

Dinucleotide Abundance Plays an Important Role in Shaping the Codon Usage Bias of the HE Gene
We calculated the relative abundances of the 16 dinucleotides of the ICV HE and M gene coding sequences. The occurrence of dinucleotides was found to be non-random, and CpG (mean ± SD = 0.180 ± 0.030 for HE gene; 0.256 ± 0.027 for M gene) was extremely under-represented. Interestingly, the recently reported strains (C/Bovine/Montana/12/2016) had higher CpG dinucleotide relative abundances than other ICV strains in both the HE (0.281) and M gene (0.350). However, given the limited available sequences, this needs to be further confirmed.

Natural Selection Dominates the Codon Usage Pattern of the ICV HE Gene
ENC plots, PR2 analysis, and neutrality plot analyses were performed to explore whether the codon usage bias of the HE coding sequences was shaped by natural selection, mutation pressure, or both. The ENC-plot according to the different lineages showed that all the points were below the standard curves, indicating that other factors, such as natural selection, shape the codon usage of the ICV HE and M genes ( Figure 2B and Figure 4A, respectively). Moreover, the PR2-plot showed that all the points were away from the origin point (0.5, 0.5) (Figure 2C), indicating a bias between the degree of mutation pressure and natural selection. Furthermore, neutrality analysis revealed the narrow distribution of GC3s in all the HE lineages. There was only one significant correlation between GC12s and GC3s among the six lineages: C/Aichi lineage (R 2 = 0.4372, p < 0.05). The slops of the C/Yamagata, C/Kanagawa, C/Aichi, C/Taylor, C/Sao Paulo, and C/Mississippi lineages were −0.0602, 0.0474, 0.2734, −0.00329, 0.01124, and −0.1495, respectively (Figure 3). This implies that the effect of direct mutation pressure on the codon usage bias of the six lineages is only 6.02%, 4.74%, 27.34%, 0.33%, 1.124%, and 14.9%, respectively. Natural selection plays a critical role in shaping the codon usage bias, especially in the C/Sao Paulo lineage, with the contribution of natural selection being 99.67%. As shown in Figure 4B, natural selection also plays a dominant role in shaping the codon usage patterns of the M gene (95.22%).

Dinucleotide Abundance Plays an Important Role in Shaping the Codon Usage Bias of the HE Gene
We calculated the relative abundances of the 16 dinucleotides of the ICV HE and M gene coding sequences. The occurrence of dinucleotides was found to be non-random, and CpG (mean ± SD = 0.180 ± 0.030 for HE gene; 0.256 ± 0.027 for M gene) was extremely under-represented. Interestingly, the recently reported strains (C/Bovine/Montana/12/2016) had higher CpG dinucleotide relative abundances than other ICV strains in both the HE (0.281) and M gene (0.350). However, given the limited available sequences, this needs to be further confirmed.

Dinucleotide Abundance Plays an Important Role in Shaping the Codon Usage Bias of the HE Gene
We calculated the relative abundances of the 16 dinucleotides of the ICV HE and M gene coding sequences. The occurrence of dinucleotides was found to be non-random, and CpG (mean ± SD = 0.180 ± 0.030 for HE gene; 0.256 ± 0.027 for M gene) was extremely under-represented. Interestingly, the recently reported strains (C/Bovine/Montana/12/2016) had higher CpG dinucleotide relative abundances than other ICV strains in both the HE (0.281) and M gene (0.350). However, given the limited available sequences, this needs to be further confirmed.

Host-Specific Codon Adaption Patterns of ICV
CAI analysis was performed to determine the gene expression levels in specific hosts. We calculated the CAI values of the six phylogenetic lineages based on the three currently identified hosts. Mean CAI values of 0.74 ± 0.0056, 0.61 ± 0.0059, and 0.66 ± 0.0056 were estimated for the overall HE gene in relation to H. sapiens, Sus scrofa, and Bos Taurus, respectively. The highest CAI value was for H. sapiens and the lowest for Sus scrofa. The C/Sao Paulo lineage had the highest CAI value (0.7469 ± 0.0028 for H. sapiens, 0.6212 ± 0.0030 for Sus scrofa, 0.6674 ± 0.0027 for Bos Taurus) compared to the other five lineages ( Figure 5A). RCDI values were calculated to compare the similarity in the codon usage of the HE gene and hosts. The mean of the RCDI values was highest for Sus scrofa (1.72 ± 0.038), followed by Bos Taurus (1.64 ± 0.034) and H. sapiens (1.50 ± 0.028), indicating the highest codon usage deoptimization for Sus scrofa. In terms of lineages, the C/Kanagawa and C/Sao Paulo lineages showed the highest and the lowest RCDI values in relation to the hosts (excluding the C/Taylor lineage for its limited number of sequence), respectively ( Figure 5A). Moreover, the results among different ICV lineages or hosts were significant, with a p cut off < 0.001, in relation to both CAI and RCDI. For the M gene, the lowest CAI value and the highest RCDI value were observed in swine, similar to the HE gene ( Figure 6A).  Figure 5A). RCDI values were calculated to compare the similarity in the codon usage of the HE gene and hosts. The mean of the RCDI values was highest for Sus scrofa (1.72 ± 0.038), followed by Bos Taurus (1.64 ± 0.034) and H. sapiens (1.50 ± 0.028), indicating the highest codon usage deoptimization for Sus scrofa. In terms of lineages, the C/Kanagawa and C/Sao Paulo lineages showed the highest and the lowest RCDI values in relation to the hosts (excluding the C/Taylor lineage for its limited number of sequence), respectively ( Figure 5A). Moreover, the results among different ICV lineages or hosts were significant, with a p cut off < 0.001, in relation to both CAI and RCDI. For the M gene, the lowest CAI value and the highest RCDI value were observed in swine, similar to the HE gene ( Figure 6A).

Sus scrofa Induced the Stronger Selection Pressure on ICV
SiD analysis was performed to demonstrate whether the host (H. sapiens, Sus scrofa, Bos Taurus) influences the codon usage patterns of the HE gene coding sequences ( Figure 5B). The SiD value was highest in Sus scrofa (0.139), followed by Bos Taurus (0.129) and H. sapiens (0.108), indicating that swine exerted a stronger selection pressure than bovine and human. Additionally, the C/Kanagawa lineage displayed the highest SiD value, excluding the C/Taylor given its limited number of sequences, and the C/Sao Paulo lineage had the lowest SiD. The differences of SiD values among ICV lineages or hosts were significant with a p cut off < 0.001. A similar pattern was observed for the M gene ( Figure  6B).

Sus scrofa Induced the Stronger Selection Pressure on ICV
SiD analysis was performed to demonstrate whether the host (H. sapiens, Sus scrofa, Bos Taurus) influences the codon usage patterns of the HE gene coding sequences ( Figure 5B). The SiD value was highest in Sus scrofa (0.139), followed by Bos Taurus (0.129) and H. sapiens (0.108), indicating that swine exerted a stronger selection pressure than bovine and human. Additionally, the C/Kanagawa lineage displayed the highest SiD value, excluding the C/Taylor given its limited number of sequences, and the C/Sao Paulo lineage had the lowest SiD. The differences of SiD values among ICV lineages or hosts were significant with a p cut off < 0.001. A similar pattern was observed for the M gene ( Figure 6B).

Selection Analysis of the ICV HE Gene
The relative rate of non-synonymous and synonymous mutations (dN/dS) for the ICV HE gene was 0.147. Site-by-site selection analysis revealed three positive amino acid sites: 176, 194, and 198 (numbering does not include the cleaved signal peptide, amino acid positions 1 to 14) under positive selection, with amino acid sites 194 and 198 supported by four algorithms and site 174 supported by three algorithms, expect for SLAC ( Table 2). The site-by-site selection analysis results of the rest of the amino acid can be seen in Table S3 (Supplementary materials 2).

Structural Analysis of Sites under Positive Selection
To speculate what the role of the selected amino acid residues for the evolution of ICV might be, we labelled their location within the crystal structure of the ectodomain of the HE protein from the prototype human Johannesburg strain of ICV, which was visualized using PyMOL [47]. All three amino acids are located on the surface of the molecule in the globular head domain of the HEF1 subunit near the receptor binding site. Residues 194 and 198 (shown as red spheres) are situated in a loop region at the very distal part of HE. This region shows amino acid substitutions between the otherwise very conserved lineages of human ICVs [48] and are part of antigenic sites A-3 and J-1 of HE of human ICV [49]. Asparagine 194 is changed to (negatively charged) glutamate in the Taylor strain of human ICV and to an (also negatively charged) aspartate residue in other human isolates. In ICV bovine isolates, residue 194 is deleted and early isolates from pigs contain a glutamate at this position, which in later isolates is reverted to glutamine. The positively charged lysine 198 is located in the same loop and is replaced by a negatively charged glutamate in the bovine HE and early porcine isolates of ICV, but exchanged by a lysine in later porcine isolates.
Residue 176 is located in a loop below the receptor-binding site. It contains asparagine in the HE of most isolates, but sometimes changes to an aspartate, for example, in the Miyagi clade of human ICV. Residue 176 is adjacent to the N-glycosylation site asparagine 175 (shown as an orange stick in Figure 7B). There is experimental evidence showing that this site is used, at least in the AA/50 strain that contains an asparagine at position 176 [50]. Residue 176 is part of the N-glycosylation motif (asparagine, any amino acid, serine, or threonine), in that case asparagine 175, asparagine 176, and serine 177. Since every amino acid (except a proline) is accepted by the oligosaccharyltransferase at that position, it seems unlikely that the mutation from asparagine to aspartate prevents the attachment of a carbohydrate to asparagine 175. Figure 7B). There is experimental evidence showing that this site is used, at least in the AA/50 strain that contains an asparagine at position 176 [50]. Residue 176 is part of the N-glycosylation motif (asparagine, any amino acid, serine, or threonine), in that case asparagine 175, asparagine 176, and serine 177. Since every amino acid (except a proline) is accepted by the oligosaccharyltransferase at that position, it seems unlikely that the mutation from asparagine to aspartate prevents the attachment of a carbohydrate to asparagine 175.

Discussion
ICV was first identified in humans in the United States [1] and later also in swine and cattle. ICV has a worldwide distribution, which is demonstrated by its high rate of seroprevalence, and may pose a threat to public health [4,11]. However, comprehensive studies on ICV are still limited. Recently, the HE genes of ICV isolates from 39 cases that tested positive from December 2014 to February 2016 in the United States were sequenced [11] and were included in our study after collapsing identical sequences. During the evolution of ICV, according to the HE gene, six lineages evolved, consistent with previous studies [20]. The HE sequences reported between 2014 and 2016 mainly clustered with the C/Sao Paulo and C/Kanagawa lineages, supporting a previous study that only the C/Sao Paulo and C/Kanagawa lineages have circulated since 2004. An exception is the C/Bovine/Montana/12/2016 strain, which is part of the C/Mississippi lineage [20]. The prevalence and recurrence of multiple lineages, together with cross species transmission, raises public health concerns. In this study, the codon usage patterns of the ICV HE and M genes were analysed to better understand the processes that determine ICV spread following a host range shift and to comprehend the genetic evolution process under the host pressure. This will ultimately assist the development of prevention strategies.
As previously reported, codon usage bias can be greatly influenced by the overall nucleotide composition [51]. The results of nucleotide composition showed that the ICV HE coding sequences

Discussion
ICV was first identified in humans in the United States [1] and later also in swine and cattle. ICV has a worldwide distribution, which is demonstrated by its high rate of seroprevalence, and may pose a threat to public health [4,11]. However, comprehensive studies on ICV are still limited. Recently, the HE genes of ICV isolates from 39 cases that tested positive from December 2014 to February 2016 in the United States were sequenced [11] and were included in our study after collapsing identical sequences. During the evolution of ICV, according to the HE gene, six lineages evolved, consistent with previous studies [20]. The HE sequences reported between 2014 and 2016 mainly clustered with the C/Sao Paulo and C/Kanagawa lineages, supporting a previous study that only the C/Sao Paulo and C/Kanagawa lineages have circulated since 2004. An exception is the C/Bovine/Montana/12/2016 strain, which is part of the C/Mississippi lineage [20]. The prevalence and recurrence of multiple lineages, together with cross species transmission, raises public health concerns. In this study, the codon usage patterns of the ICV HE and M genes were analysed to better understand the processes that determine ICV spread following a host range shift and to comprehend the genetic evolution process under the host pressure. This will ultimately assist the development of prevention strategies.
As previously reported, codon usage bias can be greatly influenced by the overall nucleotide composition [51]. The results of nucleotide composition showed that the ICV HE coding sequences were AU-rich, especially in the third position of synonymous codons. RSCU analysis also indicated that the ICV HE gene shows great codon usage bias towards A-and U-ended codons. It is generally accepted that directional mutation pressure explains the interspecific variation of nucleotide composition, which is primarily affected by the bias of AU/GC content [35]. Nevertheless, natural selection basically involves the selection of a specific set of codons, for instance, codons are preferred because of the host tRNAs abundance, translation selection, or the reduction of dinucleotides (CpG), which activate the host innate immunity [52][53][54][55]. As previously reported for influenza A virus, which originated from birds and has been replicating in humans for many years, the virus reduces its CpG dinucleotide frequency in its genome by directional selection [53]. Likewise, our dinucleotide analysis showed that ICV has adapted to the human host since it exhibits an extremely low CpG dinucleotide abundance in the HE gene and since all the codons containing CpG dinucleotides were under-represented. These results may indicate that escape from host innate immunity shapes the codon usage patterns of the ICV HE gene. The RSCU results also indicate that extreme differences exist in the 18 preferred codons between the HE gene and its known hosts. Influenza D virus (IDV), a novel genus of the Orthomyxoviridae, was first identified in 2011 and also contains seven genome segments [56]. Phylogenetic analysis of PB1 gene indicated that ICV and IDV had a common ancestor [57]. However, the RSCU results of the IDV HEF gene showed that the codon usage patterns and the most preferred codons were very similar to its known hosts for the majority of codons [58]. It has been proposed that a consistent codon usage between viruses and their hosts allows a higher translation efficiency of the corresponding amino acids, and thus it is expected that extremely different RSCU will reduce the translation efficiency of the viral protein, but may cause it to fold properly [59]. Therefore, we hypothesized that ICV may represent a lower translation efficiency in its hosts than IDV, but may fold the viral protein properly.
In spite of the overall ENC value being higher than 35 (44.15± 0.92), the HE gene of ICV displayed a higher codon usage bias than many other RNA viruses, such as Zika virus [60], H1N1pdm IAV [61], H5N1 influenza virus [62], and H3N2 canine influenza virus [63]. It is hypothesized that the low codon bias of RNA viruses facilitates efficient replication by reducing the competition between the virus and the host during protein synthesis [51]. The C/Sao Paulo lineage, currently circulating, had the highest ENC values and thus the lower codon usage bias, which may favour the maintenance of successful replication. In addition, the ENC value of the HEF1 subunit (45.62 ± 0.77) is higher than that of the HEF2 subunit (42.42 ± 1.43) and the M gene (45.19 ± 0.67) has a higher ENC value than the HE gene, for reasons that remain to be further investigated. A possible explanation is that it is associated with their respective functions.
PR2 analysis and ENC-plot results revealed that natural selection plays a critical role in shaping the codon usage bias. Neutrality analysis revealed the specific magnitude of these factors in shaping codon usage bias, indicating the predominant role of natural selection in the six lineages of the ICV HE gene, especially in the C/Sao Paulo lineage. In order to further confirm the role of natural selection, CAI analysis was performed. CAI is frequently used to measure gene expression and assess the adaptability of viral genes to their hosts [64]. We found that the greatest adaption of ICV was to H. sapiens, followed by Bos Taurus and finally Sus scrofa, consistent with RCDI analysis. A low RCDI indicates good adaption to the host, while a high RCDI value implies that the virus might show a low replication rate with alternative codon usage patterns or some viral genes expressed in the latent period [40]. SiD analysis revealed that swine exerted more selection pressure on the codon usage pattern compared to humans. As previously reported, ICV may be transmitted between humans and pigs in nature [5,65]. Furthermore, the viruses isolated from swine or cattle are closely related to human ICV isolates. However, there is still no experimental evidence for interspecies transmission, and we have no data to show that ICV is a zoonotic pathogen. Whether ICV infection of swine and bovine species originated from humans and the role of bovine in the evolution of ICV remain to be further investigated. The C/Sao Paulo lineage, a current circulating lineage, revealed a greater adaptability than the other five lineages. Of note, the lowest adaption was found for the C/Kanagawa lineage, excluding the C/Taylor strain. The C/Kanagawa lineage was considered extinct because it had not been isolated since 1977. However, it re-emerged in Japan in 1996 (C/Miyagi/9/96) and acquired genes encoding internal proteins from epidemic strains of the C/Yamagata lineage [66][67][68], indicating that some difference exists in the evolution process between the C/Sao Paulo and the C/Kanagawa lineages. Selection analysis revealed three sites under positive selection in HE: residue 176, 194, and 198. The residues under positive selection are all located in loop regions in the globular head domain of HE surrounding the receptor binding site. Since the loop containing the residues 194 and 198 is unlikely to be of structural or functional relevance for HE and since these two amino acids are located in one of the known antigenic epitopes of HE from human ICV, it seems likely that the amino acids were selected by neutralizing antibodies, i.e., virus variants with the mentioned changes could escape the antibody response of their hosts. Residue 176, which is located in a loop below the receptor binding site, is part of an N-glycosylation motif NXS/T, but the substitution at the variable X-position is unlikely to have an effect on glycosylation at asparagine 175. However, this region constitutes the antigenic site A-1 in human ICV HE [49] and thus it seems that escape from a host antibody response also selects the amino acid at this site. However, note that no significant positive selection was identified for the HE of human ICV isolates and only a few mutations affecting its antigenicity have been induced during evolution [43]. Thus, it also seems possible that the selected residues might subtly affect binding of HE to its receptor determinant 9-O-acetyl-N-acetylneuraminic acid [16,50].
In conclusion, this study showed that the codon usage bias of the ICV HE gene is low, but higher compared to many other RNA viruses. The major factor that has contributed to shaping codon usage is natural selection. Moreover, dinucleotides and mutation pressure also influence codon usage. Furthermore, we discovered three positive selected sites, which seem to be related to escape from the antibody responses. This is the first detailed genetic analysis of ICV devoted to extending our understanding of the mechanisms that contribute to ICV evolution.
Supplementary Materials: The following are available online at http://www.mdpi.com/1999-4915/11/2/167/s1, Table S1: Explanation of the variation of ICV HE gene by axis, Table S2: The relative synonymous codon usage (RSCU) patterns of ICV M gene and its hosts, Table S3: The site-by-site selection analysis results of the rest of amino acids. Data S1: Sequence data of influenza C viruses HE gene and M gene used in this study. Data S3: The nucleotide compositions analyses of HE and M gene used in this study.