Influenza A Hemagglutinin Passage Bias Sites and Host Specificity Mutations

Animal studies aimed at understanding influenza virus mutations that change host specificity to adapt to replication in mammalian hosts are necessarily limited in sample numbers due to high cost and safety requirements. As a safe, higher-throughput alternative, we explore the possibility of using readily available passage bias data obtained mostly from seasonal H1 and H3 influenza strains that were differentially grown in mammalian (MDCK) and avian cells (eggs). Using a statistical approach over 80,000 influenza hemagglutinin sequences with passage information, we found that passage bias sites are most commonly found in three regions: (i) the globular head domain around the receptor binding site, (ii) the region that undergoes pH-dependent structural changes and (iii) the unstructured N-terminal region harbouring the signal peptide. Passage bias sites were consistent among different passage cell types as well as between influenza A subtypes. We also find epistatic interactions of site pairs supporting the notion of host-specific dependency of mutations on virus genomic background. The sites identified from our large-scale sequence analysis substantially overlap with known host adaptation sites in the WHO H5N1 genetic changes inventory suggesting information from passage bias can provide candidate sites for host specificity changes to aid in risk assessment for emerging strains.


Introduction
Influenza pandemics typically occur when an influenza virus from animals infects humans and evolves the capacity for human-to-human transmission [1][2][3][4][5]. The viral surface protein hemagglutinin (HA) is critical for recognizing the respective host cell receptors [6]. Previous studies in ferrets have shown that as few as 3-4 HA mutations (along with a mutation in the influenza virus polymerase complex) can be sufficient to enable a highly pathogenic A/H5N1 avian influenza virus to become mammal-to-mammal transmissible [7,8]. Such experiments are controversial [9] and cannot be performed on a wide variety of strains due to cost and ethical issues; indeed, such experiments produce results which cannot be extrapolated across relatively small changes in the genetic background [10]. Alternative data sources that could complement our understanding of genetic sites important for host specificity could provide valuable information about mutations associated with virus adaptation to new hosts. Commonly observed in laboratory practice, influenza viruses mutate at specific positions when cultured or passaged in different cell types (e.g., from different species). These passage bias mutations are assumed to increase fitness of the virus in the respective species and are sometimes necessary to grow in culture at all. Although passage bias mutations occur in almost all studies involving influenza virus culture, only a few studies have investigated passage bias directly and were limited to specific contexts such as effects on inferring phylogenies for one subtype [11,12]. Since passage annotation data is available for tens of thousands of viruses, we wanted to investigate systematically over common subtypes which positions in HA show passage bias and if this information could be linked to host specificity mutations which could also aid in pandemic risk assessment in the future [13][14][15][16][17].

Sequence and Passage Data Sources and Analysis
Passage annotation is available in GISAID's EpiFlu database, the most complete source for influenza sequences, but annotation submission has historically been heterogeneous without controlled vocabulary [18]. For example, 'M1', 'MDCK3' or 'MDCK 2 + 2' all refer to passage in Madin-Darby Canine Kidney (MDCK) cells. Based on frequency of occurrence and relevance for interpretation in our host specificity context, we classified available ambiguous passage annotations into four broad categories: (i) embryonated hen's eggs (denoted as "EGG"), (ii) Madin-Darby canine kidney cells ("MDCK"), (iii) α-2,6-sialyltransferase enriched MDCK cells (denoted as "SIAT") and (iv) original clinical samples (denoted as "ORI"). Table 1 shows the number of classified passage categories for HA sequences from different subtypes (about 80,000 in this analysis). Full length HA protein sequences from GISAID with passage information that can be classified in the four groups above were aligned with MAFFT [19] and odds ratios for different culture categories were calculated for every pair of amino acids occurring at the same position. Only amino acids with at least 20 occurrences at the respective position were analysed to avoid low frequency biases. Positions with at least one relevant odds ratio of ≥2 or ≤ 1/2 were considered candidate passage bias sites for further analysis. To account for the fact that different subtypes might be preferentially grown in different cell types we calculated the odds ratios only within each subtype. Furthermore, we verified that use of different passage systems in these data does not fluctuate significantly over time to avoid temporally biased sequence variants appearing as passage sites.

Same Strain Different Passage (SSDP) Analysis
Some viral strains are passaged in more than one cell type and represent prime candidates to study passage mutations in context of the same strain genomic backbone. For each of the viral subtypes, we searched for isolates with the same strain name that has been passaged in both EGG cells and mammalian cells (MDCK/SIAT or ORI for human viruses). We counted the amino acid positions that are different between the pair of differentially passaged isolates with the same isolate name. For some strains, there can be several isolates that are passaged in egg cells and several isolates that are passaged in mammalian cells. For such cases, we consider all possible combinations of Egg-mammalian pairs. To reduce bias caused by over counting a single position for these combinations, mutation at a site is counted only once for any given isolate with the same strain name. Hence, the final count of mutations at a given site represents the number of different strains where we found a mutation at the site when differentially passaged in egg and mammalian cells. We excluded H5 data because there were too few examples and this allows to test if analysis without H5 still allows to find host specificity sites also relevant for H5.

Pair Analysis
In addition to understanding passage bias for single sites, we examined if the association between sites is affected by passage. We tested the association between all the pairs of sites, and considered three pairs of passages, MDCK vs. ORI, MDCK vs. EGG, and ORI vs. EGG, by the following conditional logistic regression method conditioned on time: where AA1 and AA2 represent two sites that are considered and AAi is equal to 1 if the observed amino acid is the same as the most frequent amino acid variant and 0 if it is the same as the second most frequent amino acid in a given site i. The analysis was performed in R using the clogit function in the survival package [20]. We only showed the significant results after Bonferroni correction. The unit of time is month. If β 3 is different from 0, it suggests that the association of AA2 with AA1 differs between two passages. The method assumes that the same proportions of different aa combinations (denoted by P 00 , P 01 , P 10 , P 11 ) were passed into different cell types, but does not assume that the sample sizes were the same across passages or across time. However, it is possible that the violation of the assumption was related to the balance of the sample sizes from different cell types (for example, if the ratio of the sample sizes of EGG to MDCK was particularly high at one time point, it is possibile that one study passed many samples to EGG and these samples might have very different P 00 , P 01 , P 10 , P 11 from the rest of the population) and therefore we excluded the time periods where the sample size ratio between two passages fell farther than two standard deviations away from the mean in order to minimize this possibility. To increase statistical power, we updated the analysis sets for H1 and H3 subsets (Table S2).

Structural Overview of Passage Bias Sites in HA Across Different Influenza A Subtypes
Based on a highly permissive threshold of odds ratio >2 in any culture cell type with any virus subtype, we identified 472 candidate passage bias sites in HA which would cover 83% of all HA positions. To understand how to filter and prioritize these sites, we analysed them in more detail regarding independence of genomic background and the structural positions of these substitutions relative to known functional sites, especially those previously associated with host specificity.
Within a given influenza subtype (e.g., pandemic H1, seasonal H1, H3, H5), the sites where passage bias was detected were mostly shared between different passage types ( Figure S1). This is not surprising since preference for one amino acid in eggs often means another amino acid at the same site is preferred in MDCK or original samples.
Passage bias sites were also frequently shared between different influenza subtypes (e.g. pandemic H1, seasonal H1, H3, H5). Interestingly, the extent of sharing ( Figure 1) appears to mirror the genetic distance of the subtypes relative to each other known from phylogenetic trees. For example, pandemic H1 shares more sites with H5 than H3. Furthermore, this allows consideration of passage sites shared among multiple subtypes as universal consensus to avoid subtype-specific biases such as culture type preference or temporal variations. Scoring each passage bias site by the number of different subtypes and different passage cell types and calculating the geometric mean of their odds ratios gave a ranking of sites that were dominantly involved in passage bias, independent of subtype (Table S2). Next, we wanted to compare if the reliably identified sites with geometric mean odds ratio >5 are robust across different collection/time periods and compared sites using the same described analysis for data until May 2013 against those from June 2013-May 2019 and found good agreement with 45 of 52 (87%) sites of the smaller set being shared ( Figure 2). Occasionally, virus strains can be found in the database where the same strain has been passaged in different cell types (Same Strain Different Passage = SSDP). Especially relevant for us is if the different passage was between mammalian (MDCK/SIAT) and avian (EGG) cell types. These examples are of great value for this study as they represent a clean signal of different passage conditions for the very same starting strain or genomic backbone. Considering only H3, H1s and H1p since H5 data were limited, there are 59 sites shared between at least two subtypes ( Figure S3). These conserved direct passage sites are then compared against the two time period sets from the indirect passage sites from large-scale statistical analysis and define our proposed set of robust common passage bias sites. 19 passage bias sites are shared between all three data sets and 54 are shared between at least two (Table 2 and Figure 2). These 54 common passage bias sites can be found across the HA structure (Figures 2 and 3) but with higher density in the globular head domain. Interestingly, the distribution of dominant passage sites within the globular head domain was not limited to the immediate receptor binding pocket but rather spread around the globular head domain which also is involved in docking to cells and includes antigenic regions consistent with earlier findings [12]. A secondary passage site cluster was detected in a region required to undergo structural changes for pH-dependent fusion [21,22]. A third cluster is not visible in the structure but is located in the N-terminal signal peptide region. This region seems experimentally understudied for influenza [23] but mutations in a signal peptide can in principle alter secretion efficiency and therefore surface expression. All 3 main clusters make mechanistic sense in terms of what we know about importance for influenza HA biology.
Some of the sites with low geometric mean (below 5) also tend to have a lower number of occurrences in the 18 possible subtype and cell combinations, indicating that although these sites met the criteria of being present in at least 2 out of 3 of the analyses, they may not be as dominant across different subtypes or were just not found in the SSDP analysis. On the other hand, some positions such as 183 and 187 with a relatively high geometric mean of 11.88 and 16.16 respectively were not listed in Table 2 because they were not found in the SSDP analysis and they did not meet the (geometric mean >5) criteria for one of the time period analyses.

Epistasis of Site Pairs Differs by Passage Types
In order to investigate if pairs of sites show passage type dependent co-occurrence, we used a conditional logistic regression approach, conditioned on time, to test for association between sites in seasonal human influenza A viruses (H1 and H3) in regard to passage type. We identified a list of site pairs with significantly different association between passages (Table 3 and Figure 3). The majority of these were between ORI and EGG, suggesting that epistasis might differ between mammalian and avian hosts.

Passage Adaptations cannot Easily Predict Directionality of Host Specific Adaptation Mutations of Specific Strains in Ferret-Adaptation Studies
We compared the identified HA passage bias for seven mutations identified in the gain-of-ferret-transmissibility studies on H5N1 viruses by Herfst et al and Imai et al [7,8]. Passage bias data correctly matched direction of avian to mammalian adaptation (as approximated by EGG vs. MDCK preference) for four mutations (N172D, T174A, Q240L and G242S (H1pdm numbering)), not enough passage data was available for another two (H120Y and N238K) and the opposite/wrong adaptation direction was predicted for only one mutations (T333I). It seems H5 mammalian adaptation is not easy to fully predict with passage data as most H5 strains are grown in eggs and few mammalian examples are available and extrapolating from other subtypes can only be a rough approximation (Table S4). Although directionality of adaptation appeared difficult to predict fully, suggesting a role of the involved structural positions for adaptation without strongly postulating direction may be possible.

Passage Adaptation Sites Overlap with Previously Known H5 Sites That Affect Host Specificity
The purpose of this study was to test if passage data can help in identifying mutations that may be linked to host specificity changes and hence pandemic risk. Therefore, we compared the different derived lists of passage sites for overlap with the compiled list of host specificity sites in the H5 genetic changes inventory [24]. Figure 4 shows that, as expected, H5 data as independent set is too small to give rise to significant odds ratios for most relevant sites while data even from seasonal viruses, because of their quantity, does get substantial overlaps between common passage and known H5 host specificity sites. Finally and supporting the validity of the consensus approach, the most commonly shared 19 mutations from the overlap of three sets (red in Figure 2 structure) gave the highest odds ratio of 11.85 [95% c.i.3.81-36.9] and significant Fisher exact p-value of 0.0003 when comparing common passage sites with the H5 inventory sites (Figure 4).

Discussion
Amino acid changes due to passage adaptations occur commonly in viruses grown in laboratory culture. They have been shown to enable the virus to switch from growing well in the original host cells to the lab cultured cells [25][26][27]. While multiple factors and genes determine the ability of influenza viruses to cross the host species barrier [16,28] this study focuses on hemagglutinin which is the main surface protein recognizing the host receptors. Additonally, in hemagglutinin, overlap of antigenic regions and passage sites might lead to reduced efficacy of egg-grown vaccines [12,29]. In this study, we show that passage adaptation sites in influenza viruses are strongly associated with sites involved in host specificity shifts identified previously in animal experiments and from human cases. Across different influenza A subtypes, we identified three main passage bias clusters where adaptation most commonly occurs: (i) receptor binding site and the globular head region, (ii) region that undergoes pH-dependent structural changes and (iii) the N-terminal signal peptide region (Figure 2). When the virus switches from its original host environment to the cell culture environment, many things change: the cell surface receptors and its concentration may be different, pH may be different, the intracellular transport and signal peptide guided secretion may differ in efficiency. Hence, it is not surprising that the most common passage bias sites occur at the observed sites, where they can help the virus adapt to the new cellular environment.
Our study analyses a large number of passage data across multiple influenza subtypes and their associated sequences but the statistical power still depends on the sample size. There are limitations to the approach because, although the sites undergoing mutations necessary for host adaptation can be predicted, the exact residues necessary for the virus to adapt to the new host could not be fully predicted (4 out of 7 correct, 1 wrong, 2 without prediction because of too little data). For the latter, understanding the role of the genetic background of the virus may improve prediction capacity as hinted by the epistatic relations among passage sites we also observed. Alternatively, since the mammalian cell type used here was canine kidney cells, while the ultimate target of interest is human respiratory epithelia and the ferret experiments involved still a third mammalian species, the lack of overlap may point to the limitations of generalizing across mammalian species and/or cell types. It will be interesting to confirm the sites as well as epistatic links through analysis on even more data in future as more data becomes available. Another limitation of the current study is the use of only the hemagglutinin protein to examine passage bias. In spite of the fact that HA plays a key role in the life cycle and host adaptation of the influenza virus, mutations in other viral proteins such as in the polymerase complex (PB2, PB1 and PA), as well as PA-X and the HA-NA balance are also known to play a role in host adaptation and the agreement, or not, with passage sites should be subject for future studies. Additionally, it is unclear how other experimental variables such as temperature and infective dose may affect passage mutations as well as the different number of passage steps.
Above limitations aside, the sites identified from large scale sequence analysis here significantly overlap with known host adaptation sites and show that passage bias can provide candidate sites for host specificity changes to aid in risk assessment for emerging strains [16,17] when combined with a tool such as the FluSurver (https://flusurver.bii.a-star.edu.sg) that allows automated mutation annotation of query sequences with phenotypically relevant information. Last but not least, analysis of this readily available and growing data set of sequences with passage adaptation seems to be an interesting cheap and safe approach for identifying sites associated with adaptation to host-shifts.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4409/8/9/958/s1, Figure S1: Overlap of passage bias sites between different passage types. Table S1: Updated passage data set for seasonal influenza A site pair analysis. Table S2: HA numbering conversion table for influenza subtypes H1, H3 and H5. Figure S2: Distribution of passage bias sites across the HA sequence comparing data from two time periods. Figure S3: Overlap of SSDP (same strain different passage) sites between the 3 seasonal subtypes. Table  S3: Overlap of selected sets with WHO H5 genetic change inventory.