1. Introduction
The Stockholm Paradigm aims to recover Darwinian ideas in order to discuss evolution, claiming that the historical construction of Neodarwinism lost essential information on the study of diversification of life [
1,
2,
3]. The authors who proposed this Paradigm aimed to discuss the coevolution and emergence of diseases as processes that imply the synergy of capacity and opportunity in determining biological interactions.
Capacity is defined as the whole ability of an organism to utilize resources, either as a free-living agent or as a symbiont, given by the Information Space inherited by their history, including actual genetic, epigenetic and developmental information, as well as information acquired by other means, such as education. On the other hand, opportunity is defined as the coincidence of resource availability (either environmental or host-wise) in time and space; thus, the realization of capacity is conditioned by the opportunity [
4].
The authors extensively discuss the framework underlying the Stockholm Paradigm, which builds upon earlier Darwinian proposals and challenges the general Neodarwinian view of the one host–one pathogen relationship. It relies on three general pillars: (1) ecological fitting [
5,
6,
7] (2) oscillation hypothesis [
8], and (3) taxon pulse [
9,
10]. Ecological fitting approaches the initial stages of an interaction, describing the interaction in terms of realization of capacity, limited opportunity, selection of phenotypic variations, and establishment of an interaction.
Oscillation hypothesis and taxon pulse, on the other hand, address the phylogenetic and phylogeographic aspects of evolution. The former explores the transitional aspect of specialism and generalism, while the latter describes the spatiotemporal reality of evolution. In particular, it highlights the exploration of hosts by parasites, for example, by cycles of stability and perturbation, or better yet, through the oscillation of opportunity. The Stockholm Paradigm aims, therefore, to present evolution from populational to the phylogeographic perspectives [
1,
9,
10].
The Stockholm Paradigm had not been applied to host–virus coevolutionary scenarios prior to the emergence COVID-19 [
11] However, it has been extensively analyzed in the study of other infectious diseases. Therefore, this research aimed to expand the study of viral coevolution with its hosts under the scope of the Stockholm Paradigm by applying a new methodology and evaluating whether it can help understand and draw expectations on host-switching events in virus–host interactions.
In order to achieve this, one needs to choose a viral lineage that can be deeply analyzed according to capacity and opportunity, and to gather sufficient data on both parameters. Also, the viral lineage needs to be one that has been sampled, studied and monitored in depth and over a long period of time.
With that in mind, we opted to use Influenza A H1N1 virus lineage as a study model, since it has a long-lasting relationship with the human population (
Figure 1), and has been under the scope of the global public health since the beginning of the 20th century, marked by the 1918 Spanish flu [
12], followed by several epidemic and pandemic events, with a vast host oscillation among avian and mammalian taxa [
13].
For these reasons, Influenza H1N1 carries a great amount of public data that can fit into the concepts of viral capacity and transmission opportunity, and better discuss the evolutionary patterns and expectations the Paradigm presents, enabling a better understanding of host–parasite coevolution, herein designated as host–Influenza coevolution, in the field of disease emergence risk assessment, as has been previously discussed for other diseases [
14,
15,
16].
It is important to note, however, that H1N1 was chosen as a study model and not as the sole model for this application. Our main purpose here is to develop a method for zoonotic prediction of Influenza viruses in a broader sense.
This work relies heavily on the assumptions of ecological fitting, where capacity and opportunity conditions precede and enable the emergence of new symbiont interactions. The idea here is to describe the capacity Influenza A H1N1 strains have to establish themselves in human populations by analyzing their epitope and substitution profile, coupled with the current ecological circumstances (available surveillance data) that allow this. By analyzing both capacity and opportunity, we expect to be able to visualize, and hierarchize, the sampled strains according to their emergence risk in humans, contributing to emerging infectious diseases preparedness.
2. Materials and Methods
In order to fulfill what is described by the Stockholm Paradigm—a synergistic view of all of the possible capacity information available for each strain as well as the opportunity scenario involved in an interaction—we needed to create a method which would allow us to analyze all these parameters at the same time, with no predetermined weight, direction or hierarchy to the information.
To obtain this global understanding of each strain, we applied the unsupervised learning method named Multiple Correspondence Analysis. Multiple Correspondence Analysis (MCA) is a dimensionality reduction method [
17] which allows us to obtain just what the Paradigm requires: an unbiased understanding of the different capacities of each strain. It also allows us to confront that information with the ecological context of the strain. By understanding the capacity variation in a group of strains, we expect to rank them according to their compatibility with the human host (using the Hierarchical clustering analysis—HCA), and filter, among those with capacity to use the human host, the strains with the spatial and temporal opportunities to do so, which would allow them to actually emerge in our species.
We will first describe the capacity and opportunity data we were able to collect from the Influenza Research Database (IRD) deposited in the BACTERIAL AND VIRAL BIOINFORMATICS RESOURCE CENTER v.3.30.19 (BV-BRC), and we will then discuss the data preprocessing and processing methods, followed by the statistical application of the aforementioned MCA.
2.1. Raw Data Collection
All the data used in this work was collected from the Influenza Research Database (IRD) [
18] which is currently part of the BACTERIAL AND VIRAL BIOINFORMATICS RESOURCE CENTER v.3.30.19 (BV-BRC), a larger database that includes all bacterial and viral research databases, as well as archeas and eukaryotic hosts. The current database version can be accessed at BV-BRC [
19].
The data that describe viruses’ capacity to utilize their host will be herein described as (1) strain information; (2) epitope information and (3) substitutions information—or known genomic substitutions that are associated with capacity alterations, such as transmission capacity, replication change, infectivity in a host, and antiviral resistance, among other specifications. Opportunity will be described by (4) surveillance information, which includes variables that describe the ecological context of different H1N1 strains.
The whole preprocessing, processing and Multiple Correspondence Analysis (MCA were performed using the Jupyterlab platform, Python version 3.9, while Hierarchical clustering analysis (HCA) was applied thought Google Colaboratory in order to provide access to some specific dependencies.
2.2. Data Importation
All the raw data collected from IRD was manually imported by filtering the database information for Influenza A/H1N1/avian, human, or mammalian hosts. Meanwhile, the genomic information utilized in the Epitope Occurrence section was collected by remotely accessing Genbank with the Entrez and SegIO Python libraries, which locate and collect the genomic information through the Genbank ID, delivering the nucleotide and aminoacidic sequences of all segments analyzed in the string format. See the Entrez documentation, as part of the Biopython package (Bio.Entrez).
The dataset used to characterize viral capacity include (1) strain information—with strain name and GenBank ID for the 8 genome segments; (2) epitope information—with information on all linear peptides registered in the BV-BRC; and (3) substitution information—corresponding to all human, avian and mammalian genomic substitutions that have been identified by published works and are described as substitutions that alter viruses’ capacity to interact with their host (published in PubMed). Capacity information was processed and evaluated using MCA and HCA.
In order to describe the opportunity for H1N1 strains to emerge in our species, we collected surveillance information from the original BV-BRC dataset. This dataset includes all information on data collection (date and location in terms of coordinates, city, state/province, and country), and host identification and condition (in terms of natural state, capture mode, and health at the time of sampling).
The raw version of this data was obtained in January 2023 through the following link (Raw Strain Data IRD, [
19]).
2.3. Data Processing Protocol
We imported each preprocessed dataset as will be described—and quantified—below (the method workflow is described in
Figure 2).
For the strain information, by the end of the data processing pipeline, we obtained 18,045 H1N1 strains, with the GenBank ID for all 8 segments of the strain. For the epitope information, we obtained the genomic sequence of each of the 8 segments of the 18,045 strains and translated it into the aminoacidic sequence for detection of the 8595 epitopes given by the BV_BRC database. We then filtered the detected epitopes with a 5% occurrence threshold in order to avoid low-frequency epitopes that could bias our analysis. By the end of this process, we had imported 6488 epitopes into our MCA.
In order to process the substitution information, we imported each substitution subgroup (avian, human and mammalian) and merged them separately because we did not require that the Strain present in the avian substitution table, for example, be present in the human or mammalian tables. It is important to note that when we merged both epitope and substitution datasets, we identified the following Influenza strain host groups: of the 18,045 strains, 3625 of them were non-human mammalian strains, alongside 13,335 human and 608 avian strains, totaling 17,568 strains. The remaining strains (of the 18,045 originally imported in the strain information) were not present in the substitution table and were excluded from the analysis.
The final dataset—herein referred to as the Capacity Table—included 17,568 strains and 6521 columns which correspond to the cumulative capacity data and the strains that are present on both datasets (Genbank ID, epitope occurrence and substitution data). This information was the input used in our subsequent analyses.
2.4. Statistical Analysis
2.4.1. Multiple Correspondence Analysis (MCA) Method
We performed MCA on the Capacity Table in order to derive a set of continuous and uncorrelated Principal Components (PCs), which summarizes the information of all parameters at once, in the least number of components as possible [
17].
MCA is an unsupervised learning method suitable for reducing the dimension of data tables represented by individuals and their answers to a set of categorical variables. It can be applied to study individuals, variables, or categories. We applied MCA to the study of parameter categories to be able to represent Influenza strains according to the commonness of their parameter states (or commonness of capacity information), which was represented by a cloud of category points in a Euclidean space [
17].
MCA interpretation is based on the estimated
eigenvalues and
eigenvectors. The former are positive quantities related to inertias, representing the total amount of explained variability (percentage of inertia) by a given PC—the information we utilize in order to consider how well a specific PC explained the data points’ variability in the Euclidean space. The
eigenvalues order the PCs from the one that most explains the variability of the data to the one that explains the least [
20]. So, we retained for HCA those PCs that accumulated approximately 65% of explained variability in order to adequately identify subgroups among the Influenza strains that might have biological significance in the identification of potential zoonotic H1N1.
Eigenvectors represent the orientation of the PCs.
2.4.2. Hierarchical Clustering Analysis (HCA)
For HCA, we used the Agglomerative Hierarchical Clustering (AHC) method to calculate individual similarity using Ward’s method and Euclidean distance, resulting in a hierarchical tree, which segregates the Influenza strains from most similar to most dissimilar strains distributed in the Euclidean space (also known as the
bottom-up dendrogram construction method described by [
21].
The number of clusters was initially determined by visual inspection of the hierarchical tree. Additionally, we considered the increase in within-cluster homogeneity, represented by an inertia plot, and cluster interpretability, according to Influenza strain information, that might contribute to cluster identification and standardization.
2.4.3. Epitope Profile Characterization
Following cluster identification, we analyzed the distribution of epitope occurrences to identify key differences among the groupings formed by HCA. For this purpose, we calculated the relative risk (RR), which simply constituted the relative frequency of the epitope in a cluster in relation to its relative frequency in the cluster of reference, for example, cluster 1:
2.4.4. Surveillance Data Processing
The surveillance data was analyzed only after the capacity clusters were explored, and was not used as input for the MCA and HCA. Of the 115,576 surveillance events gathered from the BV-BRC database, 48,159 involved a strain name present in the capacity data and were therefore considered for subsequent analyses.
We were interested in information that depicted any aspect of host interaction
opportunity, or the ecological setting of the different Influenza strains. Thus, we processed and organized the surveillance data according to the identified strain and retained the following information of potential interest: for steps i and j of the workflow, we labeled and grouped the data as (i)
Dados_host_location, which included host species, host common name, host group, and host natural state; and (j)
Dados_host_location, which included collection year, collection country, collection state, and collection city. The final surveillance information was separated into two datasets: (1) host nature data—describing the host species and group, as well as nature (domestic or wild). Unfortunately, only the host group contained a sufficient amount of data. Of the 17,568 strains included in the clustering analysis, only 1218 strains had host nature information (this is step
i of the method workflow described in
Figure 1 and
Figure 2 host location data—including city, state and country information. We manually searched for all the location specifications, so as to enable us to subsequently relate locations based on regional congruence (this is step
j of the method workflow described in
Figure 1). Both datasets were then filed to be used as the
opportunity information of the selected strains.
In order to try and increase the number of strains analyzed in terms of host use, we applied a second method to observe the host group distribution according to the capacity clusters, which we referred to as the Strain Name Segregation Method (
Supplementary Materials).
This additional method is applied using the crosstab files, which are the outputs of the spreading technique described in the
Supplementary Materials. This application allowed us to observe the strain distribution across host groups for 4210 strains, but did not enable a wider discussion about the other surveillance information, unlike the main result presented here.
2.5. Phylogenetic Reconstruction
2.5.1. Sampling Representative Strains
The complete set of strains that were included in the MCA and HCA rounded up to 17,521 strains. Ideally, we would construct our phylogenies using all sequences and strains; however, for this number of strains, such an approach was time-consuming and computationally challenging. Thus, we selected the 30 most representative strains of each cluster using the following criteria: first, we separated, for each of the three clusters, all epitopes which occurred at a frequency above 50%, among the epitope possibilities.
After that, we arbitrarily pulled the top 30 strains with the highest frequency of these epitopes to represent the cluster epitope profile. Not all selected strains carry all the high-frequency epitopes of the cluster, but they are the ones with the highest number of them. We then used the Strain Table in order to pull all amino acid sequences from the sampled strains and construct the trees, beginning with the hemagglutinin segment construction.
2.5.2. Why So Many Trees?
When analyzing the evolution of an Influenza group, one must consider the different ways we can draw the phylogenetic trees. A group is traditionally understood according to one gene, which is considered the most representative of the taxon, since it requires less information in order to be drawn and the method is less time-consuming. However, one of our goals here was to test the different ways we can draw the tree using one or more genes.
After applying MCA and HCA we decided to construct two trees with the highest cluster differentiation power—the hemagglutinin (HA) and the Nucleoprotein (NP) segments—and see how they resolved the phylogenetic relationships among the strains in comparison to the capacity clusters. After that, we drew a two-gene tree containing the two segments and compared it to a full genome tree (with the eight segments of each strain). This comparison was important for helping us understand the distinguishing power of the two most significant segments (according to our method) compared to the full genome.
2.5.3. Phylogenetic Procedure
In order to explore the phylogenetic relationship of the selected Influenza A H1N1 strains, we followed the traditional pathway for Bayesian Phylogenetic Application using the tool StarBeast package in the BEAST2 (v.2.7.6).
We first collected the
fasta sequences using the Python library Entrez for all genome segments of interest. In order to generate a reliable phylogeny, we then (1) aligned the sequenced with
Muscle and trimmed all the amino acid sequences using Geneious (v. 7.1.3) [
22]; (2) defined the best-fit substitution model for all sets of sequences (single-gene, two-segment, and later, eight-segment reconstructions) using MEGA X [
23]; (3) created the BEAST2 [
24] input in the BEAUTi (v.2.7.6); and (4) ran a Bayesian analysis using BEAST with a run with 50 million MCMC, sampled every 1000 generations, using the JTT+G substitution model.
The phylogenetic analysis was assessed in Tracer (v.1.7.2) [
24], which allowed us to determine the Estimated Sample Size (ESS) of each tree metric, and thus comprehend the general quality of the parameter landscape estimations. The trees were run in TreeAnnotator (v.1.10.4) [
24] to remove the burn-in (10% of the sampled MCMC) and create a consensus regarding the tree topology. The final tree was drawn using FigTree (v.1.4.4). Further editing of the tree was manually performed using Canva software v. of 24 April 2024.
This multi-gene construction is only possible using the newest versions of BEAST2 [
24], because it contains the *Beast template that allows us to add to the workspace-independent gene evolutionary scenarios, drawing a consensus tree from them. This variation was applied both to the HA+NP phylogeny and the 8-segment reconstruction, where we tested all possible MCMC models available in the software (Yule, calibrated Yule, FBD, and birth–death) for the first multi-locus reconstruction and analyzed their likelihood using Tracer. Since the most adequate model was the FBD for the two-segment tree, we kept the same model in order to analyze the 8-segment construction.
2.5.4. Phylogenetic Signal Analysis
This application was later utilized to analyze two characters in the resulting trees: (1) HCA cluster phylogenetic signal and (2) host use phylogenetic signal, both of which were constructed using the Mesquite software (v. 3.61) [
25]. This statistical analysis explains in a more systematic way whether the character-state transformation along the tree is highly associated with the group evolution or whether it is distributed randomly along the lineages. The phylogenetic signal can be seen as a tool for visualizing the association between strain and character evolution. A null model was created by reshuffling the terminal taxa of the tree 1000 times to assess the significance of the phylogenetic signal.
4. Discussion
When applying the Stockholm Paradigm’s theoretical framework, based on the DAMA protocol described by [
26], we expected to obtain, beyond the potential emerging flues, a more universal understanding of H1N1’s capacity and opportunity to emerge in humans. In order to obtain the aforementioned results, we systematically processed, merged, and analyzed data from IRD to finally statistically and visually understand the synergistic relation of this information.
Applying the MCA followed by the HCA to represent the concept of capacity described by the Paradigm [
2,
26], was an efficient way of representing all non-genomic information at the same time. This method allowed us to observe the 17,568 strains in a single plot, considering the information of five PCs, which cumulatively represent more than 60% of the variability present in the 6521 epitopes and the 96 genomic substitutions knowingly associated with viral capacity change, so that the strains can be organized according to their similarity (
Figure 3).
The HCA grouped the strains hierarchically, generating actual clusters based on the strains’ epitope-substitution identity (
Figure 3). This hierarchization was essential for us to relate the non-genomic information to our phylogenetic reconstructions and discuss the validity of how we traditionally study viral lineages or viruses’ capacity to emerge, host-switch, or alter their host-repertoire utilization (
Figure 7,
Figure 8 and
Figure 9).
The biological understanding of the clusters, however, was provided through our subsequent analysis of the most influential epitopes given by the MCA, which shaped the work we have presented here. The simple way the MCA allows us to explore the contributions of each dimension (
Table 1), as well as the parameter’s weight on the dimension, directly influenced the results presented here.
Based on the identification of the most influential epitopes, we were able to establish a systematic understanding of the biological meaning of the three clusters suggested by the HCA and bring biological meaning to the three-cluster description of the group (
Figure 4). We achieved this by first establishing a general view of the inter-cluster differentiation of capacity (through the epitope frequency distribution calculation for each cluster—
Figure 5 and
Figure 6 and
Table 2). From these plots we were able to actually see that the clusters suggested by the HCA are in fact distinct in terms of epitope profile, and also match the host use analysis (
Figure 10).
The individual characterization of the ten most influential epitopes was also very important for adding depth to the understanding of the clusters. When calculating the epitope frequency of the ten epitopes (
Figure 6) we were surprised by the great differentiation of the three groups, considering both the RR estimation and the pattern among the ten epitopes analyzed.
The divergence of epitope frequency among the three clusters was then understood when we traced the genomic location of the ten epitopes and discovered that they were present in the hemagglutinin (HA) genomic segment (for epitopes 1 through 5)—an envelope protein—and in the Nucleoprotein (NP) segment (for epitopes 6 through 10), a protein essential to the encapsulation of the viral segments.
It was interesting to note how the three clusters contain such divergent epitope profiles, and at such fundamental proteins of the viral structure, where the first is directly related to the viral entrance to the host cell, which has also been studied in depth when discussing tissue and disease severity (for upper or lower respiratory tract damage) and host specificity according to Sia2-3Gal (preferential for avian and equine HAs) or Sia2-6Gal receptors (preferential in human and swine HAs) [
27]. Of course, this host use pattern is not absolute in terms of sialylgalactosyl tropism, but is quite homogeneous in terms of host use and disease severity, and is heavily discussed in viral hemagglutinin specificity selection (see [
28,
29] for the in-depth tissue tropism analyses).
The biological importance of the HA protein also leads us to hypothesize about the immunological pathway it activates, since it is an envelope protein that is constantly exposed in the host’s circulation. Based on its historical utilization in the seasonal flu vaccination programs, described as the most abundant and immunogenic protein of Influenza [
30], we can understand that these epitopes interact more strongly with type B lymphocytes, and induce an antibody production according to that pathway activation [
31].
On the other hand, the NP importance for host utilization also leads to important ramifications in the host–parasite immunological compatibility. Since the NP protein (and the whole vRNP) is only exposed to the immune system once the viral envelope has already fused to the endosomal membrane [
27] to then establish itself in the nucleus, we can infer that these epitopes can only induce an immunological response from differentiated CD8+ T lymphocytes, except when there is an induction of adaptive immunization through vaccination [
31].
The application of the MCA followed by the HCA is not innovative when discussing the simultaneous visualization and interpretation of data [
32,
33] or more specifically, health-related datasets [
29]. It is more amply used in studies of human behavior and psychology [
34] but this is the first work, to our knowledge, that discusses HCA results with an actual multi-locus phylogeny, approaching the idea that capacity might not be sufficiently explained by a single- or two-gene phylogeny.
After analyzing the single-gene trees, we constructed the multi-locus trees and compared the differentiation capacity of HA+NP and the whole-genome reconstruction, as well as how they resemble the HCA, which can be seen as a cumulative representation of “non-genomic” capacity information. This is an important step toward our goal because, when analyzing a biological group, especially in terms of their emergence risk in humans, we typically synonymize the single-gene tree with capacity to emerge, and for the HA and NP trees, capacity is different from phylogeny, or at least different from epitope-substitution information (
Figure 4).
Both multi-locus trees were more similar to the HCA results compared to the single-gene trees. For both HA+NP and eight-segment trees we calculated the RR of cluster identity for the phylogenetic clusters, as an initial strategy to visualize the correlation between HCA clusters and phylogenetic groupings (
Figure 7 and
Figure 8). It is interesting to note, however, that although the eight-segment tree is more reliable in terms of phylogenetic resolution, it did not significantly change the clusters’ differentiation to the HCA groups, as evidenced by the small change in the relative risk calculations in
Figure 7 and
Figure 8 and the small change in the phylogenetic signal analysis (the phylogenetic signal analysis for the HA+NP tree is in the
Supplementary Materials).
The full-genome tree was very similar to the HA+NP analysis in terms of cluster identity, even though the eight-segment tree contains four times more genome segments for the same number of strains (
Figure 7 and
Figure 8). The HA+NP tree contains the two genome segments with the highest number of epitopes and substitutions that differentiate the Influenza clusters in the HCA (or non-genomic capacity information of the strains).
Although convergent, the two-segment and full-genome trees do not coincide with the capacity cluster, indicating that Influenza evolution needs to be analyzed considering other historical factors such as reassortment or recombination between strains, as well as the implications of that for the host-use repertoire. Capacity, in summary, requires more information besides gene evolution data.
For our analyzed sample, we can only understand that, for the three clusters, there is a variety of hosts being utilized, although their use did not occur in a homogeneous manner (as the capacity was not uniform among the clusters), as observed based on the host nature dataset (
Figure 10b). For the three clusters, we can observe that cluster 1 contains the largest ratio of strains that utilize human hosts (more than 80% of the strains that colonize the host), whereas almost 40% of the strains that utilize swine hosts—identified as wild boars, pigs or tagged as swine in general in the dataset—are from cluster 1. This information coincides with the discussion associated with the Stockholm Paradigm, where pathogens, when exploiting a specific host (realized capacity space), still retain an ancestral capacity to explore other groups given by the sloppy fitness space of the strain, or an unused capacity in the human host which enables a host-switch when given the opportunity to do so [
26].
Cluster 2, being the smallest of the three (
Figure 4), but presenting the most diverse epitope frequency (
Figure 5), contains strains that utilize humans, avians or swine in a homogeneous ratio. Unfortunately, only one host is attributed to each strain, since that information takes only the documented sample in which the strain was identified. Therefore, we cannot directly discuss the host repertoire of the analyzed strains [
35].
When analyzing avian hosts, we can see how this host has been specified among 16 avian species (
Figure 10a), which is very important when discussing the opportunity to emerge. In general terms (
Figure 10b), a significant quantity of avian strains are found in cluster 3 (more than 90% of the avian cases), and almost 40% of swine Influenza strains are also from that cluster. Around 10% of strains that occur in humans are from cluster 3.
When analyzing the avian species with the highest number of records in our dataset, we can observe that both american black ducks (
Anas rubripes) and mallards (
A. platyrhynchos) have ample migration routes that extend from the coastal areas of North America, all through Europe and Asia [
36,
37]. More recently, these two species have altered their typical locations during their wintering migrations and besides being seen in coastal areas, they have explored the recently developed rural and urban localities, as their presence has been recorded in the Atlantic region of Canada [
36]. Both species are traditionally monitored in various countries due to their appeal in hunting practices in all the previously mentioned regions, as well as various works related to the hypothesis of competitive exclusion dynamics between the species, as well as reproductive success changes associated with wetland quality alteration [
3,
38,
39,
40].
The American Black Duck and Mallards have a long monitoring history, and for the last 100 years they have been studied together, especially with regard to hybridization cases among the species [
40,
41]. The migratory pattern of ducks are usually centered in the breeding, molting, and wintering areas [
42] and more recently in the mid-migratory pathways [
41,
43]. In Canada, both species’ migration routes have been influenced by the urbanization and agricultural degradation of forested areas, which coincides with the growing loss of wetlands due to landscape alteration in the Atlantic region of Canada [
36,
40,
41].
It is also well documented how both species, with the loss of natural habitat, have explored urban and periurban foraging environments such as urban parks and wetlands as well as agricultural regions, which is reflected in a dietary change according to the microregion these species migrate to for the winter season [
36,
44,
45]. What is interesting to note is that both species explore the coastal environments, except there is an ongoing discussion on how black ducks usually dominate these regions compared to mallards, due to a competitive exclusion dynamic between the two species [
46,
47]. The ecological importance of the black duck in all these different microregions have even culminated in the development of the Black Duck Joint Venture for a continuous monitoring of these different populations [
38,
48].
When discussing microregion specifications, it is very important that we differentiate the host species that were identified in our work, and how they relate to the microhabitat manually identified in our dataset. We can see from
Figure 10 that all clusters were present in all six types of environments at similar ratios. Although not depicted by the plot, the location type entitled “Brackish” is the microregion with the highest number of records, and includes regions with saline water, with estuarine environments, beaches, islands, and river deltas being the specific landscapes identified in our data.
Only the “Forestal” microregion diverges in terms of cluster strains’ frequency, where in this group there is a higher frequency of cluster 2 strains, and lower presence of cluster 1 strains. Cluster 2 is the cluster with the highest diversity in terms of epitope profile and host use, whereas cluster 1 has a lower epitope frequency diversity (more homogeneous capacity profile) with a dominance of Influenza strains that occur in human and swine hosts.
We can hypothesize that this differential relation among clusters in this type of environment hints at a lower presence of humans in forested areas (although not a complete absence), and a larger distribution of avian hosts, drawn by the type of microhabitat. Forest locations have historically been described as reservoirs for various diseases, together with the human invasion of these environments being associated with the emergence of historical diseases such as HIV [
34] malaria [
49], zika [
50], and Ebola [
51], among others.
When specifically analyzing the four strains with full data, we realize that there is very little information outside of the sequence importation to Genbank, as well as the sample information that was uploaded into IRD. Our discussion will be restricted, therefore, to the data gathered in this work.
Of the four strains with both capacity and opportunity information, two are human and two are avian pathogens. Both avian pathogens utilize mallards as their main host (A/mallard/Ohio/11OS2078/2011 and A/mallard/NewBrunswick/00340/2010) and are similar to one of the identified human strains in terms of capacity (A/Managua/4905.02/2009), since the three belong to cluster 1 of the HCA. It is interesting to note how this information coincides with the distribution of the host previously mentioned with regard to its wintering distribution in the Atlantic Canada region for the New Brunswick strain (with no microregion specification), whereas the Ohio strain occurs in a preserved forest microregion, the Ottawa National Wildlife Refuge, which is also a naturally occurring microhabitat preferred by mallards.
The human utilizing Influenza that belongs to cluster 1 was identified at the Sustainable Sciences Institute in Managua, the capital of Nicaragua, which is not a traditional breeding, mid-migratory, or wintering site of mallards (or has not been surveyed for the presence of this duck species). We can hypothesize here that the human presence in these environments where mallards appear, or the recent foraging habit expansion of this host to periurban and urban areas, are enabling the continuous spillover of Influenza H1N1 between this host (and other ducks with similar behavior like their congenerics) and humans.