An Unsupervised Algorithm for Host Identification in Flaviviruses

Early characterization of emerging viruses is essential to control their spread, such as the Zika Virus outbreak in 2014. Among other non-viral factors, host information is essential for the surveillance and control of virus spread. Flaviviruses (genus Flavivirus), akin to other viruses, are modulated by high mutation rates and selective forces to adapt their codon usage to that of their hosts. However, a major challenge is the identification of potential hosts for novel viruses. Usually, potential hosts of emerging zoonotic viruses are identified after several confirmed cases. This is inefficient for deterring future outbreaks. In this paper, we introduce an algorithm to identify the host range of a virus from its raw genome sequences. The proposed strategy relies on comparing codon usage frequencies across viruses and hosts, by means of a normalized Codon Adaptation Index (CAI). We have tested our algorithm on 94 flaviviruses and 16 potential hosts. This novel method is able to distinguish between arthropod and vertebrate hosts for several flaviviruses with high values of accuracy (virus group 91.9% and host type 86.1%) and specificity (virus group 94.9% and host type 79.6%), in comparison to empirical observations. Overall, this algorithm may be useful as a complementary tool to current phylogenetic methods in monitoring current and future viral outbreaks by understanding host–virus relationships.


Introduction
Recent viral pandemics have shown that rapid characterization of the virus is essential during the development of an outbreak [1][2][3][4]. Among other factors, host information is essential for surveillance and control of virus spread. However, emerging viruses are fully characterized only after several confirmed cases occur; this is an inefficient method of deterring current and future outbreaks [5]. Fast and reliable computational biology methods are needed to develop antiviral treatments, to improve medical diagnoses and to efficiently contain viral outbreaks [6]. Viral genomes are modulated by high mutation rates [7] and by selective forces to adapt their codon usage to that of their hosts, especially when the viruses can infect a wide host range, as is the case for flaviviruses [8]. Previous methods identify flavivirus host range based on an analysis of dinucleotides [9,10] based on the idea that a virus that infects multiple hosts has a weaker dinucleotide bias [11].
In this article, we introduce an unsupervised algorithm to identify putative virus host ranges based on only genome sequence information. The proposed methodology has been tested in 94 viruses of genus Flavivirus and 16 potential hosts. Several flaviviruses are major human pathogens, with potential host ranges from vertebrates to arthropods [12]. Flaviviruses are classified by vector type into mosquito-borne (MBFV), tick-borne (TBFV), insect-only (IOFV) and unknown vector (UVFV) [13] flaviviruses. In MBFVs, there exists a Life 2021, 11, 442 2 of 9 paraphyletic subgroup of mosquito-specific viruses [14], also known as dual-host insectonly flaviviruses (dhIOFVs). However, certain annotation ambiguities exist; e.g., the Ecuador Paraíso Escondido Virus (EPEV) is defined as MBFV based on phylogeny, but may also be classified as dhIOFV [15]. Flaviviruses with the same host type tend to be monophyletic and are subject to the same selective pressures as the host; this situation is reflected in their codon usage and dinucleotide composition [16]. The most widespread and prevalent flaviviruses include Dengue virus (DENV), West Nile virus (WNV), Japanese encephalitis virus (JEV), and Zika virus (ZKV) [17].
Several articles suggest that highly similar codon usage frequencies between viruses and hosts are indicative of a high virus-host adaptation level [18]. Thus, the codon adaptation index (CAI) [19] may be a robust indicator for determining putative hosts. Here, we use a normalized CAI (nCAI) and a correspondence analysis (CA) to compare codon usage frequencies across virus and host sequences (see Materials and Methods section). Therefore, the nCAI-CA algorithm provides a fast and reliable method of identifying the putative host range of a virus. This method requires only coding sequences (CDSs) without prior knowledge, and can be implemented with minimal computational equipment. In addition, we have developed an easy-to-use web server, available at http://ppuigbo.me/ programs/CAIcal/nCAI (accessed on 13 May 2021), to calculate nCAI values.

Materials and Methods
The optimal host identification algorithm ( Figure 1) consists of two phases. In the first phase, the algorithm computes the required codon usage tables through two subroutines: one for the host and the other for the virus. These tables, along with complete genomic CDSs, are then used as the input data for CAIcal [20]. This produces CAI data between the virus and host (CAIh) using virus CDSs, and host codon usage tables and CAI data for the virus itself (CAIs) using virus CDSs and virus codon usage tables. In the second phase, the CAIh values are normalized by dividing each by its respective CAIs as in Equation (1): This yields the normalized CAI (nCAI) value, from which the optimal and likely hosts can be inferred depending on how similar the codon usage of a virus is to the codon usage of its host organisms. The nCAI values range between −∞ and +∞, and the optimal value is 1.0, indicating identical codon usage between the virus and host and therefore perfect adaptation to the host. Values above and below 1.0 would indicate over-and underoptimization, respectively, and thus suboptimal adaptation to a host.
The nCAI calculations can be performed with the CAIcal tool in a dedicated web server, written in PHP, that works on any web browser (http://ppuigbo.me/programs/ CAIcal/nCAI, accessed on 13 May 2021). The server requires two sets of inputs: complete DNA or RNA CDSs of the viruses of interest in FASTA format and the codon usage tables of the potential host animals in the format used by the Codon Usage Database [21]. CAIcal will then output the results in a tab-delimited table with the following values: name of the query sequence (Name); CAI of the virus to a host (CAIh); CAI of the virus to itself (CAIs); normalized CAI, calculated by dividing CAIh by CAIs (nCAI); length of the query sequence (Length); overall %GC; and GC content at the first, second or third nucleotide of each codon (%GC1-3).
In this study, a vector is defined as an organism capable of transmitting a virus to another type of organism. This definition does not take into account whether the virus is virulent within a vector, i.e., there is no differentiation between a vector and a vector-host. A host, on the other hand, is an organism in which the virus primarily replicates, and it does not directly transmit the virus to another organism of the same type. The host Life 2021, 11, 442 3 of 9 organisms (N = 16) for this study were chosen based on information primarily provided by the Virus-Host Database [27], which includes representative arthropod (mosquitoes and tick) and vertebrate (mammals, birds, reptiles and amphibians) host species. Additionally, a more comprehensive list of hosts and vectors for each flavivirus is included in Supplementary Table S1. This table includes only confirmed cases of viruses sequenced from an organism, or cases in which viruses have successfully infected the cells of a host in a laboratory experiment. It is important to note that not all host animals listed in the database are primary hosts, as they might have acquired the viruses through happenstance. We computed a codon usage reference table for 16 putative hosts representing all possible flavivirus host types among vertebrates (mammals, birds, reptiles, and amphibians) and arthropods (mosquitoes and ticks). We analyzed genomes that contained over 10,000 CDSs to reflect actual codon usage frequencies, as well as those of Gallus gallus (6017) and Sus scrofa (2953). The nCAI calculations can be performed with the CAIcal tool in a dedicated web server, written in PHP, that works on any web browser (http://ppuigbo.me/programs/CAIcal/nCAI, accessed on 13 May 2021). The server requires two sets of inputs: complete DNA or RNA CDSs of the viruses of interest in FASTA format and the codon usage tables of the potential host animals in the format used by the Codon Usage Database [21]. CAIcal will then output the results in a tab-delimited table with the following values: name of the query sequence (Name); CAI of the virus to a host (CAIh); CAI of the virus to itself (CAIs); normalized CAI, calculated by dividing CAIh by CAIs (nCAI); length of the query sequence (Length); overall %GC; and GC content at the first, second or third nucleotide of each codon (%GC1-3).
In this study, a vector is defined as an organism capable of transmitting a virus to another type of organism. This definition does not take into account whether the virus is virulent within a vector, i.e., there is no differentiation between a vector and a vector-host. The %GC and relative synonymous codon usage (RSCU) values were calculated from the CDSs of the flaviviruses. The RSCU describes the preference bias for a codon to be used to encode an amino acid [28]. This can be calculated by dividing the observed number of a codon by the expected frequency of the same codon, assuming that individual codons for amino acids were used at equal frequency [29]. CA was performed for two different types of nCAI datasets. The first analysis included all known flaviviruses, and the second included separate datasets, containing only the values for DENV, JEV, WNV and ZKV. The correspondence analyses were performed with the "ca" package (version 0.70) and then plotted with the "ggplot2" package (version 2. acid sequences of each genome were aligned using MUSCLE [30]. Tree construction was performed with FastTree [31]. Host trees were built based on NCBI taxonomy [32]. Each of the viruses and host organisms were sorted to match their respective phylogenies. For the DENV, JEV, WNV and ZKV genomes, their results were clustered based on k-means (5) in the heat map (Supplementary Figure S9). The clustering of each subgroup was performed and visualized by computing centroids based on the multivariate normal distribution of each subgroup with a confidence level of 0.95. This was achieved with the "ggplot2" package (version 2.2.1) in R (version 3.4.4). The virus subgroups included MBFV, TBFV, IOFV, UVFV and dhIOFV, and the host type subgroups were vertebrates, mosquitoes, and ticks.

Results
First, to assess whether a codon usage methodology could distinguish subgroups within a viral species, we performed nCAI-CA analysis for all the available CDSs of DENV, WNV, JEV, and ZKV, which numbered 4865, 297, 1619 and 494, respectively. Each viral subgroup formed a distinct cluster based on relative synonymous codon usage (Supplementary Figures S3 and S4) and GC content (%GC) (Supplementary Figures S5-S7). In addition, we determined the interspecies and intraspecies variability of the RSCU and %GC in the DENV, JEV, WNV and ZKV genomes. The results show that the RSCU values could differentiate viral subgroups within species and that their distances mostly reflected the evolutionary histories of the viruses (Supplementary Figure S5). The %GC was not a discriminating factor at the intraspecific level (Supplementary Figure S6). At the interspecies level, the clustering patterns based on the RSCU were only slightly more similar to the evolutionary histories of the viruses than the %GC (Supplementary Figure S7).
Next, we used the nCAI-CA algorithm ( Figure 1) to identify the optimal hosts of 94 flaviviruses, based on only complete CDSs and codon usage tables from 16 potential hosts (vertebrates: mammals, birds, reptiles and amphibians; arthropods: mosquitoes and ticks) (Supplementary Tables S1-S3). The nCAI-CA algorithm was able to accurately determine host types for MBFVs and UVFVs (vertebrates), IOFVs (Aedes mosquitoes), and TBFVs (Ixodes scapularis) (Figure 2). The paraphyletic group of dhIOFVs clustered between Aedes mosquitoes and vertebrates (Supplementary Figure S1). The CA plot shows a partial overlap between the MBFV and TBFV groups (Supplementary Figure S1); however, on average, TBFVs had higher nCAI values (0.813) than MBFVs (0.765) for I. scapularis, suggesting a higher degree of optimization for tick hosts (Table 1 and Supplementary  Table S2). The nCAI-CA analysis also revealed unexpected findings for individual viruses, e.g., WNVs clustered within MBFVs but near TBFVs, which aligns with the results of previous infectivity tests [33] and some observational studies (Supplementary Table S1). All the viruses could be classified into two general host groups: vertebrates (MBFVs, TBFVs, UVFVs and dhIOFVs) and mosquitoes (IOFVs) (Supplementary Figure S2). Our analysis shows that no group clusters near Culex quinquefasciatus, suggesting that this is not an optimal host for most flaviviruses. However, Culex mosquitoes are relatively good vectorhosts for certain flaviviruses (e.g., JEV and WNV) and in many cases the preferred mosquito vector-host is debatable [34][35][36][37][38]. Our results indicate a higher adaptation of MBFV towards Aedes; however, a higher genomic adaptation does not imply that Aedes is currently the most common host-vector for all MBFV, as additional factors should be considered. Moreover, our algorithm mostly rules out Anopheles gambiae as the main host-vector, in agreement with the literature [39], although there are few notable exceptions [40,41]. Overall, the nCAI-CA algorithm is able to predict virus groups and host types with high values of accuracy and specificity in comparison to empirical observations (Table 2 and Supplementary Table S4).

38].
Our results indicate a higher adaptation of MBFV towards Aedes; however, a higher genomic adaptation does not imply that Aedes is currently the most common host-vector for all MBFV, as additional factors should be considered. Moreover, our algorithm mostly rules out Anopheles gambiae as the main host-vector, in agreement with the literature [39], although there are few notable exceptions [40,41]. Overall, the nCAI-CA algorithm is able to predict virus groups and host types with high values of accuracy and specificity in comparison to empirical observations (Table 2 and Supplementary Table S4).   Figure S8). Likely optimal hosts (within an nCAI range of 0.9-1.1) include mammals (Myotis brandtii, M. davidii, Mus musculus, Bos taurus, and Homo sapiens) and Aedes mosquitoes (Aedes aegypti and A. albopictus). Unlikely hosts due to low adaptation (nCAI < 0.9) include C. quinquefasciatus, A. gambiae, I. scapularis and S. scrofa. These results are in accordance with previous studies and observations, e.g., most MBFVs have a reproductive cycle that includes Aedes (host-vector) or Culex (vector) mosquitoes and a primary mammalian host (Supplementary Table S1). Based on these analyses, flaviviruses are potentially less adapted to reproduction in Culex mosquitoes due to the differences in %GC between Culex and Aedes mosquitoes. Moreover, our analysis suggests that TBFV is a group of flaviviruses optimized to reproduce in vertebrates and use ticks as vectors (and we speculate that they may occasionally reproduce in ticks). In general, flavivirus codon usage is overoptimized (nCAI > 1.1) for birds (Columba livia, G. gallus, Anas platyrhynchos), amphibians (Xenopus laevis) and reptiles (Alligator mississippiensis).

Discussion
Despite the high values of specificity and accuracy produced by the nCAI-CA (Table 2  and Supplementary Table S4), there are certain limitations in its application. The algorithm is based on the assumption that there is a selection pressure to optimize the relative use of synonymous codons in the virus. However, it is well known that some viruses use the opposite strategy, and some viruses deoptimize codon usage to hide from host defense mechanisms [42]. The highest level of optimization is at nCAI = 1.0, when the relative use of synonymous codons in the virus and host is identical. Virus-host adaptations were also evaluated with a correspondence analysis (CA) plot ( Figure 2). Flaviviruses able to infect a wide range of hosts (generalists) tend to be in the center of the plot, whereas host-specific flaviviruses move away from the center, towards their optimal hosts (Supplementary Figures S1 and S2). Viruses with overoptimized codon usage (nCAI >> 1) might be explained by multiple factors, e.g., adaptation to multiple hosts, effects of extreme %GC bias or adaptation to highly expressed genes [43]. Moreover, some gene-specific codon usage biases may better explain adaptations in certain viruses.
Nevertheless, further empirical investigations are necessary to determine reliable confidence intervals for nCAI. Host determination may be uncertain if viruses display approximately equal optimizations for different host types; for example, although dhIOFV codon usage is optimized for both vertebrate and mosquito hosts, they are insect-specific [14]. Although common host preference patterns are observed, the optimal hosts vary depending on the virus or subgroup and may not reflect documented cases (Supplementary Table S1). The observed host ranges also do not distinguish between vectors and hosts, and classical phylogenomic methods cannot determine potential hosts without confirmed cases. Our nCAI-based method overcomes this limitation by directly measuring the adaptation of viruses to the translational machinery of their hosts.

Conclusions
In conclusion, this novel algorithm provides a fast and proactive method to assess the potential host ranges and the risk of zoonotic host shift for new and emerging viruses. In flaviviruses, this method distinguishes between arthropod and vertebrate hosts with high accuracy. However, it might produce ambivalent results for viruses undergoing host shifts. Overall, this nCAI-based algorithm may be used as a complement to current phylogenetic methods to monitor current and future outbreaks.