Dynamic Molecular Epidemiology Reveals Lineage-Associated Single-Nucleotide Variants That Alter RNA Structure in Chikungunya Virus

Chikungunya virus (CHIKV) is an emerging Alphavirus which causes millions of human infections every year. Outbreaks have been reported in Africa and Asia since the early 1950s, from three CHIKV lineages: West African, East Central South African, and Asian Urban. As new outbreaks occurred in the Americas, individual strains from the known lineages have evolved, creating new monophyletic groups that generated novel geographic-based lineages. Building on a recently updated phylogeny of CHIKV, we report here the availability of an interactive CHIKV phylodynamics dataset, which is based on more than 900 publicly available CHIKV genomes. We provide an interactive view of CHIKV molecular epidemiology built on Nextstrain, a web-based visualization framework for real-time tracking of pathogen evolution. CHIKV molecular epidemiology reveals single nucleotide variants that change the stability and fold of locally stable RNA structures. We propose alternative RNA structure formation in different CHIKV lineages by predicting more than a dozen RNA elements that are subject to perturbation of the structure ensemble upon variation of a single nucleotide.


Introduction
Chikungunya virus (CHIKV) is an arthropod-borne Alphavirus of the family Togaviridae that causes millions of human infections every year, particularly in tropic and subtropic regions. CHIKV is the etiological agent of chikungunya fever, an acute febrile illness associated with joint pain, rash, and, rarely, neurological manifestations [1]. CHIKV infection can culminate in chronic arthralgia and arthritis lasting up to several years. CHIKV cycles between vertebrate hosts and hematophagous arthropod vectors, predominantly Aedes aegypti and Aedes albopictus [2]. In Africa, CHIKV occurs in an enzootic, sylvatic cycle involving nonhuman primates as hosts, while in Asia, CHIKV is mainly maintained in an urban cycle with direct human-mosquito-human transmission [3]. The absence of a sylvatic life cycle in Asia suggests that CHIKV originated in Africa and was later carried to Asia [4].

Geographical Spread of Chikungunya Virus
The first documented outbreak of CHIKV was in 1952 on the Makonde Plateau in the Southern Province of Tanganyika (present-day Tanzania) [5,6]. Since then, CHIKV has been emerging in Africa and Asia, with larger outbreaks in the 1960s and 1990s [7]. In 2004, CHIKV re-emerged in Kenya and in 2005, an outbreak hit the island of La Réunion, which

RNA Structure Conservation in Chikungunya Virus Genomes
CHIKV is a small, spherical, enveloped virus with a single-stranded, (+)-sense RNA genome of approximately 11.8 kb [2,26] that contains a 5 cap structure and a 3 poly-A tail. The CHIKV genome contains two open reading frames (ORFs) that encode nonstructural and structural proteins, respectively, as polyproteins that are post-translationally cleaved [27]. The non-structural proteins (nsP1, nsP2, nsP3 and nsP4) constitute the viral replication machinery and are translated from the full-length genome, while the structural proteins (C, E1, E2, E3) form the virus particles and are produced from a subgenomic messenger RNA. The coding sequence is flanked by structured untranslated regions (UTRs) on both ends, which represent the most variable regions of the CHIKV genome [28]. This divergence manifests in variable 3 UTR (and thus genome) lengths of individual CHIKV lineages, with ECSA-derived lineages being shorter than WA, and AUL comprising the longest isolates [15]. It is plausible to propose patterns of coupled historic mutation and recombination events that eventually resulted in the plasticity observed in present day CHIKV isolates [29]. Specific patterns and copy numbers of sequence-level direct repeats in the 3 UTR are characteristic of particular CHIKV lineages and likely represent adaptations of the virus to environmental constraints [30].
Like many other RNA viruses, CHIKV encodes not only viral proteins but also functional RNAs that mediate the viral life cycle. These structured RNAs are found in coding and non-coding regions of the viral genome. A specific fold is often a prerequisite for functional RNAs, and there are selective evolutionary pressures on maintaining these folds, both in coding and non-coding regions [31]. As RNA structure is typically conserved at the level of secondary structures, structural homology can be interpreted as a result of evolutionary forces that act on particular RNAs, requiring them to maintain a critical set of structure-determining base-pair interactions. In nature, this is achieved by compensatory mutations, i.e., those that maintain base-pair complementarity by a combination of two mutations, e.g., AU → GC, or consistent mutations that change only one pairing partner, e.g., AU → GU. While this kind of structural conservation of functional RNAs, which is also known a covariation, is a ubiquitous trait that is found in all domains of life, increased mutation rates render viruses particularly interesting in this context.
Although CHIKV is one of the best-studied viruses within the genus Alphavirus, knowledge of functional RNA elements and their specific association to viral pathogenesis and replication remains elusive. Unlike other RNA viruses that are characterized by structural conservation of a critical amount of functional RNAs in their UTRs, such as flaviviruses [32] or coronaviruses [33], genus-wide RNA structure conservation does not appear to be prevalent in alphaviruses. In this line, evidence for pervasive RNA structure conservation has not been observed among Sindbis virus (SINV), Venezuelan equine encephalitis virus (VEEV), and CHIKV [34], probably because the genomic location of recognized RNA structure motifs, such as packaging signals, are found at divergent locations in different alphaviruses. However, the apparent absence of covariation patterns among alphavirus species does not exclude the ability of individual species to form highly stable, functional structures. This has been recently affirmed in a genome-wide RNA structure probing study that could confirm known functional RNAs in CHIKV by SHAPE-MaP and characterize several highly structured, potentially functional RNA elements [35].
Likewise, the association between primary sequence and secondary structure in the terminal regions of CHIKV genomes has raised considerable research interest over the last years, mainly motivated by the observation that different CHIKV lineages maintain variable-length 3 UTRs that comprise specific patterns of sequence repeats. While earlier studies identified sequence repeat patterns in the 3 UTR of several alphaviruses [36], recombination by copy-choice mechanisms has been proposed to accelerate CHIKV adaptability, resulting in novel 3'UTR variants [30]. In a recent study, we proposed an unambiguous association of sequence repeats in the 3 UTRs of different CHIKV lineages with evolutionarily conserved, structured and unstructured RNA elements [16]. Current knowledge about the lack of functional conservation in alphaviruses suggests that potentially functional RNA elements evolved independently in each viral species.
An aspect related to the formation and specificity of RNA structure is the effect of single nucleotide variants (SNVs). These are mutations that can alter the RNA structural ensemble by mediating the base-pairing pattern, potentially resulting in an alternative fold and disrupted functionality. SNVs are sometimes associated with so-called riboSNitches, i.e., RNA elements that are subject to perturbation of the structural ensemble resulting in large conformational changes [37,38]. Examples where such events can lead to disease phenotypes in human have been described in the literature [39][40][41][42].

Molecular Epidemiology Reveals RNA Structure-Affecting SNVs
With the availability of large numbers of next-generation sequencing data in public databases, multiple efforts to analyze and visualize the spread of infectious diseases have been made [43][44][45]. Nextstrain (https://www.nextstrain.org (accessed on 20 January 2021)) [46] is an open source project that makes pathogen phylogenetic data easily accessible to researchers and the interested public, thus facilitating research efforts in the field of pathogen evolution and epidemiology. Nextstrain allows for the set up socalled community builds that employ the Nextstrain software stack to construct custom real-time phylodynamics resources. The Nextstrain community builds have become increasingly popular, and were used, for example, to highlight the genomic epidemiology of the 2018-2020 Ebola virus outbreak in DRC [47], and showcase the mutational dynamics of SARS-CoV-2 superspreading events in Austria [48]. We report here the availability of a custom Nextstrain build for CHIKV that encompasses 924 publicly available genomes.
To better understand the evolutionary traits associated with functional RNA conservation among different lineages, we set out to use the CHIKV molecular epidemiology data to study the impact of lineage-associated sequence variability on viral RNA structure. We were particularly interested in the structural divergence induced by fixed SNVs that are specific to particular lineages, and predict the existence of more than a dozen locally stable RNA elements in the coding regions of the CHIKV genome, whose structural ensemble is substantially altered by lineage-associated SNVs.

Taxon Selection
We downloaded viral genome and annotation data from the public National Center for Biotechnology Information (NCBI) Genbank database [49] on 23 October 2020. We compiled all temporal and geographic metadata available in these genome records to create data sets for building the Nextstrain instance. Eight sequences of the data set were removed due to missing geographic location metadata or designation as a vaccine sequence. Metadata related to geographic location was associated with the United Nations geoscheme for consistency and labeled in the same way as in de Bernardi Schneider et al. [16]. For the temporal analysis, we identified two sequences with erroneous sampling dates in the Genbank record. Upon contacting the corresponding authors, we were able to correct the dates of the NCBI accessions KX262991.1 and KY435454.1 to 2013 and 2014, respectively.

Genetic Distance
To calculate the genetic distance within and between Chikungunya lineages, we estimated the evolutionary divergence over sequence pairs between groups as implemented in MEGA X under default settings [50,51]. The analyses were conducted using the Maximum Composite Likelihood model [52].

CHIKV Nextstrain
We employed the workflow management system Snakemake [53] to build a pipeline for rapid deployment and reproducibility of the CHIKV Nextstrain build. In the first step of the Snakemake workflow, we retrieved metadata including collection date, country, and place of isolation (if available) from the Genbank records of all available CHIKV isolates. Each country was then assigned to one of the following regions: South East Asia, East Asia, South Asia, West Asia, Caribbean, Northern America, Central America, Southern America, Europe, Oceania, Eastern Africa, Middle Africa, Southern Africa, or Western Africa. In the next step, a file with lineage association for each isolate was created. Isolates with unknown lineage association were assigned to lineages via the the time-resolved phylogenetic tree produced by Nextstrain iteratively. The standard augur pipeline was then applied to construct all relevant data for Nextstrain visualization [46].

RNA sTructure Modulation via Lineage-Associated SNVs
For characterizing SNVs that affect CHIKV RNA structure formation, we performed local RNA structure prediction with RNALfold [54] in the reference strain of our Nextstrain build (KT327163.2), limited to sequence lengths of 150 nt and filtered for thermodynamically stable structures. We required a free energy z-score of at least −2 when comparing to 1000 dinucleotide shuffled sequences of the same nucleotide composition, resulting in 138 locally stable candidate structures spread throughout the CHIKV genome. These were then intersected with Nextstrain genome diversity data, yielding a set of 759 candidate RNAs that overlap either one or more variable sites of the CHIKV genome. For each candidate RNA, we computed the minimum free energy (MFE) of the non-mutated wild-type (WT) sequence as well as MFEs of all SNV mutants, assessed the base-pair distance between WT and mutant MFE structures with RNAdistance from the ViennaRNA Package [55], employing a base-pair distance cutoff of 15, and filtered for variants that show (almost) complete fixation in one or more lineages. This yielded 14 locally stable RNA elements of the reference strain that overlap a total of 16 lineage-associated SNV, as listed in Table 3. Each SNV was then evaluated for its potential to alter the thermodynamic ensemble of RNA structures with the MutaRNA web server [56] using default parameters.

Data Availability
The CHIKV Nexstrain instance is available at https://nextstrain.org/community/ ViennaRNA/CHIKV (accessed on 20 January 2021). The data can be downloaded by scrolling down to the bottom of the page and clicking on the "Download Data" link.

Genetic Distance between Chikungunya Virus Lineages
Nucleotide divergence analyses based on the average number of base substitutions per site can highlight the proximity among groups of taxa. To get an updated view of the distances within and between CHIKV lineages at the level of nucleotides, we computed the average evolutionary divergence over sequence pairs. Our results show that the nucleotide divergence within each individual lineage is relatively low (<0.01), with an average of 0.006 (Appendix A, Table A1). The evolutionary divergence between lineages has an average value of 0.073 (Table 1). The WA lineage has a nucleotide divergence to the other lineages that ranges from 0.17-0.19, showing the highest divergence to all other lineages. Additionally, the major clades encompassing AUL and AUL-Am on one side, and EAL, IOL, MAL, and SAL on the other side, present a nucleotide divergence ranging from 0.065-0.069 between each other. While the genetic divergence between these major clades confirms genotypes that have previously been described in the literature as well-defined lineages, i.e., WA, ECSA, and AUL, lower divergence can be observed between AUL-Am and AUL (0.011), as well as between SAL, MAL, EAL, IOL, AAL, sECSA, ranging from 0.006-0.031. Table 1. Estimates of evolutionary divergence over sequence pairs between Chikungunya virus (CHIKV) lineages. The number of base substitutions per site from averaging over all sequence pairs between groups are shown. Analyses were conducted using the Maximum Composite Likelihood model [52] with 924 CHIKV nucleotide sequences encompassing the lineages/groups observed in this study. Lineages: Asian urban (AUL), AUL-America (AUL-Am), South America (SAL), Middle Africa (MAL), Indian Ocean (IOL), East Africa (EAL), Africa and Asia (AAL), Sister Taxa to ECSA (sECSA), West Africa (WA).

A Nextstrain Build for Chikungunya Virus
In an attempt to provide a publicly available epidemiological and phylogeographical interactive visualization of CHIKV spread, we created a custom Nextstrain build that encompasses all currently available CHIKV genome data. Our community build is available via https://nextstrain.org/community/ViennaRNA/CHIKV (accessed on 20 January 2021) and comprises 924 genomes, which represents a substantial increase compared to the 590 genomes considered in the previous most comprehensive study of CHIKV phylogeny [16]. The Nextstrain phylogeny is based on a maximum-likelihood tree, which is used to infer a timed tree with TreeTime [57] (Figure 1), making available the time of the most recent common ancestor (TMRCA) associated with each individual lineage/major clade of interest (   Figure A1. The CHIKV Nextstrain build allows the user to filter data and change the visualization according to preferences, utilizing a set of filters such as date ranges, multiple tree visualization layouts, feature filters and colors. Besides visualizing phylogeography-related characteristics, Nextstrain provides information about the diversity of the underlying sequence data as normalized Shannon entropy in a separate diversity panel. The analysis of nucleotide divergence within the full genome sequences available on the Nextstrain build enabled us to explore lineage-associated genomic variants that change the stability and fold of locally stable RNA structures in silico.

Lineage-Specific RNA Structures
To better understand within-species RNA genotype-phenotype associations in viruses, wet set out to assess the impact of lineage-associated, evolutionary fixed SNVs on RNA structure formation in CHIKV. To this end, we performed local RNA secondary structure prediction in the reference strain of our Nextstrain build (KT327163.2, clustering with the AUL-Am lineage), aiming at characterizing structural elements that show increased thermodynamic stability, as expressed by z-score statistics. We intersected loci that fold into locally stable RNA structures with genome diversity data from Nextstrain to obtain regions of the CHIKV genome that both fold into stable RNA structures and contain one or more single-nucleotide mutations. Each SNV in this set was then characterized in terms of geographic spread and association to specific CHIKV lineages, as well as their base-pair distance between non-mutated wild-type MFE structure and mutant MFE structure. Using base-pair distance to pre-filter variants that induce a substantial change in the global fold of the locally stable RNAs, we identified 12 candidate RNAs within the CHIKV coding regions that overlap one SNV, and two candidates that overlap two SNVs each. In total, we have 14 candidate RNAs and 16 fixed SNVs, as shown in Figure 2. Details, such as the genomic location and thermodynamic stability of the candidate RNAs, as well as the lineage-association of SNVs are listed in Table 3.
For each candidate RNA, we quantified the effect of mutation-induced changes on the RNA structure ensemble with the MutaRNA webserver [56]. In addition to comparing the characteristics of wild-type and mutant (Due to the selection of our reference strain, 'wild-type' refers to isolate KT327163.2, whereas 'mutant' refers to the respective SNV in all candidates listed in Table 3). For MFE structures, we assessed the impact of lineageassociated mutations on the entire thermodynamic ensemble of RNA structures by partition function folding. RNA #1        RNA #13 We selected two examples that exhibit interesting structural traits: The first example is a variant at nucleotide position 1653 (Figure 3), which has an Adenine (A, wild-type) in the majority of AUL and AUL-Am isolates, while a Guanine (G, mutant) is found at this position in all other lineages. The A1653G variant overlaps a locally stable RNA of 81 nt, whose wild-type sequence folds into a bulged stem-loop structure with an MFE of −23.30 kcal/mol. The mutant folds into a three-way junction structure with an MFE of −28.10 kcal/mol and an equilibrium frequency of 0.3 in the thermodynamic ensemble, suggesting that the mutant is thermodynamically more stable than the wild-type variant. The base pairing potential of wild-type and mutant sequences are depicted as heat-map dot-plots as well as circular plots in Figure 3. These plots demonstrate the differences in base pairing patterns, highlighting that the SNV considerably modulates the fold of the RNA, thereby enabling the formation of alternatively stacked regions in the mutant that result in thermodynamic stabilization.     An interesting observation relates to a C-U mutation at position 10,651, which is responsible for the E1-A226V mutation that has been associated with increased CHIKV transmissibility by A. albopictus. Although the C10651U mutation overlaps a locally stable region (positions 10,594-10,682 of the CHIKV genome, z = −3.73), the overall fold of this structure is not altered by the mutation. This suggests that the biological effect is mediated by the protein level rather than the RNA level.

Discussion
In this contribution, we address the question as to what can be learned about CHIKV genotype/phenotype associations by comparative approaches, bringing together different concepts of molecular epidemiology, phylogeny reconstruction, and computational RNA biology. To this end, we build on the Nextstrain [46] framework to provide an interactive phylodynamics resource of CHIKV that reveals spatiotemporal and epidemiological facets of global virus dissemination. Moreover, we employ established tools for RNA structure prediction based on the ViennaRNA [55] Package to infer lineage-specific structural traits.
While the nucleotide divergence within the observed CHIKV lineages is relatively low, this can be explained by the geographical constraint and the reduced collection period for novel lineages. Conversely, the divergence between lineages varies considerably. The highest divergence of the West African lineage when compared to the other CHIKV lineages can be explained by the observation that WA has been an isolated lineage that emerged decades before the more recent outbreaks [14].
However, the divergence between major clades and geographically delineated lineages does raise the question of whether the current nomenclature, as canonized by experts from the field when first identified, is not misguiding. IOL is an example of a well-established lineage in the literature that presents low divergence from its African origin, but it does introduce the A226V mutation in E1, which mediates the capacity of the virus to replicate in A. albopictus [59].
Most viral genotypes/subgenotypes/subclades are based on whole-genome nucleotide divergence of a specific percentage, usually determined by phylogeny showing clades/lineages of a defined 'high' statistical support (>70% bootstrap) [60,61]. Hepatitis B viruses, for example, are classified into genotypes and subgenotypes based on their monophyly, amino acid signatures, and genetic distance [62]. In HIV, major distinct clades are classified as major genetic groups, with multiple subtypes within the genetic groups [63].
In the most recent comprehensive phylogeny of CHIKV to date, de Bernardi Schneider et al. [16] analyzed the three major lineages of CHIKV, AUL, IOL, and ECSA. These lineages were broken down based on their monophyly and geographic predominance. Here, we can see that although these strains can still be classified into distinct groups, there should be definite layers, such as lineages/genotypes and sublineages/subgenotypes.
Although we were not aiming to discuss a reclassification of CHIKV into a system independent of geography, we find the urge to bring to attention that a new coherent system should replace the current taxa classification. Such a new system could assist drug and vaccine development researchers to target specific genotypes or subgenotypes.
In an epidemiological context, the current lineage system, or, respectively, the way the strains are currently grouped, allows the inspection of major outbreak instances and looking at TMRCAs as a way to gauge when a lineage has been introduced in a region, causing outbreaks. From this perspective, our Nextstrain instance provides a reasonable amount of data to conduct a more in-depth investigation of the outbreaks that have recently occurred. Importantly, the calculated TMRCA of the major clades is in accordance with previous studies [15]. While before December 2013, local CHIKV transmission had not been identified in the Americas [24], our results suggest that the time of the most recent common ancestor for sequences in SAL and AUL-Am were 2011 and 2008, respectively. This result emphasizes the importance of increased surveillance, to identify the virus at the time of introduction, rather than at the time of pandemic [64]. The consistency of the dates between Nextstrain TimeTree calculations and previously described TMRCAs in the literature is also reassuring of our ability to provide this additional information on the CHIKV Nextstrain instance.
Molecular epidemiology provides a detailed picture of the geographical spread and fixation of RNA variants in viruses and can be used in combination with in silico RNA structure prediction to study the structural divergence of different lineages. Owing to the error-prone replication machinery inherent to many RNA viruses, SNVs are created continuously and represent the constitutive driving force behind viral quasispecies [65]. Although most of these mutations are considered neutral in an evolutionary context [66], a detailed understanding of the impact and functional associations of lineage-specific SNVs in viruses remains elusive. Importantly, both thermodynamic stabilization and destabilization of the RNA can have an effect, such as making the RNA accessible for protein binding. Single nucleotide mutations can lead to non-synonymous mutations at the amino acid level that result in potentially different protein function. Likewise, nucleotide mutations that culminate in synonymous mutations still have the capacity to alter RNA structure formation, leading to RNA phenotypes that can influence, e.g., co-translational protein folding efficiency and thereby mediate viral gene expression patterns [67].
We asked whether Nextstrain can be utilized to infer novel insight into RNA structureassociation of individual clades. As Nextstrain is particularly convenient for discriminating characteristics of viral clades/lineages, we set out to expand the sequence-centric approach to computation of lineage-associated structural traits. By combining Nextstrain genome diversity data with RNA structure prediction methods, we could associate sequence variants observed in different CHIKV lineages with alternative RNA structure formation. Although we cannot associate lineage-specific SNVs with particular biological functionality at this point, the fixation of nucleotide mutations in certain CHIKV lineages suggests that they are either neutral or they confer an adaptive advantage in an evolutionary setting.
Building on a thermodynamic model for RNA structure formation, we present here 14 RNA candidates in the CHIKV coding region that exhibit an alternative fold upon mutation of individual nucleotides. While we were focusing on mutations that induce the most obvious changes to the RNA structural ensemble, there are many more lineagespecific mutations that may have only a subtle effect on local RNA folding or may be involved in long-range RNA structure formation. Importantly, phenotypic traits result from the complex interplay of all lineage-specific mutations, and in this line, our approach can help by defining target RNAs for experimental verification.
In summary, we show here that the combination of molecular epidemiology data with RNA structure prediction can help to gain insight into hitherto unresolved aspects of genotype-phenotype associations within viral species. On a broader scale, specifically when applied to different viruses, this can augment our understanding of RNA structure evolution.