In-Depth Analysis of Genetic Variation Associated with Severe West Nile Viral Disease

West Nile virus (WNV) is a mosquito-borne virus which causes symptomatic disease in a minority of infected humans. To identify novel genetic variants associated with severe disease, we utilized data from an existing case-control study of WNV and included population controls for an expanded analysis. We conducted imputation and gene-gene interaction analysis in the largest and most comprehensive genetic study conducted to date for West Nile neuroinvasive disease (WNND). Within the imputed West Nile virus dataset (severe cases n = 381 and asymptomatic/mild controls = 441), we found novel loci within the MCF.2 Cell Line Derived Transforming Sequence Like (MCF2L) gene (rs9549655 and rs2297192) through the individual loci analyses, although none reached statistical significance. Incorporating population controls from the Wisconsin Longitudinal Study on Aging (n = 9012) did not identify additional novel variants, a possible reflection of the cohort’s inclusion of individuals who could develop mild or severe WNV disease upon infection. Many of the top gene-gene interaction results were intergenic, with currently undefined biological roles, highlighting the need for further investigation into these regions and other identified gene targets in severe WNND. Further studies including larger sample sizes and more diverse populations reflective of those at risk are needed to fully understand the genetic architecture of severe WNDD and provide guidance on viable targets for therapeutic and vaccine development.


Introduction
Mosquito-borne viruses such as West Nile virus (WNV) are a major global health threat, with estimates of more than 340 million infections and more than 440,000 fatalities each year [1]. WNV was first isolated in an individual from the West Nile region of Uganda and is now found in more than 80 countries [2,3]. Since its introduction to the United States in 1999, the virus has since spread throughout the continental states and into Mexico and Canada [4]. Among infected humans, the majority-especially the young-remain asymptomatic, while approximately 20% mainly older individuals present mild symptoms such as fever, and <1% exhibit severe, potentially life-threatening West Nile neuroinvasive disease (WNND) [5][6][7][8][9].
Genetic factors are among the host characteristics identified to place a minority of individuals at increased risk of severe disease. To date, 13 candidate-gene studies and one larger-scale 13,371 single-nucleotide polymorphisms (SNPs) study have been conducted to examine genetic factors

WNV-Infected Study Population
Study participants were enrolled between 2004 and 2009 from Nebraska, Pennsylvania, Texas, and Colorado in the United States and Alberta and Saskatchewan in Canada, as previously described [11]. All study participants were adults ≥18 years with confirmed infection with WNV based on the guidelines of the Centers for Disease Control and Prevention. The 381 cases were diagnosed with meningitis, encephalitis, or acute flaccid paralysis, while the 441 control patients did not have any of these diagnoses. Blood samples from adults were collected for DNA extraction and genotyping with the Illumina 1 Million Duo and Illumina OmniExpress as previously described (Illumina, San Diego, CA, USA) [11].

Quality Control and Imputation
For quality control, we used PLINK (Harvard University, Boston, MA, USA) to remove all individuals and SNPs with >5% missing rates and SNPs that deviated significantly from Hardy Weinberg Equilibrium (p < 1 × 10 −5 ) [16]. We updated the data for human genome build 37 (hg19) using Batch Coordinate Conversion (LiftOver) [17]. Genotype phasing was conducted with SHAPEIT (University of Oxford, Oxford, UK) and imputation was performed using IMPUTE2 (University of Oxford, Oxford, UK), with 1000 Genomes Project, Phase 3 dataset serving as reference [18][19][20]. Imputation quality is scored by the INFO score (0-1.0, with a higher score indicating increased certainty), and SNPs with INFO scores >0.7 were included in the analysis. Principal component analysis was conducted on the directly genotype data using EIGENSTRAT (Harvard University, Boston, MA, USA) [21,22]. We merged the two imputed chip datasets in PLINK and subsequently conducted association analyses, with the top 10 principal components, sex, and the type of chip (1 Million Duo or OmniExpress) as covariates (Supplemental Figure S1). To adjust for multiple testing, the genome-wide Bonferroni-corrected threshold of 5 × 10 −8 was used to assess p-value significance [23,24]. Manhattan and quantile-quantile (QQ) plots were created through the R package qqman [25].

Population Controls
The cohort from the Wisconsin Longitudinal Study on Aging (dbGaP accession: phs001157.v1.p1) was used as a population control dataset during an expanded analysis of the imputed WNV dataset [26,27]. The Wisconsin dataset was procured from dbGaP under an approved agreement [28]. As previously described, 9012 individuals were enrolled and provided samples that were genotyped on the Illumina HumanOmniExpress chip, human genome build 37 (Illumina, San Diego, CA, USA) [27,29]. Similar to the methods we used in the WNV dataset, they conducted imputation using the IMPUTE2 software (University of Oxford, Oxford, UK), with the 1000 Genomes Project, Phase 3 dataset serving as controls [20,30]. We merged this Wisconsin population controls imputed dataset with the WNV imputed dataset in PLINK, and conducted univariate analysis using the top 10 principal components, sex, the type of chip for the WNV participants (1 Million Duo or OmniExpress), and the cohort (Wisconsin population control or WNV sample) as covariates. Principal component analysis of the merged, directly genotyped datasets was conducted using EIGENSTRAT [21,22,31]. To reduce heterogeneity between the samples, we limited analysis to SNPs with population control minor allele frequencies (MAF) ±5% of the WNV control MAF.

Analysis of Gene-Gene Interactions
To assess the directly genotyped dataset for gene-gene interactions, we applied PLINK's fast epistasis function, which compares genotype count tables, to test interactions between all SNPs with MAF ≥10%. We then applied the regular epistasis function, which employs a logistic regression approach, on all interactions with p-values below 5 × 10 −8 from the fast epistasis test [16]. We used a Bonferroni-corrected threshold of 5 × 10 −14 to assess p-value significance.

Imputation Increases Identification of Genes Associated with WNV
To identify additional gene loci with a role in infection with WNV, we made use of imputation to analyze the existing GWAS dataset. The initial analysis covered 1,426,538 directly genotyped SNPs, and with imputation, we increased the number of loci which can be assessed for a role in WNV infection by 8-fold. Following imputation, 11,883,872 SNPs passed quality control (Supplemental Table  S1). The top hits from the imputed data set revealed several novel targets including rs3065938 in an intergenic region of Chromosome 4 (p-value = 9.45 × 10 −7 ) and rs78660320 in the Testis Expressed 15, Meiosis and Synapsis Associated (TEX15) gene on Chromosome 8 (1.76 × 10 −6 ) ( Table 1; Manhattan plot, Supplemental Figure S2; QQ plot, Supplemental Figure S3). While still short of the statistical significance threshold of 5 × 10 −8 , these imputed SNPs were more significantly associated with disease than the top associations from the original analysis (Supplemental Table S2). Other novel SNPs of potential interest that approached significance were in the MCF.2 Cell Line Derived Transforming Sequence Like (MCF2L) gene (rs9549655 and rs2297192; p-values 5.84 × 10 −7 and 1.35 × 10 −6 ), the TIMELESS Interacting Protein (TIPIN) gene (rs36061431; p-value 2.43 × 10 −6 ), and the Potassium Calcium-Activated Channel Subfamily N Member 3 (KCNN3) gene (rs7553671; p-value of 2.49 × 10 −6 ).

Population Controls Extend Univariate Analysis
The WNV dataset includes WNV patients of differing severity of disease but does not include asymptomatic subjects who make up the majority of infections [11,32]. For additional power to discern genes with a significant role in WNV infection, we added a large dataset of population controls from the Wisconsin Longitudinal Study on Aging (n = 9012) [27]. We re-analyzed the imputed WNV dataset after incorporating the imputed dataset of population controls. In order to reduce heterogeneity between samples, results were restricted to SNPs with population controls MAF within 5% of the WNV controls MAF. From the reanalysis with population controls, the top SNP was in the MCF2L gene (rs2297192) on Chromosome 13; p-value of 2.30 × 10 −6 ( Table 2). Table 2. Top SNPs following individual variant analysis of the imputed datasets of WNV severe cases, non-severe WNV-infected controls, and population controls from the Wisconsin Longitudinal Study on Aging. The table shows the top 10 results by chromosome (Chr), single-nucleotide polymorphism (SNP), odds ratio (OR), p-value from the combined WNV cohort and population control analysis, p-value from the WNV-only cohort analysis, and minor allele frequency (MAF) within the WNV severe cases, WNV-infected non-severe controls, and population controls from the Wisconsin Longitudinal Study on Aging. None of the interactions passed the Bonferroni-corrected p-value of 5 × 10 −8 . Gene acronyms: MCF.2 Cell Line Derived Transforming Sequence Like (MCF2L), Nicotinamide N-Methyltransferase (NNMT), Coagulation Factor X (F10).    The top SNPs from the univariate WNV-only analysis (Table 1) were not included in this analysis as the MAF between the WNV controls and Wisconsin population controls was greater than the 5% cutoff. Many of the p-values of the population controls analysis were similar or even slightly higher than the WNV-only data analysis, suggesting the frequency of the risk alleles may be greater in the population controls than in the WNV controls. Notably, for some of the SNPs, the population control MAF was between the WNV case MAF and WNV control MAF, limiting the potential boost in p-values despite the comparatively large number of population controls. No additional novel areas of interest were noted in the Manhattan plot (Supplemental Figure S4; QQ plot, Supplemental Figure S5).

Gene-Gene Interactions Identify Combined Effects of Variation at Two Loci of Interest
To identify potential interactions of two genes which may contribute to genetic susceptibility to infection with WNV, we examined gene-gene interactions across the genome. None of the interactions passed the Bonferroni-corrected p-value of 5 × 10 −14 (Table 3). Although not reaching significance, the top interaction was between rs7021419, an intergenic SNP on chromosome 9, and rs9989408, a SNP located in the Heparan Sulfate-Glucosamine 3-Sulfotransferase 4 (HS3ST4) gene on chromosome 16 (p-value of 2.42 × 10 −11 ). Other top interactions included SNPs from the Neuregulin 1 (NRG1), Microtubule-Associated Serine/Threonine-Protein Kinase 4 (MAST4), Ribosomal Protein L17 (RPL17) genes. No additional interactions were identified in the combined cohort of WNV cases and controls and the Wisconsin population controls. As expected, and consistent with the univariate analysis outlined above, the addition of the population controls resulted in less significant findings of gene-gene interactions.

Conclusions
With only a minority of individuals infected with WNV developing severe, potentially life-threatening disease, it is vital to identify factors associated with susceptibility to disease [33,34]. Human genetic variation is of critical importance to an individual's response to infection and subsequent progression to severe disease and previous studies have identified several genes associated with WNV infection [35][36][37][38][39][40]. To identify additional novel genetic variants associated with severe WNND, we conducted advanced genetic analyses of loci in a North American cohort by using an imputation approach, incorporating population controls, and analyzing gene-gene interactions. Through univariate analysis of the imputed dataset, we found novel loci, particularly near the MCF2L gene on Chromosome 13, that may be associated with severe WNND. MCF2L belongs to the Rho guanine nucleotide exchange factors, which are one of three groups of proteins that control Rho GTPases [41]. Viruses have been shown to interact with and evade Rho GTPase signaling, including WNV during its infiltration of murine neurons [42,43]. While many of the other top SNPs were intergenic, one SNP is located in the TIPIN gene. The TIPIN gene product is the TIMELESS-interacting protein, which has been implicated in Epstein-Barr viral replication and at the pathway level during infection with zika, a related flavivirus [44][45][46]. These newly identified SNPs, while not reaching statistical significance in our study sample, are more significant than the findings of the original study. The advanced analysis using imputation provided important additional clues of genetic susceptibility. Future investigations, both of these and other SNPs analyzed within larger samples sizes with greater statistical power, integrated with profiles of WNV immune responses and animal models, can illuminate the potential roles of these genes on disease pathogenesis [47][48][49][50][51].
To expand the sample size and increase power for our analysis, we utilized population controls, also of northern-European descent, from the Wisconsin Longitudinal Study on Aging. We selected the control dataset due to its similarities to the WNV dataset, namely in location and subject population of individuals of northern European descent. Including the control dataset of >9000 individuals resulted in surprisingly limited improvement across SNP p-values. This may be due to occult infection within the Wisconsin group, or differences across MAFs, even following the removal of SNPs with population control MAFs greater than 5% different from the WNV control MAF. As shown in Table 2, the MAF in the population control dataset often fell between the MAF estimates for the WNV cases and controls, indicating the population controls represent both potential severe cases as well as asymptomatic or mild controls upon infection with WNV. The population controls are more likely to carry the risk alleles for severe disease compared to the WNV controls, a population infected with WNV but resistant to severe disease. The inclusion of an additional, external dataset could introduce potential bias through misclassification of severe cases and batch effects and confounding through population substructure [52].
Including population controls did not lead to more significant findings as anticipated, and the greatest increase in statistical power will be gained through the inclusion of a larger sample of infected cases. Total cases of WNV reported in the USA since 1999 are 50,830 [53], thus, only limited enrollment is possible. Most of the WNV genetic studies conducted to date have been relatively small in sample size, with 5 of the 13 published genetic studies including fewer than 50 severe cases [10]. We conducted our analyses in the largest WNV genetic study conducted to date, by both number of genetic variants tested and number of severe cases. However, adequate power (>80%) to identify gene-gene interactions required odds ratios of 2.4 or higher, depending on the minor allele frequencies (Supplemental Table S3). Furthermore, ongoing and future studies need to be conducted in the appropriate at-risk populations; a mutation that plays an important role in disease progression in one population may not be universally significant. WNV has been found on every continent excluding Antarctica, but published studies exploring the role of genetic variation in WNV disease severity have focused largely on populations of Northern European descent [54]. Only three of the 13 studies were conducted outside North America, and focused on Greece, Macedonia, and Israel.
When we analyzed gene-gene interactions, many of the top pairs included an intergenic region, highlighting the importance of ongoing efforts to understand the roles and effects of these areas. The top interaction involved a SNP in the HS3ST4 gene, which encodes the enzyme heparan sulfate D-glucosaminyl 3-O-sulfotransferase 4 and has been linked to herpes simplex virus type 1 pathogenesis and tumor growth [55,56]. HS3ST4 has been shown to be regulated by TRF2, with abnormal expression limiting natural killer cell recruitment and activation; notably, natural killer cells play an important role in the immune system response to WNV infection [57][58][59]. While the top interactions approached but did not pass the Bonferroni-corrected p-value threshold, future research may reveal the potential role of HS3ST4 and other identified genes during WNV infection.
There are several potential limitations to our study design. LiftOver was used to update the genetic data to human genome build 37, which was a necessary step for the software and reference panels used in the subsequent analytical steps. While studies evaluating LiftOver and other similar conversion tools have shown low rates of errors when converting between builds [60,61], this could have potentially introduced error. To impute the genetic data for the WNV study participants, we used the IMPUTE2 software in order to match the approach previously used on the Wisconsin Longitudinal Study on Aging cohort; studies have shown similar accuracy across IMPUTE2 and other imputation software [62,63]. As imputation estimates ungenotyped loci, we only included imputed variants with an INFO quality score above 0.7 to limit potential error. In the univariate analyses, we used the Bonferroni-corrected genome-wide threshold of 5 × 10 −8 to assess p-value significance. The Bonferroni-corrected threshold is widely acknowledged to be more conservative than alternative approaches [64,65], but is less likely to provide false positive results [66,67]. Our top findings, while not statistically significant using the Bonferroni-corrected threshold, are biologically plausible and may reflect the underlying biology of WNND; however, further genetic studies are needed to replicate these findings. Our analyses utilized data from the largest WNV GWAS conducted to date, but as noted above, future studies incorporating more cases and diverse populations are needed to best understand the genetic architecture of WNND. If our findings are replicated in additional cohorts, biological experiments will be needed to elucidate the functional roles of the identified genetic variation in WNV infection. The importance of understanding genetic architecture of complex traits to guide vaccine development has been well documented in other diseases [68][69][70], and our analyses show further studies need to be conducted to adequately inform and guide vaccine development.
As WNV and other mosquito-borne viruses become of increasing concern due to climate change, there is a time-sensitive need for future work to explore the roles of identified genes in disease pathogenesis and a continued need to conduct larger, more diverse studies for a more universal understanding of genetic risk factors and viable therapeutic targets. As noted here, novel analyses such as imputation, population controls, and gene-gene interactions in existing datasets may provide additional leads for future investigation, but well-designed, fully-powered new studies with sufficient asymptomatic, mild, and severe infection subgroups are needed to completely understand the genetic architecture of severe WNND [71]. In addition to guidance for much-needed vaccines and treatment, this knowledge could identify individuals at increased risk for severe disease, allowing for more targeted precaution measures and earlier intervention.
Supplementary Materials: The following are available online at http://www.mdpi.com/2076-393X/8/4/744/s1. Figure S1: Quality control and analysis flowchart. Overview of approach for quality control, phasing, imputation, and association analysis of the genetic dataset of WNV sample, Figure S2: Manhattan plot of imputed dataset of WNV severe cases and mild and asymptomatic infection controls. The chromosomes are along the x-axis and the negative log p-value along the y-axis. The blue line indicates SNPs that approach statistical significance, and the red line separate SNPs that reach the genome-wide corrected p-value threshold. The top association hits are located in a region of chromosome 13, Figure S3: QQ plot of the expected and observed negative log of the p-values from the association analysis of the imputed WNV dataset. The negative log of the observed p-values (x-axis) are plotted against the negative-log of the expected p-values (y-axis). The majority of the SNPs fall along the red line, indicating the results are not inflate, Figure S4: Manhattan plot of imputed datasets of WNV severe cases, non-severe WNV-infected controls, and population controls from the Wisconsin Longitudinal Study on Aging. The chromosomes are along the x-axis and the negative log p-value along the y-axis. The blue line indicates SNPs that approach statistical significance, with none of the SNPs reaching statistical significance, Figure S5: QQ plot of the expected and observed negative log of the p-values from the association analysis of the imputed datasets of WNV severe cases, non-severe WNV-infected controls, and population controls from the Wisconsin Longitudinal Study on Aging. The negative log of the observed p-values (x-axis) are plotted against the negative-log of the expected p-values (y-axis), Table S1: Imputation of the West Nile virus dataset. The number of directly genotyped SNPs and the total number of SNPs following imputation, by chromosome. The 'before imputation' numbers only include directly genotyped SNPs; the 'after imputation' includes both directly genotyped and imputed SNPs, Table S2: Comparison of p-values for the top three SNPs from the previously published GWAS within the current subset, with no covariate adjustment to reflect the original analysis. The previously published GWAS, including 560 neuroinvasive case patients and 950 control patients analyzed for 13,371 SNPs, overlaps with the current study containing 444 neuroinvasive cases and 829 control patients, Table S3: Power calculations for detection of gene-gene interactions among the WNV dataset. Calculations were based on 444 severe disease cases and 829 non-severe infections, population prevalence of severe WNV of 0.01, and a range of minor allele frequencies (MAF). Each loci has log-additive inheritance and marginal odds ratio (OR) of 1.5, which reflects the top SNPs from the initial study. There is sufficient power (>0.80) to detect interaction ORs depicted in each cell or higher for the two corresponding MAF. Bonferroni-corrected two-sided p-value = 4 × 10 −6 , assuming 110 pairwise interactions.  studies. Additional IRB approval from Yale University was not required to obtain the Wisconsin Longitudinal Study on Aging dataset through dbGaP [72]. This study was conducted in accordance with the Declaration of Helsinki, and the parent study protocols were approved by the Ethics Committee of the respective universities: for the West Nile virus dataset, institutional review board approval was received from McMaster University, McGill University, Nebraska Medical Center, University of Pennsylvania, and University of Texas at San Antonio [11]; for the Wisconsin Longitudinal Study on Aging dataset, approval was obtained through University of Wisconsin-Madison [27,29].