Reworking GWAS Data to Understand the Role of Nongenetic Factors in MS Etiopathogenesis

Genome-wide association studies have identified more than 200 multiple sclerosis (MS)-associated loci across the human genome over the last decade, suggesting complexity in the disease etiology. This complexity poses at least two challenges: the definition of an etiological model including the impact of nongenetic factors, and the clinical translation of genomic data that may be drivers for new druggable targets. We reviewed studies dealing with single genes of interest, to understand how MS-associated single nucleotide polymorphism (SNP) variants affect the expression and the function of those genes. We then surveyed studies on the bioinformatic reworking of genome-wide association studies (GWAS) data, with aggregate analyses of many GWAS loci, each contributing with a small effect to the overall disease predisposition. These investigations uncovered new information, especially when combined with nongenetic factors having possible roles in the disease etiology. In this context, the interactome approach, defined as “modules of genes whose products are known to physically interact with environmental or human factors with plausible relevance for MS pathogenesis”, will be reported in detail. For a future perspective, a polygenic risk score, defined as a cumulative risk derived from aggregating the contributions of many DNA variants associated with a complex trait, may be integrated with data on environmental factors affecting the disease risk or protection.


Introduction
Multiple sclerosis (MS) is the most common chronic inflammatory disease of the central nervous system causing neurological disability. There is a strong need to understand the still elusive MS cause(s) and use this knowledge to develop safe drugs that specifically target disease mechanisms. Like other complex traits, MS results from the interaction of genetic and environmental factors [1,2].
annotations, and knowledge of network connectivity, the authors were able to maximize the genetic information for target validation of 30 immune traits. The added value of this approach was the incorporation of network connectivity information that increases enrichment for established therapeutic targets and helps overcome the possibility that many potential targets do not contain naturally-occurring variants that disrupt gene function and are associated with a relevant trait. A possible channel for the future development of similar approaches may lie in the bioinformatic reworking of GWAS data, considering the components of PI together with supposedly active, nonheritable factors which are known to interact with the genetic signals resulting from genome-scale data.

Bioinformatic Reworking of GWAS Data
The analysis performed on GWAS data considered SNPs exceeding a p-value threshold of 5 × 10 −8 as being relevant for association with the disease. This approach stems from the assumption that genetic markers independently contribute to disease development. Other approaches took into consideration a looser cutoff (p-values of less than 0.05), with the assumption that a combined effect of many loci, albeit with a modest contribution, may account for overall disease susceptibility (Table 1). Prioritization of cell specific gene/protein networks Explanation of the potential role of GAWS signals in a tissue/cell-specific manner: identification of cell-specific susceptibility pathways.
IMSGC 2019 [11] IMSGC 2019 [11] p < 5 × 10 8 Multiple approaches: cell-specific eQTL, pathway analyses; PPI Prioritization of genes putatively associated with the disease, and identification of possible major implications for resident microglia and the B cell in MS.
Pathway analysis was one of the approaches selected for GWAS data interpretation, based on the identification of molecular function and biological processes that can be targeted by disease-associated SNPs [39]. An interpretation of nominally MS-associated SNPs (p < 0.05), obtained from two studies [30,31] and combining pathway analysis and protein-protein interaction networks (PINs), identified subnetworks of genes involved in several immunological and neural pathways which are enriched in MS [29]. In another study, this approach highlighted the oxidative stress and immune dysfunction pathways as being relevant in primary progressive MS, despite the fact that conventional GWAS analysis has not shown any association at the single SNP level [40]. To further unravel the heritable factors that are potentially involved in MS, a PIN-based pathway analysis (PINBPA, [41]) was conducted. Specifically, significant MS-associated SNPs (p < 0.05) of two independent GWAS datasets [8,33] were matched with data of protein-protein interaction networks that were subsequently gathered in a pathway analysis. This combined approach showed that proteins encoded by genes carrying risk variants are more likely to interact and share the same or related pathways. Moreover, the integration with protein-protein interaction pathways suggested new MS-susceptibility loci, including TNF-receptor-associated factor 3 (TRAF3), B cell membrane protein (CD48), B cell lymphoma 10 (BCL10), v-rel reticuloendotheliosis viral oncogene homolog (REL), and TEC protein tyrosine kinase (TEC) [32]. This approach proved to be informative. In fact, two TRAF3 MS risk SNPs (rs12147246 and rs12588969; [11]) were then identified as being of genome-wide significance, while other studies have shown their involvement in a dysregulated response to EBV infection [42], i.e., one of the main environmental factors associated with MS [43,44].
Another approach of GWAS reworking is based on the wealth of information on regulatory elements which is available thanks to the efforts of Encyclopedia of DNA elements (ENCODE) and Regulome Epigenomics Consortiums [45,46]. These repositories can help to identify disease-associated functional variants that could have an impact on specific cell types. Epigenetic changes, including DNA methylation, posttranslational modification of histones, and the synthesis of noncoding RNAs, are representative of molecular mechanisms through which environmental signals are translated within the cells to change their gene expression [47,48]. In MS pathogenesis, a great deal of evidence suggests the integration of the risk related to genetic predisposition with cell-type-specific epigenetic changes occurring in the immune system and in the brain in response to environmental stimuli [49][50][51]. A reworking of GWAS data, aimed at identifying the cellular type where the MS-associated variants might exert functional effects, was recently performed in an association study that analyzed a total of 47,351 cases and 68,284 healthy controls [38]. Taking into consideration the 200 autosomal susceptibility variants outside the major histocompatibility complex (MHC), the authors considered all the regulatory elements that could be affected by the presence of these variants in a cell-specific manner, and created a cell-specific protein network. This paradigmatic approach to decoding the disease risk in a cell/tissue specific context suggested MS-associated variants operative in CNS resident microglia as important contributors (besides those of peripheral immune cells) to disease development [11,38].
A further level of complexity stems from the fact that many disease-associated polymorphisms identified by GWAS lie within regulatory regions of the genome. In fact, many MS-associated variants are located within noncoding regions [8,50] with a potential impact on gene expression as expression quantitative trait loci (eQTL). Currently available data on eQTL suggest a relationship between gene variants and the gene expression of certain cell types; these effects can be disease-specific, and possibly depend on external stimuli (i.e. signaling pathways activated by cytokines affecting a specific cell type) [50,52]. Specifically, most of eQTL have been studied in healthy volunteers, with the assumption that the effect of a single nucleotide variant may be independent from the diseased condition. Conversely, recent data has revealed a specificity of the eQTL effects in diseased subjects. A recent work based on RNA-Seq in PBMC from MS patients to identify eQTLs in regions centered on at-risk SNPs showed 77 statistically-relevant eQTL associations, 40% of which were more pronounced in MS patients compared with noninflammatory neurological disease patients [52]. Another interesting approach of eQTL analysis was based on public RNA-sequencing and microarray data of blood-derived cells. A group investigated the role of SNP rs1414273, which is located within the microRNA-548ac stem-loop sequence in the first intron of the CD58 gene. They provided evidence that this MS-associated SNP might alter Drosha cleavage activity, thus modifying CD58 and microRNA-548ac gene expression in immune cells, a change that has already been reported to be relevant for MS development [53][54][55].

Interactome-Based Approach
To understand a complex disease such as MS, the impact of environmental factors should be included in investigations. The influence of genetic variants on environmental exposures that associate with multifactorial diseases is far from being fully understood, particularly at a genome-wide level. The complexity of interactions between genes and environment needs to be explored using analytical approaches which are capable of considering many variables simultaneously; this may account for the so called 'missing heritability'. New investigations on the aggregate analysis of many GWAS loci, each contributing with a small effect to overall disease predisposition, might uncover new information when combined with nongenetic factors that can play a role in disease etiology. We tried to implement this concept, using available actionable data on interactions at the protein level between human gene products and exposures. The possible causal significance of environmental exposures was investigated by measuring, at a genome-wide level, the enrichment of MS-associated genetic variants in genomic regions coding for proteins interacting with the exposure (i.e. influencing the exposure). Being aware that most SNPs associated with complex traits fall within regulatory genomic regions with distal effects or even in trans effects, we operationally chose to consider a 20kbp distance between MS variants and the nearest genes. This analysis was performed using Association LIst Go AnnoTatOR (ALIGATOR) bioinformatic tool [34] to search for statistical enrichment in associations between interactome's genes and MS genome-wide association data published by IMSGC in 2011 [8].
The interactomes can be defined as "modules of genes whose products are known to physically interact with environmental or human factors with plausible, uncertain, or unlikely relevance for MS pathogenesis" [56]. The analysis was centered on viral interactomes, based on the classical hypothesis of a viral etiology of MS, examining only direct interactions between viral and human proteins, i.e., those of primary importance for the phenotypic impact of the environment within the host physiology [57,58]. This approach took into consideration MS-associated SNPs contributing with a small effect to overall disease susceptibility (p-value cut-off of association less than 0.05), thus starting from GWAS signal values which were well beyond those considered to be at the genome-wide significance cutoff. This candidate-interactome analysis allowed us to determine a relative low number of known or new nongenetic factors which are associated with MS risk/protection, and to formalize interplays between the heritable and nonheritable elements of possible causal nature. Specifically, the relevance and complexity of interactions between host genotype and EBV were disclosed, highlighting, through a pathway analysis, some cellular functions that may be affected by such interactions (Figure 1). Moreover, this approach has the power to identify other viruses, and related interacting proteins, which are potentially relevant for MS etiology [56].
Other complementary approaches may unveil the relevance of interactions between environmental factors and genetic predisposition. MS susceptibility regions may be preferentially targeted by both the viral and cellular proteins which are directly involved in molecular mechanisms that are able to translate environmental signals into cellular perturbations. Epstein-Barr nuclear 2 (EBNA2, a viral transactivator of viral and cellular genes) binding motifs were shown to be significantly enriched in genomic intervals associated with MS [35,37]. Another layer of complexity is represented by a striking overlap between MS-associated loci, EBNA2 binding motifs, and Vitamin D receptor binding sites [35], suggesting that a complex interplay between host genetic variants and known associated environmental factors [5] may contribute to disease development. These results indicate that looking at genetic predisposition through the lens of nonheritable risk factors represents an advancement in studies aiming at disclosing functional meaning and prioritizing genetic variants coming from GWAS data.

Future Perspectives
The above studies have highlighted the need to integrate different analytical approaches with the aim of deepening our understanding of MS pathophysiology and etiology. New approaches, still experimental, could have future applications in quantifying the overall burden of genetic risk factors or serving as a stratification biomarker for treatment optimization. A promising approach in this field is the polygenic risk score (PRS), also known as 'genetic risk score'. PRS can be defined as a cumulative risk derived from aggregating the contributions of many DNA variants associated with a complex trait or disease. Existing research using PRS mainly focuses on two problems: association analysis and outcome prediction.
Although the use of PRS has not yet achieved clinical accuracy levels, interesting potential perspectives have emerged in diseases like cancer [59], psoriasis [60], rheumatoid arthritis [61], mental disorders [62,63], atherosclerosis [64], Type 2 diabetes [65,66], asthma [67], Parkinson's disease [68,69], and cardiovascular diseases (CVD) [70], including coronary heart disease (CHD) [71]. Polygenic risk scores can help to select a therapy for disease prevention. For example, statin therapy was shown to lead to a greater relative risk reduction for coronary heart disease among patients at high genetic risk score compared with patients at low genetic risk [63]. Polygenic risk scores have also been used to explore the genetic overlap between different diseases (e.g., application of schizophrenia-specific PRS to bipolar disorder), where the PRS derived from one disease is evaluated in another.
In a complex disease such as MS, the PRS could be correlated with phenotypic data (the 'classical' clinical and neurardiological parameters, but also to some emerging biomarkers that are being intensively investigated in biological fluids) as well as with MS endophenotypes (such as the radiologically-isolated syndrome; [72]) to generate a complex risk model which is able to predict the cumulative effects leading to overt disease onset. Moreover, the PRS may be integrated once again by data on environmental factors affecting disease risk or protection. In this context, it is possible to speculate on the calculation of an individual's risk of disease based on the "candidate interactome" approach, integrating genetic-environmental interplay as reported in the previous paragraph. Also, the recently reported genomic variants of EBV [73,74], the main environmental risk factor for MS development, may prompt us to define a PRS combining the MS-associated loci of the host with the risky or protecting genomic variants of the virus. All these approaches would require a model that was calibrated to proportionately include genetic variants (G) and environmental exposures (E), i.e., taking into account G, E, and their interaction GxE. Such a tool does not currently seem to be within reach for complex diseases, where problems such as pleiotropy, causal effect estimation, and other questions of etiologic epidemiology remain. Overall, the etiology of multifactorial diseases seems to be more complex than anticipated [75]; therefore, their inherently multifaceted nature requires new models. Nonetheless, emerging bio-statistical approaches seem poised to start a post-GWAS era [76].

Conflicts of Interest:
The authors declare no conflict of interest.