The Role of SNP Interactions when Determining Independence of Novel Signals in Genetic Association Studies—An Application to ARG1 and Bronchodilator Response

Genome-wide association studies (GWAS) play a critical role in identifying many loci for common diseases and traits. There has been a rapid increase in the number of GWAS over the past decade. As additional GWAS are being conducted, it is unclear whether a novel signal associated with the trait of interest is independent of single nucleotide polymorphisms (SNPs) in the same region that has been previously associated with the trait of interest. The general approach to determining whether the novel association is independent of previous signals is to examine the association of the novel SNP with the trait of interest conditional on the previously identified SNP and/or calculate linkage disequilibrium (LD) between the two SNPs. However, the role of epistasis and SNP by SNP interactions are rarely considered. Through simulation studies, we examined the role of SNP by SNP interactions when determining the independence of two genetic association signals. We have created an R package on Github called gxgRC to generate these simulation studies based on user input. In genetic association studies of asthma, we considered the role of SNP by SNP interactions when determining independence of signals for SNPs in the ARG1 gene and bronchodilator response.


Introduction
Genome-wide association studies (GWAS) play a critical role in identifying many loci for common diseases [1] as well as complex traits [2]. Over the past decade, the rapid increase in the number of GWAS provides an extraordinary opportunity to examine the potential impact of common and rare genetic variants on complex diseases [3]. Two independent GWAS may identify two different single nucleotide polymorphisms (SNPs) in the same gene or region that are both significantly associated with the trait of interest [4]. When determining whether these two signals are independent in GWAS, epistasis and SNP by SNP interactions are often not considered.
Epistasis is defined as the interaction between different genes or SNPs and refers to the departure from independence of the effect of different genetic loci on the disease or trait of interest [5,6]. There is a complex relationship with epistasis and linkage disequilibrium (LD) [7,8]. Multiple unobserved functional polymorphisms can lead to genotyped SNPs 2 of 6 that do not properly represent the causal variants [9] and high order LD can lead to spurious statistical epistatic associations [10]. Strong LD may suggest that detected epistasis between pairs of SNPs in different association studies need to be interpreted with caution [9,10].
Given epistasis, it is important to consider SNP by SNP interactions when determining whether genetic signals in GWAS are considered independent. For example, if an SNP is associated with the trait of interest in a GWAS, an investigator may want to determine if this SNP is an independent signal or purely the result of a correlated SNP that was previously associated with the trait of interest. In order to determine if the two signals are independent, the most popular approaches are to fit a regression of the trait of interest with the novel SNP conditional on the previously identified SNP and/or calculate LD between the two SNPs.
Through simulation studies, we examined the impact of SNP by SNP interactions when determining whether two signals are independent in a GWAS. We have created the R package, called gxgRC, which implements these simulation studies for user specified parameters. In addition, we considered the role of SNP by SNP interactions when determining independence of signals for SNPs in the ARG1 gene and bronchodilator response in genetic association studies of asthma.

Materials and Methods
In the following simulation scenarios, we examined the impact of SNP by SNP interactions when determining the independence of two SNPs by regressing the trait of interest Y with the SNP X 1 conditional on the SNP X 2 . We generated 1000 subjects for 5000 simulations using a significance level of 5*10 −8 . SNP X 1 is generated from a binomial distribution with a binary genetic coding (i.e., dominant or recessive model) and P(X 1 = 1) = 0.5. SNP X 2 is generated from a logistic regression based on X 1 such that: where γ 0 = 0 and γ 1 = 0.3. While X 1 and X 2 assume a binary genetic coding for simplicity, the results are generalizable to an additive genetic coding (i.e., X j = 0, 1, 2 for j = 1, 2). The continuous outcome Y is generated from a normal distribution with a variance of 1 and a mean such that where β 0 = 0, β 1 = 0.3 or 0, β 2 = 0.3 or 0, and β I varies from 0.3 to 1 by 0.05. We considered additional simulation scenarios for different values for γ 0 , γ 1 in Equation (1) and β 0 , β 1 , β 2 in Equation (2). However, we observed similar results to the presented simulations scenarios; therefore, the results are not shown here. After the data were simulated using Equations (1) and (2), we then fit 3 algorithms and tested the following null hypotheses for each algorithm: we tested H 0 : α 1 = 0 to determine if the SNP X 1 is associated with the trait of interest Y conditional on the SNP X 2 .

Results
In Figure 1, β 1 = 0.3 and β 2 = 0.3 in Equation (2) for the plot on the left and β 1 = 0 and β 2 = 0 in Equation (2) for the plot on the right. For both plots and simulations, when a stronger interaction between the two SNPs is generated (i.e., β I closer to 1 in Equation (2)), the majority of simulations concluded scenario 1: rejecting Algorithm 0 H 0 : δ 1 = 0, Algorithm 1 H 0 : α 1 = 0, and Algorithm 2 H 0 : ϕ I = 0. These simulations show that there can be a significant association between the SNP X 1 and the trait of interest Y in Algorithm 0, and the SNP X 1 is still significantly associated with the trait of interest Y when conditioning on the SNP X 2 in Algorithm 1. However, there is a significant interaction between the two SNPs in Algorithm 2. This shows that if a researcher were to use Algorithm 1 to conclude that the two SNPs are independent since the SNP X 1 is significantly associated with the trait of interest Y conditional on the SNP X 2 , a false conclusion would be reached because there is a significant interaction of the two SNPs on the trait Y in Algorithm 2 and as generated by the data using Equation (2), such that β I = 0. These simulations demonstrate that it is not sufficient to consider independence of two genetic signals by considering Algorithm 1: E[Y] = α 0 + α 1 X 1 + α 2 X 2 and testing H 0 : α 1 = 0. One needs to also consider if there is a significant SNP by SNP interaction by fitting Algorithm 2:  (2) for the plot on the right. For both plots and simulations, when a stronger interaction between the two SNPs is generated (i.e., βI closer to 1 in Equation (2)), the majority of simulations concluded scenario 1: rejecting Algorithm 0 : = 0, Algorithm 1 : = 0, and Algorithm 2 : = 0. These simulations show that there can be a significant association between the SNP and the trait of interest Y in Algorithm 0, and the SNP is still significantly associated with the trait of interest Y when conditioning on the SNP in Algorithm 1. However, there is a significant interaction between the two SNPs in Algorithm 2. This shows that if a researcher were to use Algorithm 1 to conclude that the two SNPs are independent since the SNP is significantly associated with the trait of interest Y conditional on the SNP , a false conclusion would be reached because there is a significant interaction of the two SNPs on the trait Y in Algorithm 2 and as generated by the data using Equation (2), such that 0. These simulations demonstrate that it is not sufficient to consider independence of two genetic signals by considering Algorithm 1: = + + and testing : = 0. One needs to also consider if there is a significant SNP by SNP interaction by fitting Algorithm 2: = + + + and testing : = 0.  (2)), the majority of simulations concluded scenario 2: rejecting Algorithm 0 : = 0 and Algorithm 1 : = 0, but failing to reject for Algorithm 2 (i.e., there was not a signification SNP by SNP interaction). For the plot on the right, when the interaction was simulated to be weaker (i.e., βI closer to 0.3 in Equation (2)), the majority of simulations concluded scenario 5: failing to reject Algo-  (2) for the plot on the right, where both y-axes are the proportion of simulations where the null hypothesis was rejected. For the plot on the left, when the interaction was simulated to be weaker (i.e., β I closer to 0.3 in Equation (2)), the majority of simulations concluded scenario 2: rejecting Algorithm 0 H 0 : δ 1 = 0 and Algorithm 1 H 0 : α 1 = 0, but failing to reject H 0 for Algorithm 2 (i.e., there was not a signification SNP by SNP interaction). For the plot on the right, when the interaction was simulated to be weaker (i.e., β I closer to 0.3 in Equation (2)), the majority of simulations concluded scenario 5: failing to reject Algorithm 0 H 0 : δ 1 = 0 (i.e., the SNP X 1 was not associated with the trait of interest Y). For both plots, when a stronger interaction between the 2 SNPs was simulated (i.e., β I closer to 1 in Equation (2)), the majority of simulations concluded scenario 1: rejecting Algorithm 0 H 0 : δ 1 = 0, Algorithm 1 H 0 : α 1 = 0, and Algorithm 2 H 0 : ϕ I = 0.

Data Analysis
To illustrate the effect of SNP by SNP interactions when determining conditional independence of genetic signals, we considered SNPs in chromosome 6 [ARG1], which has previously been associated with bronchodilator response (BDR) in asthma [11]. In the CAMP (N = 560) [12], CARE (N = 206) [13,14], and LODO (N = 126) [15] cohorts, we used weighted least squares regression to examine the conditional effect of four SNPs in the ARG1 gene on BDR among subjects of European ancestry (total N = 892) adjusting for cohort, age, sex, body mass index (BMI) as a categorical variable (obese, overweight, vs. normal/underweight), and genetic ancestry. We picked these covariates based on other studies of BDR [16][17][18], as BDR has been found to differ depending on age [19,20], sex [21] and BMI [22]. Three of these SNPs are common variants that have previously been associated with BDR (rs2781659, rs2781663, rs2781665) [11] and one SNP (rs185631674) is a rare variant in the region.
Based on previous studies [11], rs2781659 had the most significant association with BDR in AGR1. In our cohorts, rs2781659 was associated with BDR adjusting for cohort, age, sex, BMI as a categorical variable (obese, overweight, vs. normal/underweight), and genetic ancestry (minor allele frequency (MAF) = 0.32, Beta = −0.15, sd = 0.05, pvalue = 0.0037). In order to determine if the association of rs2781659 with BDR is independent of the other three SNPs, we considered the following algorithms: where C is a vector of the covariates: cohort, age, sex, BMI as a categorical variable (obese, overweight, vs. normal/underweight), and genetic ancestry.
As seen in Table 1, the association between rs2781659 and BDR was still significant when conditioning on rs2781663 (p = 0.003) and rs185631674 (p = 0.004) but not when conditioning on rs2781665 (p = 0.78). The SNP by SNP interaction on BDR was marginally significant for rs2781663 (p = 0.06) and rs2781665 (p = 0.06), and significant for rs185631674 (p = 0.03). However, rs2781659 is in LD with rs2781663 (r 2 = 0.995) and rs2781665 (r 2 = 0.891) and rs185631674 is a rare variant in a study with a small sample size (N = 892). This shows that if a researcher were to only consider Algorithm 1, where the association of rs2781659 with BDR is still significant conditioning on rs2781663 and rs185631674, the researcher would falsely conclude that there is more than one independent genetic signal with BDR. This false conclusion would not be reached with the SNP by SNP interaction considered in Algorithm 2 as well as the LD as measured by r 2 . This also shows that special consideration needs to be given to rare variants.

Discussion
Through simulation studies and a data analysis of SNPs in ARG1 with BDR, we demonstrate that it is not sufficient to consider independence of two genetic signals by considering Algorithm 1: E[Y] = α 0 + α 1 X 1 + α 2 X 2 and testing H 0 : α 1 = 0. One needs to also consider whether there is a significant SNP by SNP interaction by fitting Algorithm 2: E[Y] = ϕ 0 + ϕ 1 X 1 + ϕ 2 X 2 + ϕ I X 1 X 2 and testing H 0 : ϕ I = 0 and/or calculating an estimate of LD, such as r 2 or D'. Also, prior knowledge, for example protein-protein interactions or biological pathways, should be considered when examining SNP by SNP interactions [23].
There are some potential limitations for our simulation studies and data analysis. For the data analysis, all subjects are of European origin. A more diverse population could provide varying results, which should be considered for future analyses. Additionally, our data analysis considered genetic association studies of asthma but there is opportunity to explore the role of SNP by SNP interactions in conditional analyses by examining other diseases or traits. The sample size for the data analysis was relatively moderate (n = 892). Since the power to detect an interaction is substantially smaller than detecting a main effect, it should be noted that a larger sample size may be needed or the study may be underpowered to detect SNP by SNP interactions. While we have only presented two simulation studies here, we have created an R package on Github called gxgRC (https://github.com/SharonLutz/gxgRC (accessed on 12 December 2020)) [24] to generate similar simulations based on user input.
Understanding the role of epistasis and SNP by SNP interactions is important for the development of pharmacogenetic tests and personalized medicine. To date, studies in asthma pharmacogenetics have not resulted in clinical practice changes; however, exploring the role of SNP by SNP interactions has the potential to increase the likelihood of translatable findings.

Conflicts of Interest:
The authors have no conflicts of interests to disclose.