FDHE-IW: A Fast Approach for Detecting High-Order Epistasis in Genome-Wide Case-Control Studies

Detecting high-order epistasis in genome-wide association studies (GWASs) is of importance when characterizing complex human diseases. However, the enormous numbers of possible single-nucleotide polymorphism (SNP) combinations and the diversity among diseases presents a significant computational challenge. Herein, a fast method for detecting high-order epistasis based on an interaction weight (FDHE-IW) method is evaluated in the detection of SNP combinations associated with disease. First, the symmetrical uncertainty (SU) value for each SNP is calculated. Then, the top-k SNPs are isolated as guiders to identify 2-way SNP combinations with significant interaction weight values. Next, a forward search is employed to detect high-order SNP combinations with significant interaction weight values as candidates. Finally, the findings were statistically evaluated using a G-test to isolate true positives. The developed algorithm was used to evaluate 12 simulated datasets and an age-related macular degeneration (AMD) dataset and was shown to perform robustly in the detection of some high-order disease-causing models.


Introduction
In recent years, genome-wide association studies (GWASs) have played an important role in identifying single-nucleotide polymorphisms (SNPs) associated with complex human diseases. This approach is non-candidate-driven, and investigates the entire genome, thus offering a more comprehensive method when compared to gene-specific candidate-driven studies [1]. Genome-wide association studies was first employed by Klein et al. to investigate patients with age-related macular degeneration (AMD), and they identified two SNPs (rs380390 and rs10272438) with significant AMD associations [2]. Since its conception, 1800 diseases and traits, and thousands of associated SNPs have been identified, with the main focus being on individual SNPs that are isolated based on their contribution to disease status [3]. However, more recently, it has become widely accepted that multiple SNPs may contribute to a given pathogenicity via epistasis. Epistasis is defined as the effect of one gene (allele) on a phenotype that is modified by another gene (allele) or several other genes (alleles) [4], and includes additive/synergistic epistasis, dominant epistasis, recessive epistasis, functional epistasis, and sign epistasis [5].
Nevertheless, detecting and identifying the disease-causing SNP combinations on a genome-wide scale is met with several challenges. First, there is an enormous computational burden that is associated with the examination of SNP combinations due to multiple testing. Second, it is challenging to develop a method that is able to reliably identify disease-causing SNP combinations from those that are not given the diversity that exists among disease models [6], especially when there is insufficient sample data.
To tackle these challenges, some algorithms were developed to detect synergistic SNP combinations associated with complex diseases. The majority of these methods can be classified into three categories: Figure 1. An example of the fast approach for detecting high-order epistasis with interaction weight (FDHE-IW) approach for detecting 3-way single nucleotide polymorphism (SNP) combinations associated with a given phenotype. (1) Using the dataset with N SNPs, the SU value of each SNP was calculated. (2) The top three SNPs with the largest SU values were selected from the N SNPs, and then the three SNPs were employed as seeds to calculate the 2-way interaction weights (IW) with the other SNPs. (3) The top nine 2-way SNP-combinations were selected from the 2-way SNP combinations that pair with the three parent SNPs from (2), based on IW values. (4) The top 9 × 3 = 27 3-way SNP-combinations that are formed from a parent SNP-combination in (3) were selected based on IW. (5) The G-test statistical method was employed to test the 27 3-way SNP combinations, and two 3-way SNP combinations were verified using the G-test. 3 8 150 { , , } X X X ), I denotes the number of genotype combinations (I = 3 k for k-way SNP combination), J is the number of phenotype states (C), with J = 2 for a given disease phenotype (it equals 2 for a case-control dataset. 1 denotes case label, 0 is the control label). The number of samples in a dataset is defined as ni, with the SNP loci taking the value of the ith genotype combination (1 ≤ i ≤ I), and nij representing the number of samples of the ith genotype combination being associated with the phenotype state (cj).

Definitions
"Information entropy" has been defined as the average amount of information that is produced by a stochastic source of data that can be used to measure the data distribution diversity, and to measure the uncertainty of random variables [28]. To introduce our method well, the terms were defined: Figure 1. An example of the fast approach for detecting high-order epistasis with interaction weight (FDHE-IW) approach for detecting 3-way single nucleotide polymorphism (SNP) combinations associated with a given phenotype. (1) Using the dataset with N SNPs, the SU value of each SNP was calculated. (2) The top three SNPs with the largest SU values were selected from the N SNPs, and then the three SNPs were employed as seeds to calculate the 2-way interaction weights (IW) with the other SNPs. (3) The top nine 2-way SNP-combinations were selected from the 2-way SNP combinations that pair with the three parent SNPs from (2), based on IW values. (4) The top 9 × 3 = 27 3-way SNP-combinations that are formed from a parent SNP-combination in (3) were selected based on IW. (5) The G-test statistical method was employed to test the 27 3-way SNP combinations, and two 3-way SNP combinations were verified using the G-test.

Materials and Methods
Sets of SNP variables are defined by X = {X 1 , X 2 , · · · , X N } with N SNP loci, where X i is the genotype variable of the ith SNP locus with values of x 1 i , x 2 i , · · · , x n i , accounting for the homozygous major allele (0), the heterozygous allele (1) and the homozygous minor allele (2). C denotes the phenotype variable with values of c 1 , c 2 , . . . , c J (J = 2 for a given disease phenotype). For a k-way SNP combination X a 1 , X a 2 , · · · , X a k (1 ≤ a i ≤ N, 1 ≤ i ≤ k) (such as a 3-way SNP combination {X 3 , X 8 , X 150 }), I denotes the number of genotype combinations (I = 3 k for k-way SNP combination), J is the number of phenotype states (C), with J = 2 for a given disease phenotype (it equals 2 for a case-control dataset. 1 denotes case label, 0 is the control label). The number of samples in a dataset is defined as n i , with the SNP loci taking the value of the ith genotype combination (1 ≤ i ≤ I), and n ij representing the number of samples of the ith genotype combination being associated with the phenotype state (c j ).

Definitions
"Information entropy" has been defined as the average amount of information that is produced by a stochastic source of data that can be used to measure the data distribution diversity, and to measure the uncertainty of random variables [28]. To introduce our method well, the terms were defined: Entropy: Let p(x i ) be the probability of the ith genotype of a SNP variable (X), with the entropy of X being expressed as: Joint entropy: Let X 1 , X 2 , · · · , X k be k SNP variables, with (x i 1 , x i 2 , · · · , x i k ) being a particular genotype value for these SNP loci, and p(x i 1 , x i 2 , · · · , x i k ) being the probability of the genotype occurring together. The joint entropy (JE) of multiple variables is defined as: The JE is a measure of the uncertainty that is associated with a set of variables and can be used to measure the genotype distribution of a k-way SNP combination (X 1 , X 2 , · · · , X k ); however, it cannot be used in assessing genotype-phenotype correlations. Recently, mutual information has attracted extensive attention for identifying the association.
Mutual information: To measure the mutual dependence between a SNP locus (X) and phenotype (C), the mutual information was defined as follows: where p(x) denotes the marginal probability distribution function of X. Joint mutual information: The joint mutual information between k variables (X 1 , X 2 , X k ) and the phenotype (C) was defined as follows: The mutual information can also be regarded as a measure of the interaction strength between two SNPs (X and C), or between a SNP (X) and the disease status (C).
Interaction gain: Mutual information can also be called an interaction gain (IG) between X and Y. Based on a previous proposed interaction gain with three variables X, Y, and C [29], the following was derived: IG(X; Y; C) = I([X, Y]; C) − I(X; C) − I(Y; C) In GWASs, X and Y can be regarded as SNP loci, and C can be either a SNP variable or phenotype variable. In IG(X; Y; C), a 2-way interaction between SNP X and SNP Y for phenotype C can be depicted, or a 3-way interaction between X, Y, and C can be indicated. If X, Y, and C are a feature set, such as in can be regarded as an IG between X and Y.
In Equation (5), if IG(X; Y; C) > 0 (I(X, Y; C) > I(X; C) + I(Y; C)) then X and Y together yield a more synergistic effect on phenotype than would be expected from their sum individually. Thus, X and Y interacting with each other (interaction effect), and the presence of Y will increase the ability of predicting the phenotype (C). Conversely, IG(X; Y; C) < 0 (I(X, Y; C) < I(X; C) + I(Y; C)) indicates a redundancy between X and Y.
In GWAS, measuring the interaction between genotype variables (X, Y) and phenotype C is very important but difficult. In this study, a new interaction weight (IW) factor was employed.
Interaction weight factor: The interaction weight factor (IWF) between X and Y has been previously [30] defined as: IWF(X, Y) has the following properties: Evaluating the association of all SNP combinations with specific phenotypes may be timeconsuming, and the computational evaluation of high-order SNP combinations is generally difficult to perform. To speed up the process of detecting epistasis from high-dimensional data sets, the present study employed Symmetrical uncertainty in identifying search seeds.
Symmetrical uncertainty (SU): Mutual information has been widely adopted for data mining from high-dimensional data. However, it tends to favor features with more values while ignoring interactive features [31]. In this study, SU was utilized to compensate for the bias of mutual information toward features with more values, as previously described [27,30], and this was defined as Equation (7): In Equation (7), I(X; C) I(X;C) is used to measure the mutual dependence between the SNP locus (X) and the disease status (C). H(X) and H(C) measure the diversity of the SNP genotype distribution and the phenotype, respectively. Nevertheless, SU cannot effectively measure low associations. To enhance the ability to detect lower SNP and phenotype associations, the SU equation was modified as follows: In Equation (8), H(X) + H(C) is replaced with H(X, C) because we have found that the joint distribution of variables X* and C (let X* be the disease locus) usually have a smaller degree of dispersion than those of the other variables, namely, X and C. This improved Equation (8) thus enables the identification of some susceptibility loci with low marginal effects. This proposed algorithm is further described in Algorithm 1.
The FDHE-IW algorithm first calculates the SU value for each SNP ( Step (2)). Then the SNP s a with a maximum SU value is chosen (see step (3.1)) to find K k-way SNP combinations that are associated with the status of the diseases. In step (3.2) of the FDHE-IW algorithm, the IW and weight coefficient (W) for each SNP in F are updated iteratively, and a new SNP having a maximum relevance with the phenotype is combined with SNP s a .
Time Complexity: In the FDHE-IW algorithm, the time complexity is defined as O(N + N*(k!)*K). Generally, the values of k and K are very small, and the value of (k!)*K is much less than N. Therefore, the time complexity of FDHE-IW is less than O(N 2 ), which is a feasible computation amount for current computers to detect high-order SNP epistasis from a data set with thousands to millions of SNPs.
G-Test: A G-test is a maximum likelihood statistical significance test [32]. Compared to a Chi-square test, the G-test will lead to the same test results for samples of a rational size. However, for some cell cases, it is always better than the Chi-squared test [33].
In this study, an improved G-test method [19] was employed to verify the association between genotype and phenotype. For the k-way SNP combination model, the formula for calculating the G value is as follows: T-the candidate size; θ-the threshold of the G-test p-value; k-the number of SNPs in a k-way SNP combination; and K-the number to find the SNP combinations based on a seed SNP. Outputs: SNP combinations (SC)-the k-way SNP combinations that are associated with disease status.
(1) Initialize: (3) Search a k-way SNP combination based on the interaction weight.
(3.1) Select a SNP locus with a maximum SU × W value. s a ← argmax

EndIf (5) Statistical test
Perform G-test statistic for each SNP combination in SC.
Output the k-way SNP combinations with a p-value < θ The degree of freedom d (d = (I − 1)(J − 1)) is modified correspondingly, as follows: Q ij · P ij are the observed numbers and the expected number of genotypes, respectively, when the phenotype takes the state y j , and the genotype takes the ith k-combination.
ln denotes the natural logarithm function. The observed number G = 2 Q ij · P ij is obtained from the dataset by using a simple counting statistical method. The expected genotype frequency number (E ij ) is obtained according to Hardy-Weinberg principles [34]. ξ is a small integer that is less than or equal to 5.

Performance Evaluation
To evaluate the performance of the proposed algorithm, equations for the power Equation (10), the F-measure Equation (11), the recall Equation (12), and the precision Equation (13) were utilized.
Power is a measure of the capability to detect disease-causing models in all datasets, where #S is the number of disease-causing models from all #T datasets (there are 100 data matrices for each disease model).
True positives (TPs) are defined as the discovery of a k-way SNP combination that is associated with disease status, and FNs (false negatives) are defined as a non-discovery of a SNP combination that is associated with disease. TNs (true negatives) indicate no discovery, and FPs (false positives) are defined as a k-way SNP combination that is falsely associated with a disease status [31].
In this experiment, recall, precision and F-measure were used to evaluate the statistical precision of this hypothesis testing method (G-test in our method) for finding disease models in the screening stage. #TP is equal to the number of disease-causing SNP combinations that have passed the threshold p-value, while #FN is the number of the disease-causing SNP combinations that failed to pass the threshold. #FP is the number of non-disease-causing combinations that passed the threshold, while #TN equals the number of non-disease-causing combinations that failed to pass the threshold.

Simulation Data Sets and Case Study
Twelve disease loci with marginal effects (DME)-simulated disease models (multiplicative models: DME-1-DME-4, threshold models: DME-5-DME-8, and concrete models: DME-9-DME-12) (see Tables S1 and S2 in Supplementary File 2) that are well characterized were utilized [17], with 100 simulated data sets being generated for each DME disease model using GAMETES_2.0 [35]. One of the generated data sets contained 100 SNPs, with 800 controls and 800 cases; while another contained 1000 SNPs, with 2000 cases and 2000 controls. The experimental results of our algorithm were then compared with BEAM [14], BOOST [7], MACOED [17], and SNPHarvester [12] results. BEAM is a classical heuristic search algorithm for detecting epistasis interactions. It applies a Bayesian partitioning model to disease-associated markers and their interactions, and uses a Markov chain Monte Carlo (MCMC) to compute the posterior probability that each marker set is associated with a given disease [14]. Boost detects the SNP epistasis interaction very rapidly using a filter and an exhaustive method. SNPHarvester is a classical and effective filtering-based approach for detecting epistatic interactions in genome-wide association studies. MACOED is a new intelligent search algorithm for detecting high-order epistatic interactions that reduces the number of SNP combinations that are examined in association with a phenotype.
To further validate the algorithm, real AMD data containing 103,611 genotypes SNPs for 50 controls and 96 cases [14] were utilized. To balance the case samples and controls, we enlarged the control sample size to 96 using a bootstrap method [36], which can significantly increase the statistical power, and imputed missing data using the k-nearest-neighbor method [37].

Paremeters Setting
In the DME simulation experiments, the threshold θ of the G-test p-value was set equal to 0.01×MAF C k N , with minor allele frequency (MAF). The candidate size set to T = 2*k for simulation data sets and T = 200 for AMD. K = k for simulation data sets and K = 5 for AMD data. All experiments were performed using a Windows 10 operation system with Intel(R) Core(TM) i7-4790 CPU@3.6GHz and 8 GB memory, and all program codes were written in MATLAB R2015b (MathWorks, Natick, MA, USA). The source code is in Supplementary File FDHE-IW.rar.

Simulated Models
The detection power of FDHE-IW was first investigated by comparing it with four state-of-the-art algorithms (BEAM, SNPHarvester, MACOED, and BOOST) using a DME dataset with 100 SNPs ( Figure 2) and a data set with 1000 SNPs (Figure 3). The results in Figures 2 and 3 show that FDHE-IW is superior to the other four algorithms when analyzing DME-1 and DME-3-DME-10, and is comparable with BOOST for DME-2 and DME-12.  The recall, precision and F-measure were also determined when using FDHE-IW for all 12 DME models (Table 1), with the five comparative algorithms also evaluated using a subset of DME models (Tables 2 and 3). These results indicated that the performance of FDHE-IW when using the dataset with 1600 samples was inferior when compared to the dataset with 4000 samples in recall, precision, and F-measure.   The recall, precision and F-measure were also determined when using FDHE-IW for all 12 DME models (Table 1), with the five comparative algorithms also evaluated using a subset of DME models (Tables 2 and 3). These results indicated that the performance of FDHE-IW when using the dataset with 1600 samples was inferior when compared to the dataset with 4000 samples in recall, precision, and F-measure. The recall, precision and F-measure were also determined when using FDHE-IW for all 12 DME models (Table 1), with the five comparative algorithms also evaluated using a subset of DME models (Tables 2 and 3). These results indicated that the performance of FDHE-IW when using the dataset with 1600 samples was inferior when compared to the dataset with 4000 samples in recall, precision, and F-measure. Runtimes are displayed as a mean value of models (multiplicative models: DME-1-DME-4, threshold models: DME-5-DME-8, and concrete models: DME-9-DME-12). The results in Tables 2 and 3 showed that the performance of FDHE-IW was superior to BEAM, SNPHarvester, and BOOST in recall, precision, and F-measure. However, it was found to be inferior to MACOED for most of the DME models. This is because some disease-causing SNP combinations identified by FDHE-IW in the searching stage were then rejected in the testing stage, but in MACOED, only SNP combinations that were significantly associated with disease status were identified. For many disease-causing SNP combinations with a low MAF and low hereditary, MACOED fails to identify them in the screening stage. Therefore, the MACOED has higher precision than FDHE-IW in the testing stage, but the detection power of MACOED is much lower than that of FDHE-IW ( Figure 2). As can be seen in Tables 2 and 3, FDHE-IW has a lower runtime than MACOED, BEAM, and SNPHarvester, but it has a longer runtime than Boost. Table 3. Performance comparisons on recall, precision and F-measure (1000 SNPs, 4000 sample size). Runtimes are displayed as the mean value of models (multiplicative models: DME-1-DME-4, threshold models: DME-5-DME-8, and concrete models: DME-9-DME-12).

Experimental Results Using an AMD Dataset
To further validate the FDHE-IW algorithm, a real AMD data set was evaluated. It took 5 h to find 200 2-way candidate solutions, 15 h to find 200 3-way candidate solutions, and 25 h to find 8 4-way candidate solutions. The algorithm identified 45 2-way SNP combinations (p-value < 10 −11 ), 18 3-way SNP-combinations (p-value < 1 × 10 −15 ; Table 4) and 2 4-way SNP-combinations (p-value = 0; Table 5) that were associated with AMD out of the 103,611 examined SNPs (Supplementary File 1, sheets 1-way-4-way). The obtained findings were further examined using Cytoscape (http://www. cytoscape.org/) [38], and a 2-way SNP interaction network and a 2-way gene interaction network were constructed (Figure 4), where the genes were mapped based on the SNP interaction networks. Table 4. Three-way SNP combinations, SNP1, SNP2, and SNP3 were mapped to gene1, gene2, and gene3, respectively. (NA denotes that the corresponding SNP is situated within a non-coding region).

SNP1
Gene1 Within the SNP network (Figure 4), two widely reported SNPs (rs380390, rs1329428) that are located in an intron within the CFH gene were also identified. The CFH gene has been commonly association with AMD [14,18,19]. Furthermore, the constructed gene network showed that many of the SNPs are mapped to non-gene coding regions, thus denoted NA. Within the interaction network, NA and CFH had six connections due to six SNP pairs within the SNP network being mapped to a CFH-NA gene-pair. CFH was also found to have a novel connection with the JMJD2C gene, a histone lysine demethylase. JMJD2C has been reported to play a crucial role in the progression of breast cancer, prostate carcinomas, osteosarcoma, and blood neoplasms, thus indicating that JMJD2C represents a promising anti-cancer target [39][40][41]. These findings also suggest that JMJD2C may have an important role in AMD.   Figure 4a, where each node denotes a SNP locus. An edge represents a 2-way SNP combination that has a strong association with the phenotype. The yellow SNPs (nodes) have been reported to be associated with age-related macular degeneration (AMD). (b) In Figure 4b, the nodes and edges are mapped from nodes and edges from Figure 4a, in which a node denotes a gene, and each edge represents a 2-way SNP combination that is mapped to two genes. NA denotes non-gene coding regions; there are multiple NA-NA edges because multiple SNP-SNP pairs were mapped to non-gene-coding regions. The greater the number of edges between two gene nodes, the more the SNP combination maps into the two genes. The yellow genes (nodes) are believed to be associated with AMD. In node NA, there are many edges, which means there are multiple SNP combinations in the non-coding region.   Figure 4a, where each node denotes a SNP locus. An edge represents a 2-way SNP combination that has a strong association with the phenotype. The yellow SNPs (nodes) have been reported to be associated with age-related macular degeneration (AMD). (b) In Figure 4b, the nodes and edges are mapped from nodes and edges from Figure 4a, in which a node denotes a gene, and each edge represents a 2-way SNP combination that is mapped to two genes. NA denotes non-gene coding regions; there are multiple NA-NA edges because multiple SNP-SNP pairs were mapped to non-gene-coding regions. The greater the number of edges between two gene nodes, the more the SNP combination maps into the two genes. The yellow genes (nodes) are believed to be associated with AMD. In node NA, there are many edges, which means there are multiple SNP combinations in the non-coding region.
Eight 3-way gene combinations containing the CFH gene, and four 3-way gene combinations containing the JMJD2C gene were identified. Only one 3-way combination did not involve the CFH and JMJD2C genes, suggesting that these are important to AMD.
The two 4-way SNP combinations in Table 5 shows little uncertainty on whether every SNP contributes to the phenotype. Therefore, the AUCs (areas under the curve) of each SNP involved in a potential SNP combination (1-way, 2-way, and 3-way) were computed and then compared to the AUC relative to the 4-way SNP combinations (see Figures S1 and S2 in Supplementary File 3). The AUC of the 4-way SNP combination was larger than those of other sub SNP combinations. Hence, each SNP in the SNP combinations contribute to the development of AMD.
Recent investigations have utilized AMD data sets, which include algorithm IOBLPSO [42], epiACO [43], BEAM [14], epi forest [44], DCHE [45], FHSA-SED [18] and NHSA-DHSC [19]. Table 6 summarizes the results of these seven studies. Two SNPs (rs380390, rs1329428) and the CFH gene have been reported by all the seven algorithms. However, other SNPs and SNP combinations that are associated with AMD were detected by different algorithms. The proposed FDHE-IW also detected novel SNPs and genes, where rs10511467 (in NA) and rs3776652 (in the JMJD2C gene) were also reported by FHSA-SED and NHSA-DHSC. SNPs rs6598991 and rs10507949, both in NA, have not been reported to date. Table 6. Comparison of the results of seven algorithms using the age-related macular degeneration (AMD) data set.

Advantage
The FDHE-IW algorithm does not need to evaluate all k-way SNP combinations to detect k-way disease-causing SNP combinations. It is capable of detecting some higher-order epistasis on a whole genome scale with a time complexity that is much less than O(N 2 ).

Limitations
The proposed algorithm can effectively detection nested epistasis [35], suggesting that one or more of the interacting loci are the major contributors to disease, and that at least one proper subset of the loci also interacts epistatically. However, for some disease-causing SNP interaction combinations with pure, strict epistasis, the detection power of FDHE-IW is still unsatisfactory.

Future Work
At present there is still no fast or effective approach for detecting various disease-causing models with multi-loci in GWAS, due to the enormous computational burden. Therefore, detecting high-order disease models has room to be explored using high-performance methods and cloud computing. In future research, we will also focus on non-coding genomic regions, we will continue to focus on rare variants in GWASs, and we will develop new methods to aid in the identification of the causes of complex diseases. With the rapid development of high-performance cloud computing techniques, these abilities should continue to improve.