HSIC CR : A Lightweight Scoring Criterion Based on Measuring the Degree of Causality for the Detection of SNP Interactions

: Recently, research on detecting SNP interactions has attracted considerable attention, which is of great signiﬁcance for exploring complex diseases. The formulation of effective swarm intelligence optimization algorithms is a primary resolution to this issue. To achieve this goal, an important problem needs to be solved in advance; that is, designing and selecting lightweight scoring criteria that can be calculated in O ( m ) time and can accurately estimate the degree of association between SNP combinations and disease status. In this study, we propose a high-accuracy scoring criterion (HSIC CR ) by measuring the degree of causality dedicated to assessing the degree. First, we approximate two kinds of dependencies according to the structural equation of the causal relationship between epistasis SNP combination and disease status. Then, inspired by these dependencies, we put forward this scoring criterion that integrates a widely used method of measuring statistical dependencies based on kernel functions (HSIC). However, the computing time complexity of HSIC is O ( m 2 ) , which is too costly to be an integral part of the scoring criterion. Since the sizes of the sample space of the disease status, SNP loci and SNP combination are small enough, we propose an efﬁcient method of computing HSIC for variables with a small sample in O ( m ) time. Eventually, HSIC CR can be computed in O ( m ) time in practice. Finally, we compared HSIC CR with ﬁve representative high-accuracy scoring criteria that detect SNP interactions for 49 simulation disease models. The experimental results show that the accuracy of our proposed scoring criterion is, overall, state-of-the-art.


Introduction
Since many complex diseases are usually caused by multiple genes and multiple factors, in recent years, with the emergence of high-throughput genotypic technology, genome-wide association analysis (GWAS) has been one of the main methods used to study complex diseases.Furthermore, the identification of single-nucleotide polymorphism (SNP) interactions from GWAS data is of great importance for exploring the explanation, prevention and treatment of complex diseases [1,2].Therefore, over the past decade, this research topic has attracted considerable attention [3][4][5][6][7][8][9].
It is well known that SNP interactions represent combinations of multiple SNPs that affect complex diseases in a linear or non-linear manner, also known as k-order epistasis SNPs.The research topic of detecting k-order epistasis SNPs is a typical case of combinatorial optimization problems in k-dimensional discrete space (k ∈ {2, 3, 4, 5} in practice), and swarm intelligence optimization (SIO) algorithms are one of the main methods used to solve the problems [9][10][11].For this study to be successful, an important problem needs to be solved in advance; that is, designing and selecting lightweight scoring criteria that can be calculated in O(m) time and can accurately estimate the degree of association between SNP combinations and disease status.
To date, few lightweight scoring criteria can accurately estimate the degree of association of SNP combinations with disease status in most disease models due to the widely varying characteristics of different disease models.As one of the primary methods used to work on this combinatorial optimization problem, SIO algorithms mostly tackle this issue by combining multiple criteria [9][10][11][12][13]; however, using too many objective functions will often make the proposed algorithm difficult to converge effectively.Therefore, picking a few high-accuracy objective functions instead of using too many objective functions can dramatically improve the performance of the used algorithms [14,15].This paper's goal is not to contribute toward fixing the issue entirely but to use a different methodology to propose a scoring criterion that can accurately estimate the associations in most disease models.The contributions of this paper are: 1.
We propose a high-accuracy scoring criterion based on measuring the degree of causality that integrates a widely used method of measuring statistical dependencies (HSIC); 2.
We put forward an efficient algorithm of computing HSIC on two variables with a small sample in O(m) time, thus enabling us to compute HSIC CR in O(m) time in practice.

Related Works
So far, the proposed lightweight scoring criteria can be roughly divided into two categories.
The first category covers various approaches, which are so-called Bayesian scoring criteria.The Bayesian scoring criteria calculate the posterior probability distribution, proceeding from a prior belief on the possible DAG models, conditional on the data [16].The K2-Score is an efficient Bayesian scoring criterion that obtains priors under the assumption that all DAG models are equally likely.Other such scoring criteria representatives include the Bayesian Dirichlet equivalent (BDe) scoring criterion and the Bayesian Dirichlet equivalent uniform (BDeu) scoring criterion [4,[17][18][19].
The second category is usually known as the information-theoretic scoring criteria.Mutual information (Mi) is a lightweight method but has preferences for certain disease models [6,20].The JS divergence is a symmetrized divergence measure, derived from the Kullback-Leibler (KL) divergence, which is an asymmetric divergence measure of two probability distributions [21].This approach can be utilized to evaluate the SNP genotype deviation between control samples and case samples.Lately, joint entropy (JE) and normalized distance with joint entropy (ND-JE) have been proposed as criteria for guiding harmony search algorithms to discover clues for exploring the epistasis of SNP combinations [9].
There are also a few approaches that do not fall into any of the above two main categories.For example, the LR is a composite indicator that reflects both sensitivity and specificity and can be used for a related measure to find the likelihood difference between a disease-causing SNP combination and an SNP combination that is not involved in the disease process [22,23].
In statistics, G-test is a significant test method of natural ratio or maximum likelihood.In recent years, scholars have tended to use the G-test independence test instead of the chi-square independence test recommended in the past.In genome association analysis, G-test has been extensively used.Different from other scoring criteria, G-test will provide its p-value when measuring the relationship between an SNP combination and sample state, which can indicate whether the SNP combination has a significant relationship with the sample state [24].
Published research has found that the results were different when employing different scoring criteria, The K2-Score has been widely used to evaluate the association.This measure has a high capacity for detecting SNP interactions and is superior in discriminating certain disease models with low marginal effects.However, for the interaction model with low minor allele frequencies (MAFs) and low genetic heritability (h), the K2-Score has a low performance in detecting high-order SNP interactions.The ND-JE is proposed based on the properties of the disease-causing SNP combination models without marginal effects, so this metric is more suitable for evaluating diseases with this type of mode.The LR-score aims to discover the relationship between likelihood differences in functional SNP combinations and non-functional SNP combinations.The method is well adapted to unbalanced datasets of cases and controls.In practice, the use of the G-test as a single evaluation criterion for detection is found to be inadequate, as there are often many SNP combinations with G-test values close to 0 [6,9,11].
The scoring criterion proposed in this work is based on the theory of causality.It is distinct from the theoretical approach taken by the current existing criteria.From the perspective of a comparison with correlation, causality strictly distinguishes "cause" variables and "result" variables, and plays an irreplaceable role in revealing the mechanism of the occurrence of things and guiding intervention behaviors [25].Thus, the proposed criterion is a useful for and complementary to the current existing criteria.

Concepts and Terms
In this work, x = {x 1 , x 2 , . . ., x n } represents a set of n SNP loci, and is a set of m samples of x; Y = {y 1 , y 2 , . . ., y m } t denotes a set of m samples of disease status y.For ∀1 i n, 1 j m , i, j ∈ Z, x ij ∈ {0, 1, 2}, x ij is equal to 0, 1 and 2, which implies that it is the homozygous major allele (AA), heterozygous allele (AT), and homozygous minor allele (TT), respectively; y j = 0 for control and y j = 1 for case; D = (X,Y) is a dataset with m samples.Definition 1 (k-order epistasis SNP combination).Let S k = {{x i 1 , x i 2 , . . ., x i k }} be a collection of a set with k SNP loci (1 < k < n).f (D): S k → R + is a score function used for measuring the association between any k SNP loci and disease status y based on a dataset D. If x has the k-order epistasis SNP combination on y (denoted as s * k , s * k ∈ S k ), and f (D) is a correct score function (or scoring criterion), then, for

Causal Relationship
According to how the data are generated, the structural equation of the causal relationship between epistasis SNP combination and disease status can be modeled as [26]: where e y is the noise variable.
From the above equation, we can find that s * k and e y are independent (denoted as s * k ⊥ e y ).In other words, among all s k ∈ S k , s * k and e y have the lowest degree of dependency.However, it is unrealistic to measure the dependence degree of any s k and e y as the evaluation criterion for epistasis detection, because it requires too high a computational cost to obtain data generated by e y based on the regression method.
Thus, this paper herein let e y be the constant 0, i.e., y ≈ f (s * k ) , which approximately introduces two kinds of dependencies as described in Figures 1 and 2, respectively [25].Obviously, the dependence between s * and y is direct (denoted as s * k ⊥ y); and the other is derived from the v-structure, i.e., x i 1 and x i 2 are dependent given a value of y and s * V-structure-related dependence.

Scoring Criterion
These two kinds of dependencies described above inspire us to raise this scoring criterion, which integrates a widely used method of measuring statistical dependencies based on kernel functions (HSIC). For is the number of rows of the data slice (m ; HSIC(X, Y) is used to measure the degree of statistical dependence of two random variables (x and y) based on dataset (X, Y); the scoring criterion can be computed by the following Equations ( 2)-( 6).
To facilitate the reader's understanding, we now define following notations: For all The value of HSIC q CR is a linear weighted sum of HSIC(X i 1 i 2 ...,i k , Y i 1 i 2 ...,i k ) and b q based on sample size of D i 1 i 2 ...,i k and average sample size of all D q i a i b , which is a component and basis of HSIC CR ; 4.
In particular, b q i 1 i 2 = b q as is HSIC q CR = HSIC CR when k = 2; 5.
For robustness purposes, let b q i 1 i 2 = 0 if and only if the denominator of the weighted factor term is 0, like b q ; 6.
The effort to calculate scoring criterion is reduced to calculate HSIC( Thus, our estimate of s * k can eventually be obtained by solving the problem where k(, ) and l(, ) are two kernel functions.
Let X and Y be the separable sample spaces of random variables x and y, respectively, assuming that (X, Γ) and (Y, Λ) are furnished with probability measures p x , p y , respectively (Γ being the Borel sets on X, and Λ the Borel sets on Y); p xy is a joint measure over (X × Y, Λ × Γ); HSIC P xy ≥ 0 (the higher the degree of dependence of x and y, the greater the value) and HSIC P xy is zero if and only if x and y are independent.
In order to show that HSIC is a practical criterion for measuring independence or the degree of dependence given a finite number of observations, it consists of an empirical estimator with O(m −1 ) expectation bias, denoted by HSIC(D), formulated as follows: where An advantage of HSIC compared with other kernel-based independence criteria is that it can be computed in O(m 2 ) time.However, such computational costs are too high as an integral part of the scoring criterion.Fortunately, the sample space of the disease status, SNP loci and k-order SNP combination is finite discrete.Thus, immediately below, we put forward an efficient HSIC calculation method for variables with a small sample, which can be approximately calculated in O(m) time.Thus, as k ∈ {2, 3, 4, 5}, we can compute

Efficient Computation
Proposition 1 (Efficient computation).Let x and y be two random discrete variables with p and q states, respectively, where p 2 × q 2 < m, or p 2 × q 2 ≈ m.Then, we can compute trace(KHLH) in O(m) time.
Proof.Let e be a column vector with a length of m, e =  , and I be an identity matrix with a size of m * m; we have where each l i is the sum of the corresponding row elements of L.
Thus, without a loss of generality, we can assume that: D={(c 1 , y i 1 ), . . .(c 1 , y i 2 −1 ), (c 2 , y i 2 ), . . .(c 2 , y i 3 −1 ), . . .(c p , y i p ), . . .(c p , y i m )}, i.e., the number of observed instances of x with the value of c j (denoted as xt(c j )) is i j+1 − i j (i p+1 = i m + 1); K can be viewed as a p × p partitioned matrix.Let K jl be the j × lth block having xt(c j ) × xt(c l ) elements, all having the same value ( k(c j , c l )).Let K = K − K, where all elements in Kjl have the same (denoted as K(j, l)).Let yx(j, i) be the number of observed instances with the value (c i , d j ) (1 ≤ j ≤ q, d j is the j-th state of y); the definition of L and L is similar to that of K and K.We also view L as a p × p partitioned matrix, where each L jl has the same number of rows and columns as , we can obtain that the computational complexity of trace( K L) is O(p 2 q 2 ).
As described above, we can know that the total computational complexity of xt, yt and yx is O(m), and that those of K and L are O(p 2 ) and O(q 2 ), respectively.
Hence, we have that the total computational complexity of HSIC is O(m), where The proof is complete.
In fact, the proof above gives the simplified process of efficient computation to HSIC.The detailed processes are shown in Algorithms 1-5.Algorithm 1 is the main process of the method, consisting of three functions: 1.
(X, Y) is m observations of a tuple of x and y with p and q states, respectively; 2.
kernels includes two kernel functions used to calculate K i,j and L i,j , the parameters of which are delta(1) and delta(2), respectively; 3.
GetIn f o (see Algorithm 2) is used to calculate xt, yt and yx ; 4.
Trace (see Algorithm 4) is used to calculate trace( K L); 6.
RowAverage (see Algorithm 5) is used to calculate K and L.

Evaluation Criterion
The evaluation criterion that we adopted in the experiments is by [9]: where S is the number of found disease-causing SNP combinations (the epistasis SNPs score the highest) and T is the number of datasets.Each dataset includes one disease-causing SNP combination.Power is a measure of the accuracies of scoring criteria from genome data.

Simulated Datasets
For any data set, the worst-case scenario for checking the correctness of the scoring criteria is extensive testing of all SNP combinations.It is too computationally expensive for k = 4 and k = 5 cases.Therefore, tests were only conducted for k = 2 and k = 3.

Disease Models with k = 2
For k = 2, we used thirty-five disease models without marginal effects (DNME1-35) and six disease models with marginal effects (DME1-6).The models were designed based on interaction structures with different diseases, MAFs, prevalence (p) and h (the parameter settings are described in the supplementary files).Each data set contains 1000 SNPs and includes pairs of interacting SNPs (M0P0 and M1P1) generated according to the disease model setting, while other SNPs are generated using MAFs uniformly selected in [0.05, 0.5).For each model, we generated two simulated 100 data sets using the software GAMETES2.1 [30] with sample sizes of 400 (200 controls and 200 cases) and with sample sizes of 800 (400 control and 400 cases) [31].
The analysis results of subgroups of DNME1-35, each of which has 400 samples, are shown in Figure 3: 1.
Except for Mi, using tests on DNME1-10, the accuracy of all scoring criteria is close to 100%; 2.
All criteria are not very accurate using tests on DNME21-25 and DNME31-35; 3.
Mi has an extremely poor accuracy on all subgroup tests; 4.
LR has the highest accuracy using tests on DNME11-15 and DNME16-20, close to 100%, but is only a little more accurate than Mi on DNME21- When the size of samples increased from 400 to 800, the accuracy of all criteria was greatly improved.The analysis results of subgroups of DNME1-35, each of which has 800 samples, are shown in Figure 4: 1.
Except for Mi, the accuracy of all criteria is close to 100% excluding tests on the two most difficult model subgroups (DNME21-25 and DNME31-35); 2.
Although the accuracy rate of Mi can be significantly improved with the increase in the size of samples, it is still relatively poor overall; 3.
With the number of samples increasing, there is still no change in the overall ranking, but the accuracy of the K2-Score on the DNME31-35 test rises to second; 4.
HSIC CR has the highest accuracy on the two most difficult model subgroup tests.Table 1 reveals the total average accuracy.From Table 1, we can find that Mi has a poor average accuracy; HSIC CR has the best average accuracy regardless of the model's sample scale of 400 or 800; although HSIC CR is only slightly higher than the other four criteria on the total average accuracy, and the average accuracy on the most difficult model subgroup test is much better than other criteria.where k = 2, that each scoring criterion identified epistasis SNPs of snp1000 for sample sizes of 400 and 800.The fourth column gives the total accuracy over all sample sizes.The last column gives the accuracy over all sample sizes in the most difficult subgroup models.The scoring criteria are listed in descending order of total accuracy.
The analysis results of DME1-6, each of which has 400 samples, are shown in Figure 5: 1.
The accuracy of all scoring criteria is close to 100% tested on DME1, except for Mi; 2.
Mi has extremely poor accuracy on all six models tests; 3.
Except for Mi, the accuracy rate of LR is worse than the other four criteria on DME2-6 tests, except that the accuracy on the DME3 test is nearly the same as that of HSIC CR ; 4.
The accuracy rate of ND-JE ranks third on DME1 and DME3 tests, and fourth on the other four models tests; 5.
The accuracy rate of G-test ranks first on the DME1 test, second on DME2 and DME3 tests and third on the other four models tests; 6.
HSIC CR has the highest accuracy rate on DME4 and DME6 tests, its accuracy rate on the DME5 test is slightly worse than LR and the accuracy rate on DME1-2 tests ranks third, whereas the accuracy rate on the DME3 test is a little better than Mi; 7.
K2-Score has the highest accuracy rate on DME1-3 and DME5 tests, its accuracy rate on DME4 and DME6 tests ranks second and it significantly outperforms the others on the most difficult model (DME3) test (although its accuracy rate in DME3 is below 50%).When the size of samples increased from 400 to 800, the accuracy of all criteria was greatly improved.The analysis results of DME1-6, each of which has 800 samples, are shown in Figure 6: 1.
Although the accuracy of Mi can be significantly improved with the increase in the size of samples, it is still relatively poor overall; 2.
The accuracy rates of the other five scoring criteria all exceed 95% tested by the models, except on DME3; 3.
K2-Score has the highest accuracy rate on the most difficult model test (over 90%), the accuracy rate of G-test ranks second (over 80 %) and the accuracy rates of ND-JE, HSIC CR and LR are not good enough, at just over 70%.Table 2 reveals the total average accuracy.From Table 2, we can find that Mi has a poor average accuracy; the K2-Score has the best average accuracy rate regardless of the model's sample scale of 400 or 800, where the main reason is that its accuracy rate on the DME3 test is much better than the other five scoring criteria; although the average accuracy of HSIC CR tested on DME3 is not good enough, it ranks second in the overall average accuracy rate.For k = 3, the data sets are generated by eight third-order epistasis pathogenic models (DM1-8), which are modeled by GAMETES2.1 according to the combinations of different MAFs ([0.2, 0.4]) and different heritability ([0.025, 0.05, 0.1, 0.2]) (DM1 MAF = 0.2, h = 0.025; DM2 MAF = 0.2, h = 0.05; DM3 MAF = 0.2, h = 0.1; DM4 MAF = 0.2, h = 0.2; DM5 MAF = 0.4, h = 0.025; DM6 MAF = 0.4, h = 0.05; DM7 MAF = 0.4, h = 0.1; DM8 MAF = 0.4 h = 0.2).The quantiles of each combination is five.Every quantile of each pathogenic model corresponds to 100 simulated data files.Each file contains 100 SNPs and 1600 samples (800 normal, 800 diseased), and includes three interacting SNPs (M0P0, M1P1 and M2P2) generated according to the disease model settings, while other SNPs were generated using MAFs uniformly selected in [0.05, 0.5].Therefore, the total number of the data sets is 4000 [6].Detailed parameter settings are described in the supplementary file.
The analysis results of DM1-8, each of which has 1600 samples, are shown in Figure 7: 1.
The accuracy of all scoring criteria is close to 80% on the DM1 test; 3.
The K2-Score has a poor accuracy rate on the most difficult model (DM5) test, whose accuracy is just close to 10%; 4.
The accuracy rates of the scoring criteria on the DM6 test are good enough, and the accuracy rates of the scoring criteria are close to 100%, except the K2-Score; 5.
HSIC CR has the highest accuracy rate on the DM5 test, and its accuracy rate is the only one that exceeds 60% among all scoring criteria.Table 3 reveals the total average accuracy.From Table 3, we can find that the K2-Score has a poor average accuracy on the most difficult model test; the total average accuracy rates for all scoring criteria are good enough, with all criteria except the K2-Score achieving over 90% accuracy.Although HSIC has the best total average accuracy, its accuracy is not significantly better than other four criteria; however, tested on the most difficult model, the accuracy rate of HSIC CR outperforms LR by 3.2%, and significantly outperforms the other four criteria, especially the K2-Score.

The Running Time Analysis
To demonstrate that our proposed method can be used as a lightweight scoring criterion, we proved in the previous section that its time complexity is O(m).Furthermore, we calculated the average running time per dataset (unit in seconds) for the two-order with a sample size of 800 and the three-order in the simulation experiments, and we found that the average running time per dataset of our proposed method is between the other five lightweight methods (see Table 4).This further demonstrates the applicability of our proposed method as a lightweight scoring criterion.
In the experiment, all scoring criteria were implemented based on Matlab, and all tests were run on the environment of Windows 10 64 desktop computer with 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80 GHz, and 16.0 GB memory.

Case Study: A Real Chronic Dialysis Data
A real data set of 193 cases and 704 controls was selected from the mitochondrial Dloop region of chronic dialysis patients who were observed in a study by other authors [32].The genotypes and locations of 77 SNPs are presented in Table 5 [33].
The 77 SNPs contained in the subset of the chronic dialysis data set were used in the case study, which aims to give our readers more specificity regarding our proposed scoring criterion.First, for this dataset, we performed a full-space two-order SNP combination detection, meaning that the HSIC CR values were evaluated for 2926 (C 2 77 ) possible combinations.Then, we selected the top ten HISC CR -valued combinations as candidate two-order epistasis SNP combinations to be raised for medical researchers, and the 10 candidate combinations are presented in Table 6.

Conclusions
In this paper, we verified with rigorous mathematical proof that HSIC CR can be computed in O(m) time.Moreover, we compared HSIC CR with five representative scoring criteria for 49 simulation disease models.The experimental results show that: Mi has a poor accuracy on two-order disease models; the K2-Score has a poor accuracy on three-order difficult disease models; the accuracy rates of LR are not good enough on two-order disease models tests; HSIC CR , G-test and ND-JE have a high accuracy on all three classes of disease models tests; the accuracy rates of HSIC CR rank first on two-order disease models without marginal effects tests and three-order disease model tests, and rank second on two-order disease models with marginal effects tests, although its advantage is not significant.
The advantages of HSIC CR are: the methodology used is different from other scoring criteria, which makes it more complementary to other scoring criteria; it has a high accuracy on most disease models.
In the future, we will further investigate proposing efficient SIO algorithms to solve this problem by combining HSIC CR and other effective lightweight criteria that already exist as weighted single or multi-objective functions.In addition, we will work with several local medical research institutions to use their real disease case-control study data to mine for disease-related SNP combinations by using our proposed approach.This will ultimately provide new guidance for drug development in complex diseases.

Figure 5 .
Figure 5. Disease model with marginal effects, with 400 samples and k = 2.

Figure 6 .
Figure 6.Disease model with marginal effects, with 800 samples and k = 2.

Table 1 .
The number of times, out of 3500 data sets generated by 35 models without marginal effects,

Table 2 .
The number of times, out of 600 data sets generated by six models with marginal effects, where k = 2, that each scoring criterion identified epistasis SNPs of snp1000 for sample sizes of 400 and 800.The fourth column gives the total accuracy over all sample sizes.The last column gives the accuracy over all sample sizes in the most difficult model.The scoring criteria are listed in descending order of total accuracy.

Table 3 .
The number of times, out of 4000 data sets generated by eight models, where k = 3, that each scoring criterion identified epistasis SNPs of snp100 for 1600 samples.The second column gives the total accuracy over a sample size of 1600.The last column gives the accuracy over a sample size of 1600 in the most difficult model.The scoring criteria are listed in descending order of total accuracy.

Table 4 .
Average running time (s) for the six scoring criteria per dataset for both the two-order tests and the three-order tests in the simulation experiments.

Table 5 .
Positions of chronic dialysis-associated 77 SNPS in mitochondrial d-loop region.a Left and right letters are major and minor genotypes, respectively.The number is the SNP position in the mitochondrial D-loop region.

Table 6 .
The top ten highest HISC CR -valued two-order SNPs were used as ten candidate combinations.