Use of Information Measures and Their Approximations to Detect Predictive Gene-Gene Interaction

Abstract: We reconsider the properties and relationships of the interaction information and its modified versions in the context of detecting the interaction of two SNPs for the prediction of a binary outcome when interaction information is positive. This property is called predictive interaction, and we state some new sufficient conditions for it to hold true. We also study chi square approximations to these measures. It is argued that interaction information is a different and sometimes more natural measure of interaction than the logistic interaction parameter especially when SNPs are dependent. We introduce a novel measure of predictive interaction based on interaction information and its modified version. In numerical experiments, which use copulas to model dependence, we study examples when the logistic interaction parameter is zero or close to zero for which predictive interaction is detected by the new measure, while it remains undetected by the likelihood ratio test.


Introduction
The aim of the paper is to review existing measures and to introduce a new measure of interaction strength of two nominal factors in predicting a binary outcome and to investigate how they perform for this task.We show that there exist interactive effects that are not detectable by parametric tests, such as the likelihood ratio test, and we propose a test statistic, which performs much better in such situations.A specific situation that we have in mind is the case of two Single Nucleotide Polymorphisms (SNPs) and their joint strength to predict the occurrence of a certain disease as compared to their individual strengths.We will speak of gene-gene interaction when the prediction effects of genotypes at two corresponding loci combine non-additively.Thus, rather than aiming at a simpler task of testing associations that allow for interactions, we focus here on testing the predictive interaction of two SNPs.We will refer to the phenomenon as the interaction effect rather than (statistical) epistasis in order to avoid confusion, as the term "epistasis" is frequently used to mean blocking of one allelic effect by another allele at a different locus (cf.[1]).
There are problems encountered when applying some of the aforementioned methods.The first one, which is not generally recognized, is that some methods are designed to detect the dependence between pairs of genes and a disease and do not distinguish whether the dependence is due to main effects (genes influencing disease individually) or their interaction (the combined effect of two genes).Thus, the case with strong main effects and with no or a weak interaction effect is still frequently referred to as gene-gene interaction.In the paper, we carefully distinguish between main and interaction effects and discuss situations when interaction can be detected based on the strength of overall dependence.The second problem is that some of the methods used, such as logistic regression, are model dependent and define interaction in a model-specific way.Thus, it may happen, as we show, that genes interacting predictively, as it is understood in this paper, do not interact when modeled, e.g., by logistic regression.Logistic regression is thus blind to predictive interaction in such cases.The third one is that the fact of whether loci are interlinked or not influences the analysis.We show that the dependence between genes may have profound effects on the strength of interaction and its detection.A focus is here on information measures and their approximations, as their decompositions allow one to neatly attribute parts of dependence either to the main effects or to the interaction effect.Let us note that the problem of the dependency of SNPs to be taken into account is recognized; see, e.g., [11].We also refer to [12], where the maximal strength of interactions for two-locus models without main effects is studied.Sensitivity analysis of a biological system using interaction information is discussed in [13].The analogous problem of measuring interaction in the Quantitative Trait Loci (QTL) setting with quantitative response is analyzed in, e.g., in [14].
We study the properties of the modified interaction information introduced by [8].In particular, we state new sufficient conditions for the predictive interaction of the pair of genes.We also study its chi square approximations in the neighborhood of total independence (loci and the outcome are mutually independent) and in the case when the loci may be interdependent.Plug-in estimators of the interaction measures are introduced, and we investigate their behavior in parametric models.This analysis leads us to the introduction of a new test statistic for establishing predictive interaction, which is defined as a maximum of the information interaction estimator and its modified version.It is shown that it leads to a test that is superior or on par with the Likelihood Ratio Test (LRT) in all parametric models considered, and the superiority is most pronounced when logistic interaction is zero or close to zero.Thus, there are cases when genes interact predictively that are not detected by LRT.As the detection of weak interactions becomes increasingly important, the proposed method seems worth studying.The proposal is also interesting from the computational point of view, as LRT, which is a standard tool to detect interactions in the logistic regression model, does not have a closed form expression; its calculation is computationally intensive and requires iterative methods.Our experience shows that the execution of LRT described below is at least fifteen-times longer than for any test considered based on interaction information.This is particularly important when the interaction effect has to be tested for a huge number of pairs of SNPs.We do not treat this case, here leaving it for a separate paper.
The paper is structured as follows.In Section 2, we discuss interaction information measures, their variants, as well as approximations and the corresponding properties.Parametric approaches to interaction are examined in Section 3. We also show some links between the lack of predictive interaction and additive logistic regression (cf.Proposition 7).Moreover, the behavior of the introduced measures in logistic models for independent and dependent SNPs is studied there.In Section 4, we investigate the performance of tests based on empirical counterparts of the discussed indices by means of numerical experiments.An illustrative example of the analysis for a real dataset is also included.Section 5 concludes the paper.

Interaction Information Measure
We adopt the following qualitative definition of the predictive interaction of SNPs X 1 and X 2 in explaining dichotomous qualitative outcome Y.We say that X 1 and X 2 interact predictively in explaining Y when the strength of the joint prediction ability of X 1 and X 2 in explaining Y is (strictly) larger than the sum of the individual prediction abilities of X 1 and X 2 for this task.This corresponds to a synergetic effect between X 1 and X 2 as opposed to the case when the sum is larger then the strength of a joint prediction, which can be regarded as a redundancy between X 1 and X 2 .
In order to make this definition operational, we need a measure of the strength of prediction ability of X in explaining Y where X is either a single SNP: X = X i or a pair of SNPs: X = (X 1 , X 2 ).This can be done in various ways; we apply the information-theoretic approach and use mutual information to this aim.The Kullback-Leibler distance between P and Q will be denoted by KL(P||Q).We will consider mass function p corresponding to probability distribution P and use p(x i ) and p(x j ) to denote mass functions of X 1 and X 2 , respectively, when no confusion arises.Mutual information between X and Y is defined as: where sums range over all possible values y k of Y and ) and P X × P Y is the so-called product measure of marginal distributions of X and Y, defined by P X × P Y (x i , y k ) = p(x i )p(y k ).P X × P Y is thus the probability distribution corresponding to ( X, Ỹ), where X and Ỹ are independent and have distributions corresponding to p(x i ) and p(y k ), respectively.Therefore, mutual information is the Kullback-Leibler distance between joint distribution and the product of the marginal distributions.Note that if X = (X 1 , X 2 ), the value x i in (1) is two-dimensional and equals one of the possible values of (X 1 , X 2 ).
The motivation behind the definition of I(X; Y) is of a geometric nature and is based on the idea that if Y and X are strongly associated, their joint distribution should significantly deviate from the joint distribution of X and Ỹ.In view of this interpretation and taking X = (X 1 , X 2 ), we define the strength of association of (X 1 , X 2 ) with Y as: and, analogously, the strengths of individual associations of X i with Y as: We now introduce the interaction information as (cf.[15,16]): Thus, in concordance with our qualitative definition above, we say that SNPs X 1 and X 2 interact predictively in explaining Y when I I(X 1 ; X 2 ; Y) is positive.We stress that the above definition of interaction is not model dependent in contrast to, e.g., the definition of interaction in a logistic regression.This is a significant advantage as for model-dependent definitions of interaction, the absence of such an effect under one model does not necessarily extend to other models.This will be discussed in greater detail later.
Let us note that I I(X 1 ; X 2 ; Y) defined above is one of the equivalent forms of interaction information.Namely, observe that: where H(X 1 , X 2 ) = − ∑ ij p ij log p ij is the entropy of X = (X 1 , X 2 ) with other quantities defined analogously.This easily follows from noting that I(X; Y) = H(X) + H(Y) − H(X, Y) (cf.[17], Chapter 2).The second equality above is actually a restatement of the decomposition of entropy H(X 1 , X 2 , Y) in terms of the values of its difference operator ∆ (cf.[18]).Namely, Formula (4.10) in [18] asserts that II(X 1 ; X 2 ; Y) = −∆H(X 1 , X 2 , Y). Interaction information is not necessarily positive.It is positive when the strength of association of (X 1 , X 2 ) with Y is larger than an additive effect of both X 1 and X 2 , i.e., when X 1 and X 2 interact predictively in explaining Y. Below, we list some properties of II(X 1 ; X 2 ; Y).
(ii) (X 1 , X 2 ) are independent of Y if and only if I(X 1 ; Y) = 0, I(X 2 ; Y) = 0 and I I(X 1 ; X 2 ; Y) = 0; (iii) We have: (iv) It holds that: where I(X 1 ; X 2 |Y) is conditional mutual information defined by: I(X 1 ; X 2 |Y) = ∑ y p(y) ∑ x i ,x j p(x i , x j |y) log p(x i , x j |y) p(x i |y)p(x j |y) .
Some comments are in order.Note that (i) is an obvious restatement of (4) and is analogous to the decomposition of variability in ANOVA models.The proof of (ii) easily follows from (i) after noting that the independence (X 1 , X 2 ) of Y is equivalent to KL(P X 1 ,X 2 ,Y ||P X 1 ,X 2 × P Y ) = 0 in view of the information inequality (see [17], Theorem 2.6.3).Whence, if (X 1 , X 2 ) is independent of Y, then I((X 1 , X 2 ); Y) = 0, and thus, also, I(X i ; Y) = 0 for i = 1, 2. In view of (6), we then have that I I(X 1 ; X 2 ; Y) = 0.The trivial consequence of (i) is that I I(X 1 ; X 2 ; Y) ≥ −[I(X 1 ; Y) + I(X 2 ; Y)]; thus, when main effects I(X i ; Y) are zero, i.e., X i are independent of Y, we have I I(X 1 ; X 2 ; Y) ≥ 0, and in this case, I I(X 1 ; X 2 ; Y) is a measure of association between (X 1 , X 2 ) and Y.
Part (ii) asserts that in order to check the joint independence of (X 1 , X 2 ) and Y, one needs to check that X i for i = 1, 2 are individually independent with Y, and moreover, interaction information I I(X 1 ; X 2 ; Y) has to be zero.Part (iii) follows easily from (5).
Part (iv) yields another interpretation of I I(X 1 ; X 2 ; Y) as a change of mutual information (information gain) when the outcome Y becomes known.This can be restated by saying that in the case when X 1 and X 2 are independent, the interaction between genes can be checked by testing the conditional dependence between genes given Y.This is the source of the methods discussed, e.g., in [19,20] based on testing the difference of inter-locus associations between cases and controls.Note however that this works only for independent SNPs, and in the case when X 1 and X 2 are dependent, conditional mutual information I(X 1 ; X 2 |Y) overestimates interaction information.
Let pK be the function appearing in the denominator of (7): pK (x i , x j , y k ) = p(x i , x j )p(x i , y k )p(x j , y k ) p(x i )p(x j )p(y k ) (10) and PK the associated distribution.PK is called the (unnormalized) Kirkwood superposition approximation of P. Note that (7) implies that if the KL distance between P and PK is small, then interaction II(X 1 ; X 2 ; Y) is negligible.Let: be the Kirkwood parameter.If Kirkwood parameter equals one, then Kirkwood approximation PK is a probability distribution.In general, is a probability distribution, which will be called the Kirkwood superposition distribution.We say that a discrete distribution p has perfect bivariate marginals if the following conditions are satisfied ( [21]): Note that Condition (13) implies that bivariate marginals of pK coincide with those of p(x i , x j , y k ).Now, we state some new facts on the interplay between predictive interaction, the value of the Kirkwood parameter and Condition (13).In particular it follows that if η < 1, then genes interact predictively, and the sufficient condition for that is given in Part (iv) below.Proposition 2. (i) II(X 1 ; X 2 ; Y) ≥ log(1/η), and thus, if η < 1, then II(X 1 ; X 2 ; Y) > 0; (ii) If any of the conditions in (13) are satisfied then η = 1 and II(X 1 ; X 2 ; Y) ≥ 0; (iii) If any two components of random vector (X 1 , X 2 , Y) are independent, then η = 1; (iv) If for any i, j ∑ k p(x i , y k )p(x j , y k )/p(y k ) < p(x i )p(x j ) then η < 1.
Part (i) is equivalent to KL(P X 1 ,X 2 ,Y ||P K ) = II(X 1 ; X 2 ; Y) + log η ≥ 0. The proof of (ii) follows by direct calculation.Assume that, e.g., the first condition in (13) is satisfied.Then: (iii) is a special case of (ii) as the independence of two components of (X 1 , X 2 , Y) implies that a respective condition in (13) holds.Note that the condition in (iv) is weaker than the third equation in (13), and Part (iv) states that if X i are weakly individually associated with Y, they either do not interact or interact predictively.
The usefulness of using the normalized Kirkwood approximation to test for interactions was recognized by [8].It is applied in the BOOST package to screen off pairs of genes that are unlikely to interact.In [7], interaction information is used for a similar purpose; see also [22].We call: modified interaction information, which is always nonnegative.Numerical considerations indicate that it is also useful to consider: ĪI = max(II, II m ).
Note that ĪI = II is equivalent to η ≤ 1.In connection with Proposition 2, we note that we also have another representation of II(X 1 ; X 2 ; Y) in terms of the Kullback-Leibler distance, namely: where PK is a distribution on values of Y pertaining to pK (y k |x i , x j ) = p(y k )p(x i |y k )p(x j |y k )/ p(x i )p(x j ) and: The last representation of II(X 1 ; X 2 ; Y) follows from ( 5) by an easy calculation.

Other Nonparametric Measures of Interaction
We define: where: Note that bivariate marginals of p a coincide with those of p x i , x j , y k , e.g., ∑ k p a x i , x j , y k = p x i , x j ; however, p a is necessarily positive.We have the following decomposition: Thus, the terms π(x j , y k ), π(x i , y k ) and π(x i , x j ) correspond to the first order dependence effects for p(x i , x j , y k ), whereas π(x i , x j , y k ) reflects the second order effect.Furthermore, note that the second order effect is equivalent to the dependence effect when all of the first order effects, including π(x i , x j ), are zero.
Moreover, let: and: We have: for some α ij , β jk and γ ik .
Part (i) is checked directly.Note, e.g., that π(x j , y k ) = 0; π(x j , y k ) = 0; π(x i , x j , y k ) = 0 is equivalent to X i ; X j are independent of Y; and further, p a (x i , x j , y k ) = p(x i , x j )p(y k ).Thus, π(x i , x j , y k ) = 0 is equivalent to (X 1 , X 2 ) being independent of Y. Part (ii) is obvious in view of (i).Note that it is an analogue of Proposition 1 (ii).Part (iii) is easily checked.This is due to [23].
In the proposition below, we prove a new decomposition of χ 2 (X 1 ,X 2 );Y , which can be viewed as an analogue of ( 6) for the chi square measure.In particular, in view of this decomposition, χ 2 X 1 ;X 2 ;Y is a measure of interaction information.Namely, the following analogue of Proposition 1 (i) holds: Proposition 4. We have: In order to prove (27), noting the rewriting of ( 22), we have: We claim that squaring both sides, multiplying them by p(x i )p(x j )p(y k ) and summing over i, j, k yields (27).Namely, we note that all resulting mixed terms disappear.Indeed, the mixed term pertaining to the first two terms on the right-hand side equals: due to ∑ j p(x i , x j , y k ) − p a (x i , x j , y k ) = 0.The mixed term pertaining to the last two terms on the right-hand side equals: We note that ( 27) is an analogue of the decomposition of ∑ i,j,k (p(x i , x j , y k ) − p(x i )p(x j )p(y k )) 2 /p(x i )p(x j )p(y k ) into four terms (see Equation ( 9) in [23]).Han in [18] proved that for the distribution of P X 1 ,X 2 ,Y close to independence, i.e., when all three variables X 1 , X 2 and Y are approximately independent, interaction information II(X 1 ; X 2 ; Y) and 2 −1 χ 2 X 1 ;X 2 ;Y are approximately equal.A natural question in this context is how those measures compare in general.
In particular, we would like to to allow for the dependence of X 1 and X 2 .In this, case Han's result is not applicable, as P X 1 ,X 2 ,Y is not close to independence (cf.( 22)).It turns out that despite analogous decompositions in ( 6) and (27) in the vicinity of mass function p 0 (x i , x j , y k ) = p(x i , x j )p(y k ) (independence of (X 1 , X 2 ) and Y), II(X 1 ; X 2 ; Y) is approximated by different functions of chi squares.
Proposition 5. We have the following approximation in the vicinity of p 0 (x i , x j , y k ) = p(x i , x j )p(y k ): where term o(1) tends to zero when the vicinity of p 0 shrinks to this point.
Expanding f (t) = t log t for t = p(x i , x j , y k ) around t 0 = p(x i , x j )p(y k ), we obtain: Rearranging the terms, we have: Summing the above equality over i, j and k and using the definition of I((X 1 , X 2 ); Y), we have: Reasoning analogously, we obtain Using now the definition of interaction information, we obtain the conclusion.

Estimation of the Interaction Measures
We discuss now the estimators of the introduced measures.Suppose that we have n observations on genotypes of the two SNPs under consideration.The data can be cross-tabulated in a 3 × 3 × 2 contingency table with n ijk denoting the number of data points falling in the cell X 1 = x i , X 2 = x j and Y = k.The considered estimators are plug-in versions of theoretical quantities.Namely, we define (cf.( 7)): where p(x i , x j , y k ) = n ijk /n and other empirical quantities are defined analogously.Let: where η is a plug-in estimator of η, an estimator of I I m .Analogously, we define: Moreover, let: χ2 = χ2 denote the plug-in estimator of χ 2 X 1 ;X 2 ;Y defined in (25) and χ2 m the plug-in estimator of the main term on the right-hand side of (31).
Han in [18] proved that for the distribution of P X 1 ,X 2 ,Y close to independence, i.e., when all three variables X 1 , X 2 and Y are approximately independent, we have that the distribution of 2n I I(X 1 ; X 2 ; Y) is close to the distribution of n χ2 X 1 ;X 2 ;Y for large sample sizes.Moreover: in the distribution when the sample size tends to infinity, where χ 2 (4) denotes the chi square distribution with four degrees of freedom.However, the large sample distribution of 2n I I(X 1 ; X 2 ; Y) for p(x i , x j , y k ) = p 0 (x i , x j , y k ) is unknown.Note that although one can establish the asymptotic behavior of empirical counterparts of each term on the right-hand side of (31), these parts are dependent, which contributes to the difficulty of the problem.
In Section 3 below, we discuss the problem of detecting predictive interaction using the empirical indices defined above as the test statistics.

Modeling Gene-Gene Interactions
Interaction information and its modifications are model-free indices of measuring the interaction of SNPs in predicting the occurrence of a disease.Below, we list several approaches to measure interaction based on modeling.Although, as it turns out, the logistic model encompasses all of them, various parametrizations used for such models imply that the meaning of interaction differs in particular cases.

Logistic Modeling of Gene-Gene Interactions
As a main example of parametric modeling involving the quantification of interaction strength, we consider logistic regression.It turns out that any type of conditional dependence can be described by it.Namely, a general logistic model with interactions that models conditional dependence P(Y = 1|X 1 , X 2 ), where X 1 and X 2 are qualitative variables with I and J values, respectively, has I J parameters.Indeed, it allows for an intercept term, (I − 1) + (J − 1) main effects of X 1 and X 2 and (I − 1)(J − 1) interactions, i.e., I J parameters in total.This is equal to the number of possible pairs (x i , x j ).Thus, any form of conditional dependence of Y on X 1 and X 2 can be described by this model for some specific choice of intercept, main effects and interactions.
We discuss a specific setting frequently used for GWAS analysis when X 1 and X 2 stand for two biallelic genetic markers with the respective genotypes AA, Aa, aa (reference value) and BB, Bb, bb (reference value).For convenience, the values of SNPs will be denoted, although not treated as, consecutive integers 1, 2 and 3. Y is a class label, i.e., a binary outcome, which we want to predict, with one standing for cases and zero for controls.
We consider the additive logistic regression model ω, which asserts that: and compare it with a general saturated model Ω: In the logistic regression model, X 1 and X 2 interact when γ ij is non-zero for some i, j, and Model (40) is frequently called a general logistic model with interactions.Thus, interactions here correspond to all coefficients γ ij , and a lack of interactions means that all of them are zero.Note that the number of independent parameters in (40) when both X 1 and X 2 have three values is nine and equals the number of possible pairs (x i , x j ).For specific values of the main effects and interaction, which have been used in GWAS, see, e.g., [19,22], and for complete enumeration of models with 0/1 penetrance, see [24].
We discuss now different modeling approaches.The main difference in comparison with (40) is that another function of the odds is parametrized.The choice of the function influences the parametrization of the model, however; for discrete predictors, all such parametrizations are equivalent.
In particular, in [19], so-called multiplicative and threshold models for a disease were considered (Table 1, p. 365).In the multiplicative model, the odds of the disease have a baseline value γ and increase multiplicatively once there is at least one disease allele at each locus: where δ AA,BB = 4, δ Aa,BB = δ AA,Bb = 2, δ Aa,Bb = 1 and δ ij = 0, otherwise.In the threshold model, the odds of the disease increase by a constant multiplicative factor once there is at least one disease allele at each locus, i.e., δ AA,BB = δ Aa,BB = δ AA,Bb = δ Aa,Bb = 1 and δ ij = 0, otherwise.It is easily seen that both models are special cases of (40).Moreover, if δ ij is such that δ ij = η i + κ j , then Equation ( 41) is a special case of the additive model (39).Note that interactions in (41) are measured by coefficients δ ij , as well as by θ.
An analogous approach is to model the prevalence of the disease P(Y = 1|X 1 , X 2 ) instead of the odds as in (41) or the logarithmic odds as in (40).This is adopted in [20], where Table 1, p. 832, lists six representative models.In particular, a threshold model corresponds to the situation when the prevalence of the disease increases from zero to f , where f is positive, provided a disease allele is present at each locus.Interaction in the threshold model is measured by f .As the model considers zero as a baseline value, it is not a special case of the threshold model in [19].
Interesting insight for particular dependence models is obtained when they are parametrized using Minor Allele Frequency (MAF), overall prevalence P(Y = 1) and heritability h 2 .This is adopted in [22,25] where several models with varying values of these parameters are considered.
Below, we state the proposition that describes the connection between the additive logistic regression model and the lack of predictive interaction.Namely, Part (ii) states that if the Kirkwood parameter is not larger than one and I I = 0, then the additive logistic model holds true.Proposition 7. (i) Equation ( 39) is equivalent to: for some values It is easily checked that Condition (42) implies (39).On the other hand, if we assume (39), then we have p ij1 = p ij0 exp(µ + α i + β j ), and we can take This proves (i).Now, for (ii), if η ≤ 1 and I I (X 1 ; X 2 ; Y) = 0, then which satisfies (42).The last equality follows from the generalization of the information inequality stating that KL distance D(p||q) = 0 if and only if p = q when p is a probability mass function and q is nonnegative and such that ∑ i q(x i ) ≤ 1.Thus, if η ≤ 1, then any model with interaction information of zero has to be an additive logistic regression model.However, we will show that conditions I I = 0 and γ ij ≡ 0 are not equivalent even in the case when X 1 and X 2 are independent and η = 1.
The principal tool to detect interactions in logistic regression is the log-likelihood ratio statistic (LRT), defined as: where L(Ω), L(ω) are respectively the values of the likelihood for models Ω and ω, and PΩ , Pω are respectively estimated probability distributions.Large positive values of LRTare interpreted as an indication that the additive model is not adequate and that interactions between genes occur.In order to check what is the usual range of values of LRT under ω, we use the property stating that when ω is adequate, LRT is approximately distributed as a χ square distribution with four degrees of freedom provided that all cells contain at least five observations.Whereas the calculation of L(Ω) is straightforward, as it involves only fractions n ijk /n as parametric estimates of probabilities of interest, the calculation of L(ω) is computationally intensive and involves the Iterated Weighted Least Squares (IWLS) procedure.Thus, it is also if interest to find an easily computable approximation of LRT.This was exactly the starting point of [8] where it was noticed that probability mass function p K (x i , x j , y k ) = pK (x i , x j , y k )/η follows the additive logistic regression model.Indeed, we have, in view of (10): and thus, it satisfies (39).In particular, it follows that: where L K is a value of the likelihood for a logistic regression model using plug-in estimators to estimate Kirkwood probabilities.Since I I m is easily computable, the lower bound on 2 I I m can be imposed to screen pairs that are unlikely to interact, as in view of (45), the cases with small values of 2 log L(Ω)/L K yield even smaller values of LRT.However, as we discuss in Section 3, there are cases of interactions that will be detected by I I m , but they will remain undetected by LRT.Note also that from the considerations above, we have revealed the interpretation of the interaction information: where PK is a probability distribution corresponding to estimated non-normalized Kirkwood approximations.
Another interaction modeling tool for contingency tables is a log-linear model for which the logarithm of the expected value of the number of observations falling into each cell is modeled.Since the expected value for the cell (i, j, k) equals µ ijk = np(x i , x j , y k ), it is seen that the approach is equivalent to logistic modeling.In particular, Model (39) is equivalent to the so-called homogeneous association model: log Because of the equivalence, we will discuss only the logistic setting later on.

ANOVA Model for Binary Outcome
Additive ANOVA models briefly described below are used to model the dependence of the quantitative outcome on qualitative predictors, in QTL studies in particular.However, they work reasonably well for a binary outcome.We provide a brief justification for this.
In an additive ANOVA model ω, we assume that the conditional distribution of Y given X 1 and X 2 : and for model Ω with interactions, we postulate that: Estimation of the parameters of ANOVA models is based on least squares analysis, i.e., minimization of the following sum of squares: where y l is a prediction for the l-th observation under assumed model M. It is well known (see, e.g., [26]) that the F statistic defined by (51) below has an asymptotically F distribution with parameters p − q and n − p (p and q denote the number of coefficients in models Ω and ω, respectively).
In our problem, the outcome is binary, so formally, it is not legitimate to use the ANOVA model in this case.Nevertheless, the prediction has an interesting property.Let us denote y (1) := Y = I (Y = 1) and y (0) := 1 − Y = I (Y = 0).Then, for the additive ANOVA model and the model with interaction, we have: Moreover, if the values of the respective SNPs are denoted by x i and x j , then for the model with interaction, we have y (0) = n ij0 /n ij = P Y = 0|x i , x j and y (1) = n ij1 /n ij = P Y = 1|x i , x j .Using this notation for both models, we can treat predictors y (0) and y (1) as estimators of P Y = 0|X 1 = x i , X 2 = x j and P Y = 1|X 1 = x i , X 2 = x j , respectively.Now, manipulating conditional probabilities, we can rewrite (50) as: Note that it follows that R(M) can be treated as the weighted variability of prediction, which is the largest when P Y = 0 X 1 = i, X 2 = j = 0.5.Thus, minimizing (53) leads to finding the parameters of model M that yield the most certain prediction.This provides some intuition, in addition to (52), for why the ANOVA model yields reasonable estimates in the case of a binary outcome.

Behavior of Interaction Indices for Logistic Models
Our main goal here is to check whether estimators of the information interaction lead to satisfactory, universal and easy to compute tests for predictive interaction.We recall that X 1 and X 2 interact predictively in explaining Y when I I(X 1 ; X 2 ; Y) > 0. Thus, we consider as the null hypothesis H 0 : I I = 0 and as an alternative H 1 , the hypothesis we are interested in, namely H 1 : I I > 0. As test statistics, we employ sample versions of interaction information indices and their approximations introduced above.We discuss the behavior of the pertaining tests for logistic models (see Section 3), as for discrete predictors, they cover all possible types of conditional dependence of Y given their values.Two types of distributions of (X 1 , X 2 ) will be considered, the first, when X 1 and X 2 are independent and the second one, when their dependence is given by Frank's copula with parameter −1 (see Appendix A for the definition).Frank's copula with parameter −1 was chosen as for the logistic models considered below, it leads to predictive interaction.In both cases, we set Minor Allele Frequency MAF = 0.25, where MAF = P(X 1 = AA) = P(X 2 = BB).In Table A1, the conditional distributions are specified (see also Appendix A for the method of generation).As discussed below, larger values of λ and γ lead to larger values of interaction measures.

Behavior of Interaction Indices When X 1 and X 2 Are Independent
We consider first a special case when X 1 and X 2 are independent.We recall that it follows from Proposition 2 (iii) that parameter η then equals one regardless of the form of prevalence mapping.Thus, Kirkwood superposition approximation is a probability distribution, and the fact that I I = 0 is equivalent to the property that distribution P equals its Kirkwood approximation.Moreover, we then have I I = I I m and χ 2 = χ 2 m .We omit the proof of the second equality.Below, we discuss specific logistic regression models that are used in the simulations.For each model, the intercept was set so that prevalence P(Y = 1) is approximately 0.1.In Table 1, the coefficients for additive logistic models considered here are specified.39)).In each model, intercept µ was chosen, such that prevalence P(Y = 1) is equal to approximately 0.1.
Note that for model M0, both predictors are independent of Y, whereas in M1 and M2, only X 2 is independent of Y.It is seen from (39) that for M3, logarithmic odds depend on the occurrence of either value AA, BB or both, whereas for M4, they also depend on the values Aa and Bb.The additive influence of both loci is the same and measured by parameter λ.
We also consider a general logistic model with interactions given in (40).We employ three types of models, the additive parts of which are the same as in models M0 1 , M3 1 and M4 1 , respectively.Note that M0 1 denotes model M0 λ with λ = 1.The form of interaction is stated in Table 2. Mi (γ) for i = 0, 3, 4 will denote the logistic model, the additive part of which is as in model Mi 1 and interaction part as in Table 2 for a fixed value of γ.Thus, M0(γ) is a model with no main effects, but with a logistic interaction effect, whereas additive effects, as well logistic interaction are nonzero in M3(γ) and M4(γ).Note that in the models considered, parameter γ measures the strength of the logistic interaction.
We discuss now how interaction indices behave for the introduced models.We start with additive logistic regression models M0 λ − M4 λ .Note that for models M0 λ , M1 λ and M2 λ , all considered interactions measures are zero, since for M0 λ , response Y is independent of (X 1 , X 2 ), whereas for M1 λ and M2 λ , predictor X 2 is independent of (X 1 , Y).The values of I I, I I m , χ 2 and χ 2 m as a function of λ for models M3 λ and M4 λ are shown in Figure 1.In models M3 λ and M4 λ , the values of I I are small, but strictly positive for λ > 0. Note that the monotone dependence of χ 2 = χ 2 m on λ is much stronger than for I I = I I m .In Figure 2, the behavior of interaction measures as a function of γ for the logistic model with the nonzero interaction term is depicted.Thus, we check how nonparametric interaction information and its modifications depend on logistic interaction parameter γ.Observe that I I is positive, close to zero for γ = 0.5 and for γ ≥ 0.5 grows slowly in all models considered.There is no significant difference between the values of I I for all models M0 (γ) , M3 (γ) and M4 (γ) when γ is fixed, which means that the additive part in the saturated logistic model has a weak influence on the interaction information.Index χ 2 is also approximately 0 for γ = 0.5, but grows much faster than I I when γ increases.The differences between the values of χ 2 for models M0 (γ) , M3 (γ) and M4 (γ) are much more pronounced than for interaction information I I, and they increase with γ.The values of all indices are larger when the logistic interaction is nonzero.

Behavior of Interaction Indices When X 1 and X 2 Are Dependent
The situation when X 1 and X 2 are dependent is more involved.First, Kirkwood superposition approximation does not have to be a probability distribution; therefore, the fact that I I = 0 is not equivalent to the equality of mass functions p x i , x j , y k and pK x i , x j , y k .Second, the dependence of predictors means that distribution (X 1 , X 2 , Y) deviates more strongly from joint independence, i.e., from the situation when the asymptotic behavior of 2n I I is known and given in (38).
As before, we consider the logistic model with and without interactions.For logistic regression models, we choose the same values of parameters as in the previous section (see Tables 1 and 2) with the exception of µ, which was set such that in every model, prevalence is equal approximately to 0.1.
The behavior of interaction indices for the discussed models is shown below.Their variability in λ for models M3 λ , M4 λ when predictors are dependent is shown in Figure 3 and for models with interaction in Figure 4. Model M0 λ is omitted as all considered indices are zero there independently of λ.Note that we have a stronger dependence of I I on λ in M4 than in M3; the effect is much weaker in the case of I I m due to the fact that log η negatively depends on λ for M4.Furthermore, for this model, the dependence of I I on λ is much stronger than for the independent case.When logistic interaction is nonzero, we again see pronounced differences in the behavior of I I between M3 and M4.This indicates that in the contrast to the independence case, two factors, namely the value of γ, as well as the form of main effects, influence the value of I I.The differences between the values of I I m for different models are negligible.Values I I are larger for the dependent than for the independent case for the same value of γ.Note that from the logistic model perspective.the strength of interaction in all three models is the same and corresponds to the value of γ.Again, the dependence of χ 2 and χ 2 m on γ is much stronger than for I I.

Tests for Predictive Interaction
The main challenge with the application of the constructed statistics for testing is the determination of their behavior under the null hypothesis I I = 0.The asymptotic distribution of Î I is known only for the case when X 1 , X 2 and Y are independent, which is a special case only of the null hypothesis, and the asymptotic distributions of Î I m and Î I are unknown.In order to overcome this problem, we hypothesize that the distributions of all three statistics do not deviate much from the χ 2 (4) distribution, at least in the right tails, and we check the validity of our conjecture by calculating the actual Type I error rates for nominal error rates α using the Monte Carlo method.The results are shown in Figure 5 for the independent predictors and in Figure 6 for dependent ones.The results for LRT and ANOVA tests are included as the benchmarks.It is seen that for the independent predictors, discrepancies between actual and nominal rates for Î I and Î I m are negligible and comparable to the discrepancy of LRT for all models M i , i = 0, 1, 2, and the same is true for Î I in the case of models M 0 and M 1 .The same observation holds for the dependent predictors, although here, empirical evidence is restricted to model M 0 , as it is the only model known to us that satisfies the null hypothesis in this case.Based on this observation, in the following, we compare the power of LRT and ANOVA with the powers of the new tests with a rejection region { Ĩ I ≥ χ 2 0.95 (4)}, where χ 2 0.95 (4) stands for the quantile of order 0.95 of the χ 2 (4) distribution and Ĩ I stands for either Î I, Î I m or Î I. • t 0 , t 1 , the number of observations in controls (Y = 0) and cases (Y = 1), set equal in our experiments and n = t 0 + t 1 , the total number of observations.Values of n = 500 and n = 1000 were considered.• MAF, the minor allele frequency for X 1 and X 2 .We set MAF = 0.25 for both loci.• copula, the function that determines the cumulative distribution of (X 1 , X 2 ) based on its marginal distributions.• p(i, j) := P Y = 1 x i , x j , the prevalence mapping, which in our experiments was either additive logistic or logistic with nonzero interaction.
For every model, 1000 datasets were generated, and for each of them, tests of the interaction effect based on the introduced indices were performed.The null hypothesis is the lack of predictive interaction I I = 0.The null hypothesis at the specific significance level α is not rejected if the value of the test statistic is less than (1 − α), the quantile of appropriate distribution; otherwise, we reject this hypothesis and claim that there is a predictive interaction effect.As discussed above, the tests based on all indices except the ANOVA test were compared with the χ 2 (4) distribution, and for the ANOVA test, null distribution was the F-Snedecor distribution with parameters four and n − 9.For models with I I = 0, we estimate the probability of Type I error for a given α as a fraction of the tests that are falsely rejected.For the model with the predictive interaction effect, statistical power is estimated as a fraction of the tests that are rejected at a significance level of 0.05.We will refer to those tests as simply interaction information, χ 2 , LRT and ANOVA tests in the following.
All experiments have been performed in the R environment.The source codes of functions used are available from the corresponding author's website.
4.1.Behavior of Interaction Tests When X 1 and X 2 Are Independent We consider the situation when SNPs X 1 and X 2 are independent.In this case, η = 1 and both I I and I I m estimate I I consistently.

Type I Errors for Models M0-M2
First, we consider models M0-M2, where I I = 0 and for which the null hypothesis holds.Recall that for M0, all components of random vector (X 1 , X 2 , Y) are independent; thus, in this case, convergence (38) to the χ 2 (4) distribution holds.In Figure 5, we compare the Type I error rate with the nominal error rate.Note that Type I errors of I I, I I m and LRT approximately agree with nominal level α.This indicates that the distributions of these statistics under the null hypothesis are approximately χ 2 (4) distributed.For ANOVA and χ 2 tests, the probability of Type I error is slightly smaller than α.For the maximum statistic Ī I, the probability of Type I error agrees well with the nominal level for models M0 and M1 for M2; it exceeds α, but the relative difference is not larger than 20% for α ≤ 0.05.

Power for Additive Logistic Models
Now, we focus on models M3 λ and M4 λ .From Figure 1, we see that for λ = 0.5, index I I is very close to zero, and predictably, the power is close to the significance level for such λ (see Figure 7).However, for larger λ, the power of I I m and I I increases for M3, and in the case of M4, this holds also for I I m .The difference between the behavior of I I m on the one side and I I and χ 2 approximations on the other for model M4 is remarkable and shows that modification incorporating log η is worthwhile.However, for model M3, it does not improve the performance of I I.Note also that unsurprisingly.The power of the LRT test stays close to the significance level, but the power of ANOVA starts to increase for large λ.A clear winner in both cases is the test based on Ī I, which performs very well for both models, as it takes advantage of the good performance of I I for M3 and I I m for model M4.

Power for the Logistic Model with Interactions
We consider now the power of the tests with respect to logistic interaction parameter γ for models M0 (γ) , M3 (γ) and M4 (γ) .Note that M0 (γ) is now included as I I is positive for γ > 0. In all cases, I I m performs better than I I, supporting the usefulness of the correction.LRT works comparably to ANOVA and, in some cases, worse than I I m and Ī I. We stress that this does not contradict the fact that LRT is the most powerful test at the level α for null hypothesis H0 : γ = 0. Indeed, it follows from the previous experiments that predictive interaction tests are not level α tests for H0 ; thus, they may have larger power than LRT.The power of I I m and Ī I is around 30% for M4, even for γ close to zero, when for LRT, it stays around a significance level of 0.05.Note also that I I and χ2 m fail to detect the predictive interaction for model M4.

Behavior of the Interaction Tests When X 1 and X 2 Are Dependent
We consider the situation where SNPs X 1 and X 2 are dependent.The distribution of (X 1 , X 2 ) is modeled by the Frank copula with θ = −1 (see Table A1).

Type I Errors for Model M0
Note that for dependent X 1 and X 2 , only M0 satisfies the null hypothesis, as for the models M1 and M2, index I I is negative.It follows from Figure 6 that for all tests apart from χ 2 tests, Type I error for M0 is approximately α, and it is significantly smaller than α for both χ 2 tests.For larger α and n = 1000 errors for information, interaction tests exceed slightly α, similarly to LRT.

Power for Additive Logistic Models When X 1 and X 2 Are Dependent
The behavior of the discussed tests is analogous to the case of independence (see Figure 8) with one important difference.Namely, now I I performs better than I I m apart from model M4 for λ ≥ 1.5 and n = 500.LRT and ANOVA do not detect any interaction for λ ≤ 1.5, but the power of ANOVA starts to pick up for large λ.Note the erratic behavior of the χ 2 m test for which the power actually starts to decrease for larger λ.This possibly is caused by the fact that the test is not well calibrated and the fact that χ 2 m is likely to become negative for large λ.The power of Ī I is the largest one among all tests.
4.2.3.The Powers for Logistic Models with Interaction When X 1 and X 2 Are Dependent Obviously, when logistic interaction is present, the LRT test performs much better (see Figure 9).The same applies to ANOVA and I I, but the best performance is again exhibited by Ī I. Consider, e.g., its performance for the model M4.Its excellent behavior for small γ's stems from the very good performance of I I, whereas for larger γ's from the performance of I I m .Comparing Figures 9 and 10, we see that the dependence between predictors improved the performance of the tests based on the interaction information measures, especially for a smaller sample size, which is consistent with the larger values of the interaction information in the dependent than in the independent case.Note that in the dependent case, the power of I I and Ī I for M4 is above 0.9 for γ close to zero, whereas it is less then 0.3 for such γ in the independent case. .Powers for models M0 (γ) , M3 (γ) and M4 (γ) against γ when X 1 and X 2 are independent.

Real Data Example
We perform the analysis of a real dataset in order to show that the pairs of SNPs that correspond to large values of interaction indices exhibit interesting patterns of conditional dependence.We used data on pancreatic cancer studied by [27] and downloaded from the address [28]  Note that the pattern of the change of the conditional probabilities of the occurrence of the AA genotype for SNP2 given the genotypes of SNP1 for the cases Y = 1 is reversed in the case of the controls Y = 0.For the pooled sample, the conditional probabilities are approximately equal.Moreover, observe that the conditional frequency of SNP2 = BB given SNP1 = BB is around 0.2 for the cases, whereas it is zero for the controls.
This preliminary example indicates that the analysis based on large values of interaction information indices allows one to detect interesting patterns of dependence between pairs of SNPs and the response.

Discussion
In the theoretical part of the paper, we reviewed and proved some new properties of interaction information, its modification and their approximations.It is argued that parameter η introduced in (11) plays an important role in establishing predictive interaction.Theoretical analysis supported considering a new measure defined in (16), being the maximum of interaction information, and its modified version, which was considered in numerical experiments.There are several conclusions that can be drawn from the conducted numerical experiments.The first one is that the dependence between predictors influences the performance of interaction information tests: while I I m performs in general better than I I for independent predictors, the situation is reversed for dependent ones.Their natural combination, statistics Ī I, is superior to any of them and LRT.When compared to LRT, the difference in performance is most striking when detecting predictive interaction in the cases when logistic interaction is 0 as, e.g., in models M3 λ and M4 λ .This should serve as a cautionary remark for the situations when the interaction information test is used as a screening test and then LRT is applied for remaining pairs of genes: the screening test may rightly retain pairs of genes interacting predictively; however, the interaction may not be confirmed by the LRT test.Such cases are likely to occur especially for dependent pairs.It is also worthwhile to note that Ī I is as least good as LRT and sometimes much better for detecting interactions when the logistic interaction γ is close to zero.This is a promising feature, as detecting such weak interactions is the most challenging and bears important consequences for revealing the dependence structure of diseases.This of course requires the construction of interaction tests suitable for a huge number of pairs and is a subject of further study.On the negative side, tests based on χ 2 approximations tend to perform less well than those based on interaction information and their modifications.This is possibly due to their non-adequate calibration in the case when the null hypothesis holds.It also should be stressed that in numerical experiments, we studied the problem of distinguishing between no interaction information (I I = 0) and predictive interaction (I I > 0).In the case when interaction information is negative, I I m and Ī I are likely not to detect such an effect.

Figure 1 .
Figure 1.Theoretic values of interaction measures for the additive logistic model and independent SNPs.

Figure 2 .
Figure 2. Theoretic values of the interaction measures for the logistic model with the interaction and independent SNPs.

Figure 3 .Figure 4 .
Figure 3. Theoretic values of the interaction measures for the additive logistic model and dependent SNPs.

Figure 7 .
Figure 7.Powers for models M3 λ and M4 λ against λ when X 1 and X 2 are independent.
. They consist of 208 observations (121 cases and 87 controls) with values of 901 SNPs.We applied tests I I, I I m and Ī I for all pairs and α = 0.05 with Bonferroni correction resulting in an individual level of significance 1.23 × 10 −7 = 0.05/K, where K = (901 × 900)/2.All three tests rejected the null hypothesis I I = 0 for 11 pairs.The pair of SNPs with the largest values of I I and I I m is the pair (SNP1, SNP2) =(rs1131854,rs7374) with values I I = 0.1239 and I I m = 0.1246.Figure 11 shows the probability mass function of this pair (a) and of the conditional mass functions given Y = 1 (b) and Y = 0 (c).

Table 2 .
Values of the γ ij coefficients from Model (40) for the same constant value of γ.
Actual Type I error rates against the nominal error α rate when X 1 and X 2 are independent for models M0 1 , M1 1 and M2 1 .Datasets pertaining to the discussed models are generated as described in the Appendix.The pertinent parameters are: Actual Type I error rates against nominal error rate α when X 1 and X 2 are dependent for model M0 1 .