#### 2.1. Interaction Information Measure

We adopt the following qualitative definition of the predictive interaction of SNPs ${X}_{1}$ and ${X}_{2}$ in explaining dichotomous qualitative outcome Y. We say that ${X}_{1}$ and ${X}_{2}$ interact predictively in explaining Y when the strength of the joint prediction ability of ${X}_{1}$ and ${X}_{2}$ in explaining Y is (strictly) larger than the sum of the individual prediction abilities of ${X}_{1}$ and ${X}_{2}$ for this task. This corresponds to a synergetic effect between ${X}_{1}$ and ${X}_{2}$ as opposed to the case when the sum is larger then the strength of a joint prediction, which can be regarded as a redundancy between ${X}_{1}$ and ${X}_{2}$.

In order to make this definition operational, we need a measure of the strength of prediction ability of

X in explaining

Y where

X is either a single SNP:

$X={X}_{i}$ or a pair of SNPs:

$X=({X}_{1},{X}_{2})$. This can be done in various ways; we apply the information-theoretic approach and use mutual information to this aim. The Kullback–Leibler distance between

P and

Q will be denoted by

$KL(P||Q)$. We will consider mass function

p corresponding to probability distribution

P and use

$p({x}_{i})$ and

$p({x}_{j})$ to denote mass functions of

${X}_{1}$ and

${X}_{2}$, respectively, when no confusion arises. Mutual information between

X and

Y is defined as:

where sums range over all possible values

${y}_{k}$ of

Y and

${x}_{i}$ of

X,

$p({x}_{i},{y}_{k})=P(X={x}_{i},Y={y}_{k})$,

$p({x}_{i})=P(X={x}_{i})$,

$p({y}_{k})=P(Y={y}_{k})$ and

${P}_{X}\times {P}_{Y}$ is the so-called product measure of marginal distributions of

X and

Y, defined by

${P}_{X}\times {P}_{Y}({x}_{i},{y}_{k})=p({x}_{i})p({y}_{k})$.

${P}_{X}\times {P}_{Y}$ is thus the probability distribution corresponding to

$(\tilde{X},\tilde{Y})$, where

$\tilde{X}$ and

$\tilde{Y}$ are independent and have distributions corresponding to

$p({x}_{i})$ and

$p({y}_{k})$, respectively. Therefore, mutual information is the Kullback–Leibler distance between joint distribution and the product of the marginal distributions. Note that if

$X=({X}_{1},{X}_{2})$, the value

${x}_{i}$ in (1) is two-dimensional and equals one of the possible values of

$({X}_{1},{X}_{2})$.

The motivation behind the definition of

$I(X;Y)$ is of a geometric nature and is based on the idea that if

Y and

X are strongly associated, their joint distribution should significantly deviate from the joint distribution of

$\tilde{X}$ and

$\tilde{Y}$. In view of this interpretation and taking

$X=({X}_{1},{X}_{2})$, we define the strength of association of

$({X}_{1},{X}_{2})$ with

Y as:

and, analogously, the strengths of individual associations of

${X}_{i}$ with

Y as:

We now introduce the interaction information as (cf. [

15,

16]):

Thus, in concordance with our qualitative definition above, we say that SNPs ${X}_{1}$ and ${X}_{2}$ interact predictively in explaining Y when $II({X}_{1};{X}_{2};Y)$ is positive. We stress that the above definition of interaction is not model dependent in contrast to, e.g., the definition of interaction in a logistic regression. This is a significant advantage as for model-dependent definitions of interaction, the absence of such an effect under one model does not necessarily extend to other models. This will be discussed in greater detail later.

Let us note that

$II({X}_{1};{X}_{2};Y)$ defined above is one of the equivalent forms of interaction information. Namely, observe that:

where

$H({X}_{1},{X}_{2})=-{\sum}_{ij}{p}_{ij}\mathrm{log}{p}_{ij}$ is the entropy of

$X=({X}_{1},{X}_{2})$ with other quantities defined analogously. This easily follows from noting that

$I(X;Y)=H(X)+H(Y)-H(X,Y)$ (cf. [

17], Chapter 2). The second equality above is actually a restatement of the decomposition of entropy

$H({X}_{1},{X}_{2},Y)$ in terms of the values of its difference operator Δ (cf. [

18]). Namely, Formula (4.10) in [

18] asserts that

$II({X}_{1};{X}_{2};Y)=-\mathrm{\Delta}H({X}_{1},{X}_{2},Y)$.

Interaction information is not necessarily positive. It is positive when the strength of association of $({X}_{1},{X}_{2})$ with Y is larger than an additive effect of both ${X}_{1}$ and ${X}_{2}$, i.e., when ${X}_{1}$ and ${X}_{2}$ interact predictively in explaining Y. Below, we list some properties of $II({X}_{1};{X}_{2};Y)$.

**Proposition** **1.** (i)(ii) $({X}_{1},{X}_{2})$ are independent of Y if and only if $I({X}_{1};Y)=0$, $I({X}_{2};Y)=0$ and $II({X}_{1};{X}_{2};Y)=0$; (iii) We have:(iv) It holds that:where $I({X}_{1};{X}_{2}|Y)$ is conditional mutual information defined by: Some comments are in order. Note that (i) is an obvious restatement of (4) and is analogous to the decomposition of variability in ANOVA models. The proof of (ii) easily follows from (i) after noting that the independence

$({X}_{1},{X}_{2})$ of

Y is equivalent to

$KL({P}_{{X}_{1},{X}_{2},Y}||{P}_{{X}_{1},{X}_{2}}\times {P}_{Y})=0$ in view of the information inequality (see [

17], Theorem 2.6.3). Whence, if

$({X}_{1},{X}_{2})$ is independent of

Y, then

$I(({X}_{1},{X}_{2});Y)=0$, and thus, also,

$I({X}_{i};Y)=0$ for

$i=1,2$. In view of (6), we then have that

$II({X}_{1};{X}_{2};Y)=0$. The trivial consequence of (i) is that

$II({X}_{1};{X}_{2};Y)\ge -[I({X}_{1};Y)+I({X}_{2};Y)]$; thus, when main effects

$I({X}_{i};Y)$ are zero, i.e.,

${X}_{i}$ are independent of

Y, we have

$II({X}_{1};{X}_{2};Y)\ge 0$, and in this case,

$II({X}_{1};{X}_{2};Y)$ is a measure of association between

$({X}_{1},{X}_{2})$ and

Y.

Part (ii) asserts that in order to check the joint independence of $({X}_{1},{X}_{2})$ and Y, one needs to check that ${X}_{i}$ for $i=1,2$ are individually independent with Y, and moreover, interaction information $II({X}_{1};{X}_{2};Y)$ has to be zero. Part (iii) follows easily from (5).

Part (iv) yields another interpretation of

$II({X}_{1};{X}_{2};Y)$ as a change of mutual information (information gain) when the outcome

Y becomes known. This can be restated by saying that in the case when

${X}_{1}$ and

${X}_{2}$ are independent, the interaction between genes can be checked by testing the conditional dependence between genes given

Y. This is the source of the methods discussed, e.g., in [

19,

20] based on testing the difference of inter-locus associations between cases and controls. Note however that this works only for independent SNPs, and in the case when

${X}_{1}$ and

${X}_{2}$ are dependent, conditional mutual information

$I({X}_{1};{X}_{2}|Y)$ overestimates interaction information.

Let

${\tilde{p}}_{K}$ be the function appearing in the denominator of (7):

and

${\tilde{P}}_{K}$ the associated distribution.

${\tilde{P}}_{K}$ is called the (unnormalized) Kirkwood superposition approximation of

P. Note that (7) implies that if the KL distance between

P and

${\tilde{P}}_{K}$ is small, then interaction

$II({X}_{1};{X}_{2};Y)$ is negligible. Let:

be the Kirkwood parameter. If Kirkwood parameter equals one, then Kirkwood approximation

${\tilde{P}}_{K}$ is a probability distribution. In general,

is a probability distribution, which will be called the Kirkwood superposition distribution. We say that a discrete distribution

p has perfect bivariate marginals if the following conditions are satisfied ([

21]):

Note that Condition (13) implies that bivariate marginals of ${\tilde{p}}_{K}$ coincide with those of $p({x}_{i},{x}_{j},{y}_{k}).$ Now, we state some new facts on the interplay between predictive interaction, the value of the Kirkwood parameter and Condition (13). In particular it follows that if $\eta <1$, then genes interact predictively, and the sufficient condition for that is given in Part (iv) below.

**Proposition** **2.** (i) $II({X}_{1};{X}_{2};Y)\ge \mathrm{log}(1/\eta )$, and thus, if $\eta <1$, then $II({X}_{1};{X}_{2};Y)>0$; (ii) If any of the conditions in (13) are satisfied then $\eta =1$ and $II({X}_{1};{X}_{2};Y)\ge 0$; (iii) If any two components of random vector $\left({X}_{1},{X}_{2},Y\right)$ are independent, then $\eta =1$; (iv) If for any $i,j$${\sum}_{k}p({x}_{i},{y}_{k})p({x}_{j},{y}_{k})/p({y}_{k})<p({x}_{i})p({x}_{j})$ then $\eta <1$.

Part (i) is equivalent to

$KL({P}_{{X}_{1},{X}_{2},Y}\left|\right|{P}_{K})=II({X}_{1};{X}_{2};Y)+\mathrm{log}\eta \ge 0$. The proof of (ii) follows by direct calculation. Assume that, e.g., the first condition in (13) is satisfied. Then:

(iii) is a special case of (ii) as the independence of two components of $({X}_{1},{X}_{2},Y)$ implies that a respective condition in (13) holds. Note that the condition in (iv) is weaker than the third equation in (13), and Part (iv) states that if ${X}_{i}$ are weakly individually associated with Y, they either do not interact or interact predictively.

The usefulness of using the normalized Kirkwood approximation to test for interactions was recognized by [

8]. It is applied in the BOOST package to screen off pairs of genes that are unlikely to interact. In [

7], interaction information is used for a similar purpose; see also [

22]. We call:

modified interaction information, which is always nonnegative. Numerical considerations indicate that it is also useful to consider:

Note that

$\overline{II}=II$ is equivalent to

$\eta \le 1$. In connection with Proposition 2, we note that we also have another representation of

$II({X}_{1};{X}_{2};Y)$ in terms of the Kullback–Leibler distance, namely:

where

${\overline{P}}_{K}$ is a distribution on values of

Y pertaining to

${\overline{p}}_{K}({y}_{k}|{x}_{i},{x}_{j})=p({y}_{k})p({x}_{i}|{y}_{k})p({x}_{j}|{y}_{k})$/

$p({x}_{i})p({x}_{j})$ and:

The last representation of $II({X}_{1};{X}_{2};Y)$ follows from (5) by an easy calculation.

#### 2.2. Other Nonparametric Measures of Interaction

Note that bivariate marginals of

${p}^{a}$ coincide with those of

$p\left({x}_{i},{x}_{j},{y}_{k}\right)$, e.g.,

${\sum}_{k}{p}^{a}\left({x}_{i},{x}_{j},{y}_{k}\right)=p\left({x}_{i},{x}_{j}\right)$; however,

${p}_{a}$ is necessarily positive. We have the following decomposition:

Thus, the terms $\pi ({x}_{j},{y}_{k}),\pi ({x}_{i},{y}_{k})$ and $\pi ({x}_{i},{x}_{j})$ correspond to the first order dependence effects for $p({x}_{i},{x}_{j},{y}_{k})$, whereas $\pi ({x}_{i},{x}_{j},{y}_{k})$ reflects the second order effect. Furthermore, note that the second order effect is equivalent to the dependence effect when all of the first order effects, including $\pi ({x}_{i},{x}_{j})$, are zero.

We have:

**Proposition** **3.** (i) $({X}_{1},{X}_{2})$ are independent of Y if and only if $\pi ({x}_{i},{y}_{k})=0$, $\pi ({x}_{j},{y}_{k})=0$ and $\pi ({x}_{i},{x}_{j},{y}_{k})=0$ for any ${x}_{i},{x}_{j},{y}_{k}.$; (ii) The independence of $({X}_{1},{X}_{2})$ and Y is equivalent to ${\chi}_{{X}_{1};Y}^{2}=0$, ${\chi}_{{X}_{2};Y}^{2}=0$ and ${\chi}_{{X}_{1};{X}_{2};Y}^{2}=0$; (iii) Condition $\pi ({x}_{i},{x}_{j},{y}_{k})=0$ for any ${x}_{i},{x}_{j},{y}_{k}$ is equivalent to:for some ${\alpha}_{ij},{\beta}_{jk}$ and ${\gamma}_{ik}$. Part (i) is checked directly. Note, e.g., that

$\pi ({x}_{j},{y}_{k})=0$;

$\pi ({x}_{j},{y}_{k})=0$;

$\pi ({x}_{i},{x}_{j},{y}_{k})=0$ is equivalent to

${X}_{i}$;

${X}_{j}$ are independent of

Y; and further,

${p}^{a}({x}_{i},{x}_{j},{y}_{k})=p({x}_{i},{x}_{j})p({y}_{k})$. Thus,

$\pi ({x}_{i},{x}_{j},{y}_{k})=0$ is equivalent to

$({X}_{1},{X}_{2})$ being independent of

Y. Part (ii) is obvious in view of (i). Note that it is an analogue of Proposition 1 (ii). Part (iii) is easily checked. This is due to [

23].

In the proposition below, we prove a new decomposition of ${\chi}_{({X}_{1},{X}_{2});Y}^{2}$, which can be viewed as an analogue of (6) for the chi square measure. In particular, in view of this decomposition, ${\chi}_{{X}_{1};{X}_{2};Y}^{2}$ is a measure of interaction information. Namely, the following analogue of Proposition 1 (i) holds:

In order to prove (27), noting the rewriting of (22), we have:

We claim that squaring both sides, multiplying them by

$p({x}_{i})p({x}_{j})p({y}_{k})$ and summing over

$i,j,k$ yields (27). Namely, we note that all resulting mixed terms disappear. Indeed, the mixed term pertaining to the first two terms on the right-hand side equals:

due to

${\sum}_{j}p({x}_{i},{x}_{j},{y}_{k})-{p}^{a}({x}_{i},{x}_{j},{y}_{k})=0$. The mixed term pertaining to the last two terms on the right-hand side equals:

as

${\sum}_{i}p({x}_{i},{y}_{k})-p({x}_{i})p({y}_{k})=0$.

We note that (27) is an analogue of the decomposition of

${\sum}_{i,j,k}{(p({x}_{i},{x}_{j},{y}_{k})-p({x}_{i})p({x}_{j})p({y}_{k}))}^{2}/p({x}_{i})p({x}_{j})p({y}_{k})$ into four terms (see Equation (9) in [

23]). Han in [

18] proved that for the distribution of

${P}_{{X}_{1},{X}_{2},Y}$ close to independence, i.e., when all three variables

${X}_{1}$,

${X}_{2}$ and

Y are approximately independent, interaction information

$II({X}_{1};{X}_{2};Y)$ and

${2}^{-1}{\chi}_{{X}_{1};{X}_{2};Y}^{2}$ are approximately equal. A natural question in this context is how those measures compare in general.

In particular, we would like to to allow for the dependence of ${X}_{1}$ and ${X}_{2}$. In this, case Han’s result is not applicable, as ${P}_{{X}_{1},{X}_{2},Y}$ is not close to independence (cf. (22)). It turns out that despite analogous decompositions in (6) and (27) in the vicinity of mass function ${p}_{0}({x}_{i},{x}_{j},{y}_{k})=p({x}_{i},{x}_{j})p({y}_{k})$ (independence of $({X}_{1},{X}_{2})$ and Y), $II({X}_{1};{X}_{2};Y)$ is approximated by different functions of chi squares.

**Proposition** **5.** We have the following approximation in the vicinity of ${p}_{0}({x}_{i},{x}_{j},{y}_{k})=p({x}_{i},{x}_{j})p({y}_{k})$:where term $o(1)$ tends to zero when the vicinity of ${p}_{0}$ shrinks to this point. Expanding

$f(t)=t\mathrm{log}t$ for

$t=p({x}_{i},{x}_{j},{y}_{k})$ around

${t}_{0}=p({x}_{i},{x}_{j})p({y}_{k})$, we obtain:

Rearranging the terms, we have:

Summing the above equality over

$i,j$ and

k and using the definition of

$I(({X}_{1},{X}_{2});Y)$, we have:

Reasoning analogously, we obtain $I({X}_{i};Y)={\chi}_{{X}_{i},Y}^{2}+o(1)$ for $i=1,2$. Using now the definition of interaction information, we obtain the conclusion.

Note that it follows from the last two propositions that we have the following generalization of Lemma 3.3 in [

18].

**Proposition** **6.** In the vicinity of ${p}_{0}({x}_{i},{x}_{j},{y}_{k})=p({x}_{i},{x}_{j})p({y}_{k})$, it holds that: This easily follows by replacing $-{2}^{-1}\left\{{\chi}_{{X}_{1};Y}^{2}+{\chi}_{{X}_{2};Y}^{2}\right\}$ in (31) by ${2}^{-1}\left\{{\chi}_{{X}_{1};{X}_{2};Y}^{2}-{\chi}_{\left({X}_{1},{X}_{2}\right);Y}^{2}\right\}+o(1)$ and using the definition of ${\chi}_{\left({X}_{1},{X}_{2}\right);Y}^{2}$.