Information Theoretic Methods for Variable Selection—A Review

We review the principal information theoretic tools and their use for feature selection, with the main emphasis on classification problems with discrete features. Since it is known that empirical versions of conditional mutual information perform poorly for high-dimensional problems, we focus on various ways of constructing its counterparts and the properties and limitations of such methods. We present a unified way of constructing such measures based on truncation, or truncation and weighing, for the Möbius expansion of conditional mutual information. We also discuss the main approaches to feature selection which apply the introduced measures of conditional dependence, together with the ways of assessing the quality of the obtained vector of predictors. This involves discussion of recent results on asymptotic distributions of empirical counterparts of criteria, as well as advances in resampling.


Introduction
Conditional independence is one of the main concepts in statistics which plays a fundamental role in such areas as causal inference, dependence analysis, and graphical modelling. In this review, we discuss how measures of dependence considered in information theory are used in classification and regression problems to choose predictors which significantly influence the outcome. This is a vital application of the information theoretic approach to variable selection, also known as feature selection. Let us stress, however, that informationtheoretic measures are frequently applied in many other problems in Machine Learning and Statistics, in contexts other than feature selection. Some representative examples include data visualisation (t-SNE, [1]), clustering [2], Independent Component Analysis (ICA [3], Chapter 15), Variational Inference ( [4], Chapter 10), and Natural Language Modelling [5], among others.
In the times of the big data challenge, the problem of a choice of a small group of variables from the pool of all potential variables, all of which would accurately describe the changes in response, is gaining in importance. This is needed for better understanding of a studied phenomena, as well as for construction of tractable models for them. Moreover, feature selection is instrumental in avoiding curse of dimensionality problem when a prohibitive amount of data are required for adequate fitting of the model ( [6], Chapter 2) and avoiding overfitting (ibid., Chapter 7). This is also important for prediction, as classifiers which avoid using inactive predictors are usually less variable. Feature selection is frequently applied because of cost considerations, especially when measuring some of the potentially useful predictors is costly or inconvenient.
Another competing direction of research with a similar aim is dimensionality reduction, in particular variable extraction, which transforms given variables to obtain a lean group of new predictors. Primary examples of such methods are Principal Components Regression The second equality in (1) motivates the name Information Gain also used for MI. It also yields intuitive meaning for MI as the decrease in variability of Y is measured by its entropy when the information about another variable X is available. We remark that I(Y; X) is frequently denoted as MI(Y; X). Note that the semicolon in I(Y; X) determines between which variables dependence is considered: either between Y and X in the case of I(Y; X) or between Y and (X, Z) in the case of I(Y; X, Z). I(Y; X) evaluates how similar the joint distribution P YX of (Y, X) is to the product P Y ⊗ P X of their marginal distributions. As P Y ⊗ P X corresponds to independence of X and Y, I(Y; X) can be considered a measure of strength of dependence between Y and X. If follows from the definition that ( [15], Section 2.3) I(Y; X) = KL(P Y,X ||P Y ⊗ P X ), where Kullback-Leibler (KL) divergence between distributions P W and P Z is defined as KL(P W ||P Z ) = ∑ w P(W = w) log{P(W = w)/P(Z = w)}.
We note that KL divergence is closely related to the Maximum Likelihood (ML) method and popular feature selection method Akaike Information Criterion (AIC) is derived as the bias-corrected ML [17]. We also remark that Kullback-Leibler divergence can be replaced by other pseudo-distance between probability distributions resulting in a different measure of dependence. It follows from the properties of KL divergence that I(Y; X) is non-negative and is equal to zero if, and only if, P Y,X = P Y ⊗ P X , i.e., when Y and X are independent. Moreover, the definition (1) that Mutual Information is symmetric: I(Y; X) = I(X; Y). It is also easily seen that (see formula (2.450) in [15]) (2)

Conditional Mutual Information CMI
We denote by P X|Z=z conditional distribution of X given Z = z and define I(Y; X|Z = z) = KL(P Y,X|Z=z ||P Y|Z=z ⊗ P X|Z=z ) as the strength of dependence between Y and X given Z = z. Thus, I(Y; X|Z = z) = 0 means that Y and X are independent given that Z equals z. Now we define conditional mutual information. x,y P(X = x, Y = y|Z = z) log P(X = x, Y = y|Z = z) P(X = x|Z = z)P(Y = y|Z = z) .
Thus, the conditional mutual information is the mutual information of Y and X given Z = z averaged over the values of Z. It is measure of dependence between Y and X given the knowledge of X. From the properties of Kullback-Leibler divergence it follows that I(Y; X|Z) = 0 ⇐⇒ X and Y are conditionally independent given Z. This is a powerful property, not satisfied for other measures for dependence, such as partial correlation coefficient in the case of continuous random variables. The conditional independence of X and Y given Z will be denoted by X ⊥ ⊥ Y|Z and abbreviated to CI. We note that since I(Y; X|Z) is defined as a probabilistic average of I(Y; X|Z = z) over Z = z, it follows that I(Y; X|Z) = 0 ⇐⇒ I(Y; X|Z = z) = 0 for any z in the support of Z.
Thus, formally, testing conditional independence of X and Y given Z is equivalent to testing unconditional dependence of X and Y on every stratum Z = z. This, however, does not make the problem an easy task, as sometimes claimed, since performing such tests simultaneously on each strata (global null), even when we have sufficient data to do it, would result in lack of control of the probability of false signals (multiple testing problem).
The above definitions can be naturally extended to the case of random vectors (i.e., when X, Y, and Z are multivariate) by using a multivariate probability mass functions instead of a univariate one. It is also easily seen that the following chain formula holds I(Y; X, Z) = I(Y; X) + I(Y; Z|X). (4)

Interaction Information I I
The 3-way interaction information I I(Y; X, Z) of, possibly multivariate, X, Y, and Z plays also important role in feature selection.
The second equality, stemming from (4), neatly explains the idea of interaction information: we want to evaluate the synergistic effect (positive or negative) of both X and Z influencing Y, disregarding the sum of their individual effects. Thus, from the evaluated strength of overall dependence between Y and a pair (X, Z) measured by I(Y; X, Z), we subtract the strength of individual dependences of Y on X and Y on Z measured by I(Y; X) and I(Y; Z), respectively. Although it is not immediately clear from the definition, I I(Y; X; Z) is actually a symmetric function of its arguments. However, it can be positive, negative, or 0. If it is positive, X and Z are said to be complementary wrt to Y or interacting positively in influencing Y. A smaller than 0 value of I I indicates redundancy among features or their inhibition.
We define 2-way as I I as I I(Y; X) = I(Y; X) and I I(Y; X|Z) = I(Y; X|Z).
Definition (5) can be generalised in a recursive way to define k-way interaction information (see [18][19][20] for alternative equivalent definition). For any subset of indices S = {s 1 , s 2 , . . . , s |S| }, Z S := (Z s 1 , Z s 2 , . . . , Z s |S| ) will stand for the sub-vector of Z i s with indices in S. Abusing the notion, Z ∈ Z S will be mean that Z ∈ {Z s 1 , Z s 2 , . . . , Z s |S| }. Let S = {1, . . . , |S|} and k = |S| be the cardinality of S. We define k-way interaction information I I(Z 1 ; Z 2 ; . . . ; Z |S| ) so that the chain formula holds.
Definition 4. k-way interaction information I I(Z 1 ; Z 2 ; . . . ; Z |S| ) is defined as where, consistently with the definition of the conditional mutual information in (3), we define Using expansions of MI and CMI in terms of entropies, the following formula for k-way I I is obtained where Z T = (Z t 1 , . . . , Z t i ) for T = {t 1 , . . . , t i }. In particular, we have Note that when, e.g., Z 3 = XOR(Z 1 , Z 2 ) = I{Z 1 = Z 2 } and (Z 1 , Z 2 ) are independent copies of a random variable having Bernoulli distribution with p = 1/2, we obtain I I(Z 1 ; Z 2 ; Z 3 ) = log 2, as in this case I(Z 3 , Z 2 |Z 1 ) = H(Z 3 ). In the following, so-called Möbius expansion (see, e.g., [20]) plays a crucial role.
Theorem 1. The conditional mutual information satisfies the equation where T = {t 1 , . . . , t i } and, conversely, We stress that the inner sum ranges over all subsets of S. For a description of set functions for which (8) and (9) are equivalent, see [20]. Note that we carefully distinguish between 3-way interaction information I I(X; Y; Z S ) with multivariate Z S as the third component and I I(X; Y; Z 1 , . . . ; Z |S| ) being (|S| + 2)-way interaction information between X, Y and all components Z 1 , . . . , Z |S| . Equality (8) can be restated, in view of I I(X; Y) = I(X; Y) and (6) as Finally, note that it follows from (9) that I I(X; Y; Z 1 , . . . ; Z |S| ) = 0 provided X and Y are conditionally independent given any subvector Z T of Z S including Z ∅ . As we shall see in Section 4.1, Möbius expansion (8) is a natural starting point to introduce CMI-related criteria for feature selection.
Let us indicate that for any classifierŶ = Y(X) of class variable Y for g classes, its unconditional probability of error P(Y =Ŷ) can be related to conditional entropy H(Y|X) by means of Fano's inequality ( [15,21]) which, using I(X; Y) = H(Y) − H(Y|X), can be written as This shows that when I(Y; X) decreases, i.e., Y and X become less dependent, the lower bound on probability of error of any classifier increases.

Feature Selection
In this section we discuss an objective of feature selection, the concept of Markov Blanket and its properties, as well as a greedy search for active features.

General Considerations: Characterisations of Markow Blanket
Suppose now that we consider class variable Y and p-dimensional discrete vector (X 1 , . . . , X p ) =: (Z 1 , . . . , Z p ) of all predictors available. Our aim is to describe Y using those features among X 1 , . . . , X p which jointly influence Y. Note that this is a different task than selecting features which individually affect Y. Namely, we want to find out which features can be discarded without loss of information about Y when the remaining features are taken into account. Thus, our aim is to check which features become redundant in the presence of other features. Moreover, we would like to investigate synergy between active features, that is a possible additional effect due to their simultaneous acting. Joint informativeness is, thus, crucial for feature selection. In this context, we define a minimal set of active predictors Definition 5. We define a minimal subset of active predictors as a minimal subset S * ⊂ {1, . . . , p} =: where X T denotes the subvector of (X 1 , . . . , X p ) with indices in set T. Minimality is meant in the set-theoretic sense i.e., S * is minimal if there is no proper subset W ⊂ S which satisfies (11).
Note that the second equality in (11) is due to chain formula as it follows that The problem of determining S * is a feature selection problem stated in informationtheoretic setting, as X S * may be considered as the minimal set describing adequately the overall dependence of Y on the available vector of predictors. The other possible way of defining a small subset of predictors which contain 'almost' all information about target Y is, for a given value of hyperparameter γ > 0, to consider the smallest subset S * γ , such that The other variant of the problem is its constrained version when a solution is sought for specific cardinality k of the feature set which involves ( |F| k ) evaluations of mutual information. We also refer to the related bottleneck problem in which information in X is compressed to a random variable M satisfying a given compression condition making it easy to transmit, which has maximal mutual information with Y, see, e.g., [22].
Note that since I(Y; X S ) is monotone in the second coordinate (I(Y; X S ) ≥ I(I(Y; X T ) for T ⊆ S), we have thus the RHS can be used as a criterion when looking for S * ( [23]). It is shown in [24] that when the distribution of (X 1 , . . . , X p ) is determined by a subvector X S * corresponding to a certain S * ⊆ F and is parametrised by τ * , finding a maximiser of I(Y; X T ) is a first step of maximising the log-likelihood L(T, τ) over T and τ under so called filter assumption stating that optimisations over T and over τ are independent. The problem of uniqueness of S * satisfying (11) is important and sometimes disregarded as a problem in feature selection.

Remark 1.
Note that although (11) is trivially satisfied for S * = F it does not mean that the minimal set in the sense of inclusion is uniquely defined. The obvious example is constructed by taking any Y, X, such that I(Y; X) > 0 and letting X 1 = X and X 2 = f (X 1 ) where f is any 1 − −1 transform which maps a set of values of X onto itself. In this case, I(Y; X 1 ) = I(Y; X 2 ) = I(Y; X 1 , X 2 ), thus both subsets {X 1 } and {X 2 } satisfy (11) and are minimal. Moreover, note that Y ⊥ ⊥ X 2 |X 1 . However, uniqueness of S * satisfying (11) holds for the case of continuous X and Y binary under strict positiveness assumptions stating that density p(x) is positive almost everywhere with respect to Lebesgue measure and P X (p(Y = 1|X) = 1/2) = 0 (see [25], p. 97).
We discuss, now, properties of the set S * . The first one is obtained by noticing that in view of the chain formula for S * defined in (11) we have I(Y; X S * c |X S * ) = 0, where T c denotes complement of T in F. This is equivalent to stating that S * is so called Markov Blanket of Y.
Thus MB(Y) shields Y from the rest of predictors in the sense that they become irrelevant once MB(Y) is known. Again, MB(Y) does not need to be uniquely defined. In order to discuss properties of MB(Y) we introduce the concept of strong relevancy. Definition 7. X i is strongly relevant feature provided Note that the last property is equivalent to Y ⊥ ⊥ X i |X F\{i} and it can be restated as P Y|X F = P Y|X F\{i} [24]. Intuitively, strongly relevant features should be included in S * as, after exclusion of such features, the dependence of Y on X F is not adequately described. In the class of General Linear Models, under minimal conditions, features which are not strongly relevant are exactly those for which corresponding regression coefficient equals 0 (see [26], Proposition 2.2). The following facts concerning S * have been formally proved: Theorem 2. Assume that Markov Blanket MB(Y) exists. Then, we have: (iii) Every strongly relevant feature belongs to S * .
The first statement is justified above. Part (ii) is due to [27] and indicates that all subsets of F are partitioned into two parts: the first consisting of sets T disjoint with S * for which Y ⊥ ⊥ X T |X * S holds and the remaining ones which are conditionally dependent with Y given X S * \T .
We note that (iii) can be justified by the following simple reasoning: it is enough to show that if j ∈ S * then X j is not strongly relevant. Indeed, we have where the second equality follows from the chain formula and the third from the fact that j ∈ S * and the chain formula again. Thus, from the second equality it follows that I(Y; X j |X F\{j} ) = 0 whence X j is not strongly relevant. It is not true in general that MB(Y) consists only of strongly relevant features. Note that in the last example both X 1 and X 2 substituted for MB(Y) satisfy (13), but neither of them is strongly relevant as Y ⊥ ⊥ X 1 |X 2 and Y ⊥ ⊥ X 2 |X 1 . In the case when X is continuous and Y is binary, this can be proved for strictly positive distributions defined in Remark 1 (cf. Theorem 10 in [28]).
Ref. [27] introduces a concept of m-Markov Blanket S * m for m ≤ p for which equivalence in (14) is satisfied for any subset T ⊆ F \ S * m , such that |T| ≤ m. As for m = p the condition |T| ≤ m is vacuous, p-Markov Blanket of Y is MB(Y).

Greedy Feature Selection
In the view of (i) of Theorem 2, finding S * is equivalent to finding the minimal set satisfying This is a difficult task when p is large as it would involve checking this condition for all 2 p subsets of {1, . . . , p}. The problem is frequently replaced by its greedy selection analogue: given S defined as the set of indices of features already selected, one determines where the second equality follows from (4). The feature having the largest MI with Y is chosen first and the sequential search is stopped when for all j ∈ S c we have I(X j ; Y|X S ) = 0. Note that, in view of chain equality, the maximised criterion equals where I I(X j , X S , Y) is three-way interaction information of X j , X S , and Y. The first two terms of the first equality are called relevance and redundancy of X j wrt Y and X S , respectively. Note that in the maximisation process of I(X j ; Y|X S ) over X j , redundancy of X j with respect to X S may be outweighed by the magnitude of conditional information which X j contains about X S within classes: I(X j ; X S |Y). RHS of (16) involves I I(X j , X S , Y). In the next section, we show that I I(X j , X S , Y) is frequently replaced by algebraic expressions involving I I(X j ; X t 1 ; . . . ; X t k ; Y) where k does not exceed 2. This is motivated by (8) and allows for easier estimation of the expression.

Criteria Based on Möbius Expansion
As estimation of conditional quantities, such as I(Y; X|X S ) and its effective use for detecting conditional dependence in case when X S is high-dimensional, requires large sample sizes (see, e.g., [29], Section 3.2), the common approach is to modify expressions, such as RHS of (10) to define analogues of CMI. Consider two approaches to achieve this aim. The first is based on truncation of the sum in (10) at a certain order, the second consists of weighing the corresponding terms and truncating them. In both cases, conditioning by random variable X S is replaced with conditioning by low-dimensional sub-vectors of these vector, which drastically reduces the need for large sample sizes. This makes them easier to estimate and analyse. As a motivational example, consider the situation when X S consists of |S| = 10 binary coordinates, uniformly distributed on {0, 1} and the sample size n = 1000. Then, we expect around 1000/2 10 ≈ 1 observation for each value X S = x S of the conditioning variable X S in I(Y; X|X S ) whereas the respective number is around 500 for a single conditioning variable X i appearing in the definition of, e.g., J MI in (21) below. The problem that low dimensionality of conditioning set facilitates estimation is recognised (see, e.g., [29]).
More specifically, we will consider below the following criteria based on truncation: • Mutual Information Maximisation criterion MI M; • Conditional Infomax Feature Selection criterion CIFE of order two and three.
Additionally, the following criteria based on truncation and weighing of Möbius expansion will be considered: The simplest approach is to consider only the first term of expansion (10) leading to Mutual Information Maximisation criterion MI M(X) := I(X; Y) (cf. [30]). This completely ignores dependence structure between Y and X S , thus taking into account only feature relevance and disregarding its redundancy. However, it is useful and frequently applied for preliminary screening of predictors which are than subjected to more precise scrutiny. Actually, the name 'filters' is frequently used for exactly such criteria when MI is replaced by dependence measure between Y and X of user's choice.
Consideration of the first two summands of expansion (10) leads to Conditional Infomax Feature Selection (CIFE, [31], see also [24]) which is also called Short Expansion of CMI of order 2 (SECMI2) in [29]. We stress that, in the definition of CIFE and in definitions of the following criteria, the argument of the criterion is X := X i for i ∈ S c which is the variable over which maximisation is performed.
Remembering that I I is symmetric and writing I I(Y; X, which, in particular, shows that J MI = 0 is equivalent to Y ⊥ ⊥ X|X i for any i ∈ S.

Remark 2.
Note that J MI(X) is always non-negative, but it is not the case for CIFE(X) (and neither for CIFE3(X)). Indeed, taking arbitrary X such that I(Y, X 1 ) > 0 and letting X 2 = · · · = X p = X 1 it is easy to check that w have for for p > 2: as I(X j ; Y|X 1 ) = 0 (cf. also [34]).
In [38], 3-way J MI has been introduced by starting from the equality Note that by dropping the first sum on RHS which does not depend on X (as the introduced criteria will be used to choose X) and scaling the second term by 2/(|S|(|S| − 1) one obtains in view of (8) and all other coefficients equal to 0. The important criterion Conditional Mutual Information Maximisation CMI M using a different approach consisting of considering non-linear function of I(Y; X|Z i ), i = 1, . . . , |S| has been proposed in [39] CMI M(X) = min which should be maximised over X := X i for i ∈ S c . Thus, we look for the best surrogate of X i among already chosen variables and the candidate having the worst the best surrogate of X is chosen. Generalisation of the rule based on CMI M is considered in [40].

Variational Approach
The other promising approach to approximate MI and CMI is based on construction of variational lower bounds of these quantities. The bounds obtained are then used as selection criteria. We discuss two approaches which are similar in nature. The first one is based on Donsker-Varadhan inequality which states that [41] and inequality becomes equality when family F contains the logarithm of the ratio of densities of P X,Y and P . Indeed, note that by plugging h(x, y) + C into the expression on the RHS of (25) we obtain equality. Estimation of I(X; Y) based on the Donsker-Varadhan formula has been proposed in [42] where neural network is applied for F resulting in MINE method. The approach has been further pursued in [43,44]. Other lower bounds, such as Nguyen-Wainwright-Jordan bound [45] can also be used. Note that in order to evaluate the expected value under independence in (25) one uses permutations of the original sample which consists in permuting X values while keeping the values of Y at their original places (see Section 7). For feature selection, one can either directly evaluate I(Y; X S∪{j} ) or I(Y; X j |X S ). For the second approach resampling methods which would ensure conditional independence for generated data are needed. Note that the efficiency of such methods depends on flexibility of function class F over which the bound in (25) is optimised. The empirical evidence suggest that relatively simple neural nets are sufficiently flexible to yield satisfactory estimators of I(Y; X). They are also preferable, as large networks will increase the variability of the estimators. The second bound, similar in flavour to (25), is due to [46]. It is noticed there that for arbitrary density q(x, y) we have where the inequality is due to D(p(Y|x)||q(Y|x)) ≥ 0. Assuming that q(y) = p(y) in view of Bayes theorem it is now sufficient to specify marginal density q(x|y) to obtain a lower bound on I(Y; X). In [46] q(x|y) are considered which either satisfy naive Bayes assumption or more general assumption that q(x t |y, Other proposals, using specific tree-representation for q(x|y) are possible (see, e.g., [47]).

Interplay between CM I and CM I-Related Criteria
The following result states conditions under which CMI and one of the introduced criteria coincide. As before, X denotes a feature from X F\S considered as a candidate to be added to {X s , s ∈ S}.

Theorem 3.
(i) Assume that all features are conditionally independent given class (naive Bayes assumption): and features in S are conditionally independent given any not chosen feature (i.e., belonging to S c ). Then, for X ∈ X F\S , I(Y; X|X S ) differs from MIFS 1 (X) = I(Y; X) − ∑ i∈S I(X; X i ) by a factor which does not depend on X. (ii) Assume that all k-way interaction informations I I(Y; X; X t 1 ; . . . ; X t l ) = 0 for k > 3 (l > 1).
Then, I(Y; X|X S ) = CIFE(Y, X|X S ) for any X ∈ X F\S . (iii) Ref. [24] If X i , i ∈ S are conditionally independent given X, for X ∈ X F\S and additionally they are conditionally independent given X and Y, then I(Y; X|X S ) differs from CIFE(Y, X|X S ) by a factor which does not depend on X. (iv) Ref. [48] Assume that for any i ∈ S we have X ⊥ ⊥ X S\{i} |X i and additionally X ⊥ ⊥ X S\{i} |X i , Y. Then, I(Y; X|X S ) = J MI(Y; X|X S ).
For completeness, the proof is included in the Appendix A. Note that the conditions imposed by (i) are very stringent. There are no known probabilistic vectors satisfying I(Y; X; X t 1 , . . . ; X t l ) = 0 for l > 1 apart from simple situations, such as X S ⊥ ⊥ (X, Y), which obviously does not hold when X S is the set of chosen predictors.
We remark that under conditions of (i) MIFS 1 (X) = CIFE(X), as due to the naive Bayes assumption ∑ i∈S I(X i ; X|Y) = 0. Additionally, it follows from (21) that J MI(X) = mRMR(X) provided that the naive Bayes condition holds, thus in order to have J MI(X) = CMI(X) = mRMR(X) under rather strong assumptions of (iv), we have to assume additionally naive Bayes condition. This indicates that the last equality can hold only under restrictive conditions and is not true in general (see [24], Section 4.1).
Assumptions in (iii) and (iv) are not compatible as they lead to different forms of criteria equivalent to CMI criterion. It is argued in [46] that the only plausible graph representation of probability structure for which conditions of (iii) is graph with edges E = {(Y, X)} ∪ {(X, X i )} i∈S }. In this case, due to data-processing inequality, we have I(Y; X) ≥ I(Y; X S ). This means, however, that X should have been chosen before any features from S.
Additional results of similar flavours to Theorem 3 are discussed in [38,49].
It follows from preceding discussion that criteria introduced in Section 4.1 are formally introduced by truncation (or truncation and weighing) of terms in Möbius expansion. Their analytical properties concerning how well they approximate CMI have yet to be established.

Asymptotic Distributions of Information-Theoretic Measures
Sequential feature selection for predicting target variable Y typically involves checking whether a new candidate feature X is a valuable addition to the set of already chosen features X S . This is usually based on testing whether null hypothesis H 0 : X ⊥ ⊥ Y|X S holds and its rejection is interpreted as an indication that X carries an additional information about Y to that provided by X S . To this end, I(Y; X|X S ) or its modified versions are used. The usual strategy following statistical testing approach is to derive asymptotic distribution of I(Y; X|X S ) or its modifications described in Section 4 under the null. This distribution is used as a benchmark for which value of the statistic for the sample under consideration is compared to obtain the asymptotic p-value of the test. In the following, we describe the asymptotic distribution of CMI and CMI-related criteria under H 0 . A competing approach to approximate the distribution of the considered statistic under H 0 based on resampling is described in Section 7.

Asymptotic Distribution of CMI
We assume that X, Y, X S take I, J, K possible values, respectively. As X S is a |S|dimensional vector, K is the number of all possible combinations of values of its coordinates. We let p(x, y, x S ) = P(X = x, Y = y, X S = x S ) and consider the case when all probabilities p(x, y, x S ) are positive. It is assumed throughout that the estimation of I(Y; X|X S ) and related quantities is based on n independent identically distributed (iid) samples from the distribution of (X, Y, X S ). Construction of estimators of CMI relies on plugging-in frequencies in place of unknown probabilities, e.g., wherep(x, y, x S ) = n(x, y, x S )/n and n(x, y, x S ) is a number of samples equal to (x, y, x S ).
We will consider only frequencies as estimators of discrete probabilities. Other estimators exist, e.g., regularised versions of sample frequencies, for which regularisation reflects a level of departure from conditional independence, see [38,50]. The following known result is frequently used in dependence analysis (compare [51], see also [52]).
The frequently applied test of conditional independence X ⊥ ⊥ Y|X S is based on the above useful fact (ii) that under independence asymptotic distributionÎ(Y; X|X S ) does not depend on the distribution of (X, Y, X S ) and is chi-square with the known number of degrees of freedom. On the other hand, when the conditional independence does not hold, the limiting distribution ofÎ(Y; X|Z) − I(Y; X|X S ) is normal with the variance depending on the underlying probability distribution P XYZ . We stress that speeds of convergence of I(Y; X|X S ) to I(Y; X|X S ) are different in both cases: they equal n −1 in the first case and n −1/2 in the second.
The test based on CMI is a popular tool in dependence analysis, in particular for Markov Blanket discovery (see Section 8). It has different names among which G 2 test is the most popular (see, e.g., [53]). Additionally, in the literature X 2 denotes the second order approximation of CMI, which turns out to be the conditional chi-square test and has the same asymptotic distribution asÎ(Y; X|X S ).

Asymptotic Distribution of Modified Criteria
Asymptotic distribution of the modified criteria related to CMI can be also derived.
Let J β,γ (X, Y|Z) be a plug in-version of J β,γ defined in (20). Moreover, let p = (p(x, y, x s )) be a vector of probabilities of dimension M = I × J × K,p corresponding vector of fractions and f : [0, 1] M → R be a function which represents J β,γ as a function of p, i.e., J β,γ = f (p). For example for J MI criterion (see (22)) the corresponding function is defined as where x s denotes coordinate of x S . We state here the result on asymptotic distribution of J β,γ (X, Y|X S ) proved in [29]. Let Σ x, y ,x S x,y,x S denote an element of matrix Σ with row index x, y, x S and column index x , y , x S .

Theorem 5.
(i) We have where σ 2 J = D f (p) T ΣD f (p) = Var(D f (p) Tp ) and Σ = nΣp is a matrix consisting of elements Σ x ,y ,x S x,y,x S = p(x , y , x S )(I(x = x , y = y , x s = x S ) − p(x, y, x S ))/n.
where V follows N(0, Σ) distribution, and H = D 2 f (p) is a Hessian of f .
In particular, we have the following result for J MI [54]: where V and H are defined in Theorem 2, Z i are iid N(0, 1) and λ i are eigenvalues of the matrix W = ΣH which has the following elements It follows from Theorem 1 that if for any i ∈ S we have that I(Y; X|X i ) = 0, i.e., Y and X are conditionally independent given X i , then asymptotic distribution is that of quadratic form specified in (ii), otherwise the distribution is normal.
The main advantage of using modified criteria instead of CMI is that their estimation does not require as large samples as for CMI itself. Note that for modifications of order k, conditioning involves k-dimensional strata, whereas for CMI p-dimensional strata are considered. However, modified criteria considered in Section 4.1 suffer from the fact that under hypothesis of conditional independence X ⊥ ⊥ Y|Z the asymptotic distribution of empirical criterion is not uniquely determined and has to be estimated from the sample. As in the case of J MI, the asymptotic distribution can be either normal (when X ⊥ ⊥ Y|X j for at least one j ∈ S or coincides with distribution of the quadratic form. Which type of asymptotic distribution is valid can be decided using resampling schemes shortly discussed in Section 7. The chosen distribution can used as a benchmark to test the conditional independence hypothesis (see, e.g., [29,55]). Alternatively, testing of H 0 : I(Y; X|X S ) = 0 can be replaced by testingH 0 : I(Y; X|X i ) = 0 for any i ∈ S at each stage of forward feature selection procedure and the benchmark distribution can be obtained by approximating eigenvalues of M specified in (ii) (see [54]) or using scaled and shifted χ squared distribution α + βχ 2 d , where α, β, d are estimated from the sample [56]. Note that, although H 0 andH 0 are not equivalent, in the case of faithful distributions (See Section 9.2 for definition of faithfulness)H 0 implies H 0 , as conditional independence is inherited by conditioning supersets in such a case.

Resampling Schemes
When using the conditional independence test, which the test statistic is used for, what is very common, the exact or even the asymptotic distribution under conditional independence is not known, the usual practice is to use conditional randomisation (CR) and resampling schemes to approximate this distribution and use the resulting approximation as the benchmark distribution, as described in the beginning of Section 6. Analogously as before, comparison of the value of the statistic based on the observed sample with the benchmark distribution is used to calculate the CR or resampling p-value. It can be also used as alternative way of assessing the distribution of the statistic even when its approximate distribution is known. We briefly discuss these procedures, first for a discrete X S , then for a continuous case.

•
Conditional Randomisation (CR).In the case of the CR approach we assume that the probability mass function p(x|X S = x S ) or density of X given X S = x S is known. Then, given a sample (X, Y, , when X * i are independently sampled from p.m.f. or density p(x|X S = x S,i ) for i = 1, . . . , n. Let T(X, Y, X S ) be a test statistic which we would like to use for testing CI and consider M generated CR samples. We define the permutation p-value as It follows that p CR is a valid p-value, i.e., conditionally on ((Y, X S ), i.e., its distribution is uniform on the set {1/M, 2/M, . . . , M/(M + 1)} under CI and, thus, This follows from the following simple fact. Suppose that CI holds and X * is such that P X * |X S = P X|X S . Then, given Y, X S , when CI holds, we have the following equality in distribution (see, e.g., [57]). For recent improvements of the CR method, random permutations are appropriately chosen and considered instead of choosing them at random, see [58].
In the case when p(x|X S = x S ) is unknown, CR sampling is replaced by Conditional Permutations or Bootstrap X method. • Conditional permutation (CP). Conditional permutation method is similar to CR method and differs in that for value x S taken by elements of X S we consider the strata of the sample corresponding to this value, namely The CP sample is obtained from the original sample by replacing (X j , Y j , X S,j ) for j ∈ P i by (X π i (j) , Y j , X S,j ), where π i is a randomly chosen permutation of P i . Thus, on every strata X S = x s we randomly permute values of corresponding X independently of values of Y. Once M of CP samples are obtained independently in this fashion, we calculate p-value p PC based on them analogously as before. • Conditional Bootstrap X (CB.X). Instead of permuting values of X on each strata, we draw a bootstrap sample from X observations, that is why we sample them with replacement as many times as is the size of the strata. The remaining steps are as previously described.
We discuss now the resampling for continuous predictors. Feature selection for this case is covered shortly in Section 9. We remark here that the aim, methodology of solutions, and criteria considered are very similar, the main differences consist of technical problems of adequate estimation of information-theoretic measures in the continuous case.
In the case of X S being continuous, CR methods work as stated, however the situation is more complicated for CP and CB.X method as those depend on transforming the strata of the sample, which degenerate to single observation points in this case. One possible solution is to sample from some estimatep(x|X S = x S ), e.g., kernel estimate, however this requires a large number of observations on each strata. The following proposal of constructing pseudo sample distributions of which is close to P X,Y,X S under CI has been suggested by [59] and consists in using observations having close values to X S = x S . More specifically, consider observation (x i , y i , x S,i ) and find k th nearest observation to x S,i in X S space, say, x S,j with corresponding triple (x j , y j , x S,j ). Then, observation of (x i , y i , x S,i ) is replaced in the resampling sample by (x j , y i , x S,i ).
Another technique for data augmentation, applicable in both discrete and continuous cases, is knock-off construction which we now describe. Consider the regression problem involving vector X = (X 1 , . . . , X p ) of features and response Y. VectorX = (X 1 , . . . ,X p ) is the knock-off vector for X in this regression problem, if Y ⊥ ⊥X and, moreover, for any S ⊆ {1, . . . , p} we have the following equality in distribution [57]: (X 1 , . . . , X p ,X 1 , . . . ,X p ) swap(S) d = (X 1 , . . . , X p ,X 1 , . . . ,X p ) where swap(S) means swapping entries X j andX j for any j ∈ S. Thus, in the above sensẽ X 1 , . . . ,X p are interchangeable with X 1 , . . . , X p and independent of Y at the same time.
The construction of knock-offs is complicated in general, feasible only in special cases as of now, such as the Gaussian case, and requires knowledge of distribution of P X . Thus, even more information is needed than in the case of CR approach. However, the gains are considerable, as for natural classes of statistics measuring influence of predictors on the response one can compare the performance of X 1 , . . . , X p with that of their knock-offs and on this basis decide which ones are not strongly relevant. Intuitively, when the performance of X j measured, e.g., by an absolute value of estimated regression coefficient, is comparable to that of its knock-off, then such a variable should be discarded. This approach yields a bound on FDR of strongly relevant features (Theorem 3 in [26]). Thus, in cases discussed in Section 3, when the set of strongly relevant features is exactly equal MB(Y), this approach yields a bound on FDR of its recovery. We stress that, in this way, one can analyse properties of the whole procedure of MB determination, and not only that of an individual steps in the algorithm. An additional advantage is that, unlike for the resampling schemes above, only one sample of knock-offs needs to be generated.

Markov Blanket Discovery Algorithms
We discuss now Markov Blanket discovery algorithms. Their aim is to solve a feature selection problem posed in Section 3 (see Equation (11)) applying CMI or CMI approximations introduced in Section 4 for which theoretical guarantees are discussed in Section 5. Such algorithms are used as building blocks for conditional independence tests, based either on asymptotic distributions of test statistics discussed in Section 6 or approximations of their distributions based on resampling covered in Section 7.

•
The GS (Grow and Shrink) algorithm [65]. It consists of two phases. Specific ordering of variables in F is considered, e.g., variables may be ordered according to value of I(Y; X) and, then, the variable having the largest value of the mutual information is the first variable chosen. Denoting by S the current set of chosen variables, we pick the first variable (in the considered ordering) which depends on Y given X S (I(Y; X|X S ) > 0) and we add it to S, then repeat the step. When there is no longer such a variable among candidates, the first phase is terminated. We call the resulting chosen set S * . In the second phase, we remove from S * , again using the considered ordering, any variable X j which is not strongly relevant with respect to the current S * , i.e., such that X j ⊥ ⊥ Y|X S * \{j} (I(Y; X j |X S * \{j} ) = 0) and we let S * := S * \ {j}. • The IAMB (Incremental Association Markov Blanket) [66]. The algorithm is similar to GS with one important difference in the first step. Namely, it disregards initial ordering and S is augmented by the most plausible candidate, that is the variable realising max i∈S c I(X i , Y|X S ), provided it is not conditionally independent from Y given X S . It is proved in [27] that GS (m) algorithm yields the m-Markov Blanket (see Section 3 for definition of m-Markov Blanket) of Y. Thus, for m = p we have that the output of GS (p) is MB(Y). However, since, in the growing phase, all subsets of F \ S of size not exceeding m have to be checked which is computationally intensive, in practice k subsets for k large enough are checked at every step.
Note that for such results we assume that condition Y ⊥ ⊥ X T\S |X S or Y ⊥ ⊥ X|X S can be verified. In practice, it is impossible to check conditional independence of X and Y given the set of variables already chosen without error and this has to be replaced with the appropriate test discussed in Section 6. Obviously, as such a test has to be performed at each step, one does not control probability of including false discoveries. False discoveries are dealt with in the second phase, but no formal results exist concerning how likely recovering MB(Y) is in the case of practical algorithm. However, see [67] for the results concerning the PC algorithm in the Gaussian case and knock-off construction discussed in Section 7.
Another possibility is to omit phase two and stop early phase one, applying stopping rules devised in multiple testing problems to control Family-Wise Error Rate or False Discovery Rate (see, e.g., [68] where a stopping rule using the Holm procedure has been applied for CIFE criterion). These methods work promisingly in practice, however, due to the fact that the set S augmented at each step is data-dependent, also in this case formal results on the consistency of such procedures are not available.
There is no definitive study of empirical performance of the presented criteria and/or selection procedures; note that any criterion considered can be used for any feature selection method described above, thus creating a large number of filter methods. Moreover, those can be evaluated using their classification performance, as well as for synthetic datasets, according to their ability to choose active predictors. Thus, we discuss only the reports on the simplest greedy procedures consisting on the application of discussed criteria on synthetic datasets with a fixed ahead number of chosen features. Authors of [24] discuss, among other things, results for datasets coming from NIPS Feature Selection Data Challange, Gisette, and Madelon (procedures stopped at 200 variables in the first case and 20 in the second). Two interesting points arise from the study: strong performance of J MI, which was the second best and co-winner in those cases, with respect to balanced accuracy (BA) and the number of feature chosen, with a strong performance of CMI M (winner in the case of Gisette and together with CIFE co-winner in the case of Madelon). The overall strong performance of CIFE was also confirmed in [68,69]. The second important observation is the failure of performance of CMI, which due to scarcity of data, failed to detect conditionally dependent variables very early. This is due to the fact that for a large conditioning set the test becomes very conservative (see also [70] for discussion of this property).

Case of Continuous Distribution P X,Y,Z
In the following, we shortly discuss conditional independence testing for continuous distributions. The main motivation to include a continuous case in this review, devoted mainly to discrete case, is to underline the strong similarities between these two cases. In particular, we discuss below that all information-theoretic tools defined for the discrete case have their analogues for continuous distributions. Moreover, we note that the selection methods presented till now do not depend on continuity of the underlying measure, once a specific test for conditional independence X ⊥ ⊥ Y|Z which takes this into account is used. Thus, the differences between two cases are mostly due to the fact that estimation of information-theoretic measures are much more difficult for continuous distributions. Moreover, their asymptotic behaviour under CI is not known, thus making construction of corresponding tests difficult. On the other hand, for Gaussian case significant simplifications exists due to the existence of closed formula for CMI discussed in Section 9.2.

General Considerations
Returning to notation of Section 2, we discuss conditional independent testing in a continuous case. In this context, it is enlightening to mention the result in [71], where it is shown that in the continuous case conditional independence testing is a hard problem in the following sense. For any test on a natural family of P X,Y,Z defined in [71] satisfying X ⊥ ⊥ Y|Z, which uniformly achieves type I error asymptotically not larger than a prescribed significance level α its power for any alternative will be also no larger than α. Thus, such a test is useless for discriminating between CI and Conditional Dependence.
The indices defined in Section 2 have their natural analogues in the continuous case; namely probabilities P(X = x, Y = y, Z = z) are replaced by density functions p(x, y, z) and summations by integrals (for details see [15]). What is important for development here is that Conditional Independence X ⊥ ⊥ Y|Z in the continuous case is also equivalent to conditional Mutual Information I(X; Y|Z) being 0, as in the discrete case. Thus, the problem boils down to an accurate estimation of I(X; Y|Z) and corresponding testing of CI using constructed estimator as a test statistic. This, however, is also a much harder task than in a discrete case as plug-in estimators of entropy, and conditional and unconditional information when, e.g., kernel estimators of densities (see, e.g., [6], Chapter 6) are plugged-in are unstable and work satisfactorily only when very large samples are available. Much better results are obtained using the variational approach described in Section 4.2. Additionally, some estimators based on the nearest neighbour idea, such as the Kozachenko-Leonenko estimator of entropy [72,73] and its refinements, such as the Kraskov et al. estimator [74] work better than straightforward plug-in estimator. Then, estimators of MI are constructed using (1). Additionally, the simple approach based on discretising underlying variables is frequently used. There are two problems related to this approach. The first one is a loss of information due to this operation. The second, related one, is a difficulty of choosing an appropriate bin size, especially in many dimensions. Too large bin size may result in hiding interesting characteristics of predictors' distributions and, consequently, a loss of predictive power.
We mention two other approaches for the continuous case. One which is frequently used is kernel based approach which relies on the following fact [75]: Thus, although conditional independence is not equivalent to conditional covariance being zero, this is true when transformations of vectors (X, Z) and (Y, Z) are allowed. In view of this result it turns out that its possible to find rich enough function spaces, such that the condition that the generalized conditional covariance of transformed X and Y being equal to 0 is equivalent to conditional independence. Moreover, it is the function space that one can consider to appropriately define Reproducing Kernel Hilbert Spaces (RKHS). This, conceptually, is much more involved than checking I(X; Y|Z) being 0, save for the technical problems of testing this condition in the continuous case.
We also mention in this context the second approach, called distillation method [76] which relies on ideas similar to the construction of partial residual plot and resampling.

Gaussian Case
We review the case when (Y, X) is Gaussian, where X = (X 1 , . . . , X p ). This case offers significant simplifications, as the quantities on which feature selection is based can be explicitly calculated. Namely, direct calculation yields [15]: where |Σ| stands for determinant of matrix Σ. In particular for bivariate normal case when ρ(X, Y) = ρ we have I(Y; X 1 ) = −(log(1 − ρ 2 ))/2. Whence, in the case of the conditional distribution of (Y, X), given X S , where X S ⊆ {2, . . . , p} this yields where ρ par is partial correlation coefficient between Y and X 1 given X S . Thus, in Gaussian case information-based dependence indices can be expressed in terms of either correlation or partial correlation coefficients. In order to present the PC algorithm, we first introduce the concept of the faithfulness of the distribution which plays an important role in its construction (see, e.g., [67]). Let X 0 := Y and consider undirected graph G = (V, E), where vertices V = {0, 1, . . . , p} correspond to variables (Y, X 1 , . . . , X p ) and E ⊂ V × V, such that (j, k) ∈ E ⇐⇒ (k, j) ∈ E is the set of edges. Definition 8. Distribution P Y,X is faithful to graph G when for any triple A, B, C ⊂ V we have A and B are separated by C ⇐⇒ X A ⊥ ⊥ X B |X C , when separation by C means that every path joining elements of A and B has to pass through C.
(Implication from LHS to RHS in (33) is called global Markov property.) In the case when (Y, X 1 , . . . , X p ) is Gaussian and there exists faithful representation of its distribution, PC algorithm [77] reconstructs its dependence structure with probabilistic guarantees. The main tool is clever usage of one of the consequences of faithfulness, namely The algorithm starts from the fully connected graph and removes edges between any vertices which are marginally uncorrelated (and, thus, in view of (34) and Gaussianity, conditionally independent given any subset of variables). Then, it proceeds to remove edges between vertices which correspond to variables which are conditionally independent given a subset of their neighbours of size l, where l = 1, 2, . . . is increased stepwise. The conditional independence is tested using an empirical partial correlation coefficientρ par and the property that its Fisher transform log[(1 +ρ par )/(1 −ρ par )] is approximately N(0, 1) normal under CI (details in [67], Section 13.7). The case when Y is discrete and P X|Y=y i are normal is more complicated, as in this scenario distribution of X is that of normal mixture for which no explicit formulae for its entropy exist (see, however, [34] for special cases of mixtures of summands having the same covariance matrix). Authors of [78] use approximations of MI (see (18) in [78]) to address this problem.
To summarise, we note that the section discusses possible approaches to test conditional independence and discover Markov Blanket of target Y when the underlying distribution is continuous. The first task can be performed by using some available estimators of CMI, discussed above, such as Kraskov et al. estimator [74] in conjunction with benchmark distribution based on resampling. The other possibility is a kernel method which reduces the problem to checking whether generalised conditional covariance is 0. For the parametric case when (X, Y) is Gaussian, the testing problem can be reduced to testing whether the partial correlation coefficient is 0. Moreover, in this case under faithfulness assumption on the distribution Markov Blanket can be recovered with theoretical guarantees.

Interaction Detection
Detection of existing interactions between features in influencing the outcome Y is an important task of data analysis; primary examples being gene-gene interactions in Genome-Wide Association Studies (GWAS) and gene-environment interactions. In this section, we show how information-theoretic measures introduced before can be applied to solve this problem. Moreover, we indicate that interaction detection problems may be viewed as a special case of feature selection. Various indices have been proposed to measure the strength of interactions, one of the most popular tools being interaction information discussed in Sections 2 and 4.1. There it was considered as a tool to define new selection criteria by truncating and possibly weighing summands in the Möbius expansion of CMI.
Here, I I is adopted as a main tool in detection of interactions. Its most popular competitor is the value of Likelihood Ratio Test statistics for two nested logistic models: an additive model with no interactions and a saturated model taking all possible interactions into account. However, it has been shown in [68] that for logistic model when predictors are independent and at least one of interaction terms is non-zero then I I = 0 but not vice versa. This shows that there are situations when interactions are present and are detected by I I but will go undetected using logistic model approach. I I is applied as the main tool to find interactions in AMBIENCE package [79] and BOOST package uses so called Kirkwood approximation discussed below, which is closely related to I I [80]. Other competitors to LRT and I I include Multifactor Dimensionality Reduction (MDR) [81] and Restricted Partitioning Method (RPM) [82]. Now, we will discuss two additional properties of Interaction Information which are useful in interaction analysis, focusing on its three-way variant applied to predict synergistic effect of two variables X 1 and X 2 , say, in determining the target Y.
From the first equality in (5) it follows that when both variables are jointly independent from Y, i.e., (X 1 , X 2 ) ⊥ ⊥ Y then I I(X; X 2 ; Y) = 0 as I(X 1 ; X 2 |Y) = I(X 1 ; X 2 ). Although the converse is not true, it can be shown in adversarially constructed examples only and, thus, testing H 0 : I I = 0 is usually replaced by (more stringent) hypothesisH 0 : (X 1 , The other property is insightful representation whereP K is Kirkwood Superposition Approximation with masses assigned to points equal is positive but necessarily summing up to 1, mass function supported on values (x 1 , x 2 , y) of (X 1 , X 2 , Y). It can be shown that for η denoting the summary mass ofP K it follows that when η ≤ 1 and I I = 0, then P X 1 ,X 2 ,Y is equal to its Kirkwood approximation [68]. Let us discuss two commonly used methods of testing that I I(X; X 2 ; Y) = 0. The most popular one is based on Han's approximation [20] derived under a stringent assumption of overall independence, which results in considering as the benchmark distribution chi-squared distribution with (I − 1)(J − 1)(K − 1) degrees of freedom, when I, J, K are numbers of values of X 1 , X 2 , and Y, respectively. This, however, may lead to a large number of false signals when the pertaining test is employed as the overall independence is only a very special situation when I I vanishes. It is shown in [69] that when (X 1 , X 2 ) are independent of Y, similarly to other measures of conditional dependence discussed in Section 4.1, the approximate distribution is weighted chi-square distribution. Its complete description has been given for X 1 and X 2 having three values and Y being binary, which includes typical GWAS situation of two loci with two co-dominant alleles. The properties of the corresponding test have not been investigated yet.
The other method uses permuting Y values of the considered sample and in this way obtaining M random samples satisfyingH 0 : (X 1 , X 2 ) ⊥ ⊥ Y. Then, permutation p-value can be calculated as (M + 1) −1 (1 + ∑ M i=1 I( I I i ≥ I I)) (compare (32)). One can also use chisquare distribution with the number of degrees of freedom as the benchmark distribution, analogously to [70]. The advantage of the latter approach is that the number of permuted samples M may be much smaller than in the case of permutation test.
The main challenge to detect significant interactions among potential predictors F = {1, . . . , p} is usually computational burden of the procedure. Indeed, any method discussed should in principle include interaction term for any pair of predictors X i , X j , i, j ∈ F. This requires enlarging set F by |F| × (|F| − 1)/2 hot-encoded interactions, i.e., fitted model will contain |F| × (|F| + 1)/2 nominal predictors. This is practically infeasible, e.g., for 100 Single Nucleotide Polymorphisms (SNP) being our predictors this would yield number of needed tests of order 10 6 at each step of selection procedure. The frequently adapted approach is to screen the set of variables first and then apply more sophisticated procedure to the chosen variables only.
Consider two other possible solutions. The first one uses the premise that significant interactions may arise only among variables which themselves are significant. Thus any of the method described in Section 8 can be used for predictors in F and then two methods described above are applicable for all interactions of the chosen features using Bonferroni adjustment for multiple testing or Holm method [83]. The problematic aspect of this approach is exclusion of possible interactions between two features when only one of them has non-negligible main effect.
The other group of methods, for which BOOST proposed in [80] is a representative example, screens interactions by the modified version of LRT test which consists of replacing likelihood for a fitted additive model by likelihood for normalised Kirkwood approximatioñ P K /η which can be very quickly computed. The pairs for which the value of LRT statistic modified in this way exceeds certain threshold. Then, an exact LRT test is performed for all remaining pairs. The weak side of this approach is a choice of a threshold τ in the first stage which needs to be chosen by a rule-of-thumb as properties of modified LRT test remain unknown.

Conclusions
This paper reviews main ideas concerning feature selection using information theoretic tools which revolve around detecting and measuring the strength of conditional independence. As estimation of CMI, which is a natural and powerful tool for this endeavour, encounters a problem when dimensionality of the task is large; a natural way is to look for its good substitutes. Several ways in which such substitutes are constructed are discussed and the current knowledge about their properties is presented. It is argued that selection criteria based on truncation of Möbius decomposition make quite far-reaching compromises on the dependence structures of feature sets, whereas properties of approximations based on variational approaches are yet to be established. Additionally, major Markov Blanket discovery algorithms are constructed under assumptions that conditional independence or its absence can be established without error, when, in reality, the intrinsic features of any practical CI test applied at each stage is its fallibility. Therefore, further efforts, both theoretical and experimental, are needed to understand the advantages, drawbacks, and interrelations between up-to-date developments. The paper presents some recently established tools which can be used for this purpose, such as asymptotic distributions of CMI-related criteria discussed in Section 6. They are used to construct tests of conditional independence with approximate control of probability of false signals. The problem of existence of interactions can be similarly approached using results in [69].
Another challenging task is to extend general approaches discussed here, such as construction of feature selection criteria by truncation of Möbius expansion or variational approach to multilabel classification when Y is a multivariate binary vector. Several criteria, such as CIFE or J MI, have been generalised to this case already (cf. [84,85], respectively, see also [86] for an approach based on Han's inequality and [87] for the review), but the general approach, e.g., in the spirit of [24], is still missing. Another important challenge here seems construction of feature selection criteria which will efficiently take into account dependence between coordinates of the response.

Conflicts of Interest:
The authors declare no conflict of interest.