Simple Stopping Criteria for Information Theoretic Feature Selection

Feature selection aims to select the smallest feature subset that yields the minimum generalization error. In the rich literature in feature selection, information theory-based approaches seek a subset of features such that the mutual information between the selected features and the class labels is maximized. Despite the simplicity of this objective, there still remain several open problems in optimization. These include, for example, the automatic determination of the optimal subset size (i.e., the number of features) or a stopping criterion if the greedy searching strategy is adopted. In this paper, we suggest two stopping criteria by just monitoring the conditional mutual information (CMI) among groups of variables. Using the recently developed multivariate matrix-based Rényi’s α-entropy functional, which can be directly estimated from data samples, we showed that the CMI among groups of variables can be easily computed without any decomposition or approximation, hence making our criteria easy to implement and seamlessly integrated into any existing information theoretic feature selection methods with a greedy search strategy.


I. INTRODUCTION
Feature selection finds a smallest feature subset that yields the minimum generalization error [1].Ever since the pioneering work of Battiti [2], information theoretic feature selection has been extensively investigated in signal processing and machine learning communities (e.g., [3], [4]).Given a set of F features S = {x 1 , x 2 , • • • , x F } (each x i denotes an attribute) and their corresponding class labels y, these methods aim to seek a subset of informative attributes S ⊂ S, such that the mutual information (MI) between S and y (i.e., I(S ; y)) is maximized [5].
Despite the simplicity of this objective, there still remains several open problems in information theoretic feature selection.These include, for example, the reliable estimation of I(S ; y) in high-dimensional space, in which S denotes an arbitrary subset of S [5], [6].In fact, S may contain both continuous and discrete variables, whereas y is a discrete variable.There is no universal agreement on the definition of MI between a discrete variable and a group of mixed variables, let alone its estimation [7].Therefore, almost all existing information theoretic feature selection methods estimate I(S ; y) by first discretizing the feature space and then approximate I(S ; y) with low-order MI quantities, in particular the relevancy I(x i ; y), the joint relevancy I({x i , x j }; y), the conditional relevancy I(x i ; y|x j ), the redundancy I(x i ; x j ), S. Yu and J. C. Príncipe are with the Computational NeuroEngineering Laboratory (CNEL), University of Florida, Gainesville, FL 32611, USA (email: yusjlcy9011@ufl.edu;principe@cnel.ufl.edu).
the conditional redundancy I(x i ; x j |y) and the synergy [8].These low-order MI quantities only capture the low-order feature dependency and hence severely limit the performance of existing information theoretic feature selection methods [9].Interested readers can refer to [5] for a systemic review to 17 popular low-order information theoretic criteria in the last two decades.Apart from the MI estimation, another challenging problem is the automatic determination of the optimal size of S .This is because most of the information theoretic feature selection methods do not have a stopping criterion [1].Hence, a predefined maximum number of features is required.
Regarding the first problem, our recent work [10] suggested that I(S ; y) can be simply estimated using the normalized eigenspectrum of a Hermitian matrix of the projected data in the reproducing kernel Hilbert space (RKHS).In this letter, we extend [10] and illustrate that the novel multivariate matrix-based Rényi's α-entropy functional also enables simple strategies to guide the early stopping in the greedy search procedure of information theoretic feature selection methods.

A. Related work
Perhaps the most acknowledged stopping criterion for information theoretic feature selection is that the value of I(S ; y) stops increasing or reaches the maximum [11], [12].Unfortunately, such an over-optimistic rule cannot be applied in practice.In fact, I(S ; Y ) is monotonically increasing with the increase of the size of S , i.e., the maximum value of I(S ; y) is exactly I(S; y) in which all the features are incorporated.Given the current subset S , after adding a new feature x , by the chain rule of mutual information [13], we have: the incremental value of MI is exactly the CMI I(x ; y|S ).Since CMI is non-negative [13] and rarely reduces to zero in practice due to statistical variation and chance agreement between variables [14], we always have I({S , x }; y) ≥ I(S ; y).An alternative approach to optimal feature subset selection is using the concept of the Markov blanket (MB) [15], [16].Remember that the MB M of a target variable y is the smallest subset of S such that y is conditional independent of the rest of the variables S − M , i.e., y ⊥ (S − M )|M [1].From the perspective of information theory, this indicates that the CMI I({S − M }; y|M ) is zero.Again, by the chain rule of mutual information, we have: arXiv:1811.11971v1 [cs.CV] 29 Nov 2018 As mentioned earlier, I(M ; y) is monotonically increasing with the size of M , thus I({S − M }; y|M ) is monotonically decreasing correspondingly, given that I(S; y) is a fixed value.This suggests that an ideal orthogonal scenario is mostly likely to happen if M = S, or in other words, a perfect MB of y is perhaps the feature set S itself.
Admittedly, one can say that we can stop the selection if the increment of I(S ; y) or the decrement of I({S −S }; y|S ) approaches zero with a tiny residual.Unfortunately, since we still do not have a reliable estimator to MI and CMI in highdimensional space (before [10]), it is hard for us to measure or determine how small the residual terms are.
To the best of our knowledge, there are only two methods in the literature that can stop the greedy search.Franc ¸ois, et al. [11] suggest monitoring the value of I(S ; y) using a permutation test [17].Specifically, suppose the new feature selected in the current iteration is x cand , the authors create a random permutation of x cand (without permuting the corresponding y), denoted x cand .If I({S , x cand }; y) is not significantly larger than I({S , x cand }; y), x cand can be discarded and the feature selection is stopped.Vinh, et al. [14], on the other hand, propose to monitor the increment of I(S ; y) after adding x cand (i.e., I( x cand ; y|S )) using χ 2 distribution.If I(x cand ; y|S ) is smaller than a threshold obtained from the χ 2 distribution at a certain significance level, the feature selection is stopped.

II. SIMPLE STOPPING CRITERIA FOR INFORMATION THEORETIC FEATURE SELECTION
In this section, we start with a brief introduction to the recently proposed matrix-based Rényi's α-entropy functional [18] and its multivariate extension [10].Benefited from the novel definition, two simple stopping criteria are presented.

A. Matrix-based Rényi's α-entropy functional and its multivariate extension
In information theory, a natural extension of the well-known Shannon's entropy is Rényi's α-order entropy [19].For a random variable X with probability density function (PDF) f (x) in a finite set X , the α-entropy H α (X) is defined as: Based on this entropy definition, Rényi then proposed a divergence measure (α-relative entropy) between random variables with PDFs f and g: Rényi's entropy and divergence evidence a long track record of usefulness in information theory and its applications [20].Unfortunately, the accurate PDF estimation impedes its more widespread adoption in data driven science.To solve this problem, [18], [10] suggest similar quantities that resembles quantum Rényi's entropy [21] in terms of the normalized eigenspectrum of the Hermitian matrix of the projected data in RKHS, thus estimating the entropy, joint entropy among two or multiple variables directly from data without PDF estimation.For brevity, we directly give the definition.Definition 1.Let κ : X × X → R be a real valued positive definite kernel that is also infinitely divisible [22].Given X = {x1 , x 2 , ..., x n } and the Gram matrix K obtained from evaluating a positive definite kernel κ on all pairs of exemplars, that is (K) ij = κ(x i , x j ), a matrix-based analogue to Rényi's α-entropy for a normalized positive definite (NPD) matrix A of size n × n, such that tr(A) = 1, can be given by the following functional: , where the superscript i denotes the sample index, each sample contains k (k ≥ 2) measurements obtained from the same realization, and the positive definite kernels κ 1 : analogue to Rényi's α-order joint-entropy among k variables can be defined as: where , and • denotes the Hadamard product.

B. Stopping criteria based on conditional mutual information
the remaining features in S − S , given the entropy and joint entropy estimators shown in Eqs. ( 5)-( 6), the MI between y and S (i.e., I(S ; y)) and the CMI between y and S − S conditioning on S (i.e., I({S − S }; y|S )) can be estimated with Eq. ( 7) and Eq. ( 8) respectively 1 , where As can be seen, the multivariate matrix-based Rényi's αentropy functional enables simple estimation of both MI and CMI in high-dimensional space, no matter the data characteristics (e.g., continuous or discrete) in each dimension.Benefited from these elegant expressions, suppose the new feature selected in the current iteration is x cand , we present two simple criteria to guide the early stopping of the greedy search.Specifically, we aim to test the "goodness-of-fit" of the MB condition, i.e., S − S is the MB of y given S .Intuitively, if I({S − S − x cand }; y|{S , x cand }) approaches zero, the MB condition is approximately satisfied.
Criterion I.If I({S − S − x cand }; y|{S , x cand }) ≤ ε, where ε refers to a tiny threshold, then we should stop the selection.We term this criterion CMI-heuristic, since ε is a heuristic value.Criterion II.Motivated by [11], in order to quantify how x cand affects the MB condition, we create a random permutation of x cand (without permuting the corresponding y), denoted x cand .If I({S − S − x cand }; y|{S , x cand }) is not significantly smaller than I({S − S − x cand }; y|{S , x cand }), x cand can be discarded and the feature selection is stopped.We term this criterion CMI-permutation (see Algorithm 1 for more details, in which

III. EXPERIMENTS AND DISCUSSIONS
We compare our two criteria with existing ones [14], [11] on 10 well-known public datasets used in previous feature selection research [5], [14], covering a wide variety of samplefeature ratios and a range of multi-class problems.The detailed properties of these datasets, including the number of features (F ), the number of examples (N ) and the number of classes (C), are available in [5].We refer the criterion in [14] ∆MIχ 2 , since it monitors the increment of MI I(S ; y) (i.e., I(x cand ; y|S )) with χ 2 distribution.We refer the criterion in [11] MI-permutation, since it uses the permutation test to quantify the impact of x cand on I(S ; y).Throughout this letter, we select ε = 10 −4 in CMI-heuristic and a significance level of α = 0.05 in permutation test.To provide a fair comparison, instead of using the k-nearest neighbors (KNN) estimator [23] which may result in negative CMI quantities (see results in [11]), we use the multivariate matrix-based Rényi's α-entropy functional to estimate all MI quantities in MI-permutation.The baseline information feature selection method used in this letter is from [10], which directly optimizes I(S ; y) in a greedy manner without any decomposition or approximation.An example for different stopping criteria on the dataset waveform is shown in Fig. 1.
The quantitative results are summarized in Table I.For each criterion, we report the number of selected features and the average classification accuracy across 100 bootstrap runs.In each run, N bootstrap samples are drawn for the training set, while the unselected samples serve as the test set.Same to [5], we use the linear support vector machine (SVM) as the baseline classifier.To give a reference, we define the "optimal" number of features (an unknown parameter) as the one that yields the maximum bootstrap accuracy or first achieves a bootstrap accuracy with no statistical difference to the maximum value (evaluated by paired t-test with significance level 0.05), and rank all the criteria based on the difference between their estimated number of features and the optimal one.
As can be seen, ∆MI-χ 2 is likely to severely underestimate the number of features, accompanied by the lowest bootstrap accuracy.One possible reason is that I(x cand ; y|S ) does not precisely fit a χ 2 distribution if the MB condition is not satisfied.CMI-permutation and MI-permutation always have the same ranks.This is because I(S ; y) + I({S − S }; y|S ) = I(S; y), a fixed value.Thus, it is equivalent to monitor the increment of I(S ; y) or the decrement of I({S − S }; y|S ).On the other hand, it is surprising to find that CMI-heuristic Table I: The number of selected features (#F) and the bootstrap classification accuracy (acc.)comparison for CMI-heuristic and CMIpermutation against different stopping criteria.All criteria are ranked based on the difference between their selected number of features and the optimal values.The best two ranks are marked with green and blue respectively.The average rank across all datasets is reported in the bottom line.The value behind the name of each dataset indicates the total number of features.II, corroborates our analysis that our criteria perform equally to MI-permutation, but significantly better than ∆MI-χ 2 .

CMI-heuristic
Table II: p-values and decisions (in parentheses) of Wilcoxon ranksum test at 0.1 significance level on ranks of our criteria against ∆MI-χ 2 and MI-permutation.A p-value smaller than 0.1 indicates rejection of the null hypothesis that two criteria perform equally.
∆MI-χ 2 MI-permutation CMI-heuristic 0.0781 (1) 0.5455 (0) CMI-permutation 0.0561 (1) 0.9036 (0) IV. CONCLUSIONS This letter suggests two simple stopping criteria, namely CMI-heuristic and CMI-permutation, for information theoretic feature selection by monitoring the value of conditional mutual information (CMI) estimated with the novel multivariate matrix-based Rényi's α-entropy functional.Experiments on 10 benchmark datasets indicate that CMI is a more tractable quantity than MI to guide early stopping in feature selection.Moreover, as an alternative to permutation test, a tiny threshold is sufficient to test the Markov blanket (MB) condition.

Figure 1 :
Figure 1: (a) shows the the values of MI I(S ; y) and CMI I({S − S }; y|S ) with respect to different number of selected features, i.e., the size of S .I(S ; y) is monotonically increasing, whereas I({S − S }; y|S ) is monotonically decreasing.(b) shows the terminated points produced by different stopping criteria, namely CMI-heuristic (black solid line), CMI-permutation (black dashed line), ∆MI-χ 2 (green solid line) and MI-permutation (blue solid line).The red curve with shaded area indicates the average bootstrap classification accuracy with 95% confidence interval.In this example, the bootstrap classification accuracy reaches its statistical maximum value with 11 features and CMI-heuristic performs the best.
performs the best in most datasets.This indicates that although permutation test is effective to test the MB condition, ε = 10 −4 is a reliable threshold to speed up this test, as the permutation test is always time-consuming.Finally, the Wilcoxon rank-sum test at 0.1 significance level, shown in Table