Simple Stopping Criteria for Information Theoretic Feature Selection
Abstract
:1. Introduction
 Efficient stopping criteria for feature selection are still missing in the literature. We present two rules based on the new estimators of information theoretic descriptors of Rényi’s $\alpha $entropy in RKHS. One should note that the properties, utilities, and possible applications of these new estimators are rather new and mostly unknown to practitioners;
 Our criteria are extremely simple and flexible. First, the proposed tests and stopping criteria can be incorporated into any sequential feature selection procedure, such as mutual informationbased feature selection (MIFS) [4] and maximumrelevance minimumredundancy (MRMR) [24]. Second, benefiting from the higher accuracy of the estimation, we demonstrated that one can simply use a threshold in a heuristic manner, which is very valuable for practitioners across different domains.
2. Simple Stopping Criteria for Information Theoretic Feature Selection
2.1. MatrixBased Rényi’s $\alpha $Entropy Functional and Its Multivariate Extension
2.2. Stopping Criteria Based on Conditional Mutual Information
Algorithm 1 CMIpermutation 
Input: Feature set S; Selected feature subset ${S}^{\prime}$; Class labels y; Selected feature in the current iteration ${x}^{\mathrm{cand}}$; Permutation number P; Significance level $\theta $. Output: decision (Stop selection or Continue selection).

3. Experiments and Discussions
4. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
MI  Mutual information 
CMI  Conditional mutual information 
RKHS  Reproducing kernel Hilbert space 
MB  Markov blanket 
Probability density function 
References
 Vergara, J.R.; Estévez, P.A. A review of feature selection methods based on mutual information. Neural Comput. Appl. 2014, 24, 175–186. [Google Scholar] [CrossRef]
 Guyon, I.; Elisseeff, A. An introduction to variable and feature selection. J. Mach. Learn. Res. 2003, 3, 1157–1182. [Google Scholar]
 Lu, Y.; Fan, Y.; Lv, J.; Noble, W.S. DeepPINK: Reproducible feature selection in deep neural networks. In Proceedings of the 2018 Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 8690–8700. [Google Scholar]
 Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 1994, 5, 537–550. [Google Scholar] [CrossRef] [PubMed] [Green Version]
 Meyer, P.E.; Schretter, C.; Bontempi, G. Informationtheoretic feature selection in microarray data using variable complementarity. IEEE J. Sel. Top. Signal Process. 2008, 2, 261–274. [Google Scholar] [CrossRef]
 Gurban, M.; Thiran, J.P. Information theoretic feature extraction for audiovisual speech recognition. IEEE Trans. Signal Process. 2009, 57, 4765–4776. [Google Scholar] [CrossRef]
 Eriksson, T.; Kim, S.; Kang, H.G.; Lee, C. An informationtheoretic perspective on feature selection in speaker recognition. IEEE Signal Process. Lett. 2005, 12, 500–503. [Google Scholar] [CrossRef]
 Brown, G.; Pocock, A.; Zhao, M.J.; Luján, M. Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 2012, 13, 27–66. [Google Scholar]
 Fleuret, F. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 2004, 5, 1531–1555. [Google Scholar]
 Ross, B.C. Mutual information between discrete and continuous data sets. PLoS ONE 2014, 9, e87357. [Google Scholar] [CrossRef]
 Singha, S.; Shenoy, P.P. An adaptive heuristic for feature selection based on complementarity. Mach. Learn. 2018, 107, 1–45. [Google Scholar] [CrossRef]
 Vinh, N.X.; Zhou, S.; Chan, J.; Bailey, J. Can highorder dependencies improve mutual information based feature selection? Pattern Recognit. 2016, 53, 46–58. [Google Scholar] [CrossRef] [Green Version]
 Yu, S.; Giraldo, L.G.S.; Jenssen, R.; Principe, J.C. Multivariate Extension of Matrixbased Renyi’s αorder Entropy Functional. arXiv, 2018; arXiv:1808.07912. [Google Scholar]
 François, D.; Rossi, F.; Wertz, V.; Verleysen, M. Resampling methods for parameterfree and robust feature selection with mutual information. Neurocomputing 2007, 70, 1276–1288. [Google Scholar] [CrossRef] [Green Version]
 GómezVerdejo, V.; Verleysen, M.; Fleury, J. Informationtheoretic feature selection for the classification of hysteresis curves. In International WorkConference on Artificial Neural Networks; Springer: Berlin/Heidelberg, German, 2007; pp. 522–529. [Google Scholar]
 Koller, D.; Sahami, M. Toward Optimal Feature Selection; The Stanford University InfoLab: Stanford, CA, USA, 1996. [Google Scholar]
 Yaramakala, S.; Margaritis, D. Speculative Markov blanket discovery for optimal feature selection. In Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM’05), Houston, TX, USA, 27–30 November 2005; pp. 809–812. [Google Scholar]
 Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 2012. [Google Scholar]
 Vinh, N.X.; Chan, J.; Bailey, J. Reconsidering Mutual Information Based Feature Selection: A Statistical Significance View. In Proceedings of the TwentyEighth AAAI Conference on Artificial Intelligence (AAAI’14), Quebec City, QC, Canada, 27–31 July 2014; pp. 2092–2098. [Google Scholar]
 Good, P. Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypotheses; Springer: Berlin/Heidelberg, German, 2013. [Google Scholar]
 Kraskov, A.; Stögbauer, H.; Grassberger, P. Estimating mutual information. Phys. Rev. E 2004, 69, 066138. [Google Scholar] [CrossRef] [PubMed]
 Campos, L.M. A scoring function for learning Bayesian networks based on mutual information and conditional independence tests. J. Mach. Learn. Res. 2006, 7, 2149–2187. [Google Scholar]
 Giraldo, L.G.S.; Rao, M.; Principe, J.C. Measures of entropy from data using infinitely divisible kernels. IEEE Trans. Inf. Theory 2015, 61, 535–548. [Google Scholar] [CrossRef]
 Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of maxdependency, maxrelevance, and minredundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed] [Green Version]
 Rényi, A. On Measures of Entropy and Information; Hungarian Academy of Sciences: Budapest, Hungary, 1961. [Google Scholar]
 Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Pperspectives; Springer: Berlin/Heidelberg, German, 2010. [Google Scholar]
 Ohya, M.; Petz, D. Quantum Entropy and Its Use; Springer: Berlin/Heidelberg, German, 2004. [Google Scholar]
 Bhatia, R. Infinitely divisible matrices. Am. Math. Mon. 2006, 113, 221–235. [Google Scholar] [CrossRef]
 MüllerLennert, M.; Dupuis, F.; Szehr, O.; Fehr, S.; Tomamichel, M. On quantum Rényi entropies: A new generalization and some properties. J. Math. Phys. 2013, 54, 122203. [Google Scholar] [CrossRef]
 Mosonyi, M.; Hiai, F. On the Quantum Rényi Relative Entropies and Related Capacity Formulas. IEEE Trans. Inf. Theory 2011, 57, 2474–2487. [Google Scholar] [CrossRef]
 Verdú, S. αmutual information. In Proceedings of the 2015 IEEE Information Theory and Applications Workshop (ITA), San Diego, CA, USA, 1–6 February 2015; pp. 1–6. [Google Scholar]
 Chang, C.C.; Lin, C.J. LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2011, 2, 1–27. [Google Scholar] [CrossRef]
 Li, J.; Liu, H. Challenges of feature selection for big data analytics. IEEE Intell. Syst. 2017, 32, 9–15. [Google Scholar] [CrossRef]
 Martin, C.D.; Porter, M.A. The extraordinary SVD. Am. Math. Mon. 2012, 119, 838–851. [Google Scholar] [CrossRef]
CMIHeuristic  CMIPermutation  $\mathbf{\Delta}$MI${\mathit{\chi}}^{2}$ [19]  MIPermutation [14]  “Optimal”  

#F  acc.  rank  #F  acc.  rank  #F  acc.  rank  #F  acc.  rank  #F  acc.  
waveform (21)  11  84.7 ± 1.8  1  4  78.3 ± 1.8  3  3  76.1 ± 1.8  4  5  80.2 ± 1.8  2  11  84.7 ± 1.8 
breast (30)  2  92.3 ± 1.7  1  2  92.3 ± 1.7  1  2  92.3 ± 1.7  1  2  92.3 ± 1.7  1  2  95.2 ± 1.2 
heart (13)  13  81.7 ± 3.5  4  4  80.4 ± 3.3  1  2  76.9 ± 3.8  3  4  80.4 ± 3.3  1  6  82.6 ± 3.0 
spect (22)  22  80.6 ± 3.6  4  11  82.1 ± 3.3  1  1  80.1 ± 3.3  3  7  81.1 ± 3.3  2  11  82.1 ± 3.3 
ionosphere (34)  15  83.3 ± 2.8  1  7  81.8 ± 2.8  2  1  76.7 ± 3.2  4  7  81.8 ± 2.8  2  33  85.3 ± 3.0 
parkinsons (22)  12  85.2 ± 3.7  1  4  85.0 ± 3.2  2  1  85.1 ± 3.5  4  4  85.0 ± 3.2  2  9  86.5 ± 3.4 
semeion (256)  59  86.1 ± 1.3  1  20  77.7 ± 1.5  2  4  49.6 ± 1.7  4  20  77.7 ± 1.5  2  73  93.3 ± 1.3 
Lung (325)  5  74.2 ± 7.7  3  10  73.9 ± 8.0  2  1  46.5 ± 7.5  4  13  79.1 ± 7.9  1  41  84.3 ± 6.5 
Lympth (4026)  6  81.3 ± 5.8  1  248  88.7 ± 6.1  3  2  62.8 ± 6.5  2  249  88.9 ± 6.2  4  70  90.7 ± 5.4 
Madelon (500)  3  69.5 ± 1.6  2  2  59.5 ± 1.6  3  4  76.7 ± 1.5  1  2  59.5 ± 1.6  3  4  76.7 ± 1.5 
average rank  1.9  2.0  3.0  2.0 
CMIHeuristic  CMIPermutation  $\mathbf{\Delta}$MI${\mathit{\chi}}^{2}$ [19]  MIPermutation [14]  “Optimal”  

#F  acc.  rank  #F  acc.  rank  #F  acc.  rank  #F  acc.  rank  #F  acc.  
waveform (21)  17  84.2 ± 1.9  1  4  78.2 ± 2.1  2  3  76.1 ± 2.0  4  4  78.2 ± 2.1  2  11  84.7 ± 1.8 
breast (30)  1  90.9 ± 1.5  2  1  90.9 ± 1.5  2  2  95.2 ± 1.2  1  1  90.9 ± 1.5  2  2  95.2 ± 1.2 
heart (13)  6  82.6 ± 3.0  1  1  55.0 ± 4.4  3  2  76.9 ± 3.8  2  1  55.0 ± 4.4  3  6  82.6 ± 3.0 
spect (22)  17  80.2 ± 4.2  1  2  78.2 ± 3.2  3  1  79.6 ± 3.2  4  3  79.1 ± 3.2  2  11  81.4 ± 4.2 
ionosphere (34)  18  84.7 ± 2.9  1  8  81.9 ± 3.5  2  1  76.4 ± 3.9  4  8  81.9 ± 3.5  2  33  85.3 ± 3.0 
parkinsons (22)  13  83.7 ± 3.7  1  4  84.6 ± 3.2  2  1  77.0 ± 5.0  4  3  84.6 ± 3.4  3  9  86.0 ± 3.9 
semeion (256)  53  85.5 ± 1.1  1  40  82.7 ± 1.4  2  4  49.6 ± 1.7  4  40  82.7 ± 1.4  2  73  93.3 ± 1.3 
Lung (325)  9  76.1 ± 7.9  2  5  74.2 ± 7.7  3  1  46.5 ± 7.5  4  11  73.0 ± 7.1  1  41  84.3 ± 6.5 
Lympth (4026)  8  86.0 ± 5.3  3  65  89.6 ± 5.5  1  2  62.8 ± 6.5  4  64  89.6 ± 5.5  2  70  90.7 ± 5.4 
Madelon (500)  3  69.5 ± 1.6  2  1  52.4 ± 1.7  3  4  76.7 ± 1.5  1  1  52.4 ± 1.7  3  4  76.7 ± 1.5 
average rank  1.5  2.3  3.2  2.2 
$\mathit{\alpha}=1.01$  
$\mathbf{\Delta}$MI${\mathit{\chi}}^{\mathbf{2}}$  MIPermutation  
CMIheuristic  0.0781 (1)  0.5455 (0) 
CMIpermutation  0.0561 (1)  0.9036 (0) 
$\mathit{\alpha}=\mathbf{2}$  
$\mathbf{\Delta}$MI${\mathit{\chi}}^{\mathbf{2}}$  MIPermutation  
CMIheuristic  0.0081 (1)  0.0341 (1) 
CMIpermutation  0.0587 (1)  0.7340 (0) 
© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yu, S.; Príncipe, J.C. Simple Stopping Criteria for Information Theoretic Feature Selection. Entropy 2019, 21, 99. https://doi.org/10.3390/e21010099
Yu S, Príncipe JC. Simple Stopping Criteria for Information Theoretic Feature Selection. Entropy. 2019; 21(1):99. https://doi.org/10.3390/e21010099
Chicago/Turabian StyleYu, Shujian, and José C. Príncipe. 2019. "Simple Stopping Criteria for Information Theoretic Feature Selection" Entropy 21, no. 1: 99. https://doi.org/10.3390/e21010099