Entropic Statistics: Concept, Estimation, and Application in Machine Learning and Knowledge Extraction
Abstract
:1. Introduction of Entropic Statistics
- The data lie within non-ordinal space.
- The data are a mixture of ordinal and non-ordinal spaces, and the non-ordinal space is expected to carry unneglectable bulk of information.
- The data lie within ordinal space, yet the performance of ordinal statistics methods fails to meet the expectation.
- Let and be two countable alphabets with cardinalities and , respectively.
- Let the Cartesian product be with a joint probability distribution .
- Let the two marginal distributions be respectively denoted by and where and ; hence X is a variable on with distribution and Y is a variable on with distribution .
- For uni-variate situations, K stands for , and stands for .
- Let be an independent and identically distributed () random sample of size n from . Let ; hence is the count of occurrence of letter in a sample. Let . is called the plug-in estimator of . Similarly, one can construct the plug-in estimators for and and name them as and , respectively.
- For any two functions f and g taking values in with , the notation means
- For any two functions f and g taking values in , the notation means
2. Classic Entropic Statistics Quantities and Estimation
2.1. Shannon’s Entropy and Mutual Information
2.1.1. Shannon’s Entropy
- 1.
- H is a measurement of dispersion. It is always non-negative by definition.
- 2.
- H = 0 if and only if the probability of a letter l in is 1; hence no dispersion.
- 3.
- For a finite alphabet with cardinality K, H is bounded from the above by , and the maximum is achieved when its distribution is uniform (); hence maximum dispersion.
- 4.
- For a countably infinite alphabet, H may not exist (See Example 4 in Section 3).
Entropy Estimation-The Plug-in Estimator
- 1.
- ,
- 2.
- , and
- 3.
- ;
Entropy Estimation-The Miller–Madow and Jackknife Estimators
- for each , construct , which is a plug-in estimator based on a sub-sample of size obtained by leaving the ith observation out;
- obtain for ; and then
- compute the jackknife estimator
Entropy Estimation-The Z-Estimator
- 1.
- ,
- 2.
- , and
- 3.
- ;
Remarks
2.1.2. Mutual Information
- 1.
- MI is a measurement of dependence. It is always non-negative by definition.
- 2.
- if and only if the two marginals are independent.
- 3.
- if and only if the two marginals are dependent.
- 4.
- A non-zero MI does not always indicate the degree (level) of dependence.
- 5.
- MI may not exist when the cardinality of joint space is countably infinite.
MI Estimation-The Plug-in Estimator and Z-Estimator
Remarks
2.2. Kullback–Leibler Divergence
- 1.
- KL is a measurement of non-metric distance between two distributions on the same alphabet (with the same discrete support). It is always non-negative because of Gibbs’ inequality.
- 2.
- if and only if the two underlying distributions are the same. Namely, for each .
- 3.
- if and only if the two underlying distributions are different. Namely, for some k.
2.2.1. KL Point Estimation-The Plug-in Estimator, Augmented Estimator, and Z-Estimator
2.2.2. Symmetrized KL and Its Point Estimation
2.2.3. Asymptotic Properties for KL and Symmetrized KL Estimators
3. Recently Developed Entropic Statistics Quantities and Estimation
3.1. Standardized Mutual Information
3.2. Entropic Basis: A Generalization from Shannon’s Entropy
3.3. Generalized Shannon’s Entropy and Generalized Mutual Information
4. Application of Entropic Statistics in Machine Learning and Knowledge Extraction
4.1. An Entropy-Based Random Forest Model
4.2. Feature Selection Methods
4.3. A Keyword Extraction Method
5. Conclusions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
ANOVA | Analysis of Variance |
GSE | Generalized Shannon’s Entropy |
GMI | Generalized Mutual Information |
i.i.d. | independent and identically distributed |
KL | Kullback–Leibler Divergence |
MI | Mutual Information |
ML | Machine Learning |
PID | Partial Information Decomposition |
SMI | Standardized Mutual Information |
UMVUE | Uniformly Minimum-Variance Unbiased Estimator |
Appendix A. R Functions
Statistic | R Package Name | Function Name |
entropy [80] | entropy.plugin | |
entropy | entropy.MillerMadow | |
bootstrap [81] | jackknife | |
entropy | entropy.ChaoShen | |
EntropyEstimation [82] | Entropy.z | |
in Theorems 1 and 3 | EntropyEstimation | Entropy.sd |
entropy | mi.plugin | |
EntropyEstimation | MI.z | |
in Theorem 6 | EntropyEstimation | MI.sd |
entropy | KL.plugin | |
EntropyEstimation | KL.z | |
EntropyEstimation | SymKL.plugin | |
EntropyEstimation | SymKL.z | |
EntropyEstimation | Renyi.z |
References
- Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
- Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
- Zhang, Z.; Grabchak, M. Entropic representation and estimation of diversity indices. J. Nonparametr. Stat. 2016, 28, 563–575. [Google Scholar] [CrossRef]
- Grabchak, M.; Zhang, Z. Asymptotic normality for plug-in estimators of diversity indices on countable alphabets. J. Nonparametr. Stat. 2018, 30, 774–795. [Google Scholar] [CrossRef]
- Zhang, Z. Generalized Mutual Information. Stats 2020, 3, 158–165. [Google Scholar] [CrossRef]
- Burnham, K.P.; Anderson, D.R. Practical use of the information-theoretic approach. In Model Selection and Inference; Springer: Berlin/Heidelberg, Germany, 1998; pp. 75–117. [Google Scholar]
- Dembo, A.; Cover, T.M.; Thomas, J.A. Information theoretic inequalities. IEEE Trans. Inf. Theory 1991, 37, 1501–1518. [Google Scholar] [CrossRef]
- Chatterjee, S.; Hadi, A.S. Regression Analysis by Example; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
- Speed, T. What is an analysis of variance? Ann. Stat. 1987, 15, 885–910. [Google Scholar] [CrossRef]
- Hardy, M.A. Regression with Dummy Variables; Sage: Newcastle upon Tyne, UK, 1993; Volume 93. [Google Scholar]
- Kent, J.T. Information gain and a general measure of correlation. Biometrika 1983, 70, 163–173. [Google Scholar] [CrossRef]
- Lin, J. Divergence measures based on the Shannon entropy. IEEE Trans. Inf. Theory 1991, 37, 145–151. [Google Scholar] [CrossRef]
- Van Erven, T.; Harremos, P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory 2014, 60, 3797–3820. [Google Scholar] [CrossRef] [Green Version]
- Sethi, I.K.; Sarvarayudu, G. Hierarchical classifier design using mutual information. IEEE Trans. Pattern Anal. Mach. Intell. 1982, 4, 441–445. [Google Scholar] [CrossRef] [PubMed]
- Safavian, S.R.; Landgrebe, D. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 1991, 21, 660–674. [Google Scholar] [CrossRef]
- Li, J.; Cheng, K.; Wang, S.; Morstatter, F.; Trevino, R.P.; Tang, J.; Liu, H. Feature selection: A data perspective. ACM Comput. Surv. (CSUR) 2017, 50, 1–45. [Google Scholar] [CrossRef]
- Basharin, G.P. On a statistical estimate for the entropy of a sequence of independent random variables. Theory Probab. Appl. 1959, 4, 333–336. [Google Scholar] [CrossRef]
- Harris, B. The Statistical Estimation of Entropy in the Non-Parametric Case; Technical Report; Wisconsin Univ-Madison Mathematics Research Center: Madison, WI, USA, 1975. [Google Scholar]
- Zhang, Z.; Zhang, X. A normal law for the plug-in estimator of entropy. IEEE Trans. Inf. Theory 2012, 58, 2745–2747. [Google Scholar] [CrossRef]
- Miller, G.A.; Madow, W.G. On the Maximum Likelihood Estimate of the Shannon-Weiner Measure of Information; Operational Applications Laboratory, Air Force Cambridge Research Center, Air Research and Development Command, Bolling Air Force Base: Washington, DC, USA, 1954. [Google Scholar]
- Zahl, S. Jackknifing an index of diversity. Ecology 1977, 58, 907–913. [Google Scholar] [CrossRef]
- Chen, C.; Grabchak, M.; Stewart, A.; Zhang, J.; Zhang, Z. Normal Laws for Two Entropy Estimators on Infinite Alphabets. Entropy 2018, 20, 371. [Google Scholar] [CrossRef]
- Antos, A.; Kontoyiannis, I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithms 2001, 19, 163–193. [Google Scholar] [CrossRef]
- Paninski, L. Estimation of entropy and mutual information. Neural Comput. 2003, 15, 1191–1253. [Google Scholar] [CrossRef]
- Zhang, Z. Entropy estimation in Turing’s perspective. Neural Comput. 2012, 24, 1368–1389. [Google Scholar] [CrossRef]
- Schürmann, T. A note on entropy estimation. Neural Comput. 2015, 27, 2097–2106. [Google Scholar] [CrossRef] [PubMed]
- Zhang, Z. Asymptotic normality of an entropy estimator with exponentially decaying bias. IEEE Trans. Inf. Theory 2013, 59, 504–508. [Google Scholar] [CrossRef]
- Zhang, Z. Statistical Implications of Turing’s Formula; John Wiley & Sons: Hoboken, NJ, USA, 2016. [Google Scholar]
- Chao, A.; Shen, T.J. Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample. Environ. Ecol. Stat. 2003, 10, 429–443. [Google Scholar] [CrossRef]
- Nemenman, I.; Shafee, F.; Bialek, W. Entropy and inference, revisited. arXiv 2001, arXiv:physics/0108025. [Google Scholar]
- Agresti, A.; Hitchcock, D.B. Bayesian inference for categorical data analysis. Stat. Methods Appl. 2005, 14, 297–330. [Google Scholar] [CrossRef]
- Hausser, J.; Strimmer, K. Entropy inference and the James-Stein estimator, with application to nonlinear gene association networks. J. Mach. Learn. Res. 2009, 10, 1469–1484. [Google Scholar]
- Shi, J.; Zhang, J.; Ge, Y. CASMI—An Entropic Feature Selection Method in Turing’s Perspective. Entropy 2019, 21, 1179. [Google Scholar] [CrossRef]
- Zhang, Z.; Zheng, L. A mutual information estimator with exponentially decaying bias. Stat. Appl. Genet. Mol. Biol. 2015, 14, 243–252. [Google Scholar] [CrossRef]
- Zhang, J.; Chen, C. On “A mutual information estimator with exponentially decaying bias” by Zhang and Zheng. Stat. Appl. Genet. Mol. Biol. 2018, 17, 20180005. [Google Scholar] [CrossRef]
- Williams, P.L.; Beer, R.D. Nonnegative decomposition of multivariate information. arXiv 2010, arXiv:1004.2515. [Google Scholar]
- Bertschinger, N.; Rauh, J.; Olbrich, E.; Jost, J.; Ay, N. Quantifying unique information. Entropy 2014, 16, 2161–2183. [Google Scholar] [CrossRef]
- Griffith, V.; Koch, C. Quantifying synergistic mutual information. In Guided Self-Organization: Inception; Springer: Berlin/Heidelberg, Germany, 2014; pp. 159–190. [Google Scholar]
- Tax, T.M.; Mediano, P.A.; Shanahan, M. The partial information decomposition of generative neural network models. Entropy 2017, 19, 474. [Google Scholar] [CrossRef]
- Wollstadt, P.; Schmitt, S.; Wibral, M. A rigorous information-theoretic definition of redundancy and relevancy in feature selection based on (partial) information decomposition. arXiv 2021, arXiv:2105.04187. [Google Scholar]
- Mori, T.; Nishikimi, K.; Smith, T.E. A divergence statistic for industrial localization. Rev. Econ. Stat. 2005, 87, 635–651. [Google Scholar] [CrossRef] [Green Version]
- Wang, Q.; Kulkarni, S.R.; Verdú, S. Divergence estimation for multidimensional densities via k-Nearest-Neighbor distances. IEEE Trans. Inf. Theory 2009, 55, 2392–2405. [Google Scholar] [CrossRef]
- Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Trans. Inf. Theory 2010, 56, 5847–5861. [Google Scholar] [CrossRef]
- Zhang, Z.; Grabchak, M. Nonparametric estimation of Küllback-Leibler divergence. Neural Comput. 2014, 26, 2570–2593. [Google Scholar] [CrossRef]
- Press, W.H.; Teukolsky Saul, A. Numerical Recipes in Fortran: The Art of Scientific Computing; Cambridge University Press: Cambridge, UK, 1993. [Google Scholar]
- De Mántaras, R.L. A distance-based attribute selection measure for decision tree induction. Mach. Learn. 1991, 6, 81–92. [Google Scholar] [CrossRef]
- Kvalseth, T.O. Entropy and correlation: Some comments. IEEE Trans. Syst. Man Cybern. 1987, 17, 517–519. [Google Scholar] [CrossRef]
- Strehl, A.; Ghosh, J. Cluster ensembles—A knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 2002, 3, 583–617. [Google Scholar]
- Yao, Y. Information-theoretic measures for knowledge discovery and data mining. In Entropy Measures, Maximum Entropy Principle and Emerging Applications; Springer: Berlin/Heidelberg, Germany, 2003; pp. 115–136. [Google Scholar]
- Vinh, N.X.; Epps, J.; Bailey, J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J. Mach. Learn. Res. 2010, 11, 2837–2854. [Google Scholar]
- Zhang, Z.; Stewart, A.M. Estimation of Standardized Mutual Information; Technical Report; UNC Charlotte Technical Report: Charlotte, NC, USA, 2016. [Google Scholar]
- Zhang, Z.; Zhou, J. Re-parameterization of multinomial distributions and diversity indices. J. Stat. Plan. Inference 2010, 140, 1731–1738. [Google Scholar] [CrossRef]
- Chen, C. Goodness-of-Fit Tests under Permutations. Ph.D. Thesis, The University of North Carolina at Charlotte, Charlotte, NC, USA, 2019. [Google Scholar]
- Simpson, E.H. Measurement of diversity. Nature 1949, 163, 688. [Google Scholar] [CrossRef]
- Gini, C. Measurement of inequality of incomes. Econ. J. 1921, 31, 124–126. [Google Scholar] [CrossRef]
- Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 1 January 1961; Volume 1. [Google Scholar]
- Emlen, J.M. Ecology: An Evolutionary Approach; Addison-Wesley: Boston, MA, USA, 1977. [Google Scholar]
- Zhang, Z.; Chen, C.; Zhang, J. Estimation of population size in entropic perspective. Commun.-Stat.-Theory Methods 2020, 49, 307–324. [Google Scholar] [CrossRef]
- Beck, C.; Schögl, F. Thermodynamics of Chaotic Systems; Cambridge University Press: Cambridge, UK, 1995. [Google Scholar]
- Zhang, J.; Shi, J. Asymptotic Normality for Plug-In Estimators of Generalized Shannon’s Entropy. Entropy 2022, 24, 683. [Google Scholar] [CrossRef]
- Zhang, J.; Zhang, Z. A Normal Test for Independence via Generalized Mutual Information. arXiv 2022, arXiv:2207.09541. [Google Scholar]
- Kontoyiannis, I.; Skoularidou, M. Estimating the directed information and testing for causality. IEEE Trans. Inf. Theory 2016, 62, 6053–6067. [Google Scholar] [CrossRef]
- Huang, N.; Lu, G.; Cai, G.; Xu, D.; Xu, J.; Li, F.; Zhang, L. Feature selection of power quality disturbance signals with an entropy-importance-based random forest. Entropy 2016, 18, 44. [Google Scholar] [CrossRef]
- Brown, G.; Pocock, A.; Zhao, M.J.; Luján, M. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J. Mach. Learn. Res. 2012, 13, 27–66. [Google Scholar]
- Lewis, D.D. Feature selection and feature extraction for text categorization. In Proceedings of the Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, NY, USA, 23–26 February 1992. [Google Scholar]
- Battiti, R. Using mutual information for selecting features in supervised neural net learning. IEEE Trans. Neural Netw. 1994, 5, 537–550. [Google Scholar] [CrossRef] [PubMed]
- Yang, H.; Moody, J. Feature selection based on joint mutual information. In Proceedings of the International ICSC Symposium on Advances in Intelligent Data Analysis, Rochester, NY, USA, 22–25 June 1999; Volume 1999, pp. 22–25. [Google Scholar]
- Ullman, S.; Vidal-Naquet, M.; Sali, E. Visual features of intermediate complexity and their use in classification. Nat. Neurosci. 2002, 5, 682–687. [Google Scholar] [CrossRef] [PubMed]
- Yu, L.; Liu, H. Efficient feature selection via analysis of relevance and redundancy. J. Mach. Learn. Res. 2004, 5, 1205–1224. [Google Scholar]
- Tesmer, M.; Estévez, P.A. AMIFS: Adaptive feature selection by using mutual information. In Proceedings of the 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), Budapest, Hungary, 25–29 July 2004; Volume 1, pp. 303–308. [Google Scholar]
- Fleuret, F. Fast binary feature selection with conditional mutual information. J. Mach. Learn. Res. 2004, 5, 1531–1555. [Google Scholar]
- Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef]
- Jakulin, A. Machine Learning Based on Attribute Interactions. Ph.D. Thesis, Univerza v Ljubljani, Ljubljana, Slovenia, 2005. [Google Scholar]
- Lin, D.; Tang, X. Conditional infomax learning: An integrated framework for feature extraction and fusion. In Proceedings of the European Conference on Computer Vision, Graz, Austria, 7–13 May 2006; pp. 68–82. [Google Scholar]
- Meyer, P.E.; Bontempi, G. On the use of variable complementarity for feature selection in cancer classification. In Proceedings of the Workshops on Applications of Evolutionary Computation, Budapest, Hungary, 10–12 April 2006; pp. 91–102. [Google Scholar]
- El Akadi, A.; El Ouardighi, A.; Aboutajdine, D. A powerful feature selection approach based on mutual information. Int. J. Comput. Sci. Netw. Secur. 2008, 8, 116. [Google Scholar]
- Guo, B.; Nixon, M.S. Gait feature subset selection by mutual information. IEEE Trans. Syst. Man-Cybern.-Part Syst. Hum. 2008, 39, 36–46. [Google Scholar]
- Cheng, G.; Qin, Z.; Feng, C.; Wang, Y.; Li, F. Conditional Mutual Information-Based Feature Selection Analyzing for Synergy and Redundancy. Etri J. 2011, 33, 210–218. [Google Scholar] [CrossRef]
- Singhal, A.; Sharma, D. Keyword extraction using Renyi entropy: A statistical and domain independent method. In Proceedings of the 2021 7th International Conference on Advanced Computing and Communication Systems (ICACCS), Coimbatore, India, 19–20 March 2021; Volume 1, pp. 1970–1975. [Google Scholar]
- R Package Entropy. Available online: https://cran.r-project.org/web/packages/entropy/index.html (accessed on 27 September 2022).
- R Package Bootstrap. Available online: https://cran.r-project.org/web/packages/bootstrap/index.html (accessed on 27 September 2022).
- R Package EntropyEstimation. Available online: https://cran.r-project.org/web/packages/EntropyEstimation/index.html (accessed on 27 September 2022).
n | 100 | 300 | 500 | 1000 | 1500 | 2000 |
---|---|---|---|---|---|---|
avg. of | 4.56 | 5.57 | 6.00 | 6.51 | 6.75 | 6.89 |
avg. of | 5.11 | 6.09 | 6.49 | 6.92 | 7.11 | 7.21 |
MIM [65] MIFS [66] JMI [67] | Proposed Criterion (Score) | |
Different Estimation Method | Use with jackknife procedure | |
IF [68] | Proposed Criterion (Score) | |
Different Estimation Method | Use with jackknife procedure | |
FCBF [69] | Proposed Criterion (Score) | |
Different Estimation Method | Use [51] with jackknife procedure | |
AMIFS [70] | Proposed Criterion (Score) | |
Different Estimation Method | Use and [51] with jackknife procedure | |
CMIM [71] | Proposed Criterion (Score) | and |
Different Estimation Method | Use with jackknife procedure | |
MRMR [72] | Proposed Criterion (Score) | |
Different Estimation Method | Use with jackknife procedure | |
ICAP [73] | Proposed Criterion (Score) | |
Different Estimation Method | Use with jackknife procedure | |
CIFE [74] | Proposed Criterion (Score) | |
Different Estimation Method | Use with jackknife procedure | |
DISR [75] | Proposed Criterion (Score) | |
Different Estimation Method | Use [51] with jackknife procedure | |
IGFS [76] | Proposed Criterion (Score) | |
Different Estimation Method | Use with jackknife procedure | |
SOA [77] | Proposed Criterion (Score) | |
Different Estimation Method | Use with jackknife procedure | |
CMIFS [78] | Proposed Criterion (Score) | |
Different Estimation Method | Use with jackknife procedure |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, J. Entropic Statistics: Concept, Estimation, and Application in Machine Learning and Knowledge Extraction. Mach. Learn. Knowl. Extr. 2022, 4, 865-887. https://doi.org/10.3390/make4040044
Zhang J. Entropic Statistics: Concept, Estimation, and Application in Machine Learning and Knowledge Extraction. Machine Learning and Knowledge Extraction. 2022; 4(4):865-887. https://doi.org/10.3390/make4040044
Chicago/Turabian StyleZhang, Jialin. 2022. "Entropic Statistics: Concept, Estimation, and Application in Machine Learning and Knowledge Extraction" Machine Learning and Knowledge Extraction 4, no. 4: 865-887. https://doi.org/10.3390/make4040044