Next Article in Journal
Magnetisation Processes in Geometrically Frustrated Spin Networks with Self-Assembled Cliques
Next Article in Special Issue
CDE++: Learning Categorical Data Embedding by Enhancing Heterogeneous Feature Value Coupling Relationships
Previous Article in Journal
Semantic and Generalized Entropy Loss Functions for Semi-Supervised Deep Learning
Previous Article in Special Issue
Information Theoretic Multi-Target Feature Selection via Output Space Quantization
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Weighted Mean Squared Deviation Feature Screening for Binary Features

1
School of Mathematics and Statistics, Northeast Normal University, Changchun 130000, China
2
Key Laboratory for Applied Statistics of the MOE, School of Economics and Management, Northeast Normal University, Changchun 130000, China
*
Author to whom correspondence should be addressed.
Entropy 2020, 22(3), 335; https://doi.org/10.3390/e22030335
Submission received: 22 February 2020 / Revised: 13 March 2020 / Accepted: 13 March 2020 / Published: 14 March 2020
(This article belongs to the Special Issue Information Theoretic Feature Selection Methods for Big Data)

Abstract

:
In this study, we propose a novel model-free feature screening method for ultrahigh dimensional binary features of binary classification, called weighted mean squared deviation (WMSD). Compared to Chi-square statistic and mutual information, WMSD provides more opportunities to the binary features with probabilities near 0.5. In addition, the asymptotic properties of the proposed method are theoretically investigated under the assumption log p = o ( n ) . The number of features is practically selected by a Pearson correlation coefficient method according to the property of power-law distribution. Lastly, an empirical study of Chinese text classification illustrates that the proposed method performs well when the dimension of selected features is relatively small.

1. Introduction

Feature screening is a practical and powerful tool in data analysis and statistical modeling of ultrahigh dimensional data, such as genomes, biomedical images and text data. In supervised learning, features of data often satisfy the sparsity assumption, i.e., only a small number of features are relevant to the response in a large amount of features. Therefore, Fan and Lv [1] proposed a sure independence screening method based on correlation learning for linear model and theoretically proved the screening consistency. Subsequently, a series of model-free feature screening methods were proposed, which did not require model specification [2,3,4,5,6,7]. These methods learned the marginal relationships between the response and features, and filtered out the features with weak relationships to response.
In this study, we focus on feature screening of binary classification with ultrahigh dimensional binary features. The purpose of feature screening in classification is to filter out a large amount of irrelevant features that are unhelpful for the discrimination of class labels. Both computational speed and classification accuracy are also expected to be taken into account. For categorical features, statistical test (e.g., Chi-square test) [8,9], information theory (e.g., information gain, mutual information, cross entropy) [10,11,12,13], and Bayesian methods [14,15] are usually used for feature screening, especially in the field of text classification. In this study, we propose a novel model-free feature screening method called weighted mean squared deviation (WMSD), which can be considered as a simplified version of Chi-square statistic and mutual information. Next, according to the property of power-law distribution [16,17], a Pearson correlation coefficient method is developed to select the number of the relevant features. Lastly, the proposed method is applied to Chinese text classification. It outperforms Chi-square statistic and mutual information when a small number of words are selected.
The rest of this article is organized as follows. In Section 2.1, we introduce the weighted mean squared deviation feature screening method and investigate its asymptotic properties. In Section 2.2, a Pearson correlation coefficient method is developed based on the property of power-law distribution for model selection. In Section 2.3, the relationships between Chi-square statistic, mutual information and WMSD are discussed. In Section 3, the outstanding performance of the proposed method is numerically confirmed on both simulated and empirical datasets. Lastly, some conclusions of this study are given in Section 4. Some derivations and theoretical proofs are shown in the Appendix A and Appendix B.

2. Methodology

2.1. Weighted Mean Squared Deviation

As an general classification task, let ( X i , Y i ) 1 i n be n independent identically distributed observations. For i-th observation, X i = ( X i 1 , , X i p ) { 0 , 1 } p is the associated p-dimensional binary feature, and Y i { 0 , 1 } is the corresponding binary class label. Denote all necessary parameters as follows, P ( Y i = 1 ) = π , P ( X i j = 1 | Y i = 1 ) = θ 1 j , P ( X i j = 1 | Y i = 0 ) = θ 0 j , P ( X i j Y i = 1 ) = μ 1 j = π θ 1 j , P ( X i j ( 1 Y i ) = 1 ) = μ 0 j = ( 1 π ) θ 0 j and P ( X i j = 1 ) = θ j = π θ 1 j + ( 1 π ) θ 0 j , for 1 i n and 1 j p . Under the model-free feature screening framework, we need to filter out the features that irrelevant (or independent) of class label, i.e., θ 1 j = θ 0 j = θ j . Intuitively, feature X i j is independent of Y i , if and only if ω j = π ( θ 1 j θ j ) 2 + ( 1 π ) ( θ 0 j θ j ) 2 = π ( 1 π ) ( θ 1 j θ 0 j ) 2 = 0 . Note that, the probabilities of two classes are considered as weights in ω j . In contrast, j-th feature is relevant, if and only if ω j 0 . Then we define the true model as T = { j : ω j 0 , 1 j p } with model size | T | = d 0 and the full model as F = { 1 , , p } .
Next, the Laplace smoothing method [18] is adopted for parameter estimation, to make all estimators bounded away from 0 and 1. The parameter estimators are denoted as π ^ = ( 2 + i = 1 n Y i ) / ( n + 4 ) , μ ^ 1 j = ( 1 + i = 1 n Y i X i j ) / ( n + 4 ) and μ ^ 0 j = ( 1 + i = 1 n ( 1 Y i ) X i j ) / ( n + 4 ) , for 1 j p . It is easy to represent that θ ^ 1 j = μ ^ 1 j / π ^ , θ ^ 0 j = μ ^ 0 j / ( 1 π ^ ) and θ ^ j = μ ^ 1 j + μ ^ 0 j , for 1 j p . Then, a model-free feature screening statistic is constructed, called weighted mean squared deviation (WMSD), i.e.,
ω ^ j = π ^ ( 1 π ^ ) ( θ ^ 1 j θ ^ 0 j ) 2 ,
which is an estimator of ω j . It is expected that, the features far away from independency should be selected. Intuitively, those features with larger ω ^ j values are more likely to be relevant. In contrast, those with smaller ω ^ j values are less likely. Consequently, an estimated model is defined as M ^ = { j : ω ^ j > c , j F } , where c is a positive critical value. The following theorem provides the asymptotic properties of the WMSD method under the assumption of ultrahigh dimension.
Theorem 1.
Assume log p = o ( n ) and there exists a positive constant ϵ < 1 / 3 , such that ϵ π 1 ϵ , ϵ θ k j 1 ϵ for any k { 0 , 1 } and j F , and | θ 1 j θ 0 j | ϵ for j T . We have the following two results:
(1)
max j F | ω ^ j ω j | = O P ( log p / n ) ;
(2)
there exists 0 < c < ( 1 ϵ ) ϵ 3 , such that l i m n P ( M ^ = T ) = 1 .
Note that, the conditions ϵ π 1 ϵ , ϵ θ k j 1 ϵ imply all parameters are bounded away from 0 and 1, and the condition | θ 1 j θ 0 j | ϵ implies P ( X i j = 1 | Y i = 1 ) P ( X i j = 1 | Y i = 0 ) for j T . Theorem 1 states that (1) ω ^ j is a consistent estimator of ω j and (2) M ^ is a consistent estimator of T as long as the critical value c lies between 0 and ( 1 ϵ ) ϵ 3 , which is the strong screening consistency of WMSD. However, the lower bound ϵ is unknown in real applications. To this end, a practicable method is proposed in the next section. The proof of this theorem is left into Appendix A.

2.2. Feature Selection Via Pearson Correlation Coefficient

While the true model T can be theoretically selected by Theorem 1, it strongly depends on the critical value c. However, c is not given beforehand in empirical studies, and it always varies with the data. In order to solve this problem, the following strategy is developed for feature selection. Firstly, without loss of generality, it could be assumed that the features have been appropriately reordered such that ω ^ 1 > ω ^ 2 > > ω ^ p , then all candidate models can be given by M = { M ( d ) : 1 d p } with M ( d ) = { 1 , , d } for 1 d p , which is a finite set with a total of p nested candidate models. Thus, the original problem of determination for critical value c from ( 0 , + ) is converted into a model selection problem with respect to the model set M . Next, according to our best knowledge of text classification, the relatively large ω j s of irrelevant features approximatively follow a power-law distribution. Meanwhile, both ω j s of relevant features and relatively small ω j s of irrelevant features can not fit the power-law distribution well. The density function of power-law distribution can be represented as,
p ( x ) = α 1 x 0 x x 0 α ,
where the power parameter α > 1 and the lower bound parameter x 0 > 0 . A typical property of power-law distribution is that it obeys log p ( x ) = α log x + C , i.e., it follows a straight line on a doubly logarithmic plot, where C is a constant dependent on parameters α and x 0 . Therefore, a common way to probe for the power-law behavior is to construct the frequency distribution histogram of data, and plot the histogram on doubly logarithmic axes. If the doubly logarithmic histogram approximately falls on a straight line, the data can be considered to follow a power-law distribution [16]. This inspires us to use Pearson correlation coefficient of doubly logarithmic histogram of ω ^ j s to find an optimal model from M . The Pearson correlation coefficient of sequences { log j } 1 j m and { log ω ^ j } d j d + m 1 can be represented as,
r d = m j = 1 m log j log ω ^ j + d 1 ( j = 1 m log j ) ( j = 1 m log ω ^ j + d 1 ) m j = 1 m ( log j ) 2 ( j = 1 m log j ) 2 m j = 1 m ( log ω ^ j + d 1 ) 2 ( j = 1 m log ω ^ j + d 1 ) 2 ,
for 1 d p m + 1 , where m is the number of points when calculating Pearson correlation coefficient. Obviously, the absolute value of r d can be used to measure the approximate level of sequence { ω ^ j } d j d + m 1 to power-law distribution. Thus, the best model is selected as M ^ = M ( d ^ ) , with
d ^ = argmax d m i n d d m a x | r d | 1 ,
where d m i n and d m a x are the smallest and largest true model sizes to be considered. In other words, if the sequence { ω ^ j } d ^ + 1 j d ^ + m fits the power-law distribution best over all candidate continuous subsequences of { ω ^ j } 1 j p , then the features in model { d ^ + 1 j d ^ + m } are more likely to be irrelevant and the features in model { 1 j d ^ } are more likely to be relevant. As a result, the Pearson correlation coefficient method is adopted to determine the model size estimated by WMSD. In numerical studies, parameters m, d m i n and d m a x must be artificially given beforehand by empirical experience. The performance of numerical studies suggests that the feature selection method works quite well both on simulated and empirical data.

2.3. The Relationships between Chi-Square Statistic, Mutual Information and WMSD

As we know, Chi-square statistic and mutual information are two popularly used feature screening methods for discrete features. Next, the relationships between these two feature screening methods and WMSD will be investigated. According to the definitions of parameter estimators above, the Chi-square statistic can be represented as,
χ j 2 = n { n 1 j ( n n 1 · n · j + n 1 j ) ( n · j n 1 j ) ( n 1 · n 1 j ) } 2 n · j n 1 · ( n n · j ) ( n n 1 · ) n θ ^ j 1 ( 1 θ ^ j ) 1 ω ^ j ,
where n 1 · = i = 1 n Y i , n · j = i = 1 n X i j , and n 1 j = i = 1 n X i j Y i for 1 j p . Formula (5) shows the relationship between Chi-square statistic and WMSD (see Appendix B.1 for detailed derivation). Thus, WMSD can be considered as a simplified version of Chi-square statistic.
In a similar way, the mutual information can be represented as,
M I j = n 1 j n log n n 1 j n 1 · n · j + n 1 · n 1 j n log n ( n 1 · n 1 j ) n 1 · ( n n · j ) + n · j n 1 j n log n ( n · j n 1 j ) n · j ( n n 1 · ) + n n 1 · n · j + n 1 j n log n ( n n 1 · n · j + n 1 j ) ( n n 1 · ) ( n n · j ) n 1 χ j 2 θ ^ j 1 ( 1 θ ^ j ) 1 ω ^ j ,
for 1 j p , Formula (6) shows the relationship among mutual information, Chi-square statistic and WMSD (see Appendix B.2 for detailed derivation). Chi-square statistic and mutual information are asymptotic equivalent for feature screening of binary classification with binary features, if the sample size n is ignored.
Remark 1.
From Formulas (5) and (6), compared to Chi-square statistic and mutual information, WMSD provides more opportunities to the features with probabilities (i.e., θ j ) near 0.5. For an example, if n = 100 , θ ^ 1 = 0.2 , θ ^ 2 = 0.1 , M I 1 = 0.2 and M I 2 = 0.3 , then χ 1 2 20 , χ 2 2 30 , ω ^ 1 0.032 and ω ^ 2 0.027 . It is obviously that, M I 1 < M I 2 and χ 1 2 < χ 2 2 , but ω ^ 1 > ω ^ 2 . This property is also confirmed in the following empirical study of Chinese text classification.

3. Numerical Studies

3.1. Simulation Study

To evaluate the finite sample performance of WMSD feature screening method for binary classification with binary features, two standard feature selection methods are considered as competitors, i.e., Chi-square statistic (Chi2) and mutual information (MI). In addition, to investigate the robustness of the proposed method under different classifiers, two popular used classification methods are considered, i.e., naive Bayes (NB) and logistic regression (LR). To generate the simulated data, a multi-variate Bernoulli model [19] with both relevant and irrelevant binary features is considered. Moreover, different sample sizes of training set (i.e., n = 1000, 2000, 5000), different feature dimensions (i.e., p = 500, 1000), and different true model sizes (i.e., d 0 = 20, 50) are considered in parameter setup. For each fixed parameter setting, a total of 1000 simulation replications are conducted. For each simulated dataset, three feature screening methods are adopted, i.e., Chi2, MI and WMSD. Subsequently, the false positive rate (FPR), that is FPR = | T \ M ^ | / | T | , of WMSD is calculated. In the same way, the false negative rate (FNR), that is FNR = | ( F \ T ) M ^ | / | F \ T | , of WMSD is also calculated. Average FPR and FNR values over 1000 replications are reported. Lastly, in order to evaluate the performance of classification, another 1000 independent observations as testing sample are generated for each replication. Then, the area under the receiver operating characteristic curve (AUC) is adopted to evaluate the out-of-sample prediction accuracy. The AUC values of NB and LR on three estimated models (separately selected by Chi2, MI and WMSD) are calculated on the testing sample and averaged over 1000 replications.
For the given simulation model and parameter setup, the simulated data is generated as follows. Firstly, generate the class label Y i { 0 , 1 } with probability P ( Y i = 1 ) = π = 0.5 for balanced case and π = 0.8 for unbalanced case. Next, given Y i , the j-th binary feature X i j is generated from a multi-variate Bernoulli model with probability P ( X i j = 1 | Y i = 1 ) = θ 1 j = 0.05 { j 0.2 p 0.2 + I ( 1 j 0.5 d 0 ) j 0.5 d 0 0.5 } and P ( X i j = 1 | Y i = 0 ) = θ 0 j = 0.05 { j 0.2 p 0.2 + I ( 0.5 d 0 + 1 j d 0 ) j 0.5 d 0 0.5 } for j { 1 , , p } , where I ( · ) is the indicator function. Note that, without loss of generality, we set T = { 1 , , d 0 } , that is, the first d 0 features are relevant. Moreover, in this simulation, the parameters in Formulas (3) and (4) are set to be m = 100 , d m i n = 10 and d m a x = 100 .
The detailed simulation results are given in Table 1. In balanced case (i.e., π = 0.5 ), the following results could be obtained. First, if both p and n are fixed, a larger true model size d 0 leads to a larger AUC. Because the more relevant features are involved, the better we can predict. Second, if both d 0 and n are fixed, a larger feature dimension p leads to worse performance in terms of AUC. This is reasonable because the larger feature dimension leads to more challenge for feature selection and then a worse prediction. Third, if both p and d 0 are fixed, a larger sample size n leads to a larger AUC and a smaller FPR. This is expected because the larger sample size leads to a more accurate estimator and then a better prediction. Forth, in almost all parameter settings, the AUC values of WMSD are larger than that of Chi2 and MI, which states that WMSD performs better than the other two methods on the simulated data. Last, for all parameter settings, the FNR values are relatively small, which indicates that WMSD can filter out most irrelevant features. The results of unbalanced case (i.e., π = 0.8 ) are similar to that of balanced case. For any parameter setting, FPR values are larger than that of balanced case, which implies that feature selection is harder in unbalanced case.

3.2. An Application in Chinese Text Classification

The dataset is downloaded from CNKI (www.cnki.net), which is one of the largest academic literature platform in China. It contains n = 14,473 abstracts of articles published in CSSCI (Chinese Social Sciences Citation Index) journals of economics and management fields in 2018. The abstracts are composed of p = 2385 Chinese words (ignored the words with frequencies less than 10). Our purpose is to classify the articles into different fields (economics or management) according to their abstracts, and select a small number of feature words which are helpful for classification. Economics or management is considered as class 1 (i.e., Y i = 1 ) and the other is considered as class 0 (i.e., Y i = 0 ), respectively. In summary, there are 8570 abstracts from economics and 5903 from management. To this end, naive Bayes and logistic regression are both considered as standard classification methods. Then, Chi-square statistic, mutual information and WMSD are considered as feature screening methods and the performances of them are compared based on the two classification methods. It is noted that, the results of these feature selection methods are invariable when class 1 and class 0 are exchanged.
Next, we sample 10,000 abstracts as the training set and the others as the testing set randomly. For comparison of feature screening methods, different numbers of selected words d (from 10 to 100 by 10) are considered. The AUC values of two classification methods with different numbers of selected words are calculated for evaluating feature screening methods. For each setting, a total of 200 random replications are conducted. The averaged AUC values of two classifiers (i.e., NB and LR) over 200 replications for three feature screening methods (i.e., Chi2, MI and WMSD) with different number of selected words, when economics and management are considered as class 1 respectively, are reported in Figure 1. Panel (1) of Figure 1 shows that when naive Bayes classifier is applied and economics is considered as class 1, AUC values based on three estimated models (separately selected by Chi2, MI and WMSD) increase as d becomes larger. Obviously, WMSD far outperforms other methods when d < 50 , and they perform similarly when d 50 . Panel (2) shows a similar result as panel (1) when logistic regression is applied. Panels (3) and (4) of Figure 1 show that, WMSD also far outperforms Chi2 and MI when d < 50 , if the classes are exchanged.
Furthermore, the Pearson correlation coefficient method is used to determine the estimated model size of WMSD. To calculate d ^ , the parameters in Formulas (3) and (4) are set to be m = 100 , d m i n = 20 and d m a x = 100 . The averaged d ^ is 25.86 over 200 replications. In each replication, for the same d ^ , AUC values of NB and LR based on three estimated models by Chi2, MI, WMSD are calculated separately. Figure 2 shows the boxplots of AUC for six situations (i.e., NB+Chi2, NB+MI, NB+WMSD, LR+Chi2, LR+MI and LR+WMSD) over 200 replications. It could be observed that, when the estimated model size is relatively small (actually, averaged d ^ is 25.86), WMSD performs more accurate and robust than Chi2 and MI in terms of AUC, whether economics or management is considered as class 1.
Lastly, the probabilities of top 10 words ranked by three feature screening methods are also calculated separately, based on all n = 14,473 abstracts. It can be seen from Table 2 that the probabilities of top 10 words ranked by WMSD are larger than that of other two methods. It states that WMSD provides more opportunities to high frequency words (with probabilities near 0.5). Because the word frequencies of almost all words are less than 0.5, the word frequencies of high frequency words are closer to 0.5. It validates the property of WMSD mentioned in Section 2.3.

4. Conclusions

In this study, a novel model-free feature screening method called weighted mean squared deviation is proposed especially for ultrahigh dimensional binary features of binary classification, which is a measure of dependence between each feature and the class label. WMSD can be considered as a simplified version of Chi-square statistic and mutual information, which can provide more opportunities to the features with probabilities near 0.5. Furthermore, the strong screening consistency of WMSD is investigated theoretically, the number of features is determined by a Pearson correlation coefficient method practically, and the performance of WMSD is numerically confirmed both on simulated data and an real example of Chinese text classification. Three potential directions are also proposed for future studies. First, for multi-class classification with categorical features, the corresponding WMSD statistics need to be theoretically and numerically investigated. Second, the feature selection method via the Pearson correlation coefficient has not been theoretically verified, which is an important problem to be solved. Last, in order to further confirm the outstanding performance of WMSD in empirical research, it may make sense to investigate specifically the observations for which other methods give a probability near 0.5 (i.e., it is hard to predict their class labels) in future studies.

Author Contributions

Conceptualization, G.W.and G.G.; methodology, G.G.; software, G.W.; validation, G.G.; formal analysis, G.W. and G.G.; investigation, G.W.; resources, G.G.; data curation, G.W.; writing original draft preparation, G.W. and G.G.; writing review and editing, G.W. and G.G.; visualization, G.W.; supervision, G.G.; project administration, G.G.; funding acquisition, G.G. All authors have read and agreed to the published version of the manuscript.

Funding

The research is supported by National Social Science Fund of China, grant number 19CTJ013.

Acknowledgments

The authors thank all the anonymous reviewers for their constructive comments. The authors also thank Ningzhen Wang and Chao Wu from University of Connecticut for correcting the English writing.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Theorem 1

According to the definitions of π ^ , μ ^ 1 j and μ ^ 0 j , we know that they all lie between ( n + 4 ) 1 and 1 ( n + 4 ) 1 . In addition, by the conditions of Theorem 1, it is also known that π , θ 1 j and θ 0 j are all bounded away from 0 and 1 for j F . Then μ 1 j and μ 0 j are also bounded away from 0 and 1 for j F . By the conclusions of Lemma 1 in [12], for any ε > 0 and sufficiently large n, we have P ( | π ^ π | > ε ) 2 exp ( 2 n ε 2 ) , and P ( | μ ^ k j μ k j | > ε ) 2 exp ( 2 n ε 2 ) , for k = 0 , 1 and 1 j p . In addition, ω j and ω ^ j can also be rewritten as
ω j = π ( 1 π ) { π 1 μ 1 j ( 1 π ) 1 μ 0 j } 2 ,
ω ^ j = π ^ ( 1 π ^ ) { π ^ 1 μ ^ 1 j ( 1 π ^ ) 1 μ ^ 0 j } 2 .
Then, by the conclusion of Lemma 2 in [12], for any ε > 0 , we have P ( | ω ^ j ω j | > ε ) C 1 exp ( C 2 n ε 2 ) , where C 1 and C 2 are some positive constants. Next, by Bonferroni’s inequality [20],
P max j F | ω ^ j ω j | > 2 / C 2 log p / n j = 1 p P | ω ^ j ω j | > 2 / C 2 log p / n
p C 1 exp C 2 ( 2 / C 2 ) log p = C 1 exp ( log p ) 0 .
Consequently, we know that max j F | ω ^ j ω j | = O P log p / n .
By the condition of Theorem 1, ϵ π 1 ϵ and | θ 1 j θ 0 j | ϵ for j T , we have ω j = π ( 1 π ) ( θ 1 j θ 0 j ) 2 ( 1 ϵ ) ϵ 3 for j T and ω j = 0 for j T . For 0 < c < ( 1 ϵ ) ϵ 3 and log p = o ( n ) , we have
P ( M ^ = T ) = P min j T ω ^ j > c , max j T ω ^ j < c P min j T ω ^ j > c + P max j T ω ^ j < c 1 P max j T | ω ^ j ω j | < ( 1 ϵ ) ϵ 3 c + P max j T | ω ^ j ω j | < c 1 .
Hence P ( M ^ = T ) 1 as n . The proof is completed.

Appendix B. Some Necessary Derivations

Appendix B.1. Derivation of the Relationship between Chi-Square Statistic and WMSD

Denote n 1 · = i = 1 n Y i , n · j = i = 1 n X i j , and n 1 j = i = 1 n X i j Y i . According to the definitions of estimators, we have π ^ n 1 · / n , θ ^ j n · j / n , θ ^ 1 j n 1 j / n 1 · and θ ^ 0 j ( n · j n 1 j ) / ( n n 1 · ) . Then we have
χ j 2 = n { n 1 j ( n n 1 · n · j + n 1 j ) ( n · j n 1 j ) ( n 1 · n 1 j ) } 2 n · j n 1 · ( n n · j ) ( n n 1 · ) n θ ^ j 1 ( 1 θ ^ j ) 1 π ^ ( 1 π ^ ) ( θ ^ 1 j θ ^ 0 j ) 2 = n θ ^ j 1 ( 1 θ ^ j ) 1 ω ^ j .
It is the relationship between Chi-square statistic and WMSD.

Appendix B.2. Derivation of the Relationship between Mutual Information and WMSD

Based on the notations used in Appendix B.1 and according to the Taylor’s theorem, we have
M I j = n 1 j n log n n 1 j n 1 · n · j + n 1 · n 1 j n log n ( n 1 · n 1 j ) n 1 · ( n n · j ) + n · j n 1 j n log n ( n · j n 1 j ) n · j ( n n 1 · ) + n n 1 · n · j + n 1 j n log n ( n n 1 · n · j + n 1 j ) ( n n 1 · ) ( n n · j ) = π ^ θ ^ 1 j log ( θ ^ 1 j / θ ^ j ) + ( 1 θ ^ 1 j ) log { ( 1 θ ^ 1 j ) / ( 1 θ ^ j ) } + ( 1 π ^ ) θ ^ 0 j log ( θ ^ 0 j / θ ^ j ) + ( 1 θ ^ 0 j ) log { ( 1 θ ^ 0 j ) / ( 1 θ ^ j ) } π ^ θ ^ 1 j ( θ ^ 1 j / θ ^ j 1 ) + ( 1 θ ^ 1 j ) { ( 1 θ ^ 1 j ) / ( 1 θ ^ j ) 1 } + ( 1 π ^ ) θ ^ 0 j ( θ ^ 0 j / θ ^ j 1 ) + ( 1 θ ^ 0 j ) { ( 1 θ ^ 0 j ) / ( 1 θ ^ j ) 1 } = π ^ ( 1 π ^ ) ( μ ^ 1 j + μ ^ 0 j ) 1 ( 1 μ ^ 1 j μ ^ 0 j ) 1 { π ^ 1 μ ^ 1 j ( 1 π ^ ) 1 μ ^ 0 j } 2 = n 1 χ j 2 .
As a result, we know that M I j θ ^ j 1 ( 1 θ ^ j ) 1 ω ^ j . It is the relationship between mutual information and WMSD.

References

  1. Fan, J.; Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. B 2008, 70, 849–911. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Zhu, L.; Li, L.; Li, R.; Zhu, L. Model-free feature screening for ultrahigh dimensional data. J. Am. Stat. Assoc. 2011, 106, 1464–1475. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  3. Li, R.; Zhong, W.; Zhu, L. Feature screening via distance correlation learning. J. Am. Stat. Assoc. 2012, 107, 1129–1139. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Cui, H.; Li, R.; Zhong, W. Model-free feature screening for ultrahigh dimensional discriminant analysis. J. Am. Stat. Assoc. 2015, 110, 630–641. [Google Scholar] [CrossRef] [PubMed]
  5. Yu, Z.; Dong, Y.; Zhu, L. Trace Pursuit: A general framework for model-free variable selection. J. Am. Stat. Assoc. 2016, 111, 813–821. [Google Scholar] [CrossRef]
  6. Lin, Y.; Liu, X.; Hao, M. Model-free feature screening for high-dimensional survival data. Sci. China Math. 2018, 61, 1617–1636. [Google Scholar] [CrossRef]
  7. Pan, W.; Wang, X.; Xiao, W.; Zhu, H. A generic sure independence screening procedure. J. Am. Stat. Assoc. 2019, 114, 928–937. [Google Scholar] [CrossRef]
  8. An, B.; Wang, H.; Guo, J. Testing the statistical significance of an ultra-high-dimensional naive Bayes classifier. Stat. Interface 2013, 6, 223–229. [Google Scholar] [CrossRef] [Green Version]
  9. Huang, D.; Li, R.; Wang, H. Feature screening for ultrahigh dimensional categorical data with applications. J. Bus. Econ. Stat. 2014, 32, 237–244. [Google Scholar] [CrossRef]
  10. Lee, C.; Lee, G.G. Information gain and divergence-based feature selection for machine learning-based text categorization. Inform. Process. Manag. 2006, 42, 155–165. [Google Scholar] [CrossRef]
  11. Pascoal, C.; Oliveira, M.R.; Pacheco, A.; Valadas, R. Theoretical evaluation of feature selection methods based on mutual information. Neurocomputing 2017, 226, 168–181. [Google Scholar] [CrossRef] [Green Version]
  12. Guan, G.; Shan, N.; Guo, J. Feature screening for ultrahigh dimensional binary data. Stat. Interface 2018, 11, 41–50. [Google Scholar] [CrossRef]
  13. Dai, W.; Guo, D. Beta Distribution-Based Cross-Entropy for Feature Selection. Entropy 2019, 21, 769. [Google Scholar] [CrossRef] [Green Version]
  14. Feng, G.; Guo, J.; Jing, B.; Hao, L. A Bayesian feature selection paradigm for text classification. Inform. Process. Manag. 2012, 48, 283–302. [Google Scholar] [CrossRef]
  15. Feng, G.; Guo, J.; Jing, B.; Sun, T. Feature subset selection using naive Bayes for text classification. Pattern Recogn. Lett. 2015, 65, 109–115. [Google Scholar] [CrossRef]
  16. Clauset, A.; Shalizi, C.R.; Newman, M.E. Power-law distributions in empirical data. SIAM Rev. 2009, 51, 661–703. [Google Scholar] [CrossRef] [Green Version]
  17. Stumpf, M.P.; Porter, M.A. Critical Truths About Power Laws. Science 2012, 335, 665–666. [Google Scholar] [CrossRef] [PubMed]
  18. Murphy, K.P. Machine Learning: A Probabilistic Perspective; The MIT Press: Cambridge, MA, USA, 2012. [Google Scholar]
  19. Mccallum, A.; Nigam, K. A comparison of event models for naive Bayes text classification. In Proceedings of the AAAI-98 Workshop on Learning for Text Categorization, Madison, WI, USA, 26–31 July 1998; pp. 41–48. [Google Scholar]
  20. Galambos, J.; Simonelli, I. Bonferroni-Type Inequalities with Applications; Springer: New York, NY, USA, 1996. [Google Scholar]
Figure 1. Averaged AUC values of NB and LR on three models ranked by Chi2, MI and WMSD with different model sizes (from 10 to 100 by 10), when economics and management are considered as class 1, over 200 replications.
Figure 1. Averaged AUC values of NB and LR on three models ranked by Chi2, MI and WMSD with different model sizes (from 10 to 100 by 10), when economics and management are considered as class 1, over 200 replications.
Entropy 22 00335 g001
Figure 2. The boxplots of AUC values of NB and LR based on three estimated models by Chi2, MI and WMSD, when economics and management are considered as class 1, over 200 replications.
Figure 2. The boxplots of AUC values of NB and LR based on three estimated models by Chi2, MI and WMSD, when economics and management are considered as class 1, over 200 replications.
Entropy 22 00335 g002
Table 1. Results of simulation study. The averaged area under the receiver operating characteristic curve (AUC) values of naive Bayes (NB) and logistic regression (LR) based on three estimated models (Chi-square statistic (Chi2), mutual information (MI) and weighted mean squared deviation (WMSD)) are reported, and the averaged false positive rate (FPR) and false negative rate (FNR) values of WMSD are also reported, over 1000 replications.
Table 1. Results of simulation study. The averaged area under the receiver operating characteristic curve (AUC) values of naive Bayes (NB) and logistic regression (LR) based on three estimated models (Chi-square statistic (Chi2), mutual information (MI) and weighted mean squared deviation (WMSD)) are reported, and the averaged false positive rate (FPR) and false negative rate (FNR) values of WMSD are also reported, over 1000 replications.
AUC of NBAUC of LR
d 0 pnChi2MIWMSDChi2MIWMSDFPRFNR
π = 0.5
2050010000.72380.72330.73180.69660.69600.70330.41880.0001
20000.76100.76090.76250.74110.74110.74280.19300.0000
50000.77780.77780.77790.76730.76730.76760.01080.0013
100010000.71450.71350.73030.68490.68390.70070.40140.0001
20000.75450.75430.75910.73350.73320.73990.15990.0001
50000.76930.76930.76970.75840.75840.75920.00240.0010
5050010000.89360.89350.89730.84630.84600.84990.29760.0008
20000.91020.91020.91100.88370.88370.88500.10580.0001
50000.91650.91650.91650.89980.89980.89980.00960.0005
100010000.87890.87870.88510.82390.82330.83130.34080.0004
20000.90140.90130.90310.87170.87160.87480.11060.0001
50000.90970.90970.90980.89210.89210.89230.00170.0007
π = 0.8
2050010000.63720.65020.68830.64220.65450.69050.47960.0007
20000.72060.72370.73030.72030.72390.73070.34130.0001
50000.76920.76920.76960.76580.76590.76640.07060.0001
100010000.61710.63290.69080.62680.64050.69360.48330.0007
20000.71830.72100.73280.71900.72160.73300.32140.0001
50000.76420.76400.76580.76140.76130.76270.04060.0002
5050010000.86360.86650.87460.85370.85420.85940.47390.0017
20000.90180.90220.90430.89300.89230.89350.21150.0005
50000.91490.91490.91500.91070.91070.91070.04420.0000
100010000.84280.84680.85830.83260.83370.84250.54330.0008
20000.88940.88990.89430.87900.87830.88210.22910.0004
50000.90750.90740.90790.90280.90270.90340.02950.0001
Table 2. The probabilities of top 10 words ranked by three feature screening methods, Chi2, MI and WMSD.
Table 2. The probabilities of top 10 words ranked by three feature screening methods, Chi2, MI and WMSD.
MethodsProbabilities of Top 10 Words
Chi20.2850.0340.1330.0290.0430.0470.0120.0140.0220.017
MI0.2850.1330.0340.0290.0220.0430.0260.0190.0120.047
WMSD0.2850.1330.5410.2110.2230.2030.0340.2350.0470.043

Share and Cite

MDPI and ACS Style

Wang, G.; Guan, G. Weighted Mean Squared Deviation Feature Screening for Binary Features. Entropy 2020, 22, 335. https://doi.org/10.3390/e22030335

AMA Style

Wang G, Guan G. Weighted Mean Squared Deviation Feature Screening for Binary Features. Entropy. 2020; 22(3):335. https://doi.org/10.3390/e22030335

Chicago/Turabian Style

Wang, Gaizhen, and Guoyu Guan. 2020. "Weighted Mean Squared Deviation Feature Screening for Binary Features" Entropy 22, no. 3: 335. https://doi.org/10.3390/e22030335

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop