Improved Measures of Redundancy and Relevance for mRMR Feature Selection

: Many biological or medical data have numerous features. Feature selection is one of the data preprocessing steps that can remove the noise from data as well as save the computing time when the dataset has several hundred thousand or more features. Another goal of feature selection is improving the classiﬁcation accuracy in machine learning tasks. Minimum Redundancy Maximum Relevance (mRMR) is a well-known feature selection algorithm that selects features by calculating redundancy between features and relevance between features and class vector. mRMR adopts mutual information theory to measure redundancy and relevance. In this research, we propose a method to improve the performance of mRMR feature selection. We apply Pearson’s correlation coe ﬃ cient as a measure of redundancy and R-value as a measure of relevance. To compare original mRMR and the proposed method, features were selected using both of two methods from various datasets, and then we performed a classiﬁcation test. The classiﬁcation accuracy was used as a measure of performance comparison. In many cases, the proposed method showed higher accuracy than original mRMR.


Introduction
Recently, with the rapid development of machine learning and the increasing accumulation of data through the internet, various methods of analyzing data using past techniques have been difficult to apply to modern big data problems, and various data preprocessing techniques have been developed. Among them, feature selection is a process of selecting a set of features (variables, attributes) that meet the purpose of analysis for a high-dimensional dataset having thousands or tens of thousands of features. Analysts can benefit from a selection of features, including better performance of predictive models, and faster and more efficient data analysis. The advantages of feature selection are as follows: (a) reduces the dimension of the dataset and therefore reduces the cost of computing resources (b) improves classification model performance by reducing data noise (c) facilitates data visualization and understanding The main purpose of the general feature selection is to determine a set of related features that is of interest regarding particular events or phenomena. This feature selection is usually divided into filtering methods and wrapper methods, depending on how the relevant features are searched [1][2][3][4]. Filter techniques assess the relevance of features by evaluating only the intrinsic properties of the data [1]. In most cases, relevance scores between each feature and class vector are calculated, and high-scored features are selected. Filter techniques are simple, fast, and easy to understand. However, they do not consider redundancy and interaction between features; they assume features are independent from each other. To capture the interactions between features, wrapper methods embed a classification model within the feature subset evaluation. However, as the space of feature subsets grows exponentially with the number of features, heuristic search methods such as forward search and backward elimination are used to guide the search toward an optimal subset [1]. Feature selection can be categorized into supervised, unsupervised, and semisupervised [5][6][7]. Supervised feature selection algorithms consider features' relevance by evaluating their correlation with the class information whereas unsupervised feature selection algorithms may exploit data variance or data distribution in its evaluation of features' relevance without labels. Semisupervised feature selection algorithms use a small amount of labeled data as additional information to improve unsupervised feature selection [5]. Minimum Redundancy Maximum Relevance (mRMR) and the proposed method belong to the supervised method.
Ding and Hanchuan [8,9] suggested the mRMR measure to reduce redundant features during the feature selection process. They tried to measure both redundancy among features and relevance between features and class vector for a given set of features. Their redundancy and relevance measures are based on mutual information as follows: In the Equation (1), x and y are feature vector or class vector, and p() represents probability. Suppose S is a given set of features and h is a class variable. The redundancy of S is measured by Equation (2): In Equation (2), |S| is the number of features in S. The relevance of S is measured by Equation (3): There are two types of methods to evaluate S: MIQ : V I /W I In many cases, MIQ (Mutual Information Quotient) shows better performance than MID (Mutual Information Difference). We cannot test all subsets of features S for a given dataset, so the mRMR algorithm adopts a forward search in its implementation. The procedure is described in Algorithm 1. In the context of statistics or information theory, the term 'variable' is used instead of 'feature'. We will use 'variable' and 'feature' as compatible terms according to their context. Mutual information can be only applied on two categorical variables (x,y). Therefore, if a dataset has continuous variables, they need to be converted into categorical variables before performing mRMR. The performance of mRMR depends on the quality of redundancy and relevancy measures. If we can improve the measures, we can enhance the performance of mRMR. Several studies [2,10,11] have attempted to improve redundancy measure WI by introducing equations of joint mutual information I(x 1 ,x 2 ,..x n ). Auffarth et al. [12] compared various redundancy and relevance measures, and suggested 'Fit Criterion' and 'Value Difference Metric' as best measures. These measures, however, can be applied to only two-class datasets. mRMR is widely used in bioinformatics including gene selection and disease diagnosis [8,[13][14][15].
In this study, we propose new measures for redundancy and relevancy. We suggest Pearson's correlation coefficient [16] as a redundancy measure and the R-value [17] as a relevance measure. The R-value and correlation coefficient can be designed for continuous variables whereas mutual information implies categorical variables. We also implement advanced mRMR (AmRMR) using new measures. Details of the new measures and AmRMR are provided in the next section.

Pearson's Correlation Coefficient and R-Value
Pearson's correlation coefficient is a measure of the linear correlation between two variables x and y, and it is defined by Equation (6): x, y : mean o f x, y S x , S y : standard deviation o f x, y It has a value range [−1, +1]. If an absolute value of the correlation coefficient is near 1, the variables (x, y) have strong correlation. In the context of feature selection, if two features (x, y) represent similar values, then the correlation coefficient of (x, y) will be high; this means that the correlation coefficient can be used to measure redundancy. If two features (a, b) have strong negative correlation, their values will be different. However, from the point of view of information theory, the amount of information in a and b is similar, and they can be considered redundant features.
The R-value is proposed as an evaluation measure for datasets [17,18]. The motivation for using the R-value is that the quality of the dataset has a profound effect on classification accuracy, and overlapping areas among classes in a dataset have a strong relationship that determines the quality of the dataset. For example, dataset D 1 produces higher classification accuracy than dataset D 2 in Figure 1. Overlapping area is a region where samples from different classes are gathered closely to one another. If an unknown sample is located in the overlapping area, it is difficult to determine its class label. Therefore, the size of overlapping areas may be a criterion to measure the quality of features or of the entire dataset [19]. The R-value captures overlapping areas among classes in a dataset. The R-value uses a k-nearest neighbor algorithm to define overlapping areas. If an instance has many neighbors that have different class values, then it may belong to an overlapping area. Suppose DS is a given dataset, S is a subset of features, and C is a class vector. Algorithm 2 describes the procedure to calculate the R-value of S. The R-value has range [0, 1], and if the R-value of S is near 1, then S may produce lower classification accuracy.

Formal Description of AmRMR
Suppose we evaluate a feature set S that has m features. The new relevancy measure VR for S is simply defined using the Rvalue: If a feature set S produces a high Rvalue, it means that large overlapping areas exist between classes and may cause lower classification accuracy. Therefore, the lower the Rvalue obtained, the better the classification. We define the new relevancy measure as 1 − Rvalue to give a higher score to a lower Rvalue.
To develop a better redundancy measure, we replace mutual information with a correlation coefficient. The original redundancy measure, WI, is simply the mean of the mutual information for a pair of features in S. From several experiments, we found that the value of a specific pair of features is more important than the mean of all pairs if the value is high. Therefore, we calculate a maximum (maxC) and a mean (meanC) of the correlation coefficient, and choose maxC as a new redundant measure WR if maxC  0.5, otherwise WR = meanC. If the absolute value of correlation coefficient of variables (x,y)  0.5, we accept that they have meaningful correlation. In Equations (8) and (9), Cor() is a correlation coefficient function, abs() is an absolute value function, and max() is a maximum value function.

Formal Description of AmRMR
Suppose we evaluate a feature set S that has m features. The new relevancy measure VR for S is simply defined using the Rvalue: If a feature set S produces a high Rvalue, it means that large overlapping areas exist between classes and may cause lower classification accuracy. Therefore, the lower the Rvalue obtained, the better the classification. We define the new relevancy measure as 1 − Rvalue to give a higher score to a lower Rvalue.
To develop a better redundancy measure, we replace mutual information with a correlation coefficient. The original redundancy measure, W I , is simply the mean of the mutual information for a pair of features in S. From several experiments, we found that the value of a specific pair of features is more important than the mean of all pairs if the value is high. Therefore, we calculate a maximum (maxC) and a mean (meanC) of the correlation coefficient, and choose maxC as a new redundant measure W R if maxC ≥ 0.5, otherwise W R = meanC. If the absolute value of correlation coefficient of variables (x,y) ≥ 0.5, we accept that they have meaningful correlation. In Equations (8) and (9), Cor() is a correlation coefficient function, abs() is an absolute value function, and max() is a maximum value function.
From the new definition of relevance measure V R and redundant measure W R , we redefine MID and MIQ as RVD and RVQ, respectively. RVD is similar to MID. We define RVQ in a more sophisticated manner. In evaluation function RVQ, V R indicates benefit and W R indicates penalty. Therefore, (V R /W R ) cannot be larger than V R . However, 0 ≤ V R , W R ≤ 1 in our equation, and sometimes (V R /W R ) > V R . Therefore, we adjust for this discrepancy in Equation (12).
We have described a new evaluation measure for feature subset S. As we mentioned earlier, we cannot evaluate all instances of S for a given dataset; thus, a heuristic approach is required. We implemented AmRMR based on mRMR code. It applies a forward search to reduce the search space. Algorithm 3 describes the pseudo code for AmRMR. We only consider the case of RVQ.

Result
To compare mRMR and AmRMR algorithms, we collected several types of datasets that have different numbers of features, classes, and instances. Table 1 summarizes the datasets. We obtained GDS2546, GDS2547, and GDS3715 from the NCBI Gene Expression Omnibus [20], and arcene and madelon from NPIS2003's challenge of feature selection [21], and others were obtained from the UCI Machine Learning Repository [22]. We took 5-25 features using mRMR and AmRMR, and performed classification tests using k-nearest neighbor (KNN), support vector machine (SVM), C5.0 (C50), and random forest (RF). To avoid an overfitting problem, we adopted a k-fold cross-validation, where k is 10. In the case of arcene and madelon, we took feature set from the training dataset and performed classification tests using validation datasets because they support separated training/validation datasets. Tables 2-5 summarize the results. In most of the cases, AmRMR produces better performance than mRMR. Figure 2 summarizes the classification results in Tables 2-5. Each accuracy means average classification accuracy from 5 to 25 features of datasets. Each graph clearly shows AmRMR chooses better features than mRMR.

Discussion
In general, the R-value is better than mutual information as a measure of relevance between features and class vector. Mutual information is a statistical measure and it needs categorical values

Discussion
In general, the R-value is better than mutual information as a measure of relevance between features and class vector. Mutual information is a statistical measure and it needs categorical values to calculate probability. Therefore, if a target dataset contains continuous values, we need to discretize them before applying mRMR. Information loss is inevitable in discretization. The R-value does not need discretization and is more advantageous than mutual information when a dataset has continuous values. Another weak point of mutual information is that it can calculate I(f i , C) where f i is a feature and C is a class vector, but it cannot calculate I({f 1 , f 2 , f 3 }, C) because it is based on probability. Therefore, it uses (I(f 1 , C) + I(f 2 , C) + I(f 3 , C))/3 to calculate relevance between {f 1 , f 2 , f 3 } and C. This calculation cannot fully capture interactions among {f 1 , f 2 , f 3 }. In contrast, the R-value is a dimensionless distance-based measure so R-value({f 1 , f 2 , f 3 }, C) can be directly calculated.
mRMR and AmRMR output different feature sets from the same dataset, resulting in different classification accuracies. Table 6 shows a list of 25 features from GDS3715 dataset evaluated by mRMR and AmRMR. In the case of Arcene, there is only one shared feature (9970) between mRMR and AmRMR. In the case of Madelon, there are five shared features. It means that mRMR and AmRMR have different evaluation criteria for feature selection. Figure 3 shows PCA (Principal Component Analysis) plots for Arcene and Madelon using five features by mRMR and AmRMR. As we can see, PCA plots of AmRMR show a clearer distribution of class instances than mRMR. It explains why the feature set of AmRMR produces better classification accuracy than the one used by mRMR.  Figure 3 shows PCA (Principal Component Analysis) plots for Arcene and Madelon using five features by mRMR and AmRMR. As we can see, PCA plots of AmRMR show a clearer distribution of class instances than mRMR. It explains why the feature set of AmRMR produces better classification accuracy than the one used by mRMR.   Table 7 shows averages of the improved classification accuracy for 10 datasets. In the four classifiers, 4-10% of accuracies are improved. This result indicates that the proposed new redundancy and relevance measures enhance performance compared to the original mRMR measures. KNN classifier shows remarkably improved result (10.7%). The reason is in the R-value, which is a measure of relevance. Both KNN and R-value are based on k-nearest neighbor. Therefore, a set of features with good R-value may produce good classification accuracy by KNN. The relationship between Rvalue and KNN is similar to the relationship between the classifier and the feature evaluation measure in the wrapper method.
The proposed new redundancy and relevance measures are tailored to datasets that have continuous values. This means that they are not suitable for datasets that have categorical values. The mutual information measure in the original mRMR method is more suitable for categorical datasets. Nevertheless, AmRMR is useful, because there exist many high-dimensional continuous datasets such as microarray data, diagnosed diseases data, image analysis data, and so on.  Table 7 shows averages of the improved classification accuracy for 10 datasets. In the four classifiers, 4-10% of accuracies are improved. This result indicates that the proposed new redundancy and relevance measures enhance performance compared to the original mRMR measures. KNN classifier shows remarkably improved result (10.7%). The reason is in the R-value, which is a measure of relevance. Both KNN and R-value are based on k-nearest neighbor. Therefore, a set of features with good R-value may produce good classification accuracy by KNN. The relationship between R-value and KNN is similar to the relationship between the classifier and the feature evaluation measure in the wrapper method.
The proposed new redundancy and relevance measures are tailored to datasets that have continuous values. This means that they are not suitable for datasets that have categorical values. The mutual information measure in the original mRMR method is more suitable for categorical datasets. Nevertheless, AmRMR is useful, because there exist many high-dimensional continuous datasets such as microarray data, diagnosed diseases data, image analysis data, and so on. To show the effect of AmRMR, we compare it with three filter feature selection methods such as mutual information (MI), linear correlation (Linear), and rank correlation (Rank.Corr). The condition of comparison is the same as for the case of mRMR. For simplicity, we test KNN and SVM.

Conclusions
In this study, we proposed new redundancy and relevance measures to improve mRMR feature selection. The proposed method provides powerful performance for specific target dataset than mRMR. However, it should be noted that the proposed method has its limitations on types of datasets it can analyze. The performance of feature selection depends on the characteristics of the target dataset. Therefore, users are encouraged to test both mRMR and AmRMR, and choose the better feature subsets according to the test results. The entire set of R codes for AmRMR is available at https://bitldku.github.io/home/sw/AmRMR.html.