An Attribute Reduction Method Using Neighborhood Entropy Measures in Neighborhood Rough Sets

Attribute reduction as an important preprocessing step for data mining, and has become a hot research topic in rough set theory. Neighborhood rough set theory can overcome the shortcoming that classical rough set theory may lose some useful information in the process of discretization for continuous-valued data sets. In this paper, to improve the classification performance of complex data, a novel attribute reduction method using neighborhood entropy measures, combining algebra view with information view, in neighborhood rough sets is proposed, which has the ability of dealing with continuous data whilst maintaining the classification information of original attributes. First, to efficiently analyze the uncertainty of knowledge in neighborhood rough sets, by combining neighborhood approximate precision with neighborhood entropy, a new average neighborhood entropy, based on the strong complementarity between the algebra definition of attribute significance and the definition of information view, is presented. Then, a concept of decision neighborhood entropy is investigated for handling the uncertainty and noisiness of neighborhood decision systems, which integrates the credibility degree with the coverage degree of neighborhood decision systems to fully reflect the decision ability of attributes. Moreover, some of their properties are derived and the relationships among these measures are established, which helps to understand the essence of knowledge content and the uncertainty of neighborhood decision systems. Finally, a heuristic attribute reduction algorithm is proposed to improve the classification performance of complex data sets. The experimental results under an instance and several public data sets demonstrate that the proposed method is very effective for selecting the most relevant attributes with great classification performance.


Introduction
Attribute reduction in rough set theory has been recognized as an important feature selection method, aimed to select the most representative attribute subset with a high resolution by eliminating redundant and unimportant attributes [1]. The attribute reduction methods can be widely implemented in the fields of data classification, data mining, machine learning, and pattern recognition [2][3][4][5][6]. Due to the development of the internet, the scale of data becomes bigger and bigger. Even thousands of attributes may be acquired in some real-world databases. In order to shorten the processing time and obtain better generalization, the attribute reduction problem attracts more and more attention in recent years [5,7,8].
In the classical rough set theory, there are two forms of definition for attribute reduction. The one is the algebra definition based on approximate precision, which determines whether certain conditional attributes can be removed according to the variation of approximate precision and considers the effect of attributes on the deterministic subsets in the field. The other is the definition of information view based on information entropy, which determines whether certain conditional attributes can be removed according to the changes of conditional entropy and considers the effect of attributes on the indeterminate subsets in the field [9]. Many attribute reduction algorithms are based on the algebra definition so far. Mi et al. [10] introduced the concepts of a lower distribution reduct and an upper distribution reduct based on the variable precision rough sets, and obtained an approach for knowledge reduction in variable precision rough sets. Syau et al. [11] provided the characterizations of lower and upper approximations for the connection between the concepts of variable precision generalized rough set model and neighborhood systems by introducing minimal neighborhood systems. What is more, as a measure to evaluate the uncertainty of discrete sample spaces, information entropy is a significant tool for characterizing the distinguishment information of attributes subsets [12]. Information entropy based on neighborhood systems has been established, and the extension of information entropy and its variants are adapted for attribute reduction. Gao et al. [13] developed a heuristic attribute reduction algorithm based on the maximum decision entropy in the decision-theoretic rough set model. Dai et al. [14] proposed a framework for attribute reduction in interval-valued data from the information view. It is known that there is a strong complementarity between the algebra view and the information view of attribute importance, and the two views can be combined to produce a more comprehensive measurement mechanism [15]. Wang [15] summarized the reduct in rough sets from algebra view and information view. This inspires the authors to investigate new attribute reduction methods from algebra view and information view in this paper.
The classical rough set theory is established on the equivalence approximate space and only compatible for discrete data set, and it could be useless for continuous numerical data [13,15,16]. In general, it needs to discretize when processing continuous numerical data, which will lead to the loss of information (including the neighborhood structure information and order structure information in real spaces) [17,18]. To overcome this drawback, many extensions of classical rough set theory have been presented [19][20][21][22][23][24][25][26], such as fuzzy rough set [21,22], tolerance approximate models [23], similarity rough approximate model [24], covering approximation model [25], and neighborhood granular model [26]. Among all the extensions, Hu et al. [18] developed a neighborhood rough set model to process both numerical and categorical data sets via neighborhood relation. Then, the neighborhood rough set model can process both numerical and discrete data sets via neighborhood parameters. Wang et al. [27] presented a local neighborhood rough set combining the neighborhood rough set and local rough set, to be applied to rough data analysis in big data. Fan et al. [28] designed an attribute reduction algorithm based on the max-decision neighborhood rough set model. Chen et al. [29] investigated approaches to attribute reduction in parallel using dominance-based neighborhood rough sets. Therefore, this paper studies an attribute reduction algorithm based on neighborhood rough sets by making full use of the advantages for rough sets.
It is noted that the reduction calculation of decision neighborhood systems is a key problem in neighborhood rough set theory. In addition, the reducts of an information system need to be achieved to further extract rule-like knowledge from information systems [29]. In practical application of decision-making, both the certainty factor and the object coverage factor of rule are two important standards of evaluating the decision ability of decision systems [30,31]. However, some of these existing knowledge reduction methods cannot reflect the change of decision ability for classification objectively. It is known that the credibility degree and the coverage degree can efficiently reflect the classification ability of conditional attributes with respect to the decision attribute [30]. The conditional attributes with higher credibility and coverage degrees are more important with respect to the decision attribute. Therefore, it is necessary to investigate a new uncertainty measure and an effective heuristic search algorithm. Sun et al. [32] proposed a concept of decision degree based on the notions of the certainty factor and the coverage factor of rule in rough sets, which includes the degree of sufficiency of a proposition and the degree of its necessity. Until now, the works are not considered in neighborhood rough sets. This inspires the authors to investigate a new measure to effectively illustrate the classification ability and the decision ability of neighborhood decision systems. Based on this idea, the concepts of coverage and credibility degrees are introduced into neighborhood decision systems as measures to reflect the classification ability of conditional attributes, with respect to decision attributes in this paper. In order to fully reflect the decision ability of attributes, the credibility degree and the coverage degree based on neighborhood rough sets are integrated into neighborhood entropy measures. Then, a heuristic attribute reduction method based on decision neighborhood entropy is presented to address the uncertainty and noisiness of complex data sets in neighborhood rough sets.
The remainder of this paper is organized as follows: Section 2 reviews some basic concepts of rough sets, information entropy measures, and neighborhood rough sets. In Section 3, some uncertainty measures based on neighborhood entropy in neighborhood decision systems are investigated, their properties are derived, and the relationships among these measures are established. An attribute reduction algorithm based on decision neighborhood entropy with complexity analysis is designed. Section 4 analyzes the classification experiments conducted on several public data sets. Finally, Section 5 summarizes the study.

Rough Sets
Given a decision system DS = (U, C, D, V, f ), usually written more simply as DS = (U, C, D), where U = {x 1 , x 2 , · · · , x n } is a sample set named universe, C = {a 1 , a 2 , · · · , a m } is a conditional attribute set that describes the samples, D is a set of classification attributes, f : U × {C∪D}→V is an information function which associates a unique value of each attribute with every object belonging to U, and f (a, x) represents the value of x ∈ U on attribute a ∈ C∪D. For any B⊆C, two samples x, y ∈ U, the equivalence relation is described as Then, U/IND(B) is called a partition that is composed of the equivalence classes, and for any sample In a decision system DS = (U, C, D) with B⊆C and X⊆U, the lower approximation set and the upper approximation set of X with respect to B can be expressed, respectively, as

Information Entropy Measures
Given a decision system DS = (U, C, D) with B⊆C, and U/B = {X 1 , X 2 , · · · , X N }, then the information entropy of B is described as where p(X i ) = |X i | |U| is the probability of X i ⊆U/B, and |X i | denotes the cardinality of the equivalence class X i . Given a decision system DS = (U, C, D) with B 1 , B 2 ⊆C, U/B 1 = {X 1 , X 2 , · · · , X N }, and U/B 2 = {Y 1 , Y 2 , · · · , Y M }, then the joint entropy of B 1 and B 2 is denoted as where p X i ∩ Y j = |Xi∩Yj| |U| , i = 1, 2,···, N, and j = 1, 2,···, M. Given a decision system DS = (U, C, D) with B 1 , B 2 ⊆C, U/B 1 = {X 1 , X 2 , · · · , X N }, and U/B 2 = {Y 1 , Y 2 , · · · , Y M }, then the conditional information entropy of B 2 with respect to B 1 is defined as where p Y j X i = |Yj∩Xi| |X i | , i = 1, 2, · · · , N, and j = 1, 2, · · · , M. Given a decision system DS = (U, C, D) with B 1 , B 2 ⊆C, the mutual information between B 1 and B 2 is defined as Given a decision system DS = (U, C, D) with B 1 , B 2 ⊆C, the mutual information has the following properties: for any a∈B, then B is a reduct of C with respect to D.

Neighborhood Rough Sets
Given a neighborhood decision system NDS = (U, C, D, V, f, ∆, δ), usually written more simply as NDS = (U, C, D, δ), where U = {x 1 , x 2 , · · · , x n } is a sample set named universe, C = {a 1 , a 2 , · · · , a m } is a conditional attribute set that describes the samples, D = {d} is a decision attribute set that contains only one decision attribute, V = ∪ D} V a , V a is a value set of attribute a, f : U × {C∪D}→V is a map function, ∆→[0, ∞) is a distance function, and δ is a neighborhood parameter with 0 ≤ δ ≤ 1.
For any samples x, y, z∈U on a subset B, the distance function ∆ B (x, y) satisfies the following three conditions: It is well known that there are three classical metrics including Manhattan, Euclidean, and Chebychev distance functions, where the Euclidean distance function effectively reflects the basic information of the unknown data [33]. Given a neighborhood decision system NDS = (U, C, D, δ) with B⊆C, for any x, y∈U, the Euclidean distance function between x and y is expressed as Given a neighborhood decision system NDS = (U, C, D, δ) with B⊆C, the similarity relation resulting by B is defined as Given a neighborhood decision system NDS = (U, C, D, δ) with B⊆C, for any x ∈ U, the neighborhood class of x with respect to B is described as Given a neighborhood decision system NDS = (U, C, D, δ) with B⊆C and X⊆U, the neighborhood lower approximation set and the neighborhood upper approximation set of X with respect to B are described, respectively, as Given a neighborhood decision system NDS = (U, C, D, δ) with B⊆C and X⊆U, B δ (X) is the neighborhood lower approximation set of X with respect to B, and B δ (X) is the neighborhood upper approximation set of X with respect to B, then the approximate precision of X with respect to B is described as Given a neighborhood decision system NDS = (U, C, D, δ) with B⊆C, U/D = {X 1 , X 2 , · · · , X N }, then the neighborhood lower approximation set and the neighborhood upper approximation set of D with respect to B are described respectively as where B δ (X i ) respects the neighborhood lower approximation set of X i with respect to B, B δ (X i ) respects the neighborhood upper approximation set of X i with respect to B, and i = 1, 2, · · · , N. Given a neighborhood decision system NDS = (U, C, D, δ) with B⊆C, then the neighborhood approximate precision of D with respect to B is described as The neighborhood approximate precision can be used to reflect the complete degree of the knowledge of a set, but this precision measure does not take into account the size of the particles that are included in the lower approximation set completely. Therefore, it is not sufficient to only consider attribute reduction from the algebra view.

Attribute Reduction Method Using Neighborhood Entropy Measures in Neighborhood Decision Systems
Attribute reduction is a core part of the rough set theory [13]. In the classical rough set theory, there are two forms of definition for attribute reduction: One is the algebra definition based on set theory; the other is the definition of information view based on information entropy. There is a strong complementarity between the algebra definition of attribute significance and the definition of information view. The former considers the influence of attributes on the defined subset in the domain of theory, while the latter considers the influence of attributes on the uncertain subset in the domain of theory. Therefore, the two views can be combined to produce a more comprehensive measurement mechanism. In rough sets, the equivalence classes-based information entropy does not work for numerical data. Then, for continuous data sets, a discretization should be performed before further processing. However, the discretization may result in information loss, and it is difficult to employ mutual information in attribute evaluation due to the difficulty in estimating the probability density of attributes [36]. To address this issue, the concept of neighborhood can be combined with information theory to extend Shannon entropy, and then some correlative concepts of neighborhood entropy are defined to measure the uncertainty of numerical data. Then, some concepts of neighborhood entropy-based uncertainty measures are presented to measure the uncertainty of knowledge in neighborhood decision systems, some important properties and relationships of these measures are deduced respectively as well, and a heuristic attribute reduction algorithm is investigated to improve the classification performance of complex data sets.

Neighborhood Entropy-Based Uncertainty Measures
Given a neighborhood decision system NDS = (U, C, D, δ) with B⊆C, n δ B (x i ) is a neighborhood class of x i ∈ U, then Hu et al. [37] described the neighborhood entropy of x i as Given a neighborhood decision system NDS = (U, C, D, δ) with B⊆C, Hu et al. [37] and Chen et al. [33] computed the average neighborhood entropy of the sample set as The concept of neighborhood entropy is defined based on the information entropy theory, which granulates the space of the domain by neighborhood relation and is used to measure the uncertainty and classification ability of the numerical knowledge classification system [38]. In this paper, the neighborhood approximate precision is combined with the neighborhood entropy to reflect the uncertainty of knowledge, and then a new average neighborhood entropy is defined as follows. Definition 1. Given a neighborhood decision system NDS = (U, C, D, δ) with B⊆C, a new average neighborhood entropy of the sample set is defined as From Definition 1, the average neighborhood entropy combines the neighborhood precision with the average neighborhood entropy, and it can make full use of the advantages of algebra and information view, and overcomes the drawbacks of traditional precision measurement.

Property 1. Given a neighborhood decision system NDS
Proof. It follows from Equation (10) Proof. Suppose that B 1 ⊆B 2 ⊆C, and similar to the proof of Proposition 1 in [33], one has that n δ It follows from Equations (11) and (12) that B 1 δ (X) ⊆ B 2 δ (X) and B 1δ (X) ⊇ B 2δ (X). By Equation (16), one has that p B 1 (D) ≤ p B 2 (D). Hence, it can be obtained from Equation (19)

and [x i ] D is an equivalence class of x i ∈U generated by IND(D), then a decision neighborhood entropy of B and D is defined as
In a decision system DS = (U, C, D) with any x i ∈U, Pawlak et al. [34] and Wang et al. [30] express a decision rule as d is called the coverage degree of decision rule d x i . Wang et al. [30] declared that the credibility degree and the coverage degree can reflect the classification ability of conditional attributes with respect to the decision attribute, and the conditional attributes with higher credibility and coverage degrees are more important with respect to the decision attribute. Furthermore, Tsumoto [31] emphasized that the credibility degree indicates the adequacy of the proposition, and the coverage degree describes the necessity of the proposition. Then, in order to fully reflect the decision ability and the classification ability of neighborhood decision systems, this paper investigates some neighborhood entropy-based uncertainty measures by combining the credibility degree with the coverage degree in neighborhood rough sets.

Property 2. Given a neighborhood decision system NDS
Proof. It follows immediately from Definition 2 that Property 2 shows that the decision neighborhood entropy of B and D combines the credibility degree and the coverage degree in the neighborhood decision system, which can fully reflect the decision ability of the neighborhood decision system.

Proposition 2. Given a neighborhood decision system NDS
holds. In addition, from Equations (11) and (12), it follows that B 1 δ (X) ⊆ B 2 δ (X) and B 1δ (X) ⊇ B 2δ (X). According to Equation (16), one has that p B 1 (D) ≤ p B 2 (D). Hence, it can be obtained from Definition 2 , and one has p B 1 (D) = p B 2 (D). From Definition 2, it follows that H δ (D, The monotonicity is one of the most important properties for an effective uncertainty measure of attribute reduction. According to Proposition 2, it is quite obvious that the decision neighborhood entropy is monotonic, decreasing when adding the condition attributes, which validates the monotonicity of the proposed uncertainty measure. Proof. It follows immediately from Definitions 1 and 2 that: B is dependent if any attribute in B with respect to D is indispensable. B is called a reduct of C with respect to D if it satisfies the following two conditions: Obviously, a reduct of C with respect to D is the minimal attribute subset to retain the decision neighborhood entropy of C with respect to D. Definition 4. Given a neighborhood decision system NDS = (U, C, D, δ) with B⊆C and any attribute a∈B, then the significance measure of a in B with respect to D is defined as Definition 5. Given a neighborhood decision system NDS = (U, C, D, δ) with B⊆C and any attribute a∈C − B, then the significance measure of a with respect to D is defined as When B = ∅, Sig out (a, B, D) = H δ (D, {a}). From Definition 5, the significance of attribute a is the increment of the distinguishing information after adding a into B. The larger the value of Sig(a, B, D) is, the greater the importance of attribute a for B with respect to D is.

Attribute Reduction Algorithm Based on Decision Neighborhood Entropy
The process of the attribute reduction method for classification is shown in Figure 1.
Definition 5. Given a neighborhood decision system NDS = (U, C, D, δ) with B ⊆ C and any attribute a ∈ C − B, then the significance measure of a with respect to D is defined as When B = ∅, Sigout(a, B, D) = Hδ(D, {a}). From Definition 5, the significance of attribute a is the increment of the distinguishing information after adding a into B. The larger the value of Sig(a, B, D) is, the greater the importance of attribute a for B with respect to D is.

Attribute Reduction Algorithm Based on Decision Neighborhood Entropy
The process of the attribute reduction method for classification is shown in Figure 1. To support efficient knowledge reduction, an attribute reduction algorithm based on decision neighborhood entropy (ARDNE) is constructed and described as Algorithm 1.

Algorithm 1 Input:
for any a∈ (S − R) do To support efficient knowledge reduction, an attribute reduction algorithm based on decision neighborhood entropy (ARDNE) is constructed and described as Algorithm 1. for any a∈(S − R) do 5.
end for 10.
Select a i ∈R.

Complexity Analysis of ARDNE Algorithm
From Algorithm 1, the decision neighborhood entropy and the neighborhood classes induced by the conditional attributes need to be frequently calculated in the computation of the attribute significance measure. The above computational process largely affects the time complexity of selecting attributes. Suppose that the number of attributes is m, and the number of samples is n. It is well known that m R m in most cases. Therefore, the time complexity of ARDNE is close to O(mn). So far, ARDNE appears to be more efficient than some of the existing algorithms for attribute reduction in [33,[39][40][41] in neighborhood decision systems. Furthermore, its space complexity is O(mn).

An Illustrative Example
In the following, the performance of the ARDNE algorithm is shown through an illustrative example in [42]. A neighborhood decision system NDS = (U, C, D, δ) is employed, where U = {x 1 , x 2 , x 3 , x 4 }, C = {a, b, c}, D = {d}, and δ = 0.3, as shown in Table 1.  Table 1, an example for attribute reduction using Algorithm 1 is given. Then, the neighborhood class of each attribute in Table 1 is calculated by using the Euclidean distance function as follows. For

Experiment Preparation
It is known that the objective of an attribute reduction algorithm usually has two aspects: One is to select a small attribute subset and the other is to maintain high classification accuracy. To demonstrate the classification performances of our proposed attribute reduction algorithm described in Section 3.2 on several public data sets, the more comprehensive results of all contrasted algorithms should be achieved and analyzed. The selected four UCI (University of California at Irvine) Machine Learning Repository data sets with low-dimensional attributes include Ionosphere, Wisconsin Diagnostic Breast Cancer (Wdbc), Wisconsin Prognostic Breast Cancer (Wpbc), and Wine, which were downloaded from https://archive.ics.uci.edu/ml/datasets.html. The selected seven microarray gene expression data sets with high-dimensional attributes included Brain_Tumor1, Diffuse Large B Cell Lymphoma (DLBCL), Leukemia, Small Round Blue Cell Tumor (SRBCT), Colon, Lung, and Prostate, where the four gene expression data sets (Brain_Tumor1, DLBCL, Leukemia and SRBCT) can be downloaded at http://www.gems-system.org, the Colon gene expression data set can be downloaded at http: //eps.upo.es/bigs/datasets.html, the Lung data set can be downloaded at http://bioinformatics. rutgers.ed/Static/Supple-mens/CompCancer/datasets, and the Prostate gene expression data set can be downloaded at http://www.gems-system.org. All of the data sets above are summarized in Table 2. The experiments were performed on a personal computer running Windows 7 with an Intel(R) Core(TM) i5-3470 CPU operating at 3.20 GH, and 4 GB memory. All the simulation experiments were implemented in MATLAB R2014a, and the k-nearest neighbors (KNN) classifier and the support vector machine (SVM) classifier were selected to verify the classification accuracy in WEKA software, where the parameter k in KNN was set to 3 and the linear kernel functions were selected in SVM. All of the following experimental comparisons for classification on the selected attributes are implemented using a 10-fold cross-validation with all the test data sets, where every data set is first randomly divided into ten portions which are the same size subset of data each other, one data subset is used as the testing data set, the rest nine data subsets are used as the training data set, and each of the ten data subsets only is employed exactly once as the testing data set; secondly, the operation of the cross-validation is repeated ten times; finally, the average of ten test results is as the obtained classification accuracy.

Effect of Different Neighborhood Parameter Values
Since the value of neighborhood parameter decides the granularity of data manipulation, which affects both the cardinality of the data set and the classification accuracy of the attribute subset, in this subsection, our experiments concern the number of selected attributes and the classification accuracy with the different neighborhood parameter values. Following the experimental techniques designed by Chen et al. [33], the number of selected attributes and the classification accuracy of selected attribute subset for the different neighborhood parameter values is discussed to obtain a suitable neighborhood parameter value and a better attribute subset. The classification results of the data sets given in Table 2 were obtained by using the ARDNE algorithm with the different neighborhood parameters, shown in Figure 2, where the horizontal coordinates denotes the neighborhood parameters with δ ∈ [0.05, 1] at intervals of 0.05, and the left and right vertical axes represent the classification accuracy and the number of selected attributes, respectively.
Since the value of neighborhood parameter decides the granularity of data manipulation, which affects both the cardinality of the data set and the classification accuracy of the attribute subset, in this subsection, our experiments concern the number of selected attributes and the classification accuracy with the different neighborhood parameter values. Following the experimental techniques designed by Chen et al. [33], the number of selected attributes and the classification accuracy of selected attribute subset for the different neighborhood parameter values is discussed to obtain a suitable neighborhood parameter value and a better attribute subset. The classification results of the data sets given in Table 2 were obtained by using the ARDNE algorithm with the different neighborhood parameters, shown in Figure 2, where the horizontal coordinates denotes the neighborhood parameters with δ∈ [0.05, 1] at intervals of 0.05, and the left and right vertical axes represent the classification accuracy and the number of selected attributes, respectively.      Figure 2a, the classification accuracy reached its maximum when the parameter was 0.3. As the parameter value continued to increase, the number of selected attributes decreased, resulting in a rapid decrease in classification accuracy. For the Wdbc data set in Figure 2b, when the parameter took values in the interval [0.05, 0.15], there was little change in the classification accuracy, and the number of selected attributes was less when the parameter was 0.15. For the Wine data set in Figure 2c, the classification accuracy reached its maximum when the parameter was 0.15. Similarly, for the Wpbc data set in Figure 2d, the classification accuracy reached its maximum when the parameter was 0.2. For the Brain_Tumor1 data set in Figure 2e, the classification accuracy achieved the maximum when the parameter was set as 0.15. For the Colon data set in Figure 2f, as the parameter value continued to increase, the number of selected attributes increased first and in turn decreased, and then the classification accuracy reached its maximum when the parameter was 0.05. For the DLBCL data set in Figure 2g, when the parameter took the values in the interval [0.15, 0.3], there was a slight difference in the classification accuracy, and then the number of selected attributes was less when the parameter was 0.15. For the Leukemia, Lung, Prostate, and SRBCT data sets in Figure 2h-k, the classification accuracy reached their maximum when the parameters were 0.1, 0.3, 0.5, and 0.25, respectively. In addition, when the neighborhood parameter was about 0.5, the number of selected attributes would be close to zero. Therefore, the appropriate neighborhood parameters of eleven data sets should take values in the interval [0.05, 0.5].

Classification Results of ARDNE Algorithm under Different Neighborhood Parameter Values
In this part of our experiments, by using the above selected neighborhood parameters in Section 4.2, the classification results of the raw data and the reduced data using Algorithm 1 on the eleven gene expression data sets in Table 2 could be obtained. Then, the number of the attributes selected by the ARDNE algorithm and the corresponding classification accuracy with SVM and KNN based on 10-fold cross validation are shown in Table 3, respectively. The corresponding neighborhood parameter values are listed in the last column. Table 3. The number of selected attributes and the classification accuracy under the SVM and KNN classifiers on the raw data and the reduced data with Algorithm 1. From Table 3, it can be found that our proposed algorithm can greatly reduce the attributes for all the data sets without loss of classification accuracy, and most of the redundant attributes are reduced. In the four low-dimensional data sets, the classification accuracy of the SVM and KNN classifiers were higher. In the seven high-dimensional gene expression data sets, the classification accuracy of the KNN classifier was higher than that of the raw data, while there were some differences in the classification accuracy of SVM classifier. On the KNN classifier, the classification accuracy of all the data sets was higher than that of the raw data. On the SVM classifier, the classification accuracy of the Brain_Tumor1 data set was 3% less than that of the raw data, and the classification accuracy of the Colon and Leukemia data sets were slightly lower than that of the raw data. It shows that the reduced attribute set can maintain the classification accuracy of the raw data. However, on the SVM classifier, for the Prostate data set, the classification accuracy was 8.7% less than that of the raw data. The reason is that some attributes with important information during reduction are lost. What is more, for the average classification accuracy, our ARDNE algorithm obtained 91.04% and 91.86% on the SVM and KNN classifiers, respectively, which was higher by 10% than that of raw data sets. Therefore, the proposed ARDNE algorithm was efficient in dimension reduction of low-dimensional and high-dimensional data sets.

Classification Results of UCI Data Sets with Low-Dimensions
This portion of our experiments was to evaluate the performance of our proposed algorithm in terms of classification accuracy, and the classification performance of the ARDNE algorithm was compared with those of the other four related state-of-the-art attribute reduction algorithms on the four UCI data sets, selected from Table 2. The algorithms used in the comparison included: (1) The classical rough set algorithm (RS) [34], (2) the neighborhood rough set algorithm (NRS) [49], (3) the covering decision algorithm (CDA) [50], and (4) the max-decision neighborhood rough set algorithm (MDNRS) [28]. Table 4 gives the numbers of selected attributes in the reduced data with the four different algorithms. Tables 5 and 6 show the comparison results of classification accuracy using the four different methods.   Table 4, comparing the numbers of selected attributes, the NRS, CDA, MDNRS, and ARDNE algorithms were all superior to the RS algorithm, but the ARDNE algorithm was slightly inferior to NRS, CDA, and MDNRS algorithms. From Tables 5 and 6, it is obvious that the classification accuracy of the proposed ARDNE algorithm outperformed that of the other algorithms on most of UCI data sets, except for the Wpbc data set. Furthermore, the average classification accuracy of the ARDNE was the highest and greatly improved on the SVM and KNN classifiers. For the Wpbc data set, the number of attributes selected by the ARDNE algorithm was six, which was not far from the MDNRS algorithm, and its classification accuracy was 0.8% lower than the MDNRS algorithm on KNN, while the accuracy was 8% higher than the MDNRS algorithm on SVM. Meanwhile, for the Wpbc data set, the number of attributes selected by the RS was seven, and its classification accuracy was 1% lower than the ARDNE algorithm on KNN, while the accuracy was 0.6% higher than the ARDNE algorithm on SVM. For the RS, NRS, and CDA algorithms, the classification accuracy of the Wdbc and Wine data sets were unstable. The classification accuracy of the Wine data sets only was 40.23% on the SVM classifier, and the average classification accuracy of the NRS model was the lowest. It can be obtained that the classification accuracy of the ARDNE algorithm on the SVM and KNN classifiers were relatively steady. Based on the results in Table 4, it can be seen that some important information attributes were lost in the process of reduction for the RS, NRS, and CDA algorithms, resulting in the decrease of classification accuracy of the reduced data sets with fewer attributes. The experimental results show that our attribute reduction algorithm could greatly remove the redundant attributes, and improve the classification accuracy for most of the data sets.

Classification Results of Microarray Data Sets with High-Dimensions
This subsection of our experiments continued testing the classification performance of the ARDNE algorithm, compared with those of the other three state-of-the-art entropy-based attribute reduction algorithms on the five microarray gene expression data sets with high-dimensional attributes, selected from Table 2. The algorithms used in the comparison included: (1) The mutual entropy-based attribute reduction algorithm (MEAR) [50], (2) the entropy gain-based attribute reduction algorithm (EGAR) [33], and (3) the average decision neighborhood entropy-based attribute reduction algorithm (ADNEAR) [42]. The objective of these further experiments was to show the classification power of the proposed approach to gene selection. Tables 7 and 8 show the number of selected genes and the classification accuracy of the five high-dimensional gene expression data sets with the KNN and SVM classifiers, respectively.
From Tables 7 and 8 the ARDNE algorithm obtained 93.8% and 91.9% average classification accuracy on the KNN and SVM classifiers, respectively. The classification accuracy of genes selected by the MEAR, EGAR, and ADNEAR algorithms were far lower than that with the ARDNE algorithm. For the MEAR algorithm, since the process of discretization generally results in loss of extensive useful gene information, the MEAR algorithm acquired the lower classification accuracy. For the number of selected genes, there was no significant difference among the EGAR, ADNEAR, and ARDNE algorithms. However, the classification accuracy of the ARDNE algorithm was superior to the EGAR and ADNEAR algorithms. It shows that the proposed ARDNE algorithm was able to find the most informative genes for classification. For the Colon data set, the classification accuracy of ARDNE was 80.8%, which was slightly less than that of the MEAR algorithm. So, it indicates that MDNRS algorithm was greatly affected by the data set, and the classification results were not as stable as the ARDNE algorithm. For the SRBCT data set, the classification accuracy of the ARDNE algorithm was obviously higher than those of other algorithms, and its number of selected genes only was six. The result of the further experiments shows that the proposed method had significant classification ability on the five microarray gene expression data sets.

Classification Results of Dimensionality Reduction Methods on Gene Expression Data Sets
To further verify the classification performance of our proposed method, the eight methods were employed to evaluate the number of selected genes and the classification accuracy on the four gene expression data sets selected from Table 2. The ARDNE algorithm was compared with the seven related state-of-the-art dimensionality reduction methods, which included: (1) The sequential forward selection algorithm (SFS) [51], (2) the sparse group lasso algorithm (SGL) [52], (3) the adaptive sparse group lasso based on conditional mutual information algorithm (ASGL-CMI) [53], (4) the Spearman's rank correlation coefficient algorithm (SC 2 ) [44], (5) the gene selection algorithm based on fisher linear discriminant and neighborhood rough set (FLD-NRS) [39], (6) the gene selection algorithm based on locally linear embedding and neighborhood rough set algorithm (LLE-NRS) [40], and (7) the RelieF algorithm [41] combined with the NRS algorithm [49] (RelieF+NRS). The SVM classifier in the WEKA tool was used to do some simulation experiments. The number of selected genes and the classification accuracy are shown in Tables 9 and 10 respectively, where the symbol (-) denotes no results obtained for Leukemia using the SGL and ASGL-CMI algorithms.  According to the experimental results in terms of the number of selected genes and the classification accuracy in Tables 9 and 10, the differences among the eight methods could be clearly identified. For the SGL and ASGL-CMI methods, the number of selected genes was obviously higher than that the other six algorithms, and then the classification accuracy of the SGL and ASGL-CMI methods was not ideal. For some methods, such as the SFS, SC 2 , FLD-NRS, and ARDNE algorithms, the average number of selected genes was less than 10. It follows that our proposed ARDNE algorithm selected fewer genes than the SFS, LLE-NRS, and RelieF+NRS algorithms, and it was roughly the same as SC 2 and FLD-NRS. For the Colon data set, the classification accuracy of the ARDNE algorithm was 81% which was slightly lower than the SGL, ASGL-CMI, FLD-NRS, and LLE-NRS methods, but for the Leukemia, Lung, and Prostate data sets, the classification accuracy of the ARDNE algorithm were 96.7%, 98.7%, and 85.8%, respectively, which were higher than the other methods. For the SFS, SC 2 , LLE-NRS, and RelieF+NRS algorithms, their classification results were not as stable as the ARDNE algorithm. Thus, the classification effect of the algorithm for the four gene expression data sets would be slightly different, but the average classification ability of the ARDNE algorithm would not be affected. As for the average classification accuracy, the ARDNE algorithm obtained the highest accuracy. Therefore, our method was an efficient dimensionality reduction technique for high-dimensional, large-scale microarray data sets.

Statistical Analysis
The final part of our experiments was to further demonstrate the statistical significance of the results, and the Friedman test [53] and the Bonferroni-Dunn test [54] are employed in this paper. The Friedman statistic is described as follows where k is the number of algorithms, N is the number of data sets, and R i is the average ranking of algorithm i over all the data sets. And the critical distance [55] is denoted as where q α is the critical tabulated value for the test and α is the significant level of Bonferroni-Dunn test.
In the following, based on the classification performance of the five attribute reduction algorithms in Tables 4, 11 Table 13. Tables 14 and 15 show the rankings of the four algorithms in Table 7 Table 16. Similarly, Table 17 shows the rankings of the eight attribute reduction algorithms in Table 10 under the SVM classifier. The values of the different evaluation measures under the SVM classifiers are shown in Table 18. Tables [11][12][13][14][15][16][17][18] show that the proposed ARDNE algorithm was statistically superior to the other algorithms in summary. It can be easily seen from Table 13, Table 16, and Table 18 that the values of F F were 13 and 23.78 under the KNN classifier, respectively, and those of F F were 2.33, 6.31, and 0.9 under the SVM classifier, respectively. When the significant level α = 0.1, the critical value of F(4,12) was 2.48, F(3,12) was 2.61, and F(7,7) was 2.78. The critical value q 0.1 = 2.241 can be found in [55], and it could be easily calculated from Equation (25) that the values of CD were 2.506, 1.83, and 1.093, respectively.

Conclusions
Attribute reduction is one of the important steps in data mining and classification learning. A number of measures for calculating the distinguishment ability of attribute subsets have been developed in recent years. Considering its effectiveness, neighborhood entropy is widely employed and discussed to evaluate attributes in neighborhood rough sets. In this paper, an attribute reduction method using neighborhood entropy measures in neighborhood rough sets is proposed. With the strong complementarity between the algebra definition of attribute importance and the definition of information view, some neighborhood entropy-based uncertainty measures in neighborhood decision systems are studied. Then, the significance measure is presented by combining the credibility degree with the coverage degree to analyze the classification ability of the selected attribute subset. On the basis of these theories, a heuristic attribute reduction algorithm is developed for the dimensionality reduction task to solve the practical problem. On the four UCI data sets with low-dimensional attributes and the seven microarray gene expression data sets with high-dimensional attributes, a series of experiments are carried out for verifying the effectiveness of the proposed method. The experimental results indicate that our algorithm is effective to remove the most redundant attributes without loss of classification accuracy. Comparing with the other related reduction algorithms, the reduction ability and the classification accuracy are more superior for knowledge reduction.