Feature Selection Combining Information Theory View and Algebraic View in the Neighborhood Decision System

Feature selection is one of the core contents of rough set theory and application. Since the reduction ability and classification performance of many feature selection algorithms based on rough set theory and its extensions are not ideal, this paper proposes a feature selection algorithm that combines the information theory view and algebraic view in the neighborhood decision system. First, the neighborhood relationship in the neighborhood rough set model is used to retain the classification information of continuous data, to study some uncertainty measures of neighborhood information entropy. Second, to fully reflect the decision ability and classification performance of the neighborhood system, the neighborhood credibility and neighborhood coverage are defined and introduced into the neighborhood joint entropy. Third, a feature selection algorithm based on neighborhood joint entropy is designed, which improves the disadvantage that most feature selection algorithms only consider information theory definition or algebraic definition. Finally, experiments and statistical analyses on nine data sets prove that the algorithm can effectively select the optimal feature subset, and the selection result can maintain or improve the classification performance of the data set.


Introduction
Today, society has entered the era of network information, the rapid development of computer and network information technology that makes data and information in various fields increase rapidly. How to dig out potential and valuable information from the massive, disordered and strong interference data has posed an unprecedented challenge to the ability of intelligent information processing, which has produced a new field of artificial intelligence research, feature selection. Among the many methods of feature selection, rough set theory is an effective way to deal with complex systems, because it does not need to provide any prior information except for the data set [1].
Rough set theory is a theory proposed by Polish scientist Pawlak in 1982 to deal with uncertain, imprecise and fuzzy problems [1]. Its basic idea is to use equivalence relations to granulate the discrete sample space into a cluster of equivalence classes that do not intersect each other, therefore describing the knowledge and concepts in the sample space. Feature selection is one of the core contents of rough set theory and application research. Rough set theory performs information granulation on the original data set, deletes redundant conditional attributes without reducing the data classification ability, and obtains a more concise description than the original data set [2,3]. Classical rough set theory can only handle discrete data well, and cannot meet the large number of continuous and mixed data (including continuous and discrete) in practical applications [4][5][6]. Even if the discretization technology is adopted [7], the important information in the data will be lost, which will ultimately affect the selection result. For this reason, Wang et al. [8] proposed the k-nearest neighborhood rough set model. Chen et al. [9] explored the granular structure, distance and metric in the neighborhood system. Yao et al. [10] studied the relationship between the 1-step neighborhood system and rough set approximation. Based on the above research, Hu et al. [11] proposed the neighborhood rough set model and successfully applied it to the feature selection, classification and uncertainty reasoning of continuous and mixed data. As a data preprocessing method, feature selection based on the neighborhood rough set has been widely used in cancer classification [12], character recognition [13] and facial expression feature selection [14], and has good research value and application prospect.
The traditional feature selection methods have been proven to be NP hard problem by Wong and Ziarko [15]. Therefore, in the research of feature selection algorithms, how to speed up the convergence speed to reduce the time complexity has become a mainstream research direction [16]. Chen et al. [17] proposed a heuristic feature selection algorithm using joint entropy measurement. Jiang et al. [16] studied the feature selection accelerator based on the supervised neighborhood. Most of the above feature selection methods are based on monotonic evaluation functions to achieve feature selection [11]. However, the feature selection algorithm that satisfies the monotonicity has the problem that when the classification performance of the original data set is poor, the measured value of the evaluation function is low, and the final reduction effect is not good [18]. To solve this problem, Li et al. [19] proposed a non-monotonic feature selection algorithm based on decision rough set model. Sun et al. [18] designed a gene feature selection algorithm based on the uncertainty measurement of neighborhood entropy. Wang et al. [20] studied a greedy feature selection algorithm based on non-monotonic conditional discriminant index.
Some existing uncertainty measures cannot objectively reflect changes in classification decision capability [21]. Sun et al. [18] believes that credibility and coverage can reflect the classification ability of condition attributes relative to decision attributes, and condition attributes with higher credibility and coverage are more important for decision attributes. In addition, Tsumoto et al. [22] also emphasizes that credibility represents the sufficiency of propositions and coverage describes the necessity of propositions. Therefore, this paper defines the credibility and coverage in the neighborhood decision system, namely neighborhood credibility and neighborhood coverage.
The information theory definition based on information entropy and the algebraic definition based on approximate precision are two definitions form in the classic rough set theory [23]. The information theory definition based on information entropy considers the influence of attributes on uncertain subsets, while the algebraic definition based on approximate precision considers the influence of attributes on defined subsets [24,25], which are two measurement mechanisms with strong complementarity [26]. So far, most feature selection algorithms only consider information theory definition or algebraic definition. For example, Hu et al. [11] proposed a hybrid feature selection algorithm based on neighborhood information entropy. Wang et al. [27,28] used the equivalent relation matrix to calculate the concepts of knowledge granularity, resolution and attribute importance from the algebraic view of rough sets. Sun et al. [2,29] studied the feature selection method based on entropy measures. The uncertainty measures based on neighborhood information entropy reflect the information theory view in the neighborhood decision system, and the neighborhood approximate precision belongs to the algebraic view in the neighborhood decision system [18].
Inspired by the above, this paper combines the information theory view and algebra view in the neighborhood decision system, and proposes a heuristic non-monotonic feature selection algorithm. The experimental results on nine different scale data sets show that the algorithm can effectively select the optimal feature subset, and the selection results can maintain or improve the classification performance of the data set.
In summary, the main contributions of this paper are as follows: • The credibility and coverage degrees can reflect the decision-making ability and the classification ability of conditional attributes with respect to the decision attribute [18].
In order to effectively analyze the uncertainty of knowledge in the neighborhood rough set, the credibility and coverage are introduced into the neighborhood decision system, and then the neighborhood credibility and neighborhood coverage are defined and introduced into neighborhood joint entropy. • Based on the proposed neighborhood joint entropy, some uncertainty measures of neighborhood information entropy are studied, and the relationship between the measures is derived, which is conducive to understanding the nature of knowledge uncertainty in neighborhood decision systems. • To construct a more comprehensive measurement mechanism and overcome the problem of poor selection results when the classification performance of the original data set is not good, the information theory view and algebraic view in the neighborhood decision system are combined to propose a heuristic non-monotonic feature selection algorithm.
Section 2 briefly introduces the basic concepts of the neighborhood rough set and information entropy measures. Section 3 studies the heuristic non-monotonic feature selection algorithm based on information theory view and algebraic view. Section 4 analyzes the experimental results on four low-dimensional data sets and five high-dimensional data sets. Section 5 summarizes the content of this paper.

Basic Concepts
In this part, we will briefly review the basic concepts of information entropy measures and the neighborhood rough set [2,[30][31][32][33].

Information Entropy Measures
In the DS, if B ⊆ C divides the sample set U into U/B = {X 1 , X 2 , . . . , X K }, then the information entropy is defined as p(X i ) = |X i | |U| represents the probability of X i in the sample set. In the DS, if B, Q ⊆ C, U/B = {X 1 , X 2 , . . . , X K }, U/Q = {Y 1 , Y 2 , . . . , Y L }, then the conditional information entropy of Q relative to B is defined as . . , Y L }, then the joint information entropy of Q and B is defined as where

Neighborhood Rough Set
NDS = (U, C, D, δ) is called the neighborhood decision system, where U is a sample set named universe, C is the conditional attribute set, D is decision attribute, and δ is the neighborhood radius.
In the NDS, if B ⊆ C, then Minkowski distance between different sample points x i = {x i1 , x i2 , . . . , x im } and x j = x j1 , x j2 , . . . , x jm on U is defined as Given the NDS and the distance measurement function MD, if B ⊆ C, then the neighborhood information granule of x i ∈ U relative to B is defined as In the NDS, if B ⊆ C, N B is the neighborhood relationship on U, then the neighborhood upper approximation set N B X and the neighborhood lower approximation set N B X of sample set X ⊆ U relative to B are respectively defined as In the NDS, if B ⊆ C, U/D = {Y 1 , Y 2 , . . . , Y L }, N B is the neighborhood relationship on U, then the upper approximate set N B (D) and the lower neighborhood approximate set N B (D) of D relative to B are respectively defined as In the NDS, if B ⊆ C, then the neighborhood approximate precision of the sample set X ⊆ U relative to B is defined as In the NDS, if B ⊆ C, U/D = {Y 1 , Y 2 , . . . , Y L }, then the neighborhood approximate precision of D relative to B is defined as P B (D) describes the knowledge completeness of a set, considering the influence of attributes in the neighborhood decision system on the defined subset, and is the view of the neighborhood decision system under algebraic definition [18].

Feature Selection Algorithm Design
This part first defines the neighborhood credibility and neighborhood coverage. Second, some uncertainty measures of neighborhood information entropy are studied, and the relationship between the measures is derived. Then, using the information theory view and algebraic view in the neighborhood decision system, a heuristic non-monotonic feature selection algorithm is designed. The following introduces related concepts and their properties.

Neighborhood Credibility and Neighborhood Coverage
In the NDS, if B ⊆ C, U/B = {X 1 , X 2 , . . . , X K }, U/D = {Y 1 , Y 2 , . . . , Y L }, then the credibility α ij and coverage κ ij [18] are respectively defined as where i = 1, 2, . . . , K and j = 1, 2, . . . , L. Credibility and coverage reflect the classification ability of condition attributes relative to decision attributes. Condition attributes with higher credibility and coverage are more important for decision attributes [22].

Definition 1.
In the NDS, if B ⊆ C, then the joint neighborhood information granule of x i ∈ U is defined as n (B,D) (x i ) combines the neighborhood information granule n δ B (x i ) and decision equivalence relationship [x i ] D , which more accurately reflects the amount of class information when each class in n δ B (x i ) has a different distribution, and the amount of class information provided is embodied in the number of elements in n (B,D) (x i ). Therefore, n (B,D) (x i ) can accurately reflect the decision information.

Definition 2.
In the NDS, if B ⊆ C, then the neighborhood credibility nα i and neighborhood coverage nκ i of x i ∈ U are respectively defined as nα i and nκ i respectively use the joint neighborhood information granule and the decision equivalence relationship to describe the credibility and coverage of the neighborhood decision system, which makes full use of the decision information provided by the decision system.

Uncertainty Measures of Neighborhood Information Entropy
In the NDS, if B ⊆ C, then neighborhood entropy [34] of x i ∈ U is defined as In the NDS, if B ⊆ C, then the average neighborhood entropy [34] is defined as Definition 3. In the NDS, if B ⊆ C, then new neighborhood entropy of x i ∈ U is defined as (20) Definition 4. In the NDS, if B ⊆ C, then the new average neighborhood entropy is defined as The new average neighborhood entropy H δ (B) introduces the joint neighborhood information granule into neighborhood entropy, which makes full use of the decision information in the neighborhood decision system.

Definition 5.
In the NDS, if B ⊆ C, then neighborhood conditional entropy of D relative to B is defined as Definition 6. In the NDS, if B ⊆ C, then neighborhood joint entropy of D and B is defined as Proof of Theorem 2.
From Theorem 2, we can see that the definition of neighborhood joint entropy can be derived from neighborhood credibility and neighborhood coverage.

Proof of Theorem 3.
Sun et al. [18] shows that information entropy and its extension belong to the view under the information theory definition, and the neighborhood approximate precision comes from the view under the algebra definition. Therefore, Definitions 4-6 can be used to measure the uncertainty of knowledge in the neighborhood decision system from the information theory view and the algebraic view.

Heuristic Non-Monotonic Feature Selection Algorithm Design
The feature selection algorithm that satisfies the monotonicity has the problem that the reduction effect is not good when the classification performance of the original data set is poor. Therefore, based on the uncertainty measures combining algebraic view and information theory view in Section 3.2, a heuristic non-monotonic feature selection algorithm is designed.
Proof of Theorem 4. we can know that n δ (5). Then it can be deduced that the numerical relationship between is not clear, so the numerical relationship between is unknown. According to Equations (9), (10) and (12), we can obtain According to Equation (23), Theorem 4 holds.
then it is said that attribute b is redundant with respect to D, otherwise it is said that attribute b is indispensable for D. If B satisfies the following conditions, then B is called a feature subset of C.
From a numerical point of view, looking for an optimal feature subset is to find the B corresponding to the maximum H δ (D, B).
To accurately reflect the decision information and eliminate redundant features, a heuristic non-monotonic feature selection algorithm based on neighborhood joint entropy (BONJE) is designed. The implementation steps of this algorithm are shown in Algorithm 1.
Let B = Agent 11. end while 12. return A feature subset B To facilitate the understanding of the specific calculation steps of the algorithm, an example is given below.

Example 1.
A NDS = (U, C, D, δ) is given in Table 1, where U = {x 1 , x 2 , x 3 , x 4 } is the universe, C = {a, b, c} is the conditional attribute set, D = d is the decision attribute, and the neighborhood radius parameter δ = 0.3. Let the initial feature subset B = ∅, the base of log is 10, the calculation result is kept to three decimal places. In the distance measurement function Equation (4), p = 2 is used as the calculation function.
From Equation (6), we know that [ When B = a, the distance between each sample is as follows: According to Equation (5), we obtain n δ (15).
From Equations (9), (10) and (12), we can obtain According to Equation (23), we can obtain It can be seen from the results that meets the suspension requirement, so B = {a, c} is the optimal feature subset.

Experiment and Analysis
This part uses the BONJE algorithm to select the appropriate neighborhood radius for different data sets and designs different comparative experiments to prove the efficiency of the BONJE algorithm in feature selection.

Experimental Data Introduction
To verify the efficiency of the BONJE algorithm in feature selection, this experiment selects nine data sets with different dimensions as the experimental objects, including 4 low-dimensional data sets (Wine, WDBC, WPBC, Ionosphere) and 5 high-dimensional data sets (Colon, SRBCT, DLBCL, Leukemia, Lung). The specific data of each data set is shown in Table 2.

Experimental Environment
The experiment in this paper is performed on a personal computer with Microsoft Windows 10 Professional Edition (64-bit), (Intel) Intel(R) Core(TM) i5-6500 CPU @ 3.20 GHz (3192 MHz) and 16.00 GB RAM. The simulation experiment is implemented on the In-telliJ IDEA 2020.1.2 platform using Java version "1.8.0_144". C4.5, SVM (support vector machine) and KNN (k-nearest neighbors) classifiers are selected on Weka software to verify the classification accuracy of selected feature subsets, where SVM uses PolyKernel as the kernel function, and KNN sets K = 3. In order to reduce the generalization error, the three classifiers all adopt a ten-fold cross-validation method to obtain the final classification accuracy.

Neighborhood Radius Selection
Since the neighborhood radius affects the granularity of neighborhood information, and thus neighborhood joint entropy, it is very important to choose a proper neighborhood radius. In order to unify the value of the neighborhood radius, eliminate the difference in dimensions and make each feature be treated equally by the classifier, this experiment, first, normalizes the data (

Classification Results of Bonje Algorithm
This part of the experiment compares the classification accuracy and the number of features between the original data and the feature subset selected by the BONJE algorithm. The comparison results are shown in Table 3. The neighborhood radius selected for different data sets are listed in the last column. In addition, the feature subsets selected by the BONJE algorithm for different data sets are shown in Table 4. Please note that the boldface indicates the better value in the comparison data. From the comparison of average classification accuracy in Table 3, it can be seen that the average classification accuracy of the BONJE algorithm on the Wine, WDBC, and Ionosphere data sets is slightly lower than the original data by 0.2%, 0.2%, and 0.8%, respectively. The accuracy loss caused by the BONJE algorithm is controlled within 1%, which shows that the BONJE algorithm maintains the classification accuracy of the original data. The average classification accuracy of the BONJE algorithm on the WPBC, Colon, SRBCT, DLBCL, Leukemia, and Lung data sets is higher than the original data by 1.5%, 4.8%, 3.7%, 7.4%, 7.4%, 2.5%, respectively, which indicates that the BONJE algorithm eliminates many redundant features and improves the classification accuracy of the data set. From the comparison of feature number in Table 3, it can be seen that BONJE algorithm can delete redundant features without reducing the classification accuracy, especially in high-dimensional data sets. In summary, the BONJE algorithm can effectively select the optimal feature subset, and the feature selection result can maintain or improve the classification ability of the data set.

The Performance of BONJE Algorithm on Low-Dimensional Data Sets
This part of the experiment compares the BONJE algorithm with four other advanced feature selection algorithms in the low-dimensional data set from the perspective of the number of selected features and the classification accuracy of KNN and SVM classifiers. The four advanced feature selection algorithms are: (1) Classic Rough Set Algorithm (RS) [1], (2) Neighborhood Rough Set Algorithm (NRS) [40], (3) Covering Decision Algorithm (CDA) [41], (4) Maximum Decision Neighborhood Rough Set Algorithm (MDNRS) [35]. Tables 5-7 show the experimental results of five different feature selection algorithms. Comprehensive analyses of Tables 5-7 show that for the Wine data set, CDA algorithm selects the least number of features, but the KNN classification accuracy and SVM classification accuracy of CDA algorithm are far lower than BONJE algorithm by 23.4% and 31.8% respectively, which indicates that CDA algorithm loses features with important information in the selection process; For WDBC data set, although BONJE algorithm has more selected features than other algorithms, the classification accuracy of BONJE algorithm under the two classifiers is higher than that of other algorithms; For WPBC data set, NRS algorithm and the CDA algorithm choose the least number of features, but their classification accuracy under the two classifiers is lower than BONJE algorithm; For Ionosphere data set, the classification accuracy of BONJE algorithm is relatively high compared to other algorithms, and the number of features selected by BONJE algorithm is smaller than other algorithms; In general, the average number of selected features of BONJE algorithm is less, and BONJE algorithm has the highest average classification accuracy under the two classifiers, which shows that BONJE algorithm has stable reduction ability and can improve the classification accuracy of data set in low-dimensional data.

The Performance of BONJE Algorithm on High-Dimensional Data Sets
This part of the experiment compares the BONJE algorithm with four other advanced entropy-based feature selection algorithms from the perspective of different highdimensional data sets. The four entropy-based feature selection algorithms are: (1) the mutual entropy-based attribute reduction algorithm (MEAR) [42], (2) the entropy gainbased gene selection algorithm (EGGS) [17], (3) the EGGS algorithm combined with the Fisher score (EGES-FS) [29], (4) feature selection algorithm with the Fisher score based on decision neighborhood entropy (FSDNE) [18]. Tables 8-12 show the experimental results of five different entropy-based feature selection algorithms. As shown in Table 8, the KNN classification accuracy and C4.5 classification accuracy of the BONJE algorithm are better than other algorithms. Although the SVM classification accuracy of the BONJE algorithm is slightly lower than that of the first-ranked MEAR algorithm by 0.9%, the average classification accuracy of the BONJE algorithm is much higher than the second-ranked FSDNE algorithm by 3.5%. In general, the BONJE algorithm has excellent performance on the Colon data set.  Table 9 shows that the KNN classification accuracy and C4.5 classification accuracy of the BONJE algorithm are better than other algorithms. Although the SVM classification accuracy of the BONJE algorithm is lower than that of the first-ranked FSDNE algorithm by 1.5%, the average classification accuracy of the BONJE algorithm is much higher than the second-ranked FSDNE algorithm by 4.2%. Therefore, BONJE has stable classification performance on the SRBCT data set. According to the experimental results in Table 10, it can be clearly seen that the KNN classification accuracy, SVM classification accuracy and C4.5 classification accuracy of the BONJE algorithm are better than other algorithms. Compared with the BONJE algorithm, the MEAR and EGGS-FS algorithms select fewer features, but the average classification accuracy of the MEAR and EGGS-FS algorithms is much lower than the BONJE algorithm. Therefore, the BONJE algorithm can delete many redundant features on the DLBCL data set without reducing the data classification ability. According to the results in Table 11, although the KNN classification accuracy of the BONJE algorithm is lower than that of the FSDNE algorithm, the SVM classification accuracy and C4.5 classification accuracy of the BONJE algorithm are as high as 95.8% and 94.4%, respectively. The average classification accuracy of the BONJE algorithm is 1.5% higher than that of the second-ranked FSDNE algorithm. Therefore, the BONJE algorithm can effectively select feature subsets on the Leukemia data set and improve the classification ability of the data set.
It can be seen from Table 12 that the number of features selected by the BONJE algorithm is relatively high compared with other algorithms, but the BONJE algorithm has the highest average classification accuracy. Therefore, the BONJE algorithm can effectively reduce noise and improve classification accuracy on the Lung data set. Based on the above experimental results and analyses, the BONJE algorithm can effectively select feature subsets under high-dimensional data, and the feature selection results can improve the classification ability of the data set.

Comparison of BONJE Algorithm and Multiple Dimensionality Reduction Algorithms
To further verify the reduction performance and classification ability of the BONJE algorithm, this part of the experiment compares the BONJE algorithm with other 10 reduction algorithms from the perspective of the number of selected features and SVM classification accuracy on 3 representative tumor data sets (Colon, Leukemia, Lung). The ten different dimensionality reduction methods are: (1) the neighborhood rough set-based reduction algorithm (NRS) [35], (2) feature selection algorithm with Fisher linear discriminant (FLD-NRS) [32], (3) the gene selection algorithm based on locally linear embedding (LLE-NRS) [43], (4) the Relief algorithm [44] combined with the NRS algorithm(Relief + NRS) [35], (5) the fuzzy back-ward feature algorithm (FBFE) [44], (6) the binary differential evolution algorithm (BDE) [2], (7) the sequential forward selection algorithm (SFS) [29], (8) the Spearman's rank correlation coefficient algorithm (SC2) [36], (9) the mutual information maximization algorithm (MIM) [2], (10) feature selection algorithm with the Fisher score based on decision neighborhood entropy (FSDNE) [18]. Tables 13 and 14 show the experimental results of 11 dimensionality reduction algorithms. According to the results in Tables 13 and 14, the SVM classification accuracy of the BONJE and LLE-NRS algorithms on the Colon dataset is the same and ranked second, but the number of features selected by the LLE-NRS algorithm is twice that of BONJE algorithm. The SVM classification accuracy of the BONJE algorithm on the Colon data set is lower than that of the FLD-NRS algorithm, but the SVM classification accuracy of the BONJE algorithm on the Leukemia and Lung data sets is much higher than that of the FLD-NRS algorithm by 13% and 10.5%, respectively, which shows that the classification performance of the BONJE algorithm is more stable. Although the BDE algorithm selects the least number of features on the Colon data set, its SVM classification accuracy is only 75%, which indicates that the BDE algorithm loses some important features in the process of selecting feature subsets. The SVM classification accuracy of the BONJE algorithm on the Leukemia data set is 0.1% lower than that of the first-ranked SFS algorithm, and the number of selected features the BONJE algorithm is only one more than the SFS algorithm, so these two algorithms have similar performance on the Leukemia data set. Compared with other algorithms, the number of features selected by the BONJE algorithm on the Lung data set is higher, but the SVM classification accuracy of the BONJE algorithm is the highest. In general, the BONJE algorithm is at a medium level compared to other algorithms in terms of the number of selected features, and has the highest average classification accuracy in terms of SVM classification accuracy, which is enough to show that BONJE algorithm has a stable dimension reduction performance, and can select features with important classification information in the data set.

Statistical Analyses
To systematically explore the statistical significance of algorithm classification results, this part of the experiment introduces the Friedman statistic test [45] and Nemenyi test [46].
The calculation formula of Friedman statistic test is as follows: where M is the number of algorithms, N is the number of data sets, and R i represents the average ranking of the classification accuracy of the i-th algorithm on all data sets. F F is an F-distribution with M − 1 and (M − 1)(N − 1) degrees of freedom.
If the null hypothesis, all algorithms have the same performance, is rejected, it means that the performance of the algorithms is significantly different. Then, the Nemenyi test is used as a post-hoc test for algorithm comparison. If the average ranking difference between the algorithms is greater than the critical distance CD, it means that the algorithm with a high average ranking is better than the algorithm with a low average ranking.
The calculation formula of the critical distance CD is as follows: where q α is the critical list value of the test, α represents the significance level of Bonferroni-Dunn.
According to the classification accuracy results of Tables 6 and 7 on low-dimensional data sets, the rankings of the five feature selection algorithms under the KNN and SVM classifiers are shown in Tables 15 and 16, respectively. Please note that the content in parentheses in all tables is the classification accuracy under the corresponding classifier According to the algorithm rankings in Tables 15 and 16, the two evaluation measurement values (Friedman statistics χ 2 F and Iman-Davenport test F F ) of the five feature selection algorithms under the KNN and SVM classifiers are shown in Table 17.  F(4, 12), so the null hypothesis under the two classifiers is rejected. Then Nemenyi test is used as a post-hoc test to compare the algorithm performance, and the comparison results are shown in Figure 2. It is worth noting that the average ranking of each algorithm is plotted along the axis in the graph, and the best ranking in the axis is on the left. In particular, when there are thick lines between the algorithms, it means that the classification capabilities of these algorithms are similar, otherwise, they will be regarded as significantly different from each other [47]. It can be clearly seen from Figure 2 that BONJE algorithm ranks first under the two classifiers. The classification performance of the BONJE, MDNRS, RS and NRS algorithms under the KNN classifier is similar, and the BONJE algorithm is significantly better than the CDA algorithm. Under the SVM classifier, the classification performance of BONJE, RS, CDA and MDNRS algorithms is similar, and the BONJE algorithm performs better than the NRS algorithm According to the classification accuracy results of Tables 8-12 on high-dimensional data sets, the rankings of the entropy-based feature selection algorithms under the KNN, C4.5 and SVM classifiers are shown in Tables 18-20, respectively. According to the algorithm rankings in Tables 18-20, the two evaluation measurement values of the five entropy-based feature selection algorithms under the KNN, SVM, and C4.5 classifiers are shown in Table 21. When the significance level α = 0.1, the critical value of Friedman statistic test F(4, 16) = 2.333, so null hypothesis under the three classifiers is rejected. The Nemenyi test is used as a post-hoc test to compare the performance of the algorithms, and the comparison results are shown in Figure 3.
According to the results in Figure 3, it can be seen that the ranking of BONJE algorithm is the best under the three classifiers. Under the KNN classifier, the classification performance of the BONJE, FSDNE and EGGS-FS algorithms is similar and the BONJE algorithm is significantly better than the MEAR and EGGS algorithms. Under the SVM classifier, the classification performance of the BONJE, FSDNE, EGGS-FS and FSDNE algorithms is similar, and the BONJE algorithm performs better than the EGGS algorithm. Under the C4.5 classifier, the BONJE algorithm has better classification performance than the EGGS and EGGS-FS algorithms.  According to the classification accuracy results of Table 14 on three representative tumor data sets, the rankings of the 11 dimensionality reduction algorithms under the SVM classifier are shown in Table 22. According to the ranking in Table 22, the χ 2 F = 17.0491 and F F = 2.6329 of the 11 dimensionality reduction algorithms under the SVM classifier. When the significance level α = 0.1, the critical value of Friedman statistic test F(10, 20) = 1.9367. F F = 2.8329 is greater than F(10, 20), so the null hypothesis under the SVM classifier is rejected. The Nemenyi test is used as a post-hoc test to compare the algorithm performance, and the comparison result is shown in Figure 4.  Figure 4 shows that the dimensionality reduction effect of BONJE is significantly better than NRS algorithm. In addition, BONJE algorithm has the highest ranking, which shows that BONJE algorithm has stable classification performance compared to other algorithms.
In general, the classification results of BONJE algorithm under different data sets are significantly better than different algorithms, which shows that the classification performance of BONJE algorithm is more stable and efficient from a statistical point of view.

Conclusions
Since the classification performance of many feature selection algorithms based on rough set theory and its extension is not ideal, this paper proposes a feature selection algorithm combining information theory view and algebraic view in the neighborhood decision system to deal with redundant features and noise in data. First, some uncertainty measures of the neighborhood information entropy are studied to measure the uncertainty of knowledge in the neighborhood decision system. In addition, the credibility and coverage are introduced into the neighborhood decision system, and then neighborhood credibility and neighborhood coverage are defined and introduced into neighborhood joint entropy. Finally, based on the information theory view and algebraic view in the neighborhood decision system, a heuristic non-monotonic feature selection algorithm is proposed. A series of comparative experiments and statistical analysis results on four low-dimensional data sets and five high-dimensional data sets show that the algorithm can effectively remove redundant features and select the optimal feature subset. Since the BONJE algorithm needs to frequently calculate the neighborhood information particles of all samples, it has a high time complexity when processing high-dimensional data. Moreover, the BONJE algorithm cannot completely balance the classification level of the selected feature subset. In future work, it is necessary to study more effective search methods and uncertainty evaluation criteria to reduce the time complexity and classification error of the algorithm.