A Fast Feature Selection Algorithm by Accelerating Computation of Fuzzy Rough Set-Based Information Entropy

The information entropy developed by Shannon is an effective measure of uncertainty in data, and the rough set theory is a useful tool of computer applications to deal with vagueness and uncertainty data circumstances. At present, the information entropy has been extensively applied in the rough set theory, and different information entropy models have also been proposed in rough sets. In this paper, based on the existing feature selection method by using a fuzzy rough set-based information entropy, a corresponding fast algorithm is provided to achieve efficient implementation, in which the fuzzy rough set-based information entropy taking as the evaluation measure for selecting features is computed by an improved mechanism with lower complexity. The essence of the acceleration algorithm is to use iterative reduced instances to compute the lambda-conditional entropy. Numerical experiments are further conducted to show the performance of the proposed fast algorithm, and the results demonstrate that the algorithm acquires the same feature subset to its original counterpart, but with significantly less time.


Introduction
Rough set theory [1] presented by Pawlak in 1982 is a useful tool to deal with vagueness and uncertainty information in the field of computer sciences. The research of rough set theory has mainly focused on both the generalizations of rough set models and the applications in different data environments, which has already attached much attention in granular computing [2][3][4], feature selection [5][6][7][8], dynamic data mining [9][10][11], and big data mining [12,13]. On the other hand, since the information entropy is powerful to measure information uncertainty, it has been extensively applied in practical problems, such as decision making [14], time series [15], portfolio selection [16], and so on.
In view of the effectiveness of information entropy to measure uncertainty in formation, information entropy has been extensively applied in the rough set theory to mine knowledge, which mainly concentrates on constructing rough set-based entropy in different information systems to measure the significance of features (or attributes) or the quality of knowledge granules and on exploring practical applications of rough set-based entropy. Specifically, in the aspect of constructing rough set-based entropy [17][18][19][20][21][22][23][24][25][26][27][28], the references [18] and [19] respectively introduced the concepts of information entropy, rough entropy, and knowledge granulation in complete and incomplete information systems and provided their important properties. Hu et al. [20] proposed the generalizations of the entropy to calculate the information of a fuzzy approximation space and a fuzzy probabilistic approximation space, respectively. Xu et al. [21] introduced the definition of rough entropy of rough sets in ordered information systems. Mi et al. [22] formulated the entropy of the generalized fuzzy approximation space. Dai and Tian [25] provided the concepts of knowledge information entropy and knowledge rough entropy in set-valued information systems, and investigated their properties. Dai et al. [26] presented the rough decision entropy to evaluate the uncertainty of interval-valued decision systems. Chen et al. [27] introduced the neighborhood entropy to evaluate the uncertainty of neighborhood information systems. Wang et al. [28] put forward a unified form of uncertainty measures for general binary relations.
In the aspect of exploring practical applications of rough set-based entropy [29][30][31][32][33][34][35], Pal et al. [31] defined the measure "rough entropy of image" for image object extraction in the framework of rough sets. Tsai et al. [32] provided an entropy-based fuzzy rough classification approach to acquire classification rules. Chen and Wang [33] presented an improved clustering algorithm based on both rough set theory and entropy theory. Sen and Pal [34] gave classes of entropy measures based on rough set theory to quantify the grayness and spatial ambiguity in images. Chen et al. [35] put forward an entropy-based gene selection method based on the neighborhood rough set model. Furthermore, it is worth noting that one of the most important applications of rough set-based entropy is feature selection (attribute reduction) [36][37][38][39][40][41][42][43][44]. For example, Miao and Hu [36] defined the significance of attributes from the viewpoint of information and then proposed a heuristic attribute reduction algorithm by using the mutual information. Wang et al. [37] developed two novel heuristic attribute reduction algorithms based on the conditional information entropy. Hu et al. [39] introduced a fuzzy entropy to measure the uncertainty in kernel approximation based on fuzzy rough sets, and thus proposed the feature evaluation index and a feature selection algorithm. Sun et al. [40] provided the rough entropy-based uncertainty measures for feature selection in incomplete decision systems. Liang et al. [41] introduced the incremental mechanisms for three representative information entropies and then developed a group incremental entropy-based feature selection algorithm based on the rough set theory with multiple instances being added to a decision system. Chen et al. [43] proposed a neighborhood entropy to select feature subset based on the neighborhood rough set model. Zhang et al. [44] presented a feature selection method by using the fuzzy rough set-based information entropy.
Since the computation of the fuzzy rough set-based information entropy in [44] is quite time-consuming, we propose in this paper a corresponding improved mechanism with lower complexity to compute the entropy and develop a fast feature selection algorithm that can quickly obtain the same result to the feature selection algorithm in [44]. In addition, the performance of the fast algorithm is shown by some numerical experiment.
In the remainder of this paper, we briefly review in Section 2 the feature selection algorithm in [44] and some related knowledge. In Section 3, the computational properties of the fuzzy rough set-based information entropy in [44] are presented. A fast feature selection approach with lower complexity has been developed. Numerical experiments were documented in Section 4 to show the performance of the proposed fast feature selection algorithm.

Preliminaries
As indicated in [45], a fuzzy information system is a pair (U, A) in which U = {x 1 , x 2 , . . . , x n } is the universe of discourse and A = {a 1 , a 2 , . . . , a m } is the attribute set. For each attribute a t ∈ A, a mapping a t : U → V a t holds where V a t is the domain of a t , and a fuzzy relation R {a t } can be defined.
It is possible to define the corresponding fuzzy relations for the attributes with different types of values, and one can refer to [44] for the details. Here, a fuzzy relation R is a fuzzy set that is defined on the fuzzy power set F(U × U) to measure the similarity between two objects in the universe U.
By adding an attribute set D = {d} with A ∩ D = ∅ into a fuzzy information system (U, A), we obtain a fuzzy decision system (U, A ∪ D) where A is the conditional attribute set and D is the decision attribute set. It should be pointed out that d is a nominal attribute on which a mapping d : U → V d holds and V d is the domain of d.
By utilizing a fuzzy rough sets-based information entropy, a forward addition feature selection algorithm is proposed in [44], and it is as follows.

Algorithm 1:
Computing an ε-approximate reduct of a fuzzy decision system. Input : A fuzzy decision system (U, A ∪ D) with U = {x 1 , x 2 , . . . , x n }, and a parameter ε ≥ 0. Output : An ε-approximate reduct B. In Step 3 of Algorithm 1, R A [x i ] D is the fuzzy lower approximation of the decision class [x i ] D based on the fuzzy relation R A , which is proposed in the pioneering work of fuzzy approximation operators [46] and is concretely computed by Here, [x i ] D is the crisp decision class to which the object x i belongs, and [ for B relative to D, which is factually the decrease of the λ-conditional entropy in the process of adding one attribute. Here, the λ-conditional entropy of the decision attribute set D relative to the conditional attribute subset B, i.e., H λ (D|B), is defined in [44] as where is the fuzzy granule of x i with respect to B, and It should be pointed out that |X| is the cardinality of the fuzzy set X, which is defined in [38] as . Moreover, as indicated in [44], if there exists an object Generally, the λ-conditional entropy is less than n/e. Thus, the λ-conditional entropy H λ (D|B) is initialized to n/e in Step 1 of Algorithm 1. Furthermore, the λ-conditional entropy is of monotonicity, As indicated in [44], the time complexity of Algorithm 1 is O(|U| 2 |A| 2 ), in which Step 7 is the critical step to select features and the complexity of computing SIG λ (a i , B, D) is O(|U| 2 ), as well as the complexity of running Steps 2-4 is O(|U| 2 |A|). Here, | · | is the cardinality of one crisp set. Computing SIG λ (a i , B, D) may require a great amount of time if |U| is large. Therefore, a natural idea of accelerating Algorithm 1 is that accelerating the computation of SIG λ (a i , B, D) according to computational properties of the λ-conditional entropy.

Accelerated Computation of λ-Conditional Entropy
In the following, we concentrate on the computational characteristic of λ-conditional entropy. Firstly, we review the following theorem in [44]. Theorem 1. Let (U, A) be a fuzzy information system with a fuzzy relation R B for each B ⊆ A. For any fuzzy set X ∈ F(U), Here Let (U, A ∪ D) be a fuzzy decision system with U = {x 1 , x 2 , . . . , x n } and B ⊆ A. Denote as the object set in which each object It is obvious to have U * B ⊆ U. We then have the following property.
holds for any a ∈ A \ B.
Proof. Assume that U \ U * B = ∅. Then, for any Assume that the similarity relation R B (x i , x j ) has been computed for any x i ∈ U and x j ∈ U. Then, according to Property 1, the time complexity of as the object set in which each object belongs to the fuzzy set [x i ] λ i B with the degree being λ i . Since then, for any x j ∈ U \ U x i B , it is easily obtained that [x i ] λ i B (x j ) = 0. Furthermore, we have the following property.
Property 2. Let (U, A ∪ D) be a fuzzy decision system with U = {x 1 , x 2 , . . . , x n } and B ⊆ A. Then, for any a ∈ A \ B, we have and Proof. Assume that U \ U x i B = ∅. Then, for any a ∈ A \ B and any x j ∈ U \ U x i B , it is obtained that the fuzzy similarity relation Substituting Equations (10) and (11) into Equation (8), we then have Corollary 1. Let (U, A ∪ D) be a fuzzy decision system with U = {x 1 , x 2 , . . . , x n } and B ⊆ A. Then, for any a ∈ A \ B, we have U Proof. For any x j ∈ U \ U x i B , we have [x i ] λ i B (x j ) = 0. It can be obtained from the proof process of Property 2 that [x i ] λ i B∪{a} (x j ) = 0 holds for any a ∈ A \ B, which yields x j ∈ U \ U x i B∪{a} . Thus, Assume that the similarity relation R B (x i , x j ) has been computed for any x i ∈ U and x j ∈ U. Then, according to Equation (12), the time complexity of H λ (D|B ∪ {a}) is O(C|U * B |), which is generally less than O(|U| 2 ) since both C ≤ |U| and |U * B | ≤ |U| hold. Here, C = max{|U x i B | : x i ∈ U * B }. Therefore, according to Properties 1 and 2, we can use Equation (12) to compute H λ (D|B ∪ {a}) and then obtain an accelerated algorithm in the following.
Compared with Algorithm 1, there exist three aspects of differences in Algorithm 2. First, Algorithm 2 needs to set U * B and U x i B (x i ∈ U) to U in Steps 1-4. Second, the evaluation measure H λ (D|B ∪ {a i }) is improved to compute according to Equation (12) in Step 10, in which U * B∪{a i } can be automatically acquired without additional computation. Here, the complexity of computing are iteratively updated in Steps 16-20, and Steps 17-20 need O(C|U * B |). Furthermore, the main procedure of Algorithm 2 for selecting features, namely Steps 8-22, needs to be run at most |A| times, so the time complexity is O(C|U * B ||A| 2 ). However, the main process Steps 5-14 in Algorithm 1 for selecting features requires O(|U| 2 |A| 2 ). It should be pointed out that both |U * B | and C may monotonously decrease in the iteration process of Algorithm 2, which mainly contributes to accelerate computation.

Numerical Experiment
In this section, numerical experiments are conducted to assess the performance of Algorithm 2. The experiment mainly focuses on showing the computational efficiency of Algorithm 2. In order to achieve the task, nine data sets are downloaded from UCI Repository of machine learning databases. The data sets are briefly described in Table 1.

Pretreatment of the Data Sets and Design of the Experiment
For each data set, the object set, conditional attribute set and decision attribute set are denoted by U, A, and D, respectively. If there are some real-valued conditional attributes in A, then, for each real-valued attribute a ∈ A, the attribute value of each object is normalized according to the method in [44] as so that a(x i ) ∈ [0, 1] for each x i ∈ U. Here, a is still used to denote the corresponding normalized conditional attribute for notational simplicity.
The experiment was designed as follows. Given one of the pretreated data sets, the objects were randomly divided into 20 approximately equal parts. The first part was taken as the 1st data set, the combination of both the first and the second parts was regarded as the 2nd data set, the combination of the anterior three parts was regarded as the 3rd data set, ···, and the combination of all twenty parts was taken as the 20th data set. For each of the generated 20 data sets, a fuzzy relation for each normalized conditional attribute a is defined as On the other hand, a special fuzzy relation, namely an equivalence relation, is defined for each nominal attribute a ∈ A by where x i , x j ∈ U k . Here, U k is the universe determined by the k-th data set. In this way, a fuzzy decision system (U k , A ∪ D) is formed for the k-th data set. Then, Algorithms 1 and 2 were used to obtain the computation time of these fuzzy decision systems, respectively. Furthermore, the "ten-fold approach" was also used to access the efficiency of the fast algorithm proposed in this paper. Specifically, for each of the pretreated data sets, the instances were randomly divided into 10 approximately equal parts. The k-th part was removed and the remainder was taken as the k-th data set, which generates the ten data sets called the ten-fold data sets. Then, the fuzzy relations for real-valued attributes and nominal attributes were defined according to Equations (18) and (19), respectively, which then formed a fuzzy decision system for each of the ten-fold data sets. Algorithms 1 and 2 were used to obtain the computation time of the fuzzy decision systems, respectively. Moreover, it should be pointed out that the output results obtained by both Algorithms 1 and 2 are the same for the same threshold values ε. The parameter ε determines the number of the selected features. The smaller the threshold value ε is, the more selected features there are and thus the more computation time is needed. Therefore, the parameter ε in both Algorithms 1 and 2 was set to 0. The experiment was performed by MATLAB R2016a on a personal computer with Intel(R) Core(TM) i7-4510U CPU @2.00 GHz configuration, 8 G Memory, and the 64-bit Windows 7 system.

Comparison of Computation Time on 20 Data Sets Generated by Each Data Set
The computation time on 20 data sets generated by each data set respectively obtained by Algorithms 1 and 2 is depicted in Figure 1. For each of the sub-figures in Figure 1, the x-coordinate indicates the generated data sets and the number k expresses the k-th data set. In other words, the x-coordinate expresses the size of each data set and the number k is factually (5 * k)% data of original data sets. On the other hand, the y-coordinate shows the running time (in seconds). It is seen from Figure 1 that, for each data set, with the increase in data size, both Algorithms 1 and 2 require more time. At the beginning, the two algorithms cost an almost equivalent amount of time. Algorithm 2 needs a little more time relative to Algorithm 1 since the advantage of Algorithm 2 is limited by a smaller data set size. Algorithm 2 may need more time to run Steps 17-20. However, with the increase in data set size, Algorithm 2 obviously requires less running time than Algorithm 1. Therefore, the proposed Algorithm 2 is efficient and can be regarded as an accelerated version of Algorithm 1.

Comparison of Computation Time on Ten-Folds Data Sets Produced by Each Data Set
The computation time of ten-fold data sets generated by each data set is depicted in Figure 2. For each of the sub-figures in Figure 2, the x-coordinate indicates the generated data sets and the number i expresses the i-th data set, and the y-coordinate shows the running time (in second). Furthermore, the average computation time is listed in Table 2. In addition, the average cardinalities of the selected feature subset, which is expressed by | · |, are also listed in the 3rd and 5th columns of Table 2. Moreover, in order to illustrate the variation tendency of |U * B | in the iteration process of the proposed Algorithm 2, the relevant result obtained by one of the ten-fold data sets is depicted in Figure 3. For each of the sub-figures in Figure 3, the x-coordinate indicates the number of iterations in Algorithm 2 and the y-coordinate expresses the cardinality of U * B .   It can be clearly seen in Figure 2 and Table 2 that, for each of the data sets, Algorithm 2 requires less time than Algorithm 1 for the ten-fold data sets. Especially for data sets German, Musk1, HV, and Robot, Algorithm 2 requires much less time and needs approximately no greater than 60% of the running time of Algorithm 1. Thus, it seems that Algorithm 2 requires significantly less running time for the data sets with a larger size or with more features. Moreover, the results of the 3rd and the 5th columns in Table 2 verify that the selected features respectively obtained by Algorithms 1 and 2 are the same. In addition, it can be seen from Figure 3 that |U * B | does monotonously decrease with the increase of the iteration number. In fact, the decrease of |U * B | contributes to the accelerating computation of Algorithm 2. Therefore, Algorithm 2 is validated to be effective again on the ten-fold data sets.

Conclusions
Based on the existing feature selection algorithm, by utilizing a fuzzy rough set-based information entropy, an accelerated feature selection algorithm according to the computational properties of fuzzy rough set-based information entropy, in which the entropy is computed by a lower time complexity, is presented in this paper. The numerical experiment results demonstrate that the algorithm can effectively decrease computation time and thus is efficient and effective. In future work, the proposed fast feature selection algorithm will be considered to deal with a dynamic data environment in which new instances or new features are added.