Combining Fuzzy C-Means Clustering with Fuzzy Rough Feature Selection

: With the rapid development of the network, data fusion becomes an important research hotspot. Large amounts of data need to be preprocessed in data fusion; in practice, the features of datasets can be ﬁltered to reduce the amount of data. The feature selection based on fuzzy rough sets can process a large number of continuous and discrete data to reduce the data dimension, making the selected feature subset highly correlated with the classiﬁcation but less dependent on other features. In this paper, a new method of fuzzy rough feature selection is proposed which combines the membership function determination method of fuzzy c-means clustering and fuzzy equivalence to the original selection. Different from the existing research, our method takes full advantage of knowledge about the dataset itself and the differences between datasets, which makes the features selected have a higher correlation with the classiﬁcation, improves the classiﬁcation accuracy, and reduces the data dimension. Experimental results on the UCI machine learning repository datasets conﬁrmed the performance and effectiveness of our method. Compared to the existing method, smaller subsets of features and an average of 1% higher classiﬁcation accuracies were achieved.


Introduction
We live in an age of data explosion-dealing with the growth of datasets usually requires much time and expense if we use the existing computers and algorithms. We want the dataset to contain more and more features to increase the likelihood of distinguishing different categories. Unfortunately, it may not be right. A higher-dimensional dataset increases the possibility of discovering incompletely valid false patterns. An effective way of resolving this problem is to select some of the most relevant and informative features from the dataset and eliminate redundant or irrelevant features. Unlike other dimensional reduction methods, feature selection retains the original meaning of features. This method can effectively reduce the size of a dataset without influencing the information expressed by the data, thus reducing cost and saving time.
Researchers have carried out different definitions of feature selection. Ideally, feature selection is to find the minimum feature subset which is necessary and sufficient to identify the target [1]. The definition of feature selection from the angle of improving the prediction accuracy is a process which can increase the classification accuracy or reduce the characteristic dimension without lowering the classification accuracy [2]. The basic method of feature selection is to generate a feature subset (search algorithm) and then evaluate the subset (evaluation criteria). The selection algorithm and evaluation criteria are two important parts of feature selection; an excellent search algorithm can speed up the search of features to find the optimal solution. The normal search algorithms contain global optimization, random search, and heuristic search. Evaluation criteria are defined as to evaluate the feature subset which is selected with some evaluation criterions. The evaluation decides directly the output of algorithm and performance of a classification model. An excellent evaluation criterion ensures the chosen subset contains a large amount of information and tiny redundancy. Evaluation functions can be divided into filter (evaluation function is independent of classifier), wrapper (the error probability of classifier is used as evaluation function), and embedding (a mixture of the first two). Common feature selection methods include Relief (relevant features), LVW (Las Vegas wrapper), LARS (least angle regression), and attribute reduction of rough set.
Fuzzy rough set is the extension of rough set. Two important applications of rough set are rule induction and attribute reduction (the meaning is the same as feature selection, but in rough set theory it is often called attribute reduction). However, rough set feature selection can only reduce the dataset of discrete data, and to overcome the drawback, people combined fuzzy set to rough set. Fuzzy rough feature selection (FRFS) provides a method which can effectively reduce the datasets of real value or discrete noise data (or both) without users' information. In addition, the technology can be applied to data with a continuous or nominal decision attribute, thus it can be used in the regression and classification dataset, and only requires additional information for each feature of fuzzy partition which can be automatically derived from the data. An important problem of FRFS is the determination of the similarity relations between objects. The existing methods are not flexible nor pertinent. In this paper, a method combines the membership function generation method based on fuzzy c-means clustering and fuzzy equivalence is proposed. The improving FRFS can automatically generate a membership function based on the knowledge of the dataset itself and complete feature selection.
This paper is structured as follows: Section 2 is literature review. The theoretical background is given in Section 3, introducing the rough set attribute reduction, membership function, and fuzzy rough feature selection. In Section 4, the improved fuzzy rough feature selection is presented, and an experiment is provided in Section 5. The paper is concluded in Section 6.

Literature Review
The rough set theory is a frame represented by Zdzislaw Pawlak in 1989, which can construct concept approximation with incomplete information. The available information contains a set of examples of concept and the relationship to each other, such as indiscernibility, set approximation, reducts, and dependency [3,4]. Rough set as a method of soft computing receives more and more attention. Nowadays, rough set is still a research hotpot in the field of artificial intelligence. Hu and Yao proposed structured rough set approximation in complete and incomplete information systems to serve as a basis of three-way decisions with rough set [5]. To deal with an incomplete information system, a more generalized approach that considered potential candidates was presented [6].
Rule induction and feature selection are two important applications of rough set. Every component of the model of rule induction is introduced in detail in [7]. In the literature [8,9], rule induction is carried out for the absence of feature values in the information system. In the literature [10,11], the researchers used the result of attribute reduction to classify datasets with neural networks; the testing result indicated that with less study time the misclassification does not increase significantly, and they declared that the attribute reduction of rough set has the possibility of practical application. Because the attribute reduction of rough set is an NP-hard problem, many pieces of research pay attention to the acceleration algorithm [12][13][14]. Recently, two quick feature selection algorithms based on the neighbor inconsistent pair were presented which can reduce the time consumed in finding a reduct [15].
Fuzzy sets were introduced independently by Lotfi A. Zadeh and Dieter Klaua in 1965 as an extension of the classical notion of set [16]. Because both rough set and fuzzy set are used to deal with uncertain data, so many scholars compared the two methods and make great contributions [17,18]. Dubois and Prade first combined fuzzy set and rough set together [19], hereafter the research centering on fuzzy rough set appear one after another [20][21][22][23][24][25], and in the meantime, the accelerating algorithm came out, such as feature selection based on ant colony optimization [26] and information entropy [27]. In recent years, the feature selection algorithm based on a new definition of fuzzy rough set approximations based on the divergence measured of fuzzy sets is proposed and its properties were explored [28]. Another interest is the accelerator of fuzzy rough feature selection, a method based on sample reduction and dimensionality reduction was proposed [29].

Theoretical Background
Rough set theory is presented by Zdzislaw Pawlak in 1981 [30,31]. A basic notion of rough set is the concept of lower and upper approximations. A vague concept is an approximation of precise concepts. Objects belonging to the same category have the same attributes that are indistinguishable.

Rough Set Attribute Reduction
The central notion of RSAR (rough set attribute reduction) is indiscernibility. Assume there is an information system I = (U, A), where U is a non-empty finite subset of objects (the universe), A is a non-empty finite set of attributes such as for each a ∈ A, U → V a , where V a is the value set for attribute a. A = {C ∪ D} where C is the set of condition attributes and D is the set of decision attributes.

Definition 1 (Indiscernibility).
For any P ⊆ A, there is an associated equivalence relation I ND(P) : I ND(P) = {(x.y) ∈ U 2 |∀a ∈ P a(x) = a(y)}, if (x, y) ∈ I ND(P), then x and y are indiscernible from attributes of P. The equivalence classes of the P-indiscernibility relation are denoted as [x] P .

Definition 2 (Lower Approximation).
The lower approximation of P is defined as PX = {x|[x] P X}, X U.

Definition 3 (Positive Region)
. P and Q are equivalence relations over U, the positive region is defined as POS P (Q) = X∈U/Q PX. The positive region contains the objects of U which can be divided into U/Q class, using the knowledge of attribute P.

Definition 4 (Dependency).
An important work of data analysis is finding the dependency of attributes. If all values of a set of attributes Q depend uniquely on another set of attributes P, or there is a functional dependency between values of Q and P, then Q depends totally on P, denoted P ⇒ Q . Dependency can be defined as: for P, Q ⊂ A, Q depends on P with degree k (0 ≤ k ≤ 1), defined P ⇒ k Q , k = γP(Q) = |POS P (Q)| |U| , if k = 1, Q depends totally on P. If k < 1, Q depends partially on P with degree k. If k = 0, Q does not depend on P.
A basic idea is to calculate the dependencies of all the possible subsets of C, any subset with γ(D) = 1 is a reduct. The smallest subset is the minimum reduct. However, this idea is impossible for large datasets. Algorithm 1 is called QUICKREDUCT Algorithm [20]; we do not need to calculate all the possible subsets. Start with an empty set and add attributes to the set in proper order to obtain the maximum increase of dependency until we get the maximum possible number (usually equal to 1). We need to pay attention to that the algorithm may not get the minimum set of reduct necessarily every time. In the worst situation, the complexity reaches n! for attribute dimensionality of n.

Membership Function
American cybernetician L.A. Zadeh created fuzzy sets theory with a groundbreaking paper in 1965 [16]. Fuzzy set is the extension of classical set, which contains more general and various mathematical concepts, forming a new fuzzy mathematics discipline. Assuming X is the universe, A is a subset of X and can be represented by characteristic functions such as map A : For fuzzy subset A of X, any x ∈ X, x is not absolutely belonging to A nor absolutely not belonging to A. The degree to which x belongs to A can be represented by the value in [0, 1].

Definition 5 (Membership Function)
. Assuming X is the universe, the map A : The generation of the fuzzy membership function is fundamentally important. It is important to find a proper membership function. A basic method of constructing membership functions is the reference function; there are some commonly used membership functions, choosing proper parameters to get the membership functions which we need, such as triangular membership functions: Trapezoidal membership function: Gaussian membership function: Membership function can be generated from available data. Many methods can be used, such as the histogram method, the transformation of the probability distribution to the probability distribution, Appl. Sci. 2019, 9, 679 5 of 18 clustering, and neural network [32][33][34][35][36]. We need an effective membership function generation mechanism to make full use of fuzzy theory and it must have the following advantages [37]:

1.
Accuracy. Membership function should reflect the knowledge contained in the data accurately.

2.
Flexibility. This method can provide a broad family of membership function.

3.
Computability. The method should be feasible in calculation so that it has practical application value. The literature [38] emphasizes the importance of easy optimization and adjustment for membership functions. 4.
Ease of use. Once the membership function is determined, for any given x, the corresponding A(x) can be found easily.
In this paper, we use clustering to find membership functions. The fuzzy c-mean clustering (FCM) method is used to generate the fuzzy membership function during clustering. During the clustering process of FCM, the fuzzy membership function is generated.

Fuzzy Rough Feature Selection
The RSAR we mentioned only applies on the discrete datasets, but in real life, datasets usually contain real values and noises. Using fuzzy set theory, we can deal with this complex situation. Fuzzy mathematics is a mathematical theory in which studies utilize fuzzy phenomena. The core of fuzzy mathematics is the fuzzy set, which is different from the classical set. It has no definite elements and can only be mastered by membership function.
The intersection, union, and complement operations of fuzzy sets are similar to those of classical sets, but in some cases, the general and operator may fail. The selection of the and operator should be analyzed in detail. T norm and s norm can be thought of as generalized operations, but they are not actually. This question remains uncertain.

Definition 6 (t-norm).
A triangular norm or shortly t-norm better reflects the nature of the logic operator and. T-norm is a binary function T on [0, 1], which satisfies the exchange law, associative law, monotonicity, and the boundary condition. That is to say T : There are some frequently used t-norms ⊗: The standard min operator x ⊗ y = min{x, y} The algebraic product x ⊗ y = xy Lukasiewicz t-norm x ⊗ y = max{x + y − 1, 0}.

Definition 7 (s-norm).
S-norm is also called a triangular conorm and shortly t-conorm. S-norm is a binary function S on [0, 1], which satisfies the exchange law, associative law, monotonicity, and the boundary condition. That is to say S : S(x, S(y, z)) = S(S(x, y), z) Three well known s-norms ⊕ are: The standard max operator x ⊕ y = max{x, y} The probabilistic sum x ⊕ y = x + y − xy The bounded sum x ⊕ y = min{x + y, 1}.
The fuzzy rule can be expressed as "if A then B", shortly A → B . A and B are fuzzy sets, the true degrees are expressed as A(x), B(y). The true degree of A → B is expressed as A(x) → B(y) ; the degree of the proposition depends on the true degree of the former and the latter.

Definition 8 (Fuzzy Implicator). Find a binary function R
The frequently used implicators are: Fuzzy upper approximation and lower approximation are defined with the fuzzy division of input [39] or with the following definition: where I is the fuzzy implicator and R P represents the fuzzy similar relation in the subset of feature set P.
where µ R a (x, y) represents the degree of similarity of x and y about feature a.
The fuzzy positive region and dependency are defined as in rough set [40].
An example with the dataset in Table 1 follows. There are six objects, features (condition attribute) a, b and c, label (decision attribute) q.
using Equation (6), we can obtain the fuzzy similarity matrixes as follows: Others are the same as object 3.

New Method
In the existing fuzzy rough feature selection algorithm, there are two methods for choosing the fuzzy set, one is given a fuzzy set while inputting data [39]. The other is definite with fuzzy similarity relations and a fuzzy implicator [40,41]. Both methods have their own drawbacks. The first one complicates the algorithm, so we need to add some knowledge to the feature selection, which is departing from our original intension. The second one has some problems in the definition of fuzzy similarity relations. The common definitions of relation at present are: where σ 2 a is the variance of feature a. As we can see, what is said above defines the fuzzy similarity relations of all the datasets with a single equation but ignores the difference between the datasets. Generating a fuzzy set automatically is extremely urgent. A dataset is the universe of fuzzy sets that contains many fuzzy sets. We can abstract a fuzzy set and fuzzy similarity relations from a dataset and make it different between datasets, so the algorithm model has better generalization ability.

Reduction
The fuzzy c-means clustering algorithm (FCM) has a wide range application and is more successful in numerous fuzzy clustering algorithms. It obtains the membership degree of every sample point to the class center through the optimization of objective function [42].
Objective function is represented by the Euclidean distance of clustering center and sample point. Solving every clustering center to the minimum of the value function of the non-similarity index. The vague generalization is: where u ij is between 0 and 1, c i is the clustering center of the fuzzy set, d ij = c i − x j is the Euclidean distance of ith clustering center jth sample point. m is the weighted index number. To construct the Lagrangian multiplier of the constraint formula, derivate all the input parameters to make Equation (12) reach the minimum: The whole processors of the clustering algorithm are as in Algorithm 2. The output of FCM are centers c and membership matrix U. U contains the degree of every object belonging to each center. Update c i and U 6: t←t + 1 7: Until t = T or U t − U t−1 ≤ ε 8: Return U The definition of the lower approximation of fuzzy rough set is: µ R P X (x) = in f y∈U I µ R P (x, y), µ X (y) (15) where I is the fuzzy implicator. µ R P (x, y) represents the similarity relation between x and y in the whole subsets of feature set P. In order to contain only one similarity relation, we take the intersection of relations in all subsets of P, where the intersection is calculated with the t-norm.
where µ R a (x, y) represents the similarity degree of x and y about feature a. Fuzzy clustering on every feature with Algorithm 2 to get the membership degree of every object to the feature: Because equivalence relations are used to model equality, fuzzy equivalence relations are commonly considered to represent approximate equality or similarity [41,43]. We use the fuzzy equivalence relation R in the literature [44]: Figure 1 shows that with different values of a and b, the equation has a different form. The value of a defines the basic shape of the function. Additionally, the value of b defines the size of the opening on one side. As we use the function to describe the similarity of two objects, so we choose b equal to 0 to make sure that the function is in balance of the two objects even if they are great values or small values. When a goes to 0, the function performs a crisp relation that when two objects are equal the function output is 1, otherwise it is 0. Thus, we chose a small value on a, because after the FCM algorithm, the difference between objects was small. commonly considered to represent approximate equality or similarity [41,43]. We use the fuzzy  Calculate E(x, y) 4: γ prev = 0, γ best = 0 5: do 6: 9: T ← R ∪ {a} 10: R ← T 11: γ best ← γ R (D) 12: until γ best − γ prev × |U| < 1

13: return R
According to the clustering membership and Equation (18), we can get the fuzzy similarity relation of two objects.
The definitions of positive region and dependency are the same as we mentioned above [40].
The steps of the algorithm are as in Algorithm 3, and we simply called it C-FRFS, which means fuzzy rough feature selection based on clustering. We apply the fuzzy c-means clustering on every object in the universe C. For every two objects x and y, use the fuzzy equivalence relation Equation (19) to describe the fuzzy similarity relation. Then, according to Equations (15), (20), and (21), we can obtain the dependency degree γ of every feature in C. Start with an empty set R, and each time select a feature which has the greatest increase in the dependency degree. The algorithm stops when adding a feature cannot result in classifying at least one object.

Example
We still use the example in Table 1, and with the FCM we calculate the membership matrix as in Table 2    For every element in I ND(q), calculate the lower approximation of every object. We give an example with {1, 3, 6} in I ND(q) and object 2: The dependency of feature a: In the same way, we can obtain γ {b} (Q) = 0.4624, γ {c} (Q) = 0.4837. Thus, in the first circulation, we choose feature c. Then, in the second circulation, we obtain γ {a,c} (Q) = 0.6305, γ {b,c} (Q) = 0.9831, and in the end we choose feature b and c as our result of feature selection.

Experiments
In this part, we used nine classifiers to classify nine datasets from UCI machine learning repository [45], to verify and evaluate our feature selection algorithm. The algorithm uses fuzzy clustering to generate fuzzy sets, which are defined by Equation (17). After feature selection, the datasets are reduced. These reduced datasets are classified using related classifiers (unreduced datasets do not use feature selection steps).
The T-FRFS is the short of threshold fuzzy rough feature select which was represented in [46] and FRFS is mentioned in [40]. C-FRFS is our proposed method mentioned in this paper. All the three methods were written in MATLAB 2017b which runs on a computer with following characteristics: OS: Microsoft Windows 10 CPU: Intel Core i5-8400 CPU 2.80GHz RAM: 8GB.
We use these two proposed successors to compare with our new method and this indicted that we have made some progress on fuzzy rough feature selection. Nine datasets as depicted in Table 3, were employed to evaluate each methods' performance. The results of all methods, as well as unreduced datasets in terms of the number of selected features, are also demonstrated in Table 3. As shown, our new method C-FRFS always gets the smallest reduced subset and then the T-FRFS. The FRFS does not perform well and sometimes cannot select features. Table 3. Information of nine datasets and reduct size of three feature selection methods. C-FRFS: c-means fuzzy rough feature select. T-FRFS: threshold fuzzy rough feature select.  Figure 2 shows the running time of three methods. In Figure 2a, our new method is always slower than the other two methods. However, in the Hillvalley dataset, our new method performs better, which means it is more suitable for high-dimensional data than the other two methods. The reason is that our method obtains a smaller reduct, so the program ends earlier and running time is less.

C-FRFS T-FRFS FRFS
To test the performance of the new method, we applied three methods on different sizes of datasets. With the number of objects from 200 to 5000, the time consumption increased exponentially. However, as shown in Figure 3, the three methods have the same tendency and are almost coincident, which demonstrates that clustering nearly not influences the performance of our new method. In the same case, our new method can get the best consequence but has no impact on efficiency.  Figure 2 shows the running time of three methods. In Figure 2(a), our new method is always 361 slower than the other two methods. However, in the Hillvalley dataset, our new method performs 362 better, which means it is more suitable for high-dimensional data than the other two methods. The 363 reason is that our method obtains a smaller reduct, so the program ends earlier and running time is  Table 3.

370
To test the performance of the new method, we applied three methods on different sizes of 371 datasets. With the number of objects from 200 to 5000, the time consumption increased exponentially.

372
However, as shown in Figure 3, the three methods have the same tendency and are almost coincident, 373 which demonstrates that clustering nearly not influences the performance of our new method. In the 374 same case, our new method can get the best consequence but has no impact on efficiency.  Table 3.   Table 3.

370
To test the performance of the new method, we applied three methods on different sizes of 371 datasets. With the number of objects from 200 to 5000, the time consumption increased exponentially.

372
However, as shown in Figure 3, the three methods have the same tendency and are almost coincident, 373 which demonstrates that clustering nearly not influences the performance of our new method. In the 374 same case, our new method can get the best consequence but has no impact on efficiency.   Table 4 where the average classification  Nine classifiers of different categories such as Bayes Net, Naïve Bayes, RBFNetwork, Jrip, PART, BFTree, FT, J48, and NBTree in Weka were selected to classify resulting subsets of features by each FS method [47]. To obtain more accurate results, we used 10-fold cross validation in the modeling of each classifier method. The results are presented in Table 4 where the average classification accuracies of nine classifiers for each dataset worked with three selection methods along with unreduced datasets shown in each column, and the last column indicated the mean of nine classification accuracies. The highest mean of classification accuracies was gained in expense of employing all features, which means that feature selection methods are not always successful in increasing classification accuracy, but in decreasing model complexity by sacrificing inconsiderable accuracy. As Table 4 shows, for four datasets, the original model gained the highest classification accuracy, three for both FRFS and our new method. Since both the classification accuracies and the number of selected features are important, divisions of classification accuracies, as shown in Table 4, by the number of selected features, as shown in Table 3, were considered as a measure to compare the results [48]. Therefore, a method which has a higher classification accuracy and smaller set of features is regarded as a better method. The results are shown in Table 5. The lowest value was from the original datasets, and our new method reaches the highest value in each dataset. Sometimes, the T-FRFS also reaches the highest value. A statistical test with the Friedman test was used to compare the methods. Table 6 shows the average ranks of methods which correspond to Table 5. The chi-square with 3 freedom was 26.1 and P-value was 0, which means there are statistical difference between methods.

Discussion and Conclusions
A new method named C-FRFS based on fuzzy rough feature selection has been presented in this paper. The first development based on the fuzzy c-means clustering uses a new kind of membership function which automatically forms the membership matrix only with the knowledge contained in the dataset itself. The second development employs the setting up of the fuzzy relation-different to methods used in many articles which use a single equation-we have combined fuzzy equivalence mentioned in other research and the membership function generated through the above way to set up the fuzzy relation. An example is given to illustrate how reduction may be achieved. Note that no user-defined thresholds are required for the method, although a choice must be made regarding fuzzy connections and the priori coefficient of fuzzy equivalence. Experimental results over nine datasets taken from UCI showed the applicability and effectiveness of our proposed new method. In order to measure the performance of every method, we considered the number of selected features, classification accuracies, and the division of classification accuracies by the number of selected features. We compared the results operated on different methods, such as T-FRFS, FRFS, unreduced, and our new method C-FRFS. The comparisons indicated that our method was outperformed. We also compared the time taking of three methods on different sizes of datasets. The results demonstrated that our new method did not influence the efficiency remarkably. That was to say, we improved the accuracy and validity of feature selection with nearly no time loss.
Further research in this area will include a more in-depth experimental investigation of the prior coefficient with its evaluation and a more efficient fuzzy clustering method to raise up the accuracies of our feature selection method.