Multi-Granulation Entropy and Its Applications

In the view of granular computing, some general uncertainty measures are proposed through single-granulation by generalizing Shannon’s entropy. However, in the practical environment we need to describe concurrently a target concept through multiple binary relations. In this paper, we extend the classical information entropy model to a multi-granulation entropy model (MGE) by using a series of general binary relations. Two types of MGE are discussed. Moreover, a number of theorems are obtained. It can be concluded that the single-granulation entropy is the special instance of MGE. We employ the proposed model to evaluate the significance of the attributes for classification. A forward greedy search algorithm for feature selection is constructed. The experimental results show that the proposed method presents an effective solution for feature analysis.


Introduction
Uncertainty analysis represents one of the most significant challenging tasks in intelligent computation.Since Shannon introduced the information entropy to measure the uncertainty of the system, a series of measures were proposed for machine learning, data mining and pattern recognition, etc. [1][2][3].
In the field of granular computing, Yu et al. introduced the fuzzy entropy for attribute reduction [4].Hu et al. presented kernel entropy by extended Yu's work [5].In [6], the authors defined neighborhood entropy by using a neighborhood relation.In the view of granular computing, there are two modules in the entropy methodology mentioned above: (1) granulation of data (samples) into a set of information granules according to the relation of objects; (2) calculating the sum of the uncertainty quantity of all OPEN ACCESS the information granules.We will give an example to illustrate this two-step process in detail in Section 2. It shows that granulation plays a key role in these entropy models.However, the classical information entropy theory utilizes solely the granularity structure of the given data, which is expressed by one suitable binary relation.The neighborhood entropy is only based on the neighborhood granulation; the fuzzy entropy on the fuzzy granulation; and the kernel entropy on the kernel granulation.In [7], Qian at el. proposed that there is a contradiction between two different binary relations in some data analysis issues.In other words, the decision or the view of each of decision makers may be independent for the same object in the process of some decision making.Accordingly, Qian et al. proposed multi-granulation rough set (MGRS) according to a user's different requirements or targets of problem solving.Since then, many researchers have extended the classical MGRS by using various generalized binary relations.Lin et al. [8] proposed a covering-based pessimistic multi-granulation rough set.Xu et al. [9] proposed another generalized version, called variable precision multi-granulation rough set.There are two essential problems to be addressed when employing the rough sets model to real-world applications as similar as the information entropy model: (1) information granulation [10,11]; (2) approximate classification realized in the presence of such induced information granules [12,13].The idea of multi-granulation is expressed through the approximation classification realizing.For example, one of the contributions in MGRS is to describe the lower and upper approximations by the multiple equivalence relations instead of the single equivalence relation.As a matter of fact, we can construct the multi-granulation structure in the process of the information granulation.Based on this idea, the contribution of this paper includes: (1) we extend the classical information entropy model to a multi-granulation entropy model (MGE) by using a series of general binary relations; (2) moreover, a number of theorems are obtained; (3) furthermore, we employ the proposed model to evaluate the significance of the attributes for classification.A forward greedy search algorithm for feature selection is constructed.The experimental results show that the proposed method presents an effective solution for feature analysis.
The paper is organized as follows: in Section 2, some basic concepts about entropy in the view of granular computing are briefly reviewed.In Section 3, the MGE model is proposed.A series of theorems about MGE is discussed.Section 4 shows the applications of MGE to feature evaluating and feature selection.Numeric experiments are reported in Section 5. Finally, Section 6 concludes the paper.

Entropy in the View of Granular Computing
Knowledge representation is realized via the information system ( IS ) which is a tabular form, similar to databases.An information system is pair , where


is a nonempty finite set of objects, A is a nonempty finite set of attributes, and is a mapping for any A a  , where a V is called the value set of a .
Relations, as a fundamental concept in mathematics, represent the connections of a set elements in the domain.A binary relation on U can be represented as a matrix.The matrix In the classical set theory, the relations take values in the set  and a binary relation R , x is defined as: With each sample, we express the information granule in the form of fuzzy sets.Here, we give an instance about kernel entropy to illustrate the information entropy model in the view of kernel granulation [5]: Example 1.Given an information system as follows: Table 1.IS description., the two modules in the kernel entropy methodology are as follows, respectively: (1) Information granulation: The kernel relation is computed with Gaussian kernel as follows, where is the Euclidean distance between samples i x and j x : (3) Hence, we have

  r r
if  is set to 0.1.The kernel granules can be constructed according Equation (2).
(2) Calculating the kernel entropy: x is computed as follows, where U is cardinality of setU .
The kernel entropy is defined as follows: Then, the kernel entropy of this IS is Remark.To deal with nominal attributes and numerical attributes, which are common in practice, we use a extended Euclidean distance as the method introduced in literature [13].This distance function is computed as follows: where In a real environment, we often need to concurrently describe a target concept through multiple binary relations (e.g., neighborhood relation, kernel relation, and fuzzy relation) according to a user's requirements or targets of problem solving.Therefore, we will study the multi-granulation entropy model in the next section.

Multi-Granulation Entropy
In this section, two types of multi-granulation entropy (MGE) are introduced to measure the uncertainty of knowledge in information systems.Then, the joint entropy and conditional entropy are presented in the view of multi-granulation.A number of theorems will be discussed in detail.

Two Types of MGE
be an information system, where  is a nonempty finite set of objects.Given a set of general binary relations is computed as follows, where "  " means "max": The granule   j R i x is defined in forms of fuzzy set for the "max" operation.The word "optimistic" is used to express the idea that the information granulation seeks common ground while reversing difference among these general binary relations.
be an information system, where


is a nonempty finite set of objects.Given a set of general binary relations x is computed as follows, where "  " means "min": The granule   j R i x is defined in forms of fuzzy set for the "min" operation.The word "pessimistic" is used to express the idea that the information granulation seeks common ground while rejection difference among these general binary relations.
Then, the expected cardinality of   OR i x and   PR i x are computed as follows, respectively: Here, we give the definition about the two types of MGE.
be an information system, where be an information system, where  is a nonempty finite set of objects.Given a set of general binary relations , A B  , the second type of MGE, called pessimistic multi-granulation entropy (PMGE), is denoted by: The following example will illustrate the two types of MGE in detail.Example 2. Given a nonempty finite set of objects . Two relation matrixes about 1 R and 2 R are denoted as: The optimistic and pessimistic relation matrixes are denoted by O M and P M respectively as follows: Every row of the matrixes denotes the information granule (e.g.,   ).
OMGE and PMGE are computed according to Equations ( 12) and ( 13): be an information system, where is a nonempty finite set of objects.Given a set of general binary relations ,the optimistic x are induced by 1 B and 2 B .The optimistic joint entropy is expressed as: where "  " means "min".
The pessimistic joint entropy is defined as follows: be an information system, where  is a nonempty finite set of objects.Given a set of general binary relations , the optimistic conditional entropy of 2 B to 1 B is expressed as: Similarly, we have the pessimistic conditional entropy: The conditional entropy reflects the uncertainty of 2 B if 1 B is given.

Some Theorems about MGE
be an information system, where , we have: where is straightforward.The equivalence binary relation is computed as: The samples are divided into disjoint , where . Assumed there are k w It is shown that the MGE is a natural generalization of the Shannon's entropy in the view of granulation by the proof above.In [14,15], the authors generalized Shannon's entropy to fuzzy entropy, kernel entropy and neighborhood entropy, respectively.These entropy models utilize solely the granularity structure of the given data, which is expressed by one suitable binary relation.The neighborhood entropy is only based on the neighborhood granulation; the fuzzy entropy on the fuzzy granulation; and the kernel entropy on the kernel granulation.Hence, it also can be concluded that the single-granulation entropy, such as neighborhood entropy, kernel entropy, fuzzy entropy, etc., is the special instance of MGE.
be an information system, where  is a nonempty finite set of objects.Given a set of general binary relations , we have: For convenience, the monotonicity of entropy value induced by the set of relations is called the granulation monotonicity.
be an information system, where  is a nonempty finite set of objects.Given a set of general binary relations , we have: Obviously, be an information system, where  is a nonempty finite set of objects.Given a set of general binary relations we have: Proof According to Lemma 4.1 in Ref. [16], we know that the combination of information granules by "  " operator will increase the conditional entropy monotonously.Similarly, it can be concluded that the conditional entropy will decrease through combining information granules by "  " operator.QED

Feature Selection Based on MGE
One of the most important applications of information entropy theory is to evaluate the classification power of the attributes in a decision system by computing the significance of the condition attributes for the resulting decision.This entropy-based model was widely used in feature selection algorithms for categorical data [17].However, classical entropy models cannot be used to express multi-granulation which represents the different points of view for describing one concept.Here, we show a feature selection technique based on MGE.
If the set of samples is assigned with a decision attribute D , we call this information system a decision system, where C are conditional attributes.Therefore, as we explain in Definition 6 that multi-granulation conditional entropy  is a nonempty finite set of objects.Given a set of general binary relations , C B  , we thus define significance of attribute subset B in the multi-granulation of view: is used to evaluate the significance of attribute subset B by the optimistic multi-granulation.Similar to is another evaluation measure of the attributes.The pessimistic granules, which are formed by the binary relations  , are used to compute ) becomes a symmetric uncertainty measure.In fact this is mutual information of B and D defined in Shannon's information theory if B and D generate Boolean equivalence relations according to Equation (23) [18].As it is well-known, mutual information is widely applied in evaluating features and constructing decision trees [19,20], the classical definition of mutual information can just be used to deal with only one granulation.The multi-granulation significance defined here can be used to express lots of views with a series of binary relations.Equations ( 30) and (31) can be used to find the significant features for classification.Actually, it is impractical to get the optimal subset of features from 1 2  n candidates through exhaustive search, where n is the number of features.The greedy search guided by some heuristics is usually more efficient than the plain brute-force exhaustive search.In a forward greedy search, one starts with an empty set of attributes, and keeps adding features to the subset of selected attributes one by one.Each selected attribute maximizes the increment of significance of the current subset.A forward search algorithm for feature selection based on MGE is written as follows.Here, OSIG and PSIGare denoted as SIG uniformly.
Algorithm 1. Feature selection based on MGE(OMGE or PMGE) Input: decision system , binary relations , where n and m are the numbers of features and samples, respectively.It is worth noting that the proposed measures of mutual information can be incorporated with other search strategies used in other feature selection algorithms, such as ABB (Automatic Branch and Bound), probabilistic search [21] and GP (Genetic programming) [22].In this study, we are not going to compare the influence of search strategies on the results of feature selection.Here we focus on the comparison of the proposed method when dealing with different evaluation measures.

Experimental Analysis
In this section, we compare the effectiveness of MGE in evaluating feature quality.The data sets are downloaded from the UCI Machine Learning Repository.They are described in Table 2.The numerical attributes of the samples are linearly normalized as follows: where min x and max x are the bounds of the given attribute.Three popular leaning algorithms such as CART, liner SVM and RBF SVM are introduced to evaluate the quality of selected features.The experiments were run in a 10-fold cross validation mode.The parameters of the linear SVM and RBF SVM are taken as the default values (the use of the MATLAB toolkit osu_svm3.00).In the experiment, we employ three symmetric membership functions for multi-granulation.One is the kernel relation defined as Equation (3) in Example 1; the other two are computed as follows, respectively: Equation ( 33), called neighborhood relation, is used to compute neighborhood entropy (NE) in Ref. [6] where the threshold 0   .According to this definition, the samples in a neighborhood granule have the distance is less than the threshold  .Literature [6] has explained that the result is optimum if threshold is set between 0.1 and 0.2.In the following, if not specified, 15 .0   .Similarly, the fuzzy entropy (FE) is proposed based on the fuzzy relation according to Equation (34) [4].We compare MGE with kernel entropy (KE), NE and FE, where the compared methods are the typical single-granulation entropy.The parameters of the KE and NE are kept consistent in Ref. [5] and Ref. [6].We compute the significance of single feature with five evaluation functions, such as OMGE, PMGE, KE, NE and FE.At the same time, we reported the classification accuracies of the each feature based on the use of the linear SVM and RBF SVM.
Two data sets wine and glass are used in the experiment.There are 13 features in the wine and nine features in the glass dataset.The results are given in Figures 1 and 2. As to the wine data, the features 1, 6, 7, 10, 11, 12, 13 produce higher values of all evaluation functions, as shown in Figure 1a; at the same time, we can also find that the classification accuracies of these features are better than others (again shown in Figure 1b).As to the glass data, features 2, 3, 4, 8 are better than others in terms of the five evaluating functions, corresponding the classification accuracies of features 2, 3, 4, 8 are also higher than the other features.These results show that all five evaluating functions can produce good estimates of classification ability of the features.It can be concluded that OMGE and PMGE are competent with other entropy models.3 and 4, respectively.Regarding OMGE, PMGE, FE, NE and KE, the orders of the features presented in the tables are the orders that the features are kept being added to the feature space.These orders reflect the relative significance of features in terms of the corresponding measures.Some results can be derived from the selected attributes.First, whatever attribute selection techniques have been used, most of the attributes in all datasets can be deleted.The reduction rate is high to 90% for some datasets, such as sonar and wpbc.Second, some selected attributes are slightly different.Especially, some of the selected features are the subset of attributes selected by other models.As we know, we consider the ranking of features in feature selection, sometimes, a little difference in feature qualities may lead to completely different ranking.Therefore, the great difference between these selected features is the difference between the qualities of features computed with diverse granularities.In other words, there is a inconsistent relationship between its values under one-granularity and those under the another granularity.In [7], the authors give a tentative study that multi-granulation model will display its advantage for rule extraction when two granularities process a contradiction relationship.We will test this idea by the following experiment.We build classification models with the selected features and test their classification performance based on 10-fold cross validation.The average value and standard deviation are used to measure the classification performance.We compare the raw data, MGE, FE, NE and KE in Tables 5-7, where learning algorithms CART, linear SVM and RBF SVM are introduced to evaluate the selected features.Comparing the performance of raw data and granulation-based selection, we can find although most of features have been removed, most of the classification accuracies derived from the reduced data sets do not decrease, but increase.It shows there are redundant and irrelevant attributes in the raw data.
The experimental results show that no matter which classification algorithms are used, MGE is better than or equivalent to KE.Table 6 shows that MGE outperforms FE and NE with respect to liner SVM.As to CART learning algorithm in Table 5, MGE is better than or equivalent to NE for six of the seven databases.It can be concluded that MGE is a better choice for the diverse granularities.Actually, the different decision makers have different granulation points of view.Therefore, it is necessary to take diverse factors into consideration for granular computing in the real world.

Conclusions
In this paper, the classical single-granulation entropy theory has been extended.As a result of this extension, a multi-granulation entropy model (MGE) has been developed.The uncertainty of the information system is defined by using multiple relations on the universe.These relations can be chosen according to a user's requirements or targets of problem solving.
In MGE model, we introduce OMGE and PMGE to describe the relations between different granularities.Based on the mutual information defined through MGE, we proposed the forward greed features selection algorithms, which will be helpful for applying this theory to practical issues.MGE provides an effective approach in the context of multiple granulations.We conclude that the single-granulation entropy is the special instance of MGE.The experimental result shows that MGE will display its advantage for rule extraction and knowledge discovery when the different granularities in information systems possess a contradiction or inconsistent relationship.
The future work could move along two directions.First, the existing feature selection algorithms based entropy sometimes might not be robust enough for real-world applications.How to improve it is an important issue.Second, we will continue to construct MGE models with various binary relations for discussing the common properties of this kind of entropy model.

 is a nonempty
finite set of objects.Given only one equivalence binary relation

Figure 1 .
Figure 1.Significance and accuracy of single feature (wine).(a) Significance of a single feature computed with different evaluating.(b) Classification accuracies obtained for single features when using linear SVM and RBF SVM.
The above results show MGE can be used to evaluate single attributes.Now, we show the effectiveness in attribute reduction.The selected features with different algorithms are presented in Tables binary relation R to denote any instantiated relation, such as fuzzy relation and kernel relation, etc.The fuzziness of relations is the essential characteristic in these cases.Therefore, is the uncertainty of D if condition attributesC are given, conditional entropy reflects the relevance between condition attributes and decision.

Table 3 .
Subsets of features selected with OMGE and PMGE.

Table 4 .
Subsets of features selected with FE, NE and KE.