Next Article in Journal
Gaze Information Channel in Cognitive Comprehension of Poster Reading
Next Article in Special Issue
MEMe: An Accurate Maximum Entropy Method for Efficient Approximations in Large-Scale Machine Learning
Previous Article in Journal
Learning Using Concave and Convex Kernels: Applications in Predicting Quality of Sleep and Level of Fatigue in Fibromyalgia
Previous Article in Special Issue
On Using Linear Diophantine Equations for in-Parallel Hiding of Decision Tree Rules
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Compact Belief Rule Base Learning for Classification with Evidential Clustering †

School of Automation, Northwestern Polytechnical University, Xi’an 710072, China
*
Author to whom correspondence should be addressed.
The paper is an extended version of our paper published in 2018 International Conference on Information Fusion, Cambridge, UK, 10–13 July 2018.
Entropy 2019, 21(5), 443; https://doi.org/10.3390/e21050443
Submission received: 27 March 2019 / Revised: 24 April 2019 / Accepted: 28 April 2019 / Published: 28 April 2019
(This article belongs to the Special Issue Entropy Based Inference and Optimization in Machine Learning)

Abstract

:
The belief rule-based classification system (BRBCS) is a promising technique for addressing different types of uncertainty in complex classification problems, by introducing the belief function theory into the classical fuzzy rule-based classification system. However, in the BRBCS, high numbers of instances and features generally induce a belief rule base (BRB) with large size, which degrades the interpretability of the classification model for big data sets. In this paper, a BRB learning method based on the evidential C-means clustering (ECM) algorithm is proposed to efficiently design a compact belief rule-based classification system (CBRBCS). First, a supervised version of the ECM algorithm is designed by means of weighted product-space clustering to partition the training set with the goals of obtaining both good inter-cluster separability and inner-cluster pureness. Then, a systematic method is developed to construct belief rules based on the obtained credal partitions. Finally, an evidential partition entropy-based optimization procedure is designed to get a compact BRB with a better trade-off between accuracy and interpretability. The key benefit of the proposed CBRBCS is that it can provide a more interpretable classification model on the premise of comparative accuracy. Experiments based on synthetic and real data sets have been conducted to evaluate the classification accuracy and interpretability of the proposal.

1. Introduction

Pattern classification is a popular research field in artificial intelligence. The main purpose of classification is to assign the objects, represented by feature vectors to predefined group of classes [1]. In the past five decades, a variety of classification techniques, such as K-nearest neighbors (K-NN) [2], decision trees (DT) [3], support vector machines (SVM) [4], rule-based classification (RBC) [5], have been proposed. Among these methods, the RBC not only obtains the advantage in classification result interpreting, but also can be easily enhanced by adding new rules from experts’ domain knowledge. As one of the most representative RBC methods, a fuzzy rule-based classification system (FRBCS) [5,6] has been developed by incorporating fuzzy sets [7]. The FRBCS is widely used because it can build a linguistic model interpretable to users. It has been successfully applied to many classification tasks where model interpretability is important, such as terrain classification [8], intrusion detection [9], fault prediction [10], disease diagnosis [11], and target recognition [12].
However, in real-world complex systems, different types of uncertainty (such as fuzziness, imprecision and incompleteness) may coexist. The FRBCS, which is based on fuzzy set theory, cannot model those imprecise or incomplete information effectively. The belief function theory, proposed by Dempster [13] and Shafer [14] et al., provides a powerful framework for uncertain modeling and reasoning. As fuzzy set theory and belief function theory are suited to dealing with different types of uncertainty, some researchers have investigated the relationship between them and suggested different integrating ways [15,16,17,18]. Among them, in [18], a belief rule-based classification system (BRBCS) was developed by extending the FRBCS within the framework of belief function theory to address imprecise or incomplete information in complex classification problems. In contrast to the traditional fuzzy rule, the new belief rule assigns the consequent part with a belief distribution structure, so that different kinds of uncertain information existing in the training set can be well characterized. Besides, to reduce the risk of misclassification in noisy conditions, the classification of a query pattern is made by combining all the activated belief rules. In many situations, this method is found experimentally to yield better classification accuracy and robustness than the FRBCS using the same information.
Rule learning is the most important issue in developing the BRBCS. In [18], a heuristic belief rule base (BRB) learning method was developed by defining belief rules based on fuzzy-grid partitions of the feature space and individuals of the training patterns, and the resulting BRB can provide an accurate mapping between the feature space and the class space. However, with this method, higher numbers of instances and features generally induce a BRB with larger size. This may lead to a large rule base for big data set, which degrades the interpretability of the classification model. Motivated by the above consideration, in this paper, a compact belief rule-based classification system (CBRBCS) is developed for a better trade-off between accuracy and interpretability (A preliminary version of some of the ideas introduced here was presented in [19,20]. The present paper is a deeply revised and extended version of this work, with several new results.). We propose to learn a compact BRB based on partitions of the training set realized with clustering techniques. The evidential C-mean (ECM) algorithm [21], which extended the fuzzy C-mean (FCM) algorithm [22] within the framework of belief functions, is used for its capability to address imprecise and partial information existing in the observed data. As belief rules are constructed based on credal partitions of the training set, this method can reduce the number of generated rules greatly. The main contributions of this paper are as follows:
  • A supervised version of the ECM algorithm is designed by means of weighted product-space clustering to take into account the class labels, which can obtain credal partitions with both good inter-cluster separability and inner-cluster pureness.
  • A systematic method is developed to construct belief rules (composed of the antecedent part, the consequent class, and the rule weight) based on credal partitions of the training set.
  • A two-objective optimization procedure based on both the mean squared error and the evidential partition entropy is designed to get a compact BRB with a better trade-off between accuracy and interpretability.
Two types of experiments using both synthetic and real data sets have been developed to evaluate the performance of the proposed CBRBCS. In the synthetic data test, a two-dimensional four-class synthetic data set was designed to illustrate the interest of the compact BRB learning under different parameter settings. In the real data test, 20 data sets varying greatly in the number of instances, features, and classes were selected from the UCI Machine Learning Repository [23] for evaluation. The comparison methods cover the traditional BRBCS, as well as some of the most representative classifiers, including K-NN, C4.5, SVM, and FRBCS. The reported results show that the proposed CBRBCS can obtain competitive performance compared with those representative classifiers for a variety of real tasks involving different data conditions, and get a better trade-off between accuracy and interpretability than the traditional BRBCS. Therefore, it provides a better choice of classification technique for those problems where both high accuracy and interpretability are needed.
The rest of the paper is organized as follows. In Section 2, some preliminaries of the related theories and methods are reviewed. The compact BRB learning with ECM is developed in Section 3. The experiments to evaluate the performance of the proposed method are reported in Section 4. At last, Section 5 concludes the paper.

2. Background

In this section, we provide some preliminaries of the related theories and methods. We first introduce some basic concepts of the belief function theory in Section 2.1. After that, we give overviews of the BRBCS classification method and the ECM clustering method in Section 2.2 and Section 2.3, respectively. The used symbols and their definitions are listed in Table 1 to facilitate reading.

2.1. Basics of the Belief Function Theory

In belief function theory [13,14], a problem domain is represented by a finite set Ω = { ω 1 , ω 2 , , ω C } called the frame of discernment. A mass function expressing the belief committed to the elements of 2 Ω by a given source of evidence is a mapping function m: 2 Ω [ 0 , 1 ] , such that
A 2 Ω m ( A ) = 1 .
Elements A Ω with m ( A ) > 0 are called the focal sets of the mass function m. The mass function m has several special cases, which represent different types of information. A mass function is said to be
  • normal, if m ( ) = 0 . Otherwise, it is subnormal, and m ( ) is interpreted as a mass of belief given to the hypothesis that ω might not lie in Ω .
  • Bayesian, if all its focal sets are singletons. In this case, the mass function reduces to the precise probability distribution;
  • certain, if the whole mass is allocated to a unique singleton. This corresponds to a situation of complete knowledge;
  • vacuous, if the whole mass is allocated to Ω . This situation corresponds to complete ignorance.
After representing the available pieces of evidence as mass functions, one often needs to combine several mass functions into a single one. Dempster’s rule is the most popular way to combine several distinct pieces of evidence. The Dempster’s rule of combination of two normal mass functions m 1 and m 2 defined on the same frame of discernment Ω is given by
m 1 m 2 ( A ) = 0 , A = B C = A m 1 ( B ) m 2 ( C ) 1 B C = m 1 ( B ) m 2 ( C ) , A 2 Ω \ .
Dempster’s rule of combination is both commutative and associative.

2.2. Belief Rule-Based Classification System (BRBCS)

The BRBCS is composed of two components, the BRB and the belief reasoning method (BRM) [18]. The BRB is first constructed to establishes a mapping between the feature space and the class space, and then the BRM is used to classify a query pattern based on the constructed BRB.
For an M-class (denoted as C = { c 1 , c 2 , , c M } ) classification problem with P features, the BRB consists of a collection of belief rules defined as follows:
R j : If x 1 is A 1 j and x 2 is A 2 j and and x P is A P j , then class is C j = ( c 1 , β 1 j ) , , ( c M , β M j ) , with rule weight θ j , j = 1 , 2 , ,
where x 1 , x 2 , , x P represent the antecedent features and A j = ( A 1 j , A 2 j , , A P j ) is the antecedent part of the belief rule R j with each A p j belonging to fuzzy partitions { A p , 1 , A p , 2 , , A p , n p } associated with p-th feature, p = 1 , , P . β k j is the belief degree that input data x = ( x 1 , x 2 , , x P ) belongs to c k , k = 1 , , M . In the belief structure, the consequence may be incomplete, i.e., k = 1 M β k j 1 , and the left belief 1 k = 1 M β k j denotes the degree of global ignorance about the consequence. The rule weight θ j with 0 θ j 1 , characterizes the certainty grade of the belief rule R j .
The BRB can be learned from training data or derived from expert knowledge [24]. In [18], a heuristic BRB learning method was developed based on fuzzy-grid partitions of the feature space. To generate the BRB, this method uses the following steps:
Step 1: 
Partition of the feature space.
A fuzzy-grid-based method is used to divide the P-dimensional feature space into p = 1 P n p fuzzy regions, with n p being the number of partitions for p-th feature.
Step 2: 
Generation of the consequent class for each fuzzy region.
Each training pattern is assigned to the fuzzy region with the greatest matching degree, and those patterns assigned to the same fuzzy region are fused to get the consequent class.
Step 3: 
Generation of the rule weights.
The rule weights are determined by two measures called confidence and support jointly.
Once the BRB is generated, the BRM is used to classify a query pattern by combining the consequent parts of all the activated belief rules (refer to [18] for details of this reasoning method).

2.3. Evidential C-Means (ECM)

In [21], an ECM algorithm was proposed to derive credal partitions from object data. In this algorithm the class membership of an object x i is represented by a mass function m i defined on the power set of a given frame of discernment Ω = { ω 1 , ω 2 , , ω C } . The credal partitions of N observed data { x 1 , x 2 , , x N } R P are then defined as the N-tuple M = ( m 1 , m 2 , , m N ) . It can be seen as a general model of partitioning, where:
  • when each m i is a certain mass function, then M defines the conventional, crisp partitions of the set of objects;
  • when each m i is a Bayesian mass function, then M specifies the fuzzy partitions, as defined by Bezdek [22].
For each object x i , the quantities m i j = m i ( A j ) ( A j Ω , A j ) are determined in such a way that the mass of belief m i j is low (high) when the distance d i j between object x i and set A j is high (low). The distance between object x i and set A j is calculated by d i j = x i v ¯ j , where v ¯ j is the barycenter of the centers associated with the classes composing A j . Denoting v k the center of the single cluster ω k , the barycenter v ¯ j is calculated as
v ¯ j = 1 | A j | k = 1 C s k j v k with s k j = 1 , if ω k A j 0 , otherwise .
Finally, the objective function used to derive the credal partition matrix M of size 2 C × N and the cluster center matrix V of size C × P , is given by
J ECM ( M , V ) = i = 1 N { j / A j Ω , A j } | A j | α m i j β d i j 2 + i = 1 N δ 2 m i β ,
subject to
{ j / A j Ω , A j } m i j + m i = 1 , i = 1 , , N ,
where β > 1 is a weighting exponent to control the fuzziness of the partition, α 0 is a weighting exponent to control the degree of penalization for the subsets in Ω of high cardinality, δ > 0 is a distance to control the amount of data considered to be outliers, and m i denotes m i ( ) , the mass that the class of object x i does not lie in Ω . This objective function is minimized using an iterative algorithm, which alternatively optimizes the credal partition matrix M and the cluster center matrix V.

3. Compact BRB Learning with ECM

As reviewed in Section 2.2, in the traditional BRB learning method, belief rules are defined based on fuzzy-grid partitions of the feature space and individuals of the training instances. This may lead to a large rule base for big data set with large numbers of instances and features, which degrades the interpretability of the classification model. In this section, we propose to learn a compact BRB based on partitions of the training set realized with clustering techniques. The ECM algorithm is used here to incorporate the additional degrees of freedom and information obtained from the derived credal partitions, in the BRBCS. The flow diagram of the proposed compact BRB learning with ECM is shown in Figure 1. First, the ECM algorithm operates in a supervised way in Section 3.1 by means of weighted product-space clustering with the goals of obtaining credal partitions with both good inter-cluster separability and inner-cluster pureness. Then, Section 3.2 shows how to construct belief rules based on credal partitions of the training set. Finally, a two-objective optimization procedure is designed in Section 3.3 to get a compact BRB with a better trade-off between accuracy and interpretability.

3.1. Credal Partition with Supervised ECM

In typical classification problems, a set of N labeled patterns T = { ( x 1 , c ( 1 ) ) , ( x 2 , c ( 2 ) ) } , , ( x N , c ( N ) ) } with input vectors x i R P and class labels c ( i ) { c 1 , c 2 , , c M } are available, and the problem is to classify a query pattern y based on the training set T . In contrast to unsupervised clustering problems which only consider the inter-cluster separability, a good partition of labeled patterns should also take into account the inner-cluster pureness. For this purpose, we cluster the N labeled patterns in the following weighted product space
z = ( x × W c ) ,
where W 0 controls the weight of class labels in clustering process. If W = 0 , it just reduces to the unsupervised clustering, and as W , the resulting clusters are the same with those obtained by dividing the training set only based on the class labels directly. A suggested choice of W for balancing the effects of feature values and class values is
W = p = 1 P σ p 2 σ c 2 ,
where σ p 2 is the variance of p-th feature values, p = 1 , 2 , , P , and σ c 2 is the variance of class values.
With given weight W and number of clusters C, the ECM clustering algorithm operates in the above supervised way to discover credal partitions of the training set T in the weighted product space. Two practical issues concerning the credal partitions are further considered to be follows.
  • Limiting the number of credal partitions. By minimizing the objective function displayed as Equation (4), a maximum number of 2 C credal partitions can be obtained. However, those credal partitions composed of many classes are quite difficult to interpret and are usually also less important in practice. Therefore, in order to learn a compact BRB, we constrain the focal sets to be either Ω , or to be composed of at most two classes, thereby reducing the maximum number of credal partitions from 2 C to ( C 2 + C ) / 2 + 2 F ( C ) .
  • Discarding the outlier cluster. In ECM, the training patterns assigned to empty set are considered to be outliers, which are adverse to classification. Thus, we only construct belief rules based on the left F ( C ) 1 credal partitions associated with non-empty focal sets.

3.2. Belief Rule Base Construction

As shown in Section 2.2, each belief rule is composed of three components, namely the antecedent part, the consequent class, and the rule weight. In the following part, we will show how to construct belief rules from these three aspects based on credal partitions of the training set obtained previously.

3.2.1. Antecedent Parts Generation

From the obtained credal partition matrix M, whose i j -th element m i j [ 0 , 1 ] is the membership degree of the data x i in partition j, it is possible to extract the fuzzy sets in the antecedent parts of the belief rules. One-dimensional antecedent fuzzy sets A p j are obtained from the multidimensional credal partition M by point-wise projection [25] onto the space of the antecedent features x p , p = 1 , 2 , , P :
μ A p j ( x i p ) = proj p ( m i j ) .
With the above point-wise defined membership, a continuous membership function μ A p j ( x ) for fuzzy sets A p j can be approximated. Several types of functions such as triangular, trapezoidal, or Gaussian, can be used. In this work we choose the Gaussian membership function of the form
μ A p j ( x ) = f ( x ; v ¯ j p , σ j p ) = e ( x v ¯ j p ) 2 2 σ j p 2 ,
where v ¯ j p is the mean value calculated as Equation (3), and σ j p is the standard variance to be estimated.
In this way, for each credal partition j, j = 1 , 2 , , F ( C ) 1 , a series of fuzzy sets A 1 j , A 2 j , , A P j can be defined on the antecedent features with Gaussian membership functions, which finally constitute the antecedent part of belief rule R j .

3.2.2. Consequent Classes Generation

Based on the credal partition matrix M, the training set T can be divided into F ( C ) 1 groups by assigning each pattern to the partition with highest mass:
T j = { ( x i , c ( i ) ) | m i j = max k m i k , i = 1 , , N } , j = 1 , 2 , , F ( C ) 1 .
The training subsets T j for j = 1 , 2 , , F ( C ) 1 define a hard credal partition [21] of the training set T . In the following, we will derive the consequent class of belief rule R j by combining the class information of patterns in subset T j , j = 1 , 2 , , F ( C ) 1 .
First, for any pattern x i T j , we calculate the matching degree with antecedent part of belief rule R j using the geometric mean operator as
μ A j ( x i ) = p = 1 P μ A p j ( x i p ) P ,
where μ A p j is the membership function of the fuzzy set A p j defined in Equation (9).
Then, assume the class label of pattern x i is c k , which takes value in class set C . This can be regarded as a piece of evidence that supports the consequent class belonging to c k . However, this piece of evidence is not full certainty. In belief function theory, this can be expressed by saying that only some part of the belief (measured by the matching degree μ A j ( x i ) ) is committed to c k . Because Class ( x i ) = c k does not point to any other particular class, the rest of the belief should be assigned to the frame of discernment C representing global ignorance. Therefore, this item of evidence can be represented by a mass function m j ( · | x i ) verifying:
m j ( { c k } | x i ) = μ A j ( x i ) m j ( C | x i ) = 1 μ A j ( x i ) m j ( A | x i ) = 0 , A 2 C \ { C , { c k } } .
Finally, the mass functions derived from all the patterns in T j are combined to obtain the consequent class of belief rule R j . As the items of evidence from different labeled patterns are collected independently, the Dempster’s rule of combination is used in this work to synthesize the final consequent class membership as
m j = x i T j m j ( · | x i ) .
Noting that all the pieces of evidence have only one focal set except the global set C , the computation of Dempster’s rule is quite efficient. The belief degrees of the consequent class of rule R j are then obtained as β k j = m j ( { c k } ) , k = 1 , 2 , , M .

3.2.3. Rule Weights Generation

As in [18], the rule weights can be derived based on two concepts called confidence and support, which are often used for evaluating association rules in data mining fields. The confidence is a measure of the validity of one rule, which is defined for belief rule as
c ( R j ) = 1 K j ¯ ,
where 0 K j ¯ 1 is the average conflict factor, which measures the conflict among those pieces of evidence used for building the consequent class of rule R j :
K j ¯ = 0 , if | T j | = 1 , 1 | T j | ( | T j | 1 ) x p , x q T j ; c ( p ) c ( q ) μ A j ( x p ) μ A j ( x q ) , otherwise .
with | T j | donating the number of training patterns in j-th hard credal partition.
On the other hand, the support indicates the grade of the coverage by one rule, which is defined as the ratio of the number of covered patterns to the total pattern number:
s ( R j ) = | T j | N .
Based on the above two measures, the rule weights are finally derived as
θ j = c ( R j ) s ( R j ) max j { c ( R j ) s ( R j ) , j = 1 , , F ( C ) 1 } , j = 1 , 2 , , F ( C ) 1 .

3.3. Parameter Optimization for Trade-Off between Accuracy and Interpretability

In the above BRB learning process, the number of clusters C plays a key role in determining the accuracy and the interpretability of the learned classification model. Many clusters means many rules, which usually leads to high classification accuracy, but degrades the model’s interpretability. Therefore, we need to search for an optimal number of clusters to get the desired trade-off between accuracy and interpretability.
On the one hand, to obtain a model with high accuracy, the following leave-one-out test (mean squared) error MSE should be minimized:
M S E = 1 N i = 1 N j = 1 M ( P ( i ) ( { ω j } ) t j ( i ) ) 2 ,
where P ( i ) ( { ω j } ) , j = 1 , , M , are the output of the belief reasoning method for training pattern x ( i ) , and t j ( i ) , j = 1 , , M , are binary indicator variables defined by t j ( i ) = 1 , if the real label of training pattern x ( i ) is ω j and t j ( i ) = 0 , otherwise.
On the other hand, to obtain a model with high interpretability, the number of rules or, equivalently, the number of clusters should be minimized. However, the number of clusters should not be too small to ensure the cluster validity. To assess the quality of fuzzy partitions, a great number of validity indexes have been proposed in the literature [26,27,28,29]. One of the representatives is the fuzzy partition entropy FPE [30] defined by
F P E = 1 N log 2 ( C ) i = 1 N j = 1 C μ i j log 2 1 μ i j ,
where μ i j is the membership degree of i-th pattern in j-th cluster. The optimal number of clusters C is obtained by minimizing FPE with respect to C = 2 , 3 , , C max .
The above fuzzy entropy-based validity index has inspired us to use similar definitions of entropy in belief function framework to assessing the quality of evidential partitions. The definition of entropy in belief function framework has been a hot research subject in the past few years [31,32,33]. A representative one is the aggregated entropy AE introduced by Pal et al. [34], which satisfies natural requirements and has interesting properties. It is defined for a normal mass function m as
A E ( m ) = A F ( m ) m ( A ) log 2 | A | m ( A ) ,
where F ( m ) denotes the set of focal sets of m. This entropy measure can be further decomposed as the sum of two terms:
A E ( m ) = A F ( m ) m ( A ) log 2 | A | + A F ( m ) m ( A ) log 2 1 m ( A ) .
The first term is the nonspecificity measure, which reflects the degree of imprecision of m, whereas the second term reflects the inconsistency in m and can be seen as a measure of conflict. Therefore, A E ( m ) tends to be small when the mass is assigned to few focal sets, with small cardinality.
Though the above AE measure was defined for normal mass functions (i.e., m with m ( ) = 0 ), it can be easily extended to subnormal mass functions considered in the paper by defining the cardinality of the empty set as C (This extension is justified by the fact that the mass given to the empty set corresponds to a situation of maximal uncertainty, just like the mass given to Ω [35].). The evidential partition entropy EPE is then defined as the average AE as
E P E = 1 N log 2 ( C ) i = 1 N A F ( m i ) m i ( A ) log 2 | A | m i ( A ) .
When all the patterns are assigned to singleton sets , ω 1 , ω 2 , , ω C , E P E gets the lower bound value 0. The maximum value of E P E is attained for m i such that m i ( A ) | A | , for all A F ( m i ) . It should also be noted that the fuzzy partition entropy FPE defined in Equation (19) is a special case of our defined evidential partition entropy EPE when each m i is a Bayesian mass function.
Finally, a single scalar objective function for the number of clusters C is then defined based on the above two objectives MSE and EPE as
J ( C ) = λ · M S E + ( 1 λ ) · E P E ,
where λ [ 0 , 1 ] is the weight characterizing the user’s preference for classification accuracy. When λ = 1 , the classification accuracy is the only objective, whereas when λ = 0 , only the cluster validity is guaranteed. With given weight λ , by minimizing the above objective function, an optimal number of clusters C can be obtained for a better trade-off between accuracy and interpretability.

4. Experiments

The performance of the proposed CBRBCS was assessed by two different types of experiments. In the first experiment, a synthetic data set was used to show the behavior of the proposal in controlled settings. In the second one, 20 real data sets from the UCI Machine Learning Repository [23] were considered, with the aim to show that the proposed technique is adequate for a variety of real tasks.

4.1. Synthetic Data Set Test

A two-dimensional four-class synthetic data set was designed to illustrate the interest of the compact BRB learning method in CBRBCS. The following normal class-conditional distributions were assumed:
Class ω 1 : μ 1 = ( 0 , 0 ) T , Σ 1 = 2 I ;  Class ω 2 : μ 2 = ( 5 , 0 ) T , Σ 2 = 2 I ;
Class ω 3 : μ 3 = ( 2 , 5 ) T , Σ 3 = 2 I ;  Class ω 4 : μ 4 = ( 3 , 5 ) T , Σ 4 = 2 I .
A set of 400 samples was generated from the above distributions using equal prior probabilities. This data set is displayed in Figure 2. The proposed method was used to learn BRBs from this data set. The default values of open parameters in ECM were used and different values of the accuracy weight λ were considered for comparison.
Figure 3 shows the objective values J ( C ) of the learned BRBs under different numbers of clusters ( C = 2 , 3 , 4 , 5 , 6 ). When the accuracy weight λ = 1 , the objective function J ( C ) just reduces to the MSE measure. It can be seen that as the increasement of the number of clusters (or, equivalently, the number of rules), the MSE decreases gradually. By minimizing the MSE, a large BRB is obtained with high classification accuracy. By contrast, when the accuracy weight λ = 0 , the EPE measure is recovered. We see that the EPE reaches its minimal value when the number of clusters equals to 3, and after that it increases as the increasement of the number of clusters. In the same way of minimizing the EPE, we can get a small BRB with higher model interpretability, but relatively lower classification accuracy. Finally, when the accuracy weight 0 < λ < 1 , the objective value J ( C ) provides a trade-off between the MSE and the EPE. Please note that the three considered weights ( λ = 0.2 , 0.5 , and 0.8 ) give the same decision for the optimal number of clusters as C = 4 , in which case, a total number of ( C 2 + C ) / 2 + 1 = 11 belief rules are learned with classification accuracy of 83.55%.

4.2. Real Data Set Test

In this experiment, 20 representative real data sets from UCI Machine Learning Repository were selected to evaluate the performance of the proposed CBRBCS. The main characteristics of the 20 data sets are summarized in Table 2. It can be seen that the selected data sets vary greatly in the number of instances (from 80 to 12,690), the number of features (from 4 to 60), and the number of classes (from 2 to 11).
To develop the experiment, we consider the B-Fold Cross-Validation (B-CV) model. Each data set is divided into B blocks, with B 1 blocks as a training set and the remaining block as a test set. Therefore, each block is used exactly once as a test set. We use the 10-CV model here, i.e., ten random partitions of the original data set, with nine of them ( 90 % ) as the training set and the remainder ( 10 % ) as the test set. For each data set, we consider the average results of the ten partitions.
The performance of the proposed classifier is compared with the traditional BRBCS [18] as well as several other representative classifiers, including K-NN (instance-based classifier) [2], C4.5 (decision tree-based classifier) [3], SVM (statistical classifier) [4], and FRBCS (rule-based classifier) [5]. Settings of these comparison methods are summarized in Table 3.
Table 4 shows the classification accuracy rates of different methods for real data sets. The numbers in brackets represent the ranks of the classification accuracy for each method and the last row shows the average ranks of all the methods over the 20 data sets. It can be seen that the performance of the two belief rule-base classifiers, i.e., BRBCS and CBRBCS, is comparable with those classical methods. To compare the classification results statistically, we carry out nonparametric tests [36,37] for multiple comparisons based on the average accuracy ranks obtained over the considered data sets. First, we use the Iman-Davenport test to determine whether significant differences exist among different methods. The Iman-Davenport statistic (distributed according to the F-distribution with k 1 = 5 and ( k 1 ) ( N 1 ) = 95 degrees of freedom, where k is the number of compared methods and N is the number of data sets) is 4.99 for average ranks and the corresponding critical value is 2.29 for a significance level of α = 0.05 . Given that the Iman-Davenport statistic is clearly greater than the critical value, the test rejects the null hypothesis, and therefore, it can be said that there are significant differences among the accuracy results of the considered methods. Then, we apply the post hoc Bonferroni-Dunn test to compare the control method (i.e., the proposed CBRBCS) with the remaining ones. Figure 4 shows the test result of the average accuracy ranks with a significance level of α = 0.05 , in which case the calculated critical difference is 1.52. The critical difference value is represented as a thicker horizontal line, and those values that exceed this line are methods with significantly different results from the control method. It can be seen that the proposed CBRBCS performs significantly better than the FRBCS, and obtains similar classification accuracy with the traditional BRBCS. Compared with other non-rule-based classifiers including SVM, C4.5 and K-NN, although the classification accuracy differences among them are not very significant, the proposed CBRBCS is preferable as it can provide a more interpretable classification model on the premise of comparative accuracy.
To evaluate the interpretability of the classification models, Table 5 displays the numbers of generated rules for the two belief rule-based methods, i.e., BRBCS and CBRBCS. It can be seen that for all the evaluated data sets, much smaller number of rules are generated for the proposed CBRBCS. To show the rule reduction performance more clearly, we also provide the rule reduction rate (defined as ( # Rule BRBCS # Rule CBRBCS ) / # Rule BRBCS ) in the last column. We can notice that for those data sets with large numbers of train instances and features, but small number of classes (like Australian, Car, Contraceptive, Ionosphere, Nursery, Sonar, Thyroid, Vehicle), the proposed CBRBCS achieves more significant rule reduction performance (with rule reduction rate >90%). The reason is that in the traditional BRBCS, the rules are generated based on fuzzy-grid method, in which case, the number of generated rules is positive correlated with both the numbers of train instances and features. However, the number of generated rules for the clustering-based learning method used in CBRBCS is only determined by the underling structure of the data, which is closely related to the number of classes. Therefore, compared with the traditional BRBCS, the proposed CBRBCS obtains a better trade-off between accuracy and interpretability (similar classification accuracy is obtained with much smaller number of rules).

5. Conclusions

In this paper, a compact belief rule-based classification system with ECM clustering has been proposed to overcome the limitations of the traditional BRBCS in large data set conditions. Instead of defining belief rules for individuals of the training patterns, belief rules are constructed based on credal partitions of the training set. The two-objective optimization procedure based on both the mean squared error and the evidential partition entropy can successfully find an optimal number of clusters. This method can discover the underlying data structure, which can be successfully translated into belief rules. From the results reported in the last section, we can conclude that the proposed technique can obtain a better trade-off between accuracy and interpretability than the traditional one. Furthermore, compared with other non-rule-based classifiers, the proposed technique can obtain competitive classification performance in accuracy. Therefore, this technique is be a better choice for those classification problems where both high accuracy and interpretability are needed.

Author Contributions

L.J. conceived the idea and designed the methodology. X.G. wrote the paper. Q.P. provided the laboratory support and improved the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant Nos. 61790552 and 61801386), the Natural Science Basic Research Plan in Shaanxi Province of China (Grant No. 2018JQ6043), and the Aerospace Science and Technology Foundation of China.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Aggarwal, C.C. Data Classification: Algorithm and Applications; Chapman & Hall: Boca Raton, FL, USA, 2014. [Google Scholar]
  2. Cover, T.M.; Hart, P.E. Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 1967, 13, 21–27. [Google Scholar] [CrossRef] [Green Version]
  3. Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kauffman: San Francisco, CA, USA, 1993. [Google Scholar]
  4. Cortes, C.; Vapnik, V. Support vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  5. Chi, Z.; Yan, H.; Pham, T. Fuzzy Algorithms with Applications to Image Processing and Pattern Recognition; World Scientific: Singapore, 1996. [Google Scholar]
  6. Ishibuchi, H.; Nozaki, K.; Tanaka, H. Distributed representation of fuzzy rules and its application to pattern classification. Fuzzy Sets Syst. 1992, 52, 21–32. [Google Scholar] [CrossRef]
  7. Zadeh, L.A. Fuzzy sets. Inform. Control 1965, 8, 338–353. [Google Scholar] [CrossRef] [Green Version]
  8. Stavrakoudis, D.G.; Galidaki, G.N.; Gitas, I.Z.; Theocharis, J.B. A genetic fuzzy-rule-based classifier for land cover classification from hyperspectral imagery. IEEE Trans. Geosci. Remote Sens. 2012, 50, 130–148. [Google Scholar] [CrossRef]
  9. Samantaray, S.R. Genetic-fuzzy rule mining approach and evaluation of feature selection techniques for anomaly intrusion detection. Appl. Soft Comput. 2013, 13, 928–938. [Google Scholar] [CrossRef]
  10. Singh, P.; Pal, N.R.; Verma, S.; Vyas, O.P. Fuzzy rule-based approach for software fault prediction. IEEE Trans. Syst. Man Cybern. Syst. 2017, 47, 826–837. [Google Scholar] [CrossRef]
  11. Paul, A.K.; Shill, P.C.; Rabin, M.R.I.; Murase, K. Adaptive weighted fuzzy rule-based system for the risk level assessment of heart disease. Appl. Intell. 2018, 48, 1739–1756. [Google Scholar] [CrossRef]
  12. Wu, H.; Mendel, J. Classification of battlefield ground vehicles using acoustic features and fuzzy logic rule-based classifiers. IEEE Trans. Fuzzy Syst. 2007, 15, 56–72. [Google Scholar] [CrossRef]
  13. Dempster, A. Upper and lower probabilities induced by multivalued mapping. Ann. Math. Stat. 1967, 38, 325–339. [Google Scholar] [CrossRef]
  14. Shafer, G. A Mathematical Theory of Evidence; Princeton University Press: Princeton, NJ, USA, 1976. [Google Scholar]
  15. Yager, R.R.; Filev, D.P. Including probabilistic uncertainty in fuzzy logic controller modeling using Dempster-Shafer theory. IEEE Trans. Syst. Man Cybern. 1995, 25, 1221–1230. [Google Scholar] [CrossRef]
  16. Liu, J.; Yang, J.B.; Wang, J.; Sii, H.S.; Wang, Y.M. Fuzzy rule based evidential reasoning approach for safety analysis. Int. J. Gen. Syst. 2004, 33, 183–204. [Google Scholar] [CrossRef]
  17. Yang, J.B.; Liu, J.; Wang, J.; Sii, H.S.; Wang, H.W. Belief rule-based inference methodology using the evidential reasoning approach–RIMER. IEEE Trans. Syst. Man Cybern. Part A-Syst. 2006, 36, 266–285. [Google Scholar] [CrossRef]
  18. Jiao, L.; Pan, Q.; Denœux, T.; Liang, Y.; Feng, X. Belief rule-based classification system: Extension of FRBCS in belief functions framework. Inf. Sci. 2015, 309, 26–49. [Google Scholar] [CrossRef]
  19. Jiao, L.; Geng, X.; Pan, Q. A compact belief rule-based classification system with interval-constrained clustering. In Proceedings of the 2018 International Conference on Information Fusion, Cambridge, UK, 10–13 July 2018; pp. 2270–2274. [Google Scholar]
  20. Jiao, L. Classification of Uncertain Data in the Framework of Belief Functions: Nearest-Neighbor-Based and Rule-Based Approaches. Ph.D. Thesis, Université de Technologie de Compiègne, Compiègne, France, 26 October 2015. [Google Scholar]
  21. Masson, M.H.; Denoeux, T. ECM: An evidential version of the fuzzy c-means algorithm. Pattern Recognit. 2008, 41, 1384–1397. [Google Scholar] [CrossRef]
  22. Bezdek, J.C. Pattern Recognition with Fuzzy Objective Function Algorithms; Plenum Press: New York, NJ, USA, 1981. [Google Scholar]
  23. Dua, D.; Karra Taniskidou, E. UCI Machine Learning Repository. Available online: http://archive.ics.uci.edu/ml (accessed on 1 July 2017).
  24. Jiao, L.; Denœux, T.; Pan, Q. A hybrid belief rule-based classification system based on uncertain training data and expert knowledge. IEEE Trans. Syst. Man Cybern. Syst. 2016, 46, 1711–1723. [Google Scholar] [CrossRef]
  25. Almeida, R.J.; Denoeux, T.; Kaymak, U. Constructing rule-based models using the belief functions framework. In Advances in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2012; pp. 554–563. [Google Scholar]
  26. Wang, W.; Zhang, Y. On fuzzy cluster validity indices. Fuzzy Sets Syst. 2007, 158, 2095–2117. [Google Scholar] [CrossRef]
  27. Yang, M.S.; Nataliani, Y. Robust-learning fuzzy c-means clustering algorithm with unknown number of clusters. Pattern Recognit. 2015, 71, 45–59. [Google Scholar] [CrossRef]
  28. Wu, C.H.; Ouyang, C.S.; Chen, L.W.; Lu, L.W. A new fuzzy clustering validity index with a median factor for centroid-based clustering. IEEE Trans. Fuzzy Syst. 2015, 23, 701–718. [Google Scholar] [CrossRef]
  29. Lei, Y.; Bezdek, J.C.; Chan, J.; Vinh, N.X.; Romano, S.; Bailey, J. Extending information-theoretic validity indices for fuzzy clustering. IEEE Trans. Fuzzy Syst. 2017, 25, 1013–1018. [Google Scholar] [CrossRef]
  30. Bezdek, J.C. Cluster validity with fuzzy sets. J. Cybern. 1974, 3, 58–73. [Google Scholar] [CrossRef]
  31. Jiroušek, R.; Shenoy, P.P. A new definition of entropy of belief functions in the Dempster-Shafer theory. Int. J. Approx. Reason. 2018, 92, 49–65. [Google Scholar] [CrossRef]
  32. Pan, L.; Deng, Y. A new belief entropy to measure uncertainty of basic probability assignments based on belief function and plausibility function. Entropy 2018, 20, 842. [Google Scholar] [CrossRef]
  33. Xiao, F. An improved method for combining conflicting evidences based on the similarity measure and belief function entropy. Int. J. Fuzzy Syst. 2018, 20, 1256–1266. [Google Scholar] [CrossRef]
  34. Pal, N.R.; Bezdek, J.C.; Hemasinha, R. Uncertainty measures for evidential reasoning II: New measure of total uncertainty. Int. J. Approx. Reason. 1993, 8, 1–16. [Google Scholar] [CrossRef]
  35. Denœux, T.; Masson, M.H. EVCLUS: Evidential clustering of proximity data. IEEE Trans. Syst. Man Cybern. Part B-Cybern. 2004, 34, 95–109. [Google Scholar] [CrossRef]
  36. Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 2006, 7, 1–30. [Google Scholar]
  37. García, S.; Fernández, A.; Luengo, J.; Herrera, F. Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Inf. Sci. 2010, 180, 2044–2064. [Google Scholar] [CrossRef]
Figure 1. Flow diagram of the compact BRB learning with ECM.
Figure 1. Flow diagram of the compact BRB learning with ECM.
Entropy 21 00443 g001
Figure 2. Synthetic data set (‘◯’ for class ω 1 , ‘□’ for class ω 2 , ‘▽’ for class ω 3 and ‘◊’ for class ω 4 ).
Figure 2. Synthetic data set (‘◯’ for class ω 1 , ‘□’ for class ω 2 , ‘▽’ for class ω 3 and ‘◊’ for class ω 4 ).
Entropy 21 00443 g002
Figure 3. Objective values J ( C ) of the learned BRBs under different numbers of clusters C.
Figure 3. Objective values J ( C ) of the learned BRBs under different numbers of clusters C.
Entropy 21 00443 g003
Figure 4. Bonferroni-Dunn test result of the average accuracy ranks.
Figure 4. Bonferroni-Dunn test result of the average accuracy ranks.
Entropy 21 00443 g004
Table 1. List of symbols and definitions.
Table 1. List of symbols and definitions.
SymbolDefinitions
AEaggregated entropy
BRBbelief rule base
BRBCSbelief rule-based classification system
BRMbelief reasoning method
CBRBCScompact belief rule-based classification system
ECMevidential C-means
EPEevidential partition entropy
FCMfuzzy C-means
FPEfuzzy partition entropy
FRBCSfuzzy rule-based classification system
K-NNK-nearest neighbor
MSEmean squared error
RBCrule-based classification
SVMsupport vector machines
A j antecedent part of belief rule R j
C set of classes
cclass label
Cnumber of clusters
d i j distance between object x i and set A j
Mcredal partition matrix
n p number of partitions for p-th feature
R j j-th belief rule in the rule base
T training data set
Vcluster center matrix
Wweight of class labels in clustering process
x input feature vector
α weighting exponent for cardinality in ECM
β weighting exponent for fuzziness in ECM
δ distance to the empty set in ECM
θ j weight of belief rule R j
λ weight of classification accuracy
σ p 2 variance of p-th feature values
σ c 2 variance of class values
Ω frame of discernment
Table 2. Statistics of the real data sets used in the experiment.
Table 2. Statistics of the real data sets used in the experiment.
Data Set# Instances# Features# Classes
Australian690142
Balance62543
Car127864
Contraceptive147393
Dermatology358346
Ecoli33678
Glass21496
Hepatitis80192
Ionosphere351332
Iris15043
Lymphography148184
Nursery12,69085
Page-blocks5472105
Sonar208602
Thyroid7200213
Vehicle846184
Vowel9901311
Wine178133
Yeast1484810
Zoo101167
Table 3. Settings of the comparison methods.
Table 3. Settings of the comparison methods.
MethodParameterValue
K-NNnumber of neighbors K3
distance metricEuclidean
C4.5pruned?TRUE
confidence level c0.25
minimal instances per leaf i2
SVMkernel typeRBF
penalty coefficient C100
kernel parameter γ 0.01
FRBCSnumber of partitions per feature n5
membership function typetriangular
reasoning methodsingle winner
BRBCSnumber of partitions per feature n5
membership function typetriangular
reasoning methodbelief reasoning
CBRBCSweighting exponent for cardinality α 2
weighting exponent for fuzziness β 2
accuracy weight λ 0.5
Table 4. Classification accuracy rates (in %) for real data sets.
Table 4. Classification accuracy rates (in %) for real data sets.
Data setK-NNC4.5SVMFRBCSBRBCSCBRBCS
Australian88.78 (1)85.22 (2)75.51 (6)79.86 (5)82.74 (4)83.90 (3)
Balance83.37 (5)76.80 (6)95.51 (1)89.60 (4)92.66 (3)93.20 (2)
Car92.31 (6)92.55 (5)94.33 (2)92.78 (4)95.23 (1)93.12 (3)
Contraceptive44.95 (5)52.68 (2)55.95 (1)39.86 (6)49.15 (3)48.20 (4)
Dermatology96.90 (1)94.42 (2)94.34 (3)72.29 (6)85.12 (5)93.35 (4)
Ecoli80.67 (3)79.47 (4)81.96 (2)76.02 (6)78.34 (5)82.62 (1)
Glass70.11 (1)67.44 (5)70.00 (2)66.04 (6)69.04 (3)68.15 (4)
Hepatitis82.51 (3)84.00 (1)82.18 (4)74.41 (6)76.28 (5)83.68 (2)
Ionosphere85.18 (6)90.90 (3)92.60 (1)86.55 (5)89.11 (4)91.66 (2)
Iris94.00 (4)96.00 (3)97.33 (1)93.67 (5)96.67 (2)93.33 (6)
Lymphography77.39 (4)74.30 (5)81.27 (1)72.27 (6)79.20 (2)77.90 (3)
Nursery92.54 (6)97.30 (1)93.18 (5)94.02 (4)96.05 (2)94.65 (3)
Page-blocks95.91 (2)96.97 (1)92.36 (5)91.92 (6)95.10 (4)95.28 (3)
Sonar83.07 (1)70.07 (5)78.71 (2)59.60 (6)74.80 (3)73.33 (4)
Thyroid93.89 (5)99.63 (1)93.49 (6)94.03 (4)95.04 (2)94.39 (3)
Vehicle71.75 (3)74.69 (1)52.95 (6)60.77 (5)71.95 (2)70.54 (4)
Vowel97.78 (1)81.52 (5)95.76 (2)79.90 (6)93.28 (3)92.10 (4)
Wine95.49 (3)94.90 (4)89.74 (6)95.82 (2)96.14 (1)94.48 (5)
Yeast53.17 (5)55.53 (2)58.09 (1)48.51 (6)54.66 (4)55.08 (3)
Zoo92.81 (4)93.64 (3)96.50 (1)85.06 (6)90.55 (5)95.30 (2)
Average rank3.453.052.905.203.153.25
Table 5. Numbers of generated rules of BRBCS and CBRBCS for real data sets.
Table 5. Numbers of generated rules of BRBCS and CBRBCS for real data sets.
Data Set# Train Instances# Features# Classes # Rules
BRBCSCBRBCSReduction Rate
Australian6211423171694.95%
Balance55243661183.33%
Car1150646822296.77%
Contraceptive1326932332290.56%
Dermatology3223463153788.25%
Ecoli30278453717.78%
Glass19296402245.00%
Hepatitis7219267789.55%
Ionosphere3163322271195.15%
Iris13543141121.43%
Lymphography1331841292282.95%
Nursery11,4218542383799.13%
Page-blocks4925105552260.00%
Sonar1876021871691.44%
Thyroid64802134602295.22%
Vehicle7611842301693.04%
Vowel89113111416752.48%
Wine1601331221686.89%
Yeast1336810965641.67%
Zoo91167552947.27%

Share and Cite

MDPI and ACS Style

Jiao, L.; Geng, X.; Pan, Q. Compact Belief Rule Base Learning for Classification with Evidential Clustering. Entropy 2019, 21, 443. https://doi.org/10.3390/e21050443

AMA Style

Jiao L, Geng X, Pan Q. Compact Belief Rule Base Learning for Classification with Evidential Clustering. Entropy. 2019; 21(5):443. https://doi.org/10.3390/e21050443

Chicago/Turabian Style

Jiao, Lianmeng, Xiaojiao Geng, and Quan Pan. 2019. "Compact Belief Rule Base Learning for Classification with Evidential Clustering" Entropy 21, no. 5: 443. https://doi.org/10.3390/e21050443

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop