Compact Belief Rule Base Learning for Classification with Evidential Clustering

The belief rule-based classification system (BRBCS) is a promising technique for addressing different types of uncertainty in complex classification problems, by introducing the belief function theory into the classical fuzzy rule-based classification system. However, in the BRBCS, high numbers of instances and features generally induce a belief rule base (BRB) with large size, which degrades the interpretability of the classification model for big data sets. In this paper, a BRB learning method based on the evidential C-means clustering (ECM) algorithm is proposed to efficiently design a compact belief rule-based classification system (CBRBCS). First, a supervised version of the ECM algorithm is designed by means of weighted product-space clustering to partition the training set with the goals of obtaining both good inter-cluster separability and inner-cluster pureness. Then, a systematic method is developed to construct belief rules based on the obtained credal partitions. Finally, an evidential partition entropy-based optimization procedure is designed to get a compact BRB with a better trade-off between accuracy and interpretability. The key benefit of the proposed CBRBCS is that it can provide a more interpretable classification model on the premise of comparative accuracy. Experiments based on synthetic and real data sets have been conducted to evaluate the classification accuracy and interpretability of the proposal.


Introduction
Pattern classification is a popular research field in artificial intelligence. The main purpose of classification is to assign the objects, represented by feature vectors to predefined group of classes [1]. In the past five decades, a variety of classification techniques, such as K-nearest neighbors (K-NN) [2], decision trees (DT) [3], support vector machines (SVM) [4], rule-based classification (RBC) [5], have been proposed. Among these methods, the RBC not only obtains the advantage in classification result interpreting, but also can be easily enhanced by adding new rules from experts' domain knowledge. As one of the most representative RBC methods, a fuzzy rule-based classification system (FRBCS) [5,6] has been developed by incorporating fuzzy sets [7]. The FRBCS is widely used because it can build a linguistic model interpretable to users. It has been successfully applied to many classification tasks where model interpretability is important, such as terrain classification [8], intrusion detection [9], fault prediction [10], disease diagnosis [11], and target recognition [12].
However, in real-world complex systems, different types of uncertainty (such as fuzziness, imprecision and incompleteness) may coexist. The FRBCS, which is based on fuzzy set theory, cannot model those imprecise or incomplete information effectively. The belief function theory, proposed by Dempster [13] and Shafer [14] et al., provides a powerful framework for uncertain modeling and reasoning. As fuzzy set theory and belief function theory are suited to dealing with different types of uncertainty, some researchers have investigated the relationship between them and suggested different integrating ways [15][16][17][18]. Among them, in [18], a belief rule-based classification system (BRBCS) was developed by extending the FRBCS within the framework of belief function theory to address imprecise or incomplete information in complex classification problems. In contrast to the traditional fuzzy rule, the new belief rule assigns the consequent part with a belief distribution structure, so that different kinds of uncertain information existing in the training set can be well characterized. Besides, to reduce the risk of misclassification in noisy conditions, the classification of a query pattern is made by combining all the activated belief rules. In many situations, this method is found experimentally to yield better classification accuracy and robustness than the FRBCS using the same information.
Rule learning is the most important issue in developing the BRBCS. In [18], a heuristic belief rule base (BRB) learning method was developed by defining belief rules based on fuzzy-grid partitions of the feature space and individuals of the training patterns, and the resulting BRB can provide an accurate mapping between the feature space and the class space. However, with this method, higher numbers of instances and features generally induce a BRB with larger size. This may lead to a large rule base for big data set, which degrades the interpretability of the classification model. Motivated by the above consideration, in this paper, a compact belief rule-based classification system (CBRBCS) is developed for a better trade-off between accuracy and interpretability (A preliminary version of some of the ideas introduced here was presented in [19,20]. The present paper is a deeply revised and extended version of this work, with several new results.). We propose to learn a compact BRB based on partitions of the training set realized with clustering techniques. The evidential C-mean (ECM) algorithm [21], which extended the fuzzy C-mean (FCM) algorithm [22] within the framework of belief functions, is used for its capability to address imprecise and partial information existing in the observed data. As belief rules are constructed based on credal partitions of the training set, this method can reduce the number of generated rules greatly. The main contributions of this paper are as follows:

1.
A supervised version of the ECM algorithm is designed by means of weighted product-space clustering to take into account the class labels, which can obtain credal partitions with both good inter-cluster separability and inner-cluster pureness.

2.
A systematic method is developed to construct belief rules (composed of the antecedent part, the consequent class, and the rule weight) based on credal partitions of the training set.

3.
A two-objective optimization procedure based on both the mean squared error and the evidential partition entropy is designed to get a compact BRB with a better trade-off between accuracy and interpretability.
Two types of experiments using both synthetic and real data sets have been developed to evaluate the performance of the proposed CBRBCS. In the synthetic data test, a two-dimensional four-class synthetic data set was designed to illustrate the interest of the compact BRB learning under different parameter settings. In the real data test, 20 data sets varying greatly in the number of instances, features, and classes were selected from the UCI Machine Learning Repository [23] for evaluation. The comparison methods cover the traditional BRBCS, as well as some of the most representative classifiers, including K-NN, C4.5, SVM, and FRBCS. The reported results show that the proposed CBRBCS can obtain competitive performance compared with those representative classifiers for a variety of real tasks involving different data conditions, and get a better trade-off between accuracy and interpretability than the traditional BRBCS. Therefore, it provides a better choice of classification technique for those problems where both high accuracy and interpretability are needed.
The rest of the paper is organized as follows. In Section 2, some preliminaries of the related theories and methods are reviewed. The compact BRB learning with ECM is developed in Section 3.
The experiments to evaluate the performance of the proposed method are reported in Section 4. At last, Section 5 concludes the paper.

Background
In this section, we provide some preliminaries of the related theories and methods. We first introduce some basic concepts of the belief function theory in Section 2.1. After that, we give overviews of the BRBCS classification method and the ECM clustering method in Sections 2.2 and 2.3, respectively. The used symbols and their definitions are listed in Table 1 to facilitate reading.

Basics of the Belief Function Theory
In belief function theory [13,14], a problem domain is represented by a finite set Ω = {ω 1 , ω 2 , · · · , ω C } called the frame of discernment. A mass function expressing the belief committed to the elements of 2 Ω by a given source of evidence is a mapping function m: Elements A ⊆ Ω with m(A) > 0 are called the focal sets of the mass function m. The mass function m has several special cases, which represent different types of information. A mass function is said to be • normal, if m(∅) = 0. Otherwise, it is subnormal, and m(∅) is interpreted as a mass of belief given to the hypothesis that ω might not lie in Ω.

•
Bayesian, if all its focal sets are singletons. In this case, the mass function reduces to the precise probability distribution; • certain, if the whole mass is allocated to a unique singleton. This corresponds to a situation of complete knowledge; • vacuous, if the whole mass is allocated to Ω. This situation corresponds to complete ignorance.
After representing the available pieces of evidence as mass functions, one often needs to combine several mass functions into a single one. Dempster's rule is the most popular way to combine several distinct pieces of evidence. The Dempster's rule of combination of two normal mass functions m 1 and m 2 defined on the same frame of discernment Ω is given by Dempster's rule of combination is both commutative and associative.

Belief Rule-Based Classification System (BRBCS)
The BRBCS is composed of two components, the BRB and the belief reasoning method (BRM) [18]. The BRB is first constructed to establishes a mapping between the feature space and the class space, and then the BRM is used to classify a query pattern based on the constructed BRB.
For an M-class (denoted as C = {c 1 , c 2 , · · · , c M }) classification problem with P features, the BRB consists of a collection of belief rules defined as follows: R j : If x 1 is A j 1 and x 2 is A j 2 and · · · and x P is A j P , then class is C j = (c 1 , β where x 1 , x 2 , · · · , x P represent the antecedent features and A j = (A j 1 , A j 2 , · · · , A j P ) is the antecedent part of the belief rule R j with each A j p belonging to fuzzy partitions {A p,1 , A p,2 , · · · , A p,n p } associated with p-th feature, p = 1, · · · , P. β j k is the belief degree that input data x = (x 1 , x 2 , · · · , x P ) belongs to c k , k = 1, · · · , M. In the belief structure, the consequence may be incomplete, i.e., ∑ M k=1 β j k ≤ 1, and the left belief 1 − ∑ M k=1 β j k denotes the degree of global ignorance about the consequence. The rule weight θ j with 0 ≤ θ j ≤ 1, characterizes the certainty grade of the belief rule R j .
The BRB can be learned from training data or derived from expert knowledge [24]. In [18], a heuristic BRB learning method was developed based on fuzzy-grid partitions of the feature space. To generate the BRB, this method uses the following steps: Step 1: Partition of the feature space.
A fuzzy-grid-based method is used to divide the P-dimensional feature space into ∏ P p=1 n p fuzzy regions, with n p being the number of partitions for p-th feature.
Step 2: Generation of the consequent class for each fuzzy region.
Each training pattern is assigned to the fuzzy region with the greatest matching degree, and those patterns assigned to the same fuzzy region are fused to get the consequent class.
Step 3: Generation of the rule weights.
The rule weights are determined by two measures called confidence and support jointly.
Once the BRB is generated, the BRM is used to classify a query pattern by combining the consequent parts of all the activated belief rules (refer to [18] for details of this reasoning method).

Evidential C-Means (ECM)
In [21], an ECM algorithm was proposed to derive credal partitions from object data. In this algorithm the class membership of an object x i is represented by a mass function m i defined on the power set of a given frame of discernment Ω = {ω 1 , ω 2 , · · · , ω C }. The credal partitions of N observed data {x 1 , x 2 , · · · , x N } ∈ R P are then defined as the N-tuple M = (m 1 , m 2 , · · · , m N ). It can be seen as a general model of partitioning, where: • when each m i is a certain mass function, then M defines the conventional, crisp partitions of the set of objects; • when each m i is a Bayesian mass function, then M specifies the fuzzy partitions, as defined by Bezdek [22].
are determined in such a way that the mass of belief m ij is low (high) when the distance d ij between object x i and set A j is high (low). The distance between object x i and set A j is calculated by where v j is the barycenter of the centers associated with the classes composing A j . Denoting v k the center of the single cluster ω k , the barycenter v j is calculated as Finally, the objective function used to derive the credal partition matrix M of size 2 C × N and the cluster center matrix V of size C × P, is given by subject to where β > 1 is a weighting exponent to control the fuzziness of the partition, α ≥ 0 is a weighting exponent to control the degree of penalization for the subsets in Ω of high cardinality, δ > 0 is a distance to control the amount of data considered to be outliers, and m i∅ denotes m i (∅), the mass that the class of object x i does not lie in Ω. This objective function is minimized using an iterative algorithm, which alternatively optimizes the credal partition matrix M and the cluster center matrix V.

Compact BRB Learning with ECM
As reviewed in Section 2.2, in the traditional BRB learning method, belief rules are defined based on fuzzy-grid partitions of the feature space and individuals of the training instances. This may lead to a large rule base for big data set with large numbers of instances and features, which degrades the interpretability of the classification model. In this section, we propose to learn a compact BRB based on partitions of the training set realized with clustering techniques. The ECM algorithm is used here to incorporate the additional degrees of freedom and information obtained from the derived credal partitions, in the BRBCS. The flow diagram of the proposed compact BRB learning with ECM is shown in Figure 1. First, the ECM algorithm operates in a supervised way in Section 3.1 by means of weighted product-space clustering with the goals of obtaining credal partitions with both good inter-cluster separability and inner-cluster pureness. Then, Section 3.2 shows how to construct belief rules based on credal partitions of the training set. Finally, a two-objective optimization procedure is designed in Section 3.3 to get a compact BRB with a better trade-off between accuracy and interpretability.

Credal Partition with Supervised ECM
In typical classification problems, a set of N labeled patterns T = {(x 1 , c (1) ), (x 2 , c (2) )}, · · · , (x N , c (N) )} with input vectors x i ∈ R P and class labels c (i) ∈ {c 1 , c 2 , · · · , c M } are available, and the problem is to classify a query pattern y based on the training set T . In contrast to unsupervised clustering problems which only consider the inter-cluster separability, a good partition of labeled patterns should also take into account the inner-cluster pureness. For this purpose, we cluster the N labeled patterns in the following weighted product space where W ≥ 0 controls the weight of class labels in clustering process. If W = 0, it just reduces to the unsupervised clustering, and as W → ∞, the resulting clusters are the same with those obtained by dividing the training set only based on the class labels directly. A suggested choice of W for balancing the effects of feature values and class values is where σ 2 p is the variance of p-th feature values, p = 1, 2, · · · , P, and σ 2 c is the variance of class values. With given weight W and number of clusters C, the ECM clustering algorithm operates in the above supervised way to discover credal partitions of the training set T in the weighted product space. Two practical issues concerning the credal partitions are further considered to be follows.

•
Limiting the number of credal partitions. By minimizing the objective function displayed as Equation (4), a maximum number of 2 C credal partitions can be obtained. However, those credal partitions composed of many classes are quite difficult to interpret and are usually also less important in practice. Therefore, in order to learn a compact BRB, we constrain the focal sets to be either Ω, or to be composed of at most two classes, thereby reducing the maximum number of credal partitions from 2 C to (C 2 + C)/2 + 2 F(C).

•
Discarding the outlier cluster. In ECM, the training patterns assigned to empty set are considered to be outliers, which are adverse to classification. Thus, we only construct belief rules based on the left F(C) − 1 credal partitions associated with non-empty focal sets.

Belief Rule Base Construction
As shown in Section 2.2, each belief rule is composed of three components, namely the antecedent part, the consequent class, and the rule weight. In the following part, we will show how to construct belief rules from these three aspects based on credal partitions of the training set obtained previously.

Antecedent Parts Generation
From the obtained credal partition matrix M, whose ij-th element m ij → [0, 1] is the membership degree of the data x i in partition j, it is possible to extract the fuzzy sets in the antecedent parts of the belief rules. One-dimensional antecedent fuzzy sets A j p are obtained from the multidimensional credal partition M by point-wise projection [25] onto the space of the antecedent features x p , p = 1, 2, · · · , P: where v jp is the mean value calculated as Equation (3), and σ jp is the standard variance to be estimated.
In this way, for each credal partition j, j = 1, 2, · · · , F(C) − 1, a series of fuzzy sets A j 1 , A j 2 , · · · , A j P can be defined on the antecedent features with Gaussian membership functions, which finally constitute the antecedent part of belief rule R j .

Consequent Classes Generation
Based on the credal partition matrix M, the training set T can be divided into F(C) − 1 groups by assigning each pattern to the partition with highest mass: The training subsets T j for j = 1, 2, · · · , F(C) − 1 define a hard credal partition [21] of the training set T . In the following, we will derive the consequent class of belief rule R j by combining the class information of patterns in subset T j , j = 1, 2, · · · , F(C) − 1.
First, for any pattern x i ∈ T j , we calculate the matching degree with antecedent part of belief rule R j using the geometric mean operator as where µ A j p is the membership function of the fuzzy set A j p defined in Equation (9). Then, assume the class label of pattern x i is c k , which takes value in class set C. This can be regarded as a piece of evidence that supports the consequent class belonging to c k . However, this piece of evidence is not full certainty. In belief function theory, this can be expressed by saying that only some part of the belief (measured by the matching degree µ A j (x i )) is committed to c k . Because Class(x i ) = c k does not point to any other particular class, the rest of the belief should be assigned to the frame of discernment C representing global ignorance. Therefore, this item of evidence can be represented by a mass function m j (·|x i ) verifying: Finally, the mass functions derived from all the patterns in T j are combined to obtain the consequent class of belief rule R j . As the items of evidence from different labeled patterns are collected independently, the Dempster's rule of combination is used in this work to synthesize the final consequent class membership as Noting that all the pieces of evidence have only one focal set except the global set C, the computation of Dempster's rule is quite efficient. The belief degrees of the consequent class of rule R j are then obtained as β j k = m j ({c k }), k = 1, 2, · · · , M.

Rule Weights Generation
As in [18], the rule weights can be derived based on two concepts called confidence and support, which are often used for evaluating association rules in data mining fields. The confidence is a measure of the validity of one rule, which is defined for belief rule as where 0 ≤ K j ≤ 1 is the average conflict factor, which measures the conflict among those pieces of evidence used for building the consequent class of rule R j : with |T j | donating the number of training patterns in j-th hard credal partition. On the other hand, the support indicates the grade of the coverage by one rule, which is defined as the ratio of the number of covered patterns to the total pattern number: Based on the above two measures, the rule weights are finally derived as

Parameter Optimization for Trade-Off between Accuracy and Interpretability
In the above BRB learning process, the number of clusters C plays a key role in determining the accuracy and the interpretability of the learned classification model. Many clusters means many rules, which usually leads to high classification accuracy, but degrades the model's interpretability. Therefore, we need to search for an optimal number of clusters to get the desired trade-off between accuracy and interpretability.
On the one hand, to obtain a model with high accuracy, the following leave-one-out test (mean squared) error MSE should be minimized: where P (i) ({ω j }), j = 1, · · · , M, are the output of the belief reasoning method for training pattern x (i) , and t (i) j , j = 1, · · · , M, are binary indicator variables defined by t On the other hand, to obtain a model with high interpretability, the number of rules or, equivalently, the number of clusters should be minimized. However, the number of clusters should not be too small to ensure the cluster validity. To assess the quality of fuzzy partitions, a great number of validity indexes have been proposed in the literature [26][27][28][29]. One of the representatives is the fuzzy partition entropy FPE [30] defined by where µ ij is the membership degree of i-th pattern in j-th cluster. The optimal number of clusters C is obtained by minimizing FPE with respect to C = 2, 3, · · · , C max . The above fuzzy entropy-based validity index has inspired us to use similar definitions of entropy in belief function framework to assessing the quality of evidential partitions. The definition of entropy in belief function framework has been a hot research subject in the past few years [31][32][33]. A representative one is the aggregated entropy AE introduced by Pal et al. [34], which satisfies natural requirements and has interesting properties. It is defined for a normal mass function m as where F (m) denotes the set of focal sets of m. This entropy measure can be further decomposed as the sum of two terms: The first term is the nonspecificity measure, which reflects the degree of imprecision of m, whereas the second term reflects the inconsistency in m and can be seen as a measure of conflict. Therefore, AE(m) tends to be small when the mass is assigned to few focal sets, with small cardinality. Though the above AE measure was defined for normal mass functions (i.e., m with m(∅) = 0), it can be easily extended to subnormal mass functions considered in the paper by defining the cardinality of the empty set as C (This extension is justified by the fact that the mass given to the empty set corresponds to a situation of maximal uncertainty, just like the mass given to Ω [35].). The evidential partition entropy EPE is then defined as the average AE as When all the patterns are assigned to singleton sets ∅, ω 1 , ω 2 , · · · , ω C , EPE gets the lower bound value 0. The maximum value of EPE is attained for m i such that m i (A) ∝ |A|, for all A ∈ F (m i ). It should also be noted that the fuzzy partition entropy FPE defined in Equation (19) is a special case of our defined evidential partition entropy EPE when each m i is a Bayesian mass function.
Finally, a single scalar objective function for the number of clusters C is then defined based on the above two objectives MSE and EPE as where λ ∈ [0, 1] is the weight characterizing the user's preference for classification accuracy. When λ = 1, the classification accuracy is the only objective, whereas when λ = 0, only the cluster validity is guaranteed. With given weight λ, by minimizing the above objective function, an optimal number of clusters C can be obtained for a better trade-off between accuracy and interpretability.

Experiments
The performance of the proposed CBRBCS was assessed by two different types of experiments. In the first experiment, a synthetic data set was used to show the behavior of the proposal in controlled settings. In the second one, 20 real data sets from the UCI Machine Learning Repository [23] were considered, with the aim to show that the proposed technique is adequate for a variety of real tasks.

Synthetic Data Set Test
A two-dimensional four-class synthetic data set was designed to illustrate the interest of the compact BRB learning method in CBRBCS. The following normal class-conditional distributions were assumed: Class A set of 400 samples was generated from the above distributions using equal prior probabilities. This data set is displayed in Figure 2. The proposed method was used to learn BRBs from this data set.   Figure 3 shows the objective values J(C) of the learned BRBs under different numbers of clusters (C = 2, 3, 4, 5, 6). When the accuracy weight λ = 1, the objective function J(C) just reduces to the MSE measure. It can be seen that as the increasement of the number of clusters (or, equivalently, the number of rules), the MSE decreases gradually. By minimizing the MSE, a large BRB is obtained with high classification accuracy. By contrast, when the accuracy weight λ = 0, the EPE measure is recovered.
We see that the EPE reaches its minimal value when the number of clusters equals to 3, and after that it increases as the increasement of the number of clusters. In the same way of minimizing the EPE, we can get a small BRB with higher model interpretability, but relatively lower classification accuracy. Finally, when the accuracy weight 0 < λ < 1, the objective value J(C) provides a trade-off between the MSE and the EPE. Please note that the three considered weights (λ = 0.2, 0.5, and 0.8) give the same decision for the optimal number of clusters as C = 4, in which case, a total number of (C 2 + C)/2 + 1 = 11 belief rules are learned with classification accuracy of 83.55%.

Real Data Set Test
In this experiment, 20 representative real data sets from UCI Machine Learning Repository were selected to evaluate the performance of the proposed CBRBCS. The main characteristics of the 20 data sets are summarized in Table 2. It can be seen that the selected data sets vary greatly in the number of instances (from 80 to 12,690), the number of features (from 4 to 60), and the number of classes (from 2 to 11). Table 2. Statistics of the real data sets used in the experiment. Australian  690  14  2  Balance  625  4  3  Car  1278  6  4  Contraceptive  1473  9  3  Dermatology  358  34  6  Ecoli  336  7  8  Glass  214  9  6  Hepatitis  80  19  2  Ionosphere  351  33  2  Iris  150  4  3  Lymphography  148  18  4  Nursery  12,690  8  5  Page-blocks  5472  10  5  Sonar  208  60  2  Thyroid  7200  21  3  Vehicle  846  18  4  Vowel  990  13  11  Wine  178  13  3  Yeast  1484  8  10  Zoo  101  16  7 To develop the experiment, we consider the B-Fold Cross-Validation (B-CV) model. Each data set is divided into B blocks, with B − 1 blocks as a training set and the remaining block as a test set. Therefore, each block is used exactly once as a test set. We use the 10-CV model here, i.e., ten random partitions of the original data set, with nine of them (90%) as the training set and the remainder (10%) as the test set. For each data set, we consider the average results of the ten partitions.

Data Set # Instances # Features # Classes
The performance of the proposed classifier is compared with the traditional BRBCS [18] as well as several other representative classifiers, including K-NN (instance-based classifier) [2], C4.5 (decision tree-based classifier) [3], SVM (statistical classifier) [4], and FRBCS (rule-based classifier) [5]. Settings of these comparison methods are summarized in Table 3.  Table 4 shows the classification accuracy rates of different methods for real data sets. The numbers in brackets represent the ranks of the classification accuracy for each method and the last row shows the average ranks of all the methods over the 20 data sets. It can be seen that the performance of the two belief rule-base classifiers, i.e., BRBCS and CBRBCS, is comparable with those classical methods. To compare the classification results statistically, we carry out nonparametric tests [36,37] for multiple comparisons based on the average accuracy ranks obtained over the considered data sets. First, we use the Iman-Davenport test to determine whether significant differences exist among different methods. The Iman-Davenport statistic (distributed according to the F-distribution with k − 1 = 5 and (k − 1)(N − 1) = 95 degrees of freedom, where k is the number of compared methods and N is the number of data sets) is 4.99 for average ranks and the corresponding critical value is 2.29 for a significance level of α = 0.05. Given that the Iman-Davenport statistic is clearly greater than the critical value, the test rejects the null hypothesis, and therefore, it can be said that there are significant differences among the accuracy results of the considered methods. Then, we apply the post hoc Bonferroni-Dunn test to compare the control method (i.e., the proposed CBRBCS) with the remaining ones. Figure 4 shows the test result of the average accuracy ranks with a significance level of α = 0.05, in which case the calculated critical difference is 1.52. The critical difference value is represented as a thicker horizontal line, and those values that exceed this line are methods with significantly different results from the control method. It can be seen that the proposed CBRBCS performs significantly better than the FRBCS, and obtains similar classification accuracy with the traditional BRBCS. Compared with other non-rule-based classifiers including SVM, C4.5 and K-NN, although the classification accuracy differences among them are not very significant, the proposed CBRBCS is preferable as it can provide a more interpretable classification model on the premise of comparative accuracy.  To evaluate the interpretability of the classification models, Table 5 displays the numbers of generated rules for the two belief rule-based methods, i.e., BRBCS and CBRBCS. It can be seen that for all the evaluated data sets, much smaller number of rules are generated for the proposed CBRBCS. To show the rule reduction performance more clearly, we also provide the rule reduction rate (defined as (#Rule BRBCS − #Rule CBRBCS )/#Rule BRBCS ) in the last column. We can notice that for those data sets with large numbers of train instances and features, but small number of classes (like Australian, Car, Contraceptive, Ionosphere, Nursery, Sonar, Thyroid, Vehicle), the proposed CBRBCS achieves more significant rule reduction performance (with rule reduction rate >90%). The reason is that in the traditional BRBCS, the rules are generated based on fuzzy-grid method, in which case, the number of generated rules is positive correlated with both the numbers of train instances and features. However, the number of generated rules for the clustering-based learning method used in CBRBCS is only determined by the underling structure of the data, which is closely related to the number of classes. Therefore, compared with the traditional BRBCS, the proposed CBRBCS obtains a better trade-off between accuracy and interpretability (similar classification accuracy is obtained with much smaller number of rules).

Conclusions
In this paper, a compact belief rule-based classification system with ECM clustering has been proposed to overcome the limitations of the traditional BRBCS in large data set conditions. Instead of defining belief rules for individuals of the training patterns, belief rules are constructed based on credal partitions of the training set. The two-objective optimization procedure based on both the mean squared error and the evidential partition entropy can successfully find an optimal number of clusters. This method can discover the underlying data structure, which can be successfully translated into belief rules. From the results reported in the last section, we can conclude that the proposed technique can obtain a better trade-off between accuracy and interpretability than the traditional one. Furthermore, compared with other non-rule-based classifiers, the proposed technique can obtain competitive classification performance in accuracy. Therefore, this technique is be a better choice for those classification problems where both high accuracy and interpretability are needed.