Using the Choquet Integral in the Fuzzy Reasoning Method of Fuzzy Rule-Based Classification Systems

In this paper we present a new fuzzy reasoning method in which the Choquet integral is used as aggregation function. In this manner, we can take into account the interaction among the rules of the system. For this reason, we consider several fuzzy measures, since it is a key point on the subsequent success of the Choquet integral, and we apply the new method with the same fuzzy measure for all the classes. However, the relationship among the set of rules of each class can be different and therefore the best fuzzy measure can change depending on the class. Consequently, we propose a learning method by means of a genetic algorithm in which the most suitable fuzzy measure for each class is computed. From the obtained results it is shown that our new proposal allows the performance of the classical fuzzy reasoning methods of the winning rule and additive combination to be enhanced whenever the fuzzy measure is appropriate for the tackled problem.


Introduction
A classification problem [1,2] consists of assigning objects into predefined groups or classes based on the observed variables related to the objects.To do so, a learning algorithm, which uses the available information, is used to give some decision function to determine the class to which the objects belong.
Fuzzy Rule-Based Classification Systems (FRBCSs) [3] aside from their good performance provide a model close to the one used by humans, since it is composed of a set of rules formed of linguistic labels.Due to these reasons, they are widely used to deal with real world problems [4].The two main components of FRBCSs are the knowledge base, where it is stored the information about the problem, and the Fuzzy Reasoning Method (FRM).
The FRM is an inference procedure that uses the information in the knowledge base to determine the class in which new examples are classified.To do so, in first place the local information is computed, that is, the compatibility between the example and each fuzzy rule in the system.Then, this local information is aggregated to generate global information that is associated with each class of the problem and finally, the example is classified in the class having the maximum global information.
The FRM of the winning rule is traditionally used in the specialized literature [5][6][7][8][9].It uses the maximum as aggregation function [10,11] to obtain the global information.This FRM only considers per each class the information given by a single fuzzy rule having the greatest compatibility with the example, and consequently it ignores the available information given by the remainder fuzzy rules of the system.
In this paper, we propose a new FRM that takes into account the information given by several or even all the fuzzy rules in the system.To do so, we consider the use of the Choquet integral [12,13] as the aggregation operator in the FRM.The Choquet integral is related to a fuzzy measure [12,14], which models the interaction among the elements to be aggregated (the information given by the rules of the system in this case).Therefore, a key point is the choice of an appropriate fuzzy measure for each problem we want to deal with.To perform this choice, we propose to use a genetic algorithm [15] in order to learn the most suitable fuzzy measure for each class to carry out the aggregation stage.
In order to study the usefulness of the new proposal, we apply the well-known Chi et al.'s algorithm [8] to accomplish the fuzzy rule learning process.We compare the performance of the classic FRMs of the winning rule and additive combination with respect to both the ones obtained when using the Choquet integral related to several fuzzy measures and the Choquet integral when the fuzzy measure is genetically learned.The behaviour of the approaches is tested in seventeen numerical dataset selected from the KEEL data-set repository [16,17], and in order to support our conclusions, we use some non-parametric statistical tests [18,19].
This paper is arranged as follows.In Section 2 we recall some preliminary concepts that are necessary to understand the paper.The new proposal is described in detail in Section 3, including the new FRM, the fuzzy measures considered in the paper and the method to genetically learn the fuzzy measure.Next, the experimental set-up and the corresponding analysis of the results are presented in Sections 4 and 5 respectively.Finally, the main conclusions are drawn in Section 6.

Preliminaries
This section is aimed at introducing the background necessary to understand the new proposal.In first place we recall some theoretical concepts, next we introduce basic concepts about FRBCSs and finally we describe the evolutionary model considered in this paper.

Theoretical Concepts
In this paper we use fuzzy sets to model the linguistic labels composing the antecedents of the rules.
Definition 1 [20] A fuzzy set F defined on a finite and non-empty universe U = {u 1 , ..., u n } is given by The conjunction among the antecedents of the rules is modelled by means of t-norms.
Definition 2 [10,11] A triangular norm (t-norm) When several numerical values need to be combined into a single value, we use aggregation functions.
Finally, we recall the necessary concept that derives in the definition of the aggregation function known as the Choquet integral [12].
In the context of aggregation functions, fuzzy measures are used to model the importance of a coalition, that is, the relationship among the elements to be aggregated.

Fuzzy Rule-Based Classification Systems
FRBCSs are widely used in data mining, since they allow the inclusion of all the available information in system modelling, i.e., expert knowledge, empirical measures or mathematical models.They have the advantage of generating an interpretable model and therefore allowing the knowledge representation to be understandable for the users of the system.The two main components of FRBCSs are: • Knowledge Base: It is composed of both the Rule Base (RB) and the Data Base, where the rules and the membership functions are stored respectively.• Fuzzy Reasoning Method: It is the mechanism used to classify examples using the information stored in the knowledge base.
Any classification problem consists of m training examples x p = (x p1 , . . ., x pn , y p ), p = 1, 2, . . ., m from M classes where x pi is the value of the ith variable (i = 1, 2, . . ., n) and y p is the class label of the p-th training example.
We use fuzzy rules of the following form: Rule R j : If x 1 is A j1 and . . .and x n is A jn then Class = C j with RW j where R j is the label of the jth rule, x = (x 1 , . . ., x n ) is an n-dimensional example vector, A ji is an antecedent fuzzy set representing a linguistic term, C j is a class label, and RW j ∈ [0, 1] is the rule weight [21].
In the remainder of this subsection, the FRM applied to determine the classes of new examples and the fuzzy rule learning algorithm used to generate the RB are described in detail.

Fuzzy Reasoning Method
Let x p = (x p1 , . . ., x pn ) be a new example to be classified, L the number of rules in the RB and M the number of classes of the problem.The steps of the FRM [22] are the following: 1. Matching degree, that is, the strength of activation of the if-part for all rules in the RB with the example x p .To compute it we use a t-norm.
2. Association degree.The association degree of the example x p with the class of each rule in the RB.
3. Example classification soundness degree for all classes.We use an aggregation function that combines the positive association degrees calculated in the previous step.
4. Classification.We apply a decision function F over the example classification soundness degree for all classes.This function determines the class corresponding to the maximum value.

Chi et al. Rule Generation Algorithm
Chi et al. fuzzy rule learning method [8] is the extension of the Wang and Mendel algorithm [23] to solve classification problems.This method is one of the most used learning algorithms in the specialized literature due to the simplicity of the fuzzy rule generation method.
To generate the fuzzy RB, this FRBCSs design method determines the relationship between the variables of the problem and establishes an association between the space of the features and the space of the classes by means of the following steps: 1. Establishment of the linguistic partitions.Once the domain of variation of each feature A i is determined, the Ruspini's fuzzy partitions are computed using triangular shaped membership functions in this paper.2. Generation of a fuzzy rule for each example x p = (x p1 , . . ., x pn , C p ) applying the following process: 2.1.To compute the matching degree µ(x p ) of the example with all the fuzzy regions using a conjunction operator (usually modelled with the minimum or product t-norm).2.2.To assign the example x p to the fuzzy region with the greatest matching degree.2.3.To generate a rule for the example, whose antecedent is determined by the selected fuzzy region and whose consequent is the class label of the example.2.4.To compute the rule weight.In this paper, we use the Penalized Certainty Factor (PCF) defined in [24] as: where µ A j (x p ) is the matching degree of the example x p with the antecedent of the rule that is being generated.
We must remark that rules with the same antecedent can be generated during the learning process.If they have the same class in the consequent we just remove one of the duplicated rules, but if they have a different class only the rule with the highest weight is kept in the RB.

Evolutionary Model
In this paper, we consider the evolutionary model of CHC [25] to accomplish the learning of the fuzzy measure.CHC is a GA that presents a good trade-off between exploration and exploitation, being a good choice in problems with complex search spaces.
The CHC evolutionary model considers a population-based selection approach in order to perform a suitable global search.It makes use of a "Population-based Selection" approach, where N parents and their corresponding offspring are combined to select the best N individuals to form the next population.The CHC approach uses an incest prevention mechanism and a restarting process to provoke diversity in the population, instead of the well known mutation operator.
The incest prevention mechanism is only considered in order to apply the crossover operator.In our case, two parents are only crossed if half their Hamming distance is above a predetermined threshold, T h.Since we consider a real coding scheme, we have to transform each gene considering a Gray Code (binary code) with a fixed number of bits per gene (BIT SGEN E), which is determined by the system expert.In this way, the threshold value is initialized as: where #Genes stands for the total length of the chromosome.Following the original CHC scheme, T h is decremented by one (BIT SGEN E in our case) when there are no new individuals in the next generation.The algorithm restarts when T h is below zero.The scheme of this model is depicted in Figure 1.

A Novel Fuzzy Reasoning Method Using the Choquet Integral
This section is aimed at describing our new FRM making use of the Choquet integral to aggregate the local information given by the rules in the RB, that is, the values of b k j computed with Equation (3).Specifically, we propose to modify the third step of the general FRM introduced in Section 2.2.1.by applying Equation ( 7) instead of Equation (4).
where m k is the fuzzy measure considered for the k-th class of the problem, M is the number of classes of the classification problem and L is the number of rules composing the RB.
In the remainder of this section, we introduce the fuzzy measures considered in this paper in first place, and then we provide a learning proposal in which we optimize the fuzzy measure for each class of the problem.

Fuzzy Measures
According to Equation (7) we apply the Choquet integral to obtain the global information from the local information given by each rule of the system.A key point on the success of the Choquet integral is the definition of the fuzzy measure related to it.Let N = {1, ..., n} and A ⊆ N , we consider the use of the following five fuzzy measures: (1) Cardinality or uniform measure.
where i ∈ A is selected beforehand.We must point out that the result of the Choquet integral with this fuzzy measure is the i-th smallest value of X, that is, i-th order statistic.
(3) Weighted mean.We assign the following values for the fuzzy measure: m({1}) = w 1 , ..., m({n}) = w n .For |A| > 1 the fuzzy measure is ) Ordered Weighted Averaging (OWA).We assign the following values for the fuzzy measure: m({i}) = w j , with i being the j-th largest component to be aggregated, that is, we construct an OWA operator.For |A| > 1 the fuzzy measure is We must point out that all these fuzzy measures are additive (the exponential cardinality is additive only when q = 1) and the cardinality and exponential cardinality are also symmetric.

Fuzzy Measure Learning Method
The definition of an appropriate fuzzy measure plays an essential role in the success of the Choquet integral.In the proposed FRM, the first attempt is to apply the Choquet integral with the same fuzzy measure for each class of the problem.However, the set of rules of each class can interact in a different way.This fact could be taken into account by taking different values of the parameter q for each class using the exponential cardinality.In this manner, a specific fuzzy measure would be constructed for the different classes of the problem, that is, , with k = 1, . . ., M .Consequently, we propose a learning method to compute the most appropriate fuzzy measure for each class of the problem, since it can provoke an increase on the system's accuracy.
In order to carry out this optimization problem, we consider the use of the CHC evolutionary model [25] (see Section 2.3).In the remainder of this section, we describe the specific features of our evolutionary model.
• Coding scheme.We have a set of real parameters to be optimized (q k , with k = 1, ..., M ), where the range in which we suggest to vary each one is [0.01, 100].However, we do not directly encode them in a chromosome but we adapt them using chromosomes in the form: where G k ∈ [0.01, 1.99] with k = 1, ..., M .In order to compute their real values (in the range [0.01, 100]) we apply Equation (8).
The change of range is provoked because we need to give the same chances to produce offspring in the ranges [0.01, 1] and [1, 100] after applying the crossover operator.Looking at how the crossover operator works, if we encoded the parameters in the range [0.01, 100] we would favour the generation of offspring in the range [1, 100] and consequently, we would reduce the probability of the generation of offspring in the range [0.01, 1].For this reason, we adapt the range in order to solve this undesirable situation.• Initial Gene Pool.We include an individual having all genes with value 1.In this manner, at least we obtain the results provided by the cardinality measure.• Chromosome Evaluation.We use the most common metric for classification, i.e., the accuracy rate that is the percentage of correctly classified examples.• Crossover Operator.The crossover operator is based on the concept of environments (the offspring are generated around their parents).These kinds of operators present a good cooperation when they are introduced within evolutionary models forcing the convergence by pressure on the offspring (as the case of CHC). Figure 2 depicts the behaviour of these kinds of operators, which allow the offspring genes to be around the genes of one parent, Parent Centric BLX (PCBLX), or around a wide zone determined by both parent genes BLX-α.Specifically, we consider the PCBLX operator that is based on the BLX-α [26].
The PCBLX is described as follows.Assuming that are two real-coded chromosomes that are going to be crossed.The PCBLX operator generates the two following offsprings: - , where o 1i is a randomly (uniformly) chosen number from the interval • Restarting Approach.To get away from local optima, this algorithm uses a restarting approach since it does not apply mutation during the recombination phase.Therefore, when the threshold value is lower than zero, all the chromosomes are regenerated randomly to introduce new diversity to the search.Furthermore, the best global solution found is included in the population to increase the convergence of the algorithm as in the elitist scheme.

Experimental Framework
In this section, we first present the real world classification data-sets selected for the experimental study.Next, we introduce the parameter set-up considered along this study.Finally, we introduce the statistical tests that are necessary to compare the results achieved throughout the experimental study.

Data-Sets
We have selected seventeen numerical data-sets selected from the KEEL data-set repository [16,17].Table 1 summarizes the properties of the selected data-sets, showing for each data-set the number of examples (#Ex.), the number of attributes (#Atts.)and the number of classes (#Class.).We must point out that the magic, ring and twonorm data-sets have been stratified sampled at 10% in order to reduce their size for training and examples with missing values have been removed like in the wisconsin data-set.
A 5-fold cross-validation model was considered in order to carry out the different experiments.That is, we split the data-set into 5 random partitions of data, each one with 20% of the examples, and we use a combination of 4 of them (80%) to train the system and the remaining one to test it.This process is repeated five times by using a different partition to test the system each time.We consider the average result of the five partitions as the final classification rate of the algorithm.This procedure is a standard for testing the performance of classifiers [27,28].

Configuration of the Proposals and Notation
We will apply the following configuration for the Chi et al. rule generation algorithm: • Conjunction operator: Product t-norm.
For the new proposal using the Dirac's fuzzy measure, the value selected as i is the one associated with the median, that is, if the number of elements is odd we take i = n+1 2 , whereas if the number of elements is even we take i = n 2 + 1.We must stress that when using the Dirac's measure taking i = n we obtain the same results provided by the maximum (Max.).In addition, if we used i = 1, we would obtain the results provided by the minimum as aggregation function [29] but we do not include them since the achieved performance is poor.
Regarding the genetic process, we have used the values suggested in [30], which are: • Population Size: 50 individuals.
• Bits per gene for the Gray codification (for incest prevention): 30 bits.
Finally, for the sake of clarity, Table 2 shows the names given to the different approaches considered along the experimental study.In this paper, we use some hypothesis validation techniques in order to give statistical support to the analysis of the results [31,32].We will use non-parametric tests because the initial conditions that guarantee the reliability of the parametric tests cannot be fulfilled, which implies that the statistical analysis loses credibility with these parametric tests [18].Specifically, we use the Friedman aligned ranks test [33] to detect statistical differences among a group of results and the Holm post-hoc test [34] to find the algorithms that reject the equality hypothesis with respect to a selected control method.
The post-hoc procedure allows us to know whether a hypothesis of comparison could be rejected at a specified level of significance α.Furthermore, we compute the adjusted p-value (APV) in order to take into account the fact that multiple tests are conducted.In this manner, we can directly compare the APV with respect to the level of significance α in order to be able to reject the null hypothesis.
Furthermore, we consider the method of aligned ranks of the algorithms in order to show graphically how good a method is with respect to its partners.The first step to compute this ranking is to obtain the average performance of the algorithms in each data set.Next, we compute the subtractions between the accuracy of each algorithm minus the average value for each data-set.Then, we rank all these differences in descending order and, finally, we average the rankings obtained by each algorithm.In this manner, the algorithm that achieves the lowest average ranking is the best one.
These tests are suggested in the studies presented in [18,31,35], where it is shown that their use in the field of machine learning is highly recommended.

Experimental Results
Table 3 shows the classification accuracy along with the standard deviation obtained both in training and in testing by the different approaches used in the experimental study, where the best global result for each data-set is emphasised in bold-face.From these results it can be observed that the behaviour of the new proposal using standard fuzzy measures (Card., Dirac, WMean and OWA) is similar among themselves except the proposal associated with Dirac's measure, since it provides worse results.Regarding the behaviour of these proposals with respect the classical FRM of the winning rule, which uses the maximum as aggregation function (Max.), we can observe that although they provide a worse mean performance, the lack of accuracy is mainly due to three datasets, namely balance, iris and twonorm, the latter being especially bad for these proposals.However, when the fuzzy measure is appropriate for the specific problem we are dealing with, like the one that has been genetically learned (Card GA), both the increase in the system's performance and the robustness of the method can be noted, since it provides the best result in eleven out of the seventeen datasets of the study.Finally, we also compare our new FRM with respect to the classical additive combination FRM (AC), which aggregates the positive association degrees by summing them and therefore does not provide a result in the range between the minimum and maximum of the aggregated values like the Choquet integral.In this comparison, we can stress that the Card GA proposal obtains an average mean enhancement of 2.04%, which is based on the improvement of the performance of the AC FRM in more than half of the data-sets.These facts are confirmed in Figure 3, where it is clearly shown that Card GA is the best ranking method.The p-value obtained with the Friedman aligned ranks test is 0.02, which confirms the existence of statistical differences among these seven approaches.For this reason, we perform the Holm post-hoc test to check whether the best ranking method (Card GA) is able to statistically enhance the remainder methods.From results in Table 4, the goodness of the proposal using the Choquet integral with a suitable fuzzy measure is clearly determined, since it outperforms both the proposals using a standard fuzzy measure and the classical FRM of the winning rule.Furthermore, the obtained APV shows that the Card GA allows the performance of the additive combination FRM to be clearly enhanced.Therefore, it can be concluded that the best approach is the one that makes use of the Choquet integral with the fuzzy measure genetically learned.

Conclusions
In this paper we have proposed a novel FRM in which the Choquet integral is used to aggregate the local information of the rules.The Choquet integral is associated with a fuzzy measure, which allows us to model the relationship among the rules.For this reason, we have applied several standard fuzzy measures in order to take into account such an interaction.However, the definition of an appropriate fuzzy measure is a complex problem and consequently, we have proposed a genetic learning method in which a fuzzy measure is computed to aggregate the information related to the different classes of the problem.
In the experimental study, we have used the Chi et al.'s algorithm to generate the fuzzy rules.In order to test the goodness of our method, we have used a wide benchmark of numerical data-sets to compare the behaviour of the classical FRMs of both the winning rule and the additive combination with respect to our new approach using both several well-known fuzzy measures and the fuzzy measure genetically learned.From this comparison, it can be concluded that the use of our new approach is advisable to face classification problems when the fuzzy measure is learnt to suit the features of each specific problem, since it statistically outperforms the results of the FRM of the winning rule and it clearly enhances the performance of the additive combination FRM.

Figure 2 .
Figure 2. Scheme of the behaviour of the BLX and PCBLX operators.

Figure 3 .
Figure 3. Rankings of the seven approaches considered in the study.

Table 1 .
Summary Description for the employed data-sets.

Table 2 .
Names given to the seven approaches used in the paper.

Table 3 .
Results in training (Tr.) and testing (Tst.)along with their standard deviations achieved by the seven approaches considered in this paper.

Table 4 .
Holm test to compare Card GA with respect to the different approaches.