Benchmarking Analysis of the Accuracy of Classification Methods Related to Entropy

In the machine learning literature we can find numerous methods to solve classification problems. We propose two new performance measures to analyze such methods. These measures are defined by using the concept of proportional reduction of classification error with respect to three benchmark classifiers, the random and two intuitive classifiers which are based on how a non-expert person could realize classification simply by applying a frequentist approach. We show that these three simple methods are closely related to different aspects of the entropy of the dataset. Therefore, these measures account somewhat for entropy in the dataset when evaluating the performance of classifiers. This allows us to measure the improvement in the classification results compared to simple methods, and at the same time how entropy affects classification capacity. To illustrate how these new performance measures can be used to analyze classifiers taking into account the entropy of the dataset, we carry out an intensive experiment in which we use the well-known J48 algorithm, and a UCI repository dataset on which we have previously selected a subset of the most relevant attributes. Then we carry out an extensive experiment in which we consider four heuristic classifiers, and 11 datasets.


Introduction
Classification is one of the most relevant topics in machine learning [1][2][3][4]. In general, the purpose of supervised classification is to predict the correct class , among a set of known classes, of a new observation given, based on the knowledge provided by a dataset, known as "training data". In addition, the classification problem is very important in decision-making in many different fields, so it is not difficult to find applications in fields such as medicine, biotechnology, marketing, security in communication networks, robotics, image and text recognition... Three issues in classification problems are the attribute subset selection, the design and implementation of classifiers, and the performance evaluation of classifiers [1][2][3][4]. In this paper, we will focus mainly on the latter.
On the other hand, entropy appears in statistics or information theory as a measure of diversity, uncertainty, randomness or even complexity. For this reason, we can find the use of entropy in the feature selection problem and the design of classifiers. Shannon [5] introduced entropy in the context of communication and information theory. This concept has been used frequently in information-based learning models [2]. Two extensions of the Shannon entropy measure, which are also frequently used, are the Renyi's entropy [6] and the Tsallis' entropy [7]. In [8], a review on generalized entropies can be found.
One of the most frequent difficulties found in the analysis of a dataset is that of high dimensionality, since when there are too many variables the analysis is more difficult and computationally expensive, there may be correlated variables, redundant variables or even properties and the behavior of 12 performance measures for flat multi-class classifiers. Jiao and Du [68] reviewed the most common performance measures used in bioinformatics predictors for classifications. Valverde-Albacete and Peláez-Moreno [69][70][71][72] analyzed classification performance with information-theoretic methods. In particular, they proposed to analyze classifiers by means of entropic measures on their confusion matrices. To do this, they used the de Finetti entropy diagram or entropy triangle and a suitable decomposition of a Shannon-type entropy, and then defined two performance measures for classifiers: the entropy-modified accuracy (EMA) and the normalized information transfer (NIT) factor. The EMA is the expected proportion of times the classifier will guess the output class correctly, and the NIT factor is the proportion of available information transferred from input to output. The quotient of these two measures provides information on how much information is available for learning.
In this paper, we focus on the definition of performance measures. In particular, following the ideas on agreement coefficients from statistics, the Cohen's κ [73] and the Scott's π [74], which have also been used as performance measures of classifiers [75], we consider three performance measures closely related to them. Those statistics were originally defined to measure the concordance level between the classifications made by two evaluators. The mathematical formula is the following: where P 0 represents the observed proportion of classifications on which the two evaluators agree when classifying the same data independently; and P e is the proportion of agreement to be expected on the basis of chance. Depending on how P e is defined the Cohen's κ or the Scott's π are obtained. In machine learning, these statistics are used as performance measures by considering the classifier to be evaluated and a random classifier, where P 0 is the accuracy of the classifier. In this paper, we look at these performance measures from another point of view and define two new performance measures based on the Scott's π. In particular, we use the interpretation given in Goodman and Kruskal [76] for the λ statistics. Thus, we consider three benchmark classifiers, the random classifier and two intuitive classifiers. The three classifiers assign classes to new observations by using the information of the frequency distribution of all attributes in the training data. To be more specific, the random classifier, X , predicts by random with the frequency distribution of the classes at hand, while the first intuitive classifier, V, predicts the most likely outcome for each possible observation with the frequency distribution of the classes in the training data, and the second intuitive classifier, I, predicts the most likely outcome for each possible observation with the joint frequency distribution of all attributes in the training data. The two described intuitive classifiers were postulated, built, and analyzed but rejected in favor of more modern classifier technologies before 2000. However, they could still be useful to define other performance measures in the style of the Cohen's κ or the Scott's π. Thus, in order to evaluate a classifier we determine the proportional reduction of classification error when we use the classifier to be evaluated with respect to using one of the benchmark classifiers. In this sense, P 0 is the accuracy of the classifier to be evaluated and P e is the (expected) accuracy of the benchmark classifier. In the case where the benchmark classifier is the random classifier we obtain a performance measure like the Scott's π, but the interpretation given is different from the usual one in the machine learning literature. This is also an interesting approach of performance evaluation of classifiers because we can measure how advantageous a new classifier is with respect to three simple benchmark classifiers which can be seen as the best common sense options for non-expert (but sufficiently intelligent and with common sense) people, and whose error rates are simpler to determine than the Bayes error.
On the other hand, we analyze the relationship between the three benchmark classifiers and different aspects of the entropy of the dataset. Thus, the random classifier X and the intuitive classifier V are directly related to the entropy of the target attribute, while the intuitive classifier I is closely related to the entropy of the target attribute when all dataset is considered, i.e., to the conditional entropy of the target attribute given the remaining variables in the dataset. With this relationships in mind, we can analyze the performance of classifiers taking into account the entropy of the dataset [77]. This is an interesting approach because it allows us to identify under what conditions of information uncertainty (measured by means of entropy) a classifier works better.
To the best of our knowledge, the main contributions of the paper to the machine learning literature are the following:

1.
We consider the random classifier and two intuitive classifiers as benchmark classifiers. These classifiers can be considered as simple, intuitive and natural for common sense non-expert decision-makers.

2.
We define three new performance measures of classifiers based on the Scott's π, the accuracy of classifiers, and the benchmark classifiers.

3.
We interpret our performance measures of classifiers in terms of proportional reduction of classification error. Therefore, we measure how much a classifier improves the classification made by the benchmark classifiers. This interpretation is interesting because it is easy to understand and, at the same time, we determine the gain in accuracy related to three simple classifiers. In a sense, they provide information on whether the design of the classifier has been worth the effort.

4.
The three performance measures of classifiers lie in the interval [−1, 1], where −1 means that the classifier in evaluation worsens by 100% the correct classification made by the corresponding benchmark classifier, this corresponds to the classifier assigns incorrectly all observations, and 1 means that the classifier reduces by 100% the incorrect classification made by the corresponding benchmark classifier, this corresponds to the classifier assigns correctly all observations. 5.
The benchmark classifiers catch the entropy of the dataset. The random classifier X and the intuitive classifier V measure the entropy of the target attribute, and the intuitive classifier I reflects the conditional entropy of the target attribute given the remaining variables in the dataset. Therefore, they allow us to analyze the performance of a classifier taking into account the entropy in the dataset. These measures, particularly that based on the intuitive classifiers, offer different information than other performance measures of the classifiers, which we consider to be interesting. The aim, therefore, is not to substitute for any known performance measure, but to provide a measure of a different aspect of the performance of a classifier. 6.
We carry out an intensive experiment to illustrate how the proposed performance measures works and how the entropy can affect the performance of a classifier. For that we consider a particular dataset and the classification algorithm J48 [78][79][80], an implementation provided by Weka [75,[81][82][83], of the classic C4.5 algorithm presented by Quinlan [36,37]. 7.
In order to validate what was observed in the previous experiment, we carried out an extensive experiment using four classifiers implemented in Weka and 11 datasets.
The rest of the paper is organized as follows. In Section 2, we provide the methodology and materials used in the paper. In particular, the method of feature selection, the algorithm of the intuitive classifier I, the description of several heuristic classifiers implemented in Weka [75,[81][82][83], and the definition and theoretical analysis of the performance measures introduced in this paper. In Section 3, we carry out the experiment to illustrate how the performance measures work and how they can be used to analyze the classifiers' performance in terms of entropy. In Section 4, we discuss the results obtained and conclude. Tables are included in Appendix A.

Method and Software Used for Feature Selection
The method used to perform the selection and ranking of the most influential variables is Gain Ratio Attribute Evaluation [25] (implemented in Weka [75,[81][82][83]). This measure, GR(att) on Equation (2), provides an objective criterion for sorting explanatory variables by importance versus the target variable. Gain Ratio by its own design penalizes the proliferation of nodes and meliorates the variables that are distributed so uniformly. The gain ratio of each attribute is calculated using the following formula: where (IG) is a measure to evaluate the informational gain provided by each attribute, which is considered to be a popular measure to evaluate attributes. In particular, it is the difference between the entropy of the consequent attribute and the entropy when att is known, H(att). Thus, the feature selection method calculates the informational gain for each attribute att [25].

Methodology and Software for the Intuitive Classification Method I
The basic idea of the intuitive classifier I is to generate classification rules from a dataset where all values are discrete (text tags). Dataset data will have C columns or attributes (A 1 , . . . , A C ). One of the attributes (A C in the Figure 1) is the target variable, used to classify instances. The remaining attributes (A 1 , . . . , A C−1 ) are the explanatory variables of the problem or antecedents.
A classification rule will consist of an antecedent (left side of the rule) and a consequent (right side of the rule), as illustrated in Equation (3). The antecedent will be composed of C − 1 attribute/value pairs (< A i = V i >), where attributes are the explanatory variables. The consequent will consist of an attribute pair (target variable/value) in the form < A C = V C >. The intuitive classifier I counts the more repeated values within the data sample. In our opinion this could be what any non-expert person would do to try to identify the most likely patterns of a data sample by applying common sense. The algorithm of the intuitive classifier I (see Algorithm 1) performs a scan comprehensive by all records in the dataset and counts how many times each combination of values is given in the left side of the rule (antecedent), to that amount of what we will call rule support (R. supp). Analogously, given an antecedent, for each classification rule, the algorithm counts the number of times each of the its possible consequences or right part of the rule. We call it rule confidence (R. conf). (see Algorithm 1). if there exists a rule Rj in CRS such that Antecedent(Rj) = Antecedent(row) and Consequent(Rj) = Consequent(row) then 9: for all Ri in CRS such that Antecedent(Ri) = Antecedent(row) do 10: Ri.supp ← Ri.supp + 1 11: end for 12: Rj.conf ← Rj.conf + 1 13: else 14: R ← New Rule 15: R.antecedent ← Antecedent(row) 16: R.consequent ← Consequent(row) 17: R.supp ← 1 18: R.conf ← 1 19: for all Ri in CRS such that Antecedent(Ri) = Antecedent(row) do 20: Ri.supp ← Ri.supp + 1 21: end for 22: end if 24: end for 25: return CRS: Classification Rule Set {/*OUTPUT*/} 26: END ALGORITHM Note that each rule (R) of the set of rules (CRS), generated according to Algorithm 1, has associated both support and confidence values (R. supp, R. conf). These values are, as indicated above, the number of times the antecedent is repeated in the sample of data and, the number of times that, given a particular antecedent, its class of the consequent is repeated in the data sample. These two counters allow us to determine which patterns are the most repeated. This model, formed by the whole of CRS rules, predicts the class variable of an instance "s" by applying Algorithm 2.
Algorithm 2 infers the value of instance class "s", using the set rule CRS whose antecedent most closely resembles the antecedent of "s" (matching a greater number of attributes). In the case where there are multiple rules with the same number of matches, that which has a larger support is selected. If there are several rules with equal support, the most trusted is chosen. Once that rule is identified, the predicted class is the value of the consequent of the selected rule. if RSS = ∅ then 9: R ← R1 {/* R1 is the first rule of RSS */} 10: for j = 2 to |RSS| do 11: if R.supp < Rj.supp then 12

Methodology and Software for the Heuristic Classifiers
For the generation of predictive models from the heuristic approach, we consider several heuristic classifiers: J48, Naïve Bayes, SMO, and Random Forest.
The decision tree learner J48 [78][79][80] is an implementation provided by Weka of the classic C4.5 algorithm [36,37]. J48 extends some of the functionalities of C4.5 such as allowing the post-pruning process of the tree to be carried out by a method based on error reduction or that the divisions over discrete variables are always binary, among others [75]. These decision trees are considered supervised classification methods. There is a dependent or class variable (variable of a discrete nature), and the classifier, from a training sample, determines the value of that class for new cases. The tree construction process begins with the root node, which has all training examples or cases associated. First, the variable or attribute from which to divide the original training sample (root node) is chosen, seeking that in the generated subsets there is minimal variability with respect to the class. This process is recursive, i.e., once the variable with the highest homogeneity is obtained with respect to the class in the child nodes, the analysis is performed again for each of the child nodes. This recursive process stops when all leaf nodes contain cases of the same class, and then over-adjustment should be avoided, for which the methods of pre-pruning and post-pruning of trees are implemented.
We also consider the Naïve Bayes algorithm implemented in Weka [75,[81][82][83] which is a well-known classifier [48,49] based on the Bayes Theorem. Details on Naïve Bayes classifiers can be found almost in any data science or machine learning book. On the other hand, Ref. [81] is an excellent reference for the Weka software.
The SMO is an implementation in Weka [75,[81][82][83] of the Platt's sequential minimal optimization algorithm [84][85][86] for training a support vector machine classifier [45]. SMO is a simple algorithm to quickly solve the support vector machine quadratic problems by means of the decomposition of the overall quadratic problem into smaller quadratic sub-problems which are easier and faster to be solved.
Finally, we will also use the random forest classifier implemented in the Weka software [75,[81][82][83]. Random forests classifiers [41] consist of ensembles of decision trees which are built from randomly selected subset of training set, and the final classification is the result of the aggregation of the classification provided by each tree.

Evaluation Measures
The evaluation of classifiers or models to predict is very important because it allows us (1) to compare different classifiers or models to make the best choice, (2) to estimate how the classifier or model will perform in practice, and (3) to convince the decision maker that the classifier or model will be suitable for its purpose (see [1,2]). The simplest way to evaluate a classifier for a particular problem given by a dataset is to consider the ratio of correct classification. If we denote by Z the classifier and by D the dataset, then the performance of Z classifying a particular attribute (the consequent) in D is given by This measure is known as accuracy. There are other evaluation measures [1,2], but we focus in this paper on defining new measures based in some way on the concepts of proportional reduction of the classification error [76] and entropy [5].
Our approach for defining evaluation measures based on entropy is by considering simple classifiers that capture the entropy of the problem. These classifiers play the role of benchmark when evaluating other classifiers.
Let us consider a dataset D with N instances (rows) and C attributes (columns) such that attributes A 1 , A 2 , . . . , A C−1 are considered the explanatory variables (antecedents) and A C is the attribute to be explained (consequent) or predicted. Let a C1 , a C2 , . . . , a CK be the categories or classes of variable A C , and let p C1 , p C2 , . . . , p CK be the relative frequencies of those categories in D. Associated with this problem, we can consider a random variable X from the sample space Ω = {a C1 , a C2 , . . . , a CK } to R, such that X(a Cj ) = j, and Prob(X = j) = p Cj . Therefore X has the non-uniform discrete distribution D(p C1 , p C2 , . . . , p CK ), i.e., X ∼ D(p C1 , p C2 , . . . , p CK ). This X can be considered the random classifier for the consequent A C in the dataset D, defined as where i is an observation or instance. Furthermore, we can define another simple and intuitive classifier for the consequent A C in the dataset D as follows where i is an observation or instance, i.e., this intuitive classifier predicts the most likely outcome for each possible observation with the frequency distribution of the consequent A C . If we take the N instances of the dataset, then the classification of each instance i by the random classifier X has a categorical, generalized Bernoulli or multinoulli distribution with parameter p i , where p i is the frequency associated with the category that attribute A C takes for the instance i, i.e., X(i) ∼ B(p i ). Therefore, the expected number of success in the classification of the N instances is given by Assuming that the classification of each instance is made independently, the variance of the number of success in the classification of the N instances is given by Note that if we consider a set of instances different from dataset D then Equations (7) and (8) would be given by (9) where N Cj is the number of instances for which attribute A C takes the value a Cj . Likewise, if we are interested in the ratio of success in the classification, then Equation (7) simply becomes Thus, Equation (10) provides the expected accuracy of the random classifier X , i.e., In the same way, we can arrive at the accuracy of the classifier V is On the other hand, the Shannon entropy [5] of attribute A C in dataset D is given by Shannon entropy can be seen as a Renyi's entropy measure [6] or a Tsallis' entropy measure [7], which have the following mathematical expressions for attribute A C in dataset D, respectively. Renyi's and Tsallis' entropy measures coincide with the Shannon entropy when α goes to 1, therefore Shannon's measure of entropy is seen as a Renyi's entropy measure or a Tsallis' entropy measure of order α = 1. If we consider the Renyi's entropy measure and the Tsallis' entropy measure of order α = 2, we obtain The entropy measures given in Equations (16) and (17) are very closely related to Equation (10), which measures the expected ratio of success in the classification of the random classifier X . Now, we have the following result which relates the expected ratio of success of the random classifier X and the different entropy measures above of consequent A C when it is binary. Theorem 1. Let D, and D * be two datasets with the same attributes and A C a binary attribute which is considered the consequent. Then, the following statement holds Proof of Theorem 1. In order to prove the theorem all you need is to prove statement 3, because the other two statements follow from the mathematical expressions of H R,2 , and H T,2 and statement 3. Let p C1 , p C2 and p * C1 , p * C2 be two frequency distributions of A C such that the entropy associated with the first is greater than the entropy associated with the second. Consider that p C1 = p * C1 , then p C2 = p * C2 . Otherwise, the result immediately follows. Since the entropy of the first frequency distribution is greater than the entropy of the second frequency distribution, we know that On the other hand, we have that After some calculations, we have that Cj . The proof of the converse follows similarly.
Theorem 1 cannot be extended to attributes with more than 2 possible values, as the following example shows. On the other hand, if we consider the Renyi's entropy measure when α goes to ∞, we obtain and results similar to the above can be proved. However, all Renyi's entropy measures are correlated, therefore H S , H R,2 , and H R,∞ are also correlated.
In view of the analysis above, the entropy of attribute A C is somehow caught by the random classifier X and the intuitive classifier V, in the sense that the higher the entropy, the lower the (expected) number of successes in the classification, and conversely. Therefore, the random classifier X and the intuitive classifier V can be used as benchmarks when evaluating other classifiers, taking into account the entropy of the consequent. Next we define an evaluation measure based on the analysis above. Definition 1. Let Z be a classifier. Given a dataset D, and a consequent A C , the performance of Z with respect to the random classifier X is given by where µ(X, D) = , such that M is the total number of predictions, and µ(Z, D) is the ratio of correct classifications using classifier (Z).
Note that the first case of the definition of the performance measure γ X coincides with the Scott's π. If we use the intuitive classifier V instead of X as benchmark classifier, we obtain the performance measure γ V . The evaluation measure γ X (resp. γ V ) runs between −1 and 1, where −1 is the worst case, and is achieved when the classifier does not predict correctly any instance; 0 means that performance is as the random classifier X (resp. γ V ); and 1 is the best case, and is achieved when the classifier correctly classifies all instances. The intermediate values measure in which proportion the classifier performs better (positive values) or worse (negative values) than the random classifier (resp. V).
On the other hand, we can interpret the performance measure γ X (resp. γ V ) in terms of proportional reduction of classification error with respect to the random classifier (resp. V). Indeed, if we predict M instances, we can write Equation (21) as follows: Now, we can write Equation (22) in the following way: Finally, Equation (23) can be interpreted as follows: Thus, the first case of γ X measures the proportional reduction of classification error when we use classifier Z with respect to using the random classifier X . The second case of γ X measures the proportional reduction of classification success when we use classifier Z with respect to using the random classifier X . The same can be said when using the intutitive classifier V as benchmark.
Therefore, γ X gives us information about how much a classifier Z improves or worsens the classification with respect to a classifier that decides the class randomly taking into account the frequency distribution of the classes. Furthermore, γ X gives us information about how much a classifier Z improves or worsens the classification with respect to a classifier that simply predicts the most likely class according to the frequency distribution of the classes. Since the previous two classifiers only use information related to the classes, these two measures provide information on whether it is relevant to use more sophisticated classifiers that incorporate information from other attributes.
On the other hand, the measure γ X and γ V incorporate in a way the information on the entropy of the consequent to the evaluation of a classifier, but do not take into account the rest of the attributes (the antecedents). Nevertheless, a similar analysis can be carried out by considering all possible different strings of attributes, obtaining analogous results. On the other hand, the intuitive classification method described in Section 2.2 can be another way of taking into account all the attributes and the entropy of the dataset, since its definition is based on the repetition of instances which is related to the entropy of the dataset. In particular, it is related to the conditional entropy of the attribute A C given the remaining variables in the dataset. Thus, another measure of evaluation of the classifiers related to entropy could be to use this intuitive classification method as a benchmark, its definition being analogous to those previously given. Below we formally outline the definition of this measure. Definition 2. Let Z be a classifier. Given a dataset D, and a consequent A C , the performance of Z with respect to the intuitive classifier I is given by where µ(I, D) is the ratio of correct classifications using classifier (I), and µ(Z, D) is the ratio of correct classifications using classifier (Z).
The interpretation of Γ is completely analogous to that of γ above, only changing the random classifier X and the intuitive classifier V for the intuitive classifier I. However, it gives some extra information about classifiers, in the sense that since it uses all information in the dataset, it provides information on how much relevant is to use more sophisticated classifiers.

Computer-Based Experiments: Design and Results
In this section, we illustrate how the evaluation measures introduced in Section 2 work. For that end, we design an experiment in which we consider five scenarios of entropy for a binary attribute (the consequent), and for each of those scenarios we study 31 combinations of explanatory attributes (the antecedents). Thus, we can give a better idea about how these evaluation measures work and how they measure the performance of classifiers in different entropy situations. We then go further and carry out an extensive comparison for four classifiers by using 11 different datasets whose results are concisely presented.

Datasets and Scenarios
We start from the hypothesis of working in a classification context where the target to be predicted is discrete and more specifically binary, but another multi-class target variable could be considered. A well-known dataset from UCI Machine Learning Repository [87] named "thyroid0387.data" [88] has been chosen for the most intensive experiment.
This dataset has been widely used in the literature in problems related to the field of classification. Since it is only used in this paper as an example and we are not interested in the clinical topic itself that the data collect, in order to facilitate the experiment of this study and make it exhaustive, that dataset has been minimally preprocessed as follows: • Headers have been added and renamed. • The numeric attributes have been removed and we have left only those which are nominal. • The class variable has been recoded in positive and negative cases (the original sample has several types of positive instances).
Finally, the dataset used to perform the experiment has the following features: The target variable used to classify which corresponds to a clinical diagnosis, is unbalanced, as it has a positive value in 2401 tuples and a negative value in 6772. From these data we will consider five types of possible scenarios with different ratios between positive and negative values (see Table 1). The remaining 10 datasets used in the most extensive experiment are also from UCI Machine Learning Repository [87]. The following modifications have been made, common to all of them.

1.
In all the datasets that did not have a row with the header, it has been added, taking into account the specifications of the "Attribute Information" section of each of these UCI repository datasets.

2.
The configuration in Weka to discretize has been with the parameter "bins" = 5 (to obtain 5 groups) and the parameter "UseEqualFrecuency" = true (so that the groups of data obtained were equitable).

3.
When discretizing in Weka (filter→unsupervised→discretized) the results obtained were numerical intervals, so they were later renamed.
In particular, apart from the dataset already mentioned, we have used the following datasets: The main features of these datasets are summarized in Table 2.  In addition, some specific preprocessing of the data were carried out in the datasets "Adult.data" [93] and "Bank marketing" [95,96]. In "Adult.data", the rows with missing values were removed, and three attributes were discarded (capital-gain, capital-loss, native-country); and in "Bank marketing", the selected dataset was "bank-full.csv", and 6 attributes were discarded (balance, day, duration, campaign, pdays, and previous).

Experimental Design
The experiment consists of determining the accuracy of an heuristic classifier, the already mentioned J48, in comparison with three benchmark classifiers: the random classifier and two intuitive classifiers. These three classifiers to certain extent contain information about the entropy present in the dataset as explained in the previous section. Therefore, we provide evaluation measures of that classifier taking into account the entropy of the system. In this sense, we try to evaluate how this classifier performs in terms of the improvement (or deterioration) obtained with respect to three classifiers that can be considered as benchmarks and that are based on the simple distribution of data from the dataset, and then on the entropy of the data.
On the other hand, we are also interested in observing the differences between the three evaluation measures of the classifiers introduced in the previous section, and what effect, considering more or less information from the dataset, this has when making classifications of instances. To do this, we consider the five scenarios described in Table 1, which have different level of Shannon's entropy in the consequent. For each of these scenarios, we follow the process depicted in Figure 1.
First, starting from original sample of data and fixing the consequent variable (or target variable) A C to be studied, the five variables (attributes) more correlated with the target variable are selected. Then they are sorted (A 1 , A 2 , A 3 , A 4 , A 5 ), that is, we determine which is more correlated with the consequent and which less, for which we use the gain ratio attribute method described in Section 2.1. In Table 3, we show the gain ratio scores observed for each of the five scenarios (S1,S2, S3,S4, S5) considered. At this point, we would like to emphasize once again that it is not our purpose to analyze a particular problem, but only to use a dataset for analyzing the evaluation measures introduced in this paper and also show an analysis of heuristic classifiers when considering entropy characteristics of the dataset. For this reason, attributes A 1 , . . . , A 5 are not necessarily the same nor they are in the same order in the five scenarios. We simply call generically A 1 to the attribute best correlated with the target variable in each scenario, even if it is not the same variable in each of them. Accordingly, the other attributes occupy second to fifth positions in the correlation ranking with the consecutive attribute in each scenario, always according to the gain ratio attribute evaluation. In each of the scenarios, these five attributes will be used as predictor or explanatory variables (antecedents) to generate the classification models. It is not an objective of this work to delve into the different methods of features (attributes) subset selection, but we simply use one of them, always the same (gain ratio attribute), in order to work only with those attributes that in each case are really significant. Reducing the size of the problem from 22 to 5 explanatory variables will allow a comprehensive experiment with which to illustrate and analyze the two introduced evaluation measures, and to show a way to analyze the performance of an heuristic classifier when we consider different degrees of entropy in the dataset. In order to select the five best attributes, we use the software Weka [75,82,83], in particular, its Select attributes function, with GainRatioAttributeEval as the attribute evaluator, ranker as the search method, and cross-validation as attribution selection mode. Note that Weka gives two measures of the relevance of the (antecedent) attributes. The average merit and its standard deviation, and the average rank and its standard deviation. The first refers to the mean of the correlations measured with GainRatioAttributeEval in 10 cycles (although with 5 cycles would have been sufficient, since only the first 5 attributes are wanted) of validation fold. The average rank refers to the average order in which each attribute remained in each of the ten cycles. See [75,82] for details about Weka.
Once the five best attributed are chosen, the next step is to establish the 31 possible combinations of the set of predictor variables. These 31 combinations will be the background to consider in a set of classification rules or in a decision tree. That is, 31 classification studies will be carried out to predict the consequent attribute A C based on each of these combinations of explanatory variables (see Table 4).
For each of these attribute combinations we generate 100 subsamples to avoid possible biases in the selection of records.
Third, for each of the scenarios described (Table 1), for each of the 31 combinations of antecedent attributes (Table 4), and for each of the 100 random subsamples, classification models are generated, both with the two intuitive classifiers and with the heuristic method J48. Thus, we have carried out 15,500 heuristic classification models with the J48 method as well as with our own implementation of the intuitive classifier I.
Finally, for both classifiers we calculate their accuracies, from their corresponding confusion matrices by using cross-validation. Therefore, to calculate the success ratio µ(X , D) of the random classifier X , we directly use the theoretical result given by Equation (7), and the same for the intuitive classifier V using Equation (12), while to calculate the success ratio µ(I, D) of the intuitive classifier I, we use the confusion matrix obtained by crossvalidation. Likewise, the success ratio µ(Z, D) of the heuristic classifier, in our case J48, is also calculated by the confusion matrix obtained by cross-validation. From these results, the evaluation measures introduced in Section 2.4 can already be calculated.
Therefore, we have an experimental design with two factors (entropy scenarios and attribute combinations) with 100 replications for each cross combination of factors. This allows us to analyze in depth how an heuristic classifier performs when we consider both the entropy of the consequent variable and the number of attributes used as antecedents.
Therefore, the experiment illustrates both how the evaluation measures work and how to analyze the effects of entropy and the number of selected attributes to predict the consequent variable in the performance of an heuristic classifier.

Results
After performing all the classification models described in the previous section for each of the five scenarios, each model is subjected to a cross-validation test, and confusion matrices are determined. With this information we can calculate some performance measures for the heuristic classifier J48. The simplest performance measure is accuracy, which measures the success rate in the prediction. Table 5 shows the accuracy of J48 and the intuitive classifier I for each of the five scenarios considered. Table 5. Accuracy measures for the random classifier X , the intuitive classifier V, J48 and the intuitive classifier I when using combination of attributes A31 for each scenario. The accuracy and the mean absolute error are calculated as the average accuracy and the average mean absolute error of the 100 subsamples. Results are presented as accuracy ± mean absolute error.

Scenario
E(acc(X (D))) acc(V (D)) acc(J48(D)) acc(I (D)) S1 In Table 5, we observe that, for this dataset, the performance of J48 is on average slightly better than the performance of the intuitive classifier I, but the mean absolute errors for J48 are worse than the mean absolute errors of the intuitive classifier I except for S5. However, this comparison could be analyzed in more detail considering other aspects such as the number of times that one method beats the other or the entropy. Likewise, the improvements with respect to the intuitive classifier V are not too great, which would mean that either the model is not very good, or that in this specific case the use of information from other attributes and/or classifiers more sophisticated do not provide noticeable improvements over the intuitive classifier V.
We now consider that a classifier beats another classifier each time that the first correctly classifies a number of items from the test set higher than the items correctly classified by second. When the reverse occurs, we will say that the second classifier beats the first. When the difference between the items well classified by both methods is 0, we will say that a draw has occurred. The number of times that J48 and the intuitive classifier I win for each scenario and each combination of the best five attributes are shown in Tables A1-A5 in Appendix A. Table 6 summarizes the percentage of times each method wins for each scenario. In Table 6, we observe that J48 classifies better than the intuitive method I in 47.48% of the instances, while the intuitive method I classifies better than J48 in 24.63% of the instances. J48 classifies particularly better in scenarios S5 and S3, while the intuitive method I classifies better in scenarios S2 and S4. Moreover, J48 clearly beats the intuitive classifier V in all scenarios except in S1, while the intuitive method I classifies better than the intuitive classifier V in scenarios S2, S4 and S5. Therefore, in absolute terms we can say that J48 performs reasonably well with respect to the dataset used. However, in addition to knowing whether one method classifies better than another, it is even more relevant to know how much better it classifies in relative terms as mentioned above. In this sense, having a benchmark is important to assess how much improvement there is when compared to it. In Tables A1-A5 in Appendix A, we can find the evaluation measures introduced in Section 2.4 applied to the average of the results obtained for the 100 subsamples for each combination of the best attributes when J48 and the intuitive classifier are used. Table 7 summarizes these measures for each of the five scenarios considered.  First note that in this case the measure γ X coincides in all scenarios with the Scott's π. On the other hand, beyond that which was analyzed when we evaluate which method best classifies simply in terms of the number of successes, in Table 7 we observe that the performance of J48 and the intuitive classifier I are very similar when compared with the random classifier X and the intuitive classifier V for each of the scenarios (columns corresponding to evaluation measures γ X and γ V ). This is clearly reflected in the evaluation measure Γ of J48, which is the result of comparison with the intuitive method I (see Definition 2). We also observe that, for the dataset used in the experiment, the performance of the classifiers improves with the decrease in the entropy of the consequent, i.e., the lower the entropy, the higher the performance of both classifiers with respect to the random classifier X .
Moreover, if we look, for example, at scenario S3, γ V (J48) tells us that J48 improves the performance of the intuitive classifier V, which only uses the information provided by the frequency distribution of the target attribute, by as much as 5% using the information provided by attributes other than the target attribute. Therefore, this percentage can be interpreted as the exploitation that J48 makes of this additional information. If we now look at Γ(J48), then we see that this improvement reaches almost 8.5% with respect to the intuitive classifier I. This percentage can be interpreted as the better exploitation that J48 makes of the information than the intuitive classifier I. At this point, one could already assess, taking into account the practical implications of better performance, whether the use of a more sophisticated classifier than the two intuitive classifiers is worth it.
Therefore, comparison with a benchmark is important because performance measures often do not reflect what is actually gained with respect to a simple, even random, way of classifying. Therefore, the use of measures based on simple benchmark classifiers that somehow capture the entropy of the dataset seems appropriate and provides relevant information on the performance of the classifiers. In particular, the use of both intuitive classifiers as benchmark seems reasonable, because although as classifiers they have been discarded in favor of other classifiers that use more modern and elaborate technologies, they are still easy enough to understand and intuitive as to at least consider them as benchmark classifiers when measuring the performance of classifiers, as the random classifier is commonly used in machine learning.

Extensive Experiment
In this subsection we present the results of an extensive experiment in which we consider four heuristic classifiers besides the intuitive classifier I, and 11 datasets. In particular, we consider four classification algorithms implemented in Weka [75,[81][82][83], J48, Naïve Bayes, SMO, and Random Forest, which have been briefly described in Section 2.3; and 11 datasets from UCI Machine Learning Repository [87] which have been described in Section 3.1.
The purpose of this extensive analysis is to check whether the results obtained in the previous experiment are repeated for other classifiers and other datasets. The first step in all cases is to select the 5 most relevant attributes by using the feature selection method described in Section 2.1. The results are shown in Table 8.
Then the five classifiers are applied with the selection of attributes in Table 8. We calculate their accuracies, from their corresponding confusion matrices by using crossvalidation. The resulting accuracies for each classifier and dataset are shown in Table 9. Tic-tac-toe m-m-s b-l-s t-l-s t-r-s b-r-s 10 Credit A9 A10 A4 A5 A6 11 Mushroom odor gill-size stalk-surface-spore-print-ring-type above-ring color In Tables 10 and 11, we present the results obtained when γ X and γ X are used as evaluation performance measure.
As we mentioned before, we know that the γ X measure is close related to the κ and π measures. In Tables 10 and 11, we observe that a higher entropy in the consequent attribute does not mean a worse performance of the classifiers [70]. This is not surprising since all classifiers use not only the frequency distribution information of the consequent attribute, but also the information provided about it by the remaining attributes in the dataset. Therefore, it seems appropriate to use the entropy of the entire dataset as a reference when assessing the performance of the classifiers. This entropy is somehow captured by the intuitive classifier I as explained earlier. In Table 12, we present the results obtained when Γ is used as evaluation performance measure.
The intuitive classifier I will have better accuracy the lower the conditional entropy of the target attribute given the entire dataset (or the subset of selected attributes if a selection feature is previously carried out), therefore, it will be more difficult for a classifier to significantly improve the classification results of this intuitive classifier. On the other hand, it is necessary to emphasize that the selection of the best subset of attributes has been relevant throughout the classification process, since the method used is based on the reduction of entropy. In this sense, Γ would measure how much a classifier contributes to the complete classification procedure with respect to what is contributed by the attribute selection process. Therefore, Γ offers different information than other performance measures of the classifiers, which we consider to be interesting. The aim, therefore, is not to substitute for any known performance measure, but to provide a measure of a different aspect of the performance of a classifier.

# Dataset
Entropy γ X (I ) γ X (J48) γ X (SMO) γ X (NB) γ X (RF)  Table 11. Evaluation measure γ V for the five classifiers and the 11 datasets, and the accuracy of the intuitive classifier X .  Tables 11 and 12, we observe that performance measures γ V and Γ provide complementary information about classifiers. In Table 11, we can observe how each classifier takes advantage of the information provided by the attributes in the dataset to better classify the target attribute, while in Table 12 we can observe how much better than the intuitive classifier I are classifiers capable of using the information in the dataset to correctly predict the classes of the target attribute.

Discussion and Conclusions
In the experiment we have shown that both feature selection and the entropy of the consequent attribute may be relevant to the performance result of an algorithm of classification. Therefore, it would appear to be of interest to consider the diversity of the response variable or the dataset when evaluating a classifier. In addition, the effect of entropy is observed, in the sense that the lower the entropy, the higher the success rate in the classifications, which seems intuitively reasonable. On the other hand, we observe in the experiment that choosing a greater number of features does not always provide a better performance of the classification algorithm, so this kind of analysis is relevant when selecting an adequate number of features, above all when the feature selection algorithm has not used the classifier algorithm for optimal selection. A rigorous analysis of the latter can be found in [104].
The performance measures of classifiers which only use the results of the classification algorithm itself, such as the ratio of successes (accuracy), do not really provide information on how it is really capable of classifying correctly with respect to unsophisticated methods. For this reason, the use of relative measures when compared with simple benchmark classifiers is important, because they give us information about the relationship between the gain in the correct classification of instances and the effort made in the design of new classifiers with respect to the use of simple and intuitive classifiers, i.e., we can better assess the real improvement provided by the classification algorithm. Moreover, if the benchmark classifier incorporates some type of additional information, such as different aspects of the entropy of all the dataset or the consequent attribute, the information provided by the performance measure will be even more relevant.
In this paper, three simple classifiers have been used, the random classifier X , the intuitive classifier V, and the intuitive classifier I. The first two simply use the distribution of the consequent attribute to classify and we have shown that they are closely related to the entropy of that attribute, while the third uses the entire distribution of the whole data set to classify and its performance is close to the conditional entropy of the consequent attribute given the remaining attributes (or a subset of attributes if feature selection is previously applied) in the dataset . These three classifiers have been used as references to introduce three measures of the performance of classifiers. These measure how much a classifier improves (or worsens) over these simple classifiers that are related to certain aspects of the entropy of the consequent attribute within the dataset. Therefore, they are measures that reflect on the performance of the heuristic classifiers, taking into account entropy in some way, and this is important, because the greater the entropy, the greater the difficulty to classify correctly, as has been seen in the experiment, which gives a better idea of the true performance of a classifier. Likewise, the three performance measures of classifiers can be interpreted in terms of proportional reduction of the classification error, which makes these measures easily understandable. In particular, γ X is closely related to the well-known κ and π measures, and provides information on how much a classifier improves the classification results relative to a random classifier that it only takes into account the information contained in the frequency distribution of the target attribute classes. γ V gives information on how a classifier is capable to use the information contained in the whole dataset (or a subset of the dataset) to improve the classification results relative to a classifier that it only uses the information of the frequency distribution of the target attribute classes and always predicts the most likely class. Last, Γ provides information on how much a classifier improves the classification results when using a more elaborate technology of managing data than the intuitive classifier I which simply predicts the most likely class given a particular profile of attributes in the dataset.
To conclude, although the two intuitive classifiers used in this paper were already discarded in favor of more modern and sophisticated classifiers, we believe that they are still useful as benchmark classifiers, as the random classifier is commonly used in machine learning, and then to design performance measures based on them which we have shown throughout this work that provide relevant information about the performance of classifiers different from other performance measures. Table A1. Scenario S1, 3.200 rows, 3:1 ratio of positive/negative values for target variable, 100 subsamples per combination, and the gain ratio attribute evaluations of the five best variables are 0.036, 0.037, 0.033, 0.034, and 0.029 (from most to least relevant).