A Comparison of Four Approaches to Discretization Based on Entropy †

: We compare four discretization methods, all based on entropy: the original C4.5 approach to discretization, two globalized methods, known as equal interval width and equal frequency per interval, and a relatively new method for discretization called multiple scanning using the C4.5 decision tree generation system. The main objective of our research is to compare the quality of these four methods using two criteria: an error rate evaluated by ten-fold cross-validation and the size of the decision tree generated by C4.5. Our results show that multiple scanning is the best discretization method in terms of the error rate and that decision trees generated from datasets discretized by multiple scanning are simpler than decision trees generated directly by C4.5 or generated from datasets discretized by both globalized discretization methods.


Introduction
Mining data with numerical attributes requires discretization.Among many discretization techniques, discretization based on entropy is one of the most successful methods .Entropy was used for discretization applied to ranking data [7].A special kind of discretization for data with many attributes was presented in [25].Discretization combined with semi-supervised learning was presented in [3].Many papers emphasizing the importance of discretization to data mining were recently published [13,17,18,28,29,31].
In this paper, we present the results of our experiments conducted on 17 numerical datasets using the C4.5 decision tree generation system, combined with four discretization methods: the original C4.5 approach to discretization, two globalized methods, known as equal interval width and equal frequency per interval, and a relatively new method for discretization called multiple scanning.The original approach to discretization included in the C4.5 system, as well as discretization based on equal interval width and equal frequency per interval are well known.Multiple scanning, introduced in [15,32], was very successful when combined with rule induction and a classification system of LERS (learning from examples based on rough sets) [33].
In multiple scanning, during every scan, the entire attribute set is analyzed.For all attributes, the best cut point is selected.At the end of a scan, some sub-tables that still need discretization are created.The entire attribute set of any sub-table is scanned again, and the best corresponding cut points are selected.The process continues until the stopping condition is satisfied or the required number of scans is reached.If necessary, discretization is completed by another discretization technique, called dominant attribute [15,32].In the dominant attribute method, initially we select the best attribute.For this attribute, the best cut point is selected using conditional entropy.This process continues until the same stopping criterion is satisfied.The stopping criterion used in this paper is based on rough set theory.
The main objective of our research is to compare the quality of these four discretization methods using two criteria: an error rate evaluated by ten-fold cross-validation and the size of the decision tree generated by C4.5.Experimental results presented in [32] show that multiple scanning is the best discretization method among these four discretization methods.In [32], four discretization techniques were compared using a rule-based methodology.Experiments were conducted using the MLEM2 (modified learning from examples module, Version 2) rule induction algorithm [34] and the LERS classification system.There is a possibility that the results of [32] depend on the choice of experimental setup.Therefore, to remove this bias, we changed the original setup and conducted new experiments using the standard C4.5 decision tree generation methodology.Our new results fully support the results of [32].For 17 numerical datasets, four sets of experiments were conducted: first, the C4.5 system was used to compute an error rate using ten-fold cross-validation; then, the same datasets were discretized using two globalized methods (equal interval width and equal frequency per interval) and multiple scanning, and for such discretized datasets, the same C4.5 system was used to establish an error rate.
The same methodology, based on computing the C4.5 error rate, was used in [23] to compare nine successful and well-known discretization methods using 11 datasets.Seven of these 11 datasets (australian, bupa, glass, ionosphere, iris, pima and wine recognition) were also used in our experiments.For any of these seven datasets, the best result accomplished using our methods is better than the corresponding best result cited in [23].Thus, our choice for the four discretization methods is well justified: we used very efficient methods.Our results show that the multiple scanning discretization technique is significantly better than the internal discretization used in C4.5 and two globalized discretization methods: equal interval width and equal frequency per interval in terms of the error rate computed by ten-fold cross-validation (two-tailed test, 5% level of significance).Additionally, decision trees generated from data discretized by multiple scanning are significantly simpler than decision trees generated directly by C4.5 and decision trees generated from datasets discretized and both globalized discretization methods.
The main idea of the multiple scanning method, giving the same chance to all attributes, is highly successful.In each consecutive step of this method, every attribute is taken into account.In other discretization methods, some attributes may be eliminated to begin with.

Discretization
Let a be a numerical attribute with domain [a i , a j ].A partition of the domain [a i , a j ] into k intervals: where a i 0 = a i , a i k = a j , and a i l < a i l+1 for l = 0, 1, ..., k − 1, determines a discretization of a.The numbers a i 1 , a i 2 , ..., a i k−1 are called cut points.In this paper, corresponding intervals are denoted as follows: An example of a dataset with numerical attributes is presented in Table 1.In this table, all cases are described by variables called attributes and one variable called a decision.The set of all attributes is denoted by A. The decision is denoted by d.The set of all cases is denoted by U. In Table 1, the attributes are length, width and height, while the decision is quality.Additionally, U = {1, 2, 3, 4, 5, 6, 7}.
For a subset S of the set U of all cases, an entropy of a variable v (attribute or decision) with values v 1 , v 2 , ..., v n is defined by the following formula: where p(v i ) is a probability (relative frequency) of value v i in the set S, i = 0, 1, ..., n.All logarithms in this paper are binary.The conditional entropy of the decision d given an attribute a is: where p(a j ) is the probability of value a j of the attribute a and p(d i |a j ) is the conditional probability of the value d j of the decision d given a j ; a 1 , a 2 , ..., a m are all values of a, and d 1 , d 2 , ..., d n are all values of d.Discretization based on the conditional entropy of the concept given the attribute is considered to be one of the most successful discretization techniques [2,5,6,9,11,12,14,15,19,24,26,27].
Let a be an attribute and q be a cut point that splits the set S into two subsets, S 1 and S 2 .The conditional entropy H S (d|q) is defined as follows: where |X| denotes the cardinality of the set X.The cut point q for which the conditional entropy H S (d|q) has the smallest value is selected as the best cut point.

Equal Interval Width and Equal Frequency per Interval
Some discretization methods are obvious.To this category belong the equal interval width and equal frequency per interval methods [14].These methods are applied to a single attribute at a time, so they are called local [14].Note that a discretization method is called global if it depends on all attributes [14].In the local discretization method, the user must specify a positive integer k, a number of intervals required for discretization.In the former method, the domain of a numerical attribute a should be divided into k intervals that are approximately equal.In the latter method, the domain of the numerical attribute a should be divided into k intervals, each containing approximately an equal number of cases.
These two methods were converted to global methods, using the idea of entropy, in [5].First, all numerical attributes are discretized using k = 2.After that, we need to compute the level of consistency for the set of all discretized attributes.If the level of consistency satisfies the requirement, the discretization is done.If not, the worst attribute must be selected for further discretization.
For an attribute a, let a d denote the discretized attribute.Let A d denote the set of all discretized attributes.For any partially-discretized attribute a d , we define a measure of quality, called the average block entropy, in the following way: A partially-discretized attribute a d with the largest M(a d ) is the worst attribute [5].The worst attribute is the subject of additional discretization for k + 1 intervals.The rest of the algorithm is defined by recursion.These new methods of discretization are called the globalized version of equal interval width and the globalized version of equal frequency per interval.As follows from [2], both methods are quite successful.

Multiple Scanning
In the multiple scanning discretization method, the entire attribute set is scanned t times; t is a parameter selected by the user.In our experiments, we applied t = 1, 2, ..., until the error rate, a result of ten-fold cross-validation, computed by C4.5, was the same for two consecutive values of t.Initially, for each attribute, the best cut point is selected, using the minimum of conditional entropy H S (d|q), for all possible values of q.During the next scans (i.e., for t = 2, 3, ...), the entire attribute set is scanned again; for each attribute, we identify one cut point: for each block X of (A d ) * , the best cut point is selected, the best cut point among all such blocks is accepted as the best cut point for the attribute.If the requested parameter t is reached and the dataset needs more discretization since L(A d ) = 1, the dominant attribute technique is used for remaining discretization.
Let us discretize Table 1 using the multiple scanning method.First, we need to compute the conditional entropy H U (d|q) for each attribute q and for all possible cut points for each attribute.The first attribute is length, with two possible cut points: 4.1 and 4.5.The corresponding conditional entropies are: For the attribute length, the better cut point is 4.1.For the attribute width, there are two possible cut points: 1.75 and 1.85, with H length (1.75, U) = 1.373 and H length (1.85, U) = 1.659; the better cut point is 1.75.For the attribute height, there are two possible cut points: 1.45 and 1.55, with H height (1.45, U) = 0.980 and H height (1.55, U) = 1.251; the better cut point is 1.45.A partially-discretized dataset, after the first scan, is presented in Table 4.
For Table 4, (A d ) * = {{1}, {2}, {3, 4, 6, 7}, {5}}, and L(A d ) = 0.429.The remaining discretization is conducted using the dominant attribute method for the sub-table presented in Table 5.It is clear that the attribute length is the best attribute (with the smallest entropy) and that the remaining cut point is 4.5 for the attribute length.As a result, we obtain Table 6, for which L(A d ) = 1.

Interval Merging
In the discretization techniques presented in this paper, except the internal discretization method of C4.5, the last step is an attempt to merge intervals.During merging intervals, we want to reduce the number of intervals while preserving consistency.The corresponding algorithm has two steps: • safe merging: for any discretized attribute a d and for any two neighboring intervals i..j and j..k, if both intervals belong to the same concept, these intervals are merged (or replaced by the interval i..k); • proper merging: for any discretized attribute a d and for any two neighboring intervals i..j, j..k, if the new interval i..k, the result of merging, does not reduce the level of consistency L(A d ), these intervals are merged (or replaced by the new interval i..k).
In Table 3, we may merge intervals 1.75..1.85and 1.85..1.9 of the attribute width d .Additionally, we may merge the only two intervals 1.4..1.55and 1.55..1.6 of the attribute height d , so the attribute height d becomes redundant.Table 7 presents the final, discretized table by the equal frequency per interval method.
In Table 6, we may merge both intervals of the attribute height d ; so finally, the table discretized by the multiple scanning method is identical to Table 7.

Experiments
We conducted experiments on 17 datasets with numerical attributes presented in Table 8.These datasets, except bankruptcy, were taken from the Machine Learning Repository stored at the University of California, Irvine.The bankruptcy dataset is a well-known dataset used by Altman to predict the bankruptcy of companies [37].For all four discretization methods, an error rate was estimated using the ten-fold cross-validation procedure of C4.5, with the level of consistency equal to 100%.Table 9 presents error rates for all four discretization methods.
Table 10 shows the size of decision trees generated by the C4.5 system.For the analysis of the experimental results, we used the Friedman rank sum test combined with multiple comparisons, with a 5% level of significance.We conclude that the multiple scanning discretization method is associated with a significantly smaller error rate than all three remaining discretization methods: the original C4.5 discretization method and the globalized versions of the equal interval width and equal frequency per interval discretization methods.The differences between the performance of C4.5, the globalized versions of the equal interval width and equal frequency per interval discretization methods and multiple scanning are statistically insignificant.Additionally, decision trees generated by C4.5 from datasets discretized by multiple scanning are simpler than decision trees generated by C4.5 from datasets discretized by both globalized versions of equal interval width and equal frequency per interval.

Conclusions
We present results of our experiments using four different discretization techniques based on entropy.These discretization techniques were validated by conducting experiments on 17 datasets with numerical attributes.Our results show that the multiple scanning discretization technique is significantly better than the internal discretization used in C4.5 and two globalized discretization methods: equal interval width and equal frequency per interval in terms of an error rate computed by ten-fold cross-validation (two-tailed test, 5% level of significance).Additionally, decision trees generated from data discretized by multiple scanning are significantly simpler than decision trees generated directly by C4.5 and decision trees generated from discretized datasets and both globalized discretization methods.
Our results show that multiple scanning is the best discretization method in terms of the error rate and that decision trees generated from datasets discretized by multiple scanning are simpler than decision trees generated by both global discretization methods.
(x, y) ∈ I ND(B) if and only if a(x) = a(y) for any a ∈ B, where a(x) denotes the value of the attribute a ∈ A for the case x ∈ U.The relation I ND(B) is an equivalence relation.The equivalence classes of I ND(B) are denoted by [x] B and are called B-elementary sets.Any finite union of B-elementary sets is B-definable.

Table 1 .
An example of a dataset with numerical attributes.

Table 4 .
Partially-discretized Table1using multiple scanning, after the first scan.

Table 5 .
A sub-table of the dataset presented in Table1.

Table 6 .
Partially-discretized Table1using multiple scanning and the dominant attribute.

Table 7 .
Table 1 discretized using equal frequency per interval.