Next Article in Journal
Information Contained in Molecular Motion
Next Article in Special Issue
Möbius Transforms, Cycles and q-triplets in Statistical Mechanics
Previous Article in Journal
Comparison of Two Efficient Methods for Calculating Partition Functions
Previous Article in Special Issue
Dynamical Transitions in a One-Dimensional Katz–Lebowitz–Spohn Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Reduced Data Sets and Entropy-Based Discretization

by
Jerzy W. Grzymala-Busse
1,2,*,
Zdzislaw S. Hippe
2 and
Teresa Mroczek
2
1
Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045, USA
2
Department of Artificial Intelligence, University of Information Technology and Management, 35–225 Rzeszow, Poland
*
Author to whom correspondence should be addressed.
Entropy 2019, 21(11), 1051; https://doi.org/10.3390/e21111051
Submission received: 13 September 2019 / Revised: 23 October 2019 / Accepted: 25 October 2019 / Published: 28 October 2019
(This article belongs to the Special Issue The Ubiquity of Entropy)

Abstract

:
Results of experiments on numerical data sets discretized using two methods—global versions of Equal Frequency per Interval and Equal Interval Width-are presented. Globalization of both methods is based on entropy. For discretized data sets left and right reducts were computed. For each discretized data set and two data sets, based, respectively, on left and right reducts, we applied ten-fold cross validation using the C4.5 decision tree generation system. Our main objective was to compare the quality of all three types of data sets in terms of an error rate. Additionally, we compared complexity of generated decision trees. We show that reduction of data sets may only increase the error rate and that the decision trees generated from reduced decision sets are not simpler than the decision trees generated from non-reduced data sets.

1. Introduction

The problem of reducing (or selecting) the set of attributes (or features) is known for many decades [1,2,3,4,5,6]. It is a central topic in multivariate statistics and data analysis [7]. It was recognized as a variant of the Set Covering problem in [3]. The problem of finding a minimal set of features is crucial for the study of medical and bioinformatics data, with tens of thousands of features representing genes [7]. An analysis of feature selection methods, presented in [7], included filter, wrapper and embedded methods, based on mathematical optimization, e.g., on linear programming. An algorithm for discretization that removes redundant attributes was presented in [8,9,10]. An example of a MicroArray Logic Analyzer (MALA), where feature selection was based on cluster analysis, was presented in [11]. An approach to feature selection, based on logic, was presented in [12]. A lot of attention has been paid to benefits of feature reduction in data mining, machine learning and pattern recognition, see for example, References [4,5,6,13,14,15,16,17]. Recently, feature reduction or reducing of the attribute set, combined with discretization of numerical attributes, was discussed in Reference [13,15,16,18,19]. In Reference [18], results of experiments conducted on ten numerical data sets with four different types of reducts were presented. The authors used two classifiers, Support Vector Machine (SVM) [20] and C4.5 [21]. In both cases, results of the Friedman test are inconclusive, so benefits of data reduction are not clear. In experiments presented in Reference [19], genetic algorithms and artificial neural networks were used for a few tasks: discretization, feature reduction and prediction of stock price index. It is difficult to evaluate how feature reduction contributed to final results. In Reference [13] experimental results on ten numerical data sets, discretized using a system called C-GAME, are reported. C-GAME uses reducts during discretization. The authors claim that C-GAME outperforms five other discretization schemes. Some papers, for example, Reference [14,15,16], discuss reducts combined with discretization. However, no convincing experimental results are included. A related problem, namely how reduction of the attribute set as a side-effect of discretization of numerical attributes changes an error rate, was discussed in Reference [17].
For symbolic attributes, it was shown [22] that the quality of rule sets induced from reduced data sets, measured by an error rate evaluated by ten-fold cross validation, is worse than the quality of rule sets induced form the original data sets, with no reduction of the attribute set.

2. Reducts

The set of all cases of the data set is denoted by U. An example of the data set with numerical attributes is presented in Table 1. For simplicity, all attributes have repetitive values, though in real-life numerical attribute values are seldom repetitive. In our example U = {1, 2, 3, 4, 5, 6, 7, 8}. The set of all attributes is denoted by A. In our example A = {Length, Height, Width, Weight}. One of the variables is called a decision, in Table 1 it is Quality.
Let B be a subset of the set A of all attributes. The indiscernibility relation I N D ( B ) [23,24] is defined as follows
( x , y ) I N D ( B )   if   and   only   if   a ( x ) = a ( y )   for   any a B ,
where x , y U and a ( x ) denotes the value of an attribute a A for a case x U . The relation I N D ( B ) is an equivalence relation. An equivalence class of I N D ( B ) , containing x U , is called a B-elementary class and is denoted by [ x ] B . A family of all sets [ x ] B , where x U , is a partition on U denoted by B * . A union of B-elementary classes is called B-definable. For a decision d we may define an indiscernibility relation I N D { d } by analogy. Additionally, {d}-elementary classes are called concepts.
A decision d depends on the subset B of the set A of all attributes if and only if B * { d } * . For partitions π and τ on U, π τ if and only if for every Y τ there exists X π such that X Y . For example, for B = {Width, Weight}, B * = {{1}, {2, 3}, {4}, {5}, {6, 7, 8}}, { d } * = {{1, 2, 3}, {4}, {5, 6, 7, 8}} and B * { d } * . Thus d depends on B.
For Table 1 and for B = {Weight}, B * = {{1, 2, 3}, {4}, {5, 6, 7, 8}}. The concepts {4, 5}, {6, 7, 8} are not B-definable. For an undefinable set X we define two definable sets, called lower and upper approximations of X [23,24]. The lower approximation of X is defined as follows
{ x | x U , [ x ] B X }
and is denoted by B ̲ X . The upper approximation of X is defined as follows
{ x | x U , [ x ] B X }
and is denoted by B ¯ X . For Table 1 and B = {Weight}, B ̲ {6, 7, 8} = ∅ and B ¯ {6, 7, 8} = {5, 6, 7, 8}.
A set B is called a reduct if and only if B is the smallest set with B * { d } * . The set {Width, Weight} is the reduct since {Width} * = {{1, 4, 5}, {2, 3, 6, 7, 8}} { d } * and {Weight} * = {{1, 2, 3}, {4}, {5, 6, 7, 8}} { d } * , so B is the smallest set with B * { d } * .
An idea of the reduct is important since we may restrict our attention to a subset B and construct a decision tree with the same ability to distinguish all concepts that are distinguishable in the data set with the entire set A of attributes. Note that any algorithm for finding all reducts is of exponential time complexity. In practical applications, we have to use some heuristic approach. In this paper, we suggest two such heuristic approaches, left and right reducts.
A left reduct is defined by a process of a sequence of attempts to remove one attribute at a time, from right to left and by checking after every attempt whether B * { d } * , where B is the current set of attributes. If this condition is true, we remove an attribute. If not, we put it back. For the example presented in Table 1, we start from an attempt to remove the rightmost attribute, that is, Weight. The current set B is {Length, Height, Width}, B * = {{1}, {2, 3}, {4}, {5}, {6}, {7, 8}} { d } * , so we remove Weight for good. The next candidate for removal is Width, the set B = {Length, Height}, B * = {{1}, {2, 3}, {4}, {5}, {6}, {7, 8}} and B * { d } * , so we remove Width as well. The next candidate is Height, if we remove it, B = {Length} { d } * , so {Length} is the left reduct since it cannot be further reduced.
Similarly, a right reduct is defined by a similar process of a sequence of attempts to remove one attribute at a time, this time from left to right. Again, after every attempt we check whether B * { d } * . It is not difficult to see that the right reduct is the set {Width, Weight}.
For a discretized data set we may compute left and right reducts, create three data sets: with the discretized (non-reduced) data set and with attribute sets restricted to the left and right reducts and then for all three data sets compute an error rate evaluated by C4.5 decision tree generation system using ten-fold cross validation. Our results show again that reduction of data sets causes increase of an error rate.

3. Discretization

For a numerical attribute a, let a i be the smallest value of a and let a j be the largest value of a. In discretizing of a we are looking for the numbers a i 0 , a i 1 , ..., a i k , called cutpoints, where a i 0 = a i , a i k = a j , a i l < a i l + 1 for l = 0, 1,..., k - 1 and k is a positive integer. As a result of discretization, the domain [ a i , a j ] of the attribute a is divided into k intervals
{ [ a i 0 , a i 1 ) , [ a i 1 , a i 2 ) , . . . , [ a i k - 2 , a i k - 1 ) , [ a i k - 1 , a i k ] } .
In this paper we denote such intervals as follows
a i 0 . . a i 1 , a i 1 . . a i 2 , . . . , a i k - 2 . . a i k - 1 , a i k - 1 . . a i k .
Discretization is usually conducted not on a single numerical attribute but on many numerical attributes. Discretization methods may be categorized as supervised or decision-driven (concepts are taken into account) or unsupervised. Discretization methods processing all attributes are called global or dynamic, discretization methods processing a single attribute are called local or static.
Let v be a variable and let v 1 , v 2 , ..., v n be values of v, where n is a positive integer. Let S be a subset of U. Let p ( v i ) be a probability of v i in S, where i = 1, 2, ..., n. An entropy H S ( v ) is defined as follows
H S ( v ) = - i = 1 n p ( v i ) · log p ( v i ) .
All logarithms in this paper are binary.
Let a be an attribute, let a 1 , a 2 , ..., a m be all values of a restricted to S, let d be a decision and let d 1 , d 2 , ..., d n be all values of d restricted to S, where m and n are positive integers. A conditional entropy H S ( d | a ) of the decision d given an attribute a is defined as follows
- j = 1 m p ( a j ) · i = 1 n p ( d i | a j ) · l o g p ( d i | a j ) ,
where p ( d i | a j ) is the conditional probability of the value d j of the decision d given a j ; j { 1 , 2 , . . . , m } and i { 1 , 2 , . . . , n } .
As is well-known [25,26,27,28,29,30,31,32,33,34,35,36], discretization that uses conditional entropy of the decision given attribute is believed to be one of the most successful discretization techniques.
Let S be a subset of U, let a be an attribute and let q be a cutpoint splitting the set S into two subsets S 1 and S 2 . The corresponding conditional entropy, denoted by H S ( d | a ) is defined as follows
| S 1 | | U | H S 1 ( a ) + | S 2 | | U | H S 2 ( a ) ,
where | X | denotes the cardinality of the set X. Usually, the cutpoint q for which H S ( d | a ) is the smallest is considered to be the best cutpoint.
We need how to halt discretization. Commonly, we halt discretization when we may distinguish the same cases in the discretized data set as in in the original data set with numerical attributes. In this paper discretization is halted when the level of consistency [26], defined as follows
L ( A ) = X { d } * | A ̲ X | | U |
and denoted by L ( A ) , is equal to 1. For Table 1, A * = {{1}, {2, 3}, {4}, {5}, {6}, {7, 8}}, so A ̲ X = X for any concept X from { d } * . On the other hand, for B = {Weight},
L ( B ) = | B ̲ { 1 , 2 , 3 } | + | B ̲ { 4 , 5 } | + | B ̲ { 6 , 7 , 8 } | | U | = | { 1 , 2 , 3 } | + | | + | | 8 = 0.375 .

4. Equal Frequency per Interval and Equal Interval Width

Both discretization methods, Equal Frequency per Interval and Equal Interval Width, are frequently used in discretization and both are known to be efficient [25]. In local versions of these methods, only a single numerical attribute is discretized at a time [31]. The user provides a parameter denoted by k. This parameter is equal to a requested number of intervals. In the Equal Frequency per Interval method, the domain of a numerical attribute is divided into k intervals with approximately equal number of cases. In the Equal Interval Width method, the domain of a numerical attribute is divided into k intervals with approximately equal width.
In this paper we present a supervised and global version of both methods, based on entropy [26]. Using this idea, we start from discretizing all numerical attributes assuming k = 2. Then the level of consistency is computed for the data set with discretized attributes. If the level of consistency is sufficient, discretization ends. If not, we select the worst attribute for additional discretization. The measure of quality of the discretized attribute, denoted by a d and called the average block entropy, is defined as follows
M ( a d ) = B { a d } * | B | | U | H ( B ) | { a d } * | .
A discretized attribute with the largest value of M ( a d ) is the worst attribute. This attribute is further discretized into k + 1 intervals. The process is continued by recursion. The time computational complexity, in the worst case, is O ( m · l o g m · n 2 ) , where m is the number of cases and n is the number of attributes. This method is illustrated by applying the Equal Frequency per Interval method for the data set from Table 1. Table 2 presents the discretized data set for the data set from Table 1. It is not difficult to see that the level of consistency for Table 2 is 1.
For the data set presented in Table 2, both left and right reducts are equal to each other and equal to {Height d , Width d , Weight d }.
Table 3 presents the data set from Table 1 discretized by the Global Equal Interval Width discretization method. Again, the level of consistency for Table 3 is equal to 1. Additionally, for the data set from Table 3, both reducts, left and right, are also equal to each other and equal to {Length d , Width d }.

5. Experiments

We conducted experiments on 13 numerical data sets, presented in Table 4. All of these data sets may be accessed in Machine Learning Repository, University of California, Irvine, except for bankruptcy. The bankruptcy data set was described in Reference [37].
The main objective of our research is to compare the quality of decision trees generated by C4.5 directly from discretized data sets and from data sets based on reducts, in terms of an error rate and tree complexity. Data sets were discretized by the Global Equal Frequency per Interval and Global Equal Interval Width methods with the level of complexity equal to 1. For each numerical data set three data sets were considered:
  • an original (non-reduced) discretized data set,
  • a data set based on the left reduct of the original discretized data set and
  • a data set based on right reduct of the original discretized data set.
The discretized data sets were inputted to the C4.5 decision tree generating system [21]. In our experiments, the error rate was computed using an internal mechanism of the ten-fold cross validation of C4.5.
Additionally, an internal discretization mechanism of C4.5 was excluded in experiments for left and right reducts since in this case data sets were discretized by the global discretization methods, so C4.5 considered all attributes as symbolic.
We illustrate our results with Figure 1 and Figure 2. Figure 1 presents discretization intervals for yeast data set, where discretization was conducted by the internal discretization mechanism of C4.5. Figure 2 presents discretization intervals for the same data set with discretization conducted by the global version of the Equal Frequency per Interval method (right reducts and left reducts were identical).
Results of our experiments are presented in Table 5, Table 6, Table 7 and Table 8. These results were analyzed by the Friedman rank sum test with multiple comparisons, with 5% level of significance. For data sets discretized by the Global Equal Frequency per Interval method, the Friedman test shows that there are significant differences between the three types of data sets: the non-reduced discretized data sets and data sets based on left and right reducts. In most cases, the original, non-reduced data sets are associated with the smallest error rates than both left and right reducts. However, the test of multiple comparisons shows that the differences are not statistically significant.
For data sets discretized by the Global Equal Interval Width method results are more conclusive. There are statistically significant differences between non-reduced discretized data sets and data sets based on left and right reducts. Moreover, an error rate for the non-reduced discretized data sets is significantly smaller than for both types of data sets, based on left and right reducts. As expected, the difference between left and right reducts is not significant.
For both discretization methods and all types of data sets (non-reduced, based on left and right reducts) the difference in complexity of generated decision trees, measured by tree size and depth, is not significant.
Table 8 shows the size of left and right reducts created from data sets discretized by the Global versions of Equal Frequency per Interval and Equal Interval Width methods. For some data sets, for example, for bupa, both left and right reducts are identical with the original attribute set.

6. Conclusions

Our preliminary results [22] show that data reduction combined with rule induction causes an increase of the error rate. The current results, presented in this paper, confirm these results: the reduction of data sets, associated with C4.5 tree generation system, causes the same effect. Decision trees generated from reduced data sets increase the error rate as evaluated by ten-fold cross validation. Additionally, decision trees generated from reduced data sets, in terms of a tree size or tree depth, are not simpler than decision trees generated from non-reduced data sets.Therefore, it is obvious that reduction of data sets (or feature selection) should be used with caution since it may degrade results of data mining.
In the future we are planning to extend our experiments to large data sets and to include other classifiers than systems for rule induction and decision tree generation.

Author Contributions

T.M. designed and conducted experiments, Z.S.H. validated results and J.W.G.-B. wrote the paper.

Funding

This research received no external funding.

Acknowledgments

The authors would like to thank the editor and referees for their helpful suggestions and comments on the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Almuallim, H.; Dietterich, T.G. Learning Boolean concepts in the presence of many irrelevant features. Artif. Intell. 1994, 69, 279–305. [Google Scholar] [CrossRef]
  2. Kira, K.; Rendell, L.A. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of the 10-th National Conference on AI, San Jose, CA, USA, 12–16 July 1992; pp. 129–134. [Google Scholar]
  3. Garey, M.; Johnson, D. Computers and Intractability: A Guide to the Theory of NP-Completeness; W. H. Freeman: New York, NY, USA, 1979. [Google Scholar]
  4. Fuernkranz, J.; Gamberger, D.; Lavrac, N. Foundations of Rule Learning; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
  5. Stanczyk, U.; Jain, L.C. (Eds.) Feature Selection for Data and Pattern Recognition; Springer: Berlin/Heidelberg, Germany, 2015. [Google Scholar]
  6. Stanczyk, U.; Zielosko, B.; Jain, L.C. (Eds.) Advances in Feature Selection for Data and Pattern Recognition; Springer: Cham, Switzerland, 2018. [Google Scholar]
  7. Bertolazzi, P.; Felici, G.; Festa, P.; Fiscon, G.; Weitschek, E. Integer programming models for feature selection: New extensions and a randomized solution algorithm. Inf. Sci. 2016, 250, 389–399. [Google Scholar]
  8. Santoni, D.; Weitschek, E.; Felici, G. Optimal discretization and selection of features by association rates of joint distributions. RAIRO-Oper. Res. 2016, 50, 437–449. [Google Scholar] [CrossRef]
  9. Liu, H.; Rudy, S. Feature selection via discretization. IEEE Trans. Knowl. Data Eng. 1997, 9, 642–646. [Google Scholar]
  10. Sharmin, S.; Ali, A.A.; Khan, M.A.H.; Shoyaib, M. Feature selection and discretization based on mutual information. In Proceedings of the IEEE International Conference on Imaging, Vision & Pattern Recognition, Dhaka, Bangladesh, 13–14 February 2017; pp. 1–6. [Google Scholar]
  11. Weitschek, E.; Felici, G.; Bertolazzi, P. MALA: A microarray clustering and classification software. In Proceedings of the International Workshop on Database and Expert Systems Applications, Vienna, Austria, 3–6 September 2012; pp. 201–205. [Google Scholar]
  12. Felici, G.; Weitschek, E. Mining logic models in the presence of noisy data. In Proceedings of the International Symposium on Articial Intelligence and Mathematics, Fort Lauderdale, FL, USA, 9–11 January 2012. [Google Scholar]
  13. Tian, D.; Zeng, X.J.; Keane, J. Core-generating approximated minimum entropy discretization for rough set feature selection in pattern classification. Int. J. Approx. Reason. 2011, 52, 863–880. [Google Scholar] [CrossRef]
  14. Jensen, R.; Shen, Q. Fuzzy-rough sets for descriptive dimensionality reduction. In Proceedings of the International Conference on Fuzzy Systems FUZZ-IEEE 2002, Honolulu, HI, USA, 12–17 May 2002; pp. 29–34. [Google Scholar]
  15. Nguyen, H.S. Discretization problem for rough sets methods. In Proceedings of the 1-st International Conference RSCTC 1998 on Rough Sets and Current Trends in Computing, Warsaw, Poland, 22–26 June 1998; Springer-Verlag: Berlin/Heidelberg, Germany, 1998; pp. 545–552. [Google Scholar]
  16. Swiniarski, R.W. Rough set methods in feature reduction and classification. Int. J. Appl. Math. Comput. Sci. 2001, 11, 565–582. [Google Scholar]
  17. Grzymala-Busse, J.W.; Mroczek, T. Attribute selection based on reduction of numerical attribute during discretization. In Advances in Feature Selection for Data and Pattern Recognition; Stanczyk, B., Zielosko, B., Jain, L.C., Eds.; Springer International Publishing AG: Cham, Switzerland, 2017; pp. 13–24. [Google Scholar]
  18. Hu, Q.; Yu, D.; Xie, Z. Information-preserving hybrid data reduction based on fuzzy-rough techniques. Pattern Recognit. Lett. 2006, 27, 414–423. [Google Scholar] [CrossRef]
  19. Kim, K.j.; Han, I. Genetic algorithms approach to feature discretization in artificial neural networks for the prediction of stock price index. Expert Syst. Appl. 2000, 19, 125–132. [Google Scholar] [CrossRef]
  20. Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  21. Quinlan, J.R. C4.5: Programs for Machine Learning; Morgan Kaufmann Publishers: San Mateo, CA, USA, 1993. [Google Scholar]
  22. Grzymala-Busse, J.W. An empirical comparison of rule induction using feature selection with the LEM2 algorithm. In Communications in Computer and Information Science; Greco, S., Bouchon-Meunier, B., Coletti, G., Fedrizzi, M., Matarazzo, B., Yager, R.R., Eds.; Springer: Berlin/Heidelberg, Germany, 2012; Volume 297, pp. 270–279. [Google Scholar]
  23. Pawlak, Z. Rough sets. Int. J. Comput. Inf. Sci. 1982, 11, 341–356. [Google Scholar] [CrossRef]
  24. Pawlak, Z. Rough Sets. Theoretical Aspects of Reasoning about Data; Kluwer Academic Publishers: Dordrecht, The Netherlands; Boston, MA, USA; London, UK, 1991. [Google Scholar]
  25. Blajdo, P.; Grzymala-Busse, J.W.; Hippe, Z.S.; Knap, M.; Mroczek, T.; Piatek, L. A comparison of six approaches to discretization—A rough set perspective. In Proceedings of the Rough Sets and Knowledge Technology Conference, Chengdu, China, 17–19 May 2008; pp. 31–38. [Google Scholar]
  26. Chmielewski, M.R.; Grzymala-Busse, J.W. Global discretization of continuous attributes as preprocessing for machine learning. Int. J. Approx. Reason. 1996, 15, 319–331. [Google Scholar] [CrossRef] [Green Version]
  27. Clarke, E.J.; Barton, B.A. Entropy and MDL discretization of continuous variables for Bayesian belief networks. Int. J. Intell. Syst. 2000, 15, 61–92. [Google Scholar] [CrossRef]
  28. Elomaa, T.; Rousu, J. General and efficient multisplitting of numerical attributes. Mach. Learn. 1999, 36, 201–244. [Google Scholar] [CrossRef]
  29. Fayyad, U.M.; Irani, K.B. On the handling of continuous-valued attributes in decision tree generation. Mach. Learn. 1992, 8, 87–102. [Google Scholar] [CrossRef] [Green Version]
  30. Fayyad, U.M.; Irani, K.B. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, Chambery, France, 28 August–3 September 1993; pp. 1022–1027. [Google Scholar]
  31. Grzymala-Busse, J.W. Discretization of numerical attributes. In Handbook of Data Mining and Knowledge Discovery; Kloesgen, W., Zytkow, J., Eds.; Oxford University Press: New York, NY, USA, 2002; pp. 218–225. [Google Scholar]
  32. Grzymala-Busse, J.W. A multiple scanning strategy for entropy based discretization. In Proceedings of the 18th International Symposium on Methodologies for Intelligent Systems, Prague, Czech Republic, 14–17 September 2009; pp. 25–34. [Google Scholar]
  33. Kohavi, R.; Sahami, M. Error-based and entropy-based discretization of continuous features. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, Portland, OR, USA, 2–4 August 1996; pp. 114–119. [Google Scholar]
  34. Nguyen, H.S.; Nguyen, S.H. Discretization methods in data mining. In Rough Sets in Knowledge Discovery 1: Methodology and Applications; Polkowski, L., Skowron, A., Eds.; Physica-Verlag: Heidelberg, Germany, 1998; pp. 451–482. [Google Scholar]
  35. Stefanowski, J. Handling continuous attributes in discovery of strong decision rules. In Proceedings of the First Conference on Rough Sets and Current Trends in Computing, Warsaw, Poland, 22–26 June 1998; pp. 394–401. [Google Scholar]
  36. Stefanowski, J. Algorithms of Decision Rule Induction in Data Mining; Poznan University of Technology Press: Poznan, Poland, 2001. [Google Scholar]
  37. Altman, E.I. Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. J. Financ. 1968, 23, 589–609. [Google Scholar] [CrossRef]
Figure 1. Attribute intervals for the yeast data set discretized by the internal discretization mechanism of C4.5.
Figure 1. Attribute intervals for the yeast data set discretized by the internal discretization mechanism of C4.5.
Entropy 21 01051 g001
Figure 2. Attribute intervals for the yeast data set discretized by the global version of Equal Frequency per Interval method and then computing right reducts.
Figure 2. Attribute intervals for the yeast data set discretized by the global version of Equal Frequency per Interval method and then computing right reducts.
Entropy 21 01051 g002
Table 1. An example of a data set with numerical attributes.
Table 1. An example of a data set with numerical attributes.
CaseAttributesDecision
LengthHeightWidthWeightQuality
14.81.21.60.8high
24.81.41.80.8high
34.81.41.80.8high
44.41.41.61.0medium
54.41.21.61.4medium
64.21.21.81.4low
74.21.81.81.4low
84.21.81.81.4low
Table 2. A data set discretized by Equal Frequency per Interval.
Table 2. A data set discretized by Equal Frequency per Interval.
CaseAttributesDecision
Length d Height d Width d Weight d Quality
14.3..4.81.2..1.31.6..1.70.8..1.2high
24.3..4.81.3..1.81.7..1.80.8..1.2high
34.3..4.81.3..1.81.7..1.80.8..1.2high
44.3..4.81.3..1.81.6..1.70.8..1.2medium
54.3..4.81.2..1.31.6..1.71.2..1.4medium
64.2..4.31.2..1.31.7..1.81.2..1.4low
74.2..4.31.3..1.81.7..1.81.2..1.4low
84.2..4.31.3..1.81.7..1.81.2..1.4low
Table 3. A data set discretized by Equal Interval Width.
Table 3. A data set discretized by Equal Interval Width.
CaseAttributesDecision
Length d Height d Width d Weight d Quality
14.5..4.81.2..1.51.6..1.70.8..1.2high
24.5..4.81.2..1.51.7..1.80.8..1.2high
34.5..4.81.2..1.51.7..1.80.8..1.2high
44.2..4.51.2..1.51.7..1.80.8..1.2medium
54.2..4.51.2..1.51.7..1.81.2..1.4medium
64.2..4.51.5..1.81.7..1.81.2..1.4low
74.2..4.51.5..1.81.7..1.81.2..1.4low
84.2..4.51.5..1.81.7..1.81.2..1.4low
Table 4. Data sets.
Table 4. Data sets.
Data Set Number of
CasesAttributesConcepts
Abalone4177829
Australian690142
Bankruptcy6652
Bupa34562
Echocardiogram7472
Ecoli33688
Glass21496
Ionosphere351342
Iris15043
Leukemia4151752
Wave512213
Wine Recognition178133
Yeast148489
Table 5. Error rates for discretized data sets.
Table 5. Error rates for discretized data sets.
Data SetEqual Frequency per IntervalEqual Interval Width
No ReductionLeft ReductsRight ReductsNo ReductionLeft ReductsRight Reducts
Abalone76.8776.9977.4776.9077.4277.42
Australian12.4614.4922.4613.3330.1414.64
Bankruptcy3.033.033.0310.6110.6110.61
Bupa35.9435.9435.9434.4934.4934.49
Echocardiogram27.0327.0327.0331.0839.1939.19
Ecoli30.6528.8728.8728.5728.5728.57
Glass41.1239.2541.5933.1833.6433.64
Ionosphere13.1119.1520.8510.8311.9715.82
Iris12.6712.6712.674.004.004.00
Leukemia1.452.931.601.321.991.59
Wave25.5927.1528.9127.5431.4628.13
Wine Recognition10.1110.6712.369.558.9910.67
Yeast57.8257.8257.8256.5456.8756.87
Table 6. Tree size for discretized data sets.
Table 6. Tree size for discretized data sets.
Data SetTree Size
Equal Frequency per IntervalEqual Interval Width
No ReductionLeft ReductsRight ReductsNo ReductionLeft ReductsRight Reducts
Abalone28,23627,20224,90518,71116,49116,491
Australian4131239953
Bankruptcy333666
Bupa171717272727
Echocardiogram81313161616
Ecoli614040109107107
Glass706356186191191
Ionosphere335371342470
Iris111111444
Leukemia139125132229139174
Wave6242441077396
Wine Recognition193030181518
Yeast662678678913914914
Table 7. Tree depth for discretized data sets.
Table 7. Tree depth for discretized data sets.
Data SetTree Depth
Equal Frequency per IntervalEqual Interval Width
No ReductionLeft ReductsRight ReductsNo ReductionLeft ReductsRight Reducts
Abalone222333
Australian512541
Bankruptcy111111
Bupa222222
Echocardiogram222333
Ecoli322333
Glass545555
Ionosphere665646
Iris222111
Leukemia644645
Wave9791196
Wine Recognition666545
Yeast444333
Table 8. Number of attributes in left and right reducts for discretized data sets.
Table 8. Number of attributes in left and right reducts for discretized data sets.
Data SetDiscretized Non-ReducedEqual Frequency per IntervalEqual Interval Width
Data SetLeft ReductsRight ReductsLeft ReductsRight Reducts
Abalone86666
Australian1481099
Bankruptcy54444
Bupa66666
Echocardiogram75566
Ecoli75555
Glass98877
Ionosphere3411131110
Iris44444
Leukemia15881110
Wave2115141414
Wine Recognition13881010
Yeast86666

Share and Cite

MDPI and ACS Style

Grzymala-Busse, J.W.; Hippe, Z.S.; Mroczek, T. Reduced Data Sets and Entropy-Based Discretization. Entropy 2019, 21, 1051. https://doi.org/10.3390/e21111051

AMA Style

Grzymala-Busse JW, Hippe ZS, Mroczek T. Reduced Data Sets and Entropy-Based Discretization. Entropy. 2019; 21(11):1051. https://doi.org/10.3390/e21111051

Chicago/Turabian Style

Grzymala-Busse, Jerzy W., Zdzislaw S. Hippe, and Teresa Mroczek. 2019. "Reduced Data Sets and Entropy-Based Discretization" Entropy 21, no. 11: 1051. https://doi.org/10.3390/e21111051

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop