An Accurate and Easy to Interpret Binary Classifier Based on Association Rules Using Implication Intensity and Majority Vote
Abstract
:1. Introduction
2. Methodology
2.1. A New Quality Measure of Association Rules
2.2. The Proposed Approach
- Transform every input variable into a set of binary variables , , (here one has one or more parameters, the number of intervals in which each numeric variable is categorized). The classification variable Y is doubled into the binary variables and .
- Mine all the 2-length classifying rules (whose lhs is a binary input variable and rhs is either or ). This choice allows us to mine all the rules without the sacrifice of a minimum support, even for large datasets, in a reasonable time.
- Choose a threshold (this is a new parameter in ) for implifidence, and filter the rules exceeding that implifidence. They can be called significant rules. Any original variable such that all its binary related variables (, , ) are the lhs of a significant rule, it can be called significant variable.
- Extract the subset of significant variables (in this step, variables with low effect on the classification variable Y are discarded).
- Make the table of 1-predictor classifications, using the significant rules, where each individual (row) is classified in accordance to each of the selected significant variables (column).
- Choose a low odd number m of ‘premises’ (another parameter). Rules will be built with at most m premises in order to classify data).
- For every combination of m significant variables:
- Classify each individual in the sample using the classifying rules involving the m variables, by majority voting among the outcomes of those rules.
- Compare classification with the true class.
- Assess the prediction of the m-tuple by the four measures of performance (accuracy, precision, sensitivity, specificity).
- Choose the m-tuple with best performance (accuracy by default, but any other can be chosen).
- Return final classifier that uses the m-tuple leading to the best performance.
2.3. Data Description
2.4. Classifier Comparison Criteria
3. Results
- Place the Clasif.zip file in a local folder (that we denote as path_to_file)
- Run install.packages(path_to_file, repos = NULL, type="source") in the R Console. The package will be installed from source.
- Run library(Classif) in the R Console
- Run Classif() in the R Console. It provides a comfortable windows interface where the user can pick the dataset file and select the number of votes without writing code. Its output is the comparison among the classifiers that we show in Table 2. The object rules (that the user can print just typing rules in the R Console) contains the classifying rules and the accuracy for each possible number of votes.
- Internally, the function SIAclassif() is the one that performs all the steps of our algorithm, according to the implifidence threshold and the maximum number of votes. It computes the implifidence of all the rules, it selects the significant variables, and it computes all the combinations of groups of significant variables and the resulting classification for each one, taking the one which maximizes the accuracy (by default, or any other criterion specified by the user). It prints a sentence explaining the rule for classification. As an example, the result for the WBC dataset is shown in Appendix A.
- The function predict.SIA() requires, as a first argument, the object returned by the function SIAclassif(), and as a second argument, a data frame with the new instances, in order to apply the classification and produce the predicted output.
4. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
- If the user wants to use only one variable, the best classification is done by checking V2, which predicts the class “malignant” for an individual whenever it takes one of the values 10, 3, 4, 5, 6, 7, 8 or 9. The accuracy of this classifier is .
- If the user accepts to use up to three variables, the best classification reaches an accuracy of by checking the variables V8, V3 and V6. The rule forecasts the class “malignant” for a record, if at least two of the variables V8, V3 and V6 take one of their listed values (see component [[2]] below).
- In general, the user decides on a particular classifier according to the importance given to the accuracy, as well as to the simplicity of the rule. In any case, the interpretation of the classification of our approach is easy, contrary to other classifiers, because the practitioner only needs to check the values of a few of the original variables in order to decide the output class.
- [1] "rules"[[1]][1] "Classify as ’malignant’ iff at least 1 of the following variables hold: V2={10,3,4,5,6,7,8,9}; . The ’accuracy’ is 0.926793557833089."
- [[2]][1] "Classify as ’malignant’ iff at least 2 of the following variables hold: V8={10,3,4,5,6,7,8,9}; V3={10,3,4,5,6,7,8,9}; V6={10,3,4,5,6,7,8,9}; . The ’accuracy’ is 0.961932650073206."
- [[3]][1] "Classify as ’malignant’ iff at least 2 of the following variables hold: V7={10,4,5,6,7,8,9}; V3={10,3,4,5,6,7,8,9}; V6={10,3,4,5,6,7,8,9}; . The ’accuracy’ is 0.961932650073206."
- [[4]][1] "Classify as ’malignant’ iff at least 3 of the following variables hold: V8={10,3,4,5,6,7,8,9}; V7={10,4,5,6,7,8,9}; V3={10,3,4,5,6,7,8,9}; V6={10,3,4,5,6,7,8,9}; V1={10,6,7,8,9}; . The ’accuracy’ is 0.972181551976574."
- [[5]][1] "Classify as ’malignant’ iff at least 3 of the following variables hold: V7={10,4,5,6,7,8,9}; V5={10,3,4,5,6,7,8,9}; V2={10,3,4,5,6,7,8,9}; V6={10,3,4,5,6,7,8,9}; V1={10,6,7,8,9}; . The ’accuracy’ is 0.972181551976574."
- [[6]][1] "Classify as ’malignant’ iff at least 3 of the following variables hold: V7={10,4,5,6,7,8,9}; V5={10,3,4,5,6,7,8,9}; V3={10,3,4,5,6,7,8,9}; V6={10,3,4,5,6,7,8,9}; V1={10,6,7,8,9}; . The ’accuracy’ is 0.972181551976574."
- [[7]][1] "Classify as ’malignant’ iff at least 3 of the following variables hold: V7={10,4,5,6,7,8,9}; V2={10,3,4,5,6,7,8,9}; V3={10,3,4,5,6,7,8,9}; V6={10,3,4,5,6,7,8,9}; V1={10,6,7,8,9}; . The ’accuracy’ is 0.972181551976574."
- [[8]][1] "Classify as ’malignant’ iff at least 4 of the following variables hold: V8={10,3,4,5,6,7,8,9}; V7={10,4,5,6,7,8,9}; V5={10,3,4,5,6,7,8,9}; V2={10,3,4,5,6,7,8,9}; V3={10,3,4,5,6,7,8,9}; V6={10,3,4,5,6,7,8,9}; V1={10,6,7,8,9}; . The ’accuracy’ is 0.97510980966325."
- [[9]][1] "Classify as ’malignant’ iff at least 4 of the following variables hold: V8={10,3,4,5,6,7,8,9}; V7={10,4,5,6,7,8,9}; V5={10,3,4,5,6,7,8,9}; V3={10,3,4,5,6,7,8,9}; V6={10,3,4,5,6,7,8,9}; V1={10,6,7,8,9}; V4={10,2,3,4,5,6,7,8,9}; . The ’accuracy’ is 0.97510980966325."
References
- Han, J.; Kamber, M.; Pei, J. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann: San Mateo, CA, USA, 2011. [Google Scholar]
- Mitchell, T. Machine Learning; McGraw Hill: New York, NY, USA, 1997. [Google Scholar]
- Liu, B.; Hsu, W.; Ma, Y. Integrating classification and association rule mining. In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 27–31 August 1998; pp. 80–86. [Google Scholar]
- Yin, X.; Han, J. CPAR: Classification based on predictive association rules. In Proceedings of the SIAM International Conference on Data Mining (SDM’03), San Francisco, CA, USA, 1–3 May 2003; pp. 331–335. [Google Scholar]
- Li, W.; Han, J.; Pei, J. CMAR: Accurate and efficient classification based on multiple class-association rules. In Proceedings of the 1st IEEE International Conference on Data Mining, San Jose, CA, USA, 29 November–2 December 2001; pp. 369–376. [Google Scholar]
- Thabtah, F.; Cowling, P.; Peng, Y. MMAC: A new multi-class, multi-label associative classification approach. In Proceedings of the 4th IEEE International Conference on Data Mining, Brighton, UK, 1–4 November 2004; pp. 217–224. [Google Scholar]
- Vo, B.; Le, B. A novel classification algorithm based on association rule mining. In Proceedings of the 2008 Pacific Rim Knowledge Acquisition Workshop (Held with PRICAI’08), LNAI 5465, Ha Noi, Vietnam, 15–19 December 2008; pp. 61–75. [Google Scholar]
- Agrawal, R.; Srikant, R. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB ’94, Santiago, Chile, 12–15 September 1994; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1994; pp. 487–499. [Google Scholar]
- Abdelhamid, N.; Ayesh, A.; Hadi, W. Multi-label rules algorithm based association. Parallel Process Lett. 2014. [Google Scholar] [CrossRef]
- Abdelhamid, N.; Ayesh, A.; Thabtah, F. Emerging trends in associative classification data mining. Int. J. Electron. Electr. Eng. 2015, 3, 50–53. [Google Scholar] [CrossRef] [Green Version]
- Gras, R.; Suzuki, E.; Guillet, F.; Spagnolo, F. (Eds.) Statistical Implicative Analysis, Theory and Applications; Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2008; Volume 127. [Google Scholar]
- Gras, R. Contribution à l’Étude Expérimentale et à l’Analyse de Certaines Acquisitions Cognitives et de Certains Objectifs en Didactique des Mathématiques. Ph.D. Dissertation, Université de Rennes 1, Rennes, France, 1979. [Google Scholar]
- Croset, M.; Trgalova, J.; Croset, J. Student’s Algebraic Knowledge Modelling: Algebraic Context as Cause of Student’s Actions. In Statistical Implicative Analysis; Gras, R., Suzuki, E., Guillet, F., Spagnolo, F., Eds.; Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2008; Volume 127. [Google Scholar]
- Kuzniak, A. Personal Geometrical Working Space: A Didactic and Statistical Approach. In Statistical Implicative Analysis; Gras, R., Suzuki, E., Guillet, F., Spagnolo, F., Eds.; Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2008; Volume 127. [Google Scholar]
- Ramstein, G. Statistical Implicative Analysis of DNA microarrays. In Statistical Implicative Analysis; Studies in Computational Intelligence; Gras, R., Suzuki, E., Guillet, F., Spagnolo, F., Eds.; Springer: Berlin/Heidelberg, Germany, 2008; Volume 127. [Google Scholar]
- David, J.; Guillet, F.; Briand, H.; Gras, R. On the use of Implication Intensity for matching ontologies and textual taxonomies. In Statistical Implicative Analysis; Gras, R., Suzuki, E., Guillet, F., Spagnolo, F., Eds.; Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2008; Volume 127. [Google Scholar]
- Huynh, H.; Phan, N.; Duong-Trung, N.; Nguyen, H. Collaborative Filtering Recommendation Based on Statistical Implicative Analysis. Commun. Comput. Inf. Sci. 2020, 1287, 224–235. [Google Scholar] [CrossRef]
- Dua, D.; Graff, C. UCI Machine Learning Repository; School of Information and Computer Sciences, University of California: Irvine, CA, USA, 2019; Available online: https://archive.ics.uci.edu (accessed on 1 January 2021).
- Inan, O.; Uzer, M.S.; Yılmaz, N. A new hybrid feature selection method based on association rules and pca for detection of breast cancer. Int. J. Innov. Comput. Inf. Control 2012, 9, 727–739. [Google Scholar]
- Rajagopalan, G.S.; Abdelhalim, M.; Zeid, M. Breast Cancer Diagnosis on Three Different Datasets Using Multi-Classifiers. Int. J. Comput. Inf. Technol. 2012, 1, 36–43. [Google Scholar]
- Ibrahim, A.; Hashad, I.; Shawky, N.; Maher, A. Robust Breast Cancer Diagnosis on Four Different Datasets Using Multi-Classifiers Fusion. Int. J. Eng. Res. Technol. (IJERT) 2015, 4, 114–118. [Google Scholar]
- Aruna, S.; Rajagopalan, S.; Nandakishore, L. Knowledge Based Analysis of Various Statistical Tools in Detecting Breast Cancer. In Proceedings of First International Conference on Computer Science, Engineering and Applications (CCSEA 2011); Wyld, D.C., Wozniak, M., Eds.; AIRCC Publishing Corporation: Chennai, India, 2011; Volume 1, pp. 37–45. [Google Scholar]
- Orr, M. Radial Basis Function Networks; Technical Report; Edinburgh University: Edinburgh, UK, 1996. [Google Scholar]
- Srimani, P.; Koti, M.S. Medical Diagnosis Using Ensemble Classifiers—A Novel Machine-Learning Approach. J. Adv. Comput. 2013, 2, 9–27. [Google Scholar] [CrossRef]
- Sabanci, K.; Koklu, M. The classification of eye state by using KNN and mlp classification models according to the EEG signals. Intell. Syst. Appl. Eng. 2015, 3, 127–130. [Google Scholar] [CrossRef]
- Wang, T.; Guan, S.; Man, K.; Ting, T. Time Series Classification for EEG Eye State Identification Based on Incremental Attribute Learning. In Proceedings of the 2014 International Symposium on Computer, Consumer and Control, Taichung, Taiwan, 10–12 June 2014. [Google Scholar] [CrossRef]
- Fang, H.; Shi, C.; Chen, C.H. BioExpDNN: Bioinformatic Explainable Deep Neural Network. In Proceedings of the 2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Seoul, Korea, 16–19 December 2020; pp. 2461–2467. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2019. [Google Scholar]
- Hahsler, M.; Chelluboina, S.; Hornik, K.; Buchta, C. The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Datasets. J. Mach. Learn. Res. 2011, 12, 1977–1981. [Google Scholar]
- Vaillant, B.; Lallich, S.; Lenca, P. On the Behavior of the Generalizations of the Intensity of Implication: A Data-Driven Comparative Study. In Statistical Implicative Analysis; Gras, R., Suzuki, E., Guillet, F., Spagnolo, F., Eds.; Studies in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2008; Volume 127, pp. 421–447. [Google Scholar]
- Lallich, S.; Vaillant, B.; Lenca, P. A probabilistic framework towards the parameterization of association rule interestingness measures. Methodol. Comput. Appl. Probab. 2007, 9, 447–463. [Google Scholar] [CrossRef]
- Lallich, S.; Lenca, P.; Vaillant, B. Variations autour de l’intensité d’implication. In Proceedings of the Third International Conference A.S.I. Implicative Statistic Analysis, Palermo, Italy, 6–8 October 2005. [Google Scholar]
- Armand, A.; Totohasina, A.; Feno, D. An extension of Totohasina’s normalization theory of quality measures of association rules. Int. J. Math. Math. Sci. 2019, 2019. [Google Scholar] [CrossRef]
- Gras, R.; Couturier, R.; Gregori, P. An arranged mariage between Implication and Confidence. In Analyse Statistique Implicative. Des Sciences dures aux Sciences Humaines et Solciales, Proceedings of the VIII International Conference SIA Statistical Implicative Analysis, Radès, Tunisia, 11–14 November 2015; Regnier, J.C., Slimani, Y., Gras, R., ARSA Association, Eds.; Bibliothèque National de Tunisie: Tunisia, Tunisia, 2015; pp. 47–68. [Google Scholar]
- Ghanem, S. An R Package for SIA Binary Classification. 2021. Available online: https://github.com/souhilabsl/SIAclassification (accessed on 30 January 2021.).
- Nick Street, W.; Wolberg, W.; Mangasarian, O. Nuclear feature extraction for breast tumor diagnosis. In Proceedings of the SPIE—The International Society for Optical Engineering, San Diego, CA, USA, 11–16 July 1993; Volume 1905, pp. 861–870. [Google Scholar] [CrossRef] [Green Version]
- Sobar; Machmud, R.; Wijaya, A. Behavior determinant based cervical cancer early detection with machine learning algorithm. Adv. Sci. Lett. 2016, 22, 3120–3123. [Google Scholar] [CrossRef]
- Hornik, K.; Buchta, C.; Zeileis, A. Open-Source Machine Learning: R Meets Weka. Comput. Stat. 2009, 24, 225–232. [Google Scholar] [CrossRef] [Green Version]
- Witten, I.; Frank, E. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed.; Morgan Kaufmann: San Francisco, CA, USA, 2005. [Google Scholar]
- Altman, D.; Bland, J. Statistics Notes: Diagnostic tests 1: Sensitivity and specificity. BMJ 1994, 308, 1552. Available online: https://www.bmj.com/content/308/6943/1552.full.pdf (accessed on 20 January 2021). [CrossRef] [PubMed] [Green Version]
b | |||
---|---|---|---|
55 | 5 | 60 | |
a | 20 | 20 | 40 |
75 | 25 | 100 |
Dataset | Method | Accuracy | Precision | Sensitivity | Specificity |
---|---|---|---|---|---|
WDBC | NB | 0.7240773 | 0.6629947 | 0.5329353 | 0.8361464 |
CART | 0.6432337 | 0.5986015 | 0.1629407 | 0.9303413 | |
J48 | 0.6274165 | NA | 0.0000000 | 1.0000000 | |
RadSVM | 0.6274165 | NA | 0.0000000 | 1.0000000 | |
SIA | 0.8347979 | 0.7786700 | 0.7688728 | 0.8728805 | |
IDs_mapping | NB | 0.6417223 | 0.6092675 | 0.5617475 | 0.7067785 |
CART | 0.6578772 | 0.6288379 | 0.5829920 | 0.7187949 | |
J48 | 0.6104806 | 0.6032557 | 0.3858690 | 0.7933888 | |
RadSVM | 0.5512016 | NA | 0.0000000 | 1.0000000 | |
SIA | 0.6550734 | 0.6325261 | 0.5529404 | 0.7383278 | |
SOBAR | NB | 0.9027778 | 0.9652778 | 0.7460317 | 0.9739583 |
CART | 0.6527778 | 0.5027778 | 0.4365079 | 0.7619464 | |
J48 | 0.7083333 | NA | 0.0000000 | 1.0000000 | |
RadSVM | 0.7083333 | NA | 0.0000000 | 1.0000000 | |
SIA | 0.8055556 | 0.7833333 | 0.7301587 | 0.8952020 | |
WBC | NB | 0.9765739 | 0.9475863 | 0.9870416 | 0.9707784 |
CART | 0.9458272 | 0.9225932 | 0.9247400 | 0.9574295 | |
J48 | 0.9311859 | 0.9130195 | 0.8921389 | 0.9536268 | |
RadSVM | 0.9663250 | 0.9570372 | 0.9447121 | 0.9775771 | |
SIA | 0.9428990 | 0.9107039 | 0.9285842 | 0.9500720 | |
WPBC | NB | 0.6752577 | 0.3009450 | 0.2740059 | 0.7963057 |
CART | 0.7577320 | 0.5552283 | 0.1666237 | 0.9403461 | |
J48 | 0.7628866 | NA | 0.0000000 | 1.0000000 | |
RadSVM | 0.7628866 | NA | 0.0000000 | 1.0000000 | |
SIA | 0.6752577 | 0.2424399 | 0.1640648 | 0.8377091 | |
haberman | NB | 0.7156863 | 0.4185765 | 0.2281000 | 0.8850047 |
CART | 0.6960784 | 0.2857298 | 0.2097401 | 0.8622492 | |
J48 | 0.7352941 | NA | 0.0000000 | 1.0000000 | |
RadSVM | 0.7352941 | NA | 0.0000000 | 1.0000000 | |
SIA | 0.6143791 | 0.2898471 | 0.2910234 | 0.7110647 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ghanem, S.; Couturier, R.; Gregori, P. An Accurate and Easy to Interpret Binary Classifier Based on Association Rules Using Implication Intensity and Majority Vote. Mathematics 2021, 9, 1315. https://doi.org/10.3390/math9121315
Ghanem S, Couturier R, Gregori P. An Accurate and Easy to Interpret Binary Classifier Based on Association Rules Using Implication Intensity and Majority Vote. Mathematics. 2021; 9(12):1315. https://doi.org/10.3390/math9121315
Chicago/Turabian StyleGhanem, Souhila, Raphaël Couturier, and Pablo Gregori. 2021. "An Accurate and Easy to Interpret Binary Classifier Based on Association Rules Using Implication Intensity and Majority Vote" Mathematics 9, no. 12: 1315. https://doi.org/10.3390/math9121315
APA StyleGhanem, S., Couturier, R., & Gregori, P. (2021). An Accurate and Easy to Interpret Binary Classifier Based on Association Rules Using Implication Intensity and Majority Vote. Mathematics, 9(12), 1315. https://doi.org/10.3390/math9121315