# A New Method to Compare the Interpretability of Rule-Based Algorithms

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Predictivity Score

## 3. q-Stability Score

“A rule learning algorithm is stable if two independent estimations based on two independent samples, drawn from the same distribution $\mathbb{Q}$, result in two similar lists of rules.”

## 4. Simplicity Score

**Definition 1.**

## 5. Interpretability Score

“the ability to explain or present to a person in an understandable form”

## 6. Application

#### 6.1. Brief Overview of the Selected Algorithms

#### 6.2. Datasets

#### 6.3. Execution

#### 6.4. Results for Regression

#### 6.5. Results for Classification

## 7. Conclusions and Perspectives

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Molnar, C. Interpretable Machine Learning. 2020. Available online: https://www.lulu.com (accessed on 25 May 2021).
- Molnar, C.; Casalicchio, G.; Bischl, B. Interpretable machine learning—A brief history, state-of-the-art and challenges. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Ghent, Belgium, 14–18 September 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 417–431. [Google Scholar]
- Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat.
**2001**, 29, 1189–1232. [Google Scholar] [CrossRef] - Goldstein, A.; Kapelner, A.; Bleich, J.; Pitkin, E. Peeking inside the black box: Visualizing statistical learning with plots of individual conditional expectation. J. Comput. Graph. Stat.
**2015**, 24, 44–65. [Google Scholar] [CrossRef] - Ribeiro, M.T.; Singh, S.; Guestrin, C. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
- Shrikumar, A.; Greenside, P.; Kundaje, A. Learning important features through propagating activation differences. In Proceedings of the International Conference on Machine Learning, Sydney, Australia, 6–11 August 2017; pp. 3145–3153. [Google Scholar]
- Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 4765–4774. [Google Scholar]
- Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv.
**2018**, 51, 1–42. [Google Scholar] [CrossRef][Green Version] - Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell.
**2019**, 1, 206–215. [Google Scholar] [CrossRef][Green Version] - Breiman, L.; Friedman, J.; Olshen, R.; Stone, C. Classification and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984. [Google Scholar]
- Quinlan, J.R. Induction of decision trees. Mach. Learn.
**1986**, 1, 81–106. [Google Scholar] [CrossRef][Green Version] - Quinlan, J.R. C
_{4.5}: Programs for Machine Learning; Elsevier: Amsterdam, The Netherlands, 1993. [Google Scholar] - Wang, Y.; Witten, I.H. Inducing model trees for continuous classes. In Proceedings of the European Conference on Machine Learning, Prague, Czech Republic, 23–25 April 1997. [Google Scholar]
- Landwehr, N.; Hall, M.; Frank, E. Logistic model trees. Mach. Learn.
**2005**, 59, 161–205. [Google Scholar] [CrossRef][Green Version] - Cohen, W. Fast effective rule induction. In Machine Learning Proceedings; Elsevier: Amsterdam, The Netherlands, 1995; pp. 115–123. [Google Scholar]
- Karalič, A.; Bratko, I. First order regression. Mach. Learn.
**1997**, 26, 147–176. [Google Scholar] [CrossRef] - Holmes, G.; Hall, M.; Prank, E. Generating rule sets from model trees. In Proceedings of the Australasian Joint Conference on Artificial Intelligence, Sydney, Australia, 6–10 December 1999; Springer: Berlin/Heidelberg, Germany, 1999; pp. 1–12. [Google Scholar]
- Friedman, J.; Popescu, B. Predective learning via rule ensembles. Ann. Appl. Stat.
**2008**, 2, 916–954. [Google Scholar] [CrossRef] - Dembczyński, K.; Kotłowski, W.; Słowiński, R. Solving regression by learning an ensemble of decision rules. In Proceedings of the International Conference on Artificial Intelligence and Soft Computing, Zakopane, Poland, 22–26 June 2008; pp. 533–544. [Google Scholar]
- Meinshausen, N. Node harvest. Ann. Appl. Stat.
**2010**, 4, 2049–2072. [Google Scholar] [CrossRef][Green Version] - Bénard, C.; Biau, G.; Da Veiga, S.; Scornet, E. Sirus: Stable and interpretable rule set for classification. Electron. J. Stat.
**2021**, 15, 427–505. [Google Scholar] [CrossRef] - Bénard, C.; Biau, G.; Veiga, S.; Scornet, E. Interpretable random forests via rule extraction. In Proceedings of the International Conference on Artificial Intelligence and Statistics, San Diego, CA, USA, 13–15 April 2021; pp. 937–945. [Google Scholar]
- Margot, V.; Baudry, J.P.; Guilloux, F.; Wintenberger, O. Consistent regression using data-dependent coverings. Electron. J. Stat.
**2021**, 15, 1743–1782. [Google Scholar] [CrossRef] - Lipton, Z.C. The mythos of model interpretability. Queue
**2018**, 16, 31–57. [Google Scholar] [CrossRef] - Doshi-Velez, F.; Kim, B. Towards a rigorous science of interpretable machine learning. arXiv
**2017**, arXiv:1702.08608. [Google Scholar] - Yu, B.; Kumbier, K. Veridical data science. Proc. Natl. Acad. Sci. USA
**2020**, 117, 3920–3929. [Google Scholar] [CrossRef] [PubMed][Green Version] - Murdoch, W.J.; Singh, C.; Kumbier, K.; Abbasi-Asl, R.; Yu, B. Interpretable machine learning: Definitions, methods, and applications. arXiv
**2019**, arXiv:1901.04592. [Google Scholar] [CrossRef][Green Version] - Hammer, P.L.; Kogan, A.; Simeone, B.; Szedmák, S. Pareto-optimal patterns in logical analysis of data. Discret. Appl. Math.
**2004**, 144, 79–102. [Google Scholar] [CrossRef][Green Version] - Alexe, G.; Alexe, S.; Hammer, P.L.; Kogan, A. Comprehensive vs. comprehensible classifiers in logical analysis of data. Discret. Appl. Math.
**2008**, 156, 870–882. [Google Scholar] [CrossRef][Green Version] - Alexe, G.; Alexe, S.; Bonates, T.O.; Kogan, A. Logical analysis of data—The vision of Peter L. Hammer. Ann. Math. Artif. Intell.
**2007**, 49, 265–312. [Google Scholar] [CrossRef] - Arlot, S.; Celisse, A. A survey of cross-validation procedures for model selection. Stat. Surv.
**2010**, 4, 40–79. [Google Scholar] [CrossRef] - Vapnik, V. The Nature of Statistical Learning Theory; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013. [Google Scholar]
- Bousquet, O.; Elisseeff, A. Stability and generalization. J. Mach. Learn. Res.
**2002**, 2, 499–526. [Google Scholar] - Poggio, T.; Rifkin, R.; Mukherjee, S.; Niyogi, P. General conditions for predictivity in learning theory. Nature
**2004**, 428, 419–422. [Google Scholar] [CrossRef] [PubMed] - Yu, B. Stability. Bernoulli
**2013**, 19, 1484–1500. [Google Scholar] [CrossRef][Green Version] - Letham, B.; Rudin, C.; McCormick, T.; Madigan, D. Interpretable classifiers using rules and bayesian analysis: Building a better stroke prediction model. Ann. Appl. Stat.
**2015**, 9, 1350–1371. [Google Scholar] [CrossRef] - Fayyad, U.M.; Irani, K.B. Multi-interval discretization of continuous-valued attributes for classification learning. In Proceedings of the 13th International Joint Conference on Artificial Intelligence, 28 August–3 September 1993; pp. 1022–1027. [Google Scholar]
- Margot, V.; Baudry, J.P.; Guilloux, F.; Wintenberger, O. Rule induction partitioning estimator. In Proceedings of the International Conference on Machine Learning and Data Mining in Pattern Recognition, New York, NY, USA, 15–19 July 2018; Springer: Berlin/Heidelberg, Germany, 2018; pp. 288–301. [Google Scholar]
- Dougherty, J.; Kohavi, R.; Sahami, M. Supervised and unsupervised discretization of continuous features. In Machine Learning Proceedings; Elsevier: Amsterdam, The Netherlands, 1995; pp. 194–202. [Google Scholar]
- Luštrek, M.; Gams, M.; Martinčić-Ipšić, S. What makes classification trees comprehensible? Expert Syst. Appl.
**2016**, 6, 333–346. [Google Scholar] - Fürnkranz, J.; Kliegr, T.; Paulheim, H. On cognitive preferences and the plausibility of rule-based models. Mach. Learn.
**2020**, 109, 853–898. [Google Scholar] [CrossRef][Green Version] - Frank, E.; Witten, I.H. Generating accurate rule sets without global optimization. In Proceedings of the Fifteenth International Conference on Machine Learning, Madison, WI, USA, 24–27 July 1998; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1998; pp. 144–151. [Google Scholar]
- Hornik, K.; Buchta, C.; Zeileis, A. Open-source machine learning: R meets Weka. Comput. Stat.
**2009**, 24, 225–232. [Google Scholar] [CrossRef][Green Version] - Friedman, J.; Popescu, B. Importance sampled learning ensembles. J. Mach. Learn. Res.
**2003**, 94305, 1–32. [Google Scholar] - Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B
**1996**, 58, 267–288. [Google Scholar] [CrossRef] - Friedman, J.H. Stochastic gradient boosting. Comput. Stat. Data Anal.
**2002**, 38, 367–378. [Google Scholar] [CrossRef] - Fürnkranz, J.; Gamberger, D.; Lavrač, N. Foundations of Rule Learning; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Fürnkranz, J.; Kliegr, T. A brief overview of rule learning. In Proceedings of the International Symposium on Rules and Rule Markup Languages for the Semantic Web, Berlin, Germany, 3–5 August 2015; Springer: Berlin/Heidelberg, Germany; pp. 54–69. [Google Scholar]
- Dua, D.; Graff, C. Uci Machine Learning Repository. 2017. Available online: http://archive.ics.uci.edu/ml (accessed on 25 May 2021).
- Hastie, T.; Friedman, J.; Tibshirani, R. The Elements of Statistical Learning; Springer Series in Statistics; Springer: Berlin, Germeny, 2001; Volume 1. [Google Scholar]
- Cortez, P.; Silva, A.M.G. Using data mining to predict secondary school student performance. In Proceedings of the 5th Future Business Technology Conference, Porto, Portugal, 9–11 April 2008. [Google Scholar]
- Harrison, D., Jr.; Rubinfeld, D.L. Hedonic housing prices and the demand for clean air. J. Environ. Econ. Manag.
**1978**, 5, 81–102. [Google Scholar] [CrossRef][Green Version] - Fokoue, E. UCI Machine Learning Repository. 2020. Available online: https://archive.ics.uci.edu/ml/index.php (accessed on 25 May 2021).

Name | $(\mathit{n}\times \mathit{d})$ | Description |
---|---|---|

Ozone | $330\times 9$ | Prediction of atmospheric ozone concentration from daily meteorological measurements [50]. |

Machine | $209\times 8$ | Prediction of published relative performance [49]. |

MPG | $398\times 8$ | Prediction of city-cycle fuel consumption in miles per gallon [49]. |

Boston | $506\times 13$ | Prediction of the median price of neighborhoods, [52]. |

Student | $649\times 32$ | Prediction of the final grade of the student based on attributes collected by reports and questionnaires [51]. |

Abalone | $4177\times 7$ | Prediction of the age of abalone from physical measurements [49]. |

Name | $(\mathit{n}\times \mathit{d})$ | Description |
---|---|---|

Wine | $4898\times 11$ | Classification of white wine quality from 0 to 10 [49]. |

Covertype | 581,012 × 54 | Classification of forest cover type $[1,7]$ based on cartographic variables [49]. |

Speaker | $329\times 12$ | Classification of accent, six possibilities, based on features extracted from the first reading of a word [53]. |

Algorithm | Parameters |
---|---|

CART | $max\_leaf\_nodes=20$. |

RuleFit | $tree\_size=4$, $max\_rules=2000$. |

NodeHarvest | $max.inter=3$. |

CA | $generator\_func=RandomForestRegressor$, $n\_estimators=500$, $max\_leaf\_nodes=4$, $alpha=1/2-1/100$, $gamma=0.95$, $k\_max=3$ |

SIRUS | $max.depth=3$, $num.rule=10$. |

**Table 4.**Average of predictivity score (${\mathcal{P}}_{n}$), stability score (${\mathcal{S}}_{n}^{q}$), simplicity score (${\mathcal{S}}_{n}$) and interpretability score ($\mathcal{I}$) over a 10-fold cross-validation of commonly used interpretable algorithms for various public regression datasets. Best values are in bold, as well as values within 10% of the maximum value for each dataset.

Dataset | ${\mathcal{P}}_{\mathit{n}}$ | ||||

RT | RuleFit | NodeHarvest | CA | SIRUS | |

Ozone | 0.55 | 0.74 | 0.66 | 0.56 | 0.6 |

Machine | 0.79 | 0.95 | 0.73 | 0.59 | 0.46 |

MPG | 0.75 | 0.85 | 0.78 | 0.59 | 0.74 |

Boston | 0.61 | 0.74 | 0.67 | 0.26 | 0.57 |

Student | 0.08 | 0.16 | 0.22 | 0.13 | 0.24 |

Abalone | 0.4 | 0.55 | 0.37 | 0.39 | 0.3 |

Dataset | ${\mathcal{S}}_{n}^{q}$ | ||||

RT | RuleFit | NodeHarvest | CA | SIRUS | |

Ozone | 1.0 | 0.11 | 0.92 | 0.24 | 0.99 |

Machine | 0.63 | 0.27 | 0.91 | 0.17 | 1.0 |

MPG | 1.0 | 0.14 | 0.87 | 0.25 | 1.0 |

Boston | 0.85 | 0.15 | 0.81 | 0.26 | 0.97 |

Student | 0.98 | 0.14 | 1.0 | 0.26 | 1.0 |

Abalone | 1.0 | 0.21 | 0.86 | 0.25 | 0.99 |

Dataset | ${\mathbb{S}}_{n}$ | ||||

RT | RuleFit | NodeHarvest | CA | SIRUS | |

Ozone | 0.12 | 0.01 | 0.04 | 0.96 | 0.29 |

Machine | 0.14 | 0.02 | 0.04 | 0.9 | 0.25 |

MPG | 0.15 | 0.01 | 0.05 | 0.98 | 0.34 |

Boston | 0.26 | 0.01 | 0.07 | 1.0 | 0.52 |

Student | 0.37 | 0.05 | 0.25 | 0.91 | 0.97 |

Abalone | 0.58 | 0.02 | 0.13 | 0.66 | 1.0 |

Dataset | $\mathcal{I}$ | ||||

RT | RuleFit | NodeHarvest | CA | SIRUS | |

Ozone | 0.56 | 0.29 | 0.54 | 0.59 | 0.63 |

Machine | 0.52 | 0.41 | 0.56 | 0.55 | 0.57 |

MPG | 0.63 | 0.33 | 0.57 | 0.61 | 0.69 |

Boston | 0.57 | 0.3 | 0.52 | 0.5 | 0.69 |

Student | 0.47 | 0.12 | 0.49 | 0.43 | 0.74 |

Abalone | 0.66 | 0.26 | 0.45 | 0.43 | 0.76 |

${\mathcal{P}}_{\mathit{n}}$ | ${\mathcal{S}}_{\mathit{n}}^{\mathit{q}}$ | ${\mathbb{S}}_{\mathit{n}}$ | |
---|---|---|---|

${\mathcal{P}}_{n}$ | 1 | $-0.1$ | $-0.27$ |

${\mathcal{S}}_{n}^{q}$ | − | 1 | $-0.10$ |

${\mathbb{S}}_{n}$ | − | − | 1 |

**Table 6.**Average of predictivity score (${\mathcal{P}}_{n}$), stability score (${\mathcal{S}}_{n}^{q}$), simplicity score (${\mathbb{S}}_{n}$) and interpretability score ($\mathcal{I}$) over a 10-fold cross-validation of commonly used interpretable algorithms for various public classification datasets. Best values are in bold, as well as values within 10% of the maximum value for each dataset.

Dataset | ${\mathcal{P}}_{\mathit{n}}$ | ||

CART | RIPPER | PART | |

Wine | 0.13 | 0.12 | 0.01 |

Covertype | 0.37 | 0.46 | 0.5 |

Speaker | 0.24 | 0.31 | 0.35 |

Dataset | ${\mathcal{S}}_{n}^{q}$ | ||

CART | RIPPER | PART | |

Wine | 1.0 | 1.0 | 1.0 |

Covertype | 1.0 | 1.0 | 1.0 |

Speaker | 0.95 | 1.0 | 1.0 |

Dataset | ${\mathbb{S}}_{n}$ | ||

CART | RIPPER | PART | |

Wine | 0.99 | 0.64 | 0.01 |

Covertype | 1.0 | 0.12 | 0.01 |

Speaker | 0.71 | 1.0 | 0.45 |

Dataset | $\mathcal{I}$ | ||

CART | RIPPER | PART | |

Wine | 0.71 | 0.59 | 0.34 |

Covertype | 0.79 | 0.53 | 0.50 |

Speaker | 0.63 | 0.77 | 0.6 |

${\mathcal{P}}_{\mathit{n}}$ | ${\mathcal{S}}_{\mathit{n}}^{\mathit{q}}$ | ${\mathbb{S}}_{\mathit{n}}$ | |
---|---|---|---|

${\mathcal{P}}_{n}$ | 1 | $0.09$ | $-0.04$ |

${\mathcal{S}}_{n}^{q}$ | − | 1 | $0.06$ |

${\mathbb{S}}_{n}$ | − | − | 1 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Margot, V.; Luta, G. A New Method to Compare the Interpretability of Rule-Based Algorithms. *AI* **2021**, *2*, 621-635.
https://doi.org/10.3390/ai2040037

**AMA Style**

Margot V, Luta G. A New Method to Compare the Interpretability of Rule-Based Algorithms. *AI*. 2021; 2(4):621-635.
https://doi.org/10.3390/ai2040037

**Chicago/Turabian Style**

Margot, Vincent, and George Luta. 2021. "A New Method to Compare the Interpretability of Rule-Based Algorithms" *AI* 2, no. 4: 621-635.
https://doi.org/10.3390/ai2040037