1. Introduction
Machine learning methods are widely used in various domains and applications such as healthcare, finance, etc. In many cases, the learned models are so-called black-box models, meaning that the learned representation is not easily interpretable. Hence, the predictions they make are not easily comprehensible to humans.
The necessity of having some explanations to understand how the model works led to substantial research on explaining learned models [
1]. One can distinguish between local explanations, which try to approximate the black-box model in the vicinity of an example that should be explained (e.g., [
2,
3]), or global models, which try to capture the behavior of the entire black-box model in an interpretable surrogate. Recently, several approaches have been investigated which try to construct global models from local explanations (e.g., [
4,
5]). Furthermore, one can distinguish between model-specific explanation methods, which are tailored to specific types of black-box models such as deep neural networks (e.g., [
6]), and model-agnostic explanation methods, which do not make any assumptions about the nature of the learned black-box model (e.g., [
2]).
While the importance of explaining black-box models is not deniable in high stake decision problems, various challenges and issues have renewed the interest in learning interpretable models, such as decision trees or rule sets, in the first place.
The obvious problem is that post-hoc explanation methods only approximate the underlying black-box model so that the found explanations often do not accurately reflect the behavior of the model they are meant to explain. This is typically captured by monitoring the
fidelity of the surrogate model, i.e., the degree to which it follows the underlying model. In addition, if the explanation works ideally without any errors, it might use completely different features, which means that the explanation is not faithful to the computations in the black-box model. Furthermore, there might be flaws in black-box models, and in this situation, troubleshooting gets more complicated since both explanations and black-box models must be maintained. For these and other reasons, it has been argued that more efforts should be devoted to learning more accurate interpretable models [
7].
Motivated by this observation, this paper evaluates to what extent post-hoc explanations can be replaced with directly learned interpretable methods unaware of the underlying black-box models. The goal is to investigate whether the performance of an interpretable model is accurate enough to be used as a replacement for model-agnostic methods or, conversely, to see how much information is lost when doing so. To reach this goal, the performance of local and global explanation methods will be evaluated by putting the theories to the test, thereby assessing the validity of the assumption. We conduct a series of experiments to evaluate and compare the performance of several interpretable models to explain black-box models. Our results on rule-based and feature-based explanatory models seem to confirm our hypothesis.
This article is organized as follows.
Section 2 briefly reviews important works on interpretability and explainability,
Section 3 describes the research goals and the methods that are used in our experiment, and
Section 4 discusses the experimental results.
3. Methods and Experimental Setup
3.1. Problem Statement
This study addresses the validation of the idea proposed by Rudin [
7] that research should focus more on interpretable models rather than explaining black-box models. To this end, we select and compare pairs of a model-agnostic post-hoc explanation method and an independent, directly trained interpretable method, which both produce the same syntactic class of models. More precisely, as shown in
Figure 1, we learn a black-box model
from a data set and training set. With training set consisting of
n examples
where each example has
m features and a label
. We then employ common methods from explainable AI to approximate
with an interpretable model
. In parallel, we directly learn a syntactically comparable model
from the same data and compare it to
.
Thus, the research question that we investigate is to what extent an interpretable model that has been directly learned from data can approximate an independently learned black-box model , and how much of this fidelity is lost compared to an interpretable model , which had access to . One would, of course, expect that has a higher fidelity (and consequently maybe also a higher accuracy) than , because had access to M, whereas was trained independently. However, both are trained on the same data, so that implicit correlations may emerge.
Moreover, it is well-known that interpretable models are often less accurate than because they typically only approximate the underlying black-box model . This approximation is often measured in terms of fidelity, i.e., how well approximates the predictions of .
Thus, we intend to find out how the two models , and are compared not only in terms of commonly used parameters such as their complexity or the accuracy of the respective models but also in terms of this fidelity.
Furthermore, different ways of explaining a model might exist according to the so-called Rashomon effect [
20], which, in a nutshell, states that in particular with structured models such as trees or rules, there are often multiple different models which explain the data equally well. We are interested in understanding whether there are differences in the explanations provided by our selected interpretation methods for a model.
Generally, we focus on rule-based and feature-based methods, whereby we compare the methods with respect to the logical rules they learn and the latter according to the feature weights that are attributed to them. The following sections will introduce the selected methods and algorithms we are interested in.
3.2. Rule-Based Interpretability Methods
GLocalX and
JRip are selected as model-agnostic and interpretable models, respectively. Both methods generate explanations in the form of rulesets, which are our preference as they produce more compact models and are very close to human reasoning language [
11].
JRip [
12] is a classic rule learning algorithm that generates rules by executing three main steps; grow, prune, and optimize. Before learning each individual rule,
JRip splits the examples it covers into two sets, a growing set from which the next rule is learned and a pruning set used to simplify the learned rule. The rule set is further optimized by re-learning individual rules in the context of other rules when a sufficient number of positive examples have been covered.
GLocalX [
4] generates global explanations for a black-box model using local explanations created by a local surrogate model such as
LORE [
14], and the predicted labels from a black-box model. The algorithm takes a set of local explanations as input, and then tries to iteratively merge and combine them to provide more general rules. At each iteration, it sorts the local explanations into a queue according to their similarities and samples a batch of data to merge the candidate explanations. The merge operation gets executed once a pair of explanations with the closest similarity is popped from the queue. The merge function consists of
cut and
join operators, which allow the algorithm to generalize a set of explanations while balancing fidelity and complexity. To merge two local explanations
and
, the
join and
cut operators apply to non-conflicting and conflicting explanations, respectively. Thus,
join generalizes explanations at the cost of fidelity while
cut specializes explanations at the cost of generality. If the result of the merge function satisfies simplicity and accuracy constraints, a merged pair is kept, and
and
would be replaced by the merged pair. Finally, explanations with low fidelity are filtered out using the
parameter that indicates a per-class trimming threshold.
3.3. Feature-Based Interpretability Methods
Among various feature-based methods on interpretability,
GA2Ms [
13] is selected as a glass-box model, intelligible algorithm, and
MAPLE is selected as a post-hoc method.
GA2Ms and MAPLE are based on linear models and provide feature weights that explain the contribution of the features in the prediction.
GA2Ms algorithm is based on Generalized Additive Models (GAM) which is a generalized linear model. GAM considers that the model could be the sum of arbitrary functions instead of simple weights.
GA2Ms extends
GAM, including terms that capture the interaction of features values:
The method starts by building up a small tree for each feature separately in a boosting fashion so that each tree is only related to one feature. This procedure will be repeated for a fixed number of iterations, so that eventually, for each feature, we obtain an ensemble of trees. In the next step, the generated trees for each feature are summarized in a graph by recording the prediction of each tree in a graph. At the end of this step, there is a graph for each feature that builds the model. Since
GA2Ms is an additive model, we can easily reason the contribution of each feature to the prediction [
13].
As for the prediction in GA2Ms each function , for each feature acts like a lookup table that returns a term contribution. The returned term contributions are added up, and the final predictions are calculated by passing them through function g. The additivity enables GA2Ms to give us the impact of each feature on the prediction.
MAPLE uses classical linear modeling and a tree interpretation of tree ensembles as a supervised neighborhood approach and feature selection method to detect global and example-based explanations. The algorithm first identifies the training points in the training set that are most relevant to the prediction. It then assigns similarity weights to each training point
by calculating how often
and
x are put in the same leaf node in trees
as defined in (
2)
The weights of the training points are then used in the linear model to make a prediction and a local explanation by solving the linear regression problem in
3.4. Experimental Setup
The two experiments were performed on some commonly used datasets, mostly from the UCI collection of machine learning databases [
21]. All the datasets are binary classification problems. In the
adult dataset, the task is to determine whether a person earns over 50 K a year. The
compas two-year dataset contains recidivism risk score that predicts a person’s likelihood of committing a crime in the next two years. The
German dataset records whether a loan applicant has good or bad credit risk. The
NHANES I dataset is a follow-up mortality data from the National Health and Nutrition Examination Survey epidemiologic follow-up study. The
credit card fraud dataset contains credit card transactions labeled as legitimate or fraudulent transactions. Finally, the
Bank dataset is from a direct marketing campaign of a Portuguese banking institution, where the goal is to predict whether the client will subscribe to a term deposit.
3.5. Experimental Setup on Rule-Based Models
To prepare the experiments, we follow the same procedure as in [
4]. As the preprocessing step in the experiment, the dataset is separated into three parts: 60% of the data is dedicated to training the black-box model
, 20% is for training
GLocalX and the last 20% is used as unseen data for validation
.
As pointed out in
Section 3.2,
GLocalX requires a black-box model and a local explanation method to extract the global rules. To this end, we use random forests [
22] as a black-box model to predict the labels for the 20% training data for
GLocalX. We use the
LORE algorithm to find local explanations of each sample in the same partition. An overview of the required blocks is shown in
Figure 2.
For the experiments with JRip, we employ the first 80% of the dataset as the training data. Note that JRip internally also splits the data into growing set and pruning set, which is quite similar to the internal split of GLocalX.
Since rule-based interpretability methods provide label prediction, the evaluation is done through accuracy and fidelity. In addition, the number of rules is considered another evaluation metric.
4. Results and Discussion
This section describes the results of rule-based and feature-based models experiments.
4.1. Results on Rule-Based Models
In order to evaluate the performance of the glass-box model
JRip as a substitute for the explanatory model
GLocalX, we tried various values for the
parameter of
GLocalX, and compared the resulting rule set against a rule set that can be directly obtained from
JRip.
Table 1 shows the results of
JRip and all
parameters on the
adult dataset.
As can be seen, for the
adult dataset,
GLocalX obtained its best results in terms of fidelity and the number of rules with
. Hence this
value is selected for further discussion. By comparing the
GLocalX results to the
JRip in
Table 1, we see that both methods obtain a quite comparable performance in terms of accuracy and fidelity: the most accurate theory in terms of accuracy and fidelity learned by
GLocalX (for
) has lower accuracy than
JRip, which, however, learns a somewhat more complex rule set. However, even if we take a look at a rule set with a comparable complexity (
), the result is still quite similar to the previous observation: Even though
JRip has not seen the underlying black-box model, it seems to deliver a better explanation of the model than
GLocalX, in the sense that it has a higher fidelity to the black-box model than
GLocalX, despite the fact that
GLocalX tried to mimic the black-box model, while
JRip learned an independent rule set.
Table 2 shows the results for all datasets, with the best
for each dataset. We can see that both algorithms obtain a quite comparable performance in the number of rules and accuracy, with, again, slight advantages for
JRip.
To better understand the rules,
Table 3 compares the rules learned by
GLocalX and
JRip for the
adult data set. Even though there are some common features such as “age”, “capital-gain”, and “marital-status” in the global rules of both
GLocalX, and
JRip, in general, very different rules were learned, which had similar fidelity as global explanations. This is an instantiation of a phenomenon known as the Rashomon effect [
20], namely that very different models can obtain a similar predictive accuracy and that small changes in a dataset may often lead to significant changes in a learned symbolic model. However, here we see this phenomenon from a novel angle: Very different rules may provide different explanations with the same fidelity to an underlying black-box model.
Furthermore, the trade-off between interpretability and accuracy states that models with high interpretability might have lower accuracy. By comparing the rules provided by GLocalX and JRip in terms of the rule length and number of conditions for each rule, we see that in general GLocalX generates shorter rules with smaller number of conditions than JRip. Thus, as a simple inference, GLocalX can be more understandable for human since it generates shorter and simpler rules. However, JRip obtains higher accuracy while generating more complex rules. It is noteworthy to mention that the interpretability and understandability of rules can be evaluated from different perspectives and it needs to be studied deeply.
4.2. Results on Feature-Based Interpretability Methods
As mentioned in previous sections, GA2Ms and MAPLE are selected as feature-based interpretability methods. Again, we aim to evaluate the performance of the two methods in terms of their stand-alone performance (accuracy) as well as their similarity to a black-box model (fidelity). We will first show the contribution of features to the prediction by plotting the feature importance ranks provided by GA2Ms and MAPLE. Since the MAPLE algorithm does not give global feature importance in the form of weights, we use the average linear regression weights for each feature as the feature importance.
The detailed results of the feature importance plot and feature importance ranks for
GA2Ms and
MAPLE methods on the
adult are tabulated and shown graphically in
Figure 3. By comparing the results from
MAPLE and
GA2Ms methods, we see that “age” and “capital_gain” have high contributions to the prediction. Some features such as "marital_status” in
MAPLE have low contributions while in
GA2Ms they highly contribute to the prediction which, again, illustrates the Rashomon effect, i.e., different feature weights can be provided as explanations.
In order to compare both methods with respect to fidelity, we need to have an identical prediction method for both
MAPLE and
GA2Ms. To that end, we use normalized feature importance weights from the two methods
GA2Ms and
MAPLE as
and
. Then, for each explanation method, we use (
4) to calculate the value for each sample in the dataset.
In (
4), for each sample in the dataset, weights derived from
MAPLE and
GA2Ms are multiplied by the features. To convert probabilities to binary labels, we tune the threshold
using the ROC curve and as defined in (
5):
where TPR and FPR are true positive and false positive rates, respectively. The obtained values are then used to measure accuracy and fidelity by computing the AUC.
In this way, the accuracy and fidelity of the two methods are again evaluated on different datasets, and the results are shown in
Table 4. The results confirm that in all the datasets,
GA2Ms has higher accuracy and fidelity compared to
MAPLE, again underlining the hypothesis behind this work, namely that directly learned interpretable models may provide excellent explanations for black-box models, even if they have not seen this model, simply because well-trained interpretable and black-box models will necessarily correlate with each other.
5. Conclusions
Interpretable machine learning has gained importance in various problems and applications. The key idea behind many approaches that aim at explaining a black-box model is to approximate it globally or locally with an interpretable surrogate model. However, in this approximation, much of the predictive quality of the original model is lost, and it is unclear whether the surrogate model is actually sufficiently faithful to the black-box model. In this work, we showed that maybe somewhat surprisingly, interpretable models, which have not seen the black-box model, may be equally faithful to the black-box model as the surrogate models that have been learned from them.
In particular, we selected GLocalX and JRip as post-hoc and interpretable rule-based methods, and MAPLE and GA2Ms as post-hoc and interpretable feature-based methods, respectively. According to the experiment’s results, the performance of interpretable models in terms of accuracy and fidelity is as good as post-hoc methods. However, various explanations can be provided by the two methods in the form of rules or feature importance. Thus, interpretable models can be used instead of post-hoc methods. In addition, differences in explanations provided by rule-based and feature-based methods can be another research topic in the future to measure the efficiency of different explanations on a dataset and determine the most efficient explanation.